A friend pointed me to How to crack product metric questions in PM interviews. In essence, it discusses how does a service provider respond to a change in a metric or KPI (Key Performance Indicator).
A definition from the article:
Metric interview questions test if candidates can perform data analysis and select key metrics that matter most to the success of a product. Employers like Facebook and Google use these questions to evaluate critical thinking and communication skills.
There are two types of metric questions - one builds on the other
Metric definition questions | Metric change questions |
|
|
Solving the Metric definition question
The article recommends the use the GAME method
Goals |
|
Actions |
|
Metrics |
|
Evaluation |
|
Solving the Metric Change question
The article provides its own methodology.
Define the metric change | Make sure you fully understand the metric that is being presented
to you:
|
Explore possible root-causes of the change | Use the MECE
framework
|
Conclude | Make sure there aren't any horrible assumptions that you haven't validated / eliminated. |
* For info on MECE:
Escalation and Communication
The article hardly touches on the critical, critical
point of issue escalation and communication.
After you have done your initial assessment, you need to answer
the question: Do I ring the alarm bell and to whom?
Judgement call: Let's assume that (a) the metric isn't a false positive and you've validated to some significant extent (b) the metric matters. So, how high do you escalate the issue?
Factors to consider:
- How many other business functions are impacted by the incident?
- Is that metric important to them?
- Can they do something about it, either to resolve it (eg some root-cause analysis) or to reconfigure their business function to accommodate the impact of the incident?
I recommend using this yard stick:
Level of concern | Action |
Not a concern or quickly solved and implemented |
|
Somewhat concerning, but impact is contained |
|
Very concerning |
|
Ridiculously high |
|
The importance of communicating - and re-communicating
As soon as you communicate outside of your team, make sure that someone
is knowingly and consistently appointed as Incident Manager (ie it could
be you!). Note that if the problem rattles on, then you may need to
appoint a Communications Manager in front of the Incident Manager to
protect the Incident Manager from simply responding to communication
enquiries rather than actually working on the resolution.
It's best also to alert other business functions how you will update them
on progress and frequency of updates (even if there is no progress).
Use of incident support tools
There is a great danger in over-engineering the use of incident support
tools. Over-engineering = it is too clumsy or laborious at the point of
the crisis. If others aren't familiar with the tools, then they will still
ask for email / telephone / instant messenger updates anyway!
Next step: Deep Dive into Root-Cause Analysis
Once you have done your initial assessment and then communicated your initial findings to relevant business function, it's time to dive deep to separate symptoms from root-cause. You may need access to multiple test and near-production systems(*) to validate theories.So do make sure these are functional before the crisis hits.
Do ask around your team (DevOps, Release Management, Testing,
Engineering) for their input into symptoms and route cause. In many
circumstances, something similar has been seen before.
Cycle back with colleagues and partners based on status and resolution -
instant messenger is fab here!
(*) Near-production = an environment that mostly closely replicates your live or production environment
Multiple resolutions
There may be several resolutions available: there's a matrix of time horizon vs cost-benefit.
Multiple resolutions might be appropriate - and that's absolutely appropriate to pursue several simultaneously. eg roll-back recent enhancement; test proposed partial fix; kick back to engineering for a rework with new testing scenarios etc.