Stream121: Product Metrics Spikes

A friend pointed me to How to crack product metric questions in PM interviews. In essence, it discusses how does a service provider respond to a change in a metric or KPI (Key Performance Indicator).

A definition from the article:

Metric interview questions test if candidates can perform data analysis and select key metrics that matter most to the success of a product. Employers like Facebook and Google use these questions to evaluate critical thinking and communication skills.

There are two types of metric questions - one builds on the other

Metric definition questions	Metric change questions
What are the metrics that provide clarity on the health of a product or feature? What metrics matter??	Do you know what to do when a key product metric (e.g. traffic, revenue, engagement, etc.) is going up or down for no apparent reason.

Solving the Metric definition question

The article recommends the use the GAME method

Goals	Make sure you understand how the functionality helps the commercial goals of the company. 80/20 rule - make sure you're measuring the 80% - the stuff that matters!
Actions	What are key activities that indicate the functionality is achieving those goals? This requires good understanding of the product functionality. Make sure that DevOps doesn't define and implement these unsupervised!!
Metrics	Where in the value chain is a good place to position a measurement? Yes or No evaluation is a starting point, but is there another metric that is a precursor to the switch toggling from one condition to another? In web performance management ie website availability and uptime, then "What goes slow, goes down." (see the Webmetrics case study)
Evaluation	What is normal, good or bad about a metric reading? And are the trade-offs and limitations in having this figure? For example, seeing a drop in shopping basket conversions over a 24 hour period might be useless without understanding that there was 60 minute outage from your payment provider.

Solving the Metric Change question

The article provides its own methodology.

Define the metric change	Make sure you fully understand the metric that is being presented to you: What is it and what is it not? You may have to get into the weeds about how the metric is measured. See Actions above For example is 'Page Views' the total number of views or page views by unique users? What happens if the same user logs in and views the page on two devices or two different sessions, how is that logged?
Explore possible root-causes of the change	Use the MECE framework Mutually Exclusive and Collectively Exhaustive = Identify cause(s) that is independent of all others OR understand what the cause is definitely NOT (). For a metric incident, the root-cause of the problem has to be internal OR external, as there is no other option. If you think there is an internal AND an external cause, then you need to do more research. Did one external condition cascade to an internal condition? Unfortunately, these can be super* hard to run down!
Conclude	Make sure there aren't any horrible assumptions that you haven't validated / eliminated.

* For info on MECE:

Escalation and Communication

The article hardly touches on the critical, critical point of issue escalation and communication.

After you have done your initial assessment, you need to answer the question: Do I ring the alarm bell and to whom?

Judgement call: Let's assume that (a) the metric isn't a false positive and you've validated to some significant extent (b) the metric matters. So, how high do you escalate the issue?

Factors to consider:

How many other business functions are impacted by the incident?
Is that metric important to them?
Can they do something about it, either to resolve it (eg some root-cause analysis) or to reconfigure their business function to accommodate the impact of the incident?

I recommend using this yard stick:

Level of concern	Action
Not a concern or quickly solved and implemented	If you are sure that impact is minimal and contained, then I recommend concentrating on solving the issue. You can report that the problem was resolved with minimal impact. Warning: if you get stuck in the resolution, then you quickly have to over-escalate the issue eg to 'Very concerning'
Somewhat concerning, but impact is contained	Keep researching until either you hit a brick wall or an unreasonable amount of time has elapsed
Very concerning	More business functions are required to investigate and you need to escalate it to a suitably high level to get resources devoted to it. When communicating, be clear to separate known symptoms vs known conclusions to date vs guestimates as to possible causes, articulate mitigating factors and (partial) resolutions that have been applied to date.
Ridiculously high	This is a hard one to call. You most probably know that something basic has gone wrong and that others have spotted it and are working on a solution (and may have even resolved it). However, that is a dangerous assumption. It's best to escalate it - even if it is likely to be a false positive.

The importance of communicating - and re-communicating

As soon as you communicate outside of your team, make sure that someone is knowingly and consistently appointed as Incident Manager (ie it could be you!). Note that if the problem rattles on, then you may need to appoint a Communications Manager in front of the Incident Manager to protect the Incident Manager from simply responding to communication enquiries rather than actually working on the resolution.

It's best also to alert other business functions how you will update them on progress and frequency of updates (even if there is no progress).

Use of incident support tools

There is a great danger in over-engineering the use of incident support tools. Over-engineering = it is too clumsy or laborious at the point of the crisis. If others aren't familiar with the tools, then they will still ask for email / telephone / instant messenger updates anyway!

Next step: Deep Dive into Root-Cause Analysis

Once you have done your initial assessment and then communicated your initial findings to relevant business function, it's time to dive deep to separate symptoms from root-cause. You may need access to multiple test and near-production systems(*) to validate theories.So do make sure these are functional before the crisis hits.

Do ask around your team (DevOps, Release Management, Testing, Engineering) for their input into symptoms and route cause. In many circumstances, something similar has been seen before.

Cycle back with colleagues and partners based on status and resolution - instant messenger is fab here!

(*) Near-production = an environment that mostly closely replicates your live or production environment

Multiple resolutions

There may be several resolutions available: there's a matrix of time horizon vs cost-benefit.

Multiple resolutions might be appropriate - and that's absolutely appropriate to pursue several simultaneously. eg roll-back recent enhancement; test proposed partial fix; kick back to engineering for a rework with new testing scenarios etc.

Stream121

HTML

26 October, 2020

Product Metrics Spikes