In a big data world, numbers dictate decisions, features and investments. It’s vital we understand just what the numbers are telling us, but depending on analysis, the same data can tell two completely different stories. It’s called Simpson’s Paradox and it can lead to poor decisions and costly errors.
More specifically, Simpson’s Paradox is a phenomenon in which a trend identified from a population is reversed when investigated at the sub-population levels. Think about that again – conclusions drawn from an overall set of data are not indicative of the behavior of the underlying subsets. This is a problem when relying on an overall value to summarize large sets of data.
What the Paradox Looks Like
For example, consider the following chart. You just released a patch and are checking to see if there was any impact on speed in your performance monitoring data.
What do you see? Seems like everything is going well since the patch went live.
But what happens when we take a deeper peek?
One of the many things to check is to make sure all servers updated to the release. In this example, the patch was rolled out to 3 servers: A, B, and C. Shading the scatterplot by responding server creates this chart:
The new chart hints that the servers may be performing differently. Trending by server makes this obvious:
While Server A and C’s response times are steady – it seems like a jump in response time happened to Server B after the release which was not clear when looking at the entire data set as one, in the first scatterplot.
Ops/DevOps vs. Simpson
Simpson’s Paradox is an excellent cautionary tale of the limitations Ops and DevOps hit when doing statistical analysis on performance data. While the above example may be a little over dramatized and simple to catch, Simpson’s Paradox becomes an issue when everyone is satisfied when the data meets a specific desired condition, instead of diving further. How many times have we missed something important by relying on an overall value?
Take this Real User Monitoring (RUM) story for example, the page’s speed improved as the traffic to the page increased several folds. You may be looking at data that appears to show an improvement in response times, but when a factor that always slows speeds is involved, like increased traffic, you need to dig deeper into the data for problems.
In Web Performance, we work with many different populations and combinations of data. For Synthetic testing, as the environment is controlled, you become familiar with the different sub-populations within your data. Slicing and dicing through edge-cases and subsets becomes a habit with time. Given that you’re aware of known correlations associated with specific events, you’re able to sense when the data doesn’t quite add up.
Remember, take caution when you read merged or aggregated data from different sub populations. Simpson could be lurking, and he might be telling you the wrong story.