Blog Post

A Bird’s Eye View Via Boxplot

Updated

Published

February 15, 2017

mins read

in this blog post

The impact of website/app performance on the bottom line of an Internet firm is an undisputed fact (refer to our earlier blog for further discussion on the subject). Over the years, the industry has come to terms with no longer considering performance as an afterthought and making it a top priority. Now, performance analysis is easier said than done; for instance, let’s carry out a comparative performance analysis – measured via, say, webpage response time – of some of the leading airlines. The plot below shows a week-long snapshot where the aforementioned metric was sampled every 5 minutes (the data was extracted via the Catchpoint portal).

With increasing maturity of tooling, data collection has become a commodity today. However, any meaningful analysis, even visual analysis, of the plot is not practical. One may wonder what would happen if one were to lax the sampling rate to contain the “too much data” problem?

The plot above corresponds to the same time period, but with a sampling period of 15 minutes. The overlap between the time series is still too heavy, thereby making it very hard to derive any material insights. How about laxing the sampling further?

The plot above corresponds to the same time period, but with a sampling period of 30 minutes. From the plot above we note that, on average, Alaska Airlines has the best performance and Virgin has the worst performance. Having said that, from the above it is difficult to assess how often each airline experiences a performance hiccup. Concretely speaking, diving deeper to figure out how often one’s website experiences a webpage response time of, say, >3 seconds might lead to a useful discovery regarding user churn. To this end, a common method used is to analyze the probability density distribution of the metric of interest, as exemplified by the plot below (note that the plot below corresponds to data set sampled every 5 minutes).

A lot of valuable insight can be extracted from the plot above based on the following:

Relative location of the peaks of each distribution
The spread (an indicator of variance) of the distribution
The fatness of the tails – this sheds light on the extent that the user base is being impacted, in the current context, outlier webpage response times

Still, the probability density distribution is not conducive to compare the key statistics such as, but not limited to, median, the first and third quartiles, and the density of outliers. Boxplot, proposed for over four decades (see [1] and [7]), is tailor-made for this. An example illustration of a boxplot is shown below.

A boxplot is made up of five components that are carefully chosen to give a robust summary of the distribution of a dataset:

The median
The upper and lower fourth quartiles, commonly referred to as “hinges”
The data values adjacent to the upper and lower fences, which lie 1.5 times the IQR (inter-quartile) range from the hinges
Two whiskers that connect the hinges to the fences
Anomalies, which are data points further away from the fences

Boxplot for the data set sampled every 5 minutes is shown below:

From the plot above, it is straightforward to compare the various descriptive statistics of webpage response time across different airlines. For instance, although both Southwest and United have a lower median than Delta, the latter has a lower spread (= IQR = height of the box) than the former two. In a similar vein, we note that not only does Virgin has the highest median webpage response time, it also has the highest IQR. This clearly speaks well of the experience of Virgin’s (potential) customers.

One of the common use cases of boxplots is to detect anomalies. Although robust anomaly detection is subject to a multitude of factors, boxplots serve as a first-cut means to filter out potential anomalies. In the case of a standard normal distribution, 0.35% of the data points along each tail are deemed anomalous (see below).

The limitations of boxplot are that it is primarily suited to:

Almost symmetric data
Approximately mesokurtic distribution, i.e., distributions with zero excess kurtosis

The above two assumptions do not hold in general for real world data. This is exemplified by the plot of the probability density distribution above.

One way to address the former, i.e., asymmetry, is to use medcouple – a robust metric to measure skewness of a univariate distribution. Using medcouple (MC), the whiskers of the boxplot are redefined as follows:

A number of techniques have been proposed, see [6, 8, 9], to adapt boxplot to different characteristics of the underlying distribution. Likewise, several variations of boxplots have been proposed, see [2]. In a similar vein, the addition of other graphical elements to display distributional features like kurtosis [3], skewness and multimodality [4], and mean and standard error [5] have been proposed. For instance, varying the width of the box based on the sample size. A user of Catchpoint can plugin a boxplot plotting library of their choice in a straightforward fashion (refer to our earlier blog for this).

By: Arun Kejariwal, Ryan Pellette, and Mehdi Daoudi

Readings

[1] “Exploratory Data Analysis”, by J. W. Tukey, Addison–Wesley, 1977.

[2] “Variation of Boxplots”, by R. McGill, J. W. Tukey and W. A. Larsen, 1978.

[3] “Shape-finder box plots”, by M. Aslam and A. Khurshid, 1991.

[4] “Can the box plot be improved?”, by C. Choonpradub and D. McNeil, 2005.

[5] “The shifting boxplot”, by F. Marmolejo-Ramos and T. Tian, 2010.

[6] “An adjusted boxplot for skewed distributions“, by M. Hubert and E. Vendervieren, 2008.

[7] “40 years of Boxplots”, by H. Wickham and L. Stryjewski, 2011. http://vita.had.co.nz/papers/boxplots.pdf

[8] “A generalized boxplot for skewed and heavy-tailed distributions”, by C. Bruffaerts, V. Verardi and C. Vermandele, 2014.

[9] “A Generalized Boxplot for Skewed and Heavy-tailed Distributions implemented in Stata”, by V. Verardi. http://www.stata.com/meeting/uk14/abstracts/materials/uk14_verardi.pdf