With cloud computing becoming ubiquitous and the advent of IoT, the problems associated with the three Vs of Big Data – viz., volume, velocity, and variety – would exacerbate. One routinely hears from speakers at every industry conference about the magnitude of the three Vs at their respective companies. However, very rarely is there any discussion about how to extract actionable insights from the data. This is akin to famous lines by Samuel Taylor Coleridge in “The Rime of the Ancient Mariner:”
Water, water, everywhere,
Nor any drop to drink.
A common challenge faced in data analysis is, in signal processing parlance, how to filter noise from the underlying signal. Besides this, in production, there are many other data fidelity issues, such as:
- Data collection issues
- Missing data
- Exogenic factors such as autoscaling or change in incoming traffic
- Concept drift: changes in the conditional distribution of the output (i.e., target variable) given the input (input features), while the distribution of the input may stay unchanged.
The following plot exemplifies an observed signal (in blue) with noise and the underlying signal without noise (in red).
Unlike the example above, which is amenable to visual analysis, in most cases, filtering the noise to determine the signal is not feasible via visual analysis. More important, given the volume of the number of time series, it is not practical to carry out visual analysis. It is imperative to carry data analysis in an algorithmic fashion.
Further, removing the noise from the observed the signal is not an end goal in itself. In a production setting, what is important is to extract actionable insights from the signal, else the analysis assumes a flavor of an academic exercise. Concretely, let’s look at the time series plot (see below) of Wait Time for a period of 12 days for healthcare.gov. From the plot we note that during the night the performance is great (i.e., Wait Time is low) but that during the day it’s slow (i.e., Wait Time is high). This is informative but not actionable as normalcy has not been defined.
While there are spikes in Wait Time in this particular instance, it must first be defined at which point a spike is indicative of a capacity issue. Now, if the upper bound on Wait Time were to be, say, 120 ms, then based on the data, one can deduce that perhaps there are capacity issues as there are multiple instances where Wait Time is more 120ms.
Likewise, from the plot below we note a gradual increase in the value of the three metrics.
This increase can be automatically detected via a simple linear regression. A consistent increase in the three metrics is actionable for the operations team.
On the other hand, there are many examples where the data may shed light on “interesting” insights but is not actionable. For instance, the plot below shows the Response Time of Google.com from various 4G wireless nodes in New York City before, during and after memorial week end. As you can see the performance of Google.com has nothing to do with Google but with the wireless networks being saturated.
Though the comparative analysis is of potential use to the end user, no immediate actionable insights can be gleaned from the data.
How to remove noise?
Over multiple decades, a large amount of work has been done is many different fields – such as, but not limited to, signal processing, statistics, information theory – to improve the signal-to-noise ratio (SNR). Noise reduction plays a key role is large set of applications beyond operations, e.g., image/audio/video processing.
A wide variety of filters have been proposed to address noise reduction. Broadly speaking, filters can be classified into two categories:
- Low pass filter: It passes signals with a frequency lower than a certain cut-off frequency and attenuates signals with frequencies higher than the cut-off frequency. In the context of a time series, a simple moving average (SMA) exemplifies a low pass filter.
The red line in the plot above is the SMA of the original signal shown in blue. From the plot we note that SMA filters out most of the noise and approximates the underlying signal (shown earlier in the blog) very well. Note that, by construction, there’s a lag between SMA and the underlying signal.
- High pass filter: It passes signals with a frequency higher than a certain cut-off frequency and attenuates signals with frequencies lower than the cut-off frequency.
Depending on the requirement, either linear filters (such as SMA) or non-linear filters (such as median filter) can be used. Some common filters used are Kalman filter, Recursive Least Square (RLS), Least Mean Square Error (LMS), Wiener-Kolmogorov Filters.
Noise reduction can be achieved in both the time domain as well as frequency domain. In case of the latter, Fourier Transform or Wavelet Transform of the observed signal is obtained and subsequently an appropriate filter is applied.
As mentioned in a prior blog about the performance impact of ad blocking, a large set of metrics needs to be monitored for quantifying ad blocking performance and to guide optimization. Data fidelity is key to extract actionable – w.r.t. performance characterization and optimization insights. This in turn requires filtering the noise from the underlying signal.
By: Arun Kejariwal and Mehdi Daoudi