Every company that monitors their website or application performance focuses on two key metrics Availability and Speed. However, there is a third metric, Reliability, which is often misunderstood or in some cases ignored by companies. Reliability measures availability, accuracy, and delivery of a service within a time threshold.
Reliability is difficult to define and measure as it is different for each company and service. To simplify it, you can think of Reliability as how consistent are you in delivering the “service”. The question it is trying to answer is: Can the “service” deliver the same experience, every time, all the time to all users? The service, could be a site, an application, a process, or a series of processes.
For example for Amazon, reliability might answer the question; can a user browse the site, add an item to the cart, checkout and receive the right item on time? For an ad-serving company like DoubleClick (Google), reliability might mean: can the adserver serve the right ad all the time for the right user on the right website, within 200ms.
Reliability is key to every business, but is even more important to vendors that sell services impacting the performance of other products. The service of a CDN, or Adserver, or Widget company can have a huge impact on the performance of a publisher or ecommerce site. Reliability is a key differentiator for such vendors.
To illustrate the Reliability concept, let’s compare DoubleClick’s Ad-serving system with an unnamed ad-serving company. We are defining reliability as how consistent is the adserver in delivering an ad to an end user. (Because we do not have access to the two products and cannot control the testing, we cannot go more in depth and measure certain things like getting the right ad on the right site.)
Let’s start by looking at the overall performance (Response measured in ms):
DoubleClick’s Response is consistently below 190 ms as it can be seen by the flat line. The response of the other adserver is fluctuating a lot (between 250 and 850ms). In other words, they are not consistent in delivering their ads.
Using Catchpoint monitoring tool, we were able to easily rule out DNS and Network connectivity issues (they were flat for both) and identified the culprit, the Wait time. (the time from the request was sent to the server to the first byte of response – in other words how long it took for the server to process and respond).
DoubleClick Wait and Response time:
ABC Wait and Response Time:
The other company’s adserver response time fluctuates during critical business hours due to the wait time.
DoubleClick’s Performance by Hour:
ABC’s Performance by Hour:
ABC vs DoubleClick:
Which leads me to believe their servers are overloaded during business hours and cannot handle the requests based on how their application performs. It seems like the company need to take a look at their capacity planning and also optimize their code.
Standard Deviation provides a simple way to define consistency: a large standard deviation value indicates a high degree of inconsistency within the measurement population, whereas a low small standard deviation value indicates a higher degree of consistency.
To summarize, Reliability is about how consistent your service is performing. It looks at Speed, Availability, Integrity (delivering the right ad) over time to ensure service is within acceptable thresholds. Reliable services will have flat response lines, or low volatility.
Mehdi – Catchpoint