Monitoring SLA Performance
From this article you will learn why SLAs are very important, and (if not abused) how they can be a great tool to align clients and suppliers around facts and data.
The SLA (Service Level Agreement) topic is already well covered – books have been written and libraries filled on this subject- so I will try to avoid writing a book of my own. (BTW my favorite reference is: “Foundations of Service Level Management” by Rick Sturm and Wayne Morris )
In my experience SLAs are very important, and (if not abused) they can be a great tool to align Clients and Suppliers around FACTS and DATA.
While VP of QoS at DoubleClick I was responsible for reviewing, approving and monitoring our external and internal SLAs; we had over 400 customers with SLAs, and each had at least 2 or 3 objectives!
One of my first tasks while setting up the QoS group was finding an effective way to measure the performance of third party providers – and employing that technique in SLAs.
The challenge was that clients would look at their site performance and notice spikes and they would attribute it to our system, meanwhile our performance chart would not show any problems. We couldn’t correlate the two charts, and therefore we couldn’t come to an agreement when it was our problem, and when it was someone else’s problem.
Working with an amazing statistician, Matt Briggs we created a methodology we called Differential Performance Measurement. (DPM)
The philosophy behind DPM was to be able to measure, as accurately as possible, the performance and availability of Doubleclick’s services and their impact on the pages of our customers, making sure we were responsible and accountable for the things we had control over and there was no finger pointing.
The methodology added context to the measurements. DPM introduced clarity and comparison, removing absolute performance numbers from the SLAs.
Recipe for Differential Performance Measurement (Example with an Advert):
1- Take two pages, one without ads and one with one ad call.
- Page A = without ad
- Page B = with ad
2- Make sure the pages do not contain any other third-party references (CDNs etc.)
3- Make sure the page sizes (in KB) are the same
4- “Bake” – Measure response times for both pages and you get the following metrics:
- Differential Response (DR) will be (Response Time of page B) minus (Response Time of page A)
- Differential Response Percentage (DRP) = DR / A. (e.g. If Page A is 2 seconds, and Page B is 2.1 seconds, DR is 0.1 second, and DRP is 0.1/2=0.05 or 5%)
With this methodology we were able to eliminate noise introduced by:
- Internet related issues that no one has control over (fiber cuts etc.)
- Monitoring Agent Issues (Which raises the separate topic of monitoring your monitoring services)
- Other third parties
Example:
Page A contains no third party content. Page B contains an Ad tag (or a CDN image, or a widget… it could be anything).
Raw Data:
We can graph this as follows:
- Scenario 1: The ad serving company is having performance problems and negatively impacting the customer’s sites performance. The vendor breached the SLA threshold from Time 4 to Time 8.
- Scenario 2: The web site is having performance problems that are not caused by the ad serving company.
At the end of the day, the buyer and supplier can have a more meaningful dialogue using DPM results. There, that’s much better.
At Catchpoint we have developed an innovative solution to monitoring third party content on your sites; To learn more contact us (info (at) catchpoint.com).
I would be delighted to hear your comments on this methodology and, if you wish, share your experiences with others.
Mehdi.