If you’re an SRE, then you already know your SLOs from your SLAs, not to mention your SLIs. But even if you’re not au fait with those acronyms, you’ll soon discover how widespread and applicable these concepts are in this installment of our IPM Best Practices Series.
We’ll explore these concepts in detail and explore how external monitoring can enhance the tracking of Service Level Objectives (SLOs), leading to positive user experiences and informed decision-making.
To begin, let's distinguish between our SLAs and our SLOs.
What is an SLO?
An SLO, or service-level objective, is a specific performance goal that a vendor must meet. It serves as a clear and measurable goal that outlines the expected level of service quality. For example, an SLO may specify that a web application should maintain 99% uptime, meaning it should be accessible and operational for at least 99% of the time within a given period, such as a month or a year.
What is an SLA?
SLAs, or service-level agreements, are contracts between a vendor and a client defining the level of performance the vendor must provide to the client and the consequences, typically financial, if performance does not meet defined thresholds. They often consist of a collection of individual SLOs (Service Level Objectives) to precisely define the scope of what is guaranteed.
For instance, an SLA between a cloud provider and a client might ensure 99.9% availability of their virtual servers. If this level of availability isn't maintained, the SLA could specify a refund or service credit.
What is an SLI?
An SLI, or service-level indicator, represents the exact metric(s) and threshold(s) that will be used to determine whether an SLO has been violated. For example, an SLI might define that the "Test Failure rate" should not exceed 1% during any one-week period. In this case, the SLI is tracking the failure rate of a particular test and sets a threshold of 1% as the maximum acceptable failure rate within a weekly timeframe.
What is an error budget?
An error budget is a predetermined allowance for errors or deviations in a system's performance metrics, typically specified in SLOs. It represents the margin for acceptable issues or failures within a defined timeframe. When errors accumulate and exceed the budget, it signals a breach of the agreed-upon service quality
Why are SLOs important?
Here are some key reasons why SLOs are vital:
- Enhanced user experience: SLOs are usually closely tied to the end-user experience. Tracking SLOs helps organizations to ensure that their services consistently meet or exceed user expectations. This eventually results in a positive user experience.
- Reliability assessment: SLOs help trace the reliability of the application or service used. When SLOs are not met, this indicates the application faced multiple service disruptions and performance degradation.
- SLA compliance: SLOs are part of SLAs between organizations and their customers or between different teams within the same organization. Monitoring SLOs ensures adherence to these agreements, fostering accountability and trust. When the SLOs are not met, organizations can claim the financial penalties as per the SLA.
- Data-driven decision-making: Organizations can use SLO metrics to make strategic decisions, allocate resources effectively, and prioritize improvements. With the help of historic SLO data, organizations can choose between reliable, consistent products and services.
How do SLOs work?
The goal of SLOs is to ensure the delivery of highly reliable, resilient, and responsive services that consistently meet or surpass user expectations. These objectives are often expressed as percentages that hover close to, but not quite 100%. For example, you might aim for 99% or 99.99% availability.
Perfection, aiming for 100%, is simply impossible in the real world. Even if issues aren't your fault, something will fail at some point. Striving for absolute reliability is not only financially costly but also unrealistic in terms of human resources and engineer expectations.
Conversely, being too unreliable also comes at a cost. Users will eventually leave for more dependable competitors if they experience frequent disruptions. So, what's the balance?
Imagine ordering a dish at your favorite restaurant and things don't go perfectly. The kitchen may get backed up, or the waiter mixes up your order. It's a minor inconvenience, easily corrected. As long as such errors are infrequent, say only 5% of the time, you'll likely keep supporting the restaurant because you're generally satisfied. This encapsulates SLOs and site reliability engineering. It's about meeting expectations most of the time while accepting occasional hiccups. SLOs help strike this balance, ensuring user happiness and business success.
How can external monitoring help SLO tracking?
External monitoring plays a crucial role in enhancing SLO tracking by providing an unbiased and comprehensive view of service performance and reliability. Customers and stakeholders often place trust in external assessments, considering them impartial and transparent. This trust contributes to increased confidence in the service provider's claims.
Ultimately, external monitoring ensures positive user experiences, reliability, SLA compliance, and well-informed decision-making, making it an invaluable tool in maintaining service quality.
SLO Best Practices
Here are some best practices for setting up SLOs:
- Begin by listing the services you wish to monitor for SLOs. For example, your DNS provider, CDN vendor, 3rd party tags on your website, and Cloud provider.
- Determine the metrics that need to be tracked. We recommend Availability and Test time (which helps understand latency )
- Configure tests for specific services. For example, if you want to track the SLO of Adobe Tag Manager, create a test for the tag manager JS on the page.
- Determine the goals and objectives for the selected metrics and establish violation conditions for these metrics.
- Ensure you have a dashboard/console to track the spent and remaining budgets for the period.
If you already have predefined SLOs in your SLA from an internal team or vendor, you can rely on the SLOs specified in those documents to guide your setup. However, if you're starting from scratch, consider these best practices to get started effectively.
Streamlined SLO Monitoring with Catchpoint’s unified dashboard
Any SLO dashboard should provide a single view to track the SLOs of all the services. Catchpoint's SLO dashboard offers efficient monitoring and provides a unified view of service performance, simplifying the implementation of the above best practices.
Below is a snapshot of Catchpoint’s SLO dashboard and its components.
- Test: The tests created for each service are listed here.
- Objective: Name of the SLO Rule configured and mapped to the test.
- Time Range: This section provides SLO data for various time ranges.
- Budget: This section explains how much budget is spent and what is remaining for the timeframe.
The above dashboard provides a single view of all the services used and the corresponding SLO for various timeframes. It enables quick identification of whether there are any breaches in the SLO budget for any of the services that are used in the delivery stack.
In case of any SLO violation, it is important to understand the reason for the outage or degradation and have the data ready to share with stakeholders. From the above burndown chart, we can see the SLO budget trend and specific time details about the SLO violations.
The need for a reliability strategy
SLA management is not just about holding vendors accountable. It is also about your IT organization ensuring reliable services independent of any vendor failure. It's essential to have a well-defined observability strategy that complements your SLA strategy and a reliability strategy that includes real-time monitoring and alerting for both the vendor's service and your own.
Your observability strategy should have the following capabilities:
- Active 24/7 monitoring, with a frequency of one minute or faster, of external DNS providers, both within and outside your environment.
- Observing from all key geographical locations where your users are situated.
- Observing from major transit ISPs in these geographical areas.
- Monitoring critical components of digital services, including DNS, network connectivity, HTTP, web transactions, email, Websocket, MQTT, etc.
- Monitoring essential services managed by external vendors, such as DNS, CDN, cloud, APIs, SaaS, email, etc.
- Providing real-time data and alerts based on captured information.
- Leveraging real-time APIs and integrations with other tools used in multi-vendor strategies.
Catchpoint’s IPM platform encompasses all these capabilities, allowing you to confidently meet customer expectations and hold your vendors to account.
With Catchpoint, you can:
- Monitor SLIs with neutral, third-party data to validate service delivery.
- Ensure data relevance by implementing maintenance schedules.
- Align data with business goals for effective SLA management.
- Objectively handle customer complaints with third-party SLA reports.
- Maintain extensive, long-term data for legal readiness.
Join us in the next installment of our IPM Best Practices Series as we explore API Monitoring, diving into the intricacies of API transactions and learning how to improve resilience.