Advancements in cloud computing, edge networks, and distributed systems and architectures have added complexity to many digital services, and service level agreements (SLAs) are no exception. That’s why we’ve compiled this ultimate list of resources for managing, monitoring, and maintaining SLAs.
We selected entries from professionals with first-hand experience in SLA management and compiled our list for readers who are brand new to SLAs as well as more advanced readers looking for tips and best practices. While this list is not exhaustive, entries were selected based on the following criteria:
- Entries provided information suitable to today’s IT landscape
- Entries highlighted ways that SLAs ensure trust and accountability
- Entries focused on monitoring’s role in SLA management
We hope you find these service level agreement resources informative and inspiring!
SLA Best Practices
Clear SLAs are all alike; every unclear SLA is unclear in its own way. SLAs should express consumer objectives in a language accessible to all parties. Unfortunately, that’s not always the case. The entries in this section make a strong case for revising outdated attitudes on SLAs while making clear is that the ultimate aim is the delivery of quality service levels that ensure quality end-user experiences.
If you’re looking to implement SLOs at your organization, then Google’s Customer Reliability Engineering team just made your job a lot easier. This kit includes all material—slide deck, script, facilitator handbook, and participant workbook—needed to run an interactive workshop to a broad audience on the value of measuring internal and external SLOs and SLIs.
by Theo Schlossnagle
Published in Seeking SRE: Conversations About Running Production Systems at Scale (2018), this chapter is for anyone who fed up with meaningless SLA dashboards and reports displaying averages, minimums and maximums, and percentiles. (Or for anyone who needs to an introductory crash course on statistical analysis and data visualization). While identifying the shortcomings with popular methods for quantifying availability—specifically marking of time quanta and counting transaction failures—Schlossnagle emphasizes the role histograms can play in measuring system behaviors allowing us to set better SLAs based on distribution rather than focused on performance in the q(0.99). For a more technical focused version of this chapter, check out Heinrich Hartmann’s “Latency SLOs Done Right”.
By Rick Strum, Wayne Morris, and Mary Jander
Bringing years of expertise to the table, Strum, Morris, and Jander present an accessible text that speaks to the wide-ranging audience involved in SLA management, from service providers and consumers to IT departments and business operations to procurement officers and financial analysts. Several chapters dedicated to SLA reporting—what to measure, where to monitor, how to report and review, etc.—emphasizes a point Catchpoint shares as well: SLAs need to be meaningful to your customers.
“Service level management is not simply reacting to problems and reporting the achieved service levels. Properly implemented, service level management includes proactively developing the right procedures, policies, organization structure, and personnel skills to improve service quality and to ensure that users and the business are not impacted by any service difficulties” (182).
—Foundations of Service Level Management
By Reza Koranki
This entry provides a brief history of the rise, fall, and resurrection of SLAs over the last forty years. From the original equipment manufacturer (OEM) warranties of the 1980s to the rise of advanced replacement SLAs in the 1990s, to the impact that the decline in hardware had in the 2000s, and, finally, to the rise of cloud and third-party SLAs across the 2010s, this piece is a great overview for anyone interested in intellectual and consumer history of SLAs.
By Mehdi Dauodi
This post recounts how Catchpoint co-founder and CEO implemented and scaled service level management across DoubleClick after paying more than $1 million dollars in penalties for SLA violations to a single customer in 2001. Focused on end-user experience-based SLAs, this entry recounts centralizing current SLAs in one database, simulating daily risk of SLA breach using historical data, running “what-if scenarios” to accurately calculate the impact of performance on revenue, and creating a methodology for monitoring services from third-party providers called Differential Performance Measurement. This is a great post for anyone who has recently paid out SLA penalties and is rethinking their approach to service level management.
“Some people do not believe in SLAs. Bad SLAs are the ones that some companies put in the contract without real penalties or real measurements. I always see SLAs that guarantee 0% packet loss, but if you ask how it’s measured, you quickly realize that it’s useless. This is exactly what gives SLAs a bad reputation.”
—A Practical Guide to SLAs
Secrets of Service Level Management
By Ami Nahari
SLA management does not end when a customer and service provider sign a formal document. In fact, for Ami Nahari, that’s merely the first step. In Secrets of Service Level Management, Nahari presents a 5-part process for service level management (SLM) that includes: 1. SLA management, 2. Monitor and Report, 3. Service Review, 4. Service Improvement Plan, and 5. Continual Process Improvement. Unlike other entries on this list, Nahari focuses more on ways to implement processes—one reason why the book is structured in accordance with ITHL lifecycle phases—which can be customized according to specific needs. This is a great resource for new SRE teams at enterprise organizations tasked with implementing cross-department SLOs and SLIs and (re)building a culture of performance.
By Vladimir Fedak
This article looks at how the last ten years of outsourcing IT has led to the renewed importance of SLAs. What’s great about this post is that Fedak outlines basic benefits to look for your SLA with an IT service provider such as the exacting time, guaranteed variables, the procedure of incident report creation, and much more.
By Ellie Mirman
Wondering what to include in your SLAs? Whether you’re a service provider or consumer, HubSpot’s ultimate guide provides a comprehensive overview of the various elements to include, what each covers, and examples of different SLAs from some of today’s leading organizations. A great place to start if you’re new to SLAs!
By Alan Nance
Is it time to rebrand SLAs? That’s the case put forth in this post. But this isn’t merely a marketing gimmick. Nance makes a nuanced point that “[m]easuring what you can is not the same as doing what you must,” which leads to a thought-provoking piece on pivoting away from “managing IT services” and toward “managing the consumer’s experience of IT” with what he calls “experience-level agreements (XLAs).” Whether you jump aboard the XLAs bandwagon or not, one thing is for sure: the times they are a-changing when it comes to measuring and monitoring SLAs.
Internal and External SLAs
The value of external, customer-facing SLOs and SLIs is easy enough to understand. Internal, department-specific SLOs and SLIs can be a little tougher to understand, especially for enterprise organizations with a small (or non-existent) SRE presence. This section’s entries each deal with different aspects of SLAs. From introducing internal SLOs and SLIs to increase operational efficiency to calculating error budgets to prioritize releases and deployments, these articles provide tactical tips from real-world scenarios offering insights useful for SREs, DevOps, system administrators, NOC teams, and many more!
By Stephanie Overby
Application sprawl has led to the exponential rise of shadow IT across enterprise organizations, but this article licensed by Bloomberg Professional Services outlines a plan of action for IT departments to regain control. The article introduces a 4-step SLA framework for applications managed outside the IT department. We love the recommendation to monitor shadow IT alongside IT-procured services so the overall value each service provides to the business can be evaluated by various stakeholders. This method holds managers and department heads accountable, while also ensuring the best tools and solutions are in service.
By Eyal Arazi
The 2019 AWS DDoS attack, as well as the year-over-year rise in DNS attacks, has many asking what level of protection their SLAs provide against these popular forms of cyber-attacks. However, when an attack occurs, many customers realize vendor promises are merely marketing promotions as SLAs prove ineffective in recouping losses. That’s why you need to ask the six questions presented in this post—and insist on the six performance indicators detailed as well—before signing any SLAs with DDoS clauses to ensure the mitigation of future DDoS attacks.
By Krishnan Raman and Joey Salacup
Finding the root cause of performance issues in big data pipelines is the digital equivalent of finding the proverbial needle in a haystack. Published on the LinkedIn Engineering blog, this entry shines a light into the black box of data pipelines. While SLAs and SLOs aren’t the focus of the piece, we selected this entry because it displays ways of thinking and testing with SLAs. More specifically, the authors address the growing need to ingest larger and larger datasets and publish these datasets in near-real-time in order to pinpoint and detect performance bottlenecks. These “under the hood” SLO optimizations result in reductions of MTTI and MTTR for your on-call team if and when issues arise.
“As data at LinkedIn continues to grow and the number of teams depending on these datasets increases, we must be able to measure dataset SLO and publish these datasets in a timely manner. Monitoring and alerting data pipelines in real-time has been instrumental for pinpointing bottlenecks and helping the on-call engineer identify and triage issues without spending time digging into the logs.”
—An Inside Look at LinkedIn’s Data Pipeline Monitoring System
By Marc Alvidrez
This must-read chapter from Site Reliability Engineering: How Google Runs Production Systems (2016) reframes conventional thinking around SLAs, SLOs, and SLIs to make the case for rethinking of service level agreement management as a form of risk management. Instead of measuring availability with a time-based metric, Google SREs define availability using request success rate to approximate unplanned downtime from the end user’s perspective. This method of quantifying availability is also amenable to backend systems and reduces MTTD root cause of errors and deviations across the frontend or backend before users are impacted.
By Dmytrii S. ShChadei
Similar to other entries in this post, this article offers a framework for service monitoring. However, this entry’s four pillars of service monitoring—compute infrastructure, application monitoring, dependencies monitoring, and network infrastructure—form a conceptual framework that can be scaled to suit the needs whether you’re a small startup or enterprise organization. What we like about this post is its explanation of application-level monitoring. While application monitoring answers the question, “Is my service running?”, the author unveils the complexity behind this question by approaching it from the perspectives of Business KPIs, End User Experience, and, finally, SLAs. Needless to say, the answer takes into account a lot more than uptime.
By Owen Sullivan
Published on the Workday Technology blog, this entry provides insights into the monitoring practices of Workday Developers who were able to increase their availability SLA to an industry-leading 99.7% earlier this year. The growing number of product and feature releases made it increasingly difficult for developers to set alerts when internal SLO thresholds were exceeded, which increased the risk of missing external customer SLAs. Workday developers turned to our friends at BigPanda for help. Applying machine learning algorithms to performance data, Workday was able to correlate alerts into insight-rich incidents to reduce MTTD and MTTR across their growing infrastructure.
In addition to noting the benefits of internal SLAs—managing expectations, increasing productivity, prioritizing initiatives, etc.—this entry provides practical tips on what to consider and who to include when introducing the concept of internal SLAs to your organization.
By Lilia Gutnik
From our friends at PagerDuty this post does a great job displaying just how many points of failure exist in a “single service,” which, in reality, is built from multiple components. At the same time, the author warns of the risks at stake in only focusing on component performance: “Each component needs some level of telemetry and visibility, but that doesn’t mean you need to monitor and alert on every component. Doing so will cause you to focus on component health when you should be focusing on the overall health of the service instead.” In addition, we recommend this entry because of its honesty: defining internal SLAs, SLOs, and SLIs is hard work that will need to be refined time and time again.
By Robert Sturt
This entry offers enterprise organizations using SD-WAN for internet connectivity advice on what to look for (or what to insist on) in their SLAs with providers such as measurements on latency, packet loss, and network uptime. Why we love this piece is because of its emphasis on using reporting capabilities to customize internal SLAs.
“Instead of using latency and jitter figures, IT teams can determine which SLA factors best reflect how to meet business demands. They can create internal SLAs based on factors like connectivity origin, type, and conditions. SD-WAN also presents the unique opportunity for IT teams to create per application, per department or even per user SLAs.”
—SD-WAN and SLAs
By Splunk + VictorOps
Our partners over at Splunk + VictorOps provide some of the best content that focuses—as the tagline promises—on making on-call suck less. While mentioning SLAs and SLOs in passing, this entry provides readers with an accessible history on the rise of Site Reliability Engineering (SRE) in relation to DevOps. More specifically, the article identifies points of confluence and divergence between the practices in a productive way to bypass long-standing impasses—prioritizing speed vs. reliability, rapid deployment vs. system resilience, etc.—and instead focus on the ways in which SREs and DevOps teams complement one another.
By Charity Majors
What are the internal, non-customer focused benefits of SLOs? This article details the ways in which SLOs resolve, or, better yet, prevent miscommunications, interferences, and roadblocks between individuals and teams or between different teams across an organization. This thought-piece underscores the communicative value of an SLA in the author’s presentation of all too common scenarios, from executive stakeholders micromanaging team roadmaps to the perfectionists on your engineering team tinkering with new features instead of shipping them.
By Rushabh Doshi
In translating a generic prioritization framework into the language of assurance windows for fixing bugs, this entry uncovers a shared middle ground between engineers, quality assurance, and SREs where long-standing tensions between continuous delivery and improving can be renegotiated in terms of error budgets and technical debt.
“Regardless of how bad things are today, you can institute a framework to pay off your bug debt, while building and pushing features. An SLA system has built-in checks and balances to slow down feature development when things get serious and provide natural incentives for teams to pay attention to the quality of their software, leading to more productive and happier engineering teams in the long run.”
—Software Quality, Bugs, and SLAs
By Tammy Butow
We came for the title pun but stayed for the terrifying stories centered around problems encountered from not verifying your monitoring strategy, or when you don’t monitor the monitors. This article is a great introduction to chaos engineering and its controlled injection of failure into your systems in order to test your monitoring and alerting strategy. In presenting six all too common monitoring mistakes, the author makes the case for gamedays and ‘what-if’ scenario tests to check whether or not your monitoring strategy will ensure SLAs are maintained.
Common SLA Metrics
SLAs get all the attention, when, in reality, what gets managed, monitored, and measured are service level objectives (SLOs) and service level indicators (SLIs). The question, then, is how to select SLOs and SLIs best suited to you or your customer’s needs? The entries in this section provide answers to that and much more.
[A Brief History of High Availability
](https://www.cockroachlabs.com/blog/brief-history-high-availability/)By Sean Loiselle
This entry covers the last 30 years to trace a change in attitude from a time when “[a]vailability was desirable, but not something to which we felt fundamentally entitled.” This article offers a great explanation of the long-lasting impact and influence that developments in database replication and availability have had on user behaviors and expectations. By the end, you may be rethinking your availability SLOs!
By Arnaud Lawson
In this presentation from SRECon 2019, Lawson shares insights from his experience of defining and measuring SLIs and SLOs when launching a new service at Squarespace. This entry is great for SREs and DevOps professionals new to managing projects based on internal SLOs and error budgets, as well as those looking to implement a culture of performance at their company. Plus, this is a great example of why setting SLOs and SLIs that best reflect user experiences is a must.
By Liz Fong-Jones, Kristina Bennett, Daniel Quinlan, Gwendolyn Stockman, and Stephen Thorne
This now-legendary SREcon presentation is the perfect introduction— or refresher! —for anyone tasked with setting SLOs. The authors prepared an SLO Workshop (and shared the material in a .pdf) that begins by breaking down the math behind the infamous “9s” to reveal the hidden costs and risks of downtime permitted in various reliability metrics. What’s great about this post is that the authors present different types of SLIs for measurements beyond uptime such as availability, latency, quality, freshness, coverage, and correctness. And to make the abstract concrete, the document ends with a fictional case study exemplifying previously covered practices in a real-world scenario.
By Benjamin Cane
While not explicitly about SLAs, this post from American Express Technology examines the design strategy for platforms requiring high scalability and high availability. One way to achieve these objectives is to reduce or remove completely the database. Read the post to learn the pros and cons presented through a variety of hypothetical use cases and learn more about “shared-nothing architecture” used by American Express’s engineering team for its currency-conversion API.
By Charlie Taylor
This Blameless interview with Tyler Wells, Director of Engineering at Twilio, covers the process implemented to achieve five 9s availability. A great post highlighting the different roles of Empathy, Chaos Engineering, and Operational Maturity Model (OMM) can play in defining and achieving customer-centric SLAs.
By Steven Thurgood, David Ferguson, Alex Hidalgo, and Betsy Beyer
Published in The Site Reliability Workbook (2018), this chapter expands on ways to leverage SLOs and Error Budgets to prioritize SRE and developer work. The chapter includes sections on measuring SLIs, establishing an Error Budget Policy, and decision-making decisions with SLOs, which are illustrated with practical examples. However, we enjoyed the sections on time window considerations and modeling dependencies for displaying ways SLOs can reduce calculation mistakes before signing availability SLAs.
“An SLO of 100% means you only have time to be reactive. You literally cannot do anything other than react to < 100% availability, which is guaranteed to happen. Reliability of 100% is not an engineering culture SLO—it’s an operations team SLO.”
By George Demarest and Abha Jain
What do you do when your customer’s executive team sets a goal to reduce mean time to resolve (MTTR) big data cluster incidents by 40%? That was the problem facing Demarest and Jain, and in this post learn what insights they discovered by setting internal SLAs, granular performance level reporting, and customized drill-down tags to optimize internal operations and application performance. Great for anyone looking for new ways to correlate performance SLAs with business KPIs.
By Benjamin Treynor Sloss, Shylaja Nukala, and Vivek Rau
Don’t cut corners and implement easy to measure metrics is the message in this entry from some of Google’s most prominent SREs. For more advanced readers, this article offers invaluable lessons learned from years of measuring service performance. Especially insightful is the discussion on why Google replaced speed SLOs based on median latency with SLOs based on measures at the 95th and 99th percentile enabling detection of “long-tail” latency.
“‘Speed matters’ is a good axiom for SREs to apply when thinking about what makes a service attractive to users…A good follow-up question is, ‘Speed for whom?’ Engineers often think about measuring speed on the server-side…The problem is that users have no interest in this server-side metric. Users care about how fast or slow the application is when responding to their actions, and, unfortunately, this can have very little correlation with server-side latency.”
—Metrics That Matter
By Ronald Bartels
What makes managing network SLAs for Telcos so difficult? One answer is the number of metrics involved. In this follow-up post on specific SLA metrics and KPIs to measure, the author presents common topics that often require SLA stipulations for a Telco including availability, power outages, IPSLA functionality, SD-WAN, and more.
By Chris Jones, John Wilkes, Niall Murphy with Cody Smith and Betsy Beyer
Service Level Agreements (SLAs) may get more attention, but service level objectives (SLOs) and Service Level Indicators (SLIs) are what really matter when it comes to monitoring service performance, availability, reliability, and reachability. Published in Site Reliability Engineering: How Google Runs Production Systems (2016), this chapter spotlights the need to set user and team expectations with SLOs and SLIs. More specifically, SLOs that reflect service provider and consumer interests help prioritize work among SREs and developers who can decide whether to improve availability from 99.99% to 99.999% or instead focus on releasing new features while maintaining 99.99% uptime objective.
By Jay Judkowitz
This entry covers the different roles SLAs, SLOs, and SLIs play in aligning SRE principles and practices with business objectives. Judkowitz provides clear definitions of each term while also illustrating how they relate and operate in relation to one another. But what makes this introductory article stand out is the emphasis placed on the dual function of SLOs as both a reporting tool as well as a future indicator of the probability that your system performs as expected. Simply put: don’t set and forget SLOs.
By Jeffrey C. Mogul, Rebecca Isaacs, and Brent Welch
How can system designers define overall availability for large-scale infrastructures that are suitable and can be converted into underlying component performance? This article examines ways to improve infrastructure availability and highlights the need to define meaningful, multi-dimensional SLOs over multiple SLIs.
By Adrian Hilton and Yaniv Aknin
SLIs are the building blocks for your SLO thresholds, which, in turn, form the basis of your SLA, but this article focuses on the value of correlating SLIs to end-user problems. The takeaway here is that if your SLI isn’t useful, then you shouldn’t be using it to monitor your end user’s experience.
“The cleaner your SLIs are, and the better they correlate to end-user problems, the more directly useful they will be to you. The ideal SLI to strive for (and perhaps never reach) is a near-real-time metric expressed as a percentage, which varies from 0%—all of your customers are having a terrible time—to 100%, where all your customers feel your site is working perfectly.”
—Tune Up Your SLI Metrics
By Yan Cui
Like other entries in this post, this article deals with the problem of relying on average latency measurements. Percentile latencies offered significant improvements but managing that amount of data is difficult. In fact, despite generating percentile latencies at the agent level, vendors are summarizing this immense amount of data in averages, which merely returns to the original problem. Cui, an application developer, offers an alternative method replacing percentiles as the primary metric with the percentage of requests over the SLO threshold for request completed in order to identify the number of requests affected during an outage. An interesting read that may just change your mind about your monitoring and incident response methods!
SLAs for Cloud Services
Thomas Trappler, the Associate Director of IT Strategic Sourcing for the University of California system, published a paper ten years ago with a title capturing the growing concerns with cloud computing. But, in retrospect, “If It’s in the Cloud, Get It On Paper: Cloud Computing Contract Issues” (2010) can be read as a harbinger of the problems consumers and providers have had in the intervening years when it comes to cloud SLAs. The article lays out what contract terms to include when negotiating cloud SLAs to safeguard organizations transitioning from a “technical managed solution” that was operated in-house to a “contractually managed solution” operated by external vendors and third parties. Beyond identifying the very issues companies encounter today when signing a cloud SLA, the article puts into perspective the rapid rate of technological advancements over the last ten years and what impact it has had on the role and business value of SLAs.
By Carl Levine
The first step of deciding to migrate to the cloud is often the easiest. The second step of selecting what type of cloud environment to migrate services, infrastructure, or data to, for example, is much more difficult. Luckily our partners at NS1 published a great post with tips on how to select a cloud provider. Filled with the questions to ask, the digital components to look for, and ways to avoid vendor lock-in from “one-stop-shop solutions,” this is a must-read for anyone trying to negotiate or renegotiate cloud SLAs.
“How many 9s is the vendor willing to put on the line? No provider can offer a true 100% Service Level Agreement (SLA), but they can come pretty darn close. The industry recognizes 99.999% (Five Nines) as the most reliable level of service available. A best-of-breed Cloud provider will be able to guarantee their services with this SLA and should be considered first among others with lesser guarantees. Also, consider the redundancy strategies that the vendor leverages.”
—Choosing a Cloud Provider
By Blake Thorne
This entry on cloud SLAs uncovers why it is increasingly difficult to achieve “four 9s” availability (let alone 100% uptime) and provides practical strategies and tactics for SLA monitoring and reporting. Topics covered include eliminating single points of failure, monitoring across multiple geographic regions, and the benefits of tolerating a little downtime in your SLAs.
By Samir Nasser
Hybrid cloud software solutions present obstacles for upholding SLAs and require performance and resiliency testing in order to mitigate the impact of outages and performance degradation. The post is filled with tips and best practices for monitoring the performance and resiliency of hybrid cloud software, but the recommendation to “minimize geographic distances” between components to reduce network latency earned the post its spot on our list.
By Arijit Mukherji
Influenced by the next entry’s examination of cloud SLOs, this article expands the conversation on “very high availability” cloud SLAs to consider “very high reliability” cloud SLAs. Cloud providers resist offering high SLOs because the shared, multi-tenant infrastructure of cloud environments would require SLOs to account for customer behavior. Since customer behavior is unpredictable, cloud providers avoid the topic and instead concentrate on worst-case scenarios—complete outages, downtime for users around the world, etc.—instead of the everyday usage patterns that greatly affect reliability. To ensure “very high reliability”, then, requires a high frequency, high-resolution monitoring system capable of executing automated actions based on near-real-time alert triggers. But this will cost you! What we love about this piece is its implicit challenge to both cloud providers and cloud customers to determine when “good enough” reliability is in fact good enough.
By Jeffrey C. Mogul and John Wilkes
Defining SLOs is harder than many people believe, especially for cloud providers and cloud consumers whose goals are inherently at odds. This entry puts forward a method for re-aligning those interests that begin with more clearly defining SLOs. Like other entries, the authors highlight the confusion surrounding the “9s” of performance. Unlike other entries, the authors uncover a focus in cloud provider SLOs to define exceptional “corner cases” instead of “normal customer behavior,” which, given the model of shared resources that make cloud computing possible, is a much more important metric to cloud customers looking for clear reliability and performance SLOs.
By Mike Fisher
This entry details Etsy’s internal processes in selecting a cloud service provider in 2018. While SLAs aren’t the entire focus, the discussion of determining the suitability of various providers for maintaining current workloads and SLAs shows how this organization decided whether to use cloud services or build their own tool. Over five month the migration team collected thousands of data points to build a decision matrix containing over 200 factors, 1,400 weights, and 400 scores. No spoilers, but what Etsy’s ultimate decision revealed was the vast amount of work that goes into cloud migration decisions.
By Rodri Rabbah
When it comes to “functions-as-a-serverless” (FaaS), sometimes called “serverless functions,” web services traditional SLAs do not apply. This post from Apache OpenWhisk explains why and is a great example of why cloud SLAs matter.
By David Lebutsch, Tim Waizenegger, and Daniel Pittner
This article introduces readers to Charly, Ulrike, and Theo, composite sketches of two consumers and one provider of a cloud service, and then proceeds to examine their different motivations, reactions, and relations to SLAs, graceful degradation, and component availability. What makes this entry stand out is its explanation of availability’s relative meaning among users as well as highlighting the value of various modes of graceful degradation when it comes to a cost-effective solution for improving availability.
SLA Monitoring and Reporting
The only way to ensure trust and accountability between service providers and consumers is by monitoring the delivery and quality of service. The entries in this section provide some best practices on what to monitor as well as how to report on SLA performance. In addition, we’ve included some real-life examples from industry leaders who rely on SLA monitoring and reporting to provide digital services to their global customer base.
By Frank Yue
Deciding what to monitor across the application delivery infrastructure to gain performance visibility is a difficult task and that often requires a significant amount of trial and error. This post covers key aspects of the network infrastructure to monitor—content delivery networks, security protocols, capacity tools—to ensure optimal application performance.
Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure, Reliability, and Efficiency
By Di Lin, Girish Lingappa, and Jitender Aswani
Netflix’s creation of a data lineage system aimed to provide engineering teams with a picture of data dependencies. Netflix instrumented a model for data ingestion-at-scale to address various business use cases across its organization, with mapping micro-services interactions among the use cases. This approach enabled building a unified data model and repository, which, in turn, allowed the streaming giant to focus on SLA service. By defining job dependencies in ETL workflows, Netflix can now alert on potential SLA misses thereby reducing disruptions to end-user streaming experiences. In addition, this architectural decision enables proactive alerts on potential delays for critical reports caused by upstream issues.
By Ben Treynor, Mike Dahlin, Vivek Rau, and Betsy Beyer
How much availability can you actually offer to your customers? The answer to this question depends on how many critical dependencies you use that are managed by external vendors, and what levels of availability those vendors guaranteed in their SLAs. This article is a must-read and will help you avoid penalties for SLA violations.
“A service cannot be more available than the intersection of all its critical dependencies. If your service aims to offer 99.99 percent availability, then all of your critical dependencies must be significantly more 99.99 percent available.”
—The Calculus of Service Availability
By Joe the IT Guy
What is a “watermelon SLA” you ask? “[O]ne that contains a metric target that, when assessed against, states that all is well,” explains Joe the IT Guy, “[w]hen, in reality, we’ve left a trail of unhappy customers in our wake. Learn ways to avoid false negatives and false positives in your SLA report with this and many other SLA management posts found on Joe the IT Guy’s blog.
By Peter Christian Fraedrich
The first in a three-part series on application monitoring published on the Capital One Tech covers what metrics your monitoring solutions should track. What we love about this post is its emphasis on making a distinction between operational metrics—client connections, CPU load, error rates, etc.—and business metrics—signups, buy-flow step hits, page hits, etc. Each metric is defined with its value illustrated with real-world examples.
By Jonathan Mercereau
This chapter from Seeking SRE (2018) revisits build vs. buy debates from the perspective of SRE teams at modern organizations where more non-mission-critical components are being managed by third-party providers. What’s refreshing is the way in which Mercereau dispels fantasies of building in-house solutions, and instead spotlights the breadth of skills and creative thinking required to select data proxies and monitoring strategies capable of accurately monitoring so-called black box service providers.
This list is far from exhaustive and we intend to update it regularly with the latest in SLA management!