Subscribe to our
weekly update
Sign up to receive our latest news via a mobile-friendly weekly email
Over 400 SREs worldwide responded to our annual SRE Survey, providing valuable insights into the role of reliability practitioners. Here are the insights.
Download The SRE Report 2024 here (no registration required).
Register for the 'The SRE Report 2024: What the charts don't tell you' panel discussion on January 24th here.
If you Google, “What is the shortest, complete sentence in American English?”, then you may get, “I am” as the first answer. However, “Go” is also considered a grammatically correct sentence, and is shorter than, “I am”.
This isn’t a post arguing perfect rules or syntax (we’re not trying to compile code). Underlying rules when it comes to such things as the “indirect you”, the use of the Oxford Comma, or the number of spaces that come after a period are part of the discussion – yes, absolutely. However, this is a post about whether we can use words (such as those written in The SRE Report 2024) and reliability practices (such as those encapsulated within SRE) to make it better, without getting bogged down or hung up in nuances.
“Oh, you do this, but not that, so you can’t be called an SRE.” Instead, how about, “Are we making it a little better each day? Are we learning along the way?”
It is in this vein of making it better that we would like to offer some additional lens’ when reading The SRE Report 2024 (download here – no registration required). Not to offer some list of perfect rules, but to instead offer ways to make the reading of the report better.
It is simple: The SRE Report is a piece of independent research targeted at anyone who cares about [digital] reliability. The idea for writing this report (with The SRE Report 2024 being the sixth edition) was inspired by Google, and evolved by the community.
Here are the section Insight stats for this year’s report. We will list them, and then explore.
Insight I:
Insight II:
Insight III:
Insight IV:
Insight V:
Insight VI:
Insight VII:
Let us look at an insight snapshot, with accompanying lens’. While we do list some thoughts with respective insights, they could really be considered across any of them.
This Insight is all about the continual increase in the use of third-party providers and how practitioners are rethinking control and visibility; we conversationally refer to these third-party providers as stuff outside our control. In the survey, we asked if practitioners should monitor this stuff and the answer was a majority Yes (64%).
To share a little anecdote, after Kurt and I wrote the survey (with much iteration and feedback from the community), a video was shared with me (starring Sarah Butt and Alex Elman) entitled "Embracing the Multi-Party Dilemma: Learning from Incidents Across Company Boundaries”. Little did I know that incorporating, broadening relationships, and learning from incidents involving parties outside of your organization’s four walls, was of interest to more than just me.
As you read this Insight, think of the gravity and propensity to have visibility into your stuff. And then, rethink the way you think about gaining visibility into the other stuff.
This Insight is really an extension of the first Insight. It also dances around other forms of engineering which also try to make it better. While we do mention Platform Engineering in this year’s report, we really tried to emphasize that titles and monikers are not as important as the work.
As you read this Insight, consider the journeys of either you or your company. How have roles or team structures adjusted as e.g., your company grows? Do you passionately care about reliability, even if your title is not SRE? If so, that’s all that matters.
This Insight is inspired by the LFI community. It first establishes a baseline for just how many incidents (hint: a lot) practitioners must manage. It then continues to explore the ways in which we can consider doing a better job of learning from them.
As you read this Insight, think about how other dimensions e.g., company size might dictate where dedicated teams are used for specific functions (in this case, managing incidents). Now consider that Learning from Incidents (‘LFI’) is a universal opportunity and did not substantially trend when broken down by company size. This non-correlation is arguably more fascinating than if it did trend by size!
This Insight was written and may become a year-over-year benchmark data for future reports. We could not ignore the generative AI coverage seen over the last year. We also felt this was large enough to establish some baseline questions for year-over-year comparison.
As you read this Insight, think about other evolutions over the years. Were they a bust or a boon? Recall the first time you heard about migrating to the cloud, the iPhone (sorry, Blackberry), Bitcoin, or the Metaverse. How will new Gen AI play out? And are you ready to embrace it? Or bury it?
While this Insight section intro discusses service level breaches, the larger Insight discusses the importance of Service Level Indicators and Service Level Objectives in SRE. It also discusses their relationship.
As you read this Insight, consider your day-to-day responsibilities. Then go through a critical thought exercise and see if you can map these responsibilities – through a value stream – all the way to an end user’s experience and business outcomes. It is important to know how the work you do affects you, your team, and your business to get a better appreciation for how it ties together.
This is my favorite Insight in this year’s report. While we do not intentionally look for ways to go against marketing rhetoric, sometimes the words on paper do come out that way. This Insight is meant to combat the pure fricking magic messaging of “consolidate all your [monitoring tool] needs with us [one provider]”. We wrote this insight to offer an alternative, pragmatic suggestion to approach tool usage (backed with some data). That is, use purposeful, multiple tools if their combined received value is larger than their cost (where the cost can take many forms).
As you read this Insight, consider that no one [probably] wakes up in the morning saying, “We need to add more tools [simply for the sake of adding them] to the stack because we don’t have enough”. Conversely, no one should then wake up saying, “We need to remove tools [simply for the sake of saying we reduced tool count] from the stack because we have too many”.
We wrote this Insight to incorporate human, social elements. This Insight also includes our famous benchmarking data against Google’s seminal recommendation. In this Insight, we asked, “Being which of these is most important to you?” and we are glad that ‘Being proud of my work’ ranked first (63%). Unfortunately (or fortunately?), as organizational rank increased, being efficient became more important.
As you read this insight, please take a moment to appreciate the quantity of work it takes to produce The SRE Report. In the same way it takes a hefty quantity of work to ensure systems are reliable and resilient, those who are observing from the outside may not understand what it truly takes.
It is in this vein that I’d like to take a moment and thank this year’s report contributors for doing ‘more than their day job’. I would also like to thank all the survey respondents who said they would like to contribute to this year’s report (but that which we did not have enough contributor slots for).
And finally, a sincere thank you to (in order of report appearance):