Blog Post

Empower the SREs - Conclusions from The SRE Report 2023

"My takeaway from the report conclusions is the theme of SRE empowerment. SREs in my experience thrive the most when they feel truly empowered." Steve McGhee, Reliability Advocate, SRE, Google Cloud

Photograph of Steve McGhee, Reliability Advocate, SRE, Google Cloud
Steve McGhee, Reliability Advocate, SRE, Google Cloud

The following Conclusions from The SRE Report 2023 is reprinted with permission from Google Cloud. Download the full report here (no registration required).

Let's be honest, nobody loves surveys. Ok, well I sure don't. But surveys satisfy a huge need in our demand for insights into complex human-computer, sociotechnical systems. It turns out that we've been measuring the computer part pretty well, but the humans – not as easy to keep track of. When Google SRE first defined toil as a metric we wanted to reduce, we spent far too long trying to quantify it numerically based on tooling and insights from computer systems. It turned out to be easy:  just ask the humans. Never stop, just keep asking them. We never found a better measure for toil, and I don't expect we will.

So, a survey is a powerful tool, but it takes work. It takes unbiased, structured questions, lots of respondents who actually take it seriously, and lots of analysis at the end. Without all of these critical elements, surveys are too often a waste of time that end up regurgitating existing biases and forgone conclusions. And those are easy to spot. I was excited to help with the creation and analysis of this particular survey. I felt that the questions were well-considered and the analysis was thoughtful. I wish there had been more respondents, but alas, there's always next year.

My takeaway from the conclusions this year is the theme of SRE empowerment. SREs in my experience thrive the most when they feel truly empowered: when their organizations trust them to do the right thing and they're given the resources and freedom they need. This means leaders must listen to their needs and support them, without inserting preconceived notions or interpretations. SRE is a very young field. There is a lot of interpretation at play here.  

AIOps sounds amazing, heck even the name sounds cool. But listen to the practitioners who are actually trying it out, not the sales pitch. What is it actually doing today? Does it actually solve a problem you have right now? If not, move on. Don't be lured by the siren song of all-seeing, all-dancing AI. Don't make a decision for the SREs, empower them to choose (or not choose) tools based on their understanding of the current system and their own needs for operating it for the immediate future. Remember, the root of all evil is premature optimization.

Tool sprawl sounds scary. Too many tools? Sounds expensive! I know when I go to a mechanic or a woodworking shop, I look for the place with the fewest tools on the walls and workbench. Wait, that's not right. When it comes to skilled labor, or "operations" perhaps, you want teams to be able to reach for the right tool at the right time, not to be impeded by earlier decisions about what they think they might need in the future. Also, what counts as a tool? If I combine two unix commands in a script, does that make a third? Why are we even stressing about this?  Cost is the boogeyman here. Teams either have a culture of IT as a cost-center, which must be reduced over time, or they've been bitten by runaway costs of Cloud. APIs are powerful!  Especially when you're not watching your billing statement. Instead of forcing SREs to rationalize every tool and prevent every possible overlap of functionality, empower the SREs. Give them transparency into cost, let them assess the value judgment as a group, inform them of contract details like renewal dates, let them propose alternatives.

Blamelessness is working. What better example of seeing the benefit of a psychologically safe environment? This is another form of empowerment. Knowing that they can be trusted with a complex system, despite fallible humanity (we all make mistakes!) empowers an SRE and results in a stable, sustainable system. This is a great data point to see reflected in the survey.

Why do ICs and Execs disagree so broadly? Why aren't they aligned? One interpretation is that Execs are looking at the bigger picture and ICs are focusing on a smaller portion, missing the context. However, that's not the only way to work. That's certainly the traditional (Taylorist) model that is employed at many Enterprises today, but we can do better. By providing transparency, context, and rationale around budgets, revenue and loss, teams can better understand tradeoffs made "above them" instead of simply throwing POs up to management to see what sticks. SREs fight for the user. Don't tie their hands, instead empower them to provide a big-picture solution. They can do it if you let them.

The last few years have been a heckuva ride. WFH is here to stay, remote work is only growing, even the allure of the 4-day work week is approaching, depending on who you talk to. Is this possible? Is this great? Is this scary? How about all of the above. I don't think this bell can be un-rung, nor should we want it to be. Being knowledge workers in the age of the Cloud means you don't have to be datacenter-adjacent, or even in-office. SRE is about creating higher levels of abstraction by which to control the systems that society ever more depends on. Let this happen, don't tie it down with old models of working, lest they come back to haunt you by way of attrition, burnout, and checked-out, uninspired employees.

Trust your SREs, empower them to defend the user (within clear expenditure limits), give them time and resources to be creative, push and reward sustainable behavior.

The above is an extract from The SRE Report 2023. Want to read the full thing? Get The SRE Report 2023 here (no registration required).
This is some text inside of a div block.

You might also like

Blog post

2024 SRE Report: AI is not replacing human intelligence anytime soon

Blog post

Bridging the IT-business comms gap comes down to this one word: Ask

Blog post

The power of synthetic data to drive accurate AI and data models