Blog Post

SRE Report 2023: Findings From the Field — Toil

“What percentage of your work, on average, is toil?” We've asked this question in the SRE Survey for the last 5 years and we've seen self-reported toil levels drop, but can toil levels be quantified?

Toil.

Few other words have the same visceral impact for SREs as their four-letter nemesis: toil. Although pretty much everyone recognizes and agrees that toil is bad, it is a term that is frequently misused in colloquial use. In common English usage, toil is defined as “long strenuous fatiguing labor”[1] or “work that is difficult and unpleasant and that lasts for a long time: long, hard labor”[2].  

As a term of art in the SRE profession, “toil” has several very specific characteristics which distinguish it from other sorts of work which people spend time on. In the original SRE book, toil is defined with the following attributes[3]:

  1. manual,
  1. repetitive,
  1. automatable,
  1. tactical,
  1. devoid of enduring value, and
  1. scales linearly as a service grows.

The authors also refer to several other types of work which might be difficult and unpleasant and last for a long time: administrative overhead (team meetings and the like) and grungy work. The examples in the book were “cleaning up the entire alerting configuration for your service” and “removing clutter”. I’d also suggest that many migrations can fall into the category of being grungy, especially when dealing with the exceptions that inevitably impede the happy path to completion.

“What percentage of your work, on average, is toil?”

When we wrote the questions for the 2023 SRE Survey, we knew that this confusion between toil as a term of art and toil as a common English term could clutter the survey results and their interpretation so we included those six criteria in the descriptive field for the question: “What percent of your work, on average, is toil?”

Here are the results from this year’s report, broken out by the self-reported level of the respondent:

Here’s the data representation in a cumulative distribution form as presented in the report:

The previous year’s responses look like this:

With some people reporting that they spend 90-100% of their time on toil, they are clearly in an undesirable situation. But as one of the authors of the survey and the subsequent report, I’m more concerned about Manager+ people reporting that their work is “manual, repetitive, automatable, tactical, devoid of enduring value, and…scales linearly as a service grows.” This is not an “OR” list. With past experience as both an IC and a manager, I can understand that there is indeed lots of manual, repetitive, tactical work required for management. But dealing with people is neither “devoid of enduring value” nor something which scales linearly with the service(s) that are being supported. Many parts do scale with the number of employees within a manager’s span of control, but what sorts of work, by managers, senior managers, and executives would qualify technically as toil?

In 2019, Nikolaus Rath talked about the scaling problems of onboarding large numbers of services[4] as a limiting function for their SRE teams. In that situation the scaling element was not the size of the service or the end-user base, but rather the number of services themselves. As such, the onboarding process qualified as toil because the remaining manual sanity checking was a constraint on scaling and met all the other criteria as well.

Can toil be quantified?

There is so much judgment inherent in the qualification criteria for toil that I wonder whether it is even something that can be quantified. At a recent round-table discussion with a group of SREs we discussed aspects of each person’s work which they considered to be “toil”. As we discussed the work items in light of the six criteria for “toil”, nearly 80% of the items failed to meet one or more of the criteria. One of the challenging areas was the question of whether something is “automatable”. Something which was not automatable five years ago, might be with newer technology or by spending a very large amount of time and effort. Does that make it toil now, but not five years ago? Does or should cost even enter into the consideration of whether something is toil?

When you think about the way that you spend your work time, how much time do you spend on “grunge”? Grunge might be the same sort of work that could be toil, but might be one-off, or otherwise fall outside of the technical requirements to be considered toil. How much of your time is spent on “overhead”? Overhead might be team meetings, 1-1s with your manager, company All-Hands meetings, status reporting, or (mainly for contracting firms) tracking billing information. Most engineers consider “overhead” time to be unpleasant, and lasting for a long time (at least subjectively 😊) – but that doesn’t make it “toil”.

Assessing these sorts of complicated factors is difficult even within the context of a single company, which shares at least some level of a common culture. When we try to extend these across an entire industry, we are battling the forces of semantic diffusion (diluting the technical definition of the term of art) as well as common parlance usage. There are many questions which remain unanswered, but In spite of these measurement challenges, it’s still helpful to have these sorts of longer-term records to inform our future and help us to understand where we have come from. The SRE Report, having been run for five years (to date) provides one of these records for our industry. When we solicit input later this year for the next survey, we hope that you will participate!

Learn more

You can read the full 2023 SRE Report here (no registration needed).

References

[1] https://www.merriam-webster.com/dictionary/toil

[2] https://www.britannica.com/dictionary/toil

[3] https://sre.google/sre-book/eliminating-toil/

[4] https://www.usenix.org/conference/srecon19americas/presentation/rath

SRE
DevOps
Customer Experience
This is some text inside of a div block.

You might also like

Blog post

Mastering IPM: API Monitoring for Digital Resilience

Blog post

Mastering IPM: Protecting Revenue through SLA Monitoring

Blog post

Mastering IPM: The Essential Customer Experience Monitoring Framework

Blog post

Accelerating Detection to Resolution: A Case Study in Internet Resilience