Our fourth annual SRE Report launched last week. I had the good fortune to be involved in writing and editing it this year for the first time alongside our very own driving force Leo Vasiliou and the brilliant Eveline Oehrlich at DevOps Institute (check out Eveline’s take on the report’s Key Takeaways here), in addition to a number of folks at VMware Tanzu.
Over 300 site reliability engineers from around the world responded to the SRE Survey, providing valuable insights into the role of an SRE across organizations of different sizes and industry verticals. These findings were then charted, analyzed, and discussed at length among the team and industry practitioners to form the basis of the report.
Here Are the Four Big Highlights Of the Report
1. Baselining is essential to understanding (and improving) SRE conditions.
We baselined core SRE metrics and found that SREs are balancing their work across operations and development, although operations dominates. The median value of time spent exclusively on development was 40% leaving 60% for operations work, 20% of which is being spent on call.
Fascinatingly, we also saw that toil had dropped by 15% since 2020 – this was across the entire distribution, not just the median.
One of our nine Spotlight contributors had this to say about why this might be and what it means…
"If Google is correct that too much toil leads to frustration, boredom, and burnout, could it be that during a pandemic, with its enduring waves of stress, SREs chose to spend their time elsewhere? Either way, did anyone even notice--and what does that say about toil?" Jaime Woo, Co-editor, 97 Things Every SRE Should Know
Will self-reported toil rise next year as org structures are reintroduced and SREs are exposed again to the pain, problems, and challenges of everyday work?
We will need to compare the last two years with next year to see if this happens, or perhaps we’ll see SREs disputing there introduction of more toil, knowing now what it’s like to work with less. Most importantly, if you don’t have some type of baseline to know whether things got better or worse, you can’t answer these questions in the first place.
2. Multi-provider usage is growing and points to a looming scale ceiling for SRE teams and an opportunity for Platform Ops.
Across the board, multi-provider usage is growing – from cloud to CDN to DNS.
Having more than one vendor for these delivery components improves resilience by increasing failover opportunities. It also allows organizations to tap into different strengths and capabilities. At the same time, it brings with it greater complexities in managing the infrastructure.
We also found that as the number of SREs and employees with an organization increases, the SRE topology changes and SREs become more decentralized.
It is, of course, crucial to prevent are turn to silos and to manage the scale ceiling imposed by these challenges. One way of doing this is to embrace Platform Ops and create a Platform Operations team. The centralized Platform Ops team can help overcome these challenges by providing a normalized set of capabilities to decentralized colleagues, which can then be integrated across company workflows.
3. AIOps has yet to be fully embraced across the board.
“Most SREs working at scale are already leveraging machine learning, especially when it comes to efficiencies around data centers (locations, cooling, and all the things that happen inside it), for networks and building out infrastructure … Evolving that into AIOps isthe next logical step.” J. Bobby Dorlus, Staff Site Reliability Engineer, Twitter
There is enthusiasm for the potential of AIOps among SREs. That said, compared to the amount of hype and buzz in the industry, we found that adoption is slow.
Big data is draining the value of traditional monitoring tools, such as APM, and making it extremely difficult for SREs to be proactive. We see the value in AIOps as an aid to monitoring and management strategies because of its ability to pare through the increasing velocity and volume of data to rapidly arrive at actionable insights. By leveraging AI and ML, decision-making and efficiency levels can be improved.
We recommend that SREs break AIOps into individual components and determine the merits of each component in its own right. We also suggest training SRE teams in AI and ML. The value will come not in the short term, but in medium to long term gains.
4. While SREs are becoming more customer-focused, they should look to expand the boundaries of observability to better align with core business KPIs and needs.
The survey findings revealed that the top three monitoring tools in use are infrastructure performance (62% always used),network performance (42% always used), and application performance (49% always used). We see only a rare usage of benchmarking intelligence (9% always) and public sentiment/social media monitoring (8% always).
By including a wider range of monitoring tools, in particular, those centered around business perception and ranking, SREs can better realize their value to the business as a whole.
Similarly, monitoring from the outside-in perspective is valuable because it allows businesses to understand what the customer or end user is seeing. Organizations gain insight into the user journey and are able to make changes to directly improve customer experience.
To read the full SRE Report 2021, which includes nine contributions from industry practitioners in response to the key findings and an actionable path to consistently realize greater value for your customers, click here.