Balancing ‘Dev’ and ‘Ops’ - SRE Report 2020
The DevOps-SRE relationship raises the question of finding the right balance between dev vs. ops and the need to shift reliability further left.
We recently released Catchpoint’s SRE Report 2020 that analyzed results from the SRE survey we conducted early this year along with a recent addendum survey. The report offers a detailed look at the current state of SRE and how the shift to an all-remote work environment has impacted SRE teams.
In this blog, we take a deeper look at one of the report highlights – ‘Heavy Ops Workload Comes at a Cost’. The DevOps-SRE relationship raises the question of finding the right balance between dev vs. ops and the need to shift reliability further left.
The 50/50 Split
Google set the benchmark of an ideal 50/50 split in SRE teams in its pioneering 2017 book on SRE. Benjamin Treynor Sloss, the senior VP overseeing technical operations at Google, and the originator of the term “Site Reliability Engineering”, writes, “Google’s 50% cap on aggregate ’ops’ work for all SREs (such as tickets, on-call, manual tasks, etc.) ensures that the SRE team has enough time in their schedule to make the service stable and operable.” Sloss even defines this cap as “an upper bound” since ideally, “the SRE team should end up with the very little operational load and almost entirely engage in development tasks because the service basically runs and repairs itself.” Google wants its SRE teams to spend “the remaining 50% of its time actually doing development.”
The Reality vs. the Goal
We were curious to find out how many SREs from over 600 respondents to our survey this year are actually maintaining the ideal 50/50 split. When asked the question, “What percentage of your work is spent on development?”, 55% said between 0-25%, 31% said between 25-50%, and only 14% said greater than 50%.
Moreover, a net 10% of the survey said, after two and a half months of working from home, their activities had shifted to include more ops work. The goal of a 50/50 split for most SREs then, as we wrote in the report, appears to be something of a pipe dream.
We also asked respondents to identify, “Which of these activities do SREs do as part of their job?” The results were again very interesting. Only 25% selected dev activities (e.g. developing applications, writing software to help operations) versus 75% who plumped for ops activities (e.g. troubleshooting tickets, incident response).
Over 83% of respondents to our survey identified as doing SRE activities. As we caution in the report, however, “identifying as doing SRE activities does not mean actually being an SRE. This is because we must consider, not as parts or pieces. SRE teams are becoming more defined but spanning across different focal points does lend SRE work to being buried or hidden.”
We believe that with a different strategic approach at both the individual and the company level, SRE teams can shift left towards a more equal balance of dev and ops.
Why is Shifting Left Important?
Benjamin Treynor Sloss explains it in this way, “Consciously maintaining this balance between ops and development work allows us to ensure that SREs have the bandwidth to engage in creative, autonomous engineering, while still retaining the wisdom gleaned from the operations side of running a service.” He adds, “because SREs are directly modifying code in their pursuit of making Google’s systems run themselves, SRE teams are characterized by both rapid innovation and a large acceptance of change.”
At Google, he writes, this sometimes involves moving some of the operations’ burden back to the development team or staffing up as opposed to assigning a team additional operational responsibilities. In other words, this approach is largely enabled by managerial decisions that support and understand the benefits of the 50/50 split.
Within the larger field, our report showed that only 46% of organizations have a dedicated SRE team separate from other ops/admins teams. 19% of respondents said the DevOps team handles SRE-related activities and 16% said the Operations and System Administration team is responsible for SRE activities. 13% said SRE activities are performed across the entire organization as opposed to being localized to one team and 7% responded that SRE is still new and it is currently unclear if this needs a separate team.
As we highlight in the report, “if we are all on a journey to being preventive through the design and implementation of observable systems, then we have a long way to go.”
Strategies for Becoming More Preventive
Reacting to incidents and problems is a part of the SRE life. There is a need to mature from being reactive to proactive, based on the business and the organization’s context. In the SRE Report, we identified a few key takeaways to how SRE teams and their managers can shift left and become more preventive. These include:
- Start by recognizing the work being identified as SRE and actively measure how SRE time is spent.
- Identify when SREs are involved in the lifecycle (53% said they experienced challenges due to being involved late).
- Reintroduce a core goal of being preventive through the design of observable systems (the stages of this journey can be split into three phases: (i) reactive, (ii) proactive, (iii) preventive).
- Instead of debating the nature of an activity in isolation, shift the conversation to ask, “what are we preventing from happening by doing this?”
- Do you have an SRE charter? If yes, does it need reworking? If not, can you build one?
- Ask if SRE principles have been considered as necessary within your org, and if not, why not?
- Similarly, ask if a customer-centric approach to delivering services has been evaluated?
When SRE work and value is recognized, then it can start to be rewarded. If you can demonstrate how making a shift left to a more preventive approach will impact the wider business, showing the benefits of reduced coast, improved team alignment, and highlight specific business outcomes, the 50/50 split or better will become more a reality and less of an elusive chimera.
Read all the highlights from the report here and the full report here. Also, keep an eye out for an upcoming blog focusing on the results of the addendum survey that dealt exclusively on how SREs have been handling all-remote IT.
Join our ‘SRE from Home’ event on Thursday, July 23rd for a live discussion on everything SRE. Register today and save your spot!