In January, my colleague Peter came to me with an idea to create a Site Reliability Engineering (SRE) survey. The goal of the survey was to find out what it really means to be an SRE, examining the types of organizations, skills, and culture where site reliability engineers work. We wanted to see if there is a specific profile or core set of principles across organizations including:
- Who are site reliability engineers? What is their work experience and skills?
- Where do SREs work? What type of organizations hire site reliability engineers? What is the team structure and what is the culture?
- What are the day to day tasks of an SRE? How are they spending their time, and how is their role defined?
- What tools and processes are used by site reliability engineers?
This response from one of the respondents nicely summarized some of our reasons for conducting the survey:
“The SRE role changes from organization to organization. Even though the role was defined at Google, organizations have incorporated the roles differently. Because of this, there can be confusion of the SRE role vs pre-existing operations roles. This confusion can lead to a misrepresentation of the SRE role to non-SREs. This usually creates toil for the SREs, as SREs end up having to do stories that may not be under the scope of SREs, or will have to push back on stories that are not under the scope of SREs.”
This post will focus not on the metrics from the report, which will be released later this month, but rather some of the surprises in the responses and the data.
Diversity of SREs
When crafting the questions we debated whether to ask questions about gender and ethnicity. We searched for examples of how to phrase questions to be inclusive. What struck me in this research were the questions raised by the Human Rights Campaign on collecting gender data in surveys on whether the data is essential. We decided that this information was not essential and would not provide any surprises or insights that can help organizations looking to hire SREs or people looking to become a site reliability engineer.
Excluding these questions still revealed fairly diverse backgrounds for SREs. While the majority of site reliability engineers have a degree, studied in a technical field, and have held a number of technical roles there were some that had a non-traditional path into the role. SREs hold degrees in English Literature & Fiction Writing, American Culture & Literature, Technical Theater, Zoology, and Theology.
As Site Reliability Engineers are described as having a mix of software development and operations responsibilities, we were interested to see what percentage had previous roles in development vs operations vs something else. I was expecting a fairly even split between the two. 53% previously held a job as a developer or software engineer compared to 64% as a SysAdmin. Reading through the “other” responses I realized the options we provided left out two areas that are stepping stones into the SRE role: QA/test and help desk/support.
It is not uncommon to switch career paths, personally I’ve held roles in education, support, sales, product management, and product marketing. A variety of roles and backgrounds can often lead to stronger teams. In addition to the expected technical roles some SREs held roles in sales, product management, project management, and the military.
“I don’t think any of those technical skills particularly make a great SRE. A team of good SREs has a cross section of all of those such that the team has all the traits, but I can’t say any single SRE should have any or all of them to be good.”
In the words of Aretha Franklin “R-E-S-P-E-C-T”
The lack of respect many SREs feel was the most surprising aspect of the survey for me. This feeling is troubling to me, as when people aren’t treated with dignity and respect they are more likely to leave an organization or be faced with physical and mental health issues. There were many comments about how the role is important but it is not currently valued. Organizations, and those within the organization need to find ways to value the people doing this important work.
“I’m not the first person to say it but remove the hero mentality. Waking up at 3am and solving an outage by resetting connections to a database that got hung up by a new application not closing threads should not make you a hero. Rather everyone should have empathy for the rough night you had, and help you to ensure the proper documentation is written first, so that anyone can pick up that task if it arises again, and then move on to automating the fix, practicing the runbook in a controlled manner for the team, and writing alerting that catches and auto-resolves this before it even happens again. You’re only as strong as your weakest link and it’s everyone’s job to improve themselves and those around them, don’t settle for “but you’re our rockstar” as a reason to run down your own health and sanity.”
“Unless the organization itself is aimed for success, being an SRE is like being a daycare assistant.”
One way to show value for the SRE role is to provide reasonable, attainable, and relevant metrics for the team. When asked what metrics are used to measure the success of the SRE team responses included “Incredibly stupid ones,” “Metrics are constantly being redefined,” and “Surely you jest, we have no stable metrics.” Think carefully about what metrics make sense and whether you are measuring the right things. Metrics won’t make everything better but it can provide insights into how people are feeling. How many times are SREs woken in the middle of the night? How often do problems reoccur? These metrics can tell you whether processes need to change to improve things.
For those that feel more respected in their organizations refer to the SRE role as being fun, extremely interesting, awesome or “the best role in tech.” All organizations should strive to have all their Site Reliability Engineers feel this way.
Latency vs Availability vs Response Time
End users have high expectations in terms of web site performance. It doesn’t matter if an application is being loaded from a mobile or a desktop and whether it is for work or pleasure, we want pages to load as quickly as possible. If a page takes too long to load a user will consider it to be unavailable. I realized after the fact the nuances of this were not considered in the phrasing of one of our questions. We asked “What service level indicators are most important for your services?” Three of the options were end-user response time, latency, and availability. I view availability as the system up or down, latency as delays before a response is generated and end-user response time as how long before the user received the information they wanted. If an error message appears or the page fails to load, an application is unavailable. If a page takes 10 seconds to load, it’s available but incredibly frustrating to use. For SREs availability means more than is a system up or down. If the response time or latency exceeds a certain threshold the application is considered unavailable.
Asking a simple question about what service level indicators are used without asking for how those metrics are defined results in questions regarding the results. For users that selected availability but not latency or response time do they define a response over a certain threshold as a system being unavailable? For those that selected all three metrics, do they define these the way I did above? I would be curious to dig into this aspect a little more in the future.
Thank you to everybody that took the time to respond to the survey, asked questions, and provided feedback. The full State of Site Reliability Engineering report will be available at the end of March.
Update: our 2018 SRE Report is now available here.