This year’s SRE from Anywhere (SREFA) brought together hundreds of registrants from around the world to gather virtually, share experiences, and network around all things SRE. We were thrilled to see so many friendly faces! We kicked off the event by celebrating the spirit of optimism in the air this year and shared why we brought back the SRE community event: to explore how the year that changed everything has made us more resilient and how site reliability engineers are thinking about and building the future.
Want to discover some killer insights from some of the field’s most dynamic practitioners?
THE CURRENT STATE OF SRE
Panel Discussion: SRE Survey Findings, Spurious Correlations Or Nah?
Catchpoint engineer Navya Dwarakanath kicked things off by hosting a panel on 2021’s SRE Survey Findings and the 2021 SRE Report (check out the report here). On the panel: Scott Rogers from VMware Tanzu’s observability team and two of the report’s lead authors: Eveline Oehrlich, Chief Research Officer at DevOps Institute, and Leo Vasiliou, former ITOps practitioner, now on the product marketing and research team at Catchpoint.
The panelists talked over the report’s key findings (from the underutilization of AIOps to what the growth of multi-provider usage means for scaling operations), discussed whether correlations between questions were spurious or not, and shared their top advice for attendees to take away.
“Across the distribution and geographies, the level of toil has reduced last year over the year before. Last year, we were locked down at home… Was there a correlation there? Did people find work more meaningful? Is this why there was a seismic drop in toil? We will need to compare with next year to see if the levels of self-reported toil indeed rise.” Leo Vasiliou, Product Marketing Manager, Catchpoint
One of the first and most debated topics was, “What is the most effective structure for SREs in an organization?” Attendees agreed with Eveline who said it depends on the environment, organization maturity, technology, setup, and capabilities of a business.
“In my company,” one attendee wrote in the chat, “they have an SRE team for each DevOps team instead of having one SRE team for all. More silos.”
Another shared, “I facilitate a DevOps/SRE community of practice. I strongly believe SREs do belong on product teams, but the tools and practices must be shared.”
Another key topic was the question of the next scalability ceiling for SRE, which is innately connected to the the question around centralization vs. decentralization for SRE teams.
“If you have so many teams operating independently on different cloud providers,” Leo warned, “it’s inefficient. The way forward is to centralize those capabilities to be reused across all those teams. A Platform Ops team can normalize the capabilities on top of different providers, which can then be reused internally.”
Wrapping things up, each of the panelists shared their key takeaways with the audience.
Eveline exhorted people to "value your SREs. Hug your SREs. Give them love, case studies, and success."
Leo, meanwhile, encouraged attendees to really consider the need to "expand the boundaries of observability to include digital experience and business KPIs", the final key finding of the report.
Scott agreed, emphasizing the need for SREs and business leaders to find more alignment. “To make SRE successful,” he said, “you need to move beyond SRE practice and tie success back to the business. If you can get business leaders onboard with the tech, then you’ll be successful.”
OBSERVATIONS FROM THE FIELD
Lightning Talk: Why You Need Scheduled Downtime: the Powerful Benefits Of Taking Intentional Breaks – Jaime Woo
In his lightning talk, Jaime Woo – writer, SRE educator, and mindfulness expert – laid down some of the science behind the importance of breaks, examples of strong and not so strong breaks, and why you need them even when you think you don’t.
“Your brain hasn’t stopped working on a problem just because you’re on a break. In fact, if you take the right kind of break, the problem-solving moves to your subconscious. You will likely find a better solution than if you focused on the task exclusively.” Jaime Woo, Writer, SRE Educator, Certified Mindfulness-Based Stress Reduction Expert
Jaime wrapped things up with a powerful story about surgeons and anesthesiologists who took breaks during surgery. It was found that short breaks improved mood and concentration, and did not add a significant amount of time to the surgery or decrease patient outcomes. We can all learn from this, he told the attendees, especially SREs on call. Look at your typical work week, he advised, and think about where you can schedule more intentional downtime.
Lightning Talk: Using Kubernetes Probes To Improve Application Stack Stability – Regis Wilson
Regis Wilson, a founding engineer at Release (in addition to being an author, amateur comedian and professional poker player!), used his lightning talk to discuss how to use Kubernetes probes to do application health checks.
There are three types of probes, he shared:
- Startup probe: For slow-starting applications (these are a legacy setting and most people don’t need or use them).
- Liveness probe: Contrary to their name, these probes in fact kill containers that are not responding. Only use probes that are fail-safe and never use them on stateful containers, Regis cautioned.
- Readiness probes: For use removing traffic from unresponsive or slow containers. These are the most important, even “mandatory,” said Regis, for any customer-facing service.
Evolving with SRE, The Game Plan - Santanoo Bhattacharjee
Santanoo, Solutions Expert at Accenture, is a self-described “techie by choice and evolutionist by passion.” In his thirty minute talk, Santanoo walked attendees through how to effect impactful transformations as an SRE.
He broke this into four critical steps:
- Get the basics aligned
- Visualize the transformation
- Arrive at the game plan
- Define the road ahead
Throughout, Santanoo stressed the need for a critical mind that prioritizes introspection at all stages of the journey, particularly the start. He also emphasized why having a tactical mindset is important to moving to continuous improvement of systems. Lastly, he was adamant that architectural clarity was an essential for everyone involved. If goals are not clear, he said, it will be hard for teams to align and therefore difficult to mature in the necessary direction.
“The SREs who are doing introspection before doing any kind of transformation are the ones who create masterpieces. We have to fundamentally agree that great creations are not made by templated blueprints, they are masterpieces because of the expansion capabilities the templates have.” Santanoo Bhattacharjee, Solutions Expert, Architect - Advanced Technology Solutions, Accenture
Lightning Talk: How To Alert On SLOs Using An Error Budget Burn Rate - Yuri Grinshteyn
Yuri Grinshteyn, SRE at Google on the customer reliability engineering (CRE) team, was next up. In his five-minute talk, Yuri focused on alerting strategy and the need to balance four considerations (precision, recall, detection time, and reset time) while minimizing operational load. The best way to do this, he counseled, is to learn error budget burn rate i.e., how fast relative to your service level objective (SLO) your service consumes the error budget.
Yuri detailed the three main factors used to calculate burn rate. He said, you should then ask the question, “How much error budget should my service burn before firing an alert?” (Want to know the answer? Watch the replay!) Ultimately, he counseled, you want to understand the burn rate in relation to two windows - a long window and a shorter one. By using multi-window alerts, Yuri said, “we keep our detection time low and minimize our reset time so that the alert doesn’t keep firing once the incident has been mitigated.”
How To “SRE” Your Way To Pipeline Improvement - Anders Wallgren
Anders Wallgren, VP of Technology Strategy at CloudBees, led the day's second thirty-minute session: “a mashup [talk] of couple of different things, using one to provide a little insight into the other.”
The two topics?
- Metrics – “A few small guidelines on how to look at metrics, how to think about metrics, the kinds of things that you want to apply metrics to, and then how and so forth.”
- SRE – In relation to “how we can improve our pipeline processes, [and] whether they're continuous integration or continuous delivery or other kinds of pipelines.”
Anders encouraged SREs to start by asking what “we really need and want out of our CI/CD pipelines.” The goal is to drive outcomes by improving those pipelines through using metrics and SRE techniques.
One of the main pieces of advice in relation to metrics is to reduce them.
“It really isn't that useful in most situations to have 50 different metrics that we're looking at,” Anders explained. “That means we're trying to do too much, or we're not very focused on what we're trying to do, or we don't understand what we're trying to do.” Instead, he suggested, focus on “a small amount of metrics at any given time” in order to concentrate on a particular problem.
In terms of SRE, Anders centered his observations on service level actions (i.e., “when something goes out of whack, what do we do?”). This was an opportunity for attendees to look at SLIs, SLOs and SLAs through a different lens – specifically in relation to applying them to CI/CD pipelines.
Lightning Talk: Why Does SRE Need SLOs? - Alex Hidalgo
The final lightning talk of the day came from Alex Hidalgo, Director of SRE at Nobl9 and the author of Implementing Service Level Objectives (O’Reilly 2020). Alex started by acknowledging the many definitions out there of service level objectives, then shared his own. When we say SLO, he said, we tend to mean three things:
- Service Level Indicator (SLI) – A metric showing you how you’re performing from a user perspective.
- Service Level Objective (SLO) – A metric that tells you the threshold of how often you want to be reliable.
- Error budget – A way of tracking your SLO status over time.
Finishing things off with a lightness of touch, Alex described how SLOs work through the everyday metaphor of ordering a pizza, likening customer satisfaction levels and acceptance of failure rates to users of the Internet. We can all relate to being given the wrong pizza and how we might feel and react!
“SLO and SRE are both about the users and ensuring that people are happy. This counts on both sides. Your users are happy if you give them correct pizza often enough. Your engineers and business stay happy by ensuring you’re not trying to deliver the correct pizza every single time.”
ASK AN SRE
Panel Discussion: Ask An SRE
The last event of the day was a terrific live panel with Jonathan Franconi (SRE Manager at VMware), Tim Kadlec (Performance Engineering Fellow at Catchpoint), Robert Ross (CEO and co-founder of FireHydrant), and Pablo Sanchez Torralba (SRE Manager, Google).
Topics covered included how SRE can be defined, the day-to-day work of SRE in different organizations, and the end-to-end support that SRE provides to the whole business. This took us full circle, back to the conversation in our first panel around centralized versus decentralized SRE teams.
At VMware, Jonathan shared that “SRE begins with the person that starts the first line of code and ends when it hits production.” VMware, he said, has “a very wide SRE organization in every business unit.” However, SRE is also a part of the central R&D organization called VMware Engineering Services. “We basically hold the glue together across all these teams and all these SRE groups.”
Meanwhile at Google, Pablo shared his insights from specifically working as an SRE at Google Photos and its focus on scaling and being cost-effective in relation to the enormous size of the product. “We are part of a smaller part of SRE inside of Google,” he said, “which is supporting actual products instead of building the infrastructure where most of the SRE lives.”
“The next step we are looking into,” Pablo shared, was, “let's try to do our monitoring based on, what is the user experiencing? And that's not a single interaction. You think about for us in Google Photos, sharing a photo is the journey. And there are many services involved, many calls from the application to the services behind, and that's where we want to monitor. Everything may be great, and still, the users are not able to share the photos and that's bad for them, bad for the business.”
“SRE is such a large term,” Robert posited in response, “it’s almost like we need to start breaking it up. At Google, you have CRE and you have SRE. It's almost like we need to have design reliability engineering, we need to have frontend reliability engineering and really start to break it out.” Food for thought.
The panel concluded with a refreshingly honest conversation about the benefits and challenges of remote SRE.
Jonathan took things back to the event’s title: “I think the working remote piece of it, from SRE… we are born into this mentality of SREs being anywhere in the world and being able to work from anywhere. So in our specific field, this was natural from day one.” He shared his opinion that the “legacy operational model of hopping all on the same computer in the same war room has subsided,” and is being replaced by a mindset of, “let’s jump into a conversational tool to meet and figure out how to solve this problem and work together.”
A fitting note to conclude.
Here’s to all of you SRE-ing from anywhere! Thanks for attending and generously sharing your time and insights with the speakers, panelists, and one another!
And finally, a huge thanks our brilliant crew of speakers for sharing your wide-ranging and rich reflections on the state of SRE in 2021: Santanoo Bhattacharjee, Navya Dwarakanath, Jonathan Franconi, Yuri Grinshteyn, Alex Hidalgo, Tim Kadlec, Eveline Oehrlich, Scott Rogers, Robert Ross, Pablo Sanchez Torralba, Leo Vasiliou, Anders Wallgren, Regis Wilson, and Jaime Woo.