Without SRECon happening this year and the world turned upside down from COVID-19, we set out to hold a virtual event to bring SREs together to share their experiences of what has changed. Last week’s SRE from Home was exactly that. With 1900 registrants, 20 lively Slack channels, six illuminating and entertaining talks from a diverse range of experts in the field and our #askanSRE panel answering attendees’ questions with a candid generosity, it was an amazing, jam-packed day. In this post, we take a look at the day’s events, sharing some of its many highlights. For those of you who want to see the full program or catch up on individual events, check out the day’s recording here.
“Yes, You Can Improve Your Team’s Wellness”
Jaime Woo, the co-founder of Incident Labs and an avid researcher into mental health and mindfulness, kicked things off for the day with a wide-ranging talk about how to improve individual and team wellness. Jaime covered a wide range of serious wellness topics, including stress triggers, burnout, and unresolved feedback loops, specifically for the SRE community with a light and engaging touch.
Addressing Structural Factors that Lead to Stress and Burnout
Jaime discussed how much of the time, we dwell too long on individuals when we think about stress and burnout. Instead, he urged us to think about the structural factors involved, including:
- The conditions people are working under
- What are people’s incentives for success at work – financial, institutional and/or social
- What is helping people achieve their goals
- What is blocking them
Following his talk, Jaime shared a valuable list of resources on Slack looking at how organizations can better support their teams’ wellness.
The next talk, also on the theme of wellness, came from Dawn Parzych, an ex-Catchpointer, and now a Developer Advocate at LaunchDarkly where she specializes in the intersection of technology and psychology. She describes parenting, from the point of view of DevOps, as “the worst on-call scenario ever”, especially right now when “there’s no one to hand the pager off to.”
Three Ways to Adapt DevOps to Parenting
Dawn shared how she was using three DevOps principles to help with her parenting right now:
- Flow: Flow relates to how you’re communicating within an organization and the way in which you want feedback, in short, amplified cycles, allowing you to incorporate it. Dawn stressed that when it comes to kids, reducing ambiguity is essential to building a more successful flow.
- Feedback: Dawn discussed the idea of “kaikaku”, that is, making a radical or destructive change, expected or otherwise that must be dealt with. When COVID shut down schools, she and her husband built a schedule to see how best to balance things and continued to improve and adjust based on the situation.
- Continuous Improvement and Learning: One of the ways in which Dawn related this principle to parenting was the need to share with your community and workplace what you are having to juggle when you need help.
OK, so you are not Google. What should SRE mean for your organization?
Sanjeev Sharma, the author of DevOps for Dummies, TheDevOps Adoption Playbook and a lively blog, led our third talk on adopting SRE in organizations that aren’t Google. Sanjeev kicked things off by looking at why Google created SRE as a practice.
How to do SRE when you’re not Google
Sanjeev discussed how Ops in an Enterprise is challenging, especially at a time when most organizations are hybrid and multi-cloud. He then looked at several ways how not to adopt SRE.
One of the key takeaways, a theme that was picked up in the related Slack channel, was his reminder that DevOps shouldn’t be a team in and of itself. SRE is different as it is focused on resilience and managing SLAs and SLOs. Let Ops specialists do Ops, let software engineers in Ops do SRE, let everyone do DevOps, he said.
The Pandemic Brief, Assuring Essential Services
Freelance developer, Henri Helvetica, finished the morning’s events by examining the role of SRE during the pandemic “when the web has become a vital resource”. In this context, Henri suggested SREs are a form of essential worker ensuring that essential services are in place for people to access. “Things without the web, without some kind of stability, reliability”, Henri said, “things get serious or I like to say, SREious.”
Lighthouse Scores for Government Sites on COVID-19
At times of crisis, people rely on government or news sites for essential information, so it is necessary to ensure that end users can access the information they need.
Relating this to crisis sites communicating information about COVID-19, Henri shared the success of the California government health site. Since the site receives millions of daily visitors on days when big announcements are made, one of its goals is to “quickly and reliably respond to visitors on any device with any accessibility constraint.”
Henri compared its Lighthouse score of 96/100 to similar government health sites in Florida (22), Texas (25/100) and Georgia (23/100), and shared that the California developer team is working with IT teams across the country to share their approach and help others make similar improvements.
In conclusion, Henri stressed the need among SREs and their teams for “the disciplined delivery of data”, ensuring that sites are as lean as possible to ensure that everyone can access them. COVID-19 is “teaching us that firsthand”.
Emerging from Burnout with Amy Tobey
Amy Tobey, Staff SRE at Blameless, kicked off our afternoon talks with our final wellness talk, Emerging from Burnout. As Dawn Parzych observed on the Slack Q&A channel, it was “a tough subject but so many important points.”
Firsthand Insight into Emerging from Burnout
Amy described the way that burnout felt to her when she was fired from Netflix and shared some valuable insight into how she emerged from this dark place, encouraging other SRES to:
- Take breaks to regain “cognitive capacity”
- “Put our sleep on a schedule” – to recover cognitive capacity
- Add hindsight biases to how we interpret the past to better understand it
- Use observability towards ourselves – turn the skill we use for looking at huge complex systems at work inwards to notice “what clues will tell me what is going on with me”
- Chaos engineering/chaos exercise – Amy emphasizes the need to exercise “the whole system”, sharing that she does this with yoga and cardio
- Take time off and discover where we are “a single point of failure” for the system
- Self-kindness: this, Amy said, she would put as “the number one value of an SRE” – being kind to yourself and to others.
Amy pointed out that the blame for burnout is often laid at the door of the individual, but we also need to look at the overall system to see how it can change to prevent burnout in the first place.
Next up was our #askanSRE panel with Moderator Liz Fong-Jones, Developer Advocate at Honeycomb, Holly Allen, Head of Reliability at Slack, Lex Neva, SRE at Fastly, Maira Zarate, Application Engineer at Autodesk, and Tony Ferrelli, VP of Technology Operations at Catchpoint. The questions, gathered from attendees in advance and on the day, began with a deceptively simple question about “How is SRE defined for you at your organization?” and ended with a career-focused question on “How to make the transition into SRE coming from a software developer’s background?”
The panel stressed the need to actively work on maintaining robust communication channels. Maira summarized this as, “Figure out how your team works as one.” Tony agreed, saying his team was doing daily standups to get everyone in the same room and discuss things virtually. He also encouraged managers to set up a water cooler channel just for people to be able to gather and mimic more informal ways of having a conversation.
At Catchpoint, one of the many strands of the conversation that caught our attention was around “the biggest challenge in handling observability tools and infrastructure and why is it our job as SREs to think about observability.” Holly shared that at Slack, while they “are lucky enough to have a dedicated monitoring team”, one of the biggest challenges was to “get out of the operator mindset” and really consider how to use metrics to help solve the “real problems that developers have”.
Liz shared that her boss, Charity Majors believes that over time, the observability needs of an organization will become roughly 30% of its cloud budget.” You need to spend that amount of resources either of your own engineers or through a third-party solution to have “sufficient visibility into what your systems are doing, what your software is doing.”
Maintaining Mean-Time-to-Joy: Managing a Global Incident at Netflix
Our final talk came from J. Paul Reed, Senior Applied Resilience Engineer and Tim Heckman, Senior Site Reliability Engineer at Netflix. In a classic three-act structure, the SREs shared “what it’s been like for Netflix during a global pandemic and how we’ve dealt with that situation.”
The CORE team works around short-term incidents with long-running ones purely linked to security. What has made COVID so different is the uncertain and longer-term nature of the worldwide situation that is changing how end users are interacting with Netflix, not to mention the surge in subscriptions and traffic volume.
Tim explained the CORE team set up daily Working Group meetings, sending out weekly emails and communicating across teams to give the engineers “some confidence to know that the system around them would be fine” and to share regular updates.
“2020 Turned our Flashlight into a Dumpster Fire”
J. Paul Reed described the importance of “telling stories about the future” and how as SREs, we “walk through the world, shining this flashlight based on where we are, the stories and knowledge that we have, and these different framings of what’s possible, what’s probable but also possible, what do we actually want, what’s preferable.”
Unfortunately, as Tim stated, “2020 turned our flashlight into a dumpster fire.” J. Paul Reed turned his attention to how to move forward, “The point that I want to make is, as we go on in COVID, the stories that we tell each other” are “compressed and that is cognitively painful.” The CORE team recognized that it had to “reconstruct the stories” that have been broken by COVID, figure out what stories to tell about our systems and our teams and “make those coherent again.” The duo concluded with a rousing cry, “We’ll SRE through this dumpster fire together.”
And a Toast!
In the last few minutes of SREfh, Catchpoint’s CEO Mehdi Daoudi, came online to thank our partners, speakers and all the attendees for making this such a terrific event.
Mehdi shared a story from when he was at Google handling a tooling team called “Quality of Service” before the concept of SRE existed and how he deleted an image on the ad servers taking down the entire DoubleClick ad serving system while working from home.
The next day he was called into a conference room by his boss, the CIO, along with Mehdi’s peers, a bunch of VPs and legal. He was told his action had taken the entire system down. “The question he asked me was formidable”, Mehdi shared. “What did you do?” He continued, “We want to learn from this so that it never happens again.”
The story demonstrates the importance of a blameless culture for people to be able to get better at their jobs. People make mistakes and it is important to take care of both our teams and future user experience.
Mehdi ended with a toast to the SREs, “Thank you to all the SREs, thank you for making sure that everything is working and humming, you’re doing an awesome job and the Internet is better because of you.”
To hear the full recording of SRE from home, click here.