NEW YORK, June 15, 2021 -- Catchpoint®, the leader in Digital Experience Management, conducted a study with VMware Tanzu and DevOps Institute of nearly 300 site reliability engineers (SREs). The SRE Report is one of the most data-backed studies of its kind and has played a critical role in defining the nature of what it means to be a SRE since it launched four years ago. This year’s report underscores the challenges of multi-cloud, calls out the underutilization of AIOps, and shows a systemic shift in core baselining data. The report concludes by offering an actionable path for SREs to consistently deliver customer value.
Download the report here.
“SREs deal with a very broad set of challenges that span across transformational and operational activities,” says Mehdi Daoudi, CEO of Catchpoint. “This report arms them with the insights they need to help address these challenges – to balance the need for agility against the need for stability when building and operating massive, distributed, and reliable systems.”
Levels of Toil Fall Around the World
Toil is the work tied to a production service that tends to be manual, repetitive, automatable, and devoid of enduring value. Google suggests that SREs should do no more than doing 50% ops work (including toil) and 50% dev work. This year, the SRE Report notes an average year-over-year drop in toil of 15%.
“The reason this is such an impactful insight is that the drop in toil was across all geographies,” says Tony Ferrelli, Vice President of Technical Operations at Catchpoint. “If this drop in toil was because work felt more meaningful since COVID-19 led to SREs working-from-home, then will reported toil levels rise next year as people return to the office or a hybrid work environment?”
The Accelerating Use of Multiple Providers Warns of a Looming Scalability Ceiling
If the cloud is your new datacenter, then third-party services like DNS and CDN are your new racks and cabinets. When combining the rising use of multiple same-service platforms (e.g., multi-cloud) with the increase in the volume, velocity, and variety of monitoring data, there is little wonder why lack of visibility across the stack (53%) was the most cited cloud-app monitoring challenge or why SREs continually refine service level objectives (50%). The survey responses give rise to the critical question, how can companies most effectively scale SRE implementations?
“Spanning the gaps between the interfaces and the data that each provider offers increases the difficulty for SRE teams to automate across those multiple providers. These integrations are rarely simple except for the most superficial aspects. Effectively mapping disparate data models together may be the next frontier for SRE in a multi-vendor environment,” says Kurt Andersen, SRE Architect, Blameless.
The Shift Toward AIOps Is Slow
AIOps has been widely touted to reduce laborious ops work and to intelligently sift through the ever-increasing volumes of data that organizations are continually presented with. However, the report shows that many SREs have never used AIOps and their rating of its received value evenly spanned the 1-9 value scale.
According to J. Bobby Dorlus, Staff SRE at Twitter, “Most SREs working at this scale are already leveraging machine learning, especially when it comes to efficiencies around data centers (locations, cooling, and all the things that happen inside it) for networks and building out infrastructure … Evolving that into AIOps is the next logical step.”
Observability Should Include Digital Experience Metrics and Business KPIs
SREs that fail to deliver customer value run the risk of being stuck in an operational toil rut. Conversely, businesses that fail to recognize the importance of SRE activities risk losing talented employees and their competitive edge.
The highest-ranked driver for successful SRE implementations was incident resolution (60%), while expanding the business was fifth lowest (33%). These findings show that SREs are still inwardly focusing on IT operations versus outwardly focusing on the business results that deliver customer value. To close this IT-to-business gap, SRE teams must expand observability boundaries to include digital experience metrics and business KPIs.
“The balancing work of innovation while providing operational excellence has forced many IT teams to put heavy emphasis on improving reliability and stability of services and applications,” says Eveline Oehrlich, Chief Research and Content Officer at the DevOps Institute. “What SREs now need to do is make sure the value of these reliable services and applications are understood by the customer.”
- Businesses and SREs need to establish a baselining program around core SRE tenets and business level metrics to know whether things are getting better or worse.
- Platform Operations teams should be implemented to achieve higher levels of scale and efficiency. Platform Ops should develop normalized capabilities for SREs across the organization to draw on (even though underlying platforms will have different interfaces) and treat those capabilities as a product to sell and market to other teams within the business.
- To achieve the promise of AIOps, SREs and managers must break down AIOps into smaller components and incrementally develop from there, in addition to investing in training in AI and ML for SRE teams.
- It is crucial to find ways to bridge the gap between SRE and business goals. Start conversations around capabilities, for instance, versus focusing on low-level monitoring metrics and high-level business outcomes.
Connect with Catchpoint
Catchpoint, the global leader in Digital Experience Monitoring (DEM), empowers business and IT leaders to protect and advance the experience of their customers and employees. In a digital economy, enabled by cloud, SaaS and IoT, applications and users are everywhere. Catchpoint offers the largest and most geographically distributed monitoring network in the industry – it’s the only DEM platform that can scale and support today’s customer and employee location diversity and application distribution. It helps enterprises proactively detect, identify and validate user and application reachability, availability, performance and reliability, across an increasingly complex digital delivery chain. Industry leaders like Google, L'Oréal, Verizon, Oracle, LinkedIn, Honeywell, and Priceline trust Catchpoint’s out-of-the box monitoring platform, to proactively detect, repair, and optimize customer and employee experiences. Learn more at www.catchpoint.com.