20 Essential Books for Site Reliability Engineers
The following 20, must-read books for site reliability engineers (SREs) include topics like post-incident review and SLA management
Site Reliability Engineering (SRE) continues to evolve its practices and expand its presence across different industries. Whether you’re a seasoned SRE or just starting out in the field, the Catchpoint team has compiled this must-read list of site reliability engineering books. The list includes classics and new releases on topics ranging from SRE implementation, systems thinking, post-incident review, SLA management, and more!
Accelerate: Building & Scaling High Accelerate: Building & Scaling High
By Nicole Forsgren, PhD, Jez Humble, & Gene Kim
The promise of BizDevOps has been to cultivate an organization where technology drives business value but missing from the conversations has been the performance of software delivery teams. The publication of Accelerate fills this gap. It is a ground-breaking presentation of four years of research into and statistical analysis of the capabilities and practices most important to the development and delivery of software products.
A Seat at the Table: IT Leadership in the Age of Agility
By Mark Schwartz
Whether IT departments report to CIOs, CISOs, CDOs, or even CTOs (to name only a few), there’s no denying the state of flux that IT leadership finds itself in today. In A Seat at the Table, Schwartz, an experienced CIO, lays out a blueprint for what IT leadership should be: a value creation engine. Part field guide, part manifesto, Schwartz offers a call to action to IT professionals that becoming an Agile IT leader requires the courage to cast off traditional thinking and be willing to fail; a tenet that we know is near and dear to some of today’s most successful SRE teams.
Continuous Delivery: Reliable Software Released through Build, Test, & Deployment Automation
By Jez Humble & David Farley
Your latest software release was designed, developed, and deployed, but it’s not getting in front of your target audience. Figuring out what to do in this scenario, or ideally avoiding this scenario altogether, is the aim of Continuous Delivery. Humble and Farley set out strategic principles and tactical practices that enable continuous, incremental software delivery. For SREs looking to reduce toil, the chapter on automated acceptance testing is a must-read.
Data Visualization: A Handbook for Data Driven Design
By Andy Kirk
In Data Visualisation, Kirk provides a handy resource when deciding what data visualizations to use for drilling down into further analysis of data, post-mortem reviews, SLM negotiations, external presentations, and more.
Foundations of Service Level Management
By Rick Sturm, Wayne Morris & Mary Jander
The rapid rise of SaaS and IT-as-a-service means SLMs and SLAs are even more important now than when Foundations of Service Level Management was released in 2000. The authors present strategies for developing and enforcing SLAs with third-party vendors and service providers. They also provide pertinent insight for us now by showing how vendors and providers can optimize their own practices.
High Performance Web Sites: Essential Knowledge for Front-End Engineers
By Steve Sounders
When Sounders published High Performance Web Sites, this now-classic text in 2007, he shocked many web developers by claiming that the client slide takes up 80% of the time it takes for a web page to load. To reduce response and page load times, Sounders presents 14 specific rules for optimizing website performance. Many of these rules hold true, but SREs looking for even more tips on improving site and application performance should also check out Sounders’ sequel, Even Faster Websites (2011).
Inspired: How to Create Tech Products Customers Love
By Marty Cagan
The age of customer experience has led to companies competing on offering the best customer experience possible, which has also raised the bar on what customers expect from digital interactions. Originally published ten years ago, the recently released second edition of Inspired is arguably more relevant today since Cagan now provides insight into assembling customer-centric teams and designing, developing, and delivering products that exceed market demand and business objectives.
Platform Revolution (2017)
By Geoffrey G. Parker, Marshall W. Van Alstyne, & Sangeet Paul Choudray
From SaaS to IaaS and now to XaaS, we are inundated with tech acronyms heralding digital disruptions and transformations but have few insights into the mechanisms and behaviors driving these business model changes. In Platform Revolution, the authors take a deep dive into the Platform-as-a-Service phenomenon. They examine the historical context, operational tactics, and economic impact of the emergence of PaaS organizations and their effect on our interactions with technology. A must-read for our brave new world.
Post-Incident Reviews: Learning from Failure for Improved Incident Response
By Jason Hand
While IT environments have drastically changed, the same cannot be said of post-incident reviews. The Post-Incident Reviews report addresses the shortcomings of traditional post-incident review techniques, like root cause analysis, when it comes to understanding and preventing problems from reoccurring in complex, distributed IT systems.
Practical Reliability Engineering, 5th Edition
By Patrick P. O’Conner & Andrew Kleyner
Practical Reliability Engineering presents high-level reliability theory concepts alongside practical real-world applications and industry best practices. This comprehensive approach to reliability will appeal to a wide range of engineering professionals, but SREs will find chapters on software reliability, analyzing reliability data, and maintainability, maintenance, and availability especially insightful.
Principles of Network and System Administration
By Mark Burgess
Released 15 years ago, this foundational text introduces overarching principles and operational tactics for establishing, configuring, and maintaining computer systems and networks. This is a must-have resource for your library whether you’re a seasoned or novice SRE.
Seeking SRE: Conversations About Running Production Systems at Scale
By David N. Blank-Edelman
After the success of Site Reliability Engineering: How Google Runs Production Systems (2016), demand for more SRE content accelerated, especially on nurturing SRE practices at non-tech organizations. Seeking SRE meets this need with essays from nearly 40 SREs and tech professionals following SRE practices. Why we’re recommending this book, however, is that the contributors focus on humans, not technology, in presenting what SREs can do for people.
Site Reliability Engineering: How Google Runs Production Systems (2016)
By Betsy Beyer, Chris Jones, Jennifer Petoff & Niall R. Murphy
When it comes to essential SRE reading there’s no better place to start than with this 2016 collection of essays. Each chapter is rooted in the personal experiences of industry experts involved in putting business-IT into practice at Google. “The most impressive thing of all about this book is its very existence,” observe the editors, who then go on to remind us that “[i]mplementations are ephemeral, but documented reasoning is priceless.” We couldn’t agree more.
The Field Guide to Understanding ‘Human Error’, 3rd Edition
By Sidney Dekker
While embracing failure is a core tenet among SREs, it can be much more difficult to bring risk-averse business leaders around to realize the long-term value of failure. The Field Guide to ‘Human Error’, now in its third edition, stages an intervention into how organizations perceive ‘human error’ problems. The Field Guide moves the conversation on ‘human error’ forward, rethinking accidents, post-mortems, and our safety systems.
The Human Side of Postmortems
By Dave Zwieback
As reported in the 2019 SRE Report, stress-levels of SREs are at an all-time high. And yet, few how-to’s or guides on running postmortems address how stress and other human factors can contribute to and even prolong an outage. The Human Side of Postmortems makes the case for why SRE and DevOps teams need both a technical and a human postmortem to mitigate stress-induced mistakes during an outage.
The Lean Product Playbook: How to Innovate with Minimum Viable Products & Rapid Customer Feeback
By Dan Olsen
This practical, no-nonsense guide is a great go-to resource for small, or even one-person, SRE teams looking to improve or adopt lean thinking workflows. With step-by-step instructions and processes, The Lean Product Playbook can help SRE teams establish themselves as integral partners in accomplishing organizational objectives in any industry.
The Practice of Cloud System Administration: DevOps and SRE Practiced for Web Services, vol 2
By Thomas A. Limoncelli, Strata R. Chalup & Christina J. Hogan
As more and more organizations migrate to “the cloud,” what can DevOps/SRE principles and practices do to help redefine and reposition Information Technology departments? The authors of this volume provide case studies on operating and running systems at industry giants like Netflix, Etsy, and Amazon while highlighting why distributed systems require a fundamentally different system administration that may not be offered by your cloud services provider.
The Site Reliability Workbook: Practical Ways to Implement SRE
By Betsy Beyer, Niall R. Murphy, David K. Rensin, Kent Kawahara & Stephen Thorne
The highly-anticipated sequel to Site Reliability Engineering (2016) expands upon its predecessor with a hands-on focus that presents concrete examples of SRE in action. “The purpose of this second SRE book is (a) to add more implementation detail to the principles outlined in the first volume,” the editors explain. But for us the second reason is key: “(b) to dispel the idea that SRE is implementable only at ‘Google scale’ or in ‘Google Culture.’”
The Systems Bible: The Beginner’s Guide to Systems Large and Small, 3rd Edition
By John Gall
This systems engineering treatise expands on Gall’s field-defining insights into system failures, which claims that failure is an intrinsic feature of systems. For SREs, The Systems Bible offers 40 chapters on the benefits of conceptualizing systems premised on failure when it comes to measuring, optimizing, and managing systems both big and small.
Thinking, Fast and Slow
By Daniel Kahneman
In Thinking, Fast and Slow, Kahneman presents two systems, one slow, one fast, that drive the way we think, and then examines how these systems guide our professional and personal choices. We recommend the discussion on the inside versus outside view and the problems that arise when teams—or entire organizations—extrapolate and forecast based on only the internal view that fails to account for “unknown unknowns.” The outside view, or what we call the end user experience, provides the baseline needed when making predictions and long-term investments.
2019 SRE Report
Catchpoint runs an annual survey of site reliability engineers and the 2019 SRE Report is available for free.