Chaos Engineering: A Lesson from the Experts
Chaos engineering is experimenting on a distributed system to expose gaps and weaknesses in your infrastructure that could cause a performance problem.
Chaos engineering as a practice is experimenting on a distributed system to expose gaps and weaknesses in your infrastructure that could cause a performance problem. As the tech industry expands and becomes more complex, chaos engineering is becoming an important part of a company’s digital business strategy. As your complexity increases so does the risk for potential issues that could degrade your user experience.
We recently had the opportunity to host a webinar with a panel of industry experts during which a series of audience-chosen questions were answered, while tips and lessons-learned were shared as well. The Chaos Engineering and DiRT AMA panel included Bruce Wong, Senior Engineering Manager at Twilio; Casey Rosenthal, Engineering Manager (traffic and Chaos Team) at Netflix; and John Welsh, Cloud Infrastructure and SRE Disaster Recovery at Google.
Below is an excerpt from the live webinar. The entire transcript and webinar recording is available here.
Casey, we have integration tests, so why do we need chaos engineering?
That’s one of my favorite questions. Let me think about that for a second. We view chaos engineering as a form of experimentation, so we draw a distinction between testing and experimentation. In some circles, QA has a bad connotation, but if you’ll leave that aside, testing and chaos engineering kind of live in that space of QA, quality assurance, where we’re building confidence in a system or product. That’s kind of where they live. Testing, strictly speaking, doesn’t create new knowledge. It just determines the valence of a known property. In classical testing, given some conditions, n=function, you should have this output. Usually, you’re determining the binary truthiness of that assertion. Right? Either this is true or false.
Whereas, experimentation is a formal method to generate new knowledge. You have a hypothesis and you try doing something, and you see what comes out of it. If your hypothesis is not supported by the data, then it’s kicking off this form of exploration. The difference there is that, in generating new knowledge, we’re looking at things that are much more complicated than you can reasonably expect an engineer to test. Because when you’re testing something, you have to know what to test. Right? That’s a given. You have to make an assertion on a known property.
In chaos engineering, we’re saying, “Look, these systems are just too complicated at this point to reasonably expect an engineer to know all of the properties that a system has.” This form of experimentation helps us uncover new properties that the system has. Definitely, testing is very important. Chaos engineering is not testing; it’s complementary.
To go off of what you’re saying about knowledge, I started thinking about the difference between knowledge and skills, as well. I found that knowledge is a good start, but for leveling up your teams and training individuals, I found chaos engineering to be very helpful in that regard. For example, every craft has their own set of journeys. If you have a doctor who graduated from med school, he has a lot of knowledge. He or she has a lot of knowledge, but they go through years of training, years of fellowship. We call it a practice. I think, in our industry we also have that notion of a craft and practice, as well. There’s knowledge that you have a base on, but you need to build skills upon that base of knowledge.
Outages are really what makes engineers a lot better and a lot more honed. Chaos engineering is really a good way to accelerate learning, to not just build knowledge about a system, but also to build skills in responding to such system.
To comment on that, Bruce, at Google we have this concept of a blameless, post-mortem culture. Every time you have an outage, a real outage for example, rather than have fear around that or distrust around that, we celebrate those situations because we want to document what happened, we want to share that broadly, and we want people to own that so that we can say, “Look, this is what happened to me. Here’s how I responded. Here’s how I fixed it. Here’s how I’m preventing it from happening again,” and the next team can take advantage of those lessons.
By that knowledge share and by that openness – you’re right, they do get more experience because they have these outages, but they actually get to benefit from other people’s experience having that culture.
Yeah, and I think that sharing knowledge with blameless post-mortems is a great start, and I think that’s a practice that we definitely do at Twilio, as well. But I think the chaos gives you that ability to actually give other people the experience of that outage that happened in a controlled manner. Whether it’s loading the right dashboards and understanding what metrics actually means or seeing the different parts of the system and how they respond to that outage situation. Getting to practice when it’s 3:00 p.m. and not 3:00 a.m. is pretty advantageous.
Bruce, when you put a new application live, do you actually plan chaos testing around that app before it’s taking any traffic? Or is it expected that standards were followed from the reliability, and then, it will be randomly tested later?
It’s a great question. That was one of the things I had to get acquainted with joining Twilio. We were actually building brand new products that had zero customers. If there’s a great time to break production, is when you have zero customers. Our staging environment in those cases, actually, has more traffic than production because it’s not launched yet. When we have a private alpha or an open beta, the expectations and tolerance for the service being unavailable is a lot higher, so we can be a lot more aggressive with our chaos testing.
That said, I think there are definitely engineering standards that we have. We call it our Operational Maturity model, similar to what Casey mentioned, a Chaos Maturity model. We have that across many, many different dimensions, and we actually make sure that every team follows those things before they actually make products generally available. Chaos actually has an entire portion of that maturity model to make sure that we actually have done the resilient … Not only are we resilient, but are we actually validating that our designs and implementation of that resilience actually works.
Yeah, I think there’s a lot of standards and we’re finding that chaos is actually giving us the ability to iterate on other aspects, not just resilience, as well. Telemetry or insight or monitoring is a good example. I read a lot of RCAs or RFOs, the root cause analysis documents put out by different companies when there’s an outage. One of the themes that I’ve noticed is, oftentimes, there’s either missing alerts or missing telemetry and you need to add more.
Chaos engineering, actually, has allowed us to iterate on our telemetry before we even launch, and so, we cause a failure that we know can and will happen. Did our monitors and alerts catch it? Do we have enough telemetry around that particular scenario to respond appropriately? We can actually iterate on that in isolation before we even get to a launch.
Casey, this is also another really good question. How often do your exercises negatively affect your users and subscribers? Does it ever become an issue for your team?
Oh, never. Yeah, so, occasionally. I think the key to having a good practice is that we never run … I mentioned this in the webinar chat. We never run a chaos experiment if we expect that the system is going to negatively impact our customers. The point of the whole discipline is to build confidence in the system, so if we know that there’s going to be a negative impact, then of course, we engineer a solution to remove that impact. There’s no sense in running an experiment just to verify that we think something’s going to go poorly. As long as we have a good expectation that things are going to go well, a chaos experiment, hopefully, most of the time confirms that.
In the cases where it doesn’t … Those are rare, but they do happen. That is a very important signal for us. That signal gives us new knowledge and tells us where to focus our engineering effort to make the system more resilient. We do have automated controls in place to stop some of our more sophisticated chaos experiments if we can automatically get a signal that our customers are being impacted. Aside from that, numerically, we know that if we are causing some small inconvenience to some of our customers, the ROI to the entire customer base is much greater by the resiliency that we’re adding and the outages that we’ve prevented that could have affected much larger spots of our customers.
Does that ever cause trouble for our team? I assume that’s a reference to perception within the company or management. At Netflix, no, although the culture here is probably pretty unusual for most tech companies. We have very strong support for the chaos practice and a very strong focus on that ultimate ROI.
It’s interesting talking to other industries, like banking, where they’re like, “Well, we can’t do chaos experiments because there’s real money on the line.” Turns out that banking is one of the industries, now, that is quickly picking up chaos engineering as a practice. I know ING speaks about it. We have anecdotes from other large banks: DB and the large bank in Australia, the name slips my mind. The financial industry is actually looking at this as a very useful, solid practice.
I find that the fear of affecting customers and consumers of a service is well-warranted, but that should not prevent the adoption of the practice. Because if it’s well thought out, the ROI of the practice outweighs the inconvenience or the minimal harm that it can do along the way.
I think that there’s a lot of validation in the investment into a team like Casey’s, to invest in the controls and tooling around failure injection and the value of having tooling around that. I think, in some of the earlier iterations of failure injection at Netflix, there was a large amount of customer impact caused by … I think it was called Latency Monkey. It was the right strategy, but it didn’t have the tooling and the maturity around our understanding of how to control failure injection, wasn’t as mature as it is today. A lot of the work that Kolton Andrus did around the failure injection testing framework, allowed for chaos to be a lot more surgical in its precision. That actually allowed the team to build a lot of confidence across the organization.
Like Casey’s saying, I think it’s in ROI and figuring out where you are, where your organization is at, where your tooling is at, and where your application, how resilient, where that’s at. That’s where you have to think about where to get started.
If I can build on that…At Google, we think of the equation in terms of minimizing the cost and maximizing the value of what you’re trying to learn. Cost can come in the form of financial cost, reputational cost to the company brand, cost in terms of developer or employee productivity. That’s a big one. If you disrupt 30,000 people for an hour, that’s a pretty expensive test. What you’re trying to get out of that should be highly valued. It’s kind of a gut check that we do, where we say, “What is the cost of the test? What are we hoping to learn?” We actually ask the test proctors to document: what is the goal of the test, what is the risk, what is the impact, what are the mitigations so that we can help evaluate that equation of minimum cost and maximum value.
Then, just to comment about what Bruce was talking about, culture. At Google, one of the reasons that DiRT is popular and successful is because we have a lot of executive buy-in, and we make it fun. We have a storyline, we inject a bit of play into it, we have a blog story, and we have memes and things like that that go along with the theme of the testing. We have both this, “Okay, I know I’m disrupting you,” but I’m trying to take a little bit of the edge off by having a theme or some fun around it. Then, if things really do go crazy, we have air cover from executives who are saying, “No, this is okay. We’ve reviewed these tests. It’s okay. We’re going to move forward with these.” That’s the model that we’ve built at Google.
Yeah, if I could come back to that, too … One of our big current focuses right now in our chaos tooling is around what we call “Precision Chaos.” Aaron Blohowiak will be talking about this at Velocity in a couple of months, which is actually engineering into our experimentation platform very small blast radiuses. Again, in a lot of cases you can engineer in a stronger statistically relevant signal on a smaller audience. Let’s not let engineering stop us from minimizing the impact on customers.