Today Southwest Airlines COO Michael Van De Ven went on twitter to explain the “catastrophic technological failure” that took down Southwest’s systems for most of the day yesterday, and how its customers were impacted. With no network, Southwest was unable to let passengers check in to flights, to move crews around, or to find baggage. Van De Ven said that the company has redundancies in place, but they did not kick in as expected, and about 800 of its servers were affected.
The minute I saw his tweet I had to look at the data. Here’s what I found (above). You can see right before 3:00pm EDT something that looks like a spike and an error. Then the site’s response time drops. This is weird. Shouldn’t there be something looking more like an error? At the surface it looks like their page got better!
The story becomes clearer if you follow Southwest’s tweets. At 1:54pm EDT the company tweeted that it was investigating systems issues. This appears to be in response to a spike in wait times at 1:20pm EDT. See below:
As you can see, the issue with wait times escalated quickly by 3:00pm EDT, before a drop in web page response. The drop appears to have happened because Southwest changed to their distress plan and only served its error page,. The company identified an issue and moved quickly to respond and manage client expectations.
This is a great example of how a company should respond to major interruptions of service: Detect an issue early to minimize the impact; use social media to keep your users informed; and as soon as it escalates, notify users further so they are not left in the dark.
Every company will eventually have to manage a situation like this one. Southwest may have had a difficult day yesterday, but its response to this failure is a model for others and probably minimized the issue’s negative impact on customer loyalty and the business.