How Etsy Turns Monitoring Data into Actionable Alerts
In this article we examine Etsy's approach to performance monitoring with a focus on their actionable alerts strategy.
Most people know Etsy as the online marketplace for handmade craft items from around the world. The 10-year-old eCommerce site processes nearly $2 billion worth of transactions a year. Just as it sells handmade merchandise, Etsy takes a “craft” approach to its web operations using open source tooling rather than commercial off-the-shelf software. The company’s Code As Craft blog details the various projects Etsy’s IT department is working on and is a popular destination site in the IT management world.
Etsy has authored more than 60 IT management projects on open source community site GitHub, including the widely-used StatsD network monitoring data aggregator. So, I was interested in checking out Etsy performance engineer Allison McKnight’s presentation, titled “Crafting Performance Alerting Tools,” at the recent Velocity New York conference.
Full disclosure: Though Etsy builds most of their IT management systems themselves using open source components, they do use Catchpoint for synthetic monitoring and testing of their Web applications. Even the craftiest IT departments can’t replicate our global node coverage and range of test types. That said, we’ve blogged about Etsy’s approach to IT performance management before and remain fans of how they do things there.
McKnight described Etsy’s IT environment; it initially logged raw performance data in the Etsy-developed open source logging tool Logster, then graphed it using the open source Graphite. Online travel company Orbitz, which is also a Catchpoint customer, created Graphite and contributed it to the open source community.
Using these tools, Etsy put together a number of dashboards that show how its site (coded in PHP) was performing. While that left it with a nice set of charts, it still needed monitoring to establish a baseline of good performance and detect regressions from that baseline, which would indicate a performance problem. Not surprisingly, Etsy turned to popular open source monitoring tool Nagios to handle the next phase of this project.
Etsy initially created a simple system that sent regression reports from Nagios by email. This proved to be ineffective, however, as it didn’t catch small or slow-creep regressions that turned into major performance degradations over time. These regression reports were also difficult to tune and required additional investigation to better understand the regressions, which led to alert fatigue due an increase in false alarms.
So, Etsy set out to build its own alerting tools in Nagios that would change the alerting mechanism, create tools to help investigate regressions, and change its alerting format. It built a tool called Perfnag to check for regressions at regular intervals and visualize alerts based on the regressions detected. Then, it created a tool called Nagios Herald, which creates more context around alerts such as time series data and breaking down which elements on the page are causing performance thresholds to be breached. Alerts are then sent, or not sent, accordingly. For example, if a third party payment processor is the problem, alerts won’t be sent to those IT personnel who are in charge of internal systems.
Soon, Etsy was finding it much easier and faster to investigate and resolve performance issues, and found that better alerting with more context fostered a better spirit of team collaboration. The company is looking to add more and better context to alerts going forward, making sure the right stakeholders in the organization are notified appropriately. And it wants to make alerts more comprehensive, extending them to its mobile site, front end, and APIs.
Kudos to Etsy for developing these tools in house, they seem to be working well for them as the company’s IT department evolved from logging and graphing, to monitoring and regression detection, to monitoring and alerting. A monitoring system is only effective if it knows when to generate an alert, what metrics to generate an alert from, and to whom to send that alert.
Therefore, it’s no accident that alerting has been our most requested integration from our customers at Catchpoint. We currently support integrations with alerting vendors OpsGenie, PagerDuty, and VictorOps, along with enterprise communications service Slack. Catchpoint, like most commercial monitoring tools, has its own built-in alerting system that alerts users when dynamic thresholds have been breached. These alerts include detailed debugging information, including the request that impacted performance.
Integrations with our alerting and communications platform partners, which can be done with just a few clicks, enable you to integrate alerts from Catchpoint and other monitoring tools, direct those alerts to the right stakeholders in your organization and foster better collaboration, communication, and knowledge sharing among those stakeholders. These alerting systems get smarter over time, keeping track of how alerts were routed, escalated, and responded to in the past, which can lead to faster problem resolution and fewer false positives.
Etsy’s “roll your own” approach fits its culture, and the company is rightfully acknowledged as a leader in building powerful open-source IT management tools. However, that’s not the only right way to handle monitoring and alerting. Integrated commercial monitoring and alerting tools, in lieu of or in addition to open-source tools, are a better fit for many organizations. As our partnering strategy demonstrates, we believe there are a lot of strong commercial alerting and communications tools on the market and you needn’t build your own.
The approach you take should be the one that best serves the goal of keeping your systems available and running at optimal performance levels and your customers happy. As the holiday shopping season fast approaches, achieving that goal is never more critical to your business success.