Our recap of top monitoring stories of 2019 is influenced by the year’s various reminders of just how fragile the Internet is. 2019 included a wave of serious outages taking down many of the web’s biggest players, including Facebook, AWS, and Google Cloud. The scale of the outages and the increasingly interdependent nature of the web underlines the importance of having robust, early alert systems in place. Many vantage points (monitoring nodes) spread over multiple clouds, ISPs, and wireless networks provide the earliest possible warnings of any problem and thereby reduce mean time to resolution (MTTR).
Another trending topic in the monitoring world over the last 12 months has been the benefits of SaaS monitoring amidst increased cloud complexity. For those companies picking best of breed services, cloud complexity is inevitable. While the ability to choose your vendors comes with many benefits, primarily flexibility, it can also lead to more difficult management of the multiple workloads, databases, platforms, storage systems, security models, etc. you’re juggling. Within this context, it is particularly important to institute an objective SaaS monitoring solution to guarantee performance and hold third-party vendors accountable when necessary.
Other important monitoring stories from the last year include Gartner’s introduction of their first-ever DEM Guide, the growing trend towards observability, and the aftermath of Dynatrace’s Gomez shutting down.
Here are Catchpoint’s top 9 monitoring stories of 2019:
1. Summer of Outages Exposes the Fragility of the Internet
2019’s surge in systematic outages has been a wake-up call for all of us. With IP traffic likely to grow threefold from 2017 to 2022, guaranteeing reliable services is becoming an ever more difficult proposition. This summer’s wave of outages spanned many of the web’s biggest players, from whom we increasingly expect complete reliability – from Facebook and Instagram’s inability to enable the loading of photographs on both apps in July to an ISP in Pennsylvania setting off a domino effect that ended up taking down huge chunks of the web in June, including AWS and Cloudflare (who called the event “a small heart attack”). It may seem trivial to be unable to briefly access your favorite social media app, but the outages can both seriously impact a company’s bottom line and critically expose the increasingly fragile nature of the Internet.
Asked by Forbes how businesses can respond proactively to the challenge, Catchpoint CEO Mehdi Daoudi stressed the importance of monitoring from outside-your-firewall infrastructure within such a context to maintain the accessibility of your digital properties.
2. BGP Leaks, Cloudflare Outages, and AWS Issues
We ran a five-part blog series about the Border Gateway Protocol (BGP) back in October, an important topic to understand given its inherent vulnerabilities, which have been dramatically on display over this past year. BGP was conceived in the early days of the web when there were only a handful of players online, hence its design didn’t protect against accidental or deliberate errors. In June, a widespread BGP routing leak was the initial cause behind the major outage that impacted Cloudflare and the hundreds of websites the popular CDN supports, making them far slower than usual or sending them entirely offline. Catchpoint detected the problem within seconds, enabling us to track performance, availability, and reachability in real-time across multiple sites.
3. Monitoring and Chaos Engineering
Chaos Engineering has emerged as a solution to the increased complexity of the web with the rapid rise of microservices and distributed cloud architectures. The practice, which involves literally “breaking things on purpose” and deliberately injecting faults to create an antifragile system(s), started around a decade ago at Amazon and Netflix to enable the web-scale companies to test their complex distributed systems for worst-case scenarios ahead of their occurring. Gremlin, which was started by a former employee of both companies, aims to make chaos engineering available to companies that don’t have their own teams of Site Reliability Engineers (SREs).
Without the proper collection of baseline metrics and advanced monitoring, however, it is impossible to comprehensively understand the impact of the faults you’ve introduced into your system. Google’s manual on Site Reliability Engineering discusses four fundamental areas to monitor: latency, traffic, errors, and saturation. By closely observing each of these metrics, you can witness most of the effects of your chaos engineering-related experiments; for instance, by searching for correlations, you can better see how components interact; by watching errors and saturation, you can introduce load and work out how scalable your system really is; and so on.
4. Gartner Introduces the First-Ever DEM Guide
At the start of September, Gartner introduced its first-ever guide to Digital Experience Monitoring (DEM). The global research and advisory firm was recently named as one of the World’s Most Admired Companies for 2019 by Forbes. The research areas it focuses on are widely noted and can play a tectonic role in shifting the industry landscape, hence the significance of its concentration on DEM for the monitoring industry.
The Gartner DEM guide focuses on improving the end-user experience for digital transformation, a difficult proposition while I&O is losing direct control of infrastructure and applications. Gartner stresses that I&O leaders “must[therefore] use digital experience monitoring to optimize business transactions and customer journeys regardless of where workloads reside”. The guide includes a market description, market direction, market analysis, including common DEM approaches, a list of representative vendors (including Catchpoint), market recommendations, and finally, Gartner recommended reading.
5. Increasing Cloud Complexity, Managing It, and the Benefits of SaaS Monitoring
Cloud complexity is the result of continually increasing migration rates to the cloud and net-new development without anticipation of the complexity these changes bring to operations. As David Linthicum of Deloitte Consulting put it in an interview with TechBeacon, “with cloud computing on track to become the mother of all shifts, especially in regards to IT’s approaches to development and operations, we yet again face the issue of conversion mistakes – this time a hundredfold greater than previous moves to distributed computing and the web”.
The challenges involved make it essential that a DEM solution is put in place for all three phases of cloud migration: before (when benchmarking is essential), during (to guarantee continued availability while changes are happening) and afterwards (to ensure your services are running at their best).
Software-as-a-service (SaaS) offerings have become one of the most common deployments for cloud-based applications. The need to monitor SaaS apps and their performance, reachability, and availability has become likewise increasingly important. One of the primary benefits of end-to-end SaaS monitoring is the fact it allows users to examine their programs from an end-user point of view.
6. Outages in China and the Growth in Monitoring (and its Challenges)
Observability has been a buzzword in the monitoring world for several years now, and that trend has certainly continued across 2019. The term was actually first introduced by Twitter in 2013 by its Observability Engineering Team. Increasing numbers of companies are using the term to describe their services, including Grafana Labs who earlier in the year, revealed their goal to be the first vendor to build an open-source-based observability platform with the aid of $24 million in Series A Funding.
Others are wondering if the term observability is merely a new way to describe monitoring; as Distributed Systems Engineer Cindy Dridharan tweeted, “Observability” – because devs don’t like to do “monitoring”, we need to package it new nomenclature to make it palatable and trendy”.
In a just-released podcast for Ollycast, our CEO Mehdi Daoudi defined observability as “the ability to connect various dots from all this telemetry lag we have, and how can we quickly answer what is broken and why?”
7. The Trend Toward Observability Continues
China has been suffering from its own share of major outages this year. In May, China Telecom experienced an outage that lasted almost eight hours, impacting a huge number of sites, including Western ones such as Apple, Amazon, Slack, and Microsoft. Substantial packet loss across China Telecom’s backbone was witnessed for web traffic entering and exiting the country. Over 100 services were disrupted at its peak.
In August, we added new monitoring nodes to our system in China. Catchpoint now has 140 nodes in mainland China, making up 14% of our global footprint. We’ve written before about the need for a separate monitoring strategy tailored to China, involving carefully selecting monitoring locations, running traceroutes and DNS tests, creating charts in local China time and setting up different alerts for China.
While China’s government does block various Western sites and services (what is known as “The Great Firewall of China”), the outages, however, demonstrated that Chinese ISPs like China Telecom today have a worldwide presence and importance. China Telecom both controls infrastructure beyond the Chinese borders and provides regional interconnects for other service providers in various locations around the world.
8. Dynatrace’s Gomez Shutting Down… What’s Next?
In a significant move this year, industry heavyweight Dynatrace shut down its old Gomez/Keynote synthetic monitoring product to move all its users over to a cloud-based monitoring service, spelling the end of location-based monitoring at Dynatrace. In a blog written in July of 2019, Ben Rushlo from Dynatrace’s Digital experience team outlined why Dynatrace believes, “the future of synthetic testing is in the cloud”.
We’ve long seen that Gomez doesn’t work effectively, nor do we think their “new synthetic solution” works sufficiently either, which is essentially about saving money at the expense of their customers. Catchpoint, by contrast, has more than 500 backbone and broadband nodes with thousands of servers in hundreds of geographical locations around the world and a unique architectural solution to building synthetic monitoring at scale works!
9. How to Make Post-Incident Reviews More Effective (and Coping with Post-Incident Stress)
A story that isn’t spoken about enough: the impact that incidents have on people and the reality of post-incident stress. In what can be a taboo subject, Jaime Woo and Dawn Parzych of Catchpoint opened up this delicate subject earlier this year, inviting SREs at a lightning talk at SREcon to engage in some audience participation ahead of the release of the now annual Catchpoint SRE Report, which delves into the subject in more detail.
Meanwhile, more focus is also being brought to how to make post-incident reviews more effective. Earlier this fall, Woo and Emil Stolarsky, co-founders of Incident Labs, released the first issue of The Post-Incident Review, a zine that brings together original articles and curated links to postmortems elsewhere.
As 2019 comes to a close, certain trends are likely to become even more top-of-mind in 2020. The importance of SaaS monitoring, visibility into cloud infrastructure, and monitoring end user experience are all likely to be popular topics of discussion. For more info on these IT monitoring trends, read Gartner’s Market Guide for Digital Experience Monitoring (DEM) report in full.