Recently, I was visiting a large media company who, a short time ago, moved their entire infrastructure to Google Cloud. The dev team were pushing for an APM solution and were looking at all the usual suspects: Dynatrace, AppD, New Relic, Datadog… all great choices!
The discussion that followed is highly typical in a world in which Dev and Ops must coexist.
Dev was pushing for APM because they want to understand how their apps work and trace performance issues. So far, so good. However, the dev team was refusing to consider consuming telemetry data from an OUTSIDE point of view. Their attitude was, all we care about is our application data: that will be the source of every piece of information we need to know! The single source of truth.
As the conversation was going nowhere, I decided to show them a few graphs.
The graph below shows the Connect Time to a Google Storage Object from Los Angeles (based on data from 1/1/2018 to 12/31/2018):
Let’s look at the data again, this time breaking it down by ISP:
And here is how the data looks currently in 2019 (again breaking down by ISP):
So, what’s going on here?
In 2019, our nodes in Los Angeles, running on Verizon, returned the following IPs for the Google storage URL:
Another way to look at it:
The IPs with the highest latency were the furthest away from Los Angeles. In 2018, our various nodes were sent as far away as Madrid. While in 2019, the majority of IPs with the highest connect time were on the East Coast of the U.S.
I said to the media company, “Great, you have your APM data, and you have your RUM data… meanwhile, your customers in LA are screaming that your application is dead slow and they’re going to dump you because they can’t get their work done. And yet, your internal telemetry is telling you that everything is OK! Even worse, you cannot find why people in the LA area are complaining.”
So, how do you solve this? If you don’t have the right telemetry to gain insights from your customer’s point of view, you can easily end up wasting a lot of time going in circles troubleshooting while burning out your team, and annoying your customers.
By the end of our meeting, the media company realized that using an APM-only approach wasn’t enough when dealing with their customer’s end-user experiences. They agreed to rely on APM for application tracing only and embrace a true, dedicated Digital Experience Monitoring solution in which proactive, end-user based monitoring allows them to catch problems before they impact the customer.