Google Saves Time And Reduces Latency With Catchpoint

Reduce MTTR from hours to minutes

Minimize issue analysis time

Google needed to analyze vast volumes of data in as little time as possible to determine long-term benchmarks and trends, reduce latency, drill down into specific data over long periods of time, and make data easily accessible for the many people and departments to whom it may be relevant.

Google partnered with Catchpoint to:

Provide active observability data across their digital properties and networks.
Push their performance data to a designated endpoint for storage and analysis.
Integrate Catchpoint alerts with their own alerting tool to become more proactive.

Employees:

139,995

Revenue:

$66,001,000,000

Headquarters:

Mountain View, CA

Industry:

Catchpoint’s webhooks give us the control and flexibility to visualize and analyze our data, and integrate it with our alerting tools. With this tool, we were able to use Catchpoint's real-time measurements to pinpoint and resolve Google Public DNS latency. Instead of a long process, we were able to get at it almost instantly, and turn around the problem in just minutes instead of tens of minutes.

Matthew White

SRE Manager

Problem

As one of the largest enterprise companies in the world, Google has a massive amount of digital properties under its control which require constant internal and external monitoring efforts to maintain the technology brand’s reputation for digital excellence.

In order to ensure excellent performance across their many diﬀerent digital properties, Google must be able to collect, store, and analyze huge amounts of data in as little time as possible. A traditional REST API solution is unable to satisfy this need due to the system limits that cap the number of requests you can do in an allotted period of time. Instead, they require a way to collect and store all of the data as it comes in so that they can analyze it in real time.

Google must be able to analyze this information across months and years, be it for determining long-term benchmarks and trends, or for drilling down into speciﬁc data over long periods of time.

Additionally, due the scope of the organization, the data must be able to be stored in a place that’s easily accessible for the many diﬀerent people and departments to whom it may be relevant.

‍

Solution

To manage all of this data, Google’s Site Reliability Engineering (SRE) team relies on Catchpoint’s Test Data Webhook feature. This tool allows the client to select which of their tests are going to push Catchpoint data to a speciﬁed endpoint in real time, where it can then be integrated with any number of third party tools for storage and visualization; in Google’s case, this is done using their own in-house tools such as Google Data Studio.

By enabling the Test Data Webhook, Google’s performance data is pushed to their designated endpoint every single time they run a test within the Catchpoint platform, where they can then execute their ETL (Extract, Transform, Load) paradigm. In doing so, they are able to overcome the system limits of the REST API to handle all of their performance data as soon as it’s collected by Catchpoint, as well as store it for even longer than Catchpoint’s industry-leading three-year storage oﬀering.

After the test data is collected from the test target by the Catchpoint node, the information is compiled and put into a JSON format (XML is another formatting option) before being sent to Google’s endpoint, where it posts to an AppEngine that lives on the Google platform. There it undergoes the ETL functions and is then sent and stored using Cloud Bigtable, from which it is visualized and analyzed using Data Studio or any other visualization tool that they wish (e.g., Grafana, Geckoboard, etc.).

‍

Results

The measurements that Catchpoint provides have enabled Google to detect performance issues in multiple digital properties under their control, including both their Public DNS and their backbone infrastructure.

In the case of Google Public DNS, the service was experiencing very high query latency, which is undetectable under their own internal monitoring because there is no way of knowing how long a DNS answer is received by the client once it is sent due to a lack of TCP Connection; essentially, there is no way for them to measure the round trip time between the client and the server.

With Catchpoint, however, Google’s SRE team was able to detect issues of query latency from a network perspective, speciﬁcally by identifying some ASNs that were experience the most latency. From there, the SRE team could drill down directly to where the problem was, rather than having to go back and forth between the ISP that had reported the problem and their customers, and then the SRE support team. Ultimately, they were able to detect and ﬁx the problem in just a few minutes, when it ordinarily could have taken close to an hour.

Furthermore, whenever Google has problems on their backbone that require a post-mortem, one of the things that they’re interested in is learning how those failures have aﬀected their Cloud product. Because anybody in the company can have access to the data once it reaches the data store, the appropriate people can go in and perform the analysis themselves to include in the post-mortem or performance report without having to rely on someone who has direct access to the Catchpoint platform to create a report for them, thereby functioning as a time saver for multiple teams.

‍