Catchpoint’s webhooks give us the control and flexibility to visualize and analyze our data, and integrate it with our alerting tools. With this tool, we were able to use Catchpoint's real-time measurements to pinpoint and resolve Google Public DNS latency. Instead of a long process, we were able to get at it almost instantly, and turn around the problem in just minutes instead of tens of minutes.
To manage all of this data, Google’s Site Reliability Engineering (SRE) team relies on Catchpoint’s Test Data Webhook feature. This tool allows the client to select which of their tests are going to push Catchpoint data to a speciﬁed endpoint in real time, where it can then be integrated with any number of third party tools for storage and visualization; in Google’s case, this is done using their own in-house tools such as Google Data Studio.
By enabling the Test Data Webhook, Google’s performance data is pushed to their designated endpoint every single time they run a test within the Catchpoint platform, where they can then execute their ETL (Extract, Transform, Load) paradigm. In doing so, they are able to overcome the system limits of the REST API to handle all of their performance data as soon as it’s collected by Catchpoint, as well as store it for even longer than Catchpoint’s industry-leading three-year storage oﬀering.
After the test data is collected from the test target by the Catchpoint node, the information is compiled and put into a JSON format (XML is another formatting option) before being sent to Google’s endpoint, where it posts to an AppEngine that lives on the Google platform. There it undergoes the ETL functions and is then sent and stored using Cloud Bigtable, from which it is visualized and analyzed using Data Studio or any other visualization tool that they wish (e.g., Grafana, Geckoboard, etc.).
The measurements that Catchpoint provides have enabled Google to detect performance issues in multiple digital properties under their control, including both their Public DNS and their backbone infrastructure.
In the case of Google Public DNS, the service was experiencing very high query latency, which is undetectable under their own internal monitoring because there is no way of knowing how long a DNS answer is received by the client once it is sent due to a lack of TCP Connection; essentially, there is no way for them to measure the round trip time between the client and the server.
With Catchpoint, however, Google’s SRE team was able to detect issues of query latency from a network perspective, speciﬁcally by identifying some ASNs that were experience the most latency. From there, the SRE team could drill down directly to where the problem was, rather than having to go back and forth between the ISP that had reported the problem and their customers, and then the SRE support team. Ultimately, they were able to detect and ﬁx the problem in just a few minutes, when it ordinarily could have taken close to an hour.
Furthermore, whenever Google has problems on their backbone that require a post-mortem, one of the things that they’re interested in is learning how those failures have aﬀected their Cloud product. Because anybody in the company can have access to the data once it reaches the data store, the appropriate people can go in and perform the analysis themselves to include in the post-mortem or performance report without having to rely on someone who has direct access to the Catchpoint platform to create a report for them, thereby functioning as a time saver for multiple teams.