I've used a lot of monitoring tools, and I have to say that Catchpoint stands out from the pack. There are so many nodes and so many features, and we're always able to present the data that's important in any fashion that we want.
Given the huge and expansive nature of LinkedIn’s user base, Catchpoint’s global node infrastructure is a critical part of their monitoring strategy. With more than 600 global monitoring agents across dozens of different ISPs at their disposal, including both backbone and last-mile agents, LinkedIn’s SRE team can get perspective on the end user experience regardless of where those users may be located. This is vital for ensuring that their third-party infrastructure is performing up to expectations.
“I've used a lot of monitoring tools, and I have to say that Catchpoint stands out from the pack,” says Samir Jafferali, Linkedin edge performance SRE. “There are so many nodes and so many features, and we're always able to present the data that's important in any fashion that we want. The other tools, yes they have agents in other locations, but it's not necessarily the locations you care about.”
These tests are run using Catchpoint’s web transaction monitor, which tests the HTTP content via custom Selenium scripting and captures headers for every single object on the page. The tests are then tied to specific performance thresholds for all their CDNs, and the results are piped into LinkedIn’s internal system in real time thanks to the Catchpoint Test Data Webhook. In doing so, the SRE team can detect increases in latency as they occur, and when applicable, hand the user off to a different CDN while the vendor addresses the problem.
On the DNS side, LinkedIn runs constant tests using Catchpoint’s two different DNS monitors: DNS experience tests allow Catchpoint to behave like a DNS resolver, measuring latency from an end user perspective and providing granular performance data; direct name server tests monitor the name servers themselves, thereby providing data that ties directly to the availability of the server, and by extension, the site itself. These DNS tests combine to give LinkedIn the full scope of their DNS performance.
The ability to collect, analyze, and share this data in a timely manner – even during a performance crisis – is vital to the LinkedIn service offering. With historical trends tied to custom visualizations, LinkedIn can present the data in any number of different ways that suits their need and cuts to the root source of the issue.
“The other tools have historical trending, but the graphic capabilities are limited,” notes Jafferali. “The ability for [Catchpoint] to capture headers for every single object on the page and then do analysis on the headers and plot variations of headers over time from different ASs is very powerful.”
And with Catchpoint’s public URL feature, the data and graphs can be shared quickly and easily as the team works with the vendor to resolve the issue. By collecting separate data on both speed/latency and availability, the SRE team can isolate problems quickly and easily by using their Catchpoint-powered analytics engine to discover the root cause of issues and export the data to their vendors right away so that they can troubleshoot and solve the problem.