LinkedIn is the world's largest professional social network, with more than 500 million users in over 200 countries and territories worldwide. To serve their massive global user base, The company has an expansive IT architecture comprised of many first- and third-party components.
LinkedIn partnered with Catchpoint to:
- Conduct active observability tests of their first- and third-party global infrastructure to optimize performance and availability.
- Monitor DNS, CDNs, PoPs, and various pages (profile, feed, etc.) from both an internal and client-facing perspective.
- Push real-time monitoring data to catch performance issues before users are impacted.
- Share data with third-party vendors when troubleshooting issues.
- Detect localized performance issues in key geographies.
I've used a lot of monitoring tools, and I have to say that Catchpoint stands out from the pack. There are so many nodes and so many features, and we're always able to present the data that's important in any fashion that we want.
LinkedIn services users in every corner of the globe, which requires a wide array of third-party vendors, including three domain name system (DNS) resolution services, five content delivery networks (CDNs), and their internal CDN. But with so much reliance on third-party solutions for their digital delivery, the LinkedIn site reliability engineering (SRE) team must keep a close eye on the performance of these cloud solutions to ensure that their end users get the best digital experience possible. This can be especially challenging in certain parts of the world like China, India, other parts of Asia, and South America, where the local infrastructure on the ground is often unable to provide the level of performance that LinkedIn requires.
Users generally allot a certain amount of time on a site, so the faster it’s performing, the more page views a user will generate. LinkedIn’s SRE team manages all their external services, optimizing every step of the path from the nearest point of presence to the end user, with the ultimate goal of keeping latency to a minimum and maximizing availability. To do this, they need to be able to generate performance data of all these services to give them the best possible view of the end user experience, and to collect and analyze this data in real time to ensure that users are getting handed off to the right CDN at the right time.
Given the huge and expansive nature of LinkedIn’s user base, Catchpoint’s global node infrastructure is a critical part of their monitoring strategy. With more than 600 global monitoring agents across dozens of different ISPs at their disposal, including both backbone and last-mile agents, LinkedIn’s SRE team can get perspective on the end user experience regardless of where those users may be located. This is vital for ensuring that their third-party infrastructure is performing up to expectations.
“I've used a lot of monitoring tools, and I have to say that Catchpoint stands out from the pack,” says Samir Jafferali, Linkedin edge performance SRE. “There are so many nodes and so many features, and we're always able to present the data that's important in any fashion that we want. The other tools, yes they have agents in other locations, but it's not necessarily the locations you care about.”
These tests are run using Catchpoint’s web transaction monitor, which tests the HTTP content via custom Selenium scripting and captures headers for every single object on the page. The tests are then tied to specific performance thresholds for all their CDNs, and the results are piped into LinkedIn’s internal system in real time thanks to the Catchpoint Test Data Webhook. In doing so, the SRE team can detect increases in latency as they occur, and when applicable, hand the user off to a different CDN while the vendor addresses the problem.
On the DNS side, LinkedIn runs constant tests using Catchpoint’s two different DNS monitors: DNS experience tests allow Catchpoint to behave like a DNS resolver, measuring latency from an end user perspective and providing granular performance data; direct name server tests monitor the name servers themselves, thereby providing data that ties directly to the availability of the server, and by extension, the site itself. These DNS tests combine to give LinkedIn the full scope of their DNS performance.
The ability to collect, analyze, and share this data in a timely manner – even during a performance crisis – is vital to the LinkedIn service offering. With historical trends tied to custom visualizations, LinkedIn can present the data in any number of different ways that suits their need and cuts to the root source of the issue.
“The other tools have historical trending, but the graphic capabilities are limited,” notes Jafferali. “The ability for [Catchpoint] to capture headers for every single object on the page and then do analysis on the headers and plot variations of headers over time from different ASs is very powerful.”
And with Catchpoint’s public URL feature, the data and graphs can be shared quickly and easily as the team works with the vendor to resolve the issue. By collecting separate data on both speed/latency and availability, the SRE team can isolate problems quickly and easily by using their Catchpoint-powered analytics engine to discover the root cause of issues and export the data to their vendors right away so that they can troubleshoot and solve the problem.