We provide a consumerized-style web experience both directly and via our APIs, so users expect reliable access and fast performance. That makes it table stakes to understand the customer experience, and Catchpoint plays an increasingly valuable role in helping Box achieve that.
Dennhardt and a colleague evaluated several active observability options before selecting Catchpoint. According to Dennhardt, Catchpoint stood out in several key areas. “When it comes to what we are trying to achieve, a picture is worth a thousand words and we were very impressed by Catchpoint’s charting and visualization,” he says.
He also appreciates that he could rapidly instrument and capture metrics to share. “With Catchpoint, we don’t need to export the data and post process it in Excel or pipe it to a metrics system to arrive at meaningful information.” The third major deciding factor was Catchpoint’s worldwide network of observers. “Our primary data center presence is in the western U.S. and, while our network footprint is worldwide, it is less complete than Catchpoint’s,” Dennhardt says.
A Proactive Approach to Customer Experience Observability
With Catchpoint, Box gains great visibility into customer experience throughout the delivery chain. “This is critical for a SaaS business but very hard since we don’t control what goes on beyond our network boundaries,” says Dennhardt. Box also uses Catchpoint as a neutral point of comparison between its own metrics and customer metrics. “More often than not, we can reconcile what we see with what our customers see because we capture good logs and metrics. But it is valuable to have this capability, allowing us to verify our findings” he says.
Finding the Needle in the Haystack
While Box extensively monitors production traffic, it can be challenging to identify and test for a specific condition from its internal logs. As Dennhardt explains, “We handle 3 billion to 5 billion public API calls per day and it’s growing quickly! It can be tough finding an answer in that mountain of data.” For example, Box might see a drop in client throughput, but because in some cases throughput is spiky by nature, it can be difficult for Box to easily determine the reason for such a dip. Using Catchpoint, Box’s monitoring and Network Operations Center (NOC) teams can more easily investigate when something fails in more than a transient nature. “We fire Catchpoint tests for a few days and compare the results to our broader set of data, which allows us to identify problems and discover new insights,” explains Dennhardt.
Quickly Pinpointing Degradation Impact
Not only do millions of users and applications around the world use Box, but users can take more than 100 distinct actions across Box’s API. While these actions share common services, they are all distinguished by differences in the code paths. This makes it challenging to quickly ascertain whether a performance degradation is specific to a particular geography, to a particular API action, or if it is occurring more broadly. Using the Catchpoint dashboard, the NOC and monitoring teams can quickly zero in and dissect these issues. The teams can also use instant Catchpoint tests to verify or rule out network degradation. “Catchpoint gives us a direct signal of what might be happening outside our network so we no longer need to draw inferences around scenarios like a lower throughput of requests,” says Dennhardt.
Maintaining Site Health
The NOC team uses Catchpoint regularly as a general signal for maintaining site health. As Dan explains, the Box NOC has a difficult job. “They have to be aware of and think about so much to understand site health and recover it quickly when it's lower than acceptable. With Catchpoint, the team gets tests with detail they didn’t have previously so they can more easily get to the root of problems.”
Integration with Key Tools
By integrating Catchpoint with Wavefront and PagerDuty, Box has a complete suite of monitoring at its disposal. Box ingests Catchpoint data into Wavefront to get more robust metrics from its monitoring and application metrics tool. By feeding Catchpoint data into PagerDuty, Box can alert its NOC team to investigate platform issues and assess the duration of impact and recovery.
With Catchpoint, Box can more easily track changes in API success/failure rates and performance over time. And it has seen a dramatic improvement since using Catchpoint. “Catchpoint wasn’t the sole reason for the improvement, but it was a meaningful factor in making the case to invest more in reliability and availability and provides an objective measure that our investment is paying off,” says Dennhardt.
This expanded focus on reliability and availability triggered Box’s site reliability engineering team to implement its existing tests in Catchpoint for Box’s crucial log-in flow. Another Box team used Catchpoint when comparing vendors that do network routing to help validate its large investment.
In the future, Box will be standardizing across the engineering teams that own slices of the Box API, such as for uploads/downloads and search. “We operate under the ‘You build It, you run it’ principle, so each API team will be running their own set of synthetic tests for their area of responsibility. We’ll standardize operational dashboards across teams so they can easily compare internal and synthetic metrics,” concludes Dennhardt.