Box is a leader in cloud content management. The company delivers its cloud-based product via web applications, integrations with partner solutions such as Microsoft Office 365 and G Suite, and publicly exposed and supported application programming interfaces (APIs) that enable customers and partners to integrate Box into their systems. The company worked with Catchpoint to:
- Enable a proactive approach to customer experience observability.
- Quickly pinpoint degradation impact.
- Monitor and analyze performance and availability issues.
- Maintain site health.
- Leverage data integration with key tools.
We provide a consumerized-style web experience both directly and via our APIs, so users expect reliable access and fast performance. That makes it table stakes to understand the customer experience, and Catchpoint plays an increasingly valuable role in helping Box achieve that.
After experiencing some broad service issues in late 2018, Box realized it needed better insight into the customer experience with Box APIs. For every site-impacting event, Box writes a detailed explanation akin to a postmortem. For some such events, Box struggled to reconcile its understanding of the issue and what customers said they experienced. Through its metrics, Box knew what was happening inside its network and used those metrics as a proxy when assessing customer impact.
As Dan Dennhardt, senior product manager for Box, explains, uptime is the lifeblood of a software as a service (SaaS) application because user retention is critical for business success. Knowing that, Box has deployed sophisticated means for monitoring and analyzing any hiccups in performance and availability.
“We had a good handle on the experience with our web app but were in the dark when it came to what customers experienced on the other end of API integrations,” he says. “To better understand and serve our API customers, we needed a more accurate representation of their experiences.”
Dennhardt and a colleague evaluated several active observability options before selecting Catchpoint. According to Dennhardt, Catchpoint stood out in several key areas. “When it comes to what we are trying to achieve, a picture is worth a thousand words and we were very impressed by Catchpoint’s charting and visualization,” he says.
He also appreciates that he could rapidly instrument and capture metrics to share. “With Catchpoint, we don’t need to export the data and post process it in Excel or pipe it to a metrics system to arrive at meaningful information.” The third major deciding factor was Catchpoint’s worldwide network of observers. “Our primary data center presence is in the western U.S. and, while our network footprint is worldwide, it is less complete than Catchpoint’s,” Dennhardt says.
A Proactive Approach to Customer Experience Observability
With Catchpoint, Box gains great visibility into customer experience throughout the delivery chain. “This is critical for a SaaS business but very hard since we don’t control what goes on beyond our network boundaries,” says Dennhardt. Box also uses Catchpoint as a neutral point of comparison between its own metrics and customer metrics. “More often than not, we can reconcile what we see with what our customers see because we capture good logs and metrics. But it is valuable to have this capability, allowing us to verify our findings” he says.
Finding the Needle in the Haystack
While Box extensively monitors production traffic, it can be challenging to identify and test for a specific condition from its internal logs. As Dennhardt explains, “We handle 3 billion to 5 billion public API calls per day and it’s growing quickly! It can be tough finding an answer in that mountain of data.” For example, Box might see a drop in client throughput, but because in some cases throughput is spiky by nature, it can be difficult for Box to easily determine the reason for such a dip. Using Catchpoint, Box’s monitoring and Network Operations Center (NOC) teams can more easily investigate when something fails in more than a transient nature. “We fire Catchpoint tests for a few days and compare the results to our broader set of data, which allows us to identify problems and discover new insights,” explains Dennhardt.
Quickly Pinpointing Degradation Impact
Not only do millions of users and applications around the world use Box, but users can take more than 100 distinct actions across Box’s API. While these actions share common services, they are all distinguished by differences in the code paths. This makes it challenging to quickly ascertain whether a performance degradation is specific to a particular geography, to a particular API action, or if it is occurring more broadly. Using the Catchpoint dashboard, the NOC and monitoring teams can quickly zero in and dissect these issues. The teams can also use instant Catchpoint tests to verify or rule out network degradation. “Catchpoint gives us a direct signal of what might be happening outside our network so we no longer need to draw inferences around scenarios like a lower throughput of requests,” says Dennhardt.
Maintaining Site Health
The NOC team uses Catchpoint regularly as a general signal for maintaining site health. As Dan explains, the Box NOC has a difficult job. “They have to be aware of and think about so much to understand site health and recover it quickly when it's lower than acceptable. With Catchpoint, the team gets tests with detail they didn’t have previously so they can more easily get to the root of problems.”
Integration with Key Tools
By integrating Catchpoint with Wavefront and PagerDuty, Box has a complete suite of monitoring at its disposal. Box ingests Catchpoint data into Wavefront to get more robust metrics from its monitoring and application metrics tool. By feeding Catchpoint data into PagerDuty, Box can alert its NOC team to investigate platform issues and assess the duration of impact and recovery.
With Catchpoint, Box can more easily track changes in API success/failure rates and performance over time. And it has seen a dramatic improvement since using Catchpoint. “Catchpoint wasn’t the sole reason for the improvement, but it was a meaningful factor in making the case to invest more in reliability and availability and provides an objective measure that our investment is paying off,” says Dennhardt.
This expanded focus on reliability and availability triggered Box’s site reliability engineering team to implement its existing tests in Catchpoint for Box’s crucial log-in flow. Another Box team used Catchpoint when comparing vendors that do network routing to help validate its large investment.
In the future, Box will be standardizing across the engineering teams that own slices of the Box API, such as for uploads/downloads and search. “We operate under the ‘You build It, you run it’ principle, so each API team will be running their own set of synthetic tests for their area of responsibility. We’ll standardize operational dashboards across teams so they can easily compare internal and synthetic metrics,” concludes Dennhardt.