Blog Post

Digging into Code during Catchpoint’s reIMAGINE Hackathon

Updated

Published

December 20, 2021

mins read

in this blog post

As an Engineering Manager, I don't get the chance to dig into code as much as I did when I was a developer. Catchpoint's semi-annual hackathon provided me that opportunity last month.

Digging into code to quickly unearth the source errors and bugs

As an engineering team, we strive to write secure, maintainable, performant code. We drink our own champagne so we know when our user experience falters. Sometimes it is from external factors but sometimes our application code has an error and a bug surfaces. And then we need answers, and we turn to our application logs. What's happening? What's gone wrong? Who is accessing what and how is that affecting things? However, often answering those questions is reactive, because the problem has already happened. Our goal was to change that since we are a proactive team.

For context, Catchpoint’s core infrastructure is built to reliably handle tens of millions of Synthetic tests, RUM page views, and Endpoint experience results every day. Catchpoint Symphony leverages React and ASP.NET Core to provide an improved user experience from configuration to analysis. Our unparalleled global reach provides challenges in troubleshooting and debugging production issues. We rely on our application logging across applications and microservices to reduce MTTR.

My reIMAGINE hackathon team set out to enhance our backend logging capabilities. Our goal was to cover more services, knowing that the added information would give us clearer answers to the questions noted above. In addition, we could use that data to build more dashboards with more insights for our users!

In order to achieve our goal, we had to keep a few things in mind:

Keeping our centralized logging was imperative so we weren’t wasting time logging into individual servers to view logs.
Extending the data sources for our logs to increase visibility across all services.
Reducing noise in application logs and limit it to the most invaluable logging information for easier troubleshooting.
Keeping our timestamps consistent across all log data and system and application metric data to correlate high resource usage with certain operations.
Tracing, tracking, and monitoring our connected services for a continuous view of critical paths, to ensure our performance and reliability hits the mark.

Improving our data processing for an even better customer experience

We started by setting up watchers on our log directories and sending web server logs, OS level event logs, and general server information to our log store. Parsing our application logs proved to be troublesome and noisy.

Our first dashboard was simple and had only four widgets:

Count of web requests split on server.
Count of log entries split by service.
Top pages requested by service.
Average request/response time correlated with CPU and memory usage.

This initial dashboard answered basic questions about our users’ experiences. It also proved that we can configure widgets across multiple data sources using a single, unified dashboard. Using information from multiple data sources will provide further insights across our services and help us track down issues that require a holistic view of our system.

Continuing to make improvements for our customers

The one-week hackathon gave us a great start to what will be a long running project that we can continue to build on. As a result of the week spent on this project, we now have less noise in our logs and metrics from webserver, databases, and redis instances.

Our expanded logs have already helped us figure out ways to make our data processing more efficient, and we have plans for more improvements to come. For example, we plan to enhance our internal application performance data, gather more client-side information, and correlate web logs with Orchestra logs (our in-house NoSQL engine). In addition, tracking services across releases during testing will provide actionable insights, so we can quickly address potential challenges before they get to production.

All of this is the result of just one week and a full dedication to our users to provide the best experience.

Want to find out how to join our team? We are hiring! Check out our open positions in Engineering and this insightful blog from our VP of Engineering, which gives a sneak peek into what the interview process at Catchpoint might hold.

As an Engineering Manager, I don't get the chance to dig into code as much as I did when I was a developer. Catchpoint's semi-annual hackathon provided me that opportunity last month.

Digging into code to quickly unearth the source errors and bugs

In order to achieve our goal, we had to keep a few things in mind:

Keeping our centralized logging was imperative so we weren’t wasting time logging into individual servers to view logs.
Extending the data sources for our logs to increase visibility across all services.
Reducing noise in application logs and limit it to the most invaluable logging information for easier troubleshooting.
Keeping our timestamps consistent across all log data and system and application metric data to correlate high resource usage with certain operations.
Tracing, tracking, and monitoring our connected services for a continuous view of critical paths, to ensure our performance and reliability hits the mark.