Last week, Catchpoint was one of the sponsors of Splunk’s half-day DevOps & Observability Best Practices event. It was a jampacked conference that examined what observability is, its key drivers, and how observability and monitoring exist “like two peas in a pod”, perfect compliments to one another in enabling enterprise to better understand overall systems behavior and health.
The Practice of Observability/DevOps
The conference kicked off with a high-level overview of the twin practices of observability and DevOps from Splunk’s Arijit Mukherji, Distinguished Architect (formerly CTO at SignalFx), and Marc Chipouras, Engineering Director.
“Move fast and break things,”, Mukherji began. “If we don’t move fast, the competition will beat us. That’s the reality. 52% of Fortune 500 companies have disappeared since 2000. Without DevOps services, digital transformation is doomed because it won’t be able to keep up. Without observability, DevOps is doomed. Observability provides me with vision,” he explained. “Imagine DevOps as a fast car, but to stop it from going in circles or crashing headlong into a wall, I need to know where I’m going. That’s the lens observability provides.”
His central piece of advice: build an observability program, “meaning a set of people whose headache it is to figure out what observability means for the organization, what do we want to get out of it and how do we get there.” Within this, he and Chipouras shared three key principles to follow:
(i) Focus your tools. In an era when there are tools for everything (logs, metrics, traces/APM, RUM/synthetics, incident management, network monitoring, etc.), ask yourself how many different tools do you have? How many tools do your users need to learn? Then choose one or a few ways to do logs instead of keeping all ten. The big advantage of consolidation is that you can ask broader, deeper questions.
As engineering director, Chipouras advises, first ask what key patterns you want to reinforce across dev, stage and prod. That may drive people to fewer tools, but first, “I want to look at how I can ensure I’m reinforcing these patterns so that when my engineers receive an alert at 3am, they can use them instantly without having to learn them.”
(ii) Develop standards. This lets teams better understand their upstream and downstream dependencies. Standards can also help lessen friction and elevate the overall quality of observability.
In practice, Chipouras does this by issuing a minimal set of essential guidelines to his team. These center around logs, metrics, tracing, and naming to ensure they can perform robust automation while permitting customization to enable alignment across observability tools.
(iii) Future-proof your systems. “Your systems need to be flexible,” said Mukherji. You need to be able to invest in current technologies while also retaining the option of being open to new ones. Choose tools that are vibrant with a robust community, he stressed. It’s also important to avoid lock-in as much as possible to “make change easier to do, not harder.” Open standards (not just OSS) for telemetry collection mean your application isn’t tied to a specific backend or vendor. Finally, Mukherji advised all observability teams to build a telemetry pipeline. “If you are going to remember anything from this talk, this is a superpower,” he stressed. Having a pipeline where all your telemetry is being collected and processed “allows for enormous flexibility down the line.”
Making the Case for Observability
At the center of the event was a Q&A with Kelly Ann Fitzpatrick, Industry Analyst at RedMonk and Rick Rackow, Senior Service Reliability Engineer at Red Hat moderated by Josh Atwell, Technology Advocate at Splunk. Atwell began with a set of statistics from 451 Research: “Monitoring and logging of containerized environments will become the largest section of the container market by 2024. This will go from 24% in 2019 to 32% in 2024, clearly highlighting the importance of visibility in these environments.”
Rackow started off by sharing the questions he sees Red Hat customers having around Kubernetes. They may have now openly adopted Kubernetes and OpenShift, he observed, but many are “still trying to find their right way of dealing with them.” The increased complexity of the development environment today has led to changes in the type of visibility tools businesses need. “All I mainly care about” he said, is “how is my app performing? This has led to a shift in monitoring to become more service-oriented and symptom-based, moving away from root causes.”
Fitzpatrick shared her perspective gained from talking to the many enterprises, web companies, SaaS companies, and vendors RedMonk serve, saying that in terms of observability, the analyst firm is getting “a mix of signals.” Two years ago, she said, they were “telling companies to do observability, but now vendors are coming to tell us about observability.” She admitted there remains a good deal of confusion around the concept of observability. “What is the relationship between observability and monitoring? What is the terminology around observability that most people don’t know? What is telemetry? What does it mean to instrument your code?”
What RedMonk is certain about are the drivers behind its growing adoption as a practice: uncertainty in systems and apps, complexity tied to the rise of microservices and containerization, plus the amount of testing necessary to figure out what’s going on across these different systems. Visibility into what’s going on inside this complex ecosystem is essential. Additionally, Fitzpatrick noted an increased demand for changes to come out regularly and quickly. “Observability helps teams move more quickly with more confidence”, she said, because its three pillars (logs, metrics, and traces) can be put to use at any stage across the development and production lifecycle to understand precisely what is going on.
Essential Proactive Monitoring for Best in Class Observability
Catchpoint’s Brandon DeLap, Senior Performance Engineer, led the second presentation of the day with a focus on the way in which “monitoring and observability work together to help provide clear visibility into your overall system, allowing observability teams to make better decisions.”
Brandon closed his session by sharing eight must-haves for effective digital experience monitoring (DEM):
- Monitor real users – Utilize passive Real User Monitoring (RUM) and WebSee, Catchpoint’s new user sentiment tool.
- Run 24/7 proactive monitoring – to gain a steady baseline into your systems.
- Test from everywhere – city, continent, connection type.
- Test everything – test every component along the delivery chain to identify precisely what is causing a poor experience for your end user.
- Deploy a comprehensive set of test types – Catchpoint offers out-of-the-box monitors that let you test individual components e.g. Network Insights, an out-of-the-box network visibility solution.
- Access to accurate and granular RAW data: you need to be able to trust the data you’re relying on to measure the four pillars of DEM and run your own calculations.
- Utilize historical charting and analytics – gather at least 13 months of data to see how historically your app is trending for all four DEM pillars.
- Advanced API capabilities: these let you take all the data you’ve captured and send it into different solutions to correlate with other data sources.
Brandon closed by sharing Catchpoint’s integrations with Splunk and SignalFx:
Fig 1: Catchpoint Splunk Search app
The Catchpoint Splunk Search app uses a Catchpoint REST API to get raw performance data and alerts out of Catchpoint and into Splunk and data from other systems, such as APM or machine data logs to improve troubleshooting and efficiency in process and reporting within Splunk.
Fig 2: SignalFx data webhook integration
With the Catchpoint SignalFx data integration through the Test Data Webhook, you can push data in real-time directly from the Catchpoint nodes into SignalFx.
Fig 3: SignalFx end-to-end/deep-link integration
With the Catchpoint SignalFx app integration, you can now click through DeepLinks from the synthetic tests in Catchpoint to the SignalFx traces to quickly understand and answer why backend timings and errors may be occurring from an external point of view.
Other sessions included a Doers4DevOps Rapid Fire from HashiCorp, Puppet and Google, a webcast from JFrog on observability for your IT value stream, and a Splunk4DevOps Rapid Fire demos of how Splunk works.
To view the sessions in their entirety, [click here](https://vshow.on24.com/vshow/Splunk_Dev_Obs/home?utm_campaign=Global_FY21Q2_Glbl_GEM_OnEvt_AppDev_EN_VirtEvnt_DevOps_ObsBP_SPLK_Conf_Sept16®PageId=17855&utm_content=Open Source Insights) and/or sign up here for Splunk’s upcoming Conf 2020.