This guide explains the various disciplines that define the practice of observability and tracks its evolution over the last decade from its roots in systems monitoring. What started as a need to monitor a server’s computing resources like CPU and memory utilization as part of a client-server architecture gradually transformed into observing digital experiences and isolating the root cause of application slowdowns with the help of distributed tracing and log analysis.
As form follows function, monitoring follows application architecture, so for this guide to have context, we must start by reviewing recent changes to application architecture.
Application architecture based on microservices
We now pejoratively refer to client-server applications as “monolithic” applications and, with hindsight, recognize their limitations in creating a single point of failure by using centralized hardware and lacking the ability to adapt to a changing workload efficiently. And so, application architects broke the monolithic application into components (known as a microservice) that communicate via the abstraction layer of an application programming interface (API) and invited anyone with a web browser to access the application over the Internet by going through a load balancer designed to spread the traffic across multiple nodes. This distributed architecture helps in many ways:
- A development team dedicated to a microservice can work independently.
- Each microservice can scale on its own by adding more computing resources.
- The application can operate without having a single point of failure.
- The application utilizes compute resources more efficiently
The diagram above simplifies the concept of microservices, but the reality is more complex. An application typically comprises dozens of microservices, each replicated across dozens of containers (a container is a lightweight version of a virtual machine), forming an environment with hundreds of containers updating and communicating constantly. This dynamic environment requires a new operational model.
The rise of DevOps
The operational processes have also adapted to the changes in application architecture. Independent development teams can now release code more frequently (as many as dozens of times a day) and roll back mistakes almost instantaneously, blurring the lines between operations and development and resulting in a closer daily collaboration popularized by the term DevOps.
As more lightweight containers replaced heavier servers and virtual machines, it has become simpler to replace containers than reconfigure them to keep up with the changing needs of the application. This concept is known as immutable architecture, visually explained below. These changes have made it possible for operations teams to configure an infrastructure environment the way developers used to configure code, giving rise to a new paradigm known as Infrastructure as Code (IaC) that we introduce next.
Infrastructure as Code (IaC)
In the early days of infrastructure management, operations engineers logged into machines and manually updated the configuration files each time the infrastructure required a change (now referred to as a mutable architecture). The second generation of configuration management tools (such as Chef and Puppet) relied on local software agents to execute centrally-issued changes simultaneously on dozens of servers, reducing the manual labor. The latest approach to updating infrastructure (promoted by Terraform) simply describes the desired state via declarations and relies on specialized orchestration software (such as Kubernetes) to achieve it.
Continuous Integration and Continuous Delivery (CI/CD)
Once teams can independently release new code into a production environment and update it, they need a pipeline to automate the process to save time. The code delivery pipeline must compile the code and its dependencies, initiate testing of the new code before deployment, and trigger third-party processes (like updating the configuration of monitoring tools). This pipeline is commonly called Continuous Integration and Continuous Delivery (CI/CD) and is visually summarized below. A popular open-source project representing this category of functionality is Jenkins.
The new processes and tools spawned after the break up of monolithic applications into microservices upended the practice of monitoring. The emphasis has shifted from monitoring discrete infrastructure components to observing application services, and along with it promoting the word “Observability” into the lexicon of DevOps practitioners.
Services vs. Infrastructure
CPU and memory measurements can reveal a problem when a monolithic application is hosted on a single server, but those measurements are no longer relevant by themselves to the digital experience of an application relying on microservices spanning hundreds of short-lived containers. The modern application environment’s dynamic complexity forces operators to observe services from the outside instead of only depending on monitoring each component from the inside. The monitoring of application infrastructure is still necessary to isolate problems once an application service degradation is detected, but what ultimately matters is the digital experience of the end-users.
A complete observability strategy must cover the entire path of a digital transaction which includes:
- The digital experience of the end-users from their desktops or mobile devices.
- The public networks that the transactions traverse to reach the application.
- The application infrastructure hosted in a private or public cloud.
When applications were architected based on the client-server model, end-users resided on the same local area network as the application servers. The current-day users of applications freely roam the globe and access the application services from various device types, including mobile, emphasizing the importance of monitoring the digital experience and the shared public networks.
Troubleshooting In the Age of DevOps
Monitoring the application infrastructure hasn’t become redundant. It is, in fact, more sophisticated than ever with the help of new technologies that let operators isolate a single error in terabytes of distributed log files that modern tools aggregate and index. New infrastructure monitoring tools store unprecedented sub-second transaction data in scalable databases that display granular information on demand. However, what has changed is the separation between the “alarming” signals used to alert operators of a service problem and the “debugging” signals used by DevOps engineers to isolate the root cause of a performance issue and fix it.
Alerting vs. Debugging Signals
DevOps engineers rely on application service metrics to notify them of an application slowdown or outage. However, as soon as an application problem arises, all focus shifts to isolating the root cause in the infrastructure, whether manually or automatically, using orchestration and artificial intelligence. A typical sequence of events entails:
- Digital experience observers detect a problem and notify the operators
- Operators identify or rule out the share public networks as the root cause
- Transaction tracing helps operators isolate the microservice impacted first
- Real-time metrics and indexed logs files help engineers isolate the root cause
In summary, the term “observability” refers to the approach of observing the application services to alert operators of a legitimate problem who then rely on infrastructure monitoring tools to isolate the root cause with the help of transaction tracing, performance metrics, and system logs.
This guide dedicates a chapter to each of the disciplines used in the practice of observability.