learn

The guide to observability vs monitoring

Monitoring Microservices

Software architecture has changed significantly over the last decade from traditional, tightly-coupled monolithic applications to loosely coupled microservices. Software monitoring has had to adapt to suit this new architecture. Monitoring microservice health is not as simple as tracking the CPU and memory utilization of a few servers. In this article, you’ll learn how to effectively monitor microservices using a standard set of alarms paired with effective logging and tracing.

Traditional vs. Microservice Monitoring

Microservices are small, single-purpose application components that together form a complete software application. These services typically run inside of containers (such as Docker) and scale across multiple hosting environments.

The application architecture based on microservices is designed for high performance, availability, efficient use of computing resources, and increased development speed. Applications built this way handle variable workloads and failures gracefully by running many small containers. Additionally, development is safer and faster when working on a single microservice at a time.

However, this flexibility comes at a cost. The operational complexity and span of microservice architectures are higher than legacy architectures. Applications have tens or even hundreds of unique microservices. Remotely logging into single instances using SSH is no longer an option.

Containers are an additional layer of abstraction on top of Virtual Machines (VMs), creating distinct infrastructure metrics and logs. Many versions of the same application run inside containers across multiple physical locations. The lifespan of containers is shorter than VMs, meaning logs and metrics are harder to access during debugging.

In the next section, you’ll learn more about these challenges.

{{banner-20="/design/banners"}}

Microservice Monitoring Challenges

Below is a table describing monitoring goals, their traditional solution, and the challenges you’ll face when moving to microservices.

Monitoring Goal Traditional Solution Microservice Challenge
Understand server health CPU and memory metrics Many different containerized applications live on the same host
Check server logs SSH into the server Containers are short-lived, and their termination removes the logs useful for troubleshooting
Trace the response to an HTTP request through the system All code methods live in the same application
Understand application health Monitor server health The health or even failure of a single node no longer affects application health

Here is where traditional infrastructure monitoring starts to show its cracks. What infrastructure needs to be monitored? VMs, containers, autoscaling groups, load balancers, Kubernetes pods, and database servers all need monitoring, but not all need alarms.

The solution for effective microservice monitoring is understanding the difference between alarming metrics and debugging metrics. We’ll then add logs and transaction tracing on top of this solid metric foundation.

Alarming Metrics vs. Debugging Metrics

Alarms are notifications sent to on-call engineers indicating a problem with the system. We refer to the key metrics that trigger these alarms as alarming metrics. In contrast, engineers use what we call debugging metrics to identify the alarm’s root cause.

The critical difference between alarming and debugging metrics is that alarming metrics must reflect user-facing symptoms. Alarms notify engineers that they must intervene to fix the system. If a system user is experiencing no symptoms, then the problem does not require human intervention. Examples of user-facing symptoms are:

  • Slow interactions with the user interface
  • Errors when using a public API
  • Missing data or notifications
  • Delays in batch processing

Debugging metrics are highly detailed application-level metrics. Only Subject Matter Experts (SMEs) need to understand these details for an application. Debugging metrics are helpful during the investigation of an incident but are overly specific as alarm triggers. Extremely detailed metrics will confuse engineers, leading to an ineffective remedial response.

Below you’ll learn the alarming core metrics used in microservice monitoring. You’ll then learn how to layer additional application and infrastructure metrics into your system as debugging metrics.

{{banner-21="/design/banners"}}

Four Key Alarming Metrics

The table below shows four metrics applicable to all microservices. The two dimensions are:

  • The requests that come from the browsers of users via HTTP and the requests that come from an application microservice via an application message bus or an Application Programming Interface (API).
  • The error percentages and delay times as measured for those requests.

HTTP requests are calls by a user or other application to your application. Examples of requests are a user logging in, the user interface requesting data for a dashboard, or another application sending metric data to your application. HTTP requests are synchronous because your application must respond immediately. The microservices communicate with each other via an application message bus and/or an API to collectively process a user transaction. The message bus and/or API are ideal points for collecting metrics related to the availability, performance, and error-free operations of the microservices (we have devoted this article to help readers understand the best practices associated with API monitoring).

Alternatively, some work within your application is processed asynchronously. Batch jobs, nightly email reports, or backend billing calculations don’t happen based on real-time requests. Message buses (such as Kafka) store messages related to asynchronous tasks, and services pull work from the bus when they are ready.

Pushed versus pulled work is an easy way to distinguish between synchronous and asynchronous requests. Synchronous requests push requests onto services. Services pull asynchronous requests when they complete their current tasks. Both types of requests can result in errors and delays.

Metric Metric Type Request Type
The percentage of HTTP errors Error Synchronous
HTTP response time Delay Synchronous
The error rate of consumers on an application messaging bus Error Asynchronous
The queue size consumers on an application messaging bus Delay Asynchronous

The table above may not seem like many alarming metrics. What’s important is that your system, subsystems, and microservices all need this type of monitoring. The following section describes the advantages of this three-level monitoring.

{{banner-22="/design/banners"}}

Three-Level Monitoring

The metrics described above treat the monitored system as a black box. Once your application grows beyond a few microservices, the implementation details become too many for a single engineer to understand. The four metrics above indicate user-facing symptoms since errors and delays are the only symptoms users experience during an outage or a slowdown regardless of the inner complexity of each microservice.

Alarming on errors and delays across the whole system is a significant first step but is not an effective microservice monitoring strategy. Monitoring only the whole system will detect total system outages. With microservice architectures, subsystems and services are decoupled from one another, reducing the risk of total system outages. However, decoupling services increases the risk of partial system outages. You must implement alarms at the subsystem and service level to monitor these partial system outages.

The advantage of this three-level monitoring approach is that alarms immediately indicate the subsystem or microservice affected by the incident. On-call engineers can escalate the incident to subject matter experts (SMEs) for the affected service early in their debugging process, shortening the duration of the incident.

Three-Level Monitoring Examples

To highlight how three-level monitoring works, consider the following examples.

Imagine a system that ingests data, processes that data asynchronously using a message bus, then sends email notifications back to the customer. There are three subsystems in this system: data ingestion, data processing, and notification sending.

If you receive an HTTP error alarm for data ingestion, the subsystem may require human intervention. The reasons behind the user-facing errors are varied. For example, a subsystem may use a cache that’s failing, a slow database, or rely on containers that are producing errors. Such a scenario requires an alarm so that an expert can diagnose the specific issue using debugging metrics.

Alternatively, alarming on debugging metrics can cause false alarms. The notification subsystem in our example may rely upon a third-party email provider. Third-party provider failure count is important to track but is not a reliable alarming metric.

Consider what happens when the third-party provider has a partial outage. The third-party provider failure count metric will jump. If your team has prepared for this scenario, the notification subsystem will automatically postpone sending the delayed emails when it encounters such an error. In other words, there are no user-facing symptoms and, therefore, no need for manual intervention.

Layering in Infrastructure Metrics

We can return to traditional monitoring metrics now that we understand the difference between alarming and debugging metrics. CPU percentage, memory utilization, and other server-level metrics are no longer vital alarming metrics when monitoring microservices, but they are valuable debugging metrics. Infrastructure metrics, along with detailed application metrics, help SMEs determine an incident’s root cause after isolating the part of the application infrastructure contributing to the incident.

Layering in Logs and Tracing

This article has focused on application and server metrics because metrics are the key to proper alarming. Metric values are aggregated, which summarizes information for operators, but aggregation is also a disadvantage for detailed debugging. Aggregation strips metrics of their context. For example, an error count indicates that a service has errors but doesn’t describe the error messages. Centralized logging (collecting and visualizing all logs in a tool such as an Elasticsearch, Logstash, and Kibana stack) and transaction tracing (e.g., Jaeger) are used to apply context to aggregated metrics.

As mentioned earlier in this article, microservices break application logic into multiple loosely-coupled services. Application logs and tracing tie microservices back together, allowing operators to pinpoint errors as they propagate through the system.  Logging tools like Elastic collect text-based log files from distributed systems and index them to help users isolate an individual error message by searching. Tracing tools like Jaeger discover and visualize the relationships between microservices to help users understand the dependencies that are instrumental to isolating the root cause.

Publishing too many logs or traces can become quite expensive, however. Consider sampling logs and traces to reduce storage costs. You can also reduce your storage costs by implementing dynamic log levels, allowing operators to turn on and off detailed logging in a production environment at runtime.

{{banner-sre="/design/banners"}}

Conclusion

Monitoring a legacy client-server application was limited to watching its infrastructure. Monitoring modern applications begins with observing the digital experience of its end-users and tracking the service quality of its supporting microservices.

The challenge of monitoring an application based on microservices is the effort required to shift through the enormous amount of data that it generates and taming its dynamic complexity. Servers, containers, load balancers, databases, and other system components are short-lived and continuously produce time-series data and log files.

The best approach to monitoring a microservice is to treat it as a blackbox and rely on the service latency and error rate to mask its inner complexity. These two metrics are ideal for alarming because they are simple to understand and apply to a microservice as they apply to an entire software application.

The alarms generated from these two metrics can be forwarded to experts who specialize in a given microservice to isolate the problem’s root cause. The experts must rely on distributed tracing to uncover where the problem originated and use log monitoring tools to search and find the earliest indicative error message.

Keep in mind that monitoring microservices is a journey. It must continuously improve by reducing alarm noise and by analyzing recent outages to identify new metrics and log entries that can make future troubleshooting more efficient.