learn

Infrastructure Monitoring

Despite the advent of modern monitoring techniques like synthetic monitoring and APM (application performance monitoring), infrastructure monitoring remains a fundamental component of any application observability strategy.

Infrastructure monitoring tools collect, store and analyze data from an organization’s IT infrastructure systems like operating systems, virtual machines, servers, storage volumes, networks, databases, and application messaging platforms. However, infrastructure monitoring has evolved to include public cloud services like serverless functions for example, and container orchestration platforms like Kubernetes.

When applications were designed based on the client-server architecture paradigm, operations teams only monitored the infrastructure to detect performance bottlenecks. However, the infrastructure supporting modern applications became too complex and dynamic (with autoscaling and load balancing), so the focus shifted to monitoring application services instead of infrastructure. Now, the two approaches complement each other. Once the tools designed to observe user experience detect a service degradation, infrastructure monitoring tools help identify the root cause.

This article addresses the following topics to help you understand evaluate a modern infrastructure monitoring solution:

  • Infrastructure objects typically covered by infrastructure monitoring
  • Metrics, logs, and traces
  • The core functionality of infrastructure monitoring solutions
  • Popular open-source infrastructure monitoring tools
  • Best practices for choosing and implementing an infrastructure monitoring tool

{{banner-20="/design/banners"}}

Basic elements of infrastructure monitoring

First, let's highlight the main elements of modern IT infrastructure that organizations need to monitor.

Servers

A server is the fundamental piece of infrastructure hardware. Modern servers include management modules (such as iLO in HP servers or iDRAC in Dell servers) that allow you to directly obtain the status of their components (processor, memory, storage, network, fans, etc.). In modern environments, servers often run hypervisors.

Hypervisors

A hypervisor is a software layer that decouples an operating system and applications from the underlying physical hardware. The most popular hypervisors are VMWare ESXi, Xen, Linux KVM, and Microsoft Hyper-V.

Hypervisors, like servers, include management modules to provide information about their status and the servers they run on. Hypervisors, consolidated under a single management system, have broad functionality for maintaining virtual machines’ health. However, it is important to collect performance and health metrics that can highlight potential or existing problems.

{{banner-21="/design/banners"}}

Virtual machines

A virtual machine is a software-defined computer or server.  Collecting metrics data from virtual machines makes it possible to identify cases when they do not receive the required resources from hypervisors or otherwise suffer from performance issues.

Operating systems

An operating system is a set of programs designed to manage computer resources and organize user interaction. Here, at a minimum, we should collect and analyze CPU/Memory/Disk/Network utilization data, the status of system services, and events from logs.

Containers

A container is a lightweight, stand-alone executable software package that includes everything you need to run your application: code, runtime, system tools, system libraries, and settings. The de facto standard in this area is Docker. We can collect resource utilization metrics from “the side” of containers and logs generated during execution. Due to the usage of container orchestration tools (Kubernetes, OpenShift, Nomad, etc.), monitoring should move to a higher level because the inoperability of a single container may not affect the performance of an application or service.

Database management systems

A database management system (DBMS) is a set of programs that allow you to create databases and manipulate data (insert, update, select and delete). The most common products: MySQL, Postgresql, Microsoft SQL Server, Oracle Database. Data is vital to most applications and services, so the availability and performance of databases is a critical component. DBMSs allow you to capture a wide range of metrics such as response time, open connections, query execution speed, transactional locks and deadlocks, and buffer and cache usage.

Message brokers

A message broker is an application that receives and sends messages between individual modules/applications inside some complex system. The most popular products: RabbitMQ, Kafka. Key monitoring metrics for message brokers are the number of messages in queues and memory consumption.

{{banner-22="/design/banners"}}

Data storage systems

A data storage system combines hardware and software into a single solution designed to store and process large amounts of information. In addition to the standard metrics for the health of storage hardware components, these three main metrics are critical to assessing storage performance:

  • Service time-  often referred to as latency or response time
  • IO/s - the number of input/output operations per second
  • MB/s - the number of transferred megabytes per second.

Network devices

These are components (such as routers, switches, access points, firewalls, etc.) used to connect computers, servers, data storage systems, and other devices so that they can share data and resources. Network devices allow you to receive all the necessary metrics (port status and utilization, errors, packet loss, etc.) using the Simple Network Management Protocol (SNMP) protocol. It is also possible to receive detailed traffic information using the NetFlow protocol and capture logs using Syslog from many network devices.

Public clouds

A public cloud is an infrastructure model for providing access to compute, storage, network, or other resources. Of course, all public cloud service providers have internal monitoring, but if public cloud resources are part of your IT infrastructure, you should monitor them directly.

Understanding metrics, logs, and traces

Now that we know what infrastructure to monitor, let's look at what types of data can be collected and analyzed.

  • Metrics - qualitative or quantitative indicators that reflect one or another characteristic of the monitoring object
  • Logs - files containing records of system or application actions
  • Traces - tracking the passage of a request through a distributed system. A trace is a tree structure with a parent trace and child spans. A request trace covers several services and is further broken down into smaller fragments by operations/functions, called spans.

Infrastructure monitoring key functionality

The table below summarizes the key functionality of infrastructure monitoring tools. When assessing infrastructure monitoring tools, be sure to consider this functionality as you make your comparisons.

Now, let’s take a closer look at each one.

Collection agents

The monitoring agent is a special lightweight application that collects information about the system and the running applications. Monitoring agents allow you to reduce the load on the monitoring system’s core and enable performing checks directly from the node. These agents can help check the availability of a network resource with which an application or a service hosted on the node interacts. In some monitoring systems, if the monitoring system is unavailable, the agent can accumulate data in itself to give it back later when the system becomes available.

Metrics and logs storage

Some monitoring systems include an engine for storing data. Others allow you to configure database management systems for storage.

Good infrastructure monitoring tools will allow you to configure different storage policies depending on the metrics.

Alerting

One of the primary purposes of monitoring is to generate alerts when a problem occurs. The monitoring infrastructure tool should allow you to configure the rules for triggering alerts and the workflow when they are activated. For example, defining alert contacts, alerting channels (e.g., email, SMS, Slack message, etc.), and when the alert is sent (immediately, after 5 minutes, after an hour, etc.) are essential features of infrastructure monitoring tools.

In some systems, it is also possible to configure the dependencies of some alerts on others to reduce the number of alerts in case of massive failures (for example, network failures).

Dashboards

Dashboards are interactive panels with important information grouped on one or more screens. Dashboards allow you to group data from multiple data sources. With dashboards, you can implement an infrastructure health map, a network map, or a service connectivity map.

Reports

Typically, an infrastructure monitoring tool contains several pre-configured reports on its status and emerging events. Using the report engine allows the user to customize reports to retrieve data regularly in an automated manner.

Analytics and Machine Learning

An infrastructure monitoring system with an analytics and machine learning module allows you to identify problems based on historical data and discovered dependencies. By analyzing the patterns of behavior of metrics, it can automatically detect deviations from normal behavior. In addition, an advanced version of this functionality allows you to predict some problems in advance.

Open-source infrastructure monitoring tools

Now that we understand what to look for in infrastructure monitoring tools, let’s look at some popular open-source solutions.

Zabbix

Zabbix is an open-source, enterprise-class distributed monitoring solution. Zabbix is software for monitoring numerous network parameters, health, and integrity of servers. Zabbix uses a flexible notification mechanism. Zabbix offers reporting and data visualization functions based on historical data. Zabbix supports both pollers (for actively collecting metrics) and trappers (for passively obtaining metrics).

InfluxDB

InfluxDB is an open-source time-series database for recording metrics, events, and analytics. This is not a complete infrastructure monitoring solution, but an excellent effective engine for storing its metrics.

Prometheus

Prometheus is an open-source monitoring solution with a dimensional data model, flexible query language, efficient time-series database, and modern alerting approach. Prometheus is great for collecting metrics from dynamic architecture applications (microservices). Prometheus does not contain its dashboard module, but all data is easily visualized in Grafana.

Elastic

Elasticsearch is a distributed, RESTful search and analytics engine capable of centrally storing data for lightning-fast search, fine‑tuned relevancy, and powerful analytics that scale with ease. Currently the most common solution for storing application logs. Elasticsearch is usually implemented with Kibana for data visualization.

Grafana

Grafana is an open-source, multi-platform web-based analytics and interactive visualization application. It provides charts, graphs, and alerts for the web when connected to supported data sources. Grafana has wide functionality and allows you to create truly beautiful dashboards that are appreciated not only by administrators and engineers but also by top managers.

Best practices for choosing and implementing an infrastructure monitoring tool

Here are some important tips on how to choose and implement an infrastructure monitoring tool:

  1. Prepare your infrastructure plan and examine how the monitoring utility covers the elements of your infrastructure out of the box. The more elements you can cover before customizing the monitoring tool, the less time it takes to fully cover your infrastructure.

  2. Prepare an escalation plan for the alerts generated by your infrastructure monitoring solution by considering the following:some text
    • Where to forward the alerts (e.g., Slack, email, PagerDuty, even console)
    • A strategy for deferring alerts that don’t require immediate action
    • Integration with your incident management system
    • Integration with automation systems that can take basic remedial actions

  3. If your selected monitoring tool supports multiple database systems (e.g., relational, NoSQL, cloud-based), start with the one that can scale more and cost less in the long run. Changing a database system is a complex and challenging task and deserves the upfront research investment.

  4. A common problem with infrastructure alerts is the sheer volume. Consider a tool that supports event correlation (we are using the term event and alert interchangeably), or augment it with a tool that does. Such technologies suppress symptomatic alerts avoiding a flood of alerts. For example, if a network route is down, it would generate a single alert instead of one from every unreachable segment of the network. 

{{banner-sre="/design/banners"}}

Conclusion

Infrastructure monitoring is an indispensable tool for identifying the root cause of performance issues in modern environments. We can think of it as the lowest level in the pyramid of information related to the inner functioning of an application. Infrastructure monitoring covers servers and all middleware systems responsible for the physical and virtual delivery of the application services. Everything the user experiences depends on the performance of the underlying infrastructure. Choosing the correct infrastructure monitoring solution based on predefined selection criteria will significantly help improve your overall observability strategy.