learn

The guide to observability vs monitoring

Log Monitoring

Logs are strings of text which record events that occur on a system. They come in many different formats and can be written locally to a log file or sent over the network when an event happens.

There are two basic types of logs:

  • System logs provide information about events happening at the OS (operating system) level. Examples of events in system logs include system-level authentication, connection attempts, service and process starts and stops, configuration changes, errors, and point-in-time usage and performance metrics.
  • Application logs provide information about events happening at the software level. “Software level” includes specialized server software – think of dedicated proxies or firewalls – and other software applications. These logs include events like application-level authentication, CRUD operations, software configuration changes, errors, and application-specific functions. Examples of application-specific logs include proxy los, firewall logs, and log statements inserted by a developer.
The two types of logs used in monitoring (source)

{{banner-20="/design/banners"}}

Logging strategies

Log monitoring is the practice of reviewing logs to determine what events are occurring on your systems. Logs contain valuable information and granular detail. There are different ways you can monitor logs:

  • Local log monitoring is done by directly accessing a system and reviewing local log files. This is the fastest way to perform log monitoring for a single system and can be very helpful for developers building and testing code locally. However, it is not easily scalable, and log retention policies can leave your systems’ disk space clogged with logs. Local monitoring also is not feasible for infrastructure where you do not have direct access to the servers such as Functions-as-a-Service (FaaS) and Software-as-a-Service (SaaS).
  • Vendor-provided log consoles are interfaces provided by vendors for you to review logs without direct server access. These are necessary as SaaS solutions do not expose the underlying server, but users are still responsible for monitoring application-level behavior. The flexibility in searching, filtering, and exporting these logs varies. These consoles can be helpful for admins working specifically in the tool, but again scaling and log retention are issues. An organization with multiple SaaS vendors will find it cumbersome to validate even basic activity across different SaaS platforms.  Administrators cannot easily manage all the different logins, UI navigation, and search languages. Additionally, vendors generally have log retention policies that do not align with the amount of time you may be legally required to store logs.
  • Centralized log management moves your logs from their origin to a central repository that can have its own data retention, storage, and access policies. You can normalize your data, search across your entire environment, and schedule alerts in this centralized location. Log management platforms can be on-prem, self-hosted, or SaaS deployments.
A centralized log management server aggregates log data and provides a UI to view and sort logs (source)

Log monitoring use cases

Local log monitoring may be sufficient for developers, and vendor-provided log consoles may be adequate for specialized admin teams. Still, there are many operational and security reasons to centralize your log management. For example, centralized logging really shines when you need to standardize monitoring across servers or services, correlate disparate data sources with each other, or monitor ephemeral environments. Common use cases for log monitoring include:

  • Monitor all authentication systems for password spray attacks and alert the Security Operations Center (SOC) for investigation.
  • Correlate Intrusion Detection System (IDS) logs with vulnerability scan data and asset management data to determine when an attack occurs against a vulnerable system. When an attack is detected, the system can alert both the SOC and the asset owner.
  • Scope a project to upgrade all servers to use TLS 1.2 for communication by identifying active usage trends and high-volume systems.
  • Monitor server health and cut tickets to the operations team when servers are showing troubling KPIs.
  • Notify the applications team when a spike in errors occurs within the application logs of a critical system.
  • Troubleshoot dropped traffic between two endpoints.
  • Discover system relationships by injecting test data into a workflow and tracing its progression.
  • Investigate user activity across all systems to assist with a help desk call about persistent lockout issues.
This centralized log server daily review workflow includes logs from a network device, a security appliance, and a server (source)

{{banner-21="/design/banners"}}

Log monitoring features

When evaluating log management solutions, here are some essential features to consider during an evaluation:

Ingestion types

A log management solution should support different kinds of log ingestion. Syslog-ng and other collectors provide flexible, generalized ingestion points. Agents installed on endpoints can simplify configuration for endpoint log collection. API or webhook integrations – both push and pull methodologies – are useful for SaaS platforms and microservice infrastructures.

Log filtering and masking

During log ingest, you want mechanisms to filter out unneeded logs and mask sensitive data. Verbose debug logging is helpful for application developers. However, it is generally not useful for a centralized log management solution. Dropping these logs can help you save on ingest cost and downstream compute costs. It can also simplify user search results.

As a result, filtering can help improve performance and protect against accidental data exposure, such as exposing customer PII in overly descriptive application logs.

PII concerns are also why you want to make sure your log management solution has masking capabilities. Suppose your logs include sensitive information such as customer names, email addresses, IDs, or plaintext passwords. In that case, you want to put safeguards in place to prevent this information from being ingested. Otherwise, you risk exposing PII.

Secure transport and storage

Logs should be encrypted both in transit and at rest. Encryption prevents tampering, data exposure, and other security concerns. It also simplifies log onboarding for sensitive log sources that contain data with regulatory minimum standards of encryption. The transport mechanism itself should be robust and mitigate the risk of dropping logs. Although UDP was previously the standard for log transport, RFC 5424 now recommends TLS-based transport. You may still see UDP logging in legacy equipment, but it is not best practice due to the potential of log loss.

Parsing

A log management solution should provide up-to-date and easily configurable parsing for common data structures and popular vendor tools. The power of a log management tool shows in its ability to parse logs, letting the user easily and efficiently search against the parsed fields.

Unfortunately, there is no global standard for log formatting. There are commonly used structures such as JSON, XML, and key-value pairs, but vendors and application teams can and will use custom formatting. Choosing a log management solution with an existing parsing library and auto-parsing on standard formats will save a lot of time during log onboarding and product upgrades that impact the log format.

Parsing occurs at different points in different log management solutions. Some are more flexible than others, but that can come at a processing cost. When choosing a solution, it is important to understand when parsing happens, and the process to re-parse logs should the parsing not work correctly or the log formatting change.

Searching

Every log management solution should offer a flexible, robust searching language for efficiently querying your logs. UI-based filtering can be helpful for new users, but text-based queries make creating and sharing queries much more efficient in the long term.

In addition to simple keyword searching against parsed data, you should also be able to:

  • Search using wildcards
  • Query raw text of the logs
  • Manipulate and store information in variables
  • Use regex to extract and match text
  • Perform basic mathematical and statistical functions
  • Manipulate strings
  • Perform basic comparisons and aggregations across a set of logs
  • Combine disparate datasets.

Supplemental data

A log management solution should provide easy upload mechanisms to add supplemental data to the platform. Logs tell you events that occur in the environment, but they do not tell you state of an environment.

A business can have a lot of supplemental information that correlates against logs and can significantly minimize manual triage. This information can help answer questions like:

  • Who owns this application?
  • What office does this server live in?
  • Who is the manager of this employee?
  • What error text does this error number map to?

IT ticket info, asset data, employee data, frameworks, and manual uploads are all common supplemental data sources.

Saved searches and alerting

Users should be able to save searches for reuse and schedule saved searching for proactive alerting. Saved searches should be sharable with the option to run on-demand or be scheduled for periodic execution to generate alerts and reports in a common data format. Typically alerts are configured using a cron schedule.

Resource allocation and scaling

Understand platform limitations when it comes to user, ingest, and compute. Where does the capacity to grow exist, and where would usage patterns cause a problem? If you’ve never built a log management solution before, accurately scoping is a challenge. Safeguards should exist to limit how users on the system impact each other. For example, one user who kicks off a compute-heavy search should not monopolize all the platform’s resources.

{{banner-22="/design/banners"}}

Documentation

Centralized logging is intended to be a company-wide log management solution. This means the documentation must be excellent to keep the barrier to entry low. You don’t want to buy something so complicated you have to send everyone to training for it before getting value from the tool. Make sure your vendor has easily accessible, clear, concise documentation on how to perform key functions. Bonus if they have free self-led training.

Retention and recovery

From both a billing and an efficiency standpoint, it can be advantageous to implement a hot/cold storage strategy. Most users do not need more than the past 30 days of logs, but legal storage requirements can mandate years-long retention. Putting unused logs in storage can save on processing and reduce storage costs. Make sure all log storage has recovery capabilities in case of log destruction or system failure.

Residency

Make sure your log management solution can accommodate any data residency laws in play in your countries of business. Some data may be legally mandated to remain on in-country servers. Can the vendor accommodate such a need?

Pricing

Pricing strategies for log management solutions are diverse, and many companies will use a mix of pricing components that aggregate into your final bill. Different pricing strategies are better  depending on your ingest, storage, compute, search, user, and functionality requirements.

Pricing strategies may include the following components:

  • Flat rate pricing charges a single fee for usage of the tool, regardless of component costs.
  • Ingestion pricing charges a fee per amount of data ingested.
  • Storage pricing charges a fee per amount of data stored.
  • Compute pricing charges a fee per amount of compute resources consumed, i.e., per the deployment’s search load.
  • Seats pricing charges the customer per number of users or “seats” that are in use.
  • Features pricing charges based on specific components of the solution often described as “features”, “modules”, “functionality”, etc.

Best Practices

Now let’s look at some log management best practices.

Always maintain an inventory

Asset management is key to good log hygiene. You can only capture logs from what you know is out there.

Logs document events, not state

Logs are useful for detecting when something happens in the environment, but it cannot provide on-demand state values.

Take configuration, for example. A log management solution can tell you which servers have recently initiated a session using TLS of a version below 1.2. However, it cannot spot machines with TLS1.0 enabled that haven’t used the protocol in any transactions.

Know your legal landscape

Nothing grinds a technical working session to a halt faster than uncertainty about whether onboarding a particular set of logs is legally allowed. Have your legal counsel survey relevant data residency laws.

Within your company, clearly identify and document which logs have in-country restrictions, which countries require paperwork to be submitted and the status of that paperwork, and which countries have no data residency restrictions. Maintain a similar document in regards to data retention policies for different log sources.

Plan for sensitive data

Identify high-risk logs that may expose sensitive information. Even if you mask the data, create a response plan for how you will react if that sensitive data is exposed.

Log formats can change, invalidating masking for some time. Developers may turn on more verbose logging to help debug an issue, only to discover these new logs contain new sensitive information. Are you going to restrict access to these particular logs? Delete any log that contains sensitive information or let it roll off? Make sure to have well-defined answers to these questions.

Take a balanced approach to logging

Deciding which logs to ingest is a balancing act. A good equilibrium has to take into account ingestion fees, value from potential alerts generated, convenience, maintenance/administrative overhead, vendor tools that can offer similar functionality for a slice of the environment, and legal or compliance requirements.

Sometimes an alert is enough

Some log sources can be prohibitively large to ingest. In those cases, ingesting alerts and other action-item events can be a good compromise. Users who need more information will need to go elsewhere to retrieve the full logs (such as a vendor-provided log console or directly onto a server). However, you can still take advantage of a log management platform’s centralized correlation and alerting functionalities by ingesting alerts.

Configure health monitoring for log ingestion

Health monitoring log ingestion to the log management platform can be tricky. Consistent, high-volume log types are easy enough to spot when they go down. If you have infrastructure that is only used occasionally – such as a failover device or a small application with infrequent user activity – you’ll find setting the right threshhold for when you consider the logging ‘broken’ is more an art than a science.

Normalize timestamps

Regardless of ingestion type, the log management solution should create a universal timestamp of when the log was ingested. The log itself will contain a timestamp of when the event occurred.

To avoid timezone issues, you should normalize to a universal timestamp. By comparing the event timestamp with the ingestion timestamp, you can easily identify ingestion delay issues.

Normalization can also be helpful when log timestamps may not clearly reflect the timeline of events. For example, a popular IT vendor will timestamp security alerts with the time the suspicious event occurred, rather than when the activity was determined malicious and the security alert was generated. The log timestamp can make it look like a security alert was ignored for days, when the reality of the situation is that the alert just came in.

{{banner-sre="/design/banners"}}

Normalize your data

When parsing your logs, use vendor-agnostic field names. This simplifies the environment for exploring users and prevents administrative overhead when technology with the same kind of data is added or swapped out.