Observability vs. Monitoring: How Are They Different?

March 16, 2022

Share this article

Data observability and monitoring are often used interchangeably in distributed systems — leading to questions like “observability vs. monitoring: what’s the difference?”

Observability and monitoring are two distinct concepts that depend on each other and are essential for building and managing distributed systems.

Observability is a property of distributed systems to help you understand what’s slow, broken, or inefficient. Monitoring is an action to understand a system’s performance.

This article further explores the concepts, their importance in the modern data stack, and elaborates on the differences in monitoring vs. observability.

Let’s start with monitoring.

What is data monitoring? #

According to Google Cloud’s DevOps Research and Assessment (DORA) program, monitoring is tooling that allows teams to watch and understand the state of their systems. Monitoring is based on gathering predefined sets of metrics or logs.

When you know which elements within your tech stack are prone to failure or bottlenecks, you can use monitoring in the form of dashboards and alerts to keep track of such elements.

For instance, data monitoring systems generate alerts to notify the DataOps team whenever a variation in the metrics is reported. These alerts allow teams to track, manage, and improve distributed microservices.

The alerts could be:

Cause-based: All possible error conditions — critical or non-critical — are listed and an alert is generated for each condition.
Symptom-based: These are created for prominent or highly critical errors. They focus more on user-facing symptoms, but also keep track of non-user-facing symptoms.

The DataOps team can receive these alerts through SMSes, emails, or dedicated mobile apps for monitoring.

Now let’s explore observability and how it connects with monitoring.

The evolution of observability from monitoring #

According to Gartner, observability is the evolution of monitoring into a process that offers insight into digital business applications, speeds innovation, and enhances customer experience.

That’s because traditional monitoring tools don’t help you find the root cause of performance issues, especially in cloud-native infrastructure. They only tell you what’s wrong, but don’t elaborate on why something went wrong and how it affects the business KPIs.

That’s where observability can help. Observability picks up where monitoring left off.

What is meant by data observability? #

Observability is a measure of how well you can infer the internal state of a system using only its outputs. You interpret a system’s overall health and internal state by analyzing the data it generates — logs, metrics, and traces.

Here’s how James Burns, Head of Lightstep Research, explains observability by using a cruise control analogy:

Under constant power, a car’s velocity would decrease as it drives up a hill. So, an algorithm changes the engine’s power output in response to the measured speed to keep the vehicle’s speed consistent regardless of the grade or terrain. This system, cruise control, interprets the vehicle’s state by observing the measured output, i.e., the speed of the car.

Why is observability so crucial for the modern data infrastructure? #

Observability isn’t a new concept. It’s borrowed from control theory where observability is used to describe and understand self-regulating systems. Observability provides valuable insights into various cloud-distributed applications or microservices in the data ecosystem.

It’s vital in dealing with the uncertainties that modern distributed systems present.

For example, you know that when a system exceeds its memory limit, it will crash. However, several factors are causing the system to consume more memory, and you don’t understand how they work yet.

So, monitoring these factors helps determine the root cause. This is a “known unknown” and monitoring can help you understand the system better. Monitoring also helps you with the “known knowns” — you know of these elements and understand how they work.

However, you notice that an entire system has slowed down, but all the factors known for causing the slowdown are doing fine.

Now you’re in a situation that you don’t understand and weren’t aware of in the first place. That’s an “unknown unknown”, and what helps is observing the system, learning from it, and knowing how to deal with it. So, you can fix problems before they become too big and affect user experience.

With observability tools, you get the complete picture of your entire data stack in real-time. You can track every event or request and get the full context to understand the impact on everything from infrastructure to business applications.

For instance, here are some questions you can answer using a proper observability tool:

Why is something broken? What went wrong?
At what point in time did the element start malfunctioning? Why?
Which services depend on the broken element? How are they related?
How does the broken element affect the overall user experience for your customers? Who’s getting impacted the most?
Which metrics can help you spot such issues right away?

These are just a few, but they demonstrate the value of observability in managing the overall performance of a system.

Next, let’s understand the data points that observability tools take into account.

The three pillars of observability: Logs, metrics, and traces #

Data observability helps you visualize the internal system activities with the help of external data outputs or telemetry, also known as the three pillars of observability. These include:

Metrics (what)
Logs (why)
Traces (how)

Let’s understand each pillar in-depth.

1. Metrics: What went wrong? #

Metrics are numeric counts or values recorded over time. They provide you with some data on how a system works and keeping an eye on the right metrics can alert you whenever something goes wrong.

Metrics can be further divided into the following types:

Gauge metrics: They record values for the system gauges. Examples include measuring CPU use after every 10 minutes, recording fan speed, or the total time spent in executing a process.
Delta metrics: They compute the variance between previous and current measurements. An example is a difference in networking throughput since the last recorded reading.
Cumulative metrics: They track the changes of various counters over a certain duration. Examples include the number of system bugs, the number of emails sent, or the number of successful or failed API calls.

2. Logs: Why it went wrong? #

Logs capture and store event-related data. So, you can investigate why something went wrong using logs as they contain information on:

How and when a process begins
The problems it experiences
How and when it ends

Log entries are timestamped and the information within logs cannot be modified. So, they help you investigate an unpredictable event — the unknown unknown — and theorize what could have happened.

However, while logs are easy to generate, they’re difficult to interpret. That’s why you need more information or context.

3. Traces: Where it went wrong? #

A trace shows the execution flow of connected devices. They show the path of an event (request, transaction, or operation) as it travels across a distributed environment.

Traces provide further context for other telemetry — logs and metrics — by indicating where something went wrong. Observability tracing can show you how events are interdependent and which element is causing a problem or a bottleneck.

Here’s how software engineer Tyler Treat explains the relationship between these three pillars:

Everything is really just events, of which we want a different lens to view. Data, such as logs and metrics, provides context for the event itself. Data, such as traces, describes relationships between events.

Logs, metrics, and traces: Putting it all together #

Tracking all three outputs and observing their interrelationships helps you spot problems, quickly fix them, and set up new metrics or benchmarks to deal with similar issues in the future.

Here’s how James Burns puts it:

Let’s say there is a sudden regression in the performance of a particular backend service deep in your stack.
It turns out that the underlying issue was that one of your many customers changed their traffic pattern and started sending significantly more complex requests.
This would be obvious within seconds after looking at aggregate trace statistics, though it would have taken days just looking at logs, metrics, or even individual traces on their own.

With the concepts out of the way, let’s compare observability vs. monitoring.

Delve deeper into Observability. Read our explainer.

Observability vs. monitoring: Exploring the relationship #

The observability vs. monitoring debate is critical in understanding the performance of a distributed architecture.

Monitoring indicates a system’s failure, whereas observability assists in investigating the reason for that failure. As we’ve mentioned earlier, observability evolved from monitoring, picking up where monitoring left off.

So, before exploring the differences, let’s understand how they’re connected.

Is data monitoring a subset of data observability? #

Yes. You can only monitor what’s observable. Data observability diagnoses the root cause behind any system failure — the what and why.

Monitoring aggregates data on a system’s performance — the how.

Together, monitoring and observability provide real-time visibility of system processes, report incidents, and oversee the application infrastructure. DataOps teams need both practices to set up and maintain a high-performing cloud-based tech stack.

Observability vs. monitoring: Key differences #

Aspect	Observability	Monitoring
Definition	Observability is an approach for understanding an application’s internal structure by generating logs data, metrics, and traces.	Monitoring is tooling that aggregates logs and metrics for watching and understanding an application’s performance.
Purpose	It provides context on performance bottlenecks or system failure.	It provides visibility into the elements causing bottlenecks or errors.
Application	It helps DataOps teams ask questions and get valuable system insights.	It assists DataOps teams in visualizing system performance via dashboards.
Key difference	It works in real-time and proactively processes information to help you understand distributed systems and find the root cause of problems.	It consumes information passively to show you the status of your tech stack and alert you about expected anomalies.
Tooling	Observability tools are new and will continue to evolve as the use cases grow.	Monitoring tools have proven use cases and the market is well-established.

Build robust applications with data observability and monitoring #

For developing highly configurable distributed applications, organizations need to make informed decisions. Relying on traditional monitoring approaches that process information passively and generate reports isn’t enough.

Managing distributed infrastructure means adopting real-time data observability solutions that can offer further context regarding a performance issue by monitoring metrics, logs, traces, and finding patterns.

Such platforms can help you monitor, track, and triage incidents to prevent downtime and boost overall performance.

To know more about data observability for the modern data stack, check out this article on the future of the modern data stack in 2023.

Photo by Clint Patterson on Unsplash