Open Source Data Observability: Tools and Insights for 2024

Updated October 25th, 2024

Share this article

Open source data observability tools allow organizations to monitor, understand, and manage data pipelines. These tools enable transparency, detect anomalies, and optimize data reliability across systems without the cost of proprietary solutions.
See How Atlan Simplifies Data Governance – Start Product Tour

Here are 7 popular open-source data observability tools

  1. Prometheus
  2. Loki
  3. OpenTelemetry
  4. Fluentd
  5. Opstrace
  6. Vector
  7. SigNoz


Table of contents #

  1. Popular open-source data observability tools
  2. Prometheus
  3. Loki
  4. OpenTelemetry
  5. Fluentd
  6. Opstrace
  7. Vector
  8. SigNoz
  9. Conclusion
  10. FAQs on Open Source Data Observability
  11. Related deep dives on popular data tools

Data observability tools bring visibility into data systems. Using these tools, you can monitor and debug the operations and performance of your databases, data warehouses, data lakes, and data pipelines.

A single open-source data observability tool usually doesn’t have all the features required to enable complete visibility into your data systems. Some tools are helpful for log and metric collection, while others specialize in log and event tracing; some are good at visualization, while others are good for efficiently storing event and metrics data. You usually have to use a few of these tools in confluence to get that visibility.

According to Grafana Labs’ latest survey, around 76% of observability teams use OpenTelemetry, often alongside Prometheus, making them the most popular open-source observability frameworks. Notably, 85% of OpenTelemetry users also integrate Prometheus, highlighting a growing preference for combining application and infrastructure monitoring tools to address complex operational challenges.

This article will walk you through some of the significant event tracing and collection tools that you can integrate with your data stack. Let’s understand each of their capabilities in detail.


Prometheus #

Prometheus Overview #


In 2012, SoundCloud built a metrics collection tool, Prometheus, with a built-in time-series database. In 2016, SoundCloud handed it over to Cloud Native Computing Foundation for further maintenance.

Prometheus uses pull-based data collection by default. However, it can also push data into Prometheus using the Pushgateway. Using labels, Prometheus supports multi-dimensional data collection and stores it as key-value pairs in the time-series database.

Prometheus integrates with many popular databases, message brokers, HTTP servers, APIs, log aggregators, etc. With Prometheus, basic visualization is available with the Expression browser integrated with Prometheus, but you might find tools like Grafana more helpful for any advanced use cases.

Prometheus Features #


By default, the Prometheus server stores the data on your local disk. However, you can also opt for external storage systems. The storage system has crash safety built into it. Prometheus stores the data in two-hour blocks backed by a write-ahead log (WAL), similar to PostgreSQL. It keeps the WAL files in even smaller chunks of 128MB, which you can replay in case of a crash.

Although Prometheus doesn’t support distributed storage or replication, you can make your storage highly reliable and available by taking timely snapshots and backups. If you’re unsure how your storage requirements will change over time, you can opt for a cloud-based managed deployment of Prometheus. All the major cloud platforms, including AWS, Azure, and Google Cloud, offer a managed service for Prometheus.

Prometheus has a few official client libraries for Go, Java, Scala, Python, Ruby, and Rust. Many unofficial client libraries are available for integrating with any other programming language.

Prometheus Resources #


Documentation | Slack | Blog | GitHub


Loki #

Loki Overview #


Grafana markets Loki as Prometheus, but for logs. One of Loki’s major features is its ability to order logs by time to enable faster search. The search feature is helpful for visualization, log aggregation, and tracing. Teams at Apple, Etsy, MathWorks, Datavisor, and others use Loki for log aggregation.

To ensure excellent search performance, Loki only indexes log metadata. It doesn’t create full-text indexes for the data. The log data is stored in compressed chunks locally on disk or remote storage systems like Amazon S3 or Google Cloud Storage. Loki accesses the compressed data based on your search queries through the indexes.

Loki Features #


One key difference between Loki and Prometheus is that Loki focuses on pushing the logs instead of pulling metrics from different servers. Similar to Prometheus, Loki also offers native support for Grafana for seamless integration.

Data engineers and developers need logs to be stored for extended periods to ensure that they can debug issues even if they catch the bugs late. Some pieces of work in data engineering, such as executing complex Spark jobs, can generate quite verbose logs. They can use Loki with a cloud-storage account that would give them access to very cheap long-term storage, such as S3 Glacier.

The service discovery mechanism in Loki remains the same as in Prometheus. It is responsible for discovering targets, attaching labels to log streams, and finally pushing them to the Loki instance. Loki uses a Promtail agent to ship the logs from your local server to a Loki instance.

For working with logs, Loki provides a compelling use case in the PLG (Promtail, Loki, Grafana) stack, which you can use in place of the widely used ELK (Elastic, Logstash, Kibana) stack.

Loki Resources #


Documentation | Discussion Board | GitHub


OpenTelemetry #

OpenTelemetry Overview #


OpenTelemetry is a product of the merger of two major open-source projects. Many senior engineers from Google, Lightstep, Microsoft, Uber, and others took initiative to create a single tool to bring together the distributed tracing and metrics features of OpenCensus and OpenTracing.

Getting started with OpenTelemetry is quick and easy. It provides automatic instrumentation packages that allow data engineers and developers to embed instrumentation into their applications.

OpenTelemetry Features #


OpenTelemetry commits to a vendor-independent implementation of the data collection, processing, and publishing processes. This rids you from supporting and maintaining various observability data formats, such as Jaeger and Prometheus.

Data engineering systems often use many languages and frameworks, so managing telemetry data coming from all components is often a challenge. OpenTelemetry reduces application overhead as it enables you to use a standard specification for transporting telemetry data.

These qualities of OpenTelemetry make it an effective tool to have in your stack. For these reasons, SigNoz, the open-source APM, bases its telemetry and instrumentation data collection on OpenTelemetry. More about this later in the article.

OpenTelemetry Resources #


Documentation | Blog | Slack | GitHub


Fluentd #

Fluentd Overview #


Fluentd is an open-source data collector meant for unified data collection and consumption. It was created by Treasure Data back in 2011 after one of its founders realized that a good chunk of time-series data was unused because there were no sound systems to collect it.

Fluentd helps solve this problem by unifying all logs in a recognizable data exchange and storage format called JSON. Fluentd enables you to decouple data sources from your backend systems with a consistent format across the system, reducing interdependencies.

Thousands of companies, including Microsoft, Atlassian, AWS, Nintendo, Line, and Twilio, use Fluentd to solve their data observability problems. According to the documentation, Fluentd’s largest deployment is where it has collected logs from over fifty thousand servers.

Fluentd Features #


Fluentd follows a hub-and-spoke model, where it acts as the hub and the data sources as the spokes. You can ingest data into Fluentd from access logs, application logs, system logs, and databases. Using various tools, you can filter and route your observability events to target systems that would help you raise alerts, do analysis, and archive your data for further usage.

Fluentd Resources #


Documentation | Blog | Google Group | GitHub


Opstrace #

Opstrace Overview #


Unlike other tools in our list, Opstrace is a full-fledged, cloud-based observability platform that you can install in your Google Cloud or AWS account. Opstrace works on top of the three other open-source tools, Prometheus, Loki, and Cortex.

In December 2021, GitLab acquired Opstrace to integrate into their DevOps platform. This acquisition will also come with some changes to Opstrace’s architecture. They are now developing another logging backend on top of ClickHouse. GitLab will remove the Loki backend for logging as Loki’s license doesn’t comply with their license model.

Opstrace Features #


Opstrace makes it very easy to get started. You can get an Opstrace cluster up and running in minutes. It gets even easier if you already have Prometheus set up. To be able to send data from Prometheus to your Opstrace cluster, you only need to add a few lines to your Prometheus configuration file. This is especially useful when you reduce the load on your Prometheus instance.

Scaling observability can be a pain. Products like Thanos and Cortex have tried solving the issues with scaling Prometheus. Opstrace chose Cortex to provide high scalability, high availability, multi-tenancy, and long-term storage.

Opstrace is responsible for the automatic generation of TLS certificates and authenticating API endpoints for read and write paths. Opstrace secures your data coming in and going out of the cluster. This saves you quite a lot of time when you want to go to production.

Opstrace Resources #


Documentation| GitLab | Slack | Roadmap


Vector #

Vector Overview #


Timber.io launched Vector in mid-2020 as a tool to help you build data pipelines for your logs and metrics. Vector enables you to collect, process, and deliver your data application’s spans, traces, logs, and metrics to a tool of your choice.

Several companies, including Atlassian, Fastly, Zendesk, Discord, and Comcast use Vector to make their observability platform more reliable, robust, and cost-effective. With Vector’s rising adoption, soon after its launch, DataDog acquired it in February 2021.

Vector Features #


Vector takes a DAG-based approach to enable movement and transformation of data from one stage to another. Every DAG contains different components — sources, transforms, and sinks. Vector gives you an option to define your DAG using either a TOML, YAML, or JSON file. Vector uses Tokio to schedule the DAGs.

Vector defines any incoming metric or logs data as events. These events are then cleaned, enriched, and transformed. After the transformations, you can push these events into one or more sinks of your choice. You can send logs to AWS S3 for debugging and archival purposes. You can also feed these transformed events to Kafka, Loki, NewRelic, etc.

Vector uses a customized language for transforming observability data. Vector Remap Language (VRL) helps you to transform observability events and specify conditions for filtering and routing events further.

Vector Resources #


GitHub | Releases | Documentation | Discord


SigNoz #

SigNoz Overview #


SigNoz aims to be an alternative to a full-scale application performance monitoring tool. It uses distributed tracing to enable you to create application and infrastructure metrics. It also gives you a detailed view of the traces to help you with the root cause analysis of any problem you might face.

SigNoz aims to provide a fully integrated view of your data application’s metrics and telemetry data in an integrated UI. This enables you to aggregate and filter over traces and perform an aggregation over other types of incoming logs.

SigNoz Features #


SigNoz builds on top of OpenTelemetry. It stores the collected instrumentation data in a columnar format in ClickHouse and exposes it to a ReactJS and Typescript frontend. For long-term storage and analytics, you also can use Amazon S3 as the storage and ClickHouse.

Most data engineering applications use a variety of languages and frameworks, such as Python, NodeJS, Java, or Scala. Because it relies on OpenTelemetry, SigNoz makes it easy for you to add manual or automatic instrumentation to your application.

Another interesting feature that SigNoz is working on right nowis the anomaly detection framework largely inspired by LinkedIn’s ThirdEye. This framework will enable you to have a better dynamic response and threshold capabilities for alerting.

SigNoz Resources #


Documentation | Roadmap | Slack | GitHub


Other Tools #

The data observability space is expanding and evolving very fast. All the good tools can’t be featured in one article. But we mention them here in this section because some of these tools are essential for a good data observability setup. Jaeger and Zipkin are two of the most critical distributed tracing tools. They work seamlessly with OpenTelemetry.

Although Opstrace went with Cortex for their integration, there’s an equally great open-source tool, Thanos, for providing you with a highly available Prometheus setup with long-term storage capabilities. Another such tool is Rsyslog, which you can use for log processing. You can think of it as an alternative to Fluentd or Logstash (of the ELK stack).


Conclusion #

Atlan enables customers to integrate with data observability tools and surface metadata from those tools along the lineage graph.

It leverages this observability metadata to create and propagate alerts along downstream impacted assets, enabling proactive data quality management.

Atlan also connects to pipeline systems, including Kafka, Airflow, Fivetran, and Matillion, to monitor for anomalous enterprise-wide fluctuations.

Book your personalized demo today to find out how Atlan can help your organization to capture end-to-end, column-level lineage.


FAQs for Open Source Data Observability #

1. What is data observability, and why is it important? #


Data observability is the practice of monitoring and understanding data systems’ health, performance, and reliability. It is important because it helps identify issues in real time, ensuring data accuracy and system efficiency, which are crucial for decision-making.


Some popular open-source data observability tools in 2024 include Prometheus, Loki, OpenTelemetry, Fluentd, and Opstrace. These tools help in monitoring, logging, tracing, and managing data health across various systems.

3. How does Prometheus contribute to data observability? #


Prometheus is an open-source monitoring tool widely used for collecting and analyzing metrics. It provides real-time data, allowing organizations to detect anomalies and maintain performance across data pipelines.

4. What role does OpenTelemetry play in data observability? #


OpenTelemetry provides standardized libraries, APIs, and SDKs that allow organizations to instrument and collect telemetry data, making it easier to monitor distributed systems.

5. How can Loki enhance data observability? #


Loki is a logging tool that integrates well with Prometheus, providing scalable log aggregation. This enables users to identify and trace issues through logs without needing excessive indexing.



Photo by Olya Kobruseva


Share this article

It would take six or seven people up to two years to build what Atlan gave us out of the box. We needed a solution on day zero, not in a year or two.

Akash Deep Verma
Akash Deep Verma

Director of Data Engineering

resource image

Build vs Buy: Delhivery’s Learnings from Implementing a Data Catalog

[Website env: production]