OpenMetadata vs. OpenLineage: Primary Capabilities, Architecture & More
Share this article
OpenMetadata is an open-source data discovery, profiling, and lineage tool built for the modern data stack by the engineers who worked on Databook. It was built to solve the problems of metadata siloing and the lack of metadata standards observed by the team while solving metadata at Uber.
OpenLineage is an open standard and API for producing and consuming lineage metadata. It was created by the same team that created Marquez at WeWork, which is why Marquez became the first reference implementation for the OpenLineage standard to allow lineage production and consumption across various modern data stack components.
In this article, you will learn how these two tools compare across several areas and how they fit into your realization of the modern data stack.
Is Open Source really free? Estimate the cost of deploying an open-source data catalog 👉 Download Free Calculator
Table of contents #
- OpenMetadata vs. OpenLineage: Factors for comparison
- Core capabilities
- Technical architecture
- Data lineage
- Data integration
- Summary
- Additional resources
OpenMetadata vs. OpenLineage: Factors for comparison #
Although both these tools solve quite different problems, there’s a significant overlap in metadata integration and data lineage. This is why we will not have a feature-by-feature comparison. We will instead talk about the overlapping features, their implementations, product maturity, and some of the use cases where using these tools makes sense. We’ll discuss these points under the following headings:
- Core capabilities
- Technical architecture
- Data lineage
- Data integration
While we’ll talk mostly about data lineage, we’ll also mention other secondary use case features in either of these tools. With that in mind, let’s start by understanding the core capabilities of both OpenMetadata and OpenLineage.
OpenMetadata vs. OpenLineage: Core capabilities #
OpenMetadata is a general-purpose data catalog with features like data asset search and discovery, sharing and collaboration, tagging and classification, monitoring, and alerting. OpenMetadata also has data quality, profiling, and lineage capabilities.
OpenMetadata compares well to tools in the open-source data catalog ecosystem, such as Amundsen, Apache Atlas, DataHub, etc. With the features above, OpenMetadata covers a wide range of use cases in a modern data stack about discovery, observability, security, and governance.
Having tools like OpenMetadata or Amundsen or others does solve the data lineage problem, but it locks you in with the custom metadata model that OpenMetadata uses for conforming and using your lineage metadata.
OpenLineage comes in to solve that problem with an open standard and API that allows for collecting and processing data lineage from various data sources. It was made to increase the interoperability between the various components of the modern data stack in terms of sharing and integrating lineage metadata.
OpenMetadata vs. OpenLineage: Technical architecture #
The core architecture of OpenMetadata is similar to most open-source data catalogs that allow connections to various data stores, extract and store metadata into a database, make all the metadata searchable, and expose multiple features that will enable you to navigate your data assets better.
The connect-and-extract functionality in OpenMetadata is based on an HTTP API that feeds the data into a MySQL database that serves as the primary entity store.
The API simultaneously provides the Elasticsearch engine to enable search and discovery. OpenMetadata uses a standard contract structure in JSON Schema to allow communication between the core API, the metadata serving layer, and the ingestion framework.
OpenLineage is an API that integrates with different data sources to collect and standardize metadata while enabling storage and retrieval by other tools that work across platforms.
The OpenLineage specification has a standard data model to which all collected lineage metadata conforms. The model has datasets, jobs, runs as the standard entities, and facets as user-defined metadata for enrichment purposes.
OpenMetadata vs. OpenLineage: Data lineage #
OpenMetadata works with the table, pipeline, and dashboard constructs and allows you to trace their lineage, both on an entity-level and also on a column-level. It will enable you to integrate with workflow engines like Airflow and data transformation tools like dbt to extract dependencies and lineage metadata.
Moreover, OpenMetadata provides you with an option to manually edit your data lineage graph using a drag-and-drop interface, as sometimes it is difficult to get the most accurate representation of the actual data lineage because of a variety of reasons, such as the inability to integrate all your data sources, the inability to fetch specific type of lineage metadata, and so on.
While OpenMetadata, like many other open-source data catalogs, implements its own data model to store lineage metadata, OpenLineage does it with the intent of being an open standard for collecting and storing lineage across data sources and platforms.
OpenLineage’s core data model for storing data lineage consists of jobs, runs, and datasets. All these constructs have facets attached to them, which are user-defined chunks of metadata to enrich the lineage metadata extracted from the source. Like OpenMetadata, OpenLineage also allows you to manually edit and annotate lineage for some connectors like Airflow.
The key idea with OpenLineage is to conform data lineage into this standard model across various data sources in the modern data stack like dbt, Snowflake, Dagster, Airflow, Spark, Flink, and more. With that in mind, let’s discuss the integration capabilities of both these tools in the next section.
OpenMetadata vs. OpenLineage: Data integration #
From a data lineage standpoint, OpenMetadata can connect to seven popular data sources, including BigQuery, Snowflake, Redshift, and Databricks. If you want to get lineage metadata from any other database, you can use query logs, which will feed the query text into OpenMetadata’s SQL parser and extract dependencies and lineage metadata.
As mentioned earlier, OpenMetadata does a lot more than just data lineage. To power its core functionality, it allows you to extract data from other systems in your data stack too, such as data quality and profiling systems, orchestration engines like Airflow, Dagster, Glue, Airbyte, etc., ML model services like MLFlow and SageMaker, BI tools like PowerBI, Redash, Superset, Metabase, etc., along with other data cataloging tools like Amundsen and Apache Atlas.
OpenLineage, on the other hand, integrates with data stack components that generate or consume lineage metadata. This means that OpenLineage can be used by data lineage producers, such as Great Expectations, Airflow, dbt, Spark, Egeria, among others, and it can also be used by data lineage consumers, such as Amundsen, Marquez, and Atlan.
Summary #
This article took you through the primary capabilities and features of both OpenMetadata and OpenLineage. Here’s a summary of some of the points that we discussed, along with some of the features that can highlight the differences between the two tools in question:
Feature | OpenMetadata | OpenLineage |
---|---|---|
Data catalog | Yes, OpenMetadata is primarily a data catalog with a data lineage feature. | No, OpenLineage does not have data cataloging capabilities. |
Data governance | Yes, OpenMetadata has RBAC, tagging, and classification features, among others, to enable data governance. | No, but data lineage can help with ownership propagation across platforms. |
Data lineage | Yes, OpenMetadata supports table and column-level lineage for many sources. | Yes, OpenLineage supports granular, table, and column-level lineage across lineage producers. |
Data quality | Yes, OpenMetadata connects to your data quality and profiling tools, such as Great Expectations. | Yes, OpenLineage also supports connections with data quality tools, such as Great Expectations. |
Integrations | OpenMetadata connects with over 40 connectors spread across multiple types of systems, such as databases, data lakes, BI tools, workflow engines, messaging services, and other data catalogs. | OpenLineage connects with producers and consumers of data lineage that support the OpenLineage standard. |
Additional resources #
OpenMetadata #
Slack | GitHub | Documentation | Sandbox | Medium | Roadmap | Swagger
OpenLineage #
Slack | GitHub | Mastodon | Blog | OpenAPI Documentation | YouTube
Share this article