OpenMetadata: Design Principles, Architecture, Applications & More
Share this article
What is OpenMetadata? #
OpenMetadata is an open-source metadata store that can help you enable data cataloging, discovery, and collaboration across your data ecosystem. OpenMetadata was launched in the latter half of 2021. It has had twelve minor releases, with the latest one being 0.12.0; a major release is yet to take place.
See How Atlan Streamlines Metadata Management – Start Tour
OpenMetadata was inspired by the learnings accumulated while building Uber’s metadata infrastructure, which can be thought of as the first iteration of OpenMetadata. Uber’s metadata system features in-house tools like Databook.
In their announcement blog, Suresh Srinivas, founder of OpenMetadata cited reasons why Uber’s in-house system wasn’t open-sourced itself, rather Open Metadata was built ground up. The reasons fundamentally stem from the idea to ensure that the priorities of the company and the open-source community are not in conflict in the process of evolution of such a tool.
OpenMetadata is one of the latest additions to the open-source data cataloging landscape that includes other tools like Amundsen, DataHub, Apache Atlas, and so on.
Here, we will take you through the basics of OpenMetadata in terms of the following key themes:
- Design principles and architecture choices
- Features
- Integrations supported
In the end, we’ll also supply you with further reading materials, links, and resources. Let’s dive right in.
Table of contents #
- What is OpenMetadata?
- Design principles and architecture choices that define OpenMetadata
- Applications of OpenMetadata
- Integrations supported by OpenMetadata
- OpenMetadata Resources
- Conclusion
- FAQs on OpenMetadata
- Related reads
Design principles and architecture choices that define OpenMetadata #
In this section, we’ll take a look at the following principles that guided OpenMetadata’s design and architecture:
- Unified metadata model
- Open and standardized APIs for integrations
- Metadata extensibility
- Pull-based metadata ingestion
- Graph storage for metadata
Unified metadata model #
Businesses work with a range of data sources to serve different purposes. These data sources have their architectures aligned to specific use cases; some are document-oriented, some store geolocation data, and so on. Because these data sources differ in how they store data, it is natural for them to store the underlying metadata also differently.
To enable organization-wide data discovery, data governance, and data lineage features, you need to have a unified metadata model. This will enable you to configure and maintain different integrations in a centralized fashion. With a unified metadata model, it will also be easy to expose metadata for the consumption of internal microservices and external applications. Here’s a diagram from OpenMetadata’s blog that depicts such as setup.
Open and standardized APIs for integrations #
The unified metadata model helps OpenMetadata to enable better integration with diverse data sources. That added with open APIs based on well-documented and widely-accepted schema standards helps OpenMetadata to expose the unified data model for various downstream applications, such as a data catalog, data quality engine, and so on.
You can get the Open API specification for the REST API that exposes all the metadata extracted and enriched in OpenMetadata from the Swagger specification document.
The Open APIs are backed by the same strongly-type, well-structured, and annotated schema following the JSON Schema specification. OpenMetadata also uses the same specification for defining data quality tests.
Metadata extensibility #
If there’s one thing you can be sure of in any business is that organization, processes, and priorities always change. To cater to custom requirements, the metadata model needs to be flexible enough to handle any additional data points, nodes, and other fields.
This means that the unified metadata model can be conceptually split into two parts - the base metadata model and the extended metadata model.
The base metadata model consists of all the metadata that is common across multiple data sources and the extended metadata model will take care of any data source-specific customizations. OpenMetadata, much like DataHub and many others, has been designed to be extensible.
Pull-based metadata ingestion #
Most metadata ingestion systems are pull-based, which means that the metadata extraction is the responsibility of the metadata engine, and not the data source. Some metadata catalogs, such as DataHub support both push and pull-based metadata ingestion.
OpenMetadata has taken the pull-based approach as the authors of OpenMetadata believe, “no metadata system can be purely push-based”.
The thinking behind this choice is that data sources can’t be reasonably expected to push data into a metadata aggregation system. The job of extracting and transforming metadata into a unified metadata model falls on the data cataloging tool, much like what an ETL tool does for creating data lakes and data warehouses.
Graph storage for metadata #
OpenMetadata takes the approach of storing metadata in a centralized fashion where it is “actively organized as a graph connecting data” with all teams, tools, and processes.
This enables organizations to build, maintain, and utilize a “Metadata Graph” that can be consumed by downstream applications to enable many value-adding features, such as data cataloging, data governance, data lineage, automated data quality, and testing, data profiling, data observability, and so on.
A Guide to Building a Business Case for a Data Catalog
Download free ebook
Applications of OpenMetadata #
OpenMetadata is built to support the following applications:
- Data discovery
- Data governance
- Data lineage
- Data quality
- Integrations
- Metadata versioning
Data discovery #
OpenMetadata’s data discovery features are powered by a full-text search engine that can search through not just the entity definitions, but also their descriptions, extended metadata, conversation threads, tasks, and announcements. When you are on the OpenMetadata console, you can initiate a search by using the CMD + K
shortcut, as shown in the image below:
To complement the search engine functionality, OpenMetadata offers an easy way to navigate both the technical and business metadata for your data sources. The technical metadata is captured from the data sources as is and is enriched by features like conversation threads, tasks, and announcements, as mentioned earlier.
Data governance #
Backed by the unified metadata model, OpenMetadata has implemented the following three features to enable data governance across your organization:
- Role-based access control (RBAC)
- Ownership
- Importance
A sophisticated role-based access control system with an organization-wide team hierarchy and a role-policy-rule-based access control sets a solid foundation for data governance in OpenMetadata.
Building an ownership and importance layer on top of the RBAC enhances the value OpenMetadata brings to a business. Let’s take a glimpse of OpenMetadata’s RBAC engine in action.
The following image shows the page on the UI where you can create and manage roles:
And this image shows the page on the UI where you can create and manage different policies.
Data lineage #
OpenMetadata primarily capitalizes on its query parser to collect lineage data, however, it also uses dbt and data source query logs to build and enrich data lineage.
OpenMetadata manages data lineage in the following ways:
- Automated collection of data lineage
- Manual addition of data lineage
- Editing existing data lineage
OpenMetadata captures lineage in an automated manner, triggered by tools like Airflow, Prefect, etc.
It also allows you to add lineage manually because there might be cases where the data sources might not provide reliable information about the lineage.
And finally, OpenMetadata takes it one step forward by allowing you to edit data lineage if the data lineage visualization doesn’t reflect the actual lineage between different data assets.
Here’s a quick peek into how data lineage is visualized in OpenMetadata:
Modern Data Catalogs: The Key Trends, the Data Stack, and the Humans of Data
Download ebook
Data quality #
Tackling data quality across data sources is one of the most challenging tasks in the data engineering domain today, but again, because of OpenMetadata’s unified data model, it is easy to define tests and run profiles on data assets across different data sources.
OpenMetadata allows you to group different tests together and create a test suite, as shown in the image below:
You can run a test suite on the data assets you want. The following image shows you the output for the test runs for one of the sample data assets:
OpenMetadata has tightly integrated data quality in the UI to enable data teams to make it a part of their usual workflow. This way data quality issues are always visible to the team consuming the data, which makes fixing these issues faster and easier.
Metadata versioning #
Similar to how you capture changes in data using CDC tools, OpenMetadata enables you to capture changes in the structure of data assets along with any related metadata with the help of metadata versioning. OpenMetadata’s metadata versioning follows a major.minor versioning pattern with any minor release being backward compatible and any major release being backward incompatible.
Metadata versioning is instrumental in providing valuable information to developers and data users when they’re collaborating across teams with different data sources and also when they are trying to debug an issue with the data. This enables transparency in the handling of data across the organization which results in a better overall collaboration between teams while keeping the metadata clean and up-to-date.
An overview of OpenMetadata
Integrations supported by OpenMetadata #
Most data cataloging tools now enable data extraction using a Singer-like, connector-based model.
OpenMetadata currently offers more than fifty connectors for metadata ingestion from data sources like databases, data lakes, data warehouses, business intelligence tools, message queues, data pipelines, and even other data catalogs.
As OpenMetadata is open-source, you may see more connectors being written by members of the community as and when required. OpenMetadata also integrates with Great Expectations for data quality workloads and Prefect for data workflows.
OpenMetadata Resources #
Although it has only been just over a year since OpenMetadata’s launch, there’s been quite a bit of development. Here’s a curated list of resources that might help you navigate your OpenMetadata learning journey and keep up to speed with further developments.
- GitHub Repo
- Slack Community
- Google Group
- Demo
- Technology Blog
- Roadmap
- Community Discussions
- Swagger API Specification
Conclusion #
Here, we took you through the basic design, architecture, and prominent features of OpenMetadata.
The resources we’ve shared above should be able to steer you in the right direction if you’re thinking about evaluating OpenMetadata as a metadata management platform for your stack.
When evaluating OpenMetadata, take your time to review your data cataloging, governance, and lineage requirements and OpenMetadata’s features in those areas, and see if there’s enough alignment for you to go through a POC.
Also, as with any other open-source project, assess it on specific general criteria, like popularity, maturity, activity, release cycles, and the roadmap. A combined view of all these things will help you decide which of the data cataloging and governance tools makes the most sense for your business.
If you are a data consumer or producer and are looking to champion your organization to optimally utilize the value of a modern data stack, it’s worth taking a look at off-the-shelf alternatives like Atlan — Atlan is built on open source, and open by default.
Atlan empowers organizations to establish and scale data governance programs by automating metadata management, providing end-to-end lineage tracking, enabling collaboration across diverse personas, and offering an extensible platform for customized governance workflows and integrations.
Atlan’s approach ensures data quality, security, and compliance while fostering data literacy and self-service across the organization.
Tide’s Story of GDPR Compliance: Embedding Privacy into Automated Processes #
- Tide, a UK-based digital bank with nearly 500,000 small business customers, sought to improve their compliance with GDPR’s Right to Erasure, commonly known as the “Right to be forgotten”.
- After adopting Atlan as their metadata platform, Tide’s data and legal teams collaborated to define personally identifiable information in order to propagate those definitions and tags across their data estate.
- Tide used Atlan Playbooks (rule-based bulk automations) to automatically identify, tag, and secure personal data, turning a 50-day manual process into mere hours of work.
Book your personalized demo today to find out how Atlan can help your organization in establishing and scaling data governance programs.
A demo of Atlan for data discovery
FAQs on OpenMetadata #
What is OpenMetadata? #
OpenMetadata is an open-source metadata management platform that supports data cataloging, discovery, and collaboration. It enables organizations to maintain a unified metadata system, fostering data governance and quality.
What are the main principles behind OpenMetadata’s design? #
OpenMetadata is built on key principles such as a unified metadata model, open APIs for integration, metadata extensibility, pull-based metadata ingestion, and graph storage.
How does OpenMetadata handle metadata ingestion? #
OpenMetadata uses a pull-based metadata ingestion method, meaning the metadata engine retrieves data from sources rather than relying on those sources to push data, which ensures consistency and reliability.
What makes OpenMetadata’s integration unique? #
It offers open and standardized APIs that follow widely accepted schema standards, allowing seamless integration with downstream applications like data catalogs and quality engines.
Why does OpenMetadata use graph storage for metadata? #
Graph storage in OpenMetadata enables the organization to build a metadata graph, connecting data with processes, tools, and teams, which supports features like data lineage, governance, and observability.
Related reads #
- Openmetadata vs DataHub: Understand how both these tools compare based on their architecture, ingestion methods, capabilities, available integrations, and more.
- Learn more about how Amundsen compares with other open-source data catalog and metadata tools
- What Is a Data Catalog? & Do You Need One?
- Lyft Amundsen Vs. Linkedin DataHub: A deep dive into how Amundsen and DataHub compare in terms of architecture, metadata ingestion, ease of deployment, and core data discovery features.
- Lyft Amundsen Vs. Apache Atlas: Read more about how Amundsen and Apache Atlas compare and contrast in data discovery, data catalog, and data lineage features.
- Understanding AWS Glue data catalog: Architecture, components, and crawlers
- Data Catalog: What It Is & How It Drives Business Value
- What Is a Metadata Catalog? - Basics & Use Cases
- Modern Data Catalog: What They Are, How They’ve Changed, Where They’re Going
- Open Source Data Catalog - List of 6 Popular Tools to Consider in 2024
- 5 Main Benefits of Data Catalog & Why Do You Need It?
- Enterprise Data Catalogs: Attributes, Capabilities, Use Cases & Business Value
- The Top 11 Data Catalog Use Cases with Examples
- 15 Essential Features of Data Catalogs To Look For in 2024
- Data Catalog vs. Data Warehouse: Differences, and How They Work Together?
- Snowflake Data Catalog: Importance, Benefits, Native Capabilities & Evaluation Guide
- Data Catalog vs. Data Lineage: Differences, Use Cases, and Evolution of Available Solutions
- Data Catalogs in 2024: Features, Business Value, Use Cases
- AI Data Catalog: Exploring the Possibilities That Artificial Intelligence Brings to Your Metadata Applications & Data Interactions
- Amundsen Data Catalog: Understanding Architecture, Features, Ways to Install & More
- Machine Learning Data Catalog: Evolution, Benefits, Business Impacts and Use Cases in 2024
- 7 Data Catalog Capabilities That Can Unlock Business Value for Modern Enterprises
- Data Catalog Architecture: Insights into Key Components, Integrations, and Open Source Examples
- Data Catalog Market: Current State and Top Trends in 2024
- Build vs. Buy Data Catalog: What Should Factor Into Your Decision Making?
- How to Set Up a Data Catalog for Snowflake? (2024 Guide)
- Data Catalog Pricing: Understanding What You’re Paying For
- Data Catalog Comparison: 6 Fundamental Factors to Consider
- Alation Data Catalog: Is it Right for Your Modern Business Needs?
- Collibra Data Catalog: Is It a Viable Option for Businesses Navigating the Evolving Data Landscape?
- Informatica Data Catalog Pricing: Estimate the Total Cost of Ownership
- Informatica Data Catalog Alternatives? 6 Reasons Why Top Data Teams Prefer Atlan
- Data Catalog Implementation Plan: 10 Steps to Follow, Common Roadblocks & Solutions
- Data Catalog Demo 101: What to Expect, Questions to Ask, and More
- Data Mesh Catalog: Manage Federated Domains, Curate Data Products, and Unlock Your Data Mesh
- Best Data Catalog: How to Find a Tool That Grows With Your Business
- How to Build a Data Catalog: An 8-Step Guide to Get You Started
- The Forrester Wave™: Enterprise Data Catalogs, Q3 2024 | Available Now
- How to Pick the Best Enterprise Data Catalog? Experts Recommend These 11 Key Criteria for Your Evaluation Checklist
- Collibra Pricing: Will It Deliver a Return on Investment?
- Data Lineage Tools: Critical Features, Use Cases & Innovations
- OpenMetadata vs. DataHub: Compare Architecture, Capabilities, Integrations & More
- Automated Data Catalog: What Is It and How Does It Simplify Metadata Management, Data Lineage, Governance, and More
- Data Mesh Setup and Implementation - An Ultimate Guide
- What is Active Metadata? Your 101 Guide
Share this article