Finding a good open-source data governance tool can be challenging. There are many reasons for that. First, the biggest barrier to deciding on anything related to data governance is the lack of a standardized approach - the goals aren't well defined. Moreover, the data governance capabilities of most open-source tools aren't clear. You have to sift through pages of documentation and GitHub repos to ascertain whether a particular tool solves for a specific use case. To simplify your evaluation process, we've put together a list of 7 open-source data governance tools popular amongst data practitioners.
7 popular open source data governance tools in 2022
Amundsen was initially built at Lyft and is currently hosted and maintained by LF AI & Data Foundation. With respect to data governance, it primarily solved for data security and compliance with data privacy and sovereignty laws. The idea was to tag and classify all the data on the metadata layer.
With Amundsen, you can search the metadata and understand who is using the data and how frequently they are using it. You can understand data quite a bit by looking at these data access patterns, but this approach is more reactive. For a more proactive approach, you’d need to have granular access control to prevent people from accessing data based on data access policies for teams, roles, individuals, systems, and so on.
Amundsen Data Governance Features
You don’t have RBAC (role-based access control) in Amundsen yet, but you still have some necessary data governance features, such as tagging and classification of metadata.
The data governance capabilities that leverage the default neo4j backend are quite limited, so Amundsen decided to add support for Apache Atlas. Because Apache Atlas is one of the most mature metadata management platforms, many of the features have been tried and tested in various systems, bringing reliability to the data cataloging and governance solution. Amundsen gets good support for data lineage and tag/badge propagation (using the lineage).
The neo4j or Atlas backend should typically work for most businesses; some want yet more advanced features from their data cataloging and governance solution.
Amundsen Data Governance Resources
Square created its version of Amundsen, which supports additional graph node types for representing column-level metadata in more detail.
Read more about that in this blog post on the Square blog. Several others have implemented their versions too. An Estonian company has worked on getting automated, column-level, cross-system lineage data into their Amundsen environment.
Amundsen Release Information
The latest release of Amundsen 2.5.1 was in March 2021. You can keep an eye out for the developments here.
Want to try your hands on Amundsen? Here's the sandbox
LinkedIn created DataHub after WhereHows stopped being a viable solution for the increasing demands from a metadata search and discovery tool. Before DataHub, LinkedIn had used other tools in conjunction with WhereHows to add some data governance features.
DataHub allows you to have fine-grained access control of the metadata. The access is driven by policies, which you can declare both from the web UI and the GraphQL API. DataHub’s policies work on two layers — platform and metadata. Platform policies allow you to control user permissions for DataHub, for example, what features can a user see and use and to what extent. You can apply these policies to individual users or groups. On the other hand, metadata policies allow you to control what users can access different metadata entities (charts, data sources, dashboards, etc.) and what operations they can perform on them. However, currently, DataHub doesn’t let you control read permissions.
Several other features are part of the DataHub roadmap but don’t have a clearly defined timeline as of now. One of the main data governance features is RBAC (role-based access control) for entities and aspects (PDL records). RBAC will not only enable finer access control for the metadata, but it will also help achieve better tag management, data preview access control, and so on.
In terms of governance/privacy: DataHub supports dataset-level classification, governed data movement, automated data deletion, data export, etc. They have plans to open-source some of the compliance capabilities, listed as part of their roadmap.
DataHub Release Information
In conclusion, DataHub is a tool that solves many problems simultaneously with different levels of sophistication. Several organizations have already deployed this in production as you read this. The latest release of DataHub, 0.8.20, was in December 2021.
Want to try your hands on Datahub? Here's the sandbox
Apache Atlas Overview
Apache Atlas was one of the first open-source data catalogs to integrate data governance features. However, the development cycle is a bit slow on this project and not to mention that this project was built specifically for the Hadoop ecosystem. It works well with anything that integrates with Hive.
Apache Atlas Features
Apache Atlas is especially good with classification. It can dynamically create data sensitivity, expiry, and quality classifications. This brings us to data lineage, another one of the sought-after features of Apache Atlas. Atlas implemented true data lineage, i.e., the lineage was actionable. Using the lineage data, Apache Atlas can propagate metadata properties to entities down the lineage hierarchy. This is one feature that you don’t find as well-implemented in other data governance tools/
Apache Atlas has a range of data privacy and security features too. It has fine-grained access controls for entities and classifications. Atlas also works well with Apache Ranger to implement data authorization and masking. When working in tandem, these features form an effective data privacy and security safety net that allows the data to be masked or classified as PII, SENSITIVE, etc. It also gives you the framework to control who can access the PII and SENSITIVE data.
Atlas Release Information
The latest release of Apache Atlas, 2.2.0, was in August 2021.
Magda was developed by CSIRO’s (Australian Commonwealth Scientific & Industrial Research Organization) data sciences arm, Data61. MAGDA is an acronym that stands for Making Australian Government Data Available. CSIRO deployed Magda to create an open data portal with over 70000 data sets of the Australian federal and state governments. They also open-sourced the project for others to use.
While the richest and most mature feature of Magda remains search and discovery, it also provides great support for tagging and defining dataset themes. Magda also has a built-in data preview option, both a spreadsheet and an interactive chart. Other tools like Amundsen require integration with Superset. To note: integration with a tool like Superset for data preview is more extensible.
Magda currently doesn’t support RBAC (role-based access controls), but it supports some features that allow for strict control over access to the resources ingested into Magda. Magda uses Kubernetes to stay cloud-agnostic. It uses the Open Policy Agent standard to manage access policies. This helps implement different types of access controls, such as role-based, attribute-based, and so on.
Magda Release Information
Magda is definitely under active development, as the roadmap suggests. The latest release of Magda, 1.1.0, was in December 2021.
Open metadata was announced in August 2021. This open-source project defines specifications to standardize metadata with a schema-first approach. It’s comprised of a centralized metadata store and an ingestion framework supporting popular connectors in the data stack.
OpenMetadata takes a different approach to tagging. It allows you to tag data owners with a data set. It further allows you to tag your data set into multiple tiers based on their importance. OpenMetadata also implements versioning across all your metadata. This means that all metadata related to database entities (tables, views, schemas), tags, dataset ownership details, and business glossaries are also versioned—all the information about the change, such as who changed it and when is also captured.
OpenMetadata Release Information
OpenMetadata is a new and fast-evolving community, you may follow the official roadmap here.
Launched in 2019, Egeria is maintained by the Linux Foundation's AI & Data arm. Egeria is designed to enable the easy exchange of metadata between tools and platforms in a vendor-agnostic manner. Other tools achieve this with SDKs and APIs, but there are limits to what they can do. Egeria is good with this because it is built around the principles of platform independence, easy scalability, and data accessibility.
While all the other tools we've looked at till now deal mostly with the problem of metadata management and governance primarily from a user's perspective, Egeria tries to solve the problem both for users and systems. Egeria works well with a wide variety of data tools.
Egeria provides you with very fine and granular control over your metadata with features like governance zones, effectivity dating, metadata archival, metadata provenance, and so on. Some of these features are unique to Egeria. It also comes with over 800 plus metadata types predefined but doesn't limit you there. You can define your own types based on the business requirements, which means that Egeria is flexible enough to adjust to suit your business needs.
Egeria Release Information
Egeria v1.0 launched in February 2019, and since then the development has been at quite a swift pace. Three years later, in February 2022, Egeria is at the v3.5 version. You can check out the information regarding the upcoming features and fixes in the official roadmap.
Finally, there is TrueDat, which is arguably the only full-fledged open-source data governance tool on this list. TrueDat was created by BlueTab (now an IBM company) after understanding the market's needs as a data solutions provider and finding a gap in the data governance space.
TrueDat has an overlapping set of features with the other tools that have been mentioned above. It has a data catalog, a search engine, data lineage capabilities, and so on. Still, the features that people enjoy most are the business glossary and the ability to share data between teams with very granular control, heavily focusing on data stewardship and data ownership management, taxonomy, and so on.
There are other features that make TrueDat completely unique in this list. One such feature is the data sharing feature, which resembles the Snowflake data sharing, making it easier for teams to share and collaborate more effectively. Furthermore, to ensure a high level of security and control over the data, there are subscription and notification features that can be used to log change events in an audit trail and monitor them in real-time.
TrueDat Release Information
With the latest stable version, v4.35, released just in January 2022, this is one of the most mature open-source data governance tools out there.
Here's a concise matrix that summarizes the major data governance features you might be looking for in your data governance tool. For simplicity's sake, the matrix values have been kept to Yes and No, however, these tools implement the same features with differing levels of sophistication and maturity.
|Tool||Data Lineage||Business Glossary||Tagging/Classification||Tag/Classification Propagation||RBAC||ABAC||Data Sharing|
^ partially implemented or in the immediate roadmap
It is also important to remember that most of these open-source data governance tools are made by engineers - for engineers. It will take significant time and resources to get up and running with them. While you are in the evaluation process, you may also like to review off-the-shelf solutions like Atlan, which has all the capabilities of mature open-source data governance tools available and more.