Apache Atlas is an open source metadata management and governance system designed to help you easily find, organize, and manage data assets.
What is metadata? metadata is data that provides information about one or more aspects of your data. Or, to put it quite simply, it’s data about data. And a metadata management system, like Atlas, is one that helps you build a catalog of, classify, govern, and collaborate, on all your data assets, including their metadata.
Atlas was incubated by Hortonworks under the umbrella of Data Governance Initiative (DGI) and joined the official Apache Foundation Incubator in May of 2015, where it lived and grew until it graduated as a top-level project in June 2017. The initial focus was the Apache Hadoop environment although Apache Atlas has no dependencies on the Hadoop platform itself.
The open source project continues to see stable year-on-year development with useful contributions of committers from organizations like Hortonworks, Aetna, Merck, and Target. As for the future, with metadata itself becoming big data, Apache Atlas can be considered as one of the building blocks of the modern data platform.
What is Apache Atlas used for?
Using metadata to visualize lineage
Atlas reads the content of metadata to build relationships among data assets. When it receives query information, it notes the input and output of the query and generates a lineage map that traces all usage and transformations over time. This visualization of lineage allows data teams to quickly identify the source of data and to understand the impact of data and schema changes.
Adding entities to metadata makes searching easier
Atlas manages entities like “classifications” and “labels” that can be created and used to enhance the metadata of data assets. These can be for anything from identifying data stages to recording comments on data assets.
Creating a common vocabulary for the data
Atlas also provides an infrastructure to create and maintain a business glossary to label data assets. These can be useful to bringing department- or organization-wide vocabulary on the same page with the data and data users.
Capabilities of Apache Atlas
The core capabilities of Apache Atlas, as defined by the incubator project, included the following:
to create an understanding of the data within a data platform such as Hadoop and provide a classification of this data to external and internal sources.
to provide a framework for capturing and reporting on access to and modifications of data.
Search and Lineage
to allow pre-defined and ad-hoc exploration of data and metadata while maintaining a history of how a data source or explicit data was constructed.
Security and Policy Engine
to protect data and rationalize data access according to regulations and compliance policy.
How does it work?
Considering Apache Atlas is an open source tool, this part can get quite technical. The key components of Atlas are grouped under the following main categories:
Type System: As per the Apache documentation a ‘Type’ is a definition of how particular types of metadata objects are stored and accessed in Atlas. One key benefit of using this system is that it allows data stewards to define both technical and business metadata.
Graph Engine: For data storage, Atlas uses a graph database called JanusGraph. Apart from storage, the graph engine also creates indices for the metadata objects so that they can be searched efficiently.
Ingest / Export: Metadata can be added to Atlas, while metadata changes can be consumed in real-time as events via export.
Metadata management and new model creation in Apache Atlas can be done via these 2 methods:
Rest API: All the functionality of Atlas, including creating, updating and deleting types and entities is available to users via a REST API. This is the primary route to query and discover your metadata with Atlas.
Messaging: Users can also integrate with Atlas using a messaging interface based on Apache Kafka. Atlas uses Kafka as a notification server for communication between hooks and downstream consumers of metadata notification events.
Atlas supports ingesting and managing metadata from the following sources within the Apache eco-system:
- HBase - A data model designed to provide quick random access to huge amounts of structured data.
- Hive - A data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.
- Sqoop - A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
- Storm - An open source distributed realtime computation system.
- Kafka - An open source distributed event streaming platform.
Who should you use it?
The advantages of using a system like Apache Atlas, for managing your metadata and towards data governance and compliance, is apparent from what’s mentioned above.
However, like any other open source tool—it is made by engineers for engineers, and thus quite technical to set-up, use and leverage it’s capabilities. Make sure you’ve got a band of engineers and time at your disposal to set this up.
If you’re still in the build vs buy mind frame, it’s worth taking a look at off-the-shelf alternatives like Atlan. What you'll get is a ready-to-use tool from the get-go, that is built for drawing the most out of open source systems like Apache Atlas - without too much hassle.