Apache Atlas is an open source metadata management and governance system designed to help you easily find, organize, and manage data assets.
What is Metadata
Metadata is data that provides information about one or more aspects of your data. Or, to put it quite simply, it’s data about data. And a metadata management system, like Atlas, is one that helps you build a catalog of, classify, govern, and collaborate on all your data assets, including their metadata.
Apache Atlas Origins
Atlas was incubated by Hortonworks under the umbrella of Data Governance Initiative (DGI) and joined the official Apache Foundation Incubator in May of 2015, where it lived and grew until it graduated as a top-level project in June 2017. The initial focus was the Apache Hadoop environment although Apache Atlas has no dependencies on the Hadoop platform itself.
The open source project continues to see stable year-on-year development with useful contributions of committers from organizations like Hortonworks, Aetna, Merck, and Target. As for the future, with metadata itself becoming big data, Apache Atlas can be considered as one of the building blocks of the modern data platform.
What is Apache Atlas used for?
Apache Atlas is used by organizations to build a catalog of their data assets. It enables metadata management and governance capabilities and supports the following:
- Using metadata to visualize lineage
- Adding entities to metadata makes searching easier
- Creating a common vocabulary for the data
Using metadata to visualize lineage
Atlas reads the content of metadata to build relationships among data assets. When it receives query information, it notes the input and output of the query and generates a lineage map that traces all usage and transformations over time. This visualization of lineage allows data teams to quickly identify the source of data and to understand the impact of data and schema changes.
Adding entities to metadata makes searching easier
Atlas manages entities like “classifications” and “labels” that can be created and used to enhance the metadata of data assets. These can be for anything from identifying data stages to recording comments on data assets.
Creating a common vocabulary for the data
Atlas also provides an infrastructure to create and maintain a business glossary to label data assets. These can be useful for bringing department- or organization-wide vocabulary on the same page with the data and data users.
Capabilities of Apache Atlas
- Data Classification
- Centralized Metadata
- Search & Lineage
- Security and Policy Engine
The core capabilities of Apache Atlas, as defined by the incubator project, included the following:
To create an understanding of the data within a data platform such as Hadoop and provide a classification of this data to external and internal sources. Apache Atlas gives you the ability to automatically create classifications for PII, sensitive & other sensitive data. Data assets can be associated with multiple classifications. One can also propagate policies through lineage thus automatically ensuring that derived data inherits the same classification and security controls.
Apache Atlas enables defining of new metadata types, and easy exchange of metadata via a common metadata store. This allows interoperability across multiple metadata repositories which is one of the core requirements of building a forward looking data stack.
Search and Lineage
Apache Atlas equips one with an intuitive UI to engage in pre-defined and ad-hoc exploration of data types by type, classification, attribute value or free-text It also maintains a history of how a data source or explicit data was constructed, and how it has evolved over time. It’s also possible to access and update lineages via rest APIs.
Security and Policy Engine
Apache Atlas is primarily a data governance tool. It allows granular fine grained security for metadata access, enabling to set up controls on access to entity instances and also set-up operations like add/update/remove classifications. Integration with Apache Ranger, also allows masking/authorization control of data depending on classification associated with data assets.
How does Apache Atlas work?
Considering Apache Atlas is an open source tool, this part can get quite technical. The key components of Atlas are grouped under the following main categories:
Type System: As per the Apache documentation a ‘Type’ is a definition of how particular types of metadata objects are stored and accessed in Atlas. One key benefit of using this system is that it allows data stewards to define both technical and business metadata.
Graph Engine: For data storage, Atlas uses a graph database called JanusGraph. Apart from storage, the graph engine also creates indices for the metadata objects so that they can be searched efficiently.
Ingest / Export: Metadata can be added to Atlas, while metadata changes can be consumed in real-time as events via export.
Metadata management and new model creation in Apache Atlas can be done via these 2 methods:
Rest API: All the functionality of Atlas, including creating, updating and deleting types and entities is available to users via a REST API. This is the primary route to query and discover your metadata with Atlas.
Messaging: Users can also integrate with Atlas using a messaging interface based on Apache Kafka. Atlas uses Kafka as a notification server for communication between hooks and downstream consumers of metadata notification events.
Ingestion Sources Supported By Apache Atlas
Atlas supports ingesting and managing metadata from the following
sources within the Apache eco-system:
- HBase - An open-source, distributed, non-relational data-base designed to provide quick random read/write access to huge amounts of structured data.
- Hive - A open-source data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Hive is easy to scale. Easy to use for even non-programmers. It is designed to quickly handle large amounts of data using batch processing techniques.
- Sqoop - A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
- Storm - An open source distributed realtime computation system. Apache Storm has multiple use cases like real time analytics, continuous computation, ETL etc.
- Kafka - An open source distributed event streaming platform. Kafka enables to read or write event streams, to store event streams durably, and to process event streams as they are happening or retrospectively.
When should you consider using Apache Atlas?
Capabilities like metadata exchange, fine grained security metadata access & visual lineage make Apache Altas one of the building blocks of a modern data platform. You may consider building on top of Apache Atlas if you are solving for the following use cases:
- Intelligent data cataloging & classification
- Federated metadata access & management
- Centralized data governance
- Security and Policy Engine
However, like any other open source tool—it is made by engineers for engineers, and thus quite technical to set-up, use and leverage it’s capabilities. Make sure you’ve got a band of engineers and time at your disposal to set this up.
If you’re still in the build vs buy mind frame, it’s worth taking a look at off-the-shelf alternatives like Atlan. What you'll get is a ready-to-use tool from the get-go, that is built for drawing the most out of open source systems like Apache Atlas - without too much hassle.