What is Apache Atlas?

March 10th, 2021

Apache Atlas Logo

Apache Atlas is an open source metadata management and governance system designed to help you easily find, organize, and manage data assets.

What is metadata? metadata is data that provides information about one or more aspects of your data. Or, to put it quite simply, it’s data about data. And a metadata management system, like Atlas, is one that helps you build a catalog of, classify, govern, and collaborate, on all your data assets, including their metadata.

Origins

Atlas was incubated by Hortonworks under the umbrella of Data Governance Initiative (DGI) and joined the official Apache Foundation Incubator in May of 2015, where it lived and grew until it graduated as a top-level project in June 2017. The initial focus was the Apache Hadoop environment although Apache Atlas has no dependencies on the Hadoop platform itself.

The open source project continues to see stable year-on-year development with useful contributions of committers from organizations like Hortonworks, Aetna, Merck, and Target. As for the future, with metadata itself becoming big data, Apache Atlas can be considered as one of the building blocks of the modern data platform.

What is Apache Atlas used for?

Using metadata to visualize lineage

Atlas reads the content of metadata to build relationships among data assets. When it receives query information, it notes the input and output of the query and generates a lineage map that traces all usage and transformations over time. This visualization of lineage allows data teams to quickly identify the source of data and to understand the impact of data and schema changes.

Visualizing lineage with Apache Atlas

Visualizing lineage with Apache Atlas. Source: Apache Atlas documentation

Adding entities to metadata makes searching easier

Atlas manages entities like “classifications” and “labels” that can be created and used to enhance the metadata of data assets. These can be for anything from identifying data stages to recording comments on data assets.

Classifications on Apache Atlas

Classification propogation to impact entities with Apache Atlas. Source: Apache Atlas documentation

Creating a common vocabulary for the data

Atlas also provides an infrastructure to create and maintain a business glossary to label data assets. These can be useful to bringing department- or organization-wide vocabulary on the same page with the data and data users.

Business metadata on Apache Atlas

Business metadata is a type supported by Atlas typesystem. Source: Apache Atlas documentation

Capabilities of Apache Atlas

The core capabilities of Apache Atlas, as defined by the incubator project, included the following:

Data Classification

to create an understanding of the data within a data platform such as Hadoop and provide a classification of this data to external and internal sources.

Centralized Auditing

to provide a framework for capturing and reporting on access to and modifications of data.

Search and Lineage

to allow pre-defined and ad-hoc exploration of data and metadata while maintaining a history of how a data source or explicit data was constructed.

Security and Policy Engine

to protect data and rationalize data access according to regulations and compliance policy.

How does it work?

Considering Apache Atlas is an open source tool, this part can get quite technical. The key components of Atlas are grouped under the following main categories:

Components of Apache Atlas

Atlas high level architecture. Source: Apache Atlas documentation

Core Components

Type System: As per the Apache documentation a ‘Type’ is a definition of how particular types of metadata objects are stored and accessed in Atlas. One key benefit of using this system is that it allows data stewards to define both technical and business metadata.

Graph Engine: For data storage, Atlas uses a graph database called JanusGraph. Apart from storage, the graph engine also creates indices for the metadata objects so that they can be searched efficiently.

Ingest / Export: Metadata can be added to Atlas, while metadata changes can be consumed in real-time as events via export.

Integrations

Metadata management and new model creation in Apache Atlas can be done via these 2 methods:

Rest API: All the functionality of Atlas, including creating, updating and deleting types and entities is available to users via a REST API. This is the primary route to query and discover your metadata with Atlas.

Messaging: Users can also integrate with Atlas using a messaging interface based on Apache Kafka. Atlas uses Kafka as a notification server for communication between hooks and downstream consumers of metadata notification events.

    Atlas supports ingesting and managing metadata from the following sources within the Apache eco-system:
  • HBase - A data model designed to provide quick random access to huge amounts of structured data.
  • Hive - A data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.
  • Sqoop - A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
  • Storm - An open source distributed realtime computation system.
  • Kafka - An open source distributed event streaming platform.

Who should you use it?

The advantages of using a system like Apache Atlas, for managing your metadata and towards data governance and compliance, is apparent from what’s mentioned above.

However, like any other open source tool—it is made by engineers for engineers, and thus quite technical to set-up, use and leverage it’s capabilities. Make sure you’ve got a band of engineers and time at your disposal to set this up.

If you’re still in the build vs buy mind frame, it’s worth taking a look at off-the-shelf alternatives like Atlan. What you'll get is a ready-to-use tool from the get-go, that is built for drawing the most out of open source systems like Apache Atlas - without too much hassle.

Ebook cover - guide to evaluating a data catalog

Free Guide: Find the Right Data Catalog in 5 Simple Steps.

This step-by-step guide shows how to navigate existing data cataloging solutions in the market. Compare features and capabilities, create customized evaluation criteria, and execute hands-on Proof of Concepts (POCs) that help your business see value. Download now!