AWS Data Catalog — Key Considerations & Tools Evaluation Guide

September 30th, 2022

header image for AWS Data Catalog — Key Considerations & Tools Evaluation Guide

If you’re evaluating data catalogs for AWS Glue, the in-built data catalog might seem like the obvious choice as it integrates with the rest of the AWS ecosystem.

However, before zeroing down to a specific catalog, you should consider certain key factors — connectivity, multi-cloud compatibility and interoperability, lineage, governance, and more.

Here we will explore the key factors to consider during your data catalog evaluation, followed by an assessment of a list of suitable data catalogs for AWS Glue.

Let’s start with the primary evaluation criteria.

Data catalogs for AWS Glue: Some factors to consider

  1. Data connectors available: The data sources supported and the ease of requesting as well as setting up custom connectors for upcoming sources
  2. Hosting model: Support for on-premise and cloud-hosted infrastructure
  3. Data governance: IAM-based granular access controls in compliance with local and regional regulations
  4. Usability: With diverse data consumers, the catalog should be user-friendly to technical and business users
  5. Extensibility: An open platform that makes it easy to connect with other data products using APIs
  6. Technical support: The level of support available to set up, manage, and troubleshoot the data catalog

[Download] → Forrester Wave™: Enterprise Data Catalog for DataOps, Q2 2022


Data catalogs for AWS Glue

While there are several solutions available for cataloging the ETL jobs on AWS Glue, we’re going to focus on four popular platforms:

  1. AWS Glue Data Catalog
  2. Atlan
  3. Amundsen
  4. DataHub

Each tool has its USPs, benefits, and limitations. Let's explore these aspects further to help you pick the right tool for your data stack.

Let’s start with the AWS Glue Data Catalog.

AWS Glue Data Catalog

The AWS Glue Data Catalog is a persistent metadata repository to keep track of ETL jobs performed on AWS Glue — a cloud-based fully managed ETL service. The catalog automatically fetches and stores information on data location, format, column schema, and more.

The AWS Glue Data Catalog uses crawlers to connect to various data sources (Redshift, RDS, S3), determine the schema for each data entity, and create metadata tables.

Information like data location, format, and columns schema can be automatically discovered and stored as tables, where each table specifies a single data store.

Many organizations use AWS Glue Data Catalog to extract the schema of their S3 files.

Supported data sources

AWS Glue supports several data sources:

  • AWS data sources: Redshift, Athena, RDS, S3, DynamoDB, and DocumentDB
  • Non-relational databases (MongoDB), Hadoop databases (Apache Hive), and streaming data (Apache Kafka)

Pros

  • AI/ML algorithms identify and auto-populate descriptions for each data entity
  • Hive-compatible metastore
  • In-built workflows to handle ETL workloads
  • In-built classifiers and transforms
  • IAM-based fine-grained access control
  • Pay-as-you-go pricing as AWS Glue Data Catalog is serverless

Cons

  • Glue is a black box for developers, making it difficult to adapt and extend for the various data tools in the modern data stack
  • Limited programming language support (Python and Spark) for writing custom ETL codes
  • Glue’s UI/UX is too poor for real end-user workflows

AWS Glue Data Catalog resources

AWS Glue components | AWS Glue data catalog explained | Data catalog and crawlers in AWS Glue | AWS Glue data catalog on Github


The Ultimate Guide to Evaluating an Enterprise Data Catalog

Download free ebook


Atlan

Atlan is a third-generation enterprise data catalog for the modern data platform. Atlan’s data catalog acts as a collaboration and orchestration layer, bringing together the diverse humans of data, the data they need, and the tools they use.

Atlan is built on the premise of embedded collaboration — work happens where you are, with no friction. So, Atlan’s catalog brings together micro-workflows across multiple tools and lets you get all the context without switching between various data tools.

Atlan’s open and extensible API makes it easy to connect with any data tool — databases, lakes, warehouses, BI platforms, CRMs, data movement tools — of your choice.

Moreover, using programmable bots, you can automate several aspects of cataloging to save time, reduce errors, and eliminate data silos.

Supported data sources

Atlan supports the following AWS data sources:

Atlan also offers connectivity to various data movement and BI tools.

→ Here’s the complete list of supported integrations on Atlan

Atlan Data catalog facilitates metadata search across your AWS data stack. Source: Atlan

Atlan Data catalog facilitates metadata search across your AWS environment. Source: Atlan

Required AWS Glue components

Atlan integrates directly with AWS services such as Redshift, Glue, and Athena. To understand how to set up Atlan for AWS Glue, check out these step-by-step guides:

Pros

  • Pull-based crawlers ingest metadata from different sources and update it frequently
  • Google-like search with advanced filters to search for data beyond tables — across BI dashboards, pipelines, code, models, queries, metrics, and more
  • Seamless crawling of AWS Glue assets in sync with the AWS instances
  • Data asset profiles with relevant context such as the column names, owners, classifications, certifications, related terms, and README
  • Collaborate on data — discuss data assets, look up definitions, and get alerts on your workflows without leaving Atlan
  • Custom classifications, programable PII bots, automated propagation, custom masking, and hashing policies
  • An API-driven platform that’s open and extensible, and not a closed, black-box tool
  • Pay-as-you-go pricing

Cons

Atlan doesn’t deploy on GCP and Azure yet.

Atlan data catalog resources

Atlan data catalog overview | Atlan API reference | Atlan Data Catalog: User stories | Atlan and AWS


A Demo of Atlan Data Catalog for AWS


Amundsen

Amundsen is an open-source data catalog tool developed by Lyft. Amundsen was built to make searching for data and metadata easy for technical and business users.

Amundsen’s Databuilder supports several extractors to simplify metadata ingestion. It also integrates seamlessly with Airflow.

For backend environments, Amundsen supports AWS Neptune and Apache Atlas.

Supported AWS data sources

Amundsen supports the following AWS data sources:

Amundsen also supports ETL, querying, and BI tools. You can find the complete list of supported integrations here.

Amundsen, an open-source data catalog for your AWS data stack. Source: Atlan

Amundsen, an open-source data catalog for your AWS data stack. Source: Atlan

Required AWS Glue components

Amundsen maintains the metadata in Neo4j and uses Elasticsearch for the data discovery. So, you can deploy Amundsen on Docker with the Neo4j backend database.

The cloud components required to set up Amundsen for AWS include:

  • AWS EC2 instance
  • AWS Elastic Container Service (ECS)
  • Neo4j or Apache Atlas backend

Alternatively, you can use:

Pros

  • Metadata search and discovery through search interface with previews
  • Provides relevant context with table/column description and basic statistics (count, null values, min, and max)
  • Supports several Hadoop and non-Hadoop data sources
  • Officially supports Helm charts

Cons

  • Like other open-source tools, Amundsen requires substantial engineering expertise and resources for setup and maintenance
  • No support for masking PII or sensitive information
  • Depends on Apache Atlas for lineage support
  • Doesn’t offer fine-grained access control or versioning

Amundsen resources

Amundsen Lyft Overview | Amundsen DemoAmundsen Setup GuideAmundsen GitHub Repository | A guide to configure and set up Amundsen on AWS


Data Catalog 3.0: The Modern Data Stack, Active Metadata, and DataOps

Download ebook


DataHub

DataHub is an open-source metadata platform originally developed by LinkedIn and uses Elasticsearch for data discovery. You can deploy DataHub on Docker with Neo4j or MySQL as the backend databases.

DataHub is known for its metadata governance, GraphQL API, Great Expectations integration, and application monitoring capabilities.

Supported AWS data sources

DataHub also supports cloud data warehouses and lakes, lineage, querying, and BI tools. You can find out more about all the supported integrations here.

Required AWS Glue components

Like Amundsen, Datahub also maintains the metadata in Neo4j or MySQL and uses Elasticsearch for data discovery. So, you can deploy DataHub on Docker with either of these as the backend database. DataHub also supports GraphQL and Kafka for streaming data.

The cloud components required to set up DataHub for AWS include:

  • AWS EC2 instance
  • AWS Elastic Kubernetes Service (EKS)
  • Neo4j or MySQL backend
  • AWS CLI for managing AWS resources

You can also use AWS-managed services for the storage layer.

Pros

  • Supports several Hadoop and non-Hadoop data sources
  • Enables text-based search and discovery of datasets and metadata
  • Supports column-level and dataset-level classification, PII tagging, and automatic data deletion for GDPR
  • Support helm charts (Helm 3) to deploy Kubernetes clusters at scale

Cons

  • Doesn’t offer column-level lineage
  • Doesn’t support multiple backend environments
  • Doesn’t support AWS Neptune yet

Datahub resources

LinkedIn DataHub OverviewDataHub Setup GuideDataHub GitHub Repository | Amundsen vs DataHub


Data catalog for AWS: Summing up

While each tool mentioned above has merits, you should pick one that solves your specific needs and use cases.

So, look for factors such as connectivity to your data stack, programming language support, and cloud environments supported; besides easy setup and management. The catalog you choose should also help build a collaborative workspace for all teams using your data.

In addition to the points mentioned above, also consider our data catalog evaluation criteria to choose the perfect data catalog for your AWS environment.

In case you are evaluating a data catalog for your AWS data sources, you might want to check our Atlan.



Free Guide: Find the Right Data Catalog in 5 Simple Steps.

This step-by-step guide shows how to navigate existing data cataloging solutions in the market. Compare features and capabilities, create customized evaluation criteria, and execute hands-on Proof of Concepts (POCs) that help your business see value. Download now!