Metadata Management in AWS: A Detailed Look at Your Options

Updated June 09th, 2023

Share this article

Amazon Web Services (AWS) offers a comprehensive suite of on-demand computing services to build and manage applications, store data, and deliver content at scale.

This article will discuss the importance of metadata management in AWS and how it can help organizations handle data at scale. We will start by exploring the data ecosystem offered by AWS, and then discuss the different cataloging options available, including AWS Glue, enterprise data catalogs, and open-source data catalogs.

Lastly, as an example, we’ll take a detailed look at integrating AWS Glue with Atlan for better data collaboration and metadata management of your data assets.

Table of contents #

Metadata management in the AWS data ecosystem
AWS Glue data catalog
Exploring open-source metadata management options for AWS
Wrapping up
Metadata management in AWS: Related reads

AWS: An early entrant in the data space since 2006 #

AWS demonstrated foresight by investing early in data services. According to Andy Jassy, the CEO of AWS:

“Most of the successful technology startups use AWS and have built their business on top of us and these are companies like Airbnb and Pinterest”.

Five tools — S3, SQS, RDS, SNS, and Redshift — played a vital role in establishing AWS as a top player in the market. Let’s see why:

In early 2006, AWS launched a cloud data storage service called the Simple Storage Service (S3), revolutionizing the data storage space.
Also in 2006, the Simple Queue Service (SQS) was launched as a secure queueing system for messages traveling between services.
In 2009, AWS came up with the Relational Database Service (RDS) to simplify database setup, operations, and scaling.
Launched in 2010, the Simple Notification Service (SNS) provided developers with a robust, scalable messaging tool.

SNS + SQS fan out pattern

SNS + SQS fan out pattern - Source: Twitter.

The year 2012 saw the launch of Redshift, the flagship data warehousing service of AWS , which took fast, fully managed data analysis to the cloud and set the stage for other cloud data tools.

Redshift’s launch in 2012 kicked things off for the modern data stack

According to dbt Labs founder and CEO Tristan Handy, Redshift’s launch in 2012 kicked things off for the modern data stack - Source: dbt.

With Redshift, metadata management at scale became crucial for effective data management. So, let’s see how the AWS ecosystem handled metadata management.

Metadata management in the AWS data ecosystem #

In the vast landscape of AWS, various tools come into play for metadata management.

For instance, the AWS Glue data catalog serves as an internal tool, offering a consolidated view of data across AWS. It simplifies the discovery of data and its associated metadata, easing the accessibility of data assets.

A major advantage of AWS Glue data catalog

A major advantage of AWS Glue data catalog - Source: Twitter.

However, it comes with limitations, such as limited support for programming languages and frameworks and a closed, black-box-like architecture.

In contrast, external tools — both open-source and enterprise data catalogs — offer a wider scope. These tools, in conjunction with AWS services, address specific data and metadata management needs, thereby handling data cataloging, metadata, governance, and discovery.

Next, let’s explore the AWS Glue data catalog, which is an internal tool for metadata management in AWS.

AWS Glue data catalog #

AWS Glue data catalog is one of the options you have for metadata management within the AWS data ecosystem. Here’s how the Glue data catalog works:

Glue crawls data sources using classifiers. These classifiers could be out-of-the-box or user-defined.
Glue then connects to a data source using these classifiers, extracts the metadata, and writes it to the Glue data catalog.

AWS Glue data catalog

AWS Glue data catalog - Source: Hevo.

Glue supports JDBC data sources for metadata collection and is Hive metastore-compatible.

Let’s explore the connectivity between Glue and Apache Hive further.

Integrating AWS Glue with Apache Hive and Spark for metadata management #

As of Amazon EMR release 5.8.0 and onwards, both Spark SQL and Hive can be configured to use AWS Glue Data Catalog as their metastore.

To specify AWS Glue Data Catalog as Spark’s metastore using AWS CLI, you would need to provide a specific value for hive.metastore.client.factory.class through the spark-hive-site classification. The value is typically “com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory”.

This information can be arranged in a JSON-like format as shown in the example below:

jsonCopy code
[
  {
    "Classification": "spark-hive-site",
    "Properties": {
      "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
    }
  }
]

If the goal is to specify a Data Catalog from another AWS account, you would add the hive.metastore.glue.catalogid property to the properties section. Ensure to replace acct-id with the AWS account ID of the Data Catalog.

jsonCopy code
[
  {
    "Classification": "spark-hive-site",
    "Properties": {
      "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
      "hive.metastore.glue.catalogid": "acct-id"
    }
  }
]

When using AWS Glue, it is important to make sure that the EC2 instance profile for a cluster has the necessary IAM permissions for AWS Glue actions.

If you enable encryption for AWS Glue Data Catalog objects, the role must also have permissions to encrypt, decrypt, and generate the AWS KMS key that will be used for encryption.

As mentioned before, you can replace the Hive metastore in EMR with the Glue data catalog. Let’s see how.

Replacing the Hive metastore in Amazon EMR with the Glue data catalog: An example #

Here’s a walkthrough to replace the Hive metastore in Amazon EMR with AWS Glue Data Catalog:

Prepare your AWS environment
Configure AWS Glue
Set up Amazon EMR
Replace Hive metastore
Configure IAM roles
Test your setup

Let’s delve into the specifics.

Step 1: Prepare your AWS environment

Ensure you have an AWS account with the required permissions set up for Amazon EMR, AWS Glue, and AWS IAM roles. For instance, to set up AWS CLI, you might use:

bashCopy code
pip install awscli
aws configure

Step 2: Configure AWS Glue

Next, set up your AWS Glue Data Catalog by creating databases and tables as per your requirements. Here’s a sample AWS CLI command to create a database in Glue:

bashCopy code
aws glue create-database --database-input '{"Name": "my_database"}'

Step 3: Set up Amazon EMR

Set up your Amazon EMR cluster, ensuring you choose EMR release version 5.8.0 or later. Here is a basic AWS CLI command to create a cluster:

bashCopy code
aws emr create-cluster --name "Test cluster" --release-label emr-5.8.0 --applications Name=Hive Name=Pig --use-default-roles

Step 4: Replace Hive metastore

During the EMR setup, select Glue Data Catalog as your metastore. This ensures that your Hive and Spark applications will automatically use Glue Data Catalog.

Here’s an example configuration in the EMR cluster settings:

jsonCopy code
[
  {
    "Classification": "hive-site",
    "Properties": {
      "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
    }
  }
]

Step 5: Configure IAM roles

Your EMR service role (default is EMR_DefaultRole) should have Glue permissions. The IAM policy might look something like this:

jsonCopy code
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:CreateDatabase",
        "glue:CreateTable",
        "glue:DeleteDatabase",
        "glue:DeleteTable",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetTable",
        "glue:GetTables"
      ],
      "Resource": "*"
    }
  ]
}

Step 6: Test your setup

Now you’re ready to test your setup by running a Hive or Spark job. For example, a Hive script run on EMR might look like this:

bashCopy code
aws emr add-steps --cluster-id j-2AXXXXXXGAPLF --steps Type=CUSTOM_JAR,Name="Hive program",Jar=s3://region.elasticmapreduce/libs/script-runner/script-runner.jar,Args=["/usr/lib/hive/bin/hive","-e","select * from my_database.my_table"]

Next, we will delve into how we can augment the capabilities of AWS Glue by integrating it with Atlan, a robust platform for data collaboration and metadata management.

Let’s explore this further.

Augmenting AWS Glue capabilities through Atlan #

To integrate AWS Glue with Atlan, you’ll need to follow two steps:

Create an IAM policy
Select an authentication mechanism

Let’s explore each step further.

Step 1: Create an IAM policy

Start by creating an IAM policy. The policy should be structured as follows, with replaced by your Glue instance’s AWS region and <account_id> replaced by your account ID:

jsonCopy code
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": [
        "glue:GetTables",
        "glue:GetDatabases",
        "glue:GetTable",
        "glue:GetDatabase",
        "glue:SearchTables",
        "glue:GetTableVersions",
        "glue:GetTableVersion",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:GetUserDefinedFunctions",
        "glue:GetUserDefinedFunction"
      ],
      "Resource": [
        "arn:aws:glue:<region>:<account_id>:tableVersion/*/*/*",
        "arn:aws:glue:<region>:<account_id>:table/*/*",
        "arn:aws:glue:<region>:<account_id>:catalog",
        "arn:aws:glue:<region>:<account_id>:database/*"
      ]
    }
  ]
}

Step 2: Select an authentication mechanism

After creating the policy, you need to select an authentication method.

You have three options:

User-based authentication
Role-based authentication
Role delegation-based authentication

Let’s explore each authentication method further.

User-based authentication

Create an AWS IAM user and attach the policy you created to this user. Once the user is created, download the user’s access key ID and secret access key.

It’s important to note that this is your only opportunity to download the keys.

Role-based authentication

Alternatively, you can use role-based authentication. In this case, attach the policy created to the EC2 role that Atlan uses for its EC2 instances in the EKS cluster.

Role delegation-based authentication

To set this up, raise a support ticket to get the ARN of the Node Instance Role for your Atlan EKS cluster. Then, create a new role in your AWS account, attaching the policy created earlier to this role. You’ll also need to create a trust relationship for the role.

If you want to use an external ID for added security, generate it within Atlan and paste it into the policy.

Next, let’s look at open-source metadata management options outside of the AWS ecosystem.

Exploring open-source metadata management options for AWS #

Let’s take a look at the open-source metadata management alternatives that let you overcome the extensibility issues with AWS Glue:

DataHub: A platform for open-source metadata created by LinkedIn to find, comprehend, and use organizational data.
Amundsen: Amundsen is a data discovery and metadata tool made by Lyft.
OpenMetadata: OpenMetadata is an open-source metadata store that offers APIs for discovering and managing metadata.

If you want to remain within the AWS infrastructure but are keen on exploring alternatives to AWS Glue, the AWS Marketplace is an excellent resource. It lists a variety of managed open-source and enterprise solutions, all of which can be smoothly integrated with your existing AWS setup.

The AWS Marketplace contract structure

The AWS Marketplace contract structure - Source: AWS.

If you’re seeking enterprise-grade solutions, consider Atlan, which is built on open source, and open by default.

The recognition of Atlan as a Leader in The Forrester Wave™: Enterprise Data Catalogs, Q3 2024, and also in the G2 Spring 2023 Grid® Reports for Data Governance, Machine Learning Data Catalog, and Data Quality make it a compelling off-the-shelf choice for metadata management in AWS.

Wrapping up #

Metadata management in AWS has a variety of options. AWS Glue Data Catalog is an internal tool for managing your AWS ecosystem. However, it’s lacking in certain aspects like flexibility and extensibility.

This article covered the setup process for AWS Glue, augmenting it with Atlan, and then suggested open-source and off-the-shelf enterprise alternatives to AWS Glue. Choose the tool that best fits the needs of your organization, the size of your team, and specific use cases.