Working with Apache Iceberg and AWS Glue: A Complete Guide [2025]

Updated March 07th, 2025

Share this article

AWS started supporting Apache Iceberg first to bring ACID transactions to Amazon Athena. Since then, Apache Iceberg has seen an intersection with over ten AWS services and AWS Glue is one of those services.

See How Atlan Simplifies Data Governance ✨ – Start Product Tour

In this article, you will see how AWS Glue integrates with Apache Iceberg for various use cases – data integration, ETL, and even cataloging. You will also learn about a few other related components that come into the picture with this integration, especially AWS Glue Data Catalog as a backend catalog for Iceberg.

Finally, you will explore the need for a metadata control plane and how it can help you get the most value out of your data ecosystem.


Table of Contents #

  1. Apache Iceberg and AWS Glue: What’s the scope of this integration?
  2. Apache Iceberg and AWS Glue: Why you need a metadata control plane for this integration
  3. Apache Iceberg and AWS Glue: Summing up
  4. Working with Apache Iceberg and AWS Glue: Related reads

Apache Iceberg and AWS Glue: What’s the scope of this integration? #

While AWS Glue works with Apache Iceberg in many ways, there are two central use cases:

  • ETL and integration tool: For AWS-native Spark or Python-based data pipelines that read and write from Iceberg objects stored in Amazon S3
  • Catalog for Apache Iceberg: As a REST-based alternative to other Apache Iceberg catalogs like Amazon RDS, Hive Metastore, MySQL/PostgreSQL, or DynamoDB

Both use cases are extremely important, regardless of whether or not you’re dealing with AWS-native services. Also, the catalog integration fits right in if you’re using AWS Glue as the ETL tool.

On the other hand, if you are using another data platform like Snowflake or Databricks, they integrate with AWS Glue Data Catalog too. Here’s how:

In the following sections, we will explore use cases, the ETL/integration tool, and the catalog for Iceberg. We’ll also see how AWS Glue integrates with Apache Iceberg.

AWS Glue: The serverless ETL and integration tool #


AWS Glue supports working with various data lake frameworks, such as Delta Lake, Apache Iceberg, and Apache Hudi. The main intention is to help you read and write data in Amazon S3 using the table format of your choice.

Iceberg is a key ingredient for building modern data lakes and data lakehouses. There’s a need for reading and writing Iceberg data objects natively from a tool that brings data in, transforms it, and writes it to the destination. That’s where AWS Glue can help, as shown in the snippets below.

# Creating an Iceberg table and registering it in the Glue Data Catalog

dataFrame.createOrReplaceTempView("<tmp_tbl_name>")

create_table_query = f"""
CREATE TABLE glue_catalog.<db_name>.<tbl_name>
USING Iceberg
TBLPROPERTIES ("format-version"="2")
AS SELECT * FROM <tmp_tbl_name>
"""
spark.sql(create_table_query)

# Reading an Iceberg table registered in the Glue Data Catalog using Glue Context

from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)

dataFrame = glueContext.create_data_frame.from_catalog(
    database="<db_name>",
    table_name="<tbl_name>",
    additional_options=additional_options
)

# Reading an Iceberg table registered in the Glue Data Catalog using Spark

dataFrame = spark.read.format("iceberg").load("glue_catalog.<db_name>.<tbl_name>")

AWS Glue also allows you to work directly with Lake Formation, which writes data natively to an Amazon S3 data lake. Glue also lets you choose between the native Iceberg integration and a custom Iceberg version, which attaches all the relevant dependencies to a Glue job.

However, after version 3.0, AWS Glue started natively supporting Apache Iceberg and other open table formats.

AWS Glue Data Catalog: The technical catalog for Iceberg tables #


Iceberg is a table format with an integrated component – an internal technical catalog that helps it maintain critical metadata related to table structure, snapshots, etc. Iceberg also supports plugging in your internal catalog, and that’s where the AWS Glue Data Catalog comes into the picture.

When you use AWS Glue Data Catalog with Iceberg, the following things happen:

  • An Iceberg namespace gets mapped to a Glue Database
  • An Iceberg table gets mapped to a Glue Table
  • A version of an Iceberg table gets stored as a TableVersion in Glue

This lays the foundation of creating a data lakehouse in AWS.

Data lakehouse in AWS

Data lakehouse in AWS - Source: Noise.

Defining AWS Glue as the data catalog for an Iceberg job is simple. You need to define the correct Iceberg-Spark runtime package along with the configuration that specifies the name, type, warehouse, and io-impl of the catalog, as shown in the snippet below.

# start Spark SQL client shell

spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.8.1,org.apache.iceberg:iceberg-aws-bundle:1.8.1 \
    --conf spark.sql.defaultCatalog=my_catalog \
    --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.my_catalog.warehouse=s3://my-bucket/my/key/prefix \
    --conf spark.sql.catalog.my_catalog.type=glue \
    --conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO

You might also need to use Amazon DynamoDB to ensure optimistic locking to achieve atomic transactions. You’ll need to do this if you’re using the AWS SDK version before 2.17.131. Otherwise, AWS Glue will take care of the optimistic locking itself.

Once you run the spark-sql statement mentioned earlier, your Glue job and its linked objects, will be read from and written to the Iceberg catalog by default, unless you explicitly specify that you will read and write from another catalog in the job.


Apache Iceberg and AWS Glue: Why you need a metadata control plane for this integration #

While the AWS Glue Data Catalog serves as the REST catalog for Iceberg and, in doing so, brings data from various sources together, its function is still to act as an internal or technical metadata catalog. It’s not the most effective way to interface and interact with your data sets across the organization, and that’s where the need for a control plane for metadata arises.

A metadata control plane sits horizontally across your organization’s data ecosystem. It brings all data from your data assets together in one place, not just for cataloging but also for governing, profiling, analyzing, and thoroughly using them.

Atlan is a metadata control plane for your data, metadata, and AI assets. It examines your data discovery, cataloging, lineage, collaboration, governance, and documentation needs and brings everything under one roof.


Apache Iceberg and AWS Glue: Summing up #

AWS Glue is a key data service many customers use because of its serverless Python and Spark environments for small to large scale data integration and ETL capabilities. Apache Iceberg is also important to organizations that value vendor freedom and want tooling interoperability at the core of their systems design.

This article guided you through the AWS Glue + Apache Iceberg integration, pointing out its strengths and limitations. It also explained that to enable organization-wide data discovery, cataloging, governance, lineage, and collaboration, you need a control plane for metadata. A platform like Atlan can perform all of the above functions for your organization, irrespective of the data tools and technologies you use.

Learn more about the Apache Iceberg + Atlan integration in the official documentation.



Share this article

[Website env: production]