Working with Apache Iceberg and AWS Glue: A Complete Guide [2025]
Share this article
AWS started supporting Apache Iceberg first to bring ACID transactions to Amazon Athena. Since then, Apache Iceberg has seen an intersection with over ten AWS services and AWS Glue is one of those services.
See How Atlan Simplifies Data Governance ✨ – Start Product Tour
In this article, you will see how AWS Glue integrates with Apache Iceberg for various use cases – data integration, ETL, and even cataloging. You will also learn about a few other related components that come into the picture with this integration, especially AWS Glue Data Catalog as a backend catalog for Iceberg.
Finally, you will explore the need for a metadata control plane and how it can help you get the most value out of your data ecosystem.
Table of Contents #
- Apache Iceberg and AWS Glue: What’s the scope of this integration?
- Apache Iceberg and AWS Glue: Why you need a metadata control plane for this integration
- Apache Iceberg and AWS Glue: Summing up
- Working with Apache Iceberg and AWS Glue: Related reads
Apache Iceberg and AWS Glue: What’s the scope of this integration? #
While AWS Glue works with Apache Iceberg in many ways, there are two central use cases:
- ETL and integration tool: For AWS-native Spark or Python-based data pipelines that read and write from Iceberg objects stored in Amazon S3
- Catalog for Apache Iceberg: As a REST-based alternative to other Apache Iceberg catalogs like Amazon RDS, Hive Metastore, MySQL/PostgreSQL, or DynamoDB
Both use cases are extremely important, regardless of whether or not you’re dealing with AWS-native services. Also, the catalog integration fits right in if you’re using AWS Glue as the ETL tool.
On the other hand, if you are using another data platform like Snowflake or Databricks, they integrate with AWS Glue Data Catalog too. Here’s how:
- Snowflake: Integrates with AWS Glue Data Catalog as a catalog option in Snowflake’s Apache Polaris-based catalog, Snowflake Open Catalog
- Databricks: Integrates with AWS Glue Data Catalog and also brings other Hive metastores along with the Glue Catalog via Glue Federation in Unity Catalog
In the following sections, we will explore use cases, the ETL/integration tool, and the catalog for Iceberg. We’ll also see how AWS Glue integrates with Apache Iceberg.
AWS Glue: The serverless ETL and integration tool #
AWS Glue supports working with various data lake frameworks, such as Delta Lake, Apache Iceberg, and Apache Hudi. The main intention is to help you read and write data in Amazon S3 using the table format of your choice.
Iceberg is a key ingredient for building modern data lakes and data lakehouses. There’s a need for reading and writing Iceberg data objects natively from a tool that brings data in, transforms it, and writes it to the destination. That’s where AWS Glue can help, as shown in the snippets below.
# Creating an Iceberg table and registering it in the Glue Data Catalog
dataFrame.createOrReplaceTempView("<tmp_tbl_name>")
create_table_query = f"""
CREATE TABLE glue_catalog.<db_name>.<tbl_name>
USING Iceberg
TBLPROPERTIES ("format-version"="2")
AS SELECT * FROM <tmp_tbl_name>
"""
spark.sql(create_table_query)
# Reading an Iceberg table registered in the Glue Data Catalog using Glue Context
from awsglue.context import GlueContext
from pyspark.context import SparkContext
sc = SparkContext()
glueContext = GlueContext(sc)
dataFrame = glueContext.create_data_frame.from_catalog(
database="<db_name>",
table_name="<tbl_name>",
additional_options=additional_options
)
# Reading an Iceberg table registered in the Glue Data Catalog using Spark
dataFrame = spark.read.format("iceberg").load("glue_catalog.<db_name>.<tbl_name>")
AWS Glue also allows you to work directly with Lake Formation, which writes data natively to an Amazon S3 data lake. Glue also lets you choose between the native Iceberg integration and a custom Iceberg version, which attaches all the relevant dependencies to a Glue job.
However, after version 3.0, AWS Glue started natively supporting Apache Iceberg and other open table formats.
AWS Glue Data Catalog: The technical catalog for Iceberg tables #
Iceberg is a table format with an integrated component – an internal technical catalog that helps it maintain critical metadata related to table structure, snapshots, etc. Iceberg also supports plugging in your internal catalog, and that’s where the AWS Glue Data Catalog comes into the picture.
When you use AWS Glue Data Catalog with Iceberg, the following things happen:
- An Iceberg namespace gets mapped to a Glue Database
- An Iceberg table gets mapped to a Glue Table
- A version of an Iceberg table gets stored as a TableVersion in Glue
This lays the foundation of creating a data lakehouse in AWS.
Data lakehouse in AWS - Source: Noise.
Defining AWS Glue as the data catalog for an Iceberg job is simple. You need to define the correct Iceberg-Spark runtime package along with the configuration that specifies the name, type, warehouse, and io-impl of the catalog, as shown in the snippet below.
# start Spark SQL client shell
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.8.1,org.apache.iceberg:iceberg-aws-bundle:1.8.1 \
--conf spark.sql.defaultCatalog=my_catalog \
--conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.my_catalog.warehouse=s3://my-bucket/my/key/prefix \
--conf spark.sql.catalog.my_catalog.type=glue \
--conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
You might also need to use Amazon DynamoDB to ensure optimistic locking to achieve atomic transactions. You’ll need to do this if you’re using the AWS SDK version before 2.17.131. Otherwise, AWS Glue will take care of the optimistic locking itself.
Once you run the spark-sql statement mentioned earlier, your Glue job and its linked objects, will be read from and written to the Iceberg catalog by default, unless you explicitly specify that you will read and write from another catalog in the job.
Apache Iceberg and AWS Glue: Why you need a metadata control plane for this integration #
While the AWS Glue Data Catalog serves as the REST catalog for Iceberg and, in doing so, brings data from various sources together, its function is still to act as an internal or technical metadata catalog. It’s not the most effective way to interface and interact with your data sets across the organization, and that’s where the need for a control plane for metadata arises.
A metadata control plane sits horizontally across your organization’s data ecosystem. It brings all data from your data assets together in one place, not just for cataloging but also for governing, profiling, analyzing, and thoroughly using them.
Atlan is a metadata control plane for your data, metadata, and AI assets. It examines your data discovery, cataloging, lineage, collaboration, governance, and documentation needs and brings everything under one roof.
Apache Iceberg and AWS Glue: Summing up #
AWS Glue is a key data service many customers use because of its serverless Python and Spark environments for small to large scale data integration and ETL capabilities. Apache Iceberg is also important to organizations that value vendor freedom and want tooling interoperability at the core of their systems design.
This article guided you through the AWS Glue + Apache Iceberg integration, pointing out its strengths and limitations. It also explained that to enable organization-wide data discovery, cataloging, governance, lineage, and collaboration, you need a control plane for metadata. A platform like Atlan can perform all of the above functions for your organization, irrespective of the data tools and technologies you use.
Learn more about the Apache Iceberg + Atlan integration in the official documentation.
Working with Apache Iceberg and AWS Glue: Related reads #
- Apache Iceberg: All You Need to Know About This Open Table Format in 2025
- Apache Iceberg Data Catalog: What Are Your Options in 2025?
- Apache Iceberg Tables Data Governance: Here Are Your Options in 2025
- Apache Iceberg Alternatives: What Are Your Options for Lakehouse Architectures?
- Apache Parquet vs. Apache Iceberg: Understand Key Differences & Explore How They Work Together
- Apache Hudi vs. Apache Iceberg: 2025 Evaluation Guide on These Two Popular Open Table Formats
- Apache Iceberg vs. Delta Lake: A Practical Guide to Data Lakehouse Architecture
- Polaris Catalog from Snowflake: Everything We Know So Far
- Polaris Catalog + Atlan: Better Together
- Snowflake Horizon for Data Governance
- What does Atlan crawl from Snowflake?
- Snowflake Cortex for AI & ML Analytics: Here’s Everything We Know So Far
- Snowflake Copilot: Here’s Everything We Know So Far About This AI-Powered Assistant
- How to Set Up Data Governance for Snowflake: A Step-by-Step Guide
- How to Set Up a Data Catalog for Snowflake: A Step-by-Step Guide
- Snowflake Data Catalog: What, Why & How to Evaluate
- AI Data Catalog: Exploring the Possibilities That Artificial Intelligence Brings to Your Metadata Applications & Data Interactions
- What Is a Data Catalog? & Do You Need One?
Share this article