Apache Iceberg Data Catalog: What Are Your Options in 2025?
Share this article
An Apache Iceberg data catalog will act as the single source of truth for assets using the Iceberg table format. The catalog is responsible for letting you create, drop, rename, and alter any table within its purview.
For any query or compute engine like Dremio, Snowflake, or Trino to perform these operations, you’ll first need to initialize and configure a catalog in Iceberg. With Iceberg, you can set up an internal technical data catalog like AWS Glue Data Catalog, Project Nessie, a custom JDBC catalog, or build one Iceberg’s REST API. This flexibility helps you integrate your Iceberg assets with the rest of the data ecosystem.
See How Atlan Simplifies AI Governance ✨ – Start Product Tour
This article will take you through the role of a data catalog in Apache Iceberg, along with the various types of technical catalogs that can work with Apache Iceberg. In the end, you will also learn about the need for a control plane for metadata, where we’ll talk about the integration between Iceberg and Atlan.
Table of Contents #
- Apache Iceberg data catalog: What is it and how does it work?
- Apache Iceberg data catalog: Types of technical catalogs available
- Choosing the right Apache Iceberg data catalog
- Integrating Apache Iceberg with Atlan to set up a control plane for data, metadata, and AI
- Summing up
- FAQs on Apache Iceberg data catalog
- Apache Iceberg data catalog: Related reads
What is Apache Iceberg data catalog and how does it work? #
The data catalog in Apache Iceberg is a technical data catalog, acting as the center and facilitator for all table-related operations. It maintains all the metadata related to the tables in one metadata.json file per table.
In other words, it is the single source of truth regarding the state of your tables.
Also, read → All you need to know about Apache Iceberg in 2025
Apache Iceberg data catalog is responsible for the following:
- Tracking table names and organizing them in namespaces (a schema or database can be a namespace)
- Maintaining the location of the current metadata by maintaining several versions of the same file as the table changes
To maintain the location of the current metadata, Apache Iceberg uses a persistent tree data structure – it preserves the previous versions of itself when modified, allowing access to any historical version.
In Apache Iceberg, updates to table structures are done in an immutable fashion, i.e., all changes trigger the creation of a new file, for example, v2.metadata.json
. The new file is atomically swapped with the old one, for example, v1.metadata.json
.
It is important to note that while the technical data catalog maintains pointers to the current metadata stored in snapshots, manifest lists, and manifest files, it doesn’t store any of the data itself. It only maintains pointers. Snapshots, manifest lists, and manifest files are stored in the metadata layer.
Here’s a detailed example of a metadata.json file.
{
"format-version": 2,
"table-uuid": "2fccc490-64d9-4d94-a142-ecf64ba7e4fc",
"location": "s3://analytics/apac/orders_raw/",
"last-updated-ms": 1739067679893,
"schemas": [
{
"schema-id": 1,
"fields": [
{"id": 1, "name": "order_id", "type": "long"},
{"id": 2, "name": "customer_id", "type": "long"},
{"id": 3, "name": "order_datetime", "type": "string"},
{"id": 4, "name": "amount", "type": "string"},
{"id": 5, "name": "status", "type": "string"}
]
}
],
"partition-specs": [
{
"spec-id": 1,
"fields": [
{"source-id": 3, "transform": "year", "name": "year"},
{"source-id": 3, "transform": "month", "name": "month"}
]
}
],
"default-spec-id": 1,
"properties": {
"write.format.default": "parquet",
"commit.retry.num-retries": "5"
},
"current-snapshot-id": 1002,
"snapshots": [
{
"snapshot-id": 1002,
"parent-snapshot-id": 1001,
"timestamp-ms": 1739067679893,
"summary": {"operation": "append", "added-data-files": "13"},
"manifest-list": "s3://analytics/apac/orders_raw/manifest-list-1002.avro"
},
{
"snapshot-id": 1001,
"timestamp-ms": 1739063179212,
"summary": {"operation": "create"},
"manifest-list": "s3://analytics/apac/ordeorders_rawrs/manifest-list-1001.avro"
}
],
"refs": {"main": {"snapshot-id": 1002, "type": "branch"}}
}
As you can see, the manifest.json
file maintains a lot of data, but mostly, they are pointers to locations where, say, the snapshot, the manifest list, or the table itself is stored. With the flexibility of Iceberg, you can maintain this file in various locations, depending on the Iceberg catalog you end up implementing.
The next section will discuss the different types of catalogs and when to choose which one.
What are the different types of Apache Iceberg technical catalogs? #
There are two broad types of technical catalogs in Apache Iceberg – file-based and service-based catalogs.
File-based catalogs use files to maintain the metadata pointers. There’s only one catalog that falls into this category, which is the Hadoop Catalog. It’s important to note that it works with any file system, not just HDFS.
Service-based catalogs depend on external services to maintain the metadata pointers. In this option, an external service stores the actual metadata.json and sits between the query engine and Apache Iceberg. There are many examples of service-based catalogs, such as Hive Metastore, AWS Glue, Project Nessie, JDBC, and REST.
With its 0.14.0 release in July 2022, Iceberg also started allowing catalog implementations using the Iceberg REST API.
Snowflake created a fully-featured multi-engine interoperable catalog using the Iceberg REST API. Snowflake’s Polaris catalog is an incubating Apache project now. It works with Apache Doris, Apache Flink, Dremio, and Trino, among others.
With so many options to implement the catalog, it can be confusing to decide which one fits the bill for your use cases. Let’s tackle that in the next section.
How do you choose the right Apache Iceberg data catalog? #
There are a few questions that need answers before you can pick the right technical catalog for Apache Iceberg. These include (but aren’t limited to):
- Where does the catalog store the metadata.json file?
- What are the concurrency restrictions and locking mechanisms that the backend storage supports?
- Which compute or query engines can the catalog support? In other words, does the catalog have multi-engine support?
- Does the catalog allow Git-like version control to have rollback, time travel, and branching?
The following table answers these questions to make it easy to choose the correct technical data catalog for your setup.
Technical catalog | Storage backend | Engine support | Version control | Transactions, locking, and atomicity |
---|---|---|---|---|
Hadoop | HDFS | Spark, Hive | No | No locking or atomic commits |
Hive Metastore | Apache Derby, MySQL, PostgreSQL, etc. | Spark, Trino, Flink, Presto, Hive | No | Atomic transactions and row-level locking |
JDBC | Any relational database | Spark, Trino, Flink, Presto, Hive | No | Atomic transactions with row-level locking |
REST | Any object store like S3 | Spark, Trino, Flink, Presto, Hive | Not by default, but Polaris (which uses Iceberg REST API) has full version control | CAS (Compare-And-Swap) implementation for atomic commits |
AWS Glue | AWS DynamoDB | No | Atomic transactions using the DynamoDB TransactWriteItems API | |
Nessie | Any versioned object store like S3 | Spark, Trino, Flink, Presto, Hive | Yes | Atomic transactions |
In addition to these features, your use case and current data setup also matter when choosing the catalog. Let’s look at some scenarios:
- If it is a Hive-based ecosystem, you might want to go with the Hive Metastore to ensure the highest level of compatibility.
- If you are on an AWS-based data stack using services like Redshift, EMR, and Athena, using Glue might be the way to go.
- You might want to go to the REST option for the highest level of flexibility in terms of cloud platform and query engine support. As mentioned earlier, Polaris is a fully-featured implementation of Apache Iceberg’s REST API.
Another thing to note is that the platforms covered above are technical catalogs meant to maintain and manage Apache Iceberg’s internal features and functionality. Such catalogs aren’t intended for direct consumption by your wider team of data engineers, analysts, and other business users.
Moreover, these catalogs don’t have (nor are they supposed to have) some of the key business requirements like data governance, quality assurance or lineage, among others. For these capabilities to be implemented organization-wide, irrespective of the underlying data stack, you need a control plane for metadata.
That’s where Atlan and its integration with Apache Iceberg’s REST-based catalog Polaris can help.
How can you integrate Apache Iceberg with Atlan for a unified control plane? #
Atlan was one of the first data and AI governance platforms to integrate with Snowflake’s Polaris Catalog (now open-sourced as an Apache incubating project). This integration promises interoperability, which, in turn, allows enterprises to keep up with and adopt best-in-class tools without getting locked into any single vendor.
These principles of openness and interoperability align with Atlan’s mission too.
Also, read → Atlan’s Open API documentation
Atlan’s integration with Polaris means you can bring Iceberg tables from everywhere in your business into Atlan as long as they flow through Polaris.
Having said that, Atlan is not just a catalog for Iceberg or Polaris, it is a catalog for many other data sources, which means you can bring data assets from everywhere to be managed from Atlan, something you cannot do with Iceberg or Polaris alone.
Atlan brings all the metadata in one place, creating a metadata lake that powers the metadata platform. Whether making your organization’s data AI-ready, democratizing data consumption across the business, or enforcing privacy and governance, Atlan’s metadata control plane supports diverse end-users, use cases, and applications.
Summing up #
In this article, we looked into the structure and purpose of an Iceberg catalog and the various implementation options you have for the catalog. We also covered the key differences between these catalogs, especially those related to multi-engine support, ACID transactions, and version control.
Finally, you saw how Atlan empowers data teams by integrating with Polaris, giving them an easy way to bring in and manage all their Iceberg tables from Atlan. To learn more, check out the details of the Polaris + Atlan integration.
FAQs on Apache Iceberg data catalog #
What Is an Apache Iceberg Data Catalog and How Does It Work? #
The Apache Iceberg data catalog serves as the central repository for managing metadata related to Iceberg tables. It helps track table names, schemas, and historical versions of tables through its metadata files. It acts as the “single source of truth” for Iceberg assets and facilitates operations like creating, dropping, or altering tables.
What Are the Different Types of Apache Iceberg Data Catalogs? #
Apache Iceberg supports two main types of catalogs: file-based and service-based. File-based catalogs, like Hadoop, store metadata directly in files, while service-based catalogs (e.g., AWS Glue, Hive Metastore, or REST APIs) use external services to manage metadata. Each type has different features, such as multi-engine support and version control.
How Can You Integrate Apache Iceberg with Atlan for Data Management? #
Atlan integrates with Apache Iceberg’s REST-based catalogs (e.g., Polaris) to create a unified control plane for data, metadata, and AI. This integration allows organizations to bring all their Iceberg tables into Atlan, centralizing metadata management and enhancing governance, security, and data democratization across the business.
What Are the Benefits of Using Apache Iceberg with Atlan’s Control Plane? #
Integrating Apache Iceberg with Atlan’s control plane provides a comprehensive solution for metadata management, data governance, and AI readiness. It allows businesses to manage Iceberg tables alongside other data sources in one unified platform, ensuring better collaboration, compliance, and the ability to scale data operations without being tied to a single vendor.
Apache Iceberg data catalog: Related reads #
- Apache Iceberg: All You Need to Know About This Open Table Format in 2025
- Polaris Catalog from Snowflake: Everything We Know So Far
- Polaris Catalog + Atlan: Better Together
- Snowflake Horizon for Data Governance
- What does Atlan crawl from Snowflake?
- Snowflake Cortex for AI & ML Analytics: Here’s Everything We Know So Far
- Snowflake Copilot: Here’s Everything We Know So Far About This AI-Powered Assistant
- How to Set Up Data Governance for Snowflake: A Step-by-Step Guide
- How to Set Up a Data Catalog for Snowflake: A Step-by-Step Guide
- Snowflake Data Catalog: What, Why & How to Evaluate
- AI Data Catalog: Exploring the Possibilities That Artificial Intelligence Brings to Your Metadata Applications & Data Interactions
- What Is a Data Catalog? & Do You Need One?
Share this article