AWS Glue Catalog: An overview
Permalink to “AWS Glue Catalog: An overview”AWS Glue Data Catalog is the key data catalog service offered by AWS. It works with a host of AWS services like Redshift, S3, Athena, EMR, and Lake Formation, among others. Needless to say, Glue Data Catalog is seamlessly integrated with all the AWS data services and works really well with them.
It’s only when you go beyond your AWS stack to full-fledged data platforms like Databricks or Snowflake, or use other tools for orchestration, transformation, etc., that the limitations of Glue Data Catalog show up.
For teams operating entirely within AWS, it is a capable and cost-effective starting point. For teams running hybrid or multi-cloud environments, or those maturing toward governed, collaborative data platforms, Glue Data Catalog requires significant augmentation to meet enterprise needs.
Understanding the six primary limitations of the AWS Glue Data Catalog
Permalink to “Understanding the six primary limitations of the AWS Glue Data Catalog”Let’s now take a look at how AWS Glue Catalog’s limitations affect various data workflows in an organization, and why they exist. We’ll look at possible mitigation paths in a later section.
1. Specialized data catalog for the AWS data landscape with limited connector support
Permalink to “1. Specialized data catalog for the AWS data landscape with limited connector support”AWS Glue Data Catalog is not a general-purpose data catalog. It is a technical metadata aggregator with seamless integrations with AWS services and limited support for connecting to external systems.
If you’ve got a multi-cloud or multi-platform setup for your organization, you’ll first find it hard to overcome the siloing and fragmentation caused by the different catalogs native to different systems.
Even if you solve that problem, external systems won’t suddenly become first-class citizens in AWS Glue. Limited connector support will still make it hard to sync and keep metadata up to date.
2. Crawler-based schema inference is limited and untrustworthy at times
Permalink to “2. Crawler-based schema inference is limited and untrustworthy at times”AWS Glue uses crawlers to infer the schema of your data, but the approach varies by format.
- For CSV files, the crawler reads the first 1,000 records or the first megabyte, whichever comes first.
- For JSON files, it reads the first megabyte and can go up to 10 megabytes in increments.
- For Parquet files, it reads the schema directly from the file.
Such an approach to sampling has an obvious limitation — many errors, or potential errors, can be missed, especially when entire data partitions are skipped.
Running crawlers frequently to keep the schema up to date is expensive, and even then, they offer no support for control over what you want to crawl your data. This results in teams maintaining their own separate schema manifests and table definitions using other tools.
3. Native lineage is dependent on Amazon DataZone
Permalink to “3. Native lineage is dependent on Amazon DataZone”AWS Glue Data Catalog itself doesn’t track lineage metadata. To get lineage, you can use Amazon DataZone (now part of SageMaker Catalog) or use Glue 5.0’s built-in support for OpenLineage to capture and send lineage events to DataZone. Once set up, lineage from Glue tables added to DataZone can be captured automatically.
That said, getting to a position where lineage works automatically requires setting up DataZone, IAM permissions, and lineage events on Glue jobs. Moreover, lineage only works for Spark DataFrames; it doesn’t work for Glue DynamicFrames yet.
4. Governance enforcement needs other services and remains siloed and fragmented
Permalink to “4. Governance enforcement needs other services and remains siloed and fragmented”AWS organizations and accounts are great for SecOps and FinOps requirements, but they come at a cost to the data landscape. AWS Glue Data Catalog doesn’t have built-in cross-account data governance; you need Lake Formation for that purpose.
Even with Lake Formation, governance metadata doesn’t work seamlessly across accounts:
- LF-Tags don’t propagate to consuming accounts.
- Each account has to create and maintain its own tags independently.
- Cross-region setups complicate things even further.
- There is no single pane of glass to view your organization’s GRC posture.
5. No collaboration features for data teams
Permalink to “5. No collaboration features for data teams”AWS Glue Data Catalog is built for pipeline execution and metadata registration, not for the humans who need to understand, annotate, and discuss data assets. There are no native features for adding business context to tables or columns, leaving comments or questions on data assets, managing annotation workflows, or flagging data quality concerns for team review. In practice, this pushes collaboration into Slack threads, Confluence pages, and spreadsheets, thereby fragmenting the context that should live alongside the data itself.
6. No out-of-the-box support for data mesh
Permalink to “6. No out-of-the-box support for data mesh”AWS Glue Data Catalog organizes data into databases and tables. That structure works for pipeline-centric workflows, but doesn’t map to data mesh concepts. There is no native support for data domains, data products, or the ownership and discoverability model that data mesh requires.
As a result, teams implementing data mesh on AWS typically have to layer additional tooling on top of Glue, or bypass it entirely, to achieve the domain-oriented architecture and self-serve data access that data mesh promises.
How can you solve the limitations of AWS Glue Data Catalog?
Permalink to “How can you solve the limitations of AWS Glue Data Catalog?”Some limitations, such as service quotas, language support, and crawler accuracy, will improve over time as new features roll out. They also have workarounds by using other AWS services.
Then there are these other limitations that are more structural and fundamental and unlikely to be fixed by a Glue version upgrade. These include:
- Multi-platform cataloging, governance, and lineage aren’t easily possible right now. AWS Glue Data Catalog is quite specific to AWS data services, as one would expect.
- Lack of support for lineage and lineage-based impact analysis is another limitation that doesn’t have a direct solution. Using DataZone and SageMaker Catalog can help, but they also share some limitations.
- Lack of propagation of crucial metadata that should be set in one place and used everywhere, such as compliance policies, data classifications, tags, etc.
- Limited connector ecosystem for metadata exchange. Connectors exist, but exchanging metadata between these systems is limited, and synchronization options are also limited.
These issues stem from the lack of a unified context plane to consolidate all metadata from across your organization. The key issues are data silos and fragmented ecosystems. Once these are resolved, other problems can be tackled as well.
Atlan is a context plane built with a key idea in mind: bring all the metadata in one place and activate it. Let’s take a look at how Atlan helps mitigate some of the limitations we’ve discussed above.
Using Atlan to mitigate the AWS Glue Data Catalog’s limitations
Permalink to “Using Atlan to mitigate the AWS Glue Data Catalog’s limitations”As mentioned earlier, Glue Data Catalog works really well with AWS services. A tool like Atlan is needed to bring together metadata from several tools, including AWS Glue Data Catalog, especially when your organization uses multiple tools in its data stack.
Some of the key features of Atlan that help you mitigate AWS Glue Data Catalog’s limitations are as follows:
- A mature ecosystem of connectors supporting a wide variety of data tools, both within and outside AWS.
- Domain-based organization, persona-based search and discovery.
- End-to-end granular data lineage with the ability to edit and synchronize it with other systems.
- Two-way synchronization and propagation of attributes like tags, policies, and classifications.
- Automated data quality monitoring leading to better and live trust signals.
- A business glossary that works across all the tools in the data stack of your organization.
- Governance both from or using AI and also for AI assets within your organization.
Atlan’s control plane, combined with the wealth of information stored and purposefully organized in Atlan’s metadata lakehouse, enables an organization to get the most out of its metadata for all use cases, including data governance, quality, security, discovery, and lineage.
Let’s take a look at how Atlan’s customers leverage the tool to break down silos and fragmentation in metadata.
Real stories from real customers using Atlan with AWS
Permalink to “Real stories from real customers using Atlan with AWS”From scattered metadata to unified intelligence: How Nasdaq governs 140B events daily
Permalink to “From scattered metadata to unified intelligence: How Nasdaq governs 140B events daily”From scattered metadata to unified intelligence: How Nasdaq governs 140B events daily
"We needed visibility across our entire AWS data infrastructure—S3 lakes, Glue transformations, Redshift warehouses, and QuickSight dashboards. Atlan gave us that end-to-end view while letting Glue continue doing what it does best for our AWS services."
Data Platform Team
Nasdaq
🎧 Listen to podcast: Nasdaq's data transformation with Atlan
Moving forward with a sovereign context layer for your data and AI ecosystem
Permalink to “Moving forward with a sovereign context layer for your data and AI ecosystem”AWS Glue Data Catalog works really well with AWS-native environments. It is serverless, cost-effective, and deeply integrates with other AWS services. However, it comes with limitations. These limitations are most pronounced when you have to work across regions, platforms, and external tools in your data landscape.
Some of these limitations are mitigated by using other services, such as DataZone and Lake Formation, but other key issues remain unsolved by any of these tools. This is where a unified metadata control plane comes into the picture to consolidate and assimilate all the metadata from all of your organization’s systems in one place, a metadata lakehouse.
See how a unified context plane can extend your AWS stack.
Book a personalized demoFAQs about AWS Glue Catalog limitations
Permalink to “FAQs about AWS Glue Catalog limitations”1. What are the key limitations of the AWS Glue Data Catalog?
Permalink to “1. What are the key limitations of the AWS Glue Data Catalog?”AWS Glue Data Catalog is a technical metadata catalog that works primarily with AWS native services while also supporting non-AWS data systems in a limited fashion. Governance, lineage, and discovery aren’t within Glue Data Catalog’s purview, so it doesn’t have any of these features built in.
2. Does the Glue Data Catalog work with non-AWS data sources?
Permalink to “2. Does the Glue Data Catalog work with non-AWS data sources?”It does, but in a limited way. You can find connectors, but they usually come with their own quirks and complications. Most connectors are good enough for basic data exchange, but when it comes to metadata for governance, quality, and especially lineage, none of them work quite that well.
3. Can the Glue Data Catalog track data lineage?
Permalink to “3. Can the Glue Data Catalog track data lineage?”Glue Data Catalog doesn’t natively support lineage, but Glue 5.0 has OpenLineage support built into its ETL engine, which can help you capture lineage. You can then visualize that lineage using Amazon DataZone.
4. Does the Glue Data Catalog support a data mesh architecture?
Permalink to “4. Does the Glue Data Catalog support a data mesh architecture?”AWS Glue Catalog doesn’t natively support data mesh constructs like domains and data products, and data asset ownership. For that, you’ll need to use a metadata control and context plane like Atlan, which natively supports these constructs.
5. Can the Glue Data Catalog maintain business context?
Permalink to “5. Can the Glue Data Catalog maintain business context?”To a very limited extent, it can, but Glue Data Catalog is, first and foremost, a technical metadata catalog. It doesn’t have a built-in mechanism to create business glossaries, ontologies, and semantic layers. You’ll need an external tool to do that.
Share this article
