Atlan vs. Apache Atlas: What to Consider When Evaluating?
Last Updated on: May 16th, 2023, Published on: May 12th, 2023
Share this article
Atlan is an active metadata platform built on open standards and open-source technologies including Apache Atlas. Atlan was first used as an internal tool at Social Cops, where it was battle-tested through more than 100 data projects that improved access to energy, healthcare, and education.
Apache Atlas is a prominent open-source data governance and metadata framework. HortonWorks and Cloudera created it in 2014 to manage their enterprise data platform built on Apache Hadoop. The Apache Software Foundation later incubated this project.
This article will take you through the capabilities, architecture, and cost of ownership of both tools, among other things, to gain a better understanding of which to choose for your use cases.
Table of contents
- Atlan vs. Apache Atlas: How do they compare in terms of core features?
- Things you must consider while evaluating these data catalogs
- Atlan vs Apache Atlas: Related Reads
What is Atlan?
Atlan is built around the idea of providing businesses with a single source of truth by being a collaborative workspace for data teams for all things data - discovery, quality, governance, lineage, and documentation.
Atlan can be deployed on AWS with a high level of isolation using Kubernetes. With its rich data cataloging and discovery features, security, and reliability, various companies like Autodesk, Postman, Monster, WeWork, Nasdaq, Unilever, Juniper Networks, and Ralph Lauren have chosen Atlan as their metadata management platform. Atlan deployment in Google Cloud and Azure is also on the roadmap.
What is Apache Atlas?
Apache Atlas is an open-source project which supports the core features of a data cataloging, discovery, and governance engine. Although it was created with the Apache Hadoop ecosystem in mind, Apache Atlas now supports a wide range of data sources outside that ecosystem too.
With the core features being search, discovery, governance, lineage, and security, Apache Atlas is one of the most evolved open-source data catalogs out there. Various companies like New York Life Insurance, JP Morgan Chase, InMobi, and Target, have trusted Apache Atlas to take care of metadata management at scale.
Atlan vs. Apache Atlas: How do they compare in terms of core features?
- Data discovery
- Data lineage
- Data governance
- Active metadata
1. Data discovery
Search and discovery on a metadata platform should feel intuitive for it to be useful for business users. A natural language search interface comes in handy to deliver such an experience. Atlan uses Elasticsearch, whereas Apache Atlas uses Apache Solr for full-text search.
Atlan takes the search experience to the next level by mimicking the feel of an online shopping website, where you would not only search for a product but also use rich filters and advanced sorting techniques to go through products. The same is done here with data assets. In addition, Atlan embeds trust signals from integrations with tools like Slack to guide you better in your search queries.
Apache Atlas gives you a barebones search experience with the ability to perform full-text searches. It does, though, allow you to filter based on business taxonomy. In other search options, Apache Atlas provides you with a DSL search option and an option to search for relationships (powered by JanusGraph).
2. Data lineage
Apache Atlas and Atlan support column-level data lineage; however, Apache Atlas has limited support for lineage regarding different data sources, especially data captured from dashboards, etc. Line Corp had to spend a lot of engineering time and effort building and enhancing column-lineage features with their Apache Atlas deployment.
Atlan’s automated data lineage features like automated parsing of SQL queries and scripts, and automated propagation of tags, classifications, and column descriptions to data assets both upstream and downstream.
Atlan’s in-line actions also help you with metadata enrichment by letting you add business and technical context to your data assets and data lineage by creating alerts for failed or problematic data assets, creating Jira tickets, asking questions on Slack, and performing impact analysis.
3. Data governance
Features like data asset classification, propagation of classifications via data lineage, fine-grained data security, and authorization power Apache Atlas’s data governance capabilities. Apache Ranger manages authorization and data masking within Apache Atlas. Atlan also uses it to control some aspects of authorization and data masking, but Atlan’s data governance offering is much more power-packed.
Atlan allows users to run Playbooks to auto-identify PII, HIPAA, and GDPR data. Tags and classifications within Atlan are automatically propagated to dependent data assets. Moreover, you can also customize governance based on personas, purposes, and compliance requirements.
Atlan has been big on shifting left and AI-led governance for tackling data governance problems. During one of their recent events, they teased the beta release of Atlan AI, which promises to significantly enhance governance workflows with automation and generative AI capabilities.
Although it comes with many useful features, Apache Atlas lacks collaboration features to a large extent. Apache Atlas cannot integrate with your productivity and task management tools like Slack and Jira within the Apache Atlas web interface.
On the contrary, Atlan achieves embedded collaboration by forming deep integrations with those tools and more. This means that you can work seamlessly with the tools you use every day and collaborate effectively with your team without leaving Atlan.
5. Active metadata
Metadata activation is one of the critical features of Altan. Atlan’s active metadata is always-on, intelligent, and action-oriented. What this means is that the metadata in Atlan is always available, ready to be used, and connected in a way that allows the deduction of meaning or intelligence.
How does active metadata manifest in your data flows? Here are some examples:
One of the most annoying things you see in badly-managed data catalogs is the high frequency of stale and unused data assets. You can ask people to clean that mess up, but it will be costly and error-prone. Instead, you can use Atlan to purge stale metadata based on custom rules.
Freshness and accuracy are both critical metrics for data assets. A data catalog doesn’t add much value if you cannot identify which data assets comply with data quality rules to be classified as usable. Your metadata should be able to indicate both of these. Atlan’s automation-based approach allows you to perform all these tasks with ease.
Active metadata use cases for modern enterprises
Atlan vs. Apache Atlas: Things you must consider while evaluating these data catalogs
- Managed vs. self-hosted tools
- Ease of setting up
- Integration with other tools
- The actual cost of deploying an open-source tool
1. Managed vs. self-hosted tools
Open-source tools are superb, but it gets tough to manage them, especially in their nascent stage with limited community support and long feature development and bug resolution cycles. Without substantial engineering effort, you won’t be able to manage an open-source tool. Therefore, it is usually wise to use a managed, SaaS-based tool.
Apache Atlas is an open-source data cataloging and governance tool, while Atlan is an enterprise data catalog. Enterprise data catalogs like Atlan are managed products. With feature development, design, security, disaster recovery, and infrastructure - all taken care of, managed tools make a lot of sense, especially when looking to invest in something for the long term.
On the other hand, open-source, self-hosted tools are suitable for smaller projects unless, as mentioned earlier, you have the engineering capacity and expertise to deploy and maintain those tools.
2. Ease of setting up
Unlike most open-source tools that offer quickstart and scalable deployments using Docker and Kubernetes, Apache Atlas still requires you to build and install using Apache Maven. Although the installation steps are straightforward, they wouldn’t make sense if you’ve never worked with build tools.
Apache Atlas gives you a fair bit of freedom to configure different backends, including Apache HBase, Apache Solr, BerkeleyDB, Apache Cassandra, etc. This is just the installation. Setting it up with numerous data sources, enabling data lineage, setting up access controls, SSO, etc., can take several weeks, maybe even months.
Atlan takes a much simpler approach to installation. Setting up Atlan end-to-end can take as little as one week. It comes with a pre-configured Apache Ranger to manage access and policies for the metastore.
A wide range of connectors within Atlan ensures that integrating data sources is a cakewalk. Backups and disaster recovery are set up out of the box. All the persistent data stores, such as Cassandra, Elasticsearch, PostgreSQL, and more, are backed up every day, which makes the RPO (recovery point objective) 24 hours.
3. Integration with other data tools
Apache Atlas was designed for the Hadoop ecosystem, which is why most integrations happen through Apache Hive, Apache HBase, Apache Flink, Apache Kafka, and so on. For instance, you cannot directly connect Snowflake, AWS Redshift, or Azure Synapse Analytics using a JDBC connector with Apache Atlas. You’ll instead have to use the Apache Hive hook to get data into Apache Atlas. Updates to Apache Hive would come via Apache Kafka.
Atlan’s integrations, on the other hand, are purpose-built. These integrations for partners like dbt and Snowflake enable Atlan to treat domain-specific metadata, such as dbt metrics, as first-class citizens.
For instance, Atlan’s dbt + GitHub integration allows you to preemptively detect breaking changes before they are pushed to your Git repository. Atlan is also the first data catalog to be certified as a Snowflake Ready Technology Partner.
Risk of failure while adopting the tool
Bringing a new tool to your stack is always tricky, irrespective of whether it is self-hosted or managed. The risk of no/low adoption is very real; this risk depends on factors such as ease of setup, ease of use, integrations with existing stack, features, and more. Even with great features, open-source data catalogs need more precise and helpful documentation and technical support, among other things.
A logistics and transportation company tried implementing Apache Atlas. After spending many months on it, they decided to scrap Apache Atlas, because they were facing issues with bad UI/UX, lack of integrations, and bad overall usability. They eventually shifted to Atlan, where these risks were significantly reduced as most of these issues of UI/UX, integrations, and ease of use, were already taken care of.
4. The actual cost of deploying an open-source tool
Most open-source data catalogs give the impression of being easy to set up. It is true in most cases, but what you get from that setup is a very barebones setup. To do anything on top of the barebones setup, i.e., to make the tool usable, you need to do a lot of work in the infrastructure, authentication, authorization, and disaster recovery spaces.
In addition to this additional effort of setting such a tool up for your organization, you also risk signing up for feature development, updates, and refinement, to get people to use the tool; otherwise, there’s a high risk of failure.
Without sponsorship, open-source projects cannot afford feature development that matches enterprise tools like Atlan, which is why it may be better to choose a tool like Atlan that harnesses the power of open-source tooling but, at the same time, provides you with the reliability and assurance of an enterprise-grade product.
Apache Atlas architecture
Apache Atlas is built on top of some of the most prominent Apache Software Foundation’s open-source projects, such as Hive, Ranger, Solr, Kafka, and HBase. The only external tool used in the architecture is JanusGraph, a graph database to manage data asset relationships and lineage, among other things.
For enhancing security, Apache Atlas provides options to use one-way and two-way SSL and service authentication using Kerberos and JAAS. It also supports an extensible and pluggable authorization engine, the most popular plugin to handle authorization being Apache Ranger (also used by Atlan). You can also configure SPNEGO-based HTTP authentication on Apache Atlas.
Apache Atlas doesn’t offer high availability for the web service. When you run Apache Atlas in production, you can configure high availability with the help of Apache Zookeeper. You can keep hot backups and do a manual failover when your web service instance fails. It is also possible to change other components, such as the index store and metadata store, but that would require quite a lot of code change, configuration, and testing before it can be used.
Atlan integrates with Loft’s Virtual Clusters to run its microservices on Kubernetes. Atlan is big on managed open-source tools like Argo. Containers for different microservices are orchestrated using Argo Workflows. Atlan also uses GitHub Actions for CI/CD at scale.
Atlan’s infrastructure is currently deployed on AWS. However, support for deployments in Google Cloud and Azure is coming soon. Using CloudCover’s ArgoCD and Loft-based implementation, Atlan emulates a single-tenant deployment in the cloud to ensure the highest data security and privacy standards.
Atlan powers itself by using a healthy mix of managed open-source and enterprise tools to ensure the high availability and reliability of the data catalog, which is why tools like Rancher for Kubernetes cluster management, Velero for cluster volume backups, Apache Calcite for parsing SQL, Apache Atlas as the metadata backend are central to Atlan’s architecture.
This article showed you the main features of the open-source data cataloging and governance tool Apache Atlas and the active metadata platform, Atlan. It also discussed the various things you need to consider when choosing one data cataloging tool for your business.
When considering these tools, you should also clarify their future plans. Apache Atlas doesn’t have a public roadmap that gives you a view from afar of what’s coming next in terms of features, integrations, etc. They have a Jira board, which mainly consists of tickets for bug fixes with a few scattered tickets for new features and enhancements.
Atlan’s focus on AI-led data governance and metadata-based automation is well-known and distinguishes itself from the rest. You can take both of these tools for a quick test drive to find out more. You can deploy Apache Atlas as a PoC on your own machine and also ask for a proof-of-value from Atlan!
Atlan vs Apache Atlas: Related Reads
- 5 popular open source data discovery and catalog tools to evaluate in 2023.
- Introduction to Apache Atlas : An open source metadata management and governance platform.
- A step-by-step guide to installing Apache Atlas.
- Atlan vs. DataHub : Which Tool Offers Better Collaboration and Governance Features?
- Atlan vs Amundsen : A Comprehensive Comparison of Features, Integration, Ease of Use, Governance, and Cost for Deployment
- Amundsen Vs Atlas: A comparison in architecture, data discovery features, deployment, and data observability.
- Understanding AWS Glue data catalog: Use cases, benefits and more
Share this article