Databricks Unity Catalog: A Comprehensive Guide to Features, Capabilities, and Architecture
Last updated on: May 16th, 2023, Published on: April 12th, 2023
Share this article
In the Databricks world, Unity Catalog has been the talk of the town for a while now; however, not many people realize that it only had its GA release less than a year ago, in August 2022. Since then, it has been the preferred choice for existing Databricks customers as it ties seamlessly with the other components of the Databricks ecosystem.
What is Databricks Unity Catalog?
Databricks Unity Catalog is a unified governance solution that enables the management of data assets. It serves as a central repository of all data assets, coupled with a data governance framework and an extensive audit log of all actions performed on the data stored in a Databricks account.
This article will take you through the features, capabilities, and technical architecture of the Unity Catalog, along with a brief history of its development at Databricks. Let’s dive right in.
Table of content
- What is Databricks Unity Catalog?
- Why did Databricks create the Unity Catalog?
- Databricks Unity Catalog: Features and capabilities
- Databricks Unity Catalog: Components and architecture
- Databricks Unity Catalog: Concluding thoughts
- Databricks Unity catalog: Related reads
Why did Databricks create the Unity Catalog?
Once Databricks succeeded with a data platform’s data storage and processing aspect, they started investing their time in developing components for two neglected areas - discovery and governance.
Mid-2021 saw the introduction of the Unity Catalog to solve the data governance problem. Before that, data cataloging and governance were usually outsourced to an enterprise or open-source tool.
Although there were great options in the market, the idea was to create a data cataloging, discovery, and governance solution that would work seamlessly with the Databricks ecosystem, especially handling various asset types that are part of the Lakehouse architecture and would provide tremendous value to Databricks customers.
Databricks, in their initial release blog post, mention that they observed a lack of granular security controls for data lakes in existing tools. They also observed that existing tools were cloud-platform-specific, i.e., AWS Glue Catalog for platforms built on AWS and Azure Data Catalog for platforms built on Azure.
For all these reasons and more, Databricks ended up creating Unity Catalog, which saw a gated release for Azure and AWS in April 2022, and finally a GA release in August 2022.
Databricks Unity Catalog: Features and capabilities
Databricks Unity Catalog’s features can be broadly categorized into four categories: data discovery, governance, lineage, and sharing. This section will shed some light on Unity Catalog’s capabilities around these categories.
Unity Catalog brings the power of a well-structured organization of metadata along with a powerful search interface. It exposes the search metadata but restricts access to that metadata based on the logged-in user’s privileges and permissions. This ensures security at the metadata level. Here’s what the search interface looks like:
We’ll talk about data lineage in detail later, but lineage metadata also helps search and discovery by depicting relationships between several entities and layers of the data. With its data discovery features, Unity Catalog succeeds in creating a “unified and secure search experience.”
The Databricks Unity Catalog is designed to provide a search and discovery experience enabled by a central repository of all data assets, such as files, tables, views, dashboards, etc. This, coupled with a data governance framework and an extensive audit log of all the actions performed on the data stored in a Databricks account, makes Unity Catalog very attractive for businesses.
From an identity and access management perspective, Databricks users can be service principals, users, and groups. You can have these users make a trust relationship with Databricks workspaces. This trust relationship will end up giving you identity federation.
Unity Catalog lets you use pure SQL to control access based on rows and columns. The next level of granularity will come with attribute-based access control, which is currently in the works but has yet to be released.
Data lineage is becoming increasingly important for several data engineering use cases, such as tracking and monitoring jobs, debugging failures, understanding complex workflows, tracing transformation rules, etc. Unity Catalog has put the SQL parser to use for extracting lineage metadata from queries, and external tools like dbt and Airflow. Lineage in the Unity Catalog is not limited to SQL; it is available for any code you write in your workspace.
Lineage data holds critical information about your company’s data flow, so Unity Catalog has taken the same approach to protect your data from bad actors using the governance model, which restricts access to data lineage based on the logged-in users’ privileges. Needless to say, securing your data has a lot of value.
And finally, data sharing from the platform has been one of the welcome developments in the data engineering space. It gives businesses more control over how, why, and what data is being shared with whom. When such a setup is not in place, people who have access to the data download it and share it with the team manually using E-mail, Slack, Teams, etc.
Unity Catalog’s built-in, tightly integrated method of sharing data alleviates the pain and difficulty of managing data permissions across a business. It is based on the platform-agnostic open protocol for data sharing called Delta Sharing.
A highly transparent way of sharing data not only reduces the workload of your data team but also helps them monitor and control access to data with clarity.
Databricks Unity Catalog: Components and architecture
As it is a closed-source, proprietary data platform, the full implementation details of the Unity Catalog are not public. Still, thorough details about the Unity Catalog object model, the backbone of it all, are available. In this section, that will be our focus.
One of the distinguishing factors of the Unity Catalog object model is that it uses a three-level namespace to address various types of data assets in the catalog. This differs from most databases and data warehouses where you can address a data asset using the schema_name.table_name format.
The Unity Catalog has a metastore similar to a Hive or Hive-compatible metastore, which is used by cloud-platform-specific data catalogs like AWS Glue Catalog. It adds one additional abstraction layer to enable users to categorize data assets better. The following diagram depicts how the object model is structured.
The metastore acts as the container for all your data assets that are categorized, first into multiple catalogs, then schemas, and then entities like tables, views, functions, and so on. Unity Catalog uses its custom metastore, but the metastore is compatible with Apache Hive to a large extent.
Depending on where you are deploying Databricks, i.e., on AWS, Azure, or elsewhere, your metastore will end up using a different storage backend. For instance, on AWS, your metastore will be stored in an S3 bucket. You can also choose to use an external Hive metastore instead of using the native Unity Catalog metastore if you want Apache Hive’s full functionality.
Moreover, if all of your data infrastructure is not contained within the Databricks ecosystem, you’ll probably benefit by using an external catalog, such as Atlan which will let you integrate all your metadata sources and provide you with a holistic view of your data infrastructure, both in terms of discovery and governance.
Storage for data lineage
Before moving ahead, it is essential to note that the metastore is only one component, but a central component, of the Unity Catalog. The Unity Catalog internally stores both table-level and column-level lineage data based on your queries, workflows, CTAS statements, etc.
All of this data is stored in the metastore, which is one of the reasons the need for a custom metastore arose. Still, if you use an external Apache Hive metastore, you will be able to make some customizations and store the lineage metadata.
Audit logs, on the other hand, are delivered to a separate storage location (a different S3 bucket if you are on AWS). This means that even if you delete a metastore, audit logs will still be available for compliance purposes.
The audit logs capture all events related to the Unity Catalog. This includes the creation, deletion, and modification of all the components of the metastore, and the metastore itself. These events also cover activities related to storing and retrieving credentials, access control lists, data-sharing requests, and so on.
The identity and access management model in Unity Catalog is designed with custom privileges that work on different levels of the three-level namespace in the metastore. Privileges in the Unity Catalog are inherited downwards in the hierarchy of namespaces.
Databricks has a workspace-level permission model that lets you control access to all the different Data assets like DLT pipelines, SQL warehouses, notebooks, and so on, using ACLs (Access Control Lists). These ACLs are managed by admin users and also by users that are assigned ACL management privileges.
Databricks Unity Catalog: Concluding thoughts
This article took you through the core features, use cases, and technical architecture of the Unity Catalog. In conclusion, although the Unity Catalog has been a great value add to the Databricks ecosystem, there might be cases when you would need to bring in a third-party data catalog to sit horizontally across all of your data infrastructure of which Databricks might be just one part.
Atlan’s REST API connects with Databricks Unity Catalog to bring in all the metadata from Databricks clusters and workspaces. Atlan also brings in data from all the other places that are not connected to the Unity Catalog. This is extremely useful if your data infrastructure is spread across multiple cloud platforms, data processing frameworks, orchestration engines, BI tools, and even on-prem databases - you need a metadata integration and presentation layer, which is where Atlan comes in.
Databricks Unity catalog: Related reads
- Databricks Lineage — Overview, Benefits, How to Set Up?
- Databricks Governance: What To Expect, Setup Guide, Tools
- Databricks Metadata Management — FAQs, Tools, Getting started
- AI Data Catalog: Exploring the Possibilities That Artificial Intelligence Brings to Your Metadata Applications & Data Interactions
- Machine Learning Data Catalogs: Evolution, Benefits, Business Value and Uses in 2023
- Amundsen Data Catalog: Understanding Architecture, Features, Ways to Setup & More
Share this article