Databricks Data Mesh: Native Capabilities, Benefits of Integration with Atlan, and More

Updated December 26th, 2024

Share this article

Data mesh is a topology-based data platform design pattern that has gained popularity in recent years. It rests on the core idea of applying product thinking to data assets. While there are no authoritative prescriptions for implementing data mesh, every platform offers best practices and recommendations that align with the native features. In that spirit, this article will discuss working with the data mesh pattern on Databricks, one of the major data platforms for organizations.

See How Atlan Simplifies Data Governance – Start Product Tour

The data mesh pattern addresses issues caused by silos in traditional data warehouses or data lakes. It shifts from centralization to decentralization, enabling teams or business units to own their data and fostering self-service data production and consumption across the organization.

Now, let’s dive into the native Databricks capabilities that support data mesh implementation.

Table of contents #

Data mesh capabilities in Databricks
Implementing data mesh in Databricks with Atlan
Best practices for leveraging data mesh with Databricks and Atlan
Best practices and recommendations for Databricks connectivity
Related Reads

Data mesh capabilities in Databricks #

There are four core principles of data mesh: domain-driven architecture, data as a product, self-serve data infrastructure, and federated data governance. Databricks supports all of these principles to a significant extent. You can build a domain-driven data platform using the Lakehouse architecture, with Databricks workspaces aligned to domains, each having its own ownership and access control over the underlying data.

A Databricks workspace serves as the space where data product owners publish data products and data consumers access them, creating a self-service environment. Combined with Delta Sharing, it enables cross-domain and cross-organization collaboration through features like Databricks Cleanrooms and Databricks Marketplace, all governed and secured by a federated yet centralized layer in the Unity Catalog.

Databricks offers the features and flexibility to support various data mesh topologies (described here and here), making it adaptable for organizations of different sizes, operational models, and workflows. However, despite these robust features, there’s still a gap in achieving true self-service, collaboration, and end-to-end governance.

This is where a platform like Atlan comes in, seamlessly integrating across your data ecosystem, including Databricks. With a focus on helping you find, trust, and govern data, Atlan enables a true data mesh architecture by elevating data products to first-class citizens. In the next section, we’ll explore how.

Implementing data mesh in Databricks with Atlan #

The native Databricks approach to implementing a data mesh architecture pattern relies mostly on the Unity Catalog and Delta Sharing, along with the access control mechanisms available. While the native features provide a good standard for operating and working with data within Databricks, you can do more around collaboration, self-service, and data governance. That’s exactly where Atlan comes in!

Atlan creates a unified control plane for data, which lays the foundation for implementing the data mesh principles that we talked about in the previous section. It leverages all the cool native features of Databricks and builds on top of them to significantly enhance the experience to be more streamlined, efficient, and user-centric. Atlan is specifically built to address the core data mesh principles in the following ways:

Decentralized domain ownership - Personas enable data domain-specific organization and personalization.
Data as a product - Data products are producible and consumable assets that are organized in custom hierarchies of domains and subdomains with your organization’s specific policies.
Federated computational governance - drive governance using event-driven automation and Atlan’s open APIs.
Self-serve data infrastructure - while the data platform infrastructure is managed by Databricks, Atlan enables self-serve by making all data products readily available from an intuitive search and discovery interface.

To get started, you’ll need to connect Atlan with Databricks, primarily using Unity Catalog. This connection requires a BROWSE privilege to access metadata, the lineage graph, information_schema, the REST API, and more. Atlan crawls databases, schemas, views, tables, and stored procedures through the Databricks connection. It also enables importing tags from Databricks, reverse-syncing tags back to Databricks, and extracting lineage and usage metadata.

Several companies, including a Fortune 50 investment management firm, a Forbes 100 cloud technology company, and a leading digital automotive platform, use Atlan as the metadata foundation for their data stack.

One key example is Autodesk, which had a complex problem to solve for over 60 teams using distinct ownership models; that’s when Atlan came into the picture and helped them realize the full value of the data mesh architecture. Let’s look at some best practices that would help you do the same with Databricks and Atlan.

Let’s look at some best practices that would help you do the same with Databricks and Atlan.

Best practices for leveraging data mesh with Databricks and Atlan #

Atlan simplifies the process with carefully curated resources on the Databricks Connectivity page. These resources guide you through setting up the Databricks-Atlan connection, crawling metadata for data assets, configuring PrivateLink (AWS, Azure), and extracting lineage metadata, among other tasks.

Examples from the Atlan community #

Atlan also has an active, engaged community that continuously shares new ways to integrate Atlan into organizational processes, workflows, and data culture. Their suggestions often focus on areas like documentation best practices, efficient project management, and improving data ingestion. Let’s explore a couple of examples.

Guidelines around defining personas for data users based on your organization’s operating model and ways of working. This is essential for implementing role-based access, data masking, and data sharing controls, among other things.
Ways of creating a semantic layer for metrics and a business glossary for all users to understand the data products better.

You can find several more of these examples, which you can explore on the community page.

Best practices and recommendations for Databricks connectivity #

Efficiency of metadata ingestion, while maintaining data security and privacy is a key principle behind Atlan’s design and architecture, which is why Atlan has strong and opinionated recommendations around these topics when connecting with Databricks.

Use of the three secure authentication methods that Atlan offers for connecting with Databricks: Personal access token authentication, AWS service principal authentication, and Azure service principal authentication.
In addition to the CATALOG and SCHEMA level permissions, also configure the Atlan + Databricks connection to have access to system.access schema for extracting lineage, CAN MANAGE permission on your warehouses to mine query history, and APPLY TAG privilege to manage your Databricks tags from Atlan.

There’s much more in the official documentation that can help you connect and derive value from Atlan and Databricks.

Databricks Unity Catalog: A Comprehensive Guide to Features, Capabilities, Architecture
Data Catalog for Databricks: How To Setup Guide
Databricks Lineage: Why is it Important & How to Set it Up?
Databricks Governance: What To Expect, Setup Guide, Tools
Databricks Metadata Management: FAQs, Tools, Getting Started
Data Catalog: What It Is & How It Drives Business Value
Snowflake Cost Optimization: Typical Expenses & Strategies to Handle Them Effectively
Databricks Cost Optimization: Top Challenges & Strategies