Databricks Governance: What To Expect, Setup Guide, Tools
September 30th, 2022
Databricks governance is essential to get complete visibility of your lakehouse assets. In the past, governance was seen as a means of just ensuring regulatory compliance; today, it plays a more prominent role. According to Gartner, governance should “accommodate offensive capabilities that add business value, as well as defense capabilities to protect the organization”.
Governance can influence business outcomes with proper accountability and end-to-end visibility of all data assets and how they’re created, stored, consumed, and managed by an organization. As a result, organizations can work with high-quality, reliable data that’s easily accessible and understandable.
That’s why governance is central to extracting value from your Databricks lakehouse assets, besides dealing with regulatory compliance.
Here we will explore the fundamental tenets of a comprehensive governance solution, analyze data governance capabilities native to Databricks, and recognize what’s missing.
Sneak peek of the article:
- What is Databricks?
- Databricks governance: What should it offer?
- Databricks governance: How does it work?
- Governance in Databricks: What is missing?
- Databricks governance with Atlan: An end-to-end, business-user-friendly solution
- Databricks data governance: What’s next?
What is Databricks?
Databricks is an enterprise software company that combines data warehouses and data lakes to offer an open and unified platform for data and AI.
The platform started as an open-source project (i.e., Apache Spark™) in academia. One engineer describes the core Databricks platform as “Spark, but with a nice GUI on top and many automated easy-to-use features.”
Today, Databricks is valued at $38 billion and is considered to be one of the world’s most valuable startups, joining the ranks of Stripe, Nubank, Bytedance, and SpaceX.
Databricks governance: What should it offer?
The Databricks lakehouse makes it possible to process large volumes of structured, semi-structured, and unstructured data in real-time.
However, without a proper governance framework in place, all that data will turn into a data swamp — you must know where your data comes from, which storage architecture is required for each asset type, and how to put your data to maximum use.
Moreover, regulations like GDPR and CCPA expect you to document how you use and store data (especially sensitive data), whether you encrypt/mask sensitive data, and maintain detailed audit logs.
That’s why an ideal data governance platform would ensure:
- Auto-classification and masking of sensitive data
- Standardized glossary with all business definitions
- Searchable and discoverable data assets
- Role-based access controls for data and metadata
- End-to-end visualization of data lineage
- Audit logs of all data pipelines
- Easy data sharing across apps and platforms
Besides the above capabilities, the governance platform must be adaptive — enabling you to tailor your governance strategy according to your business requirements.
According to Gartner’s VP Analyst Saul Judah:
“A typical ‘one-size-fits-all,’ command-and-control-based IT governance capability has neither the scope nor the agility to meet the needs of digital business. Adaptive governance enables flexible and nimble decision-making processes that help an organization respond quickly to opportunities, while continuously addressing investments, risk, and value.”
So, the platform should be customizable and personalized to your projects.
Now let’s see what Databricks governance looks like and how adaptive the capabilities are for modern data teams.
How does Databricks governance work?
Databricks offers a native data catalog named Unity Catalog as a “unified governance solution for all data and AI assets” within Databricks. Let’s explore its components and capabilities.
1. Metastore in Unity Catalog
All Databricks objects, such as tables, views, and the permissions governing their access, are stored in a metastore.
Each region gets a separate metastore. To exchange data across all metastores, you can use Delta Sharing — an open protocol for secure data sharing. The metastore brings together schemas (or databases), which organize tables and views in Unity Catalog.
Each object within the catalog has an owner, who can read and modify its contents. To read any data within the metastore, you must have the following permissions, according to Databricks:
SELECTon the table or view
USAGEon the schema that owns the table
USAGEon the catalog that owns the schema
2. Access control with SQL queries
To grant access to a team member, you can use the following SQL syntax:
GRANT USAGE ON CATALOG <catalog_name > TO `<group_name>`; GRANT USAGE ON SCHEMA <catalog_name >.<schema_name > TO `<group_name>`; GRANT SELECT ON <catalog_name >.<schema_name >.<table_name >; TO `<group_name>`;
Alternatively, if you wish to create a separate view for someone from another team, you can use the following SQL syntax:
CREATE VIEW <catalog_name >.<schema_name >.< view_name >as SELECT id, CASE WHEN is_member('<group_name>')THEN email ELSE 'REDACTED' END AS email, country, product, total FROM <catalog_name >.<schema_name >.<table_name >; GRANT USAGE ON CATALOG <catalog_name >TO `<group_name>`; GRANT USAGE ON SCHEMA <catalog_name >.<schema_name >.< view_name >; TO `<group_name>`; GRANT SELECT ON <catalog_name >.<schema_name >.< view_name >; TO `<group_name>`;
3. Audit logs
Databricks can track and maintain audit logs in S3 buckets every 15 minutes. These audit logs monitor everything that happens to each metastore. Databricks captures audit logs at two levels:
- Workspace: This includes accounts, clusters and their policies, jobs, IAM roles, web terminals, SQL permissions, etc.
- Account: This includes single sign-on settings, billable account usage, and actions done in the accounts console.
To maintain audit logs, you must configure the storage (ideally an S3 bucket), create IAM roles, and set up audit log delivery for the required workspaces.
4. Data sharing across metastores
As mentioned earlier, Databricks uses Delta Sharing to exchange data across metastores. When you enable Delta Sharing, Unity Catalog runs a Delta Sharing server that indexes tables from multiple metastores.
Although the tables will appear as read-only objects, you can set up access permissions with custom SQL queries.
Governance in Databricks: What’s missing?
While Databricks enables cataloging and governance for its assets, it isn’t enough to ensure governance across non-Databricks sources and assets.
Moreover, working with Databricks requires writing SQL queries. That might not be ideal for non-technical users as they must rely on engineering for data discovery, usage permissions, and data sharing.
Other areas that aren’t native to Databricks governance capabilities include:
Personalized access policies for users and assets
Setting up and monitoring policies for an organization is arduous. As data consumers become increasingly diverse, the requests become more complex — a business user would want to see the latest updates to a dashboard, whereas a data engineer would want to find and fix failed pipelines quickly.
That’s why the best way is to define policies and curate data assets by business domains, user personas, or projects, and not just user roles.
Auto-propagation of access permissions via lineage mapping
Data assets will get used across countless upstream and downstream applications. Defining access permissions or data classifications manually at each stage isn’t feasible or scalable.
The ideal solution would be to enable auto-propagation of these policies and trace them by mapping data flow (i.e., lineage mapping).
Easy masking and hashing for sensitive data
Compliance with regulations such as GDPR or CCPA requires you to ensure sensitive data remains masked. Since data pours in from various sources and formats, the easiest way to ensure governance at scale is to set up custom programs to identify and mask sensitive data automatically.
Without these capabilities, you cannot achieve true governance across all data assets. That’s where a modern data catalog with active data governance can help.
A Guide to Building a Business Case for a Data Catalog
Databricks governance with Atlan: An end-to-end, business-user-friendly solution
Atlan is a third-generation data catalog with active governance that comes equipped with:
- A powerful search engine that lets you discover the information you want by scouring through your entire data landscape — READMEs, descriptions, business metadata, dashboards, and more
- A GitHub-like repository to get complete context on every asset— this includes warnings, issues, resources from Slack or Jira, and visual mapping of how data flows through your landscape
- A mechanism that lets anyone with view access to see what data exists and offer suggestions on missing terms, descriptions, data use, and more
With Atlan, you can take your existing Databricks governance setup to the next level by improving the capabilities of Unity Catalog. Here’s how:
A Netflix-like experience lets you create personalized experiences for the different humans of data. Personalization allows setting granular access policies for domains, projects, and users.
In Atlan, you can define purposes — ways you interact with data, and personas — who interacts with what data. You can define access controls for your Databricks metadata, data, and glossary content with purpose and personas.
Programmable bots auto-classify and mask sensitive data in your Databricks lakehouse. For example, you can create compliance bots for HIPAA, GDPR, BCBS 239, and more.
In addition, you can also set up custom classifications with a user-friendly UI — click Governance → Classifications → Add classification. Here’s the complete walkthrough.
Policy propagation in Atlan is automatic and takes place via lineage mapping. For example, if you classify a column from a specific table in the Unity Catalog metastore as PII, then Atlan will categorize all downstream columns as PII too.
Similarly, if you label an entire table as “Product”, Atlan will classify all the columns in that table as “Product” too.
Embedded collaboration lets you share data-related concerns on Slack or by sending the asset profile link. Embedded collaboration is also how Atlan achieves active metadata management as data flows from Atlan to business apps and vice versa.
For instance, you can create Jira tickets, send questions to Slack channels, and look up data definitions for your Databricks assets without leaving Atlan.
A Demo of Atlan Data Governance for Databricks
Databricks data governance: What’s next?
If you are evaluating and looking to deploy best-in-class data governance for Databricks data sources - give Atlan a spin.
The deep integration and the open API enable Atlan to solve other modern data governance use cases across DataOps, workflow management, and pipeline automation.
Atlan has been named a leader in The Forrester Wave™: Enterprise Data Catalogs for DataOps, Q2 2022.
The report states,
“Atlan is the tool of choice for DataOps and data product deployment. Atlan’s vision is to create frictionless data product deployment through a single metadata and data automation platform.”
Getting started with Databricks data governance with Atlan:
- How to crawl Databricks metadata
- How to extract lineage from Databricks
- What does Atlan crawl from Databricks?
- How to set up a private network link to Databricks?
Databricks data governance: Related reads
- Databricks lineage — Overview, benefits, how to set up?
- Databricks metadata management — FAQs, tools, getting started
- 8 best practices for a robust data governance program
- 7 popular open-source data governance tools to consider in 2023
- Data governance and its importance in the modern data stack