Databricks Cost Optimization: Top Challenges & Strategies

Updated August 21st, 2024

Share this article

Highly scalable and available data warehouses, lakes, and lakehouses built on platforms like Databricks have revolutionized data management. However, this evolution brings challenges – rising technical debt, inefficiencies (slower and buggier apps), and higher costs.

The rapid development pace and feature-rich implementations often lead to suboptimal resource use, causing your costs to spiral. In this article, we’ll explore Databricks cost optimization strategies, including understanding the factors driving costs, optimizing resource allocation, and leveraging native features alongside complementary tools like Atlan.

Table of contents #

Databricks cost optimization: How does it work?
Understanding Databricks cost
Typical causes for high Databricks cost
Databricks-native cost optimization options
How Atlan can help with Databricks cost optimization
Summary
Related reads

Databricks cost optimization: How does it work? #

In a blog post on their Platform blog, Databricks highlights that “the ease of creating compute resources comes with the risk of spiraling cloud costs when left unmanaged and without guardrails.”

Unfortunately, this advice isn’t often heeded. To ensure that any cost-related issues with Databricks aren’t hindrances to your business, you should follow their recommended best practices by taking a metadata-driven approach.

To get started, you can leverage Databricks’s technical metadata and tooling to identify common data storage, modeling, engineering, movement, and management mistakes, which can snowball your Databricks costs.

While that’s a legitimate way to sort out cost-related issues, it requires you to invest time in writing scripts and creating reports and dashboards from the metadata.

An alternative approach is to use complementary tools that connect with Databricks and the rich metadata stored in Unity Catalog (or in the Hive metastore if you’re using the legacy Hive metastore).

Before exploring both these solutions further, let’s understand how Databricks costs work and some typical reasons for the costs to grow significantly over time.

Understanding Databricks cost #

Before we begin, it’s important to note that you must have a high-level understanding of Databricks architecture and the features that come at a premium. The pricing calculator on the Databricks website can help you get started.

To get the specifics of costing, you’ll have to dig deeper into factors, such as the cloud provider, the geographical region, instance type, compute type, Databricks Edition, and also any upfront usage commitments.

Besides these, there are storage and data transfer costs, too. So, let’s explore the most significant cost components further.

1. Compute #

The compute costs in Databricks depend on:

Instance type (cloud provider and region): This is native to the cloud platform. For instance, an m4.2xlarge in AWS would cost $1.5 DBU/hour and $0.825/hour.
Compute type: Databricks currently offers around fifteen compute types, including All-Purpose Compute, Jobs Compute, SQL Compute, and Model Serving/Serverless Real-time Inference. All of these compute types have different DBU/hour and hourly rates. For example, if you use Jobs Compute to run ETL jobs, the price would be $0.15/DBU.
Databricks Tiers: The feature availability and, hence, cost will also depend on the tier you choose to deploy.

2. Storage #

Storage costs in Databricks include the fees incurred from using the cloud platform’s storage layer, such as S3 in AWS, GCS in Google Cloud, and ADLS Gen2 in Azure. Object storage usually has a flat rate depending on the cloud platform, region, and availability, among other things.

Moreover, there’s some storage cost incurred by the operations of serverless features of Databricks, such as Delta Lake Time Travel.

3. Data transfer #

While storage isn’t super expensive, moving data around can affect you in terms of network and data transfer costs. Here are some questions to ask when you’re thinking about data transfer costs:

Whether the data transfer is being egressed to the same region, cloud platform, or the internet?
Whether the data transfer is happening over public or private networks within the cloud platform?
Whether the data transfer occurs between availability zones, i.e., different data centers in the same geographical region?

You’ll also need to consider any business continuity and disaster recovery plans for storage costs, as those can quickly ramp up the expenses.

After getting a basic view of these components, you should focus on the areas you want to monitor and optimize. So, let’s look at the typical causes of high costs.

Typical causes for high Databricks cost #

Databricks costs usually spiral when you don’t follow architecture best practices and make sloppy use of the infrastructure. Some of the most common causes that inflate your Databricks costs are:

Choosing the wrong type of compute (and possibly over-provisioning it)
Ignoring architecture best practices and guidelines
Having database objects that are rarely used and unused
Using serverless compute features and services where they are not needed

Let’s explore each issue further.

Choosing the wrong type of compute #

Databricks is a PaaS data platform that lets you choose the type of compute you need for specific workloads. There are different compute types for data warehousing, machine learning and AI, and ETL and orchestration workloads, among others. All of these consume a different DBU/hour and the price of DBU will also depend on a number of factors, as discussed earlier.

When choosing the type of compute, you’ll also need to make a decision on what instance types support it. Depending on the workload, you’ll need to strike the right CPU and memory balance.

Ingestion workloads generally require low compute and memory. Heavy ETL and data warehousing workloads are more compute-intensive but might not need a lot of memory. ML and AI workloads, especially inference workloads, require specialized CPUs and a lot of memory to do the job.

You need to take all of this into consideration because not doing so can land you in overprovisioning troubles that inflate costs really quickly. This compute creation cheat sheet from Databricks can help.

Ignoring architecture best practices and guidelines #

With powerful platforms like Databricks, you can run any workload without much hassle even if you don’t implement the right data format, storage strategy, or data model for your use case. The problem in doing so is the cost.

Databricks’s Photon acceleration and query optimization engine can help you optimize your queries for the platform, but it still is a good rule of thumb to follow the recommended guidelines and advice on file format choices, table design, Delta Live Tables, and so on.

Having database objects that are rarely used and unused #

Databricks automatically uses its predictive optimization feature to find and eliminate unused data files that form a table, but it comes with its own caveats.

However, that’s just the files that make up tables. When you have entire tables that are unused, they can silently shoot up your costs. Since engineers and analysts are busy with their day-to-day work, they don’t necessarily focus on unused assets. All the attention goes to the assets that are in use and providing value.

Rarely used or unused database objects can be external, Delta Live Tables, streaming tables, foreign tables, Hive tables, and sometimes whole schemas with several tables, views, and other assets.

Removing these objects directly reduces the storage cost and the associated serverless compute cost for their maintenance and upkeep. It also helps keep your Databricks environment clean, which in turn keeps your catalog clean and prevents any confusion for consumers of these assets.

Unity Catalog provides a window into these assets’ usage, tagging, and statistics. You’ll need elevated privilege to access all these details stored in the system schema. This way, you can identify rarely used and unused database objects using the tables in that schema.

Using serverless compute features and services where they are not needed #

Databricks is a managed data platform that lets you offload some under-the-hood administration and optimization work to it, but it doesn’t come without costs. It is important to understand the serverless features and the extent to which they are used in the solution you’re deploying.

For instance, you don’t need to use Streaming table datasets in Delta Live Tables if you don’t have a use case for real-time data consumption from the data consumers. Old-school batch compute can still handle many data processing and consumption workloads.

Now, let’s see Databricks cost optimization possibilities using native options.

Databricks-native cost optimization options #

Databricks offers guidance and several tools for monitoring and controlling costs:

Views in the system schema provide a detailed view of your account’s billable usage data. There are several out-of-the-box SQL queries that Databricks provides for tracking costs.
Using resource tagging to monitor and control costs is quite useful in Databricks. These tags are also automatically propagated to your cloud platform, allowing you to view Databricks-related costs from your cloud platform console.

Although these features are handy, your team needs to do a lot of legwork before getting actionable cost control insights. You’ll need to provide elevated access to some of your team members to query this data. Moreover, this work of tracking and actioning items can build up over time.

A data governance platform like Atlan can help offload this work from the engineers in your team. Let’s see how.

How Atlan can help with Databricks cost optimization #

By connecting to your Databricks account, Atlan fetches all the metadata to build and utilize an end-to-end lineage graph for data discovery and cost optimization. Atlan’s solution for databricks cost optimization works at three different levels:

Level 1: An end-to-end lineage graph (powered by Unity Catalog) identifies unused and duplicate data assets, which you can then certify as deprecated, ready to be deleted, or archived from your Databricks account.
Level 2: Asset popularity metrics leverage Unity Catalog to help identify low-usage assets (tables, views, columns), which can be certified as deprecated. Atlan uses four metrics to identify candidates for deprecation:
- Number of queries using an asset
- Number of users using an asset
- datetime of the last query run using the asset
- datetime of the last update that was made to an asset
Level 3: Atlan targets the most popular, queried, and expensive assets using asset popularity metrics, such as assets with the most queries, users, etc.

There are some limitations, primarily because of the Databricks APIs, which prevent Atlan from bringing all cost-related metadata, such as expensive queries and related compute costs. You can find more information about this in the official documentation.

It’s also important to note that you need to enable Unity Catalog for your Databricks workspace to get detailed lineage, usage, and popularity metrics to be calculated.

If Unity Catalog isn’t enabled, you may need to convert existing tables to Unity Catalog-compliant versions for these metrics to work. There’s a dedicated page on Atlan’s documentation to help you resolve any of these issues.

Also, read -> How to report on usage and cost using Atlan

Summary #

In this article, we discussed several aspects of Databricks costs, typical causes of spiralling costs, and methods you can use to monitor, control, and optimize costs. You also learned about using Databricks’s native and external tools like Atlan to optimize costs easily.

Atlan’s multi-pronged cost optimization solutions leverage end-to-end lineage and popularity metrics to identify and deprecate less frequently used and unused data assets. This offers significant and immediate cost savings for an enterprise.

Want to know more? Chat with our data strategy experts who’ve guided data teams at Nasdaq, Autodesk, and HelloFresh in their data governance journeys.

Databricks Unity Catalog: A Comprehensive Guide to Features, Capabilities, Architecture
Data Catalog for Databricks: How To Setup Guide
Databricks Lineage: Why is it Important & How to Set it Up?
Databricks Governance: What To Expect, Setup Guide, Tools
Databricks Metadata Management: FAQs, Tools, Getting Started
Data Catalog: What It Is & How It Drives Business Value
Snowflake Cost Optimization: Typical Expenses & Strategies to Handle Them Effectively