Data Lakehouse 101: What It Is & How It Works? (2025)

Updated December 04th, 2024

Share this article

A data lakehouse is a modern architecture that combines the scalability of a data lake with the robust data management features of a warehouse.
See How Atlan Simplifies Data Governance – Start Product Tour

It supports structured, semi-structured, and unstructured data in a unified system, enabling businesses to perform advanced analytics, real-time processing, and machine learning.

Unlike traditional two-tier architectures, a lakehouse eliminates the need for complex ETL processes between separate systems, reducing data redundancy and improving reliability.

This streamlined approach ensures fresher data for analysis and lowers operational costs.

By integrating features like metadata management, data lakehouses provide businesses with a secure, scalable, and efficient data platform, empowering them to gain actionable insights faster and more effectively than ever before.

Table of contents #

What is a data lakehouse?
What are the components of data lakehouse architecture?
Why might a business use a data lakehouse?
When did the data lakehouse emerge?
Predictive analytics with data lakehouses
How the modern data lakehouse fits into the modern data stack
How organizations making the most out of their data using Atlan
FAQs about Data Lakehouse
Data Lakehouse: Related reads

What is a data lakehouse? #

A data lakehouse is an emerging system design that combines the data structures and management features from a data warehouse with the low-cost storage of a data lake.

What are the components of data lakehouse architecture? #

When businesses use both data warehouses and data lakes — without lakehouses — they must use different processes to capture data from operational systems and move this information into the desired storage tier. As a result, these organizations typically leverage a two-tier architecture in which data is extracted, transformed, and loaded (ETL) from an operational database into a data lake. Benefitting from the cost-effective storage of the data lake, the organization will eventually ETL certain portions of the data into a data warehouse for analytics purposes.

A data lakehouse, however, allows businesses to use the data management features of a warehouse within an open format data lake. Pioneered by Databricks, the data lake house is different from other data cloud solutions because the data lake is at the center of everything, not the data warehouse.

To address the data storage aspect, a relatively new open source standard called Delta Lake brings the essential functionality of a data warehouse, such as structured tables, into a data lake.

Data lakehouse architecture is made up of 5 layers #

Ingestion layer: Data is pulled from different sources and delivered to the storage layer.
Storage layer: Various types of data (structured, semi-structured, and unstructured) are kept in a cost-effective object store, such as Amazon S3.
Metadata layer: A unified catalog that provides metadata about all objects in the data lake. This enables data indexing, quality enforcement, and ACID transactions, among other features. The metadata layer is the defining element of the data lakehouse.
API layer: Metadata APIs allow users to understand what data is required for a particular use case and how to retrieve it.
Consumption layer: The business tools and applications that leverage the data stored within the data lake for analytics, BI, and AI purposes.

Why might a business use a data lakehouse? #

Combining data lakes and data warehouses into data lakehouses allows data teams to operate swiftly because they no longer need to access multiple systems to use the data.

This simplified data infrastructure solves several challenges that are inherent to the two-tier architecture mentioned above:

Improved reliability: Businesses don’t have to worry about engineering ETL transfers between fragile systems that may be disrupted due to quality issues.
Reduced data redundancy: The data lakehouse serves as a single repository for all data, eliminating redundancies and supporting more efficient data movement.
Fresher data: The issue of data staleness is addressed with a data lakehouse because data is available for analysis in a few hours rather than a few days.
Decreased cost: By streamlining ETL processes and moving to a single-tier architecture, businesses often save money after adopting the data lakehouse approach.

Featuring increased agility and up-to-date data, it’s clear that data lakehouses are a great fit for organizations looking to fuel a wide variety of workloads that require advanced analytics capabilities. In fact, lakehouses enable businesses to use BI tools, such as Tableau and Power BI, directly on the source data, resulting in the ability to have both batch and real-time analytics on the same platform.

Data lakehouses also give businesses the ability to adopt AI and machine learning (ML) or take their existing technology to the next level, while still meeting compliance requirements. Though the unstructured data needed for AI and ML can be stored in a data lake, it creates data security and governance issues. A lakehouse solves this problem by automating compliance processes and even anonymizing personal data if needed.

The 2024 State of Data Lakehouse Report found that 65% of enterprise IT professionals run most of their analytics on data lakehouses. Over half (56%) reported saving over 50% on analytics costs by transitioning to this architecture. Additionally, about 81% of organizations use data lakehouses to support data scientists in developing AI models and applications.

How do data lakehouses compare to data warehouses? #

Data warehouses are built for queryable analytics on structured data and certain types of semi-structured data. While business analytics teams are typically able to access the data stored in a data lake, there are limitations. Data lakes often require a data engineer to “wrangle” the data into a usable format.

A data lakehouse, however, has the data management functionality of a warehouse, such as ACID transactions and optimized performance for SQL queries. This also includes support for raw and unstructured data, like audio and video.

Also, read → Benefits and Challenges of Open Data Lakehouses | The Data Lakehouse Isn’t Sailing Smooth Yet

When did the data lakehouse emerge? #

According to S&P Global Market Intelligence, the first documented use of the term “data lakehouse” was in 2017 when software company Jellyvision began using Snowflake to combine schemaless and structured data processing. In a separate Q&A, Databricks CEO and Cofounder Ali Ghodsi noted that 2017 was a pivotal year for the data lakehouse:

“The big technological breakthrough came around 2017 when three projects simultaneously enabled building warehousing-like capabilities directly on the data lake: Delta Lake, (Apache) Hudi, and (Apache) Iceberg. They brought structure, reliability, and performance to these massive datasets sitting in data lakes.”

As cloud SaaS expert Jamin Ball points out, Snowflake has not embraced the data lakehouse in their product. The company’s cloud data warehouse and Databricks’ data lakehouse can be considered “two different entry points for the same ultimate vision: to be the data cloud platform.”

AWS Lake House Architecture #

AWS joined the fray and began talking about data lakehouses in relation to Amazon Redshift Spectrum in late 2019, later featuring their lakehouse architecture at re:Invent 2020. AWS actually prefers to use the nomenclature “lake house” to describe their combined portfolio of data and analytics services.

In the above-mentioned Q&A, Ghodsi emphasizes the data lakehouse’s support for AI and ML as a major differentiator with cloud data warehouses. Today’s data warehouses still don’t support the raw and unstructured data sets required for AI/ML. According to CIO, unstructured data makes up 80-90% of the digital data universe. Lakehouses allow businesses to clean up these “data swamps,” or the massive data sets in data lakes, so they can more strategically access and use the information to make smarter business decisions.

Bill Inmon, “father of the data warehouse,” further contextualizes the mounting interest in data lakehouses for AI/ML use cases: “Data management has evolved from analyzing structured data for historical analysis to making predictions using large volumes of unstructured data. There is an opportunity to leverage machine learning and a wider variety of datasets to unlock new value.”

Predictive analytics with data lakehouses #

In our blog exploring data warehouses, we mentioned that historical data is being increasingly used to support predictive analytics. However, data warehouses and data lakes on their own don’t have the same strengths as data lakehouses when it comes to supporting advanced, AI-powered analytics.

In a 2021 paper created by data experts from Databricks, UC Berkeley, and Stanford University, the researchers note that today’s top ML systems, such as TensorFlow and Pytorch, don’t work well on top of highly-structured data warehouses. While these systems can be used on open format data lakes, they don’t have crucial data management features, such as ACID transactions, data versioning, and indexing to support BI workloads. By combining the best features of data warehouses and data lakes, data lakehouses are now empowering both business analytics and data science teams to extract valuable insights from businesses’ data.

Using data lakehouses for predictive analytics: An example #

An airline wants to determine which customers are most likely to churn based on their phone activity with the support team. If the company uses a data lakehouse as a central data repository, they could conduct sentiment analysis using natural language processing (NLP) to identify people who have had a frustrating customer experience. Based on those insights, the business might contact the customers to learn more about how things could be improved as well as provide them with offers that might incentivize them to remain a customer.

How the modern data lakehouse fits into the modern data stack #

Cloud data warehousing has been one of the foundational components of the modern data stack for several years. Now, with the advent of the data lakehouse, businesses have a new way to separate compute from storage for advanced analytics.

At the Modern Data Stack Conference 2021, Ghodsi spoke to Fivetran CEO and Cofounder George Fraser about the pros and cons of the cloud data warehouse vs. data lakehouse approach. They expressed a belief that data lakehouses will become increasingly popular because having data stored in an open-source format that query engines can access allows businesses to extract maximum value from the data they already have. Cost-effectiveness is another area where the data lakehouse usually outperforms the data warehouse.

As per a 2024 survey by MarketResearch.biz on the data lakehouse market size, the Global Data Lakehouse Market is projected to reach approximately USD 66.4 billion by 2033, up from USD 8.9 billion in 2023, growing at a compound annual growth rate (CAGR) of 22.9% from 2024 to 2033.

It’s fair to mention that, data lakehouse as a concept is relatively new - compared to data warehouses. Over the years they promise to mature and develop to build up to their fundamental offering of being more cost-efficient, simple, and capable of serving diverse kinds of data usage and applications.

Also, read → Forrester’s Data Lakehouse Overview, Q4 2023 | Forrester Wave Data Lakehouses Q2 2024 Report | Experimental Study on Data Lakehouse

How organizations making the most out of their data using Atlan #

The recently published Forrester Wave report compared all the major enterprise data catalogs and positioned Atlan as the market leader ahead of all others. The comparison was based on 24 different aspects of cataloging, broadly across the following three criteria:

Automatic cataloging of the entire technology, data, and AI ecosystem
Enabling the data ecosystem AI and automation first
Prioritizing data democratization and self-service

These criteria made Atlan the ideal choice for a major audio content platform, where the data ecosystem was centered around Snowflake. The platform sought a “one-stop shop for governance and discovery,” and Atlan played a crucial role in ensuring their data was “understandable, reliable, high-quality, and discoverable.”

For another organization, Aliaxis, which also uses Snowflake as their core data platform, Atlan served as “a bridge” between various tools and technologies across the data ecosystem. With its organization-wide business glossary, Atlan became the go-to platform for finding, accessing, and using data. It also significantly reduced the time spent by data engineers and analysts on pipeline debugging and troubleshooting.

A key goal of Atlan is to help organizations maximize the use of their data for AI use cases. As generative AI capabilities have advanced in recent years, organizations can now do more with both structured and unstructured data—provided it is discoverable and trustworthy, or in other words, AI-ready.

Tide, a UK-based digital bank with nearly 500,000 small business customers, sought to improve their compliance with GDPR’s Right to Erasure, commonly known as the “Right to be forgotten”.
After adopting Atlan as their metadata platform, Tide’s data and legal teams collaborated to define personally identifiable information in order to propagate those definitions and tags across their data estate.
Tide used Atlan Playbooks (rule-based bulk automations) to automatically identify, tag, and secure personal data, turning a 50-day manual process into mere hours of work.

Book your personalized demo today to find out how Atlan can help your organization in establishing and scaling data governance programs.

FAQs about Data Lakehouse #

1. What is a data lakehouse? #

A data lakehouse is a modern data architecture that combines the best features of data lakes and data warehouses. It provides the flexibility of data lakes for storing unstructured and semi-structured data while maintaining the data management, governance, and performance capabilities of data warehouses.

2. How does a data lakehouse differ from a data warehouse or data lake? #

A data lakehouse integrates the strengths of both data warehouses and data lakes. Unlike data lakes, which store raw, unstructured data, and data warehouses, which store structured data optimized for analytics, data lakehouses offer a unified approach. They enable efficient data storage, management, and real-time analytics in a single platform.

3. What are the advantages of using a data lakehouse? #

Key advantages include a unified platform that combines the flexibility of a data lake with the structured query capabilities of a data warehouse, cost efficiency by reducing the need for separate systems, improved analytics supporting real-time insights and machine learning, and enhanced governance with robust data management features.

4. How does a data lakehouse support real-time analytics? #

Data lakehouses are designed to handle real-time data ingestion and processing, making them ideal for scenarios that require immediate insights. They leverage advanced processing engines and optimized storage layers to enable fast queries on fresh data.

5. What industries benefit most from data lakehouse adoption? #

Industries like finance, healthcare, retail, and technology benefit significantly from data lakehouses. These sectors require real-time analytics, robust data governance, and the ability to handle diverse data types to drive innovation and efficiency.

6. How secure is a data lakehouse compared to traditional storage solutions? #

Data lakehouses offer enhanced security features, including encryption, access controls, and data masking. They comply with industry standards such as GDPR, HIPAA, and SOC 2, ensuring data integrity and confidentiality.

What is a data lake: Definition, architecture, and solutions.
Data mesh vs data lake: Understanding decentralized and centralized approaches to data management.
Data warehouse vs data lake vs data lakehouse: What are the key differences?
Data Catalog: Does Your Business Really Need One?
Why does a data lake need a data catalog?
Data Lake Metadata Management: Benefits, Examples, & Tools
Data Lake & Data Governance: Unifying Diverse Data Sources
Modern Data Stack: Components, Architecture & Tools

Photo by eberhard grossgasteiger from Pexels