What is a data lakehouse?
What are the components of data lakehouse architecture?
When businesses use both data warehouses and data lakes — without lakehouses — they must use different processes to capture data from operational systems and move this information into the desired storage tier. As a result, these organizations typically leverage a two-tier architecture in which data is extracted, transformed, and loaded (ETL) from an operational database into a data lake. Benefitting from the cost-effective storage of the data lake, the organization will eventually ETL certain portions of the data into a data warehouse for analytics purposes.
A data lakehouse, however, allows businesses to use the data management features of a warehouse within an open format data lake. Pioneered by Databricks, the data lake house is different from other data cloud solutions because the data lake is at the center of everything, not the data warehouse.
To address the data storage aspect, a relatively new open source standard called Delta Lake brings the essential functionality of a data warehouse, such as structured tables, into a data lake.
Data lakehouse architecture is made up of 5 layers:
- Ingestion layer: Data is pulled from different sources and delivered to the storage layer.
- Storage layer: Various types of data (structured, semi-structured, and unstructured) are kept in a cost-effective object store, such as Amazon S3.
- Metadata layer: A unified catalog that provides metadata about all objects in the data lake. This enables data indexing, quality enforcement, and ACID transactions, among other features. The metadata layer is the defining element of the data lakehouse.
- API layer: Metadata APIs allow users to understand what data is required for a particular use case and how to retrieve it.
- Consumption layer: The business tools and applications that leverage the data stored within the data lake for analytics, BI, and AI purposes.
Why might a business use a data lakehouse?
Combining data lakes and data warehouses into data lakehouses allows data teams to operate swiftly because they no longer need to access multiple systems to use the data.
This simplified data infrastructure solves several challenges that are inherent to the two-tier architecture mentioned above:
- Improved reliability: Businesses don’t have to worry about engineering ETL transfers between fragile systems that may be disrupted due to quality issues.
- Reduced data redundancy: The data lakehouse serves as a single repository for all data, eliminating redundancies and supporting more efficient data movement.
- Fresher data: The issue of data staleness is addressed with a data lakehouse because data is available for analysis in a few hours rather than a few days.
- Decreased cost: By streamlining ETL processes and moving to a single-tier architecture, businesses often save money after adopting the data lakehouse approach.
Featuring increased agility and up-to-date data, it’s clear that data lakehouses are a great fit for organizations looking to fuel a wide variety of workloads that require advanced analytics capabilities. In fact, lakehouses enable businesses to use BI tools, such as Tableau and Power BI, directly on the source data, resulting in the ability to have both batch and real-time analytics on the same platform.
Data lakehouses also give businesses the ability to adopt AI and machine learning (ML) or take their existing technology to the next level, while still meeting compliance requirements. Though the unstructured data needed for AI and ML can be stored in a data lake, it creates data security and governance issues. A lakehouse solves this problem by automating compliance processes and even anonymizing personal data if needed.
How do data lakehouses compare to data warehouses?
Data warehouses are built for queryable analytics on structured data and certain types of semi-structured data. While business analytics teams are typically able to access the data stored in a data lake, there are limitations. Data lakes often require a data engineer to “wrangle” the data into a usable format.
A data lakehouse, however, has the data management functionality of a warehouse, such as ACID transactions and optimized performance for SQL queries. This also includes support for raw and unstructured data, like audio and video.
When did the data lakehouse emerge?
According to S&P Global Market Intelligence, the first documented use of the term “data lakehouse” was in 2017 when software company Jellyvision began using Snowflake to combine schemaless and structured data processing. In a separate Q&A, Databricks CEO and Cofounder Ali Ghodsi noted that 2017 was a pivotal year for the data lakehouse:
“The big technological breakthrough came around 2017 when three projects simultaneously enabled building warehousing-like capabilities directly on the data lake: Delta Lake, (Apache) Hudi, and (Apache) Iceberg. They brought structure, reliability, and performance to these massive datasets sitting in data lakes.”
As cloud SaaS expert Jamin Ball points out, Snowflake has not embraced the data lakehouse in their product. The company’s cloud data warehouse and Databricks’ data lakehouse can be considered “two different entry points for the same ultimate vision: to be the data cloud platform.”
AWS Lake House Architecture
AWS joined the fray and began talking about data lakehouses in relation to Amazon Redshift Spectrum in late 2019, later featuring their lakehouse architecture at re:Invent 2020. AWS actually prefers to use the nomenclature “lake house” to describe their combined portfolio of data and analytics services.
In the above-mentioned Q&A, Ghodsi emphasizes the data lakehouse’s support for AI and ML as a major differentiator with cloud data warehouses. Today’s data warehouses still don’t support the raw and unstructured data sets required for AI/ML. According to CIO, unstructured data makes up 80-90% of the digital data universe. Lakehouses allow businesses to clean up these “data swamps,” or the massive data sets in data lakes, so they can more strategically access and use the information to make smarter business decisions.
Bill Inmon, “father of the data warehouse,” further contextualizes the mounting interest in data lakehouses for AI/ML use cases: “Data management has evolved from analyzing structured data for historical analysis to making predictions using large volumes of unstructured data. There is an opportunity to leverage machine learning and a wider variety of datasets to unlock new value.”
Predictive analytics with data lakehouses
In our blog exploring data warehouses, we mentioned that historical data is being increasingly used to support predictive analytics. However, data warehouses and data lakes on their own don’t have the same strengths as data lakehouses when it comes to supporting advanced, AI-powered analytics.
In a 2021 paper created by data experts from Databricks, UC Berkeley, and Stanford University, the researchers note that today’s top ML systems, such as TensorFlow and Pytorch, don’t work well on top of highly-structured data warehouses. While these systems can be used on open format data lakes, they don’t have crucial data management features, such as ACID transactions, data versioning, and indexing to support BI workloads. By combining the best features of data warehouses and data lakes, data lakehouses are now empowering both business analytics and data science teams to extract valuable insights from businesses’ data.
Using data lakehouses for predictive analytics: An example
An airline wants to determine which customers are most likely to churn based on their phone activity with the support team. If the company uses a data lakehouse as a central data repository, they could conduct sentiment analysis using natural language processing (NLP) to identify people who have had a frustrating customer experience. Based on those insights, the business might contact the customers to learn more about how things could be improved as well as provide them with offers that might incentivize them to remain a customer.
How the modern data lakehouse fits into the modern data stack
Cloud data warehousing has been one of the foundational components of the modern data stack for several years. Now, with the advent of the data lakehouse, businesses have a new way to separate compute from storage for advanced analytics.
At the Modern Data Stack Conference 2021, Ghodsi spoke to Fivetran CEO and Cofounder George Fraser about the pros and cons of the cloud data warehouse vs. data lakehouse approach. They expressed a belief that data lakehouses will become increasingly popular because having data stored in an open-source format that query engines can access allows businesses to extract maximum value from the data they already have. Cost-effectiveness is another area where the data lakehouse usually outperforms the data warehouse.
It's fair to mention that, data lakehouse as a concept is relatively new - compared to data warehouses. Over the years they promise to mature and develop to build up to their fundamental offering of being more cost-efficient, simple, and capable of serving diverse kinds of data usage and applications.
Data Lakehouse: Related reads
- What is a data lake: Definition, architecture, and solutions.
- Data mesh vs data lake: Understanding decentralized and centralized approaches to data management.
- Data warehouse vs data lake vs data lakehouse: What are the key differences?
- Why does a data lake need a data catalog?