With so many data storage systems available, it’s easy to get confused as to how they differ. Fear not! In this blog, we’re comparing three common data storage architectures – a data warehouse vs a data lake vs a data lakehouse – so you can choose the best option to meet your organization’s needs.
What is a data warehouse?
A data warehouse is a repository for large amounts of structured data from various data sources – structured being the operative word. Data usually falls under three categories: structured, unstructured, and semi-structured.
- Structured data is quantitative and highly organized, such as names, birthdays, addresses, social security numbers, stock prices, and geolocation.
- Unstructured data is qualitative. It has no clearly defined framework and is not easily searchable, such as online reviews, videos, photos, or audio files.
- Semi-structured data combines elements of the other two. It has a loosely defined framework, such as emails that have addresses for sender/recipient, but a body that can contain anything.
Data that lives in a data warehouse is processed for validation, sorting, summarization, aggregation, analysis, reporting, or classification. A data warehouse is highly organized and is formatted for a specific purpose. It enables an organization to easily access and analyze relevant data in order to develop actionable insights.
Components of a data warehouse
In order to provide these insights, a data warehouse consists of four core components: a central database, data integrations tools, metadata, and data access tools.
- Central database: The backbone of a data warehouse, a central database houses data organized into tables that group together related objects.
- Data integration tools: Data integration tools are used to pull data from various sources and transform it so it fits within the data warehouse. The traditional approach used here is called extract, transform, and load (ETL), though extract, load, and transform (ELT) techniques have also become popular.
- Metadata: Metadata is data about your data and is used to create context and organization. For example, if a photo file is a data point, then its date, geolocation, and camera type is its metadata that helps to better contextualize and organize the file.
- Data access tools: Data access tools, such as query tools, application development tools, data mining tools, and online analytical processing (OLAP) tools, give users the ability to interact with the data stored in their data warehouses.
Data warehouses are a great solution for storing and obtaining insights from structured data. But the average person produces 2.5 quintillion data bytes per day, most of which is raw data that doesn't fit so neatly inside a data warehouse. So what is an organization to do when it comes to storing all this data? The answer – a data lake.
What is a data lake?
Popularized in 2010, a data lake is a centralized repository for virtually all types of raw data. Structured, unstructured, and semi-structured data can all be quickly dumped into a data lake before it's processed for validation, sorting, summarization, aggregation, analysis, reporting, or classification.
Components of a data lake
A data lake architecture has five major components that can be remembered with the acronym ISASA – Ingest, Store, Analyze, Surface, Act.
- Ingest: This refers to data migration, usually through APIs or batch processes.
- Store: Data ingested from various sources is stored in a single repository, without silos.
- Analyze: Users can then analyze the data to uncover relationships and even make forecasts.
- Surface: To surface means to present findings of the analysis in easily discernible ways – usually in the form of a chart, graph, or actionable insight.
- Act: With the data analyzed and surfaced, it can be acted upon to inform business decisions.
Data lakes are built on inexpensive object storage and provide organizations with simple, cost-effective, scalable storage. The problem with data lakes is that since they serve as repositories for virtually all types of data, they can very easily become disorganized and transform into a dreaded, inefficient data swamp where it’s hard to find or do anything useful.
What is a data lakehouse?
A data lakehouse is a more recent data management architecture pioneered by Databricks that combines the flexibility, open format, and cost-effectiveness of data lakes with the accessibility, management, and advanced analytics support of data warehouses.
5 Composite layers of a data lakehouse
There are typically five layers that make up a data lakehouse:
- Ingestion layer: Data is pulled from different sources and delivered to the storage layer.
- Storage layer: Various kinds of data (structured, semi-structured, and unstructured) are kept in a cost-effective object store, such as Amazon S3.
- Metadata layer: A unified catalog that provides metadata about all objects in the data lake and enables data indexing, quality enforcement, and ACID transactions, among other features. The metadata layer is the defining element of the data lakehouse.
- API layer: Metadata APIs allow users to understand what data is required for a particular use case and how to retrieve it.
- Consumption layer: The business tools and applications that leverage the data stored within the data lake for analytics, BI, and AI purposes.
Data warehouse vs data lake
So what’s the difference between a data warehouse and a data lake? The two are more dissimilar than they are alike.
- A data warehouse stores structured data that has been processed for a specific purpose. These systems are more organized than a data lake.
- A data lake is a free-for-all, housing structured, unstructured, and semi-structured data. Data lakes can also store unprocessed data for some unknown, future use.
Let’s take a look at how these would be used in the real world, and how they could work together. Consider an airline. The company could neatly store passenger information in a data warehouse, where data is structured and includes items such as names, birthdays, addresses, origination airports, destination airports, and frequency of travel, among others. The company may also have a separate data warehouse where it stores financial data for each passenger.
The airline may also choose to have a data lake where unstructured data, like customer support emails, photographs from IDs, and social media content culled from various social channels, could live until further analysis is needed. Insights from this data, when coupled with the structured data insights, could empower the airline to create a unique customer offering that may have otherwise gone undetected as a means to attract new customers.
Data warehouse vs data lakehouse
As we’ve discussed, data warehouses are rigid architectures that are well organized and provide fast discovery, query, and preparation of processed data. Worth noting is that the storage cost of a data warehouse can be quite expensive. This is why data retention is usually limited; historical data is generally removed to make way for new data.
A data lakehouse, however, is flexible (able to house all types of data) like a data lake, but it comes with the organization and management features of a data warehouse. They are more cost-effective than a data warehouse and enable both prescriptive analytics you’d find with a warehouse and predictive analytics usually performed in a data lake.
Using our previous airline example, the company might find that storing passenger info in a data lake is too messy and choose to house it in a data lakehouse instead. Under this architecture, the airline could store the data so that a passenger’s information, financial data (credit card numbers, ticket spend), customer support tickets, and social media profiles are all matched up to render a complete profile of any particular individual. It could then run predefined data analysis to find average ticket spend for passengers in Boston, or predictive analysis to forecast when college students from the midwest are likely to purchase tickets to Europe.
Data warehouse vs data lake vs data lakehouse – at a glance
|Data Warehouse||Data Lake||Data Lakehouse|
|Raw or processed data||Processed||Raw||Raw and processed|
|Data structure||Structured only||Structured, semi-structured, unstructured||Structured, semi-structured, unstructured|
|Organization||Highly organized||Little organization||Highly organized|
|Analytics||Good for prescriptive analytics||Good for predictive analytics||Good for prescriptive and predictive analytics|
As you have seen, when it comes to a data warehouse vs data lake vs data lakehouse, the key differentiator is organization. Think of it this way: A data lakehouse is a data lake with the organization of a data warehouse. Proper governance and management using metadata help organize the data so users can quickly and effectively perform analysis leading to actionable insights.
- What is a data lake: Definition, architecture, and solutions.
- What is a data lakehouse: The best of data lakes and data warehouses.
- What is a data warehouse: Purpose, components, and benefits
- Data mesh vs data lake: What are the differences in architecture, use cases, and benefits?
- Why does a data lake need a data catalog?