What Is a Data Lake and Why It Needs a Data Catalog
January 5th, 2021

Learn what a data lake is (definition) and how to get the best value from it with a data catalog.
Much like the term suggests, a data lake is literally a space full of dataāfree and unfiltered. Sounds peaceful right? Spoiler: You never know what lies beneath! š
Just as a lake stores all the water that rushes into it, a data lake stores all types of data, whether itās from internal or external data sources. And everyoneās welcomeāso the data in a data lake can be unstructured, semi-structured or structured.
Weāre talking messy data like audio files, emails, photos, or satellite imagery to more neat and clean data like phone numbers, customer names, addresses and zip codes.
Hereās how James Dixon, the person who created the term ādata lakeā, describes it:
āIf you think of a data mart as a store of bottled water ā cleansed and packaged and structured for easy consumption ā the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake and various users of the lake can come to examine, dive in, or take samples.ā
Hereās an even simpler analogy from Alex Gorelik, author of the book titled āThe enterprise data lakeā.
āA data lake is sort of like a piggy bank. You often donāt know what you are saving the data for, but you want it in case you need it one day.ā
Examples of popular piggy banks, sorry data lakes, include Amazon S3 and Azure Blob Storage. š½
\[Ebook\] ā Data Catalog 3.0: The Modern Data Stack, Active Metadata & DataOps
Who uses data lakes?
Thatās like asking who swims in the oceanāliterally anyone! š
Anyone can use a data lake, from data analysts and scientists to business users. However, to work with data lakes you need to be familiar with data processing and analysis techniques. Thatās why itās usually data scientists and data engineers who work with data lakes.
But waitā¦
Isnāt a data lake the same as a data warehouse?
Or is it data lake vs data warehouse?
This is a common confusionājust like the chicken and egg story! We get it. But a data lake is not data warehouse 2.0. They are completely different repositories, built for different purposes.
Letās dig a little deeper into this. Or swim a little deeper, if you will.
Here are some of the main differences between a data lake and a data warehouse.
The way you store data is different
Before storing data in a data warehouse, you need to model itāprovide it with a structure. This process is called schema-on-write.
For data lakes, you can store raw data just the way it is. You can model it whenever you need to use it for analysis. This process is called schema-on-read.
Data lakes are flexible; data warehouses arenāt
A lot of effort and decision-making is involved before storing data in a warehouse. This includes defining the business questions to be answered, processing raw data and transforming raw data into a structured format. Thatās why reconfiguring a data warehouse is tedious and time-consuming.
On the other hand, data lakes can be configured and reconfigured easily since they donāt have a predefined structure.
Scaling data warehouses is challenging
Traditionally, enterprises invested heavily in data warehouses to process and store data that answered specific business questions. However, scaling data warehouses is expensive, and changing the structure of the data stored is cumbersome. Certainly not your IT teamās idea of fun!
Data lakes offer a solution to the challenges posed by data warehouses as theyāre cheap (for storing massive volumes of data), highly scalable and flexible.
Data warehouses are more secure than data lakes
Data warehouses have been around for a while and as a result, theyāre fully equipped to deal with data security and integrity. Data lakes are still evolving and arenāt as secure as data warehouses yet.
Also, as data lakes store all kinds of data in a single repository, they might make your data more vulnerable. Just laying out all the cards here. Soā¦
The Ultimate Guide to Evaluating a Data Catalog
Download free ebook
Why should you store data in a data lake?
For enterprises that work with big data, data lakes offer a low-cost storage alternative helps to overcome the challenges presented by data warehouses. Every dollar matters after all.
Also, data warehouses store historic dataāso they help you understand what happened in the past and what conclusions can you draw from the past.
With data lakes, you can use data to explain not just what happened in the past, but also whatās likely to occur in the future. Like a pack of tarot cards, but logical!
Moreover, since real-time data can be directly streamed into data lakes, you can also do real-time analytics. And letās face it, everyone wanted the data like yesterday!
For more on data warehouses, please read this article.
Oh, all ready to set up a fancy data lake?
Wait a minuteāhereās the thing about data lakes. They can degrade into veritable swamps almost overnight if you donāt set them up the right way. Hear me out.
Problem with data lakes
The truth is that enterprises tend to think of data lakes as a magic pill when it comes to data because thatās how they are sold.
In reality, no matter how much effort you put in setting up a data lake, it means nothing if the data lake is not usable, transparent or accessible.
When we first began to work with a leading Fortune 500 to democratize their data, they had already spent eight whole years in setting up a fancy SAP data warehouse. Weāre talking all the bells and whistles:
- A cloud infrastructure team to set up and manage the data lake š©āš»
- Fancy data access and governance policies to follow⦠š
- ⦠and an army of MIS execs to extract the data and share insights š§āš¤āš§
But millions of dollars and (well, what felt like) light years later, they realized that they werenāt any closer to achieving their goals of digital transformation than when they had first started out.
Why? Because the lake was accessible to only a few, the burden of maintaining it was on a few, and it was unable to respond to the needs of the business in real-time.
In fact, the product owner of their data lake remarked:
āWe have built a data lake but it is still a black box to the business users. They just wonāt use S3.ā
One of the biggest challenge facing data teams is that of data silos. Read how you can break down data silos here..
Related Reads
- What is a data lakehouse: Definition, architecture, components, and use cases.
- Data Catalog: The Must-Have Tool for Data Leaders in 2023
- What is a data lake: Definition, examples, architecture, and solutions.
- Data mesh vs data lake: What are the differences in architecture, use cases, and benefits?
- Data Warehouse vs Data Lake vs Data Lakehouse: What are the key differences?
- Top data catalog use cases intrinsic to data-led enterprises