What Is a Data Lake and Why It Needs a Data Catalog

January 5th, 2021

What-is-a-data-lake
Learn what a data lake is (definition) and how to get the best value from it with a data catalog.

Much like the term suggests, a data lake is literally a space full of data—free and unfiltered. Sounds peaceful right? Spoiler: You never know what lies beneath! 🔍

Just as a lake stores all the water that rushes into it, a data lake stores all types of data, whether it’s from internal or external data sources. And everyone’s welcome—so the data in a data lake can be unstructured, semi-structured or structured.

We’re talking messy data like audio files, emails, photos, or satellite imagery to more neat and clean data like phone numbers, customer names, addresses and zip codes.

Here’s how James Dixon, the person who created the term “data lake”, describes it:

"If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake and various users of the lake can come to examine, dive in, or take samples."

Here’s an even simpler analogy from Alex Gorelik, author of the book titled “The enterprise data lake”.

"A data lake is sort of like a piggy bank. You often don’t know what you are saving the data for, but you want it in case you need it one day."

Examples of popular piggy banks, sorry data lakes, include Amazon S3 and Azure Blob Storage. 🐽

Who uses data lakes?

That’s like asking who swims in the ocean—literally anyone! 🏄

Anyone can use a data lake, from data analysts and scientists to business users. However, to work with data lakes you need to be familiar with data processing and analysis techniques. That’s why it’s usually data scientists and data engineers who work with data lakes.

But wait…

Isn’t a data lake the same as a data warehouse?

Or is it data lake vs data warehouse?

This is a common confusion—just like the chicken and egg story! We get it. But a data lake is not data warehouse 2.0. They are completely different repositories, built for different purposes.

Let’s dig a little deeper into this. Or swim a little deeper, if you will.

Here are some of the main differences between a data lake and a data warehouse.

The way you store data is different

Before storing data in a data warehouse, you need to model it—provide it with a structure. This process is called schema-on-write.

For data lakes, you can store raw data just the way it is. You can model it whenever you need to use it for analysis. This process is called schema-on-read.

Data lakes are flexible; data warehouses aren’t

A lot of effort and decision-making is involved before storing data in a warehouse. This includes defining the business questions to be answered, processing raw data and transforming raw data into a structured format. That’s why reconfiguring a data warehouse is tedious and time-consuming.

On the other hand, data lakes can be configured and reconfigured easily since they don’t have a predefined structure.

Scaling data warehouses is challenging

Traditionally, enterprises invested heavily in data warehouses to process and store data that answered specific business questions. However, scaling data warehouses is expensive, and changing the structure of the data stored is cumbersome. Certainly not your IT team’s idea of fun!

Data lakes offer a solution to the challenges posed by data warehouses as they’re cheap (for storing massive volumes of data), highly scalable and flexible.

Data warehouses are more secure than data lakes

Data warehouses have been around for a while and as a result, they’re fully equipped to deal with data security and integrity. Data lakes are still evolving and aren’t as secure as data warehouses yet.

Also, as data lakes store all kinds of data in a single repository, they might make your data more vulnerable. Just laying out all the cards here. So…

Why should you store data in a data lake?

For enterprises that work with big data, data lakes offer a low-cost storage alternative helps to overcome the challenges presented by data warehouses. Every dollar matters after all.

Also, data warehouses store historic data—so they help you understand what happened in the past and what conclusions can you draw from the past.

With data lakes, you can use data to explain not just what happened in the past, but also what’s likely to occur in the future. Like a pack of tarot cards, but logical!

Moreover, since real-time data can be directly streamed into data lakes, you can also do real-time analytics. And let’s face it, everyone wanted the data like yesterday!

For more on data warehouses, please read this article.

Oh, all ready to set up a fancy data lake?

Wait a minute—here’s the thing about data lakes. They can degrade into veritable swamps almost overnight if you don’t set them up the right way. Hear me out.

Problem with data lakes

The truth is that enterprises tend to think of data lakes as a magic pill when it comes to data because that’s how they are sold.

In reality, no matter how much effort you put in setting up a data lake, it means nothing if the data lake is not usable, transparent or accessible.

In reality, no matter how much effort you put in setting up a data lake, it means nothing if the data lake is not usable, transparent or accessible.

    When we first began to work with a leading Fortune 500 to democratize their data, they had already spent eight whole years in setting up a fancy SAP data warehouse. We’re talking all the bells and whistles:
  • A cloud infrastructure team to set up and manage the data lake 👩‍💻
  • Fancy data access and governance policies to follow… 🛑
  • … and an army of MIS execs to extract the data and share insights 🧑‍🤝‍🧑

But millions of dollars and (well, what felt like) light years later, they realized that they weren’t any closer to achieving their goals of digital transformation than when they had first started out.

Why? Because the lake was accessible to only a few, the burden of maintaining it was on a few, and it was unable to respond to the needs of the business in real-time.

In fact, the product owner of their data lake remarked:

"We have built a data lake but it is still a black box to the business users. They just won’t use S3."

One of the biggest challenge facing data teams is that of data silos. Read how you can break down data silos here.

Ebook cover - data catalog primer

Data Catalog Primer - Everything You Need to Know About Data Catalogs.

Adopting a data catalog is the first step towards data discovery. In this guide, we explore the evolution of the data management ecosystem, the challenges created by traditional data catalog solutions, and what an ideal, modern-day data catalog should look like. Download now!