Data Mesh and Data Lake: Understanding Use Cases & Reasons to Deploy

Published on: November 16th, 2022, Last Updated on: January 30th, 2023

Data mesh Vs Data Lake

Data mesh is a design strategy for enterprise data platform architecture. Meanwhile, a data lake is a central repository that stores data — structured and unstructured — in a raw format.

Here, we will further explore the concepts of data mesh and data lake by understanding their specific use cases and maturity stages at which organizations may choose to deploy them into their data platform.

Data mesh and data lake: The fundamentals

Let’s begin by quickly reviewing the fundamentals of data mesh and data lake.

What is a data mesh?

Data mesh is “a decentralized sociotechnical approach to share, access, and manage analytical data in complex and large-scale environments—within or across organizations,” as defined in Data Mesh: Delivering Data-Driven Value at Scale.

It combines the techniques of distributed domain-driven architecture, self-serve platform design, and product thinking with data to develop an enterprise data platform architecture.

The term was coined by Zhamak Dehghani, a former principal technology consultant at ThoughtWorks, in 2019.

Read moreEverything you need to know about data mesh

data mesh vs data lake: Key principles of a data mesh architecture

Key principles of a data mesh architecture. Source: Data Mesh by Zhamak Dehghani, O'Reilly Media, Inc

Example use cases: Data mesh for sales and marketing

Both sales and marketing departments are domains and, as such, have domain owners.

These teams are responsible for collecting, storing, and managing their data, rather than a central data team. So, if someone from the marketing team wants to access sales data, they work with the sales team directly.

As a result, business teams have complete control and ownership of their campaigns. They don’t have to wait for the IT department to help them find historical data on customers or campaign results.

What is a data lake?

A data lake is a central location for storing all kinds of data in raw and processed forms, collected from various sources.

Data lakes are simple, scalable, and cost-effective for storing large volumes of data. So, many organizations have invested in a central data lake and a data team responsible for managing it to get rid of data silos, capture all kinds of data, and drive business decision-making.

The term was popularized by James Dixon — founder and CTO of business intelligence software company Pentaho — in 2010.

Read moreEverything you need to know about data lakes

Example use cases: Data lake for big data storage

From oil and gas to life sciences, all data-heavy industries can use data lakes to store and process big data at scale.

For instance, an average oil and gas platform generates at least 2 TB daily. When that data is stored at a central, accessible location, data scientists and geologists can use it to optimize directional drilling, minimize unplanned downtime, improve safety, and more.

Download ebook → Building a Business Case for DataOps

Download ebook

Data mesh vs. data lake: Why you shouldn’t be comparing them

Since a data lake is a centralized approach to managing data, and the data mesh is a decentralized design for enterprise data architecture, people tend to compare the two concepts.

However, a better way is to think about how they synergize. Let’s explore this possibility by looking at the following:

  • The shortcomings of a centralized approach to data architecture design
  • The promise of a decentralized data architecture
  • The merits of a design where the data mesh and data lake fit together

The shortcomings of a centralized data architecture

Currently, most data-driven organizations have set up a centralized, domain-agnostic data platform overseen by a central team of engineers.

The resulting data repository — usually a data lake — houses all organizational data and is treated as the single source of truth for the organization. For instance, Adidas used to rely on such an architecture where:

  • The data lake (AWS S3) centralized the storage of all kinds of data
  • A central data team was responsible for all the heavy lifting involved in ingesting data into the lake

Such an approach isn’t scalable, especially for large organizations like Adidas with several business domains and analytics use cases.

Here’s how Zhamak Dehghani describes this conundrum:

The hyper-specialized data platform engineers need to provide data for a diverse set of needs, operational or analytical, without a clear understanding of the application of the data and access to the consuming domain’s experts.”

Data mesh: Key differentiating characteristics

Data mesh: Key differentiating characteristics. Source: Data Mesh by Zhamak Dehghani, O'Reilly Media, Inc

The promise of a decentralized data infrastructure approach

The data mesh reimagines how you design your data architecture to be domain-oriented decentralization for analytical and operational data.

Each data domain is owned end-to-end by the domain team, rather than a centralized data team. According to Dehghani:

For a distributed data platform to be successful, domain data teams must apply product thinking with similar rigor to the datasets that they provide; considering their data assets as their products and the rest of the organization’s data scientists, ML, and data engineers as their customers.

So, the domains are responsible for owning their data and implementing data pipelines that clean, prepare, aggregate, and send data to various applications.

The overall data mesh design would have several data products, just like a microservice architecture. These data products are interconnected and interoperable, which ensures that there aren’t any data silos.

Data mesh: Key principles and their benefits

Data mesh: Key principles and their benefits. Source: Data Mesh by Zhamak Dehghani, O'Reilly Media, Inc

Introduction to Data Mesh with Zhamak Dehghani

Why does the data mesh architecture need a data lake?

Let’s start by understanding the three building blocks of the data mesh architecture:

  • Distributed data products oriented around domains
  • Independent cross-functional teams, with data engineers and data product owners, responsible for managing these products
  • A common data infrastructure as a platform to host, prep, and serve all data assets

The first point talks about how each domain is responsible for its own data products. To be able to do that, the individual domains still need to house their data in a cost-effective, cloud-based, scalable platform like a data lake.

For instance, when JP Morgan Chase implemented the data mesh design for its data ecosystem, each data product had a product-specific data lake. The data lakes were interconnected and exchanged data using a suite of AWS cloud services for cataloging, querying, and data sharing.

You can use the various data lake tools from AWS, Oracle, Databricks, or Azure to set up the overall data mesh infrastructure.

Data mesh and data lake: When do you choose to roll out either or both of them?

Choosing the data lake architecture

A centralized, monolithic architecture is good enough if you:

  • Are a small organization without numerous business domains
  • Don’t have several engineering teams
  • Can manage your analytical and operational use cases smoothly with a monolithic system

Choosing the data mesh architecture

The data mesh isn’t for everyone. You must have the following:

  • Data-oriented strategy: You consider AI, ML, and analytics to be your “strategic differentiator”
  • Data technology: You already have the ability and desire to build data-driven technology infrastructure and use it as a competitive advantage
  • Modern engineering: You already inculcate modern engineering principles of CI/CD, DevOps, and DataOps, and have a modern data stack in place
  • Domain-oriented organization: Your organization design is based on your business domains
  • Organizational complexity: Existing data warehouse or lake solutions have become blockers in your quest to extract value from data at scale
  • Data-mature teams: Each domain has technical teams capable of maintaining the necessary data infrastructure and domain experts well-versed in data-driven decision-making

Bottom line

A key takeaway from this article is:

  • The data mesh is a cultural and technological shift in thinking
  • The data lake - other than being a fundamental component of data stacks - is a means of implementing the many elements constituting the mesh infrastructure

If you’re considering the data mesh approach, your focus should be redesigning how data flows through your organization. Tooling — data lakes, warehouses, pipelines — should be a secondary concern, i.e., “an implementation detail.”

Photo by Quang Nguyen Vinh from Pexels

Ebook cover - metadata catalog primer

Everything you need to know about modern data catalogs

Adopting a modern data catalog is the first step towards data discovery. In this guide, we explore the evolution of the data management ecosystem, the challenges created by traditional data catalog solutions, and what an ideal, modern-day data catalog should look like. Download now!

[Website env: production]