Data Mesh Vs. Data Lake: Definition, Principles, and Architecture

Apr 11, 2022

header image for Data Mesh Vs. Data Lake: Definition, Principles, and Architecture

Data mesh Vs Data lake: What is the difference?

Data mesh and data lake are fundamentally different, yet business and tech users alike may confuse them as comparable concepts.

Datamesh is a strategy, a design concept, for setting up a distributed data architecture. It uses a product management approach to data, where each data domain is the product, and its users are the consumers.

Whereas a data lake is a central repository that stores data — structured and unstructured — in a raw format. Data lakes are a part of the data tech stack offered by most cloud providers, such as Google Cloud Platform (GCP), Amazon AWS, and Azure.

Data mesh architectures can include data lakes to connect different data sources into a coherent infrastructure. So, data consumers with proper credentials can access the data they need when they need it. However, that doesn't mean that all the data gets stored in a single repository.

Read on to explore the intricate differences between data mesh and data lake and their role in an organization's data architecture. Let's begin by quickly recapping the concepts, starting with the data lake.

What is a data lake?

A data lake is a central place for storing all kinds of data in raw and processed forms. The data can be in any format and collected from numerous sources.

The concept of a data lake came about when people started using Apache Hadoop (and other NoSQL databases) to dump unstructured or semi-structured data into a Hadoop cluster and then get what they needed as and when there was a demand.

When do you need a data lake?

State-of-the-art applications such as visual recognition, self-driving cars, predictive maintenance using IoT signals, detecting disease signals from genetic code have one thing in common - they all require massive volumes of data. Enterprise data lakes are one of the best places to get that data. That's why most data science and analytics projects warrant one or several data lakes.

Lakes are particularly helpful when the volume, velocity, and variety of data are too much to handle by traditional models. Besides, your organization would want to have all the data available in one place to make analytics and cross-team collaboration easier.

What does a data lake architecture look like?

There are two fundamental data lake architectures - on-premise and cloud-based. Both take a centralized approach to collecting, storing, and processing data. They support multiple programming languages, databases, sources (like sensor or surveillance data), and big data technologies (for example, Pig and Spark).

However, on-premise data lakes aren't fully equipped for real-time processing and analytics.

Alex Gorelik best summarized it in his book The Enterprise Big Data Lake:

Between data sovereignty regulations (e.g., you are not allowed to take data out of Germany) and organizational pressures, multiple data lakes typically proved to be a better solution. However, as companies realized the complexity of supporting a massively parallel cluster and experienced the frustration at their inability to find and hire experienced administrators for Hadoop and other big data platforms, they started opting for cloud-based data lakes where most hardware and platform components are managed by the experts that work for Amazon, Microsoft, Google, and others.

Cloud-based lakes are low-cost alternatives for organizations with terabytes or even petabytes of data. Moreover, the cloud is elastic, so you can provision and pay for large compute clusters only when you need them.

To explore these architectures in-depth, check out our comprehensive guide on data lakes.

Now let's look at data mesh.

What is data mesh?

Here's how Zhamak Dehghani defines data mesh:

The data mesh platform is an intentionally designed distributed data architecture, under centralized governance and standardization for interoperability, enabled by a shared and harmonized self-serve data infrastructure.

The data mesh paradigm stands for decentralized and domain-specific data ownership. In such a universe, data is easily discoverable and ready for consumption for everyone in the organization. In a way, data mesh is like product management, where each data domain is treated as a product and gets an owner.

When do you need data mesh?

The primary application is scaling data analytics. The centralized data lakes fall short for many organizations in two ways:

  • A central ETL pipeline gives teams less control over increasing volumes of data.
  • Different data use cases require various transformations, exacting a heavy toll from the central platform.

The easiest way to scale a data architecture, speed up analytics, and democratize data is by breaking it down into smaller domains. With its domain-oriented and decentralized architecture, data mesh has multiple domains responsible for handling its data pipelines.

What does a data mesh architecture look like?

Data mesh can be a challenging concept to grasp. Here's an image offering a high-level overview of the mesh architecture.

diagram explaining data mesh architecture

High-level overview of a data mesh architecture. Image by Martin Fowler.

The architecture of data mesh is best described in its four key characteristics:

  1. Decentralized and domain-oriented: Data mesh stores data across different domains, which are maintained by domain experts.
  2. Data as a product: Each data domain is a product, and the users are its customers.
  3. Self-serve data platform: A data mesh advocates setting up a tech ecosystem that supports creating, using, and maintaining data products without needing specialized knowledge or expertise in sophisticated technologies.
  4. Federated computational governance: Decentralized data products can lead to data silos. A federated approach to governance standardizes rules, definitions, and procedures related to data.

Want to learn more about data mesh? Read our complete guide on understanding data mesh architecture.

Both data lake and data mesh come with several benefits. Let’s look at the most prominent ones.

Data mesh vs data lake: Comparing benefits

The top benefits of the decentralized data mesh include:

  • Business agility, scalability, and empowered business users: Business users have direct access to the decentralized data domains and complete data visibility. Moreover, self-service and domain-level ownership remove engineering bottlenecks and the need for technical support. So, average business users can find the data they want, run analytics, get insights, and prepare reports all by themselves.
  • Data with full context: In data lakes, the knowledge of all data assets — and the transformations they underwent — stays with a selected few. Additionally, engineers might lack business context, or business users might lack the engineering skills needed to put data to use. That leads to chaos. Data mesh mitigates these issues as domain owners handle their data end-to-end, and as a result, there's no loss of context.
  • Data quality and end-to-end compliance: Since domain owners handle their data and have end-to-end visibility of its lineage, they can ensure better data quality and integrity. Also, since a federated governance program means globally defined policies, domain owners can implement end-to-end compliance as per the regulatory requirements.
  • Central repository for versatile data: A data lake gets rid of silos as it brings all organizational data under one roof. As a result, working with data lakes is easier and more streamlined for data analysts and scientists. Moreover, everyone with the right access has end-to-end visibility of each data asset.
  • Metadata lakes for better context: As modern data stacks evolve, organizations can employ metadata lakes for faster data discovery with context and more effective lineage tracking.
  • Data security and governance: IT giants well-versed in cloud computing handle the security of a data lake — enterprise-grade security. Also, the centralized architecture means developing and enforcing governance programs is easier and less complex.

To make things easier, we’ve also mapped the benefits in the table below.

Data MeshData Lake
The decentralized architecture makes it easier for businesses to deliver analytics faster and at scale.The centralized repository brings versatile data collected from various sources under one roof.
A domain-oriented design ensures domain owners and users have access to data with full context.A metadata lake compiles data with adequate context — glossaries, classifications, transformations, and more.
Decentralized domains with dedicated owners improve data quality and make it possible to ensure end-to-end compliance.The centralized architecture also makes it easier to enforce governance.

Having explored the concepts, architecture, and benefits, let’s take a look at the use cases of the data mesh and the data lake.

Data mesh vs data lake: Comparing use cases

Let’s explore some real-world use cases for the data mesh:

  1. Data science and machine learning: Domain teams are in charge of their data. So, they build and maintain datasets by tapping into years of domain experience and knowledge. This makes data easier to find, understand, and use, which is the key to succeeding with your analytics or ML algorithms.
  2. Compliance: Compliance can be considered a product and managed end-to-end by the domain owners. The domain owner could also work closely with the regulators to make sure their data standards satisfy all the regulatory requirements.
  3. Sales and marketing: Both sales and marketing departments (domains) have a domain owner. So, data is owned and managed by their respective teams. As a result, they have better control over their campaigns and don’t have to wait for the IT department to help them find historical data on customers or campaign results.

Now let’s look at some of the use cases for data lakes:

  1. Data analytics and machine learning: Both areas require vast amounts of data to drive insights and train their algorithms. So, storing all data in one location (i.e., the data lake) makes the life of a data scientist easier and further simplifies and speeds up model training.
  2. Big data storage: From oil and gas to life sciences, all data-heavy industries can use data lakes to store and process big data at scale. For instance, an average oil and gas company generates at least 1.5 TB of IoT data per day. That data needs to be stored at a central, accessible location, so that data scientists and geologists can use it to optimize directional drilling, minimize unplanned downtime, improve safety, and more.

Bottom line: both concepts have diverse applications across industries when it comes to analytics.

Instead of data mesh vs data lake, think data mesh + data lake

If you take away just one point from this article, it should be this:

Data Mesh is a design concept. Data Lake is a data repository. Data mesh tells you how to implement a distributed data architecture. A data lake stores structured and unstructured data in a centralized fashion.

However, that doesn't mean they can't co-exist. For instance, your architecture could be based on the data mesh principles, while you use data lakes within domains to store and maintain data.

That's why rather than comparing data mesh vs. data lake, a better way forward is to understand your organization's needs and then develop an approach to data management — possibly using the data mesh and several data lakes.

Photo by Quang Nguyen Vinh from Pexels

Ebook cover - metadata catalog primer

Everything you need to know about modern data catalogs

Adopting a modern data catalog is the first step towards data discovery. In this guide, we explore the evolution of the data management ecosystem, the challenges created by traditional data catalog solutions, and what an ideal, modern-day data catalog should look like. Download now!