What is a data mesh?
A data mesh is an architectural and organizational paradigm that believes in decentralized and domain-specific ownership of data - that is easily discoverable and ready for consumption for everyone in the organization. It essentially moves away from the notion that big analytical data must be centrally stored, transformed and processed to be served to data consumers across different domains. It rather says that each business domain is responsible for hosting, prepping and serving their data to their own domain and the larger audience.
Before we truly appreciate all that we've noted above, we must ask ourselves, one simple question:
Why data mesh?
In the modern world, it is becoming increasingly necessary to process and analyze more and more large amounts of diverse data. In such a situation, large companies are investing in deploying state-of-the-art data architecture that could effectively solve these problems. Most of them use a centralized approach for storing and processing data. However, a centralized architecture, even if composed of modern tools inherits some problems from data platforms of the past generations that hinders unlocking the true potential of data in organizaions.
Recently, a new concept of “data mesh” has emerged, which is based on a distributed data architecture that avoids the problems that exist when working with a centralized architecture.
Understanding of data mesh will be useful if you work with microservices, cloud applications, virtual data catalogs, machine learning applications - and are serving a business with rich domains.
In this article, we will take a closer look at what data mesh is, the benefits of using it, and its architecture.
Data mesh definition
Data mesh is a new type of data platform architecture first proposed by Zhamak Dehghani of Thoughtworks. The main idea of the data mesh is that data should be stored in decentralized independent domain groups (computers that generate, store, and process data independently), not in one centralized unified data platform. These domains should be responsible for storing, managing, and maintaining the data.
Let's look at another definition, this one's from the Thoughtworks official website.
Data mesh is a decentralized sociotechnical approach to remove the dichotomy of analytical data and business operation. Its objective is to embed sharing and using analytical data into each operational business domain and close the gap between the operational and analytical planes. It's founded on four principles: domain data ownership, data as a product, self-serve data platform and computational federated governance.
Important to note, that data mesh doesn't advocate for siloed data being stored in different pockets of the business, rather discoverable and re-usable datasets that are owned by different domains of the business.
Data mesh principles
There are four basic principles that define data mesh:
Domain-oriented, decentralized data ownership : Data mesh stores data across different domains. The functional unit of data in this case is not a pipeline stage, rather the entire domain, that collects, manages, processes and serves this data. Indendence from a central unit of experts and infrastructure, helps increase the speed of the process and also ensures richer data sets as they are maintained close to the domain experts.
Data as a product : This principle implies working with data as a product. The data should be easy to find, read and understand by the user. It also needs to comply with rules such as versioning, monitoring, and security.
Self-service infrastructure as a platform : The self-service system provides tools for both end-users and developers for self-service business intelligence and for developing analytic products.
Federated governance : There are general rules and regulations that govern the operation of distributed systems, and that data is made secure yet democratically available to everyone, thanks to global access control.
Predecessors to data mesh
The modern data stack is coalition of best-of-breed tools that are fast, flexible and scalable. Especially at the heart of these stacks would be data warehouses, data lakes / lakehouses that is centrally loaded with all data being collected by the organization across all functions and teams. This data is then transformed and served as per the data consumer - downstream. While this solved the challenge of data silos and made all data owned by the organization centrally available to all - it still doesn't solve for context and governance.
Before we run through the problems that manifest in a centralized data architecture, let's jog our understanding of the main components that form the core of centralized data architectures:
- Data Warehouses : Traditionally, data warehouses have optimized compute and processing speed. This is helpful for reporting and business intelligence, making warehouses the system of choice for analytics teams.
- Data Lakes : Cheap storage to store vast amounts of raw or even unstructured data. The data lake architecture is typically great for ad-hoc exploration and data science use cases.
- Data Lakehouses : An emerging system design that combines the data structures and management features from a data warehouse with the low-cost storage of a data lake.
Read more about Data Warehouses, Data Lakes and how they are convering into Data Lakehouses in this article.
Problems with centralized data architectures
Some problems with centralized data architectures are inherent, while others have emerged as data needs evolved. Let’s take a closer look at the main problems.
- Using a centralized architecture to process large amounts of data requires the user to move data from the edge to the central location, which is time-consuming and costly.
- As the volume of global data grows, the query method in the centralized management model requires changes throughout the data pipeline. As the number of data sources increases, this increases the time it takes to open them. Slow data processing is detrimental to the ability of a business to benefit from data and respond to change.
- Data transfers are often subject to data placement and privacy policies that prohibit data migration if the data is stored in certain regions. Compliance with data governance rules is time-consuming and can significantly delay the processing and analysis of data in Business Intelligence.
Data mesh architecture
The microservice architecture concept of “domain-driven design” has made it possible to divide information systems into distributed services that operate in different business domains. This allows for the formation of teams that can independently and autonomously own their microservices. Domain-driven design is very useful in a distributed data platform architectures like data mesh. Data mesh allows you to store data in the same domain in which it was generated. In this case, the data remains under the support of the command of the domain in which it is created.
Fundamental components of the data mesh architecture
A data mesh has the following fundamental components:
Data Sources - The information system that exists in each domain should provide data sets prepared for the consumers. It is raw data that is the basic building block of the entire architecture.
Data Infrastructure - The infrastructure component enables you to build, deploy, run data product code, store and access big data and metadata.
Domain-oriented data pipelines - Data pipelines are responsible for consuming, transforming, and serving data received from the domain operating system or upstream data product. The implementation of your own data pipeline becomes an internal task of the business domain team.
Governance - A governance model is required to be able to correlate different data, create associations, find intersections, or perform other functions. This model should describe global standardization, domain decentralization, and automatic platform execution.
More and more companies are moving to distributed architecture and using data mesh. Let’s take a look at some of them.
- Intuit is a financial platform that provides a variety of financial services for businesses and individuals. It has moved from a centralized data management architecture to data mesh that enables a wide range of streaming, machine learning, and analytic processing workloads.
- Online streaming platform Netflix has implemented a data mesh architecture to optimize costs, improve performance, and reduce operational risks. The use of data mesh allows Netflix significantly reduce the time to create a pipeline, and offer a self-service user interface and secure access to data.
Data mesh vs. data lakehouse
The main difference between data mesh and data lake is that a data lake is a large data store that is physically located in one place. Data mesh is physically and logically disconnected. Let's take a closer look at the list of differences between data lakehouse and data mesh.
|Data Lakehouse||Data Mesh|
|Data is stored in a single central repository||Data is stored in distributed domains|
|Users are unaware of the original domain from which the data is being retrieved which often leads to poor data quality||Domain owners are responsible for data quality|
|The query method in the centralized control model needs to change as the amount of data grows||A unified governance system works unchanged as data grows|
|Can be a separate node in data mesh||Can not be part of data lakehouse|
Warehouses, data lakes, and data lakehouses can continue to function in conjunction with data mesh and be used as separate nodes in the data mesh.
Benefits of data mesh
Let’s take a look at the main benefits of using data mesh:
Democratization of data - Data mesh provides enterprise-wide users with access to any dataset.
Scaling - Data mesh enables decentralized data operations, independent teamwork, and data infrastructure as a service to provide scalability and reduces operational and storage costs.
Speed - By using data mesh, companies can access data from anywhere using SQL queries with much lower latency. The distributed architecture reduces the levels of processing and intervention that delay data delivery.
Security - The decentralized framework allows applications to connect to data that can be streamed in real-time or stored on devices. The data mesh asks for data where it is stored, instead of requiring users to make a copy and send it over the public network to the data store. This eliminates the risk of data leakage and information loss.
Accuracy - The data mesh enables highly accurate data delivery through the use of an easily managed and centralized infrastructure based on a self-service model.
Innovation - The process of decentralizing and democratizing data is a big step in the evolution of enterprise data architecture, and empowers the test and learn cycle of innovation.
Challenges of a data mesh
While there are a number of benefits that you can get when working with data mesh, there are also some challenges when working with it.
Enforcing governance - Data mesh involves storing data in different sources. However, for effective interaction and quick receipt of any data by users, it is necessary to establish rules and specifications that must be adhered to by all participants in the distributed system.
Implementing proper tool suite - When designing a data mesh, technological and implementation challenges can arise. That’s why it is very important to choose the right implementation tools.
Transitioning employee thinking from the old, centralized way - Centralized architectures have evolved rapidly over the past few decades and are now widely used. Technicians are used to this approach. Data mesh offers a new approach that requires employees to change the way they think about data processing.
Achieving requisite data maturity - Data mesh requires a higher level of maturity in an organization’s data management culture because it provides a completely different approach to data and team interactions.
Who needs a data mesh?
The use of a data mesh will be useful for companies:
- That work with large amounts of diverse data.
- Where data sources reside in rich business domains.
- Where data changes frequently, and the organization quickly evolves to add new data consumers.
Decentralized teams can use data mesh to allow individual teams to store and manage their data, and provide it to other teams as a product.
In this article, we looked at existing centralized data architectures and their drawbacks. We also studied what data mesh is, its architecture, principles of implementation, and the advantages of using it.
Want to understand more about how a data mesh can supercharge your data initiatives?