What Is a Data Lake? Definition, Architecture, and Solutions
March 23rd, 2022
A data lake is a repository for raw data. Since any raw data can be quickly dumped into a data lake, these storage systems make it easy for organizations to adopt a ‘store now, analyze later’ approach.
Raw data is data that has not yet been processed for validation, sorting, summarization, aggregation, analysis, reporting, or classification. It’s simply values collected from a source. Raw data can be structured, semi-structured, or unstructured.
- Structured data is quantitative and highly organized, such as names, birthdays, addresses, social security numbers, stock prices, and geolocation.
- Unstructured data is qualitative. It has no clearly defined framework and is not easily searchable, such as online reviews, videos, photos, or audio files.
- Semi-structured data combines elements of the other two. It has a loosely defined framework, such as emails that have addresses for sender/recipient, but a body that can contain anything.
Organizations are increasingly leveraging data lakes for their simplicity, scalability, and because they provide cost-effective storage options not found in other storage systems (i.e., data marts or data warehouses, both of which require data to come in specific forms).
Data lakes are relatively new storage architectures. James Dixon, founder, and CTO of business intelligence software company Pentaho popularized the term in 2010 when he wrote in his blog – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.
What are the components of a data lake architecture?
There are five major components that support a data lake architecture. They can be remembered with the acronym ISASA – Ingest, Store, Analyze, Surface, Act.
- Ingest: Ingest involves collecting data, commonly through APIs or batch processes. Imagine unfettered streams of data flowing into a lake.
- Store: Store means housing disparate data from many streams in a central location where it exists without silos.
- Analyze: With all the data stored, it’s time to analyze it to reveal relationships.
- Surface: To surface refers to presenting the analysis in ways it can be understood (i.e., charts, graphs, actionable insights).
- Act: With actionable insights in hand, the data can be acted upon to form operational strategies, optimize processes, or drive positive business outcomes.
How does a data lake work?
In order to visualize how an organization might leverage a data lake, let’s take a look at a hypothetical use case.
An airline wants to increase travel during its off-peak season – the three weeks after Thanksgiving. To do so, the airline needs to know which type of customers to target with promotions.
The airline collected data from disparate sources, including its followers on social media and previous ticket buyers, into a single centralized location – the data lake. Structured and unstructured data that once existed separately on Facebook, Instagram, Twitter, LinkedIn, and an internal database now live together where data scientists can conduct analysis.
The analysis, once surfaced in an easy-to-digest format, reveals there are few business trips taken during this off-peak season, but leisure travelers can be enticed for short weekend getaways. Management takes action based on these insights and decides to offer cheap roundtrip tickets between select domestic travel hubs.
Popular data lake providers
It should come as no surprise that the biggest names in cloud-based solutions are also some of the most popular data lake providers. Microsoft Azure, Amazon Web Services (AWS), Google, and Oracle are all leaders in the space.
Azure describes its offering as a no-limits data lake that provides storage and analysis of petabyte-size files and trillions of objects. Users can easily debug and optimize big
data while enjoying enterprise-grade security, auditing, and support. Designed for the cloud, the Azure data lake starts in seconds, scales instantly, and users can pay per job.
AWS data lakes provide the scale, agility, and flexibility required to combine different data and analytics approaches. AWS data lakes deliver 3X better price-performance and 70% costs savings versus cloud data warehouses. Additionally, users enjoy 3 PB of data storage in a single cluster with Amazon OpenSearch Service. It’s easy to see why over 200,000 data lakes run on AWS.
Google Cloud’s data lake empowers users to securely and cost-effectively ingest, store, and analyze large volumes of diverse, full-fidelity data. It integrates with existing applications, including Dataflow and Cloud Data Fusion, for fast and serverless data ingestion, Cloud Storage for globally unified and scalable object storage, and Dataproc and BigQuery for easy and cost-effective analytics processing.
Oracle Big Data is an automated service based on Cloudera Enterprise that provides a cost-effective Hadoop data lake environment, Spark for processing, and analysis through Oracle Cloud SQL or the user’s preferred analytical tool. The data lakes can be deployed in Oracle Cloud data centers or within customer data centers.
How to build a data lake
There are usually five steps involved when it comes to building a data lake.
- Select storage. Users can either build their data lake on-premise (i.e, within an organization’s data centers) or by using a cloud-based provider, like Azure, AWS, Google Cloud, or Oracle Cloud Infrastructure.
- Transfer data. Your raw data might live in disparate locations, so step 2 is to migrate all of it into a central repository.
- Prep data. To make the raw data usable and useful, it must first be prepared, cleaned, and cataloged. This process could involve reformatting data, making corrections, removing outliers, and combining data sets.
- Configure security and compliance protocols. Administrators must create policies for handling data so it remains secure, and establish permissions for who can access which data sets.
- Provide access. With security protocols in place, administrators can provide relevant users with access to the data they need to run analyses and develop data-driven conclusions.
Why data lake projects fail
Some data lake projects fail simply because not every organization needs a data lake. Sometimes, a database or data warehouse could be the better solution. So let’s look at instances when a data lake would or would not make a good fit.
Data lakes are an ideal repository for organizations with large volumes of unstructured and semi-structured data.
Putting data to good use requires an ETL (extract-transform-load) process. If the process is complex, then it will likely cause bottlenecks when trying to fit semi-structured and unstructured data into a relational database. Here is an instance where a data lake would make sense.
Storing large volumes of data in a database can become cost-prohibitive. As a result, data might only be retained for a short duration, or historical data may be destroyed so users only have a limited historical window of, say, 10 years. A data lake alleviates these cost concerns because its architecture is built on inexpensive object storage.
Finally, the results you intend to discover through data analysis could determine whether a data lake is suitable for your project. Building reports or dashboards that rely on predetermined queries can easily be performed using a data warehouse. But if you intend to apply machine learning and predictive analytics to data to generate ‘experimental’ results, then applying the predefined schema to data housed in a data warehouse would be inefficient; a data lake would be better suited to the task.
From data lake to data swamp
Without proper governance and parameters on data gathering, a data lake can transform into a data swamp where data retrieval becomes difficult and time-consuming due entirely to disorganization. Data assets, especially large volumes of them, require metadata to provide information on context, format, structure, and usage so data scientists can perform analysis with the fewest mouse clicks. Metadata also allows users across an organization to reuse data assets as needed.
Atlan provides governance and cataloging tools, along with metadata management, so data teams can more efficiently discover, explore, understand, and work with data assets.
Data Lake: Related reads
- What is a data lakehouse: The best of data lakes and data warehouses.
- Data mesh vs data lake: Understanding decentralized and centralized approaches to data management.
- Data Warehouse vs Data Lake vs Data Lakehouse: What are the key differences?
- Why does a data lake need a data catalog?