Data Ingestion 101: Using Big Data Starts Here
February 14, 2022
The modern data stack has many layers which can be challenging to navigate or even comprehend. Let’s break it down by starting where it all begins – data ingestion. In this blog, we’re detailing what you need to know about how data is gathered so it can be stored and used to derive business intelligence.
What is data ingestion?
Data ingestion is the process of moving raw data from a source to a destination where it can be stored or used for queries and analytics. Simply put, it’s transferring data from point A to point B so that it can be centralized and/or analyzed.
For example, let’s say you’re running business intelligence for an airline and your company has a presence across every social media platform (Facebook, Instagram, LinkedIn, TikTok, etc.). You want to analyze the cities where users are engaging with your brand. To do that, you need to get that data from those disparate platforms (the sources) to a staging area (the destination). That process is data ingestion.
With Big Data, there is a multitude of data sources where data generation occurs and serves as points of data ingestion. These include:
- IoT devices: all those smart devices (thermostat, watches, TVs, etc) we enjoy
- Click streams: a record of clicks users make while navigating webpages
- Files: audio, video, photos
- Documents: emails, spreadsheets, slides
- SaaS applications: Netflix, YouTube, Salesforce
There are also a growing number of destinations where data practitioners want to deposit data. These destinations include:
Data warehouse – A data warehouse is a storage system designed to hold large amounts of structured data from various data sources. They are usually well organized but can be costly. Data lake – A data lake is a storage system designed to hold large amounts of raw data. This data may be structured, unstructured, or semi-structured. While they are cost-effective, they can quickly become disorganized because these repositories can hold mountains of data. Data lakehouse – Combining elements of both a data warehouse and a data lake, a data lakehouse is a relatively newer storage solution that can hold both raw and processed data. They are cost-effective and provide the tidiness usually associated with a data warehouse. Data mart – Like a data warehouse, a data mart is a highly structured data repository. However, unlike a data warehouse that might store all the data for an organization, a mart holds data specific to a single subject (i.e., finance data, sales data, marketing data).
It’s worth noting that raw data storage locations, such as a data lake or data lakehouse, can be both a source and a destination.
Batch data ingestion vs streaming ingestion vs hybrid ingestion
So how exactly does the movement of data take place? It’s usually done through batch data ingestion, streaming ingestion, or a combination of the two.
Batch data ingestion
Batch data ingestion is the process of collecting and transferring data in batches at periodic intervals. Data might be collected and transferred at a scheduled time (i.e., every day at 11:59 p.m. or every other Friday at noon) or when triggered by an event (i.e., ingest when capacity reaches 10TB). This type of data ingestion can be lengthy, so it is typically used with large quantities of data that isn’t needed for real-time decision-making.
Streaming data ingestion
Streaming data ingestion, sometimes called real-time ingestion, is the process of collecting and transferring data from sources to destinations at the moment the data is generated. Time-sensitive use cases (i.e., stock market trading, log monitoring, fraud detection) require real-time data that can be used to inform decision-making. Streaming data ingestion is fast and allows for the continuous flow of information from disparate sources to a target destination.
Hybrid batch + streaming data ingestion
Sometimes called a lambda architecture, hybrid data ingestion employs both batch and streaming processes to get the job done. It’s best applied to IoT use cases. Consider, for example, a smart thermostat. The device requires real-time data processing to set the temperature of a location, so it requires streaming data ingestion to adjust the temperature accordingly. And at scheduled intervals, say, nightly, it might ingest data related to the day’s humidity and precipitation in order to refine its climate control model.
Steps in the data ingestion process
Moving data from the source of generation to a destination is, admittedly, rarely straightforward. That’s because there are so many different types of data and it can exist in countless forms. But storage systems, like data warehouses and data marts, are designed to accept data in a standardized form to keep things organized. To accomplish this, a data ingestion framework is typically composed of three steps: extract, transform, load. This is also known as ETL.
- Extract means to grab the data at its source.
- Transform refers to cleansing and standardizing the data so it’s in a format the destination will accept.
- Load is placing the data in the target destination where it can be stored and/or analyzed.
Mitigating challenges in the data ingestion process
The ETL process can be very time-intensive, specifically the transforming component of cleansing and standardizing the data. This can cause bottlenecks that inhibit data use. Solutions to surmounting these obstacles include:
- Automated workflows and AI tools that can automatically complete the ETL process. This enables the data ingestion to occur rapidly and free from human-induced errors.
- Delaying the time-intensive transform component of the process in a now-common technique known as ELT. Data is grabbed from the source and loaded into a destination that can accept raw data to be transformed at a later date.
- Data lakes and data lakehouses, both of which can accept raw data, are becoming more popular storage systems.
The benefits of modern data ingestion tools
Manual data ingestion is impractical in the era of Big Data where organizations are looking to work with vast amounts of information generated daily, hourly, and even by the second. Fortunately, modern data ingestion tools automate formally human-led processes and alleviate many of the challenges associated with it.
- Speed: Automated data ingestion tools are fast, allowing for the rapid collection and use of data without the need of a human hovering over a computer clicking a mouse.
- Scalability: It’s hard to fully leverage Big Data with manual data ingestion which relies on individual capabilities, but modern automated tools allow for its use at scale.
- Reliability: Humans are prone to human errors, but automated data ingestion tools can repeatedly perform data extraction, data transformation, and loading without introducing misspellings, duplication, invalid formats, outliers, and other unwanted elements into the process.
- Governance: Organizations are able to implement proper governance using modern tools, providing for sensitive data confidentiality, more robust security, and regulations adherence.
Making use of big data
In order to run an analysis that yields actionable insights, data must be collected from a host of sources and deposited into a target destination. This is why data ingestion is such an important part of the process to make sense of the volumes of data being generated today.
Creating a robust data ingestion process is an important part of the overall data pipeline. To make this possible, organizations are shifting away from arduous manual processes and instead are leveraging high-tech tools to make the job faster and less error-prone. By managing ingested data with proper cataloging and governance, users can prevent or alleviate further bottlenecks for maximum efficiency.
Go deeper in your understanding of the modern data stack with our blog on modern data culture.