A Comparison Guide to the Best Data Ingestion Tools in 2022

May 25th, 2022

header image for A Comparison Guide to the Best Data Ingestion Tools in 2022

We will explore the top data ingestion tools in 2022 with insights on their core offering, customers, case studies, and resources for further reading.

But before we proceed, here’s a brief explainer on data ingestion tools and their role in the modern data platform.

Data ingestion tools collect data from various sources and formats in a central repository. For example, bringing your CRM and customer service software data together in a data warehouse is a use case of data ingestion.

Data ingestion is a crucial step in building any data platform as it helps eliminate silos and compile all organizational data in one place. Once your ingestion tool collects all data successfully, you can start processing and analyzing that data to extract valuable insights.

  1. Apache Kafka
  2. Apache NiFi
  3. Fivetran
  4. IBM DataStage
  5. Informatica Cloud Mass Ingestion
  6. Matillion
  7. Stitch data
  8. Wavefront

1. Apache Kafka

Overview

Apache Kafka is an open-source event streaming platform that captures data in real time.

LinkedIn’s Jay Kreps, Neha Narkhede, and Jun Rao collaborated to build Apache Kafka in 2008. In 2011, LinkedIn open-sourced the software by donating it to The Apache Software Foundation.

Later, the co-founders left LinkedIn in 2014 and founded Confluent, a company offering a comprehensive data platform to manage Kafka. That’s because even though 80% of the Fortune 100 companies use Apache Kafka, managing it is a complicated affair, according to Nasdaq.

Companies such as Adidas, Cisco, Grab, Goldman Sachs, Intuit, LinkedIn, Spotify and The New York Times use Apache Kafka for their real-time data ingestion needs.

To understand how Apache Kafka facilitates ingestion, let’s explore LinkedIn’s case study.

Case study: How LinkedIn customizes Apache Kafka for 7 trillion messages per day

LinkedIn uses Apache Kafka for tracking activities, exchanging messages, collecting metrics, and other such use cases requiring a stream processing platform.

According to LinkedIn, they maintain 100+ Kafka clusters with 4000+ brokers, serving 100,000+ topics and 7 million partitions. As a result, the most recent Kafka deployments can process over 7 trillion messages a day.

Check out the detailed case study here to know how LinkedIn customizes Apache Kafka to process events at scale without facing operational constraints.

Resources

Apache Kafka | Videos | Docs | Github | Podcasts | Confluent Community on Slack


2. Apache NiFi

Overview

Apache NiFi (NiFi stands for Niagarafiles and is pronounced as Nigh-Figh) is an open-source platform to automate data flow across systems.

NiFi was initially built by Onyara, an early-stage startup and first used inside the US NSA (National Security Agency) to understand what was happening to sensor data collected from various sources.

In 2014, the NSA released the software under an open-source license, and in 2015, Hortonworksacquired Onyara, Inc. Today, Hortonworks has merged with Cloudera and together, they oversee the commercial support for NiFi.

Companies such as GoDataDriven, Hastings Group, Looker, Micron, Ona, and Slovak Telecom use Apache NiFi to manage their data flow requirements.

Let’s look at the Schlumberger case study to understand how Apache Nifi helps visualize data flow at a granular level.

Case study: How Schlumberger solves its data flow issues with Apache NiFi

Schlumberger’s oil well drilling projects involve several contractors using various tools and systems. Schlumberger works on 3000+ wells each month, with ten files generated per minute and 480+ queries run each second.

Since they didn’t have a central data repository to get end-to-end visibility of the drilling projects across locations, they had an in-house system to extract, clean, and organize all that data manually.

Using Apache NiFi, they could automate data flow within hours and quickly answer questions on the origins and credibility of data.

To know more about the specifics of the Apache NiFi implementation at Schlumberger, check out this case study.

Resources

Use cases | Docs | Github | NiFi backstory | Apache NiFi Slack


3. Fivetran

Overview

Fivetran, Inc. offers data ingestion for applications and is used by several companies such as Asos, Canva, Databricks, Docusign, Intercom, Lufthansa, and Square.

George Fraser and Taylor Brownlaunched Fivetran in 2013. The company’s backed by Andreessen Horowitz, General Catalyst, Matrix Partners, and CEAS Investments.

Let’s explore the Condé Nast case study to understand how Fivetran solves their ingestion needs.

Case study: How Condé Nast connects and monetizes trillions of data points with Fivetran

Condé **Nast’s readers consume content across several digital assets, generating trillions of data points and potential insights on reader interests and patterns.

However, Condé Nast didn’t have a system to centrally manage data collected from various sources for each of its brands.

So, they chose Fivetran to automatically pull data scattered across various applications into a central data lake. As a result, Condé Nast gets a “true 360-degree view” of their customers and how they engage with Condé Nast’s content.

To read the whole story, check out the original case study here.

Resources

Case studies | Demo Center | Documentation | Github


4. IBM® DataStage®

Overview

IBM® Infosphere DataStage® is a data integration tool part of the IBM Cloud Pak® for Data. Initially developed by Lee Scheffler at VMark Software Inc, DataStage was acquired by IBM in 2005.

Today, DataStage® helps several companies such as GasTerra, Integra Lifesciences, Rabobank, Wichita State University, HUK-Coburg, InfoTrellis, and TechD as part of IBM’s Cloud Pak® for Data offering.

Let’s look at the case study of a Toronto-area community hospital to understand how DataStage® works.

Case study: How North York General Hospital (NYGH) delivered data from 15+ sources to the data server in real time

NYGH receives most of its funding from the Ontario Ministry of Health and Long-Term Care, which measures Quality-Based Procedures (QBPs) to gauge the financing required.

Capturing QBP-related data means calculating metrics such as cost per case, length of stay, and patient age. Traditionally, the hospital developed a new report for each patient by pulling the required data on essential metrics.

However, they needed an analytics solution to assess these metrics and analyze the relationship between them on the go using rich visualizations.

NYGH started its journey towards analytics by setting up a warehouse and using the AI-powered DataStage® to compile data from 15+ sources in real-time. The sources included clinical, non-clinical, external, and open-source systems.

To learn more about the full story, check out the complete case study here.

Resources

Solution brief | Docs | Top questions on modernizing DataStage to Cloud Pak for Data


5. Informatica Cloud Mass Ingestion

Overview

Informatica Cloud Mass Ingestion is part of Informatica Intelligent Cloud Services to ingest large volumes of data for on-premises and cloud-based repositories. Gaurav Dhillon and Diaz Nesamoneyset up Informatica LLC in 1993. Informatica is known for pioneering the Intelligent Data Management Cloud (IDMC), according to Nasdaq.

Customers of Informatica Intelligent Cloud Services include Databricks, CVS Health, NYC Health+ Hospitals, HelloFresh, KPMG, and TELUS.

Let’s look at the KLA case study to understand the potential of Informatica Cloud Mass Ingestion.

Case study: KLA moves 12 years of data to the cloud in one weekend with Informatica

Semiconductor manufacturer KLA had several business units with their own systems, making it harder for the company to predict and meet demand in real-time. As a result, they needed a central data repository along with mass ingestion capabilities.

KLA chose Informatica Cloud Mass Ingestion to ingest large volumes of data at scale. Combined with Snowflake Data Cloud, this setup empowered KLA to perform ETL, ELT, and mass ingestion at scale with a single platform.

Check out the complete case study here.

Resources

Data sheet | Overview and Demo | Website


6. Matillion Data Loader

Overview

Matillion Data Loader is part of the Cloud ETL products offered by Matillion Ltd, which was founded by Ed Thompson and Matthew Scullion in 2011.

The company’s backed by investors such as Scale Venture Partners, Sapphire Ventures, Lightspeed Venture Partners, General Atlantic, Battery Ventures, and YFM Equity Partners.

Its customers include Knak, Cimpress, Western Union, Cisco, LiveRamp, Slack, Pacific Life, and Duo.

Here’s a case study to explore the capabilities of Matillion Data Loader.

Case study: Always-on data loading for reporting and insights in minutes for Knak

Knak enables no-code campaign creation for marketing teams and relies on extracting insights from data to understand their customers, sales KPIs, and marketing ROI. While Excel worked for the company initially, it wasn’t feasible as the company grew exponentially.

Knak needed a solution that could automatically ingest data from platforms such as Marketo, Salesforce, Google Analytics, and MySQL and load that data into a central warehouse. That’s why they chose Matillion Data Loader, and as a result, their teams can deliver insightful reports within minutes.

Check out the complete case study here.

Resources

Documentation | Matillion Community | Overview and Demo | Website


7. Stitch Data

Overview

Stitch offers a suite of products to ingest and integrate data from various sources into the data warehouse. In 2016, Bob Moore and Jake Steinfounded Stitch and in 2018, Talend acquired the company.

Stitch’s customers include companies such as Calm, EZCater, Envoy, Heap, Invision, Indiegogo, Peloton, Third Love, and Wistia.

Case study: Growth Street saves time and money replicating data to Redshift

UK-based financial services company Growth Street was looking for a solution that integrated their third-party data with warehouse (Redshift) data quickly, without driving up the costs.

Since building an in-house tool would require extensive engineering and maintenance, the company decided to buy a solution that matched their needs.

With Stitch, Growth Street’s BI and analytics team could set everything up within days and in just a few weeks, the company was able to start extracting value from their data.

Here’s the full case study.

Resources

Documentation | Github | FAQs | Customers | Getting started


8. Wavefront

Overview

Wavefront, previously known as Watford, can ingest time-series data from various sources into a central data repository.

The privately held company offering Watford was set up in 2013 and acquired by VMWare in 2017, renaming it to Wavefront by VMWare.

Workday, Zuora, Okta, box, reddit, IndusInd Bank, DoorDash, Hive, and BookMyShow are some of its customers.

Case study: VMware empowers IndusInd Bank to deploy applications faster at multiple user end points with centralized management

IndusInd Bank was experiencing a significant increase in customer transactions across all channels. Being a digital-first bank, IndusInd wanted to build an infrastructure that pulled data from all channels and made it accessible across all employee devices. The bank also wanted to extract insights on customers and reduce the overall time-to-market using data.

With VMWare’s suite of products, including Tanzu Observability by Wavefront, the bank’s IT team can compile all data in one place and track data flow in real-time, speeding up their response times.

Read the complete case study here.

Resources

Documentation | Getting data into Wavefront (video) | Github | Tanzu Talk (podcast)


How to evaluate ingestion tools

Each tool has different features and specifications, and as a result, picking the right tool depends on several factors such as:

  • Data sources, formats, volumes, and ingestion type as per your use cases
  • User-friendliness (opt for a low-code/no-code platform if you lack engineering resources)
  • Costs
  • Level of security offered
  • Feedback on community portals and third-party websites such as Gartner Peer Insights or G2
  • Interoperability with the rest of your data stack

Conclusion

As we mentioned earlier, data ingestion is one of the first steps in data integration. Depending on your requirements, budget, and resources, you should set up a solid ingestion pipeline using a tool like those mentioned above.

Go deeper in your understanding of the components of the modern data stack, check our blog that elaborates on tooling choices and more.



Ebook cover - metadata catalog primer

Everything you need to know about modern data catalogs

Adopting a modern data catalog is the first step towards data discovery. In this guide, we explore the evolution of the data management ecosystem, the challenges created by traditional data catalog solutions, and what an ideal, modern-day data catalog should look like. Download now!