9 Popular Data Pipeline Orchestration Tools in 2022

May 26th, 2022

header image for 9 Popular Data Pipeline Orchestration Tools in 2022

We will explore the top data orchestration tools in 2022 by reviewing the companies behind them, their customers, case studies, and more. Before we begin, let’s quickly look into data orchestration tools and their capabilities.

Data orchestration tools automate the process of bringing data together from multiple sources, standardizing it, and preparing it for data analysis.

According to Astasia Myers, author of “Data Orchestration — A Primer”, data orchestration tools can:

  1. Cleanse, organize, and publish data into a data warehouse
  2. Compute business metrics
  3. Maintain data infrastructure (like database scrapes)
  4. Run a TensorFlow task to train a machine learning model
  5. Apply rules to target and engage users through online marketing campaigns

Automating processes (like the ones mentioned above) is vital as companies handle millions of data points from apps, websites, databases, SDKs, and more — so, scheduling cron jobs manually is taxing and error-prone.

  1. Astronomer
  2. AWS Step Functions
  3. Azure Data Factory
  4. Control-M
  5. Flyte
  6. Google Cloud Functions
  7. K2View
  8. Metaflow
  9. Prefect

1. Astronomer

Overview

Astronomer builds data orchestration tools like Astro using Apache Airflow™ — originally developed by Airbnb to automate its data engineering pipelines. Astro enables data teams to build, run, and observe pipelines-as-code.

In 2018, Greg NeiheiselRy Walker, and Tim Brunkfounded Astronomer, Inc. Astronomer is backed by Meritech Capital Partners, Salesforce Ventures, Insight Partners, and Sierra Ventures. Some of the customers include Sonos, EA, Condé Nast, Credit Suisse, Rappi, StockX, BBC, Wise, and Societe Generale.

More recently, Astronomer has acquired data lineage company Datakin. Astronomer aims to set up end-to-end lineage so that its customers can observe and add context to disparate pipelines.

Let’s look at a case study to understand the challenges Astronomer solves with Apache Airflow™ for Wise.

Case study: How Wise uses data orchestration to further its ML initiatives

Wise uses machine learning algorithms for real-time transaction monitoring and KYC processes. Wise intends to enable stream machine learning, rather than using REST, to launch new products faster while collaborating across teams.

Wise started using Amazon SageMaker for data science, where scientists work in segregated team environments to collaborate without breaching privacy measures. They adopted Airflow to retrain machine learning workflows from SageMaker.

“By nature, working with ML models in production requires automation and orchestration for repeated model training, testing, evaluation, and likely integration with other services to acquire and prepare data. Airflow is the perfect orchestrator to pair with SageMaker.”

To understand how Airflow helps Wise with orchestration, check out the complete case study here.

Resources

Blog | Docs | Github | Community | The Airflow Podcast | Demo (video)


2. AWS Step Functions

Overview

AWS Step Functions is a low-code visual workflow service used to orchestrate AWS services. The low-code visual designer for Step Functions is called Workflow Studio.

Step Functions is a part of the AWS ecosystem from Amazon Web Services, Inc., which accounts for customers such as Coinbase, Paessler AG, Trulia Rentals, DuPont Pioneer, MirrorWeb, Nasdaq, and ClearDATA.

Let’s check out the Zalora case study to see how Step Functions fit into the big data architecture for modern data teams.

Case study: How serverless automation with AWS Step Functions has reduced Zalora’s SAP system refresh time from 5 days to 2 days

Asia’s online fashion retailer Zalora uses SAP and AWS solutions to automate internal processes and reduce operational overhead. AWS Lambda serves as the heart of this setup where:

  • S3 is used for production database backup
  • EC2 is used to trigger events
  • SNS is send messages during errors or failures
  • Step Functions is the orchestration layer for the entire setup

Traditionally, Zalora’s engineering and DevOps teams handled this process manually, which took up around 5 days and involved several manual errors.

When Zalora switched to AWS Step Functions and Lambda to take care of their SAP systems, they could automate the entire process, reducing the system refresh time by 60%. The serverless setup also enabled Zalora to reduce their engineering investment considerably.

Watch this video to learn about the complete setup.

Resources

AWS Step Functions Launch | AWS Step Functions | Getting started | Github


3. Azure Data Factory

Overview

Azure Data Factory is used for orchestrating data processing pipelines for Azure, a Microsoft Corporation solution. Adobe, Concentra, Milliman, Rockwell Automation, Lorven Technologies, and Hentsu are some of its customers.

Let’s explore the Milliman case study from the insurance industry to see Azure Data Factory in action.

Case study: Top actuarial firm Milliman transforms the insurance industry

Milliman found that actuarial firms were spending about 70% of their time managing data — updating models, creating files, and running reports. They wanted to build a solution to simplify actuarial modeling and reporting so that actuaries could spend more time analyzing the results, rather than handling infrastructure setup.

Milliman chose to build Integrate Data Management using Azure Data Factory and Azure HDInsight (a Hadoop-based cloud service). Azure Data Factory helps Milliman automate workflows for data integration and transformation using multiple data sources. As a result, Milliman could launch its platform at scale, while reducing IT and data management costs considerably.

To know more about the Azure Data Factory, check out the complete case study here.

Resources

Data Factory | Introduction | Tutorial | Case studies | Github | Microsoft annual report (2021)


4. Control-M

Overview

Control-M is a data workflow orchestration tool from BMC Software, Inc. It has two parts:

  1. Control-M Desktop: Sets and schedules jobs
  2. Control-M Enterprise Manager: Handles monitoring

Founded in 2012, BMC Software, Inc. has customers such as Carrefour, Sky Italia, Tampa General Hospital, ING Bank Slaski, SAP, and Ingram Micro.

To understand the business impact of Control-M, let’s explore the Carrefour case study.

Case study: Carrefour drives proximity store growth with Control-M

Carrefour Argentina was targeting expanding its presence by opening 540 Express branches within five years. As new branches popped up, the number of required data exchanges grew exponentially because of more frequent stock replenishment, discounts, pricing updates, and so on.

Carrefour needed a solution that offered a single view of all the data exchanges and flagged issues in real-time. So, they chose Control-M to orchestrate data and application workflows across platforms for more efficient business processes. So, Carrefour could minimize data exchange and quality issues, while improving collaboration and communication across all stores.

Check out the full case study here.

Resources

Control-M | Customers | Demo | Datasheet | Github


5. Flyte

Overview

Flyte is a workflow automation platform built to help ML and data engineers build robust and reusable pipelines.

In 2020, Lyft open-sourced Flyte, after having used it to train production models for three years. Flyte helped Lyft’s engineering team manage 7000+ unique workflows, leading to 100,000+ executions every month.

Flyte’s customers include Spotify R&D, Gojek, Freenome, Striveworks, RunX, Convoy, and USU AI Services.

Flyte was developed by Ketan Umare at Lyft. Today, Ketan Umare heads Union Systems Inc. that’s building a managed version of Flyte. In 2022, his company closed $10 million in seed funding led by New Enterprise Associates (NEA), a global venture capital firm.

Let’s explore the Spotify case study to know how Flyte solves their challenges.

Case study: How Spotify leverages Flyte to coordinate deep financial analytics company-wide

Spotify’s finance team must prepare P&L projections for two years into the future, which involves collaborating with 8+ teams and running 15+ models to analyze each business unit. The process was slow, manual, and took up 3-4 weeks each quarter.

Spotify’s financial team wanted to automate the business casing and scenario analysis, and that requires a tool to automate the workflows (or pipelines). So, Spotify chose Flyte to be the main runtime engine for its forecasting models. Today, Spotify uses Flyte to build its financial reports automatically.

To delve further into the specifics, check out the Spotify case study here.

Resources

Introducing Flyte |Docs | Slack | Github | Flyte | Union AI (the team behind Flyte)


6. Google Cloud Functions

Overview

Cloud Functions is a pay-as-you-go functions as a service (FaaS) product from Google Cloud Platform by Alphabet Inc. It’s a serverless compute solution to run code in the cloud.

Cloud Functions is part of the GCP, founded in 2019, and has customers such as Home Away, Lucille Games, Smart Parking, and Semios.

Let’s delve into the Home Away case study to understand the problem Cloud Functions solves.

Case study: Home Away halves development time and lowers cost by over 66%

Vacation rental company Home Away was building apps for global travelers, complete with a real-time recommendation engine. The development team at Home Away wanted to offer this facility even in areas with no Internet connection, without complicating the tool architecture or the go-to-market time.

Traditionally, this process took Home Away 2-3 months and a team of three full-time developers. When they adopted Cloud Firestore, along with Firebase Authentication and Cloud Functions, they could:

  • Set up the infrastructure in minutes
  • Build core app features without any complex server-side logic
  • Write just a few lines of code
  • Ship apps in 4-6 weeks
  • Deliver real-time user experience from the get-go

Here’s the complete case study.

Resources

Cloud Functions | Introducing Cloud Functions (video) | Overview | Github


7. K2View

Overview

K2View Data Orchestration offers a no-code visual tool for charting out data movement, transformation and business-flow orchestration. It’s part of the K2View Data Product platform.

Achi Rotem and Rafi Cohenset up K2View in 2009, backed by investors such as Flashpoint, Forestay Capital, and Genesis Partners. Their customers include AT&T, VodafoneZiggo, Verizon, American Express, Hertz, IQVIA, Comcast, and Telefónica.

Here’s a case study to see K2View Data Orchestration in action.

Case study: AT&T slashes test data provisioning to minutes and time-to-market by 80%

AT&T struggled with reducing its time-to-market as it lacked quick, on-demand access to realistic test data. The company also wanted to cut its overall test data management operational costs, without compromising test data integrity and security.

Traditionally, furnishing test data involved manual requests, multiple teams, and tedious database backup and restore processes, taking several days and weeks. As the result, their time-to-market cycle would be 3-6 months.

AT&T adopted K2View’s platform to speed up data sourcing, transformation, and masking, leading to an 80% decrease in time-to-market and a 30% reduction in manual processes. The time taken to create test data also went from weeks to mere minutes.

Check out the complete AT&T case study to know more.

Resources

Data Orchestration | Resources | Blog | Customers


8. Metaflow

Overview

Metaflow is a framework for data science projects built by Netflix, Inc. Metaflow helps data scientists manage, deploy and run their code in a production environment.

Netflix built Metaflow to help its data scientists speed up the development process and track their projects in notebooks (like Jupyter). In 2019, Netflix open-sourced Metaflow.

Metaflow’s customers include Future Demand, Spike, FindHotel, LMS, and giffgaff.

Let’s look into a case study to gauge the impact of Metaflow.

Case study: How Netflix Metaflow helped Future Demand build real-world machine learning services

German event sales and marketing company Future Demand relied on its engineers to deploy and manage the models developed by the data scientists. The engineers had to extract the code from Jupyter notebooks and refactor the Python scripts manually.

So, Future Demand chose Metaflow to help its data scientists build, deploy, and manage their code with end-to-end visibility and zero engineering intervention. In addition to empowering Future Demand’s data scientists, Metaflow also freed up its engineers to focus on solving other engineering issues.

Read the full story on Medium.

Resources

Metaflow | Docs | Github | Slack | Open-sourcing Metaflow | Overview | Netflix annual report


9. Prefect

Overview

Prefect offers a data orchestration platform to set up, deploy, and manage pipelines at scale.

In 2018, Jeremiah Lowinset up Prefect Technologies, Inc. The company’s backed by Tiger Global Management, Bessemer Venture Partners, and Atreides Management. Data Revenue, Quansight, Clearcover, and Actium are some of its customers.

Let’s explore a Prefect case study to understand the tool better.

Case study: What we (Data Revenue) love about Prefect

Developing the ML model code is a small part of ML projects. The more significant aspect is building and maintaining workflows and dataflows.

For Data Revenue, a key requirement (besides workflow orchestration) was native Kubernetes support. They chose Prefect to pull data from various sources, transform it as required, and monitor the jobs using the transformed data. Moreover, its data team could build tasks on Prefect using Python scripts.

Check out the complete case study on Prefect implementation for Data Revenue here.

Resources

Prefect | Docs | Github | Slack | Discourse | State of work


How to evaluate data orchestration tools

You must consider the following factors before choosing a data orchestration tool for your organization:

  • Check the size of resource allocation — memory and CPU sizes
  • Ensure that the tool enables multi-tenancy and accommodates several integrations
  • Analyze how the tool supports other dependencies and ensures streamlined data migration
  • Understand their infrastructure and support for multi-cloud environments
  • Check whether they offer good customer support
  • Evaluate its user-friendliness, documentation, knowledge base, and more to help you resolve issues quickly
  • Verify reviews and customer testimonials on third-party review portals such as Gartner Peer Insights, G2, and Capterra

Data orchestration and the modern data stack

Here’s how Astasia Myers highlights the importance of data orchestration to the modern data stack:

Historically, individuals wrote cron jobs to orchestrate data. However, as data teams began writing more cron jobs the growing number and complexity became hard to manage. Today, there are data orchestration frameworks that allow them to programmatically author, schedule, and monitor data pipelines. So, over the past few years, we have seen the emergence of numerous data orchestration frameworks and believe it is a core component of the modern data stack.

Would you like to deepen your understanding of the modern data stack? Then check out this blog on modern data stack that discusses its core components, capabilities, tooling choices, and more.



It would take six or seven people up to two years to build what Atlan gave us out of the box. We needed a solution on day zero, not in a year or two.

Akash Deep Verma
Akash Deep Verma

Director of Data Engineering

Delhivery: Leading fulfilment platform for digital commerce.

Build vs Buy: Delhivery’s Learnings from Implementing a Data Catalog

Build vs Buy: Delhivery’s Learnings from Implementing a Data Catalog