9 Popular Data Pipeline Orchestration Tools in 2023
May 26th, 2022
We will explore the top data orchestration tools in 2023 by reviewing the companies behind them, their customers, case studies, and more. Before we begin, let’s quickly look into data orchestration tools and their capabilities.
Data orchestration tools automate the process of bringing data together from multiple sources, standardizing it, and preparing it for data analysis.
According to Astasia Myers, author of “Data Orchestration — A Primer”, data orchestration tools can:
- Cleanse, organize, and publish data into a data warehouse
- Compute business metrics
- Maintain data infrastructure (like database scrapes)
- Run a TensorFlow task to train a machine learning model
- Apply rules to target and engage users through online marketing campaigns
Automating processes (like the ones mentioned above) is vital as companies handle millions of data points from apps, websites, databases, SDKs, and more — so, scheduling cron jobs manually is taxing and error-prone.
The 9 most popular data orchestration tools in 2023
- AWS Step Functions
- Azure Data Factory
- Google Cloud Functions
Astronomer builds data orchestration tools like Astro using Apache Airflow™ — originally developed by Airbnb to automate its data engineering pipelines. Astro enables data teams to build, run, and observe pipelines-as-code.
In 2018, Greg Neiheisel, Ry Walker, and Tim Brunk founded Astronomer, Inc. Astronomer is backed by Meritech Capital Partners, Salesforce Ventures, Insight Partners, and Sierra Ventures. Some of the customers include Sonos, EA, Condé Nast, Credit Suisse, Rappi, StockX, BBC, Wise, and Societe Generale.
More recently, Astronomer has acquired data lineage company Datakin. Astronomer aims to set up end-to-end lineage so that its customers can observe and add context to disparate pipelines.
Let’s look at a case study to understand the challenges Astronomer solves with Apache Airflow™ for Wise.
Case study: How Wise uses data orchestration to further its ML initiatives
Wise uses machine learning algorithms for real-time transaction monitoring and KYC processes. Wise intends to enable stream machine learning, rather than using REST, to launch new products faster while collaborating across teams.
Wise started using Amazon SageMaker for data science, where scientists work in segregated team environments to collaborate without breaching privacy measures. They adopted Airflow to retrain machine learning workflows from SageMaker.
“By nature, working with ML models in production requires automation and orchestration for repeated model training, testing, evaluation, and likely integration with other services to acquire and prepare data. Airflow is the perfect orchestrator to pair with SageMaker.”
To understand how Airflow helps Wise with orchestration, check out the complete case study here.
Blog | Docs | GitHub | Community | The Airflow Podcast | Demo (video)
2. AWS Step Functions
AWS Step Functions is a low-code visual workflow service used to orchestrate AWS services. The low-code visual designer for Step Functions is called Workflow Studio.
Step Functions is a part of the AWS ecosystem from Amazon Web Services, Inc., which accounts for customers such as Coinbase, Paessler AG, Trulia Rentals, DuPont Pioneer, MirrorWeb, Nasdaq, and ClearDATA.
Let’s check out the Zalora case study to see how Step Functions fit into the big data architecture for modern data teams.
Case study: How serverless automation with AWS Step Functions has reduced Zalora’s SAP system refresh time from 5 days to 2 days
Asia’s online fashion retailer Zalora uses SAP and AWS solutions to automate internal processes and reduce operational overhead. AWS Lambda serves as the heart of this setup where:
- S3 is used for production database backup
- EC2 is used to trigger events
- SNS is send messages during errors or failures
- Step Functions is the orchestration layer for the entire setup
Traditionally, Zalora’s engineering and DevOps teams handled this process manually, which took up around 5 days and involved several manual errors.
When Zalora switched to AWS Step Functions and Lambda to take care of their SAP systems, they could automate the entire process, reducing the system refresh time by 60%. The serverless setup also enabled Zalora to reduce their engineering investment considerably.
Watch this video to learn about the complete setup.
AWS Step Functions Launch | AWS Step Functions | Getting started | GitHub
3. Azure Data Factory
Azure Data Factory is used for orchestrating data processing pipelines for Azure, a Microsoft Corporation solution. Adobe, Concentra, Milliman, Rockwell Automation, Lorven Technologies, and Hentsu are some of its customers.
Let’s explore the Milliman case study from the insurance industry to see Azure Data Factory in action.
Case study: Top actuarial firm Milliman transforms the insurance industry
Milliman found that actuarial firms were spending about 70% of their time managing data — updating models, creating files, and running reports. They wanted to build a solution to simplify actuarial modeling and reporting so that actuaries could spend more time analyzing the results, rather than handling infrastructure setup.
Milliman chose to build Integrate Data Management using Azure Data Factory and Azure HDInsight (a Hadoop-based cloud service). Azure Data Factory helps Milliman automate workflows for data integration and transformation using multiple data sources. As a result, Milliman could launch its platform at scale, while reducing IT and data management costs considerably.
To know more about the Azure Data Factory, check out the complete case study here.
Data Factory | Introduction | Tutorial | Case studies | GitHub | Microsoft annual report (2021)
Control-M is a data workflow orchestration tool from BMC Software, Inc. It has two parts:
- Control-M Desktop: Sets and schedules jobs
- Control-M Enterprise Manager: Handles monitoring
Founded in 2012, BMC Software, Inc. has customers such as Carrefour, Sky Italia, Tampa General Hospital, ING Bank Slaski, SAP, and Ingram Micro.
To understand the business impact of Control-M, let’s explore the Carrefour case study.
Case study: Carrefour drives proximity store growth with Control-M
Carrefour Argentina was targeting expanding its presence by opening 540 Express branches within five years. As new branches popped up, the number of required data exchanges grew exponentially because of more frequent stock replenishment, discounts, pricing updates, and so on.
Carrefour needed a solution that offered a single view of all the data exchanges and flagged issues in real-time. So, they chose Control-M to orchestrate data and application workflows across platforms for more efficient business processes. So, Carrefour could minimize data exchange and quality issues, while improving collaboration and communication across all stores.
Check out the full case study here.
Control-M | Customers | Demo | Datasheet | GitHub
Flyte is a workflow automation platform built to help ML and data engineers build robust and reusable pipelines.
In 2020, Lyft open-sourced Flyte, after having used it to train production models for three years. Flyte helped Lyft’s engineering team manage 7000+ unique workflows, leading to 100,000+ executions every month.
Flyte’s customers include Spotify R&D, Gojek, Freenome, Striveworks, RunX, Convoy, and USU AI Services.
Flyte was developed by Ketan Umare at Lyft. Today, Ketan Umare heads Union Systems Inc. that’s building a managed version of Flyte. In 2022, his company closed $10 million in seed funding led by New Enterprise Associates (NEA), a global venture capital firm.
Let’s explore the Spotify case study to know how Flyte solves their challenges.
Case study: How Spotify leverages Flyte to coordinate deep financial analytics company-wide
Spotify’s finance team must prepare P&L projections for two years into the future, which involves collaborating with 8+ teams and running 15+ models to analyze each business unit. The process was slow, manual, and took up 3-4 weeks each quarter.
Spotify’s financial team wanted to automate the business casing and scenario analysis, and that requires a tool to automate the workflows (or pipelines). So, Spotify chose Flyte to be the main runtime engine for its forecasting models. Today, Spotify uses Flyte to build its financial reports automatically.
To delve further into the specifics, check out the Spotify case study here.
Introducing Flyte | Docs | Slack | GitHub | Flyte | Union AI (the team behind Flyte)
6. Google Cloud Functions
Cloud Functions is a pay-as-you-go functions as a service (FaaS) product from Google Cloud Platform by Alphabet Inc. It’s a serverless compute solution to run code in the cloud.
Cloud Functions is part of the GCP, founded in 2019, and has customers such as Home Away, Lucille Games, Smart Parking, and Semios.
Let’s delve into the Home Away case study to understand the problem Cloud Functions solves.
Case study: Home Away halves development time and lowers cost by over 66%
Vacation rental company Home Away was building apps for global travelers, complete with a real-time recommendation engine. The development team at Home Away wanted to offer this facility even in areas with no Internet connection, without complicating the tool architecture or the go-to-market time.
Traditionally, this process took Home Away 2-3 months and a team of three full-time developers. When they adopted Cloud Firestore, along with Firebase Authentication and Cloud Functions, they could:
- Set up the infrastructure in minutes
- Build core app features without any complex server-side logic
- Write just a few lines of code
- Ship apps in 4-6 weeks
- Deliver real-time user experience from the get-go
Here’s the complete case study.
Cloud Functions | Introducing Cloud Functions (video) | Overview | GitHub
K2View Data Orchestration offers a no-code visual tool for charting out data movement, transformation and business-flow orchestration. It’s part of the K2View Data Product platform.
Achi Rotem and Rafi Cohen set up K2View in 2009, backed by investors such as Flashpoint, Forestay Capital, and Genesis Partners. Their customers include AT&T, VodafoneZiggo, Verizon, American Express, Hertz, IQVIA, Comcast, and Telefónica.
Here’s a case study to see K2View Data Orchestration in action.
Case study: AT&T slashes test data provisioning to minutes and time-to-market by 80%
AT&T struggled with reducing its time-to-market as it lacked quick, on-demand access to realistic test data. The company also wanted to cut its overall test data management operational costs, without compromising test data integrity and security.
Traditionally, furnishing test data involved manual requests, multiple teams, and tedious database backup and restore processes, taking several days and weeks. As the result, their time-to-market cycle would be 3-6 months.
AT&T adopted K2View’s platform to speed up data sourcing, transformation, and masking, leading to an 80% decrease in time-to-market and a 30% reduction in manual processes. The time taken to create test data also went from weeks to mere minutes.
Check out the complete AT&T case study to know more.
Data Orchestration | Resources | Blog | Customers
Metaflow is a framework for data science projects built by Netflix, Inc. Metaflow helps data scientists manage, deploy and run their code in a production environment.
Netflix built Metaflow to help its data scientists speed up the development process and track their projects in notebooks (like Jupyter). In 2019, Netflix open-sourced Metaflow.
Metaflow’s customers include Future Demand, Spike, FindHotel, LMS, and giffgaff.
Let’s look into a case study to gauge the impact of Metaflow.
Case study: How Netflix Metaflow helped Future Demand build real-world machine learning services
German event sales and marketing company Future Demand relied on its engineers to deploy and manage the models developed by the data scientists. The engineers had to extract the code from Jupyter notebooks and refactor the Python scripts manually.
So, Future Demand chose Metaflow to help its data scientists build, deploy, and manage their code with end-to-end visibility and zero engineering intervention. In addition to empowering Future Demand’s data scientists, Metaflow also freed up its engineers to focus on solving other engineering issues.
Read the full story on Medium.
Metaflow | Docs | GitHub | Slack | Open-sourcing Metaflow | Overview | Netflix annual report
Prefect offers a data orchestration platform to set up, deploy, and manage pipelines at scale.
In 2018, Jeremiah Lowin set up Prefect Technologies, Inc. The company’s backed by Tiger Global Management, Bessemer Venture Partners, and Atreides Management. Data Revenue, Quansight, Clearcover, and Actium are some of its customers.
Let’s explore a Prefect case study to understand the tool better.
Case study: What we (Data Revenue) love about Prefect
Developing the ML model code is a small part of ML projects. The more significant aspect is building and maintaining workflows and dataflows.
For Data Revenue, a key requirement (besides workflow orchestration) was native Kubernetes support. They chose Prefect to pull data from various sources, transform it as required, and monitor the jobs using the transformed data. Moreover, its data team could build tasks on Prefect using Python scripts.
Check out the complete case study on Prefect implementation for Data Revenue here.
Prefect | Docs | GitHub | Slack | Discourse | State of work
How to evaluate data orchestration tools
You must consider the following factors before choosing a data orchestration tool for your organization:
- Check the size of resource allocation — memory and CPU sizes
- Ensure that the tool enables multi-tenancy and accommodates several integrations
- Analyze how the tool supports other dependencies and ensures streamlined data migration
- Understand their infrastructure and support for multi-cloud environments
- Check whether they offer good customer support
- Evaluate its user-friendliness, documentation, knowledge base, and more to help you resolve issues quickly
- Verify reviews and customer testimonials on third-party review portals such as Gartner Peer Insights, G2, and Capterra
Data orchestration and the modern data stack
Here’s how Astasia Myers highlights the importance of data orchestration to the modern data stack:
“Historically, individuals wrote cron jobs to orchestrate data. However, as data teams began writing more cron jobs the growing number and complexity became hard to manage. Today, there are data orchestration frameworks that allow them to programmatically author, schedule, and monitor data pipelines. So, over the past few years, we have seen the emergence of numerous data orchestration frameworks and believe it is a core component of the modern data stack.”
Would you like to deepen your understanding of the modern data stack? Then check out this blog on modern data stack that discusses its core components, capabilities, tooling choices, and more.
Related reads on data orchestration tools
- What is data orchestration: Definition, uses, examples, and tools
- 5 popular open source data pipeline orchestration tools in 2023
- What are data silos and how can you break them down?
Related deep dives on popular data tools
- Top 5 ETL Tools to Consider in 2023.
- Top 6 ELT Tools to Consider in 2023.
- 10 popular transformation tools in 2023.
- 11 top data masking tools in 2023.
- 9 best data discovery tools in 2023.
- 5 popular open-source data catalog tools to consider in 2023
- Open-source data lineage tools: 5 best tools in 2023
- Open-source data observability tools: 7 popular picks in 2023