Data orchestration tools sit at the center of your data infrastructure, taking care of all your data pipelining and ETL workloads. Choosing an open-source data orchestration tool is difficult, as there are quite a few you can choose from, and every one of them has competitive features. You have to select the tool that sits well with your existing infrastructure and doesn’t require redesigning everything.
Here is a list of five popular open-source data orchestration tools
[Download ebook] → Rethinking Data Governance for the Modern Data Stack
Maxime Beauchemin, who is also the creator of Superset, created Airflow while he was working at Airbnb back in 2014. Airflow saw widespread adoption as soon as it got open-sourced in 2015. Airbnb handed it over to the Apache Software Foundation for maintenance a year later. Since then, Airflow’s popularity has risen, with giants like PayPal, Twitter, Google, Square, and more using it at scale.
AWS and Google Cloud both offer Airflow as a managed service. Airflow’s commercial avatar in Astronomer offers advanced deployment methods and priority support. It is proof that businesses of all shapes and sizes choose to use Airflow over the more traditional proprietary tools, such as Control-M and Talend.
Airflow Data Orchestration Features
Airflow introduced the world of data engineering to the directed acyclic graph (DAG). It is a seldom-used graph data structure but fits the bill perfectly when creating complex workflows without circular dependencies.
Airflow’s design took into consideration the perspective of data engineers and developers. The idea was to provide flexibility and extensibility to Airflow to work with a wide range of tools and technologies. One such example is that of Airflow Operators. There are several built-in operators, but you can write your own if they don’t fulfill your requirements.
At the time of the release, a few advanced features were unique to Airflow, such as the ability of tasks to communicate with each other using XComs, although with certain limitations. Airflow’s interesting new features and updates might help you understand if it is the right data orchestration tool for you.
The creator of GraphQL, Nick Schrock, created Dagster. He is the founder of the data product company, Elementl, which took care of Dagster’s initial development before moving to the open-source world in mid-2019.
Dagster’s mission is to enhance the testing, development, and overall collaboration experience for engineers, analysts, and business users while dealing with the plethora of tools and technologies they come across daily.
Dagster is relatively new to the market, but many companies, including Mapbox, VMWare, DoorDash, etc., trust its capabilities enough to use it in production.
Dagster Data Orchestration Features
Airflow heavily inspires Dagster. Dagster attempts to solve many of the shortcomings of Airflow, such as issues in local testing and development, dynamic workflow, and ad-hoc task runs.
Dagster took a different path by being more fluid and easier to integrate. It does so by taking abstraction to the next level, both for storage and compute. To enforce clarity in task dependencies, Dagster mandates strong typing for Ops. This helps efficient caching and an easy swapping strategy for IO.
A fresh take on the DAG-based workflow model, simple-to-use APIs, and a host of other features make Dagster a viable alternative. Dagster provides easy integration with the most popular tools, such as dbt, Great Expectations, Spark, Airflow, Pandas, and so on. It also offers a range of deployment options, including Docker, k8s, AWS, and Google Cloud. Take a look at the resources listed below to determine if Dagster is the data orchestration tool for you.
Data teams around the world use Atlan to bring their data to life
Argo is an open-source container-native take on data orchestration. It runs on Kubernetes, making it a great choice if a large portion of your infrastructure is cloud-native. Applatix (an Intuit company) created Argo in 2017 to make the Kubernetes ecosystem richer.
The Cloud Native Computing Foundation (CNCF) maintains the open-source project. Hundreds of companies, including GitHub, Tesla, WordPress, Ticketmaster, and Adobe, use Argo for their ML and data pipeline workloads.
Argo is unique as it is the only genuinely general-purpose, cloud-native data orchestrator. Similar to Airflow and Dagster, Argo also harnesses the power of the DAG. In Argo, each step of a DAG is a Kubernetes container. Using Kubernetes, Argo can orchestrate parallel jobs without any trouble while allowing you to scale up linearly as your business grows.
Another difference between the other data orchestration tools in the market and Argo is that Argo uses YAML to define tasks instead of Python. To make your job easier, Argo offers a JSON Schema that enables your IDE to validate the resources specified in the YAML file.
Argo also allows you to create CI/CD pipelines natively on Kubernetes without installing any other piece of software. This fascinating post on how Arthur Engineering chose Argo over Prefect and Airflow and why might help you find the right data orchestration tool.
Argo Data Orchestration Resources
Back in 2017, many engineers, some of whom had worked on Apache Airflow, created Prefect after identifying and fixing the pain points they had experienced. The core idea for Prefect coming into existence was to create a more flexible, customizable, and modular tool to suit the needs of the modern data stack.
Many companies, including Slate, Kaggle, Microsoft, PositiveSum, and ClearCover, use Prefect to handle their data pipelining and orchestration workloads.
Prefect Data Orchestration Features
One of the significant shortcomings of Airflow was the inability to do parameterized or dynamic DAGs. Airflow also had difficulty in handling complicated branching logic and ad-hoc task runs. While there were workarounds for these problems in Airflow, they weren’t clean and scalable. Prefect took a fresh approach to data orchestration and fixed these issues.
Running ad-hoc tasks is easy in Prefect, as Prefect considers every workflow as an invocable standalone object, not tied up to any predefined schedule. Prefect also allows tasks to receive inputs and send outputs, allowing more transparency between interdependent tasks.
Prefect has recently launched v2.0 with a host of exciting features like the support for perpetually running pipelines, event-based workflows, workflow versioning, and so on. As of now, this new version of Prefect is in Beta. You can still try v1.0 to see if it sits well with your current data architecture.
Data Catalog 3.0: The Modern Data Stack, Active Metadata and DataOps
Developed at Spotify in 2011, Luigi improved upon some existing orchestration engines like Oozie and Azkaban, built explicitly for the Hadoop ecosystem. Luigi took a more general-purpose approach by enabling you to orchestrate different types of tasks, such as Spark jobs, Hive queries, and so on.
It’s not completely clear how many businesses actively use Oozie, but it’s still under active development and a fair bit of use as the latest release, v1.20.0, was as late as December 2021.
Luigi Data Orchestration Features
Luigi is pre-Airflow, so it doesn’t have the concept of a DAG. It instead uses a task-and-target semantic to define dependencies. The output of one task goes into the input of a target, which itself can be a task feeding another target, hence creating a chain of tasks. Due to the non-DAG design, developing highly complex pipelines with many dependencies and branches will be extremely difficult in Luigi.
Luigi might be a good option for you if you want something lightweight with substantially less time required for management. Also, can be considered when your pipelines don’t have complex dependencies that warrant using a DAG-based tool. To find more about the tool and decide if it’s the right one for you, please visit their official blog and go through the release notes of some of the latest versions.
All the open-source data orchestration tools listed are good ones to choose from. The one best suited for you will largely depend on your existing data stack and your primary use cases.
Data orchestration tools: Related reads
- What is data orchestration: Definition, uses, examples, and tools
- Open source ETL tools: 7 popular tools to consider in 2022
- Open-source data observability tools: 7 popular picks in 2022
- ETL vs. ELT: Exploring definitions, origins, strengths, and weaknesses
- 10 popular transformation tools in 2022