Choosing any data engineering tool is hard in itself. Choosing an ETL tool is even more difficult as the importance of the decision is paramount, as the ETL tool is the one that binds your different data sources and targets into a coherent and functional system.
Most ETL solutions had traditionally been commercial enterprise solutions, so there were no viable alternatives for startups and companies built on open-source technologies.
Here are 7 popular open-source ETL tools
In the article, we will evaluate and compare the most popular open-source ETL tools based on: Key features, data integration capabilities, architecture, support/community, documentation, and product updates.
- Talend Open Studio For Data Integration
- Pentaho Data Integration
A comparative evaluation of open-source ETL tools
The earliest successful, limited open-source ETL tools that are still around are Pentaho Kettle and Talend Open Studio for Data Integration. The breakthrough in ETL came with Stitch open-sourcing their tap and target plugin-based ETL tool called Singer. Other tools like Airbyte and dbt were also developed in the same timeframe and have seen tremendous adoption, especially in new data engineering projects. Let’s look deeper into the list.
As mentioned above, Singer was the first open-source ETL tool to attempt the problem of data integration at scale. Singer was first launched in the first quarter of 2017 and proved to be of great interest to engineers and business users alike. Singer has since been an inspiration to other tools like Wise’s PipelineWise and GitLab’s Meltano.
Singer ETL features
Singer first came up with the idea of a tap and target-based architecture, where you can understand taps as being the producers of data, and you can understand the targets as being the consumers of data. Taps and targets were designed to be pluggable components, which you could configure based on the tools and technologies you were using for your business. Tap and target-based architecture also allow you to reduce failure points by loading data into multiple targets, especially in a multi-cloud or hybrid-cloud infrastructure.
Hundreds of users trust singer’s enterprise solution. The same is true for the open-source option.
Inspired by Singer, but fundamentally quite a lot different from it, Airbyte boasts of being better in many respects. Starting with the standardization of the codebase for taps and targets, Airbyte offers centralized ownership of the code, which makes the codebase more reliable, the roadmap a lot more predictable, and the community support much smoother.
Airbyte ETL features
Airbyte was launched in the second quarter of 2020. It has seen widespread popularity with thousands of users within the first year and a half. Many of the great things from Singer, such as extensibility, and flexibility are central to Airbyte’s design. On top of that, one of the major innovations Airbyte has brought is to separate the transformation step from the extract and load steps. This enables Airbyte to integrate with tools like dbt that specialize in data transformation.
Airbyte has also brought the concept of Reverse ETL to the forefront after promising this feature on its official roadmap. Reverse ETL is increasingly becoming more important for businesses as it will help justify the huge costs of running ETL operations to populate data warehouses. Reverse ETL will enable the data in the data warehouses to be fed back to the operational systems, actively helping business insights and business operations.
Started as a project at RJMetrics in 2016 to extend the transformation capabilities of Singer by StitchData, dbt was open-source from the beginning. Another company called Fishtown Analytics took the core codebase and created their product. Since then, dbt has seen widespread adoption because of its ease of use and its ability to do SQL-based transformations very effectively by harnessing the power of the Jinja2 templating engine.
dbt ETL features
dbt is easily integrated with any orchestration tool like Prefect or Airflow and also works well with any basic Extract and Load tool that wants to offload the transformation workload to another tool. dbt, like many other data engineering tools of the day, relies more on the command-line and less on the UI. dbt is extremely lightweight; you can run it on your local by installing it using Homebrew, installing it using pip, or running it in a Docker container.
Recently, the long-awaited dbt Core v1.0.0 was released after getting contributions in over 5000 commits from over 200 engineers. Needless to say that dbt has an active and vibrant community. There are many YouTube tutorials and blog posts to learn and know about dbt, but there’s also a well-directed effort from dbt Labs to create useful courses and tutorials for new learners. Learn more about dbt’s roadmap on the official blog.
Initially developed at Wise (formerly known as TransferWise), PipelineWise was open-sourced in the third quarter of 2019. After the engineers at Wise considered many ways to solve the data integration problem at scale, they looked at the Singer.io specification. They chose to extend it rather than going for an enterprise solution or building a solution of their own from scratch.
PipelineWise ETL features
Instead of Singer.io’s JSON configuration, PipelineWise went for version-controlled YAML-based configuration files. Additional features include the out-of-the-box capability to obfuscate sensitive data to comply with data privacy and security regulations such as GDPR. Moreover, advanced replication features, such as streams selection, logging, etc., have been added by Wise.
Started as an in-house open-source project at GitLab in 2018, Meltano was created from a fresh perspective using the principles of DevOps to enable businesses to derive the most value from their data at every point of the data lifecycle. Meltano, like PipelineWise, is based on the Singer specification. After the initial success, Meltano was created as a separate company towards the end of the second quarter of 2021.
Meltano ETL features
Compared to other open-source ETL tools, Meltano stands much closer to Airbyte. Meltano, like Airbyte, allows you to offload transformation workloads to a tool like dbt and orchestration workloads to a tool like Airflow. Meltano also enables you to deploy your ETL tool using Docker, fulfilling its three main promises of providing you with an efficient solution for — data integration, orchestration, and containerization.
Meltano community & resources
If we talk about the Singer specification, Meltano has taken it to the next level. Meltano’s Singer Working Group comprises of the leading Singer.io contributors, including the teams from Wise and StitchData (now a part of Talend). The core focus of this group is to figure out ways to improve Singer by adding features and making performance enhancements while also making sure that the Singer community is active.
6. Talend Open Studio for data integration
Talend Open Studio for data integration overview
Talend Open Studio for Data Integration is one of the most popular ETL tools. It has seen a shrinking adoption in the last few years, which is one of the reasons that Talend decided to buy StitchData. Talend Open Studio was first launched in the year 2006. After quickly becoming one of the forces to be reckoned with, TOS DI was competing with Informatica PowerCenter, IBM DataStage, and others.
Talend Open Studio for data Integration features
While many of today’s ETL tools focus on solving a specific step of the ETL pipeline, Talend solves it all by having advanced ETL, orchestration, data privacy, security features built-in. dbt, for instance, solves specifically for the transformation step of the ETL; Airflow focuses on handling orchestration well. Many tools like Airbyte and Meltano are built to integrate with other tools rather than solve it all by themselves.
Talend Open Studio for data Integration resources
The ETL tool comprises two separate GitHub repositories — Common Code across Talend Products and TOS DI. Although these repositories are actively maintained on GitHub, there’s a slight lack of centralized documentation on what’s going on with the development of the product. Other than that, Talend is a very sophisticated ETL tool full of advanced data integration features.
7. Pentaho Data Integration
Pentaho Data Integration overview
Pentaho Data Integration (formerly known as Pentaho Kettle) was also developed around the same time as TOS DI. Architecturally and stylistically, Pentaho Kettle is pretty close to TOS DI; however, it is not as feature-rich. After consistent success, especially with enterprise users, Pentaho was acquired by Hitachi back in 2015.
Since the acquisition, the Pentaho suite has been in active development for in-house and commercial usage with a host of closed-source and open-source components of the suite.
Deciding which ETL tool can be tricky irrespective of where you are in your data engineering journey. To get the most out of your ETL, make sure that you look at how well the ETL tool sits with the rest of your stack, how much it costs to operate, and how much engineering expertise it requires to deploy, maintain, and develop for.
Related reads on open-source ETL tools:
- ETL vs. ELT: Exploring definitions, origins, strengths, and weaknesses
- Top 5 ETL tools to consider in 2022
- What is reverse ETL and how does it enhance the modern data stack?
- Data transformation: Definition, processes, and use cases
Related deep dives on popular data tools
- 11 top data masking tools
- 9 best data discovery tools
- 5 popular open-source data catalog tools to consider in 2022
- 5 popular open-source data pipeline orchestration tools in 2022
- Open-source data lineage tools: 5 best tools in 2022
Go deeper in your understanding of the modern data stack with our blog onmodern data culture.