Data Transformation: Definition, Processes, and Use Cases
Share this article
What is data transformation? #
Data transformation is the process of converting the format or structure of data so it’s compatible with the system where it’s stored. It is one of the steps in the Extract, Transform, Load (ETL) or ELT process that is essential for accessing data and using it to inform decisions.
The goal of data transformation is to take the information you have about customers and business processes and make it consumable to everyone in your organization. With data resting in multiple sources, it’s important to ensure data is compliant with the required format of new data warehouses.
Data transformation includes procedures such as filtering, summarizing, and formatting data. It can be done manually, automated with open-source and commercial tools, or handled with a combination of the two. A variety of products are available that streamline the process of transformation to make it more manageable and scalable.
Modern data problems require modern solutions - Try Atlan, the data catalog of choice for forward-looking data teams! 👉 Book your demo today
What are the steps of data transformation? #
The data transformation process consists of two overarching steps: Researching and planning the transformation, then executing it.
Let’s take a deeper dive into these steps.
Research and planning #
Raw data is not always usable in its original form. It must be transformed so it can be used for analytics. The first step towards deriving value from data is to understand the format and structure of source data then uncover what must be done to shape it into a usable format.
It’s critical to plan the transformation process to clarify exactly what types of transformations need to take place. This part of the process is known as “mapping.” The purpose here is to ensure data is compatible with the destination system and any data that already rests there.
For example, this might entail:
- Standardizing data so each data type has the same format and content
- Filtering, enriching, converting, or joining data
- Removing duplicate data
It’s worth noting that not all data will need to be transformed. Some will already be in a compatible format. This data is known as “direct move” or “pass-through” data.
Planning the transformation process step by step is necessary to uncover any pass-through data, identify data that needs to be transformed, and ensure the data mapping addresses relevant business or technical requirements.
Executing the transformation #
Once you understand the format and structure of data and plan how it needs to be transformed, it’s time to execute the process of extracting, cleansing, transforming, and delivering data.
This can be done manually, but it’s more efficient and scalable to write executable code (in SQL, Python, or R) for performing the transformation. This step often is completed using a transformation tool or platform.
After the transformation is completed, the transformed data is ready to be loaded into a target warehouse. End users can then check the output data to ensure it meets their requirements and has been correctly formatted. Any errors they uncover are communicated back to data teams.
A quality data lineage tool comes in handy here since it helps trace the transformational steps a piece of data went through. By providing a transparent view of the entire data transformation process, data lineage makes it easier to track and audit compliance.
ETL vs. ELT vs. Reverse ETL processes #
ETL | ELT | Reverse ETL | |
---|---|---|---|
Order of Operations | Data is extracted, transformed in the staging area, and then loaded into the target data system. | Data is extracted, loaded directly into the target system, and transformed within the system. | Data is extracted from the source system (target in traditional ETL), transformed, and then loaded into a third-party system. |
Primary Focus | Loading into data systems (typically data warehouses) where compute is a valuable resource. | Loading into flexible data systems (data warehouses, lakes, or lakehouses), mapping schemas directly. | Loading into third-party systems (SaaS applications or platforms), enabling real-time connectivity. |
Analytics Flexibility | Use cases and reporting models must be defined at the beginning of the process. | Data can be selected at any time for transformation and analysis as new use cases emerge. | Real-time data is consistently made available to various teams, powering operational analytics. |
Scalability | Bottlenecks can occur if there is not a scalable, distributed processing system in place. | Highly scalable because of the cloud, compute and storage resources can be added as necessary. | Similar to ETL, bottlenecks can occur if the right tools are not in place. |
If your business uses on-premise data warehouses, the steps for transformation typically happen in the middle of the ETL process whereby you extract data from sources, transform it, then load it into a data repository.
Today most businesses use cloud-based data warehouses and data lakes, which means they can extract and load the data first, then transform it into a clean, analysis-ready format at the time of the actual query. With this model, known as ELT, users don’t have to depend on engineers and analysts to transform data before they can load it.
Data teams have evolved at light speed over the past few years, and have innovated a third approach known as Reverse ETL, one of the six big ideas we highlighted in a recent blog post on The Future of the Modern Data Stack. Reverse ETL brings data into third-party systems such as SaaS tools, allowing stakeholders to uncover insights using the tools they already use on a daily basis.
Regardless of whether you’re using an ETL, ELT, or Reverse ETL process, data transformation is arguably the most value-added process because it takes raw data that’s not usable and enables it to be mined for insights.
Data transformation challenges vs. benefits #
Despite its value for data analysis, data transformation is full of challenges:
- Data transformation tools can be expensive: Costs may include subscriptions, licensing, computing resources, and the staff needed to manage them.
- The process is resource-intensive: Transforming data requires heavy computational power and can slow down other programs.
- It requires domain expertise: Engineers may not understand the business context of data. There needs to be a match between business and data expertise in order to transform data so it’s ready for its intended analytics use.
- Mismatching across systems: You might need to change data to a specific format for one application then to another format for a different application.
Businesses that are able to overcome such challenges stand to gain numerous benefits, such as:
- Higher quality data: Data transformation helps eliminate quality issues such as missing values and inconsistent formats.
- Increased compatibility between applications and systems: Accurately transformed data is easier for both humans and computers to access and utilize.
- Faster querying: Standardized data is easier to access, organize and utilize.
- Greater value for business intelligence: Having data in the right format allows end-users to understand it.
In short, data transformation sounds like a dull process, but it’s central to the process of curating data. Having reliable data transformation processes in place ensures that end users have access to data that is in the right format for use in daily activities.
Furthermore, a systematic approach to data transformation helps prepare for situations such as when data is transferred between systems, when information is added to data sets, or when data needs to be combined from multiple sets.
What is data transformation with an example? #
Suppose you have an event log that’s delimited by commas and want to load it into a MySQLdatabase so you can analyze the data using SQL. You’ll need to transform the data. There are several ways to do that:
- Using the MySQL command line
- Using a tool for working with MySQL, such as Workbench or phpMyAdmin
- Using automation, such as script written in Python, together with Python libraries and a touch of magic :)
The first two ways each require manual coding to complete each time you want to transform the data, while the third would make it possible to build an automated pipeline from the source into MySQL.
Data transformation tools: An overview #
In the past, much of the scripting and coding for data transformation was done by hand. This was error-prone and not scalable. Today’s data pros have numerous options (both commercial and open-source) for data transformation. These data transformation tools are some of the key building blocks for the modern data platform.
What are the types of data transformation tools? #
There are two types of data transformation layer implementations commonly seen in the modern enterprise: tools that streamline transformations for the data warehouse, and tools that enable custom transformations for data pipeline orchestration.
- For companies with data warehouse-first architectures, tools such as dbt and Matillion streamline data transformation so analysts and users can easily transform data sources.
- Companies with data warehouse-first architectures that use orchestration tools for data pipelines (such as Apache Airflow) often carry out custom transformations in a programming language like Python, SQL, or R.
These tools can often visually represent dataflows, incorporate parallelization, monitoring, and failover, and often include the connectors needed to migrate. By optimizing each stage, they reduce the time it takes to mine raw data into useful insights.
Furthermore, specific warehousing tools eliminate the need for certain types of transformation. Snowflake, for instance, has data-sharing functionalities that eliminate the need to transform data for use in different departments or geographies. IBM likewise has developed sophisticated tools running on Red Hat OpenShift for building ETL processes to move and transform data at scale.
Data transformation in data warehousing #
Today’s data leaders are looking for ways to bridge the gap between data and insights. Now you understand why data transformation is an important part of this process: It allows data teams to standardize data so it’s ready for analysis.
Data Transformation: Related reads #
- 10 popular transformation tools in 2024
- Open source ETL tools: 7 popular tools to consider in 2024
- ETL vs. ELT: Exploring definitions, origins, strengths, and weaknesses
- What is DataOps: Definition, Framework, and Importance
- Data discovery tools: 9 popular tools you should consider in 2024
Photo by Markus Spiske on Unsplash
Share this article