- Published on
Data Integration - Part 1: ETL, Pushdown and Data Orchestrator
- Authors
- Name
- Mai TIEU Khoi
- @tieukhoimai

Table of Contents
Data integration is the process of combining data from multiple sources into a unified, consistent, and accessible format. This fundamental practice in data management involves collecting information from various databases, applications, and systems, then transforming and consolidating it to create a single source of truth for analysis and decision-making. The primary goal of data integration is to provide organizations with a comprehensive view of their data landscape, enabling better insights, improved operational efficiency, and more informed strategic decisions.
ETL vs ELT
Data Warehouse and Business Intelligence environments are mainly based on data use cases, therefore it is mandatory to have the right approach and the right tools to manage data with robustness and in a centralized manner.
Characteristic | ETL (Extract-Transform-Load) | ELT (Extract-Load-Transform) |
---|---|---|
Processing Location | Transformation occurs in separate ETL server | Transformation occurs inside the target database |
Scalability | Limited by ETL server capacity | Scales with database computing power |
Data Volume Handling | Better for smaller data volumes | Optimal for large data volumes |
Implementation Speed | Can be slower to implement | Faster implementation with SQL |
Transformation Complexity | Handles complex transformations well | Limited to what SQL/database functions can do |
Target System Load | Minimal processing load on target | Heavy processing on target system |
Storage Requirements | Requires less storage in target | Requires more storage (raw + processed data) |
Best Use Case | Complex transformations, smaller datasets | Big data, cloud data warehouses |
ETL
ETL refers to the processes to gather data and present it to the final user.
- E: Extract – data is extracted from homogeneous or heterogeneous data sources
- Homogeneous dataset: If it is made up of things that are similar to each other. They usually refer to a single data source.
- Heterogeneous data source: Refer to datasets with diverse types, structures, formats, or origins.
- T: Transform – data is transformed for storing in the proper format or structure for the purposes of querying and analysis
- L: Load – data is loaded into the final target database or data warehouse or data lake
Target System | Structure | Purpose | Data Type |
---|---|---|---|
Database | Highly structured | Day-to-day operations | Structured |
Data Warehouse | Structured | Analytical processing & reporting | Mostly structured |
Data Lake | Unstructured/mixed | Storing raw data for future use | All types (structured → unstructured) |
ELT
- Extract-Load-Transform is a more widely adopted practice in recent years.
- ELT changes the loading pattern to leverage the computing power of the database.
- ELT tools will generate SQL statements which are going to be directly executed by the database.
- SQL statements can be written manually, speeding up learning curves and adoption across teams.
- No or limited GUI to learn.
- Sometimes the only way to process a large amount of data is to first load the data in a loosely defined and unconstrained table and then manipulate it once the data is already loaded.
- ELT components (Input, Mapper and Output) are useful for bulk loading the tables by pushing processing overhead to the database rather than iterating through the resultset in memory.
Pushdown
Query pushdown is a database optimization technique where the processing of a query, or parts of it, is pushed down to the data source (e.g., a database) for execution. This means that operations like filtering, aggregation, or other transformations are performed at the source instead of being done by the client application or a middle layer.
Pushdown is available when source table and/or target are the same database instance.
Characteristic | No Pushdown | Partial Pushdown | Full Pushdown |
---|---|---|---|
Source DB Workload | Light workload, can be time consuming if tables are big and e.g. a filter or join could strongly reduce the volume of the result | High (the source DB executes all the required transformations) | High (the source DB executes all the required transformations) |
Target DB Workload | Light workload, time consuming (one insert statement for each row) | Light workload, time consuming (one insert statement for each row) | High (bulk load of all the data in one single step) |
Network Traffic | Very high | High (source traffic reduced by applied filters) | None |
ETL Server Workload | Very high | Low | None |
Overall Time Taken | Very high | Medium | Very low |
No Pushdown
When push down is not available due to limitations, applied functions or data locations, the ETL tool will have to:
- Fully read the source table from the database server
- Transfer the data into the ETL application servers
- Execute the required transformations
- Send the result data to the target database

In this scenario, the ETL server bears the full burden of data processing, which can lead to performance bottlenecks for large datasets.
Partial Pushdown
Partial pushdown could happen when the source tables are on the same database, the applied transformation is pushdown compliant, but the target is different.
In this case the ETL tool will have to:
- Ask the source database to execute the transformation through any SQL statement
- Get the result of the source table transformation from the database server
- Transfer it to the ETL application server
- Send it to the target database

In this scenario, a part of the workload is moved to the source database server, which can significantly improve performance by reducing the amount of data transferred and processed by the ETL server.
Full Pushdown
Full pushdown happens when the source and the target tables are on the same database and the applied transformations are pushdown-compliant.
In this case the ETL tool has to:
- Ask the source database to execute the required transformations
- Load the result into the target table through an SQL statement

In this scenario, the ETL tool is a simple orchestrator without any operative role in the process: all the workload is pushed to the database which is usually far more efficient than the ETL in performing massive workloads.
Data Orchestrator
A data orchestrator is a tool or system that manages and automates the flow of data between different systems and applications. It's essentially a central conductor that coordinates the movement, transformation, and integration of data from various sources into a usable format for analysis.

References
- Lecture notes from Business Analytics course, Prof. Paolo Menna, University of Verona 2024-2025