Published on

Data Integration - Part 1: ETL, Pushdown and Data Orchestrator

Authors
Article Cover
Table of Contents

Data integration is the process of combining data from multiple sources into a unified, consistent, and accessible format. This fundamental practice in data management involves collecting information from various databases, applications, and systems, then transforming and consolidating it to create a single source of truth for analysis and decision-making. The primary goal of data integration is to provide organizations with a comprehensive view of their data landscape, enabling better insights, improved operational efficiency, and more informed strategic decisions.

ETL vs ELT

Data Warehouse and Business Intelligence environments are mainly based on data use cases, therefore it is mandatory to have the right approach and the right tools to manage data with robustness and in a centralized manner.

CharacteristicETL (Extract-Transform-Load)ELT (Extract-Load-Transform)
Processing LocationTransformation occurs in separate ETL serverTransformation occurs inside the target database
ScalabilityLimited by ETL server capacityScales with database computing power
Data Volume HandlingBetter for smaller data volumesOptimal for large data volumes
Implementation SpeedCan be slower to implementFaster implementation with SQL
Transformation ComplexityHandles complex transformations wellLimited to what SQL/database functions can do
Target System LoadMinimal processing load on targetHeavy processing on target system
Storage RequirementsRequires less storage in targetRequires more storage (raw + processed data)
Best Use CaseComplex transformations, smaller datasetsBig data, cloud data warehouses

ETL

ETL refers to the processes to gather data and present it to the final user.

  • E: Extract – data is extracted from homogeneous or heterogeneous data sources
    • Homogeneous dataset: If it is made up of things that are similar to each other. They usually refer to a single data source.
    • Heterogeneous data source: Refer to datasets with diverse types, structures, formats, or origins.
  • T: Transform – data is transformed for storing in the proper format or structure for the purposes of querying and analysis
  • L: Load – data is loaded into the final target database or data warehouse or data lake
Target SystemStructurePurposeData Type
DatabaseHighly structuredDay-to-day operationsStructured
Data WarehouseStructuredAnalytical processing & reportingMostly structured
Data LakeUnstructured/mixedStoring raw data for future useAll types (structured → unstructured)

ELT

  • Extract-Load-Transform is a more widely adopted practice in recent years.
  • ELT changes the loading pattern to leverage the computing power of the database.
  • ELT tools will generate SQL statements which are going to be directly executed by the database.
  • SQL statements can be written manually, speeding up learning curves and adoption across teams.
  • No or limited GUI to learn.
  • Sometimes the only way to process a large amount of data is to first load the data in a loosely defined and unconstrained table and then manipulate it once the data is already loaded.
  • ELT components (Input, Mapper and Output) are useful for bulk loading the tables by pushing processing overhead to the database rather than iterating through the resultset in memory.

Pushdown

Query pushdown is a database optimization technique where the processing of a query, or parts of it, is pushed down to the data source (e.g., a database) for execution. This means that operations like filtering, aggregation, or other transformations are performed at the source instead of being done by the client application or a middle layer.

Pushdown is available when source table and/or target are the same database instance.

CharacteristicNo PushdownPartial PushdownFull Pushdown
Source DB WorkloadLight workload, can be time consuming if tables are big and e.g. a filter or join could strongly reduce the volume of the resultHigh (the source DB executes all the required transformations)High (the source DB executes all the required transformations)
Target DB WorkloadLight workload, time consuming (one insert statement for each row)Light workload, time consuming (one insert statement for each row)High (bulk load of all the data in one single step)
Network TrafficVery highHigh (source traffic reduced by applied filters)None
ETL Server WorkloadVery highLowNone
Overall Time TakenVery highMediumVery low

No Pushdown

When push down is not available due to limitations, applied functions or data locations, the ETL tool will have to:

  1. Fully read the source table from the database server
  2. Transfer the data into the ETL application servers
  3. Execute the required transformations
  4. Send the result data to the target database
No Pushdown Scenario

In this scenario, the ETL server bears the full burden of data processing, which can lead to performance bottlenecks for large datasets.

Partial Pushdown

Partial pushdown could happen when the source tables are on the same database, the applied transformation is pushdown compliant, but the target is different.

In this case the ETL tool will have to:

  1. Ask the source database to execute the transformation through any SQL statement
  2. Get the result of the source table transformation from the database server
  3. Transfer it to the ETL application server
  4. Send it to the target database
Partial Pushdown Scenario

In this scenario, a part of the workload is moved to the source database server, which can significantly improve performance by reducing the amount of data transferred and processed by the ETL server.

Full Pushdown

Full pushdown happens when the source and the target tables are on the same database and the applied transformations are pushdown-compliant.

In this case the ETL tool has to:

  1. Ask the source database to execute the required transformations
  2. Load the result into the target table through an SQL statement
Full Pushdown Scenario

In this scenario, the ETL tool is a simple orchestrator without any operative role in the process: all the workload is pushed to the database which is usually far more efficient than the ETL in performing massive workloads.

Data Orchestrator

A data orchestrator is a tool or system that manages and automates the flow of data between different systems and applications. It's essentially a central conductor that coordinates the movement, transformation, and integration of data from various sources into a usable format for analysis.

Data Orchestrator Overview

References

  • Lecture notes from Business Analytics course, Prof. Paolo Menna, University of Verona 2024-2025