Published on

Data Integration - Part 2: Loading Strategies, Change Data Capture and Data Layers

Authors
Article Cover
Table of Contents

Loading Methods

There are two primary methods for loading data into target systems: batch loading and stream loading. Each has distinct characteristics and use cases.

Batch Loading

Batch loading loads large volumes of data at once, typically in a scheduled or periodic manner. It's suitable for large datasets with less frequent updates.

Batch loading refers to the process of grouping and processing multiple items (like shipments, data records, or files) together as a single unit, often for efficiency or automation. This approach is commonly used when large quantities of data are handled.

Stream Loading

Stream loading involves continuous or near real-time data transfer, enabling immediate access to the latest data. It's suitable for applications that require real-time data processing.

Data stream loading refers to the process of ingesting a continuous flow of data into a system, typically for immediate processing and analysis.

This contrasts with batch processing, where data is collected and processed in discrete chunks. Data streams are characterized by their real-time nature, high volume, and continuous generation, making them suitable for applications like real-time analytics, event-driven architectures, and streaming media.

CharacteristicBatch LoadingStream Loading
FrequencyScheduled intervals (hourly, daily, weekly)Continuous, real-time
LatencyHigher (minutes to hours)Lower (seconds to minutes)
Data Volume per OperationLarge volumesSmall chunks
ComplexitySimpler to implementMore complex architecture
Resource UsageIntensive but predictableConsistent but lower peaks
Use CasesHistorical reporting, overnight processingReal-time dashboards, alerts, timely decisions
Error HandlingEasier to restart and recoverMore complex error recovery

If the source system doesn't have a reliable audit system, data have to be extracted and loaded in the data warehouse slice by slice. In this case, it's mandatory to define a time window of data to refresh every time. The assumption is that data outside the time window should not change. This process could help in identifying hard deletion in the source system:

  • Record Hard Delete – record is physically deleted from the source system
  • Record Soft Delete – record is flagged as not valid or deleted in a specific column of the source's table

Loading Strategies

Full Load

This strategy involves transferring all data from the source to the target, typically replacing existing data in the target system. It is simple to implement but can be time-consuming and resource-intensive, especially for large datasets.

Full Load Strategy

The source table will be written on target database as is. No logic is applied.

Incremental Load

This strategy focuses on transferring only the changes or new data since the last load, minimizing the amount of data transferred and processed. It is more efficient for handling large datasets with frequent updates but requires more logic to track changes and handle conflicts.

Incremental Load Strategy

If we're doing incremental loading, records that do not have any change will not come - only new or updated records will come. After each execution the date until which the loading has been performed will be stored in some data warehouse table and next batch will extract only records that has an update date greater than the stored one.

Once we extract records incrementally based on their last update date, we can compare each record with the target based on their natural keys and determine if the record is a new record or updated record.

In this case, the source system must have an audit system.

Change Data Capture (CDC)

CDC is a generic term for techniques that monitor sources in order to detect and capture data changes.

There are 3 most popular CDC techniques applicable in DBMS environments:

Timestamp-based Capture

This approach involves adding timestamp columns (such as created_date, modified_date) to source tables to track when records are created or updated.

  1. Each table includes timestamp columns that are automatically updated when records change
  2. ETL processes query for records where the timestamp is greater than the last extraction time
  3. Only records modified since the last run are captured and processed

However, if a record is updated multiple times between ETL runs, only the latest version is captured

Trigger-based Capture

This technique uses database triggers to automatically capture changes as they occur in the source system.

  1. A change (e.g., insert/update/delete) occurs in a source table.
  2. A trigger writes a log entry to a dedicated log table.
  3. During ETL processes, system reads the log table to find out which rows have changed
  4. When the trigger logs a change, it adds a timestamp to the log record

Transaction Log Capture

This technique employs the logging and recovery capabilities of a DBMS. Since transaction logs are utilized by DBMS to store transaction information for logging and recovery, these logs contain all information required to capture changed data.

You can't just read the transaction logs manually — they are often stored in a binary, proprietary format, which is not human-readable. So, a special application/tool (usually provided by the database vendor or third parties) must be used (Example: SQL Server - Change Data Capture built-in)

Many databases automatically reuse (overwrite) old parts of the transaction log to save space — this is called log recycling.

Data Warehouse Layers

Modern data warehouse architectures organize data into distinct layers, each serving specific purposes in the data processing pipeline. These layers follow Ralph Kimball's dimensional modeling principles for structured data warehouse design.

Staging Area

This is the initial layer where data is temporarily stored after extraction from source systems. The staging area serves as a work area for ETL processes, where data is held in its original format before transformation. According to Kimball's methodology:

  • Data is stored temporarily and typically purged after successful processing
  • No business users access the staging area directly
  • Data structures mirror the source systems with minimal modification
  • Primary purpose is to facilitate the ETL process and provide restart capabilities

Data Presentation Area

This layer contains the dimensional models designed for business analysis and reporting. Following Kimball's approach, this area includes:

  • Fact Tables: Contain measurable business events with foreign keys to dimensions
  • Dimension Tables: Provide descriptive context for facts, often denormalized for query performance
  • Conformed Dimensions: Shared dimensions that ensure consistency across multiple fact tables and data marts

The presentation area is organized into data marts - subject-oriented collections of dimensional models that serve specific business processes or departments.

Kimball's methodology emphasizes:

  • Business Process Focus: Dimensional models are designed around specific business processes
  • Conformed Dimensions: Ensure consistency and enable drill-across analysis
  • Grain Definition: Each fact table represents events at a specific level of detail
  • Slowly Changing Dimensions: Handle changes in dimension attributes over time

References

  • Lecture notes from Business Analytics course, Prof. Paolo Menna, University of Verona 2024-2025
  • Kimball, R., & Ross, M. (2013). The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling (3rd ed.). Wiley.