Hero image for ETL/ELT Core Concepts: A Complete Guide to Modern Data Engineering

ETL/ELT Core Concepts: A Complete Guide to Modern Data Engineering

ETL ELT Data Engineering Data Pipeline Data Quality

1. Architecture Overview: ETL vs ELT

The fundamental question: How do we move data from Source to Target?

ETL (Extract-Transform-Load)

┌─────────┐    ┌─────────────┐    ┌─────────┐
│ Source  │───▶│  Transform  │───▶│  Target │
└─────────┘    │  (External) │    │ (Clean) │
               └─────────────┘    └─────────┘
  • Flow: Extract → External Transform → Load
  • Result: Clean, production-ready data lands in target
  • Best For: On-premise systems, strict data privacy/security, limited target compute

ELT (Extract-Load-Transform)

┌─────────┐    ┌─────────┐    ┌─────────────┐
│ Source  │───▶│  Target │───▶│  Transform  │
└─────────┘    │  (Raw)  │    │ (In-Database)│
               └─────────┘    └─────────────┘
  • Flow: Extract → Raw Load → In-Database Transform (SQL/dbt)
  • Result: Raw data lands first, transformed using target’s compute power
  • Best For: Cloud platforms (Snowflake/BigQuery/Databricks), rapid iteration, modern analytics

Quick Comparison

AspectETLELT
Transform LocationExternal (Spark/Python)In-Database (SQL/dbt)
Data in TargetClean/ProcessedRaw + Processed
FlexibilityLowerHigher
Compute CostExternal serversTarget database
Primary Use CaseHeavy compute, compliance, unstructured dataCloud-native analytics

ETL is Not Dead: While ELT dominates cloud data warehouses, ETL remains essential for:

  • PII Masking/Compliance: Anonymize sensitive data before it enters the lake
  • Heavy Transformations: Complex processing (ML preprocessing, image/video) using PySpark
  • Unstructured Data: Parse and structure complex JSON, logs, or binary files externally
  • Hybrid Architectures: Combine ETL for ingestion prep with ELT for analytics

2. E: Extraction (Data Sourcing)

Solving: How do we get the data? How much data do we pull?

By Data Scope

StrategyDescriptionUse Case
Full ExtractionSELECT * every time, replace entire datasetDimension tables, small reference data
Incremental ExtractionFilter by updated_at or ID, pull only changesLarge fact tables, event logs

By Technical Mechanism

Pull / Request-Response (Active Fetching)

MethodDescriptionConsiderations
Database QueryingDirect SQL via JDBC/ODBCConnection pooling, query optimization
API CallsREST/GraphQL endpointsPagination, Rate Limiting, Authentication
Web ScrapingSelenium/BeautifulSoupLegal concerns, fragile to UI changes

Push / Event-Driven (Passive Receiving)

MethodDescriptionTools
CDC (Log-based)Read database transaction logs (Binlog/WAL)Debezium, AWS DMS, Fivetran
Event StreamingSubscribe to message queuesKafka, RabbitMQ, AWS Kinesis
WebhooksSaaS sends HTTP POST on eventsStripe, GitHub, Shopify

CDC Deep Dive: Change Data Capture has two main approaches:

  • Query-based CDC: Poll database with timestamp columns. Simple but can miss deletes.
  • Log-based CDC: Read transaction logs directly. Complete but requires DB permissions.

File-Based (Batch Exchange)

MethodDescriptionFormats
File ParsingRead files from FTP/S3/ADLSCSV, JSON, Parquet, Avro

Manual

  • Excel uploads, copy-paste operations (avoid in production!)

ETL Architecture


3. L: Loading (Write Strategies)

Solving: How do we write data to target? How do we handle historical changes?

By Write Strategy

Full Load (Complete Replacement)

StrategyDescriptionRisk Level
Truncate & InsertTRUNCATE TABLEINSERT⚠️ High (data loss window)
Drop & CreateDROP TABLECREATEINSERT⚠️ High (schema reset)

Best for: Small dimension tables, complete refreshes, schema migrations

Incremental Load (Partial Updates)

StrategyDescriptionUse Case
Append OnlyINSERT new records onlyEvent logs, time-series, immutable data
UpsertUPDATE if exists, else INSERT (by Primary Key)Master data, slowly changing records
MergeSQL MERGE statement combining INSERT/UPDATE/DELETEComplex sync operations, SCD handling
-- Upsert Example (PostgreSQL)
INSERT INTO target_table (id, name, updated_at)
VALUES (1, 'John', NOW())
ON CONFLICT (id) 
DO UPDATE SET name = EXCLUDED.name, updated_at = EXCLUDED.updated_at;

-- Merge Example (SQL Server / Snowflake)
MERGE INTO target_table t
USING source_table s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET t.name = s.name
WHEN NOT MATCHED THEN INSERT (id, name) VALUES (s.id, s.name);

By History Retention (Slowly Changing Dimensions)

SCD TypeNameBehaviorHistory
Type 0Retain OriginalNever update❌ No
Type 1OverwriteReplace old value❌ No
Type 2Add New RowNew row with valid_from/valid_to✅ Full
Type 3Add New Columncurrent_value, previous_value columns⚠️ Limited
-- SCD Type 2 Example
-- Before: id=1, name='John', valid_from='2024-01-01', valid_to='9999-12-31', is_current=true
-- After update:
UPDATE dim_customer SET valid_to = '2024-06-15', is_current = false WHERE id = 1 AND is_current = true;
INSERT INTO dim_customer (id, name, valid_from, valid_to, is_current) VALUES (1, 'Johnny', '2024-06-15', '9999-12-31', true);

Production Note: In high-concurrency or large-scale environments, the two-step approach (UPDATE + INSERT) can cause data inconsistency if not wrapped in a transaction. Modern Lakehouses (Delta Lake, Iceberg) support atomic MERGE for SCD Type 2:

-- SCD Type 2 with MERGE (Delta Lake / Databricks)
MERGE INTO dim_customer t
USING (
    SELECT id, name, current_timestamp() as effective_date
    FROM staging_customer
) s
ON t.id = s.id AND t.is_current = true
WHEN MATCHED AND t.name <> s.name THEN
    UPDATE SET t.valid_to = s.effective_date, t.is_current = false
WHEN NOT MATCHED THEN
    INSERT (id, name, valid_from, valid_to, is_current)
    VALUES (s.id, s.name, s.effective_date, '9999-12-31', true);

-- Then insert the new current record
INSERT INTO dim_customer
SELECT id, name, effective_date, '9999-12-31', true
FROM staging_customer s
WHERE EXISTS (SELECT 1 FROM dim_customer t WHERE t.id = s.id AND t.is_current = false);

By Timing

ModeDescriptionLatencyThroughput
BatchScheduled runs (hourly/daily)HighHigh
StreamReal-time processingLowVariable
Micro-batchSmall frequent batches (e.g., every 5 min)MediumMedium

ETL Architecture


4. T: Transformation (Data Processing)

Solving: How do we clean, standardize, and create business value?

By Purpose (Functional)

Cleaning

OperationDescriptionExample
Remove DuplicatesDeduplication by keyROW_NUMBER() OVER (PARTITION BY id)
Handle Missing DataImpute or remove nullsCOALESCE(value, default)
Outlier DetectionFilter statistical anomaliesZ-score > 3 standard deviations
Trim WhitespaceRemove unwanted spacesTRIM(column)
Filter InvalidRemove records failing validationWHERE email LIKE '%@%.%'

Standardization

OperationDescriptionExample
Type CastingConvert data typesCAST(date_string AS DATE)
NormalizationUnify units/codesConvert all currencies to USD
Format StandardizationConsistent formattingPhone: +1-555-123-4567

Enrichment

OperationDescriptionExample
IntegrationJoin external dataJoin customer with demographics
GeocodingAddress to coordinatesGoogle Maps API
LookupReference data enrichmentCountry code → Country name

Logic & Calculation

OperationDescriptionExample
Derived ColumnsCalculate new fieldsage = DATEDIFF(YEAR, birthdate, GETDATE())
Business RulesApply business logiccustomer_tier = CASE WHEN revenue > 1M THEN 'Enterprise'...
AggregationsSummarize dataSUM(sales), COUNT(DISTINCT users)

Anonymization

OperationDescriptionExample
MaskingHide sensitive datajohn.doe@email.comj***@e***.com
HashingOne-way encryptionSHA256(ssn)
TokenizationReplace with tokensSSN → Random UUID mapping

By Location (Architectural)

LocationWhenTools
Pre-Load (ETL)Before entering databasePython, Spark, AWS Glue
In-Database (ELT)After raw data landsSQL, dbt, Stored Procedures

ETL Architecture


5. Data Quality & Observability

Modern data engineering requires continuous monitoring and validation.

Data Quality Dimensions

DimensionDescriptionExample Check
CompletenessNo unexpected nullsNULL rate < 1%
UniquenessNo duplicatesCOUNT(*) = COUNT(DISTINCT pk)
AccuracyValues are correctprice > 0
ConsistencyCross-system alignmentSUM(orders) = SUM(order_items)
TimelinessData is freshmax(updated_at) within 1 hour
ValidityConforms to rulesemail matches regex pattern

Tools Ecosystem

┌─────────────────────────────────────────────────────────────────┐
│                     Data Quality & Observability                 │
├─────────────────────────────────────────────────────────────────┤
│  Testing          │  Great Expectations, dbt tests, Soda        │
│  Monitoring       │  Monte Carlo, Bigeye, Datafold              │
│  Lineage          │  OpenLineage, Atlan, DataHub                │
│  Alerting         │  PagerDuty, Slack integrations              │
└─────────────────────────────────────────────────────────────────┘

dbt Test Example

# schema.yml
models:
  - name: orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: amount
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0

6. Tool Ecosystem

ETL/ELT Tools by Category

┌─────────────────────────────────────────────────────────────────┐
│                        Data Pipeline Ecosystem                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐       │
│  │   Extract    │    │  Transform   │    │     Load     │       │
│  ├──────────────┤    ├──────────────┤    ├──────────────┤       │
│  │ Fivetran     │    │ dbt          │    │ Snowflake    │       │
│  │ Airbyte      │    │ Spark        │    │ BigQuery     │       │
│  │ Stitch       │    │ Python       │    │ Redshift     │       │
│  │ Debezium     │    │ Dataflow     │    │ Databricks   │       │
│  └──────────────┘    └──────────────┘    └──────────────┘       │
│                                                                  │
│  ┌──────────────────────────────────────────────────────┐       │
│  │                   Orchestration                       │       │
│  ├──────────────────────────────────────────────────────┤       │
│  │  Airflow  │  Dagster  │  Prefect  │  Mage  │  Kestra │       │
│  └──────────────────────────────────────────────────────┘       │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Modern Data Stack (Typical Architecture)

Source Systems          Ingestion         Warehouse        Transform        BI
    │                      │                  │               │             │
    ▼                      ▼                  ▼               ▼             ▼
┌────────┐            ┌─────────┐        ┌─────────┐     ┌───────┐    ┌─────────┐
│Postgres│──┐         │         │        │         │     │       │    │         │
│ MySQL  │──┼────────▶│ Fivetran│───────▶│Snowflake│────▶│  dbt  │───▶│ Tableau │
│  APIs  │──┤         │ Airbyte │        │ BigQuery│     │       │    │Power BI │
│  Files │──┘         │         │        │         │     │       │    │         │
└────────┘            └─────────┘        └─────────┘     └───────┘    └─────────┘

Reverse ETL (Data Activation)

The Modern Data Stack isn’t complete without Reverse ETL – pushing transformed data back to operational systems.

              Traditional ETL/ELT                    Reverse ETL
                     │                                    │
                     ▼                                    ▼
┌────────────────────────────────────────┐   ┌──────────────────────────┐
│  Sources → Warehouse → BI Dashboards   │   │  Warehouse → SaaS Tools  │
└────────────────────────────────────────┘   └──────────────────────────┘
AspectDescription
PurposeActivate warehouse data in operational tools
DirectionWarehouse → SaaS (Salesforce, HubSpot, Facebook Ads)
Use CasesEnrich CRM with lead scores, sync segments to ad platforms, update customer success tools
ToolsHightouch, Census, Polytomic, RudderStack

Example Flow:

Snowflake (Gold Layer)    →    Hightouch    →    Salesforce
    │                              │                  │
    ▼                              ▼                  ▼
Customer Lifetime Value    Sync daily at 6am    Update Account.LTV field
High-churn-risk flag       Map to SF fields     Trigger CS workflow

Why Reverse ETL Matters: Data teams spend effort building clean datasets. Reverse ETL closes the loop by making that data actionable in the tools sales, marketing, and support teams actually use.


7. Architecture Layers (Medallion Architecture)

Solving: How do we organize data pipelines?

🥉 Bronze Layer (Raw / Landing)

  • Content: Raw data exactly as received (JSON, CSV, API responses)
  • Purpose: Full traceability, debugging, replay capability
  • Transformations: None (or minimal: add ingestion timestamp)

🥈 Silver Layer (Clean / Staging)

  • Content: Cleaned, deduplicated, typed, standardized tables
  • Purpose: Single Source of Truth for analysts
  • Transformations: Cleaning, standardization, deduplication

🥇 Gold Layer (Serve / Mart)

  • Content: Aggregated, business-logic-enriched reporting tables (Star Schema)
  • Purpose: High-performance, BI-ready (Power BI, Tableau, Looker)
  • Transformations: Aggregations, business rules, dimensional modeling
┌─────────────────────────────────────────────────────────────────┐
│                     Medallion Architecture                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   Source    ──▶    🥉 Bronze    ──▶    🥈 Silver    ──▶    🥇 Gold     │
│   Systems          (Raw)              (Clean)            (Mart)  │
│                                                                  │
│   • APIs          • JSON blobs       • Typed tables    • Star   │
│   • DBs           • CSV dumps        • Deduplicated      Schema │
│   • Files         • Full history     • Standardized    • KPIs   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

8. Summary: The Kitchen Analogy 🍳

ConceptKitchen Equivalent
E (Extract)Sourcing – Getting ingredients (whether you pick them up or suppliers deliver)
L (Load)Storage/Plating – Putting ingredients in the fridge, or serving finished dishes
T (Transform)Prep & Cooking – Washing, chopping, cooking
ArchitectureKitchen Layout – Receiving → Prep → Service areas (zone management)
Data QualityQuality Control – Checking freshness, rejecting bad ingredients
OrchestrationOrder Management – Coordinating dish preparation timing

9. Complete Structure Overview 📋

A quick-reference outline of everything covered in this series.

  • 1. Core Architecture ParadigmsHow data flows and when it’s processed

    • ETL (Extract-Transform-Load)
      • Flow: Extract → External Transform (Python/Spark) → Load
      • Characteristic: Compute-heavy, clean data lands in target
      • Use Case: On-premise, GDPR/HIPAA, complex unstructured data
    • ELT (Extract-Load-Transform)
      • Flow: Extract → Raw Load → In-Database Transform (SQL/dbt)
      • Characteristic: Storage-heavy, raw data first, leverage cloud compute (Snowflake/BigQuery)
      • Use Case: Cloud-native, fast iteration, preserve raw history
    • Reverse ETL
      • Flow: Warehouse → SaaS Tools (Salesforce/Slack)
      • Purpose: Operational Analytics — let business users see insights in their existing tools
  • 2. E: ExtractionHow to acquire data from sources

    • 2.1 By Technical Mechanism
      • Pull / Request-Response
        • Database Querying: JDBC/ODBC SQL queries (SELECT *)
        • API Calls: REST/GraphQL
          • Challenges: Pagination (Offset/Cursor), Rate Limiting (Exponential Backoff)
        • Web Scraping: For sites without APIs
          • Tools: Selenium (dynamic rendering), BeautifulSoup (static parsing)
      • Push / Event-Driven
        • CDC (Change Data Capture): Log-based
          • Principle: Listen to DB Transaction Log (Binlog/WAL)
          • Advantages: Captures DELETEs, near real-time, low source DB load
        • Event Streaming: Message queues (Kafka/RabbitMQ)
        • Webhooks: SaaS systems push HTTP POST payload
      • File-Based
        • File Parsing: FTP/S3 files (CSV, JSON, Parquet, XML)
      • Manual
        • Manual Data Extraction: Excel input, copy-paste (Human-in-the-loop)
    • 2.2 By Data Scope
      • Full Extraction: Complete dataset each run
      • Incremental Extraction: Based on updated_at (Watermark) or Auto-increment ID
        • Risk: Cannot detect Hard Deletes at source
  • 3. L: LoadingHow data enters the target and maintains state

    • 3.1 Write Strategy
      • **Full Load **
        • Truncate & Insert: Clear table → Write (fast, but table is empty during execution)
        • Drop & Create: Drop schema → Rebuild → Write (for frequent schema changes)
      • **Incremental Load **
        • Append Only: Add to end (logs, transactions — immutable data)
        • Upsert (Merge): Update + Insert
          • Logic: Match on Primary Key
          • If exists → Update, if not → Insert
          • Application: Sync member data, inventory status
    • 3.2 SCD (Slowly Changing Dimensions)How warehouse handles source changes (e.g., user moves address)
      • Type 0 (No Historization): Ignore changes, keep original value forever
      • Type 1 (Overwrite): Direct overwrite, latest state only, no history
      • Type 2 (Row Versioning): Add new row (most standard approach)
        • Uses: Effective_Date, End_Date, Is_Current columns
        • Advantage: Can query state at any point in time
      • Type 3 (Previous Column): Add column
        • Keeps Current_Value and Previous_Value
    • 3.3 Processing Timing
      • Batch Processing: Accumulated processing (daily/hourly), high Throughput
      • Stream Processing: Event-driven (ms/s latency), low Latency
  • 4. T: TransformationData cleaning, standardization, enrichment

    • 4.1 Data Cleaning (Data Hygiene)
      • Remove Duplicates: Delete completely duplicate rows
      • Filtering: Exclude unwanted rows (test users, soft-deleted records)
      • Missing Data Handling:
        • Drop: Discard rows/columns
        • Imputation: Fill with 0, “Unknown”, Mean, or Forward Fill
      • Outlier Detection: Identify and isolate anomalies (e.g., age > 150)
      • String Manipulation:
        • Unwanted Spaces: Trim/Strip
        • Invalid Values: Fix format errors (e.g., Email missing @)
    • 4.2 Standardization
      • Type Casting: String → Integer/Date
      • Normalization:
        • Unit conversion (cm → m, TWD → USD)
        • Value scaling (0-1 normalization)
      • Formatting: Date format unification (ISO 8601)
    • 4.3 Integration & Enrichment
      • Data Integration: Join tables from different sources
      • Lookup/Enrichment: Link external reference tables
        • Examples: IP → Country, Coordinates → Administrative Region
    • 4.4 Logic & Aggregation
      • Derived Columns: Calculate new fields (Unit Price × Quantity = Total)
      • Business Rules: Apply business logic (Spend > 1000 → Mark as VIP)
      • Aggregations:
        • Roll-up: Day → Month
        • Summarize: SUM, COUNT, AVG, MAX, MIN

10. What is dbt? (The “T” in ELT)

A common question: Is dbt a database?

No. dbt (data build tool) is not a database. It doesn’t store any data.

The Kitchen Analogy

ComponentKitchen Equivalent
Database / Data Warehouse (BigQuery, Snowflake)The fridge & stove — stores ingredients, does the cooking
dbtThe recipe book & head chef — tells the stove how to cook, what order to prepare dishes

Key Concept: dbt connects to your database, sends SQL instructions, and the results stay in your database. This is called push-down computing.

dbt’s Position in the Stack

Extract & Load (EL)    →    Transform (T)    →    Analyze
   Fivetran/Airbyte         dbt (SQL)             Tableau/PowerBI
        ↓                      ↓                       ↓
   Raw data enters       dbt transforms          BI reads
   (Bronze)              Bronze → Silver          Gold layer
                         Silver → Gold

Why Everyone Uses dbt

Before dbt, data engineers wrote:

  • CREATE TABLE ... AS SELECT ... statements
  • Stored procedures
  • Pain points: No version control, no dependency management, no testing

dbt brings software engineering best practices to SQL:

FeatureDescription
ModularityWrite only SELECT; dbt wraps it in CREATE TABLE/VIEW
Dependency GraphDefine dependencies; dbt runs in correct order automatically
Version ControlAll code in Git — who changed what, when
TestingWrite tests like “This column should not be NULL”
DocumentationAuto-generates docs website from your models

One-Liner Summary

dbt is not where your data lives. dbt is the tool that transforms your data inside your warehouse.