The ETL Mental Model

Context: why an “ETL mental model” matters

Teams often treat data pipelines as a set of disconnected scripts: one job pulls files, another cleans them, a third updates dashboards. That approach tends to fail at scale because it hides key questions: What is the system of record? Where is quality enforced? Which transformations are reproducible? Which datasets are safe for self-service consumption? A useful ETL mental model is to view the pipeline as a managed product flow: data moves through explicit stages with clear contracts, controls, and accountability. This aligns with DAMA-DMBOK’s view of data management as a set of disciplines (data integration/interoperability, data quality, metadata management, governance) that must be designed intentionally rather than emerging accidentally.

Core definition: ETL (and ELT) in one sentence

ETL is the pattern of extracting data from source systems, transforming it to meet analytical and operational requirements, and loading it into a target platform; ELT shifts most transformations to the target platform while keeping the same conceptual stages. The “mental model” is less about whether you transform before or after load, and more about designing a pipeline that is:

Repeatable (idempotent and versioned)
Observable (measurable health and data quality)
Governed (clear ownership and access)
Fit for purpose (supports defined use cases)

The ETL pipeline as staged movement of data

A practical way to reason about ETL is to structure the platform into zones/layers and treat each handoff as a contract.

Stage 1: Extract (source → landing)

Goal: capture data reliably from operational systems (ERP/CRM/apps), files, APIs, event streams, or partner feeds. Key design ideas:

Define the source of truth: each attribute should have an authoritative origin (important for master/reference data).
Prefer incremental extraction where possible (CDC, high-water marks, log-based capture) to reduce load and improve timeliness.
Preserve raw facts: store the payload needed for replay (raw files, raw JSON, or equivalent) so failures can be reprocessed. Common pitfalls:
Pulling “whatever is easiest” rather than what is contractually stable.
Silent schema drift (new/removed fields) causing downstream breakage.

Stage 2: Land and standardize (landing → raw/bronze)

Goal: make extracted data queryable and governed without changing its meaning. Typical actions: