data modelingdimensional modelingdata architecture
Data modeling defines how data is structured, related, and constrained so it can be stored, integrated, and used reliably. This article introduces core modeling concepts, the conceptual/logical/physical levels, and common approaches such as normalized modeling, dimensional modeling, and Data Vault, with practical guidance for building governable analytics-ready datasets.
Introduction: why data modeling matters
Data modeling is the discipline of defining how an organization’s data is structured, related, and constrained so it can be stored, integrated, and used reliably. In DAMA-DMBOK terms, data modeling is a core activity within Data Architecture and Data Modeling & Design, and it directly supports downstream capabilities such as data integration, governance, metadata management, analytics, and data quality.
A “good” data model reduces ambiguity (shared definitions), improves interoperability (consistent keys and relationships), and enables scalable analytics (clear facts, dimensions, and grain). Poor modeling typically shows up later as reconciliation issues, duplicated metrics, brittle pipelines, and low trust in reporting.
Core concepts and definitions
A practical foundation starts with shared terminology.
Entity: a thing the business cares about (Customer, Order, Product).
Attribute: a property of an entity (Customer Email, Order Date).
Relationship: how entities connect (Customer places Order).
Key: an attribute (or set) that uniquely identifies a record.
Grain: what one row represents (one row per order line, per daily product snapshot, etc.).
Business definition: the agreed meaning of a concept (e.g., “Active Customer”).
Modeling is not only about database tables. It also includes the semantic meaning of data (definitions, allowable values, and rules) so that analytics and operational use cases interpret data consistently.
Levels of modeling: conceptual, logical, physical
Most established modeling practices separate work into levels, which helps stakeholders review the model at the right depth.
Conceptual data model: business-facing overview of major entities and relationships. It is used to align stakeholders and scope domains without implementation details.
Logical data model: detailed representation of entities, attributes, keys, and relationships independent of any specific technology. It introduces normalization decisions and business rules more explicitly.
Physical data model: implementation-specific design (tables, columns, data types, indexes, partitions, constraints). It reflects the chosen platform (e.g., Postgres, Snowflake, BigQuery) and performance considerations.
Keeping these levels distinct improves reviewability and governance: business stakeholders can validate meaning at the conceptual/logical levels, while data engineers and DBAs optimize the physical layer.
Data modeling approaches (and when to use them)
Modern organizations typically combine multiple modeling styles across the data lifecycle (source systems, integration layer, analytics layer). The key is choosing the right approach for the job.
Normalized modeling (3NF) for operational and integrated data
A normalized model (often associated with third normal form, 3NF) minimizes redundancy and supports transactional consistency.
Strengths: reduces update anomalies, supports integrity constraints, works well for operational workloads and master/reference data.
Common uses: operational databases, master data management (MDM) hubs, and some enterprise data warehouse (EDW) designs.
Risks for analytics: analytics queries can become complex due to many joins, and business-friendly reporting often requires additional structures.
In an Inmon-style EDW approach, an integrated enterprise warehouse is often modeled in a normalized form, with downstream data marts shaped for specific analytics needs.
Dimensional modeling (Kimball) for analytics and BI
Dimensional modeling is designed for analytical workloads and decision support, organizing data into facts and dimensions.
Fact table: quantitative measurements/events at a defined grain (sales amount per order line, page views per session).
Dimension table: descriptive context used for filtering/grouping (customer, product, date, region).
Key practices in dimensional modeling include:
Declare the grain first: ensure everyone agrees what one fact row represents.
Conformed dimensions: reuse the same dimensions across subject areas to ensure consistent reporting.
Slowly Changing Dimensions (SCD): manage history in dimensions (e.g., Type 2 to preserve changes over time).
Dimensional modeling typically provides the best usability and performance for BI and self-service analytics because it matches how people ask questions (“sales by product by month”).
Data Vault 2.0 for scalable, auditable integration
Data Vault is an integration modeling approach built around three core constructs:
Hubs: business keys (e.g., Customer ID from source).
Links: relationships between hubs (Customer–Order relationship).
Satellites: descriptive attributes and history (customer name changes, status).
Data Vault is often selected when integration must handle many sources, frequent change, and strong auditability/traceability requirements.
Strengths: handles evolving sources, supports historical tracking, can reduce rework during source changes.
Trade-offs: not inherently business-friendly for BI; it usually feeds dimensional marts or a semantic layer for consumption.
Choosing a modeling strategy: decision factors
Use these factors to guide selection (often per layer):
Primary workload: transactional integrity (normalized) vs analytics usability (dimensional).
Change and audit needs: high source volatility and lineage requirements (Data Vault) vs stable curated marts (dimensional).
Governance and reuse goals: conformed dimensions and shared metrics suggest dimensional and semantic modeling investments.
Skill and tooling: success depends on consistent implementation, documentation, and enforcement.
Data quality is enforced (or undermined) by the model
A data model is a control mechanism for data quality and governance, not just a diagram.
Accuracy is supported indirectly through source-of-truth decisions, reference data, and validation rules.
Completeness is supported through required fields, defaults, and monitoring expectations.
Consistency is supported by standardized data types, conformed dimensions, and shared keys.
Timeliness is influenced by pipeline design and update patterns (e.g., snapshots vs incremental loads).
Validity is enforced via domain constraints, allowable values, and referential integrity.
Uniqueness is enforced through primary keys, unique constraints, and deduplication rules.
If these rules are not represented somewhere (database constraints, transformation logic, or data quality tests), data quality becomes an informal expectation rather than an enforceable property.
Practical workflow: from requirements to a maintainable model
A repeatable workflow improves consistency and reduces rework.
1) Start with use cases and definitions
Before drawing tables, define:
Key business questions (reporting, operational decisions, ML features).
Definitions of critical measures and dimensions (e.g., “Net Revenue,” “Active User”).
Required history behavior (do we need point-in-time accuracy?).
This step aligns to governance practices: definitions and ownership should be captured as metadata and reviewed.
2) Identify subject areas and bounded domains
Organize modeling work by domain (Customer, Product, Finance, Supply Chain) to avoid a monolithic design. In enterprise architecture terms (e.g., TOGAF), this is part of establishing a data architecture baseline and target, with clear boundaries and integration points.
3) Design the grain and keys
For each core dataset:
Declare the grain explicitly.
Choose stable keys and document key strategy:
Natural keys: meaningful business identifiers (can change and vary by source).
Surrogate keys: system-generated identifiers (common in dimensional models to manage history and performance).
Define how keys are generated, reconciled, and mapped across sources (critical for integration).
Transformation rules (standardization, deduplication, currency conversion, time zone rules).
Lineage supports auditability, root-cause analysis, and governance, and it is essential for reliable analytics.
5) Implement with testing and documentation
In modern analytics engineering practices, modeling is paired with automated tests and documentation:
Schema tests: uniqueness and not-null checks on keys.
Reconciliation tests: totals vs source, row-count expectations.
Documentation should include grain, business definitions, owners, and refresh cadence.
Modeling for self-service analytics: semantic layer considerations
Even with strong physical models, self-service often fails without a consistent semantic layer.
Standardize metric definitions and calculation logic.
Provide business-friendly naming and descriptions.
Manage join paths intentionally to avoid double counting.
Control access and privacy through governed dimensions and role-based policies.
The goal is to separate “how data is stored” from “how data is understood and queried,” improving consistency across dashboards, notebooks, and downstream applications.
Common pitfalls (and how to avoid them)
Unclear grain: leads to double counting and inconsistent metrics. Always document grain and enforce it with keys/tests.
Mixing operational and analytical patterns: operational schemas rarely satisfy BI usability without additional modeling; plan an analytics layer.
Inconsistent keys across systems: invest in master/reference data, mapping tables, and clear key management.
Over-modeling too early: start with a minimal viable model aligned to prioritized use cases, then evolve with governance controls.
No ownership or definitions: without data stewardship and a glossary, models become ambiguous and contested.
Ignoring history requirements: decide early where and how to track changes (SCD, snapshots, Data Vault satellites), and validate point-in-time reporting needs.
Summary: key takeaways
Data modeling defines structure and meaning so data can be integrated, governed, and used consistently.
Separate conceptual, logical, and physical modeling to improve stakeholder alignment and implementation quality.
Use the right modeling approach by layer: normalized for operational integrity, dimensional for analytics consumption, and Data Vault for scalable integration and auditability.
Treat the model as a data quality control surface: keys, constraints, and tests make quality measurable and enforceable.
Pair models with documentation, lineage, and a semantic layer to scale self-service and maintain metric consistency.