Data types span both technical storage formats (e.g., SQL types) and semantic domains that define meaning, rules, and valid operations in analytics. Defining reusable domains, mapping them consistently across platforms, and enforcing them with constraints and automated tests reduces type drift, improves data quality, and supports reliable reporting.
Introduction
Data types define how data is represented, stored, validated, and processed across databases, pipelines, and analytics tools. In practice, “data type” spans both technical types (e.g., DATE, DECIMAL, VARCHAR) and semantic types (e.g., “Customer ID”, “Order Amount”, “Consent Flag”). Clear and consistent data typing reduces integration friction, improves data quality controls, and makes metrics easier to interpret.
What “data type” means (technical vs. semantic)
A complete understanding of data types requires two complementary views:
Technical (physical) data types: How a system stores and computes values (SQL types, file encodings, precision/scale, collation, nullability).
Semantic (business) data types / domains: What a value means and the rules it must follow (allowed values, format constraints, reference data, sensitivity classification).
In DAMA-DMBOK terms, technical types are part of Data Modeling & Design and Data Architecture, while semantic definitions belong in Metadata Management (business glossary, data dictionary) and are enforced through Data Quality controls.
Core categories of data types
Numeric types
Numeric typing choices affect accuracy, performance, and downstream aggregations.
Exact decimals (DECIMAL(p,s) / NUMERIC): Currency and other values requiring exact arithmetic (avoid binary floating-point rounding).
Approximate numbers (FLOAT, DOUBLE): Scientific/measurement data where small rounding error is acceptable.
Common governance practice is to define standard numeric domains (e.g., amount_currency as DECIMAL(18,2)) and apply them consistently.
Text and code types
Text types are often used for names, descriptions, and codes.
Free text: Customer names, comments; requires collation/encoding standards (e.g., UTF-8) and often needs normalization.
Codes/identifiers: Product codes, ISO country codes; should be modeled as domains with explicit allowed patterns and reference lists.
A frequent modeling improvement is to differentiate “identifier-like text” (stable, constrained) from “free text” (unconstrained) in the semantic layer and documentation.
Date and time types
Time is a common source of defects when types and rules are unclear.
DATE: Calendar date without time.
TIME: Time of day without date.
TIMESTAMP / DATETIME: Point in time; requires a timezone strategy.
Best practice is to standardize:
A canonical timezone for storage (often UTC) and explicit conversion rules for reporting.
ISO 8601-compatible formats at ingestion boundaries.
Boolean and enumerations
Booleans: True/false flags (e.g., is_active). Define whether null is permitted and what null means.
Enumerations (enums): Small, controlled sets of values (e.g., order status). These should be treated as reference data with governed definitions, not ad-hoc strings.
Structured and semi-structured types
Modern platforms often store nested data.
Arrays/structs/JSON: Useful for ingestion and flexibility, but can reduce discoverability and consistent validation if overused.
A common pattern is to ingest semi-structured data first, then promote stable fields into typed relational/columnar structures with governed domains as the model matures.
Measurement scales (analytics semantics)
In analytics, “type” also includes the measurement scale, which informs valid operations:
Nominal: Categories without order (e.g., country). Valid operations: grouping, counts.
Interval: Numeric scale with meaningful differences but no true zero (e.g., Celsius temperature). Valid operations: addition/subtraction.
Ratio: Numeric scale with true zero (e.g., revenue, quantity). Valid operations: all arithmetic, ratios.
Documenting measurement scale in the semantic layer prevents metric misuse (for example, averaging an ordinal rating as if it were ratio data).
How data types relate to data modeling frameworks
Dimensional modeling (Kimball)
In star schemas, types should support consistent BI behavior:
Conformed dimensions should use consistent key types and stable domain definitions across subject areas.
Facts should use numeric types that support correct aggregation (e.g., DECIMAL for amounts, integer for counts) and explicitly defined additive/semi-additive behavior.
Typing is part of making a metric “BI-ready,” along with grain definition and dimensional conformance.
Inmon-style EDW concepts
In EDW layers, domains and typing standards help maintain integration across source systems:
Standardize key domains, date/time conventions, and code sets.
Use constraints and reference data to reduce semantic drift across integrated datasets.
Data Vault 2.0
Data Vault patterns depend on consistent typing for keys and audit fields:
Hubs/links: Key types and hashing strategies must be standardized (including consistent canonicalization of business keys before hashing).
Satellites: Effective dates, load timestamps, and record source fields require consistent timestamp types and timezone rules.
Typing standards here directly affect historization and reproducibility.
Practical approach to defining and enforcing data types
1) Start with domains (semantic definitions)
Define a manageable set of reusable domains, for example:
customer_id: pattern, allowed length, leading zeros rule
currency_code: ISO 4217
country_code: ISO 3166-1 alpha-2
event_timestamp_utc: timezone, precision
amount_currency: precision/scale
These belong in a business glossary/data dictionary and should be referenced by models, pipelines, and tests.
2) Map domains to physical types per platform
For each domain, specify platform-specific implementations (warehouse, lakehouse, operational DB), including:
nullability
precision/scale
encoding/collation
partitioning/granularity implications (for time)
This aligns with enterprise architecture practices (e.g., TOGAF), where standards reduce variability across systems.
3) Enforce types through constraints and automated tests
Use layered enforcement:
Schema constraints: type declarations, check constraints, foreign keys (where feasible).
Pipeline validations: schema checks, allowed values, range checks, referential checks.
Semantic layer constraints: consistent typing and formatting for BI consumption.
This supports DAMA-DMBOK-aligned Data Quality practices by turning definitions into executable controls.
4) Monitor type drift and contract violations
Implement monitoring for:
unexpected null rate changes
sudden increases in parsing failures (e.g., dates)
new unseen enum values
upstream schema changes
Type drift is a common root cause of downstream metric breaks.
Common pitfalls and how to avoid them
Using FLOAT for money: Prefer exact decimals for financial amounts.
Inconsistent timezones: Store in a canonical timezone and document conversion logic.
Overloading strings: Treat identifiers, codes, and free text as separate semantic domains even if all are physically VARCHAR.
Silent coercions: Avoid implicit casts that mask quality issues; fail fast at boundaries.
Uncontrolled enums: Manage them as reference data with clear ownership and change control.
JSON everywhere: Use semi-structured types for ingestion and flexibility, but promote stable fields into typed models for governed analytics.
Summary
Data types are not only database column types; they include semantic domains and measurement scales that define how data should be interpreted and validated. Effective typing combines governed domain definitions, consistent platform mappings, automated enforcement, and monitoring for drift. When applied consistently across modeling layers (dimensional, EDW, Data Vault) and exposed through a semantic layer, strong typing improves data quality, usability, and trust in analytics.