Welcome to LearningData.online
Introduction: why “good data” is hard to define
Data quality is a core discipline within data management because analytics, operational processes, and regulatory reporting all depend on trustworthy data. “Good data” is best defined as fitness for use: quality is not a single score, but a set of measurable requirements that must be met for a specific business purpose.
What data quality is (and is not)
Data quality is the degree to which data meets defined requirements across relevant dimensions (for example, accuracy or completeness). It is not the same as:
- Data governance (decision rights, policies, accountability) even though governance sets the standards that quality must satisfy
- Data security/privacy (protection and compliant use), which can be excellent even when data is inaccurate
- Data availability (systems uptime), which can be high even when the content is wrong
Core dimensions of data quality
A practical way to specify requirements is to use common data quality dimensions (widely used in data governance programs and reflected in frameworks such as DAMA-DMBOK and ISO data quality models). The most common dimensions for analytics and reporting are:
- Accuracy: Values correctly represent the real-world entity or event (for example, an order total matches the source transaction).
- Completeness: Required attributes are present at the expected rate (for example, a customer must have a country code for tax reporting).
- Consistency: The same concept has compatible values across systems and over time (for example, “Active” status means the same thing in CRM and billing).
- Timeliness: Data is available within the needed latency and reflects the required point-in-time freshness (for example, intraday inventory vs. monthly finance close).
- Validity: Data conforms to defined formats and business rules (for example, date formats, allowable ranges, referential integrity).
- Uniqueness: No unintended duplicates exist for the entity definition (for example, one customer record per customer key, controlled de-duplication rules). When documenting these dimensions, state them as testable requirements (rule + threshold + scope) rather than abstract ideals.
Context matters: quality is driven by the use case
Quality expectations vary by purpose, risk, and decision horizon. A useful pattern is to classify use cases and then set explicit targets:
- Regulatory and financial reporting: typically strict accuracy, completeness, consistency, lineage, and controlled changes (high auditability).
- Operational analytics (for example, routing, fraud detection): often strict timeliness plus accuracy on critical fields; tolerances must be explicit.
- Experimentation and marketing: may accept higher uncertainty in some attributes if timeliness and consistency of definitions are maintained. This is why data quality should be defined against critical data elements (CDEs) and business definitions, not applied uniformly to every column.
How to specify data quality requirements (practical template)
Define requirements at the right level of granularity:
- Data element: attribute definition, domain/allowed values, nullability, precision
- Dataset/table: uniqueness keys, referential integrity rules, acceptable volume ranges
- Metric/semantic layer: calculation logic, filters, aggregation grain, time zone rules A concise specification format that works well in governance and analytics engineering:
- Rule: what must be true (for example,
order_date <= ship_date) - Scope: where it applies (table, partition, business unit, product line)
- Threshold: acceptable failure rate (for example, 99.5% pass per day)
- Severity: impact if breached (for example, block reporting vs. warn)
- Owner: accountable role (data owner/steward) and technical responder (data producer team)
Measuring and monitoring data quality
Data quality management requires measurement that is repeatable and auditable.
- Data profiling (baseline): quantify null rates, distinct counts, distribution shifts, outliers, and key constraints before setting thresholds
- Validation controls: implement checks for schema, ranges, referential integrity, duplicates, and business rules
- Ongoing monitoring: schedule checks aligned to refresh cadence; alert only when thresholds are breached
- Issue management workflow: log incidents, triage by severity, assign owners, track remediation and recurrence Common metrics by dimension:
- Completeness:
% non-nullfor required fields, coverage across expected entities - Uniqueness: duplicate rate for defined natural/business keys
- Timeliness: data latency distribution (p50/p95), SLA compliance rate
- Validity: rule pass rate, domain conformance rate
- Consistency: cross-system reconciliation variance, definition conformance (semantic checks)
- Accuracy: sampled verification against authoritative sources, reconciliation to system-of-record totals
Controls across the data lifecycle (prevention beats detection)
High-performing programs use controls at multiple layers, aligned with DAMA-style lifecycle thinking (create, store, integrate, deliver, use):
- At capture (source systems): input validation, controlled reference data, mandatory fields, standardized codes
- During integration (ETL/ELT and streaming): schema enforcement, deduplication logic, idempotency, late-arriving data handling
- In storage and modeling: clear keys and grain, conformed dimensions (Kimball), canonical definitions, and controlled historization patterns where needed
- At the semantic/metrics layer: centralized metric definitions and consistent business logic to reduce “multiple versions of truth”
- At consumption: documented caveats, certified datasets, and usage guidance for analysts and downstream applications
Roles and accountability (governance linkage)
Data quality improves when ownership is explicit:
- Data owner: accountable for quality targets for a domain/CDEs and for prioritizing fixes
- Data steward: maintains definitions, rules, and issue management; supports adoption and training
- Data producer team: implements preventative controls and remediations in pipelines and source processes
- Data consumers: report issues with evidence and validate whether fixes meet the intended use Without these roles, monitoring turns into unmanaged alerts rather than sustained improvement.
Common pitfalls to avoid
- Treating quality as a one-time cleanup instead of an operating process with monitoring and root-cause fixes
- Measuring only “accuracy” and ignoring completeness, timeliness, and consistency that often drive stakeholder trust
- Setting thresholds without profiling baselines or without aligning to decision risk
- Checking data in the warehouse only, while leaving root causes in upstream operational processes
- Allowing metric logic to drift across dashboards (no semantic layer or definition governance)
Summary: key takeaways
- Data quality is multidimensional and should be defined as fitness for use.
- Use standard dimensions (accuracy, completeness, consistency, timeliness, validity, uniqueness) to create testable rules with thresholds.
- Implement controls across the lifecycle and connect monitoring to an issue management process and clear accountability.
- Prioritize critical data elements and governed metric definitions to improve trust in analytics and decision-making.