Understanding Data Quality: Beyond Completeness and Accuracy | LearningData.online
Foundations·2 min read
Understanding Data Quality: Beyond Completeness and Accuracy
data-qualitydata-governancedata-management
Data quality is best defined as fitness for use and must be expressed as measurable requirements, not a vague idea of “clean data.” Using common dimensions—accuracy, completeness, consistency, timeliness, validity, and uniqueness—organizations can implement governance, controls, and monitoring that make data reliable for reporting, operations, and analytics.
Why data quality is more than “clean data”
Data quality is the degree to which data is fit for its intended use. In DAMA-DMBOK terms, data quality management is a core data management discipline that defines, measures, monitors, and improves data to meet business expectations.
Poor quality data typically shows up as:
Incorrect decisions (e.g., wrong KPIs, biased model features)
Operational failures (e.g., failed order fulfillment due to invalid addresses)
Loss of trust in analytics and self-service
A practical definition of “good” data therefore must be measurable and explicitly tied to a use case (reporting, operational processing, ML, compliance), not assumed.
Core dimensions of data quality (and how to operationalize them)
Many organizations use a set of commonly accepted dimensions to express requirements and design controls. The six dimensions below are widely used in governance and data quality practices and map well to how rules and metrics are implemented in real systems.
Accuracy
Definition: Data correctly represents the real-world entity/event it describes.
How it fails: wrong amounts, wrong customer attributes, incorrect timestamps, incorrect mappings.
How to measure: compare to an authoritative source (system of record, external validation, reconciliation); calculate error rate and impact.
Common controls: reconciliations, reference data validation, controlled vocabularies, master data management (MDM) where appropriate.
Completeness
Definition: Required data is present at the right level of granularity for the use case.
How it fails: nulls in required fields, missing records, partial history after a pipeline outage.
How to measure: null rate for required fields; record counts vs. expected; completeness by segment/time window.
Common controls: required-field checks, ingestion expectations (e.g., “daily file must contain all regions”), backfills with auditable lineage.
Consistency
Definition: Data does not contradict itself across datasets, systems, or time.
How it fails: customer status differs between CRM and billing; metric definitions differ between dashboards; different currencies without conversion.
How to measure: cross-system reconciliation; referential integrity checks; “same business concept, same definition” checks.
Common controls: canonical definitions in a semantic layer/metrics layer; conformed dimensions (Kimball); standardized transformation logic.
Timeliness
Definition: Data is available when needed and reflects the required recency for the use case.
late-arriving feeds; pipelines succeed but deliver after reporting deadlines; operational actions happen on stale data.
freshness/latency (event time → availability time); SLA/SLO compliance.
pipeline SLAs, alerting on freshness, late data handling patterns (watermarks, reprocessing windows).
How it fails:
How to measure:
Common controls:
Validity
Definition: Data conforms to defined formats, types, ranges, and business rules.
How it fails: invalid dates, negative quantities where prohibited, invalid country codes, malformed emails.
How to measure: rule pass/fail rates; distribution checks (e.g., allowed values, ranges).
Common controls: schema enforcement, domain constraints, business-rule tests in transformation pipelines, reference data and code sets.
Uniqueness
Definition: Each real-world entity/event is represented once where uniqueness is required.
How it fails: duplicate customers, repeated transactions due to retries, double-counted events.
How to measure: duplicate rate by business key; collision checks; idempotency validation.
Common controls: primary keys, deduplication logic, idempotent ingestion, survivorship rules (often connected to MDM).
Context: “fit for use” drives the target thresholds
Data quality is not “maximum on all dimensions.” The required thresholds depend on risk, decision impact, and tolerance for delay.
Regulatory/financial reporting typically requires strict accuracy, completeness, auditability, and consistency; late delivery may be less acceptable near close.
Growth experiments and near-real-time decisioning may prioritize timeliness and tolerate limited late corrections, as long as bias and error are understood.
Operational workflows often require high validity (e.g., shipping address formats) even if some optional attributes are incomplete.
A good practice is to document, for each critical dataset and metric:
Serving/semantic layer: conformed definitions, metric governance, consistent filters and time logic
Use severity levels:
Blocker: stop downstream publishing (e.g., primary key not unique)
Warning: publish with known limitations (e.g., slight timeliness breach) and notify consumers
3) Measure, monitor, and alert (not just test)
Treat quality as observable system behavior:
Track trends (null rate, duplicate rate, freshness) over time
Set alert thresholds and escalation rules
Report on SLO attainment (e.g., “99% of days delivered by 8:00 AM ET”)
Monitoring is especially important because a test suite can pass while data is still unusable (e.g., distribution drift, upstream business process changes).
4) Manage remediation with auditability
When issues occur, define how corrections are delivered:
Backfills and restatements (with clear time windows)
Versioned datasets or snapshotting for traceability
Communication of impact (which dashboards/models were affected)
This aligns with governance expectations for transparency and trust.
Common pitfalls to avoid
Treating data quality as a one-time cleanup instead of an ongoing capability
Using generic dimensions without specifying measurable rules and thresholds
Over-indexing on completeness (filling nulls) while ignoring validity/accuracy
Creating multiple conflicting metric definitions due to lack of a semantic layer or glossary
Ignoring “data at rest” quality (historical backfills) and only checking the latest load
Key takeaways
Data quality is best defined as fitness for use and operationalized through measurable rules.
The six dimensions—accuracy, completeness, consistency, timeliness, validity, uniqueness—provide a practical structure for requirements and controls.
Sustainable improvement requires governance, ownership, and monitoring, not only technical fixes.
Modern implementations benefit from explicit data contracts, layered controls, and SLO-based observability.