Understanding Data Quality: Beyond Completeness and Accuracy
Why data quality is more than “clean data”
Data quality is the degree to which data is fit for its intended use. In DAMA-DMBOK terms, data quality management is a core data management discipline that defines, measures, monitors, and improves data to meet business expectations. Poor quality data typically shows up as:
- Incorrect decisions (e.g., wrong KPIs, biased model features)
- Operational failures (e.g., failed order fulfillment due to invalid addresses)
- Higher costs (e.g., rework, manual reconciliation, duplicate outreach)
- Loss of trust in analytics and self-service A practical definition of “good” data therefore must be measurable and explicitly tied to a use case (reporting, operational processing, ML, compliance), not assumed.
Core dimensions of data quality (and how to operationalize them)
Many organizations use a set of commonly accepted dimensions to express requirements and design controls. The six dimensions below are widely used in governance and data quality practices and map well to how rules and metrics are implemented in real systems.
Accuracy
Definition: Data correctly represents the real-world entity/event it describes. How it fails: wrong amounts, wrong customer attributes, incorrect timestamps, incorrect mappings. How to measure: compare to an authoritative source (system of record, external validation, reconciliation); calculate error rate and impact. Common controls: reconciliations, reference data validation, controlled vocabularies, master data management (MDM) where appropriate.
Completeness
Definition: Required data is present at the right level of granularity for the use case. How it fails: nulls in required fields, missing records, partial history after a pipeline outage. How to measure: null rate for required fields; record counts vs. expected; completeness by segment/time window. Common controls: required-field checks, ingestion expectations (e.g., “daily file must contain all regions”), backfills with auditable lineage.
Consistency
Definition: Data does not contradict itself across datasets, systems, or time. How it fails: customer status differs between CRM and billing; metric definitions differ between dashboards; different currencies without conversion. How to measure: cross-system reconciliation; referential integrity checks; “same business concept, same definition” checks. Common controls: canonical definitions in a semantic layer/metrics layer; conformed dimensions (Kimball); standardized transformation logic.
Timeliness
Definition: Data is available when needed and reflects the required recency for the use case. How it fails: late-arriving feeds; pipelines succeed but deliver after reporting deadlines; operational actions happen on stale data. How to measure: freshness/latency (event time → availability time); SLA/SLO compliance. Common controls: pipeline SLAs, alerting on freshness, late data handling patterns (watermarks, reprocessing windows).
Validity
Definition: Data conforms to defined formats, types, ranges, and business rules. How it fails: invalid dates, negative quantities where prohibited, invalid country codes, malformed emails. How to measure: rule pass/fail rates; distribution checks (e.g., allowed values, ranges). Common controls: schema enforcement, domain constraints, business-rule tests in transformation pipelines, reference data and code sets.
Uniqueness
Definition: Each real-world entity/event is represented once where uniqueness is required. How it fails: duplicate customers, repeated transactions due to retries, double-counted events. How to measure: duplicate rate by business key; collision checks; idempotency validation. Common controls: primary keys, deduplication logic, idempotent ingestion, survivorship rules (often connected to MDM).
Context: “fit for use” drives the target thresholds
Data quality is not “maximum on all dimensions.” The required thresholds depend on risk, decision impact, and tolerance for delay.
- Regulatory/financial reporting typically requires strict accuracy, completeness, auditability, and consistency; late delivery may be less acceptable near close.
- Growth experiments and near-real-time decisioning may prioritize timeliness and tolerate limited late corrections, as long as bias and error are understood.
- Operational workflows often require high validity (e.g., shipping address formats) even if some optional attributes are incomplete. A good practice is to document, for each critical dataset and metric:
- Intended consumers and decisions supported
- Required quality dimensions and thresholds
- Data latency expectations (SLA/SLO)
- Acceptable remediation approach (hotfix, backfill, restatement)
Connecting data quality to governance (roles and accountability)
Data quality improves sustainably when it is treated as a governance and operating-model concern, not only a technical problem.
- Data owners set expectations aligned to business outcomes and risk.
- Data stewards define rules, reference data, and issue triage processes.
- Data producers (source system teams) and data platform/analytics teams implement controls and remediation.
- Data consumers provide feedback on fitness for use and exceptions. Common governance artifacts:
- Business glossary for definitions and naming
- Data dictionary and metadata (technical definitions, lineage)
- Data quality rule catalog (what is checked, where, severity)
- Issue management workflow (triage, root cause, resolution, prevention)
Implementing data quality in modern analytics platforms
A practical, scalable approach is to treat data quality as part of the delivery lifecycle (similar to software quality).
1) Define the “data product” contract
For key datasets (tables, views, metrics), document a contract that specifies:
- Schema and grain (what one row represents)
- Business keys and uniqueness constraints
- Required fields and valid value sets
- Freshness expectations and update cadence
- Ownership and support model This makes expectations explicit and reduces the “silent failure” problem.
2) Build rule-based checks where failures are cheapest
Place controls at multiple layers:
- Source/ingestion: schema checks, volume checks, basic validity
- Transformation: business-rule checks, referential integrity, deduplication, metric assertions
- Serving/semantic layer: conformed definitions, metric governance, consistent filters and time logic Use severity levels:
- Blocker: stop downstream publishing (e.g., primary key not unique)
- Warning: publish with known limitations (e.g., slight timeliness breach) and notify consumers
3) Measure, monitor, and alert (not just test)
Treat quality as observable system behavior:
- Track trends (null rate, duplicate rate, freshness) over time
- Set alert thresholds and escalation rules
- Report on SLO attainment (e.g., “99% of days delivered by 8:00 AM ET”) Monitoring is especially important because a test suite can pass while data is still unusable (e.g., distribution drift, upstream business process changes).
4) Manage remediation with auditability
When issues occur, define how corrections are delivered:
- Backfills and restatements (with clear time windows)
- Versioned datasets or snapshotting for traceability
- Communication of impact (which dashboards/models were affected) This aligns with governance expectations for transparency and trust.
Common pitfalls to avoid
- Treating data quality as a one-time cleanup instead of an ongoing capability
- Using generic dimensions without specifying measurable rules and thresholds
- Over-indexing on completeness (filling nulls) while ignoring validity/accuracy
- Creating multiple conflicting metric definitions due to lack of a semantic layer or glossary
- Ignoring “data at rest” quality (historical backfills) and only checking the latest load
Key takeaways
- Data quality is best defined as fitness for use and operationalized through measurable rules.
- The six dimensions—accuracy, completeness, consistency, timeliness, validity, uniqueness—provide a practical structure for requirements and controls.
- Sustainable improvement requires governance, ownership, and monitoring, not only technical fixes.
- Modern implementations benefit from explicit data contracts, layered controls, and SLO-based observability.