Feature Engineering Principles

Context: why feature engineering is a discipline (not a notebook task)

Feature engineering sits at the boundary between data management and machine learning. In development, it can look like “create a few columns and train a model.” In production, it becomes a repeatable, governed pipeline that must be correct at prediction time, scalable, and auditable. Common failure modes are rarely about algorithms; they are about data: inconsistent definitions between teams, training/serving skew, leakage, missing point-in-time logic, poor data quality, and undocumented transformations.

Core definitions and the minimum vocabulary

A consistent vocabulary prevents ambiguity and aligns work across data, analytics, and ML teams.

Feature: a measurable input used by a model at prediction time (numeric, categorical, boolean, embedding, etc.).
Label/target: the outcome the model is trained to predict.
Entity: the business object the feature describes (customer, account, device, order).
Feature value timestamp: when the feature value is considered known.
Observation window: the time span used to compute a feature (e.g., “last 30 days”).
Point-in-time correctness: computing features using only data available as of the prediction (or training) time.
Lineage and metadata: where the feature came from, how it was computed, and which upstream datasets and rules it depends on.

Principle 1: start from the decision and the data domain

Feature engineering should be traceable to a decision, not just a dataset.

Define the business/operational decision the model supports (approve a loan, forecast demand, detect fraud).
Identify the entity and prediction granularity (customer-level vs transaction-level).
Specify the prediction moment (“as-of time”) and what is truly known at that moment.
Use domain knowledge to propose signals that reflect mechanisms in the domain (behavioral recency, financial stability, product usage, seasonality), then validate empirically. A practical rule: each feature should have a clear statement of intent (what it measures) and a plausible relationship to the target.

Principle 2: treat features as managed data assets (governance by design)

From a data management perspective (aligned with DAMA-style practices), features are reusable data products that require ownership and controls.

Ownership: assign a feature owner (or owning team) responsible for definition, quality, and changes.