What is the main role of data engineering in machine learning systems?
To provide reliable, well-documented pipelines that deliver feature and label data in the right shape, quality, and cadence for training and serving models.
What is a feature in ML?
An input variable used by a model to make predictions, typically derived from raw data through transformations and aggregations.
What is a label in supervised learning?
The target variable that the model is trained to predict, representing known outcomes for historical examples.
Why is feature engineering often more impactful than model tweaking?
Better features can reveal relevant structure in data and improve signal-to-noise, while model changes may offer smaller gains on poor features.
What is the difference between a feature and a raw field from a source system?
Features are cleaned, standardized, and engineered for modeling, while raw fields are direct outputs from source systems that may include noise and idiosyncrasies.
What is a feature pipeline?
A data pipeline that computes, stores, and serves features for training and inference, often from raw logs or transactional data.
Why is it useful to separate feature pipelines from model code?
It decouples data preparation from modeling logic, enabling reuse across models and clearer ownership for data vs algorithms.
What is an offline feature store conceptually?
A storage layer that holds historical feature values for training and batch scoring, usually on a warehouse or data lake.
What is an online feature store conceptually?
A low-latency store that serves the latest feature values for real-time inference, often based on key-value or cache technology.
Why is consistency between offline and online features critical?
If features differ between training and serving, models will see a different input distribution in production, causing performance degradation (train–serve skew).
What is train–serve skew?
A mismatch between the data and features used during training and those seen at inference time in production.
What are common causes of train–serve skew?
Different feature code paths offline vs online, time-dependent features not computed the same way, or using future information in training that is unavailable at inference.
How can you reduce train–serve skew?
Reuse the same feature definitions in offline and online pipelines, avoid using future data, and thoroughly test end-to-end feature flows.
What is a snapshot training dataset?
A static extract of features and labels at a given time, used to train a model on a specific snapshot of history.
What is a point-in-time correct training dataset?
A dataset that ensures features for each training example are computed only from information available up to the event time, avoiding leakage from the future.
What is data leakage in ML datasets?
Using information in training that would not be available at prediction time, leading to overly optimistic metrics and poor performance in production.
What are examples of leakage from feature engineering?
Using post-outcome flags as features, aggregating over future periods, or joining labels back into features inadvertently.
Why is point-in-time correctness challenging?
It requires storing historical states of features and carefully designing joins and windows so future updates do not contaminate past examples.
What is a common pattern for building point-in-time features?
Store time-stamped events and compute features using window functions or aggregations restricted to times before the label event.
What is target leakage via look-ahead windows?
When feature windows accidentally extend beyond the label event time, effectively including information about the outcome in inputs.
What is a label generation pipeline?
A pipeline that derives target values from raw events or transactional data, applying definitions and time windows consistently.
Why should label definitions be documented and versioned?
Changing label logic over time can alter what the model is learning; versioning makes experiments reproducible and auditable.
What is a positive-unlabeled (PU) situation in labels?
A setting where positive cases are labeled but negatives are mixed with unlabeled and potentially positive cases, common in fraud or anomaly detection.
Why is sampling important for ML training datasets?
Full historical data may be too large or imbalanced; sampling can improve training efficiency and class balance while preserving key patterns.