ML Data Engineering & Feature Pipelines Flashcards by O Cam

What is the main role of data engineering in machine learning systems?

To provide reliable, well-documented pipelines that deliver feature and label data in the right shape, quality, and cadence for training and serving models.

How well did you know this?

Not at all

Perfectly

What is a feature in ML?

An input variable used by a model to make predictions, typically derived from raw data through transformations and aggregations.

How well did you know this?

Not at all

Perfectly

What is a label in supervised learning?

The target variable that the model is trained to predict, representing known outcomes for historical examples.

How well did you know this?

Not at all

Perfectly

Why is feature engineering often more impactful than model tweaking?

Better features can reveal relevant structure in data and improve signal-to-noise, while model changes may offer smaller gains on poor features.

How well did you know this?

Not at all

Perfectly

What is the difference between a feature and a raw field from a source system?

Features are cleaned, standardized, and engineered for modeling, while raw fields are direct outputs from source systems that may include noise and idiosyncrasies.

How well did you know this?

Not at all

Perfectly

What is a feature pipeline?

A data pipeline that computes, stores, and serves features for training and inference, often from raw logs or transactional data.

How well did you know this?

Not at all

Perfectly

Why is it useful to separate feature pipelines from model code?

It decouples data preparation from modeling logic, enabling reuse across models and clearer ownership for data vs algorithms.

How well did you know this?

Not at all

Perfectly

What is an offline feature store conceptually?

A storage layer that holds historical feature values for training and batch scoring, usually on a warehouse or data lake.

How well did you know this?

Not at all

Perfectly

What is an online feature store conceptually?

A low-latency store that serves the latest feature values for real-time inference, often based on key-value or cache technology.

How well did you know this?

Not at all

Perfectly

Why is consistency between offline and online features critical?

If features differ between training and serving, models will see a different input distribution in production, causing performance degradation (train–serve skew).

How well did you know this?

Not at all

Perfectly

What is train–serve skew?

A mismatch between the data and features used during training and those seen at inference time in production.

How well did you know this?

Not at all

Perfectly

What are common causes of train–serve skew?

Different feature code paths offline vs online, time-dependent features not computed the same way, or using future information in training that is unavailable at inference.

How well did you know this?

Not at all

Perfectly

How can you reduce train–serve skew?

Reuse the same feature definitions in offline and online pipelines, avoid using future data, and thoroughly test end-to-end feature flows.

How well did you know this?

Not at all

Perfectly

What is a snapshot training dataset?

A static extract of features and labels at a given time, used to train a model on a specific snapshot of history.

How well did you know this?

Not at all

Perfectly

What is a point-in-time correct training dataset?

A dataset that ensures features for each training example are computed only from information available up to the event time, avoiding leakage from the future.

How well did you know this?

Not at all

Perfectly

What is data leakage in ML datasets?

Using information in training that would not be available at prediction time, leading to overly optimistic metrics and poor performance in production.

How well did you know this?

Not at all

Perfectly

What are examples of leakage from feature engineering?

Using post-outcome flags as features, aggregating over future periods, or joining labels back into features inadvertently.

How well did you know this?

Not at all

Perfectly

Why is point-in-time correctness challenging?

It requires storing historical states of features and carefully designing joins and windows so future updates do not contaminate past examples.

How well did you know this?

Not at all

Perfectly

What is a common pattern for building point-in-time features?

Store time-stamped events and compute features using window functions or aggregations restricted to times before the label event.

How well did you know this?

Not at all

Perfectly

What is target leakage via look-ahead windows?

Study These Flashcards

When feature windows accidentally extend beyond the label event time, effectively including information about the outcome in inputs.

What is a label generation pipeline?

Study These Flashcards

A pipeline that derives target values from raw events or transactional data, applying definitions and time windows consistently.

Why should label definitions be documented and versioned?

Study These Flashcards

Changing label logic over time can alter what the model is learning; versioning makes experiments reproducible and auditable.

What is a positive-unlabeled (PU) situation in labels?

Study These Flashcards

A setting where positive cases are labeled but negatives are mixed with unlabeled and potentially positive cases, common in fraud or anomaly detection.

Why is sampling important for ML training datasets?

Study These Flashcards

Full historical data may be too large or imbalanced; sampling can improve training efficiency and class balance while preserving key patterns.

What is stratified sampling?

Sampling that preserves the proportion of classes or important subgroups, improving representativeness of the training set.

Why must sampling decisions be recorded?

They affect metrics and reproducibility; future comparisons and retraining must know how data was sampled.

What is a feature drift in ML systems?

A change over time in the distribution of input features compared to the training data.

What is concept drift in ML systems?

A change over time in the relationship between features and labels, such that the original model becomes less appropriate.

How can feature drift be detected?

By monitoring feature distributions over time and comparing them to historical baselines, using statistical tests or divergence measures.

How can label distribution drift be detected?

By monitoring label proportions and outcome rates over time, when labels are available, and comparing to training distributions.

What is a retraining pipeline?

An orchestrated process that periodically or conditionally rebuilds a model using recent data, evaluates it, and, if acceptable, promotes it.

Why must retraining pipelines be automated?

Manual retraining is error-prone and slow; automation ensures models stay current and reduces operational overhead.

What is a model registry conceptually?

A system that stores model artifacts, metadata, and versions, supporting promotion from staging to production.

What metadata should be tracked per model version?

Training data snapshot, feature and label definitions, hyperparameters, metrics, code version, and deployment history.

Why is reproducibility important in ML data pipelines?

To debug issues, audit model decisions, compare versions, and comply with regulations or internal governance.

What is a feature importance analysis used for in data engineering?

To understand which features are most predictive, guiding feature pipeline investment and data quality prioritization.

Why is feature availability and latency as important as predictive power?

A highly predictive feature that is slow or unavailable at inference time may not be usable in production settings.

What is feature freshness in an ML context?

How up-to-date feature values are relative to when predictions are made, critical for time-sensitive models.

How can batch feature pipelines support low-latency serving?

By precomputing features on a schedule and storing them in fast key-value stores or caches for quick lookup at inference time.

What is online feature computation?

Computing features on the fly at request time from recent events or streaming data, balancing freshness with latency constraints.

What is the tradeoff between precomputed and on-demand features?

Precomputed features reduce inference latency but may be less fresh; on-demand features are more current but can increase latency and complexity.

What is a feature view or feature definition object?

A declarative specification of how to compute a feature set from underlying tables or streams, used by feature stores.

Why is decoupling feature definitions from model code beneficial?

It enables reuse across models, standardizes feature logic, and reduces the risk of inconsistent implementations.

What is A/B testing in the context of ML deployment?

Running two or more model versions in parallel on different subsets of traffic to compare performance and business impact.

Why do A/B tests depend on solid data engineering?

Because reliable tracking of exposures, predictions, and outcomes is required to attribute effects correctly and avoid bias.

What is logging in an ML pipeline context?

Recording inputs, predictions, model version, and possibly outcomes for each inference to support monitoring, debugging, and retraining.

Why must logging avoid leaking sensitive data unnecessarily?

Logs often have broad access and long retention; logging too much sensitive data increases risk and compliance burden.

What is a good one-sentence mental model for ML data engineering?

Design feature and label pipelines that are point-in-time correct, reusable across training and serving, and continuously monitored for drift and quality.

ML Data Engineering & Feature Pipelines Flashcards

(48 cards)