Databricks + MLflow Flashcards by O Cam

What is the main role of Databricks in ML workflows?

To provide a scalable environment for feature engineering, model training, and batch scoring using Spark, Python, and integrated tooling like MLflow.

How well did you know this?

Not at all

Perfectly

Why is Databricks well-suited for feature engineering at scale?

It uses Spark DataFrames to process large datasets on clusters, enabling joins, aggregations, and transformations over TB-scale data efficiently.

How well did you know this?

Not at all

Perfectly

How can you decide between distributed training on Spark vs single-node training on Databricks?

Use distributed training when the data or model does not fit in a single machine’s memory; use single-node (e.g., scikit-learn, XGBoost) when data and model fit and distributed overhead is unnecessary.

How well did you know this?

Not at all

Perfectly

What is MLflow at a high level?

An open-source platform integrated into Databricks that tracks experiments, manages models, and packages them for reproducible deployment.

How well did you know this?

Not at all

Perfectly

What are the three main components of MLflow most relevant on Databricks?

MLflow Tracking, MLflow Models, and MLflow Model Registry.

How well did you know this?

Not at all

Perfectly

What is MLflow Tracking used for?

Logging parameters, metrics, artifacts, and code versions for model runs so they can be compared and reproduced.

How well did you know this?

Not at all

Perfectly

What is an MLflow run?

A single execution of training or evaluation code whose parameters, metrics, and artifacts are logged as a unit in the tracking system.

How well did you know this?

Not at all

Perfectly

What is an MLflow experiment?

A logical grouping of related runs, often associated with a project, notebook, or modeling task.

How well did you know this?

Not at all

Perfectly

Why is organizing runs into experiments important?

It makes it easier to compare models, search past runs, and maintain a tidy history of work by project or use case.

How well did you know this?

Not at all

Perfectly

What kinds of information are typically logged in an MLflow run?

Hyperparameters, training and validation metrics, model artifacts (pickled models, pipelines), plots, and environment information.

How well did you know this?

Not at all

Perfectly

What is MLflow autologging?

A feature that automatically records parameters, metrics, and models for certain ML libraries without requiring manual log calls.

How well did you know this?

Not at all

Perfectly

When is MLflow autologging especially convenient?

During iterative experimentation with libraries like scikit-learn, XGBoost, or Spark MLlib, where basic logging can be automated.

How well did you know this?

Not at all

Perfectly

Why might you still use explicit logging calls with MLflow even when autologging is enabled?

To log custom metrics, artifacts, or metadata that autologging does not capture by default.

How well did you know this?

Not at all

Perfectly

What is an MLflow Model?

A packaged model format that can include code, environment specifications, and multiple ‘flavors’ (e.g., Python function, Spark, sklearn) for deployment.

How well did you know this?

Not at all

Perfectly

What is MLflow Model Registry?

A centralized store for managing versions and lifecycle states of models, such as staging and production, with metadata and approvals.

How well did you know this?

Not at all

Perfectly

Why use a Model Registry instead of just saving models to files or paths?

It provides versioning, stage transitions (e.g., staging → production), audit trails, and a single source of truth for deployment targets.

How well did you know this?

Not at all

Perfectly

What are typical model stages in MLflow Model Registry?

Study These Flashcards

None, Staging, Production, and Archived.

Why is having a ‘Staging’ stage useful?

Study These Flashcards

It allows testing models in non-production environments before promoting them to Production, enforcing a review workflow.

How do Databricks Jobs relate to MLflow and model training?

Study These Flashcards

Jobs can run training notebooks or scripts that log runs to MLflow, making training pipelines repeatable and schedulable.

What is a typical pattern for a training job on Databricks?

Study These Flashcards

Load features from Delta tables, train a model, evaluate it, log results and artifacts to MLflow, and optionally register or update a model in the registry.

How can you implement hyperparameter tuning on Databricks?

Study These Flashcards

By looping over parameter configurations or using libraries like Hyperopt or Spark ML tuning APIs, and logging each configuration as an MLflow run.

Why is logging hyperparameter search results to MLflow valuable?

Study These Flashcards

It allows comparison across configurations, visualizing metric trends, and re-running promising setups later.

What is batch scoring on Databricks?

Study These Flashcards

Applying a trained model to a large dataset stored in Delta or Parquet to generate predictions, often as a scheduled job.

How is batch scoring commonly implemented with Spark on Databricks?

Study These Flashcards

Load features as a DataFrame, apply a UDF or Pandas UDF that wraps the model, and write predictions back to a Delta table or other sink.

What considerations are important when doing batch scoring with Python models on Spark?

Serialization and UDF overhead, broadcasting models if possible, and ensuring inference code is efficient and stateless per row or batch.

What is online serving of ML models in a Databricks context?

Deploying models behind real-time endpoints using Databricks serving features, MLflow model serving, or external serving infrastructure that loads MLflow models.

Why might a team choose external model serving (e.g., on Kubernetes) even when using Databricks?

For tighter integration with existing microservices, custom scaling policies, or when latency and protocol requirements exceed built-in serving options.

How can MLflow Models help in external serving scenarios?

They provide a standardized way to package and load models in multiple environments, including Python functions and REST-serving wrappers.

What is feature reuse in the context of Databricks and ML?

Using the same feature pipelines and tables across multiple models rather than re-implementing feature logic in each project.

Why is feature reuse important?

It reduces duplication, improves consistency, and makes it easier to maintain and evolve feature logic in one place.

What is a common pattern for organizing ML projects in Databricks repos?

Separate directories for data prep, model training, evaluation, and deployment scripts, all under version control with shared utils and config files.

Why is separating configuration from training code helpful?

It allows the same code to run in different environments, with different data ranges or hyperparameters, without modifying code itself.

How does MLflow integrate with experiment tracking in Databricks UI?

Experiments and runs appear in a dedicated Experiments interface where you can filter, compare, and visualize metrics and parameters.

Why is experiment tracking crucial in collaborative environments?

It provides a shared history of modeling work, prevents duplicate efforts, and allows others to learn from or build on past runs.

What is model lineage in the context of MLflow and Databricks?

The relationship between models, the data and code used to create them, and their deployments, tracked via run metadata and registry entries.

How can you track data versions used in ML training on Databricks?

By logging dataset versions or Delta table version numbers and paths as run parameters or tags in MLflow.

Why is tracking data versions important for ML reproducibility?

Models depend on the exact data snapshot; being able to reconstruct training data is essential for debugging and audits.

What is a typical deployment flow using Model Registry on Databricks?

Train model → log to MLflow → register model → promote to Staging for tests → promote to Production when validated → use in batch/online serving.

How can you roll back to a previous model in a registry-based workflow?

By changing which registered version is marked as Production or by directing serving infrastructure to load an earlier version.

What is a good practice for ML model monitoring on Databricks?

Log predictions and key features to Delta tables, then use scheduled notebooks or jobs to compute drift and performance metrics over time.

In one sentence, what is the core mental model for ML & MLflow on Databricks?

Use Databricks to build feature pipelines and run training at scale, and use MLflow to track experiments, manage model versions, and connect trained models to robust batch or online serving workflows.

Databricks + MLflow Flashcards

(41 cards)