What is the main role of Databricks in ML workflows?
To provide a scalable environment for feature engineering, model training, and batch scoring using Spark, Python, and integrated tooling like MLflow.
Why is Databricks well-suited for feature engineering at scale?
It uses Spark DataFrames to process large datasets on clusters, enabling joins, aggregations, and transformations over TB-scale data efficiently.
How can you decide between distributed training on Spark vs single-node training on Databricks?
Use distributed training when the data or model does not fit in a single machine’s memory; use single-node (e.g., scikit-learn, XGBoost) when data and model fit and distributed overhead is unnecessary.
What is MLflow at a high level?
An open-source platform integrated into Databricks that tracks experiments, manages models, and packages them for reproducible deployment.
What are the three main components of MLflow most relevant on Databricks?
MLflow Tracking, MLflow Models, and MLflow Model Registry.
What is MLflow Tracking used for?
Logging parameters, metrics, artifacts, and code versions for model runs so they can be compared and reproduced.
What is an MLflow run?
A single execution of training or evaluation code whose parameters, metrics, and artifacts are logged as a unit in the tracking system.
What is an MLflow experiment?
A logical grouping of related runs, often associated with a project, notebook, or modeling task.
Why is organizing runs into experiments important?
It makes it easier to compare models, search past runs, and maintain a tidy history of work by project or use case.
What kinds of information are typically logged in an MLflow run?
Hyperparameters, training and validation metrics, model artifacts (pickled models, pipelines), plots, and environment information.
What is MLflow autologging?
A feature that automatically records parameters, metrics, and models for certain ML libraries without requiring manual log calls.
When is MLflow autologging especially convenient?
During iterative experimentation with libraries like scikit-learn, XGBoost, or Spark MLlib, where basic logging can be automated.
Why might you still use explicit logging calls with MLflow even when autologging is enabled?
To log custom metrics, artifacts, or metadata that autologging does not capture by default.
What is an MLflow Model?
A packaged model format that can include code, environment specifications, and multiple ‘flavors’ (e.g., Python function, Spark, sklearn) for deployment.
What is MLflow Model Registry?
A centralized store for managing versions and lifecycle states of models, such as staging and production, with metadata and approvals.
Why use a Model Registry instead of just saving models to files or paths?
It provides versioning, stage transitions (e.g., staging → production), audit trails, and a single source of truth for deployment targets.
What are typical model stages in MLflow Model Registry?
None, Staging, Production, and Archived.
Why is having a ‘Staging’ stage useful?
It allows testing models in non-production environments before promoting them to Production, enforcing a review workflow.
How do Databricks Jobs relate to MLflow and model training?
Jobs can run training notebooks or scripts that log runs to MLflow, making training pipelines repeatable and schedulable.
What is a typical pattern for a training job on Databricks?
Load features from Delta tables, train a model, evaluate it, log results and artifacts to MLflow, and optionally register or update a model in the registry.
How can you implement hyperparameter tuning on Databricks?
By looping over parameter configurations or using libraries like Hyperopt or Spark ML tuning APIs, and logging each configuration as an MLflow run.
Why is logging hyperparameter search results to MLflow valuable?
It allows comparison across configurations, visualizing metric trends, and re-running promising setups later.
What is batch scoring on Databricks?
Applying a trained model to a large dataset stored in Delta or Parquet to generate predictions, often as a scheduled job.
How is batch scoring commonly implemented with Spark on Databricks?
Load features as a DataFrame, apply a UDF or Pandas UDF that wraps the model, and write predictions back to a Delta table or other sink.