Databricks Repos, CI/CD, and Dev Workflow Flashcards by O Cam

What are Databricks Repos at a high level?

An integration that links a directory in the Databricks workspace to a Git repository, enabling version-controlled development inside Databricks.

How well did you know this?

Not at all

Perfectly

Why is using Repos preferable to ad hoc notebook copies for serious projects?

Repos keep code in Git with history, branching, and review, avoiding drift between local notebooks and source of truth.

How well did you know this?

Not at all

Perfectly

What kinds of files can be managed in a Databricks Repo?

Notebooks, Python modules, SQL files, configs, and other project artifacts that should be version controlled.

How well did you know this?

Not at all

Perfectly

How does a typical branch-based workflow look with Databricks Repos?

Developers work on feature branches in Repos, commit and push to Git, create merge requests, and sync changes into main when reviewed.

How well did you know this?

Not at all

Perfectly

Why should production jobs reference code from a Repo rather than ad hoc workspace notebooks?

So production code is versioned, reviewed, and reproducible, and changes flow through the same CI/CD pipeline as other software.

How well did you know this?

Not at all

Perfectly

What is the benefit of separating notebooks into ‘development’ and ‘production’ directories?

It discourages editing live production notebooks, keeps experimental work distinct, and clarifies which notebooks/jobs are stable.

How well did you know this?

Not at all

Perfectly

How can you structure a Databricks ML or DE project repo?

With folders for data prep, feature pipelines, training, evaluation, utilities, and job entrypoints, plus configuration files for environments.

How well did you know this?

Not at all

Perfectly

Why is it useful to centralize shared utilities in Python modules within the repo?

It avoids copy-pasting code across notebooks and allows reuse and testing of common logic.

How well did you know this?

Not at all

Perfectly

What is the role of configuration files (e.g., YAML/JSON) in a Databricks project?

They externalize environment-specific settings like paths, table names, and parameters so the same code can run in dev, staging, and prod.

How well did you know this?

Not at all

Perfectly

How does CI integrate with Databricks development?

CI pipelines can run unit tests, linting, and small integration tests against code in the repo whenever changes are pushed or merged.

How well did you know this?

Not at all

Perfectly

What tool can be used to run tests for Databricks code outside the workspace?

Standard Python testing frameworks like pytest, combined with mocks or local Spark sessions, and CLI tools or APIs to target Databricks when needed.

How well did you know this?

Not at all

Perfectly

Why is it valuable to have unit tests for transformation logic?

They catch regressions early, make refactoring safer, and ensure business logic is correct independently of the cluster.

How well did you know this?

Not at all

Perfectly

What is an integration test in the context of Databricks projects?

A test that runs end-to-end or multi-step flows on small datasets, often against a dev cluster or temporary Delta locations.

How well did you know this?

Not at all

Perfectly

How can Databricks Jobs be connected to CI/CD?

CI/CD pipelines can update job definitions (via Terraform or APIs) after successful tests and then trigger or validate test runs in dev/staging workspaces.

How well did you know this?

Not at all

Perfectly

What is the advantage of defining Databricks resources with infrastructure-as-code (IaC) tools?

IaC provides versioned, reviewable definitions for clusters, jobs, and permissions, enabling reproducible environments and controlled changes.

How well did you know this?

Not at all

Perfectly

Which IaC tools are commonly used with Databricks?

Study These Flashcards

Terraform with the Databricks provider, and sometimes ARM templates/CloudFormation wrappers or custom scripts using the REST API.

Why should promotion from dev to prod be automated rather than manual UI edits?

Study These Flashcards

Automation reduces human error, ensures changes are tracked and repeatable, and makes it clear which code version is running where.

What metadata is important to record for each deployment?

Study These Flashcards

Git commit SHA, DBR version, cluster config, job definitions, and environment variables or config versions used in that release.

How can feature flags be used in Databricks projects?

Study These Flashcards

By controlling code paths with configuration or parameters, allowing gradual rollout or quick disabling of new logic without redeployment.

What is a typical dev → stage → prod workflow on Databricks?

Study These Flashcards

Develop and test on dev workspace, promote code and IaC changes to staging for realistic tests and backfills, then promote to prod once validated.

Why is having separate Databricks workspaces/accounts for dev and prod beneficial?

Study These Flashcards

It isolates experimental work from production data and jobs, reduces blast radius, and allows different access and cost controls.

How can notebooks fit into a CI/CD pipeline?

Study These Flashcards

By treating them as code files in a repo, testing underlying Python modules, and using job entrypoint notebooks that are deployed or referenced via IaC.

What is a good practice for making notebooks more testable?

Study These Flashcards

Minimize business logic in notebooks, moving most code into importable modules with tests, and let notebooks primarily orchestrate and visualize.

Why is code review important for Databricks repos?

Study These Flashcards

It helps catch errors, enforce standards, and share knowledge across data and ML engineers working on shared pipelines.

What is the benefit of tagging releases in Git for Databricks projects?

Tags identify code snapshots tied to deployments, making it easier to reproduce or roll back to a specific version if needed.

How do you ensure local development and Databricks environments stay consistent?

Use requirements files or environment specs, test code with the same library versions locally and in DBR, and avoid relying on ad hoc workspace-only installs.

What is a runbook and how does it relate to CI/CD?

A documented procedure for handling deployments, failures, and rollbacks that complements automated pipelines with clear manual steps when needed.

In one sentence, what is the core mental model for Repos, CI/CD, and dev workflow on Databricks?

Treat Databricks projects like any other software: keep code in Git, test and review it, define jobs and clusters as code, and promote changes through dev→stage→prod with automated, observable pipelines.

Databricks Repos, CI/CD, and Dev Workflow Flashcards

(28 cards)