Databricks Repos, CI/CD, and Dev Workflow Flashcards

(28 cards)

1
Q

What are Databricks Repos at a high level?

A

An integration that links a directory in the Databricks workspace to a Git repository, enabling version-controlled development inside Databricks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is using Repos preferable to ad hoc notebook copies for serious projects?

A

Repos keep code in Git with history, branching, and review, avoiding drift between local notebooks and source of truth.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What kinds of files can be managed in a Databricks Repo?

A

Notebooks, Python modules, SQL files, configs, and other project artifacts that should be version controlled.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How does a typical branch-based workflow look with Databricks Repos?

A

Developers work on feature branches in Repos, commit and push to Git, create merge requests, and sync changes into main when reviewed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why should production jobs reference code from a Repo rather than ad hoc workspace notebooks?

A

So production code is versioned, reviewed, and reproducible, and changes flow through the same CI/CD pipeline as other software.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the benefit of separating notebooks into ‘development’ and ‘production’ directories?

A

It discourages editing live production notebooks, keeps experimental work distinct, and clarifies which notebooks/jobs are stable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How can you structure a Databricks ML or DE project repo?

A

With folders for data prep, feature pipelines, training, evaluation, utilities, and job entrypoints, plus configuration files for environments.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why is it useful to centralize shared utilities in Python modules within the repo?

A

It avoids copy-pasting code across notebooks and allows reuse and testing of common logic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the role of configuration files (e.g., YAML/JSON) in a Databricks project?

A

They externalize environment-specific settings like paths, table names, and parameters so the same code can run in dev, staging, and prod.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does CI integrate with Databricks development?

A

CI pipelines can run unit tests, linting, and small integration tests against code in the repo whenever changes are pushed or merged.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What tool can be used to run tests for Databricks code outside the workspace?

A

Standard Python testing frameworks like pytest, combined with mocks or local Spark sessions, and CLI tools or APIs to target Databricks when needed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why is it valuable to have unit tests for transformation logic?

A

They catch regressions early, make refactoring safer, and ensure business logic is correct independently of the cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is an integration test in the context of Databricks projects?

A

A test that runs end-to-end or multi-step flows on small datasets, often against a dev cluster or temporary Delta locations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How can Databricks Jobs be connected to CI/CD?

A

CI/CD pipelines can update job definitions (via Terraform or APIs) after successful tests and then trigger or validate test runs in dev/staging workspaces.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the advantage of defining Databricks resources with infrastructure-as-code (IaC) tools?

A

IaC provides versioned, reviewable definitions for clusters, jobs, and permissions, enabling reproducible environments and controlled changes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which IaC tools are commonly used with Databricks?

A

Terraform with the Databricks provider, and sometimes ARM templates/CloudFormation wrappers or custom scripts using the REST API.

17
Q

Why should promotion from dev to prod be automated rather than manual UI edits?

A

Automation reduces human error, ensures changes are tracked and repeatable, and makes it clear which code version is running where.

18
Q

What metadata is important to record for each deployment?

A

Git commit SHA, DBR version, cluster config, job definitions, and environment variables or config versions used in that release.

19
Q

How can feature flags be used in Databricks projects?

A

By controlling code paths with configuration or parameters, allowing gradual rollout or quick disabling of new logic without redeployment.

20
Q

What is a typical dev → stage → prod workflow on Databricks?

A

Develop and test on dev workspace, promote code and IaC changes to staging for realistic tests and backfills, then promote to prod once validated.

21
Q

Why is having separate Databricks workspaces/accounts for dev and prod beneficial?

A

It isolates experimental work from production data and jobs, reduces blast radius, and allows different access and cost controls.

22
Q

How can notebooks fit into a CI/CD pipeline?

A

By treating them as code files in a repo, testing underlying Python modules, and using job entrypoint notebooks that are deployed or referenced via IaC.

23
Q

What is a good practice for making notebooks more testable?

A

Minimize business logic in notebooks, moving most code into importable modules with tests, and let notebooks primarily orchestrate and visualize.

24
Q

Why is code review important for Databricks repos?

A

It helps catch errors, enforce standards, and share knowledge across data and ML engineers working on shared pipelines.

25
What is the benefit of tagging releases in Git for Databricks projects?
Tags identify code snapshots tied to deployments, making it easier to reproduce or roll back to a specific version if needed.
26
How do you ensure local development and Databricks environments stay consistent?
Use requirements files or environment specs, test code with the same library versions locally and in DBR, and avoid relying on ad hoc workspace-only installs.
27
What is a runbook and how does it relate to CI/CD?
A documented procedure for handling deployments, failures, and rollbacks that complements automated pipelines with clear manual steps when needed.
28
In one sentence, what is the core mental model for Repos, CI/CD, and dev workflow on Databricks?
Treat Databricks projects like any other software: keep code in Git, test and review it, define jobs and clusters as code, and promote changes through dev→stage→prod with automated, observable pipelines.