Mental Models Flashcards

(38 cards)

1
Q

What is Databricks at a high level?

A

A unified analytics and data engineering platform built around Apache Spark that provides managed compute, collaborative notebooks, and a lakehouse-style data architecture.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does Databricks relate to Apache Spark?

A

Databricks is built by the creators of Spark and provides a managed, optimized Spark runtime along with additional services and tooling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Where does Databricks typically sit in a modern data platform?

A

As the primary compute and transformation layer on top of cloud object storage, supporting ETL, batch, streaming, analytics, and ML.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the ‘lakehouse’ concept associated with Databricks?

A

A data architecture that combines the flexibility of data lakes with the reliability and performance of data warehouses using a single storage layer and table format like Delta Lake.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the three core concerns Databricks tries to unify?

A

Data engineering, data science/ML, and analytics/BI on a shared platform and storage layer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why is Databricks often used alongside cloud data warehouses like Snowflake or Redshift?

A

Databricks excels at heavy ETL, Spark-based processing, and ML, while warehouses often remain the primary SQL-serving and BI semantic layer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the primary storage layer underlying Databricks workloads?

A

Cloud object storage (e.g., S3, ADLS, GCS) accessed via the Databricks File System (DBFS) and table formats like Delta Lake.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is DBFS conceptually?

A

An abstraction that presents object storage (and some local storage) in a file system-like interface for Databricks clusters and notebooks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why is it important to understand that DBFS ultimately sits on object storage?

A

Because it affects performance characteristics, file immutability, partitioning strategies, and cost patterns familiar from data lakes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a Databricks workspace?

A

A logical environment that organizes notebooks, clusters, jobs, repos, and permissions for a team or project.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a Databricks cluster at a high level?

A

A set of compute resources (driver and workers) managed by Databricks, used to run Spark jobs, notebooks, and SQL queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is an all-purpose cluster?

A

A cluster intended for interactive use, such as notebooks and ad hoc development, often shared by multiple users.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a job cluster?

A

A cluster that is created for a specific job or workflow run and typically terminates when the job completes, providing isolation and cost control.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why is the distinction between all-purpose and job clusters important?

A

It affects cost, reproducibility, isolation, and how you design dev vs production workflows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a notebook in Databricks?

A

An interactive environment for writing code (e.g., Python, SQL, Scala), running commands, and visualizing results, tied to a cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why are notebooks popular for data engineering and ML work?

A

They support iterative exploration, visualization, and collaboration, while still allowing scheduling and productionization via jobs.

17
Q

What are Databricks Jobs?

A

Managed workflows that schedule and run notebooks, JARs, Python scripts, or multi-task DAGs on clusters, with retry and monitoring support.

18
Q

How do Databricks Jobs relate to external orchestrators like Airflow or Step Functions?

A

Jobs can be orchestrated by Databricks itself or triggered and managed from external orchestrators in larger workflows.

19
Q

What is a Databricks SQL warehouse (formerly SQL endpoint)?

A

A compute resource optimized for SQL workloads, allowing BI tools and users to run SQL queries against Delta tables via JDBC/ODBC.

20
Q

Why is Databricks SQL relevant for analytics teams?

A

It provides a warehouse-like experience on top of the lakehouse, enabling dashboards and ad hoc SQL using familiar BI tools.

21
Q

What is Delta Lake in the Databricks ecosystem?

A

An open table storage format that brings ACID transactions, schema evolution, and time travel to data stored on object storage.

22
Q

How does Delta Lake support the lakehouse idea?

A

By adding transactional semantics, reliability, and performance optimizations on top of raw data lake storage, allowing warehouse-like tables.

23
Q

What is the conceptual difference between a ‘bronze/silver/gold’ medallion architecture on Databricks?

A

Bronze holds raw ingested data, silver holds cleaned and standardized data, and gold holds curated, business-ready tables and aggregates.

24
Q

Why is the medallion architecture popular on Databricks?

A

It aligns well with Delta tables, Spark pipelines, and lakehouse principles, and makes lineage and quality boundaries explicit.

25
What data engineering tasks is Databricks especially strong at?
Large-scale ETL/ELT, streaming ingestion and processing, complex transformations, and building unified batch/stream pipelines.
26
What ML tasks are naturally supported by Databricks?
Feature engineering at scale, model training and tuning using Spark or single-node libraries, experiment tracking with MLflow, and batch scoring.
27
What does Databricks add on top of raw Spark?
Managed clusters, optimized runtimes, Delta Lake integration, collaborative notebooks, job scheduling, MLflow, and governance features.
28
Why is cluster lifecycle management a key benefit of Databricks?
It abstracts provisioning, scaling, and termination of Spark clusters, letting engineers focus on code instead of low-level infrastructure.
29
How does Databricks help with performance compared to vanilla Spark?
Through optimized runtimes, auto-tuning features, Delta Lake optimizations, and cluster/configuration management.
30
What is Unity Catalog at a conceptual level?
Databricks' governance and metadata layer for tables, views, and other assets, providing centralized access control and lineage.
31
Why is centralized governance (Unity Catalog-style) important in a lakehouse?
It ensures consistent permissions, auditability, and lineage for data that might otherwise be spread across many object paths and clusters.
32
How should you think about Databricks vs a traditional ETL tool?
Databricks is more of a programmable, Spark-based compute platform with notebooks and jobs, rather than a GUI-only pipeline designer.
33
How should you think about Databricks vs a BI platform?
Databricks provides the data and compute engine; BI tools sit on top to visualize and explore data via SQL warehouses or connectors.
34
Why is it helpful to separate dev, staging, and prod workspaces or environments in Databricks?
To test code and configuration safely before impacting production data and workloads, and to manage permissions and spend by environment.
35
What roles typically interact with Databricks?
Data engineers, ML engineers/data scientists, analytics engineers, and platform engineers managing clusters and governance.
36
Why is it important to keep Databricks projects repo-driven rather than notebook-only?
Repos and version control support CI/CD, code review, testing, and reproducible deployments, while notebooks alone can become brittle and ad hoc.
37
What is the mental model for Databricks from a cost perspective?
You primarily pay for compute (clusters, SQL warehouses), while storage costs live in the underlying object store and can be optimized via Delta and lifecycle policies.
38
In one sentence, what is the core mental model for Databricks in your stack?
Databricks is your programmable, managed Spark and Delta Lake engine on top of cloud object storage, where you do heavy ETL, streaming, and ML in a lakehouse architecture.