What is Databricks at a high level?
A unified analytics and data engineering platform built around Apache Spark that provides managed compute, collaborative notebooks, and a lakehouse-style data architecture.
How does Databricks relate to Apache Spark?
Databricks is built by the creators of Spark and provides a managed, optimized Spark runtime along with additional services and tooling.
Where does Databricks typically sit in a modern data platform?
As the primary compute and transformation layer on top of cloud object storage, supporting ETL, batch, streaming, analytics, and ML.
What is the ‘lakehouse’ concept associated with Databricks?
A data architecture that combines the flexibility of data lakes with the reliability and performance of data warehouses using a single storage layer and table format like Delta Lake.
What are the three core concerns Databricks tries to unify?
Data engineering, data science/ML, and analytics/BI on a shared platform and storage layer.
Why is Databricks often used alongside cloud data warehouses like Snowflake or Redshift?
Databricks excels at heavy ETL, Spark-based processing, and ML, while warehouses often remain the primary SQL-serving and BI semantic layer.
What is the primary storage layer underlying Databricks workloads?
Cloud object storage (e.g., S3, ADLS, GCS) accessed via the Databricks File System (DBFS) and table formats like Delta Lake.
What is DBFS conceptually?
An abstraction that presents object storage (and some local storage) in a file system-like interface for Databricks clusters and notebooks.
Why is it important to understand that DBFS ultimately sits on object storage?
Because it affects performance characteristics, file immutability, partitioning strategies, and cost patterns familiar from data lakes.
What is a Databricks workspace?
A logical environment that organizes notebooks, clusters, jobs, repos, and permissions for a team or project.
What is a Databricks cluster at a high level?
A set of compute resources (driver and workers) managed by Databricks, used to run Spark jobs, notebooks, and SQL queries.
What is an all-purpose cluster?
A cluster intended for interactive use, such as notebooks and ad hoc development, often shared by multiple users.
What is a job cluster?
A cluster that is created for a specific job or workflow run and typically terminates when the job completes, providing isolation and cost control.
Why is the distinction between all-purpose and job clusters important?
It affects cost, reproducibility, isolation, and how you design dev vs production workflows.
What is a notebook in Databricks?
An interactive environment for writing code (e.g., Python, SQL, Scala), running commands, and visualizing results, tied to a cluster.
Why are notebooks popular for data engineering and ML work?
They support iterative exploration, visualization, and collaboration, while still allowing scheduling and productionization via jobs.
What are Databricks Jobs?
Managed workflows that schedule and run notebooks, JARs, Python scripts, or multi-task DAGs on clusters, with retry and monitoring support.
How do Databricks Jobs relate to external orchestrators like Airflow or Step Functions?
Jobs can be orchestrated by Databricks itself or triggered and managed from external orchestrators in larger workflows.
What is a Databricks SQL warehouse (formerly SQL endpoint)?
A compute resource optimized for SQL workloads, allowing BI tools and users to run SQL queries against Delta tables via JDBC/ODBC.
Why is Databricks SQL relevant for analytics teams?
It provides a warehouse-like experience on top of the lakehouse, enabling dashboards and ad hoc SQL using familiar BI tools.
What is Delta Lake in the Databricks ecosystem?
An open table storage format that brings ACID transactions, schema evolution, and time travel to data stored on object storage.
How does Delta Lake support the lakehouse idea?
By adding transactional semantics, reliability, and performance optimizations on top of raw data lake storage, allowing warehouse-like tables.
What is the conceptual difference between a ‘bronze/silver/gold’ medallion architecture on Databricks?
Bronze holds raw ingested data, silver holds cleaned and standardized data, and gold holds curated, business-ready tables and aggregates.
Why is the medallion architecture popular on Databricks?
It aligns well with Delta tables, Spark pipelines, and lakehouse principles, and makes lineage and quality boundaries explicit.