Databricks Flashcards

(9 cards)

1
Q

What are Databricks main strengths that can also be weaknesses

A

⚡ DATABRICKS MAIN STRENGTHS — AND THEIR SHADOW SIDES

Strength
Why It’s a Strength
When It Can Be a Weakness
1. Open-Source Foundation (Spark, Delta Lake)
- Based on open, battle-tested frameworks- No vendor lock-in at the storage layer- Strong community and ecosystem
- Requires more technical expertise- Can lead to complex configurations- Open doesn’t mean easy
2. Unified Data Platform (Lakehouse Vision)
- Combines data warehouse + data lake- Handles BI + AI + ML in one system
- “All-in-one” can be overkill for teams that just need simple analytics- May lack the polish or ease of best-in-class BI/SQL tools
3. First-Class Support for ML & AI
- Built-in notebooks, MLflow, feature stores- Python-native, supports full ML lifecycle
- Heavy on code-first workflows- Requires strong engineering and data science skills
4. Streaming and Real-Time Capabilities
- Strong support for Structured Streaming- Event-time processing, exactly-once semantics
- Streaming setups are complex to manage- Not fully “plug and play” for most teams
5. Deep Flexibility and Customization
- Supports Python, Scala, SQL, R, Java- Total control over pipeline logic and environments
- Flexibility leads to inconsistency if not governed- Steeper learning curve and more maintenance
6. Open Table Format (Delta Lake)
- ACID transactions on data lakes- Works with S3, ADLS, GCS- Interoperable with other engines
- Delta Lake requires cluster compute to access efficiently- Not as performant or abstracted as columnar formats in Snowflake
7. Fine-Grained Performance Tuning
- Full visibility into Spark jobs, execution plans, and resource usage
- Manual optimization is often required- Harder to manage at scale without a mature team
8. Collaborative Notebooks & Workflow Orchestration
- Integrated development environment with real-time collaboration- Unity Catalog + Workflows streamlines pipelines
- Notebooks can lead to spaghetti code and poor reproducibility without discipline
9. Strong Cloud-Native & DevOps Integration
- Supports Terraform, GitOps, REST APIs, CI/CD pipelines
- Setup is not trivial—DevOps maturity required- Snowflake is still easier for non-technical teams

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are Databricks 5 main selling points

A

Databricks has become one of the most popular platforms in data and AI, and its strength lies in how it combines multiple capabilities into a single unified platform.

  1. 🔥 Lakehouse Architecture
  2. ⚙️ Unified Platform for Data Engineering, Analytics, and AI
  3. 💾 Delta Lake for Reliable and Performant Storage
  4. 🤖 Best-in-Class AI & ML Support
  5. 🔓 Open Source + Multicloud Flexibility

Bonus: 🧠 Performance + Governance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain Lakehouse Architecture as a main selling point of Databricks

A
  1. 🔥 Lakehouse Architecture
    • Combines the scalability of data lakes (like S3 or GCS) with the structure and performance of data warehouses.
    • Enables you to store all your data in open formats (like Parquet + Delta Lake) while supporting SQL analytics, BI, streaming, and ML/AI — all on the same platform.
    • Reduces the need for separate ETL pipelines or duplicate data silos (data warehouse + data lake).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain Unified Platform for Data Engineering, Analytics, and AI as a main selling point of Databricks

A
  1. ⚙️ Unified Platform for Data Engineering, Analytics, and AI
    • One workspace for data engineers, data scientists, analysts, and ML engineers.
    • Supports SQL, Python, Scala, R, and Spark natively.
    • Fully integrates notebooks, jobs, pipelines, ML lifecycle, and governance in a collaborative environment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain Delta Lake for Reliable and Performant Storage as a main selling point of Databricks

A
  1. 💾 Delta Lake for Reliable and Performant Storage
    • Delta Lake, developed by Databricks, brings ACID transactions, schema evolution, time travel, and efficient data versioning to cloud storage.
    • It’s open source and widely adopted, but tightly integrated into Databricks with optimizations.
    • Makes streaming and batch workloads much easier to manage reliably.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain Best-in-Class AI & ML Support as a main selling point of Databricks

A
  1. 🤖 Best-in-Class AI & ML Support
    • End-to-end support for machine learning: from data prep, feature engineering, model training (including AutoML), to model tracking and deployment (with MLflow, also created by Databricks).
    • GPU support, distributed training, and integration with popular ML/DL libraries like TensorFlow, PyTorch, and scikit-learn.
    • Recently introduced Databricks Model Serving and Foundation Model APIs (LLMs).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain Open Source + Multicloud Flexibility as a main selling point of Databricks

A
  1. 🔓 Open Source + Multicloud Flexibility
    • Strong commitment to open standards: Delta Lake, MLflow, Apache Spark — all open source and widely supported.
    • Multicloud: Available on AWS, Azure, Google Cloud — customers aren’t locked into one cloud vendor.
    • You own your data; it’s stored in your object storage (not trapped in proprietary formats).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain 🧠 Performance + Governance as a main selling point of Databricks

A

Bonus: 🧠 Performance + Governance
* Databricks Photon engine gives massive performance boosts for SQL workloads.
* Built-in Unity Catalog allows centralized data governance, access control, lineage, and discovery across all data and ML assets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How can costs snowball in Databricks

A
  1. Cluster Sprawl
    Users spin up clusters for ad-hoc jobs or experimentation and forget to shut them down.
    Even idle clusters burn compute credits, so dozens of “forgotten” clusters can quietly rack up thousands in charges.
  2. Autoscaling Gone Wild
    Autoscaling is a powerful feature, but if not tuned, a query or job may trigger a massive scale-up (hundreds of nodes).
    A job expected to cost $20 can suddenly run $2,000 if autoscaling isn’t capped.
  3. Spot Instance Evictions
    Teams often rely on cheaper spot instances. If spot capacity is revoked, workloads may fall back on expensive on-demand VMs without notice.
    Costs jump unexpectedly, especially in long-running jobs.
  4. Storage + Data Egress
    Databricks often uses external cloud storage (S3, ADLS, GCS). Staging huge intermediate files or transferring data across regions can silently add to bills.
  5. Orphaned Jobs & Artifacts
    Scheduled jobs or ML experiment runs keep executing in the background, or leave behind large logs and checkpoints.
    Storage + compute costs accumulate if no one cleans up.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly