Databricks Flashcards

Question 1

Q

What are Databricks main strengths that can also be weaknesses

Answer

A

⚡ DATABRICKS MAIN STRENGTHS — AND THEIR SHADOW SIDES

Strength
Why It’s a Strength
When It Can Be a Weakness
1. Open-Source Foundation (Spark, Delta Lake)
- Based on open, battle-tested frameworks- No vendor lock-in at the storage layer- Strong community and ecosystem
- Requires more technical expertise- Can lead to complex configurations- Open doesn’t mean easy
2. Unified Data Platform (Lakehouse Vision)
- Combines data warehouse + data lake- Handles BI + AI + ML in one system
- “All-in-one” can be overkill for teams that just need simple analytics- May lack the polish or ease of best-in-class BI/SQL tools
3. First-Class Support for ML & AI
- Built-in notebooks, MLflow, feature stores- Python-native, supports full ML lifecycle
- Heavy on code-first workflows- Requires strong engineering and data science skills
4. Streaming and Real-Time Capabilities
- Strong support for Structured Streaming- Event-time processing, exactly-once semantics
- Streaming setups are complex to manage- Not fully “plug and play” for most teams
5. Deep Flexibility and Customization
- Supports Python, Scala, SQL, R, Java- Total control over pipeline logic and environments
- Flexibility leads to inconsistency if not governed- Steeper learning curve and more maintenance
6. Open Table Format (Delta Lake)
- ACID transactions on data lakes- Works with S3, ADLS, GCS- Interoperable with other engines
- Delta Lake requires cluster compute to access efficiently- Not as performant or abstracted as columnar formats in Snowflake
7. Fine-Grained Performance Tuning
- Full visibility into Spark jobs, execution plans, and resource usage
- Manual optimization is often required- Harder to manage at scale without a mature team
8. Collaborative Notebooks & Workflow Orchestration
- Integrated development environment with real-time collaboration- Unity Catalog + Workflows streamlines pipelines
- Notebooks can lead to spaghetti code and poor reproducibility without discipline
9. Strong Cloud-Native & DevOps Integration
- Supports Terraform, GitOps, REST APIs, CI/CD pipelines
- Setup is not trivial—DevOps maturity required- Snowflake is still easier for non-technical teams

Question 2

Q

What are Databricks 5 main selling points

Answer

A

Databricks has become one of the most popular platforms in data and AI, and its strength lies in how it combines multiple capabilities into a single unified platform.

🔥 Lakehouse Architecture
⚙️ Unified Platform for Data Engineering, Analytics, and AI
💾 Delta Lake for Reliable and Performant Storage
🤖 Best-in-Class AI & ML Support
🔓 Open Source + Multicloud Flexibility

Bonus: 🧠 Performance + Governance

Question 3

Q

Explain Lakehouse Architecture as a main selling point of Databricks

Answer

A

🔥 Lakehouse Architecture
- Combines the scalability of data lakes (like S3 or GCS) with the structure and performance of data warehouses.
- Enables you to store all your data in open formats (like Parquet + Delta Lake) while supporting SQL analytics, BI, streaming, and ML/AI — all on the same platform.
- Reduces the need for separate ETL pipelines or duplicate data silos (data warehouse + data lake).

⸻

Question 4

Q

Explain Unified Platform for Data Engineering, Analytics, and AI as a main selling point of Databricks

Answer

A

⚙️ Unified Platform for Data Engineering, Analytics, and AI
- One workspace for data engineers, data scientists, analysts, and ML engineers.
- Supports SQL, Python, Scala, R, and Spark natively.
- Fully integrates notebooks, jobs, pipelines, ML lifecycle, and governance in a collaborative environment.

⸻

Question 5

Q

Explain Delta Lake for Reliable and Performant Storage as a main selling point of Databricks

Answer

A

💾 Delta Lake for Reliable and Performant Storage
- Delta Lake, developed by Databricks, brings ACID transactions, schema evolution, time travel, and efficient data versioning to cloud storage.
- It’s open source and widely adopted, but tightly integrated into Databricks with optimizations.
- Makes streaming and batch workloads much easier to manage reliably.

⸻

Question 6

Q

Explain Best-in-Class AI & ML Support as a main selling point of Databricks

Answer

A

🤖 Best-in-Class AI & ML Support
- End-to-end support for machine learning: from data prep, feature engineering, model training (including AutoML), to model tracking and deployment (with MLflow, also created by Databricks).
- GPU support, distributed training, and integration with popular ML/DL libraries like TensorFlow, PyTorch, and scikit-learn.
- Recently introduced Databricks Model Serving and Foundation Model APIs (LLMs).

⸻

Question 7

Q

Explain Open Source + Multicloud Flexibility as a main selling point of Databricks

Answer

A

🔓 Open Source + Multicloud Flexibility
- Strong commitment to open standards: Delta Lake, MLflow, Apache Spark — all open source and widely supported.
- Multicloud: Available on AWS, Azure, Google Cloud — customers aren’t locked into one cloud vendor.
- You own your data; it’s stored in your object storage (not trapped in proprietary formats).

Question 8

Q

Explain 🧠 Performance + Governance as a main selling point of Databricks

Answer

A

Bonus: 🧠 Performance + Governance
* Databricks Photon engine gives massive performance boosts for SQL workloads.
* Built-in Unity Catalog allows centralized data governance, access control, lineage, and discovery across all data and ML assets.

Question 9

Q

How can costs snowball in Databricks

Answer

A

Cluster Sprawl
Users spin up clusters for ad-hoc jobs or experimentation and forget to shut them down.
Even idle clusters burn compute credits, so dozens of “forgotten” clusters can quietly rack up thousands in charges.
Autoscaling Gone Wild
Autoscaling is a powerful feature, but if not tuned, a query or job may trigger a massive scale-up (hundreds of nodes).
A job expected to cost $20 can suddenly run $2,000 if autoscaling isn’t capped.
Spot Instance Evictions
Teams often rely on cheaper spot instances. If spot capacity is revoked, workloads may fall back on expensive on-demand VMs without notice.
Costs jump unexpectedly, especially in long-running jobs.
Storage + Data Egress
Databricks often uses external cloud storage (S3, ADLS, GCS). Staging huge intermediate files or transferring data across regions can silently add to bills.
Orphaned Jobs & Artifacts
Scheduled jobs or ML experiment runs keep executing in the background, or leave behind large logs and checkpoints.
Storage + compute costs accumulate if no one cleans up.

Databricks Flashcards

(9 cards)