Snowflake Flashcards

(10 cards)

1
Q

What are Snowflake’s weaknesses

A

🔹 1. Machine Learning & Advanced Analytics Maturity (vs. Databricks)
• Weakness: Snowflake is not a full-featured data science platform like Databricks.
• Perception: Limited native tools for model training, experimentation, and ML lifecycle management.
• Reality: While Snowpark and external ML integrations are growing, they’re still less mature than notebooks in Databricks or SageMaker.

🔹 2. Cost Predictability Can Be Challenging
• Weakness: Snowflake charges per-second for compute, and uncontrolled usage can lead to surprise bills.
• Perception: It can be hard to estimate or monitor costs, especially in large orgs with many users or warehouses.
• Reality: While features like resource monitors, auto-suspend, and budgets help, some customers still struggle with optimization and governance of spend.

🔹 3. Vendor Lock-in to Proprietary Architecture
• Weakness: Snowflake’s architecture is proprietary and abstracted, making it difficult to move workloads out.
• Perception: Data engineers can’t fully control how things run under the hood (e.g., indexing, partitioning).
• Reality: Some enterprises value transparency and open-source flexibility more than Snowflake’s managed simplicity.

🔹 4. Limited Real-Time Streaming Support
• Weakness: While Snowpipe and Snowpipe Streaming exist, they aren’t fully real-time in all use cases.
• Perception: Compared to Kafka-native or Flink-based platforms, Snowflake is not ideal for sub-second latency ingestion or processing.
• Reality: Snowflake is improving here (with Snowpipe Streaming and Kafka connectors), but it’s not the first choice for event-driven microservices or ultra-low-latency analytics.

🔹 5. Complex Pricing Structure for Some Users
• Weakness: Snowflake pricing involves separate charges for compute, storage, data transfer, and features like replication or data sharing.
• Perception: Customers may struggle to understand or model pricing, especially in early adoption phases.
• Reality: Compared to BigQuery’s flat on-demand pricing, this can feel harder to manage without strong FinOps or IT ownership.

🔹 6. Limited On-Prem Support
• Weakness: Snowflake is cloud-native only, with no on-premises deployment.
• Perception: Hybrid cloud or regulated industries with strict data residency needs may feel limited.
• Reality: For industries like healthcare, defense, or finance, this can be a blocker unless they move to supported public clouds.

🔹 7. Ecosystem Lock-in and Tooling Gaps
• Weakness: Snowflake pushes its own ecosystem (Snowpark, Streams & Tasks, etc.), which may duplicate other investments in Spark, Airflow, or dbt.
• Perception: Customers may resist being nudged into “Snowflake-native everything.”
• Reality: Integration exists, but not all tools are as seamless or mature as cloud-native options from GCP, AWS, or Azure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Weaknesses related to Trandformation

A

🔹 Key Weaknesses of Snowflake Related to Data Transformation:

  1. Limited Native Support for Complex Data Processing Patterns
    • Snowflake is optimized for SQL-based transformations, which can become cumbersome for complex logic like iterative algorithms, recursion, or graph traversal.
    • Lack of native support for imperative or procedural transformations (e.g., chaining complex logic like in Python or Spark).

Example: Complex sessionization, graph relationships, or ML feature engineering might be better suited to Spark.

  1. No Built-in Data Orchestration
    • Snowflake doesn’t have native tools for orchestrating multi-step data pipelines (e.g., scheduling, conditional logic, dependencies).
    • Users must rely on external tools like Airflow, dbt, or Apache NiFi to coordinate complex workflows.

Snowflake Streams & Tasks help but are limited in scope and flexibility compared to full orchestration engines.

  1. Transformation Debugging and Observability Are Basic
    • Snowflake SQL transformations can be hard to debug and version, especially in large-scale environments with many dependencies.
    • Lacks native visual lineage tracking, step-by-step logging, or real-time transformation diagnostics without third-party tools.

Teams often use tools like dbt Cloud, Monte Carlo, or Collibra to fill this gap.

  1. Lack of Native Support for Code-Based (Non-SQL) Transformations
    • While Snowpark introduces support for Python, Java, and Scala, it’s still maturing, and not as widely adopted or flexible as Spark for data engineers.
    • Writing non-SQL transformations can feel constrained compared to notebook-based, code-first platforms like Databricks.

Developers with heavy Python/Scala pipelines may find Snowflake less ergonomic for their workflows.

  1. No Direct Support for In-Memory Processing
    • All transformations in Snowflake are disk-based, meaning intermediate steps are persisted.
    • There is no support for in-memory distributed processing like Spark, which can be a performance bottleneck for certain high-speed transformations or iterative workloads.

  1. JSON and Semi-Structured Data Processing Is Powerful, But Not Always Performant
    • While Snowflake handles semi-structured data (e.g., JSON, Avro) well via VARIANT and lateral flattening, it can get verbose and slow for deeply nested or large-scale transformations.
    • Users often need to write complex SQL to access or reshape nested data structures
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are Snowflake main strengths, and how can they also be weaknesses

A

🌟 SNOWFLAKE MAIN STRENGTHS — AND THEIR SHADOW SIDES
Strength
Why It’s a Strength
When It Can Be a Weakness
1. Separation of Compute & Storage
- Scales storage and compute independently- Enables concurrent workloads with isolated performance
- Can lead to uncontrolled compute sprawl- Users may unknowingly rack up costs with multiple virtual warehouses
2. Fully Managed & Serverless
- No infrastructure to manage- No tuning or provisioning needed- Fast time to value
- Less control for performance tuning- Opaque internals—hard to troubleshoot or optimize complex workloads
3. Exceptional SQL Experience
- Powerful, familiar interface for analysts and engineers- Handles structured and semi-structured data
- SQL-centric model limits flexibility- Less suited for code-heavy transformations or complex workflows
4. Scalable ELT & Analytics
- Designed for high-performance batch processing- Efficient for dashboards, reporting, and BI
- Streaming support is weak- Not ideal for real-time or low-latency processing use cases
5. Native Support for Semi-Structured Data (VARIANT)
- Query JSON, Avro, XML, Parquet natively- No need for external parsing or schema management
- Querying deeply nested structures can be clunky- Data stored in VARIANT may reduce performance and data governance clarity
6. Multi-Cloud & Cross-Region Flexibility
- Deploy on AWS, Azure, or GCP- Replicate and share data across clouds and regions
- Adds complexity to governance and cost tracking across environments- May lock you further into Snowflake’s ecosystem for cross-cloud data movement
7. Native Data Sharing & Marketplace
- Easily share data between orgs without data movement- Supports monetization and collaboration
- Only works within the Snowflake ecosystem- Encourages platform lock-in and may limit flexibility for multi-platform sharing
8. Automatic Scaling & Performance Optimization
- Auto-scale based on query load- Almost no performance tuning required
- Less transparency into query performance- Hard to optimize or understand cost/performance trade-offs in advanced use cases
9. Governance & Security Built-in
- Strong features for RBAC, masking, row-level security- Supports compliance and enterprise security
- Security policies and lineage tracking are Snowflake-specific, not standards-based- May be less portable across data platforms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are Snowflakes AI/ML weaknesses compared to Databricks

A
  1. Native Experience vs. Extension Approach

In Databricks, ML workflows are fluid and natively integrated. In Snowflake, they are bolted on, and you often feel the friction.

  1. Development Environment

Databricks feels like a data scientist’s lab. Snowflake feels like a data warehouse with add-ons.

  1. Model Training and Execution

In Databricks, training happens where your data is, at scale. In Snowflake, you’re often moving data out, which adds latency, cost, and friction.

  1. Model Lifecycle Management

Snowflake, on the other hand, is still catching up in this area. While Snowpark and Cortex AI bring useful capabilities like in-database model development and deployment, Snowflake’s ML lifecycle support remains relatively shallow and fragmented. Many stages of the ML workflow (such as experiment tracking, automated retraining, model monitoring, and CI/CD integration) typically require external tools, custom plumbing, or unproven patterns. In short, it’s powerful for running and serving models close to data, but not yet a full-fledged MLOps platform.

  1. Tooling and Ecosystem

Snowflake restricts what packages can run inside it. Databricks lets you bring your own tools, freely.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Show a scenario where it is evident that snowflake pricing is unpredictable

A

Scenario: Marketing Analytics Team
A retail company’s marketing team is running weekly customer segmentation analyses in Snowflake. They’ve budgeted $2,000 per month for Snowflake usage.
What happens:
Query Optimization Oversight
A data analyst runs a segmentation query that joins a large customer transactions table (5 TB) with a demographics dataset.
The query isn’t optimized and scans far more data than necessary. Instead of costing a few dollars, the single query racks up $150 in compute credits.
Warehouse Auto-Suspend Surprise
The team’s virtual warehouse is set to suspend after 10 minutes of inactivity.
Analysts keep running small queries throughout the day, meaning the warehouse keeps waking up, consuming full-minute credits each time.
By the end of the week, this idle time doubles compute costs compared to expectations.
Unexpected Storage Costs
Data scientists clone large tables for experimentation, not realizing that cloned tables consume storage once modified.
After two weeks, storage costs rise by 30%, surprising the finance team.
Monthly Invoice Shock
Instead of the expected $2,000 budget, the monthly bill comes in at $4,700.
No single decision seemed huge at the time, but the combination of query inefficiency, warehouse misconfiguration, and unnoticed storage creep made costs unpredictable.
👉 Why it feels unpredictable:
Costs aren’t tied directly to “users” or “projects” but to query patterns, warehouse behavior, and storage growth.
A few small missteps by non-technical users can dramatically shift monthly spend, making budgeting hard without strong governance.
Would you like me to create a visual chart/diagram showing how the budget drifted (planned $2k vs. actual $4.7k, broken down by factors)? That might make the unpredictability more striking.
You said:
For unexpected storage cost, doesn’t snowflake have zero copy clone?
ChatGPT said:
You’re absolutely right 👍 — Snowflake’s zero-copy cloning means when you first clone a table, database, or schema, no new storage is consumed (just metadata). That’s one of Snowflake’s big advantages.
But here’s the nuance where unexpected costs creep in:

Initial clone: ✅ Free (no extra storage).
After modifications: When you insert, update, or delete in the cloned object, Snowflake has to preserve the original data for time travel and maintain the modified data in the clone.
The more changes you make, the more incremental storage you actually consume.
If people keep cloning and experimenting, you can suddenly end up with a large hidden storage bill.
🔎 Example
You have a 5 TB table.
You create a clone — storage = still ~5 TB (no change).
A data scientist modifies 1 TB worth of rows in the clone.
Now you’re paying for 6 TB of storage (5 TB original + 1 TB new versions).
If multiple analysts do this in parallel and forget to clean up their clones, costs snowball invisibly until the invoice arrives.
👉 So the “unexpected storage cost” in the scenario isn’t from the cloning itself, but from modified clones and leftover historical data (which is easy to miss if no one monitors).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain about Snowflake´s AI/ML Weakness (compared to Databricks) “Native Experience vs. Extension Approach”

A
  1. Native Experience vs. Extension Approach

Feature
Databricks
Snowflake + Snowpark
ML-native platform?
✅ Yes — built on Apache Spark, designed for ML from the start
❌ No — built for SQL + analytics, ML support added later
Languages
Python (notebooks), Scala, R — first-class support
Snowpark supports Python, Java, Scala — but it’s more limited
ML Workflow Integration
Seamless MLlib, scikit-learn, PyTorch, TensorFlow
External packages must be installed via Anaconda or manually loaded into UDFs

Result: In Databricks, ML workflows are fluid and natively integrated. In Snowflake, they are bolted on, and you often feel the friction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain Snowflake´s AI/ML weakness (compared to Databrics) “Development Environment”

A
  1. Development Environment

Task
Databricks
Snowflake
Interactive Notebooks
Native (collaborative, real-time)
No native notebooks (must use Jupyter, Hex, or Streamlit separately)
Visualizations
Built into notebooks
Requires integration (e.g., Streamlit or BI tools)
Model experimentation
Easy model versioning and tracking (with MLflow)
Requires external tools or manual tracking

🔎 Result: Databricks feels like a data scientist’s lab. Snowflake feels like a data warehouse with add-ons.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain Snowflake´s AI/ML weakness (Compared to Databricks) “Model Training and Execution”

A
  1. Model Training and Execution

Step
Databricks
Snowflake
Train large ML models
Directly on Spark clusters
Must pull data into UDFs or use external compute (e.g., SageMaker, Vertex AI)
GPU/accelerated training
Native support
Requires external platforms
Feature engineering
Native with Delta Lake + notebooks
SQL-first approach via Snowpark (not ideal for complex pipelines)

🔎 Result: In Databricks, training happens where your data is, at scale. In Snowflake, you’re often moving data out, which adds latency, cost, and friction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Explain Snowflake´s AI/ML weakness (Compared to Databricks) Model Lifecycle Management

A
  1. Model Lifecycle Management

Area
Databricks
Snowflake
Experiment tracking
MLflow (built-in)
Requires external tools
Model registry
Native
Not included (must use external registry)
Inference
Native batch + real-time
Batch inference via UDFs; real-time via external APIs or Snowpark Containers (beta/early)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain Snowflake´s AI/ML weakness (compared to Databricks) “Tooling and Ecosystem”

A
  1. Tooling and Ecosystem

Tooling
Databricks
Snowflake
Libraries
Any Python/R library, pip installable
Must rely on Anaconda-provided packages in Snowpark
IDE Integration
Jupyter, VS Code, native notebooks
Primarily SQL IDEs or Streamlit + Jupyter
ML Frameworks
PyTorch, TensorFlow, XGBoost — natively supported
Limited in Snowpark; better via external integration (SageMaker, Azure ML)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly