1.1 Data Engineering with Snowflake Flashcards

Introduction to Data Engineering with Snowflake (8 cards)

1
Q

Snowflake stores data in cloud storage in a _ format.

A

compressed columnar

Columnar Storage means data is stored as key-value pairs, where key is column and all entries in it is stored as value. Compression is applied at the column (key) level. Since values in a column tend to be of the same type and often repetitive, the compression algorithms work exceptionally well.
Queries often scan specific columns (e.g., SELECT Name, Age FROM Users WHERE Country = ‘USA’)

It’s more efficient to read just the relevant columns than entire rows

Columnar + compression = massive performance gain and cost savings.

https://copilot.microsoft.com/shares/QHE6u23dSu3PPyyfxqG6F

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

To execute filtering in WHERE clause, instead of scanning all rows of the column, each compute node reads a different _ at the same time (parallely).

A

micro-partition

Each partition has metadata, having info about column values. Leveraging this info, micro-partition not having the clause value is skipped. So, effectively fewer scans are needed to execute filtering.

while columns are stored separately inside each micro-partition, the micro-partition itself contains all columns for a subset of rows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

When storing data in a snowflake table, we can guide snowflake to keep rows with similar values in specific column(s) close together, i.e., grouped in a micro-partition using _

A

Clustering Keys

ALTER TABLE sales CLUSTER BY (region, order_date);
Snowflake tries to physically group rows with similar region and order_date values into the same micro-partitions. This improves the predictability of metadata, which is key for pruning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

If you’re working with CDC or time-series data, Use _ as part of your clustering key to make incremental queries (e.g., last 24 hours) blazingly fast.

A

event_timestamp or modified_date

Usually, we tend to keep the most filtered column as clustering key. In time-series data, event_timestamp is the most common filtering criteria.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

SQL command to check clustering depth of sales column

A

SELECT SYSTEM$CLUSTERING_INFORMATION(‘sales’);

A high depth means:
Data is scattered across many partitions
Pruning is less effective
You might need to recluster the table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Benefits of virtual warehouses instead of a single centralized warehouse, for query processing.

A
  • Separation of compute and storage: Storage is centralized, but compute (query processing) is isolated per warehouse
  • Concurrency scaling: Multiple users can run queries without stepping on each other’s toes.
  • Workload Isolation: A bad query cannot crash or slow down all other workloads.
  • You can scale different workloads independently
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

_ makes copies of data without incurring additional storage
costs

A

Zero-copy cloning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

_ allows users to access data at a point in time in the past

A

Time travel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly