1.1 Data Engineering with Snowflake Flashcards

Question 1

Q

Snowflake stores data in cloud storage in a _ format.

Answer

A

compressed columnar

Columnar Storage means data is stored as key-value pairs, where key is column and all entries in it is stored as value. Compression is applied at the column (key) level. Since values in a column tend to be of the same type and often repetitive, the compression algorithms work exceptionally well.
Queries often scan specific columns (e.g., SELECT Name, Age FROM Users WHERE Country = ‘USA’)

It’s more efficient to read just the relevant columns than entire rows

Columnar + compression = massive performance gain and cost savings.

https://copilot.microsoft.com/shares/QHE6u23dSu3PPyyfxqG6F

Question 2

Q

To execute filtering in WHERE clause, instead of scanning all rows of the column, each compute node reads a different _ at the same time (parallely).

Answer

A

micro-partition

Each partition has metadata, having info about column values. Leveraging this info, micro-partition not having the clause value is skipped. So, effectively fewer scans are needed to execute filtering.

while columns are stored separately inside each micro-partition, the micro-partition itself contains all columns for a subset of rows.

Question 3

Q

When storing data in a snowflake table, we can guide snowflake to keep rows with similar values in specific column(s) close together, i.e., grouped in a micro-partition using _

Answer

A

Clustering Keys

ALTER TABLE sales CLUSTER BY (region, order_date);
Snowflake tries to physically group rows with similar region and order_date values into the same micro-partitions. This improves the predictability of metadata, which is key for pruning.

Question 4

Q

If you’re working with CDC or time-series data, Use _ as part of your clustering key to make incremental queries (e.g., last 24 hours) blazingly fast.

Answer

A

event_timestamp or modified_date

Usually, we tend to keep the most filtered column as clustering key. In time-series data, event_timestamp is the most common filtering criteria.

Question 5

Q

SQL command to check clustering depth of sales column

Answer

A

SELECT SYSTEM$CLUSTERING_INFORMATION(‘sales’);

A high depth means:
Data is scattered across many partitions
Pruning is less effective
You might need to recluster the table

Question 6

Q

Benefits of virtual warehouses instead of a single centralized warehouse, for query processing.

Answer

A

Separation of compute and storage: Storage is centralized, but compute (query processing) is isolated per warehouse
Concurrency scaling: Multiple users can run queries without stepping on each other’s toes.
Workload Isolation: A bad query cannot crash or slow down all other workloads.
You can scale different workloads independently

Question 7

Q

_ makes copies of data without incurring additional storage
costs

Answer

A

Zero-copy cloning

Question 8

Q

_ allows users to access data at a point in time in the past

Answer

A

Time travel

1.1 Data Engineering with Snowflake Flashcards

Introduction to Data Engineering with Snowflake (8 cards)