Snowflake stores data in cloud storage in a _ format.
compressed columnar
Columnar Storage means data is stored as key-value pairs, where key is column and all entries in it is stored as value. Compression is applied at the column (key) level. Since values in a column tend to be of the same type and often repetitive, the compression algorithms work exceptionally well.
Queries often scan specific columns (e.g., SELECT Name, Age FROM Users WHERE Country = ‘USA’)
It’s more efficient to read just the relevant columns than entire rows
Columnar + compression = massive performance gain and cost savings.
https://copilot.microsoft.com/shares/QHE6u23dSu3PPyyfxqG6F
To execute filtering in WHERE clause, instead of scanning all rows of the column, each compute node reads a different _ at the same time (parallely).
micro-partition
Each partition has metadata, having info about column values. Leveraging this info, micro-partition not having the clause value is skipped. So, effectively fewer scans are needed to execute filtering.
while columns are stored separately inside each micro-partition, the micro-partition itself contains all columns for a subset of rows.
When storing data in a snowflake table, we can guide snowflake to keep rows with similar values in specific column(s) close together, i.e., grouped in a micro-partition using _
Clustering Keys
ALTER TABLE sales CLUSTER BY (region, order_date);
Snowflake tries to physically group rows with similar region and order_date values into the same micro-partitions. This improves the predictability of metadata, which is key for pruning.
If you’re working with CDC or time-series data, Use _ as part of your clustering key to make incremental queries (e.g., last 24 hours) blazingly fast.
event_timestamp or modified_date
Usually, we tend to keep the most filtered column as clustering key. In time-series data, event_timestamp is the most common filtering criteria.
SQL command to check clustering depth of sales column
SELECT SYSTEM$CLUSTERING_INFORMATION(‘sales’);
A high depth means:
Data is scattered across many partitions
Pruning is less effective
You might need to recluster the table
Benefits of virtual warehouses instead of a single centralized warehouse, for query processing.
_ makes copies of data without incurring additional storage
costs
Zero-copy cloning
_ allows users to access data at a point in time in the past
Time travel