What are the 4 “Out of the box” tenants of Spark Query Optimization?
1.File Format
2.Photon
3.AQE
4.Liquid Clustering/ Data Skipping
What are the 5 “Read” tenants of Spark Optimization?
1.Partition & Cluster Strategy
2.Minimizing Skew
3.Caching
4.Broadcast Joins
5.Filters
Explain the File Format tenant.
Data should be formated as Delta or Parquet
Explain the Photon tenant.
Phonton is Databricks’ C++ execution engine. That increases SQL and Dataframe processing. Make sure it is enabled on the cluster
Explain the AQE tenant.
What are its 3 key features?
Adaptive Query Execution reoptimizes query plan at runtime.
What is Liquid Clustering?
How is LC implemented?
What’s is the default # of partitions?
LC improves read/write perfomance without the manual steps in ZORDER and Partition methods.
Can be used at table/schema/catalog creation or ALTER.
Liquid Clustering is initiated for an object when the CLUSTER BY <column_name(s)> keyword is used in its creation syntax.
CLUSTER BY AUTO allows historical query data to be used predictively .
200
Explain Partitioning.
What is data skipping?
What are three use cases for Partitioning?
Partitions physically arange data on the disk based on a specified column(s).
When well executed partitons allow read directories to be skipped (i.e. data skipping)
1.Time series data (e.g. year, month, day)
2.Consistent filter on certain column(s)
3.Stable set of distinct values for partitioning (e.g. Sales Reps, Regions)
What are the 4 best practices for partitioning?
1.Cardinality: Low-to-medium; # of distinct values is manageable
2.Avoid Data Skew: Data should be evenly distributed b/t partitions
4.Write-Time Optimization: Use when writing data in order to organize it by partition
What is clustering?
How is a cluster implemented and what is an important option?
What are is a best practice for its implementation?
What are its use cases?
Partition within an existing partition based on other column(s).
A cluster(s) is implemented by using the “clusterBy” keyord within the “CREATE TABLE” syntax. “clusterBy INTO BUCKETS “ is an important option
Optimal size: same as a partition. Total Data/ # of clusters = 100MB - 1 GB
1.Large tables that are frequently joined on a high cardinality field (e.g. user_id, product_id). Spark can perform a merge-sort w/o shuffle
2.Columns used in groupBy for aggregations
Explain the small files problem.
How can it be mitigated?
Small files from ingestion or over partitioning/clustering can hurt query performance by raising read/write overhead.
Ingestion: use Optimize keyword on table
Over Partitioning: rework partition strategy
Explain Caching.
When should it be used?
How is it implemented?
Caching stores specified data in the memory of each worker node instead of on the disk.
Use on small datasets that are used multiple times/iteratively in the same task (e.g. ML)
df.cache()
Explain the Broadcast Join.
What is the use case for them?
How is it implemented?
Broadcast joins place identical read only data in memory of worker nodes. This reduces the need for read calls back to the driver.
Using immutable lookup data against a large/complex table(s)
df_big.join(F.broadcast(df.small),’join_key’)
Explain Filters Tenant.
How is it implemented?
Filters should be “pushed down” to the source before being read into Spark to reduce I/O of query.
Run filter clauses as early as possible in the query.
Look for “pushed filter” in explain plan.
What is a repartition?
What is a use case for repartition?
How is a repartition initiated?
A full shuffle of the data on disk.
Skew Reduction, Improve Parallelism
df.repartition(# of partitions,”column_name”)
What is a coalesce?
What is a use case for coalesce?
How is a coalesce initiated?
Reorganization of an existing set of partitions. <= # of existing partitions
Reduce data skew;
DF.coalesce()
What is a shuffle?
How is it mitigated?
Reorganization of data on disk
Optimal Partition/Cluster Strategy
What is skew?
What are two ways to mitigate it?
Unbalnced partition sizes across a data set
Liquid Clustering
Optimal Partition/Cluster Strategy
Explain the VACUUM command.
How does it afffect Deletes?
Cleans disk of deleted, unrefrenced or obsolete data.
Deletion of files only occurs after VACUUM is run on the table (default 7 days)
Explain the OPTIMIZE command.
When should it be invoked in reference to VACUUM?
Compacts small files into larger ones to minimize read overhead.
Before VACCUM.
What is Z-Order Indexing?
How is it implemnted
Groups data into optimized files without subfolders. Data is co-locate and reorganze column information in the same set of files.
Implement by adding ZORDER BY to the OPTIMZE command followed by a column name.