Spark Query Optimization Flashcards

Question 1

Q

What are the 4 “Out of the box” tenants of Spark Query Optimization?

Answer

A

1.File Format
2.Photon
3.AQE
4.Liquid Clustering/ Data Skipping

Question 2

Q

What are the 5 “Read” tenants of Spark Optimization?

Answer

A

1.Partition & Cluster Strategy
2.Minimizing Skew
3.Caching
4.Broadcast Joins
5.Filters

Question 3

Q

Explain the File Format tenant.

Answer

A

Data should be formated as Delta or Parquet

Question 4

Q

Explain the Photon tenant.

Answer

A

Phonton is Databricks’ C++ execution engine. That increases SQL and Dataframe processing. Make sure it is enabled on the cluster

Question 5

Q

Explain the AQE tenant.

What are its 3 key features?

Answer

A

Adaptive Query Execution reoptimizes query plan at runtime.

Partition Coalesce
Converts Sorts to Broadcast Joins
Splits large partions on joins to minimize skew

Question 6

Q

What is Liquid Clustering?

How is LC implemented?

What’s is the default # of partitions?

Answer

A

LC improves read/write perfomance without the manual steps in ZORDER and Partition methods.

Can be used at table/schema/catalog creation or ALTER.

Liquid Clustering is initiated for an object when the CLUSTER BY <column_name(s)> keyword is used in its creation syntax.

CLUSTER BY AUTO allows historical query data to be used predictively .

200

Question 7

Q

Explain Partitioning.

What is data skipping?

What are three use cases for Partitioning?

Answer

A

Partitions physically arange data on the disk based on a specified column(s).

When well executed partitons allow read directories to be skipped (i.e. data skipping)

1.Time series data (e.g. year, month, day)

2.Consistent filter on certain column(s)

3.Stable set of distinct values for partitioning (e.g. Sales Reps, Regions)

Question 8

Q

What are the 4 best practices for partitioning?

Answer

A

1.Cardinality: Low-to-medium; # of distinct values is manageable

2.Avoid Data Skew: Data should be evenly distributed b/t partitions

Size: 1GB+ Too big memory problems; Too small: overhead delays

4.Write-Time Optimization: Use when writing data in order to organize it by partition

Question 9

Q

What is clustering?

How is a cluster implemented and what is an important option?

What are is a best practice for its implementation?

What are its use cases?

Answer

A

Partition within an existing partition based on other column(s).

A cluster(s) is implemented by using the “clusterBy” keyord within the “CREATE TABLE” syntax. “clusterBy INTO BUCKETS “ is an important option

Optimal size: same as a partition. Total Data/ # of clusters = 100MB - 1 GB

1.Large tables that are frequently joined on a high cardinality field (e.g. user_id, product_id). Spark can perform a merge-sort w/o shuffle

2.Columns used in groupBy for aggregations

Question 10

Q

Explain the small files problem.

How can it be mitigated?

Answer

A

Small files from ingestion or over partitioning/clustering can hurt query performance by raising read/write overhead.

Ingestion: use Optimize keyword on table

Over Partitioning: rework partition strategy

Question 11

Q

Explain Caching.

When should it be used?

How is it implemented?

Answer

A

Caching stores specified data in the memory of each worker node instead of on the disk.

Use on small datasets that are used multiple times/iteratively in the same task (e.g. ML)

df.cache()

Question 12

Q

Explain the Broadcast Join.

What is the use case for them?

How is it implemented?

Answer

A

Broadcast joins place identical read only data in memory of worker nodes. This reduces the need for read calls back to the driver.

Using immutable lookup data against a large/complex table(s)

df_big.join(F.broadcast(df.small),’join_key’)

Question 13

Q

Explain Filters Tenant.

How is it implemented?

Answer

A

Filters should be “pushed down” to the source before being read into Spark to reduce I/O of query.

Run filter clauses as early as possible in the query.
Look for “pushed filter” in explain plan.

Question 14

Q

What is a repartition?

What is a use case for repartition?

How is a repartition initiated?

Answer

A

A full shuffle of the data on disk.

Skew Reduction, Improve Parallelism

df.repartition(# of partitions,”column_name”)

Question 15

Q

What is a coalesce?

What is a use case for coalesce?

How is a coalesce initiated?

Answer

A

Reorganization of an existing set of partitions. <= # of existing partitions

Reduce data skew;

DF.coalesce()

Question 16

Q

What is a shuffle?

How is it mitigated?

Answer

Study These Flashcards

A

Reorganization of data on disk

Optimal Partition/Cluster Strategy

Question 17

Q

What is skew?

What are two ways to mitigate it?

Answer

Study These Flashcards

A

Unbalnced partition sizes across a data set

Liquid Clustering
Optimal Partition/Cluster Strategy

Question 18

Q

Explain the VACUUM command.

How does it afffect Deletes?

Answer

Study These Flashcards

A

Cleans disk of deleted, unrefrenced or obsolete data.

Deletion of files only occurs after VACUUM is run on the table (default 7 days)

Question 19

Q

Explain the OPTIMIZE command.

When should it be invoked in reference to VACUUM?

Answer

Study These Flashcards

A

Compacts small files into larger ones to minimize read overhead.

Before VACCUM.

Question 20

Q

What is Z-Order Indexing?

How is it implemnted

Answer

Study These Flashcards

A

Groups data into optimized files without subfolders. Data is co-locate and reorganze column information in the same set of files.

Implement by adding ZORDER BY to the OPTIMZE command followed by a column name.

Spark Query Optimization Flashcards

(20 cards)