Spark Query Optimization Flashcards

(20 cards)

1
Q

What are the 4 “Out of the box” tenants of Spark Query Optimization?

A

1.File Format
2.Photon
3.AQE
4.Liquid Clustering/ Data Skipping

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 5 “Read” tenants of Spark Optimization?

A

1.Partition & Cluster Strategy
2.Minimizing Skew
3.Caching
4.Broadcast Joins
5.Filters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain the File Format tenant.

A

Data should be formated as Delta or Parquet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain the Photon tenant.

A

Phonton is Databricks’ C++ execution engine. That increases SQL and Dataframe processing. Make sure it is enabled on the cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain the AQE tenant.

What are its 3 key features?

A

Adaptive Query Execution reoptimizes query plan at runtime.

  1. Partition Coalesce
  2. Converts Sorts to Broadcast Joins
  3. Splits large partions on joins to minimize skew
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Liquid Clustering?

How is LC implemented?

What’s is the default # of partitions?

A

LC improves read/write perfomance without the manual steps in ZORDER and Partition methods.

Can be used at table/schema/catalog creation or ALTER.

Liquid Clustering is initiated for an object when the CLUSTER BY <column_name(s)> keyword is used in its creation syntax.

CLUSTER BY AUTO allows historical query data to be used predictively .

200

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain Partitioning.

What is data skipping?

What are three use cases for Partitioning?

A

Partitions physically arange data on the disk based on a specified column(s).

When well executed partitons allow read directories to be skipped (i.e. data skipping)

1.Time series data (e.g. year, month, day)

2.Consistent filter on certain column(s)

3.Stable set of distinct values for partitioning (e.g. Sales Reps, Regions)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the 4 best practices for partitioning?

A

1.Cardinality: Low-to-medium; # of distinct values is manageable

2.Avoid Data Skew: Data should be evenly distributed b/t partitions

  1. Size: 1GB+ Too big memory problems; Too small: overhead delays

4.Write-Time Optimization: Use when writing data in order to organize it by partition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is clustering?

How is a cluster implemented and what is an important option?

What are is a best practice for its implementation?

What are its use cases?

A

Partition within an existing partition based on other column(s).

A cluster(s) is implemented by using the “clusterBy” keyord within the “CREATE TABLE” syntax. “clusterBy INTO BUCKETS “ is an important option

Optimal size: same as a partition. Total Data/ # of clusters = 100MB - 1 GB

1.Large tables that are frequently joined on a high cardinality field (e.g. user_id, product_id). Spark can perform a merge-sort w/o shuffle

2.Columns used in groupBy for aggregations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain the small files problem.

How can it be mitigated?

A

Small files from ingestion or over partitioning/clustering can hurt query performance by raising read/write overhead.

Ingestion: use Optimize keyword on table

Over Partitioning: rework partition strategy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain Caching.

When should it be used?

How is it implemented?

A

Caching stores specified data in the memory of each worker node instead of on the disk.

Use on small datasets that are used multiple times/iteratively in the same task (e.g. ML)

df.cache()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain the Broadcast Join.

What is the use case for them?

How is it implemented?

A

Broadcast joins place identical read only data in memory of worker nodes. This reduces the need for read calls back to the driver.

Using immutable lookup data against a large/complex table(s)

df_big.join(F.broadcast(df.small),’join_key’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Explain Filters Tenant.

How is it implemented?

A

Filters should be “pushed down” to the source before being read into Spark to reduce I/O of query.

Run filter clauses as early as possible in the query.
Look for “pushed filter” in explain plan.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a repartition?

What is a use case for repartition?

How is a repartition initiated?

A

A full shuffle of the data on disk.

Skew Reduction, Improve Parallelism

df.repartition(# of partitions,”column_name”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a coalesce?

What is a use case for coalesce?

How is a coalesce initiated?

A

Reorganization of an existing set of partitions. <= # of existing partitions

Reduce data skew;

DF.coalesce()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a shuffle?

How is it mitigated?

A

Reorganization of data on disk

Optimal Partition/Cluster Strategy

17
Q

What is skew?

What are two ways to mitigate it?

A

Unbalnced partition sizes across a data set

Liquid Clustering
Optimal Partition/Cluster Strategy

18
Q

Explain the VACUUM command.

How does it afffect Deletes?

A

Cleans disk of deleted, unrefrenced or obsolete data.

Deletion of files only occurs after VACUUM is run on the table (default 7 days)

19
Q

Explain the OPTIMIZE command.

When should it be invoked in reference to VACUUM?

A

Compacts small files into larger ones to minimize read overhead.

Before VACCUM.

20
Q

What is Z-Order Indexing?

How is it implemnted

A

Groups data into optimized files without subfolders. Data is co-locate and reorganze column information in the same set of files.

Implement by adding ZORDER BY to the OPTIMZE command followed by a column name.