Essential PySpark: Data Engineering Flashcards by Sahil Kakkar

The performance gain in Distributed Computing lies in splitting the data into smaller chunks and _ processing them on _ machine(s).
(iteratively/parallely)
(single machine/several machines)

parallely
several

Code (containing business logic) for processing is brought to the place where the data chunk is actually stored. This technique is called “Data Parallel Processing”

How well did you know this?

Not at all

Perfectly

The MapReduce paradigm breaks down a Data Parallel Processing problem into three kinds of stages.
Name them.

Map stage
Shuffle stage
Reduce stage

There can be multiple Map stages followed by multiple Reduce stages. However, a Reduce
stage only starts after all of the Map stages have been completed.

How well did you know this?

Not at all

Perfectly

The Map stage takes the input dataset, splits it into _ .

(key, value) pairs

How well did you know this?

Not at all

Perfectly

The Shuffle stage takes the (key, value) pairs from the Map stage and _ them so that pairs with the same key end up together.

sorts/groups

Sorting/Grouping is based on keys.
A processing core in the cluster cannot have two non-trivial groups (with freq > 1)

How well did you know this?

Not at all

Perfectly

The Reduce stage takes the resultant groups of (key, value) pairs from the Shuffle stage and _ them (groups) to produce the final result.

reduces or aggregates

How well did you know this?

Not at all

Perfectly

Commodity Hardware used to store data chunks is prone to failures.
MapReduce achieves resiliency to hardware failures by saving the results of every stage to _ , which offers
* replication
* checkpointing
* re-execution

as resiliency mechanisms.

HDFS
(Hadoop Distributed File System)

How well did you know this?

Not at all

Perfectly

How does “replication of blocks” in HDFS help make mapreduce resilient?

Input and intermediate data chunks are stored in HDFS, which replicates blocks (default 3 copies) across different machines. If one disk fails, another copy exists elsewhere.

How well did you know this?

Not at all

Perfectly

How does “Checkpointing intermediate results” in HDFS help make mapreduce resilient?

Each stage’s output is written to disk so that if a node fails mid‑computation, the job can restart from the last completed stage instead of reprocessing everything.

How well did you know this?

Not at all

Perfectly

How does “Task re‑execution” in HDFS help make mapreduce resilient?

If a mapper/reducer fails, the mapreduce framework reassigns the task to another node having a replica of the data

How well did you know this?

Not at all

Perfectly

The round-trip to HDFS (disk) after every stage makes MapReduce relatively _ at
processing data because of the _ I/O performance of physical disks in general.
(slow/fast)

slow
slow

How well did you know this?

Not at all

Perfectly

To overcome the limitation of slow performance of mapreduce, Spark keeps intermediate results in memory ( _ than I/O), but still falling back to disk when needed for resiliency.
Additionally, Spark offers much more _ APIs to express data transformations as compared to mapreduce (Hadoop).

(slower/faster)
(low-level/high-level)

faster
high-level

How well did you know this?

Not at all

Perfectly

A _ is a group of computers (machines) all working together as a single unit to solve a distributed computing problem.

cluster

How well did you know this?

Not at all

Perfectly

The primary machine of a Cluster, called the _ , takes care of the orchestration and management of the Cluster.

Master Node

How well did you know this?

Not at all

Perfectly

The secondary machines in a cluster that actually perform the task (data processing code)
are called _ .

Worker Nodes

At any time, a Cluster has exactly one Master node and one or more Worker nodes.

How well did you know this?

Not at all

Perfectly

In Spark, Resilient Distributed Dataset (RDD) is _ data structure that is _ across the cluster, residing in memory (RAM) of _ .

(mutable/immutable)
(localized/distributed)
(single machine / several machines)

immutable
distributed
several machines

How well did you know this?

Not at all

Perfectly

An RDD consists of _ , which are logical divisions of an RDD, with a few of them residing on each machine/node of the cluster.

partitions

How well did you know this?

Not at all

Perfectly

_ are used to manipulate the data stored within the partitions of an RDD.

Hint: They operate on RDDs.

Higher-order Functions

map, flatMap, reduce, fold, filter, reduceByKey, join, and union to name a few.

How well did you know this?

Not at all

Perfectly

Higher-order function accepts _ as
parameter, which is an inner function that helps us define the actual business logic that transforms data and is applied to each partition of the RDD in parallel.

lambda

Note: lambdas apply (transformations) to all partitions of the RDD at once (in parallel)

How well did you know this?

Not at all

Perfectly

Given SparkContext sc, write PySpark program to read a text file as an RDD and split each word based on a delimiter such as a whitespace. Transform each word into tuple with default count = 1.

linesRDD = sc.textFile("/path/to/file")
wordsRDD = linesRDD.flatMap(lambda s: s.split())
tuplesRDD = wordsRDD.map(lambda s: (s, 1))

How well did you know this?

Not at all

Perfectly

At any given point in time, an RDD has information of all the individual operations performed on it, going back all the way to the data source itself.
How is this “lineage information” useful for fault tolerance?

If any Executors are lost due to any failures and one or more of its partitions are lost, RDD can easily recreate the lost partitions from the source data by making use of the lineage information, thus making it Resilient to failures.

How well did you know this?

Not at all

Perfectly

List 3 major components of Apache Spark architecture.

Driver (JVM process)
Executors (JVM processes)
Cluster Manager

How well did you know this?

Not at all

Perfectly

Each Executor process is launched on _ node of the Spark cluster.

(master/worker)

worker

How well did you know this?

Not at all

Perfectly

In _ deploy mode, the Driver runs on a worker node chosen by the cluster manager.
In _ deploy mode, the Driver runs on the machine where you submitted the job

In both modes, Executors still run on worker nodes of the cluster (close to distributed data)

cluster
client

The deploy mode is specified via flag when we submit a Spark job through spark-submit command.

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  my_app.py (or my_scala_app.jar)

How well did you know this?

Not at all

Perfectly

Spark supports running with a variety of clusters depending upon your scale, infrastructure, and operational needs.
The concept of a master node (for Driver) is externalized when Spark runs on
1. YARN
2. Kubernetes.
as clusters.

Name the cluster manager for each of the following clusters:
1. Standalone
2. YARN
3. Kubernetes

Spark’s built-in manager
YARN ResourceManager
K8s control plane

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  my_app.py (or my_scala_app.jar)

How well did you know this?

Not at all

Perfectly

With `--deploy-mode cluster` and `--master spark://`, the Driver process runs in _ .

a worker node

With `--deploy-mode cluster` and `--master yarn`, the Driver process runs in _ .

inside YARN ApplicationMaster

With `--deploy-mode cluster` and `--master k8s://https://`, the Driver process runs in _ .

in Kubernetes cluster, as a pod.

The _ is the single point of failure for a Spark cluster, and the entire Spark application fails if the driver fails. Therefore, different Cluster Managers implement different strategies to make it highly available.

Driver (JVM process)

_ is responsible for providing resources requested by the Driver. It also monitors the Executors regarding task progress and their status. ## Footnote Resources like: compute, memory, network bandwidth, storage (disc)

Cluster Manager

List major sequence of events on submitting a Spark job.

1. You submit job (`spark-submit`) 2. Cluster Manager starts the Driver JVM. 3. Driver requests for Executors (along with their respective resources). 4. Cluster Manager launches Executors on worker nodes. 5. Driver sends tasks to Executors.

To make Executors run your code, Spark Driver must send your function (the lambda) and any data it needs across the network. These functions/lambdas need to be converted into "byte stream" through the process called _ . The Python library used for this process is _ .

1. serialization 2. cloudpickle ## Footnote Executors then deserialize that byte stream back into a Python function and apply it to their local RDD partition.

Which kind of functions are serializable?

Pure functions using only built-in Python types and operations.

1. If your lambda uses only basic Python data types, math, string ops, or simple functions, it’s _ . 2. If it touches external resources, system state, or complex objects, it’s _ .

1. serializable 2. not serializable

If your Spark functions/lambdas reference non-serializable state (files, sockets, threads), Spark will throw a _ or runtime error

PicklingError

Name the schedulers in Spark Driver that create tasks.

1. TaskScheduler 2. DAGScheduler

The Spark SQL engine supports two types of APIs, namely, _ .

1. DataFrame 2. Spark SQL ## Footnote the Spark SQL engine was added as a layer on top of the RDD API, offering an even easier to use and familiar API for developers.

At API level, the difference between a Spark DataFrame and Pandas DataFrame is that a _ resides in the memory of several machines whereas a _ resides in memory of a single machine.

1. Spark DataFrame 2. Pandas DataFrame

To allow you to manipulate data, DataFrames come with operations that can be grouped into two main categories, namely, _ .

1. transformations 2. actions

SQL engine comes with a built-in query optimizer called _ . This optimizer analyzes user code, along with any available statistics on the data, to generate the best possible execution plan for the query, called _ . This execution plan is further converted into Java bytecode, which runs natively inside the Executor (JVM).

1. Catalyst 2. Query Plan

_ are operations performed on DataFrames that manipulate the data in the DataFrame and result in another DataFrame.

Transformations ## Footnote Some examples of transformations are read, select, where, filter, join, and groupBy.

_ are operations that actually cause a result to be calculated and either printed onto the console or, more practically, written back to a storage location.

Actions ## Footnote Some examples of actions include write, count, show and collect.

Spark transformations are not evaluated immediately as they are declared, and data is not manifested in memory until an _ is called.

action ## Footnote This principle is called lazy evaluation. Spark job is triggered in response to an action.

When any plain text file gets uploaded, its schema is read as: column: _ rows: each line as string

`value` ## Footnote default column name is `value`

The Executors apply functions in `pyspark.sql.functions` to operate 1. _ within a partition 2. _ across partitions of RDD

1. sequentially (row by row) 2. parallely

If RDD has 10 partitions, then _ Executors (or cores) can process them simultaneously (parallely)

Write PySpark code to break up the physical data stored in Amazon S3 into eight partitions.

``` log_df = spark.read.text("/path/to/s3/bucket").repartition(8) print(log_df.rdd.getNumPartitions()) ```

Write code to read csv file while retaining csv columns as dataframe columns.

``` df = spark.read \ .option('header','true') \ .csv("/path/to/file.csv") ```

Rename column(s) of the table read by `spark.range()` ## Footnote Simple one-column rename right after creation

``` val df = spark.range(1, 10001, 1, 8).toDF("number") df.show(5) ``` ## Footnote Here, `toDF("number")` renames the default column `id` to `number`.

Rename columns `id`, `name` of the table with `CustomerId`, `Cname`. ## Footnote Renaming multiple existing columns in a pipeline

``` df = spark.read.table("/path/to/table") .withColumnRenamed("id", "CustomerId") .withColumnRenamed("name", "Cname") ```

Display the schema of `df` in tree format.

``` df.printSchema() ```

Create dataframe with rows: `(14, "Tom"), (23, "Alice"), (16, "Bob")`

``` df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"]) ``` ## Footnote last tuple denotes column names

Write code to read csv file while retaining csv columns as dataframe columns and inferring datatypes from values in csv file.

Option 1 ``` df = spark.read \ .option('header','true') \ .csv('path/to/file', inferSchema=True) ``` Option 2 ``` df = spark.read.csv("path/to/file", header=True, inferSchema=True) ``` ## Footnote without inferSchema=True, numbers in .csv file will also be read as string.

Return first 5 rows of dataframe `df`

``` df.head(5) ```

``` df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], schema=["age", "name"]) ``` Select column "name" and a new column "age" with `age_value + 10`.

``` df.select('name', (df.age+10).alias('age')) ``` ## Footnote To select all columns, use `df.select('*')`

Return type of `df.show(5)` is _ whereas return type of `df.head(5)` is _ .

1. NoneType 2. List[Row]

Return the datatypes of columns of `df` dataframe in comma-separated list.

``` df.dtypes ``` ## Footnote Each element of list is a 2-tuple: (name, type); both strings.

Show summary of a dataset, i.e., count, mean, stddev, min, max

``` df.describe().show() ```

Add new column to ``` df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], schema=["age", "name"]) ``` which shows future_age after 5 years.

Option 1: ``` df.select("name", "age", (df["age"]+5).alias("age_after_5yrs")) ``` Option 2: ``` df.withColumn("age_after_5yrs", df["age"] + 10) ``` ## Footnote Option 2 is better if you need all columns + a new column.

Remove column `age_after_5yrs` from `df`.

``` df.drop("age_after_5yrs") ```

Remove all rows from `df` having one or more `null` entries. ## Footnote one or more = any

`df.dropna()`

Remove all rows from `df` having one or more `null` entries in columns age and name. ## Footnote one or more = any

``` df.dropna(subset=['age', 'name']) ```

Remove all rows from `df` having more than 2 `null` entries. ## Footnote If N is total number of columns, we don't want to allow the count of non-null columns to drop beyond N-2.

``` df.dropna(thresh=len(df.columns)-2) ```

Fill all null entries in column "age" with 25.

``` df.fillna(25, subset='age') ```

Fill all null entries in column "age" with 25 and in column "name" with "Unknown".

``` df.fillna({'age': 25, 'name': 'Unknown'}) ```

Return rows in `df` with `age > 15`

Option 1 ``` df.filter(df['age'] > 15) ``` Option 2 ``` df.filter('age > 15') ```

Return rows in `df` with `age > 15` and `name` is either 'Alice' or 'Bob'

Option 1 ``` df.filter((df['age'] > 15) & (df['name'].isin('Alice', 'Bob'))) ``` Option 2 ``` df.filter("age > 15 AND name IN ('Alice', 'Bob')") ```

Return rows in `df` where `name` is neither 'Alice' nor 'Bob'.

``` df.filter(~df['name'].isin('Alice', 'Bob')) ```

Given `df` has 3 columns: `name`, `age`, `score` Write pyspark code equivalent to ``` SELECT AVG(age), AVG(order) FROM df; ```

Option 1 ``` import pyspark.sql.functions.* df.select(avg('age').alias('avg_age'), avg('order').alias('avg_order')).show() ``` Option 2 ``` import pyspark.sql.functions.* df.groupby().agg(avg('age'), sum('order')) ``` Option 3 ``` import pyspark.sql.functions.* df.groupby().agg({'age': 'avg', 'order': 'sum'}) ```

``` df = spark.createDataFrame([(14, "Tom", 94), (23, "Alice", 82), (23, "Alice", 102), (16, "Bob", 90)], ["age", "name", "order"]) ``` Find name of person with highest order, using PySpark code. ## Footnote SELECT name FROM df ORDER BY order DESC LIMIT 1;

``` import pyspark.sql.functions.* df.orderBy('order', ascending=False).select('name').limit(1).show() ```

``` df = spark.createDataFrame([(14, "Tom", 94), (23, "Alice", 82), (23, "Alice", 102), (16, "Bob", 90)], ["age", "name", "order"]) ``` Find average order of each person (name) in `df`, using PySpark code. ## Footnote select name, avg(order) from df group by name

``` df.groupBy('name').agg( avg('order').alias('avg_order') ).show() ```

``` df = spark.createDataFrame([(14, "Tom", 94), (23, "Alice", 82), (23, "Alice", 102), (16, "Bob", 90)], ["age", "name", "order"]) ``` Find total number of orders made by each person (name) in `df`, using PySpark code.

``` df.groupBy('name').agg( count('*').alias('order_count') ).show() ```

Essential PySpark: Data Engineering Flashcards

Sreeram Nudurupati - Essential PySpark for Data Analytics_ A beginner's guide to harnessing the power and ease of PySpark 3.0 (2021, Packt Publishing) (71 cards)