Ascendeum Interview Questions Flashcards

asked by Notebook LLM (3 cards)

1
Q

what is the fundamental difference between processing a 1TB dataset using standard Python/Pandas on a single machine versus processing it using PySpark on an EMR cluster?

A

The fundamental difference is distribution and parallelism:
1. Pandas (Single Machine): All processing is sequential and confined to the RAM of one machine. If the data is larger than the available RAM, processing fails or requires the slow, manual, iterative chunking.
2. PySpark (EMR Cluster): PySpark is designed to handle this scale. It automatically splits the 1TB dataset into smaller chunks (partitions). These partitions are then processed simultaneously and in parallel across the many worker nodes (executors) provisioned by your EMR cluster. This allows for the efficient execution of complex transformations that are impossible on a single machine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is using a PySpark DataFrame API considered the optimized approach for writing complex transformations in Spark, compared to using the older, less abstract RDD API?

A

The core reason DataFrames are optimized is the Catalyst Optimizer. DataFrames contain rich metadata about the data (like column names and types—schema), which RDDs lack. This structure allows the Catalyst Optimizer to:
1. Generate an Optimized Execution Plan: Spark’s engine can analyze the entire DataFrame transformation plan and rewrite it for maximum efficiency before any computation starts.
2. Utilize Efficient Memory Management: DataFrames use techniques like tungsten for more efficient memory and CPU utilization compared to RDDs.

In short, DataFrames are not just an abstraction over the DataFrames API; they provide the structured information that enables Spark’s optimization engine to run your code faster and use fewer resources than an RDD equivalent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Imagine you have a large PySpark DataFrame named ad_events with columns CampaignID, UserID, and Revenue.
Question: Your goal is to find the total revenue and the total number of distinct users for each unique CampaignID.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly