Spark Flashcards

Question

Join Operation

Answer 1

Combining related information from multiple datasets based on a common key.

Answer 2

A classic example used to demonstrate distributed data processing, involving counting the occurrences of each word in a text.

Answer 3

Creating multiple copies of data to ensure availability and fault tolerance.

Answer 4

A fault-tolerant, immutable, distributed collection of objects that can be processed in parallel across a cluster.

Answer 5

An evaluation strategy where expressions are not evaluated until their values are needed.

Answer 6

The amount of data that can be processed in a given unit of time.

Answer 7

A distributed file system designed for storing large datasets reliably.

Answer 8

Variables that are only 'added' to through an associative and commutative operation and can be efficiently supported in parallel.

Answer 9

Programming interfaces that abstract away low-level details, making it easier to develop applications.

Answer 10

The foundational module responsible for memory management, scheduling, fault recovery, and interaction with cluster managers.

Answer 11

Processes that run computations and store data on worker nodes in a Spark cluster.

Answer 12

The principle of moving computation close to the data to minimize network traffic.

Answer 13

Operations on a dataset that are lazily evaluated and produce a new dataset, like map, filter, or union.

Answer 14

The entry point to Spark functionality; it represents a connection to a Spark cluster and is used to create RDDs, accumulators, and broadcast variables.

Answer 15

The ability of a system to continue operating properly in the event of the failure of some of its components, achieved through mechanisms like lineage.

Answer 16

Combining two datasets into a single dataset containing all elements from both.

Answer 17

A module for processing structured data, providing support for SQL queries and a DataFrame API.

Answer 18

Operations that trigger the execution of computations on a dataset and return results to the driver program, such as reduce, count, or first.

Answer 19

The ability of a system to handle increasing amounts of work or data without negatively impacting performance.

Answer 20

The time it takes to process a unit of data.

Answer 21

Processing large volumes of data in a single job.

Answer 22

Operation that combines each item of one RDD with each item of the second RDD

Answer 23

A query optimization framework within Spark SQL that improves query execution.

Answer 24

Creating a dataset containing only the elements that are present in the first dataset but not in the second.

Answer 25

Algorithms that repeatedly apply a set of operations to converge on a result.

Answer 26

A data transformation that merges the values for each key using an associative reduce function.

Answer 27

Processing data as it is generated, providing immediate insights.

Answer 28

Programming interfaces that provide fine-grained control over system resources and operations.

Answer 29

A logical division of data in a Resilient Distributed Dataset, enabling parallel processing.

Answer 30

A machine in a Spark cluster that runs executors for computations.

Spark Flashcards

(54 cards)