Spark Flashcards

(54 cards)

1
Q

Driver Program

A

The process that runs the main function of an application and creates the SparkContext.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Apache Spark

A

An open-source, distributed data-processing engine designed to handle real-time, batch, and iterative workloads efficiently.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

In-memory computing

A

A technique where data is cached in RAM to reduce repeated reads, leading to faster processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

saveAsTextFile

A

An action that writes the contents of a dataset to a text file.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Spark Shell

A

An interactive environment for experimenting with Spark code.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Tungsten Execution Engine

A

A project focused on improving the efficiency of memory and CPU utilization for computations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Cluster Manager

A

An external service in a Spark setup responsible for allocating resources to applications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Immutability

A

The property of an object whose state cannot be modified after it is created.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

MLlib

A

A library within for scalable machine learning algorithms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Broadcast Variables

A

Variables that are cached on each machine to avoid shipping a copy of large datasets with every task.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

flatMap

A

A data transformation that applies a function to each element of a data structure and flattens the results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Intersection Operation

A

Creating a dataset containing only the elements that are present in both of the input datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Map Operation

A

Applying a function to each element in a dataset and returning a new dataset with the transformed elements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

MapReduce

A

A batch-oriented processing model known for its limitations in real-time, OLTP, graph, and iterative processing scenarios.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Graph Processing

A

Analyzing relationships between entities represented as nodes and edges.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

RDD Lineage

A

The recorded sequence of operations that allows for the reconstruction of lost data partitions in a distributed dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Spark Packages

A

An ecosystem of extensions that add functionality to Spark.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

OLTP Workloads

A

Workloads characterized by short, numerous transactions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

DataFrame

A

A distributed collection of data organized into named columns, offering a structured approach to data processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Spark Streaming

A

A module designed for processing real-time data streams.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Dataset

A

A typed version of a DataFrame, available in Scala and Java, that provides compile-time type safety.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

GraphX

A

A component in the ecosystem for graph-parallel computation.

23
Q

Shuffling

A

The process of redistributing data across partitions, often required for operations like joins and aggregations.

24
Q

Filter Operation

A

Selecting elements from a dataset based on a specified condition.

25
Join Operation
Combining related information from multiple datasets based on a common key.
26
WordCount
A classic example used to demonstrate distributed data processing, involving counting the occurrences of each word in a text.
27
Data Replication
Creating multiple copies of data to ensure availability and fault tolerance.
28
Resilient Distributed Dataset (RDD)
A fault-tolerant, immutable, distributed collection of objects that can be processed in parallel across a cluster.
29
Lazy Evaluation
An evaluation strategy where expressions are not evaluated until their values are needed.
30
Throughput
The amount of data that can be processed in a given unit of time.
31
HDFS
A distributed file system designed for storing large datasets reliably.
32
Accumulators
Variables that are only 'added' to through an associative and commutative operation and can be efficiently supported in parallel.
33
High-level APIs
Programming interfaces that abstract away low-level details, making it easier to develop applications.
34
Spark Core
The foundational module responsible for memory management, scheduling, fault recovery, and interaction with cluster managers.
35
Executors
Processes that run computations and store data on worker nodes in a Spark cluster.
36
Data Locality
The principle of moving computation close to the data to minimize network traffic.
37
RDD Transformations
Operations on a dataset that are lazily evaluated and produce a new dataset, like map, filter, or union.
38
SparkContext
The entry point to Spark functionality; it represents a connection to a Spark cluster and is used to create RDDs, accumulators, and broadcast variables.
39
Fault Tolerance
The ability of a system to continue operating properly in the event of the failure of some of its components, achieved through mechanisms like lineage.
40
Union Operation
Combining two datasets into a single dataset containing all elements from both.
41
Spark SQL
A module for processing structured data, providing support for SQL queries and a DataFrame API.
42
RDD Actions
Operations that trigger the execution of computations on a dataset and return results to the driver program, such as reduce, count, or first.
43
Scalability
The ability of a system to handle increasing amounts of work or data without negatively impacting performance.
44
Latency
The time it takes to process a unit of data.
45
Batch Processing
Processing large volumes of data in a single job.
46
Cartesian Product
Operation that combines each item of one RDD with each item of the second RDD
47
Catalyst Optimizer
A query optimization framework within Spark SQL that improves query execution.
48
Subtract Operation
Creating a dataset containing only the elements that are present in the first dataset but not in the second.
49
Iterative Algorithms
Algorithms that repeatedly apply a set of operations to converge on a result.
50
reduceByKey
A data transformation that merges the values for each key using an associative reduce function.
51
Real-time Processing
Processing data as it is generated, providing immediate insights.
52
Low-level APIs
Programming interfaces that provide fine-grained control over system resources and operations.
53
RDD Partition
A logical division of data in a Resilient Distributed Dataset, enabling parallel processing.
54
Worker Node
A machine in a Spark cluster that runs executors for computations.