Driver Program
The process that runs the main function of an application and creates the SparkContext.
Apache Spark
An open-source, distributed data-processing engine designed to handle real-time, batch, and iterative workloads efficiently.
In-memory computing
A technique where data is cached in RAM to reduce repeated reads, leading to faster processing.
saveAsTextFile
An action that writes the contents of a dataset to a text file.
Spark Shell
An interactive environment for experimenting with Spark code.
Tungsten Execution Engine
A project focused on improving the efficiency of memory and CPU utilization for computations.
Cluster Manager
An external service in a Spark setup responsible for allocating resources to applications.
Immutability
The property of an object whose state cannot be modified after it is created.
MLlib
A library within for scalable machine learning algorithms.
Broadcast Variables
Variables that are cached on each machine to avoid shipping a copy of large datasets with every task.
flatMap
A data transformation that applies a function to each element of a data structure and flattens the results.
Intersection Operation
Creating a dataset containing only the elements that are present in both of the input datasets.
Map Operation
Applying a function to each element in a dataset and returning a new dataset with the transformed elements.
MapReduce
A batch-oriented processing model known for its limitations in real-time, OLTP, graph, and iterative processing scenarios.
Graph Processing
Analyzing relationships between entities represented as nodes and edges.
RDD Lineage
The recorded sequence of operations that allows for the reconstruction of lost data partitions in a distributed dataset.
Spark Packages
An ecosystem of extensions that add functionality to Spark.
OLTP Workloads
Workloads characterized by short, numerous transactions.
DataFrame
A distributed collection of data organized into named columns, offering a structured approach to data processing.
Spark Streaming
A module designed for processing real-time data streams.
Dataset
A typed version of a DataFrame, available in Scala and Java, that provides compile-time type safety.
GraphX
A component in the ecosystem for graph-parallel computation.
Shuffling
The process of redistributing data across partitions, often required for operations like joins and aggregations.
Filter Operation
Selecting elements from a dataset based on a specified condition.