Threads vs. Processes
Definition:
Threads share memory within the same process, while processes run independently with separate memory spaces.
Mutex vs. Semaphore
Definition:
A mutex (mutual exclusion) allows only one thread at a time to access a resource, while a semaphore can allow multiple concurrent accesses (permits).
Deadlock
Definition:
Occurs when two or more threads or processes block each other, each waiting for a resource that the others hold.
Race Condition
Definition:
Multiple threads or processes access and modify shared data without proper synchronization, leading to unpredictable or incorrect results.
Parallelism vs. Concurrency
Definition:
Concurrency is about dealing with multiple tasks over the same time period (not necessarily simultaneously); Parallelism is about executing tasks simultaneously using multiple cores/CPUs.
Threading
Definition:
Creating multiple threads within a process to perform tasks concurrently.
Apache Spark
Definition:
An open-source distributed computing framework for big data processing, with APIs in Scala, Java, Python, and R.
Dask
Definition:
A Python library for parallel computing that extends NumPy, Pandas, and scikit-learn APIs to larger-than-memory or distributed datasets.
In-Memory vs. Out-of-Memory Computations
Definition:
In-memory computations hold data in RAM for faster processing, while out-of-memory (OOM) computations handle data larger than RAM by streaming or chunking it off disk.
MapReduce (Concept)
Definition:
A programming model for distributed processing of large data sets across clusters (popularized by Hadoop).