Data Science at Scale Flashcards

Question

Why are document models considered more flexible than relational models?

Answer 1

They don’t require a rigid schema; each document can be different.

Answer 2

Difficulty handling many-to-many relationships efficiently.

Answer 3

Representing complex many-to-many relationships using nodes and edges.

Answer 4

Nodes are entities or objects; edges are the relationships between them.

Answer 5

A database that doesn’t use traditional relational models; prioritizes scalability and availability.

Answer 6

In a distributed system, you can only choose two: Consistency, Availability, and Partition Tolerance.

Answer 7

Every read returns the most recent write or an error.

Answer 8

Every request gets a response, even if it's not the most up-to-date.

Answer 9

The system continues functioning despite network partitions.

Answer 10

Document, Key-value, Wide-column, and Graph databases.

Answer 11

A NoSQL database that stores data as key-value pairs, like a dictionary.

Answer 12

A database model where data is stored in rows and columns, but columns can vary between rows.

Answer 13

By column families instead of rows.

Answer 14

A graph where nodes and edges can have associated properties.

Answer 15

Better handling of complex and interconnected data.

Answer 16

The structure of data is defined when it's read by the application, not enforced when written.

Answer 17

It's good for heterogeneous data or when you can't control the data structure, like tweets or logs.

Answer 18

Document DBs allow format changes without changing the schema; relational DBs often need downtime and schema updates.

Answer 19

It means storing a whole document as a continuous string (like JSON), keeping related data together.

Answer 20

Even if you need only part of the document, the DB loads the whole thing, which can be inefficient.

Answer 21

Store data and retrieve data.

Answer 22

A write-heavy workload (e.g., bank transactions).

Answer 23

A read-heavy workload (e.g., dashboards, reports).

Answer 24

The part of the database that handles how data is written to and read from disk.

Answer 25

A database that appends all writes to a log file.

Answer 26

Because they avoid random disk access and just add to the end of the file.

Answer 27

You have to scan the whole file (O(n) time complexity).

Answer 28

A map of keys to byte offsets in a log file.

Answer 29

Fast lookups for keys.

Answer 30

It doesn’t support range queries and must fit in memory.

Answer 31

To prevent a single log file from growing too large.

Answer 32

A process that merges segments and removes duplicates or deleted records.

Answer 33

Sorted String Table.

Answer 34

They store keys in sorted order and only once per segment.

Answer 35

Because they support binary search and smaller indexes.

Answer 36

Yes, but only for some keys (sparse index).

Answer 37

Data is first written to a memtable (e.g., an AVL tree) in memory. When full, the memtable is flushed to disk as a new SSTable file.

Answer 38

First, check the memtable; if not found, search the newest SSTable, then older ones.

Answer 39

Scalability and fault tolerance.

Answer 40

Adding more power (CPU, RAM) to a single machine.

Answer 41

Adding more machines to the system.

Answer 42

Replication and partitioning.

Answer 43

Each node holds a full copy of the data (a replica).

Answer 44

Lower latency, higher availability, better read performance.

Answer 45

Writes must update all replicas.

Answer 46

The node that receives all write operations.

Answer 47

Apply updates from the leader using a replication log.

Answer 48

Leader waits for confirmation from followers before acknowledging a write.

Answer 49

A synchronous follower must confirm a write before it's considered successful, ensuring stronger consistency. An asynchronous follower receives updates later and doesn’t delay the leader’s response.

Answer 50

An asynchronous follower is promoted to be synchronous.

Answer 51

Snapshot the leader, copy data, then replay changes since the snapshot.

Answer 52

It fetches missed changes from the leader using logs.

Answer 53

A follower is promoted to become the new leader (failover).

Answer 54

Multiple leaders making conflicting updates.

Answer 55

Followers being behind the leader in data updates.

Answer 56

Users always see their own recent updates.

Answer 57

Users never see older data than a previous read.

Answer 58

Breaking the dataset into subsets (partitions), each stored on a different node.

Answer 59

Exactly one.

Answer 60

To scale data storage and load across multiple nodes for large datasets.

Answer 61

A node that handles a disproportionately large amount of data or requests.

Answer 62

Randomly scatter data across nodes.

Answer 63

Assign a continuous range of keys to each partition.

Answer 64

Can cause write hotspots if keys follow insertion order.

Answer 65

Use a hash function to assign data to partitions.

Answer 66

Even distribution of data across partitions.

Answer 67

Poor support for range queries.

Answer 68

They don’t uniquely identify records and don’t map neatly to partitions.

Answer 69

Moving data between nodes to distribute load evenly.

Answer 70

When nodes fail, data grows, or load increases.

Answer 71

Database continues operating, and minimal data is moved.

Answer 72

A set number of partitions that get reassigned when nodes are added or removed.

Answer 73

Partitions split or merge depending on data size.

Answer 74

Key range partitioning.

Answer 75

Hash partitioning.

Answer 76

Consistency and Availability.

Answer 77

Every read gets the most recent write or an error.

Answer 78

Every request gets a (non-error) response.

Answer 79

The system still operates even if parts of the network are disconnected.

Answer 80

Atomicity, Consistency, Isolation, Durability.

Answer 81

All nodes will eventually hold the same data, but reads may return outdated values in the meantime.

Answer 82

All operations appear instant and atomic, as if there's only one copy of the data.

Answer 83

To guarantee strong consistency, such as unique IDs or avoiding double bookings.

Answer 84

It sacrifices availability during network partitions.

Answer 85

Response time increases with network uncertainty.

Answer 86

Getting all nodes to agree on a value or action, like leader election or committing transactions.

Answer 87

Ensuring atomic commit across multiple nodes.

Answer 88

The coordinator asks all nodes to prepare to commit.

Answer 89

If all agree, the coordinator sends a commit. If any disagree, it sends an abort.

Answer 90

Nodes can crash, messages can be lost, and decisions can be left hanging.

Answer 91

Reads may return outdated or inconsistent data temporarily.

Answer 92

All other clients must see that update afterward.

Answer 93

Reaching consensus in a distributed system, even with unreliable nodes.

Answer 94

Proposer, Acceptor, Learner.

Answer 95

Suggests a value to be agreed upon.

Answer 96

Votes to accept proposed values; consensus is reached when a majority accepts.

Answer 97

Learns the value that has been chosen but doesn't participate in voting.

Answer 98

A majority of acceptors (more than half), required to make decisions.

Answer 99

A value identified with a unique, increasing number.

Answer 100

Proposer sends PREPARE(x); acceptors respond with PROMISE(x).

Answer 101

The acceptor won’t accept proposals with a number less than x.

Answer 102

If proposer gets promises from a quorum, it sends ACCEPT(x, v) to acceptors.

Answer 103

When a majority of acceptors send ACCEPTED(x, v).

Answer 104

Learns the value chosen after the majority of acceptors accept it.

Answer 105

They ensure that newer proposals can override older ones and prevent conflict.

Answer 106

Paxos ensures only one value is chosen by allowing only the highest-numbered proposal to proceed.

Answer 107

Because of node failures, network delays, and inconsistent message ordering.

Answer 108

Services (online), Batch processing (offline), and Stream processing (near-real-time).

Answer 109

Throughput.

Answer 110

Processing large amounts of data in scheduled jobs that produce output.

Answer 111

Stream processing operates on data shortly after it is produced; batch processes large data sets periodically.

Answer 112

Build small, simple, modular tools that do one thing well and chain them together.

Answer 113

A batch processing framework for distributed computation on large datasets.

Answer 114

Extracts key-value pairs from input data; runs once per record.

Answer 115

No, it is stateless.

Answer 116

Aggregates values grouped by key and produces output.

Answer 117

Parallelism across distributed systems

Answer 118

No, jobs are often chained together into workflows.

Answer 119

Joining datasets by emitting a shared key and processing all values for that key in the reducer.

Answer 120

To perform large-scale dataset joins efficiently and locally.

Answer 121

Key-value pair model.

Answer 122

A method of processing data in real-time, event-by-event, as it arrives.

Answer 123

A small, immutable data record representing something that happened, usually with a timestamp.

Answer 124

A producer (also called publisher or sender).

Answer 125

One or more consumers (also called subscribers).

Answer 126

Polling them constantly adds overhead; they're not built for real-time notification.

Answer 127

A service that allows producers to send messages and consumers to receive them in real time.

Answer 128

Messages can be dropped, buffered in a queue, or flow-controlled (backpressure).

Answer 129

They can be lost or stored (if durability is implemented).

Answer 130

Producers send messages directly to consumers, often using TCP or UDP.

Answer 131

A server that stores and forwards messages between producers and consumers.

Answer 132

It can handle clients that disconnect or crash, improving reliability.

Answer 133

Via load balancing or fan-out.

Answer 134

Each message is sent to one consumer in the group to share the load.

Answer 135

Every consumer receives every message, useful for broadcasting.

Answer 136

To confirm that a message was processed; if not, it may be redelivered.

Answer 137

It can affect message ordering.

Answer 138

Databases persist data and offer search; brokers focus on message delivery and often delete after consumption.

Answer 139

A piece of code (or job) that consumes input streams and outputs a new derived stream.

Answer 140

Both read-only inputs and write append-only outputs.

Answer 141

Because streams never end — you can't just restart from the beginning.

Answer 142

A system that detects patterns in streams using SQL-like queries and emits events when a match is found.

Answer 143

CEP has persistent queries on transient data; databases have transient queries on persistent data.

Answer 144

Real-time calculations over streams, like rates, rolling averages, and trend detection.

Answer 145

The time when the event actually occurred

Answer 146

The time when the system processed the event.

Answer 147

It can lead to incorrect results and analysis.

Answer 148

Because devices and servers may have different clocks and delays.

Answer 149

When it happened, when it was sent, when it was received.

Answer 150

A fixed-size, non-overlapping time window, each event belongs to one window.

Answer 151

A fixed-size window that overlaps in time, allowing events to belong to multiple windows.

Answer 152

An iterative algorithm that updates model parameters in the direction that decreases the loss, controlled by a step size called the learning rate.

Answer 153

A forward pass (to compute predictions) and a backpropagation pass (to compute gradients and update weights).

Answer 154

Full batch uses all available data to compute the exact gradient, while mini-batch uses subsets to approximate it.

Answer 155

It provides a smooth, consistent gradient each epoch, making convergence more stable.

Answer 156

Retraining on large or growing datasets is computationally expensive and may not fit in memory.

Answer 157

When the relationship 𝑃(𝑦∣𝑥)itself changes over time in unanticipated ways .

Answer 158

A method where the dataset is divided into small batches, each used to compute an approximate gradient.

Answer 159

Faster iterations since less data is processed per update.

Answer 160

Convergence can be noisy and unstable due to variability in each batch.

Answer 161

It balances computational cost (smaller batch = faster) against convergence stability (larger batch = smoother).

Answer 162

Data arriving in a continuous, high-speed flow where only one pass per sample is possible.

Answer 163

Because the data distribution can evolve quickly, and the model can’t revisit past examples.

Answer 164

Updating model parameters immediately after each new data point arrives.

Answer 165

Online learning uses a single-sample update, which is exactly SGD.

Answer 166

It controls how aggressively the model adapts to each new observation.

Answer 167

A single outlier or bad data point can disproportionately affect the model’s performance.

Answer 168

When the input distribution 𝑃(𝑥) changes but the conditional 𝑃(𝑦∣𝑥) remains the same.

Answer 169

A face-recognition system struggling with masked faces.

Answer 170

When the distribution of target variables changes but inputs stay the same.

Answer 171

A flu-prediction model during a pandemic sees 𝑃(𝑦 = flu) surge while symptoms-to-flu mapping stays constant .

Answer 172

It depends on the application—daily, weekly, monthly, etc., based on data volatility.

Answer 173

Continuously updating the model with new data without retraining from scratch.

Answer 174

Because each epoch uses the entire dataset, so the gradient represents the true direction of steepest descent.

Answer 175

Stability and smooth convergence versus computational efficiency and adaptability to new data.

Answer 176

To avoid idle workers waiting on data or synchronization, thereby maximizing utilization.

Answer 177

A machine sends data proactively to another without a request.

Answer 178

In pull, a machine requests data from another, rather than receiving it unprompted.

Answer 179

In pull, a machine requests data from another, rather than receiving it unprompted.

Answer 180

To send identical data from one node to all other workers simultaneously.

Answer 181

Aggregating partial results (e.g., sums) from multiple workers onto a single machine.

Answer 182

After aggregation, it distributes the final result to every worker.

Answer 183

One machine halts its work until it gets a signal from another machine.

Answer 184

All machines pause until every one of them reaches the barrier, then all resume together.

Answer 185

To hide network latency by doing useful work while data moves.

Answer 186

One full pass over the entire training dataset.

Answer 187

B = M × B′.

Answer 188

Local gradient on each worker; (2) all-reduce sum; (3) synchronized parameter update.

Answer 189

It’s statistically identical to standard minibatch SGD, so hyperparameters carry over.

Answer 190

Workers idle during the global reduction, since there’s no overlap.

Answer 191

Pairs of (cluster_id → point data) after assigning each point to its nearest centroid.

Answer 192

Locally summing coordinates and counts per cluster to reduce network shuffle.

Answer 193

By dividing the total sum of point coordinates by the total count for each cluster.

Answer 194

Compare old and new centroids; if they move less than a threshold, you stop.

Answer 195

To cut down on data sent over the network by doing partial aggregation in the map stage.

Answer 196

When you need all workers to finish a phase before any can proceed (e.g., before starting a new epoch).

Answer 197

Local computation—e.g., computing the next gradient chunk while the previous one is being reduced.

Answer 198

Because the communication overhead—and idle time—grows as the cluster scales.

Answer 199

Reduce puts the aggregate on one node; all-reduce replicates it to all nodes.

Answer 200

A barrier-like synchronization happens naturally when MapReduce moves from the reduce phase back to the next map phase.

Answer 201

Gradual, abrupt, and recurring (cyclical) changes

Answer 202

They can’t adapt to new distributions without expensive full retraining

Answer 203

ADaptive WINdowing method for concept drift detection

Answer 204

By adaptively shrinking its window when statistical tests detect change in the stream

Answer 205

When no significant change is detected, indicating stable data.

Answer 206

The observed error rate at time 𝑖

Answer 207

S𝑖 = SQR(P𝑖(1-P𝑖) / 𝑖)

Answer 208

P𝑖 + S𝑖 ≥ Pmin = 2*Smin

Answer 209

P𝑖 + S𝑖 ≥ Pmin + 3*Smin

Answer 210

Monitoring the average distance between classification errors to catch slow drift early.

Answer 211

It uses spacing of errors, which can show drift before error-rate spikes.

Answer 212

A sudden market crash completely changing customer behavior.

Answer 213

Seasonal or cyclical patterns, like weekend vs. weekday usage.

Answer 214

To retrain or adapt the model before performance degrades significantly.

Answer 215

ADWIN explicitly manages a data window; DDM tracks error statistics without a sliding window.

Answer 216

A scalar (single value).

Answer 217

1-order tensor.

Answer 218

Two (rows × columns).

Answer 219

A video batch: frames × width × height × channels.

Answer 220

One where most entries are non-zero.

Answer 221

One where most entries are zero (O(n) non-zeros in an n×n tensor).

Answer 222

As the proportion of zero-valued elements.

Answer 223

To reduce storage and compute on data with many zeros.

Answer 224

A map from index tuples (i,j,…) to non-zero values.

Answer 225

A list of (row, col, value) triples for non-zeros.

Answer 226

V (values), COL_INDEX (columns), ROW_INDEX (row pointers).

Answer 227

The start index in V/COL_INDEX for row k

Answer 228

Slice V and COL_INDEX from ROW_INDEX[i] to ROW_INDEX[i+1].

Answer 229

The column indices of each non-zero in V.

Answer 230

One non-zero (value 3 at column 2).

Answer 231

Creating a software-based VM that behaves like a full computer, including hardware, OS, and peripherals.

Answer 232

The software (virtual machine monitor) that runs on a host to manage and isolate multiple guest VMs.

Answer 233

Emulation mimics hardware without direct host-hardware interaction; virtualisation uses the real host hardware via a hypervisor.

Answer 234

A point-in-time capture of a VM’s complete state, which can be restored later.

Answer 235

By copying a snapshot to another host and restoring it, you move the VM seamlessly.

Answer 236

High overhead (each VM runs a full OS) and redundancy in duplicated system files.

Answer 237

OS-level virtualisation where containers share a host OS kernel but are isolated environments.

Answer 238

Containers share the OS kernel and have less overhead versus VMs, which each run a full guest OS.

Answer 239

Only those explicitly allocated to it by the container runtime.

Answer 240

Portability, scalability, and ease of building/deploying/managing applications.

Answer 241

Processes and files in one container cannot affect those in another.

Answer 242

Automating deployment, scaling, management, and networking of containers across multiple hosts.

Answer 243

To develop and run containerized applications without managing the infrastructure itself.

Answer 244

A programming model that splits work into threads, lightweight execution units within one process, to run concurrently on multiple CPU cores

Answer 245

Threads share the parent’s memory and resources, so they incur much less overhead to create and context-switch than full processes

Answer 246

Single Instruction, Single Data: one instruction stream on one data stream

Answer 247

SIMD applies the same instruction to multiple data streams in parallel, with each core working on its own data but fetching identical instructions

Answer 248

Multiple processors each fetch their own instructions and operate on their own data independently

Answer 249

Global variables and file descriptors

Answer 250

A common address space so they can directly access shared data

Answer 251

Race conditions, leading to inconsistent or corrupted data

Answer 252

The master thread forks worker threads at a parallel region, they execute concurrently, then join back at a synchronization point

Answer 253

The default MPI communicator that includes all processes in an MPI session

Answer 254

Context/ID, Group (set of processes), Size (# processes), and Rank (each process’s integer ID)

Answer 255

A sender posts a message with data and destination rank; the receiver must post a matching receive, making the exchange cooperative and two-sided

Answer 256

Threads can directly access shared data without explicit message passing, simplifying programming for on-node parallelism

Answer 257

Low-level thread creation, scheduling, and most synchronization details—letting you focus on marking parallel regions

Answer 258

A coupled development process where hardware is tailored to software requirements and software is tuned to exploit hardware features

Answer 259

They rely heavily on tensor operations, both dense (CNNs) and sparse (graph models), which general-purpose CPUs can’t efficiently handle

Answer 260

Bottom-up builds hardware first then software; top-down derives hardware features from software workload demands

Answer 261

Partitioning, Prototyping & Simulation, High-Level Synthesis, and Platform-Based Design

Answer 262

Allocating which functions run in hardware (for performance) versus software (for flexibility/updates)

Answer 263

Modelling how the hardware and software will interact. Involves the use of hardware description languages (HDLs), and software dev tools to create accurate system models.

Answer 264

It automatically converts a high level language into HDL, speeding hardware implementation from high-level code

Answer 265

By using a predifined platform (a set of hardware and software components) as a starting point to reduce design time and complexity.

Answer 266

Enables thorough design-space exploration (power, cost, performance) and multi-level optimization (system, architectural, algorithmic)

Answer 267

Infrastructure/Data Processing Units—specialised chips in data centres for networking, security, and management tasks

Answer 268

Embedded systems, automotive electronics (autonomous vehicles), and 5G telecommunications

Answer 269

Personalized models need on-device inference/training, requiring hardware/software tuned for low power and latency

Answer 270

The use of supercomputers or large processor clusters for parallel processing and clustering with specialized hardware to solve complex computational problems .

Answer 271

In FLOPS (Floating point operations per second): GFLOPS (10⁹), TFLOPS (10¹²), PFLOPS (10¹⁵), EFLOPS (10¹⁸).

Answer 272

Transistor counts on integrated circuits double roughly every two years, historically driving performance gains

Answer 273

A formula predicting maximum speedup from parallelism, showing that the non-parallelizable portion limits overall speedup

Answer 274

Slatency = 1/(1-p + p/s) p = proportion of execution time enhanced s = the speedup of the enhancement.

Answer 275

Resource allocation, workload scheduling, and support for distributed execution & monitoring

Answer 276

A self-contained work unit with input data that produces output, run interactively or in batch, and queued until resources are available

Answer 277

It enables big data handling, complex analytics, faster ML/DL training, and large-scale scientific simulations

Answer 278

Because LAN bandwidth and latency determine how fast nodes can exchange data, preventing communication bottlenecks .

Answer 279

The total expense of running the system—including admin staff, maintenance, and up to $10 million/year in electricity costs .

Answer 280

Distributed HPC spans multiple networked nodes that communicate over an interconnect, whereas non-distributed (shared-memory) runs entirely within one multi-core system .

Answer 281

The number of nodes, processors per node, and cores per processor

Answer 282

An open source, modulear, extensible, scalable resource manager and workload scheduling software for clusters and supercomputers.

Answer 283

A logical grouping of nodes that defines a job queue with its own constraints and priorities

Answer 284

Draining, Drained, Down (also Completing, Allocated, Idle, Unknown)

Answer 285

Completed, TimeOut, NodeFail, Cancelled, and Failed (with intermediate states Pending, Running, Suspended, Completing)

Data Science at Scale Flashcards

(312 cards)