FinaExamFinal Flashcards

Question

What is an ensemble method?

Answer 1

Combines multiple models to improve performance and reduce overfitting. ## Footnote * Diversity among models is key * Two types: bagging and boosting * Often outperforms single models

Answer 2

Bootstrap Aggregating: trains multiple models on bootstrap samples, averages predictions. ## Footnote * Models trained in parallel * Reduces variance * Example: Random Forest

Answer 3

Ensemble of decision trees using bagging plus random feature selection at each split. ## Footnote * Each tree: bootstrap sample + random features * Predictions averaged or voted * Reduces correlation between trees

Answer 4

Sequential ensemble where each model focuses on errors of previous models by reweighting samples. ## Footnote * Models trained sequentially * Reduces bias * Examples: AdaBoost, XGBoost, Gradient Boosting

Answer 5

Bagging: parallel, simple average, reduces variance. Boosting: sequential, weighted average, reduces bias. ## Footnote * Bagging: less sensitive to noise * Boosting: better accuracy but can overfit * Bagging: easy to parallelize

Answer 6

Programming model for processing large datasets by dividing work into Map and Reduce phases. ## Footnote * Map: transform input to (key, value) pairs * Reduce: aggregate values for each key * Shuffle: group by key between phases

Answer 7

Processes input records independently, emitting (key, value) pairs. ## Footnote * Runs in parallel across nodes * No communication between mappers * Example: emit (disease, 1) for each diagnosis

Answer 8

Aggregates all values for each key. ## Footnote * Receives (key, [list of values]) * Emits (key, aggregated_value) * Example: sum all counts for each disease

Answer 9

Groups all values by key and routes them to appropriate reducers between Map and Reduce. ## Footnote * Network-intensive operation * Sorts and partitions data * Ensures all values for a key go to same reducer

Answer 10

Master tracks task completion, re-executes failed tasks on different nodes. ## Footnote * Worker failure detected via heartbeats * Map tasks: re-execute on different worker * Deterministic execution enables recomputation

Answer 11

Linear regression has closed-form solution using aggregation statistics: θ = (X^T X)^-1 X^T Y. ## Footnote * Can aggregate X^T X and X^T Y across nodes * Single MapReduce pass * No iteration needed

Answer 12

Logistic regression requires iterative gradient descent with multiple passes over data. ## Footnote * Each iteration: full MapReduce job * Writes to disk between iterations * Disk I/O dominates cost

Answer 13

Inefficient for iterative algorithms, acyclic data flow, disk I/O between jobs. ## Footnote * Must write to HDFS after each job * Poor for ML (10s-100s iterations) * Poor for interactive queries

Answer 14

Iteratively assigns points to nearest centroid, then updates centroids as cluster means. ## Footnote * Algorithm: initialize K centroids, assign, update, repeat * Converges to local optimum * Assumes spherical clusters

Answer 15

O(i × n × K × d) where i=iterations, n=points, K=clusters, d=dimensions. ## Footnote * Linear in n, K, d * Number of iterations varies * Typically converges quickly

Answer 16

Builds a tree (dendrogram) of clusters by iteratively merging closest clusters. ## Footnote * Agglomerative: bottom-up merging * No need to specify K upfront * Provides cluster hierarchy

Answer 17

Probabilistic clustering assuming data comes from mixture of Gaussian distributions. ## Footnote * Soft clustering: probabilities for each cluster * Uses EM algorithm * More flexible than K-means

Answer 18

Expectation-Maximization: iterative algorithm alternating E-step (compute expectations) and M-step (maximize parameters). ## Footnote * E-step: compute cluster membership probabilities * M-step: update Gaussian parameters * Converges to local optimum

Answer 19

K-means variant using small random batches instead of full dataset. ## Footnote * Complexity: O(t × b × K × d), b << n * Much faster than standard K-means * Slight accuracy loss

Answer 20

Density-Based Spatial Clustering: finds clusters as high-density regions separated by low-density regions. ## Footnote * Parameters: ε (radius), MinPts (minimum density) * Finds arbitrary-shaped clusters * Identifies noise points

Answer 21

Points with ≥ MinPts neighbors within ε radius. ## Footnote * Define high-density regions * Form cluster centers * Border points connect to cores

Answer 22

Points within ε of a core point but not core themselves. ## Footnote * On cluster edges * Less than MinPts neighbors * Belong to one cluster

Answer 23

Points not within ε of any core point. ## Footnote * Low-density regions * Not assigned to any cluster * DBSCAN advantage: identifies outliers

Answer 24

RI = (a+b) / (total pairs). Measures clustering agreement with ground truth. ## Footnote * a: pairs in same cluster in both * b: pairs in different clusters in both * Ranges 0 to 1 (higher better)

Answer 25

Measures shared information between clustering and ground truth. MI(X,Y) = ΣΣ p(x,y) log[p(x,y)/(p(x)p(y))]. ## Footnote * Normalized MI: MI / sqrt(H(X)H(Y)) * Ranges 0 to 1 * Requires ground truth

Answer 26

s(x) = (b-a) / max(a,b) where a=avg distance to same cluster, b=avg distance to nearest other cluster. ## Footnote * Ranges -1 to 1 (higher better) * No ground truth needed * Computed per point, averaged

Answer 27

Extracting structured disease phenotypes from raw clinical data. ## Footnote * Transforms EHR into disease labels * Methods: expert rules, classification, clustering * Enables GWAS, prediction, trials

Answer 28

Genome-Wide Association Study: identifies genetic variants (SNPs) associated with diseases. ## Footnote * Compares SNP frequencies: cases vs controls * Needs accurate phenotype labels * Requires large sample sizes

Answer 29

Supervised (expert rules, classification) and Unsupervised (dimensionality reduction, tensor factorization). ## Footnote * Supervised: needs labeled data * Unsupervised: discovers phenotypes automatically * Rules: interpretable but manual effort

Answer 30

Factorizes matrix X = UΣV^T where U, V are orthogonal and Σ is diagonal with singular values. ## Footnote * U: left singular vectors * V: right singular vectors * Σ: singular values (ordered largest to smallest)

Answer 31

Finds orthogonal directions of maximum variance. Equivalent to SVD on centered data. ## Footnote * PCs = UΣ (scores) * Loadings = V (feature weights) * First PC has highest variance

Answer 32

SVD produces dense factors (all non-zero), making interpretation difficult. ## Footnote * Factors are linear combinations of all features * Hard to interpret in clinical context * CUR decomposition addresses this

Answer 33

Approximates A ≈ CUR using actual columns (C) and rows (R) from the data. ## Footnote * C: selected columns * R: selected rows * U: small matrix connecting C and R * Preserves sparsity and interpretability

Answer 34

Multi-dimensional array generalizing matrices (2D) to 3+ dimensions. ## Footnote * Matrix: 2D (patients × diagnoses) * Tensor: 3D+ (patients × diagnoses × medications × time) * Captures higher-order interactions

Answer 35

Canonical Polyadic: factorizes tensor into sum of rank-1 tensors. ## Footnote * Each rank-1 component = one phenotype * Factor matrices for each dimension * Unsupervised phenotype discovery

Answer 36

RCTs: controlled, randomized, one intervention. Pragmatic: real-world, no randomization, multiple interventions. ## Footnote * RCT: efficacy in ideal conditions * Pragmatic: effectiveness in practice * RCT: expensive, slow, causal

Answer 37

Finding past patients similar to current patient to inform treatment decisions. ## Footnote * Enables precision medicine * Methods: distance metrics, graphs * Learn what worked for similar patients

Answer 38

Learns distance metric by maximizing margin between same-outcome and different-outcome neighbors. ## Footnote * Context-specific for prediction task * Optimizes: max distance to heterogeneous, min to homogeneous * Uses eigenvectors of H = L^e - L^o

Answer 39

International Classification of Diseases: standardized diagnosis codes for billing and epidemiology. ## Footnote * ICD-9: ~14K codes, 3-5 digits * ICD-10: ~70K codes, 7 alphanumeric * US transitioned October 2015

Answer 40

ICD-9: shorter codes, less detail. ICD-10: longer codes, more specificity. ## Footnote * ICD-9: 3-5 characters * ICD-10: 7 alphanumeric characters * Complex mapping: 1-to-many possible

Answer 41

Current Procedural Terminology: codes for medical procedures and services for billing. ## Footnote * Maintained by AMA * Determines physician payment * Category I: main procedures (6 sections)

Answer 42

Logical Observation Identifiers Names and Codes: standardizes lab tests and observations. ## Footnote * 6 attributes: component, property, timing, system, scale, method * Enables interoperability * Critical for lab data exchange

Answer 43

Systematized Nomenclature of Medicine: comprehensive medical terminology with concepts and relationships. ## Footnote * >300K concepts * Hierarchical IS-A relationships * 18 top-level hierarchies

Answer 44

Hierarchical relationship indicating one concept is a subtype of another. ## Footnote * Example: Arthritis IS-A Joint Finding * Creates directed acyclic graph * Enables reasoning and inference

Answer 45

Unified Medical Language System: integrates multiple medical terminologies. ## Footnote * 3 components: Metathesaurus, Semantic Network, Lexicon * ~1.5M concepts (CUIs) * Links codes across vocabularies

Answer 46

Metathesaurus (concepts from all sources), Semantic Network (types and relationships), SPECIALIST Lexicon (linguistic info). ## Footnote * Metathesaurus: concept mapping * Semantic Network: 135 types, 54 relationships * Lexicon: 330K biomedical terms

Answer 47

Algorithm computing node importance based on incoming links from important nodes. ## Footnote * q = cA^T q + (1-c)/N × e * c = damping factor (0.85) * Iterative until convergence

Answer 48

Iteratively: each node distributes rank to neighbors, nodes sum contributions. ## Footnote * Start with uniform distribution * Each iteration: distribute rank / out-degree * Apply damping: c×sum + (1-c)/N

Answer 49

Uses eigenvectors of graph Laplacian for clustering. ## Footnote * Build similarity graph * Compute Laplacian L = D - W * Use top-k eigenvectors for k-means

Answer 50

L = D - W where D=degree matrix, W=adjacency matrix. ## Footnote * Unnormalized: L = D - W * Normalized: L_sym = I - D^(-1/2) W D^(-1/2) * Eigenvectors reveal cluster structure

Answer 51

ε-neighborhood, k-NN, fully connected. ## Footnote * ε-neighborhood: connect if distance < ε * k-NN: connect k nearest neighbors * Fully connected: Gaussian kernel weights

Answer 52

In-memory distributed computing framework for big data processing. ## Footnote * Faster than MapReduce (10-100x) * Keeps data in memory * Built on RDDs

Answer 53

Resilient Distributed Dataset: immutable distributed collection with fault tolerance via lineage. ## Footnote * Immutable: transformations create new RDDs * Lazy evaluation * Fault tolerance: recompute from lineage

Answer 54

DAG of transformations used to build RDD, enabling fault tolerance. ## Footnote * Records sequence of operations * Recompute lost partitions * No replication needed

Answer 55

Operations creating new RDDs: map, filter, join, etc. Lazily evaluated. ## Footnote * Lazy: not executed until action * Build computation DAG * Examples: map, flatMap, filter, distinct

Answer 56

Operations returning results: collect, count, reduce, etc. Trigger execution. ## Footnote * Execute DAG of transformations * Return to driver or save to storage * Examples: collect, count, save, reduce

Answer 57

map: one output per input. flatMap: 0, 1, or many outputs per input. ## Footnote * map: one-to-one transformation * flatMap: one-to-many, flattens results * flatMap useful for tokenization

Answer 58

Read-only variables efficiently sent once to all worker nodes. ## Footnote * Avoid sending data with each task * Cached locally on each node * Use for lookup tables, parameters

Answer 59

Spark caches data in memory, avoiding disk I/O between iterations. ## Footnote * MapReduce: read/write HDFS each iteration * Spark: keep data in memory * 10-100x speedup for ML

Answer 60

Transformations recorded but not executed until action called. ## Footnote * Enables DAG optimization * Pipelines operations * Minimizes shuffles

Answer 61

Recomputes lost partitions using lineage instead of replication. ## Footnote * Lineage: DAG of operations * Replay transformations only for lost partition * No storage cost

Answer 62

Technique for evaluating model on unseen data by splitting dataset multiple ways. ## Footnote * Estimates test performance * Types: leave-one-out, k-fold, randomized * Prevents overfitting

Answer 63

Split data into k folds, train on k-1, test on 1, rotate test fold. ## Footnote * Typically k=5 or 10 * Each sample used for test once * Average performance across folds

Answer 64

Train on n-1 samples, test on 1 sample, repeat for all samples. ## Footnote * Most accurate estimate * Computationally expensive * n iterations needed

Answer 65

validation

Answer 66

Adds penalty to loss function to discourage complex models and reduce overfitting. ## Footnote * L1 (Lasso): penalty = λ|w|, sparsity * L2 (Ridge): penalty = λw², shrinkage * λ controls strength

Answer 67

Adds sum of absolute weights to loss: L + λΣ|w_i|. Drives some weights to zero. ## Footnote * Also called Lasso * Feature selection: zeroes out features * Produces sparse models

Answer 68

Adds sum of squared weights to loss: L + λΣw_i². Shrinks weights toward zero. ## Footnote * Also called Ridge * Shrinks all weights * Prefers many small weights

Answer 69

Model learns training data too well, including noise, failing to generalize. ## Footnote * High training accuracy, low test accuracy * Solutions: regularization, more data, simpler model * Cross-validation helps detect

Answer 70

Model too simple to capture patterns in data. ## Footnote * Low training and test accuracy * High bias * Solution: more complex model, more features

Answer 71

Sampling with replacement to create multiple training sets. ## Footnote * Sample n from n with replacement * ~63% unique samples per bootstrap * Used in bagging methods

Answer 72

Plot metric vs k, find 'elbow' where improvement slows. ## Footnote * For choosing k in k-means * WCSS decreases with k * Elbow = optimal k

Answer 73

In high dimensions, data becomes sparse and distances become less meaningful. ## Footnote * Distances concentrate * Need exponentially more data * Solution: dimensionality reduction

Answer 74

Creating new features from raw data to improve model performance. ## Footnote * Aggregate temporal data * Create interaction features * Domain knowledge critical

Answer 75

Projecting high-dimensional data to lower dimensions while preserving structure. ## Footnote * Reduces noise and overfitting * Improves computation * Methods: PCA, SVD, tensor

Answer 76

Tensor that can be written as outer product of vectors: a ⊗ b ⊗ c. ## Footnote * CP decomposition: sum of rank-1 tensors * Each represents one phenotype * Factor vectors for each dimension

Answer 77

Group of patients sharing common observable characteristics (diagnoses, medications, labs). ## Footnote * Can represent disease subtypes * Extracted from EHR data * Used for cohorts and features

Answer 78

Distance accounting for correlations: d² = (x-y)^T Σ^(-1) (x-y). ## Footnote * Σ = covariance matrix * Accounts for feature correlations * LSML learns generalized version

Answer 79

Each point assigned to exactly one cluster. ## Footnote * Example: k-means * Discrete assignments * Clear boundaries

Answer 80

Each point has probability of belonging to each cluster. ## Footnote * Example: GMM * Probabilistic assignments * Overlapping clusters

Answer 81

Normal distribution: bell-shaped curve characterized by mean μ and variance σ². ## Footnote * PDF: (1/√(2πσ²)) exp(-(x-μ)²/(2σ²)) * 68-95-99.7 rule * Central to GMM

Answer 82

Classification: predict categorical labels. Regression: predict continuous values. ## Footnote * Classification: discrete output * Regression: continuous output * Different metrics: accuracy vs MSE

Answer 83

Supervised: has labeled training data. Unsupervised: no labels, find structure. ## Footnote * Supervised: classification, regression * Unsupervised: clustering, dimensionality reduction * Supervised: predict, Unsupervised: discover

Answer 84

Map: emit (word, 1) for each word. Reduce: sum counts for each word. ## Footnote * Classic MapReduce example * Map: tokenize and emit * Reduce: aggregate counts

Answer 85

Not executed until an action is called, allowing optimization. ## Footnote * Build DAG of operations * Optimizer can pipeline and minimize shuffles * Action triggers execution

Answer 86

union: combines all elements from both RDDs. intersection: only elements in both. ## Footnote * union: cheap (no shuffle) * intersection: expensive (requires shuffle) * Both return new RDD

Answer 87

Combines elements using associative function: reduce((x,y) => x+y). ## Footnote * Must be associative and commutative * Parallel aggregation * Returns single value to driver

Answer 88

Returns all RDD elements to driver as array. ## Footnote * Brings data to driver * Can cause out-of-memory * Use only for small results

Answer 89

Distributed collection with schema, similar to database table. ## Footnote * Higher-level than RDD * Optimized execution (Catalyst) * SQL-like operations

Answer 90

Data not recorded or lost. Handle by: imputation, deletion, or modeling. ## Footnote * MCAR: missing completely at random * MAR: missing at random * MNAR: missing not at random

Answer 91

Filling in missing values with estimates (mean, median, model-based). ## Footnote * Mean/median imputation * K-NN imputation * Multiple imputation

Answer 92

Scaling features to common range (e.g., 0-1 or z-score). ## Footnote * Min-max: scale to [0,1] * Z-score: mean=0, std=1 * Important for distance-based methods

Answer 93

Converts categorical variables into binary vectors. ## Footnote * Each category → binary feature * Example: {red, blue, green} → [1,0,0], [0,1,0], [0,0,1] * Increases dimensionality

Answer 94

Data point significantly different from others, may be error or interesting case. ## Footnote * Detection: statistical tests, visualization * May remove or investigate * DBSCAN identifies as noise

Answer 95

Assessing model performance on data not used for training. ## Footnote * Test set or cross-validation * Prevents overfitting * Estimates generalization

Answer 96

Finding optimal hyperparameters (e.g., learning rate, regularization strength). ## Footnote * Use validation set, not test * Methods: grid search, random search * Tune before final evaluation

Answer 97

Systematic search over hyperparameter combinations. ## Footnote * Tests all combinations * Computationally expensive * Finds optimal in grid

Answer 98

Random sampling of hyperparameter combinations. ## Footnote * Often as good as grid search * More efficient * Better for high dimensions

Answer 99

Stop training when validation performance stops improving. ## Footnote * Prevents overfitting * Monitor validation loss * Save best model

Answer 100

Normalizes layer inputs during training to stabilize learning. ## Footnote * Reduces internal covariate shift * Enables higher learning rates * Commonly used in deep learning

Answer 101

Regularization: randomly drop neurons during training. ## Footnote * Rate: fraction of neurons to drop (e.g., 0.5) * Prevents co-adaptation * Ensemble effect

Answer 102

Step size for parameter updates in gradient descent. ## Footnote * Too high: divergence * Too low: slow convergence * Typically 0.001-0.1

Answer 103

Accumulates gradient history to accelerate convergence. ## Footnote * Adds fraction of previous update * Helps escape local minima * Smooths optimization path

Answer 104

Adaptive learning rate optimizer combining momentum and RMSprop. ## Footnote * Adapts learning rate per parameter * Popular default choice * Combines benefits of momentum and adaptive rates

Answer 105

Using pre-trained model on new task. ## Footnote * Reuse learned features * Fine-tune on new data * Reduces training time and data needs

Answer 106

Tree structure where each node tests a feature, leaves predict outcome. ## Footnote * Interpretable * Handles non-linear relationships * Prone to overfitting

Answer 107

Removing branches to reduce complexity and overfitting. ## Footnote * Pre-pruning: stop growing early * Post-pruning: remove after building * Improves generalization

Answer 108

Reduction in entropy from splitting on a feature. ## Footnote * Used to select split in decision trees * Higher gain = better split * ID3 algorithm uses information gain

Answer 109

Measure of how often randomly chosen element would be incorrectly labeled. ## Footnote * Gini = 1 - Σp_i² * Used in CART algorithm * Lower = purer node

Answer 110

Linear model for classification using sigmoid function: P(y=1) = 1/(1+exp(-w^T x)). ## Footnote * Outputs probabilities [0,1] * Optimized via gradient descent * Linear decision boundary

Answer 111

Finds optimal hyperplane maximizing margin between classes. ## Footnote * Maximizes distance to nearest points * Kernel trick for non-linear boundaries * Effective in high dimensions

Answer 112

Implicitly maps data to higher dimensions without computing transformation. ## Footnote * Common kernels: linear, polynomial, RBF * Enables non-linear SVM * Computationally efficient

Answer 113

Classifies based on majority vote of k nearest neighbors. ## Footnote * Non-parametric * No training phase * Slow prediction (distance to all points)

Answer 114

Probabilistic classifier using Bayes' theorem with independence assumption. ## Footnote * Assumes features independent given class * Fast training and prediction * Works well for text classification

Answer 115

Sampling that maintains class proportions from original data. ## Footnote * Important for imbalanced data * Ensures representative train/test splits * Used in stratified k-fold CV

Answer 116

When one class has many more samples than others. ## Footnote * Common in healthcare (rare diseases) * Leads to biased models * Solutions: resampling, weighted loss, different metrics

Answer 117

Increasing minority class samples to balance dataset. ## Footnote * Duplicate minority samples * SMOTE: synthetic minority oversampling * Risk: overfitting minority class

Answer 118

Decreasing majority class samples to balance dataset. ## Footnote * Randomly remove majority samples * Risk: losing information * Faster training

Answer 119

Synthetic Minority Over-sampling Technique: creates synthetic minority samples. ## Footnote * Interpolates between minority neighbors * Reduces overfitting vs duplication * Popular for imbalanced data

Answer 120

Plot of precision vs recall at different thresholds. ## Footnote * Alternative to ROC for imbalanced data * Area under curve (AUC-PR) measures performance * Focuses on positive class

Answer 121

L1: penalty on |w|, creates sparsity. L2: penalty on w², shrinks weights. ## Footnote * L1: feature selection (some weights → 0) * L2: all weights shrink * Elastic Net: combines both

Answer 122

Regularization combining L1 and L2: penalty = α|w| + β w². ## Footnote * α controls L1 strength * β controls L2 strength * Balances sparsity and shrinkage

Answer 123

Transforming features to similar ranges. ## Footnote * Important for distance-based methods * Methods: normalization, standardization * Not needed for tree-based methods

Answer 124

Normalization: scale to [0,1]. Standardization: mean=0, std=1 (z-score). ## Footnote * Normalization: (x-min)/(max-min) * Standardization: (x-μ)/σ * Standardization better with outliers

Answer 125

Parameter c (typically 0.85) representing probability of following links vs random jump. ## Footnote * Formula: q = cA^T q + (1-c)/N × e * c=0.85: 85% follow links, 15% random jump * Prevents dead ends and spider traps

Answer 126

Category I: main procedures (6 sections). Category II: performance measurement (optional). Category III: emerging procedures (temporary). ## Footnote * Category I: 5-digit codes for established procedures * Category II: 4 digits + F, quality metrics * Category III: 4 digits + T, new technologies

Answer 127

Optional function that performs local aggregation on mapper output before shuffle. ## Footnote * Reduces network traffic * Similar to reducer but runs on mapper nodes * Example: local sum before global sum

Answer 128

Component, Property, Timing, System, Scale, Method. ## Footnote * Component: what measured (e.g., glucose) * Property: type (e.g., mass concentration) * System: specimen (e.g., serum)

Answer 129

Concepts: unique clinical meanings. Descriptions: terms/synonyms for concepts. ## Footnote * Each concept has unique ID * Multiple descriptions per concept * Enables synonym matching

Answer 130

Relationship connecting concepts from different hierarchies (e.g., Appendicitis 'associated morphology' Inflammation). ## Footnote * Non-hierarchical relationships * Provides semantic meaning * Complements IS-A relationships

Answer 131

Phenotype KnowledgeBase: repository of validated phenotyping algorithms. ## Footnote * Community-contributed phenotype definitions * Includes ICD codes, medications, labs * Enables phenotype replication

Answer 132

Observational Medical Outcomes Partnership Common Data Model: standardized EHR database schema. ## Footnote * Standardizes structure and content * Enables multi-site studies * Includes standard vocabularies

Answer 133

Computes posterior probability γ_nk that each point n belongs to each cluster k. ## Footnote * γ_nk = π_k N(x_n|μ_k, Σ_k) / Σ_j π_j N(x_n|μ_j, Σ_j) * Soft assignment to clusters * Uses current parameter estimates

Answer 134

Updates Gaussian parameters π_k, μ_k, Σ_k using weighted samples (weights = γ_nk). ## Footnote * μ_k = weighted mean of points * Σ_k = weighted covariance * π_k = average membership

Answer 135

Can find arbitrary-shaped clusters and identify outliers, doesn't require specifying K. ## Footnote * K-means: spherical clusters, all points assigned * DBSCAN: any shape, identifies noise * Better for spatial data

Answer 136

Computationally expensive O(n³), sensitive to noise and outliers, cannot undo merges. ## Footnote * Does not scale to large datasets * Once clusters merged, cannot split * Greedy algorithm

Answer 137

Diagonal matrix where D_ii = sum of weights of edges incident to node i. ## Footnote * For unweighted graph: D_ii = degree of node i * Used to construct Laplacian: L = D - W * Captures node connectivity

Answer 138

Maximal set of nodes where every node is reachable from every other node. ## Footnote * Number of zero eigenvalues of Laplacian = number of components * Important for spectral clustering * Isolated subgraphs

Answer 139

L_sym = D^(-1/2) L D^(-1/2) = I - D^(-1/2) W D^(-1/2). ## Footnote * Normalizes for varying node degrees * Eigenvalues in [0, 2] * Better for graphs with variable degree

Answer 140

Focus on local neighborhood when learning similarity metric rather than global distance. ## Footnote * Different regions of feature space may need different metrics * LSML learns local metrics * Better captures clinical context

Answer 141

Patients with same outcome as target patient. ## Footnote * Should be close in learned metric * Used to define similarity * Part of optimization objective

Answer 142

Patients with different outcome than target patient. ## Footnote * Should be far in learned metric * Used to define dissimilarity * Maximize distance to these

Answer 143

Difference between distances to heterogeneous and homogeneous neighbors. ## Footnote * Large margin = good separation * Maximize total margin across all patients * Similar to SVM margin concept

Answer 144

System's ability to continue operating despite failures of components. ## Footnote * Critical for large clusters (failures common) * Achieved via replication or recomputation * MapReduce: recomputation, HDFS: replication

Answer 145

Hadoop Distributed File System: distributed storage for MapReduce. ## Footnote * Stores large files across cluster * Replicates blocks (typically 3 copies) * Master-slave architecture (NameNode, DataNodes)

Answer 146

Principle of moving computation to data rather than data to computation. ## Footnote * Reduces network traffic * Scheduler assigns tasks to nodes with data * Key to MapReduce performance

Answer 147

Mini-reducer that runs locally on mapper node for preliminary aggregation. ## Footnote * Reduces shuffle data * Must be associative and commutative * Same code as reducer often

Answer 148

System continues operating despite network partitions. ## Footnote * Part of CAP theorem * Network failures split cluster * Must choose consistency or availability

Answer 149

Distributed system can provide at most 2 of 3: Consistency, Availability, Partition tolerance. ## Footnote * Consistency: all nodes see same data * Availability: every request gets response * Partition tolerance: works despite network splits

Answer 150

Evaluation/Management, Anesthesia, Surgery, Radiology, Pathology/Laboratory, Medicine. ## Footnote * Most commonly used CPT codes * 5-digit numeric codes * Determines reimbursement

Answer 151

National Drug Code: unique identifier for drugs including labeler, product, and package. ## Footnote * 10-digit 3-segment code * Format: labeler-product-package * Maintained by FDA

Answer 152

ICD-9-CM: Clinical Modification for diagnoses. ICD-9-PCS: Procedure Coding System for inpatient procedures. ## Footnote * CM: outpatient and diagnosis coding * PCS: hospital inpatient procedures only * Different code structures

Answer 153

V codes: factors influencing health status. E codes: external causes of injury. ## Footnote * V codes: 'V01-V91' (e.g., vaccination) * E codes: 'E800-E999' (e.g., accident cause) * Supplement main diagnosis codes

Answer 154

High intra-cluster similarity (points within cluster similar) and low inter-cluster similarity (points in different clusters dissimilar). ## Footnote * Compact clusters * Well-separated clusters * Matches domain expectations

Answer 155

Sum of squared distances from points to their cluster centroids. ## Footnote * Lower WCSS = tighter clusters * Used in elbow method * Minimized by K-means algorithm

Answer 156

Number of copies of each data block stored across cluster (default 3). ## Footnote * Provides fault tolerance * Trade-off: reliability vs storage * Configurable per file

Answer 157

Master server managing file system namespace and regulating access to files. ## Footnote * Stores metadata (file names, locations) * Single point of failure (uses backup) * Coordinates DataNode operations

Answer 158

Worker node storing actual data blocks and serving read/write requests. ## Footnote * Multiple DataNodes per cluster * Report to NameNode via heartbeats * Store and retrieve blocks

Answer 159

Defines how to split input data and create input key-value pairs for mappers. ## Footnote * TextInputFormat: line-by-line text * KeyValueInputFormat: tab-separated pairs * SequenceFileInputFormat: binary format

Answer 160

Defines how to write reducer output to files. ## Footnote * TextOutputFormat: text files * SequenceFileOutputFormat: binary * NullOutputFormat: no output

Answer 161

Launching backup copies of slow tasks to reduce job completion time. ## Footnote * Detects stragglers (slow tasks) * Runs duplicate task on different node * Uses result from first to finish

Answer 162

Narrow: each parent partition used by at most one child partition. Wide: multiple child partitions depend on parent. ## Footnote * Narrow: map, filter (no shuffle) * Wide: groupBy, join (requires shuffle) * Narrow faster, can pipeline

Answer 163

Directed Acyclic Graph: representation of computation as stages and tasks. ## Footnote * Nodes = RDD partitions * Edges = transformations * No cycles (acyclic) * Optimized before execution

Answer 164

Query optimizer for Spark SQL and DataFrames. ## Footnote * Logical optimization: predicate pushdown, constant folding * Physical optimization: join reordering * Code generation

Answer 165

Set of tasks that can be executed in parallel without shuffle. ## Footnote * Bounded by shuffle operations * Wide transformations create stage boundaries * Tasks within stage can pipeline

Answer 166

Unit of work sent to executor: applies transformations to one partition. ## Footnote * One task per partition per stage * Run in parallel across cluster * Result sent back to driver

Answer 167

JVM process on worker node executing tasks and caching data. ## Footnote * Multiple executors per worker * Each has own memory and CPU cores * Runs for duration of application

Answer 168

Main program running SparkContext, coordinating execution. ## Footnote * Converts user program to tasks * Schedules tasks on executors * Collects results

Answer 169

Entry point to Spark functionality, coordinates execution on cluster. ## Footnote * Creates RDDs * Accesses Spark services * One per application

Answer 170

Dividing RDD into partitions distributed across cluster. ## Footnote * Default: based on input source * Can repartition or coalesce * Affects parallelism and performance

Answer 171

coalesce: reduces partitions without shuffle. repartition: changes partitions with full shuffle. ## Footnote * coalesce: efficient for reducing * repartition: for increasing or rebalancing * coalesce can result in imbalance

Answer 172

cache: stores in memory (default). persist: configurable storage level (memory/disk/both). ## Footnote * cache() = persist(MEMORY_ONLY) * persist can specify MEMORY_AND_DISK, etc. * Improves performance for reused RDDs

Answer 173

Variable that can only be added to, used for counters and sums. ## Footnote * Only driver can read value * Workers can only add * Fault-tolerant across task retries

Answer 174

Data redistribution across partitions to group by key. ## Footnote * Expensive operation (network + disk I/O) * Triggered by: groupByKey, reduceByKey, join * Creates stage boundary

Answer 175

reduceByKey: aggregates locally before shuffle. groupByKey: shuffles all values first. ## Footnote * reduceByKey: more efficient (less network) * groupByKey: transfers all data * Use reduceByKey when possible

FinaExamFinal Flashcards

(200 cards)