FinaExamFinal Flashcards

(200 cards)

1
Q

What is the key difference between prospective and retrospective studies?

A

Prospective studies collect data going forward in time, while retrospective studies use historical data already collected.

  • Retrospective: cheaper, faster, more noise
  • Prospective: expensive, time-consuming, less noise
  • Retrospective common for large EHR datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the three main healthcare applications covered in this course?

A

Predictive Modeling, Computational Phenotyping, and Patient Similarity.

  • Predictive Modeling: predict future outcomes
  • Phenotyping: extract disease patterns from data
  • Patient Similarity: find similar patients for treatment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a cohort study?

A

Selects patients exposed to a risk factor and follows them to observe outcomes.

  • Example: all HF patients discharged from hospital
  • Define inclusion/exclusion criteria
  • Follows natural disease progression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a case-control study?

A

Matches cases (positive outcome) with controls (negative outcome) based on specific criteria like age, gender, clinic.

  • Useful when disease is rare
  • Requires careful matching criteria
  • Helps balance dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the observation window in predictive modeling?

A

The historical time period before the index date used to extract features.

  • Too short: insufficient data
  • Too long: irrelevant old data
  • Typical: 6-12 months

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the prediction window?

A

The future time period after the index date where we predict the outcome.

  • Longer window: easier prediction, less actionable
  • Shorter window: harder prediction, more actionable
  • Trade-off between accuracy and utility

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the index date?

A

The reference point in time from which predictions are made.

  • Examples: admission date, diagnosis date, discharge date
  • Separates observation window from prediction window
  • Must be consistently defined

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are common feature types in clinical prediction?

A

Demographics, diagnoses, medications, lab results, vitals, procedures.

  • Demographics: age, gender, race
  • Diagnoses: ICD codes
  • Medications: drug prescriptions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is feature selection and why is it important?

A

Choosing relevant features to include in the model to improve performance and interpretability.

  • Reduces dimensionality
  • Removes irrelevant/redundant features
  • Improves model generalization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Define True Positive Rate (TPR).

A

TPR = TP / (TP + FN) = True Positives / Condition Positive. Also called Sensitivity or Recall.

  • Measures: what fraction of sick patients are identified
  • High TPR: catches most disease cases
  • Important for screening tests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Define False Positive Rate (FPR).

A

FPR = FP / (FP + TN) = False Positives / Condition Negative.

  • Measures: what fraction of healthy patients are misclassified
  • Low FPR desired
  • Trade-off with TPR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Define Positive Predictive Value (PPV).

A

PPV = TP / (TP + FP) = True Positives / Prediction Outcome Positive. Also called Precision.

  • Measures: what fraction of positive predictions are correct
  • Depends on prevalence
  • Important for confirmation tests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Define Specificity.

A

Specificity = TN / (TN + FP) = True Negatives / Condition Negative. Also called True Negative Rate.

  • Measures: what fraction of healthy correctly identified
  • High specificity: few false alarms
  • Important for confirmatory tests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does the F1 score measure?

A

F1 = 2 × (Precision × Recall) / (Precision + Recall). Harmonic mean of precision and recall.

  • Ranges from 0 to 1
  • Balances precision and recall
  • Better than accuracy for imbalanced data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the ROC curve?

A

Plot of True Positive Rate vs False Positive Rate at different classification thresholds.

  • Each point = different threshold
  • AUC measures overall performance
  • AUC = 0.5: random, AUC = 1.0: perfect

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the confusion matrix?

A

2×2 table showing predicted vs actual labels: TP, FP, FN, TN.

  • Diagonal: correct predictions
  • Off-diagonal: errors
  • All metrics derive from it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is prevalence in classification?

A

Prevalence = Condition Positive / Total Population. Fraction of population with disease.

  • Affects PPV interpretation
  • Low prevalence → low PPV even with high specificity
  • Important for understanding dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Why is accuracy a poor metric for imbalanced datasets?

A

Can achieve high accuracy by always predicting majority class, missing minority class entirely.

  • Example: 95% healthy → predict all healthy = 95% accuracy
  • Better: F1, AUROC, per-class metrics
  • Use stratified sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is Mean Absolute Error (MAE)?

A

MAE = (1/n) × Σ|y_i - ŷ_i|. Average absolute difference between predicted and actual values.

  • For regression problems
  • Same units as target variable
  • Less sensitive to outliers than MSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is Mean Squared Error (MSE)?

A

MSE = (1/n) × Σ(y_i - ŷ_i)². Average squared difference between predicted and actual values.

  • For regression problems
  • Penalizes large errors more
  • Not in original units (squared)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is R² (R-squared)?

A

Coefficient of determination: proportion of variance in target explained by model. R² = 1 - SS_res/SS_tot.

  • Ranges 0 to 1 (higher better)
  • R² = 1: perfect fit
  • R² = 0: model no better than mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is Gradient Descent?

A

Iterative optimization algorithm that updates parameters in direction of negative gradient to minimize loss.

  • Update rule: w ← w - α∇L(w)
  • α = learning rate
  • Uses entire dataset per iteration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is Stochastic Gradient Descent (SGD)?

A

Gradient descent using one sample (or mini-batch) at a time instead of entire dataset.

  • Faster than full GD
  • Noisier updates
  • Better for large datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the bias-variance tradeoff?

A

Bias = error from model assumptions. Variance = error from sensitivity to training data. Complex models: low bias, high variance.

  • Total error = bias² + variance + noise
  • Simple models: high bias (underfitting)
  • Complex models: high variance (overfitting)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is an ensemble method?
Combines multiple models to improve performance and reduce overfitting. ## Footnote * Diversity among models is key * Two types: bagging and boosting * Often outperforms single models
26
What is bagging?
Bootstrap Aggregating: trains multiple models on bootstrap samples, averages predictions. ## Footnote * Models trained in parallel * Reduces variance * Example: Random Forest
27
What is Random Forest?
Ensemble of decision trees using bagging plus random feature selection at each split. ## Footnote * Each tree: bootstrap sample + random features * Predictions averaged or voted * Reduces correlation between trees
28
What is boosting?
Sequential ensemble where each model focuses on errors of previous models by reweighting samples. ## Footnote * Models trained sequentially * Reduces bias * Examples: AdaBoost, XGBoost, Gradient Boosting
29
Compare bagging vs boosting.
Bagging: parallel, simple average, reduces variance. Boosting: sequential, weighted average, reduces bias. ## Footnote * Bagging: less sensitive to noise * Boosting: better accuracy but can overfit * Bagging: easy to parallelize
30
What is MapReduce?
Programming model for processing large datasets by dividing work into Map and Reduce phases. ## Footnote * Map: transform input to (key, value) pairs * Reduce: aggregate values for each key * Shuffle: group by key between phases
31
What happens in the Map phase?
Processes input records independently, emitting (key, value) pairs. ## Footnote * Runs in parallel across nodes * No communication between mappers * Example: emit (disease, 1) for each diagnosis
32
What happens in the Reduce phase?
Aggregates all values for each key. ## Footnote * Receives (key, [list of values]) * Emits (key, aggregated_value) * Example: sum all counts for each disease
33
What is the Shuffle phase in MapReduce?
Groups all values by key and routes them to appropriate reducers between Map and Reduce. ## Footnote * Network-intensive operation * Sorts and partitions data * Ensures all values for a key go to same reducer
34
How does MapReduce achieve fault tolerance?
Master tracks task completion, re-executes failed tasks on different nodes. ## Footnote * Worker failure detected via heartbeats * Map tasks: re-execute on different worker * Deterministic execution enables recomputation
35
Why is MapReduce good for linear regression?
Linear regression has closed-form solution using aggregation statistics: θ = (X^T X)^-1 X^T Y. ## Footnote * Can aggregate X^T X and X^T Y across nodes * Single MapReduce pass * No iteration needed
36
Why is MapReduce poor for logistic regression?
Logistic regression requires iterative gradient descent with multiple passes over data. ## Footnote * Each iteration: full MapReduce job * Writes to disk between iterations * Disk I/O dominates cost
37
What are the limitations of MapReduce?
Inefficient for iterative algorithms, acyclic data flow, disk I/O between jobs. ## Footnote * Must write to HDFS after each job * Poor for ML (10s-100s iterations) * Poor for interactive queries
38
What is K-means clustering?
Iteratively assigns points to nearest centroid, then updates centroids as cluster means. ## Footnote * Algorithm: initialize K centroids, assign, update, repeat * Converges to local optimum * Assumes spherical clusters
39
What is the computational complexity of K-means?
O(i × n × K × d) where i=iterations, n=points, K=clusters, d=dimensions. ## Footnote * Linear in n, K, d * Number of iterations varies * Typically converges quickly
40
What is hierarchical clustering?
Builds a tree (dendrogram) of clusters by iteratively merging closest clusters. ## Footnote * Agglomerative: bottom-up merging * No need to specify K upfront * Provides cluster hierarchy
41
What is Gaussian Mixture Model (GMM)?
Probabilistic clustering assuming data comes from mixture of Gaussian distributions. ## Footnote * Soft clustering: probabilities for each cluster * Uses EM algorithm * More flexible than K-means
42
What is the EM algorithm?
Expectation-Maximization: iterative algorithm alternating E-step (compute expectations) and M-step (maximize parameters). ## Footnote * E-step: compute cluster membership probabilities * M-step: update Gaussian parameters * Converges to local optimum
43
What is Mini-Batch K-means?
K-means variant using small random batches instead of full dataset. ## Footnote * Complexity: O(t × b × K × d), b << n * Much faster than standard K-means * Slight accuracy loss
44
What is DBSCAN?
Density-Based Spatial Clustering: finds clusters as high-density regions separated by low-density regions. ## Footnote * Parameters: ε (radius), MinPts (minimum density) * Finds arbitrary-shaped clusters * Identifies noise points
45
What are core points in DBSCAN?
Points with ≥ MinPts neighbors within ε radius. ## Footnote * Define high-density regions * Form cluster centers * Border points connect to cores
46
What are border points in DBSCAN?
Points within ε of a core point but not core themselves. ## Footnote * On cluster edges * Less than MinPts neighbors * Belong to one cluster
47
What are noise points in DBSCAN?
Points not within ε of any core point. ## Footnote * Low-density regions * Not assigned to any cluster * DBSCAN advantage: identifies outliers
48
What is the Rand Index?
RI = (a+b) / (total pairs). Measures clustering agreement with ground truth. ## Footnote * a: pairs in same cluster in both * b: pairs in different clusters in both * Ranges 0 to 1 (higher better)
49
What is Mutual Information for clustering?
Measures shared information between clustering and ground truth. MI(X,Y) = ΣΣ p(x,y) log[p(x,y)/(p(x)p(y))]. ## Footnote * Normalized MI: MI / sqrt(H(X)H(Y)) * Ranges 0 to 1 * Requires ground truth
50
What is the Silhouette Coefficient?
s(x) = (b-a) / max(a,b) where a=avg distance to same cluster, b=avg distance to nearest other cluster. ## Footnote * Ranges -1 to 1 (higher better) * No ground truth needed * Computed per point, averaged
51
What is computational phenotyping?
Extracting structured disease phenotypes from raw clinical data. ## Footnote * Transforms EHR into disease labels * Methods: expert rules, classification, clustering * Enables GWAS, prediction, trials
52
What is GWAS?
Genome-Wide Association Study: identifies genetic variants (SNPs) associated with diseases. ## Footnote * Compares SNP frequencies: cases vs controls * Needs accurate phenotype labels * Requires large sample sizes
53
What are the two approaches to phenotyping?
Supervised (expert rules, classification) and Unsupervised (dimensionality reduction, tensor factorization). ## Footnote * Supervised: needs labeled data * Unsupervised: discovers phenotypes automatically * Rules: interpretable but manual effort
54
What is SVD (Singular Value Decomposition)?
Factorizes matrix X = UΣV^T where U, V are orthogonal and Σ is diagonal with singular values. ## Footnote * U: left singular vectors * V: right singular vectors * Σ: singular values (ordered largest to smallest)
55
What is PCA (Principal Component Analysis)?
Finds orthogonal directions of maximum variance. Equivalent to SVD on centered data. ## Footnote * PCs = UΣ (scores) * Loadings = V (feature weights) * First PC has highest variance
56
What is the sparsity problem with SVD?
SVD produces dense factors (all non-zero), making interpretation difficult. ## Footnote * Factors are linear combinations of all features * Hard to interpret in clinical context * CUR decomposition addresses this
57
What is CUR decomposition?
Approximates A ≈ CUR using actual columns (C) and rows (R) from the data. ## Footnote * C: selected columns * R: selected rows * U: small matrix connecting C and R * Preserves sparsity and interpretability
58
What is a tensor?
Multi-dimensional array generalizing matrices (2D) to 3+ dimensions. ## Footnote * Matrix: 2D (patients × diagnoses) * Tensor: 3D+ (patients × diagnoses × medications × time) * Captures higher-order interactions
59
What is CP decomposition?
Canonical Polyadic: factorizes tensor into sum of rank-1 tensors. ## Footnote * Each rank-1 component = one phenotype * Factor matrices for each dimension * Unsupervised phenotype discovery
60
What is the difference between pragmatic trials and RCTs?
RCTs: controlled, randomized, one intervention. Pragmatic: real-world, no randomization, multiple interventions. ## Footnote * RCT: efficacy in ideal conditions * Pragmatic: effectiveness in practice * RCT: expensive, slow, causal
61
What is patient similarity search?
Finding past patients similar to current patient to inform treatment decisions. ## Footnote * Enables precision medicine * Methods: distance metrics, graphs * Learn what worked for similar patients
62
What is LSML (Locally Supervised Metric Learning)?
Learns distance metric by maximizing margin between same-outcome and different-outcome neighbors. ## Footnote * Context-specific for prediction task * Optimizes: max distance to heterogeneous, min to homogeneous * Uses eigenvectors of H = L^e - L^o
63
What is ICD?
International Classification of Diseases: standardized diagnosis codes for billing and epidemiology. ## Footnote * ICD-9: ~14K codes, 3-5 digits * ICD-10: ~70K codes, 7 alphanumeric * US transitioned October 2015
64
What are key differences between ICD-9 and ICD-10?
ICD-9: shorter codes, less detail. ICD-10: longer codes, more specificity. ## Footnote * ICD-9: 3-5 characters * ICD-10: 7 alphanumeric characters * Complex mapping: 1-to-many possible
65
What is CPT?
Current Procedural Terminology: codes for medical procedures and services for billing. ## Footnote * Maintained by AMA * Determines physician payment * Category I: main procedures (6 sections)
66
What is LOINC?
Logical Observation Identifiers Names and Codes: standardizes lab tests and observations. ## Footnote * 6 attributes: component, property, timing, system, scale, method * Enables interoperability * Critical for lab data exchange
67
What is SNOMED-CT?
Systematized Nomenclature of Medicine: comprehensive medical terminology with concepts and relationships. ## Footnote * >300K concepts * Hierarchical IS-A relationships * 18 top-level hierarchies
68
What is the IS-A relationship in SNOMED?
Hierarchical relationship indicating one concept is a subtype of another. ## Footnote * Example: Arthritis IS-A Joint Finding * Creates directed acyclic graph * Enables reasoning and inference
69
What is UMLS?
Unified Medical Language System: integrates multiple medical terminologies. ## Footnote * 3 components: Metathesaurus, Semantic Network, Lexicon * ~1.5M concepts (CUIs) * Links codes across vocabularies
70
What are the three UMLS components?
Metathesaurus (concepts from all sources), Semantic Network (types and relationships), SPECIALIST Lexicon (linguistic info). ## Footnote * Metathesaurus: concept mapping * Semantic Network: 135 types, 54 relationships * Lexicon: 330K biomedical terms
71
What is PageRank?
Algorithm computing node importance based on incoming links from important nodes. ## Footnote * q = cA^T q + (1-c)/N × e * c = damping factor (0.85) * Iterative until convergence
72
How is PageRank computed?
Iteratively: each node distributes rank to neighbors, nodes sum contributions. ## Footnote * Start with uniform distribution * Each iteration: distribute rank / out-degree * Apply damping: c×sum + (1-c)/N
73
What is spectral clustering?
Uses eigenvectors of graph Laplacian for clustering. ## Footnote * Build similarity graph * Compute Laplacian L = D - W * Use top-k eigenvectors for k-means
74
What is the Graph Laplacian?
L = D - W where D=degree matrix, W=adjacency matrix. ## Footnote * Unnormalized: L = D - W * Normalized: L_sym = I - D^(-1/2) W D^(-1/2) * Eigenvectors reveal cluster structure
75
What are the three types of similarity graphs?
ε-neighborhood, k-NN, fully connected. ## Footnote * ε-neighborhood: connect if distance < ε * k-NN: connect k nearest neighbors * Fully connected: Gaussian kernel weights
76
What is Apache Spark?
In-memory distributed computing framework for big data processing. ## Footnote * Faster than MapReduce (10-100x) * Keeps data in memory * Built on RDDs
77
What is an RDD?
Resilient Distributed Dataset: immutable distributed collection with fault tolerance via lineage. ## Footnote * Immutable: transformations create new RDDs * Lazy evaluation * Fault tolerance: recompute from lineage
78
What is RDD lineage?
DAG of transformations used to build RDD, enabling fault tolerance. ## Footnote * Records sequence of operations * Recompute lost partitions * No replication needed
79
What are RDD transformations?
Operations creating new RDDs: map, filter, join, etc. Lazily evaluated. ## Footnote * Lazy: not executed until action * Build computation DAG * Examples: map, flatMap, filter, distinct
80
What are RDD actions?
Operations returning results: collect, count, reduce, etc. Trigger execution. ## Footnote * Execute DAG of transformations * Return to driver or save to storage * Examples: collect, count, save, reduce
81
What is the difference between map and flatMap?
map: one output per input. flatMap: 0, 1, or many outputs per input. ## Footnote * map: one-to-one transformation * flatMap: one-to-many, flattens results * flatMap useful for tokenization
82
What are broadcast variables in Spark?
Read-only variables efficiently sent once to all worker nodes. ## Footnote * Avoid sending data with each task * Cached locally on each node * Use for lookup tables, parameters
83
Why is Spark better than MapReduce for iterative algorithms?
Spark caches data in memory, avoiding disk I/O between iterations. ## Footnote * MapReduce: read/write HDFS each iteration * Spark: keep data in memory * 10-100x speedup for ML
84
What is lazy evaluation in Spark?
Transformations recorded but not executed until action called. ## Footnote * Enables DAG optimization * Pipelines operations * Minimizes shuffles
85
How does Spark achieve fault tolerance?
Recomputes lost partitions using lineage instead of replication. ## Footnote * Lineage: DAG of operations * Replay transformations only for lost partition * No storage cost
86
What is cross-validation?
Technique for evaluating model on unseen data by splitting dataset multiple ways. ## Footnote * Estimates test performance * Types: leave-one-out, k-fold, randomized * Prevents overfitting
87
What is k-fold cross-validation?
Split data into k folds, train on k-1, test on 1, rotate test fold. ## Footnote * Typically k=5 or 10 * Each sample used for test once * Average performance across folds
88
What is leave-one-out CV?
Train on n-1 samples, test on 1 sample, repeat for all samples. ## Footnote * Most accurate estimate * Computationally expensive * n iterations needed
89
What is the difference between training
validation
90
What is regularization?
Adds penalty to loss function to discourage complex models and reduce overfitting. ## Footnote * L1 (Lasso): penalty = λ|w|, sparsity * L2 (Ridge): penalty = λw², shrinkage * λ controls strength
91
What is L1 regularization?
Adds sum of absolute weights to loss: L + λΣ|w_i|. Drives some weights to zero. ## Footnote * Also called Lasso * Feature selection: zeroes out features * Produces sparse models
92
What is L2 regularization?
Adds sum of squared weights to loss: L + λΣw_i². Shrinks weights toward zero. ## Footnote * Also called Ridge * Shrinks all weights * Prefers many small weights
93
What is overfitting?
Model learns training data too well, including noise, failing to generalize. ## Footnote * High training accuracy, low test accuracy * Solutions: regularization, more data, simpler model * Cross-validation helps detect
94
What is underfitting?
Model too simple to capture patterns in data. ## Footnote * Low training and test accuracy * High bias * Solution: more complex model, more features
95
What is bootstrap sampling?
Sampling with replacement to create multiple training sets. ## Footnote * Sample n from n with replacement * ~63% unique samples per bootstrap * Used in bagging methods
96
What is the elbow method?
Plot metric vs k, find 'elbow' where improvement slows. ## Footnote * For choosing k in k-means * WCSS decreases with k * Elbow = optimal k
97
What is the curse of dimensionality?
In high dimensions, data becomes sparse and distances become less meaningful. ## Footnote * Distances concentrate * Need exponentially more data * Solution: dimensionality reduction
98
What is feature engineering?
Creating new features from raw data to improve model performance. ## Footnote * Aggregate temporal data * Create interaction features * Domain knowledge critical
99
What is dimensionality reduction?
Projecting high-dimensional data to lower dimensions while preserving structure. ## Footnote * Reduces noise and overfitting * Improves computation * Methods: PCA, SVD, tensor
100
What is a rank-1 tensor?
Tensor that can be written as outer product of vectors: a ⊗ b ⊗ c. ## Footnote * CP decomposition: sum of rank-1 tensors * Each represents one phenotype * Factor vectors for each dimension
101
What is a phenotype?
Group of patients sharing common observable characteristics (diagnoses, medications, labs). ## Footnote * Can represent disease subtypes * Extracted from EHR data * Used for cohorts and features
102
What is the Mahalanobis distance?
Distance accounting for correlations: d² = (x-y)^T Σ^(-1) (x-y). ## Footnote * Σ = covariance matrix * Accounts for feature correlations * LSML learns generalized version
103
What is hard clustering?
Each point assigned to exactly one cluster. ## Footnote * Example: k-means * Discrete assignments * Clear boundaries
104
What is soft clustering?
Each point has probability of belonging to each cluster. ## Footnote * Example: GMM * Probabilistic assignments * Overlapping clusters
105
What is the difference between ICD
CPT
106
What is a Gaussian distribution?
Normal distribution: bell-shaped curve characterized by mean μ and variance σ². ## Footnote * PDF: (1/√(2πσ²)) exp(-(x-μ)²/(2σ²)) * 68-95-99.7 rule * Central to GMM
107
What is the difference between classification and regression?
Classification: predict categorical labels. Regression: predict continuous values. ## Footnote * Classification: discrete output * Regression: continuous output * Different metrics: accuracy vs MSE
108
What is the difference between supervised and unsupervised learning?
Supervised: has labeled training data. Unsupervised: no labels, find structure. ## Footnote * Supervised: classification, regression * Unsupervised: clustering, dimensionality reduction * Supervised: predict, Unsupervised: discover
109
What is the MapReduce word count example?
Map: emit (word, 1) for each word. Reduce: sum counts for each word. ## Footnote * Classic MapReduce example * Map: tokenize and emit * Reduce: aggregate counts
110
What does it mean for RDD transformations to be lazy?
Not executed until an action is called, allowing optimization. ## Footnote * Build DAG of operations * Optimizer can pipeline and minimize shuffles * Action triggers execution
111
What is the difference between union and intersection in RDDs?
union: combines all elements from both RDDs. intersection: only elements in both. ## Footnote * union: cheap (no shuffle) * intersection: expensive (requires shuffle) * Both return new RDD
112
What is the reduce action in Spark?
Combines elements using associative function: reduce((x,y) => x+y). ## Footnote * Must be associative and commutative * Parallel aggregation * Returns single value to driver
113
What is the collect action in Spark?
Returns all RDD elements to driver as array. ## Footnote * Brings data to driver * Can cause out-of-memory * Use only for small results
114
What is a data frame in Spark?
Distributed collection with schema, similar to database table. ## Footnote * Higher-level than RDD * Optimized execution (Catalyst) * SQL-like operations
115
What is missing data and how to handle it?
Data not recorded or lost. Handle by: imputation, deletion, or modeling. ## Footnote * MCAR: missing completely at random * MAR: missing at random * MNAR: missing not at random
116
What is imputation?
Filling in missing values with estimates (mean, median, model-based). ## Footnote * Mean/median imputation * K-NN imputation * Multiple imputation
117
What is data normalization?
Scaling features to common range (e.g., 0-1 or z-score). ## Footnote * Min-max: scale to [0,1] * Z-score: mean=0, std=1 * Important for distance-based methods
118
What is one-hot encoding?
Converts categorical variables into binary vectors. ## Footnote * Each category → binary feature * Example: {red, blue, green} → [1,0,0], [0,1,0], [0,0,1] * Increases dimensionality
119
What is an outlier?
Data point significantly different from others, may be error or interesting case. ## Footnote * Detection: statistical tests, visualization * May remove or investigate * DBSCAN identifies as noise
120
What is model validation?
Assessing model performance on data not used for training. ## Footnote * Test set or cross-validation * Prevents overfitting * Estimates generalization
121
What is hyperparameter tuning?
Finding optimal hyperparameters (e.g., learning rate, regularization strength). ## Footnote * Use validation set, not test * Methods: grid search, random search * Tune before final evaluation
122
What is grid search?
Systematic search over hyperparameter combinations. ## Footnote * Tests all combinations * Computationally expensive * Finds optimal in grid
123
What is random search?
Random sampling of hyperparameter combinations. ## Footnote * Often as good as grid search * More efficient * Better for high dimensions
124
What is early stopping?
Stop training when validation performance stops improving. ## Footnote * Prevents overfitting * Monitor validation loss * Save best model
125
What is batch normalization?
Normalizes layer inputs during training to stabilize learning. ## Footnote * Reduces internal covariate shift * Enables higher learning rates * Commonly used in deep learning
126
What is dropout?
Regularization: randomly drop neurons during training. ## Footnote * Rate: fraction of neurons to drop (e.g., 0.5) * Prevents co-adaptation * Ensemble effect
127
What is a learning rate?
Step size for parameter updates in gradient descent. ## Footnote * Too high: divergence * Too low: slow convergence * Typically 0.001-0.1
128
What is momentum in optimization?
Accumulates gradient history to accelerate convergence. ## Footnote * Adds fraction of previous update * Helps escape local minima * Smooths optimization path
129
What is Adam optimizer?
Adaptive learning rate optimizer combining momentum and RMSprop. ## Footnote * Adapts learning rate per parameter * Popular default choice * Combines benefits of momentum and adaptive rates
130
What is transfer learning?
Using pre-trained model on new task. ## Footnote * Reuse learned features * Fine-tune on new data * Reduces training time and data needs
131
What is a decision tree?
Tree structure where each node tests a feature, leaves predict outcome. ## Footnote * Interpretable * Handles non-linear relationships * Prone to overfitting
132
What is pruning in decision trees?
Removing branches to reduce complexity and overfitting. ## Footnote * Pre-pruning: stop growing early * Post-pruning: remove after building * Improves generalization
133
What is information gain?
Reduction in entropy from splitting on a feature. ## Footnote * Used to select split in decision trees * Higher gain = better split * ID3 algorithm uses information gain
134
What is Gini impurity?
Measure of how often randomly chosen element would be incorrectly labeled. ## Footnote * Gini = 1 - Σp_i² * Used in CART algorithm * Lower = purer node
135
What is logistic regression?
Linear model for classification using sigmoid function: P(y=1) = 1/(1+exp(-w^T x)). ## Footnote * Outputs probabilities [0,1] * Optimized via gradient descent * Linear decision boundary
136
What is a support vector machine (SVM)?
Finds optimal hyperplane maximizing margin between classes. ## Footnote * Maximizes distance to nearest points * Kernel trick for non-linear boundaries * Effective in high dimensions
137
What is the kernel trick?
Implicitly maps data to higher dimensions without computing transformation. ## Footnote * Common kernels: linear, polynomial, RBF * Enables non-linear SVM * Computationally efficient
138
What is k-nearest neighbors (k-NN)?
Classifies based on majority vote of k nearest neighbors. ## Footnote * Non-parametric * No training phase * Slow prediction (distance to all points)
139
What is Naive Bayes?
Probabilistic classifier using Bayes' theorem with independence assumption. ## Footnote * Assumes features independent given class * Fast training and prediction * Works well for text classification
140
What is stratified sampling?
Sampling that maintains class proportions from original data. ## Footnote * Important for imbalanced data * Ensures representative train/test splits * Used in stratified k-fold CV
141
What is class imbalance?
When one class has many more samples than others. ## Footnote * Common in healthcare (rare diseases) * Leads to biased models * Solutions: resampling, weighted loss, different metrics
142
What is oversampling?
Increasing minority class samples to balance dataset. ## Footnote * Duplicate minority samples * SMOTE: synthetic minority oversampling * Risk: overfitting minority class
143
What is undersampling?
Decreasing majority class samples to balance dataset. ## Footnote * Randomly remove majority samples * Risk: losing information * Faster training
144
What is SMOTE?
Synthetic Minority Over-sampling Technique: creates synthetic minority samples. ## Footnote * Interpolates between minority neighbors * Reduces overfitting vs duplication * Popular for imbalanced data
145
What is a precision-recall curve?
Plot of precision vs recall at different thresholds. ## Footnote * Alternative to ROC for imbalanced data * Area under curve (AUC-PR) measures performance * Focuses on positive class
146
What is the difference between L1 and L2 regularization?
L1: penalty on |w|, creates sparsity. L2: penalty on w², shrinks weights. ## Footnote * L1: feature selection (some weights → 0) * L2: all weights shrink * Elastic Net: combines both
147
What is Elastic Net?
Regularization combining L1 and L2: penalty = α|w| + β w². ## Footnote * α controls L1 strength * β controls L2 strength * Balances sparsity and shrinkage
148
What is feature scaling?
Transforming features to similar ranges. ## Footnote * Important for distance-based methods * Methods: normalization, standardization * Not needed for tree-based methods
149
What is the difference between normalization and standardization?
Normalization: scale to [0,1]. Standardization: mean=0, std=1 (z-score). ## Footnote * Normalization: (x-min)/(max-min) * Standardization: (x-μ)/σ * Standardization better with outliers
150
What is the damping factor in PageRank?
Parameter c (typically 0.85) representing probability of following links vs random jump. ## Footnote * Formula: q = cA^T q + (1-c)/N × e * c=0.85: 85% follow links, 15% random jump * Prevents dead ends and spider traps
151
What are the three CPT categories?
Category I: main procedures (6 sections). Category II: performance measurement (optional). Category III: emerging procedures (temporary). ## Footnote * Category I: 5-digit codes for established procedures * Category II: 4 digits + F, quality metrics * Category III: 4 digits + T, new technologies
152
What is the purpose of the combiner in MapReduce?
Optional function that performs local aggregation on mapper output before shuffle. ## Footnote * Reduces network traffic * Similar to reducer but runs on mapper nodes * Example: local sum before global sum
153
What are the six LOINC attributes?
Component, Property, Timing, System, Scale, Method. ## Footnote * Component: what measured (e.g., glucose) * Property: type (e.g., mass concentration) * System: specimen (e.g., serum)
154
What is the difference between SNOMED concepts and descriptions?
Concepts: unique clinical meanings. Descriptions: terms/synonyms for concepts. ## Footnote * Each concept has unique ID * Multiple descriptions per concept * Enables synonym matching
155
What is an attribute relationship in SNOMED?
Relationship connecting concepts from different hierarchies (e.g., Appendicitis 'associated morphology' Inflammation). ## Footnote * Non-hierarchical relationships * Provides semantic meaning * Complements IS-A relationships
156
What is PheKB?
Phenotype KnowledgeBase: repository of validated phenotyping algorithms. ## Footnote * Community-contributed phenotype definitions * Includes ICD codes, medications, labs * Enables phenotype replication
157
What is OMOP CDM?
Observational Medical Outcomes Partnership Common Data Model: standardized EHR database schema. ## Footnote * Standardizes structure and content * Enables multi-site studies * Includes standard vocabularies
158
What happens during the E-step of GMM?
Computes posterior probability γ_nk that each point n belongs to each cluster k. ## Footnote * γ_nk = π_k N(x_n|μ_k, Σ_k) / Σ_j π_j N(x_n|μ_j, Σ_j) * Soft assignment to clusters * Uses current parameter estimates
159
What happens during the M-step of GMM?
Updates Gaussian parameters π_k, μ_k, Σ_k using weighted samples (weights = γ_nk). ## Footnote * μ_k = weighted mean of points * Σ_k = weighted covariance * π_k = average membership
160
What is the key advantage of DBSCAN over K-means?
Can find arbitrary-shaped clusters and identify outliers, doesn't require specifying K. ## Footnote * K-means: spherical clusters, all points assigned * DBSCAN: any shape, identifies noise * Better for spatial data
161
What are the limitations of hierarchical clustering?
Computationally expensive O(n³), sensitive to noise and outliers, cannot undo merges. ## Footnote * Does not scale to large datasets * Once clusters merged, cannot split * Greedy algorithm
162
What is the degree matrix D in graph Laplacian?
Diagonal matrix where D_ii = sum of weights of edges incident to node i. ## Footnote * For unweighted graph: D_ii = degree of node i * Used to construct Laplacian: L = D - W * Captures node connectivity
163
What is a connected component in a graph?
Maximal set of nodes where every node is reachable from every other node. ## Footnote * Number of zero eigenvalues of Laplacian = number of components * Important for spectral clustering * Isolated subgraphs
164
What is the normalized Laplacian?
L_sym = D^(-1/2) L D^(-1/2) = I - D^(-1/2) W D^(-1/2). ## Footnote * Normalizes for varying node degrees * Eigenvalues in [0, 2] * Better for graphs with variable degree
165
What is locality in patient similarity?
Focus on local neighborhood when learning similarity metric rather than global distance. ## Footnote * Different regions of feature space may need different metrics * LSML learns local metrics * Better captures clinical context
166
What are homogeneous neighbors in LSML?
Patients with same outcome as target patient. ## Footnote * Should be close in learned metric * Used to define similarity * Part of optimization objective
167
What are heterogeneous neighbors in LSML?
Patients with different outcome than target patient. ## Footnote * Should be far in learned metric * Used to define dissimilarity * Maximize distance to these
168
What is the margin in LSML?
Difference between distances to heterogeneous and homogeneous neighbors. ## Footnote * Large margin = good separation * Maximize total margin across all patients * Similar to SVM margin concept
169
What is fault tolerance in distributed systems?
System's ability to continue operating despite failures of components. ## Footnote * Critical for large clusters (failures common) * Achieved via replication or recomputation * MapReduce: recomputation, HDFS: replication
170
What is HDFS?
Hadoop Distributed File System: distributed storage for MapReduce. ## Footnote * Stores large files across cluster * Replicates blocks (typically 3 copies) * Master-slave architecture (NameNode, DataNodes)
171
What is data locality in MapReduce?
Principle of moving computation to data rather than data to computation. ## Footnote * Reduces network traffic * Scheduler assigns tasks to nodes with data * Key to MapReduce performance
172
What is a combiner function?
Mini-reducer that runs locally on mapper node for preliminary aggregation. ## Footnote * Reduces shuffle data * Must be associative and commutative * Same code as reducer often
173
What is partition tolerance in distributed systems?
System continues operating despite network partitions. ## Footnote * Part of CAP theorem * Network failures split cluster * Must choose consistency or availability
174
What is the CAP theorem?
Distributed system can provide at most 2 of 3: Consistency, Availability, Partition tolerance. ## Footnote * Consistency: all nodes see same data * Availability: every request gets response * Partition tolerance: works despite network splits
175
What are the six sections of CPT Category I?
Evaluation/Management, Anesthesia, Surgery, Radiology, Pathology/Laboratory, Medicine. ## Footnote * Most commonly used CPT codes * 5-digit numeric codes * Determines reimbursement
176
What is an NDC code?
National Drug Code: unique identifier for drugs including labeler, product, and package. ## Footnote * 10-digit 3-segment code * Format: labeler-product-package * Maintained by FDA
177
What is ICD-9-CM vs ICD-9-PCS?
ICD-9-CM: Clinical Modification for diagnoses. ICD-9-PCS: Procedure Coding System for inpatient procedures. ## Footnote * CM: outpatient and diagnosis coding * PCS: hospital inpatient procedures only * Different code structures
178
What is the ICD-9 supplementary classification?
V codes: factors influencing health status. E codes: external causes of injury. ## Footnote * V codes: 'V01-V91' (e.g., vaccination) * E codes: 'E800-E999' (e.g., accident cause) * Supplement main diagnosis codes
179
What makes a good clustering result?
High intra-cluster similarity (points within cluster similar) and low inter-cluster similarity (points in different clusters dissimilar). ## Footnote * Compact clusters * Well-separated clusters * Matches domain expectations
180
What is within-cluster sum of squares (WCSS)?
Sum of squared distances from points to their cluster centroids. ## Footnote * Lower WCSS = tighter clusters * Used in elbow method * Minimized by K-means algorithm
181
What is the replication factor in HDFS?
Number of copies of each data block stored across cluster (default 3). ## Footnote * Provides fault tolerance * Trade-off: reliability vs storage * Configurable per file
182
What is a NameNode in HDFS?
Master server managing file system namespace and regulating access to files. ## Footnote * Stores metadata (file names, locations) * Single point of failure (uses backup) * Coordinates DataNode operations
183
What is a DataNode in HDFS?
Worker node storing actual data blocks and serving read/write requests. ## Footnote * Multiple DataNodes per cluster * Report to NameNode via heartbeats * Store and retrieve blocks
184
What is the InputFormat in MapReduce?
Defines how to split input data and create input key-value pairs for mappers. ## Footnote * TextInputFormat: line-by-line text * KeyValueInputFormat: tab-separated pairs * SequenceFileInputFormat: binary format
185
What is the OutputFormat in MapReduce?
Defines how to write reducer output to files. ## Footnote * TextOutputFormat: text files * SequenceFileOutputFormat: binary * NullOutputFormat: no output
186
What is speculative execution in MapReduce?
Launching backup copies of slow tasks to reduce job completion time. ## Footnote * Detects stragglers (slow tasks) * Runs duplicate task on different node * Uses result from first to finish
187
What is the difference between narrow and wide transformations in Spark?
Narrow: each parent partition used by at most one child partition. Wide: multiple child partitions depend on parent. ## Footnote * Narrow: map, filter (no shuffle) * Wide: groupBy, join (requires shuffle) * Narrow faster, can pipeline
188
What is a DAG in Spark?
Directed Acyclic Graph: representation of computation as stages and tasks. ## Footnote * Nodes = RDD partitions * Edges = transformations * No cycles (acyclic) * Optimized before execution
189
What is the Catalyst optimizer in Spark?
Query optimizer for Spark SQL and DataFrames. ## Footnote * Logical optimization: predicate pushdown, constant folding * Physical optimization: join reordering * Code generation
190
What is a stage in Spark?
Set of tasks that can be executed in parallel without shuffle. ## Footnote * Bounded by shuffle operations * Wide transformations create stage boundaries * Tasks within stage can pipeline
191
What is a task in Spark?
Unit of work sent to executor: applies transformations to one partition. ## Footnote * One task per partition per stage * Run in parallel across cluster * Result sent back to driver
192
What is an executor in Spark?
JVM process on worker node executing tasks and caching data. ## Footnote * Multiple executors per worker * Each has own memory and CPU cores * Runs for duration of application
193
What is the driver program in Spark?
Main program running SparkContext, coordinating execution. ## Footnote * Converts user program to tasks * Schedules tasks on executors * Collects results
194
What is a SparkContext?
Entry point to Spark functionality, coordinates execution on cluster. ## Footnote * Creates RDDs * Accesses Spark services * One per application
195
What is partitioning in Spark?
Dividing RDD into partitions distributed across cluster. ## Footnote * Default: based on input source * Can repartition or coalesce * Affects parallelism and performance
196
What is coalesce vs repartition in Spark?
coalesce: reduces partitions without shuffle. repartition: changes partitions with full shuffle. ## Footnote * coalesce: efficient for reducing * repartition: for increasing or rebalancing * coalesce can result in imbalance
197
What is persist vs cache in Spark?
cache: stores in memory (default). persist: configurable storage level (memory/disk/both). ## Footnote * cache() = persist(MEMORY_ONLY) * persist can specify MEMORY_AND_DISK, etc. * Improves performance for reused RDDs
198
What is an accumulator in Spark?
Variable that can only be added to, used for counters and sums. ## Footnote * Only driver can read value * Workers can only add * Fault-tolerant across task retries
199
What is a shuffle in Spark?
Data redistribution across partitions to group by key. ## Footnote * Expensive operation (network + disk I/O) * Triggered by: groupByKey, reduceByKey, join * Creates stage boundary
200
What is the difference between reduceByKey and groupByKey?
reduceByKey: aggregates locally before shuffle. groupByKey: shuffles all values first. ## Footnote * reduceByKey: more efficient (less network) * groupByKey: transfers all data * Use reduceByKey when possible