Final_Exam Flashcards

(246 cards)

1
Q

What does reduceByKey do in Spark

A

It merges values per key with map side combine

  • More efficient than groupByKey
  • Reduces shuffle size
  • Common for aggregations in key value RDDs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the purpose of flatMap in Spark

A

It expands one input element into zero or more outputs

  • Used for tokenization in text processing
  • Unlike map it flattens nested results
  • Produces variable length output per input

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does collect do in Spark

A

It returns the entire RDD to the driver

  • Dangerous on large datasets
  • Should be avoided in production
  • Often replaced with take or count

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does take do in Spark

A

It returns the first N elements from the RDD

  • More memory safe than collect
  • Useful for debugging early data slices
  • Executes only until required partitions produce output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How is a text file loaded into an RDD in Spark

A

By using sc textFile

  • Reads file from HDFS or local FS
  • Produces RDD of strings one per line
  • Lazy until an action is executed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a transformation in Spark

A

It defines a new RDD from an existing one lazily

  • Lazy evaluation builds lineage graph
  • Executes only when an action is called
  • Includes map filter flatMap distinct

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is an action in Spark

A

It triggers execution of the DAG and returns a result

  • Examples include count take collect reduce
  • Forces materialization of transformations
  • Sends results back to driver or writes output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does repartition do in Spark

A

It reshuffles the data into a new number of partitions

  • Involves full shuffle which is expensive
  • Use coalesce for reducing partitions without full shuffle
  • Helps balance load across executors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does cache do in Spark

A

It stores the RDD in memory for faster reuse

  • Useful for iterative algorithms
  • Avoids recomputation of lineage
  • Storage level configurable with persist

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a shuffle in Spark

A

It is a data redistribution step between executors

  • Triggered by operations like join reduceByKey sortByKey
  • Expensive due to network IO and disk writes
  • Should be minimized for performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the purpose of broadcasting in Spark

A

It distributes read only data efficiently to all executors

  • Avoids repeated serialization
  • Useful for lookup tables
  • Reduces communication overhead

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does mapValues do in Spark

A

It transforms only the values of key value pairs

  • Keys remain unchanged
  • Preserves partitioning structure
  • More efficient than full map on pair RDD

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What kind of join is performed by join in Spark on two pair RDDs

A

An inner join

  • Only keys present in both RDDs appear
  • Produces key and tuple of matching values
  • Causes shuffle unless partitioner matches

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a broadcast join in Spark SQL

A

A join where a small table is broadcast to all executors

  • Avoids shuffle for small dimension tables
  • Triggered by broadcast hint or size threshold
  • Common in star schema joins

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What storage system does MapReduce depend on

A

HDFS distributed storage

  • Splits files into blocks
  • Replicates blocks across nodes
  • Enables data locality for mappers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does the mapper phase do in MapReduce

A

It emits intermediate key value pairs

  • Processes input splits independently
  • Prepares data for grouping by key
  • Often performs filtering or extraction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does the reducer phase do in MapReduce

A

It aggregates values for each key

  • Receives all values for a given key
  • Performs summation counting or custom logic
  • Produces final output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the shuffle in MapReduce

A

It groups intermediate values by key between map and reduce

  • Includes sorting and network transfer
  • Often the bottleneck of the job
  • Critical to MapReduce efficiency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Why is MapReduce slow for iterative algorithms

A

Because each iteration writes to disk

  • Logistic regression and k means require multiple passes
  • Spark solves this by caching in memory
  • MapReduce has high disk overhead

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is HDFS block size significance

A

It defines how input is split for mappers

  • Typical block size is 128MB
  • Larger blocks reduce overhead
  • Determines data locality opportunities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the key idea of PageRank

A

It assigns importance scores based on link structure

  • Uses random walk model
  • Iterative computation until convergence
  • Implemented using MapReduce or Spark

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does adjacency list represent in graphs

A

It lists neighbors of each node

  • Enables efficient traversal
  • Foundation for BFS DFS PageRank
  • Memory efficient for sparse graphs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is BFS used for in graph analysis

A

It computes shortest paths in unweighted graphs

  • Layer by layer expansion
  • Common for social network analysis
  • Basis for connected components

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is cosine similarity used for

A

It measures angle based similarity between vectors

  • Often used for patient similarity
  • Works well with sparse high dimensional data
  • Range is between minus one and one

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is Euclidean distance used for
It measures straight line distance between points ## Footnote * Standard metric in K means * Sensitive to scale * Requires normalization
26
What does K in k means represent
The number of clusters ## Footnote * Must be chosen before training * Hard clustering assigns one cluster per point * Sensitive to initialization
27
What is the k means objective function
Minimize within cluster sum of squares ## Footnote * Uses L2 distance * Non convex optimization * Lloyd algorithm alternates assign and update
28
What is a centroid in k means
It is the mean of points assigned to a cluster ## Footnote * Updated after assignment step * Represents cluster center * Sensitive to outliers
29
What is the main weakness of k means
It assumes spherical clusters ## Footnote * Fails on non linear shapes like moons * Sensitive to initialization * Requires numeric data
30
What does DBSCAN detect
Arbitrarily shaped clusters ## Footnote * Density based method * Identifies noise points * Requires eps and minPts parameters
31
What is eps in DBSCAN
It defines neighborhood radius ## Footnote * Larger eps means broader clusters * Must be tuned with domain knowledge * Visualized using k distance plot
32
What is minPts in DBSCAN
The minimum number of neighbors for a core point ## Footnote * Typical value is 4 or greater * Determines density threshold * Works with eps to define clusters
33
What type of clustering does GMM perform
Soft clustering with probabilistic assignment ## Footnote * E step computes responsibilities * M step updates means covariances and weights * More flexible than k means due to covariance modeling
34
What algorithm is used to fit GMM
Expectation maximization ## Footnote * Alternates between E and M steps * Maximizes likelihood * Converges to local optimum
35
What is PCA used for
Dimensionality reduction via variance maximization ## Footnote * Uses eigen decomposition of covariance matrix * Projects data onto top principal components * Removes noise dimensions
36
What does the first principal component capture
The direction of maximum variance ## Footnote * Orthogonal to other components * Represented by highest eigenvalue * Helps visualize high dimensional data
37
What does t SNE accomplish
Non linear dimensionality reduction for visualization ## Footnote * Preserves local structure * Useful for clusters in high dimensional data * Not for global geometry
38
What is the purpose of CUR decomposition
Low rank matrix approximation using actual rows and columns ## Footnote * More interpretable than SVD * Uses leverage scores to sample rows columns * Approximates original matrix
39
What does ICD encode
Medical diagnoses ## Footnote * ICD10 widely used in billing * Hierarchical structure * Important for phenotyping
40
What does CPT encode
Procedures performed on patients ## Footnote * Used for billing and resource tracking * Different from ICD diagnoses * Helps build feature vectors
41
What does NDC encode
Drug products ## Footnote * Identifies manufacturer and formulation * Important for medication phenotyping * Used in pharmacy claims
42
What does RxNorm provide
Normalized drug naming system ## Footnote * Maps brand and generic names * Provides RXCUI identifiers * Integrates with NDC
43
What is UMLS
A metathesaurus integrating multiple ontologies ## Footnote * Contains CUIs for concepts * Maps across ICD SNOMED MeSH * Used for NLP and concept normalization
44
What is SNOMED CT
A comprehensive clinical terminology ## Footnote * Contains granular clinical concepts * Supports hierarchical relationships * Used in EHR systems
45
What is concept drift
Changes in data distribution over time ## Footnote * Streaming k means adapts using decay factor * Common in temporal clinical data * Requires online updating
46
What does the streaming k means decay factor do
It controls weight of new data versus old data ## Footnote * High decay emphasizes recent data * Prevents outdated centroids * Supports adaptive clustering
47
What is a phenotype in clinical data
A computable abstraction of disease state ## Footnote * Derived from diagnoses labs meds * Used for cohort selection * May use rule based or ML methods
48
What is supervised phenotyping
Using labeled cases and controls to train models ## Footnote * Requires gold standard labels * Enables predictive performance metrics * Often uses logistic regression or random forest
49
What is unsupervised phenotyping
Discovering latent patterns without labels ## Footnote * Methods include k means LDA tensor factorization * Useful for sub phenotype discovery * Sensitive to input feature design
50
What is anchor phenotyping
Weak supervision using high precision features ## Footnote * Anchors derived from clinical heuristics * Trains generative model for labels * Bridges rule based and supervised phenotyping
51
What is precision in classification
Fraction of predicted positives that are correct ## Footnote * TP divided by TP plus FP * Useful in imbalanced datasets * Measures reliability of positive predictions
52
What is recall in classification
Fraction of actual positives that were retrieved ## Footnote * TP divided by TP plus FN * Also called sensitivity * Measures completeness of positive detection
53
What is specificity
Fraction of actual negatives correctly identified ## Footnote * TN divided by TN plus FP * Complements sensitivity * Measures ability to avoid false positives
54
What is F1 score
Harmonic mean of precision and recall ## Footnote * Balances two competing metrics * Useful when classes are imbalanced * Always <= arithmetic mean
55
What does AUC measure
Ability of classifier to rank positives above negatives ## Footnote * Equivalent to probability positive ranked higher * Independent of threshold * Uses ROC curve area
56
What is logistic regression used for
Binary classification via sigmoid transformation ## Footnote * Optimized using gradient descent * Outputs probability scores * Sensitive to feature scaling
57
What is regularization for
Preventing overfitting by penalizing large weights ## Footnote * L1 yields sparsity * L2 yields shrinkage * Helps generalization
58
What does bagging reduce
Variance of the model ## Footnote * Trains models on bootstrap samples * Averages predictions * Random forest uses bagging
59
What does boosting reduce
Bias by sequentially correcting errors ## Footnote * Learners trained on residuals * Sensitive to noise * Includes AdaBoost gradient boosting
60
What is random forest
An ensemble of decision trees trained on bootstrap samples ## Footnote * Adds feature randomness per split * Reduces variance * Robust to overfitting
61
What is gradient descent
Optimization method moving opposite gradient direction ## Footnote * Requires differentiable loss * Step size controls convergence * Sensitive to feature scaling
62
What is SGD
Gradient descent using one or few samples ## Footnote * Noisy updates * Faster on large data * Good for online learning
63
What is the curse of dimensionality
Phenomenon where high dimension degrades learning ## Footnote * Distances become less meaningful * Requires dimensionality reduction * Impacts clustering and nearest neighbor
64
What is cosine distance good for
Measuring similarity in sparse high dimensional vectors ## Footnote * Used in patient similarity * Invariant to magnitude * Good for bag of words
65
What is mutual information used for
Measuring dependence between features and labels ## Footnote * Non linear metric * Useful for feature selection * Robust to distributional shape
66
What is chi square test used for
Evaluating independence between categorical variables ## Footnote * Common in feature selection * Compares observed and expected counts * Assumes large sample size
67
What is silhouette score
Measures cluster separation ## Footnote * Values from minus one to one * Higher means better defined clusters * Uses cohesion and separation
68
What is elbow method
Choosing K by detecting diminishing returns in WCSS ## Footnote * Plot SSE versus K * Look for bend in curve * Heuristic for k means
69
What is curse of large clusters in k means
Large clusters dominate centroid updates ## Footnote * Causes imbalance * May miss small dense clusters * Requires careful initialization
70
What is slope of ROC curve at threshold
Ratio of densities of positive and negative scores ## Footnote * Derived from Neyman Pearson lemma * Relates to likelihood ratio * Used in optimal thresholding
71
What does dimensionality reduction solve
Removes redundancy and noise ## Footnote * Helps visualization * Speeds up algorithms * Reduces overfitting risk
72
What does eigenvalue represent in PCA
Variance explained by component ## Footnote * Sorted descending * Larger eigenvalue means more signal * Sum equals total variance
73
What is a loading in PCA
Contribution of each feature to component ## Footnote * Elements of eigenvector * Helps interpret principal axes * Useful for feature interpretation
74
What is a factor in NMF
Latent non negative component ## Footnote * Enables additive parts based representation * Good for interpretability * Used in text and EHR data
75
What is adjacency matrix
Matrix representation of graph edges ## Footnote * Aij is 1 if edge exists * Supports matrix operations for graph algorithms * Used in spectral clustering
76
What is Laplacian used for
Graph clustering and spectral methods ## Footnote * L equals D minus A * Eigenvectors give cluster structure * Used in normalized cuts
77
What is assortativity in graphs
Preference for nodes to attach to similar nodes ## Footnote * High in social networks * Can be positive or negative * Measured by attribute correlation
78
What is degree distribution
Distribution of node degrees in a graph ## Footnote * Scale free networks follow power law * Influences robustness * Affects spreading processes
79
What is Markov chain assumption in PageRank
Random surfer moves with fixed probability to neighbors ## Footnote * Damping factor controls reset * Stationary distribution gives ranks * Computed by power iteration
80
What is a pivot in CUR
Column or row selected based on leverage ## Footnote * Leverage scores from SVD * Ensures good reconstruction * CUR yields interpretable features
81
What does leverage score indicate
Importance of row or column in SVD ## Footnote * High score means influential direction * Used for sampling in CUR * Helps reduce computation
82
What is phenotype leakage
Using future information to label past outcomes ## Footnote * Creates label leakage * Inflates model accuracy falsely * Must restrict features by time
83
What is medication count feature
Number of unique meds a patient receives ## Footnote * Proxy for disease severity * Often used in phenotyping * Requires mapping via NDC or RxNorm
84
What is lab abnormality feature
Binary indicator of abnormal lab result ## Footnote * Useful in rule based phenotypes * Requires LOINC mapping * Must consider timing
85
What is CPT grouping used for
Aggregating procedures into higher level categories ## Footnote * Helps reduce sparsity * Useful in cohort building * Mapped via healthcare ontologies
86
What is ICD hierarchy used for
Grouping diagnoses at different specificity levels ## Footnote * Parent child relationships * Enables roll up features * Helps reduce dimensionality
87
What does SNOMED hierarchy allow
Reasoning across clinical concepts ## Footnote * Supports ancestor queries * Enables feature expansion * More granular than ICD
88
What does UMLS CUI represent
A normalized clinical concept identifier ## Footnote * Maps across vocabularies * Useful for NLP tasks * Stable canonical identifier
89
What is co occurrence in EHR
Frequency of two codes appearing together ## Footnote * Basis for association mining * Helps create features for similarity * Often sparse and noisy
90
What is temporal embedding
Representation of events with timing context ## Footnote * Captures trajectory of disease * Used in patient similarity models * Improves predictive performance
91
What is Hamming distance
Count of mismatched binary positions ## Footnote * Used for comparing binary phenotypes * Simple and fast * Less effective for dense vectors
92
What is Jaccard similarity
Intersection over union of sets ## Footnote * Good for sparse binary features * Range zero to one * Used for diagnosis overlap
93
What is Manhattan distance
Sum of absolute differences ## Footnote * L1 metric * More robust to outliers * Used in clustering and KNN
94
What is hierarchical clustering
Tree based merging or splitting of clusters ## Footnote * Dendrogram shows structure * Does not require K * Sensitive to linkage method
95
What is average linkage
Mean distance between clusters ## Footnote * Balances complete and single linkage * Often produces smoother clusters * Used in agglomerative clustering
96
What is single linkage
Minimum distance between clusters ## Footnote * Produces chaining effect * Sensitive to noise * Captures elongated clusters
97
What is complete linkage
Maximum distance between clusters ## Footnote * Produces compact clusters * Sensitive to outliers * Less chaining than single linkage
98
What is cluster purity
Fraction of points in cluster with majority label ## Footnote * Measures how well clusters match ground truth * Higher purity implies better performance * Does not penalize many small clusters
99
What is Rand index
Measures agreement between two clusterings ## Footnote * Considers all pairwise decisions * Adjusted Rand corrects for chance * Range minus one to one
100
What is adjusted Rand index
Chance corrected Rand index ## Footnote * Zero means random * One means perfect match * Negative means worse than chance
101
What is entropy of cluster
Measure of uncertainty within cluster labels ## Footnote * Lower is better * Zero means pure cluster * Used for evaluation
102
What is mutual information of clustering
Measures shared information between assignments ## Footnote * Normalized MI helps compare across settings * Zero means independence * One means perfect match
103
What is cross validation used for
Estimating generalization performance ## Footnote * Splits data into train and test folds * Avoids overfitting * Provides stable performance estimates
104
What is train test leakage
Using information from test in training ## Footnote * Inflates metrics * Must isolate test set * Common pitfall in EHR
105
What is A B testing
Comparing two variants experimentally ## Footnote * Random assignment * Measures causal effect * Uses statistical significance
106
What is p value
Probability of observing result under null ## Footnote * Not probability null is true * Must be interpreted carefully * Sensitive to sample size
107
What is confidence interval
Range of plausible values for parameter ## Footnote * Wider means more uncertainty * Depends on variance and sample size * Not 95 percent chance parameter lies inside
108
What is odds ratio
Ratio of odds between two groups ## Footnote * OR > 1 means increased odds * Used in logistic regression * Common in association studies
109
What is hazard ratio
Relative event rate in survival analysis ## Footnote * HR > 1 means greater risk * Based on Cox model * Time to event focused
110
What is survival curve
Probability of surviving past time point ## Footnote * Kaplan Meier estimator * Step function * Handles censored data
111
What is censoring
Incomplete observation of event time ## Footnote * Right censoring common * Occurs when study ends or patient lost * Handled by survival models
112
What is log rank test
Tests equality of survival curves ## Footnote * Compares event timing distributions * Non parametric * Sensitive to proportional hazards
113
What is proportional hazards assumption
Hazard ratios constant over time ## Footnote * Required for Cox model * Violations distort inference * Checked via Schoenfeld residuals
114
What is distributed file system
Stores data across cluster nodes ## Footnote * Provides scalability * Handles replication * Example HDFS
115
What is data locality
Executing computation near stored data ## Footnote * Reduces network IO * Boosts performance * Key idea in Hadoop
116
What is executor in Spark
A worker process running tasks ## Footnote * Allocated by cluster manager * Holds cached RDDs * Executes transformations and actions
117
What is driver in Spark
Main program controlling execution ## Footnote * Builds DAG * Schedules tasks * Collects results
118
What is DAG in Spark
Directed acyclic graph of transformations ## Footnote * Represents lineage * Optimized before execution * Ensures fault tolerance
119
What is narrow dependency
Parent RDD partitions map to child one to one ## Footnote * No shuffle required * Example map filter * Faster execution
120
What is wide dependency
Child RDD depends on many parent partitions ## Footnote * Requires shuffle * Example reduceByKey join * More expensive
121
What is accumulator in Spark
Shared write only variable for aggregation ## Footnote * Good for counters * Not for control flow * Updated by workers
122
What is partitioner
Controls distribution of key value RDDs ## Footnote * Hash partitioner common * Affects join efficiency * Preserved by mapValues
123
What is schema in DataFrame
Structured definition of columns ## Footnote * Enforces data types * Used for SQL queries * Validated at runtime
124
What is catalyst optimizer
Optimizes DataFrame query plans ## Footnote * Performs rule based transformations * Pushes filters down * Rewrites logical plan
125
What is tungsten
Spark execution engine using memory optimizations ## Footnote * Uses off heap storage * Improves serialization * Speeds up SQL workloads
126
What is lazy evaluation
Delays execution until action ## Footnote * Reduces computation * Enables pipeline optimization * Core principle in Spark
127
What is phenotype rule
Logical criteria to identify patients ## Footnote * Combines ICD CPT labs meds * High precision manually crafted * Used in rule based phenotyping
128
What is feature vector in EHR
High dimensional representation of patient events ## Footnote * Sparse structure * Includes diagnoses procedures meds * Input to ML models
129
What is temporal windowing
Restricting features to specific time period ## Footnote * Prevents future leakage * Aligns with clinical validity * Important in survival tasks
130
What is embedder model
Transforms sparse codes into dense vectors ## Footnote * Learns latent representations * Used in patient similarity * Improves clustering
131
What is phenotype portability issue
Rule created in one site does not transfer well ## Footnote * Coding patterns differ * Lab ranges differ * Requires site specific tuning
132
What is SNOMED parent child relation
Hierarchical abstraction of concepts ## Footnote * Enables roll up of clinical codes * Supports semantic reasoning * Deeper hierarchy than ICD
133
What is metathesaurus
Repository linking multiple vocabularies ## Footnote * Core of UMLS * Uses CUI as unified id * Resolves cross terminology mapping
134
What is word embedding
Vector representation of tokens ## Footnote * Captures semantic similarity * Used for notes and codes * Learned via neural models
135
What is boostrap sample
Sample drawn with replacement from dataset ## Footnote * Size same as original * Used in bagging * Produces diverse trees
136
What is impurity measure
Metric used to split decision trees ## Footnote * Includes Gini entropy * Lower impurity means better split * Guides tree growth
137
What is out of bag error
Validation error from unused bootstrap samples ## Footnote * Internal validation method * Used in random forest * Avoids separate test set
138
What is learning rate in boosting
Controls contribution of each weak learner ## Footnote * Lower rate improves stability * Requires more iterations * Key hyperparameter
139
What is AdaBoost sensitivity
High sensitivity to noisy labels ## Footnote * Over weights misclassified points * Can overfit * Requires careful preprocessing
140
What is gradient boosting
Sequentially fits models on residuals ## Footnote * Powerful ensemble method * Requires shrinkage regularization * Used in XGBoost LightGBM
141
What is ROC curve
Plot of TPR versus FPR across thresholds ## Footnote * Shows tradeoffs * AUC summarizes performance * Useful for imbalanced classes
142
What is calibration
Agreement between predicted and true probabilities ## Footnote * Reliability diagrams used * Important in clinical models * Affects decision making
143
What is overfitting
Model performs well on train but poorly on test ## Footnote * Caused by high variance * Regularization and pruning help * Cross validation detects it
144
What is underfitting
Model too simple to capture patterns ## Footnote * High bias * Increase model complexity * Add features
145
What is confounding variable
Factor associated with both exposure and outcome ## Footnote * Biases observational associations * Must adjust via modeling * Common in EHR studies
146
What is imputation
Replacing missing data with estimated values ## Footnote * Methods include mean median KNN * Affects model stability * MAR MCAR MNAR distinctions
147
What is standardization
Scaling features to zero mean and unit variance ## Footnote * Improves optimization * Required for distance based models * Prevents feature dominance
148
What is min max scaling
Transforms features to range zero to one ## Footnote * Sensitive to outliers * Useful for neural nets * Preserves shape
149
What is softmax
Converts logits to probability distribution ## Footnote * Used in multi class classification * Differentiable * Normalizes exponentiated values
150
What is cross entropy loss
Measures difference between predicted and true distribution ## Footnote * Computes negative log likelihood * Used in softmax classification * Sensitive to miscalibration
151
What is regularization path
Trajectory of coefficients as penalty changes ## Footnote * Visualized in LASSO * Shows variable selection * Helps tuning lambda
152
What is L1 penalty
Penalizes absolute weights ## Footnote * Produces sparse solutions * Good for feature selection * Hard thresholding effect
153
What is L2 penalty
Penalizes squared weights ## Footnote * Produces smooth shrinkage * Prevents exploding weights * Improves generalization
154
What is elastic net
Combines L1 and L2 penalties ## Footnote * Balances sparsity and stability * Useful for correlated features * Controlled by mixing parameter
155
What is SGD advantage
Scales well with large data ## Footnote * Incremental updates * Enables online learning * Converges faster in practice
156
What is batch gradient descent disadvantage
Requires full pass over data ## Footnote * Slow for large datasets * Memory intensive * Replaced by mini batch methods
157
What is mini batch gradient descent
Updates using small random subset ## Footnote * Balances noise and stability * GPU efficient * Common in deep learning
158
What is epoch
One full pass over training data ## Footnote * Used in iterative training * Multiple epochs for convergence * Count depends on model and data
159
What is learning rate schedule
Adjusting learning rate over time ## Footnote * Includes decay warmup cosine schedules * Improves convergence * Prevents divergence
160
What is early stopping
Halting training when validation stops improving ## Footnote * Prevents overfitting * Simple and effective * Requires validation set
161
What is model drift
Performance degradation over time ## Footnote * Caused by data distribution shift * Requires monitoring * Common in streaming contexts
162
What is reproducibility
Ability to replicate results ## Footnote * Requires fixed seeds * Data versioning * Transparent code
163
What is hyperparameter tuning
Searching optimal model settings ## Footnote * Grid search random search Bayesian optimization * Impacts performance significantly * Requires validation scheme
164
What is stratified sampling
Maintaining label proportions in splits ## Footnote * Important for imbalanced data * Reduces variance * Used in classification tasks
165
What is label imbalance issue
One class dominates dataset ## Footnote * Affects threshold choice * Requires precision recall metrics * May require resampling
166
What is SMOTE
Synthetic minority oversampling technique ## Footnote * Generates synthetic samples * Helps reduce imbalance * Beware of noise amplification
167
What is class weighting
Adjusting loss to account for imbalance ## Footnote * Penalizes misclassification of minority class more * Built into many models * Alternative to resampling
168
What is bag of words representation
Vector of token counts ## Footnote * Simple baseline for text * High dimensional and sparse * Used in clinical notes mining
169
What is TF IDF
Term frequency inverse document frequency weighting ## Footnote * Reduces impact of common words * Highlights informative terms * Used in text classification
170
What is stop word removal
Removing common non informative words ## Footnote * Reduces noise * Language dependent * Standard NLP preprocessing
171
What is tokenization
Splitting text into tokens ## Footnote * Basis for NLP feature extraction * Can use whitespace or learned rules * Precursor to embeddings
172
What is lemmatization
Reducing words to dictionary form ## Footnote * More accurate than stemming * Language aware * Improves text consistency
173
What is stemming
Removing suffixes to approximate root form ## Footnote * Faster but cruder than lemmatization * Sometimes harms accuracy * Porter stemmer commonly used
174
What is EHR
Electronic health record system ## Footnote * Contains structured and unstructured data * Used for phenotyping and ML * Requires preprocessing
175
What is CPT modifier
Additional code describing procedural detail ## Footnote * Refines meaning of CPT * Affects billing interpretation * Appears in claims data
176
What is diagnosis code position
Placement of code within visit record ## Footnote * Primary code indicates main reason * Secondary codes add context * Used for feature weighting
177
What is readmission definition
Returning to hospital within short time window ## Footnote * Often 30 days * Quality metric * Requires careful cohort definition
178
What is cohort definition
Criteria selecting study population ## Footnote * Requires inclusion exclusion rules * Must align with phenotype logic * Impacts downstream model validity
179
What is feature leakage
Using information unavailable at prediction time ## Footnote * Common in EHR data * Must enforce temporal cutoffs * Leads to inflated metrics
180
What is upcoding
Assigning more severe codes than appropriate ## Footnote * Affects model training * Creates bias * Present in claims data
181
What is data normalization
Transforming features to common scale ## Footnote * Required for distance based models * Improves optimization * Prevents variable dominance
182
What is one hot encoding
Expanding categorical variables into binary columns ## Footnote * High dimensional for large vocabularies * Used for diagnosis and procedure codes * Sparse representation
183
What is embedding lookup
Mapping IDs to dense vectors ## Footnote * Used in neural networks * Captures semantic similarity * Efficient memory usage
184
What is dropout
Regularization technique dropping random neurons ## Footnote * Prevents co adaptation * Improves generalization * Used in deep models
185
What is batch norm
Normalizing layer activations ## Footnote * Stabilizes training * Allows higher learning rates * Common in deep nets
186
What is vanishing gradient
When gradients become too small for updates ## Footnote * Affects deep networks * Solved by ReLU residual connections * Slows training
187
What is exploding gradient
When gradients blow up during backprop ## Footnote * Causes unstable training * Solved by gradient clipping * Common in RNNs
188
What is softmax output
Vector of probabilities summing to one ## Footnote * Used in multi class tasks * Differentiable * Works with cross entropy
189
What is sigmoid output
Maps score to probability between zero and one ## Footnote * Used in binary classification * Sensitive to extreme logits * Output not calibrated
190
What is feature correlation issue
Highly correlated features distort models ## Footnote * Causes unstable coefficients * L2 helps smooth * PCA can resolve
191
What is dimensionality curse
High dimensions degrade clustering ## Footnote * Distances concentrate * Need DR methods * Impacts nearest neighbor search
192
What is AUC advantage
Threshold free metric ## Footnote * Summarizes ranking performance * Robust to imbalance * Interpretable as probability
193
What is NDC usefulness
Identifies drug exposure history ## Footnote * Key in medication phenotyping * Helps detect polypharmacy * Maps to RxNorm
194
What is LOINC use
Standardizes lab test identifiers ## Footnote * Needed for lab feature extraction * Enables cross institution mapping * Used in phenotyping
195
What is SNOMED advantage
Granular concept representation ## Footnote * Rich hierarchy * Better clinical expressiveness * Supports reasoning
196
What is UMLS advantage
Unified mapping across vocabularies ## Footnote * Resolves inconsistent coding * Useful in NLP * Provides CUIs
197
What is big data in healthcare
Large scale heterogeneous and high velocity health related data ## Footnote * Includes EHR claims imaging genomics sensors * Requires distributed systems for storage and analysis * Often high dimensional and sparse
198
What are the 5Vs of big data
Volume velocity variety veracity value ## Footnote * Defines challenges of big data systems * Healthcare strongly exhibits variety and veracity issues * Used to motivate scalable analytics
199
Why is healthcare data complex
It spans multiple modalities and coding systems ## Footnote * Structured unstructured temporal streaming * Requires normalization and mapping across vocabularies * Harder than typical business datasets
200
What is structured healthcare data
Tabular fields like labs vitals diagnoses ## Footnote * Found in EHR databases * Easier to process and model * Still noisy and missing
201
What is unstructured healthcare data
Free text notes and reports ## Footnote * Requires NLP for extraction * Largest portion of clinical information * Contains temporal and contextual cues
202
What is interoperability challenge
Different systems cannot easily exchange data ## Footnote * Lack of unified formats * Requires mapping across ICD CPT SNOMED LOINC * Limits multi site model training
203
What is observational bias
Systematic distortion due to non randomized data ## Footnote * Common in EHR where treatments are not assigned randomly * Affects causal interpretation * Requires adjustment for confounders
204
What is selection bias
Bias from non representative cohort ## Footnote * Example only sicker patients have frequent labs * Affects training and evaluation * Mitigated by careful cohort construction
205
What is missingness mechanism
Process by which data becomes missing ## Footnote * MCAR MAR MNAR distinctions * Impacts imputation strategy * Healthcare often MNAR
206
What is high dimensionality issue
Too many features relative to samples ## Footnote * Leads to overfitting * Requires feature selection or DR * Common in EHR code based features
207
What is data sparsity
Most feature entries are zero ## Footnote * Common with diagnosis procedure codes * Requires specialized distance metrics * Affects clustering and similarity
208
What is multimodal data
Combining multiple data types ## Footnote * EHR notes images genomics signals * Requires alignment and fusion * Powerful but challenging
209
What is predictive modeling pipeline
Sequence of defining outcome cohort features and model evaluation ## Footnote * Core of L03 Predictive Modeling * Requires careful temporal design * Avoids leakage and bias
210
What is prediction target
The variable the model aims to predict ## Footnote * Examples mortality readmission diagnosis * Must be clinically meaningful and feasible * Target definition drives entire pipeline
211
What is interesting but impossible target
Clinically interesting but not predictable from available data ## Footnote * Example undiagnosed cancer without signals * Leads to misleading models * Must confirm predictability
212
What is prediction and observation window
Period where features and outcomes are defined ## Footnote * Observation window forms features * Prediction window defines future to predict * Must avoid overlap to prevent leakage
213
What is leakage in predictive modeling
Using future information during feature construction ## Footnote * Example using post diagnosis codes to predict diagnosis * Inflates accuracy * Must enforce strict temporal cutoff
214
What is cohort definition
Selecting appropriate patient population ## Footnote * Requires inclusion exclusion * Sequence matters for validity * Impacts generalizability
215
What is index time
Reference time point for prediction ## Footnote * Defines start of prediction horizon * Example discharge time or admission time * All features must precede index
216
What is case control design
Selecting cases with outcome and matched controls ## Footnote * Controls sampled to match characteristics * Helps balance data for modeling * Must avoid overmatching
217
What is prospective data design
Future outcomes relative to index time ## Footnote * Reflects real world prediction use * Reduces leakage * Preferred for clinical ML
218
What is retrospective design
Using historical data with known outcomes ## Footnote * Easier to implement * More prone to leakage * Common in EHR studies
219
What is temporal alignment
Ensuring all features precede outcome ## Footnote * Critical for validity * Prevents accidental leakage * Requires timestamp aware preprocessing
220
What is feature extraction
Transforming raw EHR into model inputs ## Footnote * Includes labs meds diagnoses notes * May include summary statistics * Requires careful scaling and encoding
221
What is feature aggregation
Summarizing repeated events into features ## Footnote * Examples counts averages last value * Helps reduce dimension * Must align with clinical logic
222
What is feature selection
Choosing most informative features ## Footnote * Methods include L1 mutual information tree based importance * Prevents overfitting * Speeds training
223
What is label prevalence effect
Outcome rarity impacts metric choice ## Footnote * Accuracy becomes meaningless * Precision recall more informative * Affects threshold tuning
224
What is temporal cross validation
Splitting data by time rather than random ## Footnote * Prevents look ahead bias * Reflects deployment setting * Critical for time dependent tasks
225
What is random split weakness
Allows leakage through temporal correlation ## Footnote * Future events may appear in train set * Inflates performance * Not realistic for clinical deployment
226
What is calibration in clinical ML
Agreement between predicted and observed risk ## Footnote * Critical for decision support * Measured by Brier score and calibration curves * Independent of discrimination ability
227
What is model transportability
Ability of model to work at new sites ## Footnote * Limited by coding and population differences * Requires external validation * Often lower performance than internal validation
228
What is fairness in ML
Absence of systematic performance gaps across subgroups ## Footnote * Sensitive to protected attributes * Evaluated by group specific metrics * Healthcare models must ensure fairness
229
What is proxy variable problem
Variable indirectly encoding sensitive attribute ## Footnote * Example zipcode encoding socioeconomic status * Leads to unintended bias * Requires careful feature audit
230
What is sensitivity analysis
Testing robustness under perturbations ## Footnote * Vary cohort or features * Detects fragile models * Recommended for clinical ML validation
231
What is label noise
Inaccurate outcome labels ## Footnote * Common in claims and EHR * Reduces model performance * Requires noise robust methods
232
What is data shift
Distribution change between train and test ## Footnote * Includes covariate and concept shift * Common across hospitals * Requires drift monitoring
233
What is missing not at random challenge
Missingness depends on unobserved values ## Footnote * Hardest missingness type * Requires modeling the missing mechanism * Common in lab tests
234
What is encounter level prediction
Prediction tied to specific visit ## Footnote * Uses visit level context * Examples sepsis mortality * Must define index event consistently
235
What is patient level prediction
Prediction at individual scale ## Footnote * Aggregates longitudinal data * Examples chronic disease risk * Requires temporal modeling
236
What is censoring in prediction
Outcome not observed due to incomplete follow up ## Footnote * Common in long horizon predictions * Requires survival models or exclusion * May bias observed outcome rate
237
What is negative sampling
Selecting non outcome examples ## Footnote * Important for imbalanced targets * Controls class ratio * Must avoid future information
238
What is heuristic feature
Domain inspired feature rule ## Footnote * Example high creatinine for kidney injury * High precision features boost signal * Used in anchor learning
239
What is phenotyping window
Period used to derive phenotype labels ## Footnote * Must precede prediction window * Ensures causal ordering * Common source of leakage when misaligned
240
What is soft outcome definition
Outcome defined with uncertainty ## Footnote * Example probable disease based on partial evidence * Introduces label noise * Requires robust models
241
What is data pipeline reproducibility
Ensuring the same pipeline yields same output ## Footnote * Requires versioning seeds environment management * Critical for clinical deployment * Prevents silent drift
242
What is code provenance
Tracking origin and version of code ## Footnote * Supports debugging and auditing * Required for regulated settings * Aligns with ML Ops best practices
243
What is target leakage via medications
Prescribing pattern indirectly reveals outcome ## Footnote * Example using insulin to predict diabetes * Appears predictive but leaks label * Must restrict features to avoid
244
What is window overlap issue
Observation window overlaps prediction window ## Footnote * Introduces future information * Inflates performance * Must enforce non overlapping windows
245
What is positive test bias
Sicker patients get more tests leading to feature skew ## Footnote * Leads to confounding via care processes * Not directly disease signal * Requires adjustment or careful windowing
246
What is predictive horizon
Time gap between index and outcome ## Footnote * Defines clinical utility * Short horizon detects acute risk * Long horizon predicts chronic outcomes