Final_Exam Flashcards

Question

What is Euclidean distance used for

Answer 1

It measures straight line distance between points ## Footnote * Standard metric in K means * Sensitive to scale * Requires normalization

Answer 2

The number of clusters ## Footnote * Must be chosen before training * Hard clustering assigns one cluster per point * Sensitive to initialization

Answer 3

Minimize within cluster sum of squares ## Footnote * Uses L2 distance * Non convex optimization * Lloyd algorithm alternates assign and update

Answer 4

It is the mean of points assigned to a cluster ## Footnote * Updated after assignment step * Represents cluster center * Sensitive to outliers

Answer 5

It assumes spherical clusters ## Footnote * Fails on non linear shapes like moons * Sensitive to initialization * Requires numeric data

Answer 6

Arbitrarily shaped clusters ## Footnote * Density based method * Identifies noise points * Requires eps and minPts parameters

Answer 7

It defines neighborhood radius ## Footnote * Larger eps means broader clusters * Must be tuned with domain knowledge * Visualized using k distance plot

Answer 8

The minimum number of neighbors for a core point ## Footnote * Typical value is 4 or greater * Determines density threshold * Works with eps to define clusters

Answer 9

Soft clustering with probabilistic assignment ## Footnote * E step computes responsibilities * M step updates means covariances and weights * More flexible than k means due to covariance modeling

Answer 10

Expectation maximization ## Footnote * Alternates between E and M steps * Maximizes likelihood * Converges to local optimum

Answer 11

Dimensionality reduction via variance maximization ## Footnote * Uses eigen decomposition of covariance matrix * Projects data onto top principal components * Removes noise dimensions

Answer 12

The direction of maximum variance ## Footnote * Orthogonal to other components * Represented by highest eigenvalue * Helps visualize high dimensional data

Answer 13

Non linear dimensionality reduction for visualization ## Footnote * Preserves local structure * Useful for clusters in high dimensional data * Not for global geometry

Answer 14

Low rank matrix approximation using actual rows and columns ## Footnote * More interpretable than SVD * Uses leverage scores to sample rows columns * Approximates original matrix

Answer 15

Medical diagnoses ## Footnote * ICD10 widely used in billing * Hierarchical structure * Important for phenotyping

Answer 16

Procedures performed on patients ## Footnote * Used for billing and resource tracking * Different from ICD diagnoses * Helps build feature vectors

Answer 17

Drug products ## Footnote * Identifies manufacturer and formulation * Important for medication phenotyping * Used in pharmacy claims

Answer 18

Normalized drug naming system ## Footnote * Maps brand and generic names * Provides RXCUI identifiers * Integrates with NDC

Answer 19

A metathesaurus integrating multiple ontologies ## Footnote * Contains CUIs for concepts * Maps across ICD SNOMED MeSH * Used for NLP and concept normalization

Answer 20

A comprehensive clinical terminology ## Footnote * Contains granular clinical concepts * Supports hierarchical relationships * Used in EHR systems

Answer 21

Changes in data distribution over time ## Footnote * Streaming k means adapts using decay factor * Common in temporal clinical data * Requires online updating

Answer 22

It controls weight of new data versus old data ## Footnote * High decay emphasizes recent data * Prevents outdated centroids * Supports adaptive clustering

Answer 23

A computable abstraction of disease state ## Footnote * Derived from diagnoses labs meds * Used for cohort selection * May use rule based or ML methods

Answer 24

Using labeled cases and controls to train models ## Footnote * Requires gold standard labels * Enables predictive performance metrics * Often uses logistic regression or random forest

Answer 25

Discovering latent patterns without labels ## Footnote * Methods include k means LDA tensor factorization * Useful for sub phenotype discovery * Sensitive to input feature design

Answer 26

Weak supervision using high precision features ## Footnote * Anchors derived from clinical heuristics * Trains generative model for labels * Bridges rule based and supervised phenotyping

Answer 27

Fraction of predicted positives that are correct ## Footnote * TP divided by TP plus FP * Useful in imbalanced datasets * Measures reliability of positive predictions

Answer 28

Fraction of actual positives that were retrieved ## Footnote * TP divided by TP plus FN * Also called sensitivity * Measures completeness of positive detection

Answer 29

Fraction of actual negatives correctly identified ## Footnote * TN divided by TN plus FP * Complements sensitivity * Measures ability to avoid false positives

Answer 30

Harmonic mean of precision and recall ## Footnote * Balances two competing metrics * Useful when classes are imbalanced * Always <= arithmetic mean

Answer 31

Ability of classifier to rank positives above negatives ## Footnote * Equivalent to probability positive ranked higher * Independent of threshold * Uses ROC curve area

Answer 32

Binary classification via sigmoid transformation ## Footnote * Optimized using gradient descent * Outputs probability scores * Sensitive to feature scaling

Answer 33

Preventing overfitting by penalizing large weights ## Footnote * L1 yields sparsity * L2 yields shrinkage * Helps generalization

Answer 34

Variance of the model ## Footnote * Trains models on bootstrap samples * Averages predictions * Random forest uses bagging

Answer 35

Bias by sequentially correcting errors ## Footnote * Learners trained on residuals * Sensitive to noise * Includes AdaBoost gradient boosting

Answer 36

An ensemble of decision trees trained on bootstrap samples ## Footnote * Adds feature randomness per split * Reduces variance * Robust to overfitting

Answer 37

Optimization method moving opposite gradient direction ## Footnote * Requires differentiable loss * Step size controls convergence * Sensitive to feature scaling

Answer 38

Gradient descent using one or few samples ## Footnote * Noisy updates * Faster on large data * Good for online learning

Answer 39

Phenomenon where high dimension degrades learning ## Footnote * Distances become less meaningful * Requires dimensionality reduction * Impacts clustering and nearest neighbor

Answer 40

Measuring similarity in sparse high dimensional vectors ## Footnote * Used in patient similarity * Invariant to magnitude * Good for bag of words

Answer 41

Measuring dependence between features and labels ## Footnote * Non linear metric * Useful for feature selection * Robust to distributional shape

Answer 42

Evaluating independence between categorical variables ## Footnote * Common in feature selection * Compares observed and expected counts * Assumes large sample size

Answer 43

Measures cluster separation ## Footnote * Values from minus one to one * Higher means better defined clusters * Uses cohesion and separation

Answer 44

Choosing K by detecting diminishing returns in WCSS ## Footnote * Plot SSE versus K * Look for bend in curve * Heuristic for k means

Answer 45

Large clusters dominate centroid updates ## Footnote * Causes imbalance * May miss small dense clusters * Requires careful initialization

Answer 46

Ratio of densities of positive and negative scores ## Footnote * Derived from Neyman Pearson lemma * Relates to likelihood ratio * Used in optimal thresholding

Answer 47

Removes redundancy and noise ## Footnote * Helps visualization * Speeds up algorithms * Reduces overfitting risk

Answer 48

Variance explained by component ## Footnote * Sorted descending * Larger eigenvalue means more signal * Sum equals total variance

Answer 49

Contribution of each feature to component ## Footnote * Elements of eigenvector * Helps interpret principal axes * Useful for feature interpretation

Answer 50

Latent non negative component ## Footnote * Enables additive parts based representation * Good for interpretability * Used in text and EHR data

Answer 51

Matrix representation of graph edges ## Footnote * Aij is 1 if edge exists * Supports matrix operations for graph algorithms * Used in spectral clustering

Answer 52

Graph clustering and spectral methods ## Footnote * L equals D minus A * Eigenvectors give cluster structure * Used in normalized cuts

Answer 53

Preference for nodes to attach to similar nodes ## Footnote * High in social networks * Can be positive or negative * Measured by attribute correlation

Answer 54

Distribution of node degrees in a graph ## Footnote * Scale free networks follow power law * Influences robustness * Affects spreading processes

Answer 55

Random surfer moves with fixed probability to neighbors ## Footnote * Damping factor controls reset * Stationary distribution gives ranks * Computed by power iteration

Answer 56

Column or row selected based on leverage ## Footnote * Leverage scores from SVD * Ensures good reconstruction * CUR yields interpretable features

Answer 57

Importance of row or column in SVD ## Footnote * High score means influential direction * Used for sampling in CUR * Helps reduce computation

Answer 58

Using future information to label past outcomes ## Footnote * Creates label leakage * Inflates model accuracy falsely * Must restrict features by time

Answer 59

Number of unique meds a patient receives ## Footnote * Proxy for disease severity * Often used in phenotyping * Requires mapping via NDC or RxNorm

Answer 60

Binary indicator of abnormal lab result ## Footnote * Useful in rule based phenotypes * Requires LOINC mapping * Must consider timing

Answer 61

Aggregating procedures into higher level categories ## Footnote * Helps reduce sparsity * Useful in cohort building * Mapped via healthcare ontologies

Answer 62

Grouping diagnoses at different specificity levels ## Footnote * Parent child relationships * Enables roll up features * Helps reduce dimensionality

Answer 63

Reasoning across clinical concepts ## Footnote * Supports ancestor queries * Enables feature expansion * More granular than ICD

Answer 64

A normalized clinical concept identifier ## Footnote * Maps across vocabularies * Useful for NLP tasks * Stable canonical identifier

Answer 65

Frequency of two codes appearing together ## Footnote * Basis for association mining * Helps create features for similarity * Often sparse and noisy

Answer 66

Representation of events with timing context ## Footnote * Captures trajectory of disease * Used in patient similarity models * Improves predictive performance

Answer 67

Count of mismatched binary positions ## Footnote * Used for comparing binary phenotypes * Simple and fast * Less effective for dense vectors

Answer 68

Intersection over union of sets ## Footnote * Good for sparse binary features * Range zero to one * Used for diagnosis overlap

Answer 69

Sum of absolute differences ## Footnote * L1 metric * More robust to outliers * Used in clustering and KNN

Answer 70

Tree based merging or splitting of clusters ## Footnote * Dendrogram shows structure * Does not require K * Sensitive to linkage method

Answer 71

Mean distance between clusters ## Footnote * Balances complete and single linkage * Often produces smoother clusters * Used in agglomerative clustering

Answer 72

Minimum distance between clusters ## Footnote * Produces chaining effect * Sensitive to noise * Captures elongated clusters

Answer 73

Maximum distance between clusters ## Footnote * Produces compact clusters * Sensitive to outliers * Less chaining than single linkage

Answer 74

Fraction of points in cluster with majority label ## Footnote * Measures how well clusters match ground truth * Higher purity implies better performance * Does not penalize many small clusters

Answer 75

Measures agreement between two clusterings ## Footnote * Considers all pairwise decisions * Adjusted Rand corrects for chance * Range minus one to one

Answer 76

Chance corrected Rand index ## Footnote * Zero means random * One means perfect match * Negative means worse than chance

Answer 77

Measure of uncertainty within cluster labels ## Footnote * Lower is better * Zero means pure cluster * Used for evaluation

Answer 78

Measures shared information between assignments ## Footnote * Normalized MI helps compare across settings * Zero means independence * One means perfect match

Answer 79

Estimating generalization performance ## Footnote * Splits data into train and test folds * Avoids overfitting * Provides stable performance estimates

Answer 80

Using information from test in training ## Footnote * Inflates metrics * Must isolate test set * Common pitfall in EHR

Answer 81

Comparing two variants experimentally ## Footnote * Random assignment * Measures causal effect * Uses statistical significance

Answer 82

Probability of observing result under null ## Footnote * Not probability null is true * Must be interpreted carefully * Sensitive to sample size

Answer 83

Range of plausible values for parameter ## Footnote * Wider means more uncertainty * Depends on variance and sample size * Not 95 percent chance parameter lies inside

Answer 84

Ratio of odds between two groups ## Footnote * OR > 1 means increased odds * Used in logistic regression * Common in association studies

Answer 85

Relative event rate in survival analysis ## Footnote * HR > 1 means greater risk * Based on Cox model * Time to event focused

Answer 86

Probability of surviving past time point ## Footnote * Kaplan Meier estimator * Step function * Handles censored data

Answer 87

Incomplete observation of event time ## Footnote * Right censoring common * Occurs when study ends or patient lost * Handled by survival models

Answer 88

Tests equality of survival curves ## Footnote * Compares event timing distributions * Non parametric * Sensitive to proportional hazards

Answer 89

Hazard ratios constant over time ## Footnote * Required for Cox model * Violations distort inference * Checked via Schoenfeld residuals

Answer 90

Stores data across cluster nodes ## Footnote * Provides scalability * Handles replication * Example HDFS

Answer 91

Executing computation near stored data ## Footnote * Reduces network IO * Boosts performance * Key idea in Hadoop

Answer 92

A worker process running tasks ## Footnote * Allocated by cluster manager * Holds cached RDDs * Executes transformations and actions

Answer 93

Main program controlling execution ## Footnote * Builds DAG * Schedules tasks * Collects results

Answer 94

Directed acyclic graph of transformations ## Footnote * Represents lineage * Optimized before execution * Ensures fault tolerance

Answer 95

Parent RDD partitions map to child one to one ## Footnote * No shuffle required * Example map filter * Faster execution

Answer 96

Child RDD depends on many parent partitions ## Footnote * Requires shuffle * Example reduceByKey join * More expensive

Answer 97

Shared write only variable for aggregation ## Footnote * Good for counters * Not for control flow * Updated by workers

Answer 98

Controls distribution of key value RDDs ## Footnote * Hash partitioner common * Affects join efficiency * Preserved by mapValues

Answer 99

Structured definition of columns ## Footnote * Enforces data types * Used for SQL queries * Validated at runtime

Answer 100

Optimizes DataFrame query plans ## Footnote * Performs rule based transformations * Pushes filters down * Rewrites logical plan

Answer 101

Spark execution engine using memory optimizations ## Footnote * Uses off heap storage * Improves serialization * Speeds up SQL workloads

Answer 102

Delays execution until action ## Footnote * Reduces computation * Enables pipeline optimization * Core principle in Spark

Answer 103

Logical criteria to identify patients ## Footnote * Combines ICD CPT labs meds * High precision manually crafted * Used in rule based phenotyping

Answer 104

High dimensional representation of patient events ## Footnote * Sparse structure * Includes diagnoses procedures meds * Input to ML models

Answer 105

Restricting features to specific time period ## Footnote * Prevents future leakage * Aligns with clinical validity * Important in survival tasks

Answer 106

Transforms sparse codes into dense vectors ## Footnote * Learns latent representations * Used in patient similarity * Improves clustering

Answer 107

Rule created in one site does not transfer well ## Footnote * Coding patterns differ * Lab ranges differ * Requires site specific tuning

Answer 108

Hierarchical abstraction of concepts ## Footnote * Enables roll up of clinical codes * Supports semantic reasoning * Deeper hierarchy than ICD

Answer 109

Repository linking multiple vocabularies ## Footnote * Core of UMLS * Uses CUI as unified id * Resolves cross terminology mapping

Answer 110

Vector representation of tokens ## Footnote * Captures semantic similarity * Used for notes and codes * Learned via neural models

Answer 111

Sample drawn with replacement from dataset ## Footnote * Size same as original * Used in bagging * Produces diverse trees

Answer 112

Metric used to split decision trees ## Footnote * Includes Gini entropy * Lower impurity means better split * Guides tree growth

Answer 113

Validation error from unused bootstrap samples ## Footnote * Internal validation method * Used in random forest * Avoids separate test set

Answer 114

Controls contribution of each weak learner ## Footnote * Lower rate improves stability * Requires more iterations * Key hyperparameter

Answer 115

High sensitivity to noisy labels ## Footnote * Over weights misclassified points * Can overfit * Requires careful preprocessing

Answer 116

Sequentially fits models on residuals ## Footnote * Powerful ensemble method * Requires shrinkage regularization * Used in XGBoost LightGBM

Answer 117

Plot of TPR versus FPR across thresholds ## Footnote * Shows tradeoffs * AUC summarizes performance * Useful for imbalanced classes

Answer 118

Agreement between predicted and true probabilities ## Footnote * Reliability diagrams used * Important in clinical models * Affects decision making

Answer 119

Model performs well on train but poorly on test ## Footnote * Caused by high variance * Regularization and pruning help * Cross validation detects it

Answer 120

Model too simple to capture patterns ## Footnote * High bias * Increase model complexity * Add features

Answer 121

Factor associated with both exposure and outcome ## Footnote * Biases observational associations * Must adjust via modeling * Common in EHR studies

Answer 122

Replacing missing data with estimated values ## Footnote * Methods include mean median KNN * Affects model stability * MAR MCAR MNAR distinctions

Answer 123

Scaling features to zero mean and unit variance ## Footnote * Improves optimization * Required for distance based models * Prevents feature dominance

Answer 124

Transforms features to range zero to one ## Footnote * Sensitive to outliers * Useful for neural nets * Preserves shape

Answer 125

Converts logits to probability distribution ## Footnote * Used in multi class classification * Differentiable * Normalizes exponentiated values

Answer 126

Measures difference between predicted and true distribution ## Footnote * Computes negative log likelihood * Used in softmax classification * Sensitive to miscalibration

Answer 127

Trajectory of coefficients as penalty changes ## Footnote * Visualized in LASSO * Shows variable selection * Helps tuning lambda

Answer 128

Penalizes absolute weights ## Footnote * Produces sparse solutions * Good for feature selection * Hard thresholding effect

Answer 129

Penalizes squared weights ## Footnote * Produces smooth shrinkage * Prevents exploding weights * Improves generalization

Answer 130

Combines L1 and L2 penalties ## Footnote * Balances sparsity and stability * Useful for correlated features * Controlled by mixing parameter

Answer 131

Scales well with large data ## Footnote * Incremental updates * Enables online learning * Converges faster in practice

Answer 132

Requires full pass over data ## Footnote * Slow for large datasets * Memory intensive * Replaced by mini batch methods

Answer 133

Updates using small random subset ## Footnote * Balances noise and stability * GPU efficient * Common in deep learning

Answer 134

One full pass over training data ## Footnote * Used in iterative training * Multiple epochs for convergence * Count depends on model and data

Answer 135

Adjusting learning rate over time ## Footnote * Includes decay warmup cosine schedules * Improves convergence * Prevents divergence

Answer 136

Halting training when validation stops improving ## Footnote * Prevents overfitting * Simple and effective * Requires validation set

Answer 137

Performance degradation over time ## Footnote * Caused by data distribution shift * Requires monitoring * Common in streaming contexts

Answer 138

Ability to replicate results ## Footnote * Requires fixed seeds * Data versioning * Transparent code

Answer 139

Searching optimal model settings ## Footnote * Grid search random search Bayesian optimization * Impacts performance significantly * Requires validation scheme

Answer 140

Maintaining label proportions in splits ## Footnote * Important for imbalanced data * Reduces variance * Used in classification tasks

Answer 141

One class dominates dataset ## Footnote * Affects threshold choice * Requires precision recall metrics * May require resampling

Answer 142

Synthetic minority oversampling technique ## Footnote * Generates synthetic samples * Helps reduce imbalance * Beware of noise amplification

Answer 143

Adjusting loss to account for imbalance ## Footnote * Penalizes misclassification of minority class more * Built into many models * Alternative to resampling

Answer 144

Vector of token counts ## Footnote * Simple baseline for text * High dimensional and sparse * Used in clinical notes mining

Answer 145

Term frequency inverse document frequency weighting ## Footnote * Reduces impact of common words * Highlights informative terms * Used in text classification

Answer 146

Removing common non informative words ## Footnote * Reduces noise * Language dependent * Standard NLP preprocessing

Answer 147

Splitting text into tokens ## Footnote * Basis for NLP feature extraction * Can use whitespace or learned rules * Precursor to embeddings

Answer 148

Reducing words to dictionary form ## Footnote * More accurate than stemming * Language aware * Improves text consistency

Answer 149

Removing suffixes to approximate root form ## Footnote * Faster but cruder than lemmatization * Sometimes harms accuracy * Porter stemmer commonly used

Answer 150

Electronic health record system ## Footnote * Contains structured and unstructured data * Used for phenotyping and ML * Requires preprocessing

Answer 151

Additional code describing procedural detail ## Footnote * Refines meaning of CPT * Affects billing interpretation * Appears in claims data

Answer 152

Placement of code within visit record ## Footnote * Primary code indicates main reason * Secondary codes add context * Used for feature weighting

Answer 153

Returning to hospital within short time window ## Footnote * Often 30 days * Quality metric * Requires careful cohort definition

Answer 154

Criteria selecting study population ## Footnote * Requires inclusion exclusion rules * Must align with phenotype logic * Impacts downstream model validity

Answer 155

Using information unavailable at prediction time ## Footnote * Common in EHR data * Must enforce temporal cutoffs * Leads to inflated metrics

Answer 156

Assigning more severe codes than appropriate ## Footnote * Affects model training * Creates bias * Present in claims data

Answer 157

Transforming features to common scale ## Footnote * Required for distance based models * Improves optimization * Prevents variable dominance

Answer 158

Expanding categorical variables into binary columns ## Footnote * High dimensional for large vocabularies * Used for diagnosis and procedure codes * Sparse representation

Answer 159

Mapping IDs to dense vectors ## Footnote * Used in neural networks * Captures semantic similarity * Efficient memory usage

Answer 160

Regularization technique dropping random neurons ## Footnote * Prevents co adaptation * Improves generalization * Used in deep models

Answer 161

Normalizing layer activations ## Footnote * Stabilizes training * Allows higher learning rates * Common in deep nets

Answer 162

When gradients become too small for updates ## Footnote * Affects deep networks * Solved by ReLU residual connections * Slows training

Answer 163

When gradients blow up during backprop ## Footnote * Causes unstable training * Solved by gradient clipping * Common in RNNs

Answer 164

Vector of probabilities summing to one ## Footnote * Used in multi class tasks * Differentiable * Works with cross entropy

Answer 165

Maps score to probability between zero and one ## Footnote * Used in binary classification * Sensitive to extreme logits * Output not calibrated

Answer 166

Highly correlated features distort models ## Footnote * Causes unstable coefficients * L2 helps smooth * PCA can resolve

Answer 167

High dimensions degrade clustering ## Footnote * Distances concentrate * Need DR methods * Impacts nearest neighbor search

Answer 168

Threshold free metric ## Footnote * Summarizes ranking performance * Robust to imbalance * Interpretable as probability

Answer 169

Identifies drug exposure history ## Footnote * Key in medication phenotyping * Helps detect polypharmacy * Maps to RxNorm

Answer 170

Standardizes lab test identifiers ## Footnote * Needed for lab feature extraction * Enables cross institution mapping * Used in phenotyping

Answer 171

Granular concept representation ## Footnote * Rich hierarchy * Better clinical expressiveness * Supports reasoning

Answer 172

Unified mapping across vocabularies ## Footnote * Resolves inconsistent coding * Useful in NLP * Provides CUIs

Answer 173

Large scale heterogeneous and high velocity health related data ## Footnote * Includes EHR claims imaging genomics sensors * Requires distributed systems for storage and analysis * Often high dimensional and sparse

Answer 174

Volume velocity variety veracity value ## Footnote * Defines challenges of big data systems * Healthcare strongly exhibits variety and veracity issues * Used to motivate scalable analytics

Answer 175

It spans multiple modalities and coding systems ## Footnote * Structured unstructured temporal streaming * Requires normalization and mapping across vocabularies * Harder than typical business datasets

Answer 176

Tabular fields like labs vitals diagnoses ## Footnote * Found in EHR databases * Easier to process and model * Still noisy and missing

Answer 177

Free text notes and reports ## Footnote * Requires NLP for extraction * Largest portion of clinical information * Contains temporal and contextual cues

Answer 178

Different systems cannot easily exchange data ## Footnote * Lack of unified formats * Requires mapping across ICD CPT SNOMED LOINC * Limits multi site model training

Answer 179

Systematic distortion due to non randomized data ## Footnote * Common in EHR where treatments are not assigned randomly * Affects causal interpretation * Requires adjustment for confounders

Answer 180

Bias from non representative cohort ## Footnote * Example only sicker patients have frequent labs * Affects training and evaluation * Mitigated by careful cohort construction

Answer 181

Process by which data becomes missing ## Footnote * MCAR MAR MNAR distinctions * Impacts imputation strategy * Healthcare often MNAR

Answer 182

Too many features relative to samples ## Footnote * Leads to overfitting * Requires feature selection or DR * Common in EHR code based features

Answer 183

Most feature entries are zero ## Footnote * Common with diagnosis procedure codes * Requires specialized distance metrics * Affects clustering and similarity

Answer 184

Combining multiple data types ## Footnote * EHR notes images genomics signals * Requires alignment and fusion * Powerful but challenging

Answer 185

Sequence of defining outcome cohort features and model evaluation ## Footnote * Core of L03 Predictive Modeling * Requires careful temporal design * Avoids leakage and bias

Answer 186

The variable the model aims to predict ## Footnote * Examples mortality readmission diagnosis * Must be clinically meaningful and feasible * Target definition drives entire pipeline

Answer 187

Clinically interesting but not predictable from available data ## Footnote * Example undiagnosed cancer without signals * Leads to misleading models * Must confirm predictability

Answer 188

Period where features and outcomes are defined ## Footnote * Observation window forms features * Prediction window defines future to predict * Must avoid overlap to prevent leakage

Answer 189

Using future information during feature construction ## Footnote * Example using post diagnosis codes to predict diagnosis * Inflates accuracy * Must enforce strict temporal cutoff

Answer 190

Selecting appropriate patient population ## Footnote * Requires inclusion exclusion * Sequence matters for validity * Impacts generalizability

Answer 191

Reference time point for prediction ## Footnote * Defines start of prediction horizon * Example discharge time or admission time * All features must precede index

Answer 192

Selecting cases with outcome and matched controls ## Footnote * Controls sampled to match characteristics * Helps balance data for modeling * Must avoid overmatching

Answer 193

Future outcomes relative to index time ## Footnote * Reflects real world prediction use * Reduces leakage * Preferred for clinical ML

Answer 194

Using historical data with known outcomes ## Footnote * Easier to implement * More prone to leakage * Common in EHR studies

Answer 195

Ensuring all features precede outcome ## Footnote * Critical for validity * Prevents accidental leakage * Requires timestamp aware preprocessing

Answer 196

Transforming raw EHR into model inputs ## Footnote * Includes labs meds diagnoses notes * May include summary statistics * Requires careful scaling and encoding

Answer 197

Summarizing repeated events into features ## Footnote * Examples counts averages last value * Helps reduce dimension * Must align with clinical logic

Answer 198

Choosing most informative features ## Footnote * Methods include L1 mutual information tree based importance * Prevents overfitting * Speeds training

Answer 199

Outcome rarity impacts metric choice ## Footnote * Accuracy becomes meaningless * Precision recall more informative * Affects threshold tuning

Answer 200

Splitting data by time rather than random ## Footnote * Prevents look ahead bias * Reflects deployment setting * Critical for time dependent tasks

Answer 201

Allows leakage through temporal correlation ## Footnote * Future events may appear in train set * Inflates performance * Not realistic for clinical deployment

Answer 202

Agreement between predicted and observed risk ## Footnote * Critical for decision support * Measured by Brier score and calibration curves * Independent of discrimination ability

Answer 203

Ability of model to work at new sites ## Footnote * Limited by coding and population differences * Requires external validation * Often lower performance than internal validation

Answer 204

Absence of systematic performance gaps across subgroups ## Footnote * Sensitive to protected attributes * Evaluated by group specific metrics * Healthcare models must ensure fairness

Answer 205

Variable indirectly encoding sensitive attribute ## Footnote * Example zipcode encoding socioeconomic status * Leads to unintended bias * Requires careful feature audit

Answer 206

Testing robustness under perturbations ## Footnote * Vary cohort or features * Detects fragile models * Recommended for clinical ML validation

Answer 207

Inaccurate outcome labels ## Footnote * Common in claims and EHR * Reduces model performance * Requires noise robust methods

Answer 208

Distribution change between train and test ## Footnote * Includes covariate and concept shift * Common across hospitals * Requires drift monitoring

Answer 209

Missingness depends on unobserved values ## Footnote * Hardest missingness type * Requires modeling the missing mechanism * Common in lab tests

Answer 210

Prediction tied to specific visit ## Footnote * Uses visit level context * Examples sepsis mortality * Must define index event consistently

Answer 211

Prediction at individual scale ## Footnote * Aggregates longitudinal data * Examples chronic disease risk * Requires temporal modeling

Answer 212

Outcome not observed due to incomplete follow up ## Footnote * Common in long horizon predictions * Requires survival models or exclusion * May bias observed outcome rate

Answer 213

Selecting non outcome examples ## Footnote * Important for imbalanced targets * Controls class ratio * Must avoid future information

Answer 214

Domain inspired feature rule ## Footnote * Example high creatinine for kidney injury * High precision features boost signal * Used in anchor learning

Answer 215

Period used to derive phenotype labels ## Footnote * Must precede prediction window * Ensures causal ordering * Common source of leakage when misaligned

Answer 216

Outcome defined with uncertainty ## Footnote * Example probable disease based on partial evidence * Introduces label noise * Requires robust models

Answer 217

Ensuring the same pipeline yields same output ## Footnote * Requires versioning seeds environment management * Critical for clinical deployment * Prevents silent drift

Answer 218

Tracking origin and version of code ## Footnote * Supports debugging and auditing * Required for regulated settings * Aligns with ML Ops best practices

Answer 219

Prescribing pattern indirectly reveals outcome ## Footnote * Example using insulin to predict diabetes * Appears predictive but leaks label * Must restrict features to avoid

Answer 220

Observation window overlaps prediction window ## Footnote * Introduces future information * Inflates performance * Must enforce non overlapping windows

Answer 221

Sicker patients get more tests leading to feature skew ## Footnote * Leads to confounding via care processes * Not directly disease signal * Requires adjustment or careful windowing

Answer 222

Time gap between index and outcome ## Footnote * Defines clinical utility * Short horizon detects acute risk * Long horizon predicts chronic outcomes

Final_Exam Flashcards

(246 cards)