What does Cronbach’s a (or KR-20) measure, and when is it most appropriate?
Cronbach’s a (or KR-20 for dichotomous items) measures internal consistently reliability
What does test–retest reliability assess, and what kind of trait is it best for?
Test–retest reliability measures the temporal stability of a test by correlating scores from the same individuals on two administrations separated by time.
Best for: stable traits such as intelligence or personality.
Keyword: Temporal stability.
What does alternate (parallel) forms reliability measure, and when is it useful?
Assesses content equivalence between two versions of the same test to control for content and time sampling errors.
What does the kappa coefficient (κ) measure and why is it superior to percent agreement?
Measures rater agreement corrected for chance. κ provides a more accurate estimate of reliability than simple percent agreement.
What is the difference between internal consistency error and content sampling error?
Internal consistency error arises from item heterogeneity; content sampling error reflects variation from different item sets.
Internal consistency is estimated by α or KR-20; content sampling is minimized using parallel forms.
What does the Spearman–Brown prophecy formula estimate?
Predicts how changing the length of a test (adding or removing items) will affect its reliability.
What is the coefficient of stability another name for
Another term for the test–retest reliability coefficient, indicating score consistency across time.
What does a low Cronbach’s α suggest about a test’s dimensionality?
A low α indicates the test likely measures multiple constructs, not one coherent dimension.
Keyword: α ↓ → multidimensional.
Why is equivalent forms reliability considered the most rigorous method?
Because it accounts for both content and time sampling errors, giving the most comprehensive estimate of reliability—though it is the hardest to implement.
Keyword: Max error control.
How is the magnitude of Cohen’s κ interpreted in reliability terms?
κ values range from 0 to +1.0; a κ of .90 indicates excellent inter-rater reliability.
Keyword: κ magnitude = reliability strength.
What does validity tell us about a test?
It tells us whether the test measures what it claims to measure. A test can be reliable without being valid, but not valid without being reliable.
What is the key difference between reliability and validity?
Reliability = consistency of scores; Validity = accuracy of what the test measures.
What does content validity assess?
Whether test items adequately represent the full range of the construct or skill being measured.
How is criterion-related validity evaluated?
By correlating test scores with an external criterion — concurrently (current performance) or predictively (future performance).
What is construct validity and how is it demonstrated?
It shows that a test measures the theoretical trait it claims to. Demonstrated through convergent and discriminant validity (e.g., multitrait–multimethod matrix).
What is the main difference between criterion-referenced and norm-referenced tests?
Criterion-referenced tests interpret scores by mastery standards (e.g., pass/fail). Norm-referenced tests compare performance to a reference group.
What is incremental validity?
The amount of additional predictive value a new test adds beyond existing predictors.
What is the purpose of the multitrait–multimethod matrix?
To assess construct validity by showing high correlations between similar traits (convergent) and low correlations between different traits (discriminant).
What are the three main parameters in Item Response Theory (IRT)?
Difficulty (how hard an item is), discrimination (how well it separates high vs. low ability), and guessing (chance of correct response by luck).
What does the slope of an Item Characteristic Curve (ICC) represent?
Item discrimination — the steeper the slope, the better the item distinguishes between high- and low-ability examinees.
What does the Standard Error of Measurement (SEM) represent?
It reflects the average amount a test score is expected to vary from a person’s true score due to measurement error.
Formula: SEM = SD × √(1 − rxx).
Used to create confidence intervals around observed scores.
What does the Standard Error of Estimate (SEE) indicate?
It estimates the accuracy of predicting a criterion score from a predictor.
Smaller SEE = more accurate prediction.
Formula: SEE = SDy × √(1 − r²).
How are confidence intervals constructed around a test score?
Observed score ± (z × SEM).
Example: 95% CI = score ± (1.96 × SEM).
Define sensitivity and specificity in test accuracy.
Sensitivity = true positives ÷ (true positives + false negatives) → ability to detect those with the condition.
Specificity = true negatives ÷ (true negatives + false positives) → ability to exclude those without the condition.