Test Construction Flashcards

(58 cards)

1
Q

What does Cronbach’s a (or KR-20) measure, and when is it most appropriate?

A

Cronbach’s a (or KR-20 for dichotomous items) measures internal consistently reliability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does test–retest reliability assess, and what kind of trait is it best for?

A

Test–retest reliability measures the temporal stability of a test by correlating scores from the same individuals on two administrations separated by time.
Best for: stable traits such as intelligence or personality.
Keyword: Temporal stability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does alternate (parallel) forms reliability measure, and when is it useful?

A

Assesses content equivalence between two versions of the same test to control for content and time sampling errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does the kappa coefficient (κ) measure and why is it superior to percent agreement?

A

Measures rater agreement corrected for chance. κ provides a more accurate estimate of reliability than simple percent agreement.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the difference between internal consistency error and content sampling error?

A

Internal consistency error arises from item heterogeneity; content sampling error reflects variation from different item sets.
Internal consistency is estimated by α or KR-20; content sampling is minimized using parallel forms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does the Spearman–Brown prophecy formula estimate?

A

Predicts how changing the length of a test (adding or removing items) will affect its reliability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the coefficient of stability another name for

A

Another term for the test–retest reliability coefficient, indicating score consistency across time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does a low Cronbach’s α suggest about a test’s dimensionality?

A

A low α indicates the test likely measures multiple constructs, not one coherent dimension.
Keyword: α ↓ → multidimensional.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why is equivalent forms reliability considered the most rigorous method?

A

Because it accounts for both content and time sampling errors, giving the most comprehensive estimate of reliability—though it is the hardest to implement.
Keyword: Max error control.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is the magnitude of Cohen’s κ interpreted in reliability terms?

A

κ values range from 0 to +1.0; a κ of .90 indicates excellent inter-rater reliability.
Keyword: κ magnitude = reliability strength.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does validity tell us about a test?

A

It tells us whether the test measures what it claims to measure. A test can be reliable without being valid, but not valid without being reliable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the key difference between reliability and validity?

A

Reliability = consistency of scores; Validity = accuracy of what the test measures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does content validity assess?

A

Whether test items adequately represent the full range of the construct or skill being measured.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How is criterion-related validity evaluated?

A

By correlating test scores with an external criterion — concurrently (current performance) or predictively (future performance).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is construct validity and how is it demonstrated?

A

It shows that a test measures the theoretical trait it claims to. Demonstrated through convergent and discriminant validity (e.g., multitrait–multimethod matrix).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the main difference between criterion-referenced and norm-referenced tests?

A

Criterion-referenced tests interpret scores by mastery standards (e.g., pass/fail). Norm-referenced tests compare performance to a reference group.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is incremental validity?

A

The amount of additional predictive value a new test adds beyond existing predictors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the purpose of the multitrait–multimethod matrix?

A

To assess construct validity by showing high correlations between similar traits (convergent) and low correlations between different traits (discriminant).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are the three main parameters in Item Response Theory (IRT)?

A

Difficulty (how hard an item is), discrimination (how well it separates high vs. low ability), and guessing (chance of correct response by luck).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What does the slope of an Item Characteristic Curve (ICC) represent?

A

Item discrimination — the steeper the slope, the better the item distinguishes between high- and low-ability examinees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What does the Standard Error of Measurement (SEM) represent?

A

It reflects the average amount a test score is expected to vary from a person’s true score due to measurement error.
Formula: SEM = SD × √(1 − rxx).
Used to create confidence intervals around observed scores.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does the Standard Error of Estimate (SEE) indicate?

A

It estimates the accuracy of predicting a criterion score from a predictor.
Smaller SEE = more accurate prediction.
Formula: SEE = SDy × √(1 − r²).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How are confidence intervals constructed around a test score?

A

Observed score ± (z × SEM).
Example: 95% CI = score ± (1.96 × SEM).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Define sensitivity and specificity in test accuracy.

A

Sensitivity = true positives ÷ (true positives + false negatives) → ability to detect those with the condition.
Specificity = true negatives ÷ (true negatives + false positives) → ability to exclude those without the condition.

25
What do positive and negative predictive values represent?
PPV = probability a person with a positive test truly has the condition. NPV = probability a person with a negative test truly does not. Both depend on base rate of the condition.
26
What are true/false positives and negatives?
True positive = correctly identified case; false positive = incorrectly identified case; true negative = correctly excluded case; false negative = missed case. Used to evaluate decision accuracy of a test.
27
What is an expectancy table used for?
Shows the likelihood of successful performance (criterion) for each score range on a predictor, aiding cutoff and hiring decisions.
28
How does base rate affect test accuracy?
When the base rate of success is very high or very low, even valid tests can yield misleading rates of true vs. false positives.
29
In Classical Test Theory, what is item difficulty (p)?
Proportion of examinees who answered correctly (0–1). p = .50 provides maximum discrimination between high and low scorers.
30
What is item discrimination (D)?
How well an item distinguishes high scorers from low scorers. Typically computed as (upper group % correct − lower group % correct). Higher D = better item.
31
What is the correction for guessing and how does it affect scores?
Adjusts scores to penalize random guessing. Reduces mean and increases SD of the distribution.
32
What is communality in factor analysis?
The proportion of a variable’s variance explained by all extracted factors. Communality = Σ (factor loadings²).
33
What is the purpose of factor rotation in factor analysis?
Simplifies interpretation by making factor loadings clearer. Orthogonal rotation → uncorrelated factors. Oblique rotation → correlated factors.
34
What does factor loading represent?
The correlation between a test item and an underlying factor. High loading = item strongly measures that factor.
35
What is the difference between orthogonal and oblique rotation?
Orthogonal assumes factors are independent (uncorrelated). Oblique allows factors to correlate.
36
What do z-scores represent?
Standardized scores showing distance from the mean in SD units. Formula: z = (X − M)/SD. Mean = 0, SD = 1.
37
What is a T-score and how does it differ from a z-score?
T-scores are transformed z-scores scaled to Mean = 50, SD = 10. Used for ease of interpretation and to avoid negatives.
38
What does a percentile rank tell us?
The percentage of the norm group that scored below a given score. Distribution is rectangular (flat), not normal.
39
What are criterion cutoffs and how are they optimized?
Cutoffs determine pass/fail or hire/no-hire thresholds. Optimized by balancing false positives and false negatives to maximize true decisions.
40
What are utility analysis and selection ratio in applied testing?
Utility analysis estimates financial or practical value of using a selection test. Selection ratio = number hired ÷ number of applicants; lower ratios improve decision accuracy.
41
What is the Spearman-Brown prophecy formula used to estimate?
It estimates the effect of increasing or decreasing test length on a test’s reliability coefficient. Used in test construction and split-half reliability adjustments. Key principle: Longer tests generally yield higher reliability (up to a point). Not to be confused with: Correction for attenuation: estimates true validity if measures were perfectly reliable. Standard error of measurement: estimates true score range.
42
In factor analysis, what does a test’s communality represent, and how is it calculated when factors are orthogonal?
Definition: Communality = proportion of a test’s variance explained by the common factors (shared variance). Formula (for orthogonal/unrelated factors): ℎ 2 = ∑ ( factor loadings ) 2 h 2 =∑(factor loadings) 2 Example: If Test A loads .40 on Factor I and .30 on Factor II → ℎ 2 = .40 2 + .30 2 = .16 + .09 = .25 h 2 =.40 2 +.30 2 =.16+.09=.25. Interpretation: 25% of Test A’s variance is explained by the factors; 75% is unique or error variance.
43
What does a criterion-related validity coefficient of .70 tell us about the relationship between predictor and criterion scores?
The coefficient of determination (r²) shows how much variance in the criterion is explained by the predictor. 𝑟 = .70 ⇒ 𝑟 2 = .49 r=.70⇒r 2 =.49. Therefore, 49% of the variability in criterion scores is explained by the predictor. The remaining 51% reflects other factors or error.
44
Which reliability method is best for assessing a characteristic that fluctuates over time (a “state”)? Why?
Coefficient of internal consistency. Uses one administration to assess consistency across items, not across time. Suitable for traits that vary (e.g., anxiety, mood). Not appropriate: Coefficient of stability (test–retest): assumes the trait is stable. Coefficient of equivalence: uses alternate forms over time. Coefficient of determination: measures shared variance, not reliability.
45
A test has a mean of 100 and an SEM of 5. An examinee scores 103. What is the 68% confidence interval, and how is it calculated?
Formula: CI = Obtained score ± (1 × SEM) for 68% confidence. Calculation: 103 ± 5 → 98 to 108. Interpretation: There is a 68% chance the examinee’s true score lies between 98 and 108. Note: 95% CI = ±1.96 × SEM 99% CI = ±2.58 × SEM
46
What does the coefficient of stability measure, and when is it used?
Another term for test–retest reliability. Measures the consistency of scores across time by administering the same test twice to the same group. Appropriate for stable traits (e.g., intelligence). Not for: fluctuating states (use internal consistency instead). Key distinction: Coefficient of equivalence: compares different forms. Coefficient of stability: compares same test, different times.
47
What is communality (h²), and how do you compute it when factors are orthogonal? Apply to loadings .40 on Factor I and .30 on Factor II.
Meaning: Proportion of a test’s variance explained by the common factors. Orthogonal formula: ℎ2=∑(loading)222=∑(loading)2. Compute: .402+.302=.16+.09=.25.402+.302=.16+.09=.25. Interpretation: 25% of Test A’s variance is common; the rest is uniqueness 𝑢2=1−ℎ2=.75u2=1−h2=.75 (specific + error). If factors were oblique: use the structure and factor intercorrelation (Φ) matrix: ℎ2=𝑙′Φ   𝑙h2=l′Φl (not simple squaring). Pitfall: Don’t add raw loadings; always square then sum for orthogonal factors.
48
If a test measures a state that fluctuates in intensity over time (e.g., mood, anxiety), what reliability method should be used—and why?
Use the Coefficient of Internal Consistency (e.g., Cronbach’s alpha). Requires only one administration of the test. Assesses the consistency among items within a single testing session. Suitable when the construct changes over time. Avoid: Coefficient of Stability (test–retest): assumes trait is stable. Coefficient of Equivalence: requires two forms at different times. Coefficient of Determination: measures shared variance, not reliability.
49
If a predictor’s criterion-related validity coefficient is .70, what percentage of the variability in criterion scores is explained by the predictor—and how is it calculated?
Formula: 𝑟2=(.70)2=.49 r2=(.70)2=.49 Interpretation: 49% of the variability in criterion scores is explained by the predictor. The remaining 51% reflects other influences or measurement error. Concept: r2 is the coefficient of determination, representing shared variance between predictor and criterion.
50
What information is needed to construct a 68% confidence interval around an examinee’s obtained test score?
You need the examinee’s score and the Standard Error of Measurement (SEM). Formula: CI = Obtained score ± (1 × SEM) for 68% confidence. SEM is derived from the test’s standard deviation and reliability coefficient. Note: Standard error of estimate → used for predicted criterion scores. Test mean and SD alone are insufficient without SEM.
51
In Item Response Theory (IRT), what does an examinee’s test score represent?
It reflects the examinee’s status on a latent trait or ability (θ). IRT models the relationship between item responses and the underlying trait being measured (e.g., intelligence, anxiety). Focuses on item characteristics (difficulty, discrimination) rather than total test score. Contrast: Classical Test Theory: interprets scores relative to total performance. Norm-referenced: compares to others. Criterion-referenced: compares to preset standards.
52
In a multitrait–multimethod matrix, what type of coefficient supports divergent (discriminant) validity?
A small heterotrait–monomethod coefficient. Indicates low correlation between different traits measured by the same method → evidence the test is not measuring unrelated constructs. Contrast: Large monotrait–heteromethod: evidence of convergent validity (same trait, different methods). Large heterotrait–monomethod: signals poor divergent validity (too much overlap between different traits).
53
A normal distribution has M = 106 and SD = 10. What is the percentile rank for a raw score of 126, and why?
z=126−10610=+2.0z = \frac{126 - 106}{10} = +2.0z=10126−106​=+2.0 → two standard deviations above the mean. A z-score of +2.0 corresponds to the 98th percentile. Interpretation: The examinee scored higher than about 98% of the population. Reference values: +1 SD → 84th percentile +2 SD → 98th percentile +3 SD → 99.9th percentile
54
When a disorder has a very low base rate, even a highly accurate (98%) screening test will produce what type of error pattern? Why?
It will produce more false positives than false negatives. Reason: When few people actually have the disorder, most tested individuals are healthy. Even a small false-positive rate applied to a large healthy group creates many false positives. Example: With a 1% base rate in 10,000 people → 98 true positives, 2 false negatives 9,702 true negatives, 198 false positives Takeaway: Low base rate → poor positive predictive value, even with high test accuracy.
55
In a factor matrix, Test A has a factor loading of .70 on Factor II. What does this indicate?
A factor loading is the correlation between a variable (test) and a factor. To find the shared variance, square the loading: .702=.49.70^2 = .49.702=.49. Therefore, 49% of Test A’s variance is explained by Factor II. Interpretation: The remaining 51% reflects unique variance and measurement error.
56
Given a z-score of +.75, a percentile rank of 84, and a T-score of 65, which order ranks these from lowest to highest within a normal distribution?
Order (lowest → highest): z = +.75 → 0.75 SD above mean Percentile rank = 84 → ≈ +1 SD above mean T = 65 → +1.5 SD above mean Key conversions: T-score: 𝑇=50+(𝑧×10) Percentile rank: 68 ≈ +1 SD = 84th percentile Interpretation: Converting all values to standard deviation units (z) allows valid comparison across scoring systems.
57
What does an item discrimination index (D) = 0 indicate about an item’s performance on a test?
D = 0 means equal proportions of high- and low-achieving examinees answered the item correctly. The item does not discriminate between strong and weak test-takers. Range: –1.0 to +1.0 Positive D: high scorers > low scorers → good item. Negative D: low scorers > high scorers → flawed item. Goal in test construction: maximize positive D values (preferably ≥ .30).
58
Which type of test format typically has the lowest reliability, and why
True–false tests have the lowest reliability because of the high probability of guessing correctly (50%). Reliability decreases as chance success increases. Guessing probabilities: True–false → 50% 3-option multiple choice → 33% 7-option multiple choice → 14% Free recall → ≈0% Principle: More possible response options → lower chance of random correctness → higher reliability.