Item difficulty refers to the proportion of examinees in the tryout sample who answered the item ___
correctly (impt for determining knowledge/skill level)
According to classical test theory, total variability in obtained test scores is composed of true score variability plus [systematic/random] error
true score variability plus random error (X = T + E)
Item difficulty is represented by the letter [“p”/”q”] and is calculated by dividing the number of examinees who answered the item correctly by the ___ number of examinees. It ranges in value from [0 to 1/1 to 10/-1 to +1] with larger numbers indicating easier item
“p”; correctly by the total number of examinees; 0 to 1
T/F: In examining item difficulty, if p is equal to 1 that means that none of the examinees got the question right.
F: it means they all did
For most tests, the optimal p value of test items is [.25/.5] which indicates moderate difficulty
.5 (this increases test score variability, helps ensure normal distribution, provides maximum discrimination between examinees, and helps maximize test’s reliability)
For most tests, an item difficulty index of .5 is desirable, however, if the goal of testing is to choose a certain number of top performers, the optimal p value corresponds to the ___ of examinees to be chosen.
The optimal value is also affected by the likelihood that examinees can select the correct answer by ___, with the preferred difficulty level being halfway between 100% of examinees answering the item correctly and the probability of answering correctly by __“__.
proportion of examinees to be chosen; guessing (e.g. optimal value for T/F items is .75 (halfway between guess rate of .5 and 1.0))
(Item difficulty) adequate ceiling for a test can be ensured by including a large proportion of items with a [high/low] p value (opposite for adequate floor)
low p value (hi for floor)
Item Discrimination refers to the extent to which an item ___ between examinees who obtain low or high scores on the test or an external criterion. It is symbolized with the letter [“D”/”S”]. It is calculated by subtracting the ___ of examinees in the lower-scoring group who answered the item correctly from the __“__ of
examinees in the upper-scoring group who answered the item correctly. It ranges from [0 to 1/-1 to +1]
discriminates; “D”; percent (often upper and lower 27%) (D is calculated for each item); -1 to +1
Which of the following statements is false about the Item Discrimination Index (D)?
A. It ranges in value from -1 to +1
B. when D equals +1, all examinees in the upper-scoring group answered the item correctly while all examinees in the lower-scoring group answered the item incorrectly
C. when D equals 0, the same percent of examinees in both groups answered the item correctly
D. when D equals -1, all examinees in the lower-scoring group answered the item correctly and half of the examinees in the upper-scoring group answered the item correctly
D.
If an Item Discrimination Index for an item is [.35/.75] or higher, it is generally considered acceptable.
.35
Read: (Item Characteristic Curve (ICC)): When using item response theory to construct a test, an ICC is derived for each item by plotting the proportion of examinees in the tryout sample who answered the item correctly against either the total test score, performance on an external criterion, or a mathematically derived est. of a latent ability or trait.
The curve looks like a capital S on a graph with ability level on the X axis and probabiliy of correct response on the Y axis. Depending on which model is used, the curve provides info on 1, 2, or 3 parameters - difficulty (ability level at which 50% of examiness answer right), discrimination (slope of curve), and probably of guessing correctly (where intercepts Y axis).
In comparison to Classical Test theory (item difficulty (p), item discrimination (D)), Item R___ theory (item characteristic curves) supports ___ testing by using item characteristic curves (ICCs) to estimate an examinee’s ability level in real time. Subsequent items are selected based on the examinee’s previous responses, allowing for a personalized and efficient testing process.
Item Response theory; adaptive testing (IRT’s key strength is that its item parameters (difficulty, discrimination, and guessing) are considered sample-invariant, meaning they remain stable across different samples.)
According to classical test theory, variability in test scores reflects a combination of true score variability and variability due to ___ error
measurement (random) error (which leads to inconsistencies in test scores)
T/F: Most methods for estimating reliability produce a reliability coefficient, which is is symbolized as rxx. Reliability coefficients range in value from 0 to +1, and they are always interpreted directly as a measure of random variability
All T except for last phrase; are always interpreted directly as a measure of true score variability
T/F: A reliability coefficient of .75 would be considered sufficient for use in most cases. In interpretting the coefficient, 75% of the variability in examinees score represents true variability, while the remaining 25% represents error.
F: .80 or higher is the standard; T: reliability coefficients are directly interpreted as the proportion of variability in a set of test scores that results from true score differences
(Criterion-related validity coefficients of .3 or higher are generally acceptable, as many are not above .6)
Which of these is NOT a method of evaluating reliability?
A. Test-Retest Reliability
B. Internal Consistency Reliability
C. Analytical Reliability
D. Alternatve Forms Reliability
E. Inter-Rater Reliability
C.
For a speeded test, which would be the most appropriate and least appropriate for evaluating reliability?
A. Coefficient of internal consistency
B. Coefficient of equivalence/alternate forms
least - A (will overestimate)
most - B
(Speed tests are designed so that all items answered by an examinee are answered correctly and the examinee’s total score depends primarily on his/her speed of responding. Because of the nature of these tests, a measure of internal consistency will provide a spuriously high estimate of the test’s reliability, while alternate forms tests eliminate the problem of practice effects.)
Test-Retest Reliability is also known as the coefficient of [stability/equivalence/internal consistency]. It is appropriate for tests designed to measure a characteristic that is ___ over time, and is not appropriate for tests that measure characteristics that ___ over time.
stability; stable over time…that fluctuate over time (or are likely to be affected in a random way by taking the test more than once) (like alt forms)
___ forms reliability provides a measure of test score consistency over two forms of the test. It is appropriate for tests that measure a characteristic that is ___ over time but not for tests that measure characteristics that ___ over time or when exposure to one ___ is likely to affect performance on the other __“__ in an unsystematic way. It is also known as the coefficient of [stability/equivalence/internal consistency].
Alternate, equivalent, or parallel forms; stable over time…fluctuate over time (like test-retest); one form…the other form; equivalence
Internal Consistency Reliability indicates the degree of consistency across different test items, i.e., indicates the degree to which items in the test measure the s___ c___. It is most appropriate for tests measuring [a single domain/multiple related domains]. It is useful for estimating the reliability of tests that measure characteristics that ___ over time or are susceptible to m___ or p___ effects. It is also known as the coefficient of [stability/equivalence/internal consistency]. (Split-half reliability and coefficient alpha are 2 methods)
same characteristic; a single domain; fluctuate over time; memory or practice effects; internal consistency
(Internal consistency reliability) As a function of cutting the number of test items in half, split-half reliability tends to [under/over]estimate the test’s reliability, and the S___-Brown prophecy formula is used to correct the coefficient. It is not appropriate for [tests using pictures/speeded tests] because it will overestimate the coefficient.
underestimate; Spearman-Brown prophecy formula; speeded tests (Spearman-Brown prophecy formula can also be used to estimate the effect of shortening or lengthening a test on its reliability coefficient in general)
(Internal consistency reliability) Cronbach’s coefficient alpha is the “___ of all possible split-half correlation coefficients.” The [Spearman-Brown/Kuder-Richardson Formula 20] can be used as a substitute for coefficient alpha when multiple choice test items are scored d___ly
mean; Kuder-Richardson Formula 20 (KR-20); dichotomously
Inter-rater reliability is most important for measures that are ___ scored such as essay and projective tests. It can be evaluated through [Cohen’s kappa/Kendall’s coefficient/percent agreement], although this tends to overestimate inter-rater reliability. Between the other two options previously listed, which is for 3+ raters when scores are ranks and which is for 2 raters when scores are nominal variables?
subjectively scored; percent agreement; 3+ raters when scores are ranks - Kendall’s coefficient of concordance
2 raters when scores are nominal variables - Cohen’s kappa statistic (k)
(A disadvantage of percent agreement is that it does not take into account the amount of agreement that could have occurred among raters by chance alone, which can provide an inflated estimate of the measure’s reliability.)
Consensual observer drift occurs when two or more observers working together influence ___ ___’s ratings on a behavioral rating scale so that they assign ratings in a ___ idiosyncratic way. Consensual observer drift makes the ratings of different raters more ___, which artificially [increases/decreases] inter-rater reliability, but it can be controlled for by [alternating/live-observing] raters.
influence each other’s ratings; similar idiosyncratic way; raters more similar, which artificially increases inter-rater reliability; alternating raters