50-90 55-85 60-80 65-75
The Correct Answer is “C”
C. Confidence interval indicates the range within which an examinees’ true score is likely to fall, given his or her obtained score. The standard error of measurement indicates how much error an individual test score can be expected to have and is used to construct confidence intervals. To calculate the 68% confidence interval, add and subtract one standard error of measurement to the obtained score. To calculate the 95% confidence interval, add and subtract two standard errors of measurement to the obtained score. Two standard errors of measurement in this case equal 10. We’re told that the examinee’s obtained score is 70. 70 + 10 results in a confidence interval of 80 to100. In other words, we can be 95% confident that the examinee’s true score falls within 60 and 80.
split-half reliability. test-retest stability. Likert scales. tests with dichotomously scored questions.
The Correct Answer is “D”
The Kuder-Richardson formula is one of several statistical indices of a test’s internal consistency reliability. It is used to assess the inter-item consistency of tests that are dichotomously scored (e.g., scored as right or wrong).
It is used to establish criterion-related validity. It is appropriate for tests designed to assess a person's future status on a criterion. It is obtained by collecting predictor and criterion scores at about the same time. It indicates the extent to which a test yields the same results as other measures of the same phenomenon.
The Correct Answer is “B”
There are two ways to establish the criterion-related validity of a test: concurrent validation and predictive validation. In concurrent validation, predictor and criterion scores are collected at about the same time; by contrast, in predictive validation, predictor scores are collected first and criterion data are collected at some future point. Concurrent validity indicates the extent to which a test yields the same results as other measures of the same phenomenon. For example, if you developed a new test for depression, you might administer it along with the BDI and measure the concurrent validity of the two tests.
The Correct Answer is “A”
A. Perceptual speed tests are highly speeded and are comprised of very easy items that every examinee, it is assumed, could answer correctly with unlimited time. The best way to estimate the reliability of speed tests is to administer separately timed forms and correlate these, therefore using a test-retest or alternate forms coefficient would be the best way to assess the reliability of the test in this question. The other response choices are all methods for assessing internal consistency reliability. These are useful when a test is designed to measure a single characteristic, when the characteristic measured by the test fluctuates over time, or when scores are likely to be affected by repeated exposure to the test. However, they are not appropriate for assessing the reliability of speed tests because they tend to produce spuriously high coefficients.
The Correct Answer is “A”
Pure speed tests and pure power tests are opposite ends of a continuum. A speed test is one with a strict time limit and easy items that most or all examinees are expected to answer correctly. Speed tests measure examinees’ response speed. A power test is one with no or a generous time limit but with items ranging from easy to very difficult (usually ordered from least to most difficult). Power tests measure level of content mastered.
The Correct Answer is “B”
B. The kappa statistic is used to evaluate inter-rater reliability, or the consistency of ratings assigned by two raters, when data are nominal or ordinal. Interval and ratio data is sometimes referred to by the term metric.
alters the factor loadings for each variable but not the eigenvalue for each factor alters the eigenvalue for each factor but not the factor loadings for the variables alters the factor loadings for each variable and the eigenvalue for each factor does not alter the eigenvalue for each factor nor the factor loadings for the variables
The Correct Answer is “C”
C. In factor analysis, rotating the factors changes the factor loadings for the variables and eigenvalue for each factor although the total of the eigenvalues remains the same.
10. 0 0. 5 1. 0 0. 0
The Correct Answer is “B”
The item difficulty index ranges from 0 to 1, and it indicates the number of examinees who answered the item correctly. Items with a moderate difficulty level, typically 0.5, are preferred because it helps to maximize the test’s reliability.
criterion-related validity coefficient item discrimination index item difficulty index item characteristic curve
The Correct Answer is “D”
Item characteristic curves (ICCs), which are associated with item response theory, are graphs that depict individual test items in terms of the percentage of individuals in different ability groups who answered the item correctly. For example, an ICC for an individual test item might show that 80% of people in the highest ability group, 40% of people in the middle ability group, and 5% of people in the lowest ability group answered the item correctly. Although costly to derive, ICCs provide much information about individual test items, including their difficulty, discriminability, and probability that the item will be guessed correctly.
The Correct Answer is “B”
According to classical test theory, the reliability of a test indicates the degree to which examinees’ scores are free from error and reflect their “true” test score. Reliability is typically measured by obtaining the correlation between scores on the same test, such as by having examinees take then retake the test and correlating both sets of scores (test-retest reliability) or by dividing the test in half and correlating scores on both halves (split-half reliability). Cronbach’s alpha, like split-half reliability, is categorized as an internal consistency reliability coefficient. Its calculation is based on the average of all inter-item correlations, which are correlations between responses on two individual items. Mathematically, Cronbach’s alpha works out to the average of all possible split-half correlations (there are many possible split-half correlations because there are many different ways of splitting the test in half). Regarding the other choices, the Spearman-Brown formula is used to estimate the effects of lengthening a test on its reliability coefficient. Longer tests are typically more reliable. The Spearman-Brown formula is commonly used to adjust the split-half coefficient to estimate what reliability would have been if the halved tests had as many items as the full test. The chi-square test is used to test predictions about observed versus expected frequency distributions of nominal, or categorical, data; for example, if you flip a coin 100 times, you can use the chi-square test to determine if the distribution of heads versus tails outcomes falls into the expected range or if there is evidence that the coin toss was “fixed.” And the point-biserial correlation coefficient is used to correlate dichotomously scaled variables with interval or ratio data; for example, it can be used to correlate responses on test items scored as correct or incorrect with scores on the test as a whole.
The Correct Answer is “D”
D. Use of a multitrait-multimethod matrix is one method of assessing a test’s construct validity. The matrix contains correlations among different tests that measure both the same and different traits using similar and different methodologies. The heterotrait-monomethod coefficient, one of the correlation coefficients that would appear on this matrix, reflects the correlation between two tests that measure different traits using similar methods. An example might be the correlation between a test of depression based on self-report data and a test of anxiety also based on self-report data. If a test has good divergent validity, this correlation would be low. Divergent validity is the degree to which a test has a low correlation with other tests that do not measure the same construct. Using the above example, a test of depression would have poor divergent validity if it had a high correlation with other tests that purportedly measure different traits, such as anxiety. This would be evidence that the depression test is measuring traits that are unrelated to depression.
The Correct Answer is “A”
On a job selection test, a “false positive” is someone who is identified by the test as successful but who does not turn out to be successful, as measured by a performance criterion. If you raise the selection test cutoff score, you will reduce false positives, since, by making it harder to “pass” the test, you will be ensuring that the people who do pass are more qualified and therefore more likely to be successful. By lowering the criterion score, what you are in effect doing is making your definition of success more lax. It therefore becomes easier to be considered successful, and many of the people who were false positives will now be considered true positives.
If you understand concepts in pictures better than in words, refer to the Test Construction section, where a graph is used to explain this idea
The Correct Answer is “A”
There are many ways to assess the validity of a test. If we correlate our test with another test that is supposed to measure the same thing, we’ll expect the two to have a high correlation; if they do, the tests will be said to have convergent validity. If our test has a low correlation with other tests measuring something our test is not supposed to measure, it will be said to have discriminant (or divergent) validity. Convergent and divergent validity are both types of construct validity.
The Correct Answer is “D”
Use of a multitrait-multimethod matrix is one method of assessing a test’s construct validity. The matrix contains correlations among different tests that measure both the same and different traits using similar and different methodologies. The heterotrait-heteromethod coefficient, one of the correlation coefficients that would appear on this matrix, reflects the correlation between two tests that measure different (hetero) traits using different (hetero) methods. An example might be the correlation between vocabulary subtest scores on the WAIS-III for intelligence and scores on the Beck Depression Inventory for depression. Since these measures presumably measure different constructs, the correlation coefficient should be low, indicating high divergent or discriminant validity.
The Correct Answer is “D”
D. An oblique rotation is used when the variables included in the analysis are considered to be correlated. When the variables included in the analysis are believed to be uncorrelated (c.), an orthogonal rotation is used. Response choice “a.” describes semi-partial correlation and “b.” describes partial correlation.
nominal ordinal interval ratio
The Correct Answer is “B”
An item difficulty index indicates the percentage of individuals who answer a particular item correctly. For example, if an item has a difficulty index of .80, it means that 80% of test-takers answered the item correctly. Although it appears that the item difficulty index is a ratio scale of measurement, according to Anastasi (1982) it is actually an ordinal scale because it does not necessarily indicate equivalent differences in difficulty.
85 to 105 90 to 100 90 to 110 impossible to calculate without the reliability coefficient
The Correct Answer is “A”
The standard error of measurement indicates how much error an individual test score can be expected to have. A confidence interval indicates the range within which an examinees’s true score is likely to fall, given his or her obtained score. To calculate the 68% confidence interval we simply add and subtract one standard error of measurement to the obtained score. Choice D is incorrect because although the reliability coefficient is needed to calculate a standard error of measurement, in this case, we are provided with the standard error.
true score. mean score. error score. criterion score.
The Correct Answer is “A”
The question is just a roundabout way of asking “what is the standard error of measurement?”, though it does supply a practical application of the concept. According to classical test theory, an obtained test score consists of truth and error. The truth component reflects the degree to which the score reflects the actual characteristic the test measures, and the error component reflects random or chance factors affecting the score. For instance, on an IQ test, a score will reflect to some degree the person’s “true” IQ and to some degree chance factors such as whether the person was tired the day he took the test, whether some of the questions happen to be a particularly good fit with the person’s knowledge base, etc. The standard error of measurement of a test indicates the expected amount of error a score on that test will contain. It can be used to answer the question, “given an obtained score, what is the likely true score?” For example, if the test referenced had a standard error of measurement of 5, there would be a 68% chance that the true test score lies within one standard error of measurement of the obtained score (between 128 and 138 in this case), and a 95% chance that the true score lies within two standard errors of measurement (between 123 - 143). So in the example, the parent would be interested to know what the test’s standard error of measurement because the higher it is, the greater the possibility that an obtained score of 133 actually reflects a true score of 135 or above
brief speed state trait
The Correct Answer is “D”
As the name implies, test-retest reliability involves administering a test to the same group of examinees at two different times and then correlating the two sets of scores. This would be most appropriate when evaluating a test that purports to measure a stable trait, since it should not be significantly affected by the passage of time between test administrations.