Test Construction Flashcards

(69 cards)

1
Q

Item difficulty refers to the proportion of examinees in the tryout sample who answered the item ___

A

correctly (impt for determining knowledge/skill level)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

According to classical test theory, total variability in obtained test scores is composed of true score variability plus [systematic/random] error

A

true score variability plus random error (X = T + E)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Item difficulty is represented by the letter [“p”/”q”] and is calculated by dividing the number of examinees who answered the item correctly by the ___ number of examinees. It ranges in value from [0 to 1/1 to 10/-1 to +1] with larger numbers indicating easier item

A

“p”; correctly by the total number of examinees; 0 to 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

T/F: In examining item difficulty, if p is equal to 1 that means that none of the examinees got the question right.

A

F: it means they all did

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

For most tests, the optimal p value of test items is [.25/.5] which indicates moderate difficulty

A

.5 (this increases test score variability, helps ensure normal distribution, provides maximum discrimination between examinees, and helps maximize test’s reliability)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

For most tests, an item difficulty index of .5 is desirable, however, if the goal of testing is to choose a certain number of top performers, the optimal p value corresponds to the ___ of examinees to be chosen.
The optimal value is also affected by the likelihood that examinees can select the correct answer by ___, with the preferred difficulty level being halfway between 100% of examinees answering the item correctly and the probability of answering correctly by __“__.

A

proportion of examinees to be chosen; guessing (e.g. optimal value for T/F items is .75 (halfway between guess rate of .5 and 1.0))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

(Item difficulty) adequate ceiling for a test can be ensured by including a large proportion of items with a [high/low] p value (opposite for adequate floor)

A

low p value (hi for floor)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Item Discrimination refers to the extent to which an item ___ between examinees who obtain low or high scores on the test or an external criterion. It is symbolized with the letter [“D”/”S”]. It is calculated by subtracting the ___ of examinees in the lower-scoring group who answered the item correctly from the __“__ of
examinees in the upper-scoring group who answered the item correctly. It ranges from [0 to 1/-1 to +1]

A

discriminates; “D”; percent (often upper and lower 27%) (D is calculated for each item); -1 to +1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which of the following statements is false about the Item Discrimination Index (D)?

A. It ranges in value from -1 to +1

B. when D equals +1, all examinees in the upper-scoring group answered the item correctly while all examinees in the lower-scoring group answered the item incorrectly

C. when D equals 0, the same percent of examinees in both groups answered the item correctly

D. when D equals -1, all examinees in the lower-scoring group answered the item correctly and half of the examinees in the upper-scoring group answered the item correctly

A

D.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

If an Item Discrimination Index for an item is [.35/.75] or higher, it is generally considered acceptable.

A

.35

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Read: (Item Characteristic Curve (ICC)): When using item response theory to construct a test, an ICC is derived for each item by plotting the proportion of examinees in the tryout sample who answered the item correctly against either the total test score, performance on an external criterion, or a mathematically derived est. of a latent ability or trait.

A

The curve looks like a capital S on a graph with ability level on the X axis and probabiliy of correct response on the Y axis. Depending on which model is used, the curve provides info on 1, 2, or 3 parameters - difficulty (ability level at which 50% of examiness answer right), discrimination (slope of curve), and probably of guessing correctly (where intercepts Y axis).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

In comparison to Classical Test theory (item difficulty (p), item discrimination (D)), Item R___ theory (item characteristic curves) supports ___ testing by using item characteristic curves (ICCs) to estimate an examinee’s ability level in real time. Subsequent items are selected based on the examinee’s previous responses, allowing for a personalized and efficient testing process.

A

Item Response theory; adaptive testing (IRT’s key strength is that its item parameters (difficulty, discrimination, and guessing) are considered sample-invariant, meaning they remain stable across different samples.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

According to classical test theory, variability in test scores reflects a combination of true score variability and variability due to ___ error

A

measurement (random) error (which leads to inconsistencies in test scores)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

T/F: Most methods for estimating reliability produce a reliability coefficient, which is is symbolized as rxx. Reliability coefficients range in value from 0 to +1, and they are always interpreted directly as a measure of random variability

A

All T except for last phrase; are always interpreted directly as a measure of true score variability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

T/F: A reliability coefficient of .75 would be considered sufficient for use in most cases. In interpretting the coefficient, 75% of the variability in examinees score represents true variability, while the remaining 25% represents error.

A

F: .80 or higher is the standard; T: reliability coefficients are directly interpreted as the proportion of variability in a set of test scores that results from true score differences

(Criterion-related validity coefficients of .3 or higher are generally acceptable, as many are not above .6)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which of these is NOT a method of evaluating reliability?

A. Test-Retest Reliability
B. Internal Consistency Reliability
C. Analytical Reliability
D. Alternatve Forms Reliability
E. Inter-Rater Reliability

A

C.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

For a speeded test, which would be the most appropriate and least appropriate for evaluating reliability?

A. Coefficient of internal consistency
B. Coefficient of equivalence/alternate forms

A

least - A (will overestimate)
most - B

(Speed tests are designed so that all items answered by an examinee are answered correctly and the examinee’s total score depends primarily on his/her speed of responding. Because of the nature of these tests, a measure of internal consistency will provide a spuriously high estimate of the test’s reliability, while alternate forms tests eliminate the problem of practice effects.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Test-Retest Reliability is also known as the coefficient of [stability/equivalence/internal consistency]. It is appropriate for tests designed to measure a characteristic that is ___ over time, and is not appropriate for tests that measure characteristics that ___ over time.

A

stability; stable over time…that fluctuate over time (or are likely to be affected in a random way by taking the test more than once) (like alt forms)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

___ forms reliability provides a measure of test score consistency over two forms of the test. It is appropriate for tests that measure a characteristic that is ___ over time but not for tests that measure characteristics that ___ over time or when exposure to one ___ is likely to affect performance on the other __“__ in an unsystematic way. It is also known as the coefficient of [stability/equivalence/internal consistency].

A

Alternate, equivalent, or parallel forms; stable over time…fluctuate over time (like test-retest); one form…the other form; equivalence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Internal Consistency Reliability indicates the degree of consistency across different test items, i.e., indicates the degree to which items in the test measure the s___ c___. It is most appropriate for tests measuring [a single domain/multiple related domains]. It is useful for estimating the reliability of tests that measure characteristics that ___ over time or are susceptible to m___ or p___ effects. It is also known as the coefficient of [stability/equivalence/internal consistency]. (Split-half reliability and coefficient alpha are 2 methods)

A

same characteristic; a single domain; fluctuate over time; memory or practice effects; internal consistency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

(Internal consistency reliability) As a function of cutting the number of test items in half, split-half reliability tends to [under/over]estimate the test’s reliability, and the S___-Brown prophecy formula is used to correct the coefficient. It is not appropriate for [tests using pictures/speeded tests] because it will overestimate the coefficient.

A

underestimate; Spearman-Brown prophecy formula; speeded tests (Spearman-Brown prophecy formula can also be used to estimate the effect of shortening or lengthening a test on its reliability coefficient in general)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

(Internal consistency reliability) Cronbach’s coefficient alpha is the “___ of all possible split-half correlation coefficients.” The [Spearman-Brown/Kuder-Richardson Formula 20] can be used as a substitute for coefficient alpha when multiple choice test items are scored d___ly

A

mean; Kuder-Richardson Formula 20 (KR-20); dichotomously

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Inter-rater reliability is most important for measures that are ___ scored such as essay and projective tests. It can be evaluated through [Cohen’s kappa/Kendall’s coefficient/percent agreement], although this tends to overestimate inter-rater reliability. Between the other two options previously listed, which is for 3+ raters when scores are ranks and which is for 2 raters when scores are nominal variables?

A

subjectively scored; percent agreement; 3+ raters when scores are ranks - Kendall’s coefficient of concordance
2 raters when scores are nominal variables - Cohen’s kappa statistic (k)

(A disadvantage of percent agreement is that it does not take into account the amount of agreement that could have occurred among raters by chance alone, which can provide an inflated estimate of the measure’s reliability.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Consensual observer drift occurs when two or more observers working together influence ___ ___’s ratings on a behavioral rating scale so that they assign ratings in a ___ idiosyncratic way. Consensual observer drift makes the ratings of different raters more ___, which artificially [increases/decreases] inter-rater reliability, but it can be controlled for by [alternating/live-observing] raters.

A

influence each other’s ratings; similar idiosyncratic way; raters more similar, which artificially increases inter-rater reliability; alternating raters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Which 2 changes to a test or its scores would be the least likely to lead to an increase in reliability? A. Decreasing its length B. Having a wider range of scores C. Making the content of test more homogenous D. Making the content of the test more heterogenous E. Increasing the difficulty of guessing an item correctly
A. and D.
26
The acronym SEM refers to ___ ___ of ___ in test construction, which is used to construct a confidence interval for what purpose? For the 68% confidence interval, how many SEM's are added to and subtracted from the obtained score? How many for 95%? 99%?
standard error of measurement; construct a CI around an obtained test score to estimate the range within which a true score lies; one SEM; two SEM; three SEM (SEM = SD*sqrt(1-r)) (Standard error of estimate (SEE) has a slightly different formula (square the r) but use of it to construct CI around predicted criterion score is the same) (SEM for obtained score. SEE for estimated score.)
27
What is the formula for Standard Error of Measurement? (used to obtain CI around an obtained test score)
SD (of measure) times sqr root of 1 minus reliability coefficient (of measure) (SEM = SD*sqrt(1-r)) (https://www.statisticshowto.com/standard-error-of-measurement/)
28
The maximum value for the standard error of measurement (or of estimate) is the value of the ___ ___ of the test scores.
standard deviation (the lowest it can be is 0 when the reliability coefficient is 1.0) (The standard error of estimate ranges from 0 (which occurs when the validity coefficient is 1.0) to the standard deviation of the criterion scores (which occurs when the validity coefficient is 0.))
29
Validity refers to a test's ___ in terms of the extent to which the test measures what it was designed to measure. - ___ validity is important for tests designed to measure a specific content or behavior domain (does it represent the __"__ domain accurately?) - ___ validity is important for tests designed to measure a hypothetical trait or __"__ (does it evaluate the abstract __"__ well?) - ___-___ validity is important for tests that will be used to predict or estimate an examinee’s status on an external __"__ (does it predict scores on other measures well?)
accuracy; content; construct (also how a test "behaves" and relates to other variables) criterion-related
30
Content validity is most often [evaluated through statistical procedures/evaluated by domain experts]
Content validity is most often evaluated by domain experts (who determine if test items are an adequate and representative sample of the content or behavior domain)
31
Assuming no constraints in terms of time, money, or other resources, the best way to demonstrate that a test has adequate reliability is by using which of the following techniques? A. equivalent (alternate) forms B. test-retest C. Cronbach's alpha D. Cohen's kappa
A. equivalent (alternate) forms (The most thorough method for assessing reliability is the one that takes into account the greatest number of potential sources of measurement error. Because equivalent forms reliability takes into account error due to both time and content sampling, it is the most rigorous method for establishing reliability and, consequently, is considered by some experts to be the best method. Alternate form reliability is often not assessed due to the difficulty in developing forms that are truly equivalent.)
32
[Content/Face/Convergent] validity refers to whether or not test items “look like” they’re measuring what the test is designed to measure and is not actually a measure of validity.
Face validity (may be useful for test-takers to trust that what items that they are answering are relevant)
33
The multit___-multim___ matrix is a table of correlation coefficients that provide information about a test’s convergent and divergent (discriminant) validity, which is relevant for [construct/content/criterion-related] validity, with high correlations on related traits evidencing convergent validity and low correlations on unrelated traits evidencing divergent validity
multitrait-multimethod method; construct validity (factor analysis also provides information about convergent and divergent validity but is a more complex technique)
34
(multitrait-multimethod matrix for construct validity) Which kind of coefficients indicate the correlation between a measure and itself? Which indicate correlation between different traits that have been measured by different methods? Which indicate correlation between the same traits that have been measured by different methods? A. monotrait-monomethod coefficients B. monotrait-heteromethod coefficients C. heterotrait-monomethod coefficients D. heterotrait-heteromethod coefficients
A = measure and itself D = different traits, different methods (divergent validity) B = same trait, different methods (convergent validity) (C is divergent validity) (Only the correlation coefficients that involve the test in question offer information about convergent or divergent validity about the test)
35
Pick the false statement about criterion-related validity: A. It is evaluated by correlating scores on the test (predictor) with scores on the criterion for a sample of examinees to obtain a criterion-related validity coefficient B. May involve a concurrent validity study, which involves obtaining scores on the predictor and criterion at about the same time C. May involve a predictive validity study, which involves obtaining predictor scores prior to obtaining criterion scores D. Makes use of a multitrai-multimethod matrix to evaluate convergent and divergent validity
D. (this one is for construct validity) (There are two types of criterion-related validity -- predictive and concurrent.)
36
Read: A correlation coefficient can be squared to interpret it only when it represents the correlation between 2 different tests or other variables, providing a measure of shared variability.
Terms that suggest shared variability include "accounted for by" and "explained by." A validity coefficient would be interpretable by squaring but not a reliability coefficient.
37
The procedure for constructing a confidence interval around a predicted criterion score is the same as the procedure for constructing a confidence interval around an obtained test score. While standard error of ___ is associated with a CI constructed around a predicted criterion score, standard error of ___ is associated with a CI constructed around an obtained test score.
Standard error of estimate (SEE -> predicted criterion Standard error of measure (SEM) -> obtained test score
38
What is the formula for Standard Error of Estimate? (used to obtain CI around an estimated score)
SD (of criterion scores) times sqr root of 1 minus validity coefficient (of the two measures) squared (SEE = SD*sqrt(1-(r)^2))
39
Between reliability and validity, which is a necessary but not sufficient condition for the other? That is, it places an upper limit on the other.
reliability is a necessary but not sufficient condition for validity (I.e., a test can be reliable without being valid, but cannot be valid without being reliable) (rxy [less than or equal to] sqrt rxx) (validity coefficient is always less than or equal to sqr root of its reliability coefficient)
40
I___ validity refers to the increase in ___-making accuracy that use of a predictor provides. Even when a predictor has a large validity coefficient, it may not increase decision-making accuracy beyond the current level. It is evaluated by comparing the number of ___ decisions made with and without the new predictor.
Incremental validity; increase in decision-making accuracy; number of correct decisions made with and without the new predictor
41
In an incremental validity graph (or chart), people who score above the cutoff for the predictor are considered [positives/negatives/trues/falses] while those who scored above the criterion cutoff are considered [positives or negatives/trues or falses] depending on their quadrant.
above/below predictor cutoff - positives/negatives above/below criterion cutoff - trues or falses (Often on a graph, predictor is X axis and criterion is Y axis. Top left = false negatives Top right = true positives Bottom left = true negatives Bottom right = false positives) (A vertical line on graph divides positive+negative for predictor A horizontal line on graph divides true+false for criterion)
42
In an incremental validity graph (or chart), the number of true positives can be increased by [lowering/raising] the predictor and/or criterion cutoff score
lowering the predictor and/or criterion cutoff score (Often on a graph, predictor is X axis and criterion is Y axis. Top left = false negatives Top right = true positives Bottom left = true negatives Bottom right = false positives) (A vertical line on graph divides positive+negative A horizontal line on graph divides true+false) (The number of false negatives increases as the predictor cutoff score is raised (moved to the right in a scatterplot) and when the criterion score is lowered (moved toward the bottom of the scatterplot). True negatives decrease when the predictor and criterion cutoff scores are both lowered. False positives increase when the predictor cutoff score is lowered and/or the criterion cutoff score is raised.)
43
Read: Incremental validity is calculated by subtracting positive hit rate by the base rate (i.e., the number of people hired by the new predictor and are successful on criterion by the number of people hired without the new predictor and are successful on criterion) true positives/total positives - (true positive+false negatives)/total people
Top right divided by right side minus upper side divided by total
44
When determining a predictor's incremental validity, the positive hit rate is calculated by dividing the number of true positives by the...
total number of positives
45
When determining a predictor's incremental validity, the base rate is calculated by dividing the ___ + ___ by the total number of people
true positives + false negatives (upper 2 quadrants) by the total number of people
46
Bob and Rob each scored well on an assertiveness scale prior to being hired as salespersons. Bob's performance numbers are very high, while Rob's are some of the lowest on company record. Bob has been a [false positive/true negative/true positive] while Rob has been a [false positive/true negative/true positive].
Bob - true positive Rob - false positive
47
When creating a screening test to detect a current diagnosis or other mental health condition, what kind of validity is of most concern? A. predictive validity B. concurrent validity C. content validity D. construct validity
B. concurrent validity (screener is the predictor while a more thorough eval is the criterion, and screener is for detecting a current status) (predictive validity is closely related but more appropriate for predicting a future development)
48
A scientist develops a predictor of job performance using a particular sample of employees and gets an adequate validity coefficient. But she knows that she will need to cross-validate with another sample because...
shrinkage (decrease in validity coefficient) tends to occur between samples because first test was tailor made for that sample and chance factors in that sample are likely not to be there on the second one
49
[Sensitivity/Specificity] refers to the probability that a predictor will correctly identify people with the disorder from the pool of people with the disorder. It is calculated by dividing the number of true positives by the number of true positives plus false negatives. [Sensitivity/Specificity] refers to the probability that a predictor will correctly identify people who do not have the disorder. It is calculated by dividing the number of true negatives by the number of true negatives plus false positives.
Sensitivity; Specificity (The positive predictive value (PPV) indicates the probability that people who test positive have the disorder. It is calculated by dividing the number of true positives by the number of true and false positives. The negative predictive value (NPV) indicates the probability that people who test negative do not have the disorder. It is calculated by dividing the number of true negatives by the number of true and false negatives.)
50
In a criterion-related validity study, when a cutoff score on a test is lowered, more applicants will be selected, and the primary goal would usually to be to [increase true positives/decrease false negatives]
Decrease false negatives (This means more true positives may be selected (good performers who previously fell just below the old cutoff), but also more false positives (poor performers who now slip through). These are individuals who would have succeeded on the job but were rejected under the old, higher cutoff. Lowering the cutoff gives these people a chance, reducing the false negative rate.)
51
(Test score interpretation) an examinee's raw score is often difficult to interpret unless it is anchored to the performance of other examinees like a standardized group sample [___-referenced interpretation] or a predefined standard of performance [___-referenced interpretation]
norm-referenced interpretation; criterion-referenced interpretation
52
Percentile ranks (PR) and standard scores are associated with [criterion-referenced score interpretation/norm-referenced score interpretation] as opposed to percentage scores, regression equations, and expectancy tables.
norm-referenced score interpretation (criterion-referenced for the latter methods)
53
(Test score interpretation) One unique thing about the distribution of percentile scores is that it is always ___ (rectangular) regardless of the shape of the raw score distribution, because percentile ranks (PR) are ___ly distributed throughout the range of scores, meaning that the number of scores falling between two percentile ranks with a certain difference is the same as between any other two scores with the same difference (I.e., 10% between 50 and 60 and 10% between 85% and 95%). Because the transformation changes the shape of the original raw score distribution, it is categorized as a [linear/nonlinear] transformation.
flat; evenly distributed; nonlinear
54
(Test score interpretation) A limitation of percentile ranks (PR) is that they indicate an examinee’s relative position in a distribution but do not provide information about ___ differences between examinees in terms of their raw scores, i.e., percentile ranks provide limited information about meaningful differences in raw scores between examinees
absolute
55
Percentile ranks (PR) maximize difference in the ___ of the raw score distribution and minimize differences at the ___s. This general rule means that changes to raw scores for percentile ranks near the __"__ will be affected more by the same change in raw scores for percentile ranks near the __"__s.
middle of the raw score distribution...differences at the extremes; near the middle...near the extremes (Since most of the scores are "piled up" near the center of the distribution, an increase in raw score will position an examinee above a larger number of examinee's than the same increase for someone at the extremes) ("more change in the middle")
56
The ___-score distribution has a mean of 0 and an SD of 1. How is a __"__-score calculated using the mean of the distribution, its SD, and the raw score? Meanwhile, a ___-score distribution has a mean of 50 and SD of 10
Z-score; a Z-score is calculated by subtracting the mean of the distribution from the examinee’s score to obtain a deviation score and dividing the deviation score by the distribution’s standard deviation; T-score distribution (deviation IQ scores have a mean of 100 and SD of 15) (Stanine scores divide a distribution of scores into nine parts and have a mean of 5 and a standard deviation of approximately 2.)
57
Which 2 of the following methods for test score interpretation are associated with criterion-reference interpretation? A. percentage B. percentile rank (PR) C. percent correct
A. and C. (which are synonymous) (B. is associated with norm-referenced interpretation) (a certain % correct is usually set as cutoff for interpretation as the criteria for "success")
58
Which of the following is not a common purpose for factor analysis? A. Grouping a large number of test items into subtests B. Testing hypotheses about how test items or subtests relate to one another C. Evaluating test construct validity (divergent and convergent validity) D. Estimating the appropriate length of a test
D.
59
Read: Steps in a factor analysis (used for both a group of multiple test items and an array of different tests)
1. Administer tests to a sample of examinees (along with other tests of the construct) 2. Derive and interpret the correlation matrix (which tests are related to one another?) 3. Extract the initial factor matrix (hard to interpret, so do step 4.) 4. Rotate the factor matrix 5. Name the factors (determine subtests/subscales)
60
(Factor Analysis) On a factor matrix, the inside boxes are correlation coefficients called factors ___s that indicate the degree of association between each item (or test) and each factor. When you [square/directly interpret] the factor __"__, you determine the amount of variability in item scores that is accounted for (explained by) by the factor, e.g., a correlation of .5 would be interpretted as...
factor loadings; square the factor loading; .5 squared is .25, and so 25% of variability in the item is explained by the given factor
61
(Factor Analysis) On a factor matrix (assuming an orthogonal rotation as opposed to an oblique one), communality is the far right column, which is the total amount of variability in the score of a given ___ that is accounted for by all the identified factors. Explain the process for calculating communality for a given __"__.
given item/test/variable; sum the squares of each factor loading
62
(Factor Analysis) On a factor matrix, the rotation of a factor matrix can be orthogonal or oblique. Which means the factors are correlated and which means uncorrelated? A researcher decides which is appropriate based on their ___ about the characteristics measured by the tests included in the analysis.
orthogonal means uncorrelated, while oblique means correlated; theory/prior research (e.g., if two factors are assumed to be related then use oblique) (Communalities can be calculated on matrices that are orthogonally rotated but not obliquely rotated since those calculations would exceed the communality and cannot be easily calculated)
63
In a factor analysis, between communality and specificity, which is the amount of true score variability that has been explained by the factor analysis, and which is the amount of true score variability that has NOT been explained by the factor analysis?
Communality is the former (amount explained) and specificity is the latter (amount NOT explained)
64
Eigenvalues are associated with which 2 of the following: A. internal consistency reliability. B. criterion-referenced interpretation. C. the multitrait-multimethod matrix. D. principal component analysis. E. factor analysis
D. and E. (An eigenvalue indicates the total amount of variability in a set of tests or other variables that is explained by an identified component or factor. Eigenvalues can be calculated for each component "extracted" in a principal component analysis.)
65
___ refers to the extent to which individual test items contribute to the overall purpose of the test (not validity or reliability)
Relevance (Relevance is determined by judging the extent to which each test item assesses the target content or behavior domain and does so at the appropriate ability level.)
66
The [correction for attenuation formula/Spearman-Brown prophecy formula] is used to estimate the predictor's validity coefficient if the predictor and/or criterion were perfectly reliable. It might be used to determine the impact of increasing the reliability of a test on the test's validity.
correction for attenuation formula
67
To maximize the inter-rater reliability of a behavior observation scale, coding categories must be d___ and mutually ___.
discrete and mutually exclusive (this allows for better operationalization of variables)
68
Content ___ing refers to the extent to which test scores depend on factors specific to the particular items included in the test (i.e., to its content). It is not a potential source of error for which of the following methods of evaluating a test's reliability? A. Coefficient alpha B. Test-retest C. Split-half
Sampling; B. (Because test-retest reliability involves administering the same test (i.e., the same content) twice, content sampling is not a source of error.)
69
When a test user uses a correction for guessing formula that involves subtracting points from each examinee's scores, the resulting distribution of scores will have a [smaller/larger] mean and [smaller/larger] SD.
smaller mean and larger SD