Chapter 5 - Reliability Flashcards

(45 cards)

1
Q

What are some cultural considerations in test construction and standardization?

A

1- make sure test norms are appropriate for the targeted test taker population
2- understand the culture and time period of the test taker when interpreting results
3- context should be taken into account

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is reliability?

A

The consistency in measurement
I.e. how much of the observed variance is due to actual variance among the true scores

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a reliability coefficient?

A

An index of reliability
Ratio between the true score variance and the total variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is an observed score comprised of?

A

The “true” score plus error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is error?

A

A component of the observed score that doesn’t have to do with the trait being measured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is measurement error?

A

Factors associated with the process of measurement that don’t have to do with the variable being measured (i.e. interference)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the types of measurement error?

A

Random: unpredictable fluctuations or inconsistencies (i.e. noise)
Systematic: constant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Where might error be introduced into the assessment process?

A

1 - Construction of a test
2 - Test administration
3 - Interpretation
4 - Sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How might error be introduced when designing a test?

A

Variance existing within the items on the test
I.e. issues with the phrasing, item sampling, content sampling, examples given, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How might error be introduced during test administration?

A

Setting, materials, environment
Test-taker: sleep, physical discomfort, personal situations
Test examiner: appearance, demeanor, level of training

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is methodological error?

A

Issues with training, ambiguous wording, biased framing of questions, etc.
Tends to be more systematic than random

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How might error be introduced during interpretation?

A

Subjectivity by assessors (especially during behavioral assessment)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is sampling error?

A

When the sample isn’t actually representative of the population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Test-Retest reliability?

A

Having the same test-takers take the same test under two different administrations (some time has passed)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

When is it appropriate to use test-retest as an estimate of reliability? When is it NOT appropriate?

A

Appropriate when the variable we’re testing is supposed to be stable over time (ex: personality)
NOT appropriate when the variable is expected to change over time (ex: mood)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the coefficient of stability?

A

A type of test-retest in which the time interval between administrations is 6 months or more

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is Parallel-Forms or Alternate-Forms reliability?

A

Having the same group of test-takers take two different versions of a test (typically form A and form B, one right after the other)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the purpose of Parallel-Forms and Alternate-Forms?

A

To find the Coefficient of Equivalence, or the degree to which two versions/forms of a test (meant to measure the same construct) are similar or equivalent to each other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

When is Parallel-Forms or Alternate-Forms reliability appropriate to use?

A

When we have two versions of a test created to measure the same construct

20
Q

What is a Parallel Form? What does this mean for the coefficient of equivalence?

A

Each version of the test produces EQUAL means and variances
This results in a higher coefficient of equivalence

21
Q

What is an Alternate Form? What does this mean for the coefficient of equivalence?

A

Each version of the test has similar item content and difficulty, but they don’t meet the strict requirements for a parallel form (same mean and variance)
This results in a lower coefficient of equivalence

22
Q

What is Split-Half Reliability?

A

Divide the scores from a single test administration into two halves, then look at the correlation (Pearson r) between the two groups of scores

23
Q

What is the Spearman-Brown formula?

A

Used to adjust the half-test reliability by changing the way the test is divided in half (obtaining different combinations and thus correlations)
Allows a test developer to estimate the amount of internal consistency of a test and its items

24
Q

When is Split-Half reliability appropriate to use?

A

When the items on the test are linear and continuous (dichotomous items require a different formula)

25
What is inter-item consistency? How is it affected by the operational definition of the construct?
The degree of relatedness between items within a test Gauges how homogeneous the test is, and thus how narrow the construct is Define a construct more broadly = more heterogeneous test items = less inter-item consistency (this is NOT necessarily a bad thing)
26
What is the Kuder-Richardson formula 20?
Used to determine inter-item consistency for tests that contain dichotomous items such as true/false or yes/no
27
What is the coefficient alpha?
The mean of all possible split-half correlations, corrected using the spearman-brown formula Most popular approach to measuring internal consistency
28
What are the measures for a coefficient alpha? How is this affected by the homogeneity of the construct?
Values range from 0 to 1, with an acceptable score being considered 0.7 and above Homogeneous (narrow) constructs have higher coefficient alphas, while heterogeneous (broad) constructs have lower ones
29
Why might a coefficient alpha over 0.9 not be a good thing?
It might mean that test items are too redundant
30
When does internal consistency need to be established?
EACH time that a test is administered
31
What is inter-rater reliability? When is it appropriate to determine?
The degree of agreement or consistency between two or more scorers, judges, or raters Often determined for behavioral measures (but is not appropriate for self-reports)
32
How does the nature of a test determine the reliability metric that we use? (5 ways)
1- Homogenous or heterogeneous 2- Dynamic or static 3- Range restricted or not 4- Speed or power test 5- Criterion-referenced or not
33
What is a speed test used for? How does this affect internal consistency?
Meant to determine how many items you can get done in a particular amount of time Since some items won't even be answered, internal consistency is not appropriate
34
What is a power test used for? How does this affect internal consistency?
Designed to get harder and harder until you cannot answer the questions accurately anymore Since you WILL eventually fail it, internal consistency is not appropriate
35
What does Classical Test Theory (CTT) say about test scores?
There is a true score that genuinely reflects an individual's ability level as measured by a particular test
36
What are potential advantages and disadvantages of CTT?
Advantage: CTT assumptions are more readily met than other models Disadvantages: 1- the assumption of CTT that test items are equivalent in their ability to measure the construct may be problematic 2 - may yield very long tests
37
What is Generalizability Theory?
States that a person's test scores may vary depending on variables in the testing situation and sampling (i.e. the testing environment)
38
What is Item-Response Theory (IRT)?
Provides a way to model the probability that a person with X ability will be able to perform at a level of Y In other words, it views someone's ability as a prediction of how they'll perform on a test
39
What are the potential advantages and disadvantages of IRT?
Advantages: 1- can provide a lot of information about the usefulness of certain items (via discrimination and difficulty) 2 -can be more powerful in its predictions Disadvantage: it uses more advanced statistical methods with more stringent requirements
40
What is item discrimination (a)?
The degree to which an item differentiates between people with higher and lower levels of a trait In other words, it's a measure of how good an item is at distinguishing between two groups
41
What is item difficulty (b)?
How difficult an item is to be accomplished, solved, or comprehended Indicates where on a continuum of a trait the item gives the best information for In other words, where on the scale of the trait does this item differentiate
42
What is the Standard Error of Measurement? What is it used for?
A measure of the precision of an observed test score. In other words, an estimate of the amount of error inherent in the observed score Can be used to estimate the extent to which an observed score deviates from the true score in CTT
43
What is a Confidence Interval? How do we calculate one?
A range of test scores around the observed score that is likely to contain the true score Observed score +/- confidence level z-score x SEM As a general rule in psych, we use a confidence level of 95%, or z-score of about 2
44
What is Standard Error of the Difference?
Used to determine how large a difference between test scores should be before it's considered statistically significant
45
What can standard error of the difference be used to measure? (3 thing)
1- one person's performance on test 1 vs test 2 2 - one person's performance on test 1 vs another person's performance on test 1 3 - one person's performance on test 1 vs another person's performance on test 2