Construct
Hypothetical factor that cannot be observed directly; its existence is inferred from certain behaviors and assumed to follow from certain circumstances
Operational Definition
A way to attach a system of measurement in a way that can be replicated and which is a faithful PROXY of the construct.
We use different kinds of measures (e.g. questionnaires, tests) to act as faithful proxies of what we really want to measure
Reliability
Reproducibility of a measurement; the extent to which measures of the same phenomenon are consistent and repeatable; measures that are high in reliability will contain a minimum measurement error.
Validity
The extent to which a measure of a constuct truly measures that constuct and not something else.
Classical Test Theory
AKA “Classical Reliability Theory”
1. True score is constant
2. Error is random
3. Correlation between the true scores and error is 0
4. Correlation between errors of difficult measurements occasions are also equal (this error is because error is assumed to be random)
X(observed score) = TRUE SCORE + ERROR
Random Error and True Score
Generalizability Theory
Error is seperated into pieces, each of which can be estimated (if we collect the data properly)
1. To better understand what aspect is producing the most amount of error therefore we can understand it better.
2. Quickly identify where the error is coming from and determine how good the measure is
3. Explicitly connects measurement operations to the purpose of measurement
Random Error, True Score, Test-Retest Error, Rater Error, and Other identifiable sources of Error.
Temporal Reliability
Reproductibility of values of a variable when you measure the same subjects twice or more. A form of reliability in which a test is administered on two seperate occassions and the correlation between them is calculated.
1. Test-Retest
2. Parallel Test
Test-Retest
Administer the same test two or more different times and calculate the correlation between the scores. This helps us understand how stable the test is over time.
Parallel Test
Giving two different versions of a test that is supposed to be assessing the same construct (or constructs).
In which you would find the correlation between the two different versions of the test.
Internal Consistency Reliability
Measures the extent to which a measure yields the same number of score each time it is administered, assuming everything else is equal.
(Reliability of a 10-item test would be higher than that of a 5 similar items)
Split-half reliability
A format of reliability in which one half of the items on a test are correlated with the remaining items.
Finding questions that are let us say in one already established tool and merging it with another already established tool and detemining the reliablity then
Cronbach’s Alpha
The internal reliability within a test by comparing each item to all other items.
Each item is compared with each other to assure that the ratings scale is consistent.
Used with a likert scale type item (range in severity from 1-10)
See formula in notes (can be calculated in R)
Kuder-Richardson coefficient (KR-20)
Variables that are dichotomous in nature (yes/no or true/false)
Use the KR-20 equation to determine if the items are not in fact continuous in nature
See notes for formula
Interrater Reliability
The reliability of a test’s results as it occurs across multiple test adminstrators (that is the stability of test results for the same test takers but across different test adminstrators or scores)
Can be determined by:
(number of agreements)(number of agreements + disagreements)
Intraobserver or Interrater Reliability
Cohen’s Kappa
This is used to determine diagnosis similarities by two doctors
WHERE
Po= proportion of observed agreements between the raters
Pe= probability of agreement occuring by chance
K=(Po-Pe)/(1-Pe)
How do you calculate the probability of agreement by chance
Probability that the raters would agree simply by random chance:
Calculated by considering the marginal totals (rows and column) totals in a contingency table and summing the products of these marginal totals
Cohen’s Kappa for nominal or more categorical data
Standards of Reliability
Factors that affect reliability
Item Response Theory (IRT)
Framework which helps us to evaluate a test takers performance to specific test items and to understand their underlying abilities or traits through that. LATENT TRAIT
Examples:
Adaptive Testing: SAT on a computer in that it is adaptive in nature with the questions becoming subseqently harder or easier based on your response to the previous questions
Clinical Assessment: Often focused on finding solid diagnostic items which could include the following questions from PHQ-9:
-PHQ-9, Item5, Poor appetite or overeating
-PHQ-9, Item9, Thoughts you would be better off dead, or of hurting yourself
Item Response Theory Parameters on which items are characterized include:
With all three of these parameters, IRT can do a reliable job in estimating ability at the low, middle, and high levels of ability.