Week 4: Reliability Flashcards by avryl b

Give an example of a norm-referenced test.

An IQ test – as the average is dependent upon all participants.

How well did you know this?

Not at all

Perfectly

Give an example of a criterion-referenced test.

A quiz where the answers are compared to a set of correct answers.

How well did you know this?

Not at all

Perfectly

What are the advantages and disadvantages of using a norm-referenced or criterion-referenced test?

The advantages of norm-referenced test include scores being protected from a test possibly easier or harder (in the case of writing an examination – if the writer is in a worse mood, and writes a very hard test, a lower proportion will fail it), and it yields a good distribution of scores, which then allows discrimination of good and bad). The disadvantage of norm-referenced tests is that the norm is easily changeable – and scores are changed if the norm is (if the norm is really smart, less likely to get grade deserved). The advantage of criterion referenced tests include scores not changing if the norm changes, and that absolute standards can be set based on what people can actually do. The disadvantage of criterion referenced tests include that scores are affected by test difficulty, and that there is a larger risk of floor or ceiling effects.

How well did you know this?

Not at all

Perfectly

In a criterion-referenced course, the proportion of higher grades is found to increase from a previous semester. What three reasons could account for this?

The students got smarter, the assessment was easier than previous, or the teaching of the course was better.

How well did you know this?

Not at all

Perfectly

What is classical test theory?

Classical test theory is the traditional conceptual basis of psychometrics – it is the idea that every measurement we take can be decomposed into two parts: true score (underlying thing trying to be measured) and measurement error.

How well did you know this?

Not at all

Perfectly

What is true score theory?

Another name for classical test theory

How well did you know this?

Not at all

Perfectly

What is reliability in terms of the relationship between true and total test score variability?It is a type of test construction.

Reliability is the ratio between the true variability and the total variability.

How well did you know this?

Not at all

Perfectly

List the various sources of measurement error.

Test construction, test administration, test scoring and then other influences.

How well did you know this?

Not at all

Perfectly

What is item sampling?

Item sampling is where a small selection of questions are asked out of a great range of possibilities. It is a type of test construction.

How well did you know this?

Not at all

Perfectly

What is content sampling?

A type of test construction – can be a source of measurement error.

How well did you know this?

Not at all

Perfectly

What is domain sampling?

A type of test construction – can be a source of measurement error.

How well did you know this?

Not at all

Perfectly

Why can we only estimate the reliability of a test and not measure it directly?

Reliability is based on the hypothetical construct of a true score, which in turn means that we are never able to say for certain what the reliability is – we can only estimate it.

How well did you know this?

Not at all

Perfectly

Describe four methods available to us to help estimate the reliability of a test.

Estimating the internal consistency, test-retest reliability, parallel-forms reliability and inter-rater reliability.

How well did you know this?

Not at all

Perfectly

What is internal consistency?

How much the item scores in a test correlate with one another on average (the average correlation between the items on a scale).

How well did you know this?

Not at all

Perfectly

If someone describes a psychological scale as having “high internal coherence”, what are they talking about?

They are saying that the psychological scale has high internal consistency – there is a high average correlation between the items on the scale.

How well did you know this?

Not at all

Perfectly

What is inter-item consistency?

An alternative label for internal consistency.

How well did you know this?

Not at all

Perfectly

How do you calculate Cronbach’s alpha using JAMOVI?

Select FACTOR then RELIABILITY ANALYSIS. Then, select all the items in scale and move them to the ‘items’ box.

How well did you know this?

Not at all

Perfectly

Describe the steps involved in calculating Cronbach’s alpha by hand.

The questionnaire is split in half. The total score for each half is calculated. Then the correlation between the total scores for each half is worked out. Steps 1-3 is then repeated for all possible two way splits of the questionnaire. The average of all possible split half correlations is calculated. Finally, the correlation is adjusted to account for the fact that the test has been shortened by applying the Spearman-Brown formula as a correction.

How well did you know this?

Not at all

Perfectly

What is test-retest reliability?

The correlation between scores on the same test by the same people sone at two different times.

How well did you know this?

Not at all

Perfectly

What statistic can we use to evaluate test-retest reliability?

r – correlation

How well did you know this?

Not at all

Perfectly

Test-retest reliability involving giving the same test twice. Give an example of a situation where giving the same test with the same items twice might be a problem.

In a situation where the learning effect is likely/possible – particularly tests which are tests of competency.

How well did you know this?

Not at all

Perfectly

What is parallel forms reliability?

The correlation between scores on two versions of the same test by the same people done at the same time.

How well did you know this?

Not at all

Perfectly

What statistic can we use to evaluate parallel forms reliability?

Pearsons r

How well did you know this?

Not at all

Perfectly

Imagine you create an ability test which involves an examiner making a number of ratings of an individual’s thumb-rolling skill. You test the inter-rater reliability using two examiners and obtain a correlation of 0.87. What does this mean?

The correlation between each examiners is good, however the means and SDs would need to be checked for similarity.

How well did you know this?

Not at all

Perfectly

For inter-rater reliability, when might we want to examine the means and SD of two raters’ ratings in addition to the correlation between them?

If it is a criterion referenced test.

List five issues/situations that might affect which reliability estimate you can use. Give an example of each.

Homogeneity/heterogeneity of the test, static v dynamic characteristics, restriction of range/variance, speed tests v power tests, and criterion-referenced tests.

What is a homogeneous test and a heterogeneous test?

A homogeneous test is where all the test items measure the same construct. A heterogeneous test is a test which has subscales which measure different constructs.

If you had a heterogeneous test, which estimate of reliability should you AVOID?

Internal consistency – although you could look at the internal consistency of each subscale separately.

If you had a test measuring a dynamic characteristic, which estimate of reliability should you AVOID?

Test-retest reliability

If you had a dataset of test scores where the range of scores was substantially restricted, which estimate of reliability would be affected (assuming the data could in principle yield statistics on all reliability estimates)?

As the correlation would be affected, all reliability estimates would then be affected.

If you had a speed test (as opposed to a power test), which estimate(s) of reliability should you AVOID?

Measuring internal consistency

What special considerations do we need to take into account when analysing the psychometric properties of a speed test compared with a power test?

Not all questions will have been answered – which can end up in a spurious correlation if internal consistency is not avoided.

What possible feature of some criterion-referenced tests might create problems for estimating reliability?

There may be very little variation in responses – due this restriction of range using any reliability estimates can be problematic.

What’s the relationship between reliability and the number of items in a test?

Reliability tends to increase when you have more items and decrease when you have less items.

Why should having more items in your test lead to better reliability?

It is the effect of domain sampling, with more items there are more samples of the domains of interest, and the test score then becomes a better representation of the total domain score.

How can the Law of Large Numbers apply when trying to predict the behaviour of an individual?

A large and diverse sample of evidence of past behaviour is going to better predict future behaviour than sample consisting of a singular past behavioural event.

When might you want to use the Spearman-Brown formula?

To estimate how reliability would change if a test was shortened or lengthened.

Imagine you had a questionnaire with 20 items and you were disappointed that its internal consistency was .59. What effect would adding another 30 equivalent items be predicted to have on the reliability?

It would likely increase the reliability

True or False: A test designed to yield information about whether or not a student has mastered the ability to multiply two-digit numbers to a specified level of competency could be described as norm-referenced.

False

True or False: If an examination is criterion-referenced, then an individual's score will not be affected if the norm for the course changes.

True

True or False: If an examination is norm-referenced, then an individual’s score will be less likely to be affected by changes in the quality of teaching

True

True or False: If an examination is criterion-referenced, then marks are protected from changes in examination difficulty.

False

True or False: If the proportion of higher grades changes substantially between semesters for a course, this probably means the course scoring is criterion-referenced.

True

True or False: When I convert your percentage score to your grade score at the end of this course, this is an example of a linear transformation.

False

True or False: True Score Theory involves conceptualising test score variability as comprising true test score variation and total test score variation.

False

True or False: Classical Test Theory can be described by the formula X = T / E (where X is the observed score, T is the true score, and E is measurement error).

False

True or False: According to Classical Test Theory, if a test has very high reliability then the measurement error must be very high.

False

True or False: Reliability conceptualised in Classical Test Theory as being total test score variation minus measurement error.

False

True or False: According to Classical Test Theory, if a test has very high reliability then the measurement error must be very low.

True

True or False: If a test was unreliable then the true test score variability would only be a small proportion of the actual test score variability.

True

True or False: The fact that we cannot ask students everything about the course in the quizzes will decrease the proportion of the observed score that can be attributed to the true score (assuming the quiz marks are supposed to reflect students’ overall PSYC3020 knowledge).

True

True or False: As part of the process of calculating Cronbach’s alpha, you have to multiply the correlations derived from all possible ways of splitting the test into two.

False

True or False: As part of the process of calculating Cronbach’s alpha, you have to average the correlations derived from all possible ways of splitting the test into two.

True

True or False: As part of the process of calculating Cronbach’s alpha, you have to adjust for the fact you’ve halved the test by applying the Spearman-Brown formula.

True

True or False: As part of the process of calculating Cronbach’s alpha, you have to adjust for the homogeneity of the test by applying the Spearman-Brown formula.

False

True or False: Imagine you create an ability test which involves an examiner making a number of ratings of an individual’s acrobatic skill. You test the inter-rater reliability using two examiners and obtain a reliability of -.98 (negative point nine eight). This means one examiner is giving the opposite rating to the other.

True

True or False: Imagine you create an ability test which involves an examiner making a number of ratings of an individual’s acrobatic skill. You test the inter-rater reliability using two examiners and obtain a reliability of .88 (point nine eight). This means one examiner’s rating is independent of the other’s rating.

False - they are related to one another (high correlation)

True or False: If we double the number of items in a test, then the Spearman Brown formula predicts that we should double its reliability.

False

True or False: 0.70 is usually considered the typical minimum threshold for decent reliability.

True

True or False: 0.90 is usually considered the typical minimum threshold for decent reliability

False

Week 4: Reliability Flashcards

(60 cards)