Download CRT Dependability

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CRT Dependability
Consistency for criterionreferenced decisions
Challenges for CRT dependability
• Raw scores may not show much variation
(skewed distributions)
• CRT decisions are based on acceptable
performance rather than relative position
• A measure of the dependability of the
classification (i.e., master / non-master) is
needed
Approaches using cut-score
• Threshold loss agreement
– In a test-retest situation, how consistently are
the students classified as master / non-master
– All misclassifications are considered equally
serious
• Squared error loss agreement
– How consistent are the classifications
– The consequences of misclassifying students
far above or far below cut-point are
considered more serious
Berk, R. A. (1984). Selecting the index of reliability. In R. A. Berk (Ed.), A guide to criterion-referenced test
construction (pp. 231-266). Baltimore, MD: The Johns Hopkins University Press.
Issues with cut-scores
• “The validity of the final classification decisions
will depend as much upon the validity of the
standard as upon the validity of the test content”
(Shepard, 1984, p. 169)
• “Just because excellence can be distinguished
from incompetence at the extremes does not
mean excellence and incompetence can be
unambiguously separated at the cut-off.” (p. 171)
Shepard, L. A. (1984). Setting performance standards. In R. A. Berk (Ed.), A guide to criterionreferenced test construction (pp. 169-198). Baltimore, MD: The Johns Hopkins University Press.
Methods for determining cut-scores
• Method 1: expert judgments about
performance of hypothetical students on
test
• Method 2: test performance of actual
students
Setting cut-scores
(Brown, 1996, p. 257)
Institutional decisions
(Brown, 1996, p. 260)
Agreement coefficient (po), kappa
Po = (A + D) / N
K=
77
6
83
6
21
27
(p – pchance)
(1 – pchance)
83
27
110
Po = (A + D) / N
Pchance = [(A+B)(A+C)+(C+D)(B+D)]/N2
Po = (77+21) / 110 K = (.89 - .63) / (1 - .63)
K = .70
Po = .89
Short-cut methods for one
administration
• Calculate an NRT reliability coefficient
– Split-half, KR-20, Cronbach alpha
• Convert cut-score to standardized score
– Z = [(cut-score - .5 – mean)] / SD
• Use Table 7.9 to estimate Agreement
• Use Table 7.10 to estimate Kappa
Estimate the dependability for the
HELP Reading test
Assume a cut point of
60%. What is the raw
score?
27
z = -0.36
Look at Table 9.1. What is
the approximate value of
the agreement coefficient?
Look at Table 9.2. What is
the approximate value of the
kappa coefficient?
Squared-error loss agreement
• Sensitive to degrees of mastery / nonmastery
• Short-cut form of generalizability study
• Classical Test Theory
– OS = TS + E
• Generalizability Theory
– OS = TS + (E1 + E2 + . . . Ek)
Brennan, Robert (1995). Handout from generalizability theory workshop.
Phi (lambda) dependability index
# of items
Cut-point
Mean of
proportion scores
Standard deviation of
proportion scores
Domain score dependability
• Does not depend on cut-point for
calculation
• “estimates the stability of an individual’s
score or proportion correct in the item
domain, independent of any mastery
standard” (Berk, 1984, p. 252)
• Assumes a well-defined domain of
behaviors
Phi dependability index
Confidence intervals
• Analogous to SEM for NRTs
• Interpreted as a proportion correct score
rather than raw score
Reliability Recap
• Longer tests are better than short tests
• Well-written items are better than poorly written
items
• Items with high discrimination (ID for NRT, Bindex for CRT) are better
• A test made up of similar items is better
• CRTs – a test that is related to the objectives is
better
• NRTs – a test that is well-centered and spreads
out students is better
Related documents