Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CRT Dependability Consistency for criterionreferenced decisions Challenges for CRT dependability • Raw scores may not show much variation (skewed distributions) • CRT decisions are based on acceptable performance rather than relative position • A measure of the dependability of the classification (i.e., master / non-master) is needed Approaches using cut-score • Threshold loss agreement – In a test-retest situation, how consistently are the students classified as master / non-master – All misclassifications are considered equally serious • Squared error loss agreement – How consistent are the classifications – The consequences of misclassifying students far above or far below cut-point are considered more serious Berk, R. A. (1984). Selecting the index of reliability. In R. A. Berk (Ed.), A guide to criterion-referenced test construction (pp. 231-266). Baltimore, MD: The Johns Hopkins University Press. Issues with cut-scores • “The validity of the final classification decisions will depend as much upon the validity of the standard as upon the validity of the test content” (Shepard, 1984, p. 169) • “Just because excellence can be distinguished from incompetence at the extremes does not mean excellence and incompetence can be unambiguously separated at the cut-off.” (p. 171) Shepard, L. A. (1984). Setting performance standards. In R. A. Berk (Ed.), A guide to criterionreferenced test construction (pp. 169-198). Baltimore, MD: The Johns Hopkins University Press. Methods for determining cut-scores • Method 1: expert judgments about performance of hypothetical students on test • Method 2: test performance of actual students Setting cut-scores (Brown, 1996, p. 257) Institutional decisions (Brown, 1996, p. 260) Agreement coefficient (po), kappa Po = (A + D) / N K= 77 6 83 6 21 27 (p – pchance) (1 – pchance) 83 27 110 Po = (A + D) / N Pchance = [(A+B)(A+C)+(C+D)(B+D)]/N2 Po = (77+21) / 110 K = (.89 - .63) / (1 - .63) K = .70 Po = .89 Short-cut methods for one administration • Calculate an NRT reliability coefficient – Split-half, KR-20, Cronbach alpha • Convert cut-score to standardized score – Z = [(cut-score - .5 – mean)] / SD • Use Table 7.9 to estimate Agreement • Use Table 7.10 to estimate Kappa Estimate the dependability for the HELP Reading test Assume a cut point of 60%. What is the raw score? 27 z = -0.36 Look at Table 9.1. What is the approximate value of the agreement coefficient? Look at Table 9.2. What is the approximate value of the kappa coefficient? Squared-error loss agreement • Sensitive to degrees of mastery / nonmastery • Short-cut form of generalizability study • Classical Test Theory – OS = TS + E • Generalizability Theory – OS = TS + (E1 + E2 + . . . Ek) Brennan, Robert (1995). Handout from generalizability theory workshop. Phi (lambda) dependability index # of items Cut-point Mean of proportion scores Standard deviation of proportion scores Domain score dependability • Does not depend on cut-point for calculation • “estimates the stability of an individual’s score or proportion correct in the item domain, independent of any mastery standard” (Berk, 1984, p. 252) • Assumes a well-defined domain of behaviors Phi dependability index Confidence intervals • Analogous to SEM for NRTs • Interpreted as a proportion correct score rather than raw score Reliability Recap • Longer tests are better than short tests • Well-written items are better than poorly written items • Items with high discrimination (ID for NRT, Bindex for CRT) are better • A test made up of similar items is better • CRTs – a test that is related to the objectives is better • NRTs – a test that is well-centered and spreads out students is better