Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Evaluating g Model Fit with IRT American Board of Internal Medicine Item Response Theory Course How good is our model? • Assessment of Model Model-data data Fit –Model ode Assumptions ssu p o s –Model Predictions vs. observed data –Examples Examples of Goodness-of-Fit Goodness of Fit Model-data Model data Fit • George Box (1979): “All models are wrong but some are useful.” • A “model” is something we use to approximate reality for the purpose of making predictions, explaining data, etc. • Strictly y speaking, p g no model will fit the data perfectly…the question is, “how much model-data misfit is too much?” Model-data Model data Fit • “How How much misfit is too much?” much? • Generally, there are no absolute criteria for model model-data data fit fit. • “Let your conscience be your guide” – Conduct a variety of analyses, consider the full set of results, and be mindful of th these with ith regard d tto th the iintended t d d application of the IRT model. Model-Data Model Data Fit • Evaluate model assumptions assumptions. • Assess residuals and standardized residuals and examine consequences of model misfit misfit. Types of Evidence • Are model assumptions met to a ‘reasonable degree’ by test data? – Determine if model features features, like parameter invariance, are present. • Assess the accuracy of model predictions versus actual data. – Do D expected t d probabilities b biliti match t h with ith empirical relative frequencies? Model Assumptions • Are model assumptions met to a reasonable degree by test data? –Unidimensionality U idi i lit –Local Independence p –Nature of the ICC • Minimal Guessing (for 1- & 2-PL) • Equal Discrimination (for 1-PL) –Parameter P t Invariance I i Unidimensionality • Dimensionality is often analyzed through a Principal Components Analysis (PCA) to determine how many “components” components or “dimensions” are necessary in order d tto explain l i a significant i ifi t amount of total variance. PCA • PCA is mathematical procedure that t transforms f a number b off (possibly) ( ibl ) correlated l t d variables into a (smaller) number of uncorrelated variables called principal components. • Analysis is performed on the correlation matrix. • The first p principal p component p accounts for as much variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible possible. PCA • Each principal component will receive an ‘eigenvalue’ (λj) which indirectly represents t the th amountt off variance i that component explains. • In IRT, we look for a “dominant” first factor in order to infer the unidimensionality of the test. Unidimensionality • A scree plot is often used to evaluate the dominance of the first factor. • Rules of thumb for “essential unidimensionality.” y – First eigenvalue: λ1 / n ≥ 0.20. – Ratio of λ1 to λ2 should be “large large.” – Other eigenvalues: λj ≤ 1; 10 9 Scree Plot for a 40-item Test U idi Unidimensionality i lit n 8 ∑λ E Eigenvalu ue 7 j =1 j= 6 5 j =n λj so if n ≈ 0.20 then we say the test is 4 "essentially unidimensional" 3 2 1 0 0 5 10 15 20 Component 25 30 35 40 10 9 Scree Plot for a 40-item Test U idi Unidimensionality i lit λ1 8.45 = = 0.21 0 21 n 40 8 E Eigenvalu ue 7 and d λ1 8.45 = = 5.63 5 63 λ2 1.5 6 The first component accounts for 21% of the total variance, and it accounts for 5.63 times the amount of variance as the second component 5 4 3 2 1 0 0 5 10 15 20 Component 25 30 35 40 Local Independence • Item responses should be independent given ability. – Omits or unexpected low response probabilities for the last few items may indicate speededness off the test and will violate local independence. – Response probabilities should not be related to group membership (DIF). Nature of the ICC • The probability of success should be monotonically y increasing with θ. • No systematic errors of prediction should be apparent apparent. Nature of the ICC • Minimal Guessing (1 (1- or 2 2-PL) PL) –Check C ec pe performance o a ce levels e eso of low ability examinees. –Conditional C di i l p-values l ffor llow abilityy examinees should be close to zero when fitting a 1- or 2 PL model 2-PL model. Nature of the ICC • Equal Discrimination (1-PL) – Essentially homogeneous distribution for th ititem-total the t t l correlations. l ti – All r values should be approximately equal when fitting a 1-PL model. – When examining g the model-data fit, is there any pattern of errors that might be accounted for by y a different slope? p Parameter Invariance • Determine in model parameter invariance is present. p –Invariance of Ability (θ). –Invariance of Item parameters (a, b, c). Parameter Invariance • Invariance of ability (θ) – Compare examinee ability estimates based on different sets of items from the pool of items calibrated on a common scale. – If the model fits, different item samples produce close to the same should still p ability estimates (SE will change for shorter tests, though). Parameter Invariance • Invariance of ability (θ) – Scatterplots of examinee ability based on one sample l off items it versus the th other th should be strongly linearly related. – Relationship won’t be perfect: sampling error does enter the picture. – Those estimates far from the best-fit line represent p a violation of invariance. Parameter Invariance • Invariance of Item parameters (a (a,b,c) b c) – Compare item statistics obtained in two or more groups (e.g., ( high hi h and d llow performing groups; Ethnicity; Gender). – If the model fits, different examinee samples should still produce close to the same item parameter estimates. Parameter Invariance • Invariance of item parameters (a (a,b,c) b c) – Scatterplots of b-b, a-a, c-c based on one sample l off examinees i versus th the other should be strongly linearly related. – Relationship won’t be perfect: sampling error does enter the picture. – Those estimates far from the best-fit line represent p a violation of invariance. Check Item Invariance • Differential Item Functioning (DIF) – Details on Thursday • A test item is labeled with ith “DIF” when hen examinees with equal ability, but from diff different t groups, have h an unequall probability of answering the item correctly. tl • This violates parameter invariance because items aren’t unidimensional. DIF issues • Uniform DIF: item is systematically more difficult for members of one group, even after matching examinees on ability (θ) (θ). • Cause: shift in b-parameter. 1.0 P (u = 1 | θ ) 0.8 0.6 0.4 0.2 0.0 -3 3 -2 2 -1 1 0 Ability (θ ) 1 2 3 DIF issues • Non-Uniform DIF: shift in item difficulty is not consistent across the ability bilit continuum. ti • Increase/decrease in P for low-abilityy examinees is offset by the converse for high-ability high ability examinees. examinees • Cause: shift in a-parameter. 1.0 P (u = 1 | θ ) 0.8 0.6 0.4 0.2 0.0 -3 3 -2 2 -1 1 0 Ability (θ ) 1 2 3 Residual Analysis • Assess the accuracy of model predictions versus actual data – Residual = difference between observed proportion and predicted probability: rj (θ ) = Pj (θ ) − E[ Pj (θ )] Residual Analysis rj (θ ) = Pj (θ ) − E[ Pj (θ )] In this notation, Pj(θ) = observed proportion correct for a given θ level E[Pj(θ)] = expected proportion correct (i e probability from the IRT model) (i.e., Residual Analysis Pro obability y of Corrrect Response 1.0 0.9 0.8 0.7 0.6 rj (θ ) = Pj (θ ) − E[ Pj (θ )] 0.5 0.4 White = Pj(θ) 0.3 Black = E[Pj(θ)] 0.2 0.1 0.0 -3 -2 -1 0 Ability (θ ) 1 2 3 Standardized Residuals • Raw residuals do not take into account the error associated with the expected proportion correct, correct so we standardize each by dividing by its standard error: SE ( E[ Pj (θ )]) = E[ Pj (θ ] E[Q j (θ )] N (θ ) Standardized Residuals SR j (θ ) = Pj (θ ) − E[ Pj (θ )] E[ Pj (θ )] E[Q j (θ )] / N (θ ) SR values should be homoscedastic for each it item and d ffollow ll an approximately i t l standard t d d normal distribution across all items of the test. Prrobabilitty of Corrrect Response 1.0 09 0.9 0.8 0.7 0.6 0.5 0.4 0.3 1-PL 0.2 0.1 0.0 -3 -2 -1 0 Ability (θ ) 1 2 3 Prrobabilitty of Corrrect Response 1.0 09 0.9 0.8 0.7 0.6 2-PL 0.5 0.4 0.3 0.2 0.1 0.0 -3 -2 -1 0 Ability (θ ) 1 2 3 Prrobabilitty of Corrrect Response 1.0 09 0.9 0.8 0.7 0.6 0.5 3-PL 0.4 0.3 0.2 0.1 0.0 -3 -2 -1 0 Ability (θ ) 1 2 3 Sta andardize ed Residual SR values are homoscedastic for this item when fit by a 3-PL model, but systematic errors are present for the 1- and 2-PL models. 14 12 10 8 6 4 2 0 -2 -4 -6 -8 -10 -12 -14 -3 -2 -1 0 Ability (θ ) 1 2 3 Test-level Test level Fit • Similar to the comparison done for individual items, but instead we compare Expected Proportion Correct (TCC) to observed proportion correct (raw score/N) score/N). • SRs across the test should be homoscedastic and follow an approximate normal distribution. Across all items, SR values are approximately normally distributed when fit by a 3-PL model, but more uniform for the 1- and 2-PL models. 03 0.3 Relative Frequ R uency 0.25 0.2 0.15 0.1 3-PL 0.05 0 2-PL -4 -3 -2 2 -1 Standardized Residual 0 1 2 1-PL 3 4 Significance g Testing g Q1 chi-square (Yen, 1981) m Q1j = ∑SR i =1 2 ij Q1~ χ 2 m = # of quadrature points p = # of item parameters df = m − p Chi-square Chi square test • Standardized residuals are essentially prediction errors that have been turned into z-scores z scores. • The sum of squared z-scores follow a Chi square distribution Chi-square distribution. – Much like Sums of Squares and variances ariances follo follow a Chi Chi-square sq are distribution in ANOVA. Standard Normal Distribution 0.5 probabilitty 0.4 03 0.3 0.2 0.1 0.0 -3 -2 -1 0 z-score 1 2 3 Probabiility df = 1 As the degrees of freedom increase, the sum of squared standardized residuals are less likely to be equal to zero. The expected value for SUM(SR2) moves to the right as the distribution approaches normality. df = 5 df = 10 df = 15 df = 20 Sum of squared z-scores Goodness of Fit Goodness-of-Fit • This is what we call “goodness-of-fit”: goodness of fit : • We hope that the chi-square test will NOT be significant: – This indicates that the differences between observed and expected is small small. – Significant differences would mean that observed proportions are far from what the model predicted…and that’s bad. Significance Testing in BILOG and PARSCALE • The goodness of fit information contained in BILOG and PARSCALE use the Chi Chi-square square test described in the previous slides slides. – These values can be found for all items. SUBTEST PRETEST ; ITEM ITEM PARAMETERS AFTER CYCLE 30 INTERCEPT SLOPE THRESHOLD LOADING ASYMPTOTE CHISQ DF S.E. S.E. S.E. S.E. S.E. (PROB) ------------------------------------------------------------------------------MATH01 | 1.041 | 0.651 | -1.599 | 0.545 | 0.186 | 29.0 9.0 | 0.107* | 0.082* | 0.242* | 0.069* | 0.084* | (0.0007) | | | | | | MATH02 | 2.230 | 0.600 | -3.717 | 0.514 | 0.199 | 9.5 5.0 | 0.165* | 0.114* | 0.610* | 0.098* | 0.089* | (0.0920) | | | | | | MATH03 | 0.428 | 0.693 | -0.618 | 0.569 | 0.159 | 63.2 7.0 | 0.106* | 0.084* | 0.190* | 0.069* | 0.071* | (0.0000) | | | | | | MATH04 | -0.601 | 1.391 | 0.432 | 0.812 | 0.217 | 10.7 8.0 | 0.216* | 0.268* | 0.095* | 0.156* | 0.040* | (0.2204) The problem of sample size • Statistical tests of model model-data data fit present an interesting duality: 1) Due to sensitivity to sample size size, almost any departure of data from the model results in rejecting H0. 2) For small samples, model-data misfit i fi can b be overlooked, l k d and d SE SEs for item parameters are large. Standardized Residuals SR j (θ ) = Pj (θ ) − E[ Pj (θ )] E[ Pj (θ )] E[Q j (θ )] / N (θ ) As N increases, SE decreases, SR y have a increases! Often,, when you large enough sample size to estimate parameters, model misfit is a foregone conclusion! Conclusion • This leads us back to the beginning beginning… recall what George Box said! • Generally, Generally there are no absolute criteria for model-data fit. • “Let “L t your conscience i b be your guide” id ” – Conduct a variety of analyses, consider the full set of results, and be mindful of these with regard to the intended application li ti off the th IRT model. d l Next… Next • Test Reliability and Development p with IRT • Test Score Equating with IRT