Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Applications of IRT Models DIF and CAT Which of these is the situation of a biased test? The average score for males and females is different on an item is not the same. The correlation between males’ scores on an item is stronger than that for the females’ scores. A group of males and females with exactly the same ability achieve different scores on an item. Disentangling the Terminology Item impact DIF The differential probability of a correct response for examinees at the same trait level but from different groups. DIF occurs when examinees from different groups show differing probabilities of success on (or endorsing) the item after matching on the underlying ability that the item is intended to measure. Item bias Item impact is evident when examinees from different groups have differing probabilities of responding correctly to (or endorsing) an item because there are true differences between the groups in the underlying ability being measured by the item. Item bias occurs when examinees of one group are less likely to answer an item correctly (or endorse an item) than examinees of another group because of some characteristic of the test item or testing situation that is not relevant to the test purpose. Adverse Impact Adverse impact is a legal term describing the situation in which group differences in test performance result in disproportionate examinee selection or related decisions (e.g., promotion). This is not evidence for test bias. No DIF There are two types of DIF Uniform DIF The referent group always has a higher probability of a correct response than that for the focal group. Non-uniform DIF The direction of the advantage of one group’s likelihood of a correct response changes in different regions of the ability scale. Uniform DIF Non uniform DIF Differential Test Functioning DTF Against Reference Group 1.0 Proportion Correct True Score 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Focal 0.2 Reference 0.1 0.0 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 Theta 0.5 1.0 1.5 2.0 2.5 3.0 Relationship between IRT and CTST models It has been shown that there is a relationship between 2 PL normal ogive IRT models and the single factor FA model (Lord & Novick, 1968) The b-parameter is related to the threshold parameter divided by the item factor loading The discrimination parameter is e2qual to the factor loading divided by the communality of the item Highly discriminating items will have high factor loadings Examining Measurement Invariance in CTST Examining factorial invariance Configural invariance Pattern (metric) invariance Zero and non-zero loading patterns are the same across groups The factor loadings are equal across groups Scalar (strong) invariance The factor loadings and intercepts are equal across groups Any group differences in means can be attributed to the common factors, which allows for meaningful group mean comparisons Strict invariance Factor loadings, intercepts, and unique variances are equal across groups Any systematic differences in group means, variances, or covariances are due to the common factors Examining DIF in IRT IRT tests of DIF examine if the IRC (Item response curve) the same for the reference group as it is for the focal group. The focal group is the smaller group in questions (the minority group). The reference group is the larger group that generally has the established parameters. If they are different, then this means that the probability of an individual in one group with ability x responding correctly is different than the probability of an individual with the same ability x in group two if getting the item correct. DTF refers to a difference in the test characteristic curves, obtained by summing the item response functions for each group. DTF is perhaps more important for selection because decisions are made based on test scores, not individual item responses. Procedures for Detecting DIF/DTF Parametric Procedures Compare item parameters from two groups of examinees Lord’s Chi-Square Likelihood Ratio Test Compare IRFs from two groups of examinees by measuring areas between them Raju’s Area Measures Likelihood Ratio Test G2j 2log L(compact model) 2log L(augmented model) Distributed as a chi-square with degrees of freedom equal to the difference in the number of parameters estimated in the compact and the augmented model The compact model assumes item parameters are the same for both groups The augmented model constrains anchor items to be equal, but allows items of interest to have parameters that vary across groups Raju’s Area Measures Signed and unsigned areas Indicates the area between two IRCs Requires separate calibrations of the item parameters in each group, then use a linear transformation to put them on the same scale Signed area 2 1 Unsigned area 2 1 2 1 2 D1 2 2 1 Unsigned area ln 1 exp 2 1 D1 2 1 2 Procedures for Detecting DIF/DTF Non Parametric Procedures Bivariate frequencies between item responses and group memberships conditional on levels of ability or trait estimation Logistic Regression Simultaneous Item Bias Test (SIBTEST) Mantel-Haenszel (MH) Logistic Regression Procedures for Detecting DIF/DTF Simultaneous Item Bias Test (SIBTEST) Examinees are matched on a true score ability estimate of ability Creates a weighted mean difference between the reference and focal groups, which is then tested statistically The means are adjusted to correct for differences in the ability distributions with a regression correction procedure Some examination of this procedure has been conducted to examine changes in Type I error rates when the percent of DIF items is large SIBTEST H 0 : UNI 0 H1 : UNI 0 UNI B f F d B P , R P , F f F is the density function for in the focal group d is the differential of theta Mantel-Haenszel (MH) Compares the item performance of two groups who were previously matched on the ability scale Total test score can be used K 2x2 contingency tables are made for each item for K number of ability levels DIF is shown if the odds of correctly answering the item at a given score level is difference for the two groups Mantel-Haenszel (MH) Response to Suspect Item Group j Right (1) Reference group Aj Focal group Cj Wrong (0) Bj Dj pR j pF j 1 p 1 p Rj Fj Mantel-Haenszel (MH) The statistic for detecting DIF in an item is K K Aj E Aj 0.5 j 1 j 1 MH K Var Aj j 1 K MH Aj D j / N.. j j 1 K B C j 1 j j / N .. j MH 2.35ln( MH ) 2 •Type A items – negligible DIF with ΔαMH < |1| •Type B items – moderate DIF with |1|<= ΔαMH <= |1.5, and MH test is statistically significant| •Type C items – large DIF with ΔαMH > |1.5| Logistic Regression e f ( x) p(u 1 | X ) 1 e f ( x) p(u 1 | X ) is the conditional probability of obtaining a correct answer given X independent variables f ( x ) 0 1 2G 3 G G is the independent (group) variable is the matching criterion (normally test score) If the group effect is significant and the interaction is not, then there is uniform DIF If the interaction is significant, then there is non-uniform DIF Conduct model comparisons by adding each successive model term Computerized Adaptive Testing (CAT) To obtain equal precision of measurement to that of a linear test, but with greater efficiency. Give people only the items that are informative about them. Reduce testing time and opportunity for error. CAT System Initial ability estimate. Select first item. Estimate ability. Mean Prior Most discriminating. Least discriminating. MLE Bayesian Methods Select items. Max info. Exposure control. Content specs. Estimate ability. Check stopping rule. SE stopping rule. Max # of items. Issues of Research in a CAT system. Early Issues Precision of measurement Equivalence Reliability of Estimate, Test Form Equivalence (Test Information), Testing Mode Efficiency Estimation procedure, Prior estimates Item selection methods, Test length Newer Issues Security Item exposure Tetstlet models Item Exposure and Item Selection Methods Sympson-Hetter Directly controls item exposure probabilistically P(S) probability that an item is selected as the best item P(A) probability that an item is administered P(A|S) conditional probability that an item is administered given that it is selected Places a filter between item selection and item administration Items are administered below a prespecified maximum exposure rate Item exposure parameter P(A)=P(A|S)*P(S)<=rmax P(A|S) is easy to determine if P(S) is known, but P(S) must be determined through an iterative process Item Exposure and Item Selection Methods Conditional Sympson-Hetter or SLC (Sotcking and Lewis, 1998) SH controls that item exposure for a population, but at various ability levels the exposure rates can be quite high P(A|S) is determined at specific trait levels rather than across a population Item Exposure and Item Selection Methods a-stratified design (STR CAT; Chang & Ying, 1996, 1999) Partition the item pool into multilevels and multistages according to the discrimination parameters Start with the less discriminating items This approach seems to improve item pool utilization and balanced item exposure rates Then use a b-matching item selection procedure It is less computationally complex No other restrictions on item exposure is imposed