Download Can subject matter experts` ratings of statement extremity be used

Can subject matter experts’ ratings of statement extremity be used to streamline the development of unidimensional pairwise preference scales Stephen Stark Introduction– applications of noncognitive CAT • The sample size for classification accuracy at least attaining 1000 is recommended for 3PL. • Rarely do pools for noncognitive test contain more than 50 items per dimension, because it is difficult to generate large number of descriptors. • Pairwise preference items – Just 20 statements can generate 190 pairwise items Using subject matter experts (SMEs) to develop adaptive UPP scales • Marginal maximum likelihood (MML) – Pretesting is usually a lengthy and expensive endeavor. • Subject Matter Experts (SMEs) – Already a part of test development and validation – Test blueprint, numbers, types, and difficulties for each content. – History: • Calibrating statements for behaviorally anchored rating scale (BARS) and behavioral summary scales (BSS) • Score situational judgment test Using subject matter experts (SMEs) to develop adaptive UPP scales • Source of MML error: – Sampling error – Priors used during estimate • SME error – Hard to detect (true value unknown) • SME in an IRT-based UPP testing framework – 2 statements measure the same dimension – Each statement is characterized by one location parameter μ – SME could rate statements, result in the scoring • Using empirical data to compare the standard error of scores by computing parameter using MML or SME • Simulation study examine the recovery of known trait when scoring and CAT based on SME or MML location The ZG IRT model for UPPs • θ μs-μt The ZG IRT model for UPPs • – μs = 2.0, μt = -1.4 – μ s - μ t = 3.4 • – μs = 0.6, μt = 2.2 – μ s - μ t = -1.6 • Monotonically ↑ or ↓ • Distance ↑, slop ↑ • μ s= μ t, slop=0 (Pst=0.5) • Distance ↑, slop ↑, because the choice probability change rapidly over narrow range. • Despite the Intuitive notion, two similar alternatives provides more information • – Distance ↑, information ↑ – Attain maximum when |μs - μt|=2.0 ˆ  1.6 CAT with the ZG UPP model • Generate a pool of pairwise items with the statements’ location parameter – Constraints on the minimum distance (no use) • • • • Prior ~N(0,1), Initial score = 0, EAP estimation randomly select an item from the Subset of items with max information (90% of maximum) CAT with the ZG UPP model • Termination: – max test length, – or no satisfactory items • Availability constraints – Limit the number of times a statement appears (repeat only once and not within the last 2 or 3 items) – (strict) Information and this constraints might result in premature test termination Study 1 • Effect of using MSE estimate on scoring accuracy and criterion validity in an empirical sample • choosing realistic correlation between SME and MML for study 2 • External validity: – preventative health behavior • Health checklist (8 items, reliability=.76) – Study behavior • Study behavior questionnaire(SBQ, 10 items, reliability=.76) Participants and Measures • 602 freshmen and sophomores • Female: 77% Male: 23% • 2 dimension: Order and Self-control • 20~25 personality statement for each dimension • 2 SMEs rate the μs and μt on a 7 point scale – The location was transformed to a -3 to 3 scale. • AVG(distance)=2.95, distances for 24 items are from 0.5 to 5.0. Analysis • Software: MODFIT 2.0 • statistics – Single, pairs, groups of three items. • Correlation between MML location and SME location • EAP estimate trait score based on MML and SME • Marginal reliabilities • Correlation with external criterion result < 3.0 • item 3 has problem, but retain it. • inter-rater correlation: .95 and .91 • range: MML < SMEs , because prior • In terms of Order: SME > MML • Correlation(MML, SME): 0.83 and 0.62 • trait scores correlated highly for both scales – SME rating approximates to MML • As expected, Order score correlate higher with criterion than Self-control score • MML and SME correlate with criterion similarly – Error in SME did not affect the rank order of θ or their criterion validities • trait scores correlated highly for both scales – SME rating approximates to MML • As expected, Order score correlate higher with criterion than Self-control score • MML and SME correlate with criterion similarly – Error in SME did not affect the rank order of θ or their criterion validities • trait scores correlated highly for both scales – SME rating approximates to MML • As expected, Order score correlate higher with criterion than Self-control score • MML and SME correlate with criterion similarly – Error in SME did not affect the rank order of θ or their criterion validities Study 2: Scoring Accuracy in CAT • Increasing accuracy or reducing testing time. • UPP method can yield relatively large pools • Statement parameter affects the item selection and scoring. The detrimental effect from error could thus propagate in CAT. Method • Condition 1: Uniform[-3,+3] • Condition 2 & 3: generate response from condition 1 and using MML to estimate. Θ ~ N(0,1) • Condition 4~8: Method • item selection: CAT and non-CAT • Test length: 8 items and 15 items • 100 examinee at each point:[-3.0, -2.8, …, +3.0] – Bias – AbsBias – RMSE – Correlation Result • TRUE correlated highly with MML • SME correlated with TRUE near the intended value of .9, .8, .7, .6, .0 • SME8 & SME6 reasonably connect to the empirical results (.83 & .62) in study 1 Result • TRUE correlated highly with MML • SME correlated with TRUE near the intended value of .9, .8, .7, .6, .0 • SME8 & SME6 reasonably connect to the empirical results (.83 & .62) in study 1 Result • TRUE correlated highly with MML • SME correlated with TRUE near the intended value of .9, .8, .7, .6, .0 • SME8 & SME6 reasonably connect to the empirical results (.83 & .62) in study 1 • It is mimicked in MML1000 & MML 500 • SME shows larger AbsBias and RMSE • 8 items, correlation for SMEs are high (.70~.90) • Correlation for SME6 and SME8 are very high. > non-CAT 15 items • For CAT, correlation for SME6 and SME7 are high. Even if 8 items is still higher than nonCAT 15 items non-CAT 15 items • The effect of Regression to mean (EAP),SME9 < SME6 SME0 response randomly and earned scores averaging near the prior mean (0) CAT 15 items • Discussion • Interest to using other IRT models • The correlation between SME and MML for trait score were above .90, so it can be applied to actual personnel decisions. Further, the validity is OK. • The score accuracy would improve if longer test or implementing CAT procedure. Discussion • Only two SMEs, larger should be more precision • Training rater can estimate location and standard setting well in cognitive (ability) test. • Explore the classification accuracy (Advance simulation) • Pairwise preference between different domains.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Can subject matter experts` ratings of statement extremity be used