Download Can subject matter experts` ratings of statement extremity be used

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Best–worst scaling wikipedia , lookup

Transcript
Can subject matter experts’ ratings
of statement extremity be used to
streamline the development of
unidimensional pairwise preference
scales
Stephen Stark
Introduction– applications of
noncognitive CAT
• The sample size for classification accuracy at
least attaining 1000 is recommended for 3PL.
• Rarely do pools for noncognitive test contain
more than 50 items per dimension, because it
is difficult to generate large number of
descriptors.
• Pairwise preference items
– Just 20 statements can generate 190 pairwise
items
Using subject matter experts (SMEs) to
develop adaptive UPP scales
• Marginal maximum likelihood (MML)
– Pretesting is usually a lengthy and expensive
endeavor.
• Subject Matter Experts (SMEs)
– Already a part of test development and validation
– Test blueprint, numbers, types, and difficulties for
each content.
– History:
• Calibrating statements for behaviorally anchored rating
scale (BARS) and behavioral summary scales (BSS)
• Score situational judgment test
Using subject matter experts (SMEs) to
develop adaptive UPP scales
• Source of MML error:
– Sampling error
– Priors used during estimate
• SME error
– Hard to detect (true value unknown)
• SME in an IRT-based UPP testing framework
– 2 statements measure the same dimension
– Each statement is characterized by one location
parameter μ
– SME could rate statements, result in the scoring
• Using empirical data to compare the standard
error of scores by computing parameter using
MML or SME
• Simulation study examine the recovery of
known trait when scoring and CAT based on
SME or MML location
The ZG IRT model for UPPs
•
θ
μs-μt
The ZG IRT model for UPPs
•
– μs = 2.0, μt = -1.4
– μ s - μ t = 3.4
•
– μs = 0.6, μt = 2.2
– μ s - μ t = -1.6
• Monotonically
↑ or ↓
• Distance ↑,
slop ↑
• μ s= μ t,
slop=0
(Pst=0.5)
• Distance ↑, slop ↑, because the choice
probability change rapidly over narrow range.
• Despite the Intuitive notion, two similar
alternatives provides more information
•
– Distance ↑, information ↑
– Attain maximum when |μs - μt|=2.0
ˆ  1.6
CAT with the ZG UPP model
• Generate a pool of pairwise items with the
statements’ location parameter
– Constraints on the minimum distance (no use)
•
•
•
•
Prior ~N(0,1),
Initial score = 0,
EAP estimation
randomly select an item from the Subset of
items with max information (90% of maximum)
CAT with the ZG UPP model
• Termination:
– max test length,
– or no satisfactory items
• Availability constraints
– Limit the number of times a statement appears
(repeat only once and not within the last 2 or 3
items)
– (strict) Information and this constraints might
result in premature test termination
Study 1
• Effect of using MSE estimate on scoring accuracy
and criterion validity in an empirical sample
• choosing realistic correlation between SME and
MML for study 2
• External validity:
– preventative health behavior
• Health checklist (8 items, reliability=.76)
– Study behavior
• Study behavior questionnaire(SBQ, 10 items, reliability=.76)
Participants and Measures
• 602 freshmen and sophomores
• Female: 77%
Male: 23%
• 2 dimension: Order and Self-control
• 20~25 personality statement for each dimension
• 2 SMEs rate the μs and μt on a 7 point scale
– The location was transformed to a -3 to 3 scale.
• AVG(distance)=2.95, distances for 24 items are
from 0.5 to 5.0.
Analysis
• Software: MODFIT 2.0
•
statistics
– Single, pairs, groups of three items.
• Correlation between MML location and SME
location
• EAP estimate trait score based on MML and SME
• Marginal reliabilities
• Correlation with external criterion
result
< 3.0
• item 3 has problem, but retain it.
• inter-rater correlation: .95 and .91
• range: MML < SMEs , because prior
• In terms of Order: SME > MML
• Correlation(MML, SME): 0.83 and 0.62
• trait scores correlated highly for both scales
– SME rating approximates to MML
• As expected, Order score correlate higher with
criterion than Self-control score
• MML and SME correlate with criterion similarly
– Error in SME did not affect the rank order of θ or
their criterion validities
• trait scores correlated highly for both scales
– SME rating approximates to MML
• As expected, Order score correlate higher with
criterion than Self-control score
• MML and SME correlate with criterion similarly
– Error in SME did not affect the rank order of θ or
their criterion validities
• trait scores correlated highly for both scales
– SME rating approximates to MML
• As expected, Order score correlate higher with
criterion than Self-control score
• MML and SME correlate with criterion similarly
– Error in SME did not affect the rank order of θ or
their criterion validities
Study 2: Scoring Accuracy in CAT
• Increasing accuracy or reducing testing time.
• UPP method can yield relatively large pools
• Statement parameter affects the item
selection and scoring. The detrimental effect
from error could thus propagate in CAT.
Method
• Condition 1: Uniform[-3,+3]
• Condition 2 & 3: generate response from
condition 1 and using MML to estimate. Θ ~
N(0,1)
• Condition 4~8:
Method
• item selection: CAT and non-CAT
• Test length: 8 items and 15 items
• 100 examinee at each point:[-3.0, -2.8, …, +3.0]
– Bias
– AbsBias
– RMSE
– Correlation
Result
• TRUE correlated highly with MML
• SME correlated with TRUE near the intended
value of .9, .8, .7, .6, .0
• SME8 & SME6 reasonably connect to the
empirical results (.83 & .62) in study 1
Result
• TRUE correlated highly with MML
• SME correlated with TRUE near the intended
value of .9, .8, .7, .6, .0
• SME8 & SME6 reasonably connect to the
empirical results (.83 & .62) in study 1
Result
• TRUE correlated highly with MML
• SME correlated with TRUE near the intended
value of .9, .8, .7, .6, .0
• SME8 & SME6 reasonably connect to the
empirical results (.83 & .62) in study 1
• It is mimicked in MML1000 & MML 500
• SME shows larger AbsBias and RMSE
• 8 items, correlation for SMEs are high
(.70~.90)
• Correlation for SME6 and SME8 are very high.
> non-CAT
15 items
• For CAT, correlation for SME6 and SME7 are
high. Even if 8 items is still higher than nonCAT 15 items
non-CAT 15 items
•
The effect of Regression to mean
(EAP),SME9 < SME6
SME0 response randomly and earned
scores averaging near the prior mean (0)
CAT 15 items
•
Discussion
• Interest to using other IRT models
• The correlation between SME and MML for
trait score were above .90, so it can be applied
to actual personnel decisions. Further, the
validity is OK.
• The score accuracy would improve if longer
test or implementing CAT procedure.
Discussion
• Only two SMEs, larger should be more precision
• Training rater can estimate location and
standard setting well in cognitive (ability) test.
• Explore the classification accuracy (Advance
simulation)
• Pairwise preference between different domains.