Download Evaluating Model Fit with IRT g

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Evaluating
g Model Fit with IRT
American Board of Internal Medicine
Item Response Theory Course
How good is our model?
• Assessment of Model
Model-data
data Fit
–Model
ode Assumptions
ssu p o s
–Model Predictions vs. observed
data
–Examples
Examples of Goodness-of-Fit
Goodness of Fit
Model-data
Model
data Fit
• George Box (1979):
“All models are wrong but some are useful.”
• A “model” is something we use to
approximate reality for the purpose of
making predictions, explaining data, etc.
• Strictly
y speaking,
p
g no model will fit the data
perfectly…the question is, “how much
model-data misfit is too much?”
Model-data
Model
data Fit
• “How
How much misfit is too much?”
much?
• Generally, there are no absolute
criteria for model
model-data
data fit
fit.
• “Let your conscience be your guide”
– Conduct a variety of analyses, consider
the full set of results, and be mindful of
th
these
with
ith regard
d tto th
the iintended
t d d
application of the IRT model.
Model-Data
Model
Data Fit
• Evaluate model assumptions
assumptions.
• Assess residuals and
standardized residuals and
examine consequences of
model misfit
misfit.
Types of Evidence
• Are model assumptions met to a
‘reasonable degree’ by test data?
– Determine if model features
features, like
parameter invariance, are present.
• Assess the accuracy of model
predictions versus actual data.
– Do
D expected
t d probabilities
b biliti match
t h with
ith
empirical relative frequencies?
Model Assumptions
• Are model assumptions met to a
reasonable degree by test data?
–Unidimensionality
U idi
i
lit
–Local Independence
p
–Nature of the ICC
• Minimal Guessing (for 1- & 2-PL)
• Equal Discrimination (for 1-PL)
–Parameter
P
t Invariance
I
i
Unidimensionality
• Dimensionality is often analyzed
through a Principal Components
Analysis (PCA) to determine how
many “components”
components or
“dimensions” are necessary in
order
d tto explain
l i a significant
i ifi
t
amount of total variance.
PCA
• PCA is mathematical procedure that
t
transforms
f
a number
b off (possibly)
(
ibl ) correlated
l t d
variables into a (smaller) number of
uncorrelated variables called principal
components.
• Analysis is performed on the correlation
matrix.
• The first p
principal
p component
p
accounts for as
much variability in the data as possible, and
each succeeding component accounts for as
much of the remaining variability as possible
possible.
PCA
• Each principal component will receive
an ‘eigenvalue’ (λj) which indirectly
represents
t the
th amountt off variance
i
that component explains.
• In IRT, we look for a “dominant” first
factor in order to infer the
unidimensionality of the test.
Unidimensionality
• A scree plot is often used to evaluate
the dominance of the first factor.
• Rules of thumb for “essential
unidimensionality.”
y
– First eigenvalue: λ1 / n ≥ 0.20.
– Ratio of λ1 to λ2 should be “large
large.”
– Other eigenvalues: λj ≤ 1;
10
9
Scree Plot for a 40-item Test
U idi
Unidimensionality
i
lit
n
8
∑λ
E
Eigenvalu
ue
7
j =1
j=
6
5
j
=n
λj
so if
n
≈ 0.20
then we say the test is
4
"essentially unidimensional"
3
2
1
0
0
5
10
15
20
Component
25
30
35
40
10
9
Scree Plot for a 40-item Test
U idi
Unidimensionality
i
lit
λ1
8.45
=
= 0.21
0 21
n
40
8
E
Eigenvalu
ue
7
and
d
λ1 8.45
=
= 5.63
5 63
λ2 1.5
6
The first component accounts for
21% of the total variance, and it
accounts for 5.63 times the amount
of variance as the second component
5
4
3
2
1
0
0
5
10
15
20
Component
25
30
35
40
Local Independence
• Item responses should be
independent given ability.
– Omits or unexpected low response
probabilities for the last few items may
indicate speededness off the test and will
violate local independence.
– Response probabilities should not be
related to group membership (DIF).
Nature of the ICC
• The probability of success
should be monotonically
y
increasing with θ.
• No systematic errors of
prediction should be apparent
apparent.
Nature of the ICC
• Minimal Guessing (1
(1- or 2
2-PL)
PL)
–Check
C ec pe
performance
o a ce levels
e eso
of
low ability examinees.
–Conditional
C di i
l p-values
l
ffor llow
abilityy examinees should be
close to zero when fitting a 1- or
2 PL model
2-PL
model.
Nature of the ICC
• Equal Discrimination (1-PL)
– Essentially homogeneous distribution for
th ititem-total
the
t t l correlations.
l ti
– All r values should be approximately
equal when fitting a 1-PL model.
– When examining
g the model-data fit, is
there any pattern of errors that might be
accounted for by
y a different slope?
p
Parameter Invariance
• Determine in model parameter
invariance is present.
p
–Invariance of Ability (θ).
–Invariance of Item parameters
(a, b, c).
Parameter Invariance
• Invariance of ability (θ)
– Compare examinee ability estimates
based on different sets of items from the
pool of items calibrated on a common
scale.
– If the model fits, different item samples
produce close to the same
should still p
ability estimates (SE will change for
shorter tests, though).
Parameter Invariance
• Invariance of ability (θ)
– Scatterplots of examinee ability based
on one sample
l off items
it
versus the
th other
th
should be strongly linearly related.
– Relationship won’t be perfect: sampling
error does enter the picture.
– Those estimates far from the best-fit line
represent
p
a violation of invariance.
Parameter Invariance
• Invariance of Item parameters (a
(a,b,c)
b c)
– Compare item statistics obtained in two
or more groups (e.g.,
(
high
hi h and
d llow
performing groups; Ethnicity; Gender).
– If the model fits, different examinee
samples should still produce close to the
same item parameter estimates.
Parameter Invariance
• Invariance of item parameters (a
(a,b,c)
b c)
– Scatterplots of b-b, a-a, c-c based on
one sample
l off examinees
i
versus th
the
other should be strongly linearly related.
– Relationship won’t be perfect: sampling
error does enter the picture.
– Those estimates far from the best-fit line
represent
p
a violation of invariance.
Check Item Invariance
• Differential Item Functioning (DIF)
– Details on Thursday
• A test item is labeled with
ith “DIF” when
hen
examinees with equal ability, but from
diff
different
t groups, have
h
an unequall
probability of answering the item
correctly.
tl
• This violates parameter invariance
because items aren’t unidimensional.
DIF issues
• Uniform DIF: item is
systematically more difficult for
members of one group, even after
matching examinees on ability (θ)
(θ).
• Cause: shift in b-parameter.
1.0
P (u = 1 | θ )
0.8
0.6
0.4
0.2
0.0
-3
3
-2
2
-1
1
0
Ability (θ )
1
2
3
DIF issues
• Non-Uniform DIF: shift in item
difficulty is not consistent across the
ability
bilit continuum.
ti
• Increase/decrease in P for low-abilityy
examinees is offset by the converse
for high-ability
high ability examinees.
examinees
• Cause: shift in a-parameter.
1.0
P (u = 1 | θ )
0.8
0.6
0.4
0.2
0.0
-3
3
-2
2
-1
1
0
Ability (θ )
1
2
3
Residual Analysis
• Assess the accuracy of model
predictions versus actual data
– Residual = difference between observed
proportion and predicted probability:
rj (θ ) = Pj (θ ) − E[ Pj (θ )]
Residual Analysis
rj (θ ) = Pj (θ ) − E[ Pj (θ )]
In this notation,
Pj(θ) = observed proportion correct for
a given θ level
E[Pj(θ)] = expected proportion correct
(i e probability from the IRT model)
(i.e.,
Residual Analysis
Pro
obability
y of Corrrect Response
1.0
0.9
0.8
0.7
0.6
rj (θ ) = Pj (θ ) − E[ Pj (θ )]
0.5
0.4
White = Pj(θ)
0.3
Black = E[Pj(θ)]
0.2
0.1
0.0
-3
-2
-1
0
Ability (θ )
1
2
3
Standardized Residuals
• Raw residuals do not take into account the
error associated with the expected
proportion correct,
correct so we standardize each
by dividing by its standard error:
SE ( E[ Pj (θ )]) =
E[ Pj (θ ] E[Q j (θ )]
N (θ )
Standardized Residuals
SR j (θ ) =
Pj (θ ) − E[ Pj (θ )]
E[ Pj (θ )] E[Q j (θ )] / N (θ )
SR values should be homoscedastic for each
it
item
and
d ffollow
ll
an approximately
i t l standard
t d d
normal distribution across all items of the test.
Prrobabilitty of Corrrect Response
1.0
09
0.9
0.8
0.7
0.6
0.5
0.4
0.3
1-PL
0.2
0.1
0.0
-3
-2
-1
0
Ability (θ )
1
2
3
Prrobabilitty of Corrrect Response
1.0
09
0.9
0.8
0.7
0.6
2-PL
0.5
0.4
0.3
0.2
0.1
0.0
-3
-2
-1
0
Ability (θ )
1
2
3
Prrobabilitty of Corrrect Response
1.0
09
0.9
0.8
0.7
0.6
0.5
3-PL
0.4
0.3
0.2
0.1
0.0
-3
-2
-1
0
Ability (θ )
1
2
3
Sta
andardize
ed Residual
SR values are homoscedastic for this item
when fit by a 3-PL model, but systematic errors
are present for the 1- and 2-PL models.
14
12
10
8
6
4
2
0
-2
-4
-6
-8
-10
-12
-14
-3
-2
-1
0
Ability (θ )
1
2
3
Test-level
Test
level Fit
• Similar to the comparison done for
individual items, but instead we
compare Expected Proportion Correct
(TCC) to observed proportion correct
(raw score/N)
score/N).
• SRs across the test should be
homoscedastic and follow an
approximate normal distribution.
Across all items, SR values are approximately
normally distributed when fit by a 3-PL model,
but more uniform for the 1- and 2-PL models.
03
0.3
Relative Frequ
R
uency
0.25
0.2
0.15
0.1
3-PL
0.05
0
2-PL
-4
-3
-2
2
-1
Standardized Residual
0
1
2
1-PL
3
4
Significance
g
Testing
g
Q1 chi-square (Yen, 1981)
m
Q1j = ∑SR
i =1
2
ij
Q1~ χ
2
m = # of quadrature points
p = # of item parameters
df = m − p
Chi-square
Chi
square test
• Standardized residuals are essentially
prediction errors that have been
turned into z-scores
z scores.
• The sum of squared z-scores follow a
Chi square distribution
Chi-square
distribution.
– Much like Sums of Squares and
variances
ariances follo
follow a Chi
Chi-square
sq are
distribution in ANOVA.
Standard Normal Distribution
0.5
probabilitty
0.4
03
0.3
0.2
0.1
0.0
-3
-2
-1
0
z-score
1
2
3
Probabiility
df = 1
As the degrees of freedom increase,
the sum of squared standardized
residuals are less likely to be equal
to zero.
The expected value for SUM(SR2)
moves to the right as the
distribution approaches normality.
df = 5
df = 10 df = 15
df = 20
Sum of squared z-scores
Goodness of Fit
Goodness-of-Fit
• This is what we call “goodness-of-fit”:
goodness of fit :
• We hope that the chi-square test will NOT
be significant:
– This indicates that the differences between
observed and expected is small
small.
– Significant differences would mean that
observed proportions are far from what the
model predicted…and that’s bad.
Significance Testing in BILOG and
PARSCALE
•
The goodness of fit information contained in BILOG and PARSCALE
use the Chi
Chi-square
square test described in the previous slides
slides.
– These values can be found for all items.
SUBTEST PRETEST ;
ITEM
ITEM PARAMETERS AFTER CYCLE
30
INTERCEPT
SLOPE
THRESHOLD
LOADING
ASYMPTOTE
CHISQ
DF
S.E.
S.E.
S.E.
S.E.
S.E.
(PROB)
------------------------------------------------------------------------------MATH01
|
1.041 |
0.651 | -1.599 |
0.545 |
0.186 |
29.0
9.0
|
0.107* |
0.082* |
0.242* |
0.069* |
0.084* | (0.0007)
|
|
|
|
|
|
MATH02
|
2.230 |
0.600 | -3.717 |
0.514 |
0.199 |
9.5
5.0
|
0.165* |
0.114* |
0.610* |
0.098* |
0.089* | (0.0920)
|
|
|
|
|
|
MATH03
|
0.428 |
0.693 | -0.618 |
0.569 |
0.159 |
63.2
7.0
|
0.106* |
0.084* |
0.190* |
0.069* |
0.071* | (0.0000)
|
|
|
|
|
|
MATH04
| -0.601 |
1.391 |
0.432 |
0.812 |
0.217 |
10.7
8.0
|
0.216* |
0.268* |
0.095* |
0.156* |
0.040* | (0.2204)
The problem of sample size
•
Statistical tests of model
model-data
data fit
present an interesting duality:
1) Due to sensitivity to sample size
size,
almost any departure of data from
the model results in rejecting H0.
2) For small samples, model-data
misfit
i fi can b
be overlooked,
l k d and
d SE
SEs
for item parameters are large.
Standardized Residuals
SR j (θ ) =
Pj (θ ) − E[ Pj (θ )]
E[ Pj (θ )] E[Q j (θ )] / N (θ )
As N increases, SE decreases, SR
y have a
increases! Often,, when you
large enough sample size to
estimate parameters, model misfit is
a foregone conclusion!
Conclusion
• This leads us back to the beginning
beginning…
recall what George Box said!
• Generally,
Generally there are no absolute
criteria for model-data fit.
• “Let
“L t your conscience
i
b
be your guide”
id ”
– Conduct a variety of analyses, consider
the full set of results, and be mindful of
these with regard to the intended
application
li ti off the
th IRT model.
d l
Next…
Next
• Test Reliability and
Development
p
with IRT
• Test Score Equating with IRT