Download Test Scores

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Prof Gavin T L Brown
Quantitative Data Analysis & Research Unit
[email protected]

Scores—How we measure success or learning
◦ Observed—What you actually get on a test
◦ True—What you should get if test were perfect,
bearing in mind test is a sample of domain
(latent)
◦ Ability—What you really are able to do or know of
a domain independent of what’s in any one test
(latent)
Real Ability (independent of test)
Less
More
True Score Range
(if tested again after brain washing)





Students A, B and C all sit the same test
Test has ten items
All items are dichotomous (score 0 or 1)
All three students score 6 out of 10
What can we say about the ability of these
three students?
Item
1
A
B
C
2
3
4
5
6
7
8
9
10
% correct







60
60
60
Assumptions: Each item is equally difficult and has
equal weight towards total score. Total score based
on sum of items correct is a good estimate of true
ability
Inference: these students are equally able.
But what if the items aren’t equally difficult?


Rethink items and tests because of CTT
weaknesses
Often called “latent trait theory” –
◦ assumes existence of unobserved (latent) trait OR
ability which leads to consistent performance
◦ Traits are:






Person ability
Item difficulty
Item discrimination
Item pseudo-chance
Focus at item level, not at level of the test
Calculates parameters as estimates of
population characteristics, not sample statistics



Generate items that provide maximum
information about the proficiency/ability of
interest
Give examinees items tailored to their
proficiency/ability
Reduce the number of items required to
pinpoint an examinee on the
proficiency/ability continuum (without loss
of reliability)

All items like $c in a bank
◦ Different denominations have different
purchasing power ($5<$10<$20….)
◦ All coins & bills on same scale



Can be assembled into a test flexibly
tailored to the ability or difficulty
required AND reported on a common
scale
All items in test are systematic
sample of domain
All items in test have different weight
in making up test score

Probability of getting an item correct is a
function of ability—Pi(θ)
◦ as ability increases the chance of getting item
correct increases




People and items can be located on one
scale
Item statistics are invariant across groups
of examinees
S-shaped curves (ogive) plot relationship
of parameters
More than one model for accounting for
error components
Items and People on same scale
Items of varying difficulty
-∞
∞
People of varying ability

A term used to describe a round tapering
shape
Chance of
getting it right
increasing
Item x
Probability
does NOT
increase in
linear fashion
Probability increases in relation to
ability, difficulty, discrimination, and
chance factors in an ogive fashion
Ability & Difficulty Increasing

Difficulty:
◦ the ability point at which the probability of getting
it right is 50% (b)
◦ Note in 1PL /Rasch this is the ONLY parameter

Discrimination:
◦ The slope of the curve at the difficulty point (a)
◦ In 2PL these are the 2 parameters

Pseudo-Chance:
◦ The probability of getting it right when no TRUE
ability exists (c)
◦ In 3PL, all thee parameters are estimated

1PL (Rasch)

2PL

3PL

Choice depends on theory and fit statistics
◦ only uses b parameter; no c; and a set equal for all
items
◦ uses b and a, ignores c
◦ uses a, b, and c
◦ note many objectively scored tests use 1PL to
calculate test scores
◦ but many developers use 2PL and 3PL to evaluate
items before tests are created
Where
= an item on the test
g
ag = gradient of the ICC at the point  (item discrimination)
bg = the ability level at which ag is maximised (item difficulty)
cg = probability of low candidates correctly answering question g
e = natural log, ≈2.718
D = scaling factor to make logistic function close to ogive, =1.7
Rasch model remove c, D, and a

IRT creates estimate of ability based on how
hard the items were that the student got
right
◦ Creates estimate of where the student has a 50%
chance of getting items correct

Note more items answered correctly at the
same difficulty point only increases
accuracy of estimate, not the estimate of
ability
1PL Item Characteristic Curve (ICC)
Difficulty
found at
50%
probability
NO Chance
parameter
Slopes identical
Think of it as
choosing a shoe
based only on length
(Bond & Fox, 2007)
Distribution of
population into
ability groups
relative to 1PL
fixed
discrimination
curve.
Fit of responses at
4 points in the
distribution to the
model is good.
SLOPES not the same
Difficulty found at
50% probability
CHANCE
possible esp.
MCQ
Now, width and
arch support
are being
added as
conditions in
choosing a
shoe.
Simple is not
always better.

we seek items that
◦ Have robust estimates of item parameters
◦ Items have good fit to underlying model
◦ Provide information at & around ability level of
interest
◦ Are at various points of difficulty
◦ have high discrimination,
◦ tap into different aspects of the attribute.
The chance
of getting
answer
correct
increases as
ability
increases
That the
correct
answer fits
the Rasch
assumptions
is a bonus
Wrong
Answers
Right
Answer
Low fit of
people to
ICC. Low
probability
of anyone
getting
item right.
Item
Difficulty
A
B
C
1
-3
2
-2
3
-1
4
-1
5
0
6
0
7
1
8
1
9
2
10
% correct
3







60
60
60
Item
Difficulty
A
B
C
1
-3
2
-2
3
-1
4
-1
5
0
6
0
7
1
8
1
9
2
10
% correct asTTle v4
3






60
530
60
545
60
593
Conclusions: C > A, B; B ≈ A because C answered all the hardest
items correctly—no penalty for skipping or getting easy items wrong



Items are not equally difficult.
Difficulty is anchored to curriculum levels.
Test is designed to have :
◦ most items between 3P and 4B
◦ Fewer at 3B and 4A and none outside Levels 3-4
Strength
Need
Gap
Mastery



Student 1 gets only one 3B item correct.
Best estimate of 50% ability is below 3B because
only one of many 3B items was correct.
Solution: give an easier test to reduce error in
estimating ability


Student 2 gets many 3B-3P items correct, some
3A right, and few Level 4 items correct
Best estimate of 50% ability is 3A with good
confidence because lots of items



Student 3 gets almost all items in test correct
Best estimate of 50% ability lies above Level 4A
Solution: administer harder test to reduce error
in estimation of ability
Key Properties of IRT





Probabilistic - determines the probability that
an examinee with ability  correctly answers
the item
Estimates item parameters and person ability
Sample independent
Places items and students on same scale
BUT
◦ Requires all items to be calibrated against each
other
◦ LOTS of people & items needed (150 people per
item)
◦ Representative sampling required
1.
2.
3.
4.
Short tests can be more
reliable than longer tests
Can compare test scores,
even if tests differ in
difficulty
1.
2.
Mixed item formats can
yield optimal test scores
3.
Change scores can be
meaningfully compared
when initial score levels
differ
4.
IRT
Longer tests are more
reliable than shorter
tests
Can only compare test
scores if parallel tests
are used
Mixed item formats lead
to unbalanced impact on
total test scores
Change scores can’t be
meaningfully compared
when initial score levels
differ
CTT



IRT is too hard to calculate without special
software….BUT
Use an IRT developed test (if available) to get
more accurate information
USE CTT to check test questions before
creating a total score and making decisions
◦ Difficulty (% correct)
 remove 0 and 100% items
◦ Discrimination (correlation to total)
 Remove those with r<.10
◦ Then calculate total score and judge the cut score
 What score % = pass, mastery, excellence, etc.
http://dx.doi.org//10.17608/k6.auckland.3
827082.v2