Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Prof Gavin T L Brown Quantitative Data Analysis & Research Unit [email protected] Scores—How we measure success or learning ◦ Observed—What you actually get on a test ◦ True—What you should get if test were perfect, bearing in mind test is a sample of domain (latent) ◦ Ability—What you really are able to do or know of a domain independent of what’s in any one test (latent) Real Ability (independent of test) Less More True Score Range (if tested again after brain washing) Students A, B and C all sit the same test Test has ten items All items are dichotomous (score 0 or 1) All three students score 6 out of 10 What can we say about the ability of these three students? Item 1 A B C 2 3 4 5 6 7 8 9 10 % correct 60 60 60 Assumptions: Each item is equally difficult and has equal weight towards total score. Total score based on sum of items correct is a good estimate of true ability Inference: these students are equally able. But what if the items aren’t equally difficult? Rethink items and tests because of CTT weaknesses Often called “latent trait theory” – ◦ assumes existence of unobserved (latent) trait OR ability which leads to consistent performance ◦ Traits are: Person ability Item difficulty Item discrimination Item pseudo-chance Focus at item level, not at level of the test Calculates parameters as estimates of population characteristics, not sample statistics Generate items that provide maximum information about the proficiency/ability of interest Give examinees items tailored to their proficiency/ability Reduce the number of items required to pinpoint an examinee on the proficiency/ability continuum (without loss of reliability) All items like $c in a bank ◦ Different denominations have different purchasing power ($5<$10<$20….) ◦ All coins & bills on same scale Can be assembled into a test flexibly tailored to the ability or difficulty required AND reported on a common scale All items in test are systematic sample of domain All items in test have different weight in making up test score Probability of getting an item correct is a function of ability—Pi(θ) ◦ as ability increases the chance of getting item correct increases People and items can be located on one scale Item statistics are invariant across groups of examinees S-shaped curves (ogive) plot relationship of parameters More than one model for accounting for error components Items and People on same scale Items of varying difficulty -∞ ∞ People of varying ability A term used to describe a round tapering shape Chance of getting it right increasing Item x Probability does NOT increase in linear fashion Probability increases in relation to ability, difficulty, discrimination, and chance factors in an ogive fashion Ability & Difficulty Increasing Difficulty: ◦ the ability point at which the probability of getting it right is 50% (b) ◦ Note in 1PL /Rasch this is the ONLY parameter Discrimination: ◦ The slope of the curve at the difficulty point (a) ◦ In 2PL these are the 2 parameters Pseudo-Chance: ◦ The probability of getting it right when no TRUE ability exists (c) ◦ In 3PL, all thee parameters are estimated 1PL (Rasch) 2PL 3PL Choice depends on theory and fit statistics ◦ only uses b parameter; no c; and a set equal for all items ◦ uses b and a, ignores c ◦ uses a, b, and c ◦ note many objectively scored tests use 1PL to calculate test scores ◦ but many developers use 2PL and 3PL to evaluate items before tests are created Where = an item on the test g ag = gradient of the ICC at the point (item discrimination) bg = the ability level at which ag is maximised (item difficulty) cg = probability of low candidates correctly answering question g e = natural log, ≈2.718 D = scaling factor to make logistic function close to ogive, =1.7 Rasch model remove c, D, and a IRT creates estimate of ability based on how hard the items were that the student got right ◦ Creates estimate of where the student has a 50% chance of getting items correct Note more items answered correctly at the same difficulty point only increases accuracy of estimate, not the estimate of ability 1PL Item Characteristic Curve (ICC) Difficulty found at 50% probability NO Chance parameter Slopes identical Think of it as choosing a shoe based only on length (Bond & Fox, 2007) Distribution of population into ability groups relative to 1PL fixed discrimination curve. Fit of responses at 4 points in the distribution to the model is good. SLOPES not the same Difficulty found at 50% probability CHANCE possible esp. MCQ Now, width and arch support are being added as conditions in choosing a shoe. Simple is not always better. we seek items that ◦ Have robust estimates of item parameters ◦ Items have good fit to underlying model ◦ Provide information at & around ability level of interest ◦ Are at various points of difficulty ◦ have high discrimination, ◦ tap into different aspects of the attribute. The chance of getting answer correct increases as ability increases That the correct answer fits the Rasch assumptions is a bonus Wrong Answers Right Answer Low fit of people to ICC. Low probability of anyone getting item right. Item Difficulty A B C 1 -3 2 -2 3 -1 4 -1 5 0 6 0 7 1 8 1 9 2 10 % correct 3 60 60 60 Item Difficulty A B C 1 -3 2 -2 3 -1 4 -1 5 0 6 0 7 1 8 1 9 2 10 % correct asTTle v4 3 60 530 60 545 60 593 Conclusions: C > A, B; B ≈ A because C answered all the hardest items correctly—no penalty for skipping or getting easy items wrong Items are not equally difficult. Difficulty is anchored to curriculum levels. Test is designed to have : ◦ most items between 3P and 4B ◦ Fewer at 3B and 4A and none outside Levels 3-4 Strength Need Gap Mastery Student 1 gets only one 3B item correct. Best estimate of 50% ability is below 3B because only one of many 3B items was correct. Solution: give an easier test to reduce error in estimating ability Student 2 gets many 3B-3P items correct, some 3A right, and few Level 4 items correct Best estimate of 50% ability is 3A with good confidence because lots of items Student 3 gets almost all items in test correct Best estimate of 50% ability lies above Level 4A Solution: administer harder test to reduce error in estimation of ability Key Properties of IRT Probabilistic - determines the probability that an examinee with ability correctly answers the item Estimates item parameters and person ability Sample independent Places items and students on same scale BUT ◦ Requires all items to be calibrated against each other ◦ LOTS of people & items needed (150 people per item) ◦ Representative sampling required 1. 2. 3. 4. Short tests can be more reliable than longer tests Can compare test scores, even if tests differ in difficulty 1. 2. Mixed item formats can yield optimal test scores 3. Change scores can be meaningfully compared when initial score levels differ 4. IRT Longer tests are more reliable than shorter tests Can only compare test scores if parallel tests are used Mixed item formats lead to unbalanced impact on total test scores Change scores can’t be meaningfully compared when initial score levels differ CTT IRT is too hard to calculate without special software….BUT Use an IRT developed test (if available) to get more accurate information USE CTT to check test questions before creating a total score and making decisions ◦ Difficulty (% correct) remove 0 and 100% items ◦ Discrimination (correlation to total) Remove those with r<.10 ◦ Then calculate total score and judge the cut score What score % = pass, mastery, excellence, etc. http://dx.doi.org//10.17608/k6.auckland.3 827082.v2