Download Test Worthiness The Curve Reliability Validity

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Race and intelligence wikipedia , lookup

Personnel selection wikipedia , lookup

Intelligence and public policy wikipedia , lookup

Transcript
Test Worthiness The Curve Reliability Validity Review of Concepts •  Psychological Measurement vs. Other Types of Assessment •  CorrelaAon Coefficient –  PosiAve/Direct –  NegaAve/Inverse –  Pearson’s r Pithy Points about r •  The degree of relaAonship between two variables is indicated by the number, the direcAon indicated by sign •  CorrelaAon, even if high does not imply causaAon •  High correlaAons allow us to make predicAons All standardized assessment scores are based on the NORMAL CURVE Overview of The Normal Curve ProperAes of Normal DistribuAons The most important probability distribuAon in staAsAcs is the normal distribu.on. Normal curve x A normal distribuAon is a conAnuous probability distribuAon for a random variable, x. The graph of a normal distribuAon is called the normal curve. ProperAes of Normal DistribuAons 1.  The mean, median, and mode are equal.
2.  The normal curve is bell-shaped and symmetric about
the mean.
3.  The total area under the curve is equal to one.
4.  The normal curve approaches, but never touches the xaxis as it extends farther and farther away from the
mean.
5.  Between µ - σ and µ + σ (in the center of the curve), the
graph curves downward. The graph curves upward to
the left of µ - σ and to the right of µ + σ. The points at
which the curve changes from curving upward to
curving downward are called the inflection points.
ProperAes of Normal DistribuAons Inflec.on points Total area = 1 μ -­‐ 3σ μ -­‐ 2σ μ -­‐ σ μ μ + σ μ + 2σ μ + 3σ If x is a conAnuous random variable having a normal distribuAon with mean μ and standard deviaAon σ, you can graph a normal curve with the equaAon 1
-(x - µ )2
y=
e
σ 2π
2σ 2
.
e = 2.178 π = 3.14
x Means and Standard DeviaAons A normal distribuAon can have any mean and any posiAve standard deviaAon. InflecAon points 1 2 3 4 5 6 The mean gives
the location of the
line of symmetry.
x InflecAon points 1 2 3 4 5 6 7 8 9 10 11 Mean: µ = 3.5
Mean: µ = 6
Standard
deviation: σ ≈ 1.3
Standard deviation:
σ ≈ 1.9
The standard deviation describes the spread of the data.
x Means and Standard DeviaAons Example:
1.  Which curve has the greater mean?
2.  Which curve has the greater standard deviation?
B A 1 3 5 7 9 11 13 x The line of symmetry of curve A occurs at x = 5. The line of symmetry of curve B occurs at x = 9. Curve B has the greater mean. Curve B is more spread out than curve A, so curve B has the greater standard deviaAon. The Standard Normal DistribuAon The standard normal distribution is a normal distribution
with a mean of 0 and a standard deviation of 1.
The horizontal scale corresponds to z-scores. z -­‐3 -­‐2 -­‐1 0 1 2 Any value can be transformed into a z-score by using the formula z=
Value - Mean
x -µ.
=
Standard deviation
σ
3 Know The DefiniAon of These Scores • 
• 
• 
• 
Standard Score Scaled Score T-­‐score PercenAle Rank Overview of The Normal Curve Validity & Reliability What is Reliability? –  Reliability means consistency. –  Reliability refers to the scores obtained with a test and not to the instrument itself. The Classical Model of Reliability –  A person’s true score is the score the individual would have received if the test and tesAng condiAons were free from error. –  Systema.c error remains constant from one measurement to another and leads to consistency. –  Random error, which does affect reliability, is due to: •  FluctuaAons in the mood or alertness of persons taking the test due to faAgue, illness, or other recent experiences. •  Incidental variaAon in the measurement condiAons due, for example, to outside noise or inconsistency in the administraAon of the instrument. •  Differences in scoring due to factors such as scoring errors, subjecAvity, or clerical errors. •  Random guessing on response alternaAves in tests or quesAonnaire items. The Classical Model of Reliability ConAnued… –  The classical assumpAon is that any observed score consists of both the true score and error of measurement. –  Reliability indicates what proporAon of the observed score variance is true score variance. –  Reliability coefficients in the .80s are desirable for screening tests, .90s for diagnosAc decisions. The Classical Model of Reliability ConAnued… –  Classical test theory also proposes two addiAonal assumpAons: •  The distribuAon of observed scores that a person may obtain under repeated independent tesAng with the same test is normal. •  The standard deviaAon of this normal distribuAon, referred to as the standard error of measurement (SEM), is the same for all persons of a given group taking the test. Overview: Types of Reliability • 
• 
• 
• 
Internal consistency Test-­‐retest reliability Alternate forms reliability Interscorer and interrater reliability Internal Consistency –  Internal consistency esAmates are based on the average correlaAon among items within a test or scale. There are various ways to obtain internal consistency: •  Split-­‐half method: –  Odd-­‐even method or matched random subsets method –  The Spearman-­‐Brown prophecy formula must be applied •  Cronbach’s coefficient alpha used for mulAscaled item response formats •  Kuder-­‐Richardson formula 20 used for dichotomous item response formats Test-­‐Retest Reliability –  Test-­‐retest reliability, also known as temporal stability, is the extent to which the same persons consistently respond to the same test administered on different occasions. –  The major problem is the potenAal for carryover effects between the two administraAons. –  Thus, it is most appropriate for measurements of traits that are stable across Ame. Alternate Forms Reliability •  Alternate forms reliability counteracts the pracAce effects that occur in test-­‐retest reliability by measuring the consistency of scores on alternate test forms administered to the same group of individuals. Reliability of Criterion-­‐Referenced Tests •  Classifica.on consistency shows the consistency with which classificaAons are made, either by the same test administered on two occasions of by alternate test forms. There are two forms: – Mastery versus nonmastery – Cohen’s k: the proporAon of nonrandom consistent classificaAons Interscorer and Interrater Reliability •  Interscorer and interrater reliability are influenced by subjecAvity of scoring. The higher the correlaAon, the lower the error variance due to scorer differences, and the higher the interrater agreement. Researchers and Test Users Endeavor to Reduce Measurement Error and Improve Reliability by: –  WriAng items clearly. –  Providing complete and understandable test instrucAons. –  Administering the instrument under prescribed condiAons. –  Reducing subjecAvity in scoring. –  Training raters and providing them with clear scoring instrucAons. –  Using heterogeneous respondent samples to increase the various of observed scores. –  Increasing the length of the test by adding items that are ideally parallel to those that are already in the test. –  The general principle behind improving reliability it to maximize the variance of relevant individual differences and minimize the error variance. The Importance of Reliability –  Reliability is a necessary, but not sufficient, condiAon in the validaAon process. Social JusAce and MulAcultural Issues •  Test authors should determine whether the reliabiliAes of scores from different groups vary substanAally, and report those variaAons for each populaAon for which the test has been recommended. Validity We say an instrument is valid to the extent that it measures what it is designed to measure and accurately performs the funcAons it is purported to perform Validity (I.O.W.) Validity indicates the degree to which test scores measure what the test claims to measure. Examples of the Types of Validity •  Face Validity •  Content-­‐Related Validity •  Criterion-­‐Related Validity –  PredicAve criterion-­‐related validity –  Concurrent criterion-­‐related validity •  Construct Validity Face Validity –  Face validity is derived from the obvious appearance of the measure itself and its test items, but it is not an empirically demonstrated type of validity. –  Face validity is commonly referred to; not really a type of validity –  Self-­‐report tests with high face validity can face problems when the trait or behavior in quesAon is one that many people will not want to reveal about themselves. Content-­‐Related Validity –  The main focus is on how the instrument was constructed and how well the test items reflect the domain of the material being tested. –  This type of validity is widely used in educaAonal tesAng and in tests of apAtude or achievement. –  Determining the content validity of a test requires a systemaAc evaluaAon of the test items to determine whether adequate coverage of a representaAve sample of the content domain was measured. Criterion-­‐related Validity –  Criterion-­‐related validity is derived from comparing scores on the test to scores on a selected criterion. –  Sources of criterion scores include: •  Academic achievement •  Task performance •  Psychiatric diagnosis •  ObservaAons •  RaAngs •  CorrelaAons with previous tests Criterion-­‐related Validity ConAnued… –  Two forms of criterion-­‐related validity: •  Predic.ve criterion-­‐related validity •  Concurrent criterion-­‐related validity –  Standard error of es.mate (SEE) is derived from examining the difference between our predicted value of the criterion and the person’s actual score on the criterion. Construct Validity –  Evidence for construct validity is established by defining the construct being measured and by gradually collecAng informaAon over Ame to demonstrate or confirm what the test measures. –  Construct validity evidence is gathered by: •  Convergent validity evidence •  Discriminant validity evidence •  Factor analysis uses staAsAcs to determine the degree to which the items contained in two separate instruments tend to group together along factors that mathemaAcally indicate similarity, and thus a common meaning. •  Developmental changes •  DisAnct groups Controversies in Assessment: “Teaching to the Test” Inflates Scores –  “Teaching to the test” means that the focus of instrucAon becomes so prescribed that only content that is sure to appear on an exam is addressed in instrucAon. If this occurs, test scores should rise. –  Whether test scores are inflated in this instance is a marer of content mastery. –  Test publishers, state educaAon departments, and local educators must work collaboraAvely to develop test items that adequately sample the broad content domain and standards. The InteracAon of Reliability and Validity •  A test can never be more valid than it is reliable. Decision Making Using a Single Score •  Decision theory involves the collecAon of a screening test score and a criterion score, either at the same point in Ame or at some point in the future. – You want to maximize hits (valid acceptances and valid rejecAons) and minimize misses (false rejecAons and false posiAves). Decision Making Using MulAple Tests –  Mul.ple regression allows for several variables to be weighted in order to predict some criterion score. –  The mul.ple cutoff method means that the counselor must establish a minimally acceptable score on each measure under consideraAon, then analyze the scores of a given client or student and determine whether each of the scores meets the given criterion. –  Clinical judgment and diagnosis using a test baRery rely on the experiences, informaAon processing capability, theoreAcal frameworks, and reasoning ability of the professional counselor. –  Combining decision-­‐making models can lead to greater accuracy. MulAcultural and Social JusAce Issues –  Tests developers must evaluate diverse subgroups for potenAal score differences not related to the skills, attudes, or abiliAes being assessed. –  Tests must be used for verifiable purposes and with individuals with appropriate characterisAcs. Jigsaw –Describe for Clients •  Team 1: Normal Curve •  Team 2: Raw Scores & Standard Scores •  Team 3: Age-­‐Equivalent Scores vs Grade Equivalent Scores •  Team 4: Confidence Intervals for Obtained Scores •  Team 5: Validity and Reliability Other Psychometric Concepts Be aware of the following concepts: • 
• 
• 
• 
• 
Floor effect Ceiling effect Item gradient Norm table layout Age or grade equivalent scores • 
• 
• 
• 
• 
Reliability Skill areas assessed Test content PublicaAon date Sampling