Download Explaining Educational Concepts PPT

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Regression toward the mean wikipedia , lookup

Psychometrics wikipedia , lookup

Transcript
Explaining Educational Concepts
Dr. Julie Esparza Brown
SPED 512/Diagnostic Assessment
Portland State University
Measures of Central Tendency
Mean (Most useful)
 Mean – the average of all the scores in the distribution.
 Appropriate for Equal Interval and Ratio Scales.
 Not appropriate for skewed distributions.
Median (Next most useful)
Median – the middle score of a distribution.
Appropriate for Ordinal, Equal Interval, and Ratio Scales.
Most appropriate when distribution is skewed,
50% of scores are above the median, 50% of scores are below the
median.
 Example




 Arrange scores in order from largest to smallest (or vice versa)
 If N is odd, the middle score is the median.
 If N is even, the average of the two middle scores is the median.
Mode (Least useful)
 The Mode is the most frequently occurring score.
 Appropriate for Nominal, Ordinal, Equal Interval, and Ratio
Scales.
 Generally used in a very rough sense to get a feel for “the
peak of the mountain.”
Fifth Graders and Push-up Test
 Half of the children completed 10 or more.
 Half of the children completed ten or less.
 The average child completed 10.
 The average or mean number completed by this class of 100




5th graders is 10.
Half of the children scored above the mean score of 10.
Half of the children scored below the mean or average score
of 10.
50 percent of the children scored 10 or above.
50 percent of the children scored 10 or below.
Fifth Graders and Push-up Test (cont.)
 One-third of the children scored between 7 and 10.
 One-third of the children completed between 10 and 13.
 Two-thirds of the children scored between 7 to 13.
 Half of the children (50 percent) completed between 8 and
12.
 The lowest scoring child completed 1.
 The highest scoring child completed 19.
Y axis
X axis
 The highest point of the bell curve on the X axis is equal to
10 push-ups.
 The next most frequently obtained scores were 9 and 11,
followed by 8 and 12.
 This pattern continues out towards the far ends of the bell
curve with the ends occurring at 1 and 19 push-ups.
 Amy’s score of 10 places her at the 50% level or Amy’s
percentile rank is 50 (PR = 50).
 Erik’s score of 13 places
him at the 84th position or
out of the 100 fifth grade
children tested or 84th
%ile.
 Sam’s raw score of 7 placed
him at the 16th %ile. 84
children earned a higher
score than Sam.
Percentiles (Relative Standing)
 The percent of people in the comparison group who
scored at or below the score of interest.
 Example:
 Billy obtained a percentile rank of 42.
 This means that Billy performed as well or better than 42% of
children his age on the test.
 Or, 42% of children Billy’s age scored at or below Billy’s score.
Advantages of Percentiles Ranks
 Percentile ranks are one of the best types of score to report
to consumers of a child’s relative standing compared to
other children.
 Scores indicate how well a student performed compared to
the performance of some reference group,
 Percentile ranks are Ordinal Scale (values ordered from
worst to best but differences between adjacent values are
unknown) data,
 It is not meaningful to calculate the mean or standard deviation of
percentiles.
Converting Raw Scores
 Now let’s develop a weighting system to convert each raw
score to a scale score so that we can compare different scores
(number of push-ups, sit-ups, seconds to complete the 50
yard dash).
 One way is to develop a rank order system.
 The child who scores highest in an event receives a scale
score of 100; the lowest receives a score of 1. The other 98
children receive their respective “rank” as their scale score.
 After all raw scores are converted to scale scores, we can
easily compare an individual child to the group and to all
children who are the same age or in the same grade.
Composite Scores
 There are difficulties with composite scores.
 For example, John has good muscular strength and scored at the 70%ile





in push-ups and 78%ile in sit-ups.
But, he is slow and uncoordinated and finished 2nd from last out of the
100 children or at the 2%ile.
If we average John’s scores they will average 50 (average score);
however, he was not “average” in all events.
NOTE: You cannot average percentile rank scores (WHY?) –
you can average standard scores or scale/subtest scores.
Moral of this story: make sure the subtest scores that create a cluster or
composite score will not mislead us into believing there are no
weaknesses present.
Cluster scores must be considered with caution when there is
a significant difference between individual subtest scores.
Standard Deviation
 Percentile ranks are computed by determining the mean
score and amount of variation of all scores around the mean
score.
 Are the scores bunched around the number 10 in a tight
uniform distribution?
 Are the scores evenly distributed?
 Do they peak and taper slowly?
 Do they bunch at the ends with few or no scores in the middle?
 Is there great variance, with the scores spread over a wide
range, with two or more peaks?
 Is there a normal bell curve distribution of scores?
Standard Deviation
 On our push-up test, most of the 5th graders scored around
10 push-ups, with an even distribution above and below 10
push-ups.
 If one-half of the children completed 5 push-ups, one-fourth
completed 14 push-ups, and one-fourth completed 16, the
average or mean would still be 10 – half the children scored
above 10 and half below 10.
 A low SD means the data points are close together.
 A high SD means the data points are spread out.
SD
 The standard deviation measures how much on average





individual scores of a given group vary from the average
(mean) for this same group.
The SD measures the spread of individual results around a
mean of all results.
Let’s take a class of 40 people taking an exam.
Once it’s graded, the instructor calculates the mean.
To determine the SD, we split the total dataset, which is 100
points, into smaller, even values.
It is up to the researcher how to split it, so for example, we’ll
have 10 value units, from 10 to 100.
SD
 The mean score is 50.
 16 students scored between 40 and 60 which means they scored within 10








points, either higher or lower, of the average score.
This means that 40% of the entire class (16 divided by 40 and multiplied by
100) scored within one value unit of the mean score.
Another 12 students scored between 30 and 70, which means they scored
within 20 points, either higher or lower, of the average score.
These students account for 30% of the class (12 divided by 40 and multiplied
by 100).
Together with these students who scored within 10 points of the average
score, they make up a group of 28 students or 70% of the entire class – who
scored within two value units of the mean score.
We know that approximately 68% of scores in any group fall within one SD
from the mean.
Based on this, 20 points is the approximate value of one SD
We know that approximately 95% of scores fall within two SDs from the
mean.
What’s the range for 2 SD in this example?
Standard Scores
 The average score or mean is 100.
 The standard deviation is 15.
 If a child had a standard score of 68, or 2.5 SDs below the
mean on a writing sample, this means they scored below the
1%ile.
 You can convert standard scores into percentile ranks.
Educational and Psychological Tests
 These tests are designed to present normal bell curve distributions






with predictable patterns of scores.
We need to know the mean and standard deviation of the test.
In most educational and psychological tests, the mean is 100 and
the standard deviation is 15.
On some tests, the mean is 10 and the standard deviation is 3.
Average scores do not deviate far from the mean.
When a score falls significantly above or below the mean, it is
referred to as being a distance from the mean, e.g., 1 or 2 standard
deviations from the mean.
To interpret test scores, you need to know the mean and standard
deviation.
Educational and Psychological Tests
 One standard deviation
above the mean always falls
at the 84%ile.
 One standard deviation
below the mean always falls
at the 16%ile.
 Two SD’s above the mean is
always at the 98%ile.
 Two SD’s below the mean
is always at the 2%ile.
Understanding Test Data
 Sometimes, test scores are reported differently.
 For example, test scores may be reported as “z scores.”
 Z scores have a mean of 0 (zero) and a standard deviation of
1.
 If you know a child earned a z score of -1, you know that the
child scored at one deviation below the mean.
 One SD below the mean is at the 16%ile.
 If you convert this score into the standard score format, with
a mean of 100 and a standard deviation of 15, a z score of -1
is the same as a standard score of 85.
Standard Scores (Relative Standing)
Standard scores are scores
of relative standing with a
set, fixed, predetermined
mean and standard
deviation.
Standard
Score
Mean
Standard
Deviation
Z
0
1
T
50
10
IQ
100
15
SB
Subtest
50
8
WISC-III
10
3
Understanding Test Data
 Other tests report results as T scores.
 T scores have a mean of 50 and an SD of 10.
 A T score of 60 is the same as a Z score of +1. A child who
has a T score of 60 or a Z score of +1 scored at the 84%ile. A
T score of 70 is the same as a Z score of +2, a standard score
of 130 and a percentile rank of 98.
 A few tests report results in Stanines. In Stanine tests, the
mean is 5 the SD is 2.
WISC-IV Scores
 What do these mean?
WISC-IV Full Scale IQ
101
Verbal Comprehension Index
124
Perceptual Reasoning Index
88
11
Similarities
16
Block Design
Vocabulary
14
Picture Concepts
7
Comprehension
12
Matrix Reasoning
6
Information
(13)
Picture Completion
(8)
Word Reasoning
(12)
Working Memory Index
110
Processing Speed Index
75
Digit Span
14
Coding
4
Letter-naming Sequencing
10
Symbol Search
7
Arithmetic
(8)
Cancellation
(8)
NOTE: Scores in (brackets) are supplementary subtests and not used to calculate Full Scale IQ
or Index Scores.
Age & Grade Equivalents
(Developmental Scale)
 There are problems with using these scores
 Identical age equivalents can mean different task
performance.
Problems with Grade and Age
Equivalent Scores
1.
2.
3.
Systematic misinterpretation: students who earn an AE of 12.0
has answered as many questions as the average for children of
12. They have not necessarily performed as a 12 year old
could.
Implication of a false standard of performance: equivalent
scores are constructed so that 50% of any age or grade group
will perform below or above age or grade level.
Tendency for scales to be ordinal, not equal interval: age and
grade equivalent scores are ordinal, not equal interval: they
should not be added or multiplied.
Source: Salvia,Ysseldyke & Bolt (2009)
Age & Grade Equivalents
(Developmental Scale)
Maria got an age equivalent of 2-0 on a
test means:
Maria obtained the same number
correct as the estimated mean of
children 2 years and 0 months of
age,
 It does NOT mean:
Maria performed like an
average 2 year old on the
test.
Age & Grade Equivalents
(Developmental Scale)
John got a grade equivalent of 3.5 on a
test means:
John obtained the same number
correct as the estimated mean of
children 5th month of 3rd grade.
 It does NOT mean:
John is able to do 3.5 grade
level work.
Bottom Line – Do not use grade or grade level
scores.
Technical Adequacy of
Instruments
The Reliability Coefficient
 An index of the extent to which observations can be generalized; the







square of the correlation between obtained scores and true scores on a
measure.
The proportion of variability in a set of scores that reflects true
differences among individuals.
If there is relatively little error, the ratio of true-score variance to
obtained-score variance approaches a reliability index of 1.0 (perfect
reliability)
If there is a relatively large amount of error, the ratio of true-score
variance to obtained-score variance approaches .00 (total unreliability).
We want to use the most reliable tests available.
The greater the number of items, the higher the reliability coefficient.
The greater range of test scores, the higher reliability.
Moderate level of difficulty increases test reliability.
Standards for Reliability
 If test scores are to be used for administrative purposes and are
reported for groups of individuals, a reliability of .60 should be
the minimum. The relatively low standard is acceptable because
group means are not affected by a test’s lack of reliability.
 If weekly (or more frequent) testing is used to monitor pupil
progress, a reliability of .70 should be the minimum. This
relatively low standard is acceptable because random fluctuations
can be taken into account when a behavior or skill is measured
often.
Standards for Reliability
 If the decision being made is a screening decision, there
is still a need for higher reliability. For screening
devices, a standard of .80 is recommended.
 If a test score is to be used to make an important
decision concerning an individual student (such as
special education placement), the minimum standard
should be .90.
Standard Error of Measurement
 SEM is another index of test error.
 It is the average standard deviation of error distributed around a person’s true score.
 The difference between a student’s actual score and their highest or lowest hypothetical




score.
We generally assess a student once on a norm-referenced test so we do not know the test
taker’s true score or the variance of the measurement error that forms the distribution
around that person’s true score.
We estimate the error distribution by calculating the SEM.
The general formula SEM equals the standard deviation of the obtained scores,
multiplied by the square root of 1 minus the reliability coefficient.
When the SEM is relatively large, the uncertainty that the student’s true score will fall
within the range is large; when the SEM is relatively small, the uncertainty is small.
Confidence Interval
 The range of scores within which a person’s true score
will fall with a given probability.
 Since we can never know a person’s true score, we can
estimate the likelihood that a person’s true score will be
found within a specified range of scores called the
confidence interval.
 Confidence intervals have two components:
 Score range
 Level of confidence
Confidence Interval
 Score range: the range within which a true score is likely to be
found
 A range of 80 – 90 tells us that a person’s true score is likely to be within
that range
 Level of confidence: tells us how certain we can be that the true
score will be contained within the interval
 If a 90% confidence interval for an IQ is 106 – 112, we can be 90% sure
that the true score will be contained within that interval.
 It also means that there is a 5% chance the true score is higher than 112 and
a 5% chance the true score is lower than 106.
 To have greater confidence would require a wider confidence interval.
 You will have a choice of confidence intervals on Compuscore.
You can choose the 90 percent option but the default is set at
68%.
YOUTUBE
Validity
 “The degree to which evidence and theory support the
interpretation of test scores entailed by proposeed uses of
tests” (APA Standards, 1999, p. 9)
 Validity is the most fundamental consideration in evaluating
and using tests.
Validity
 “A test that leads to valid inferences in general or about most
students may not yield valid inferences about a specific
student…First, unless a student has been systematically
acculturated in the values, behavior, and knowledge found in the
public culture of the United States, a test that assumes such
cultural information is unlikely to lead to appropriate inferences
about that student…
Validity
 Second, unless a student has been systematically instructed in the
content of an achievement test, a test assuming such academic
instruction is unlikely to lead to appropriate inferences about the
student’s ability to profit from instruction. It would be inappropriate
to administer a standardized test of written language (which counts
misspelled words as errors) to a student who has been encouraged to
use inventive spelling and reinforced for doing so. It is unlikely that
the test results would lead to correct inferences about that student’s
ability to provide from systematic instruction in spelling” (Salvia,
Ysseldyke, & Bolt, 2009, p. 63.)
Types of Validity
Content validity
Criterion-related validity
Construct validity
Content Validity
 A measure of the extent to which a test is an adequate measure of the
content it is designed to cover; content validity is established by
examining three factors:
 Appropriate of type of items included
 Comprehensiveness of item sample
 The way in which the ietms assess the content
 It is assessed by an overview of the items by trained individuals who
make judgments about the relevancy of the items and the unambiguity
of their formulation.
 This is especially important in achievement testing and one under
debate.
 There is an emerging consensus that the methods used to assess
student knowledge should closely parallel those used in instruction.
Criterion-related Validity
 The extent to which performance on a test predicts
performance in a real-life situation.
 Usually expressed as a correlation coefficient called a validity
coefficient.
 Two types of criterion-related validity:
 Concurrent validity
 Predictive validity
Concurrent Validity
 A measure of how accurately a person’s current test score can
be used to estimate a score on a criterion measure.
 We look to see if the test presents evidence of content validity
and elicits test scores corresponding closely (correlating
significantly) to judgments and scores from other achievement
tests that are presumed to be valid, we can conclude that there
is evidence for a test’s criterion-related validity.
Predictive Criterion-related Validity
 A measure of the extent to which a person’s current test scores can be
used to estimate accurately what that person’s criterion scores will be
at a later time.
 Concurrent and predictive validity differ in the time at which scores
on the criterion measure are obtained.
 If we are developing a test to assess reading readiness, we can ask:
Does knowledge of a student’s score on the reading test allow an
accurate estimation of the student’s actual readiness for instruction?
How do we know that our test really assesses reading readiness?
 The first step is to find a valid criterion measure and if an assessment
has content validity and corresponds to another measure, we can
conclude the test is valid.
Construct Validity
 The extent to which a procedure or test measures a theoretical




trait or characteristics.
Especially important for measures of process such as
intelligence/cognition.
To provide evidence of construct validity, an author must rely on
indirect evidence and inference.
To gauge construct validity a test develop accumulates evidence
that the test acts in the way it would if it were a valid measure of a
construct.
As the research evidence accumulates, the developer can make a
stronger claim to construct validity.
The Bottom Line…
“Test users are expected to ensure that
the test is appropriate for the specific
students being assessed.”
Salvia,Ysseldyke & Bolt, 2009, p. 71