Download Lecture 10: Descriptive Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Psychometrics wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Lecture 3: Descriptive Statistics
I. Describing the Distribution of Test Scores
8, 10, 6, 5, 9, 10, 8, 9, 7, 9, 7, 6, 10, 8, 9, 6, 10, 3, 8, 9, 7, 9, 7, 8, 10
A list of #’s by themselves is not very interesting or informative.
What is an easy thing I could do to help make sense of this long list of quiz scores?
A. Frequency Distribution Tables – a way of presenting data (e.g. scores on a
test) that shows the number of times each value occurs; thus making it easier to
see any patterns that exist in the data. We may also estimate the percentile rank
of scores using the cumulative percent frequency column. C%F = (cf/N)*100
Remember: N = the number of scores in the set or distribution.
Frequency Distribution Table of Quiz Scores
Raw
Scores
Frequency
Cumulative Frequency
Cumulative Percent
Frequency
10
5
25 = N
100%
9
6
20
80%
8
5
14
56%
7
4
9
36%
6
3
5
20%
5
1
2
8%
4
0
1
4%
3
1
1
4%
X
Now it’s easier to see that most of the scores were very good!
But instead of a table, some people prefer pictures and they can help us interpret
our results, too.
B. Frequency Polygons and Histograms: graphs that provide the same
information as a frequency table, where each point (or bar if it’s a
histogram) on the horizontal axis represents a raw score, and each point on
the vertical axis represents the frequency (or number of times) each score
occurred in the distribution.
Freq.
7
6
X
5
X
4
X
X
3
X
2
1
0
X
1
2
3
X
4
5
6
7
8
9
10
Quiz Scores
If you superimposed a frequency histogram over a frequency polygon of the same
distribution of scores, they would look about the same. The only difference is
that one (the polygon) uses points as markers and the other (the histogram) uses
bars. Pictured above is a frequency polygon.
Notice that the line for the polygon begins and ends on the X-axis.
C. Possible Shapes of Score Distributions
Symmetrical, Bell Shaped, Unimodal
Distribution with two modes (indicates
heterogeneity of the group)
Asymmetrical (could result from a test being too
hard or lacking content validity)
Asymmetrical (could result from a test being
easy or from mastery of objectives)
Rectangular (frequency is constant for all values
of X)
Uniform
3
II. Using Measures of Central Tendency
Identifies how scores tend to cluster or identifies the center of a score distribution
A. Mode – the most frequently occurring score; the one(s) you have the most of.
1, 1, 1, 1, 2, 3, 4, 5
“Unimodal”
1, 1, 1, 2, 3, 3, 3, 5
“Bimodal”
1, 1, 2, 2, 3, 3, 4, 5
“Multimodal”
1, 2, 3, 4, 5, 6, 7, 8
No Mode! (A uniform distribution)
B. Median – when scores are put in numeric order, the estimated median is the
middle score. The median is also the 50th percentile
1. If you have an odd number of scores then the median is middle score
1, 2, 3, 4, 5
Median = 3
2. If you have an even number of scores then the median is average of the middle
two scores
1, 2, 3, 4, 5, 6
Median = (3 + 4) / 2 = 3.5
C. Mean – symbolized as  , it is the arithmetic average. To find the mean, add all
of the scores together and divide the total sum by the number of scores in
the set. The mean is defined as



1, 2, 3, 4, 5
Mean = (1 + 2 + 3 + 4 + 5) / 5 = 15 / 5 = 3
1, 3, 3, 5, 8

(1 + 3 + 3 + 5 + 8) / 5 = 20 / 5 = 4
The mean is the most commonly used measure of central tendency, however, its
value is greatly influenced by the presence of extreme scores (scores that are far
away from the rest of the distribution).
4
1, 2, 3, 4, 5
Median = 3
Mean = 3
*1, 2, 3, 4, 25
Median = 3
Mean = 7
*For skewed distributions, the median is often the preferred measure of central
tendency because the mean is not a good indicator of ‘average’ performance.
Practice Exercise
Using the following data:
(1)
(2)
3
9
8
8
4
5
5
5
3
9
9
3
10
9
4
10
Develop a frequency table and sketch the frequency polygon.
Calculate the mean, estimate the median, and identify the mode of the
distribution.
Frequency Distribution Table of Quiz Scores
X
Frequency
Cumulative
Frequency
Cumulative
Percent
Frequency
10
9
8
7
6
5
4
3
5
Frequency Polygon
Freq.
7
6
5
4
3
2
1
0
1
2
3
4
5
6
7
8
9
10
Scores
N=
Mean =
so the Median =
Mode =
6
III. The Relationship between Skewness and Measures of Central Tendency
The mode is the score that occurs most frequently.
It is possible for there to be no mode for a distribution.
It is possible for there to be more than one mode for a single distribution.
It is relatively easy to locate and is useful in preliminary description.
It is the only measure of central tendency appropriate for qualitative data.
The mean is the sum of all the scores divided by the number of observations (N).
The mean is the most often used measure of central tendency.
The mean is the balancing point of the distribution.
Extreme scores will affect the position of the mean.
The median is the score point below which 50% of the cases fall.
The median is also known as the 50th percentile.
The median is often used for reporting central tendency for skewed
distributions.
Extreme scores will not affect the value of the median.
The inclusive range is the difference between the highest and lowest scores plus one.
The range is the simplest measure of variability and gives us a quick estimate.
The range is found using only the two most extreme scores in the distribution.
The range is greatly influenced by extreme scores (outliers).
A deviation score indicates the distance of a score above or below the mean. It is used to find the
variance and standard deviation of distributions.
7
IV. Using Measures of Variability
When describing a set of scores, people often present only a measure of central tendency.
However, a measure of variability, which tells us how much the scores spread out, is also
needed. Consider the two following sets of quiz scores:
Mean
Quiz 1
Scores
2
5
4
8
5
5
6
7
3
5
8
2
60 / 12 = 5
Quiz 2
Scores
0
0
10
8
3
12
0
1
8
15
2
1
60 / 12 = 5
A. Inclusive Range = (Highest Score – Lowest Score) + 1
Quiz 1 Range = (8 – 2) + 1 = 7
Quiz 2 Range = (15 – 0) + 1 = 16
*Caution: The range is based only on the 2 most extreme scores.
What we would really like is a measure that tells us “on average, how do scores
differ from the mean”.
We need a measure of dispersion that is based on all of the scores, not just two.
8
(Score – Mean)
(Score - Mean)2
1
1 – 3 = -2
-22 = 4
2
2 – 3 = -1
-12 = 1
3
3–3=0
02 = 0
4
4–3=1
12 = 1
5
5–3=2
22 = 4
Score
Mean = 3
(This Sum always = 0)
Sum of squared deviations = 10
B. Variance = s = Sum(each score - mean)
2
2
   
= 
N
2
N
Steps:
1. Find the mean of the distribution.
2. For each score
a. Subtract the mean from the score
b. Square the resulting difference.
3. Sum the column of squared differences.
4. Divide the sum of squared differences by the number of
scores, N.
C. Standard Deviation = s = The square root of the
variance =
s2
9
More about Measures of Dispersion
A.
Cautions concerning use of the range:
i.
The range is based only on the 2 most extreme scores, which
makes it less representative of the group in the presence of
outliers. The more extreme those outliers are, the less
representative the range is. (The same is true of the mean.)
ii.
The range is greatly influenced by N; the size of the group.
Generally, the larger the group is, the larger the range will be.
What we really need is a measure that tells us “on average” how much scores
differ from the mean. We need an indicator of variability that is less influenced
by outliers and is calculated using every score in the distribution.
The ________is another measure of how much scores differ from the mean of
the distribution. We use it to find the variance. Specifically, it is the
distance of a score from the mean.
The sum of deviations from the mean, always equals zero. That is why
we have to square each deviation.
_________ is the average of the sum of squared deviations of a set of scores
from their mean. It is represented by s2.
    
2
i.
s
2=
N
The variance cannot be easily interpreted in the context of the original
scores because it is on a different scale of measurement.
III. Standard Deviation is the square root of the variance and is interpreted as
the average difference between the scores in a distribution and their
mean. It is represented by s. The standard deviation is on the same
scale of measurement as the original scores and is, therefore,
interpretable in that context.
10
s=
s
2
=
    
2
N
The largest possible standard deviation for any given range of scores is
the range divided by two.
s is easily interpretable
Rules to help determine the variability of a group’s performance:
When s is close to ½ of the range, the group’s performance is
very diverse within the range and to be heterogeneous.
When s is close to 1/3 of the range, scores are considered
dispersed throughout the range.
When s is ¼ or less of the range, scores are said to be clustered
within an area and to be homogeneous.
iv.
Factors that influence the size of the standard deviation:
1. The range: The wider the range, the larger s will be
2. The distribution of scores within the range: the less
homogeneous the group, the larger s will be
v.
Standard deviations are also used to interpret standardized test
scores and to evaluate student performances within classes and
between them.
11