Download Measuring Spread (Variability)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
4.2 Describing Variability
A. Range
Definition - The range of a data set is defined as range = largest observation – smallest observation
Is this measure of spread affected by extreme scores?
B. Deviations from the Mean
Definition - The n deviations from the sample mean are the differences
(x1 - x ), (x2 - x ), . . . (xn - x )
Miles from school
1,2,3,4
X
X=
2
(X - X )
1
3
4
x
Σ=
The sum of the deviation from the mean is always _______
How can we “fix” this? __________________
C. The Variance and Standard Deviation
The sample variance is denoted by s2.
The variance is the average sum of the squared deviations from the MEAN
s2 
( x  x )2
n 1
The sample standard deviation is the positive square root of the sample variance and is denoted by s.
The standard deviation is the average sum of the deviations from the MEAN
s
( x  x ) 2
n 1
Variance is the average sum of the squared deviations from the mean
X
1
2
3
4
(X - X )
(X - X )2
The sum of (X - X )2 is huge and doesn’t describe the
“typical” spread of the data set. How do we “fix” this
problem? _____________________
Our value (s2) describes the average spread of the SQUARED deviations from the mean. This value is
not in the same unit of measurement as our original data. How do we “fix” this problem?
__________________
Standard Deviation is the average sum of the deviations from the mean
Is this measure of spread affected by extreme scores?
Standard Deviation
1st – the sum of the deviations (differences) from the mean is ALWAYS ZERO
2nd – so, we square each of the differences (negatives become positives).
3rd – this value (sum) is so large and doesn’t reflect the “typical” squared difference
4th – so, we find the average sum of the squared deviations – this is s2, the variance.
5th – this value is in squared units and doesn’t describe the spread of our original data which is not in
squared units
6th – so, we “remove” the square by taking the square root of the variance – this is s, the standard
deviation. Again (because it is important) the standard deviation is the average sum of the deviations
from the mean. **The derivation of the standard deviation formula is on your test.
Sample statistics estimate Population parameters. These estimates should be unbiased.
statistic
parameter
unbiased
estimator
μ
X
2
s
unbiased estimator
σ2
s
unbiased estimator
σ
unbiased estimator
p
p̂
s2 
Why divide by (n – 1)
( x  x )2
n 1
s
( x  x ) 2
n 1
To find the average of a set of data, we add the data values than divide by the number of data values
(n). But when finding the average sum of the deviations from the mean (standard deviation) we
divide by the number of data values minus 1 (n-1).
In a nutshell, dividing by
n – 1 provides a sample variance that is an unbiased estimator of the population variance σ2 and
dividing by n does not. It seems that when we calculate a SAMPLE variance by dividing by n,
this value UNDERESTIMATES the Population variance. So, dividing by (n-1) slightly inflates
s2 to make it closer to σ2. This logic holds true for the SAMPLE standard deviation as well.
D. The Interquartile Range
Definition - The interquartile range (iqr) is a measure of variability that is not as sensitive to the
presence of outliers as the standard deviation is. Specifically, the iqr is a measure of the middle 50% of
the data set. iqr = upper quartile (Q1) – lower quartile(Q2)
(Q1)lower quartile= median of the lower half of the sample
(Q2)upper quartile=median of the upper half of the sample
(If n is odd, the median of the entire sample is excluded from both halves)
2, 4, 4, 5, 7, 8, 10, 11, 12, 15
2, 4, 4, 5, 7, 8, 10, 11, 12, 15, 16
2, 4, 4, 5, 7, 8, 10, 11, 12, 15,1600
Is this measure of spread affected by extreme scores?
Resistant statistics are those measures that are not affected be extreme values. That is, an extremely
large or small value in a data set does not pull the statistic toward that extreme value.
1, 2, 3, 4, 5
shape
center:
1, 2, 3, 4, 500
shape
center:
mean =
median =
iqr =
standard
deviation = 1.6
resistant measures vs. non-resistant measures
mean =
median =
iqr =
standard
deviation = 222.5
Remember when describing distributions, start with the shape. The shape of the distribution tells you
which measure of center and spread is appropriate for the data set. Skewed distributions and
distributions with extreme values are best described by the median and iqr since the median and iqr are
resistant to extreme values (long tails). For symmetric distributions, the mean and standard deviation
are the best way to describe the center and spread. (Mode? - Most of our distributions will be unimodal,
so a mention of this at the beginning of the description is appropriate. Range? – Even though the range
is non-resistant a mention of the range in your description is a good idea). All descriptions must be
in context, in complete sentence format, and include the unit of measurement from the problem.
Cost-to-Charge Shape?
So, which measure of center is appropriate?
Which measure of spread is appropriate?
MINITAB
Descriptive Statistics: acrylamide
Variable
N
Mean
Median
TriMean
STDev
SE Mean
acrylamide 7
287.7
270.0
287.7
112.3
42.4
Maximum
Q1
Q3
497.0
193.0
328.0
Variable
Minimum
acrylamide 155.0
Introduction to Statistics and Data Analysis
Measuring Spread (Variability)
Part I. For each of the four pairs of histograms below, choose the statement from the box at the right that
best describes the situation. HINT: Mark the value of the mean on the x-axis and consider where the data
lies relative to the mean.
Mean = 0.4
Mean = 4.4
Pair #1
10
10
A has larger std dev
8
8
Frequency
12
Frequency
12
6
6
4
4
2
2
0
0
1
2
3
4
B has larger std dev
0
5
0
1
2
A1
3
4
5
B1
Both graphs have the same
std dev
Mean = 2.5
9
8
8
7
7
6
6
Frequency
Frequency
Mean = 2.5
9
5
4
4
3
2
2
1
1
0
1
2
3
4
0
5
A has larger std dev
5
3
0
Pair #2
B has larger std dev
0
1
2
A2
3
4
5
B2
Both graphs have the same
std dev
Mean = 2.0
6
Pair #3
5
5
A has larger std dev
4
4
Frequency
Frequency
Mean = 2.0
6
3
3
2
2
1
1
0
0
1
2
3
B has larger std dev
0
4
0
1
2
A3
3
4
B3
Both graphs have the same
std dev
Mean = 3.33
6
Pair #4
5
5
A has larger std dev
4
4
Frequency
Frequency
Mean = 2.57
6
3
3
2
2
1
1
0
0
1
2
3
A4
4
5
0
B has larger std dev
0
1
2
3
4
5
B4
Both graphs have the same
std dev
Related documents