Download Variance and Standard Deviation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Lesson 1 - 2
Describing Distributions
with Numbers
adapted from Mr. Molesky’s Statmonkey website
Measures of Spread
Variability is the key to Statistics. Without
variability, there would be no need for the subject.
When describing data, never rely on center alone.
Measures of Spread:
Range - {rarely used ... why?}
Quartiles - InterQuartile Range {IQR=Q3-Q1}
Variance and Standard Deviation {var and sx}
Like Measures of Center, you must choose the most
appropriate measure of spread.
Standard Deviation
Another common measure of spread is the Standard
Deviation: a measure of the “average” deviation of
all observations from the mean.
To calculate Standard Deviation:
Calculate the mean.
Determine each observation’s deviation (x - xbar).
“Average” the squared-deviations by dividing the total
squared deviation by (n-1).
This quantity is the Variance.
Square root the result to determine the Standard
Deviation.
Standard Deviation Properties
s measures spread about the mean and should be
used only when the mean is used as the measure of
center
s = 0 only when there is no spread/variability. This
happens only when all observations have the same
value. Otherwise, s > 0. As the observations
become more spread out about their mean, s gets
larger
s, like the mean x-bar, is not resistant. A few
outliers can make s very large
Standard Deviation
Variance:
(x1  x ) 2  (x2  x ) 2  ... (xn  x ) 2
var 
n 1
Standard Deviation:

sx 
2
(x

x
)
 i
n 1
Example 1.16 (p.85): Metabolic Rates
1792
1666
1362

1614
1460
1867
1439
Standard Deviation
1792
1666
1362
1614
1460
1867
1439
Metabolic Rates: mean=1600
x
(x - x)
(x - x)2
1792
192
36864
1666
66
4356
1362
-238
56644
1614
14
196
1460
-140
19600
1867
267
71289
1439
-161
25921
Totals:
0
214870
Total
Squared
Deviation
214870
Variance
var=214870/6
var=35811.66
Standard
Deviation
s=√35811.66
s=189.24 cal
What does this value, s, mean?
Example 1
Which of the following measures of spread are
resistant?
1. Range
Not Resistant
2. Variance
Not Resistant
3. Standard Deviation
Not Resistant
Example 2
Given the following set of data:
70,
28,
56,
63,
56,
35,
51,
50,
48,
58,
46,
46,
48,
62,
39,
69,
53,
45,
56,
53,
52,
60,
32,
70,
66,
38,
44,
33,
48,
73,
60,
54,
36,
45,
51,
55,
49,
51,
44,
52
What is the range?
73-28 = 45
What is the variance?
117.958
What is the standard deviation?
10.861
Quartiles
Quartiles Q1 and Q3 represent the 25th and 75th
percentiles.
To find them, order data from min to max.
Determine the median - average if necessary.
The first quartile is the middle of the ‘bottom half’.
The third quartile is the middle of the ‘top half’.
19
22
23
23
23
68
74
Q1
26
27
28
med
Q1=23
45
26
75
76
29
30
31
32
Q3=29.5
82
med=79
82
91
Q3
93
98
Using the TI-83
• Enter the test data into List, L1
– STAT, EDIT enter data into L1
• Calculate 5 Number Summary
– Hit STAT go over to CALC
and select 1-Var Stats and hitt 2nd 1 (L1)
• Use 2nd Y= (STAT PLOT) to graph the box plot
–
–
–
–
–
Turn plot1 ON
Select BOX PLOT (4th option, first in second row)
Xlist: L1
Freq: 1
Hit ZOOM 9:ZoomStat to graph the box plot
• Copy graph with appropriate labels and titles
5-Number Summary, Boxplots
The 5 Number Summary provides a reasonably
complete description of the center and spread of
distribution
MIN
Q1
MED
Q3
MAX
We can visualize the 5 Number Summary with a
boxplot.
min=45
45
50
Q1=74
55
Outlier?
60
med=79
65
70
75
Q3=91
80
Quiz Scores
85
max=98
90
95 100
Determining Outliers
“1.5 • IQR Rule”
InterQuartile Range “IQR”: Distance between Q1 and
Q3. Resistant measure of spread...only measures
middle 50% of data.
IQR = Q3 - Q1 {width of the “box” in a boxplot}
1.5 IQR Rule: If an observation falls more than 1.5
IQRs above Q3 or below Q1, it is an outlier.
Why 1.5? According to John Tukey, 1 IQR seemed like too little and 2 IQRs
seemed like too much...
Outliers: 1.5 • IQR Rule
To determine outliers:
1. Find 5 Number Summary
2. Determine IQR
3. Multiply 1.5xIQR
4. Set up “fences”
A.
B.
5.
Lower Fence: Q1-(1.5∙IQR)
Upper Fence: Q3+(1.5∙IQR)
Observations “outside” the fences are outliers.
Outlier Example
All data
on pg 48,
#1.6
IQR=45.72-19.06
IQR=26.66
fence: 19.06-39.99
= -20.93
1.5IQR=1.5(26.66)
1.5IQR=39.99
fence: 45.72+39.99
= 85.71
{
}
0
10
20
30
40
50 60 70
Spending ($)
80
outliers
90
100
Example 4
Consumer Reports did a study of ice cream bars (sigh, only
vanilla flavored) in their August 1989 issue. Twenty-seven bars
having a taste-test rating of at least “fair” were listed, and
calories per bar was included. Calories vary quite a bit partly
because bars are not of uniform size. Just how many calories
should an ice cream bar contain?
342
377
319
353
295
234
294
286
377
182
310
439
111
201
182
197
209
147
190
151
131
151
Construct a boxplot for the data above.
Example 4 - Answer
Q1 = 182
Min = 111
IQR = 137
Q2 = 221.5
Max = 439
UF = 524.5
Q3 = 319
Range = 328
LF = -23.5
100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500
Calories
Example 5
The weights of 20 randomly selected juniors at MSHS are
recorded below:
121
126
130
132
143
137
141
144
148
205
125
128
131
133
135
139
141
147
153
213
a) Construct a boxplot of the data
b) Determine if there are any mild or extreme outliers
c) Comment on the distribution
Example 5 - Answer
Q1 = 130.5
Min = 121
IQR = 15
Mean = 143.6
StDev = 23.91
Q2 = 138
Max = 213
UF = 168
Q3 = 145.5
Range = 92
LF = 108
Extreme Outliers
( > 3 IQR from Q3)
*
100
110
120
130
140
150
160
170
180
190
200
*
210
220
Weight (lbs)
Shape: somewhat symmetric
Center: Median = 138
Outliers: 2 extreme outliers
Spread: IQR = 15
Linear Transformations
Variables can be measured in different units
(feet vs meters, pounds vs kilograms, etc)
When converting units, the measures of center and
spread will change
Linear Transformations (xnew = a+bx) do not change
the overall shape of a distribution
Multiplying each observation by b multiplies both the
measure of center and spread by b
Adding a to each observation adds a to the measure of
center, but does not affect spread
If the distribution was symmetric, its transformation is
symmetric. If the distribution was skewed, its
transformation maintains the same skewness
Transformation Example 6
• Using the data from example #5
– a) Change the weight from pounds to kilograms
and add 2 kg (for a special band uniform)
– b) Get summary statistics and compare with example 5
– c) Draw a box plot
121
126
130
132
143
137
141
144
148
205
125
128
131
133
135
139
141
147
153
213
Example 6 - Answer
• Convert Pounds to Kg ( 0.4536 ) and add 2
121
126
130
132
143
137
141
144
148
205
125
128
131
133
135
139
141
147
153
213
56.88
59.15
60.97
61.88
66.87
64.14
65.96
67.32
69.13
94.99
58.7
60.06
61.42
62.33
63.24
65.05
65.96
68.68
71.40
98.62
Q1 = 61.19
Min = 56.89
IQR = 6.81
Mean = 67.14
StDev = 10.84
Q2 = 64.60
Q3 = 68.00
Max = 98.62
Range = 41.73
UF = 78.22
LF = 50.98
(143.6  0.4536 + 2)
(23.91  0.4536)
Example 6 – Answer cont
Extreme Outliers
( > 3 IQR from Q3)
*
45
50
55
60
65
70
75
80
85
90
95
*
100
105
Weight (in Kg)
Transformation follows what we expect:
Multiplying each observation by b multiplies both the measure of
center and spread by b
Adding a to each observation adds a to the measure of center, but
does not affect spread
If the distribution was symmetric, its transformation is symmetric.
If the distribution was skewed, its transformation maintains the
same skewness
Day 2 Summary and Homework
• Summary
– Sample variance is found by dividing by (n – 1) to keep it an
unbiased (since we estimate the population mean, μ, by
using the sample mean, x‾) estimator of population variance
– The larger the standard deviation, the more dispersion the
distribution has
– Boxplots can be used to check outliers and distributions
– Use comparative boxplots for two datasets
– Identifying a distribution from boxplots or histograms is
subjective!
• Homework
– pg 82: prob 33; pg 89 probs 40, 41;
pg 97 probs 45, 46