Download STAT 1060 - Chapter 6 Notes Standard Deviation and Normal Model

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Transcript
STAT 1060 - Chapter 6 Notes
Standard Deviation and Normal Model
September 26, 2011
()
Exploring and Understanding Data
September 26, 2011
1/1
Distribution of final grades of those completing final
exam
0.000 0.015 0.030
relative frequency histogram of grades
mean= 68.3 standard deviation = 13.9
and best fitting normal curve N( 68.3 , 13.9 )
30
40
50
60
70
80
90
grade
Figure: 1. Probability histogram, and overlaid best normal model.
()
Exploring and Understanding Data
September 26, 2011
2/1
The histogram in figure 1 is a type of relative frequency histogram
in which the total area of the bars sums to 1.
similarly, the total area under the overlaid “bell shaped curve”
equals 1.
The proportion of students with grades of 70 or less is the sum of
the areas of the bars to the left of 70.
This can be fairly closely approximated by the area under the
curve to the left of 70.
()
Exploring and Understanding Data
September 26, 2011
3/1
The standard deviation as a ruler
How good was a grade of 77 in this class?
Depending on how others scored, a 77 might have been quite
good, or it might have been quite average.
It is better to replace the 77 by a standardized value or z-score
Where y denotes a grade, ȳ the average grade in the class, and s
the standard deviation of the grades, the standardized value of y is
y − ȳ
s
A grade of 77 corresponds to a z-score of
z = (77 − 68.3)/13.9 = .625.
z-scores have no units, so the interpretation of a z-score is the
same regardless of what is being measured. In this case, a grade
of 77 is .625 standard deviations above the mean.
In the process of standardizing, we first shift, and then we
re-scale.
z=
()
Exploring and Understanding Data
September 26, 2011
4/1
Normal probability models
If the distribution (in this case, of grades) is approximately
symmetric and unimodal, then we can have a better interpretation
of what a standardized score of .625 means.
A distribution which is approximately symmetric and unimodal is
well represented by a theoretical distribution known as a normal
model, a normal distribution, or more colloquially, a bell-shaped
curve.
A normal model is specified by two characteristics - its mean µ
and its standard deviation σ. µ is a real number, and σ > 0 is a
positive real number. These are called parameters of the normal
model. Parameters have nothing to do with data. They are
properties of the model.
On the other hand, the sample mean ȳ , and sample standard
deviation s are known as statistics, which are numbers
calculated from the data.
The normal model with mean µ and standard deviation σ is
denoted by N(µ, σ).
()
Exploring and Understanding Data
September 26, 2011
5/1
Aside - a general outline of statistical inference
Generally we have data in the form of a sample which is drawn
randomly from a population.
A primary goal of statistical inference is to make estimates of the
unknown parameters. We use properties of a sample to infer
properties of the population.
It is often convenient to describe the population using a
probability model, also known as a probability distribution. For
example, we might assume that the population of incomes of
Canadians follows a symmetric, unimodal distribution. We might
then want to estimate the point of symmetry, which would
correspond to the mean income in the population.
It is often reasonable to assume a symmetric, unimodal
distribution. More generally, there is a mathematical argument
which suggests that many distributions can be accurately
approximated by a normal probability model. Later in the course
you will see this stated as the central limit theorem.
Normal probability models are symmetric and unimodal
()
Exploring and Understanding Data
September 26, 2011
6/1
0.8
Examples of some normal probability models
0.0
0.2
0.4
0.6
µ=0 σ=1
µ=0 σ=3
µ=−3 σ=3
µ=0 σ=.5
−10
−5
0
5
10
normal models N(µ
µ,σ
σ)
Figure: 2.
()
Exploring and Understanding Data
September 26, 2011
7/1
Standard normal model
normal models can be standardized in the same way as data
if y follows a normal model N(µ, σ), then the standardized version
of y is
y −µ
σ
in which case z has mean 0 and standard deviation 1. That is, z
has the standard normal model N(0, 1), which is the black curve
on the previous page.
z=
The N(µ, σ) curve is symmetric about µ.
The N(0, 1) curve is symmetric about 0.
()
Exploring and Understanding Data
September 26, 2011
8/1
the mean, median and mode (highest point) of this distribution is
at µ
the standard deviation is σ, and is also the distance from µ to the
points where the curve changes from concave-down to
concave-up
why is this distribution so important?
I
good descriptions for some distributions of real data
I
good approximations to results of many chance outcomes
F
F
F
I
scores on tests, biological characteristics such as lengths, yields, etc.
number of heads in 40 tosses of a coin
see Central Limit Theorem (CLT)
many statistical inference procedures developed for the normal
work well for other approximately symmetric distributions
many variables do not have normal distributions
I
I
I
time until first goal in a hockey game
incomes of Canadians
(both of these have distributions which are skewed to the right)
Areas under the normal curve to the left of c, or between c and d
are known as probabilities. They cannot be calculated by hand.
()
Exploring and Understanding Data
September 26, 2011
9/1
Evaluating areas under the normal curve using normal
tables.
extensive tables have been prepared for the standard normal
which has µ = 0 and σ = 1
these are in Appendix D, Table Z (pg. A-60-61) of the text
the side margin gives the first decimal place, and the top margin
gives the second
Useful fact: Under standard normal curve, area to the right of c
equals area to the left of −c.
Examples: For a standard normal model, what is the area (probability)
in each of the following intervals?
1
z < .33 (.6293)
2
z > .33 (1-.6293)
3
z > −1.63 (.0516)
4
−1.3 < z < .9 (.8159-.0968)
5
|z| < 2 (.9772-(1-.9772))
()
Exploring and Understanding Data
September 26, 2011
10 / 1
other normal areas (probabilities) can be obtained from this table
after standardizing
I
subtract mean, divide by standard deviation
Z =
I
I
X −µ
σ
sometimes called the Z score
gives the number of standard deviations X is from its mean
Useful fact: The area under the N(µ, σ) curve to the left (right) of
c is equal to the area under the N(0, 1) curve to the left (right) of
(c − µ)/σ.
()
Exploring and Understanding Data
September 26, 2011
11 / 1
Questions - area under the normal curve
Find the area under the standard normal curve to the left of 1.28.
(.90)
Find the area under the standard normal curve to the right of 1.28.
(.10)
Find the area under the standard normal curve to the right of
-1.28. (.90)
Find the area under the standard normal curve to the right of
-2.05. (.98)
Find the area under the standard normal curve between -2.05 and
1.28. (.88)
Find the area under N(1, 3) to the left of 4.84. (.9)
Find the area under N(1, 3) between -5.15 and 4.84. (.88)
Find the area under N(−1, 5) between -11.25 and 5.4 (.88)
()
Exploring and Understanding Data
September 26, 2011
12 / 1
Evaluating standard normal areas using minitab.
The area to the left of c under a probability model is referred to as
the cumulative distribution or cumulative distribution function
evaluated at c.
In minitab, the area to the left of c under the standard normal
distribution is evaluated as “cdf c”, where c is a real number.
Find the area to the left of .33 and to the left of -1.63 under the
standard normal curve.
MTB > cdf .33
Cumulative Distribution Function
Normal with mean = 0 and standard deviation = 1
x P( X <= x )
0.33
0.629300
MTB > cdf -1.63
x P( X <= x )
-1.63
0.0515507
()
Exploring and Understanding Data
September 26, 2011
13 / 1
Some other minitab cdf examples
Find the areas to the left of 1.28, -1.28, -2.05 under N(0,1)
MTB > set c1
DATA> 1.28 -1.28 -2.05
DATA> end
MTB > cdf c1
Cumulative Distribution Function
Normal with mean = 0 and standard deviation = 1
x P( X <= x )
1.28
0.899727
-1.28
0.100273
-2.05
0.020182
()
Exploring and Understanding Data
September 26, 2011
14 / 1
Minitab - area under other normal curves
To get probabilities under the normal curve with mean µ and standard
deviation σ, use the subcommand normal, followed by the mean, the
standard deviation, and a ".".
For example, the following gives probabilities to the left of 4.84 and
-5.15 under the normal model with mean 1 and standard deviation 3.
MTB > set c2
DATA> 4.84 -5.15
DATA> end
MTB > cdf c2;
SUBC> normal 1 3.
Cumulative Distribution Function
Normal with mean = 1 and standard deviation = 3
x
4.84
-5.15
P( X <= x )
0.899727
0.020182
()
Exploring and Understanding Data
September 26, 2011
15 / 1
Percentiles of the standard normal distribution
Sometimes we are given a probability and need to find the
corresponding percentile of the distribution
the 100 p’th percentile of the standard normal curve is that
number which cuts off an area p to its left.
For a standard normal, we find the probability in the table and then
the corresponding Z score from the margins.
For other normal distributions, we first get the Z score and then
‘untransform’ it using X = µ + σZ
Example: For a standard normal random variable, find
1
the 80th percentile.
I
the answer satisfies
P(Z ≤ z) = .8
I
I
from the table we find the closest probability .7995
from the margins of the table we get the corresponding z = .84
()
Exploring and Understanding Data
September 26, 2011
16 / 1
Find the values under N(0,1) containing the middle 50% of the
area
we want z such that
P(−z < Z < z) = .50
there must be .25 probability in the left tail below −z, or
P(Z < −z) = .25
we find the closest probability in the table .2514, and then the
corresponding value -.67 from the margins
we have found −z = −.67 so z = .67 and conclude that the
interval containing the middle 50% goes from -.67 to .67
()
Exploring and Understanding Data
September 26, 2011
17 / 1
Some examples - percentiles of normal models
Find the 2.5’th percentile of the standard normal curve. (-1.96)
Find the 97.5th percentile of the standard normal curve. (1.96)
Find the 50’th percentile of the N(0,1) distribution. (0)
Find the 35’th percentile of the standard normal. (-.38)
Find the 97.5’th percentile of the N(1, 3) distribution.
I
I
I
The area under N(1, 3) to the left of c is the same as the area
under N(0, 1) to the left of (c − 1)/3.
The 97.5th precentile of N(0, 1) is 1.96.
Let 1.96 = (c − 1)/3 which means c = 1 + 3(1.96) = 6.88.
Find the 50’th percentile of the N(1, 3) distribution. (1)
Find the 35’th percentile of the N(−2, 7) distribution.
Using the fact that the 35th percentile of N(0, 1) is -.38, we get
−2 + (−.38)7 = −4.7
Useful fact: 100p’th percentile of N(µ, σ) is
µ + σ(100 p’th percentile of N(0,1))
()
Exploring and Understanding Data
September 26, 2011
18 / 1
Normal model percentiles in minitab using invcdf
The invcdf command in minitab finds the inverse cumulative
distribution function. For example “invcdf .8” finds the 80’th percentile
of N(0,1). To get percentiles of other normal models, you need to
specify the mean and standard deviation.
Find, the 80’th, 25’th, 75’th, 2.5’th, 97.5’th, 50’th, 35’th percentiles of
the standard normal, and then the 35’th percentile of N(-2,7),
MTB > invcdf .8
P( X <= x )
x
0.8 0.841621
MTB > set c3
DATA> .25 .75
DATA> end
MTB > invcdf c3
P( X <= x )
x
0.25 -0.674490
0.75
0.674490
()
Exploring and Understanding Data
September 26, 2011
19 / 1
some other minitab percentile examples
MTB > set c4
DATA> .025 .975 .5 .35
DATA> end
MTB > invcdf c4;
SUBC> end
Inverse Cumulative Distribution Function
Normal with mean = 0 and standard deviation = 1
P( X <= x )
x
0.025 -1.95996
0.975
1.95996
0.500
0.00000
0.350 -0.38532
()
Exploring and Understanding Data
September 26, 2011
20 / 1
more minitab percentiles
MTB > invcdf .35;
SUBC> normal -2 7.
Inverse Cumulative Distribution Function
Normal with mean = -2 and standard deviation = 7
P( X <= x )
x
0.35 -4.69724
Next lines transform the 35’th percentile of N(0,1)
to the 35’th percentile of N(-2,7).
MTB > let k1=-.38532
MTB > let k2=k1*7-2
MTB > print k2
Data Display
K2
-4.69724
()
Exploring and Understanding Data
September 26, 2011
21 / 1
Example: Scores on the SAT verbal test, X , follow approximately the
N(505, 110) distribution.
How high must a student score to place in the top 10% of all
students taking the test?
I
I
we want x for which P(X > x) = .10
standardizing, we want
P(Z >
x − 505
) = .10
110
or
x − 505
) = .90
110
from the tables, we find the zscore 1.28 gives P(Z < z) ≈ .9
solving
x − 505
z=
= 1.28
110
gives
x = 505 + 110(1.28) = 645.8
P(Z <
I
I
()
Exploring and Understanding Data
September 26, 2011
22 / 1
using Minitab
MTB > invcdf .9;
SUBC> normal 505 110.
Inverse Cumulative Distribution Function
Normal with mean = 505 and standard deviation = 110
P(Xă<=x)
x
0.9 645.971
()
Exploring and Understanding Data
September 26, 2011
23 / 1
Below what mark are the lowest 20% of the students?
I
I
the z score corresponding to probability .2 (the 20’th percentile of
the standard normal model) is -.85
transforming gives the 20’th percentile of N(505,110) as
X = µ + zσ = 505 + (−.85)110 = 411.5
I
a common mistake is to ignore the sign of z and produce an answer
greater than the mean, when the answer should be less than the
mean for probabilities less than .5
()
Exploring and Understanding Data
September 26, 2011
24 / 1
The 68-95-99.7 rule
In a normal model
I
I
I
I
I
I
about 68% of the observations fall within 1 standard deviation of the
mean
about 95% of the observations fall within 2 standard deviation of the
mean
about 99.7% of the observations fall within 3 standard deviation of
the mean
If an observation is within 1 standard deviation of the mean, then
the associated standardized score is in (-1,1)
If an observation is within 2 standard deviations of the mean, then
the associated standardized score is in (-2,2)
If an observation is within 3 standard deviations of the mean, then
the associated standardized score is in (-3,3)
It is reasonable to assume a normal model for a data set is the
shape of the data’s distribution is approximately unimodal and
symmetric. This can be checked by making a histogram or a
normal probability plot.
()
Exploring and Understanding Data
September 26, 2011
25 / 1
68-95-99.7 Rule
.15
2.35
13.5
68
|
mu
13.5
2.35
.15
mu + sigma
95
99.7
x
Figure: 3. 68-95-99.7 rule
()
Exploring and Understanding Data
September 26, 2011
26 / 1
Example: The length of white pine needles is approximately normally
distributed with mean 8 cm and standard deviation 2.5 cm. What is the
probability that a needle is less than 5cm?
with X the length as before
P(X < 5) = P(Z <
5−8
) = P(Z < −1.2)
2.5
from the tables of the standard normal (Table Z)
P(Z < −1.2) = .1151
this is approximately what we would get from the 68-95-99.7 rule
we can also get the probability using Minitab
()
Exploring and Understanding Data
September 26, 2011
27 / 1
MTB > cdf 5;
SUBC> normal 8 2.5.
Cumulative Distribution Function
Normal with mean = 8 and standard deviation = 2.5
a
5
P(X<=a)
0.115070
()
Exploring and Understanding Data
September 26, 2011
28 / 1
Example: The distribution of cholesterol in 14 year old boys is
approximately normal with µ = 170 mg/dl and σ = 30 mg/dl.
What proportion of boys have a cholesterol value of more than
240 mg/dl?
I
the level X ∼ N(170, 30), so
240 − 170
)
30
= P(Z > 2.33)
P(X > 240) = P(Z >
I
use symmetry to get the value from the table
P(Z > 2.33)
()
= P(Z < −2.33)
= .0099
Exploring and Understanding Data
September 26, 2011
29 / 1
What is the probability that a 14 year old boy will have a
cholesterol level between 160 and 230 mg/dl?
I
using Minitab
MTB > cdf 230 k1;
SUBC> normal 170 30.
MTB > cdf 160 k2;
SUBC> normal 170 30.
MTB > print k1 k2
Data Display
K1
0.977250
K2
0.369441
MTB > let k3 = k1-k2
MTB > print k3
Data Display
K3
0.607809
()
Exploring and Understanding Data
September 26, 2011
30 / 1
Assessing Normality of a Sample
a normal probability plot can be used to assess whether the
data could have come from a normal distribution
these plots are also called normal QQ, normal scores plots, or
normal quantile plot
the sorted values are plotted against the values we would expect
to get if the sample came from a normal distribution
a straight line in this plot indicates that the data are normally
distributed
outliers show up as values distant from the overall pattern
curvature indicates departure from normality e.g. skewness
the NSCORES command in MINITAB produces the values to be
plotted against the data
()
Exploring and Understanding Data
September 26, 2011
31 / 1
Example: Pine needles were collected by DISP students in Point
Pleasant Park. The histogram and normal scores plot shows they are
approximately normally distributed.
15
50
•
••
•
10
5
20
length
30
10
40
••
•••
••
•
•
•
•
•
•
•
•
•
••
•••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
••
•
••
••
•
0
• ••
0
5
10
length
()
15
-3
-2
-1
0
1
2
3
Quantiles of Standard Normal
Exploring and Understanding Data
September 26, 2011
32 / 1
15
5
0
10
15
20
−3
Sample Quantiles
40
30
20
1
5.6
5.7
5.8
5.9
2
3
6.0
−3
−2
−1
0
1
x
Theoretical Quantiles
2
3
Histogram of x
Normal Q−Q Plot
2
3
5.5
6.0
Sample Quantiles
15
10
5
5.0
0
Frequency
20
6.5
25
7.0
5.5
0
Theoretical Quantiles
10
5.4
−1
Normal Q−Q Plot
0
5.3
−2
x
Histogram of x
5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0
5
50
0
Frequency
10
Sample Quantiles
30
20
0
10
Frequency
40
50
20
Example: Shown below are histograms and QQ plots for data sets
which are skewed right, skewed left and with a flat peak.
5.0
5.5
6.0
6.5
7.0
−3
−2
x
−1
0
1
Theoretical Quantiles
Figure: 5.
()
Exploring and Understanding Data
September 26, 2011
33 / 1
for the pine needle data
MTB > nscor c1 c2
MTB > plot c1 c2
MTB > nscor c1 c2
MTB > plot c1 c2
C1
15.0+
10.0+
5.0+
-
()
*
* *
**
*3*2*
53625
469257
+85
66+46
8693
8375
4342
222
**2
*
2 *
--------+---------+---------+---------+---------+--------C2
-2.0
-1.0
0.0
1.0
2.0
Exploring and Understanding Data
September 26, 2011
34 / 1
apart from the top right, the line is pretty straight, confirming that
the values could have come from a normal distribution
the curvature in the normal scores plot can reveal the shape of
distribution
if the distribution is skewed to the right, the nscores plot curves up
at both the left and the right
()
Exploring and Understanding Data
September 26, 2011
35 / 1
MTB > hist c12
Histogram of C12
Midpoint
Count
0
45
1
67
2
38
3
16
4
18
5
7
6
4
7
2
8
1
9
0
10
1
11
0
12
0
13
0
14
1
()
N = 200
***********************
**********************************
*******************
********
*********
****
**
*
*
*
*
Exploring and Understanding Data
September 26, 2011
36 / 1
MTB > nscor c12 c13
MTB > plot c12 c13
15.0+
*
C12
10.0+
*
*
**
*2*
5.0+
*32*
54432
565
88777*
677788888
0.0+ * * ****22233344556
--------+---------+---------+---------+---------+--------C13
-2.0
-1.0
0.0
1.0
2.0
()
Exploring and Understanding Data
September 26, 2011
37 / 1
if the distribution is skewed to the left, the nscores plot curves
down at each end
MTB > hist c14
Histogram of C14
N = 300
Each * represents 5 obs.
Midpoint
4
6
8
10
12
14
16
18
20
Count
1
1
2
11
13
36
60
117
59
()
*
*
*
***
***
********
************
************************
************
Exploring and Understanding Data
September 26, 2011
38 / 1
MTB > nscor c14 c15
MTB > plot c14 c15
C14
20.0+
15.0+
10.0+
5.0+
-
3544322****
7+++9873
3++++6
+++
4++
+7
*
*89
266
33
24*
*22*
*
*
*
*
--------+---------+---------+---------+---------+--------C15
-2.4
-1.2
0.0
1.2
2.4
()
Exploring and Understanding Data
September 26, 2011
39 / 1
if the distribution has a flatter peak than the normal, the normal
scores plot curves up at the left and down at the right
MTB > hist c16
Histogram of C16
Midpoint
Count
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
15
18
21
17
18
23
24
24
16
18
6
()
N = 200
***************
******************
*********************
*****************
******************
***********************
************************
************************
****************
******************
******
Exploring and Understanding Data
September 26, 2011
40 / 1
MTB > nscor c16 c17
MTB > plot c16 c17
1.05+
C16
0.70+
0.35+
0.00+
**** * *
*33222
3442
252
2764
375
385
785
8*
68
57*
3662
3452
2333*
* * ****22
--------+---------+---------+---------+---------+--------C17
-2.0
-1.0
0.0
1.0
2.0
()
Exploring and Understanding Data
September 26, 2011
41 / 1
Using the pull down menus in minitab
In minitab, use the following sequence of pulldown menus
graph -> probability plot -> single
then select the column name with the data, and click OK
If almost all of the data points are within the outside blue lines, the
assumption of a normal model is appropriate. In this case the right
hand tail is a bit short as compared to a normal distribution
(because there is an upper limit of 100 for the grades) but the fit
isn’t too bad.
()
Exploring and Understanding Data
September 26, 2011
42 / 1