Download The normal distribution - UC Davis Plant Sciences

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Grading
•
•
•
•
Homework
25 %
In-class quiz
5 % (Jan. 29, 9:00 a.m.)
First exam
35 % (Feb. 12 and due Feb. 17, 9:00 a.m.)
Second exam 35 % (March 12 and due March 17, 5:00 p.m.)
1
Measures of Central Tendency and Dispersion [ST&D p. 16-27]
Individual values of a population are designated Yi, i = 1,...,N, where N= size of pop.
Individual values of a sample are also denoted Yi, i = 1,...,n, where n= size of the sample.
Greek letters are used for population parameters (µ = pop. mean; σ2 = pop. variance).
Mean or average (measure of central tendency)
N
 Pop. mean:  
r
 Yi
i 1
* Sample mean: Y 
N
Y
i
i 1
n
Variance (measure of dispersion of the individuals about the mean)
N
 Pop. variance:  2 
 (Y
i
 )
r
2
* Sample variance: s 2 
i 1
N
 (Y
i 1
i
 Y )2
n 1
The quantities (Yi - Y ) are called deviations.
To express these measures of dispersion in the original units of observation:
 Pop. standard deviation:    2
2
* Sample standard deviation: s  s
To express the standard deviation in units of the mean (or %):
 Pop. coeff. of variation: CV 


* Sample coeff. of variation: CV 
s
Y
Visualization of central tendency and dispersion using boxplots
*
0
Box Plots
Outliers
0 >1.5 IQ and<3 IQ
* >3 IQ
median
1.5 IQ
range
interqartile
(IQ) range
mean
 Review ST&D p. 58 Estimation and inference, p53: 3.8 Distribution of means
2
Measures of dispersion of sample means
An important population parameter is the sample variance of the mean (  Y ).
2
If you repeatedly sample a population by taking samples of size n, the variance
of those sample means is what we call the sample variance of the mean.
It relates very simply to the population variance:
Variance of the mean:
 
2
Y
2
n
We can estimate  Y for a population by taking r independent, random samples
2
of size n from that population, calculating the sample means Yi , and then
calculating the variance of those sample means.
r
sY2 
 (Y
i 1
i
 Y )2
r 1
  Y2
2
The square root of s Y is called standard error (or standard deviation of a mean).
Standard error:
sY  sY2 
s
n
 As with the standard deviation, this is a quantity in the original units of
observation.
 The SE is important in determining confidence intervals and the powers
of tests.
3
The Normal distribution (~N)
If you measure a quantitative trait most of the measurements will cluster near the
population mean (µ), and as you consider values further and further from µ, individuals
exhibiting those values become rarer.
Frequency
of observation
µ
Observed
value
 Some basic characteristics of this kind of distribution are:
1) The maximum value occurs at µ;
2) The dispersion is symmetric about µ (i.e. the mean, median, and mode of
the population are equal); and
3) The “tails” asymptotically approach zero.
A distribution which meets these basic criteria is known as a normal distribution.
 The following conditions tend to result in a normal distribution:
1) There are many factors which contribute to the observed value of the trait;
2) These many factors act independently of one another; and
3) The individual effects of the factors are additive and of comparable magnitude.
 Many biological and ecological variables are approximately normally distributed.
 The bell-shaped normal distribution is also known as a Gaussian curve, named after
Friedrich Gauss who figured out the formal mathematics:
Z (Y ) 
1
e
 2
1  Y  
 
2   
2

Z(Y) is the height of the curve at a given
observed value Y.
 The location and shape are uniquely
determined by only two parameters, µ and
σ2 .
4
 If we set µ = 0 and σ2 = 1, we obtain a standard normal curve [N(0,1)]:
 By varying the value of µ, one can center Z(Y) anywhere on the x-axis.
 By varying σ2, one can freely adjust the width of the central hump.
Normal (0 , 1)
0 .4
0 .3
.
q
e
r
F
Normal (1 , 1)
Normal (0 , 2)
0.4
0.4
.
q
e
r
F
0 .2
0 .1
0 .0
0.3
0.3
0.2
.
q
re 0.2
F
0 .1
0.1
0.0
0.0
-5
0
5
-5
0
Sig ma
5
-5
0
Sigma
5
Sigma
To convert any ~N into a standard N curve:
Standard N curve
 =0, =1
Zi 
Yi  
where - centers to 0
/ puts variation in units of 

Location and Scale transformation (when 0 and/or 1)
Normal (0 , 1)
Normal (1 , 1)
0.4
0 .4
0.3
N(1,1)
0 .3
.
q
re 0.2
F
-= N(0,1)
0.1
.
q
re
F
0 .2
0 .1
0.0
0 .0
-5
Z= (Y-)/
-5
0
0 1
Norm
(a0 , 2)
Sa
iglm
5
5
-5
-5
Norm
al (0 , 1)
Sig ma
0
0
5
5
0
5
0 .4
0 .4
0 .3
N(0,2)
.
q
re
F
0 .3
/= N(0,1)
0 .2
.
q
e
r
F
0 .2
0 .1
0 .1
0 .0
0 .0
-5
-5
0
0
Si g ma
5
5
-5
-5
0
Sig ma
5
The following % of items lie within the indicated limits:
   contains 68.27% of the items
  2 contains 95.45% of the items
  3 contains 99.73% of the items
Conversely:
50% of the items fall between   0.674
95% of the items fall between   1.960
99% of the items fall between   2.576
68.27%
95.45%
.45
99.73%
7%
5
Q1: From a ~N population of finches with mean weight µ = 17.2 g and variance σ2 = 36 g2,
what is the probability of randomly selecting an individual finch weighing > than 22 g?
Solution: To answer this, first convert the value 22 g to its corresponding normal score:
Zi 
Yi  


22 g  17.2 g
 0.8
6g
Table A14: 21.19% of the area lies to the right of Z = 0.8. Then, 22 g is not an unusual
weight for a finch in this population (less than 1 SD from the mean).
Question: What is this area?
Or: P(Y≥22) = X
Answer:
P(Y≥22) = P(Z≥0.8) = 0.2119
Y
17.2
22.0
Z
0
0.8
Q2: From the same population. What is the probability of randomly selecting a sample of 20
finches with an average weight of more than 22 g?
This question is asking for the probability of selecting a sample of a certain average value.
For a sample of size n = 20, the appropriate distribution to consider is the normal distribution
of sample means
 2 36 g 2
2

 1.8 g 2
for sample size n = 20 (µ = 17.2 g and  Y ( n  20) 
n
20
With this in mind, we proceed as before:
Zi 
Yi  
 Y ( n20)

22 g  17.2 g
 3.6
1.34
Table A14: only 0.02% of the area lies to the right of Z = 2.67 (only 0.02% chance)
22 g is an extremely unusual mean weight for a sample of twenty finches in this
population (it is >3 SE from the mean!).
One final word about the wide applicability of the normal distribution:
The central limit theorem states that, as sample size increases, the
distribution of sample means drawn from a population of any
distribution will approach a normal distribution with mean µ and
variance σ2/n.
6
Use of the normal distribution table (page 612, Appendix A4)
For any value of Z, the table reports the area under the curve to the right of Z.
This area to the right of Z is the theoretical probability of randomly picking an
individual from N(0,1) whose value is greater than Z.
From Table
P(Z  1.17)= 0.121 (pb inside Table)
If asked
P(Z  1.17)=1- P(Z  1.17)= 0.879
P(0.42Z  1.61)=
P(Z  0.42) - P(Z  1.61)=
0.3372
-
0.0537 = 0.2835
P(-1.61Z  0.42)=
P(Z  -1.61) - P(Z 0.42)=
1- P(Z  1.61) - P(Z  0.42)=
[1- 0.0537] - 0.3372=
0.9463
- 0.3372=0.6091
P(|Z|  1.05)=
2 * P(Z  1.05)=
2 * 0.1469= 0.2938
7
Normal probability plot (Q-Q plot) ST&D p. 566
14 malt extract values: 77.7, 76.0, 76.9, 74.6, 74.7, 76.5, 74.2, 75.4, 76.0, 76.0,
73.9, 77.4, 76.6, 77.3 (ST&D p. 30, Lab1). N=14 
Divide ~N in 14 intervals = area.
Normal line: slope=s=1.227, intercept=
z
78.4
Y
=75.943. y= a+bx
Y Y
 Y  ( z * s )  Y  (2 * 1.227)  75.943  78.4
s
Sahpiro-Wilk test for ~N
Correlation coefficient
between the data and the
normal scores.
W=1 perfect ~N
W=0.8 ~N?
SAS
PROC UNIVARIATE
NORMAL;
Pr<W should be lower than
0.05 to reject Normality
Graphic tool for assessing normality
8