Download Week 1 Lecture: The Normal Distribution (Chapter 6)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Week 1 Lecture: The Normal Distribution (Chapter 6)
This bell-shaped curved is probably the most common and well-recognized distribution in all of
biology. However, not all bell-shaped curves are normal. When a random variable is said to be
normally distributed, it will have a mean “µ”and a standard deviation “σ” ; i.e., X ~ N(µ,σ).
When X ~ N(µ,σ), then Z =
X−µ
~N(0,1), which is called the “standard normal” curve. This
σ
Z-values gives the number of standard deviations away from µ = 0. Table B.2 gives the
proportions of a normal distribution which lie beyond a specified Z-value. Because the normal
distribution is symmetrical, we can use the table to find different proportions for a random
variable.
For any normal distribution,
µ + σ contains 68.3% of the population
µ + 2σ contains 95.5% of the population
µ + 3σ contains 99.7% of the population.
Example: X = a normal distribution of weights, with µ = 70 kg and σ = 10 kg. What if we want
to find the proportion of the distribution that is
greater than 80 kg; ie., P(X > 80)? First, we need to
compute Z =
X − µ 80 − 70
=
= 1.0 . Then, we have
σ
10
P(X > 80 kg) = P(Z > 1.0) = 0.1587 or 15.87%. This
can also be thought of as the probability of randomly
60 (1)
70 (0)
80 (1)
1
sampling a weight, X, greater than 80 kg from a population with µ = 70 kg and σ = 10 kg.
What if we want to determine: P(X > 60 kg)?
Again, calculate Z =
60 − 70
= −1.0 . Then, you
10
have: P(X>60 kg) = P(Z > -1.0) = 1 – 0.1587 =
0.8413 (you can do this because the distribution is
symmetrical).
X = 60 (1)
You can also find the proportion between two points:
P(55 < X < 65) = P(-1.5 < Z < -0.5) = (0.5 – 0.0668) – (0.5 – 0.3085) = 0.2417
55
65
What if we want the value of X that represents a specified percentage of the distribution? This
interval can be expressed as: µ ± Z*σ. Remember that the Z-value represents the number of
standard deviations from a mean of zero.
2
So, what is the value of X that encompasses the middle 50% (Z = ±0.67 – look up a Z-value for
25%)?
Xu = 70 + 0.67*10 = 76.7 kg
Xl = 70 – 0.67*10 = 63.5 kg.
What is the value of X that cuts off the upper 5% (Z = 1.645)?
Xu = 70 + 1.645*10 = 86.5 kg
What is the value of X that cuts off the lower 10% (Z = -1.28)?
Xu = 70 - 1.28*10 = 57.2 kg
Central Limit Theorem
n
The CLT is concerned with a sample mean, X =
population,
Sx =
∑x
i =1
n
i
. For any n, drawn from a normal
X has exactly a normal distribution with a mean = µ and a standard deviation =
σ
⎞ . If the parent population is not normally distributed and n is
>>> X ~ N⎛⎜ µ, σ
⎟
n
⎝
⎠
n
⎞.
large, then X ~ N⎛⎜ µ, σ
⎟
n⎠
⎝
Thus, for any
•
X:
E(x ) = µ
3
•
Var (X ) =
σ2
=
n
σ2
σ
=
= standard error of the sample mean = SE.
n
n
If X is normally distributed (X ~ N(µ,σ)), then Z =
If X is some other distribution, then Z =
X −µ
~N(0,1), for any n.
σ
n
X −µ
tends towards a N(0,1) as n gets large.
σ
n
Note the difference between the individual X and the sample mean
1. X~N(µ,σ) implies: Z =
2.
X:
X −µ
~ N(0,1).
σ
X >>> Z = X − µ is ~N(0,1) as n gets large, and is exactly ~N(0,1) if X~N(µ,σ).
σ
n
Introducing Statistical Hypothesis Testing
Statistics is largely concerned with making inferences about a population from a sample
collected from that population. Most often, the inferences are made on one or more population
means. Statistical inferences are constructed on a framework that includes a null hypothesis and
an alternative or research hypothesis. The null hypothesis represents a condition of “no change”
or “no difference” or “equality” while the alternative hypothesis represents the condition that is
true if the null hypothesis is false. The null and alternative hypotheses should be stated a priori.
When testing hypotheses concerning population means, sample means are calculated from
randomly sampled data collected within the population. Then, the probability of that sample
mean occurring given that the null hypothesis is true is calculated. This probability can be
graphically represented by the proportion of the area under the curve, like we calculated earlier.
This calculated probability is then compared to an objective criterion for drawing a statistical
4
conclusion about our sample mean. This comparison is stated in a “decision rule”. Thus, our
criterion is used to “reject” or “not reject” the null hypothesis. This is an important concept, in
that the calculated probability represents the sample data given that the null hypothesis is true,
and NOT that the null hypothesis is true given the data. Thus, we never “accept” the null
hypothesis as being true, we only “fail to reject” or “do not reject” the null hypothesis. This is a
philosophy of “falsification”, largely codified by Karl Popper. The reason we can never “accept”
a null hypothesis is because in statistics, we never actually have a complete accounting of a
population. In some instances, our sample mean may actually be an extreme occurrence that
creates an error in our conclusion of the statistical test. There are two types of errors that occur
in statistical hypothesis testing: 1) Type I Error and 2) Type II Error. Type I Error occurs when
we reject a null hypothesis that is in fact true, and a Type II Error occurs when we do not reject a
null hypothesis that is in fact false. As the researcher, you set the acceptable levels of Type I and
II Errors that you are willing to accept. The Type I Error rate or “significance level” or “alpha
level” is often set at 0.05 or 95%, while the Type II Error rate or “1-power”or “beta level” is
often set at 0.10. These values are arbitrary and can be changed. The idea of “statistical power”
will be discussed in more detail later, but in summary it tells you about the reliability of your
statistical test for making a conclusion about the sample mean. You typically increase power of
a test by increasing the sample size. Another point about hypotheses is that they can be
directional; you can specify which tail of the distribution (or both) for which you are testing.
This is called “one tail” versus “two tail” testing, and it is important that you correctly state your
hypotheses to reflect which or both tails you are considering. Zar provides an excellent treatment
of this topic in section 6.3; it may very well be the best section in the book, in my opinion!
5
Testing for Departures from Normality
The Chi-square and KS GOF tests can be used to assess whether or not a sample came from a
normal population, though Zar does not recommend these methods because they have low
power. Graphical methods can be used to visually assess whether the sample appears to be from
a normal population. Shapiro and Wilk developed a test for normality by calculating a “W”
statistic. This is one of the normality tests provided in SAS. Let’s do an example in SAS to show
how you can go about testing for normality.
Example: We want to test if total height (in feet) in two samples of Appalachian oaks are
normally distributed. We will use the PROC UNIVARIATE routine in SAS to calculate the
Shapiro-Wilk “W” statistic and test the hypotheses:
Ho: The white oak sample comes from a normal distribution.
Ho: The red oak sample comes from a normal distribution.
Ha’s: not Ho.
α = 0.05
From SAS, we get:
•
W (white oaks) = 0.96 (p = 0.24) >>> do not reject Ho.
•
W (red oaks) = 0.97(p = 0.44) >>> do not reject Ho.
6