Download IV. The Normal Distribution

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Stat 3000 – Statistics for Scientists and Engineers
Dr. Corcoran, Fall 2005
IV. The Normal Distribution
The normal distribution (a.k.a., the Gaussian distribution or “bell
curve”) is the by far the best known random distribution. It’s
discovery has had such a far-reaching impact in modeling
quantitative phenomenon across the physical, social, and biological
sciences that it’s founder has even found his way on to a major
currency (before the Euro):
Stat 3000 – Statistics for Scientists and Engineers
Dr. Corcoran, Fall 2005
Utility of the Normal Distribution
The normal distribution has such broad applicability in part
because
• phenomenon in the natural world that result from the
interaction of many environmental and genetic factors tend
to follow the normal distribution (e.g., height, weight,
measurable intelligence).
• the sums and averages of random samples have distributions
that look roughly normal – as the sample size gets larger,
the normal approximation gets better. This result is known
as the Central Limit Theorem. As we will discuss later,
this applies even to samples of categorical variables!
Stat 3000 – Statistics for Scientists and Engineers
Dr. Corcoran, Fall 2005
Characteristics of the Normal Distribution
σ σ
2σ
3σ
2σ
µ
3σ
• Symmetric (about the mean), unimodal, bell-shaped. If X ~ N(µ,σ2) – where µ is
the mean and σ2 is the variance – then the density function of X is given by
f ( x | µ ,σ ) =
1
exp{( x − µ ) 2 / 2σ 2 }, for − ∞ < x < ∞.
σ 2π
• Of the subjects in a normally distributed population, 68.3% lie within one standard
deviation of the mean, 95.4% lie within 2 s.d.’s, and 99.7% lie within 3 s.d.’s.
Stat 3000 – Statistics for Scientists and Engineers
Dr. Corcoran, Fall 2005
Computing Probabilities using the Normal Distribution
Recall that for a continuous random variable X with probability density function
(pdf) f(x), one cannot compute P(X = x). That is, the pdf does not yield
probabilities as does a discrete probability mass function. Technically, for a
continuous random variable X, P(X = x) = 0.
However, we can compute probabilities over intervals of X – that is, the
probability that X lies between two numbers a and b is equal to the area under the
density curve between a and b, for example:
a
b
Stat 3000 – Statistics for Scientists and Engineers
Dr. Corcoran, Fall 2005
Computing Probabilities using the Normal Distribution
To this point, we have computed areas under a density curve by using
integration. However, since the normal density (i) cannot be
integrated in closed form and (ii) is used by researchers with access to
modern computing tools, probabilities based on the normal
distribution can be obtained using tables or computer software.
A normal probability table looks something like what is shown on the
last pages of this handout (reproduced from pages 881-882 of your
text). Such a table is based on the standard normal distribution, or
the normal distribution with zero mean and variance of 1.
Using this table, what is the probability that a randomly sampled
N(0,1) variable is less than 1.34? Less than –0.28? Between –2.54
and 1.68? For what x does P(Z < x) = 0.975?
Stat 3000 – Statistics for Scientists and Engineers
Dr. Corcoran, Fall 2005
Standardizing a N(µ,σ2) Random Variable
How do we compute probabilities for a normal distribution with
arbitrary mean and variance using a standard normal table?
Another unique aspect of the normal distribution is that if we have
X ~ N(µ,σ2), then any linear function of X is also normally
distributed. That is, if we have Y = a + bX for arbitrary constants a
and b, then Y ~ N(a + bµ, b2σ2).
If we define Z = (X – µ)/σ, then (using the notation above)
a = –µ/σ, and b = 1/σ, so that Z ~ N(0,1). Computing Z is called
“standardizing” X. Once we’ve converted X into standard units,
we can compute probabilities over intervals of X by using the
standard normal – or “Z” – distribution.
Stat 3000 – Statistics for Scientists and Engineers
Dr. Corcoran, Fall 2005
Example IV.A
Data from a study of king crabs on Kodiak Island, AK, (carried
out by the Alaska Department of Fish and Game) show that male
crab length is normally distributed with a mean of 134.7 mm and a
standard deviation of 25.5 mm.
What proportion of the male crab population on Kodiak Island is
less than 140 mm? What proportion is between 100 and 140 mm?
What is the probability that a randomly selected male crab will
measure at least 170 mm?
What is the 75th percentile of this population? The 99th percentile?
Stat 3000 – Statistics for Scientists and Engineers
Dr. Corcoran, Fall 2005
Sums of Normally Distributed Random Variables
Yet another interesting feature of the normal distribution is that
sums of normally distributed independent variables are also
normally distributed.
Suppose we have two independent random variables X1 and X2
such that X1 ~ N(µ1,σ12), and X2 ~ N(µ2,σ22), and we define Y such
that Y = c1X1 + c2X2, where c1 and c2 are constants. Then
Y ~ N(c1µ1 + c2µ2, c12σ12 + c22σ22).
Stat 3000 – Statistics for Scientists and Engineers
Dr. Corcoran, Fall 2005
Distribution of the Sample Mean
Recall from our discussion of random variables that if we sample n
subjects X1,…,Xn at random from a population with an underlying
expected value of µ and variance of σ2, then the expectation of the
distribution of the sample mean X is µ, and the variance of X
is σ2/n.
From the previous slide, we can see further that if the sample
comes from a normally distributed population then
X ~ N( µ , σ 2 / n).
Stat 3000 – Statistics for Scientists and Engineers
Dr. Corcoran, Fall 2005
Example IV.B
Consider again the population of Kodiak crabs discussed in
Example IV.A. Suppose that we randomly sample 20 specimens
from this population. What is the probability that the sample mean
will lie between 124.7 and 144.7 mm?
Compute an interval centered at the mean µ such that a sample
average of 20 male crabs will lie within that interval with 95%
probability. What sample size is required to reduce the total width
of this interval to 20 mm?
Stat 3000 – Statistics for Scientists and Engineers
Dr. Corcoran, Fall 2005
The Central Limit Theorem
Suppose that we have a sample X1,X2,…,Xn from some distribution
with mean µ and variance σ2. If n is sufficiently large, then the
sample mean X ~& N(µ,σ2/n). This is true even if the underlying
population is not normal – the approximation improves for
relatively larger n. We refer to this result as the Central Limit
Theorem, or CLT. It represents one of the most remarkable
results in mathematical statistics.
The CLT applies even to samples from some categorical
distributions, including the binomial and Poisson distributions.
Stat 3000 – Statistics for Scientists and Engineers
Dr. Corcoran, Fall 2005
Example IV.C
Let’s carry out an experiment in order to demonstrate the power of
the Central Limit Theorem. Suppose we have a random variable
Xi that represents the ith flip of a coin, for i = 1,…,n. We will
assume Xi = 1 for heads and Xi = 0 for tails. The mass function for
Xi is given by
x
0
1
P(Xi = x)
½
½
Hence, in this case µ = ½, and σ2 = ½(1 – ½) = ¼. Note that the
sample mean X is just the proportion of flips out of n tries that
turn up heads.
Stat 3000 – Statistics for Scientists and Engineers
Dr. Corcoran, Fall 2005
Example IV.C, cont’d
Note further that the underlying distribution is not at all normal:
it’s a binary distribution with just two points of probability mass at
zero and one. The density of the normal distribution is continuous,
unimodal, bell-shaped, symmetric, and has a domain over the entire
real line.
However, the CLT claims that if n is sufficiently large, the
distribution of X will be approximately normal.
What will the mean and variance of this distribution be?
Stat 3000 – Statistics for Scientists and Engineers
Dr. Corcoran, Fall 2005
Example IV.C, cont’d
Take a coin, and flip it 30 times. Record the results of the flips
in order!
What is the proportion of your first 10 flips that were heads?
What is the proportion of the first 20 flips (the first 10 combined
with the second 10) that were heads?
What is the proportion of all 30 that were heads?
Stat 3000 – Statistics for Scientists and Engineers
Example IV.C,Dr.
cont’d
Corcoran, Fall 2005
Plot below the distribution of sample proportions for the whole
class, with n = 10:
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Stat 3000 – Statistics for Scientists and Engineers
Example IV.C,Dr.
cont’d
Corcoran, Fall 2005
Plot below the distribution of sample proportions for the whole
class, with n = 20:
0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95
Stat 3000 – Statistics for Scientists and Engineers
Example IV.C,Dr.
cont’d
Corcoran, Fall 2005
Plot below the distribution of sample proportions for the whole
class, with n = 30:
0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95
Stat 3000 – Statistics for Scientists and Engineers
Dr. Corcoran, Fall 2005
Example IV.C, cont’d
Answer the following questions:
How does the shape of the distribution of the sample proportion
change as n gets larger?
What is the estimated mean (for the whole class) of the distribution
of the sample proportion X when n = 10? When n = 20? When
n = 30?
What is the estimated variance (for the whole class) of the
distribution of the sample proportion X when n = 10? When
n = 20? When n = 30?
Compare the estimated means and variances to our predictions.
Stat 3000 – Statistics for Scientists and Engineers
Dr. Corcoran, Fall 2005
Approximating the Binomial and Poisson Distributions
In light of the CLT, it’s not surprising that the normal distribution can provide a
fairly accurate approximation – under certain (not necessarily uncommon)
circumstances – of binomial and Poisson probabilities. Can you explain why?
For example, the plots below superimpose normal curves on binomial
distributions with different values of n and p. For what sorts of binomial
distributions will the normal distribution prove more accurate?
n = 10, p = 0.50
n = 10, p = 0.10
n = 100, p = 0.10
Stat 3000 – Statistics for Scientists and Engineers
Dr. Corcoran, Fall 2005
Example IV.D
What are the mean and variance of the number of sunbathing
lizards in Example III.E? Use a normal approximation to
compute P(X ≥ 10), where X represents the number of lizards
observed in the sun out of the 60 lizards sampled.
Example IV.E
What are the mean and variance of the number of ship damage
incidents over a period of 5 years in Example III.H? Use a
normal approximation to compute the probability that a given ship
is damaged at least once during its next 5 years of service.
How accurate is the normal approximation in these last two
examples?
Stat 3000 – Statistics for Scientists and Engineers
Dr. Corcoran, Fall 2005
Resources on the Web
There are many applets available via the internet that demonstrate the Central
Limit Theorem. A simple example using dice is found at
http://www.amstat.org/publications/jse/v6n3/applets/CLT.html.
http://www.amstat.org/publications/jse/v6n3/applets/CLT.html
You can find a more interesting example at
http://www.ruf.rice.edu/~lane/stat_sim/index.html.
You can also easily access web-based CDF calculators for the normal
distribution, as well as for other distributions related to the normal (more on
those later). For example, this website computes probabilities for a variety of
common distributions:
http://www.stat.berkeley.edu/~stark/Java/Html/ProbCalc.htm.
Stat 3000 – Statistics for Scientists and Engineers
Dr. Corcoran, Fall 2005
The Chi-Square Distribution
A related distribution that we will use later on during this semester is the
chi-square distribution. If Z is a standard normal random variable, then
2
Z2 is a chi-square random variable with 1 degree of freedom, or χ1 .
The sum of n independent chi-square random variables follows a chisquare distribution with n degrees of freedom, denoted by χ n2 .
A chi-square random variable has a range that is nonnegative, and its
distribution is positively skewed. For example the pdf for the chi-square
distribution with five degrees of freedom looks something like this:
Stat 3000 – Statistics for Scientists and Engineers
Dr. Corcoran, Fall 2005
The t Distribution
Another distribution related to the normal – and one upon which we
will heavily rely – is the t distribution. If Z is a standard normal
random variable, and X 2 is an independent χn2 random variable,
then the random variable
Z
T=
X2 /n
follows a t distribution with n degrees of freedom.
A t distribution actually looks quite similar to the standard normal
distribution: it’s mean is zero, it is unimodal, bell-shaped, and
symmetric. One distinction is that the variability of the t
distribution is slightly greater than the Z distribution. As n gets very
large, however, the t distribution converges to (i.e., is nearly
indistinguishable from) a Z distribution.
Stat 3000 – Statistics for Scientists and Engineers
Dr. Corcoran, Fall 2005
The F Distribution
The third related distribution that we will use is the F distribution. If U
and V are independent χn2 and χm2 random variables, respectively, then
the variable
U /n
F=
V /m
follows an F distribution with n and m degrees of freedom. We denote
this distribution by Fn,m. The F distribution has a range that is
nonnegative, and its distribution is positively skewed. For example the
pdf for the F5,10 distribution looks something like this:
Stat 3000 – Statistics for Scientists and Engineers
Dr. Corcoran, Fall 2005
Example IV.F
Note that we cannot tabulate the χ2, t, and F distributions in the
same way that we do for the Z distribution – there are an infinite
number of distributions in each of these families (as many as there
are values for the degrees of freedom).
Instead of areas under the curve, then, you are given tables in your
textbook that contain quantiles from a given χ2, t, or F distribution.
For example, the χ2 table in the back of your book looks something
like what you see on the following slide. Each row corresponds to
a value for the degrees of freedom, and each column corresponds to
a right tail area. Hence, the upper 95% quantile from the χ62
2
distribution is 1.635. We denote this by χ 0.05, 6 .
Chi-square Table
Stat 3000 – Statistics for Scientists and Engineers
Dr. Corcoran, Fall 2005
Stat 3000 – Statistics for Scientists and Engineers
Dr. Corcoran, Fall 2005
Example IV.F, cont’d
2
2
χ
,
χ
Find the values of 0.025, 20 0.10,11.
Find the values of t0.05,15 , t0.025,30 .
Find the values of F0.05,5,10 , F0.10,9, 20 .
Stat 3000 – Statistics for Scientists and Engineers
Dr. Corcoran, Fall 2005
In Review…
Let’s take a breath and summarize some very important points.
We now have laid a foundation that allows us to describe and
analyze data.
From this point forward, we will focus on sampling data, and
making inferences about the underlying population based on that
sample. We denote our sample by X1, X2,…, Xn. The underlying
mean for this population is µ and the variance is σ2.
In practice, we are often interested in µ and σ2, although we don’t
know what they are. That’s why we’re gathering the data. We
will therefore focus much attention on inferring something about
population quantities (such as µ, for example) based on the
sampled data.
Stat 3000 – Statistics for Scientists and Engineers
Dr. Corcoran, Fall 2005
TAKE THESE FACTS WITH YOU!
1. X is a random variable: its distribution has a mean of µ,
and a variance of σ2/n.
2. If the underlying population is normally distributed, then X
is normally distributed.
3. Even if the underlying population is not normally
distributed, the Central Limit Theorem tells us that for
sufficiently large sample size n, X will be approximately
normally distributed.
Standard Normal Table
Stat 3000 – Statistics for Scientists and Engineers
Dr. Corcoran, Fall 2005
Standard Normal Table
Stat 3000 – Statistics for Scientists and Engineers
Dr. Corcoran, Fall 2005