Download II. The Normal Distribution

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2012
II. The Normal Distribution
The normal distribution (a.k.a.,
(a k a the Gaussian distribution or “bell
curve”) is the by far the best known random distribution. It’s
discovery has had such a far-reaching impact in modeling
quantitative phenomena across the physical, social, and biological
sciences that it’s founder has even found his way on to a major
currency (before the Euro):
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2012
Utility of the Normal Distribution
The normal distribution has such broad applicability in part
because
• phenomena
h
i the
in
th natural
t l world
ld that
th t result
lt from
f
the
th
interaction of many environmental and genetic factors tend
to follow the normal distribution (e.g., height, weight,
measurable intelligence).
• the sums and averages of random samples have distributions
that look roughly normal – as the sample size gets larger,
the normal approximation gets better. This result is known
as the
th Central Limit Theorem.
Theorem As
A we will
ill discuss
di
later,
l t
this applies even to samples of categorical variables!
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2012
Characteristics of the Normal Distribution
 
2
3
2

3
• Symmetric (about the mean), unimodal, bell-shaped. If X ~ N(μ,σ2) – where μ is
the mean and σ2 is the variance – then the density function of X is given by
f ( x |  , ) 
1
exp{( x   ) 2 / 2 2 }, for    x  .
 2
• Of the subjects in a normally distributed population, 68.3% lie within one standard
deviation of the mean, 95.4% lie within 2 s.d.’s, and 99.7% lie within 3 s.d.’s.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2012
Computing Probabilities using the Normal Distribution
Recall that for a continuous random variable X with probability density function
(pdf) f(x), one cannot compute P(X = x). That is, the pdf does not yield
probabilities as does a discrete probability mass function. Technically, for a
continuous random variable X, P(X = x) = 0.
However, we can compute probabilities over intervals of X – that is, the
probability that X lies between two numbers a and b is equal to the area under the
density curve between a and b, for example:
a
b
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2012
Computing Probabilities using the Normal Distribution
To this point, we have computed areas under a density curve by using
integration. However, since the normal density (i) cannot be
integrated in closed form and (ii) is used by researchers with access to
modern computing tools, probabilities based on the normal
distribution can be obtained using tables or computer software.
A normall probability
b bilit table
t bl looks
l k something
thi like
lik what
h t is
i shown
h
in
i the
th
handout (reproduced from the Kutner text for this course). Such a
table is based on the standard normal distribution, or the normal
distribution with zero mean and variance of 1.
Using this table, what is the probability that a randomly sampled
N(0,1) variable is less than 1.34? Less than –0.28? Between –2.54
and 1.68? For what x does P(Z < x) = 0.975?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2012
Standardizing a N(μ,σ2) Random Variable
How do we compute probabilities for a normal distribution with
arbitrary mean and variance using a standard normal table?
Another unique aspect of the normal distribution is that if we have
X ~ N(μ,σ2), then any linear function of X is also normally
distributed. That is,, if we have Y = a + bX for arbitrary
y constants a
and b, then Y ~ N(a + bμ, b2σ2).
If we define Z = (X – μ)/σ, then (using the notation above)
a = –μ/σ, and b = 1/σ, so that Z ~ N(0,1). Computing Z is called
“standardizing” X. Once we’ve converted X into standard units,
we can compute probabilities
b bili i over intervals
i
l off X by
b using
i the
h
standard normal – or “Z” – distribution.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2012
Example II.A
Data from
f
a study
d off king
ki crabs
b on Kodiak
di k Island,
l d AK, (carried
(
i d
out by the Alaska Department of Fish and Game) show that male
crab length
g is normallyy distributed with a mean of 134.7 mm and a
standard deviation of 25.5 mm.
What proportion of the male crab population on Kodiak Island is
less than 140 mm? What proportion is between 100 and 140 mm?
What iis the
Wh
h probability
b bili that
h a randomly
d l selected
l
d male
l crab
b will
ill
measure at least 170 mm?
What is the 75th percentile of this population? The 99th percentile?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2012
Sums of Normally Distributed Random Variables
Yet another interesting feature of the normal distribution is that
sums of normally distributed independent variables are also
normally distributed.
Suppose we have two independent random variables X1 and X2
such that X1 ~ N((μ1,σ12), and X2 ~ N((μ2,σ22), and we define Y such
that Y = c1X1 + c2X2, where c1 and c2 are constants. Then
Y ~ N(c1μ1 + c2μ2, c12σ12 + c22σ22).
)
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2012
Distribution of the Sample Mean
Recall from our discussion of random variables that if we sample n
subjects X1,…,Xn at random from a population with an underlying
expected value of μ and variance of σ2, then the expectation of the
distribution of the sample mean X is μ, and the variance of X
is σ2/n.
From the previous slide, we can see further that if the sample
comes from a normally
y distributed population
p p
then
X ~ N(  ,  2 / n).
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2012
Example II.B
Consider again the population of Kodiak crabs discussed in
Example II.A. Suppose that we randomly sample 20 specimens
from this population.
population What is the probability that the sample mean
will lie between 124.7 and 144.7 mm?
Compute
C
t an interval
i t
l centered
t d att the
th mean μ such
h that
th t a sample
l
average of 20 male crabs will lie within that interval with 95%
probability. What sample size is required to reduce the total width
of this interval to 20 mm?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2012
The Central Limit Theorem
Suppose that we have a sample X1,X2,…,Xn from some distribution
with mean μ and variance σ2. If n is sufficiently large, then the
sample mean X ~ N(μ,σ2/n). This is true even if the underlying
population is not normal – the approximation improves for
relatively larger n. We refer to this result as the Central Limit
Theorem, or CLT. It represents
p
one of the most remarkable
results in mathematical statistics.
The CLT applies even to samples from some categorical
distributions (e.g., including the binomial and Poisson
distributions).
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2012
Resources on the Web
There are many applets available via the internet that demonstrate the Central
Limit Theorem. A simple example using dice is found at
http://www.amstat.org/publications/jse/v6n3/applets/CLT.html
You can find a more interesting example at
http://www ruf rice edu/ lane/stat sim/index html
http://www.ruf.rice.edu/~lane/stat_sim/index.html
You can also easily access web-based CDF calculators for the normal
distribution as well as for other distributions related to the normal (more on
distribution,
those in a moment). For example, this website computes probabilities for a
variety of common distributions:
http://www.stat.berkeley.edu/users/stark/Java/Html/ProbCalc.htm
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2012
The Chi-Square Distribution
A related distribution that we will use later on during this semester is the
chi square distribution
chi-square
distribution. If Z is a standard normal random variable,
variable then
2
Z2 is a chi-square random variable with 1 degree of freedom, or 1 .
The sum of n independent chi-square random variables follows a chi2
square distribution with n degrees of freedom, denoted by  n .
A chi-square random variable has a range that is nonnegative, and its
di t ib ti is
distribution
i positively
iti l skewed.
k
d For
F example
l the
th pdf
df for
f the
th chi-square
hi
distribution with five degrees of freedom looks something like this:
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2012
The t Distribution
Another distribution related to the normal – and one upon which we
will heavily rely – is the t distribution.
distribution If Z is a standard normal
random variable, and X 2 is an independent χn2 random variable,
then the random variable
Z
T
X2 /n
f ll
follows
a t distribution
di t ib ti with
ith n degrees
d
off freedom.
f d
A t distribution actually looks quite similar to the standard normal
distribution: it’s mean is zero, it is unimodal, bell-shaped, and
symmetric. One distinction is that the variability of the t
distribution is slightly greater than the Z distribution.
distribution As n gets very
large, however, the t distribution converges to (i.e., is nearly
indistinguishable from) a Z distribution.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2012
The F Distribution
The third related distribution that we will use is the F distribution. If U
and V are independent χn2 and χm2 random variables,
variables respectively,
respectively then
the variable
U /n
F
V /m
follows an F distribution with n and m degrees of freedom. We denote
this distribution by
y Fn,m
g that is
n m. The F distribution has a range
nonnegative, and its distribution is positively skewed. For example the
pdf for the F5,10 distribution looks something like this:
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2012
Example II.F
Note that we cannot tabulate the χ2, t, and F distributions in the
same way that we do for the Z distribution – there are an infinite
number of distributions in each of these families (as many as there
are values
l
for
f the
th degrees
d
off freedom).
f d )
Instead of areas under the curve, then, you are given tables in your
textbook that contain quantiles from a given χ2, t, or F distribution.
For example,
example examine the χ2 table in the back of your book (on
page 663). Each row corresponds to a value for the degrees of
freedom, and each column corresponds to a right tail area. Hence,
th upper 95% quantile
the
til from
f
the
th χ62 distribution
di t ib ti is
i 1.635.
1 635 We
W will
ill
denote this – consistent with the text – by χ2(0.05;6)
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2012
Example II.F, cont’d
Find the values of χ2(0.975;20) and χ2(0.90;11).
Find the values of t(0.95;15) and t(0.975;30).
Find the values of F(0.95;5,10) and F(0.90;9,20).
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2012
A few points of review:
Given
i
a random
d
sample
l X1,…,Xn, with
i h E(Xi) = µ and
d Var((Xi) = σ2:
1. X is a random variable: its distribution has a mean of μ, and a
variance of σ2/n.
2 If the underlying population is normally distributed,
2.
distributed then X is
normally distributed.
3 E
3.
Even if the
h underlying
d l i population
l i is
i not normally
ll distributed,
di ib d
the Central Limit Theorem tells us that for sufficiently large
sample
p size n, X will be approximately
pp
y normally
y distributed.