Download Section 2.5 Probability Distributions - Computer Graphics Home

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Probability wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Transcript
§2.5 Probability Distributions
The topics in this section are related and necessary topics for both course
objectives.
A probability distribution indicates how the probabilities are distributed
for outcomes (sample points).
There are billions of different distributions. However, as we learned in the
last section, exist only two kinds of probability distributions; discrete and
continuous probability distributions. Discrete probability distributions are
for discrete outcomes and continuous probability distributions are for
continuous outcomes.
Discrete Distributions
Examples of discrete probability distributions are discrete uniform
distributions, Bernoulli distributions, and binomial distributions.
A discrete uniform distribution is a probability distribution with equal
probability for every possible discrete outcome.
There is no restriction on the number of possible outcomes (other than that it
should be a finite natural number of them). If there are k outcomes, then
each outcome has the same 1/k probability. All outcomes have the same
probabilities so they are uniform in terms of probability. Thus, the name of
“uniform distribution” is used for this kind of the probability distribution.
All the probability must be equal, but there is no restriction on outcomes
other than that they should be discrete (and only a finite number of them).
They can be numerical or non-numerical. As a result, this distribution has a
sample space of the form, S = {s1, s2, …, sk}. The notation for a discrete
uniform distribution is Uniform{s1, s2, …, sk}. Note that the sample space is
used as part of the notation.
If X is a discrete uniform variable, then we write X~Uniform{s1, s2,…, sk},
which reads as “X is distributed as discrete uniform distribution.”
Generally, the symbol tilde ~ means “distributed as” in Statistics.
1
An example of a discrete uniform variable is the number of dots up on a
balanced die after rolled. Let X be the number of dots up on a balanced die.
Then, X~Uniform{1, 2, 3, 4, 5, 6}. See the diagram of the uniform
probability distribution below.
1/6
1/6
1/6
1/6
1/6
1/6
1
2
3
4
5
6
X
In a graph like this one, generally, the probabilities are indicated by the
vertical bars on the sample points along the X-axis. The length of a bar
indicates the amount of probability (weight) for its sample point.
Random variables in equally likely situations have uniform distributions. If
you are interested equally likely situations and their related topics such as
combinations and permutations, please read the optional appendix to this
chapter, Appendix: Counting & Probabilities (Optional).
A Bernoulli distribution is a probability distribution for only two possible
outcomes.
The probability distribution in the Coin Toss Example is an example of a
Bernoulli distribution. The name “Bernoulli” is from a Swiss mathematician
James Bernoulli who established this probability distribution. For James
Bernoulli, check out the following website.
http://www.mhhe.com/math/calc/smithminton2e/cd/tools/timeline/bernoulli.html
If the two outcomes of a Bernoulli distribution have the equal probabilities
(50% each), then it is also a discrete uniform distribution. There is no
restriction on outcomes other than that there should be two and only two
outcomes. As a result, this distribution has a sample space of the form, S =
{s1, s2}. The notation for a Bernoulli distribution is Bernoulli(p) where p is
the probability for s1 or s2.
2
Let X be a Bernoulli variable. Then, we write as X~Bernoulli(p). See the
diagram of Bernoulli probability distribution given below.
1-p
p
X
s2
s1
As an example, if we let X be the number of T on one toss of a crooked coin
whose probability of landing T up is 66%, then X~Bernoulli(0.66). This X
has a sample space of S = {0, 1}. See its probability distribution given
below.
0.34
0.66
0
1
X
3
A binomial distribution is a probability distribution for outcomes of whole
numbers with some fixed maximum (whole) number.
That is, a binomial distribution must be for outcomes of 0, 1, 2, …, k where
k is the fixed maximum number (whole number), and it is used to indicate
the maximum number of trials for a binomial distribution. The outcomes are
typically the number of ‘successes’ among k independent trials. If no
success among the k independent trials, then it is 0. If there is one success
among them, then it is 1. If they are all successes, then it is k. A binomial
distribution gives the probability for each case (each number of successes).
As a result, this distribution has a sample space, S = {0, 1, 2, …, k}.
There are three conditions (beside a fixed number of trials) for a binomial
distribution.
1. The k trials must be all independent of one another. That is, the
outcome of one trial does not affect the outcomes of the other trials.
2. Each trial must have only two possibilities, “success” and “failure.”
3. The probabilities for “success” must be the same for all the trials.
The success or failure of each trail can be anything. For instance, if the trial
is a coin toss, then the success could be getting H, which means the failure is
getting T. If the trial is a rolling a die, then the success could be getting an
ace or 5, which means that the failure is getting 2, 3, 4, or 6.
In fact, each of these trials is called a Bernoulli trail. If k = 1, then the
binomial distribution is a Bernoulli distribution for the outcomes of 0 and 1.
Of course, if the probabilities for 0 and 1 are equal to each other, then a
Bernoulli distribution is a discrete uniform distribution. The binomial
distribution is an extremely important discrete distribution. If the k (the
number of trials) increases under certain conditions, it becomes very close to
a Normal distribution which is perhaps the most important distribution.
The notation for a binomial distribution is Binomial(k, p) where k is the
number of trials, and p is the probability for the “success” on each trial. Let
X be the number of “successes” among k independent trials with probability
p for the “success” on every trial. Then, we write X~Binomial(k, p). The
4
probability distribution of Binomial(20, 0.3) is given below. The horizontal
axis is for X as for all other distributions.
This graph of the Binomial distribution is found at
http://planetmath.org/encyclopedia/BernoulliDistribution2.html
The following website,
http://sites.csn.edu/mgreenwich/stat/binomial.htm
or
http://surfstat.anu.edu.au/surfstat-home/tables/binomial.php
computes probabilities and such for Binomial(k, p) where n is used for
k. It also computes µ and σ of binomial distributions, which are the next
topics.
Mean µ
Generally, the center of a probability distribution (for outcomes) is
expressed by µ, because
5
µ is the pivotal point for the outcomes and its probability distributions.
When you have bunch of weights (probabilities) on a horizontal rod (axis),
the pivotal point is the point on the rod where all the weights balance. That
is, it is the location where you can put a finger under the rod and balance the
rod with all the weights. In other words, it is the gravitational center of all
the weights along the rod. Here, weights are probabilities and the rod is the
horizontal axis (real number line). Namely,
µ is the gravitational center of probabilities along the horizontal axis.
For instance, if the outcomes 0 and 1 have a discrete uniform distribution,
then probability (weight) of 0.5 is at 0 on the and the probability (weight) of
0.5 is at 1 on the real line (rod). Where is the pivotal point on the real line
(rod)? It must be 0.5, the midpoint of 0 and 1. It is the point where you can
balance these weights (probabilities) with one finger.
Suppose 0 and 1 have probabilities P({0}) = 0.3 and P({1}) = 0.7. Then, µ
cannot be 0.5 anymore since it has a heavier weight at 1. So, the pivotal
point µ must be closer to 1. In fact, it is 0.7. These two weights are
balanced at 0.7.
P(1) = 0.7
P({0}) = 0.5
P({1}) = 0.5
P({0}) = 0.3
0
↑
µ = 0.5
1
0
↑
1
µ = 0.7
X
6
This mean µ can be computed mathematically as
µ = ∑ s P(s) ,
∀s∈S
which means that µ is the sum of all the products of sample points s’s and
their corresponding probabilities P(s)’s. This is for every sample point s in
the sample space S. Simply put,
µ is the sum of all the products of each outcome and its probability.
For instance, if 0 and 1 have a discrete uniform distribution, then µ is
computed as
0*P({0}) + 1*P({1}) = 0*(0.5) + 1*(0.5) = 0.5.
If 0 and 1 have a Bernoulli distribution of P({0}) = 0.3 and P({1}) = 0.7,
then µ is computed as
0*P({0}) + 1*P({1}) = 0*(0.3) + 1*(0.7) = 0.7.
If there are four outcomes of 2, 4, 6, and 12 with the discrete uniform
distribution; that is, P({2}) = P({4}) = P({6}) = P({12}) = 1/4; then µ is
computed as
2*(1/4) + 4*(1/4) + 6*(1/4) + 12*(1/4) = 24/4 = 6.
If X~Uniform{1, 2, 3, 4, 5 ,6}, then µ of this probability distribution is
computed as
1(1/6) + 2(/6) + 3(1/6) + 4(1/6) + 5(/6) + 6(1/6) = 3.5
That is, µ = 3.5. Let us look at this point with a diagram of the probability
distribution below.
7
1/6
1/6
1
2
1/6
1/6
1/6
3 ↑ 4
5
µ = 3.5
1/6
6
x
Now, you see, by putting your finger at 3.5, you should be able to balance all
the weights (probabilities) along the x-axis.
Notice that µ must be between the minimum and maximum outcomes
(regardless of how their probabilities are distributed); that is, s(1) < µ < s(k);
since it is a pivotal point on the real line with probabilities (weights) on the
outcomes. So, for instance, if you come up with 8 for µ of the outcomes 3,
4, 5, 6, and 7, then you must know immediately that µ = 8 is incorrect.
Again, µ is estimated by the sample average (sample mean) x . You see
how this works with the example of the Bernoulli distribution with P({0}) =
0.3 and P({1}) = 0.7. The value of µ is computed as
µ = 0*P({0}) + 1*P({1}) = 0*(0.3) + 1*(0.7) = 0.7.
(1)
Suppose that data of size 10 are taken from this random system (more
specifically, random variable) and that the data are
{{1, 0, 0, 1, 1, 0, 1, 1, 1, 1}}
with three 0’s and seven 1’s since P({0}) = 0.3 and P({1}) = 0.7. The
sample average is computed, using its formula, as
x = (0+1+0+1+1+0+1+1+1+1)/(10)
= (0+0+0+1+1+1+1+1+1+1)/(10)
= (0+0+0)/(10) +(1+1+1+1+1+1+1)/(10)
= 3(0)/(10) + 7(1)/(10) = 3*0/10 + 1*1/10
= 0*3/10 + 1*7/10
= 0*(0.3) + 1*(0.7) = 0.7 which estimates µ = 0.7.
(2)
8
Compare (1) for µ and (2) for x . These 3/10 and 7/10 in (2) are the
estimates for 0.3 = P({0}) and 0.7 = P({1}) in (1) respectively. I hope,
comparing (1) and (2), you see why µ can be and should be estimated by x .
This is true for any probability distribution.
With this particular example, these estimates of the probabilities are
identical to the estimated probabilities. As a result, the estimate of µ so
happen to be identical to the true value of µ. With other data, they might not
be identical to each other since, in practice, you might not get three 0’s and
seven 1’s out of data of size 10.
However, the numbers of 0’s and 1’s are close to 3 and 7 if data of size 10
are taken. If data of size 100 are taken, then the number of 0’s is close to 30
and that of 1’s is close to 70. As a result, 0 is multiplied by a number close
to 3/10 (= 30/100) and 1 is multiplied by a number close to 7/10 (= 70/100)
in the computation with the sample average. This results in a good (close)
estimate for µ by x .
Standard Deviation σ
The standard deviation σ indicates the variation (spread) of outcomes, if you
remember. The probability distribution affects the value of σ as well. If
outcomes have more probabilities toward the center µ of the outcomes, then
the value of σ is small.
For instance, σ for outcomes 3, 4, 5, 6 with P({3}) = 0.1, P({4}) = 0.4,
P({5}) = 0.4 and P({6}) = 0.1 is smaller than σ for the same outcomes with
P({3}) = 0.4, P({4}) = 0.1, P({5}) = 0.1 and P({6}) = 0.4 because, for the
first probability distribution, the outcomes 4 and 5 have more probabilities
which are close to the center of the outcomes.
9
P({3}) = 0.1, P({4}) = 0.4,
P({5}) = 0.4, and P({6}) = 0.1
P({3}) = 0.4, P({4}) = 0.1,
P({5}) = 0.1, and P({6}) = 0.4
The same µ = 4.5 but σ for this
distribution is smaller.
The same µ = 4.5 but σ for this
distribution is greater.
3
4
5
6
3
4
5
6
This can be reflected by data which come from these outcomes according to
their probability distributions. Large data from the outcomes with the first
probability distribution have a lot of observations around the center of the
outcomes. They should have about 80% of the data with values 4 and 5
close to the center µ = 4.5. It produces data with less variation.
On the other hand, large data from the same outcomes, with the second
probability distribution, have a lot of observations away from the center of
the outcomes. They should have about 80% of the data with values 3 and 6
away from the center µ = 4.5. It produces data with more variation.
When a probability distribution has more probabilities around the center of
the outcomes, the distribution is said to be “tight.” This is clear from the
formula of σ for a discrete distribution with k sample points,
σ=
k
∑ (s -μ) P(s ) ,
2
i
i
i =1
by the way, which is very similar to the formula of s, and it should be so
since σ is estimated by s. So, a tight probability distribution tends to have a
small σ. This holds true for discrete and continuous distributions.
10
Let us have numerical examples for computing σ with the probability
distribution of P({3}) = 0.1, P({4}) = 0.4, P({5}) = 0.4 and P({6}) = 0.1.
We already have µ = 4.5. So,
σ = (3 − 4.5)2 (0.1) + (4 − 4.5)2 (0.4) + (5 − 4.5)2 (0.4) + (6 − 4.5)2 (0.1)
≈ 0.806 (keeping three places after the decimal).
Let us compute σ with the probability function of P({3}) = 0.4, P({4}) = 0.1,
P({5}) = 0.1 and P({6}) = 0.4.
σ = (3 − 4.5)2 (0.4) + (4 − 4.5)2 (0.1) + (5 − 4.5)2 (0.1) + (6 − 4.5)2 (0.4)
≈ 1.360 (keeping three places after the decimal).
These two probability distributions have the same outcomes (sample points)
and the same mean (4.5), but the first one with more probabilities close to
the center µ of the outcomes has a smaller σ (0.806 against 1.306). In fact,
the second probability distribution has more variation by 0.5 (= 1.360 –
0.806) or about 62% (= 0.5/0.806) more variation than the first probability
distribution.
These two numerical examples illustrate the following point. As stated
earlier, a tighter probability distribution (with more probabilities toward the
center) has less variation and a small value of σ. In practice, we generally
prefer a random system with less variation (one with a smaller σ).
Again, σ is estimated by the sample standard deviation s. You see how this
works with the example of the Bernoulli distribution with P({0}) = 0.3 and
P({1}) = 0.7. For this, let us use the variance σ2 and the sample variance
s2, which are the squares of the standard deviation σ and the sample standard
deviation s, respectively.
The reason for using the (sample) variance is that, with them, we do not
carry square root signs, and the point to be made remains the same. In fact,
the point would be clearer without all the square root signs. The variance of
the Binominal(0.7) is
σ2 = (0 - µ)2 (0.3) + (1 - µ)2 (0.7)
(3)
11
= 0.21 (with µ = 0.7)
That is, σ ≈ 0.458 (≈
0.21 ) (keeping three places after the decimal).
Suppose that data of size 10 are taken from this random system (more
specifically, random variable) and that the data are
{{1, 0, 0, 1, 1, 0, 1, 1, 1, 1}}
with three 0’s and seven 1’s since P({0}) = 0.3 and P({1}) = 0.7. The
sample variance is computed, using its formula, as
s2 = ((1- x )2 + (0- x )2 + (0- x )2 + (1- x )2 + (1- x )2 + (0- x )2 + (1- x )2 +
(1- x )2 + (1- x )2 + (1- x )2)/(10 – 1)
= ((0- x )2 + (0- x )2 + (0- x )2 + (1- x )2 + (1- x )2 + (1- x )2 + (1- x )2 +
(1- x )2 + (1- x )2 + (1- x )2)/(9)
= ((0- x )2 + (0- x )2 + (0- x )2 )/(9) + ((1- x )2 + (1- x )2 + (1- x )2 + (1- x )2 +
(1- x )2 + (1- x )2 + (1- x )2)/(9)
= (0- x )2(3/9) + (1- x )2 (7/9)
(4)
≈ 0.233 with x = 0.7. (keeping three places after the decimal).
s ≈ 0.483 by 2.333 (keeping three places after the decimal)
Compare (3) for σ2 and (4) for s2. The µ in (3) is estimated by x in (4) and
these 3/9 and 7/9 in (4) are the estimates for 0.3 = P({0}) and 0.7 = P({1}) in
(3), respectively. I hope, comparing (3) and (4), you see why σ2 can be
estimated by s2 so that σ can be estimated by s. This is true for any
probability distribution.
Also, note that the estimates, 2.333 and 0.483, are close respectively to σ2 =
0.21 and σ = 0.458. If n = 10, instead of n – 1 = 9, were used with the data,
the estimates would be identical to σ2 = 0.21 and σ = 0.458. Again, they are
estimates and, depending on data, they are often close but not exactly the
same as their estimated parameters.
12
Continuous Distributions
Generally, for continuous outcomes, probability is given by an area just like
the area of a bar (with a relative frequency for the height) in a histogram.
The sample space of a continuous distribution is an interval (or intervals) of
the real number line (could be the entire real number line). All real numbers
in the interval are possible outcomes (sample points). See the diagrams of a
discrete probability distribution and a continuous probability distribution
below.
A discrete
distribution like
this one has
each probability
given by the
length (height)
of a bar on an
outcome. They
must all add up
to one (100%).
s1 s2 s3 s4 s5
S = {s1, s2, s3, s4, s5}
A continuous distribution
like this one has probability
given as an area between the
curve and the horizontal
axis.
The total area between the
curve and the horizontal axis
must be one (100%).
a
b
S = [a, b]
With continuous sample points (outcomes), an event (a set) is given by an
interval like [c, d] since you cannot list all the elements (sample points) in
the set explicitly one by one to indicate the event (set). The probability for
an event such as [c, d] is given as the (size of) area over [c, d] between the
curve and the horizontal axis.
If a continuous random variable X is distributed over its sample space S =
[a, b] as given above, then the probability for the event [c, d], P(c ≤ X ≤ d),
is given as the area indicated in the diagram below.
13
With a discrete
distribution like this one,
a probability of an event
is computed by adding
probabilities of all the
sample points that
constitute the event (set).
For instance,
If A = {s3, s4}, then
P(A) = P(s3) + P(s4)
With a continuous distribution like this
one, the probability for an event (set) is
given as an area over the event (set)
between the curve and the horizontal
axis.
s1 s2 s3 s4 s5
A = {s3, s4}
If B = [c, d], then P(B) is the
area between the curve and the
horizontal axis over [c, d].
P(B) = P(c ≤ X ≤ d) is this area.
a
c
d
b
X
B = [c, d]
Note, with the continuous distribution given above, that
P(-∞ < X ≤ d) = P(a ≤ X ≤ d)
since P(-∞ < X ≤ d) = P(-∞ < X < a) + P(a ≤ X ≤ d) and P(-∞ < X < a) = 0.
Can you tell me why P(-∞ < X < a) = 0? Similarly,
P(c ≤ X < ∞) = P(c ≤ X ≤ b)
since P(c ≤ X < ∞) = P(c ≤ X ≤ b) + P(b < X < ∞ ) and P(b < X < ∞ ) = 0.
If you understand these, then it is obvious that
P(-∞ < X < ∞) = P(a ≤ X ≤ b) = 1.
The computations of µ and σ with continuous probability distributions
generally require integrations and are beyond the scope of this book while,
for some continuous distributions, they can be computed without
integrations. Nonetheless, µ and σ of continuous probability distributions
14
indicate and mean the same things as those of discrete probability
distributions.
The mean µ with a continuous probability distribution is the point on the
horizontal axis (line) where the entire area (probability), between the curve
(or line) and the horizontal line, balances. It is the pivotal point or the
gravitational center of the distribution on the horizontal line. It indicates the
center of the (continuous) probability distribution. It looks like µ is just a
little bit left of d in the above diagram.
The standard deviation σ of a continuous probability distribution indicates
how spread out the distribution is. A tighter distribution has a small value of
σ, indicating small variation among outcomes. Even for the same sample
space, if more probability is located away from the center (say µ), the value
of σ for the distribution is larger. See the diagrams given below.
σB < σA since the sample space of
σD < σC since Distribution D has more
Distribution B is wider than that of
Distribution A.
probability toward the center while they
have the same sample space and µ.
Distribution D,
with σD
Distribution B,
with σB
Distribution A,
with σA
Distribution C,
with σC
↑
µ for both Distribution A & B
↑
µ for both Distribution C & D
Let us have four examples of continuous probability distributions.
15
A continuous uniform distribution is a probability distribution for
continuous real numbers for its outcomes in an interval with the equal
probability for the equal-length subintervals in the interval.
The probability exists only in an interval in an equal manner (uniform) so
that any subintervals in the intervals of the equal length have the same
probability. For instance, if it has the equal probability between 3 and 19,
then any sub-intervals of, say, the length 2 has the equal probability of
12.5%. That is, when you take the next measurement, the chance of the
measurement value between 5 and 7 is 12.5% and the chance of it between
12 and 14 is also 12.5%.
This distribution is denoted as Uniform[a, b] where [a, b] is the interval with
equal (uniform) probability. As a result, this distribution has a sample space
of the form, S = [a, b], an interval on the real number line. For the
continuous uniform distribution discussed as an example in the last
paragraph is denoted as Uniform[3, 19] with a = 3 and b = 19. Its sample
space is S = [3, 19]. See the diagram given below.
X~Uniform[a, b]
or
X~Uniform[3, 19]
with a = 3 and b = 19
This area is
P(12 ≤ X ≤ 14) = 1/8 = 0.125
since 2*(1/16) = 1/8.
The same probability as
P(5 ≤ X ≤ 7) = 1/8 = 0.125
This area is
P(5 ≤ X ≤ 7) = 1/8 = 0.125
since 2*(1/16) = 1/8.
a=3
5
7
↑ 12
14
This height must be
1/(b – a) = 1/(19 – 3)
= 1/16
since the length of
the area (rectangle) is
b – a = 19 = 3 = 16
and the total area
must be one (100%).
b = 19
X
µ = 11
The mean µ is always the midpoint of the point where the probability starts
and the point where the probability ends; that is, (a + b)/2. In the example of
16
the uniform distribution above, the mean is 11, µ = 11, which is computed as
(3 + 19)/2. Generally, the value of σ is smaller for a continuous uniform
distribution with a short interval for probability. For instance, the example
continuous uniform distribution, Uniform[3, 19], has the value of sigma
about 2.6 while Uniform[56, 58] has the value of σ is about 0.58.
Generally, a tighter uniform distribution has a smaller σ, which means less
variation. This makes sense since the outcomes from Uniform[56, 58] have
the maximum variation (without considering the probability) of only 2 = 5856 while the other one has the maximum variation (without considering the
probability) of 16 = 19 – 3.
A histogram for data from outcomes with a uniform distribution looks like a
rectangle with tops of bars all about at the same height. For the first
example uniform distribution, a histogram looks like a rectangle started at 3
and ends at 19 with the height of 6.25% (in relative frequency). If the
relative frequencies are used, the total area of the rectangle should be close
to 1 since the size of an area estimates a probability, and the total probability
(area) must be one.
A Normal distribution is a continuous distribution with a perfect bell-shape
with one mode (bump) right at µ, the center of the outcomes. The outcomes
are all possible real numbers. Its sample space is the entire real number line.
A ‘bell-shape’ means a perfectly symmetric, unimodal (one bump at the
center) shape tapering off to the sides. It is most prevalent and the most
commonly found distributions in nature. Hence, it is called a ‘Normal’
distribution. It is also known as Gaussian distribution (especially in
physics), named after one of the two discoverers of Normal distributions;
namely, a German mathematician/physicists Johann Gauss. The other
discoverer is the French/English mathematician Abraham de Moivre.
A Normal distribution is denoted by N(µ,σ) where µ and σ are the mean
and standard deviation of the Normal distribution, respectively, and they are
parameters. The first position is for µ and the second position is for σ; an m
comes before an s. See the graph of N(µ,σ) given below.
17
This diagram is found at
http://www.stat.yale.edu/Courses/1997-98/101/normal.htm
The mean µ is the center of the outcomes and their probabilities (that is, the
center of the Normal distribution where the bump is located at), and σ
indicates the spread of the outcomes with their probabilities (thus, the spread
of the Normal distribution). The smaller the value of σ is, the tighter the
probability distribution is. That is, a Normal distribution with a smaller σ
has a tighter bell shape.
The graphs of four different Normal distributions are plotted in the diagram
given below. They are graphs of N(0, 0.2), N(0, 1.0), N(0, 5.0) and N(-2,
0.5). As you can see, µ indicates the center of the distribution and σ
indicates the spread of the distribution.
18
The green line is the standard Normal distribution
This diagram is found at
http://en.wikipedia.org/wiki/Normal_distribution
As you remember, [µ – 3σ, µ + 3σ] catches most of outcomes and
probabilities. In case of a Normal distribution, the six-sigma interval catches
99.73% of the probability for the outcomes. If the σ is small, then the
interval is short, but it still catches 99.73% since the bell-shape is tighter.
The sample space of a Normal distribution is the entire real number line, but
most of the probability is found in [µ – 3σ, µ + 3σ].
The standard Normal distribution is a Normal distribution with µ = 0 and
σ = 1.
That is, the standard Normal distribution is denoted by N(0, 1). It is
customarily to use Z for the standard Normal variable. That is, Z is a
random variable with the standard Normal distribution (recall what a random
variable is). In other words, Z has the standard Normal distribution, which
means the outcomes of Z have the standard Normal distribution. We also
write
Z~N(0, 1)
which stands for “Z is distributed as the standard Normal distribution.”
If X~N(µ,σ), then
19
P(µ - 3σ < X < µ + 3σ) = 0.9973
P(µ - 2σ < X < µ + 2σ) = 0.9545,
and
P(µ - σ < X < µ + σ) = 0.6827.
In terms of the standard Normal distribution,
P(-3 < Z < 3) = 0.9973,
P(-2 < Z < 2) = 0.9545,
and
P(-1 < Z < 1) = 0.6827.
See the graph of the standard Normal distribution given below.
20
This diagram is found at http://www.westgard.com/images/ls36f4.gif
You do not have to memorize these numbers since there are Normal tables,
computer Normal probability functions and its inverse functions all
available. For instance, found in the following website are programmes that
find probabilities by giving points (z values) and find points (z values) by
giving probabilities.
http://surfstat.anu.edu.au/surfstat-home/tables/normal.php
or
http://sites.csn.edu/mgreenwich/stat/normal.htm
Another such a website is
http://www-stat.stanford.edu/~naras/jsm/FindProbability.html
21
Please check out the Related Websites for more websites on Normal and
other probability distributions.
The Normal standardization is used to change any Normal distribution to the
standard Normal distribution by subtracting its µ and, then, dividing the
difference by its σ. If X~N(µ,σ), then
(X - µ)/σ
has the standard Normal distribution; that is, (X - µ)/σ~N(0, 1) and, hence,
Z = (X - µ)/σ.
It is not difficult to remember. For a Normal distribution with µ and σ to be
the standard Normal distribution, µ must be zero and σ must be 1. So, how
do you get 0 from µ? By subtracting µ, of course. Now, σ must be 1. So,
how do you get 1 from σ? By dividing it by σ. You just do these
subtraction of µ and division by σ to a Normal random variable X with µ
and σ, which is the Normal standardization. Also, these operations must
be in that order since µ shows up in the first position and σ shows up in the
second position in N(µ,σ).
Because of this Normal standardization, Only the standard Normal table is
necessary to find probabilities of any Normal distribution. That is, the
standard Normal distribution is all needed to find any probability of any
Normal distribution.
For example, we can find the probability of X less than 5.2 where X is
distributed as the Normal distribution with the mean 1.9 and the standard
deviation 2.5, using the Normal standardization, as follows.
First, the statement “the probability of X less than 5.2” needs to be translated
into the probability statement P(X < 5.2) so that the Normal standardization
can be performed. The Normal standardization gives the same probability in
terms of the standard Normal variable Z as
P(X < 5.2) = P( (X – 1.9)/2.5 < (5.2 – 1.9)/2.5 ) = P(Z < 1.32)
22
That is, finding P(Z < 1.32) is finding the probability of X less than 5.2
where X~N(1.9, 2.5). You could use a standard Normal table to find this
probability. However, in this course, we use internet websites to find this
kind of probability. Go to the following website.
http://surfstat.anu.edu.au/surfstat-home/tables/normal.php
or
http://sites.csn.edu/mgreenwich/stat/normal.htm
Click on the circle under
and enter 1.32 under “z value.”
Then, click on the button “→.” This should give you 0.9066 under
“probability.” That is, the probability of X less than 5.2 is 0.9066.
Note that, if the probability to find were P(Z > 1.32), then click the circle
under
and do the same. You would get 0.0934 under “probability” this time for
P(Z > 1.32). By the way, this makes sense 0.0934 + 0.9099 = 1, the total
probability of one.
You need to understand that the little z under the distribution is the “z
value,” and the red area in the distribution is the probability on the left or on
the right of the z value, which is determined by the inequality in the
probability statement.
For instance, the z value in the probability statement P(Z < 1.32) is 1.32
(that is, z = 1.32) and the probability is on the left of z = 1.32 (so the red area
is on the left of z).
23
Here are some exercises with this website. Can you find P(Z > -0.67), P(Z <
-1.38), P(2.14 < Z), and P(-1.44 > Z)? They are respectively 0.7486, 0.0838,
0.0162, and 0.0749. I hope you can get them.
By the way, the little “z” and “z value” are used with the distribution
because it is the standard Normal distribution. You can find these z values
by using this website if probabilities are given. That is, you can find z (z
value) such that P(Z < z) = 0.3526 and P(Z > z) = 0.9684, and P(z > Z) =
0.9372. They are respectively -0.3783, -1.858, and 1.532. I hope you can
find them. Hint: What could the “←” button in the website be for?
The third continuous distribution is the Student’s t distribution. This
distribution was introduced, in 1908, by an applied mathematician William
Gosset who was employed by Guinness Breweries in Ireland.
A Student’s t distribution is a probability distribution very similar to the
standard Normal distribution with more probability away from the center.
Its sample space is the entire real number line, just like that of a Normal
distribution.
It has a symmetric bell shape centered at 0 like the standard Normal
distribution. However, it has more probability away from the center (“on the
tails”). That is, the tails of the bell are thicker, and the bell shape is not as
tight as that of the standard Normal distribution.
A Student’s t distribution comes with the degree of freedom (this is the
parameter for this distribution), and its notation is Student(m) where m > 0;
this m is the degree of freedom. The smaller the degree of freedom is, the
thicker the tails of the bell are (the more probability away from the center).
That is, as the degree of freedom increases, the Student’s t distribution has
less probability away from the center and more probability toward the
center. In fact, as the degree of freedom approaches to infinity, the
Student’s t distribution approaches to the standard Normal distribution.
This is called the asymptotic Normality of the Student’s t distribution, which
is the relation between the standard Normal and Student’s t distribution.
See the graphs of five Student’s t distributions that are plotted in the diagram
given below. They are Student(1), Student(2), Student(5), Student(10), and
Student(∞) = N(0, 1).
24
This diagram is found at http://en.wikipedia.org/wiki/Student's_t-distribution
In fact, if the Student’s t distribution’s degree of freedom exceeds 30, it is
very close to the standard Normal distribution. In practice, the standard
Normal distribution is used for a Student’s t distribution if its degree of
freedom is greater than 30.
A probability or a point (t value) can be found by giving one of them along
with the degree of freedom at the following website.
http://surfstat.anu.edu.au/surfstat-home/tables/t.php
or
http://sites.csn.edu/mgreenwich/stat/t.htm
A Chi-square distribution is a continuous probability distribution with the
positive real numbers for its outcomes; that is S = (0, ∞). A Chi-square
distribution comes with the degree of freedom (this is the parameter for this
distribution), and its notation is Chi-square(m) where m > 0; this m is the
degree of freedom.
Depending on the degree of freedom, it has a skewed bell shape starting at
the origin, has one bump (unimodal), and tapers down (to the horizontal
axis) toward ∞. However, this is not the case for a Chi-square distribution
with a small degree of freedom
25
See the graphs of five Chi-square distributions that are plotted in the
diagram given below.
They are Chi-square(1), Chi-square(2), Chisquare(3), Chi-square(4), and Chi-square(5).
This diagram is found at
http://en.wikipedia.org/wiki/Chi-square_distribution
Probabilities and points (Χ2 values), along with the degree of freedom, can
be found by giving one of them at the following website.
http://surfstat.anu.edu.au/surfstat-home/tables/chi.php
or
http://sites.csn.edu/mgreenwich/stat/chi.htm
Sampling Probability Distributions (Sampling Distributions)
The probability distribution of an estimator is called the sampling
probability distribution since an estimator cannot do its estimation without
data which are taken from the objects sampled (sampling is involved). On
the other hand, the probability distribution for outcomes is called the
underlying probability distribution. That is,
26
a sampling (probability) distribution is a probability distribution of an
estimator
and
an underlying (probability) distribution is a probability distribution for
outcomes of a random variable or system which is not an estimator.
When probability distributions of both outcomes and an estimator are
discussed, it gets confusing so ‘sampling’ or ‘underlying’ are attached to a
probability distribution to clearly indicate which probability distribution that
you are referring to. When there is only one kind of probability distribution
is involved, you do not need to use ‘sampling’ or ‘underlying’ with the
probability distribution.
Let us have an example of underlying and sampling probability distributions
with a sample average. Suppose there are only three outcomes of 0, 4, and 8
with the equal probability of 1/3 for each outcome. That is, P({0}) = P({4})
= P({8}) = 1/3 which is the underlying probability distribution (with S = {0,
4, 8}) because these 1/3’s are the probabilities for the outcomes (no
estimator is involved at this point). By the way, it is a discrete uniform
distribution. What is µ? It must be 4; that is, the true value of the parameter
µ is 4 (µ = 4) for this underlying distribution. By the way, what is σ? It
must be
32
. In practice, we do not know the true value of µ but suppose
3
that we need it to make correct/good decision. Then, it must be estimated by
the sample average x . Let us take a sample of size 2 with replacement.
Then, we have nine possible samples of size 2 which are
{{0,0}},
{{0,4}},
{{4,0}},
{{0,8}},
{{8,0}},
{{4,4}},
{{4,8}},
{{8,4}},
{{8,8}},
1/9
1/9
1/9
1/9
1/9
1/9
1/9
1/9
1/9
0
2
2
4
4
4
6
6
8
27
The number given on the right of each sample (data) is the probability of
getting the sample (data). Remember, in practice, you take only one sample,
not all of them while this is a theoretical exercise to find the sampling
distribution.
The probability of each sample is determined by multiplying the chance of
getting the first outcome for the sample (data) which is 1/3 and the
probability of getting the second outcome for the sample (data) which is 1/3,
(1/3)*(1/3) = 1/9. The last number given at the end of the row for each
sample is the estimate for µ based on the sample (data) using the sample
average.
The possible estimates for µ is 0, 2, 4, 6, 8. So, this estimator, the sample
average, has probability distribution of
{P({0}) = 1/9, P({2}) = 2/9, P({4}) = 3/9, P({6}) = 2/9, P({8}) = 1/9},
which is the probability distribution of the estimator (the sample average) for
µ. This probability distribution is a sampling distribution since it is the
probability distribution of an estimator for µ. Note that this sampling
probability distribution has a sample space of S = {0, 2, 4, 6, 8} while the
underlying distribution has a sample space of S = {0, 4, 8}.
The mean for both underlying and sampling distributions is identical at 4,
but the standard deviation of the sampling distribution is less, 16 / 3 , than
that of the underlying distribution, 32 / 3 , because less probabilities on the
end sample points of 0 and 8 (1/3 for the underlying distribution and 1/9 for
the sampling distribution). See the diagram given below.
28
Underlying Distribution of
X~Uniform{0, 4, 8} with
σX = 32
3
Sampling Distribution of
X with
σ X = 32 = 16
6
3
3/9 = 1/3
1/3
1/3
1/3
2/9
2/9
1/3
1/9
0
4
µx = 4
8
X
0
2
4
6
8
X
µ X = µx = 4
There are more sample points between 0 and 8 for the sampling distribution
than the underlying distribution. When sample size is large, say 30 or
greater, the sampling distribution of a sample average is very close to a
Normal distribution.
Generally, a sampling probability distribution of a sample average
approaches to a Normal distribution, which is true regardless of the
underlying probability distribution (even if it is a discrete underlying
probability distribution).
Let us look at the last example. It started with a discrete probability
distribution (underlying distribution) of three outcomes. It is a discrete
uniform distribution, which is flat. However, the sampling probability
distribution has more estimates (outcomes), increase to five from three.
Also, the probability gets more concentrated toward the center, and the
distribution is getting bell-shaped. This is with the sample size of only two.
If the sample size is increased to 30, 50, 100, and 200, then the sampling
distribution becomes a tighter bell-shape, with more sample points, similar
to the shape of a Normal distribution.
29
Also, notice that µ is 4 for both the underlying and sampling probability
distributions. That is, µ does not change from the underlying probability
distribution to the sampling probability distribution, which means that the
sample average is an unbiased estimator for µ since μ x = μ .
Central Limit Theorem
There is a theorem called the Central Limit Theorem that states that
the sample average’s sampling probability distribution is approximately a
Normal distribution, N(µ, σ/ n ) for a large sample size n,
where µ and σ are those of the original underlying distribution and σ/ n is
the standard deviation (standard error) of the sampling probability
distribution. That is, the sampling probability distribution gets tighter by
n . So, as the sample size n increases, the sampling probability distribution
gets tighter.
The Normal standardization can be applied to the Central Limit Theorem as
x-μ
σ/ n
being approximately distributed as the standard Normal Distribution.
This becomes very important in the next Chapter.
It should be noted that, if the underlying probability distribution is a Normal
distribution, the sampling probability distribution of a sample average is
exactly a Normal distribution with a tighter bell-shape by the factor of n .
Let us look at this from the standpoint of accuracy and precision of the
estimators. The sample average has the same µ as that of the original
outcomes, and this original µ is what the sample average estimates. This
means that the sample average is unbiased estimator and it is accurate. The
standard deviation of the estimator is standard error σ/ n . It represents the
imprecision of the estimator. As the sample size increases, the standard
error (imprecision) decreases, and the precision increases as stated in a
30
previous section. That is, the sample average, as an estimator for µ, is
accurate (unbiased), and its precision increases as data get larger.
© Copyrighted by Michael Greenwich, 08/2011
☺
31