Download 1 Probability Distributions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
1
Probability Distributions
In the chapter about descriptive statistics sample data were discussed, and tools introduced
for describing the samples with numbers as well as with graphs.
In this chapter models for the population will be introduced. One will see how the properties
of a population can be described in mathematical terms. Later we will see how samples can
be used to draw conclusions about those properties. That step is called statistical inference.
Definition
A random variable(rv) X is a variable whose value is determined by the outcome of a random
experiment. As discussed for variables in samples, rvs can be categorical or numerical, and if
they are numerical they can be either discrete or continuous.
categorical
%
random variable
discrete
&
%
numerical
&
continuous
In data description we observed that the proper methods depend on the type of the variable.
This is similar for rvs. The choice of model depends on their type. The models for continuous
rvs will be different than those for categorical or discrete rvs.
All random variables are described by their distribution.
Definition 1
The distribution of a random variable gives the values the random variable
can have and the probabilities for these to occur.
1.1
Categorical Random Variables
Definition
The distribution of a categorical rv is a table giving all possible values (categories) of the rv
and the associated probabilities.
The distribution of a categorical rv can be shown in form of a bar graph.
Example:
The population investigated are the students of a selected college. The random variable of
interest is the residence status, it can be either resident or nonresident, so it is categorical.
The probability distribution is:
resident status probability
resident
0.73
nonresident
0.27
If a student is chosen randomly from this college, the probability for the student being a
resident is 0.73.
Is x the random variable resident status then write P (x =resident)=0.73.
1
1.2
Numerical Random Variables
1.2.1
Discrete Random Variables
Remember: A discrete rv is a random variable whose possible values are isolated points along
the number line.
Example 1
1. number of teeth in a patient
2. number of houses in a certain block
3. number of heads when tossing 3 coins
The probability distribution for a discrete rv, X, can be given as a formula, table, or graph
that gives the possible values of X, and their corresponding probabilities, p(X).
Example:
Toss two unbiased coins and let X equal the number of heads observed.
The simple events of this experiment are:
coin1
H
H
T
T
coin 2
H
T
H
T
x P (X = x)
2
1/4
1
1/4
1
1/4
0
1/4
So that we get the following distribution for X=number of heads observed:
x P (X = x)
0
1/4
1
1/2
2
1/4
With the help of this distribution can calculate that P (X ≤ 1) = P (X = 0) + P (X = 1) =
1/4 + 1/2 = 3/4.
Properties for discrete probability distributions:
• 0 ≤ P (X = x) ≤ 1
•
P
x
P (X = x) = 1
Example 2 Consider the distribution of the variable X=number of vehicles owned per
family. Suppose the following table gives the distribution of the variable
x P (X = x)
0
0.015
1
0.235
2
0.425
3
0.245
4
?
2
What is the value of P (X = 4), if no family owns more than 4 vehicles?
P (X = 4) = 1 − (0.015 + 0.235 + 0.425 + 0.245) = 1 − 0.92 = 0.08, because the total of the
probabilities must be 1.
What is the probability that a family has more than 2 vehicles?
P (X > 2) = P (X = 3) + P (X = 4) = 0.245 + 0.08 = 0.325.
The expected value or population mean µ (mu) of a rv x is the value that you would expect
to observe on average if the experiment is repeated over and over again. It is the center of the
distribution.
Definition:
Let X be a discrete rv with probability distribution P (X = x). The population mean µ or
expected value of X is given as
µ = E(X) =
X
xP (X = x).
x
Example:
The expected value of the distribution of x=the number of heads observed tossing two coins
is calculated by
1
1
1
µ=0· +1· +2· =1
4
2
4
Example 3
The mean µ of vehicles owned per family is
µ = 0 × 0.015 + 1 × 0.235 + 2 × 0.425 + 3 × 0.245 + 4 × 0.08 = 2.14
The standard deviation of a distribution measures the spread of the distribution. It denoted
by the Greek letter σ.
Let X be a discrete rv with probability distribution P (X = x). The standard deviation σ of
the rv X is
s
X
σ=
(x − µ)2 P (X = x)
x
Example (continued):
The population standard deviation of x=number of heads observed tossing two coins is calculated by
s
s
1
1
1
1
1
1
1
σ = (0 − 1)2 · + (1 − 1)2 · + (2 − 1)2 · = + =
=√
4
2
4
4 4
2
2
The alternative formula for the standard deviation is usually quicker to evaluate:
r
σ=
X
x2 P (x) − µ2
Example:
Donations have been collected. Every person in a population has been asked for a donation.
The following table gives the distribution of the donations given.
3
X
$0 $10 $20 $50
P (X = x) .45 .30 .20 .05
Interpretation: If one person is randomly selected from the population the probability the
person donated $50 is equal to 0.05.
The mean of this distribution is
X
µ=
xP (X = x) = 0 · 0.45 + 10 · 0.30 + 20 · 0.20 + 50 · 0.05 = 0 + 3 + 4 + 2.5 = 9.5
x
This population donated in average $9.5.
Calculating the standard deviation.
P
2
σ2 =
x (x − µ) P (X = x)
2
= (0 − 9.5) · 0.45 + (10 − 9.5)2 · 0.30 + (20 − 9.5)2 · 0.2 + (50 − 9.5)2 · 0.05
= 40.61 + 0.075 + 22.05 + 82.01 = 144.75
so that
σ=
√
σ 2 = 12.03
The standard deviation of this distribution equals $12.03.
Using the alternative formula you do:
P
2
2
σ2 =
x x P (X = x) − µ
= 02 · 0.45 + 102 · 0.30 + 202 · 0.2 + 502 · 0.05 − 9.52
= 0 + 30 + 80 + 125 − 90.25 = 144.75
and σ =
1.2.2
√
144.75 = 12.03
Continuous Random Variables
Continuous data variables are described by histograms. For histograms the measurement
scale is divided in class intervals and the area of the rectangles put above those intervals is
proportional to the relative frequency of the data falling into this interval.
The relative frequency can be interpreted as an estimate for the probability for falling into the
associated interval.
With this interpretation the histogram becomes an ”estimate” of the probability distribution
of the continuous random variable.
A probability distribution of a continuous rv is a smooth curve, called a density curve if and
only if
1. The total area under the curve is equal to 1.
2. The area under the curve and above any particular interval gives the probability of
observing a value of x in the corresponding interval when an experimental unit is selected
at random from the population.
4
We can calculate that the probability for falling in the interval [−2; 0] equals 0.46.
Example:
The density of a uniform distribution in an interval [0; 5] looks like this:
Use the density function to calculate probabilities for a random variable x with a uniform
distribution on [0; 5]:
• P (X ≤ 3) = area under the curve from − ∞ to 3 = 3 · 0.2 = 0.6
• P (1 ≤ X ≤ 2) = area under the curve from 1 to 2 = 1 · 0.2 = 0.2
• P (X > 3.5) = area under the curve from 3.5 to ∞ = 1.5 · 0.2 = 0.3
5
Remark: Since there is zero area under the curve above a single value, the definition implies
for continuous random variables and numbers a and b:
• P (X = a) = 0
• P (X ≤ a) = P (X < a)
• P (X ≥ b) = P (X > b)
• P (a < X < b) = P (a ≤ X ≤ b)
This is generally not true for discrete random variables.
How to choose a model for a given variable in a sample?
The model (density function) should resemble the histogram for the given variable.
Fortunately, many continuous data variables have bell shaped histograms. The normal probability distribution provides a good model for modelling this type of data.
1.2.3
Normal Probability Distribution
The density function of a normal distribution is unimodal, mound shaped, and symmetric.
There are many different normal distributions, they are distinguished from one another by
their population mean µ and their population standard deviation σ.
µ is the center of the distribution, right at the highest point of the density distribution function.
At the values µ − σ and µ + σ the density curve has turning points. Coming from −∞ the
curve turns from a left to a right curve at µ − σ and again into in a left curve at µ + σ.
6
The function describing the density curve for a given mean µ and a given standard deviation
σ is
(x−µ)2
1
f (x) = √ e− 2σ2
σ 2π
Example:
If the normal distribution is used as a model for a specific situation, the mean and the standard
deviation have to be chosen for that situation. E.g. the height of students at a certain university
follow a normal distribution with µ = 178 cm and σ = 8 cm.
Given that we know the mean and the standard deviation of a normal distribution we can
locate intervals telling us where most of the values of the population are located.
The 68-95-99.7 Rule
In the Normal Distribution with mean µ and standard deviation σ:
• Approximately 68% of the observations fall within one standard deviation of the mean,
within [µ − σ, µ + σ].
• Approximately 95% of the observations fall within two standard deviations of the mean,
within [µ − 2σ, µ + 2σ].
• Approximately 99.7% of the observations fall within three standard deviations of the
mean, within [µ − 3σ, µ + 3σ].
Example:
Continuing the example above. This rule tells us to expect about
• 68% of the height of students at the university to fall within [178-8 , 178 +8] =[170 ,
186] cm.
• 95% of the height of students at the university to fall within [178-2(8) , 178 +2(8)8]
=[162 , 194] cm.
• 99.7% of the height of students at the university to fall within [178-3(8) , 178 +3(8)]
=[154 , 202] cm.
7
Definition:
The normal distribution with µ = 0 and σ = 1 is called the Standard Normal Distribution.
In order to work with the normal distribution, we need to be able to calculate the following:
1. We must be able to use the normal distribution to compute probabilities, which are areas
under the normal curve.
2. We must be able to describe extreme values in the distribution, such as the largest 5%, the
smallest 1%, the most extreme 10% (which would include the largest 5% and the smallest
5%), that is we have to be able to calculate percentiles of any normal distribution.
We first look how to compute these for a Standard Normal Distribution.
Since the normal distribution is a continuous distribution the following holds for every normal
distributed random variable X:
P (X < z) = P (X ≤ z)= area under the curve from −∞ to z.
The area under the curve of a normal distributed random variable is hard to calculate. There
is no simple formula that can be used to calculate the area.
Appendix Table A (in the text book) tabulates for standard normal distributed random variables for many different values of z ∗ the area under the curve from −∞ to z ∗ . These are
values from the so called cumulative density function.
From now on use Z to indicate a standard normal distributed random variable (µ = 0 and
σ = 1). Using the table you find that,
• P (Z < −1.75) = P (z ≤ −1.75) = 0.0401 and
8
• P (Z > 1.34) = 1 − P (z ≤ 1.34) = 1 − 0.9099 = 0.0901 and
Shaded area equals 0.0901.
• P (−1 ≤ Z ≤ 1) = P (Z ≤ 1) − P (Z ≤ −1) = 0.8413 − 0.1587 = 0.6826.
The shaded area equals 0.6826.
9
The first probability can be interpreted as meaning that, in a long sequence of observations
from a Standard Normal distribution, about 4.01% of the observed values will be smaller than
-1.75.
Try this for different values!
Now we will look how to identify extreme values.
Definition:
For any particular number r between 0 and 1, the rth percentile xr of a distribution is a value
such that the cumulative area from −∞ to xr is equal to r.
If X is a random variable the rth percentile xr satisfies the following equation:
P (X ≤ xr ) = r
To determine the percentiles for a standard normal distribution (denote them by zr ), we can
use Table A again.
• Suppose we want to describe the values that make up the smallest 2%. So we are looking
for the 0.02th percentile z0.02 , with
P (Z ≤ z0.02 ) = 0.02.
So look in the body of the Table A for the cumulative area 0.0200. The closest you will
find 0.0202 for zr = −2.05 This is the best approximation you can find from the table.
The result is that the smallest 2% of the values of a standard normal distribution fall
within the interval (−∞, −2.05].
10
• Suppose now we are interested in the largest 5%. So we are looking for z ∗ , with
P (Z > z ∗ ) = 0.05
In Table A we can only find areas to the left of a given value, the first step is to determine
the area to the left of z ∗ :
P (z ≤ z ∗ ) = 1 − 0.05 = 0.95
That tells us that in fact z ∗ = z0.95 the 0.95th percentile.
Checking the table we find values 0.9495 and 0.9505, with 0.95 exactly in the middle, so
we take the average of the corresponding numbers and get
z0.95 =
1.64 + 1.65
= 1.645
2
• And now we are interested in the most extreme 5%. That means we are interested in
the middle 95%. Since the normal distribution is the symmetric the most extreme 5 %
can be split up in the lower 2.5% and the upper 2.5%. Symmetry about 0 implies that
−z0.025 = z0.975 .
In Table A we find z0.025 = −1.96, so that z0.975 = 1.96
We found the result, that the 5% most extreme values are outside the interval [−1.96, 1.96].
Now remains the step to determine those areas for any normal distribution using the results
of the standard normal distribution.
Lemma: Is x normal distributed with population mean µ and population standard deviation
σ then the standardized random variable
X −µ
is normal distributed with µ = 0 and σ = 1.
Z=
σ
11
Example: Let X be normal distributed with µ = 100 and σ = 5.
1. Calculate the area under the curve between 98 and 107 for the distribution chosen above.
P (98 < X < 107) = P ( 98−100
< X−100
<
5
5
2
7
= P (− 5 < Z < 5 )
= P (−0.4 < Z < 1.4)
107−100
)
5
This can be calculated using Table IV. P (−0.4 < z < 1.4) = P (z < 1.4)−P (z < −0.4) =
0.9192 − 0.3446 = 0.5746.
• The first step you have to take is to standardize the rv, so that the result is
standard normal distributed. The Lemma above tells you how it is done, subtract
the mean µ and divide by the standard deviation σ.
• In a second step you use the table for the standard normal distribution to find the
probability.
2. To find the 0.3th percentile for this distribution, that is x0.3 , use
0.3 = P (X ≤ x0.3 )
= P ( X−100
≤ x0.3 −100
)
5
5
x0.3 −100
)
= P (Z ≤
5
But then x0.3 −100
equals the 0.3th percentile from a standard normal distribution, which
5
we can find in Table A.
x0.3 − 100
= −1.88
5
This is equivalent to x0.3 = −1.88 · 5 + 100 = 100 − 9.40 = 90.6.
So that the lower 30% of a normal distributed random variable with mean µ = 100 and
σ = 5 fall into the interval (−∞, 90.6].
• Again, in a first step standardize the rv, so that the result is standard normal
distributed. This is done by subtracting the mean µ and dividing by the standard
deviation σ.
• In a second step use the table for the standard normal distribution to find the
percentile.
12
Examples for Calculating Probabilities of a Standard Normal distribution
Assumption: The random variable Z is standard normal distributed, that is population mean
µ = 0 and standard deviation σ = 1.
1. Calculate P (Z < 0.53):
In order to find the probability use Table A:
(a) Find 0.5 in the left hand side column, this determines the row
(b) then find 0.03 in the top row, this determines the column
(c) now check for the value where the row and the column intersect In this example
.7019.
Result: P (Z < 0.53) = .7019.
2. Calculate P (Z > −0.79):
Rule for Compliments: P (z > −0.79) = 1 − P (z ≤ −0.79).
Now use Table A again:
(a) Find -.7 in the left hand side column, this determines the row
(b) then find .09 in the top row, this determines the column
(c) now check for the value where the row and the column intersect In this example
.2148.
Result: P (Z > −0.79) = 1 − P (Z ≤ −0.79) = 1 − 0.2148 = 0.7852.
3. Calculate P (2.1 < Z < 4.79):
It is P (2.1 < Z < 4.79) = P (Z ≤ 4.79) − P (Z < 2.1).
Use Table A:
(a) You find that 4.79 is larger than the largest value in the Table. That means that
P (z < 4.79) = 1.
(b) Find 2.1 in the left hand side column, this determines the row
(c) then find .00 in the top row, this determines the column
(d) now check for the value where the row and the column intersect In this example
.9821.
Result: P (2.1 < Z < 4.79) = P (Z ≤ 4.79) − P (Z < 2.1) = 1 − 0.9821 = .0179.
Examples for Calculating Probabilities of a Normal distribution (not necessarily
standard)
Assumption: The random variable x is normal distributed with mean µ = 80 and standard
deviation σ = 10.
13
1. Calculate P (X < 100):
First standardize:
P (X < 100) = P (
X −µ
100 − µ
100 − 80
<
) = P (Z <
) = P (Z < 2)
σ
σ
10
Now find this from Table A applying the method from above and find P (Z < 100) =
0.9772.
2. Calculate P (X > 79):
Rule for Compliments: P (X > 79) = 1 − P (X ≤ 79).
First standardize:
P (X < 79) = P (
X −µ
79 − µ
79 − 80
<
) = P (Z <
) = P (Z < −0.1)
σ
σ
10
Now use Table A again:
(a) Find -.1 in the left hand side column, this determines the row
(b) then find .090 in the top row, this determines the column
(c) now check for the value where the row and the column intersect In this example
.4602.
Result: P (X > 79) = 1 − P (X ≤ 79) = 1 − 0.4602 = 0.5398.
3. Calculate P (70 < X < 80):
It is P (70 < X < 80) = P (X ≤ 80) − P (X < 70).
First standardize:
P (X ≤ 80) − P (X < 70) = P (
X −µ
80 − µ
X −µ
70 − µ
<
) − P(
<
)=
σ
σ
σ
σ
P (Z < 0) − P (Z < −1) = 0.5 − 0.1587 = 0.3413
Examples for Calculating Percentiles of a Standard Normal distribution
Assumption: The random variable z is standard normal distributed, that is population mean
µ = 0 and standard deviation σ = 1.
1. Calculate the 0.9th percentile z.9 : P (Z < z.9 ) = 0.9:
In order to find the percentile use Table A:
(a) Find 0.9 in the body of the table, the closest you find is 0.8997
(b) go to the left and find in the left column 1.2
(c) got to the top and find in the top row 0.08
Result: The 0.9th percentile equals z.9 = 1.28.
14
2. Find the interval which contains the middle 50%:
The middle 50% of the standard normal distribution can be found between the 0.25th
percentile and 0.75 percentile.
Now use Table A again:
(a) Find 0.25 in the body of the table, the closest you find is 0.2514
(b) go to the left and find in the left column -0.6
(c) got to the top and find in the top row 0.07
(a) Find 0.75 in the body of the table, the closest you find is 0.7468
(b) go to the left and find in the left column 0.6
(c) got to the top and find in the top row 0.07
Result: The middle 50% of a standard normal distribution can be found between -0.67
and 0.67.
Examples for Calculating Percentiles of ANY Normal distribution
Assumption: The random variable X is normal distributed, with mean µ = 50 and σ = 20.
1. Calculate the 0.1th percentile x.9 : P (X < x.1 ) = 0.1:
First standardize:
P (X < x.1 ) = P (
X −µ
x0.1 − µ
x0.1 − 50
<
) = P (Z <
) = 0.1
σ
σ
20
For this equation to hold x0.120−50 has to be the 0.1th percentile of the standard normal
distribution. So
x0.1 − 50
z0.1 =
⇐⇒ x0.1 = 50 + 20z0.1
20
In order to find the percentile z0.1 use Table A:
(a) Find 0.1 in the body of the table, the closest you find is 0.1003
(b) go to the left and find in the left column -1.2
(c) got to the top and find in the top row 0.08
Result: The 0.1th percentile equals z.1 = −1.28 so that x0.1 = 50 + 20(−1.28) = −24.4
2. Find the interval which contains the middle 50% of this distribution:
The middle 50% of the normal distribution can be found between the 0.25th percentile
and 0.75 percentile.
First standardize:
P (X < x.25 ) = P (
X −µ
x0.25 − µ
x0.25 − 50
<
) = P (Z <
) = 0.25
σ
σ
20
P (X < x.75 ) = P (
x0.75 − µ
x0.75 − 50
X −µ
<
) = P (Z <
) = 0.75
σ
σ
20
and
15
For these equations to hold x0.2520−50 has to be the 0.25th percentile of the standard normal distribution and x0.7520−50 has to be the 0.75th percentile of the standard normal
distribution.
So
z0.25 =
x0.25 − 50
⇐⇒ x0.25 = 50 + 20z0.25 = 50 + 20(−0.67) = 36.6
20
and
z0.75 =
x0.75 − 50
⇐⇒ x0.75 = 50 + 20z0.75 = 50 + 20(0.67) = 63.4
20
Result: The middle 50% of this normal distribution can be found between 36.6 and 63.4.
16