Download The Case of a Normal Population Distribution

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
5
Joint Probability
Distributions and
Random Samples
Copyright © Cengage Learning. All rights reserved.
5.1
Jointly Distributed
Random Variables
Copyright © Cengage Learning. All rights reserved.
Two Discrete Random Variables
The probability mass function (pmf) of a single discrete rv X
specifies how much probability mass is placed on each
possible X value.
The joint pmf of two discrete rv’s X and Y describes how
much probability mass is placed on each possible pair of
values (x, y).
3
Two Discrete Random Variables
Definition
4
Example 5.1
Anyone who purchases an insurance policy for a home or
automobile must specify a deductible amount, the amount
of loss to be absorbed by the policyholder before the
insurance company begins paying out.
Suppose that a particular company offers auto deductible
amounts of $100, $500, and $1000, and homeowner
deductible amounts of $500, $1000, and $2000. Consider
randomly selecting someone who has both auto and
homeowner insurance with this company, and let X =the
amount of the auto policy deductible and Y = the amount of
the homeowner policy deductible.
5
Example 5.1
cont’d
The joint pmf of these two variables appears in the
accompanying joint probability table:
According to this joint pmf, there are nine possible (X, Y)
pairs: (100, 500), (100, 1000), … , and finally (1000, 5000).
The probability of (100, 500) is p(100, 500) = P(X = 100, Y
= 500) = .30. Clearly p(x, y) ≥ 0, and it is easily confirmed
that the sum of the nine displayed probabilities is 1.
6
Example 5.1
cont’d
The probability P(X = Y) is computed by summing p(x, y)
over the two (x, y) pairs for which the two deductible
amounts are identical:
P(X = Y) = p(500, 500) + p(1000, 1000) = .15 + .10 = .25
Similarly, the probability that the auto deductible amount is
at least $500 is the sum of all probabilities corresponding to
(x, y) pairs for which x ≥ 500; this is the sum of the
probabilities in the bottom two rows of the joint probability
table:
P(X ≥ 500) = .15 + .20 + .05 + .10 + .10 + .05 = .65
7
Two Discrete Random Variables
Definition
8
Example 5.2
Example 5.1 continued…
The possible X values are x = 100, 500 and x = 1000, so
computing row totals in the joint probability table yields
9
Example 5.2
cont’d
Similarly, the marginal pmf of X is then
From this pmf, P(X ≥ 500) = .40 + .25 = .65, which we
already calculated in Example 5.1. Similarly, the marginal
pmf of Y is obtained from the column totals as
10
Two Continuous Random Variables
The probability that the observed value of a continuous rv X
lies in a one-dimensional set A (such as an interval) is
obtained by integrating the pdf f(x) over the set A.
Similarly, the probability that the pair (X, Y) of continuous
rv’s falls in a two-dimensional set A (such as a rectangle) is
obtained by integrating a function called the joint density
function.
11
Two Continuous Random Variables
Definition
12
Two Continuous Random Variables
We can think of f(x, y) as specifying a surface at height
f(x, y) above the point (x, y) in a three-dimensional
coordinate system.
Then P[(X, Y)  A] is the volume underneath this surface
and above the region A, analogous to the area under a
curve in the case of a single rv.
13
Two Continuous Random Variables
This is illustrated in Figure 5.1.
P[(X, Y )  A] = volume under density surface above A
Figure 5.1
14
Example 5.3
A bank operates both a drive-up facility and a walk-up
window. On a randomly selected day, let X = the proportion
of time that the drive-up facility is in use (at least one
customer is being served or waiting to be served) and
Y = the proportion of time that the walk-up window is in
use.
Then the set of possible values for (X, Y) is the rectangle
D = {(x, y): 0  x  1, 0  y  1}.
15
Example 5.3
cont’d
Suppose the joint pdf of (X, Y) is given by
To verify that this is a legitimate pdf, note that f(x, y)  0
and
16
Example 5.3
cont’d
The probability that neither facility is busy more than
one-quarter of the time is
17
Example 5.3
cont’d
18
Two Continuous Random Variables
The marginal pdf of each variable can be obtained in a
manner analogous to what we did in the case of two
discrete variables.
The marginal pdf of X at the value x results from holding x
fixed in the pair (x, y) and integrating the joint pdf over y.
Integrating the joint pdf with respect to x gives the marginal
pdf of Y.
19
Two Continuous Random Variables
Definition
20
Example 5.4
The marginal pdf of X, which gives the probability
distribution of busy time for the drive-up facility without
reference to the walk-up window, is
for 0 ≤ x ≤ 1 and 0 otherwise. The marginal pdf of Y is
21
Example 5.4
Then
22
Independent Random Variables
In many situations, information about the observed value of
one of the two variables X and Y gives information about
the value of the other variable.
In Example 5.1, the marginal probability of X at x = 100
was .35, and at X = 1000 is .25. However, we learn that Y =
5000 the last column of the joint probability table tells us
that X can’t possible be 100 and the other two possibilities,
500 and 1000, are now equally likely. Thus knowing the
value is a dependence between two variables.
In Chapter 2, we pointed out that one way of defining
independence of two events is via the condition
P(A  B) = P(A)  P(B).
23
Independent Random Variables
Here is an analogous definition for the independence of two
rv’s.
Definition
24
Independent Random Variables
The definition says that two variables are independent if
their joint pmf or pdf is the product of the two marginal
pmf’s or pdf’s.
Intuitively, independence says that knowing the value of
one of the variables does not provide additional information
about what the value of the other variable might be.
25
Example 5.1
cont’d
The joint pmf of these two variables appears in the
accompanying joint probability table:
p(1000, 5000) = .05  (.10)(.25) = pX(1000)  pY(5000)
so X and Y are not independent.
Independence of X and Y requires that every entry in the
joint probability table be the product of the corresponding
row and column marginal probabilities.
26
Independent Random Variables
Independence of two random variables is most useful when
the description of the experiment under study suggests that
X and Y have no effect on one another.
Then once the marginal pmf’s or pdf’s have been specified,
the joint pmf or pdf is simply the product of the two
marginal functions. It follows that
P(a  X  b, c  Y  d) = P(a  X  b)  P(c  Y  d)
27
More Than Two Random Variables
To model the joint behavior of more than two random
variables, we extend the concept of a joint distribution of
two variables.
Definition
28
Example 5.9
A binomial experiment consists of n dichotomous
(success–failure), homogenous (constant success
probability) independent trials.
Now consider a trinomial experiment in which each of the n
trials can result in one of three possible outcomes. For
example, each successive customer at a store might pay
with cash, a credit card, or a debit card. The trials are
assumed independent.
Let 𝑝1 = P(trial results in a type 1 outcome) and define 𝑝2
and 𝑝3 analogously for type 2 and type 3 outcomes. The
random variables of interest here are 𝑋𝑖 = the number of
trials that result in a type i outcome for i = 1, 2, 3.
29
Example 5.9
In n = 10 trials, the probability that the first five are type 1
outcomes, the next three are type 2, and the last two are
type 3—that is, the probability of the experimental outcome
1111122233—is 𝑝15 ∙ 𝑝23 ∙ 𝑝32 .
This is also the probability of the outcome 1122311123, and
in fact the probability of any outcome that has exactly five
1’s, three 2’s, and two 3’s.
Now to determine the probability P(𝑋1 = 5, 𝑋2 = 3, and 𝑋3
= 2), we have to count the number of outcomes that have
exactly five 1’s, three 2’s, and two 3’s.
30
Example 5.9
10
) ways to choose five of the trials to be
5
the type 1 outcomes. Now from the remaining five trials, we
choose three to be the type 2 outcomes, which can be
5
done in ( ) ways.
3
First, there are (
This determines the remaining two trials, which consist of
type 3 outcomes. So the total number of ways of choosing
five 1’s, three 2’s, and two 3’s is
31
Example 5.9
Thus we see that
Generalizing this to n trials gives
for 𝑥1 = 0, 1,2, … ; 𝑥2 = 0, 1, 2, … ; 𝑥3 =
0, 1, 2, … such that 𝑥1 + 𝑥2 + 𝑥3 = 𝑛.
Notice that whereas there are three random variables here,
the third variable 𝑥3 is actually redundant. For example, in
the case n = 10, having 𝑥1 = 5 and 𝑥2 = 3 implies that 𝑥3 =
2 (just as in a binomial experiment there are actually two
rv’s—the number of successes and number of failures—but
the latter is redundant).
32
Example 5.9
As a specific example, the genetic allele of a pea section
can be either AA, Aa, or aa.
A simple genetic model specifies P(AA) = .25, P(Aa) = .50,
and P(aa) = .25.
If the alleles of 10 independently obtained sections are
determined, the probability that exactly five of these are Aa
and two are AA is
33
Example 5.9
A natural extension of the trinomial scenario is an
experiment consisting of n independent and identical trials,
in which each trial can result in any one of r possible
outcomes.
Let 𝑝𝑖 = P(outcome i on any particular trial), and define
random variables by 𝑋𝑖 = the number of trials resulting in
outcome i (i = 1, … , r).
34
Example 5.9
This is called a multinomial experiment, and the joint pmf
of 𝑋1 , … , 𝑋𝑟 is called the multinomial distribution. An
argument analogous to the one used to derive the trinomial
pmf gives the multinomial pmf as
35
More Than Two Random Variables
The notion of independence of more than two random
variables is similar to the notion of independence of more
than two events.
Definition
36
More Than Two Random Variables
Thus if the variables are independent with n = 4, then the
joint pmf or pdf of any two variables is the product of the
two marginals, and similarly for any three variables and all
four variables together.
Intuitively, independence means that learning the values of
some variables doesn’t change the distribution of the
remaining variables.
Most importantly, once we are told that n variables are
independent, then the joint pmf or pdf is the product of the
n marginals.
37
Conditional Distributions
Suppose X = the number of major defects in a randomly
selected new automobile and Y = the number of minor
defects in that same auto.
If we learn that the selected car has one major defect, what
now is the probability that the car has at most three minor
defects—that is, what is P(Y  3 | X = 1)?
38
Conditional Distributions
Similarly, if X and Y denote the lifetimes of the front and
rear tires on a motorcycle, and it happens that X = 10,000
miles, what now is the probability that Y is at most 15,000
miles, and what is the expected lifetime of the rear tire
“conditional on” this value of X?
Questions of this sort can be answered by studying
conditional probability distributions.
39
Conditional Distributions
Definition
40
Conditional Distributions
Notice that the definition of fY | X(y | x) parallels that of
P(B | A), the conditional probability that B will occur, given
that A has occurred.
Once the conditional pdf or pmf has been determined,
questions of the type posed at the outset of this subsection
can be answered by integrating or summing over an
appropriate set of Y values.
41
Example 5.3
cont’d
Reconsider the situation of example 5.3 and 5.4 involving
X = the proportion of time that a bank’s drive-up facility is
busy and Y = the analogous proportion for the walk-up
window.
The conditional pdf of Y given that X = .8 is
42
Example 5.12
The probability that the walk-up facility is busy at most half
the time given that X = .8 is then
43
Example 5.12
cont’d
Using the marginal pdf of Y gives P(Y  .5) = .350. Also
E(Y) = .6, whereas the expected proportion of time that the
walk-up facility is busy given that X = .8 (a conditional
expectation) is
44
5.2
Expected Values,
Covariance, and Correlation
Copyright © Cengage Learning. All rights reserved.
45
Expected Values, Covariance, and Correlation
Any function h(X) of a single rv X is itself a random
variable.
However, to compute E[h(X)], it is not necessary to obtain
the probability distribution of h(X); instead, E[h(X)] is
computed as a weighted average of h(x) values, where the
weight function is the pmf p(x) or pdf f(x) of X.
A similar result holds for a function h(X, Y) of two jointly
distributed random variables.
46
Expected Values, Covariance, and Correlation
Proposition
47
Example 5.13
Five friends have purchased tickets to a certain concert. If
the tickets are for seats 1–5 in a particular row and the
tickets are randomly distributed among the five, what
is the expected number of seats separating any particular
two of the five?
Let X and Y denote the seat numbers of the first and
second individuals, respectively. Possible (X, Y) pairs are
{(1, 2), (1, 3), . . . , (5, 4)}, and the joint pmf of (X, Y) is
x = 1, . . . , 5; y = 1, . . . , 5; x  y
otherwise
48
Example 5.13
cont’d
The number of seats separating the two individuals is
h(X, Y) = |X – Y| – 1.
The accompanying table gives h(x, y) for each possible
(x, y) pair.
49
Example 5.13
cont’d
Thus
50
Covariance
When two random variables X and Y are not independent,
it is frequently of interest to assess how strongly they are
related to one another.
Definition
51
Covariance
That is, since X – X and Y – Y are the deviations of the
two variables from their respective mean values, the
covariance is the expected product of deviations. Note
that Cov(X, X) = E[(X – X)2] = V(X).
The rationale for the definition is as follows.
Suppose X and Y have a strong positive relationship to one
another, by which we mean that large values of X tend to
occur with large values of Y and small values of X with
small values of Y.
52
Covariance
Then most of the probability mass or density will be
associated with (x – X) and (y – Y), either both positive
(both X and Y above their respective means) or both
negative, so the product (x – X)(y – Y) will tend to be
positive.
Thus for a strong positive relationship, Cov(X, Y) should be
quite positive.
For a strong negative relationship, the signs of (x – X) and
(y – Y) will tend to be opposite, yielding a negative
product.
53
Covariance
Thus for a strong negative relationship, Cov(X, Y) should
be quite negative.
If X and Y are not strongly related, positive and negative
products will tend to cancel one another, yielding a
covariance near 0.
54
Covariance
Figure 5.4 illustrates the different possibilities. The
covariance depends on both the set of possible pairs and
the probabilities. In Figure 5.4, the probabilities could be
changed without altering the set of possible pairs, and this
could drastically change the value of Cov(X, Y).
p(x, y) = 1/10 for each of ten pairs corresponding to indicated points:
(a) positive covariance;
(b) negative covariance;
Figure 5.4
(c) covariance near zero
55
Example 5.15
The joint and marginal pmf’s for
X = automobile policy deductible amount and
Y = homeowner policy deductible amount in Example 5.1
were
56
Example 5.15
cont’d
Therefore,
57
Covariance
The following shortcut formula for Cov(X, Y) simplifies the
computations.
Proposition
According to this formula, no intermediate subtractions are
necessary; only at the end of the computation is 𝜇𝑋 ∙ 𝜇𝑌
subtracted from E(XY). The proof involves expanding (X 𝜇𝑋 )(Y - 𝜇𝑌 ) and then carrying the summation or integration
through to each individual term.
58
Correlation
Definition
59
Example 5.17
It is easily verified that in the insurance scenario of
Example 5.15, E(X2) = 2,987,500
= 353,500 – (485)2 = 118,275,
X = 343.911, E(Y2) = 2,987,500,
𝜎𝑌2 =1,721,875, and Y = 1312.202.
This gives
60
Correlation
The following proposition shows that  remedies the defect
of Cov(X, Y) and also suggests how to recognize the
existence of a strong (linear) relationship.
Proposition
61
Correlation
If we think of p(x, y) or f(x, y) as prescribing a mathematical
model for how the two numerical variables X and Y are
distributed in some population (height and weight, verbal
SAT score and quantitative SAT score, etc.), then  is a
population characteristic or parameter that measures how
strongly X and Y are related in the population.
In Chapter 12, we will consider taking a sample of pairs (x1,
y1), . . . , (xn, yn) from the population.
The sample correlation coefficient r will then be defined and
used to make inferences about .
62
Correlation
The correlation coefficient  is actually not a completely
general measure of the strength of a relationship.
Proposition
63
Correlation
This proposition says that  is a measure of the degree of
linear relationship between X and Y, and only when the
two variables are perfectly related in a linear manner will
 be as positive or negative as it can be.
However, if | p | << 1, there may still be a strong
relationship between the two variables, just one that is not
linear.
And even if | p | is close to 1, it may be that the relationship
is really nonlinear but can be well approximated by a
straight line.
64
Example 5.18
Let X and Y be discrete rv’s with joint pmf
The points that receive positive
probability mass are identified
on the (x, y) coordinate system
in Figure 5.5.
The population of pairs for Example 18
Figure 5.5
65
Example 5.18
cont’d
It is evident from the figure that the value of X is completely
determined by the value of Y and vice versa, so the two
variables are completely dependent. However, by
symmetry X = Y = 0 and
E(XY)
=0
The covariance is then Cov(X,Y) = E(XY) – X  Y = 0 and
thus X,Y = 0. Although there is perfect dependence, there
is also complete absence of any linear relationship!
66
Correlation
A value of 𝜌 near 1 does not necessarily imply that
increasing the value of X causes Y to increase. It implies
only that large X values are associated with large Y values.
For example, in the population of children, vocabulary size
and number of cavities are quite positively correlated, but it
is certainly not true that cavities cause vocabulary to grow.
Instead, the values of both these variables tend to increase
as the value of age, a third variable, increases. For children
of a fixed age, there is probably a low correlation between
number of cavities and vocabulary size.
In summary, association (a high correlation) is not the
same as causation.
67
The Bivariate Normal Distribution
Just as the most useful univariate distribution in statistical
practice is the normal distribution, the most useful joint
distribution for two rv’s X and Y is the bivariate normal
distribution. The pdf is somewhat complicated:
68
The Bivariate Normal Distribution
A graph of this pdf, the density surface, appears in Figure
5.6. It follows (after some tricky integration) that the
marginal distribution of X is normal with mean value 𝜇1 and
standard deviation 𝜎1 , and similarly the marginal
distribution of Y is normal with mean 𝜇2 and standard
deviation 𝜎2 . The fifth parameter of the distribution
is 𝜌, which can be shown to be the correlation coefficient
between X and Y.
69
The Bivariate Normal Distribution
It is not at all straightforward to integrate the bivariate
normal pdf in order to calculate probabilities. Instead,
selected software packages employ numerical integration
techniques for this purpose.
Many students applying for college take the SAT, which for
a few years consisted of three components: Critical
Reading, Mathematics, and Writing. While some colleges
used all three components to determine admission, many
only looked at the first two (reading and math).
70
The Bivariate Normal Distribution
Let X and Y denote the Critical Reading and Mathematics
scores, respectively, for a randomly selected student.
According to the College Board website, the population of
students taking the exam in Fall 2012 had the following
characteristics:
Suppose that X and Y have (approximately, since both
variables are discrete) a bivariate normal distribution with
correlation coefficient 𝜌 = .25. The Matlab software
package gives P(X ≤ 650, Y ≤ 650) = P(both scores are at
most 650) = .8097.
71
The Bivariate Normal Distribution
It can also be shown that the conditional distribution of Y
given that X = x is normal. This can be seen geometrically
by slicing the density surface with a plane perpendicular to
the (x, y) passing through the value x on that axis; the
result is a normal curve sketched out on the slicing plane.
The conditional mean value is
a linear function of x, and the conditional variance is
The closer the correlation coefficient is to 1 or 21, the less
variability there is in the conditional distribution. Analogous
results hold for the conditional distribution of X given that Y
= y.
72
The Bivariate Normal Distribution
The bivariate normal distribution can be generalized to the
multivariate normal distribution. Its density function is quite
complicated, and the only way to write it compactly is to
employ matrix notation.
If a collection of variables has this distribution, then the
marginal distribution of any single variable is normal, the
conditional distribution of any single variable given values
of the other variables is normal, the joint marginal
distribution of any pair of variables is bivariate normal, and
the joint marginal distribution of any subset of three or more
of the variables is again multivariate normal.
73
5.3
Statistics and Their
Distributions
Copyright © Cengage Learning. All rights reserved.
74
Statistics and Their Distributions
There is uncertainty, before the data becomes, what a
statistic will be. As we view each observation as a random
variable and denote the sample by X1, X2, . . . , Xn
(uppercase letters for random variables).
This variation in turn implies that the value of any function
of the sample observations—such as the sample mean,
sample standard deviation, or sample fourth spread—also
varies from sample to sample. That is, prior to obtaining x1,
. . . , xn, there is uncertainty as to the value of , the value
of s, and so on.
75
Example 5.20
cont’d
Samples from the Weibull Distribution of Example 19
Table 5.1
76
Statistics and Their Distributions
Definition
77
Statistics and Their Distributions
Any statistic, being a random variable, has a probability
distribution. In particular, the sample mean has a
probability distribution.
The probability distribution of a statistic is sometimes
referred to as its sampling distribution to emphasize that
it describes how the statistic varies in value across all
samples that might be selected.
78
Random Samples
The probability distribution of any particular statistic
depends not only on the population distribution (normal,
uniform, etc.) and the sample size n but also on the method
of sampling.
Consider selecting a sample of size n = 2 from a population
consisting of just the three values 1, 5, and 10, and
suppose that the statistic of interest is the sample variance.
If sampling is done “with replacement,” then S2 = 0 will
result if X1 = X2.
79
Random Samples
However, S2 cannot equal 0 if sampling is “without
replacement.” So P(S2 = 0) = 0 for one sampling method,
and this probability is positive for the other method.
Our next definition describes a sampling method often
encountered (at least approximately) in practice.
80
Random Samples
Definition
81
Random Samples
Conditions 1 and 2 can be paraphrased by saying that the
Xi’s are independent and identically distributed (iid).
If sampling is either with replacement or from an infinite
(conceptual) population, Conditions 1 and 2 are satisfied
exactly.
These conditions will be approximately satisfied if sampling
is without replacement, if the sample size n is much smaller
than the population size N. In practice, if n/N  .05 (at most
5% of the population is sampled), we can proceed as if the
Xi’s form a random sample.
82
Deriving a Sampling Distribution
Probability rules can be used to obtain the distribution of a
statistic provided that it is a “fairly simple” function of the
Xi’s and either there are relatively few different X values in
the population or else the population distribution has a
“nice” form.
Our next example illustrate such situation.
83
Example 5.21
A certain brand of MP3 player comes in three
configurations: a model with 2 GB of memory, costing $80,
a 4 GB model priced at $100, and an 8 GB version with a
price tag of $120.
If 20% of all purchasers choose the 2 GB model, 30%
choose the 4 GB model, and 50% choose the 8 GB model,
then the probability distribution of the cost X of a single
randomly selected MP3 player purchase is given by
with  = 106,  2 = 244
(5.2)
84
Example 5.21
cont’d
Suppose on a particular day only two MP3 players are sold.
Let X1 = the revenue from the first sale and X2 the revenue
from the second.
Suppose that X1 and X2 are independent, each with the
probability distribution shown in (5.2) [so that X1 and X2
constitute a random sample from the distribution (5.2)].
85
Example 5.21
cont’d
Table 5.2 lists possible (x1, x2) pairs, the probability of each
[computed using (5.2) and the assumption of
independence], and the resulting and s2 values. [Note
that when n = 2, s2(x1 – )2(x2 – )2.]
Outcomes, Probabilities, and Values of x and s2 for Example 20
Table 5.2
86
Example 5.21
The complete sampling distributions of
(5.3) and (5.4).
cont’d
and S2 appear in
(5.3)
(5.4)
87
Example 5.21
cont’d
Figure 5.8 pictures a probability histogram for both the
original distribution (5.2) and the distribution (5.3). The
figure suggests first that the mean (expected value) of the
distribution is equal to the mean 106 of the original
distribution, since both histograms appear to be centered at
the same place.
Probability histograms for the underlying distribution and x distribution in Example 20
Figure 5.8
88
Example 5.21
cont’d
From (5.3),
= (80)(.04) + . . . + (120)(.25) = 106 = 
Second, it appears that the distribution has smaller
spread (variability) than the original distribution, since
probability mass has moved in toward the mean. Again
from (5.3),
= (802)(.04) +    + (1202)(.25) – (106)2
89
Example 5.21
cont’d
The variance of is precisely half that of the original
variance (because n = 2). Using (5.4), the mean value of
S2 is
S2 = E(S2) =  S2  pS2(s2)
= (0)(.38) + (200)(.42) + (800)(.20) + 244 =  2
That is, the sampling distribution is centered at the
population mean , and the S2 sampling distribution is
centered at the population variance  2.
90
Example 5.21
cont’d
If there had been four purchases on the day of interest, the
sample average revenue would be based on a random
sample of four Xi’s, each having the distribution (5.2).
More calculation eventually yields the pmf of
for n = 4 as
91
Example 5.21
cont’d
From this, x = 106 =  and
= 61 =  2/4. Figure 5.8 is a
probability histogram of this pmf.
Probability histogram for
based on n = 4 in Example 20
Figure 5.9
92
Example 5.21
cont’d
Example 5.21 should suggest first of all that the
computation of
and
can be tedious.
If the original distribution (5.2) had allowed for more than
three possible values, then even for n = 2 the computations
would have been more involved.
The example should also suggest, however, that there are
some general relationships between E( ), V( ), E(S2),
and the mean  and variance  2 of the original distribution.
93
Simulation Experiments
94
Simulation Experiments
The second method of obtaining information about a
statistic’s sampling distribution is to perform a simulation
experiment.
This method is usually used when a derivation via
probability rules is too difficult or complicated to be carried
out. Such an experiment is virtually always done with the
aid of a computer.
95
Simulation Experiments
The following characteristics of an experiment must be
specified:
1. The statistic of interest (
mean, etc.)
, S, a particular trimmed
2. The population distribution (normal with  = 100 and
 = 15, uniform with lower limit A = 5 and upper limit
B = 10,etc.)
3. The sample size n (e.g., n = 10 or n = 50)
4. The number of replications k (number of samples to be
obtained)
96
Simulation Experiments
Then use appropriate software to obtain k different random
samples, each of size n, from the designated population
distribution.
For each sample, calculate the value of the statistic and
construct a histogram of the k values. This histogram gives
the approximate sampling distribution of the statistic.
The larger the value of k, the better the approximation will
tend to be (the actual sampling distribution emerges as
k  ). In practice, k = 500 or 1000 is usually sufficient if
the statistic is “fairly simple.”
97
Simulation Experiments
The final aspect of the histograms to note is their spread
relative to one another.
The larger the value of n, the more concentrated is the
sampling distribution about the mean value. This is why the
histograms for n = 20 and n = 30 are based on narrower
class intervals than those for the two smaller sample sizes.
For the larger sample sizes, most of the values are quite
close to 8.25. This is the effect of averaging. When n is
small, a single unusual x value can result in an value far
from the center.
98
Simulation Experiments
With a larger sample size, any unusual x values, when
averaged in with the other sample values, still tend to yield
an value close to .
Combining these insights yields a result that should appeal
to your intuition:
based on a large n tends to be closer to  than does
based on a small n.
99
5.4
The Distribution of the
Sample Mean
Copyright © Cengage Learning. All rights reserved.
100
The Distribution of the Sample Mean
The importance of the sample mean springs from its use
in drawing conclusions about the population mean . Some
of the most frequently used inferential procedures are
based on properties of the sampling distribution of .
A preview of these properties appeared in the calculations
and simulation experiments of the previous section, where
we noted relationships between E( ) and  and also
among V( ),  2, and n.
101
The Distribution of the Sample Mean
Proposition
102
The Distribution of the Sample Mean
According to Result 1, the sampling (i.e., probability)
distribution of is centered precisely at the mean of the
population from which the sample has been selected.
Result 2 shows that the distribution becomes more
concentrated about  as the sample size n increases.
In marked contrast, the distribution of To becomes more
spread out as n increases.
Averaging moves probability in toward the middle, whereas
totaling spreads probability out over a wider and wider
range of values.
103
The Distribution of the Sample Mean
The standard deviation
is often called the
standard error of the mean; it describes the magnitude of a
typical or representative deviation of the sample mean from
the population mean.
104
Example 5.25
In a notched tensile fatigue test on a titanium specimen, the
expected number of cycles to first acoustic emission (used
to indicate crack initiation) is  = 28,000, and the standard
deviation of the number of cycles is  = 5000.
Let X1, X2, . . . , X25 be a random sample of size 25, where
each Xi is the number of cycles on a different randomly
selected specimen.
Then the expected value of the sample mean number of
cycles until first emission is E( ) = 28,000, and the
expected total number of cycles for the 25 specimens is
E(To) = n = 25(28,000) = 700,000.
105
Example 5.25
The standard deviation of
and of To are
cont’d
(standard error of the mean)
If the sample size increases to n = 100, E( ) is unchanged,
but = 500, half of its previous value (the sample size
must be quadrupled to halve the standard deviation of ).
106
The Case of a Normal Population
Distribution
107
The Case of a Normal Population Distribution
Proposition
We know everything there is to know about the and To
distributions when the population distribution is normal. In
particular, probabilities such as P(a   b) and
P(c  To  d) can be obtained simply by standardizing.
108
The Case of a Normal Population Distribution
Figure 5.15 illustrates the proposition.
A normal population distribution and sampling distributions
Figure 5.15
109
Example 5.26
The distribution of egg weights (g) of a certain type is normal
with mean value 53 and standard deviation .3 (consistent
with data in the article “Evaluation of Egg Quality Traits of
Chickens Reared under Backyard System in Western Uttar
Pradesh” (Indian J. of Poultry Sci., 2009: 261–262)).
Let 𝑋1 , 𝑋2 , … , 𝑋12 denote the weights of a dozen randomly
selected eggs; these 𝑋𝑖 ’s constitute a random sample of size
12 from the specified normal distribution
110
Example 5.26
cont’d
The total weight of the 12 eggs is 𝑇0 = 𝑋1 +. . . +𝑋12 it is
normally distributed with mean value E(𝑇0 ) = 𝑛𝜇= 12(53) =
636 and variance V(𝑇0 ) = n𝜎 2 =12(.3)2 = 1.08. The
probability that the total weight is between 635 and 640 is
now obtained by standardizing and referring to Appendix
Table A.3:
111
Example 5.26
cont’d
If cartons containing a dozen eggs are repeatedly selected,
in the long run slightly more than 83% of the eggs in a
carton will weigh in total between 635 g and 640 g.
Notice that 635 < 𝑇0 < 640 is equivalent to 52.9167 < X <
53.3333 (divide each term in the original system of
inequalities by 12).
Thus P(52.9167 < X < 53.3333) ≈ .8315. This latter
probability can also be obtained by standardizing X directly.
112
Example 5.26
Now consider randomly selecting just four of these eggs.
The sample mean weight 𝑋 is then normally distributed with
mean value 𝜇𝑋 = 𝜇 = 53 and standard deviation 𝜇𝑋 = 𝜎/ 𝑛
= .3/ 4 = .15 The probability that the sample mean
weight exceeds 53.5 g is then
Because 53.5 is 3.33 standard deviations (of X ) larger than
the mean value 53, it is exceedingly unlikely that the
sample mean will exceed 53.5.
113
The Central Limit Theorem
114
The Central Limit Theorem
When the Xi’s are normally distributed, so is
sample size n.
for every
The derivations in Example 5.21 and simulation experiment
of Example 5.24 suggest that even when the population
distribution is highly nonnormal, averaging produces a
distribution more bell-shaped than the one being sampled
A reasonable conjecture is that if n is large, a suitable
normal curve will approximate the actual distribution of .
The formal statement of this result is the most important
theorem of probability.
115
The Central Limit Theorem
Theorem
116
The Central Limit Theorem
Figure 5.16 illustrates the Central Limit Theorem.
The Central Limit Theorem illustrated
Figure 5.16
117
The Central Limit Theorem
According to the CLT, when n is large and we wish to
calculate a probability such as P(a   b), we need only
“pretend” that is normal, standardize it, and use the
normal table.
The resulting answer will be approximately correct. The
exact answer could be obtained only by first finding the
distribution of , so the CLT provides a truly impressive
shortcut.
118
Example 5.27
The amount of a particular impurity in a batch of a certain
chemical product is a random variable with mean value 4.0 g
and standard deviation 1.5 g.
If 50 batches are independently prepared, what is the
(approximate) probability that the sample average amount of
impurity is between 3.5 and 3.8 g?
According to the rule of thumb to be stated shortly, n = 50 is
large enough for the CLT to be applicable.
119
Example 5.27
cont’d
then has approximately a normal distribution with mean
value
= 4.0 and
so
120
Example 5.27
Now consider randomly selecting 100 batches, and let 𝑇0
represent the total amount of impurity in these batches.
Then the mean value and standard deviation of 𝑇0 are
100(4) = 400 and 100 (1.5) = 15, respectively, and the
CLT implies that 𝑇0 has approximately a normal distribution.
The probability that this total is at most 425 g is
121
The Central Limit Theorem
The CLT provides insight into why many random variables
have probability distributions that are approximately
normal.
For example, the measurement error in a scientific
experiment can be thought of as the sum of a number of
underlying perturbations and errors of small magnitude.
A practical difficulty in applying the CLT is in knowing when
n is sufficiently large. The problem is that the accuracy of
the approximation for a particular n depends on the shape
of the original underlying distribution being sampled.
122
The Central Limit Theorem
If the underlying distribution is close to a normal density
curve, then the approximation will be good even for a small
n, whereas if it is far from being normal, then a large n will
be required.
There are population distributions for which even an n of 40
or 50 does not suffice, but such distributions are rarely
encountered in practice.
123
The Central Limit Theorem
On the other hand, the rule of thumb is often conservative;
for many population distributions, an n much less than 30
would suffice.
For example, in the case of a uniform population
distribution, the CLT gives a good approximation for n  12.
124
5.5
The Distribution of a
Linear Combination
Copyright © Cengage Learning. All rights reserved.
125
The Distribution of a Linear Combination
The sample mean X and sample total To are special cases
of a type of random variable that arises very frequently in
statistical applications.
Definition
126
The Distribution of a Linear Combination
For example, consider someone who owns 100 shares of
stock A, 200 shares of stock B, and 500 shares of stock C.
Denote the share prices of these three stocks at some
particular time by 𝑋1, 𝑋2 , and 𝑋3 , respectively. Then the value
of this individual’s stock holdings is the linear combination
Y = 100𝑋1 + 200𝑋2 + 500𝑋3 .
Taking a1 = a2 = . . . = an = 1 gives Y = X1 + . . . + Xn = To,
and a1 = a2 = . . . = an = yields
127
The Distribution of a Linear Combination
Notice that we are not requiring the Xi’s to be independent
or identically distributed. All the Xi’s could have different
distributions and therefore different mean values and
variances. We first consider the expected value and
variance of a linear combination.
128
The Distribution of a Linear Combination
Proposition
129
The Distribution of a Linear Combination
Proofs are sketched out at the end of the section. A
paraphrase of (5.8) is that the expected value of a linear
combination is the same as the linear combination of the
expected values—for example, E(2X1 + 5X2) = 21 + 52.
The result (5.9) in Statement 2 is a special case of (5.11) in
Statement 3; when the Xi’s are independent, Cov(Xi, Xj) = 0
for i  j and = V(Xi) for i = j (this simplification actually
occurs when the Xi’s are uncorrelated, a weaker condition
than independence).
Specializing to the case of a random sample (Xi’s iid) with
ai = 1/n for every i gives E(X) =  and V(X) = 2/n. A similar
comment applies to the rules for To.
130
Example 5.30
A gas station sells three grades of gasoline: regular, extra,
and super.
These are priced at $3.00, $3.20, and $3.40 per gallon,
respectively.
Let X1, X2, and X3 denote the amounts of these grades
purchased (gallons) on a particular day.
Suppose the Xi’s are independent with 1 = 1000, 2 = 500,
3 = 300, 1 = 100, 2 = 80, and 3 = 50.
131
Example 5.30
cont’d
The revenue from sales is Y = 3.0X1 + 3.2X2 + 3.4X3, and
E(Y) = 3.01 + 3.22 + 3.43
= $5620
132
The Difference Between Two
Random Variables
133
The Difference Between Two Random Variables
An important special case of a linear combination results
from taking n = 2, a1 = 1, and a2 = –1:
Y = a1X1 + a2X2 = X1 – X2
We then have the following corollary to the proposition.
Corollary
134
The Difference Between Two Random Variables
The expected value of a difference is the difference of the
two expected values, but the variance of a difference
between two independent variables is the sum, not the
difference, of the two variances.
There is just as much variability in X1 – X2 as in X1 + X2
[writing X1 – X2 = X1 + (– 1)X2, (–1)X2 has the same amount
of variability as X2 itself].
135
Example 5.31
A certain automobile manufacturer equips a particular
model with either a six-cylinder engine or a four-cylinder
engine.
Let X1 and X2 be fuel efficiencies for independently and
randomly selected six-cylinder and four-cylinder cars,
respectively. With 1 = 22, 2 = 26, 1 = 1.2, and 2 = 1.5,
E(X1 – X2) = 1 – 2
= 22 – 26
= –4
136
Example 5.31
cont’d
If we relabel so that X1 refers to the four-cylinder car, then
E(X1 – X2) = 4, but the variance of the difference is
still 3.69.
137
The Case of Normal Random
Variables
138
The Case of Normal Random Variables
When the Xi’s form a random sample from a normal
distribution, X and To are both normally distributed. Here is
a more general result concerning linear combinations.
Proposition
139
Example 5.32
The total revenue from the sale of the three grades of
gasoline on a particular day was Y = 3.0X1 + 3.2X2 + 3.4X3,
and we calculated g = 5620 and (assuming independence)
g = 429.46. If the Xis are normally distributed, the
probability that revenue exceeds 4500 is
140
The Case of Normal Random Variables
The CLT can also be generalized so it applies to certain
linear combinations. Roughly speaking, if n is large and no
individual term is likely to contribute too much to the overall
value, then Y has approximately a normal distribution.
141