Download 3. Joint Distributions of Random Variables

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Gibbs sampling wikipedia , lookup

Student's t-test wikipedia , lookup

Law of large numbers wikipedia , lookup

Transcript
3. Joint Distributions of Random Variables
3.1 Joint Distributions of Two Discrete Random Variables
Suppose the discrete random variables X and Y have supports SX
and SY , respectively.
The joint distribution of X and Y is given by the set
{P(X = x, Y = y )}x∈SX ,y ∈SY .
This joint distribution satisfies
P(X = x, Y = y )≥0
X X
P(X = x, Y = y )=1
x∈SX y ∈SY
1 / 64
Joint Distributions of Two Discrete Random Variables
When SX and SY are small sets, the joint distribution of X and Y
is usually presented in a table, e.g.
X =0
X =1
Y = −1
0
Y =0
1
3
0
1
3
Y =1
0
1
3
Here, P(X = 0, Y = −1) = 0, P(X = 0, Y = 0) = 13 , etc.
2 / 64
3.1.1 Marginal Distributions of Discrete Random Variables
Note that by summing along a row (i.e. summing over all the
possible values of Y for a particular realisation of X ) we obtain the
probability that X takes the appropriate value, i.e.
X
P(X = x) =
P(X = x, Y = y ).
y ∈SY
Similarly, by summing along a column (i.e. summing over all the
possible values of X for a particular realisation of Y ) we obtain the
probability that Y takes the appropriate value, i.e.
X
P(Y = y ) =
P(X = x, Y = y ).
x∈SX
3 / 64
Marginal Distributions of Discrete Random Variables
The distributions of X and Y obtained in this way are called the
marginal distributions of X and Y , respectively.
These marginal distributions satisfy the standard conditions for a
discrete distribution, i.e.
X
P(X = x) ≥ 0;
P(X = x) = 1.
x∈SX
4 / 64
3.1.2 Independence of Two Discrete Random Variables
Two discrete random variables X and Y are independent if and
only if
P(X = x, Y = y ) = P(X = x)P(Y = y ),
∀x ∈ SX , y ∈ SY .
5 / 64
3.1.3 The Expected Value of a Function of Two Discrete
Random Variables
The expected value of a function, g (X , Y ), of two discrete random
variables is defined as
X X
E [g (X , Y )] =
g (x, y )P(X = x, Y = y ).
x∈SX y ∈SY
In particular, the expected value of X is given by
X X
E [X ] =
xP(X = x, Y = y ).
x∈SX y ∈SY
It should be noted that if we have already calculated the marginal
distribution of X , then it is simpler to calculate E [X ] using this.
6 / 64
3.1.4 Covariance and the Correlation Coefficient
The covariance between X and Y , Cov (X , Y ) is given by
Cov (X , Y ) = E (XY ) − E (X )E (Y )
Note that by definition Cov (X , X ) = E (X 2 ) − E (X )2 = Var (X ).
Also, for any constants a, b
Cov (X , aY + b) = aCov (X , Y ).
The coefficient of correlation between X and Y is given by
Cov (X , Y )
.
ρ(X , Y ) = p
Var (X )Var (Y )
7 / 64
Properties of the Correlation Coefficient
1. −1 ≤ ρ(X , Y ) ≤ 1.
2. If |ρ(X , Y )| = 1, then there is a deterministic linear
relationship between X and Y , Y = aX + b. When
ρ(X , Y ) = 1, then a > 0. When ρ(X , Y ) = −1,
a < 0.
3. If random variables X and Y are independent, then
ρ(X , Y ) = 0. Note that the condition ρ(X , Y ) = 0 is
not sufficient for X and Y to be independent (see
Example 4.1). However, when ρ(X , Y ) 6= 0, then X
and Y are dependent.
4. If a > 0, ρ(X , aY + b) = ρ(X , Y ), i.e. the correlation
coefficient is independent of the units a variable is
measured in. If a < 0, ρ(X , aY + b) = −ρ(X , Y ).
8 / 64
Example 3.1
Discrete random variables X and Y have the following joint
distribution:
X =0
X =1
Y = −1
0
Y =0
1
3
0
1
3
Y =1
0
1
3
a) Calculate i) P(X > Y ), ii) the marginal distributions of X and
Y , iii) the expected values and variances of X and Y , iv) the
coefficient of correlation between X and Y .
b) Are X and Y independent?
9 / 64
Example 3.1
10 / 64
Example 3.1
11 / 64
Example 3.1
12 / 64
Example 3.1
13 / 64
Example 3.1
14 / 64
Example 3.1
15 / 64
Example 3.1
16 / 64
Example 3.1
17 / 64
Example 3.1
18 / 64
3.1.5 Conditional Distributions and Conditional Expected
Values
Define the conditional distribution of X given Y = y , where
y ∈ SY by {P(X = x|Y = y )}x∈SX .
P(X = x|Y = y ) =
P(X = x, Y = y )
.
P(Y = y )
Analogously, the conditional distribution of Y given X = x, where
x ∈ SX is given by {P(Y = y |X = x)}y ∈SY .
P(Y = y |X = x) =
P(X = x, Y = y )
.
P(X = x)
Note: This definition is analagous to the definition of conditional
probability given in Chapter 1.
19 / 64
Conditional Expected Value
The expected value of the function g (X ) given Y = y is denoted
E [g (X )|Y = y ]. We have
X
E [g (X )|Y = y ] =
g (x)P(X = x|Y = y ).
x∈SX
In particular,
E [X |Y = y ] =
X
xP(X = x|Y = y ).
x∈SX
The expected value of the function g (Y ) given X = x is defined
analogously.
20 / 64
Probabilistic Regression
The graph of the probabilistic regression of X on Y is given by a
scatter plot of the points {(y , E [X |Y = y ])}y ∈SY .
Note that the variable we are conditioning on appears on the
x-axis.
Analogously, the graph of the probabilistic regression of Y on X is
given by a scatter plot of the points {(x, E [Y |X = x])}x∈SX .
These graphs illustrate the nature of the (probabilistic) relation
between the variables X and Y .
21 / 64
Probabilistic Regression
The graph of the probabilistic regression of Y on X , E [Y |X = x],
may be
1. Monotonically increasing in x. This indicates a
positive dependency between X and Y . ρ(X , Y ) will
be positive.
2. Monotonically decreasing in x. This indicates a
negative dependency between X and Y . ρ(X , Y ) will
be negative.
3. Flat. This indicates that ρ(X , Y ) = 0. However, it
does not always follow that X and Y are
independent.
22 / 64
Probabilistic Regression
4. None of the above (e.g. oscillatory). In this case, X
and Y are dependent, but this dependency cannot be
described in simple terms such as positive or negative.
The correlation coefficient can be 0, positive or
negative in such a case (but cannot be -1 or 1).
23 / 64
Example 3.2
Discrete random variables X and Y have the following joint
distribution:
X =0
X =1
Y = −1
0
Y =0
1
3
0
1
3
Y =1
0
1
3
a) Calculate the conditional distributions of X given i)
Y = −1, ii) Y = 0, iii) Y = 1.
b) Calculate the conditional distributions of Y given i)
X = 0, ii) X = 1.
c) Draw the graph of the probabilistic regression of i) Y
on X , ii) X on Y .
24 / 64
Example 3.2
25 / 64
Example 3.2
26 / 64
Example 3.2
27 / 64
Example 3.2
28 / 64
Example 3.2
29 / 64
Example 3.2
30 / 64
Example 3.2
31 / 64
Example 3.2
32 / 64
Example 3.3
The random variables X and Y may sometimes be defined on the
basis of an experiment.
In this case, each elementary event will correspond to a pair of
values taken by X and Y (see also Example 2.2.1)
Suppose a coin is thrown 3 times. Let X be the number of heads
thrown in the first two throws and Y be the number of heads
thrown in the last two throws.
Define the joint distribution of X and Y .
33 / 64
Example 3.3
34 / 64
Example 3.3
35 / 64
Example 3.3
36 / 64
Example 3.3
37 / 64
3.2 The joint distribution of n independent random
variables
Recall that if two discrete random variables X and Y are
independent, then
P(X = x, Y = y ) = P(X = x)P(Y = y ).
This can be generalised to the following: If X1 , X2 , . . . , Xn are
independent discrete random variables, then
P(X1 = x1 , X2 = x2 , . . . , Xn = xn ) = P(X1 = x1 )P(X2 = x2 ) . . . P(Xn = xn ).
38 / 64
Joint distribution of n independent continuous random
variables
A single continuous random variable X can be described by its
density function fX (x).
If X1 , X2 , . . . , Xn are independent continuous random variables,
then their joint distribution function, denoted
fX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ), is given by
fX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = fX1 (x1 )fX2 (x2 ) . . . fXn (xn ).
These results will be used in deriving estimators of parameters
using the maximum likelihood method.
39 / 64
3.3 Simple Samples
A simple sample is obtained by observing n independent,
identically distributed (i.i.d.) random variables, X1 , X2 , . . . , Xn .
Suppose X1 , X2 , . . . , Xn , have the same distribution as X , where
E [X ] = µ and Var (X ) = σ 2 .
If these random variables are discrete, then the probability of
observing a sample [set of n realisations] (x1 , x2 , . . . , xn ) is given by
P(X1 = x1 , X2 = x2 , . . . , Xn = xn ) = P(X = x1 )P(X = x2 ) . . . P(X = xn ).
If these random variable are continuous, the the joint density
function is given by
fX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = fX (x1 )fX (x2 ) . . . fX (xn ).
In addition,
P(X1 ≤ x1 , X2 ≤ x2 , . . . , Xn ≤ xn ) = P(X ≤ x1 )P(X ≤ x2 ) . . . P(X ≤ xn )
40 / 64
The Sample Mean
The sample mean is defined as X =
Recall that
E[
n
X
i=1
Xi ] =
1
n
Pn
n
X
i=1 Xi .
E [Xi ]
i=1
and
E [aX + b] = aE [X ] + b
41 / 64
The Sample Mean
It follows that
"
#
n
n
1X
1 X
E [X ]=E
Xi = E [
Xi ]
n
n
i=1
=
n
1X
n
E [Xi ] =
i=1
i=1
1
[µ + µ + . . . + µ]
n
n terms in sum
1
= nµ = µ
n
It follows that the expected value of the sample mean is equal to µ
(the population/theoretical mean).
42 / 64
Unbiased Estimators
An estimator is a function of the observations in a sample that is
used to estimate the parameters of a distribution.
For example, the sample mean is an obvious estimator of µ.
An estimator of a parameter is unbiased if its expected value is
equal to the parameter itself, i.e. there is no systematic tendency
to underestimate or overestimate the parameter.
It follows that the sample mean is an unbiased estimator of µ, the
theoretical (population) mean.
43 / 64
Estimators of the theoretical (population) variance, σ 2
Since σ 2 is a measure of the average square distance of a random
variable from its expected value, one might expect that the
estimator sn2 , where
n
sn2
1X
=
[Xi − X ]2 ,
n
i=1
is an unbiased estimator of σ 2 .
2 , where
However, it is not. It can be shown that sn−1
n
2
sn−1
=
1 X
[Xi − X ]2 ,
n−1
i=1
is an unbiased estimator of σ 2 .
44 / 64
Estimators of the theoretical (population) variance, σ 2
2
The estimators sn2 and sn−1
are often labelled s 2 and ŝ 2 ,
respectively.
An estimator of the parameter α is generally labelled α̂.
45 / 64
The variance of an estimator
Although an estimator might not show any systematic error, there
will be a random error depending on the sample observed. The
degree of this random error may be described by the variance of an
estimator.
Recall that
Var (aX + b) = a2 Var (X )
Also, if X1 , X2 , . . . , Xn are independent, then
n
n
X
X
Var (
Xi ) =
Var (Xi )
i=1
i=1
46 / 64
The variance of the sample mean
The variance of the sample mean is given by
!
n
n
X
1X
1
Var (X )=Var
Xi = 2 Var (
Xi )
n
n
i=1
=
1
n2
n
X
Var (Xi ) =
i=1
i=1
1 2
(σ + σ 2 + . . . + σ 2 )
n2
n terms in sum
1
σ2
= 2 nσ 2 =
.
n
n
47 / 64
The variance of the sample mean
It follows that as the sample size increases, the variance of the
sample mean decreases, i.e. the sample mean becomes a more
replicable predictor of the theoretical (population) mean [as
intuition would suggest].
If an estimator is unbiased, then if the sample size is large, it will
be an accurate estimator of the appropriate parameter.
48 / 64
The distribution of the sample mean
Note that the sample mean is simply the sum of n random
variables divided by a constant.
If n is sufficiently large (generally, n > 30), then this sum will have
approximately a normal distribution.
Since the sample mean is obtained by dividing this sum by a
constant, the sample mean will also have approximately a normal
distribution.
Since any normal distribution can be defined
by its expected value
σ2
and variance, we have X ∼approx N µ, n .
It should be noted that if the Xi have a normal distribution, then
the sample mean will have (exactly) a normal distribution.
49 / 64
The sample proportion - estimator of a probability [or
population proportion]
Public opinion polls are interested, for example, in the proportion
of the electorate who are going to vote for a certain party.
Suppose the proportion of the entire electorate who wish to vote
for party A is p (this is the population proportion).
Assume that the population is large in comparison to the sample
we observe and hence we may assume that the probability that
each individual in the sample votes for party A is p.
Thus the number of individuals, say X , in the sample who state
that they want to vote for party A has a binomial distribution with
parameters n and p.
Note that It is assumed here that no-one hides their preferences or
lies.
50 / 64
The sample proportion
The sample proportion, denoted p̂, is the proportion of individuals
in a sample who have the trait of interest, here those who want to
vote for party A.
Hence, p̂ = Xn , where X is the number in the sample wanting to
vote for party A and n is the sample size.
The sample proportion is an obvious estimator of the population
proportion p.
51 / 64
The distribution of the sample proportion
We have X ∼Bin(n, p). When n is large and p is not close to 0 or
1, then X ∼approx N(np, np(1 − p)). This approximation works well
when 0.1 ≤ p ≤ 0.9.
Since the sample proportion is simply X (the number in the sample
wishing to vote for party A) divided by n, the distribution of the
sample proportion is also approximately normal. We have
1
1
E [X ] = np = p
n
n
1
1
p(1 − p)
Var [p̂]=Var [X /n] = 2 Var [X ] = 2 np(1 − p) =
n
n
n
E [p̂]=E [X /n] =
52 / 64
The distribution of the sample proportion
It follows that the sample proportion is an unbiased estimator of
the population proportion.
From the above derivations p̂ ∼approx N(p, p(1−p)
).
n
It follows that as the sample size increases the variance of the
sample proportion decreases.
Also, for a given sample size the maximum variance is always
obtained when the population proportion is 0.5.
53 / 64
Example 3.6
a) Suppose that in reality 15% of the electorate wish to vote for
Nowoczesna. Estimate the probability that in a sample of 500
individuals the proportion of those wishing to vote for party
Nowoczesna is between 12% and 18%.
b) Estimate the sample size required such that the proportion of
individuals wishing to vote for any party can be estimated to
within ±3% with probability 0.95.
54 / 64
Example 3.6
55 / 64
Example 3.6
56 / 64
Example 3.6
57 / 64
Example 3.6
58 / 64
Example 3.6
59 / 64
Example 3.6
60 / 64
The Required Sample Size to Estimate a Population Mean
to the Required Accuracy
From the calculations given above, we can calculate the sample
size needed to estimate any proportion to the required accuracy.
In order to calculate the sample size needed to calculate a
population mean to the required accuracy, we need information
regarding the population variance (or, equivalently, the standard
deviation).
This information can be in the form of a) a prior estimate of the
population variance, b) based on an initial sample of moderate size
(the variance from this sample is used to estimate the sample size
required).
2
Note that the distribution of the sample mean is N(µ, σn ).
61 / 64
Example 3.7
Suppose the standard deviation of the monthly earnings of Poles is
2000 zl/month.
Calculate the sample size required to estimate the mean earnings
of Poles to within 100zl/month with a probability of 0.99.
62 / 64
Example 3.7
63 / 64
Example 3.7
64 / 64