* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 3. Joint Distributions of Random Variables
Survey
Document related concepts
Transcript
3. Joint Distributions of Random Variables 3.1 Joint Distributions of Two Discrete Random Variables Suppose the discrete random variables X and Y have supports SX and SY , respectively. The joint distribution of X and Y is given by the set {P(X = x, Y = y )}x∈SX ,y ∈SY . This joint distribution satisfies P(X = x, Y = y )≥0 X X P(X = x, Y = y )=1 x∈SX y ∈SY 1 / 64 Joint Distributions of Two Discrete Random Variables When SX and SY are small sets, the joint distribution of X and Y is usually presented in a table, e.g. X =0 X =1 Y = −1 0 Y =0 1 3 0 1 3 Y =1 0 1 3 Here, P(X = 0, Y = −1) = 0, P(X = 0, Y = 0) = 13 , etc. 2 / 64 3.1.1 Marginal Distributions of Discrete Random Variables Note that by summing along a row (i.e. summing over all the possible values of Y for a particular realisation of X ) we obtain the probability that X takes the appropriate value, i.e. X P(X = x) = P(X = x, Y = y ). y ∈SY Similarly, by summing along a column (i.e. summing over all the possible values of X for a particular realisation of Y ) we obtain the probability that Y takes the appropriate value, i.e. X P(Y = y ) = P(X = x, Y = y ). x∈SX 3 / 64 Marginal Distributions of Discrete Random Variables The distributions of X and Y obtained in this way are called the marginal distributions of X and Y , respectively. These marginal distributions satisfy the standard conditions for a discrete distribution, i.e. X P(X = x) ≥ 0; P(X = x) = 1. x∈SX 4 / 64 3.1.2 Independence of Two Discrete Random Variables Two discrete random variables X and Y are independent if and only if P(X = x, Y = y ) = P(X = x)P(Y = y ), ∀x ∈ SX , y ∈ SY . 5 / 64 3.1.3 The Expected Value of a Function of Two Discrete Random Variables The expected value of a function, g (X , Y ), of two discrete random variables is defined as X X E [g (X , Y )] = g (x, y )P(X = x, Y = y ). x∈SX y ∈SY In particular, the expected value of X is given by X X E [X ] = xP(X = x, Y = y ). x∈SX y ∈SY It should be noted that if we have already calculated the marginal distribution of X , then it is simpler to calculate E [X ] using this. 6 / 64 3.1.4 Covariance and the Correlation Coefficient The covariance between X and Y , Cov (X , Y ) is given by Cov (X , Y ) = E (XY ) − E (X )E (Y ) Note that by definition Cov (X , X ) = E (X 2 ) − E (X )2 = Var (X ). Also, for any constants a, b Cov (X , aY + b) = aCov (X , Y ). The coefficient of correlation between X and Y is given by Cov (X , Y ) . ρ(X , Y ) = p Var (X )Var (Y ) 7 / 64 Properties of the Correlation Coefficient 1. −1 ≤ ρ(X , Y ) ≤ 1. 2. If |ρ(X , Y )| = 1, then there is a deterministic linear relationship between X and Y , Y = aX + b. When ρ(X , Y ) = 1, then a > 0. When ρ(X , Y ) = −1, a < 0. 3. If random variables X and Y are independent, then ρ(X , Y ) = 0. Note that the condition ρ(X , Y ) = 0 is not sufficient for X and Y to be independent (see Example 4.1). However, when ρ(X , Y ) 6= 0, then X and Y are dependent. 4. If a > 0, ρ(X , aY + b) = ρ(X , Y ), i.e. the correlation coefficient is independent of the units a variable is measured in. If a < 0, ρ(X , aY + b) = −ρ(X , Y ). 8 / 64 Example 3.1 Discrete random variables X and Y have the following joint distribution: X =0 X =1 Y = −1 0 Y =0 1 3 0 1 3 Y =1 0 1 3 a) Calculate i) P(X > Y ), ii) the marginal distributions of X and Y , iii) the expected values and variances of X and Y , iv) the coefficient of correlation between X and Y . b) Are X and Y independent? 9 / 64 Example 3.1 10 / 64 Example 3.1 11 / 64 Example 3.1 12 / 64 Example 3.1 13 / 64 Example 3.1 14 / 64 Example 3.1 15 / 64 Example 3.1 16 / 64 Example 3.1 17 / 64 Example 3.1 18 / 64 3.1.5 Conditional Distributions and Conditional Expected Values Define the conditional distribution of X given Y = y , where y ∈ SY by {P(X = x|Y = y )}x∈SX . P(X = x|Y = y ) = P(X = x, Y = y ) . P(Y = y ) Analogously, the conditional distribution of Y given X = x, where x ∈ SX is given by {P(Y = y |X = x)}y ∈SY . P(Y = y |X = x) = P(X = x, Y = y ) . P(X = x) Note: This definition is analagous to the definition of conditional probability given in Chapter 1. 19 / 64 Conditional Expected Value The expected value of the function g (X ) given Y = y is denoted E [g (X )|Y = y ]. We have X E [g (X )|Y = y ] = g (x)P(X = x|Y = y ). x∈SX In particular, E [X |Y = y ] = X xP(X = x|Y = y ). x∈SX The expected value of the function g (Y ) given X = x is defined analogously. 20 / 64 Probabilistic Regression The graph of the probabilistic regression of X on Y is given by a scatter plot of the points {(y , E [X |Y = y ])}y ∈SY . Note that the variable we are conditioning on appears on the x-axis. Analogously, the graph of the probabilistic regression of Y on X is given by a scatter plot of the points {(x, E [Y |X = x])}x∈SX . These graphs illustrate the nature of the (probabilistic) relation between the variables X and Y . 21 / 64 Probabilistic Regression The graph of the probabilistic regression of Y on X , E [Y |X = x], may be 1. Monotonically increasing in x. This indicates a positive dependency between X and Y . ρ(X , Y ) will be positive. 2. Monotonically decreasing in x. This indicates a negative dependency between X and Y . ρ(X , Y ) will be negative. 3. Flat. This indicates that ρ(X , Y ) = 0. However, it does not always follow that X and Y are independent. 22 / 64 Probabilistic Regression 4. None of the above (e.g. oscillatory). In this case, X and Y are dependent, but this dependency cannot be described in simple terms such as positive or negative. The correlation coefficient can be 0, positive or negative in such a case (but cannot be -1 or 1). 23 / 64 Example 3.2 Discrete random variables X and Y have the following joint distribution: X =0 X =1 Y = −1 0 Y =0 1 3 0 1 3 Y =1 0 1 3 a) Calculate the conditional distributions of X given i) Y = −1, ii) Y = 0, iii) Y = 1. b) Calculate the conditional distributions of Y given i) X = 0, ii) X = 1. c) Draw the graph of the probabilistic regression of i) Y on X , ii) X on Y . 24 / 64 Example 3.2 25 / 64 Example 3.2 26 / 64 Example 3.2 27 / 64 Example 3.2 28 / 64 Example 3.2 29 / 64 Example 3.2 30 / 64 Example 3.2 31 / 64 Example 3.2 32 / 64 Example 3.3 The random variables X and Y may sometimes be defined on the basis of an experiment. In this case, each elementary event will correspond to a pair of values taken by X and Y (see also Example 2.2.1) Suppose a coin is thrown 3 times. Let X be the number of heads thrown in the first two throws and Y be the number of heads thrown in the last two throws. Define the joint distribution of X and Y . 33 / 64 Example 3.3 34 / 64 Example 3.3 35 / 64 Example 3.3 36 / 64 Example 3.3 37 / 64 3.2 The joint distribution of n independent random variables Recall that if two discrete random variables X and Y are independent, then P(X = x, Y = y ) = P(X = x)P(Y = y ). This can be generalised to the following: If X1 , X2 , . . . , Xn are independent discrete random variables, then P(X1 = x1 , X2 = x2 , . . . , Xn = xn ) = P(X1 = x1 )P(X2 = x2 ) . . . P(Xn = xn ). 38 / 64 Joint distribution of n independent continuous random variables A single continuous random variable X can be described by its density function fX (x). If X1 , X2 , . . . , Xn are independent continuous random variables, then their joint distribution function, denoted fX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ), is given by fX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = fX1 (x1 )fX2 (x2 ) . . . fXn (xn ). These results will be used in deriving estimators of parameters using the maximum likelihood method. 39 / 64 3.3 Simple Samples A simple sample is obtained by observing n independent, identically distributed (i.i.d.) random variables, X1 , X2 , . . . , Xn . Suppose X1 , X2 , . . . , Xn , have the same distribution as X , where E [X ] = µ and Var (X ) = σ 2 . If these random variables are discrete, then the probability of observing a sample [set of n realisations] (x1 , x2 , . . . , xn ) is given by P(X1 = x1 , X2 = x2 , . . . , Xn = xn ) = P(X = x1 )P(X = x2 ) . . . P(X = xn ). If these random variable are continuous, the the joint density function is given by fX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = fX (x1 )fX (x2 ) . . . fX (xn ). In addition, P(X1 ≤ x1 , X2 ≤ x2 , . . . , Xn ≤ xn ) = P(X ≤ x1 )P(X ≤ x2 ) . . . P(X ≤ xn ) 40 / 64 The Sample Mean The sample mean is defined as X = Recall that E[ n X i=1 Xi ] = 1 n Pn n X i=1 Xi . E [Xi ] i=1 and E [aX + b] = aE [X ] + b 41 / 64 The Sample Mean It follows that " # n n 1X 1 X E [X ]=E Xi = E [ Xi ] n n i=1 = n 1X n E [Xi ] = i=1 i=1 1 [µ + µ + . . . + µ] n n terms in sum 1 = nµ = µ n It follows that the expected value of the sample mean is equal to µ (the population/theoretical mean). 42 / 64 Unbiased Estimators An estimator is a function of the observations in a sample that is used to estimate the parameters of a distribution. For example, the sample mean is an obvious estimator of µ. An estimator of a parameter is unbiased if its expected value is equal to the parameter itself, i.e. there is no systematic tendency to underestimate or overestimate the parameter. It follows that the sample mean is an unbiased estimator of µ, the theoretical (population) mean. 43 / 64 Estimators of the theoretical (population) variance, σ 2 Since σ 2 is a measure of the average square distance of a random variable from its expected value, one might expect that the estimator sn2 , where n sn2 1X = [Xi − X ]2 , n i=1 is an unbiased estimator of σ 2 . 2 , where However, it is not. It can be shown that sn−1 n 2 sn−1 = 1 X [Xi − X ]2 , n−1 i=1 is an unbiased estimator of σ 2 . 44 / 64 Estimators of the theoretical (population) variance, σ 2 2 The estimators sn2 and sn−1 are often labelled s 2 and ŝ 2 , respectively. An estimator of the parameter α is generally labelled α̂. 45 / 64 The variance of an estimator Although an estimator might not show any systematic error, there will be a random error depending on the sample observed. The degree of this random error may be described by the variance of an estimator. Recall that Var (aX + b) = a2 Var (X ) Also, if X1 , X2 , . . . , Xn are independent, then n n X X Var ( Xi ) = Var (Xi ) i=1 i=1 46 / 64 The variance of the sample mean The variance of the sample mean is given by ! n n X 1X 1 Var (X )=Var Xi = 2 Var ( Xi ) n n i=1 = 1 n2 n X Var (Xi ) = i=1 i=1 1 2 (σ + σ 2 + . . . + σ 2 ) n2 n terms in sum 1 σ2 = 2 nσ 2 = . n n 47 / 64 The variance of the sample mean It follows that as the sample size increases, the variance of the sample mean decreases, i.e. the sample mean becomes a more replicable predictor of the theoretical (population) mean [as intuition would suggest]. If an estimator is unbiased, then if the sample size is large, it will be an accurate estimator of the appropriate parameter. 48 / 64 The distribution of the sample mean Note that the sample mean is simply the sum of n random variables divided by a constant. If n is sufficiently large (generally, n > 30), then this sum will have approximately a normal distribution. Since the sample mean is obtained by dividing this sum by a constant, the sample mean will also have approximately a normal distribution. Since any normal distribution can be defined by its expected value σ2 and variance, we have X ∼approx N µ, n . It should be noted that if the Xi have a normal distribution, then the sample mean will have (exactly) a normal distribution. 49 / 64 The sample proportion - estimator of a probability [or population proportion] Public opinion polls are interested, for example, in the proportion of the electorate who are going to vote for a certain party. Suppose the proportion of the entire electorate who wish to vote for party A is p (this is the population proportion). Assume that the population is large in comparison to the sample we observe and hence we may assume that the probability that each individual in the sample votes for party A is p. Thus the number of individuals, say X , in the sample who state that they want to vote for party A has a binomial distribution with parameters n and p. Note that It is assumed here that no-one hides their preferences or lies. 50 / 64 The sample proportion The sample proportion, denoted p̂, is the proportion of individuals in a sample who have the trait of interest, here those who want to vote for party A. Hence, p̂ = Xn , where X is the number in the sample wanting to vote for party A and n is the sample size. The sample proportion is an obvious estimator of the population proportion p. 51 / 64 The distribution of the sample proportion We have X ∼Bin(n, p). When n is large and p is not close to 0 or 1, then X ∼approx N(np, np(1 − p)). This approximation works well when 0.1 ≤ p ≤ 0.9. Since the sample proportion is simply X (the number in the sample wishing to vote for party A) divided by n, the distribution of the sample proportion is also approximately normal. We have 1 1 E [X ] = np = p n n 1 1 p(1 − p) Var [p̂]=Var [X /n] = 2 Var [X ] = 2 np(1 − p) = n n n E [p̂]=E [X /n] = 52 / 64 The distribution of the sample proportion It follows that the sample proportion is an unbiased estimator of the population proportion. From the above derivations p̂ ∼approx N(p, p(1−p) ). n It follows that as the sample size increases the variance of the sample proportion decreases. Also, for a given sample size the maximum variance is always obtained when the population proportion is 0.5. 53 / 64 Example 3.6 a) Suppose that in reality 15% of the electorate wish to vote for Nowoczesna. Estimate the probability that in a sample of 500 individuals the proportion of those wishing to vote for party Nowoczesna is between 12% and 18%. b) Estimate the sample size required such that the proportion of individuals wishing to vote for any party can be estimated to within ±3% with probability 0.95. 54 / 64 Example 3.6 55 / 64 Example 3.6 56 / 64 Example 3.6 57 / 64 Example 3.6 58 / 64 Example 3.6 59 / 64 Example 3.6 60 / 64 The Required Sample Size to Estimate a Population Mean to the Required Accuracy From the calculations given above, we can calculate the sample size needed to estimate any proportion to the required accuracy. In order to calculate the sample size needed to calculate a population mean to the required accuracy, we need information regarding the population variance (or, equivalently, the standard deviation). This information can be in the form of a) a prior estimate of the population variance, b) based on an initial sample of moderate size (the variance from this sample is used to estimate the sample size required). 2 Note that the distribution of the sample mean is N(µ, σn ). 61 / 64 Example 3.7 Suppose the standard deviation of the monthly earnings of Poles is 2000 zl/month. Calculate the sample size required to estimate the mean earnings of Poles to within 100zl/month with a probability of 0.99. 62 / 64 Example 3.7 63 / 64 Example 3.7 64 / 64