Download MTH 202 : Probability and Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Hardware random number generator wikipedia , lookup

Randomness wikipedia , lookup

Demographics of the world wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Confidence interval wikipedia , lookup

Value at risk wikipedia , lookup

Nyquist–Shannon sampling theorem wikipedia , lookup

Transcript
MTH 202 : Probability and Statistics
Lecture S1 :
10. Statistics
10.1 : Sampling
Statistics deals with studying samples collected from various experiments or observations. For obvious reasons it would best if we can get
a large number of such data which would ensure to detect statistically
the best expectation. However, often this is practically impossible to
obtain. We would therefore study a reasonably convenient number of
data often called as samples chosen randomly.
Definition 10.1.1 : A Population is a larger set of values of observations or measurements of some experiment.
Definition 10.1.2 : A smaller subset of values from population is
called a sample.
Suppose we have a population of size N . Corresponding to each number of the population, we associate a numerical value, denoted by
x1 , x2 , . . . , xN . We would now introduce some parameters :
Population mean µ :=
Population total τ :=
1
N
PN
i=1
PN
i=1
Population variance σ 2 :=
xi
xi
1
N
PN
i=1 (xi
− µ)2
We would often use the following identity : σ 2 :=
1
N
PN
i=1
x2i − µ2
There would be a special case when the values x1 , x2 , . . . , xN are either
0 or 1 (often referred as the dichotomous case) simply representing
presence or absence of certain characteristics. In this case the population mean would represent the ratio of such character present say
denoted by p. In this case the population variance would be p(1 − p).
(Why?)
1
2
We choose a set of n random samples (the choices are determined by
a random number generator), from the total population. We call it
simple random sampling (SRS in short) if each particular sample
of size n has the same probability of occurrences. We will also assume
that the sample of n members are chosen from the population without
replacement (unless otherwise mentioned). In this case, there are Nn
such choices. Since it is convenient
to study sample of small size n com
pared to N , the number Nn is often very large. Thus it is practically
impossible to study all Nn samples.
Since the samples of size n are chosen randomly, it is important to
realize them through a set X1 , X2 , . . . , Xn of random variables (mostly
with unknown distribution). These are obviously dependent on each
other. The following random variables would be important to us :
P
Sample mean X := n1 ni=1 Xi
Pn
1
2
Sample variance S 2 = n−1
i=1 (Xi − X)
Estimate of the population total T := N X
The probability distributions of X, X1 , X2 , . . . , Xn are referred as sampling distributions. The i-th sample member is equally likely to be
any of the N population members, i.e., P (Xi = xj ) = N1 . However, in
general there are repetitions of a particular value in the sample set :
Lemma 10.1.3 : Let ζ1 , ζ2 , . . . , ζm denote the distinct population values. Suppose nj be the number of population members that has value
ζj . Then Xi is a discrete RV with PMF :
nj
P (Xi = ζj ) =
(1 ≤ i ≤ n, 1 ≤ j ≤ m)
N
Also, E(Xi ) = µ, Var(Xi ) = σ 2 .
Proof : See [RI, Page-205, Lemma A, Section 7.3]
Using this it can be easily verified that :
Theorem 10.1.4 : With SRS E(X) = µ and E(T ) = τ .
Lemma 10.1.5 : For SRS
Cov(Xi , Xj ) = −
σ2
if i 6= j
N −1
Proof : See [RI, Page-207, Lemma B, section 7.3]
3
Theorem 10.1.6 : With SRS
Var(X) =
σ2 n−1
1−
n
N −1
Proof : See [RI, Page-208, Theorem B, section 7.3]
In case the sampling is done with replacement, it is easy to calculate
2
that Var(X) = σn . The extra factor
n−1 N − n
1−
= 1−
N −1
N −1
measures the difference of this sampling from the ideal scenario, which
is the sampling with replacement. The ideal scenario can occur when
the population size is infinite. The above quantity is therefore called
the finite population correction. The number n/N is called the
sampling fraction.
Corollary 10.1.7 : With SRS
N 2σ2 n−1
Var(T ) =
1−
n
N −1
10.2 : Estimation of Bias
Definition 10.2.1 : A statistic is a random variable which is a function of a set of random variables X1 , X2 , . . . , Xn constituting a random
sample.
e.g. sample mean X and sample variance S 2 are statistics.
We have come across certain parameters which arise while studying
certain distributions. For example, n and p are the parameters for the
binomial distribution Bin(n, p); we have discussed about the parameter
λ for the Poisson(λ).
Definition 10.2.2 : A statistic Θ̂ is called an unbiased estimator of
the parameter θ if E(Θ̂) = θ; otherwise Θ̂ is called biased.
For example if X ∼ Bin(n, p), then E(X) = np which mean that X
is a biased estimator of the parameter p. However, E(X/n) = p and
hence the scaled random variable X/n is an unbiased estimator of p.
In the previous section we have noticed that X and T are the unbiased
estimator of the (population) parameters µ and τ respectively.
4
Let σ̂ be defined by
n
1X
σ̂ =
(Xi − X)2
n i=1
2
Theorem 10.2.3 : With SRS
E(σ̂ 2 ) = σ 2
n − 1 N
n
N −1
Proof : See [RI, Page-211, Theorem A, section 7.3.2]
Corollary 10.2.4 : With SRS, an unbiased estimator of Var(X) is
2
=
sX
σ̂ 2 n N − 1 N − n s2 n
=
1−
n n−1
N
N −1
n
N
where
n
s2 =
1 X
(Xi − X)2
n − 1 i=1
10.3 : Confidence Interval
Let us presume in the deal scenario when we will have infinitely many
random variables X1 , X2 , . . . which are independent identically distributed with common mean µ and variance σ 2 . We have the n-th
average
1
X n = (X1 + X2 + . . . + Xn )
n
The Central limit theorem states that
X − µ
n
√ ≤ z −→ Φ(z) as n → ∞
P
σ/ n
However, in practice, neither the variables are independent, nor there
are infinitely many supply of them. For this reason we have earlier used
a kind of method of approximation. The confidence interval is another
depiction to estimate the error with some interval estimating the error
through probability.
Definition 10.3.1 : An interval estimate of a parameter θ is an interval of the form θ̂1 < θ < θ̂2 where θ̂1 , θ̂2 are values of appropriate RVs
Θ̂1 and Θ̂2 respectively. By ”appropriate” we mean
P (Θ̂1 < θ < Θ̂2 ) = 1 − α
5
for some specified probability (1 − α) and we say that (θ̂1 , θ̂2 ) is a
(1 − α)100% confidence interval.
10.3.2 Confidence interval of the mean µ :
Let z(α) (where 0 ≤ α ≤ 1) denote the point on the x-axis such that the
area under the standard normal density curve to the interval [z(α), ∞)
is α. Since the rest of the area is 1 − α and the curve is symmetric
around the y-axis, we have that z(1 − α) = −z(α).
If a random variable Z follows a standard normal distribution, then
P (−z(α/2) < Z < z(α/2)) = 1 − α
From central limit theorem we have learned that (X − µ)/σX can be
approximated to the standard normal distribution. Which means
P (−z(α/2) <
X −µ
< z(α/2)) ≈ 1 − α
σX
In other words
P (X − z(α/2)σX < µ < X + z(α/2)σX ) ≈ 1 − α
Hence the probability that the mean µ lies in the interval
(x0 − z(α/2)σX , x0 + z(α/2)σX )
is approximately 100(1 − α)% (for some appropriate value of x0 ). We
call this interval as the 100(1 − α)% confidence interval for µ.
Exercise 10.3.3 : Suppose that a simple random sample is used to
estimate the proportion of families in a certain area that are living
below the poverty level. If this this proportion is roughly 0.15, what
sample size is necessary so that the standard error of the estimate is
0.02? (Page 240, Ex-7, Chapter-7, Rice)
Solution : Here, counting the ratio is a dichotomous case. Hence
p = 0.15. Therefore, σ 2 = p(1 − p) = 0.15 × 0.85.
√
Next, the standard error estimate σX = 0.02 = σ/ n (ignoring the
finite population correction). Hence n is
0.15 × 0.85
≈
≈ 319
0.0004
Exercise 10.3.4 : In a simple random sample of 1, 500 voters, 55%
said they planned to vote for a particular proposition, and 45% said
they planned to vote against it. The estimated margin of victory for
6
the proposition is thus 10%. What is the standard error of this estimated margin? What is an approximate 95% confidence interval for
the margin? (Page 240, Ex-9, Chapter-7, Rice)
Solution : The sample size n = 1500. Let p = 0.55 denote the
proportion of votes to the particular proposition, say Q. Then 1 − p =
0.45 is the proportion of votes to ¬Q. Since the estimator of p is
X=
1
(X1 + . . . + Xn )
n
the estimator of the margin of victory 0.1 = p − (1 − p) = 2p − 1 is
Y = 2X − 1, i.e., EY = 2p − 1.
√
The standard error of the estimated margin Y is σY = σ/ n. The
variance σ is given by
σ 2 = Var(Y ) = E(Y − EY )2 = 4.Var(X)
But Var(X) = p(1 − p) = 0.55 × 0.45. Hence the standard error is
r
4 × 0.55 × 0.45 √
σY =
= 0.00066 ≈ 0.026
1500
An approximate 95% confidence interval for the margin is given by
(EY − 1.96σY , EY + 1.96σY )
= (0.1 − 1.96 × 0.026, 0.1 + 1.96 × 0.026) = (0.049, 0.151)
10.4 : Approximation and Estimation of Ratio
In a few statistical problems it is important to estimate certain ratios;
e.g. ratio of adults who has high school degrees, or ratio of children in
a family who are not affected by polio disease. Thus if there are two
sets of values corresponding to the population members :
x1 , x2 , . . . , xN ; y1 , y2 , . . . , yN
and the ratio of interest would be :
y1 + y2 + . . . + yN
µy
r=
=
x1 + x 2 + . . . + xN
µx
where µy , µx are the population means corresponding to the first and
second set of values respectively. The suffixes x and y are naturally
picked.
7
We introduce the variable R defined by
Y
X
and we will be estimating E(R) and Var(R).
R :=
The population covariance σxy of x and y is defined by
σxy =
N
1 X
(xi − µx )(yi − µy )
N i=1
Exercise 10.4.1 : With SRS show that :
Cov(X, Y ) =
σxy n−1
1−
n
N −1
We have already seen it is not so easy to deal with Y /X. Before we
discuss how to estimate E(R), we need to establish some approximation
formulae. Letting g a twice differentiable function, if U is a continuous
random variable we can use Taylor’s approximation to establish :
V = g(U ) ≈ g(µU ) + (U − µU )g 0 (µU )
and a little more accurately
1
V = g(U ) ≈ g(µU ) + (U − µU )g 0 (µU ) + (U − µU )2 g 00 (µU )
2
Applying formulations of expectation and variance we deduce :
µV ≈ g(µU ), Var(V ) = σV2 ≈ [g 0 (µU )]2 σU2
It can be improved more to the second error term, e.g.
1
µV ≈ g(µU ) + σU2 .g 00 (µU )
2
If Z = g(U, V ) where g is differentiable up to second order and
µ = (µU , µV ) (where µU , µV are marginal expectations of U, V )
Z = g(U, V ) ≈ g(µ) + (U − µU )
∂
∂
g(µ) + (V − µV ) g(µ)
∂u
∂v
which implies E(Z) ≈ g(µ) and
∂
∂
∂
∂
g(µ)]2 σU2 + [ g(µ)]2 σV2 + 2σU V ( g(µ))( g(µ))
∂u
∂v
∂u
∂v
= Cov(U, V ).
Var(Z) ≈ [
where σU,V
8
For the function Z = g(U, V ) = V /U (i.e., g(u, v) = v/u) we have
∂g
v ∂g
1
= − 2,
=− ,
∂u
u ∂v
u
2
2
2
∂ g
1
2v ∂ g
∂ g
=− 2
= − 3 , 2 = 0,
2
∂u
u ∂v
∂u∂v
u
Hence if µU 6= 0, we have
µV
1 µV
+ 2 σU2 .
− ρσU σV
E(Z) ≈
µU
µU
µU
where ρ is the correlation coefficient given by
σU V = ρσU σV
Similarly
Var(Z) ≈
1 2 µ2V
µV 2
σ
.
+
σ
−
2ρσ
σ
.
U V
V
µ2U U µ2U
µU
For more details see Page-161, Sec-4.6, Chapter-4, Rice. Now we apply
this to R = Y /X.
Theorem 10.4.2 : With SRS, the approximate variance of R is
1
n−1 1 2 2 2
1 2 2
2
1−
.
r σx +σy −2rσxy
Var(R) ≈ 2 r σX +σY −2rσXY =
µx
n
N − 1 µ2x
Proof : Since µX = µx , µY = µy , the first statement follows from the
form of Var(Z) as above. The second part follows from theorem 10.1.6.
Using the definition of correlation coefficient ρ we can also express :
n−1 1 2 2
1
2
1−
.
r σx + σy − 2rρσx σy
Var(R) ≈
n
N − 1 µ2x
Theorem 10.4.3 : With SRS,
1
n−1 1 2
E(R) ≈ r +
1−
. 2 rσx − ρσx σy
n
N − 1 µx
9
10.4.4 : Estimating the standard error of R
The population variances of x and y are estimated by s2x and s2y . The
population covariance is estimated by :
n
1 X
sxy =
(Xi − X)(Yi − Y )
n − 1 i=1
The population correlation is estimated by ρ̂ = ssxxysy . Hence the variance
of R is estimated by
1
n−1 1 2 2
s2R =
1−
. 2 . R sx + s2y − 2Rsxy
n
N −1 X
Note 10.4.5 : The following statements can be verified by the properties discussed above :
a. An approximate 100(1−α)% confidence interval for r is R ±z( α2 )sR .
b. Ratio estimate of µy is
Y R = µx R = µx
Y
X
c. Approximate variance of the ratio estimate of µy is :
n−1 2 2
1
Var(Y R ) ≈
1−
(r σx + σy2 − 2rρσx σy )
n
N −1
d. Approximate bias of the ratio estimate of µy is :
1
n−1 1 2
E(Y R ) − µY ≈
1−
rσx − ρσx σy
n
N − 1 µx
Corollary 10.4.6 : The variance of Y R can be estimated by
1
n − 1 2 2
2
2
sY R =
1−
R sx + sy − 2Rsxy
n
N −1
and an approximate 100(1 − α)% confidence interval for µy is
α
Y R ± z( )sY R
2
10
10.5 : Stratified Random Sampling
Often studying raw data is of very less useful, since it would give a
very naive view. For instance if we are trying to understand a ratio of
education level, or else people living below the poverty level, we need to
keep in mind that there are certain regions where the expected values
are considerably larger, where as certain states marked as backward,
would have lower expected values of the population parameters.
It is often necessary for this reason to divide the total population into
smaller groups, called ”stratum” (”strata” in plural), depending on
say, geographical locations or say based on certain natural properties,
which are independent of each other.
We first start with introducing formal notations. Suppose that there
are L number of strata and we denote by Ni , the number of population
members in each of these stratum. Hence the number of members in
the total population is N = N1 + N2 + . . . + NL .
Let the values corresponding to the population members of the l-th
stratum be
x1l , x2l , . . . , xNl l
whereas, the mean and the variance in the l-th stratum be µl , σl2 respectively (1 ≤ l ≤ L). The l-th population ratio Wl = Nl /N .
If the population mean is µ, then
L
L Nl
L
X
1 XX
1 X
µ=
xil =
Nl µl =
W l µl
N l=1 i=1
N l=1
l=1
Assume that, within each stratum, an SRS of sample size nl is taken.
We realize these by the random variables
X1l , X2l , . . . , Xnl l
which mean that Xil1 and Xjl2 are independent to each other if l1 6= l2 .
(Independency condition).
We define the sample mean X l as :
nl
1 X
X l :=
Xil (1 ≤ l ≤ L)
nl i=1
11
It is easy to see that X l1 and X l2 are independent if l1 6= l2 . The
stratified estimate X s of the mean is defined by
L
L
X
1 X
Xs =
Nl X l =
Wl X l
N l=1
l=1
where the suffix s simply represent X, stratified, and not a numeral.
Theorem 10.5.1 : The stratified estimate X s , of the population mean
is unbiased.
Proof :
E(X s ) =
L
X
Wl E(X l ) =
l=1
L
X
Wl µl = µ
l=1
From now on, we will describe this setup by calling it stratified SRS.
Theorem 10.5.2 : With stratified SRS
L
X
σ2 nl − 1 Var(X s ) =
Wl2 . l 1 −
nl
Nl − 1
l=1
Proof : Page-229, Theorem B, Section-7.5.2, Chapter-7.
If the sampling fractions (nl − 1)/(Nl − 1) within all strata are small,
then from above theorem we have
Var(X s ) =
L
X
Wl2 .
l=1
σl2
nl
See example A, Page-230, Section-7.5.2, Chapter-7. Now if Ts = N X s
denote the stratified estimate of the population total, we have
Corollary 10.5.3 : With stratified SRS
E(Ts ) = τ
and
2
Var(Ts ) = N Var(X s ) =
L
X
l=1
Nl2 .
σl2 nl − 1 1−
nl
Nl − 1
12
The estimate of σl2 is given by :
n
s2l
l
1 X
=
(Xil − X l )2
nl − 1 i=1
Var(X s ) is estimated by
s2X s
=
nl
X
Wl2 .
i=1
s2l nl 1−
nl
Nl
(Compare this with theorem 10.2.4 as above)
10.5.4 : Methods of allocation
It is important to regulate the allocation by appropriate methods.
There are essentially two methods we will discuss here. We have seen
that neglecting the finite population correction
Var(X s ) =
L
X
W 2σ2
l
l=1
l
nl
The first of the method of allocation is called Neyman allocation
which tries to minimize Var(X s ) subject to the condition n1 + n2 +
. . . + nL = n.
Theorem 10.5.5 : The sample sizes n1 , n2 , . . . , nL that minimize
Var(X s ) subject to the constraint n1 + n2 + . . . + nL = n are given
by
Wl σl
nl = n.
W1 σ1 + . . . + WL σL
where l = 1, 2, . . . , L.
Proof : See Page-232, Section 7.5.3, Chapter-7, Rice.
Substituting the optimal value of nl from above theorem in the expression of Var(X s ) we obtain :
Corollary 10.5.6 : Denoting Var(X so ), the stratified estimate using
the optimal allocations as given in previous theorem and neglecting the
finite population correction,
L
2
1X
Wl σ l
Var(X so ) =
n l=1
13
The previous method is technically difficult to employ in sampling. A
rather simpler method is called proportional allocation is easier towards computing. Suppose we assume the sampling fraction is constant
in each stratum, i.e.,
n1
n2
nL
n1 + . . . + nL
n
=
= ... =
=
=
N1
N2
NL
N1 + . . . + NL
N
The estimate of the population mean based on proportional allocation
is
nl
L
L
X
1 XX
Xil
X sp =
Wl X l =
n
i=1
l=1
l=1
Theorem 10.5.7 : The stratified estimate using the proportional allocation, neglecting the finite population correction,
L
1X
Var(X sp ) =
Wl σl2
n l=1
Finally, a comparison between the two variance estimates is necessary
to judge which one is more suitable for a particular case of study.
Theorem 10.5.8 : With SRS, the difference between the above two
variances ignoring the finite population correction is given by
L
1X
Wl (σl2 − σ)2
Var(X sp ) − Var(X so ) =
n l=1
where
σ=
L
X
Wl σl
l=1
Proof : See Page-235, Section 7.5.3, Chapter-7, Rice.
Theorem 10.5.9 : With SRS, neglecting the finite population correction
L
1X
Var(X) − Var(X sp ) =
Wl (µl − µ)2
n l=1
Proof : See Page-236-237, Section 7.5.3, Chapter-7, Rice.
References :
14
[RS] An Introduction to Probability and Statistics, V.K. Rohatgi and
A.K. Saleh, Second Edition, Wiley Students Edition.
[RI] Mathematical Statistics and Data Analysis, John A. Rice, Cengage
Learning, 2013