Download Asymptotic statistics. An introduction

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Probability wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Introduction to the statistics
Emilie Devijver
July 2016
Contents
Introduction
Those notes are an introduction to the course which will be taught during the summer school.
It is required for the summer school that students read, understand, and do the exercises in those
notes.
Questions may be asked to [email protected].
Every exercise should be done, and the solution (at least one per group of 3 persons) should be made
available on Sunday 10th of July before 5p.m. (either by e-mail, I accept photos if you are bored by
Latex, or through an organized meeting on Sunday). It is necessary to do that for working in good
conditions during the summer school (for you to have the basic ideas, and the basic notations, and for
me to know what you know, and what you don’t).
This course will be illustrated by exercises with Python language. You could find some introduction (if
needed) on any website, e.g. https://www.kevinsheppard.com/images/0/09/Python_introduction.pdf.
You should know how to generate random data (with specific distributions), plot whatever you want (histograms, curves, ...), use loops (for loops, while loops, if conditions) and do standard computations.
1
1
1.1
Classical probability distributions
Discrete probability distributions
A discrete probability distribution is a probability distribution characterized by a probability mass function. Thus, the distribution of a random variable X is discrete, and X is called a discrete random
variable, if
X
P(X = x) = 1,
x∈Ω(X)
where Ω(X) is finite or countably infinite.
Definition 1 (Discrete uniform distribution). The discrete uniform distribution is a symmetric probability distribution whereby a finite number of values are equally likely to be observed.
More formally, let (a1 , . . . , ak ) ∈ Rk , and X ∼ U({a1 , . . . , ak }).
Ω(X) = {a1 , . . . , ak }
1
P(X = x) = for all x ∈ Ω(X).
k
Example. Roll a fair die.
Definition 2 (Bernoulli distribution). The Bernoulli distribution is the probability distribution of a
random variable which takes the value 1 with success probability of p and the value 0 with failure probability
of q = 1 − p.
More formally, let p ∈ [0, 1], and X ∼ B(p).
Ω(X) = {0, 1}
P(X = 1) = p
P(X = 0) = 1 − p.
Example. Flip a fair coin (p = 0.5).
Definition 3 (Binomial distribution). The binomial distribution with parameters n and p is the discrete
probability distribution of the number of successes in a sequence of n independent yes/no experiments,
each of which yields success with probability p.
More formally, let n ∈ N, p ∈ [0, 1], and X ∼ B(n, p).
Ω(X) = [0, n]
n k
P(X = k) =
p (1 − p)n−k for k ∈ Ω(X).
k
Example. Suppose a biased coin comes up heads with probability 0.3 when tossed. What is the probability
of achieving 3 heads after six tosses?
Definition 4 (Geometric distribution). The geometric distribution is the number of Bernoulli trials
needed to get one success.
More formally, let p ∈ [0, 1], and X ∼ G(p).
Ω(X) = Z∗
P(X = k) = (1 − p)k−1 p for k ∈ Ω(X).
Example. For example, suppose an ordinary die is thrown repeatedly until the first time a 1 appears.
The probability distribution of the number of times it is thrown is supported on the infinite set {1, 2, 3, . . .}
and is a geometric distribution with p = 1/6.
2
Definition 5 (Poisson distribution). Formally, for λ > 0,
Ω(X) = Z∗
λk −λ
e
for k ∈ Ω(X).
k!
P(X = k) =
Example. The Poisson distribution is useful to model events such as the number of goals scored in a
Euro 2016 football match.
1.2
Continuous probability distribution
A continuous probability distribution is a probability distribution that has a cumulative distribution
function that is continuous. Most often they are generated by having a probability density function.
Definition 6 (Continuous uniform distribution). The continuous uniform distribution is a family of
probability distributions such that for each member of the family, all intervals of the same length on the
distribution’s support are equally probable.
More formally, let (a, b) ∈ R2 , a < b, and X ∼ U([a, b]). The probability density function of the
continuous uniform distribution is:
1
for a ≤ x ≤ b
fX (x; a, b) =
b−a
= 0 elsewhere.
Definition 7 (Exponential distribution). The exponential distribution is the probability distribution that
describes the time between events in a Poisson process, i.e. a process in which events occur continuously
and independently at a constant average rate.
More formally, let λ > 0, and X ∼ E(λ). The probability density function of the exponential distribution is:
fX (x; λ) = λe−λx for x ≥ 0
= 0 elsewhere.
Definition 8 (Gaussian distribution). Formally, let (µ, σ 2 ) ∈ R × R+∗ , and X ∼ N (µ, σ 2 ). The
probability density function of the Gaussian distribution is:
1
1
exp(− 2 (x − µ)2 ).
fX (x; µ, σ 2 ) = √
2σ
2πσ 2
++
Definition 9 (Multivariate Gaussian distribution). Formally, let (µ, Σ) ∈ Rp × S++
p (R), where Sp (R)
denotes the set of the symmetric positive semidefinite matrices of size p. Let X ∼ Np (µ, Σ). The
probability density function of the multivariate Gaussian distribution is:
1
1
fX (x; µ, Σ) =
exp(− (x − µ)t Σ−1 (x − µ)).
2
(2π det(Σ))p/2
Definition 10 (Gamma distribution). Formally, let (α, β) ∈ R∗+ , and X ∼ Gamma(α, β). The probability density function of the gamma distribution is:
β α α−1
x
exp(−βx) for x > 0
Γ(α)
= 0 elsewhere
f (x; α, β) =
where
Z
Γ(α) =
∞
xα−1 exp(−x)dx.
0
Definition 11 (Pareto distribution). Formally, let (α, β) ∈ R∗+ , and X ∼ Pareto(α, β). The probability
density function of the Pareto distribution is:
αβ α
for β ≤ x < ∞
xα+1
= 0 elsewhere.
f (x; α, β) =
3
1.3
Useful probability tools
Definition 12 (Probability space). A probability space consists of three parts:
• A sample space, Ω, which is the set of all possible outcomes.
• A set of events F, where each event is a set containing zero or more outcomes.
• The assignment of probabilities to the events; that is, a function P from events to probabilities.
Definition 13 (Expectation). If X is a random variable defined on a probability space (Ω, Σ, P ), then
the expected value of X, denoted by E(X), is defined as the Lebesgue integral:
Z
E(x) =
XdP.
Ω
If X is a discrete random variable,
E(X) =
X
xP(X = x).
x∈Ω(X)
If X is an univariate continuous random variable, and if X admits a probability density function
f (x),
Z ∞
xf (x)dx.
E(X) =
−∞
Definition 14 (Variance). The variance is the expectation of the squared deviation of a random variable
from its mean, and it informally measures how far a set of (random) numbers are spread out from their
mean.
More formally, for a random variable X,
Var(X) = E(X − E(X)).
Exercise 1. Compute, for every distribution introduced in the previous subsections, the mean and the
variance associated.
4
2
Statistics: first definitions
2.1
Sample and estimator
In statistics and quantitative research methodology, a data sample is a set of data collected and/or
selected from a statistical population by a defined procedure. More formally, let give some definitions.
Definition 15 (Sample). Given a random variable X with distribution F , a random sample of length
n denoted by (X1 , . . . , Xn ) is a set of n independent, identically distributed (iid) random variables with
distribution F . A realization of this sample is an n-tuple (x1 , . . . , xn ) of values.
A sample concretely represents n experiments in which the same quantity is measured.
Definition 16 (Statistic). Let (X, A) and (Y, C) be two measurable spaces. A statistic is a measurable
application from (X n , A⊗n ) into (Y, C)
T : (X n , A⊗n ) → (Y, C)
where X n is the cartesian product and A⊗n the product σ-algebra.
In this course, we assume that we observe a sample (X1 , . . . , Xn ) and a realization of it (x1 , . . . , xn )
generated from a distribution Pθ0 with parameter θ0 ∈ Θ. Our goal is, from this realization (x1 , . . . , xn ),
to estimate θ0 .
Definition 17 (Estimator). An estimator θb for θ is a statistics with values in Θ. An estimation is the
b 1 , . . . , xn ).
value of the estimator corresponding to a realization of the sample: θ(x
Example. When you compute your scholar mean, you assume that the grades are uniformly distributed
(but everybody knows that a teacher has some difficulty to give a 0 or a 20, and tends to give a lot of
grades belonging to {8, . . . , 12}, this is then an approximation), and you want to estimate the expectation
of this law according to the realization of the sample you get in hand.
2.2
Performances of an estimator
In the following, we will say that a statistic θb is of order p if θb ∈ Lp (Pθ ).
Definition 18 (Bias). Let θb an estimator of order 1, ie
Z
b =
b θ (dx)
Eθ0 (θ)
θP
0
Ω
is finite.
• The bias of θb in θ is Eθ0 (θb − θ0 ).
• θb is unbiased if its bias is 0 for every θ0 ∈ Θ.
• θb is asymptotically unbiased if for every θ0 ∈ Θ, limn→∞ Eθ0 θb = θ.
Definition 19 (Mean squared error). Let θb an estimator of order 2.
The mean squared error (MSE) of an estimator measures the average of the squares of the errors,
that is, the difference between the estimator and what is estimated.
More formally, the MSE of an estimator θb of a parameter θ is defined by
b = E((θb − θ0 )2 ).
MSE(θ)
Let θb0 another estimator of order 2. We say that θb is better than θb0 if for every θ0 ∈ Θ,
b ≤ MSE(θb0 ).
MSE(θ)
Proposition 2.
b = Bias(θ)
b 2 + Var(θ).
b
MSE(θ)
5
Exercise 3. Prove it.
Definition 20 (MVUE). Among unbiased estimators, there often exists one with the lowest variance,
called the minimum variance unbiased estimator (MVUE).
Definition 21 (Consistency). A consistent sequence of estimators is a sequence of estimators that
converge in probability to the quantity being estimated as the index (usually the sample size) grows
without bound. In other words, increasing the sample size increases the probability of the estimator being
close to the population parameter.
More formally, a sequence of estimators θbn is a consistent estimator for parameter θ0 if and only if,
for all ε > 0, we have
P(|θbn − θ0 | < ε) →n→∞ 1.
Exercise 4. Let (X1 , . . . , Xn ) be a sample generated by a continuous uniform distribution on [0, θ0 ]. We
propose two estimators for θ0 :
θb1 = max Xi
1≤i≤n
n
X
1
θb2 =
2n
Xi
i=1
Compare the performances of those estimators.
2.3
2.3.1
Classical estimators
Method of moments
the method of moments is a method of estimation of population parameters. One starts with deriving
equations that relate the population moments (i.e., the expected values of powers of the random variable
under consideration) to the parameters of interest. Then a sample is drawn and the population moments
are estimated from the sample. The equations are then solved for the parameters of interest, using
the sample moments in place of the (unknown) population moments. This results in estimates of those
parameters.
More formally, let a sample (X1 , . . . , Xn ), generated from a distribution described by θ. Let G =
b its empirical version:
(m1 (θ), m2 (θ), . . . , ms (θ)) the vector of the s first moments. Let G
!
n
n
n
1X 2
1X s
1X
b
xk ,
x ,...,
x
G = (m
c1 , . . . , m
cs ) =
n i=1
n i=1 k
n i=1 k
The moment estimator θbmom is defined by
b = G(θbmom ).
G
Exercise 5. We observe (x1 , . . . , xn ) a realization of a sample generated from a geometric distribution
with parameter p0 unknown. Compute the moment estimator of p0 .
Compute the bias and the mean square error for this estimator.
Exercise 6. We observe (x1 , . . . , xn ) a realization of a sample generated from a Gaussian distribution
with parameters µ0 and σ02 unknown. Compute the moment estimators of µ0 and σ02 .
Compute the bias and the mean square error for those estimators.
Exercise 7. Let (x1 , . . . , xn ) a sample generated from a Gamma distribution with parameters α0 and β0
unknown. Compute the moment estimators of α0 and β0 .
Assume then that you know α0 , and compute the moment estimator of β0 .
Assume then that you know β0 , and compute the moment estimator of α0 .
Exercise 8. Run the Gamma estimation experiment 1000 times for several different values of the sample
size n and the parameters α0 and β0 . Note the empirical bias and mean square error of every estimator
introduced in the previous exercise. One would think that the estimators when one of the parameters
is known should work better than the corresponding estimators when both parameters are unknown; but
investigate this question empirically.
6
Exercise 9. Let (x1 , . . . , xn ) be a realization of a sample generated from a Pareto distribution with
parameters α0 and β0 unknown. Compute the moment estimators of α0 and β0 .
Assume then that you know α0 , and compute the moment estimator of β0 .
Assume then that you know β0 , and compute the moment estimator of α0 .
Exercise 10. Run the Pareto estimation experiment 1000 times for several different values of the sample
size n and the parameters α0 and β0 . Note the empirical bias and mean square error of every estimator
introduced in the previous exercise. One would think that the estimators when one of the parameters
is known should work better than the corresponding estimators when both parameters are unknown; but
investigate this question empirically.
2.3.2
Likelihood and maximum likelihood estimator
A likelihood function is a function of the parameters of a statistical model given data.
Definition 22 (Likelihood).
• Let X be a random variable with a discrete probability distribution p
depending on a parameter Pθ0 . Then the likelihood function is defined by
L(θ0 ; x) = Pθ0 (X = x)
• Let X be a random variable with an absolutely continuous probability distribution with density
function f depending on a parameter θ0 . Then the likelihood function is defined by
L(θ0 ; x) = fθ0 (x)
Remark 11. As we consider sample, the random variables are independent and identically distributed.
If we consider a sample (X1 , . . . , Xn ), and observe (x1 , . . . , xn ), the likelihood is defined by
L(θ0 ; (x1 , . . . , xn )) = L(θ0 ; x1 ) × . . . × L(θ0 ; xn ).
Definition 23. The maximum likelihood estimator is the parameter θbMLE which maximizes the likelihood
for a realization of a sample (x1 , . . . , xn ).
Remark 12. It could be important to notice that the maximum likelihood estimator is also the parameter
which maximizes the log likelihood function, i.e. the logarithm of the likelihood function.
Exercise 13. For every distribution introduced in Section ?? and in Section ?? (at least, you should do
it for the binomial distribution, the Poisson distribution, the continuous uniform distribution, and the
Gaussian distribution), where the vector of unknown parameters is denoted θ0 , do the following.
Consider a sample (X1 , . . . , Xn ) generated from a distribution Pθ0 .
Compute the MLE, denoted by θbMLE of θ.
Run this experiment 1000 times for n = 200 and parameters chosen by yourself (indicate the true
parameters in the plots).
Plot an histogram of the estimators.
3
Central limit theorem
3.1
Reminder about the law of large numbers
Theorem 14 (Weak version). Let (Xn )n∈N a sequence of independent variables, such that E(Xn ) < +∞
and V ar(Xn ) < +∞. Then, ∀ > 0,
X1 + . . . + Xn
− E(X) ≥ →n∞ 0
P n
i.e.
n
1X
Xi →P E(X).
n i=1
Theorem 15 (Strong version).
1X
Xi →a.s. E(X)
n i
But what is the distribution of this estimator? With which error we get this limit?
7
3.2
First definitions and properties
Let X be a real random variable.
Definition 24. The cumulative distribution function is defined by
FX (x) = E(1X≤x )
for x ∈ R.
Definition 25. The characteristic function is defined by
ϕX (t) = E(eitX ).
If X has a density function fX , it could be written as
Z
ϕX (t) =
eitx fX (x)dx
R
Proposition 16.
Z
fb(t) =
f (x)e−itx dx
R
fb0 (t) = tfb(t)
Z
¯
f (y)g(x
+ y)dy
h(x) = (f ? g)(x) =
R
¯
b
h(t) = fb(t)b
g (t)
Z
1
fb(t)eixt dt
f (x) =
2π R
Proposition 17.
• The characteristic function always exists (as the integral of a bounded continuous
function over a space whose measure is finite),
• it is uniformly continuous,
• it is bounded (by 1),
• X1 ∼ X2 ↔ ϕX1 = ϕX2
(k)
• E(X k ) = (−i)k ϕX (0)
• If X1 and X2 are independent, ϕX1 +X2 = ϕX1 × ϕX2
Proposition 18.
ϕX (t) = sumk≥0
ik mk (X) k
t
k!
where mk (X) is the kth moment of X.
Proof.
ϕX (t) = EX (eitx )


X (itX)k

=E
k!
k≥0
=
X i k tk
k!
k≥0
E(X k )
Remark that we have intervert the sum and the integral, which is doable if we are inside the convergence
disk.
8
2
Exercise 19. If X ∼ N (mX , σX
),
ϕX (t) = eimX t−t
4
2 σX
2
2
Exercise 20. If X and Y are independent, with X ∼ N (mX , σX
) and Y ∼ N (mY , σY2 ),
q
4 + σ4 )
X + Y ∼ N (mX + mY , σX
Y
3.3
Lévy’s theorem
Definition 26. Let (Xn )n∈N a sequence of random variables, X a random variable. Let Fn be the
cumulative distribution function of Xn , and F the one of X.
Xn →d X if and only if ∀x ∈ C(F ), limn∞ Fn (x) = F (x).
Remark 21. It is equivalent to prove that for all f bounded and continuous, Ef (Xn ) →n∞ Ef (X) by
Portemanteau’s lemma.
Theorem 22 (Lévy’s theorem). Let (Xn )n be a sequence of random variables with characteristic functions ϕXn . Let X a random variable with characteristic function ϕX . Then,
Xn →d X ↔ ∀t ∈ R, lim ϕXn (t) → ϕX (t).
n∞
Remark 23. ϕX should be a characteristic function!
Proof. First assume that Xn →d X. For all t ∈ R, as gt : x 7→ eitx is continuous and bounded,
Eg(Xn ) →n∞ E(gt (X))
ϕXn (t) →n∞ ϕX (t)
Now, we assume that for all t ∈ R, ϕXn (t) → ϕX (t). We should prove that for all g continuous and
bounded, E(g(Xn )) → E(g(X)). We prove it for g ∈ C 2 with compact support.
As g has compact support, g 0 and g 00 ∈ L1 (R). We could then compute the Fourier transform:
Z
1
gb(t)eixt dt.
g(x) =
2π R
Then,
Eg(Xn ) = E
1
2π
1
=
2π
1
→
2π
=
Z
1
gb(t)eiXn t dt
2π R
Z
gb(t)E(eiXn t )dt
R
Z
gb(t)ϕXn (t)dt
ZR
gb(t)ϕX (t)dt = E(g(X))
R
the last row being true by dominated convergence theorem (check the hypotheses) and the finally equal
is true with the same computation, but in the other sense.
3.4
Central limit theorem
Theorem 24. Let (X1 , . . . , Xn ) be a random sample of size n drawn from a distribution such that
n
E(X1 ) = µ and V ar(X1 ) = σ 2 < ∞. Let Sn = X1 +...+X
. Then,
n
S −µ d
pn
→ N (0, 1).
σ 2 /n
9
Proof. Let Y be a random variable, with E(Y ) = 0 and V ar(Y ) = 1. We know that
t2
+ o(t2 )
2
ϕY (t) = 1 −
for t → 0. Let X be a random variable with E(X) = µ and V ar(X) = σ 2 . Let Y =
Pn
Sn −µ
be a random sample distributed as X. Let Zn = √
= i=1 √Yin .
2
X−µ
σ .
Let X1 , . . . , Xn
σ /n
Then,
ϕZn (t) = ϕ √Y1 +...+ √
Yn
n
=
n
ϕf racY1 √n
Yn
× . . . × ϕ√
n
t2
t
+ o( ))n
2n
n
2
= (ϕ √Y1 )n = (1 −
n
→n
3.5
2
inf ty
e−t
/2
= ϕN (0,1) .
Application
We could construct confidence intervals.
For example, we want to know the percentage of persons having a red car. We ask to 1000 persons,
150 have a red cars, 850 does not have. Let p be the true proportion. With the previous notations,
150
= pb. According to the central limit theorem, if we let Xi = 1 if the ith person has a red car,
Sn = 1000
and 0 elsewhere, Xi ∼ B(p, p(1 − p)). Then,
Sn − p
p
p(1 − p)/n
→dn∞ N (0, 1).
We know that P (−1.96 ≤ Z ≤ 1.96) = 0.95.
P (−1.96 ≤ p
Sn − p
p(1 − p)/n
≤ 1.96) ≡ 0.95
(remark that this approximation is controlled by theoretical results, e.g. Berry Esseen inequality). As,
for p ∈ [0, 1], p(1 − p) ≤ 41 ,
1.96
1.96
p ∈ 0.15 − √
, 0.15 + √
2 1000
2 1000
p ∈ [0.12, 0.18] with probability 0.95.
10
4
Consistency of the maximum likelihood estimator
4.1
A counter-example
1
Let fθ (x) = πθ θ2 +x
2 for all x ∈ R. Let (X1 , . . . , Xn ) a random sample with density fθ , for θ > 0. Let
compute the maximum likelihood estimator:
θb = argmax
θ
n
X
n
X
log(fθ (xi ))
i=1
1
θ
)
2
π θ + x2
i=1
n ∂L X 1
2θ
=
− 2
∂θ
θ θ + x2i
i=1
L(θ; x1 , . . . , xn ) =
=
log(
n
X
n
1
− 2θ
2 + x2
θ
θ
i
i=1
n
X
∂L
n
1
= 0 ↔ = 2θ
2
∂θ
θ
θ + x2i
i=1
n
∂L
1
1 X θ2
=0↔ =
∂θ
2
n i=1 θ2 + x2i
n
1
1
1X
∂L
=0↔ =
= φn (θ)
∂θ
2
n i=1 1 + xθi 2
Assume that θb is consistent. By the law of large numbers,
φn (θ) → E(
1
).
1 + (X/θ)2
However,
E(
1
1
) →θ∞ 1 6=
2
1 + (X/θ)
2
which contradicts the consistency.
4.2
Kullback-Leibler divergence
Let P and Q two probability measures over a set X. Assume that P is absolutely continuous w.r.t. Q
(Q(A) = 0 → P (A) = 0 for A a measurable set).
Definition 27.
Z
KL(P, Q) =
log
X
Remark 25.
dP
dP
dQ
• If P and Q are discrete probability distributions,
KL(P, Q) =
X
i
P (i) log
P (i)
.
Q(i)
• If P and Q are distributions of a random variable, with densities p and q respectively,
Z
p(x)
KL(P, Q) =
p(x) log
dx
q(x)
R
Proposition 26.
1. KL(P, Q) ≥ 0
2. KL(P, Q) = 0 → P = Q a.e.
11
Proof. We recall the Jensen’s inequality. If g ∈ L1 (µ), if φ is continuous convex, and φ : R → R. Then
Z
Z
φ( gdµ) ≤
φ ◦ gdµ
Ω
Ω
1. As t 7→ − log t is convex,
Z
q(x)
p(x) log
dx
p(x)
R
Z
q(x)
dx
≥ − log p(x)
p(x)
ZR
= − log q(x)dx = 0
KL(P, Q) = −
R
2. Since t 7→ − log(t) os a strictly convex function, for x > 0 it follows that equality holds when
p(x) = q(x) a.e.
12