Download Statistical Methods

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Law of large numbers wikipedia , lookup

Central limit theorem wikipedia , lookup

Resampling (statistics) wikipedia , lookup

German tank problem wikipedia , lookup

Transcript
Statistical Methods
”Never trust a statistics you didn’t forge yourself ”
Winston Churchill
Florian Herzog
2013
Independent and identical distributed random variables
Definition 1. The random variables X1, ..., Xn are called a random sample
of size n from the population f (x) if X1, ..., Xn are mutually independent
random variables and the marginal pdf of each Xi is the same function Xi.
Alternatively, X1, ..., Xn are called independent and identically distributed
random variables with pdf f (x).
The joint pdf of X1, ..., Xn is given as:
f (x1, x2, ..., xn) = f (x1)f (x2) · · · f (xn) =
n
Y
f (xi)
i=1
The slides of this section follow closely the Chapters 5 and 7 of the Book
”G.Casela and R. Berger, Statistical Inference,Duxbury Press 2002”.
Stochastic Systems, 2013
2
Identically and independently distributed random variables
Often in statistics (especially in estimation) we assume identically and independently distributed (i.i.d) random variables (r.v.). This means that a random
variable Xi, where k = 1, 2, ... denotes the realizations of the r.v., has the
following properties:
• Each Xk f (x) is drawn form the same density .
• Xk is independent of Xk−1, Xk−2, ..., X1.
• Each Xk is uncorrelated from each Xj ,i.e. E[Xk Xj ] = 0 ∀j\{k}
Stochastic Systems, 2013
3
Sample mean and variance
Definition 2. The sample mean is the arithmetic average of the values in a
random sample and is denoted as
n
1X
X =
Xi
n i=1
Definition 3. The sample variance is the statistic defined as
n
1 X
2
S =
(Xi − X)
n − 1 i=1
2
The sample standard deviation is the statistic defined as S =
Stochastic Systems, 2013
p
(S 2)
4
Properties of i.i.d.random variables
Theorem 1. When we have X1, X2, ...Xn independent and identically
distributed (i.i.d) random variables with mean µ = E[Xn] and variance
σ 2 = V ar[Xn]. Then
E[X]
V ar[X]
2
E[S ]
Stochastic Systems, 2013
=
µ
=
σ2
n
=
σ
2
5
Properties of i.i.d.random variables
Theorem 2. When we have X1, X2, ...Xn i.i.d. from a normal distribution
with mean µ and variance σ . Then
1. X and S 2 are independent random variable,
2
2. X is distributed N (µ, σn ),
2
3. (n − 1) Sσ2 has a chi square distribution with n − 1 degrees of freedom.
Stochastic Systems, 2013
6
Convergence of a sequence of r.v. {Xn}
1
1. Convergence with probability one (or almost sure), Xn −→ X :
P ({ω ∈ Ω : lim (Xn(ω)) = X(ω)}) = 1.
n→∞
p
2. The sequence {Xn} converges to X in probability, Xn −→ X , if
lim P ({ω ∈ Ω : |Xn(ω) − X(ω)| > ε}) = 0,
n→∞
p
for all
ε > 0.
Lp
3. The sequence {Xn} converges to X in L , Xn −→ X , if
p
lim E(|Xn(ω) − X(ω)| ) = 0.
n→∞
Stochastic Systems, 2013
7
Convergence concepts interrelations
Lp
Xn −→ X
?
1
Xn −→ X (almost sure)
q )
Lq
Xn −→ X,
q<p
p
Xn −→ X (in probability)
?
d
Xn −→ X (in distribution)
Stochastic Systems, 2013
8
Parameter estimation (Point estimation)
In stochastic systems modeling, we often build models from data observation
(and not from physical first principles).
We need statistically motivated methods to identify the stochastic systems
under consideration. The identification of the stochastic systems requires the
following:
•
•
•
•
Identification of the distribution
Identification of the dynamics
Identification of the system parameters
Analysis of the parameter significance
In this section we only focus on the parameter estimation and assume that
the distribution is known. We will come back to this topic after the theoretical
introduction of stochastic processes.
Stochastic Systems, 2013
9
Parameter estimation (Point estimation)
Definition 1. A point estimator is any function W (X1, X2, ..., Xn) of a
sample of random variables.
There are main ways of finding point estimators, the main ones are:
•
•
•
•
Methods of moments (MM)
Maximum Likelihood estimators (MLE)
Expectation Maximization (EM)
Bayes Estimators
Besides the methods of finding a point estimator, the evaluation (quality) of the
estimator. In the following slides, we will introduce the methods of moments
and the maximum likelihood estimator.
Stochastic Systems, 2013
10
Methods of moments
We have X1, X2, ...., Xn the sample from a population from one pdf
f (x|θ1, θ2, ..., θk ). The parameter θi are the distribution parameter, e.g.
σ and µ in the case of a normal distribution.
Definition 2. The method of moments is the matching of the first k moments
of the data with the first k theoretical moments of the distribution. The
theoretical moments are a function of the parameters and parameter estimation
problem is reduced to the solving of k equations
Stochastic Systems, 2013
11
Methods of moments
We have
n
m1
=
1X
0
Xi, µ1 = E[X],
n i=1
=
1X 2
0
2
Xi , µ2 = E[X ],
n i=1
n
m2
...
n
mk
=
1X k
0
k
Xi , µk = E[X ],
n i=1
where mi denotes the sample moment and µ0i the theoretical moments.
Stochastic Systems, 2013
12
Methods of moments
Since µ0i is a function of θi, we get the following system of equations:
0
m1
=
µ1(θ1, ..., θk ),
m2
=
...
µ2(θ1, ..., θk ),
mk
=
µk (θ1, ..., θk ),
0
0
where mi denotes the sample moment and µ0i the theoretical moments.
The parameters are found by solving the system of k moments.
Stochastic Systems, 2013
13
Methods of moments
As main example, we assume that the data is generated by a normal distribution
with mean µ and variance σ 2. We denote θ1 = µ and θ2 = σ 2. The first and
second moment of the normal distribution are given as
n
0
µ1
=
1X
Xi
µ=
n i=1
=
1X 2
Xi
µ +σ =
n i=1
n
0
µ2
Stochastic Systems, 2013
2
2
14
Methods of moments
Solving for µ and σ 2 we get:
n
µ
σ
2
=
1X
Xi
n i=1
=
n
1X 2
Xi −
n i=1
n
X
i=1
!2
Xi
n
1X
2
=
(Xi − µ)
n i=1
The solution are the sample moments of mean and variance and are of course
the natural”way of estimation the mean and variance of the normal distribution.
Stochastic Systems, 2013
15
Maximum Likelihood Estimation
The likelihood is the joint pdf of X1, ..., Xn and given as:
L(θ1, ...θk |x1, x2, ..., xn) =
n
Y
f (xi|θ1, ...θk ) .
i=1
We denote by x = [x1, x2, ...]T and by θ = [θ1, θ2, ...]T
b
Definition 3. For each sample x, let θ(x)
be a parameter value at which
L(θ|x) attains its maximum as function of θ . A maximum likelihood estimator
b .
MLE of the parameter θ based on the X is θ(x)
If the likelihood function is C 2, then possible candidates for the MLE are the
values of θ which solve
∂
L(θ1, ...θk |x1, x2, ..., xn) = 0 .
∂θi
Stochastic Systems, 2013
16
Maximum log-Likelihood Estimation
Theorem 1. The maximum likelihood estimation is equivalent to the maximum log-likelihood estimation. The log-likelihood is defined as
l(θ1, ...θk |x1, x2, ..., xn) =
n
X
log (f (xi|θ1, ...θk )) .
i=1
Example: We want to derive the maximum likelihood estimated for the mean
(µ) of the normal distribution under the assumption of known variance σ 2. The
log-pdf of the normal distribution is given as:
!
2
(xi − µ)
1
2
log(f (xi|µ)) = −
log(2πσ ) +
2
σ2
Stochastic Systems, 2013
17
Maximum log-Likelihood Estimation
The maximum log-likelihood function is given as:
l(µ|x1, x2, ..., xn) =
n
X
−
i=1
1
2
2
log(2πσ ) +
(xi − µ)
σ2
2
!
Since σ is known, the maximization problem is reduced to least square problem:
min
µ
which as the solution of
n
X
(xi − µ)
2
i=1
n
1X
xi
µ
b=
n i=1
Stochastic Systems, 2013
18
Invariance of Maximum Likelihood Estimation
Theorem 2. The invariance property of MLEs state that if θb is the MLE of θ
b.
then for any function τ (θ), the MLE of τ (θ) is τ (θ)
Suppose that a distribution is parameterized by a parameter θ , but we are
interest in finding an estimator for some function of θ , say τ (θ), then we can
still use the MLE for θ . An example is as follows: If θ is the mean of normal
distribution, the MLE of sin(θ) is sin(b
µ).
Stochastic Systems, 2013
19
Quality of estimators: MSE
Definition 4. The mean squared error (MSE) of an W of the a parameter θ
is the function defined by Eθ [(W − θ)2.
The MSE of W measures the average squared distance between the estimator
and the true value of the parameter. The MSE has the following interpretation:
2
Eθ [(W − θ) ] = V arθ [W ] + (Eθ [W ] − θ)
2
The first term is the variance of the estimator W and the second term is called
the bias.
Definition 5. The bias of an estimator W of the parameter θ is the distance
between the expected value of W and the true value of θ . An estimator where
the bias is zero is called an unbiased estimator.
Stochastic Systems, 2013
20
Quality of estimators: Bias and variance
In the multivariate case where θ is a vector of parameters, the variance of the
estimator is a covariance Cov(θ). An estimator with low variance (covariance)
is called an efficient estimator (in the sense that few data is needed). The MSE
is often an trade-off between unbiasedness and higher variance or an biased but
efficient estimator.
The true value of the MSE can often not be determined since the true value of
θ is not known. Therefore, we focus on the variance of the estimator in order
to describe the quality of the estimator.
p
p
Definition 6. An estimator is called called consistent when θb → θ where →
denotes convergence in probability. An unbiased estimator is also consistent.
Stochastic Systems, 2013
21
Quality of estimators: Normal distribution example
When we have X1, X2, .. i.i.d. data from a N (µ, σ 2) distribution and use the
sample mean X and sample variance S 2 as estimator:
• E[X] = µ and therefore, X is unbiased
• E[S 2] = σ 2 and therefore, S 2 is unbiased
σ2
n
2σ 4
ar[S 2] = n−1
• E[(X − µ)2] = V ar[X] =
• E[(S 2 − σ 2)2] = V
2
The MSE of X is still σn when the data is not normal, but this does not hold
for MSE for S 2 when the data is not normally distributed.
Stochastic Systems, 2013
22
Quality of estimators: Cramer-Rao bound for the variance
Definition 7. The Fisher Information matrix J is defined as
1
∂
l(θ1, ...θk |x1, x2, ..., xn)
Ji,j =
n ∂θi
∂
·
l(θ1, ...θk |x1, x2, ..., xn) ,
∂θj
which is known as the outer product form. Under certainty regularity conditions
and when the log-likelihood function is C 2, it can be calculated as:
!
2
1
∂
l(θ1, ...θk |x1, x2, ..., xn) ,
Ji,j = −
n ∂θi∂θj
which is called the inner product form. Note that the expectation is conditional
on θ
Stochastic Systems, 2013
23
Quality of estimators: Cramer-Rao bound for the variance
The Cramer-Rao bound states the following:
Theorem 3. The covariance of an estimator W is bounded by
J −1
,
Cov(W ) ≥
N
where N is the number of observations- This bound also to make an worst case
approximation of the efficiency of an estimator.
The Fisher Information matrix allows us to compute the uncertainty and thus,
the quality of an estimator.
Stochastic Systems, 2013
24
Quality of estimators
MLE is the main methods for finding estimators, since it has the following
properties:
• Consistency: the estimator converges in probability to the value being
estimated.
• Asymptotic normality: as the sample size increases, the distribution of the
MLE tends to the Gaussian distribution with mean θ and covariance matrix
equal to the inverse of the Fisher information matrix.
• Efficiency, i.e., it achieves the Cramer-Rao lower bound when the sample
size tends to infinity. This means that no asymptotically unbiased estimator
has lower asymptotic mean squared error than the MLE
−1
• The estimate of θ ∼ N (θM L, JN ).
• Barretts Theorem ”The maximum-likelihood procedure in any problem is
what you are most likely to do if you don’t know any statistics”.
Stochastic Systems, 2013
25