Download Statistics, Bayes Rule, Limit Theorems

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Statistics, Bayes Rule, Limit
Theorems
CS 5960/6960: Nonparametric Methods
Tom Fletcher
January 26, 2009
Independent, Identically Distributed RVs
Definition
The random variables X1 , X2 , . . . , Xn are said to be
independent, identically distributed (iid) if they share
the same probability distribution and are independent of
each other.
Recall that independence means
FX1 ,...,Xn (x1 , . . . , xn ) =
n
Y
i=1
FXi (xi ).
Random Samples
Definition
A random sample from the distribution F of length n is
a set (X1 , . . . , Xn ) of iid random variables with
distribution F . The length n is called the sample size.
I
A random sample represents n experiments in
which the same quantity is measured.
I
A realization of a random sample, denoted
(x1 , . . . , xn ) are the values we get when we take
the measurements.
Statistics
Definition
A statistic on a random sample (X1 , . . . , Xn ) is a
function T(X1 , . . . , Xn ).
Examples:
I
Sample Mean
n
1X
X̄ =
Xi
n
i=1
I
Sample Variance
n
1 X
S=
(Xi − X̄)2
n−1
i=1
Estimation
Given data x1 , . . . , xn , we want to define a probability
model that describes the data as a realization of a
random sample, Xi .
This involves defining a probability distribution for the Xi
with parameters θ = (θ1 , . . . , θm ) and then estimating
these parameters from the data.
Example: If we decide to model our population with a
Gaussian distribution, we would need to estimate the
parameters (µ, σ), the mean and standard deviation.
Likelihood
Definition
The likelihood function, L(θ, {xi }), is the probability of
observing the data xi given that they came from the
density f (X|θ), where θ is a parameter. That is,
L(θ; {xi }) =
n
Y
i=1
f (xi |θ).
Maximum Likelihood Estimation (MLE)
Definition
The maximum likelihood estimate of a parameter θ
given data xi is the parameter that maximizes the
likelihood function:
θ̂ = arg max L(θ; {xi }).
θ
Log-Likelihood
Typically, it is easier to maximize the log-likelihood
function, l(θ; {xi }) = log L(θ; {xi }), because it
converts products into sums:
θ̂ = arg max l(θ; {xi })
θ
= arg max log
θ
= arg max
θ
n
Y
!
f (xi |θ)
i=1
n
X
i=1
log f (xi |θ)
Example: MLE of Gaussian Mean
Exercise: MLE of Exponential Distribution
The exponential distribution with rate parameter λ is
defined by the pdf
f (x) = λe−λx .
What is the MLE for λ given data x1 , . . . , xn ?
Error of an Estimator
Definition
The mean-squared error (MSE) of an estimator θ̂ for a
parameter θ is
MSE(θ̂) = E[(θ̂ − θ)2 ].
Note that the estimator θ̂ is considered as a function of
the random sample {Xi }, and these random variables
are what the expectation is taken over.
Bias and Variance of an Estimator
Definition
The bias of an estimator θ̂ is given by
Bias(θ̂) = E[θ − θ̂].
Definition
The variance of an estimator θ̂ is given by
Var(θ̂) = E[(θ̂ − E[θ̂])2 ].
Decomposition of MSE
The MSE of an estimator θ̂ decomposes into the bias
and variance as follows:
MSE(θ̂) = Var(θ̂) + Bias(θ̂)2 .
Example: Bias of Sample Mean
Exercise: Variance of Sample Mean
Let Xi be a continuous iid random variables with mean µ
and variance σ 2 < ∞. What is the variance of the
sample mean X̄ ?
Bayes’ Rule
Bayes’ rule is the following formula for the probability of
an event A conditioned on the event B.
P(A|B) =
P(B|A)P(A)
P(B)
This formula is easily derived from the definition of
conditional probability:
P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A).
Bayesian Statistics
The posterior distribution combines the likelihood with
a prior on the parameters, which represents our existing
beliefs or knowledge about the parameters.
Likelihood:
Prior:
Posterior:
f (x1 , . . . , xn |θ)
fθ (θ)
fθ|x (θ)
According to Bayes’ Rule:
fθ|x (θ) = R
f (x1 , . . . , xn |θ) fθ (θ)
.
f (x1 , . . . , xn |θ) fθ (θ) dθ
Frequentist vs. Bayesian
A Frequentist thinks of data coming from a fixed “true”
distribution of the population. The parameters of this
distribution are unchanging.
A Bayesian thinks of the parameters of the population
distribution as random variables. The prior specifies a
belief about the nature of the parameters.
Limit Theorems
Limit theorems state some property of a statistic
T(X1 , . . . , Xn ) in the limit as the sample size goes to
infinity, i.e., as n → ∞.
Typically, a limit theorem will state that a statistic
converges to the corresponding quantity for the
population. For example, the sample mean converging
to the population mean.
Weak vs. Strong Convergence
Given iid random variables X1 , X2 , . . ., a statistic
Tn (X1 , . . . , Xn ), we can define two types of
convergence for the statistic: weak or strong.
Weak Convergence
We say that the statistic Tn (X1 , . . . , Xn ) converges
weakly or converges in probability to τ ∈ R if for
every > 0
lim P(|Tn − τ | < ) = 1
n→∞
Strong Convergence
We say that the statistic Tn (X1 , . . . , Xn ) converges
strongly or converges almost surely to τ ∈ R if for
every > 0
P lim |Tn − τ | < = 1
n→∞
Weak Law of Large Numbers
Theorem
Let X1 , X2 , . . . be iid random variables, with E|Xi | < ∞
and population mean µ = E[Xi ]. Then the sample mean
X¯n converges weakly to µ as n → ∞.
Chebyshev’s Inequality
Theorem
Let X be a random variable with mean µ < ∞ and
variance σ 2 < ∞. Then for any real number k > 0,
P(|X − µ| ≥ kσ) ≤
1
.
k2
This tells us that “most” values of a random variable are
“close” to the mean.
Example: (k = 2)
No more than 1/4 of the values are farther than 2
standard deviations from the mean.
Proof of Chebyshev’s Inequality
Proof of Weak Law of Large Numbers
Strong Law of Large Numbers
Theorem
Let X1 , X2 , . . . be iid random variables, with E|Xi | < ∞
and population mean µ = E[Xi ]. Then the sample mean
X¯n converges strongly to µ as n → ∞.
Proof is much more difficult than the weak case.
Central Limit Theorem
Theorem
Let X1 , X2 , . . . be iid random variables from a distribution
with mean µ and variance σ 2 < ∞. Then in the limit as
n → ∞, the statistic
X¯n − µ
√
Zn =
σ/ n
has a standard normal distribution.
Importance of the Central Limit Theorem
I
Applies to real-world data when the measured
quantity comes from the average of many small
effects.
I
Examples include electronic noise, interaction of
molecules, exam grades, etc.
I
This is why a Gaussian distribution model is often
used for real-world data.