Download PPT

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Ars Conjectandi wikipedia , lookup

Mixture model wikipedia , lookup

German tank problem wikipedia , lookup

Bias of an estimator wikipedia , lookup

Transcript
CHAPTER 4:
Parametric Methods
Parametric Methods
A statistic is any value that is calculated from a given
sample. In statistical inference, we make a decision using
the information provided by a sample.
Our first approach is parametric where we assume that
the sample is drawn from some distribution that obeys a
known model, for example, Gaussian.
The advantage of the parametric approach is that the
model is defined with a small number of parameters—for
example, mean, variance—the sufficient statistics of the
distribution. Once those parameters are estimated from
the sample, the whole distribution is known.



2
Parametric Methods
We estimate the parameters of the distribution from the
given sample, plug in these estimates to the assumed
model, and get an estimated distribution, which we then
use to make a decision.
The method we use to estimate the parameters of a
distribution is maximum likelihood estimation.We also
introduce Bayesian estimation, which we will continue
discussing in chapter 14.


3
Parametric Estimation
we have an independent and identically distributed (iid) sample

X = { xt }t

We assume that xt are instances drawn from some known
probability density family, p (x |q ), defined with parameters, q
such that xt ~ p (x |q)

We want to find θ that makes sampling xt from p(x|θ)
as likely as possible.

For example, we find q = { μ, σ2} for normal (Gaussian)
distribution N ( μ, σ2)
4
Normal distribution
X is normal or Gaussian distributed with mean μ and variance σ2, denoted as
N(μ,σ2), if its density function is
The parameter μ in this definition is
the mean or expectation of the
distribution (and also its median
and mode). The parameter σ is its
standard deviation; its variance is
therefore σ 2. A random variable
with a Gaussian distribution is said
to be normally distributed and is
called a normal deviate.
If μ = 0 and σ = 1, the distribution is
called the standard normal
distribution or the unit normal
distribution, and a random variable
with that distribution is a standard
normal deviate.
5
Normal distribution
In probability theory, the normal (or Gaussian) distribution is a very
commonly occurring continuous probability distribution—a function that
tells the probability that an observation in some context will fall between
any two real numbers. For example, the distribution of grades on a test
administered to many people is normally distributed. Normal distributions
are extremely important in statistics and are often used in the natural and
social sciences for real-valued random variables whose distributions are
not known.
The normal distribution is immensely useful because of the central limit
theorem, which states that, under mild conditions, the mean of many
random variables independently drawn from the same distribution is
distributed approximately normally, irrespective of the form of the original
distribution: physical quantities that are expected to be the sum of many
independent processes (such as measurement errors) often have a
distribution very close to the normal.


http://www.mathsisfun.com/data/standard-normal-distribution.html

6
Maximum Likelihood Estimation

Given an independent and identically distributed (iid)
sample X, the likelihood of parameter, θ, is the
product of the likelihoods of the individual points:

In maximum likelihood estimation, we are interested
in finding θ that makes X the most likely to be drawn.
We thus search for θ that maximizes the likelihood,
which we denote by l(θ|X).
Maximum likelihood estimator (MLE) :
θ* = argmaxθ L(θ|X)
7
Maximum Likelihood Estimation
We can maximize the log of the likelihood
without changing the value where it takes its
maximum. log(·) converts the product into a sum
and leads to further computational simplification
when certain densities are assumed, for
example, containing exponents. The log
likelihood is defined as

8
Maximum Likelihood Estimation

Likelihood of q given the sample X
l (θ|X) = p (X |θ) = ∏t p (xt|θ)

Log likelihood
L(θ|X) = log l (θ|X) = ∑t log p (xt|θ)

Maximum likelihood estimator (MLE)
θ* = argmaxθ L(θ|X)
9
Maximum Likelihood Estimators

Let as consider some distributions that arise in the
applications we are interested in.




If we have a two-class problem, the distribution we use is
Bernoulli.
When there are K > 2 classes, its generalization is the
multinomial.
Gaussian (normal) density is the one most frequently used for
modeling class-conditional input densities with numeric input.
For these three distributions, we discuss the maximum
likelihood estimators (MLE) of their parameters.
10
Bernoulli Density
In a Bernoulli distribution, there are two outcomes: An event
occurs or it does not.
The Bernoulli random variable X takes the value 1 with probability
p, and the nonoccurrence of the event has probability 1 − p and this
is denoted by X taking the value 0. This is written as


P (x) = pox (1 – po ) (1 – x)

11
The expected value and variance can be calculated as
Bernoulli Density
12
Multinomial Density
13
Gaussian (Normal) Density
14
Examples: Bernoulli/Multinomial

Bernoulli: Two states, failure/success, x in {0,1}
P (x) = pox (1 – po ) (1 – x)
L (po|X) = log ∏t poxt (1 – po ) (1 – xt)
MLE: po = ∑t xt / N

Multinomial: K>2 states, xi in {0,1}
P (x1,x2,...,xK) = ∏i pixi
L(p1,p2,...,pK|X) = log ∏t ∏i pixit
MLE: pi = ∑t xit / N
15
Gaussian (Normal) Distribution

 x   2 
1
p x  
exp
2
2
 2 
 x   2 
1
px  
exp

2
2

2 



μ
σ
p(x) = N ( μ, σ2)
MLE for μ and σ2:
m
s2 
16
x
t
t
N
 x
t
 m
t
N
2
Evaluating an Estimator: Bias and Variance


Let X be a sample from a population specified up to a parameter θ,
and let d = d(X) be an estimator of θ. To evaluate the quality of this
estimator, we can measure how much it is different from θ, that is,
(d(X)−θ)2).
But since it is a random variable (it depends on the sample), we need
to average this over possible X and consider r(d, θ), the mean
square error of the estimator d defined as
r(d, θ) = E[(d(X) − θ)2]

The bias of an estimator is given as
bθ(d) = E[d(X)] − θ

17
If bθ(d) = 0 for all θ values, then we say that d
is an unbiased estimator of θ.
Bias and Variance
Unknown parameter q
Estimator di = d (Xi) on sample Xi
Bias: bq(d) = E [d] – q
Variance: E [(d–E [d])2]
q
Mean square error:
r (d,q) = E [(d–q)2]
= (E [d] – q)2 + E [(d–E [d])2]
= Bias2 + Variance
The variance that measures how much, on average, di vary around the expected
value (going from one dataset to another), and the second term is the bias that
measures how much the expected value varies from the correct value θ
18
Bayes’ Estimator




Sometimes, before looking at a sample, we (or experts of the application) may have
some prior information on the possible value range that a parameter, θ, may take. This
information is quite useful and should be used, especially when the sample is small.
The prior information does not tell us exactly what the parameter value is (otherwise
we would not need the sample).
We model this uncertainty by viewing θ as a random variable and by defining a prior
density for it, p(θ).
The prior density, p(θ), tells us the likely values that θ may take before looking at the
sample.
We combine this with what the sample data tells us, namely, the likelihood density,
p(X|θ), using Bayes’ rule, and get the posterior density of θ, which tells us the likely θ
values after looking at the sample:
19
Bayes’ Estimator
p(x|θ,X) = p(x|θ) because once we know θ, the sufficient statistics, we
know everything about the distribution. Thus we are taking an average over
predictions using all values of θ, weighted by their probabilities. If we are
doing a prediction in the form, y = g(x|θ), as in regression, then we have


20
Evaluating the integrals may be quite difficult, except in cases where
the posterior has a nice form. When the full integration is not feasible,
we reduce it to a single point.
Maximum A Posteriori (MAP) Posteriori
Estimate Estimate

If we can assume that p(θ|X) has a narrow peak around its
mode, then using the maximum a posteriori (MAP) estimate
will make the calculation easier:

21
By replacing a whole density with a single point, getting rid
of the integral, we can estimate the density at x with
Maximum Likelihood Estimate

If we have no prior reason (information) to favor some
values of θ, then the prior density is flat and the posterior
will have the same form as the likelihood, p(X|θ), and the
MAP estimate will be equivalent to the maximum
likelihood estimate (section 4.2) where we have
22
Bayes’ Estimator

Another possibility is the Bayes’ estimator, which is
defined as the expected value of the posterior density.

The reason for taking the expected value is that the best
estimate of a random variable is its mean. In the case of a
normal density, the mode is the expected value and if
p(θ|X) is normal, then θBayes = θMAP
23
Bayes’ Estimator


Treat θ as a random var with prior p (θ)
Bayes’ rule: p (θ|X) = p(X|θ) p(θ) / p(X)

Full: p(x|X) = ∫ p(x|θ) p(θ|X) dθ
Maximum a Posteriori (MAP): θMAP = argmaxθ
p(θ|X)
Maximum Likelihood (ML): θML = argmaxθ p(X|θ)

Bayes’: θBayes’ = E[θ|X] = ∫ θ p(θ|X) dθ


24
Bayes’ Estimator: Example
Thus the Bayes’ estimator is a weighted average of the prior mean μ0 and the sample mean m, with
weights being inversely proportional to their variances. As the sample size N increases, the Bayes’
estimator gets closer to the sample average, using more the information provided by the sample.
When σ0 is small, that is, when we have little prior uncertainty regarding the correct value of θ, or when
N is small, our prior guess μ0 has a higher effect.
25
Bayes’ Estimator: Example



xt ~ N (θ, σ2) and θ ~ N ( μ0, σo2 )
θML = m
θMAP = θBayes’ =
1/  2
N /
E q | X  
m

2
2
2
2
N /  0  1/ 
N /  0  1/ 
2
0
26
Parametric Classification
We saw in chapter 3 that using the Bayes’ rule, we can write the posterior
probability of class Ci as
Use the upper term as the discriminant function or log(.) of discriminant function
gi x   px |C i P C i 
or
gi x   log px |C i   log P C i 
27
Parametric Classification

If we can assume that p(x|Ci) are Gaussian, then
 x   i 2 
1
px |C i  
exp 

2
2 i 
2  i


1
x  i 
gi x    log 2  log  i 
 log P C i 
2
2
2 i
2
28
Example


Assume we are a car company selling K different cars, and for simplicity,
let us say that the sole factor that affects a customer’s choice is his or
her yearly income, which we denote by x.
Then P(Ci) is the proportion of customers who buy car type i. If the yearly
income distributions of such customers can be approximated with a
Gaussian, then p(x|Ci), the probability that a customer who bought car
type i has income x, can be takenN(μi,σi2), where μi is the mean income
of such customers and σi2 is their income variance.
29
Estimates for the Means And Variances

When we do not know P(Ci) and p(x|Ci), we estimate them from a sample
and plug in their estimates to get the estimate for the discriminant function.
Given a sample
X  {x t ,r t }tN1
x 
t

1
if
x
C i
t
ri  
t
0
if
x
C j , j  i


For each class separately, the ML estimates for the means and variances are
mi 
30
x r
t t
i
t
t
r
i
t
si2 
 x
t

2
 mi rit
t
t
r
i
t
Priors and Discriminant Function

The estimates for the priors are (relying on
equation 4.6)
Pˆ Ci  

t
r
i
t
N
Plugging these estimates into equation 4.22, we
get the discriminant function:

1
x  mi 2
gi x    log 2  log si 
 log PˆC i 
2
2
2si
31
Assign x to the class with the nearest mean
32
C2
C1
C1
C2
Equal variances
C1
C2
Single boundary at
halfway between
means
Figure 4.2 (a) Likelihood functions and (b) posteriors with equal priors for
two classes when the input is one-dimensional. Variances are equal and
the posteriors intersect at one point, which is the threshold of decision.
33
Variances are different
Two boundaries
Figure 4.3 (a) Likelihood functions and (b) posteriors with equal priors for two
classes when the input is one-dimensional. Variances are unequal and the
posteriors intersect at two points. In (c), the expected risks are shown for the
two classes and for reject with λ = 0.2 (section 3.3).
34