Download Lecture 1 - Sortie-ND

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Probability wikipedia , lookup

Statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Transcript
Likelihood Methods in Ecology
November 16th – 20th, 2009
Millbrook, NY
Instructors:
Charles Canham and María Uriarte
Teaching Assistant
Liza Comita
Daily Schedule



Morning
-
8:30 – 9:20
9:20 – 10:10
10:30 – 12:00
Lecture
Case Study or Discussion
Lab
Lunch 12:00 – 1:30 (in this room)
Afternoon
-
1:30 – 2:20
2:20 – 3:10
3:30 – 5:00
Lecture
Lab
Lab
Course Outline
Statistical Inference using Likelihood







Principles and practice of maximum likelihood
estimation
Know your data – choosing appropriate likelihood
functions
Formulate statistical models as alternate hypotheses
Find the ML estimates of the parameters of your
models
Compare alternate models and choose the most
parsimonious
Evaluate individual models
Advanced topics
Likelihood is much more than a statistical method...
(it can completely change the way you ask and answer questions…)
Lecture 1
An Introduction to Likelihood Estimation

Probability and probability density functions

Maximum likelihood estimates (versus traditional “method of
moment” estimates)

Statistical inference

Classical “frequentist” statistics : Limitations and mental
gyrations...

The “likelihood” alternative: Basic principles and definitions

Model comparison as a generalization of hypothesis testing
A simple definition of probability for
discrete events...
“...the ratio of the number of events of type A to the total
number of all possible events (outcomes)...”
The enumeration of all possible outcomes is called the
sample space (S).
If there are n possible outcomes in a sample space, S, and m
of those are favorable for event A, then the probability of
event, A is given as
P{A} = m/n
Probability defined more generally...

Consider an outcome X from some process that has a set of
possible outcomes S:
-
If X and S are discrete, then P{X} = X/S
-
If X is continuous, then the probability has to be defined in the
limit:
b
P{xa  X  xb }   g ( x )dx
a
Where g(x) is a probability density function (PDF)
The Normal Probability Density Function (PDF)
( x  u )2
prob( x ) 
exp( 
)
2
2
2
2
1
m = mean
2= variance
Normal PDF with mean = 0
Prob(x)
1
Properties of a PDF:
Var
= 0.50 < g(x) < 1
(1)
0.8
Var = 0.25
0.6
Var = 1
Var = 2
(2) ∫ g(x) = 1
0.4
Var = 5
0.2
Var = 10
0
-5 -4 -3 -2 -1
0
X
1
2
3
4
5
Common PDFs...

For continuous data:
-
Normal
Lognormal
Gamma
For discrete data:
-
Poisson
Binomial
Multinomial
Negative Binomial
0.3
m = 2.5
m=5
m = 10
Poisson PDF
Prob(x)

0.2
0.1
0.0
0
5
10
15
20
25
30
x
See McLaughlin (1993) “A compendium of common probability distributions” in the reading list
Why are PDFs important?
Answer: because they are used to calculate likelihood…
(And in that case, they are called “likelihood functions”)
Statistical “Estimators”
A statistical estimator is a function applied to a sample of data used
to estimate an unknown population parameter (and an “estimate” is
just the result of applying an “estimator” to a sample)
1 n
A common estimator for the population mean : x   xi
n i 1
Properties of Estimators

Some desirable properties of “point estimators” (functions to
estimate a fixed parameter)
- Bias: if the average error is zero, the estimate is unbiased
-
Efficiency: an estimate with the minimum variance is the most
efficient (note: the most efficient estimator is often biased)
-
Consistency: As sample size increases, the probability of the
estimate being close to the parameter increases
-
Asymptotically normal: a consistent estimator whose
distribution around the true parameter θ approaches a normal
distribution with standard deviation shrinking in proportion to 1
as the sample size n grows
n
Maximum likelihood (ML) estimates
versus
Method of moment (MOM) estimates
Bottom line:
MOM was born in the time before computers, and was OK,
ML needs computing power, but has more desirable properties…
Doing it MOM’s way: Central Moments
1 n
If the sample (arithmeti c) mean : x   xi
n i 1
1 n
First central moment   ( xi  x )1  0
n i 1
1 n
Second moment   ( xi  x )2  sample variance (s 2 )
n i 1
Third moment
1 n
1 n
3
  ( xi  x ) , skew  3  ( xi  x )3
n i 1
ns i 1
1 n
 1 n
4
4
Fourth moment   ( xi  x ) , kurtosis   4  ( xi  x )   3
n i 1
 ns i 1

What’s wrong with MOM’s way?

Nothing, if all you are interested in is calculating properties of
your sample…

But MOM’s formulas are generally not the best way1 to infer
estimates of the statistical properties of the population from which
the sample was drawn…
For example:
Population variance
1 n
 
( xi  x )2

n  1 i 1
2
(because the second central moment is a biased underestimate of the
population variance)
1…
in the formal terms of bias, efficiency, consistency, and asymptotic normality
The Maximum Likelihood alternative…
Going back to PDF’s: in plain language, a PDF allows you to
calculate the probability that an observation will take on a
value (x), given the underlying (true?) parameters of the
population
0.3
x!
where the mean (and variance)  a
m = 2.5
m=5
m = 10
Poisson PDF
Prob(x)
Poisson PDF : P( x ) 
exp a a x
0.2
0.1
0.0
0
5
10
15
x
20
25
30
But there’s a problem…

The PDF defines the probability of observing an outcome (x),
given that you already know the true population parameter
(θ)

But we want to generate an estimate of θ, given our data (x)

And, unfortunately, the two are not identical:
P( | x )  P( x |  )
Fisher and the concept of “Likelihood”...
The “Likelihood Principle”
L( | x )  P( x |  )
In plain English: “The likelihood (L) of the parameter
estimates (θ), given a sample (x) is proportional to the
probability of observing the data, given the parameters...”
{and this probability is something we can calculate, using the
appropriate underlying probability model (i.e. a PDF)}
R.A. Fisher (1890- 1962)
http://www.economics.soton.ac.uk/staff/aldrich/fisherguide/pro
b+lik.htm
“Likelihood and Probability in R. A. Fisher’s Statistical Methods for
Research Workers” (John Aldrich)
A good summary of the evolution of Fisher’s ideas on probability,
likelihood, and inference… Contains links to PDFs of Fisher’s early
papers…
A second page shows the evolution of his ideas through changes
in successive editions of Fisher’s books…
Age 22
Calculating Likelihood and Log-Likelihood
for Datasets
From basic probability theory:
If two events (A and B) are independent, then P(A,B) = P(A)P(B)
More generally, for i = 1..n independent observations, and a vector X
of observations (xi):
n
Likelihood  L | X   P( X |  )   g ( xi |  )
i 1
where
g ( xi |  )
is the appropriate PDF
But, logarithms are easier to work with, so...
n
Log - likelihood  ln L | X    ln g ( xi |  )
i 1
Likelihood “Surfaces”
The variation in likelihood for any given set of
parameter values defines a likelihood “surface”...
-147
Log-Likelihood
For a model with
just 1 parameter,
the surface is
simply a curve:
(aka a “likelihood
profile”)
-149
-151
-153
-155
2
2.1
2.2
2.3
2.4
2.5
Parameter Estimate
2.6
2.7
2.8
“Support” and “Support Limits”
Log-likelihood = “Support” (Edwards 1992)
-147
Log-Likelihood
Maximum likelihood estimate
-149
-151
-153
2-unit support interval
-155
2
2.1
2.2
2.3
2.4
2.5
Parameter Estimate
2.6
2.7
2.8
A (somewhat trivial) example
MOM vs ML estimates of the probability of survival for a
population:
Data: a quadrat in which 16 of 20 seedlings survived during a
census interval. (Note that in this case, the quadrat is the unit of
observation…, so sample size = 1)
N 
N x
Binomal PDF    p x 1  p 
x
0.20
Binomial PDF with 16 successes out of 20 trials
x <- seq(0,1,0.005)
y <- dbinom(16,20,x)
plot(x,y)
x[which.max(y)]
0.10
0.05
i.e. Given N=20, x = 16, what is p?
0.15
N 
n!
   binomial coefficien t 
x!( N  x ))!
x
0.00
-
P(x)

0.0
0.2
0.4
0.6
x
0.8
1.0
A more realistic example
-100 -50
-200
-300
log likelihood
# Calculate the log-likelihood for each
# probability of survival
p <- seq(0,1,0.005)
log_likelihood <- rep(0,length(p))
for (i in 1:length(p))
{ log_likelihood[i] <- sum(log(dbinom(x,N,p[i]))) }
0
# Create some data (5 quadrats)
N <- c(11,14,8,22,50)
x <- c(8,7,5,17,35)
# Plot the likelihood profile
plot(p,log_likelihood)
# What probability of survival maximizes log likelihood?
p[which.max(log_likelihood)]
0.685
# How does this compare to the average across the 5 quadrats
mean(x/N)
0.665
0.0
0.2
0.4
0.6
p
0.8
1.0
-13
-12
-11
-10
Log - likelihood   ln g ( xi |  )
i 1
0.65
0.70
0.75
0.80
-100 -50
0
p
-200
n
0.60
-300
• The absolute magnitude of the loglikelihood increases as sample size increases
0.55
log likelihood
• They should always be negative! (if not, you
have a problem with your likelihood
function)
-15
-14
Things to note about log-likelihoods:
log likelihood
# what is the log-likelihood of the MLE?
max(log_likelihood)
[1] -9.46812
-9
Focus in on the MLE…
0.0
0.2
0.4
0.6
p
0.8
1.0
An example with continuous data…
The normal PDF:
prob( x ) 
1
2 2
exp( 
( xu)
)
2
2
2
x = observed
m = mean
2= variance
In R:
dnorm(x, mean = 0, sd = 1, log = FALSE)
> dnorm(2,2.5,1)
[1] 0.3520653
> dnorm(2,2.5,1,log=T)
[1] -1.043939
>
Problem: Now there are TWO unknowns needed to
calculate likelihood (the mean and the variance)!
Solution: treat the variance just like another parameter in
the model, and find the ML estimate of the variance just
like you would any other parameter…
(this is exactly what you’ll do in the lab this morning…)