Download slide 2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Transcript
Bayesian methods for parameter estimation and data
assimilation for crop models
PART 2. Likelihood function and prior distribution
SLIDE 1
Welcome back for part 2 of our mini-course on Bayesian methods for crop models
SLIDE 2
Previously on “mini-course”;
We explained Bayes’ theorem and wrote it in a form appropriate for use with model
parameter estimation. That formula is reproduced on the slide.
In this part we’ll explain in detail the different parts of this equation, apply it in a
simple case and analyze and interpret the results.
SLIDE 4
There are just two quantities that appear in Bayes theorem,  and y.
 represents the parameters of interest. It can be a vector or just a single parameter.
For crop models which have many parameters,  will be a vector. If we want to apply
Bayes’ Theorem to  , it must be a random variable. The theorem has no sense otherwise. For
a frequentist  is a fixed (but unknown) quantity. Bayes’ Theorem is not applicable. For a
Bayesian as well a parameter has some true value, but  does not represent that value. Rather,
 is a random variable whose distribution represents our knowledge about the parameter.
Then Bayes’ Theorem can be applied.
y represents the observations. It can be a vector (a whole list of measured quantities) or
just a single quantity. y is a random variable in the sense that until the measurement is done,
we don’t know the value.
SLIDES 7, 8 EXAMPLE
Our model is y   where y is the measurement of yield in a subplot of a field and  ,
the unknown parameter, is mean yield in the field. Despite its extreme simplicity, this has the
basic features of a crop model. Basically, a crop model provides a relationship between
measurable variables on the one hand and input variables and parameters on the other hand.
Here we have simplified this to the extreme. We have no input variables, we have only a
single parameter, and we have the simplest possible function relating y and  .
Our objective is to obtain the posterior distribution for the parameter. This will be
based on prior information, which might reflect our knowledge about yield in fields like this,
and on a single measurement in the field in question.
SLIDES 9, 10 PRIOR DISTRIBUTION
P ( ) is the prior distribution for the parameter value. Note that it is not sufficient to
give a value for  that represents our best guess. One must specify a probability distribution.
David Makowski and Daniel Wallach, INRA, France. 12/05/2017
-1-
In this way one specifies not only one’s prior idea, but also the level of uncertainty about this
idea.
For the example we suppose that the prior distribution is a normal distribution with
expectation 5 t/ha and standard error of 2t/ha.
Note that Bayes’ theorem can be used and often is used even if there is no prior
information about a parameter. In that case, one uses a “non-informative” prior. For example,
one could suppose that the parameter has a uniform distribution between -A and +A where A
is some huge number. This effectively says that as far as you know, the parameter is equally
likely to be anywhere
SLIDE 12 LIKELIHOOD FUNCTION
The second distribution in Bayes’ Theorem is P( y  ) . This notation is shorthand for
P( y  yobs    fixed ) . This is the probability that y has some particular value, say yobs , given
that  has some particular value, say  fixed .
In our example there is a single measured value yobs =9t/ha.
SLIDE 13 STATISTICAL MODEL
To evaluate P( y  yobs    fixed ) we require a statistical model for y. For our example
we use the statistical model y     . We assume for our model that  is a random variable
with a normal distribution with expectation 0 and variance  2 , written  N (0,  2 ) . We
will assume that  2  1 .
Crop models are deterministic, they do not include any random variables. So to
calculate a likelihood you need to add a random error, and specify the properties of that
random variable.
SLIDE 14 DEFINITION OF A LIKELIHOOD FUNCTION
We can use our statistical model to determine the likelihood P( y  yobs    fixed ) .
Suppose for example that  fixed  8 . That implies that the random variable
  yobs   fixed  9  8  1 . According to the distribution of  this is
P( y  9   8)  P(  1)  P  N (0,1)  1  0.242 .
The probability of measuring y=9 clearly depends on the value of  fixed . A different
value of  fixed would lead to a different probability. In general, using the explicit equation for
the normal distribution,
P( y  yobs    fixed )  1/ 2 exp (9   fixed )2 / 2

 

which is a function of  fixed . The function P( y  ) is called the likelihood function. The
common notation is L( y) but the inversion between y and  should not be
misunderstood. This is still the probability of getting a particular value of y, considered as a
function of  .
David Makowski and Daniel Wallach, INRA, France. 12/05/2017
-2-
SLIDE 16 MAXIMUM LIKELIHOOD
Note that the observation y enters the posterior distribution only through the likelihood
function. That is, the information conveyed by the measurement is all captured by the
likelihood function. Obviously the likelihood function plays an extremely important role.
The likelihood function is central to both frequentist and Bayesian statistics. For a
frequentist, as we have said numerous times, a parameter is just a constant. Then L( y) is a
function of the value of that constant. For a Bayesian  is a random variable, and L( y) is
the probability of y as a function of the value taken by the random variable  . At the end of
the day, there is no difference here.
There is an important difference in the way the likelihood is used. For a frequentist,
parameter estimation is based solely on the likelihood. A common procedure is maximum
likelihood estimation. The estimator of  is then the value that maximizes the probability of
observing y. In the example, the maximum likelihood estimator of  is easily seen to be
  9 . A Bayesian, on the other hand, bases estimation on both the likelihood and the prior
information.
SLIDE 18 POSTERIOR DISTRIBUTION
In this simple case we can calculate the posterior distribution analytically.
To derive the result shown on the slide, one could do all the calculations, including the
integration in the denominator of Bayes’ Theorem. An easier method is to note that the
posterior distribution is a normal distribution for  since it has the form d exp(a 2  b  c) .
For a normal distribution the coefficient of the quadratic term is a  1/(2 2 ) and the
coefficient of the linear term is b   /  2 , which allows us to calculate  and  2 .
SLIDE 20 DISCUSSION OF POSTERIOR DISTRIBUTION
The posterior distribution is a probability distribution for the parameter  which
represents our uncertainty about the parameter. All Bayesian analyses will then be based on
this distribution. For example, suppose we want to make a prediction that involves  . The
predicted quantity would have a distribution, resulting from the distribution for  . In this way
the uncertainty in  is propagated to all calculated quantities that depend on  .
The expectation is a weighted average between the prior and maximum likelihood
values. Note that normally in a Bayesian analysis, one is interested in the full posterior
distribution and not just in its expectation. However, the expectation can be useful as a
reference value.
The weighting of the prior and maximum likelihood values to give the posterior
expectation depends on the uncertainty in each result. If  2   2 (lots of uncertainty in the
prior compared to uncertainty due to the data ) B is close to 1 and the expectation is close to
y. If on the contrary the uncertainty in the data is large compared to the uncertainty in the
prior information, then B is close to 0 and the expectation is close to the expectation of the
prior density. This seems intuitively reasonable. We give more weight to the source of
information that is better known.
2
The variance of the posterior,  post
, is smaller than either  2 or  2 . Combining two
sources of information has allowed us to reduce our uncertainty. For example if  2   2  a ,
then B=1/2 and the variance of the posterior distribution is a/2.
David Makowski and Daniel Wallach, INRA, France. 12/05/2017
-3-
SLIDE 21 FREQUENTIST versus BAYESIAN
We can now better understand some of the debate between frequentists and Bayesians.
A major concern of frequentists is that Bayesian analysis introduces an element of
subjectivity into statistical analysis. This is well expressed by D. R. Cox and D. V. Hinkley in
their book Theoretical Statistics: “In the statistical analysis of scientific and technological
data, there will nearly always be some external information, such as that based on previous
similar investigations, fully or half-formed theoretical predictions, and arguments by vague
analogy. Such information is important in reaching conclusions about the data at hand,
including consistency of the data with external information, and in planning future
investigations. Nevertheless, it is not at all obvious that this is usually best done
quantitatively, expressing vague prior knowledge by a probability distribution, rather than
qualitatively.”
One major criticism expressed by Bayesians concerns the way that frequentists assess
uncertainty in results. For a frequentist, uncertainty refers to the uncertainty in measured
values. For example, consider a survey where 10 people are chosen at random. If the survey
were repeated many times, the people chosen would be different in each case and that would
lead to a distribution of results. It is this distribution that underlies frequentist analyses of
uncertainty. Bayesians point out that this is not always very pertinent, since in general one
does not envision repeating the experiment. In a Bayesian approach, on the other hand, all
conclusions are based on P( y) which is conditional on the specific measured values in the
experiment that was actually carried out.
Bayesians argue in addition that uncertainty assessments in frequentist statistics are
complex, difficult to obtain and difficult to explain. In Bayesian analyses on the other hand all
uncertainties are based on the posterior distribution. Once this distribution is available all
calculations are relatively straightforward. The explanations are also simple. The distributions
that one calculates based on the posterior distribution represent directly the uncertainty in the
results.
SLIDE 22 WHICH IS BETTER?
Statistics is supposed to help in predicting things about the real world. Which school
of statistics does this better?
As you might suspect, the answer is not simple. All statistical analyses involve various
assumptions, which are often simplifications compared to the real world. The quality of the
results will depend on the accuracy of the assumptions and this may be more important than
the choice between a Bayesian and a frequentist approach.
It does seem clear in general that when there is prior information, it is worthwhile to
take it into account. We combine prior information and data all the time in real life. When you
read the weather forecast and then look out at the sky in the morning to assess the probability
of rain, you are combining prior information (the forecast) and data (the sky). However, this
still leaves open the question as to whether it is a good idea to adapt a Bayesian approach in
order to take account of prior information.
One statistics professor describes an experiment he performs in his statistics class to
compare Bayesian and frequentist estimation quantitatively. (My apologies to the professor
who wrote about this but whose name I don’t have). You could easily repeat this with your
colleagues.
He chooses a book, I don’t remember which one. Let’s say The Constant Gardner by
John Le Carré. The objective is to predict the average number of letters of the first word on
each page. In my edition there are 558 pages, so we’re looking for the average length of 558
words. We’ll call this value  . The professor explains to his class that they will estimate 
David Makowski and Daniel Wallach, INRA, France. 12/05/2017
-4-
using a frequentist and a Bayesian method. The first method will be based on a random
sample of words from the book. He will choose 5 pages at random and calculate the average
word length of the first word on each page. That average, noted  5, is the frequentist
estimator of  . It is based solely on the data. The only assumption here is that the sample is
really a random sample, which is fairly easy to insure. Thus there is no problem of using
unfounded assumptions. The second method will be a Bayesian method, combining the
student’s prior estimate of  , noted  prior, with the data. The prior estimate can be based on
the student’s general knowledge of word length in the English language or, even better, on his
knowledge about this particular writer and book.
After hearing how estimation will be done, each student is requested to provide his
value of  prior. Each student has to provide his value before the sample is taken, because the
prior information is not supposed to be influenced by the measurement. Each student is also
told to provide a second number, we’ll call it wtrust, between 0 and 1, to quantify how much
trust he puts in his prior value compared to the value  5 . His estimator will be
 Bayes=(wtrust)(  prior)+(1- wtrust)(  5). Thus a value of 0 means that he will ignore  prior and
just use  5 alone as his estimate of  . A value of 1 means that he will ignore  5 and just use
 prior as his estimate of  . Looking back at the example with normal distributions, we see that
wtrust has a role exactly equivalent to 1-B.
Once each student has provided values of  prior and wtrust the professor looks at five
pages at random in the book and gives the class the value of  5. Each student then calculates
his personal  Bayes and the absolute errors |  -  5| and |  -  Bayes |. Finally, the professor
counts the number of students for whom |  -  5| is the smaller error and those for whom |   Bayes | is the smaller error.
In the large majority of cases, the Bayes estimator has the smaller error. That is, the
Bayes approach is in most cases better able to provide predictions about the real world.
The professor then goes one step further. He asks the students for whom the
frequentist error is smaller to announce their values of  prior and wtrust. He finds (no surprise)
that these students combine poor values of  prior with large values of wtrust. The conclusion?
Using prior information is usually a good idea, but if your prior is poor and you put a lot of
trust in it, you’re in trouble!
SLIDE 23 DIFFICULTIES FOR ESTIMATING CROP
MODEL PARAMETERS
As we have said, the quality of the results of a statistical analysis will depend on the
quality of the assumptions that are made (and also of course on the data).
A first assumption required for a Bayesian analysis (also for a frequentist analysis)
concerns the statistical model for the data. The problem is often particularly difficult for crop
models. Can errors reasonably be assumed to have expectation zero? Constant variance? What
about correlations between different measurements in the same field?
A second assumption concerns prior information. What exactly is known, and what is
a reasonable way to represent that information? Lurking behind this question is a very
fundamental question about crop models. What real physical or biological constants do the
parameters really represent?
SLIDE 24 PRACTICAL CONSIDERATIONS
In some cases, like our simple example, the posterior distribution is a known
distribution that we can work with.
What do we do in all those cases (including non-linear models like crop models)
where the posterior distribution is not a known distribution? First of all, note that we can live
David Makowski and Daniel Wallach, INRA, France. 12/05/2017
-5-
without an analytical expression for the posterior distribution. It will be sufficient to have a
large sample from this distribution. Such a sample can then be used to represent the
uncertainty in any calculations involving the parameters. We will simply repeat all
calculations for each parameter value in the large sample.
What do we know? Given a statistical model, it is usually straightforward to write
P( y  ) even if the crop model is complex. Given prior information, it is usually
straightforward to write P ( ) . On the other hand the integral in the denominator of Bayes’
Theorem is very difficult to calculate if the parameter is of even moderate dimension. So what
we usually know is that the posterior density is proportional to the product of P( y  ) and
P ( ) , which can be calculated, but the proportionality constant is unknown.
In fact, there are powerful but simple algorithms for obtaining a sample from a
posterior distribution when all one knows is P( y  ) and P ( ) . We will present one of these
algorithms in Part 3 of this mini-course.
David Makowski and Daniel Wallach, INRA, France. 12/05/2017
-6-