Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bayesian methods for parameter estimation and data assimilation for crop models PART 2. Likelihood function and prior distribution SLIDE 1 Welcome back for part 2 of our mini-course on Bayesian methods for crop models SLIDE 2 Previously on “mini-course”; We explained Bayes’ theorem and wrote it in a form appropriate for use with model parameter estimation. That formula is reproduced on the slide. In this part we’ll explain in detail the different parts of this equation, apply it in a simple case and analyze and interpret the results. SLIDE 4 There are just two quantities that appear in Bayes theorem, and y. represents the parameters of interest. It can be a vector or just a single parameter. For crop models which have many parameters, will be a vector. If we want to apply Bayes’ Theorem to , it must be a random variable. The theorem has no sense otherwise. For a frequentist is a fixed (but unknown) quantity. Bayes’ Theorem is not applicable. For a Bayesian as well a parameter has some true value, but does not represent that value. Rather, is a random variable whose distribution represents our knowledge about the parameter. Then Bayes’ Theorem can be applied. y represents the observations. It can be a vector (a whole list of measured quantities) or just a single quantity. y is a random variable in the sense that until the measurement is done, we don’t know the value. SLIDES 7, 8 EXAMPLE Our model is y where y is the measurement of yield in a subplot of a field and , the unknown parameter, is mean yield in the field. Despite its extreme simplicity, this has the basic features of a crop model. Basically, a crop model provides a relationship between measurable variables on the one hand and input variables and parameters on the other hand. Here we have simplified this to the extreme. We have no input variables, we have only a single parameter, and we have the simplest possible function relating y and . Our objective is to obtain the posterior distribution for the parameter. This will be based on prior information, which might reflect our knowledge about yield in fields like this, and on a single measurement in the field in question. SLIDES 9, 10 PRIOR DISTRIBUTION P ( ) is the prior distribution for the parameter value. Note that it is not sufficient to give a value for that represents our best guess. One must specify a probability distribution. David Makowski and Daniel Wallach, INRA, France. 12/05/2017 -1- In this way one specifies not only one’s prior idea, but also the level of uncertainty about this idea. For the example we suppose that the prior distribution is a normal distribution with expectation 5 t/ha and standard error of 2t/ha. Note that Bayes’ theorem can be used and often is used even if there is no prior information about a parameter. In that case, one uses a “non-informative” prior. For example, one could suppose that the parameter has a uniform distribution between -A and +A where A is some huge number. This effectively says that as far as you know, the parameter is equally likely to be anywhere SLIDE 12 LIKELIHOOD FUNCTION The second distribution in Bayes’ Theorem is P( y ) . This notation is shorthand for P( y yobs fixed ) . This is the probability that y has some particular value, say yobs , given that has some particular value, say fixed . In our example there is a single measured value yobs =9t/ha. SLIDE 13 STATISTICAL MODEL To evaluate P( y yobs fixed ) we require a statistical model for y. For our example we use the statistical model y . We assume for our model that is a random variable with a normal distribution with expectation 0 and variance 2 , written N (0, 2 ) . We will assume that 2 1 . Crop models are deterministic, they do not include any random variables. So to calculate a likelihood you need to add a random error, and specify the properties of that random variable. SLIDE 14 DEFINITION OF A LIKELIHOOD FUNCTION We can use our statistical model to determine the likelihood P( y yobs fixed ) . Suppose for example that fixed 8 . That implies that the random variable yobs fixed 9 8 1 . According to the distribution of this is P( y 9 8) P( 1) P N (0,1) 1 0.242 . The probability of measuring y=9 clearly depends on the value of fixed . A different value of fixed would lead to a different probability. In general, using the explicit equation for the normal distribution, P( y yobs fixed ) 1/ 2 exp (9 fixed )2 / 2 which is a function of fixed . The function P( y ) is called the likelihood function. The common notation is L( y) but the inversion between y and should not be misunderstood. This is still the probability of getting a particular value of y, considered as a function of . David Makowski and Daniel Wallach, INRA, France. 12/05/2017 -2- SLIDE 16 MAXIMUM LIKELIHOOD Note that the observation y enters the posterior distribution only through the likelihood function. That is, the information conveyed by the measurement is all captured by the likelihood function. Obviously the likelihood function plays an extremely important role. The likelihood function is central to both frequentist and Bayesian statistics. For a frequentist, as we have said numerous times, a parameter is just a constant. Then L( y) is a function of the value of that constant. For a Bayesian is a random variable, and L( y) is the probability of y as a function of the value taken by the random variable . At the end of the day, there is no difference here. There is an important difference in the way the likelihood is used. For a frequentist, parameter estimation is based solely on the likelihood. A common procedure is maximum likelihood estimation. The estimator of is then the value that maximizes the probability of observing y. In the example, the maximum likelihood estimator of is easily seen to be 9 . A Bayesian, on the other hand, bases estimation on both the likelihood and the prior information. SLIDE 18 POSTERIOR DISTRIBUTION In this simple case we can calculate the posterior distribution analytically. To derive the result shown on the slide, one could do all the calculations, including the integration in the denominator of Bayes’ Theorem. An easier method is to note that the posterior distribution is a normal distribution for since it has the form d exp(a 2 b c) . For a normal distribution the coefficient of the quadratic term is a 1/(2 2 ) and the coefficient of the linear term is b / 2 , which allows us to calculate and 2 . SLIDE 20 DISCUSSION OF POSTERIOR DISTRIBUTION The posterior distribution is a probability distribution for the parameter which represents our uncertainty about the parameter. All Bayesian analyses will then be based on this distribution. For example, suppose we want to make a prediction that involves . The predicted quantity would have a distribution, resulting from the distribution for . In this way the uncertainty in is propagated to all calculated quantities that depend on . The expectation is a weighted average between the prior and maximum likelihood values. Note that normally in a Bayesian analysis, one is interested in the full posterior distribution and not just in its expectation. However, the expectation can be useful as a reference value. The weighting of the prior and maximum likelihood values to give the posterior expectation depends on the uncertainty in each result. If 2 2 (lots of uncertainty in the prior compared to uncertainty due to the data ) B is close to 1 and the expectation is close to y. If on the contrary the uncertainty in the data is large compared to the uncertainty in the prior information, then B is close to 0 and the expectation is close to the expectation of the prior density. This seems intuitively reasonable. We give more weight to the source of information that is better known. 2 The variance of the posterior, post , is smaller than either 2 or 2 . Combining two sources of information has allowed us to reduce our uncertainty. For example if 2 2 a , then B=1/2 and the variance of the posterior distribution is a/2. David Makowski and Daniel Wallach, INRA, France. 12/05/2017 -3- SLIDE 21 FREQUENTIST versus BAYESIAN We can now better understand some of the debate between frequentists and Bayesians. A major concern of frequentists is that Bayesian analysis introduces an element of subjectivity into statistical analysis. This is well expressed by D. R. Cox and D. V. Hinkley in their book Theoretical Statistics: “In the statistical analysis of scientific and technological data, there will nearly always be some external information, such as that based on previous similar investigations, fully or half-formed theoretical predictions, and arguments by vague analogy. Such information is important in reaching conclusions about the data at hand, including consistency of the data with external information, and in planning future investigations. Nevertheless, it is not at all obvious that this is usually best done quantitatively, expressing vague prior knowledge by a probability distribution, rather than qualitatively.” One major criticism expressed by Bayesians concerns the way that frequentists assess uncertainty in results. For a frequentist, uncertainty refers to the uncertainty in measured values. For example, consider a survey where 10 people are chosen at random. If the survey were repeated many times, the people chosen would be different in each case and that would lead to a distribution of results. It is this distribution that underlies frequentist analyses of uncertainty. Bayesians point out that this is not always very pertinent, since in general one does not envision repeating the experiment. In a Bayesian approach, on the other hand, all conclusions are based on P( y) which is conditional on the specific measured values in the experiment that was actually carried out. Bayesians argue in addition that uncertainty assessments in frequentist statistics are complex, difficult to obtain and difficult to explain. In Bayesian analyses on the other hand all uncertainties are based on the posterior distribution. Once this distribution is available all calculations are relatively straightforward. The explanations are also simple. The distributions that one calculates based on the posterior distribution represent directly the uncertainty in the results. SLIDE 22 WHICH IS BETTER? Statistics is supposed to help in predicting things about the real world. Which school of statistics does this better? As you might suspect, the answer is not simple. All statistical analyses involve various assumptions, which are often simplifications compared to the real world. The quality of the results will depend on the accuracy of the assumptions and this may be more important than the choice between a Bayesian and a frequentist approach. It does seem clear in general that when there is prior information, it is worthwhile to take it into account. We combine prior information and data all the time in real life. When you read the weather forecast and then look out at the sky in the morning to assess the probability of rain, you are combining prior information (the forecast) and data (the sky). However, this still leaves open the question as to whether it is a good idea to adapt a Bayesian approach in order to take account of prior information. One statistics professor describes an experiment he performs in his statistics class to compare Bayesian and frequentist estimation quantitatively. (My apologies to the professor who wrote about this but whose name I don’t have). You could easily repeat this with your colleagues. He chooses a book, I don’t remember which one. Let’s say The Constant Gardner by John Le Carré. The objective is to predict the average number of letters of the first word on each page. In my edition there are 558 pages, so we’re looking for the average length of 558 words. We’ll call this value . The professor explains to his class that they will estimate David Makowski and Daniel Wallach, INRA, France. 12/05/2017 -4- using a frequentist and a Bayesian method. The first method will be based on a random sample of words from the book. He will choose 5 pages at random and calculate the average word length of the first word on each page. That average, noted 5, is the frequentist estimator of . It is based solely on the data. The only assumption here is that the sample is really a random sample, which is fairly easy to insure. Thus there is no problem of using unfounded assumptions. The second method will be a Bayesian method, combining the student’s prior estimate of , noted prior, with the data. The prior estimate can be based on the student’s general knowledge of word length in the English language or, even better, on his knowledge about this particular writer and book. After hearing how estimation will be done, each student is requested to provide his value of prior. Each student has to provide his value before the sample is taken, because the prior information is not supposed to be influenced by the measurement. Each student is also told to provide a second number, we’ll call it wtrust, between 0 and 1, to quantify how much trust he puts in his prior value compared to the value 5 . His estimator will be Bayes=(wtrust)( prior)+(1- wtrust)( 5). Thus a value of 0 means that he will ignore prior and just use 5 alone as his estimate of . A value of 1 means that he will ignore 5 and just use prior as his estimate of . Looking back at the example with normal distributions, we see that wtrust has a role exactly equivalent to 1-B. Once each student has provided values of prior and wtrust the professor looks at five pages at random in the book and gives the class the value of 5. Each student then calculates his personal Bayes and the absolute errors | - 5| and | - Bayes |. Finally, the professor counts the number of students for whom | - 5| is the smaller error and those for whom | Bayes | is the smaller error. In the large majority of cases, the Bayes estimator has the smaller error. That is, the Bayes approach is in most cases better able to provide predictions about the real world. The professor then goes one step further. He asks the students for whom the frequentist error is smaller to announce their values of prior and wtrust. He finds (no surprise) that these students combine poor values of prior with large values of wtrust. The conclusion? Using prior information is usually a good idea, but if your prior is poor and you put a lot of trust in it, you’re in trouble! SLIDE 23 DIFFICULTIES FOR ESTIMATING CROP MODEL PARAMETERS As we have said, the quality of the results of a statistical analysis will depend on the quality of the assumptions that are made (and also of course on the data). A first assumption required for a Bayesian analysis (also for a frequentist analysis) concerns the statistical model for the data. The problem is often particularly difficult for crop models. Can errors reasonably be assumed to have expectation zero? Constant variance? What about correlations between different measurements in the same field? A second assumption concerns prior information. What exactly is known, and what is a reasonable way to represent that information? Lurking behind this question is a very fundamental question about crop models. What real physical or biological constants do the parameters really represent? SLIDE 24 PRACTICAL CONSIDERATIONS In some cases, like our simple example, the posterior distribution is a known distribution that we can work with. What do we do in all those cases (including non-linear models like crop models) where the posterior distribution is not a known distribution? First of all, note that we can live David Makowski and Daniel Wallach, INRA, France. 12/05/2017 -5- without an analytical expression for the posterior distribution. It will be sufficient to have a large sample from this distribution. Such a sample can then be used to represent the uncertainty in any calculations involving the parameters. We will simply repeat all calculations for each parameter value in the large sample. What do we know? Given a statistical model, it is usually straightforward to write P( y ) even if the crop model is complex. Given prior information, it is usually straightforward to write P ( ) . On the other hand the integral in the denominator of Bayes’ Theorem is very difficult to calculate if the parameter is of even moderate dimension. So what we usually know is that the posterior density is proportional to the product of P( y ) and P ( ) , which can be calculated, but the proportionality constant is unknown. In fact, there are powerful but simple algorithms for obtaining a sample from a posterior distribution when all one knows is P( y ) and P ( ) . We will present one of these algorithms in Part 3 of this mini-course. David Makowski and Daniel Wallach, INRA, France. 12/05/2017 -6-