Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
History of randomness wikipedia , lookup
Indeterminism wikipedia , lookup
Infinite monkey theorem wikipedia , lookup
Birthday problem wikipedia , lookup
Probability box wikipedia , lookup
Ars Conjectandi wikipedia , lookup
Dempster–Shafer theory wikipedia , lookup
Bayesian Signal Processing Ercan Engin Kuruoglu ISTI-CNR [email protected] Why do we study Bayesian Theory? It allows us to formulate our prior knowledge or our belief about the data. Classical techniques ignore any prior information we might have about the problem. Why do we study Bayesian Theory? Many real world applications require processing vast amounts of data. We have to be selective and look for solutions in the right areas. We have to be able to learn from the changing nature of data. Why do we study Bayesian Theory? It is a very good mimic of how our brain learns and thinks itself. Why do we study Bayesian Theory? Why now and not before? Our computational power can now handle the computational complexity of numerical Bayesian techniques. Applications Telecommunications Radar, sonar Image processing Machine vision Biomedical signal processing Bioinformatics etc Our approach The Bayesian theory does not involve difficult mathematics or hard to decide strategies. It is a unified theory and approach Our discussion will be more heuristic than rigorous We will first try to adopt the Bayesian way of thinking We will obtain some basis of the theory by analytical examples Numerical techniques will be emphasized since it is the hot area in research. We will discuss various different applications and see how your own research could adopt a Bayesian formulation. Our aims To learn the basics of Bayesian theory To understand its potentials To see a picture of current research on applications of Bayesian theory To see whether we can use it in our research History – Thomas Bayes Bayes’ Tomb Rev. Thomas Bayes (1701-1761) Bayes was born to an English family from Sheffield (Yorkshire) Bayes was a Presbyterian priest! A non-conformist one (Arian)! Therefore could not study at Cambridge or Oxford. Studied at Edinburgh. In Edinburgh took mathematics lessons from James Gregory Work Religious writings Divine Benevolence Royal society member Critic of many mathematical work Close contact with John Canton and Richard Price Natural sciences Strong defendant of Isaac Newton Doctrine of Fluxions Interest in infinite series advanced the work of Maclaurin Probability The Bayes’ Theorempubblication The famous paper on Bayes’ Theorem was published after he died by Price in 1763 who communicated it to the Royal Society. “An Essay Towards Solving a Problem in the Doctrine of Chances”, Philosophical Transactions of the Royal Society of London, vol 53, pp. 370-418. Reference: D.R. Bellhouse, “The reverend Thomas Bayes: a biography to celebrate the tercentenary of his birth,” Statistical Science, 2004, Vol. 19, No. 1, pp. 3-43. Bayes theory attracted much attention after its publication but received much criticism because of its philosophical nature. Initial reaction was due to non-orthodox implications in religion. The theory was given a formal framework by Laplace (1774). During 19th century it was almost completely abandoned although starting the work of Jeffreys in 1930’s it became placed again in the center of controversy. Cox’s work in 1946 helped clarify several issues. These debates still continue. The theory is still resisted by the school of frequentists. But since 1980s the Bayesian stand is getting ever more accepted. Frequentist vs Bayesian view of probability Frequentist vs Bayesian Probability There are two main interpretations of Probability: Frequentist or objective or empirical (classical) Bayesian or subjective or evidential Frequentist Probability Frequentist views probability as long-run relative frequencies. Imagines that events are outcomes of experiments that can be run indefinitely under identical conditions. The frequentist probability of an event occuring is the limit, as the number of experiments goes to infinity, of the proportion of times that the event occurs. Frequentists are outside observers! Frequentist example Experiment: Tossing a fair coin Possible events : {H, T} A frequentist would assign probability to the event “tails” by imagining the experiment being run infinitely many times and measuring the proportion of the times a T comes up. This concept of probability was first proposed by Venn in 1886 and it leads to the classical, frequentist approach to statistical inference. Subjective probability Another way to view probability is as a personal, but rational, measure of certainty/uncertainty based on available evidence. This view of probability is called subjective. However, when available evidence is empirical, the frequentist and subjective probabilities coincide. The subjectivist does not need to imagine throwing coins infinitely, the evidence tells us that the coin has two sides and is fair and therefore the probability of throwing T on a single toss is one half. Subjectivists are part of the experiment, since they act with their belief based on evidence. Example 2 The probability of passing this course this september: The subjectivist can look at how you performed in previous statistics courses, your skills and study habits and guess that it is 0.9. The frequentist however cannot make you do the exams infinite times. They cannot rationally and systematically conceptualiase this process to provide meaningful probability statement regarding this event. Thus, the subjective probabilities are more general than the frequentist one as they can be used to assign uncertainty to single, unique events. Parameter uncertainty Frequentist (classical) statistics assumes that population parameters are unknown but deterministic constants. Estimation proceeds by randomly sampling from the population and using this data to estimate the parameters of interest. So, the truth is there, you just collect data to fit the model on it and find the fixed parameters. The subjectivist statistics views the parameters unknown and random. The estimate the parameter, we sample data, but since it is only sampled data, we will never be able to know the exact value of our quantity, but we will know more after the sample has been studied than we did before. Subjectivist view of learning A subjectivist is uncertain about parameter values before sample data is collected. The new evidence in the sample will reduce the uncertainty. Subjectivist measures uncertainty using probability. The method of statistical inference that starts with initial uncertainty about parameter values and then modifies this uncertainty using sample information is known as Bayesian statistical inference, or Bayesian learning. Bayesian learning naturally reflects our way of learning since childhood and also the way scientific knowledge is advanced. Introduction to Bayesian Inference Theory, Hypothesis, Model Creativity Deduction Predictions Inference, Verification, Falsification Induction Observation Data What do hypotheses predict about potential data? How does data support (or undermine) hypotheses? Introduction to Bayesian Inference Deduction: Deduce outcomes from hypotheses A —>B A Therefore B B C A D Induction: Infer hypotheses from outcomes If A then we are likely to observe B and C B and C are observed A B Therefore A is supported C E D Introduction to Bayesian Inference How new data support hypotheses H1 D1 H2 D2 H3 D3 What do we infer if we observe D1? D2? D3? Introduction to Bayesian Inference How new data support hypotheses H1 D1 H2 D2 H3 D3 Observing D1 refutes H1, supports H2 a little and H3 strongly Observing D2 supports H1 and H3 a little and H2 moderately. Introduction to Bayesian Inference Statistical inference is not magic Cannot get information that isn’t present in the data Statistical inference is easily misused Watch out for “Garbage in, Garbage out” Often the right solution is a better experiment or better observations, not slick statistical procedures! Probability For a Bayesian, probability is conditioned on what each individual knows. It can vary from individual to individual. Probability is not “out there”. It is in your head. “Probability does not exist”—Phil Dawid Probability is about epistemology, not ontology. Probability as belief In the Bayesian view, probability describes your degree of belief, given what you know. Probability as belief R.T. Cox (and independently, I.J. Good) proposed the following reasonable assumptions about plausibilities or degrees of belief Plausibility should be transitive, i.e., if A is more plausible than B and B more plausible than C then A is more plausible than C. This means that it should be possible to attach a real number P(A) to each proposition, and rank plausibilities by those numbers The plausibility of ~A (not-A, negation of A, A ) should be some function of the plausibility of A: P(~A)=f(P(A)) The plausibility of (A and B) should be some function of the plausibility of A given that B is true, and the plausibility of B: P(A&B)=g(P(A|B),P(B)) The Conditional Nature of Probability Probability is always conditional, that is, the probability that You assign to something is always dependent on information that You have. Someone else, with different information, will usually, and legitimately, assign a different probability to a proposition than You will Background information is information that is assumed but not always stated. Such information could include things like: How do do mathematics, basic physical and astronomical knowledge that You may have, stuff You learned while growing up, etc., etc. There is always background information, which we write H and often include as an explicit reminder of this fact. Failure to include H can lead to apparent paradoxes in probability theory Probability Axioms We write P(A|H) to represent “the probability that A is true, given that H is true”. It is a real number. P(A|H) satisfies the following axioms of probability: 0 ≤ P(A|H) ≤ 1 P(A|A,H) = 1 P(A|H) + P(~A|H) = 1 where ~A is the negation of A P(A&B|H) = P(A|B&H) P(B|H) product law, and definition of conditional probability • P(A|B&H) = P(A&B|H)/P(B|H) if P(B|H)≠0 If A and B are mutually exclusive propositions, then P(A or B|H) = P(A|H)+P(B|H) [Derivable from the product law, hence not an independent axiom] The Bayes’ Theorem Suppose we have two events, A and B P(A B) PA | BPB Similarly P(A B) PB | APA Equating both sides Bayes’ Theorem Bayes’ Theorem is a trivial result of the definition of conditional probability: when P(D)≠0, P ( & D ) P ( | D) P( D) P ( D | ) P ( ) P( D) P ( D | ) P ( ) Note that the denominator P(D) is nothing but a normalization constant required to make the total probability on the left sum to 1 Often we can dispense with the denominator, leaving its calculation until last, or even leave it out altogether! Bayes’ Theorem Bayes’ theorem is a model for learning. Thus, suppose we have an initial or prior belief about the truth of A. Suppose we observe some data D. Then we can calculate our revised or posterior belief about the truth of θ, in the light of the new data D, using Bayes’ theorem P( & D) P ( | D) P( D) P ( D | ) P( ) P( D) P( D | ) P ( ) Bayes’ Theorem P( D | ) P( ) P( | D) P( D) PD| : likelihood P : prior distributi on (quantification of our belief) (marginal of the data distribution) P D : evidence P | D : posterior distributi on P( D) P( D & i ) P( D | i ) P( i ) i i The Bayesian mantra: posterior prior likelihood Bayes’ Theorem In the special case that there are only two states of nature, A1 and A2=~A1, we can bypass the calculation of the marginal likelihood by using the odds ratio, the ratio of the probabilities of the two hypotheses: P( A1 ) Prior odds P( A2 ) P( D | A1 ) P( A1 ) Posterior odds P( D | A2 ) P( A2 ) Likelihood ratio Prior odds The marginal probability of the data, P(D), is the same in each case and cancels out The likelihood ratio is also known as the Bayes factor. Bayes’ Theorem In this case, since A1 and A2 are mutually exclusive and exhaustive, we can calculate P(A1|D) as well as P(A1) from the posterior and prior odds ratios, respectively, and vice versa Probability Odds 1 Probability Odds Probability 1 Odds Bayes’ Theorem The entire program of Bayesian inference can be encapsulated as follows: Enumerate all of the possible states of nature and choose a prior distribution on them that reflects your honest belief about the probability that each state of nature happens to be the case, given what you know Establish the likelihood function, which tells you how well the data we actually observed are predicted by each hypothetical state of nature Compute the posterior distribution by Bayes’ theorem Summarize the results in the form of marginal distributions, (posterior) means of interesting quantities, Bayesian credible intervals, or other useful statistics Bayes’ Theorem That’s it! In Bayesian inference there is one uniform way of approaching every possible problem in inference There’s not a collection of arbitrary, disparate “tests” or “methods”—everything is handled in the same way So, once you have internalized the basic idea, you can address problems of great complexity by using the same uniform approach Of course, this means that there are no black boxes. One has to think about the problem you have…establish the model, think carefully about priors, decide what summaries of the results are appropriate. It also requires clear thinking about what answers you really want so you know what questions to ask.