Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Indeterminism wikipedia , lookup
Random variable wikipedia , lookup
History of randomness wikipedia , lookup
Dempster–Shafer theory wikipedia , lookup
Probability box wikipedia , lookup
Infinite monkey theorem wikipedia , lookup
Conditioning (probability) wikipedia , lookup
Law of large numbers wikipedia , lookup
Birthday problem wikipedia , lookup
Inductive probability wikipedia , lookup
Advanced probability: notes Aleksandar Makelov 1. History 1.1. Introduction. Kolmogorov set the mathematical and philosophical foundations of modern probability in the 1933 book ‘Grundbegriffe der Wahrscheinlichkeitsrechnung’. He synthesized previous ideas, among which Émile Borel’s addition of countable additivity to classical probability, and related the mathematics to the world of experience. To appreciate these two aspects of his work, we need the context of measure theory and axiomatic probability in 1900-1930. Kolmogorov was a frequentist. In chapter 2, we look at the foundation Kolmogorov replaced - equally likely cases, how it relates to the real world, and some paradoxes. In chapter 3, we look at the parallel development of measure theory and how it became entangled with probability. In chapter 4, we look at alternative formalizations of probability that appeared in 1900-1930. In chapter 5, we turn to the ‘Grundbegriffe’. 1.2. The classical foundation. This was the notion that probability of an event A is the ratio of the number of equally likely cases in which A occurs to the number of all equally likely cases. It had been around for 200 years by the start of the 20th century, and is still used today to teach probability in an elementary way. 1.2.1. The classical calculus. From the axiom above one easily gets the rules of finite additivity and conditional probability: Pr [A ∪ B] = Pr [A] + Pr [B] and Pr [A ∩ B] = Pr [A] Pr B A These arguments were still the standard in textbooks in the 1900s. Only the British were doing it differently, emphasizing the combinatorics without dwelling on formalities. The Americans adopted the classical rules, but wanted to avoid the arguments based on equally likely cases, and switched to ‘limiting frequencies’1, but this approach didn’t last. Geometric probability This is the translation of the ideas of classical probability to a geometric setting, but probabilities are still ratios (of volumes, areas, etc). With the introduction of measure theory, this did not change significantly; we had more sets to work with, but the rules were the same. Relative probability This term had no clear meaning at first, and was later replaced by conditional probability. Hausdorff defined relative probability in 1901 as something that resembles today’s conditional probability; other people had other definitions for what ‘relative’ meant, and it was a mess. Curiously, George Boole wrote about ‘conditional events’ and their probabilities in 1854. 1I’m guessing this means that probability is the frequency of the event as the number of samples goes to infinity? 1 2 ALEKSANDAR MAKELOV 1.2.2. Cournot’s principle. OK, this section is sort of important. ‘An event with very small probability is morally impossible: it will not happen; equivalently, an event with very high probability is morally certain: it will happen.’ The principle was first formulated by Bernoulli in 1713. Maybe this sounds stupid/obvious at first, but here’s the justification. One can prove that (as Bernoulli did) in a long series of samples, the frequency of an event will be close to its probability with a high probability, so we can use that frequency as an estimate for the probability. This later became known as the law of large numbers. Of course, people argued about this kind of moral certainty a lot. For example, Jean d’Alembert pondered the possibility of a hundred successive coin flips coming up heads, and deemed it ‘metaphysically possible, but physically impossible’. Perhaps a better example coming from geometric probability which has a natural notion of ‘vanishingly small’ probabilities is due do Cournot (1843): a heavy cone balancing on its top is mathematically possible, but physically impossible. And similarly, he argued, it’s not physically possible for the law of large numbers to be violated. The idea of ‘morally impossible’ events, or events with ‘vanishingly small’ probability, became important in physics in the second half of the 19th century, with the discussion of termodynamic entropy (where it’s morally impossible for a system to reverse the increase of entropy), and Poincaré’s recurrence theorem about the three-body problem which works for all initial configurations except a vanishingly small set. Saying small probability events won’t happen is one thing, but saying that probability theory gains connection to reality only by ruling out the happening of such events is a whole different thing. That’s what Cournot was smart enough to realize: . . . The physically impossible event is therefore the one that has infinitely small probability, and only this remark gives substance – objective and phenomenal value – to the theory of mathematical probability (1843 p. 78).2 In other words, saying that probability theory gains connection to reality only by ruling out the happening of such events. [Note: This took me a while to process, but I think the moral is this: if we don’t rule out these ‘physically impossible’ events, we can’t reliably measure probability by repeated samples (as in e.g. the law of large numbers). So when we adopt this principle, probability becomes a measurable quantity, much like weight and volume. ] The viewpoint of the French probabilists In the beginning of the 20th century, probability was becoming pure math, and people often asked themselves how it relates to the real world. These French guys, namely Borel, Hadamard, Fréchet, Lévy, etc, made the connection by basically invoking Cournot’s principle in their own words. Lévy was an especially strong proponent of the principle, calling it the only bridge between probability and reality, and describing how it, combined with Bernoulli’s law of large numbers, makes probability a measurable physical quantity [Note: like I said above. ]. Kolmogorov knew about Lévy’s views when he was starting his book. Markov did too; he said ...Consequently, one of the most important tasks of probability theory is to identify those events whose probabilities come close to one or zero. Strong and weak forms of Cournot’s principle An important distinction is between ADVANCED PROBABILITY: NOTES 3 • the strong form, asserting that an event with very small or zero probability fixed in advance of a given trial will not happen during that trial • the weak form, asserting that an event with very small probability will happen very rarely in repeated trials. The strong form was used by Borel, Lévy and Kolmogorov to argue as above that this justifies the empirical validity of probability. The weak form leads to a different conclusion, that probability is usually approximated by the relative frequency. British indifference and German skepticism The French took a very philosophical view on probability, likely influenced by Poincaré. On the contrary, the British were much more pragmatic and worked only with frequencies; they had little interest in the mathematical theory of probability. Germany did see activity on the mathematical side of probability, but few German mathematicians considered themselves philosophers, because division of labor had already started taking place in Germany at the time. Then there was this German guy Johannes von Kries, who thought Cournot’s principle was false; he argued that the idea that an event with very small probability is impossible is wrong. He gave some conditions under which objective probabilities exist, but nobody knew how to use them, so that was that. However influential von Kries was at the time, this not prevent Germans from embracing the assumption that zero probability equals impossibility. Therefore, for Germans (as opposed to the French), there was a substantial difference between probability zero events and very small probability events. 1.2.3. Bertrand’s paradoxes. Bertrand came up with several paradoxes that play with the idea of equally likely cases; we’ll look at two. The paradox of the three jewelry boxes Ok that’s silly... it’s about how naively treating all ‘possibilities’ as equally likely can go wrong in the discrete case. The problem is we don’t explicitly say what the atomic events that are equally likely are. The paradox of the great circle Suppose we want to find the probability that two random points on a sphere are within 10’ of one another. By symmetry, we can fix the first point. Then we can proceed in two ways: • Just calculate the are of the circular cap of radius 10’ around it and divide by the full area. • By symmetry we can fix the great circle on which the second point is; then the probability of the second point landing within 100 of the first is given by the ratio of the length of the 100 arc in each direction to the entire length of the great circle; this comes out to a different number. Bertrand thought the paradox was due to the problem being ill-posed, that the two solutions were using a different notion of ‘random’. However, Borel proved him wrong - the first solution is alright but in the second we’re splitting the cases into ones that should have probability zero and conditioning on them (the second point landing on a specific great circle). Appraisal Poincaré, Borel and the bunch fixed the paradoxes because they understood the principles classical probability was supposed to represent. Two principles emerged from their solutions: • Equally likely cases must be detailed enough to represent new information in complete detail [Note: I feel this moves towards state spaces a bit ] • [Note: I don’t get this ] ‘We may need to consider the real observed event of non-zero probability that is represented in an idealized way by an event of zero probability (e.g., a 4 ALEKSANDAR MAKELOV randomly chosen point falls on a particular meridian). We should pass to the limit only after absorbing the new information’ 1.3. Measure-theoretic probability before the Grundbegriffe. 1.3.1. The invention of measure theory by Borel and Lebesgue. Borel is considered the father of measure theory. His main insight seems to be closure under countable unions, which he was motivated to adopt by complex analysis. He studies series which diverged on a dense set which could however be covered by a countable collection of intervals of arbitrarily small length. This lead Borel to make a new theory of measure on [0, 1]. Lebesgue took that and used it to define his integral 1.3.2. Abstract measure theory from Radon to Saks. Radon unified Lebesgue and Stieltjes integration by replacing Lebesgue measure with any countably additive set function defined on the Borel sets. Then Fréchet went beyond Euclidean space, realizing it’s enough to look at countably additive functions on any σ-algebra. This is the abstract theory of integration Kolmogorov used: . . After Lebesgues investigations, the analogy between the mea- sure of a set and the probability of an event, as well as between the integral of a function and the mathematical expectation of a random variable, was clear. This analogy could be extended fur- ther; for example, many properties of independent random variables are completely analogous to corresponding properties of orthogonal functions. But in order to base probability theory on this analogy, one still needed to liberate the theory of measure and integration from the geometric elements still in the foreground with Lebesgue. This liberation was accomplished by Fréchet. 1.3.3. Fréchet’s integral. Quoting from the book, When Fréchet generalized Radons integral in 1915, he was explicit about what he had in mind: he wanted to integrate over function space. In some sense, therefore, he was already thinking about probability. An integral is a mean value. In a Euclidean space this might be a mean value with respect to a distribution of mass or electrical charge, but we cannot distribute mass or charge over a space of functions. The only thing we can imagine distributing over such a space is probability or frequency. Why did Fréchet fail at the time to elaborate his ideas on abstract inte- gration, connecting them explicitly with probability? One reason was the war. Mobilized on August 4, 1914, Fréchet was at or near the front for about two and a half years, and he was still in uniform in 1919. Thirty years later, he still had the notes on abstract integration that he had prepared, in English, for the course he had planned to teach at the University of Illinois in 1914-1915. But the real problem was that time had not come for probability in function space yet: there were no nontrivial examples of distributing probability over it. The laws of classical physics for example are deterministic, and probability can only be placed on the initial conditions, which are usually living in some finite-dimensional space. The first such examples came from Daniell and Wiener. 1.3.4. Daniell’s integral and Wiener’s differential space. Daniell, working in Texas, took a slightly different approach from Fréchet, and showed how different kinds of integrals are extensions to one another. ADVANCED PROBABILITY: NOTES 5 He used his theory of integration to study the motion of a particle whose infinitesimal motions are independently normally distributed; he called it dynamic probability unaware of the earlier (less rigorous) work on Brownian motion. Wiener was in a better position to pick things up from there, because he was aware of the connection. Quoting the book, Because of his work on cybernetics after the second world war, Wiener is now the best known of the twentieth-century mathematicians in our story - far better known to the general intellectual public than Kolmogorov. But he was not well known in the early 1920s, and though the most literary of mathematicianshe published fiction and social commentaryhis mathematics was never easy to read, and the early articles attracted hardly any immediate readers or followers. Only after he became known for his work on Tauberian theorems in the later 1920s, and only after he returned to Brownian motion in collaboration with C. A. B. Paley and Antoni Zygmund in the early 1930s (Paley, Wiener, and Zygmund 1933) do we see the work recognized as central and seminal for the emerging theory of continuous stochastic processes. Wiener’s basic idea was to consider Brownian motion on an interval [0, 1]. A realized path is simply a function [0, 1] → R. So random variables are functionals, i.e. functions of functions: ([0, 1] → R) → R. Then to do probability we need to define mean values for these functionals. Wiener started with functionals that only depend on the values on finitely many points in [0, 1], which can be calculated using ordinary Gaussian probabilities. Extending this by Daniell’s method, he obtained probabilities for certain sets of paths: for example, he showed the continuous functions have probability 1, but the differentiable functions have probability zero. People find it surprising Weiner did all that long before Kolmogorov’s axioms; but he had Daniell’s integral and functional analysis; his approach was markedly different, and did not require an abstract notion of probability, or divorcing measure from geometry. 1.3.5. Borel’s denumerable probability. Wiener’s impressive work didn’t influence Kolmogorov much; it was Borel who did. Already in his work on complex analysis, there was some hint of probabilistic reasoning. For example, Borel showed in 1897 that a Taylor series would usually diverge on the boundary of its circle of convergence. The argument was that successive groups of coefficients of a Taylor series are independent, each group determines an arc on the circle, and if a point belongs to infinitely many such arcs, the series diverges at that point. Then by independence it follows that, usually, the series will diverge; today, we call this the Borel-Cantelli lemma. Later on Borel published his strong law of large numbers, stating that the number of 1s in the binary expansion of a random real number in [0, 1] goes to 1/2 with probability 1. He didn’t use measure theory to prove that for philosophical reasons though; he gave an ‘imperfect’ proof using denumerable versions of the rules of classical probability. Others, however, gave rigorous measure-theoretic proofs. Borel was unwilling to assume countable additivity in probability: He saw no logical absurdity in a countably infinite number of zero probabilities adding to a nonzero probability, and so instead of general appeals to countable additivity, he preferred arguments that derive probabilities as limits as the number of trials increases [...] But he saw even more fundamental problems in the idea that Lebesgue measure can model a random choice (von Plato 1994, pp. 3656; Knobloch 2001). How can we choose a real number at random when most real numbers are not even definable in any constructive sense? 6 ALEKSANDAR MAKELOV It was finally Hugo Steinhaus who gave axioms for Borel’s denumerable probability for infinite sequences of coin flips, along the lines of Sierpiński’s axiomatization of Lebesgue measure, and proved the two axiomatizations are equivalent, i.e. Lebesgue measure on [0, 1] and probability on countably many coin flips are isomorphic. 1.3.6. Kolmogorov enters the stage. He wrote a paper, in co-authorship with Khinchin (1925), extending the analogy of Steinhaus beyond sequences of coin flips. But Khinchin transported the argument to the world of Lebesgue measure on [0, 1] because that was well-understood at the time while Borel’s probability remained murky. In his own articles later on, Kolmogorov worked with finite additivity and passed to ad-hoc limits. 1.4. Hilbert’s sixth problem. Roughly, it said ‘Make probability rigorous!’. Integration gave a partial answer, but there was still disagreement about what the actual axioms should be, especially about conditional (‘compound’) probability. 1.4.1. Bernstein’s qualitative axioms. Bernstein gave some qualitative axioms for probability to the effect that the probability of m disjoint events out of n equally likely is a function of m/n, and the conditional probability of A with respect to B is the ratio Pr [A] / Pr [B]. 1.4.2. Von Mises’s Kollektivs. This was an alternative formalization (1919) of probability that relied on ensembles (‘Kollektiv’-s), infinite sequences of outcomes satisfying two properties: • The relative frequency of each outcome converges • and converges to the same limit no matter which subsequence we pick in advance (even if we pick the next term based on the ones so far) So this is reminiscent of formalizing what a fair game is. This theory could derive the classical laws of probability and stuff like that, and it was somewhat popular, but in 1936 Jean Ville showed Kollektivs are vulnerable to a more clever betting strategy which varies the amount of the bet and the outcome on which to bet; Ville called it a ‘martingale’. This in fact inspired Doob to consider measure-theoretic martingales! 1.4.3. Slutsky’s calculus of valences. This was another attempt to define probability fully mathematically. Slutsky abandoned the idea of ‘equally likely cases’ and even of ‘probability’, and instead assign numbers to cases and insist that when a case is subdivided, the corresponding numbers add up to the value of the case. He called this new probability ‘valence’. Slutsky did not think probability could be reduced to limiting frequency, because sequences of independent trials have properties that go beyond their possess- ing limiting frequencies. He gave two examples. First, initial segments of the sequences have properties that are not imposed by the eventual convergence of the relative frequency. Second, the sequences must be irregular in a way that resists the kind of selection discussed by von Mises (he cites an author named Umov, not von Mises). So for Slutsky, frequencies are more like an alternative interpretation of a more general and abstract calculus. 1.4.4. Kolmogorov’s general theory of measure. In a note dated 1927 and published 1929, he gave a philosophical sketch of the program for what would later become probability. It has the fundamental ingredients in place: countable additivity, and the class of subsets to which it assigns values should be a σ-field, because ADVANCED PROBABILITY: NOTES 7 only then can we uniquely define probabilities for countable unions and intersections, and this seems necessary to justify arguments involving events such as the convergence of random variables. 1.4.5. The axioms of Steinhaus and Ulam. idk 1.4.6. Cantelli’s abstract theory. He was trying to do something like what Kolmogorov did - an abstract theory founded on empirically tested frequentist assumptions that can then be applied to concrete cases to make predictions 1.5. The Grundbegriffe. ‘The Grundbegriffe was an exposition, not another research contribution.’ Kolmogorov showed how nothing more than the classical rules together with countable additivity was needed to get probability going, as evidenced by the plethora of foundational results in the last 20 years (many of which due to himself). The other major contribution was the rhetorical move to call this entire abstract theory ‘probability’. As it turned out, more was needed, especially for conditional probability, but people were excited enough to fill the gaps rather than abandon the theory. 1.5.1. Overview. 1.5.2. The mathematical framework. Here’s Kolmogorov on countable additivity: . . . Since the new axiom is essential only for infinite fields of proba- bility, it is hardly possible to explain its empirical meaning. . . . In describing any actual observable random process, we can obtain only finite fields of probability. Infinite fields of probability occur only as idealized models of real random processes. This understood, we limit ourselves arbitrarily to models that satisfy Axiom VI. So far this lim- itation has been found expedient in the most diverse investigations. Then there’s the deal with conditioning... it seems weird to condition on probability zero events, because they don’t happen in the real world, and because we get nonsense formally when we try to do it. 1.5.3. The empirical origin of the axioms. The section ‘In Kolmogorov’s own words’ is valuable: it is a translation of the two pages of empirical motivation of probability, and a succint account of Kolmogorov’s frequentist philosophy about how probability relates to the real world. He gives the motivation for a space of outcomes as the set of phenomena that can happen in a given repeatable experiment, and for events being subsets of that space. He then defines the probability of an event A as the limiting frequency with which A happens, and posits a strong version of Cournot’s principle. From that, he deduces the axioms empirically. 1.6. Reception. 1.6.1. Acceptance of the axioms. Not as instant as some sources make it sound like... many people were turned away by its abstractness. 1.6.2. The evolution of the philosophy of probability. Turns out, Kolmogorov’s two principles – the empirical limiting frequency and the strong version of Cournot’s principle – faded from memory. This may have been due to lack of interest in philosophy altogether. 8 ALEKSANDAR MAKELOV 1.6.3. The new philosophy of probability. Cournot’s principle faded; maybe it just didn’t appear to make sense, but there were sociological factors: by the time WWII ended, philosophers and mathematicians were pretty much disjoint subsets, and the abstractness of Kolmogorov’s work made it hard for philosophers to engage with. 1.7. Conclusion. Here it is verbatim: The great achievement of the Grundbegriffe was to seize the notion of probability from the classical probabilists. In doing so, Kolmogorov made space for a mathematical theory that has flourished for seven decades, driving ever more sophisticated statistical methodology and feeding back into other areas of pure mathematics. He also made space, within the Soviet Union, for a theory of mathematical statistics that could prosper without entanglement in the vagaries of political or philosophical controversy. Kolmogorovs way of relating his axioms to the world has received much less attention than the axioms themselves. But Cournots principles has re- emerged in new guises, as the information-theoretic and game-theoretic aspects of probability have come into their own. We will be able to see this evolu- tion as growthrather than mere decay or foolishnessonly if we can see the Grundbegriffe as a product of its own time. ADVANCED PROBABILITY: NOTES 9 2. Motivation First, give several examples of situations where a notion of ‘probability’ arises, taken from real life... 2.1. Examples. • • • • • • • Flipping a coin: finitely, and infinitely many times. The St Petersburg gambling system. Scientific experiments Thermodynamics. Quantum mechanics. A falling prism: will it stand on its tip? Spaces of functions and Brownian motion. 2.2. Motivation. • besides the obvious things, probability is useful in many settings as a new way of looking at a problem that makes many things more intuitive (maybe Daniel Kahneman showed that we people are bad with probabilities, but let me tell you, we’re even worse with other things). An example is combinatorics, where restating things in terms of probabilities is often illuminating and makes things simpler. In this section we talk about the modern formalization of probability and why it came to be this way. 2.3. A first model. Let’s imagine all possible outcomes of a system/process/whatever R that involves randomness as points living inside some space of outcomes Ω. We can think of each point ω ∈ Ω as a list of the values of a set of parameters which describe R completely. It makes sense that when we’re dealing with ‘discrete’ randomness, whatever that means, Ω will be some discrete space, and if we’re dealing with ‘continuous’ randomness, that is, each random thing can vary continuously, then Ω will have some notion of continuity, something like a ‘product’, and that’s why it makes sense to call it a ‘space’, and to expect it will have some topological and hopefully metric properties (because we actually want to compute things). What we are able to observe about R may not be everything: that is, we may not be able to distinguish between every two points ω and ω 0 . The events E encode the observable aspects of R; i.e., they are the atomic things about which we can say whether they happened or not. As such, it makes sense to impose the following conditions on E: • If E ∈ E, then E c ∈ E. • If E1 , E2 ∈ E, then E1 ∪ E2 ∈ E. • If E1 , E2 ∈ E, then E1 ∩ E2 ∈ E. That is, • if we can tell something happened, we can tell if it didn’t happen too. • if we can tell whether two things happened individually, we can tell whether either happened • if we can tell whether two things happened individually, we can tell whether both happened. It’s easy to see our conditions are equivalent to • If E ∈ E, then E c ∈ E. • If E1 , . . . , En ∈ E, then n [ i=1 Ei ∈ E. 10 ALEKSANDAR MAKELOV Then the probability should be some way to assign numbers to events: how likely, on a scale from 0 to 1, is E to happen? What fraction of the time does E happen? (haha there’s some deep bayesianist-frequentist thing going on here but let’s not care about it). As such, it should satisfy some obvious things: • Pr [Ω] = 1 – something’s gotta happen! • If A and B can never happen simultaneously, Pr [A ∪ B] = Pr [A] + Pr [B]. Let’s see what this formalization can give us: Example 2.1. Consider a single roll of a symmetric die. Here it’s natural to let Ω = {1, 2, 3, 4, 5, 6}, E = {1}, . . . , {6}, and by symmetry it’s reasonable to assume that Pr [{i}] = Pr [{j}] for all i, j. Then since the outcome can’t be both i and j for i 6= j, it follows that Pr [{i}] = 1/6 for all i. 2.4. Adjusting the model: limits and infinities. All this is very reasonable, but it’s not how we actually do probability. Why? Right now I don’t have a good answer because I haven’t tried using the above formalism to get anywhere, but I’m guessing the two reasons are we want to deal with infinity and limits in some way. So let’s see what goes wrong. One simple example where the need to deal with infinites should arise is Ω = R, i.e. a random real number. [Note: It seems very plausible that the probability of picking a rational number is 0. First, there’s only countably many rationals... next, if we pick the digits one by one, we have to pick infinitely many zeroes in succession, which has probability zero. ]