Download Advanced probability: notes 1. History 1.1. Introduction. Kolmogorov

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Odds wikipedia , lookup

Indeterminism wikipedia , lookup

Random variable wikipedia , lookup

History of randomness wikipedia , lookup

Dempster–Shafer theory wikipedia , lookup

Randomness wikipedia , lookup

Probability box wikipedia , lookup

Infinite monkey theorem wikipedia , lookup

Conditioning (probability) wikipedia , lookup

Law of large numbers wikipedia , lookup

Birthday problem wikipedia , lookup

Inductive probability wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Probability interpretations wikipedia , lookup

Transcript
Advanced probability: notes
Aleksandar Makelov
1. History
1.1. Introduction. Kolmogorov set the mathematical and philosophical foundations of modern
probability in the 1933 book ‘Grundbegriffe der Wahrscheinlichkeitsrechnung’. He synthesized
previous ideas, among which Émile Borel’s addition of countable additivity to classical probability,
and related the mathematics to the world of experience.
To appreciate these two aspects of his work, we need the context of measure theory and
axiomatic probability in 1900-1930.
Kolmogorov was a frequentist.
In chapter 2, we look at the foundation Kolmogorov replaced - equally likely cases, how it
relates to the real world, and some paradoxes.
In chapter 3, we look at the parallel development of measure theory and how it became
entangled with probability.
In chapter 4, we look at alternative formalizations of probability that appeared in 1900-1930.
In chapter 5, we turn to the ‘Grundbegriffe’.
1.2. The classical foundation. This was the notion that probability of an event A is the ratio
of the number of equally likely cases in which A occurs to the number of all equally likely cases.
It had been around for 200 years by the start of the 20th century, and is still used today to teach
probability in an elementary way.
1.2.1. The classical calculus. From the axiom above one easily gets the rules of finite additivity
and conditional probability:
Pr [A ∪ B] = Pr [A] + Pr [B] and Pr [A ∩ B] = Pr [A] Pr B A
These arguments were still the standard in textbooks in the 1900s. Only the British were doing
it differently, emphasizing the combinatorics without dwelling on formalities. The Americans
adopted the classical rules, but wanted to avoid the arguments based on equally likely cases, and
switched to ‘limiting frequencies’1, but this approach didn’t last.
Geometric probability
This is the translation of the ideas of classical probability to a geometric setting, but probabilities are still ratios (of volumes, areas, etc). With the introduction of measure theory, this did
not change significantly; we had more sets to work with, but the rules were the same.
Relative probability
This term had no clear meaning at first, and was later replaced by conditional probability.
Hausdorff defined relative probability in 1901 as something that resembles today’s conditional
probability; other people had other definitions for what ‘relative’ meant, and it was a mess.
Curiously, George Boole wrote about ‘conditional events’ and their probabilities in 1854.
1I’m guessing this means that probability is the frequency of the event as the number of samples goes to infinity?
1
2
ALEKSANDAR MAKELOV
1.2.2. Cournot’s principle. OK, this section is sort of important.
‘An event with very small probability is morally impossible: it will not happen;
equivalently, an event with very high probability is morally certain: it will happen.’
The principle was first formulated by Bernoulli in 1713. Maybe this sounds stupid/obvious at
first, but here’s the justification. One can prove that (as Bernoulli did) in a long series of samples,
the frequency of an event will be close to its probability with a high probability, so we can use
that frequency as an estimate for the probability. This later became known as the law of large
numbers.
Of course, people argued about this kind of moral certainty a lot. For example, Jean d’Alembert
pondered the possibility of a hundred successive coin flips coming up heads, and deemed it ‘metaphysically possible, but physically impossible’.
Perhaps a better example coming from geometric probability which has a natural notion of
‘vanishingly small’ probabilities is due do Cournot (1843): a heavy cone balancing on its top is
mathematically possible, but physically impossible. And similarly, he argued, it’s not physically
possible for the law of large numbers to be violated.
The idea of ‘morally impossible’ events, or events with ‘vanishingly small’ probability, became
important in physics in the second half of the 19th century, with the discussion of termodynamic entropy (where it’s morally impossible for a system to reverse the increase of entropy), and
Poincaré’s recurrence theorem about the three-body problem which works for all initial configurations except a vanishingly small set.
Saying small probability events won’t happen is one thing, but saying that probability
theory gains connection to reality only by ruling out the happening of such events is
a whole different thing. That’s what Cournot was smart enough to realize:
. . . The physically impossible event is therefore the one that has infinitely small
probability, and only this remark gives substance – objective and phenomenal
value – to the theory of mathematical probability (1843 p. 78).2
In other words, saying that probability theory gains connection to reality only by ruling
out the happening of such events.
[Note: This took me a while to process, but I think the moral is this: if we don’t rule out
these ‘physically impossible’ events, we can’t reliably measure probability by repeated samples
(as in e.g. the law of large numbers). So when we adopt this principle, probability becomes a
measurable quantity, much like weight and volume. ]
The viewpoint of the French probabilists
In the beginning of the 20th century, probability was becoming pure math, and people often
asked themselves how it relates to the real world. These French guys, namely Borel, Hadamard,
Fréchet, Lévy, etc, made the connection by basically invoking Cournot’s principle in their own
words.
Lévy was an especially strong proponent of the principle, calling it the only bridge between
probability and reality, and describing how it, combined with Bernoulli’s law of large numbers,
makes probability a measurable physical quantity [Note: like I said above. ].
Kolmogorov knew about Lévy’s views when he was starting his book. Markov did too; he said
...Consequently, one of the most important tasks of probability theory is to identify those events whose probabilities come close to one or zero.
Strong and weak forms of Cournot’s principle
An important distinction is between
ADVANCED PROBABILITY: NOTES
3
• the strong form, asserting that an event with very small or zero probability fixed in
advance of a given trial will not happen during that trial
• the weak form, asserting that an event with very small probability will happen very rarely
in repeated trials.
The strong form was used by Borel, Lévy and Kolmogorov to argue as above that this justifies the
empirical validity of probability. The weak form leads to a different conclusion, that probability
is usually approximated by the relative frequency.
British indifference and German skepticism
The French took a very philosophical view on probability, likely influenced by Poincaré. On the
contrary, the British were much more pragmatic and worked only with frequencies; they had little
interest in the mathematical theory of probability. Germany did see activity on the mathematical
side of probability, but few German mathematicians considered themselves philosophers, because
division of labor had already started taking place in Germany at the time.
Then there was this German guy Johannes von Kries, who thought Cournot’s principle was
false; he argued that the idea that an event with very small probability is impossible is wrong.
He gave some conditions under which objective probabilities exist, but nobody knew how to use
them, so that was that. However influential von Kries was at the time, this not prevent Germans
from embracing the assumption that zero probability equals impossibility. Therefore, for Germans
(as opposed to the French), there was a substantial difference between probability zero events
and very small probability events.
1.2.3. Bertrand’s paradoxes. Bertrand came up with several paradoxes that play with the idea of
equally likely cases; we’ll look at two.
The paradox of the three jewelry boxes
Ok that’s silly... it’s about how naively treating all ‘possibilities’ as equally likely can go wrong
in the discrete case. The problem is we don’t explicitly say what the atomic events that are equally
likely are.
The paradox of the great circle
Suppose we want to find the probability that two random points on a sphere are within 10’
of one another. By symmetry, we can fix the first point. Then we can proceed in two ways:
• Just calculate the are of the circular cap of radius 10’ around it and divide by the full
area.
• By symmetry we can fix the great circle on which the second point is; then the probability
of the second point landing within 100 of the first is given by the ratio of the length of
the 100 arc in each direction to the entire length of the great circle; this comes out to a
different number.
Bertrand thought the paradox was due to the problem being ill-posed, that the two solutions
were using a different notion of ‘random’. However, Borel proved him wrong - the first solution
is alright but in the second we’re splitting the cases into ones that should have probability zero
and conditioning on them (the second point landing on a specific great circle).
Appraisal
Poincaré, Borel and the bunch fixed the paradoxes because they understood the principles
classical probability was supposed to represent. Two principles emerged from their solutions:
• Equally likely cases must be detailed enough to represent new information in complete
detail [Note: I feel this moves towards state spaces a bit ]
• [Note: I don’t get this ] ‘We may need to consider the real observed event of non-zero
probability that is represented in an idealized way by an event of zero probability (e.g., a
4
ALEKSANDAR MAKELOV
randomly chosen point falls on a particular meridian). We should pass to the limit only
after absorbing the new information’
1.3. Measure-theoretic probability before the Grundbegriffe.
1.3.1. The invention of measure theory by Borel and Lebesgue. Borel is considered the father
of measure theory. His main insight seems to be closure under countable unions, which he was
motivated to adopt by complex analysis. He studies series which diverged on a dense set which
could however be covered by a countable collection of intervals of arbitrarily small length. This
lead Borel to make a new theory of measure on [0, 1].
Lebesgue took that and used it to define his integral
1.3.2. Abstract measure theory from Radon to Saks. Radon unified Lebesgue and Stieltjes integration by replacing Lebesgue measure with any countably additive set function defined on the
Borel sets. Then Fréchet went beyond Euclidean space, realizing it’s enough to look at countably
additive functions on any σ-algebra. This is the abstract theory of integration Kolmogorov used:
. . After Lebesgues investigations, the analogy between the mea- sure of a set and
the probability of an event, as well as between the integral of a function and the
mathematical expectation of a random variable, was clear. This analogy could be
extended fur- ther; for example, many properties of independent random variables
are completely analogous to corresponding properties of orthogonal functions. But
in order to base probability theory on this analogy, one still needed to liberate
the theory of measure and integration from the geometric elements still in the
foreground with Lebesgue. This liberation was accomplished by Fréchet.
1.3.3. Fréchet’s integral. Quoting from the book,
When Fréchet generalized Radons integral in 1915, he was explicit about what he
had in mind: he wanted to integrate over function space. In some sense, therefore,
he was already thinking about probability. An integral is a mean value. In a
Euclidean space this might be a mean value with respect to a distribution of
mass or electrical charge, but we cannot distribute mass or charge over a space
of functions. The only thing we can imagine distributing over such a space is
probability or frequency.
Why did Fréchet fail at the time to elaborate his ideas on abstract inte- gration,
connecting them explicitly with probability? One reason was the war. Mobilized
on August 4, 1914, Fréchet was at or near the front for about two and a half
years, and he was still in uniform in 1919. Thirty years later, he still had the
notes on abstract integration that he had prepared, in English, for the course he
had planned to teach at the University of Illinois in 1914-1915.
But the real problem was that time had not come for probability in function space yet: there
were no nontrivial examples of distributing probability over it. The laws of classical physics for
example are deterministic, and probability can only be placed on the initial conditions, which are
usually living in some finite-dimensional space. The first such examples came from Daniell and
Wiener.
1.3.4. Daniell’s integral and Wiener’s differential space. Daniell, working in Texas, took a slightly
different approach from Fréchet, and showed how different kinds of integrals are extensions to
one another.
ADVANCED PROBABILITY: NOTES
5
He used his theory of integration to study the motion of a particle whose infinitesimal motions
are independently normally distributed; he called it dynamic probability unaware of the earlier
(less rigorous) work on Brownian motion.
Wiener was in a better position to pick things up from there, because he was aware of the
connection. Quoting the book,
Because of his work on cybernetics after the second world war, Wiener is now
the best known of the twentieth-century mathematicians in our story - far better
known to the general intellectual public than Kolmogorov. But he was not well
known in the early 1920s, and though the most literary of mathematicianshe
published fiction and social commentaryhis mathematics was never easy to read,
and the early articles attracted hardly any immediate readers or followers. Only
after he became known for his work on Tauberian theorems in the later 1920s, and
only after he returned to Brownian motion in collaboration with C. A. B. Paley
and Antoni Zygmund in the early 1930s (Paley, Wiener, and Zygmund 1933) do
we see the work recognized as central and seminal for the emerging theory of
continuous stochastic processes.
Wiener’s basic idea was to consider Brownian motion on an interval [0, 1]. A realized path is
simply a function [0, 1] → R. So random variables are functionals, i.e. functions of functions:
([0, 1] → R) → R. Then to do probability we need to define mean values for these functionals. Wiener started with functionals that only depend on the values on finitely many points in
[0, 1], which can be calculated using ordinary Gaussian probabilities. Extending this by Daniell’s
method, he obtained probabilities for certain sets of paths: for example, he showed the continuous
functions have probability 1, but the differentiable functions have probability zero.
People find it surprising Weiner did all that long before Kolmogorov’s axioms; but he had
Daniell’s integral and functional analysis; his approach was markedly different, and did not require
an abstract notion of probability, or divorcing measure from geometry.
1.3.5. Borel’s denumerable probability. Wiener’s impressive work didn’t influence Kolmogorov
much; it was Borel who did. Already in his work on complex analysis, there was some hint of
probabilistic reasoning. For example, Borel showed in 1897 that a Taylor series would usually
diverge on the boundary of its circle of convergence. The argument was that successive groups
of coefficients of a Taylor series are independent, each group determines an arc on the circle,
and if a point belongs to infinitely many such arcs, the series diverges at that point. Then by
independence it follows that, usually, the series will diverge; today, we call this the Borel-Cantelli
lemma.
Later on Borel published his strong law of large numbers, stating that the number of 1s in
the binary expansion of a random real number in [0, 1] goes to 1/2 with probability 1. He didn’t
use measure theory to prove that for philosophical reasons though; he gave an ‘imperfect’ proof
using denumerable versions of the rules of classical probability. Others, however, gave rigorous
measure-theoretic proofs.
Borel was unwilling to assume countable additivity in probability:
He saw no logical absurdity in a countably infinite number of zero probabilities
adding to a nonzero probability, and so instead of general appeals to countable
additivity, he preferred arguments that derive probabilities as limits as the number
of trials increases [...] But he saw even more fundamental problems in the idea
that Lebesgue measure can model a random choice (von Plato 1994, pp. 3656;
Knobloch 2001). How can we choose a real number at random when most real
numbers are not even definable in any constructive sense?
6
ALEKSANDAR MAKELOV
It was finally Hugo Steinhaus who gave axioms for Borel’s denumerable probability for infinite
sequences of coin flips, along the lines of Sierpiński’s axiomatization of Lebesgue measure, and
proved the two axiomatizations are equivalent, i.e. Lebesgue measure on [0, 1] and probability on
countably many coin flips are isomorphic.
1.3.6. Kolmogorov enters the stage. He wrote a paper, in co-authorship with Khinchin (1925),
extending the analogy of Steinhaus beyond sequences of coin flips. But Khinchin transported the
argument to the world of Lebesgue measure on [0, 1] because that was well-understood at the
time while Borel’s probability remained murky. In his own articles later on, Kolmogorov worked
with finite additivity and passed to ad-hoc limits.
1.4. Hilbert’s sixth problem. Roughly, it said ‘Make probability rigorous!’. Integration gave a
partial answer, but there was still disagreement about what the actual axioms should be, especially
about conditional (‘compound’) probability.
1.4.1. Bernstein’s qualitative axioms. Bernstein gave some qualitative axioms for probability to
the effect that the probability of m disjoint events out of n equally likely is a function of m/n,
and the conditional probability of A with respect to B is the ratio Pr [A] / Pr [B].
1.4.2. Von Mises’s Kollektivs. This was an alternative formalization (1919) of probability that
relied on ensembles (‘Kollektiv’-s), infinite sequences of outcomes satisfying two properties:
• The relative frequency of each outcome converges
• and converges to the same limit no matter which subsequence we pick in advance (even
if we pick the next term based on the ones so far)
So this is reminiscent of formalizing what a fair game is. This theory could derive the classical
laws of probability and stuff like that, and it was somewhat popular, but in 1936 Jean Ville
showed Kollektivs are vulnerable to a more clever betting strategy which varies the amount of
the bet and the outcome on which to bet; Ville called it a ‘martingale’. This in fact inspired Doob
to consider measure-theoretic martingales!
1.4.3. Slutsky’s calculus of valences. This was another attempt to define probability fully mathematically. Slutsky abandoned the idea of ‘equally likely cases’ and even of ‘probability’, and
instead assign numbers to cases and insist that when a case is subdivided, the corresponding
numbers add up to the value of the case. He called this new probability ‘valence’.
Slutsky did not think probability could be reduced to limiting frequency, because
sequences of independent trials have properties that go beyond their possess- ing
limiting frequencies. He gave two examples. First, initial segments of the sequences
have properties that are not imposed by the eventual convergence of the relative
frequency. Second, the sequences must be irregular in a way that resists the kind
of selection discussed by von Mises (he cites an author named Umov, not von
Mises).
So for Slutsky, frequencies are more like an alternative interpretation of a more general and
abstract calculus.
1.4.4. Kolmogorov’s general theory of measure. In a note dated 1927 and published 1929, he
gave a philosophical sketch of the program for what would later become probability. It has the
fundamental ingredients in place: countable additivity, and the class of subsets to which it assigns
values should be a σ-field, because
ADVANCED PROBABILITY: NOTES
7
only then can we uniquely define probabilities for countable unions and intersections, and this seems necessary to justify arguments involving events such as the
convergence of random variables.
1.4.5. The axioms of Steinhaus and Ulam. idk
1.4.6. Cantelli’s abstract theory. He was trying to do something like what Kolmogorov did - an
abstract theory founded on empirically tested frequentist assumptions that can then be applied
to concrete cases to make predictions
1.5. The Grundbegriffe. ‘The Grundbegriffe was an exposition, not another research contribution.’
Kolmogorov showed how nothing more than the classical rules together with countable additivity was needed to get probability going, as evidenced by the plethora of foundational results in
the last 20 years (many of which due to himself). The other major contribution was the rhetorical
move to call this entire abstract theory ‘probability’.
As it turned out, more was needed, especially for conditional probability, but people were
excited enough to fill the gaps rather than abandon the theory.
1.5.1. Overview.
1.5.2. The mathematical framework. Here’s Kolmogorov on countable additivity:
. . . Since the new axiom is essential only for infinite fields of proba- bility, it is
hardly possible to explain its empirical meaning. . . . In describing any actual
observable random process, we can obtain only finite fields of probability. Infinite
fields of probability occur only as idealized models of real random processes. This
understood, we limit ourselves arbitrarily to models that satisfy Axiom VI. So
far this lim- itation has been found expedient in the most diverse investigations.
Then there’s the deal with conditioning... it seems weird to condition on probability zero
events, because they don’t happen in the real world, and because we get nonsense formally when
we try to do it.
1.5.3. The empirical origin of the axioms. The section ‘In Kolmogorov’s own words’ is valuable:
it is a translation of the two pages of empirical motivation of probability, and a succint account of
Kolmogorov’s frequentist philosophy about how probability relates to the real world. He gives the
motivation for a space of outcomes as the set of phenomena that can happen in a given repeatable
experiment, and for events being subsets of that space.
He then defines the probability of an event A as the limiting frequency with which A happens,
and posits a strong version of Cournot’s principle.
From that, he deduces the axioms empirically.
1.6. Reception.
1.6.1. Acceptance of the axioms. Not as instant as some sources make it sound like... many people
were turned away by its abstractness.
1.6.2. The evolution of the philosophy of probability. Turns out, Kolmogorov’s two principles – the
empirical limiting frequency and the strong version of Cournot’s principle – faded from memory.
This may have been due to lack of interest in philosophy altogether.
8
ALEKSANDAR MAKELOV
1.6.3. The new philosophy of probability. Cournot’s principle faded; maybe it just didn’t appear
to make sense, but there were sociological factors: by the time WWII ended, philosophers and
mathematicians were pretty much disjoint subsets, and the abstractness of Kolmogorov’s work
made it hard for philosophers to engage with.
1.7. Conclusion. Here it is verbatim:
The great achievement of the Grundbegriffe was to seize the notion of probability from the classical probabilists. In doing so, Kolmogorov made space for
a mathematical theory that has flourished for seven decades, driving ever more
sophisticated statistical methodology and feeding back into other areas of pure
mathematics. He also made space, within the Soviet Union, for a theory of mathematical statistics that could prosper without entanglement in the vagaries of
political or philosophical controversy.
Kolmogorovs way of relating his axioms to the world has received much less
attention than the axioms themselves. But Cournots principles has re- emerged in
new guises, as the information-theoretic and game-theoretic aspects of probability
have come into their own. We will be able to see this evolu- tion as growthrather
than mere decay or foolishnessonly if we can see the Grundbegriffe as a product
of its own time.
ADVANCED PROBABILITY: NOTES
9
2. Motivation
First, give several examples of situations where a notion of ‘probability’ arises, taken from real
life...
2.1. Examples.
•
•
•
•
•
•
•
Flipping a coin: finitely, and infinitely many times.
The St Petersburg gambling system.
Scientific experiments
Thermodynamics.
Quantum mechanics.
A falling prism: will it stand on its tip?
Spaces of functions and Brownian motion.
2.2. Motivation.
• besides the obvious things, probability is useful in many settings as a new way of looking
at a problem that makes many things more intuitive (maybe Daniel Kahneman showed
that we people are bad with probabilities, but let me tell you, we’re even worse with other
things). An example is combinatorics, where restating things in terms of probabilities is
often illuminating and makes things simpler.
In this section we talk about the modern formalization of probability and why it came to be
this way.
2.3. A first model. Let’s imagine all possible outcomes of a system/process/whatever R that
involves randomness as points living inside some space of outcomes Ω. We can think of each
point ω ∈ Ω as a list of the values of a set of parameters which describe R completely.
It makes sense that when we’re dealing with ‘discrete’ randomness, whatever that means,
Ω will be some discrete space, and if we’re dealing with ‘continuous’ randomness, that is, each
random thing can vary continuously, then Ω will have some notion of continuity, something like
a ‘product’, and that’s why it makes sense to call it a ‘space’, and to expect it will have some
topological and hopefully metric properties (because we actually want to compute things).
What we are able to observe about R may not be everything: that is, we may not be able to
distinguish between every two points ω and ω 0 . The events E encode the observable aspects of
R; i.e., they are the atomic things about which we can say whether they happened or not. As
such, it makes sense to impose the following conditions on E:
• If E ∈ E, then E c ∈ E.
• If E1 , E2 ∈ E, then E1 ∪ E2 ∈ E.
• If E1 , E2 ∈ E, then E1 ∩ E2 ∈ E.
That is,
• if we can tell something happened, we can tell if it didn’t happen too.
• if we can tell whether two things happened individually, we can tell whether either happened
• if we can tell whether two things happened individually, we can tell whether both happened.
It’s easy to see our conditions are equivalent to
• If E ∈ E, then E c ∈ E.
• If E1 , . . . , En ∈ E, then
n
[
i=1
Ei ∈ E.
10
ALEKSANDAR MAKELOV
Then the probability should be some way to assign numbers to events: how likely, on a scale
from 0 to 1, is E to happen? What fraction of the time does E happen? (haha there’s some deep
bayesianist-frequentist thing going on here but let’s not care about it). As such, it should satisfy
some obvious things:
• Pr [Ω] = 1 – something’s gotta happen!
• If A and B can never happen simultaneously, Pr [A ∪ B] = Pr [A] + Pr [B].
Let’s see what this formalization can give us:
Example 2.1. Consider a single roll of a symmetric die. Here it’s natural to let Ω = {1, 2, 3, 4, 5, 6},
E = {1}, . . . , {6}, and by symmetry it’s reasonable to assume that Pr [{i}] = Pr [{j}] for all i, j.
Then since the outcome can’t be both i and j for i 6= j, it follows that Pr [{i}] = 1/6 for all i.
2.4. Adjusting the model: limits and infinities. All this is very reasonable, but it’s not how
we actually do probability. Why? Right now I don’t have a good answer because I haven’t tried
using the above formalism to get anywhere, but I’m guessing the two reasons are we want to deal
with infinity and limits in some way.
So let’s see what goes wrong. One simple example where the need to deal with infinites should
arise is Ω = R, i.e. a random real number.
[Note: It seems very plausible that the probability of picking a rational number is 0. First,
there’s only countably many rationals... next, if we pick the digits one by one, we have to pick
infinitely many zeroes in succession, which has probability zero. ]