Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
An Introduction to Laws of Large Numbers Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev John CVGMI Group Contents 1 Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev Introduction Contents 1 2 Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Contents 1 2 Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev 3 Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Contents 1 2 Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law 3 Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev 4 Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev Intuition Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev We’re working with random variables. What could we observe? Random Variables {Xn }∞ n=1 Intuition Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev We’re working with random variables. What could we observe? Random Variables {Xn }∞ n=1 ... More specifically Bernoulli sequence, looking at probability of average of the P sum of N random events. P(x1 < N X i=1 i < x2 ) Intuition Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev We’re working with random variables. What could we observe? Random Variables {Xn }∞ n=1 ... More specifically Bernoulli sequence, looking at probability of average of the P sum of N random events. P(x1 < N X i=1 i < x2 ) Coin Flipping anybody? Intuition Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev We’re working with random variables. What could we observe? Random Variables {Xn }∞ n=1 ... More specifically Bernoulli sequence, looking at probability of average of the P sum of N random events. P(x1 < N X i=1 i < x2 ) Coin Flipping anybody? as N increases we see that the probability of observing an equal amount of 0 or 1 is ∼ equal Intuition Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev We’re working with random variables. What could we observe? Random Variables {Xn }∞ n=1 ... More specifically Bernoulli sequence, looking at probability of average of the P sum of N random events. P(x1 < N X i=1 i < x2 ) Coin Flipping anybody? as N increases we see that the probability of observing an equal amount of 0 or 1 is ∼ equal This is intuitive: As the number of samples increases the average observation should tend toward the theoretical mean Weak Law Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev Let’s try to work out this most basic fundamental theorem of random variables arising from repeatedly observing a random event. How do we build such a theorem? Weak Law Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev Let’s try to work out this most basic fundamental theorem of random variables arising from repeatedly observing a random event. How do we build such a theorem? In blackbox scenario we want to be able to use independence. Weak Law Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev Let’s try to work out this most basic fundamental theorem of random variables arising from repeatedly observing a random event. How do we build such a theorem? In blackbox scenario we want to be able to use independence. We want to be able to use variance and expectation: so let’s make Xn ∈ L2 for all n, and associate with each Xn it’s mean, we’ll denote by µn , and variance, by σn . Weak Law Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev Let’s try to work out this most basic fundamental theorem of random variables arising from repeatedly observing a random event. How do we build such a theorem? In blackbox scenario we want to be able to use independence. We want to be able to use variance and expectation: so let’s make Xn ∈ L2 for all n, and associate with each Xn it’s mean, we’ll denote by µn , and variance, by σn . We will create new random variables from the sequence and work with these; aim for results in terms of them. Weak Law Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev Let’s try to work out this most basic fundamental theorem of random variables arising from repeatedly observing a random event. How do we build such a theorem? In blackbox scenario we want to be able to use independence. We want to be able to use variance and expectation: so let’s make Xn ∈ L2 for all n, and associate with each Xn it’s mean, we’ll denote by µn , and variance, by σn . We will create new random variables from the sequence and work with these; aim for results in terms of them. This seems like a start, let’s try to prove a theorem... Weak Law Introduction Theorem Introduction To Laws of Large Numbers Let {Xn } be a sequence of independent L2 random variables n P Xi − µi → 0. with means µn and variances σn . Then Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev i=1 So a sequence of functions on Ω, the sample space, are going towards a function = zero... Weak Law Introduction Theorem Introduction To Laws of Large Numbers Let {Xn } be a sequence of independent L2 random variables n P Xi − µi → 0. with means µn and variances σn . Then Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev i=1 So a sequence of functions on Ω, the sample space, are going towards a function = zero... Can we prove this? Weak Law Introduction Theorem Introduction To Laws of Large Numbers Let {Xn } be a sequence of independent L2 random variables n P Xi − µi → 0. with means µn and variances σn . Then Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev i=1 So a sequence of functions on Ω, the sample space, are going towards a function = zero... Can we prove this? We haven’t used (σn ): we’ll clearly need these since otherwise the Xi are wildly unpredictable. Weak Law Introduction Theorem Introduction To Laws of Large Numbers Let {Xn } be a sequence of independent L2 random variables n P Xi − µi → 0. with means µn and variances σn . Then Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev i=1 So a sequence of functions on Ω, the sample space, are going towards a function = zero... Can we prove this? We haven’t used (σn ): we’ll clearly need these since otherwise the Xi are wildly unpredictable. IDEA: Constrain lim n P i=1 σi2 = 0 Counterexample Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev Weak Law: Pitfalls Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev We haven’t specified a mode of convergence: this is a technical point, but if you recall uniform, pointwise, a.e., normed convergence from analysis then the goal of the proof can change radically. IDEA: We want our rule to hold with high probability, that is it should hold for essentially all values of ω ∈ Ω. Convergence in the normed space is pretty ambitious at this point. Weak Law: Pitfalls Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev We haven’t specified a mode of convergence: this is a technical point, but if you recall uniform, pointwise, a.e., normed convergence from analysis then the goal of the proof can change radically. IDEA: We want our rule to hold with high probability, that is it should hold for essentially all values of ω ∈ Ω. Convergence in the normed space is pretty ambitious at this point. We haven’t specified a rate of convergence. The rate of convergence of the σn should imply something about the rate of convergence of the Xn − µn . The Xn − µn should converge slower than the σn . Weak Law: Pitfalls Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev We haven’t specified a mode of convergence: this is a technical point, but if you recall uniform, pointwise, a.e., normed convergence from analysis then the goal of the proof can change radically. IDEA: We want our rule to hold with high probability, that is it should hold for essentially all values of ω ∈ Ω. Convergence in the normed space is pretty ambitious at this point. We haven’t specified a rate of convergence. The rate of convergence of the σn should imply something about the rate of convergence of the Xn − µn . The Xn − µn should converge slower than the σn . n P IDEA: Weaken constraint to lim n−2 σi2 = 0 i=1 Weak Law: Pitfalls Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev We haven’t specified a mode of convergence: this is a technical point, but if you recall uniform, pointwise, a.e., normed convergence from analysis then the goal of the proof can change radically. IDEA: We want our rule to hold with high probability, that is it should hold for essentially all values of ω ∈ Ω. Convergence in the normed space is pretty ambitious at this point. We haven’t specified a rate of convergence. The rate of convergence of the σn should imply something about the rate of convergence of the Xn − µn . The Xn − µn should converge slower than the σn . n P IDEA: Weaken constraint to lim n−2 σi2 = 0 i=1 The Xn − µn should converge on average Law of Large Numbers 2.0 Theorem Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev Let {Xn } be a sequence of independent L2 random variables n P with means µn and variances σn2 . Then n1 Xi − µi → 0 in measure if lim 1 n2 Let Yn (ω) = 1/n i=1 n P σi2 i=1 n P = 0. (Xi (ω) − µi ). Then E (Yn ) = 0 and i=1 σ 2 (Yn ) = 1/n2 n P i=1 σi2 by independence. Now we can use the limit of σ 2 (Yn ) from the hypothesis, and need to prove P({ω : |Yn (ω)| > }) ≤ σ 2 (Yn ) → 0 as n →∗ ∞ 2 Aside: Lp bounds Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev What does P({ω : |Yn (ω)| > }) ≤ σ 2 (Yn ) → 0 as n →∗ ∞ 2 actually say? Part of Yn > is bounded by the integral of Yn2 divided by epsilon squared... Thinking about this makes it obvious. Better bounds can be obtained: Chernoff bounds Law of Large Numbers 2.1 Theorem Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev Let {Xn } be a sequence of independent L2 random variables n P Xi − µi → 0 in with means µn and variances σn2 . Then n1 measure if lim n−2 n P i=1 i=1 σi2 = 0. ⇓ weakened; use P(maxk | Pk i=1 Xi − µi | ≥ ) ≤ −2 Pn 2 i=1 σi Theorem Let {Xn } be a sequence of independent L2 random variables n P with means µn and variances σn2 . Then n1 Xi − µi → 0 a.e. if lim n−2 n P i=1 i=1 σi2 < ∞. Khinchine’s Law Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev Theorem Let {Xn } be a sequence of independent L1 random variables, n P Xi − µi → 0 identically distributed, with means µ. Then n1 a.e. as n → ∞. i=1 Large Sample Size in Coding Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev Now we need to use the laws. Start with an example in source coding. C : X (Ω) → Σ∗ encoding events Interesting to know how complex C must be to encode (Yi )N i=1 Entropy: H(x) = E [−lg (p(x))]; representation of how uncertain a r.v. is Problem: p(·) is unknown to the function and the distribution needs to be learned. WLLN can be used to answer: how uncertain is the distribution? Asymptotic Equipartition Property Theorem Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev Asymptotic Equipartition Property If X1 , . . . , Xn i.i.d. ∼ p(x) then 1 − lg (p(x1 , . . . , xn )) → H(x). n Xi i.i.d. ∼ p(x) so lgp(xi ) are i.i.d. The weak law of large numbers says that Y Y 1 P(| − (lg ( (p(xi ))) − E [ −lg (p(x)))]| > ) → 0 n i i 1X P(| − lg ((p(xi ))) − E [−lg (p(x))]| > ) → 0 n i so the sample entropy approaches the true entropy in probability as the sample size increases. Statistical Learning Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev Want to minimize R(f ) = E (l(x, y , f (x))) eg l = (x, y , f ) → 1/2|f (x) − y | Stuck minimizing over all f , under a distribution we don’t know... hopeless... IDEA: Take Law of Large numbers and apply it to this framework, and P hopefully Remp (f ) = 1/m i l(xi , yi , f (xi )) → R(f ) Use Lp bounds to prove convergence results on testing error. Mini Appendix: Measure Theory I won’t go into too much detail in this regard. If we’ve made it this far then great. Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev (X , A) is a measurable space if A ⊂ 2X is closed under countable unions and complements. These correspond to ’measurable events’. (X , A) with µ is a measure space if µ : A → [0, ∞) P is countably additive over disjoint sets: µ(∪i Ai ) = i µ(Ai ) if Ai ∩ Aj = ∅ if i 6= j. More great properties fall out of this quickly: measure of the emptyset is zero, measures are monotonic under the containment (partial) ordering, even dominated and monotone convergence (of sets) come out from this. Chebyshev’s Inequality - Let f ∈ Lp (R) R and pdenote p Eα = {x : |f | > α} and now ||f ||p ≥ Eα |f | ≥ αp µ(Eα ) by monotonicity and positivity of measures. L2 R.V.’s: Definition Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law We say that a function fR: X → R belongs to the function space Lp (X ) if ||f ||p = ( |f (x)|p dx)1/p < ∞ and say f has finite Lp norm. Definition The fundamental object being considered in statistics is the random variable. Given a measure space (Ω, B, µ) X : Ω → R is an L2 random variable if Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev {X < r } ∈ B ∀r ∈ R µ(Ω) = 1 and Z 2 X (ω) dµ(ω) 1/2 < ∞. We may also say that a random variable with finite 2nd 2 L2 R.V.’s: Questions Why B? Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev Physical paradoxes, eg. Banach-Tarski Want to talk about physical events ∼ measurable sets B is as descriptive as we’d like most of the time What does µ do here? µ weights the events ∼ measurable sets Puts values to {ω ∈ Ω : X (ω) < r }; Ω is the sample space X maps random events (patterns occuring in data) to the reals which have a good standard notion of measure. So X induces the distribution P(B) = µ(X −1 (B)) which gives P({X (ω) ∈ B}) = µ(X −1 (B)) as expected. This is usually written without the sample variable. For more sophisticated questions/answers see MAA 6616 or STA 7466, but discussion is encouraged Statistics: Functions of Measured Events Now we can start building Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Theorem If f : R → R is measurable, then Z Z f (x) dP = f (X (ω)) dµ Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev ie. the random variable pushes forward a measure onto R, and the integrals of measurable functions of random variables are therefore ultimately determined by real integration. This may be proved using Dirac measures and measure properties. Following these lines we may develop the basics of the theory of probability from measure theory. (ok, Homework) Statistics: Moments Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Definition Define the first moment as the linear functional Z E (X ) = t dP. Then the pth central moment is is a functional given by Z mp (X ) = (t − E (X ))p dP. Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev Note that when X ∈ L2 these are well defined by the theorem above. Pullbacks are omitted. Since we’re talking about L2 let’s define the second central moment as σ 2 for convenience. If we need higher moments, sadly, mathematics says no. Statistics: Independence Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev Now that we know what random variables are, let’s try to define how a collection of them interact in the most basic way possible. Definition Let X = {Xα } be a collection of random variables. Then X is independent if P(∩α Xα−1 (Bα )) = P((X1 , ..., Xn )−1 (B1 × ... × Bn )) = (µ(X1−1 ) × ... × µ(Xn−1 ))(B1 × ... × Bn ). (Head Explodes) Really this just means that the Xα are independent if their joint distributions are given by the product measure over the induced measures. Limits: Repeating Independent Trials Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev The significance of the belabored definition of independence is that when a joint distribution, just a distribution over several random variables, contains zero information about how the variables are related. If we have an infinite collection of random variables, independent of each other - in spite of independence - we would like to be able to infer information about the lower order moments from the higher order moments and vice versa. If we have ideal conditions on the the random variables then we should be able to deduce information about moments from a very large sequence. The type of convergence we get should be sensitive to the hypotheses. Modes of Convergence Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev We clearly need to specify how the means are converging, right? Modes of Convergence Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev We clearly need to specify how the means are converging, right? Why? Modes of Convergence Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev We clearly need to specify how the means are converging, right? Why? P So how could N i=1 (Xi − µi )/N → 0? Modes of Convergence Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples We clearly need to specify how the means are converging, right? Why? P So how could N i=1 (Xi − µi )/N → 0? P In measure : limN P({| N i=1 (Xi − µi )/N| ≥ }) → 0 R PN In norm: limN ( ( i=1 (Xi − µi )/N)p dµ(ω))1/p → 0 P A.E : P({limN | N i=1 (Xi − µi )/N| > 0}) → 0 Information Theory Statistical Learning AE Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev fnk fn fnk µ fn Lp Lp−i Chebyshev Introduction Introduction To Laws of Large Numbers Weak Law of Large Numbers Strong Law Strongest Law Examples Information Theory Statistical Learning Appendix Random Variables Working with R.V.’s Independence Limits of Random Variables Modes of Convergence Chebyshev