Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
E0 370 Statistical Learning Theory Lecture 10 (Sep 15, 2011) Excess Error, Approximation Error, and Estimation Error Lecturer: Shivani Agarwal 1 Scribe: Shivani Agarwal Introduction So far, we have considered the finite sample setting: given a finite sample S ∈ (X × Y)m drawn according to Dm , we have seen how to obtain (high confidence) bounds on the generalization error of a function learned from S, usually in terms of some empirical quantity that measures the performance of the function on S. Another question of interest concerns the behaviour of a learning algorithm in the infinite sample limit: as it receives more and more data, does the algorithm converge to an optimal prediction rule, i.e. does the generalization error of the learned function approach the optimal error? Recall that for a distribution D on X × Y and a loss ` : Y × Y→[0, ∞), the optimal error w.r.t. D and ` is the lowest possible error achievable by any function h : X →Y: er`,∗ inf er`D [h] . (1) D = h:X →Y For the 0-1 loss, the optimal error is known as the Bayes error. To formalize the above, for any function h : X →Y, define its excess error (w.r.t. D and `) as erD [h] − er`,∗ . D (2) We would like to study the behaviour of the excess error of the function learned by an algorithm from a training sample S ∼ Dm as m→∞. As we have seen, since minimizing the error over all possible functions in Y X can be difficult, most learning algorithms select a function from some fixed function class H ⊆ Y X . In such cases, we can only hope to achieve generalization error close to the lowest possible within the class; we refer to this as the optimal error within H (w.r.t. D and `): er`D [H] = inf er`D [h] . (3) h∈H It is then useful to view the excess error of functions h ∈ H as a sum of the following two terms: erD [h] − er`,∗ = er`D [h] − er`D [H] + er`D [H] − er`,∗ . D D (4) The first term is called the estimation error, and measures how far h is from the optimal within H. The second term, called the approximation error, measures how close one can get to the optimal error using functions in H; this is an inherent property of the function class, and forms a lower bound on the excess error of any function learned from H. In the following we will focus on the estimation error, which is what a learning algorithm learning from a function class H can hope to minimize. We first give a couple of definitions. 2 Statistical Consistency m Definition. Let H ⊆ Y X . Let A : ∪∞ m=1 (X × Y) →H be a learning algorithm that given a training m sample S ∈ ∪∞ (X × Y) , returns a function h ∈ H. Let D be a probability distribution on X × Y and S m=1 1 2 Excess Error, Approximation Error, and Estimation Error ` : Y × Y→[0, ∞). We say A is (statistically) consistent in H w.r.t. D and ` if the estimation error of the function learned by A from S ∼ Dm converges in probability to zero, i.e. if for all > 0, PS∼Dm er`D [hS ] − er`D [H] ≥ −→ 0 as m→∞ . If A is consistent in H w.r.t. ` for all distributions D on X × Y, we say A is universally consistent in H w.r.t. `.1 m X ∞ Definition. Let A : ∪∞ m=1 (X ×Y) →Y be a learning algorithm that given a training sample S ∈ ∪m=1 (X × Y)m , returns a function hS : X →Y. Let D be a probability distribution on X × Y and ` : Y × Y→[0, ∞). We say A is Bayes consistent w.r.t. D and ` if the excess error of the function learned by A from S ∼ Dm converges in probability to zero, i.e. if for all > 0, PS∼Dm er`D [hS ] − er`,∗ D ≥ −→ 0 as m→∞ . If A is Bayes consistent w.r.t. ` for all distributions D on X × Y, we say A is universally Bayes consistent w.r.t. `.2 One can also define analogous notions of strong consistency, which require almost sure convergence instead of convergence in probability. 3 Consistency of Empirical Risk Minimization in H Let H ⊆ Y X and ` : Y × Y→[0, ∞). Consider the empirical risk minimization (ERM) algorithm in H, which m given a training sample S ∈ ∪∞ returns3 m=1 (X × Y) hS ∈ arg min er`S [h] . h∈H Then for any distribution D on X × Y, we can write the estimation error of hS as er`D [hS ] − er`D [H] = er`D [hS ] − er`S [hS ] + er`S [hS ] − er`D [H] ≤ er`D [hS ] − er`S [hS ] + sup er`S [h] − er`D [h] h∈H ≤ 2 sup er`S [h] − er`D [h] . (5) (6) (7) (8) h∈H Therefore, uniform convergence of empirical errors in H implies consistency of ERM in H! In particular, for binary classification, we immediately have the following: Theorem 3.1. Let H ⊆ {±1}X and ` = `0-1 . If VCdim(H) = d < ∞, then ERM in H is universally consistent in H w.r.t. `0-1 . Proof. Let D be any probability distribution on X × {±1}. Let > 0. Then 0-1 0-1 0-1 0-1 (by Eq. (8)) (9) PS∼Dm erD [hS ] − erD [H] ≥ ≤ PS∼Dm sup erD [hS ] − erD [H] ≥ 2 h∈H d 2 2em ≤ 4 e−m /32 (by previous results) (10) d −→ 0 as m→∞ . (11) 1 Note that one could also define a notion of consistency in terms of convergence in expectation, which would require that ES∼Dm [er`D [hS ] − er`D [H]] −→ 0 as m→∞. It is easy to show that a sequence of bounded, non-negative random variables converges in probability if and only if it converges in expectation (show this!), and therefore when the loss function ` is bounded, consistency in terms of convergence in probability is equivalent to consistency in terms of convergence in expectation. 2 Note that the term ‘Bayes’ consistency is usually used to refer to convergence to the optimal error for binary classification with the 0-1 loss; we will use the term for any learning problem/loss function to distinguish it from consistency within H. 3 We assume for simplicity that the minimum is achieved in H; the results we discuss continue to hold if h is selected to be S any function in H whose empirical error is within an appropriately small precision of inf h∈H er`S [h]. Excess Error, Approximation Error, and Estimation Error 3 Several remarks are in order: 1. As we have noted before, for binary classification, ERM is typically not computationally efficient, except for some simple classes H. We will later discuss consistency of algorithms that minimize a convex upper bound on `0-1 . 2. Note that for any 0 < δ ≤ 1, we have with probability at least 1 − δ (over S ∼ Dm ), s d ln m + ln( 1δ ) 0-1 er0-1 . D [hS ] − erD [H] ≤ c m q ln m As a function of the sample size m, this gives a rate of convergence of O for the estimation m error. For distributions D for which erD [H] = 0 (so that there is a ‘target function’ t ∈ H such that with probability 1, the true label y of any instance x under D is given by t(x), i.e. P(x,y)∼D y = t(x) = 1), one can actually show a faster rate of convergence of O lnmm . This follows from a better uniform 2 convergence bound for such distributions (with an e−cm term in the bound rather than e−cm ); we probably will not show this for the general case, but will show this for finite H in a later lecture. A derivation for the general case can be found for example in [1]. 3. It is important to note that the above result applies only to classes of finite VC-dimension. Since no such class can have zero approximation error for all distributions D, ERM in such a class cannot achieve (universal) Bayes consistency. 4. For classes H of finite VC-dimension, the above result actually establishes that ERM in H is strongly universally consistent in H, by virtue of the Borel-Cantelli lemma (see [1]). 4 Consistency of Structural Risk Minimization in H = ∪i Hi m Let H1 ⊂ H2 ⊂ . . ., where Hi ⊆ Y X . Let ` : Y × Y→[0, ∞). Given a training sample S ∈ ∪∞ m=1 (X × Y) , ∞ the structural risk minimization (SRM) algorithm in (Hi )i=1 returns hS ∈ arg min er`S [hiS ] + penalty(i, m) , (12) i where hiS ∈ Hi is the function returned by ERM in Hi , and penalty(i, m) is a penalty term that increases with the complexity of Hi . Under certain conditions, one can show that SRM in (Hi )∞ i=1 is consistent in ∞ ∞ H = ∪∞ H ; if in addition the sequence (H ) is such that H = ∪ H has zero approximation error, then i i=1 i=1 i i=1 i SRM in (Hi )∞ can also be Bayes consistent. For example, for binary classification, we have the following i=1 result: Theorem 4.1 (Lugosi and Zeger, 1996). Let H1 ⊂ H2 ⊂ . . ., where Hi ⊆ {±1}X , VCdim(Hi ) = di < ∞ ∀i, and di < di+1 ∀i. Let ` = `0-1 . Then SRM with penalties given by r 8di ln(2em) + i penalty(i, m) = m is universally consistent in H = ∪∞ i=1 Hi w.r.t. `0-1 . Proof. Let D be any probability distribution on X × {±1}. Let > 0. We can write the estimation error of hS as 0-1 0-1 0-1 i er0-1 [h ] − er [H] = er [h ] − inf er [h ] + penalty(i, m) + S S D D D S S i i 0-1 inf er0-1 [h ] + penalty(i, m) − er [H] . (13) S S D i 4 Excess Error, Approximation Error, and Estimation Error Therefore we have 0-1 PS∼Dm er0-1 [h ] − er [H] ≥ S D D 0-1 0-1 i ≤ PS∼Dm erD [hS ] − inf erS [hS ] + penalty(i, m) ≥ + i 2 0-1 i 0-1 PS∼Dm inf erS [hS ] + penalty(i, m) − erD [H] ≥ . (14) i 2 We will bound each probability in turn. For the first probability, we have 0-1 i PS∼Dm er0-1 [h ] − inf er [h ] + penalty(i, m) ≥ S D S S i 2 i 0-1 i ≤ PS∼Dm sup er0-1 [h ] − er [h ] + penalty(i, m) ≥ D S S S 2 i ∞ X i 0-1 i ≤ PS∼Dm er0-1 + penalty(i, m) (by union bound) D [hS ] − erS [hS ] ≥ 2 i=1 d ∞ X 2em i −m( +penalty(i,m))2 /8 2 e ≤ 4 di i=1 ≤ ∞ X 4(2em)di e−m 2 /32 −m(penalty(i,m))2 /8 e (15) (16) (17) (18) (19) i=1 = = 4e−m 4e−m 2 2 /32 ∞ X /32 i=1 ∞ X (2em)di e−(8di ln(2em)+i)/8 (20) e−i/8 (21) i=1 = 4 1 − e−1/8 e−m 2 /32 . (22) For the second probability, let i∗ be such that 0-1 er0-1 D [Hi∗ ] ≤ erD [H] + , 4 (23) and let m∗ be such that for all m ≥ m∗ , penalty(i∗ , m) ≤ . 8 (24) Then we have i 0-1 inf er0-1 [h ] + penalty(i, m) − er [H] ≥ S S D i 2 i 0-1 ∗ PS∼Dm inf er0-1 [h ] + penalty(i, m) − er [H ] ≥ i S S D i 4 ∗ i ∗ 0-1 PS∼Dm er0-1 S [hS ] + penalty(i , m) − erD [Hi∗ ] ≥ 4 i∗ 0-1 ∗ PS∼Dm er0-1 [h ] − er [H ] ≥ , for m ≥ m∗ i S S D 8 0-1 PS∼Dm sup er0-1 [h] − er [h] ≥ S D 8 h∈Hi∗ di∗ 2 2em 4 e−m /512 . di∗ PS∼Dm (25) ≤ (26) ≤ ≤ ≤ ≤ Thus we have 0-1 PS∼Dm er0-1 D [hS ] − erD [H] ≥ d ∗ 2em i −m2 /512 4 −m2 /32 ≤ e +4 e , di∗ 1 − e−1/8 −→ 0 as m→∞ . (27) (28) (29) (30) for m ≥ m∗ (31) (32) Excess Error, Approximation Error, and Estimation Error 5 A couple of remarks: 0-1,∗ 0-1 1. As noted above, if the sequence (Hi )∞ for all distributions i=1 is such that inf i inf h∈Hi erD [h] = erD H is zero for all D), then SRM in (Hi )∞ D on X × {±1} (i.e. if the approximation error of H = ∪∞ i=1 i=1 i as above is universally Bayes consistent w.r.t. `0-1 . 2. Again, except for the simplest problems, SRM (particularly for binary classification) is often not computationally feasible; however it is useful as a theoretical tool for understanding model selection techniques and Bayes consistency, and can also serve as a guide for the development of approximate algorithms. 5 Consistency and Learnability: Two Sides of the Same Coin In the next few lectures we will turn to learnability, and then return to a more detailed discussion of statistical consistency. As we will see, the two notions are closely related, although they arose in different communities and tend to emphasize somewhat different aspects: Statistical Consistency 6 Learnability • Origins in statistics • Origins in theoretical computer science • Starts with learning algorithm; asks if it is statistically consistent • Starts with function class H; asks if there is a learning algorithm that is statistically consistent in H (with an additional requirement we will see next time) • Both consistency within H and Bayes consistency of interest • By definition, interest is in consistency w.r.t. H • Mostly distribution-free; also interested in ‘low-noise’ settings • Often assume er`D [H] = 0 (‘target function’ setting); mostly distribution-free otherwise, but sometimes interested in specific distributions (such as the uniform distribution over the Boolean cube X = {0, 1}n ) • Focus on convergence rates (m, δ) • Focus on sample complexity m(, δ) and computational complexity Next Lecture In the next lecture we will introduce the notion of learnability, and will give a few basic results and examples to illustrate the concept. The next few lectures after that will discuss more results and examples related to learnability, before we return to talk more about statistical consistency. References [1] Luc Devroye, Laszlo Gyorfi, and Gabor Lugosi. A Probabilistic Theory of Pattern Recognition. Springer, 1996.