Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Distribution Specific Learnability – an Open Problem in Statistical Learning Theory M. Hassan Zokaei-Ashtiani December 2013 Overview • • • • • • • • Statistical Learning Theory No-Free-Lunch Theorem Uniform Convergence Fundamental Theorem of SLT Learnability w.r.t. a distribution Learnability w.r.t. a set of distributions Semi-supervised learning, a case study Summary Statistical Learning Theory (SLT) • Machine learning: automated detection of meaningful patterns from data • Analysis of learning methods – Obtaining guarantees on the performance of learning methods – Overfitting… • Statistical assumptions on data – E.g., iid assumption • Our focus: classification Preliminaries and Notations • • • • Domain set: 𝑋 Label set: Y = +1, −1 A labeling function 𝑙 is a function 𝑙: 𝑋 → 𝑌 Samples are generated iid from an unknown distribution 𝐷, labeled by a labeling function 𝑓 • Training set: 𝑆 = 𝑥1 , 𝑦1 , . . , 𝑥𝑁 , 𝑦𝑁 • An algorithm 𝐴 takes the training sample 𝑆 and outputs a predictor Preliminaries and Notations • The expected error rate of a predictor ℎ is – 𝐿𝐷,𝑓 (ℎ) = 𝑃𝑥~𝐷 [ℎ(𝑥) ≠ 𝑓(𝑥)] • For an algorithm 𝐴 • 𝐿𝐷,𝑓 (𝐴) = 𝐿𝐷,𝑓 (𝐴(𝑆)) Empirical Risk Minimization (ERM) • Given a training set 𝑆, the train-error is defined by – 𝐿𝑆 ℎ = |{𝑖∈ 𝑚 :ℎ(𝑥𝑖 )≠𝑙(𝑥𝑖 )}| 𝑁 • ERM minimizes the train-error over a predefined set of predictors – 𝑎𝑟𝑔𝑚𝑖𝑛ℎ∈𝐻 𝐿𝑆 (ℎ) PAC Model (Leslie Valiant) • Given a train-set 𝑆, can we guarantee anything about the performance of 𝐴 on? • For example, can we say – 𝐿𝐷,𝑓 𝐴(𝑆) < 𝜖? – Even if we see that 𝐿𝑆 𝐴 𝑆 = 0? PAC Model (Leslie Valiant) • Given a train-set 𝑆, can we guarantee anything about the performance of 𝐴 on? • For example, can we say – 𝐿𝐷,𝑓 𝐴(𝑆) < 𝜖? – Even if we see that 𝐿𝑆 𝐴 𝑆 • Usually impossible! = 0? PAC Model • We may be able to guarantee something like this: – With probability at least 1 − 𝛿, 𝐿𝐷,𝑓 𝐴(𝑆) < 𝜖 • Probably Approximately Correct (PAC) • How can we do this? Concentration Inequalities • Provide bounds on how a random variable deviates from a value (such as its expectation) • Hoeffding’s inequality – For a random variable 𝑋 ∈ [0,1] – Pr 𝑋𝑁 − 𝐸[𝑋] > 𝜖 < exp(−2𝑁𝜖 2 ) – Where 𝑋𝑁 = 𝑥1 +𝑥2 +⋯+𝑥𝑁 𝑁 and 𝑥𝑖 s are independent An Example: Cross Validation • Assume that using an algorithm 𝐴, and a trainset 𝑆, we have obtained ℎ = 𝐴(𝑆) • Now we are given a new test-set 𝑆𝑇 (with 𝑆𝑇 = 𝑁) which is independent of 𝑆 • Using Hoeffding’s inequality • Pr 𝐿𝑆𝑇 (𝐴(𝑆)) − 𝐿𝐷,𝑓 (𝐴(𝑆)) > 𝜖 < exp −2𝑁𝜖 2 – Or in other words, with probability at least 1 − 𝛿 2 • 𝐿𝐷,𝑓 𝐴 𝑆 < 𝐿𝑆𝑇 𝐴 𝑆 + 𝑙𝑜𝑔𝛿 2𝑁 PAC Setting • So if we have enough samples, we can estimate the error rate of a specific predictor • The bound seems to be very useful, with exponential decay • Can we test all of the possible predictors, and select the best predictor? – ERM? No-Free-Lunch Theorem (simplified version) • Let 𝑋 be an infinite domain set • Let 𝐴 be “any” learning algorithm • Then there exists (𝐷, 𝑓) such that – 𝐿𝐷 𝑓 = 0 – With probability of at least 1/7 (over the choice of 𝑆): 𝐿𝐷 𝐴 𝑆 > 1/8 • Independent of the number of train samples! No-Free-Lunch Theorem • NFL suggests that learning without some form of prior knowledge is impossible • The most natural way to incorporate prior knowledge is to use a hypothesis class • We assume that we have a hypothesis class 𝐻, for which there exists h∗ ∈ 𝐻 with low error PAC Learnability • Let 𝑋 be a domain and 𝐻 be a set of predictors (hypothesis class) over it • We say that 𝐻 is PAC learnable if and only if – There exists an algorithm 𝐴 that – For every 𝜖, 𝛿 ∈ (0,1) and unknown (𝐷, 𝑓) – There exists 𝑚𝐻 𝜖, 𝛿 < ∞ such that if we generate 𝑚 > 𝑚𝐻 (𝜖, 𝛿) samples 𝑆 according to 𝐷 and label it by 𝑓 – Then Pr 𝐿𝐷,𝑓 𝐴 𝑆 − 𝐿𝐷,𝑓 (ℎ∗ ) > 𝜖 < 𝛿 Uniform Convergence • 𝐻 has uniform convergence property – – – – – If there exists 𝑚𝑈𝐶 (𝜖, 𝛿) Such that for all 𝜖, 𝛿 ∈ 0,1 and ℎ ∈ 𝐻 If we generate a sample 𝑆 with 𝑆 > 𝑚𝑈𝐶 𝜖, 𝛿 then we have: For all ℎ ∈ 𝐻 • 𝑃 |𝐿𝑆 ℎ − 𝐿𝐷,𝑓 ℎ | > 𝜖 < 𝛿 – Uniform convergence is sufficient for learning • using ERM – It can be proved that it is also necessary! Fundamental Theorem of ML • These statements about 𝐻 are equivalent – 𝐻 is PAC learnable – 𝐻 has uniform convergence property – ERM algorithm on 𝐻 will “work” – 𝐻 has finite VC-Dimension Prior Knowledge • In the classic learning theory, our prior knowledge is mostly used in selecting 𝐻 • But sometimes, we have some knowledge about the probability distribution – For example, we may know the marginal distribution 𝐷 a priori Learnability w.r.t. a Single Distribution • Assume we have 𝐷 a priori • PAC learnability of (𝐻, 𝐷)? – 𝐻 should have a finite 𝜖-covers w.r.t. 𝐷 Learnability w.r.t. a Single Distribution • Assume we have 𝐷 a priori • PAC learnability of (𝐻, 𝐷)? – 𝐻 should have a finite 𝜖-covers w.r.t. 𝐷 • Not realistic Learnability w.r.t. a Set of Distributions • Assume that we know 𝐷 ∈ Ω • What is the necessary and sufficient condition for learnability of (𝐻, Ω) Learnability w.r.t. a Set of Distributions • Assume that we know 𝐷 ∈ Ω • What is the necessary and sufficient condition for learnability of (𝐻, Ω) • Still an open problem! Learnability w.r.t. a Set of Distributions • Assume that we know 𝐷 ∈ Ω • What is the necessary and sufficient condition for learnability of (𝐻, Ω) • Still an open problem! • Conjecture: 𝐻 should have an 𝜖-cover w.r.t. Ω with finite VC-dimension • More general version: ℎ, 𝐷 ∈ 𝑅 Learnability w.r.t. a Set of Distributions • Why is it important? – Theoretical aspects – Unexplained phenomena in ML • • • • Semi-supervised learning Domain adaptation Deep learning methods … Semi-Supervised Learning • We are given both some labeled samples and a lot of unlabeled samples • In practice, unlabeled samples can help a lot • In theory, worst-case analysis shows that unlabeled data is not helpful [1]! • Theory-practice gap! • Researchers are looking for useful and realistic assumptions to explain this • [1] Ben-david, Shai, Tyler Lu, and Dávid Pál. "D.: Does unlabeled data provably help? worstcase analysis of the sample complexity of semi-supervised learning." In: 21st Annual Conference on Learning Theory. 2008. Summary • In classic ML theory, the prior knowledge is only used in selecting 𝐻 – For this setting, necessary and sufficient conditions for learnability are known • Other forms of prior knowledge, including knowledge of distribution may be available • Learnability w.r.t. a set of distributions is an open problem – May help in understanding semi-supervised learning, domain adaptation, etc. • May drastically change our current models for learning