Download To PAC and Beyond

To PAC and Beyond Thesis submitted in partial fulfillment of the degree of Doctor of Philosophy by Ran Gilad-Bachrach Submitted to the Senate of the Hebrew University February 2006 ii iii This work was carried out under the supervision of Prof. Naftali Tishby. iv Acknowledgements The work presented here is a result of a period of 10 years I spent at the Hebrew University in Jerusalem. During this time I got married to the adorable Efrat Gilad and together we gave birth to our two incredible sons Dvir and Yinon. It sounds like a cliché but without the support of my closest family, especially my parents Yehudit (Dita) and Daniel (Dani) Bachrach, my brother and sisters Yuval, Yael and Nurit my grandparents and of course Efrat, Dvir and Yinon, I would not had any chance succeeding. I do not think there is any way to express gratitude to them in words. I have been lucky enough to be surrounded by extremely talented people who were my teachers, peers and friends. I take this opportunity to thank them. Ran El-Yaniv opened the doors for me to the fantastic world of scientific research. I am glad to be able to call him both a mentor and a friend. I had the opportunity to collaborate with Eli Shamir who is a role model for me as a person who is extremely smart, yet modest. Shai Fine is my “older scientific brother” and I thank him for that (Shai - we still have an outstanding bet ...). Amir Navot has been a great peer to work with and a great friend as well. We spent plenty of time together discussing everything. Without his bike riding enthusiasm I would never had ridden a bike down the Kicking Horse ski slopes without brakes. My other room mates, Amir Globerson and Gal Chechik also shared with us great moments of inspiring discussions. While at the Hebrew U. I spent most of my time at the machine learning lab. I owe something to each person in the lab: Yoseph Barash, Gil Bejerano, Koby Crammer, Ofer Dekel, Gal Elidan, Michael Fink, Nir Friedman, Eyal Krupka, Shahar Mendelson, Elad Schneidman, Shai ShalevShwartz, Lavi Shpigelman, Yoram Singer, Yair Weiss, Golan Yona, and all the people at that incredible place. Of course, a great thank you goes to my advisor, Naftali Tishby, who taught me a lot about the scientific method, the beauty of science and how to conduct a scientific research. Although we debated allot, he had a greater influence on my scientific approach than he might think. v vi I had the best possible teachers from whom I have learned how to teach (which is important for someone who is learning how to learn). Especially I would like to mention Ehud de Shalit, Nati Linial, Israel (Robert) Aumann, Saharon Shelah and Hillel Furstenberg. I thank the Clore foundation for the generous funding they have provided me as a Ph.D. Student. I also thank the Chorfas foundation, Vatat and the Amirim program for additional support. I thank Eyal Rosenman and the ultimate Frisbee team for the great fun. Special thank goes to Esther Singer for the English editing of this dissertation and to Nitsa Movshovitz-Hadar for inspiring discussions. There are so many people who deserve to be mentioned here, I thank each and every one of you. Finally, I would like to thank again Efrat who motivated me, challenged me and supported me. Your name should appear first on the title page of this work. Abstract The greatest advantage of learning algorithms is their ability to adjust themselves to the required task. While “traditional” algorithms are tailored to perform a single task, learning algorithms acquire this capability. Learning techniques have been applied successfully in various domains such as optical character recognition, information retrieval, medical diagnostics and fraud detection. There are different frameworks for learning. In all these frameworks, the learner is presented with data and attempts to find some underlying structures in it. Later, the structure is used to infer about unseen cases or unobserved properties of the data. The difference between the frameworks is in: 1. The type of data the learner is presented with 2. The way it interacts with its teacher 3. The way the learner’s performance is evaluated In many cases, the learner needs to be trained for a lengthy period before it reaches an acceptable performance level. Generating these long training sequences is a hard and expensive process, since in most cases it requires a great deal of human labor. For these reasons, a considerable part of research in machine learning has dealt with different ways of shortening the training process. This dissertation contributes to this field by examining ways to truncate the learning process using active learning. Active learning is a setting in which the learner has control over the learning processes. The learner is allowed to direct the teacher to present the more informative data and, by so doing, reducing the length and the cost of the training process. The rationale behind active learning is that the value of any additional data is a function of the data seen so far and the internal structure of the learning algorithm. Thus, for the learning process to be efficient, it needs to be tailored to the learning algorithm. This is true when the learner is a machine and when the learner is a vii viii human being1 . In a sense, this work is a demonstration of at least the first part of an ancient Hebrew saying: (zea` iwxt) “nln otwd `le ,nl oyiad `l'' “He who is shy does not learn, and he who is pedant shall not teach” (Ethics of the Fathers: teachings forming a tractate of the Mishnah, from the 3rd century BCE). Active learning draws its power from several frameworks in machine learning: supervised learning, unsupervised learning and reinforcement learning. In this work we take something from each of these frameworks. However, we assess the performance of active learners when compared to passive learners in the supervised learning framework and particularly the Probably Approximately Correct (PAC) [126] framework. We study several active learning models: membership queries, selective sampling and label efficient learning. After a short introduction we begin our discussion with the membership queries framework [3]. In this model the learner can direct questions to the teacher. It does so by crafting instances and asking the teacher to label these instances. We study the sensitivity of this model to noise. We show that learning can take place even if the teacher is inaccurate in some of the answers it provides to the queries issued by the learner. We show that the noise can be filtered out if the learning problem is well structured; i.e. when the dual learning problem is dense and has moderate complexity. In the third part of this work we study the selective sampling model [29, 30]. In this framework, the learner observes unlabeled instances and selects the instances to be labeled by the teacher. This model is useful in many real world scenarios in which the raw data are plentiful but labels are scarce. We focus on the Query By Committee algorithm [112] and address theoretical, algorithmic and practical issues. We begin by extending the theoretical foundation of the Query By Committee (QBC) algorithm and the selective sampling framework. We show that QBC learns exponentially faster than passive learners under milder assumptions than were previously known. We also prove the efficiency of this algorithm in the label efficient setting; the “online” version of the selective sampling model. We continue by lifting some of the assumptions made in the theoretical analysis. First we lift the Bayesian assumption. By lifting this assumption we show that when certain conditions apply, the QBC is almost certain to be efficient whereas under the Bayesian assumption we can only 1 See Chapter 12 for a short discussion of active learning in humans. ix guarantee average performance. Next we address the issue of learning with Query By Committee in the presence of noise. We show that QBC can be modified to tolerate noise. While analyzing this scenario we develop the notion of information from observations, which is of interest in its own right. A key requirement of the QBC algorithm is the ability to sample a random hypothesis. The difficulties involved in fulfilling this requirement has prevented most researchers from using QBC. We show, for the first time, that when learning linear classifiers, sampling can be done in polynomial time. We show that the problem of sampling a random hypothesis is equivalent to the problem of sampling a random point from a convex body. The latter has been studied extensively [41, 84, 85] and the results can be applied to our case. While the algorithms suggested for sampling from convex bodies are polynomial they are far from being efficient. We address this issue in two ways; first, we show that the sampling problem can be done in a low-dimensional space. This enables us to use QBC in high dimensional spaces without sacrificing computational complexity; thus kernels can be used to augment the QBC algorithm. We also suggest using the hit-and-run [85] random walk and show empirically that the number of iterations needed when sampling using hit-and-run is small. These tools allow us to conduct experiments using the QBC algorithm and demonstrate the benefits of using active learning over passive learning. The study of active learning is still in its infancy. There are only a few theoretical results (see e.g. [27, 36, 37, 48]) and empirical findings (see e.g. [35, 124, 122]). The distinction of this dissertation lies in its holistic approach. We look at the same objects from different points of view. We study active learning using analytic tools and at the same time discuss practical concerns and provide empirical evidence to assess our findings. In this way, we reduce the gap between the theoretical study of active learning and its practical use. We believe that active learning is significant in many applications. This work is another step towards the enhanced maturity of this field. Contents I Acknoldgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Introduction 1 1 Learning II 2 1.1 Learning from Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Machine Learning and Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Introduction to Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Probably Approximately Correct (PAC) . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 On-line Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.6 Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.7 Other Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.7.1 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.7.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Membership Queries 14 2 Preliminaries 2.1 15 The Power of Membership Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.1 Constant Depth Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.2 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.3 Intersections of Halfspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 The Limitations of Membership Queries . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 x CONTENTS xi 3 Noise Tolerant Learnability using Dual Representation 21 3.1 Learning in the presence of noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 The Dual Learning Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 Dense Learning Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4 Noise Immunity Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5 A Few Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.5.1 Monotone Monomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.5.2 Geometric Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.5.3 Periodic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Estimating V Cε (C, X ∗∗ ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.7 VC Dimension of Dual Learning Problems . . . . . . . . . . . . . . . . . . . . . . . 35 3.8 Banach Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.6 3.6.1 III Selective Sampling 38 4 Preliminaries 4.1 39 Empirical Studies of Selective Sampling . . . . . . . . . . . . . . . . . . . . . . . . 40 4.1.1 Committee-Based Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1.1.1 Part-of-Speech Tagging . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1.1.2 Spoken Language Understanding . . . . . . . . . . . . . . . . . . . 42 4.1.1.3 Ensemble of Active Learners . . . . . . . . . . . . . . . . . . . . . 44 4.1.1.4 Other Committee-Based Approaches . . . . . . . . . . . . . . . . . 45 Confidence-Based Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1.2.1 Margin Based Confidence . . . . . . . . . . . . . . . . . . . . . . 45 4.1.2.2 Probability Based Confidence . . . . . . . . . . . . . . . . . . . . . 47 Look-ahead Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2 Theoretical Studies of Selective Sampling . . . . . . . . . . . . . . . . . . . . . . . 48 4.3 Label Efficient Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.1.2 4.1.3 xii CONTENTS 5 The Query By Committee Algorithm 5.1 5.2 54 Termination Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.1.1 The “Optimal” Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.1.2 Random Gibbs Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.1.3 Bayes Point Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.1.4 Avoiding the Termination Rule . . . . . . . . . . . . . . . . . . . . . . . . . 59 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6 Theoretical Analysis of Query By Committee 62 6.1 The Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.2 The Fundamental Theory of QBC . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.3 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.4 Lower Bound on the Expected Information Gain for Linear Classifiers . . . . . . . 73 6.4.1 The Class of Parallel Planes . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.4.2 Concave Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.4.3 The Function G (ρ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.5 Proof of Theorem 6.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7 The Bayes Model Revisited 88 7.1 PAC-Bayesian Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 7.2 Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 7.3 Incorrect Priors and Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 8 Noise Tolerance 8.1 8.2 8.3 98 “Soft” QBC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 8.1.1 The Case of Learning with Noise . . . . . . . . . . . . . . . . . . . . . . . . 99 8.1.2 The case of stochastic concepts . . . . . . . . . . . . . . . . . . . . . . . . . 100 8.1.3 A variant of the QBC algorithm . . . . . . . . . . . . . . . . . . . . . . . . 100 Information Gain Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 8.2.1 Observations of the State of a Random Variable . . . . . . . . . . . . . . . 105 8.2.2 Information Processing Inequality . . . . . . . . . . . . . . . . . . . . . . . 108 SQBC Sample Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 CONTENTS 8.4 xiii Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 9 Efficient Implementation Using Random Walks 114 9.1 Linear Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 9.2 Sampling from Convex Bodies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 9.3 A Polynomial Implementation of QBC . . . . . . . . . . . . . . . . . . . . . . . . . 117 9.4 A Geometric Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 10 Kernelizing the QBC 124 10.1 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 10.1.1 Commonly used Kernel Functions . . . . . . . . . . . . . . . . . . . . . . . 125 10.1.2 The Gram Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 10.1.3 Mercer’s conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 10.1.4 The Special Case of the Ridge Kernel . . . . . . . . . . . . . . . . . . . . . 127 10.2 A New Method for Sampling the Version-Space . . . . . . . . . . . . . . . . . . . . 128 10.3 Sampling with Kernels 10.4 Hit and Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 10.5 Generalizing to Unseen Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 10.6 Summary and Further Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 11 Empirical Evidence 136 11.1 Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 11.1.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 11.1.2 Label Efficient Learning over Synthetic Data . . . . . . . . . . . . . . . . . 138 11.1.3 Face Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 11.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 IV Epilog 12 Epilog 142 143 12.1 Active Learning in Humans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 List of Publications 146 xiv CONTENTS Bibliography Summary in Hebrew Abstract (in hebrew) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 I V Introduction (in hebrew) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIII Part I Introduction 1 Chapter 1 Learning Learning is the processing of information we encounter, which leads to changes or increase in our knowledge and abilities [74]. The high capacity to learn is one of the key features of human beings, and one that distinguishes us from most of the animal kingdom. It is learning that allows us to adapt to changing environments and to solve complicated problems. The quote at the top of this page captures some of the fundamental aspects of learning. Learning is a process. It is a process that converts information into knowledge and abilities. Thus learning is a process that enriches our capabilities. Machine learning is the art of designing computerized learning processes; i.e., computer processes that are capable of turning the information we encounter into knowledge and abilities. Machine learning uses the following flow to solve problems: 1. Acquire data 2. Find concise representations of the data 3. Find patterns 4. Turn patterns into knowledge 5. Turn knowledge into actions (optional) It has been shown that this process is successful in many domains such as text classification, speech recognition, machine vision, the study of the genome, medicine etc. 2 1.1. Learning from Examples 1.1 3 Learning from Examples There are many scenarios that incorporate learning. We are interested in learning from examples and more specifically in supervised learning. Two players are involved here, the teacher and the learner. The teacher has some knowledge that the learner is interested in. Thus, the learner observes the teacher engaged in acting. It is hoped that the learner will be able to collect enough information, and will be wise enough to convert this information into knowledge. Therefore, there are two sources of complexity in this process: the amount of time the learners needs to observe the teacher and the amount of “wisdom” the learner is required to have. In the machine learning literature these complexities are referred to as the sample complexity and computational complexity of the learning process. We consider the “learning from examples” framework. In the learning processes, the learner has access to examples. Each example is a pair (x, y) where x is an instance from the sample space X and y is the label of this instance taken from the output space (or label space) Y. We assume that there exists an underlying target concept, c that maps inputs to outputs. The target concept can be a deterministic map or a stochastic one. The goal of the learner is to approximate this target concept. Much of the research in machine learning focuses on ways to reduce both sample and computational complexity. In a sense, this work deals with the trade-off between the two. We are interested in ways of reducing the sample complexity without overly sacrificing the computational complexity. In other words, we are interested in transferring some of the workload from the teacher to the learner. In order to achieve this acceleration in learning we will go beyond the traditional learning models, e.g. PAC (see definition 1.1) and allow active learning. By active learning we mean that the learner plays an active role in the learning process. While a passive learner only observes the teacher, an active learner can guide the teacher by asking questions. In this work we study active learning as an extension to passive learning. We show that in many cases active learners significantly outperform passive learners. We study these frameworks both with analytical tools (theory) and with experiments (empirical evidence). As opposed to most of the active learning literature, we use the same algorithm in our theoretical study as in our empirical study. This enables us to bridge the gap between theory and practice in active learning. In the remainder of this chapter we briefly discuss fundamental definitions and theorems in machine learning. A reader who is familiar with this field may wish to skim through it simply for 4 Chapter 1. Learning Table 1.1: Summary of the notation used in this dissertation Symbol log ln X x D Y y S C c ν d m H (·) ǫ δ η Description base 2 logarithm natural logarithm the sample space in instance in the sample space a distribution over a sample space the label space a label in the label space a sample a concept class a concept in the concept class a prior (or posterior) over a concept class dimension a size of a training sample Shanon’s binary entropy error rate failure probability noise rate Remarks typically Y = {±1} m S ∈ (X × Y) H (p) = −p log p − (1 − p) log (1 − p) the notation we will be using in this dissertation (see table 1.1 for a summary of the notation). 1.2 Machine Learning and Artificial Intelligence Machine learning is a sub-division of Artificial Intelligence (AI). Both AI and machine learning try to mimic the way the human brain solves complicated problems. Traditional artificial intelligence defines knowledge as a set of logical rules. These rules are used to infer new unseen cases. Machine learning diverges from this approach in two ways: first, machine learning puts greater emphasis on the learning process; i.e. the process by which we acquire knowledge. Second, machine learning typically uses statistical and probabilistic properties whereas traditional artificial intelligence uses logical deductions. In a sense, machine learning is an evolutionary phase of AI [105]. The difference between the two approaches can be seen in the following example. Assume you would like to build a machine to perform a certain task, say a medical diagnosis machine. The “logical” approach towards building such a machine would be to contact an expert (a physician in the example of a medical diagnosis machine) and ask for a set of rules that differentiates sick from healthy people. These rules are hard coded in the machine and used to diagnose patients. This approach has several flaws: first, it is usually impossible to define these logical rules. Second, it is difficult to maintain and debug such a set of rules: in a system with thousands of rules, how do you find the one rule that leads to the wrong prediction? How do you correct it 1.3. Introduction to Machine Learning 5 without destroying the whole system? Finally, how do you adjust such a system to a changing environment or to a new diagnosis task? The flaws presented above primarily affect the acquisition process. Machine learning uses a different approach. In the acquisition stage, here called learning, the learner observes an expert at work and collects statistics about various correlations in the data. Once the learner has collected enough information, it can be used to generate insights and make predictions. Learning from examples is useful in variety of domains such as medical diagnostics, speech recognition, information retrieval, etc. It has the advantage that the training process only requires observing an expert at work. It is easier to maintain and is more reusable than “logical” machines. Machine learning, as its name suggests, focuses on the acquisition stage. Machine learning can be broken down into various sub-fields based upon the nature of the acquisition stage (e.g. supervised, semi-supervised, unsupervised) and the task the machine has to perform (e.g. batch, on-line, classification, regression). In this dissertation we focus on the task of supervised learning. 1.3 Introduction to Machine Learning Machine learning has been studied under various names for more than a half a century. A comprehensive review the field is beyond the scope of this work. Here we present the key principles that will be used in the rest of this document. Machine learning attracts researchers from different disciplines: mathematics, computer science, neuro science, biology, etc. There are three main motivations for research in this field: 1. The study of the brain 2. Learning as a way to solve difficult problems 3. The study of “learning” as an abstract concept Brain researchers have found that the brain is made up of atomic building blocks, the neurons. These neurons are connected together in a network. The ability of our brain to solve complicated problems and adjust itself to changing environments and new tasks prompted researchers to believe that by building artificial neural networks we would be able to learn to solve complex problems. It would also allow us to better understand the way our brain works. This line of research began during the 1940s. McCulloch and Pitts [91] and later Hebb [53] suggested ways in which neural 6 Chapter 1. Learning networks could work. These were the opening chapters in a very fruitful and inspiring line of research that has generated hundreds of books. The main building block of the neural network is the neuron. A neuron has many inputs (synapses) and a single output (axon). The artificial neuron is the perceptron or the linear classifier [103]. Like the neuron, it has many inputs and a single output. To facilitate the notation we assume that there is a single input which is a vector x ∈ IRd , such that each component in this vector resembles a synapse. The perceptron calculates a linear threshold function over its input. Each perceptron holds a vector of weights w ∈ IRd and a threshold θ ∈ IR. It computes the function cw,θ (x) = sign (w · x − θ) where w · x is the inner product between the vectors w and x. Although the perceptron was defined almost 50 years ago, it is probably the most commonly used tool in machine learning. A large part of this work is devoted to learning perceptrons. We usually refer to perceptrons as linear classifiers. Whenever the threshold θ is set at zero, i.e. the classification rule is cw (x) = sign (w · x) we call the classifier a homogeneous linear classifier. Artificial neural networks serve both as a tool to study the brain and as a method to solve problems that are otherwise considered to be hard. To solve problems, other approaches have been suggested as well. Two representative approaches in this category are nearest neighbor rules and window based rules [33, 46, 120]. Both methods assume some metric over the input space and the predictions on the label of a new instance are based on their proximity to some of the points in the training set. In nearest neighbor rules, the predicted label of a new instance is chosen by holding a majority vote among the k nearest neighbors to the instance at hand. In window based approaches, the label is chosen by holding a majority vote among all training instances which are close to the instance at hand. Both approaches have been analyzed and proved to be consistent; i.e., optimal asymptotically, provided that the right choice of parameters is made. The most recent direction in machine learning has been the attempt to study “learning” as an autonomous concept. The most significant work in this domain was done by Vapnik and Chervonenkis [128] who studied the concept of uniform convergence of empirical means. This result was kept a “secret” until Blumer et al. [17] discovered its relation to the Probably Approximately Correct (PAC) model [126]. The Probably Approximately Correct (PAC) model [126], is an attempt to define learning as a mathematical object. By defining learning in a way which makes no assumptions as to how learning is accomplished, Valiant was able to raise issues which had never been formulated before such as “Is it possible to learn?”, “Is it possible to learn everything?” or in more generally “What 1.4. Probably Approximately Correct (PAC) 7 can be learned?”. In the following sections we review the PAC model and other definitions of learning and some of the important findings in this field. 1.4 Probably Approximately Correct (PAC) Valiant made several important observations which are fundamental ingredient of the PAC model [126]. First of all, learning is a finite process in the sense that we should be able to benefit from learning after a finite time. Therefore, learning should be possible after seeing only a finite set of examples. Valiant also made a distinction between the inaccuracy of the learner and a failure of the learning process. Inaccuracy is caused by the fact that the learner sees only a finite sample. However, sometimes the learning process can fail when the training sequence is a-typical. Valiant claims that as long as we have reasonable accuracy with high confidence we can learn. The PAC model defines learnable concept classes. A class C is learnable if the number of examples needed to learn a concept in this class is finite. Valiant was primarily interested in the binary setting where Y = {±1}. The assumption that the labels can take only two possible values seems to be restrictive; however, there are canonical ways to convert multi-class learning problems into a set of binary learning problems. In most cases this is done by breaking the multiclass learning problem into a collection of binary classification problems. For example, in the one-against-all method, we generate a binary classification problem for every possible value y ∈ Y. We train a binary classifier to distinguish between the instances labeled by y and the instances for which the label is different from y. In the all-pairs method, as its name suggests, we train a classifier for every pair of labels. The more general approach uses error-correcting codes to generate a set of binary classification problems and combine them into a multi-class classifier (see e.g. [34]). Hence, the assumption that the labels are binary is not too restrictive and we will assume this unless otherwise specified1 . Definition 1.1 Probably Approximately Correct [126] Let X be a sample space and let C be a binary concept class over X . Let the label space Y be {±1}. We say that C is PAC learnable if for any ǫ, δ > 0, there exists m < ∞ and an algorithm L : (X × Y) m → C such that for any probability measure µ on X × Y Pr S∼µm errorµ (L (S)) > ǫ + inf (errorµ (c)) < δ c∈C 1 Multi-class active learning is more complicated since the learner needs to decide when to query the teacher for more information. We discuss this issue in Chapter 8. 8 Chapter 1. Learning where errorµ (c) = µ {(x, y) : c (x) 6= y} A concept class is PAC learnable if a finite sample suffices to learn a hypothesis from the class that is almost the best possible concept in this class in terms of generalization error. In a seminal paper, Vapnik and Chervonenkis [128] showed that PAC learnable classes have a unique geometric property: a concept class C is PAC learnable if and only if it has a finite Vapnik Chervonenkis (VC) dimension. The connection to the PAC learnability problem was found in [17]. In order to define the VC dimension we need to define the shatter coefficient (or growth function) of a class C. Definition 1.2 Let C be a concept class. The m’th shatter coefficient of C is ΠC (m) = max x1 ,...,xm ∈X |{(c (x1 ) , . . . , c (xm )) : c ∈ C}| The shatter coefficient measures the number of different ways that a concept class can assign labels to m instances. The function ΠC is called the growth function of C. Clearly, since |Y| = |{±1}| = 2 then ΠC (m) ≤ 2m . This is the rationale behind the definition of the VC dimension: Definition 1.3 A concept class C has a VC dimension d if d = max {m : ΠC (m) = 2m } The VC dimension is infinite if ΠC (m) = 2m for all m. The VC dimension gives an exact classification of PAC learnable concept classes. Vapnik and Chervonenkis [128] proved the following seminal theorem (this is a rephrased version of the original result). Theorem 1.1 [6, Theorem 4.2 and Theorem 4.8] Let C be a concept class of VC dimension d. Let L be an algorithm which given a sample S ∈ (X × Y) m of labeled instances, returns a hypothesis c = L (S) ∈ C which minimizes the empirical error |{(x, y) ∈ S : c (x) 6= y}| Then for any δ > 0 and any probability measure µ over X × Y the following holds: 1. If d is finite then Pr S∼µm errorµ (L (S)) > ǫ + inf errorµ (c) ≤ δ c∈C 1.5. On-line Learning 9 Teacher xt Learner ⇒ ⇐ ⇒ L (yt , ŷt ) yt ŷt Figure 1.1: On-line learning. An illustration of a single round in the on-line learning model. as long as ǫ≥ s 32 m 2em 4 d ln + ln d δ where error· (·) is as defined in the definition of the PAC model (Definition 1.1). 2. If d is finite and inf c∈C errorµ (c) = 0, i.e. the target concept is in the concept class C then Pr [errorµ (L (S)) > ǫ] ≤ δ S∼µm as long as 2 ǫ≥ m 2em 2 d ln + ln d δ 3. If the d = ∞ then C is not PAC learnable. q d in the general case Theorem 1.1 shows that the learning rates we can expect are O∗ m d and O∗ m when the target concept is in the concept class2 . Note that only the constants in the bounds we presented can be improved. We will see in Chapter 6 that when active learning is used we obtain significantly better results. 1.5 On-line Learning The on-line learning model [83] is another attempt to define learning. Littlestone tried to capture the fact that learning is a continuous process. In the PAC model there are two phases: the learning phase and the inferring phase (or generalizing phase). In the on-line learning model, these two phases are interleaved. Learning takes place in rounds. In round t, the teacher presents the instance xt , the user suggests that the label of xt is ŷt . After this, the teacher presents the label yt and the learner suffers a loss of L (yt , ŷt ) where L (·, ·) is some non-negative loss function. See figure 1.1 for an illustration of a single round in the on-line learning model. P∞ The goal of the learner in this setting is to minimize t=1 L (yt , ŷt ) under the mildest possible assumptions. In most cases one of the following assumptions is made: 2 We use the notation O ∗ (·) to indicate that we neglect logarithmic factors. 10 Chapter 1. Learning 1. There exists an underlying target concept c chosen from a class C such that yt = c (xt ). Under this assumption we can sometimes prove that ∞ X t=1 L (yt , ŷt ) ≤ M < ∞ We call M the mistake bound since it provides an upper bound on the number of mistakes for any single sequence x1 , x2 , . . . and any concept c ∈ C. 2. There is no restriction on the target concept but the learner is compared only to a limited reference class C. In this case we seek a function f (·) such that for any sequence (x1 , y1 ) , (x2 , y2 ) , . . . we have that ∞ X t=1 L (yt , ŷt ) ≤ inf c∈C ∞ X f (L (c (xt ) , ŷt )) t=1 These bounds are called regret bounds. They provide a bound on the difference between the cumulative loss of the algorithm studied and the best concept in the reference class. Many of the on-line learning algorithms are very simple, fast and use a small amount of memory. For example, the perceptron algorithm [103] when applied to a d dimensional problem, uses O (d) memory cells and each prediction is made in O (d) operations3 . It is important to note, though, that the constraints on memory and CPU usage are not a part of the definition of the on-line learning model. Since in most of the cases we study here, the labels are either +1 or −1, the natural loss function is the 0-1 loss which has the value 0 whenever yt and ŷt are equal and has the value 1 otherwise In this case P∞ t=1    0 L0−1 (yt , ŷt ) = (1 − yt yˆt ) /2 =   1 if yt = ŷt if yt 6= ŷt L0−1 (yt , ŷt ) is a count of the number of prediction mistakes the learning algorithm made. It is known that if the perceptron algorithm is used on a sequence (x1 , y1 ) , (x2 , y2 ) , . . . then P∞ 2 2 d t=1 L0−1 (yt , ŷt ) ≤ R /θ provided that kxt k2 ≤ R for every t, and there exist w ∈ IR such that kwk2 = 1 and yt (w · xt ) ≥ θ . This is the mistake bound for the perceptron algorithm that was proved in [97]. 3 We assume here that the perceptron is represented in the primal space. When the perceptron is used with kernels, the hypothesis must be represented in the dual space. In this case, the memory usage and CPU usage change dramatically. See [40] for more about this issue. 1.6. Active Learning 1.6 11 Active Learning Stone’s celebrated theorem proves that given a large enough training sequence, even naive algorithms such as the k-nearest neighbors can be optimal [120]. However, collecting large training sequences runs up against two main obstacles. First, collecting these sequences is a lengthy and costly task. Second, processing large data-sets requires enormous resources. Obviously we need to process the data while training. However, in most cases, the complexity of inferring the labels of new data items is affected by the size of the training data. This is the case for the commonly used Support Vector Machines [20] and Adaboost [47] algorithms. Therefore, reducing the size of the training sequence is of major concern. Active learning suggests that the size of the training sequence can be reduced considerably if we allow ourselves to go beyond the standard definitions of learning, e.g. PAC and on-line learning, and allow the learner some control over the learning process. In the learning frameworks we have discussed so far, the teacher selected the instances to be presented to the learner. Therefore we call these frameworks passive learning. In active learning frameworks, the learner has some influence on the selection of data points. Having control over the learning process allows the learner to focus on the more informative data points and thus increase the learning rate. In many cases, active learning can indeed accelerate the learning rate. We will show that the speedup can be exponential. However, in some cases there is a price to be paid. Since the learner has control over the learning process, it needs to make decisions that passive learners do not make. Therefore, in some cases the computational complexity of learning can increase when moving from passive learning to active learning. At the same time however, the sample complexity of learning reduces considerably. This means that we shift the workload from the teacher to the learner and from the generalization (inference) phase to the training phase. This makes perfect sense since the teacher is typically a human while the learner is a machine; thus, active learners require less human labor but may require more computing effort while training. We discuss two active learning frameworks in this work. In Part II we discuss the membership queries framework and in Part III we discuss selective sampling. The difference between these frameworks is in the type of control the learner is assumed to have over the learning process. In the Membership Queries framework [3] the learner is allowed to pose questions to the teacher. These questions are presented as instances and the teacher is queried for the labels of these instances. The selective sampling framework [29] is more restrictive. The learner is presented with unlabeled instances and may query for the labels of a subset of these instances. This framework subdivides 12 Chapter 1. Learning into two varieties: the batch framework, which we call selective sampling whenever this is not confusing, and the on-line framework, which is called label efficient learning [54]. Alternative active learning frameworks do exist. In the Equality Query model [3] the learner can present a hypothesis to the teacher. The teacher can either accept this hypothesis as a good one, or reject it while presenting an instance on which it deviates from the target concept. Another model, experiment design (see e.g. [7]) is being studied extensively by statisticians. In this model, the problem at hand is a regression problem, and the learner is allowed to select the experiment to run. Although this can be viewed as an active learning framework, it is not proactive learning as the learner does not refine the selection of experiments based on the results of previous experiments. 1.7 Other Learning Models As mentioned earlier, this essay focuses on the supervised learning framework. However, this is not the only way in which learning occurs. 1.7.1 Unsupervised Learning Unsupervised learning is an important type of learning. The goal in unsupervised learning is to find structure in data. The learner is given data x1 , . . . , xm and is required to find a concise representation of the data. A good representation is a small representation that captures the significant properties of the data. Two popular ways of finding these representations are clustering and dimensionality reduction. In clustering, the learner groups the instances into clusters. The goal here is to find a small number of clusters such that all the instances within the same cluster are closer or more similar to each other, compared to instances from different clusters.. There are many ways of achieving this goal (see e.g. [14, 96, 121]) but the problem is ill-posed [67] since similarity between points can be measured in many different ways. Nevertheless, clustering is a powerful tool. Another unsupervised learning method is dimensionality reduction. In dimensionality reduction, the learner finds a new representation of the data that is low dimensional but at the same time close to the original representation. For every instance xi the learner retains only a few properties φ1 (xi ) , . . . , φd (xi ) such that the φs’ capture most of the important attributes of xi . Principal Component Analysis (PCA) is a representative of this family of learning techniques [61]. 1.7. Other Learning Models 1.7.2 13 Reinforcement Learning Another important type of learning is reinforcement learning [62]. Consider for example the problem of navigating a robot in a maze. At each junction, the robot has to decide the direction to take. The decision made by the robot has long term consequences, since the parts of the maze the robot will see depend on the decisions it makes. In the more general setting, reinforcement learning assumes an underlying state machine. At each point in time, the learner decides on the action it makes. As a result of this action, the learner receives a reward, and the state of the machine changes. The reward and the new state are stochastic functions of the current state and the action taken by the player. Active learning is a combination of supervised learning, unsupervised learning, semi-supervised learning and reinforcement learning. We use unlabeled data, labeled data and make decisions that affect future events, since the queries the learner issues affect the learning process. However, the main focus of this work is in viewing active learning as an extension of supervised learning. Part II Membership Queries 14 Chapter 2 Preliminaries Active learners have some control over the learning process. Passive learners can observe the training data, but can-not alter them, whereas active learners can direct the teacher to what the learner consider to be the most interesting cases. The capability to play an active role in the training process gives the learner much more latitude for action than passive learners. It also more closely resembles the way humans learn. Human learning is a bi-directional process [18, 93, 118]. A good teacher needs to adjust his or her mode of instruction to the student’s prior knowledge and state of mind. It is a well established fact that a teacher who is not tuned to feedback from the students will not be able to teach effectively [18, 93, 118]. When trying to design a framework for computerized active learning, we need to define the way bi-directional communication between learner and teacher takes place. The first active learner framework explored here is the Membership Queries (MQ) framework [3, 4]. In this framework, the learner is allowed to direct questions to the teacher. Definition 2.1 Let X be a sample space. A membership query is an instance x ∈ X . The teacher’s response to such a query is the label y associated with x. A learning algorithm makes a membership query, much like humans ask their teachers questions. The membership query oracle is very powerful since it allows the learning algorithm to query for the label of any instance. 15 16 Chapter 2. Preliminaries Figure 2.1: An illustration of a Boolean circuit. This circuit has three inputs: I1 , I2 and I3 and a single output marked by O. The circuit contains two AND gates and a single OR gate. The circuit has a depth of two. 2.1 The Power of Membership Queries The power of membership queries has been demonstrated in hundreds of papers. This section looks at a few key articles, all of which show that membership queries can solve problems that are difficult to tackle. Enumerating all the tasks in which membership queries have been used to provide solutions is beyond the scope of this dissertation. The tasks examined here illustrate the variety and diversity of applications of membership queries. 2.1.1 Constant Depth Circuits There are many ways to represent Boolean functions: truth tables, logical formulas Karnaugh maps and others. A Boolean circuit is another representation of a Boolean function which captures the engineering point of view. A circuit is made of gates that are wired together. The gates are the atoms of this structure. A gate can perform a simple task: it receives one or more inputs and generates an output. The output of the gate can be wired to the input of other gates. Each gate can be connected to any other gate provided that the directed graph, which describes the circuit, is acyclic1 . Through the right choice of gates and wiring, a circuit can perform sophisticated Boolean functions. See figure 2.1 for an illustration of a Boolean circuit. 1 Such a graph is called a DAG, which stands for Directed Acyclic Graph. 2.1. The Power of Membership Queries 17 Boolean circuits play an important role in electronics and in theoretical computer science. Linial, Mansour and Nissan have proved the following about learning such circuits: Theorem 2.1 [81] n Let c : {−1, 1} → {−1, 1} be a Boolean function which is computable by a circuit of size S and depth d using AND and OR gates. Let ǫ, δ > 0 then there exists a learning algorithm which generates a hypothesis h such that with probability 1 − δ the hypothesis h is ǫ-close to c. The d−1 algorithm works in time poly n(14 log S/ǫ) , log 1/δ and uses membership queries. In this celebrated result, the Fourier Transform of the concept c is analyzed and used to generate hypothesis h. Membership queries are necessary for this algorithm to work. It is not known how such circuits can be learned in poly (n) time without membership queries. 2.1.2 Decision Trees Decision trees are an important tool in artificial intelligence. The main advantage of decision trees over many other models used in artificial intelligence is their lucidness to humans. A concept that is presented as a decision tree is easily understood by humans. A decision tree has a condition term on each of its internal nodes. Each leaf of the tree contains one of the possible outputs. Once an input is presented to a tree, it is matched against the condition at the root of the tree. If the condition is satisfied, we move to the left sub-tree otherwise we move to the right sub-tree. We match the input against the root of the chosen sub-tree and depending on the outcome we move either left or right. This process continues until we reach a leaf of the tree. In that case we report the value at the leaf as the outcome of the tree calculation. See figure 2.2 for an illustration of a decision tree. For more about this subject see [106]. The problem of learning a decision tree is fundamental in artificial intelligence and machine learning. The main algorithms for learning decision trees include ID3 [101], C4.5 [102], and CART [21]. However, all these algorithms fail to meet PAC requirements. Even in a noise free environment, these algorithms can learn a tree which is exponentially bigger than the smallest possible tree [64]. This is a major bottleneck in the theoretical analysis of these algorithms. Alternative algorithms have been designed for learning decision trees with accompanying theoretical analysis. These algorithms include an algorithm by Kushilvitz and Mansour [71] which is based on Fourier analysis, and an algorithm by Bshouty [23] which is based on monotonicity. These algorithms use membership queries which enables them to have performance guaranties. 18 Chapter 2. Preliminaries Figure 2.2: An illustration of a decision tree. The decision tree in this illustration has six internal nodes, each associated with a condition C1 , . . . , C6 . It has seven leaves, each is associated with either TRUE (+1) or FALSE (-1). 2.1.3 Intersections of Halfspaces Intersections of halfspaces, or polytopes, are very interesting geometrical objects. Learning such objects is interesting both as regards its geometrical representation and also the reduction of learning DNF formulas to this problem (see e.g. [66] for the reduction of Boolean formulas to geometric concepts). 1 t log(t log ρ ) t State of the art algorithms for learning intersections of halfspaces [68] require O n ρ instances where n is the dimension, t is the number of halfspaces and ρ is a margin term. However, once membership queries are allowed, Kwek and Pitt [72] showed that poly (n, t) instances suffice for learning in this setting. 2.2 The Limitations of Membership Queries Membership queries provided the theorists of machine learning with a phenomenal tool to propose algorithms, and to use for analytical purposes. However when it comes to most real world problems, membership queries fall short. This failure has been demonstrated by Lang and Baum in their paper entitled “Query Learning can Work Poorly When a Human Oracle is Used ” [73]. In this paper, the authors tried to apply an algorithm for learning 2 layer neural networks presented earlier by Baum [12]. The main idea behind this algorithm is to take two instances with alternating labels and use membership queries on instances that are on a path connecting these instances. By doing 2.2. The Limitations of Membership Queries 19 Figure 2.3: Handwritten character recognition using membership queries [73]. The lower left and right corners are images of the figures “7” and “5”. The rest of the images represent combinations of these two figures. Note that some of these images are neither “7” nor “5”. Some of them do not look like any figure. so, the algorithm can find the exact transition point where the label changes. Lang and Baum [73] tried to apply Baum’s algorithm [12] to the task of recognizing handwritten digits. In this task, a bitmap that is a digital representation of a handwritten character needs to be identified as one of the digits 0 − 9. The authors expected that the novel learning algorithm would generate extremely accurate hypotheses by identifying the exact boundaries between the different digits. Unexpectedly, the experiment failed. The cause of this failure was that for many of the queries the algorithm generated, the teacher could not provide any answer. Figure 2.3 presents a demonstration of this problem. Two images of the figures “7” and “5” were used to generate a handful of queries for images which are combinations of the original images. However, many of these queries are neither “7” nor “5”. Some do not resemble any figure at all. This led Lang and Baum to the conclusion that query learning can work poorly when a human oracle is used, as the 20 Chapter 2. Preliminaries title of their paper suggests. The reason for this failure lies in the fact that not all images are valid representations of hand written figures. The computer views such images as an array of numbers which represents gray levels. However, most of these arrays do not represent any figure at all. This phenomenon is not unique to the problem of handwritten digit recognition. On the contrary, we expect this to occur in most applications in which the oracle is human. Consider for example the task of medical diagnosis. In this case, when a computer generates medical files and lab results, it will most likely lack the consistency of medical files and lab results of human beings. Moreover, a physician who needs to “label” these instances may need to see the patient to conduct other medical examinations, but since such a patient does not exist, the whole process will fail. Another limitation of membership queries is the fact that there are problems in which even membership queries will not allow us to learn in a reasonable time. [5] showed that under common cryptologic assumptions2 there are problems with finite VC dimensional but no polynomial learning algorithms. 2.3 Summary Membership queries are a powerful tool for the analysis and development of machine learning. They have inspired many authors and provide a way to evaluate the limits of learning algorithms. However, when it comes to real world applications, membership queries usually fall short. This said, there are vital tasks to which membership queries can be applied. Recently, learning algorithms which use membership queries have been applied successfully to verification problems [44]. In these cases, the teacher (oracle) is not a human being but rather a machine itself. This makes it possible to overcome the problem of using membership queries with human oracles. In the next chapter we present a method of overcoming noise while learning. This method uses membership queries in its core. It enables us to study fundamental problems in learning in the presence of noise. 2 It suffices to assume that there is a one-way function; i.e. a one-to-one function that is easy to compute but hard to invert. Chapter 3 Noise Tolerant Learnability using Dual Representation Much of the research in machine learning and neural computation assumes the existence of a perfect teacher, one who gives the correct answers to the learning algorithm. However, in many cases this assumption is faulty, since different sorts of noise may prevent the teacher from providing the correct answers. This noise can be caused by noisy communication, human errors, measuring equipment and many other distortion sources. In some cases, problems which are efficiently learnable without noise, become hard to learn when noise is introduced [11]. In other cases, it is possible to learn efficiently even in the presence of noise (see e.g. [63]). However, no simple parameters are known to distinguish between classes that are learnable in the presence of noise and those which become hard to learn. In this section we introduce a noise cleaning procedure. Our procedure is capable of generating a clean sample even when the data source is corrupted with noise. In order to generate the noise free sample we exploit the structure of the dual learning problem. In the dual learning problem the teacher has an instance in mind and the goal of the learner is to approximate it by having access to the labels several classifiers assign to it. For any instance whose label we would like to query, we generate an approximation set consisting of many instances which are close to it. We query for the labels of the instances in the approximation set, assuming we have access to a Membership Query oracle, and use majority vote to label the instance we were interested at. In the study below we show that the noise cleaning procedure works as long as the dual learning problem is learnable and dense. Thus any learning problem, for which these criteria hold, 21 22 Chapter 3. Noise Tolerant Learnability using Dual Representation can be learned efficiently in the presence of noise. We show that these assumptions are valid for a variety of learning problems, such as smooth functions, general geometrical concepts, and monotone monomials. We are particularly interested in the analysis of smooth function classes. We show that there is a uniform upper bound on the fat-shattering dimension of both the primal and dual learning problems which are derived by a geometric property of the class called type. We also show how the dual learning problem is related to the dual Banach space which is an important tool in functional analysis. The work presented in this chapter is based on joint research with Shai Fine, Shahar Mendelson and Naftali Tishby. 3.1 Learning in the presence of noise In many learning models (e.g. PAC) it is assumed that the learner has access to an oracle (a teacher) which picks an instance from the sample space and returns this instance and its correct label. In real world problems, the existence of such oracle is doubtful: human mistakes, communication errors and various other problems make the existence of such an oracle unfeasible. In these cases, a realistic assumption would be that the learner has access to some sort of a noisy oracle. This oracle may have internal mistakes which prevent it from generating the correct labels constantly, or even influence its ability to randomly select points from the sample space. The VC dimension [128, 17] completely characterizes PAC learnability of Boolean functions in terms of the size of the sample needed. A class with a finite VC dimension can be learned from a finite sample whose size depends on the VC dimension, the accuracy and the required confidence level. However, the computational complexity of learning is not controlled by the VC dimension. In fact there are classes with a finite VC dimension but learning these classes is NP complete [65]. Things become even more complicated when we no longer assume the existence of a perfect oracle, i.e. a noise free oracle. We weaken this assumption, we obtain the true label with a probability of 1 − η. We further assume that the oracle is consistent in the sense that if the label of x was requested twice, then the oracle will produce the same result. This model is called the “persistent random classification noise” model [58]. Classes with a finite VC dimension are learnable in the presence of persistent random classification noise in terms of the sample size needed (see e.g. [6] chapter 4). However, the computational complexity of the learning task can change dramatically1 . If learning in the noise free case is 1 When the noise is non-persistent, there is no complexity gap between learning with or without noise if mem- 3.2. The Dual Learning Problem 23 unfeasible, then it will remain so in the noisy case. However, there are cases in which the noise free problem is efficiently learnable, while learning in the noisy environment is unfeasible [11, 13]. The gap between the noise-free case and the noisy case appears not only in the PAC model, but also occurs in other models such as the online learning model [82]. Here we present a procedure which convert noisy oracles to noise-free oracles. In order for our procedure to work, the dual learning problem needs to be learnable and dense. These criteria characterize learning problems which are efficiently learnable in the presence of noise. The procedure we introduce is fairly simple. Given an instance, for which the learner would like to know its label, we generate an approximation set. This set consists of instances for which we have reason to believe the target concept assigns the same label as it assigns to the instance we are interested in. We use the majority vote among the labels of the instances in the approximation set to deduce the label of the instance we are interested in. The approximation set is generated by using the dual learning problem. In the dual learning problem the instances and hypotheses switch roles. We learn an instance by looking at the labels different hypotheses assign to it. We need to be able to learn efficiently in the dual learning problem, and we need the dual learning problem to be dense for this scheme to work. For this purpose, we need to work in a Bayesian setting in which there is a known probability measure over the concept class from which the target is chosen. When these conditions are present, noise can be filtered out by a simple procedure, which makes learning in presence of noise possible. More formally, the main result is the following: Let C be a concept class endowed with a probability measure ν. Assume that the target concept c∗ ∈ C was selected according to ν. Further assume that both the primal and dual learning problems are efficiently learnable in the noise free model and that the dual learning algorithm is dense. Then the noisy oracle can be converted to a noise-free oracle and learning in the presence of noise can take place. 3.2 The Dual Learning Problem A learning problem may be characterized by a tuple hX , Ci where X is the sample space and C is the concept class. Learning can be viewed as a two player game: one player, the teacher, picks a target concept, while his counterpart, the learner, tries to identify this concept. Different learning models differ in the way the learner interacts with the teacher (PAC Sampling Oracle, Statistical Queries, Membership Queries, Equivalence Queries, etc.) and the method used to evaluate performance. bership queries are allowed [107]. However, in many cases the noise is indeed persistent. 24 Chapter 3. Noise Tolerant Learnability using Dual Representation Every learning problem has a dual learning problem [99] which may be characterized by the tuple hC, X i. In this representation the learning game is reversed: first the teacher chooses an instance x ∈ X and then the learner tries to approximate this instance by querying the value x assigns to different concepts. We view an instance x as the evaluating function δx on C such that δx (c) = c (x). We denote by X ∗∗ these evaluating functions: X ∗∗ = {δx : x ∈ X } To clarify this notion we present two dual learning problems: • Let X be the interval [0, 1] and set C to be the class of all intervals [0, a] where a ∈ [0, 1]. If x is an instance and ca is the hypothesis [0, a] then ca (x) = 1 if and only if a ≥ x. Turning to the dual learning problem, note that δx (ca ) = 1 if and only if x ≤ a. Hence, the dual learning problem is equivalent to learning intervals of the form [x, 1] where x ∈ [0, 1]. • Let X = IRn , and let C to be the class of linear separators, i.e., cw ∈ C is the concept which assigns to each x ∈ IRn the label sign (x · w). The dual learning problem thus becomes a problem of learning linear separators and hence this problem is dual to itself. The VC-dimension of the dual learning problems (also called co-VC ) obeys the inequalities: ⌊log (d)⌋ ≤ d∗ ≤ 2d+1 (3.1) where d is the VC-dimension of the primal problem and d∗ is the co-VC (see lemma 3.1 on page 35). As can be seen, the gap between the complexities of the primal and dual learning problem can be exponential. However, in both examples presented here and in most realistic cases this gap is only polynomial. Therefore, our assumption that the dual learning problem is efficiently learnable holds in many if not most of the interesting cases (see Troyansky’s thesis [123] for a survey of dual representation). In section 3.6 we broaden the discussion to handle regression problems in which the concepts assign a real value to each instance rather than a Boolean value as in the classification case. Thus, we replace the notion of VC-dimension with the fat-shattering dimension. We show that for classes which consist of sufficiently smooth functions, both the fat-shattering and the cofat-shattering have an upper bound which is polynomial in the learning parameters and enables efficient learning of the dual problem. 3.3. Dense Learning Problems 3.3 25 Dense Learning Problems A learning problem is dense [42, 113] if every hypothesis has many hypotheses which are close but not equal to it: Definition 3.1 Let X be a sample space, C a concept class and D be distribution of the instances. The learning problem defined by the triplet hX , C, Di is dense if for c ∈ C and every ǫ > 0 there exists c′ such that 0 < Pr [c (x) 6= c′ (x)] < ǫ x∼D The density property is distribution dependent: for every learning problem there exists a distribution in which the resulting learning problem is not dense. In fact, if the distribution is supported on a finite set, the problem cannot be dense. If a learning problem is indeed dense, any hypothesis can be approximated by an infinite number of hypotheses thus finite concept classes are not dense according to definition 3.1. We would like to extend the definition of a dense learning problem to finite hypothesis classes as well. We replace the demand of approximating h for every ǫ, by the same demand for every polynomial ǫ. In the above definition the requirement is that each hypothesis h can be approximated by infinite number of hypotheses. In the finite case we replace the infinity assumption by a super-polynomial number of approximating hypotheses. Definition 3.2 Let Xn be a sample space, Cn a concept class and Dn be distribution of the in∞ stances. The sequence of learning problem {hXn , Cn , Dn i}n=1 is dense if for every polynomial p (n) there exists N such that for every n > N and every c ∈ Cn there exists c′ such that 0 < Pr [c (x) 6= c′ (x)] < 1/p (n) x∼Dn In a dense class, every hypothesis can be approximated. Nevertheless, even in dense classes, a learning algorithm might not use the diversity of the class. Therefore, the definition of density should be extended to include properties of the algorithm being used: Definition 3.3 Let X be a sample space, C a concept class and D be distribution of the instances. ∗ The learning algorithm L : (X × Y) → C is dense with respect to D if for every m > 0 and c ∼ C Pr m S1 ∼µ ,S2 ∼µm Pr [L (S1 ) (x) 6= L (S2 ) (x)] = 0 = 0 x∼D, where µ = µ (D, c) is the distribution induced by Dand c on X × Y. 26 Chapter 3. Noise Tolerant Learnability using Dual Representation Definition 3.3 applies to infinite learning problems. As before, we extend it to finite cases: Definition 3.4 Let Xn be a sample space, Cn a concept class and Dn be a distribution of the instances. The sequence of learning algorithms Ln : (Xn × Y)∗ → Cn is dense with respect to ∞ {Dn }n=1 if for every polynomial p (n) and m, there exists N such that for every n > N and every c ∈ Cn Pr S1 ∼µm ,S2 ∼µm Pr [L (S1 ) (x) 6= L (S2 ) (x)] = 0 < 1/p (n) x∼D, where µ = µ (D, c) is the distribution induced by Dand c on X × Y. 3.4 Noise Immunity Scheme We now present our noise immunity scheme. This scheme immunizes any learning problem to noise, if the requirements of learnability and density with respect to its dual learning problem are satisfied. The main idea is to generate a noise free oracle and use it for learning. Let x be an instance whose label we would like to know. Since we have access to a noisy oracle, querying for the label of x does not produce good enough results. Furthermore, since the noise is persistent, repeated sampling of the label will not provide any additional information. However, if there are enough instances in the sample space to which the target concept assigns the same label as it assigns to x, we can sample the labels of these instances and use majority vote to deduce the label of x. The problem is to identify these instances. For this purpose, we use the dual learning problem. The requirements of learnability and density ensure that with high probability, for any instance x, the dual learning algorithm will find many instances x′ such that almost all concepts assign them with the same label. Since the learner of the primal learning problem knows the dual target x and the probability measure on C (the Bayesian assumption), it can provide a clean sample to the dual learning problem. Hence, the dual learning problem is noise free and therefore far easier. The algorithm is detailed in algorithm 1. In the following theorem we prove the efficiency of the noise cleaning algorithm. Theorem 3.1 Assume that the dual learning algorithm is dense and has PAC guarantees. With a probability of 1 − δ the noise cleaning algorithm (algorithm 1) will return the correct label of x. 1 The computational complexity of the algorithm is poly d∗ , log δ1 , |1−2η| . As stated it theorem 3.1, the noise cleaning algorithm is polynomial with respect to its parameters, and with high probability it returns the true label which can be used to learn the original learning problem. 3.4. Noise Immunity Scheme 27 Algorithm 1 Noise Cleaning Algorithm Inputs: • confidence parameter 1 − δ. • VC-dim of dual learning problem d∗ . • A bound on the noise level η. • An instance x. Output: • A label y of x. Algorithm: 2 2 1. By simulating the dual learning problem k = (1−2η) times, generate an ensemble 2 ln δ S = {x1 , . . . , xk } by applying to the approx function (see below) with the point x, accuracy δ 1 4 and confidence 2k . 2. Use MQ to get a label yi of xi . 3. Let y be the majority vote over the yi ’s. Function approx Inputs: • A point x. • required accuracy ǫ̂. • required confidence 1 − δ̂. Output: • A point x′ . Algorithm: 1. Let m = poly 1 1 ∗ ǫ̂ , log δ̂ , d . 2. Using the prior ν over the concept class C generate the sample c1 , . . . , cm . 3. Assign to every ci the label ci (x). 4. Apply to the learning algorithm of the dual learning problem with the labeled sample generated in steps 2 and 3 to generate x′ . 28 Chapter 3. Noise Tolerant Learnability using Dual Representation Proof. of theorem 3.1 We begin by showing that the function approx, when called with ǫ̂, δ̂ and x will return x′ such that with probability 1 − δ̂ the instance x′ will be such that Prc∼ν [c (x) 6= c (x′ )] < ǫ̂. Note that this is simply the definition of learning in the dual learning problem. Note also that by assuming that the dual learning algorithm is dense (see Definitions 3.3 and 3.4),2 it follows that x′ 6= x with probability 1. By the choice of parameters the noise cleaning algorithm makes, it follows that with probability 1− δ 2 for any choice of 1 ≤ i ≤ k, the instance xi is indeed a good approximation of x in the sense that Pr [c (xi ) 6= c (x)] < c∼ν 1 4 (3.2) Assume that (3.2) holds for all i. When we query for the label of xi the correct label is returned with a probability of 1 − η. This is independent of c (xi ) 6= c (x) or c (xi ) = c (x). Hence, with a probability greater than 1 3 (1 − η) + η 4 4 = 3 η − 4 2 we will obtain the correct label. Using Hoeffding’s inequality and the fact the all xi are chosen independently, we see that the η probability that the majority vote will fail to predict the label of xi is smaller than e−2( 4 − 2 ) 1 2 k and by the choice of k, this is smaller than δ2 . The procedure presented can fail in two cases: the xi ’s generated do not form a good approximation set or alternatively, too many of the labels of the xi ’s are corrupted. Each of these event can occur with a probability of less than δ/2 and thus the whole process will work with a probability of 1 − δ. The computational complexity of the algorithm follows easily from the definition of the algorithm. In the following sections we present a variety of problems to which the paradigm just presented is applicable. We demonstrate it both on continuous classes such as neural networks and finite classes such as monotone monomials. However, there are classes to which the paradigm can not be applied. For example, consider the class of parity functions. Although this class is dual to itself, and thus has moderated co-VC, it is not dense and therefore fails to meet the requirements. 2 If the learning problem is finite, then this holds for large enough n as defined in definition 3.4. 3.5. A Few Examples 3.5 29 A Few Examples In this section, we discuss a few classes to which the noise cleaning algorithm can be applied. We present a few examples which have the properties required by theorem 3.1. 3.5.1 Monotone Monomials The first problem we present is a Boolean learning problem of a discrete nature: the problem n of learning monotone monomials. In this problem the sample space is {0, 1} and the concepts n are conjunctions (e.g. x (1) ∧ x (3) ∧ x (4)). The hypothesis class is C = {cI : I ⊂ {0, 1} } where V cI (x) = i∈I x (i). To simplify the notation we will assume that x ⊆ [1 . . . n], and c ⊆ [1 . . . n] such that c (x) = 1 ⇐⇒ c ⊆ x. The dual learning problem for this class is learning monotone monomials with reverse order, i.e., x (c) = 1 ⇐⇒ x ⊇ c. Both the primal and the dual learning problems have the same VC-dimension, n. Instead of showing that the dual class is dense, we will give a direct argument showing that the label of each instance can be approximated. Let Zx = {z : x ⊆ z ⊆ [1 . . . n]}. Since the concept class is monotone, if c (x) = 1 then for every z ∈ Zx , c (z) = 1. On the other hand, if c (x) = 0 then there exists some i ∈ c \ x. Therefore, half of the instances in Zx have i ∈ / z, implying that c (z) = 0 for each such z. Thus, Prz∈Zx [c (z) = 0] ≥ 1/2 with respect to the uniform distribution on Zx . Hence, if c (x) = 1 then Prz∈Zx [c (z) = 0] = 0 whereas if c (x) = 0 then Prz∈Zx [c (z) = 0] ≥ 1/2. This allows us to distinguish between the two cases. In order to be able to do the same thing in the presence of noise we have to require that Zx is big enough. From the definition of Zx it follows that |Zx | = 2n−|x| . It will suffice to require that |x| ≤ np for p < 1 with a high probability, since in this case |Zx | is exponentially large. This condition holds with a high probability for the uniform distribution and many other distributions. Note that in this case there is no need for a Bayesian assumption, i.e., we do not assume the existence of a distribution on the concept class. Moreover, the dual learning problem is reduced in this case to a simple sampling procedure for Zx . However, we have used a slightly relaxed definition of density in which for most of the instances there exists a sufficient number of approximating instances. 30 Chapter 3. Noise Tolerant Learnability using Dual Representation 3.5.2 Geometric Concepts In contrast to the previous example, when dealing with geometric concepts, the sample space and the concept class are both infinite. For the sake of simplicity let the sample space be IR2 and assume that the concept class consists of axis aligned rectangles. In this case, the VC-dimension of the primal problem is 4 and the dual problem has VC-dimension 3. Moreover if a “smooth” probability measure is defined on the concept class, it is easily seen that each instance is approximated by all the instances with distance r from it (when r depends on the defined measure and the learning parameters). Therefore, this class is dense. This example can be extended to large variety of problems, such as neural networks, general geometric concepts [24] and high dimensional rectangles. We describe two methods of doing so below. Using the Bayesian Assumption: The first method uses the Bayesian assumption. Each geometric concept divides the instance space into two sets. The edge of these sets is the decision boundary of the concept. Assume that for every instance there is a ball around it which does not intersect the decision boundary of “most” of the concepts. Denote by ν a probability measure on C, and assume for every δ > 0 there exists r = r (δ, x) > 0 such that Prc∼ν [B (x, r) ∩ ∂c 6= φ] < δ (3.3) (B (x, r) is the ball of radius r centered at x and ∂c is the decision boundary of c). If (3.3) holds then all the points in B (x, r) can be used to predict the label of x, and therefore, for verifying the correct label of x. Geometric Concepts without Bayesian Assumption: A slightly different approach can be used when there is no Bayesian assumption but the distribution over the sample space is non singular. Given δ > 0, then for every concept c there exists a distance rc > 0 such that the measure of all points with a distance smaller than rc from the decision boundary of c does not exceed δ. If 0 < r = inf c∈C rc , then with high probability (on the instance space) a random point x has a distance which is greater than r to the decision boundary of the target concept c. Hence, the ball of radius r around x can be used to select approximating instances. 3.6. Regression 3.5.3 31 Periodic Functions In this example we present a case where the approximating instances are not in the near neighborhood of x. Let X = IR and set C= 2πx sign sin : such that p is prime p Since the number of primes is countable, the probability measure on C is induced via a measure on IN. Note that C consists of periodic functions, but for each function, the period is different. Given a point x ∈ IR, and a confidence parameter δ, there is a finite set of concepts A, such that ν (A) ≥ 1 − δ. Since the set A is finite, the elements of A have a common period. Therefore, there is some t, such that for every c ∈ A and every m ∈ IN, c (x) = c (x + mt). It is reasonable to assume that the noise in the primal learning problem is not periodic (because the elements of the class do not have a common period), therefore, it is possible to find many points which agree with x with high probability, but are far away from the metric point of view. Moreover, using the same idea, given any sample c1 , . . . , ck ∈ C, it is possible to construct an infinite number of points xi which agree with x on the given sample. 3.6 Regression So far in this discussion, we have focused on binary classification problems. In this section, we extend our discussion to regression were the target concept is a continuous function. For every given x we attempt to find “many” instances xi , such that with a high probability f (xi ) is “almost” f (x). When the concepts are continuous functions, it is natural to search for the desired xi near x. However, if there is no a-priori bound on the modulus of continuity of the concepts, it is not obvious when xi is “close enough” to x. Moreover, in certain examples the desired instances are not necessarily close to x, but may be found “far away” from it (e.g. periodic functions as presented in section 3.5.3). Algorithm 1 needs to be adjusted for the regression setting. We present this modified algorithm in algorithm 2. The following theorem proves the correctness of this algorithm. Theorem 3.2 Assume that the dual learning problem is dense. Assume also that the learning problem is bounded, i.e. ∀c, x |c (x)| ≤ 1. With probability 1 − δ the noise cleaning algorithm for regression (algorithm 2) will return y such that |y − c (x)| < ǫ. 32 Chapter 3. Noise Tolerant Learnability using Dual Representation Proof. The proof is very similar to the proof for theorem 3.1. We begin by showing that the function approx, when called with ǫ̂, δ̂ and x will return x′ such that with a probability of 1 − δ̂ the instance x′ will be such that the L1 (ν) distance between δx and δx′ is smaller than ǫ̂. This is simply the definition of learning in the dual learning problem. Note also that by the assumption that the dual learning problem is dense, it follows that x′ 6= x with a probability of 1. By the choice of parameters the noise cleaning algorithm makes, it follows that with a probability of 1 − δ 3 for any choice of 1 ≤ i ≤ k, instance xi is indeed a good approximation of x in the sense that kδx − δx′ kL1 (ν) < ǫ̂ (3.4) Assume that (3.4) holds for all i. Therefore, for γ > 0 Pr ′ c ∼ν |c′ (x) − c′ (x′ )| > ǫ̂ <γ γ thus, using the parameters in the algorithm, and applying them to the union bound3 , we obtain that with a probability of 1 − 23 δ (over the choice of the target concept and the internal randomization of the function approx, the following property will hold: ∀i |c (x) − c (xi )| ≤ ǫ (3.5) Assuming that (3.5) holds, we obtain that for each i Pr [yi = c (xi )] ≤ 1 − η due to noise. It suffices that more than half of the labels are correct. This will guarantee that the median is not more than ǫ away from the true value (due to 3.5). Using Hoeffding’s inequality and the fact the all xi are chosen independently, we see that the probability that the majority vote will fail to predict the label of xi is smaller than e−2( 1−2η 2 2 ) k and by the choice of k, this is smaller than 3δ . This completes the proof. 3.6.1 Estimating V Cε (C, X ∗∗ ) In general, the question of learnability of the dual problem may be divided into two parts. The first is to construct a learning algorithm L which assigns to each sample Sm = {f1 , . . . , fm } a point x′ such that for every fi , |fi (x) − f (x′ )| < ε. The second part is to show that the class 3 The union bound is very loose in this case. However, for the sake of brevity we use it here. 3.6. Regression 33 Algorithm 2 Noise Cleaning Algorithm for Regression Inputs: • Confidence parameter 1 − δ. • Fat Shattering dimension of dual learning problem d∗ . • A bound on the noise level η. • An instance x. • Required accuracy ǫ. Output: • An approximation y of c (x). Algorithm: 1. By simulating the dual learning problem k = 2 (1−2η)2 ln 3 δ times, generate an ensemble S = {x1 , . . . , xk } by applying to the approx function (see below) with point x, accuracy δ and confidence 3k . ǫδ 3k 2. Use MQ to get the value yi of xi . 3. Let y be the median of the yi ’s. Function approx Inputs: • A point x. • required accuracy ǫ̂. • required confidence 1 − δ̂. Output: • A point x′ . Algorithm: 1. Let m = poly 1 1 ∗ ǫ̂ , log δ̂ , d . 2. Using the prior ν over the concept class C generate the sample c1 , . . . , cm . 3. Assign to every ci the value ci (x). 4. Apply the labeled sample generated in steps 2 and 3 to the learning algorithm of the dual learning problem to generate x′ . When learning in the dual learning problem we require that with confidence 1 − δ̂ the returned instance x′ is ǫ̂ close to x in L1 (ν) norm. 34 Chapter 3. Noise Tolerant Learnability using Dual Representation of functions X ∗∗ = {δx : x ∈ X } on C satisfies some compactness condition (e.g., a finite fatshattering dimensions V Cε (C, X ∗∗ )). We provided an answer to the first problem in the previous section. We now move forward to address the second problem. Let X ⊆ IRd such that X is infinite, and let (C, k·k) be a subset of a Banach space (see section 3.8) consisting of functions on X . Furthermore, assume that C has a reproducing kernel (see definition 3.6 on page 36). In this case, the dual learning problem is always a linear learning problem as for any x ∈ X , the functional δx (c) = c (x) is in C ∗ and thus X ∗∗ ⊆ C ∗ . We will show that if C is a bounded subset of a Banach space with a reproducing kernel and if X ∗∗ is a bounded subset of C ∗ , then the fat-shattering dimension V Cε (C, X ∗∗ ) is finite for every ε > 0 provided that C has a non trivial type, i.e. a type greater than 1. Classical representatives for spaces with non-trivial type are Sobolev spaces W k,p , (cf. [50] for basic information regarding Sobolev spaces, or [1] for a comprehensive survey). For example, W 1,2 (0, 1) is the space of continuous functions f : [0, 1] → IR, for which the derivative f ′ belongs to L2 with respect to the Lebesgue measure. The inner product in the space is defined f · g = R1 ′ ′ 0 (f g + f g ) dx. Mendelson [92], explored the relation between the type of a Banach space and the fat-shattering dimension: Theorem 3.3 (Theorem 1.5 in [92]) Let X be a infinite dimensional Banach space with type p. The fat-shattering dimension VCǫ (B (X) , B (X ∗ )) is finite if and only if the type of X is greater than 1. Furthermore, if p′ < p then there are constants K and κ such that p p−1 ′p′ 1 1 p −1 ∗ κ ≤ VCǫ (B (X) , B (X )) ≤ K ε ε The following is a simplified version of theorem 3.3: Corollary 3.1 Let C be a bounded subset of a Banach space of functions over an infinite set X . Assume the Banach space has a non-trivial type, i.e. greater than 1. Assume further that the evaluation functionals δx ∈ X ∗∗ are uniformly bounded then V Cε (X , C) < ∞ for every ε > 0. Proof. X is bounded, hence w.l.o.g. we assume it is a subset of the unit ball of a Banach space X. C is a bounded subset of the dual space, hence w.l.o.g. we assume C ⊆ B (X ∗ ). Hence by our assumptions, for every ε > 0 V Cε (X , C) ≤ V Cε (B (X) , B (X ∗ )) < ∞ 3.7. VC Dimension of Dual Learning Problems 35 Corollary 3.1 provides a bound on the sample complexity of learning a problem based on the type of the Banach space. If the type of the Banach space is non-trivial then the sample needed for learning the problem is polynomial. Moreover, if the Banach space X has non-trivial type, then X ∗ has a non-trivial type as well (see [98]) thus the dual learning problem has polynomial sample complexity as well. Note that in both cases, i.e. the complexity of the primal and the dual learning problems, the fact that the spaces are bounded is essential. The computational complexity of these learning problems is domain specific. However, Mendelson [92] showed that learning subsets of Hilbert spaces with reproducing kernels can be done efficiently. Finally, we turn to an examination of the density of the dual learning problem. Let x ∈ X be an instance and c1 , . . . , cm be any finite set of concepts. In most cases of interest, there are infinitely many x′ ∈ X such that ∀1 ≤ i ≤ m ci (x) = ci (x′ ). Hence, the problem is naturally dense. 3.7 VC Dimension of Dual Learning Problems For the sake of completeness we present here the following lemma: Lemma 3.1 Let C be a concept class defined over the sample space X . For every x ∈ X we define the function δx such that δx (c) = c (x). Let X ∗ = {δx : x ∈ X } be a concept class defined over C. Finally let d be the VC dimension of C and d∗ be the VC dimension of X ∗ then ⌊log2 d⌋ ≤ d∗ ≤ 2d+1 Proof. Let x0 , . . . , xm−1 be a sample shattered by C. Let c0 , . . . , c2m −1 be concepts for C which m shatter this sample. For every choice of ȳ ∈ {0, 1} there is 0 ≤ i < 2m such that ci assigns the labels ȳ to x0 , . . . , xm−1 . Consider the m × ⌊log2 m⌋ table T such that the j’th row in T is simply the binary representation4 of j. Let ȳ be a column of T . There exist 0 ≤ i < 2m such that ci assigns the labels ȳ to x0 , . . . , xm−1 . W.l.o.g. assume that c0 , . . . , c⌊log2 m⌋ generate the labelings described in T . Therefore, x0 , . . . , xm−1 shatters c0 , . . . , c⌊log2 m⌋ , and hence d∗ ≥ ⌊log2 m⌋ for any m ≤ d and thus d∗ ≥ ⌊log2 d⌋. 4 For the sake of this lemma it will simplify our notation to assume that the concepts assign the values {0, 1} rather than {±1} to the instances. 36 Chapter 3. Noise Tolerant Learnability using Dual Representation By switching the roles of the instances and concepts and applying the same proof we obtain d ≥ ⌊log2 d∗ ⌋ and thus 2d+1 ≥ d∗ 3.8 Banach Spaces This section provides a brief introduction to Banach spaces. Definition 3.5 A space X endowed with a norm k·k is a Banach space if it close with respect to the distance measure d (x1 , x2 ) = kx1 − x2 k. A Banach space has a dual space which consists of all linear functionals over X. The dual space is denoted by X ∗ and it is a Banach space itself using the norm kx∗ k = supkxk=1 |x∗ (x)|. Any Banach space is naturally embedded in its dual-dual space via the duality map x → δx given by δx (x∗ ) = x∗ (x). Definition 3.6 Let X be a Banach space consisting of functions over some space Ω. We say that X has a reproducing kernel if for every ω ∈ Ω the evaluation functional δω is norm continuous, i.e. for any ω ∈ Ω there exist some κω such that |δw (f )| = |f (w)| ≤ κω kf k Another important property of Banach spaces is the type of the Banach space. Definition 3.7 A Banach space X has type p, if there is some constant κ such that for every x1 , ..., xn ∈ X, Eσ1 ,...,σn [kσi xi k] ≤ κ X i p kxi k !1/p (3.6) where the σi s are i.i.d. random variables taking the values +1 or −1 with a probability of 1/2. It follows that the type of a Banach space is always in the range of [1, 2]. If the space is a Hilbert space, then its type is exactly 2. The basic facts concerning the concept of type may be found, for example, in [80] or in [98]. 3.9 Summary In this section we have presented a noise immunizing scheme for learning. Our scheme utilizes the structure of the learning problem, mainly by exploiting properties of the dual learning problem. 3.9. Summary 37 Having access to a membership query oracle, we were able to devise a conversion scheme which applies noise robustness to many learning problems. In this presentation we focused on random classification noise. However, our method apparently works in many other noise models (e.g. malicious noise). In section 3.6 we generalized our scheme to handle real valued functions. In this setting, the dual learning problem is related to the dual Banach space. Hence, the study of the dual learning problem is very natural. We used the type of Banach space as a measure of the complexity of the learning problem and showed that if the type is non-trivial then both primal and dual learning problems are feasible. Our construction provides a set of sufficient conditions for noise tolerance learnability. However, we believe that the essence of these conditions reflects fundamental principles which may turn out to be a step towards a complete characterization of noise tolerant learnability. Part III Selective Sampling 38 Chapter 4 Preliminaries In the selective sampling framework [29, 30], the learner is presented with a large set of unlabeled instances from which it can choose the instances for which the teacher will be asked to provide labels. The selective sampling framework differentiates two features of the process: the complexity of obtaining a random unlabeled instance and the complexity of labeling it. In the PAC framework [126] these two features are merged, however in many applications, collecting unlabeled instances is an easy task which is almost cost free, but labeling these instances is costly and lengthy. Consider for example the task of text classification. In many cases, collecting random instances can be done automatically without human involvement in the process, for instance by retrieving documents from the Internet. However, labeling these texts may be lengthy (the need to read the document) and may require experts. This situation is not unique to text classification, and applies to a variety of tasks including medical diagnostics, speech recognition, natural language processing and others. Selective sampling (sometime called query filtering) is an active learning framework in which the learner sees unlabeled instances and selects those instances for which the teacher will be asked to provide labels. This framework has several advantages over membership queries. First and foremost, the selective sampling framework is applicable in many cases whereas membership queries cannot (see section 2.2 on page 18). Furthermore, selective samplers are tuned to the underlying distribution of the instances. This is significant, as the learner can focus on the more probable instances. There are two types of selective sampling settings: batch and online. In the batch setting, a large set of unlabeled instances is provided and the learner selects the instances to be labeled by repeatedly searching for informative instances in this set. By contrast, in the online setting, 39 40 Chapter 4. Preliminaries unlabeled instances are presented in a sequential manner. Whenever an unlabeled instance is presented, the learner needs to decide whether to query for its label or not. In this online setting, the learner cannot rewind the stream of unlabeled instances and hence the learner cannot defer querying for the label of an instance later in the process (unless this instance is presented again). To further understand the difference between the batch and online settings consider for example the greedy algorithm, presented by Dasgupta [36] (see algorithm 3 on page 50). At each round, this algorithm searches the entire batch of unlabeled instances for the most informative instance and queries for its label. Therefore, for each query it needs to scan the whole batch of unlabeled instances for the most informative instance. While this is reasonable when the size of the batch is moderate, in other cases this may be unfeasible. Note that in some cases there is a constant stream of unlabeled instances; hence the batch is in a sense infinite in its size. Whereas the membership queries framework does not have significant implications as regards real world applications, the selective sampling framework has been applied in many domains with great success. Several key examples are presented below. 4.1 Empirical Studies of Selective Sampling Many algorithms operate in the selective sampling framework [29]. These algorithms have been applied in many domains including text classification, part of speech tagging, etc. The core of these algorithms is typically a scoring function. This function assigns a score to unlabeled instances based on the labels seen so far. The score is designed to measure the benefits from labeling this instance; i.e. the additional information or reduction in uncertainty a label will provide. The score is used in two ways. In the batch setting, all the unlabeled instances are scored and the next query point is chosen as the one with the highest score (a greedy strategy). In the online setting, the instances are scored one at a time and the next query point is selected by thresholding over it or by a random criterion. Score functions take different forms, but most of them can be assigned to three categories: committee-based scores, scores based on a confidence levels of a single classifier, and look-ahead principles. 4.1. Empirical Studies of Selective Sampling 4.1.1 41 Committee-Based Scores Committee-based scores use several learners who learn in parallel. In most cases, the committee consists of different learning algorithms, and thus when seeing the same training sequence, each learner generates a different hypothesis. Another possibility is to use the same learning algorithm for all learners but in this case, each learner only sees a subset of the training data collected so far. The goal here is to have broad diversity in the committee, much like in Bagging [22]. When a new instance is introduced, each learner in the committee predicts its label. If all committee members agree on the predicted labels, then most likely, labeling this instance will not provide any additional information. However, if there is a considerable disagreement among committee members, labeling this instance is guaranteed to provide new information, at least for those learners who made a wrong prediction. The committee principle has led to many active learning algorithms. The leading algorithm using this principle is the Query By Committee algorithm [112] discussed in chapter 5 on page 54. Several other algorithms are summarized below. 4.1.1.1 Part-of-Speech Tagging Dagan and Engelson [35] used active learning in the domain of Natural Language Processing (NLP). Specifically, they were interested in the task of part-of-speech tagging. In this task, a sentence in a natural language is presented to the machine, and the machine needs to assign the grammatical role to each word in the sentence. Typically, algorithms for tackling this problem use either expert knowledge, which is coded in the algorithm, or a large annotated training sequence which is used to train a learning algorithm. Both methods require a vast amount of work from human experts and thus it is hard to adapt these algorithms for many languages. Dagan and Engelson suggested the use of active learning for this task. They argue that obtaining the raw data (texts in this case) is almost cost free whereas annotating it is costly and lengthy and thus selective sampling is a natural match for this task. The part-of-speech tagging task is a complicated procedure. One reason for this complication is the inherent ambiguity of natural languages. Sentences such as “We saw the park with the binocular” has more than a single valid interpretation. Moreover, words can have multiple meanings, for instance “this is my head ”, “he is the head of the group” and “we should head south” all use the word “head ” but with different meanings. Therefore, it is common for part of speech taggers to use probabilistic models that assign probabilistic scores to possible grammatical analyses of a sentence rather than trying to find the “correct” grammatical structure. 42 Chapter 4. Preliminaries Figure 4.1: Active vs. Passive Learning for Part-of-Speech Tagging [35].Accuracy appears on the x-axis and the number of tagged words used while training on the y-axis. The complexity of the task forced Dagan and Engelson [35] to suggest a heuristic based on the committee principle. The base learners (which form the committee) are a special kind of a Hidden Markov Model (HMM). A committee is constructed on the basis of the training sequence seen so far, and a random choice of the free parameters. Since this is a multi-class problem, because there are many possible tags for a word, an entropy-based criterion is used to measure the disagreement between committee members. In Figure 4.1 some of the results obtained are presented. On the x-axis, the accuracy achieved by the algorithm of [35] is presented and on the y-axis, the number of tagged words is shown. The accuracy of an active and a passive learning algorithm is compared. The difference between the two is apparent. For example, reaching an accuracy level of 90% required only ∼ 4000 words in the active algorithm whereas the passive algorithm needed ∼ 12500 words. Obtaining an accuracy level of 91% requires ∼ 7000 words in the active algorithm and ∼ 25000 in the passive algorithm. 4.1.1.2 Spoken Language Understanding Tur et al [125] used active learning as part of a spoken language understanding system. The system is a part of an automatic operator. People can call the operator and ask for a variety of services. For example, a user can call and ask “What is my balance?”- The operator needs to react to these requests. This is done by applying an automatic speech recognizer that identifies the spoken words. 4.1. Empirical Studies of Selective Sampling 43 Figure 4.2: Active vs. Passive Learning for Spoken Language Understanding [125]. The x-axis shows the number of labeled instances while the y-axis shows accuracy. In this figure three filtering methods are compared: a random selection of instances to be labeled, committee-based active learning and confidence based active learning. The transcribed words are fed into an “understanding” unit that assigns the task the operator will perform. As in many of the tasks presented here, there is constant feed data in this task as people keep calling the operator. However, labeling these requests is a labor-intensive task. Tur et al [125] used a committee-based approach to accelerate learning in the understanding component of this system. The committee used consisted of two learners, an SVM [20] and AdaBoost [47]. These two classifiers were trained over the same training sequence. Both SVM and AdaBoost are able to provide a confidence parameter together with their predictions, i.e. margin. Tur et al [125] used this confidence and selected the next instances to be labeled as instances for which SVM and AdaBoost disagreed and gave low confidence to their predictions. The results are presented in figure 4.2. The experiment compares the committee-based approach to a confidence based approach and to random sampling (i.e. passive learning). For most of training sizes, the gain of using committee-based active learning is 1-2% over passive learning. They noted that the committee-based approach seems preferable to the confidence based approach. 44 Chapter 4. Preliminaries 4.1.1.3 Ensemble of Active Learners Ensemble methods, such as Bagging [22] and Boosting [47] are successful tools for passive learning. Baram, El-Yaniv and Luz [10] presented an ensemble method for active learning. Their novel approach combines different active learning algorithms using a master algorithm. The assumption behind this master algorithm is that any active learning algorithm will fail on some data sets. The goal of the master algorithm is to find the best performing active learning algorithm on the specific data set at hand and use it for training. In order to do so, the authors had to come up with a method to evaluate the performance of active learners. In the passive learning model this can be achieved by using leave-one-out estimates or a hold-out set, but in the active learning model these approaches cannot be used as the labeled training set is heavily biased towards difficult instances. Therefore the error estimates are typically much worse than actual performance. Baram et al [10] use an entropy criterion as a scoring function. Given a training sequence, each learner is asked to label a set of unseen instances. These labels are viewed as groupings of these instances, where each group consists of the instances to which the learner assigns the same label. The score of the learner is the entropy of the partition1 . Once the algorithm uses a certain active learner to query for the next query point, the entropy based scoring function is used to evaluate the benefit obtained from the label. Therefore, at any point we can evaluate the active learners based on previous decisions they made. However, we still need to decide which learner will make the next query. The fundamental problem here is the exploration vs. exploitation dilemma. On one hand, we would like to give a fair chance to all learners, but on the other hand, giving a poor learner too many opportunities to make queries might undermine the performance of the whole ensemble. To rectify this, Baram et al [10] used the analogy to the multi-armed bandit problem and used algorithms suggested by Auer et al to solve it [8]. The approach suggested by Baram et al [10] has proven itself to be successful on many experiments the authors have conducted. This again proves the efficiency of committee-based approaches. However, note that this time, active learners are used in the ensemble, whereas in all other algorithms presented here (and all other algorithms we are familiar with), the committee consists of passive learners. 1 This criterion appears to assume that the classes are equally sized. However, empirically it works well even when this is not the case [10]. 4.1. Empirical Studies of Selective Sampling 4.1.1.4 45 Other Committee-Based Approaches Many committee-based approaches have been devised. McCallum and Nigram [90] used a probabilistic model to sample a committee. An interesting property of the approach presented in [90] is the combination of semi-supervised models with active learning. In their algorithm, an expectation maximization (EM) algorithm is used to label the instances which are not yet labeled and thus the learner can train over a larger training sequence. Liere [78] used a straight forward committee-based approach where the core classifiers are linear threshold functions (Winnow [82] and Perceptron [97]). Krogh and Vedelsby [69] used committee-based methods with neural networks. Muslea et al. [95] introduced a committee-based active learning algorithm. They used multiple views of data [16] to devise their algorithm. Another interesting approach was presented by Mamitsuka and Abe [87]. They generated a committee by training the same learning algorithm over random subsets of the training data. 4.1.2 Confidence-Based Scores Confidence-based scores use a single base classifier which is able not only to make predictions for the labels of unseen instances, but also to assign confidence levels to its predictions. Many of the classifiers used today have this capability of reporting their confidence. In SVM [20] and AdaBoost [47, 109] the margin can be used as a measure of confidence. Other classifiers, such as Bayesian networks [59], have internal probabilistic structure which can be used to measure confidence. These and other confidence scores have been used to devise active learning algorithms. An overview of some of these is presented below. 4.1.2.1 Margin Based Confidence Tong and Koller [122], Campbell et al [25] and Schohn and Cohn [110] introduced a simple active learning scheme based on large margin principles. They suggested training SVM [20] over the training sequence seen so far and choosing the next query point to be a point with the smallest possible margin. This type of point is close to the decision boundary induced by SVM. Thus, the label of this instance will shift the decision boundary considerably, making it an informative instance. See figure 4.3 for a comparison of this simple scheme with various other active learning schemes and passive learning algorithms. This example shows that this simple approach significantly outperforms the passive learning SVM algorithm on a text classification task while performing comparably to other more sophisticated active learners. 46 Chapter 4. Preliminaries Figure 4.3: Margin Based Active Learners [122]. This figure presents several active learning methods based on SVM large margin principles. The different algorithms were applied to a text classification task. The three different active learning algorithms (Ratio, MaxMin and Simple) perform similarly outperforming the passive learning algorithm (Random). Note that the active learner used 100 labels to obtain the same accuracy as the passive learner which was trained over a set of 1000 instances (full). 4.1. Empirical Studies of Selective Sampling 47 Schohn and Cohn [110] reported that in some cases when the active learner simply uses a subset of a training sequence, it outperforms the passive learner that uses the fully labeled set. The same surprising result was reported by Tur et al [125]. This is usually explained by the tendency of active learners to avoid querying the labels of outliers. 4.1.2.2 Probability Based Confidence Lewis and Gale [77] used a probability based confidence active learner. They used a logistic regression based classifier and queried for instances for which the probability for the leading class was the smallest. [77] report that in some cases using active learning reduced the number of labels 500-fold over passive learning. 4.1.3 Look-ahead Principles The ultimate criterion for selecting query points is the reduction in generalization error. However we do not have access to this parameter and thus estimates of this error need to be used. Several methods have been suggested to utilize such principles. Cohn, Ghahramani and Jordan [31] designed an active learning algorithm for parametric models such as neural networks and mixtures of Gaussians. Their algorithm minimizes the variance of the estimates of the parameters. For any instance x, the expected reduction in variance is calculated and used as a score for choosing the next query point. Cohn et al. [31] showed that in certain models such as mixtures of Gaussians, this parameter can be calculated efficiently. Roy and McCallum [104] designed an algorithm which estimates the future error based on a sampling technique. The future error is calculated over the set of unlabeled instances available to the learner. The learner uses a probabilistic model through which it can estimate the log-loss or 0-1 loss (assuming that the current probabilistic model is accurate). The next query point is selected to be the one that will reduce this loss the most. Tong and Koller [122] introduced a look-ahead algorithm called MaxMin. Similar to Query By Committee (see chapter 5 on page 54 ), MaxMin tries to estimate the reduction in the size of the version space. Given an instance x, the algorithm calculates rx+ and rx− which are the radii of the largest balls in the version space when x is used for training with the labels +1 or −1 respectively. The radius of the largest ball in the version space assuming x is labeled gives a lower bound on the volume of the version space. This gives an estimate of the reduction of the volume of the version space. The next query point is selected to be the point which maximizes min (rx+ , rx− ). A 48 Chapter 4. Preliminaries point for which min (rx+ , rx− ) is large is expected to bisect the version space most equally and thus will reduce the volume of the version space. Another algorithm, called Ratio by [122] will query + − r r the label of the instance x for which min rx− , rx+ is maximized. See figure 4.3 for the results of x x applying both MaxMin and Ratio to a text classification task. It is clear that both algorithms significantly outperform passive learning in this task. Note that in the experiment reported in [122] the Simple Active Learning Algorithm (subsection 4.1.2.1 on page 45) is equally good. However in other experiments reported by Tong and Koller [122] Ratio and MaxMin were better than Simple. Another approach using a look-ahead principle was introduced by Zhang and Chen [129] who used it for information retrieval of visual objects. 4.2 Theoretical Studies of Selective Sampling The theoretical study of selective sampling is still in its infancy. Only a few authors have studied the theoretical aspect of selective sampling. Freund et al [48] were the first to show that selective sampling can reduce the number of needed labels exponentially. Their results are discussed and extended in chapter 6 on page 62. Recently, Dasgupta [36, 37] proved some positive and negative results about selective sampling. The negative results show that there are cases where selective sampling cannot reduce the number of labels needed significantly. Consider for example the indicating functions concept class. In this case, the sample space X is a finite set and the concept class is C = {cx : x ∈ X } where cx (x′ ) =    1   −1 if x = x′ otherwise it is easy to see that in this case O (1/ |X |) labels are needed on average in order to have an accuracy of O (1/ |X |). This is similar to the number of labels a passive learner will use. Dasgupta [36] showed that the situation we have just demonstrated is not unique to the class of indicating functions. He proved the following lemma: Lemma 4.1 (Claim 1 in [36]) Let C be the class of linear separators in IR2 . For any set of m distinct instances in the unit sphere there are hypotheses in the concept class which cannot be identified without querying all m labels. Lemma 4.1 shows that if we take m points on the unit sphere and assume the uniform distribution over these points, we may need m labels in order to have an accuracy of O (1/m). This is not 4.2. Theoretical Studies of Selective Sampling 49 a great saving over passive learning. Moreover, this is not unique to the case discussed in lemma 4.1: Lemma 4.2 (Claim 2 in [36]) For any d ≥ 2 and m ≥ 2d there is a sample space X of size m and a concept class C of VCdimension d over the domain X , with the following property: if a concept c is chosen uniformly from C then the average number of labels needed in order to identify c is greater than m/8. Lemmas 4.1 and 4.2 show that a small VC dimension does not guarantee that the concept class can be learned with few labels. Dasgupta [36] showed that even when we restrict our selves to the class of non-homogeneous linear classifiers, no selective sampling algorithm can guarantee a significant reduction in the number of needed labels. Along-side the negative results, Dasgupta provided an encouraging positive result. He studied the Greedy Strategy for selective sampling (see algorithm 3). This algorithm receives a batch of unlabeled instances and finds the next query point in a greedy fashion. Whenever it needs to decide on the next query point, it goes over all the instances that have not yet been labeled. For each such instance, it calculates the measure of hypotheses which label it with +1 and the measure of hypotheses which label it with −1 and chooses to query for the label of the instance for which these two measures are most equal. The target concept remains in the version space. Any other concept, which disagrees with the target concept on an instances in the sample, will be removed from the version space during the process. Thus it is guarantied that the concept returned by the greedy strategy is consistent with the target concept on the sample. Dasgupta proved the following property of the greedy strategy: Theorem 4.1 (Theorem 3 in [36]) Let π be any distribution over C. Let Q be any query strategy. Let µgreedy = Ec∼π [number of queries needed by the greedy stratigy to indetify c] and µQ = Ec∼π [number of queries needed by Q to indetify c] then 1 µgreedy ≤ 4µQ ln minc∈C π (c) (4.1) 50 Chapter 4. Preliminaries Algorithm 3 Greedy Strategy for Selective Sampling [36] Inputs: • A sample S. • A concept class C defined over S. • A distribution π over C. Output: • An hypothesis h. Algorithm: 1. Let V1 = C. 2. For t = 1, . . . , |S| (a) For every x ∈ S let Vt+ (x) = {c ∈ Vt : c (x) = 1} and Vt− (x) {c ∈ VT : c (x) = −1}. (b) If maxx min π Vt+ (x) , π Vt− (x) then break the loop. (c) Query for the label y of x for which min π Vt+ (x) , π Vt− (x) is maximized. (d) Let Vt+1 = {c ∈ Vt : c (x) = y}. 3. Endfor 4. Return any c ∈ Vt . = 4.2. Theoretical Studies of Selective Sampling 51 Theorem 4.1 proves that the average number of queries for labels the greedy strategy will make is comparable to the best possible strategy. To see this, assume that C has a VC dimension d < ∞ and let m = |S|. From Sauer’s lemma [108] we have that the number of different hypotheses in C d when we restrict it to S is at most (em) . Assume that π is uniform over these hypotheses then from (4.1) we have that µgreedy ≤ 4µQ ln 1 1/ (em)d = 4µQ d ln (em) hence the average number of queries needed by the greedy strategy exceeds the best possible strategy by a factor of O (ln m) at most. Another significant theoretical result was proven by Dasgupta, Kalai and Monteleoni [38]. They suggested a modification to the well-known Perceptron algorithm [97] (see algorithm 4). This very simple modification learns homogeneous linear classifiers. It has several advantages over most of the algorithms we have discussed so far. First, it is a very simple algorithm and second, it works in a streaming (online) fashion as opposed to the batch fashion that is used in the greedy algorithm. Both these properties make it very attractive to use even with extremely large data sets. However, since the dimensionality of the data d is explicitly used in this algorithm, it is not possible to efficiently use it with kernels, especially kernels which map the data to infinite spaces (see Chapter 10 for more on kernels). Dasgupta et al [38] proved the following property of the Perceptron based active learning algorithm: Theorem 4.2 (theorem 3 in [38]) Let ǫ, δ > 0. Let L = O d log 1 ǫδ log dδ + log log 1ǫ and R = O d δ + log log 1ǫ . Assume that the underlying distribution over the sample space is a uniform distribution over the unit sphere in IRd and that the target concept is a homogeneous linear classifier. With probability 1 − δ, the Perceptron based active learning algorithm will use L labels, will make O ( L) errors while learning and will return a hypothesis which is ǫ close to the target concept. Theorem 4.2 proves that under the assumption of uniform distribution, the Perceptron based active learning algorithm will use O log 1ǫ labels in order to return a hypothesis which is ǫ close to the target concept. This is an exponential improvement over any passive learner which will need O (1/ǫ) labels in the same setting. 52 Chapter 4. Preliminaries Algorithm 4 Perceptron Based Active Learning [38] Inputs: • Dimension d. • Maximum number of labels L. • A patience parameter R. Output: • A homogeneous linear classifier v. Algorithm: 1. Let v1 = y1 x1 for the first example (x1 , y1 ). √ 2. Let s1 = 1/ d 3. For t = 1 . . . L (a) Wait for the next instance x such that |x · vt | ≤ st and query for its label. Call this example (xt , yt ). (b) If yt (xt · vt ) < 0 then i. vt+1 = vt − 2 (xt · vt ) xt . ii. st+1 = st . (c) else i. vt+1 = vt ii. If no prediction mistakes were made for the last R instances for which a query for label was made then A. st+1 = st /2. iii. else A. st+1 = st . 4. Endfor 5. Return vt 4.3. Label Efficient Learning 4.3 53 Label Efficient Learning The online selective sampling framework is called label efficient learning [54]. In this setting, the learner sees a stream of unlabeled instances. When a new instance is introduced, the learner can either predict its label or query for it. The learner tries to minimize the number of prediction mistakes and at the same time minimize the number of queries. Note that unlike the passive online learning framework [83], the true label is not revealed to the learner unless a query for label was made. This model has been studied by several authors. Cesa-Bianchi et al. [28], following Helmbold and Panizza [54] studied label efficient learning in use of experts advice framework. In this model it is assumed that there are many experts where one of them makes very few or even no prediction mistakes. The task of the learner is to find this expert. They showed that if the learner has a limited budget of queries for labels he is allowed to make at any given time frame then it is still possible to predict almost as well as the best expert, as long as the budget for queries grows to infinity at not too slow a rate. Cesa-Bianchi, Conconi and Gentile [27] studied learning linear classifiers in the label efficient setting. The algorithm presented by the authors uses the margin of the linear classifier with respect to the prediction it makes as a criterion to select the right instances to query for their label. In addition, they were able to analyze a slightly modified version of this algorithm in which, if the label of an instance is predicted with too small a margin, then a query for label is made for the label of the next instance in the sequence. They were able to show that when certain conditions apply, the number of prediction mistakes can be logarithmic with respect to the number of instances in the sequence. 4.4 Summary As we saw, there are many selective sampling algorithms but only few have theoretical grounding. Unfortunately, the algorithms that have theoretical grounding lack practical implementation. In the next chapter we introduce the Query By Committee (QBC) algorithm. Much of the presentation from this point and on is about providing both theoretical grounding and practical implementation for the QBC algorithm. Chapter 5 The Query By Committee Algorithm The Query By Committee (QBC) algorithm was presented by Seung et al. [112] and analyzed in [48, 119]. The algorithm assumes the existence of some underlying probability measure over the hypotheses class. At each stage, the algorithm operates on the version-space: the set of hypotheses that were correct so far. Upon receiving a new instance the algorithm has to decide whether to query for its label are not. This is done by randomly selecting two hypotheses from the versionspace. A query for label is made only if these two hypotheses predict different labels for the instance under consideration. The algorithm is presented as Algorithm 5 on page 55. The QBC algorithm is, as its name suggests, a committee-based algorithm (see section 4.1.1 on page 41). The committee is formed of all the possible classifiers in the sense that any classifier which may be the target concept is considered. Alternatively, QBC can be viewed as though it used a look-ahead principle (see section 4.1.3 on page 47). To see this, Let V be the current version space and let x be an instance. Denote by V + the set of hypotheses in the version space which predict a label +1 for x. Similarly, let V − be the set of hypotheses in the version space which predict the label of−1 for x. It follows, that QBC will query for the label of x with the probability of 2ν (V + ) ν (V − ). QBC tends to query for instances which split the version space evenly. QBC works in an online fashion; it sees an instance once and makes its decision whether to query for its label or not. Although this limits the algorithm, the probabilistic way in which QBC makes its decision utilizes the online setting to remain tuned to the underlying distribution of the inputs. Thus, the 54 55 Algorithm 5 Query By Committee [112] Inputs: • Required accuracy ǫ. • Required confidence 1 − δ. • A prior ν over the concept class. Output: • A hypothesis h. Algorithm: 1. Let V1 = C. 2. Let k ← 0. 3. Let l ← 0. 4. For t = 1, . . . (a) Receive an unlabeled instance xt . (b) Let l ← l + 1. (c) Select c1 and c2 randomly and independently from the restriction of ν to Vt . (d) If c1 (x) 6= c2 (x) then i. ii. iii. iv. (e) else Query for the label c (x). Let k ← k + 1. Let l ← 0. Let Vt+1 ← {c ∈ Vt : c (xt ) = yt }. i. Let Vt+1 ← Vt . (f) If‡ l ≥ tk i. Choose a hypothesis h according to the termination rule. ii. Return h. ‡ Step 4f is the termination procedure of QBC. The exact choices of tk and the returned hypothesis are discussed in section 5.1. For a short summary, see table 5.1. 56 Chapter 5. The Query By Committee Algorithm probability that QBC will query for the label of an instance depends on two factors: the “evenness” of the split induced on the version space, and the probability of observing this instance. 5.1 Termination Procedures According to the definition of the QBC algorithm by Seung et al [112], once the algorithm reaches a steady version space, i.e. once the algorithm has not queried for a label for a long consecutive set of instances, QBC terminates and returns a random hypothesis from the version space. Below we study this rule as well as some alternative procedures (step 4f in algorithm 5). We also prove the correctness of the algorithm in the sense that the hypothesis returned by QBC is indeed a good one. To simplify the presentation we discuss the “original” termination procedure later. 5.1.1 The “Optimal” Procedure Assume that the QBC algorithm queries for the labels of k instances. At this point the learner has a posterior over the hypotheses class which is the restriction of the prior to the version space V . Given this information, the optimal classifier that the learner can return is the Bayes classifier which is defined: cBayes (x) =    +1 if Prc∼ν|V [c (x) = +1] ≥ 1/2   −1 if Prc∼ν|V [c (x) = −1] > 1/2 where c ∼ ν|V means that c is chosen according to the restriction of ν to V . The first optimal procedure we suggest works as follows: if the QBC algorithm did not query for a label for the last tk consecutive instance after making the k’th query, then QBC terminates and returns the Bayes classifier cBayes as its hypothesis. The following proves the correctness of this procedure, together with the right choice for tk : Theorem 5.1 Assume that the termination rule of the QBC (algorithm 5) is specified by tk = 2 ǫ ln π 2 (k+1)2 . 6δ Let the Bayes classifier cBayes be defined using the version space V used by QBC when terminating. Then with a probability of 1 − δ over the sample and the internal randomness of QBC, h h ii Ec∼ν|V Pr cBayes (x) 6= c (x) ≤ ǫ x Proof. Assume that QBC made k queries for labels to generate the version space V . Assume that QBC did not query for any additional label for tk consecutive instances after making the k’th 5.1. Termination Procedures 57 query. Let cBayes be the Bayes classifier, then    +1 if Prc∼ν|V [c (x) = +1] ≥ 1/2 cBayes (x) =   −1 if Prc∼ν|V [c (x) = −1] > 1/2 Arrange x and c such that cBayes (x) 6= c (x). From the definition of the Bayes classifier it follows that if we pick a random hypothesis c′ from the distribution ν|V then with a probability h i of at least 1/2 we will have c′ (x) 6= c (x). Therefore, if we denote by cBayes (x) 6= c (x) the indicating function then Ec′ ∼ν|V [c′ (x) 6= c (x)] ≥ for any c and x. i 1h cBayes (x) 6= c (x) 2 h i Assume that Ec∼ν|V,x cBayes (x) 6= c (x) > ǫ. Thus, Ec,c′ ∼ν|V,x [c′ (x) 6= c (x)] > ǫ 2 this means that the probability that QBC will not query for the label of the next instance is at h i most 1 − 2ǫ . Hence, if Ec∼ν|V,x cBayes (x) 6= c (x) > ǫ the probability that QBC will not query for a label for the next tk consecutive instance is at most ǫ tk ǫ 1− ≤ e− 2 tk 2 by choosing tk = 2 ǫ ln π 2 (k+1)2 6δ we get that the probability that QBC will not query for tk consecutive labels when the Bayes classifier is not “good enough” is 6δ . π 2 (k+1)2 By summing over k the proof is completed. Corollary 5.1 Assume that the termination rule of the QBC (algorithm 5) is specified by tk = 2 ǫ ln π 2 (k+1)2 . 6δ Let the Bayes classifier cBayes be defined using the version space V used by QBC when terminating. Then with a probability of 1 − δ over the sample and the internal randomness of QBC, h h ii Ec∼ν Pr cBayes (x) 6= c (x) ≤ ǫ x Proof. From theorem 5.1 we have that h h ii Ec∼ν|V Pr cBayes (x) 6= c (x) ≤ ǫ x therefore h h ii = Ec∼ν Pr cBayes (x) 6= c (x) x = ≤ h h ii EV,c∼ν|V Pr cBayes (x) 6= c (x) x h h ii Ev Ec∼ν|V Pr cBayes (x) 6= c (x) x ǫ 58 Chapter 5. The Query By Committee Algorithm 5.1.2 Random Gibbs Hypothesis Another possible solution to the generalization phase is to use a random Gibbs hypothesis. In this procedure, whenever the QBC decides to terminate the learning process, a random hypothesis is drawn out of the version space and is used for making further predictions. This is the “original” termination procedure suggested in [48]. We suggest two possible analyses for this procedure: an average analysis and analysis of the “typical” case. Theorem 5.2 Assume that the termination rule of the QBC (algorithm 5) is specified by tk = 4 ǫ ln π 2 (k+1)2 . 6δ Let the Gibbs classifier cGibbs be defined using the version space V used by QBC when terminating. Then with a probability of 1 − δ over the sample and the internal randomness of QBC, h i EGibbs,c∼ν Pr cGibbs (x) 6= c (x) ≤ ǫ x Note that since the Gibbs hypothesis is a random hypothesis, the error in theorem 5.2 is averaged over this randomness. Proof. From corollary 5.1 we have that using the choice of tk that h h ii Ec∼ν Pr cBayes (x) 6= c (x) ≤ ǫ/2 x Since Haussler, Kearns and Schapire [51] proved that the average error of the Gibbs classifier is at most twice as large as the error of the Bayes classifier, the statement of the theorem follows. Theorem 5.2 shows that the average error of the Gibbs hypothesis is not large. In the next theorem we show that this is also the typical case. Theorem 5.3 Assume that the termination rule of the QBC (algorithm 5) is specified by tk = 8 ǫδ ln π 2 (k+1)2 . 3ǫδ Let the Gibbs classifier cGibbs be defined using the version space V used by QBC when terminating. Then with a probability of 1 − δ over the choice of the sample, the target hypothesis and the internal randomness of QBC, Pr cGibbs (x) 6= c (x) ≤ ǫ x Proof. This follows immediately from theorem 5.2 and the Markov inequality. From the choice of tk we have that with a probability of 1 − δ/2 h i EGibbs,c∼ν Pr cGibbs (x) 6= c (x) ≤ ǫδ/2 x (5.1) 5.1. Termination Procedures 59 Therefore, from the Markov inequality, if (5.1) holds, we have with a probability of 1 − δ/2 that Pr cGibbs (x) 6= c (x) ≤ ǫ x Note that using a direct argument (instead of using the previous theorems as building blocks) we can get tk = 2 ǫδ ln π 2 (k+1)2 3ǫδ which is better by a factor of 4. Since this is of minor significance we do not dwell on this argument. 5.1.3 Bayes Point Machine The Gibbs sampler does not use a single classifier to make predictions. Rather, it randomly selects a hypothesis. Still another alternative is to use the Bayes Point Machine (BPM) [56, 49] to generate future predictions. The BPM uses a hypothesis in the version space that is the closest to the Bayes optimal classifier. When the concept class is the class of linear classifiers, then this is just the center of gravity of the version space. Gilad-Bachrach, Navot and Tishby [49] proved the following theorem: Theorem 5.4 (Theorem 1 in [49]) Let the concept class be the class of linear classifiers and let the prior ν be log-concave, V is a version space and cBPM and cBayes are the Bayes Point Machine and Bayesian classifiers respectively, then Pr c∼ν|V h i c (x) 6= cBPM (x) ≤ (e − 1) Pr c (x) 6= cBayes (x) c∼ν|V From this, we derive the following theorem: Theorem 5.5 Assume that the termination rule of the QBC (algorithm 5) is specified by tk = 2(e−1) ǫ ln π 2 (k+1)2 . 6δ Let the concept class be the class of linear classifiers and let the prior ν be log-concave. Let the Bayes Point Machine classifier cBPM be defined using the version space V used by QBC when terminating. Then with a probability of 1 − δ over the sample and the internal randomness of QBC, h i Ec∼ν Pr cBPM (x) 6= c (x) ≤ ǫ x Proof. The proof follows immediately from corollary 5.1 and theorem 5.4. 5.1.4 Avoiding the Termination Rule In many applications, the training process need not terminate. We can assume that there is a constant stream of instances, and for each instance QBC makes one of two possible actions: it can 60 Chapter 5. The Query By Committee Algorithm either decide to query for the label of the instance or decide not to query for its label. Note that if QBC did not query for the label, this is because the random hypotheses QBC drew, predicted the same label for the instance. Therefore, we have a natural way to predict the label of this instance using the two random hypotheses. Using this “non-stop” rule makes QBC closer in spirit to label efficient algorithms (see section 4.3 on page 53). In the following theorem we show that the number of prediction mistakes QBC makes is proportional to the number of queries it makes. Therefore, if we can guarantee the number of queries to be small, we immediately obtain a bound on the number of prediction mistakes. Indeed, in chapter 6 we show that when certain conditions apply, the number of queries is small. Theorem 5.6 For any instance x and at any stage of the learning, the probability that QBC will make a prediction mistake on x is exactly half the probability it will query for the label of x. Proof. Let V be the current version space. Let x be an instance for which QBC needs to decide whether to predict its label or query for it. The probability that QBC will query for the label is 2 Pr [c (x) = 1] Pr [c (x) = −1] c∼ν|V c∼ν|V where the probability that it will make a prediction mistake is 2 2 Pr [c (x) = 1] Pr [c (x) = −1] + Pr [c (x) = −1] Pr [c (x) = 1] c∼ν|V c∼ν|V c∼ν|V c∼ν|V which is Pr [c (x) = 1] Pr [c (x) = −1] c∼ν|V c∼ν|V Therefore, the probability that QBC will query for a label for a given instance is exactly twice the probability it will make a prediction mistake. 5.2 Summary In this chapter we have presented the Query By Committee algorithm. We have shown several possible termination rules for this algorithm. In table 5.1 the different options are listed 5.2. Summary 61 Procedure tk Bayes classifier 2 ǫ ln π 2 (k+1)2 6δ Gibbs classifier 4 ǫ ln π 2 (k+1)2 6δ Gibbs classifier 8 ǫδ ln π BPM (linear classifiers) no termination (with probability 1) 2(e−1) ǫ 2 (k+1)2 3ǫδ ln π - 2 (k+1)2 6δ Guarantee − δ) h h (with prob. 1ii Ec∼ν Prx cBayes (x) 6= c (x) ≤ ǫ EGibbs,c∼ν Prx cGibbs (x) 6= c (x) ≤ ǫ Prx cGibbs (x) 6= c (x) ≤ ǫ Ec∼ν Prx cBPM (x) 6= c (x) ≤ ǫ Ec∼ν [number of queries] = 2Ec∼ν [number of prediction mistakes] Table 5.1: Possible Termination Procedures for QBC. In the following table, the possible termination procedures for QBC are listed, together with the guarantee they provide. Chapter 6 Theoretical Analysis of Query By Committee In chapter 5 we introduced the QBC algorithm and studied its basic properties. We now turn to the fundamental properties of this algorithm. We follow the guidelines of Freund et al [47] while introducing some corrections and extensions. The QBC algorithm assumes a Bayesian setting. It assumes the existence of a prior distribution ν over the concept class. It assumes that the teacher selects the target concept from the concept class using this prior. Furthermore, it is assumed that ν is known to the learner; however the learner does not have access to the random choices made by the teacher. In chapter 7 on page 88 and chapter 8 we lift some of these assumptions. We begin this chapter by introducing and discussing the information gain in section 6.1. In section 6.2 we present theorems about the fundamental properties of QBC. These theorems show that when there is a lower bound on the expected information gain, the error rate of the hypotheses learned by QBC decreases exponentially with respect to the number of queries for labels it makes. The proofs of these theorems are provided in section 6.3. In section 6.4 we study the class of parallel planes and argue that there is a lower bound on the expected information gain for this class once the prior over the concept class is concave. We prove this argument in section 6.5. We wrap up with a short discussion in section 6.6. The theorems presented in this chapter are significant for an understanding of the Query By Committee algorithm. However, the proofs of these theorems are long and involve non-trivial techniques. Therefore, some readers may wish to skip these proofs (i.e. sections 6.3 and 6.5). 62 6.1. The Information Gain 63 Doing so will not prevent the reader from following the rest of this essay. 6.1 The Information Gain The key tool in analyzing the QBC algorithm is what is known as instantaneous information gain. It was introduced in [51] as a tool for analyzing the progress of learning algorithms. Let V be a version space and x be an instance. x induces a split of the version space such that V + consists of all the concepts which label x with the label +1 and V − consists of all concepts which label x with the label −1. Assume that there exists some prior ν over C. If the observed label for x is +1 we know that the concept we are learning is in V + and thus we say that the information we have gained from the instance x and its label is log (ν (V ) /ν (V + )). Equally if the label of x is −1 we say that the information gained is log (ν (V ) /ν (V − )). Definition 6.1 Let V be a version space and ν be some probability measure defined over V . Let x be an instance and y be the label of this instance. The instantaneous information gain from the pair (x, y)is log ν ({c ∈ V }) ν ({c ∈ V : c (x) = y}) In the setting of selective sampling, we have an instance x but we do not have its label. At this point we can look at the expected information gain. The probability that the label of x will be +1 is exactly ν (V + ) /ν (V ), in which case the instantaneous information case will be log ν (V ) /ν (V + ). Equally, the probability that the label of x will be −1 is exactly ν (V − ) /ν (V ) in which case the instantaneous information gain will be log ν (V ) /ν (V − ) and thus the expected information gain from the label of x is ν (V + ) ν (V ) ν (V − ) ν (V ) log + log + ν (V ) ν (V ) ν (V ) ν (V − ) which is exactly H (ν (V + ) /ν (V )) where H (·) is Shanon’s binary-entropy [114]. Definition 6.2 Let V be a version space and ν be some probability measure over C. Let x be an instance then the expected information gain from the label of x is X ν ({c ∈ V : c (x) = y}) ν (V ) log ν (V ) (ν ({c ∈ V : c (x) = y})) y The expected information gain from an instance x is the entropy of the split it induces on the version space. We see that the most informative instances are the ones for which the split they induce are even. On the other hand, an instance for which the label is known a-priori in the sense that all the hypotheses in the version space agree on its label, has zero expected information gain. 64 Chapter 6. Theoretical Analysis of Query By Committee If the instances for which QBC queries for labels, all have high expected information gain, then the volume, in the probabilistic sense of the version space is guaranteed to shrink fast. Therefore, we are guaranteed that QBC will focus on the target concept and its close neighbors. In the next section we prove this intuition but before doing so we need to define the expected information gain from the next query QBC will make. We have defined the expected information gain from an instance (definition 6.2). We need to define the expected information gain from the next QBC query. This value takes into account both the information of an instance, and also the probability of observing this instance and the probability that QBC will query for its label. Definition 6.3 Let V be a version space and ν be a probability measure over C. Let D be a distribution over the sample space X . Let ν + (x) = ν ({c ∈ V : c (x) = 1}) /ν (V ) and ν − (x) = ν ({c : c (x) = −1}) /ν (V ). The expected information gain from the next QBC query for label is R + 2ν (x) ν − (x) H (ν + (x)) dD (x) R G (ν, D) = 2ν + (x) ν − (x) dD (x) To explain definition 6.3 we note that for an instance x, the value 2ν + (x) ν − (x) is the probability that QBC will query for the label of x. Note that the expected information gain from the next QBC query is a function of both the prior ν over the concept class and the distribution D over the sample space. Finally we use the definitions we have introduced here to define “good” concept classes and distribution, i.e. the concept classes and distributions for which we will be able to prove the properties of QBC. Definition 6.4 The concept class C endowed with the prior ν and distribution D over the sample space has a uniform lower bound g over the expected information gain, if for any set of instance x1 , . . . , xm and any concept c ∈ C G (ν|V, D) ≥ g where V is the version space induced by x1 , . . . , xm and the labels c (x1 ) , . . . , c (x2 ). 6.2 The Fundamental Theory of QBC In this section we prove the basic properties of the Query By Committee algorithm. We show that if we can lower bound the expected information gain from the next query QBC will make, then we can guarantee that the QBC algorithm will make very few queries while learning. The following theorem shows this for various termination rules of the QBC algorithm. 6.2. The Fundamental Theory of QBC 65 Theorem 6.1 Let C be a concept class with a VC-dimension d. Let ν be a prior over C and let D be a distribution over the sample space. Let g > 0 be a uniform lower bound over the expected g log 1+ information gain and let g̃ = g 16 log 16 −g g . Let δ > 0 and let 2 d+1 2 8 ln , log k ≥ max g2 δ g̃ δ 4 Then with a probability of 1 − 2δ, QBC will use at most k queries for label and m0 = d g̃k/(d+1) 2 e unlabeled instances when learning and will return a hypothesis with the following properties (depending on the termination rule used): 1. If QBC is used with the Bayes optimal classification rule, it will return a hypothesis (the Bayes optimal hypothesis) such that h h ii Ec∼ν Pr cBayes (x) 6= c (x) ≤ ǫ x for any ǫ> 2ek −g̃k/(d+1) π 2 (k + 1)2 2 ln d 6δ 2. If QBC is used with the Gibbs average termination rule, it will return an hypothesis such that for any h i EGibbs,c∼ν Pr cGibbs (x) 6= c (x) ≤ ǫ x 2 ǫ> 4ek −g̃k/(d+1) π 2 (k + 1) 2 ln d 6δ 3. If QBC is used with the Gibbs “typical” error termination rule, it will return a hypothesis such that Pr cGibbs (x) 6= c (x) ≤ ǫ x for any ǫ> 2 (k+1)2 + 8ek ln dπ 24ek δd g̃k d+1 ln 2 2−g̃k/(d+1) 4. If QBC is used with Bayes point machine termination rule, it will return a hypothesis such that h i Ec∼ν Pr cBPM (x) 6= c (x) ≤ ǫ x for any 2 ǫ> 2 (e − 1) ek −g̃k/(d+1) π 2 (k + 1) 2 ln d 6δ 66 Chapter 6. Theoretical Analysis of Query By Committee Theorem 6.1 shows that the error rate of the hypothesis returned by QBC decreases exponentially with respect to the number of queries allowed. This is true for all termination rules considered here. Note that in passive learning models, the error rate decreases as 1/polynomial with respect to the size of the sample (see e.g. [6] theorem 4.2). Hence, when there is a lower bound on the expected information gain, learning is exponentially faster when using QBC. It is interesting to note that if we look at the error rate achieved as a function of the size of the sample used m0 , the error rate decreases as 1/polynomial and thus, QBC does not “waste” much in the instances where it did not query for their label. The proof of the theorem is provided in section 6.3. The following theorem deals with the situation when QBC is being used without any termination rule. It shows that when there is a uniform lower bound on the expected information gain, both the number of queries and the number of prediction mistakes is logarithmic with respect to the length of the sequence processed. Theorem 6.2 Let C be a concept class with a VC-dimension d together with a prior ν and a distribution D. Let g > 0 be a uniform lower bound over the expected information gain and let g̃ = g log 1 + g 16 log 16 g −g 4 Denote by k (x1 , . . . , xm ) the number of queries QBC makes when processing x1 , . . . , xm (note that this also depends on the target concept and the internal randomness of QBC), and let B (m) = max em 8 d+1 em , 2 ln log g̃ d g d + 2d e Then for m > 0 Ex1 ,...,xm ,c,QBC [k (x1 , . . . , xm )] ≤ B (m) while the expected number of prediction mistakes is bounded by 12 B (m) Moreover, for δ > 0 Pr QBC,c,x1 ,x2 ,... " # B m2 ⌈log log m⌉ (⌈log log m⌉ + 1) ≤δ ∃m k (x1 , . . . , xm ) ≥ δ and " ∃m mistakes (x1 , . . . , xm ) ≥ Pr QBC,c,x1 ,x2 ,... 1 2B # m2 ⌈log log m⌉ (⌈log log m⌉ + 1) ≤δ δ where mistakes (x1 , . . . , xm ) is the number of prediction mistakes QBC makes on x1 , . . . , xm . 6.3. Proofs 67 It is important to note that B m2 ≤ 2B (m) and thus B m2 = O (log m). This shows that both the number of queries and number of mistakes QBC makes grows as a function of the logarithm1 of the number of instances processed. The following theorem is a key in proving the properties of the QBC algorithm. Theorem 6.3 Let C be a concept class with a VC-dimension d together with a prior ν and a distribution D. Let g > 0 be a uniform lower bound over the expected information gain and let g log 1 + 16 logg16 −g g g̃ = 4 Denote by k = k (x1 , . . . , xm ) the number of queries QBC makes when processing x1 , . . . , xm (note that this also depends on the target concept and the internal randomness of QBC). Then for m > 0 Pr x1 ,x2 ,...,xm 6.3 ∼Dm ,c∼ν, QBC k ≥ max em 8 d+1 em log , 2 ln g̃ d g d < 2d em Proofs As was previously mentioned, the analysis of the QBC algorithm uses the information gain as its core. Below, we note a few properties of the information gain. Let x1 , x2 , . . . be a sample, let c be a classifier and let ν be a prior. The instantaneous information gain from x1 and its label c (x1 ) is log ν (c′ ∈ V ) ν (c′ ∈ V : c′ (x1 ) = c (x1 )) If we have already seen the labels of x1 , . . . , xi−1 then the instantaneous information gain from xi and its label c (xi ) is log ν (c′ : ∀j < i c′ (xj ) = c (xj )) ν (c′ : ∀j ≤ i c′ (xj ) = c (xj )) and thus the information gained from x1 , . . . , xm and their labels, c (x1 ) , . . . , c (xm ) is simply the sum of the instantaneous information gain which is exactly log ν (c′ 1 : ∀j ≤ m c′ (xj ) = c (xj )) i.e. the volume of the version space left after observing this labeled sample. The first lemma shows that for any sample of size m, the information from having the labels of all points in the sample is of order log (m). Lemma 6.1 (lemma 3 in [48]) 1 We disregard a term of order (log log m)2 here. 68 Chapter 6. Theoretical Analysis of Query By Committee Let C be of VC-dimension d, and let S = {x1 , . . . , xm } be a sample such that m ≥ d. Let I (S, c) = log ν ({c′ 1 : ∀x ∈ S c′ (x) = c (x)}) then h em i d Pr I (S, c) > (d + 1) log < c∼ν d em em d d Proof. From Saur’s lemma [108] it follows that S breaks C into at most r ≤ equivalent classes, where we say that c ∼ c′ if c (S) = c′ (S). Let E1 , . . . , Er be the different equivalent 1 classes, then I (S, c) is simply log ν(E for i’s such that c ∈ Ei . Using this notation we can write i) for any α > 0 Pr [I (S, c) > α] = c∼ν X i : log 1 >α ν (Ei ) ν (Ei ) ≤ X i : log 1 >α ν (Ei ) 2−α ≤ em d d 2−α (6.1) and plug in (6.1) to get the stated result. Let α = (d + 1) log em d Lemma 6.1 shows that if we get a fully labeled sample, then the information we have collected from this sample is typically only logarithmic with respect to the size of the sample. Clearly, the information from a partly labeled sample can not exceed this bound (this follows from the information processing inequality [32] for example). Next we show that the information from the sub-sample that QBC collects grows linearly with respect to the number of queries QBC makes. Lemma 6.2 Assume that the expected information gain for the next query of QBC is lower bounded by g > 0. Let k > 0, and let V (k) be the version space induced by the first k queries of QBC, then Pr c,QBC,x1 ,x2 ,... " 1 g g log < k log 1 + ν (V (k)) 4 16 log 16 g −g !# ≤ e−kg 2 /8 This lemma amends lemma 1 in [48]. It shows that the information gained by QBC grows linearly with respect to the number of queries it makes. Proof. Let gi be the expected information gain from the i’th instance for which QBC queried for labels. From the definition of the expected information gain we have that 0 ≤ gi ≤ 1. Since there is a lower bound on the expected information gain, we have that Ec,QBC,x1 ,x2 ,... [gi ] ≥ g. The gi ’s form a martingale, thus using the martingale method (see e.g. [75] lemma 4.1) we have Pr c,QBC,x1 ,x2 ,... " k X kg gi < 2 i=1 # ≤ e−kg 2 /8 6.3. Proofs Assume that 69 Pn i=1 gi ≥ kg 2 . At least kg 4 of the gi ’s have gi ≥ g 4. Recall that gi is the expected information gain and thus gi = H (pi ) where pi is the probability of observing the label +1 for the corresponding instance, given the previously queries instances and their labels. From lemma 6.3 on page 73 we have that for every i such that gi ≥ g4 that !# # " " g g g g ,1 − = , 1/ 1 + pi ∈ 16 log 16 16 log 16 16 log 16 16 log 16 g g g g −g This means that for each i such that gi ≥ g4 , the instantaneous information gained from the instance and its label is at least log 1 + 16 logg16 −g , since there are at least kg/4 instances for g which gi ≥ g 4. Therefore the sum of the information gained from all query instances is at least ! g g k log 1 + 4 16 log 16 g −g which is linear with respect to k. We are now ready to prove theorem 6.3 Proof. of theorem 6.3 From lemma 6.1 we know that with a probability of 1 − d em the information gained from querying the labels of all m instances is at most (d + 1) log em d . Let k be the number of queries 2 QBC made. From lemma 6.2 we know that with a probability of 1 − e−kg /8 , the information gained from the queries QBC made is at least k g4 log 1 + 16 logg16 −g . If k ≥ g82 ln em d we have g that 1 − e −kg2 /8 ≥1− d em , since the information gained from labeling all the instances is greater than the information of any subset of the instances, and thus with a probability of 1 − ! g g em k log 1 + ≤ (d + 1) log 16 4 d 16 log g − g and thus em 4 k ≤ (d + 1) log d g log 1 + g 16 log 16 g −g while k≤ 2d em em 8 ln 2 g d We are now ready to prove theorem 6.1 Proof. of theorem 6.1 The correctness of the algorithm, i.e. the fact that the hypothesis returned is indeed close to the target concept with high probability, was already proved in chapter 5. We need only to analyze the number of labeled and unlabeled instances used. 70 Chapter 6. Theoretical Analysis of Query By Committee k ≥ k≥ d+1 g̃ 8 g2 ln 2δ implies that e−kg 2 /8 ≤ δ/2. From the choice of m0 and the assumption that log δ2 we have that d/em ≤ δ/2 and thus from theorem 6.3 we have that with a probability of 1 − δ, the number of queries QBC will make on a sample of size m0 is at most em d+1 0 log g̃ d by the choice of m0 = de 2g̃k/(d+1) we have that with a probability of 1 − δ, the number of queries QBC will make on m0 instances is at most k. Assume that QBC did not query for labels for more than k labels out of the m0 instances. Therefore, for any t < m0 /k there must have been a sequence of t consecutive instances for which QBC did not query for labels. From this point we will look at each termination condition separately. 1. Let ǫ be such that 2 ǫ> π 2 (k + 1) 2k ln m0 6δ (6.2) then from corollary 5.1 we have that if QBC did not make any query for labels for a sequence of tk = 2 ǫ ln π 2 (k+1)2 6δ consecutive instances then h h ii Ec∼ν Pr cBayes (x) 6= c (x) ≤ ǫ x with a probability of 1 − δ. However, by the choice of ǫ in (6.2) we have that tk ≤ m0 /k and thus we are guaranteed with a probability of 1 − δ that indeed there will be such a sequence (note that for any j < k we have that tj < tk and thus QBC may terminate before the k’th query for labels). Therefore, with a probability of 1 − 2δ, the number of queries that QBC will make will be smaller than k and the algorithm will terminate before the m0 unlabeled instance, returning a Bayes classifier with an error rate smaller than ǫ as defined in (6.2). By choice of m0 we have that this is true for any ǫ such that 2 ǫ> 2ek −g̃k/(d+1) π 2 (k + 1) 2 ln d 6δ 2. Let ǫ be such that 2 ǫ> π 2 (k + 1) 4k ln m0 6δ (6.3) then from theorem 5.2 we have that if QBC did not make any query for labels for a sequence of tk = 4 ǫ ln π 2 (k+1)2 6δ consecutive instances than h i EGibbs,c∼ν Pr cGibbs (x) 6= c (x) ≤ ǫ x 6.3. Proofs 71 with a probability of 1 − δ. However, by the choice of ǫ in (6.3) we have that tk ≤ m0 /k and thus we are guaranteed with a probability of 1 − δ that indeed there will be such a sequence (note that for any j < k we have that tj < tk and thus QBC may terminate before the k’th query for labels). Therefore, with a probability of 1 − 2δ, the number of queries that QBC will make will be smaller than k and the algorithm will terminate before the m0 unlabeled instance, returning a Gibbs classifier with an error rate smaller than ǫ as defined in (6.2). Through the choice of m0 we have that this is true for any ǫ such that ǫ> 4ek −g̃k/(d+1) π 2 (k + 1)2 2 ln d 6δ 3. Let ǫ be such that 2 (k+1) 8k ln m0 π 24k ǫ> δm0 2 (6.4) then from theorem 5.3 we have that if QBC did not make any query for labels for a sequence of tk = 8 ǫδ ln π 2 (k+1)2 3ǫδ consecutive instances then Pr cGibbs (x) 6= c (x) ≤ ǫ x with a probability of 1 − δ. However, by the choice of ǫ in (6.3) we have that tk ≤ m0 /k and thus we are guaranteed with a probability of 1 − δ that indeed there will be such a sequence (note that for any j < k we have that tj < tk and thus QBC may terminate before the k’th query for labels). Therefore, with a probability of 1 − 2δ, the number of queries that QBC will make will be smaller than k and the algorithm will terminate before the m0 unlabeled instance, returning a Gibbs classifier with an error rate smaller than ǫ as defined in (6.2). By choice of m0 we have that this is true for any ǫ such that 2 (k+1)2 g̃k 8ek ln dπ 24ek + d+1 ln 2 ǫ> 2−g̃k/(d+1) δd 4. Let ǫ be such that 2 ǫ> 2 (e − 1) k π 2 (k + 1) ln m0 6δ (6.5) then from theorem 5.5 we have that if QBC did not make any query for labels for a sequence of tk = 2(e−1) ǫ ln π 2 (k+1)2 6δ consecutive instances than h i Ec∼ν Pr cBPM (x) 6= c (x) ≤ ǫ x with a probability of 1 − δ. However, by the choice of ǫ in (6.5) we have that tk ≤ m0 /k and thus we are guaranteed with a probability of 1 − δ that indeed there will be such a sequence 72 Chapter 6. Theoretical Analysis of Query By Committee (note that for any j < k we have that tj < tk and thus QBC may terminate before the k’th query for labels). Therefore, with a probability of 1 − 2δ, the number of queries that QBC will make will be smaller than k and the algorithm will terminate before the m0 unlabeled instance, returning a Bayes classifier with an error rate smaller than ǫ as defined in (6.2). By choice of m0 we have that this is true for any ǫ such that 2 2e (e − 1) k −g̃k/(d+1) π 2 (k + 1) 2 ln d 6δ ǫ> Proof. of theorem 6.2 From theorem 6.3 we have that em 8 d+1 2d em k ≥ max , 2 ln < Pr log g̃ d g d em x1 ,x2 ,...,xm ∼Dm ,c∼ν,QBC Since the number of queries is at most m we have that the expected number of queries is at most max em 8 em d+1 , 2 ln log g̃ d g d + 2d e Using Theorem 5.6 we conclude that the expected number of prediction mistakes is at most 1 max 2 em 8 em d+1 log , 2 ln g̃ d g d + d e We now show that this is not only the average case but also the typical case. Let δ > 0. For any t = 1, 2, . . . we have using the Markov inequality that t   2 t (t + 1) B 2 δ k x1 , . . . , x 2t > ≤ Pr 2 δ t (t + 1) c,QBC,x1 ,x2 ,... By summing over t and using the fact that P∞ 1 t=1 t(t+1) = 1 we have that t  B 22 t (t + 1) ≤δ ∃t k x1 , . . . , x 2t > Pr 2 δ c,QBC,x1 ,x2 ,...  t Let m > 0, and let t = ⌈log log m⌉. Clearly m ≤ 22 and thus for any sequence of instances k (x1 , . . . , xm ) ≤ k x1 , . . . , x22t t using this fact and the fact that m2 ≥ 22 we have that " # B m2 ⌈log log m⌉ (⌈log log m⌉ + 1) ∃t k x1 , . . . , x22t > Pr ≤δ δ c,QBC,x1 ,x2 ,... Analyzing the probability of having too many prediction mistakes can be done in the same way as we analyzed the probability of having too many queries for label. 6.4. Lower Bound on the Expected Information Gain for Linear Classifiers 73 Lemma 6.3 Let H (p) be the binary entropy of p. If H (p) ≥ α > 0 then p≥ α 4 log α4 Proof. Let p∗ = α/ 4 log α4 . Since p∗ < 1/2 we have that for any p < p∗ that H (p) < H (p∗ ) thus it suffices to show that H (p∗ ) ≤ α. Using the fact that p∗ < 1/2 once again, we have that −p∗ log p∗ ≥ − (1 − p∗ ) log (1 − p∗ ) and thus H (p∗ ) = −p∗ log p∗ − (1 − p∗ ) log (1 − p∗ ) ≤ −2p∗ log p∗ α α log 4 log α4 2 log α4 log log α4 α 1+ 2 log α4 = − = ≤ α where the last inequality follows since 6.4 log log z log z ≤ 1 for z ≥ 2. Lower Bound on the Expected Information Gain for Linear Classifiers In previous sections we showed that whenever there is a uniform lower bound on the expected information gain, the QBC will learn fast. We showed how the error rate of the hypotheses generated by QBC decreases exponentially with respect to the number of queries made. In order to make these results meaningful, we now provide such uniform lower bounds for the concept class of parallel planes [48]. The following is the main result in this section. Theorem 6.4 Let C be the class of d dimensional parallel planes. Let ν be a prior distribution over C which is ρ-concave for ρ > −1. Let D be a distribution over the sample space X = IRd × IR such that for any x0 ∈ IRd there is a section [b1 (x0 ) , b2 (x0 )] (which may be empty) such that D (x, θ | x = x0 ) is uniform over [a (x0 ) , b (x0 )]. The expected information gain from the next query of the QBC algorithm is uniformly lower bounded by G (ρ) = 22+ρ (1 + ρ) (2 + ρ) (3 + ρ) ln 2 2−2−ρ 2−3−ρ 2−2−ρ 2−3−ρ − ln 2 − ln 2 + 2 2 3+ρ 2+ρ (3 + ρ) (2 + ρ) 74 Chapter 6. Theoretical Analysis of Query By Committee ∞ n X Γ (ρ + 1) (−1) + Γ (ρ − n + 1) n! n=0 2−n−3 2−n−3 2 − 2 + n + 3 ln 2 (n + 3) (n + 3) 1 !! and this bound is tight. The theorem we have just stated needs some explanation. We need to define the class of parallel planes, define ρ-concave measures and understand the function G (ρ). Note however, that this theorem is an extension of theorem 2 in [48] where Freund et. al. proved the same lower bound on the expected information gain for the class of parallel planes; however this theorem assumed a uniform distribution over the class of classifiers which is a special case of the theorem presented here, as any uniform distribution over convex bodies is log-concave, i.e. 0-concave. 6.4.1 The Class of Parallel Planes The class of parallel planes [48] is a close relative of the class of linear separators. Each concept in this class is represented by a d dimensional vector w. An instance is a pair (x, b) and the classification rule is cw (x, b) = sign (w · x + b) Note that this is different from the class of non-homogeneous linear separators as in the later; the bias b is a part of the concept. As is the case with linear classifiers, the vector w can be scaled down. To see this, note that if w ∈ IRd and (x, θ) ∈ IRd × IR then cw (x, b) = cw/λ (λx, λb) for any λ > 0. Therefore, we will always assume that the w’s are in the d-dimensional unit ball. 6.4.2 Concave Measures We provide a brief introduction to concave measures. See [9, 26, 100, 19] for more information about concave measures. We begin by defining concave measures. Definition 6.5 A probability measure ν over IRd is ρ-concave if for any measurable sets A and B and every 0 ≤ λ ≤ 1 the following holds: ρ ρ 1/ρ ν (λA + (1 − λ) B) ≥ [λν (A) + (1 − λ)ν (B) ] A few facts about ρ-concave measures: • If ν is ρ-concave with ρ = ∞ then ν(λA + (1 − λ)B) ≥ max(ν(A), ν(B)). • If ν is ρ-concave with ρ = −∞ then ν(λA + (1 − λ)B) ≥ min(ν(A), ν(B)), in this case ν is called quasi-concave. 6.4. Lower Bound on the Expected Information Gain for Linear Classifiers 75 • If ν is ρ-concave with ρ = 0 then ν(λA + (1 − λ)B) ≥ ν(A)λ ν(B)1−λ , in this case ν is called log-concave. Many common probability measures are log-concave, for example uniform measures over compact convex sets, normal distributions, chi-square and others. ρ-concave measures are always unimodals. Moreover, any uni-modal measure is ρ-concave, at least for ρ = −∞. The parameter ρ provides a hierarchy for the class of uni-modal measures since if ν is ρ-concave, and ρ′ < ρ than ν is ρ′ -concave as well. Thus, the larger the ρ , the assumption of being ρ-concave is more restrictive. The following lemma shows that if a measure ν is ρ-concave then any restriction of ν to a convex body is ρ-concave as well. This makes the parameter ρ suitable for our discussion, since after the QBC has made several queries for labels we will be looking at the posterior, which is the restriction of the original prior to the version-space. The lemma shows that if the prior was ρ-concave then the posterior will be ρ-concave as well. Lemma 6.4 Let ν be a ρ-concave probability measure and let K be a convex body such that ν (K) > 0. Let νK be the restriction of ν to K such that νK (A) = ν (A ∩ K) /ν (K) then νK is ρ-concave. Proof. Let A and B be measurable sets and let 0 ≤ λ ≤ 1. Let x ∈ λ (A ∩ K) + (1 − λ) (B ∩ K). It follows that x ∈ λA + (1 − λ) B, and since K is convex we have that x ∈ K and thus we conclude that λ (A ∩ K) + (1 − λ) (B ∩ K) ⊆ (λA + (1 − λ) B) ∩ K and hence ν (K) νK (λA + (1 − λ) B) = ν ((λA + (1 − λ) B) ∩ K) ≥ ν (λ (A ∩ K) + (1 − λ) (B ∩ K)) ρ 1/ρ ρ ≥ [λν (A ∩ K) + (1 − λ) ν (B ∩ K) ] ρ ρ 1/ρ = ν (K) [λνK (A) + (1 − λ) ν (B) ] 6.4.3 The Function G (ρ) The function G (ρ) as defined in theorem 6.4 might look frightening. Recall that G (ρ) is defined to be G (ρ) = 2−2−ρ 2−3−ρ 2−2−ρ 2−3−ρ − ln 2 − ln 2 + 2 2 3+ρ 2+ρ (3 + ρ) (2 + ρ) !! ∞ n X Γ (ρ + 1) (−1) 1 2−n−3 2−n−3 + − + ln 2 Γ (ρ − n + 1) n! (n + 3)2 n+3 (n + 3)2 n=0 22+ρ (1 + ρ) (2 + ρ) (3 + ρ) ln 2 76 Chapter 6. Theoretical Analysis of Query By Committee 1 0.9 0.8 0.7 G(ρ) 0.6 0.5 0.4 0.3 0.2 0.1 0 -1 0 1 2 ρ 3 4 5 Figure 6.1: The function G (ρ) from theorem 6.4 on page 73. Figure 6.1 shows a plot of this function. When ρ = −1 we have that G (ρ) = 0, however it climbs fast and reaches 1 9 + 7 18 ln 2 ≈ 0.67 when ρ = 0. The function is monotone increasing and approaches 1 as ρ grows to infinity. 6.5 Proof of Theorem 6.4 We now turn to the information gain of ρ-concave measures. Proof. of theorem 6.4. The first step to take is to come up with a simplified notation for the expected information gain. Recall that the expected information gain is G (ν, D) = E(x,b)∼D [2ν {w : w · x + b > 0} ν {w : w · x + b < 0} H (ν {w : w · x + b < 0})] (6.6) E(x,b)∼D [2ν {w : w · x + b > 0} ν {w : w · x + b < 0}] We will show that for any x0 in the support of D Eb∼Dx0 [2ν {w : w · x0 + b > 0} ν {w : w · x0 + b < 0} H (ν {w : w · x0 + b < 0})] ≥ G (ρ) (6.7) Eb∼Dx0 [2ν {w : w · x0 + b > 0} ν {w : w · x0 + b < 0}] where b ∼ Dx0 means that (x, b) is sampled from the distribution D conditioned on x = x0 . Once we prove (6.7), we have that G (ν, D) ≥ G (ρ) as well. This follows since for any two positive 6.5. Proof of Theorem 6.4 77 functions f and g and any probability measure over x is Ex [f (x)] f (x) ≥ min x g (x) Ex [g (x)] Therefore, from here on, we will be trying to prove (6.7). Fix x0 and denote F (b) = ν {w : w · x0 + b < 0} we see that we can rewrite (6.7) as Eb∼Dx0 [F (b) (1 − F (b)) H (F (b))] Eb∼Dx0 [F (b) (1 − F (b))] Note that F (b) is the Cumulative Density Function (CDF) of ν when projected along the vector x0 . Since ν is ρ-concave, F is ρ-convex (see e.g. [100]). Furthermore, according to the assumptions of this theorem, the distribution Dx0 is uniform over the segment (b1 , b2 ), thus we write R b2 F (b) (1 − F (b)) H (F (b)) db G (F ) = b1 R b2 b1 F (b) (1 − F (b)) db from now on we will study G (F ) for ρ-convex functions. W.l.o.g. assume that 0 ∈ (b1 , b2 ) and that 0 is the median of the CDF F , i.e. F (0) = 1/2. Note that R b2 b1 F (b) (1 − F (b)) H (F (b)) db G (F ) = R b2 F (b) (1 − F (b)) db b1 Rb R0 F (b) (1 − F (b)) H (F (b)) db + 0 2 F (b) (1 − F (b)) H (F (b)) db b1 = Rb R0 F (b) (1 − F (b)) db + 0 2 F (b) (1 − F (b)) db b1 ! R0 R b2 F (b) (1 − F (b)) H (F (b)) db b1 F (b) (1 − F (b)) H (F (b)) db (6.8) ≥ min , 0 R b2 R0 F (b) (1 − F (b)) db b1 0 F (b) (1 − F (b)) db Due to the symmetry around zero of G (F ) we conclude from (6.8) that it is enough to look at one “tail” of F . Hence our study will focus on F functions which have the following properties: 1. F is defined over [−∞, 0]. 2. F is monotone increasing. 3. F (−∞) = 0 and F (0) = 1/2. 4. F is ρ-convex. We call such an F function a ρ-admissible CDF function. We begin by showing that for any ρ, there exists a ρ-admissible CDF function Fρ such that G (F ) = G (ρ). We break down our discussion into three cases, depending on the value of ρ: 1. When ρ < 0 we use Fρ (b) = 1 2 show that indeed G (Fρ ) = G (ρ). 1/ρ (1 + b) . Trivially, Fρ is ρ-admissible. In Lemma 6.5 we 78 Chapter 6. Theoretical Analysis of Query By Committee 1/ρ 1 2 2. When ρ > 0 we use Fρ (b) = (1 + b) which is defined in the range of [−1, 0]. Again, proving that Fρ is ρ-admissible is trivial. Lemma 6.6 shows that G (Fρ ) = G (ρ). 3. When ρ = 0 we use F0 = 1 b 2e . Clearly, this is a 0-admissible function (recall that the Showing that G (F0 ) = G (0) is done in 0-concave function is a log-concave function). Lemma 6.7. We have shown that for any ρ there is a ρ-concave measure for which the expected information gain is G (ρ). Thus, if we prove that for any F which is ρ-admissible, G (F ) ≥ G (ρ) we will have that G (ρ) is a tight bound for G (F ). For a ρ-admissible CDF F , let f = F ρ (we use the convention here that if ρ = 0 we use f = ln F ). Since F is ρ-convex, we have that f is convex, i.e. f (λb1 + (1 − λ) b2 ) ≥ λf (b1 ) + (1 − λ) f (b2 ). ρ Note that for Fρ as defined above, if we set fρ = (Fρ ) we get a linear function. We will now claim that this is the worse case. Note that if f is convex and f + Ψ is convex then for any ǫ ∈ [0, 1] we have that f + ǫΨ is convex as well (lemma 6.8). We use the notation G (f ) = G f 1/ρ . Taking the Férchet derivative of G (·) we have that G (f + ǫψ) = G (f ) + ǫ Z 0 −∞ ▽f G (b) ψ (b) db + ǫ2 O Z ∞ ψ 2 (b) db −∞ (6.9) We now turn to ▽f G (x). Before we do this, we will rewrite G (f ). Recall that G (f ) = R∞ −∞ f 1/ρ (b) 1 − f 1/ρ (b) H f 1/ρ (b) db R∞ f 1/ρ (b) 1 − f 1/ρ (b) db −∞ Denote by K (b) = f 1/ρ (b) 1 − f 1/ρ (b) . Using this notation f 1/ρ (b) = 1 2 − √ 1−4K(b) . 2 We introduce (yet) another notation Q (b) = H and thus we have that Q (K (b)) = H 1− √ 1− 1−4K(b) 2 G (f ) = now we have that ▽f G (b) = R √ 1 − 4b 2 = H f 1/ρ (b) . Finally we write K (b) Q (K (b)) db R K (b) db ▽K G (b) ▽f K (b) (6.10) 6.5. Proof of Theorem 6.4 79 where ▽K G (b) = ▽f K (b) = ∂ 1 R Q (K (b)) + K (b) Q (K (b)) − G (f ) ∂K (b) K (b) db f 1/ρ−1 1 − 2f 1/ρ ρ (6.11) (6.12) We are interested in studying the behavior of ▽f G (b). By considering the places in which ▽f G (b) = 0 we will be able to tell where is it positive and where is it negative. By (6.10) we can study the terms ▽K G (b) and ▽f K (b) separately. First consider ▽f K (b). From 6.12 we see that ▽f K (b) = 0 only when f 1/ρ = 1/2 i.e. F = 1/2. The behavior of these terms is determined by the value of ρ. For positive ρ’s ▽f K (b) > 0 unless f 1/ρ = 1/2. On the other hand, if ρ < 0 then ▽f K (b) < 0 whenever f 1/ρ 6= 1/2. Now we consider ▽K G (b). Looking at (6.11) we note that Q (K (b)) + K (b) ∂ Q (K (b)) ∂K (b) is monotone increasing. Thus there is a point b0 which Freund et al [48] referred to as the pivot point, such that for any b < b0 we have that ▽K G (b) < 0 while for b > b0 we have that ▽K G (b) > 0. We saw that ▽K G (b) < 0 for b < b0 and ▽K G (b) > 0 for b > b0 . We also saw that ▽f K (b) < 0 when ρ < 0 and ▽f K (b) > 0 when ρ > 0. We will now have to treat three cases separately: the first case we will consider is ρ < 0, the second is ρ > 0 and last we consider the case ρ = 0. 1. Assume ρ < 0. In this case ▽f K (b) > 0 and thus    > 0 when b < b0 ▽f G (b) = ▽K G (b) ▽f K (b)   < 0 when b > b0 Let F be a ρ-admissible function and let f = F 1/ρ . We will show that unless f is linear, there ρ is some ψ and ǫ > 0 such that F̂ = (f + ǫψ) is ρ-admissible and G (F ) > G F̂ . Assume that f is non-linear, we consider two cases: the first is when the non-linearity is inspected in the range [−∞, b0 ]. The second case we consider is when f is linear on [−∞, b0 ]. (a) Assume that f is non-linear on [−∞, b0 ]. Let −∞ < b1 < b0 such that f is non-linear on [b1 , b0 ]. Since f is convex we have that for any b ∈ [b1 , b0 ] f (b) ≥ Let ψ (b) =      b − b1 b0 − b f (b0 ) + f (b1 ) b0 − b1 b0 − b1 0 b−b1 b0 −b1 f (b0 ) + b0 −b b0 −b1 f when b ∈ / [b1 , b0 ] (b1 ) − f (b) when b ∈ [b1 , b0 ] 80 Chapter 6. Theoretical Analysis of Query By Committee We note the following: f + ψ is convex and monotone such that (f + ψ) 1/ρ is ρ- admissible. Furthermore, ψ (b) = 0 for b > b0 while for b < b0 we have ψ (b) ≤ 0 and this inequality is strict at least on some parts of the range [b1 , b0 ] (See sub-figure 1(a) in figure 6.2 on page 81 for an illustration). Finally, since ψ has finite support R0 ψ 2 (b) db < ∞. Using all these facts we conclude that −∞ Z 0 −∞ ▽f G (x) Ψ (b) db < 0 and since G (f + ǫψ) = G (f ) + ǫ Z 0 −∞ ▽f G (b) ψ (b) db + ǫ2 O Z 0 ψ 2 (b) db −∞ we have that for small enough ǫ G (f + ǫψ) < G (f ) (b) Assume that f is linear for b < b0 but still it is non-linear. Therefore for b < b0 we have that f (b) = βb + α for some α and β. Consider the following ψ:    0 when b < b0    1/ρ ψ (b) = − f (x) , βb + α − f (b) when b0 ≤ x < 0 min 21     1 1/ρ  when b = 0 2 We note the following: f + ψ is monotone and convex2 . Since (f + ψ) ≤ 1 1/ρ 2 we have that (f + ψ)1/ρ is ρ-admissible, and for any 0 ≤ ǫ ≤ 1 the same holds, i.e. (f + ǫψ)1/ρ is ρ-admissible (See sub-figure 1(b) in figure 6.2 on page 81 for an illustration). ψ has the following properties: clearly ψ (b) = 0 when b < b0 while ψ (b) ≥ 0 when b > b0 and this inequality is somewhere strict (since f is non-linear). Since ψ has final support R0 2 −∞ ψ (b) db < ∞ and thus using the same argument as we used in the previous scenario G (f + ǫψ) < G (f ) for small enough ǫ. The two cases we considered here show that unless f is linear G (f ) > G (f + ǫψ) for some ǫ and ψ such that (f + ǫψ)1/ρ is ρ-admissible. This shows that the minimum of G (·) is achieved for linear f ’s. We already computed G (f ) for linear f in Lemma 6.5 (we assumed 2 Note here that (f + ψ)ρ may be a singular CDF, it may have a positive mass on the point b = 0. 6.5. Proof of Theorem 6.4 4.5 81 b 4.5 1 4 f f+ψ 3.5 3.5 b 0 3 2.5 2 2 1.5 1.5 −80 −60 −40 1(a) −20 0 0.5 0.4 0 1 −100 −80 −60 −40 1(b) −20 0 0.5 0.4 b 0 0.3 b 0 0.3 0.2 0.2 0.1 0 −100 b 3 2.5 1 −100 f f+ψ 4 0.1 f f+ψ −80 −60 −40 2(a) −20 0 0 −100 f f+ψ −80 −60 −40 2(b) −20 0 Figure 6.2: Illustrations for the proof of theorem 6.4 The four different cases considered in the proof. Sub figure 1(a) and 1(b) demonstrate the cases where ρ < 0 while sub-figures 2(a) and 2(b) demonstrate positive ρ’s. In each figure a non-linear f is presented together with the modified f + ψ as described in the proof of theorem 6.4. that f (b) = 1 ρ 2 (1 − b) but by a simple change of argument the same result holds for any admissible linear f ). And thus we conclude that for any F which is ρ-admissible for ρ < 0 G (F ) ≥ G (ρ) 2. We now consider the case when ρ > 0. Let F be ρ-admissible and assume that F is defined over the range [b1 , 0] for some finite b1 . Let f = F ρ and assume that f is non-linear. Again we will consider two cases: the first case we will consider is when f is non linear on [b0 , 0] and the second case is when f is linear on [b0 , 0] but still not linear. 82 Chapter 6. Theoretical Analysis of Query By Committee (a) Assume that f is non linear on [b0 , 0]. Let ψ be as follows    0 when b < b0 ψ (b) =   b−b0 f (0) + −b f (b0 ) when b ≥ b0 −b0 −b0 We note the following: f + ψ is monotone increasing and convex. Since (f + ψ) ≤ 1/ρ 1 1/ρ we have that (f + ψ) is ρ-admissible, and for any 0 ≤ ǫ ≤ 1 the same holds, 2 1/ρ i.e. (f + ǫψ) is ρ-admissible (See sub-figure 2(a) in figure 6.2 on page 81 for an illustration). ψ has the following properties: clearly ψ (b) = 0 when b < b0 while ψ (b) ≤ 0 when b > b0 and this inequality is somewhat strict (since f is non-linear). R0 Since ψ has final support −∞ ψ 2 (b) db < ∞ and thus using the same argument as we used in previous scenarios, recalling that ▽f G (b) > 0 when b > b0 and thus G (f + ǫψ) < G (f ) for small enough ǫ. (b) Assume that f is linear on [b0 , 0] but non linear on [b1 , b0 ]. Assume that f (b) = βb + α for b ∈ [b0 , 0]. We define ψ (b) = βb + α − f (b) 1/ρ We note the following: f + ψ is monotone increasing and convex and (f + ǫψ) is ρ-admissible for any 0 ≤ ǫ ≤ 1 (See sub-figure 2(b) in figure 6.2 on page 81 for an illustration). ψ has the following properties: ψ (b) = 0 when b > b0 while ψ (b) ≥ 0 when b < b0 and this inequality is somewhat strict (since f is non-linear). Since ψ R0 has final support −x1 ψ 2 (b) db < ∞ and thus using the same argument as we used in previous scenarios, recalling that ▽f G (b) < 0 when b > b0 and thus G (f + ǫψ) < G (f ) for small enough ǫ. The two cases we considered here show that unless f is linear G (f ) > G (f + ǫψ) for some 1/ρ ǫ and ψ such that (f + ǫψ) is ρ-admissible. This shows that the minimum of G (·) is achieved for linear f ’s. We already computed G (f ) for linear f in Lemma 6.5 (we assumed ρ that f (b) = 12 (1 + b) for b ∈ [0, 1] but by a simple change of argument the same result holds for any admissible linear f ). We also assumed that f has finite support, but since this holds for any finite range and from the continuity of G (·) this result holds for any f . Thus 6.5. Proof of Theorem 6.4 83 we conclude that for any F which is ρ-admissible for ρ < 0 G (F ) ≥ G (ρ) 3. The case where ρ = 0 can be treated in the same way as we treated the cases when ρ > 0 or ρ < 0. However, this argument can be avoided since if F is 0-concave (i.e. log-concave), it is also ρ-concave for any ρ < 0. Therefore G (F ) ≥ sup G (ρ) ρ<0 on the other hand, in Lemma 6.7 we show a log-concave F for which G (F ) = G (0). Combining these facts together completes the proof. Lemma 6.5 Let Fρ (b) = 1 2 1/ρ (1 − b) for b ≤ 0 and −1 < ρ < 0. Then G (Fρ ) = G (ρ) where G (ρ) is as defined in theorem 6.4. The CDF 1 2 1/ρ (1 − b) is the “typical” ρ-concave function when ρ is negative. In Lemma 6.5 we calculate the information gain for these CDFs. Proof. This is a pure calculation. Let F = Fρ then R0 F (b) (1 − F (b)) H (F (b)) db −∞ G (F ) = R0 −∞ F (b) (1 − F (b)) db (6.13) We treat the numerator and denominator separately. Z 0 Z 0 1 1 1 1/ρ 1/ρ 1/ρ F (b) (1 − F (b)) H (F (b)) db = 1 − (1 − b) H db (1 − b) (1 − b) 2 2 −∞ −∞ 2 Z ∞ 1 1/ρ 1 1 1/ρ = 1 − b1/ρ H db x b 2 2 2 1 Z 1/2 = − b (1 − b) H (b) H (x) 2ρ bρ−1 ρdb 0 Z 1/2 = −ρ2ρ bρ (1 − b) H (b) db "Z0 1/2 ρ2ρ b1+ρ (1 − b) ln (b) db = ln 2 0 # Z 1/2 2 + bρ (1 − b) ln (1 − b) db (6.14) 0 Now we look at the two integral terms in the last expression: Z 1/2 Z 1/2 1+ρ b (1 − b) ln bdb = b1+ρ − b2+ρ ln bdb 0 0 = b3+ρ b2+ρ − 2+ρ 3+ρ 1/2 Z ln b − 0 0 1/2 b1+ρ b2+ρ − 2+ρ 3+ρ db 84 Chapter 6. Theoretical Analysis of Query By Committee 1/2 b3+ρ 2 − 2 (2 + ρ) (3 + ρ) 0 = 2−3−ρ 2−2−ρ ln 2 − ln 2 − 3+ρ 2+ρ = 2−3−ρ 2−2−ρ 2−2−ρ 2−3−ρ − ln 2 − ln 2 + 2 2 3+ρ 2+ρ (3 + ρ) (2 + ρ) b2+ρ Looking at the second term in (6.14) Z 0 1/2 2 bρ (1 − b) ln (1 − b) db = Z 1 1/2 ρ b2 (1 − b) ln bdb Using Taylor expansion, we can write ρ (1 − b) = ∞ X Γ (ρ + 1) (−1)n n b Γ (ρ − n + 1) n! n=0 where Γ (·) is the gamma function. Using the Taylor expansion we have Z 1 X Z 1 ∞ n Γ (ρ + 1) (−1) n+2 ρ 2 b ln (b) db b (1 − b) ln (b) db = 1/2 n=0 Γ (ρ − n + 1) n! 1/2 Z ∞ X Γ (ρ + 1) (−1)n 1 n+2 b ln (b) db = Γ (ρ − n + 1) n! 1/2 n=0 1 ∞ n n+3 X Γ (ρ + 1) (−1) 1 b = ln b − Γ (ρ − n + 1) n! n + 3 n + 3 1/2 n=0 = ∞ X Γ (ρ + 1) (−1)n Γ (ρ − n + 1) n! n=0 2−n−3 − + ln 2 2 2 n+3 (n + 3) (n + 3) 2−n−3 1 Looking at the denominator of (6.13) we have Z F (b) (1 − F (b)) db = = = = 1 1 1/ρ 1/ρ 1 − (1 − b) db (1 − b) 2 −∞ 2 Z ∞ 1 1/ρ 1 1/ρ − 1− b db b 2 2 1 Z ∞ 1 2/ρ 1 1/ρ db b − b 4 2 1 ∞ b2/ρ+1 b1/ρ+1 − 4 (2/ρ + 1) 2 (1/ρ + 1) Z 0 1 = = = 1 1 − 8 2 ρ +2 ρ +4 1 1 ρ − 2 + 2ρ 8 + 4ρ 3+ρ ρ 4 (1 + ρ) (2 + ρ) Finally we can write G (F ) = ρ2ρ ln 2 2−2−ρ 2−3−ρ 2−2−ρ 2−3−ρ − ln 2 − ln 2 + 2 2 3+ρ 2+ρ (3 + ρ) (2 + ρ) ! 6.5. Proof of Theorem 6.4 85 ∞ n X Γ (ρ + 1) (−1) + Γ (ρ − n + 1) n! n=0 2−n−3 2−n−3 2 − 2 + n + 3 ln 2 (n + 3) (n + 3) 1 !! /ρ 3+ρ 4 (1 + ρ) (2 + ρ) which is, by simple algebra G (ρ). Lemma 6.6 Let Fρ (b) = 1 2 1/ρ (1 + b) for b ∈ [−1, 0] and ρ > 0. Then G (Fρ ) = G (ρ) where G (ρ) is as defined in theorem 6.4. Proof. Recall that G (Fρ ) = R0 −∞ This is a pure calculation Z 0 −1 Fρ (b) (1 − Fρ (b)) H (Fρ (b)) db = = = Fρ (b) (1 − Fρ (b)) H (Fρ (b)) db R0 F (b) (1 − Fρ (b)) db −∞ ρ 1 1 1 1/ρ 1/ρ 1/ρ 1 − (1 + b) H db (1 + b) (1 + b) 2 2 −1 2 Z 1 1 1/ρ 1 1 1/ρ 1 − b1/ρ H db b b 2 2 0 2 Z 1/2 − b (1 − b) H (b) 2ρ bρ−1 ρdb Z 0 0 = = Z 1/2 −ρ2ρ bρ (1 − b) H (b) db 0 "Z # Z 1/2 1/2 ρ2ρ 2 1+ρ ρ b (1 − b) ln (b) db + b (1 − b) ln (1 − b) db ln 2 0 0 From (6.14) in Lemma 6.5 we know that this equals ρ2ρ ln 2 2−3−ρ 2−2−ρ 2−2−ρ 2−3−ρ − ln 2 − ln 2 + 2 2 3+ρ 2+ρ (3 + ρ) (2 + ρ) ∞ n X Γ (ρ + 1) (−1) + Γ (ρ − n + 1) n! n=0 2−n−3 2−n−3 2 − 2 + n + 3 ln 2 (n + 3) (n + 3) 1 !! Looking at the denominator in the definition of information gain we have Z 0 Z 0 1 1 1/ρ 1/ρ 1 − (1 + b) db (1 + b) Fρ (b) (1 − Fρ (b)) db = 2 −1 −1 2 Z 1 1 1/ρ 1 1/ρ = 1− b db b 2 0 2 Z Z 1 1 1/ρ 1 1 2/ρ = b − b db 2 0 4 0 1 1 1 1 = b1+1/ρ − b1+2/ρ 2 (1 + 1/ρ) 4 (1 + 2/ρ) 0 0 1 1 − = 2 + 2/ρ 4 + 8/ρ 3+ρ = ρ 4 (1 + ρ) (2 + ρ) By simple algebra the result stated in the lemma is obtained. 86 Chapter 6. Theoretical Analysis of Query By Committee Lemma 6.7 Let F0 (b) = 21 eb for b ≤ 0 and −1 < ρ < 0. Then G (F0 ) = G (0) = 1 7 + 9 18 ln 2 where G (ρ) is as defined in theorem 6.4. Proof. Recall that G (F0 ) = R0 −∞ This is a pure calculation Z 0 −∞ F0 (b) (1 − F0 (b)) H (F0 (b)) db = F0 (b) (1 − F0 (b)) H (F0 (b)) db R0 F (b) (1 − F0 (b)) db −∞ 0 Z 0 −∞ 1/2 = Z 1 1 b 1 b e 1 − eb H e db 2 2 2 (1 − b) H (b) db 0 = − = − = = Z 1/2 0 Z 2 (1 − b) log (1 − b) + b (1 − b) log (b) db 1 1/2 2 b log (b) db − 7 1 − 72 ln 2 24 1 7 + 24 48 ln 2 Z 1/2 b log (b) db + 0 Z 1/2 b2 log (b) db 0 1 1 1 1 − − − + − − 8 16 ln 2 24 72 ln 2 Looking at the denominator we have Z 0 −∞ F0 (x) (1 − F0 (x)) dx = = = = = 1 x 1 x e 1 − e dx 2 −∞ 2 Z 0 Z 1 0 2x 1 ex dx − e dx 2 −∞ 4 −∞ 0 1 x0 1 1 2x (e |−∞ − e 2 4 2 −∞ 1 1 1 (1 − 0) − −0 2 4 2 3 8 Z 0 Thus we conclude G (F0 ) = 1 24 + 7 48 ln 2 3 8 = 1 7 + 9 18 ln 2 Lemma 6.8 Let f be a concave function such that f +Ψ is concave as well. Then for any ǫ ∈ [0, 1] the function f + ǫΨ is concave. Let fˆ be a convex function such that fˆ+ Ψ̂ is convex as well. Then for any ǫ ∈ [0, 1] the function fˆ + ǫΨ̂ is convex. 6.6. Summary 87 Proof. Let f be concave and Ψ be such that f + Ψ is concave as well. Let x1 and x2 be two points, let γ ∈ [0, 1] and let ǫ ∈ [0, 1]. (f + ǫΨ) (λx1 + (1 − λ) x2 ) = ǫ (f + Ψ) (λx1 + (1 − λ) x2 ) + (1 − ǫ) f (λx1 + (1 − λ) x2 ) ≤ ǫ (λ (f + Ψ) (x1 ) + (1 − λ) (f + Ψ) (x2 )) + (1 − ǫ) (λf (x1 ) + (1 − λ) f (x2 )) = λ (f + ǫΨ) (x1 ) + (1 − λ) (f + ǫΨ) (x2 ) This proves the first part of the lemma. To see that the same works for convex functions, let fˆ be convex and let fˆ + Ψ̂ be convex as well. We apply to the first part of the lemma with f = −fˆ and Ψ = −fˆ + −Ψ̂ to get the stated result. 6.6 Summary In this chapter we have studied the fundamental properties of Query By Committee. First we defined information gain which is a function of a concept class, a prior over this class and a distribution over the sample space. We showed that when there is a lower bound on the expected information gain, QBC learns exponentially fast with respect to the number of queries it makes. This pace is exponentially better than any passive learner, as these learners learn in a polynomial rate. Next we demonstrated cases in which there is a lower bound on the expected information gain. We studied the class of parallel planes. We showed that the expected information gain is lower bounded when there is a prior which is ρ-concave, and the distribution over the sample space is of a special type. Freund et al [48] showed that this lower bound on the class of parallel planes can be translated into a lower bound on the expected information gain when learning homogeneous linear classifiers when the prior is uniform and the distribution over the sample space is uniform (theorem 4 in [48]). Chapter 7 The Bayes Model Revisited The QBC algorithm and its analysis as presented in chapter 6 assumed that there is a known prior over the concept class. This assumption is usually referred to as the Bayesian assumption. However, in many cases, the knowledge of this prior is not present. In this chapter we show how this assumption can be weakened and in some cases lifted. We look at three different scenarios and use different tools in each one of them. 7.1 PAC-Bayesian Techniques McAllester [89] presented the PAC-Bayesian theory. In his work, the Bayesian assumption is regarded as a way to present prior knowledge or preferences. We use the same technique here to show how the Bayesian assumption can be lifted in some cases. Theorem 7.1 Let C = {c1 , c2 , . . .} be a countable concept class with VC-dimension d. Let w1 , w2 , . . . P be a set of positive weights such that wi ≥ 0 and wi = 1. Let D be a distribution over the sample space such that there exists a lower bound g > 0 on the expected information gain of QBC when P learning with the prior ν such that ν (S) = i∈S wi , and the distribution D. Assume that the termination rule of the QBC (algorithm 5) is specified by tk = 8 ǫδ ln π 2 (k+1)2 . 3ǫδ Let the Gibbs classifier cGibbs be defined using the version space V used by QBC when terminating. g g log 1+ 16 log 16 −g g 2 2 d+1 8 Let δ > 0, let g̃ = and let , let k ≥ max , log ln 2 4 g δ g̃ δ ǫ> 2 (k+1)2 8ek ln dπ 24ek + δd 88 g̃k d+1 ln 2 2−g̃k/(d+1) 7.2. Symmetry 89 Then for any ci ∈ C, with a probability of 1 − wδi over the choice of the sample and the internal randomness of QBC; the algorithm will use at most k queries and will return a hypothesis cGibbs such that Pr cGibbs (x) 6= c (x) ≤ ǫ x Proof. From theorem 5.3 it follows that in the conditions as described above h h ii Pr EQBC,x1 ,x2 ,... Pr cGibbs (x) 6= c (x) > ǫ < δ x c∼ν (7.1) Let ci be the target concept and define pi such that pi = = i h EQBC,x1 ,x2 ,... Pr cGibbs (x) 6= ci (x) > ǫ x i h Pr Pr cGibbs (x) 6= ci (x) > ǫ QBC,x1 ,x2 ,... x i.e. pi is the probability that QBC will fail when learning the target concept ci . Using this definition in (7.1) we have that X i wi pi ≤ δ and therefore, for any i we have that pi ≤ δ/wi . The number of queries used follows employing the same argument. Theorem 7.1 shows how the Bayesian assumption can be lifted and converted into a weight or significance assigned to each concept in the class. Although we assumed that the concept class is finite, it is possible to extend this result to general classes using the same techniques as presented in [89]. 7.2 Symmetry In this section we lift the Bayesian assumption when learning linear classifiers with a uniform distribution over the sample space. This is based on the perfect symmetry in this class. Assume that QBC is learning homogeneous d dimensional linear classifiers. Each concept is represented as a unit vector w ∈ IRd and each instance is a unit vector x ∈ IRd where the classification rule is cw (x) = sign (w · x). Freund et al [48] showed that there is a uniform lower bound on the expected information gain of QBC when learning this class once there is a uniform distribution over the sample space and a uniform prior over the concept class. Using the results presented in Chapter 6 this implies fast learning rates for the QBC algorithm in this setting. Here we show that the Bayes assumption can be lifted in this case. This is due to the symmetry of this problem. 90 Chapter 7. The Bayes Model Revisited In Theorems 6.1 and 6.2 we showed that the error rate of QBC decreases exponentially fast when there is a lower bound on the expected information gain. We showed it for several variant of the QBC algorithm and several methods for evaluating success. The argument presented here applies to all these cases. Instead of repeating these theorems we will state the following theorem in general terms. Theorem 7.2 Assume that C is the class of d-dimensional homogeneous linear classifiers. Let the sample space X be the unit sphere in IRd and assume that D is the uniform distribution over X . When QBC is applied in this setting, all the results presented in theorems 6.1 and 6.2 apply for any concept in the class and not only on average (or with a probability) over the choice of the concept. Proof. Let cw and cw′ be two homogeneous linear classifiers such that w and w′ are unit vectors. Let T be the rotation transformation such that T (w) = w′ . We will use the fact that the uniform distribution over the unit sphere is rotation invariant and thus if S is a set in the unit sphere then the measure of S equals the measure of T (S). The QBC algorithm is a random algorithm. We assume that it gets 3 inputs: a sequence of unlabeled instances, an oracle that is capable of providing the labels of instances and a sequence of random bits. By providing the algorithm with a sequence of random bits as an input, we can look at the QBC algorithm as a deterministic algorithm. ∗ For a concept cw let ∆ (cw ) ⊆ X ∗ × {0, 1} be the set of inputs on which QBC fails when learning the concept cw . Note that the definition of “failure” varies, as shown in theorems 6.1 and 6.2 however the result we present here applies to all these definitions. ∗ Let T be a rotation. For {(x1 , x2 , . . .) , (r1 , r2 , . . .)} ∈ X ∗ × {0, 1} we define T ({(x1 , x2 , . . .) , (r1 , r2 , . . .)}) = {(T (x1 ) , T (x2 ) , . . .) , (r1 , r2 , . . .)} and extend this definition such that T (∆ (cw )) = {T ({(x1 , x2 , . . .) , (r1 , r2 , . . .)}) : {(x1 , x2 , . . .) , (r1 , r2 , . . .)} ∈ ∆ (cw )} The main observation is that if w′ = T (w) then ∆ (cw′ ) = T (∆ (cw )). We define the measure ∞ ∗ µ over X ∗ × {0, 1} to be the product measure of D∞ with B 21 where B (·) here stands for the Bernoulli measure. Since µ is rotation invariant, because D is rotation invariant, then µ (∆ (cw′ )) = µ (T (∆ (cw ))) = µ (∆ (cw )) this implies that the probability of failing to learn the concept cw equals the probability of failing to learn the concept cw′ . 7.3. Incorrect Priors and Distributions 91 Assume that the probability of failure of the QBC algorithm when averaging over the target concept is bounded by δ. Recall that the probability of failure is Pr cw ∼Uniform,X ∗ ∼D ∗ ,{0,1}∗ [QBC fails] = Ecw ∼Uniform Pr [QBC fails] ∗ X ∗ ∼D ∗ ,{0,1} = Ecw ∼Uniform [µ (∆ (cw ))] Z = u (w) µ (∆ (cw )) dw where u (w) is the density of the uniform distribution. Since u (w) is constant, and as we saw µ (∆ (cw )) is constant as well it follows that ∀w Pr cw ∼Uniform,X ∗ ∼D ∗ ,{0,1}∗ [QBC fails] = µ (∆ (cw )) Finally, since Pr cw ∼Uniform,X ∗ ∼D ∗ ,{0,1}∗ [QBC fails] ≤ δ we have that ∀w µ (∆ (cw )) ≤ δ 7.3 Incorrect Priors and Distributions The QBC algorithm assumes the knowledge of both a prior over the concept class and a distribution over the target concept. In this section we discuss the case in which we have an estimate of the prior and distribution, but these estimates need not be accurate. We show that if these estimates are reasonably close to the true priors, then the QBC algorithm will tolerate the incorrect priors. Thus, the exponential learning rates of QBC which were demonstrated in the fundamental theorem of QBC (Theorem 6.1) remain1 . First we define a measure of proximity between probabilities. Definition 7.1 A probability measure µ is λ far from a probability measure µ′ if for any measurable set A, λ−1 µ′ (A) ≤ µ (A) ≤ λµ′ (A) Using this definition we note that if QBC was used with the assumption that the prior over the concept class is ν which is λc far from the true prior ν ′ and the distribution over the sample space is assumed to be D which is λx far from the true distribution D′ , then the performance of QBC 1 In this section we revisit theorem 5 in [48]. 92 Chapter 7. The Bayes Model Revisited does not degrade by much. The following theorem is the equivalent of the fundamental theorem of the QBC algorithm (Theorem 6.1). It shows that even if QBC is used with incorrect priors, it still has exponential learning rates. Theorem 7.3 Let there be a prior ν over C which is λc far from the prior ν ′ over C. Let D be a distribution over the sample space X which is λx far from D′ . Let g = G (ν, D) > 0 be a lower ! b g b g log 1+ g 16 log 16 −b bg −4 −2 bound on the expected information gain of QBC. Let gb = λc λx g and g̃ = 4 and k ≥ max 8 2 d+1 2 ln , log gb2 δ g̃ δ Let δ > 0 then if the true prior and distribution are ν ′ and D′ , while QBC assumes that the prior is ν then with a probability of 1 − 2δ, QBC will use at most k queries for labels and m0 = d g̃k/(d+1) 2 e unlabeled instances when learning and will return a hypothesis with the following properties (depending on the termination rule used): 1. If QBC is used with the Bayes optimal classification rule, it will return a hypothesis (the Bayes optimal hypothesis) such that h h ii Ec∼ν Pr cBayes (x) 6= c (x) ≤ ǫ x for any 2 ǫ> 2ek −g̃k/(d+1) π 2 (k + 1) 2 ln dλ2c 6δ 2. If QBC is used with the Gibbs average termination rule, it will return an hypothesis such that for any h i EGibbs,c∼ν Pr cGibbs (x) 6= c (x) ≤ ǫ x ǫ> 4ek −g̃k/(d+1) π 2 (k + 1)2 2 ln dλ2c 6δ 3. If QBC is used with the Gibbs “typical” error termination rule, it will return a hypothesis such that Pr cGibbs (x) 6= c (x) ≤ ǫ x for any ǫ> 2 (k+1)2 + 8ek ln dπ 24ek δdλ2c g̃k d+1 ln 2 2−g̃k/(d+1) 7.3. Incorrect Priors and Distributions 93 4. If QBC is used with Bayes point machine termination rule, it will return a hypothesis such that h i Ec∼ν Pr cBPM (x) 6= c (x) ≤ ǫ x for any ǫ> 2 (e − 1) ek −g̃k/(d+1) π 2 (k + 1)2 2 ln dλ2c 6δ Proof. In lemma 7.1 we show that even though the QBC uses wrong priors, the expected informa−2 tion gain from the next query is uniformly lower bounded by λ−4 c λx g. In lemma 7.3 we show that if QBC did not query for labels for a while, then the hypothesis it will use will be a good approximation of the target concept. Using these two lemmas, and following the proof technique of the fundamental theorem of the QBC algorithm (theorem 6.1 on page 65) the proof is completed. Lemma 7.1 Let there be a prior ν over C which is λc far from the prior ν ′ over C. Let D be a distribution over the sample space X which is λx far from D′ . Let g = G (ν, D) > 0 be lower bound on the expected information gain of QBC. If the true prior and distribution are ν ′ and D′ , while −2 QBC assumes that the prior is ν then the expected information gain is at least λ−4 c λx g. Proof. First we apply to lemma 7.2 to obtain the following R ν (V + (x)) ν (V − (x)) ν (V + (x)) dD (x) ν(V ) ν(V ) H ν(V ) g = R ν(V + (x)) ν(V − (x)) ν(V ) ν(V ) dD (x) R ν ′ (V + (x)) ν ′ (V − (x)) ν (V + (x)) λ2c dD (x) H ν ′ (V ) ν ′ (V ) ν(V ) ≤ R ′ + (x)) ν ′ (V − (x)) λc−2 ν (V ν ′ (V ) ν ′ (V ) dD (x) + ′ − ′ R ν (V (x)) ν (V (x)) ν (V + (x)) dD (x) ν ′ (V ) ν ′ (V ) H ν(V ) = λ4c R ν ′ (V + (x)) ν ′ (V − (x)) ν ′ (V ) ν ′ (V ) dD (x) Since D is λx far from D′ we have that with a probability of 1 ′ λ−1 x dD (x) ≤ dD (x) ≤ λx dD (x) and thus g ≤ λ4c R ν ′ (V + (x)) ν ′ (V − (x)) ν ′ (V ) ν ′ (V ) H λx ≤ λ4c R R ν (V + (x)) ν(V ) ν ′ (V + (x)) ν ′ (V − (x)) ν ′ (V ) ν ′ (V ) ν ′ (V + (x)) ν ′ (V − (x)) ν ′ (V ) ν ′ (V ) H λ−1 x R dD (x) dD (x) ν (V + (x)) ν(V ) ν ′ (V + (x)) ν ′ (V − (x)) ′ ν ′ (V ) ν ′ (V ) dD dD′ (x) (x) 94 Chapter 7. The Bayes Model Revisited = λ4c λ2x R ν ′ (V + (x)) ν ′ (V − (x)) ν ′ (V ) ν ′ (V ) H R which completes the proof. ν (V + (x)) ν(V ) ν ′ (V + (x)) ν ′ (V − (x)) ′ ν ′ (V ) ν ′ (V ) dD dD′ (x) (x) Lemma 7.2 Let ν be λc far from γ ′ . Let x be an instance and V be a version space. Denote by V + (x) (and V − (x)) the concepts in the version space that assign x with the label +1 (or −1) respectively. Then λ−2 c ′ + ν ′ (V + (x)) ν ′ (V − (x)) ν (V + (x)) ν (V − (x)) (x)) ν ′ (V − (x)) 2 ν (V ≤ ≤ λ c ν ′ (V ) ν ′ (V ) ν (V ) ν (V ) ν ′ (V ) ν ′ (V ) ′ Proof. First note that ν (V + (x)) ≤ λc ν ′ (V + (x)) and ν (V ) ≥ λ−1 c ν (V ) and thus λc−2 ′ + ν ′ (V + (x)) ν (V + (x)) 2 ν (V (x)) ≤ ≤ λ c ν ′ (V ) ν (V ) ν ′ (V ) ′ 2 Let z ∈ [0, 1] and let z ′ be such that λ−2 c ≤ z /z ≤ λc . It is easy to verify that λc−2 z ′ (1 − z ′ ) ≤ z (1 − z) ≤ λ2c z ′ (1 − z ′ ) By setting z = ν (V + (x)) ν(V ) and z ′ = ν ′ (V + (x)) ν ′ (V ) we complete the proof. Lemma 7.3 Let there be a prior ν over C which is λc far from the prior ν ′ over C. Let D be a distribution over the sample space X which is λx far from D′ . Assume the true prior and distribution are ν ′ and D′ , while QBC assumes that the prior is ν and the distribution is D then 1. Assume that QBC is used with tk = 2 ǫ ln π 2 (k+1)2 6δ instead of the value defined in algorithm 5. Let the Bayes classifier cBayes be defined using the version space V used by QBC when terminating and the prior ν. Then with a probability of 1 − δ over the sample and the internal randomness of QBC, h h ii Ec∼ν ′ |V Pr cBayes (x) 6= c (x) ≤ λ2c ǫ x 2. Assume that QBC is used with tk = 4 ǫ ln π 2 (k+1)2 6δ instead of the value defined in algorithm 5. Let the Gibbs classifier cGibbs be defined using the version space V used by QBC when terminating and the prior ν. Then with a probability of 1 − δ over the sample and the internal randomness of QBC, h i EGibbs,c∼ν Pr cGibbs (x) 6= c (x) ≤ λ2c ǫ x 7.3. Incorrect Priors and Distributions 3. Assume that QBC is used with tk = 8 ǫδ 95 ln π 2 (k+1)2 3ǫδ instead of the value defined in algorithm 5. Let the Gibbs classifier cGibbs be defined using the version space V used by QBC when terminating and the prior ν. Then with a probability of 1 − δ over the choice of the sample, the target hypothesis and the internal randomness of QBC, Pr cGibbs (x) 6= c (x) ≤ γ 2 ǫ x 4. Assume that QBC is used with tk = 2(e−1) ǫ ln π 2 (k+1)2 6δ instead of the value defined in algorithm 5. Let the concept class be the class of linear classifiers and let the prior ν be log-concave. Let the Bayes Point Machine classifier cBPM be defined using the version space V used by QBC when terminating. Then with a probability of 1 − δ over the sample and the internal randomness of QBC, h i Ec∼ν Pr cBPM (x) 6= c (x) ≤ λ2c ǫ x Proof. 1. Assume that QBC made k queries for labels to generate the version space V . Assume that QBC did not query for any additional label for tk consecutive instances after making the k’th query. Let cBayes be the Bayes classifier, then    +1 if Prc∼ν|V [c (x) = +1] ≥ 1/2 cBayes (x) =   −1 if Prc∼ν|V [c (x) = −1] > 1/2 Arrange x and c such that cBayes (x) 6= c (x). From the definition of the Bayes classifier it follows that if we pick a random hypothesis c′ from the distribution ν|V then with a probabilh i ity of at least 1/2 we will have c′ (x) 6= c (x). Therefore, if we denote by cBayes (x) 6= c (x) the indicating function then Ec′ ∼ν|V [c′ (x) 6= c (x)] ≥ i 1h cBayes (x) 6= c (x) 2 for any c and x. h i Assume that Ec∼ν ′ |V,x cBayes (x) 6= c (x) > ǫλ2c . Thus, ǫλ2c < = = h i Ec∼ν ′ |V,x cBayes (x) 6= c (x) h i Ex Pr′ cBayes (x) 6= c (x) c∼ν |V h i  Prc∼ν ′ |V cBayes (x) 6= c (x) ∩ (c ∈ V )  Ex  Prc∼ν ′ |V [c ∈ V ] 96 Chapter 7. The Bayes Model Revisited i cBayes (x) 6= c (x) ∩ (c ∈ V )  Ex λ2c Prc∼ν|V [c ∈ V ] h i λ2c Ec∼ν|V,x cBayes (x) 6= c (x)  ≤ = Prc∼ν|V h and therefore Ec,c′ ∼ν|V,x [c′ (x) 6= c (x)] > ǫ 2 this means that the probability that QBC will not query for the label of the next instance h i is at most 1 − 2ǫ . Hence, if Ec∼ν ′ |V,x cBayes (x) 6= c (x) > ǫλ2c the probability that QBC will not query for a label for the next tk consecutive instance is at most by choosing tk = 2 ǫ ln π 2 (k+1)2 6δ 1− ǫ ǫ tk ≤ e− 2 tk 2 we get that the probability that QBC will not query for tk consecutive labels when the Bayes classifier is not “good enough” is 6δ . π 2 (k+1)2 By summing over k the proof is completed. 2. The proof for the Gibbs classifier follows the same pattern as theorem 5.2. From item 1 in lemma 7.3 we have that using the choice of tk that h h ii Ec∼ν Pr cBayes (x) 6= c (x) ≤ λ2c ǫ/2 x Since Haussler, Kearns and Schapire [51] proved that the average error of the Gibbs classifier is at most twice as large as the error of the Bayes classifier, the statement of the theorem follows. 3. This follows immediately from the previous item and the Markov inequality. From the choice of tk we have that with a probability of 1 − δ/2 h i EGibbs,c∼ν Pr cGibbs (x) 6= c (x) ≤ λ2c ǫδ/2 x (7.2) Therefore, from the Markov inequality, if (7.2) holds, we have with a probability of 1 − δ/2 that Pr cGibbs (x) 6= c (x) ≤ λ2c ǫ x 4. The proof follows immediately from item 1 in lemma 7.3 and theorem 7.1 on page 88. 7.4. Summary 7.4 97 Summary In this chapter we revisited the Bayesian assumption underlying the QBC algorithm and its analysis. We showed that this assumption is not as strong as it might appear at first glance. In many cases it may be lifted, as we showed in section 7.1 and section 7.2 or weakened, as we showed in section 7.3. We conclude that the knowledge of the prior from which the target concept was chosen, or even the existence of such a prior, is not critical for the QBC to exhibit fast learning rates. Chapter 8 Noise Tolerance In the discussion of the Query By Committee algorithm so far, we have made several assumptions. We studied the Bayesian assumption in Chapter 7. In this chapter we revise yet another assumption we made; namely that we learn in a noise free environment. We assumed that there is a target concept which is a deterministic function of the inputs; i.e. we assumed that there exists a concept c such that for any given input x, the concept c assigns the “true” label c (x) in a deterministic way. However, this assumption is doubtful for various reasons. First, many concepts we may wish to learn are non-deterministic by nature. Moreover, noise that can be caused by human errors or communication problems might corrupt the labels we see (see the more extensive discussion about noise on Chapter 3). Version-space based algorithms such as QBC are sensitive to noise. A single misclassified instance will cause the target concept to be eliminated from the version-space and thus lead to poor results. Therefore, if QBC is ever to be used on real data, it must be made less sensitive to such effects. In this dissertation we provide two methods for coping with this problem. In the current chapter we present a “soft” version of the QBC algorithm and analyze it. We show that √ k) when certain conditions apply, the error of the hypothesis return decreases as e−O( where k is the number of queries for label made. A more practical approach is presented in Chapter 10 where kernels are used to overcome noise as well as some other practical problems. The advantage of the method we present here is its theoretical soundness. However it does not (yet) have any practical implementation. In section 8.1 we introduce the “soft” version of the QBC algorithm. In section 8.2 we revise the notation of Information Gain to suit the new setting. In section 8.3 we use the newly proposed 98 8.1. “Soft” QBC 99 way of measuring information to analyze the “soft” QBC. We wrap-up in section 8.4. Note that Sollitch and Saad [119] conducted a preliminary study of the impact of noise on active learning, although their work mainly focused on the behavior of the algorithm when the sample size grows to infinity and less on the practical scenario. The work presented in this chapter is based on collaboration with Scott Axelrod, Shai Fine, Shahar Mendelson and Naftali Tishby. 8.1 “Soft” QBC We begin our discussion by defining the model in which we are working. Noise and uncertainty can come in different forms. The first model we consider is when the target concept is deterministic. The noise in this case corrupts the communication channel between the learner and the oracle that answers the learner’s queries. In this case, the source of noise is external. The second case we consider is when the target concept is non-deterministic in itself. In this case the noise is internal. 8.1.1 The Case of Learning with Noise In many cases, the concepts to be learned are deterministic, but noise corrupts our observations. Noise can differ in nature; it can be random classification noise, where the noise is equal over all the sample space and independent of the target concept. In other cases the noise may tend to have greater impact near the decision boundary (see [39] for a comprehensive survey about learning in the presence of noise). We use the following notation: Let the set of labels Y be finite and let W be a parameterization of the concept class. For each w ∈ W and x ∈ X a distribution p (y|w, x) is defined where the underlying concept to be learned cw is such that cw (x) = arg max p (y|w, x) y∈Y Therefore, given any ǫ > 0, the objective of our learning process is to find some w ∈ W such that Pr [cw (x) 6= c∗ (x)] < ǫ x∼D where c∗ is the target concept. 100 8.1.2 Chapter 8. Noise Tolerance The case of stochastic concepts A scenario we shall not pursue any further is when the concepts we are trying to learn are stochastic. Thus, there is no perfect mapping between instances and labels. In this case one might think of various criteria for generalization. Some of the possibilities are to minimize the loss with respect to different Lp norms or to apply the Kullback Leibler divergence. For the sake of simplicity we will not present all the possibilities in this direction. However, we note that the algorithm we present below can be easily adjusted to include various such criteria, and the proofs follow the same path as those presented here. 8.1.3 A variant of the QBC algorithm Here, we focus on the noise model as described in 8.1.1, i.e. the noise is external to the system and corrupts the communication channel between the teacher and the learner. Let ǫ > 0 and δ > 0 be the accuracy and confidence parameters specified by the user. The version of the QBC algorithm which is capable of managing noise is presented in Algorithm 6. We will present two facts about this “soft” version of the QBC algorithm (for brevity, henceforth the abbreviation SQBC). First, we show that the hypothesis returned by the algorithm is indeed a good approximation of the target concept. Second, we show that if SQBC is allowed to issue k √ queries for labels then the generalization error of the hypothesis it returns is e−O( k) . Recall that √ a passive learner using k queries will have a generalization error of O 1/ k in the same setting (see e.g. [6] Theorem 5.2). Theorem 8.1 Let ǫ > 0 and δ > 0. Assume that cw∗ is the target concept and that cw is the concept the SQBC returned. Then the probability that cw is not a good approximation for cw∗ , i.e., that Pr x∼D arg max p (y|w, x) 6= arg max p (y|w∗ , x) > ǫ y∈Y y∈Y is less than δ, when the probability is with respect to the internal randomness of SQBC, the random sample used for learning, the random labels and the random target concept (Bayesian assumption). The proof is similar to the ones presented in Chapter 5 where we studied the QBC algorithm. Proof. Define the set of “bad” pairs of parameters W = (w1 , w2 ) s.t. Pr arg max p (y|w1 , x) 6= arg max p (y|w2 , x) > ǫ x∼D y∈Y y∈Y 8.1. “Soft” QBC 101 Algorithm 6 “Soft” Query By Committee (SQBC) Inputs: • Required accuracy ǫ. • Required confidence 1 − δ. • A prior ν over the parameter class W . Output: • A hypothesis cw . Algorithm: 1. Let ν1 = ν. 2. Let k ← 0. 3. Let l ← 0. 4. For t = 1, . . . (a) Receive an unlabeled instance xt . (b) Let l ← l + 1. (c) Select w1 ∼ νt and w2 ∼ νt . (d) If arg maxy∈Y p (y|w1 , x) 6= arg maxy∈Y p (y|w2 , x) then i. ii. iii. iv. Query for the label y of x. Let k ← k + 1. Let l ← 0. Let νt+1 be the posterior over W given all the labels seen so far; i.e. for U ⊆ W using Bayes rule we have νt+1 (U ) = ν (U ) Pr [y1 , . . . , yt+1 are the observed labels given that w∗ ∈ U ] Pr [y1 , . . . , yt+1 are the observed labels] (e) else i. Let νt+1 ← νt . (f) If l ≥ tk where tk = i. Select w ∼ νt+1 ii. Return cw . 2 ǫδ log 2k(k+1) then δ 102 Chapter 8. Noise Tolerance The algorithm fails if the target concept cw∗ and the hypothesis cw returned by SQBC are such that (w∗ , w) form a “bad” pair. Recall that w∗ is randomly picked from the prior ν. We allow the teacher (“the adversary”) extra power and allow it to choose the target concept only at the end of the learning process with the only restriction being that the concept is chosen using the posterior over the labels it presented while the QBC was learning. Hence, we may assume that the selection of w∗ was made using the posterior defined by the given labels. Therefore, both the algorithm and the teacher use the same probability measure which is the posterior to select w and w∗ respectively. There are two possible sources for failure in this case. First, SQBC may terminate when W is too big in a probabilistic sense. The second case of failure is when W is small but nevertheless, the target concept and the hypothesis returned by SQBC form a “bad” pair. We show that the probability of any of these cases is less than δ/2. Let νt be the posterior. If νt2 (W ) ≤ δ/2 then we are done, since the probability that w and w∗ form a bad pair is bounded by δ/2. On the other hand, assuming that νt2 (W ) > δ/2, then we argue that the probability of observing a long sequence of instances for which SQBC will not issue a query for label is small. Under this assumption, the probability of selecting a triplet x, w1 , w2 ∼ D × νt × νt such that arg maxy∈Y p (y|w1 , x) 6= arg maxy∈Y p (y|w2 , x) is greater than ǫδ/2. Hence, the probability of tk consecutive instances without a query after seeing k labels is bounded by t ǫδ k 1− ≤ e−ǫδtk /2 , 2 and by setting tk = 2 ǫδ log 2k(k+1) it follows that the probability that the QBC does not query for δ tk consecutive instances is less than δ/2k(k + 1). Summing over the possible values of k we get that the probability of failure is bounded by δ/2. The theorem above shows that the SQBC algorithm is sound. We now explore the number of queries used in the learning processes. We use the same technique as Freund et. al. [48] and analyze the information gained by the algorithm during the process. However, before we do this we need to adapt the notion of information gain to the new setting we are dealing with. In the next section we introduce the refined information gain and study some of its properties. Later in this chapter we use these tools to study the sample complexity of SQBC. 8.2. Information Gain Revisited 8.2 103 Information Gain Revisited A fundamental problem in learning theory is bounding the information gained by an example about the unknown target concept. This problem is most critical in the context of active learning, when the learner has to select the most informative examples to be labeled in order to minimize the number of labels required. The Mutual Information allows one to measure the average knowledge one gains about another random variable, B, by knowing the value of one random variable A. However, in concrete learning cases one is interested in a more precise measure; namely, how much does a specific value a tells us about B. Here we present an information measure which quantifies the amount of information an observation a of the random variable A gives about the state of the random variable B. We show that with high probability this specific mutual information is bounded by the logarithm of the covering number of B (see definitions 8.1 and 8.2), and establish a version of the Information Processing Inequality suitable for this quantity. Later we will use the information a label contains about the target concept to measure the information gain by the SQBC algorithm. The mutual information measures the amount of information one random variable contains about another random variable [32]. If a random variable A takes values in the set A and the random variable B takes values in the set B, the mutual information is defined by the following formula: I (A; B) = Z p (a, b) log A×B p (a, b) d (a × b) p (a) p (b) and it can be rewritten as1 I (A; B) = Z A p (a) Z p (b|a) db da p (b|a) log p (b) B (8.1) A reasonable question is what a specific observation a ∈ A can tell us about the other variable B. Let’s consider for example that one is interested in knowing whether it rained over night. The observation one gets can be the moisture on the ground in the morning. If the ground is dry then we can be pretty sure that it wasn’t raining. If, however, the ground is wet, then it might have rained, but it is also possible that the sprinklers were working during the night and caused the ground to be wet. Clearly, different observations, or values of the same variable, can provide different amounts of information. 1 We assume that p(a|b) belongs to an appropriate L1 space. 104 Chapter 8. Noise Tolerance There exists a natural definition for the specific information value we are after. Indeed, by looking at (8.1) we come up with the definition: I (a; B) = Z p (b|a) log B p (b|a) db p (b) (8.2) This should be read as the information that the observation a ∈ A gives about the random variable B. This quantity has some nice properties: 1. It is non-negative, since from (8.2) one can see that I (a; B) is a Kullback Leibler divergence [70] between two distributions. 2. I (a; B) is a measurable function due to Fubini’s theorem. 3. The expected value of the information from an observation is the mutual information, i.e., EA [I (a; B)] = I (A; B). Before proceeding we need to define some notations. We begin by defining a distance measure between two instances of a random variable B. Definition 8.1 The distance2 between two instances b1 and b2 of a random variable B with respect to the random variables A1 , . . . , Am over A1 , . . . , Am is ρm (b1 , b2 ) = max sup |p (ai |b1 ) − p (ai |a2 )| 1≤i≤m ai ∈Ai Given a distance measure, one can define the covering number which counts the number of balls of radius ǫ needed to cover the space: Definition 8.2 If B is a random variable over B and ρ is a (pseudo)-metric on B, then for any ǫ > 0 the ǫ-covering number is the smallest number of balls of radius ǫ (with respect to the distance measure ρ) needed to cover B. We denote this value by N (B, ǫ, ρ). Note that in the deterministic case, when p (ai |b) is either zero or one, this definition takes a simple form: ρ (b1 , b2 ) is zero if the two states assign the same values to the observations and it is 1 otherwise. Here, for every radius ǫ < 1, the ǫ covering numbers are simply the number of equivalence classes. Hence, if the observations are labels assigned to different sample points and if d B has a VC-dimension d, then by Sauer’s Lemma it follows that N (B, ǫ, ρm ) ≤ em . d 2 Actually ρ m is a semi-distance since it is possible that b1 6= b2 while ρm (b1 , b2 ) = 0. This has no significance throughout the paper. 8.2. Information Gain Revisited 8.2.1 105 Observations of the State of a Random Variable Let us assume that we are interested in the random variable B which takes values b ∈ B. We have some observations of the random variables Ai . Each random variable Ai receives values ai ∈ Ai . Q We assume that the Ai ’s are mutually independent given B, i.e. the p (a1 , . . . , am |b) = p (ai |b). This is often the case in learning from examples. To see this, let W be a parameterization of a concept class. Let x1 , . . . , xm be a fixed set of instances, then for any w ∈ W, we have that p (y1 , . . . , ym |w, x1 , . . . , xm ) = Y p (yi |w, x1 , . . . , xm ) = Y p (yi |w, xi ) We are interested in measuring the contribution of the labels y1 , . . . , ym to our knowledge about the random variable W . Haussler and Opper [52] have studied this question and presented the relationship between the information and metric entropy. However, they studied the average case; i.e. “what is the amount of information regarding the state of the world that a general set of observations captures?”. The question we are interested in is “how much information does a specific set of observations capture on the state of a random variable?”. Another difference between Haussler and Opper results and the result presented here is the distance measure used. Haussler and Opper used the Hellinger distance measure whereas we use an infinity norm. This allows us to use the results of Alon et. al. [2] which bound the metric entropy with respect to this norm using the Pollard dimension and the Fat-Shattering dimension of the space. The first result we present shows that the information from a set of observations is essentially bounded in the sense that with high probability it is bounded by the covering number. Theorem 8.2 Let m > 2 and let A1 , . . . , Am be a set of observed random variables. Let B be a random variable. Assume that there is some γ > 0 such that for any ai ∈ Ai and any b ∈ B, p (ai |b) ≥ γ. Denote by a(m) = (a1 , . . . , am ) then 2γ 1 (m) Pr I a ; B ≤ log N B, 2 , ρm + 2 + log ≥1−δ m δ a(m) ∼A(m) where ρm is as defined in Definition 8.1. Note that in the deterministic case, the assumption that there is a positive lower bound on p (ai |θ) is not necessary. In fact, if B has VC-dimension d then with a probability of at least 1 − δ, 1 I a(m) ; B ≤ d log em d + 2 + log δ which is similar to the bound presented in [48, Lemma 3]. Proof. of Theorem 8.2 106 Recall that Chapter 8. Noise Tolerance Z p a(m) |b (m) (m) db, log I a ; B = p b|a p a(m) hence, by Jensen’s inequality (or annealed approximation ) Z (m) |b (m) (m) p a db. I a ; B ≤ log p b|a p a(m) (8.3) Taking the expected value of the integral in (8.3) with respect to the observations and applying Fubini’s Theorem, it follows that "Z " # # (m) |b p a(m) |b (m) p a db = Eb∼B Ea(m) ∼A(m) |b Ea(m) ∼A(m) p b|a (8.4) p a(m) p a(m) T Let B1 , . . . , Br be a disjoint cover of B (i.e., B = ∪Bi and if i 6= j then Bi Bj = ∅), such that each Bi has diameter smaller than 2γ m2 with respect to the metric ρm . Thus, b, b′ ∈ Bi =⇒ ∀j, aj ∈ Aj |p (aj |b) − p (aj |b′ )| ≤ 2γ m2 Using this definition we rewrite the expected value in (8.4) as " " # # Z r X p a(m) |b p a(m) |b = db EB EA(m) |b p (b|Bi ) EA(m) |b P (B i ) (m) p a(m) p a B i i=1 (8.5) (8.6) We shall bound the integral on Bi for each 1 ≤ i ≤ r separately. Let i be such that P (Bi ) > 0. Note that for each b, b′ ∈ Bi we have that = p a(m) |b′ ≥ Y p (ai |b′ ) Y 2γ p (ai |b) − 2 , m thus, p a(m) = ≥ ≥ = Z Z p (b′ ) p a(m) |b′ db′ p (b′ ) p a(m) |b′ db′ B Z i Y 2γ p (ai |b) − 2 db′ p (b′ ) m Bi Y 2γ p (ai |b) − 2 . P (Bi ) m Q Since p a(m) |b = p (ai |b) it follows that p a(m) |b 1 Y p (ai |b) ≤ 2γ . (m) P (Bi ) p a p (ai |b) − m 2 (8.7) Recall that p (ai |b) ≥ γ, hence γ 1 p (ai |b) 2γ ≤ 2γ = 1 − m22 p (ai |b) − m2 γ − m2 (8.8) 8.2. Information Gain Revisited 107 Clearly, for m > 2, 2 e− m ≤ 1 − 2 2 2 + 2 ≤ 1 − 2. m m m (8.9) Hence, 2 1 p (ai |b) 1 m 2γ ≤ 2 ≤ −2 =e , m 1 − p (ai |b) − m e 2 2 m and using (8.7) p a(m) |b e2 ≤ . P (Bi ) p a(m) Therefore, EB EA(m) |b " # p a(m) |b p a(m) ≤ X P (Bi ) i : P (Bi )>0 e2 P (Bi ) ≤ re2 (8.10) (8.11) Now, recall the definition of r in (8.5) and conclude that EA(m) EB|a(m) " # p a(m) |b p a(m) " # p a(m) |b = EB EA(m) |b p a(m) 2γ ≤ N B, 2 , ρ e2 m (8.12) By Markov’s inequality, p(a(m) |b) 2γ 1 PA(m) EB|a(m) ≥ N B, 2 , ρm e2 ≤ δ, δ m p(a(m) ) thus, by (8.3) 2γ 1 PA(m) I a(m) ; B ≥ log N B, 2 , ρm + 2 + log ≤ δ, m δ as claimed. An immediate consequence of the proof of theorem 8.2 is a bound on the mutual information as presented in the following corollary: Corollary 8.1 Assume that the conditions of Theorem 8.2 hold. Then I (A1 , . . . , Am ; B) ≤ log N 2γ B, 2 , ρm + 2. m 108 8.2.2 Chapter 8. Noise Tolerance Information Processing Inequality A fundamental property of mutual information is the Information Processing Inequality. The information processing inequality asserts that when data are processed, the mutual information can only decrease. More formally, for any function g the following holds I (A; B) ≥ I (g (A) ; B) (8.13) As a corollary, if A1 , . . . , Am , B are random variables then for any J ⊆ [1, m] I (A1 , . . . , Am ; B) ≥ I (AJ ; B) where AJ = {Aj }j∈J . Nevertheless, as we move to the setting of information from observations, the situation is more complex. A subset of the observation could contain more information on the target variable than all the observations. However it is possible to prove a slightly weaker version of the information processing inequality. Theorem 8.3 Information Processing Inequality Let m > 2 and put A1 , . . . , Am to be a set of observed random variables which are mutually independent given the random variable B. Assume further that each Ai can take only a finite set of values, and that there is some γ > 0 such that for any ai ∈ Ai and any b ∈ B, p (ai |b) ≥ γ. Then, for any τ Pr a(m) ∼A(m) [∃J s.t. I (aJ ; B) ≥ τ + 1] ≤ h i 1 I a(m) ; B ≥ τ m log (m) (m) γ a ∼A Pr where aJ = {aj }j∈J . Theorem 8.3 shows that in a sense, the information processing inequality is valid for the setting described here with high probability. In the proof of this theorem we make a specific use of the fact that γ > 0. However, in the deterministic case this assumption is superfluous since the information is monotonic, thus ∀J ⊆ {1, . . . , m} I (a1 , . . . , am ; B) ≥ I (aJ ; B) Before we prove this theorem we derive an immediate corollary Corollary 8.2 In the setting of theorem 8.3, Let δ > 0 then # " m log γ1 2γ <δ Pr ∃J s.t. I (aJ ; B) ≥ N B, 2 , ρm + 3 + log m δ a(m) ∼A(m) 8.2. Information Gain Revisited 109 Corollary 8.2 follows from Theorem 8.2 and Theorem 8.3 by choosing τ =N m log γ1 2γ B, 2 , ρm + 2 + log m δ We now turn to prove Theorem 8.3. Proof. of Theorem 8.3. Assume there exists J ⊆ {1, . . . , m} such that I (aJ ; Θ) > τ + 1 (8.14) and let J = {1, . . . , m} \ J. We will examine all the possible values of aJ . Note that by Fubini’s Theorem EAJ |aJ At the same time, " " ## h i (m) p a |b I a(m) ; B = EAJ |aJ EB|a(m) log p a(m) " " ## p a(m) |b = EB|aJ EAJ |b log p a(m) log p a(m) |b p a(m) hence = log p (aJ |b) p aJ |b p (aJ ) p aJ p aJ |b p (aJ |b) = log + log p (aJ ) p aJ h i EAJ |aj I a(m) ; B = I (aJ ; B) + I AJ ; B|aJ (8.15) ≥ I (aJ ; B) Note that the second term in (8.15) is a mutual information and thus non-negative. Define QaJ = Pr AJ |aJ h i I a(m) ; B ≥ τ For every a(m) we have that I a(m) ; B ≤ m log γ1 since Ai is finite and hence p a(m) |b ≤ 1 and on the other hand p a(m) ≥ γ m . Therefore, from (8.15) it follows that I (aJ |B) h i ≤ EAJ |aJ I a(m) ; B 1 ≤ τ + m log QaJ γ 110 Chapter 8. Noise Tolerance 1 Thus, if I (aJ ; B) ≥ τ + 1 then QaJ ≥ m log 1 . Therefore, γ h i h i Pr I a(m) ; B ≥ τ ≥ Pr I a(m) ; B ≥ τ and ∃J I (aJ ; B) ≥ τ + 1 A(m) A(m) h i = Pr I a(m) ; B ≥ τ | ∃J s.t. I (aJ ; B) ≥ τ + 1 × A(m) Pr [∃J s.t. I (aJ ; B) ≥ τ + 1] A(m) ≥ 1 Pr [∃J s.t. I (aJ ; B) ≥ τ + 1] m log γ1 A(m) Thus we obtain h i 1 Pr [∃J s.t. I (aJ ; B) ≥ τ + 1] ≤ Pr I a(m) ; B ≥ τ m log γ A(m) A(m) 8.3 SQBC Sample Complexity In order to analyze the SQBC algorithm we are about to use the information of observation as a replacement for the information gain used in the analysis of QBC. Note that this is a natural extension as though the concepts were deterministic, i.e. no noise in the system, in which case the information gain is equivalent to the information of observation. Let x̄ = {x1 , x2 , . . .} be a sequence of instances. For the sake of our discussion we will assume that this sequence is fixed. The label yi of the instance xi tells us something about the target concept c. Using the terminology of the previous section, yi is an observation of the state of the target concept which is the random variable3 C. We apply the same technique as in Chapter 6. We will show that with high probability, for any subset J ⊆ {1, . . . , m}, the information from {yj }j∈J is not too high. We will argue that when certain conditions apply, SQBC queries for labels with high information content and thus it will not issue too many queries. This will lead to large gaps between consecutive queries for labels which will lead SQBC to terminate successfully as proved in Theorem 8.1 on page 100. In the following we rework the definition of information gain and its derivatives. Definition 8.3 For a sequence of instances x̄ = {x1 , x2 , . . .} ∈ X ∞ , the Information Gain from a set of labels yJ = {yj }j∈J (where yj is the label of the instance xj ) is I (yj ; C | xJ ). The Expected Information Gain from the next query for a label is Ej ∗ ,yj∗ ,{xj }j>max J I yJ∪{j ∗ } ; C xJ∪{j ∗ } − I (yJ ; C | xJ ) where the expectation is taken with respect to the sequence of instances {xj }j>max J , the choice of the next query point (i.e. the choice of j ∗ ) and the label yj ∗ of xj ∗ . 3 There is a slight abuse of notation here, since C is the concept class and not a random variable. 8.3. SQBC Sample Complexity 111 Unlike the deterministic case, the information gain is not guaranteed to be non-negative. This, and other properties of the noisy setting make the analysis more involved than the deterministic noise free case. The main result we would have liked to establish is presented in Theorem 8.4. However, we encountered a technical difficulty in the course of the proof, when trying to show that the information gain is, with high probability, linear in the number of queries. A close analysis of the proof of the analogous result in [48] reveals a similar gap which was overlooked by the authors. Though it is possible to close the gap in the noise-free case (as we did in chapter 6) we are still in the process of adjusting the proof to our setup. Hence, the proof of Theorem 8.4 is presented under the assumption that conjecture 8.1 holds. Conjecture 8.1 Assume there exist a lower bound g > 0 on the expected information gain from the query the SQBC algorithm makes at any step. Then, for any δ > 0 there exist constants Kδ and g̃ > 0 such that if k > Kδ and if J is the set of size k of indexes of the queries that the SQBC algorithm made, then Pr x̄,ȳ,SQBC [I (yJ ; C | xJ ) < kg̃] ≤ δ Conjecture 8.1 is the equivalent of lemma 6.2 on page 68. The next theorem is the main result in this section. It proves that when certain conditions √ k) apply, if SQBC is allowed to issue k queries for label, then it will reach an accuracy of e−O( . Theorem 8.4 Assume that Conjecture 8.1 holds. Let W be a set of parameters of a concept class, such that for w ∈ W the probability of observing the label y for the instance x when the target concept is parameterized by w is p (y | w, x ). Assume that there exists γ > 0 such that p (y | w, x ) ≥ γ for all y, w, x. Assume that {p (y | w, x ) | w ∈ W } has a Pollard dimension d. Let ν be a prior over W and let D be a distribution over X . Assume that there is g > 0 such that for ∗ any finite sample S ∈ (X × Y) the expected information gain of SQBC from the next query given the sample S, is lower bounded by g. Let δ > 0 and let k ≥ Kδ (Kδ and g̃ are as defined in Conjecture 8.1). Then with a probability of 1 − 3δ, SQBC will issue at most k queries for labels and use m0 = γdδ (kg̃)/(36/|Y|d) e e log γ1 112 Chapter 8. Noise Tolerance unlabeled instances when learning and will return a hypothesis with Pr SQBC,w∗ ,x1 ,x2 ,... Pr arg max p y|wSQBC , x 6= arg max p (y|w∗ , x) > ǫ < δ y x for any ǫ> y 2ke log γ1 γδ 2 d log 2k (k + 1) −√(kg̃)/(18|Y|) e δ In the statement of theorem 8.4 we used the Pollard dimension. Here is a definition of the Pollard dimension (see e.g. [2]). Definition 8.4 Let F be a set of functions from some space Z to IR. F has a Pollard-dimension d if the class C = {sign (f ) : f ∈ F } has a VC-dimension d. An alternative definition for the Pollard dimension is to say that if F has a Pollard-dimension d, if d is maximal such that there exist z1 , . . . , zd ∈ Z such that for any y1 , . . . , yd ∈ {±1} there exists f ∈ F with yi f (zi ) > 0 for all i. Alon et al. [2] showed that if F has a Pollard-dimension d then N (F , ǫ, ρm ) ≤ 2 4m ǫ2 d log(2em/(dǫ)) where N (·, ·, ·) is the covering number and ρm is the l∞ distance measure when F is restricted to m points. Note that it is possible to use the Fat-Shattering-Dimension of Alon et al. [2] here, instead of the Pollard-dimension to obtained slightly better bounds. For the sake of clarity we avoid using the Fat-Shattering-Dimension here. Proof. of Theorem 8.4 Assume that SQBC made k queries. Given that Conjecture 8.1 holds, then with a probability of 1 − δ, the information SQBC gained is at least kg̃. Let J be the indexes of the queries QBC made. From the information processing inequality, Theorem 8.3 and Corollary 8.2, we know that with a probability of 1 − δ m log γ1 2γ I (yJ ; W | xJ ) ≤ log N W, 2 , ρm + 3 + log m δ Alon et al. [2] proved that log N 3 em 2γ W, 2 , ρm ≤ |Y| d log2 + log 2 m γd and therefore |Y| d log 2 em3 γd + log 2 + 3 + log m log γ1 δ ≥ kg̃ 8.4. Summary 113 Applying a coarse upper bound on the left hand side of the above inequality we have ! em log γ1 2 ≥ kg̃ 18 |Y| d log γdδ and therefore, with a probability of 1 − 2δ, if SQBC made k queries for labels, then m≥ If m k γδd √(kg̃)/(18|Y|) e e log γ1 > tk is as defined in the SQBC algorithm then the SQBC algorithm will bail out and as we proved in theorem 8.1 on page 100, when this happens, the returned hypothesis is a good approximation of the target concept with high probability. Therefore it suffices to require that 2 2k (k + 1) γδd √(kg̃)/(18|Y|) m > tk = log e ≥ k ǫδ δ ke log γ1 which holds whenever ǫ> 8.4 2ke log γ1 γδ 2 d log 2k (k + 1) −√(kg̃)/(18|Y|) e δ Summary The main question we have attempted to address in this chapter is whether active learning in general and QBC in particular can be applied in the presence of noise and uncertainty. Although the discussion presented here is incomplete, there is a reason to believe that active learning can be applied in the realistic setting where noise and uncertainty exist. Nevertheless, we would like to mention several key issues that are lacking in the discussion in this chapter. First, we were not able to prove Conjecture 8.1 on page 111. We believe that this conjecture, with some minor amendments, is true. However, at this point we were not able to prove it. Second, we do not show here any concept class which has a lower bound on the expected information gain as required in Theorem 8.4. Finally, we do not have any practical implementation of the SQBC algorithm. Nevertheless, in Chapter 10 we present an alternative method of overcoming noise using kernels. Kernels provide a practical method of applying QBC for real world applications. However, the theoretical justification of this method is weaker. The revised concept of information gain and information from observations presented in section 8.2 are of interest in themselves. Measuring the information of an observation on a target random variable can play an important role in diverse applications. Chapter 9 Efficient Implementation Using Random Walks The Query By Committee algorithm (Algorithm 5 on page 55) is a very simple and straightforward algorithm. Whenever a new instance is presented it draws two random hypotheses from the version space. If these two hypotheses predict different labels for the instance, then the algorithm queries for the true label. However, this description belies the difficulty of implementing this algorithm since drawing random hypotheses from the version space is indeed a non-trivial task. In this chapter we show how QBC can be implemented in polynomial time when learning linear classifiers. The main ingredient in our implementation is a reduction of the problem of sampling the version space to the problem of sampling convex bodies. We show that the sophisticated techniques developed for sampling from convex bodies provide a solution to the missing components in the QBC algorithm. The work presented in this chapter is based on a collaboration with Shai Fine and Eli Shamir. 9.1 Linear Classifiers The question we address in this chapter is “how can the QBC algorithm be used to learn linear classifiers?”. We assume that the concept class we are interested in is the class of homogeneous linear classifiers. The sample space is X = IRd and the concept class is C = cw : w ∈ IRd such that cw (x) = sign (w · x). The class of linear classifiers is frequently used in modern machine learning. This class is very powerful once the inputs are mapped from the input space to some 114 9.2. Sampling from Convex Bodies 115 feature space using a non-linear map. In some cases, the inputs are mapped to an infinite dimension Hilbert space, without affecting the computational complexity of learning in this class (see more about this in Chapter 10). An important property of homogeneous linear classifiers is that they are scale free in the sense that if w ∈ IRd and λ > 0 then cw is equivalent to cλw . This is due to the fact that cw (x) = sign (w · x) = sign (λw · x) = cλw (x) Therefore, we may assume that the concept class C is defined solely on the unit ball, i.e. C = cw : w ∈ IRd and kwk ≤ 1 . A key observation is that when learning homogeneous linear classifiers, the Version Space is a bounded convex body at all stages as the following lemma shows. m Lemma 9.1 Let C be the class of homogeneous linear classifiers. Let S = {xi , yi }i=1 a finite sample (possibly empty). Then the version space induced by S is a bounded convex body. Proof. Recall that the class of homogeneous linear classifiers is defined as C = cw : w ∈ IRd and kwk ≤ 1 . Therefore, C is isomorphic to the unit ball and thus bounded and convex. The concept cw is in the version space if ∀i yi (w · xi ) ≥ 0 and thus the version space is the intersection of the unit ball with m linear constraints. Since all these constraints are convex, then the version space is convex. Furthermore, since the version space is a subset of the unit ball, it is bounded. Therefore, the problem of random sampling the version space is reduced to the problem of random sampling from convex bodies. In the following section we discuss methods of solving the later problem. 9.2 Sampling from Convex Bodies The problem of sampling from convex bodies has been studied for the last two decades in the field of computational geometry. Given a convex body K, the task is to return a point x ∈ K sampled from the uniform distribution over K. Any efficient sampling algorithm has many applications. For example, Bertsimas and Vempala [15] showed how convex optimization problems can be solved efficiently given such a sampling algorithm. 116 Chapter 9. Efficient Implementation Using Random Walks Elekes [43]1 proved that it is impossible to sample uniformly from convex bodies. Soon after, Dyer, Frieze and Kannan [41] showed that it is possible to sample approximately uniformly from convex bodies. They showed that given a bounded convex body K, and an accuracy parameter ǫ > 0 it is possible to sample x from K such that for any set A ⊆ K Pr [x ∈ A] − UK (A) < ǫ x where UK is the uniform measure over K. We use the notation Prx [A] to denote the probability that the sampling algorithm will return a point in the set A. The algorithm presented by Dyer Frieze and Kannan was polynomial but its running time was O d>20 where d is the dimension of d. Nevertheless, in a series of improvements, the efficiency of sampling algorithms was significantly improved and the recent algorithm can perform the sample task in O∗ d3 operations2 [85]. Although clear advances have been made, current algorithms are still not practical as the constants involved are too high. Nevertheless, this is an active research field and we expect better algorithms to follow. Describing these sampling algorithms is well beyond the scope of this dissertation. Although many different algorithms have been suggested, all use Monte-Carlo Markov Chains (MCMC) at their core. For these MCMCs to work, the convex body must be well rounded. Therefore, not only should K be bounded, i.e. be contained in a ball of radius R, it must also contain a ball of radius r. The following theorem summarizes the essentials of sampling from convex bodies. Theorem 9.1 Let K ∈ IRd be a convex body such that there exists a ball of radius R which contains K and there exists a ball of radius r which is contained in K. Then there exists a sampling algorithm such that for any ǫ > 0 the algorithm returns x ∈ K such that for any measurable subset S of K |Pr [x ∈ S] − UK (S)| < ǫ and the algorithm works in poly d, log R 1 r , log ǫ time. The proof for this theorem can be found in [85] for example. We note that the convex body K is assumed to be given via a separating oracle. In other words, given a point x the oracle either returns the answer “x is in K” or returns a hyperplane w such that w · x > max (w · z) z∈K This oracle must be able to compute its answer in polynomial time. 1 Elekes was interested in the problem of computing the volume of a convex body. However, the problem of sampling from convex bodies and computing their volumes are closely related. 2 The notation O ∗ indicates that logarithmic factors are ignored. 9.3. A Polynomial Implementation of QBC 9.3 117 A Polynomial Implementation of QBC Freund et al [48] showed that QBC learn a homogeneous linear classifier exponentially faster than passive learners (see also Chapter 6). In Section 7.3 on page 91 we showed that we do not need to sample exactly from the correct prior and distribution. However, we required that the approximation be close in a multiplicative sense whereas the sampling algorithm discussed in Theorem 9.1 has an additive discrepancy. Furthermore, the complexity of the sampling algorithm depends on the ratio between the radii of a bounding ball and a bounded ball. In this section we show how these ingredients can be used to support a polynomial time implementation of QBC. The polynomial implementation is presented as Algorithm 7. It has the same structure as the original QBC algorithm (Algorithm 5 on page 55) when using the sampling techniques presented above. Next we would like to prove the efficiency of this algorithm. Efficiency here means two things. The first is the computational complexity for which we show that the algorithm runs in polynomial time. The second measure of efficiency is the sample complexity for which we show that the implementation enjoys exponential learning rates, similar to the one for QBC. Theorem 9.2 Let the target concept be a uniformly distributed homogeneous linear classifier. Assume that the distribution over the sample space is uniform. Let 1 − δ be a confidence parameter. Then with a probability of 1 − δ, the following holds 1. The expected information gain of the queries the polynomial QBC makes are at least g/2 where g is the expected information gain of the original QBC algorithm. g log 1+ 2. Then there exists g̃ = g 32 log 32 −g/2 g 8 ǫ=Ω > 0 such that for any k = Ω g̃ −2 log (1/δ) and g̃k log (dk) −gk/d 2 δd2 the polynomial QBC implementation will return a hypothesis h such that Pr [h (x) 6= c (x)] ≤ ǫ x 3. It will use k labels and m0 = d2O(gk/d) unlabeled instances. 4. Each iteration of the algorithm will run in poly k, 1ǫ , 1δ time. Proof. We begin by analyzing the computational complexity of the proposed algorithm. We showed in Theorem 9.1 that sampling from convex bodies can be done in poly d, log Rr , log 1ǫ time, where 118 Chapter 9. Efficient Implementation Using Random Walks Algorithm 7 Polynomial Implementation of QBC Inputs: • Required accuracy ǫ. • Required confidence 1 − δ. • The dimension of the problem d. Output: • A hypothesis h. Algorithm: 1. Let V1 = C. 2. Let k ← 0. 3. Let l ← 0. 4. For t = 1, . . . (a) Receive an unlabeled instance xt . (b) Let l ← l + 1. (c) Select c1 and c2 uniformly from Vt using a sampling algorithm with additive accuracy gδ and g is a lower bound on the expected information gain of ǫt , where ǫt = 240k(k+1)t k QBC when learning linear separators with the correct priors. (d) If c1 (x) 6= c2 (x) then i. ii. iii. iv. (e) else Query for the label c (x). Let k ← k + 1. Let l ← 0. Let Vt+1 ← {c ∈ Vt : c (xt ) = yt }. i. Let Vt+1 ← Vt . (f) If l ≥ tk where tk = 80 ǫδ 2 ln 10k(k+1) δ i. Choose a hypothesis h uniformly from Vt using a sampling algorithm with additive accuracy δ/40. ii. Return h. 9.3. A Polynomial Implementation of QBC 119 R is a radius of a ball containing Vt and r is a radius of a ball contained in Vt . Clearly, Vt is a subset of the unit ball and thus we may assume that R = 1. We would like to show that r is not too small. Let V ∗ be the version space induced by the labels of all the m0 instances. It is clear that ∀t V ∗ ⊆ Vt . Therefore if there is a ball of radius r in V ∗ then the same ball is bounded inside Vt as well. Thus we will study V ∗ . In Lemma 6.1 we show that for any sequence of m0 instances, the probability that the target concept will be such that the probabilistic volume of the version −(d+1) d 0 space induced by it is smaller than em . Therefore, if m0 > 10d is at most em d eδ then with 0 −(d+1) 0 a probability of 1 − δ/10, the measure of the version space is at least em at all times d −(d+1) 0 and thus its volume is at least Vol (Bd ) em where Bd is the d-dimensional unit ball, and d Vol (Bd ) is its volume. In lemma 9.2 on page 122 we show that for any compact convex body K, such as the version space, there exists a ball of radius r inside the convex body with r≥ Vol (K) Vol (Bd ) dd Rd−1 where R is a radius of a ball containing K. Since the version space is a subset of the unit ball, we −(d+1) 0 can use R = 1 in our case. Since the volume of the version space is at least Vol (Bd ) em d we conclude that with probability 1 − δ, the version space contains a ball of radius r such that r≥ d d+1 (em0 ) Using the bound on r and the bound on m0 we obtain that the complexity of each iteration is 1 1 poly d, gk, log , log ǫ ǫ We now turn to prove that the hypothesis returned by this implementation of the QBC is indeed a good approximation of the target concept. The proof is very similar to the proof of Theorem 7.3 on page 92 where we considered the QBC algorithm with incorrect priors. For the sake of completeness we will show the two main ingredients needed for the proof. We begin by showing that there is a lower bound on the expect information gain from the next query. Next we will show that if the algorithm terminated, then the hypothesis returned is a good approximation of the target concept with high probability. We begin by analyzing the expected information gain. Let V be the current version space and let γ be the additive accuracy we require from the sampling algorithm. We have that R UV (V + (x)) UV (V − (x)) H (UV (V + (x))) dD (x) R g≤ UV (V + (x)) UV (V − (x)) dD (x) where UV (·) is the uniform distribution restricted to V . When sampling c from V we are guaranteed 120 Chapter 9. Efficient Implementation Using Random Walks that for any measurable set A: UV (A) − Pr [c ∈ A] ≤ γ c and thus we have Pr c ∈ V + (x) Pr c ∈ V + (x) c c and since γ < 1 we have UV V + (x) + γ UV V − (x) + γ = UV V + (x) UV V − (x) +γ UV V + (x) + UV V − (x) + γ = UV V + (x) UV V − (x) + γ (1 + γ) ≤ Pr c ∈ V + (x) Pr c ∈ V + (x) ≤ UV V + (x) UV V − (x) + 2γ c c repeating the same argument we have that Pr c ∈ V + (x) Pr c ∈ V + (x) ≥ UV V + (x) UV V − (x) − γ c c Let q be the probability that the polynomial QBC will query for the label of the next instance it sees. Clearly q=2 Z Pr c ∈ V + (x) Pr c ∈ V − (x) dD (x) c c If q is very small then the algorithm will most likely terminate. Recall that it terminates when no t query is made for tk consecutive instances. The probability for this is exactly (1 − q) k . Let q≤ δ 20k (k + 1) tk then since 0 < q ≤ 1/2 then e−2q ≤ 1 − 2q + 2q 2 ≤ 1 − q (1 − q)tk ≥ e−2qtk ≥ e−δ/10k(k+1) ≥ 1− δ 10k (k + 1) This the probability that the algorithm will not terminate when q is small. By summing over k we get that with a probability of 1 − δ/10 the algorithm will not make another query after it reach the state where q ≤ δ 20k(k+1)tk . Assume that q > δ 20k(k+1)tk . It follows that the probability that QBC, when sampling from the true posterior will query for the label of the next instance is at least q − 2γ. Since the expect information gain of QBC is at least g we have that Z 2 UV V + (x) UV V − (x) H UV V + (x) dD (x) ≥ g (q − 4γ) 9.3. A Polynomial Implementation of QBC 121 and thus 2 Z Pr c ∈ V + (x) Pr c ∈ V − (x) H UV V + (x) dD (x) ≥ c c ≥ g (q − 4γ) − 2γ gq − 6γ Thus the expected information gain of the polynomial QBC is at least gq − 6γ q By choosing γ = gδ 240k(k+1)tk = g− and using the fact that q > 6γ q δ 20k(k+1)tk we conclude that the expected information gain is at least g/2. The lower bound on the expected information gain proves that the number of queries that the polynomial QBC algorithm will make on the sample of size m0 is O dg̃ log m0 (see the arguments in the proof of the fundamental properties of the QBC algorithm, Theorem 6.1). Therefore, the algorithm will reach, with high probability, a sequence of tk consecutive instances for which it did not query for a label. We now argue that when this happens, if the algorithm returns a random hypothesis then it is likely to be a good approximation of the target concept. Let W ⊆ C × C be the set W = {(c1 , c2 ) : D (x : c1 (x) 6= c2 (x)) > ǫ} Let p be the probability that if we choose c1 using the sampling algorithm while c2 as chosen using the true prior then (c1 , c2 ) ∈ W . If p ≤ δ/10 then if the polynomial QBC terminates and returns a random hypothesis then it will be a good approximation of the target concept with high probability. Now assume that p > δ/10. We will show that with high probability the QBC algorithm will not terminate. Since p > δ/10 then Pr [UV (c2 : (c1 , c2 ) ∈ W ) > δ/20] > δ/20 c1 For each c1 such that UV (c2 : (c1 , c2 ) ∈ W ) > δ/20 we have that if we sample c2 from V , we will hit the set W with high probability since Pr [(c1 , c2 ) ∈ W ] > c2 δ δ − ǫ∗ = 20 40 and thus Pr [(c1 , c2 ) ∈ W ] > c1 ,c2 δ2 80 122 Chapter 9. Efficient Implementation Using Random Walks By definition of the set W we have that the probability that the polynomial QBC will query for the label of the next instance is at least ǫδ 2 /80. Therefore the probability of tk consecutive instances without a query, assuming that p > δ is less than t 2 ǫδ 2 k ≤ e−ǫδ tk /80 1− 80 which by the choice of tk is δ/10k (k + 1). By summing over k we get that the probability that the QBC will terminate when p ≥ δ/10 is at most δ/10. The remaining specifics of this proof are identical to the proof of Theorem 6.1 which explores the original QBC algorithm, and have thus been omitted. Note that there are several possible causes for failure. However, we showed that the probability for each of these causes is less than δ/10. Using the union bound we get that the probability of failure is less than δ. 9.4 A Geometric Lemma While sampling from convex bodies can work in polynomial time, for this to happen we need to prevent it from becoming singular, i.e. we need the ratio between the radius a bounding ball and a bounded ball to be moderated. We use the following lemma to bound this ratio. Lemma 9.2 Let K be a compact convex body in IRd which is bounded by a ball of radius R. Let Vol (K) be the volume of this body. Then there exist a ball of radius r inside K such that r≥ Vol (K) Vol (Bd ) dd Rd−1 Proof. Recall that John’s theorem [60] states that there exist an ellipsoid E ⊆ K such that K ⊆ dE where dE is the ellipsoid E when it is blown up by a factor d around its origin. Let λ1 ≥ λ2 ≥ · · · ≥ λd be the lengths of the principal axes of E. Since λd is the smallest then we can place a ball of radius λd inside E, centered at E’s origin. Thus there is a ball of radius r = λd inside K. Qd The lengths of the principal axes of dE are dλ1 , . . . , dλd and thus the volume of dE is Vol (Bd ) i=1 dλi where Bd is the d-dimensional unit ball. Since K ⊆ dE we have Vol (K) ≤ Vol (dE) = Vol (Bd ) d Y dλi i=1 and therefore r ≥ ≥ λd Vol (K) Qd−1 Vol (Bd ) dd i=1 λi 9.5. Summary 123 Finally, since K is contained in a ball of radius R, and E is contained in K then λ1 , . . . , λd ≤ R and thus r≥ Vol (K) Vol (Bd ) dd Rd−1 The significance of this lemma is that it shows that log R = O (d log d + d log R − log Vol (K)) r Therefore, as expected, log Rr is moderated as long as K occupies a non-negligible portion of the bounding ball of radius R. While the constants in Lemma 9.2 are not tight, it is clear that any (K) . To see this, let R > r > 0. Let o1 , . . . , od a set of orthogonal bound on r must be O Vol Rd−1 vectors such that the lengths of o1 , . . . , od−1 are R and the length of od is r. Let K be an ellipsoid with o1 , . . . , od as its principal directions. Clearly, the minimal ball bounding K is of radius R and the maximal ball bounded in K is of radius r. Furthermore, the volume of K is Rd−1 rVol (Bd ). Therefore r= Vol (K) Vol (Bd ) Rd−1 which is identical to the bound we have in lemma 9.2 up to dd . 9.5 Summary In this chapter we showed that the QBC algorithm can be implemented in polynomial time for learning homogeneous linear classifiers. We have reduced the problem of implementing the QBC algorithms to the problem of sampling from convex bodies and used polynomial algorithm for sampling from such bodies. While these algorithms are polynomial they are still far from being practical. We discuss this issue further in the next chapter. Chapter 10 Kernelizing the QBC In this chapter we take another step towards making it possible to use QBC for real world tasks. In chapter 9 we saw that algorithms for sampling from convex bodies can be used to sample from the version space when learning linear classifiers. While this provided us with a polynomial time algorithm it is not sufficient because it assumes that the task at hand can be carried out by a linear classifier. This problem is not unique to active learning and the QBC algorithm. The same problem is found in classical models such as the Perceptron [97] and more generally in Neural Networks. The universal way of overcoming this problem is to add a non-linear phase to the model. This is typically done by mapping the input data by a non-linear activation function. The idea is to map the data to a new space in which it is more likely that they will be linearly separable. Further improvement on the idea of mapping the data to a new space and learning in the new space was made by Vapnik and others [127, 20]. They showed that in many cases, it is not necessary to explicitly map the data. Rather, this can be done implicitly by using kernels (see section 10.1 for more about kernels). This observation led to a flux of algorithms utilizing kernels: SVM [20], kernel PCA [111] and others (see e.g. [116] and references therein). Kernels were found successful in many applications ranging from speaker identification [45] to predicting arm movements of monkeys [117]. In this chapter we show how kernels can be used together with the QBC algorithm. Thus, we need to modify the algorithm to enable the use of kernels. The algorithm we present in this chapter uses the same skeleton as QBC, but replaces sampling from the high dimensional version space by sampling from a low dimensional projection of it. By doing so, we obtain an algorithm which can cope with large-scale problems and at the same time authorizes the use of kernels. 124 10.1. Kernels 125 Although the algorithm uses linear classifiers at its core, the use of kernels makes it much broader in scope. This new sampling method is presented in section 10.2. Section 10.3 gives a detailed description of the kernelized version, the Kernel Query By Committee (KQBC) algorithm. The last building block is a method for sampling from convex bodies. We suggest the hit and run [85] random walk for this purpose in section 10.4. A Matlab implementation of KQBC is available at http://www.cs.huji.ac.il/labs/learning/code/qbc. Other algorithms have been suggested for sampling from the version space. Most notable is the Billiard-walk based sampling of Herbrich et al [55]. Herbrich and his coauthors considered the problem of sampling the version space when kernels are used. The added value of our method is two-fold. First, we extend the theoretical reasoning behind the sampling approach. Second, we suggest using “hit and run” (see section 10.4) instead of the Billiard walk since “hit and run” is easier to use and is guaranteed to mix fast to the right, i.e. uniform, distribution. 10.1 Kernels We begin with a brief introductions to kernels. The reader who is familiar with this subject may wish to skip this section. Kernels are widely used in modern machine learning. They make it possible to use a unified learning algorithm for solving a diversity of problems by plugging in different kernels. In this section we give a brief introduction to the main definitions and properties of kernels. Definition 10.1 A function K : X × X → IR is a kernel function if there exist a Hilbert space H and a function ϕ : X → H such that K (x1 , x2 ) = ϕ (x1 ) · ϕ (x2 ). 10.1.1 Commonly used Kernel Functions Here is a list of some commonly used kernel functions: 1. The polynomial kernel: For X = IRd we define the kernel function p K (x1 , x2 ) = (x1 · x2 + c) for c ≥ 0 and p ≥ 1. 2. The Gaussian/radial kernel: For X = IRd we define the kernel function 2 K (x1 , x2 ) = e−kx1 −x2 k /2σ2 126 Chapter 10. Kernelizing the QBC for σ 6= 0. 3. The sigmoid kernel: For X = IRd we define the kernel function K (x1 , x2 ) = tanh (κx1 · x2 + θ) for a variety of choices of κ and θ. 4. The ridge kernel: [115] the ridge kernel is an extension that can be applied to any kernel. Let K be a kernel function then we define the kernel function K̂ (x1 , x2 ) = K (x1 , x2 ) + ∆δx1 ,x2 where ∆ ≥ 0 and δ is Kronecker delta, i.e.    1 δx1 ,x2 =   0 if x1 = x2 otherwise Other kernels exists for variety of sample spaces: string kernels [76], spike kernels [117], Fisher kernels [57] and many others. 10.1.2 The Gram Matrix Many learning algorithms, e.g. SVM [20], need only inner products between instances for training and generalization. In these cases, it suffices to provide the algorithm with the Gram matrix, which contains all the inner products between instances: Definition 10.2 Let x1 , . . . , xm be instances in a sample space X , and let K be a kernel function over this space then the Gram matrix is a symmetric real value matrix with a size of m × m such that the entry in position i, j is K (xi , xj ). It follows that any Gram matrix must be semi-positive definite. In other words, if G is a Gram matrix then for any vector w ∈ IRm : wGw⊤ ≥ 0 10.1.3 Mercer’s conditions Mercer’s conditions provide necessary and sufficient conditions for a function K : X × X → IR to be a valid kernel function. 10.1. Kernels 127 Theorem 10.1 A function K : X × X → IR is a kernel function iff for any g (x) such that Z it holds that 10.1.4 Z 2 g (x) dx < ∞ K (x1 , x2 ) g (x1 ) g (x2 ) dx1 dx2 ≥ 0 The Special Case of the Ridge Kernel The ridge kernel has a unique property: it is generic in the sense that using this kernel every sample becomes linearly separable. This property is very important since the QBC algorithm assumes that the learning problem is linearly separable. Therefore, using the ridge kernel we can guarantee the separability even if noise exists in the system.1 Lemma 10.1 Let K be a kernel and let K̂ be the ridge kernel K̂ (x1 , x2 ) = K (x1 , x2 ) + ∆δx1 ,x2 for ∆ > 0. Let φ̂ be the map associated with K̂, i.e. K̂ (x1 , x2 ) = φ̂ (x1 ) · φ̂ (x2 ) Let S = (x1 , . . . xm ) be a sample of m unique instances. Then for any labels vector y = (y1 , . . . , ym ) there exist a separator w ∈ span φ̂ (x1 ) , . . . , φ̂ (xm ) such that w · φ̂ (xi ) = yi for any i = 1, . . . , m. Proof. Let G be the Gram matrix associated with K. The Gram matrix Ĝ associated with the kernel K̂ is simply Ĝ = G + ∆I where I is the identity matrix. Since G is semi-positive definite and ∆ > 0 it follows that Ĝ is positive definite and thus invertible. Therefore, for any set of labels y = (y1 , . . . , ym ) it is possible P to find a vector α such that y = Gα. Let w = j αj φ̂ (xj ) then w · φ̂ (xi ) = X αj K̂ (xj , xi ) j = Gi α = yi where Gi is the i’th row of the matrix G. 1 The “Soft” QBC uses a direct approach to tuckle the noisy case. See Chapter 8 for more details. 128 10.2 Chapter 10. Kernelizing the QBC A New Method for Sampling the Version-Space The Query By Committee algorithm [112] provides a general framework that can be used with any concept class. Whenever a new instance is presented, QBC generates two independent predictions for its label by sampling two hypotheses from the version space. If the two predictions differ, QBC queries for the label of the instance at hand (see algorithm 5 on page 55). The main obstacle in implementing QBC is the need to sample from the version space (step 4c). It is not clear how to do this with reasonable computational complexity. As is the case for most research in machine learning, we first focus on the class of linear classifiers and then extend the discussion by using kernels. In the linear case, the dimension of the version space is the input dimension which is typically large for real world problems. Thus direct sampling is practically impossible. We overcome this obstacle by projecting the version space onto a low dimensional subspace. k Assume that the learner has seen the labeled sample S = {(xi , yi )}i=1 , where xi ∈ IRd and yi ∈ {±1}. The version space is defined to be the set of all classifiers which correctly classify all the instances seen so far: V = {w : kwk ≤ 1 and ∀i yi (w · xi ) > 0} (10.1) QBC assumes a prior ν over the class of linear classifiers. The sample S induces a posterior over the class of linear classifiers which is the restriction of ν to V . Thus, the probability that QBC will query for the label of an instance x is exactly 2 Pr [w · x > 0] Pr [w · x < 0] w∼ν|V w∼ν|V (10.2) where ν|V is the restriction of ν to V . From (10.2) we see that there is no need to explicitly select two random hypotheses. Instead, we can use any stochastic approach that will query for the label with the same probability as in (10.2). Furthermore, if we can sample ŷ ∈ {±1} such that Pr [ŷ = 1] = Pr [w · x > 0] w∼ν|V (10.3) and Pr [ŷ = −1] = Pr [w · x < 0] w∼ν|V (10.4) we can use it instead, by querying the label of x with a probability of 2 Pr [ŷ = 1] Pr [ŷ = −1]. Based on this observation, we introduce a stochastic algorithm which returns ŷ with probabilities 10.2. A New Method for Sampling the Version-Space 129 as specified in (10.3) and (10.4). This procedure can replace the sampling step in the QBC algorithm. k Let S = {(xi , yi )}i=1 be a labeled sample. Let x be an instance for which we need to decide whether to query for its label or not. We denote by V the version space as defined in (10.1) and denote by T the space spanned by x1 , . . . , xk and x. QBC asks for two random hypotheses from V and queries for the label of x only if these two hypotheses predict different labels for x. Our procedure does the same thing, but instead of sampling the hypotheses from V we sample them from V ∩ T . One main advantage of this new procedure over the original QBC is that it samples from a space of low dimension and therefore its computational complexity is much lower. This is true since T is a space of dimension k + 1 at most, where k is the number of queries for label QBC made so far. Hence, the body V ∩ T is a low-dimensional convex body2 and thus sampling from it can be done efficiently. The input dimension plays a minor role in the sampling algorithm. Another important advantage is that it allows us to use kernels, and therefore gives a systematic way to extend QBC to the non-linear scenario. The use of kernels is described in detail in section 10.3. The following theorem proves that indeed sampling from V ∩ T produces the desired results. It shows that if the prior ν (see algorithm 5 on page 55) is uniform, then sampling hypotheses uniformly from V or from V ∩ T generates the same results. Theorem 10.2 Let S = {(xi , yi )}ki=1 be a labeled sample and x an instance. Let V be the version space V = {w : kwk ≤ 1 and ∀i yi (w · xi ) > 0} and let T = span (x, x1 , . . . , xk ) then Prw∼U(V ) [w · x > 0] = Prw∼U(V ∩T ) [w · x > 0] and Prw∼U(V ) [w · x < 0] = Prw∼U(V ∩T ) [w · x < 0] where U (·) is the uniform distribution. Before we prove this theorem, we prove a couple of lemmas: Lemma 10.2 Let V and T be as defined in Theorem 10.2. Let PT be the orthogonal projection to T then PT (V ) = V ∩ T 2 From the definition of the version space V it follows that it is a convex body. See Lemma 9.1 on page 115. 130 Chapter 10. Kernelizing the QBC Proof. Let w ∈ V ∩ T . Since w ∈ T then w = PT (w), combined with the fact that w ∈ V we conclude that w ∈ PT (V ) and thus V ∩ T ⊆ PT (V ). On the other hand, let w ∈ PT (V ). It suffices to show that w ∈ V to complete the proof. Let ŵ ∈ V be such that PT (ŵ) = w. Since PT is a projection, kwk ≤ kŵk ≤ 1. Moreover, since ŵ − w ∈ T ⊥ , and xi ∈ T we have that ŵ · xi = w · xi + (ŵ − w) · xi = w · xi and thus yi w · xi = yi ŵ · xi > 0 and thus w ∈ V which completes the proof. Next we show that V is almost a product space. Lemma 10.3 Let w ∈ V ∩ T then PT−1 (w) = {v : PT (v) = w}. PT−1 (w) ∩ V = w + v ∈ T ⊥ : kvk ≤ 1 − kwk where Proof. Let v ∈ T ⊥ such that kvk ≤ 1 − kwk. Then kv + wk = kvk + kwk ≤ 1. For any (xi , yi ), we have that yi (w + v) · xi = yi w · xi since v ⊥ xi and therefore (v + w) ∈ V . Furthermore, PT (w + v) = w since v ∈ T ⊥ and thus (v + w) ∈ PT−1 (w) ∩ V . Therefore PT−1 (w) ∩ V ⊇ w + v ∈ T ⊥ : kvk ≤ 1 − kwk On the other hand, let u ∈ PT−1 (w) ∩ V . Clearly, PT (u) = w and therefore u = w + v such that v ∈ T ⊥ . Since w ⊥ v and kwk ≤ 1 it follows that kvk ≤ 1 − kwk and thus u ∈ w + v ∈ T ⊥ : kvk ≤ 1 − kwk . Finally PT−1 (w) ∩ V ⊆ w + v ∈ T ⊥ : kvk ≤ 1 − kwk this completes the proof. We are now ready to present the proof of the main theorem: Proof. of theorem 10.2 10.3. Sampling with Kernels 131 First note that for any u ∈ V sign (u · x) = sign (PT (u) · x) (10.5) Let ν be the push forward probability measure PT (U (V )); i.e. if A is a measurable set then ν (A) is the measure under U (V ) of PT−1 (A). From (10.5) it follows that Pr [w · x > 0] = w∼ν Pr [w · x < 0] = w∼ν w∼U(V ) w∼U(V ) Pr [w · x > 0] Pr [w · x < 0] Clearly, ν is continuous with respect to the Lebesgue measure and hence has density. Let dν be the density of ν. From lemma 10.2 it follows that for any w ∈ / V ∩ T the density dν (w) is zero. From lemma 10.3 if follows that for any w ∈ V ∩ T the density dν (w) depends solely on kwk. Finally, since for any λ > 0 sign (w · x) = sign (λw · x) it follows that Pr [w · x > 0] = w∼ν Pr [w · x < 0] = w∼ν Pr [w · x > 0] Pr [w · x < 0] U(V ∩T ) U(V ∩T ) this completes the proof. Theorem 10.2 proves the soundness of the sampling algorithm presented. It proves that although we sample from a low-dimensional projection of the version space, the results are identical. 10.3 Sampling with Kernels In this section we show how the new sampling method presented in section 10.2 can be used together with kernels. QBC uses the random hypotheses for one purpose alone: to check the labels they predict for instances. In our new sampling method the hypotheses are sampled from V ∩ T , where T = span (x, x1 , . . . , xk ). Hence, any hypothesis is represented by w ∈ V ∩ T , that has the form w = α0 x + k X αj xj (10.6) j=1 The label w assigns to an instance x′ is   k k X X αj xj · x′ αj xj  · x′ = α0 x · x′ + w · x′ = α0 x + j=1 j=1 (10.7) 132 Chapter 10. Kernelizing the QBC Note that in (10.7) only inner products are used, hence we can use kernels. Using these observations, we can sample a hypothesis by sampling α0 , . . . , αk and define w as in (10.6). However, since the xi ’s do not form an orthonormal basis of T , sampling the α’s uniformly is not equivalent to sampling the w’s uniformly. We overcome this problem by using an orthonormal basis of T . The following lemma shows how an orthonormal basis for T can be computed when only inner products are used. Lemma 10.4 Let x0 , . . . , xk be a set of vectors, let T = span (x0 , . . . , xk ) and let G = (gi,j ) be the Gram matrix such that gi,j = xi · xj . Let λ1 , . . . , λr be the non-zero eigen-values of G with the corresponding eigen-vectors γ1 , . . . , γr . Then the vectors t1 , . . . , tr such that ti = k X γi (l) √ xl λi l=0 form an orthonormal basis of the space T . This lemma is significant since the basis t1 , . . . , tr enables us to sample from V ∩ T using simple Pr techniques. Note that a vector w ∈ T can be expressed as i=1 α (i) ti . Since the ti ’s form an orthonormal basis, kwk = kαk. Furthermore, we can check the label w assigns to xj by w · xj = X i α (i) ti · xj = X i,l γi (l) α (i) √ xl · xj γi which is a function of the Gram matrix. Therefore, sampling from V ∩ T boils down to the problem of sampling from convex bodies, where instead of sampling a vector directly we sample the coefficients of the orthonormal basis t1 , . . . , tr . Keep in mind that we do no not need to recalculate this basis for every new instance for which we query for label. Instead, if we have the basis t1 , . . . , tr for span (x1 , . . . , xk ) and we encounter a new instance x0 we can simply do the following calculation: t⊥ = x0 − r X i=1 (x0 · ti ) ti If t⊥ is zero then x0 ∈ span (x1 , . . . , xk ) and thus we do not need to extend the basis. Otherwise we can extend the basis with the vector tr+1 = t⊥ / t⊥ . The computational complexity of this process is O r2 which is O k 2 at most. We now go back to prove Lemma 10.4. Proof. of Lemma 10.4 First note that t1 , . . . , tr ∈ T and thus span (t1 , . . . , tr ) ⊆ T . Also note that the dimension of T is r. Indeed, if the dimension of t is greater than r then there exists an orthonormal basis 10.4. Hit and Run 133 τ1 , . . . , τk for T with k > r. We can express the vectors τ1 , . . . , τk in terms of the xi ’s such that P τi = j τi (j) xj . Let Θ = (θi,j ) be the matrix such that θi,j = τi (j) then (ΘGΘ′ )i,j X = l = X s,l = τi (l) xl · x0 , . . . , X l τi (l) xl · xk ! (τj (0) , . . . , τj (k))′ τi (l) τj (s) xl · x (s) τi · τj = δij where the last equality follows since τ1 , . . . , τk are orthonormal. It follows that ΘGΘ′ = Ik×k . Since k > r this contradicts the assumption that rank (G) = r. Therefore we conclude that the dimension of T is at most r. To complete the proof, is suffices to show that t1 , . . . , tr are indeed orthonormal. Thus, we will show that ti · tj = δi,j . ti · tj = = = = ! ! k k X X γj (l) γi (l) √ xl · p xl λi λj l=0 l=0 1 X p γi (l) γj (s) xl · xs λi λj l,s 1 p γi′ Gγj λi λj λ p j (γi · γj ) = δi,j λi λj where the last equality follows since the eigen-vectors γ1 , . . . , γr are orthonormal. In the next section we discuss one possible method of sampling from this convex body. 10.4 Hit and Run Hit and run [85] is a method of sampling from a convex body K using a random walk. Let z ∈ K. A single step of the hit and run begins by choosing a random point u from the unit sphere. Afterwards the algorithm moves to a random point selected uniformly from l ∩ K, where l is the line passing through z and z + u. Hit and run has several advantages over other random walks for sampling from convex bodies. First, its stationary distribution is indeed the uniform distribution, it mixes fast [85] and it does not require a “warm” starting point [86]. What makes it especially suitable for practical use is the fact that it does not require any parameter tuning other than the number of random steps. It is also very easy to implement. 134 Chapter 10. Kernelizing the QBC Current proofs [85, 86] show that O∗ d3 steps are needed for the random walk to mix. However, the constants in these bounds are very large. Nevertheless, our experiments show that in practice hit and run mixes much faster than that (see chapter 11 on page 136). We have used it to sample from the body V ∩T . The number of steps we used was very small, ranging from a couple of hundred to a couple of thousands. Our empirical study shows that this suffices to obtain impressive results. 10.5 Generalizing to Unseen Instances We saw how the QBC learning process can be conducted efficiently even when kernels are being used. We now look at the generalization phase. In Chapter 5, where the QBC algorithm is presented, we discussed several options for the generalization phase of QBC. One option is to work in an online fashion in which there is no clear distinction between the learning and the generalization rule (see Theorem 5.6 on page 60). In this setting, the learner predicts the label of an instance he sees and at the same time decides whether to query for the label or not. As we saw in previous sections, this does not introduce any difficulty when kernels are being used. In other settings presented in Chapter 5, the learning phase stops once a certain stopping criterion is met. At this point QBC returns a hypothesis. We have discussed several options for the choice of the returned hypothesis. We would like to verify which of these hypotheses can be used together with kernels. The first hypothesis we consider is the Bayes optimal hypothesis. This hypothesis is not necessarily a linear classifier and thus, in general, does not have a simple representation. Since this is a problem even when kernels are not being used, we will definitely have the same problem once kernels are used. The second kind of a hypothesis we consider is the Gibbs hypothesis. There are two possibilities here. First, we can draw a random hypothesis whenever we would like to label an instance. Using the techniques presented in the previous sections of this chapter, this can be done combined with kernels. An alternative way to use the Gibbs hypothesis is to draw a single hypothesis from the version space and to use it for all future predictions. This can-not be done when kernels are used because the random hypothesis needs to be sampled from the full version space. Note that when we projected the version space into space T , we used T which is the span of x, x1 , . . . , xk . We assumed that we know the instance x for which we would like to predict the label. However when x is not known, it is not clear what to focus on. 10.6. Summary and Further Study 135 The final option we considered in Chapter 5 was to use the Bayes Point Machine (BPM) classifier, which in our case will be the center of gravity of the version space. It is easy to verify that under the assumption that the prior is uniform, the center of gravity will always be in the span of the instances for which we queries for labels. Furthermore, using the same arguments as we used throughout this chapter it is easy to show that if V is the version space and T is the span of the instances for which we queries for labels then the center of gravity of V is exactly at the same point as the center of gravity of V ∩ T . Thus the BPM classifier can be used even in the kernelized setting. 10.6 Summary and Further Study In this chapter we presented two main ideas. First we showed how kernels can be used to enhance the ability of QBC to deals with tasks where the target classifier is not necessarily linear. It can be used to overcome noise using the ridge trick. This is due to the fact that every problem becomes separable when ridge kernel is used (see Section 10.1.4). Another issue that we dealt with in this chapter is practical methods for sampling the version space. We suggested the use of the hit-and-run algorithm for this purpose. We discussed the adequacy of this sampling algorithm for our purposes. In the following chapter we present the empirical results obtained when using the techniques presented here for several learning tasks. Chapter 11 Empirical Evidence 11.1 Empirical Study In this chapter we present the results of applying the kernelized version of the Query by Committee (KQBC) algorithm with the Hit-and-Run random walk (see Chapter 10), to two learning tasks. The first task requires classification of synthetic data whereas the second is a real world problem. 11.1.1 Synthetic Data In our first experiment we study the task of learning a linear classifier in a d-dimensional space. The target classifier is the vector w∗ = (1, 0, . . . , 0) thus the label of an instance x ∈ IRd is the sign of its first coordinate. The instances are normally distributed N (µ = 0, Σ = Id ). In each trial we use 10000 unlabeled instances and let KQBC select the instances to query for the labels. We also apply Support Vector Machine (SVM) to the same data. The linear kernel is used for both KQBC and SVM. Since SVM is a passive learner, SVM is trained on prefixes of the training data. We use different sizes for these prefixes. The results are presented in figure 11.1. The difference between KQBC and SVM is notable. When both are applied to a 15-dimensional linear discrimination problem (figure 11.1b), SVM and KQBC have an error rate of ∼ 6% and ∼ 0.7% respectively after 120 labels. After such a short training sequence the difference attains an order of magnitude. The same qualitative results emerge for all problem sizes. As expected, the generalization error of KQBC decreases exponentially fast as the number of queries is increased, whereas the generalization error of SVM decreases only at an inversepolynomial rate (the rate is O∗ (1/k) where k is the number of labels). This should not come as a 136 % generalization error % generalization error % generalization error 11.1. Empirical Study 137 100 10 Kernel Query By Committee Support Vector Machine −0.9k/5 48⋅2 1 0.1 0 10 20 30 40 (a) 5 dimensions 50 60 70 80 100 10 Kernel Query By Committee Support Vector Machine 53⋅2−0.76k/15 1 0.1 0 50 100 150 (b) 15 dimensions 200 250 100 30 10 Kernel Query By Committee Support Vector Machine 50⋅2−0.67k/45 3 1 0 50 100 150 200 250 300 (c) 45 dimensions 350 400 450 500 Figure 11.1: Results on the synthetic data. The generalization error (y-axis) in percent (in logarithmic scale) versus the number of queries (x-axis). Plots (a), (b) and (c) represent the synthetic task in 5, 15 and 45 dimensional spaces respectively. The generalization error of KQBC is compared to the generalization error of SVM. The results presented here are averaged over 50 trials. Note that the error rate of KQBC decreases exponentially fast as was proved in the fundamental theorem of the QBC algorithm (Theorem 6.1 on page 65). 138 Chapter 11. Empirical Evidence surprise since the fundamental theorem of the QBC algorithm (Theorem 6.1 on page 65) proved that this is the expected behavior. 11.1.2 Label Efficient Learning over Synthetic Data We conducted another experiment using the same synthetic setting as presented in section 11.1.1. The sample space is IR5 with uniform distribution and the target concept is the vector (1, 0, 0, 0, 0). In this experiment we tested KQBC in the label efficient setting (see section 4.3 on page 53). We generated 2500 instances and presented them to KQBC one by one. For each of these instances KQBC either queried for the label of the instance or predicted its label. We counted both the number of queries and the number of prediction mistakes. This process was repeated 50 times. The results are presented in figure 11.2a. As predicted in Theorem 5.6 on page 60, the number of queries for labels is exactly twice the number of prediction mistakes. Also, following the theoretical analysis presented in Chapter 6, both parameters grow logarithmically with respect to the number of instances. We use this setting to check the effect of the number of Hit and Run steps on the performance of KQBC. The results of KQBC with 1000, 100, 50, 10, 5 and 2 Hit and Run steps for generating a random hypothesis are presented in sub-figures a-f of figure 11.2. When the number of random steps drops, KQBC tends to query for fewer instances, which causes an increase in the number of prediction mistakes. However the results of using 50, 100 and 1000 random steps are practically equivalent and match our predictions for uniformly sampled hypotheses. We conclude that Hit and Run mixes very fast: much faster than the bounds in [85]. 11.1.3 Face Image Classification The setting of the second experiment is more realistic. In the second task we used the AR face dataset [88] which is a collection of face images. The people in these images are wearing different accessories, have different facial expressions and the faces are lit from different directions. We selected a subset of 1456 images from this dataset. Each image was converted into gray-scale and re-sized to 85 × 60 pixels; i.e. each image was represented as a 5100 dimensional vector. see figure 11.3 for sample images. The task was to distinguish male and female images. For this purpose we split the data into a training sequence of 1000 images and a test sequence of 456 images. To test statistical significance we repeated this process 20 times, each time splitting the dataset into training and testing sequences. 11.1. Empirical Study 139 Figure 11.2: KQBC for label efficient learning. The results of applying KQBC to the synthetic data are presented. The number of instances is located on the x-axis and the y-axis shows the average number of queries and prediction errors. Each subplot represents the different numbers of Hit and Run steps made to generate a new hypothesis from the version space. 140 Chapter 11. Empirical Evidence % generalization error Figure 11.3: Examples of face images used for the face recognition task. 48 45 42 39 36 33 30 27 24 21 Kernel Query By Committeei (KQBC) Support Vector Machine (SVM) SVM over KQBC selected instances 18 15 12 0 20 40 60 80 100 120 number of labels 140 160 180 200 Figure 11.4: The generalization error of KQBC and SVM for the faces dataset (averaged over 20 trials). The generalization error (y-axis) vs. number of queries (x-axis) for KQBC (solid) and SVM (dashed) are compared. When SVM was applied solely to the instances selected by KQBC (dotted line) the results are better than SVM but worse than KQBC. We applied both KQBC and SVM to this dataset. We used the Gaussian kernel, such that the inner product between two images was defined to be K (x1 , x2 ) = exp − kx1 − x2 k2 /2σ 2 where σ is chosen to be σ = 3500 which is the value favored by SVM. The results are presented in figure 11.4. It is apparent from figure 11.4 that KQBC outperforms SVM. When the budget allows for 100 − 140 labels, KQBC has an error rate of 2 − 3 percent less than the error rate of SVM. When 140 labels are used, KQBC outperforms SVM by 3.6% on average. This difference is significant as in 90% of the trials KQBC outperformed SVM by more than 1%. In one of the cases, KQBC was 11% better. We also used KQBC as an active selection method for SVM. We trained SVM over the instances selected by KQBC. The generalization error obtained by this combined scheme was better than the passive SVM but worse than KQBC. Another interesting way to view these results is to look at the images for which KQBC queried for labels. In figure 11.5 we see the last images for which KQBC queries for labels. It is apparent, 11.2. Summary 141 Figure 11.5: Images selected by KQBC. The last six faces for which KQBC queried for a label. Note that three of the images are saturated and that two of these are wearing a scarf that covers half of their faces. that the selection made by KQBC is non-trivial. All the images are either highly saturated or partly covered by scarves or sunglasses. We conclude that KQBC indeed performs well even when kernels are used. 11.2 Summary In this chapter we demonstrated the kernelized version of the QBC algorithm on several experiments. In all our experiments, KQBC outperformed SVM significantly. We also tested KQBC in the efficient labeling setting; i.e. the online setting, and showed that it also performs well here. Part IV Epilog 142 Chapter 12 Epilog “No learning occurs if the learner is not active” [18, pg. 110] The title of this work, “To PAC and Beyond” represents the main theme of this dissertation. The PAC [126] model is a very successful one. Once learning is defined in mathematical language it enables the scientific community to study this concept using tools from different scientific fields. It allows us to articulate questions such as • Is everything learnable? • Is anything learnable? • What can we learn? Many important results in the field of machine learning were made prior to the definition of the PAC model. Most notable, is the seminal work of Vapnik and Chervonenkis [128] but many other results predated the definition of the PAC model (see e.g. [46, 97, 103, 108, 120]). The significance of Valiant work is in placing all these findings in the context of learning and thus marking the beginning of machine learning field of research. Nevertheless, the PAC model has its limitations which led many researchers to go beyond it. In this work we went beyond PAC by allowing learners to be active. We see learning as a game played between the learner and the teacher. We showed that the assumption that the learner is passive is restrictive. When we allow the learner the freedom to actively participate in the learning process it learns much faster. After a short introduction we studied the Membership Queries framework in Part II. In Chapter 3 we presented a novel method of tolerating noise in the learning process using membership 143 144 Chapter 12. Epilog queries and the dual representation of the learning problem. In Part III we studied active learning in the selective sampling framework. In Chapter 5 we presented the Query By Committee algorithm of Seung et al[112] and discussed possible termination rules for this algorithm which corresponds to different modes of use. In Chapter 6 we presented a theoretical analysis of the QBC algorithm. We showed that active learners can enjoy an exponential speedup in their learning rates when certain conditions apply. In Chapter 7 we showed that QBC can tolerate incorrect assumptions on priors. In Chapter 8 we presented a method which makes QBC more resistant to noise. We discussed efficient implementations of QBC in Chapter 9 and extended it to enable the use of kernels in Chapter 10. An empirical study of QBC was presented in Chapter 11. These constitute encouraging step forward in the ability to study active learning from various points of view, and (almost) close the gap between theory and practice in this field. 12.1 Active Learning in Humans Our prime focus in this work is machine learning. Nevertheless, the findings can be connected to learning in humans, since active learning is as important to humans as it is for machines. Research on human learning and machine learning is conducted from very different points of view. Investigators studying human learning primarily try to teach teachers how to teach, whereas researchers in the field of machine learning attempt to teach learners how to learn. Indeed, “learning to learn” is not the title of any class in school or university. In an introduction to his course on computer organization Charles Lin tries to address this issue [79]. Lin entitled his essay “Active Learning” to render the idea that you know that you have learned something when you are able to teach it. Thus a student should convince himself that he is able to teach what he has learned and whenever the student is not confident that he can do that, he should ask the teacher, a peer or seek the answer somewhere else. Researchers in the field of human learning and early childhood development use “active learning” slightly differently from the way we have used it in this dissertation. Any learning process in which the learner takes part in is considered to be active. According to Piaget, a child plays a very active role in the growth of intelligence [118, pg. 13]. Examples include game playing, counting, etc. Furthermore, for Piaget, intelligence meant exploring the environment [118, pg. 27] thus intelligence is about actively extending knowledge. Both Piaget and Vygotsky explicitly argue that the child plays an active role in the acquisition of knowledge as opposed to Behaviorism theory which suggests that learning is determined by external variables (stimulus and reinforcement)[18, 12.1. Active Learning in Humans 145 pg. 27]. The constructivist theory argues that learning is an internal process that external stimulus can trigger it [94]. The active role of children in learning takes on several forms. According to leading theories, the child constructs a hypothesis and revises it when needed. Constructing a theory is an active process [18, pg. 8]. While this is an internal process, active learning has an external manifestation as well [18, pg. 9]: a child needs to be able to manipulate objects in order to understand what these objects are and what can they do. It is also necessary to have children actively involved in the learning process to motivate them and cause them to engage in the learning process. The type of “active learning” we are interested in is different. For us, a child is considered active if his behavior and questions causes a change in the learning process itself. Therefore, a natural question would be how much a child can gain (in knowledge) by actively directing the teacher. To the best of my knowledge, no study addresses this issue explicitly. Never the less, implicitly, there is no doubt in our mind that many theories of early childhood development and human learning see the child as a “director” of the learning process. For instance, Montessori, Erikson, Piaget and Vygotsky place great emphasis on the significance of observing the students in planning of a curriculum [93]. For example, according to Montessori, teachers should be trained to “teach little and observe much” [93, pg. 31] because observation is the key to determining what children are interested in and need to learn [93, pg. 33]. The study of active learning in machines is taking its first steps. In this work we attempted to contribute to the growth of this field. We studied both empirical and theoretical aspects of this domain. At the same time we argue that active learning is important for human learning. Those of us who are involved in learning should keep this in mind and use this powerful tool while learning. List of Publications In order to keep this document reasonably sized, only a subset of the work I have done during my studies is presented in this dissertation. Here is a complete list of my publications. Journals • R. Bachrach, R. El-Yaniv, and M. Reinstadtler, On the competitive theory and practice of online list accessing algorithms, Algorithmica, vol. 32, no. 2, pp. 201-245, 2002. An extended abstract of this paper appeared in a conference: R. Bachrach and R. El-Yaniv, Online list accessing algorithms and their applications, recent empirical evidence, in Proceedings of the 8th Symposium on Discrete Algorithms (SODA), pp 53-62, 1997. • R. Bachrach, S. Fine, and E. Shamir, Query by committee, linear separation and random walks, Theoretical Computer Science, vol. 284, no. 1, 2002. An extended abstract of this paper appeared in a conference: R. Bachrach, S. Fine, and E. Shamir, Query by committee, linear separation and random walks, in Proceedings of the 4th European Conference on Learning Theory (EUROCOLT), pp 34-49, 2001. Refereed Conferences • R. Gilad-Bachrach, A. Navot and N. Tishby Query By Committee made real, in Proceedings of the 19th Conference on Neural Information Processing Systems (NIPS), 2005. • R. Gilad-Bachrach, A. Navot and N. Tishby, Bayes and Tukey meet at the center point, in Proceedings of the 17th Conference on Learning Theory (COLT), 2004. • R. Gilad-Bachrach, A. Navot and N.Tishby, Margin based feature selection - theory and algorithms, in Proceedings of the 21st International Conference on Machine Learning (ICML), 2004. 146 12.1. Active Learning in Humans 147 • R. Gilad-Bachrach, A. Navot, and N. Tishby, An information theoretic tradeoff between complexity and accuracy, in Proceedings of the 16th Conference on Learning Theory (COLT), pp. 595-609, 2003. • K. Crammer, R. Gilad-Bachrach, A. Navot, and N. Tishby, Margin analysis of the lvq algorithm, in Proceedings of the 16th Conference on Neural Information Processing Systems (NIPS), 2002. Book Chapters • R. Gilad-Bachrach, A. Navot and N. Tishby Connections with some classic IT problems. In Information Bottlenecks and Distortions: The emergence or relevant structure from data, N. Tishby and T. Gideon (eds.) MIT press (in preparation). • R. Gilad-Bachrach, A. Navot and N. Tishby Large margin principles for feature selection. In Feature extraction, foundations and applications, I. Guyon, S. Gunn, M. Nikravesh and L. Zadeh (eds.) , Springer (2006). Technical Reports • S. Fine, R. Gilad-Bachrach, E. Shamir, and N. Tishby, Noise tolerant learning using early predictors, technical report 1999-22, Leibniz Center, the Hebrew University, 1999. • S. Fine, R. Gilad-Bachrach, S. Mendelson, and N. Tishby, Noise tolerant learning via the dual learning problem, technical report 2000-14, Leibniz Center, the Hebrew University. presented at NCST99, 2000. • S. Axelrod, S. Fine, R. Gilad-Bachrach, S. Mendelson, and N. Tishby, The information of observations and applications for active learning with uncertainty, technical report 2001-81, Leibniz Center, the Hebrew University, 2001. • R. Gilad-Bachrach, A. Navot, and N. Tishby, Kernel query by committee (KQBC), technical report 2003-88, Leibniz Center, the Hebrew University, 2003. • R. Gilad-Bachrach, Dimensionality reduction for online learning algorithms using random projections, technical report 2005, Leibniz Center the Hebrew University, 2005. Bibliography [1] R.A. Adams. Sobolev Spaces, volume 69 of Pure and Applied Mathematics series. Academic Press, 1975. [2] N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale-sensativie dimensions, uniform convergence, and learnability. Journal of the ACM, 44(4):615–631, 1997. [3] D. Angluin. Queries and concept learning. Machine Learning, 2:319–342, 1988. [4] D. Angluin. Queries revisited. Theoretical Computer Science, 313(2):175–194, 2004. [5] D. Angluin and M. Kharitonov. When won’t membership queries help? In Proceedings of the 23rd annual ACM symposium on Theory of computing, 1991. [6] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 2001. [7] A. C. Atkinson and A. N. Donve. optimum experiment designs. Oxford University Press, 1992. [8] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM journal of computing, 32(1):48–77, 2002. [9] M. Bagnoli and T. Bergstrom. Log-concave probability and its applications. http: //www.econ.ucsb.edu/∼ tedb/Theory/logconc.ps, 1989. [10] Y. Baram, R. El-Yaniv, and K. Luz. Online choice of active learning algorithms. Journal of Machine Learning Reseach (JMLR), 5:255–291, march 2004. [11] P. Bartlett and S. Ben-David. Hardness results for neural network approximation problems. In the proceedings of the 4’th European Conference on Computational Learning Theory, 1999. [12] E. B. Baum. Neural net algorithms that learn in polynomial time from examples and queries. IEEE Transactions on Neural Networks, 2(1), 1991. [13] N. Ben-David, S. Eiron and P. Long. On the difficulty of approximating maximum agreement. Journal of Computer and System Sciences, 66(3):496 – 514, May 2003. [14] P. Berkhin. Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA, 2002. 148 BIBLIOGRAPHY [15] D. Bertsimas and S. Vempala. Solving convex programs by random walks. In STOC, pages 109–115, 2002. [16] A. Blum and T. Mitchell. Combining labled and unlabled data with co-training. In the 11’th annual Conference on Computional Learning Theory, pages 92–100, 1998. [17] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM, 100:157–184, 1989. [18] E. Bodrova and Leong D. J. Tools of the Mind: The Vygotskian approach to early childhood education. Pretice-Hall, 1996. [19] C. Borell. Convex set functions in d-space. Periodica Mathematica Hungarica, 6:111– 136, 1975. [20] B. Boser, I. Guyon, and V. Vapnik. Optimal margin classifiers. In Fifth Annual Workshop on Computational Learning Theory, pages 144–152, 1992. [21] L. Breiman, J. Friedman, R. A. Olshen, and C. Stone. Classification and Regression Trees. Chapman Hall, 1984. [22] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996. [23] N. Bshouty. Exact learning via the monotone theory. In Proceedings of the 34th Annual Symposium on Foundations of Computer Science, 1993. [24] N. H. Bshouty, S. A. Goldman, H. D. Mathias, S. Suri, and H. Tamaki. Noise-tolerant distribution-free learning of general geomtric concepts. In the proceedings of the 28th Annual ACM Symposium of Theory of Computing, 1996. [25] C. Campbell, N. Cristianini, and A. Smola. Query learning with large margin classifiers. In Proceedings of the 17th International Conference on Machine Learning (ICML), 2000. [26] A. Caplin and B. Nalebuff. Aggregation and social choice: A mean voter theorem. Exonometrica, 59(1):1–23, 1991. [27] N. Cesa-Bainchi, A. Conconi, and C. Gentile. Learning probablistic linear-threshold classifiers via selective sampling. In Proceedings of the 16th annual Conference on Learning Theory (COLT), pages 373–387, 2003. [28] N. Cesa-Bianchi, G. Lugosi, and G. Stolz. Minimizing regret with label efficient prediction. IEEE Transactions on Information Theory, 51(6):2152–2162, 2005. [29] D. Cohn, L. Atlas, and R. Ladner. Training connectionist networks with queries and selective sampling. Advanced in Neural Information Processing Systems 2, 1990. [30] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201–221, 1994. 149 150 BIBLIOGRAPHY [31] D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with statistical models. Journal of Artificial Intelligence Research, 4:129–145, 1996. [32] T. M. Cover and J. A. Thomas. Elements Of Information Theory. Wiley Interscience, 1991. [33] T.M. Cover and P.E. Hart. Nearest neighbor pattern classifier. IEEE Transactions on Information Theory, 13:21–27, 1967. [34] K. Crammer and Y. Singer. On the learnability and design of output codes for multiclass problems. In Machine Learning, volume 47, 2002. [35] I. Dagan and S. Engelson. Committee-based sampling for training probabilistic classifiers. Proc. 12th International Conference on Machine Learning, 1995. [36] S. Dasgupta. Analysis of a greedy active learning strategy. In Advances in Neural Information Processing Systems, 2004. [37] S. Dasgupta. Coarse sample complexity bounds for active learning. In Advances in Neural Information Processing Systems, 2005. [38] S. Dasgupta, A. T. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. In Proceeding of the 18th Annual Conference on Learning Theory (COLT), 2005. [39] S. E. Decator. Efficient Learning from Faulty Data. PhD thesis, Harvard University, 1995. [40] O. Dekel, S. Shalev-Shwarts, and Singer Y. The forgetron: A kernel-based perceptron on a fixed budget. In Neural Information Processing Systems (NIPS), 2005. [41] M. Dyer, A. Frieze, and R. Kannan. A random polynomial time algorithm for approximating the volume of convex bodies. Journal of the Association for Computing Machinery, 38, Number 1:1–17, 1991. [42] B. Eisenberg and R. L. Rivest. On the sample complexity of pac-learning using random and chosen examples. In the Proceedings of the Third Annual Conference on Computational Learning Theory, pages 154–162. Morgan-Kaufmann, 1990. [43] G. Elekes. A geometric inequality and the complexity of computing volume. Discrete and Computational Geometry, 1986. [44] S. Fine, A. Freund, I. Jaeger, Y. Mansour, Y. Naveh, and Ziv A. Harnessing machine learning to improve success rate of stimuli generation. IEEE Transactions on Computers, to appear 2006. [45] S. Fine, J. Navratil, and R. Gopinath. A hybrid gmm/svm approach to speaker identification. In The International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2001. BIBLIOGRAPHY [46] E. Fix and j. Hodges. Discriminatory analysis. nonparametric discrimination: Consistency properties. Technical Report 4, USAF school of Aviation Medicine, 1951. [47] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119– 139, 1997. [48] Y. Freund, H. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28:133–168, 1997. [49] R. Gilad-Bachrach, A. Navot, and N. Tishby. Bayes and tukey meet at the center point. In Proceedings of the 17th Conference on Learning Theory (COLT), pages 549– 563, 2004. [50] D. Gilbarg and N. S. Trudinger. Elliptic Partial Differential Equations of Second Order. Springer Verlag, 1998. [51] D. Haussler, M. Kearns, and R. E. Schapire. Bounds on the sample complexity of bayesian learning using information theory and the vc dimension. Machine Learning, 14:83–113, 1994. [52] D. Haussler and Opper M. Mutual information, metric entropy, and cumulative relative entropy risk. Annals of Statistics, 25(6), Dec 1997. [53] Donald O. Hebb. The Organization of Behavior. John Wiley, New York, 1949. [54] D. Helmbold and S. Panizza. Some label efficient learning results. In Proceedings of the 10th Annual Conference on Computational Learning Theory (COLT), pages 218–230, 1997. [55] R. Herbrich, T. Graepel, and C. Campbell. Bayes point machines: Estimating the bayes point in kernel space. In Proceedings of IJCAI Workshop on Support Vector Machines, pages 23–27, 1999. [56] R. Herbrich, T. Graepel, and C. Campbell. Bayes point machines. Journal of Machine Learning Research, 2001. [57] T. Jaakkola, M. Deikhans, and D. Haussler. a discriminative framework for detecting remote protein homologies. Journal of Computational Biology, 7:95–114, 2000. [58] J. Jackson, E. Shamir, and C. Shwartzman. Learning with queries corrupted by classification noise. Discrete Applied Mathematics, 92(2-3):157–175, 1999. [59] F. V. Jensen. Bayesian Networks and Decision Graphs. Springer, 2001. [60] F. John. Extremum problems with inequalities as subsidiary conditions. In Studies and Essays Presented to R. Courant on his 60th Birthday, pages 187–204. Interscience Publishers, Inc., New York, N. Y., 1948. [61] I.T. Jolliffee. Principal Component Analysis. Springer Varlag, 1986. 151 152 BIBLIOGRAPHY [62] L. P. Kaelbling, M. L. Littman, and A. P. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285, 1996. [63] M. Kearns. Efficient noise-tolerant learning from statistical queries. In Proceedings of the 25th ACM Symposium on the Theory of Computing, pages 392–401, 1993. [64] M. Kearns. Boosting theory towards practice: Recent developments in decision tree induction and the weak learning framework. Abstract accompanying invited talk given at AAAI 1996, 1996. [65] M. Kearns and V. Valiant. Cryptographic limitations on learning boolean formulae and finite automata. J. of the ACM, 41(1):67–95, 1994. [66] M. Kearns and U. Vazirani. An Introduction To Computational Learning Theory. The MIT Press, 1994. [67] J. Kleinberg. An impossibility theorem for clustering. In NIPS, 2002. [68] A. R. Klivans and R. Servedio. Learning intersections of halfspaces with a margin. In Proceedings of the 17th Annual Conference on Learning Theory (COLT), 2004. [69] A. Krogh and J Vedelsby. Neural network ensembles, cross validation, and active learning. In Advances in Neural Information Processing Systems (NIPS), pages 231– 238, 1995. [70] S. Kullback. Information Theory and Statistics. Wiley, 1959. [71] E. Kushilevitz and Y. Mansour. Learning decision trees using the fourier spectrum. In Proceedings of the twenty-third annual ACM symposium on Theory of computing, pages 455–464, 1991. [72] S. Kwek and L. Pitt. Intersections of halfspaces with membership queries. Algorithmica, 1998. [73] K. J. Lang and E. B. Baum. Query learning can work poorly when a human oracle is used. In Proceedings of the International Joint Conference on Neural Networks, pages 335–340, 1992. [74] BBC Learning. How we learn - definition of learning. http://www.bbc.co.uk/learning/ returning/betterlearner/learningstyle/a whatislearning 01.shtml, 2004. [75] M. Ledoux. The Concentration of Measure Phenomenon. American Mathematical Society, 2001. [76] C. Leslie, Eskinm E., A. Cohen, J. Weston, and W. S. Noble. Mismatch string kernels for sicriminative protein classification. bioinformatics, 20(4):467–476, 2004. [77] D. D. Lewis and W. A. Gale. A sequential algorithm for training text classifiers. In W. Bruce Croft and Cornelis J. van Rijsbergen, editors, Proceedings of 17th ACM International Conference on Research and Development in Information Retrieval (SIGIR), pages 3–12, Dublin, IE, 1994. Springer Verlag, Heidelberg, DE. BIBLIOGRAPHY 153 [78] R. Liere. Active Learning with Committees: An approach to Efficient Learning in Text Categorization Using Linear Threshold Algorithms. PhD thesis, Oregon State University, 1999. [79] Charles Lin. Active learning. http://www.cs.umd.edu/class/spring2003/cmsc311/ Notes/Learn/active.html, 2003. [80] J. Lindenstrauss and L. Tzafriri. Classical Banach Spaces, volume 2. Springer Verlag, 1979. [81] N. Linial, Y. Mansour, and N. Nissan. Constant-depth circuits, fourier transform and learnability. Jour. Assoc. Comput. Mach., 40:607–620, 1993. [82] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm. In 28th Annual Symposium on Foundations of Computer Science, pages 68–77, 1987. [83] N. Littlestone. Mistake Bounds and Logarithmic Linear-threshold Learning Algorithms. PhD thesis, University of California Santa Cruz, 1989. [84] L. Lovasz and M. Simonovits. Random walks in a convex body and an improved volume algorithm. Random Structures and Algorithms, 4, Number 4:359–412, 1993. [85] L. Lovász and S. Vempala. Hit and run is fast and fun. Technical Report MSR-TR2003-05, Microsoft Research, 2003. [86] L. Lovász and S. Vempala. Hit-and-run from a corner. In Proc. of the 36th ACM Symposium on the Theory of Computing (STOC), 2004. [87] H. Mamitsuka and N. Abe. Efficient data mining by active learning. In S. Arikawa and A. Shinohara, editors, Progress in Discovery Science: Final Report of the Japanese Discovery Science Project. Springer-Verlag GmbH, 2002. [88] A.M. Martinez and R. Benavente. The ar face database. Technical report, CVC Tech. Rep. #24, 1998. [89] D. A. McAllester. Some pac-bayesian theorems. Proc. of the Eleventh Annual Conference on Computational Learning Theory, pages 230–234, 1998. [90] A. K. McCallum and K. Nigam. Employing em in pool-based active learning for text classification. In Jude W. Shavlik, editor, Proceedings of the 15th International Conference on Machine Learning (ICML), pages 350–358, Madison, US, 1998. Morgan Kaufmann Publishers, San Francisco, US. [91] W.S. McCulloch and W. Pitts. A logical calculus of ideas immanent in neural activity. Bulletin of Mathematical Biophysics, 5:115–133, 1943. [92] S. Mendelson. Learnability in hilbert spaces with reproducing kernels. Journal of Complexity, 18(1):152–170, 2002. 154 BIBLIOGRAPHY [93] C. G. Mooney. Theories of Childhood: An introduction to Dewey, Montessori, Erikson, Piaget & Vygotsky. Redleaf Press, 2000. [94] N. Movshovitz-Hadar. Personal communication, 2006. [95] I. Muslea, S. Minton, and C. Knoblock. Active + semi-supervise learning = robust multi-view learning. In Proceedings of the 19th International Conference on Machine Learning (ICML), pages 435–442, 2002. [96] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems, 2001. [97] A. B. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, volume 12, pages 615–622, 1962. [98] G. Pisier. Probabilistic methods in the geometry of banach spaces. In Probability and Analysis, number 1206 in Lecture Notes in Mathematics, pages 167–241. Springer Verlag, 1986. [99] L. Pitt and M. K. Warmuth. Prediction-preserving reducibility. Journal of Computer and System Sciences, 41:430–467, 1990. [100] A. Prekopa. Logarithmic concave measures with applications to stochastic programming. Acta Sci. Math. (Szeged), 32:301–315, 1971. [101] J. R. Quinlan. Induction of decision trees. Journal of Machine Learning, 1:81–106, 1986. [102] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. [103] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6):386–408, 1958. [104] N. Roy and A. McCallum. Toward optimal active learning through sampling estimation of error reduction. In Proceedings of the 18th International Conference on Machine Learning (ICML), pages 441–448. Morgan Kaufmann, San Francisco, CA, 2001. [105] S. Russel. Stuart russell on the future of artificial intelligence. Ubiquity, 4(43), 2004. http://www.acm.org/ubiquity/interviews/v4i43 russell.html. [106] S. Russel and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, 2nd edition edition, 2002. [107] Y. Sakakibara. On learning from queries and counterexamples in the presence of noise. Information Processing Letters, 37(5):279–284, 1991. [108] N. Sauer. On densities of families of sets. Journal of Combinatorics Theory, 13:145–147, 1972. [109] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the margin : A new explanation for the effectiveness of voting methods. Annals of Statistics, 1998. BIBLIOGRAPHY [110] G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In Proceedings of the 17th International Conference on Machine Learning (ICML), pages 839–846. Morgan Kaufmann, San Francisco, CA, 2000. [111] B. Schölkopf, A. J. Smola, and K. R. Müller. kernel prinicpal component analysis. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods– Support Vector Learning, pages 327–352. MIT press, 1999. [112] H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. Proc. of the Fith Workshop on Computational Learning Theory, pages 287–294, 1992. [113] H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review, 45(8), 1992. [114] C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27, July and October 1948. [115] J. Shawe-Tylor and N. Cristianini. Further results on the margin distribution. In Proceedings of the 12th Annual Conference on Learning Theory (COLT), pages 278– 285, 1999. [116] J. Shawe-Tylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004. [117] L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia. Spikernels: Predicting arm movements by embedding population spike rate patterns in inner-product spaces. Neural Computation, 17(3):671–690, March 2005. [118] D. G. Singer and T. A. Revenson. A Piaget primer: how a child thinks. The Penguin Group, revised edition 1996. [119] P. Sollich and D. Saad. Learning from queries for maximum information gain in imperfectly learnable problems. Advances in Neural Information Systems, 7:287–294, 1995. [120] C. J. Stone. Consistent nonparametric regression. The Annals of Statistics, 5(4):595– 620, 1977. [121] N. Tishby, F.C. Pereira, and W. Bialek. The information bottleneck method. In Proc. 37th Annual Allerton Conf. on Communication, Control and Computing, pages 368– 377, 1999. [122] S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Reseach (JMLR), 2:45–66, Nov 2001. [123] L. Troyansky. Faithful Representations and Moments of Satisfaction: Probabilistic Methods in Learning and Logic. PhD thesis, Hebrew University, Jerusalem, 1998. [124] G. Tur, D. Hakkani-Tür, and R.E. Schapire. Combining active and semi-supervised learning for spoken language understanding. Speech Comminication, 45(2):171–186, 2005. 155 156 BIBLIOGRAPHY [125] G. Tur, R. E. Schapire, and D. Hakkani-Tür. Active learning for spoken language understanding. In IEEE international conference on Acoustic, Speech and Signal Processing, 2003. [126] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134– 1142, 1984. [127] V. Vapnik. Statistical Learning Theory. Wiley, 1998. [128] V. Vapnik and A. Y. Chervonenkis. On the uniform covergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264–280, 1971. [129] C. Zhang and T. Chen. An active learning framework for content-based information retrieval. IEEE Transactions on Multimedia, 4(2):260–268, 2002. XVI .minew miieqip zetqep enil zexbqn .enill digid zxbqnd dppi` efy s` ,zigpen dinla fkxzn df xwgn ,lirl epivy itk ( Unsupervised Learning) zigpen `l dinl neld .mipezpa zipaz èvnl ìd zigpen `l dinla dxhnd .dinla aeyg spr ìd zigpen `l dinl qtez xy` ohw bevii èd aeh bevii .el` mipezp ly liri bevii èvnl yxpe “zeleky`” x1 , . . . , xm zeitvz lawn zìvn :md el`k mibevi zìvnl zevetp mikx izy .mipezpd ly zeifkxnd zepekzd z` .nin zxede (clustering) zeleky` ly ohw xtqn èvnl ìd dxhnd .zeleky`l zeitvzd z` uawn neld ,zeleky`d zhiya zeleky`n ze`nbe znerl dipyl zg`d zene e` zeaexw od leky` ezeà ze`vnpd ze`nbed lky jk [67] ahid zxben dppi` dirad la` ([121, 96, 14] `nbel) zeleky` zìvnl zeax mikx opyi .mixg` zeglvd zeleky`d zhiyl z`f mr gi .zepey mikxa ndl dleki zeewp oia oeini ziny oeeikn .zeax jenp nina mipezpd ly bevi `ven neld ,nin zxeda .nin zxed ìd ztqep zigpen `l dhiy φ1 (xi ) , . . . φd (xi ) ly dbivp ìd zeipekz ly ohw xtqn wx xney neld ,xi zitvz lkl .ixewnd bevil dne la` Principal Component Analysis (PCA) .xi ly zepekzd ziaxn z` mixney miφ-dy jk .[61] enil zehiy ly ef dgtyn Reinforecemnt Learning) ( miwefign dinl lka .jeana heaex heeip ly diraa `nbel opeazp .[26℄ miwefign dinl ìd dinl ly sqep aeyg spr jeand ly wlgdy oeeikn ,zkl zewigxn zeklyd yi heaexd ly dhlgdl .oeeik xegal heaexd lr ,znev .eizehlgda ielz heaexd d`xi eze` hilgn neld ,onfa dewp lka .miavn zpekn ly dneiw dgipn miwefig ii-lr dinl ,illkd dxwna dxenzd .dpzyn dpeknd avne dxenz lawn neld ,ef dlertn d`vezk .rvan èd dze` dlertd lr .neld ly dlertde igkepd avnd ly zeihq`kehq zeivwpet od ygd avnde dinle dvignl zigpen dinl ,zigpen `l dinl ,zigpen dinl ly aeliy ìd dlirt dinl lr zerityny zehlgd milawne zebiezn zeitvz ,zebiezn `l zeitvza miynzyn ep` .miwefign zewp ,z`f mr gi .enild jildz lr zerityn inlzd ly zezli`ydy oeeikn miiizr zerxe`n .zigpen dinl ly dagxdk dlirt dinle zigpen dinl ìd ef xwgn ly zifkxnd hand XV dlirt dinl -k enk mininz minzixebl` elit` ,ei jex` oeni` mbn ozpiday d`xn Stone ly miyxnd htynd oeni` mbn seqi` ,la` .[120] zeixyt`d xzeia zeaehd ze`vezd z` wtql mileki miaexwd mipkyd yxe mileb minbn eair ,zipy .xwie jex` jildz èd mipezpd seqi` ,ziy`x .zeira izy xvei leb dllkdd ly zeikeaiqd ,minieqn mixwna .oeni`d jldna rind z` arl yexy xexa .miax mia`yn [20] Support Vector Machines llek ,minzixebl`d ziaxna ,la` .oeni`d jildz jxeà dielz dppi` zeaiyg yi okl .oeni`d jildz jxe` ly divwpet ìd dllkdd ly aeyigd zeikeaiq [47] Adaboost-e .oeni`d jildz xeviwl dax -hqd zexbdl xarn dìvi xyt`p m` oeni`d mbn leb z` mvnvl ozipy zqxeb dlirt dinl .enild jildz lr znieqn dhily inlzl xyt`pe ,zppeekn dinl e` PAC oebk dinl ly zeihxp zehil zepekn okl .d`xi neld oze` ze`nbed z` xgea dxend ,dk r epxìzy enild zexbqna lr ef dhily .oeni`d ze`nbe zxiga lr znieqn drtyd yi nell ,dlirt dinla .dliaq dinl el` jildz uèi oke dax dina ely invrd rid z` exiyriy ze`nbea fkxzdl el zxyt`n enild jildz .enild .zikixrn zeidl dleki dv`ddy d`xp ep` .enild jildz z` dvi`n ok` dlirt dinl miax mixwna zehlgd lawl eilr ,enild jildz lr dhily yi nelly oeeikn .xign xal yi miax mixwna ,la` xy`k dlb enild jildz ly aeyigd zeikeaiq minieqn mixwna ,jkitl .odn xeht liaq nely .izernyn oteà dphw enild ly mbnd zeikeaiq ,onfa ea .dlirt dinll dliaq dinln mixaer ax oeibid jka yi .oeni`d alyl dllkdd alyne ,inlzl dxendn deard qner z` mixiarn ,epid zear” zegt miyxe milirt minel okl ;dpekn èd neldy era m` llk-jxa èd dxendy oeeikn .oeni`d alya aeyig gk xzei miyxe la` zhiya mip ep` III wlgae zekiiy zezli`ya mip ep` II wlga “miitk :lirt enil zexbqn izya mip ep` zezli`y zhiya .enild jildz lr nell zpzipy dhilyd beq èd el` zehiy oia ladd .oepiqd yxp dxende zeitvz xeza dxenl zebven zel`yd .zel`y dxend z` le`yl leki neld ,[3℄ zekiiyd ède nell bven zebiezn `l zeitvz sqe` .xzei zeax zelabn [29] oepiqd zhiya .el` zeitvz biizl deev` zinl :dpyn zwelgl zpzip ef dhiy .el`d zeitvzd jezn dveaw-zz ly mibzd z` ywal leki .[54] “mibz zliri dinl” z`xwpy zppeekn dinle .dxryd dxenl bivdl leki neld [3℄ oeieeiyd zezli`y zhiya .dlirt dinll zetqep zehiy zeniiw oepkz ìd ztqep dhiy .zlykp dxrydd da `nbe wtql e` ,daeh ìd dxryddy xy`l leki dxend llk-jxa ìd dniynd ,ef dhiya .mi`wihqihhq ii-lr zeax dxwgp xy` ([7] `nbel e`x) miieqip ,dlirt dinll axd xywd zexnl .revial miieqipd z` xegal leki nelde ,ziynn divwpet aexiw ly ze`vezl m`zda dxigad z` okrn eppi` ède miieqipd z` y`xn xegal neld lr mixwnd aexa XIV dxen nel ⇒ ⇐ ⇒ L (yt , ŷt ) xt yt zppeekn dinl ŷt :1 xei` zppeekn dinl ly igi aaq ly dnixf miyxz mixwnd ziaxna .xzeia zelwd zegpda P∞ t=1 :zeàd zegpddn zg` migipn ik ze`xdl minrtl ozip df dxwna .yt ∞ X t=1 L (yt , yˆt ) z` xrfnl ìd df dxwna neld ly dxhnd = c (xt )-y jk C c dwlgnd jeza dxhn byen miiw .1 L (yt , yˆt ) ≤ M < ∞ ze`nbe zxq lkl zeìbyd xtqn lr hlgen mqg èd ik zeìbyd-mqg `xwp .c ∈C M df dxwna byen lke x1 , x2 , . . . ytgp df dxwna .mixgzn ly znvnevn dwlgn len l` ogap neld j` laben eppi` dxhnd byen .2 miiwzn ∞ X t=1 (x1 , y1 ) , (x2 , y2 ) , . . . L (yt , ŷt ) ≤ inf c∈C ∞ X dxq lkly jk f (·) divwpet f (L (c (xt ) , ŷt )) t=1 mzixebl` ,`nbel .oexkif hrna miynzyne mixidn ,miheyt md opeeknd enild inzixebl`n daxd 4 gi . zelert O (d) zyxe zifgz lke oexkif i`z O (d) yxe inin d agxna lrtend [103] oexhtqxtd .aeyige oexkifa yeniyd z` dliabn dpi` zppeeknd dinld zxbd ,z`f mr 0−1 qtd ìd zirahd qtdd ziivwpet ,−1 e` .zxg` +1 1 md mibzd oep mda mixwnd aexay oeeikn L0−1 (yt , ŷt ) = (1 − yt yˆt ) /2 =    0   1 kxt k2 ≤ R m` t=1 L0−1 (yt , yˆt ) ≤ R2 /θ2 f` if yt 6= ŷt (x1 , y1 ) , (x2 , y2 ) , . . . gkedy oexhtqxtd mzixebl`l zeìby mqg edf .yt xy`k zqt`zny if yt = ŷt -xebl` xy`ky ,`nbel .enild mzixebl` ly zeìbyd xtqn weia èd P∞ yˆt -e yt jxrd z` zlawne midf P∞ t=1 L0−1 (yt , yˆt ) df dxwna dxq lr lrten oexhtqxtd mzi (w · xt ) ≥ θ-e kwk2 = 1-y jk w ∈ IRd miiwe ,t lkl .[97]-a kernels) mixfgyn mipirxba miynzyn xy`k .(primal) iy`xd agxna bvein .[40] -a èvnl ozip mitqep mihxt .lilk zepzyn aeyigde oexkifd zeyix ynzydl jixv ( oexhtqxtdy migipn ep` dual) ,df dxwna .( 4 ipeipy beviia XIII èd C VC-d mibyen zwlgn ly 3 dxbd nin d = max {m : ΠC (m) = 2m } .m lkl z` egiked Chervonenkis-e Vapnik .PAC ΠC (m) = 2m m` iteqpi` èd nind zeinld zewlgnd z` zwien dxeva xibn VC-d nin :(xewndn dpey hrn geqip mibivn ep`) àd drtydd ax htynd [4.8 htyne 4.2 htyn ,6℄ minbn ly m S ∈ (X × Y) mbn ozpiday mzixebl` èd L-e gipp .d èdy oeni`d zìby z` zxrfnn xy` VC nin mr dwlgny c = L (X ) ∈ X C 1 htyn idz dxryd xifgn mibiiezn |{(x, y) ∈ S : c (x) 6= y}| :miiwzn X ×Y lr µ zexazqd zin lkle δ>0 lkl f` Pr [errorµ (L (S)) > ǫ] ≤ δ S∼µm er lk 2 2em 2 ǫ≥ d ln + ln m d δ C f` q .PAC dinl dppi` `vnp dxhnd byen xy`k O∗ d m -e illkd dxwna O∗ d m iteqpi` èd C ly VC-d nin m` èd ietvd avwy d`xn 1 htyn 3 d`xp 6 wxta .mireawa wx l"pd minqgd z` xtyl ozipy oiivl yi . minel ep` da dwlgnd jezn .daxda zeaeh ze`vez mibiyn ep` dlirt dinla miynzyn xy`ky zppeekn dinl jildz ìd dinly daerd z` yibd Littlestone .dinl ly ztqep dxbd ìd mixefy el` mialy ipy zppeekn dinla .dllkdd alye enild aly :mialy ipy yi ,okn xg`l .xt -l yˆt [83] zppeekn dinl PAC-d lena .sivx beiz rivn neld ,xt zitvz bivn dxend ,t aaqa .miaaqa zygxzn dinl .dfa df .zilily-i` qtd ziivwpet ìd L (·, ·) xy`k L (yt , yˆt ) jxra qtd mxbp nelle yt bzd z` bivn dxend .zppeekn dinl ly g` aaq ly dnixf miyxz bivn 1 xei` .miinzix`bel minxeb migipfn ep`y O ∗ (·) 3 XII .dèx neld eze`y iteqd mbnd llba mxbp wei i` .neld ly oelyik oiae wei i` oia oigad m` enll ozipy orh Valiant Valiant .bviin eppi` mbnd m` lilk lykp enild jildz ,minieqn mixwna ,la` .ddeab oeghia znxa ,xiaq wei biydl ozip ze`nbed xtqn m` enill zpzip .Y = {±1} C PAC-d dwlgn .enill zepzpd mibyen zewlgn xibn oda zeix`pia zeiraa xwira oiiprzd Valiant len .iteq ef dwlgna byen enll ik zeyexd zeira mebxzl zelaewn mikx opyi la` ,dliabn zi`xp mikxr ipy wx lawl mileki mibzdy dgpdd xtqnl zeibzd zaexn dirad levit ii-lr ,mixwnd ziaxna .zeix`pia zeira sqe`l zeibz zeaexn lkl zix`pia enil ziira mixviin (one-against-all) mlek-len-g` zhiya ,`nbel .zeix`pia zeira zeitvzd oial y -a zebieznd zeitvzd oia liadl ix`piad neld z` mipn`ne .y ∈ Y ixyt` jxr .ixyt` zeibz bef lk xear beeqn mipn`n ,dny fnxny itk ,(all-pairs) zebefd-lk zhiya .zexg`d beeqnl oxeaige zeix`pia zeira sqe` zxivil zeìby oewiz ipepbpna zynzyn xzei zillkd dhiyd dgpd gipp ep` .jk-lk dliabn dppi` zix`pia ìd enild ziray dgpdd ,okl .([34] `nbel e`x) ibz-ax 2 . zxg` xn`p m` `l` ef [126] ìd C -y xn`p .mibzd agxn X ×Y lr µ Y = {±1} idie .X lr zix`pia mibyen zwlgn oekp jxra ièl aexw C didze mbnd agxn m L : (X × Y) → C mzixeblè m < ∞ miiw ,ǫ, δ > 0 Prm errorµ (L (S)) > ǫ + inf (errorµ (c)) < δ din lkly jk 1 dxbd X idi PAC dinl errorµ (c) = µ {(x, y) : c (x) 6= y} xy`k lkl m` c∈C S∼µ hrnk ìdy dwlgna dxryd èvnl ik iteq mbna i m` ,PAC dinl ìd mibyen zwlgn -ly e`xd [128] Chervonenkis-e Vapnik yi m` wxe m` PAC dinl ìd C .dllkd zìby ly oaena dwlgna xzeia daehd dxrydd mibyen zwlgn :ziegii zixhne`b dpekz yi dwlgnd ly uezipd inwn z` xibdle miwdl epilr VC-d PAC zeinl zewlgn nin z` xibdl ika .iteq VC nin dl .C èd ΠC (m) = oeeikn .ze`nbe m biizl dleki max x1 ,...,xm ∈X C ly m-d uezipd mwn .mibyen zwlgn C ìdz 2 dxbd |{(c (x1 ) , . . . , c (xm )) : c ∈ C}| mibyend zwlgn oda zepeyd mikxd xtqn z` en uezipd mwn :VC-d nin ly dxbdd ixeg`n nerd oeibidd df .ΠC (m) ≤ 2m f` |Y| = |{±1}| = 2-y mip ep` .sqep rin dxendn ywal eilr izn hilgdl jixv neldy oeeikn xzei zeakexn od zeibz zeaexn dinl zeira 8 . 2 wxta df `yepa XI [91] Pitts-e McCulloch .20-d d`nd ly 40-d zepya lgd dfd xwgnd oeeik ly mirxfd od el` zear .lertl zeleki mipexiep zezyx oditl mikx erivd .gend zlert z` xzei [53] Hebb xzei xge`ne .mixtq ze`n eazkp eizee` d`xyd `lne dxet xwgn (zeqtpiq) mihlw ly ax xtqn yi oexiepl .mipexiepd zyx ly zixwird oiipad oa` èd oexiepd .[103] ix`pild ixtnd e` oexhtqxtd ìd oexiepd ly zizek`lnd dqxibd .(oeqw`) zg` hlt zigie hlwd z` onqp mipeniqd lr lwdl ik .igi hlte mihlw ly ax xtqn yi oexhtqxtl ,oexiepl enk sq ziivwpet aygn oexhtqxtd .dqtpiql liawn df xehweea aikx lky jk ,x èd dze` divwpetd .θ .x-e w ∈ IR sqe w ∈ IRd ∈ IRd igi xehweek zelewyn xehwe yi oexhtqxt lkl .mihlwd lr zix`pil mixehwed oia ziniptd dltknd ìd w·x xy`k cw,θ (x) = sign ((w · x − θ)) ìd aygn akp wlg .ziaeyig dinla xzeia uetpd ilkd oiir èd ,mipy 50-k iptl xbed oexhtqxtdy zexnl epid ,qt`zn θ sqd xy`k .“ix`pil ixtn” oexhtqxtl `xwp llk-jxa .“ipbened ix`pil ixtn” df beqn beeqnl `xwp .oexhtqxtl ywen ef dearn cw (x) = sign (w · x) èd drxkdd llk zehiy mb zeniiw .zeyw zeira oexztl ilk ode gend xwgl od zeynyn zeizek`ln mipexiep zezyx oelg zeqqean zehiye miaexwd mipkyd illk od el`k zehiyl mitqep mibivp .el` zeira oexztl zexg` ze`nbe zebeeqn okle ,hlwd agxn lr wgxn zin ly dneiw z` zegipn zehiyd izy .[33, 46, 120] `nbe ly beeiqd yegip ,miaexwd mipkyd zhiya .enild onfa etvpy ze`nbel ozaxw it-lr zeyg beeiqd yegip ,oelgd zehiya .zxaend `nbel miaexwd mipkyd izy k oia aex zravd ii-lr ozip dyg .mipiipern ep` da `nbel zeaexwd owgxny oeni`d ze`nbe lk oia aex zravd ii-lr dyrp ly dpekp dxiga ozpida ,izehtniq`d oaena zeil`nihte` od epid ;zeiawr ody gkede egzep zehiyd .[120] mixhnxt aeygd xwgnd .“dinl” byend z` oiadl oeiqipd èd ziaeyig dinla xwgnd ly xzei yg ripn ly deey dina zeqpkzdd zrtez z` epgay d`vez ly xywd z` elib Chervonenkis-e Vapnik [17] Blumer et al-y .[126] (Probably r “eq”k ii-lr dzyrp df megza xzeia dxnyp ef d`vez .[128] mixitn` mirvenn Approximately Correct - PAC) oekp jxra ièl aexwd lenl ef wzepna dzyrp dinld zxbdy oeeikn .ihnzn byenk dinl xibdl oeiqp èd m`d” oebk ok iptl ahid zexben eid `ly zel`y le`yl Valiant-l xewqp miàd mitirqa .“?enll ozip dn” xzei illk oteà e` [126] PAC-d len xyt`zd ,zrvazn ìd ea ote`dn “?lkd enll ozip m`d” ,“?enll ozip .megza miaeygd mi`vnnd zè dinl ly zetqep zexbd ,PAC-d len z` ( oaena iteq jildz ìd dinl ,ziy`x .[126] PAC-d PAC) oekp jxra ièl aexw len ly eqiqaa zener zeaeyg zepgad xtqn .dinll witqdl jixv ze`nbe ly iteq sqe` ,okl .iteq onf xg`l enild zepexzia oigadl ozipy X rvazy dpekn zepal mipiipern ep`y gipp .dàd `nbed zxfra zehiyd oia ladd lr enrl ozip mr xyw xevil yi ,dl`k zekxrn ziipal “zibel”d dhiyd itl .iètx oegai` `nbel ,znieqn dniyn xg`l .ìxal dleg oia miliand millkd zkxrn z` epnn ywale (op dxwna `tex) megzl dgnen .mihpiv`t oegai`l zynyne dpeknd jezl zpaen ìd ,dtq`p miwegd zkxrny ,zipy .zixyt` izla ìd miwegd zxbd mixwnd ziaxna ,ziy`x :zepexqg xtqn ef dhiyl jilen xy` weg xz`p ji` miweg itl` mr zkxrna :miweg ly efk zkxrn zeierh zetple wfgzl dyw daiaqa miiepiyl efk zkxrn mi`zp ji` ,seqal ?dlek zkxrna rebtl ila eze` owzp ji` ?ieby oegai`l ?zetqep oegai` zeniynl e` alya .zxg` dhiya zhwep ziaeyig dinl .rid zyikx jildz lr mirityn lirl ebvedy miiywd zee` miihqihhq mipezp sqeè ezeara dgnen ixg` awer neld ,dinld alya epid ,rid zyikx .mipegaiè zeifgz revial df ria ynzydl ozip ,ri witqn ykx neld xy`k .rina min`zn oexzi ef dhiyl .ere rin xefgi` ,iètx oegai` oebk minegz oeebna ziyeniy ze`nbe jezn dinl zxyt`n ìde ef dhiy lr zeqqeand zepekn wfgzl lw .dgnena ditv wx yxe enild jildzy jka .zkxrnd iaikxa xfeg yeniy ziaeyigd dinld z` wxtl ozip .rid zyikx jildza zfkxzn ,myd fnexy itk ,ziaeyig dinl dinl ,dvgnl zigpen dinl ,zigpen dinl ,`nbel) rid zyikx jildz ly ite`d itl dpyn inegzl miwqer ep` ef deara .(diqxbx ,beeiq ,deev` zeniyn `nbel) minel dprnly dniynd itle (zigpen `l .zigpen dinla ziaeyig dinll èan megza dpi` megzd ly d`ln dxiwq .mipey zeny zgz d`n ivgn xzei jyna zxwgp ziaeyig dinl .jynda ynzyp mday zeeqid z` o`k mibivn ep` .ef dear ly dpiipr ,gen ixweg ,miaygn iyp` ,mi`wihnzn :mipey minegzn mixweg dil` zkyen ziaeyig dinl :megza xwgnl miixwir miripn dyely mpyi .ere mibeleia gend xwg .1 zeakxen zeira oexztl jxk dinl .2 hyten byenk “dinl” zee` xwgn .3 zlekid .zyxa mdipa mixeyw el` mipexiep .mipexiepd ,zeieqi oiipa ipa`n iepa gendy e`vn gend ixweg oiipa ii lry oin`dl mixweg dripd miiepiyl envr z` mi`zdle zeakxen zeira xeztl iyep`d gend ly aeh oiadl xytì df ,sqepa .zeakxen zeira xeztl enll milbeqn didp zeizek`ln mipexiep zezyx IX zexewn ipy yi df jildza ,okl .ril df rin xindl ik witqn mkg didie ,rin witqn seq`l gilviy zee` zextqa .neldn zyxpd aeyig zeikeaiqe (sample “dnkeg"d complexity) .a ,dxena zetvl yxp neldy onfd jyn .` :zeikeaiql mbn zeikeaiq mi`xwp el` zeikeaiq inxeb ,ziaeyig dinl .(computational complexity) dpid `nbe lk .ze`nbel dyibd nell zpzip ,enild jildz jldna .ze`nben dinla weqrp ep` mibzd agxnn gewly ef zitvz ly bzd èd y -e X zeitvzd agxn jezn zitvz ìd ly ezxhn .(mibz) mihltl (zeitvz) mihlw dtnn xy` c x xy`k (x, y) nv dxhn byen rwxa miiw ik migipn ep` .Y .df dxhn byen axwl ìd neld zeikeaiqe mbnd zeikeaiq zzgtdl zepey mikxa fkxzn ziaeyig dinla xwgndn xkip wlg mipiipern ep` .aeyigd zeikeaiql mbnd zeikeaiq oia oefià ef dear zwqer ,mieqn oaena .aeyigd ,zexg` milina .aeyigd zeikeaiqa in xzei libdl ila mbnd zeikeaiq z` oihwdl ozip mda mikxa .nell dxendn deard qnern wlg xiardl ozip mda mikx èvnl mipiipern ep` dxbd e`x) PAC `nbel ,zeizxeqnd enild zexbqnl xarn jlp ep` dinla df xetiy biydl ik nely era .enild jildza lirt iwtz yi nelly ìd dlirt dinla dpeekd .dlirt dinl dyxpe ,(1 .zel`y ii-lr dxend z` zegpdl leki lirt nel ,dxena dtev wx (passive) liaq minel ,miax mixwnay mi`xn ep` .dliaq dinl ly dagxdk dlirt dinl mixweg ep` ef deara gezip ii-lr od el` zexbqn mixweg ep` .miliaq mineln daxda zeaeh ze`vez mibiyn milirt mzeà miynzyn ep` ,dlirt dinl ly megza zextqd ziaxnl ebipa .ieqip ii-lr ode (analysis) meyiide dixe`zd oia dxertd medzd lrn xyb mipea ep` jk ii-lr .ieqipa ode gezipa od minzixebl` .dlirt dinl ly megza ìwad `xew .ziaeyig dinla zeiqiqa ze`veze eqi zexbd mibivn ep` df wxt jynda 1.1 dlaha mibven mipeniqd) .miynzyn ep` mday mipeniqd z` xikdl ik el` mita lrlrl leki .(4 enra zizek`ln dpiae ziaeyig dinl ,dkxa zg` lk ,zeqpn zizek`ln dpiae ziaeyig dinl .zizek`ln dpia ly spr ìd ziaeyig dinl ly zkxrnk ri dxibn “zizxeqn” zizek`ln dpia .zeakxen zeira xzet m`d day jxd z` zewgl dpey ziaeyig dinl .etvp `l oiiry mixwn lr zepwqn wiqdl ozip el` miweg zxfra .miibel miweg ,zipy .ykxp rid eay jildzd ;dinld jildz lr ax yb dny ziaeyig dinl ,ziy`x :mipaen ipya miibeld miwegd znerl miizexazqde miihqihhq ywid iwega llk-jxa zynzyn ziaeyig dinl zizek`ln dpia ly zegztzda aly ìd ziaeyig dinl ,miieqn oaena .zizek`ln dpiaa milaewnd .[105] èan xyè rin miarn ep` ea jildz ìd dinl [74] . mixeyikde rid zxabd e` iepiyl liaen zklnn ziaxnl eppia dlian xy` ,m`-ipa ly zehlead zepekzdn zg` ìd ddeabd enild zleki enr y`xa dàend .zeakxen zeira xeztle dpzyn daiaq mr enzdl epl zxyt`n dinld .zeigd .mixeyikae ria rin xinn xy` jildz ìd dinl .dinl ly miipeigd miaikxndn dnk lr zner df .eply zelekid z` xiyrn xy` jildz ìd dinl okl zxne` z`f ;miaygenn enil ipepbpn oepkz ly zepn`d ìd (machine learning) ziaeyig dinl milwzp ep` ea rin xindl milbeqnd miaygenn mikildz oepkz ly zepn`d ìd ziaeyig dinly :zeira xeztl ik àd jildza zynzyn ziaeyig dinl .zelekie ril rin seq` .1 rind ly izivnz bevii zìvn .2 rina zertez zìvn .3 ril el` zertez zxnd .4 (minrtl) zelertl rid zxnd .5 ,mepbd xwgn ,zaygenn di`x ,xeai iedif ,miknqn beeiq oebk minegz ly oeebna liri `vnp df jildz .ere ,dètx ze`nbe jezn dinl ,xzei wien oteàe ze`nbe jezn dinla mipiiprzn ep` .dinl zyxp mda mixwn ly oeebn mpyi dxend ,mitzzyn mipwgy ipy df beqn dinla .(supervised learning) ze`nbe jezn zigpen dinla deewza dxend ly mikldna dtev neld ,okl .yekxl oiipern neld eze` ,mieqn ri yi dxenl .nelde VIII VII mr miieqip jexrl epl mixyt`n el` miliik .ohw èd df ixwn jelida rval jixvy miaaqd xtqny .dliaq dinl ipt lr dlirt dinl ly zepexzid z` ze`xdle [27, 36, QBC-d mzixebl` `nbel) zeizxeìz ze`vez xtqn opyi .eileziga oiir `vnp dlirt dinl zee` xwgnd mippeazn ep` .zllekd dzyiba ìd ef dear ly degi .([35, 124] `nbel) miieqip ze`veze (37, 48] mip zra eae miihilp` milk zervnà dlirt dinl mixweg ep` .zepey han zeewpn minvr mzeà dyrnde dixe`zd oia xrtd z` mipihwn ep` jk ii-lr .miieqip ly ze`vez miwtqne zeiyrn zeibeqa .megza .dneiwl sqep rv ìd ef dear .miax minegza zxkip zeaiyg yi dlirt dinlly mixaeq ep` VI -lr dinle zigpen `l dinl ,zigpen dinl :dinl iqet xtqnn dgek z` zaèy dlirt dinl z` mieeyn ep` ,ok it lr s` .el`d miqetdn g` lkn edyn migwel ep` ef deara .miwefig ii ièl aexw” zxbqna geinae zigpen dinla miliaq minel ly el`l milirt minel ly mireviad dinl ly zexbqn xtqn mixweg ep` dinl”-e (Selective .[126] (Probably Sampling) “zipxxa Approximatly Correct - PAC) “oekp dnib” ,(Membership Queries) “zekiiy .(Label leki neld ef zhiya ep` jxra zezli`y” :dlirt Efficient Learning) “mibz zliri .[3] zekiiy zezli`ya mip ep` dixgè dxvw dnwda zgztp deard .el` zeitvz biizl dxenl zeywae zeitvz ziipa ii-lr z`f dyer èd .dxenl zel`y zeptdl .neld ly zel`yl wtqn èdy zeaeyza wiin eppi` dxend m` elit` rvazdl dleki dinly mi`xn enild ziira xy`k epiid ;aeh dpan zlra znlpd dirad m` “yrx”d z` opql ozipy mi`xn ep` .dxiaq zeakxen zlrae dtetv ìd (dual) zipeipyd zeitvz dèx inlzd ,ef dhiya .[29, 30] zipxxad dnibd zhiya wqer ef dear ly iyilyd wlga inlebd rind mda miax mixwna ziyeniy ef dhiy .dxend biizi oze` zeitvzd z` xgeae zebiezn `l Query By Com-) “zizveawd `zli`yd” mzixeblà miwnzn ep` .mixip md mibzd era rtya ievn miligzn ep` .zeiyrn zeibeqe zeinzixebl` zeibeq ,zeizxe`z zeibeqa miwqere QBC ik d`xn xwgnd .zipxxad dnibd zhiye QBC-d [112] (mittee - QBC mzixebl` ly zizxe`zd dpadd zagxda liri mzixebl`d ,ok-enk .ok iptl reidn xzei miax mixwna miliaq mineln xzei xdn zikixrn nel .zipxxa dnib ly zpeeknd `qxibd ìdy mibz zliri dinl zxbqna zipìqiad dgpdd z` mixiqn ep` ziy`x .izxe`zd gezipa eyrpy zegpd xtqn zxqda mikiynn ep` hrnk liri QBC ,miniiwzn minieqn mi`pz xy`ky mi`xn ep` ef dgpd zxqd ii-lr .(Bayesian) .mzixebl`d ly mirvennd mireviad z` wx gihadl mileki ep` zipìqiad dgpdd zgzy era zeèa mzixebl` z` zepyl ozip ik mi`xn ep` .yrx zegkepa QBC mr enild zibeqa miwqer ep` jynda oiprn èdy zeitvzn rin ly byend z` migztn ep` df dxwn gzpl ik .yrxl oiqg idiy jk QBC-d .envr ipta ,ef dyix ielina iyewd .zixwn dxryd mebl zlekid ìd QBC-d mzixebl` ly zifkxn dyix dnib ,miix`pil miixtn minel xy`ky ,dpey`xd mrta ,mi`xn ep` .ea yeniyd z` miyp`n drpn znib ly diral ddf zixwn dxryd znib ly dirady migiken ep` .inepilet onfa rvazdl dleki .epiptly dxwna ze`veza ynzydl ozipe [41, 84, 85] zeax dxwgp ef dira .xenw sebn zixwn dewp ;mikx izya ef dibeq mitwez ep` .miliri zeidln miwegx md ,miinepilet md el` minzixebl`y zexnl ,deab nina QBC-a ynzydl epl xyt`nd xa .jenp nina dnibd z` rval ozipy mi`xn ep` ziy`x -d mzixebl` z` wfgl ik (kernels) mipirxba ynzydl ozip okle aeyigd zeikeaiq z` aixwdl ilan ,ziieqip mi`xn epè [85] (hit-and-run) uexe-dkd beqn ixwn jelida ynzydl mirivn ep` ok-enk .QBC xivwz era .zyxpd dniynl mnvr z` mi`zdl mdly zlekid èd minel minzixebl` ly lebd oexzid z` mnvra miykex minel minzixebl` ,zniieqn dira xeztl mippkezn “miizxeqn” minzixebl` ,azk iedif oebk mipey minegza dglvda elrted dinl zeqqean zehiy .dirad z` xeztl zlekid .ze`ped xeziè iètx oegai` ,rin xefgi` mipezpa zeipaz èvnl dqpne mipezpl sygp neld ,mleka .dinl ikildz ly mipey miqet mpyi ly mipey miqet .mipezpd ly zepekz ielibl e` ,zepwqn zwqdl zeynyn zeipazd ,xzei xge`n .el` :mixa xtqna milap dinl dèx neld mze` mipezpd beq .1 mdpia mixywzn dxende neld da jxd .2 neld irevia mikxren da jxd .3 oeni` itvx zxivi .dze`p mirevia znx biyn neldy r jex` oeni` jildza jxev yi ,miax mixwna ,okl .revwn iyp` ly daexn dear zyxp mixwnd ziaxnay oeeikn xwie dyw jildz ìd mikex` miwea ep` ef xwgna .oeni`d jildz xeviwl zepey mikxa wqer ziaeyig dinla xwgndn akp wlg .dlirt dinl ii-lr enild jildz xeviwl mikx z` oeekl zexyt`d zpzip nell .enild jildz lr zniieqn dhily nell zpzep dlirt dinl oeibidd .ezelr z` mvnvle enild jildz z` xvwl jk ii-lre ,xzei ax jxr lra rin bivdl dxend iniptd dpand lye mewd rind ly dlez èd rind ztqez ly jxrdy èd ef dhiy ly dqiqaa .enild mzixebl`l m`zen zeidl eilr ,liri didi enild jildzy zpn lr ,okl .enild mzixebl` ly 1 dgiken ef dear ,miieqn oaena . m` èd neld xy`k mbe aygn èd neld xy`k mb oekp df xa :wizrd ixard htynd ly oey`xd ewlg z` zegtl ('d dpyn 'a wxt ,zea` iwxt) “nln otwd `le ,nl oyiad `l" .ef dear ly 12 wxta àen m` ipaa dlirt dinl lr xvw oei V 1 IV III iayz ilztp xeqtext ly eziigpda dzyrp ef dear II dl xarne zigpen dinl diteqelitl xehwe xèz zlaw myl xeaig z`n jxka-rlb ox e'qyz zpya milyexia zixard dhiqxaipe`d hpql ybed

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download To PAC and Beyond