Download To PAC and Beyond

Document related concepts
no text concepts found
Transcript
To PAC and Beyond
Thesis submitted in partial fulfillment of the degree of
Doctor of Philosophy
by
Ran Gilad-Bachrach
Submitted to the Senate of the Hebrew University
February 2006
ii
iii
This work was carried out under the supervision of Prof. Naftali Tishby.
iv
Acknowledgements
The work presented here is a result of a period of 10 years I spent at the Hebrew University in
Jerusalem. During this time I got married to the adorable Efrat Gilad and together we gave birth
to our two incredible sons Dvir and Yinon. It sounds like a cliché but without the support of my
closest family, especially my parents Yehudit (Dita) and Daniel (Dani) Bachrach, my brother and
sisters Yuval, Yael and Nurit my grandparents and of course Efrat, Dvir and Yinon, I would not
had any chance succeeding. I do not think there is any way to express gratitude to them in words.
I have been lucky enough to be surrounded by extremely talented people who were my teachers,
peers and friends. I take this opportunity to thank them. Ran El-Yaniv opened the doors for me
to the fantastic world of scientific research. I am glad to be able to call him both a mentor and
a friend. I had the opportunity to collaborate with Eli Shamir who is a role model for me as a
person who is extremely smart, yet modest. Shai Fine is my “older scientific brother” and I thank
him for that (Shai - we still have an outstanding bet ...). Amir Navot has been a great peer to
work with and a great friend as well. We spent plenty of time together discussing everything.
Without his bike riding enthusiasm I would never had ridden a bike down the Kicking Horse ski
slopes without brakes. My other room mates, Amir Globerson and Gal Chechik also shared with
us great moments of inspiring discussions.
While at the Hebrew U. I spent most of my time at the machine learning lab. I owe something
to each person in the lab: Yoseph Barash, Gil Bejerano, Koby Crammer, Ofer Dekel, Gal Elidan,
Michael Fink, Nir Friedman, Eyal Krupka, Shahar Mendelson, Elad Schneidman, Shai ShalevShwartz, Lavi Shpigelman, Yoram Singer, Yair Weiss, Golan Yona, and all the people at that
incredible place. Of course, a great thank you goes to my advisor, Naftali Tishby, who taught me
a lot about the scientific method, the beauty of science and how to conduct a scientific research.
Although we debated allot, he had a greater influence on my scientific approach than he might
think.
v
vi
I had the best possible teachers from whom I have learned how to teach (which is important
for someone who is learning how to learn). Especially I would like to mention Ehud de Shalit, Nati
Linial, Israel (Robert) Aumann, Saharon Shelah and Hillel Furstenberg.
I thank the Clore foundation for the generous funding they have provided me as a Ph.D.
Student. I also thank the Chorfas foundation, Vatat and the Amirim program for additional
support. I thank Eyal Rosenman and the ultimate Frisbee team for the great fun. Special thank
goes to Esther Singer for the English editing of this dissertation and to Nitsa Movshovitz-Hadar
for inspiring discussions.
There are so many people who deserve to be mentioned here, I thank each and every one of
you. Finally, I would like to thank again Efrat who motivated me, challenged me and supported
me. Your name should appear first on the title page of this work.
Abstract
The greatest advantage of learning algorithms is their ability to adjust themselves to the required
task. While “traditional” algorithms are tailored to perform a single task, learning algorithms
acquire this capability. Learning techniques have been applied successfully in various domains such
as optical character recognition, information retrieval, medical diagnostics and fraud detection.
There are different frameworks for learning. In all these frameworks, the learner is presented
with data and attempts to find some underlying structures in it. Later, the structure is used to infer
about unseen cases or unobserved properties of the data. The difference between the frameworks
is in:
1. The type of data the learner is presented with
2. The way it interacts with its teacher
3. The way the learner’s performance is evaluated
In many cases, the learner needs to be trained for a lengthy period before it reaches an acceptable
performance level. Generating these long training sequences is a hard and expensive process, since
in most cases it requires a great deal of human labor. For these reasons, a considerable part of
research in machine learning has dealt with different ways of shortening the training process. This
dissertation contributes to this field by examining ways to truncate the learning process using
active learning.
Active learning is a setting in which the learner has control over the learning processes. The
learner is allowed to direct the teacher to present the more informative data and, by so doing,
reducing the length and the cost of the training process. The rationale behind active learning is
that the value of any additional data is a function of the data seen so far and the internal structure
of the learning algorithm. Thus, for the learning process to be efficient, it needs to be tailored
to the learning algorithm. This is true when the learner is a machine and when the learner is a
vii
viii
human being1 . In a sense, this work is a demonstration of at least the first part of an ancient
Hebrew saying:
(zea` iwxt)
“nln
otwd `le ,nl oyiad `l''
“He who is shy does not learn, and he who is pedant shall not teach” (Ethics of the
Fathers: teachings forming a tractate of the Mishnah, from the 3rd century BCE).
Active learning draws its power from several frameworks in machine learning: supervised learning,
unsupervised learning and reinforcement learning. In this work we take something from each of
these frameworks. However, we assess the performance of active learners when compared to passive
learners in the supervised learning framework and particularly the Probably Approximately Correct
(PAC) [126] framework. We study several active learning models: membership queries, selective
sampling and label efficient learning.
After a short introduction we begin our discussion with the membership queries framework
[3]. In this model the learner can direct questions to the teacher. It does so by crafting instances
and asking the teacher to label these instances. We study the sensitivity of this model to noise.
We show that learning can take place even if the teacher is inaccurate in some of the answers it
provides to the queries issued by the learner. We show that the noise can be filtered out if the
learning problem is well structured; i.e. when the dual learning problem is dense and has moderate
complexity.
In the third part of this work we study the selective sampling model [29, 30]. In this framework,
the learner observes unlabeled instances and selects the instances to be labeled by the teacher. This
model is useful in many real world scenarios in which the raw data are plentiful but labels are scarce.
We focus on the Query By Committee algorithm [112] and address theoretical, algorithmic and
practical issues. We begin by extending the theoretical foundation of the Query By Committee
(QBC) algorithm and the selective sampling framework. We show that QBC learns exponentially
faster than passive learners under milder assumptions than were previously known. We also prove
the efficiency of this algorithm in the label efficient setting; the “online” version of the selective
sampling model.
We continue by lifting some of the assumptions made in the theoretical analysis. First we lift
the Bayesian assumption. By lifting this assumption we show that when certain conditions apply,
the QBC is almost certain to be efficient whereas under the Bayesian assumption we can only
1 See
Chapter 12 for a short discussion of active learning in humans.
ix
guarantee average performance. Next we address the issue of learning with Query By Committee
in the presence of noise. We show that QBC can be modified to tolerate noise. While analyzing
this scenario we develop the notion of information from observations, which is of interest in its own
right.
A key requirement of the QBC algorithm is the ability to sample a random hypothesis. The
difficulties involved in fulfilling this requirement has prevented most researchers from using QBC.
We show, for the first time, that when learning linear classifiers, sampling can be done in polynomial
time. We show that the problem of sampling a random hypothesis is equivalent to the problem of
sampling a random point from a convex body. The latter has been studied extensively [41, 84, 85]
and the results can be applied to our case. While the algorithms suggested for sampling from
convex bodies are polynomial they are far from being efficient. We address this issue in two ways;
first, we show that the sampling problem can be done in a low-dimensional space. This enables us
to use QBC in high dimensional spaces without sacrificing computational complexity; thus kernels
can be used to augment the QBC algorithm. We also suggest using the hit-and-run [85] random
walk and show empirically that the number of iterations needed when sampling using hit-and-run
is small. These tools allow us to conduct experiments using the QBC algorithm and demonstrate
the benefits of using active learning over passive learning.
The study of active learning is still in its infancy. There are only a few theoretical results
(see e.g. [27, 36, 37, 48]) and empirical findings (see e.g. [35, 124, 122]). The distinction of this
dissertation lies in its holistic approach. We look at the same objects from different points of view.
We study active learning using analytic tools and at the same time discuss practical concerns and
provide empirical evidence to assess our findings. In this way, we reduce the gap between the
theoretical study of active learning and its practical use.
We believe that active learning is significant in many applications. This work is another step
towards the enhanced maturity of this field.
Contents
I
Acknoldgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vii
Introduction
1
1 Learning
II
2
1.1
Learning from Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2
Machine Learning and Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . .
4
1.3
Introduction to Machine Learning
. . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.4
Probably Approximately Correct (PAC) . . . . . . . . . . . . . . . . . . . . . . . .
7
1.5
On-line Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.6
Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
1.7
Other Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
1.7.1
Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
1.7.2
Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
Membership Queries
14
2 Preliminaries
2.1
15
The Power of Membership Queries . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.1.1
Constant Depth Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.1.2
Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.1.3
Intersections of Halfspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
2.2
The Limitations of Membership Queries . . . . . . . . . . . . . . . . . . . . . . . .
18
2.3
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
x
CONTENTS
xi
3 Noise Tolerant Learnability using Dual Representation
21
3.1
Learning in the presence of noise . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
3.2
The Dual Learning Problem
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
3.3
Dense Learning Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.4
Noise Immunity Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.5
A Few Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.5.1
Monotone Monomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.5.2
Geometric Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.5.3
Periodic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
Estimating V Cε (C, X ∗∗ ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.7
VC Dimension of Dual Learning Problems . . . . . . . . . . . . . . . . . . . . . . .
35
3.8
Banach Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.9
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.6
3.6.1
III
Selective Sampling
38
4 Preliminaries
4.1
39
Empirical Studies of Selective Sampling . . . . . . . . . . . . . . . . . . . . . . . .
40
4.1.1
Committee-Based Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
4.1.1.1
Part-of-Speech Tagging . . . . . . . . . . . . . . . . . . . . . . . .
41
4.1.1.2
Spoken Language Understanding . . . . . . . . . . . . . . . . . . .
42
4.1.1.3
Ensemble of Active Learners . . . . . . . . . . . . . . . . . . . . .
44
4.1.1.4
Other Committee-Based Approaches . . . . . . . . . . . . . . . . .
45
Confidence-Based Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
4.1.2.1
Margin Based Confidence
. . . . . . . . . . . . . . . . . . . . . .
45
4.1.2.2
Probability Based Confidence . . . . . . . . . . . . . . . . . . . . .
47
Look-ahead Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
4.2
Theoretical Studies of Selective Sampling . . . . . . . . . . . . . . . . . . . . . . .
48
4.3
Label Efficient Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
4.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
4.1.2
4.1.3
xii
CONTENTS
5 The Query By Committee Algorithm
5.1
5.2
54
Termination Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
5.1.1
The “Optimal” Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
5.1.2
Random Gibbs Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
5.1.3
Bayes Point Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
5.1.4
Avoiding the Termination Rule . . . . . . . . . . . . . . . . . . . . . . . . .
59
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
6 Theoretical Analysis of Query By Committee
62
6.1
The Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
6.2
The Fundamental Theory of QBC . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
6.3
Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
6.4
Lower Bound on the Expected Information Gain for Linear Classifiers . . . . . . .
73
6.4.1
The Class of Parallel Planes . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
6.4.2
Concave Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
6.4.3
The Function G (ρ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
6.5
Proof of Theorem 6.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
6.6
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
7 The Bayes Model Revisited
88
7.1
PAC-Bayesian Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
88
7.2
Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
7.3
Incorrect Priors and Distributions
. . . . . . . . . . . . . . . . . . . . . . . . . . .
91
7.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
8 Noise Tolerance
8.1
8.2
8.3
98
“Soft” QBC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
8.1.1
The Case of Learning with Noise . . . . . . . . . . . . . . . . . . . . . . . .
99
8.1.2
The case of stochastic concepts . . . . . . . . . . . . . . . . . . . . . . . . . 100
8.1.3
A variant of the QBC algorithm . . . . . . . . . . . . . . . . . . . . . . . . 100
Information Gain Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
8.2.1
Observations of the State of a Random Variable . . . . . . . . . . . . . . . 105
8.2.2
Information Processing Inequality . . . . . . . . . . . . . . . . . . . . . . . 108
SQBC Sample Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
CONTENTS
8.4
xiii
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
9 Efficient Implementation Using Random Walks
114
9.1
Linear Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
9.2
Sampling from Convex Bodies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
9.3
A Polynomial Implementation of QBC . . . . . . . . . . . . . . . . . . . . . . . . . 117
9.4
A Geometric Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
9.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
10 Kernelizing the QBC
124
10.1 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
10.1.1 Commonly used Kernel Functions . . . . . . . . . . . . . . . . . . . . . . . 125
10.1.2 The Gram Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
10.1.3 Mercer’s conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
10.1.4 The Special Case of the Ridge Kernel . . . . . . . . . . . . . . . . . . . . . 127
10.2 A New Method for Sampling the Version-Space . . . . . . . . . . . . . . . . . . . . 128
10.3 Sampling with Kernels
10.4 Hit and Run
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
10.5 Generalizing to Unseen Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
10.6 Summary and Further Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
11 Empirical Evidence
136
11.1 Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
11.1.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
11.1.2 Label Efficient Learning over Synthetic Data . . . . . . . . . . . . . . . . . 138
11.1.3 Face Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
11.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
IV
Epilog
12 Epilog
142
143
12.1 Active Learning in Humans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
List of Publications
146
xiv
CONTENTS
Bibliography
Summary in Hebrew
Abstract (in hebrew) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
148
I
V
Introduction (in hebrew) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIII
Part I
Introduction
1
Chapter 1
Learning
Learning is the processing of information we encounter, which leads to
changes or increase in our knowledge and abilities [74].
The high capacity to learn is one of the key features of human beings, and one that distinguishes us
from most of the animal kingdom. It is learning that allows us to adapt to changing environments
and to solve complicated problems. The quote at the top of this page captures some of the
fundamental aspects of learning. Learning is a process. It is a process that converts information
into knowledge and abilities. Thus learning is a process that enriches our capabilities.
Machine learning is the art of designing computerized learning processes; i.e., computer processes that are capable of turning the information we encounter into knowledge and abilities.
Machine learning uses the following flow to solve problems:
1. Acquire data
2. Find concise representations of the data
3. Find patterns
4. Turn patterns into knowledge
5. Turn knowledge into actions (optional)
It has been shown that this process is successful in many domains such as text classification, speech
recognition, machine vision, the study of the genome, medicine etc.
2
1.1. Learning from Examples
1.1
3
Learning from Examples
There are many scenarios that incorporate learning. We are interested in learning from examples
and more specifically in supervised learning. Two players are involved here, the teacher and the
learner. The teacher has some knowledge that the learner is interested in. Thus, the learner
observes the teacher engaged in acting. It is hoped that the learner will be able to collect enough
information, and will be wise enough to convert this information into knowledge. Therefore, there
are two sources of complexity in this process: the amount of time the learners needs to observe
the teacher and the amount of “wisdom” the learner is required to have. In the machine learning
literature these complexities are referred to as the sample complexity and computational complexity
of the learning process.
We consider the “learning from examples” framework. In the learning processes, the learner
has access to examples. Each example is a pair (x, y) where x is an instance from the sample space
X and y is the label of this instance taken from the output space (or label space) Y. We assume
that there exists an underlying target concept, c that maps inputs to outputs. The target concept
can be a deterministic map or a stochastic one. The goal of the learner is to approximate this
target concept.
Much of the research in machine learning focuses on ways to reduce both sample and computational complexity. In a sense, this work deals with the trade-off between the two. We are
interested in ways of reducing the sample complexity without overly sacrificing the computational
complexity. In other words, we are interested in transferring some of the workload from the teacher
to the learner.
In order to achieve this acceleration in learning we will go beyond the traditional learning
models, e.g. PAC (see definition 1.1) and allow active learning. By active learning we mean that
the learner plays an active role in the learning process. While a passive learner only observes the
teacher, an active learner can guide the teacher by asking questions.
In this work we study active learning as an extension to passive learning. We show that in
many cases active learners significantly outperform passive learners. We study these frameworks
both with analytical tools (theory) and with experiments (empirical evidence). As opposed to
most of the active learning literature, we use the same algorithm in our theoretical study as in our
empirical study. This enables us to bridge the gap between theory and practice in active learning.
In the remainder of this chapter we briefly discuss fundamental definitions and theorems in
machine learning. A reader who is familiar with this field may wish to skim through it simply for
4
Chapter 1. Learning
Table 1.1: Summary of the notation used in this dissertation
Symbol
log
ln
X
x
D
Y
y
S
C
c
ν
d
m
H (·)
ǫ
δ
η
Description
base 2 logarithm
natural logarithm
the sample space
in instance in the sample space
a distribution over a sample space
the label space
a label in the label space
a sample
a concept class
a concept in the concept class
a prior (or posterior) over a concept class
dimension
a size of a training sample
Shanon’s binary entropy
error rate
failure probability
noise rate
Remarks
typically Y = {±1}
m
S ∈ (X × Y)
H (p) = −p log p − (1 − p) log (1 − p)
the notation we will be using in this dissertation (see table 1.1 for a summary of the notation).
1.2
Machine Learning and Artificial Intelligence
Machine learning is a sub-division of Artificial Intelligence (AI). Both AI and machine learning try
to mimic the way the human brain solves complicated problems. Traditional artificial intelligence
defines knowledge as a set of logical rules. These rules are used to infer new unseen cases. Machine
learning diverges from this approach in two ways: first, machine learning puts greater emphasis
on the learning process; i.e. the process by which we acquire knowledge. Second, machine learning
typically uses statistical and probabilistic properties whereas traditional artificial intelligence uses
logical deductions. In a sense, machine learning is an evolutionary phase of AI [105].
The difference between the two approaches can be seen in the following example. Assume you
would like to build a machine to perform a certain task, say a medical diagnosis machine. The
“logical” approach towards building such a machine would be to contact an expert (a physician in
the example of a medical diagnosis machine) and ask for a set of rules that differentiates sick from
healthy people. These rules are hard coded in the machine and used to diagnose patients.
This approach has several flaws: first, it is usually impossible to define these logical rules.
Second, it is difficult to maintain and debug such a set of rules: in a system with thousands of
rules, how do you find the one rule that leads to the wrong prediction? How do you correct it
1.3. Introduction to Machine Learning
5
without destroying the whole system? Finally, how do you adjust such a system to a changing
environment or to a new diagnosis task?
The flaws presented above primarily affect the acquisition process. Machine learning uses a
different approach. In the acquisition stage, here called learning, the learner observes an expert at
work and collects statistics about various correlations in the data. Once the learner has collected
enough information, it can be used to generate insights and make predictions.
Learning from examples is useful in variety of domains such as medical diagnostics, speech
recognition, information retrieval, etc. It has the advantage that the training process only requires
observing an expert at work. It is easier to maintain and is more reusable than “logical” machines.
Machine learning, as its name suggests, focuses on the acquisition stage. Machine learning
can be broken down into various sub-fields based upon the nature of the acquisition stage (e.g.
supervised, semi-supervised, unsupervised) and the task the machine has to perform (e.g. batch,
on-line, classification, regression). In this dissertation we focus on the task of supervised learning.
1.3
Introduction to Machine Learning
Machine learning has been studied under various names for more than a half a century. A comprehensive review the field is beyond the scope of this work. Here we present the key principles
that will be used in the rest of this document.
Machine learning attracts researchers from different disciplines: mathematics, computer science,
neuro science, biology, etc. There are three main motivations for research in this field:
1. The study of the brain
2. Learning as a way to solve difficult problems
3. The study of “learning” as an abstract concept
Brain researchers have found that the brain is made up of atomic building blocks, the neurons.
These neurons are connected together in a network. The ability of our brain to solve complicated
problems and adjust itself to changing environments and new tasks prompted researchers to believe
that by building artificial neural networks we would be able to learn to solve complex problems.
It would also allow us to better understand the way our brain works. This line of research began
during the 1940s. McCulloch and Pitts [91] and later Hebb [53] suggested ways in which neural
6
Chapter 1. Learning
networks could work. These were the opening chapters in a very fruitful and inspiring line of
research that has generated hundreds of books.
The main building block of the neural network is the neuron. A neuron has many inputs
(synapses) and a single output (axon). The artificial neuron is the perceptron or the linear classifier
[103]. Like the neuron, it has many inputs and a single output. To facilitate the notation we assume
that there is a single input which is a vector x ∈ IRd , such that each component in this vector
resembles a synapse. The perceptron calculates a linear threshold function over its input. Each
perceptron holds a vector of weights w ∈ IRd and a threshold θ ∈ IR. It computes the function
cw,θ (x) = sign (w · x − θ) where w · x is the inner product between the vectors w and x.
Although the perceptron was defined almost 50 years ago, it is probably the most commonly
used tool in machine learning. A large part of this work is devoted to learning perceptrons. We
usually refer to perceptrons as linear classifiers. Whenever the threshold θ is set at zero, i.e. the
classification rule is cw (x) = sign (w · x) we call the classifier a homogeneous linear classifier.
Artificial neural networks serve both as a tool to study the brain and as a method to solve
problems that are otherwise considered to be hard. To solve problems, other approaches have been
suggested as well. Two representative approaches in this category are nearest neighbor rules and
window based rules [33, 46, 120]. Both methods assume some metric over the input space and
the predictions on the label of a new instance are based on their proximity to some of the points
in the training set. In nearest neighbor rules, the predicted label of a new instance is chosen by
holding a majority vote among the k nearest neighbors to the instance at hand. In window based
approaches, the label is chosen by holding a majority vote among all training instances which are
close to the instance at hand. Both approaches have been analyzed and proved to be consistent;
i.e., optimal asymptotically, provided that the right choice of parameters is made.
The most recent direction in machine learning has been the attempt to study “learning” as
an autonomous concept. The most significant work in this domain was done by Vapnik and
Chervonenkis [128] who studied the concept of uniform convergence of empirical means. This result
was kept a “secret” until Blumer et al. [17] discovered its relation to the Probably Approximately
Correct (PAC) model [126].
The Probably Approximately Correct (PAC) model [126], is an attempt to define learning as
a mathematical object. By defining learning in a way which makes no assumptions as to how
learning is accomplished, Valiant was able to raise issues which had never been formulated before
such as “Is it possible to learn?”, “Is it possible to learn everything?” or in more generally “What
1.4. Probably Approximately Correct (PAC)
7
can be learned?”. In the following sections we review the PAC model and other definitions of
learning and some of the important findings in this field.
1.4
Probably Approximately Correct (PAC)
Valiant made several important observations which are fundamental ingredient of the PAC model
[126]. First of all, learning is a finite process in the sense that we should be able to benefit from
learning after a finite time. Therefore, learning should be possible after seeing only a finite set of
examples. Valiant also made a distinction between the inaccuracy of the learner and a failure of
the learning process. Inaccuracy is caused by the fact that the learner sees only a finite sample.
However, sometimes the learning process can fail when the training sequence is a-typical. Valiant
claims that as long as we have reasonable accuracy with high confidence we can learn.
The PAC model defines learnable concept classes. A class C is learnable if the number of
examples needed to learn a concept in this class is finite. Valiant was primarily interested in
the binary setting where Y = {±1}. The assumption that the labels can take only two possible
values seems to be restrictive; however, there are canonical ways to convert multi-class learning
problems into a set of binary learning problems. In most cases this is done by breaking the multiclass learning problem into a collection of binary classification problems. For example, in the
one-against-all method, we generate a binary classification problem for every possible value y ∈ Y.
We train a binary classifier to distinguish between the instances labeled by y and the instances
for which the label is different from y. In the all-pairs method, as its name suggests, we train a
classifier for every pair of labels. The more general approach uses error-correcting codes to generate
a set of binary classification problems and combine them into a multi-class classifier (see e.g. [34]).
Hence, the assumption that the labels are binary is not too restrictive and we will assume this
unless otherwise specified1 .
Definition 1.1 Probably Approximately Correct [126]
Let X be a sample space and let C be a binary concept class over X . Let the label space Y be
{±1}. We say that C is PAC learnable if for any ǫ, δ > 0, there exists m < ∞ and an algorithm
L : (X × Y)
m
→ C such that for any probability measure µ on X × Y
Pr
S∼µm
errorµ (L (S)) > ǫ + inf (errorµ (c)) < δ
c∈C
1 Multi-class active learning is more complicated since the learner needs to decide when to query the teacher for
more information. We discuss this issue in Chapter 8.
8
Chapter 1. Learning
where errorµ (c) = µ {(x, y) : c (x) 6= y}
A concept class is PAC learnable if a finite sample suffices to learn a hypothesis from the class
that is almost the best possible concept in this class in terms of generalization error. In a seminal
paper, Vapnik and Chervonenkis [128] showed that PAC learnable classes have a unique geometric
property: a concept class C is PAC learnable if and only if it has a finite Vapnik Chervonenkis
(VC) dimension. The connection to the PAC learnability problem was found in [17].
In order to define the VC dimension we need to define the shatter coefficient (or growth function)
of a class C.
Definition 1.2 Let C be a concept class. The m’th shatter coefficient of C is
ΠC (m) =
max
x1 ,...,xm ∈X
|{(c (x1 ) , . . . , c (xm )) : c ∈ C}|
The shatter coefficient measures the number of different ways that a concept class can assign
labels to m instances. The function ΠC is called the growth function of C. Clearly, since |Y| =
|{±1}| = 2 then ΠC (m) ≤ 2m . This is the rationale behind the definition of the VC dimension:
Definition 1.3 A concept class C has a VC dimension d if
d = max {m : ΠC (m) = 2m }
The VC dimension is infinite if ΠC (m) = 2m for all m.
The VC dimension gives an exact classification of PAC learnable concept classes. Vapnik and
Chervonenkis [128] proved the following seminal theorem (this is a rephrased version of the original
result).
Theorem 1.1 [6, Theorem 4.2 and Theorem 4.8]
Let C be a concept class of VC dimension d. Let L be an algorithm which given a sample
S ∈ (X × Y)
m
of labeled instances, returns a hypothesis c = L (S) ∈ C which minimizes the
empirical error
|{(x, y) ∈ S : c (x) 6= y}|
Then for any δ > 0 and any probability measure µ over X × Y the following holds:
1. If d is finite then
Pr
S∼µm
errorµ (L (S)) > ǫ + inf errorµ (c) ≤ δ
c∈C
1.5. On-line Learning
9
Teacher
xt
Learner
⇒
⇐
⇒
L (yt , ŷt )
yt
ŷt
Figure 1.1: On-line learning.
An illustration of a single round in the on-line learning model.
as long as
ǫ≥
s
32
m
2em
4
d ln
+ ln
d
δ
where error· (·) is as defined in the definition of the PAC model (Definition 1.1).
2. If d is finite and inf c∈C errorµ (c) = 0, i.e. the target concept is in the concept class C then
Pr [errorµ (L (S)) > ǫ] ≤ δ
S∼µm
as long as
2
ǫ≥
m
2em
2
d ln
+ ln
d
δ
3. If the d = ∞ then C is not PAC learnable.
q d
in the general case
Theorem 1.1 shows that the learning rates we can expect are O∗
m
d
and O∗ m
when the target concept is in the concept class2 . Note that only the constants in the
bounds we presented can be improved. We will see in Chapter 6 that when active learning is used
we obtain significantly better results.
1.5
On-line Learning
The on-line learning model [83] is another attempt to define learning. Littlestone tried to capture
the fact that learning is a continuous process. In the PAC model there are two phases: the learning
phase and the inferring phase (or generalizing phase). In the on-line learning model, these two
phases are interleaved. Learning takes place in rounds. In round t, the teacher presents the instance
xt , the user suggests that the label of xt is ŷt . After this, the teacher presents the label yt and the
learner suffers a loss of L (yt , ŷt ) where L (·, ·) is some non-negative loss function. See figure 1.1 for
an illustration of a single round in the on-line learning model.
P∞
The goal of the learner in this setting is to minimize t=1 L (yt , ŷt ) under the mildest possible
assumptions. In most cases one of the following assumptions is made:
2 We
use the notation O ∗ (·) to indicate that we neglect logarithmic factors.
10
Chapter 1. Learning
1. There exists an underlying target concept c chosen from a class C such that yt = c (xt ).
Under this assumption we can sometimes prove that
∞
X
t=1
L (yt , ŷt ) ≤ M < ∞
We call M the mistake bound since it provides an upper bound on the number of mistakes
for any single sequence x1 , x2 , . . . and any concept c ∈ C.
2. There is no restriction on the target concept but the learner is compared only to a limited reference class C. In this case we seek a function f (·) such that for any sequence
(x1 , y1 ) , (x2 , y2 ) , . . . we have that
∞
X
t=1
L (yt , ŷt ) ≤ inf
c∈C
∞
X
f (L (c (xt ) , ŷt ))
t=1
These bounds are called regret bounds. They provide a bound on the difference between the
cumulative loss of the algorithm studied and the best concept in the reference class.
Many of the on-line learning algorithms are very simple, fast and use a small amount of memory.
For example, the perceptron algorithm [103] when applied to a d dimensional problem, uses O (d)
memory cells and each prediction is made in O (d) operations3 . It is important to note, though,
that the constraints on memory and CPU usage are not a part of the definition of the on-line
learning model.
Since in most of the cases we study here, the labels are either +1 or −1, the natural loss function
is the 0-1 loss which has the value 0 whenever yt and ŷt are equal and has the value 1 otherwise
In this case
P∞
t=1


 0
L0−1 (yt , ŷt ) = (1 − yt yˆt ) /2 =

 1
if yt = ŷt
if yt 6= ŷt
L0−1 (yt , ŷt ) is a count of the number of prediction mistakes the learning algorithm
made. It is known that if the perceptron algorithm is used on a sequence (x1 , y1 ) , (x2 , y2 ) , . . . then
P∞
2 2
d
t=1 L0−1 (yt , ŷt ) ≤ R /θ provided that kxt k2 ≤ R for every t, and there exist w ∈ IR such that
kwk2 = 1 and yt (w · xt ) ≥ θ . This is the mistake bound for the perceptron algorithm that was
proved in [97].
3 We assume here that the perceptron is represented in the primal space. When the perceptron is used with
kernels, the hypothesis must be represented in the dual space. In this case, the memory usage and CPU usage
change dramatically. See [40] for more about this issue.
1.6. Active Learning
1.6
11
Active Learning
Stone’s celebrated theorem proves that given a large enough training sequence, even naive algorithms such as the k-nearest neighbors can be optimal [120]. However, collecting large training
sequences runs up against two main obstacles. First, collecting these sequences is a lengthy and
costly task. Second, processing large data-sets requires enormous resources. Obviously we need to
process the data while training. However, in most cases, the complexity of inferring the labels of
new data items is affected by the size of the training data. This is the case for the commonly used
Support Vector Machines [20] and Adaboost [47] algorithms. Therefore, reducing the size of the
training sequence is of major concern.
Active learning suggests that the size of the training sequence can be reduced considerably if
we allow ourselves to go beyond the standard definitions of learning, e.g. PAC and on-line learning,
and allow the learner some control over the learning process. In the learning frameworks we have
discussed so far, the teacher selected the instances to be presented to the learner. Therefore we call
these frameworks passive learning. In active learning frameworks, the learner has some influence
on the selection of data points. Having control over the learning process allows the learner to focus
on the more informative data points and thus increase the learning rate.
In many cases, active learning can indeed accelerate the learning rate. We will show that the
speedup can be exponential. However, in some cases there is a price to be paid. Since the learner
has control over the learning process, it needs to make decisions that passive learners do not make.
Therefore, in some cases the computational complexity of learning can increase when moving from
passive learning to active learning. At the same time however, the sample complexity of learning
reduces considerably. This means that we shift the workload from the teacher to the learner and
from the generalization (inference) phase to the training phase. This makes perfect sense since
the teacher is typically a human while the learner is a machine; thus, active learners require less
human labor but may require more computing effort while training.
We discuss two active learning frameworks in this work. In Part II we discuss the membership
queries framework and in Part III we discuss selective sampling. The difference between these
frameworks is in the type of control the learner is assumed to have over the learning process. In
the Membership Queries framework [3] the learner is allowed to pose questions to the teacher. These
questions are presented as instances and the teacher is queried for the labels of these instances.
The selective sampling framework [29] is more restrictive. The learner is presented with unlabeled
instances and may query for the labels of a subset of these instances. This framework subdivides
12
Chapter 1. Learning
into two varieties: the batch framework, which we call selective sampling whenever this is not
confusing, and the on-line framework, which is called label efficient learning [54].
Alternative active learning frameworks do exist. In the Equality Query model [3] the learner
can present a hypothesis to the teacher. The teacher can either accept this hypothesis as a good
one, or reject it while presenting an instance on which it deviates from the target concept. Another
model, experiment design (see e.g. [7]) is being studied extensively by statisticians. In this model,
the problem at hand is a regression problem, and the learner is allowed to select the experiment to
run. Although this can be viewed as an active learning framework, it is not proactive learning as
the learner does not refine the selection of experiments based on the results of previous experiments.
1.7
Other Learning Models
As mentioned earlier, this essay focuses on the supervised learning framework. However, this is
not the only way in which learning occurs.
1.7.1
Unsupervised Learning
Unsupervised learning is an important type of learning. The goal in unsupervised learning is to
find structure in data. The learner is given data x1 , . . . , xm and is required to find a concise
representation of the data. A good representation is a small representation that captures the
significant properties of the data. Two popular ways of finding these representations are clustering
and dimensionality reduction.
In clustering, the learner groups the instances into clusters. The goal here is to find a small
number of clusters such that all the instances within the same cluster are closer or more similar to
each other, compared to instances from different clusters.. There are many ways of achieving this
goal (see e.g. [14, 96, 121]) but the problem is ill-posed [67] since similarity between points can be
measured in many different ways. Nevertheless, clustering is a powerful tool.
Another unsupervised learning method is dimensionality reduction. In dimensionality reduction, the learner finds a new representation of the data that is low dimensional but at the same time
close to the original representation. For every instance xi the learner retains only a few properties
φ1 (xi ) , . . . , φd (xi ) such that the φs’ capture most of the important attributes of xi . Principal
Component Analysis (PCA) is a representative of this family of learning techniques [61].
1.7. Other Learning Models
1.7.2
13
Reinforcement Learning
Another important type of learning is reinforcement learning [62]. Consider for example the problem of navigating a robot in a maze. At each junction, the robot has to decide the direction to
take. The decision made by the robot has long term consequences, since the parts of the maze the
robot will see depend on the decisions it makes.
In the more general setting, reinforcement learning assumes an underlying state machine. At
each point in time, the learner decides on the action it makes. As a result of this action, the
learner receives a reward, and the state of the machine changes. The reward and the new state are
stochastic functions of the current state and the action taken by the player.
Active learning is a combination of supervised learning, unsupervised learning, semi-supervised
learning and reinforcement learning. We use unlabeled data, labeled data and make decisions that
affect future events, since the queries the learner issues affect the learning process. However, the
main focus of this work is in viewing active learning as an extension of supervised learning.
Part II
Membership Queries
14
Chapter 2
Preliminaries
Active learners have some control over the learning process. Passive learners can observe the
training data, but can-not alter them, whereas active learners can direct the teacher to what the
learner consider to be the most interesting cases. The capability to play an active role in the
training process gives the learner much more latitude for action than passive learners. It also more
closely resembles the way humans learn. Human learning is a bi-directional process [18, 93, 118].
A good teacher needs to adjust his or her mode of instruction to the student’s prior knowledge
and state of mind. It is a well established fact that a teacher who is not tuned to feedback from
the students will not be able to teach effectively [18, 93, 118].
When trying to design a framework for computerized active learning, we need to define the
way bi-directional communication between learner and teacher takes place. The first active learner
framework explored here is the Membership Queries (MQ) framework [3, 4]. In this framework,
the learner is allowed to direct questions to the teacher.
Definition 2.1 Let X be a sample space. A membership query is an instance x ∈ X . The
teacher’s response to such a query is the label y associated with x.
A learning algorithm makes a membership query, much like humans ask their teachers questions.
The membership query oracle is very powerful since it allows the learning algorithm to query for
the label of any instance.
15
16
Chapter 2. Preliminaries
Figure 2.1: An illustration of a Boolean circuit. This circuit has three inputs: I1 , I2 and I3
and a single output marked by O. The circuit contains two AND gates and a single OR gate.
The circuit has a depth of two.
2.1
The Power of Membership Queries
The power of membership queries has been demonstrated in hundreds of papers. This section
looks at a few key articles, all of which show that membership queries can solve problems that
are difficult to tackle. Enumerating all the tasks in which membership queries have been used to
provide solutions is beyond the scope of this dissertation. The tasks examined here illustrate the
variety and diversity of applications of membership queries.
2.1.1
Constant Depth Circuits
There are many ways to represent Boolean functions: truth tables, logical formulas Karnaugh maps
and others. A Boolean circuit is another representation of a Boolean function which captures the
engineering point of view. A circuit is made of gates that are wired together. The gates are the
atoms of this structure. A gate can perform a simple task: it receives one or more inputs and
generates an output. The output of the gate can be wired to the input of other gates. Each gate
can be connected to any other gate provided that the directed graph, which describes the circuit, is
acyclic1 . Through the right choice of gates and wiring, a circuit can perform sophisticated Boolean
functions. See figure 2.1 for an illustration of a Boolean circuit.
1 Such
a graph is called a DAG, which stands for Directed Acyclic Graph.
2.1. The Power of Membership Queries
17
Boolean circuits play an important role in electronics and in theoretical computer science.
Linial, Mansour and Nissan have proved the following about learning such circuits:
Theorem 2.1 [81]
n
Let c : {−1, 1} → {−1, 1} be a Boolean function which is computable by a circuit of size S
and depth d using AND and OR gates. Let ǫ, δ > 0 then there exists a learning algorithm which
generates a hypothesis h such that with probability 1 − δ the hypothesis h is ǫ-close to c. The
d−1
algorithm works in time poly n(14 log S/ǫ) , log 1/δ and uses membership queries.
In this celebrated result, the Fourier Transform of the concept c is analyzed and used to generate
hypothesis h. Membership queries are necessary for this algorithm to work. It is not known how
such circuits can be learned in poly (n) time without membership queries.
2.1.2
Decision Trees
Decision trees are an important tool in artificial intelligence. The main advantage of decision trees
over many other models used in artificial intelligence is their lucidness to humans. A concept that
is presented as a decision tree is easily understood by humans.
A decision tree has a condition term on each of its internal nodes. Each leaf of the tree contains
one of the possible outputs. Once an input is presented to a tree, it is matched against the condition
at the root of the tree. If the condition is satisfied, we move to the left sub-tree otherwise we move
to the right sub-tree. We match the input against the root of the chosen sub-tree and depending on
the outcome we move either left or right. This process continues until we reach a leaf of the tree.
In that case we report the value at the leaf as the outcome of the tree calculation. See figure 2.2
for an illustration of a decision tree. For more about this subject see [106].
The problem of learning a decision tree is fundamental in artificial intelligence and machine
learning. The main algorithms for learning decision trees include ID3 [101], C4.5 [102], and CART
[21]. However, all these algorithms fail to meet PAC requirements. Even in a noise free environment, these algorithms can learn a tree which is exponentially bigger than the smallest possible
tree [64]. This is a major bottleneck in the theoretical analysis of these algorithms. Alternative
algorithms have been designed for learning decision trees with accompanying theoretical analysis.
These algorithms include an algorithm by Kushilvitz and Mansour [71] which is based on Fourier
analysis, and an algorithm by Bshouty [23] which is based on monotonicity. These algorithms use
membership queries which enables them to have performance guaranties.
18
Chapter 2. Preliminaries
Figure 2.2: An illustration of a decision tree. The decision tree in this illustration has six
internal nodes, each associated with a condition C1 , . . . , C6 . It has seven leaves, each is
associated with either TRUE (+1) or FALSE (-1).
2.1.3
Intersections of Halfspaces
Intersections of halfspaces, or polytopes, are very interesting geometrical objects. Learning such
objects is interesting both as regards its geometrical representation and also the reduction of
learning DNF formulas to this problem (see e.g. [66] for the reduction of Boolean formulas to
geometric concepts).
1 t log(t log ρ
)
t
State of the art algorithms for learning intersections of halfspaces [68] require O n ρ
instances where n is the dimension, t is the number of halfspaces and ρ is a margin term. However,
once membership queries are allowed, Kwek and Pitt [72] showed that poly (n, t) instances suffice
for learning in this setting.
2.2
The Limitations of Membership Queries
Membership queries provided the theorists of machine learning with a phenomenal tool to propose
algorithms, and to use for analytical purposes. However when it comes to most real world problems,
membership queries fall short. This failure has been demonstrated by Lang and Baum in their
paper entitled “Query Learning can Work Poorly When a Human Oracle is Used ” [73]. In this
paper, the authors tried to apply an algorithm for learning 2 layer neural networks presented earlier
by Baum [12]. The main idea behind this algorithm is to take two instances with alternating labels
and use membership queries on instances that are on a path connecting these instances. By doing
2.2. The Limitations of Membership Queries
19
Figure 2.3: Handwritten character recognition using membership queries [73]. The
lower left and right corners are images of the figures “7” and “5”. The rest of the images
represent combinations of these two figures. Note that some of these images are neither “7” nor
“5”. Some of them do not look like any figure.
so, the algorithm can find the exact transition point where the label changes.
Lang and Baum [73] tried to apply Baum’s algorithm [12] to the task of recognizing handwritten
digits. In this task, a bitmap that is a digital representation of a handwritten character needs to be
identified as one of the digits 0 − 9. The authors expected that the novel learning algorithm would
generate extremely accurate hypotheses by identifying the exact boundaries between the different
digits. Unexpectedly, the experiment failed. The cause of this failure was that for many of the
queries the algorithm generated, the teacher could not provide any answer. Figure 2.3 presents a
demonstration of this problem. Two images of the figures “7” and “5” were used to generate a
handful of queries for images which are combinations of the original images. However, many of
these queries are neither “7” nor “5”. Some do not resemble any figure at all. This led Lang and
Baum to the conclusion that query learning can work poorly when a human oracle is used, as the
20
Chapter 2. Preliminaries
title of their paper suggests.
The reason for this failure lies in the fact that not all images are valid representations of hand
written figures. The computer views such images as an array of numbers which represents gray
levels. However, most of these arrays do not represent any figure at all. This phenomenon is not
unique to the problem of handwritten digit recognition. On the contrary, we expect this to occur in
most applications in which the oracle is human. Consider for example the task of medical diagnosis.
In this case, when a computer generates medical files and lab results, it will most likely lack the
consistency of medical files and lab results of human beings. Moreover, a physician who needs to
“label” these instances may need to see the patient to conduct other medical examinations, but
since such a patient does not exist, the whole process will fail.
Another limitation of membership queries is the fact that there are problems in which even
membership queries will not allow us to learn in a reasonable time. [5] showed that under common
cryptologic assumptions2 there are problems with finite VC dimensional but no polynomial learning
algorithms.
2.3
Summary
Membership queries are a powerful tool for the analysis and development of machine learning.
They have inspired many authors and provide a way to evaluate the limits of learning algorithms.
However, when it comes to real world applications, membership queries usually fall short.
This said, there are vital tasks to which membership queries can be applied. Recently, learning
algorithms which use membership queries have been applied successfully to verification problems
[44]. In these cases, the teacher (oracle) is not a human being but rather a machine itself. This
makes it possible to overcome the problem of using membership queries with human oracles. In
the next chapter we present a method of overcoming noise while learning. This method uses
membership queries in its core. It enables us to study fundamental problems in learning in the
presence of noise.
2 It suffices to assume that there is a one-way function; i.e. a one-to-one function that is easy to compute but
hard to invert.
Chapter 3
Noise Tolerant Learnability using
Dual Representation
Much of the research in machine learning and neural computation assumes the existence of a perfect
teacher, one who gives the correct answers to the learning algorithm. However, in many cases this
assumption is faulty, since different sorts of noise may prevent the teacher from providing the correct
answers. This noise can be caused by noisy communication, human errors, measuring equipment
and many other distortion sources. In some cases, problems which are efficiently learnable without
noise, become hard to learn when noise is introduced [11]. In other cases, it is possible to learn
efficiently even in the presence of noise (see e.g. [63]). However, no simple parameters are known
to distinguish between classes that are learnable in the presence of noise and those which become
hard to learn.
In this section we introduce a noise cleaning procedure. Our procedure is capable of generating
a clean sample even when the data source is corrupted with noise. In order to generate the noise
free sample we exploit the structure of the dual learning problem. In the dual learning problem the
teacher has an instance in mind and the goal of the learner is to approximate it by having access
to the labels several classifiers assign to it. For any instance whose label we would like to query,
we generate an approximation set consisting of many instances which are close to it. We query
for the labels of the instances in the approximation set, assuming we have access to a Membership
Query oracle, and use majority vote to label the instance we were interested at.
In the study below we show that the noise cleaning procedure works as long as the dual
learning problem is learnable and dense. Thus any learning problem, for which these criteria hold,
21
22
Chapter 3. Noise Tolerant Learnability using Dual Representation
can be learned efficiently in the presence of noise. We show that these assumptions are valid
for a variety of learning problems, such as smooth functions, general geometrical concepts, and
monotone monomials. We are particularly interested in the analysis of smooth function classes.
We show that there is a uniform upper bound on the fat-shattering dimension of both the primal
and dual learning problems which are derived by a geometric property of the class called type. We
also show how the dual learning problem is related to the dual Banach space which is an important
tool in functional analysis.
The work presented in this chapter is based on joint research with Shai Fine, Shahar Mendelson
and Naftali Tishby.
3.1
Learning in the presence of noise
In many learning models (e.g. PAC) it is assumed that the learner has access to an oracle (a teacher)
which picks an instance from the sample space and returns this instance and its correct label. In
real world problems, the existence of such oracle is doubtful: human mistakes, communication
errors and various other problems make the existence of such an oracle unfeasible. In these cases,
a realistic assumption would be that the learner has access to some sort of a noisy oracle. This
oracle may have internal mistakes which prevent it from generating the correct labels constantly,
or even influence its ability to randomly select points from the sample space.
The VC dimension [128, 17] completely characterizes PAC learnability of Boolean functions in
terms of the size of the sample needed. A class with a finite VC dimension can be learned from a
finite sample whose size depends on the VC dimension, the accuracy and the required confidence
level. However, the computational complexity of learning is not controlled by the VC dimension.
In fact there are classes with a finite VC dimension but learning these classes is NP complete [65].
Things become even more complicated when we no longer assume the existence of a perfect
oracle, i.e. a noise free oracle. We weaken this assumption, we obtain the true label with a
probability of 1 − η. We further assume that the oracle is consistent in the sense that if the label
of x was requested twice, then the oracle will produce the same result. This model is called the
“persistent random classification noise” model [58].
Classes with a finite VC dimension are learnable in the presence of persistent random classification noise in terms of the sample size needed (see e.g. [6] chapter 4). However, the computational
complexity of the learning task can change dramatically1 . If learning in the noise free case is
1 When
the noise is non-persistent, there is no complexity gap between learning with or without noise if mem-
3.2. The Dual Learning Problem
23
unfeasible, then it will remain so in the noisy case. However, there are cases in which the noise
free problem is efficiently learnable, while learning in the noisy environment is unfeasible [11, 13].
The gap between the noise-free case and the noisy case appears not only in the PAC model, but
also occurs in other models such as the online learning model [82].
Here we present a procedure which convert noisy oracles to noise-free oracles. In order for
our procedure to work, the dual learning problem needs to be learnable and dense. These criteria
characterize learning problems which are efficiently learnable in the presence of noise. The procedure we introduce is fairly simple. Given an instance, for which the learner would like to know its
label, we generate an approximation set. This set consists of instances for which we have reason
to believe the target concept assigns the same label as it assigns to the instance we are interested
in. We use the majority vote among the labels of the instances in the approximation set to deduce
the label of the instance we are interested in.
The approximation set is generated by using the dual learning problem. In the dual learning
problem the instances and hypotheses switch roles. We learn an instance by looking at the labels
different hypotheses assign to it. We need to be able to learn efficiently in the dual learning
problem, and we need the dual learning problem to be dense for this scheme to work. For this
purpose, we need to work in a Bayesian setting in which there is a known probability measure over
the concept class from which the target is chosen. When these conditions are present, noise can
be filtered out by a simple procedure, which makes learning in presence of noise possible.
More formally, the main result is the following: Let C be a concept class endowed with a
probability measure ν. Assume that the target concept c∗ ∈ C was selected according to ν. Further
assume that both the primal and dual learning problems are efficiently learnable in the noise free
model and that the dual learning algorithm is dense. Then the noisy oracle can be converted to a
noise-free oracle and learning in the presence of noise can take place.
3.2
The Dual Learning Problem
A learning problem may be characterized by a tuple hX , Ci where X is the sample space and C is the
concept class. Learning can be viewed as a two player game: one player, the teacher, picks a target
concept, while his counterpart, the learner, tries to identify this concept. Different learning models
differ in the way the learner interacts with the teacher (PAC Sampling Oracle, Statistical Queries,
Membership Queries, Equivalence Queries, etc.) and the method used to evaluate performance.
bership queries are allowed [107]. However, in many cases the noise is indeed persistent.
24
Chapter 3. Noise Tolerant Learnability using Dual Representation
Every learning problem has a dual learning problem [99] which may be characterized by the
tuple hC, X i. In this representation the learning game is reversed: first the teacher chooses an
instance x ∈ X and then the learner tries to approximate this instance by querying the value x
assigns to different concepts. We view an instance x as the evaluating function δx on C such that
δx (c) = c (x). We denote by X ∗∗ these evaluating functions:
X ∗∗ = {δx : x ∈ X }
To clarify this notion we present two dual learning problems:
• Let X be the interval [0, 1] and set C to be the class of all intervals [0, a] where a ∈ [0, 1]. If
x is an instance and ca is the hypothesis [0, a] then ca (x) = 1 if and only if a ≥ x. Turning
to the dual learning problem, note that δx (ca ) = 1 if and only if x ≤ a. Hence, the dual
learning problem is equivalent to learning intervals of the form [x, 1] where x ∈ [0, 1].
• Let X = IRn , and let C to be the class of linear separators, i.e., cw ∈ C is the concept which
assigns to each x ∈ IRn the label sign (x · w). The dual learning problem thus becomes a
problem of learning linear separators and hence this problem is dual to itself.
The VC-dimension of the dual learning problems (also called co-VC ) obeys the inequalities:
⌊log (d)⌋ ≤ d∗ ≤ 2d+1
(3.1)
where d is the VC-dimension of the primal problem and d∗ is the co-VC (see lemma 3.1 on page 35).
As can be seen, the gap between the complexities of the primal and dual learning problem can
be exponential. However, in both examples presented here and in most realistic cases this gap is
only polynomial. Therefore, our assumption that the dual learning problem is efficiently learnable
holds in many if not most of the interesting cases (see Troyansky’s thesis [123] for a survey of dual
representation). In section 3.6 we broaden the discussion to handle regression problems in which
the concepts assign a real value to each instance rather than a Boolean value as in the classification
case. Thus, we replace the notion of VC-dimension with the fat-shattering dimension. We show
that for classes which consist of sufficiently smooth functions, both the fat-shattering and the cofat-shattering have an upper bound which is polynomial in the learning parameters and enables
efficient learning of the dual problem.
3.3. Dense Learning Problems
3.3
25
Dense Learning Problems
A learning problem is dense [42, 113] if every hypothesis has many hypotheses which are close but
not equal to it:
Definition 3.1 Let X be a sample space, C a concept class and D be distribution of the instances.
The learning problem defined by the triplet hX , C, Di is dense if for c ∈ C and every ǫ > 0 there
exists c′ such that
0 < Pr [c (x) 6= c′ (x)] < ǫ
x∼D
The density property is distribution dependent: for every learning problem there exists a distribution in which the resulting learning problem is not dense. In fact, if the distribution is supported
on a finite set, the problem cannot be dense. If a learning problem is indeed dense, any hypothesis
can be approximated by an infinite number of hypotheses thus finite concept classes are not dense
according to definition 3.1.
We would like to extend the definition of a dense learning problem to finite hypothesis classes
as well. We replace the demand of approximating h for every ǫ, by the same demand for every
polynomial ǫ. In the above definition the requirement is that each hypothesis h can be approximated by infinite number of hypotheses. In the finite case we replace the infinity assumption by a
super-polynomial number of approximating hypotheses.
Definition 3.2 Let Xn be a sample space, Cn a concept class and Dn be distribution of the in∞
stances. The sequence of learning problem {hXn , Cn , Dn i}n=1 is dense if for every polynomial p (n)
there exists N such that for every n > N and every c ∈ Cn there exists c′ such that
0 < Pr [c (x) 6= c′ (x)] < 1/p (n)
x∼Dn
In a dense class, every hypothesis can be approximated. Nevertheless, even in dense classes,
a learning algorithm might not use the diversity of the class. Therefore, the definition of density
should be extended to include properties of the algorithm being used:
Definition 3.3 Let X be a sample space, C a concept class and D be distribution of the instances.
∗
The learning algorithm L : (X × Y) → C is dense with respect to D if for every m > 0 and c ∼ C
Pr
m
S1 ∼µ ,S2 ∼µm
Pr [L (S1 ) (x) 6= L (S2 ) (x)] = 0 = 0
x∼D,
where µ = µ (D, c) is the distribution induced by Dand c on X × Y.
26
Chapter 3. Noise Tolerant Learnability using Dual Representation
Definition 3.3 applies to infinite learning problems. As before, we extend it to finite cases:
Definition 3.4 Let Xn be a sample space, Cn a concept class and Dn be a distribution of the
instances. The sequence of learning algorithms Ln : (Xn × Y)∗ → Cn is dense with respect to
∞
{Dn }n=1 if for every polynomial p (n) and m, there exists N such that for every n > N and every
c ∈ Cn
Pr
S1 ∼µm ,S2 ∼µm
Pr [L (S1 ) (x) 6= L (S2 ) (x)] = 0 < 1/p (n)
x∼D,
where µ = µ (D, c) is the distribution induced by Dand c on X × Y.
3.4
Noise Immunity Scheme
We now present our noise immunity scheme. This scheme immunizes any learning problem to
noise, if the requirements of learnability and density with respect to its dual learning problem are
satisfied. The main idea is to generate a noise free oracle and use it for learning. Let x be an
instance whose label we would like to know. Since we have access to a noisy oracle, querying for
the label of x does not produce good enough results. Furthermore, since the noise is persistent,
repeated sampling of the label will not provide any additional information. However, if there are
enough instances in the sample space to which the target concept assigns the same label as it
assigns to x, we can sample the labels of these instances and use majority vote to deduce the
label of x. The problem is to identify these instances. For this purpose, we use the dual learning
problem. The requirements of learnability and density ensure that with high probability, for any
instance x, the dual learning algorithm will find many instances x′ such that almost all concepts
assign them with the same label. Since the learner of the primal learning problem knows the
dual target x and the probability measure on C (the Bayesian assumption), it can provide a clean
sample to the dual learning problem. Hence, the dual learning problem is noise free and therefore
far easier. The algorithm is detailed in algorithm 1.
In the following theorem we prove the efficiency of the noise cleaning algorithm.
Theorem 3.1 Assume that the dual learning algorithm is dense and has PAC guarantees. With
a probability of 1 − δ the noise cleaning algorithm (algorithm 1) will return the correct label of x.
1
The computational complexity of the algorithm is poly d∗ , log δ1 , |1−2η|
.
As stated it theorem 3.1, the noise cleaning algorithm is polynomial with respect to its param-
eters, and with high probability it returns the true label which can be used to learn the original
learning problem.
3.4. Noise Immunity Scheme
27
Algorithm 1 Noise Cleaning Algorithm
Inputs:
• confidence parameter 1 − δ.
• VC-dim of dual learning problem d∗ .
• A bound on the noise level η.
• An instance x.
Output:
• A label y of x.
Algorithm:
2
2
1. By simulating the dual learning problem k = (1−2η)
times, generate an ensemble
2 ln
δ
S = {x1 , . . . , xk } by applying to the approx function (see below) with the point x, accuracy
δ
1
4 and confidence 2k .
2. Use MQ to get a label yi of xi .
3. Let y be the majority vote over the yi ’s.
Function approx
Inputs:
• A point x.
• required accuracy ǫ̂.
• required confidence 1 − δ̂.
Output:
• A point x′ .
Algorithm:
1. Let m = poly
1
1
∗
ǫ̂ , log δ̂ , d
.
2. Using the prior ν over the concept class C generate the sample c1 , . . . , cm .
3. Assign to every ci the label ci (x).
4. Apply to the learning algorithm of the dual learning problem with the labeled sample generated in steps 2 and 3 to generate x′ .
28
Chapter 3. Noise Tolerant Learnability using Dual Representation
Proof. of theorem 3.1
We begin by showing that the function approx, when called with ǫ̂, δ̂ and x will return x′ such
that with probability 1 − δ̂ the instance x′ will be such that Prc∼ν [c (x) 6= c (x′ )] < ǫ̂. Note that
this is simply the definition of learning in the dual learning problem. Note also that by assuming
that the dual learning algorithm is dense (see Definitions 3.3 and 3.4),2 it follows that x′ 6= x with
probability 1.
By the choice of parameters the noise cleaning algorithm makes, it follows that with probability
1−
δ
2
for any choice of 1 ≤ i ≤ k, the instance xi is indeed a good approximation of x in the sense
that
Pr [c (xi ) 6= c (x)] <
c∼ν
1
4
(3.2)
Assume that (3.2) holds for all i. When we query for the label of xi the correct label is returned
with a probability of 1 − η. This is independent of c (xi ) 6= c (x) or c (xi ) = c (x). Hence, with a
probability greater than
1
3
(1 − η) + η
4
4
=
3 η
−
4 2
we will obtain the correct label.
Using Hoeffding’s inequality and the fact the all xi are chosen independently, we see that the
η
probability that the majority vote will fail to predict the label of xi is smaller than e−2( 4 − 2 )
1
2
k
and by the choice of k, this is smaller than δ2 .
The procedure presented can fail in two cases: the xi ’s generated do not form a good approximation set or alternatively, too many of the labels of the xi ’s are corrupted. Each of these event can
occur with a probability of less than δ/2 and thus the whole process will work with a probability
of 1 − δ.
The computational complexity of the algorithm follows easily from the definition of the algorithm.
In the following sections we present a variety of problems to which the paradigm just presented
is applicable. We demonstrate it both on continuous classes such as neural networks and finite
classes such as monotone monomials. However, there are classes to which the paradigm can not be
applied. For example, consider the class of parity functions. Although this class is dual to itself,
and thus has moderated co-VC, it is not dense and therefore fails to meet the requirements.
2 If
the learning problem is finite, then this holds for large enough n as defined in definition 3.4.
3.5. A Few Examples
3.5
29
A Few Examples
In this section, we discuss a few classes to which the noise cleaning algorithm can be applied. We
present a few examples which have the properties required by theorem 3.1.
3.5.1
Monotone Monomials
The first problem we present is a Boolean learning problem of a discrete nature: the problem
n
of learning monotone monomials. In this problem the sample space is {0, 1} and the concepts
n
are conjunctions (e.g. x (1) ∧ x (3) ∧ x (4)). The hypothesis class is C = {cI : I ⊂ {0, 1} } where
V
cI (x) = i∈I x (i). To simplify the notation we will assume that x ⊆ [1 . . . n], and c ⊆ [1 . . . n]
such that c (x) = 1 ⇐⇒ c ⊆ x. The dual learning problem for this class is learning monotone
monomials with reverse order, i.e., x (c) = 1 ⇐⇒ x ⊇ c. Both the primal and the dual learning
problems have the same VC-dimension, n.
Instead of showing that the dual class is dense, we will give a direct argument showing that the
label of each instance can be approximated. Let Zx = {z : x ⊆ z ⊆ [1 . . . n]}. Since the concept
class is monotone, if c (x) = 1 then for every z ∈ Zx , c (z) = 1. On the other hand, if c (x) = 0
then there exists some i ∈ c \ x. Therefore, half of the instances in Zx have i ∈
/ z, implying that
c (z) = 0 for each such z. Thus, Prz∈Zx [c (z) = 0] ≥ 1/2 with respect to the uniform distribution
on Zx .
Hence, if c (x) = 1 then Prz∈Zx [c (z) = 0] = 0 whereas if c (x) = 0 then Prz∈Zx [c (z) = 0] ≥ 1/2.
This allows us to distinguish between the two cases. In order to be able to do the same thing in
the presence of noise we have to require that Zx is big enough. From the definition of Zx it follows
that |Zx | = 2n−|x| . It will suffice to require that |x| ≤ np for p < 1 with a high probability, since in
this case |Zx | is exponentially large. This condition holds with a high probability for the uniform
distribution and many other distributions.
Note that in this case there is no need for a Bayesian assumption, i.e., we do not assume the
existence of a distribution on the concept class. Moreover, the dual learning problem is reduced in
this case to a simple sampling procedure for Zx . However, we have used a slightly relaxed definition
of density in which for most of the instances there exists a sufficient number of approximating
instances.
30
Chapter 3. Noise Tolerant Learnability using Dual Representation
3.5.2
Geometric Concepts
In contrast to the previous example, when dealing with geometric concepts, the sample space and
the concept class are both infinite. For the sake of simplicity let the sample space be IR2 and assume
that the concept class consists of axis aligned rectangles. In this case, the VC-dimension of the
primal problem is 4 and the dual problem has VC-dimension 3. Moreover if a “smooth” probability
measure is defined on the concept class, it is easily seen that each instance is approximated by all
the instances with distance r from it (when r depends on the defined measure and the learning
parameters). Therefore, this class is dense.
This example can be extended to large variety of problems, such as neural networks, general
geometric concepts [24] and high dimensional rectangles. We describe two methods of doing so
below.
Using the Bayesian Assumption: The first method uses the Bayesian assumption. Each
geometric concept divides the instance space into two sets. The edge of these sets is the decision
boundary of the concept. Assume that for every instance there is a ball around it which does not
intersect the decision boundary of “most” of the concepts. Denote by ν a probability measure on
C, and assume for every δ > 0 there exists r = r (δ, x) > 0 such that
Prc∼ν [B (x, r) ∩ ∂c 6= φ] < δ
(3.3)
(B (x, r) is the ball of radius r centered at x and ∂c is the decision boundary of c). If (3.3) holds
then all the points in B (x, r) can be used to predict the label of x, and therefore, for verifying the
correct label of x.
Geometric Concepts without Bayesian Assumption: A slightly different approach can be
used when there is no Bayesian assumption but the distribution over the sample space is non
singular. Given δ > 0, then for every concept c there exists a distance rc > 0 such that the
measure of all points with a distance smaller than rc from the decision boundary of c does not
exceed δ. If 0 < r = inf c∈C rc , then with high probability (on the instance space) a random point
x has a distance which is greater than r to the decision boundary of the target concept c. Hence,
the ball of radius r around x can be used to select approximating instances.
3.6. Regression
3.5.3
31
Periodic Functions
In this example we present a case where the approximating instances are not in the near neighborhood of x. Let X = IR and set
C=
2πx
sign sin
: such that p is prime
p
Since the number of primes is countable, the probability measure on C is induced via a measure
on IN. Note that C consists of periodic functions, but for each function, the period is different.
Given a point x ∈ IR, and a confidence parameter δ, there is a finite set of concepts A, such that
ν (A) ≥ 1 − δ. Since the set A is finite, the elements of A have a common period. Therefore, there
is some t, such that for every c ∈ A and every m ∈ IN, c (x) = c (x + mt). It is reasonable to
assume that the noise in the primal learning problem is not periodic (because the elements of the
class do not have a common period), therefore, it is possible to find many points which agree with
x with high probability, but are far away from the metric point of view. Moreover, using the same
idea, given any sample c1 , . . . , ck ∈ C, it is possible to construct an infinite number of points xi
which agree with x on the given sample.
3.6
Regression
So far in this discussion, we have focused on binary classification problems. In this section, we
extend our discussion to regression were the target concept is a continuous function.
For every given x we attempt to find “many” instances xi , such that with a high probability
f (xi ) is “almost” f (x). When the concepts are continuous functions, it is natural to search for
the desired xi near x. However, if there is no a-priori bound on the modulus of continuity of the
concepts, it is not obvious when xi is “close enough” to x. Moreover, in certain examples the
desired instances are not necessarily close to x, but may be found “far away” from it (e.g. periodic
functions as presented in section 3.5.3).
Algorithm 1 needs to be adjusted for the regression setting. We present this modified algorithm
in algorithm 2. The following theorem proves the correctness of this algorithm.
Theorem 3.2 Assume that the dual learning problem is dense. Assume also that the learning
problem is bounded, i.e. ∀c, x |c (x)| ≤ 1. With probability 1 − δ the noise cleaning algorithm for
regression (algorithm 2) will return y such that |y − c (x)| < ǫ.
32
Chapter 3. Noise Tolerant Learnability using Dual Representation
Proof. The proof is very similar to the proof for theorem 3.1. We begin by showing that the
function approx, when called with ǫ̂, δ̂ and x will return x′ such that with a probability of 1 − δ̂
the instance x′ will be such that the L1 (ν) distance between δx and δx′ is smaller than ǫ̂. This is
simply the definition of learning in the dual learning problem. Note also that by the assumption
that the dual learning problem is dense, it follows that x′ 6= x with a probability of 1.
By the choice of parameters the noise cleaning algorithm makes, it follows that with a probability of 1 −
δ
3
for any choice of 1 ≤ i ≤ k, instance xi is indeed a good approximation of x in the
sense that
kδx − δx′ kL1 (ν) < ǫ̂
(3.4)
Assume that (3.4) holds for all i. Therefore, for γ > 0
Pr
′
c ∼ν
|c′ (x) − c′ (x′ )| >
ǫ̂
<γ
γ
thus, using the parameters in the algorithm, and applying them to the union bound3 , we obtain that
with a probability of 1 − 23 δ (over the choice of the target concept and the internal randomization
of the function approx, the following property will hold:
∀i
|c (x) − c (xi )| ≤ ǫ
(3.5)
Assuming that (3.5) holds, we obtain that for each i
Pr [yi = c (xi )] ≤ 1 − η
due to noise. It suffices that more than half of the labels are correct. This will guarantee that the
median is not more than ǫ away from the true value (due to 3.5). Using Hoeffding’s inequality and
the fact the all xi are chosen independently, we see that the probability that the majority vote
will fail to predict the label of xi is smaller than e−2(
1−2η
2
2
)
k
and by the choice of k, this is smaller
than 3δ . This completes the proof.
3.6.1
Estimating V Cε (C, X ∗∗ )
In general, the question of learnability of the dual problem may be divided into two parts. The
first is to construct a learning algorithm L which assigns to each sample Sm = {f1 , . . . , fm } a
point x′ such that for every fi , |fi (x) − f (x′ )| < ε. The second part is to show that the class
3 The
union bound is very loose in this case. However, for the sake of brevity we use it here.
3.6. Regression
33
Algorithm 2 Noise Cleaning Algorithm for Regression
Inputs:
• Confidence parameter 1 − δ.
• Fat Shattering dimension of dual learning problem d∗ .
• A bound on the noise level η.
• An instance x.
• Required accuracy ǫ.
Output:
• An approximation y of c (x).
Algorithm:
1. By simulating the dual learning problem k =
2
(1−2η)2
ln
3
δ
times, generate an ensemble
S = {x1 , . . . , xk } by applying to the approx function (see below) with point x, accuracy
δ
and confidence 3k
.
ǫδ
3k
2. Use MQ to get the value yi of xi .
3. Let y be the median of the yi ’s.
Function approx
Inputs:
• A point x.
• required accuracy ǫ̂.
• required confidence 1 − δ̂.
Output:
• A point x′ .
Algorithm:
1. Let m = poly
1
1
∗
ǫ̂ , log δ̂ , d
.
2. Using the prior ν over the concept class C generate the sample c1 , . . . , cm .
3. Assign to every ci the value ci (x).
4. Apply the labeled sample generated in steps 2 and 3 to the learning algorithm of the dual
learning problem to generate x′ . When learning in the dual learning problem we require that
with confidence 1 − δ̂ the returned instance x′ is ǫ̂ close to x in L1 (ν) norm.
34
Chapter 3. Noise Tolerant Learnability using Dual Representation
of functions X ∗∗ = {δx : x ∈ X } on C satisfies some compactness condition (e.g., a finite fatshattering dimensions V Cε (C, X ∗∗ )). We provided an answer to the first problem in the previous
section. We now move forward to address the second problem.
Let X ⊆ IRd such that X is infinite, and let (C, k·k) be a subset of a Banach space (see
section 3.8) consisting of functions on X . Furthermore, assume that C has a reproducing kernel
(see definition 3.6 on page 36). In this case, the dual learning problem is always a linear learning
problem as for any x ∈ X , the functional δx (c) = c (x) is in C ∗ and thus X ∗∗ ⊆ C ∗ .
We will show that if C is a bounded subset of a Banach space with a reproducing kernel and if
X ∗∗ is a bounded subset of C ∗ , then the fat-shattering dimension V Cε (C, X ∗∗ ) is finite for every
ε > 0 provided that C has a non trivial type, i.e. a type greater than 1.
Classical representatives for spaces with non-trivial type are Sobolev spaces W k,p , (cf. [50]
for basic information regarding Sobolev spaces, or [1] for a comprehensive survey). For example,
W 1,2 (0, 1) is the space of continuous functions f : [0, 1] → IR, for which the derivative f ′ belongs
to L2 with respect to the Lebesgue measure. The inner product in the space is defined f · g =
R1
′ ′
0 (f g + f g ) dx.
Mendelson [92], explored the relation between the type of a Banach space and the fat-shattering
dimension:
Theorem 3.3 (Theorem 1.5 in [92])
Let X be a infinite dimensional Banach space with type p. The fat-shattering dimension
VCǫ (B (X) , B (X ∗ )) is finite if and only if the type of X is greater than 1. Furthermore, if
p′ < p then there are constants K and κ such that
p
p−1
′p′
1
1 p −1
∗
κ
≤ VCǫ (B (X) , B (X )) ≤ K
ε
ε
The following is a simplified version of theorem 3.3:
Corollary 3.1 Let C be a bounded subset of a Banach space of functions over an infinite set X .
Assume the Banach space has a non-trivial type, i.e. greater than 1. Assume further that the
evaluation functionals δx ∈ X ∗∗ are uniformly bounded then V Cε (X , C) < ∞ for every ε > 0.
Proof. X is bounded, hence w.l.o.g. we assume it is a subset of the unit ball of a Banach space
X. C is a bounded subset of the dual space, hence w.l.o.g. we assume C ⊆ B (X ∗ ). Hence by our
assumptions, for every ε > 0
V Cε (X , C) ≤ V Cε (B (X) , B (X ∗ )) < ∞
3.7. VC Dimension of Dual Learning Problems
35
Corollary 3.1 provides a bound on the sample complexity of learning a problem based on the
type of the Banach space. If the type of the Banach space is non-trivial then the sample needed
for learning the problem is polynomial. Moreover, if the Banach space X has non-trivial type,
then X ∗ has a non-trivial type as well (see [98]) thus the dual learning problem has polynomial
sample complexity as well. Note that in both cases, i.e. the complexity of the primal and the dual
learning problems, the fact that the spaces are bounded is essential.
The computational complexity of these learning problems is domain specific. However, Mendelson [92] showed that learning subsets of Hilbert spaces with reproducing kernels can be done
efficiently.
Finally, we turn to an examination of the density of the dual learning problem. Let x ∈ X
be an instance and c1 , . . . , cm be any finite set of concepts. In most cases of interest, there are
infinitely many x′ ∈ X such that ∀1 ≤ i ≤ m ci (x) = ci (x′ ). Hence, the problem is naturally
dense.
3.7
VC Dimension of Dual Learning Problems
For the sake of completeness we present here the following lemma:
Lemma 3.1 Let C be a concept class defined over the sample space X . For every x ∈ X we define
the function δx such that δx (c) = c (x). Let X ∗ = {δx : x ∈ X } be a concept class defined over C.
Finally let d be the VC dimension of C and d∗ be the VC dimension of X ∗ then
⌊log2 d⌋ ≤ d∗ ≤ 2d+1
Proof. Let x0 , . . . , xm−1 be a sample shattered by C. Let c0 , . . . , c2m −1 be concepts for C which
m
shatter this sample. For every choice of ȳ ∈ {0, 1}
there is 0 ≤ i < 2m such that ci assigns the
labels ȳ to x0 , . . . , xm−1 .
Consider the m × ⌊log2 m⌋ table T such that the j’th row in T is simply the binary representation4 of j. Let ȳ be a column of T . There exist 0 ≤ i < 2m such that ci assigns the labels
ȳ to x0 , . . . , xm−1 . W.l.o.g. assume that c0 , . . . , c⌊log2 m⌋ generate the labelings described in T .
Therefore, x0 , . . . , xm−1 shatters c0 , . . . , c⌊log2 m⌋ , and hence d∗ ≥ ⌊log2 m⌋ for any m ≤ d and thus
d∗ ≥ ⌊log2 d⌋.
4 For the sake of this lemma it will simplify our notation to assume that the concepts assign the values {0, 1}
rather than {±1} to the instances.
36
Chapter 3. Noise Tolerant Learnability using Dual Representation
By switching the roles of the instances and concepts and applying the same proof we obtain
d ≥ ⌊log2 d∗ ⌋ and thus
2d+1 ≥ d∗
3.8
Banach Spaces
This section provides a brief introduction to Banach spaces.
Definition 3.5 A space X endowed with a norm k·k is a Banach space if it close with respect to
the distance measure d (x1 , x2 ) = kx1 − x2 k.
A Banach space has a dual space which consists of all linear functionals over X. The dual space
is denoted by X ∗ and it is a Banach space itself using the norm kx∗ k = supkxk=1 |x∗ (x)|.
Any Banach space is naturally embedded in its dual-dual space via the duality map x → δx
given by δx (x∗ ) = x∗ (x).
Definition 3.6 Let X be a Banach space consisting of functions over some space Ω. We say that
X has a reproducing kernel if for every ω ∈ Ω the evaluation functional δω is norm continuous,
i.e. for any ω ∈ Ω there exist some κω such that
|δw (f )| = |f (w)| ≤ κω kf k
Another important property of Banach spaces is the type of the Banach space.
Definition 3.7 A Banach space X has type p, if there is some constant κ such that for every
x1 , ..., xn ∈ X,
Eσ1 ,...,σn [kσi xi k] ≤ κ
X
i
p
kxi k
!1/p
(3.6)
where the σi s are i.i.d. random variables taking the values +1 or −1 with a probability of 1/2.
It follows that the type of a Banach space is always in the range of [1, 2]. If the space is a
Hilbert space, then its type is exactly 2. The basic facts concerning the concept of type may be
found, for example, in [80] or in [98].
3.9
Summary
In this section we have presented a noise immunizing scheme for learning. Our scheme utilizes the
structure of the learning problem, mainly by exploiting properties of the dual learning problem.
3.9. Summary
37
Having access to a membership query oracle, we were able to devise a conversion scheme which
applies noise robustness to many learning problems. In this presentation we focused on random
classification noise. However, our method apparently works in many other noise models (e.g.
malicious noise).
In section 3.6 we generalized our scheme to handle real valued functions. In this setting, the
dual learning problem is related to the dual Banach space. Hence, the study of the dual learning
problem is very natural. We used the type of Banach space as a measure of the complexity of the
learning problem and showed that if the type is non-trivial then both primal and dual learning
problems are feasible.
Our construction provides a set of sufficient conditions for noise tolerance learnability. However,
we believe that the essence of these conditions reflects fundamental principles which may turn out
to be a step towards a complete characterization of noise tolerant learnability.
Part III
Selective Sampling
38
Chapter 4
Preliminaries
In the selective sampling framework [29, 30], the learner is presented with a large set of unlabeled
instances from which it can choose the instances for which the teacher will be asked to provide
labels. The selective sampling framework differentiates two features of the process: the complexity
of obtaining a random unlabeled instance and the complexity of labeling it. In the PAC framework
[126] these two features are merged, however in many applications, collecting unlabeled instances is
an easy task which is almost cost free, but labeling these instances is costly and lengthy. Consider
for example the task of text classification. In many cases, collecting random instances can be done
automatically without human involvement in the process, for instance by retrieving documents
from the Internet. However, labeling these texts may be lengthy (the need to read the document)
and may require experts. This situation is not unique to text classification, and applies to a variety
of tasks including medical diagnostics, speech recognition, natural language processing and others.
Selective sampling (sometime called query filtering) is an active learning framework in which
the learner sees unlabeled instances and selects those instances for which the teacher will be
asked to provide labels. This framework has several advantages over membership queries. First
and foremost, the selective sampling framework is applicable in many cases whereas membership
queries cannot (see section 2.2 on page 18). Furthermore, selective samplers are tuned to the
underlying distribution of the instances. This is significant, as the learner can focus on the more
probable instances.
There are two types of selective sampling settings: batch and online. In the batch setting,
a large set of unlabeled instances is provided and the learner selects the instances to be labeled
by repeatedly searching for informative instances in this set. By contrast, in the online setting,
39
40
Chapter 4. Preliminaries
unlabeled instances are presented in a sequential manner. Whenever an unlabeled instance is
presented, the learner needs to decide whether to query for its label or not. In this online setting,
the learner cannot rewind the stream of unlabeled instances and hence the learner cannot defer
querying for the label of an instance later in the process (unless this instance is presented again).
To further understand the difference between the batch and online settings consider for example
the greedy algorithm, presented by Dasgupta [36] (see algorithm 3 on page 50). At each round,
this algorithm searches the entire batch of unlabeled instances for the most informative instance
and queries for its label. Therefore, for each query it needs to scan the whole batch of unlabeled
instances for the most informative instance. While this is reasonable when the size of the batch
is moderate, in other cases this may be unfeasible. Note that in some cases there is a constant
stream of unlabeled instances; hence the batch is in a sense infinite in its size.
Whereas the membership queries framework does not have significant implications as regards
real world applications, the selective sampling framework has been applied in many domains with
great success. Several key examples are presented below.
4.1
Empirical Studies of Selective Sampling
Many algorithms operate in the selective sampling framework [29]. These algorithms have been
applied in many domains including text classification, part of speech tagging, etc. The core of these
algorithms is typically a scoring function. This function assigns a score to unlabeled instances based
on the labels seen so far. The score is designed to measure the benefits from labeling this instance;
i.e. the additional information or reduction in uncertainty a label will provide. The score is used
in two ways. In the batch setting, all the unlabeled instances are scored and the next query point
is chosen as the one with the highest score (a greedy strategy). In the online setting, the instances
are scored one at a time and the next query point is selected by thresholding over it or by a random
criterion.
Score functions take different forms, but most of them can be assigned to three categories:
committee-based scores, scores based on a confidence levels of a single classifier, and look-ahead
principles.
4.1. Empirical Studies of Selective Sampling
4.1.1
41
Committee-Based Scores
Committee-based scores use several learners who learn in parallel. In most cases, the committee
consists of different learning algorithms, and thus when seeing the same training sequence, each
learner generates a different hypothesis. Another possibility is to use the same learning algorithm
for all learners but in this case, each learner only sees a subset of the training data collected so far.
The goal here is to have broad diversity in the committee, much like in Bagging [22]. When a new
instance is introduced, each learner in the committee predicts its label. If all committee members
agree on the predicted labels, then most likely, labeling this instance will not provide any additional
information. However, if there is a considerable disagreement among committee members, labeling
this instance is guaranteed to provide new information, at least for those learners who made a
wrong prediction. The committee principle has led to many active learning algorithms. The
leading algorithm using this principle is the Query By Committee algorithm [112] discussed in
chapter 5 on page 54. Several other algorithms are summarized below.
4.1.1.1
Part-of-Speech Tagging
Dagan and Engelson [35] used active learning in the domain of Natural Language Processing (NLP).
Specifically, they were interested in the task of part-of-speech tagging. In this task, a sentence in
a natural language is presented to the machine, and the machine needs to assign the grammatical
role to each word in the sentence. Typically, algorithms for tackling this problem use either expert
knowledge, which is coded in the algorithm, or a large annotated training sequence which is used
to train a learning algorithm. Both methods require a vast amount of work from human experts
and thus it is hard to adapt these algorithms for many languages. Dagan and Engelson suggested
the use of active learning for this task. They argue that obtaining the raw data (texts in this case)
is almost cost free whereas annotating it is costly and lengthy and thus selective sampling is a
natural match for this task.
The part-of-speech tagging task is a complicated procedure. One reason for this complication is
the inherent ambiguity of natural languages. Sentences such as “We saw the park with the binocular” has more than a single valid interpretation. Moreover, words can have multiple meanings,
for instance “this is my head ”, “he is the head of the group” and “we should head south” all use
the word “head ” but with different meanings. Therefore, it is common for part of speech taggers
to use probabilistic models that assign probabilistic scores to possible grammatical analyses of a
sentence rather than trying to find the “correct” grammatical structure.
42
Chapter 4. Preliminaries
Figure 4.1: Active vs. Passive Learning for Part-of-Speech Tagging [35].Accuracy
appears on the x-axis and the number of tagged words used while training on the y-axis.
The complexity of the task forced Dagan and Engelson [35] to suggest a heuristic based on the
committee principle. The base learners (which form the committee) are a special kind of a Hidden
Markov Model (HMM). A committee is constructed on the basis of the training sequence seen so
far, and a random choice of the free parameters. Since this is a multi-class problem, because there
are many possible tags for a word, an entropy-based criterion is used to measure the disagreement
between committee members. In Figure 4.1 some of the results obtained are presented. On the
x-axis, the accuracy achieved by the algorithm of [35] is presented and on the y-axis, the number
of tagged words is shown. The accuracy of an active and a passive learning algorithm is compared.
The difference between the two is apparent. For example, reaching an accuracy level of 90%
required only ∼ 4000 words in the active algorithm whereas the passive algorithm needed ∼ 12500
words. Obtaining an accuracy level of 91% requires ∼ 7000 words in the active algorithm and
∼ 25000 in the passive algorithm.
4.1.1.2
Spoken Language Understanding
Tur et al [125] used active learning as part of a spoken language understanding system. The system
is a part of an automatic operator. People can call the operator and ask for a variety of services.
For example, a user can call and ask “What is my balance?”- The operator needs to react to these
requests. This is done by applying an automatic speech recognizer that identifies the spoken words.
4.1. Empirical Studies of Selective Sampling
43
Figure 4.2: Active vs. Passive Learning for Spoken Language Understanding [125]. The
x-axis shows the number of labeled instances while the y-axis shows accuracy. In this figure three
filtering methods are compared: a random selection of instances to be labeled, committee-based
active learning and confidence based active learning.
The transcribed words are fed into an “understanding” unit that assigns the task the operator will
perform. As in many of the tasks presented here, there is constant feed data in this task as people
keep calling the operator. However, labeling these requests is a labor-intensive task.
Tur et al [125] used a committee-based approach to accelerate learning in the understanding
component of this system. The committee used consisted of two learners, an SVM [20] and AdaBoost [47]. These two classifiers were trained over the same training sequence. Both SVM and
AdaBoost are able to provide a confidence parameter together with their predictions, i.e. margin.
Tur et al [125] used this confidence and selected the next instances to be labeled as instances for
which SVM and AdaBoost disagreed and gave low confidence to their predictions. The results are
presented in figure 4.2. The experiment compares the committee-based approach to a confidence
based approach and to random sampling (i.e. passive learning). For most of training sizes, the
gain of using committee-based active learning is 1-2% over passive learning. They noted that the
committee-based approach seems preferable to the confidence based approach.
44
Chapter 4. Preliminaries
4.1.1.3
Ensemble of Active Learners
Ensemble methods, such as Bagging [22] and Boosting [47] are successful tools for passive learning.
Baram, El-Yaniv and Luz [10] presented an ensemble method for active learning. Their novel
approach combines different active learning algorithms using a master algorithm. The assumption
behind this master algorithm is that any active learning algorithm will fail on some data sets. The
goal of the master algorithm is to find the best performing active learning algorithm on the specific
data set at hand and use it for training. In order to do so, the authors had to come up with a
method to evaluate the performance of active learners. In the passive learning model this can be
achieved by using leave-one-out estimates or a hold-out set, but in the active learning model these
approaches cannot be used as the labeled training set is heavily biased towards difficult instances.
Therefore the error estimates are typically much worse than actual performance. Baram et al [10]
use an entropy criterion as a scoring function. Given a training sequence, each learner is asked
to label a set of unseen instances. These labels are viewed as groupings of these instances, where
each group consists of the instances to which the learner assigns the same label. The score of the
learner is the entropy of the partition1 .
Once the algorithm uses a certain active learner to query for the next query point, the entropy
based scoring function is used to evaluate the benefit obtained from the label. Therefore, at any
point we can evaluate the active learners based on previous decisions they made. However, we
still need to decide which learner will make the next query. The fundamental problem here is the
exploration vs. exploitation dilemma. On one hand, we would like to give a fair chance to all
learners, but on the other hand, giving a poor learner too many opportunities to make queries
might undermine the performance of the whole ensemble. To rectify this, Baram et al [10] used
the analogy to the multi-armed bandit problem and used algorithms suggested by Auer et al to
solve it [8].
The approach suggested by Baram et al [10] has proven itself to be successful on many experiments the authors have conducted. This again proves the efficiency of committee-based approaches.
However, note that this time, active learners are used in the ensemble, whereas in all other algorithms presented here (and all other algorithms we are familiar with), the committee consists of
passive learners.
1 This criterion appears to assume that the classes are equally sized. However, empirically it works well even
when this is not the case [10].
4.1. Empirical Studies of Selective Sampling
4.1.1.4
45
Other Committee-Based Approaches
Many committee-based approaches have been devised. McCallum and Nigram [90] used a probabilistic model to sample a committee. An interesting property of the approach presented in [90]
is the combination of semi-supervised models with active learning. In their algorithm, an expectation maximization (EM) algorithm is used to label the instances which are not yet labeled
and thus the learner can train over a larger training sequence. Liere [78] used a straight forward
committee-based approach where the core classifiers are linear threshold functions (Winnow [82]
and Perceptron [97]). Krogh and Vedelsby [69] used committee-based methods with neural networks. Muslea et al. [95] introduced a committee-based active learning algorithm. They used
multiple views of data [16] to devise their algorithm. Another interesting approach was presented
by Mamitsuka and Abe [87]. They generated a committee by training the same learning algorithm
over random subsets of the training data.
4.1.2
Confidence-Based Scores
Confidence-based scores use a single base classifier which is able not only to make predictions for
the labels of unseen instances, but also to assign confidence levels to its predictions. Many of the
classifiers used today have this capability of reporting their confidence. In SVM [20] and AdaBoost
[47, 109] the margin can be used as a measure of confidence. Other classifiers, such as Bayesian
networks [59], have internal probabilistic structure which can be used to measure confidence. These
and other confidence scores have been used to devise active learning algorithms. An overview of
some of these is presented below.
4.1.2.1
Margin Based Confidence
Tong and Koller [122], Campbell et al [25] and Schohn and Cohn [110] introduced a simple active
learning scheme based on large margin principles. They suggested training SVM [20] over the
training sequence seen so far and choosing the next query point to be a point with the smallest
possible margin. This type of point is close to the decision boundary induced by SVM. Thus,
the label of this instance will shift the decision boundary considerably, making it an informative instance. See figure 4.3 for a comparison of this simple scheme with various other active
learning schemes and passive learning algorithms. This example shows that this simple approach
significantly outperforms the passive learning SVM algorithm on a text classification task while
performing comparably to other more sophisticated active learners.
46
Chapter 4. Preliminaries
Figure 4.3: Margin Based Active Learners [122]. This figure presents several active learning
methods based on SVM large margin principles. The different algorithms were applied to a text
classification task. The three different active learning algorithms (Ratio, MaxMin and Simple)
perform similarly outperforming the passive learning algorithm (Random). Note that the active
learner used 100 labels to obtain the same accuracy as the passive learner which was trained over
a set of 1000 instances (full).
4.1. Empirical Studies of Selective Sampling
47
Schohn and Cohn [110] reported that in some cases when the active learner simply uses a subset
of a training sequence, it outperforms the passive learner that uses the fully labeled set. The same
surprising result was reported by Tur et al [125]. This is usually explained by the tendency of
active learners to avoid querying the labels of outliers.
4.1.2.2
Probability Based Confidence
Lewis and Gale [77] used a probability based confidence active learner. They used a logistic
regression based classifier and queried for instances for which the probability for the leading class
was the smallest. [77] report that in some cases using active learning reduced the number of labels
500-fold over passive learning.
4.1.3
Look-ahead Principles
The ultimate criterion for selecting query points is the reduction in generalization error. However
we do not have access to this parameter and thus estimates of this error need to be used. Several
methods have been suggested to utilize such principles.
Cohn, Ghahramani and Jordan [31] designed an active learning algorithm for parametric models
such as neural networks and mixtures of Gaussians. Their algorithm minimizes the variance of the
estimates of the parameters. For any instance x, the expected reduction in variance is calculated
and used as a score for choosing the next query point. Cohn et al. [31] showed that in certain
models such as mixtures of Gaussians, this parameter can be calculated efficiently.
Roy and McCallum [104] designed an algorithm which estimates the future error based on a
sampling technique. The future error is calculated over the set of unlabeled instances available
to the learner. The learner uses a probabilistic model through which it can estimate the log-loss
or 0-1 loss (assuming that the current probabilistic model is accurate). The next query point is
selected to be the one that will reduce this loss the most.
Tong and Koller [122] introduced a look-ahead algorithm called MaxMin. Similar to Query By
Committee (see chapter 5 on page 54 ), MaxMin tries to estimate the reduction in the size of the
version space. Given an instance x, the algorithm calculates rx+ and rx− which are the radii of the
largest balls in the version space when x is used for training with the labels +1 or −1 respectively.
The radius of the largest ball in the version space assuming x is labeled gives a lower bound on
the volume of the version space. This gives an estimate of the reduction of the volume of the
version space. The next query point is selected to be the point which maximizes min (rx+ , rx− ). A
48
Chapter 4. Preliminaries
point for which min (rx+ , rx− ) is large is expected to bisect the version space most equally and thus
will reduce the volume of the version space. Another algorithm, called Ratio by [122] will query
+ −
r
r
the label of the instance x for which min rx− , rx+ is maximized. See figure 4.3 for the results of
x
x
applying both MaxMin and Ratio to a text classification task. It is clear that both algorithms
significantly outperform passive learning in this task. Note that in the experiment reported in [122]
the Simple Active Learning Algorithm (subsection 4.1.2.1 on page 45) is equally good. However in
other experiments reported by Tong and Koller [122] Ratio and MaxMin were better than Simple.
Another approach using a look-ahead principle was introduced by Zhang and Chen [129] who used
it for information retrieval of visual objects.
4.2
Theoretical Studies of Selective Sampling
The theoretical study of selective sampling is still in its infancy. Only a few authors have studied
the theoretical aspect of selective sampling. Freund et al [48] were the first to show that selective
sampling can reduce the number of needed labels exponentially. Their results are discussed and
extended in chapter 6 on page 62. Recently, Dasgupta [36, 37] proved some positive and negative
results about selective sampling. The negative results show that there are cases where selective
sampling cannot reduce the number of labels needed significantly. Consider for example the indicating functions concept class. In this case, the sample space X is a finite set and the concept
class is C = {cx : x ∈ X } where
cx (x′ ) =


 1

 −1
if x = x′
otherwise
it is easy to see that in this case O (1/ |X |) labels are needed on average in order to have an accuracy
of O (1/ |X |). This is similar to the number of labels a passive learner will use.
Dasgupta [36] showed that the situation we have just demonstrated is not unique to the class
of indicating functions. He proved the following lemma:
Lemma 4.1 (Claim 1 in [36])
Let C be the class of linear separators in IR2 . For any set of m distinct instances in the unit
sphere there are hypotheses in the concept class which cannot be identified without querying all m
labels.
Lemma 4.1 shows that if we take m points on the unit sphere and assume the uniform distribution over these points, we may need m labels in order to have an accuracy of O (1/m). This is not
4.2. Theoretical Studies of Selective Sampling
49
a great saving over passive learning. Moreover, this is not unique to the case discussed in lemma
4.1:
Lemma 4.2 (Claim 2 in [36])
For any d ≥ 2 and m ≥ 2d there is a sample space X of size m and a concept class C of VCdimension d over the domain X , with the following property: if a concept c is chosen uniformly
from C then the average number of labels needed in order to identify c is greater than m/8.
Lemmas 4.1 and 4.2 show that a small VC dimension does not guarantee that the concept class
can be learned with few labels. Dasgupta [36] showed that even when we restrict our selves to
the class of non-homogeneous linear classifiers, no selective sampling algorithm can guarantee a
significant reduction in the number of needed labels.
Along-side the negative results, Dasgupta provided an encouraging positive result. He studied
the Greedy Strategy for selective sampling (see algorithm 3). This algorithm receives a batch of
unlabeled instances and finds the next query point in a greedy fashion. Whenever it needs to
decide on the next query point, it goes over all the instances that have not yet been labeled. For
each such instance, it calculates the measure of hypotheses which label it with +1 and the measure
of hypotheses which label it with −1 and chooses to query for the label of the instance for which
these two measures are most equal. The target concept remains in the version space. Any other
concept, which disagrees with the target concept on an instances in the sample, will be removed
from the version space during the process. Thus it is guarantied that the concept returned by the
greedy strategy is consistent with the target concept on the sample.
Dasgupta proved the following property of the greedy strategy:
Theorem 4.1 (Theorem 3 in [36])
Let π be any distribution over C. Let Q be any query strategy. Let
µgreedy
= Ec∼π [number of queries needed by the greedy stratigy to indetify c]
and
µQ
=
Ec∼π [number of queries needed by Q to indetify c]
then
1
µgreedy ≤ 4µQ ln
minc∈C π (c)
(4.1)
50
Chapter 4. Preliminaries
Algorithm 3 Greedy Strategy for Selective Sampling [36]
Inputs:
• A sample S.
• A concept class C defined over S.
• A distribution π over C.
Output:
• An hypothesis h.
Algorithm:
1. Let V1 = C.
2. For t = 1, . . . , |S|
(a) For every x ∈ S let Vt+ (x) = {c ∈ Vt : c (x) = 1} and Vt− (x)
{c ∈ VT : c (x) = −1}.
(b) If maxx min π Vt+ (x) , π Vt− (x) then break the loop.
(c) Query for the label y of x for which min π Vt+ (x) , π Vt− (x) is maximized.
(d) Let Vt+1 = {c ∈ Vt : c (x) = y}.
3. Endfor
4. Return any c ∈ Vt .
=
4.2. Theoretical Studies of Selective Sampling
51
Theorem 4.1 proves that the average number of queries for labels the greedy strategy will make
is comparable to the best possible strategy. To see this, assume that C has a VC dimension d < ∞
and let m = |S|. From Sauer’s lemma [108] we have that the number of different hypotheses in C
d
when we restrict it to S is at most (em) . Assume that π is uniform over these hypotheses then
from (4.1) we have that
µgreedy
≤ 4µQ ln
1
1/ (em)d
= 4µQ d ln (em)
hence the average number of queries needed by the greedy strategy exceeds the best possible
strategy by a factor of O (ln m) at most.
Another significant theoretical result was proven by Dasgupta, Kalai and Monteleoni [38]. They
suggested a modification to the well-known Perceptron algorithm [97] (see algorithm 4). This very
simple modification learns homogeneous linear classifiers. It has several advantages over most
of the algorithms we have discussed so far. First, it is a very simple algorithm and second, it
works in a streaming (online) fashion as opposed to the batch fashion that is used in the greedy
algorithm. Both these properties make it very attractive to use even with extremely large data
sets. However, since the dimensionality of the data d is explicitly used in this algorithm, it is not
possible to efficiently use it with kernels, especially kernels which map the data to infinite spaces
(see Chapter 10 for more on kernels).
Dasgupta et al [38] proved the following property of the Perceptron based active learning
algorithm:
Theorem 4.2 (theorem 3 in [38])
Let ǫ, δ > 0. Let L = O d log
1
ǫδ
log dδ + log log 1ǫ
and R = O
d
δ
+ log log 1ǫ . Assume that
the underlying distribution over the sample space is a uniform distribution over the unit sphere
in IRd and that the target concept is a homogeneous linear classifier. With probability 1 − δ, the
Perceptron based active learning algorithm will use L labels, will make O ( L) errors while learning
and will return a hypothesis which is ǫ close to the target concept.
Theorem 4.2 proves that under the assumption of uniform distribution, the Perceptron based
active learning algorithm will use O log 1ǫ labels in order to return a hypothesis which is ǫ close
to the target concept. This is an exponential improvement over any passive learner which will need
O (1/ǫ) labels in the same setting.
52
Chapter 4. Preliminaries
Algorithm 4 Perceptron Based Active Learning [38]
Inputs:
• Dimension d.
• Maximum number of labels L.
• A patience parameter R.
Output:
• A homogeneous linear classifier v.
Algorithm:
1. Let v1 = y1 x1 for the first example (x1 , y1 ).
√
2. Let s1 = 1/ d
3. For t = 1 . . . L
(a) Wait for the next instance x such that |x · vt | ≤ st and query for its label. Call this
example (xt , yt ).
(b) If yt (xt · vt ) < 0 then
i. vt+1 = vt − 2 (xt · vt ) xt .
ii. st+1 = st .
(c) else
i. vt+1 = vt
ii. If no prediction mistakes were made for the last R instances for which a query for
label was made then
A. st+1 = st /2.
iii. else
A. st+1 = st .
4. Endfor
5. Return vt
4.3. Label Efficient Learning
4.3
53
Label Efficient Learning
The online selective sampling framework is called label efficient learning [54]. In this setting, the
learner sees a stream of unlabeled instances. When a new instance is introduced, the learner can
either predict its label or query for it. The learner tries to minimize the number of prediction
mistakes and at the same time minimize the number of queries. Note that unlike the passive
online learning framework [83], the true label is not revealed to the learner unless a query for label
was made.
This model has been studied by several authors. Cesa-Bianchi et al. [28], following Helmbold
and Panizza [54] studied label efficient learning in use of experts advice framework. In this model
it is assumed that there are many experts where one of them makes very few or even no prediction
mistakes. The task of the learner is to find this expert. They showed that if the learner has a
limited budget of queries for labels he is allowed to make at any given time frame then it is still
possible to predict almost as well as the best expert, as long as the budget for queries grows to
infinity at not too slow a rate.
Cesa-Bianchi, Conconi and Gentile [27] studied learning linear classifiers in the label efficient
setting. The algorithm presented by the authors uses the margin of the linear classifier with respect
to the prediction it makes as a criterion to select the right instances to query for their label. In
addition, they were able to analyze a slightly modified version of this algorithm in which, if the
label of an instance is predicted with too small a margin, then a query for label is made for the label
of the next instance in the sequence. They were able to show that when certain conditions apply,
the number of prediction mistakes can be logarithmic with respect to the number of instances in
the sequence.
4.4
Summary
As we saw, there are many selective sampling algorithms but only few have theoretical grounding. Unfortunately, the algorithms that have theoretical grounding lack practical implementation.
In the next chapter we introduce the Query By Committee (QBC) algorithm. Much of the presentation from this point and on is about providing both theoretical grounding and practical
implementation for the QBC algorithm.
Chapter 5
The Query By Committee
Algorithm
The Query By Committee (QBC) algorithm was presented by Seung et al. [112] and analyzed in
[48, 119]. The algorithm assumes the existence of some underlying probability measure over the
hypotheses class. At each stage, the algorithm operates on the version-space: the set of hypotheses
that were correct so far. Upon receiving a new instance the algorithm has to decide whether to
query for its label are not. This is done by randomly selecting two hypotheses from the versionspace. A query for label is made only if these two hypotheses predict different labels for the
instance under consideration. The algorithm is presented as Algorithm 5 on page 55.
The QBC algorithm is, as its name suggests, a committee-based algorithm (see section 4.1.1
on page 41). The committee is formed of all the possible classifiers in the sense that any classifier
which may be the target concept is considered. Alternatively, QBC can be viewed as though it
used a look-ahead principle (see section 4.1.3 on page 47). To see this, Let V be the current version
space and let x be an instance. Denote by V + the set of hypotheses in the version space which
predict a label +1 for x. Similarly, let V − be the set of hypotheses in the version space which
predict the label of−1 for x. It follows, that QBC will query for the label of x with the probability
of 2ν (V + ) ν (V − ).
QBC tends to query for instances which split the version space evenly. QBC works in an
online fashion; it sees an instance once and makes its decision whether to query for its label or
not. Although this limits the algorithm, the probabilistic way in which QBC makes its decision
utilizes the online setting to remain tuned to the underlying distribution of the inputs. Thus, the
54
55
Algorithm 5 Query By Committee [112]
Inputs:
• Required accuracy ǫ.
• Required confidence 1 − δ.
• A prior ν over the concept class.
Output:
• A hypothesis h.
Algorithm:
1. Let V1 = C.
2. Let k ← 0.
3. Let l ← 0.
4. For t = 1, . . .
(a) Receive an unlabeled instance xt .
(b) Let l ← l + 1.
(c) Select c1 and c2 randomly and independently from the restriction of ν to Vt .
(d) If c1 (x) 6= c2 (x) then
i.
ii.
iii.
iv.
(e) else
Query for the label c (x).
Let k ← k + 1.
Let l ← 0.
Let Vt+1 ← {c ∈ Vt : c (xt ) = yt }.
i. Let Vt+1 ← Vt .
(f) If‡ l ≥ tk
i. Choose a hypothesis h according to the termination rule.
ii. Return h.
‡
Step 4f is the termination procedure of QBC. The exact choices of tk and the returned hypothesis
are discussed in section 5.1. For a short summary, see table 5.1.
56
Chapter 5. The Query By Committee Algorithm
probability that QBC will query for the label of an instance depends on two factors: the “evenness”
of the split induced on the version space, and the probability of observing this instance.
5.1
Termination Procedures
According to the definition of the QBC algorithm by Seung et al [112], once the algorithm reaches
a steady version space, i.e. once the algorithm has not queried for a label for a long consecutive set
of instances, QBC terminates and returns a random hypothesis from the version space. Below we
study this rule as well as some alternative procedures (step 4f in algorithm 5). We also prove the
correctness of the algorithm in the sense that the hypothesis returned by QBC is indeed a good
one. To simplify the presentation we discuss the “original” termination procedure later.
5.1.1
The “Optimal” Procedure
Assume that the QBC algorithm queries for the labels of k instances. At this point the learner
has a posterior over the hypotheses class which is the restriction of the prior to the version space
V . Given this information, the optimal classifier that the learner can return is the Bayes classifier
which is defined:
cBayes (x) =


 +1 if Prc∼ν|V [c (x) = +1] ≥ 1/2

 −1 if Prc∼ν|V [c (x) = −1] > 1/2
where c ∼ ν|V means that c is chosen according to the restriction of ν to V . The first optimal
procedure we suggest works as follows: if the QBC algorithm did not query for a label for the last
tk consecutive instance after making the k’th query, then QBC terminates and returns the Bayes
classifier cBayes as its hypothesis. The following proves the correctness of this procedure, together
with the right choice for tk :
Theorem 5.1 Assume that the termination rule of the QBC (algorithm 5) is specified by tk =
2
ǫ
ln π
2
(k+1)2
.
6δ
Let the Bayes classifier cBayes be defined using the version space V used by QBC
when terminating. Then with a probability of 1 − δ over the sample and the internal randomness
of QBC,
h h
ii
Ec∼ν|V Pr cBayes (x) 6= c (x) ≤ ǫ
x
Proof. Assume that QBC made k queries for labels to generate the version space V . Assume that
QBC did not query for any additional label for tk consecutive instances after making the k’th
5.1. Termination Procedures
57
query. Let cBayes be the Bayes classifier, then


 +1 if Prc∼ν|V [c (x) = +1] ≥ 1/2
cBayes (x) =

 −1 if Prc∼ν|V [c (x) = −1] > 1/2
Arrange x and c such that cBayes (x) 6= c (x). From the definition of the Bayes classifier it
follows that if we pick a random hypothesis c′ from the distribution ν|V then with a probability
h
i
of at least 1/2 we will have c′ (x) 6= c (x). Therefore, if we denote by cBayes (x) 6= c (x) the
indicating function then
Ec′ ∼ν|V [c′ (x) 6= c (x)] ≥
for any c and x.
i
1h
cBayes (x) 6= c (x)
2
h
i
Assume that Ec∼ν|V,x cBayes (x) 6= c (x) > ǫ. Thus,
Ec,c′ ∼ν|V,x [c′ (x) 6= c (x)] >
ǫ
2
this means that the probability that QBC will not query for the label of the next instance is at
h
i
most 1 − 2ǫ . Hence, if Ec∼ν|V,x cBayes (x) 6= c (x) > ǫ the probability that QBC will not query
for a label for the next tk consecutive instance is at most
ǫ tk
ǫ
1−
≤ e− 2 tk
2
by choosing tk =
2
ǫ
ln π
2
(k+1)2
6δ
we get that the probability that QBC will not query for tk consecutive
labels when the Bayes classifier is not “good enough” is
6δ
.
π 2 (k+1)2
By summing over k the proof is
completed.
Corollary 5.1 Assume that the termination rule of the QBC (algorithm 5) is specified by tk =
2
ǫ
ln π
2
(k+1)2
.
6δ
Let the Bayes classifier cBayes be defined using the version space V used by QBC
when terminating. Then with a probability of 1 − δ over the sample and the internal randomness
of QBC,
h h
ii
Ec∼ν Pr cBayes (x) 6= c (x) ≤ ǫ
x
Proof. From theorem 5.1 we have that
h h
ii
Ec∼ν|V Pr cBayes (x) 6= c (x) ≤ ǫ
x
therefore
h h
ii
=
Ec∼ν Pr cBayes (x) 6= c (x)
x
=
≤
h h
ii
EV,c∼ν|V Pr cBayes (x) 6= c (x)
x
h h
ii
Ev Ec∼ν|V Pr cBayes (x) 6= c (x)
x
ǫ
58
Chapter 5. The Query By Committee Algorithm
5.1.2
Random Gibbs Hypothesis
Another possible solution to the generalization phase is to use a random Gibbs hypothesis. In this
procedure, whenever the QBC decides to terminate the learning process, a random hypothesis is
drawn out of the version space and is used for making further predictions. This is the “original”
termination procedure suggested in [48]. We suggest two possible analyses for this procedure: an
average analysis and analysis of the “typical” case.
Theorem 5.2 Assume that the termination rule of the QBC (algorithm 5) is specified by tk =
4
ǫ
ln π
2
(k+1)2
.
6δ
Let the Gibbs classifier cGibbs be defined using the version space V used by QBC
when terminating. Then with a probability of 1 − δ over the sample and the internal randomness
of QBC,
h i
EGibbs,c∼ν Pr cGibbs (x) 6= c (x) ≤ ǫ
x
Note that since the Gibbs hypothesis is a random hypothesis, the error in theorem 5.2 is
averaged over this randomness.
Proof. From corollary 5.1 we have that using the choice of tk that
h h
ii
Ec∼ν Pr cBayes (x) 6= c (x) ≤ ǫ/2
x
Since Haussler, Kearns and Schapire [51] proved that the average error of the Gibbs classifier is at
most twice as large as the error of the Bayes classifier, the statement of the theorem follows.
Theorem 5.2 shows that the average error of the Gibbs hypothesis is not large. In the next
theorem we show that this is also the typical case.
Theorem 5.3 Assume that the termination rule of the QBC (algorithm 5) is specified by tk =
8
ǫδ
ln π
2
(k+1)2
.
3ǫδ
Let the Gibbs classifier cGibbs be defined using the version space V used by QBC
when terminating. Then with a probability of 1 − δ over the choice of the sample, the target
hypothesis and the internal randomness of QBC,
Pr cGibbs (x) 6= c (x) ≤ ǫ
x
Proof. This follows immediately from theorem 5.2 and the Markov inequality. From the choice of
tk we have that with a probability of 1 − δ/2
h i
EGibbs,c∼ν Pr cGibbs (x) 6= c (x) ≤ ǫδ/2
x
(5.1)
5.1. Termination Procedures
59
Therefore, from the Markov inequality, if (5.1) holds, we have with a probability of 1 − δ/2 that
Pr cGibbs (x) 6= c (x) ≤ ǫ
x
Note that using a direct argument (instead of using the previous theorems as building blocks)
we can get tk =
2
ǫδ
ln π
2
(k+1)2
3ǫδ
which is better by a factor of 4. Since this is of minor significance
we do not dwell on this argument.
5.1.3
Bayes Point Machine
The Gibbs sampler does not use a single classifier to make predictions. Rather, it randomly selects
a hypothesis. Still another alternative is to use the Bayes Point Machine (BPM) [56, 49] to generate
future predictions. The BPM uses a hypothesis in the version space that is the closest to the Bayes
optimal classifier. When the concept class is the class of linear classifiers, then this is just the
center of gravity of the version space. Gilad-Bachrach, Navot and Tishby [49] proved the following
theorem:
Theorem 5.4 (Theorem 1 in [49])
Let the concept class be the class of linear classifiers and let the prior ν be log-concave, V
is a version space and cBPM and cBayes are the Bayes Point Machine and Bayesian classifiers
respectively, then
Pr
c∼ν|V
h
i
c (x) 6= cBPM (x) ≤ (e − 1) Pr c (x) 6= cBayes (x)
c∼ν|V
From this, we derive the following theorem:
Theorem 5.5 Assume that the termination rule of the QBC (algorithm 5) is specified by tk =
2(e−1)
ǫ
ln π
2
(k+1)2
.
6δ
Let the concept class be the class of linear classifiers and let the prior ν be
log-concave. Let the Bayes Point Machine classifier cBPM be defined using the version space V
used by QBC when terminating. Then with a probability of 1 − δ over the sample and the internal
randomness of QBC,
h i
Ec∼ν Pr cBPM (x) 6= c (x) ≤ ǫ
x
Proof. The proof follows immediately from corollary 5.1 and theorem 5.4.
5.1.4
Avoiding the Termination Rule
In many applications, the training process need not terminate. We can assume that there is a
constant stream of instances, and for each instance QBC makes one of two possible actions: it can
60
Chapter 5. The Query By Committee Algorithm
either decide to query for the label of the instance or decide not to query for its label. Note that if
QBC did not query for the label, this is because the random hypotheses QBC drew, predicted the
same label for the instance. Therefore, we have a natural way to predict the label of this instance
using the two random hypotheses. Using this “non-stop” rule makes QBC closer in spirit to label
efficient algorithms (see section 4.3 on page 53).
In the following theorem we show that the number of prediction mistakes QBC makes is proportional to the number of queries it makes. Therefore, if we can guarantee the number of queries
to be small, we immediately obtain a bound on the number of prediction mistakes. Indeed, in
chapter 6 we show that when certain conditions apply, the number of queries is small.
Theorem 5.6 For any instance x and at any stage of the learning, the probability that QBC will
make a prediction mistake on x is exactly half the probability it will query for the label of x.
Proof. Let V be the current version space. Let x be an instance for which QBC needs to decide
whether to predict its label or query for it. The probability that QBC will query for the label is
2 Pr [c (x) = 1] Pr [c (x) = −1]
c∼ν|V
c∼ν|V
where the probability that it will make a prediction mistake is
2
2
Pr [c (x) = 1] Pr [c (x) = −1] + Pr [c (x) = −1] Pr [c (x) = 1]
c∼ν|V
c∼ν|V
c∼ν|V
c∼ν|V
which is
Pr [c (x) = 1] Pr [c (x) = −1]
c∼ν|V
c∼ν|V
Therefore, the probability that QBC will query for a label for a given instance is exactly twice
the probability it will make a prediction mistake.
5.2
Summary
In this chapter we have presented the Query By Committee algorithm. We have shown several
possible termination rules for this algorithm. In table 5.1 the different options are listed
5.2. Summary
61
Procedure
tk
Bayes classifier
2
ǫ
ln π
2
(k+1)2
6δ
Gibbs classifier
4
ǫ
ln π
2
(k+1)2
6δ
Gibbs classifier
8
ǫδ
ln π
BPM
(linear classifiers)
no termination
(with probability 1)
2(e−1)
ǫ
2
(k+1)2
3ǫδ
ln π
-
2
(k+1)2
6δ
Guarantee
− δ)
h
h (with prob. 1ii
Ec∼ν Prx cBayes (x) 6= c (x) ≤ ǫ
EGibbs,c∼ν Prx cGibbs (x) 6= c (x) ≤ ǫ
Prx cGibbs (x) 6= c (x) ≤ ǫ
Ec∼ν Prx cBPM (x) 6= c (x) ≤ ǫ
Ec∼ν [number of queries] =
2Ec∼ν [number of prediction mistakes]
Table 5.1: Possible Termination Procedures for QBC. In the following table, the possible
termination procedures for QBC are listed, together with the guarantee they provide.
Chapter 6
Theoretical Analysis of Query By
Committee
In chapter 5 we introduced the QBC algorithm and studied its basic properties. We now turn to
the fundamental properties of this algorithm. We follow the guidelines of Freund et al [47] while
introducing some corrections and extensions.
The QBC algorithm assumes a Bayesian setting. It assumes the existence of a prior distribution
ν over the concept class. It assumes that the teacher selects the target concept from the concept
class using this prior. Furthermore, it is assumed that ν is known to the learner; however the
learner does not have access to the random choices made by the teacher. In chapter 7 on page 88
and chapter 8 we lift some of these assumptions.
We begin this chapter by introducing and discussing the information gain in section 6.1. In
section 6.2 we present theorems about the fundamental properties of QBC. These theorems show
that when there is a lower bound on the expected information gain, the error rate of the hypotheses
learned by QBC decreases exponentially with respect to the number of queries for labels it makes.
The proofs of these theorems are provided in section 6.3. In section 6.4 we study the class of
parallel planes and argue that there is a lower bound on the expected information gain for this
class once the prior over the concept class is concave. We prove this argument in section 6.5. We
wrap up with a short discussion in section 6.6.
The theorems presented in this chapter are significant for an understanding of the Query By
Committee algorithm. However, the proofs of these theorems are long and involve non-trivial
techniques. Therefore, some readers may wish to skip these proofs (i.e. sections 6.3 and 6.5).
62
6.1. The Information Gain
63
Doing so will not prevent the reader from following the rest of this essay.
6.1
The Information Gain
The key tool in analyzing the QBC algorithm is what is known as instantaneous information gain.
It was introduced in [51] as a tool for analyzing the progress of learning algorithms. Let V be a
version space and x be an instance. x induces a split of the version space such that V + consists
of all the concepts which label x with the label +1 and V − consists of all concepts which label x
with the label −1. Assume that there exists some prior ν over C. If the observed label for x is +1
we know that the concept we are learning is in V + and thus we say that the information we have
gained from the instance x and its label is log (ν (V ) /ν (V + )). Equally if the label of x is −1 we
say that the information gained is log (ν (V ) /ν (V − )).
Definition 6.1 Let V be a version space and ν be some probability measure defined over V . Let
x be an instance and y be the label of this instance. The instantaneous information gain from the
pair (x, y)is
log
ν ({c ∈ V })
ν ({c ∈ V : c (x) = y})
In the setting of selective sampling, we have an instance x but we do not have its label. At this
point we can look at the expected information gain. The probability that the label of x will be +1
is exactly ν (V + ) /ν (V ), in which case the instantaneous information case will be log ν (V ) /ν (V + ).
Equally, the probability that the label of x will be −1 is exactly ν (V − ) /ν (V ) in which case the
instantaneous information gain will be log ν (V ) /ν (V − ) and thus the expected information gain
from the label of x is
ν (V + )
ν (V )
ν (V − )
ν (V )
log
+
log
+
ν (V )
ν (V )
ν (V )
ν (V − )
which is exactly H (ν (V + ) /ν (V )) where H (·) is Shanon’s binary-entropy [114].
Definition 6.2 Let V be a version space and ν be some probability measure over C. Let x be an
instance then the expected information gain from the label of x is
X ν ({c ∈ V : c (x) = y})
ν (V )
log
ν
(V
)
(ν
({c
∈
V
: c (x) = y}))
y
The expected information gain from an instance x is the entropy of the split it induces on the
version space. We see that the most informative instances are the ones for which the split they
induce are even. On the other hand, an instance for which the label is known a-priori in the sense
that all the hypotheses in the version space agree on its label, has zero expected information gain.
64
Chapter 6. Theoretical Analysis of Query By Committee
If the instances for which QBC queries for labels, all have high expected information gain, then
the volume, in the probabilistic sense of the version space is guaranteed to shrink fast. Therefore,
we are guaranteed that QBC will focus on the target concept and its close neighbors. In the next
section we prove this intuition but before doing so we need to define the expected information
gain from the next query QBC will make. We have defined the expected information gain from
an instance (definition 6.2). We need to define the expected information gain from the next QBC
query. This value takes into account both the information of an instance, and also the probability
of observing this instance and the probability that QBC will query for its label.
Definition 6.3 Let V be a version space and ν be a probability measure over C. Let D be a
distribution over the sample space X . Let ν + (x) = ν ({c ∈ V : c (x) = 1}) /ν (V ) and ν − (x) =
ν ({c : c (x) = −1}) /ν (V ). The expected information gain from the next QBC query for label is
R +
2ν (x) ν − (x) H (ν + (x)) dD (x)
R
G (ν, D) =
2ν + (x) ν − (x) dD (x)
To explain definition 6.3 we note that for an instance x, the value 2ν + (x) ν − (x) is the proba-
bility that QBC will query for the label of x. Note that the expected information gain from the
next QBC query is a function of both the prior ν over the concept class and the distribution D
over the sample space.
Finally we use the definitions we have introduced here to define “good” concept classes and
distribution, i.e. the concept classes and distributions for which we will be able to prove the
properties of QBC.
Definition 6.4 The concept class C endowed with the prior ν and distribution D over the sample
space has a uniform lower bound g over the expected information gain, if for any set of instance
x1 , . . . , xm and any concept c ∈ C
G (ν|V, D) ≥ g
where V is the version space induced by x1 , . . . , xm and the labels c (x1 ) , . . . , c (x2 ).
6.2
The Fundamental Theory of QBC
In this section we prove the basic properties of the Query By Committee algorithm. We show that
if we can lower bound the expected information gain from the next query QBC will make, then we
can guarantee that the QBC algorithm will make very few queries while learning. The following
theorem shows this for various termination rules of the QBC algorithm.
6.2. The Fundamental Theory of QBC
65
Theorem 6.1 Let C be a concept class with a VC-dimension d. Let ν be a prior over C and let
D be a distribution over the sample
space. Let g > 0 be a uniform lower bound over the expected
g log 1+
information gain and let g̃ =
g
16 log 16 −g
g
. Let δ > 0 and let
2 d+1
2
8
ln
,
log
k ≥ max
g2 δ
g̃
δ
4
Then with a probability of 1 − 2δ, QBC will use at most k queries for label and
m0 =
d g̃k/(d+1)
2
e
unlabeled instances when learning and will return a hypothesis with the following properties (depending on the termination rule used):
1. If QBC is used with the Bayes optimal classification rule, it will return a hypothesis (the
Bayes optimal hypothesis) such that
h h
ii
Ec∼ν Pr cBayes (x) 6= c (x) ≤ ǫ
x
for any
ǫ>
2ek −g̃k/(d+1) π 2 (k + 1)2
2
ln
d
6δ
2. If QBC is used with the Gibbs average termination rule, it will return an hypothesis such that
for any
h i
EGibbs,c∼ν Pr cGibbs (x) 6= c (x) ≤ ǫ
x
2
ǫ>
4ek −g̃k/(d+1) π 2 (k + 1)
2
ln
d
6δ
3. If QBC is used with the Gibbs “typical” error termination rule, it will return a hypothesis
such that
Pr cGibbs (x) 6= c (x) ≤ ǫ
x
for any
ǫ>
2
(k+1)2
+
8ek ln dπ 24ek
δd
g̃k
d+1
ln 2
2−g̃k/(d+1)
4. If QBC is used with Bayes point machine termination rule, it will return a hypothesis such
that
h i
Ec∼ν Pr cBPM (x) 6= c (x) ≤ ǫ
x
for any
2
ǫ>
2 (e − 1) ek −g̃k/(d+1) π 2 (k + 1)
2
ln
d
6δ
66
Chapter 6. Theoretical Analysis of Query By Committee
Theorem 6.1 shows that the error rate of the hypothesis returned by QBC decreases exponen-
tially with respect to the number of queries allowed. This is true for all termination rules considered
here. Note that in passive learning models, the error rate decreases as 1/polynomial with respect
to the size of the sample (see e.g. [6] theorem 4.2). Hence, when there is a lower bound on the
expected information gain, learning is exponentially faster when using QBC. It is interesting to
note that if we look at the error rate achieved as a function of the size of the sample used m0 ,
the error rate decreases as 1/polynomial and thus, QBC does not “waste” much in the instances
where it did not query for their label. The proof of the theorem is provided in section 6.3.
The following theorem deals with the situation when QBC is being used without any termination
rule. It shows that when there is a uniform lower bound on the expected information gain, both
the number of queries and the number of prediction mistakes is logarithmic with respect to the
length of the sequence processed.
Theorem 6.2 Let C be a concept class with a VC-dimension d together with a prior ν and a
distribution D. Let g > 0 be a uniform lower bound over the expected information gain and let
g̃ =
g log 1 +
g
16 log 16
g −g
4
Denote by k (x1 , . . . , xm ) the number of queries QBC makes when processing x1 , . . . , xm (note that
this also depends on the target concept and the internal randomness of QBC), and let
B (m) = max
em 8
d+1
em
, 2 ln
log
g̃
d
g
d
+
2d
e
Then for m > 0
Ex1 ,...,xm ,c,QBC [k (x1 , . . . , xm )] ≤ B (m)
while the expected number of prediction mistakes is bounded by 12 B (m)
Moreover, for δ > 0
Pr
QBC,c,x1 ,x2 ,...
"
#
B m2 ⌈log log m⌉ (⌈log log m⌉ + 1)
≤δ
∃m k (x1 , . . . , xm ) ≥
δ
and
"
∃m mistakes (x1 , . . . , xm ) ≥
Pr
QBC,c,x1 ,x2 ,...
1
2B
#
m2 ⌈log log m⌉ (⌈log log m⌉ + 1)
≤δ
δ
where mistakes (x1 , . . . , xm ) is the number of prediction mistakes QBC makes on x1 , . . . , xm .
6.3. Proofs
67
It is important to note that B m2
≤ 2B (m) and thus B m2
= O (log m). This shows
that both the number of queries and number of mistakes QBC makes grows as a function of the
logarithm1 of the number of instances processed.
The following theorem is a key in proving the properties of the QBC algorithm.
Theorem 6.3 Let C be a concept class with a VC-dimension d together with a prior ν and a
distribution D. Let g > 0 be a uniform lower bound over the expected information gain and let
g log 1 + 16 logg16 −g
g
g̃ =
4
Denote by k = k (x1 , . . . , xm ) the number of queries QBC makes when processing x1 , . . . , xm (note
that this also depends on the target concept and the internal randomness of QBC). Then for m > 0
Pr
x1 ,x2 ,...,xm
6.3
∼Dm ,c∼ν,
QBC
k ≥ max
em 8
d+1
em
log
, 2 ln
g̃
d
g
d
<
2d
em
Proofs
As was previously mentioned, the analysis of the QBC algorithm uses the information gain as its
core. Below, we note a few properties of the information gain. Let x1 , x2 , . . . be a sample, let c be
a classifier and let ν be a prior. The instantaneous information gain from x1 and its label c (x1 ) is
log
ν (c′ ∈ V )
ν (c′ ∈ V : c′ (x1 ) = c (x1 ))
If we have already seen the labels of x1 , . . . , xi−1 then the instantaneous information gain from xi
and its label c (xi ) is
log
ν (c′ : ∀j < i c′ (xj ) = c (xj ))
ν (c′ : ∀j ≤ i c′ (xj ) = c (xj ))
and thus the information gained from x1 , . . . , xm and their labels, c (x1 ) , . . . , c (xm ) is simply the
sum of the instantaneous information gain which is exactly
log
ν
(c′
1
: ∀j ≤ m c′ (xj ) = c (xj ))
i.e. the volume of the version space left after observing this labeled sample. The first lemma shows
that for any sample of size m, the information from having the labels of all points in the sample is
of order log (m).
Lemma 6.1 (lemma 3 in [48])
1 We
disregard a term of order (log log m)2 here.
68
Chapter 6. Theoretical Analysis of Query By Committee
Let C be of VC-dimension d, and let S = {x1 , . . . , xm } be a sample such that m ≥ d. Let
I (S, c) = log
ν
({c′
1
: ∀x ∈ S c′ (x) = c (x)})
then
h
em i
d
Pr I (S, c) > (d + 1) log
<
c∼ν
d
em
em d
d
Proof. From Saur’s lemma [108] it follows that S breaks C into at most r ≤
equivalent
classes, where we say that c ∼ c′ if c (S) = c′ (S). Let E1 , . . . , Er be the different equivalent
1
classes, then I (S, c) is simply log ν(E
for i’s such that c ∈ Ei . Using this notation we can write
i)
for any α > 0
Pr [I (S, c) > α] =
c∼ν
X
i : log
1
>α
ν (Ei )
ν (Ei ) ≤
X
i : log
1
>α
ν (Ei )
2−α ≤
em d
d
2−α
(6.1)
and plug in (6.1) to get the stated result.
Let α = (d + 1) log em
d
Lemma 6.1 shows that if we get a fully labeled sample, then the information we have collected
from this sample is typically only logarithmic with respect to the size of the sample. Clearly,
the information from a partly labeled sample can not exceed this bound (this follows from the
information processing inequality [32] for example). Next we show that the information from the
sub-sample that QBC collects grows linearly with respect to the number of queries QBC makes.
Lemma 6.2 Assume that the expected information gain for the next query of QBC is lower
bounded by g > 0. Let k > 0, and let V (k) be the version space induced by the first k queries
of QBC, then
Pr
c,QBC,x1 ,x2 ,...
"
1
g
g
log
< k log 1 +
ν (V (k))
4
16 log 16
g −g
!#
≤ e−kg
2
/8
This lemma amends lemma 1 in [48]. It shows that the information gained by QBC grows
linearly with respect to the number of queries it makes.
Proof. Let gi be the expected information gain from the i’th instance for which QBC queried for
labels. From the definition of the expected information gain we have that 0 ≤ gi ≤ 1. Since there
is a lower bound on the expected information gain, we have that Ec,QBC,x1 ,x2 ,... [gi ] ≥ g. The gi ’s
form a martingale, thus using the martingale method (see e.g. [75] lemma 4.1) we have
Pr
c,QBC,x1 ,x2 ,...
"
k
X
kg
gi <
2
i=1
#
≤ e−kg
2
/8
6.3. Proofs
Assume that
69
Pn
i=1
gi ≥
kg
2 .
At least
kg
4
of the gi ’s have gi ≥
g
4.
Recall that gi is the expected
information gain and thus gi = H (pi ) where pi is the probability of observing the label +1 for the
corresponding instance, given the previously queries instances and their labels. From lemma 6.3
on page 73 we have that for every i such that gi ≥ g4 that
!#
# "
"
g
g
g
g
,1 −
=
, 1/ 1 +
pi ∈
16 log 16
16 log 16
16 log 16
16 log 16
g
g
g
g −g
This means that for each i such that gi ≥ g4 , the instantaneous information gained from the
instance and its label is at least log 1 + 16 logg16 −g , since there are at least kg/4 instances for
g
which gi ≥
g
4.
Therefore the sum of the information gained from all query instances is at least
!
g
g
k log 1 +
4
16 log 16
g −g
which is linear with respect to k.
We are now ready to prove theorem 6.3
Proof. of theorem 6.3
From lemma 6.1 we know that with a probability of 1 −
d
em
the information gained from
querying the labels of all m instances is at most (d + 1) log em
d . Let k be the number of queries
2
QBC made. From lemma 6.2 we know that with a probability of 1 − e−kg /8 , the information
gained from the queries QBC made is at least k g4 log 1 + 16 logg16 −g . If k ≥ g82 ln em
d we have
g
that 1 − e
−kg2 /8
≥1−
d
em ,
since the information gained from labeling all the instances is greater
than the information of any subset of the instances, and thus with a probability of 1 −
!
g
g
em k log 1 +
≤ (d + 1) log
16
4
d
16 log g − g
and thus
em 4
k ≤ (d + 1) log
d g log 1 +
g
16 log 16
g −g
while
k≤
2d
em
em
8
ln
2
g
d
We are now ready to prove theorem 6.1
Proof. of theorem 6.1
The correctness of the algorithm, i.e. the fact that the hypothesis returned is indeed close to
the target concept with high probability, was already proved in chapter 5. We need only to analyze
the number of labeled and unlabeled instances used.
70
Chapter 6. Theoretical Analysis of Query By Committee
k ≥
k≥
d+1
g̃
8
g2
ln 2δ implies that e−kg
2
/8
≤ δ/2. From the choice of m0 and the assumption that
log δ2 we have that d/em ≤ δ/2 and thus from theorem 6.3 we have that with a probability
of 1 − δ, the number of queries QBC will make on a sample of size m0 is at most
em d+1
0
log
g̃
d
by the choice of m0 = de 2g̃k/(d+1) we have that with a probability of 1 − δ, the number of queries
QBC will make on m0 instances is at most k.
Assume that QBC did not query for labels for more than k labels out of the m0 instances.
Therefore, for any t < m0 /k there must have been a sequence of t consecutive instances for which
QBC did not query for labels. From this point we will look at each termination condition separately.
1. Let ǫ be such that
2
ǫ>
π 2 (k + 1)
2k
ln
m0
6δ
(6.2)
then from corollary 5.1 we have that if QBC did not make any query for labels for a sequence
of tk =
2
ǫ
ln π
2
(k+1)2
6δ
consecutive instances then
h h
ii
Ec∼ν Pr cBayes (x) 6= c (x) ≤ ǫ
x
with a probability of 1 − δ. However, by the choice of ǫ in (6.2) we have that tk ≤ m0 /k and
thus we are guaranteed with a probability of 1 − δ that indeed there will be such a sequence
(note that for any j < k we have that tj < tk and thus QBC may terminate before the k’th
query for labels). Therefore, with a probability of 1 − 2δ, the number of queries that QBC
will make will be smaller than k and the algorithm will terminate before the m0 unlabeled
instance, returning a Bayes classifier with an error rate smaller than ǫ as defined in (6.2). By
choice of m0 we have that this is true for any ǫ such that
2
ǫ>
2ek −g̃k/(d+1) π 2 (k + 1)
2
ln
d
6δ
2. Let ǫ be such that
2
ǫ>
π 2 (k + 1)
4k
ln
m0
6δ
(6.3)
then from theorem 5.2 we have that if QBC did not make any query for labels for a sequence
of tk =
4
ǫ
ln π
2
(k+1)2
6δ
consecutive instances than
h i
EGibbs,c∼ν Pr cGibbs (x) 6= c (x) ≤ ǫ
x
6.3. Proofs
71
with a probability of 1 − δ. However, by the choice of ǫ in (6.3) we have that tk ≤ m0 /k and
thus we are guaranteed with a probability of 1 − δ that indeed there will be such a sequence
(note that for any j < k we have that tj < tk and thus QBC may terminate before the k’th
query for labels). Therefore, with a probability of 1 − 2δ, the number of queries that QBC
will make will be smaller than k and the algorithm will terminate before the m0 unlabeled
instance, returning a Gibbs classifier with an error rate smaller than ǫ as defined in (6.2).
Through the choice of m0 we have that this is true for any ǫ such that
ǫ>
4ek −g̃k/(d+1) π 2 (k + 1)2
2
ln
d
6δ
3. Let ǫ be such that
2
(k+1)
8k ln m0 π 24k
ǫ>
δm0
2
(6.4)
then from theorem 5.3 we have that if QBC did not make any query for labels for a sequence
of tk =
8
ǫδ
ln π
2
(k+1)2
3ǫδ
consecutive instances then
Pr cGibbs (x) 6= c (x) ≤ ǫ
x
with a probability of 1 − δ. However, by the choice of ǫ in (6.3) we have that tk ≤ m0 /k and
thus we are guaranteed with a probability of 1 − δ that indeed there will be such a sequence
(note that for any j < k we have that tj < tk and thus QBC may terminate before the k’th
query for labels). Therefore, with a probability of 1 − 2δ, the number of queries that QBC
will make will be smaller than k and the algorithm will terminate before the m0 unlabeled
instance, returning a Gibbs classifier with an error rate smaller than ǫ as defined in (6.2).
By choice of m0 we have that this is true for any ǫ such that
2
(k+1)2
g̃k
8ek ln dπ 24ek
+ d+1
ln 2
ǫ>
2−g̃k/(d+1)
δd
4. Let ǫ be such that
2
ǫ>
2 (e − 1) k π 2 (k + 1)
ln
m0
6δ
(6.5)
then from theorem 5.5 we have that if QBC did not make any query for labels for a sequence
of tk =
2(e−1)
ǫ
ln π
2
(k+1)2
6δ
consecutive instances than
h i
Ec∼ν Pr cBPM (x) 6= c (x) ≤ ǫ
x
with a probability of 1 − δ. However, by the choice of ǫ in (6.5) we have that tk ≤ m0 /k and
thus we are guaranteed with a probability of 1 − δ that indeed there will be such a sequence
72
Chapter 6. Theoretical Analysis of Query By Committee
(note that for any j < k we have that tj < tk and thus QBC may terminate before the k’th
query for labels). Therefore, with a probability of 1 − 2δ, the number of queries that QBC
will make will be smaller than k and the algorithm will terminate before the m0 unlabeled
instance, returning a Bayes classifier with an error rate smaller than ǫ as defined in (6.2). By
choice of m0 we have that this is true for any ǫ such that
2
2e (e − 1) k −g̃k/(d+1) π 2 (k + 1)
2
ln
d
6δ
ǫ>
Proof. of theorem 6.2
From theorem 6.3 we have that
em 8
d+1
2d
em
k ≥ max
, 2 ln
<
Pr
log
g̃
d
g
d
em
x1 ,x2 ,...,xm ∼Dm ,c∼ν,QBC
Since the number of queries is at most m we have that the expected number of queries is at most
max
em 8
em
d+1
, 2 ln
log
g̃
d
g
d
+
2d
e
Using Theorem 5.6 we conclude that the expected number of prediction mistakes is at most
1
max
2
em 8
em
d+1
log
, 2 ln
g̃
d
g
d
+
d
e
We now show that this is not only the average case but also the typical case. Let δ > 0. For
any t = 1, 2, . . . we have using the Markov inequality that
t


2
t (t + 1)
B 2
δ
k x1 , . . . , x 2t >
≤
Pr
2
δ
t (t + 1)
c,QBC,x1 ,x2 ,...
By summing over t and using the fact that
P∞
1
t=1 t(t+1)
= 1 we have that
t

B
22 t (t + 1)
≤δ
∃t k x1 , . . . , x 2t >
Pr
2
δ
c,QBC,x1 ,x2 ,...

t
Let m > 0, and let t = ⌈log log m⌉. Clearly m ≤ 22 and thus for any sequence of instances
k (x1 , . . . , xm ) ≤ k x1 , . . . , x22t
t
using this fact and the fact that m2 ≥ 22 we have that
"
#
B m2 ⌈log log m⌉ (⌈log log m⌉ + 1)
∃t k x1 , . . . , x22t >
Pr
≤δ
δ
c,QBC,x1 ,x2 ,...
Analyzing the probability of having too many prediction mistakes can be done in the same way
as we analyzed the probability of having too many queries for label.
6.4. Lower Bound on the Expected Information Gain for Linear Classifiers
73
Lemma 6.3 Let H (p) be the binary entropy of p. If H (p) ≥ α > 0 then
p≥
α
4 log α4
Proof. Let p∗ = α/ 4 log α4 . Since p∗ < 1/2 we have that for any p < p∗ that H (p) < H (p∗ ) thus
it suffices to show that H (p∗ ) ≤ α. Using the fact that p∗ < 1/2 once again, we have that
−p∗ log p∗ ≥ − (1 − p∗ ) log (1 − p∗ )
and thus
H (p∗ )
= −p∗ log p∗ − (1 − p∗ ) log (1 − p∗ )
≤ −2p∗ log p∗
α
α
log
4 log α4
2 log α4
log log α4
α
1+
2
log α4
= −
=
≤ α
where the last inequality follows since
6.4
log log z
log z
≤ 1 for z ≥ 2.
Lower Bound on the Expected Information Gain for
Linear Classifiers
In previous sections we showed that whenever there is a uniform lower bound on the expected
information gain, the QBC will learn fast. We showed how the error rate of the hypotheses
generated by QBC decreases exponentially with respect to the number of queries made. In order
to make these results meaningful, we now provide such uniform lower bounds for the concept class
of parallel planes [48]. The following is the main result in this section.
Theorem 6.4 Let C be the class of d dimensional parallel planes. Let ν be a prior distribution
over C which is ρ-concave for ρ > −1. Let D be a distribution over the sample space X = IRd × IR
such that for any x0 ∈ IRd there is a section [b1 (x0 ) , b2 (x0 )] (which may be empty) such that
D (x, θ | x = x0 ) is uniform over [a (x0 ) , b (x0 )].
The expected information gain from the next query of the QBC algorithm is uniformly lower
bounded by
G (ρ) =
22+ρ (1 + ρ) (2 + ρ)
(3 + ρ) ln 2
2−2−ρ
2−3−ρ
2−2−ρ
2−3−ρ
−
ln 2 −
ln 2 +
2
2
3+ρ
2+ρ
(3 + ρ)
(2 + ρ)
74
Chapter 6. Theoretical Analysis of Query By Committee
∞
n
X
Γ (ρ + 1) (−1)
+
Γ (ρ − n + 1) n!
n=0
2−n−3
2−n−3
2 −
2 + n + 3 ln 2
(n + 3)
(n + 3)
1
!!
and this bound is tight.
The theorem we have just stated needs some explanation. We need to define the class of
parallel planes, define ρ-concave measures and understand the function G (ρ). Note however, that
this theorem is an extension of theorem 2 in [48] where Freund et. al. proved the same lower bound
on the expected information gain for the class of parallel planes; however this theorem assumed a
uniform distribution over the class of classifiers which is a special case of the theorem presented
here, as any uniform distribution over convex bodies is log-concave, i.e. 0-concave.
6.4.1
The Class of Parallel Planes
The class of parallel planes [48] is a close relative of the class of linear separators. Each concept
in this class is represented by a d dimensional vector w. An instance is a pair (x, b) and the
classification rule is
cw (x, b) = sign (w · x + b)
Note that this is different from the class of non-homogeneous linear separators as in the later;
the bias b is a part of the concept. As is the case with linear classifiers, the vector w can be scaled
down. To see this, note that if w ∈ IRd and (x, θ) ∈ IRd × IR then cw (x, b) = cw/λ (λx, λb) for any
λ > 0. Therefore, we will always assume that the w’s are in the d-dimensional unit ball.
6.4.2
Concave Measures
We provide a brief introduction to concave measures. See [9, 26, 100, 19] for more information
about concave measures. We begin by defining concave measures.
Definition 6.5 A probability measure ν over IRd is ρ-concave if for any measurable sets A and B
and every 0 ≤ λ ≤ 1 the following holds:
ρ
ρ 1/ρ
ν (λA + (1 − λ) B) ≥ [λν (A) + (1 − λ)ν (B) ]
A few facts about ρ-concave measures:
• If ν is ρ-concave with ρ = ∞ then ν(λA + (1 − λ)B) ≥ max(ν(A), ν(B)).
• If ν is ρ-concave with ρ = −∞ then ν(λA + (1 − λ)B) ≥ min(ν(A), ν(B)), in this case ν is
called quasi-concave.
6.4. Lower Bound on the Expected Information Gain for Linear Classifiers
75
• If ν is ρ-concave with ρ = 0 then ν(λA + (1 − λ)B) ≥ ν(A)λ ν(B)1−λ , in this case ν is called
log-concave.
Many common probability measures are log-concave, for example uniform measures over compact
convex sets, normal distributions, chi-square and others. ρ-concave measures are always unimodals. Moreover, any uni-modal measure is ρ-concave, at least for ρ = −∞. The parameter ρ
provides a hierarchy for the class of uni-modal measures since if ν is ρ-concave, and ρ′ < ρ than ν
is ρ′ -concave as well. Thus, the larger the ρ , the assumption of being ρ-concave is more restrictive.
The following lemma shows that if a measure ν is ρ-concave then any restriction of ν to a
convex body is ρ-concave as well. This makes the parameter ρ suitable for our discussion, since
after the QBC has made several queries for labels we will be looking at the posterior, which is
the restriction of the original prior to the version-space. The lemma shows that if the prior was
ρ-concave then the posterior will be ρ-concave as well.
Lemma 6.4 Let ν be a ρ-concave probability measure and let K be a convex body such that ν (K) >
0. Let νK be the restriction of ν to K such that νK (A) = ν (A ∩ K) /ν (K) then νK is ρ-concave.
Proof. Let A and B be measurable sets and let 0 ≤ λ ≤ 1. Let x ∈ λ (A ∩ K) + (1 − λ) (B ∩ K). It
follows that x ∈ λA + (1 − λ) B, and since K is convex we have that x ∈ K and thus we conclude
that
λ (A ∩ K) + (1 − λ) (B ∩ K) ⊆ (λA + (1 − λ) B) ∩ K
and hence
ν (K) νK (λA + (1 − λ) B)
= ν ((λA + (1 − λ) B) ∩ K)
≥ ν (λ (A ∩ K) + (1 − λ) (B ∩ K))
ρ 1/ρ
ρ
≥ [λν (A ∩ K) + (1 − λ) ν (B ∩ K) ]
ρ
ρ 1/ρ
= ν (K) [λνK (A) + (1 − λ) ν (B) ]
6.4.3
The Function G (ρ)
The function G (ρ) as defined in theorem 6.4 might look frightening. Recall that G (ρ) is defined to
be
G (ρ) =
2−2−ρ
2−3−ρ
2−2−ρ
2−3−ρ
−
ln 2 −
ln 2 +
2
2
3+ρ
2+ρ
(3 + ρ)
(2 + ρ)
!!
∞
n
X
Γ (ρ + 1) (−1)
1
2−n−3
2−n−3
+
−
+
ln 2
Γ (ρ − n + 1) n! (n + 3)2
n+3
(n + 3)2
n=0
22+ρ (1 + ρ) (2 + ρ)
(3 + ρ) ln 2
76
Chapter 6. Theoretical Analysis of Query By Committee
1
0.9
0.8
0.7
G(ρ)
0.6
0.5
0.4
0.3
0.2
0.1
0
-1
0
1
2
ρ
3
4
5
Figure 6.1: The function G (ρ) from theorem 6.4 on page 73.
Figure 6.1 shows a plot of this function. When ρ = −1 we have that G (ρ) = 0, however it
climbs fast and reaches
1
9
+
7
18 ln 2
≈ 0.67 when ρ = 0. The function is monotone increasing and
approaches 1 as ρ grows to infinity.
6.5
Proof of Theorem 6.4
We now turn to the information gain of ρ-concave measures.
Proof. of theorem 6.4.
The first step to take is to come up with a simplified notation for the expected information
gain. Recall that the expected information gain is
G (ν, D) =
E(x,b)∼D [2ν {w : w · x + b > 0} ν {w : w · x + b < 0} H (ν {w : w · x + b < 0})]
(6.6)
E(x,b)∼D [2ν {w : w · x + b > 0} ν {w : w · x + b < 0}]
We will show that for any x0 in the support of D
Eb∼Dx0 [2ν {w : w · x0 + b > 0} ν {w : w · x0 + b < 0} H (ν {w : w · x0 + b < 0})]
≥ G (ρ) (6.7)
Eb∼Dx0 [2ν {w : w · x0 + b > 0} ν {w : w · x0 + b < 0}]
where b ∼ Dx0 means that (x, b) is sampled from the distribution D conditioned on x = x0 . Once
we prove (6.7), we have that G (ν, D) ≥ G (ρ) as well. This follows since for any two positive
6.5. Proof of Theorem 6.4
77
functions f and g and any probability measure over x is
Ex [f (x)]
f (x)
≥ min
x g (x)
Ex [g (x)]
Therefore, from here on, we will be trying to prove (6.7). Fix x0 and denote F (b) = ν {w : w · x0 + b < 0}
we see that we can rewrite (6.7) as
Eb∼Dx0 [F (b) (1 − F (b)) H (F (b))]
Eb∼Dx0 [F (b) (1 − F (b))]
Note that F (b) is the Cumulative Density Function (CDF) of ν when projected along the vector
x0 . Since ν is ρ-concave, F is ρ-convex (see e.g. [100]). Furthermore, according to the assumptions
of this theorem, the distribution Dx0 is uniform over the segment (b1 , b2 ), thus we write
R b2
F (b) (1 − F (b)) H (F (b)) db
G (F ) = b1 R b2
b1 F (b) (1 − F (b)) db
from now on we will study G (F ) for ρ-convex functions. W.l.o.g. assume that 0 ∈ (b1 , b2 ) and that
0 is the median of the CDF F , i.e. F (0) = 1/2. Note that
R b2
b1 F (b) (1 − F (b)) H (F (b)) db
G (F ) =
R b2
F (b) (1 − F (b)) db
b1
Rb
R0
F (b) (1 − F (b)) H (F (b)) db + 0 2 F (b) (1 − F (b)) H (F (b)) db
b1
=
Rb
R0
F (b) (1 − F (b)) db + 0 2 F (b) (1 − F (b)) db
b1
!
R0
R b2
F
(b)
(1
−
F
(b))
H
(F
(b))
db
b1 F (b) (1 − F (b)) H (F (b)) db
(6.8)
≥ min
, 0 R b2
R0
F
(b)
(1
−
F
(b))
db
b1
0 F (b) (1 − F (b)) db
Due to the symmetry around zero of G (F ) we conclude from (6.8) that it is enough to look at one
“tail” of F . Hence our study will focus on F functions which have the following properties:
1. F is defined over [−∞, 0].
2. F is monotone increasing.
3. F (−∞) = 0 and F (0) = 1/2.
4. F is ρ-convex.
We call such an F function a ρ-admissible CDF function.
We begin by showing that for any ρ, there exists a ρ-admissible CDF function Fρ such that
G (F ) = G (ρ). We break down our discussion into three cases, depending on the value of ρ:
1. When ρ < 0 we use Fρ (b) =
1
2
show that indeed G (Fρ ) = G (ρ).
1/ρ
(1 + b)
. Trivially, Fρ is ρ-admissible. In Lemma 6.5 we
78
Chapter 6. Theoretical Analysis of Query By Committee
1/ρ
1
2
2. When ρ > 0 we use Fρ (b) =
(1 + b)
which is defined in the range of [−1, 0]. Again,
proving that Fρ is ρ-admissible is trivial. Lemma 6.6 shows that G (Fρ ) = G (ρ).
3. When ρ = 0 we use F0 =
1 b
2e .
Clearly, this is a 0-admissible function (recall that the
Showing that G (F0 ) = G (0) is done in
0-concave function is a log-concave function).
Lemma 6.7.
We have shown that for any ρ there is a ρ-concave measure for which the expected information
gain is G (ρ). Thus, if we prove that for any F which is ρ-admissible, G (F ) ≥ G (ρ) we will have
that G (ρ) is a tight bound for G (F ).
For a ρ-admissible CDF F , let f = F ρ (we use the convention here that if ρ = 0 we use f = ln F ).
Since F is ρ-convex, we have that f is convex, i.e. f (λb1 + (1 − λ) b2 ) ≥ λf (b1 ) + (1 − λ) f (b2 ).
ρ
Note that for Fρ as defined above, if we set fρ = (Fρ ) we get a linear function. We will now claim
that this is the worse case.
Note that if f is convex and f + Ψ is convex then for any ǫ ∈ [0, 1] we have that f + ǫΨ is
convex as well (lemma 6.8).
We use the notation G (f ) = G f 1/ρ . Taking the Férchet derivative of G (·) we have that
G (f + ǫψ) = G (f ) + ǫ
Z
0
−∞
▽f G (b) ψ (b) db + ǫ2 O
Z
∞
ψ 2 (b) db
−∞
(6.9)
We now turn to ▽f G (x). Before we do this, we will rewrite G (f ). Recall that
G (f ) =
R∞
−∞
f 1/ρ (b) 1 − f 1/ρ (b) H f 1/ρ (b) db
R∞
f 1/ρ (b) 1 − f 1/ρ (b) db
−∞
Denote by K (b) = f 1/ρ (b) 1 − f 1/ρ (b) . Using this notation f 1/ρ (b) =
1
2
−
√
1−4K(b)
.
2
We
introduce (yet) another notation
Q (b) = H
and thus we have that Q (K (b)) = H
1−
√
1−
1−4K(b)
2
G (f ) =
now we have that
▽f G (b) =
R
√
1 − 4b
2
= H f 1/ρ (b) . Finally we write
K (b) Q (K (b)) db
R
K (b) db
▽K G (b) ▽f K (b)
(6.10)
6.5. Proof of Theorem 6.4
79
where
▽K G (b) =
▽f K (b) =
∂
1
R
Q (K (b)) + K (b)
Q (K (b)) − G (f )
∂K (b)
K (b) db
f 1/ρ−1 1 − 2f 1/ρ
ρ
(6.11)
(6.12)
We are interested in studying the behavior of ▽f G (b). By considering the places in which
▽f G (b) = 0 we will be able to tell where is it positive and where is it negative. By (6.10) we can
study the terms ▽K G (b) and ▽f K (b) separately. First consider ▽f K (b). From 6.12 we see that
▽f K (b) = 0 only when f 1/ρ = 1/2 i.e. F = 1/2. The behavior of these terms is determined by
the value of ρ. For positive ρ’s ▽f K (b) > 0 unless f 1/ρ = 1/2. On the other hand, if ρ < 0 then
▽f K (b) < 0 whenever f 1/ρ 6= 1/2.
Now we consider ▽K G (b). Looking at (6.11) we note that
Q (K (b)) + K (b)
∂
Q (K (b))
∂K (b)
is monotone increasing. Thus there is a point b0 which Freund et al [48] referred to as the pivot
point, such that for any b < b0 we have that ▽K G (b) < 0 while for b > b0 we have that ▽K G (b) > 0.
We saw that ▽K G (b) < 0 for b < b0 and ▽K G (b) > 0 for b > b0 . We also saw that ▽f K (b) < 0
when ρ < 0 and ▽f K (b) > 0 when ρ > 0. We will now have to treat three cases separately: the
first case we will consider is ρ < 0, the second is ρ > 0 and last we consider the case ρ = 0.
1. Assume ρ < 0. In this case ▽f K (b) > 0 and thus


 > 0 when b < b0
▽f G (b) = ▽K G (b) ▽f K (b)

 < 0 when b > b0
Let F be a ρ-admissible function and let f = F 1/ρ . We will show that unless f is linear, there
ρ
is some ψ and ǫ > 0 such that F̂ = (f + ǫψ) is ρ-admissible and G (F ) > G F̂ . Assume
that f is non-linear, we consider two cases: the first is when the non-linearity is inspected in
the range [−∞, b0 ]. The second case we consider is when f is linear on [−∞, b0 ].
(a) Assume that f is non-linear on [−∞, b0 ]. Let −∞ < b1 < b0 such that f is non-linear
on [b1 , b0 ]. Since f is convex we have that for any b ∈ [b1 , b0 ]
f (b) ≥
Let
ψ (b) =





b − b1
b0 − b
f (b0 ) +
f (b1 )
b0 − b1
b0 − b1
0
b−b1
b0 −b1 f
(b0 ) +
b0 −b
b0 −b1 f
when b ∈
/ [b1 , b0 ]
(b1 ) − f (b) when b ∈ [b1 , b0 ]
80
Chapter 6. Theoretical Analysis of Query By Committee
We note the following: f + ψ is convex and monotone such that (f + ψ)
1/ρ
is ρ-
admissible. Furthermore, ψ (b) = 0 for b > b0 while for b < b0 we have ψ (b) ≤ 0
and this inequality is strict at least on some parts of the range [b1 , b0 ] (See sub-figure
1(a) in figure 6.2 on page 81 for an illustration). Finally, since ψ has finite support
R0
ψ 2 (b) db < ∞. Using all these facts we conclude that
−∞
Z
0
−∞
▽f G (x) Ψ (b) db < 0
and since
G (f + ǫψ) = G (f ) + ǫ
Z
0
−∞
▽f G (b) ψ (b) db + ǫ2 O
Z
0
ψ 2 (b) db
−∞
we have that for small enough ǫ
G (f + ǫψ) < G (f )
(b) Assume that f is linear for b < b0 but still it is non-linear. Therefore for b < b0 we have
that f (b) = βb + α for some α and β. Consider the following ψ:



0
when b < b0



1/ρ
ψ (b) =
− f (x) , βb + α − f (b)
when b0 ≤ x < 0
min 21




1 1/ρ

when b = 0
2
We note the following: f + ψ is monotone and convex2 . Since (f + ψ) ≤
1 1/ρ
2
we have
that (f + ψ)1/ρ is ρ-admissible, and for any 0 ≤ ǫ ≤ 1 the same holds, i.e. (f + ǫψ)1/ρ
is ρ-admissible (See sub-figure 1(b) in figure 6.2 on page 81 for an illustration). ψ has
the following properties: clearly ψ (b) = 0 when b < b0 while ψ (b) ≥ 0 when b > b0
and this inequality is somewhere strict (since f is non-linear). Since ψ has final support
R0
2
−∞ ψ (b) db < ∞ and thus using the same argument as we used in the previous scenario
G (f + ǫψ) < G (f )
for small enough ǫ.
The two cases we considered here show that unless f is linear G (f ) > G (f + ǫψ) for some
ǫ and ψ such that (f + ǫψ)1/ρ is ρ-admissible. This shows that the minimum of G (·) is
achieved for linear f ’s. We already computed G (f ) for linear f in Lemma 6.5 (we assumed
2 Note
here that (f + ψ)ρ may be a singular CDF, it may have a positive mass on the point b = 0.
6.5. Proof of Theorem 6.4
4.5
81
b
4.5
1
4
f
f+ψ
3.5
3.5
b
0
3
2.5
2
2
1.5
1.5
−80
−60
−40
1(a)
−20
0
0.5
0.4
0
1
−100
−80
−60
−40
1(b)
−20
0
0.5
0.4
b
0
0.3
b
0
0.3
0.2
0.2
0.1
0
−100
b
3
2.5
1
−100
f
f+ψ
4
0.1
f
f+ψ
−80
−60
−40
2(a)
−20
0
0
−100
f
f+ψ
−80
−60
−40
2(b)
−20
0
Figure 6.2: Illustrations for the proof of theorem 6.4 The four different cases considered in
the proof. Sub figure 1(a) and 1(b) demonstrate the cases where ρ < 0 while sub-figures 2(a) and
2(b) demonstrate positive ρ’s. In each figure a non-linear f is presented together with the
modified f + ψ as described in the proof of theorem 6.4.
that f (b) =
1 ρ
2
(1 − b) but by a simple change of argument the same result holds for any
admissible linear f ). And thus we conclude that for any F which is ρ-admissible for ρ < 0
G (F ) ≥ G (ρ)
2. We now consider the case when ρ > 0. Let F be ρ-admissible and assume that F is defined
over the range [b1 , 0] for some finite b1 . Let f = F ρ and assume that f is non-linear. Again
we will consider two cases: the first case we will consider is when f is non linear on [b0 , 0]
and the second case is when f is linear on [b0 , 0] but still not linear.
82
Chapter 6. Theoretical Analysis of Query By Committee
(a) Assume that f is non linear on [b0 , 0]. Let ψ be as follows



0
when b < b0
ψ (b) =

 b−b0 f (0) + −b f (b0 ) when b ≥ b0
−b0
−b0
We note the following: f + ψ is monotone increasing and convex. Since (f + ψ) ≤
1/ρ
1 1/ρ
we have that (f + ψ)
is ρ-admissible, and for any 0 ≤ ǫ ≤ 1 the same holds,
2
1/ρ
i.e. (f + ǫψ)
is ρ-admissible (See sub-figure 2(a) in figure 6.2 on page 81 for an
illustration). ψ has the following properties: clearly ψ (b) = 0 when b < b0 while
ψ (b) ≤ 0 when b > b0 and this inequality is somewhat strict (since f is non-linear).
R0
Since ψ has final support −∞ ψ 2 (b) db < ∞ and thus using the same argument as we
used in previous scenarios, recalling that ▽f G (b) > 0 when b > b0 and thus
G (f + ǫψ) < G (f )
for small enough ǫ.
(b) Assume that f is linear on [b0 , 0] but non linear on [b1 , b0 ]. Assume that f (b) = βb + α
for b ∈ [b0 , 0]. We define
ψ (b) = βb + α − f (b)
1/ρ
We note the following: f + ψ is monotone increasing and convex and (f + ǫψ)
is
ρ-admissible for any 0 ≤ ǫ ≤ 1 (See sub-figure 2(b) in figure 6.2 on page 81 for an
illustration). ψ has the following properties: ψ (b) = 0 when b > b0 while ψ (b) ≥ 0
when b < b0 and this inequality is somewhat strict (since f is non-linear). Since ψ
R0
has final support −x1 ψ 2 (b) db < ∞ and thus using the same argument as we used in
previous scenarios, recalling that ▽f G (b) < 0 when b > b0 and thus
G (f + ǫψ) < G (f )
for small enough ǫ.
The two cases we considered here show that unless f is linear G (f ) > G (f + ǫψ) for some
1/ρ
ǫ and ψ such that (f + ǫψ)
is ρ-admissible. This shows that the minimum of G (·) is
achieved for linear f ’s. We already computed G (f ) for linear f in Lemma 6.5 (we assumed
ρ
that f (b) = 12 (1 + b) for b ∈ [0, 1] but by a simple change of argument the same result
holds for any admissible linear f ). We also assumed that f has finite support, but since this
holds for any finite range and from the continuity of G (·) this result holds for any f . Thus
6.5. Proof of Theorem 6.4
83
we conclude that for any F which is ρ-admissible for ρ < 0
G (F ) ≥ G (ρ)
3. The case where ρ = 0 can be treated in the same way as we treated the cases when ρ > 0 or
ρ < 0. However, this argument can be avoided since if F is 0-concave (i.e. log-concave), it is
also ρ-concave for any ρ < 0. Therefore
G (F ) ≥ sup G (ρ)
ρ<0
on the other hand, in Lemma 6.7 we show a log-concave F for which G (F ) = G (0). Combining
these facts together completes the proof.
Lemma 6.5 Let Fρ (b) =
1
2
1/ρ
(1 − b)
for b ≤ 0 and −1 < ρ < 0. Then G (Fρ ) = G (ρ) where G (ρ)
is as defined in theorem 6.4.
The CDF
1
2
1/ρ
(1 − b)
is the “typical” ρ-concave function when ρ is negative. In Lemma 6.5 we
calculate the information gain for these CDFs.
Proof. This is a pure calculation. Let F = Fρ then
R0
F (b) (1 − F (b)) H (F (b)) db
−∞
G (F ) =
R0
−∞ F (b) (1 − F (b)) db
(6.13)
We treat the numerator and denominator separately.
Z 0
Z 0
1
1
1
1/ρ
1/ρ
1/ρ
F (b) (1 − F (b)) H (F (b)) db =
1 − (1 − b)
H
db
(1 − b)
(1 − b)
2
2
−∞
−∞ 2
Z ∞
1 1/ρ
1
1 1/ρ
=
1 − b1/ρ H
db
x
b
2
2
2
1
Z 1/2
= −
b (1 − b) H (b) H (x) 2ρ bρ−1 ρdb
0
Z 1/2
= −ρ2ρ
bρ (1 − b) H (b) db
"Z0
1/2
ρ2ρ
b1+ρ (1 − b) ln (b) db
=
ln 2 0
#
Z 1/2
2
+
bρ (1 − b) ln (1 − b) db
(6.14)
0
Now we look at the two integral terms in the last expression:
Z 1/2
Z 1/2
1+ρ
b
(1 − b) ln bdb =
b1+ρ − b2+ρ ln bdb
0
0
=
b3+ρ
b2+ρ
−
2+ρ 3+ρ
1/2 Z
ln b −
0
0
1/2
b1+ρ
b2+ρ
−
2+ρ 3+ρ
db
84
Chapter 6. Theoretical Analysis of Query By Committee
1/2
b3+ρ 2 −
2
(2 + ρ)
(3 + ρ) 0
=
2−3−ρ
2−2−ρ
ln 2 −
ln 2 −
3+ρ
2+ρ
=
2−3−ρ
2−2−ρ
2−2−ρ
2−3−ρ
−
ln 2 −
ln 2 +
2
2
3+ρ
2+ρ
(3 + ρ)
(2 + ρ)
b2+ρ
Looking at the second term in (6.14)
Z
0
1/2
2
bρ (1 − b) ln (1 − b) db
=
Z
1
1/2
ρ
b2 (1 − b) ln bdb
Using Taylor expansion, we can write
ρ
(1 − b) =
∞
X
Γ (ρ + 1) (−1)n n
b
Γ (ρ − n + 1) n!
n=0
where Γ (·) is the gamma function. Using the Taylor expansion we have
Z 1 X
Z 1
∞
n
Γ (ρ + 1) (−1) n+2
ρ
2
b
ln (b) db
b (1 − b) ln (b) db =
1/2 n=0 Γ (ρ − n + 1) n!
1/2
Z
∞
X
Γ (ρ + 1) (−1)n 1 n+2
b
ln (b) db
=
Γ (ρ − n + 1) n! 1/2
n=0
1
∞
n n+3 X
Γ (ρ + 1) (−1)
1
b
=
ln b −
Γ (ρ − n + 1) n! n + 3
n + 3 1/2
n=0
=
∞
X
Γ (ρ + 1) (−1)n
Γ (ρ − n + 1) n!
n=0
2−n−3
−
+
ln 2
2
2
n+3
(n + 3)
(n + 3)
2−n−3
1
Looking at the denominator of (6.13) we have
Z
F (b) (1 − F (b)) db
=
=
=
=
1
1
1/ρ
1/ρ
1 − (1 − b)
db
(1 − b)
2
−∞ 2
Z ∞
1 1/ρ
1 1/ρ
−
1− b
db
b
2
2
1
Z ∞
1 2/ρ 1 1/ρ
db
b − b
4
2
1
∞
b2/ρ+1
b1/ρ+1 −
4 (2/ρ + 1) 2 (1/ρ + 1) Z
0
1
=
=
=
1
1
− 8
2
ρ +2
ρ +4
1
1
ρ
−
2 + 2ρ 8 + 4ρ
3+ρ
ρ
4 (1 + ρ) (2 + ρ)
Finally we can write
G (F ) =
ρ2ρ
ln 2
2−2−ρ
2−3−ρ
2−2−ρ
2−3−ρ
−
ln 2 −
ln 2 +
2
2
3+ρ
2+ρ
(3 + ρ)
(2 + ρ)
!
6.5. Proof of Theorem 6.4
85
∞
n
X
Γ (ρ + 1) (−1)
+
Γ (ρ − n + 1) n!
n=0
2−n−3
2−n−3
2 −
2 + n + 3 ln 2
(n + 3)
(n + 3)
1
!!
/ρ
3+ρ
4 (1 + ρ) (2 + ρ)
which is, by simple algebra G (ρ).
Lemma 6.6 Let Fρ (b) =
1
2
1/ρ
(1 + b)
for b ∈ [−1, 0] and ρ > 0. Then
G (Fρ ) = G (ρ)
where G (ρ) is as defined in theorem 6.4.
Proof. Recall that
G (Fρ ) =
R0
−∞
This is a pure calculation
Z
0
−1
Fρ (b) (1 − Fρ (b)) H (Fρ (b)) db
=
=
=
Fρ (b) (1 − Fρ (b)) H (Fρ (b)) db
R0
F (b) (1 − Fρ (b)) db
−∞ ρ
1
1
1
1/ρ
1/ρ
1/ρ
1 − (1 + b)
H
db
(1 + b)
(1 + b)
2
2
−1 2
Z 1
1 1/ρ
1
1 1/ρ
1 − b1/ρ H
db
b
b
2
2
0 2
Z 1/2
−
b (1 − b) H (b) 2ρ bρ−1 ρdb
Z
0
0
=
=
Z 1/2
−ρ2ρ
bρ (1 − b) H (b) db
0
"Z
#
Z 1/2
1/2
ρ2ρ
2
1+ρ
ρ
b
(1 − b) ln (b) db +
b (1 − b) ln (1 − b) db
ln 2 0
0
From (6.14) in Lemma 6.5 we know that this equals
ρ2ρ
ln 2
2−3−ρ
2−2−ρ
2−2−ρ
2−3−ρ
−
ln 2 −
ln 2 +
2
2
3+ρ
2+ρ
(3 + ρ)
(2 + ρ)
∞
n
X
Γ (ρ + 1) (−1)
+
Γ (ρ − n + 1) n!
n=0
2−n−3
2−n−3
2 −
2 + n + 3 ln 2
(n + 3)
(n + 3)
1
!!
Looking at the denominator in the definition of information gain we have
Z 0
Z 0
1
1
1/ρ
1/ρ
1 − (1 + b)
db
(1 + b)
Fρ (b) (1 − Fρ (b)) db =
2
−1
−1 2
Z 1
1 1/ρ
1 1/ρ
=
1− b
db
b
2
0 2
Z
Z
1 1 1/ρ 1 1 2/ρ
=
b −
b db
2 0
4 0
1
1
1
1
=
b1+1/ρ −
b1+2/ρ 2 (1 + 1/ρ)
4 (1 + 2/ρ)
0
0
1
1
−
=
2 + 2/ρ 4 + 8/ρ
3+ρ
= ρ
4 (1 + ρ) (2 + ρ)
By simple algebra the result stated in the lemma is obtained.
86
Chapter 6. Theoretical Analysis of Query By Committee
Lemma 6.7 Let F0 (b) = 21 eb for b ≤ 0 and −1 < ρ < 0. Then
G (F0 ) = G (0) =
1
7
+
9 18 ln 2
where G (ρ) is as defined in theorem 6.4.
Proof. Recall that
G (F0 ) =
R0
−∞
This is a pure calculation
Z
0
−∞
F0 (b) (1 − F0 (b)) H (F0 (b)) db
=
F0 (b) (1 − F0 (b)) H (F0 (b)) db
R0
F (b) (1 − F0 (b)) db
−∞ 0
Z
0
−∞
1/2
=
Z
1
1 b
1 b
e 1 − eb H
e db
2
2
2
(1 − b) H (b) db
0
= −
= −
=
=
Z
1/2
0
Z
2
(1 − b) log (1 − b) + b (1 − b) log (b) db
1
1/2
2
b log (b) db −
7
1
−
72 ln 2 24
1
7
+
24 48 ln 2
Z
1/2
b log (b) db +
0
Z
1/2
b2 log (b) db
0
1
1
1
1
− − −
+ − −
8 16 ln 2
24 72 ln 2
Looking at the denominator we have
Z
0
−∞
F0 (x) (1 − F0 (x)) dx
=
=
=
=
=
1 x
1 x
e 1 − e dx
2
−∞ 2
Z 0
Z
1 0 2x
1
ex dx −
e dx
2 −∞
4 −∞
0
1 x0
1 1 2x (e |−∞ −
e
2
4 2 −∞
1 1
1
(1 − 0) −
−0
2
4 2
3
8
Z
0
Thus we conclude
G (F0 ) =
1
24
+
7
48 ln 2
3
8
=
1
7
+
9 18 ln 2
Lemma 6.8 Let f be a concave function such that f +Ψ is concave as well. Then for any ǫ ∈ [0, 1]
the function f + ǫΨ is concave.
Let fˆ be a convex function such that fˆ+ Ψ̂ is convex as well. Then for any ǫ ∈ [0, 1] the function
fˆ + ǫΨ̂ is convex.
6.6. Summary
87
Proof. Let f be concave and Ψ be such that f + Ψ is concave as well. Let x1 and x2 be two points,
let γ ∈ [0, 1] and let ǫ ∈ [0, 1].
(f + ǫΨ) (λx1 + (1 − λ) x2 ) =
ǫ (f + Ψ) (λx1 + (1 − λ) x2 ) + (1 − ǫ) f (λx1 + (1 − λ) x2 )
≤
ǫ (λ (f + Ψ) (x1 ) + (1 − λ) (f + Ψ) (x2 )) + (1 − ǫ) (λf (x1 ) + (1 − λ) f (x2 ))
=
λ (f + ǫΨ) (x1 ) + (1 − λ) (f + ǫΨ) (x2 )
This proves the first part of the lemma. To see that the same works for convex functions, let fˆ
be convex and let fˆ + Ψ̂ be convex as well. We apply to the first part of the lemma with f = −fˆ
and Ψ = −fˆ + −Ψ̂ to get the stated result.
6.6
Summary
In this chapter we have studied the fundamental properties of Query By Committee. First we
defined information gain which is a function of a concept class, a prior over this class and a
distribution over the sample space. We showed that when there is a lower bound on the expected
information gain, QBC learns exponentially fast with respect to the number of queries it makes.
This pace is exponentially better than any passive learner, as these learners learn in a polynomial
rate. Next we demonstrated cases in which there is a lower bound on the expected information
gain. We studied the class of parallel planes. We showed that the expected information gain is
lower bounded when there is a prior which is ρ-concave, and the distribution over the sample space
is of a special type. Freund et al [48] showed that this lower bound on the class of parallel planes
can be translated into a lower bound on the expected information gain when learning homogeneous
linear classifiers when the prior is uniform and the distribution over the sample space is uniform
(theorem 4 in [48]).
Chapter 7
The Bayes Model Revisited
The QBC algorithm and its analysis as presented in chapter 6 assumed that there is a known
prior over the concept class. This assumption is usually referred to as the Bayesian assumption.
However, in many cases, the knowledge of this prior is not present. In this chapter we show how
this assumption can be weakened and in some cases lifted. We look at three different scenarios
and use different tools in each one of them.
7.1
PAC-Bayesian Techniques
McAllester [89] presented the PAC-Bayesian theory. In his work, the Bayesian assumption is
regarded as a way to present prior knowledge or preferences. We use the same technique here to
show how the Bayesian assumption can be lifted in some cases.
Theorem 7.1 Let C = {c1 , c2 , . . .} be a countable concept class with VC-dimension d. Let w1 , w2 , . . .
P
be a set of positive weights such that wi ≥ 0 and
wi = 1. Let D be a distribution over the sample
space such that there exists a lower bound g > 0 on the expected information gain of QBC when
P
learning with the prior ν such that ν (S) = i∈S wi , and the distribution D.
Assume that the termination rule of the QBC (algorithm 5) is specified by tk =
8
ǫδ
ln π
2
(k+1)2
.
3ǫδ
Let the Gibbs classifier cGibbs be defined using the version space V used by QBC when terminating.
g
g log 1+
16 log 16 −g
g
2
2 d+1
8
Let δ > 0, let g̃ =
and let
,
let
k
≥
max
,
log
ln
2
4
g
δ
g̃
δ
ǫ>
2
(k+1)2
8ek ln dπ 24ek
+
δd
88
g̃k
d+1
ln 2
2−g̃k/(d+1)
7.2. Symmetry
89
Then for any ci ∈ C, with a probability of 1 − wδi over the choice of the sample and the internal
randomness of QBC; the algorithm will use at most k queries and will return a hypothesis cGibbs
such that
Pr cGibbs (x) 6= c (x) ≤ ǫ
x
Proof. From theorem 5.3 it follows that in the conditions as described above
h h
ii
Pr EQBC,x1 ,x2 ,... Pr cGibbs (x) 6= c (x) > ǫ < δ
x
c∼ν
(7.1)
Let ci be the target concept and define pi such that
pi
=
=
i
h EQBC,x1 ,x2 ,... Pr cGibbs (x) 6= ci (x) > ǫ
x
i
h Pr
Pr cGibbs (x) 6= ci (x) > ǫ
QBC,x1 ,x2 ,... x
i.e. pi is the probability that QBC will fail when learning the target concept ci . Using this definition
in (7.1) we have that
X
i
wi pi ≤ δ
and therefore, for any i we have that pi ≤ δ/wi . The number of queries used follows employing
the same argument.
Theorem 7.1 shows how the Bayesian assumption can be lifted and converted into a weight or
significance assigned to each concept in the class. Although we assumed that the concept class is
finite, it is possible to extend this result to general classes using the same techniques as presented
in [89].
7.2
Symmetry
In this section we lift the Bayesian assumption when learning linear classifiers with a uniform
distribution over the sample space. This is based on the perfect symmetry in this class. Assume
that QBC is learning homogeneous d dimensional linear classifiers. Each concept is represented as
a unit vector w ∈ IRd and each instance is a unit vector x ∈ IRd where the classification rule is
cw (x) = sign (w · x). Freund et al [48] showed that there is a uniform lower bound on the expected
information gain of QBC when learning this class once there is a uniform distribution over the
sample space and a uniform prior over the concept class. Using the results presented in Chapter 6
this implies fast learning rates for the QBC algorithm in this setting. Here we show that the Bayes
assumption can be lifted in this case. This is due to the symmetry of this problem.
90
Chapter 7. The Bayes Model Revisited
In Theorems 6.1 and 6.2 we showed that the error rate of QBC decreases exponentially fast
when there is a lower bound on the expected information gain. We showed it for several variant
of the QBC algorithm and several methods for evaluating success. The argument presented here
applies to all these cases. Instead of repeating these theorems we will state the following theorem
in general terms.
Theorem 7.2 Assume that C is the class of d-dimensional homogeneous linear classifiers. Let the
sample space X be the unit sphere in IRd and assume that D is the uniform distribution over X .
When QBC is applied in this setting, all the results presented in theorems 6.1 and 6.2 apply for any
concept in the class and not only on average (or with a probability) over the choice of the concept.
Proof. Let cw and cw′ be two homogeneous linear classifiers such that w and w′ are unit vectors.
Let T be the rotation transformation such that T (w) = w′ . We will use the fact that the uniform
distribution over the unit sphere is rotation invariant and thus if S is a set in the unit sphere then
the measure of S equals the measure of T (S).
The QBC algorithm is a random algorithm. We assume that it gets 3 inputs: a sequence of
unlabeled instances, an oracle that is capable of providing the labels of instances and a sequence
of random bits. By providing the algorithm with a sequence of random bits as an input, we can
look at the QBC algorithm as a deterministic algorithm.
∗
For a concept cw let ∆ (cw ) ⊆ X ∗ × {0, 1} be the set of inputs on which QBC fails when
learning the concept cw . Note that the definition of “failure” varies, as shown in theorems 6.1 and
6.2 however the result we present here applies to all these definitions.
∗
Let T be a rotation. For {(x1 , x2 , . . .) , (r1 , r2 , . . .)} ∈ X ∗ × {0, 1} we define
T ({(x1 , x2 , . . .) , (r1 , r2 , . . .)}) = {(T (x1 ) , T (x2 ) , . . .) , (r1 , r2 , . . .)}
and extend this definition such that
T (∆ (cw )) = {T ({(x1 , x2 , . . .) , (r1 , r2 , . . .)}) : {(x1 , x2 , . . .) , (r1 , r2 , . . .)} ∈ ∆ (cw )}
The main observation is that if w′ = T (w) then ∆ (cw′ ) = T (∆ (cw )). We define the measure
∞
∗
µ over X ∗ × {0, 1} to be the product measure of D∞ with B 21
where B (·) here stands for the
Bernoulli measure. Since µ is rotation invariant, because D is rotation invariant, then
µ (∆ (cw′ )) = µ (T (∆ (cw ))) = µ (∆ (cw ))
this implies that the probability of failing to learn the concept cw equals the probability of failing
to learn the concept cw′ .
7.3. Incorrect Priors and Distributions
91
Assume that the probability of failure of the QBC algorithm when averaging over the target
concept is bounded by δ. Recall that the probability of failure is
Pr
cw ∼Uniform,X ∗ ∼D ∗ ,{0,1}∗
[QBC fails]
= Ecw ∼Uniform
Pr
[QBC fails]
∗
X ∗ ∼D ∗ ,{0,1}
= Ecw ∼Uniform [µ (∆ (cw ))]
Z
=
u (w) µ (∆ (cw )) dw
where u (w) is the density of the uniform distribution. Since u (w) is constant, and as we saw
µ (∆ (cw )) is constant as well it follows that
∀w
Pr
cw ∼Uniform,X ∗ ∼D ∗ ,{0,1}∗
[QBC fails] = µ (∆ (cw ))
Finally, since
Pr
cw ∼Uniform,X ∗ ∼D ∗ ,{0,1}∗
[QBC fails] ≤ δ
we have that
∀w µ (∆ (cw )) ≤ δ
7.3
Incorrect Priors and Distributions
The QBC algorithm assumes the knowledge of both a prior over the concept class and a distribution
over the target concept. In this section we discuss the case in which we have an estimate of the
prior and distribution, but these estimates need not be accurate. We show that if these estimates
are reasonably close to the true priors, then the QBC algorithm will tolerate the incorrect priors.
Thus, the exponential learning rates of QBC which were demonstrated in the fundamental theorem
of QBC (Theorem 6.1) remain1 .
First we define a measure of proximity between probabilities.
Definition 7.1 A probability measure µ is λ far from a probability measure µ′ if for any measurable
set A,
λ−1 µ′ (A) ≤ µ (A) ≤ λµ′ (A)
Using this definition we note that if QBC was used with the assumption that the prior over the
concept class is ν which is λc far from the true prior ν ′ and the distribution over the sample space
is assumed to be D which is λx far from the true distribution D′ , then the performance of QBC
1 In
this section we revisit theorem 5 in [48].
92
Chapter 7. The Bayes Model Revisited
does not degrade by much. The following theorem is the equivalent of the fundamental theorem
of the QBC algorithm (Theorem 6.1). It shows that even if QBC is used with incorrect priors, it
still has exponential learning rates.
Theorem 7.3 Let there be a prior ν over C which is λc far from the prior ν ′ over C. Let D be
a distribution over the sample space X which is λx far from D′ . Let g = G (ν, D) > 0 be a lower
!
b
g
b
g log 1+
g
16 log 16 −b
bg
−4 −2
bound on the expected information gain of QBC. Let gb = λc λx g and g̃ =
4
and
k ≥ max
8
2 d+1
2
ln ,
log
gb2 δ
g̃
δ
Let δ > 0 then if the true prior and distribution are ν ′ and D′ , while QBC assumes that the
prior is ν then with a probability of 1 − 2δ, QBC will use at most k queries for labels and
m0 =
d g̃k/(d+1)
2
e
unlabeled instances when learning and will return a hypothesis with the following properties (depending on the termination rule used):
1. If QBC is used with the Bayes optimal classification rule, it will return a hypothesis (the
Bayes optimal hypothesis) such that
h h
ii
Ec∼ν Pr cBayes (x) 6= c (x) ≤ ǫ
x
for any
2
ǫ>
2ek −g̃k/(d+1) π 2 (k + 1)
2
ln
dλ2c
6δ
2. If QBC is used with the Gibbs average termination rule, it will return an hypothesis such that
for any
h i
EGibbs,c∼ν Pr cGibbs (x) 6= c (x) ≤ ǫ
x
ǫ>
4ek −g̃k/(d+1) π 2 (k + 1)2
2
ln
dλ2c
6δ
3. If QBC is used with the Gibbs “typical” error termination rule, it will return a hypothesis
such that
Pr cGibbs (x) 6= c (x) ≤ ǫ
x
for any
ǫ>
2
(k+1)2
+
8ek ln dπ 24ek
δdλ2c
g̃k
d+1
ln 2
2−g̃k/(d+1)
7.3. Incorrect Priors and Distributions
93
4. If QBC is used with Bayes point machine termination rule, it will return a hypothesis such
that
h i
Ec∼ν Pr cBPM (x) 6= c (x) ≤ ǫ
x
for any
ǫ>
2 (e − 1) ek −g̃k/(d+1) π 2 (k + 1)2
2
ln
dλ2c
6δ
Proof. In lemma 7.1 we show that even though the QBC uses wrong priors, the expected informa−2
tion gain from the next query is uniformly lower bounded by λ−4
c λx g. In lemma 7.3 we show that
if QBC did not query for labels for a while, then the hypothesis it will use will be a good approximation of the target concept. Using these two lemmas, and following the proof technique of the
fundamental theorem of the QBC algorithm (theorem 6.1 on page 65) the proof is completed.
Lemma 7.1 Let there be a prior ν over C which is λc far from the prior ν ′ over C. Let D be a
distribution over the sample space X which is λx far from D′ . Let g = G (ν, D) > 0 be lower bound
on the expected information gain of QBC. If the true prior and distribution are ν ′ and D′ , while
−2
QBC assumes that the prior is ν then the expected information gain is at least λ−4
c λx g.
Proof. First we apply to lemma 7.2 to obtain the following
R ν (V + (x)) ν (V − (x))
ν (V + (x))
dD (x)
ν(V )
ν(V ) H
ν(V )
g =
R ν(V + (x)) ν(V − (x))
ν(V )
ν(V ) dD (x)
R ν ′ (V + (x)) ν ′ (V − (x))
ν (V + (x))
λ2c
dD (x)
H
ν ′ (V )
ν ′ (V )
ν(V )
≤
R ′ + (x)) ν ′ (V − (x))
λc−2 ν (V
ν ′ (V )
ν ′ (V ) dD (x)
+
′
−
′
R ν (V (x)) ν (V (x))
ν (V + (x))
dD (x)
ν ′ (V )
ν ′ (V ) H
ν(V )
= λ4c
R ν ′ (V + (x)) ν ′ (V − (x))
ν ′ (V )
ν ′ (V ) dD (x)
Since D is λx far from D′ we have that with a probability of 1
′
λ−1
x dD (x) ≤ dD (x) ≤ λx dD (x)
and thus
g
≤ λ4c
R
ν ′ (V + (x)) ν ′ (V − (x))
ν ′ (V )
ν ′ (V ) H
λx
≤
λ4c
R
R
ν (V + (x))
ν(V )
ν ′ (V + (x))
ν ′ (V − (x))
ν ′ (V )
ν ′ (V )
ν ′ (V + (x)) ν ′ (V − (x))
ν ′ (V )
ν ′ (V ) H
λ−1
x
R
dD (x)
dD (x)
ν (V + (x))
ν(V )
ν ′ (V + (x)) ν ′ (V − (x))
′
ν ′ (V )
ν ′ (V ) dD
dD′ (x)
(x)
94
Chapter 7. The Bayes Model Revisited
=
λ4c λ2x
R
ν ′ (V + (x)) ν ′ (V − (x))
ν ′ (V )
ν ′ (V ) H
R
which completes the proof.
ν (V + (x))
ν(V )
ν ′ (V + (x)) ν ′ (V − (x))
′
ν ′ (V )
ν ′ (V ) dD
dD′ (x)
(x)
Lemma 7.2 Let ν be λc far from γ ′ . Let x be an instance and V be a version space. Denote
by V + (x) (and V − (x)) the concepts in the version space that assign x with the label +1 (or −1)
respectively. Then
λ−2
c
′
+
ν ′ (V + (x)) ν ′ (V − (x))
ν (V + (x)) ν (V − (x))
(x)) ν ′ (V − (x))
2 ν (V
≤
≤
λ
c
ν ′ (V )
ν ′ (V )
ν (V )
ν (V )
ν ′ (V )
ν ′ (V )
′
Proof. First note that ν (V + (x)) ≤ λc ν ′ (V + (x)) and ν (V ) ≥ λ−1
c ν (V ) and thus
λc−2
′
+
ν ′ (V + (x))
ν (V + (x))
2 ν (V (x))
≤
≤
λ
c
ν ′ (V )
ν (V )
ν ′ (V )
′
2
Let z ∈ [0, 1] and let z ′ be such that λ−2
c ≤ z /z ≤ λc . It is easy to verify that
λc−2 z ′ (1 − z ′ ) ≤ z (1 − z) ≤ λ2c z ′ (1 − z ′ )
By setting z =
ν (V + (x))
ν(V )
and z ′ =
ν ′ (V + (x))
ν ′ (V )
we complete the proof.
Lemma 7.3 Let there be a prior ν over C which is λc far from the prior ν ′ over C. Let D be
a distribution over the sample space X which is λx far from D′ . Assume the true prior and
distribution are ν ′ and D′ , while QBC assumes that the prior is ν and the distribution is D then
1. Assume that QBC is used with tk =
2
ǫ
ln π
2
(k+1)2
6δ
instead of the value defined in algorithm 5.
Let the Bayes classifier cBayes be defined using the version space V used by QBC when
terminating and the prior ν. Then with a probability of 1 − δ over the sample and the
internal randomness of QBC,
h h
ii
Ec∼ν ′ |V Pr cBayes (x) 6= c (x) ≤ λ2c ǫ
x
2. Assume that QBC is used with tk =
4
ǫ
ln π
2
(k+1)2
6δ
instead of the value defined in algorithm 5.
Let the Gibbs classifier cGibbs be defined using the version space V used by QBC when
terminating and the prior ν. Then with a probability of 1 − δ over the sample and the
internal randomness of QBC,
h i
EGibbs,c∼ν Pr cGibbs (x) 6= c (x) ≤ λ2c ǫ
x
7.3. Incorrect Priors and Distributions
3. Assume that QBC is used with tk =
8
ǫδ
95
ln π
2
(k+1)2
3ǫδ
instead of the value defined in algorithm 5.
Let the Gibbs classifier cGibbs be defined using the version space V used by QBC when
terminating and the prior ν. Then with a probability of 1 − δ over the choice of the sample,
the target hypothesis and the internal randomness of QBC,
Pr cGibbs (x) 6= c (x) ≤ γ 2 ǫ
x
4. Assume that QBC is used with tk =
2(e−1)
ǫ
ln π
2
(k+1)2
6δ
instead of the value defined in al-
gorithm 5. Let the concept class be the class of linear classifiers and let the prior ν be
log-concave. Let the Bayes Point Machine classifier cBPM be defined using the version space
V used by QBC when terminating. Then with a probability of 1 − δ over the sample and the
internal randomness of QBC,
h i
Ec∼ν Pr cBPM (x) 6= c (x) ≤ λ2c ǫ
x
Proof.
1. Assume that QBC made k queries for labels to generate the version space V . Assume
that QBC did not query for any additional label for tk consecutive instances after making
the k’th query. Let cBayes be the Bayes classifier, then


 +1 if Prc∼ν|V [c (x) = +1] ≥ 1/2
cBayes (x) =

 −1 if Prc∼ν|V [c (x) = −1] > 1/2
Arrange x and c such that cBayes (x) 6= c (x). From the definition of the Bayes classifier it
follows that if we pick a random hypothesis c′ from the distribution ν|V then with a probabilh
i
ity of at least 1/2 we will have c′ (x) 6= c (x). Therefore, if we denote by cBayes (x) 6= c (x)
the indicating function then
Ec′ ∼ν|V [c′ (x) 6= c (x)] ≥
i
1h
cBayes (x) 6= c (x)
2
for any c and x.
h
i
Assume that Ec∼ν ′ |V,x cBayes (x) 6= c (x) > ǫλ2c . Thus,
ǫλ2c
<
=
=
h
i
Ec∼ν ′ |V,x cBayes (x) 6= c (x)
h
i
Ex Pr′
cBayes (x) 6= c (x)
c∼ν |V
h
i

Prc∼ν ′ |V cBayes (x) 6= c (x) ∩ (c ∈ V )

Ex 
Prc∼ν ′ |V [c ∈ V ]
96
Chapter 7. The Bayes Model Revisited
i
cBayes (x) 6= c (x) ∩ (c ∈ V )

Ex λ2c
Prc∼ν|V [c ∈ V ]
h
i
λ2c Ec∼ν|V,x cBayes (x) 6= c (x)

≤
=
Prc∼ν|V
h
and therefore
Ec,c′ ∼ν|V,x [c′ (x) 6= c (x)] >
ǫ
2
this means that the probability that QBC will not query for the label of the next instance
h
i
is at most 1 − 2ǫ . Hence, if Ec∼ν ′ |V,x cBayes (x) 6= c (x) > ǫλ2c the probability that QBC
will not query for a label for the next tk consecutive instance is at most
by choosing tk =
2
ǫ
ln π
2
(k+1)2
6δ
1−
ǫ
ǫ tk
≤ e− 2 tk
2
we get that the probability that QBC will not query for tk
consecutive labels when the Bayes classifier is not “good enough” is
6δ
.
π 2 (k+1)2
By summing
over k the proof is completed.
2. The proof for the Gibbs classifier follows the same pattern as theorem 5.2. From item 1 in
lemma 7.3 we have that using the choice of tk that
h h
ii
Ec∼ν Pr cBayes (x) 6= c (x) ≤ λ2c ǫ/2
x
Since Haussler, Kearns and Schapire [51] proved that the average error of the Gibbs classifier
is at most twice as large as the error of the Bayes classifier, the statement of the theorem
follows.
3. This follows immediately from the previous item and the Markov inequality. From the choice
of tk we have that with a probability of 1 − δ/2
h i
EGibbs,c∼ν Pr cGibbs (x) 6= c (x) ≤ λ2c ǫδ/2
x
(7.2)
Therefore, from the Markov inequality, if (7.2) holds, we have with a probability of 1 − δ/2
that
Pr cGibbs (x) 6= c (x) ≤ λ2c ǫ
x
4. The proof follows immediately from item 1 in lemma 7.3 and theorem 7.1 on page 88.
7.4. Summary
7.4
97
Summary
In this chapter we revisited the Bayesian assumption underlying the QBC algorithm and its analysis. We showed that this assumption is not as strong as it might appear at first glance. In many
cases it may be lifted, as we showed in section 7.1 and section 7.2 or weakened, as we showed
in section 7.3. We conclude that the knowledge of the prior from which the target concept was
chosen, or even the existence of such a prior, is not critical for the QBC to exhibit fast learning
rates.
Chapter 8
Noise Tolerance
In the discussion of the Query By Committee algorithm so far, we have made several assumptions.
We studied the Bayesian assumption in Chapter 7. In this chapter we revise yet another assumption
we made; namely that we learn in a noise free environment. We assumed that there is a target
concept which is a deterministic function of the inputs; i.e. we assumed that there exists a concept
c such that for any given input x, the concept c assigns the “true” label c (x) in a deterministic
way. However, this assumption is doubtful for various reasons. First, many concepts we may wish
to learn are non-deterministic by nature. Moreover, noise that can be caused by human errors or
communication problems might corrupt the labels we see (see the more extensive discussion about
noise on Chapter 3).
Version-space based algorithms such as QBC are sensitive to noise. A single misclassified
instance will cause the target concept to be eliminated from the version-space and thus lead to
poor results. Therefore, if QBC is ever to be used on real data, it must be made less sensitive
to such effects. In this dissertation we provide two methods for coping with this problem. In the
current chapter we present a “soft” version of the QBC algorithm and analyze it. We show that
√
k)
when certain conditions apply, the error of the hypothesis return decreases as e−O(
where k is
the number of queries for label made. A more practical approach is presented in Chapter 10 where
kernels are used to overcome noise as well as some other practical problems. The advantage of the
method we present here is its theoretical soundness. However it does not (yet) have any practical
implementation.
In section 8.1 we introduce the “soft” version of the QBC algorithm. In section 8.2 we revise
the notation of Information Gain to suit the new setting. In section 8.3 we use the newly proposed
98
8.1. “Soft” QBC
99
way of measuring information to analyze the “soft” QBC. We wrap-up in section 8.4.
Note that Sollitch and Saad [119] conducted a preliminary study of the impact of noise on
active learning, although their work mainly focused on the behavior of the algorithm when the
sample size grows to infinity and less on the practical scenario.
The work presented in this chapter is based on collaboration with Scott Axelrod, Shai Fine,
Shahar Mendelson and Naftali Tishby.
8.1
“Soft” QBC
We begin our discussion by defining the model in which we are working. Noise and uncertainty can
come in different forms. The first model we consider is when the target concept is deterministic.
The noise in this case corrupts the communication channel between the learner and the oracle that
answers the learner’s queries. In this case, the source of noise is external. The second case we
consider is when the target concept is non-deterministic in itself. In this case the noise is internal.
8.1.1
The Case of Learning with Noise
In many cases, the concepts to be learned are deterministic, but noise corrupts our observations.
Noise can differ in nature; it can be random classification noise, where the noise is equal over all
the sample space and independent of the target concept. In other cases the noise may tend to have
greater impact near the decision boundary (see [39] for a comprehensive survey about learning in
the presence of noise).
We use the following notation: Let the set of labels Y be finite and let W be a parameterization
of the concept class. For each w ∈ W and x ∈ X a distribution p (y|w, x) is defined where the
underlying concept to be learned cw is such that
cw (x) = arg max p (y|w, x)
y∈Y
Therefore, given any ǫ > 0, the objective of our learning process is to find some w ∈ W such
that
Pr [cw (x) 6= c∗ (x)] < ǫ
x∼D
where c∗ is the target concept.
100
8.1.2
Chapter 8. Noise Tolerance
The case of stochastic concepts
A scenario we shall not pursue any further is when the concepts we are trying to learn are stochastic.
Thus, there is no perfect mapping between instances and labels. In this case one might think of
various criteria for generalization. Some of the possibilities are to minimize the loss with respect to
different Lp norms or to apply the Kullback Leibler divergence. For the sake of simplicity we will
not present all the possibilities in this direction. However, we note that the algorithm we present
below can be easily adjusted to include various such criteria, and the proofs follow the same path
as those presented here.
8.1.3
A variant of the QBC algorithm
Here, we focus on the noise model as described in 8.1.1, i.e. the noise is external to the system and
corrupts the communication channel between the teacher and the learner. Let ǫ > 0 and δ > 0 be
the accuracy and confidence parameters specified by the user. The version of the QBC algorithm
which is capable of managing noise is presented in Algorithm 6.
We will present two facts about this “soft” version of the QBC algorithm (for brevity, henceforth
the abbreviation SQBC). First, we show that the hypothesis returned by the algorithm is indeed
a good approximation of the target concept. Second, we show that if SQBC is allowed to issue k
√
queries for labels then the generalization error of the hypothesis it returns is e−O( k) . Recall that
√ a passive learner using k queries will have a generalization error of O 1/ k in the same setting
(see e.g. [6] Theorem 5.2).
Theorem 8.1 Let ǫ > 0 and δ > 0. Assume that cw∗ is the target concept and that cw is the
concept the SQBC returned. Then the probability that cw is not a good approximation for cw∗ , i.e.,
that
Pr
x∼D
arg max p (y|w, x) 6= arg max p (y|w∗ , x) > ǫ
y∈Y
y∈Y
is less than δ, when the probability is with respect to the internal randomness of SQBC, the random
sample used for learning, the random labels and the random target concept (Bayesian assumption).
The proof is similar to the ones presented in Chapter 5 where we studied the QBC algorithm.
Proof. Define the set of “bad” pairs of parameters
W = (w1 , w2 ) s.t. Pr arg max p (y|w1 , x) 6= arg max p (y|w2 , x) > ǫ
x∼D
y∈Y
y∈Y
8.1. “Soft” QBC
101
Algorithm 6 “Soft” Query By Committee (SQBC)
Inputs:
• Required accuracy ǫ.
• Required confidence 1 − δ.
• A prior ν over the parameter class W .
Output:
• A hypothesis cw .
Algorithm:
1. Let ν1 = ν.
2. Let k ← 0.
3. Let l ← 0.
4. For t = 1, . . .
(a) Receive an unlabeled instance xt .
(b) Let l ← l + 1.
(c) Select w1 ∼ νt and w2 ∼ νt .
(d) If arg maxy∈Y p (y|w1 , x) 6= arg maxy∈Y p (y|w2 , x) then
i.
ii.
iii.
iv.
Query for the label y of x.
Let k ← k + 1.
Let l ← 0.
Let νt+1 be the posterior over W given all the labels seen so far; i.e. for U ⊆ W
using Bayes rule we have
νt+1 (U ) =
ν (U ) Pr [y1 , . . . , yt+1 are the observed labels given that w∗ ∈ U ]
Pr [y1 , . . . , yt+1 are the observed labels]
(e) else
i. Let νt+1 ← νt .
(f) If l ≥ tk where tk =
i. Select w ∼ νt+1
ii. Return cw .
2
ǫδ
log 2k(k+1)
then
δ
102
Chapter 8. Noise Tolerance
The algorithm fails if the target concept cw∗ and the hypothesis cw returned by SQBC are such
that (w∗ , w) form a “bad” pair. Recall that w∗ is randomly picked from the prior ν. We allow the
teacher (“the adversary”) extra power and allow it to choose the target concept only at the end of
the learning process with the only restriction being that the concept is chosen using the posterior
over the labels it presented while the QBC was learning.
Hence, we may assume that the selection of w∗ was made using the posterior defined by the
given labels. Therefore, both the algorithm and the teacher use the same probability measure
which is the posterior to select w and w∗ respectively. There are two possible sources for failure in
this case. First, SQBC may terminate when W is too big in a probabilistic sense. The second case
of failure is when W is small but nevertheless, the target concept and the hypothesis returned by
SQBC form a “bad” pair. We show that the probability of any of these cases is less than δ/2.
Let νt be the posterior. If νt2 (W ) ≤ δ/2 then we are done, since the probability that w and
w∗ form a bad pair is bounded by δ/2. On the other hand, assuming that νt2 (W ) > δ/2, then
we argue that the probability of observing a long sequence of instances for which SQBC will not
issue a query for label is small. Under this assumption, the probability of selecting a triplet
x, w1 , w2 ∼ D × νt × νt such that arg maxy∈Y p (y|w1 , x) 6= arg maxy∈Y p (y|w2 , x) is greater than
ǫδ/2. Hence, the probability of tk consecutive instances without a query after seeing k labels is
bounded by
t
ǫδ k
1−
≤ e−ǫδtk /2 ,
2
and by setting tk =
2
ǫδ
log 2k(k+1)
it follows that the probability that the QBC does not query for
δ
tk consecutive instances is less than δ/2k(k + 1). Summing over the possible values of k we get
that the probability of failure is bounded by δ/2.
The theorem above shows that the SQBC algorithm is sound. We now explore the number
of queries used in the learning processes. We use the same technique as Freund et. al. [48] and
analyze the information gained by the algorithm during the process. However, before we do this
we need to adapt the notion of information gain to the new setting we are dealing with. In the
next section we introduce the refined information gain and study some of its properties. Later in
this chapter we use these tools to study the sample complexity of SQBC.
8.2. Information Gain Revisited
8.2
103
Information Gain Revisited
A fundamental problem in learning theory is bounding the information gained by an example about
the unknown target concept. This problem is most critical in the context of active learning, when
the learner has to select the most informative examples to be labeled in order to minimize the
number of labels required.
The Mutual Information allows one to measure the average knowledge one gains about another
random variable, B, by knowing the value of one random variable A. However, in concrete learning
cases one is interested in a more precise measure; namely, how much does a specific value a tells
us about B. Here we present an information measure which quantifies the amount of information
an observation a of the random variable A gives about the state of the random variable B. We
show that with high probability this specific mutual information is bounded by the logarithm of
the covering number of B (see definitions 8.1 and 8.2), and establish a version of the Information
Processing Inequality suitable for this quantity. Later we will use the information a label contains
about the target concept to measure the information gain by the SQBC algorithm.
The mutual information measures the amount of information one random variable contains
about another random variable [32]. If a random variable A takes values in the set A and the
random variable B takes values in the set B, the mutual information is defined by the following
formula:
I (A; B) =
Z
p (a, b) log
A×B
p (a, b)
d (a × b)
p (a) p (b)
and it can be rewritten as1
I (A; B) =
Z
A
p (a)
Z
p (b|a)
db da
p (b|a) log
p (b)
B
(8.1)
A reasonable question is what a specific observation a ∈ A can tell us about the other variable
B.
Let’s consider for example that one is interested in knowing whether it rained over night. The
observation one gets can be the moisture on the ground in the morning. If the ground is dry
then we can be pretty sure that it wasn’t raining. If, however, the ground is wet, then it might
have rained, but it is also possible that the sprinklers were working during the night and caused
the ground to be wet. Clearly, different observations, or values of the same variable, can provide
different amounts of information.
1 We
assume that p(a|b) belongs to an appropriate L1 space.
104
Chapter 8. Noise Tolerance
There exists a natural definition for the specific information value we are after. Indeed, by
looking at (8.1) we come up with the definition:
I (a; B) =
Z
p (b|a) log
B
p (b|a)
db
p (b)
(8.2)
This should be read as the information that the observation a ∈ A gives about the random
variable B. This quantity has some nice properties:
1. It is non-negative, since from (8.2) one can see that I (a; B) is a Kullback Leibler divergence
[70] between two distributions.
2. I (a; B) is a measurable function due to Fubini’s theorem.
3. The expected value of the information from an observation is the mutual information, i.e.,
EA [I (a; B)] = I (A; B).
Before proceeding we need to define some notations. We begin by defining a distance measure
between two instances of a random variable B.
Definition 8.1 The distance2 between two instances b1 and b2 of a random variable B with respect
to the random variables A1 , . . . , Am over A1 , . . . , Am is
ρm (b1 , b2 ) = max sup |p (ai |b1 ) − p (ai |a2 )|
1≤i≤m ai ∈Ai
Given a distance measure, one can define the covering number which counts the number of
balls of radius ǫ needed to cover the space:
Definition 8.2 If B is a random variable over B and ρ is a (pseudo)-metric on B, then for any
ǫ > 0 the ǫ-covering number is the smallest number of balls of radius ǫ (with respect to the distance
measure ρ) needed to cover B. We denote this value by N (B, ǫ, ρ).
Note that in the deterministic case, when p (ai |b) is either zero or one, this definition takes a
simple form: ρ (b1 , b2 ) is zero if the two states assign the same values to the observations and it
is 1 otherwise. Here, for every radius ǫ < 1, the ǫ covering numbers are simply the number of
equivalence classes. Hence, if the observations are labels assigned to different sample points and if
d
B has a VC-dimension d, then by Sauer’s Lemma it follows that N (B, ǫ, ρm ) ≤ em
.
d
2 Actually ρ
m is a semi-distance since it is possible that b1 6= b2 while ρm (b1 , b2 ) = 0. This has no significance
throughout the paper.
8.2. Information Gain Revisited
8.2.1
105
Observations of the State of a Random Variable
Let us assume that we are interested in the random variable B which takes values b ∈ B. We have
some observations of the random variables Ai . Each random variable Ai receives values ai ∈ Ai .
Q
We assume that the Ai ’s are mutually independent given B, i.e. the p (a1 , . . . , am |b) = p (ai |b).
This is often the case in learning from examples. To see this, let W be a parameterization of a
concept class. Let x1 , . . . , xm be a fixed set of instances, then for any w ∈ W, we have that
p (y1 , . . . , ym |w, x1 , . . . , xm ) =
Y
p (yi |w, x1 , . . . , xm ) =
Y
p (yi |w, xi )
We are interested in measuring the contribution of the labels y1 , . . . , ym to our knowledge about
the random variable W .
Haussler and Opper [52] have studied this question and presented the relationship between the
information and metric entropy. However, they studied the average case; i.e. “what is the amount
of information regarding the state of the world that a general set of observations captures?”. The
question we are interested in is “how much information does a specific set of observations capture
on the state of a random variable?”. Another difference between Haussler and Opper results and
the result presented here is the distance measure used. Haussler and Opper used the Hellinger
distance measure whereas we use an infinity norm. This allows us to use the results of Alon et.
al. [2] which bound the metric entropy with respect to this norm using the Pollard dimension and
the Fat-Shattering dimension of the space.
The first result we present shows that the information from a set of observations is essentially
bounded in the sense that with high probability it is bounded by the covering number.
Theorem 8.2 Let m > 2 and let A1 , . . . , Am be a set of observed random variables. Let B be
a random variable. Assume that there is some γ > 0 such that for any ai ∈ Ai and any b ∈ B,
p (ai |b) ≥ γ. Denote by a(m) = (a1 , . . . , am ) then
2γ
1
(m)
Pr
I a ; B ≤ log N B, 2 , ρm + 2 + log
≥1−δ
m
δ
a(m) ∼A(m)
where ρm is as defined in Definition 8.1.
Note that in the deterministic case, the assumption that there is a positive lower bound on
p (ai |θ) is not necessary. In fact, if B has VC-dimension d then with a probability of at least 1 − δ,
1
I a(m) ; B ≤ d log em
d + 2 + log δ which is similar to the bound presented in [48, Lemma 3].
Proof. of Theorem 8.2
106
Recall that
Chapter 8. Noise Tolerance
Z p a(m) |b
(m)
(m)
db,
log
I a ; B = p b|a
p a(m)
hence, by Jensen’s inequality (or annealed approximation )
Z (m)
|b
(m)
(m) p a
db.
I a ; B ≤ log p b|a
p a(m)
(8.3)
Taking the expected value of the integral in (8.3) with respect to the observations and applying
Fubini’s Theorem, it follows that
"Z
"
#
#
(m)
|b
p a(m) |b
(m) p a
db = Eb∼B Ea(m) ∼A(m) |b
Ea(m) ∼A(m)
p b|a
(8.4)
p a(m)
p a(m)
T
Let B1 , . . . , Br be a disjoint cover of B (i.e., B = ∪Bi and if i 6= j then Bi Bj = ∅), such that
each Bi has diameter smaller than
2γ
m2
with respect to the metric ρm . Thus,
b, b′ ∈ Bi =⇒ ∀j, aj ∈ Aj
|p (aj |b) − p (aj |b′ )| ≤
2γ
m2
Using this definition we rewrite the expected value in (8.4) as
"
"
#
#
Z
r
X
p a(m) |b
p a(m) |b
=
db
EB EA(m) |b
p (b|Bi ) EA(m) |b
P (B i )
(m)
p a(m)
p
a
B
i
i=1
(8.5)
(8.6)
We shall bound the integral on Bi for each 1 ≤ i ≤ r separately. Let i be such that P (Bi ) > 0.
Note that for each b, b′ ∈ Bi we have that
=
p a(m) |b′
≥
Y
p (ai |b′ )
Y
2γ
p (ai |b) − 2 ,
m
thus,
p a(m)
=
≥
≥
=
Z
Z
p (b′ ) p a(m) |b′ db′
p (b′ ) p a(m) |b′ db′
B
Z i
Y
2γ
p (ai |b) − 2 db′
p (b′ )
m
Bi
Y
2γ
p (ai |b) − 2 .
P (Bi )
m
Q
Since p a(m) |b = p (ai |b) it follows that
p a(m) |b
1 Y
p (ai |b)
≤
2γ .
(m)
P (Bi )
p a
p (ai |b) − m
2
(8.7)
Recall that p (ai |b) ≥ γ, hence
γ
1
p (ai |b)
2γ ≤
2γ =
1 − m22
p (ai |b) − m2
γ − m2
(8.8)
8.2. Information Gain Revisited
107
Clearly, for m > 2,
2
e− m ≤ 1 −
2
2
2
+ 2 ≤ 1 − 2.
m m
m
(8.9)
Hence,
2
1
p (ai |b)
1
m
2γ ≤
2 ≤ −2 =e ,
m
1
−
p (ai |b) − m
e
2
2
m
and using (8.7)
p a(m) |b
e2
≤
.
P (Bi )
p a(m)
Therefore,
EB EA(m) |b
"
#
p a(m) |b
p a(m)
≤
X
P (Bi )
i : P (Bi )>0
e2
P (Bi )
≤ re2
(8.10)
(8.11)
Now, recall the definition of r in (8.5) and conclude that
EA(m) EB|a(m)
"
#
p a(m) |b
p a(m)
"
#
p a(m) |b
= EB EA(m) |b
p a(m)
2γ
≤ N B, 2 , ρ e2
m
(8.12)
By Markov’s inequality,
p(a(m) |b)
2γ
1
PA(m) EB|a(m)
≥ N B, 2 , ρm e2 ≤ δ,
δ
m
p(a(m) )
thus, by (8.3)
2γ
1
PA(m) I a(m) ; B ≥ log N B, 2 , ρm + 2 + log
≤ δ,
m
δ
as claimed.
An immediate consequence of the proof of theorem 8.2 is a bound on the mutual information
as presented in the following corollary:
Corollary 8.1 Assume that the conditions of Theorem 8.2 hold. Then
I (A1 , . . . , Am ; B) ≤ log N
2γ
B, 2 , ρm + 2.
m
108
8.2.2
Chapter 8. Noise Tolerance
Information Processing Inequality
A fundamental property of mutual information is the Information Processing Inequality. The
information processing inequality asserts that when data are processed, the mutual information
can only decrease. More formally, for any function g the following holds
I (A; B) ≥ I (g (A) ; B)
(8.13)
As a corollary, if A1 , . . . , Am , B are random variables then for any J ⊆ [1, m]
I (A1 , . . . , Am ; B) ≥ I (AJ ; B)
where AJ = {Aj }j∈J . Nevertheless, as we move to the setting of information from observations,
the situation is more complex. A subset of the observation could contain more information on the
target variable than all the observations. However it is possible to prove a slightly weaker version
of the information processing inequality.
Theorem 8.3 Information Processing Inequality
Let m > 2 and put A1 , . . . , Am to be a set of observed random variables which are mutually
independent given the random variable B. Assume further that each Ai can take only a finite set
of values, and that there is some γ > 0 such that for any ai ∈ Ai and any b ∈ B, p (ai |b) ≥ γ.
Then, for any τ
Pr
a(m) ∼A(m)
[∃J s.t. I (aJ ; B) ≥ τ + 1] ≤
h i
1
I a(m) ; B ≥ τ m log
(m)
(m)
γ
a
∼A
Pr
where aJ = {aj }j∈J .
Theorem 8.3 shows that in a sense, the information processing inequality is valid for the setting
described here with high probability. In the proof of this theorem we make a specific use of the fact
that γ > 0. However, in the deterministic case this assumption is superfluous since the information
is monotonic, thus
∀J ⊆ {1, . . . , m}
I (a1 , . . . , am ; B) ≥ I (aJ ; B)
Before we prove this theorem we derive an immediate corollary
Corollary 8.2 In the setting of theorem 8.3, Let δ > 0 then
#
"
m log γ1
2γ
<δ
Pr
∃J s.t. I (aJ ; B) ≥ N B, 2 , ρm + 3 + log
m
δ
a(m) ∼A(m)
8.2. Information Gain Revisited
109
Corollary 8.2 follows from Theorem 8.2 and Theorem 8.3 by choosing
τ =N
m log γ1
2γ
B, 2 , ρm + 2 + log
m
δ
We now turn to prove Theorem 8.3.
Proof. of Theorem 8.3.
Assume there exists J ⊆ {1, . . . , m} such that
I (aJ ; Θ) > τ + 1
(8.14)
and let J = {1, . . . , m} \ J. We will examine all the possible values of aJ .
Note that by Fubini’s Theorem
EAJ |aJ
At the same time,
"
"
##
h i
(m)
p
a
|b
I a(m) ; B
= EAJ |aJ EB|a(m) log
p a(m)
"
"
##
p a(m) |b
= EB|aJ EAJ |b log
p a(m)
log
p a(m) |b
p a(m)
hence
= log
p (aJ |b) p aJ |b
p (aJ ) p aJ
p aJ |b
p (aJ |b)
= log
+ log
p (aJ )
p aJ
h i
EAJ |aj I a(m) ; B
= I (aJ ; B) + I AJ ; B|aJ
(8.15)
≥ I (aJ ; B)
Note that the second term in (8.15) is a mutual information and thus non-negative. Define
QaJ = Pr
AJ |aJ
h i
I a(m) ; B ≥ τ
For every a(m) we have that I a(m) ; B ≤ m log γ1 since Ai is finite and hence p a(m) |b ≤ 1
and on the other hand p a(m) ≥ γ m . Therefore, from (8.15) it follows that
I (aJ |B)
h i
≤ EAJ |aJ I a(m) ; B
1
≤ τ + m log
QaJ
γ
110
Chapter 8. Noise Tolerance
1
Thus, if I (aJ ; B) ≥ τ + 1 then QaJ ≥ m log
1 . Therefore,
γ
h i
h i
Pr I a(m) ; B ≥ τ
≥ Pr I a(m) ; B ≥ τ and ∃J I (aJ ; B) ≥ τ + 1
A(m)
A(m)
h i
= Pr I a(m) ; B ≥ τ | ∃J s.t. I (aJ ; B) ≥ τ + 1 ×
A(m)
Pr [∃J s.t. I (aJ ; B) ≥ τ + 1]
A(m)
≥
1
Pr [∃J s.t. I (aJ ; B) ≥ τ + 1]
m log γ1 A(m)
Thus we obtain
h i
1
Pr [∃J s.t. I (aJ ; B) ≥ τ + 1] ≤ Pr I a(m) ; B ≥ τ m log
γ
A(m)
A(m)
8.3
SQBC Sample Complexity
In order to analyze the SQBC algorithm we are about to use the information of observation as
a replacement for the information gain used in the analysis of QBC. Note that this is a natural
extension as though the concepts were deterministic, i.e. no noise in the system, in which case the
information gain is equivalent to the information of observation.
Let x̄ = {x1 , x2 , . . .} be a sequence of instances. For the sake of our discussion we will assume
that this sequence is fixed. The label yi of the instance xi tells us something about the target
concept c. Using the terminology of the previous section, yi is an observation of the state of the
target concept which is the random variable3 C. We apply the same technique as in Chapter 6. We
will show that with high probability, for any subset J ⊆ {1, . . . , m}, the information from {yj }j∈J
is not too high. We will argue that when certain conditions apply, SQBC queries for labels with
high information content and thus it will not issue too many queries. This will lead to large gaps
between consecutive queries for labels which will lead SQBC to terminate successfully as proved
in Theorem 8.1 on page 100.
In the following we rework the definition of information gain and its derivatives.
Definition 8.3 For a sequence of instances x̄ = {x1 , x2 , . . .} ∈ X ∞ , the Information Gain from a
set of labels yJ = {yj }j∈J (where yj is the label of the instance xj ) is I (yj ; C | xJ ). The Expected
Information Gain from the next query for a label is
Ej ∗ ,yj∗ ,{xj }j>max J I yJ∪{j ∗ } ; C xJ∪{j ∗ } − I (yJ ; C | xJ )
where the expectation is taken with respect to the sequence of instances {xj }j>max J , the choice of
the next query point (i.e. the choice of j ∗ ) and the label yj ∗ of xj ∗ .
3 There
is a slight abuse of notation here, since C is the concept class and not a random variable.
8.3. SQBC Sample Complexity
111
Unlike the deterministic case, the information gain is not guaranteed to be non-negative. This,
and other properties of the noisy setting make the analysis more involved than the deterministic noise free case.
The main result we would have liked to establish is presented in Theorem 8.4. However,
we encountered a technical difficulty in the course of the proof, when trying to show that the
information gain is, with high probability, linear in the number of queries. A close analysis of the
proof of the analogous result in [48] reveals a similar gap which was overlooked by the authors.
Though it is possible to close the gap in the noise-free case (as we did in chapter 6) we are still
in the process of adjusting the proof to our setup. Hence, the proof of Theorem 8.4 is presented
under the assumption that conjecture 8.1 holds.
Conjecture 8.1 Assume there exist a lower bound g > 0 on the expected information gain from
the query the SQBC algorithm makes at any step. Then, for any δ > 0 there exist constants Kδ
and g̃ > 0 such that if k > Kδ and if J is the set of size k of indexes of the queries that the SQBC
algorithm made, then
Pr
x̄,ȳ,SQBC
[I (yJ ; C | xJ ) < kg̃] ≤ δ
Conjecture 8.1 is the equivalent of lemma 6.2 on page 68.
The next theorem is the main result in this section. It proves that when certain conditions
√
k)
apply, if SQBC is allowed to issue k queries for label, then it will reach an accuracy of e−O(
.
Theorem 8.4 Assume that Conjecture 8.1 holds. Let W be a set of parameters of a concept
class, such that for w ∈ W the probability of observing the label y for the instance x when the
target concept is parameterized by w is p (y | w, x ). Assume that there exists γ > 0 such that
p (y | w, x ) ≥ γ for all y, w, x. Assume that {p (y | w, x ) | w ∈ W } has a Pollard dimension d. Let
ν be a prior over W and let D be a distribution over X . Assume that there is g > 0 such that for
∗
any finite sample S ∈ (X × Y) the expected information gain of SQBC from the next query given
the sample S, is lower bounded by g. Let δ > 0 and let
k ≥ Kδ
(Kδ and g̃ are as defined in Conjecture 8.1).
Then with a probability of 1 − 3δ, SQBC will issue at most k queries for labels and use
m0 =
γdδ (kg̃)/(36/|Y|d)
e
e log γ1
112
Chapter 8. Noise Tolerance
unlabeled instances when learning and will return a hypothesis with
Pr
SQBC,w∗ ,x1 ,x2 ,...
Pr arg max p y|wSQBC , x 6= arg max p (y|w∗ , x) > ǫ < δ
y
x
for any
ǫ>
y
2ke log γ1
γδ 2 d
log
2k (k + 1) −√(kg̃)/(18|Y|)
e
δ
In the statement of theorem 8.4 we used the Pollard dimension. Here is a definition of the
Pollard dimension (see e.g. [2]).
Definition 8.4 Let F be a set of functions from some space Z to IR. F has a Pollard-dimension
d if the class C = {sign (f ) : f ∈ F } has a VC-dimension d.
An alternative definition for the Pollard dimension is to say that if F has a Pollard-dimension
d, if d is maximal such that there exist z1 , . . . , zd ∈ Z such that for any y1 , . . . , yd ∈ {±1} there
exists f ∈ F with yi f (zi ) > 0 for all i.
Alon et al. [2] showed that if F has a Pollard-dimension d then
N (F , ǫ, ρm ) ≤ 2
4m
ǫ2
d log(2em/(dǫ))
where N (·, ·, ·) is the covering number and ρm is the l∞ distance measure when F is restricted
to m points. Note that it is possible to use the Fat-Shattering-Dimension of Alon et al. [2] here,
instead of the Pollard-dimension to obtained slightly better bounds. For the sake of clarity we
avoid using the Fat-Shattering-Dimension here.
Proof. of Theorem 8.4
Assume that SQBC made k queries. Given that Conjecture 8.1 holds, then with a probability
of 1 − δ, the information SQBC gained is at least kg̃. Let J be the indexes of the queries QBC
made. From the information processing inequality, Theorem 8.3 and Corollary 8.2, we know that
with a probability of 1 − δ
m log γ1
2γ
I (yJ ; W | xJ ) ≤ log N W, 2 , ρm + 3 + log
m
δ
Alon et al. [2] proved that
log N
3
em
2γ
W, 2 , ρm ≤ |Y| d log2
+ log 2
m
γd
and therefore
|Y| d log
2
em3
γd
+ log 2 + 3 + log
m log γ1
δ
≥ kg̃
8.4. Summary
113
Applying a coarse upper bound on the left hand side of the above inequality we have
!
em log γ1
2
≥ kg̃
18 |Y| d log
γdδ
and therefore, with a probability of 1 − 2δ, if SQBC made k queries for labels, then
m≥
If
m
k
γδd √(kg̃)/(18|Y|)
e
e log γ1
> tk is as defined in the SQBC algorithm then the SQBC algorithm will bail out and
as we proved in theorem 8.1 on page 100, when this happens, the returned hypothesis is a good
approximation of the target concept with high probability. Therefore it suffices to require that
2
2k (k + 1)
γδd √(kg̃)/(18|Y|) m
> tk =
log
e
≥
k
ǫδ
δ
ke log γ1
which holds whenever
ǫ>
8.4
2ke log γ1
γδ 2 d
log
2k (k + 1) −√(kg̃)/(18|Y|)
e
δ
Summary
The main question we have attempted to address in this chapter is whether active learning in
general and QBC in particular can be applied in the presence of noise and uncertainty. Although
the discussion presented here is incomplete, there is a reason to believe that active learning can
be applied in the realistic setting where noise and uncertainty exist. Nevertheless, we would like
to mention several key issues that are lacking in the discussion in this chapter. First, we were
not able to prove Conjecture 8.1 on page 111. We believe that this conjecture, with some minor
amendments, is true. However, at this point we were not able to prove it. Second, we do not
show here any concept class which has a lower bound on the expected information gain as required
in Theorem 8.4. Finally, we do not have any practical implementation of the SQBC algorithm.
Nevertheless, in Chapter 10 we present an alternative method of overcoming noise using kernels.
Kernels provide a practical method of applying QBC for real world applications. However, the
theoretical justification of this method is weaker.
The revised concept of information gain and information from observations presented in section 8.2 are of interest in themselves. Measuring the information of an observation on a target
random variable can play an important role in diverse applications.
Chapter 9
Efficient Implementation Using
Random Walks
The Query By Committee algorithm (Algorithm 5 on page 55) is a very simple and straightforward
algorithm. Whenever a new instance is presented it draws two random hypotheses from the version
space. If these two hypotheses predict different labels for the instance, then the algorithm queries
for the true label. However, this description belies the difficulty of implementing this algorithm
since drawing random hypotheses from the version space is indeed a non-trivial task.
In this chapter we show how QBC can be implemented in polynomial time when learning linear
classifiers. The main ingredient in our implementation is a reduction of the problem of sampling the
version space to the problem of sampling convex bodies. We show that the sophisticated techniques
developed for sampling from convex bodies provide a solution to the missing components in the
QBC algorithm.
The work presented in this chapter is based on a collaboration with Shai Fine and Eli Shamir.
9.1
Linear Classifiers
The question we address in this chapter is “how can the QBC algorithm be used to learn linear
classifiers?”. We assume that the concept class we are interested in is the class of homogeneous
linear classifiers. The sample space is X = IRd and the concept class is C = cw : w ∈ IRd such
that cw (x) = sign (w · x). The class of linear classifiers is frequently used in modern machine
learning. This class is very powerful once the inputs are mapped from the input space to some
114
9.2. Sampling from Convex Bodies
115
feature space using a non-linear map. In some cases, the inputs are mapped to an infinite dimension
Hilbert space, without affecting the computational complexity of learning in this class (see more
about this in Chapter 10). An important property of homogeneous linear classifiers is that they
are scale free in the sense that if w ∈ IRd and λ > 0 then cw is equivalent to cλw . This is due to
the fact that
cw (x) = sign (w · x) = sign (λw · x) = cλw (x)
Therefore, we may assume that the concept class C is defined solely on the unit ball, i.e. C =
cw : w ∈ IRd and kwk ≤ 1 .
A key observation is that when learning homogeneous linear classifiers, the Version Space is a
bounded convex body at all stages as the following lemma shows.
m
Lemma 9.1 Let C be the class of homogeneous linear classifiers. Let S = {xi , yi }i=1 a finite
sample (possibly empty). Then the version space induced by S is a bounded convex body.
Proof. Recall that the class of homogeneous linear classifiers is defined as C = cw : w ∈ IRd and kwk ≤ 1 .
Therefore, C is isomorphic to the unit ball and thus bounded and convex. The concept cw is in the
version space if
∀i yi (w · xi ) ≥ 0
and thus the version space is the intersection of the unit ball with m linear constraints. Since
all these constraints are convex, then the version space is convex. Furthermore, since the version
space is a subset of the unit ball, it is bounded.
Therefore, the problem of random sampling the version space is reduced to the problem of
random sampling from convex bodies. In the following section we discuss methods of solving the
later problem.
9.2
Sampling from Convex Bodies
The problem of sampling from convex bodies has been studied for the last two decades in the field
of computational geometry. Given a convex body K, the task is to return a point x ∈ K sampled
from the uniform distribution over K. Any efficient sampling algorithm has many applications.
For example, Bertsimas and Vempala [15] showed how convex optimization problems can be solved
efficiently given such a sampling algorithm.
116
Chapter 9. Efficient Implementation Using Random Walks
Elekes [43]1 proved that it is impossible to sample uniformly from convex bodies. Soon after,
Dyer, Frieze and Kannan [41] showed that it is possible to sample approximately uniformly from
convex bodies. They showed that given a bounded convex body K, and an accuracy parameter
ǫ > 0 it is possible to sample x from K such that for any set A ⊆ K
Pr [x ∈ A] − UK (A) < ǫ
x
where UK is the uniform measure over K. We use the notation Prx [A] to denote the probability
that the sampling algorithm will return a point in the set A. The algorithm presented by Dyer
Frieze and Kannan was polynomial but its running time was O d>20 where d is the dimension of
d. Nevertheless, in a series of improvements, the efficiency of sampling algorithms was significantly
improved and the recent algorithm can perform the sample task in O∗ d3 operations2 [85]. Although clear advances have been made, current algorithms are still not practical as the constants
involved are too high. Nevertheless, this is an active research field and we expect better algorithms
to follow.
Describing these sampling algorithms is well beyond the scope of this dissertation. Although
many different algorithms have been suggested, all use Monte-Carlo Markov Chains (MCMC) at
their core. For these MCMCs to work, the convex body must be well rounded. Therefore, not only
should K be bounded, i.e. be contained in a ball of radius R, it must also contain a ball of radius
r. The following theorem summarizes the essentials of sampling from convex bodies.
Theorem 9.1 Let K ∈ IRd be a convex body such that there exists a ball of radius R which contains
K and there exists a ball of radius r which is contained in K. Then there exists a sampling algorithm
such that for any ǫ > 0 the algorithm returns x ∈ K such that for any measurable subset S of K
|Pr [x ∈ S] − UK (S)| < ǫ
and the algorithm works in poly d, log
R
1
r , log ǫ
time.
The proof for this theorem can be found in [85] for example. We note that the convex body K
is assumed to be given via a separating oracle. In other words, given a point x the oracle either
returns the answer “x is in K” or returns a hyperplane w such that
w · x > max (w · z)
z∈K
This oracle must be able to compute its answer in polynomial time.
1 Elekes was interested in the problem of computing the volume of a convex body. However, the problem of
sampling from convex bodies and computing their volumes are closely related.
2 The notation O ∗ indicates that logarithmic factors are ignored.
9.3. A Polynomial Implementation of QBC
9.3
117
A Polynomial Implementation of QBC
Freund et al [48] showed that QBC learn a homogeneous linear classifier exponentially faster
than passive learners (see also Chapter 6). In Section 7.3 on page 91 we showed that we do
not need to sample exactly from the correct prior and distribution. However, we required that
the approximation be close in a multiplicative sense whereas the sampling algorithm discussed in
Theorem 9.1 has an additive discrepancy. Furthermore, the complexity of the sampling algorithm
depends on the ratio between the radii of a bounding ball and a bounded ball. In this section we
show how these ingredients can be used to support a polynomial time implementation of QBC. The
polynomial implementation is presented as Algorithm 7. It has the same structure as the original
QBC algorithm (Algorithm 5 on page 55) when using the sampling techniques presented above.
Next we would like to prove the efficiency of this algorithm. Efficiency here means two things.
The first is the computational complexity for which we show that the algorithm runs in polynomial
time. The second measure of efficiency is the sample complexity for which we show that the
implementation enjoys exponential learning rates, similar to the one for QBC.
Theorem 9.2 Let the target concept be a uniformly distributed homogeneous linear classifier. Assume that the distribution over the sample space is uniform. Let 1 − δ be a confidence parameter.
Then with a probability of 1 − δ, the following holds
1. The expected information gain of the queries the polynomial QBC makes are at least g/2
where g is the expected information gain of the original QBC algorithm.
g log 1+
2. Then there exists g̃ =
g
32 log 32 −g/2
g
8
ǫ=Ω
> 0 such that for any k = Ω g̃ −2 log (1/δ) and
g̃k log (dk) −gk/d
2
δd2
the polynomial QBC implementation will return a hypothesis h such that
Pr [h (x) 6= c (x)] ≤ ǫ
x
3. It will use k labels and m0 = d2O(gk/d) unlabeled instances.
4. Each iteration of the algorithm will run in poly k, 1ǫ , 1δ time.
Proof. We begin by analyzing the computational complexity of the proposed algorithm. We showed
in Theorem 9.1 that sampling from convex bodies can be done in poly d, log Rr , log 1ǫ time, where
118
Chapter 9. Efficient Implementation Using Random Walks
Algorithm 7 Polynomial Implementation of QBC
Inputs:
• Required accuracy ǫ.
• Required confidence 1 − δ.
• The dimension of the problem d.
Output:
• A hypothesis h.
Algorithm:
1. Let V1 = C.
2. Let k ← 0.
3. Let l ← 0.
4. For t = 1, . . .
(a) Receive an unlabeled instance xt .
(b) Let l ← l + 1.
(c) Select c1 and c2 uniformly from Vt using a sampling algorithm with additive accuracy
gδ
and g is a lower bound on the expected information gain of
ǫt , where ǫt = 240k(k+1)t
k
QBC when learning linear separators with the correct priors.
(d) If c1 (x) 6= c2 (x) then
i.
ii.
iii.
iv.
(e) else
Query for the label c (x).
Let k ← k + 1.
Let l ← 0.
Let Vt+1 ← {c ∈ Vt : c (xt ) = yt }.
i. Let Vt+1 ← Vt .
(f) If l ≥ tk where tk =
80
ǫδ 2
ln 10k(k+1)
δ
i. Choose a hypothesis h uniformly from Vt using a sampling algorithm with additive
accuracy δ/40.
ii. Return h.
9.3. A Polynomial Implementation of QBC
119
R is a radius of a ball containing Vt and r is a radius of a ball contained in Vt . Clearly, Vt is a
subset of the unit ball and thus we may assume that R = 1. We would like to show that r is not
too small. Let V ∗ be the version space induced by the labels of all the m0 instances. It is clear
that ∀t V ∗ ⊆ Vt . Therefore if there is a ball of radius r in V ∗ then the same ball is bounded inside
Vt as well. Thus we will study V ∗ . In Lemma 6.1 we show that for any sequence of m0 instances,
the probability that the target concept will be such that the probabilistic volume of the version
−(d+1)
d
0
space induced by it is smaller than em
. Therefore, if m0 > 10d
is at most em
d
eδ then with
0
−(d+1)
0
a probability of 1 − δ/10, the measure of the version space is at least em
at all times
d
−(d+1)
0
and thus its volume is at least Vol (Bd ) em
where Bd is the d-dimensional unit ball, and
d
Vol (Bd ) is its volume. In lemma 9.2 on page 122 we show that for any compact convex body K,
such as the version space, there exists a ball of radius r inside the convex body with
r≥
Vol (K)
Vol (Bd ) dd Rd−1
where R is a radius of a ball containing K. Since the version space is a subset of the unit ball, we
−(d+1)
0
can use R = 1 in our case. Since the volume of the version space is at least Vol (Bd ) em
d
we conclude that with probability 1 − δ, the version space contains a ball of radius r such that
r≥
d
d+1
(em0 )
Using the bound on r and the bound on m0 we obtain that the complexity of each iteration is
1
1
poly d, gk, log , log
ǫ
ǫ
We now turn to prove that the hypothesis returned by this implementation of the QBC is
indeed a good approximation of the target concept. The proof is very similar to the proof of
Theorem 7.3 on page 92 where we considered the QBC algorithm with incorrect priors. For the
sake of completeness we will show the two main ingredients needed for the proof. We begin by
showing that there is a lower bound on the expect information gain from the next query. Next we
will show that if the algorithm terminated, then the hypothesis returned is a good approximation
of the target concept with high probability.
We begin by analyzing the expected information gain. Let V be the current version space and
let γ be the additive accuracy we require from the sampling algorithm. We have that
R
UV (V + (x)) UV (V − (x)) H (UV (V + (x))) dD (x)
R
g≤
UV (V + (x)) UV (V − (x)) dD (x)
where UV (·) is the uniform distribution restricted to V . When sampling c from V we are guaranteed
120
Chapter 9. Efficient Implementation Using Random Walks
that for any measurable set A:
UV (A) − Pr [c ∈ A] ≤ γ
c
and thus we have
Pr c ∈ V + (x) Pr c ∈ V + (x)
c
c
and since γ < 1 we have
UV V + (x) + γ UV V − (x) + γ
= UV V + (x) UV V − (x)
+γ UV V + (x) + UV V − (x) + γ
= UV V + (x) UV V − (x) + γ (1 + γ)
≤
Pr c ∈ V + (x) Pr c ∈ V + (x) ≤ UV V + (x) UV V − (x) + 2γ
c
c
repeating the same argument we have that
Pr c ∈ V + (x) Pr c ∈ V + (x) ≥ UV V + (x) UV V − (x) − γ
c
c
Let q be the probability that the polynomial QBC will query for the label of the next instance
it sees. Clearly
q=2
Z
Pr c ∈ V + (x) Pr c ∈ V − (x) dD (x)
c
c
If q is very small then the algorithm will most likely terminate. Recall that it terminates when no
t
query is made for tk consecutive instances. The probability for this is exactly (1 − q) k . Let
q≤
δ
20k (k + 1) tk
then since 0 < q ≤ 1/2 then e−2q ≤ 1 − 2q + 2q 2 ≤ 1 − q
(1 − q)tk
≥
e−2qtk
≥
e−δ/10k(k+1)
≥
1−
δ
10k (k + 1)
This the probability that the algorithm will not terminate when q is small. By summing over k we
get that with a probability of 1 − δ/10 the algorithm will not make another query after it reach
the state where q ≤
δ
20k(k+1)tk .
Assume that q >
δ
20k(k+1)tk .
It follows that the probability that QBC, when sampling from
the true posterior will query for the label of the next instance is at least q − 2γ. Since the expect
information gain of QBC is at least g we have that
Z
2 UV V + (x) UV V − (x) H UV V + (x) dD (x) ≥ g (q − 4γ)
9.3. A Polynomial Implementation of QBC
121
and thus
2
Z
Pr c ∈ V + (x) Pr c ∈ V − (x) H UV V + (x) dD (x) ≥
c
c
≥
g (q − 4γ) − 2γ
gq − 6γ
Thus the expected information gain of the polynomial QBC is at least
gq − 6γ
q
By choosing γ =
gδ
240k(k+1)tk
=
g−
and using the fact that q >
6γ
q
δ
20k(k+1)tk
we conclude that the expected
information gain is at least g/2.
The lower bound on the expected information gain proves that the number of queries that the
polynomial QBC algorithm will make on the sample of size m0 is O dg̃ log m0 (see the arguments
in the proof of the fundamental properties of the QBC algorithm, Theorem 6.1). Therefore, the
algorithm will reach, with high probability, a sequence of tk consecutive instances for which it did
not query for a label. We now argue that when this happens, if the algorithm returns a random
hypothesis then it is likely to be a good approximation of the target concept.
Let W ⊆ C × C be the set
W = {(c1 , c2 ) : D (x : c1 (x) 6= c2 (x)) > ǫ}
Let p be the probability that if we choose c1 using the sampling algorithm while c2 as chosen
using the true prior then (c1 , c2 ) ∈ W . If p ≤ δ/10 then if the polynomial QBC terminates and
returns a random hypothesis then it will be a good approximation of the target concept with
high probability. Now assume that p > δ/10. We will show that with high probability the QBC
algorithm will not terminate.
Since p > δ/10 then
Pr [UV (c2 : (c1 , c2 ) ∈ W ) > δ/20] > δ/20
c1
For each c1 such that UV (c2 : (c1 , c2 ) ∈ W ) > δ/20 we have that if we sample c2 from V , we will
hit the set W with high probability since
Pr [(c1 , c2 ) ∈ W ] >
c2
δ
δ
− ǫ∗ =
20
40
and thus
Pr [(c1 , c2 ) ∈ W ] >
c1 ,c2
δ2
80
122
Chapter 9. Efficient Implementation Using Random Walks
By definition of the set W we have that the probability that the polynomial QBC will query
for the label of the next instance is at least ǫδ 2 /80. Therefore the probability of tk consecutive
instances without a query, assuming that p > δ is less than
t
2
ǫδ 2 k
≤ e−ǫδ tk /80
1−
80
which by the choice of tk is δ/10k (k + 1). By summing over k we get that the probability that the
QBC will terminate when p ≥ δ/10 is at most δ/10.
The remaining specifics of this proof are identical to the proof of Theorem 6.1 which explores
the original QBC algorithm, and have thus been omitted. Note that there are several possible
causes for failure. However, we showed that the probability for each of these causes is less than
δ/10. Using the union bound we get that the probability of failure is less than δ.
9.4
A Geometric Lemma
While sampling from convex bodies can work in polynomial time, for this to happen we need to
prevent it from becoming singular, i.e. we need the ratio between the radius a bounding ball and
a bounded ball to be moderated. We use the following lemma to bound this ratio.
Lemma 9.2 Let K be a compact convex body in IRd which is bounded by a ball of radius R. Let
Vol (K) be the volume of this body. Then there exist a ball of radius r inside K such that
r≥
Vol (K)
Vol (Bd ) dd Rd−1
Proof. Recall that John’s theorem [60] states that there exist an ellipsoid E ⊆ K such that K ⊆ dE
where dE is the ellipsoid E when it is blown up by a factor d around its origin. Let λ1 ≥ λ2 ≥
· · · ≥ λd be the lengths of the principal axes of E. Since λd is the smallest then we can place a
ball of radius λd inside E, centered at E’s origin. Thus there is a ball of radius r = λd inside K.
Qd
The lengths of the principal axes of dE are dλ1 , . . . , dλd and thus the volume of dE is Vol (Bd ) i=1 dλi
where Bd is the d-dimensional unit ball. Since K ⊆ dE we have
Vol (K)
≤
Vol (dE)
=
Vol (Bd )
d
Y
dλi
i=1
and therefore
r
≥
≥
λd
Vol (K)
Qd−1
Vol (Bd ) dd i=1 λi
9.5. Summary
123
Finally, since K is contained in a ball of radius R, and E is contained in K then λ1 , . . . , λd ≤ R
and thus
r≥
Vol (K)
Vol (Bd ) dd Rd−1
The significance of this lemma is that it shows that
log
R
= O (d log d + d log R − log Vol (K))
r
Therefore, as expected, log Rr is moderated as long as K occupies a non-negligible portion of the
bounding ball of radius R. While the constants in Lemma 9.2 are not tight, it is clear that any
(K)
. To see this, let R > r > 0. Let o1 , . . . , od a set of orthogonal
bound on r must be O Vol
Rd−1
vectors such that the lengths of o1 , . . . , od−1 are R and the length of od is r. Let K be an ellipsoid
with o1 , . . . , od as its principal directions. Clearly, the minimal ball bounding K is of radius R and
the maximal ball bounded in K is of radius r. Furthermore, the volume of K is Rd−1 rVol (Bd ).
Therefore
r=
Vol (K)
Vol (Bd ) Rd−1
which is identical to the bound we have in lemma 9.2 up to dd .
9.5
Summary
In this chapter we showed that the QBC algorithm can be implemented in polynomial time for
learning homogeneous linear classifiers. We have reduced the problem of implementing the QBC
algorithms to the problem of sampling from convex bodies and used polynomial algorithm for
sampling from such bodies. While these algorithms are polynomial they are still far from being
practical. We discuss this issue further in the next chapter.
Chapter 10
Kernelizing the QBC
In this chapter we take another step towards making it possible to use QBC for real world tasks.
In chapter 9 we saw that algorithms for sampling from convex bodies can be used to sample from
the version space when learning linear classifiers. While this provided us with a polynomial time
algorithm it is not sufficient because it assumes that the task at hand can be carried out by a linear
classifier. This problem is not unique to active learning and the QBC algorithm. The same problem
is found in classical models such as the Perceptron [97] and more generally in Neural Networks.
The universal way of overcoming this problem is to add a non-linear phase to the model. This is
typically done by mapping the input data by a non-linear activation function. The idea is to map
the data to a new space in which it is more likely that they will be linearly separable.
Further improvement on the idea of mapping the data to a new space and learning in the
new space was made by Vapnik and others [127, 20]. They showed that in many cases, it is not
necessary to explicitly map the data. Rather, this can be done implicitly by using kernels (see
section 10.1 for more about kernels). This observation led to a flux of algorithms utilizing kernels:
SVM [20], kernel PCA [111] and others (see e.g. [116] and references therein). Kernels were
found successful in many applications ranging from speaker identification [45] to predicting arm
movements of monkeys [117].
In this chapter we show how kernels can be used together with the QBC algorithm. Thus,
we need to modify the algorithm to enable the use of kernels. The algorithm we present in this
chapter uses the same skeleton as QBC, but replaces sampling from the high dimensional version
space by sampling from a low dimensional projection of it. By doing so, we obtain an algorithm
which can cope with large-scale problems and at the same time authorizes the use of kernels.
124
10.1. Kernels
125
Although the algorithm uses linear classifiers at its core, the use of kernels makes it much broader
in scope. This new sampling method is presented in section 10.2. Section 10.3 gives a detailed
description of the kernelized version, the Kernel Query By Committee (KQBC) algorithm. The
last building block is a method for sampling from convex bodies. We suggest the hit and run [85]
random walk for this purpose in section 10.4. A Matlab implementation of KQBC is available at
http://www.cs.huji.ac.il/labs/learning/code/qbc.
Other algorithms have been suggested for sampling from the version space. Most notable is
the Billiard-walk based sampling of Herbrich et al [55]. Herbrich and his coauthors considered the
problem of sampling the version space when kernels are used. The added value of our method is
two-fold. First, we extend the theoretical reasoning behind the sampling approach. Second, we
suggest using “hit and run” (see section 10.4) instead of the Billiard walk since “hit and run” is
easier to use and is guaranteed to mix fast to the right, i.e. uniform, distribution.
10.1
Kernels
We begin with a brief introductions to kernels. The reader who is familiar with this subject may
wish to skip this section.
Kernels are widely used in modern machine learning. They make it possible to use a unified
learning algorithm for solving a diversity of problems by plugging in different kernels. In this
section we give a brief introduction to the main definitions and properties of kernels.
Definition 10.1 A function K : X × X → IR is a kernel function if there exist a Hilbert space H
and a function ϕ : X → H such that K (x1 , x2 ) = ϕ (x1 ) · ϕ (x2 ).
10.1.1
Commonly used Kernel Functions
Here is a list of some commonly used kernel functions:
1. The polynomial kernel: For X = IRd we define the kernel function
p
K (x1 , x2 ) = (x1 · x2 + c)
for c ≥ 0 and p ≥ 1.
2. The Gaussian/radial kernel: For X = IRd we define the kernel function
2
K (x1 , x2 ) = e−kx1 −x2 k
/2σ2
126
Chapter 10. Kernelizing the QBC
for σ 6= 0.
3. The sigmoid kernel: For X = IRd we define the kernel function
K (x1 , x2 ) = tanh (κx1 · x2 + θ)
for a variety of choices of κ and θ.
4. The ridge kernel: [115] the ridge kernel is an extension that can be applied to any kernel.
Let K be a kernel function then we define the kernel function
K̂ (x1 , x2 ) = K (x1 , x2 ) + ∆δx1 ,x2
where ∆ ≥ 0 and δ is Kronecker delta, i.e.


 1
δx1 ,x2 =

 0
if x1 = x2
otherwise
Other kernels exists for variety of sample spaces: string kernels [76], spike kernels [117], Fisher
kernels [57] and many others.
10.1.2
The Gram Matrix
Many learning algorithms, e.g. SVM [20], need only inner products between instances for training
and generalization. In these cases, it suffices to provide the algorithm with the Gram matrix, which
contains all the inner products between instances:
Definition 10.2 Let x1 , . . . , xm be instances in a sample space X , and let K be a kernel function
over this space then the Gram matrix is a symmetric real value matrix with a size of m × m such
that the entry in position i, j is K (xi , xj ).
It follows that any Gram matrix must be semi-positive definite. In other words, if G is a Gram
matrix then for any vector w ∈ IRm :
wGw⊤ ≥ 0
10.1.3
Mercer’s conditions
Mercer’s conditions provide necessary and sufficient conditions for a function K : X × X → IR to
be a valid kernel function.
10.1. Kernels
127
Theorem 10.1 A function K : X × X → IR is a kernel function iff for any g (x) such that
Z
it holds that
10.1.4
Z
2
g (x) dx < ∞
K (x1 , x2 ) g (x1 ) g (x2 ) dx1 dx2 ≥ 0
The Special Case of the Ridge Kernel
The ridge kernel has a unique property: it is generic in the sense that using this kernel every sample
becomes linearly separable. This property is very important since the QBC algorithm assumes that
the learning problem is linearly separable. Therefore, using the ridge kernel we can guarantee the
separability even if noise exists in the system.1
Lemma 10.1 Let K be a kernel and let K̂ be the ridge kernel K̂ (x1 , x2 ) = K (x1 , x2 ) + ∆δx1 ,x2
for ∆ > 0. Let φ̂ be the map associated with K̂, i.e.
K̂ (x1 , x2 ) = φ̂ (x1 ) · φ̂ (x2 )
Let S = (x1 , . . . xm ) be a sample of m unique instances. Then for any labels vector y =
(y1 , . . . , ym ) there exist a separator w ∈ span φ̂ (x1 ) , . . . , φ̂ (xm ) such that w · φ̂ (xi ) = yi for any
i = 1, . . . , m.
Proof. Let G be the Gram matrix associated with K. The Gram matrix Ĝ associated with the
kernel K̂ is simply
Ĝ = G + ∆I
where I is the identity matrix. Since G is semi-positive definite and ∆ > 0 it follows that Ĝ is
positive definite and thus invertible. Therefore, for any set of labels y = (y1 , . . . , ym ) it is possible
P
to find a vector α such that y = Gα. Let w = j αj φ̂ (xj ) then
w · φ̂ (xi ) =
X
αj K̂ (xj , xi )
j
= Gi α
= yi
where Gi is the i’th row of the matrix G.
1 The
“Soft” QBC uses a direct approach to tuckle the noisy case. See Chapter 8 for more details.
128
10.2
Chapter 10. Kernelizing the QBC
A New Method for Sampling the Version-Space
The Query By Committee algorithm [112] provides a general framework that can be used with any
concept class. Whenever a new instance is presented, QBC generates two independent predictions
for its label by sampling two hypotheses from the version space. If the two predictions differ, QBC
queries for the label of the instance at hand (see algorithm 5 on page 55). The main obstacle
in implementing QBC is the need to sample from the version space (step 4c). It is not clear
how to do this with reasonable computational complexity. As is the case for most research in
machine learning, we first focus on the class of linear classifiers and then extend the discussion
by using kernels. In the linear case, the dimension of the version space is the input dimension
which is typically large for real world problems. Thus direct sampling is practically impossible.
We overcome this obstacle by projecting the version space onto a low dimensional subspace.
k
Assume that the learner has seen the labeled sample S = {(xi , yi )}i=1 , where xi ∈ IRd and
yi ∈ {±1}. The version space is defined to be the set of all classifiers which correctly classify all
the instances seen so far:
V = {w : kwk ≤ 1 and ∀i yi (w · xi ) > 0}
(10.1)
QBC assumes a prior ν over the class of linear classifiers. The sample S induces a posterior
over the class of linear classifiers which is the restriction of ν to V . Thus, the probability that
QBC will query for the label of an instance x is exactly
2 Pr [w · x > 0] Pr [w · x < 0]
w∼ν|V
w∼ν|V
(10.2)
where ν|V is the restriction of ν to V .
From (10.2) we see that there is no need to explicitly select two random hypotheses. Instead,
we can use any stochastic approach that will query for the label with the same probability as in
(10.2). Furthermore, if we can sample ŷ ∈ {±1} such that
Pr [ŷ = 1] =
Pr [w · x > 0]
w∼ν|V
(10.3)
and
Pr [ŷ = −1] =
Pr [w · x < 0]
w∼ν|V
(10.4)
we can use it instead, by querying the label of x with a probability of 2 Pr [ŷ = 1] Pr [ŷ = −1].
Based on this observation, we introduce a stochastic algorithm which returns ŷ with probabilities
10.2. A New Method for Sampling the Version-Space
129
as specified in (10.3) and (10.4). This procedure can replace the sampling step in the QBC
algorithm.
k
Let S = {(xi , yi )}i=1 be a labeled sample. Let x be an instance for which we need to decide
whether to query for its label or not. We denote by V the version space as defined in (10.1) and
denote by T the space spanned by x1 , . . . , xk and x. QBC asks for two random hypotheses from
V and queries for the label of x only if these two hypotheses predict different labels for x. Our
procedure does the same thing, but instead of sampling the hypotheses from V we sample them
from V ∩ T . One main advantage of this new procedure over the original QBC is that it samples
from a space of low dimension and therefore its computational complexity is much lower. This
is true since T is a space of dimension k + 1 at most, where k is the number of queries for label
QBC made so far. Hence, the body V ∩ T is a low-dimensional convex body2 and thus sampling
from it can be done efficiently. The input dimension plays a minor role in the sampling algorithm.
Another important advantage is that it allows us to use kernels, and therefore gives a systematic
way to extend QBC to the non-linear scenario. The use of kernels is described in detail in section
10.3.
The following theorem proves that indeed sampling from V ∩ T produces the desired results.
It shows that if the prior ν (see algorithm 5 on page 55) is uniform, then sampling hypotheses
uniformly from V or from V ∩ T generates the same results.
Theorem 10.2 Let S = {(xi , yi )}ki=1 be a labeled sample and x an instance. Let V be the version
space
V = {w : kwk ≤ 1 and ∀i yi (w · xi ) > 0}
and let T = span (x, x1 , . . . , xk ) then
Prw∼U(V ) [w · x > 0] = Prw∼U(V ∩T ) [w · x > 0]
and
Prw∼U(V ) [w · x < 0] = Prw∼U(V ∩T ) [w · x < 0]
where U (·) is the uniform distribution.
Before we prove this theorem, we prove a couple of lemmas:
Lemma 10.2 Let V and T be as defined in Theorem 10.2. Let PT be the orthogonal projection to
T then
PT (V ) = V ∩ T
2 From
the definition of the version space V it follows that it is a convex body. See Lemma 9.1 on page 115.
130
Chapter 10. Kernelizing the QBC
Proof. Let w ∈ V ∩ T . Since w ∈ T then w = PT (w), combined with the fact that w ∈ V we
conclude that w ∈ PT (V ) and thus V ∩ T ⊆ PT (V ).
On the other hand, let w ∈ PT (V ). It suffices to show that w ∈ V to complete the proof.
Let ŵ ∈ V be such that PT (ŵ) = w. Since PT is a projection, kwk ≤ kŵk ≤ 1. Moreover, since
ŵ − w ∈ T ⊥ , and xi ∈ T we have that
ŵ · xi
=
w · xi + (ŵ − w) · xi
=
w · xi
and thus
yi w · xi = yi ŵ · xi > 0
and thus w ∈ V which completes the proof.
Next we show that V is almost a product space.
Lemma 10.3 Let w ∈ V ∩ T then
PT−1 (w) = {v : PT (v) = w}.
PT−1 (w) ∩ V
= w + v ∈ T ⊥ : kvk ≤ 1 − kwk where
Proof. Let v ∈ T ⊥ such that kvk ≤ 1 − kwk. Then kv + wk = kvk + kwk ≤ 1. For any (xi , yi ), we
have that
yi (w + v) · xi = yi w · xi
since v ⊥ xi and therefore (v + w) ∈ V . Furthermore, PT (w + v) = w since v ∈ T ⊥ and thus
(v + w) ∈ PT−1 (w) ∩ V . Therefore
PT−1 (w) ∩ V ⊇ w + v ∈ T ⊥ : kvk ≤ 1 − kwk
On the other hand, let u ∈ PT−1 (w) ∩ V . Clearly, PT (u) = w and therefore u = w + v
such that v ∈ T ⊥ . Since w ⊥ v and kwk ≤ 1 it follows that kvk ≤ 1 − kwk and thus u ∈
w + v ∈ T ⊥ : kvk ≤ 1 − kwk . Finally
PT−1 (w) ∩ V ⊆ w + v ∈ T ⊥ : kvk ≤ 1 − kwk
this completes the proof.
We are now ready to present the proof of the main theorem:
Proof. of theorem 10.2
10.3. Sampling with Kernels
131
First note that for any u ∈ V
sign (u · x) = sign (PT (u) · x)
(10.5)
Let ν be the push forward probability measure PT (U (V )); i.e. if A is a measurable set then
ν (A) is the measure under U (V ) of PT−1 (A). From (10.5) it follows that
Pr
[w · x > 0] =
w∼ν
Pr
[w · x < 0] =
w∼ν
w∼U(V )
w∼U(V )
Pr [w · x > 0]
Pr [w · x < 0]
Clearly, ν is continuous with respect to the Lebesgue measure and hence has density. Let dν
be the density of ν. From lemma 10.2 it follows that for any w ∈
/ V ∩ T the density dν (w) is zero.
From lemma 10.3 if follows that for any w ∈ V ∩ T the density dν (w) depends solely on kwk.
Finally, since for any λ > 0
sign (w · x) = sign (λw · x)
it follows that
Pr [w · x > 0] =
w∼ν
Pr [w · x < 0] =
w∼ν
Pr
[w · x > 0]
Pr
[w · x < 0]
U(V ∩T )
U(V ∩T )
this completes the proof.
Theorem 10.2 proves the soundness of the sampling algorithm presented. It proves that although
we sample from a low-dimensional projection of the version space, the results are identical.
10.3
Sampling with Kernels
In this section we show how the new sampling method presented in section 10.2 can be used
together with kernels. QBC uses the random hypotheses for one purpose alone: to check the labels
they predict for instances. In our new sampling method the hypotheses are sampled from V ∩ T ,
where T = span (x, x1 , . . . , xk ). Hence, any hypothesis is represented by w ∈ V ∩ T , that has the
form
w = α0 x +
k
X
αj xj
(10.6)
j=1
The label w assigns to an instance x′ is


k
k
X
X
αj xj · x′
αj xj  · x′ = α0 x · x′ +
w · x′ = α0 x +
j=1
j=1
(10.7)
132
Chapter 10. Kernelizing the QBC
Note that in (10.7) only inner products are used, hence we can use kernels. Using these observations,
we can sample a hypothesis by sampling α0 , . . . , αk and define w as in (10.6). However, since the
xi ’s do not form an orthonormal basis of T , sampling the α’s uniformly is not equivalent to sampling
the w’s uniformly. We overcome this problem by using an orthonormal basis of T . The following
lemma shows how an orthonormal basis for T can be computed when only inner products are used.
Lemma 10.4 Let x0 , . . . , xk be a set of vectors, let T = span (x0 , . . . , xk ) and let G = (gi,j ) be
the Gram matrix such that gi,j = xi · xj . Let λ1 , . . . , λr be the non-zero eigen-values of G with the
corresponding eigen-vectors γ1 , . . . , γr . Then the vectors t1 , . . . , tr such that
ti =
k
X
γi (l)
√ xl
λi
l=0
form an orthonormal basis of the space T .
This lemma is significant since the basis t1 , . . . , tr enables us to sample from V ∩ T using simple
Pr
techniques. Note that a vector w ∈ T can be expressed as i=1 α (i) ti . Since the ti ’s form an
orthonormal basis, kwk = kαk. Furthermore, we can check the label w assigns to xj by
w · xj =
X
i
α (i) ti · xj =
X
i,l
γi (l)
α (i) √ xl · xj
γi
which is a function of the Gram matrix. Therefore, sampling from V ∩ T boils down to the
problem of sampling from convex bodies, where instead of sampling a vector directly we sample
the coefficients of the orthonormal basis t1 , . . . , tr . Keep in mind that we do no not need to
recalculate this basis for every new instance for which we query for label. Instead, if we have the
basis t1 , . . . , tr for span (x1 , . . . , xk ) and we encounter a new instance x0 we can simply do the
following calculation:
t⊥ = x0 −
r
X
i=1
(x0 · ti ) ti
If t⊥ is zero then x0 ∈ span (x1 , . . . , xk ) and thus we do not need to extend the basis. Otherwise
we can extend the basis with the vector tr+1 = t⊥ / t⊥ . The computational complexity of this
process is O r2 which is O k 2 at most.
We now go back to prove Lemma 10.4.
Proof. of Lemma 10.4
First note that t1 , . . . , tr ∈ T and thus span (t1 , . . . , tr ) ⊆ T . Also note that the dimension
of T is r. Indeed, if the dimension of t is greater than r then there exists an orthonormal basis
10.4. Hit and Run
133
τ1 , . . . , τk for T with k > r. We can express the vectors τ1 , . . . , τk in terms of the xi ’s such that
P
τi = j τi (j) xj . Let Θ = (θi,j ) be the matrix such that θi,j = τi (j) then
(ΘGΘ′ )i,j
X
=
l
=
X
s,l
=
τi (l) xl · x0 , . . . ,
X
l
τi (l) xl · xk
!
(τj (0) , . . . , τj (k))′
τi (l) τj (s) xl · x (s)
τi · τj = δij
where the last equality follows since τ1 , . . . , τk are orthonormal. It follows that ΘGΘ′ = Ik×k .
Since k > r this contradicts the assumption that rank (G) = r. Therefore we conclude that the
dimension of T is at most r.
To complete the proof, is suffices to show that t1 , . . . , tr are indeed orthonormal. Thus, we will
show that ti · tj = δi,j .
ti · tj
=
=
=
=
!
!
k
k
X
X
γj (l)
γi (l)
√ xl ·
p xl
λi
λj
l=0
l=0
1 X
p
γi (l) γj (s) xl · xs
λi λj l,s
1
p
γi′ Gγj
λi λj
λ
p j (γi · γj ) = δi,j
λi λj
where the last equality follows since the eigen-vectors γ1 , . . . , γr are orthonormal.
In the next section we discuss one possible method of sampling from this convex body.
10.4
Hit and Run
Hit and run [85] is a method of sampling from a convex body K using a random walk. Let z ∈ K. A
single step of the hit and run begins by choosing a random point u from the unit sphere. Afterwards
the algorithm moves to a random point selected uniformly from l ∩ K, where l is the line passing
through z and z + u.
Hit and run has several advantages over other random walks for sampling from convex bodies.
First, its stationary distribution is indeed the uniform distribution, it mixes fast [85] and it does
not require a “warm” starting point [86]. What makes it especially suitable for practical use is the
fact that it does not require any parameter tuning other than the number of random steps. It is
also very easy to implement.
134
Chapter 10. Kernelizing the QBC
Current proofs [85, 86] show that O∗ d3 steps are needed for the random walk to mix. However,
the constants in these bounds are very large. Nevertheless, our experiments show that in practice
hit and run mixes much faster than that (see chapter 11 on page 136). We have used it to sample
from the body V ∩T . The number of steps we used was very small, ranging from a couple of hundred
to a couple of thousands. Our empirical study shows that this suffices to obtain impressive results.
10.5
Generalizing to Unseen Instances
We saw how the QBC learning process can be conducted efficiently even when kernels are being
used. We now look at the generalization phase. In Chapter 5, where the QBC algorithm is
presented, we discussed several options for the generalization phase of QBC. One option is to
work in an online fashion in which there is no clear distinction between the learning and the
generalization rule (see Theorem 5.6 on page 60). In this setting, the learner predicts the label of
an instance he sees and at the same time decides whether to query for the label or not. As we saw
in previous sections, this does not introduce any difficulty when kernels are being used.
In other settings presented in Chapter 5, the learning phase stops once a certain stopping
criterion is met. At this point QBC returns a hypothesis. We have discussed several options for
the choice of the returned hypothesis. We would like to verify which of these hypotheses can be
used together with kernels.
The first hypothesis we consider is the Bayes optimal hypothesis. This hypothesis is not necessarily a linear classifier and thus, in general, does not have a simple representation. Since this
is a problem even when kernels are not being used, we will definitely have the same problem once
kernels are used.
The second kind of a hypothesis we consider is the Gibbs hypothesis. There are two possibilities
here. First, we can draw a random hypothesis whenever we would like to label an instance. Using
the techniques presented in the previous sections of this chapter, this can be done combined with
kernels.
An alternative way to use the Gibbs hypothesis is to draw a single hypothesis from the version
space and to use it for all future predictions. This can-not be done when kernels are used because
the random hypothesis needs to be sampled from the full version space. Note that when we
projected the version space into space T , we used T which is the span of x, x1 , . . . , xk . We assumed
that we know the instance x for which we would like to predict the label. However when x is not
known, it is not clear what to focus on.
10.6. Summary and Further Study
135
The final option we considered in Chapter 5 was to use the Bayes Point Machine (BPM)
classifier, which in our case will be the center of gravity of the version space. It is easy to verify
that under the assumption that the prior is uniform, the center of gravity will always be in the
span of the instances for which we queries for labels. Furthermore, using the same arguments as
we used throughout this chapter it is easy to show that if V is the version space and T is the span
of the instances for which we queries for labels then the center of gravity of V is exactly at the
same point as the center of gravity of V ∩ T . Thus the BPM classifier can be used even in the
kernelized setting.
10.6
Summary and Further Study
In this chapter we presented two main ideas. First we showed how kernels can be used to enhance
the ability of QBC to deals with tasks where the target classifier is not necessarily linear. It can be
used to overcome noise using the ridge trick. This is due to the fact that every problem becomes
separable when ridge kernel is used (see Section 10.1.4).
Another issue that we dealt with in this chapter is practical methods for sampling the version
space. We suggested the use of the hit-and-run algorithm for this purpose. We discussed the
adequacy of this sampling algorithm for our purposes. In the following chapter we present the
empirical results obtained when using the techniques presented here for several learning tasks.
Chapter 11
Empirical Evidence
11.1
Empirical Study
In this chapter we present the results of applying the kernelized version of the Query by Committee
(KQBC) algorithm with the Hit-and-Run random walk (see Chapter 10), to two learning tasks.
The first task requires classification of synthetic data whereas the second is a real world problem.
11.1.1
Synthetic Data
In our first experiment we study the task of learning a linear classifier in a d-dimensional space.
The target classifier is the vector w∗ = (1, 0, . . . , 0) thus the label of an instance x ∈ IRd is the sign
of its first coordinate. The instances are normally distributed N (µ = 0, Σ = Id ). In each trial we
use 10000 unlabeled instances and let KQBC select the instances to query for the labels. We also
apply Support Vector Machine (SVM) to the same data. The linear kernel is used for both KQBC
and SVM. Since SVM is a passive learner, SVM is trained on prefixes of the training data. We use
different sizes for these prefixes. The results are presented in figure 11.1.
The difference between KQBC and SVM is notable. When both are applied to a 15-dimensional
linear discrimination problem (figure 11.1b), SVM and KQBC have an error rate of ∼ 6% and
∼ 0.7% respectively after 120 labels. After such a short training sequence the difference attains an
order of magnitude. The same qualitative results emerge for all problem sizes.
As expected, the generalization error of KQBC decreases exponentially fast as the number
of queries is increased, whereas the generalization error of SVM decreases only at an inversepolynomial rate (the rate is O∗ (1/k) where k is the number of labels). This should not come as a
136
% generalization error
% generalization error
% generalization error
11.1. Empirical Study
137
100
10
Kernel Query By Committee
Support Vector Machine
−0.9k/5
48⋅2
1
0.1
0
10
20
30
40
(a) 5 dimensions
50
60
70
80
100
10
Kernel Query By Committee
Support Vector Machine
53⋅2−0.76k/15
1
0.1
0
50
100
150
(b) 15 dimensions
200
250
100
30
10
Kernel Query By Committee
Support Vector Machine
50⋅2−0.67k/45
3
1
0
50
100
150
200
250
300
(c) 45 dimensions
350
400
450
500
Figure 11.1: Results on the synthetic data. The generalization error (y-axis) in percent (in
logarithmic scale) versus the number of queries (x-axis). Plots (a), (b) and (c) represent the
synthetic task in 5, 15 and 45 dimensional spaces respectively. The generalization error of KQBC
is compared to the generalization error of SVM. The results presented here are averaged over 50
trials. Note that the error rate of KQBC decreases exponentially fast as was proved in the
fundamental theorem of the QBC algorithm (Theorem 6.1 on page 65).
138
Chapter 11. Empirical Evidence
surprise since the fundamental theorem of the QBC algorithm (Theorem 6.1 on page 65) proved
that this is the expected behavior.
11.1.2
Label Efficient Learning over Synthetic Data
We conducted another experiment using the same synthetic setting as presented in section 11.1.1.
The sample space is IR5 with uniform distribution and the target concept is the vector (1, 0, 0, 0, 0).
In this experiment we tested KQBC in the label efficient setting (see section 4.3 on page 53). We
generated 2500 instances and presented them to KQBC one by one. For each of these instances
KQBC either queried for the label of the instance or predicted its label. We counted both the
number of queries and the number of prediction mistakes. This process was repeated 50 times.
The results are presented in figure 11.2a.
As predicted in Theorem 5.6 on page 60, the number of queries for labels is exactly twice the
number of prediction mistakes. Also, following the theoretical analysis presented in Chapter 6,
both parameters grow logarithmically with respect to the number of instances.
We use this setting to check the effect of the number of Hit and Run steps on the performance
of KQBC. The results of KQBC with 1000, 100, 50, 10, 5 and 2 Hit and Run steps for generating
a random hypothesis are presented in sub-figures a-f of figure 11.2. When the number of random
steps drops, KQBC tends to query for fewer instances, which causes an increase in the number of
prediction mistakes. However the results of using 50, 100 and 1000 random steps are practically
equivalent and match our predictions for uniformly sampled hypotheses. We conclude that Hit
and Run mixes very fast: much faster than the bounds in [85].
11.1.3
Face Image Classification
The setting of the second experiment is more realistic. In the second task we used the AR face
dataset [88] which is a collection of face images. The people in these images are wearing different
accessories, have different facial expressions and the faces are lit from different directions. We
selected a subset of 1456 images from this dataset. Each image was converted into gray-scale
and re-sized to 85 × 60 pixels; i.e. each image was represented as a 5100 dimensional vector.
see figure 11.3 for sample images. The task was to distinguish male and female images. For this
purpose we split the data into a training sequence of 1000 images and a test sequence of 456 images.
To test statistical significance we repeated this process 20 times, each time splitting the dataset
into training and testing sequences.
11.1. Empirical Study
139
Figure 11.2: KQBC for label efficient learning. The results of applying KQBC to the
synthetic data are presented. The number of instances is located on the x-axis and the y-axis
shows the average number of queries and prediction errors. Each subplot represents the different
numbers of Hit and Run steps made to generate a new hypothesis from the version space.
140
Chapter 11. Empirical Evidence
% generalization error
Figure 11.3: Examples of face images used for the face recognition task.
48
45
42
39
36
33
30
27
24
21
Kernel Query By Committeei (KQBC)
Support Vector Machine (SVM)
SVM over KQBC selected instances
18
15
12
0
20
40
60
80
100
120
number of labels
140
160
180
200
Figure 11.4: The generalization error of KQBC and SVM for the faces dataset
(averaged over 20 trials). The generalization error (y-axis) vs. number of queries (x-axis) for
KQBC (solid) and SVM (dashed) are compared. When SVM was applied solely to the instances
selected by KQBC (dotted line) the results are better than SVM but worse than KQBC.
We applied both KQBC and SVM to this dataset. We used the Gaussian kernel, such that
the inner product between two images was defined to be K (x1 , x2 ) = exp − kx1 − x2 k2 /2σ 2
where σ is chosen to be σ = 3500 which is the value favored by SVM. The results are presented in
figure 11.4.
It is apparent from figure 11.4 that KQBC outperforms SVM. When the budget allows for
100 − 140 labels, KQBC has an error rate of 2 − 3 percent less than the error rate of SVM. When
140 labels are used, KQBC outperforms SVM by 3.6% on average. This difference is significant as
in 90% of the trials KQBC outperformed SVM by more than 1%. In one of the cases, KQBC was
11% better.
We also used KQBC as an active selection method for SVM. We trained SVM over the instances
selected by KQBC. The generalization error obtained by this combined scheme was better than
the passive SVM but worse than KQBC.
Another interesting way to view these results is to look at the images for which KQBC queried
for labels. In figure 11.5 we see the last images for which KQBC queries for labels. It is apparent,
11.2. Summary
141
Figure 11.5: Images selected by KQBC. The last six faces for which KQBC queried for a
label. Note that three of the images are saturated and that two of these are wearing a scarf that
covers half of their faces.
that the selection made by KQBC is non-trivial. All the images are either highly saturated or
partly covered by scarves or sunglasses. We conclude that KQBC indeed performs well even when
kernels are used.
11.2
Summary
In this chapter we demonstrated the kernelized version of the QBC algorithm on several experiments. In all our experiments, KQBC outperformed SVM significantly. We also tested KQBC in
the efficient labeling setting; i.e. the online setting, and showed that it also performs well here.
Part IV
Epilog
142
Chapter 12
Epilog
“No learning occurs if the learner is not active” [18, pg. 110]
The title of this work, “To PAC and Beyond” represents the main theme of this dissertation. The
PAC [126] model is a very successful one. Once learning is defined in mathematical language it
enables the scientific community to study this concept using tools from different scientific fields.
It allows us to articulate questions such as
• Is everything learnable?
• Is anything learnable?
• What can we learn?
Many important results in the field of machine learning were made prior to the definition of the
PAC model. Most notable, is the seminal work of Vapnik and Chervonenkis [128] but many other
results predated the definition of the PAC model (see e.g. [46, 97, 103, 108, 120]). The significance
of Valiant work is in placing all these findings in the context of learning and thus marking the
beginning of machine learning field of research.
Nevertheless, the PAC model has its limitations which led many researchers to go beyond it.
In this work we went beyond PAC by allowing learners to be active. We see learning as a game
played between the learner and the teacher. We showed that the assumption that the learner is
passive is restrictive. When we allow the learner the freedom to actively participate in the learning
process it learns much faster.
After a short introduction we studied the Membership Queries framework in Part II. In Chapter 3 we presented a novel method of tolerating noise in the learning process using membership
143
144
Chapter 12. Epilog
queries and the dual representation of the learning problem. In Part III we studied active learning in the selective sampling framework. In Chapter 5 we presented the Query By Committee
algorithm of Seung et al[112] and discussed possible termination rules for this algorithm which
corresponds to different modes of use. In Chapter 6 we presented a theoretical analysis of the
QBC algorithm. We showed that active learners can enjoy an exponential speedup in their learning rates when certain conditions apply. In Chapter 7 we showed that QBC can tolerate incorrect
assumptions on priors. In Chapter 8 we presented a method which makes QBC more resistant to
noise. We discussed efficient implementations of QBC in Chapter 9 and extended it to enable the
use of kernels in Chapter 10. An empirical study of QBC was presented in Chapter 11. These
constitute encouraging step forward in the ability to study active learning from various points of
view, and (almost) close the gap between theory and practice in this field.
12.1
Active Learning in Humans
Our prime focus in this work is machine learning. Nevertheless, the findings can be connected to
learning in humans, since active learning is as important to humans as it is for machines.
Research on human learning and machine learning is conducted from very different points of
view. Investigators studying human learning primarily try to teach teachers how to teach, whereas
researchers in the field of machine learning attempt to teach learners how to learn. Indeed, “learning
to learn” is not the title of any class in school or university. In an introduction to his course on
computer organization Charles Lin tries to address this issue [79]. Lin entitled his essay “Active
Learning” to render the idea that you know that you have learned something when you are able
to teach it. Thus a student should convince himself that he is able to teach what he has learned
and whenever the student is not confident that he can do that, he should ask the teacher, a peer
or seek the answer somewhere else.
Researchers in the field of human learning and early childhood development use “active learning” slightly differently from the way we have used it in this dissertation. Any learning process in
which the learner takes part in is considered to be active. According to Piaget, a child plays a very
active role in the growth of intelligence [118, pg. 13]. Examples include game playing, counting,
etc. Furthermore, for Piaget, intelligence meant exploring the environment [118, pg. 27] thus intelligence is about actively extending knowledge. Both Piaget and Vygotsky explicitly argue that
the child plays an active role in the acquisition of knowledge as opposed to Behaviorism theory
which suggests that learning is determined by external variables (stimulus and reinforcement)[18,
12.1. Active Learning in Humans
145
pg. 27]. The constructivist theory argues that learning is an internal process that external stimulus
can trigger it [94].
The active role of children in learning takes on several forms. According to leading theories,
the child constructs a hypothesis and revises it when needed. Constructing a theory is an active
process [18, pg. 8]. While this is an internal process, active learning has an external manifestation
as well [18, pg. 9]: a child needs to be able to manipulate objects in order to understand what
these objects are and what can they do. It is also necessary to have children actively involved in
the learning process to motivate them and cause them to engage in the learning process.
The type of “active learning” we are interested in is different. For us, a child is considered active
if his behavior and questions causes a change in the learning process itself. Therefore, a natural
question would be how much a child can gain (in knowledge) by actively directing the teacher. To
the best of my knowledge, no study addresses this issue explicitly. Never the less, implicitly, there
is no doubt in our mind that many theories of early childhood development and human learning
see the child as a “director” of the learning process. For instance, Montessori, Erikson, Piaget
and Vygotsky place great emphasis on the significance of observing the students in planning of a
curriculum [93]. For example, according to Montessori, teachers should be trained to “teach little
and observe much” [93, pg. 31] because observation is the key to determining what children are
interested in and need to learn [93, pg. 33].
The study of active learning in machines is taking its first steps. In this work we attempted to
contribute to the growth of this field. We studied both empirical and theoretical aspects of this
domain. At the same time we argue that active learning is important for human learning. Those of
us who are involved in learning should keep this in mind and use this powerful tool while learning.
List of Publications
In order to keep this document reasonably sized, only a subset of the work I have done during my
studies is presented in this dissertation. Here is a complete list of my publications.
Journals
• R. Bachrach, R. El-Yaniv, and M. Reinstadtler, On the competitive theory and practice of
online list accessing algorithms, Algorithmica, vol. 32, no. 2, pp. 201-245, 2002.
An extended abstract of this paper appeared in a conference:
R. Bachrach and R. El-Yaniv, Online list accessing algorithms and their applications, recent
empirical evidence, in Proceedings of the 8th Symposium on Discrete Algorithms (SODA),
pp 53-62, 1997.
• R. Bachrach, S. Fine, and E. Shamir, Query by committee, linear separation and random
walks, Theoretical Computer Science, vol. 284, no. 1, 2002.
An extended abstract of this paper appeared in a conference:
R. Bachrach, S. Fine, and E. Shamir, Query by committee, linear separation and random
walks, in Proceedings of the 4th European Conference on Learning Theory (EUROCOLT),
pp 34-49, 2001.
Refereed Conferences
• R. Gilad-Bachrach, A. Navot and N. Tishby Query By Committee made real, in Proceedings
of the 19th Conference on Neural Information Processing Systems (NIPS), 2005.
• R. Gilad-Bachrach, A. Navot and N. Tishby, Bayes and Tukey meet at the center point, in
Proceedings of the 17th Conference on Learning Theory (COLT), 2004.
• R. Gilad-Bachrach, A. Navot and N.Tishby, Margin based feature selection - theory and
algorithms, in Proceedings of the 21st International Conference on Machine Learning (ICML),
2004.
146
12.1. Active Learning in Humans
147
• R. Gilad-Bachrach, A. Navot, and N. Tishby, An information theoretic tradeoff between complexity and accuracy, in Proceedings of the 16th Conference on Learning Theory (COLT),
pp. 595-609, 2003.
• K. Crammer, R. Gilad-Bachrach, A. Navot, and N. Tishby, Margin analysis of the lvq algorithm, in Proceedings of the 16th Conference on Neural Information Processing Systems
(NIPS), 2002.
Book Chapters
• R. Gilad-Bachrach, A. Navot and N. Tishby Connections with some classic IT problems. In
Information Bottlenecks and Distortions: The emergence or relevant structure from data, N.
Tishby and T. Gideon (eds.) MIT press (in preparation).
• R. Gilad-Bachrach, A. Navot and N. Tishby Large margin principles for feature selection. In
Feature extraction, foundations and applications, I. Guyon, S. Gunn, M. Nikravesh and L.
Zadeh (eds.) , Springer (2006).
Technical Reports
• S. Fine, R. Gilad-Bachrach, E. Shamir, and N. Tishby, Noise tolerant learning using early
predictors, technical report 1999-22, Leibniz Center, the Hebrew University, 1999.
• S. Fine, R. Gilad-Bachrach, S. Mendelson, and N. Tishby, Noise tolerant learning via the dual
learning problem, technical report 2000-14, Leibniz Center, the Hebrew University. presented
at NCST99, 2000.
• S. Axelrod, S. Fine, R. Gilad-Bachrach, S. Mendelson, and N. Tishby, The information of
observations and applications for active learning with uncertainty, technical report 2001-81,
Leibniz Center, the Hebrew University, 2001.
• R. Gilad-Bachrach, A. Navot, and N. Tishby, Kernel query by committee (KQBC), technical
report 2003-88, Leibniz Center, the Hebrew University, 2003.
• R. Gilad-Bachrach, Dimensionality reduction for online learning algorithms using random
projections, technical report 2005, Leibniz Center the Hebrew University, 2005.
Bibliography
[1] R.A. Adams. Sobolev Spaces, volume 69 of Pure and Applied Mathematics series.
Academic Press, 1975.
[2] N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale-sensativie dimensions,
uniform convergence, and learnability. Journal of the ACM, 44(4):615–631, 1997.
[3] D. Angluin. Queries and concept learning. Machine Learning, 2:319–342, 1988.
[4] D. Angluin. Queries revisited. Theoretical Computer Science, 313(2):175–194, 2004.
[5] D. Angluin and M. Kharitonov. When won’t membership queries help? In Proceedings
of the 23rd annual ACM symposium on Theory of computing, 1991.
[6] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations.
Cambridge University Press, 2001.
[7] A. C. Atkinson and A. N. Donve. optimum experiment designs. Oxford University
Press, 1992.
[8] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM journal of computing, 32(1):48–77, 2002.
[9] M. Bagnoli and T. Bergstrom. Log-concave probability and its applications. http:
//www.econ.ucsb.edu/∼ tedb/Theory/logconc.ps, 1989.
[10] Y. Baram, R. El-Yaniv, and K. Luz. Online choice of active learning algorithms. Journal
of Machine Learning Reseach (JMLR), 5:255–291, march 2004.
[11] P. Bartlett and S. Ben-David. Hardness results for neural network approximation problems. In the proceedings of the 4’th European Conference on Computational Learning
Theory, 1999.
[12] E. B. Baum. Neural net algorithms that learn in polynomial time from examples and
queries. IEEE Transactions on Neural Networks, 2(1), 1991.
[13] N. Ben-David, S. Eiron and P. Long. On the difficulty of approximating maximum
agreement. Journal of Computer and System Sciences, 66(3):496 – 514, May 2003.
[14] P. Berkhin. Survey of clustering data mining techniques. Technical report, Accrue
Software, San Jose, CA, 2002.
148
BIBLIOGRAPHY
[15] D. Bertsimas and S. Vempala. Solving convex programs by random walks. In STOC,
pages 109–115, 2002.
[16] A. Blum and T. Mitchell. Combining labled and unlabled data with co-training. In the
11’th annual Conference on Computional Learning Theory, pages 92–100, 1998.
[17] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and the
vapnik-chervonenkis dimension. Journal of the ACM, 100:157–184, 1989.
[18] E. Bodrova and Leong D. J. Tools of the Mind: The Vygotskian approach to early
childhood education. Pretice-Hall, 1996.
[19] C. Borell. Convex set functions in d-space. Periodica Mathematica Hungarica, 6:111–
136, 1975.
[20] B. Boser, I. Guyon, and V. Vapnik. Optimal margin classifiers. In Fifth Annual
Workshop on Computational Learning Theory, pages 144–152, 1992.
[21] L. Breiman, J. Friedman, R. A. Olshen, and C. Stone. Classification and Regression
Trees. Chapman Hall, 1984.
[22] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.
[23] N. Bshouty. Exact learning via the monotone theory. In Proceedings of the 34th Annual
Symposium on Foundations of Computer Science, 1993.
[24] N. H. Bshouty, S. A. Goldman, H. D. Mathias, S. Suri, and H. Tamaki. Noise-tolerant
distribution-free learning of general geomtric concepts. In the proceedings of the 28th
Annual ACM Symposium of Theory of Computing, 1996.
[25] C. Campbell, N. Cristianini, and A. Smola. Query learning with large margin classifiers.
In Proceedings of the 17th International Conference on Machine Learning (ICML),
2000.
[26] A. Caplin and B. Nalebuff. Aggregation and social choice: A mean voter theorem.
Exonometrica, 59(1):1–23, 1991.
[27] N. Cesa-Bainchi, A. Conconi, and C. Gentile. Learning probablistic linear-threshold
classifiers via selective sampling. In Proceedings of the 16th annual Conference on
Learning Theory (COLT), pages 373–387, 2003.
[28] N. Cesa-Bianchi, G. Lugosi, and G. Stolz. Minimizing regret with label efficient prediction. IEEE Transactions on Information Theory, 51(6):2152–2162, 2005.
[29] D. Cohn, L. Atlas, and R. Ladner. Training connectionist networks with queries and
selective sampling. Advanced in Neural Information Processing Systems 2, 1990.
[30] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning.
Machine Learning, 15(2):201–221, 1994.
149
150
BIBLIOGRAPHY
[31] D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with statistical models.
Journal of Artificial Intelligence Research, 4:129–145, 1996.
[32] T. M. Cover and J. A. Thomas. Elements Of Information Theory. Wiley Interscience,
1991.
[33] T.M. Cover and P.E. Hart. Nearest neighbor pattern classifier. IEEE Transactions on
Information Theory, 13:21–27, 1967.
[34] K. Crammer and Y. Singer. On the learnability and design of output codes for multiclass
problems. In Machine Learning, volume 47, 2002.
[35] I. Dagan and S. Engelson. Committee-based sampling for training probabilistic classifiers. Proc. 12th International Conference on Machine Learning, 1995.
[36] S. Dasgupta. Analysis of a greedy active learning strategy. In Advances in Neural
Information Processing Systems, 2004.
[37] S. Dasgupta. Coarse sample complexity bounds for active learning. In Advances in
Neural Information Processing Systems, 2005.
[38] S. Dasgupta, A. T. Kalai, and C. Monteleoni. Analysis of perceptron-based active
learning. In Proceeding of the 18th Annual Conference on Learning Theory (COLT),
2005.
[39] S. E. Decator. Efficient Learning from Faulty Data. PhD thesis, Harvard University,
1995.
[40] O. Dekel, S. Shalev-Shwarts, and Singer Y. The forgetron: A kernel-based perceptron
on a fixed budget. In Neural Information Processing Systems (NIPS), 2005.
[41] M. Dyer, A. Frieze, and R. Kannan. A random polynomial time algorithm for approximating the volume of convex bodies. Journal of the Association for Computing
Machinery, 38, Number 1:1–17, 1991.
[42] B. Eisenberg and R. L. Rivest. On the sample complexity of pac-learning using random
and chosen examples. In the Proceedings of the Third Annual Conference on Computational Learning Theory, pages 154–162. Morgan-Kaufmann, 1990.
[43] G. Elekes. A geometric inequality and the complexity of computing volume. Discrete
and Computational Geometry, 1986.
[44] S. Fine, A. Freund, I. Jaeger, Y. Mansour, Y. Naveh, and Ziv A. Harnessing machine learning to improve success rate of stimuli generation. IEEE Transactions on
Computers, to appear 2006.
[45] S. Fine, J. Navratil, and R. Gopinath. A hybrid gmm/svm approach to speaker identification. In The International Conference on Acoustics, Speech, and Signal Processing
(ICASSP), 2001.
BIBLIOGRAPHY
[46] E. Fix and j. Hodges. Discriminatory analysis. nonparametric discrimination: Consistency properties. Technical Report 4, USAF school of Aviation Medicine, 1951.
[47] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning
and an application to boosting. Journal of Computer and System Sciences, 55(1):119–
139, 1997.
[48] Y. Freund, H. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by
committee algorithm. Machine Learning, 28:133–168, 1997.
[49] R. Gilad-Bachrach, A. Navot, and N. Tishby. Bayes and tukey meet at the center
point. In Proceedings of the 17th Conference on Learning Theory (COLT), pages 549–
563, 2004.
[50] D. Gilbarg and N. S. Trudinger. Elliptic Partial Differential Equations of Second Order.
Springer Verlag, 1998.
[51] D. Haussler, M. Kearns, and R. E. Schapire. Bounds on the sample complexity of
bayesian learning using information theory and the vc dimension. Machine Learning,
14:83–113, 1994.
[52] D. Haussler and Opper M. Mutual information, metric entropy, and cumulative relative
entropy risk. Annals of Statistics, 25(6), Dec 1997.
[53] Donald O. Hebb. The Organization of Behavior. John Wiley, New York, 1949.
[54] D. Helmbold and S. Panizza. Some label efficient learning results. In Proceedings of the
10th Annual Conference on Computational Learning Theory (COLT), pages 218–230,
1997.
[55] R. Herbrich, T. Graepel, and C. Campbell. Bayes point machines: Estimating the
bayes point in kernel space. In Proceedings of IJCAI Workshop on Support Vector
Machines, pages 23–27, 1999.
[56] R. Herbrich, T. Graepel, and C. Campbell. Bayes point machines. Journal of Machine
Learning Research, 2001.
[57] T. Jaakkola, M. Deikhans, and D. Haussler. a discriminative framework for detecting
remote protein homologies. Journal of Computational Biology, 7:95–114, 2000.
[58] J. Jackson, E. Shamir, and C. Shwartzman. Learning with queries corrupted by classification noise. Discrete Applied Mathematics, 92(2-3):157–175, 1999.
[59] F. V. Jensen. Bayesian Networks and Decision Graphs. Springer, 2001.
[60] F. John. Extremum problems with inequalities as subsidiary conditions. In Studies
and Essays Presented to R. Courant on his 60th Birthday, pages 187–204. Interscience
Publishers, Inc., New York, N. Y., 1948.
[61] I.T. Jolliffee. Principal Component Analysis. Springer Varlag, 1986.
151
152
BIBLIOGRAPHY
[62] L. P. Kaelbling, M. L. Littman, and A. P. Moore. Reinforcement learning: A survey.
Journal of Artificial Intelligence Research, 4:237–285, 1996.
[63] M. Kearns. Efficient noise-tolerant learning from statistical queries. In Proceedings of
the 25th ACM Symposium on the Theory of Computing, pages 392–401, 1993.
[64] M. Kearns. Boosting theory towards practice: Recent developments in decision tree
induction and the weak learning framework. Abstract accompanying invited talk given
at AAAI 1996, 1996.
[65] M. Kearns and V. Valiant. Cryptographic limitations on learning boolean formulae
and finite automata. J. of the ACM, 41(1):67–95, 1994.
[66] M. Kearns and U. Vazirani. An Introduction To Computational Learning Theory. The
MIT Press, 1994.
[67] J. Kleinberg. An impossibility theorem for clustering. In NIPS, 2002.
[68] A. R. Klivans and R. Servedio. Learning intersections of halfspaces with a margin. In
Proceedings of the 17th Annual Conference on Learning Theory (COLT), 2004.
[69] A. Krogh and J Vedelsby. Neural network ensembles, cross validation, and active
learning. In Advances in Neural Information Processing Systems (NIPS), pages 231–
238, 1995.
[70] S. Kullback. Information Theory and Statistics. Wiley, 1959.
[71] E. Kushilevitz and Y. Mansour. Learning decision trees using the fourier spectrum.
In Proceedings of the twenty-third annual ACM symposium on Theory of computing,
pages 455–464, 1991.
[72] S. Kwek and L. Pitt. Intersections of halfspaces with membership queries. Algorithmica,
1998.
[73] K. J. Lang and E. B. Baum. Query learning can work poorly when a human oracle is
used. In Proceedings of the International Joint Conference on Neural Networks, pages
335–340, 1992.
[74] BBC Learning. How we learn - definition of learning. http://www.bbc.co.uk/learning/
returning/betterlearner/learningstyle/a whatislearning 01.shtml, 2004.
[75] M. Ledoux. The Concentration of Measure Phenomenon. American Mathematical
Society, 2001.
[76] C. Leslie, Eskinm E., A. Cohen, J. Weston, and W. S. Noble. Mismatch string kernels
for sicriminative protein classification. bioinformatics, 20(4):467–476, 2004.
[77] D. D. Lewis and W. A. Gale. A sequential algorithm for training text classifiers. In
W. Bruce Croft and Cornelis J. van Rijsbergen, editors, Proceedings of 17th ACM International Conference on Research and Development in Information Retrieval (SIGIR),
pages 3–12, Dublin, IE, 1994. Springer Verlag, Heidelberg, DE.
BIBLIOGRAPHY
153
[78] R. Liere. Active Learning with Committees: An approach to Efficient Learning in
Text Categorization Using Linear Threshold Algorithms. PhD thesis, Oregon State
University, 1999.
[79] Charles Lin.
Active learning.
http://www.cs.umd.edu/class/spring2003/cmsc311/
Notes/Learn/active.html, 2003.
[80] J. Lindenstrauss and L. Tzafriri. Classical Banach Spaces, volume 2. Springer Verlag,
1979.
[81] N. Linial, Y. Mansour, and N. Nissan. Constant-depth circuits, fourier transform and
learnability. Jour. Assoc. Comput. Mach., 40:607–620, 1993.
[82] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm. In 28th Annual Symposium on Foundations of Computer Science,
pages 68–77, 1987.
[83] N. Littlestone. Mistake Bounds and Logarithmic Linear-threshold Learning Algorithms.
PhD thesis, University of California Santa Cruz, 1989.
[84] L. Lovasz and M. Simonovits. Random walks in a convex body and an improved volume
algorithm. Random Structures and Algorithms, 4, Number 4:359–412, 1993.
[85] L. Lovász and S. Vempala. Hit and run is fast and fun. Technical Report MSR-TR2003-05, Microsoft Research, 2003.
[86] L. Lovász and S. Vempala. Hit-and-run from a corner. In Proc. of the 36th ACM
Symposium on the Theory of Computing (STOC), 2004.
[87] H. Mamitsuka and N. Abe. Efficient data mining by active learning. In S. Arikawa
and A. Shinohara, editors, Progress in Discovery Science: Final Report of the Japanese
Discovery Science Project. Springer-Verlag GmbH, 2002.
[88] A.M. Martinez and R. Benavente. The ar face database. Technical report, CVC Tech.
Rep. #24, 1998.
[89] D. A. McAllester. Some pac-bayesian theorems. Proc. of the Eleventh Annual Conference on Computational Learning Theory, pages 230–234, 1998.
[90] A. K. McCallum and K. Nigam. Employing em in pool-based active learning for text
classification. In Jude W. Shavlik, editor, Proceedings of the 15th International Conference on Machine Learning (ICML), pages 350–358, Madison, US, 1998. Morgan
Kaufmann Publishers, San Francisco, US.
[91] W.S. McCulloch and W. Pitts. A logical calculus of ideas immanent in neural activity.
Bulletin of Mathematical Biophysics, 5:115–133, 1943.
[92] S. Mendelson. Learnability in hilbert spaces with reproducing kernels. Journal of
Complexity, 18(1):152–170, 2002.
154
BIBLIOGRAPHY
[93] C. G. Mooney. Theories of Childhood: An introduction to Dewey, Montessori, Erikson,
Piaget & Vygotsky. Redleaf Press, 2000.
[94] N. Movshovitz-Hadar. Personal communication, 2006.
[95] I. Muslea, S. Minton, and C. Knoblock. Active + semi-supervise learning = robust
multi-view learning. In Proceedings of the 19th International Conference on Machine
Learning (ICML), pages 435–442, 2002.
[96] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm.
In Advances in Neural Information Processing Systems, 2001.
[97] A. B. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium
on the Mathematical Theory of Automata, volume 12, pages 615–622, 1962.
[98] G. Pisier. Probabilistic methods in the geometry of banach spaces. In Probability
and Analysis, number 1206 in Lecture Notes in Mathematics, pages 167–241. Springer
Verlag, 1986.
[99] L. Pitt and M. K. Warmuth. Prediction-preserving reducibility. Journal of Computer
and System Sciences, 41:430–467, 1990.
[100] A. Prekopa. Logarithmic concave measures with applications to stochastic programming. Acta Sci. Math. (Szeged), 32:301–315, 1971.
[101] J. R. Quinlan. Induction of decision trees. Journal of Machine Learning, 1:81–106,
1986.
[102] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
[103] F. Rosenblatt. The perceptron: A probabilistic model for information storage and
organization in the brain. Psychological Review, 65(6):386–408, 1958.
[104] N. Roy and A. McCallum. Toward optimal active learning through sampling estimation
of error reduction. In Proceedings of the 18th International Conference on Machine
Learning (ICML), pages 441–448. Morgan Kaufmann, San Francisco, CA, 2001.
[105] S. Russel. Stuart russell on the future of artificial intelligence. Ubiquity, 4(43), 2004.
http://www.acm.org/ubiquity/interviews/v4i43 russell.html.
[106] S. Russel and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall,
2nd edition edition, 2002.
[107] Y. Sakakibara. On learning from queries and counterexamples in the presence of noise.
Information Processing Letters, 37(5):279–284, 1991.
[108] N. Sauer. On densities of families of sets. Journal of Combinatorics Theory, 13:145–147,
1972.
[109] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the margin : A new
explanation for the effectiveness of voting methods. Annals of Statistics, 1998.
BIBLIOGRAPHY
[110] G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In
Proceedings of the 17th International Conference on Machine Learning (ICML), pages
839–846. Morgan Kaufmann, San Francisco, CA, 2000.
[111] B. Schölkopf, A. J. Smola, and K. R. Müller. kernel prinicpal component analysis. In
B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods–
Support Vector Learning, pages 327–352. MIT press, 1999.
[112] H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. Proc. of the Fith
Workshop on Computational Learning Theory, pages 287–294, 1992.
[113] H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from
examples. Physical Review, 45(8), 1992.
[114] C. E. Shannon. A mathematical theory of communication. Bell System Technical
Journal, 27, July and October 1948.
[115] J. Shawe-Tylor and N. Cristianini. Further results on the margin distribution. In
Proceedings of the 12th Annual Conference on Learning Theory (COLT), pages 278–
285, 1999.
[116] J. Shawe-Tylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge
University Press, 2004.
[117] L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia. Spikernels: Predicting arm movements by embedding population spike rate patterns in inner-product spaces. Neural
Computation, 17(3):671–690, March 2005.
[118] D. G. Singer and T. A. Revenson. A Piaget primer: how a child thinks. The Penguin
Group, revised edition 1996.
[119] P. Sollich and D. Saad. Learning from queries for maximum information gain in imperfectly learnable problems. Advances in Neural Information Systems, 7:287–294, 1995.
[120] C. J. Stone. Consistent nonparametric regression. The Annals of Statistics, 5(4):595–
620, 1977.
[121] N. Tishby, F.C. Pereira, and W. Bialek. The information bottleneck method. In Proc.
37th Annual Allerton Conf. on Communication, Control and Computing, pages 368–
377, 1999.
[122] S. Tong and D. Koller. Support vector machine active learning with applications to
text classification. Journal of Machine Learning Reseach (JMLR), 2:45–66, Nov 2001.
[123] L. Troyansky. Faithful Representations and Moments of Satisfaction: Probabilistic
Methods in Learning and Logic. PhD thesis, Hebrew University, Jerusalem, 1998.
[124] G. Tur, D. Hakkani-Tür, and R.E. Schapire. Combining active and semi-supervised
learning for spoken language understanding. Speech Comminication, 45(2):171–186,
2005.
155
156
BIBLIOGRAPHY
[125] G. Tur, R. E. Schapire, and D. Hakkani-Tür. Active learning for spoken language
understanding. In IEEE international conference on Acoustic, Speech and Signal Processing, 2003.
[126] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–
1142, 1984.
[127] V. Vapnik. Statistical Learning Theory. Wiley, 1998.
[128] V. Vapnik and A. Y. Chervonenkis. On the uniform covergence of relative frequencies of
events to their probabilities. Theory of Probability and its Applications, 16(2):264–280,
1971.
[129] C. Zhang and T. Chen. An active learning framework for content-based information
retrieval. IEEE Transactions on Multimedia, 4(2):260–268, 2002.
XVI
.minew miieqip
zetqep enil zexbqn
.enill digid zxbqnd dppi` efy s` ,zigpen dinla fkxzn df xwgn ,lirl epivy itk
(
Unsupervised Learning)
zigpen `l dinl
neld .mipezpa zipaz `evnl `id zigpen `l dinla dxhnd .dinla aeyg spr `id zigpen `l dinl
qtez xy` ohw bevii `ed aeh bevii .el` mipezp ly liri bevii `evnl yxpe
“zeleky`”
x1 , . . . , xm
zeitvz lawn
z`ivn :md el`k mibevi z`ivnl zevetp mikx izy .mipezpd ly zeifkxnd zepekzd z`
.nin zxede (clustering)
zeleky` ly ohw xtqn `evnl `id dxhnd .zeleky`l zeitvzd z` uawn neld ,zeleky`d zhiya
zeleky`n ze`nbe znerl dipyl zg`d zene e` zeaexw od leky` eze`a ze`vnpd ze`nbed lky jk
[67] ahid zxben
dppi` dirad la` ([121,
96, 14] `nbel)
zeleky` z`ivnl zeax mikx opyi .mixg`
zeglvd zeleky`d zhiyl z`f mr gi .zepey mikxa ndl dleki zeewp oia oeini ziny oeeikn
.zeax
jenp nina mipezpd ly bevi `ven neld ,nin zxeda .nin zxed `id ztqep zigpen `l dhiy
φ1 (xi ) , . . . φd (xi )
ly dbivp `id
zeipekz ly ohw xtqn wx xney neld ,xi zitvz lkl .ixewnd bevil dne la`
Principal Component Analysis (PCA) .xi
ly zepekzd ziaxn z` mixney miφ-dy jk
.[61] enil zehiy ly ef dgtyn
Reinforecemnt Learning)
(
miwefign dinl
lka .jeana heaex heeip ly diraa `nbel opeazp .[26℄ miwefign dinl `id dinl ly sqep aeyg spr
jeand ly wlgdy oeeikn ,zkl zewigxn zeklyd yi heaexd ly dhlgdl .oeeik xegal heaexd lr ,znev
.eizehlgda ielz heaexd d`xi eze`
hilgn neld ,onfa dewp lka .miavn zpekn ly dneiw dgipn miwefig ii-lr dinl ,illkd dxwna
dxenzd .dpzyn dpeknd avne dxenz lawn neld ,ef dlertn d`vezk .rvan `ed dze` dlertd lr
.neld ly dlertde igkepd avnd ly zeihq`kehq zeivwpet od ygd avnde
dinle dvignl zigpen dinl ,zigpen `l dinl ,zigpen dinl ly aeliy `id dlirt dinl
lr zerityny zehlgd milawne zebiezn zeitvz ,zebiezn `l zeitvza miynzyn ep`
.miwefign
zewp ,z`f mr gi .enild jildz lr zerityn inlzd ly zezli`ydy oeeikn miiizr zerxe`n
.zigpen dinl ly dagxdk dlirt dinle zigpen dinl `id ef xwgn ly zifkxnd hand
XV
dlirt dinl
-k enk mininz minzixebl` elit` ,ei jex` oeni` mbn ozpiday d`xn
Stone
ly miyxnd htynd
oeni` mbn seqi` ,la` .[120] zeixyt`d xzeia zeaehd ze`vezd z` wtql mileki miaexwd mipkyd
yxe mileb minbn eair ,zipy .xwie jex` jildz `ed mipezpd seqi` ,ziy`x .zeira izy xvei leb
dllkdd ly zeikeaiqd ,minieqn mixwna .oeni`d jldna rind z` arl yexy xexa .miax mia`yn
[20] Support Vector Machines
llek ,minzixebl`d ziaxna ,la` .oeni`d jildz jxe`a dielz dppi`
zeaiyg yi okl .oeni`d jildz jxe` ly divwpet `id dllkdd ly aeyigd zeikeaiq
[47] Adaboost-e
.oeni`d jildz xeviwl dax
-hqd zexbdl xarn d`ivi xyt`p m` oeni`d mbn leb z` mvnvl ozipy zqxeb dlirt dinl
.enild jildz lr znieqn dhily inlzl xyt`pe ,zppeekn dinl e`
PAC
oebk dinl ly zeihxp
zehil zepekn okl .d`xi neld oze` ze`nbed z` xgea dxend ,dk r epx`izy enild zexbqna
lr ef dhily .oeni`d ze`nbe zxiga lr znieqn drtyd yi nell ,dlirt dinla .dliaq dinl el`
jildz u`ei oke dax dina ely invrd rid z` exiyriy ze`nbea fkxzdl el zxyt`n enild jildz
.enild
.zikixrn zeidl dleki dv`ddy d`xp ep` .enild jildz z` dvi`n ok` dlirt dinl miax mixwna
zehlgd lawl eilr ,enild jildz lr dhily yi nelly oeeikn .xign xal yi miax mixwna ,la`
xy`k dlb enild jildz ly aeyigd zeikeaiq minieqn mixwna ,jkitl
.odn xeht liaq nely
.izernyn ote`a dphw enild ly mbnd zeikeaiq ,onfa ea .dlirt dinll dliaq dinln mixaer
ax oeibid jka yi .oeni`d alyl dllkdd alyne ,inlzl dxendn deard qner z` mixiarn ,epid
zear” zegt miyxe milirt minel okl ;dpekn `ed neldy era m` llk-jxa `ed dxendy oeeikn
.oeni`d alya aeyig gk xzei miyxe la`
zhiya mip ep`
III wlgae zekiiy zezli`ya mip ep` II wlga
“miitk
:lirt enil zexbqn izya mip ep`
zezli`y zhiya .enild jildz lr nell zpzipy dhilyd beq `ed el` zehiy oia ladd .oepiqd
yxp dxende zeitvz xeza dxenl zebven zel`yd .zel`y dxend z` le`yl leki neld ,[3℄ zekiiyd
`ede nell bven zebiezn `l zeitvz sqe` .xzei zeax zelabn
[29]
oepiqd zhiya .el` zeitvz biizl
deev` zinl :dpyn zwelgl zpzip ef dhiy .el`d zeitvzd jezn dveaw-zz ly mibzd z` ywal leki
.[54]
“mibz
zliri dinl” z`xwpy zppeekn dinle
.dxryd dxenl bivdl leki neld [3℄ oeieeiyd zezli`y zhiya .dlirt dinll zetqep zehiy zeniiw
oepkz `id ztqep dhiy .zlykp dxrydd da `nbe wtql e` ,daeh `id dxryddy xy`l leki dxend
llk-jxa `id dniynd ,ef dhiya .mi`wihqihhq ii-lr zeax dxwgp xy` ([7] `nbel e`x) miieqip
,dlirt dinll axd xywd zexnl
.revial miieqipd z` xegal leki nelde ,ziynn divwpet aexiw
ly ze`vezl m`zda dxigad z` okrn eppi` `ede miieqipd z` y`xn xegal neld lr mixwnd aexa
XIV
dxen
nel
⇒
⇐
⇒
L (yt , ŷt )
xt
yt
zppeekn dinl
ŷt
:1 xei`
zppeekn dinl ly igi aaq ly dnixf miyxz
mixwnd ziaxna .xzeia zelwd zegpda
P∞
t=1
:ze`ad zegpddn zg` migipn
ik ze`xdl minrtl ozip df dxwna .yt
∞
X
t=1
L (yt , yˆt ) z` xrfnl `id df dxwna neld ly dxhnd
= c (xt )-y
jk
C
c
dwlgnd jeza
dxhn byen miiw .1
L (yt , yˆt ) ≤ M < ∞
ze`nbe zxq lkl ze`ibyd xtqn lr hlgen mqg `ed ik ze`ibyd-mqg `xwp
.c
∈C
M
df dxwna
byen lke
x1 , x2 , . . .
ytgp df dxwna .mixgzn ly znvnevn dwlgn len l` ogap neld j` laben eppi` dxhnd byen .2
miiwzn
∞
X
t=1
(x1 , y1 ) , (x2 , y2 ) , . . .
L (yt , ŷt ) ≤ inf
c∈C
∞
X
dxq lkly jk
f (·)
divwpet
f (L (c (xt ) , ŷt ))
t=1
mzixebl` ,`nbel .oexkif hrna miynzyne mixidn ,miheyt md opeeknd enild inzixebl`n daxd
4
gi . zelert
O (d)
zyxe zifgz lke oexkif i`z
O (d)
yxe inin
d
agxna lrtend
[103]
oexhtqxtd
.aeyige oexkifa yeniyd z` dliabn dpi` zppeeknd dinld zxbd ,z`f mr
0−1
qtd `id zirahd qtdd ziivwpet ,−1 e`
.zxg`
+1
1
md mibzd oep mda mixwnd aexay oeeikn
L0−1 (yt , ŷt ) = (1 − yt yˆt ) /2 =


 0

 1
kxt k2 ≤ R
m`
t=1
L0−1 (yt , yˆt ) ≤ R2 /θ2
f`
if yt 6= ŷt
(x1 , y1 ) , (x2 , y2 ) , . . .
gkedy oexhtqxtd mzixebl`l ze`iby mqg edf .yt
xy`k zqt`zny
if yt = ŷt
-xebl` xy`ky ,`nbel .enild mzixebl` ly ze`ibyd xtqn weia `ed
P∞
yˆt -e yt
jxrd z` zlawne midf
P∞
t=1
L0−1 (yt , yˆt ) df dxwna
dxq lr lrten oexhtqxtd mzi
(w · xt ) ≥ θ-e kwk2 = 1-y jk w ∈ IRd
miiwe ,t lkl
.[97]-a
kernels) mixfgyn mipirxba miynzyn xy`k .(primal) iy`xd agxna bvein
.[40] -a `evnl ozip mitqep mihxt .lilk zepzyn aeyigde oexkifd zeyix
ynzydl jixv (
oexhtqxtdy migipn ep`
dual)
,df dxwna .(
4
ipeipy beviia
XIII
`ed
C
VC-d
mibyen zwlgn ly
3 dxbd
nin
d = max {m : ΠC (m) = 2m }
.m lkl
z` egiked
Chervonenkis-e Vapnik .PAC
ΠC (m) = 2m
m` iteqpi` `ed nind
zeinld zewlgnd z` zwien dxeva xibn
VC-d
nin
:(xewndn dpey hrn geqip mibivn ep`) `ad drtydd ax htynd
[4.8 htyne 4.2 htyn ,6℄
minbn ly
m
S ∈ (X × Y)
mbn ozpiday mzixebl` `ed
L-e
gipp .d `edy
oeni`d z`iby z` zxrfnn xy`
VC
nin mr dwlgny
c = L (X ) ∈ X
C
1 htyn
idz
dxryd xifgn mibiiezn
|{(x, y) ∈ S : c (x) 6= y}|
:miiwzn
X ×Y
lr
µ
zexazqd zin lkle
δ>0
lkl f`
Pr [errorµ (L (S)) > ǫ] ≤ δ
S∼µm
er lk
2
2em
2
ǫ≥
d ln
+ ln
m
d
δ
C f`
q .PAC dinl dppi`
`vnp dxhnd byen xy`k
O∗
d
m -e illkd dxwna
O∗
d
m
iteqpi` `ed
C
ly
VC-d
nin m`
`ed ietvd avwy d`xn 1 htyn
3
d`xp 6 wxta .mireawa wx l"pd minqgd z` xtyl ozipy oiivl yi . minel ep` da dwlgnd jezn
.daxda zeaeh ze`vez mibiyn ep` dlirt dinla miynzyn xy`ky
zppeekn dinl
jildz `id dinly daerd z` yibd
Littlestone
.dinl ly ztqep dxbd `id
mixefy el` mialy ipy zppeekn dinla .dllkdd alye enild aly :mialy ipy yi
,okn xg`l .xt -l
yˆt
[83]
zppeekn dinl
PAC-d lena
.sivx
beiz rivn neld ,xt zitvz bivn dxend ,t aaqa .miaaqa zygxzn dinl .dfa df
.zilily-i` qtd ziivwpet `id
L (·, ·) xy`k L (yt , yˆt ) jxra qtd mxbp nelle yt bzd z` bivn dxend
.zppeekn dinl ly g` aaq ly dnixf miyxz bivn 1 xei`
.miinzix`bel minxeb migipfn ep`y
O ∗ (·)
3
XII
.d`ex neld eze`y iteqd mbnd llba mxbp wei i` .neld ly oelyik oiae wei i` oia oigad
m` enll ozipy orh
Valiant
Valiant
.bviin eppi` mbnd m` lilk lykp enild jildz ,minieqn mixwna ,la`
.ddeab oeghia znxa ,xiaq wei biydl ozip
ze`nbed xtqn m` enill zpzip
.Y
= {±1}
C
PAC-d
dwlgn .enill zepzpd mibyen zewlgn xibn
oda zeix`pia zeiraa xwira oiiprzd
Valiant
len
.iteq ef dwlgna byen enll ik zeyexd
zeira mebxzl zelaewn mikx opyi la` ,dliabn zi`xp mikxr ipy wx lawl mileki mibzdy dgpdd
xtqnl zeibzd zaexn dirad levit ii-lr ,mixwnd ziaxna .zeix`pia zeira sqe`l zeibz zeaexn
lkl zix`pia enil ziira mixviin (one-against-all) mlek-len-g` zhiya ,`nbel .zeix`pia zeira
zeitvzd oial
y -a
zebieznd zeitvzd oia liadl ix`piad neld z` mipn`ne .y
∈ Y
ixyt` jxr
.ixyt` zeibz bef lk xear beeqn mipn`n ,dny fnxny itk ,(all-pairs) zebefd-lk zhiya
.zexg`d
beeqnl oxeaige zeix`pia zeira sqe` zxivil ze`iby oewiz ipepbpna zynzyn xzei zillkd dhiyd
dgpd gipp ep` .jk-lk dliabn dppi` zix`pia `id enild ziray dgpdd ,okl .([34] `nbel e`x) ibz-ax
2
. zxg` xn`p m` `l` ef
[126]
`id
C -y
xn`p .mibzd agxn
X ×Y
lr
µ
Y = {±1}
idie .X lr zix`pia mibyen zwlgn
oekp jxra i`el aexw
C
didze mbnd agxn
m
L : (X × Y) → C mzixebl`e m < ∞ miiw ,ǫ, δ > 0
Prm errorµ (L (S)) > ǫ + inf (errorµ (c)) < δ
din lkly jk
1 dxbd
X
idi
PAC
dinl
errorµ (c) = µ {(x, y) : c (x) 6= y}
xy`k
lkl m`
c∈C
S∼µ
hrnk `idy dwlgna dxryd `evnl ik iteq mbna i m` ,PAC dinl `id mibyen zwlgn
-ly e`xd
[128] Chervonenkis-e Vapnik
yi m` wxe m`
PAC
dinl `id
C
.dllkd z`iby ly oaena dwlgna xzeia daehd dxrydd
mibyen zwlgn :ziegii zixhne`b dpekz yi
dwlgnd ly uezipd inwn z` xibdle miwdl epilr
VC-d
PAC
zeinl zewlgn
nin z` xibdl ika .iteq
VC
nin dl
.C
`ed
ΠC (m) =
oeeikn .ze`nbe
m biizl dleki
max
x1 ,...,xm ∈X
C
ly
m-d
uezipd mwn .mibyen zwlgn
C
`idz
2 dxbd
|{(c (x1 ) , . . . , c (xm )) : c ∈ C}|
mibyend zwlgn oda zepeyd mikxd xtqn z` en uezipd mwn
:VC-d nin ly dxbdd ixeg`n nerd oeibidd df .ΠC
(m) ≤ 2m
f`
|Y| = |{±1}| = 2-y
mip ep` .sqep rin dxendn ywal eilr izn hilgdl jixv neldy oeeikn xzei zeakexn od zeibz zeaexn dinl zeira
8
.
2
wxta df `yepa
XI
[91] Pitts-e McCulloch
.20-d d`nd ly 40-d zepya lgd dfd xwgnd oeeik
ly mirxfd od el` zear .lertl zeleki mipexiep zezyx oditl mikx erivd
.gend zlert z` xzei
[53] Hebb
xzei xge`ne
.mixtq ze`n eazkp eizee` d`xyd `lne dxet xwgn
(zeqtpiq) mihlw ly ax xtqn yi oexiepl
.mipexiepd zyx ly zixwird oiipad oa` `ed oexiepd
.[103] ix`pild ixtnd e` oexhtqxtd `id oexiepd ly zizek`lnd dqxibd .(oeqw`) zg` hlt zigie
hlwd z` onqp mipeniqd lr lwdl ik .igi hlte mihlw ly ax xtqn yi oexhtqxtl ,oexiepl enk
sq ziivwpet aygn oexhtqxtd .dqtpiql liawn df xehweea aikx lky jk ,x
`ed dze` divwpetd .θ
.x-e
w
∈ IR
sqe
w ∈ IRd
∈ IRd
igi xehweek
zelewyn xehwe yi oexhtqxt lkl .mihlwd lr zix`pil
mixehwed oia ziniptd dltknd `id
w·x
xy`k
cw,θ (x) = sign ((w · x − θ))
`id aygn
akp wlg .ziaeyig dinla xzeia uetpd ilkd oiir `ed ,mipy 50-k iptl xbed oexhtqxtdy zexnl
epid ,qt`zn
θ sqd xy`k .“ix`pil ixtn” oexhtqxtl `xwp llk-jxa
.“ipbened ix`pil ixtn” df beqn beeqnl `xwp
.oexhtqxtl ywen ef dearn
cw (x) = sign (w · x)
`ed drxkdd llk
zehiy mb zeniiw .zeyw zeira oexztl ilk ode gend xwgl od zeynyn zeizek`ln mipexiep zezyx
oelg zeqqean zehiye miaexwd mipkyd illk od el`k zehiyl mitqep mibivp .el` zeira oexztl zexg`
ze`nbe zebeeqn okle ,hlwd agxn lr wgxn zin ly dneiw z` zegipn zehiyd izy .[33,
46, 120]
`nbe ly beeiqd yegip ,miaexwd mipkyd zhiya .enild onfa etvpy ze`nbel ozaxw it-lr zeyg
beeiqd yegip ,oelgd zehiya .zxaend `nbel miaexwd mipkyd
izy
k
oia aex zravd ii-lr ozip dyg
.mipiipern ep` da `nbel zeaexwd owgxny oeni`d ze`nbe lk oia aex zravd ii-lr dyrp
ly dpekp dxiga ozpida ,izehtniq`d oaena zeil`nihte` od epid ;zeiawr ody gkede egzep zehiyd
.[120] mixhnxt
aeygd xwgnd .“dinl” byend z` oiadl oeiqipd `ed ziaeyig dinla xwgnd ly xzei yg ripn
ly deey dina zeqpkzdd zrtez z` epgay
d`vez ly xywd z` elib
Chervonenkis-e Vapnik
[17] Blumer et al-y
.[126] (Probably
r
“eq”k
ii-lr dzyrp df megza xzeia
dxnyp ef d`vez .[128] mixitn` mirvenn
Approximately Correct - PAC)
oekp jxra i`el aexwd lenl ef
wzepna dzyrp dinld zxbdy oeeikn .ihnzn byenk dinl xibdl oeiqp `ed
m`d” oebk ok iptl ahid zexben eid `ly zel`y le`yl
Valiant-l
xewqp mi`ad mitirqa .“?enll ozip dn” xzei illk ote`a e`
[126] PAC-d
len
xyt`zd ,zrvazn `id ea ote`dn
“?lkd
enll ozip m`d” ,“?enll ozip
.megza miaeygd mi`vnnd z`e dinl ly zetqep zexbd ,PAC-d len z`
(
oaena iteq jildz `id dinl ,ziy`x .[126]
PAC-d
PAC)
oekp jxra i`el aexw
len ly eqiqaa zener zeaeyg zepgad xtqn
.dinll witqdl jixv ze`nbe ly iteq sqe` ,okl .iteq onf xg`l enild zepexzia oigadl ozipy
X
rvazy dpekn zepal mipiipern ep`y gipp .d`ad `nbed zxfra zehiyd oia ladd lr enrl ozip
mr xyw xevil yi ,dl`k zekxrn ziipal
“zibel”d
dhiyd itl .i`etx oegai` `nbel ,znieqn dniyn
xg`l .`ixal dleg oia miliand millkd zkxrn z` epnn ywale (op dxwna `tex) megzl dgnen
.mihpiv`t oegai`l zynyne dpeknd jezl zpaen `id ,dtq`p miwegd zkxrny
,zipy
.zixyt` izla `id miwegd zxbd mixwnd ziaxna ,ziy`x :zepexqg xtqn ef dhiyl
jilen xy` weg xz`p ji` miweg itl` mr zkxrna :miweg ly efk zkxrn zeierh zetple wfgzl dyw
daiaqa miiepiyl efk zkxrn mi`zp ji` ,seqal ?dlek zkxrna rebtl ila eze` owzp ji` ?ieby oegai`l
?zetqep oegai` zeniynl e`
alya .zxg` dhiya zhwep ziaeyig dinl .rid zyikx jildz lr mirityn lirl ebvedy miiywd
zee` miihqihhq mipezp sqe`e ezeara dgnen ixg` awer neld ,dinld alya epid ,rid zyikx
.mipegai`e zeifgz revial df ria ynzydl ozip ,ri witqn ykx neld xy`k .rina min`zn
oexzi ef dhiyl .ere rin xefgi` ,i`etx oegai` oebk minegz oeebna ziyeniy ze`nbe jezn dinl
zxyt`n `ide ef dhiy lr zeqqeand zepekn wfgzl lw .dgnena ditv wx yxe enild jildzy jka
.zkxrnd iaikxa xfeg yeniy
ziaeyigd dinld z` wxtl ozip .rid zyikx jildza zfkxzn ,myd fnexy itk ,ziaeyig dinl
dinl ,dvgnl zigpen dinl ,zigpen dinl ,`nbel) rid zyikx jildz ly ite`d itl dpyn inegzl
miwqer ep` ef deara .(diqxbx ,beeiq ,deev` zeniyn `nbel) minel dprnly dniynd itle (zigpen `l
.zigpen dinla
ziaeyig dinll `ean
megza dpi` megzd ly d`ln dxiwq .mipey zeny zgz d`n ivgn xzei jyna zxwgp ziaeyig dinl
.jynda ynzyp mday zeeqid z` o`k mibivn ep` .ef dear ly dpiipr
,gen ixweg ,miaygn iyp` ,mi`wihnzn :mipey minegzn mixweg dil` zkyen ziaeyig dinl
:megza xwgnl miixwir miripn dyely mpyi .ere mibeleia
gend xwg .1
zeakxen zeira oexztl jxk dinl .2
hyten byenk
“dinl”
zee` xwgn .3
zlekid .zyxa mdipa mixeyw el` mipexiep .mipexiepd ,zeieqi oiipa ipa`n iepa gendy e`vn gend ixweg
oiipa ii lry oin`dl mixweg dripd miiepiyl envr z` mi`zdle zeakxen zeira xeztl iyep`d gend ly
aeh oiadl xyt`i df ,sqepa .zeakxen zeira xeztl enll milbeqn didp zeizek`ln mipexiep zezyx
IX
zexewn ipy yi df jildza ,okl .ril df rin xindl ik witqn mkg didie ,rin witqn seq`l gilviy
zee` zextqa .neldn zyxpd
aeyig zeikeaiqe (sample
“dnkeg"d
complexity)
.a ,dxena zetvl yxp neldy onfd jyn .` :zeikeaiql
mbn zeikeaiq mi`xwp el` zeikeaiq inxeb ,ziaeyig dinl
.(computational
complexity)
dpid `nbe lk .ze`nbel dyibd nell zpzip ,enild jildz jldna .ze`nben dinla weqrp ep`
mibzd agxnn gewly ef zitvz ly bzd `ed
y -e X
zeitvzd agxn jezn zitvz `id
ly ezxhn .(mibz) mihltl (zeitvz) mihlw dtnn xy`
c
x xy`k (x, y) nv
dxhn byen rwxa miiw ik migipn ep` .Y
.df dxhn byen axwl `id neld
zeikeaiqe mbnd zeikeaiq zzgtdl zepey mikxa fkxzn ziaeyig dinla xwgndn xkip wlg
mipiipern ep` .aeyigd zeikeaiql mbnd zeikeaiq oia oefi`a ef dear zwqer ,mieqn oaena .aeyigd
,zexg` milina .aeyigd zeikeaiqa in xzei libdl ila mbnd zeikeaiq z` oihwdl ozip mda mikxa
.nell dxendn deard qnern wlg xiardl ozip mda mikx `evnl mipiipern ep`
dxbd e`x)
PAC
`nbel ,zeizxeqnd enild zexbqnl xarn jlp ep` dinla df xetiy biydl ik
nely era .enild jildza lirt iwtz yi nelly `id dlirt dinla dpeekd .dlirt dinl dyxpe ,(1
.zel`y ii-lr dxend z` zegpdl leki lirt nel ,dxena dtev wx (passive) liaq
minel ,miax mixwnay mi`xn ep` .dliaq dinl ly dagxdk dlirt dinl mixweg ep` ef deara
gezip ii-lr od el` zexbqn mixweg ep`
.miliaq mineln daxda zeaeh ze`vez mibiyn milirt
mze`a miynzyn ep` ,dlirt dinl ly megza zextqd ziaxnl ebipa .ieqip ii-lr ode (analysis)
meyiide dixe`zd oia dxertd medzd lrn xyb mipea ep` jk ii-lr .ieqipa ode gezipa od minzixebl`
.dlirt dinl ly
megza `iwad `xew .ziaeyig dinla zeiqiqa ze`veze eqi zexbd mibivn ep` df wxt jynda
1.1 dlaha mibven mipeniqd) .miynzyn ep` mday mipeniqd z` xikdl ik el` mita lrlrl leki
.(4 enra
zizek`ln dpiae ziaeyig dinl
,dkxa zg` lk ,zeqpn zizek`ln dpiae ziaeyig dinl .zizek`ln dpia ly spr `id ziaeyig dinl
ly zkxrnk ri dxibn
“zizxeqn” zizek`ln dpia .zeakxen zeira xzet m`d day jxd z` zewgl
dpey ziaeyig dinl .etvp `l oiiry mixwn lr zepwqn wiqdl ozip el` miweg zxfra .miibel miweg
,zipy .ykxp rid eay jildzd ;dinld jildz lr ax yb dny ziaeyig dinl ,ziy`x :mipaen ipya
miibeld miwegd znerl miizexazqde miihqihhq ywid iwega llk-jxa zynzyn ziaeyig dinl
zizek`ln dpia ly zegztzda aly `id ziaeyig dinl ,miieqn oaena .zizek`ln dpiaa milaewnd
.[105]
`ean
xy`e rin miarn ep` ea jildz `id dinl
[74]
.
mixeyikde rid zxabd e` iepiyl liaen
zklnn ziaxnl eppia dlian xy` ,m`-ipa ly zehlead zepekzdn zg` `id ddeabd enild zleki
enr y`xa d`aend .zeakxen zeira xeztle dpzyn daiaq mr enzdl epl zxyt`n dinld .zeigd
.mixeyikae ria rin xinn xy` jildz `id dinl .dinl ly miipeigd miaikxndn dnk lr zner df
.eply zelekid z` xiyrn xy` jildz `id dinl okl
zxne` z`f ;miaygenn enil ipepbpn oepkz ly zepn`d `id (machine
learning)
ziaeyig dinl
milwzp ep` ea rin xindl milbeqnd miaygenn mikildz oepkz ly zepn`d `id ziaeyig dinly
:zeira xeztl ik `ad jildza zynzyn ziaeyig dinl .zelekie ril
rin seq` .1
rind ly izivnz bevii z`ivn .2
rina zertez z`ivn .3
ril el` zertez zxnd .4
(minrtl) zelertl rid zxnd .5
,mepbd xwgn ,zaygenn di`x ,xeai iedif ,miknqn beeiq oebk minegz ly oeebna liri `vnp df jildz
.ere ,d`etx
ze`nbe jezn dinl
,xzei wien ote`ae ze`nbe jezn dinla mipiiprzn ep` .dinl zyxp mda mixwn ly oeebn mpyi
dxend ,mitzzyn mipwgy ipy df beqn dinla .(supervised
learning)
ze`nbe jezn zigpen dinla
deewza dxend ly mikldna dtev neld ,okl .yekxl oiipern neld eze` ,mieqn ri yi dxenl .nelde
VIII
VII
mr miieqip jexrl epl mixyt`n el` miliik .ohw `ed df ixwn jelida rval jixvy miaaqd xtqny
.dliaq dinl ipt lr dlirt dinl ly zepexzid z` ze`xdle
[27, 36,
QBC-d
mzixebl`
`nbel) zeizxe`iz ze`vez xtqn opyi .eileziga oiir `vnp dlirt dinl zee` xwgnd
mippeazn ep` .zllekd dzyiba `id ef dear ly degi .([35,
124]
`nbel) miieqip ze`veze (37,
48]
mip zra eae miihilp` milk zervn`a dlirt dinl mixweg ep` .zepey han zeewpn minvr mze`a
dyrnde dixe`zd oia xrtd z` mipihwn ep` jk ii-lr .miieqip ly ze`vez miwtqne zeiyrn zeibeqa
.megza
.dneiwl sqep rv `id ef dear .miax minegza zxkip zeaiyg yi dlirt dinlly mixaeq ep`
VI
-lr dinle zigpen `l dinl ,zigpen dinl :dinl iqet xtqnn dgek z` za`ey dlirt dinl
z` mieeyn ep` ,ok it lr s` .el`d miqetdn g` lkn edyn migwel ep` ef deara .miwefig ii
i`el aexw” zxbqna geinae zigpen dinla miliaq minel ly el`l milirt minel ly mireviad
dinl ly zexbqn xtqn mixweg ep`
dinl”-e (Selective
.[126] (Probably
Sampling) “zipxxa
Approximatly Correct - PAC) “oekp
dnib” ,(Membership
Queries) “zekiiy
.(Label
leki neld ef zhiya
ep`
jxra
zezli`y” :dlirt
Efficient Learning) “mibz
zliri
.[3] zekiiy zezli`ya mip ep` dixg`e dxvw dnwda zgztp deard
.el` zeitvz biizl dxenl zeywae zeitvz ziipa ii-lr z`f dyer `ed
.dxenl zel`y zeptdl
.neld ly zel`yl wtqn `edy zeaeyza wiin eppi` dxend m` elit` rvazdl dleki dinly mi`xn
enild ziira xy`k epiid ;aeh dpan zlra znlpd dirad m`
“yrx”d
z` opql ozipy mi`xn ep`
.dxiaq zeakxen zlrae dtetv `id (dual) zipeipyd
zeitvz d`ex inlzd ,ef dhiya .[29,
30] zipxxad dnibd zhiya wqer ef dear ly iyilyd wlga
inlebd rind mda miax mixwna ziyeniy ef dhiy .dxend biizi oze` zeitvzd z` xgeae zebiezn `l
Query By Com-) “zizveawd `zli`yd” mzixebl`a miwnzn ep`
.mixip md mibzd era rtya ievn
miligzn ep` .zeiyrn zeibeqe zeinzixebl` zeibeq ,zeizxe`z zeibeqa miwqere
QBC
ik d`xn xwgnd .zipxxad dnibd zhiye
QBC-d
[112] (mittee - QBC
mzixebl` ly zizxe`zd dpadd zagxda
liri mzixebl`d ,ok-enk .ok iptl reidn xzei miax mixwna miliaq mineln xzei xdn zikixrn nel
.zipxxa dnib ly zpeeknd `qxibd `idy mibz zliri dinl zxbqna
zip`iqiad dgpdd z` mixiqn ep` ziy`x .izxe`zd gezipa eyrpy zegpd xtqn zxqda mikiynn ep`
hrnk liri
QBC
,miniiwzn minieqn mi`pz xy`ky mi`xn ep` ef dgpd zxqd ii-lr .(Bayesian)
.mzixebl`d ly mirvennd mireviad z` wx gihadl mileki ep` zip`iqiad dgpdd zgzy era ze`ea
mzixebl` z` zepyl ozip ik mi`xn ep` .yrx zegkepa
QBC
mr enild zibeqa miwqer ep` jynda
oiprn `edy zeitvzn rin ly byend z` migztn ep` df dxwn gzpl ik .yrxl oiqg idiy jk
QBC-d
.envr ipta
,ef dyix ielina iyewd .zixwn dxryd mebl zlekid `id
QBC-d
mzixebl` ly zifkxn dyix
dnib ,miix`pil miixtn minel xy`ky ,dpey`xd mrta ,mi`xn ep` .ea yeniyd z` miyp`n drpn
znib ly diral ddf zixwn dxryd znib ly dirady migiken ep` .inepilet onfa rvazdl dleki
.epiptly dxwna ze`veza ynzydl ozipe
[41, 84, 85]
zeax dxwgp ef dira .xenw sebn zixwn dewp
;mikx izya ef dibeq mitwez ep` .miliri zeidln miwegx md ,miinepilet md el` minzixebl`y zexnl
,deab nina
QBC-a ynzydl epl xyt`nd xa
.jenp nina dnibd z` rval ozipy mi`xn ep` ziy`x
-d mzixebl` z` wfgl ik (kernels) mipirxba ynzydl ozip okle aeyigd zeikeaiq z` aixwdl ilan
,ziieqip mi`xn ep`e
[85] (hit-and-run) uexe-dkd
beqn ixwn jelida ynzydl mirivn ep` ok-enk .QBC
xivwz
era .zyxpd dniynl mnvr z` mi`zdl mdly zlekid `ed minel minzixebl` ly lebd oexzid
z` mnvra miykex minel minzixebl` ,zniieqn dira xeztl mippkezn
“miizxeqn”
minzixebl`
,azk iedif oebk mipey minegza dglvda elrted dinl zeqqean zehiy .dirad z` xeztl zlekid
.ze`ped xezi`e i`etx oegai` ,rin xefgi`
mipezpa zeipaz `evnl dqpne mipezpl sygp neld ,mleka .dinl ikildz ly mipey miqet mpyi
ly mipey miqet .mipezpd ly zepekz ielibl e` ,zepwqn zwqdl zeynyn zeipazd ,xzei xge`n .el`
:mixa xtqna milap dinl
d`ex neld mze` mipezpd beq .1
mdpia mixywzn dxende neld da jxd .2
neld irevia mikxren da jxd .3
oeni` itvx zxivi .dze`p mirevia znx biyn neldy r jex` oeni` jildza jxev yi ,miax mixwna
,okl .revwn iyp` ly daexn dear zyxp mixwnd ziaxnay oeeikn xwie dyw jildz `id mikex`
miwea ep` ef xwgna .oeni`d jildz xeviwl zepey mikxa wqer ziaeyig dinla xwgndn akp wlg
.dlirt
dinl
ii-lr enild jildz xeviwl mikx
z` oeekl zexyt`d zpzip nell .enild jildz lr zniieqn dhily nell zpzep dlirt dinl
oeibidd .ezelr z` mvnvle enild jildz z` xvwl jk ii-lre ,xzei ax jxr lra rin bivdl dxend
iniptd dpand lye mewd rind ly dlez `ed rind ztqez ly jxrdy `ed ef dhiy ly dqiqaa
.enild mzixebl`l m`zen zeidl eilr ,liri didi enild jildzy zpn lr ,okl .enild mzixebl` ly
1
dgiken ef dear ,miieqn oaena . m` `ed neld xy`k mbe aygn `ed neld xy`k mb oekp df xa
:wizrd ixard htynd ly oey`xd ewlg z` zegtl
('d dpyn 'a wxt ,zea` iwxt)
“nln
otwd `le ,nl oyiad `l"
.ef dear ly 12 wxta `aen m` ipaa dlirt dinl lr xvw oei
V
1
IV
III
iayz ilztp xeqtext ly eziigpda dzyrp ef dear
II
dl xarne zigpen dinl
diteqelitl xehwe x`ez zlaw myl xeaig
z`n
jxka-rlb ox
e'qyz zpya milyexia zixard dhiqxaipe`d hpql ybed