Download pdf

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Computational complexity theory wikipedia , lookup

Algorithm characterizations wikipedia , lookup

Theoretical computer science wikipedia , lookup

Hardware random number generator wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Algorithm wikipedia , lookup

Dijkstra's algorithm wikipedia , lookup

Learning classifier system wikipedia , lookup

Reinforcement learning wikipedia , lookup

Fisher–Yates shuffle wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Time complexity wikipedia , lookup

Factorization of polynomials over finite fields wikipedia , lookup

Machine learning wikipedia , lookup

Pattern recognition wikipedia , lookup

Transcript
More data speeds up training time in learning
halfspaces over sparse vectors
Amit Daniely
Department of Mathematics
The Hebrew University
Jerusalem, Israel
Nati Linial
School of CS and Eng.
The Hebrew University
Jerusalem, Israel
Shai Shalev-Shwartz
School of CS and Eng.
The Hebrew University
Jerusalem, Israel
Abstract
The increased availability of data in recent years has led several authors to ask
whether it is possible to use data as a computational resource. That is, if more
data is available, beyond the sample complexity limit, is it possible to use the
extra examples to speed up the computation time required to perform the learning
task?
We give the first positive answer to this question for a natural supervised learning
problem — we consider agnostic PAC learning of halfspaces over 3-sparse
vec
tors in {−1, 1, 0}n . This class is inefficiently learnable using O n/2 examples.
Our main contribution is a novel, non-cryptographic, methodology for establishing computational-statistical gaps, which allows us to show that, under a widely
believed assumption that refuting random 3CNFformulas is hard, it is impossible
to efficiently learn this class using only O n/2 examples.
We further show that
under stronger hardness assumptions, even O n1.499 /2 examples do not suffice. On the other
hand, we show a new algorithm that learns this class efficiently
using Ω̃ n2 /2 examples. This formally establishes the tradeoff between sample
and computational complexity for a natural supervised learning problem.
1
Introduction
In the modern digital period, we are facing a rapid growth of available datasets in science and
technology. In most computing tasks (e.g. storing and searching in such datasets), large datasets
are a burden and require more computation. However, for learning tasks the situation is radically
different. A simple observation is that more data can never hinder you from performing a task. If
you have more data than you need, just ignore it!
A basic question is how to learn from “big data”. The statistical learning literature classically studies
questions like “how much data is needed to perform a learning task?” or “how does accuracy improve
as the amount of data grows?” etc. In the modern, “data revolution era”, it is often the case that the
amount of data available far exceeds the information theoretic requirements. We can wonder whether
this, seemingly redundant data, can be used for other purposes. An intriguing question in this vein,
studied recently by several researchers ([Decatur et al., 1998, Servedio., 2000, Shalev-Shwartz et al.,
2012, Berthet and Rigollet, 2013, Chandrasekaran and Jordan, 2013]), is the following
Question 1: Are there any learning tasks in which more data, beyond the information theoretic barrier, can provably be leveraged to speed up computation
time?
The main contributions of this work are:
1
• Conditioning on the hardness of refuting random 3CNF formulas, we give the first example
of a natural supervised learning problem for which the answer to Question 1 is positive.
• To prove this, we present a novel technique to establish computational-statistical tradeoffs
in supervised learning problems. To the best of our knowledge, this is the first such a result
that is not based on cryptographic primitives.
Additional contributions are non trivial efficient
2 algorithms for learning halfspaces over 2-sparse
n
and 3-sparse vectors using Õ 2 and Õ n2 examples respectively.
The natural learning problem we consider is the task of learning the class of halfspaces over k-sparse
vectors. Here, the instance space is the space of k-sparse vectors,
Cn,k = {x ∈ {−1, 1, 0}n | |{i | xi 6= 0}| ≤ k} ,
and the hypothesis class is halfspaces over k-sparse vectors, namely
Hn,k = {hw,b : Cn,k → {±1} | hw,b (x) = sign(hw, xi + b), w ∈ Rn , b ∈ R} ,
where h·, ·i denotes the standard inner product in Rn .
We consider the standard setting of agnostic PAC learning, which models the realistic scenario
where the labels are not necessarily fully determined by some hypothesis from Hn,k . Note that in
the realizable case, i.e. when some hypothesis from Hn,k has zero error, the problem of learning
halfspaces is easy even over Rn .
In addition, we allow improper learning (a.k.a. representation independent learning), namely, the
learning algorithm is not restricted to output a hypothesis from Hn,k , but only should output a hypothesis whose error is not much larger than the error of the best hypothesis in Hn,k . This gives the
learner a lot of flexibility in choosing an appropriate representation of the problem. This additional
freedom to the learner makes it much harder to prove lower bounds in this model. Concretely, it is
not clear how to use standard reductions from NP hard problems in order to establish lower bounds
for improper learning (moreover, Applebaum et al. [2008] give evidence that such simple reductions
do not exist).
The classes Hn,k and similar classes have been studied by several authors (e.g. Long. and Servedio
[2013]). They naturally arise in learning scenarios in which the set of all possible features is very
large, but each example has only a small number of active features. For example:
• Predicting an advertisement based on a search query: Here, the possible features of each
instance are all English words, whereas the active features are only the set of words given
in the query.
• Learning Preferences [Hazan et al., 2012]: Here, we have n players. A ranking of the
players is a permutation σ : [n] → [n] (think of σ(i) as the rank of the i’th player). Each
ranking induces a preference hσ over the ordered pairs, such that hσ (i, j) = 1 iff i is ranked
higher that j. Namely,
1
σ(i) > σ(j)
hσ (i, j) =
−1 σ(i) < σ(j)
The objective here is to learn the class, Pn , of all possible preferences. The problem of
learning preferences is related to the problem of learning Hn,2 : if we associate each pair
(i, j) with the vector in Cn,2 whose i’th coordinate is 1 and whose j’th coordinate is −1,
it is not hard to see that Pn ⊂ Hn,2 : for every σ, hσ = hw,0 for the vector w ∈ Rn ,
given by wi = σ(i). Therefore, every upper bound for Hn,2 implies an upper bound for
Pn , while every lower bound for Pn implies a lower bound for Hn,2 . Since VC(Pn ) = n
and VC(Hn,2 ) = n + 1, the information theoretic barrier to learn these classesis Θ n2 .
In Hazan et al. [2012] it was shown that Pn can be efficiently learnt using O
examples. In section 4, we extend this result to Hn,2 .
n log3 (n)
2
We will show a positive answer to Question 1 for the class Hn,3 . To do so, we show1 the following:
1
In fact, similar results hold for every constant k ≥ 3. Indeed, since Hn,3 ⊂ Hn,k for every k ≥ 3, it is
trivial that item 3 below holds for every k ≥ 3. The upper bound given in item 1 holds for every k. For item 2,
2
1. Ignoring computational issues, it is possible to learn the class Hn,3 using O
n
2
examples.
2. It is also
to efficiently learn Hn,3 if we are provided with a larger training set (of
possible
size Ω̃
n2
2
). This is formalized in Theorem 3.1.
3. It is impossible
to efficiently learn Hn,3 , if we are only provided with a training set of
size O n2 under Feige’s assumption regarding the hardness of refuting random 3CNF
formulas [Feige, 2002]. Furthermore, for every
α ∈ [0, 0.5), it is impossible to learn
efficiently with a training set of size O
is formalized in Theorem 4.1.
n1+α
2
under a stronger hardness assumption. This
A graphical illustration of our main results is given below:
runtime
2O(n)
> poly(n)
nO(1)
examples
n
1.5
2
n
n
The proof of item 1 above is easy – simply note that Hn,3 has VC dimension n + 1.
Item 2 is proved in section 4, relying on the results of Hazan et al. [2012]. We note, however, that
a weaker result, that still suffices for answering Question 1 in the affirmative, can be proven using
a naive improper learning
algorithm. In particular, we show below how to learn Hn,3 efficiently
with a sample of Ω
n3
2
examples. The idea is to replace the class Hn,3 with the class {±1}Cn,3
containing all functions from Cn,3 to {±1}. Clearly, this class contains Hn,3 . In addition, we
can efficiently find a function f that minimizes the empirical training error over a training set S
as follows: For every x ∈ Cn,k , if x does not appear at all in the training set we will set f (x)
arbitrarily to 1. Otherwise, we will set f (x) to be the majority of the labels in the training set
that correspond to x. Finally, note that the VC dimension of {±1}Cn,3 is smaller than n3 (since
|Cn,3 | < n3 ). Hence, standard generalization results (e.g. Vapnik [1995]) implies that a training set
size of Ω
n3
2
suffices for learning this class.
Item 3 is shown in section 3 by presenting a novel technique for establishing statisticalcomputational tradeoffs.
The class Hn,2 . Our main result gives a positive answer to Question 1 for the task of improperly learning Hn,k for k ≥ 3. A natural question is what happens for k = 2 and k = 1. Since
VC(Hn,1 ) = VC(Hn,2 ) = n + 1, the information theoretic barrier for learning these classes is
Θ n2 . In section
4, we prove that Hn,2 (and, consequently, Hn,1 ⊂ Hn,2 ) can be learnt using
3
O n log2 (n) examples, indicating that significant computational-statistical tradeoffs start to manifest themselves only for k ≥ 3.
1.1
Previous approaches, difficulties, and our techniques
[Decatur et al., 1998] and [Servedio., 2000] gave positive answers to Question 1 in the realizable
PAC learning model. Under cryptographic assumptions, they showed that there exist binary learning problems, in which more data can provably be used to speed up training time. [Shalev-Shwartz
et al., 2012] showed a similar result for the agnostic PAC learning model. In all of these papers, the
main idea is to construct a hypothesis class based on a one-way function. However, the constructed
k
it is not hard to show that Hn,k can be learnt using a sample of Ω n2 examples by a naive improper learning
algorithm, similar to the algorithm we describe in this section for k = 3.
3
classes are of a very synthetic nature, and are of almost no practical interest. This is mainly due
to the construction technique which is based on one way functions. In this work, instead of using
cryptographic assumptions, we rely on the hardness of refuting random 3CNF formulas. The simplicity and flexibility of 3CNF formulas enable us to derive lower bounds for natural classes such
as halfspaces.
Recently, [Berthet and Rigollet, 2013] gave a positive answer to Question 1 in the context of unsupervised learning. Concretely, they studied the problem of sparse PCA, namely, finding a sparse vector
that maximizes the variance of an unsupervised data. Conditioning on the hardness of the planted
clique problem, they gave a positive answer to Question 1 for sparse PCA. Our work, as well as
the previous work of Decatur et al. [1998], Servedio. [2000], Shalev-Shwartz et al. [2012], studies
Question 1 in the supervised learning setup. We emphasize that unsupervised learning problems
are radically different than supervised learning problems in the context of deriving lower bounds.
The main reason for the difference is that in supervised learning problems, the learner is allowed
to employ improper learning, which gives it a lot of power in choosing an adequate representation of the data. For example, the upper bound we have derived for the class of sparse halfspaces
switched from representing hypotheses as halfspaces to representation of hypotheses as tables over
Cn,3 , which made the learning problem easy from the computational perspective. The crux of the
difficulty in constructing lower bounds is due to this freedom of the learner in choosing a convenient
representation. This difficulty does not arise in the problem of sparse PCA detection, since there
the learner must output a good sparse vector. Therefore, it is not clear whether the approach given
in [Berthet and Rigollet, 2013] can be used to establish computational-statistical gaps in supervised
learning problems.
2
Background and notation
For hypothesis class H ⊂ {±1}X and a set Y ⊂ X, we define the restriction of H to Y by
H|Y = {h|Y | h ∈ H}. We denote by J = Jn the all-ones n × n matrix. We denote the j’th vector
in the standard basis of Rn by ej .
2.1
Learning Algorithms
For h : Cn,3 → {±1} and a distribution D on Cn,3 × {±1} we denote the error of h w.r.t. D
by ErrD (h) = Pr(x,y)∼D (h(x) 6= y). For H ⊂ {±1}Cn,3 we denote the error of H w.r.t. D by
m
ErrD (H) = minh∈H ErrD (h). For a sample S ∈ (Cn,3 × {±1}) we denote by ErrS (h) (resp.
ErrS (H)) the error of h (resp. H) w.r.t. the empirical distribution induces by the sample S.
m
A learning algorithm, L, receives a sample S ∈ (Cn,3 × {±1}) and return a hypothesis L(S) :
Cn,3 → {±1}. We say that L learns Hn,3 using m(n, ) examples if,2 for every distribution D on
Cn,3 × {±1} and a sample S of more than m(n, ) i.i.d. examples drawn from D,
1
10
The algorithm L is efficient if it runs in polynomial time in the sample size and returns a hypothesis
that can be evaluated in polynomial time.
Pr (ErrD (L(S)) > ErrD (H3,n ) + ) <
S
2.2
Refuting random 3SAT formulas
We frequently view a boolean assignment to variables x1 , . . . , xn as a vector in Rn . It is convenient,
therefore, to assume that boolean variables take values in {±1} and to denote negation by “ − ”
(instead of the usual “¬”). An n-variables 3CNF clause is a boolean formula of the form
C(x) = (−1)j1 xi1 ∨ (−1)j2 xi2 ∨ (−1)j1 xi3 , x ∈ {±1}n
An n-variables 3CNF formula is a boolean formula of the form
φ(x) = ∧m
i=1 Ci (x) ,
2
For simplicity, we require the algorithm to succeed with probability of at least 9/10. This can be easily
amplified to probability of at least 1 − δ, as in the usual definition of agnostic PAC learning, while increasing
the sample complexity by a factor of log(1/δ).
4
where every Ci is a 3CNF clause. Define the value, Val(φ), of φ as the maximal fraction of clauses
that can be simultaneously satisfied. If Val(φ) = 1, we say the φ is satisfiable. By 3CNFn,m we
denote the set of 3CNF formulas with n variables and m clauses.
Refuting random 3CNF formulas has been studied extensively (see e.g. a special issue of TCS
Dubios et al. [2001]). It is known that for large enough ∆ (∆ = 6 will suffice) a random formula in
3CNFn,∆n is not satisfiable with probability 1 − o(1). Moreover, for every 0 ≤ < 14 , and a large
enough ∆ = ∆(), the value of a random formula 3CNFn,∆n is ≤ 1 − with probability 1 − o(1).
The problem of refuting random 3CNF concerns efficient algorithms that provide a proof that a
random 3CNF is not satisfiable, or far from being satisfiable. This can be thought of as a game
between an adversary and an algorithm. The adversary should produce a 3CNF-formula. It can
either produce a satisfiable formula, or, produce a formula uniformly at random. The algorithm
should identify whether the produced formula is random or satisfiable.
Formally, let ∆ : N → N and 0 ≤ < 14 . We say that an efficient algorithm, A, -refutes random
3CNF with ratio ∆ if its input is φ ∈ 3CNFn,n∆(n) , its output is either “typical” or “exceptional”
and it satisfies:
• Soundness: If Val(φ) ≥ 1 − , then
Pr
Rand. coins of A
(A(φ) = “exceptional”) ≥
3
4
• Completeness: For every n,
(A(φ) = “typical”) ≥ 1 − o(1)
Pr
Rand. coins of A, φ∼Uni(3CNFn,n∆(n) )
By a standard repetition argument, the probability of 34 can be amplified to 1 − 2−n , while efficiency
is preserved. Thus, given such an (amplified) algorithm, if A(φ) = “typical”, then with confidence
of 1 − 2−n we know that Val(φ) < 1 − . Since for random φ ∈ 3CNFn,n∆(n) , A(φ) = “typical”
with probability 1 − o(1), such an algorithm provides, for most 3CNF formulas a proof that their
value is less that 1 − .
Note that an algorithm that -refutes random 3CNF with ratio ∆ also 0 -refutes random 3CNF with
ratio ∆ for every 0 ≤ 0 ≤ . Thus, the task of refuting random 3CNF’s gets easier as gets smaller.
Most of the research concerns the case = 0. Here, it is not hard to see that the task is getting easier
as ∆ grows.√The best known algorithm [Feige and Ofek, 2007] 0-refutes random 3CNF with ratio
∆(n) = Ω( n). In Feige [2002] it was conjectured that for constant ∆ no efficient algorithm can
provide a proof that a random 3CNF is not satisfiable:
Conjecture 2.1 (R3SAT hardness assumption – [Feige, 2002]). For every > 0 and for every large
enough integer ∆ > ∆0 () there exists no efficient algorithm that -refutes random 3CNF formulas
with ratio ∆.
In fact, for all we know, the following conjecture may be true for every 0 ≤ µ ≤ 0.5.
Conjecture 2.2 (µ-R3SAT hardness assumption). For every > 0 and for every integer ∆ > ∆0 ()
there exists no efficient algorithm that -refutes random 3CNF with ratio ∆ · nµ .
Note that Feige’s conjecture is equivalent to the 0-R3SAT hardness assumption.
3
Lower bounds for learning Hn,3
Theorem 3.1 (main). Let 0 ≤ µ ≤ 0.5. If the µ-R3SAT hardness assumption (conjecture 2.2) is
1+µ
true, then there exists no efficient learning algorithm that learns the class Hn,3 using O n 2
examples.
In the proof of Theorem 3.1 we rely on the validity of a conjecture, similar to conjecture 2.2 for 3variables majority formulas. Following an argument from [Feige, 2002] (Theorem 3.2) the validity
of the conjecture on which we rely for majority formulas follows the validity of conjecture 2.2.
5
Define
∀(x1 , x2 , x3 ) ∈ {±1}3 , MAJ(x1 , x2 , x3 ) := sign(x1 + x2 + x3 )
An n-variables 3MAJ clause is a boolean formula of the form
C(x) = MAJ((−1)j1 xi1 , (−1)j2 xi2 , (−1)j1 xi3 ), x ∈ {±1}n
An n-variables 3MAJ formula is a boolean formula of the form
φ(x) = ∧m
i=1 Ci (x)
where the Ci ’s are 3MAJ clauses. By 3MAJn,m we denote the set of 3MAJ formulas with n variables
and m clauses.
Theorem 3.2 ([Feige, 2002]). Let 0 ≤ µ ≤ 0.5. If the µ-R3SAT hardness assumption is true, then
for every > 0 and for every large enough integer ∆ > ∆0 () there exists no efficient algorithm
with the following properties.
• Its input is φ ∈ 3MAJn,∆n1+µ , and its output is either “typical” or “exceptional”.
• If Val(φ) ≥
3
4
− , then
Pr
Rand. coins of A
(A(φ) = “exceptional”) ≥
3
4
• For every n,
(A(φ) = “typical”) ≥ 1 − o(1)
Pr
Rand. coins of A, φ∼Uni(3MAJn,∆n1+µ )
Next, we prove Theorem 3.1. In fact, we will prove a slightly stronger result. Namely, define
d
d
the subclass Hn,3
⊂ Hn,3 , of homogenous halfspaces with binary weights, given by Hn,3
=
n
{hw,0 | w ∈ {±1} }. As we show, underthe µ-R3SAT
hardness
assumption,
it
is
impossible
to
efficiently learn this subclass using only O
n1+µ
2
examples.
Proof idea: We will reduce the task of refuting random 3MAJ formulas with linear number of
d
clauses to the task of (improperly) learning Hn,3
with linear number of samples. The first step will be
to construct a transformation that associates every 3MAJ clause with two examples in Cn,3 × {±1},
d
and every assignment with a hypothesis in Hn,3
. As we will show, the hypothesis corresponding to
an assignment ψ is correct on the two examples corresponding to a clause C if and only if ψ satisfies
C. With that interpretation at hand, every 3MAJ formula φ can be thought of as a distribution Dφ
on Cn,3 × {±1}, which is the empirical distribution induced by ψ’s clauses. It holds furthermore
d
that ErrDφ (Hn,3
) = 1 − Val(φ).
d
Suppose now that we are given an efficient learning algorithm for Hn,3
, that uses κ n2 examples,
for some κ > 0. To construct an efficient algorithm for refuting 3MAJ-formulas, we simply feed
n
the learning algorithm with κ 0.01
2 examples drawn from Dφ and answer “exceptional” if the error
of the hypothesis returned by the algorithm is small. If φ is (almost) satisfiable, the algorithm is
guaranteed to return a hypothesis with a small error. On the other hand, if φ is far from being
d
satisfiable, ErrDφ (Hn,3
) is large. If the learning algorithm is proper, then it must return a hypothesis
d
from Hn,3 and therefore it would necessarily return a hypothesis with a large error. This argument
d
can be used to show that, unless N P = RP , learning Hn,3
with a proper efficient algorithm is
impossible. However, here we want to rule out improper algorithms as well.
The crux of the construction is that if φ is random, no algorithm (even improper and even inefficient)
can return a hypothesis with a small error. The reason for that is that since the sample provided
n
to the algorithm consists of only κ 0.01
2 samples, the algorithm won’t see most of ψ’s clauses,
and, consequently, the produced hypothesis h will be independent of them. Since these clauses are
random, h is likely to err on about half of them, so that ErrDφ (h) will be close to half!
To summarize we constructed an efficient algorithm with the following properties: if φ is almost
satisfiable, the algorithm will return a hypothesis with a small error, and then we will declare “exceptional”, while for random φ, the algorithm will return a hypothesis with a large error, and we will
declare “typical”.
6
Our construction crucially relies on the restriction to learning algorithm with a small sample complexity. Indeed, if the learning algorithm obtains more than n1+µ examples, then it will see most
of ψ’s clauses, and therefore it might succeed in “learning” even when the source of the formula is
random. Therefore, we will declare “exceptional” even when the source is random.
Proof. (of theorem 3.1) Assume by way of contradiction that the µ-R3SAT hardness assumption
1+µ is
true and yet there exists an efficient learning algorithm that learns the class Hn,3 using O n 2
1
examples. Setting = 100
, we conclude that there exists an efficient algorithm L and a constant
κ > 0 such that given a sample S of more than κ · n1+µ examples drawn from a distribution D on
Cn,3 × {±1}, returns a classifier L(S) : Cn,3 → {±1} such that
• L(S) can be evaluated efficiently.
• W.p. ≥
3
4
over the choice of S, ErrD (L(S)) ≤ ErrD (Hn,3 ) +
1
100 .
1
Fix ∆ large enough such that ∆ > 100κ and the conclusion of Theorem 3.2 holds with = 100
. We
will construct an algorithm, A, contradicting Theorem 3.2. On input φ ∈ 3MAJn,∆n1+µ consisting
of the 3MAJ clauses C1 , . . . , C∆n1+µ , the algorithm A proceeds as follows
1. Generate a sample S consisting of ∆n1+µ examples as follows. For every clause, Ck =
MAJ((−1)j1 xi1 , (−1)j2 xi2 , (−1)j3 xi3 ), generate an example (xk , yk ) ∈ Cn,3 × {±1} by
choosing b ∈ {±1} at random and letting
!
3
X
(xk , yk ) = b ·
(−1)jl eil , 1 ∈ Cn,3 × {±1} .
l=1
For example, if n = 6, the clause is MAJ(−x2 , x3 , x6 ) and b = −1, we generate the
example
((0, 1, −1, 0, 0, −1), −1)
1+µ
2. Choose a sample S1 consisting of ∆n
100
(with repetitions) examples from S.
≥ κ · n1+µ examples by choosing at random
3. Let h = L(S1 ). If ErrS (h) ≤ 38 , return “exceptional”. Otherwise, return “typical”.
We claim that A contradicts Theorem 3.2. Clearly, A runs in polynomial time. It remains to show
that
• If Val(φ) ≥
3
4
−
1
100 ,
then
Pr
Rand. coins of A
(A(φ) = “exceptional”) ≥
3
4
• For every n,
(A(φ) = “typical”) ≥ 1 − o(1)
Pr
Rand. coins of A, φ∼Uni(3MAJn,∆n1+µ )
Assume first that φ ∈ 3MAJn,∆n1+µ is chosen at random. Given the sample S1 , the sample S2 :=
S \ S1 is a sample of |S2 | i.i.d. examples which are independent from the sample S1 , and hence also
from h = L(S1 ). Moreover, for every example (xk , yk ) ∈ S2 , yk is a Bernoulli random variable
with parameter 21 which is independent of xk . To see that, note that an example whose instance is xk
can be generated by exactly two clauses – one corresponds to yk = 1, while the other corresponds
to yk = −1 (e.g., the instance (1, −1, 0, 1) can be generated from the clause MAJ(x1 , −x2 , x4 ) and
b = 1 or the clause MAJ(−x1 , x2 , −x4 ) and b = −1). Thus, given the instance xk , the probability
that yk = 1 is 21 , independent of xk .
7
1
It follows that ErrS2 (h) is an average of at least 1 − 100
∆n1+µ independent Bernoulli random
1
variable. By Chernoff’s bound, with probability ≥ 1 − o(1), ErrS2 (h) > 21 − 100
. Thus,
1
1
1
3
1
ErrS2 (h) ≥ 1 −
·
−
>
ErrS (h) ≥ 1 −
100
100
2 100
8
And the algorithm will output “typical”.
1
Assume now that Val(φ) ≥ 43 − 100
and let ψ ∈ {±1}n be an assignment that indicates that. Let
Ψ ∈ Hn,3 be the hypothesis Ψ(x) = sign (hψ, xi). It can be easily checked that Ψ(xk ) = yk if and
1
, it follows that
only if ψ satisfies Ck . Since Val(φ) ≥ 43 − 100
ErrS (Ψ) ≤
1
1
+
.
4 100
Thus,
ErrS (Hn,3 ) ≤
By the choice of L, with probability ≥ 1 −
1
4
ErrS (h) ≤
1
1
+
.
4 100
= 43 ,
1
1
1
3
+
+
<
4 100 100
8
and the algorithm will return “exceptional”.
4
Upper bounds for learning Hn,2 and Hn,3
The following theorem derives upper bounds for learning Hn,2 and Hn,3 . Its proof relies on results
from Hazan et al. [2012] about learning β-decomposable matrices, and due to the lack of space is
given in the appendix.
Theorem 4.1.
3
• There exists an efficient algorithm that learns Hn,2 using O n log2 (n) examples
• There exists an efficient algorithm that learns Hn,3 using O
5
n2 log3 (n)
2
examples
Discussion
We formally established a computational-sample complexity tradeoff for the task of (agnostically
and improperly) PAC learning of halfspaces over 3-sparse vectors. Our proof of the lower bound
relies on a novel, non cryptographic, technique for establishing such tradeoffs. We also derive a new
non-trivial upper bound for this task.
Open questions. An obvious open question is to close the gap between the lower
and upper bounds.
1.5
We conjecture that Hn,3 can be learnt efficiently using a sample of Õ n2 examples. Also, we
believe that our new proof technique can be used for establishing computational-sample complexity
tradeoffs for other natural learning problems.
Acknowledgements: Amit Daniely is a recipient of the Google Europe Fellowship in Learning
Theory, and this research is supported in part by this Google Fellowship. Nati Linial is supported
by grants from ISF, BSF and I-Core. Shai Shalev-Shwartz is supported by the Israeli Science Foundation grant number 590-10.
References
Benny Applebaum, Boaz Barak, and David Xiao. On basing lower-bounds for learning on worstcase assumptions. In Foundations of Computer Science, 2008. FOCS’08. IEEE 49th Annual IEEE
Symposium on, pages 211–220. IEEE, 2008.
8
Quentin Berthet and Philippe Rigollet. Complexity theoretic lower bounds for sparse principal
component detection. In COLT, 2013.
Nicolo Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line
learning algorithms. IEEE Transactions on Information Theory, 50:2050–2057, 2001.
Venkat Chandrasekaran and Michael I. Jordan. Computational and statistical tradeoffs via convex
relaxation. Proceedings of the National Academy of Sciences, 2013.
S. Decatur, O. Goldreich, and D. Ron. Computational sample complexity. SIAM Journal on Computing, 29, 1998.
O. Dubios, R. Monasson, B. Selma, and R. Zecchina (Guest Editors). Phase Transitions in Combinatorial Problems. Theoretical Computer Science, Volume 265, Numbers 1-2, 2001.
U. Feige. Relations between average case complexity and approximation complexity. In STOC,
pages 534–543, 2002.
Uriel Feige and Eran Ofek. Easily refutable subformulas of large random 3cnf formulas. Theory of
Computing, 3(1):25–43, 2007.
E. Hazan, S. Kale, and S. Shalev-Shwartz. Near-optimal algorithms for online matrix prediction. In
COLT, 2012.
P. Long. and R. Servedio. Low-weight halfspaces for sparse boolean vectors. In ITCS, 2013.
R. Servedio. Computational sample complexity and attribute-efficient learning. J. of Comput. Syst.
Sci., 60(1):161–178, 2000.
Shai Shalev-Shwartz, Ohad Shamir, and Eran Tromer. Using more data to speed-up training time.
In AISTATS, 2012.
V.N. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.
9
A
Proof of Theorem 4.1
The proof of the theorem relies on results from Hazan et al. [2012] about learning β-decomposable
matrices. Let W be an n×m matrix. We define the symmetrization of W to be the (n+m)×(n+m)
matrix
0
W
sym(W ) =
WT 0
We say that W is β-decomposable if there exist positive semi-definite matrices P, N for which
sym(W ) =
∀i, Pii , Nii ≤
P −N
β
Each matrix in {±1}n×m can be naturally interpreted as a hypothesis on [n] × [m].
We say that a learning algorithm L learns a class Hn ⊂ {±1}Xn using m(n, , δ) examples if, for
every distribution D on Xn × {±1} and a sample S of more than m(n, , δ) i.i.d. examples drawn
from D,
Pr (ErrD (L(S)) > ErrD (Hn ) + ) < δ
S
Hazan et al. [2012] have proved3 that
Theorem A.1. Hazan et al. [2012] The hypothesis class
of2 β-decomposable n × mmatrices with
±1 entries ban be efficiently learnt using a sample of O β (n+m) log(n+m)+log(1/δ)
examples.
2
We start with a generic reduction from a problem of learning a class Gn over an instance space
Xn ⊂ {−1, 1, 0}n to the problem of learning β(n)-decomposable matrices. We say that Gn is
realized by mn × mn matrices that are β(n)-decomposable if there exists a mapping ψn : Xn →
[mn ] × [mn ] such that for every h ∈ Gn there exists a β(n)-decomposable mn × mn matrix W for
which ∀x ∈ Xn , h(x) = Wψn (x) . The mapping ψn is called a realization of Gn . In the case that
the mapping ψn can be computed in time polynomial in n, we say that Gn is efficiently realized and
ψn is an efficient realization. It follows from Theorem A.1 that:
Corollary A.2. If Gn is efficiently realized by m
then
n × mn matrices that are β(n)-decomposable
Gn can be efficiently learnt using a sample of O
β(n)2 mn log(mn )+log(1/δ)
2
examples.
We now turn to the proof of Theorem 4.1. We start with the first assertion, about learning Hn,2 .
The idea will be to partition the instance space into a disjoint union of subsets and show that the
restriction of the hypothesis class to each subset can be efficiently realized by β(n)-decomposable.
Concretely, we decompose Cn,2 into a disjoint union of five sets
Cn,2 = ∪2r=−2 Arn
where
(
Arn
=
x ∈ Cn,2 |
n
X
)
xi = r .
i=1
In section A.1 we will prove that
Lemma A.3. For every −2 ≤ r ≤ 2, Hn,2 |Arn can be efficiently realized by n × n matrices that are
O(log(n))-decomposable.
To glue together the five restrictions, we will rely on the following Lemma, whose proof is given in
section A.1.
Lemma A.4. Let X1 , ..., Xk be partition of a domain X and let H be a hypothesis class over X.
Define Hi = H|Xi . Suppose the for every Hi there exist a learning algorithm that learns Hi using
≤ C(d + log(1/δ))/2 examples, for some constant C ≥ 8. Consider the algorithm A which
3
The result of Hazan et al. [2012] is more general than what is stated here. Also, Hazan et al. [2012]
considered the online scenario. The result for the statistical scenario, as stated here, can be derived by applying
standard online-to-batch conversions (see for example Cesa-Bianchi et al. [2001]).
10
receives an i.i.d. training set S of m examples from X × {0, 1} and applies the learning algorithm
for each Hi on the examples in S that belongs to Xi . Then, A learns H using at most
2Ck(d + log(2k/δ))
2
examples.
The first part of Theorem 4.1 is therefore follows from Lemma A.3, Lemma A.4 and Corollary A.2.
Having the first part of Theorem 4.1 and Lemma A.4 at hand, it is not hard to prove the second part
of Theorem 4.1:
For 1 ≤ i ≤ n − 2 and b ∈ {±1} define
Dn,i,b = {x ∈ Cn,3 | xi = b and ∀j < i, xj = 0}
Let ψn : Cn,3 → Cn,2 be the mapping that zeros
the first non zero coordinate. It is not hard to see
that Hn,3 |Dn,i,b = h ◦ ψn |Dn,i,b | h ∈ Hn,2 . Therefore Hn,3 |Dn,i,b
can be identifiedwith Hn,2
3
examples
using the mapping ψn , and therefore can efficiently learnt using O n log (n)+log(1/δ)
2
(the dependency on δ does not appear in the statement, but can be easily inferred from the proof).
The second part of Theorem 4.1 is therefore follows from the first part of the Theorem and Lemma
A.4.
A.1
Proofs of Lemma A.3 and Lemma A.4
In the proof, we will rely on the following facts. The tensor product of two matrices A ∈ Mn×m
and B ∈ Mk×l is defined as the (n · k) × (m · l) matrix


A1,1 · B · · · A1,m · B


..
..
..
A⊗B =

.
.
.
An,1 · B · · · Am,m · B
Proposition A.5. Let W be a β-decomposable matrix and let A be a PSD matrix whose diagonal
entries are upper bounded by α. Then W ⊗ A is (α · β)-decomposable.
Proof. It is not hard to see that for every matrix W and a symmetric matrix A,
sym(W ) ⊗ A = sym(W ⊗ A)
Moreover, since the tensor product of two PSD matrices is PSD, if sym(W ) = P − N is a βdecomposition of W , then
sym(W ⊗ A) = P ⊗ A − N ⊗ A
is a (α · β)-decomposition of W ⊗ A.
Proposition A.6. If W is a β-decomposable matrix, then so is every matrix obtained from W by
iteratively deleting rows and columns.
Proof. It is enough to show that deleting one row or column leaves W β-decomposable. Suppose
that W 0 is obtained from W ∈ Mn×m by deleting the i’th row (the proof for deleting columns is
similar). It is not hard to see that sym(W 0 ) is the i’th principal minor of sym(W ). Therefore, since
principal minors of PSD matrices are PSD matrices as well, if sym(W ) = P −N is β-decomposition
of W then sym(W 0 ) = [P ]i,i − [N ]i,i is a β-decomposition of W 0 .
Proposition A.7. Hazan et al. [2012] Let Tn be the upper triangular matrix whose all entries in
the diagonal and above are 1, and whose all entries beneath the diagonal are −1. Then Tn is
O(log(n))-decomposable.
Lastly, we will also need the following generalization of proposition A.7
11
Proposition A.8. Let W be an n × n ±1 matrix. Assume that there exists a sequence 0 ≤
j(1), . . . , j(n) ≤ n such that
−1 j ≤ j(i)
Wij =
1
j > j(i)
Then, W is O(log(n))-decomposable.
Proof. Since switching rows of a β-decomposable matrix leaves a β-decomposable matrix, we can
assume without loss of generality that j(1) ≤ j(2) ≤ . . . ≤ j(n). Let J be the n×n all ones matrix.
It is not hard to see that W can be obtained from Tn ⊗ J by iteratively deleting rows and columns.
Combining propositions A.5, A.6 and A.7, we conclude that W is O(log(n))-decomposable, as
required.
We are now ready to prove Lemma A.3
Proof. (of Lemma A.3) Denote Arn = Hn,2 |Arn . We split into cases.
Case 1, r=0: Note that A0n = {ei − ej | i, j ∈ [n]}. Define ψn : A0n → [n] × [n] by ψn (ei − ej ) =
(i, j). We claim that ψn is an efficient realization of A0n by n × n matrices that are O(log(n))
decomposable. Indeed, let h = hw,b ∈ A0n , and let W be the n × n matrix Wij = Wψn (ei −ej ) =
h(ei − ej ). It is enough to show that W is O(log(n))-decomposable.
We can rename the coordinates so that
w1 ≥ w2 ≥ . . . ≥ wn
(1)
From equation (1), it is not hard to see that there exist numbers
0 ≤ j(1) ≤ j(2) ≤ . . . ≤ j(n) ≤ n
for which
Wij =
−1
1
j ≤ j(i)
j > j(i)
The conclusion follows from Proposition A.8
Case 2, r=2 and r=-2: We confine ourselves to the case r = 2. The case r = −2 is similar. Note
that A2n = {ei + ej | i 6= j ∈ [n]}. Define ψn : A2n → [n] × [n] by ψn (ei + ej ) = (i, j). We
claim that ψn is an efficient realization of A2n by n × n matrices that are O(log(n)) decomposable.
Indeed, let h = hw,b ∈ A2n , and let W be the n × n matrix Wij = Wψn (ei +ej ) = h(ei + ej ). It is
enough to show that W is O(log(n))-decomposable.
We can rename the coordinates so that
w1 ≤ w2 ≤ . . . ≤ wn
(2)
From equation (2), it is not hard to see that there exist numbers
n ≥ j(1) ≥ j(2) ≥ . . . ≥ j(n) ≥ 0
for which
Wij =
−1
1
j ≤ j(i)
j > j(i)
The conclusion follows from Proposition A.8
Case 3, r=1 and r=-1: We confine ourselves to the case r = 1. The case r = −1 is similar. Note that
A1n = {ei | i ∈ [n]}. Define ψn : A0n → [n]×[n] by ψn (ei ) = (i, i). We claim that ψn is an efficient
realization of A1n by n × n matrices that are 3-decomposable (let alone, log(n)-decomposable).
Indeed, let h = hw,b ∈ A1n , and let W be the n × n matrix with Wii = Wψn (ei ) = h(ei ) and −1
outside the diagonal. It is enough to show that W is 3-decomposable. Since J is 1-decomposable, it
is enough to show that W + J is 2-decomposable. However, it is not hard to see that every diagonal
matrix D is (maxi |Dii |)-decomposable.
12
Proof. (of Lemma A.4) Let S = (x1 , y1 ), . . . , (xm , ym ) be a training set and let m̂i be the number
of examples in S that belong to Xi . Given that the values of the random variables m̂1 , . . . , m̂i is
determined, we have that w.p. of at least 1 − δ,
s
C(d + log(k/δ))
∗
∀i, ErrDi (hi ) − ErrDi (h ) ≤
,
m̂i
where Di is the induced distribution over Xi , hi is the output of the i’th algorithm, and h∗ is the
optimal hypothesis w.r.t. the original distribution D. Define,
mi = max{C(d + log(k/δ)), m̂i } .
It follows from the above that we also have, w.p. at least 1 − δ, for every i,
s
C(d + log(k/δ))
ErrDi (hi ) − ErrDi (h∗ ) ≤
=: i .
mi
P
Let αi = D{(x, y) : x ∈ Xi }, and note that i αi = 1. Therefore,
X
X√ q
αi αi 2i
ErrD (hS ) − ErrD (h∗ ) ≤
αi i =
i
≤
i
sX
i
r
=
αi
sX
αi 2i =
sX
i
C(d + log(k/δ))
m
αi 2i
i
s
X αi m
i
mi
.
Next note that if αi m < C(d + log(k/δ)) then αi m/mi ≤ 1. Otherwise, using Chernoff’s inequality, for every i we have
Pr[mi < 0.5αi m] ≤ e−αi m/8 ≤ e−(d+log(k/δ)) = e−d
δ
δ
≤ .
k
k
Therefore, by the union bound,
Pr[∃i : mi < 0.5αi m] ≤ δ.
It follows that with probability of at least 1 − δ,
s
X αi m √
≤ 2k .
mi
i
All in all, we have shown that with probability of at least 1 − 2δ it holds that
r
2Ck(d + log(k/δ))
∗
ErrD (hS ) − ErrD (h ) ≤
.
m
Therefore, the the algorithm learns H using
≤
2Ck(d + log(2k/δ))
2
examples.
13