Download Excess Error, Approximation Error, and Estimation Error 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
E0 370 Statistical Learning Theory
Lecture 10 (Sep 15, 2011)
Excess Error, Approximation Error, and Estimation Error
Lecturer: Shivani Agarwal
1
Scribe: Shivani Agarwal
Introduction
So far, we have considered the finite sample setting: given a finite sample S ∈ (X × Y)m drawn according to
Dm , we have seen how to obtain (high confidence) bounds on the generalization error of a function learned
from S, usually in terms of some empirical quantity that measures the performance of the function on S.
Another question of interest concerns the behaviour of a learning algorithm in the infinite sample limit: as
it receives more and more data, does the algorithm converge to an optimal prediction rule, i.e. does the
generalization error of the learned function approach the optimal error? Recall that for a distribution D on
X × Y and a loss ` : Y × Y→[0, ∞), the optimal error w.r.t. D and ` is the lowest possible error achievable
by any function h : X →Y:
er`,∗
inf er`D [h] .
(1)
D =
h:X →Y
For the 0-1 loss, the optimal error is known as the Bayes error.
To formalize the above, for any function h : X →Y, define its excess error (w.r.t. D and `) as
erD [h] − er`,∗
.
D
(2)
We would like to study the behaviour of the excess error of the function learned by an algorithm from a
training sample S ∼ Dm as m→∞.
As we have seen, since minimizing the error over all possible functions in Y X can be difficult, most learning
algorithms select a function from some fixed function class H ⊆ Y X . In such cases, we can only hope to
achieve generalization error close to the lowest possible within the class; we refer to this as the optimal error
within H (w.r.t. D and `):
er`D [H] = inf er`D [h] .
(3)
h∈H
It is then useful to view the excess error of functions h ∈ H as a sum of the following two terms:
erD [h] − er`,∗
= er`D [h] − er`D [H] + er`D [H] − er`,∗
.
D
D
(4)
The first term is called the estimation error, and measures how far h is from the optimal within H. The
second term, called the approximation error, measures how close one can get to the optimal error using
functions in H; this is an inherent property of the function class, and forms a lower bound on the excess
error of any function learned from H.
In the following we will focus on the estimation error, which is what a learning algorithm learning from a
function class H can hope to minimize. We first give a couple of definitions.
2
Statistical Consistency
m
Definition. Let H ⊆ Y X . Let A : ∪∞
m=1 (X × Y) →H be a learning algorithm that given a training
m
sample S ∈ ∪∞
(X
×
Y)
,
returns
a
function
h
∈
H. Let D be a probability distribution on X × Y and
S
m=1
1
2
Excess Error, Approximation Error, and Estimation Error
` : Y × Y→[0, ∞). We say A is (statistically) consistent in H w.r.t. D and ` if the estimation error of the
function learned by A from S ∼ Dm converges in probability to zero, i.e. if for all > 0,
PS∼Dm er`D [hS ] − er`D [H] ≥ −→ 0 as m→∞ .
If A is consistent in H w.r.t. ` for all distributions D on X × Y, we say A is universally consistent in H
w.r.t. `.1
m
X
∞
Definition. Let A : ∪∞
m=1 (X ×Y) →Y be a learning algorithm that given a training sample S ∈ ∪m=1 (X ×
Y)m , returns a function hS : X →Y. Let D be a probability distribution on X × Y and ` : Y × Y→[0, ∞).
We say A is Bayes consistent w.r.t. D and ` if the excess error of the function learned by A from S ∼ Dm
converges in probability to zero, i.e. if for all > 0,
PS∼Dm er`D [hS ] − er`,∗
D ≥ −→ 0 as m→∞ .
If A is Bayes consistent w.r.t. ` for all distributions D on X × Y, we say A is universally Bayes consistent
w.r.t. `.2
One can also define analogous notions of strong consistency, which require almost sure convergence instead
of convergence in probability.
3
Consistency of Empirical Risk Minimization in H
Let H ⊆ Y X and ` : Y × Y→[0, ∞). Consider the empirical risk minimization (ERM) algorithm in H, which
m
given a training sample S ∈ ∪∞
returns3
m=1 (X × Y)
hS ∈ arg min er`S [h] .
h∈H
Then for any distribution D on X × Y, we can write the estimation error of hS as
er`D [hS ] − er`D [H] =
er`D [hS ] − er`S [hS ] + er`S [hS ] − er`D [H]
≤
er`D [hS ] − er`S [hS ] + sup er`S [h] − er`D [h]
h∈H
≤ 2 sup er`S [h] − er`D [h] .
(5)
(6)
(7)
(8)
h∈H
Therefore, uniform convergence of empirical errors in H implies consistency of ERM in H! In particular, for
binary classification, we immediately have the following:
Theorem 3.1. Let H ⊆ {±1}X and ` = `0-1 . If VCdim(H) = d < ∞, then ERM in H is universally
consistent in H w.r.t. `0-1 .
Proof. Let D be any probability distribution on X × {±1}. Let > 0. Then
0-1
0-1
0-1
0-1
(by Eq. (8)) (9)
PS∼Dm erD [hS ] − erD [H] ≥ ≤ PS∼Dm sup erD [hS ] − erD [H] ≥
2
h∈H
d
2
2em
≤ 4
e−m /32 (by previous results)
(10)
d
−→ 0 as m→∞ .
(11)
1 Note that one could also define a notion of consistency in terms of convergence in expectation, which would require that
ES∼Dm [er`D [hS ] − er`D [H]] −→ 0 as m→∞. It is easy to show that a sequence of bounded, non-negative random variables
converges in probability if and only if it converges in expectation (show this!), and therefore when the loss function ` is bounded,
consistency in terms of convergence in probability is equivalent to consistency in terms of convergence in expectation.
2 Note that the term ‘Bayes’ consistency is usually used to refer to convergence to the optimal error for binary classification
with the 0-1 loss; we will use the term for any learning problem/loss function to distinguish it from consistency within H.
3 We assume for simplicity that the minimum is achieved in H; the results we discuss continue to hold if h is selected to be
S
any function in H whose empirical error is within an appropriately small precision of inf h∈H er`S [h].
Excess Error, Approximation Error, and Estimation Error
3
Several remarks are in order:
1. As we have noted before, for binary classification, ERM is typically not computationally efficient,
except for some simple classes H. We will later discuss consistency of algorithms that minimize a
convex upper bound on `0-1 .
2. Note that for any 0 < δ ≤ 1, we have with probability at least 1 − δ (over S ∼ Dm ),
s
d ln m + ln( 1δ )
0-1
er0-1
.
D [hS ] − erD [H] ≤ c
m
q
ln m
As a function of the sample size m, this gives a rate of convergence of O
for the estimation
m
error. For distributions D for which erD [H] = 0 (so that there is a ‘target function’ t ∈ H such that
with
probability 1, the true label y of any instance x under D is given by t(x), i.e. P(x,y)∼D y = t(x) = 1),
one can actually show a faster rate of convergence of O lnmm . This follows from a better uniform
2
convergence bound for such distributions (with an e−cm term in the bound rather than e−cm ); we
probably will not show this for the general case, but will show this for finite H in a later lecture. A
derivation for the general case can be found for example in [1].
3. It is important to note that the above result applies only to classes of finite VC-dimension. Since
no such class can have zero approximation error for all distributions D, ERM in such a class cannot
achieve (universal) Bayes consistency.
4. For classes H of finite VC-dimension, the above result actually establishes that ERM in H is strongly
universally consistent in H, by virtue of the Borel-Cantelli lemma (see [1]).
4
Consistency of Structural Risk Minimization in H = ∪i Hi
m
Let H1 ⊂ H2 ⊂ . . ., where Hi ⊆ Y X . Let ` : Y × Y→[0, ∞). Given a training sample S ∈ ∪∞
m=1 (X × Y) ,
∞
the structural risk minimization (SRM) algorithm in (Hi )i=1 returns
hS ∈ arg min er`S [hiS ] + penalty(i, m) ,
(12)
i
where hiS ∈ Hi is the function returned by ERM in Hi , and penalty(i, m) is a penalty term that increases
with the complexity of Hi . Under certain conditions, one can show that SRM in (Hi )∞
i=1 is consistent in
∞
∞
H = ∪∞
H
;
if
in
addition
the
sequence
(H
)
is
such
that
H
=
∪
H
has
zero
approximation
error, then
i i=1
i=1 i
i=1 i
SRM in (Hi )∞
can
also
be
Bayes
consistent.
For
example,
for
binary
classification,
we
have
the
following
i=1
result:
Theorem 4.1 (Lugosi and Zeger, 1996). Let H1 ⊂ H2 ⊂ . . ., where Hi ⊆ {±1}X , VCdim(Hi ) = di < ∞ ∀i,
and di < di+1 ∀i. Let ` = `0-1 . Then SRM with penalties given by
r
8di ln(2em) + i
penalty(i, m) =
m
is universally consistent in H = ∪∞
i=1 Hi w.r.t. `0-1 .
Proof. Let D be any probability distribution on X × {±1}. Let > 0. We can write the estimation error of
hS as
0-1
0-1
0-1 i
er0-1
[h
]
−
er
[H]
=
er
[h
]
−
inf
er
[h
]
+
penalty(i,
m)
+
S
S
D
D
D
S
S
i
i
0-1
inf er0-1
[h
]
+
penalty(i,
m)
−
er
[H]
.
(13)
S
S
D
i
4
Excess Error, Approximation Error, and Estimation Error
Therefore we have
0-1
PS∼Dm er0-1
[h
]
−
er
[H]
≥
S
D
D
0-1
0-1 i
≤ PS∼Dm erD [hS ] − inf erS [hS ] + penalty(i, m) ≥
+
i
2
0-1 i
0-1
PS∼Dm inf erS [hS ] + penalty(i, m) − erD [H] ≥
. (14)
i
2
We will bound each probability in turn. For the first probability, we have
0-1 i
PS∼Dm er0-1
[h
]
−
inf
er
[h
]
+
penalty(i,
m)
≥
S
D
S
S
i
2
i
0-1 i
≤ PS∼Dm sup er0-1
[h
]
−
er
[h
]
+
penalty(i,
m)
≥
D
S
S
S
2
i
∞
X
i
0-1 i
≤
PS∼Dm er0-1
+ penalty(i, m)
(by union bound)
D [hS ] − erS [hS ] ≥
2
i=1
d
∞
X
2em i −m( +penalty(i,m))2 /8
2
e
≤
4
di
i=1
≤
∞
X
4(2em)di e−m
2
/32 −m(penalty(i,m))2 /8
e
(15)
(16)
(17)
(18)
(19)
i=1
=
=
4e−m
4e−m
2
2
/32
∞
X
/32
i=1
∞
X
(2em)di e−(8di ln(2em)+i)/8
(20)
e−i/8
(21)
i=1
=
4
1 − e−1/8
e−m
2
/32
.
(22)
For the second probability, let i∗ be such that
0-1
er0-1
D [Hi∗ ] ≤ erD [H] +
,
4
(23)
and let m∗ be such that for all m ≥ m∗ ,
penalty(i∗ , m) ≤
.
8
(24)
Then we have
i
0-1
inf er0-1
[h
]
+
penalty(i,
m)
−
er
[H]
≥
S
S
D
i
2
i
0-1
∗
PS∼Dm inf er0-1
[h
]
+
penalty(i,
m)
−
er
[H
]
≥
i
S
S
D
i
4
∗
i
∗
0-1
PS∼Dm er0-1
S [hS ] + penalty(i , m) − erD [Hi∗ ] ≥
4
i∗
0-1
∗
PS∼Dm er0-1
[h
]
−
er
[H
]
≥
, for m ≥ m∗
i
S
S
D
8
0-1
PS∼Dm sup er0-1
[h]
−
er
[h]
≥
S
D
8
h∈Hi∗
di∗
2
2em
4
e−m /512 .
di∗
PS∼Dm
(25)
≤
(26)
≤
≤
≤
≤
Thus we have
0-1
PS∼Dm er0-1
D [hS ] − erD [H] ≥ d ∗
2em i −m2 /512
4
−m2 /32
≤
e
+4
e
,
di∗
1 − e−1/8
−→ 0 as m→∞ .
(27)
(28)
(29)
(30)
for m ≥ m∗
(31)
(32)
Excess Error, Approximation Error, and Estimation Error
5
A couple of remarks:
0-1,∗
0-1
1. As noted above, if the sequence (Hi )∞
for all distributions
i=1 is such that inf i inf h∈Hi erD [h] = erD
H
is
zero
for
all
D),
then
SRM in (Hi )∞
D on X × {±1} (i.e. if the approximation error of H = ∪∞
i=1
i=1 i
as above is universally Bayes consistent w.r.t. `0-1 .
2. Again, except for the simplest problems, SRM (particularly for binary classification) is often not
computationally feasible; however it is useful as a theoretical tool for understanding model selection
techniques and Bayes consistency, and can also serve as a guide for the development of approximate
algorithms.
5
Consistency and Learnability: Two Sides of the Same Coin
In the next few lectures we will turn to learnability, and then return to a more detailed discussion of statistical
consistency. As we will see, the two notions are closely related, although they arose in different communities
and tend to emphasize somewhat different aspects:
Statistical Consistency
6
Learnability
• Origins in statistics
• Origins in theoretical computer science
• Starts with learning algorithm; asks if it is
statistically consistent
• Starts with function class H; asks if there is
a learning algorithm that is statistically consistent in H (with an additional requirement
we will see next time)
• Both consistency within H and Bayes consistency of interest
• By definition, interest is in consistency w.r.t.
H
• Mostly distribution-free; also interested in
‘low-noise’ settings
• Often assume er`D [H] = 0 (‘target function’
setting); mostly distribution-free otherwise,
but sometimes interested in specific distributions (such as the uniform distribution over
the Boolean cube X = {0, 1}n )
• Focus on convergence rates (m, δ)
• Focus on sample complexity m(, δ) and
computational complexity
Next Lecture
In the next lecture we will introduce the notion of learnability, and will give a few basic results and examples
to illustrate the concept. The next few lectures after that will discuss more results and examples related to
learnability, before we return to talk more about statistical consistency.
References
[1] Luc Devroye, Laszlo Gyorfi, and Gabor Lugosi. A Probabilistic Theory of Pattern Recognition. Springer,
1996.