Download 14.1 Covering and Packing - Department of Statistics, Yale

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Wiles's proof of Fermat's Last Theorem wikipedia , lookup

Theorem wikipedia , lookup

Brouwer fixed-point theorem wikipedia , lookup

Fundamental theorem of algebra wikipedia , lookup

Proofs of Fermat's little theorem wikipedia , lookup

Transcript
ECE598: Information-theoretic methods in high-dimensional statistics
Spring 2016
Lecture 14: Packing, covering, and consequences on minimax risk
Lecturer: Yihong Wu
Scribe: Yingxiang Yang, Mar 10, 2016 [Ed. Mar 12]
Last lecture, we lower bounded minkθ−θ̂k I(θ; θ̂) using Shannon lower bound, and we saw that for
the p dimensional n sample GLM,
R∗ (Rp ) &
1
2
p
nvol (Bk·k )
with respect to a loss `(θ, θ̂) = kθ − θ̂k2 and an arbitrary norm k · k.
To understand why some sort of volume shows up, we further extend the lower bound obtained
using Fano’s method. We first introduce the concept of packing, covering, relate them to the
notion of volume, and then plug them into the lower bound obtained using the Fano’s inequality.
When applied to GLM, this alternative method gives the same dependence on the dimension and
the sample size for `q norms with q < ∞.
14.1
Covering and Packing
Definition 14.1 (-covering). Let (V, k · k) be a normed space, and Θ ⊂ V . {V1 , ..., VN } is an
-covering of Θ if Θ ⊂ ∪N
i=1 B(Vi , ), or equivalently, ∀θ ∈ Θ, ∃i such that kθ − Vi k ≤ .
Definition 14.2 (-packing). Let (V, k·k) be a normed space, and Θ ⊂ V . {θ1 , ..., θM } is an -packing
of Θ if mini6=j kθi − θj k > (notice the inequality is strict), or equivalently ∩M
i=1 B(θi , /2) = ∅.
Upon defining -covering and -packing, one naturally asks what is the minimal number of -balls
one needs in order to cover Θ, and what is the maximal number of /2-balls one can pack in Θ.
Those numbers are defined as covering and packing numbers.
Definition 14.3 (Covering number). N (Θ, k · k, ) := min{n : ∃-covering over Θ of size n}.
Definition 14.4 (Packing number). M (Θ, k · k, ) := max{m : ∃-packing of Θ of size m}.
Remark 14.1. Some basic remarks.
• M (Θ, k · k, ) and N (Θ, k · k, ) are often abbreviated as M (), N ().
• For -covering, the balls need not be disjoint.
• N (Θ, k · k, ) is a decreasing function of when the norm and Θ are fixed. That is, if 0 < 1 ,
N
and {V1 , ..., VN } is an -covering of Θ, then Θ ⊂ ∪N
i=1 B(Vi , 0 ) ⊂ ∪i=1 B(Vi , 1 ).
• Metric entropy: log M () and log N ().
• N () < ∞ ∀ > 0 ⇔ Θ is totally bounded (In topology, a metric space is said to be totally
bounded if for every > 0 there is a finite covering of the space by -balls). For example, a
metric space is compact iff it is complete and totally bounded. Hence a compact metric space
is totally bounded.
1
The relation between the packing number and the covering number is described in the following
theorem.
Theorem 14.1. Let (V, k · k) be a normed space, and Θ ⊂ V . Then
(a)
(b)
M (Θ, k · k, 2) ≤ N (Θ, k · k, ) ≤ M (Θ, k · k, ).
Proof. First prove part (b). Suppose E = {θ1 , ..., θM } is a maximal packing. Then ∀θ ∈ Θ\E, ∃i
such that kθ − θi k ≤ (if this does not hold for θ then we can construct a bigger packing with
θM +1 = θ). Hence E is automatically an -covering. Since N (Θ, k · k, ) is the minimal size of all
possible coverings, we have M (Θ, k · k, ) ≥ N (Θ, k · k, ).
We next prove part (a) by contradiction. Suppose there exists a 2-packing {θ1 , ..., θM } and an
-covering {x1 , ..., xN } such that M ≥ N + 1. Then by pigeonhole, we must have θi and θj belonging
to the same -ball B(xk , ) for some i =
6 j and k. This means that the distance between θi and θj
cannot be more than the diameter of the ball, i.e., kθi − θj k ≤ 2, which leads to a contradiction
since kθi − θj k > 2 for a 2-packing. Hence the size of any 2-packing is less or equal to the size of
any -covering. Hence M (Θ, k · k, 2), the maximal size of a 2-packing is at most N (Θ, k · k, ), the
minimal size of an -covering.
When (V, k · k) is the d-dimensional Euclidean space, we can extend the previous theorem by further
lower/upper bounding the covering/packing numbers. The result is given as follows.
Theorem 14.2. Let Θ ⊂ V = Rd . Then
d
(b) vol(Θ + B)
1
vol(Θ) (a)
2
≤ N (Θ, k · k, ) ≤ M (Θ, k · k, ) ≤
vol(B)
vol( 2 B)
(c)
≤
Θ convex
B⊂Θ
vol( 23 Θ)
=
vol( 2 B)
d
3
vol(Θ)
.
vol(B)
where + is the Minkovski sum, and B() is the norm ball with radius and B is the unit norm ball.
Proof. First prove (a). For a covering of minimal size, Θ ⊂ ∪ni=1 B(Xi , ). Hence
N ()
N ()
vol(Θ) ≤ vol(∪i=1 B(Xi , )) ≤
X
vol(B(Xi , )).
i=1
Since vol(B(Xi , )) = d vol(B), we have vol(Θ) ≤ N ()d vol(B). Hence (a) is proved.
M ()
Next we prove (b). For an -packing, the balls B(θi , /2) are disjoint, and ∪i=1 B(θi , /2) ⊂ Θ + 2 B.
Taking the volume on both sides, we have
M ()
vol(Θ + B) ≥ vol(∪i=1 B(θi , /2)) = M ()vol( B).
2
2
This proves (b).
To prove (c), we prove two statements. (1) When B ⊂ Θ, Θ + 2 B ⊂ Θ + 12 Θ, and (2) when Θ is
convex, Θ + 12 Θ = 32 Θ.
To prove (1), notice for any z ∈ Θ + 2 B, we have z = x + y where x ∈ 2 B and y ∈ Θ. Since
x ∈ 2 B ⇒ x ∈ Θ, we immediately have z ∈ Θ + 21 Θ.
2
To prove (2), first notice that ∀θ ∈ 32 Θ, θ = 13 θ + 23 θ. Since 13 θ ∈ 12 Θ, and 23 θ ∈ Θ, 32 Θ ⊆ Θ + 12 Θ.
On the other hand, for any x ∈ Θ + 12 Θ, we have x = y + 12 z with y, z ∈ Θ. When Θ is convex,
2
2
1
3
1
3
3 x = 3 y + 3 z ∈ Θ. Hence x ∈ 2 Θ, implying Θ + 2 Θ ⊆ 2 Θ.
With (1) and (2), (c) follows immediately.
Remark 14.2. Why is Theorem 14.1?
• (a) is a converse, saying that the minimal covering size cannot be too small. When combined
with N () ≤ M (), this turns into a existential statement: It is possible to construct a packing
of size at least vol(Θ)/vol(B()). From the proof we see that this corresponds to a greedy
construction. Furthermore, for Hamming space and Hamming distance, this is exactly the
Gilbert-Varshanov bound.
• (b) is a converse, saying that the maximal packing size cannot be too large. When combined
with N () ≤ M (), this turns into a existence statement: there exists a small covering.
Example 14.1 (Euclidean norm ball). Consider N (B2 (1), k·k2 , ).1 When ≥ 1, N (B2 (1), k·k2 , ) =
1. When < 1, activating the previous theorem we have (noticing Θ = B2 )
d
d
vol((1 + 2 )B)
2 d
1
3
vol(Θ)
= 1+
=
≤ N () ≤
≤
.
vol(B2 )
vol( 2 B)
Hence d log 1 ≤ log N () ≤ d log 3 . This relationship holds for all metric norms (those such that
vol(B)d = vol(B)). If we fix the dimension and drive → 0, then because all norms in Euclidean
space are equivalent, whenever Θ has interior, log N () = (d + o(1)) log 1 .
14.2
Applying metric entropy + Fano’s inequality to minimax risk
We now apply metric entropy and Fano’s inequality to lower bound the minimax risk. The key idea
is to reduce estimation over Θ to testing between a packing E = {θ1 , ..., θM } within T ⊂ Θ. Then
R∗ (Θ) ≥ R∗ (T ) ≥ Rπ∗ where π is equi-probable over E.
Let E = {θ1 , ..., θM } be an -packing on T ⊂ Θ. Let θ̃ be the quantized version of θ̂ restricted to E
(θ − X − θ̂ − θ̃), and consider a quadratic loss function `(θ, θ̂) = kθ − θ̂k2 . Recall that
radKL (T ) = inf sup D(Pθ kQ),
Q θ∈T
and
diamKL (T ) = sup D(Pθ0 kPθ ).
θ,θ0 ∈T
1
For unit ball B2 (1) := {x ∈ Rd : kxk2 ≤ 1} we abbreviate it as B2 .
3
We immediately have
i
h
i M arkov 2 h
i 2 h
P kθ − θ̂k ≥
P θ 6= θ̃
E kθ − θ̂k2
≥
≥
2
2 2
F ano I(θ; X) + log 2
≥
1−
2
log M ()
2
radKL (T ) + log 2
1−
≥
4
log M ()


2 
diamKL (T ) + log 2 
≥ sup
,
1−
T ⊂Θ,>0 4
log vol(T )
(14.1)
vol(B)
where in the last step, the inequality holds true for all choices of T and , and the supremum is
placed to obtain a better bound.
For GLM, we can use the above method (Fano+packing) to obtain the same result (up to constant
factor) by Shannon Lower Bound.
Example 14.2 (p-dimensional n-sample GLM). Let Θ = Rp , and T = B2 (s). Then diamKL (T ) =
supθ,θ0 ∈T D(Pθ kPθ0 ) = supθ,θ0 ∈T D(N (θ, Ip )⊗n kN (θ0 , Ip )⊗n ) = supθ,θ0 n2 kθ − θ0 k2 = n2 diam2 (T ) =
n 2
2 s . By (14.1), we have




n 2
2
2
s
+
log
2
diam
(T
)
+
log
2
KL
1 −
=
1 − 2 p
.
R∗ ≥
vol(T )
4
4
log
log sp vol(B2 )
vol(B)
vol(Bk·k )
We now choose and s. Denote vol(Bk·k ) = V , and recall that vol1/p (B2 ) √1 ,
p
i.e., c1 √1p <
vol1/p (B2 ) < c2 √1p . If we choose
r
s = c3
then
c24
R∗ ≥
4nV 2/p
p
1
, = c4 √ 1/p ,
n
nV
c23 p
2
+ log 2
1−
p log c1c4c3
!
c24
≥
4nV 2/p
c23
2
+ log 2
1−
log c1c4c3
!
.
c2
As long as we choose c1 , c2 , c3 , c4 such that ( 23 + log 2)/ log c1c4c3 < c < 1, we have
1
c24 (1 − c)
&
.
(14.2)
2/p
4nV
nV 2/p
Remark 14.3. When the specified norm is k · k∞ , the norm ball becomes a cube, and the volume
is (for fixed values of p)
V = 2p .
R∗ ≥
Hence R∗ & n1 ; however, we know R∗ log p
n
and we lose the dependence on the dimension p.
So what is to be blamed? It turns out our mutual information method, and in fact, its further
relaxation via packing and Fano’s inequality is tight in this case. What is loose is the volume ratio
bound on packing number in Theorem 14.1. In the next lecture, we will prove
(
p log √1 p ,
. √1p
log N (B, k · k, ) .
1
log(p2 ), & √1p
2
4
This will lead to the tight result R∗ log p
n .
5