Download Compression for Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Patterns are Everywhere
Compression for
Data Mining
Advanced Data Mining 2010
Matthijs van Leeuwen
1
2
Patterns are Everywhere
3
Patterns are Everywhere
4
Simple Patterns lead to Complex Data
5
Simple Patterns lead to Complex Data
6
1
Given
complex data,
Compression!
how do we find these
simple patterns?
7
8
Compression for Data Mining
Information in Data
Which piece of art contains more information?
 Compression algorithms exploit structure in data to
achieve optimal compression
 I.e. the more structure, the easier to compress
 In other words, patterns are used to compress
 But, how do we compress?
 We need a way to quantify structure / information
JPEG file: 67 Kb
9
JPEG file: 87 Kb
10
Information in Data
Information in Data
Which piece of art contains more information?
Which string contains more information?
1.
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Very structured
Compressed: 30xA
2.
ABBABAAABBABABBBAABAABAABAABAA
Seems random
Impossible to compress (?)
JPEG file: 115 Kb
11
JPEG file: 87 Kb
12
2
Kolmogorov complexity
Kolmogorov complexity
 Assume a universal computer / programming system
 E.g. a general computer language like C++
 Or a universal Turing machine
 x is incompressible iff K(x) > |x|
 x incompressible implies that x is random
 The choice of computing system changes K(x) only
with an additive constant.
 The Kolmogorov complexity of a string x is the
llength,
h iin bi
bits, off the
h shortest
h
computer program
that produces x as output.
 K(x) can be regarded as the length of the ultimate
compressed version from which x can be recovered
by a general decompression program.
 Denote this by K(x)
 K(x) < rar(x) < bzip(x) < zip(x)
13
14
Conditional Kolmogorov complexity
Kolmogorov complexity
 The conditional Kolmogorov complexity K(x | y) of a
 Sounds like we are done? We can find patterns using
 For example:
p
K(ABA
(
| AAA)) <
 Good: ultimate compression! 
string x under condition y is the length, in bits, of
the shortest computer program that produces x as
output when given y as input.
Kolmogorov complexity?
K(ABA
(
| BBB))
 Bad: K(x) is non-computable 
 We can reformulate the unconditional Kolmogorov
complexity like this:
K(x) = K(x | e)
where e is the empty string.
15
16
Information Theory:
Equally Likely Outcomes
Information Theory
 Formalise how information can be quantified
 How much information does some data/event contain?
 How many bits are needed to describe something?
 Consider a random experiment with n equally likely
 Computers always count in bits
 Hence, information theory does the same
 In these lectures, we only use the logarithm of base 2!
 What is the amount of information gained by
outcomes f1; 2; : : : ; ng
 E.g. a roll of a die (n = 6) or a coin flip (n = 2)
l
learning
i
which
hi h outcome
t
occurred?
d?
 Denote a quantitative measure of this information by
H(n)
 Useful for compression schemes
 Computable!
17
18
3
Information Theory:
Equally Likely Outcomes
Information Theory:
Equally Likely Outcomes
Reducing 8 possibilities to 1:
0
0 0 1 1

H(n) = log n = ¡ log
 Properties
 No information:
1
How much information does this give us?
1 0 0 = 3 bits
1
n
H(1) = 0
 Monotonicity:
i i
H( ) · H(n
H(n)
H( + 1)
 Additivity:
H(nm) = H(n) + H(m)
8 possibilities require 3 binary decisions,
n binary decisions.
n possibilities require log
???
19
20
Information Theory:
Unequally Likely Outcomes
Information Theory:
Unequally Likely Outcomes
 With equiprobable outcomes the `surprise value’ of
 What if not all outcomes are equally likely?
 E.g. an unfair die, where the probability of throwing 6
is higher than throwing 1
the i-th outcome is
log n = ¡ log
 Denote the probability of each possible outcome
by p(x)
1
n
 We
W could
ld argue analogously
l
l th
thatt th
the surprise
i value
l
of outcome x is
 What is the amount of information gained by
log
learning the outcome of this experiment?
21
1
= ¡ log p(x)
p(x)
22
Information Theory:
Shannon’s Entropy
Information Theory:
Shannon’s Source Coding Theorem
 We can use this result for compression.
In general we have
X
H(x) = ¡
 Suppose a symbol alphabet S and a word of
p(x) log p(x)
symbols W (8a 2 W : a 2 S)
x
for a discrete random variable x with p
probability
y
distribution p(x).
 By viewing W as a random variable,
variable we obtain a
probability distribution P(a).
 Shannon proved that an optimal lossless encoding
satisfies
L(a) = ¡ log P (a)
where L(a) is the length, in bits, of the code
assigned to a.
23
24
4
Information Theory:
Shannon’s Source Coding Theorem
25
Information Theory:
Shannon’s Source Coding Theorem
26
Information Theory:
Shannon’s Source Coding Theorem
Information Theory:
Shannon’s Source Coding Theorem
 Note: we are computing `ideal’ code lengths.
 Obviously, if we would need to materialise codes,
codes of real-valued lengths would be impossible.
27
28
Using Compression for Data Mining Tasks
Clustering Images
 All (?!) data mining tasks can be described in terms
 Suppose we have a set of images and without any
of compression / Kolmogorov complexity.




additional information, we would like to find groups
of images that are similar.
Classification
Clustering
Outlier detection
…
 In other words, we would like to cluster the images.
 For this, we need a way to quantify image similarity;
a distance measure.
 In the remainder of this lecture and the practical
assignment, we focus on the task of
clustering images using compression
29
30
5
Towards a Distance Measure for Images
Towards a Distance Measure for Images
 Using Kolmogorov complexity, we can easily define a
 K is non-computable, but we can approximate it
 The distance dk between two strings x and y can be
 This gives us the distance dc between two strings x
distance measure on strings.
using a compression algorithm C.
defined as:
dk (x; y) =
and y, defined as:
K(xjy) + K(yjx)
K(xy)
dc (x; y) =
C(xjy) + C(yjx)
C(xy)
where xy is the concatenation of x and y.
31
32
Towards a Distance Measure for Images
A Distance Measure for Images
 We have a problem with dc :
 To avoid these problems, we use the following:
 In this case, we prefer to regard C as a black box.
di (x; y) =
 However, it is not trivial how we should compute
C(x | y), without diving into the specifics of C.
C(xy) ¡ minfC(x); C(y)g
maxfC(x); C(y)g
where
h
xy is
i th
the concatenation
t
ti
off x and
d y.
 As compressor C, we can use any image
compression scheme we like, e.g. JPEG.
33
34
Clustering
Distance matrix
 Using di, we can obtain (pairwise) distance values for
0
all image pairs.
 The larger the distance, the less similar the images.
 Clustering can now be done by `linking’ those image
pairs that have the smallest distances.
p
D=
d(2,1)
0
d(3 1)
d(3,1)
d(3 2)
d(3,2)
0
d(n,1)
d(n,2)
d(n,3)
 Doing this iteratively is called
agglomerative hierarchical clustering.
35
0
36
6
Example distance matrix
1
2
3
4
1
0
2
7.4
0
3
0.7
7.1
0
4
7.3
0.3
7.0
0
5
0.5
6.9
0.6
6.8
Example distance matrix
1
5
0
37
2
3
4
1
0
2
7.4
0
3
0.7
7.1
0
4
7.3
0.3
7.0
0
5
0.5
6.9
0.6
6.8
5
0
38
Distance between clusters
Example distance matrix
1
1
0
(2,4)
?
(2,4)
3
5
Ways to measure the distance between clusters A and B:
 Single linkage (nearest neighbour)
0
d(A; B) = minfd(a; b) j a 2 A; b 2 Bg
 Complete linkage (furthest
(f th t neighbour)
i hb
)
3
0.7
?
0
5
0.5
?
0.6
d(A; B) = maxfd(a; b) j a 2 A; b 2 Bg
0
 Average linkage
d(A; B) = meanfd(a; b) j a 2 A; b 2 Bg
39
40
Linkage example
Example distance matrix – single linkage
(2,4)
41
1
Single Complete Average
2
4
1
7.4
7.3
7.3
7.4
7.35
3
71
7.1
70
7.0
70
7.0
71
7.1
7 05
7.05
5
6.9
6.8
6.8
6.9
6.85
min
max
mean
1
(2,4)
3
5
0
(2,4)
7.3
0
3
0.7
7.0
0
5
0.5
6.8
0.6
0
42
7
Example cluster result
Practical Assignment 2
 The task and approach just outlined will be the
subject of the second practical assignment.
 More information will be posted on the website on
Wednesday.
43
44
Conclusions
Literature
 Compression is useful for many data mining tasks
 C. Faloutsos & V. Megalooikonomou. On data mining,
compression, and Kolmogorov complexity. In: Data Min.
Knowl. Discov. 15(1): 3-20 (2007).
(Optional)
 Advantages
 Generic, applicable to many tasks and applications
 Well-principled
 Parameter
Parameter-free
free data mining
 B.J.L. Campana
p
& E.J. Keogh.
g A Compression-Based
p
Distance Measure for Texture. In: Proceedings of SIAM
Data Mining (2010).
(Optional, recommended for practical assignment.)
 Disadvantages
 K(x) is non-computable
 There is no single recipe
 Quality of results depend on encoding
45
46
8