Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Patterns are Everywhere Compression for Data Mining Advanced Data Mining 2010 Matthijs van Leeuwen 1 2 Patterns are Everywhere 3 Patterns are Everywhere 4 Simple Patterns lead to Complex Data 5 Simple Patterns lead to Complex Data 6 1 Given complex data, Compression! how do we find these simple patterns? 7 8 Compression for Data Mining Information in Data Which piece of art contains more information? Compression algorithms exploit structure in data to achieve optimal compression I.e. the more structure, the easier to compress In other words, patterns are used to compress But, how do we compress? We need a way to quantify structure / information JPEG file: 67 Kb 9 JPEG file: 87 Kb 10 Information in Data Information in Data Which piece of art contains more information? Which string contains more information? 1. AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA Very structured Compressed: 30xA 2. ABBABAAABBABABBBAABAABAABAABAA Seems random Impossible to compress (?) JPEG file: 115 Kb 11 JPEG file: 87 Kb 12 2 Kolmogorov complexity Kolmogorov complexity Assume a universal computer / programming system E.g. a general computer language like C++ Or a universal Turing machine x is incompressible iff K(x) > |x| x incompressible implies that x is random The choice of computing system changes K(x) only with an additive constant. The Kolmogorov complexity of a string x is the llength, h iin bi bits, off the h shortest h computer program that produces x as output. K(x) can be regarded as the length of the ultimate compressed version from which x can be recovered by a general decompression program. Denote this by K(x) K(x) < rar(x) < bzip(x) < zip(x) 13 14 Conditional Kolmogorov complexity Kolmogorov complexity The conditional Kolmogorov complexity K(x | y) of a Sounds like we are done? We can find patterns using For example: p K(ABA ( | AAA)) < Good: ultimate compression! string x under condition y is the length, in bits, of the shortest computer program that produces x as output when given y as input. Kolmogorov complexity? K(ABA ( | BBB)) Bad: K(x) is non-computable We can reformulate the unconditional Kolmogorov complexity like this: K(x) = K(x | e) where e is the empty string. 15 16 Information Theory: Equally Likely Outcomes Information Theory Formalise how information can be quantified How much information does some data/event contain? How many bits are needed to describe something? Consider a random experiment with n equally likely Computers always count in bits Hence, information theory does the same In these lectures, we only use the logarithm of base 2! What is the amount of information gained by outcomes f1; 2; : : : ; ng E.g. a roll of a die (n = 6) or a coin flip (n = 2) l learning i which hi h outcome t occurred? d? Denote a quantitative measure of this information by H(n) Useful for compression schemes Computable! 17 18 3 Information Theory: Equally Likely Outcomes Information Theory: Equally Likely Outcomes Reducing 8 possibilities to 1: 0 0 0 1 1 H(n) = log n = ¡ log Properties No information: 1 How much information does this give us? 1 0 0 = 3 bits 1 n H(1) = 0 Monotonicity: i i H( ) · H(n H(n) H( + 1) Additivity: H(nm) = H(n) + H(m) 8 possibilities require 3 binary decisions, n binary decisions. n possibilities require log ??? 19 20 Information Theory: Unequally Likely Outcomes Information Theory: Unequally Likely Outcomes With equiprobable outcomes the `surprise value’ of What if not all outcomes are equally likely? E.g. an unfair die, where the probability of throwing 6 is higher than throwing 1 the i-th outcome is log n = ¡ log Denote the probability of each possible outcome by p(x) 1 n We W could ld argue analogously l l th thatt th the surprise i value l of outcome x is What is the amount of information gained by log learning the outcome of this experiment? 21 1 = ¡ log p(x) p(x) 22 Information Theory: Shannon’s Entropy Information Theory: Shannon’s Source Coding Theorem We can use this result for compression. In general we have X H(x) = ¡ Suppose a symbol alphabet S and a word of p(x) log p(x) symbols W (8a 2 W : a 2 S) x for a discrete random variable x with p probability y distribution p(x). By viewing W as a random variable, variable we obtain a probability distribution P(a). Shannon proved that an optimal lossless encoding satisfies L(a) = ¡ log P (a) where L(a) is the length, in bits, of the code assigned to a. 23 24 4 Information Theory: Shannon’s Source Coding Theorem 25 Information Theory: Shannon’s Source Coding Theorem 26 Information Theory: Shannon’s Source Coding Theorem Information Theory: Shannon’s Source Coding Theorem Note: we are computing `ideal’ code lengths. Obviously, if we would need to materialise codes, codes of real-valued lengths would be impossible. 27 28 Using Compression for Data Mining Tasks Clustering Images All (?!) data mining tasks can be described in terms Suppose we have a set of images and without any of compression / Kolmogorov complexity. additional information, we would like to find groups of images that are similar. Classification Clustering Outlier detection … In other words, we would like to cluster the images. For this, we need a way to quantify image similarity; a distance measure. In the remainder of this lecture and the practical assignment, we focus on the task of clustering images using compression 29 30 5 Towards a Distance Measure for Images Towards a Distance Measure for Images Using Kolmogorov complexity, we can easily define a K is non-computable, but we can approximate it The distance dk between two strings x and y can be This gives us the distance dc between two strings x distance measure on strings. using a compression algorithm C. defined as: dk (x; y) = and y, defined as: K(xjy) + K(yjx) K(xy) dc (x; y) = C(xjy) + C(yjx) C(xy) where xy is the concatenation of x and y. 31 32 Towards a Distance Measure for Images A Distance Measure for Images We have a problem with dc : To avoid these problems, we use the following: In this case, we prefer to regard C as a black box. di (x; y) = However, it is not trivial how we should compute C(x | y), without diving into the specifics of C. C(xy) ¡ minfC(x); C(y)g maxfC(x); C(y)g where h xy is i th the concatenation t ti off x and d y. As compressor C, we can use any image compression scheme we like, e.g. JPEG. 33 34 Clustering Distance matrix Using di, we can obtain (pairwise) distance values for 0 all image pairs. The larger the distance, the less similar the images. Clustering can now be done by `linking’ those image pairs that have the smallest distances. p D= d(2,1) 0 d(3 1) d(3,1) d(3 2) d(3,2) 0 d(n,1) d(n,2) d(n,3) Doing this iteratively is called agglomerative hierarchical clustering. 35 0 36 6 Example distance matrix 1 2 3 4 1 0 2 7.4 0 3 0.7 7.1 0 4 7.3 0.3 7.0 0 5 0.5 6.9 0.6 6.8 Example distance matrix 1 5 0 37 2 3 4 1 0 2 7.4 0 3 0.7 7.1 0 4 7.3 0.3 7.0 0 5 0.5 6.9 0.6 6.8 5 0 38 Distance between clusters Example distance matrix 1 1 0 (2,4) ? (2,4) 3 5 Ways to measure the distance between clusters A and B: Single linkage (nearest neighbour) 0 d(A; B) = minfd(a; b) j a 2 A; b 2 Bg Complete linkage (furthest (f th t neighbour) i hb ) 3 0.7 ? 0 5 0.5 ? 0.6 d(A; B) = maxfd(a; b) j a 2 A; b 2 Bg 0 Average linkage d(A; B) = meanfd(a; b) j a 2 A; b 2 Bg 39 40 Linkage example Example distance matrix – single linkage (2,4) 41 1 Single Complete Average 2 4 1 7.4 7.3 7.3 7.4 7.35 3 71 7.1 70 7.0 70 7.0 71 7.1 7 05 7.05 5 6.9 6.8 6.8 6.9 6.85 min max mean 1 (2,4) 3 5 0 (2,4) 7.3 0 3 0.7 7.0 0 5 0.5 6.8 0.6 0 42 7 Example cluster result Practical Assignment 2 The task and approach just outlined will be the subject of the second practical assignment. More information will be posted on the website on Wednesday. 43 44 Conclusions Literature Compression is useful for many data mining tasks C. Faloutsos & V. Megalooikonomou. On data mining, compression, and Kolmogorov complexity. In: Data Min. Knowl. Discov. 15(1): 3-20 (2007). (Optional) Advantages Generic, applicable to many tasks and applications Well-principled Parameter Parameter-free free data mining B.J.L. Campana p & E.J. Keogh. g A Compression-Based p Distance Measure for Texture. In: Proceedings of SIAM Data Mining (2010). (Optional, recommended for practical assignment.) Disadvantages K(x) is non-computable There is no single recipe Quality of results depend on encoding 45 46 8