Download Compression for Data Mining

Patterns are Everywhere Compression for Data Mining Advanced Data Mining 2010 Matthijs van Leeuwen 1 2 Patterns are Everywhere 3 Patterns are Everywhere 4 Simple Patterns lead to Complex Data 5 Simple Patterns lead to Complex Data 6 1 Given complex data, Compression! how do we find these simple patterns? 7 8 Compression for Data Mining Information in Data Which piece of art contains more information?  Compression algorithms exploit structure in data to achieve optimal compression  I.e. the more structure, the easier to compress  In other words, patterns are used to compress  But, how do we compress?  We need a way to quantify structure / information JPEG file: 67 Kb 9 JPEG file: 87 Kb 10 Information in Data Information in Data Which piece of art contains more information? Which string contains more information? 1. AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA Very structured Compressed: 30xA 2. ABBABAAABBABABBBAABAABAABAABAA Seems random Impossible to compress (?) JPEG file: 115 Kb 11 JPEG file: 87 Kb 12 2 Kolmogorov complexity Kolmogorov complexity  Assume a universal computer / programming system  E.g. a general computer language like C++  Or a universal Turing machine  x is incompressible iff K(x) > |x|  x incompressible implies that x is random  The choice of computing system changes K(x) only with an additive constant.  The Kolmogorov complexity of a string x is the llength, h iin bi bits, off the h shortest h computer program that produces x as output.  K(x) can be regarded as the length of the ultimate compressed version from which x can be recovered by a general decompression program.  Denote this by K(x)  K(x) < rar(x) < bzip(x) < zip(x) 13 14 Conditional Kolmogorov complexity Kolmogorov complexity  The conditional Kolmogorov complexity K(x | y) of a  Sounds like we are done? We can find patterns using  For example: p K(ABA ( | AAA)) <  Good: ultimate compression!  string x under condition y is the length, in bits, of the shortest computer program that produces x as output when given y as input. Kolmogorov complexity? K(ABA ( | BBB))  Bad: K(x) is non-computable   We can reformulate the unconditional Kolmogorov complexity like this: K(x) = K(x | e) where e is the empty string. 15 16 Information Theory: Equally Likely Outcomes Information Theory  Formalise how information can be quantified  How much information does some data/event contain?  How many bits are needed to describe something?  Consider a random experiment with n equally likely  Computers always count in bits  Hence, information theory does the same  In these lectures, we only use the logarithm of base 2!  What is the amount of information gained by outcomes f1; 2; : : : ; ng  E.g. a roll of a die (n = 6) or a coin flip (n = 2) l learning i which hi h outcome t occurred? d?  Denote a quantitative measure of this information by H(n)  Useful for compression schemes  Computable! 17 18 3 Information Theory: Equally Likely Outcomes Information Theory: Equally Likely Outcomes Reducing 8 possibilities to 1: 0 0 0 1 1  H(n) = log n = ¡ log  Properties  No information: 1 How much information does this give us? 1 0 0 = 3 bits 1 n H(1) = 0  Monotonicity: i i H( ) · H(n H(n) H( + 1)  Additivity: H(nm) = H(n) + H(m) 8 possibilities require 3 binary decisions, n binary decisions. n possibilities require log ??? 19 20 Information Theory: Unequally Likely Outcomes Information Theory: Unequally Likely Outcomes  With equiprobable outcomes the `surprise value’ of  What if not all outcomes are equally likely?  E.g. an unfair die, where the probability of throwing 6 is higher than throwing 1 the i-th outcome is log n = ¡ log  Denote the probability of each possible outcome by p(x) 1 n  We W could ld argue analogously l l th thatt th the surprise i value l of outcome x is  What is the amount of information gained by log learning the outcome of this experiment? 21 1 = ¡ log p(x) p(x) 22 Information Theory: Shannon’s Entropy Information Theory: Shannon’s Source Coding Theorem  We can use this result for compression. In general we have X H(x) = ¡  Suppose a symbol alphabet S and a word of p(x) log p(x) symbols W (8a 2 W : a 2 S) x for a discrete random variable x with p probability y distribution p(x).  By viewing W as a random variable, variable we obtain a probability distribution P(a).  Shannon proved that an optimal lossless encoding satisfies L(a) = ¡ log P (a) where L(a) is the length, in bits, of the code assigned to a. 23 24 4 Information Theory: Shannon’s Source Coding Theorem 25 Information Theory: Shannon’s Source Coding Theorem 26 Information Theory: Shannon’s Source Coding Theorem Information Theory: Shannon’s Source Coding Theorem  Note: we are computing `ideal’ code lengths.  Obviously, if we would need to materialise codes, codes of real-valued lengths would be impossible. 27 28 Using Compression for Data Mining Tasks Clustering Images  All (?!) data mining tasks can be described in terms  Suppose we have a set of images and without any of compression / Kolmogorov complexity.     additional information, we would like to find groups of images that are similar. Classification Clustering Outlier detection …  In other words, we would like to cluster the images.  For this, we need a way to quantify image similarity; a distance measure.  In the remainder of this lecture and the practical assignment, we focus on the task of clustering images using compression 29 30 5 Towards a Distance Measure for Images Towards a Distance Measure for Images  Using Kolmogorov complexity, we can easily define a  K is non-computable, but we can approximate it  The distance dk between two strings x and y can be  This gives us the distance dc between two strings x distance measure on strings. using a compression algorithm C. defined as: dk (x; y) = and y, defined as: K(xjy) + K(yjx) K(xy) dc (x; y) = C(xjy) + C(yjx) C(xy) where xy is the concatenation of x and y. 31 32 Towards a Distance Measure for Images A Distance Measure for Images  We have a problem with dc :  To avoid these problems, we use the following:  In this case, we prefer to regard C as a black box. di (x; y) =  However, it is not trivial how we should compute C(x | y), without diving into the specifics of C. C(xy) ¡ minfC(x); C(y)g maxfC(x); C(y)g where h xy is i th the concatenation t ti off x and d y.  As compressor C, we can use any image compression scheme we like, e.g. JPEG. 33 34 Clustering Distance matrix  Using di, we can obtain (pairwise) distance values for 0 all image pairs.  The larger the distance, the less similar the images.  Clustering can now be done by `linking’ those image pairs that have the smallest distances. p D= d(2,1) 0 d(3 1) d(3,1) d(3 2) d(3,2) 0 d(n,1) d(n,2) d(n,3)  Doing this iteratively is called agglomerative hierarchical clustering. 35 0 36 6 Example distance matrix 1 2 3 4 1 0 2 7.4 0 3 0.7 7.1 0 4 7.3 0.3 7.0 0 5 0.5 6.9 0.6 6.8 Example distance matrix 1 5 0 37 2 3 4 1 0 2 7.4 0 3 0.7 7.1 0 4 7.3 0.3 7.0 0 5 0.5 6.9 0.6 6.8 5 0 38 Distance between clusters Example distance matrix 1 1 0 (2,4) ? (2,4) 3 5 Ways to measure the distance between clusters A and B:  Single linkage (nearest neighbour) 0 d(A; B) = minfd(a; b) j a 2 A; b 2 Bg  Complete linkage (furthest (f th t neighbour) i hb ) 3 0.7 ? 0 5 0.5 ? 0.6 d(A; B) = maxfd(a; b) j a 2 A; b 2 Bg 0  Average linkage d(A; B) = meanfd(a; b) j a 2 A; b 2 Bg 39 40 Linkage example Example distance matrix – single linkage (2,4) 41 1 Single Complete Average 2 4 1 7.4 7.3 7.3 7.4 7.35 3 71 7.1 70 7.0 70 7.0 71 7.1 7 05 7.05 5 6.9 6.8 6.8 6.9 6.85 min max mean 1 (2,4) 3 5 0 (2,4) 7.3 0 3 0.7 7.0 0 5 0.5 6.8 0.6 0 42 7 Example cluster result Practical Assignment 2  The task and approach just outlined will be the subject of the second practical assignment.  More information will be posted on the website on Wednesday. 43 44 Conclusions Literature  Compression is useful for many data mining tasks  C. Faloutsos & V. Megalooikonomou. On data mining, compression, and Kolmogorov complexity. In: Data Min. Knowl. Discov. 15(1): 3-20 (2007). (Optional)  Advantages  Generic, applicable to many tasks and applications  Well-principled  Parameter Parameter-free free data mining  B.J.L. Campana p & E.J. Keogh. g A Compression-Based p Distance Measure for Texture. In: Proceedings of SIAM Data Mining (2010). (Optional, recommended for practical assignment.)  Disadvantages  K(x) is non-computable  There is no single recipe  Quality of results depend on encoding 45 46 8

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Compression for Data Mining