Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Frequent Pattern Mining Toon Calders Bart Goethals ADReM research group Outline • What is data mining? - Definition local patterns vs global models Supervised vs Unsupervised What do we do? • Frequent set mining • More complex data types 2 What is data mining? “the use of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data sets.” $ $ $ Data Information 3 Supervised vs Unsupervised • Supervised: - data has been annotated - well-defined task: learn to annotate new data E.g.: examples of good/bad customers • Unsupervised: - only data has been given - no annotation - « find knowledge » x y n y x x x x x x 4 Local vs Global • Local pattern: - tells something about a small subset of the data E.g. « 90% of the customers that purchase beer also buy chips » • Global model: - fits a global model to the data, a summary E.g. : there is a linear relationship between $ spent and the income of the customers 5 What do we do? • Pattern mining - Local - Unsupervised • Useful for - large datasets - exploration: « what is this data like? » • Less suitable for - well-studied and understood problem domains 6 Outline • What is data mining? • Frequent set mining - Market Basket analysis Association rules Interestingness measures Numerical attributes • More complex data types 7 Market Basket Analysis • Data: collection of transactions of customers: • Goal: find sets of products frequently occuring together 8 Applications • Supermarket - product placement - special promotions • Websearch - which keywords often occur together in webpages? • Health care - frequent sets of symptoms for a disease 9 Applications • Basically works for all data that can be represented as a set of examples/objects having certain properties - patient / symptoms movies / ratings web pages / keywords basket / products … 10 Algorithms • Computationally a very hard problem - with n products, 2n sets of products • Hundreds of algorithms have been proposed - for sparse/dense data many rows/columns data fits/does not fit in memory … 11 Association Rules • Conditional probabilities XY (c%): if X is in the transaction, then there is a probability of c% that Y is in it as well. • Based on the frequent sets, associations can be computed easily: { Beer, Chips } { Snack nuts } 75% { adrem.html, cnts.html } { islab.html } 80% { rain } { overcast } 100% 12 Interestingness Measures • Not all association rules are interesting - Domain knowledge pregnant female, rain overcast - Redundancy A B (100%) then: AC B, AD B, … - Independence 70% buys product A: XA(70%), YA(70%) • Too many rules 13 Interestingness Measures • Incorporating background knowledge - e.g., via Bayesian network - only produce rules that deviate from background knowledge • Redundancies - Condensed representations: produce only a nonredundant subset of patterns 14 Interestingness Measures • Independence - statistical significance tests • X2 • Careful with conclusions !! 1000 tests with significance level 0.05 … (Bonferroni correction) • Too many rules - Constraints - Top-k mining 15 Numerical Attributes • Association rule mining is also possible for numerical attributes - discretization: make continuous attributes ordinal • information loss • not appropriate if the order between the values is important - other methods: • recently new method based on rank correlation measures 16 Complex Patterns • Sets • Sequences • Graphs • Relational Structures • Generation and Counting of such patterns becomes much more complex too! 17 Sequences CGATGGGCCAGTCGATACGTCGATGCCGATGTCACGA QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. 18 Patterns in Sequences • • • • Substrings Regular expressions (bb|[^b]{2}) Partial orders Directed Acyclic Graphs 19 Graphs 20 Patterns in Graphs 21 Rules f: 5 0.5 0.8 f: 4 f: 7 f: 8 0.57 f: 4 f: 4 22 Relational Databases 23 Patterns in RDBs • Queries • Query 1: Select L.drinker, V.bar From Likes L, Visits V Where V.drinker = L.drinker And L.beer = ‘Duvel’ 24 Patterns in RDBs • Query 2: Select L.drinker, V.bar From Likes L, Visits V, Serves S Where V.drinker = L.drinker And L.beer = ‘Duvel’ And S.bar = V.bar And S.beer = ‘Duvel’ 25 Patterns in RDBs • Association Rule: Query 1 => Query 2 If a person that likes Duvel visits bar, then that bar serves Duvel 26 27