Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Relational Data Mining Donato Malerba Dipartimento di Informatica Università degli studi di Bari [email protected] http://www.di.uniba.it/~malerba/ Overview • Single-table assumption • (Multi-)relational data mining and ILP • FO representations • Upgrading propositional DM systems to FOL • A case study: Mining Association rules • Conclusions MRDM – Prof. D. Malerba 2 Standard Data Mining Approach • Most existing data mining approaches look for patterns in a single table of data (or DB relation) ID Name First Name 3478 Smith John 3479 Doe … … Street City Sex Social Status Income Age Resp onse 38 Lake St Seattle M single 160k 32 Y Jane 45 Sea St Venice F married 180k 45 N … … … … … … … … •Each row represents an object and columns represent properties of objects. Single table assumption MRDM – Prof. D. Malerba 3 Standard Data Mining Approach • In the customer table we can add as many attributes about our customers as we like. A person’s number of children • For other kinds of information the single-table assumption turns out to be a significant limitation Add information about orders placed by a customer, in particular Delivery and payment modes With which kind of store the order was placed (size, ownership, location) For simplicity, no information on the goods ordered ID Name First Name … Resp onse Delivery mode Payment mode Store size Store type Locat ion 3478 Smith John … Y regular cash small franchis city 3479 Doe Jane … N express credit large indep rural … … … … … … … … … … MRDM – Prof. D. Malerba 4 Standard Data Mining Approach • • • 1. This solution works fine for once-only customers What if our business has repeat customers? Under the single-table assumption we can Make one entry for each order in our customer table ID Name First Name … Resp onse Delivery mode Payment mode Store size Store type Locat ion 3478 Smith John … Y regular cash small franchis city 3478 Smith John … Y express check small franchis city … … … … … … … … … … • We have usual problems of non-normalized tables • Redundancy, anomalies, … MRDM – Prof. D. Malerba 5 Standard Data Mining Approach • one line per order analysis results will really be about orders, not customers, which is not what we might want! 2. Aggregate order data into a single tuple per customer. • • ID Name First Name … Response No. of orders No. of stores 3478 Smith John … Y 3 2 3479 Doe Jane … N 2 2 … … … … … … … No redundancy. Standard DM methods work fine, but There is a lot less information in the new table • What if the payment mode and the store type are important? MRDM – Prof. D. Malerba 6 Relational Data • A database designer would represent the information in our problem as a set of tables (or relations) ID Name First Name Street City Sex Social Status Income Age Resp onse 3478 Smith John 38 Lake St Seattle M single 160k 32 Y 3479 Doe Jane 45 Sea St Venice F married 180k 45 N … … … … … … … … … … Cust ID Order ID Store ID Delivery mode Payment mode Store ID size Type Location 3478 213444 12 regular cash 12 small franchis city 3478 372347 19 regular cash 3478 334555 12 express check 19 large indep rural … … … … … … … MRDM – Prof. D. Malerba … 7 Relational Data Mining • (Multi-)Relational data mining algorithms can analyze data distributed in multiple relations, as they are available in relational database systems. • These algorithms come from the field of inductive logic programming (ILP) • ILP has been concerned with finding patterns expressed as logic programs • Initially, ILP focussed on automated program synthesis from examples • In recent years, the scope of ILP has broadened to cover the whole spectrum of data mining tasks (association rules, regression, clustering, …) MRDM – Prof. D. Malerba 8 ILP successes in scientific fields • In the field of chemistry/biology Toxicology Prediction of Dipertene classes from nuclear magnetic resonance (NMR) spectra • Analysis of traffic accident data • Analysis of survey data in medicine • Prediction of ecological biodegradation rates The first commercial data mining systems with ILP technology are becoming available. MRDM – Prof. D. Malerba 9 Relational patterns • Relational patterns involve multiple relations from a relational database. • They are typically stated in a more expressive language than patterns defined on a single data table. Relational classification rules Relational regression trees Relational association rules IF Customer(C1,N1,FN1,Str1,City1,Zip1,Sex1,SoSt1, In1,Age1,Resp1) AND order(C1,O1,S1,Deliv1, Pay1) AND Pay1 = credit_card AND In1 108000 THEN Resp1 = Yes MRDM – Prof. D. Malerba 10 Relational patterns IF Customer(C1,N1,FN1,Str1,City1,Zip1,Sex1,SoSt1, In1,Age1,Resp1) AND order(C1,O1,S1,Deliv1, Pay1) AND Pay1 = credit_card AND In1 108000 THEN Resp1 = Yes good_customer(C1) customer(C1, N1,FN1,Str1,City1,Zip1,Sex1,SoSt1, In1,Age1,Resp1) order(C1,O1,S1,Deliv1, credit_card) In1 108000 This relational pattern is expressed in a subset of first-order logic! A relation in a relational database corresponds to a predicate in predicate logic (see deductive databases) MRDM – Prof. D. Malerba 11 Relational decision tree Equivalent Prolog program: class(sendback) :- worn(X), not_replaceable(X), !. class(fix) :- worn(X), !. class(keep). MRDM – Prof. D. Malerba 12 Relational regression rule Background knowledge Induced model MRDM – Prof. D. Malerba 13 Relational association rule Relational database LIKES KID OBJECT Joni ice-cream Joni dolphin Elliot piglet Elliot gnu Elliot lion KID Joni Joni Elliot HAS OBJECT ice-cream piglet ice-cream KID Joni Joni Joni Elliot Elliot PREFERS OBJECT TO ice-cream pudding pudding raisins giraffe gnu lion ice-cream piglet dolphin likes(KID, piglet), likes(KID, ice-cream) likes (KID, dolphin) (9%, 85%) likes(KID, A), has(KID, B) prefers (KID, A, B) (70%, 98%) MRDM – Prof. D. Malerba 14 First-order representations • • • • An example is a set of ground facts, that is a set of tuples in a relational database From the logical point of view this is called a (Herbrand) interpretation because the facts represent all atoms which are true for the example, thus all facts not in the example are assumed to be false. From the computational point of view each example is a small relational database or a Prolog knowledge base A Prolog interpreter can be used for querying an example. MRDM – Prof. D. Malerba 15 FO representation (ground clauses) • Example: eastbound(t1):car(t1,c1),rectangle(c1),short(c1),none(c1),two_wheels(c1 ), load(c1,l1),circle(l1),one_load(l1), car(t1,c2),rectangle(c2),long(c2),none(c2),three_wheels(c 2), load(c2,l2),hexagon(l2),one_load(l2), car(t1,c3),rectangle(c3),short(c3),peaked(c3),two_wheels( c3), load(c3,l3),triangle(l3),one_load(l3), car(t1,c4),rectangle(c4),long(c4),none(c4),two_wheels(c4) , load(c4,l4),rectangle(l4),three_load(l4). • Background theory: polygon(X) :- rectangle(X) polygon(X) :- triangle(X) MRDM – Prof. D. Malerba • Hypothesis: eastbound(T):-car(T,C),short(C),not none(C). 16 Background knowledge • As background knowledge is visible for each example, all the facts that can be derived from the background knowledge and an example are part of the extended example. • Formally, an extended example is the minimal Herbrand model of the example and the background theory. • When querying an example, it suffices to assert the background knowledge and the example; the Prolog interpreter will do the necessary derivations. MRDM – Prof. D. Malerba 17 Learning from interpretations • The ground-clause representation is peculiar of an ILP setting denoted as learning from interpretations. • Similar to older work on structural matching. • It is common to several relational data mining systems, such as CLAUDIEN: searches for a set of clausal regularities that hold on the set of examples TILDE: top-down induction of logical decision trees ICL: Inductive classification logic (upgrade of CN2) • It contrasts with the classical ILP setting employed by the systems PROGOL and FOIL. MRDM – Prof. D. Malerba 18 FO representation (flattened) • Example: eastbound(t1). • Background theory: car(t1,c1). car(t1,c4). rectangle(c1). rectangle(c4). short(c1). long(c4). none(c1). none(c4). two_wheels(c1). two_wheels(c4). load(c1,l1). load(c4,l4). circle(l1). rectangle(l4). MRDM – Prof. D. Malerba one_load(l1). car(t1,c2). car(t1,c3). rectangle(c2). rectangle(c3). long(c2). short(c3). none(c2). peaked(c3). three_wheels(c2). two_wheels(c3). load(c2,l2). load(c3,l3). hexagon(l2). triangle(l3). one_load(l2). one_load(l3). 19 FO representation (terms) • Example: eastbound([c(rectangle,short,none,2,l(circle,1)), c(rectangle,long,none,3,l(hexagon,1)), c(rectangle,short,peaked,2,l(triangle,1)), c(rectangle,long,none,2,l(rectangle,3))]). • Background theory: empty • Hypothesis: eastbound(T):-member(C,T),arg(2,C,short), not arg(3,C,none). MRDM – Prof. D. Malerba 20 FO representation (strongly typed) • Type signature: data Shape Short; data Roof | …; = Rectangle | Hexagon | …; = None | Peaked | …; data Length = Long | data Object = Circle | Hexagon type Wheels = Int; type Load = (Object,Number); Int type Car = (Shape,Length,Roof,Wheels,Load); Train = [Car]; type Number = type eastbound::Train->Bool; • Example: eastbound([(Rectangle,Short,None,2,(Circle,1)), (Rectangle,Long,None,3,(Hexagon,1)), (Rectangle,Short,Peaked,2,(Triangle,1)), (Rectangle,Long,None,2,(Rectangle,3))]) = True • Hypothesis: eastbound(t) = (exists \c -> MRDM – Prof. D. Malerba member(c,t) && 21 FO representation (database) TRAIN_TABLE LOAD_TABLE LOAD CAR OBJECT NUMBER TRAIN EASTBOUND l1 c1 circle 1 t1 TRUE l2 c2 hexagon 1 t2 TRUE l3 c3 t riangle 1 … … l4 c4 rect angle 3 t6 FALSE … … … … … CAR_TABLE CAR TRAIN SHAPE LENGTH ROOF WHEELS c1 t1 rect angle short none 2 c2 t1 rect angle long none 3 c3 t1 rect angle short peaked 2 c4 t1 rect angle long none 2 … … … … SELECT DISTINCT TRAIN_TABLE.TRAIN FROM TRAIN_TABLE, CAR_TABLE WHERE TRAIN_TABLE.TRAIN = CAR_TABLE.TRAIN AND CAR_TABLE.LENGTH = ‘short’ AND CAR_TABLE.ROOF != 'none' MRDM – Prof. D. Malerba 22 Individual-centered representation • The database contains information on a number of trains. • Each train is an individual. • The database can be partitioned according to individual to obtain a ground-clause representation • Problem: sometime individuals share common parts. • Example: we want to discriminate black and white figures on the basis of their position. Each geom. figure is an individual MRDM – Prof. D. Malerba 23 Object-centered representation The whole sequence is an object, which can be represented by a multiple-head ground clause: black(x11) black(x12) white(x13) black(x14) :first(x11), crl(x11), next(x12,x11), crl(x12), sqr(x13), crl(x14), next(x14,x13), next(x13,x12) This is the representation adopted in ATRE. MRDM – Prof. D. Malerba 24 How to upgrade propositional DM algorithms to first-order 1. 2. 3. 4. 5. 6. 7. 8. 9. Identify the propositional DM system that best matches the DM task Use interpretations to represent examples Upgrade the representation of propositional hypotheses attributevalue tests by first-order literals and modify the coverage test accordingly. Structure the search-space by a more-general-than relation that works on first-order representations • -subsumption Adapt the search operators for searching the corresponding rule space Use a declarative bias mechanism to limit the search space Implement Evaluate your (first-order) implementation on propositional and relational data Add interesting extra features MRDM – Prof. D. Malerba 25 Mining association rules: a case study A set I of literals called items. A set D of transactions t’s such that t I. X Y (s%, c%) Association rule "IF a pattern X appears in a transaction t, THEN the pattern Y tends to hold in the same transaction t" • X I, Y I, XY= • s% = p(XY) support • c% = p(Y|X) = p(XY) / p(X) confidence Agrawal, Imielinsky & Swami. Mining association rules between sets of items in large databases. Proc. SIGMOD 1993 MRDM – Prof. D. Malerba 26 What is an association rule? Example: market basket analysis. Each transaction is the list of items bought by a customer on a single visit to a store. It is represented as a row in a table 1 2 3 Bread yes yes … Butter yes no … Cheese yes yes … Beer no Yes … IF a customer buys bread and butter THEN he also buys cheese (20%, 66%) = Given that 20% of customers buy bread, cheese and butter, 66% of customers who buy bread and butter also buy cheese MRDM – Prof. D. Malerba 27 Mining association rules The propositional approach Problem statement Given: • a set of transactions D • a couple of thresholds, minsup and minconf Find all association rules that have support and confidence greater than minsup and minconf respectively. MRDM – Prof. D. Malerba 28 Mining association rules The propositional approach Problem decomposition • Find large (or frequent) itemsets • Generate highly-confident association rules Representation issues • The transaction set D may be a data file, a relational table or the result of a relational expression • Each transaction is a binary vector MRDM – Prof. D. Malerba 29 Mining association rules The propositional approach Solution to the first sub-problem The APRIORI algorithm (Agrawal & Srikant, 1999) Find large 1-itemsets Cycle on the size (k>1) of the itemsets APRIORI-gen Generate candidate k-itemsets from large (k-1)-itemsets Generate large k-itemsets from candidate k-itemsets (cycle on the transactions in D) until no more large itemsets are found. MRDM – Prof. D. Malerba 30 Mining association rules The propositional approach Solution to the second sub-problem • For every large itemset Z, find all non-empty subsets X’s of Z • For every subset X, output a rule of the form X (Z-X) if support(Z)/support(X) minconf. Relevant work Agrawal & Srikant (1999). Fast Algorithms for Mining Association Rules, in Readings in Database Systems, Morgan Kaufmann Publishers. Han & Fu (1995). Discovery of Multiple-Level Association Rules from Large Databases, in Proc. 21st VLDB Conference MRDM – Prof. D. Malerba 31 Mining association rules The ILP approach Problem statement Given: • a deductive relational database D • a couple of thresholds, minsup and minconf Find all association rules that have support and confidence greater than minsup and minconf respectively. MRDM – Prof. D. Malerba 32 Mining association rules The ILP approach Problem decomposition • Find large (or frequent) atomsets • Generate highly-confident association rules Representation issues A deductive relational database is a relational database which may be represented in first-order logic as follows: • Relation Set of ground facts (EDB) • View Set of rules (IDB) MRDM – Prof. D. Malerba 33 Mining association rules The ILP approach Example Relational database LIKES KID OBJECT Joni ice-cream Joni dolphin Elliot piglet Elliot gnu Elliot lion KID Joni Joni Elliot HAS OBJECT ice-cream piglet ice-cream KID Joni Joni Joni Elliot Elliot PREFERS OBJECT TO ice-cream pudding pudding raisins giraffe gnu lion ice-cream piglet dolphin likes(joni, ice-cream) atom likes(KID, piglet), likes(KID, ice-cream) atomset likes (KID, dolphin) (9%, 85%) likes(KID, A), has(KID, B) prefers (KID, A, B) (70%, 98%) MRDM – Prof. D. Malerba 34 Mining association rules The ILP approach Solution to the first sub-problem The WARMR algorithm (Dehaspe & De Raedt, 1997) L. Dehaspe & L. De Raedt (1997). Mining Association Rules in Multiple Relations, Proc. Conf. Inductive Logic Programming Compute large 1-atomsets Cycle on the size (k>1) of the atomsets WARMR-gen Generate candidate k-atomsets from large (k-1)-atomsets Generate large k-atomsets from candidate k-atomsets (cycle on the observations loaded from D) until no more large atomsets are found. MRDM – Prof. D. Malerba 35 Mining association rules The ILP approach WARMR APRIORI • Breadth-first search on the atomset lattice • Loading of an observation o from D (query result) • Largeness of candidate atomsets computed by a coverage test • Breadth-first search on the itemset lattice • Loading of a transaction t from D (tuple) MRDM – Prof. D. Malerba • Largeness of candidate itemsets computed by a subset check 36 Mining association rules The ILP approach Pattern Space false Q1 is_a(X, large_town) intersects(X, R) is_a(R, road) Q2 is_a(X, large_town) intersects(X,Y) Q3 is_a(X, large_town) true false Q1 Q2 Q3 true MRDM – Prof. D. Malerba 37 Mining association rules The ILP approach Candidate generation is_a(X, large_town), intersects(X,R), is_a(R, road) Operator under -subsumption is_a(X,large_town), intersects(X,R), is_a(R,road), adjacent_to(X,W), is_a(W, water) Refinement step yes Does it -subsume infrequent patterns? no Pruning step MRDM – Prof. D. Malerba 38 Mining association rules The ILP approach Candidate evaluation is_a(X, large_town), intersects(X,R), is_a(R, road), adjacent_to(X,W), is_a(W, water) ?- is_a(X, large_town), intersects(X,R), is_a(R, road), adjacent_to(X,W), is_a(W, water) Large? D <X=barletta,R=a14,W=adriatico> <X=bari,R=ss16bis,W=adriatico> ... yes MRDM – Prof. D. Malerba 39 Mining association rules The ILP approach is_a(X, large_town), intersects(X,R), is_a(R, road), adjacent_to(X,W), is_a(W, water) Rule generation is_a(X, large_town), intersects(X,R), is_a(R, road), is_a(W, water) adjacent_to(X,W) (62%, 86%) yes MRDM – Prof. D. Malerba High confidence? no 40 Conclusions and future work • Multi-relational data mining: more data mining than logic program synthesis choice of representation formalisms input format more important than output format data modelling — e.g. object-oriented data mining new learning tasks and evaluation measures Reference Saso Dzeroski and Nada Lavrac, editors, Relational Data Mining, Springer-Verlag, September 2001 MRDM – Prof. D. Malerba 41