Download Experiments with MRDTL – A Multi

A Multi-Relational Decision Tree Learning Algorithm – Implementation and Experiments Anna Atramentov Major: Computer Science Program of Study Committee: Vasant Honavar, Major Professor Drena Leigh Dobbs Yan-Bin Jia Iowa State University, Ames, Iowa 2003 KDD and Relational Data Mining    Term KDD stands for Knowledge Discovery in Databases Traditional techniques in KDD work with the instances represented by one table Day Outlook Temp-re Humidity Wind Play Tennis d1 Sunny Hot High Weak No d2 Sunny Hot High Strong No d3 Overcast Hot High Weak Yes d4 Overcast Cold Normal Weak No Relational Data Mining is a subfield of KDD where the instances are represented by several tables Staff Department Graduate Student p1 Dale d1 Professor 70 - 80k d1 Math 1000 s1 John 2.0 4 p1 d3 p2 Martin d3 Postdoc 30-40k d2 Physics 300 s2 Lisa 3.5 10 p4 d3 p3 Victor d2 40-50k d3 Computer Science 400 s3 Michel 3.9 3 p4 d4 Visitor Scientist p4 David d3 Professor 80-100k Motivation Importance of relational learning:   Growth of data stored in MRDB Techniques for learning unstructured data often extract the data into MRDB Promising approach to relational learning: MRDM (Multi-Relational Data Mining) framework developed by Knobbe’s (1999)  MRDTL (Multi-Relational Decision Tree Learning) algorithm implemented by Leiva (2002)  Goals    Speed up MRDM framework and in particular MRDTL algorithm Incorporate handling of missing values Perform more extensive experimental evaluation of the algorithm Relational Learning Literature  Inductive Logic Programming (Dzeroski and Lavrac, 2001; Dzeroski et al., 2001; Blockeel, 1998; De Raedt, 1997)  First order extensions of probabilistic models Relational Bayesian Networks(Jaeger, 1997) Probabilistic Relational Models (Getoor, 2001; Koller, 1999) Bayesian Logic Programs (Kersting et al., 2000) Combining First Order Logic and Probability Theory    Multi-Relational Data Mining (Knobbe et al., 1999)  Approaches for mining data in form of graph (Holder and Cook, 2000; Gonzalez et al., 2000) Propositionalization methods (Krogel and Wrobel, 2001) PRMs extension for cumulative learning for learning and reasoning as agents interact with the world (Pfeffer, 2000) Problem Formulation Given: Data stored in relational data base Goal: Build decision tree for predicting target attribute in the target table Example of multi-relational database Department schema instances Department ID d1 Math 1000 d2 Physics 300 d3 Computer Science 400 Graduate Student Specialization s1 John 2.0 4 p1 d3 Grad.Student s2 Lisa 3.5 10 p4 d3 ID s3 Michel 3.9 3 p4 d4 #Students Staff Name ID GPA p1 Dale d1 Professor 70 - 80k Name #Publications p2 Martin d3 Postdoc 30-40k Department Advisor p3 Victor d2 Department Visitor Scientist 40-50k Position p4 David d3 Professor 80-100k Salary Staff Propositional decision tree algorithm. Construction phase Day Outlook Temp-re Humidity Wind Play Tennis Day Outlook Temp Hum-ty Wind PlayT d1 Sunny Hot High Weak No d2 Sunny Hot High Strong No d1 Sunny Hot High Weak No d2 Sunny Hot High Strong No d3 Overcast Hot High Weak Yes Day Outlook Temp Hum-ty Wind PlayT d4 Overcast Cold Normal Weak No d3 Overcast Hot High Weak Yes d4 Overcast Cold Normal Weak No Tree_induction(D: data) A = optimal_attribute(D) if stopping_criterion (D) return leaf(D) else Dleft := split(D, A) Dright := splitcomplement(D, A) childleft := Tree_induction(Dleft) childright := Tree_induction(Dright) return node(A, childleft, childright) … … … … {d1, d2, d3, d4} Outlook {d1, d2} {d3, d4} Temperature No {d3} {d4} No Yes MR setting. Splitting data with Selection Graphs Department Graduate Student ID Specialization #Students ID Name GPA #Public. Advisor Department d1 Math 1000 s1 John 2.0 4 p1 d3 d2 Physics 300 s2 Lisa 3.5 10 p4 d3 d3 Computer Science 400 s3 Michel 3.9 3 p4 d4 Staff Department Grad.Student ID Name Department Position Salary p4 David d3 Professor 80-100k Department Position Salary d1 Professor 70-80k ID Name Department Position Salary p1 Dale d1 Professor 70 - 80k p2 Martin d3 Postdoc 30-40k ID Name p3 Victor d2 Visitor Scientist 40-50k p1 Dale p4 David d3 Professor 80-100k ID Name Department Position Salary p2 Martin d3 Postdoc 30-40k p3 Victor d2 Visitor Scientist 40-50k Staff complement selection graphs Grad. Student Staff Grad. Student GPA >2.0 Staff Staff Grad. Student GPA >2.0 Grad. Student What is selection graph? Department  It corresponds to the subset of the instances from target table  Nodes correspond to the tables from the database  Edges correspond to the associations between tables  Open edge = “have at least one”  Closed edge = “have non of ” Grad.Student Staff Grad.Student Department Staff Grad.Student GPA >3.9 Specialization =math Transforming selection graphs into SQL queries Staff Select distinct T0.id From Staff Position = Professor Staff Grad. Student Where T0.position=Professor Select distinct T0.id From Staff T0, Graduate_Student T1 Where T0.id=T1.Advisor Staff Grad. Student Select distinct T0.id From Staff T0 Where T0.id not in ( Select T1. id From Graduate_Student T1) Grad. Student Staff Grad. Student GPA >3.9 Select distinct T0. id From Staff T0, Graduate_Student T1 Where T0.id=T1.Advisor T0. id not in ( Select T1. id From Graduate_Student T1 Where T1.GPA > 3.9) Generic query: select distinct T0.primary_key from table_list where join_list and condition_list MR decision tree   Staff Each node contains selection graph Each child selection graph is a supergraph of the parent selection graph Staff Grad.Student Staff Grad.Student Grad.Student Staff Grad. Student Staff Grad.Student GPA >3.9 GPA >3.9 … … … … … … How to choose selection graphs in nodes? Problem: There are too many supergraph selection graphs to choose from in each node Solution:  start with initial selection graph  find greedy heuristic to choose supergraph selection graphs: refinements  use binary splits for simplicity  for each refinement Staff Grad.Student get complement refinement  choose the best refinement based on information gain criterion Problem: Some potentially good refinements may give no immediate benefit Solution:  look ahead capability Staff Staff Grad.Student Grad.Student Staff Grad. Student Staff Grad.Student GPA >3.9 GPA >3.9 … … … … … … Refinements of selection graph Department Grad.Student Staff Grad.Student Department Staff Grad.Student Specialization =math GPA >3.9  add condition to the node explore attribute information in the tables  add present edge and open node – explore relational properties between the tables Refinements of selection graph refinement Department Grad.Student Grad.Student Department Staff Specialization =math Staff Grad.Student Position = Professor Grad.Student Department GPA >3.9 Staff Grad.Student Specialization =math complement refinement Grad.Student GPA >3.9 Department Staff Specialization =math   add condition to the node add present edge and open node Position != Professor Grad.Student GPA >3.9 Refinements of selection graph refinement Department Grad.Student Grad.Student Department Staff GPA >2.0 Staff Specialization =math Grad.Student Grad.Student Department GPA >3.9 Staff Grad.Student Specialization =math complement refinement Grad.Student Department GPA >3.9 Staff Grad.Student   add condition to the node add present edge and open node Specialization =math GPA >3.9 Grad.Student GPA >2.0 Refinements of selection graph refinement Department Grad.Student Grad.Student Staff Specialization =math Staff Grad.Student Grad.Student Department Department #Students >200 GPA >3.9 Staff Grad.Student Specialization =math complement refinement Grad.Student GPA >3.9 Specialization =math Staff   add condition to the node add present edge and open node Department Grad.Student GPA >3.9 Department #Students >200 Refinements of selection graph refinement Grad.Student Department Department Staff Specialization =math Grad.Student Grad.Student Staff GPA >3.9 Grad.Student Department Department Staff Grad.Student Specialization =math complement refinement Grad.Student Department GPA >3.9 Staff Note: information gain = 0 Grad.Student   add condition to the node add present edge and open node GPA >3.9 Department Specialization =math Refinements of selection graph Department refinement Grad.Student Staff Grad.Student Specialization =math Staff Grad.Student Grad.Student Department Staff GPA >3.9 Department Staff Specialization =math complement refinement Grad.Student Grad.Student GPA >3.9 Staff   add condition to the node add present edge and open node Department Specialization =math Grad.Student GPA >3.9 Staff Refinements of selection graph refinement Department Grad.Student Grad.Student Department Staff Staff Specialization =math Staff Grad.Student Grad.Student Department GPA >3.9 Staff Grad.Student Specialization =math complement refinement Grad.Student GPA >3.9 Department Staff Grad.Student   add condition to the node add present edge and open node GPA >3.9 Specialization =math Staff Refinements of selection graph refinement Department Grad.Student Grad.Student Department Grad.S Staff Specialization =math Staff Grad.Student Grad.Student Department GPA >3.9 Staff Grad.Student Specialization =math complement refinement Grad.Student GPA >3.9 Department Staff Grad.Student   add condition to the node add present edge and open node GPA >3.9 Specialization =math Grad.S refinement Look ahead capability Grad.Student Department Department Staff Grad.Student Grad.Student Specialization =math Staff GPA >3.9 Department Grad.Student Department Staff Grad.Student Specialization =math complement refinement Grad.Student Department GPA >3.9 Staff Grad.Student GPA >3.9 Department Specialization =math refinement Look ahead capability Grad.Student Department Staff Department Grad.Student Grad.Student Specialization =math Staff GPA >3.9 Department Grad.Student Department #Students > 200 Staff Grad.Student complement refinement Specialization =math Department GPA >3.9 Grad.Student Department Grad.Student Specialization =math Staff GPA >3.9 Department #Students > 200 MRDTL algorithm. Construction phase for each non-leaf node:  consider all possible refinements and their complements of the node’s selection graph  choose the best ones based on information gain criterion  create children nodes Staff Grad. Student Staff Staff Grad.Student Staff Grad.Student Staff Grad.Student GPA >3.9 GPA >3.9 … … … … … Grad.Student … MRDTL algorithm. Classification phase Staff for each leaf:  apply selection graph of the leaf to the test data  classify resulting instances with classification of the leaf Staff Grad.Student Staff Grad.Student Staff Grad. Student Grad.Student … Staff Grad.Student GPA >3.9 GPA >3.9 … … … Staff Grad. Student GPA >3.9 Department 70-80k Spec=math Staff Position = Professor 80-100k … Grad. Student GPA >3.9 Department Spec=physics …………….. … The most time consuming operations of MRDTL Grad.Student Department Entropy associated with this selection graph: Staff Grad.Student Specialization =math E =  (ni /N) log (ni /N) GPA >3.9 Query associated with counts ni: ID Name Dep Position Salary p1 Dale d1 Postdoc c1 p2 Martin d1 Postdoc c1 p3 David d4 Postdoc c1 p4 Peter d3 Postdoc c1 p5 Adrian d2 Professor c2 p6 Doina d3 Professor c2 … … … … n1 select distinct Staff.Salary, count(distinct Staff.ID) from Staff, Grad.Student, Deparment where join_list and condition_list group by Staff.Salary n2 … Result of the query is the following list: c i , ni The most time consuming operations of MRDTL Grad.Student Grad.Student Department Department Staff Staff Grad.Student GPA >2.0 Specialization =math Specialization =math Grad.Student GPA >3.9 GPA >3.9 Entropy associated with each of the refinements Grad.Student Department Staff select distinct Staff.Salary, count(distinct Staff.ID) from table_list where join_list and condition_list group by Staff.Salary Grad.Student Specialization =math GPA >3.9 Grad.Student GPA >2.0 A way to speed up - eliminate redundant calculations Problem: For selection graph with 162 nodes the time to execute a query is more than 3 minutes! Grad.Student Department Staff Grad.Student GPA >3.9 Specialization =math Redundancy in calculation: For this selection graph tables Staff and Grad.Student will be joined over and over for all the children refinements of the tree A way to fix: calculate it only once and save for all further calculations Speed Up Method. Sufficient tables Grad.Student Department Staff Grad.Student GPA >3.9 Specialization =math Staff_ID Grad.Student_ID Dep_ID Salary p1 s1 d1 c1 p2 s1 d1 c1 p3 s6 d4 c1 p4 s3 d3 c1 p5 s1 d2 c2 p6 s9 d3 c2 … … … … Speed Up Method. Sufficient tables Grad.Student Department Entropy associated with this selection graph: Specialization =math E =  (ni /N) log (ni /N) Staff Grad.Student GPA >3.9 Query associated with counts ni: Staff_ID Grad.Student_ID Dep_ID Salary p1 s1 d1 c1 p2 s1 d1 c1 p3 s6 d4 c1 p4 s3 d3 c1 p5 s1 d2 c2 p6 s9 d3 c2 … … … … n1 n2 select S.Salary, count(distinct S.Staff_ID) from S group by S.Salary Result of the query is the following list: c i , ni … Speed Up Method. Sufficient tables Queries associated with the add condition refinement: Grad.Student Department Staff Grad.Student GPA >3.9 select S.Salary, X.A, count(distinct S.Staff_ID) from S, X where S.X_ID = X.ID group by S.Salary, X.A Specialization =math Calculations for the complement refinement: count(ci , Rcomp(S)) = count(ci, S) – count(ci , R(S)) Speed Up Method. Sufficient tables Queries associated with the add edge refinement: Grad.Student Department Staff Grad.Student GPA >3.9 select S.Salary, count(distinct S.Staff_ID) from S, X, Y where S.X_ID = X.ID, and e.cond group by S.Salary Specialization =math Calculations for the complement refinement: count(ci , Rcomp(S)) = count(ci, S) – count(ci , R(S)) Speed Up Method  Significant speed up in obtaining the counts needed for the calculations of the entropy and information gain  The speed up is reached by the additional space used by the algorithm Handling Missing Values Graduate Student Department ID Specialization #Students d1 Math 1000 d2 Physics 300 d3 Computer Science 400 Staff ID Name GPA #Public. Advisor Department s1 John 2.0 4 p1 d3 s2 Lisa 3.5 10 p1 d3 s3 Michel 3.9 3 p4 d4 ID Name Department Position Salary p1 Dale d1 ? 70 - 80k p2 Martin d3 ? 30-40k p3 Victor d2 Visitor Scientist 40-50k p4 David d3 ? 80-100k Staff.Position, b Staff.Name, a P(a|b) For each attribute which has missing values we build a Naïve Bayes model: Staff.Position Staff.Name Staff.Dep Department.Spec … Handling Missing Values Graduate Student Department ID Specialization #Students d1 Math 1000 ID Name GPA #Public. Advisor Department s1 John 2.0 4 p1 d3 s2 Lisa 3.5 10 p1 d3 Staff ID Name Department Position Salary p1 Dale d1 ? 70 - 80k Then the most probable value for the missing attribute is calculated by formula: P(vi | X1.A1, X2.A2, X3.A3 …) = P(X1.A1, X2.A2, X3.A3 …| vi) P(vi) / P(X1.A1, X2.A2, X3.A3 … ) = P(X1.A1| vi) P(X2.A2| vi) P(X3.A3| vi) … P(vi) / P(X1.A1, X2.A2, X3.A3 … ) Experimental results. Mutagenesis  Most widely DB used in ILP.  Describes molecules of certain nitro aromatic compounds.  Goal: predict their mutagenic activity (label attribute) – ability to cause DNA to mutate. High mutagenic activity can cause cancer.  Two subsets regression friendly (188 molecules) and regression unfriendly (42 molecules). We used only regression friendly subset.  5 levels of background knowledge: B0, B1, B2, B3, B4. They provide richer descriptions of the examples. We used B2 level. Experimental results. Mutagenesis  Schema of the mutagenesis database  Results of 10-fold cross-validation for regression friendly set. Data Set mutagenesis Accuracy Sel graph size (max) Tree size Time with speed up Time without speed up 87.5% 3 9 28.45 52.15 Best-known reported accuracy is 86% Experimental results. KDD Cup 2001     Consists of a variety of details about the various genes of one particular type of organism. Genes code for proteins, and these proteins tend to localize in various parts of cells and interact with one another in order to perform crucial functions. 2 Tasks: Prediction of gene/protein localization and function 862 training genes, 381 test genes. FUNCTION  Many attribute values are missing: 70% of CLASS attribute, 50% of COMPLEX, and 50% of MOTIF in composition table Experimental results. KDD Cup 2001 localization Accuracy Sel graph size (max) Tree size Time with speed up Time without speed up With handling missing values 76.11% 19 213 202.9 secs 1256.38 secs Without handling missing values 50.14% 33 575 550.76 secs 2257.20 secs Best-known reported accuracy is 72.1% function Accuracy Sel graph size (max) Tree size (max) Time with speed up Time without speed up With handling missing values 91.44% 9 63 151.19 secs 307.83 secs Without handling missing values 88.56% 9 19 61.29 secs 118.41 secs Best-known reported accuracy is 93.6% Experimental results. PKDD 2001 Discovery Challenge    Consists of 5 tables Target table consists of 1239 records The task is to predict the degree of the thrombosis attribute from ANTIBODY_EXAM table ANA_PATTERN DIAGNOSIS PATIENT_INFO THROMBOSIS ANTIBODY_EXAM  The results for 5:2 cross validation: Data Set thrombosis Accuracy Sel Graph size (max) Tree size Time with speed up Time without speed up 98.1% 31 71 127.75 198.22 Best-known reported accuracy is 99.28% Summary   the algorithm significantly outperforms MRDTL in terms of running time the accuracy results are comparable with the best reported results obtained using different data-mining algorithms Future work      Incorporation of the more sophisticated techniques for handling missing values Incorporating of more sophisticated pruning techniques or complexity regularizations More extensive evaluation of MRDTL on real-world data sets Development of ontology-guided multi-relational decision tree learning algotihms to generate classifiers at multiple levels of abstraction [Zhang et al., 2002] Development of variants of MRDTL that can learn from heterogeneous, distributed, autonomous data sources, based on recently developed techniques for distributed learning and ontology based data integration Thanks to  Dr. Honavar for providing guidance, help and support throughout this research  Colleges from Artificial Intelligence Lab for various helpful discussions  My committee members: Drena Dobbs and Yan-Bin Jia for their help  Professors and lecturers of the Computer Science department for the knowledge that they gave me through lectures and discussions  Iowa State University and Computer Science department for funding in part this research

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Experiments with MRDTL – A Multi