Download Experiments with MRDTL – A Multi

Speeding Up Multi-Relational Data Mining Anna Atramentov and Vasant Honavar* Artificial Intelligence Laboratory Department of Computer Science Iowa State University Ames, IA 50011, USA www.cs.iastate.edu/~honavar/aigroup.html * Support provided in part by National Science Foundation, Carver Foundation, and Pioneer Hi-Bred, Inc. Motivation Importance of relational learning:   Growth of data stored in MRDB Techniques for learning unstructured data often extract the data into MRDB One of the promising approaches to relational learning: MRDM (Multi-Relational Data Mining) framework developed by Knobbe et. al. (1999)  MRDTL (Multi-Relational Decision Tree Learning) algorithm implemented by Leiva et. al. (2002)  Goal  Speed up MRDM framework and in particular MRDTL algorithm Problem Formulation Given: Data stored in relational database Goal: Learn a predictive model for the instances in the target table Example of multi-relational database Department schema Department instances ID Specialization d1 Math 1000 d2 Physics 300 d3 Computer Science 400 Graduate Student #Students Grad.Student ID s1 John 2.0 4 p1 d3 s2 Lisa 3.5 10 p4 d3 s3 Michel 3.9 3 p4 d4 Staff Name ID GPA Name #Publications p1 Dale d1 Professor 70 - 80k Department Advisor p2 Martin d3 Postdoc 30-40k Position Department p3 Victor d2 Visitor Scientist 40-50k p4 David d3 Professor 80-100k Staff Salary MRDM overview. Selection graphs Grad.Student Nodes correspond to the tables from the database  Edges correspond to the associations between tables  It corresponds to the subset of the instances from the target table having some property  It is a way of specifying attributes in the relational setting Department Staff Grad.Student  Specialization =math GPA >3.9 Staff ID Name Department Position Salary p1 Dale d1 Professor 70 - 80k ID Name Department Position Salary p2 Martin d3 Postdoc 30-40k p2 Martin d3 Postdoc 30-40k p3 Victor d2 Visitor Scientist 40-50k p3 Victor d2 Visitor Scientist 40-50k p4 David d3 Professor 80-100k MRDM overview. Transforming selection graphs into SQL queries Staff Grad. Student Select distinct T0.id From Staff T0, Graduate_Student T1 Where T0.id=T1.Advisor Staff Grad. Student Select distinct T0.id From Staff T0 Where T0.id not in ( Select T1. id From Graduate_Student T1) Grad. Student Staff Grad. Student GPA >3.9 Select distinct T0. id From Staff T0, Graduate_Student T1 Where T0.id=T1.Advisor T0. id not in ( Select T1. id From Graduate_Student T1 Where T1.GPA > 3.9) Generic query: select distinct T0.primary_key from table_list where join_list and condition_list MRDM overview. Refinements of selection graphs refinement Grad.Student Department Staff GPA >2.0 Specialization =math Grad.Student Grad.Student Department GPA >3.9 Staff Grad.Student Specialization =math complement refinement Grad.Student GPA >3.9 Department Staff Grad.Student Specialization =math GPA >3.9 Grad.Student GPA>2.0 The most time consuming operations of MRDTL Grad.Student Department Staff Grad.Student Specialization =math GPA >3.9 ID Name Dep Position p1 Dale d1 Postdoc p2 Martin d1 Postdoc p3 David d4 Postdoc p4 Peter d3 Postdoc p5 John d2 Professor p6 Susan d3 Professor … … … … Query associated with the selection graph: select distinct Staff.Salary, count(distinct Staff.ID) from Staff, Grad.Student, Department where join_list and condition_list group by Staff.Salary A way to speed up - eliminate redundant calculations Problem: For selection graph with 160 nodes the time to execute a query is more than 3 minutes! Grad.Student Department Staff Grad.Student GPA >3.9 Specialization =math Redundancy in calculation: Tables Staff and Grad.Student will be joined for all the children refinements A way to fix: make the join only once and save necessary information for all further calculations Speed Up Method. Sufficient tables Grad.Student Department Staff Grad.Student GPA >3.9 Specialization =math Staff_ID Grad.Student_ID Dep_ID p1 s1 d1 p2 s1 d1 p3 s6 d4 p4 s3 d3 p5 s1 d2 p6 s9 d3 … … … … Speed Up Method. Sufficient tables Grad.Student Department Staff Grad.Student Specialization =math Query associated with the selection graph: GPA >3.9 Staff_ID Grad.Student_ID Dep_ID p1 s1 d1 p2 s1 d1 p3 s6 d4 p4 s3 d3 p5 s1 d2 p6 s9 d3 … … select S.Salary, count(distinct S.Staff_ID) from S group by S.Salary Experimental results Accuracy Time with speed up Time w/o speed up Best-known accuracy Mutagenesis 87.5 % 28.45 secs 52.15 secs 86 % KDD Cup 2001, localization 76.11% 202.9 secs 1256.38 secs 72 % KDD Cup 2001, function 91.44% 151.19 secs 307.83 secs 93.6 % PKDD 2001 Discovery Challenge 98.1% 127.75 secs 198.22 secs 99.28 % Summary  A general approach for speeding up MRDM framework  MRDTL algorithm is a competitive algorithm for learning from RDB in terms of both accuracy and time Future work  techniques for handling missing values  pruning techniques or complexity regularizations  use of the aggregates for the attribute values  more extensive evaluation of MRDTL on real-world data sets

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Experiments with MRDTL – A Multi