Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Speeding Up Multi-Relational Data Mining Anna Atramentov and Vasant Honavar* Artificial Intelligence Laboratory Department of Computer Science Iowa State University Ames, IA 50011, USA www.cs.iastate.edu/~honavar/aigroup.html * Support provided in part by National Science Foundation, Carver Foundation, and Pioneer Hi-Bred, Inc. Motivation Importance of relational learning: Growth of data stored in MRDB Techniques for learning unstructured data often extract the data into MRDB One of the promising approaches to relational learning: MRDM (Multi-Relational Data Mining) framework developed by Knobbe et. al. (1999) MRDTL (Multi-Relational Decision Tree Learning) algorithm implemented by Leiva et. al. (2002) Goal Speed up MRDM framework and in particular MRDTL algorithm Problem Formulation Given: Data stored in relational database Goal: Learn a predictive model for the instances in the target table Example of multi-relational database Department schema Department instances ID Specialization d1 Math 1000 d2 Physics 300 d3 Computer Science 400 Graduate Student #Students Grad.Student ID s1 John 2.0 4 p1 d3 s2 Lisa 3.5 10 p4 d3 s3 Michel 3.9 3 p4 d4 Staff Name ID GPA Name #Publications p1 Dale d1 Professor 70 - 80k Department Advisor p2 Martin d3 Postdoc 30-40k Position Department p3 Victor d2 Visitor Scientist 40-50k p4 David d3 Professor 80-100k Staff Salary MRDM overview. Selection graphs Grad.Student Nodes correspond to the tables from the database Edges correspond to the associations between tables It corresponds to the subset of the instances from the target table having some property It is a way of specifying attributes in the relational setting Department Staff Grad.Student Specialization =math GPA >3.9 Staff ID Name Department Position Salary p1 Dale d1 Professor 70 - 80k ID Name Department Position Salary p2 Martin d3 Postdoc 30-40k p2 Martin d3 Postdoc 30-40k p3 Victor d2 Visitor Scientist 40-50k p3 Victor d2 Visitor Scientist 40-50k p4 David d3 Professor 80-100k MRDM overview. Transforming selection graphs into SQL queries Staff Grad. Student Select distinct T0.id From Staff T0, Graduate_Student T1 Where T0.id=T1.Advisor Staff Grad. Student Select distinct T0.id From Staff T0 Where T0.id not in ( Select T1. id From Graduate_Student T1) Grad. Student Staff Grad. Student GPA >3.9 Select distinct T0. id From Staff T0, Graduate_Student T1 Where T0.id=T1.Advisor T0. id not in ( Select T1. id From Graduate_Student T1 Where T1.GPA > 3.9) Generic query: select distinct T0.primary_key from table_list where join_list and condition_list MRDM overview. Refinements of selection graphs refinement Grad.Student Department Staff GPA >2.0 Specialization =math Grad.Student Grad.Student Department GPA >3.9 Staff Grad.Student Specialization =math complement refinement Grad.Student GPA >3.9 Department Staff Grad.Student Specialization =math GPA >3.9 Grad.Student GPA>2.0 The most time consuming operations of MRDTL Grad.Student Department Staff Grad.Student Specialization =math GPA >3.9 ID Name Dep Position p1 Dale d1 Postdoc p2 Martin d1 Postdoc p3 David d4 Postdoc p4 Peter d3 Postdoc p5 John d2 Professor p6 Susan d3 Professor … … … … Query associated with the selection graph: select distinct Staff.Salary, count(distinct Staff.ID) from Staff, Grad.Student, Department where join_list and condition_list group by Staff.Salary A way to speed up - eliminate redundant calculations Problem: For selection graph with 160 nodes the time to execute a query is more than 3 minutes! Grad.Student Department Staff Grad.Student GPA >3.9 Specialization =math Redundancy in calculation: Tables Staff and Grad.Student will be joined for all the children refinements A way to fix: make the join only once and save necessary information for all further calculations Speed Up Method. Sufficient tables Grad.Student Department Staff Grad.Student GPA >3.9 Specialization =math Staff_ID Grad.Student_ID Dep_ID p1 s1 d1 p2 s1 d1 p3 s6 d4 p4 s3 d3 p5 s1 d2 p6 s9 d3 … … … … Speed Up Method. Sufficient tables Grad.Student Department Staff Grad.Student Specialization =math Query associated with the selection graph: GPA >3.9 Staff_ID Grad.Student_ID Dep_ID p1 s1 d1 p2 s1 d1 p3 s6 d4 p4 s3 d3 p5 s1 d2 p6 s9 d3 … … select S.Salary, count(distinct S.Staff_ID) from S group by S.Salary Experimental results Accuracy Time with speed up Time w/o speed up Best-known accuracy Mutagenesis 87.5 % 28.45 secs 52.15 secs 86 % KDD Cup 2001, localization 76.11% 202.9 secs 1256.38 secs 72 % KDD Cup 2001, function 91.44% 151.19 secs 307.83 secs 93.6 % PKDD 2001 Discovery Challenge 98.1% 127.75 secs 198.22 secs 99.28 % Summary A general approach for speeding up MRDM framework MRDTL algorithm is a competitive algorithm for learning from RDB in terms of both accuracy and time Future work techniques for handling missing values pruning techniques or complexity regularizations use of the aggregates for the attribute values more extensive evaluation of MRDTL on real-world data sets