Download Experiments with MRDTL – A Multi

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Speeding Up Multi-Relational Data Mining
Anna Atramentov and Vasant Honavar*
Artificial Intelligence Laboratory
Department of Computer Science
Iowa State University
Ames, IA 50011, USA
www.cs.iastate.edu/~honavar/aigroup.html
* Support provided in part by National Science Foundation, Carver Foundation,
and Pioneer Hi-Bred, Inc.
Motivation
Importance of relational learning:


Growth of data stored in MRDB
Techniques for learning unstructured data often extract the data into MRDB
One of the promising approaches to relational learning:
MRDM (Multi-Relational Data Mining) framework developed by Knobbe et.
al. (1999)
 MRDTL (Multi-Relational Decision Tree Learning) algorithm implemented by
Leiva et. al. (2002)

Goal

Speed up MRDM framework and in particular MRDTL algorithm
Problem Formulation
Given: Data stored in relational database
Goal: Learn a predictive model for the instances in the target table
Example of multi-relational database
Department
schema
Department
instances
ID
Specialization
d1
Math
1000
d2
Physics
300
d3
Computer Science
400
Graduate Student
#Students
Grad.Student
ID
s1
John
2.0
4
p1
d3
s2
Lisa
3.5
10
p4
d3
s3
Michel
3.9
3
p4
d4
Staff
Name
ID
GPA
Name
#Publications
p1
Dale
d1
Professor
70 - 80k
Department
Advisor
p2
Martin
d3
Postdoc
30-40k
Position
Department
p3
Victor
d2
Visitor
Scientist
40-50k
p4
David
d3
Professor
80-100k
Staff
Salary
MRDM overview. Selection graphs
Grad.Student
Nodes correspond to the tables
from the database

Edges correspond to the
associations between tables

It corresponds to the subset of the
instances from the target table
having some property

It is a way of specifying attributes
in the relational setting
Department
Staff
Grad.Student

Specialization
=math
GPA >3.9
Staff
ID
Name
Department
Position
Salary
p1
Dale
d1
Professor
70 - 80k
ID
Name
Department
Position
Salary
p2
Martin
d3
Postdoc
30-40k
p2
Martin
d3
Postdoc
30-40k
p3
Victor
d2
Visitor
Scientist
40-50k
p3
Victor
d2
Visitor
Scientist
40-50k
p4
David
d3
Professor
80-100k
MRDM overview. Transforming selection graphs
into SQL queries
Staff
Grad. Student
Select distinct T0.id
From Staff T0, Graduate_Student T1
Where T0.id=T1.Advisor
Staff
Grad. Student
Select distinct T0.id
From Staff T0
Where T0.id not in
( Select T1. id
From Graduate_Student T1)
Grad. Student
Staff
Grad. Student
GPA >3.9
Select distinct T0. id
From Staff T0, Graduate_Student T1
Where T0.id=T1.Advisor
T0. id not in
( Select T1. id
From Graduate_Student T1
Where T1.GPA > 3.9)
Generic query:
select distinct T0.primary_key
from table_list
where join_list
and condition_list
MRDM overview. Refinements of selection graphs
refinement
Grad.Student
Department
Staff
GPA >2.0
Specialization
=math
Grad.Student
Grad.Student
Department
GPA >3.9
Staff
Grad.Student
Specialization
=math
complement refinement
Grad.Student
GPA >3.9
Department
Staff
Grad.Student
Specialization
=math
GPA >3.9
Grad.Student
GPA>2.0
The most time consuming operations of MRDTL
Grad.Student
Department
Staff
Grad.Student
Specialization
=math
GPA >3.9
ID
Name
Dep
Position
p1
Dale
d1
Postdoc
p2
Martin
d1
Postdoc
p3
David
d4
Postdoc
p4
Peter
d3
Postdoc
p5
John
d2
Professor
p6
Susan
d3
Professor
…
…
…
…
Query associated with the selection
graph:
select distinct Staff.Salary,
count(distinct Staff.ID)
from Staff, Grad.Student,
Department
where join_list and condition_list
group by Staff.Salary
A way to speed up - eliminate redundant
calculations
Problem:
For selection graph with 160
nodes the time to execute a
query is more than 3 minutes!
Grad.Student
Department
Staff
Grad.Student
GPA >3.9
Specialization
=math
Redundancy in calculation:
Tables Staff and Grad.Student
will be joined for all the
children refinements
A way to fix:
make the join only once and
save necessary information for
all further calculations
Speed Up Method. Sufficient tables
Grad.Student
Department
Staff
Grad.Student
GPA >3.9
Specialization
=math
Staff_ID
Grad.Student_ID
Dep_ID
p1
s1
d1
p2
s1
d1
p3
s6
d4
p4
s3
d3
p5
s1
d2
p6
s9
d3
…
…
…
…
Speed Up Method. Sufficient tables
Grad.Student
Department
Staff
Grad.Student
Specialization
=math
Query associated with the selection
graph:
GPA >3.9
Staff_ID
Grad.Student_ID
Dep_ID
p1
s1
d1
p2
s1
d1
p3
s6
d4
p4
s3
d3
p5
s1
d2
p6
s9
d3
…
…
select S.Salary,
count(distinct S.Staff_ID)
from S
group by S.Salary
Experimental results
Accuracy
Time with speed up
Time w/o speed up
Best-known accuracy
Mutagenesis
87.5 %
28.45 secs
52.15 secs
86 %
KDD Cup 2001,
localization
76.11%
202.9 secs
1256.38 secs
72 %
KDD Cup 2001,
function
91.44%
151.19 secs
307.83 secs
93.6 %
PKDD 2001 Discovery
Challenge
98.1%
127.75 secs
198.22 secs
99.28 %
Summary

A general approach for speeding up MRDM framework

MRDTL algorithm is a competitive algorithm for learning from RDB in
terms of both accuracy and time
Future work

techniques for handling missing values

pruning techniques or complexity regularizations

use of the aggregates for the attribute values

more extensive evaluation of MRDTL on real-world data sets
Related documents