Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Multi-Relational Decision Tree Learning
Algorithm – Implementation and Experiments
Anna Atramentov
Major: Computer Science
Program of Study Committee:
Vasant Honavar, Major Professor
Drena Leigh Dobbs
Yan-Bin Jia
Iowa State University,
Ames, Iowa
2003
KDD and Relational Data Mining
Term KDD stands for Knowledge Discovery in Databases
Traditional techniques in KDD work with the instances represented by one
table
Day
Outlook
Temp-re
Humidity
Wind
Play Tennis
d1
Sunny
Hot
High
Weak
No
d2
Sunny
Hot
High
Strong
No
d3
Overcast
Hot
High
Weak
Yes
d4
Overcast
Cold
Normal
Weak
No
Relational Data Mining is a subfield of KDD where the instances are
represented by several tables
Staff
Department
Graduate Student
p1
Dale
d1
Professor
70 - 80k
d1
Math
1000
s1
John
2.0
4
p1
d3
p2
Martin
d3
Postdoc
30-40k
d2
Physics
300
s2
Lisa
3.5
10
p4
d3
p3
Victor
d2
40-50k
d3
Computer Science
400
s3
Michel
3.9
3
p4
d4
Visitor
Scientist
p4
David
d3
Professor
80-100k
Motivation
Importance of relational learning:
Growth of data stored in MRDB
Techniques for learning unstructured data often extract the data into MRDB
Promising approach to relational learning:
MRDM (Multi-Relational Data Mining) framework developed by Knobbe’s
(1999)
MRDTL (Multi-Relational Decision Tree Learning) algorithm implemented by
Leiva (2002)
Goals
Speed up MRDM framework and in particular MRDTL algorithm
Incorporate handling of missing values
Perform more extensive experimental evaluation of the algorithm
Relational Learning Literature
Inductive Logic Programming (Dzeroski and Lavrac, 2001; Dzeroski et al.,
2001; Blockeel, 1998; De Raedt, 1997)
First order extensions of probabilistic models
Relational Bayesian Networks(Jaeger, 1997)
Probabilistic Relational Models (Getoor, 2001; Koller, 1999)
Bayesian Logic Programs (Kersting et al., 2000)
Combining First Order Logic and Probability Theory
Multi-Relational Data Mining (Knobbe et al., 1999)
Approaches for mining data in form of graph (Holder and Cook, 2000;
Gonzalez et al., 2000)
Propositionalization methods (Krogel and Wrobel, 2001)
PRMs extension for cumulative learning for learning and reasoning as agents
interact with the world (Pfeffer, 2000)
Problem Formulation
Given: Data stored in relational data base
Goal: Build decision tree for predicting target attribute in the target
table
Example of multi-relational database
Department
schema
instances
Department
ID
d1
Math
1000
d2
Physics
300
d3
Computer Science
400
Graduate Student
Specialization
s1
John
2.0
4
p1
d3
Grad.Student
s2
Lisa
3.5
10
p4
d3
ID
s3
Michel
3.9
3
p4
d4
#Students
Staff
Name
ID
GPA
p1
Dale
d1
Professor
70 - 80k
Name
#Publications
p2
Martin
d3
Postdoc
30-40k
Department
Advisor
p3
Victor
d2
Department
Visitor
Scientist
40-50k
Position
p4
David
d3
Professor
80-100k
Salary
Staff
Propositional decision tree algorithm. Construction
phase
Day
Outlook
Temp-re
Humidity
Wind
Play
Tennis
Day
Outlook
Temp
Hum-ty
Wind
PlayT
d1
Sunny
Hot
High
Weak
No
d2
Sunny
Hot
High
Strong
No
d1
Sunny
Hot
High
Weak
No
d2
Sunny
Hot
High
Strong
No
d3
Overcast
Hot
High
Weak
Yes
Day
Outlook
Temp
Hum-ty
Wind
PlayT
d4
Overcast
Cold
Normal
Weak
No
d3
Overcast
Hot
High
Weak
Yes
d4
Overcast
Cold
Normal
Weak
No
Tree_induction(D: data)
A = optimal_attribute(D)
if stopping_criterion (D)
return leaf(D)
else
Dleft := split(D, A)
Dright := splitcomplement(D, A)
childleft := Tree_induction(Dleft)
childright := Tree_induction(Dright)
return node(A, childleft, childright)
…
…
…
…
{d1, d2, d3, d4}
Outlook
{d1, d2}
{d3, d4}
Temperature
No
{d3}
{d4}
No
Yes
MR setting. Splitting data with Selection Graphs
Department
Graduate Student
ID
Specialization
#Students
ID
Name
GPA
#Public.
Advisor
Department
d1
Math
1000
s1
John
2.0
4
p1
d3
d2
Physics
300
s2
Lisa
3.5
10
p4
d3
d3
Computer Science
400
s3
Michel
3.9
3
p4
d4
Staff
Department
Grad.Student
ID
Name
Department
Position
Salary
p4
David
d3
Professor
80-100k
Department
Position
Salary
d1
Professor
70-80k
ID
Name
Department
Position
Salary
p1
Dale
d1
Professor
70 - 80k
p2
Martin
d3
Postdoc
30-40k
ID
Name
p3
Victor
d2
Visitor
Scientist
40-50k
p1
Dale
p4
David
d3
Professor
80-100k
ID
Name
Department
Position
Salary
p2
Martin
d3
Postdoc
30-40k
p3
Victor
d2
Visitor
Scientist
40-50k
Staff
complement selection graphs
Grad. Student
Staff
Grad. Student
GPA >2.0
Staff
Staff
Grad. Student
GPA >2.0
Grad. Student
What is selection graph?
Department
It corresponds to the subset of the
instances from target table
Nodes correspond to the tables
from the database
Edges correspond to the
associations between tables
Open edge = “have at least one”
Closed edge = “have non of ”
Grad.Student
Staff
Grad.Student
Department
Staff
Grad.Student
GPA >3.9
Specialization
=math
Transforming selection graphs into SQL queries
Staff
Select distinct T0.id
From Staff
Position = Professor
Staff
Grad. Student
Where T0.position=Professor
Select distinct T0.id
From Staff T0, Graduate_Student T1
Where T0.id=T1.Advisor
Staff
Grad. Student
Select distinct T0.id
From Staff T0
Where T0.id not in
( Select T1. id
From Graduate_Student T1)
Grad. Student
Staff
Grad. Student
GPA >3.9
Select distinct T0. id
From Staff T0, Graduate_Student T1
Where T0.id=T1.Advisor
T0. id not in
( Select T1. id
From Graduate_Student T1
Where T1.GPA > 3.9)
Generic query:
select distinct T0.primary_key
from table_list
where join_list
and condition_list
MR decision tree
Staff
Each node contains selection graph
Each child selection graph is a supergraph
of the parent selection graph
Staff Grad.Student
Staff Grad.Student
Grad.Student
Staff Grad. Student
Staff
Grad.Student
GPA >3.9
GPA >3.9
…
…
…
…
…
…
How to choose selection graphs in nodes?
Problem: There are too many supergraph selection graphs to
choose from in each node
Solution:
start with initial selection graph
find greedy heuristic to choose supergraph
selection graphs: refinements
use binary splits for simplicity
for each refinement
Staff Grad.Student
get complement refinement
choose the best refinement based
on information gain criterion
Problem: Some
potentially
good refinements
may give no
immediate benefit
Solution:
look ahead capability
Staff
Staff Grad.Student
Grad.Student
Staff Grad. Student
Staff
Grad.Student
GPA >3.9
GPA >3.9
…
…
…
…
…
…
Refinements of selection graph
Department
Grad.Student
Staff
Grad.Student
Department
Staff
Grad.Student
Specialization
=math
GPA >3.9
add condition to the node explore attribute information in the tables
add present edge and open node –
explore relational properties between the tables
Refinements of selection graph
refinement
Department
Grad.Student
Grad.Student
Department
Staff
Specialization
=math
Staff
Grad.Student
Position = Professor
Grad.Student
Department
GPA >3.9
Staff
Grad.Student
Specialization
=math
complement refinement
Grad.Student
GPA >3.9
Department
Staff
Specialization
=math
add condition to the node
add present edge and open node
Position != Professor
Grad.Student
GPA >3.9
Refinements of selection graph
refinement
Department
Grad.Student
Grad.Student
Department
Staff
GPA >2.0
Staff
Specialization
=math
Grad.Student
Grad.Student
Department
GPA >3.9
Staff
Grad.Student
Specialization
=math
complement refinement
Grad.Student
Department
GPA >3.9
Staff
Grad.Student
add condition to the node
add present edge and open node
Specialization
=math
GPA >3.9
Grad.Student
GPA >2.0
Refinements of selection graph
refinement
Department
Grad.Student
Grad.Student
Staff
Specialization
=math
Staff
Grad.Student
Grad.Student
Department
Department
#Students >200
GPA >3.9
Staff
Grad.Student
Specialization
=math
complement refinement
Grad.Student
GPA >3.9
Specialization
=math
Staff
add condition to the node
add present edge and open node
Department
Grad.Student
GPA >3.9
Department
#Students >200
Refinements of selection graph
refinement
Grad.Student
Department
Department
Staff
Specialization
=math
Grad.Student
Grad.Student
Staff
GPA >3.9
Grad.Student
Department
Department
Staff
Grad.Student
Specialization
=math
complement refinement
Grad.Student
Department
GPA >3.9
Staff
Note: information gain = 0
Grad.Student
add condition to the node
add present edge and open node
GPA >3.9
Department
Specialization
=math
Refinements of selection graph
Department
refinement
Grad.Student
Staff
Grad.Student
Specialization
=math
Staff
Grad.Student
Grad.Student
Department
Staff
GPA >3.9
Department
Staff
Specialization
=math
complement refinement
Grad.Student
Grad.Student
GPA >3.9
Staff
add condition to the node
add present edge and open node
Department
Specialization
=math
Grad.Student
GPA >3.9
Staff
Refinements of selection graph
refinement
Department
Grad.Student
Grad.Student
Department
Staff
Staff
Specialization
=math
Staff
Grad.Student
Grad.Student
Department
GPA >3.9
Staff
Grad.Student
Specialization
=math
complement refinement
Grad.Student
GPA >3.9
Department
Staff
Grad.Student
add condition to the node
add present edge and open node
GPA >3.9
Specialization
=math
Staff
Refinements of selection graph
refinement
Department
Grad.Student
Grad.Student
Department
Grad.S
Staff
Specialization
=math
Staff
Grad.Student
Grad.Student
Department
GPA >3.9
Staff
Grad.Student
Specialization
=math
complement refinement
Grad.Student
GPA >3.9
Department
Staff
Grad.Student
add condition to the node
add present edge and open node
GPA >3.9
Specialization
=math
Grad.S
refinement
Look ahead capability
Grad.Student
Department
Department
Staff
Grad.Student
Grad.Student
Specialization
=math
Staff
GPA >3.9
Department
Grad.Student
Department
Staff
Grad.Student
Specialization
=math
complement refinement
Grad.Student
Department
GPA >3.9
Staff
Grad.Student
GPA >3.9
Department
Specialization
=math
refinement
Look ahead capability
Grad.Student
Department
Staff
Department
Grad.Student
Grad.Student
Specialization
=math
Staff
GPA >3.9
Department
Grad.Student
Department
#Students > 200
Staff
Grad.Student
complement refinement
Specialization
=math
Department
GPA >3.9
Grad.Student
Department
Grad.Student
Specialization
=math
Staff
GPA >3.9
Department
#Students > 200
MRDTL algorithm. Construction phase
for each non-leaf node:
consider all possible refinements
and their complements of the
node’s selection graph
choose the best ones
based on information
gain criterion
create
children
nodes
Staff Grad. Student
Staff
Staff Grad.Student
Staff
Grad.Student
Staff
Grad.Student
GPA >3.9
GPA >3.9
…
…
…
…
…
Grad.Student
…
MRDTL algorithm. Classification phase
Staff
for each leaf:
apply selection graph of the
leaf to the test data
classify resulting instances
with classification
of the leaf
Staff
Grad.Student
Staff
Grad.Student
Staff
Grad. Student
Grad.Student
…
Staff
Grad.Student
GPA >3.9
GPA >3.9
…
…
…
Staff
Grad. Student
GPA >3.9
Department
70-80k
Spec=math
Staff
Position =
Professor
80-100k
…
Grad. Student
GPA >3.9
Department
Spec=physics
……………..
…
The most time consuming operations of MRDTL
Grad.Student
Department
Entropy associated with this selection
graph:
Staff
Grad.Student
Specialization
=math
E = (ni /N) log (ni /N)
GPA >3.9
Query associated with counts ni:
ID
Name
Dep
Position
Salary
p1
Dale
d1
Postdoc
c1
p2
Martin
d1
Postdoc
c1
p3
David
d4
Postdoc
c1
p4
Peter
d3
Postdoc
c1
p5
Adrian
d2
Professor
c2
p6
Doina
d3
Professor
c2
…
…
…
…
n1
select distinct Staff.Salary,
count(distinct Staff.ID)
from Staff, Grad.Student,
Deparment
where join_list and condition_list
group by Staff.Salary
n2
…
Result of the query is the following list:
c i , ni
The most time consuming operations of MRDTL
Grad.Student
Grad.Student
Department
Department
Staff
Staff
Grad.Student
GPA >2.0
Specialization
=math
Specialization
=math
Grad.Student
GPA >3.9
GPA >3.9
Entropy associated with each of
the refinements
Grad.Student
Department
Staff
select distinct Staff.Salary,
count(distinct Staff.ID)
from table_list
where join_list and condition_list
group by Staff.Salary
Grad.Student
Specialization
=math
GPA >3.9
Grad.Student
GPA >2.0
A way to speed up - eliminate redundant
calculations
Problem:
For selection graph with 162
nodes the time to execute a
query is more than 3 minutes!
Grad.Student
Department
Staff
Grad.Student
GPA >3.9
Specialization
=math
Redundancy in calculation:
For this selection graph tables
Staff and Grad.Student will be
joined over and over for all the
children refinements of the tree
A way to fix:
calculate it only once and save
for all further calculations
Speed Up Method. Sufficient tables
Grad.Student
Department
Staff
Grad.Student
GPA >3.9
Specialization
=math
Staff_ID
Grad.Student_ID
Dep_ID
Salary
p1
s1
d1
c1
p2
s1
d1
c1
p3
s6
d4
c1
p4
s3
d3
c1
p5
s1
d2
c2
p6
s9
d3
c2
…
…
…
…
Speed Up Method. Sufficient tables
Grad.Student
Department
Entropy associated with this selection
graph:
Specialization
=math
E = (ni /N) log (ni /N)
Staff
Grad.Student
GPA >3.9
Query associated with counts ni:
Staff_ID
Grad.Student_ID
Dep_ID
Salary
p1
s1
d1
c1
p2
s1
d1
c1
p3
s6
d4
c1
p4
s3
d3
c1
p5
s1
d2
c2
p6
s9
d3
c2
…
…
…
…
n1
n2
select S.Salary,
count(distinct S.Staff_ID)
from S
group by S.Salary
Result of the query is the following list:
c i , ni
…
Speed Up Method. Sufficient tables
Queries associated with the add
condition refinement:
Grad.Student
Department
Staff
Grad.Student
GPA >3.9
select S.Salary, X.A,
count(distinct S.Staff_ID)
from S, X
where S.X_ID = X.ID
group by S.Salary, X.A
Specialization
=math
Calculations for the complement
refinement:
count(ci , Rcomp(S)) = count(ci, S) – count(ci , R(S))
Speed Up Method. Sufficient tables
Queries associated with the add
edge refinement:
Grad.Student
Department
Staff
Grad.Student
GPA >3.9
select S.Salary, count(distinct
S.Staff_ID)
from S, X, Y
where S.X_ID = X.ID, and e.cond
group by S.Salary
Specialization
=math
Calculations for the complement
refinement:
count(ci , Rcomp(S)) = count(ci, S) – count(ci , R(S))
Speed Up Method
Significant speed up in obtaining the counts needed for the
calculations of the entropy and information gain
The speed up is reached by the additional space used by the
algorithm
Handling Missing Values
Graduate Student
Department
ID
Specialization
#Students
d1
Math
1000
d2
Physics
300
d3
Computer Science
400
Staff
ID
Name
GPA
#Public.
Advisor
Department
s1
John
2.0
4
p1
d3
s2
Lisa
3.5
10
p1
d3
s3
Michel
3.9
3
p4
d4
ID
Name
Department
Position
Salary
p1
Dale
d1
?
70 - 80k
p2
Martin
d3
?
30-40k
p3
Victor
d2
Visitor
Scientist
40-50k
p4
David
d3
?
80-100k
Staff.Position, b
Staff.Name, a
P(a|b)
For each attribute which has
missing values we build a
Naïve Bayes model:
Staff.Position
Staff.Name
Staff.Dep
Department.Spec
…
Handling Missing Values
Graduate Student
Department
ID
Specialization
#Students
d1
Math
1000
ID
Name
GPA
#Public.
Advisor
Department
s1
John
2.0
4
p1
d3
s2
Lisa
3.5
10
p1
d3
Staff
ID
Name
Department
Position
Salary
p1
Dale
d1
?
70 - 80k
Then the most probable value for the missing attribute is calculated by formula:
P(vi | X1.A1, X2.A2, X3.A3 …) =
P(X1.A1, X2.A2, X3.A3 …| vi) P(vi) / P(X1.A1, X2.A2, X3.A3 … ) =
P(X1.A1| vi) P(X2.A2| vi) P(X3.A3| vi) … P(vi) / P(X1.A1, X2.A2, X3.A3 … )
Experimental results. Mutagenesis
Most widely DB used in ILP.
Describes molecules of certain nitro aromatic compounds.
Goal: predict their mutagenic activity (label attribute) – ability to
cause DNA to mutate. High mutagenic activity can cause cancer.
Two subsets regression friendly (188 molecules) and regression
unfriendly (42 molecules). We used only regression friendly subset.
5 levels of background knowledge: B0, B1, B2, B3, B4. They provide
richer descriptions of the examples. We used B2 level.
Experimental results. Mutagenesis
Schema of the mutagenesis database
Results of 10-fold cross-validation for regression friendly set.
Data Set
mutagenesis
Accuracy
Sel graph
size (max)
Tree size
Time with
speed up
Time without
speed up
87.5%
3
9
28.45
52.15
Best-known reported accuracy is 86%
Experimental results. KDD Cup 2001
Consists of a variety of details about the various genes of one particular type
of organism.
Genes code for proteins, and these proteins tend to localize in various parts
of cells and interact with one another in order to perform crucial functions.
2 Tasks: Prediction of gene/protein localization and function
862 training genes, 381 test genes.
FUNCTION
Many attribute values are missing: 70% of CLASS attribute, 50% of
COMPLEX, and 50% of MOTIF in composition table
Experimental results. KDD Cup 2001
localization
Accuracy
Sel graph
size (max)
Tree size
Time with
speed up
Time without
speed up
With handling
missing values
76.11%
19
213
202.9 secs
1256.38 secs
Without handling
missing values
50.14%
33
575
550.76 secs
2257.20 secs
Best-known reported accuracy is 72.1%
function
Accuracy
Sel graph
size (max)
Tree size
(max)
Time with
speed up
Time without
speed up
With handling
missing values
91.44%
9
63
151.19 secs
307.83 secs
Without handling
missing values
88.56%
9
19
61.29 secs
118.41 secs
Best-known reported accuracy is 93.6%
Experimental results. PKDD 2001 Discovery
Challenge
Consists of 5 tables
Target table consists of 1239 records
The task is to predict the degree of the thrombosis attribute from
ANTIBODY_EXAM table
ANA_PATTERN
DIAGNOSIS
PATIENT_INFO
THROMBOSIS
ANTIBODY_EXAM
The results for 5:2 cross validation:
Data Set
thrombosis
Accuracy
Sel Graph
size (max)
Tree size
Time with
speed up
Time without
speed up
98.1%
31
71
127.75
198.22
Best-known reported accuracy is 99.28%
Summary
the algorithm significantly outperforms MRDTL in terms of running time
the accuracy results are comparable with the best reported results
obtained using different data-mining algorithms
Future work
Incorporation of the more sophisticated techniques for handling missing
values
Incorporating of more sophisticated pruning techniques or complexity
regularizations
More extensive evaluation of MRDTL on real-world data sets
Development of ontology-guided multi-relational decision tree learning
algotihms to generate classifiers at multiple levels of abstraction [Zhang
et al., 2002]
Development of variants of MRDTL that can learn from heterogeneous,
distributed, autonomous data sources, based on recently developed
techniques for distributed learning and ontology based data integration
Thanks to
Dr. Honavar for providing guidance, help and support throughout this
research
Colleges from Artificial Intelligence Lab for various helpful discussions
My committee members: Drena Dobbs and Yan-Bin Jia for their help
Professors and lecturers of the Computer Science department for the
knowledge that they gave me through lectures and discussions
Iowa State University and Computer Science department for funding in
part this research