Download grossaug-presentation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Scalable Mining For Classification
Rules in Relational Databases
Min Wang
Bala Iyer
Jeffrey Scott Vitter
‫ נדב גרוסאוג‬: ‫מוצג ע”י‬
Abstract
• Problem : Increase in Size of Training Set
• MIND (MINing in Database) Classifier
• Can be Implemented easily over SQL
• Other Classifiers Need O(N) space In Memory.
• MIND Scales Well Over :
• I/O
• # of Processors
Over View
• Introduction
• Algorithm
• Database Implementation
• Performance
• Experimental Results
• Conclusions
Introduction - Classification Problem
DETAIL TABLE
CLASSIFYER
Age <= 30
yes
no
salary <= 62K
yes
safe
no
risky
safe
Introduction - Scalability In
Classification
Importance Of Scalability:
• Use a Very Large Training Set – Data is Not
Memory Resident.
• Number Of CPUs – better usage of
resources.
Introduction - Scalability In
Classification
Properties of MIND:
• Scalable in memory
• Scalable In CPU
• Uses SQL
• Easy to implement
Assumptions
Attribute Values Are Discrete
We focus on the growth stage(no pruning)
The Algorithm - DataStracture
DATA in DETAIL TABLE
DETAIL(attr1,attr2,….,class,leaf_num)
attri = i attribute
class = Class type
leaf_num = the number of leaf the
example belongs to(this data can be
calculated by the known tree)
The Algorithm - gini index
S - data Set
C - number of Classes
Pi - relative frequency of class i in S
gini index :
The Algorithm
GrowTree(DETAIL TABLE)
Initialize tree T and put all records of DETAIL in root
while (some leaf in T is not a STOP node)
for each attribute i do
evaluate gini index for each non-STOP leaf at
each split value with respect to attribute i
for each non-STOP leaf do
get the overall best split for it;
partition the records and grow the tree for one more
level according to best splits;
mark all small or pure leaves as STOP nodes;
return T;
Database Implementation Dimension table
• For Each Attribute and each level of the tree
INSERT INTO DIMi
SELECT leaf_num,class,attri,count(*)
FROM DETAIL
WHERE leaf_num,<> STOP
GROUP BY leaf_num,class,attri
Size of Dimi = #leaves * #distinct values of
attri * #classes
Database Implementation Dimension table SQL
SELECT FROM DETAIL
INSERT INTO DIM1 leaf_num,class,attr1,count(*)
WHERE leaf_num,<> STOP
GROUP BY leaf_num,class,attr1
INSERT INTO DIM2 leaf_num,class,attr2,count(*)
WHERE leaf_num,<> STOP
GROUP BY leaf_num,class,attr2
•
•
•
Database Implementation UP/DOWN - split
for each attribute we find all possible split places:
INSERT INTO UP
SELECT d1.leaf_num, d1.attri,
d1.class,SUM(d2.count)
FROM(FULL OUTER JOIN DIMi d1, DIMi d2
ON d1.leaf_num = d2.leaf_num AND
d2. attri <= d1. attri
AND
d1.class = d2.class
GROUP BY d1.leaf_num, d1. attri, d1.class
Database Implementation - Class
View
create view for each class k and attribute i:
CREATE VIEW Ck_UP(leaf_num,attri,count)
SELECT leaf_num,attri,count
FROM UP
WHERE class = k
Database Implementation - GINI
VALUE
create view for all gini values:
CREATE VIEW GINI_VALUE(leaf_num,
attri,gini)AS
SELECT u1.leaf_num, u1.attri,ƒgini
FROM C1_UP u1,..,Cc_UP uc,C1_DOWN d1...
,Cc_DOWN dc
WHERE u1.attri = .. = uc. attri = .. = dc. attri
AND u1.leaf_num = .. = uc.leaf_num = .. =
dc.leaf_num
Database Implementation - MIN
GINI VALUE
create table for minimum gini values
for attribute i :
INSERT INTO MIN_GINI
SELECT leaf_num,i,attri,gini
FROM GINI_VALUE a
WHERE a.gini =
(SELECT MIN(gini)
FROM GINI_VALUE b
WHERE a.leaf_num = b.leaf_num
Database Implementation BEST SPLIT
create view over MIN_GINI for best split :
CREATE VIEW BEST_SPLIT
(leaf_num,attr_name,attr_value)
SELECT leaf_num, attr_name,attr_value
FROM MIN_GINI a
WHERE a.gini =
(SELECT MIN(gini)
FROM MIN_GINI b
WHERE a.leaf_num = b.leaf_num
Database Implementation Partitioning
Build new nodes by spliting old nodes according
to BEST_SPLIT values
Set correct node to recoreds:
Update leaf_node - is done by a function
No need to UPDATE data or DB
Performance
I/O cost of MIND:
I/O cost of SPRINT:
Experimental Results
Normalized time to
finish building the tree
Normalized time to build
the tree per example
Experimental Results
Normalized time to build
the tree per # of processor
Time to build tree
By Training Set Size
Conclusions
• MIND works over DB
• MIND works well because
– MIND rephrases the classification to a DB
problem
– MIND avoid UPDATES the DETAIL table
– Parallelism and Scaling Are achived by the use
of RDBMS
– MIND uses a user function to get the
performance gain in the DIMi creation.