Download grossaug-presentation

Scalable Mining For Classification Rules in Relational Databases Min Wang Bala Iyer Jeffrey Scott Vitter ‫ נדב גרוסאוג‬: ‫מוצג ע”י‬ Abstract • Problem : Increase in Size of Training Set • MIND (MINing in Database) Classifier • Can be Implemented easily over SQL • Other Classifiers Need O(N) space In Memory. • MIND Scales Well Over : • I/O • # of Processors Over View • Introduction • Algorithm • Database Implementation • Performance • Experimental Results • Conclusions Introduction - Classification Problem DETAIL TABLE CLASSIFYER Age <= 30 yes no salary <= 62K yes safe no risky safe Introduction - Scalability In Classification Importance Of Scalability: • Use a Very Large Training Set – Data is Not Memory Resident. • Number Of CPUs – better usage of resources. Introduction - Scalability In Classification Properties of MIND: • Scalable in memory • Scalable In CPU • Uses SQL • Easy to implement Assumptions Attribute Values Are Discrete We focus on the growth stage(no pruning) The Algorithm - DataStracture DATA in DETAIL TABLE DETAIL(attr1,attr2,….,class,leaf_num) attri = i attribute class = Class type leaf_num = the number of leaf the example belongs to(this data can be calculated by the known tree) The Algorithm - gini index S - data Set C - number of Classes Pi - relative frequency of class i in S gini index : The Algorithm GrowTree(DETAIL TABLE) Initialize tree T and put all records of DETAIL in root while (some leaf in T is not a STOP node) for each attribute i do evaluate gini index for each non-STOP leaf at each split value with respect to attribute i for each non-STOP leaf do get the overall best split for it; partition the records and grow the tree for one more level according to best splits; mark all small or pure leaves as STOP nodes; return T; Database Implementation Dimension table • For Each Attribute and each level of the tree INSERT INTO DIMi SELECT leaf_num,class,attri,count(*) FROM DETAIL WHERE leaf_num,<> STOP GROUP BY leaf_num,class,attri Size of Dimi = #leaves * #distinct values of attri * #classes Database Implementation Dimension table SQL SELECT FROM DETAIL INSERT INTO DIM1 leaf_num,class,attr1,count(*) WHERE leaf_num,<> STOP GROUP BY leaf_num,class,attr1 INSERT INTO DIM2 leaf_num,class,attr2,count(*) WHERE leaf_num,<> STOP GROUP BY leaf_num,class,attr2 • • • Database Implementation UP/DOWN - split for each attribute we find all possible split places: INSERT INTO UP SELECT d1.leaf_num, d1.attri, d1.class,SUM(d2.count) FROM(FULL OUTER JOIN DIMi d1, DIMi d2 ON d1.leaf_num = d2.leaf_num AND d2. attri <= d1. attri AND d1.class = d2.class GROUP BY d1.leaf_num, d1. attri, d1.class Database Implementation - Class View create view for each class k and attribute i: CREATE VIEW Ck_UP(leaf_num,attri,count) SELECT leaf_num,attri,count FROM UP WHERE class = k Database Implementation - GINI VALUE create view for all gini values: CREATE VIEW GINI_VALUE(leaf_num, attri,gini)AS SELECT u1.leaf_num, u1.attri,ƒgini FROM C1_UP u1,..,Cc_UP uc,C1_DOWN d1... ,Cc_DOWN dc WHERE u1.attri = .. = uc. attri = .. = dc. attri AND u1.leaf_num = .. = uc.leaf_num = .. = dc.leaf_num Database Implementation - MIN GINI VALUE create table for minimum gini values for attribute i : INSERT INTO MIN_GINI SELECT leaf_num,i,attri,gini FROM GINI_VALUE a WHERE a.gini = (SELECT MIN(gini) FROM GINI_VALUE b WHERE a.leaf_num = b.leaf_num Database Implementation BEST SPLIT create view over MIN_GINI for best split : CREATE VIEW BEST_SPLIT (leaf_num,attr_name,attr_value) SELECT leaf_num, attr_name,attr_value FROM MIN_GINI a WHERE a.gini = (SELECT MIN(gini) FROM MIN_GINI b WHERE a.leaf_num = b.leaf_num Database Implementation Partitioning Build new nodes by spliting old nodes according to BEST_SPLIT values Set correct node to recoreds: Update leaf_node - is done by a function No need to UPDATE data or DB Performance I/O cost of MIND: I/O cost of SPRINT: Experimental Results Normalized time to finish building the tree Normalized time to build the tree per example Experimental Results Normalized time to build the tree per # of processor Time to build tree By Training Set Size Conclusions • MIND works over DB • MIND works well because – MIND rephrases the classification to a DB problem – MIND avoid UPDATES the DETAIL table – Parallelism and Scaling Are achived by the use of RDBMS – MIND uses a user function to get the performance gain in the DIMi creation.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download grossaug-presentation