Download A fast Scalable Classifier for Data Mining

Metodologie per Sistemi Intelligenti Prof. Pier Luca Lanzi SLIQ A fast Scalable Classifier for Data Mining Alessandro Turcarelli matr. 674408 Abstract SLIQ is a decision tree classifier that can handle both numerical and categorical attributes It builds compact and accurate trees It uses a pre-sorting tecnique in the tree growing phase and an inexpensive pruning algorithm It is suitable for classification of large disk-resident datasets, independently of the number of classes, attributes and records Decision-Tree Classification Tree Building MakeTree(Training Data T) Partition(T) Partition(Data S) if(all points in S are in the same class) then return; Evaluate Splits for each attribute A; Use best split to partition S into S1 and S2; Partition(S1); Partition(S2); Spltting Index The gini index is used to evaluate the “goodness” of the alternative splits for an attribute If a data set T contains examples from n classes, gini(T) is defined as gini T =1−∑ p 2j where pj is the relative frequency of class j in T After splitting T into two subset T1 and T2 the gini index of the split data is defined as ∣T 1∣ ∣T 2∣ gini T split = gini T 1  gini T 2  ∣T∣ ∣T∣ Tree Building Details: Pre-Sorting The first tecnique implemented by SLIQ is a scheme that eliminates the need to sort data at each node It creates a separate list for each attribute of the training data A separate list, called class list, is created for the class labels attached to the examples. SLIQ requires that the class list and (only) one attribute list could be kept in the memory at any time Example of Pre-Sorting Tree Building Details: Evaluating Slits This is the algorithm used for split evaluation: EvaluateSplits() for each attribute A do traverse attribute list of A for each value v in the attribute list do find the corresponding entry in the class list, and hence the corresponing class and the leaf node l update the class histogram in the leaf l if A is a numeric attribute then compute splitting index for test (A ≤ v) for l if A is a categorical attribute then for each leaf of the tree do find subset of A with best split Evaluating Split: Example Tree Building Details: Updating the Class List This is the algorithm used for updating the class list UpdateLabels() for each attribute A used in a split do traverse attribute list of A for each value v in the attribute list do find the corresponding entry in the class list e find the new class c to which v belongs by applying the splitting test at node referenced from e update the class label for e to c update node referenced in e to the child corresponding to the class c Class List Update Example Optimizations The best split for large-cardinality categorical attribute is computed by a greedy algorithm When a node becomes pure, no futher splits are required. Note that some nodes becomes pure earlier than others, so it is better to condense attribute lists by discarding entries that correspond to pure nodes belonging examples Thanks to pre-sorting, SLIQ is able to scale for large data sets with no loss in accuracy. This is because the set of splits evaluated with or without pre-sorting is identical Tree Pruning SLIQ uses a post-pruning algorithm based on the Minimum Description Length (MDL) principle MDL principle states that the better model for describing data is the one that minimizes the sum of the cost describing the data in term of bits cost(M,D) = cost(D|M) + cost(M) where cost(M) is the cost of the model and cost(D|M) is the cost of data described through the model M Encoding Costs Data Encoding: the cost of encoding a training set by a decision tree is defined as the sum of classification error. This count is collected during the building phase Model Encoding: a node in a decision tree can be encoded in three ways: A node can have 0 or two children. Node cost is 1 bit A node can have 0, one (left or right) or two children. Node cost is 2 bits Only internal node are examined, so a node can have one or two children. Node cost is log(3) bits Encoding Costs Model Encoding: the cost of a split in a decision tree depends on the type of the attribute tested for the split: Numeric Attribute: the split is on the form A≤v, where A is a numeric attribute and v is a real number, so the cost of encoding the split is the overhead of encoding v. Constant value of 1 is assumed throughout the tree Categorical Attributes: the cost of the splits depends on the cardinality. If A is the set of possible values of the attribute, then Csplit=ln|A| Pruning The MDL pruning evaluate the code length at each node to determine whether to prune one or both child or leave the node intact Costs are calculated as follows Cleaf(t) = L(t)+Errorst Cboth(t) = L(t) + Lsplit + C(t1) + C(t2) Cleft(t) = L(t) + Lsplit + C(t1) + C'(t2) Cright(t) = L(t) + Lsplit + C'(t1) + C(t2) where C'(ti) represent the encoding cost using the parent's statistics Pruning There are three pruning strategies Full: if Cleaf<Cboth then a node is pruned and converted into a leaf Partial: this strategy uses every cost previously described and chooses the option with shortest length Hybrid: first uses the full method to get a smaller tree, then consider only Cboth, Cleft Cright to further prune the tree Benchmarks SLIQ has been tested on these data sets Benchmarks These tables show pruning strategy comparison Benchmarks: comparison with other methods Scalability These graphs show the scalability property of SLIQ Conclusions As authors stated, SLIQ demonstrates to be a fast, low-cost and scalable classifier that builds accurate trees An empirical performance evaluation shows that compared to other classifier, SLIQ achieves a comparable accuracy but produces small decision trees and has small classification times C5.0 wasn't compared with SLIQ because it hasn't been developed yet (1996) References References Manish Mehta, Rakesh Agrawal, Jorma Rissanen: “SLIQ: A Fast Scalable Classifier for Data Mining” Pier Luca Lanzi: “Classification: Decision Trees”

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A fast Scalable Classifier for Data Mining