Download A fast Scalable Classifier for Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Metodologie per Sistemi
Intelligenti
Prof. Pier Luca Lanzi
SLIQ
A fast Scalable Classifier for Data Mining
Alessandro Turcarelli
matr. 674408
Abstract
SLIQ is a decision tree classifier that can handle
both numerical and categorical attributes
It builds compact and accurate trees
It uses a pre-sorting tecnique in the tree growing
phase and an inexpensive pruning algorithm
It is suitable for classification of large disk-resident
datasets, independently of the number of classes,
attributes and records
Decision-Tree Classification
Tree Building
MakeTree(Training Data T)
Partition(T)
Partition(Data S)
if(all points in S are in the same class)
then return;
Evaluate Splits for each attribute A;
Use best split to partition S into S1 and S2;
Partition(S1);
Partition(S2);
Spltting Index
The gini index is used to evaluate the “goodness” of
the alternative splits for an attribute
If a data set T contains examples from n classes,
gini(T) is defined as
gini T =1−∑ p 2j
where pj is the relative frequency of class j in T
After splitting T into two subset T1 and T2 the gini
index of the split data is defined as
∣T 1∣
∣T 2∣
gini T split =
gini T 1 
gini T 2 
∣T∣
∣T∣
Tree Building Details:
Pre-Sorting
The first tecnique implemented by SLIQ is a scheme
that eliminates the need to sort data at each node
It creates a separate list for each attribute of the
training data
A separate list, called class list, is created for the
class labels attached to the examples.
SLIQ requires that the class list and (only) one
attribute list could be kept in the memory at any
time
Example of Pre-Sorting
Tree Building Details:
Evaluating Slits
This is the algorithm used for split evaluation:
EvaluateSplits()
for each attribute A do
traverse attribute list of A
for each value v in the attribute list do
find the corresponding entry in the class
list, and hence the corresponing class
and the leaf node l
update the class histogram in the leaf l
if A is a numeric attribute then
compute splitting index for test (A ≤ v)
for l
if A is a categorical attribute then
for each leaf of the tree do
find subset of A with best split
Evaluating Split: Example
Tree Building Details:
Updating the Class List
This is the algorithm used for updating the class list
UpdateLabels()
for each attribute A used in a split do
traverse attribute list of A
for each value v in the attribute list do
find the corresponding entry in the
class list e
find the new class c to which v belongs
by applying the
splitting test at
node referenced from e
update the class label for e to c
update node referenced in e to the
child corresponding to the class c
Class List Update Example
Optimizations
The best split for large-cardinality categorical
attribute is computed by a greedy algorithm
When a node becomes pure, no futher splits are
required. Note that some nodes becomes pure earlier
than others, so it is better to condense attribute lists
by discarding entries that correspond to pure nodes
belonging examples
Thanks to pre-sorting, SLIQ is able to scale for large
data sets with no loss in accuracy. This is because the
set of splits evaluated with or without pre-sorting is
identical
Tree Pruning
SLIQ uses a post-pruning algorithm based on the
Minimum Description Length (MDL) principle
MDL principle states that the better model for
describing data is the one that minimizes the sum of
the cost describing the data in term of bits
cost(M,D) = cost(D|M) + cost(M)
where cost(M) is the cost of the model and cost(D|M)
is the cost of data described through the model M
Encoding Costs
Data Encoding: the cost of encoding a training set by
a decision tree is defined as the sum of classification
error. This count is collected during the building
phase
Model Encoding: a node in a decision tree can be
encoded in three ways:
A node can have 0 or two children. Node cost is 1 bit
A node can have 0, one (left or right) or two children.
Node cost is 2 bits
Only internal node are examined, so a node can have
one or two children. Node cost is log(3) bits
Encoding Costs
Model Encoding: the cost of a split in a decision tree
depends on the type of the attribute tested for the
split:
Numeric Attribute: the split is on the form A≤v, where
A is a numeric attribute and v is a real number, so the
cost of encoding the split is the overhead of encoding
v. Constant value of 1 is assumed throughout the tree
Categorical Attributes: the cost of the splits depends
on the cardinality. If A is the set of possible values of
the attribute, then Csplit=ln|A|
Pruning
The MDL pruning evaluate the code length at each
node to determine whether to prune one or both
child or leave the node intact
Costs are calculated as follows
Cleaf(t) = L(t)+Errorst
Cboth(t) = L(t) + Lsplit + C(t1) + C(t2)
Cleft(t) = L(t) + Lsplit + C(t1) + C'(t2)
Cright(t) = L(t) + Lsplit + C'(t1) + C(t2)
where C'(ti) represent the encoding cost using the
parent's statistics
Pruning
There are three pruning strategies
Full: if Cleaf<Cboth then a node is pruned and converted
into a leaf
Partial: this strategy uses every cost previously
described and chooses the option with shortest length
Hybrid: first uses the full method to get a smaller tree,
then consider only Cboth, Cleft Cright to further prune the
tree
Benchmarks
SLIQ has been tested on these data sets
Benchmarks
These tables show pruning strategy comparison
Benchmarks: comparison with other methods
Scalability
These graphs show the scalability property of SLIQ
Conclusions
As authors stated, SLIQ demonstrates to be a fast,
low-cost and scalable classifier that builds accurate
trees
An empirical performance evaluation shows that
compared to other classifier, SLIQ achieves a
comparable accuracy but produces small decision
trees and has small classification times
C5.0 wasn't compared with SLIQ because it hasn't
been developed yet (1996)
References
References
Manish Mehta, Rakesh Agrawal, Jorma Rissanen: “SLIQ: A Fast
Scalable Classifier for Data Mining”
Pier Luca Lanzi: “Classification: Decision Trees”