Download Lecture 7. Data Stream Mining. Building decision trees

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Data Stream Mining:
Building decision trees and
Evaluations
1 / 44
Background: Learning and mining
Finding in what sense data is not random
For example: frequently repeated patterns, correlations among
attributes, one attribute being predictable from others, ...
Fix a description language a priori, and show that data has a
description more concise than the data itself.
2 / 44
Data stream classification cycle
1. Process an example at a time,
and inspect it only once (at
most)
2. Use a limited amount of
memory
3. Work in a limited amount of
time
4. Be ready to predict at any
point
3 / 44
Background: Learning and mining
PROCESS
DATA
MODEL /
/ PATTERNS
ACTIONS
4 / 44
Mining in Data Streams: What’s new?
Make one pass on the data
Use low memory
Certainly sublinear in data size
In practice, that fits in main memory – no disk accesses
Use low processing time per item
Data evolves over time - nonstationary distribution
5 / 44
Two main approaches
Learner builds model, perhaps batch style
When change detected, revise or rebuild from scratch
6 / 44
Two approaches
Keep accurate statistics of recent / relevant data
e.g. with “intelligent counters”
Learner keeps model in sync with these statistics
7 / 44
Decision Tree Learning
8 / 44
Background: Decision Trees
Powerful, intuitive classifiers
Good induction algorithms since
mid 80’s (C4.5, CART)
Many many algorithms and
variants now
9 / /44
10
Decision Trees
Decision tree is a classification model. Its basic
structure is a general tree structure
Internal node: test on example’s attribute value
Leaf node: class labels
Key idea:
1) pick an attribute to test at root
2) divide the training data into subsets Di for each value the
attribute can take on
3) build the tree for each Di and splice it in under the
appropriate branch at the root
10 / 44
Decision Trees
11 / 44
Top-down Induction, C4.5 Style
• Algorithm to build decision tree
12 / 44
Top-down Induction, C4.5 Style
• Limitations
• Classic decision tree learners assume all training data
can be simultaneously stored in main memory.
• Disk-based decision tree learners repeatedly read
training data from disk sequentially
• Goal
• Design decision tree learners that read each example
at most once, and use a small constant time to process
it.
13 / 44
VFDT
P. Domingos and G. Hulten: “Mining high-speed data streams”
KDD’2000.
Very influential paper
Very Fast induction of Decision Trees, a.k.a. Hoeffding
trees
Algorithm for inducing decision trees in data stream way
Does not deal with time change
Does not store examples - memory independent of data
size
14 / 44
VFDT
In order to find the best attribute at a node, it may
be sufficient to consider only a small subset of the
training examples that pass through that node.
Given a stream of examples, use the first ones to
choose the root attribute.
Once the root attribute is chosen, the successive
examples are passed down to the corresponding
leaves, and used to choose the attribute there, and so
on recursively.
Use Hoeffding bound to decide how many
examples are enough at each node
15 / 44
VFDT
Crucial observation [DH00]
An almost-best attribute can be pinpointed quickly:
Evaluate gain function G(A,S) on examples seen so far S,
then use Hoeffding bound
Criterion
If Ai satisfies
G(Ai ,S) > G(Aj ,S) + ε(|S|,δ) for every j ≠i
conclude “Ai is best” with probability 1 −δ
16 / 44
VFDT
•
•
Calculate the information gain for the attributes and
determines the best two attributes
At each node, check for the condition
∆G = G ( X a ) − G ( X b ) > ε
•
•
If condition satisfied, create child nodes based on the test
at the node
If not, stream in more examples and perform calculations
till condition satisfied
17 / 44
VFDT
18 / 44
VFDT-like Algorithm
T := Leaf with empty statistics;
For t = 1,2,... do VFDT Grow(T,xt )
VFDT Grow (Tree T , example x)
run x from the root of T to a leaf L
update statistics on attribute values at L using x
evaluate G(Ai , SL) for all i from statistics at L
if there is an i such that, for all j,
G(Ai ,SL) > G(Aj ,SL) + ε(SL,δ) then
turn leaf L to a node labelled with Ai
create children of L for all values of Ai
make each child a leaf with empty statistics
19 / 44
Extensions of VFDT
IADEM [G. Ramos, J. del Campo, R. Morales-Bueno 2006]
Better splitting and expanding criteria
Margin-driven growth
VFDTc [J. Gama, R. Fernandes, R. Rocha 2006],
UFFT [J. Gama, P. Medas 2005]
Continuous attributes
Naive Bayes at inner nodes and leaves
Short term memory window for detecting concept drift
Converts inner nodes back to leaves, fill them with window
data
Different splitting and expanding criteria
CVFDT [G. Hulten, L. Spencer, P. Domingos 2001]
20 / 44
CVFDT
G. Hulten, L. Spencer, P.Domingos, “Mining time-changing data
streams”, KDD 2001
• Concept-adapting VFDT
• Update statistics at leaves and inner nodes
• Main idea: when change is detected at a subtree, grow
candidatesubtree
• Eventually, either current subtree or candidate subtree is
dropped
• Classification at leaves based on most frequent class in a
window of examples
• Decisions at leaf use a window of recent examples
21 / 44
CVFDT
CVFDT (Concept-adapting Very Fast
Decision Tree learner)
Extend VFDT
Maintain VFDT’s speed and accuracy
Detect and respond to changes in the
example-generating process
22 / 44
CVFDT
With a time-changing concept, the current
splitting attribute of some nodes may not be the
best any more.
An outdated sub tree may still be better than the
best single leaf, particularly if it is near the root.
Grow an alternative sub tree with the new best attribute at its
root, when the old attribute seems out-of-date.
Periodically use a bunch of samples to evaluate
qualities of trees.
Replace the old sub tree when the alternate one becomes more
accurate.
23 / 44
CVFDT
24 / 44
CVFDT
25 / 44
CVFDT
No theoretical guarantees on the error rate of CVFDT
CVFDT parameters :
W : is the example window size.
T0: number of examples used to check at each node if the splitting
attribute is still the best.
T1: number of examples used to build the alternate tree.
T2: number of examples used to test the accuracy of the alternate
tree.
26 / 44
CVFDT vs VFDT
27 / 44
CVFDT
VFDT:
No concept drift
No example memory
No parameters but δ
Rigorous
performance
guarantees
CVFDT:
Concept drift
Window of examples
Several parameters
besides δ
No performance
guarantees
28 / 44
Remember ADWIN
[Bifet, G. 09] Adaptive Hoeffding Trees
Recall: the ADWIN algorithm
detects change in the mean of a data stream of numbers
keeps a window W whose mean approximates current mean
memory, time O(logW)
29
20/ /44
Hoeffding Adaptive Trees
Hoeffding Adaptive Tree:
•
replace frequency statistics counters by
estimators
•
•
•
•
don’t need a window to store examples, due to the
fact that we maintain the statistics data needed with
estimators
change the way of checking the substitution of
alternate subtrees, using a change detector
with theoretical guarantees
Theoretical guarantees
No Parameters
30 / 44
HAT have no parameters!
When to start growing alternate trees?
• When ADWIN says “error reate is increasing”, or
• When ADWIN for a counter says “attribute statistics are
changing”
How to start growing new tree?
• Use accurate estimates from ADWIN’s at parent - no
window
When to tell alternate tree is better?
• Use the estimation of error by ADWIN to decide
How to answer at leaves?
• Use accurate estimates from ADWIN’s at leaf - no
window
31 / 44
Experiments
32 / 44
Hoeffding Adaptive Trees: Summary
No “magic” parameters. Self-adapts to change
Always as accurate as CVFDT, and sometimes much
better
Less memory - no example window
Moderate overhead in time (<50%). Working on it
Rigorous guarantees possible
33 / 44
Evaluation
1. Error estimation: Hold-out or Prequential
2. Evaluation performance measures: Accuracy or κ-statistic
3. Statistical significance validation: MacNemar or Nemenyi test
34 / 44
Error Estimation (Holdout)
Data available for testing
Holdout an independent test set
Apply the current decision model to the test set, at regular time
intervals
The loss estimated in the holdout is an unbiased estimator
35 / 44
Error Estimation (Prequential)
No data available for testing
The error of a model is computed from the sequence of
examples.
For each example in the stream, the actual model makes a
prediction, and then uses it to update the model.
36 / 44
Error Estimation
Hold-out or Prequential?
Hold-out is more accurate, but needs data for testing.
37 / 44
Evaluation performance measures
Correct Class+
Correct ClassTotal
Predicted
Class+
75
7
82
Predicted
Class8
10
18
Total
83
17
100
Table : Simple confusion matrix example
38 / 44
Performance Measures with Unbalanced Classes
Correct Class+
Correct ClassTotal
Predicted
Class+
75
7
82
Predicted
Class8
10
18
Total
83
17
100
Table : Simple confusion matrix example
Correct Class+
Correct ClassTotal
Predicted
Class+
68.06
13.94
82
Predicted
Class14.94
3.06
18
Total
83
17
100
Table : Confusion matrix for chance predictor
39 / 44
Performance Measures with Unbalanced Classes
Kappa Statistic
p0: classifier’s prequential accuracy
pc : probability that a chance classifier makes a correct
prediction.
p0 − pc
κ statistic
κ=
1 − pc
κ = 1 if the classifier is always correct
κ = 0 if the predictions coincide with the correct ones as
often as those of the chance classifier
Forgetting mechanism for estimating prequential kappa
Sliding window of size w with the most recent observations
40 / 44
Cost Evaluation Example
Accuracy Time Memory
Classifier A
70%
100
20
Classifier B
80%
20
40
Which classifier is performing better?
41 / 44
RAM-Hours
RAM-Hour
Every GB of RAM deployed for 1 hour
Cloud Computing Rental Cost Options
42 / 44
Cost Evaluation Example
Accuracy
Classifier A
70%
Classifier B
80%
Time Memory
100
20
20
40
RAM-Hours
2,000
800
Which classifier is performing better?
43 / 44
Evaluation
•
•
•
Error estimation: Hold-out or Prequential
Evaluation performance measures: Accuracy or
κ-statistic
Resources needed: time and memory or RAMHours
44 / 44