Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Streams Classifiers
CS240B Notes
by
Carlo Zaniolo
UCLA CSD
With slides from a ICDE 2005 tutorial by
Haixun Wang, Jian Pei & Philip Yu
1
Classifiers
• The batch classification problem:
– Given a finite training set D={(x,y)} , where y={y1, y2, …, yk}, |D|=n, find a
function y=f(x) that can predict the y value for an unseen instance x
• The data stream classification problem:
– Given an infinite sequence of pairs of the form (x,y) where y={y1, y2, …,
yk}, find a function y=f(x) that can predict the y value for an unseen
instance x
• Example applications:
–
–
–
–
Fraud detection in credit card transactions
Churn prediction in a telecommunication company
Sentiment classification in the Twitter stream
Topic classification in a news aggregation site, e.g. Google news
– …
2
Batch Classifiers: Splitting Attributes
•Basic algorithm (ID3, Quinlan 1986)
– Tree is constructed in a top‐down recursive
divide‐and‐conquer manner
– At start, all the training examples are at the root node
– Then select the select the best attribute for splitting using
Gini… until the node is suffiiently pure
3
Decision Tree Classifiers
A divide-and-conquer approach
Simple algorithm, intuitive model
Typically a decision tree grows one level for each
scan of data
Multiple scans are required
But if we can use small samples these problem
disappears
But data structure is not ‘stable’
Subtle changes of data can cause global changes in the
data structure
8
Challenge #1 in on-line classifiers
How many samples do we need to build a tree in
constant time that is nearly identical to the tree a
batch learner (C4.5, Sprint,...)
Nearly identical?
Categorical attributes:
with high probability, the attribute we choose for split is
the same attribute as would be chosen by a batch learner
identical decision tree
Continuous attributes:
discretize them into categorical ones
...Forget concept shift/drift for now
9
Hoeffding Bound I
Also known as additive Chernoff Bound
Given
– r : real valued random variable with range R
– n : # independent observations of r
The true mean of r does not differ ravg by more than
ε, with probability 1-δ, where:
• This bound holds true regardless of
the distribution generating the values,
and depends only on the range of values,
number of observations and desired confidence.
– A disadvantage of being so general is that it
is more conservative than a
distribution‐dependent bound
10
Hoeffding Bound
Properties:
Hoeffding bound is independent of data
distribution
Error ε decreases when n (# of samples) increases
At each node, we shall accumulate enough
samples (n) before we make a split.
11
Hoeffding Trees
Scales better than traditional DT algorithms
Incremental: the nodes are are created
incrementally as new samples stream in
Sub-linear with sampling
Small memory requirement
Cons:
Only consider top 2 attributes
Tie breaking takes time
Growing a deep tree takes time
Discrete attribute only
12
Nearly Identical?
Categorical attributes:
with high probability, the attribute we
choose for split is the same attribute as
would be chosen by a batch learner
thus we seek identical decision tree
Continuous attributes:discretize them
into categorical ones.
13
Concept Drift/Shift
Time-changing data streams
Incorporate new samples and eliminate effect
of old samples
Naïve approach
Place a sliding window on the stream
Reapply C4.5 or VFDT whenever window moves
Time consuming!
[VFDT]Very Fast Decision Tree [Domingos, Hulten 2000]
Several Improvements: faster and less memory
14
CVFDT
Concept-adapting VFDT
Hulten, Spencer, Domingos, 2001
Goal
Classifying concept-drifting data streams
Approach
Make use of Hoeffding bound
Incorporate “windowing”
Monitor changes of information gain for attributes.
If change reaches threshold, generate alternate subtree
with new “best” attribute, but keep on background.
Replace if new subtree becomes more accurate.
16
Classifier Esemble for Data Streams
Fast and Light Classifiers:
Naïve Bayesian: one pass to count occurrences
Sliding windows, tumbles and slides
Adaptive Nearest Neighbor Classification Algorithm-ANNCAD
Fast and Light Classifiers
Ensembles of Classifiers--decision trees or
others
Bagging Ensembles and
Boosting Ensembles
17
Basic Ideas
Stream partitioned into sequential chunks
Train a classifier from each chunk
Accuracy of voting ensembles is normally better
than that of a single classfier.
Method1. Bagging
Weighted voting: weights are assigned to classifiers based
on their recent performance on the current test examples
Only top K classifiers are used
Method2. Boosting
Majority voting
Classifiers retired by age
Boosting used in training
18
Bagging Ensemble Method
19
Mining Streams with Concept Changes
Changes detected by drop in accuracy or by
other methods
Build new classifiers on new windows
Search among old ones those that have now
become accurate
20
Boosting Ensembles for
Adaptive Mining of Data Streams
Andrea Fang Chu, Carlo Zaniolo
[PAKDD2004]
21
Mining Data Stream: Desiderata
Fast learning (preferably in one pass of the data.)
Light requirements (low time complexity, low
memory requirement)
Adaptation (model always reflects the timechanging concept)
22
Adaptive Boosting Ensembles
Training stream is split into blocks (i.e., windows)
•
Each individual classifier is learned from a block.
•
A boosting ensemble of (7—19 members) is maintained
over time
Decisions are taken by simple majority
As the N+1 classifier is build, boost the weight of the
tuples misclassified by the first N
•
•
•
Change detection is explored to achieve adaptation.
23
Fast and Light
Experiments show that boosting ensembles of
“weak learners” provide accurate prediction
Weak Learners
•
•
An aggressively pruned decision tree, e.g., shallow
tree (this means fast!)
Trained on a small set of examples (this mean light in
memory requirements!)
24
Adaptation: Detect changes that cause significant drops
in ensemble performance
gradual changes: concept drift
abrupt changes: concept schift
25
Adaptability
The error rate is viewed as a random
variable
When it drops significantly from the recent
average the whole ensemble is dropped
And a new one is quickly re-learned
Cost/performance of boosting ensembles is
better than that of bagging ensembles
[KDD04]
BUT ???
26
References
Haixun Wang, Wei Fan, Philip S. Yu, Jiawei Han. Mining Concept
Drifting Data Streams using Ensemble Classifiers. In the ACM
International Conference on Knowledge Discovery and Data Mining
(SIGKDD) 2003.
Pedro Domingos, Geoff Hulten. Mining High Speed Data Streams. In the
ACM International Conference on Knowledge Discovery and Data Mining
(SIGKDD) 2000.
Geoff Hulten, Laurie Spencer, Pedro Domingos. Mining Time-Changing
Data Streams. In the ACM International Conference on Knowledge
Discovery and Data Mining (SIGKDD) 2001.
Wei Fan, Yi-an Huang, Haixun Wang, Philip S Yu. Active Mining of Data
Streams. In the SIAM International Conference on Data Mining (SIAM
DM)
2004Fang Chu, Yizhou Wang, Carlo Zaniolo, An adaptive learning
approach for noisy data streams, 4th IEEE International Conference on
Data Mining (ICDM), 2004
Fang Chu, Carlo Zaniolo: Fast and Light Boosting for Adaptive Mining of
Data Streams. PAKDD 2004: 282-292.
Yan-Nei Law, Carlo Zaniolo, An Adaptive Nearest Neighbor
Classification Algorithm for Data Streams, 2005 ECML/PKDD
Conference, Porto, Portugal, October 3-7, 2005.
27