Download slides - UCLA Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Streams Classifiers
CS240B Notes
by
Carlo Zaniolo
UCLA CSD
With slides from a ICDE 2005 tutorial by
Haixun Wang, Jian Pei & Philip Yu
1
Classifiers
• The batch classification problem:
– Given a finite training set D={(x,y)} , where y={y1, y2, …, yk}, |D|=n, find a
function y=f(x) that can predict the y value for an unseen instance x
• The data stream classification problem:
– Given an infinite sequence of pairs of the form (x,y) where y={y1, y2, …,
yk}, find a function y=f(x) that can predict the y value for an unseen
instance x
• Example applications:
–
–
–
–
Fraud detection in credit card transactions
Churn prediction in a telecommunication company
Sentiment classification in the Twitter stream
Topic classification in a news aggregation site, e.g. Google news
– …
2
Batch Classifiers: Splitting Attributes
•Basic algorithm (ID3, Quinlan 1986)
– Tree is constructed in a top‐down recursive
divide‐and‐conquer manner
– At start, all the training examples are at the root node
– Then select the select the best attribute for splitting using
Gini… until the node is suffiiently pure
3
Decision Tree Classifiers
 A divide-and-conquer approach
Simple algorithm, intuitive model
 Typically a decision tree grows one level for each
scan of data
 Multiple scans are required
 But if we can use small samples these problem
disappears
 But data structure is not ‘stable’
Subtle changes of data can cause global changes in the
data structure
8
Challenge #1 in on-line classifiers
How many samples do we need to build a tree in
constant time that is nearly identical to the tree a
batch learner (C4.5, Sprint,...)
Nearly identical?
 Categorical attributes:
 with high probability, the attribute we choose for split is
the same attribute as would be chosen by a batch learner
 identical decision tree
 Continuous attributes:
 discretize them into categorical ones
...Forget concept shift/drift for now
9
Hoeffding Bound I
 Also known as additive Chernoff Bound
 Given
– r : real valued random variable with range R
– n : # independent observations of r
 The true mean of r does not differ ravg by more than
ε, with probability 1-δ, where:
• This bound holds true regardless of
the distribution generating the values,
and depends only on the range of values,
number of observations and desired confidence.
– A disadvantage of being so general is that it
is more conservative than a
distribution‐dependent bound
10
Hoeffding Bound
Properties:
 Hoeffding bound is independent of data
distribution
 Error ε decreases when n (# of samples) increases
At each node, we shall accumulate enough
samples (n) before we make a split.
11
Hoeffding Trees
Scales better than traditional DT algorithms
Incremental: the nodes are are created
incrementally as new samples stream in
Sub-linear with sampling
Small memory requirement
Cons:
Only consider top 2 attributes
Tie breaking takes time
Growing a deep tree takes time
Discrete attribute only
12
Nearly Identical?
Categorical attributes:
with high probability, the attribute we
choose for split is the same attribute as
would be chosen by a batch learner
thus we seek identical decision tree
Continuous attributes:discretize them
into categorical ones.
13
Concept Drift/Shift
Time-changing data streams
Incorporate new samples and eliminate effect
of old samples
Naïve approach
 Place a sliding window on the stream
Reapply C4.5 or VFDT whenever window moves
Time consuming!
[VFDT]Very Fast Decision Tree [Domingos, Hulten 2000]
Several Improvements: faster and less memory
14
CVFDT
 Concept-adapting VFDT
Hulten, Spencer, Domingos, 2001
 Goal
Classifying concept-drifting data streams
 Approach
Make use of Hoeffding bound
Incorporate “windowing”
Monitor changes of information gain for attributes.
If change reaches threshold, generate alternate subtree
with new “best” attribute, but keep on background.
Replace if new subtree becomes more accurate.
16
Classifier Esemble for Data Streams
Fast and Light Classifiers:
Naïve Bayesian: one pass to count occurrences
 Sliding windows, tumbles and slides
 Adaptive Nearest Neighbor Classification Algorithm-ANNCAD
Fast and Light Classifiers
Ensembles of Classifiers--decision trees or
others
Bagging Ensembles and
Boosting Ensembles
17
Basic Ideas
 Stream partitioned into sequential chunks
 Train a classifier from each chunk
 Accuracy of voting ensembles is normally better
than that of a single classfier.
 Method1. Bagging
Weighted voting: weights are assigned to classifiers based
on their recent performance on the current test examples
Only top K classifiers are used
 Method2. Boosting
Majority voting
Classifiers retired by age
Boosting used in training
18
Bagging Ensemble Method
19
Mining Streams with Concept Changes
 Changes detected by drop in accuracy or by
other methods
Build new classifiers on new windows
Search among old ones those that have now
become accurate
20
Boosting Ensembles for
Adaptive Mining of Data Streams
Andrea Fang Chu, Carlo Zaniolo
[PAKDD2004]
21
Mining Data Stream: Desiderata
Fast learning (preferably in one pass of the data.)
Light requirements (low time complexity, low
memory requirement)
Adaptation (model always reflects the timechanging concept)
22
Adaptive Boosting Ensembles
Training stream is split into blocks (i.e., windows)
•
Each individual classifier is learned from a block.
•
A boosting ensemble of (7—19 members) is maintained
over time
Decisions are taken by simple majority
As the N+1 classifier is build, boost the weight of the
tuples misclassified by the first N
•
•
•
Change detection is explored to achieve adaptation.
23
Fast and Light
Experiments show that boosting ensembles of
“weak learners” provide accurate prediction
Weak Learners
•
•
An aggressively pruned decision tree, e.g., shallow
tree (this means fast!)
Trained on a small set of examples (this mean light in
memory requirements!)
24
Adaptation: Detect changes that cause significant drops
in ensemble performance 
gradual changes: concept drift
abrupt changes: concept schift
25
Adaptability
The error rate is viewed as a random
variable
When it drops significantly from the recent
average the whole ensemble is dropped
And a new one is quickly re-learned
Cost/performance of boosting ensembles is
better than that of bagging ensembles
[KDD04]
BUT ???
26
References
 Haixun Wang, Wei Fan, Philip S. Yu, Jiawei Han. Mining Concept
Drifting Data Streams using Ensemble Classifiers. In the ACM
International Conference on Knowledge Discovery and Data Mining
(SIGKDD) 2003.
 Pedro Domingos, Geoff Hulten. Mining High Speed Data Streams. In the
ACM International Conference on Knowledge Discovery and Data Mining
(SIGKDD) 2000.
 Geoff Hulten, Laurie Spencer, Pedro Domingos. Mining Time-Changing
Data Streams. In the ACM International Conference on Knowledge
Discovery and Data Mining (SIGKDD) 2001.
 Wei Fan, Yi-an Huang, Haixun Wang, Philip S Yu. Active Mining of Data
Streams. In the SIAM International Conference on Data Mining (SIAM
DM)
 2004Fang Chu, Yizhou Wang, Carlo Zaniolo, An adaptive learning
approach for noisy data streams, 4th IEEE International Conference on
Data Mining (ICDM), 2004
 Fang Chu, Carlo Zaniolo: Fast and Light Boosting for Adaptive Mining of
Data Streams. PAKDD 2004: 282-292.
 Yan-Nei Law, Carlo Zaniolo, An Adaptive Nearest Neighbor
Classification Algorithm for Data Streams, 2005 ECML/PKDD
Conference, Porto, Portugal, October 3-7, 2005.
27
Related documents