Ensemble methods with
Data Streams
Jungbeom Lee
 Ensemble in Machine learning
 Online ensemble algorithms
 Future work
Previous class: Data Streams Classifiers
 Ensemble methods
 Online algorithm
The batch classification problem:
– Given a finite training set D={(x,y)} , where y={y1, y2, …, yk}, |D|=n, find
a function y=f(x) that can predict the y value for an unseen instance x
The data stream classification problem:
– Given an infinite sequence of pairs of the form (x,y) where y={y1, y2, …,
yk}, find a function y=f(x) that can predict the y value for an unseen
instance x
Example applications:
– Fraud detection in credit card transactions
– Topic classification in a news aggregation site, e.g. Google news
– Translator for foreign languages
• Online mining different from static mining
Data Volume
◦ impossible to mine the entire data at one time
◦ can only afford constant memory per data sample
Changing data characteristics
◦ previously learned models are invalid
Cost of Learning
◦ model updates can be costly
◦ can only afford constant time per data sample.
A set of classifiers whose individual
decisions are combined in some way to
classify new examples
 An ensemble of classifiers to be more
accurate than any of its individual
 one key to successful is to use individual
classifiers with error rates below .5
Ensemble methods
Manipulating the Training Examples
◦ Bagging
◦ Adaboost
Injecting Randomness
◦ C4.5 decision tree algorithm
Bagging algorithm
Online bagging algorithm
Online weighted bagging algorithm
AdaBoost algorithm
Adaptive boosting algorithm
Experimental Results
Type of Data
Future work
Better online algorithm for Bagging
 Dealing with multiple data types