Download slides - UCLA Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Learning from Data Streams
CS240B Notes
by
Carlo Zaniolo
UCLA CSD
With slides from a ICDE 2005 tutorial by
Haixun Wang, Jian Pei & Philip Yu
1
What are the Challenges?
Data Volume
impossible to mine the entire data at one time
can only afford constant memory per data sample
Concept Drifts
previously learned models are invalid
Cost of Learning
 model updates can be costly
can only afford constant time per data sample.
2
On-Line Learning
 Learning (Training) :
 Input: a data set of (a, b), where a is a vector, b
a class label
 Output: a model (decision tree)
 Testing:
 Input: a test sample (x, ?)
 Output: a class label prediction for x
 When mining data streams the two phases
are combined
3
Mining Data Streams: Challenges
 On-line response (NB), limited memory, most recent
windows only
 Fast & Light algorithms needed that
 Minimize usage of memory and CPU
 Require only one (or a few) passes through data
 Concept shift/drift: change mining set statistics
 Render previously learned models inaccurate or invalid
 Robustness and Adaptability: quickly recover/adjust after
concept changes.
 Popular machine learning algorithms no longer effective:
 Neural nets: slow learner requires many passes
 Support Vector Machines (SVM): computationally
too expensive.
4
Classifiers Algorithms:
from databases to data streams
 New Algorithm have emerged,
 Bloom Filters,
 ANNCAD
 Exisiting algorithms have adapted,
 NBC survives with only minor changes
 Decision Trees require significant adaptation
 Classifier ensembles remain effective—after significant
changes
 Popular algorithms no longer effective:
 Neural nets: slow learner requires many passes
 Support Vector Machines (SVM): computationally
too expensive
5
Decision Tree Classifiers
 A divide-and-conquer approach
Simple algorithm, intuitive model
 Typically a decision tree grows one level for each
scan of data
 Multiple scans are required
 But if we can use small samples these problem
disappears
 But data structure is not ‘stable’
Subtle changes of data can cause global changes in the
data structure
6
Challenge #1
How many samples do we need to build a tree in
constant time that is nearly identical to the tree a
batch learner (C4.5, Sprint,...)
Nearly identical?
 Categorical attributes:
 with high probability, the attribute we choose for split is
the same attribute as would be chosen by a batch learner
 identical decision tree
 Continuous attributes:
 discretize them into categorical ones
...Forget concept shift/drift for now
7
Hoeffding Bound
 Also known as additive Chernoff Bound
 Given
– r : real valued random variable
– n : # independent observations of r
– R : range of r
 Mean of r is at least ravg- ε, with probability 1-δ,
 Or: P(μr  ravg- ε) = 1-δ and
8
Hoeffding Bound
Properties:
 Hoeffding bound is independent of data
distribution
 Error ε decreases when n (# of samples) increases
At each node, we shall accumulate enough
samples (n) before we make a split.
9
Nearly Identical?
Categorical attributes:
with high probability, the attribute we
choose for split is the same attribute as
would be chosen by a batch learner
thus we seek identical decision tree
Continuous attributes:discretize them
into categorical ones.
10