Download Systematic Data Selection to Mine Concept

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Systematic Data Selection to
Mine Concept-Drifting Data
Streams
Wei Fan
SIGKDD 2004
Introduction




Additional old data always helps produce a
more accurate hypothesis than using the
most recent data only.
Using old data blindly is not better than
“gambling”.
What situations and what kind of old data
will help.
選擇合適的example很耗時間crossvalidation.
Concept Drift


data received at time stamp i : Si
optimal model of time stamp i :
FOi ( X )

older optimal model of time stamp i-1 :
FOi 1 ( X )


Concept drift: FOi ( X )  FOi 1 ( X )
Rate of concept change:

Data Sufficiency


In statistical, a data sample is sufficient if
the observed statistics, such as sample
mean…, have a variance smaller than
predefined limits with high confidence.
In machine learning and data mining: a
data set is considered sufficient if adding
more data into it will not increase the
generalization accuracy.
Some Situations


SP  S1  S2  ...  Si 1
1. Underlying model does not change.



New data insufficient
Old data insufficient
2. Underlying model does change.

FOi ( X )  FOi 1 ( X )

FOi ( X )  FOi 1 ( X )  y

( FOi ( X )  FOi 1 ( X ))  y

The only data that will help is those that
are still consistent under the evolved
models.
FOi ( X )  FOi 1 ( X )  y
Optimal Models


Assumption: training data is collected
without any known prior bias.
Concept drift and data insufficiency: 4
situations.
Situations




New data is sufficient by itself and there is
no concept drift
New data is sufficient by itself and there is
concept drift
New data is insufficient by itself and there
is no concept drift
New data is insufficient by itself and there
is concept drift.
Computing Optimal Models




FN(x): new model trained from recent
data
FO(x): an optimal model finally chosen
after some statistical significance tests.
i: the sequence number of each
sequentially received data chunk.
Di 1 : the dataset that trained the most
recent optimal model FOi 1 ( x)
Steps
1. train FNi ( x) by new data chunk Si
 2. select some example from Di 1 , si 1 where
si 1  {( x, y)  Di 1 , such that ,( FNi ( x)  y)  FOi 1 ( x)  y}






FN
3. train
i ( x ) from the new data S i + si 1

FO
4. train
i 1 ( x ) by update FOi 1 ( x ) with S i
5. compare the accuracy of above by “cross
validation”, and find FOi ( x)
6. Si  si 1: the training set that computes FOi ( x)
Situations




New data is sufficient by itself and there is no
concept drift
New data is sufficient by itself and there is
concept drift
New data is insufficient by itself and there is no
concept drift
New data is insufficient by itself and there is
concept drift.
Cross Validation Decision Tree
Ensemble



Basic idea: train a number of random and
uncorrelated decision trees.
Each decision tree is constructed by
randomly selecting available features.
The only correlation is on the training data
itself.
Training




 : cut off value.
f個features
N個decision trees. (N : random number)
隨機挑一個feature來分割




Discrete features: one time.
Continuous features: multiple time with randomly
chosen multiple threshold.
Stop a branch if there are no more examples
passing through.
Missing value: weight, initial value = 1.0
attributes 1
attributes 2
highest gain
gain >
attributes 3

…
attributes f
random
max
min
random
……
Classification



Raw posterior probability
If there are nc examples out of n in the
leaf node with class label c, the P(c|x)=
, 把所有的path 取平均.
For some missing value example, ex:
0.3  100  0.7  200
0.3  2000  0.7  1000
nc
n
example

請看白版
Testing: Cross Validation




Training size: n
N-fold cross validation leaves one example x out
and uses the remaining n-1 examples to rain a
model and classify on the left-out example x.
The exclusion of x is very unlikely to change the
subset of features having information gain
A random number function generate the same
sequence of numbers.
Example


假如x被分類成fraud
If a leaf node have 1- examples with 7
frauds and 3 non-frauds.
Experiment



Synthetic Data
Credit Card Fraud Data
Donation Dataset
Real data result.
Size of incremental training set
d=5
chunk size = 250
Conclusion

Stream:



一個example 的label是利用統計所得。



Memory - CV Trees and example database.
Efficient – gains computing once only.
假如每個CV tree的統計量沒有類似的特性?
Gain的threshold要怎麼挑
以實驗數據來support理論假設。