Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan SIGKDD 2004 Introduction Additional old data always helps produce a more accurate hypothesis than using the most recent data only. Using old data blindly is not better than “gambling”. What situations and what kind of old data will help. 選擇合適的example很耗時間crossvalidation. Concept Drift data received at time stamp i : Si optimal model of time stamp i : FOi ( X ) older optimal model of time stamp i-1 : FOi 1 ( X ) Concept drift: FOi ( X ) FOi 1 ( X ) Rate of concept change: Data Sufficiency In statistical, a data sample is sufficient if the observed statistics, such as sample mean…, have a variance smaller than predefined limits with high confidence. In machine learning and data mining: a data set is considered sufficient if adding more data into it will not increase the generalization accuracy. Some Situations SP S1 S2 ... Si 1 1. Underlying model does not change. New data insufficient Old data insufficient 2. Underlying model does change. FOi ( X ) FOi 1 ( X ) FOi ( X ) FOi 1 ( X ) y ( FOi ( X ) FOi 1 ( X )) y The only data that will help is those that are still consistent under the evolved models. FOi ( X ) FOi 1 ( X ) y Optimal Models Assumption: training data is collected without any known prior bias. Concept drift and data insufficiency: 4 situations. Situations New data is sufficient by itself and there is no concept drift New data is sufficient by itself and there is concept drift New data is insufficient by itself and there is no concept drift New data is insufficient by itself and there is concept drift. Computing Optimal Models FN(x): new model trained from recent data FO(x): an optimal model finally chosen after some statistical significance tests. i: the sequence number of each sequentially received data chunk. Di 1 : the dataset that trained the most recent optimal model FOi 1 ( x) Steps 1. train FNi ( x) by new data chunk Si 2. select some example from Di 1 , si 1 where si 1 {( x, y) Di 1 , such that ,( FNi ( x) y) FOi 1 ( x) y} FN 3. train i ( x ) from the new data S i + si 1 FO 4. train i 1 ( x ) by update FOi 1 ( x ) with S i 5. compare the accuracy of above by “cross validation”, and find FOi ( x) 6. Si si 1: the training set that computes FOi ( x) Situations New data is sufficient by itself and there is no concept drift New data is sufficient by itself and there is concept drift New data is insufficient by itself and there is no concept drift New data is insufficient by itself and there is concept drift. Cross Validation Decision Tree Ensemble Basic idea: train a number of random and uncorrelated decision trees. Each decision tree is constructed by randomly selecting available features. The only correlation is on the training data itself. Training : cut off value. f個features N個decision trees. (N : random number) 隨機挑一個feature來分割 Discrete features: one time. Continuous features: multiple time with randomly chosen multiple threshold. Stop a branch if there are no more examples passing through. Missing value: weight, initial value = 1.0 attributes 1 attributes 2 highest gain gain > attributes 3 … attributes f random max min random …… Classification Raw posterior probability If there are nc examples out of n in the leaf node with class label c, the P(c|x)= , 把所有的path 取平均. For some missing value example, ex: 0.3 100 0.7 200 0.3 2000 0.7 1000 nc n example 請看白版 Testing: Cross Validation Training size: n N-fold cross validation leaves one example x out and uses the remaining n-1 examples to rain a model and classify on the left-out example x. The exclusion of x is very unlikely to change the subset of features having information gain A random number function generate the same sequence of numbers. Example 假如x被分類成fraud If a leaf node have 1- examples with 7 frauds and 3 non-frauds. Experiment Synthetic Data Credit Card Fraud Data Donation Dataset Real data result. Size of incremental training set d=5 chunk size = 250 Conclusion Stream: 一個example 的label是利用統計所得。 Memory - CV Trees and example database. Efficient – gains computing once only. 假如每個CV tree的統計量沒有類似的特性? Gain的threshold要怎麼挑 以實驗數據來support理論假設。