Download KDD04stream

Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson About … …  Data streams: continuous stream of new data, generated either in real time or periodically.     Credit card transactions Stock trades. Insurance claim data. Phone call records  Our notations. Data Streams t1 t2 t3 Old data t4 t5 New data Data Stream Mining  Characteristics: may change over time.  Main goal of stream mining:  make sure that the constructed model is the most accurate and up-to-date. Data Sufficiency  Definition:  A dataset is considered “sufficient” if adding more data items will not increase the final accuracy of a trained model significantly.  We normally do not know if a dataset is sufficient or not.  Sufficiency detection:  Expensive “progressive sampling” experiment.  Keep on adding data and stop when accuracy doesn’t increase significantly.  Dependent on both dataset and algorithm  Difficult to make a general claim Possible changes of data streams  Possible “concept drift”.  For the same feature vector, different class labels are generated at some later time  Or stochastically, with different probabilities.  Possible “data sufficiency”.  Other possible changes not addressed in our paper.  Most important of all:  These are “possibilities”.  No “Oracle” out there to tell us the truth!  Dangerous to make assumptions. How many combinations?  Four combinations:     Sufficient and no drift. Insufficient and no drift. Sufficient and drift. Insufficient and drift  Question: Does the “most accurate model” remain the same under all four situations? Case 1: Sufficient and no drift  Solution one:  Throw away old models and data.  Re-train a new model from new data.  By definitions of data sufficiency.  Solution two:  If old model is trained from “sufficient data”, just use the old model Case 2: Sufficient and drift  Solution one:  Train a new model from new data  Same “sufficiency definition”. Case 3: Insufficient and no drift  Possibility I: if old model is trained from sufficient data, keep the old model.  Possibility II: otherwise, combine new data and old data, and train a new model. Case 4: Insufficient and drift  Obviously, new data is not enough by definition.  What are our options.  Use old data?  But how? A moving hyper plane A moving hyper plane See any problems?  Which old data items can we use? We need to be picky Inconsistent Examples Consistent examples See more problems?  We normally never know which of the four cases a real data stream belongs to.  It may change over time from case to case.  Normally, no truth is known apriori or even later. Solution  Requirements:  The right solution should not be “one size fits all.”  Should not make any assumptions. Any assumptions can be wrong.  It should be adaptive.  Let the data speak for itself.  We prefer model A over model B if the accuracy of A on the evolving data stream is likely to be more accurate than B.  No assumptions! An “Un-biased” Selection framework  Train FN from new data.  Train FN+ from new data and selected consistent old data.  Assume FO is the previous most accurate model. Update FO using the new data. Call it FO+.  Use cross-validation to choose among the four candidate models {FN, FN+, FO, and FO+}. Consistent old data  Theoretically, if we know the true models, we can use the true models to choose consistent data. But we don’t  Practically, we have to rely on “optimal models.”  Go back to the hyper plane example A moving hyper plane Their optimal models True model and optimal models True model. Perfect model: never makes mistakes. Not always possible due to: Stochastic nature of the problem Noise in training data Data is insufficient Optimal model: defined over a given loss function. Optimal Model  Loss function L(t,y) to evaluate performance.  t is true label and y is prediction  Optimal decision decision y* is the label that minimizes the expected loss when x is sampled many times:  0-1 loss: y* is the label that appears the most often, i.e., if P(fraud|x) > 0.5, predict fraud  cost-sensitive loss: the label that minimizes the “empirical risk”. If P(fraud|x) * $1000 > $90 or p(fraud|x) > 0.09, predict fraud Random decision trees  Train multiple trees. Details to follow.  Each tree outputs posterior probability when classifying an example x.  The probability outputs of many trees are averaged as the final probability estimation.  Loss function and probability are used to make the best prediction. Training At each node, an un-used feature is chosen randomly  A discrete feature is un-used if it has never been chosen previously on a given decision path starting from the root to the current node.  A continuous feature can be chosen multiple times on the same decision path, but each time a different threshold value is chosen Example Gender? M F Age>30 P: 1 N: 9 y …… Age> 25 n P: 100 N: 150 Training: Continued  We stop when one of the following happens:  A node becomes empty.  Or the total height of the tree exceeds a threshold, currently set as the total number of features.  Each node of the tree keeps the number of examples belonging to each class. Classification  Each tree outputs membership probability  p(fraud|x) = n_fraud/(n_fraud + n_normal)  If a leaf node is empty (very likely for when discrete feature is tested at the end): Use the parent nodes’ probability estimate but do not output 0 or NaN  The membership probability from multiple random trees are averaged to approximate as the final output  Loss function is required to make a decision  0-1 loss: p(fraud|x) > 0.5, predict fraud  cost-sensitive loss: p(fraud|x) $1000 > $90 N-fold Cross-validation with Random Decision Trees  Tree structure is independent from the data.  Compensation when computing probability Key advantage  n-fold cross validation comes easy.  Same cost as testing the model once on the training data.  Training is efficient since we do not compute information gain.  It is actually also very accurate. Experiments  I have a demo available to show. Please contact me.  In the paper. I have the following experiments.  Synthetic datasets.  Credit card fraud datasets.  Donation datasets. Compare  This new selective framework proposed in this paper.  Our last year’s hard coded ensemble framework.  Use k number of weighted ensembles.  K=1. Only train on new data.  K=8.  Use new data and previous 7 periods of model.  Classifier is weighted against new data.  Sufficient and insufficient. Always drift. Data insufficient: new method Last year’s method Avg Result Data sufficient: new method Data sufficient: last year’s method Avg Result Independent study and implementation of random decision tree  Kai Ming Ting and Tony Liu from U of Monash, Australia on UCI datasets  Edward Greengrass from DOD on their data sets.      100 to 300 features. Both categorical and continuous features. Some features have a lot of values. 2000 to 3000 examples. Both binary and multiple class problem (16 and 25) Related publications on random trees  “Is random model better? On its accuracy and efficiency” ICDM 2003  “On the optimality of probability estimation by random decision trees” AAAI 04.  “Mining concept-drifting data streams with ensemble classifiers” SIGKDD2003

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download KDD04stream