Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Stream Classification and Novel Class Detection Mehedy Masud, Latifur Khan, Qing Chen and Bhavani Thuraisingham Department of Computer Science , University of Texas at Dallas Jing Gao, Jiawei Han Department of Computer Science , University of Illionois at Urbana Champaign Charu Aggarwal IBM T. J. Watson This work was funded in part by Masud et al. University of Texas at Dallas Aug 10, 2011 Outline of The Presentation Background Data Stream Classification Novel Class Detection Masud et al. University of Texas at Dallas Aug 10, 2011 2 Introduction Characteristics of Data streams are: ◦ Continuous flow of data ◦ Examples: Network traffic Sensor data Call center records Masud et al. University of Texas at Dallas Aug 10, 2011 3 Data Stream Classification Uses past labeled data to build classification model Predicts the labels of future instances using the model Helps decision making Expert analysis and labeling Block and quarantine Network traffic Attack traffic Firewall Classification model Benign traffic Server Masud et al. University of Texas at Dallas Aug 10, 2011 4 Data Stream Classification (cont..) What are the applications? ◦ ◦ ◦ ◦ ◦ Security Monitoring Network monitoring and traffic engineering. Business : credit card transaction flows. Telecommunication calling records. Web logs and web page click streams. Masud et al. University of Texas at Dallas Aug 10, 2011 5 Challenges Infinite length Concept-drift Concept-evolution Feature Evolution Masud et al. University of Texas at Dallas Aug 10, 2011 6 Infinite Length Impractical to store and use all historical data ◦ Requires infinite storage 1 1 1 0 0 ◦ And running time Masud et al. University of Texas at Dallas 0 0 1 1 Aug 10, 2011 7 1 0 0 Concept-Drift Current hyperplane Previous hyperplane A data chunk Negative instance Positive instance Masud et al. University of Texas at Dallas Instances victim of concept-drift Aug 10, 2011 8 Concept-Evolution y y1 Novel class ----- -------- y D y1 C A ++++ ++ ++ + + ++ + +++ ++ + ++ + + + ++ + +++++ ++++ +++ + ++ + + ++ ++ + ----- -------- A - --- - ------------------ - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - -- --++++++++ ++++++++ x1 B y2 D XXXXX X X X X X XX X X X X X XX X X X X X X X X X X X X X X X XX X X X X X X XXX X X ++++ ++ ++ + + ++ + +++ ++ + ++ + + + ++ + +++++ ++++ +++ + ++ + + ++ ++ + x C - --- - ------------------ - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - -- --++++++++ ++++++++ x1 B y2 x Classification rules: R1. if (x > x1 and y < y2) or (x < x1 and y < y1) then class = + R2. if (x > x1 and y > y2) or (x < x1 and y > y1) then class = - Existing classification models misclassify novel class instances Masud et al. University of Texas at Dallas Aug 10, 2011 9 Dynamic Features Why new features evolving ◦ Infinite data stream Normally, global feature set is unknown New features may appear ◦ Concept drift As concept drifting, new features may appear ◦ Concept evolution New type of class normally holds new set of features Different chunks may have different feature sets Masud et al. University of Texas at Dallas Aug 10, 2011 10 Dynamic Features runway, climb ith runway, clear, ramp chunk Feature Extraction & Selection ith chunk and i + 1st chunk and models have different feature sets Feature Set i + 1st chunk runway, ground, ramp Current model Classification & Novel Class Detection Feature Space Conversion Training New Model Existing classification models need complete fixed features and apply to all the chunks. Global features are difficult to predict. One solution is using all English words and generate vector. Dimension of the vector will be too high. Masud et al. University of Texas at Dallas Aug 10, 2011 11 Outline of The Presentation Introduction Data Stream Classification Novel Class Detection Masud et al. University of Texas at Dallas Aug 10, 2011 12 DataStream Classification (cont..) Single Model Incremental Classification Ensemble – model based classification ◦ Supervised ◦ Semi-supervised ◦ Active learning Masud et al. University of Texas at Dallas Aug 10, 2011 13 Overview Single Model Incremental Classification I Ensemble – model based classification ◦ Data Selection ◦ Semi-supervised ◦ Skewed Data Masud et al. University of Texas at Dallas Aug 10, 2011 14 Ensemble of Classifiers C1 + x,? C2 + input C3 - Classifier Masud et al. University of Texas at Dallas + Individual outputs voting Aug 10, 2011 Ensemble output 15 Ensemble Classification of Data Streams Divide the data stream into equal sized chunks ◦ Train a classifier from each data chunk ◦ Keep the best L such classifier-ensemble ◦ Example: for L= 3 Note: Di may contain data points from different classes Labeled chunk Data chunks D2 D1 D654 Unlabeled chunk Classifiers C1 Ensemble C1 C2 C42 C53 Masud et al. University of Texas at Dallas D543 Addresses infinite length and concept-drift C543 Prediction Aug 10, 2011 16 ECSMiner Concept-Evolution Problem A completely new class of data arrives in the stream y x<x1 T T y<y1 y1 F y<y2 F T F + - A D B + - C (a) Novel class ----- - -------- y D A ++++ ++ ++ + + ++ + +++ ++ + ++ + + + ++ + +++++ ++++ +++ + ++ + + ++ ++ + y1 C - --- - ------------------ - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - -- --+++++++ ++++++++ x1 (b) B y2 ----- -------- XXXXX X X X X X XX X X X X X XX X X X X X X X X X X X X X X X XX X X X X X X XXX X X ++++ ++ ++ + + ++ + +++ ++ + ++ + + + ++ + +++++ ++++ +++ + ++ + + ++ ++ + x - --- - ------------------ - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - -- --++++++++ ++++++++ x1 (c) y2 x (a) A decision tree, (b) corresponding feature space partitioning (c) A Novel class (denoted by x) arrives in the stream. Masud et al. University of Texas at Dallas Aug 10, 2011 17 ECSMiner ECSMiner: Overview Data Stream Older instances (labeled) Last labeled chunk Training Update xnow Outlier detection Yes Buffering and novel class detection Buffer? No Classification Ensemble of L models M1 M2 . . . ML New model Based on: Just arrived Newer instances (unlabeled) Overview of ECSMiner algorithm Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham. “Integrating Novel Class Detection with Classification for Concept-Drifting Data Streams”. In Proceedings of 2009 European Conf. on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD’09), Bled, Slovenia, 7-11 Sept, 2009, pp 79-94 (extended version appeared in IEEE Transaction on Knowledge and Data Engineering (TKDE)). Masud et al. University of Texas at Dallas Aug 10, 2011 18 ECSMiner Algorithm Training Novel class detection and classification Masud et al. University of Texas at Dallas Aug 10, 2011 19 ECSMiner Novel Class Detection Non parametric ◦ does not assume any underlying model of existing classes Steps: 1. Creating and saving decision boundary during training 2. Detecting and filtering outliers 3. Measuring cohesion and separation among test and training instances Masud et al. University of Texas at Dallas Aug 10, 2011 20 ECSMiner Training: Creating Decision Boundary y y1 Raw training data Clusters are created -- - - - - ------- Pseudopoints y D D y1 C C A A --- - ------- - - - - - - - - - -- - - - - - - - - - -- - - - - - - - - - - ++++ ++ + + + + +++ ++ + + + + + ++ + +++ ++ ++ +++ +++++ ++++ +++ + ++ + + ++ ++ + ++ B y2 B +++ + ++++++ ++++++ x1 x x1 y2 x Addresses Infinite length problem Masud et al. University of Texas at Dallas Aug 10, 2011 21 ECSMiner Outlier Detection and Filtering Test instance inside decision boundary (not outlier) Test instance outside decision boundary Raw outlier or Routlier Test instance x y D x Ensemble of L models M1 ... ML M2 y1 A C Routlier x B x1 y2 x Routlier Routlier X is an existing AND False class instance True X is a filtered outlier (Foutlier) (potential novel class instance) Routliers may appear as a result of novel class, concept-drift, or noise. Therefore, they are filtered to reduce noise as much as possible. Masud et al. University of Texas at Dallas Aug 10, 2011 22 ECSMiner Novel Class Detection Test instance x (Step 1) Ensemble of L models M1 ... ML M2 Routlier (Step 2) Routlier for q’>q Foutliers with all models? Routlier X is an existing AND False class instance True X is a filtered outlier (Foutlier) (potential novel class instance) Masud et al. University of Texas at Dallas (Step 4) q-NSC>0 Compute q-NSC with all models and other Foutliers (Step 3) Aug 10, 2011 N Treat as existing class Y Novel class found 23 ECSMiner Computing Cohesion & Separation o,5(x) a(x) +,5(x) b+(x) + + + ++ + + + + - - - - - - a(x) = mean distance from an Foutlier x to the instances in o,q(x) bmin(x) = minimum among all bc(x) (e.g. b+(x) in figure) q-Neighborhood Silhouette Coefficient (q-NSC): q - NSC(x) -,5(x) x b (x) (b min (x) a(x)) max( b min (x) , a(x)) If q-NSC(x) is positive, it means x is closer to Foutliers than any other class. Masud et al. University of Texas at Dallas Aug 10, 2011 24 Speeding Up Computing N-NSC for every Foutlier instance x takes quadratic time in the number of Foutliers. In order to make the computation faster, We create Ko pseudopoints (Fpseudopoints) from Foutliers using K-means clustering, where Ko = (No/S) * K. Here S is the chunk size and No is the number of Foutliers. perform the computations on the Fpseudopoints Thus, the time complexity ◦ to compute the N-NSC of all of the Fpseudopoints is O(Ko(Ko+K)) ◦ which is constant, since both Ko and K are independent of the input size. ◦ However, by gaining speed we lose some precision, although the loss is negligible (to be analyzed shortly) Masud et al. University of Texas at Dallas Aug 10, 2011 25 ECSMiner Algorithm To Detect Novel Class Masud et al. University of Texas at Dallas Aug 10, 2011 26 “Speedup” Penalty As discussed earlier ◦ by speeding up computation in step – 3, we lose some precision since the result deviates from exact result ◦ This analysis shows (x-i)2 that the deviation is negligible x i i (i-j)2 (x-j)2 j j Figure 6. Illustrating the computation of deviation. i is an Fpseudopoint, i,e., a cluster of Foutliers, and j is an existing class Pseudopoint, i.e., a cluster of existing class instances. In this particular example, all instances in i belong to a novel class. Masud et al. University of Texas at Dallas Aug 10, 2011 27 “Speedup” Penalty Approximate: Exact: Deviation: Masud et al. University of Texas at Dallas Aug 10, 2011 28 Experiments - Datasets • We evaluated our approach on two synthetic and two real datasets: • SynC – Synthetic data with only concept-drift. Generated using hyperplane equation. 2 classes, 10 attributes, 250K instances • SynCN – Synthetic data with concept-drift and novel class. Generated using Gaussian distribution. 20 classes, 40 attributes, 400K instances • KDD cup 1999 intrusion detection (10% version) – real dataset. 23 classes, 34 attributes, 490K instances • Forest cover – real dataset. 7 classes, 54 attributes, 581K instances Masud et al. University of Texas at Dallas Aug 10, 2011 29 Experiments - Setup Development: ◦ Language: Java H/W: ◦ Intel P-IV with ◦ 2GB memory and ◦ 3GHz dual processor CPU. Parameter settings: ◦ K (number of pseudopoints per chunk) = 50 ◦ N (minimum number of instances required to declare novel class) = 50 ◦ M (ensemble size) = 6 ◦ S (chunk size) = 2,000 Masud et al. University of Texas at Dallas Aug 10, 2011 30 Experiments - Baseline Competing approaches: ◦ i) MineClass (MC): our approach ◦ ii) WCE-OLINDDA_Parallel (W-OP) ◦ iii) WCE-OLINDDA_Single (W-OS): Where WCE-OLINDDA is a combination of the Weighted Classifier Ensemble (WCE) and novel class detector OLINDDA, with default parameter settings for WCE and OLINDDA We use this combination since to the best of our knowledge there is no approach that Can classify and detect novel classes simultaneously OLINDDA assumes there is only one normal class, and all other classes are novel ◦ Therefore, we apply two variations – W-OP keeps parallel OLINDDA models, one for each class W-OS keeps a single model that absorbs a novel class when encountered Masud et al. University of Texas at Dallas Aug 10, 2011 31 Experiments - Results Evaluation metrics ◦ Mnew = % of novel class instances Misclassified as existing class = Fn∗100/Nc ◦ Fnew = % of existing class instances Falsely identified as novel class = Fp∗100/ (N−Nc) ◦ ERR = Total misclassification error (%)(including Mnew and Fnew) = (Fp+Fn+Fe)∗100/N ◦ where Fn = total novel class instances misclassified as existing class, ◦ Fp = total existing class instances misclassified as novel class, ◦ Fe = total existing class instances misclassified (other than Fp), ◦ Nc = total novel class instances in the stream, ◦ N = total instances the stream. Masud et al. University of Texas at Dallas Aug 10, 2011 32 Experiments - Results Forest Cover KDD cup Masud et al. University of Texas at Dallas SynCN Aug 10, 2011 33 Experiments - Results Masud et al. University of Texas at Dallas Aug 10, 2011 34 Experiments – Parameter Sensitivity Masud et al. University of Texas at Dallas Aug 10, 2011 35 Experiments – Runtime Masud et al. University of Texas at Dallas Aug 10, 2011 36 Dynamic Features Solution: ◦ Global Features ◦ Local Features ◦ Union Mohammad Masud, Qing Chen, Latifur Khan, Jing Gao, Jiawei Han, and Bhavani Thuraisingham, “Classification and Novel Class Detection of Data Streams in A Dynamic Feature Space,” in Proc. of Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD 2010, Barcelona, Spain, Sept 2010, Springer, Page 337-352 Masud et al. University of Texas at Dallas Aug 10, 2011 37 Feature Mapping Across Models and Test Data Points Feature set varies in different chunks. Especially, when new class appears, new features should be selected and added to the feature set. Strategy 1 – Lossy fixed (Lossy-F) conversion / Global ◦ Use the same fixed feature in the entire stream. We call this a lossy conversion because future model and instances may lose important features due to this mapping. Strategy 2 – Lossy local (Lossy-L) conversion / Local ◦ We call this lossy conversion because it may loss feature values during mapping. Strategy 3 – Dimension preserving (D-Preserving) Mapping / Union Masud et al. University of Texas at Dallas Aug 10, 2011 38 Feature Space Conversion – Lossy-L Mapping (Local) Assume that each data chunk has different feature vectors When a classification model is trained, we save the feature vector with the model When an instance is tested, its feature vector is mapped (i.e., projected) to the model’s feature vector. Masud et al. University of Texas at Dallas Aug 10, 2011 39 Feature Space Conversion – Lossy-L Mapping For example, ◦ Suppose the model has two features (x,y) ◦ The instance has two features (y,z) ◦ When testing, assume the instance has two features (x,y) ◦ Where x = 0, and y value is kept as it is Masud et al. University of Texas at Dallas Aug 10, 2011 40 Conversion Strategy II – Lossy-L Mapping Graphically: Masud et al. University of Texas at Dallas Aug 10, 2011 41 Conversion Strategy III – D-Preserving Mapping When an instance is tested, both the model’s feature vector and the instance’s feature vector are mapped (i.e., projected) to the union of their feature vectors. ◦ The feature dimension is increased. ◦ In the mapping, both the features in the testing instance and model are preserved. The extra features are filled with all 0s. Masud et al. University of Texas at Dallas Aug 10, 2011 42 Conversion Strategy III – D-Preserving Mapping For example, ◦ suppose the model has three features (a,b,c) ◦ The instance has four features (b,c,d,e) ◦ When testing, we project both the model’s feature vector and the instance’s feature vector to (a,b,c,d,e) ◦ Therefore, in the model, d, and e will be considered 0s and in the instance, a will be considered 0 Masud et al. University of Texas at Dallas Aug 10, 2011 43 Conversion Strategy III – D-Preserving Mapping Previous Example Masud et al. University of Texas at Dallas Aug 10, 2011 44 Discussion Local does not favor novel class, it favors existing classes. ◦ Local features will be enough to model existing classes. Union favors novel class. ◦ New features may be discriminating for novel class, hence Union works. Masud et al. University of Texas at Dallas Aug 10, 2011 45 Comparison Which strategy is the better? Assumption: lossless conversion (union) preserves the properties of a novel class. In other words, if an instance belongs to a novel class, it remains outside the decision boundary of any model Mi of the ensemble M in the converted feature space. Lemma: If a test point x belongs to a novel class, it will be missclassified by the ensemble M as an existing class instance under certain conditions when the Lossy-L conversion is used. Masud et al. University of Texas at Dallas Aug 10, 2011 46 Comparison Proof: Let X1,…,XL,XL+1,…,XM be the dimensions of the model and Let X1,…,XL,XM+1,…,XN be the dimensions of the test point Suppose the radius of the closest cluster (in the higher dimension) is R Also, let the test point be a novel class instance. Combined feature space = X1,…,XL,XL+1,…,XM,XM+1,…,XN Masud et al. University of Texas at Dallas Aug 10, 2011 47 Comparison Proof (continued): Combined feature space = X1,…,XL,XL+1,…,XM,XM+1,…,XN Centroid of the cluster (original space): X1=x1,…,XL=xL,XL+1=xL+1,…,XM=xM i.e., x1,…,xL, xL+1,…,xM Centroid of the cluster (combined space): x1,…,xL, xL+1,…,xM , 0,…,0 Test point (original space): X1=x’1,…,XL=x’L,XM+1=x’M+1,…,XN=x’N i.e., x1,…,xL, x’M+1,…,x’N Test point (combined space): Masud et al. University of Texas at Dallas x’1,…,x’L, 0,…,0, x’M+1,…,x’N Aug 10, 2011 48 Comparison Proof (continued): Centroid (combined spc): Test point (combined space): x’1,…,x’L, 0,…, R2< ((x1 –x’1)2+,…, +(xL –x’L)2+ x2L+1+…+x2M)+ (x’2M+1+…+x’2N) R2 < R2 = a2 + b2 - e2 a2 = R2 + (e2 – b2) a2 < R2 Therefore, in Lossy-L conversion, the test point will not be an outlier a2 + 0, x’M+1,…,x’N b2 (e2 >0) (provided that e2 < b2) Masud et al. University of Texas at Dallas x1,…,xL, xL+1,…,xM , 0 ,…, 0 Aug 10, 2011 49 Baseline Approaches WCE is Weighted Classifier Ensemble1, which addresses multi-class ensemble classifier. OLINDDA is a novel class detector 2 works only for binary class. FAE algorithm is an ensemble classifier that addresses feature evolution3 and concept drift. ECSMiner is a multi-class ensemble classifier that addresses concept drift and concept evolution4. Masud et al. University of Texas at Dallas Aug 10, 2011 50 Approaches Comparison Proposed techniques Challenges Infinite length Conceptdrift Conceptevolution OLINDDA WCE FAE ECSMiner DXMiner Masud et al. University of Texas at Dallas Aug 10, 2011 51 Dynamic Features Experiments: Datasets We evaluated our approach on different datasets: Data Set Concept Concept Dynamic # of # of Drift Evolution Feature Instance Class KDD 492K 7 Forest Cover 387K 7 NASA 140K 21 Twitter 335K 21 Masud et al. University of Texas at Dallas Aug 10, 2011 52 Experiments: Results Evaluation metrics: let ◦ Fn = total novel class instances misclassified as existing class, ◦ Fp = total existing class instances misclassified as novel class, ◦ Fe = total existing class instances misclassified (other than Fp), ◦ Nc = total novel class instances in the stream, ◦ N = total instances the stream Masud et al. University of Texas at Dallas Aug 10, 2011 53 Experiments: Results We use the following performance metrics to evaluate our technique: ◦ Mnew = % of novel class instances Misclassified as existing class, i.e, ◦ Fnew = % of existing class instances Falsely identified as novel class, i.e., ◦ ERR = Total misclassification error (%)(including Mnew and Fnew), i.e., Masud et al. University of Texas at Dallas Aug 10, 2011 54 Experiments: Setup Development: ◦ Language: Java H/W: ◦ Intel P-IV with ◦ 3GB memory and ◦ 3GHz dual processor CPU. Parameter settings: ◦ K (number of pseudo points per chunk) = 50 ◦ q (minimum number of instances required to declare novel class) = 50 ◦ L (ensemble size) = 6 ◦ S (chunk size) = 1,000 Masud et al. University of Texas at Dallas Aug 10, 2011 55 Experiments: Baseline Competing approaches: ◦ i) DXMiner (DXM): our approach- 4 variations: Lossy-F conversion Lossy-L conversion D-Preserving conversion ◦ ii) FAE-WCE-OLINDDA_Parallel (W-OP) ◦ Assumes there is only one normal class, and all other classes are novel . W-OP keeps parallel OLINDDA models, one for each class We use this combination since to the best of our knowledge there is no approach that can classify and detect novel classes simultaneously with feature evolution. ◦ iii) FAE-ECSMiner Masud et al. University of Texas at Dallas Aug 10, 2011 56 Twitter Results Masud et al. University of Texas at Dallas Aug 10, 2011 57 Twitter Results D-preserving AUC 0.88 Masud et al. University of Texas at Dallas Lossy Local LossyGlobal 0.83 0.76 Aug 10, 2011 O-F 0.56 58 NASA Dataset Deviation AUC Masud et al. University of Texas at Dallas 0.996 Info Gain 0.967 O-F 0.876 Aug 10, 2011 59 Forest Cover Results Masud et al. University of Texas at Dallas Aug 10, 2011 60 Forest Cover Results D-preserving AUC Masud et al. University of Texas at Dallas 0.97 O-F 0.74 Aug 10, 2011 61 KDD Results Masud et al. University of Texas at Dallas Aug 10, 2011 62 KDD Results D-preserving AUC Masud et al. University of Texas at Dallas 0.98 FAE-Olindda 0.96 Aug 10, 2011 63 Summary Results Masud et al. University of Texas at Dallas Aug 10, 2011 64 Proposed Methods Improved Outlier Detection and Multiple Novel Class Detection Challenges ◦ High false positive (FP) (existing classes detected as novel) and false negative (FN) (missed novel classes) rates ◦ Two or more novel classes arrive at a time Solutions1 ◦ Dynamic decision boundary – based on previous mistakes Inflate the decision boundary if high FP, deflate if high FN ◦ Build statistical model to filter out noise data and concept drift from the outliers. ◦ Multiple novel classes are detected by Constructing a graph where outlier cluster is a vertex Merging the vertices based on silhouette coefficient Counting the number of connected components in the resultant (i.e., merged) graph 1 Mohammad M. Masud, Qing Chen, Jing Gao, Latifur Khan, Charu Aggarwal, Jiawei Han, and Bhavani Thuraisingham, Addressing ConceptEvolution in Concept-Drifting Data Streams, In Proc ICDM ’10, Sydney, Australia, Dec 14-17, 2010. Masud et al. University of Texas at Dallas Aug 10, 2011 67 Proposed Methods Outlier Threshold (OUTTH) To declare a testing instance being an outlier, using cluster radius r is not enough because of the data noise ◦ So, beyond the radius r, a threshold (OUTTH) will be setup, so that most noisy data around model cluster will be classified immediately o,5(x) a(x) +,5(x) x b+(x) + + + + + ++ Masud et al. University of Texas at Dallas Aug 10, 2011 68 Proposed Methods Outlier Threshold (OUTTH) Every instance outside the cluster range has a weight wt ( x) exp( (bx r )) ◦ If wt(x) >= OUTTH, this instance will be consider as existing class. ◦ If wt(x) < OUTTH, this instance will be an outlier. Pros: ◦ Noisy data will be classified immediately Cons ◦ OUTTH is hard to be determined Noisy data and novel class instance may occur simultaneously Different dataset may have different OUTTH Masud et al. University of Texas at Dallas Aug 10, 2011 69 Proposed Methods Outlier Threshold (OUTTH) a(x) +,5(x) + + ++ x b+(x) + + o,5(x) + OUTTH = ? If threshold is too high, noisy data may become outlier ◦ FP rate will go up If threshold is too low, novel class instance will be labeled as existing class ◦ FN rate will go up Masud et al. University of Texas at Dallas We need to balance on these two Aug 10, 2011 70 Introduction Data Stream Classification Clustering Novel Class Detection • Finer Grain Novel Class Detection • Dynamic Novel Class Detection • Multiple Novel Class Detection Masud et al. University of Texas at Dallas Aug 10, 2011 71 Proposed Methods Dynamic threshold setting a(x) x + + Marginal FN + + + ++ ◦ Marginal FP Defer approach After a testing chunk has been labeled, based on the marginal FP and FN rate of the this testing chunk update the OUTTH, and then apply the new OUTTH to the next testing chunk ◦ Eager approach What is marginal FP or marginal FN Once a marginal FP or marginal FN instance detected, update OUTTH with step function, and apply the updated OUTTH to the next testing instance Masud et al. Aug 10, 2011 University of Texas at Dallas 72 Proposed Methods Dynamic threshold setting Masud et al. University of Texas at Dallas Aug 10, 2011 73 Proposed Methods Defer approach and Eager approach comparison In Defer approach, OUTTH updates after a data chunk is labeled ◦ Too late – In the testing chunk, many marginal FP or FN may occur due to an improper OUTTH threshold ◦ Overreact – If there are many marginal FP or FN instances in the labeled testing chunk, the OUTTH update may overreact for the next testing chunk In Eager approach, OUTTH updates aggressively whenever marginal FP or FN happens. ◦ The model is more tolerate to noisy data and concept drift. ◦ The model is more sensitive to novel class instances. Masud et al. University of Texas at Dallas Aug 10, 2011 74 Proposed Methods Outliers Statistics For each outlier instance, we calculate the novelty probability Pnov P nov ( x) 1 wt ( x) Psc 1 min( wt ( x)) ◦ If Pnov is large (close to 1), indicates that the outlier has a high probability of being a novel instance. Pnov contains two parts ◦ The first part measures how far the outlier being away from the model cluster ◦ The second part Psc is the Silhouette Coefficient, measures the cohesion and separation to the model cluster of the q-Neighbors of the outlier Masud et al. University of Texas at Dallas Aug 10, 2011 75 Proposed Methods Outliers Statistics • Noise Data • Concept Drift • Novel Class Three scenarios may occur simultaneously Masud et al. University of Texas at Dallas Aug 10, 2011 76 Proposed Methods Outlier Statistics Gini Analysis The Gini coefficient is a measure of statistical inequality. The discrete Gini coefficient is: n n 1 i yi 1 G ( s) n 1 2 i 1 n n i1 yi If we divide 0~1 into n equal size bin, and put all outlier Pnov into corresponding bin, then we can get cdf yi ◦ If all Pnov is very low, to an extreme cdf yi = 1 n n 1 i yi 1 G ( s) n 1 2 i 1 n n i1 yi n 1 n n 1 2 i 1 n 1 i 1 n 1 2 n 1 i 1 i 0 n n n n ◦ If all Pnov are very high, to an extreme cdf yi =0; except yn=1 n n 1 i y i 1 G ( s) n 1 2 i 1 n n i 1 yi Masud et al. University of Texas at Dallas 1 n 1 2 n 1 n n Aug 10, 2011 77 Proposed Methods Outlier Statistics Gini Analysis ◦ If all outlier Pnov distribute evenly, yi =i/n n 1 i y 1 n 1 i i 1 G(s) n 1 2 n 1 2 n n y i n n i 1 i 1 i n n i 1 n i2 1 i 1 n 1 2 n 1 n n i 1 i n 1 2(2n 1) n 1 n 3n 3n 1 2 n 1 2 n(n 1)( 2n 1) n 6 n ( n 1 ) After get the outlier Pnov distribution, calculate G(s) n 1 3n If G(s)> If G(s) <= When n ∞, n3n1 0.33 Masud et al. University of Texas at Dallas i 1 i , declare novel class n 1 3n , classified the outlier as existing class instance. Aug 10, 2011 78 Proposed Methods Outlier Statistics Gini Analysis Limitation 1 0.9 0.8 0.7 cdf-concept drift 0.6 cdf-severe-conceptdrift cdf-novel 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 ◦ To an extreme, it is impossible the differentiate concept drift and concept evolution by Gini coefficient, when concept drift is just “looks like” concept evolution. Masud et al. University of Texas at Dallas Aug 10, 2011 79 Introduction Data Stream Classification Clustering Novel Class Detection • Finer Grain Novel Class Detection • Dynamic Novel Class Detection • Multiple Novel Class Detection Masud et al. University of Texas at Dallas Aug 10, 2011 80 Proposed Methods Multi Novel Class Detection Data Stream Positive Instance Novel class A Negative Instance Novel class B Novel Instance y y1 y2 x1 y2 x x1 y2 x If we always assume novel instances belong to one novel type, one type of novel instances, either A or B, will be misclassified. Masud et al. University of Texas at Dallas Aug 10, 2011 81 Proposed Methods Multi Novel Class Detection The main idea in detecting multiple novel classes is to construct a graph, and identify the connected components in the graph. The number of connected components determines the number of novel classes. Masud et al. University of Texas at Dallas Aug 10, 2011 82 Proposed Methods Multi Novel Class Detection Two Phases: ◦ Building the connected graph Build directed nearest neighbor graph. From each vertex (outlier cluster), add edge from this vertex to its nearest neighbor. Silhouette coefficient from the vertex to its nearest neighbor is larger than some threshold, the edge will be removed. Problem: Linkage Circle ◦ Component merging phase Gaussian distribution centric decision Masud et al. University of Texas at Dallas Aug 10, 2011 83 Proposed Methods Multi Novel Class Detection ◦ Component merging phase In probability theory, “the normal (or Gaussian) distribution, is a continuous probability distribution that is often used as a first approximation to describe real-valued random variables that tend to cluster around a single mean value” 1 d centroid _ dist ( g , g ) 1 2 1 2 If two Gaussian Distribution variables (g1, g2) can be separated, the following condition will be hold: d centroid _ dist ( g1 , g2 ) c(1 2 ) Since μ is proportion to σ, if the two variables (components) will remain separated; otherwise, these two components will be merged. 1 2 2 0 x2 xe 2 2 dx 2 2 x2 2 0 2 1 x2 2 2 2 e 2 d ( ) (e 2 e 2 ) (0 1) 2 2 0 2 2 2 2 2 2 1. Amari Shunichi, Nagaoka Hiroshi. Methods of information geometry. Oxford University Press. ISBN 0-8218-0531-2, 2000. Masud et al. University of Texas at Dallas Aug 10, 2011 84 Experiment Results Experiments: Datasets We evaluated our approach on different datasets: Concept Drift Data Set Concept Evolution Dynamic Feature # of Instance # of Class KDD 492K 7 Forest Cover 387K 7 NASA 140K 21 Twitter 335K 21 SynED 400K 20 Masud et al. University of Texas at Dallas Aug 10, 2011 85 Experiment Results Experiments: Setup Development: ◦ Language: Java H/W: ◦ Intel P-IV with ◦ 3GB memory and ◦ 3GHz dual processor CPU. Parameter settings: ◦ K (number of pseudo points per chunk) = 50 ◦ q (minimum number of instances required to declare novel class) = 50 ◦ L (ensemble size) = 6 ◦ S (chunk size) = 1,000 Masud et al. University of Texas at Dallas Aug 10, 2011 86 Experiments: Baseline Competing approaches: ◦ i) DEMminer our approach- 5 variations: Lossy-F conversion Lossy-L conversion Lossless conversion - DEMminer Dynamic OUTTH + Lossless conversion - DEMminer-Ex (without Gini) Dynamic OUTTH + Gini + Lossless conversion - DEMminer-Ex ◦ ii) WCE-OLINDDA (O-W) ◦ iii) FAE-WCE-OLINDDA_Parallel (O-F) We use this combination since to the best of our knowledge there is no approach that can classify and detect novel classes simultaneously with feature evolution. Masud et al. University of Texas at Dallas Aug 10, 2011 87 Experiment Results Experiments: Results Evaluation metrics: ◦ Fn = total novel class instances misclassified as existing class, ◦ Fp = total existing class instances misclassified as novel class, ◦ Fe = total existing class instances misclassified (other than Fp), ◦ Nc = total novel class instances in the stream, ◦ N = total instances the stream Masud et al. University of Texas at Dallas Aug 10, 2011 88 Experiment Results Twitter Results 7000 50 45 DEMminer Lossy-F Lossy-L O-F 35 30 25 20 15 10 Novel Class Instances 6000 40 Total Err % DEMminer Lossy-F Lossy-L O-F 5000 4000 3000 2000 1000 5 0 0 50 100 150 200 250 Stream (in thousand data pts) Masud et al. University of Texas at Dallas 300 350 0 0 100 200 300 Stream (in thousand data pts) Aug 10, 2011 89 400 Experiment Results Twitter Results 1 True novel class detection rate 0.9 0.8 0.7 0.6 DEMminer 0.5 Lossy-F 0.4 Lossy-L O-F 0.3 0.2 0.1 0 0 0.1 DEMminer AUC 0.88 Masud et al. University of Texas at Dallas 0.2 0.3 0.4 0.5 0.6 0.7 False novel class detection rate Lossy -L 0.83 0.8 0.9 Lossy-F O-F 0.76 Aug 10, 2011 1 0.56 90 Experiment Results Twitter Results 7000 20 18 6000 Total Err % 14 12 DEMminer 10 DEMminer-Ex OW 8 6 4 Novel Class Instances 16 5000 4000 DEMminer DEMminer-Ex 3000 OW 2000 1000 2 0 0 0 50 Stream 100 (in 150thousand 200 data pts) 250 Masud et al. University of Texas at Dallas 300 350 0 50 100 150 200 250 Strem (in thousand data pts) Aug 10, 2011 91 300 350 Experiment Results Twitter Results 1 0.9 0.8 0.7 0.6 DEMminer-Ex 0.5 DEMminer 0.4 OW 0.3 0.2 0.1 0 0 0.1 0.2 0.3 DEMminer-Ex AUC 0.94 Masud et al. University of Texas at Dallas 0.4 0.5 0.6 0.7 0.8 0.9 DEMminer OW 0.88 Aug 10, 2011 1 0.56 92 Experiment Results Forest Cover Results 3500 25 DEMminer 20 2500 DEMminer 2000 DEMminer-Ex(without Gini) 1500 DEMminer-Ex 1000 OW Total Err % Novel Class Instances 3000 DEMminerEx(without Gini) DEMminer-Ex 15 OW 10 5 500 0 0 100 200 300 Stream (in thousand data pts) Masud et al. University of Texas at Dallas 400 500 0 0 100 200 300 Stream (in thousand data pts) Aug 10, 2011 93 400 500 Experiment Results Forest Cover Results 1 0.9 0.8 0.7 0.6 DEMminer 0.5 DEMmier-Ex (without Gini) DEMminer-Ex 0.4 OW 0.3 0.2 0.1 0 0 0.1 0.2 DEMminer AUC 0.97 Masud et al. University of Texas at Dallas 0.3 0.4 0.5 0.6 DEMminer-Ex (without Gini) 0.99 0.7 0.8 0.9 DEMminer-Ex 0.97 Aug 10, 2011 1 OW 0.74 94 Experiment Results NASA Dataset Masud et al. University of Texas at Dallas Aug 10, 2011 95 Experiment Results NASA Dataset Deviation AUC 0.996 Masud et al. University of Texas at Dallas Info Gain 0.967 FAE 0.876 Aug 10, 2011 96 Experiment Results KDD Results KDD Novel Error 2500 Novel Error Instance 2000 1500 DEMminer O-F 1000 500 0 0 Masud et al. University of Texas at Dallas 100 200 300 Chunk # 400 Aug 10, 2011 500 600 97 Experiment Results KDD Results KDD ROC 1.2 1 0.9 TP rate 0.8 0.6 DEMminer 0.4 O-F 0.2 0 0 0.2 0.4 DEMminer AUC Masud et al. University of Texas at Dallas 0.98 0.6 FP rate 0.8 1 1.2 O-F 0.96 Aug 10, 2011 98 Experiment Results Result Summary Dataset Method DEMminer ERR 4.2 Mnew 30.5 Lossy-F 32.5 0.0 Lossy-L 1.6 O-F DEMminer 3.4 0.02 DEMminer(info-gain) 1.4 O-F DEMminer Fnew 0.8 AUC 0.877 FP - FN - 32.6 0.834 - - 82.0 0.0 0.764 - - 96.7 - 1.6 - 0.557 0.996 0.00 0.1 - - 0.967 0.04 10.3 3.4 3.6 8.4 1.3 0.876 0.973 0.00 - 24.7 - O-F DEMminer 5.9 1.2 20.6 5.9 1.1 0.9 0.743 0.986 - - O-F 4.7 9.6 4.4 0.967 - - Twitter ASRS Forest Cover KDD Masud et al. University of Texas at Dallas Aug 10, 2011 99 Experiment Results Result Summary Dataset Twitter Forest Cover Method ERR Mnew DEMminer 4.2 30.5 0.8 0.877 DEMminer-Ex 1.8 0.7 0.6 0.944 OW 3.4 96.7 1.6 0.557 DEMminer 3.6 8.4 1.3 0.974 DEMminer-Ex 3.1 4.0 0.68 0.990 OW 5.9 20.6 1.1 0.743 Masud et al. University of Texas at Dallas Fnew Aug 10, 2011 AUC 100 Experiment Results Running Time Comparison Time(sec)1/K Points/sec Speed gain Dataset DEMminer Lossy-F O-F DEMminer Lossy-F O-F Twitter 23 3.5 66.7 43 289 15 2.9 ASRS 21 4.3 38.5 47 233 26 1.8 Forest Cover 1.0 1.0 4.7 967 1003 212 4.7 KDD 1.2 1.2 3.3 858 812 334 2.5 Masud et al. University of Texas at Dallas Aug 10, 2011 101 DEMminer over O-F Experiment Results Multi Novel Detection Results Masud et al. University of Texas at Dallas Aug 10, 2011 102 Experiment Results Multi Novel Detection Results Masud et al. University of Texas at Dallas Aug 10, 2011 103 Experiment Results Conclusion •Our data stream classification technique addresses •Infinite length •Concept-drift •Concept-evolution •Feature-evolution •Existing approaches only address first two issues •Applicable to many domains such as •Intrusion/malware detection •Text categorization •Fault detection etc. Masud et al. University of Texas at Dallas Aug 10, 2011 104 References J. Gehrke, V. Ganti, R. Ramakrishnan, and W. Loh. : BOAT-Optimistic Decision Tree Construction. In Proc. SIGMOD, 1999. P. Domingos and G. Hulten, “Mining high-speed data streams”. In Proc. SIGKDD, pages 7180, 2000. Wenerstrom, B., Giraud-Carrier, C., “Temporal data mining in dynamic feature spaces”. In: Perner, P. (ed.) ICDM 2006. LNCS (LNAI), vol. 4065, pp. 1141.1145. Springer, Heidelberg (2006) E. J. Spinosa, A. P. de Leon F. de Carvalho, and J. Gama. “Cluster-based novel concept detection in data streams applied to intrusion detection in computer networks”. In Proc. 2008 ACM symposium on Applied computing, pages 976–980, (2008). M. Scholz and R. Klinkenberg. “An ensemble classifier for drifting concepts.” In Proc. ICML/PKDD Workshop in Knowledge Discovery in Data Streams., 2005. Masud et al. University of Texas at Dallas Aug 10, 2011 105 References (contd.) Brutlag, J.(2000). “Aberrant behavior detection in time series for network monitoring.” In: Proc. Usenix Fourteenth System Admin. Conf. LISA XIV, New Orleans, LA. (Dec 2000) Eskin, E., Arnold, A., Prerau, M., Portnoy, L., Stolfo, S.: “A geometric framework for unsupervised anomaly detection: Detection intrusions in unlabeled data.” Applications of Data Mining in Computer Security, Kluwer (2002). Fan, W. “Systematic data selection to mine concept-drifting data streams.” In Proc. KDD 04 Gao, J, Wei Fan, and Jiawei Han. (2007a). "On Appropriate Assumptions to Mine Data Streams” Gao, J. Wei Fan, Jiawei Han, Philip S. Yu. (2007b). “A General Framework for Mining ConceptDrifting Data Streams with Skewed Distributions.” SDM 2007 Goebel, J. and T. Holz. Rishi: “Identify bot contaminated hosts by irc nickname evaluation. In Usenix/Hotbots” ’07 Workshop, 2007. Grizzard, J. B., V. Sharma, C. Nunnery, B. B. Kang, and D. Dagon (2007). “Peer-to-peer botnets: Overview and case study.” In Usenix/Hotbots ’07 Workshop. Masud et al. University of Texas at Dallas Aug 10, 2011 106 References (contd.) Keogh & Pazzani, (2000) E.J., J., P.M.: “Scaling up dynamic time warping for data mining applications.” In: ACM SIGKDD. (2000) Lemos, R. (2006): Bot software looks to improve peerage. SecurityFocus. http://www.securityfocus.com/news/11390 (2006). Livadas, C., B.Walsh, D. Lapsley, and T. Strayer (2006) “Using machine learning techniques to identify botnet traffic.” In 2nd IEEE LCN Workshop on Network Security (WoNS’2006), November 2006. LURHQ Threat Intelligence Group (2004). Sinit p2p trojan analysis. http://www.lurhq.com/sinit.html (2004) Rajab, M. A. J. Zarfoss, F. Monrose, and A. Terzis (2006) “A multifaceted approach to understanding the botnet phenomenon.” In Proceedings of the 6th ACM SIGCOMM on Internet Measurement Conference (IMC), 2006. Kagan Tumar and Joydeep ghosh (1996).“Error correlation and error reduction in ensemble classifiers” (Connection sciece), 8(3-4):385-403 Masud et al. University of Texas at Dallas Aug 10, 2011 107 References (contd.) Mohammad Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham, “A Multi-Partition Multi-Chunk Ensemble Technique to Classify Concept-Drifting Data Streams.” In Proc, of 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD- 09), Page: 363-375, Bangkok, Thailand, April 2009. Mohammad Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham, “A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data.” In Proc. of 2008 IEEE International Conference on Data Mining (ICDM 2008), Pisa, Italy, Page 929-934, December, 2008. Clay Woolam, Mohammed Masud, and Latifur Khan , “Lacking Labels In The Stream: Classifying Evolving Stream Data With Few Labels”. In Proc. of 18th International Symposium on Methodologies for Intelligent Systems (ISMIS), Page 552-562, September 2009 Prague, Czech Republic Masud et al. University of Texas at Dallas Aug 10, 2011 108 References (contd.) Mohammad Masud, Qing Chen, Latifur Khan, Charu Aggarwal, Jing Gao, Jiawei Han, and Bhavani Thuraisingham, “Addressing Concept-Evolution in Concept-Drifting Data Streams”. In Proc of 2010 10th IEEE International Conference on Data Mining (ICDM 2010), Sydney, Australia, Dec 2010. Mohammad M. Masud, Qing Chen, Jing Gao, Latifur Khan, Jiawei Han, Bhavani Thuraisingham “Classification and Novel Class Detection of Data Streams in a Dynamic Feature Space”. In Proc. of European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD 2010, Barcelona, Spain, September 20- 24, 2010, Springer 2010, ISBN 978-3-64215882-7, Page: 337-352. Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham, “Classification and Novel Class Detection in Data Streams with Active Mining”. In Proc of 14th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 21-24 June, 2010, Page 311-324, - Hyderabad, India. University of Texas at Dallas Masud et al. Aug 10, 2011 109 References (contd.) Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham, “Classification and Novel Class Detection in Concept-Drifting Data Streams under Time Constraints" , IEEE Transactions on Knowledge & Data Engineering (TKDE), 2011, IEEE Computer Society, June 2011, Vol. 23, No. 6, Page 859-874. Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu, “A Framework for Clustering Evolving Data streams” Published in Proceedings VLDB ’03 proceedings of the 29th international conference on Very Large Data Bases-Volume 29 H. Wang, W. Fan, P. S. Yu, and J. Han. “Mining concept-drifting data streams using ensemble classifiers”. In Proc. ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 226–235, Washington, DC, USA, Aug, 2003. ACM. Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham. “Integrating Novel Class Detection with Classification for Concept-Drifting Data Streams”. In Proceedings of 2009 European Conf. on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD’09), Bled, Slovenia, 7-11 Sept, 2009. Masud et al. University of Texas at Dallas Aug 10, 2011 110 Questions Masud et al. University of Texas at Dallas Aug 10, 2011 111