Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Anomaly Detection Based on Tan, Steinbach, Kumar, Andrew Moore, Arindam Banerjee, Varun Chandola, Vipin Kumar, Jaideep Srivastava, Aleksandar Lazarevic © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Anomaly/Outlier Detection What are anomalies/outliers? – The set of data points that are considerably different than the remainder of the data Applications: – Credit card fraud detection, telecommunication fraud detection, network intrusion detection, fault detection © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Related problems Rare Class Mining Chance discovery Novelty Detection Exception Mining Noise Removal Black Swan* * N. Talleb, The Black Swan: The Impact of the Highly Probable?, 2007 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Importance of Anomaly Detection Ozone Depletion History In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations? The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded! © Tan,Steinbach, Kumar Sources: http://exploringdata.cqu.edu.au/ozone.html http://www.epa.gov/ozone/science/hole/size.html Introduction to Data Mining 4/18/2004 ‹#› Why Anomaly Detection? Anomalies are the focus – Anomaly detection is the goal Anomalies distort the data analysis – Anomaly removal as data preprocessing © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Causes of Anomalies Data from different classes Natural variation Data measurement and collection errors Combination of different reasons © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Data Labels Supervised Anomaly Detection – Labels available for both normal data and anomalies – Similar to rare class mining Semi-supervised Anomaly Detection – Labels available only for normal data Unsupervised Anomaly Detection – No labels assumed – Based on the assumption that anomalies are very rare compared to normal data © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Type of Anomaly Point Anomalies Contextual Anomalies Collective Anomalies © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Point Anomalies An individual data instance is anomalous w.r.t. the data Y N1 o1 O3 o2 N2 X © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Contextual Anomalies An individual data instance is anomalous within a context Requires a notion of context Also referred to as conditional anomalies* Anomaly Normal * Xiuyao Song, Mingxi Wu, Christopher Jermaine, Sanjay Ranka, Conditional Anomaly Detection, IEEE Transactions on Data and Knowledge Engineering, 2006. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Collective Anomalies A collection of related data instances is anomalous Requires a relationship among data instances – Sequential Data – Spatial Data – Graph Data The individual instances within a collective anomaly are not anomalous by themselves Anomalous Subsequence © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Evaluation of Anomaly Detection – F-value Accuracy is not sufficient metric for evaluation – Example: network traffic data set with 99.9% of normal data and 0.1% of intrusions – Trivial classifier that labels everything with the normal class can achieve 99.9% accuracy !!!!! Confusion matrix Actual class NC C Predicted class NC TN FN C FP TP anomaly class – C normal class – NC • Focus on both recall and precision – Recall (R) – Precision (P) • F – measure © Tan,Steinbach, Kumar = = TP/(TP + FN) TP/(TP + FP) = 2*R*P/(R+P) Introduction to Data Mining 4/18/2004 ‹#› Evaluation of Outlier Detection – ROC & AUC Confusion matrix Actual class Standard NC C Predicted class NC TN FN C FP TP anomaly class – C normal class – NC measures for evaluating anomaly detection problems: – Recall (Detection rate) - ratio between the number of correctly detected anomalie and the total number of anomalies – ROC Curve is a trade-off between detection rate and false alarm rate – Area under the ROC curve (AUC) is computedKumar using a trapezoid ruleto Data Mining © Tan,Steinbach, Introduction 0.9 Ideal ROC curve 0.8 Detection rate – False alarm (false positive) rate – ratio between the number of data records from normal class that are misclassified as anomalies and the total number of data records from normal class ROC curves for different outlier detection techniques 1 0.7 0.6 0.5 AUC 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False alarm rate 4/18/2004 ‹#› Applications of Anomaly Detection Network intrusion detection Insurance / Credit card fraud detection Healthcare Informatics / Medical diagnostics Industrial Damage Detection Image Processing / Video surveillance Novel Topic Detection in Text Mining … © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Unsupervised Anomaly Detection Challenges – How many outliers are there in the data? – Method is unsupervised Validation can be quite challenging (just like for clustering) Working assumption: – There are considerably more “normal” observations than “abnormal” observations (outliers/anomalies) in the data © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Anomaly Detection Schemes General Steps – Build a profile of the “normal” behavior Profile can be patterns or summary statistics for the overall population – Use the “normal” profile to detect anomalies Anomalies are observations whose characteristics differ significantly from the normal profile Types of anomaly detection schemes – Graphical & Statistical-based – Distance-based – Model-based © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Graphical/Visualization Approaches Boxplot (1-D), Scatter plot (2-D) Limitations – Time consuming – Subjective © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Visualization Based (2) Detecting Telecommunication fraud Display telephone call patterns as a graph Use colors to identify fraudulent telephone calls (anomalies) * Cox et al 1997. Visual data mining: Recognizing telephone calling fraud. Journal of Data Mining and Knowledge Discovery © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Statistical Approaches Assume a parametric model describing the distribution of the data (e.g., normal distribution) Apply a statistical test that depends on – Data distribution – Parameter of distribution (e.g., mean, variance) – Number of expected outliers (confidence limit) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Grubbs’ Test Detect outliers in univariate data Assume data comes from normal distribution Detects one outlier at a time, remove the outlier, and repeat – H0: There is no outlier in data – HA: There is at least one outlier Grubbs’ test statistic: Reject H0 if: © Tan,Steinbach, Kumar ( N 1) G N G max X X s t (2 / N ,N 2 ) N 2 t (2 / N ,N 2 ) Introduction to Data Mining 4/18/2004 ‹#› Statistical-based – Likelihood Approach Assume the data set D contains samples from a mixture of two probability distributions: – M (majority distribution) – A (anomalous distribution) General Approach: – Initially, assume all the data points belong to M – Let Lt(D) be the log likelihood of D at time t – For each point xt that belongs to M, move it to A Let Lt+1 (D) be the new log likelihood. Compute the difference, = Lt(D) – Lt+1 (D) If > c (some threshold), then xt is declared as an anomaly and moved permanently from M to A © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Statistical-based – Likelihood Approach Data distribution, D = (1 – ) M + A M is a probability distribution estimated from data – Can be based on any modeling method (naïve Bayes, maximum entropy, etc) A is initially assumed to be uniform distribution Likelihood at time t: |At | |M t | Lt ( D ) PD ( xi ) (1 ) PM t ( xi ) PAt ( xi ) i 1 xi M t xiAt LLt ( D ) M t log(1 ) log PM t ( xi ) At log log PAt ( xi ) N xi M t © Tan,Steinbach, Kumar Introduction to Data Mining xi At 4/18/2004 ‹#› Limitations of Statistical Approaches Most of the tests are for a single attribute In many cases, data distribution may not be known For high dimensional data, it may be difficult to estimate the true distribution © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Distance-based Approaches Data is represented as a vector of features Three major approaches – Nearest-neighbor based – Density based – Clustering based © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Nearest-Neighbor Based Approach Approach: – Compute the distance between every pair of data points – There are various ways to define outliers: Data points for which there are fewer than p neighboring points within a distance D The top n data points whose distance to the kth nearest neighbor is greatest The top n data points whose average distance to the k nearest neighbors is greatest © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Outliers in Lower Dimensional Projection Divide each attribute into equal-depth intervals – Each interval contains a fraction f = 1/ of the records Consider a k-dimensional cube created by picking grid ranges from k different dimensions – If attributes are independent, we expect region to contain a fraction fk of the records – If there are N points, we can measure sparsity of a cube D as: – Negative sparsity indicates cube contains smaller number of points than expected © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Example N=100, = 5, f = 1/5 = 0.2, N f2 = 4 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Density-based: LOF approach For each point, compute the density of its local neighborhood Compute local outlier factor (LOF) of a sample p as the average of the ratios of the density of sample p and the density of its nearest neighbors Outliers are points with largest LOF value p2 © Tan,Steinbach, Kumar p1 Introduction to Data Mining 4/18/2004 ‹#› Clustering-Based Cluster the data Points in small cluster as candidate outliers If candidate points are far from all other noncandidate points, they are outliers © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Detecting geographic hotspots Entire area being scanned (Philadelphia Metro) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› One Step of Spatial Scan Entire area being scanned Current region being considered © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› One Step of Spatial Scan Entire area being scanned Current region being considered I have a population of 5300 of whom 53 are sick (1%) Everywhere else has a population of 2,200,000 of whom 20,000 are sick (0.9%) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› One Step of Spatial Scan Entire area being scanned Current region being considered I have a population of 5300 of whom 53 are sick (1%) So... is that a big deal? Evaluated with Score Everywhere else has a function (e.g. Kulldorf’s population of 2,200,000 of score) whom 20,000 are sick (0.9%) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› One Step of Spatial Scan Entire area being scanned Current region being considered I have a population of 5300 of whom 53 are sick (1%) [Score = 1.4] So... is that a big deal? Evaluated with Score Everywhere else has a function (e.g. Kulldorf’s population of 2,200,000 of score) whom 20,000 are sick (0.9%) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Many Steps of Spatial Scan Entire area being scanned Highest scoring region in search so far Current region being considered I have a population of 5300 of whom 53 are sick (1%) [Score = 9.3] [Score = 1.4] So... is that a big deal? Evaluated with Score Everywhere else has a function (e.g. Kulldorf’s population of 2,200,000 of score) whom 20,000 are sick (0.9%) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Scan Statistics Standard approach: Standard scan statistic question: Given the geographical locations of occurrences of a phenomenon, is there a region with an unusually high (low) rate of these occurrences? © Tan,Steinbach, Kumar 1. Compute the likelihood of the data given the hypothesis that the rate of occurrence is uniform everywhere, L 0 2. For some geographical region, W, compute the likelihood that the rate of occurrence is uniform at one level inside the region and uniform at another level outside the region, L(W). 3. Compute the likelihood ratio, L(W)/L 0 4. Repeat for all regions, and find the largest likelihood ratio. This is the scan statistic, S*W 5. Report the region, W, which yielded the max, S* W See [Glaz and Balakrishnan, 99] for details Introduction to Data Mining 4/18/2004 ‹#› Significance testing Standard approach: Given that region W is the most likely to be abnormal, is it significantly abnormal? © Tan,Steinbach, Kumar 1. Generate many randomized versions of the data set by shuffling the labels (positive instance of the phenomenon or not). 2. Compute S*W for each randomized data set. This forms a baseline distribution for S*W if the null hypothesis holds. 3. Compare the observed value of S*W against the baseline distribution to determine a p-value. Introduction to Data Mining 4/18/2004 ‹#› N Fast squares speedup N Theoretical complexity of fast squares: O(N2) (as opposed to naïve N3), if maximum density region sufficiently dense. If not, we can use several other speedup tricks. In practice: 10-200x speedups on real and artificially generated datasets. Emergency Dept. dataset (600K records): 20 minutes, versus 66 hours with naïve approach. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Base Rate Fallacy Bayes theorem: More generally: © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Base Rate Fallacy (Axelsson, 1999) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Base Rate Fallacy Even though the test is 99% certain, your chance of having the disease is 1/100, because the population of healthy people is much larger than sick people © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Base Rate Fallacy in Intrusion Detection I: intrusive behavior, I: non-intrusive behavior A: alarm A: no alarm Detection rate (true positive rate): P(A|I) False alarm rate: P(A|I) Goal is to maximize both – Bayesian detection rate, P(I|A) – P(I|A) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Detection Rate vs False Alarm Rate Suppose: Then: False alarm rate becomes more dominant if P(I) is very low © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Detection Rate vs False Alarm Rate Axelsson: We need a very low false alarm rate to achieve a reasonable Bayesian detection rate © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Contextual Anomaly Detection Detect context anomalies General Approach – Identify a context around a data instance (using a set of contextual attributes) – Determine if the data instance is anomalous w.r.t. the context (using a set of behavioral attributes) Assumption – All normal instances within a context will be similar (in terms of behavioral attributes), while the anomalies will be different © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Conclusions Anomaly detection can detect critical information in data Highly applicable in various application domains Nature of anomaly detection problem is dependent on the application domain Need different approaches to solve a particular problem formulation © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› References Ling, C., Li, C. Data mining for direct marketing: Problems and solutions, KDD, 1998. Kubat M., Matwin, S., Addressing the Curse of Imbalanced Training Sets: One-Sided Selection, ICML 1997. N. Chawla et al., SMOTE: Synthetic Minority Over-Sampling Technique, JAIR, 2002. W. Fan et al, Using Artificial Anomalies to Detect Unknown and Known Network Intrusions, ICDM 2001 N. Abe, et al, Outlier Detection by Active Learning, KDD 2006 C. Cardie, N. Howe, Improving Minority Class Prediction Using Case specific feature weighting, ICML 1997. J. Grzymala et al, An Approach to Imbalanced Data Sets Based on Changing Rule Strength, AAAI Workshop on Learning from Imbalanced Data Sets, 2000. George H. John. Robust linear discriminant trees. AI&Statistics, 1995 Barbara, D., Couto, J., Jajodia, S., and Wu, N. Adam: a testbed for exploring the use of data mining in intrusion detection. SIGMOD Rec., 2001 Otey, M., Parthasarathy, S., Ghoting, A., Li, G., Narravula, S., and Panda, D. Towards nic-based intrusion detection. KDD 2003 He, Z., Xu, X., Huang, J. Z., and Deng, S. A frequent pattern discovery method for outlier detection. Web-Age Information Management, 726–732, 2004 Lee, W., Stolfo, S. J., and Mok, K. W. Adaptive intrusion detection: A data mining approach. Artificial Intelligence Review, 2000 Qin, M. and Hwang, K. Frequent episode rules for internet anomaly detection. In Proceedings of the 3rd IEEE International Symposium on Network Computing and Applications, 2004 Ide, T. and Kashima, H. Eigenspace-based anomaly detection in computer systems. KDD, 2004 Sun, J. et al., Less is more: Compact matrix representation of large sparse graphs. ICDM 2007 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› References Lee, W. and Xiang, D. Information-theoretic measures for anomaly detection. In Proceedings of the IEEE Symposium on Security and Privacy. IEEE Computer Society, 2001 Ratsch, G., Mika, S., Scholkopf, B., and Muller, K.-R. Constructing boosting algorithms from SVMs: An application to one-class classification. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2002 Tax, D. M. J. One-class classification; concept-learning in the absence of counter-examples. Ph.D. thesis, Delft University of Technology, 2001 Eskin, E., Arnold, A., Prerau, M., Portnoy, L., and Stolfo, S. A geometric framework for unsupervised anomaly detection. In Proceedings of Applications of Data Mining in Computer Security, 2002 A. Lazarevic, et al., A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection, SDM 2003 Scholkopf, B., Platt, O., Shawe-Taylor, J., Smola, A., and Williamson, R. Estimating the support of a highdimensional distribution. Tech. Rep. 99-87, Microsoft Research, 1999 Baker, D. et al., A hierarchical probabilistic model for novelty detection in text. ICML 1999 Das, K. and Schneider, J. Detecting anomalous records in categorical datasets. KDD 2007 Augusteijn, M. and Folkert, B. Neural network classification and novelty detection. International Journal on Remote Sensing, 2002 Sykacek, P. Equivalent error bars for neural network classifiers trained by Bayesian inference. In Proceedings of the European Symposium on Artificial Neural Networks. 121–126, 1997 Vasconcelos, G. C., Fairhurst, M. C., and Bisset, D. L. Investigating feedforward neural networks with respect to the rejection of spurious patterns. Pattern Recognition Letter, 1995 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› References S. Hawkins, et al. Outlier detection using Replicator neural networks, DaWaK02 2002. Jagota, A. Novelty detection on a very large number of memories stored in a Hopfield-style network. In Proceedings of the International Joint Conference on Neural Networks, 1991 Crook, P. and Hayes, G. A robot implementation of a biologically inspired method for novelty detection. In Proceedings of Towards Intelligent Mobile Robots Conference, 2001 Dasgupta, D. and Nino, F. 2000. A comparison of negative and positive selection algorithms in novel pattern detection. IEEE International Conference on Systems, Man, and Cybernetics, 2000 Caudell, T. and Newman, D. An adaptive resonance architecture to define normality and detect novelties in time series and databases. World Congress on Neural Networks, 1993 Albrecht, S. et al. Generalized radial basis function networks for classification and novelty detection: selforganization of optional Bayesian decision. Neural Networks, 2000 Steinwart, I., Hush, D., and Scovel, C. A classification framework for anomaly detection. JMLR, 2005 Srinivas Mukkamala et al. Intrusion Detection Systems Using Adaptive Regression Splines. ICEIS 2004 Li, Y., Pont et al. Improving the performance of radial basis function classifiers in condition monitoring and fault diagnosis applications where unknown faults may occur. Pattern Recognition Letters, 2002 Borisyuk, R. et al. An oscillatory neural network model of sparse distributed memory and novelty detection. Biosystems, 2000 Ho, T. V. and Rouat, J. Novelty detection based on relaxation time of a network of integrate-and-fire neurons. Proceedings of Second IEEE World Congress on Computational Intelligence, 1998 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›