Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Survey

Document related concepts

Transcript

Graph preprocessing Introduction Noise removal and data enhancement problem Noise removal and data enhancement on binary data Noise removal and data enhancement on graph data Noise removal and data enhancement tools Current research problems Future directions Data cleaning techniques at the data analysis stage Distance-based Local Outlier Factor (LOF) based approaches Clustering-based HCleaner, is a hyper clique-based data cleaner Clustering Based Techniques • Key assumption: normal data records belong to large and dense clusters, while anomalies belong do not belong to any of the clusters or form very small clusters • Categorization according to labels o Semi-supervised – cluster normal data to create modes of normal behaviour. If a new instance does not belong to any of the clusters or it is not close to any cluster, is anomaly o Unsupervised – post-processing is needed after a clustering step to determine the size of the clusters and the distance from the clusters is required from the point to be anomaly • Anomalies detected using clustering based methods can be: o Data records that do not fit into any cluster (residuals from clustering) o Small clusters o Low density clusters or local anomalies (far from other points within the same cluster) Clustering based outlier detection method for noise removal Clustering algorithms can detect outliers as a by product of the clustering process o Small clusters, which are far away from other major clusters can be outliers o This method is sensitive to the choice of clustering algorithms o it has difficulties in deciding which clusters should be classified as outliers Simple Example Y • N1 and N2 are regions of normal behavior • Points o1 and o2 are anomalies • Points in region O3 are anomalies N1 o1 O3 o2 N2 X Clustering based outlier detection method for noise removal Calculate a centroid of each cluster o Noise objects are the ones that are farthest from their corresponding cluster centroids o Data is clustered using a K-means algorithm available in the CLUTO clustering package o Cosine similarity (distance) of each object from its corresponding cluster centroid is recorded o top E% objects obtained after sorting these objects in ascending (descending) order with respect to this similarity (distance) o This constitute the noise objects in the data Clustering based outlier detection method for noise removal complexity of the algorithm is the same as that of an execution of K-means O(kn) o where k is the number of clusters and n is the number of points If there is only one cluster, then the cluster based approach becomes very similar to the distance based approach If every object is a separate cluster, then the cluster based approach degenerates to the process of randomly selecting objects as outliers Performs well only when the number of clusters is close to the ‘actual’ number of clusters (classes) in the data set Clustering Based Techniques • Advantages: No need to be supervised Easily adaptable to on-line / incremental mode suitable for anomaly detection from temporal data • Drawbacks Computationally expensive o Using indexing structures (k-d tree, R* tree) may alleviate this problem If normal points do not create any clusters the techniques may fail In high dimensional spaces, data is sparse and distances between any two data records may become quite similar. o Clustering algorithms may not give any meaningful clusters Data cleaning techniques at the data analysis stage Distance-based Local Outlier Factor (LOF) based approaches Clustering-based HCleaner, is a hyper clique-based data cleaner The H-confidence Measure • The h-confidence of a pattern P = {i1, i2,…, im} • Illustration: • A pattern P is a hyperclique pattern if hconf(P)>=hc, where hc is a user specified minimum h-confidence threshold Alternate Equivalent Definitions of hconfidence Given a pattern P = {i1, i2,…, im} • Definition: hconf ( P) min{conf ({x} {P {x}}) | x {i1 , i2 ,..., im}} • Definition: hconf ( P) min{conf ( X Y ) | X , Y {i1 , i2 ,..., im}& X Y P} All-Confidence Measure Omiecinski – TKDE 2003 Properties of Hyperclique Pattern Anti-monotone if P ' P, then hconf ( P ') hconf ( P) High Affinity Property • High h-confidence implies tight coupling amongst all items in the pattern Magnitude of relationship consistent with many Other measures Jaccard, Correlation, Cosine Cross support property • Eliminates patterns involving items that have very different support levels Cross Support Property of h-confidence At high support, all patterns that involve low support items are eliminated At low support, too many spurious patterns are generated that involve one high support item and one low support item Given a Pattern P = {i1, i2,…, im} For any two Itemsets Support distribution of the pumsb dataset X ,Y P X Y P & X Y hconf(P) supp{X} supp{Y} Hyper clique based data cleaner Idea is to eliminate data objects that are not tightly connected to other data objects in the data set Every pair of objects within a pattern is guaranteed to have cosine similarity above a certain level H-confidence measure has three important properties o Anti-monotone property o Cross-support property o Strong affinity property HCleaner generally leads to better performance as compared to the outlier based data cleaning alternatives Impact of noise removal on clustering analysis Clustering performance is not affected by the elimination of random objects Percentage of noise objects removed by LOF, CCleaner, and HCleaner increases, the entropy generally goes down Clustering performance improves as more and more noise or weakly-relevant objects are removed HCleaner provides the best clustering results compared to other noise removal techniques across all experimental cases Impact of noise removal on clustering analysis When the percentage of noise objects is lower than 30% o HCleaner yields significantly better clustering performance o percentage of objects being removed is increased o HCleaner tends to have better (higher) F-measure values than other noise removal techniques for the most experimental cases HCleaner tends to be the best or close to the best technique for improving clustering performance for binary data Impact of noise removal on Association analysis HCleaner provides the best association results compared to other noise removal techniques when the percentage of noise objects is above 25% HCleaner provides the best performance for all ranges of noise percentages considered HCleaner can achieve better performance when a large portion of noise has been removed Conclusions Performance of Clustering is very sensitive to the specified number of clusters Number of clusters is very small, then this approach has performance similar to that of the distance based approach Number of clusters is very large, then this approach becomes similar to the random approach for removing noise Best performance is obtained when size-3 hyperclique patterns are used as filters Conclusions HCleaner tends to provide better clustering performance and high quality associations than other data cleaning alternatives for binary data Better noise removal results in better data analysis Framework for evaluating the effectiveness of noise removal techniques for enhancing data analysis is presented Better noise removal yields better data analysis Conclusions • Data preprocessing can enhance critical information in data • Highly applicable in various application domains • Nature of data enhancement and noise removal problem is • dependent on the application domain. • Need different approaches to solve a particular problem formulation. Directions for future work study was restricted to unsupervised data mining techniques at the data analysis stage HCleaner, CCleaner, and the LOF based method were each the best in different situations, it could be useful to consider a voting scheme that combines these three techniques Investigate the impact of these noise removal techniques on classification performance Same analysis can be done on different data sets like graph dsata set • • • • • • • • • • • • • • • References Ling, C., Li, C. Data mining for direct marketing: Problems and solutions, KDD, 1998. Kubat M., Matwin, S., Addressing the Curse of Imbalanced Training Sets: One-Sided Selection, ICML 1997. N. Chawla et al., SMOTE: Synthetic Minority Over-Sampling Technique, JAIR, 2002. W. Fan et al, Using Artificial Anomalies to Detect Unknown and Known Network Intrusions, ICDM 2001 N. Abe, et al, Outlier Detection by Active Learning, KDD 2006 C. Cardie, N. Howe, Improving Minority Class Prediction Using Case specific feature weighting, ICML 1997. J. Grzymala et al, An Approach to Imbalanced Data Sets Based on Changing Rule Strength, AAAI Workshop on Learning from Imbalanced Data Sets, 2000. George H. John. Robust linear discriminant trees. AI&Statistics, 1995 Barbara, D., Couto, J., Jajodia, S., and Wu, N. Adam: a testbed for exploring the use of data mining in intrusion detection. SIGMOD Rec., 2001 Otey, M., Parthasarathy, S., Ghoting, A., Li, G., Narravula, S., and Panda, D. Towards nic-based intrusion detection. KDD 2003 He, Z., Xu, X., Huang, J. Z., and Deng, S. A frequent pattern discovery method for outlier detection. Web-Age Information Management, 726–732, 2004 Lee, W., Stolfo, S. J., and Mok, K. W. Adaptive intrusion detection: A data mining approach. Artificial Intelligence Review, 2000 Qin, M. and Hwang, K. Frequent episode rules for internet anomaly detection. In Proceedings of the 3rd IEEE International Symposium on Network Computing and Applications, 2004 Ide, T. and Kashima, H. Eigenspace-based anomaly detection in computer systems. KDD, 2004 Sun, J. et al., Less is more: Compact matrix representation of large sparse graphs. ICDM 2007 References • • • • • • • • • • • Lee, W. and Xiang, D. Information-theoretic measures for anomaly detection. In Proceedings of the IEEE Symposium on Security and Privacy. IEEE Computer Society, 2001 Ratsch, G., Mika, S., Scholkopf, B., and Muller, K.-R. Constructing boosting algorithms from SVMs: An application to one-class classification. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2002 Tax, D. M. J. One-class classification; concept-learning in the absence of counter-examples. Ph.D. thesis, Delft University of Technology, 2001 Eskin, E., Arnold, A., Prerau, M., Portnoy, L., and Stolfo, S. A geometric framework for unsupervised anomaly detection. In Proceedings of Applications of Data Mining in Computer Security, 2002 A. Lazarevic, et al., A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection, SDM 2003 Scholkopf, B., Platt, O., Shawe-Taylor, J., Smola, A., and Williamson, R. Estimating the support of a highdimensional distribution. Tech. Rep. 99-87, Microsoft Research, 1999 Baker, D. et al., A hierarchical probabilistic model for novelty detection in text. ICML 1999 Das, K. and Schneider, J. Detecting anomalous records in categorical datasets. KDD 2007 Augusteijn, M. and Folkert, B. Neural network classification and novelty detection. International Journal on Remote Sensing, 2002 Sykacek, P. Equivalent error bars for neural network classifiers trained by Bayesian inference. In Proceedings of the European Symposium on Artificial Neural Networks. 121–126, 1997 Vasconcelos, G. C., Fairhurst, M. C., and Bisset, D. L. Investigating feedforward neural networks with respect to the rejection of spurious patterns. Pattern Recognition Letter, 1995