Download Graph preprocessing

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia, lookup

Nearest-neighbor chain algorithm wikipedia, lookup

K-means clustering wikipedia, lookup

Cluster analysis wikipedia, lookup

Graph preprocessing
 Introduction
 Noise removal and data enhancement problem
 Noise removal and data enhancement on binary data
 Noise removal and data enhancement on graph data
 Noise removal and data enhancement tools
 Current research problems
 Future directions
Data cleaning techniques at the data analysis
 Distance-based
 Local Outlier Factor (LOF) based approaches
 Clustering-based
 HCleaner, is a hyper clique-based data cleaner
Clustering Based Techniques
• Key assumption: normal data records belong to large and dense clusters, while
anomalies belong do not belong to any of the clusters or form very small clusters
• Categorization according to labels
o Semi-supervised – cluster normal data to create modes of normal behaviour. If a
new instance does not belong to any of the clusters or it is not close to any cluster,
is anomaly
o Unsupervised – post-processing is needed after a clustering step to determine the
size of the clusters and the distance from the clusters is required from the point to
be anomaly
• Anomalies detected using clustering based methods can be:
o Data records that do not fit into any cluster (residuals from clustering)
o Small clusters
o Low density clusters or local anomalies (far from other points within the same
Clustering based outlier detection method for
noise removal
Clustering algorithms can detect outliers as a by
product of the clustering process
o Small clusters, which are far away from
other major clusters can be outliers
o This method is sensitive to the choice of
clustering algorithms
o it has difficulties in deciding which clusters
should be classified as outliers
Simple Example
• N1 and N2 are regions of
normal behavior
• Points o1 and o2 are
• Points in region O3 are
Clustering based outlier detection method for
noise removal
 Calculate a centroid of each cluster
o Noise objects are the ones that are farthest from
their corresponding cluster centroids
o Data is clustered using a K-means algorithm
available in the CLUTO clustering package
o Cosine similarity (distance) of each object from its
corresponding cluster centroid is recorded
o top E% objects obtained after sorting these objects
in ascending (descending) order with respect to this
similarity (distance)
o This constitute the noise objects in the data
Clustering based outlier detection method for
noise removal
 complexity of the algorithm is the same as that of an execution
of K-means O(kn)
o where k is the number of clusters and n is the number of
 If there is only one cluster, then the cluster based approach
becomes very similar to the distance based approach
 If every object is a separate cluster, then the cluster based
approach degenerates to the process of randomly selecting
objects as outliers
 Performs well only when the number of clusters is close to the
‘actual’ number of clusters (classes) in the data set
Clustering Based Techniques
• Advantages:
 No need to be supervised
 Easily adaptable to on-line / incremental mode suitable for
anomaly detection from temporal data
• Drawbacks
 Computationally expensive
o Using indexing structures (k-d tree, R* tree) may alleviate this
 If normal points do not create any clusters the techniques may fail
 In high dimensional spaces, data is sparse and distances between
any two data records may become quite similar.
o Clustering algorithms may not give any meaningful clusters
Data cleaning techniques at the data analysis
 Distance-based
 Local Outlier Factor (LOF) based approaches
 Clustering-based
 HCleaner, is a hyper clique-based data cleaner
The H-confidence Measure
• The h-confidence of a pattern P = {i1, i2,…, im}
• Illustration:
• A pattern P is a hyperclique pattern if hconf(P)>=hc, where hc is a
user specified minimum h-confidence threshold
Alternate Equivalent Definitions of hconfidence
 Given a pattern P = {i1, i2,…, im}
• Definition:
hconf ( P)  min{conf ({x}  {P  {x}}) | x {i1 , i2 ,..., im}}
• Definition:
hconf ( P)  min{conf ( X  Y ) | X , Y  {i1 , i2 ,..., im}& X  Y  P}
All-Confidence Measure
Omiecinski – TKDE 2003
Properties of Hyperclique Pattern
if P '  P, then hconf ( P ')  hconf ( P)
High Affinity Property
• High h-confidence implies tight coupling amongst all items in the pattern
Magnitude of relationship consistent with many Other measures
 Jaccard, Correlation, Cosine
Cross support property
• Eliminates patterns involving items that have very different support levels
Cross Support Property of h-confidence
 At high support, all patterns that involve
low support items are eliminated
At low support, too many spurious patterns
are generated that involve one high support
item and one low support item
 Given a Pattern P = {i1, i2,…, im}
 For any two Itemsets
Support distribution of the pumsb dataset
X ,Y  P
X Y  P & X Y  
Hyper clique based data cleaner
 Idea is to eliminate data objects that are not tightly
connected to other data objects in the data set
 Every pair of objects within a pattern is guaranteed to
have cosine similarity above a certain level
 H-confidence measure has three important properties
o Anti-monotone property
o Cross-support property
o Strong affinity property
 HCleaner generally leads to better performance as
compared to the outlier based data cleaning alternatives
Impact of noise removal on clustering analysis
 Clustering performance is not affected by the
elimination of random objects
 Percentage of noise objects removed by LOF,
CCleaner, and HCleaner increases, the entropy
generally goes down
 Clustering performance improves as more and more
noise or weakly-relevant objects are removed
 HCleaner provides the best clustering results
compared to other noise removal techniques across all
experimental cases
Impact of noise removal on clustering analysis
 When the percentage of noise objects is lower than
o HCleaner yields significantly better clustering
o percentage of objects being removed is increased
o HCleaner tends to have better (higher) F-measure
values than other noise removal techniques for the
most experimental cases
 HCleaner tends to be the best or close to the best
technique for improving clustering performance for
binary data
Impact of noise removal on Association
 HCleaner provides the best association results
compared to other noise removal techniques when
the percentage of noise objects is above 25%
 HCleaner provides the best performance for all
ranges of noise percentages considered
 HCleaner can achieve better performance when a
large portion of noise has been removed
 Performance of Clustering is very sensitive to the
specified number of clusters
 Number of clusters is very small, then this approach
has performance similar to that of the distance based
 Number of clusters is very large, then this approach
becomes similar to the random approach for
removing noise
 Best performance is obtained when size-3
hyperclique patterns are used as filters
 HCleaner tends to provide better clustering
performance and high quality associations than other
data cleaning alternatives for binary data
 Better noise removal results in better data analysis
 Framework for evaluating the effectiveness of noise
removal techniques for enhancing data analysis is
 Better noise removal yields better data analysis
• Data preprocessing can enhance critical information in data
• Highly applicable in various application domains
• Nature of data enhancement and noise removal problem is
• dependent on the application domain.
• Need different approaches to solve a particular problem
Directions for future work
 study was restricted to unsupervised data mining
techniques at the data analysis stage
 HCleaner, CCleaner, and the LOF based method
were each the best in different situations, it could be
useful to consider a voting scheme that combines
these three techniques
 Investigate the impact of these noise removal
techniques on classification performance
 Same analysis can be done on different data sets like
graph dsata set
Ling, C., Li, C. Data mining for direct marketing: Problems and solutions, KDD, 1998.
Kubat M., Matwin, S., Addressing the Curse of Imbalanced Training Sets: One-Sided Selection, ICML 1997.
N. Chawla et al., SMOTE: Synthetic Minority Over-Sampling Technique, JAIR, 2002.
W. Fan et al, Using Artificial Anomalies to Detect Unknown and Known Network Intrusions, ICDM 2001
N. Abe, et al, Outlier Detection by Active Learning, KDD 2006
C. Cardie, N. Howe, Improving Minority Class Prediction Using Case specific feature weighting, ICML 1997.
J. Grzymala et al, An Approach to Imbalanced Data Sets Based on Changing Rule Strength, AAAI Workshop on
Learning from Imbalanced Data Sets, 2000.
George H. John. Robust linear discriminant trees. AI&Statistics, 1995
Barbara, D., Couto, J., Jajodia, S., and Wu, N. Adam: a testbed for exploring the use of data mining in intrusion
detection. SIGMOD Rec., 2001
Otey, M., Parthasarathy, S., Ghoting, A., Li, G., Narravula, S., and Panda, D. Towards nic-based intrusion detection.
KDD 2003
He, Z., Xu, X., Huang, J. Z., and Deng, S. A frequent pattern discovery method for outlier detection. Web-Age
Information Management, 726–732, 2004
Lee, W., Stolfo, S. J., and Mok, K. W. Adaptive intrusion detection: A data mining approach. Artificial Intelligence
Review, 2000
Qin, M. and Hwang, K. Frequent episode rules for internet anomaly detection. In Proceedings of the 3rd IEEE
International Symposium on Network Computing and Applications, 2004
Ide, T. and Kashima, H. Eigenspace-based anomaly detection in computer systems. KDD, 2004
Sun, J. et al., Less is more: Compact matrix representation of large sparse graphs. ICDM 2007
Lee, W. and Xiang, D. Information-theoretic measures for anomaly detection. In Proceedings of the IEEE
Symposium on Security and Privacy. IEEE Computer Society, 2001
Ratsch, G., Mika, S., Scholkopf, B., and Muller, K.-R. Constructing boosting algorithms from SVMs: An application to
one-class classification. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2002
Tax, D. M. J. One-class classification; concept-learning in the absence of counter-examples. Ph.D. thesis, Delft
University of Technology, 2001
Eskin, E., Arnold, A., Prerau, M., Portnoy, L., and Stolfo, S. A geometric framework for unsupervised anomaly
detection. In Proceedings of Applications of Data Mining in Computer Security, 2002
A. Lazarevic, et al., A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection,
SDM 2003
Scholkopf, B., Platt, O., Shawe-Taylor, J., Smola, A., and Williamson, R. Estimating the support of a highdimensional distribution. Tech. Rep. 99-87, Microsoft Research, 1999
Baker, D. et al., A hierarchical probabilistic model for novelty detection in text. ICML 1999
Das, K. and Schneider, J. Detecting anomalous records in categorical datasets. KDD 2007
Augusteijn, M. and Folkert, B. Neural network classification and novelty detection. International Journal on
Remote Sensing, 2002
Sykacek, P. Equivalent error bars for neural network classifiers trained by Bayesian inference. In Proceedings of the
European Symposium on Artificial Neural Networks. 121–126, 1997
Vasconcelos, G. C., Fairhurst, M. C., and Bisset, D. L. Investigating feedforward neural networks with respect to the
rejection of spurious patterns. Pattern Recognition Letter, 1995