Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Geographic information system wikipedia , lookup
Neuroinformatics wikipedia , lookup
Theoretical computer science wikipedia , lookup
Inverse problem wikipedia , lookup
Multidimensional empirical mode decomposition wikipedia , lookup
Pattern recognition wikipedia , lookup
Data analysis wikipedia , lookup
Graph preprocessing Introduction Noise removal and data enhancement problem Noise removal and data enhancement on binary data Noise removal and data enhancement on graph data Noise removal and data enhancement tools Current research problems Future directions Framework for validating data cleaning techniques on binary data Data cleaning techniques to enhance data analysis in the presence of high noise levels Strengths and weakness of techniques Experimental results on several real-world data sets Impact on clustering and association analysis Conclusions and future work Motivation & Problem Statement • Several preprocessing techniques for real world data have been proposed – Each technique proposed for a single application domain – No comparative evaluation of techniques across different domains – Such evaluation is essential to identify relative strengths and weaknesses of the techniques • Problem Statement: Given a set of n points S, and a query point Sq, find an anomaly score for Sq with respect to S – Points in S are assumed to be (mostly) normal • This definition is applicable in multiple domains such as – Flight safety – System call intrusion detection – Proteomics Data cleaning techniques at the data analysis stage Distance-based Local Outlier Factor (LOF) based approaches Clustering-based HCleaner, is a hyper clique-based data cleaner Distance based Outlier Detection Nearest Neighbour (NN) approach For each data point d compute the distance to the k-th nearest neighbour dk Sort all data points according to the distance dk Outliers are points that have the largest distance dk and therefore are located in the more sparse neighbourhoods Usually data points that have top n% distance dk are identified as outliers o n – user parameter Not suitable for datasets that have modes with varying density * Knorr, Ng,Algorithms for Mining Distance-Based Outliers in Large Datasets, VLDB98 ** S. Ramaswamy, R. Rastogi, S. Kyuseok: Efficient Algorithms for Mining Outliers from Large Data Sets, ACM SIGMOD Conf. On Management of Data, 2000. Nearest Neighbour Based Techniques • Key assumption: normal points have close neighbours while anomalies are located far from other points • General two-step approach – Compute neighbourhood for each data record – Analyze the neighbourhood to determine whether data record is anomaly or not • Categories: – Distance based methods • Anomalies are data points most distant from other points – Density based methods • Anomalies are data points in low density regions Nearest Neighbour Based Techniques • Advantage Can be used in unsupervised or semi-supervised setting (do not make any assumptions about data distribution) • Drawbacks If normal points do not have sufficient number of neighbors the techniques may fail Computationally expensive In high dimensional spaces, data is sparse and the concept of similarity may not be meaningful anymore. Due to the sparseness, distances between any two data records may become quite similar => Each data record may be considered as potential outlier! Nearest Neighbour Based Techniques • Distance based approaches • A point O in a dataset is an DB(p, d) outlier if at least fraction p of the points in the data set lies greater than distance d from the point O* • Density based approaches • Compute local densities of particular regions and declare instances in low density regions as potential anomalies • Approaches o Local Outlier Factor (LOF) *Knorr, Ng,Algorithms for Mining Distance-Based Outliers in Large Datasets, VLDB98 Distance based approaches o Complexity of this algorithm is O(n2) o Nearest neighbor sets have to be constructed o It has a problems when a data set has regions of varying density Data cleaning techniques at the data analysis stage Distance-based Local Outlier Factor (LOF) based approaches Clustering-based HCleaner, is a hyper clique-based data cleaner Local Outlier Factor (LOF)* • For each data point q compute the distance to the k-th nearest neighbor (kdistance) •Compute reachability distance (reach-dist) for each data example q with respect to data example p as: oreach-dist(q, p) = max{k-distance(p), d(q,p)} •Compute local reachability density (lrd) of data example q as inverse of the average reachabaility distance based on the MinPts nearest neighbors of data example q o lrd(q) = ∑ p MinPts reach dist MinPts q,p •Compaute LOF(q) as ratio of average local reachability density of q’s k-nearest neighbors and local reachability density of the data record q o LOF(q) = lrd p 1 ⋅∑ MinPts p lrd q * - Breunig, et al, LOF: Identifying Density-Based Local Outliers, KDD 2000. Advantages of Density based Techniques • Local Outlier Factor (LOF) approach • Example: Distance from p3 to nearest neighbor In the NN approach, p2 is not considered as outlier, while the LOF approach find both p1 and p2 as outliers p3 Distance from p2 to nearest neighbor p2 p1 NN approach may consider p3 as outlier, but LOF approach does not Local Outlier Factor (LOF)* • For each data point q compute the distance to the k-th nearest neighbor (kdistance) •Compute reachability distance (reach-dist) for each data example q with respect to data example p as: reach-dist(q, p) = max{k-distance(p), d(q,p)} •Compute local reachability density (lrd) of data example q as inverse of the average reachabaility distance based on the MinPts nearest neighbors of data example q MinPts lrd(q) = reach _ dist MinPts(q, p) p •Compaute LOF(q) as ratio of average local reachability density of q’s k-nearest neighbors and local reachability density of the data record q LOF(q) = 1 lrd ( p) MinPts p lrd (q) * - Breunig, et al, LOF: Identifying Density-Based Local Outliers, KDD 2000.