Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Amol Ghoting, Srinivasan Parthasarathy, Matthew Eric Otey Data Mining and Knowledge Discovery Vol. 16 No. 3, 2008 Reporter : CHENG-WEI, CHOU Jan. 13 2010 組員名單: 89721002 周政緯 陳永洲 2017/5/5 Introduction Distance-based outlier detection Outlier detection algorithm Experiment results Conclusion 89721002 周政緯 2 2017/5/5 Introduction Distance-based outlier detection Outlier detection algorithm Experiment results Conclusion 89721002 周政緯 3 A common problem : automatically finding outliers Outliers : those points are highly unlikely to occur A measure of unusualness : a point’s distance On high-dimensional, existing algorithms have not good performance 2017/5/5 This paper further improve the scaling behavior of distance-based outlier detection on large, high-dimensional datasets 89721002 周政緯 4 2017/5/5 Introduction Distance-based outlier detection Outlier detection algorithm Experiment results Conclusion 89721002 周政緯 5 2017/5/5 Three popular definitions of distance-based outliers: Outliers are the data points for which there are fewer than p other data points within distance d Outliers are the top n data points whose distance to their kth nearest neighbor is greatest Outliers are the top n data points whose average distance to their k nearest neighbors is greatest 89721002 周政緯 6 2017/5/5 NL(nested loop) algorithm : the best performance in high-dimensional spaces For each data point in D, scan the dataset and keep track of its k closest neighbors Maintain a cutoff threshold, c If (distance of a data point’s kth closest neighbor < c) the data point is no longer an outlier 89721002 周政緯 7 2017/5/5 89721002 周政緯 8 2017/5/5 Introduction Distance-based outlier detection Outlier detection algorithm Experiment results Conclusion 89721002 周政緯 9 2017/5/5 RBRP(Recursive Bining and Re-Projection) A two-phase algorithm for fast mining of distancebased outliers in high dimensional datasets Finds the top n outliers in the dataset whose distance to their kth nearest neighbor is the greatest 89721002 周政緯 10 First phase of RBRP Goal : to partition the dataset into bins Points that are close to each other in space are likely to be assigned to the same bin A recursive procedure similar to divisive hierarchical clustering Second phase of RBRP : Use an extension of the NL algorithm to find outliers in the dataset 2017/5/5 89721002 周政緯 11 2017/5/5 89721002 周政緯 12 2017/5/5 89721002 周政緯 13 2017/5/5 89721002 周政緯 14 Time Complexity of Phase 1 : T ( N ) T ( N m) T (m) ( N ) Worst case : Best case : 2017/5/5 89721002 周政緯 15 2017/5/5 Average case: 89721002 周政緯 16 2017/5/5 Introduction Distance-based outlier detection Outlier detection algorithm Experiment results Conclusion 89721002 周政緯 17 2017/5/5 89721002 周政緯 18 2017/5/5 89721002 周政緯 19 2017/5/5 89721002 周政緯 20 2017/5/5 89721002 周政緯 21 2017/5/5 89721002 周政緯 22 2017/5/5 89721002 周政緯 23 2017/5/5 89721002 周政緯 24 2017/5/5 89721002 周政緯 25 2017/5/5 89721002 周政緯 26 2017/5/5 89721002 周政緯 27 2017/5/5 Introduction Distance-based outlier detection Outlier detection algorithm Experiment results Conclusion 89721002 周政緯 28 2017/5/5 Presented RBRP RBRP improves upon the scaling behavior of the state-of-the-art Provide theoretical arguments Validated its scaling behavior Empirical results on real data back the above claim Realizing a significant speedup over ORCA 89721002 周政緯 29 Thank you!! 2017/5/5 89721002 周政緯 30