Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Unsupervised Outlier Detection Seminar of Machine Learning for Text Mining UPC, 5/11/2004 Mihai Surdeanu Definition: What is an outlier? Hawkins outlier: An outlier is an observation that deviates so much from the other observations as to arouse suspicion that it was generated by a different mechanism. Clustering outlier: Object not located in the clusters of a dataset. Usually called “noise”. Applications (1) “One person’s noise is another person’s signal.” Outlier detection is useful as a standalone application: Detection of credit card fraud. Detection of attacks on Internet servers. Find the best/worst basketball/hockey/football players. Applications (2) Outlier detection can be used to remove noise for clustering applications. Some of the best clustering algorithms (EM, K-Means) require an initial model (informally “seeds”) to work. If the initial points are outliers the final model is junk. Example: K-Means with Bad Initial Model Paper1: Algorithms for Mining Distance-Based Outliers in Large Datasets Edwin M. Knorr, Raymond T. Ng What is a Distance-Based Outlier? An object O in a dataset T is a DB(p,D)outlier if at least a fraction p of the objects in T lies greater than distance D from O. The distance is not defined here. Could be Euclidian, 1 – cosine etc Outliers in Statistics Normal Distributions Properties of DB Outliers Similar lemmas exist for Poisson distributions and regression models. Efficient Algorithms Efficient algorithms for the detection of DBoutliers exist with complexities: O(k N2): Index-based: uses k-d or R trees to index all objects based on distance efficient search of neighborhood objects. Other algorithms presented that are exponential in the number of attributes k not applicable for real text collection (k > 10,000) Conclusions Advantages Clean and simple to implement Equivalent with other formal definitions for well-behaved distributions Disadvantages Depends on too many parameters (D and p). What are good values for real-world collections? The decisions is (almost) binary: a data point is or is not an outlier. In real life, it is not so simple Approach was evaluated only on toy or synthetic data with few attributes (< 50). Does it work on big real-world collections? Paper 2: LOF: Identifying Density-Based Local Outliers Markus Breunig, Hans-Peter Kriegel, Raymond T. Ng, Jörg Sander Motivation DB-outliers can handle only “nice” distributions. Many examples in real-world data (e.g. mix of distributions) can not be handled DB-outliers give a binary classification of objects: is or is not an outlier Example of Local Outliers Goal Define a Local Outlier Factor (LOF) that indicates the degree of outlier-ness of an object using only the object’s neighborhood. Definitions (1) Informally: K-distance = smallest radius that includes at least k objects Definitions (2) Definitions (3) Example of reach-dist Definitions (4) Definitions (5) Informally: LOF(p) is high when p’s density is low and the density of it’s neighbors is high. Lemma 1 The LOF of objects “deep” inside a cluster is bounded as follows: 1/(1 + ) <= LOF <= (1 + ), with a small . Hence LOF for objects in a cluster is practically 1! Theorem 1 Applies to outlier objects that are in the vicinity of a single cluster. Illustration of Theorem 1 Theorem 2 Applies to outlier objects that are in the vicinity of multiple clusters. Illustration of Theorem 2 LOF >= (0.5 d1min + 0.5 d2min) / (0.5 i1max + 0.5 i2max) How to choose the best MinPts? LOF Values when MinPts Varies MinPts > 10 to remove statistical fluctuations. MinPts < minimum number of objects in a cluster (?) to avoid including outliers in the cluster densities. How to choose the best MinPts? Solution Compute the LOF values for a range of MinPts values. Pick the maximum LOF for each object from this range. Evaluation on Synthetic Data Set Conclusions Advantages Addresses better real-world data. Formal proofs that LOF behaves well for outlier and non-outlier objects. Gives a degree of outlier-ness not a binary decision. Disadvantages Evaluated on toy (from our pov) collections. MinPts is a sensitive parameter. Paper 3: Unsupervised Outlier Detection and Semi-Supervised Learning Adam Vinueza and Gregory Grudic Cost Function for Supervised Training Q(F) maintains local consistency by constraining the classification of nearby objects to to not change too much (Wij encodes nearness of xi and xj). is optimized to maximize distances between points in different classes. Calinski? Outlier Detection Outlier = all objects classified with a low confidence Global Conclusions The approach based on density outliers (LOF) seems to be the best for real-world data. But it was not tested on real-world collection (thousands of documents, tens of thousands of attributes). Plus, some factors are ad hoc (e.g. MinPts > 10). If supervised information is available, we can do a lot better (duh).