Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
School of Computer and Information Science University of South Australia An investigation into subspace outlier detection Research Proposal by Alex Wiegand Student ID: 100029537 Program: LHCP Supervisor: Associate Professor Jiuyong Li June 2009 Alex WIEGAND -1- Disclaimer I declare all the following to be my own work, unless otherwise referenced, as defined by the University of South Australia's policy on plagiarism. The University of South Australia's policy on plagiarism can be found at http://www.unisa.edu.au/policies/manual/default.asp. Alex Wiegand Date: 16th June 2008 Alex WIEGAND -2- Table of Contents Disclaimer .................................................................................................................................. 2 Abstract ...................................................................................................................................... 5 1. Introduction ............................................................................................................................ 6 1.1. Background ......................................................................................................................... 6 1.1.1. Outliers ............................................................................................................................ 6 1.1.2. Outlier Detection ............................................................................................................. 7 1.1.3. The Curse of Dimensionality........................................................................................... 7 1.1.4. Subspace Outlier Detection ............................................................................................. 8 1.2. Motivation ........................................................................................................................... 8 1.2.3. Research Question ........................................................................................................... 9 2. Literature Review................................................................................................................... 9 2.1. Subspace Outlier Detection Algorithms............................................................................10 2.2.1. Aggarwal Evolutionary Search...................................................................................... 10 2.1.2. Lazarevic Feature Bagging Technique .......................................................................... 11 2.1.3. Subspace Outlier Degree outlier detection .................................................................... 11 2.1.4. Mining Top N Outliers in Most Interesting Suspaces ................................................... 11 2.1.5. Summary........................................................................................................................ 12 2.2. Benchmark Algorithms .....................................................................................................13 2.2.1. Distance Based Outlier Detection ................................................................................. 13 2.2.2. Local Outlier Factor ...................................................................................................... 14 2.3. Evaluation Metrics ............................................................................................................15 2.4. Summary ........................................................................................................................... 16 3. Research Design ...................................................................................................................16 3.1. Computational Tools .........................................................................................................16 Alex WIEGAND -3- 3.2. Methodology ..................................................................................................................... 16 3.2.1. Steps .............................................................................................................................. 16 3.3. Expected Outcomes ...........................................................................................................17 3.3.1. Known Outcomes .......................................................................................................... 17 3.3.2. Further Outcomes .......................................................................................................... 18 4. Timeline ............................................................................................................................... 18 5. Summary .............................................................................................................................. 19 6. References ............................................................................................................................ 19 7. Bibliography ........................................................................................................................ 19 Abstract Finding outliers in high dimensional datasets is difficult due to the "curse of dimensionality". The new field of subspace outlier detection addresses this problem by considering projections of the dataset onto lower dimensional subspaces, and looking for outliers in those projections. This research project will survey the existing literature on subspace outlier detection, and attempt to optimise, or provide a substantial contribution to the techniques of subspace outlier detection. The research question is "What is the best technique to find outliers in high dimensional datasets?" The key challenges to be addressed by this thesis are (1) the choice of subspace projections to search for outliers, (2) the choice of distance metrics, and (3) minimisation of the false positive rate. Key Words: Data Mining, Outlier Detection, Subspace Outlier Detection, Curse of Dimensionality. 1. Introduction Outlier detection is a widely used and important part of data mining. Another thing that is common in data mining problems is datasets with many dimensions. However, most existing outlier detection techniques fail at high enough dimensionality because of the Curse of Dimensionality, which destroys the meaningfulness of the techniques themselves. 1.1. Background 1.1.1. Outliers An outlier is an observation that is very different from the other observations in a dataset. The concept of outliers is used to identify observations that deserve special treatment. These ideas are put together in the Hawkins definition of outliers (Hawkins 1980): an outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism. The idea here is that a different mechanism generated the outliers, so they must be treated differently. This can be used when the main data generating mechanism is the only phenomenon of concern, and it is advantageous to ignore the outliers. Often however, the outlier-generating mechanism(s) represent important phenomena (Chandola et al. 2007). By returning the set of points that might be caused by an unusual phenomenon, outliers are used as a way narrowing down a search for phenomena that rarely occur. This research project will focus on this latter case, in which the outliers are interesting. Here is a list of problems in which the outliers are interesting (Chandola et al. 2007): 1. Intrusion Detection e.g. phenomena in computer netwoks that point to network attacks 2. Fraud Detection e.g. credit card fraud, mobile phone fraud, insider trading Databases of transactions are usually utomatically collected, and fraudulent activities can be identified from an investigation of unusual transactions. 3. Medical and Public Health Anomaly Detection e.g. detection of recording errors, detection of disease outbreaks 4. Industrial Damage Detection e.g. detecting faults in industrial machines, whether they are worn out, or faulty to begin with. An important case is the detection of shorted turns in electrical turbines, a kind of machine wear. 5. Image Processing eg. novelty detection 6. Text Processing e.g. detecting novel topics, or new events in collections of documents 1.1.2. Outlier Detection Outlier detection is the set of techniques to find the outliers in a dataset. A lot of work has been done regarding outlier detection (Chandola et al. 2007, Hodge et al. 2004). The outlier detection techniques may be categorised as following (Chandola et al. 2007): 1. Classification 2. Nearest Neighbour 3. Clustering 4. Statistical 5. Information Theoretic 6. Subspace A basic assumption in outlier detection is that the dataset is modelled as a set of points in space. Each observation is considered a point, called a data point. Usually, each attribute is considered a dimension, or else dimensions are derived from the attributes in a more complicated way. Thus the set of attributes (the schema) of the dataset constitutes a space. This mapping applies directly to numeric attributes, and can be applied to categorical attributes by first converting their values to numbers. The outliers then can be viewed not only as points that are different to the other points, but as points that are distant from them. When the distance is a poor measure of difference, the space or the distance metric can be changed to make the distance more appropriate. The default distance metric is Euclidean distance. 1.1.3. The Curse of Dimensionality As shown by Beyer et al (1998), there is a problem that arises from high dimensionality. The problem is that for most common distributions of data, as the dimensionality increases, the contrast between the distance between any pair of points in the dataset approaches zero. This is known as the curse of dimensionality, due to its repercussions. A result of the curse is that in high dimensional data, no points are very distant from the rest of the dataset. This means that there are no outliers by distance. In this case, there may still be outliers in the sense that there are points with differences that point to important rare phenomena. Suppose a set of observations with four attributes contains some very clear outliers, and then the same observations were taken, but with an extra five attributes included. Suppose also that the previously outlying points are not distant from the other points on the five new attributes. Then, the those points do not appear to be outliers in the 9-dimensional dataset. But the objects they represent are no less unusual. Traditional outlier detection methods cannot find these outliers because when there are no outliers by distance, their very definitions of outliers have become meaningless. This is the curse of dimensionality for outlier detection. 1.1.4. Subspace Outlier Detection It is possible to reduce a dataset from its original space (the full space) to a subspace. This is done by removing dimensions, or mapping the original dimensions onto a new, smaller set of dimensions. A subspace outlier is a point that is outlying in a subspace projection of the original dataset. In the earlier example of the outlier on four attributes, when the other five attributes are included, the original four attributes form a subspace of the full 9-dimensional space. Therefore, the unusual points are subspace outliers in the 9-dimensional dataset. Subspace outlier detection is the process of finding the subspace outliers in a dataset. It is essentially dimensionality reduction (project onto a subspace) combined with outlier detection. The output of a subspace outlier detection algorithm is not just a set of points declared to be outliers, but a set of ordered pairs of each containing a declared outlier and the subspace to which it belongs. 1.2. Motivation The way to solve the Curse of Dimensionality in oultier detection is via subspace outlier detection. Subspace outlier detection is a relatively new sub-field of outlier detection. The literature regarding subspace outlier detection is not comprehensive. A few algorithms (see Literature Review) have been designed specifically to perform subspace outlier detection. However, these algorithms have not been assessed in comparison in with each other. In this author's reading there have been no papers giving feedback on the suitability or effectiveness of these algorithms. All of these algorithms are proposed to satisfy the same goal – detecting outliers in high dimensional data – yet they all use different approaches. This raises the questions: 1. Which algorithm is better? 2. Do the differences in the strengths and weaknesses of the algorithms vary according to the target dataset? For example, is one algorithm the most effective for one type of data, while another algorithm is the most effective for a different type of data. 3. Can a new algorithm inspired by the concepts of these algorithms that performs be devised that performs significantly better than any of these algorithms? More information about these algorithms will be given in the literature review. 1.2.3. Research Question The research question for this paper is: Research question: What is the best way to detect outliers in high dimensional datasets? The question has been relevant for a while, but is increasingly relevant with the increase in database sizes. At the same time, the recent dedicated subspace detection algorithms provide new hope of giving an answer through study of them. As such, this research project will focus on the study of these algorithms, and the concepts involved in them. The research project is based on the hypothesis: Proposed hypothesis: Improvements can be brought to the methods of subspace outlier detection through a combination of the ideas applied in existing subspace outlier detection algorithms. In the next section, the literature review will state the current subspace outlier detection algorithms. 2. Literature Review This research project will look at some subspace outlier detection algorithms, and evaluate their effectiveness. They will not only be evaluated relative to each other, but also to two benchmark algorithms. In the evaluation, the quality metrics used are important. This literature review will go through the following things: 1. Subspace Outlier Detection Algorithms 1. Aggarwal Evolutionary Search* (Aggarwal et al. 2001) 2. Lazarevic Feature Bagging Technique* (Lazarevic et al. 2005) 3. Subspace Outlier Degree (SOD) (Kriegel et al. 2009) 4. Mining Top N Outliers in Most Interesting Subspaces (MOIS) (Leng et al. 2009) 2. Benchmark Outlier Detection Algorithms: 1. Distance-Based Outlier (DB-Outlier) detection (Knorr et al. 1998) 2. Local Outlier Factor (LOF) outlier detection (Breunig et al. 2000) 3. Quality Metrics 4. Fractional Distances Metrics *These algorithms were not named in the papers that defined them, so they been given original names for this proposal. 2.1. Subspace Outlier Detection Algorithms 2.2.1. Aggarwal Evolutionary Search This algorithm was defined by Charu C. Aggarwal and Philip S. Yu in 2001 (Aggarwal et al. 2001). Charu Aggarwal has written extensively on the topic of data mining under high dimensionality since the year 2000. This algorithm is the earliest subspace outlier detection algorithm the author of this proposal has seen. Aggarwal Evolutionary Search starts by partitioning the attributes into equal-depth parts. Equal-depth parts are ranges that contain an equal number of data points. Then the blocks formed by multiplying parts from two or more attributes are considered. A sparsity metric is defined to compare the blocks. The sparsest blocks, according to the metric, are declared to be full of outliers. The points within each sparse block are returned, along with the set of attributes corresponding to the block. However, the set of blocks increases exponentially with the number of attributes. Therefore, a comparison of all the blocks is infeasible for even modestly complicated problems. Instead, an evolutionary algorithm is used to select blocks to evaluate. Each block is modelled as a string of the attribute parts that define it. The strings undergo Selection, Crossover and Mutation, to create new blocks, including combinations of existng blocks. The sparsity metric is used as the fitness function for the evolutionary algorithm. Although Aggarwal et al. (2001) test the Aggarwal Evolutionary Search, and give results, they do not give results that convey the accuracy of the algorithm with respect to a set of true outliers. For example, there is no accuracy rate, false positive rate or ROC curve stated. One data set Aggarwal et al. test their algorithm on is the UCI Arrhythmia dataset. The Arrhythmia dataset has a set of rare classes that constitute “true” outliers. For this dataset, 43 out of the 85 members of these rare classes were detected, as well as an unspecified number of data points that were in dominant classes, but could be seen to contain erroneous information. So for a measure of quality, this algorithm can tentatively be said to have a little better than a 43/85 ≑ 0.505 true positive rate. A more thourough test is required to get a stronger result. 2.1.2. Lazarevic Feature Bagging Technique Lazarevic et al (2005) propose an algorithm to perform subspace outlier detection by running one or more conventional outlier detection algorithms on a dataset many times, with a different set of features each time. The feature sets are random samples of the full set of attributes of the dataset. After running the conventional outlier detection algorithms, their results are aggregated and the strongest outliers from all the executions are returned, along with the feature sets from which they were found. Lazarevic et al. call this approach “feature bagging”. This algorithm's tests results contain ROC curves, and show a modest improvement over an LOF approach. 2.1.3. Subspace Outlier Degree outlier detection This algorithm is defined in a paper called Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data by Hans-Peter Krigel et al., which was published this year (Kriegel et al. 2009). This algorithm is based on the idea that if a point's neighbours are modelled as being in one hyperplane, the distance from the point to the hyperplane is a suitable outlier score. For each point in the dataset, a set of nearby points, called a reference set, is chosen. The reference set for a point p is the set of the top l points whose k-neighbourhoods share a maximum number of points with the k-neighbourhood of p. The reference set is used to define a hyperplane. The hyperplane is expressed as a mean point and a set of attributes. The distance from p to the hyperplane is the distance to the hyperplane's mean point under its set of attributes. By defining and measuring the distance from the hyperplane for each point, an outlier score is assigned to each data point, and those with the highest scores are outliers. This algorithm's test results show some extremely good ROC curves (all with 0.00 false positive rate for any true positive rate below 0.80). The good ROC curves apply to a synthetic dataset containing 430 points, tested on three times, with 37, 67 and 97 irrelevant attributes, respectively. The reason for these good results is not known. This algorithm may require more testing than the others to find, or show the absence of, weaknesses. 2.1.4. Mining Top N Outliers in Most Interesting Suspaces This algorithm is written by Jinsong Leng, Jiuyong Li et al. and is awaiting publication (Leng et al, 2009). Mining Top N Outliers in Most Interesting Suspaces (MOIS) follows a three-phase approach. The set of feature sets with entropy below a threshold and interest gain above a threshold is found. Then a shape factor is calculated for all the listed feature sets. The feature sets with shape factor above a threshold are added to a shortlist. All the feature sets in the shortlist are searched for outliers, and the N points with the highest outlier score are returned. The shape factor value for a feature set is the excess kurtosis of the data under that feature set, divided by the variance of the data under that feature set. The outlier score metric is a modified form of k-distance, that is normalised with respect to the information energy of the distances on the feature set. This normalisation of distances is used to reduce the effect of disproportionate attributes. The paper plots the ROC curves of the algorithm against an outlier detection technique that uses the same outlier score, but operates on the full space instead of subspaces. The ROC curves are for three datasets. They show that the accuracy of the algorithm is reasonable, but strongly depends on the dataset. Also, under some conditions, the precision of the algorithm is decreased by the two rounds of feature set selection. An interesting characteristic of this algorithm is that it uses a combination of techniques to reduce the set of subspaces to search. 2.1.5. Summary We have looked at the subspace outlier detection methods that have been created so far. Due to the nature of the problem of subspace outlier detection, some characteristics of the algorithms are equivalent, despite their different approaches. In Table 1, each of the algorithms is listed, along with its version of a common characteristic across all the algorithms. The common characteristics looked at are: 1. Subspace The way in which the algorithm models the subspaces. 2. Search space reduction strategy The set of all (data point, subspace) pairs increases expontentially with dimensionality (Aggarwal et al. 2001). This means that the search space is too large to exhaustively search. Each algorithm applies a strategy to reduce the search space. 3. Outlier score The metric used to measure a point's outlier score once a subspace has been chosen. The accuracy from the algorithm's testing is listed on the right for convenience. Algorithm Subspace Search space reduction strategy Outlier Score Aggarwal Evolutionary Search hyperblock, formed by partition ranges of attributes evolutionary hyperblock algorithm on the sparsity hyperblocks Apparent accuracy reasonable, limited information Lazarevic feature set Feature Bagging random sampling LOF of features reasonable Subspace Outlier hyperplane Degree (SOD) l, the number of points used to define each hyperplane very good MOIS two shortlistings z-score of feature sets, based on thresholds feature set distance from hyperplane reasonable, depends on dataset Table 1: The common characteristics of the algorithms One thing to note about the accuracy of the algorithms is that they have not been tested on the same datasets, so direct comparison is not possible. Overall, the different approaches used by the algorithms constitute a set of starting points for solving the problem of subspace outlier detection. What is needed now is for these approaches to be put into context with each other, so that a consistent framework for subspace outlier detection techniques can be created. Listing the concepts is not enough, they must also be evaluated. 2.2. Benchmark Algorithms In this project, the subspace outlier detection algorithms will be compared to two traditional algorithms. These two algorithms are the benchmark algorithms. They are chosen to be effective algorithms for low dimensional outlier detection, and will be used to measure the absolute degree of improvement the subspace outlier detection algorithms make over traditional techniques. 2.2.1. Distance Based Outlier Detection This algorithm is designed by Edwin Knorr and Raymond T. Ng (1998). A distance based outlier (DB-outlier) algorithm is any algorithm that finds all the points in a dataset that are DB(p, D)-outliers, for certain values of p and D, according to the definition (Knorr et al, 1998): An object O in a dataset T is a DB(p, D)-outlier if at least fraction p of the objects in T lies greater than distance D from O. This type of algorithm is often used (Chandola et al. 2007) and returns good results for low dimensional datasets. One of its strengths is that it generalises some important traditional statistical distribution based definitions of outlier. For example, a traditional definition of outlier is that every point further than three standard deviations (3σ) from the mean is an outlier. This definition is equivalent to the definition of DB(0.9988, 0.13σ)-outlier. This algorithm is an a example of a global outlier definition, where a point's distance from another point is always treated with the same significance, regardless of the data distribution. 2.2.2. Local Outlier Factor Another definition of outlier is the Local Outlier Factor (LOF) definition, designed by Markus Breunig, Hans-Peter Kriegel et al. (2000). Any algorithm that returns the set of points with a Local Outlier Factor above a certain threshold is a Local Outlier Factor outlier detection algorithm. The following concepts are used in the definition of LOF: 1. k-distance: The k-distance of a point is the distance from that point to its k-th nearest neighbour. For example, the distance to the nearest unique point is the 1-distance of a point. This is denoted as dk(p) for a positive integer k and a point p. 2. k-neighbourhood: The set of the k nearest points to some point p, excluding p itself. This is denoted as Nk(p) for a positive integer k and point p. 3. Reachability distance: The reachability distance from one point to a second point is the maximum between the distance between the two points and the k-distance of the second point. This concept is used to measure the distance between points because it destroys statistical fluctuations in the distances between points within clusters. This is denoted as reach-distancek(p, o) for a positive integer k and two points p and o. 4. Local reachability density: The local reachability density of a point p is the inverse of the mean reachability distance of all the points in the k neighbourhood of p. This is expressed mathematically as ! state formula for local reachability distance in mathematical notation This metric is used as a measure of the density of a point. Using these concepts, the Local Outlier Factor of any point p can be stated as the mean local reachability density of all the points in p’s k-neighbourhood, divided by the local reachability density of p. Points with a low density compared to their neighbours receive a high score and are considered outliers. LOF is a local definition of outliers. This means that the relevance of a point's deviation depends on the density of the points close to it. This allows clusters of different densities to be treated equally, and reduces mislabelling. 2.3. Evaluation Metrics This section will introduce the concept of the Receiver Operating Characteristic curve (ROC curve), and some related concepts which will be used in evaluation of algorithms. !Modelling an outlier detection algorithm as a classification model, we can apply the accuracy metrics of classification, where the possible classes are outlier (positive) and nonoutlier (negative). !Then, all points that are declared outliers are positives, and all other are negatives. Using data for which the generating mechanisms are known, it is also possible to know the “true” outliers – points that are caused by rare mechanisms. Correctly classified points are true In measuring the accuracy of outlier detection, points declared to be outliers are considered “positives” and other points are “negatives”. Using knowledge of which data points are generated by rare mechanisms, it is possible to say which points are “true” outliers. A true outlier that is declared to be an outlier is a true positive. A non-outlier that is declared to be an outlier is a false positive. The same terminology applies to negatives as shown in Table 2. True class Declared class Outlier Non-outlier Outlier Non-outlier true postive false positive false negative true negative Table 2: Terminology in outlier evaluation (Fawcett 2005) The true positive rate is the number of true positives divided by the number of true outliers. The false positive rate is the number of false positives divided by the number of actual nonoutliers. The precision is the number of true positives divided by the total number of positives. (Fawcett 2005) However, to get an accurate idea of the effectiveness of an outlier detection algorithm, the values of these accuracy metrics for a single test run is not enough. (Fawcett 2005) A way to solve this problem is by plotting an ROC curve. ROC curves are 2D plots of the true positive rate against the false positive rate. In order to fill the curve for all numbers of positives, the algorithm in question needs to be run many times with a parameter, such as data size, being changed that increases the positive rates. A good ROC curve is a straight line where the true positive rate is always one and the false positive rate is always zero. The worst possible ROC curve is a straight diagonal line where the true positive rate equals the false positive rate. This is worse than a flat true positve equals zero line because it means the algorithm does even not return the correct information in a mislabelled manner. 2.4. Summary We have surveyed the literature relevant to the project. 3. Research Design 3.1. Computational Tools The test platform is as follows: 1. Intel Pentium IV 2.8GHz, 1 GiB random access memory, 500GB hard disk space. 2. Windows XP 3. Java Development Kit 1.6 4. Weka data mining software (Witten et al. 2002) 3.2. Methodology The research project will follow a positivist quantitative methodology. The focus of this methodology will be 3.2.1. Steps 1. Collect data This research project will use public datasets from the UCI Machine Learning Repository (Asuncion et al. 2007). First, the data sets will be downloaded. This has already been done. 2. Obtain implementations of algorithms The algorithms for which implementations are available will be collected for use. The algorithms for which no implementation is available will be implemented as part of this project. The preferred programming platform is Java Standard Edition, using the Weka API for data representation and utility functions. Algorithms implemented on other platforms will be acceptable if their input and output data can be transmitted to and from the preferred platform. 3. Test the algorithms The algorithms will be tested and the following values collected for each (algorithm, dataset) pair: 1. True positive rate 2. False positive rate 3. ROC curve 4. Area under ROC curve (AUC) Some algorithms may reveal important information from more extensive testing, for example, an exploration of weaknesses and strengths via carefully prepared datasets. If this information seems important, and time permits, further testing will be done on some algorithms. 4. Analysis Using the results of the tests, an assessment of the algorithms will be made. If it is straightforward, a ranking of the algorithms by quality will be found. Any relative strengths or weaknesses of the algorithms should be found. The concepts in the algorithms will be assessed for relevance based on the quality of the algorithms in which they are applied. 5. Create modified algorithm Based on the analysis, one or more improvements to the existing techniques may become apparent. If this occurs, and time permits, a new algorithm will be devised which demonstrates the improvements. 6. Test modified algorithm The improved algorithm will then be evaluated on the same basis as the original subspace outlier detection algorithms. The modifications should be confirmed to be improvements here. 3.3. Expected Outcomes The output of this research project will be a written minor thesis. 3.3.1. Known Outcomes The expectation is: 1. Survey and Analysis of Existing Methods: This research project will result in a clear assessment of the existing dedicated subspace outlier detection technqiues that 1. States the strengths and weaknesses of the algorithms in comparison with each other; 2. States the strengths and weaknesses of the algorithms in comparison with the benchmark algorithms, DB-outlier, and LOF; 3. States all the differences in the set of situations for which each algorithm is suitable. 2. This research project will state the importance of the various concepts used in subspace outlier detection, using the measures of relevance and applicability. 3.3.2. Further Outcomes A possible outcome is that: 1. New or Modified Method: This research project might result in the creation of a new subspace outlier detection algorithm that combines ideas from the existing techniques. 4. Timeline The timeline for the research project is summarised as following: Task Duration Start date Expected finish date Comments 1 Data collection 1 week 24/May 30/May Done 2 Research proposal 4 weeks 2/Jun 28/Jun Done 3 Implementati 4 weeks on procurement 5/Jul 25/Jul 4 Testing 2 weeks 2/Aug 15/Aug 5 Analysis 2 weeks 16/Aug 29/Sept 6 Modified algorithm creation 3 weeks 29/Aug 19/Sep 7 Modified algorithm testing 2 weeks 8 Thesis writing 3 weeks (dedicated) 20/Sep 3/Oct 4/Oct 26/Oct parts of the thesis may be written earlier The workload assumed is 20 hours/week. This timetable may be adjusted during the project. 5. Summary In conclusion, the problem of subspace outlier detection has been looked at, but more work needs to be done to ensure that the problem is solved. This research project will have a look at recent algorithms to solve the problem, and attempt to find the most effective algorithms and most important concepts for subspace outlier detection. 6. References Aggarwal, CC & Yu, PS 2001, 'Outlier detection for high dimensional data', SIGMOD Rec., vol. 30, no. 2, pp. 37-46. Asuncion, A & Newman, DJ 2007, UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences, <http://www.ics.uci.edu/~mlearn/MLRepository.html>. Beyer, K, Goldstein, J, Ramakrishnan, R & Shaft, U 1998, When Is "Nearest Neighbour" Meaningful?, pp. 217-235. Breunig, MM, Kriegel, H-P, Ng, RT & Sander, J 2000, 'LOF: Identifying Density-Based Local Outliers', paper presented at the Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. Chandola, V, Banerjee, A & Kumar, V 2007, 'Anomaly Detection: A Survey', Technical Report TR 07-017. Fawcett, T, 'An introduction to ROC analysis', Pattern Recognition Letters, Volume 27, Issue 8, pp. 861- 874. Hawkins, DM 1980, Identification of Outliers, Chapman and Hall, London, New York. Hodge, V & Austin, J 2004, 'A Survey of Outlier Detection Methodologies', Artificial Intelligence Review, vol. 22, no. 2, pp. 85-126. Knorr, EM & Ng, RT 1998, 'Algorithms for Mining Distance-Based Outliers in Large Datasets', paper presented at the 24th VLDB Conference, New York, USA. Kriegel, H-P, Kröger, P, Schubert, E & Zimek, A 2009, 'Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data', in Advances in Knowledge Discovery and Data Mining, pp. 831-838. Lazarevic, A, Kumar, V 2005, 'Feature Bagging for Outlier Detection', in Knowledge and Data Discovery, pp. 157-166. Leng, J, Li, J & Fu, AW-C 2009, Exploring Most Interesting Subspaces for Effective Top N Outlier Detection, Edith Cowan University, University of South Australia, Chinese University of Hong Kong, The, pp. 1-9. Witten, IH & Frank, E 2005, Data Mining: Practical machine learning tools and techniques, 2 edn, San Fransisco. 7. Bibliography Achtert, E, Kriegel, H-P & Zimek, A 2008, 'ELKI: A Software System for Evaluation of Subspace Clustering Algorithms', in Scientific and Statistical Database Management, pp. 580-585. Aggarwal, C, Hinneburg, A & Keim, D 2001, 'On the Surprising Behavior of Distance Metrics in High Dimensional Space', in Database Theory — ICDT 2001, pp. 420-434. Aggarwal, CC 2002, Hierarchical subspace sampling: a unified framework for high dimensional data reduction, selectivity estimation and nearest neighbor search, ACM, Madison, Wisconsin. Aggarwal, CC 2001, A human-computer cooperative system for effective high dimensional clustering, ACM, San Francisco, California. Aggarwal, CC 2005, On <i>k</i>-anonymity and the curse of dimensionality, VLDB Endowment, Trondheim, Norway. Aggarwal, CC 2001, 'Re-designing distance functions and distance-based applications for high dimensional data', SIGMOD Rec., vol. 30, no. 1, pp. 13-18. Aggarwal, CC 2003, Towards systematic design of distance functions for data mining applications, ACM, Washington, D.C. Aggarwal, CC & Yu, PS 2000, Finding generalized projected clusters in high dimensional spaces, ACM, Dallas, Texas, United States. Aggarwal, CC & Yu, PS 2001, 'Outlier detection for high dimensional data', SIGMOD Rec., vol. 30, no. 2, pp. 37-46. Agovic, A, Banerjee, A, Ganguly, A & Protopopescu, V 2007, Anomaly Detection in Transportation Corridors using Manifold Embedding, ACM, San Jose, California, USA. Ahmed, T, Oreshkin, B & Coates, M 2007, Machine learning approaches to network anomaly detection, USENIX Association, Cambridge, MA. Asuncion, A & Newman, DJ 2007, UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences, <http://www.ics.uci.edu/~mlearn/MLRepository.html>. Bay, SD & Schwabacher, M 2003, Mining distance-based outliers in near linear time with randomization and a simple pruning rule, ACM, Washington, D.C. Bellman, R & Kalaba, R 1959, On adaptive control processes, Bellman, R & Lee, E 1984, 'History and development of dynamic programming', Control Systems Magazine, IEEE, vol. 4, no. 4, pp. 24-28. Beyer, K, Goldstein, J, Ramakrishnan, R & Shaft, U 1998, When Is "Nearest Neighbour" Meaningful?, pp. 217-235. Breunig, MM, Kriegel, H-P, Ng, RT & Sander, J 2000, 'LOF: Identifying Density-Based Local Outliers', paper presented at the Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. Chaloner, K & Brant, R 1988, 'A Bayesian Approach to Outlier Detection and Residual Analysis', Biometrika, vol. 75, no. 4, pp. 651-659. Chan, PK, Mahoney, MV & Arshad, MH 2003, A Machine Learning Approach to Anomaly Detection, Florida Institute of Technology. Chandola, V, Banerjee, A & Kumar, V 2007, 'Anomaly Detection: A Survey', Technical Report TR 07-017. Cheng, C-H, Fu, AW & Zhang, Y 1999, Entropy-based subspace clustering for mining numerical data, ACM, San Diego, California, United States. Eskin, E, Arnold, A, Prerau, M, Portnoy, L & Stolfo, S 2002, 'A gemometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data', Data Mining for Security Applications. García Adeva, JJ & Pikatza Atxa, JM 2007, 'Intrusion detection in web applications using text mining', Engineering Applications of Artificial Intelligence, vol. 20, no. 4, pp. 555-566. Hangal, S & Lam, MS 2002, Tracking down software bugs using automatic anomaly detection, ACM, Orlando, Florida. Hartigan, JA & Wong, MA 1979, 'Algorithm AS 136: A K-Means Clustering Algorithm', Applied Statistics, vol. 28, no. 1, pp. 100-108. Hawkins, DM 1980, Identification of Outliers, Chapman and Hall, London New York. He, Z, Deng, S & Xu, X 2005, 'An Optimization Model for Outlier Detection in Categorical Data', in Advances in Intelligent Computing, pp. 400-409. Hinneburg, A, Aggarwal, CC & Keim, DA 2000, What Is the Nearest Neighbor in High Dimensional Spaces?, Morgan Kaufmann Publishers Inc. Hodge, V & Austin, J 2004, 'A Survey of Outlier Detection Methodologies', Artificial Intelligence Review, vol. 22, no. 2, pp. 85-126. Indyk, P & Motwani, R 1998, Approximate nearest neighbors: towards removing the curse of dimensionality, ACM, Dallas, Texas, United States. Jin, W, Tung, AKH & Han, J 2001, Mining top-n local outliers in large databases, ACM, San Francisco, California. Joksimovic, GM & Penman, J 2000, 'The detection of inter-turn short circuits in the stator windings of operating motors', Industrial Electronics, IEEE Transactions on, vol. 47, no. 5, pp. 1078-1084. Jolliffe, IT 1986, Principal component analysis, Springer, Berlin. Kearns, MJ 1990, Computational Complexity of Machine Learning, MIT Press, Knorr, EM & Ng, RT 1998, 'Algorithms for Mining Distance-Based Outliers in Large Datasets', paper presented at the 24th VLDB Conference, New York, USA. Kollios, G, Gunopulos, D, Koudas, N & Berchtold, S 2003, 'Efficient biased sampling for approximate clustering and outlier detection in large data sets', Knowledge and Data Engineering, IEEE Transactions on, vol. 15, no. 5, pp. 1170-1187. Kriegel, H-P, Kröger, P, Schubert, E & Zimek, A 2009, 'Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data', in Advances in Knowledge Discovery and Data Mining, pp. 831-838. Kruegel, C & Vigna, G 2003, Anomaly detection of web-based attacks, ACM, Washington D.C., USA. Leng, J, Li, J & Fu, AW-C 2009, Exploring Most Interesting Subspaces for Effective Top N Outlier Detection, Edith Cowan University University of South Australia Chinese University of Hong Kong, The, pp. 1-9. Li, X & Han, J 2007, Mining approximate top-k subspace anomalies in multi-dimensional time-series data, VLDB Endowment, Vienna, Austria. Liu, J & Chen, D-S 2009, 'Fault Detection and Identification Using Modified Bayesian Classification on PCA Subspace', Industrial & Engineering Chemistry Research, vol. 48, no. 6, pp. 3059-3077. Moonesinghe, HDK & Tan, P-N 2006, 'Outlier Detection using Random Walks', paper presented at the Proceedings of the 18th IEEE International Conference on Tools with Artificial Intelligence. Moore, B 1981, 'Principal component analysis in linear systems: Controllability, observability, and model reduction', Automatic Control, IEEE Transactions on, vol. 26, no. 1, pp. 17-32. Parsons, L, Haque, E & Jiu, H 2004, 'Evaluating Subspace Clustering Algorithms', paper presented at the Workshop on Clustering High Dimensional Data and its Applications, SIAM International Conference on Data Mining (SDM 2004). Parsons, L, Haque, E & Liu, H 2004, 'Subspace clustering for high dimensional data: a review', SIGKDD Explor. Newsl., vol. 6, no. 1, pp. 90-105. Prastawa, M, Bullitt, E, Ho, S & Gerig, G 2004, 'A brain tumor segmentation framework based on outlier detection', Medical Image Analysis, vol. 8, no. 3, pp. 275-283. Provost, F & Fawcett, T 2001, 'Robust Classification for Imprecise Environments', Machine Learning, vol. 42, no. 3, pp. 203-231. Riedewald, M, Agrawal, D, Abbadi, A & Korn, F 2003, 'Accessing Scientific Data: Simpler is Better', in Advances in Spatial and Temporal Databases, pp. 214-232. Rust, J 1997, 'Using Randomization to Break the Curse of Dimensionality', Econometrica, vol. 65, no. 3, pp. 487-516. Scholkopf, B, Smola, A & Muller, K-R 1998, 'Nonlinear Component Analysis as a Kernel Eigenvalue Problem', Neural Computation, vol. 10, no. 5, p. 1299. Smith, R, Bivens, A, Embrechts, M, Palagiri, C & Szymanski, B 2002, 'Clustering approaches for anomaly based intrusion detection', paper presented at the Proceedings of Intelligent Engineering Systems through Artificial Neural Networks. Steinwart, I, Hush, D & Scovel, C 2005, 'A Classification Framework for Anomaly Detection', J. Mach. Learn. Res., vol. 6, pp. 211-232. Streifel, RJ, II, RJM, El-Sharkawi, MA & Kerszenbaum, I 1996, 'Detection of shorted-turns in the field winding of turbine-generator rotors using novelty detectors - development and field test', IEEE Transactions on Energy Conversion, vol. 11, no. 2, pp. 312-317. Tallam, RM, Sang Bin, L, Stone, GC, Kliman, GB, Jiyoon, Y, Habetler, TG & Harley, RG 2007, 'A Survey of Methods for Detection of Stator-Related Faults in Induction Machines', Industry Applications, IEEE Transactions on, vol. 43, no. 4, pp. 920-933. Tang, J, Chen, Z, Fu, A & Cheung, D 2002, 'Enhancing Effectiveness of Outlier Detections for Low Density Patterns', in Advances in Knowledge Discovery and Data Mining, pp. 535548. Vaidya, J 2004, 'Privacy-Preserving Outlier Detection', paper presented at the Proceedings of the Fourth IEEE International Conference on Data Mining. Wang, Y, Tetko, IV, Hall, MA, Frank, E, Facius, A, Mayer, KFX & Mewes, HW 2005, 'Gene selection from microarray data for cancer classification--a machine learning approach', Computational Biology and Chemistry, vol. 29, no. 1, pp. 37-46. Wei, L, Qian, W, Zhou, A, Jin, W & Yu, J 2003, 'HOT: Hypergraph-Based Outlier Test for Categorical Data', in Advances in Knowledge Discovery and Data Mining, pp. 562-562. Wenke, L & Dong, X 2001, 'Information-theoretic measures for anomaly detection', paper presented at the Security and Privacy, 2001. S&P 2001. Proceedings. 2001 IEEE Symposium on. Witten, IH & Frank, E 2005, Data Mining: Practical machine learning tools and techniques, 2 edn, San Fransisco. Yianilos, PN 2000, Locally lifting the curse of dimensionality for nearest neighbor search (extended abstract), Society for Industrial and Applied Mathematics, San Francisco, California, United States. Zhang, K, Shi, S, Gao, H & Li, J 2007, 'Unsupervised Outlier Detection in Sensor Networks Using Aggregation Tree', in Advanced Data Mining and Applications, pp. 158-169.