Download Research Proposal - University of South Australia

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nearest-neighbor chain algorithm wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
School of Computer and Information Science
University of South Australia
An investigation into subspace outlier detection
Research Proposal
by
Alex Wiegand
Student ID: 100029537
Program: LHCP
Supervisor: Associate Professor Jiuyong Li
June 2009
Alex WIEGAND
-1-
Disclaimer
I declare all the following to be my own work, unless otherwise referenced, as defined by the
University of South Australia's policy on plagiarism. The University of South Australia's
policy on plagiarism can be found at http://www.unisa.edu.au/policies/manual/default.asp.
Alex Wiegand
Date: 16th June 2008
Alex WIEGAND
-2-
Table of Contents
Disclaimer .................................................................................................................................. 2
Abstract ...................................................................................................................................... 5
1. Introduction ............................................................................................................................ 6
1.1. Background ......................................................................................................................... 6
1.1.1. Outliers ............................................................................................................................ 6
1.1.2. Outlier Detection ............................................................................................................. 7
1.1.3. The Curse of Dimensionality........................................................................................... 7
1.1.4. Subspace Outlier Detection ............................................................................................. 8
1.2. Motivation ........................................................................................................................... 8
1.2.3. Research Question ........................................................................................................... 9
2. Literature Review................................................................................................................... 9
2.1. Subspace Outlier Detection Algorithms............................................................................10
2.2.1. Aggarwal Evolutionary Search...................................................................................... 10
2.1.2. Lazarevic Feature Bagging Technique .......................................................................... 11
2.1.3. Subspace Outlier Degree outlier detection .................................................................... 11
2.1.4. Mining Top N Outliers in Most Interesting Suspaces ................................................... 11
2.1.5. Summary........................................................................................................................ 12
2.2. Benchmark Algorithms .....................................................................................................13
2.2.1. Distance Based Outlier Detection ................................................................................. 13
2.2.2. Local Outlier Factor ...................................................................................................... 14
2.3. Evaluation Metrics ............................................................................................................15
2.4. Summary ........................................................................................................................... 16
3. Research Design ...................................................................................................................16
3.1. Computational Tools .........................................................................................................16
Alex WIEGAND
-3-
3.2. Methodology ..................................................................................................................... 16
3.2.1. Steps .............................................................................................................................. 16
3.3. Expected Outcomes ...........................................................................................................17
3.3.1. Known Outcomes .......................................................................................................... 17
3.3.2. Further Outcomes .......................................................................................................... 18
4. Timeline ............................................................................................................................... 18
5. Summary .............................................................................................................................. 19
6. References ............................................................................................................................ 19
7. Bibliography ........................................................................................................................ 19
Abstract
Finding outliers in high dimensional datasets is difficult due to the "curse of dimensionality".
The new field of subspace outlier detection addresses this problem by considering projections
of the dataset onto lower dimensional subspaces, and looking for outliers in those projections.
This research project will survey the existing literature on subspace outlier detection, and
attempt to optimise, or provide a substantial contribution to the techniques of subspace outlier
detection. The research question is "What is the best technique to find outliers in high
dimensional datasets?" The key challenges to be addressed by this thesis are (1) the choice of
subspace projections to search for outliers, (2) the choice of distance metrics, and (3)
minimisation of the false positive rate.
Key Words: Data Mining, Outlier Detection, Subspace Outlier Detection, Curse of
Dimensionality.
1. Introduction
Outlier detection is a widely used and important part of data mining. Another thing that is
common in data mining problems is datasets with many dimensions. However, most existing
outlier detection techniques fail at high enough dimensionality because of the Curse of
Dimensionality, which destroys the meaningfulness of the techniques themselves.
1.1. Background
1.1.1. Outliers
An outlier is an observation that is very different from the other observations in a dataset. The
concept of outliers is used to identify observations that deserve special treatment. These ideas
are put together in the Hawkins definition of outliers (Hawkins 1980):
an outlier is an observation that deviates so much from other observations as to
arouse suspicion that it was generated by a different mechanism.
The idea here is that a different mechanism generated the outliers, so they must be treated
differently. This can be used when the main data generating mechanism is the only
phenomenon of concern, and it is advantageous to ignore the outliers. Often however, the
outlier-generating mechanism(s) represent important phenomena (Chandola et al. 2007). By
returning the set of points that might be caused by an unusual phenomenon, outliers are used
as a way narrowing down a search for phenomena that rarely occur. This research project will
focus on this latter case, in which the outliers are interesting.
Here is a list of problems in which the outliers are interesting (Chandola et al. 2007):
1. Intrusion Detection
e.g. phenomena in computer netwoks that point to network attacks
2. Fraud Detection
e.g. credit card fraud, mobile phone fraud, insider trading
Databases of transactions are usually utomatically collected, and fraudulent activities
can be identified from an investigation of unusual transactions.
3. Medical and Public Health Anomaly Detection
e.g. detection of recording errors, detection of disease outbreaks
4. Industrial Damage Detection
e.g. detecting faults in industrial machines, whether they are worn out, or faulty to
begin with.
An important case is the detection of shorted turns in electrical turbines, a kind of
machine wear.
5. Image Processing
eg. novelty detection
6. Text Processing
e.g. detecting novel topics, or new events in collections of documents
1.1.2. Outlier Detection
Outlier detection is the set of techniques to find the outliers in a dataset. A lot of work has
been done regarding outlier detection (Chandola et al. 2007, Hodge et al. 2004). The outlier
detection techniques may be categorised as following (Chandola et al. 2007):
1. Classification
2. Nearest Neighbour
3. Clustering
4. Statistical
5. Information Theoretic
6. Subspace
A basic assumption in outlier detection is that the dataset is modelled as a set of points in
space. Each observation is considered a point, called a data point. Usually, each attribute is
considered a dimension, or else dimensions are derived from the attributes in a more
complicated way. Thus the set of attributes (the schema) of the dataset constitutes a space.
This mapping applies directly to numeric attributes, and can be applied to categorical
attributes by first converting their values to numbers.
The outliers then can be viewed not only as points that are different to the other points, but as
points that are distant from them. When the distance is a poor measure of difference, the
space or the distance metric can be changed to make the distance more appropriate. The
default distance metric is Euclidean distance.
1.1.3. The Curse of Dimensionality
As shown by Beyer et al (1998), there is a problem that arises from high dimensionality. The
problem is that for most common distributions of data, as the dimensionality increases, the
contrast between the distance between any pair of points in the dataset approaches zero. This
is known as the curse of dimensionality, due to its repercussions.
A result of the curse is that in high dimensional data, no points are very distant from the rest
of the dataset. This means that there are no outliers by distance.
In this case, there may still be outliers in the sense that there are points with differences that
point to important rare phenomena.
Suppose a set of observations with four attributes contains some very clear outliers, and then
the same observations were taken, but with an extra five attributes included. Suppose also
that the previously outlying points are not distant from the other points on the five new
attributes. Then, the those points do not appear to be outliers in the 9-dimensional dataset.
But the objects they represent are no less unusual.
Traditional outlier detection methods cannot find these outliers because when there are no
outliers by distance, their very definitions of outliers have become meaningless.
This is the curse of dimensionality for outlier detection.
1.1.4. Subspace Outlier Detection
It is possible to reduce a dataset from its original space (the full space) to a subspace. This is
done by removing dimensions, or mapping the original dimensions onto a new, smaller set of
dimensions.
A subspace outlier is a point that is outlying in a subspace projection of the original dataset.
In the earlier example of the outlier on four attributes, when the other five attributes are
included, the original four attributes form a subspace of the full 9-dimensional space.
Therefore, the unusual points are subspace outliers in the 9-dimensional dataset.
Subspace outlier detection is the process of finding the subspace outliers in a dataset. It is
essentially dimensionality reduction (project onto a subspace) combined with outlier
detection. The output of a subspace outlier detection algorithm is not just a set of points
declared to be outliers, but a set of ordered pairs of each containing a declared outlier and the
subspace to which it belongs.
1.2. Motivation
The way to solve the Curse of Dimensionality in oultier detection is via subspace outlier
detection. Subspace outlier detection is a relatively new sub-field of outlier detection. The
literature regarding subspace outlier detection is not comprehensive.
A few algorithms (see Literature Review) have been designed specifically to perform
subspace outlier detection. However, these algorithms have not been assessed in comparison
in with each other. In this author's reading there have been no papers giving feedback on the
suitability or effectiveness of these algorithms. All of these algorithms are proposed to satisfy
the same goal – detecting outliers in high dimensional data – yet they all use different
approaches. This raises the questions:
1. Which algorithm is better?
2. Do the differences in the strengths and weaknesses of the algorithms vary
according to the target dataset? For example, is one algorithm the most effective
for one type of data, while another algorithm is the most effective for a different
type of data.
3. Can a new algorithm inspired by the concepts of these algorithms that performs be
devised that performs significantly better than any of these algorithms?
More information about these algorithms will be given in the literature review.
1.2.3. Research Question
The research question for this paper is:
Research question:
What is the best way to detect outliers in high dimensional datasets?
The question has been relevant for a while, but is increasingly relevant with the increase in
database sizes. At the same time, the recent dedicated subspace detection algorithms provide
new hope of giving an answer through study of them.
As such, this research project will focus on the study of these algorithms, and the concepts
involved in them. The research project is based on the hypothesis:
Proposed hypothesis:
Improvements can be brought to the methods of subspace outlier detection through a
combination of the ideas applied in existing subspace outlier detection algorithms.
In the next section, the literature review will state the current subspace outlier detection
algorithms.
2. Literature Review
This research project will look at some subspace outlier detection algorithms, and evaluate
their effectiveness. They will not only be evaluated relative to each other, but also to two
benchmark algorithms. In the evaluation, the quality metrics used are important.
This literature review will go through the following things:
1. Subspace Outlier Detection Algorithms
1. Aggarwal Evolutionary Search* (Aggarwal et al. 2001)
2. Lazarevic Feature Bagging Technique* (Lazarevic et al. 2005)
3. Subspace Outlier Degree (SOD) (Kriegel et al. 2009)
4. Mining Top N Outliers in Most Interesting Subspaces (MOIS) (Leng et al. 2009)
2. Benchmark Outlier Detection Algorithms:
1. Distance-Based Outlier (DB-Outlier) detection (Knorr et al. 1998)
2. Local Outlier Factor (LOF) outlier detection (Breunig et al. 2000)
3. Quality Metrics
4. Fractional Distances Metrics
*These algorithms were not named in the papers that defined them, so they been given
original names for this proposal.
2.1. Subspace Outlier Detection Algorithms
2.2.1. Aggarwal Evolutionary Search
This algorithm was defined by Charu C. Aggarwal and Philip S. Yu in 2001 (Aggarwal et al.
2001). Charu Aggarwal has written extensively on the topic of data mining under high
dimensionality since the year 2000. This algorithm is the earliest subspace outlier detection
algorithm the author of this proposal has seen.
Aggarwal Evolutionary Search starts by partitioning the attributes into equal-depth parts.
Equal-depth parts are ranges that contain an equal number of data points. Then the blocks
formed by multiplying parts from two or more attributes are considered. A sparsity metric is
defined to compare the blocks. The sparsest blocks, according to the metric, are declared to
be full of outliers. The points within each sparse block are returned, along with the set of
attributes corresponding to the block.
However, the set of blocks increases exponentially with the number of attributes. Therefore, a
comparison of all the blocks is infeasible for even modestly complicated problems. Instead,
an evolutionary algorithm is used to select blocks to evaluate.
Each block is modelled as a string of the attribute parts that define it. The strings undergo
Selection, Crossover and Mutation, to create new blocks, including combinations of existng
blocks. The sparsity metric is used as the fitness function for the evolutionary algorithm.
Although Aggarwal et al. (2001) test the Aggarwal Evolutionary Search, and give results,
they do not give results that convey the accuracy of the algorithm with respect to a set of true
outliers. For example, there is no accuracy rate, false positive rate or ROC curve stated.
One data set Aggarwal et al. test their algorithm on is the UCI Arrhythmia dataset. The
Arrhythmia dataset has a set of rare classes that constitute “true” outliers. For this dataset, 43
out of the 85 members of these rare classes were detected, as well as an unspecified number
of data points that were in dominant classes, but could be seen to contain erroneous
information. So for a measure of quality, this algorithm can tentatively be said to have a little
better than a 43/85 ≑ 0.505 true positive rate. A more thourough test is required to get a
stronger result.
2.1.2. Lazarevic Feature Bagging Technique
Lazarevic et al (2005) propose an algorithm to perform subspace outlier detection by running
one or more conventional outlier detection algorithms on a dataset many times, with a
different set of features each time. The feature sets are random samples of the full set of
attributes of the dataset.
After running the conventional outlier detection algorithms, their results are aggregated and
the strongest outliers from all the executions are returned, along with the feature sets from
which they were found.
Lazarevic et al. call this approach “feature bagging”.
This algorithm's tests results contain ROC curves, and show a modest improvement over an
LOF approach.
2.1.3. Subspace Outlier Degree outlier detection
This algorithm is defined in a paper called Outlier Detection in Axis-Parallel Subspaces of
High Dimensional Data by Hans-Peter Krigel et al., which was published this year (Kriegel et
al. 2009).
This algorithm is based on the idea that if a point's neighbours are modelled as being in one
hyperplane, the distance from the point to the hyperplane is a suitable outlier score.
For each point in the dataset, a set of nearby points, called a reference set, is chosen. The
reference set for a point p is the set of the top l points whose k-neighbourhoods share a
maximum number of points with the k-neighbourhood of p.
The reference set is used to define a hyperplane. The hyperplane is expressed as a mean point
and a set of attributes. The distance from p to the hyperplane is the distance to the
hyperplane's mean point under its set of attributes.
By defining and measuring the distance from the hyperplane for each point, an outlier score is
assigned to each data point, and those with the highest scores are outliers.
This algorithm's test results show some extremely good ROC curves (all with 0.00 false
positive rate for any true positive rate below 0.80). The good ROC curves apply to a synthetic
dataset containing 430 points, tested on three times, with 37, 67 and 97 irrelevant attributes,
respectively. The reason for these good results is not known. This algorithm may require
more testing than the others to find, or show the absence of, weaknesses.
2.1.4. Mining Top N Outliers in Most Interesting Suspaces
This algorithm is written by Jinsong Leng, Jiuyong Li et al. and is awaiting publication (Leng
et al, 2009).
Mining Top N Outliers in Most Interesting Suspaces (MOIS) follows a three-phase approach.
The set of feature sets with entropy below a threshold and interest gain above a threshold is
found. Then a shape factor is calculated for all the listed feature sets. The feature sets with
shape factor above a threshold are added to a shortlist. All the feature sets in the shortlist are
searched for outliers, and the N points with the highest outlier score are returned.
The shape factor value for a feature set is the excess kurtosis of the data under that feature
set, divided by the variance of the data under that feature set. The outlier score metric is a
modified form of k-distance, that is normalised with respect to the information energy of the
distances on the feature set. This normalisation of distances is used to reduce the effect of
disproportionate attributes.
The paper plots the ROC curves of the algorithm against an outlier detection technique that
uses the same outlier score, but operates on the full space instead of subspaces. The ROC
curves are for three datasets. They show that the accuracy of the algorithm is reasonable, but
strongly depends on the dataset. Also, under some conditions, the precision of the algorithm
is decreased by the two rounds of feature set selection.
An interesting characteristic of this algorithm is that it uses a combination of techniques to
reduce the set of subspaces to search.
2.1.5. Summary
We have looked at the subspace outlier detection methods that have been created so far. Due
to the nature of the problem of subspace outlier detection, some characteristics of the
algorithms are equivalent, despite their different approaches.
In Table 1, each of the algorithms is listed, along with its version of a common characteristic
across all the algorithms.
The common characteristics looked at are:
1. Subspace
The way in which the algorithm models the subspaces.
2. Search space reduction strategy
The set of all (data point, subspace) pairs increases expontentially with dimensionality
(Aggarwal et al. 2001). This means that the search space is too large to exhaustively
search. Each algorithm applies a strategy to reduce the search space.
3. Outlier score
The metric used to measure a point's outlier score once a subspace has been chosen.
The accuracy from the algorithm's testing is listed on the right for convenience.
Algorithm
Subspace
Search space
reduction
strategy
Outlier Score
Aggarwal
Evolutionary
Search
hyperblock,
formed by
partition ranges
of attributes
evolutionary
hyperblock
algorithm on the sparsity
hyperblocks
Apparent
accuracy
reasonable,
limited
information
Lazarevic
feature set
Feature Bagging
random sampling LOF
of features
reasonable
Subspace Outlier hyperplane
Degree (SOD)
l, the number of
points used to
define each
hyperplane
very good
MOIS
two shortlistings z-score
of feature sets,
based on
thresholds
feature set
distance from
hyperplane
reasonable,
depends on
dataset
Table 1: The common characteristics of the algorithms
One thing to note about the accuracy of the algorithms is that they have not been tested on the
same datasets, so direct comparison is not possible.
Overall, the different approaches used by the algorithms constitute a set of starting points for
solving the problem of subspace outlier detection. What is needed now is for these
approaches to be put into context with each other, so that a consistent framework for
subspace outlier detection techniques can be created. Listing the concepts is not enough, they
must also be evaluated.
2.2. Benchmark Algorithms
In this project, the subspace outlier detection algorithms will be compared to two traditional
algorithms. These two algorithms are the benchmark algorithms. They are chosen to be
effective algorithms for low dimensional outlier detection, and will be used to measure the
absolute degree of improvement the subspace outlier detection algorithms make over
traditional techniques.
2.2.1. Distance Based Outlier Detection
This algorithm is designed by Edwin Knorr and Raymond T. Ng (1998).
A distance based outlier (DB-outlier) algorithm is any algorithm that finds all the points in a
dataset that are DB(p, D)-outliers, for certain values of p and D, according to the definition
(Knorr et al, 1998):
An object O in a dataset T is a DB(p, D)-outlier if at least fraction p of the objects
in T lies greater than distance D from O.
This type of algorithm is often used (Chandola et al. 2007) and returns good results for low
dimensional datasets.
One of its strengths is that it generalises some important traditional statistical distribution
based definitions of outlier. For example, a traditional definition of outlier is that every point
further than three standard deviations (3σ) from the mean is an outlier. This definition is
equivalent to the definition of DB(0.9988, 0.13σ)-outlier.
This algorithm is an a example of a global outlier definition, where a point's distance from
another point is always treated with the same significance, regardless of the data distribution.
2.2.2. Local Outlier Factor
Another definition of outlier is the Local Outlier Factor (LOF) definition, designed by
Markus Breunig, Hans-Peter Kriegel et al. (2000). Any algorithm that returns the set of points
with a Local Outlier Factor above a certain threshold is a Local Outlier Factor outlier
detection algorithm.
The following concepts are used in the definition of LOF:
1. k-distance:
The k-distance of a point is the distance from that point to its k-th nearest
neighbour. For example, the distance to the nearest unique point is the 1-distance
of a point. This is denoted as dk(p) for a positive integer k and a point p.
2. k-neighbourhood:
The set of the k nearest points to some point p, excluding p itself. This is denoted
as Nk(p) for a positive integer k and point p.
3. Reachability distance:
The reachability distance from one point to a second point is the maximum
between the distance between the two points and the k-distance of the second
point. This concept is used to measure the distance between points because it
destroys statistical fluctuations in the distances between points within clusters.
This is denoted as reach-distancek(p, o) for a positive integer k and two points p
and o.
4. Local reachability density:
The local reachability density of a point p is the inverse of the mean reachability
distance of all the points in the k neighbourhood of p. This is expressed
mathematically as
! state formula for local reachability distance in mathematical notation
This metric is used as a measure of the density of a point.
Using these concepts, the Local Outlier Factor of any point p can be stated as the mean local
reachability density of all the points in p’s k-neighbourhood, divided by the local reachability
density of p. Points with a low density compared to their neighbours receive a high score and
are considered outliers.
LOF is a local definition of outliers. This means that the relevance of a point's deviation
depends on the density of the points close to it. This allows clusters of different densities to
be treated equally, and reduces mislabelling.
2.3. Evaluation Metrics
This section will introduce the concept of the Receiver Operating Characteristic curve (ROC
curve), and some related concepts which will be used in evaluation of algorithms.
!Modelling an outlier detection algorithm as a classification model, we can apply the
accuracy metrics of classification, where the possible classes are outlier (positive) and nonoutlier (negative).
!Then, all points that are declared outliers are positives, and all other are negatives. Using
data for which the generating mechanisms are known, it is also possible to know the “true”
outliers – points that are caused by rare mechanisms. Correctly classified points are true
In measuring the accuracy of outlier detection, points declared to be outliers are considered
“positives” and other points are “negatives”. Using knowledge of which data points are
generated by rare mechanisms, it is possible to say which points are “true” outliers.
A true outlier that is declared to be an outlier is a true positive. A non-outlier that is declared
to be an outlier is a false positive. The same terminology applies to negatives as shown in
Table 2.
True class
Declared class
Outlier
Non-outlier
Outlier
Non-outlier
true postive
false positive
false negative
true negative
Table 2: Terminology in outlier evaluation (Fawcett 2005)
The true positive rate is the number of true positives divided by the number of true outliers.
The false positive rate is the number of false positives divided by the number of actual nonoutliers. The precision is the number of true positives divided by the total number of
positives. (Fawcett 2005)
However, to get an accurate idea of the effectiveness of an outlier detection algorithm, the
values of these accuracy metrics for a single test run is not enough. (Fawcett 2005) A way to
solve this problem is by plotting an ROC curve.
ROC curves are 2D plots of the true positive rate against the false positive rate. In order to fill
the curve for all numbers of positives, the algorithm in question needs to be run many times
with a parameter, such as data size, being changed that increases the positive rates.
A good ROC curve is a straight line where the true positive rate is always one and the false
positive rate is always zero.
The worst possible ROC curve is a straight diagonal line where the true positive rate equals
the false positive rate. This is worse than a flat true positve equals zero line because it means
the algorithm does even not return the correct information in a mislabelled manner.
2.4. Summary
We have surveyed the literature relevant to the project.
3. Research Design
3.1. Computational Tools
The test platform is as follows:
1. Intel Pentium IV 2.8GHz, 1 GiB random access memory, 500GB hard disk space.
2. Windows XP
3. Java Development Kit 1.6
4. Weka data mining software (Witten et al. 2002)
3.2. Methodology
The research project will follow a positivist quantitative methodology. The focus of this
methodology will be
3.2.1. Steps
1. Collect data
This research project will use public datasets from the UCI Machine Learning
Repository (Asuncion et al. 2007). First, the data sets will be downloaded. This has
already been done.
2. Obtain implementations of algorithms
The algorithms for which implementations are available will be collected for use. The
algorithms for which no implementation is available will be implemented as part of
this project. The preferred programming platform is Java Standard Edition, using the
Weka API for data representation and utility functions. Algorithms implemented on
other platforms will be acceptable if their input and output data can be transmitted to
and from the preferred platform.
3. Test the algorithms
The algorithms will be tested and the following values collected for each (algorithm,
dataset) pair:
1. True positive rate
2. False positive rate
3. ROC curve
4. Area under ROC curve (AUC)
Some algorithms may reveal important information from more extensive testing, for
example, an exploration of weaknesses and strengths via carefully prepared datasets.
If this information seems important, and time permits, further testing will be done on
some algorithms.
4. Analysis
Using the results of the tests, an assessment of the algorithms will be made. If it is
straightforward, a ranking of the algorithms by quality will be found. Any relative
strengths or weaknesses of the algorithms should be found. The concepts in the
algorithms will be assessed for relevance based on the quality of the algorithms in
which they are applied.
5. Create modified algorithm
Based on the analysis, one or more improvements to the existing techniques may
become apparent. If this occurs, and time permits, a new algorithm will be devised
which demonstrates the improvements.
6. Test modified algorithm
The improved algorithm will then be evaluated on the same basis as the original
subspace outlier detection algorithms. The modifications should be confirmed to be
improvements here.
3.3. Expected Outcomes
The output of this research project will be a written minor thesis.
3.3.1. Known Outcomes
The expectation is:
1. Survey and Analysis of Existing Methods:
This research project will result in a clear assessment of the existing dedicated
subspace outlier detection technqiues that
1. States the strengths and weaknesses of the algorithms in comparison with each
other;
2. States the strengths and weaknesses of the algorithms in comparison with the
benchmark algorithms, DB-outlier, and LOF;
3. States all the differences in the set of situations for which each algorithm is
suitable.
2. This research project will state the importance of the various concepts used in
subspace outlier detection, using the measures of relevance and applicability.
3.3.2. Further Outcomes
A possible outcome is that:
1. New or Modified Method:
This research project might result in the creation of a new subspace outlier detection
algorithm that combines ideas from the existing techniques.
4. Timeline
The timeline for the research project is summarised as following:
Task
Duration
Start date
Expected
finish date
Comments
1
Data
collection
1 week
24/May
30/May
Done
2
Research
proposal
4 weeks
2/Jun
28/Jun
Done
3
Implementati 4 weeks
on
procurement
5/Jul
25/Jul
4
Testing
2 weeks
2/Aug
15/Aug
5
Analysis
2 weeks
16/Aug
29/Sept
6
Modified
algorithm
creation
3 weeks
29/Aug
19/Sep
7
Modified
algorithm
testing
2 weeks
8
Thesis writing 3 weeks
(dedicated)
20/Sep
3/Oct
4/Oct
26/Oct
parts of the
thesis may be
written earlier
The workload assumed is 20 hours/week.
This timetable may be adjusted during the project.
5. Summary
In conclusion, the problem of subspace outlier detection has been looked at, but more work
needs to be done to ensure that the problem is solved. This research project will have a look
at recent algorithms to solve the problem, and attempt to find the most effective algorithms
and most important concepts for subspace outlier detection.
6. References
Aggarwal, CC & Yu, PS 2001, 'Outlier detection for high dimensional data', SIGMOD Rec.,
vol. 30, no. 2, pp. 37-46.
Asuncion, A & Newman, DJ 2007, UCI Machine Learning Repository, University of
California, Irvine, School of Information and Computer Sciences,
<http://www.ics.uci.edu/~mlearn/MLRepository.html>.
Beyer, K, Goldstein, J, Ramakrishnan, R & Shaft, U 1998, When Is "Nearest Neighbour"
Meaningful?, pp. 217-235.
Breunig, MM, Kriegel, H-P, Ng, RT & Sander, J 2000, 'LOF: Identifying Density-Based
Local Outliers', paper presented at the Proceedings of the 2000 ACM SIGMOD International
Conference on Management of Data.
Chandola, V, Banerjee, A & Kumar, V 2007, 'Anomaly Detection: A Survey', Technical
Report TR 07-017.
Fawcett, T, 'An introduction to ROC analysis', Pattern Recognition Letters, Volume 27, Issue
8, pp. 861- 874.
Hawkins, DM 1980, Identification of Outliers, Chapman and Hall, London, New York.
Hodge, V & Austin, J 2004, 'A Survey of Outlier Detection Methodologies', Artificial
Intelligence Review, vol. 22, no. 2, pp. 85-126.
Knorr, EM & Ng, RT 1998, 'Algorithms for Mining Distance-Based Outliers in Large
Datasets', paper presented at the 24th VLDB Conference, New York, USA.
Kriegel, H-P, Kröger, P, Schubert, E & Zimek, A 2009, 'Outlier Detection in Axis-Parallel
Subspaces of High Dimensional Data', in Advances in Knowledge Discovery and Data
Mining, pp. 831-838.
Lazarevic, A, Kumar, V 2005, 'Feature Bagging for Outlier Detection', in Knowledge and
Data Discovery, pp. 157-166.
Leng, J, Li, J & Fu, AW-C 2009, Exploring Most Interesting Subspaces for Effective Top N
Outlier Detection, Edith Cowan University, University of South Australia, Chinese
University of Hong Kong, The, pp. 1-9.
Witten, IH & Frank, E 2005, Data Mining: Practical machine learning tools and techniques,
2 edn, San Fransisco.
7. Bibliography
Achtert, E, Kriegel, H-P & Zimek, A 2008, 'ELKI: A Software System for Evaluation of
Subspace Clustering Algorithms', in Scientific and Statistical Database Management, pp.
580-585.
Aggarwal, C, Hinneburg, A & Keim, D 2001, 'On the Surprising Behavior of Distance
Metrics in High Dimensional Space', in Database Theory — ICDT 2001, pp. 420-434.
Aggarwal, CC 2002, Hierarchical subspace sampling: a unified framework for high
dimensional data reduction, selectivity estimation and nearest neighbor search, ACM,
Madison, Wisconsin.
Aggarwal, CC 2001, A human-computer cooperative system for effective high dimensional
clustering, ACM, San Francisco, California.
Aggarwal, CC 2005, On <i>k</i>-anonymity and the curse of dimensionality, VLDB
Endowment, Trondheim, Norway.
Aggarwal, CC 2001, 'Re-designing distance functions and distance-based applications for
high dimensional data', SIGMOD Rec., vol. 30, no. 1, pp. 13-18.
Aggarwal, CC 2003, Towards systematic design of distance functions for data mining
applications, ACM, Washington, D.C.
Aggarwal, CC & Yu, PS 2000, Finding generalized projected clusters in high dimensional
spaces, ACM, Dallas, Texas, United States.
Aggarwal, CC & Yu, PS 2001, 'Outlier detection for high dimensional data', SIGMOD Rec.,
vol. 30, no. 2, pp. 37-46.
Agovic, A, Banerjee, A, Ganguly, A & Protopopescu, V 2007, Anomaly Detection in
Transportation Corridors using Manifold Embedding, ACM, San Jose, California, USA.
Ahmed, T, Oreshkin, B & Coates, M 2007, Machine learning approaches to network
anomaly detection, USENIX Association, Cambridge, MA.
Asuncion, A & Newman, DJ 2007, UCI Machine Learning Repository, University of
California, Irvine, School of Information and Computer Sciences,
<http://www.ics.uci.edu/~mlearn/MLRepository.html>.
Bay, SD & Schwabacher, M 2003, Mining distance-based outliers in near linear time with
randomization and a simple pruning rule, ACM, Washington, D.C.
Bellman, R & Kalaba, R 1959, On adaptive control processes,
Bellman, R & Lee, E 1984, 'History and development of dynamic programming', Control
Systems Magazine, IEEE, vol. 4, no. 4, pp. 24-28.
Beyer, K, Goldstein, J, Ramakrishnan, R & Shaft, U 1998, When Is "Nearest Neighbour"
Meaningful?, pp. 217-235.
Breunig, MM, Kriegel, H-P, Ng, RT & Sander, J 2000, 'LOF: Identifying Density-Based
Local Outliers', paper presented at the Proceedings of the 2000 ACM SIGMOD International
Conference on Management of Data.
Chaloner, K & Brant, R 1988, 'A Bayesian Approach to Outlier Detection and Residual
Analysis', Biometrika, vol. 75, no. 4, pp. 651-659.
Chan, PK, Mahoney, MV & Arshad, MH 2003, A Machine Learning Approach to Anomaly
Detection, Florida Institute of Technology.
Chandola, V, Banerjee, A & Kumar, V 2007, 'Anomaly Detection: A Survey', Technical
Report TR 07-017.
Cheng, C-H, Fu, AW & Zhang, Y 1999, Entropy-based subspace clustering for mining
numerical data, ACM, San Diego, California, United States.
Eskin, E, Arnold, A, Prerau, M, Portnoy, L & Stolfo, S 2002, 'A gemometric framework for
unsupervised anomaly detection: Detecting intrusions in unlabeled data', Data Mining for
Security Applications.
García Adeva, JJ & Pikatza Atxa, JM 2007, 'Intrusion detection in web applications using
text mining', Engineering Applications of Artificial Intelligence, vol. 20, no. 4, pp. 555-566.
Hangal, S & Lam, MS 2002, Tracking down software bugs using automatic anomaly
detection, ACM, Orlando, Florida.
Hartigan, JA & Wong, MA 1979, 'Algorithm AS 136: A K-Means Clustering Algorithm',
Applied Statistics, vol. 28, no. 1, pp. 100-108.
Hawkins, DM 1980, Identification of Outliers, Chapman and Hall, London New York.
He, Z, Deng, S & Xu, X 2005, 'An Optimization Model for Outlier Detection in Categorical
Data', in Advances in Intelligent Computing, pp. 400-409.
Hinneburg, A, Aggarwal, CC & Keim, DA 2000, What Is the Nearest Neighbor in High
Dimensional Spaces?, Morgan Kaufmann Publishers Inc.
Hodge, V & Austin, J 2004, 'A Survey of Outlier Detection Methodologies', Artificial
Intelligence Review, vol. 22, no. 2, pp. 85-126.
Indyk, P & Motwani, R 1998, Approximate nearest neighbors: towards removing the curse of
dimensionality, ACM, Dallas, Texas, United States.
Jin, W, Tung, AKH & Han, J 2001, Mining top-n local outliers in large databases, ACM,
San Francisco, California.
Joksimovic, GM & Penman, J 2000, 'The detection of inter-turn short circuits in the stator
windings of operating motors', Industrial Electronics, IEEE Transactions on, vol. 47, no. 5,
pp. 1078-1084.
Jolliffe, IT 1986, Principal component analysis, Springer, Berlin.
Kearns, MJ 1990, Computational Complexity of Machine Learning, MIT Press,
Knorr, EM & Ng, RT 1998, 'Algorithms for Mining Distance-Based Outliers in Large
Datasets', paper presented at the 24th VLDB Conference, New York, USA.
Kollios, G, Gunopulos, D, Koudas, N & Berchtold, S 2003, 'Efficient biased sampling for
approximate clustering and outlier detection in large data sets', Knowledge and Data
Engineering, IEEE Transactions on, vol. 15, no. 5, pp. 1170-1187.
Kriegel, H-P, Kröger, P, Schubert, E & Zimek, A 2009, 'Outlier Detection in Axis-Parallel
Subspaces of High Dimensional Data', in Advances in Knowledge Discovery and Data
Mining, pp. 831-838.
Kruegel, C & Vigna, G 2003, Anomaly detection of web-based attacks, ACM, Washington
D.C., USA.
Leng, J, Li, J & Fu, AW-C 2009, Exploring Most Interesting Subspaces for Effective Top N
Outlier Detection, Edith Cowan University
University of South Australia
Chinese University of Hong Kong, The, pp. 1-9.
Li, X & Han, J 2007, Mining approximate top-k subspace anomalies in multi-dimensional
time-series data, VLDB Endowment, Vienna, Austria.
Liu, J & Chen, D-S 2009, 'Fault Detection and Identification Using Modified Bayesian
Classification on PCA Subspace', Industrial & Engineering Chemistry Research, vol. 48, no.
6, pp. 3059-3077.
Moonesinghe, HDK & Tan, P-N 2006, 'Outlier Detection using Random Walks', paper
presented at the Proceedings of the 18th IEEE International Conference on Tools with
Artificial Intelligence.
Moore, B 1981, 'Principal component analysis in linear systems: Controllability,
observability, and model reduction', Automatic Control, IEEE Transactions on, vol. 26, no. 1,
pp. 17-32.
Parsons, L, Haque, E & Jiu, H 2004, 'Evaluating Subspace Clustering Algorithms', paper
presented at the Workshop on Clustering High Dimensional Data and its Applications, SIAM
International Conference on Data Mining (SDM 2004).
Parsons, L, Haque, E & Liu, H 2004, 'Subspace clustering for high dimensional data: a
review', SIGKDD Explor. Newsl., vol. 6, no. 1, pp. 90-105.
Prastawa, M, Bullitt, E, Ho, S & Gerig, G 2004, 'A brain tumor segmentation framework
based on outlier detection', Medical Image Analysis, vol. 8, no. 3, pp. 275-283.
Provost, F & Fawcett, T 2001, 'Robust Classification for Imprecise Environments', Machine
Learning, vol. 42, no. 3, pp. 203-231.
Riedewald, M, Agrawal, D, Abbadi, A & Korn, F 2003, 'Accessing Scientific Data: Simpler
is Better', in Advances in Spatial and Temporal Databases, pp. 214-232.
Rust, J 1997, 'Using Randomization to Break the Curse of Dimensionality', Econometrica,
vol. 65, no. 3, pp. 487-516.
Scholkopf, B, Smola, A & Muller, K-R 1998, 'Nonlinear Component Analysis as a Kernel
Eigenvalue Problem', Neural Computation, vol. 10, no. 5, p. 1299.
Smith, R, Bivens, A, Embrechts, M, Palagiri, C & Szymanski, B 2002, 'Clustering
approaches for anomaly based intrusion detection', paper presented at the Proceedings of
Intelligent Engineering Systems through Artificial Neural Networks.
Steinwart, I, Hush, D & Scovel, C 2005, 'A Classification Framework for Anomaly
Detection', J. Mach. Learn. Res., vol. 6, pp. 211-232.
Streifel, RJ, II, RJM, El-Sharkawi, MA & Kerszenbaum, I 1996, 'Detection of shorted-turns
in the field winding of turbine-generator rotors using novelty detectors - development and
field test', IEEE Transactions on Energy Conversion, vol. 11, no. 2, pp. 312-317.
Tallam, RM, Sang Bin, L, Stone, GC, Kliman, GB, Jiyoon, Y, Habetler, TG & Harley, RG
2007, 'A Survey of Methods for Detection of Stator-Related Faults in Induction Machines',
Industry Applications, IEEE Transactions on, vol. 43, no. 4, pp. 920-933.
Tang, J, Chen, Z, Fu, A & Cheung, D 2002, 'Enhancing Effectiveness of Outlier Detections
for Low Density Patterns', in Advances in Knowledge Discovery and Data Mining, pp. 535548.
Vaidya, J 2004, 'Privacy-Preserving Outlier Detection', paper presented at the Proceedings of
the Fourth IEEE International Conference on Data Mining.
Wang, Y, Tetko, IV, Hall, MA, Frank, E, Facius, A, Mayer, KFX & Mewes, HW 2005, 'Gene
selection from microarray data for cancer classification--a machine learning approach',
Computational Biology and Chemistry, vol. 29, no. 1, pp. 37-46.
Wei, L, Qian, W, Zhou, A, Jin, W & Yu, J 2003, 'HOT: Hypergraph-Based Outlier Test for
Categorical Data', in Advances in Knowledge Discovery and Data Mining, pp. 562-562.
Wenke, L & Dong, X 2001, 'Information-theoretic measures for anomaly detection', paper
presented at the Security and Privacy, 2001. S&P 2001. Proceedings. 2001 IEEE Symposium
on.
Witten, IH & Frank, E 2005, Data Mining: Practical machine learning tools and techniques,
2 edn, San Fransisco.
Yianilos, PN 2000, Locally lifting the curse of dimensionality for nearest neighbor search
(extended abstract), Society for Industrial and Applied Mathematics, San Francisco,
California, United States.
Zhang, K, Shi, S, Gao, H & Li, J 2007, 'Unsupervised Outlier Detection in Sensor Networks
Using Aggregation Tree', in Advanced Data Mining and Applications, pp. 158-169.