* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Graph preprocessing
Geographic information system wikipedia , lookup
Neuroinformatics wikipedia , lookup
Theoretical computer science wikipedia , lookup
Inverse problem wikipedia , lookup
Multidimensional empirical mode decomposition wikipedia , lookup
Pattern recognition wikipedia , lookup
Data analysis wikipedia , lookup
Graph preprocessing
Introduction
Noise removal and data enhancement problem
Noise removal and data enhancement on binary data
Noise removal and data enhancement on graph data
Noise removal and data enhancement tools
Current research problems
Future directions
Framework for validating data cleaning
techniques on binary data
Data cleaning techniques to enhance data
analysis in the presence of high noise levels
Strengths and weakness of techniques
Experimental results on several real-world
data sets
Impact on clustering and association analysis
Conclusions and future work
Motivation & Problem Statement
• Several preprocessing techniques for real world data have
been proposed
– Each technique proposed for a single application domain
– No comparative evaluation of techniques across different
domains
– Such evaluation is essential to identify relative strengths
and weaknesses of the techniques
• Problem Statement: Given a set of n points S, and a query
point Sq, find an anomaly score for Sq with respect to S
– Points in S are assumed to be (mostly) normal
• This definition is applicable in multiple domains such as
– Flight safety
– System call intrusion detection
– Proteomics
Data cleaning techniques at the data analysis
stage
Distance-based
Local Outlier Factor (LOF) based approaches
Clustering-based
HCleaner, is a hyper clique-based data cleaner
Distance based Outlier Detection
Nearest Neighbour (NN) approach
For each data point d compute the distance to the k-th nearest
neighbour dk
Sort all data points according to the distance dk
Outliers are points that have the largest distance dk and therefore
are located in the more sparse neighbourhoods
Usually data points that have top n% distance dk are identified as
outliers
o n – user parameter
Not suitable for datasets that have modes with varying density
* Knorr, Ng,Algorithms for Mining Distance-Based Outliers in Large Datasets, VLDB98
** S. Ramaswamy, R. Rastogi, S. Kyuseok: Efficient Algorithms for Mining Outliers from Large Data
Sets, ACM SIGMOD Conf. On Management of Data, 2000.
Nearest Neighbour Based Techniques
• Key assumption: normal points have close neighbours while
anomalies are located far from other points
• General two-step approach
– Compute neighbourhood for each data record
– Analyze the neighbourhood to determine whether data
record is anomaly or not
• Categories:
– Distance based methods
• Anomalies are data points most distant from other
points
– Density based methods
• Anomalies are data points in low density regions
Nearest Neighbour Based Techniques
• Advantage
Can be used in unsupervised or semi-supervised setting
(do not make any assumptions about data distribution)
• Drawbacks
If normal points do not have sufficient number of
neighbors the techniques may fail
Computationally expensive
In high dimensional spaces, data is sparse and the concept
of similarity may not be meaningful anymore. Due to the
sparseness, distances between any two data records may
become quite similar => Each data record may be
considered as potential outlier!
Nearest Neighbour Based Techniques
• Distance based approaches
• A point O in a dataset is an DB(p, d) outlier if at least
fraction p of the points in the data set lies greater than
distance d from the point O*
• Density based approaches
• Compute local densities of particular regions and declare
instances in low density regions as potential anomalies
• Approaches
o Local Outlier Factor (LOF)
*Knorr, Ng,Algorithms for Mining Distance-Based Outliers in Large Datasets, VLDB98
Distance based approaches
o Complexity of this algorithm is O(n2)
o Nearest neighbor sets have to be
constructed
o It has a problems when a data set has
regions of varying density
Data cleaning techniques at the data analysis
stage
Distance-based
Local Outlier Factor (LOF) based approaches
Clustering-based
HCleaner, is a hyper clique-based data cleaner
Local Outlier Factor (LOF)*
• For each data point q compute the distance to the k-th nearest neighbor (kdistance)
•Compute reachability distance (reach-dist) for each data example q with respect
to data example p as:
oreach-dist(q, p) = max{k-distance(p), d(q,p)}
•Compute local reachability density (lrd) of data example q as inverse of the
average reachabaility distance based on the MinPts nearest neighbors of data
example q
o
lrd(q) =
∑
p
MinPts
reach dist
MinPts
q,p
•Compaute LOF(q) as ratio of average local reachability density of q’s k-nearest
neighbors and local reachability density of the data record q
o
LOF(q) =
lrd p
1
⋅∑
MinPts p lrd q
* - Breunig, et al, LOF: Identifying Density-Based Local Outliers, KDD 2000.
Advantages of Density based Techniques
• Local Outlier Factor (LOF) approach
• Example:
Distance from p3 to
nearest neighbor
In the NN approach, p2 is not
considered as outlier, while
the LOF approach find both p1
and p2 as outliers
p3
Distance from p2 to
nearest neighbor
p2
p1
NN approach may consider p3
as outlier, but LOF approach
does not
Local Outlier Factor (LOF)*
• For each data point q compute the distance to the k-th nearest neighbor (kdistance)
•Compute reachability distance (reach-dist) for each data example q with respect
to data example p as:
reach-dist(q, p) = max{k-distance(p), d(q,p)}
•Compute local reachability density (lrd) of data example q as inverse of the
average reachabaility distance based on the MinPts nearest neighbors of data
example q
MinPts
lrd(q) =
reach _ dist MinPts(q, p)
p
•Compaute LOF(q) as ratio of average local reachability density of q’s k-nearest
neighbors and local reachability density of the data record q
LOF(q) =
1
lrd ( p)
MinPts p lrd (q)
* - Breunig, et al, LOF: Identifying Density-Based Local Outliers, KDD 2000.