Download Graph preprocessing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Geographic information system wikipedia , lookup

Neuroinformatics wikipedia , lookup

Theoretical computer science wikipedia , lookup

Inverse problem wikipedia , lookup

Multidimensional empirical mode decomposition wikipedia , lookup

Pattern recognition wikipedia , lookup

Data analysis wikipedia , lookup

Data assimilation wikipedia , lookup

Corecursion wikipedia , lookup

Transcript
Graph preprocessing
 Introduction
 Noise removal and data enhancement problem
 Noise removal and data enhancement on binary data
 Noise removal and data enhancement on graph data
 Noise removal and data enhancement tools
 Current research problems
 Future directions
Framework for validating data cleaning
techniques on binary data
 Data cleaning techniques to enhance data
analysis in the presence of high noise levels
 Strengths and weakness of techniques
 Experimental results on several real-world
data sets
 Impact on clustering and association analysis
 Conclusions and future work
Motivation & Problem Statement
• Several preprocessing techniques for real world data have
been proposed
– Each technique proposed for a single application domain
– No comparative evaluation of techniques across different
domains
– Such evaluation is essential to identify relative strengths
and weaknesses of the techniques
• Problem Statement: Given a set of n points S, and a query
point Sq, find an anomaly score for Sq with respect to S
– Points in S are assumed to be (mostly) normal
• This definition is applicable in multiple domains such as
– Flight safety
– System call intrusion detection
– Proteomics
Data cleaning techniques at the data analysis
stage
 Distance-based
 Local Outlier Factor (LOF) based approaches
 Clustering-based
 HCleaner, is a hyper clique-based data cleaner
Distance based Outlier Detection
Nearest Neighbour (NN) approach
 For each data point d compute the distance to the k-th nearest
neighbour dk
 Sort all data points according to the distance dk
 Outliers are points that have the largest distance dk and therefore
are located in the more sparse neighbourhoods
 Usually data points that have top n% distance dk are identified as
outliers
o n – user parameter
 Not suitable for datasets that have modes with varying density
* Knorr, Ng,Algorithms for Mining Distance-Based Outliers in Large Datasets, VLDB98
** S. Ramaswamy, R. Rastogi, S. Kyuseok: Efficient Algorithms for Mining Outliers from Large Data
Sets, ACM SIGMOD Conf. On Management of Data, 2000.
Nearest Neighbour Based Techniques
• Key assumption: normal points have close neighbours while
anomalies are located far from other points
• General two-step approach
– Compute neighbourhood for each data record
– Analyze the neighbourhood to determine whether data
record is anomaly or not
• Categories:
– Distance based methods
• Anomalies are data points most distant from other
points
– Density based methods
• Anomalies are data points in low density regions
Nearest Neighbour Based Techniques
• Advantage
 Can be used in unsupervised or semi-supervised setting
(do not make any assumptions about data distribution)
• Drawbacks
 If normal points do not have sufficient number of
neighbors the techniques may fail
 Computationally expensive
 In high dimensional spaces, data is sparse and the concept
of similarity may not be meaningful anymore. Due to the
sparseness, distances between any two data records may
become quite similar => Each data record may be
considered as potential outlier!
Nearest Neighbour Based Techniques
• Distance based approaches
• A point O in a dataset is an DB(p, d) outlier if at least
fraction p of the points in the data set lies greater than
distance d from the point O*
• Density based approaches
• Compute local densities of particular regions and declare
instances in low density regions as potential anomalies
• Approaches
o Local Outlier Factor (LOF)
*Knorr, Ng,Algorithms for Mining Distance-Based Outliers in Large Datasets, VLDB98
Distance based approaches
o Complexity of this algorithm is O(n2)
o Nearest neighbor sets have to be
constructed
o It has a problems when a data set has
regions of varying density
Data cleaning techniques at the data analysis
stage
 Distance-based
 Local Outlier Factor (LOF) based approaches
 Clustering-based
 HCleaner, is a hyper clique-based data cleaner
Local Outlier Factor (LOF)*
• For each data point q compute the distance to the k-th nearest neighbor (kdistance)
•Compute reachability distance (reach-dist) for each data example q with respect
to data example p as:
oreach-dist(q, p) = max{k-distance(p), d(q,p)}
•Compute local reachability density (lrd) of data example q as inverse of the
average reachabaility distance based on the MinPts nearest neighbors of data
example q
o
lrd(q) =
∑
p
MinPts
reach dist
MinPts
q,p
•Compaute LOF(q) as ratio of average local reachability density of q’s k-nearest
neighbors and local reachability density of the data record q
o
LOF(q) =
lrd p
1
⋅∑
MinPts p lrd q
* - Breunig, et al, LOF: Identifying Density-Based Local Outliers, KDD 2000.
Advantages of Density based Techniques
• Local Outlier Factor (LOF) approach
• Example:
Distance from p3 to
nearest neighbor
In the NN approach, p2 is not
considered as outlier, while
the LOF approach find both p1
and p2 as outliers
p3 
Distance from p2 to
nearest neighbor

p2

p1
NN approach may consider p3
as outlier, but LOF approach
does not
Local Outlier Factor (LOF)*
• For each data point q compute the distance to the k-th nearest neighbor (kdistance)
•Compute reachability distance (reach-dist) for each data example q with respect
to data example p as:
reach-dist(q, p) = max{k-distance(p), d(q,p)}
•Compute local reachability density (lrd) of data example q as inverse of the
average reachabaility distance based on the MinPts nearest neighbors of data
example q
MinPts
lrd(q) =
 reach _ dist MinPts(q, p)
p
•Compaute LOF(q) as ratio of average local reachability density of q’s k-nearest
neighbors and local reachability density of the data record q
LOF(q) =
1
lrd ( p)

MinPts p lrd (q)
* - Breunig, et al, LOF: Identifying Density-Based Local Outliers, KDD 2000.