Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Linear and Non Linear Dimensionality Reduction
for Distributed Knowledge Discovery
Panagis Magdalinos
Supervising Committee:
Michalis Vazirgiannis,
Emmanuel Yannakoudakis,
Yannis Kotidis
Athens University of Economics and Business
Athens, 31st of May 2010
Outline
Introduction – Motivation
Contributions
FEDRA: A Fast and Efficient Dimensionality Reduction
Algorithm
A Framework for Linear Distributed Dimensionality Reduction
Distributed Non Linear Dimensionality Reduction
A new dimensionality reduction algorithm
Large scale data mining with FEDRA
Distributed Isomap (D-Isomap)
Distributed Knowledge Discovery with the use of D-Isomap
An Extensible Suite for Dimensionality Reduction
Conclusions and Future Research Directions
Athens University of Economics and Business
Athens, 31st of May 2010
2/70
Motivation
Top 10 Challenges in Data Mining1
Typical examples
Banks all around the world
World Wide Web
Network Management
More challenges are envisaged in the future
Scaling Up for High Dimensional Data and High Speed Data Streams
Distributed Data Mining
Novel distributed applications and trends
Peer-to-peer networks
Sensor networks
Ad-hoc mobile networks
Autonomic Networking
Commonality : High dimensional data in massive volumes.
1. Q.Yang and X.Wu: “10 Challenging Problems in Data Mining Research”, International Journal of Information Technology & Decision Making, Vol. 5, No. 4,
2006, 597-604
3/70
Athens University of Economics and Business
Athens, 31st of May 2010
The curses of dimensionality
Curse of dimensionality
R1 21
R3 23
R4 24
Empty space phenomenon
R2 22
Maximum and minimum distance of a dataset tend to be equal
as dimensions grow (i.e., Dmax – Dmin ≈ 0)
Data mining becomes resource intensive
K-means and k-nn are typical examples
Athens University of Economics and Business
Athens, 31st of May 2010
4/70
Solutions
Dimensionality reduction
The curse of dimensionality
MDS, PCA, SVD, FastMap, Random Projections…
Lower dimensional embeddings while enabling the subsequent addition of
new points.
Significant reduction in the number of dimensions.
We can project from 500 dimensions to 10 while retaining cluster structure.
The empty space phenomenon
Meaningful results from distance functions
k-NN classification quality almost doubles when projecting from more than
20000 dimensions to 30.
Computational requirements
Distance based algorithms are significantly accelerated.
k-Means converges to less than 40 seconds while initially required almost 7
minutes.
Athens University of Economics and Business
Athens, 31st of May 2010
5/70
Classification
Problems
Hard Problems Significant reduction
Soft Problems Milder requirements
Visualization Problems
Methods
Linear and Non Linear
Exact and Approximate
Global and Local
Data Aware and Data Oblivious
Athens University of Economics and Business
Athens, 31st of May 2010
6/70
Quality Assessment
Distortion:
Provision of an upper and lower bound to the new pairwise distance.
The new distance is provided as a function of the initial distance:
(1/c1)D(a,b)≤ D’(a,b) ≤ c2D(a,b) , c1, c2 > 1
Good method min(c1c2)
Stress
Distortion might be misleading
Stress quantifies the distance distortion on a particular example.
Stress = √∑(d(Xi,Xj)-d(X’i,X’j))2/∑d(Xi,Xj)2
Task Related Metric
Clustering/Classification Quality
Pruning Power
Computational Cost
Visualization
Athens University of Economics and Business
Athens, 31st of May 2010
7/70
Contributions
Definition of a new, global, linear, approximate dimensionality
reduction algorithm
Definition of a framework for the decentralization of any landmark
based dimensionality reduction method
Motivated by low memory requirements of landmark based algorithms
Applicable in various network topologies
Definition of the first distributed, non linear, global approximate
dimensionality reduction algorithm
Fast and Efficient Dimensionality Reduction Algorithm (FEDRA)
Combination of low time and space requirements together with high quality
results
Decentralized version of Isomap (D-Isomap)
Application on knowledge discovery from text collections
A prototype enabling the experimentation with dimensionality
reduction methods (x-SDR)
Ideal for teaching and research in academia
Athens University of Economics and Business
Athens, 31st of May 2010
8/70
FEDRA: A Fast and Efficient Dimensionality
Reduction Algorithm
Based on :
•P. Magdalinos, C.Doulkeridis, M.Vazirgiannis, "FEDRA: A Fast and Efficient Dimensionality Reduction
Algorithm", In Proceedings of the SIAM International Conference on Data Mining (SDM'09), Sparks
Nevada, USA, May 2009.
•P. Magdalinos, C.Doulkeridis, M.Vazirgiannis, "Enhancing Clustering Quality through Landmark Based
Dimensionality Reduction ", Accepted with revisions in the Transactions on Knowledge Discovery from
Data, Special Issue on Large Scale Data Mining – Theory and Applications.
Athens University of Economics and Business
Athens, 31st of May 2010
The general idea
Instead of trying to map the whole dataset in the new space
Extract a small fraction of data and embed it in the new space
Create the “kernel” around which the whole dataset is going to be placed
Minimize the loss of the information during the first part of the process.
Project each remaining point independently by taking into account only the
initial set of sampled data.
Z
Y
P4
P2
P1
P1
P2
Y
P3
X
X
P3
P4
The formulation of this idea into a coherent algorithm resulted in the
definition of FEDRA (Fast and Efficient Dimensionality Reduction
Algorithm)
A global, linear, approximate, landmark based method
Athens University of Economics and Business
Athens, 31st of May 2010
10/70
Our goal
Formulate a method which combines:
Application
Results of high quality
Minimum space requirements
Minimum time requirements
Scalability in terms of cardinality and dimensionality
Hard dimensionality reduction problems
Projecting from 500 dimensions to 10 while retaining interobjects relations
Enabling faster convergence of k-Means
Top 10 Challenge: Scaling up for high dimensional data
Athens University of Economics and Business
Athens, 31st of May 2010
11/70
The FEDRA Algorithm
Input:
Output:
Projection Dimensionality (k), Original Distances in Rn (D),
Distance Metric (p)
New Dataset in Rk (P’)
1.
L Select k points and populate set of landmarks
2.
L’ Project all landmarks in the target space by requiring that
||L’i – L’j||p = ||Li – Lj||p for 1≤i,j≤k
3.
P’L’
4.
For each of the non landmark points X
4.1
X’ Obtain the projection of X by requiring that
||L’i – X’||p = ||Li – X||p for 1≤i≤k
4.2
P’P’UX’
5
return P’
How do we select landmarks?
Does this system of
equations has a solution?
Does this simplification
come at a cost?
Does the algorithm converge?
Isn’t it time consuming?
These are the questions that we will answer in the next couple of slides
Athens University of Economics and Business
Athens, 31st of May 2010
12/70
The theory underlying FEDRA
Theorem 1: A set of k+1 points, pi i=1…k+1, described only by their pairwise
distances which have been defined with the use of a Minkowski distance metric p,
can be embedded in Rk without distortion. Their coordinates can be derived in
polynomial time through the following set of equations:
if j<i-1 then p’i,j is given by the single root of
|p’i,j|p - |p’i,j–p’j+1,j|p + ∑f=1j-1|p’i,f|p-∑f=1j-1|p’i,f-p’j,f|p + dp(pj+1,pi)p – dp(pi,p1)p = 0
if j=i-1
p’i,j =(dp(pi,p1)p - ∑f=1i-2|p’i,f|p)1/p
0 otherwise
Theorem 2: Any equation of the form f(x)=|x|p–|x-a|p–d where aЄR\{0}, dЄR,
pЄN\{0} has a single root in R.
if -1< v=d/|a|p <1 the root lays in (0, a)
otherwise the root lays in (a,|v|a)
The cost of embedding the k landmarks is ck2/2 where c is the cost of the
Newton-Raphson method (for p=2 c=1)
Athens University of Economics and Business
Athens, 31st of May 2010
13/70
Theorem 1 in practice (1/2)
Z
Z
Z
P3
P1
P1
P2
X
Y
P1
X
Y
||Pi’
Pj’||(p)=
P2
X
Y
No distortion requires that
||Pi - Pj||(p), i,j=1..4
First point is mapped as P’1 = O = {0,0,0}
Second point is mapped at P’2 = {||P2 – P1||(p),0,0}
Third points should satisfy simultaneously
||P’3 – P’1||(p)= ||P3– P1||(p)
||P’3 – P’2||(p)= ||P3– P2||(p)
The solution is the intersection of the circles
Athens University of Economics and Business
Athens, 31st of May 2010
14/70
Theorem 1 in practice (2/2)
Z
Z
P3
P3
P4
P1
P1
P2
X
Y
P2
X
Y
Fourth point should satisfy simultaneously
||P’4 – P’1||(p)= ||P4– P1||(p)
||P’4 – P’2||(p)= ||P4– P2||(p)
||P’4 – P’3||(p)= ||P4– P3||(p)
Three intersecting spheres. The intersection of two spheres is a circle.
Consequently we search for the intersection of a circle with a sphere.
Athens University of Economics and Business
Athens, 31st of May 2010
15/70
Reducing Time Complexity (1/2)
Simplified through the following iterative scheme
The embedding of Xi in Rk given the embeddings of Pj , j = 1..i-1
|x’i,1|p + |x’i,2|p + |x’i,3|p +….+ |x’i,i-1|p = ||P1 - Xi||p
|x’i,1-p’2,1|p + |x’i,2|p + |x’i,3|p +….+ |x’i,i-1|p = ||P2 - Xi||p
|x’i,1-p’3,1|p + |x’i,2-p’3,2|p + |x’i,3|p +….+ |x’i,i-1|p = ||P3 - Xi||p
……………………………………………………………………………………….
|x’i,1-pi-1,1|p + |x’i,2-pi-1,2|p + |x’i,3-pi-1,3|p +….+ |x’i,i-1|p = ||Pi-1 - Xi||p
Note that by subtracting the second equation from the first we derive
|x’i,1|p - |x’i,1-p’2,1|p - ||P1 - Xi||p + ||P2 - Xi||p=0
The equation has a single unknown and a single root x’i,1
In general, the value of the i-th coordinate is derived by subtracting the
(i+1)-th equation from the first.
Athens University of Economics and Business
Athens, 31st of May 2010
16/70
Reducing Time Complexity (2/2)
By subtracting the i-th equation from the first we essentially calculate the
corresponding coordinate (i.e. a plane in R3).
The intersection of the k-1 planes corresponds to a line.
The first equation is satisfied by points P1,P2 that correspond to the intersection of
the line with the norm-sphere of R3.
Z
Z
Z
Y=b
X=a
X=a
O
O
P1
O
X
X
X
P2
X=a & Y=b
Y
Y
Y
X=a & Y=b
We lower time complexity from O(ck2) to O(ck) or even O(k) when p=2
What if the intersection of the line with the sphere does not exist?
Athens University of Economics and Business
Athens, 31st of May 2010
17/70
Existence of solution
Theorem 3: For any non-linear system of
equations defined by FEDRA, there always
exists at least one solution, provided that the
triangular inequality is sustained in the
original space.
No convergence
||OA’|| + ||A’L’1|| < ||O’L’1||
Theorem 1 guarantees that
||O’A’||=||OA|| , ||A’L’1||= ||AL1||,
||O’L’1||=||OL1||
Triangular inequality is not sustained in
the original space
Athens University of Economics and Business
Athens, 31st of May 2010
R3
X
R2
18/70
The FEDRA Algorithm
Input:
Output:
Projection Dimensionality (k), Original Distances in Rn (D),
Distance Metric (p)
New Dataset in Rk (P’)
1.
L Select k points and populate set of landmarks
2.
L’ Project all landmarks in the target space by applying
Theorem 1 and its accompanying methodology
3.
P’L’
4.
For each of the non landmark points X
4.1
X’ Obtain the projection of X by applying Theorem 1 and its
accompanying methodology
4.2
P’P’UX’
5
return P’
How do we select landmarks?
Does this system of
equations has a solution?
Yes, always!
Does this simplification
come at a cost?
Does the algorithm converge?
Yes, always!
Isn’t it time consuming?
No! In fact it is only O(k) per point!
Still some questions remain…
Athens University of Economics and Business
Athens, 31st of May 2010
19/70
FEDRA requirements
FEDRA requirements in terms of time and space
Exhibits low memory requirements combined with low computational
complexity
Memory: O(k2), k: lower dimensionality
Time: O(cdk), d: number of objects, c: constant
Addition of new point : O(ck)
Achieved by relaxing the original requirements and requesting that every
projected point retains unaltered k distances to other data points
Advantageous features
Operates on similarity/dissimilarity matrix
Applicable with any Minkowski distance metric
FEDRA can provide a mapping from Lnp to Lkp where p≥1
Athens University of Economics and Business
Athens, 31st of May 2010
20/70
B’
A’
Distortion
L1’
Ay’
E
L2’
E
A’
A”
B
By
A
By’
OR
L1’
Ay’
L2’
L2
By’
Ay
R
B’
L1
Theorem 4: Using any two landmarks L1, L2, FEDRA can project any two points A,B while
guaranteeing that their new distance A’B’ will be bounded according to:
AB2 -4AAyBBy ≤ A’B’ 2 ≤ AB2 + 4AAyBBy
Alternatively: A’B’2=AB2 -2BL1AL1(cos(A’L’1B’)-cos(AL1B))
Does this simplification
Distortion = √(AB2+4AAyBBy)/(AB2-4AAyBBy)
come at a cost?
For any Minkowski distance metric p:
The distance distortion
ABp -Δ ≤ A’B’p ≤ ABp +Δ
is low and upper
p
p-k
k-1
bounded
Δ = 2BBy∑k=1 (AAy+BBy) (AAy-BBy)
Athens University of Economics and Business
Athens, 31st of May 2010
21/70
Landmarks selection
Based on the former analysis it can be proved that the ideal landmark
set should satisfy for any two landmarks Li, Lj and any point A, one of
the following relations:
LiA ≈ LjA – LiLj (or simply that LiLj ≈ 0 )
LjA ≈ LiLj – LiA
LiA ≈ LjA – LiLj requires the creation of a compact “kernel” where
landmarks exhibit minimum distances from each other
L1
C
A
L2
LjA ≈ LiLj – LiA requires that cluster centroids are chosen as the
landmarks
L1
L2
So if random selection is not acceptable we use a set of k
landmarks that exhibit minimum distance from each other.
How do we select landmarks?: Either randomly or heuristically
according to theory.
A
Athens University of Economics and Business
Athens, 31st of May 2010
22/70
Ameliorating projection quality (I)
Depending on the properties of the selected landmarks set, a -singlecase of failure may rise1
Y
Z
L2
Y
Cluster A
Clusters A and B
Cluster A
Cluster B
O
O
X
Y
1.
L1
O
L1
L2 X
L1
L2 X
Cluster B
V.Athitsos, J.Alon, S.Sclaroff, G.Kollios, “BoostMap: An Embedding Method for Efficient Nearest Neighbor Retrieval”, IEEE Transactions on PAMI, Vol
30, No.1 January 2008
Athens University of Economics and Business
Athens, 31st of May 2010
23/70
Ameliorating projection quality (II)
What if we sample an additional set of points and use it as for enhancing
projection quality?
Zero distortion from the landmark points and minimum distortion
from another k points.
Y
Z
Y
L2
Cluster A
Cluster A
Cluster B
X
Y
O
O
O
L1
L2 X
L1
L1
L2 X
Cluster B
Does this simplification come at a cost? The distance distortion is low and upper
bounded. Moreover the projection of a point can be determined using the already
projected non landmark points
Athens University of Economics and Business
Athens, 31st of May 2010
24/70
FEDRA Applications
The purpose of the conducted experimental evaluation process is:
Highlight the efficiency and effectiveness of FEDRA on hard dimensionality
reduction problems
Highlight FEDRA’s scaling ability and applicability in large scale data mining
Showcase the enhancement of a typical data mining task like clustering due
to the application of FEDRA
Dataset
Cardinality
n
Classes
k
Description
Ionosphere
351
34
2
3:1:7
Radar Observations
Segmentation
2100
19
7
3:1:7
Image Segmentation Data
Musk
476
166
2
3:3:15
Molecules Data
Synthetic Control
600
60
6
3:1:7
Synthetic Dataset
Alpha
500000
500
2
10:10:50
Pascal Large Scale Challenge ‘08
Beta
500000
500
2
10:10:50
Pascal Large Scale Challenge ‘08
Gamma
500000
500
2
10:10:50
Pascal Large Scale Challenge ‘08
Delta
500000
500
2
10:10:50
Pascal Large Scale Challenge ‘08
Athens University of Economics and Business
Athens, 31st of May 2010
25/70
Metrics
We assess the quality of FEDRA through the following metrics
Stress
√∑(d(Xi,Xj)-d(X’i,X’j))2/∑d(Xi,Xj)2
Clustering quality maintenance defined as Quality in Rk/ Quality in Rn
Clustering quality: Purity = (1/N) ∑i,j=1amax(|Ci∩Sj|)
Time requirements for each algorithm to produce the embedding
Time requirements for k-Means to converge
We
compare FEDRA with Landmark-based Methods
Landmark MDS
Metric Map
Vantage Objects
As well as prominent methods such as
PCA
FastMap
Random Projection
Athens University of Economics and Business
Athens, 31st of May 2010
26/70
Stress evolution
Dataset: segmentation
Dataset: ionosphere
27 /81
Purity evolution
Dataset: alpha
Dataset: beta
Experimental analysis indicates:
FEDRA exhibits behavior similar to landmark based approaches and slightly
ameliorates clustering quality
Athens University of Economics and Business
Athens, 31st of May 2010
28/70
Time Requirements
Dataset: alpha
Dataset: beta
29 /81
k-Means Convergence
296secs
324secs
Dataset: alpha
Dataset: beta
Experimental analysis indicates:
k-Means converges slower on the dataset of Vantage Objects
FEDRA reduces k-Means convergence requirements
Athens University of Economics and Business
Athens, 31st of May 2010
30/70
Summary
FEDRA is a viable solution for hard dimensionality reduction
problems.
Quality of results comparable to PCA
Low time requirements, outperformed by Random Projection
Low stress values, sometimes lower than FastMap
Maintain or ameliorate original clustering quality, similar
behavior to other methods
Enables faster convergence of k-Means
Athens University of Economics and Business
Athens, 31st of May 2010
31/70
Linear Distributed Dimensionality Reduction
Based on :
•P. Magdalinos, C.Doulkeridis, M.Vazirgiannis "K-Landmarks: Distributed Dimensionality Reduction for
Clustering Quality Maintenance" In Proceedings of 10th European Conference on Principles and
Practice of Knowledge Discovery in Databases (PKDD'06), Berlin, Germany, September 2006.
(Acceptance Rate (full papers) 8,8%)
•P. Magdalinos, C.Doulkeridis, M.Vazirgiannis, "Enhancing Clustering Quality through Landmark Based
Dimensionality Reduction ", Accepted with revisions in the Transactions on Knowledge Discovery from
Data, Special Issue on Large Scale Data Mining – Theory and Applications.
Athens University of Economics and Business
Athens, 31st of May 2010
The general idea
All landmark based algorithms are applicable in distributed
environments
The idea is to sample landmarks from all nodes and use them to define
the original landmark set.
Then, communicate this set to all nodes.
Global Landmark Set
Peer 6
Peer 6
Peer 7
Peer 4
Peer 2
Peer 1
Peer 3
Global Landmark Set
Peer 6
Peer 7
Peer 4
Peer 2
Peer 7
Peer 4
Peer 3
Peer 2
Peer 3
Peer 1
Athens University of Economics and Business
Athens, 31st of May 2010
Peer 1
33/70
Our goal
Formulate a method which combines:
Minimum requirements in terms of network resources
Immunity to subsequent alterations of the dataset
Adaptability to network changes
Top 10 Challenge: Distributed Data Mining
Application
Hard dimensionality reduction problems
Projecting from 500 dimensions to 10 while retaining interobjects relations
Reduction of network resources consumption
State of the art:
Distributed PCA
Distributed FastMap
Athens University of Economics and Business
Athens, 31st of May 2010
34/70
Requirements and Candidates
Requirements:
There exists some kind of network organization scheme
Physical topology
Self-Organization
Each algorithm is composed of two parts
A centrally executed
A decentralized part
Ideal Candidate: Any landmark based dimensionality reduction
algorithm
Landmark selection process
Aggregation of landmarks in a central location
Derivation of the projection operator
Communication of the operator to all nodes
Projection of each point independently
Athens University of Economics and Business
Athens, 31st of May 2010
35/70
Distributed FEDRA
Applying the landmark based paradigm in a network environment
Select landmarks at peer level
Communicate all landmarks to aggregator
O(nk) network load
Project landmarks and communicate the results
O(nkM +Mk2) network load
Each peer projects each point independently
Assuming a fixed number of |L| landmarks then network requirements are
upper bounded for each algorithm
O(n|L|M+M|L|k)
Landmark based algorithms are less demanding than distributed PCA
Distributed PCA: O(Mn2 + nkM)
As long as |L| < n
Athens University of Economics and Business
Athens, 31st of May 2010
36/70
Selecting the landmark points
Each peer may select:
k points from the local dataset
Select k local points (randomly or heuristically)
Transmit them to the aggregator
The aggregator receives Mk points from all peers and selects the landmark
set.
Network load is O(Mkn + Mk2)
k/M points from the local dataset
This implies that the aggregator will inform the peers about the size of the
network
The landmarks selection happens only once in the lifetime of the network,
arrivals and departures will have no affect.
Network load is O(kn + Mk2)
Zero points from the local set
The aggregator selects from the local dataset k landmarks
Network load is O(Mk2)
Athens University of Economics and Business
Athens, 31st of May 2010
37/70
Application
Datasets from the Pascal Large Scale Challenge 2008
500-node network with random connections between elements
Nodes are connected with 5% probability
Distributed K-Means (P2P-Kmeans1) approach in order to assess the
quality of the produced embedding
Dataset
Cardinality
n
Classes
k
Description
Alpha
500000
500
2
10:10:50
Pascal Large Scale Challenge ‘08
Beta
500000
500
2
10:10:50
Pascal Large Scale Challenge ‘08
Gamma
500000
500
2
10:10:50
Pascal Large Scale Challenge ‘08
Delta
500000
500
2
10:10:50
Pascal Large Scale Challenge ‘08
1. S.Datta, C.Giannella, H.Kargupta: Approximate Distributed K-means clustering over a P2P network. IEEE TKDE 2009, vol 21, no10, 10/2009
Athens University of Economics and Business
Athens, 31st of May 2010
38/70
Dataset: alpha
Dataset: gamma
Dataset: beta
Dataset: delta
39 /81
Network Requirements
Random Projection deviate from the framework
Random Projection: The aggregator identifies the projection matrix
Distributed clustering induces a network cost of more than 10GB
Hard dimensionality reduction preprocessing -requiring at most 200MB- reduces
the cost to roughly 1GB.
Athens University of Economics and Business
Athens, 31st of May 2010
40/70
Summary
Landmark based dimensionality reduction algorithms provide a viable
solution to distributed dimensionality reduction pre-processing
High quality results
Low network requirements
No special requirements in terms of network organization
Adaptability to potential failures
Results obtained in a network of 500 peers
Dimensionality reduction preprocessing and subsequent P2P-Kmeans
application necessitates only 12% of the original P2P-Kmeans load
Clustering quality remains the same and slightly ameliorated
Distributed FEDRA
Low network requirements combined with high quality results
Athens University of Economics and Business
Athens, 31st of May 2010
41/70
Distributed Non Linear Dimensionality Reduction
Based on :
•P.Magdalinos, M.Vazirgiannis, D.Valsamou, "Distributed Knowledge Discovery with Non Linear
Dimensionality Reduction", To appear in the Proceedings of the 14th Pacific-Asia Conference on
Knowledge Discovery and Data Mining (PAKDD'10), Hyderabad, India, June 2010. (Acceptance Rate
(full paper) 10,2%)
•P. Magdalinos, G.Tsatsaronis, M.Vazirgiannis, “Distributed Text Mining based on Non Linear
Dimensionality Reduction", Submitted to European Conference on Machine Learning and Principles
and Practice of Knowledge Discovery in Databases (ECML-PKDD 2010), Currently under review.
Athens University of Economics and Business
Athens, 31st of May 2010
Our goal
Top 10 Challenges: Distributed data mining of high dimensional data
Vector Space Model:
Each word defines an axis each document is a vector residing in a high
dimensional plane
Numerous methods that try to project data in a low dimensional space while
assuming linear dependence between variables.
However latest experimental results show that this assumption is incorrect
Application
Scaling Up for High Dimensional Data
Distributed Data Mining
Hard dimensionality reduction and visualization problems
Unfolding a manifold distributed across a network of peers
Mining information from distributed text collections
State of the art:
None!
Athens University of Economics and Business
Athens, 31st of May 2010
43/70
The general idea
The idea is to replicate the original Isomap algorithm in a highly distributed
environment and still get results of equal quality.
Isomap
Distributed Isomap: A three phased approach:
Peer 8
Peer 6
Peer 8
Peer 7
Peer 4
Peer 6
Distibuted NN and
SP algorithms
Peer 7
Peer i
Peer 4
Peer 3
Peer 1
Peer 2
Multidimensional
Scaling
Peer 1
Peer 3
Peer 2
Athens University of Economics and Business
Athens, 31st of May 2010
44/70
Indexing and k-NN retrieval (1/4)
Which LSH family to employ1?
Since we use the Euclidean distance we should use an Euclidean
distance preservation mapping
hx,b = floor(xr+b/w)
where x is the data point, r is an 1xn random vector, w in N and b
in [0,w)
This family of functions guarantees that the probability of collision
will be analogous to points original distance.
Given f hash function for each table we have an f-dimensional
vector
hash1
hash2
hashf
1
5
… 7
2
4
… 1
1. Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1) (2008)
Athens University of Economics and Business
Athens, 31st of May 2010
45/70
Indexing and k-NN retrieval (2/4)
Indexing and guaranteeing load balancing
Consider the norm-1 of the produced vector, ∑i=1f|hi(x)|
The values are generated from the normal distribution
N(f/2,fμ||x||/w)1
Consider 2 standard deviations and split the range into M cells
For a given hash vector v, the peer that will index it is:
peerid = (M(||v||1-μl1+2σl1)/4σl1)modM
peeri
hash1
hash2
hashf
1
5
… 7
2
4
… 1
l1=∑|vi|
……………
40
-2σl1
μl1
2σl1
1. Haghani, P., Michel, S., Aberer, K.: Distributed similarity search in high dimensions using locality sensitive hashing. ACM EDBT pp. 744--755 (2009)
Athens University of Economics and Business
Athens, 31st of May 2010
46/70
Indexing and k-NN retrieval (3/4)
How to effectively and efficiently search for the kNN of each point?
Baseline: For each local point di
For each table T
Find the peer that indexes it
Retrieve all points from corresponding bucket
Retrieve actual points
Calculate actual distances, rank them and retain k-NNs
What if we could somehow identify a range and upper bound the difference of
δ=| ||h(x)||1- ||h(y)||1 |?
Theorem 5: Given f hash functions hi = floor(rixT+bi/w) where ri is an 1xn
random vector, w∈N, bi∈[0, w), i = 1...f , the difference δ of the l1 norms of the
projections xf ,yf of two points x, y∈Rn is upper bounded by (||A|| ||x-y||)/w,
where A= || ∑i=1f|ri| || and ||x − y|| the points’ Euclidean distance.
Although the bound is rather large, it still reduces the required number of
messages
Athens University of Economics and Business
Athens, 31st of May 2010
47/70
Indexing and k-NN retrieval (4/4)
k-NN Retrieval
Indexing
( ||hash(V)||1,6)
l1(hash)
Peer
…
…
34
6
Messages: O(cskd)
Time: O(cskdi)
Memory: O(cskn)
( ||hash(V)||1,6,X)
peerid = f(hash(V))
l1(hash)
Peer
…
…
34
6
hash(V) =
V=
1
5
2
1
… 8
… 7
1
5
… 7
(Peer4,Y,32)
Peer 8
Peer 6
Messages: O(dT)
Time: O(diTfn)
Memory: O(fn)
V=
Peer 8
Peer 6
Request-Reply Y
Hash Table
Pid
Peer
…
…
2
Vid
Peer 4
Hash Table
Pid
l1(hash(V))
Peer
…
…
…
…
…
8
2
Vid
34
8
Athens University of Economics and Business
Athens, 31st of May 2010
48/70
Geodesic Distances (1/2)
At this step, each peer has identified the NN graphs of its points G
(G=Ui=1|Di|Gi )
V1
V2
…
Vi
…
V1
Vn
V1
V1
V2
V2
…
…
V2
…
Vi
…
Vn
The target is to identify the SPs from each point to the rest of the
dataset
Use best practices from computer networking
Distance Vector Routing or Distributed Bellman Ford
Assume that each point is a network node and each calculated
distance a link between the corresponding points/nodes
From a node’s perspective, DVR replicates a ranged search, starting
with one link and progressively augmenting it by 1
Athens University of Economics and Business
Athens, 31st of May 2010
49/70
Geodesic Distances (2/2)
Start at node 1
Discover paths, 1 hop away
Discover paths, 2 hops away
Discover paths, 3 hops away
Peer 1
Peer 5 will never be reached!
Not connected graph
Distance is ∞ Distance is
5*max(distance)
Graph is now connected
Peer 5
Peer 2
Peer 4
Messages: O(kNNMd2)
Space: O(did)
Time: O(M)
Peer 3
Athens University of Economics and Business
Athens, 31st of May 2010
50/70
Multidimensional Scaling
At this step, each peer has a fraction of the global matrix.
V1
V2
V1
V2
…
…
Vi
…
Vn
X1
X2
X3
V1
A
B
C
V2
C
D
N
X1
X2
X3
V1
A
B
C
V2
C
D
N
…
…
…
…
Vn
S
A
X
Instead of calculating the MDS approximate it!
Employ landmark based dimensionality reduction algorithms and
Derive the embedding
Approximate the whole datasets on peer level!
All these, with 0 load!
What if the landmarks are not enough?
Employ the approach of distributed FEDRA
Network requirements: O(knM)
Athens University of Economics and Business
Athens, 31st of May 2010
51/70
Reducing Messages
Since we will work only with a small number of landmarks why not
calculate their shortest paths only.
A node is randomly selected and initiates the SP process
Network cost
Selects the required number of landmarks (i.e. a)
Initiates the SP algorithm O(adkNNM) messages
Communicates results to all nodes O(Ma) messages
All nodes execute the landmark based DR algorithm locally
Base approach O(kNNMd2)
Landmark based approach O(adkNNM+Md)
Landmark based approach is always cheaper.
D-Isomap in total
Messages: O(csdk + dT + adkNNM)
Time: O(cskdi + M) + CDLDR
Space: O(cskdi + did) + CDLDR
Athens University of Economics and Business
Athens, 31st of May 2010
52/70
Adding or Deleting points
Addition of points:
Hashing and Identification of kNNs
Calculation of geodesic distances from landmarks using local information
Low dimensional projection using FEDRA, LMDS or Vantage Objects
Network Cost: O(cskNN), Time: O(cskNN)+ CDLDR, Memory: O(n+kNN) + CDLDR
Deletion of points:
Inform indexing peer that the point is deleted
X1
X2
X3
L1
a
b
v
L2
k
h
r
L3
u
i
o
X1
X2
Distance
Matrix
Local DB
L1
L2
X1
X2
X3
L1
a
b
v
min{y+a,z+b}
L2
K
h
r
min{y+k,z+h}
L3
u
i
o
min{y+u,z+i}
Embedding
X4 arrives
X4 nearest
neighbors are
X1 and X2
y
X4
X4
L1
X1
z
L2
X2
X3
L3
Athens University of Economics and Business
Athens, 31st of May 2010
X3
L3
53/70
Experimental Evaluation
The purpose of the conducted experimental evaluation process is:
Validate the non linear nature of D-Isomap on well known manifolds
Highlight D-Isomap’s applicability in distributed knowledge discovery
experiments
Compare D-Isomap’s performance against state of the art, centralized
methods for unsupervised clustering and classification of document
collections.
Dataset
Cardinality
n
Classes
k
peers
Description
Swiss Roll
3000
3
---
2
10:5:30
Swiss Roll dataset
Helix
3000
3
---
2
10:5:30
Helix dataset
3D Clusters
3000
3
---
2
10:5:30
Artificial 5-cluster dataset
Reuters
12216
21454
117
10:5:30
100:25:200
Reuters text collection
20 Newsgroup
18846
130080
20
100:25:200
100:25:200
20 NS text collection
Athens University of Economics and Business
Athens, 31st of May 2010
54/70
Non linear manifolds (1/3)
What we expect to see:
Input:
Output:
Athens University of Economics and Business
Athens, 31st of May 2010
55/70
Non linear manifolds (2/3)
D-Isomap with LMDS
D-Isomap with FEDRA (p=2)
D-Isomap with FEDRA (p=3)
D-Isomap with LMDS
D-Isomap with FEDRA (p=2)
D-Isomap with LMDS
Athens University of Economics and Business
Athens, 31st of May 2010
56/70
Non linear manifolds (3/3)
Network Requirements (MBs)
Network composed of 30 peers
Actual size of dataset: 60KB
kNN
Full SP
Full SP,
Bound
Partial
SP
Partial
SP,
Bound
6
14.584
14.454
0.251
0.131
8
14.584
14.454
0.251
0.131
10
14.584
14.454
0.251
0.132
12
14.584
14.454
0.251
0.132
14
14.584
14.454
0.251
0.132
Theorem 5 reduces network
requirements but is influenced
by the range bound boundp
Not connected graph! Distance
substitution did not work. DIsomap failed for kNN=2
Not connected graph but
distance substitution works.
Larger values for kNN reduce
network requirements.
kNN
Full SP
Full SP,
Bound
Partial
SP
Partial
SP,
Bound
kNN
Full SP
Full SP,
Bound
Partial
SP
Partial
SP,
Bound
2
0.29
0.29
0.23
0.23
6
39.14
39.92
0.50
0.49
3
44.70
44.69
0.53
0.53
8
34.52
34.41
0.47
0.46
4
44.53
44.45
0.53
0.53
10
31.51
31.53
0.45
0.46
5
44.16
42.19
0.53
0.53
12
29.46
29.49
0.45
0.44
6
42.84
42.03
0.52
0.53
14
28.00
28.10
0.45
0.43
Athens University of Economics and Business
Athens, 31st of May 2010
57/70
Text Mining with D-Isomap
We compare D-Isomap with
LSI
LSK (kernel LSI)
LPI (a hybrid of kernel LSI and Spectral Clustering)
We assume:
100:25:200 peers connected in Chord-style ring
kNN = 6:2:14 for LPI and D-Isomap and cs=5 for kNN retrieval
Documents are represented as vectors using Term-Frequency
Norm is not normalized to 1.
Algorithms:
k-Means
k-NN (NN=7)
Metrics:
Quality maintenance defined as F-measure in Rk/ F-measure in Rn
F-measure:= 2*precision*recall/(precision+recall)
Network Load
Athens University of Economics and Business
Athens, 31st of May 2010
58/70
Obtained results
Reuters using kNN=14 for D-Isomap and LPI
Classification with k-NN (using 7NNs)
Classification with k-Means
20-Newsgroup using kNN=14 for D-Isomap and LPI
Classification with k-NN (using 7NNs)
Classification with k-Means
Athens University of Economics and Business
Athens, 31st of May 2010
59/70
Network Requirements
The main disadvantage:
Network load of 4.5-6.5GB on Reuters (20-60MBs per node)
Network load of 3.8-6GB on 20-Newsgroup (17-60MBs per node)
Once in a lifetime of the network
Network load is minimized as kNN values grow larger
Graph diameter is reduced
Athens University of Economics and Business
Athens, 31st of May 2010
60/70
Summary
Distributed Isomap:
The first, distributed, non linear dimensionality reduction algorithm
Manages to reveal the underlying linear nature of highly non linear manifolds
Enhances the classification ability of k-NN
Manages to approximately reconstruct the original dataset on a single peer
node
Results obtained in a network of 200 peers
Experimental validation of the curse of dimensionality and the empty space
phenomenon (projecting to 0.05% of initial dimensions almost doubled the
produced f-measure)
D-Isomap managed to produce results of quality comparable and sometimes
superior to central algorithms
Disadvantage: High network requirements
Athens University of Economics and Business
Athens, 31st of May 2010
61/70
x-SDR: An eXtensible Suite for Dimensionality
Reduction
Based on :
•P.Magdalinos, A.Kapernekas, A.Mpiratsis, M.Vazirgiannis, “X-SDR: An Extensible Experimentation
Suite for Dimensionality Reduction” Submitted to European Conference on Machine Learning and
Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2010), Currently under
review.
•Downloadable from:
•www.db-net.aueb.gr/panagis/X-SDR
Athens University of Economics and Business
Athens, 31st of May 2010
The X-SDR Prototype
An open source extensible suite
Aggregates well known prototypes from
Data mining (Weka)
Dimensionality reduction (MTDR suite)
Key features
C# and Matlab
http://www.db-net.aueb.gr/panagis/X-SDR/installation/downloads/xSDRSC.7z
Easily extensible by the user
Does not require and special programming skills
Evaluation of results through specific metrics, visualization and data mining.
Exploitation
Will be used in the context of data mining and machine learning courses
Athens University of Economics and Business
Athens, 31st of May 2010
63/70
Conclusions and Future Research Directions
Athens University of Economics and Business
Athens, 31st of May 2010
Conclusions
Introduced novelties
FEDRA, a new, global, linear, approximate dimensionality reduction
algorithm
Combination of low time and space requirements together with high
quality results
Definition of a methodology for the decentralization of any landmark
based dimensionality reduction method
Applicable in various network topologies
Definition of D-Isomap, the first distributed, non linear, global
approximate dimensionality reduction algorithm
Application on knowledge discovery from text collections
A prototype enabling the experimentation with dimensionality
reduction methods (x-SDR)
Athens University of Economics and Business
Athens, 31st of May 2010
65/70
Future Work
D-Isomap has great potentials:
Assume a global landmark selection process
Given the low dimensional embedding d’ of any document d
d’ Є peeri = d’ Є peerj
hash(d’ Є peeri) = hash(d’ Є peerj)
After termination apply a second hash function and create a new
distributed hash table
Every peer is capable of answering any query.
Pointers to relevant documents can be retrieved with a single message
Queried peer searches locally in the approximated dataset
Retrieves relevant document dr
Applies the hash function and retrieves indexing peers pind
Retrieves from pind the actual host peer (ph)
Cost is only a couple of bytes (hash(dr) and IP of ph)
Focus on applying D-Isomap in a real-life scenario!
Athens University of Economics and Business
Athens, 31st of May 2010
66/70
Publications
Accepted:
P. Magdalinos, C.Doulkeridis, M.Vazirgiannis, “Enhancing Clustering Quality
through Landmark Based Dimensionality Reduction”, Accepted with revisions in
the Transactions on Knowledge Discovery from Data, Special Issue on Large
Scale Data Mining – Theory and Applications.
D.Mavroeidis, P.Magdalinos, “A Sequential Sampling Framework for Spectral kMeans based on Efficient Bootstrap Accuracy Estimations: Application to
Distributed Clustering”, Accepted with revisions in the Transactions on Knowledge
Discovery from Data.
P.Magdalinos, M.Vazirgiannis, D.Valsamou, “Distributed Knowledge Discovery
with Non Linear Dimensionality Reduction”, To appear in the Proceedings of the
14th Pacific-Asia Conference on Knowledge Discovery and Data Mining
(PAKDD'10), Hyderabad, India, June 2010. (Acceptance Rate (full papers)
10,2%)
P. Magdalinos, C.Doulkeridis, M.Vazirgiannis, “FEDRA: A Fast and Efficient
Dimensionality Reduction Algorithm”, In Proceedings of the SIAM International
Conference on Data Mining (SDM'09), Sparks Nevada, USA, May 2009.
Athens University of Economics and Business
Athens, 31st of May 2010
67/70
Publications
P. Magdalinos, C.Doulkeridis, M.Vazirgiannis “K-Landmarks: Distributed
Dimensionality Reduction for Clustering Quality Maintenance”, In Proceedings of
10th European Conference on Principles and Practice of Knowledge Discovery in
Databases (PKDD'06), Berlin, Germany, September 2006. (Acceptance Rate (full
papers) 8,8%)
P. Magdalinos, C. Doulkeridis, M. Vazirgiannis, “A Novel Effective Distributed
Dimensionality Reduction Algorithm”, SIAM Feature Selection for Data Mining
Workshop (SIAM-FSDM‘06), Maryland Bethesda, April 2006.
Under Review:
P. Magdalinos, G.Tsatsaronis, M.Vazirgiannis, “Distributed Text Mining based on
Non Linear Dimensionality Reduction”, Submitted to European Conference on
Machine Learning and Principles and Practice of Knowledge Discovery in
Databases (ECML-PKDD 2010), Currently under review.
P.Magdalinos, A.Kapernekas, A.Mpiratsis, M.Vazirgiannis, “X-SDR: An Extensible
Experimentation Suite for Dimensionality Reduction” , Submitted to European
Conference on Machine Learning and Principles and Practice of Knowledge
Discovery in Databases (ECML-PKDD 2010), Currently under review.
Athens University of Economics and Business
Athens, 31st of May 2010
68/70
Technical Reports
Technical Reports:
D.Mavroeidis, P.Magdalinos, M.Vazirgiannis, “Distributed PCA for Network
Anomaly Detection based on Sparse PCA and Principal Subspace Stability”, AUEB
2008
Athens University of Economics and Business
Athens, 31st of May 2010
69/70
Thank you!
Athens University of Economics and Business
Athens, 31st of May 2010
Back Up Slides
Athens University of Economics and Business
Athens, 31st of May 2010
Intrinsic Dimensionality with…
The Eigenvalues approach
The number of principal components which retain variance above a
certain threshold. (PCA)
Identify a maximum eigengap which also identifies the number of
data clusters (Spectral Clustering)
The number of eigenvalues above a certain threshold
The Stress approach
Project the dataset (or a sample) in various target dimensionalities
Plot the derive stress values
Clustering and then PCA application
Works well on non linear data
Correlation dimensions (objects closer than r are proportional to rD)
Compute C(r) = 2/n(n-1)Σi=1nΣj=i+1nI{||xi-xj||<r}
Plot logC(r) versus logr
Athens University of Economics and Business
Athens, 31st of May 2010
72
FEDRA requirements (ext.)
Artificially generated dataset: 5000 objects with 1000 dimensions
Experimental assessment of:
The dependence of FEDRA on the size of the dataset
The dependence of FEDRA on the Minkowski metric (parameter c)
Progressive augmentation of the dataset with a step of 100 objects
Comparing against SOTA
1.
2.
Algorithm
Time
Space
Addition
MDS
O(d3)
O(d2)
O(d)
PCA/SVD
O(n3+n2d)
O(nd + n2)
O(kn)
Fastmap
O(dk)
O(n2)
O(k)
Random Projection
O(dnε-2logd)1,2
O(kn)
O(nε-2logd)
Landmark MDS
O(ksd+s3)
O(ks)
O(ns+ks)
Metric Map
O(dk2+k3)
O(k2)
O(k2)
Boost Map
O(dT)
O(d)
O(k)
Sparse Map
O(dlog2d)
O(dlog2d)
O(log22d)
Vantage Objects
O(dk)
O(k2)
O(k)
FEDRA
O(cdk)
O(k2)
O(ck)
Ailon N., Chazelle, B.: Faster Dimension Reduction. Communications of ACM 52(3), pages 97-104 (2010)
Construction of projection matrix requires O(nlogn)
Athens University of Economics and Business
Athens, 31st of May 2010
74
k-nn querying with FEDRA
Consider two landmarks L1, L2
and an embedded object X.
Range query (points r away from
X in Rn)
Inside circle (L1,d(L1,X)+r)
Inside circle (L2,d(L2,X)+r)
The intersection is our solution
All objects which are exactly r
from X in the original space lay:
r
d(L2,X)
d(L1,X)
L2
L1
Outside circle (L1,d(L1,X)-r)
Inside circle (L1,d(L1,X)+r)
Outside circle (L2,d(L2,X)-r)
Inside circle (L2,d(L2,X)+r)
The common place of these
circles holds all points which
exhibit distance r from X in Rn
Athens University of Economics and Business
Athens, 31st of May 2010
75
Sphere to Sphere Intersection
Intersecting Spheres
Z
S1:x2+y2+z2=R2
S2:(x-d)2+y2+z2=r2
S2-S1:
(x-d)2+R2-x2 = r2
x2-2dx +d2 – x2 = r2 – R2
x = (d2 – r2 + R2)/2d
This is where FEDRA
computations halt.
Intersection
y2-z2 = R2 – x2
y2-z2 =(4d2R2-(d2-r2+R2)2)/4d2
X
Y
Athens University of Economics and Business
Athens, 31st of May 2010
76
Random Projections (1/2)
Johnson-Lindenstrauss Lemma [1984]:
For any 0<ε<1 and any integer d let k be a positive integer such that
k≥4(ε2/2-ε3/3)-1lnd. Then for any set V of d points in Rn there is a map of f:
Rn Rk such that for all u,vЄV, (1-ε)||u-v||2≤||f(u)-f(v)||2≤(1+ε) ||u-v||2.
Further this mapping can be found in randomized polynomial time.
[Achlioptas, PODS 2001]: Two distributions
+/-1 with probability 1/2
(√3)+/-1 with probability 1/6, otherwise zero
[Ailon, STOC 2006]: Cost
Theoretic: O(dkn)
Actual: Implementation dependent. Even in the most naïve implementation, it
is much less, since projection matrix is 1/3 full with +/-1
[Alon, Discrete Math 2003]: Projection matrix cannot become sparser
Only by a factor of log(1/ε)
Athens University of Economics and Business
Athens, 31st of May 2010
77
Random Projections (2/2)
Fast Johnson-Lindenstrauss Transform [Ailon, Comm ACM 2010]:
Given a fixed set X of d points in Rn, ε<1 and pЄ{1,2} draw a matrix F from
FJLT. With probability at least 2/3 the following two events will occur:
For any xЄX (1-ε)ap||x||2 ≤ ||Fx||p ≤ (1+ε)ap||x||2 where a1=k√2π-1 and a2 =
k
The mapping requires O(nlogn + nε-2logd) operations
FJLT Trick:
Densification of vectors through a Fast Fourier Transform
FJLT vs Achioptas: Projection matrix is sparser than 2/3!
Advantage: Faster projection
Disadvantage: Bounds are guaranteed only for p=1,2
FEDRA vs Achlioptas
Achlioptas bounds are stricter than FEDRA’s
FEDRA provides bounds projecting from LnpLkp while Achlioptas from Ln2Lkp
FEDRA projects close points closer and distant points further
FEDRA vs FJLT
FJLT provides bounds for projecting from Ln2Lk{1,2}
Athens University of Economics and Business
Athens, 31st of May 2010
78
FEDRA vs Prominent DR methods
FEDRA against PCA, SVD
and Random Projection
Metric: Incorrectly
Clustered Instances
(essentially 1-Purity)
Depiction ICI vs Stress
79
Stress evolution - FEDRA
Experimental analysis indicates:
The best setup should include the projection heuristic
Heuristic landmark selection does not produce significantly better results than
random FEDRA
Best setup: Random Landmark Selection and Assisted Projection
Athens University of Economics and Business
Athens, 31st of May 2010
80
Purity Evolution - FEDRA
Experimental analysis indicates:
All setups maintain clustering quality in the new space (2%-10% of initial
dimensions)
Best setup: Random Landmark Selection and Random Projection
Athens University of Economics and Business
Athens, 31st of May 2010
81
Time Requirements - FEDRA
Experimental analysis indicates:
Random FEDRA is fastest than any other configuration
Assisted Projection is sometimes cheaper than Landmark Selection!
Best setup: Random Landmark Selection and Random Projection
Athens University of Economics and Business
Athens, 31st of May 2010
82
k-Means Convergence - FEDRA
Experimental analysis indicates:
All approaches exhibit approximately the same results
Landmark Selection and Assisted Projection significantly enhance k-Means’
speed of convergence (only 10 seconds )
So which is the best setup?
Based on results: Landmark Selection and Assisted Projection configuration
Results vs Cost: Random FEDRA seems a viable compromising solution
Athens University of Economics and Business
Athens, 31st of May 2010
83
Purity evolution (ext.)
Dataset: gamma
Athens University of Economics and Business
Athens, 31st of May 2010
Dataset: delta
84
F-measure maintenance (1/2)
Dataset: alpha
Dataset: beta
Evaluation of clustering using F-measure
F-measure: 2*Recall*Precision/Recall+Precision
Recall = True Positives/ True Positives + False Negatives
Precision = True Positives/ True Positives + False Positives
Athens University of Economics and Business
Athens, 31st of May 2010
85
F-measure maintenance (2/2)
Dataset: gamma
Athens University of Economics and Business
Athens, 31st of May 2010
Dataset: delta
86
F-measure with P2P Kmeans (1/2)
Dataset: alpha
Dataset: beta
Evaluation of clustering using F-measure
F-measure: 2*Recall*Precision/Recall+Precision
Recall = True Positives/ True Positives + False Negatives
Precision = True Positives/ True Positives + False Positives
Athens University of Economics and Business
Athens, 31st of May 2010
87
F-measure with P2P Kmeans (2/2)
Dataset: gamma
Athens University of Economics and Business
Athens, 31st of May 2010
Dataset: delta
88
D-Isomap Requirements Assumptions
We want to follow the Isomap paradigm but apply it in a
network context. The following requirements rise:
Approximate NN querying results in a network context
Calculate shortest paths in distributed environment
Consider distributed shortest path algorithms widely used
routing in the internet
Approximate the multidimensional scaling
Consider an LSH based DHT and therefore a structured P2P
network like Chord
Consider landmark based dimensionality reduction approaches
that operate on small fractions of the whole dataset
Assumptions: M peers organized in a Chord-ring topology.
Athens University of Economics and Business
Athens, 31st of May 2010
89
p-stable distributions and LSH
Definition:
A distribution D over R is called p-stable if there exists p≥0 such that for any
n real numbers r1,…,rn and i.i.d variables X1,…,Xn with distribution D, the
random variable ΣiriXi has the same distribution as ||r||pX where X is a
random variable with distribution D.
From p-stable distributions to locality sensitive hashing
Notice that rXT = ΣiriXi
Therefore given u1,u2
dp(u1,u2) = ||u1-u2||p
u1XT-u2XT = (u1-u2)XT which is distributed as dp(u1,u2)XT
Show if a = u1XT and b = u2XT a small value of |a-b| implies a small
dp(u1,u2)
“Small” compared to what?
Identify an interval w and map each value on this interval.
h(ui) = floor(uiaT+b/w)
Collision (i.e. same hash values) translates to small |a-b|
Athens University of Economics and Business
Athens, 31st of May 2010
90
Solving non connected NG problem
of Isomap
Instead of calculating the SPs calculate Minimum Spanning Trees:
k-connected sub graph
Minimal spanning tree k-edge connected
NP hard problems
Proposals:
Combination of k-edge connected MSTs [D.Zhao, L.Yang, TPAMI 2009]
Also proposes solution for updating the Shortest Path
Incremental Isomap [M.Law, K.Jain, TPAMI 2006]
Our “trick” for connected graphs
Simple and based on the intuition that if a sub-graph is separated from the
rest then probably its points belong to a different cluster and therefore should
be attributed a large value.
Inverse of the technique employed in [M.Vlachos et al. SIGKDD 2002]
Athens University of Economics and Business
Athens, 31st of May 2010
91
Swiss Roll – 30 peers – various kNN
Athens University of Economics and Business
Athens, 31st of May 2010
92
Helix – 30 peers – various kNN
Athens University of Economics and Business
Athens, 31st of May 2010
93
3D Clusters – 30 peers – various kNN
Athens University of Economics and Business
Athens, 31st of May 2010
94
Original Values of k-Means and kNN during D-Isomap experiments
k-NN classification results for NN=7 on Reuters
F-measure ~ 0.45 (micro F-measure)
k-means clustering results for Reuters (top 10 categories)
F-measure ~ 0.25
k-NN classification results for NN=7 on 20 Newsgroup
F-measure ~ 0.55 (micro F-measure)
k-means clustering results for 20 Newsgroup
F-measure ~ 0.22
Athens University of Economics and Business
Athens, 31st of May 2010
95
Future Work (ext.)
Extensions will concentrate on the following three axes
Minimize network requirements
Instead of requesting the actual document retrieve its projection using
Random Projection (fixed ε)
Definition of a formal method (specific for each dataset) for the
definition of Theorem 5 bound
Ameliorate the produced results
Apply edge-covering techniques from graph theory in order to select a
good set of landmarks for the shortest path process
Enhance D-Isomap’s viability for large scale retrieval
Create clusters of nodes, all holding the same information (i.e. Crespo &
Molina’s concept of SON)
Adapt techniques from routing (i.e. OSPF) so as to enable neighboring
clusters to exchange information
Adapt name resolution protocol (i.e. DNS) so as to enable quick and
reliable information retrieval from clusters.
Athens University of Economics and Business
Athens, 31st of May 2010
96
Source Code and Results
For FEDRA and the Framework for Distributed Dimensionality
Reduction:
For D-Isomap
www.db-net.aueb.gr/panagis/TKDD2009
www.db-net.aueb.gr/panagis/PAKDD2010/ (manifold unfolding capability)
www.db-net.aueb.gr/panagis/PKDD2010/ (extensions assessment and
application on text collections)
For x-SDR
www.db-net.aueb.gr/panagis/X-SDR (source code, analysis, deployment
instructions)
Athens University of Economics and Business
Athens, 31st of May 2010
97