Download Orthogonal Range Clustering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Ebrahim Ehsanfar
Technical University of Dortmund

A database in a bank records transactions
 Amount, Date, Accounts Holders…

Query: find all the transactions (Pi) such that
• The amount is between $1000 and $2000
• It happened between 10:40 AM and 11:20AM
Range: R= [10:40,11:20]x[1000,2000]
Goal: Finding P∩ R
Geometric interpretation
Reported Points={
1
}
3
2
4
P
n
R
Range: R= [10:40,11:20]x[1000,2000]
Goal: Finding P∩ R

Definitions: d(x, C)  min cC d(x, c)
d 2 (x, C)  min cC d 2 (x, c)
cost(P, C)  xP d (x, C)
2
(the sum of the squared distances)

Problem:
Given an input set P  R d with P  n
and k  N, find a set C  R d with C  k that
minimizes cost(P, C)
Choose k initial centers C = c1,…,ck
While it gets better results
1.
2.
•
•
•
For i = 1,...,N
- find closest center ck of C to each point pi
- assign instance pi to cluster Ck
For k = 1,...,K
set ck to be the center of mass of all points in Ci
 The algorithm terminates in a local minimum

Lloyds algorithm is very sensitive to the initial
values of the centers

We can give better initial assignments to
Lloyds algorithm (seeding algorithms)

A well known seeding procedure, K-means++
[Arthur et al. 2007]




In each iteration, the next center is chosen
randomly from the input points
The distribution over the points is not
uniform
Each point picked with probability
proportional to the minimum squared
distance from it to the picked centers
Gives an O(log(k)) approximation guarantee

Goal: Compute a set of k centers, such that
the sum of the squared distances between
each point and its nearest center is minimized

A common approach is to represent points as
a collection of canonical subsets
Recall ,to the range
reporting problem



Any set can be formed as the disjoint union of
canonical subsets
Many ways to select canonical subsets
Choice affects the space and time
complexities

Goal: Build a data structure to strike a
balance between:
 # of total canonical subsets (Space)
 # of canonical subsets needed to answer a query
(Time)

Canonical subsets through the use of a
partition tree:
 rooted (binary)tree
 Leaves: points of P
 Any node u: associated with a subset of P
Canonical
▪ points stored in the leaves of the subtree rooted
subtree ofat u
{59,62,70,80}

The search paths to a and b (e.g. [25,90] )
Query time: O((log n)+k)
Space complexity: O(n)
(binary tree with ‘n’ leaves)


Assume d > 1
We want to perform range searching in
 Build T with respect to the x1-coordinate
 For each canonical set, build a (d-1) dimensional
data structure using coordinates

For answering d-dimensional query
 Find the canonical nodes of T associated with
 Make a d-1 dimensional Query on each canonical
subtree recursively, using
Total: O(n log n)

Suppose (d-1)-dim. range tree has size of

Then size of d-dim. range tree is
d
Time : O(log n + k)
Space : O(n log
(d -1)
n)

Key Idea:
Replace many points by one weighted
representative

A small weighted set, such that for any set of k
cluster centers the clustering cost of the coreset is
an approximation for the clustering cost of the
original set P with small relative error

Notation [Har-Peled, Mazumdar 2004] :
Let k  N and   1. A weighted multiset S  R d
with positive weight function w : s  R is called (k,  ) - coreset
of P iff for each C  R d of size C  k we have
(1 -  )cost(P, C)  cost w (S, C)  (1   )cost(P, C)
[Agarwal et. al. 2001], introducing coreset
[Har-Peled and Kushal 2005], size of the coreset
is independent of n
 [Feldman, Monemizadeh, Sohler 2007]: weak
coreset
 [Chen 2007]: Size of coresets is linear in d (High
dim. Data)
 [Ackermann et al. 2010] Coreset by Adaptive
sampling
 [Feldman, Schmidt, Sohler 2013] Coreset size is
independent of n and d



Observation: (Merge [Har-Peled, Mazumdar 2004] )
 If
 Then
S1 is a (k ,  )  coreset for P1
S2 is a (k ,  )  coreset for P2
S=S1∩S2 is a
(k ,  )  coreset
for P=P1∩P2

In any level i, you can get (almost) 2^i (n/(2^i))
Points in form of the canonical subsets



Replace any canonical subsets with the associated coreset

The final coreset range tree is ready for k- means clustering
the points within any orthogonal range queries




Dataset: Hip2 (Hipparcos 2-High precision
parallax collecting satellite)
Scientific mission of the European Space
Agency (ESA)
launched in 1989 and operated between 1989
and 1993
Get accurate measurements of the positions
of objects on the sky

The Hipparcos Catalogue (1997):
 A high-precision catalogue of 117,995 stars
 The lower precision data (more than a million)
 Dimension: More than 40

Our experimented dataset:
 500,000 of the more accurate points of lower
precision catalogue
 Only the coordination of the stars (in Radian from 0 2*pi)




Linear size data structures
Simplex range clustering
Other machine learning methods (e.g.
regression)
Database index data structures
 SELECT * FROM Book WHERE 50$<price<100$
 k-Cluster * FROM Book WHERE 50$<price<100$