Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Ebrahim Ehsanfar Technical University of Dortmund A database in a bank records transactions Amount, Date, Accounts Holders… Query: find all the transactions (Pi) such that • The amount is between $1000 and $2000 • It happened between 10:40 AM and 11:20AM Range: R= [10:40,11:20]x[1000,2000] Goal: Finding P∩ R Geometric interpretation Reported Points={ 1 } 3 2 4 P n R Range: R= [10:40,11:20]x[1000,2000] Goal: Finding P∩ R Definitions: d(x, C) min cC d(x, c) d 2 (x, C) min cC d 2 (x, c) cost(P, C) xP d (x, C) 2 (the sum of the squared distances) Problem: Given an input set P R d with P n and k N, find a set C R d with C k that minimizes cost(P, C) Choose k initial centers C = c1,…,ck While it gets better results 1. 2. • • • For i = 1,...,N - find closest center ck of C to each point pi - assign instance pi to cluster Ck For k = 1,...,K set ck to be the center of mass of all points in Ci The algorithm terminates in a local minimum Lloyds algorithm is very sensitive to the initial values of the centers We can give better initial assignments to Lloyds algorithm (seeding algorithms) A well known seeding procedure, K-means++ [Arthur et al. 2007] In each iteration, the next center is chosen randomly from the input points The distribution over the points is not uniform Each point picked with probability proportional to the minimum squared distance from it to the picked centers Gives an O(log(k)) approximation guarantee Goal: Compute a set of k centers, such that the sum of the squared distances between each point and its nearest center is minimized A common approach is to represent points as a collection of canonical subsets Recall ,to the range reporting problem Any set can be formed as the disjoint union of canonical subsets Many ways to select canonical subsets Choice affects the space and time complexities Goal: Build a data structure to strike a balance between: # of total canonical subsets (Space) # of canonical subsets needed to answer a query (Time) Canonical subsets through the use of a partition tree: rooted (binary)tree Leaves: points of P Any node u: associated with a subset of P Canonical ▪ points stored in the leaves of the subtree rooted subtree ofat u {59,62,70,80} The search paths to a and b (e.g. [25,90] ) Query time: O((log n)+k) Space complexity: O(n) (binary tree with ‘n’ leaves) Assume d > 1 We want to perform range searching in Build T with respect to the x1-coordinate For each canonical set, build a (d-1) dimensional data structure using coordinates For answering d-dimensional query Find the canonical nodes of T associated with Make a d-1 dimensional Query on each canonical subtree recursively, using Total: O(n log n) Suppose (d-1)-dim. range tree has size of Then size of d-dim. range tree is d Time : O(log n + k) Space : O(n log (d -1) n) Key Idea: Replace many points by one weighted representative A small weighted set, such that for any set of k cluster centers the clustering cost of the coreset is an approximation for the clustering cost of the original set P with small relative error Notation [Har-Peled, Mazumdar 2004] : Let k N and 1. A weighted multiset S R d with positive weight function w : s R is called (k, ) - coreset of P iff for each C R d of size C k we have (1 - )cost(P, C) cost w (S, C) (1 )cost(P, C) [Agarwal et. al. 2001], introducing coreset [Har-Peled and Kushal 2005], size of the coreset is independent of n [Feldman, Monemizadeh, Sohler 2007]: weak coreset [Chen 2007]: Size of coresets is linear in d (High dim. Data) [Ackermann et al. 2010] Coreset by Adaptive sampling [Feldman, Schmidt, Sohler 2013] Coreset size is independent of n and d Observation: (Merge [Har-Peled, Mazumdar 2004] ) If Then S1 is a (k , ) coreset for P1 S2 is a (k , ) coreset for P2 S=S1∩S2 is a (k , ) coreset for P=P1∩P2 In any level i, you can get (almost) 2^i (n/(2^i)) Points in form of the canonical subsets Replace any canonical subsets with the associated coreset The final coreset range tree is ready for k- means clustering the points within any orthogonal range queries Dataset: Hip2 (Hipparcos 2-High precision parallax collecting satellite) Scientific mission of the European Space Agency (ESA) launched in 1989 and operated between 1989 and 1993 Get accurate measurements of the positions of objects on the sky The Hipparcos Catalogue (1997): A high-precision catalogue of 117,995 stars The lower precision data (more than a million) Dimension: More than 40 Our experimented dataset: 500,000 of the more accurate points of lower precision catalogue Only the coordination of the stars (in Radian from 0 2*pi) Linear size data structures Simplex range clustering Other machine learning methods (e.g. regression) Database index data structures SELECT * FROM Book WHERE 50$<price<100$ k-Cluster * FROM Book WHERE 50$<price<100$