Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Survey

Document related concepts

Transcript

1 RNN(q) – returns a set of data points that have the query point q as the nearest neighbor. Advanced database applications: fixed wireless telephone access application – “load” detection problem: count how many users are currently using a specific base station q if q’s load is too heavy activating an inactive base station to lighten the load of that over loaded base station Asymetric Property The Nearest Neighbor Relation is not symmetric, the set of points that are closest to a query point (i.e., the Nearest Neighbors) differs from the set of points that have the query point as their Nearest Neighbor (called the Reverse Nearest Neighbors) 2 • NN(q) = p • If p is the nearest neighbor of q, then q need not be the nearest neighbor of p (in this case the nearest neighbor of p is r). • those efficient NN algorithms cannot directly applied to solve the RNN problems. Algorithms for RNN problems are needed. NN(p) = q p r q A straight forward solution: -- check for each point whether it has q as its nearest neighbor -- not suitable for large data set! 3 Bichromatic Version: the data points are of two categories, say red and blue. The RNN query point q is in one of the categories, say blue. So RNN(q) must determine the red points which have the query point q as the closest blue point. e.g. fixed wireless telephone access application: clients/red (e.g. call initiation or termination) servers/blue (e.g. fixed wireless base stations) Monochromatic Version: all points are of the same color is the monochromatic version. 4 RNN queries have been studied for finite, stored data sets RNN can identify "influence" of a data point on the database •[F. Korn and S. Muthukrishnan, Influence Sets Based on Reverse Nearest Neighbor Queries] •[I. Stanoi, M. Riedewald, D., Mirek Riedewald, D. Agrawal, A.E. Abbadi, Discovery of influence sets in frequently updated databases] •[C. Yang, King-Ip Lin, An index structure for efficient reverse nearest neighbor queries ] 5 Finding the set of customers affected by the opening of a new store outlet location Notifying the subset of subscribers to a digital library who will find a newly added document most relevant Finding set of users whose profiles are more similar to the new service offering than to any other service 6 Fixed Physical Position Defined Coverage Area Calls Arrives in Streams Worst-Case “Signal Strength” – RNN MAXDIST “Load” on Base Station – RNN COUNT Optimization RNNA problems 7 Fixed Physical Position Detect vehicles, estimate speed and length User Queries Arrives in Streams Periodic Updates of Closest Sensor “Load” on Sensor – RNN COUNT “Accuracy” of Information – RNN MAXDIST Optimization RNNA problems 8 Max-RNNA – Given K servers, return the maximum RNNA over all clients to any of the servers List-RNNA – Given K servers, return the RNNA over all clients to each of the servers Opt-RNNA – Find a set of at most K servers for which their RNNAs are below a given threshold 9 Max-RNN-Count Insertion and Deletion – 3-approximation Insertion only – (1+) -approximation Max-RNN-MAXDIST (1+) -approximation List-RNN-COUNT & List-RNN-MAXDIST Lower- & Upper-bound as function of the true counts Opt-RNN-COUNT 8-approximation Opt-RNN-MAXDIST (1+) –approximation Space – near-linear in the number of available servers 10 No previous works for RNNA over Data Streams Algorithms over Data Streams Algorithms for computing RNN over a conventional DB 11 1. Space requirements of Selection and Sorting as a function of the number of passes over the data [J. I. Munro and M. S. Paterson. Selection and Sorting with Limited Storage] 2. Formalization of the Data Stream Model [A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, M.J. Strauss. Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries] and [M. R. Henzinger, P. Raghavan, S. Rajagopalan. Computing on data streams] 12 3. Computing the approximate median and other quantiles in a single pass over data set [R. Agrawal, A. Swami, A One-Pass Space-Efficient Algorithm for Finding Quantiles] [G.S. Manku, S. Rajagopalan, B.G. Lindsay. Approximate Medians and other Quantiles in One Pass and with Limited Memory] [G.S. Manku, S. Rajagopalan, B.G. Lindsay. Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets] [M. Greenwald and S. Khanna. Space- Efficient Online Computation of Quantile Summaries] 13 4. Computing the approximate online quantiles with probabilistic guaranties over data stream [A.C. Gilbert, Y.Kotidis, S. Muthukrishnan, M.J. Strauss. How to Summarize the Universe: Dynamic Maintenance of Quantiles] 5. Histogram construction over data stream [A.C. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, M.J. Strauss. Fast, Small-Space Algorithms for Approximate Histogram Maintenance ] 14 6. Maintaining summary structures for maintaining approximate aggregates over data stream [A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, M.J. Strauss. Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries] and [M. R. Henzinger, P. Raghavan, S. Rajagopalan. Computing on data streams] [J. Gehrke, F. Korn, and D. Srivastava. On computing correlated aggregates over continual data streams] 15 7. Construction of decision trees [P. Domingos, G. Hulten. Mining High-Speed Data Streams] [J. Gehrke, V. Ganti, R. Ramakrishnan, W.-Y. Loh. BOAT Optimistic Decision Tree Construction] 8. Association rules [C. Hidber. Online Association Rule Mining] 9. Similarity matching [G. Cormode, M. Datar, P. Indyk, S. Muthukrishnan. Comparing Data Streams Using Hamming Norms] 16 10. Clustering algorithms (k-median clustering problem) [M. Charikar, C. Chekuri, T. Feder, R. Motwani. Incremental Clustering and Dynamic Information Retrieval ] [S. Guha, N. Mishra, R. Motwani, L. O'Callaghan. Clustering Data Streams] 17 11. Lp norms [P. Indyk. Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation] 12. Hamming norms [G. Cormode, M. Datar, P. Indyk, S. Muthukrishnan. Comparing Data Streams Using Hamming Norms] 13. Quantiles [A.C. Gilbert, Y.Kotidis, S. Muthukrishnan, M.J. Strauss. How to Summarize the Universe: Dynamic Maintenance of Quantiles] 14. Sliding window [M. Datar. Maintaining Stream Statistics over Sliding Windows ] 18 15. Study of RNN in data bases [F. Korn and S. Muthukrishnan, Influence Sets Based on Reverse Nearest Neighbor Queries] 16. Efficient access methods for indexing RNN [I. Stanoi, M. Riedewald, D., Mirek Riedewald, D. Agrawal, A.E. Abbadi, Discovery of influence sets in frequently updated databases] [C. Yang, King-Ip Lin, An index structure for efficient reverse nearest neighbor queries ] 19 Collection of n available servers (not necessary active) li – location of server i Clients arrive and depart Lj – location of client j RNN of server i is the set of all clients that have i as their NN server 20 RNN-COUNT(i) – number of clients currently in the system for which i is the NN – “LOAD” for active servers RNN-MAXDIST(i ) – largest distance to a client that has i as its NN – “QUALITY” for active servers Streams of clients are large – can’t be stored in memory – computing approximate RNNA values 21 Max-RNNA – Given K active servers, return the maximum RNNA over all clients to their closest active server – “Worst-case Load” or “Quality” List-RNNA – Given K active servers, return a list of the RNNA over all clients to each of the K active servers - “Maximum Load” or “Worst-case Quality” Opt-RNNA – Find a set of at most K servers from the available ones to be active, for which their RNNAs are below a given threshold – “Optimization” 22 Assumption: Servers are on as straight line Counters for servers i, j and client k: CLij -> Lk[li, (li+lj)/2) CRij -> Lk((li+lj)/2, lj] 23 The algorithm: Let l be the closest active server from the left of i and r from the right. RNN-COUNT(i) = CLil + CRir Require O(n2) space O(n2) updates We want – space near-linear and less updates Approximation is needed 24 Definitions: s1,.. sk are the K servers designated to be active Assumption: Servers are sorted l1 … ln Counter number of clients for server i: C(i) -> Lk[li, li+1) – at the right side of server i C(0) – at left side of server 1 Require: O(n) space O(log n) updates (look for wanted server) 25 Max-RNNA(s1,.. sk) = maxi RNN-COUNT(si) 26 RNN-COUNT(s0) C(0) 1 J< C(1) RNN-COUNT(s1) 2 C(2) 3 C(3) 4 C(4) J>+1 27 28 Mi for each si The Proof is similar to previous theorem 29 Greedy Algorithm finds: Minimal Number of active servers – K maxi RNN-COUNT(si)C 30 31 32 Minimize maxi RNN-COUNT(si) Given upper bound on number of servers K Algorithm 1. Choose different values of C 2. Run Greedy Algorithm of Opt-RNNA 3. Repeat until solve with number of servers K*K 33 Assumption: Servers are sorted l1 … ln Counter number of clients for server i: C(i) -> Lk[li, li+1) – at the right side of server i C(0) – at left side of server 1 Maintain l-quantiles (Greenwald & Khanna) ci1…cil – number of clients lying in [li, Lcik] Within (1)kC(i)/l, where 1k l Require: O(logC(i)/) space 34 Max-RNNA(s1,.. sk) = maxi RNN-COUNT(si) 35 36 Implementation in the same way Maintenance of data structure for deletion ? 37 The algorithm: Histogram based on space partitioning Assumption: Servers are sorted l1 … ln Exponential sized buckets Domain size U, such that U = [min(Lj,li), max(Lj,li)] Dividers between servers i and (i+1) – gij at distance (1+ )j from li Number of dividers is O(log1+ [li+1-li]) 38 Counter number of clients between gik and gik+1 is #gik For updates of client j: Find i such that Lj[li, li+1) Find k such that Lj[gik , gik+1) Update value #gik Require O(n log1+ U) space O(log1+ U) updates 39 Max-RNNA(s1,.. sk) = maxi RNN-MAXDIST(si) 40 Details of the proof will be given in the future paper. 41 Di=max{RDi,LDi} for each si The Proof is similar to previous theorem 42 Greedy Algorithm with limited backtracking finds: Minimal Number of active servers – K maxi RNN-MAXDIST(si)D 43 The proof will be given in the future paper. 44 Minimize maxi RNN-MAXDIST(si) Given upper bound on number of servers K Algorithm 1. Choose different values of D 2. Run Greedy Algorithm of Opt-RNNA 3. Repeat until solve with number of servers K*K 45 Nearest Neighbor and Reverse Nearest Neighbor Queries for Let the space aroundare a query point q be divided into six equal Assumption: the clients on the same axis as the servers Moving Objects regions Si (1<=i<=6) by straight lines intersecting q. Si therefore is the space between two space dividing lines. R.Benetis, C.S.Jensen,G.Karciauskas, S.Saltenis L3 L2 s2 for Dynamic Databases Reverse Nearest Neighbor Queries s3 s1 SHOU Yu Tao q s4 L1 s6 s5 For a given 2-dimensional dataset, RNN(q) will return at most six data points. And they are must be on the same circle centered at q. 46 47 The following aspects were tested: Experimental data: CALIFORNIA – latitude of 63k buildings in California, uniform and binomial distributions 48 49 50 51 52 RNNA supports computations based on geographical distances or vector-space similarity between servers and clients Applications of RNNA: o Classical – facility location o Emerging – fixed wireless telephony access and sensor-based traffic monitoring Data of RNNA arrives in streams RNNA performs online computations 53 We study three problems: Max-RNNA List-RNNA Opt-RNNA Two aggregates: COUNT MAXDIST Approximate algorithms with near-linear space usage 54 Any Questions?