Download Presentation

Document related concepts

Prognostics wikipedia , lookup

Intelligent maintenance system wikipedia , lookup

Transcript
1
 RNN(q) – returns a set of data points that
have the query point q as the nearest neighbor.
 Advanced database applications:
fixed wireless telephone access application – “load”
detection problem:
count how many users are currently using a specific base
station q  if q’s load is too heavy  activating an inactive
base station to lighten the load of that over loaded base station
 Asymetric Property
The Nearest Neighbor Relation is not symmetric, the set of
points that are closest to a query point (i.e., the Nearest
Neighbors) differs from the set of points that have the query
point as their Nearest Neighbor (called the Reverse Nearest
Neighbors)
2
•
NN(q) = p
•
If p is the nearest neighbor of q, then q need not be the nearest neighbor of p
(in this case the nearest neighbor of p is r).
•
those efficient NN algorithms cannot directly applied to solve the RNN
problems. Algorithms for RNN problems are needed.
NN(p) = q
p
r
q
A straight forward solution:
-- check for each point whether it has q as its nearest neighbor
-- not suitable for large data set!
3
 Bichromatic Version:
 the data points are of two categories, say red and blue. The RNN query
point q is in one of the categories, say blue. So RNN(q) must
determine the red points which have the query point q as the closest
blue point.
 e.g. fixed wireless telephone access application:
clients/red (e.g. call initiation or termination)
servers/blue (e.g. fixed wireless base stations)
 Monochromatic Version:
 all points are of the same color is the monochromatic version.
4
RNN queries have been studied for finite, stored data sets
RNN can identify "influence" of a data point on the
database
•[F. Korn and S. Muthukrishnan, Influence Sets Based on
Reverse Nearest Neighbor Queries]
•[I. Stanoi, M. Riedewald, D., Mirek Riedewald, D.
Agrawal, A.E. Abbadi, Discovery of influence sets in
frequently updated databases]
•[C. Yang, King-Ip Lin, An index structure for efficient
reverse nearest neighbor queries ]
5
 Finding the set of customers affected by the opening of a new
store outlet location
 Notifying the subset of subscribers to a digital library who will
find a newly added document most relevant
 Finding set of users whose profiles are more similar to the new
service offering than to any other service
6
 Fixed Physical Position
 Defined Coverage Area
 Calls Arrives in Streams
 Worst-Case “Signal Strength” – RNN MAXDIST
 “Load” on Base Station – RNN COUNT
 Optimization RNNA problems
7
 Fixed Physical Position
 Detect vehicles, estimate speed and length
 User Queries Arrives in Streams
 Periodic Updates of Closest Sensor
 “Load” on Sensor – RNN COUNT
 “Accuracy” of Information – RNN MAXDIST
 Optimization RNNA problems
8
 Max-RNNA – Given K servers, return the maximum
RNNA over all clients to any of the servers
 List-RNNA – Given K servers, return the RNNA over
all clients to each of the servers
 Opt-RNNA – Find a set of at most K servers for
which their RNNAs are below a given threshold
9
 Max-RNN-Count
Insertion and Deletion – 3-approximation
Insertion only – (1+) -approximation
 Max-RNN-MAXDIST
(1+) -approximation
 List-RNN-COUNT & List-RNN-MAXDIST
Lower- & Upper-bound as function of the true counts
 Opt-RNN-COUNT
8-approximation
 Opt-RNN-MAXDIST
(1+) –approximation
Space – near-linear in the number of available servers
10
 No previous works for RNNA over Data Streams
 Algorithms over Data Streams
 Algorithms for computing RNN over a
conventional DB
11
1. Space requirements of Selection and Sorting as a function of the
number of passes over the data
[J. I. Munro and M. S. Paterson. Selection and Sorting with
Limited Storage]
2. Formalization of the Data Stream Model
[A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, M.J. Strauss.
Surfing Wavelets on Streams: One-Pass Summaries for
Approximate Aggregate Queries] and [M. R. Henzinger, P.
Raghavan, S. Rajagopalan. Computing on data streams]
12
3. Computing the approximate median and other quantiles in a
single pass over data set
[R. Agrawal, A. Swami, A One-Pass Space-Efficient Algorithm
for Finding Quantiles]
[G.S. Manku, S. Rajagopalan, B.G. Lindsay. Approximate
Medians and other Quantiles in One Pass and with Limited
Memory]
[G.S. Manku, S. Rajagopalan, B.G. Lindsay. Random Sampling
Techniques for Space Efficient Online Computation of
Order Statistics of Large Datasets]
[M. Greenwald and S. Khanna. Space- Efficient Online
Computation of Quantile Summaries]
13
4. Computing the approximate online quantiles with probabilistic
guaranties over data stream
[A.C. Gilbert, Y.Kotidis, S. Muthukrishnan, M.J. Strauss. How
to Summarize the Universe: Dynamic Maintenance of
Quantiles]
5. Histogram construction over data stream
[A.C. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan,
M.J. Strauss. Fast, Small-Space Algorithms for Approximate
Histogram Maintenance ]
14
6. Maintaining summary structures for maintaining approximate
aggregates over data stream
[A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, M.J. Strauss.
Surfing Wavelets on Streams: One-Pass Summaries for
Approximate Aggregate Queries] and [M. R. Henzinger, P.
Raghavan, S. Rajagopalan. Computing on data streams]
[J. Gehrke, F. Korn, and D. Srivastava. On computing correlated
aggregates over continual data streams]
15
7. Construction of decision trees
[P. Domingos, G. Hulten. Mining High-Speed Data Streams]
[J. Gehrke, V. Ganti, R. Ramakrishnan, W.-Y. Loh. BOAT
Optimistic Decision Tree Construction]
8. Association rules
[C. Hidber. Online Association Rule Mining]
9. Similarity matching
[G. Cormode, M. Datar, P. Indyk, S. Muthukrishnan.
Comparing Data Streams Using Hamming Norms]
16
10. Clustering algorithms (k-median clustering problem)
[M. Charikar, C. Chekuri, T. Feder, R. Motwani. Incremental
Clustering and Dynamic Information Retrieval ]
[S. Guha, N. Mishra, R. Motwani, L. O'Callaghan.
Clustering Data Streams]
17
11. Lp norms
[P. Indyk. Stable Distributions, Pseudorandom Generators,
Embeddings and Data Stream Computation]
12. Hamming norms
[G. Cormode, M. Datar, P. Indyk, S. Muthukrishnan. Comparing
Data Streams Using Hamming Norms]
13. Quantiles
[A.C. Gilbert, Y.Kotidis, S. Muthukrishnan, M.J. Strauss. How to
Summarize the Universe: Dynamic Maintenance of
Quantiles]
14. Sliding window
[M. Datar. Maintaining Stream Statistics over Sliding Windows ]
18
15. Study of RNN in data bases
[F. Korn and S. Muthukrishnan, Influence Sets Based on
Reverse Nearest Neighbor Queries]
16. Efficient access methods for indexing RNN
[I. Stanoi, M. Riedewald, D., Mirek Riedewald, D. Agrawal,
A.E. Abbadi, Discovery of influence sets in frequently
updated databases]
[C. Yang, King-Ip Lin, An index structure for efficient
reverse nearest neighbor queries ]
19
Collection of n available servers (not necessary active)
li – location of server i
Clients arrive and depart
Lj – location of client j
RNN of server i is the set of all clients that have i as their NN server
20
RNN-COUNT(i) – number of clients currently in the
system for which i is the NN – “LOAD” for active
servers
RNN-MAXDIST(i ) – largest distance to a client that
has i as its NN – “QUALITY” for active servers
Streams of clients are large – can’t be stored in
memory – computing approximate RNNA values
21
 Max-RNNA – Given K active servers, return the
maximum RNNA over all clients to their closest active
server – “Worst-case Load” or “Quality”
 List-RNNA – Given K active servers, return a list of
the RNNA over all clients to each of the K active servers
- “Maximum Load” or “Worst-case Quality”
 Opt-RNNA – Find a set of at most K servers from the
available ones to be active, for which their RNNAs are
below a given threshold – “Optimization”
22
Assumption: Servers are on as straight line
Counters for servers i, j and client k:
CLij -> Lk[li, (li+lj)/2)
CRij -> Lk((li+lj)/2, lj]
23
The algorithm:
Let l be the closest active server from the left of i and r from the right.
RNN-COUNT(i) = CLil + CRir
Require
O(n2) space
O(n2) updates
We want – space near-linear and less updates  Approximation is
needed
24
Definitions: s1,.. sk are the K servers designated to be active
Assumption: Servers are sorted l1 … ln
Counter number of clients for server i:
C(i) -> Lk[li, li+1) – at the right side of server i
C(0) – at left side of server 1
Require:
O(n) space
O(log n) updates (look for wanted server)
25
Max-RNNA(s1,.. sk) = maxi RNN-COUNT(si)
26
RNN-COUNT(s0)
C(0)
1
J<
C(1)
RNN-COUNT(s1)
2
C(2)
3
C(3)
4
C(4)
J>+1
27
28
Mi for each si
The Proof is similar to previous theorem
29
Greedy Algorithm finds:
 Minimal Number of active servers – K
 maxi RNN-COUNT(si)C
30
31
32
Minimize maxi RNN-COUNT(si)
Given upper bound on number of servers K
Algorithm
1. Choose different values of C
2. Run Greedy Algorithm of Opt-RNNA
3. Repeat until solve with number of servers K*K
33
Assumption: Servers are sorted l1 … ln
Counter number of clients for server i:
C(i) -> Lk[li, li+1) – at the right side of server i
C(0) – at left side of server 1
Maintain l-quantiles (Greenwald & Khanna)
ci1…cil – number of clients lying in [li, Lcik]
Within (1)kC(i)/l, where 1k l
Require:
O(logC(i)/) space
34
Max-RNNA(s1,.. sk) = maxi RNN-COUNT(si)
35
36
Implementation in
the same way
Maintenance of data structure for deletion ?
37
The algorithm:
Histogram based on space partitioning
Assumption: Servers are sorted l1 … ln
Exponential sized buckets
Domain size U, such that U = [min(Lj,li), max(Lj,li)]
Dividers between servers i and (i+1) – gij at distance (1+ )j from li
Number of dividers is O(log1+  [li+1-li])
38
Counter number of clients between gik and gik+1 is #gik
For updates of client j:
 Find i such that Lj[li, li+1)
 Find k such that Lj[gik , gik+1)
 Update value #gik
Require
O(n log1+  U) space
O(log1+  U) updates
39
Max-RNNA(s1,.. sk) = maxi RNN-MAXDIST(si)
40
Details of the proof will be given in the future paper.
41
Di=max{RDi,LDi} for each si
The Proof is similar to previous theorem
42
Greedy Algorithm with limited backtracking finds:
 Minimal Number of active servers – K
 maxi RNN-MAXDIST(si)D
43
The proof will be given in the future paper.
44
Minimize maxi RNN-MAXDIST(si)
Given upper bound on number of servers K
Algorithm
1. Choose different values of D
2. Run Greedy Algorithm of Opt-RNNA
3. Repeat until solve with number of servers K*K
45
Nearest Neighbor and Reverse Nearest Neighbor Queries for
Let the space
aroundare
a query
point
q be
divided
into
six equal
Assumption:
the clients
on the
same
axis
as the
servers
Moving
Objects
regions
Si (1<=i<=6) by straight lines intersecting q. Si therefore is
the space between two space dividing lines.
R.Benetis,
C.S.Jensen,G.Karciauskas, S.Saltenis
L3
L2
s2 for Dynamic Databases
Reverse Nearest Neighbor Queries
s3
s1
SHOU Yu Tao
q
s4
L1
s6
s5
For a given 2-dimensional dataset, RNN(q) will return at most six
data points. And they are must be on the same circle centered at q.
46
47
The following aspects were tested:
Experimental data: CALIFORNIA – latitude of 63k buildings in
California, uniform and binomial distributions
48
49
50
51
52
 RNNA supports computations based on geographical distances or
vector-space similarity between servers and clients
 Applications of RNNA:
o Classical – facility location
o Emerging – fixed wireless telephony access and sensor-based
traffic monitoring
 Data of RNNA arrives in streams
 RNNA performs online computations
53
 We study three problems:
 Max-RNNA
 List-RNNA
 Opt-RNNA
 Two aggregates:
 COUNT
 MAXDIST
 Approximate algorithms with near-linear space usage
54
Any Questions?