Download Talk

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
PIRS: Query Verification on Data Streams





Ke Yi, Hong Kong University of Science and Technology
Feifei Li, Florida State University
Marios Hadjieleftheriou, AT&T Labs
George Kollios, Boston University
Divesh Srivastava, AT&T Labs
work done while the 1st and 2nd authors were working at AT&T labs.
Publishing Data and Outsourcing Query Service
011001…110…
Network
IP Traffic Stream
coming from
Results
Gigascope:
analysis tool by
statistics
2
Revisiting the CISCO – AT&T Example
IP Traffic Stream
Network
Gigascope
011001…110…
statistics
lawyers:
Could
wesign
help?
the(computer
trust agreement
scientists)
3
Concrete Example
IP Stream:
pm . . . p3
p2
p1 : srcIP, destIP, packet_size
Continuous Query:
SELECT SUM(packet_size) FROM IP_trace
GROUP BY srcIP, destIP
Answer:
Time
Groups
4
1
2
3
...
n
5
10KB
2KB
150KB
. . .
5KB
10
11KB
130KB
1MB
. . .
20KB
13
. . .
Continuous Query Verification (CQV) on Data Streams
Group 1
1. Client register query
2. Server reports answer
upon request
Source of streams
Both client
and server monitor
the same stream
Group 2
Group 3
Server maintains
exact answer
…
…
…
Client maintains
synopsis X
SELECT SUM(packet_size) From IP_Trace
GROUP BY src_ip, dest_ip
5
The Model for the Stream
T=1
S
VT
T=2 T=3
9|1 7|i 1|1 …
10
0 0 0
9
0
7
V1 V2 V3
Vi
n
v
6
i 1
i
m
agg_attribute | group_id
…
0
Vn
Continuous Query Verification: CQV
T=1
S
T=2 T=3
9|1 7|i 1|1 …
Update X
Update V
VT
T
V
10
0 0 0
9
0
7
V1 V2 V3
Vi
10
0 0 0
9
2
0
5
7
V1 V2 V3
Vi
…
0
Vn
…
0
XT
Synopsis
1
1
Vn
Alarm
no alarm
7
PIRS: Polynomial Identity Random Synopsis
choose prime p: max{ n, m  }  p  2 max{ n, m  }
chose a random number :
a  ZP
X (V T )  (a  1)v1  (a  2)v2 (a  n)vn mod p
T
T
X (V )  X (V )
?
raise alarm if not equal
o/w no alarm
Decomposab ility : X (Va  Vb )  X (Va )  X (Vb )
8
Incremental Update to PIRS
T=1
S
T=2
9|1 7|i 1|1 …
update to v1
X 1  (a  1)9
update to vi
update to v1
X 2  X 1  (a  i)7 X 3  X 2  (a  1)1
An update to group i with value u could be done in
logu time (exponential by squaring): X   X  1  (a  i) 
9
It Solves CQV problem!
Theorem: Given any V T  W T
with probability at least 1-δ
PIRS raises an alarm
1. ifif V
W,
obvously raises no alarm
2.
V
W
v
v
w1
w2
vn
w
1
2
f v ( x )  ( x  1) ( x  2)  ( x  n) , f w ( x )  ( x  1) ( x  2)  ( x  n) n
f v ( x)  f w ( x) iff V  W
a polynomial with 1 as the leading coefficient is completely determined
by its zeroes
if V  W, f v ( x)  f w ( x) happens at no more than m values of x
Due to the fundamental theorem of algebra.
Since we have p>m/ δ choices for a:
the probability that X(V)=X(W) is at most δ
10
Optimality of PIRS
Theorem: PIRS occupies O(log m/δ + log n) bits of space
(3 words only at most, i.e., p, a, X(V)), spends O(1) time to
process a tuple for count query, or O(log u) time to process
a tuple for sum query.
Theorem: Any synopsis for solving the CQV problem with
error probability at most δ has to keep Ω(log min{n,m}/δ) bits.
11
Multiple Queries
Q1
V1..n1
X1
S 9|1,8
update to v1
12
Q2
Q1
V1..n2
V1..(n1+n2)
X2
…
Q2
X
Theorem: our synopses use
constant space for multiple
queries.
update to v8
Handle the Load Shedding

Semantic Load Shedding: drop tuples from
certain groups


Small number of groups having errors
Random Load Shedding:

13
All groups have small amount of errors
CQV with Semantic Load Shedding
Randomly drop certain tuples according to groups
9|1
7|i
2|j 1|1 4|k 5|1
…
Server claims at most γ number of groups have errors
To detect if more than γ groups having errors!
We have designed synopses using O(γ log 1/δ log n)
bits of space and achieve the error probability at most δ
14
PIRSγ: An Exact Solution
k  c1 2 for c1  4.819
b , a pair - wise independen t hash fucntion maps 1,..., n
uniformly to 1,..., k , e.g., xi  y mod p mod k
b(8)=2
v8
Alarm
If at least one layer raises alarms
PIRS
PIRS
…
PIRS
1
k buckets
…
log 1/δ
PIRS
15
Alarm
If at least  buckets raise alarms
PIRS
…
PIRS
2
PIRSγ: An Exact Solution
Theorem: PIRSγ requires O(γ2 log1/δ logn) bits, spends
O( log1/δ ) time to process a tuple and solves CQV
with semantic load shedding.
16
Intuition on Approximation
the approximation
probability to raise alarm
the ideal
synopsis
γ-
17
γ
γ+
number of errors
PIRS±γ: An Approximate Solution
Theorem: PIRS±γ requires O(γ log1/δ logn) bits, spends
O(γ log1/δ ) time to process a tuple.
18
CQV with Random Load Shedding
Randomly drop tuples
All groups have small errors
To detect if any group has error
greater than a claimed threshold
Theorem: Any synopsis solves this problem with error
probability at most δ requires at least Ω(n) bits (reducing
to the problem of estimating infinite frequency moment:
the number of occurrence of the most frequent item).
19
Sliding Window and Other Queries



It is easy to extend PIRS to work with sliding
window model since it is decomposable, i.e.,
X(v1+v2)=X(v1)*X(v2).
Other queries that can be transformed into
Group By aggregation queries.
Details in the paper.
20
Some Experiments

We use real streams:



We perform the following query:



WC: Aggregate on response size and group
by client id/object id (50M groups)
IP: Aggregate on packet size and group by
source IP/destination IP (7M groups)
Hardware for the client:



21
World Cup Data (WC)
IP traces from the AT&T network (IP)
2.8GHz Intel Pentium 4 CPU
512 MB memory
Linux Machine
Detection Accuracy
n  264  2 1019 , m  1010 , hence,   m / p  0.5 10 9
n is determined by the potential number of groups, not the actual
number of groups
Over 100,000 random attacks,
PIRS identifies all of them.
22
Memory Usage of Exact
Exact’s memory usage is linear and expensive.
PIRS
using only constant 3 words (27 bytes) at all time.
23
Update Time (per tuple) of Exact
Cache misses
and memory swap
1. Exact is fast when memory usage is small.
2. It becomes extremely slow due to cache misses and memory
swap operations.
24
Running Time Analysis
Average Update Time
WC
IPs
Count
0.98 μs
0.98 μs
Sum
8.01 μs
6.69 μs
IPs exhibits smaller update cost for sum
query as the average value of u is smaller
than that of WC
25
Multiple Queries: Exact Memory Usage
Exact’s memory usage is linear w.r.t number of queries and
increasing over time.
PIRS
26 always using only constant 3 words (27 bytes).
Multiple Queries: Exact Update Time Per
Tuple
27
Multiple Queries: PIRS Update Time Per
Tuple
28
The Library
Download PIRS and other synopses at:
http://www.cs.fsu.edu/~lifeifei/pirs/
29
Conclusion



Space and Update efficient synopsis for
verifying continuous group-by aggregation
queries on streaming data;
Could be generalized to handle selection
query, and sliding-window semantics;
How about more complicated queries?
30
Thanks!

Questions
31
Problem and Goals

Assumption:


Problem:


Client and DSMS observe the same stream
Client needs to verify the results
Goals:




32
Be memory, update efficient
Tolerance for a limited number of errors
Tolerance for small errors
Support multiple queries
Related Techniques to PIRS

Incremental Cryptography


Program Verification


Block operation (insert, delete), cannot support
arithmetic operation
Server may pass the program execution but
simply return random outputs
Fingerprinting Technique

33
PIRS is a fingerprinting technique
CQV with Semantic Load Shedding
E (V ,V )  {i | vi  vi }
V  V iff E (V ,V )  
V  V iff E (V ,V )  
Design synopsis s.t. raises alarm at least 1 -  if V   V
and raises no alarm if V  V
34
PIRS±γ: An Approximate Solution
Theorem: PIRS±γ: 1.raises no alarm with probability
c

)
at least 1- δ on any V   V where   (1 
ln 
2.raises an alarm with probability at least 1- δ on any
c
V   V where   (1 
)
ln 

For any c>-lnln2=0.367
Using the intuition of coupon collector problem
and the Chernoff bound.
35
PIRS±γ: An Approximate Solution
choose k s.t.,   k ln k 
b1...bn , n   - wise independne t random numbers
uniformly distribute d in 1,..., k
bi=2
Alarm
If majority layers raise alarms
vi
PIRS
PIRS
…
PIRS
k buckets
…
log 1/δ
PIRS
36
Alarm
If all k buckets raise alarms
PIRS
…
PIRS
Information Disclosure on Multiple Attacks
R : space of random seeds used by PIRS
witness :W (V ,V )  {r  R | PIRS raises an alarm on r}
non - witness :W (V ,V )  R  W
| W (V ,V ) |  R , if V  V
PIRS: X(V) on r
| W (V ,V ) | R , if V  V
Re turns V
R
Insight: server could
potentially gets rid of δ
portion of seeds from
if V 
 V and received an alarm
each notified failed
W (V ,V )
Learns nothing about r Learns r W (V ,V )
attack!
37
Information Disclosure on Multiple Attacks
Bob
Theorem: For the total of k attacks made by
Bob to PIRS, the probability that none of
them succeeds is at least 1-kδ.
38
Proof of the Optimality
X  f :U  M
p ( fi )
f  f i 
 F  { f1 , f 2 ...} assuming p( f1 )  p( f 2 )  
X needs at least log M bits for output and
log F bits to describe the function from F

FV ,V  { f  F | f (V )  f (V )}
for X :
 p( f )  
f FV ,V
39
Proof of the Optimality
k   F   1, and consider f1... f k

k
i 1
p( f i )  
k
total of M possible combinatio ns for the outputs of these k functions
by pigeon hole, U  M
k
U  2n
log U  ( F   1)log M
F log M  (n  )
log F  (1   ) log( n  )  log M  ((n  ) )
else : log F  log( n  )
40