Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
PIRS: Query Verification on Data Streams Ke Yi, Hong Kong University of Science and Technology Feifei Li, Florida State University Marios Hadjieleftheriou, AT&T Labs George Kollios, Boston University Divesh Srivastava, AT&T Labs work done while the 1st and 2nd authors were working at AT&T labs. Publishing Data and Outsourcing Query Service 011001…110… Network IP Traffic Stream coming from Results Gigascope: analysis tool by statistics 2 Revisiting the CISCO – AT&T Example IP Traffic Stream Network Gigascope 011001…110… statistics lawyers: Could wesign help? the(computer trust agreement scientists) 3 Concrete Example IP Stream: pm . . . p3 p2 p1 : srcIP, destIP, packet_size Continuous Query: SELECT SUM(packet_size) FROM IP_trace GROUP BY srcIP, destIP Answer: Time Groups 4 1 2 3 ... n 5 10KB 2KB 150KB . . . 5KB 10 11KB 130KB 1MB . . . 20KB 13 . . . Continuous Query Verification (CQV) on Data Streams Group 1 1. Client register query 2. Server reports answer upon request Source of streams Both client and server monitor the same stream Group 2 Group 3 Server maintains exact answer … … … Client maintains synopsis X SELECT SUM(packet_size) From IP_Trace GROUP BY src_ip, dest_ip 5 The Model for the Stream T=1 S VT T=2 T=3 9|1 7|i 1|1 … 10 0 0 0 9 0 7 V1 V2 V3 Vi n v 6 i 1 i m agg_attribute | group_id … 0 Vn Continuous Query Verification: CQV T=1 S T=2 T=3 9|1 7|i 1|1 … Update X Update V VT T V 10 0 0 0 9 0 7 V1 V2 V3 Vi 10 0 0 0 9 2 0 5 7 V1 V2 V3 Vi … 0 Vn … 0 XT Synopsis 1 1 Vn Alarm no alarm 7 PIRS: Polynomial Identity Random Synopsis choose prime p: max{ n, m } p 2 max{ n, m } chose a random number : a ZP X (V T ) (a 1)v1 (a 2)v2 (a n)vn mod p T T X (V ) X (V ) ? raise alarm if not equal o/w no alarm Decomposab ility : X (Va Vb ) X (Va ) X (Vb ) 8 Incremental Update to PIRS T=1 S T=2 9|1 7|i 1|1 … update to v1 X 1 (a 1)9 update to vi update to v1 X 2 X 1 (a i)7 X 3 X 2 (a 1)1 An update to group i with value u could be done in logu time (exponential by squaring): X X 1 (a i) 9 It Solves CQV problem! Theorem: Given any V T W T with probability at least 1-δ PIRS raises an alarm 1. ifif V W, obvously raises no alarm 2. V W v v w1 w2 vn w 1 2 f v ( x ) ( x 1) ( x 2) ( x n) , f w ( x ) ( x 1) ( x 2) ( x n) n f v ( x) f w ( x) iff V W a polynomial with 1 as the leading coefficient is completely determined by its zeroes if V W, f v ( x) f w ( x) happens at no more than m values of x Due to the fundamental theorem of algebra. Since we have p>m/ δ choices for a: the probability that X(V)=X(W) is at most δ 10 Optimality of PIRS Theorem: PIRS occupies O(log m/δ + log n) bits of space (3 words only at most, i.e., p, a, X(V)), spends O(1) time to process a tuple for count query, or O(log u) time to process a tuple for sum query. Theorem: Any synopsis for solving the CQV problem with error probability at most δ has to keep Ω(log min{n,m}/δ) bits. 11 Multiple Queries Q1 V1..n1 X1 S 9|1,8 update to v1 12 Q2 Q1 V1..n2 V1..(n1+n2) X2 … Q2 X Theorem: our synopses use constant space for multiple queries. update to v8 Handle the Load Shedding Semantic Load Shedding: drop tuples from certain groups Small number of groups having errors Random Load Shedding: 13 All groups have small amount of errors CQV with Semantic Load Shedding Randomly drop certain tuples according to groups 9|1 7|i 2|j 1|1 4|k 5|1 … Server claims at most γ number of groups have errors To detect if more than γ groups having errors! We have designed synopses using O(γ log 1/δ log n) bits of space and achieve the error probability at most δ 14 PIRSγ: An Exact Solution k c1 2 for c1 4.819 b , a pair - wise independen t hash fucntion maps 1,..., n uniformly to 1,..., k , e.g., xi y mod p mod k b(8)=2 v8 Alarm If at least one layer raises alarms PIRS PIRS … PIRS 1 k buckets … log 1/δ PIRS 15 Alarm If at least buckets raise alarms PIRS … PIRS 2 PIRSγ: An Exact Solution Theorem: PIRSγ requires O(γ2 log1/δ logn) bits, spends O( log1/δ ) time to process a tuple and solves CQV with semantic load shedding. 16 Intuition on Approximation the approximation probability to raise alarm the ideal synopsis γ- 17 γ γ+ number of errors PIRS±γ: An Approximate Solution Theorem: PIRS±γ requires O(γ log1/δ logn) bits, spends O(γ log1/δ ) time to process a tuple. 18 CQV with Random Load Shedding Randomly drop tuples All groups have small errors To detect if any group has error greater than a claimed threshold Theorem: Any synopsis solves this problem with error probability at most δ requires at least Ω(n) bits (reducing to the problem of estimating infinite frequency moment: the number of occurrence of the most frequent item). 19 Sliding Window and Other Queries It is easy to extend PIRS to work with sliding window model since it is decomposable, i.e., X(v1+v2)=X(v1)*X(v2). Other queries that can be transformed into Group By aggregation queries. Details in the paper. 20 Some Experiments We use real streams: We perform the following query: WC: Aggregate on response size and group by client id/object id (50M groups) IP: Aggregate on packet size and group by source IP/destination IP (7M groups) Hardware for the client: 21 World Cup Data (WC) IP traces from the AT&T network (IP) 2.8GHz Intel Pentium 4 CPU 512 MB memory Linux Machine Detection Accuracy n 264 2 1019 , m 1010 , hence, m / p 0.5 10 9 n is determined by the potential number of groups, not the actual number of groups Over 100,000 random attacks, PIRS identifies all of them. 22 Memory Usage of Exact Exact’s memory usage is linear and expensive. PIRS using only constant 3 words (27 bytes) at all time. 23 Update Time (per tuple) of Exact Cache misses and memory swap 1. Exact is fast when memory usage is small. 2. It becomes extremely slow due to cache misses and memory swap operations. 24 Running Time Analysis Average Update Time WC IPs Count 0.98 μs 0.98 μs Sum 8.01 μs 6.69 μs IPs exhibits smaller update cost for sum query as the average value of u is smaller than that of WC 25 Multiple Queries: Exact Memory Usage Exact’s memory usage is linear w.r.t number of queries and increasing over time. PIRS 26 always using only constant 3 words (27 bytes). Multiple Queries: Exact Update Time Per Tuple 27 Multiple Queries: PIRS Update Time Per Tuple 28 The Library Download PIRS and other synopses at: http://www.cs.fsu.edu/~lifeifei/pirs/ 29 Conclusion Space and Update efficient synopsis for verifying continuous group-by aggregation queries on streaming data; Could be generalized to handle selection query, and sliding-window semantics; How about more complicated queries? 30 Thanks! Questions 31 Problem and Goals Assumption: Problem: Client and DSMS observe the same stream Client needs to verify the results Goals: 32 Be memory, update efficient Tolerance for a limited number of errors Tolerance for small errors Support multiple queries Related Techniques to PIRS Incremental Cryptography Program Verification Block operation (insert, delete), cannot support arithmetic operation Server may pass the program execution but simply return random outputs Fingerprinting Technique 33 PIRS is a fingerprinting technique CQV with Semantic Load Shedding E (V ,V ) {i | vi vi } V V iff E (V ,V ) V V iff E (V ,V ) Design synopsis s.t. raises alarm at least 1 - if V V and raises no alarm if V V 34 PIRS±γ: An Approximate Solution Theorem: PIRS±γ: 1.raises no alarm with probability c ) at least 1- δ on any V V where (1 ln 2.raises an alarm with probability at least 1- δ on any c V V where (1 ) ln For any c>-lnln2=0.367 Using the intuition of coupon collector problem and the Chernoff bound. 35 PIRS±γ: An Approximate Solution choose k s.t., k ln k b1...bn , n - wise independne t random numbers uniformly distribute d in 1,..., k bi=2 Alarm If majority layers raise alarms vi PIRS PIRS … PIRS k buckets … log 1/δ PIRS 36 Alarm If all k buckets raise alarms PIRS … PIRS Information Disclosure on Multiple Attacks R : space of random seeds used by PIRS witness :W (V ,V ) {r R | PIRS raises an alarm on r} non - witness :W (V ,V ) R W | W (V ,V ) | R , if V V PIRS: X(V) on r | W (V ,V ) | R , if V V Re turns V R Insight: server could potentially gets rid of δ portion of seeds from if V V and received an alarm each notified failed W (V ,V ) Learns nothing about r Learns r W (V ,V ) attack! 37 Information Disclosure on Multiple Attacks Bob Theorem: For the total of k attacks made by Bob to PIRS, the probability that none of them succeeds is at least 1-kδ. 38 Proof of the Optimality X f :U M p ( fi ) f f i F { f1 , f 2 ...} assuming p( f1 ) p( f 2 ) X needs at least log M bits for output and log F bits to describe the function from F FV ,V { f F | f (V ) f (V )} for X : p( f ) f FV ,V 39 Proof of the Optimality k F 1, and consider f1... f k k i 1 p( f i ) k total of M possible combinatio ns for the outputs of these k functions by pigeon hole, U M k U 2n log U ( F 1)log M F log M (n ) log F (1 ) log( n ) log M ((n ) ) else : log F log( n ) 40