Download Slides - VLDB 2009

Coordinated weighted sampling for estimating aggregates over multiple weight assignments Edith Cohen, AT&T Research Haim Kaplan, Tel Aviv University Shubho Sen, AT&T Research Data model • Universe U of items: i1 w1(i1) w2(i1) w3(i1) i2 w1(i2) w2(i2) w3(i2) i3 i4 w1(i3) w1(i4) w2(i3) w2(i4) w3(i3) w3(i4) ………… IP addresses traffic day 1 traffic day 2 traffic day 3 Multiple weight assignments defined over the items Data model (example) • Universe U of items: i1 w1(i1) w2(i1) w3(i1) i2 w1(i2) w2(i2) w3(i2) i3 i4 w1(i3) w1(i4) w2(i3) w2(i4) w3(i3) w3(i4) ………… facebook users male friends female friends online per day Data model (example) • Universe U of items: i1 w1(i1) w2(i1) w3(i1) i2 w1(i2) w2(i2) w3(i2) i3 i4 w1(i3) w1(i4) w2(i3) w2(i4) w3(i3) w3(i4) ………… customers wait time item 1 wait time item 2 wait time item 3 Data model (example) • Universe U of items: i1 w1(i1) w2(i1) w3(i1) i2 w1(i2) w2(i2) w3(i2) i3 i4 w1(i3) w1(i4) w2(i3) w2(i4) w3(i3) w3(i4) ………… license plates GW bridge day 1 GW bridge day 2 GW bridge day 3 Data model (example) • Universe U of items: i1 w1(i1) w2(i1) w3(i1) i2 w1(i2) w2(i2) w3(i2) i3 i4 w1(i3) w1(i4) w2(i3) w2(i4) w3(i3) w3(i4) ………… web pages bytes out links in links Aggregation queries • Selection predicate d(i) over items in U • compute the sum of the weights of all items for which the predicate holds:  w(i) i|d ( i ) true Examples: • Total number of links out of pages of some web site • Total number of flows into a particular network This work:Aggregates depending on more than one weight function • Selection predicate d(i) over items in U • Want to compute the sum of f(i) for all items for which the predicate holds:  f (i ) i|d ( i ) true f(i) depends on some subset of the weights given to i Simple examples: f(i) = maxbwb(i), f(i) = minbwb(i), f(i)=maxbwb(i)-minbwb(i) Data model (example) • Universe U of items: i1 w1(i1) w2(i1) w3(i1) i2 w1(i2) w2(i2) w3(i2) i3 i4 w1(i3) w1(i4) w2(i3) w2(i4) w3(i3) w3(i4) ………… customers wait time item 1 wait time item 2 wait time item 3 f(i) = maxbwb(i) aggregated over a subset gives the total waiting time of these customers Data model • Universe U of items: i1 w1(i1) w2(i1) w3(i1) i2 w1(i2) w2(i2) w3(i2) i3 i4 w1(i3) w1(i4) w2(i3) w2(i4) w3(i3) w3(i4) ………… IP addresses traffic day 1 traffic day 2 traffic day 3 f(i) = maxbwb(i)-minbwb(i) aggregated over a network gives sum of maximum changes in # of flows Challenges • Massive data volumes, too large to store or to transport elsewhere • w1 may not be available when we process w2, etc, collected at different place/time • Many different types of queries (d(i),f(i)), not known in advance  Exact aggregation can be very resource consuming or simply impossible Challenge: which summary to keep and how to estimate the aggregate ?? Solution: Use Sampling Keep a sample Desirable Properties of Sampling Scheme: • Scalability: efficient on data streams and distributed data, decouple the summarization of different sets • Applicable to a wide range of d(i), f(i) • Good estimators: unbiased, small variance Sampling a single weight function •Independent (Bernoulli) with probability kw(i)/W Poisson, PPS sampling •k times without replacement Order (bottom-k) sampling (includes PPSWOR, priority sampling (Rosen 72, Rosen 97, CK07, DLT 07) • Methods can be implemented efficiently in a stream/reservoir/distributed setting •Order (bottom-k) sampling dominates other methods (fixed sample size, with unbiased estimators [CK07, DLT07] ) Rank-based definition of Order/bottom-k sampling • For each item draw a random rank value from some distribution fw(i) • Order/bottom-k: Pick the k smallest-ranked items to your sample Rank distributions: u (i )  U[0,1] (The seed) • r(i)=u(i)/w(i) : priority order samples • r(i)=-ln(u(i))/w(i) : PPSWOR order samples Relations between samples by different weight function Independent sampling: Get a sample from each weight function independently Coordinated Sampling: If w2(i) ≥ w1(i) then r2(i) ≤ r1(i). For example by using the same seed u(i) for all wi: r1(i) = u(i)/w1(i) r2(i) = u(i)/w2(i) Can be achieved efficiently via hashing Coordination is critical • Coordination is critical for tight estimates • We develop estimators for coordinated samples and analyze them A lot of previous work on coordination in the context of survey sampling and when weights are uniform.. Horvitz Thompson Estimators [HT52] HT estimator: Give each item i an adjusted weight a(i) Let p(i) be the prob. that item i is sampled If i is in sample, a(i)=w(i)/p(i) (otherwise a(i) = 0 )  w(i)  E[a(i )]  p(i )    (1  p(i ))0  w(i)  p(i )  a(Q)   a(i )  iQ  a(i ) iQ  S  a(Q) is an unbiased estimator of w(Q) Using HT for order samples Problem • Cannot compute p(i) from an order sample • Therefore cannot compute HT adjusted weights a(i)=w(i)/p(i) p(i)? Solution: Apply HT on partitions • • p1(i) For item i, if we can find a partition of the sample space such that we can compute the p3(i) conditioned p(i) for item i for the cell that the 2 p (i) sampled set belongs to p4(i) Then apply HT using that conditioned p(i)  Obtain unbiased adjusted weights for each item p5(i) p6(i) Estimators for aggregates over multiple weights Suppose we want to estimate:  i|d ( i ) true b min b w (i) We have a sample for each weight function using independent or coordinated ranks Estimator for  i|d ( i ) true b min b w (i) Identify the items i for which d(i)=true and you know minbwb(i) (=Items contained in all samples) Compute p(i), the probability of such item to be sampled for all weight functions (conditioned in a subspace that depends on the other ranks) Set a(i) = minbwb(i)/p(i) Sum up these adjusted weights Independent vs Coordinated  1  p(i)  var[a(i)]   min b w (i)     p(i )  b 2 Variance is small when p(i) is large Coordinated sketches have larger p(i) Estimator for  i|d ( i ) true b max b w (i) Identify the items i for which d(i)=true and you know maxbwb(i) If you see all weights then you know maxbwb(i) But you never see them all if minbwb(i) = 0 Estimator for  i|d ( i ) true b max b w (i) Use the consistency of the ranks: If the largest weight you see has rank smaller than minb{kth smallest rank of an item  I\{i}} then you know it’s the maximum. Estimator for  i|d ( i ) true b max b w (i) If the largest weight you see has rank smaller than minb{kth smallest rank of an item  I\{i}} then you know it’s the maximum. Compute p(i), the probability of this to happen Set a(i) = minbwb(i)/p(i) Sum up these adjusted weights More estimators Now we can get an unbiased estimator for the L1 difference:  (min w (i)  min b w (i)) b b b Can prove that this estimator is nonnegative Empirical Evaluation Data sets: • IP packet traces: Items: (src,dest,src-port,dest-port). wb: total # of bytes in hour b • Netflix data sets: Items: movies. wb: total # of ratings in month b • Stock data: Items: ticker symbols wb: high value on day b Netflix: b var[ a (min w  ind b (i))] i  var[a coor i b (min b w (i ))] Netflix:  var[a( f (i))] i for various f Summary • Coordinated sketches improves accuracy and sometimes estimating is not feasible without it • It can be achieved with little overhead using state of the art sampling techniques Thank you! Application: Sensor nodes recording daily vehicle traffic in different locations in a city • Items: vehicles (license plate numbers) • Sets: All vehicles observed at a particular location/date Example queries: • Number of distinct vehicles in Manhattan on election day (size of the union of all Manhattan locations on election day) • Number of trucks with PA license plates that crossed both the Hudson and the East River on June 18, 2009 Application: Items are IP addresses. Sets h1,…,h24 correspond to destination IP’s observed in different 1hour time periods. Example Queries: number of distinct IP dests • In the first 12 hours (size of union of h1,…,h12) • In at most one/at least 4 out of h1,…,h24 • In the intersection of h9,…,h17 (all business hours) • Same queries restricted to: blacklisted IP’s, servers in a specific country etc. Uses: Traffic Analysis, Anomaly detection Application: Text/HyperText corpus: items are features/terms, sets are documents (or vice versa) Example Queries: •Number of distinct hyperlinks to financial news websites from websites in the .edu domain •Fraction of pages citing A that also cite B •Similarity (Jaccard coefficient) of two documents Uses: research, duplicate elimination, similarity search or clustering Application: Market basket data set. Items are consumers/baskets, subsets are goods (or vice versa) Example queries: • Likelihood of purchasing beer if diapers were purchased • Number of baskets from zip code 07932 containing shampoo • Number of consumers purchasing baby food and not diapers Uses: Marketing, placement, pricing, choosing products Rank-based definition of sampling • For each item draw a random rank value from some distribution fw(i) •Poisson: Pick all items with rank value below Tau(k) •Order/bottom-k: Pick the k smallest-ranked items to your sample Rank distributions: u  U[0,1] • r(i)=u/w(i) : Poisson PPS, priority order samples • r(i)=-ln(u)/w(i) : PPSWOR order samples  Horvitz Thompson Estimators [HT52] Give each item i an adjusted weight a(i): Let p(i) be the prob. that item i is sampled If i is in the sample a(i)=w(i)/p(i) (otherwise a(i) = 0 )  w(i)  E[a(i )]  p(i )    (1  p(i ))0  w(i)  p(i )   a(i) is an unbiased estimator of w(i) Rank Conditioning Estimator for single bottom-k sample set [CK07, DLT07] Use the principal of applying HT on a partition A 0.15 0.62 0.10 0.79 0.51 4-sample of A 0.04 0.42 0.38 0.87 0.94 < 0.42 For item i in sample, partition is based on the kth smallest rank value among all other items (excluding i) in the sample being some fixed value. In this specific case, k=4 and the fixed value = 0.42. Compute p(i) : the probability that the rank value of i is smaller than 0.42. For priority order samples, r(i)=u/w(i); p(i)=prob(r(i)<0.42)= min{1,0.42w(i)}; a(i)=w(i)/p(i)=max{w(i),100/42}

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Slides - VLDB 2009