Download whats_hot_and_whats_not

What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems 2003 ACM Transactions on Database Systems 2005 Introduction  Find “hot” items, but the set of hot items will change over time  Applications: caching, load balancing, sensor networks, data mining, etc.  Usually focus on “insert” only, this paper also take “delete” into account Prior works  Stream with sliding window (*) Arrival time 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 1 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 1 0 1 0 1 1 1 1 Elements  Flajolet-Martin approach (*)  Estimate number of distinct elements  Majority voting algorithm  Use only one counter to identify the majority item  Lossy counting * http://vc.cs.nthu.edu.tw/ezLMS/show.php?id=385 Contribution of the paper  Dynamically maintain the hot items  Both insert and delete transactions are supported  Randomized algorithm  Use hash table  Use “random” to confuse omniscient adversary  Small space required  Short processing time Finding the majority item  Keep log2m+1 counters  C0: keep how many items are “live”  Cj (j!=0): increase or decrease if bit(x,j)=1  Search: if there is a majority, it is given by  log2 m j 1 2 j gt (c j , c0 / 2)  No false negative, but false positive is possible Algorithms to find the majority element in a sequence of updates Example Find majority: x=0 +21 +0=2 Space of 8 items Counter 0 Counter 1 (20) #>(counter 0)/2 ? Counter 2 (21) Counter 3 (22) 1 2 2 2 7 2 4 6 False positive is possible! Finding hot items  Sequence with length n  Item identifiers: 1..m  nx(t): # of inserts - # of deletes before time t  fx(t): nx(t)/sigma(ny(t), y=1..m)  Hot item: given k, fx(t) > 1/(k+1) Process Item (insert or delete)  Classify sets by universal hash function  Initialize c[0..2Tk][0..logm]=0, c=0  T: # of groups  k: frequency threshold (fx(t)>1/(k+1))  for all (i, transType) do if (transType == insert)  c=c+1 else  c=c-1 for x=1 to T do index = hash(x) // uniformly distributed UpdateCounters(i,transType,c[index]) Find hot sets  for i=1 to T do //for each group if c[i][0] ≧n/(k+1) position=0; t=1; for j=1 to logm do if (c[i][j] ≧ n/(k+1)) position = position + t t = t*2 output(position) Similar to the algorithm to find the majority Error probability  Choosing |h|≧2k, T=log2(k/δ), the algorithm ensures that the probability of all hot items being output is at least 1-δ  Details of the proof (*,**) * Universal classes of hash functions, J. Comput. Syst. 1979 ** the two papers currently presented Experiments  Synthetic data:      Uniformly insert Zip-f insert Uniformly delete 1,000,000 items k=50 (hot items: f>1/(k+1))  Real data:  Telephone connections (from AT&T)  3.5 million transactions  Every 100,000 transactions, query (src, dest) pairs with frequency greater than 1% Results of synthetic data  Recall: proportion of the hot items that are found by the method  Precision: proportion of items identified by the algorithm are hot items Results of real data Conclusion  Propose a new method for identifying hot items  Cope with dynamic datasets

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download whats_hot_and_whats_not