Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems 2003 ACM Transactions on Database Systems 2005 Introduction  Find “hot” items, but the set of hot items will change over time  Applications: caching, load balancing, sensor networks, data mining, etc.  Usually focus on “insert” only, this paper also take “delete” into account Prior works  Stream with sliding window (*) Arrival time 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 1 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 1 0 1 0 1 1 1 1 Elements  Flajolet-Martin approach (*)  Estimate number of distinct elements  Majority voting algorithm  Use only one counter to identify the majority item  Lossy counting * http://vc.cs.nthu.edu.tw/ezLMS/show.php?id=385 Contribution of the paper  Dynamically maintain the hot items  Both insert and delete transactions are supported  Randomized algorithm  Use hash table  Use “random” to confuse omniscient adversary  Small space required  Short processing time Finding the majority item  Keep log2m+1 counters  C0: keep how many items are “live”  Cj (j!=0): increase or decrease if bit(x,j)=1  Search: if there is a majority, it is given by  log2 m j 1 2 j gt (c j , c0 / 2)  No false negative, but false positive is possible Algorithms to find the majority element in a sequence of updates Example Find majority: x=0 +21 +0=2 Space of 8 items Counter 0 Counter 1 (20) #>(counter 0)/2 ? Counter 2 (21) Counter 3 (22) 1 2 2 2 7 2 4 6 False positive is possible! Finding hot items  Sequence with length n  Item identifiers: 1..m  nx(t): # of inserts - # of deletes before time t  fx(t): nx(t)/sigma(ny(t), y=1..m)  Hot item: given k, fx(t) > 1/(k+1) Process Item (insert or delete)  Classify sets by universal hash function  Initialize c[0..2Tk][0..logm]=0, c=0  T: # of groups  k: frequency threshold (fx(t)>1/(k+1))  for all (i, transType) do if (transType == insert)  c=c+1 else  c=c-1 for x=1 to T do index = hash(x) // uniformly distributed UpdateCounters(i,transType,c[index]) Find hot sets  for i=1 to T do //for each group if c[i][0] ≧n/(k+1) position=0; t=1; for j=1 to logm do if (c[i][j] ≧ n/(k+1)) position = position + t t = t*2 output(position) Similar to the algorithm to find the majority Error probability  Choosing |h|≧2k, T=log2(k/δ), the algorithm ensures that the probability of all hot items being output is at least 1-δ  Details of the proof (*,**) * Universal classes of hash functions, J. Comput. Syst. 1979 ** the two papers currently presented Experiments  Synthetic data:      Uniformly insert Zip-f insert Uniformly delete 1,000,000 items k=50 (hot items: f>1/(k+1))  Real data:  Telephone connections (from AT&T)  3.5 million transactions  Every 100,000 transactions, query (src, dest) pairs with frequency greater than 1% Results of synthetic data  Recall: proportion of the hot items that are found by the method  Precision: proportion of items identified by the algorithm are hot items Results of real data Conclusion  Propose a new method for identifying hot items  Cope with dynamic datasets