Download whats_hot_and_whats_not

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
What’s Hot and What’s Not:
Tracking Most Frequent Items
Dynamically
G. Cormode and S. Muthukrishman
Rutgers University
ACM Principles of Database Systems 2003
ACM Transactions on Database Systems 2005
Introduction
 Find “hot” items, but the set of hot
items will change over time
 Applications: caching, load balancing,
sensor networks, data mining, etc.
 Usually focus on “insert” only, this
paper also take “delete” into account
Prior works
 Stream with sliding window (*)
Arrival time
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
1 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 1 0 1 0 1 1 1 1
Elements
 Flajolet-Martin approach (*)
 Estimate number of distinct elements
 Majority voting algorithm
 Use only one counter to identify the
majority item
 Lossy counting
* http://vc.cs.nthu.edu.tw/ezLMS/show.php?id=385
Contribution of the paper
 Dynamically maintain the hot items
 Both insert and delete transactions are
supported
 Randomized algorithm
 Use hash table
 Use “random” to confuse omniscient
adversary
 Small space required
 Short processing time
Finding the majority item
 Keep log2m+1 counters
 C0: keep how many items are “live”
 Cj (j!=0): increase or decrease if bit(x,j)=1
 Search: if there is a majority, it is given by

log2 m
j 1
2 j gt (c j , c0 / 2)
 No false negative, but false positive is possible
Algorithms to find the majority
element in a sequence of updates
Example
Find majority:
x=0 +21 +0=2
Space of 8 items
Counter 0
Counter 1 (20)
#>(counter 0)/2 ?
Counter 2 (21)
Counter 3 (22)
1 2 2 2 7 2 4 6
False positive is possible!
Finding hot items
 Sequence with length n
 Item identifiers: 1..m
 nx(t): # of inserts - # of deletes
before time t
 fx(t): nx(t)/sigma(ny(t), y=1..m)
 Hot item: given k, fx(t) > 1/(k+1)
Process Item (insert or delete)
 Classify sets by universal hash function
 Initialize c[0..2Tk][0..logm]=0, c=0
 T: # of groups
 k: frequency threshold (fx(t)>1/(k+1))
 for all (i, transType) do
if (transType == insert)  c=c+1
else  c=c-1
for x=1 to T do
index = hash(x) // uniformly distributed
UpdateCounters(i,transType,c[index])
Find hot sets
 for i=1 to T do
//for each group
if c[i][0] ≧n/(k+1)
position=0; t=1;
for j=1 to logm do
if (c[i][j] ≧ n/(k+1))
position = position + t
t = t*2
output(position)
Similar to the algorithm to find the majority
Error probability
 Choosing |h|≧2k, T=log2(k/δ), the
algorithm ensures that the probability
of all hot items being output is at
least 1-δ
 Details of the proof (*,**)
* Universal classes of hash functions, J. Comput. Syst. 1979
** the two papers currently presented
Experiments
 Synthetic data:





Uniformly insert
Zip-f insert
Uniformly delete
1,000,000 items
k=50 (hot items: f>1/(k+1))
 Real data:
 Telephone connections (from AT&T)
 3.5 million transactions
 Every 100,000 transactions, query (src, dest)
pairs with frequency greater than 1%
Results of synthetic data
 Recall: proportion of the hot items that
are found by the method
 Precision: proportion of items identified
by the algorithm are hot items
Results of real data
Conclusion
 Propose a new method for identifying
hot items
 Cope with dynamic datasets