Download Document

Data stream model 1/49 Data streams everywhere Telcos - phone calls Satellite, radar, sensor data Computer systems and network monitoring Search logs, accesslogs RSS feeds, social network activity Websites, clickstreams, query streams Ecommerce, credit card sales ... 2/49 Example 1: Online shop Thousands of visits / day Is this “customer” a robot? Does this customer want to buy? Is customer lost? Finding what s/he wants? What products should we recommend to this user? What ads should we show to this user? Should we get more machines from the cloud to handle incoming traffic? 3/49 Example 2: Web searchers Millions of queries / day • What are the top queries right now? • Which terms are gaining popularity now? • What ads should we show for this query and user? 4/49 Example 3: Phone company Hundreds of millions of calls/day Each call about 1000 bytes per switch I.e., about 1Tb/month; must keep for billing Is this call fraudulent? Why do we get so many call drops in area X? Should we reroute differently tomorrow? • Is this customer thinking of leaving us? How to cross-sell / up-sell this customer? • • • • 5/49 Data Streams: Modern times data 1. Data arrives as sequence of items 2. At high speed 3. 4. 5. 6. Infinite Can’t store them all Can’t go back; or too slow Evolving, non-stationary reality 10/49 In algorithmic words. . . The Data Stream axioms: 1. One pass 2. Low time per item - read, process, discard 3. Sublinear memory - only summaries or sketches 4. Anytime, real-time answers 5. The stream evolves over time 7/49 This two weeks The data stream model. Statistics on streams; frequent elements Sketches for linear algebra and graphs Dealing with change Predictive models Evaluation Clustering Frequent pattern mining The data stream model Computing in data streams Approximate answers are often OK Specifically, in learning and mining contexts Often computable with surprisingly low memory, one pass 10/49 Main Ingredients: Approximation and Randomization Algorithms use a source of independent random bits So different runs give different outputs But “most runs” are “approximately correct” 11/49 Randomized Algorithms (ε,δ)-approximation A randomized algorithm A (ε,δ )-approximates a function f : X →R iff for every x ∈ X, with probability ≥1−δ • (absolute approximation) |A(x) −f(x)|<ε • (relative approximation) |A(x) −f(x)|<ε f(x) Often ε , δ given as inputs toA ε = accuracy; δ = confidence 12/49 Three problems on Data Streams For examples: • Counting distinct elements • Finding heavy hitters • Counting in a sliding window 13/49 Counting distinct elements How many distinct IP addresses has the router seen? An IP may have passed once, or many many times Fact: Any algorithm must use Ω(n) memory to solve this problem exactly on a data stream, where n is number of different IPs seen Fact: O(log n) suffices to approximate within1% 14/49 Finding heavy hitters Which IP’s have used over ε fraction of bandwidth (each)? (Note: There can’t be more than 1/ε of these) Fact: Any algorithm must use Ω(n) memory to solve this problem exactly on a data stream, where n is number of distinct IPs seen Fact: O(1/ε ) memory suffices if we allow a constanterror factor 20/49 Counts in a sliding window Stream of bits; fixed n Question: “how many 1’s were there among the last n”? Fact: Any algorithm must use Ω(n) memory to solve this problem exactly on a data stream Fact: O(log n) suffices to approximate within1% 16/49 Argument for sketches If we keep one count, it’s ok to use a lot of memory If we have to keep many counts, they should use low memory When learning / mining, we need to keep many counts • Sketching is a good basis for data stream learning / mining 17/49 Sampling Problem: Given a data stream, choose k items with the same probability, storing only k elements in memory. Input: Stream of data that arrive online Sample size k Sample range: entire stream or most recent window (count-based or time-based) Output: k elements chosen uniformly at random within the sample range Reservoir Sampling Classical algorithm by Vitter (1985): Size of data stream is not known in advance Goal: maintains a fixed-size uniform random sample Put the first k elements from the stream into the repository (Reservoir S) When the i-th element arrives where i > k • Add it to reservoir S with probability p= k/i • If added, randomly remove an element from S. Duplicates in Stream Observations: Stream contains duplicate elements e.g. Zipf distribution (Power Law) Any value occurring frequently in the sample is a wasteful use of the available space Concise Sampling By Gibbons and Matias (1998) Represent an element in the sample by (value, count) τ= 1 Add new element with probability 1 / τ (increase count if element already in S) If S is full • increase τ to τ l • evict each element (or decrease count) from S with probability τ/τ l . Counting Most basic question? How many items have we read so far in the data stream? To count up to t elements exactly, log t bits are necessary Next is an approximate solution using log log t bits 22/49 Voting (Consider this!) First consider the following means of carrying out an election. We have m voters in a room, each voting for some candidate i ∈ [n]. We ask the voters to run around the room and find one other voter to pair up with who voted for a different candidate (note: some voters may not be able to find someone to pair with, for example if everyone voted for the same candidate). Then, we kick everyone out of the room who did manage to find a partner. A claim whose proof we leave to the reader as an exercise is that if there actually was a candidate with a strict majority, then some non-zero number of voters will be left in the room at the end, and furthermore all these voters will be supporters of the majority candidate. Majority algorithm Example Task: Given a list of numbers [representing votes]; is there an absolute majority (an element occurring > m/2 times? Correctness based on pairing argument: • Every non-majority element can be paired with a majority element • After the pairing, there will still be majority elements left FREQUENT algorithm (Misra-Gries) Generalization of MAJORITY: find all elements in a sequence whose frequency exceeds 1/k fraction of the total count (i.e. frequency >m/k) Stream with m=12 elements, all elements with more than m/k (i.e. 12/3=4) occurrences should be reported. The elements are reported correctly, the estimates are off by 2. Space Saving Performance Graham Cormode and Marios Hadjieleftheriou. "Finding the frequent items in streams of data." Communications of the ACM 52.10 (2009): 97-105. Sampling a Sliding Window • Timeliness: old data are not useful. • Restrict samples to a window of recent data • As new data arrives, old data “expires” • Reservoir sampling cannot handle data expiration • Replace an “expired” element in the reservoir with a random element in the current window A Naive Algorithm • Place a moving window of size N on the stream • an old element y expires when a new element x arrives • If y is in not in the reservoir, wedo nothing, otherwise we replace y with x • Problem: periodicity • If j-th element is in the sample, then any element with index j+cN is in the sample Chain Sampling (Babcock, Datar, Motwani, 2002) • Motivation: • When an element x is added to the sample, decide immediately which future element y will replace x when x expires • Store y when y arrives (x has not expired yet) • Of course, we must decide which future element will replace y, thus, we do not have to look back. • Create a chain (for sample of size 1) • Include each new element in the sample with probability 1/min(i,N) • When the i-th element is added to the sample • we randomly choose a future element whose index is in [i+1, i+N] to replace it when it expires • Once the element with that index arrives, store it and choose the index that will replace it in turn, building a “chain” of potential replacements Sliding Windows A useful model of stream processing is that queries are about a window of length N – the N most recent elements received Interesting case: N is so large that the data cannot be stored in memory, or even on disk Or, there are so many streams that windows for all cannot be stored Amazon example: For every product X we keep 0/1 stream of whether that product was sold in the n-th transaction We want answer queries, how many times have we sold X in the last k sales J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Sliding Window: 1 Stream Sliding window on a single stream: qwertyuiopasdfghjklzxcvbnm qwertyuiopasdfghjklzxcvbnm qwertyuiopasdfghjklzxcvbnm qwertyuiopasdfghjklzxcvbnm Past Future J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org N=6 Counting Bits (1) Problem: Given a stream of 0s and 1s Be prepared to answer queries of the form How many 1s are in the last k bits? where k ≤ N Obvious solution: Store the most recent N bits When new bit comes in, discard the N+1st bit 010011011101010110110110 Past Suppose N=6 Future J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Counting Bits (2) You can not get an exact answer without storing the entire window Real Problem: What if we cannot afford to store N bits? E.g., we’re processing 1 billion streams and N = 1 billion 010011011101010110110110 Past Future But we are happy with an approximate answer J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org An attempt: Simple solution Q: How many 1s are in the last N bits? A simple solution that does not really solve our problem: Uniformity assumption N 010011100010100100010110110111001010110011010 Maintain 2 counters: Past Future S: number of 1s from the beginning of the stream Z: number of 0s from the beginning of the stream 𝑺𝑺 How many 1s are in the last N bits? 𝑵𝑵 � 𝑺𝑺+𝒁𝒁 But, what if stream is non-uniform? What if distribution changes over time? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org DGIM Method [Datar, Gionis, Indyk, Motwani] DGIM solution that does not assume uniformity We store 𝑶𝑶(log𝟐𝟐𝑵𝑵) bits per stream Solution gives approximate answer, never off by more than 50% Error factor can be reduced to any fraction > 0, with more complicated algorithm and proportionally more stored bits J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Idea: Exponential Windows Solution that doesn’t (quite) work: Summarize exponentially increasing regions of the stream, looking backward Drop small regions if they begin at the same point as a larger region Window of width 16 has 6 1s 6 10 4 ? 3 2 2 1 1 0 010011100010100100010110110111001010110011010 N We can reconstruct the count of the last N bits, except we are not sure how many of the last 6 1s are included in the N J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org What’s Good? Stores only O(log2N ) bits 𝑶𝑶(log 𝑵𝑵) counts of log 𝟐𝟐 𝑵𝑵 bits each Easy update as more bits enter Error in count no greater than the number of 1s in the “unknown” area J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org What’s Not So Good? As long as the 1s are fairly evenly distributed, the error due to the unknown region is small – no more than 50% But it could be that all the 1s are in the unknown area at the end In that case, the error is unbounded! 6 ? 10 4 3 2 2 1 1 0 010011100010100100010110110111001010110011010 J. Leskovec, A. Rajaraman, J. Ullman: Mining of N Massive Datasets, http://www.mmds.org Fixup: DGIM method [Datar, Gionis, Indyk, Motwani] Idea: Instead of summarizing fixed-length blocks, summarize blocks with specific number of 1s: Let the block sizes (number of 1s) increase exponentially When there are few 1s in the window, block sizes stay small, so errors are small 1001010110001011010101010101011010101010101110101010111010100010110010 N J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org DGIM: Timestamps Each bit in the stream has a timestamp, starting 1, 2, … Record timestamps modulo N (the window size), so we can represent any relevant timestamp in 𝑶𝑶(𝒍𝒍𝒍𝒍𝒈𝒈𝟐𝟐 𝑵𝑵) bits J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org DGIM: Buckets A bucket in the DGIM method is a record consisting of: (A) The timestamp of its end [O(log N) bits] (B) The number of 1s between its beginning and end [O(log log N) bits] Constraint on buckets: Number of 1s must be a power of 2 That explains the O(log log N) in (B) above 1001010110001011010101010101011010101010101110101010111010100010110010 N J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Representing a Stream by Buckets Either one or two buckets with the same power-of-2 number of 1s Buckets do not overlap in timestamps Buckets are sorted by size Earlier buckets are not smaller than later buckets Buckets disappear when their end-time is > N time units in the past J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Example: Bucketized Stream At least 1 of size 16. Partially beyond window. 2 of size 8 2 of size 4 1 of size 2 2 of size 1 1001010110001011010101010101011010101010101110101010111010100010110010 N Three properties of buckets that are maintained: - Either one or two buckets with the same power-of-2 number of 1s - Buckets do not overlap in timestamps - Buckets are sorted by size J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Updating Buckets (1) When a new bit comes in, drop the last (oldest) bucket if its end-time is prior to N time units before the current time 2 cases: Current bit is 0 or 1 If the current bit is 0: no other changes are needed J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Updating Buckets (2) If the current bit is 1: (1) Create a new bucket of size 1, for just this bit • End timestamp = current time (2) If there are now three buckets of size 1, combine the oldest two into a bucket of size 2 (3) If there are now three buckets of size 2, combine the oldest two into a bucket of size 4 (4) And so on … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Example: Updating Buckets Current state of the stream: 1001010110001011010101010101011010101010101110101010111010100010110010 Bit of value 1 arrives 0010101100010110101010101010110101010101011101010101110101000101100101 Two orange buckets get merged into a yellow bucket 0010101100010110101010101010110101010101011101010101110101000101100101 Next bit 1 arrives, new orange bucket is created, then 0 comes, then 1: 0101100010110101010101010110101010101011101010101110101000101100101101 Buckets get merged… 0101100010110101010101010110101010101011101010101110101000101100101101 State of the buckets after merging 0101100010110101010101010110101010101011101010101110101000101100101101 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org How to Query? To estimate the number of 1s in the most recent N bits: 1. Sum the sizes of all buckets but the last (note “size” means the number of 1s in the bucket) 2. Add half the size of the last bucket Remember: We do not know how many 1s of the last bucket are still within the wanted window J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Example: Bucketized Stream At least 1 of size 16. Partially beyond window. 2 of size 8 2 of size 4 1 of size 2 2 of size 1 1001010110001011010101010101011010101010101110101010111010100010110010 N J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Part of this presentation is from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Document