Download Document

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data stream model
1/49
Data streams everywhere
Telcos - phone calls Satellite,
radar, sensor data
Computer systems and network
monitoring
Search logs, accesslogs
RSS feeds, social network activity
Websites, clickstreams, query streams Ecommerce, credit card sales
...
2/49
Example 1: Online shop
Thousands of visits / day
Is this “customer” a robot?
Does this customer want to buy?
Is customer lost? Finding what s/he wants?
What products should we recommend to this
user? What ads should we show to this user?
Should we get more machines from the cloud to
handle incoming traffic?
3/49
Example 2: Web searchers
Millions of queries / day
• What are the top queries right now?
• Which terms are gaining popularity now?
• What ads should we show for this query and user?
4/49
Example 3: Phone company
Hundreds of millions of calls/day
Each call about 1000 bytes per switch
I.e., about 1Tb/month; must keep for billing
Is this call fraudulent?
Why do we get so many call drops in area X?
Should we reroute differently tomorrow?
• Is this customer thinking of leaving us? How
to cross-sell / up-sell this customer?
•
•
•
•
5/49
Data Streams: Modern times data
1. Data arrives as sequence of items
2. At high speed
3.
4.
5.
6.
Infinite
Can’t store them all
Can’t go back; or too slow
Evolving, non-stationary reality
10/49
In algorithmic words. . .
The Data Stream axioms:
1. One pass
2. Low time per item - read, process, discard
3. Sublinear memory - only summaries or sketches
4. Anytime, real-time answers
5. The stream evolves over time
7/49
This two weeks
The data stream model.
Statistics on streams; frequent elements
Sketches for linear algebra and graphs
Dealing with change
Predictive models
Evaluation
Clustering
Frequent pattern mining
The data stream model
Computing in data streams
Approximate answers are often OK
Specifically, in learning and mining contexts
Often computable with surprisingly low memory, one pass
10/49
Main Ingredients: Approximation and Randomization
Algorithms use a source of independent random bits
So different runs give different outputs
But “most runs” are “approximately correct”
11/49
Randomized Algorithms
(ε,δ)-approximation
A randomized algorithm A (ε,δ )-approximates a function
f : X →R iff for every x ∈ X, with probability ≥1−δ
• (absolute approximation) |A(x) −f(x)|<ε
• (relative approximation) |A(x) −f(x)|<ε f(x)
Often ε , δ given as inputs toA
ε = accuracy; δ = confidence
12/49
Three problems on Data Streams
For examples:
• Counting distinct elements
• Finding heavy hitters
• Counting in a sliding window
13/49
Counting distinct elements
How many distinct IP addresses has the router seen? An IP
may have passed once, or many many times
Fact: Any algorithm must use Ω(n) memory to solve this
problem exactly on a data stream, where n is number of
different IPs seen
Fact: O(log n) suffices to approximate within1%
14/49
Finding heavy hitters
Which IP’s have used over ε fraction of bandwidth (each)?
(Note: There can’t be more than 1/ε of these)
Fact: Any algorithm must use Ω(n) memory to solve this
problem exactly on a data stream, where n is number of
distinct IPs seen
Fact: O(1/ε ) memory suffices if we allow a constanterror
factor
20/49
Counts in a sliding window
Stream of bits; fixed n
Question: “how many 1’s were there among the last n”?
Fact: Any algorithm must use Ω(n) memory to solve this
problem exactly on a data stream
Fact: O(log n) suffices to approximate within1%
16/49
Argument for sketches
If we keep one count, it’s ok to use a lot of memory
If we have to keep many counts, they should use low memory
When learning / mining, we need to keep many counts
• Sketching is a good basis for data stream learning / mining
17/49
Sampling
Problem:
Given a data stream, choose k items with the same probability, storing only k
elements in memory.
Input:
Stream of data that arrive online
Sample size k
Sample range: entire stream or most recent window (count-based or
time-based)
Output:
k elements chosen uniformly at random within the sample range
Reservoir Sampling
Classical algorithm by Vitter (1985):
Size of data stream is not known in advance
Goal: maintains a fixed-size uniform random sample
Put the first k elements from the stream into the repository
(Reservoir S)
When the i-th element arrives where i > k
• Add it to reservoir S with probability p= k/i
• If added, randomly remove an element from S.
Duplicates in Stream
Observations:
Stream contains duplicate elements
e.g. Zipf distribution (Power Law)
Any value occurring frequently in the sample is a wasteful use of the
available space
Concise Sampling
By Gibbons and Matias (1998)
Represent an element in the sample by (value, count)
τ= 1
Add new element with probability 1 / τ (increase count if element
already in S)
If S is full
• increase τ to τ l
• evict each element (or decrease count) from S with
probability τ/τ l .
Counting
Most basic question?
How many items have we read so far in the data stream?
To count up to t elements exactly, log t bits are necessary
Next is an approximate solution using log log t bits
22/49
Voting (Consider this!)
First consider the following means of carrying out an election.
We have m voters in a room, each voting for some candidate i
∈ [n]. We ask the voters to run around the room and find one
other voter to pair up with who voted for a different candidate
(note: some voters may not be able to find someone to pair
with, for example if everyone voted for the same candidate).
Then, we kick everyone out of the room who did manage to
find a partner. A claim whose proof we leave to the reader as
an exercise is that if there actually was a candidate with a
strict majority, then some non-zero number of voters will be
left in the room at the end, and furthermore all these voters
will be supporters of the majority candidate.
Majority algorithm
Example
Task: Given a list of numbers [representing votes]; is
there an absolute majority (an element occurring >
m/2 times?
Correctness based on pairing argument:
• Every non-majority element can be paired with a
majority element
• After the pairing, there will still be majority elements
left
FREQUENT algorithm (Misra-Gries)
Generalization of MAJORITY: find all elements in a
sequence whose frequency exceeds 1/k fraction of the
total count (i.e. frequency >m/k)
Stream with m=12 elements, all elements with more than
m/k (i.e. 12/3=4) occurrences should be reported.
The elements are reported correctly, the estimates are off
by 2.
Space Saving
Performance
Graham Cormode and Marios Hadjieleftheriou. "Finding the frequent items in streams of data." Communications of
the ACM 52.10 (2009): 97-105.
Sampling a Sliding Window
• Timeliness: old data are not useful.
• Restrict samples to a window of recent data
• As new data arrives, old data “expires”
• Reservoir sampling cannot handle data expiration
• Replace an “expired” element in the reservoir with a random element in the
current window
A Naive Algorithm
• Place a moving window of size N on the stream
• an old element y expires when a new element x arrives
• If y is in not in the reservoir, wedo nothing, otherwise we replace y with
x
• Problem: periodicity
• If j-th element is in the sample, then any element with index j+cN is in the
sample
Chain Sampling (Babcock, Datar, Motwani, 2002)
• Motivation:
• When an element x is added to the sample, decide immediately
which future element y will replace x when x expires
• Store y when y arrives (x has not expired yet)
• Of course, we must decide which future element will replace y, thus,
we do not have to look back.
• Create a chain (for sample of size 1)
• Include each new element in the sample with probability
1/min(i,N)
• When the i-th element is added to the sample
• we randomly choose a future element whose index is in [i+1, i+N]
to replace it when it expires
• Once the element with that index arrives, store it and choose the
index that will replace it in turn, building a “chain” of potential
replacements
Sliding Windows
A useful model of stream processing is that
queries are about a window of length N –
the N most recent elements received
Interesting case: N is so large that the data
cannot be stored in memory, or even on disk
Or, there are so many streams that windows
for all cannot be stored
Amazon example:
For every product X we keep 0/1 stream of whether
that product was sold in the n-th transaction
We want answer queries, how many times have we
sold X in the last k sales
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org
Sliding Window: 1 Stream
Sliding window on a single stream:
qwertyuiopasdfghjklzxcvbnm
qwertyuiopasdfghjklzxcvbnm
qwertyuiopasdfghjklzxcvbnm
qwertyuiopasdfghjklzxcvbnm
Past
Future
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org
N=6
Counting Bits (1)
Problem:
Given a stream of 0s and 1s
Be prepared to answer queries of the form
How many 1s are in the last k bits? where k ≤ N
Obvious solution:
Store the most recent N bits
When new bit comes in, discard the N+1st bit
010011011101010110110110
Past
Suppose N=6
Future
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org
Counting Bits (2)
You can not get an exact answer without
storing the entire window
Real Problem:
What if we cannot afford to store N bits?
E.g., we’re processing 1 billion streams and
N = 1 billion
010011011101010110110110
Past
Future
But we are happy with an approximate
answer
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org
An attempt: Simple solution
Q: How many 1s are in the last N bits?
A simple solution that does not really solve our
problem: Uniformity assumption
N
010011100010100100010110110111001010110011010
Maintain 2 counters:
Past
Future
S: number of 1s from the beginning of the stream
Z: number of 0s from the beginning of the stream
𝑺𝑺
How many 1s are in the last N bits? 𝑵𝑵 �
𝑺𝑺+𝒁𝒁
But, what if stream is non-uniform?
What if distribution changes over time?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org
DGIM Method
[Datar, Gionis, Indyk, Motwani]
DGIM solution that does not assume
uniformity
We store 𝑶𝑶(log𝟐𝟐𝑵𝑵) bits per stream
Solution gives approximate answer,
never off by more than 50%
Error factor can be reduced to any fraction > 0,
with more complicated algorithm and
proportionally more stored bits
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org
Idea: Exponential Windows
Solution that doesn’t (quite) work:
Summarize exponentially increasing regions
of the stream, looking backward
Drop small regions if they begin at the same point
as a larger region
Window
of width
16 has 6
1s
6
10
4
?
3
2
2
1
1 0
010011100010100100010110110111001010110011010
N
We can reconstruct the count of the last N bits, except we
are not sure how many of the last 6 1s are included in the N
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org
What’s Good?
Stores only O(log2N ) bits
𝑶𝑶(log 𝑵𝑵) counts of log 𝟐𝟐 𝑵𝑵 bits each
Easy update as more bits enter
Error in count no greater than the number
of 1s in the “unknown” area
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org
What’s Not So Good?
As long as the 1s are fairly evenly distributed,
the error due to the unknown region is small –
no more than 50%
But it could be that all the 1s are in the
unknown area at the end
In that case, the error is unbounded!
6
?
10
4
3
2
2
1
1 0
010011100010100100010110110111001010110011010
J. Leskovec, A. Rajaraman,
J. Ullman: Mining of
N
Massive Datasets, http://www.mmds.org
Fixup: DGIM method
[Datar, Gionis, Indyk, Motwani]
Idea: Instead of summarizing fixed-length
blocks, summarize blocks with specific
number of 1s:
Let the block sizes (number of 1s) increase
exponentially
When there are few 1s in the window, block
sizes stay small, so errors are small
1001010110001011010101010101011010101010101110101010111010100010110010
N
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org
DGIM: Timestamps
Each bit in the stream has a timestamp,
starting 1, 2, …
Record timestamps modulo N (the window
size), so we can represent any relevant
timestamp in 𝑶𝑶(𝒍𝒍𝒍𝒍𝒈𝒈𝟐𝟐 𝑵𝑵) bits
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org
DGIM: Buckets
A bucket in the DGIM method is a record
consisting of:
(A) The timestamp of its end [O(log N) bits]
(B) The number of 1s between its beginning and
end [O(log log N) bits]
Constraint on buckets:
Number of 1s must be a power of 2
That explains the O(log log N) in (B) above
1001010110001011010101010101011010101010101110101010111010100010110010
N
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org
Representing a Stream by Buckets
Either one or two buckets with the same
power-of-2 number of 1s
Buckets do not overlap in timestamps
Buckets are sorted by size
Earlier buckets are not smaller than later buckets
Buckets disappear when their
end-time is > N time units in the past
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org
Example: Bucketized Stream
At least 1 of
size 16. Partially
beyond window.
2 of
size 8
2 of
size 4
1 of
size 2
2 of
size 1
1001010110001011010101010101011010101010101110101010111010100010110010
N
Three properties of buckets that are maintained:
- Either one or two buckets with the same power-of-2 number of 1s
- Buckets do not overlap in timestamps
- Buckets are sorted by size
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org
Updating Buckets (1)
When a new bit comes in, drop the last
(oldest) bucket if its end-time is prior to N
time units before the current time
2 cases: Current bit is 0 or 1
If the current bit is 0:
no other changes are needed
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org
Updating Buckets (2)
If the current bit is 1:
(1) Create a new bucket of size 1, for just this bit
•
End timestamp = current time
(2) If there are now three buckets of size 1,
combine the oldest two into a bucket of size 2
(3) If there are now three buckets of size 2,
combine the oldest two into a bucket of size 4
(4) And so on …
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org
Example: Updating Buckets
Current state of the stream:
1001010110001011010101010101011010101010101110101010111010100010110010
Bit of value 1 arrives
0010101100010110101010101010110101010101011101010101110101000101100101
Two orange buckets get merged into a yellow bucket
0010101100010110101010101010110101010101011101010101110101000101100101
Next bit 1 arrives, new orange bucket is created, then 0 comes, then 1:
0101100010110101010101010110101010101011101010101110101000101100101101
Buckets get merged…
0101100010110101010101010110101010101011101010101110101000101100101101
State of the buckets after merging
0101100010110101010101010110101010101011101010101110101000101100101101
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org
How to Query?
To estimate the number of 1s in the most
recent N bits:
1.
Sum the sizes of all buckets but the last
(note “size” means the number of 1s in the bucket)
2.
Add half the size of the last bucket
Remember: We do not know how many 1s
of the last bucket are still within the wanted
window
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org
Example: Bucketized Stream
At least 1 of
size 16. Partially
beyond window.
2 of
size 8
2 of
size 4
1 of
size 2
2 of
size 1
1001010110001011010101010101011010101010101110101010111010100010110010
N
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org
Part of this presentation is from:
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org