Download n+1 - UTK-EECS

Document related concepts
no text concepts found
Transcript
COSC 526 Class 4
Naïve Bayes with Large Feature sets:
(1) Stream and Sort
(2) Request and Answer Pattern
(3) Rocchio’s Algorithm
Arvind Ramanathan
Computational Science & Engineering Division
Oak Ridge National Laboratory, Oak Ridge
Ph: 865-576-7266
E-mail: [email protected]
Slides pilfered from…
• William Cohen’s class (10-605)
• Andrew Moore’s class (10-601)
• Shannon Quinn’s class (CSCI-6900)
• Jure Leskovec (mmds.org)
• More (attributed as they come)!
2
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Recalling Naïve Bayes
• You have a train dataset and a test dataset
• Initialize an “event counter” (hashtable) C
• For each example id, y, x1,….,xd in train:
– C(“Y=ANY”) ++; C(“Y=y”) ++
Assumptions
we made in last class:
– For j in 1..d:
• C(“Y=y ^ X=x ”) ++
• Hashtable
will fit in memory!
where:
For
each example (words
id, y, x1,….,x
q = 1/|V| is
d in test:
• Vocabulary
faced
while learning
q = 1/|dom(Y)|
– For each y’ in dom(Y):
limited!
m=1
• Compute log Pr(y’,x ,….,x ) =
j
•
j
y
1
d
C ( X  x j  Y  y' )  mqx 
C (Y  y' )  mqy

  log
   log

C
(
Y

y
'
)

m
C (Y  ANY )  m
j


– Return the best y’
3
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
• Assuming that
hashtable/ counters do
not fit into memory
• Motivation:
– Zipf’s Law: many words
that you see are not seen
that often
4
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
increasing costs…
Implementing Naïve Bayes with Big Data
Via Bruce Croft
5
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Implementing Naïve Bayes with constraints
• Assuming that
hashtable/ counters do
not fit into memory
• Motivation:
– Heaps’ law: if V is the
size of the vocabulary, n
is the length of the corpus
of words
V  Kn 
K  0.1 - 0.01
6
with constants K , 0    1
β  0.4 - 0.6
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
soure: Wikipedia (Heaps’ Law)
Storage and Compute Requirements
• For text classification for a corpus of O(n) words,
expect to use O(√n) storage for vocabulary
• What about other features?
– Hypertext
– Phrases
– Keywords…
7
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Possible Approaches…
• Using Relational Database
– or key value store
8
Accessing data
Times
L1 cache reference
0.5 ns
Branch mispredict
5 ns
L2 cache reference
7 ns
Mutex lock/unlock
100 ns
Main memory reference
100 ns
Compress 1 KB with Zippy
10,000 ns
Send 2KB over 1 GBPS network
20,000 ns
Read 1 MB from memory (sequentially)
250,000 ns
Transfer & copying data within datacenter
500,000 ns
Disk seek
10,000,000 ns
Read 1 MB sequentially from network
10,000,000 ns
Read 1 MB sequentially from disk
30,000,000 ns
Send packet from CANetherlandsCA
150,000,000 ns
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
~10x
~15x
Numbers that every CS
person should know
according to Jeff Dean…
Using a relational DB for Big ML
• Prefer random access of big data:
– e.g., checking if parameter weights are sensible
• Simplest approach:
– sort the data and use binary search
– Advantage: O(log2n) to access query row
– Let K = rows per Mb
– Scan the data once and record the seek position of Kth
row in an index file (or memory)
Cheap since index
is size n/1000
– To find a row r:
• Find r’ or the last item in the index smaller than r
• Seek to r’ and read the next Mb
9
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Cost is ~= 2 seeks
Using a relational DB for Big ML
• Access went from ~= 1 seek (best possible) to ~= 2 seeks +
finding r’ in the index.
– If index is O(1Mb) then finding r’ is also like 1 seek
– 3 seeks per random access in a Gb
• What if the index is still large?
– Build (the same sort of index) for the index!
– Now we’re paying 4 seeks for each random access into
a Tb
– Suppose we do this recursively…
• B-tree structure
• Good for access, but bad for inserting and deleting
records!
10
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Where does the latency come from?
• Best case scenario occurs
when the data is in the same
sector or block
• Data can be spread among
non-adjacent blocks/sectors
How to reduce the cost of access?
11
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Using relational DB for Big ML
• Counts are stored on disk and not memory:
– Access times O(n*scan)  O(n*scan*4*seek)
• DBs are good at caching frequently-used values
– Seeks may be infrequent
• However, with random access, the cost of access
and updating (deletion/insertion) can be
prohibitive!
12
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Counting
• example 1
• example 2
• example 3
• ….
“increment
C[x] by D”
Counting logic
Hash table, database, etc
• Hashtable issue: memory is too small
• Database issue: seeks are slow
13
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Distributed Counting
• example 1
• example 2
• example 3
• …. issues:
New
Hash table1
Machine 1
“increment
C[x] by D” cost
memory
Machine 2
• Machines and
$$! Hash table2
• Routing increment requests to right machine
• Sending
increment requests across the network
Counting logic
...
• Communication complexity
Hash table2
Machine K
Machine 0
Now we have enough memory….
14
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Using a memory-distributed database
• Distributed machines now hold the counts:
– Extra costs in communication O(n) [it’s okay]:
• O(n*scan)  O(n*scan + n*send/rcv)
– Complexity: routing needs to be done correctly
• Distributing data in memory across machines is not
as cheap as accessing memory locally because of
communication costs.
• The problem we’re dealing with is not size. It’s the
interaction between size and locality: we have a
large structure that’s being accessed in a non-local
way.
15
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
What other options do we have…
• Compress the counter hash-table
– use integers as keys instead of strings?
– use approximate counts?
– discard infrequent/unhelpful words?
• Trade off time or space:
– If the counter updates are better ordered, we can
avoid using the disk
16
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Large Vocabulary Naïve Bayes / Counting
• Assume we have K times as much memory as
we have
• How?
Construct a hash function h(event)
For i=0,…,K-1:
Scan thru the train dataset
Increment counters for event only if
h(event) mod K == i
Save this counter set to disk at the end of
the scan
After K scans you have a complete
counter set
• this works for any counting task, not just naïve Bayes
• What we’re really doing here is organizing our “messages” to
get more locality….
17
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
How can we do Naïve Bayes with a large
vocabulary?
• Sort the intermediate counts:
– Worst case scenario O(n log n)!
– For very large datasets, use merge-sort:
• sort k subsets that fit in memory
• Merge results in linear time
18
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Large Vocabulary Naïve Bayes
• Create a hashtable C
• For each example id, y, x1,….,xd in train:
– C(“Y=ANY”) ++; C(“Y=y”) ++
– Print “Y=ANY += 1”
– Print “Y=y += 1”
– For j in 1..d:
messages to another
component that
increments the counters
• C(“Y=y ^ X=xj”) ++
• Print “Y=y ^ X=xj += 1”
• Sort the event-counter update “messages”
• Scan the sorted messages and compute and output the
final counter values
19
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Distributed Counting  Stream and Sort Counting
Counting logic
Hash table1
Increment
C[x] by D
Message-routing logic
• example 1
• example 2
• example 3
• ….
Hash table2
Machine 2
...
Hash table2
Machine 0
20
Machine 1
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Machine K
Distributed Counting  Stream and Sort Counting
• C[x1] += D1
• C[x1] += D2
• ….
Increment
C[x] by D
Sort
• example 1
• example 2
• example 3
• ….
Counting logic
Machine A
Machine C
Machine B
21
Logic to
combine
counter
updates
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Large-vocabulary Naïve Bayes
• Create a hashtable C
• For each example id, y, x1,….,xd in train:
– C(“Y=ANY”) ++; C(“Y=y”) ++
– Print “Y=ANY += 1”
– Print “Y=y += 1”
– For j in 1..d:
• C(“Y=y ^ X=xj”) ++
Y=business
Y=business
…
Y=business ^ X =aaa
…
Y=business ^ X=zynga
Y=sports ^ X=hat +=
Y=sports ^ X=hockey
Y=sports ^ X=hockey
Y=sports ^ X=hockey
…
Y=sports ^ X=hoe +=
…
Y=sports
…
+=
+=
1
1
+=
1
+=
1
+=
+=
+=
1
1
+=
• Print “Y=y ^ X=xj += 1”
• Sort the event-counter update “messages”
– We’re collecting together messages about the same counter
• Scan and add the sorted messages and output the final counter
values
22
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
1
1
1
1
Large-vocabulary Naïve Bayes
Scan-and-add:
Y=business
Y=business
…
Y=business ^ X =aaa
…
Y=business ^ X=zynga
Y=sports ^ X=hat +=
Y=sports ^ X=hockey
Y=sports ^ X=hockey
Y=sports ^ X=hockey
…
Y=sports ^ X=hoe +=
…
Y=sports
…
+=
+=
1
1
+=
1
+=
1
+=
+=
+=
1
1
1
1
streaming
•previousKey = Null
• sumForPreviousKey = 0
• For each (event,delta) in input:
• If event==previousKey
• sumForPreviousKey += delta
• Else
• OutputPreviousKey()
• previousKey = event
• sumForPreviousKey = delta
• OutputPreviousKey()
1
+=
1
define OutputPreviousKey():
• If PreviousKey!=Null
• print PreviousKey,sumForPreviousKey
Accumulating the event counts requires constant storage … as long as the
input is sorted.
23
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Large-vocabulary Naïve Bayes
• For each example id, y, x1,….,xd in
train:
Complexity: O(n),
n=size of train
– Print Y=ANY += 1
– Print Y=y += 1
(Assuming a constant
number of labels apply
to each document)
– For j in 1..d:
• Print Y=y ^ X=xj += 1
• Sort the event-counter update
“messages”
• Scan and add the sorted messages
and output the final counter values
Complexity:
O(nlogn)
Complexity: O(n)
O(|V||dom(Y)|)
java MyTrainertrain | sort | java MyCountAdder > model
Model size: max O(n), O(|V||dom(Y)|)
24
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Distributed Counting  Stream and Sort Counting
Increment
C[x] by D
Sort
• example 1
• example 2
• example 3
• ….
Standardized
message routing logic
Counting logic
Machine A
Trivial to parallelize!
25
• C[x1] += D1
• C[x1] += D2
• ….
Logic to
combine
counter
updates
Machine C1..CK
Machine B
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Easy to parallelize!
Part II: Phrase Finding with Request and
Answer Pattern
26
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
27
ACL Workshop 2003
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Why phrase-finding?
• There are lots of phrases
• There’s not supervised data
• It’s hard to articulate
– What makes a phrase a phrase, vs just an n-gram?
• a phrase is independently meaningful (“test drive”, “red meat”)
or not (“are interesting”, “are lots”)
– What makes a phrase interesting?
28
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
29
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
In computer terms (aka learning), what is a
good phrase?
• Phraseness: degree to which a sequence of
words “qualify” as a phrase
– Statistics: How often do words co-occur vs. occur
separately?
• Informativeness: degree to which the phrase
captures or illustrates a key idea within a
document relative to a domain:
– Statistics: Two aspects of co-occurrence
– Background corpus (training)
– Foreground corpus (testing)
30
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
“Phraseness”1 – based on BLRT
• Binomial Log-Likelihood Ratio Test (BLRT):
– Draw samples:
• n1 draws, k1 successes
• n2 draws, k2 successes
• Are they from one binominal (i.e., k1/n1 and k2/n2 were different due to
chance) or from two distinct binomials?
– Define
• p1=k1 / n1, p2=k2 / n2, p=(k1+k2)/(n1+n2),
• L(p,k,n) = pk(1-p)n-k
L(p1, k1 , n1 )L(p2 , k2 , n2 )
BLRT(n1, k1, n2 , k2 ) =
L(p, k1 , n1 )L(p, k2 , n2 )
31
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
“Phraseness”1 – based on BLRT
• Binomial Log-Likelihood Ratio Test (BLRT):
– Draw samples:
• n1 draws, k1 successes
• n2 draws, k2 successes
• Are they from one binominal (i.e., k1/n1 and k2/n2 were different due to
chance) or from two distinct binomials?
– Define
• pi=ki/ni, p=(k1+k2)/(n1+n2),
• L(p,k,n) = pk(1-p)n-k
L(p1, k1 , n1 )L(p2 , k2 , n2 )
BLRT(n1, k1, n2 , k2 ) = 2 log
L(p, k1 , n1 )L(p, k2 , n2 )
32
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
“Phraseness”1 – based on BLRT
– Define
• pi=ki /ni, p=(k1+k2)/(n1+n2),
Phrase x y: W1=x ^ W2=y
• L(p,k,n) = pk(1-p)n-k
L(p1, k1 , n1 )L(p2 , k2 , n2 )
j p (n1, k1, n2 , k2 ) = 2 log
L(p, k1 , n1 )L(p, k2 , n2 )
comment
k1
C(W1=x ^ W2=y)
how often bigram x y occurs in corpus C
n1
C(W1=x)
how often word x occurs in corpus C
k2
C(W1≠x^W2=y)
how often y occurs in C after a non-x
n2
C(W1≠x)
how often a non-x occurs in C
Does y occur at the same frequency after x as in other positions?
33
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
“Informativeness”1 – based on BLRT
– Define
• pi=ki /ni, p=(k1+k2)/(n1+n2),
• L(p,k,n) = pk(1-p)n-k
Phrase x y: W1=x ^ W2=y and
two corpora, C and B
L(p1, k1 , n1 )L(p2, k2 , n2 )
j i (n1, k1, n2 , k2 ) = 2 log
L(p, k1 , n1 )L(p, k2 , n2 )
comment
k1
C(W1=x ^ W2=y)
how often bigram x y occurs in corpus C
n1
C(W1=* ^ W2=*)
how many bigrams in corpus C
k2
B(W1=x^W2=y)
how often x y occurs in background corpus
n2
B(W1=* ^ W2=*)
how many bigrams in background corpus
Does x y occur at the same frequency in both corpora?
34
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Phrase finding, with BLRT
35
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Intuition of phraseness and informativeness
• Goal: distinguish between distributions and
quantify how different they are
• Phraseness: estimate x y with bigram model or
unigram model
• Informativeness: estimate with foreground vs
background corpus
Use point-wise KL Divergence
36
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Phraseness: Defining our metric
• Using KL Divergence
• Difference between bigram and unigram
language model in foreground
– Bigram model:
P(x y)=P(x)P(y|x)
– Unigram model: P(x y)=P(x)P(y)
37
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Informativeness: Defining our metric
• Using KL Divergence
or
• Why not combine phraseness &
informativeness?
informativeness
phraseness
38
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Phrase finding with Pointwise KL
Divergence combined
39
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Advantages of using point KL Divergence
over BLRT
• BLRT scores frequency in foreground and
background symmetrically
– Point KL divergence does not!
• Phraseness and Informativeness are
comparable:
– Combination is straightforward;
– Don’t need a classifier
• Language modeling:
– Extensions to n-grams, smoothing methods
– can modularize the software
40
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Now, how do we implement it?
• Request and Answer Pattern
• Main data structure: tables of key-value pairs
– key: is a phrase x y
– value: mapping from attribute names
• Send messages to this “table” and get results back
41
Key
Value
old man
freq(B) = 10, freq(F)=13, I = 1.3, P = 740
bad service
freq(B) = 8, freq(F) = 25, I = 560, P = 254
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
From the foreground, generating and
scoring phrases
• Count events with “W1=x and W2=y”:
– stream and sort approach in naïve Bayes
• Stream through the output:
– convert to {phrase, attributes-of-phrase} records
– Only one attribute freq_in_F=n
• Stream through foreground and count events
“W1=x” in a (memory-based) hashtable
• Compute phrasiness*. Scan through the phrase table to add an extra
attribute to hold word frequencies
42
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
From the background, count events
• Count events with “W1=x and W2=y”:
– stream and sort approach in naïve Bayes
• Stream through the output:
– convert to {phrase, attributes-of-phrase} records
– Only one attribute freq_in_B=n
• Sort the two phrase tables: freq_in_B and
freq_in_F and run the output through another
reduce
– Append together attributes with the same key
43
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Putting things together!
• Scan the phrase table once more:
– Add informativeness scores together
Assuming word vocabulary nW is small:
• Scan foreground corpus C for phrases: O(nC) producing mC
phrase records – of course mC << nC
• Compute phrasiness: O(mC) Assumes word counts fit in memory
• Scan background corpus B for phrases: O(nB) producing mB
• Sort together and combine records: O(m log m), m=mB + mC
• Compute informativeness and combined quality: O(m)
44
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Can we do better? Keeping word counts out
of memory
• Assume we have word and phrase tables:
– how to incorporate word records with phrases?
• For each phrase x y request word counts:
– Print ‘x ~ request = freq-in-F, from=xy’
– Print y ~ request = freq-in-F, from=xy’
• Sort all the word requests in the word tables
• Scan through the result and generate the answers: for
each word w, a1=n1,a2=n2,….
– Print “xy ~request=freq-in-C,from=w”
• Sort and augment xy records appropriately
45
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Final Analysis
• Scan foreground corpus F for phrases, words: O(nF)
– producing mF phrase records, vF word records
• Scan phrase records producing word-freq requests: O(mF )
– producing 2mF requests
• Sort requests with word records: O((2mF + vF )log(2mF + vF))
•
= O(mFlog mF) since vF < mF
• Scan through and answer requests: O(mF)
• Sort answers with phrase records: O(mFlog mF)
• Repeat 1-5 for background corpus: O(nB + mBlogmB)
• Combine the two phrase tables: O(m log m), m = mB + mC
• Compute all the statistics: O(m)
46
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Part III: Rocchio’s Algorithm
47
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Naïve Bayes is unusual
• Only one pass through the data
• We don’t necessarily care about the order of the
data
• Rocchio’s algorithm:
– Instead of “binary” or “counts” for words, we will use a
vector space model
– Then classify!
Relevance Feedback in Information Retrieval, SMART Retrieval System
Experiments in Automatic Document Processing, 1971, Prentice Hall Inc.
48
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Vector Space Model for Documents
• Three aspects are needed:
• Word Weighting method: we will use term
frequency model
• Document length normalization: we will use
Euclidean vector length
• Similarity Measure: we will use Cosine
Similarity
49
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Rocchio’s algorithm
Many
variants of
these
formulae
DF(w) = # different docs w occurs in
TF(w, d) = # different times w occurs in doc d
| D | D is total no. of documents
IDF(w) =
DF(w)
u(w, d) = log(TF(w, d) +1)× log(IDF(w))
u(d) = u(w1, d),...., u(w|V|, d)
…as long as
u(w,d)=0 for
words not in d!
Store only non-zeros in
u(d), so size is O(|d| )
1
u(d)
1
u(d ')
u(y) = a
-b
å
å
| Cy | dÎCy || u(d) || 2
| D - Cy | d 'ÎD-Cy || u(d ') || 2
u(d)
u(y)
f (d) = arg max y
×
|| u(d) || 2 || u(y) || 2
50
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
But size of u(y) is O(|nV| )
Rocchio’s algorithm
Given a table
mapping w to
DF(w), we can
DF(w) =# different docs w occurs in
compute v(d) from
TF(w,d) =# different times w occurs in doc d the words in d…and
the rest of the
|D|
learning algorithm is
IDF(w) =
DF (w)
just adding…
u(w,d) = log(TF (w,d) +1)× log(IDF(w))
u(d)
u(d) = u(w1,d),...., u(w|V | ,d) , v(d) =
= v(w1,d),....
|| u(d) || 2
1
1
u(y)
u(y) = a
v(d) - b
v(d), v(y) =
å
å
| Cy | d ÎC y
| D - Cy | d 'ÎD -C y
|| u(y) || 2
f (d) = argmaxy v(d)× v(y)
51
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Rocchio v Bayes
Imagine a similar process but
for labeled documents…
Event counts
Train data
id1 y1 w1,1 w1,2 w1,3 …. w1,k1
id2 y2 w2,1 w2,2 w2,3 ….
id3 y3 w3,1 w3,2 ….
id4 y4 w4,1 w4,2 …
id5 y5 w5,1 w5,2 ….
..
X=w1^Y=sports
X=w1^Y=worldNews
X=..
X=w2^Y=…
X=…
…
5245
1054
2120
37
3
…
Recall Naïve Bayes test process?
id1 y1 w1,1 w1,2 w1,3 …. w1,k1
C[X=w1,1^Y=sports]=5245, C[X=w1,1^Y=..],C[X=w1,2^…]
id2 y2 w2,1 w2,2 w2,3 ….
C[X=w2,1^Y=….]=1054,…, C[X=w2,k2^…]
id3 y3 w3,1 w3,2 ….
C[X=w3,1^Y=….]=…
id4 y4 w4,1 w4,2 …
…
52
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Rocchio Summary
• Compute DF
– one scan thru docs
• Compute v(idi) for each
document
– output size O(n)
• Add up vectors to get v(y)
• time: O(n), n=corpus size
– like NB event-counts
• time: O(n)
– one scan, if DF fits in memory
– like first part of NB test procedure
otherwise
• time: O(n)
– one scan if v(y)’s fit in memory
– like NB training otherwise
• Classification ~= disk NB
53
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Summary
• Three approaches to work with Big “Streaming”
Data:
– Stream and Sort
– Request and Answer
– Rocchio’s Algorithm
• Implementing phrase finding:
– Correlations between words for identifying phrases
• Where can these be used?
– NLP Related tasks: Complex name-entity resolution
– Other counting applications!
54
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
More about Data Streams…
55
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Streaming Data
• Data streams are “infinite” and “continuous”
– Distribution can evolve over time
• Managing streaming data is important:
– Input is coming extremely fast!
– We don’t see all of the data
– (sometimes) input rates are limited (e.g., Google
queries, Twitter and Facebook status updates)
• Key question in streaming data:
– How do we make critical calculations about the stream
using limited (secondary) memory?
56
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Is Naïve Bayes a Streaming Algorithm?
• Yes!
• In Machine Learning, we call this Online Learning
– Allows to model as the data becomes available
– Adapts slowly as the data streams in
• How are we doing this?
– NB: slow updates to the model
– Learn from the training data
– For every example in th stream, we update it (learning
rate)
57
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
General Stream Processing Model
Ad-Hoc
Queries
Standing
Queries
. . . 1, 5, 2, 7, 0, 9, 3
Output
. . . a, r, v, t, y, h, b
. . . 0, 0, 1, 0, 1, 1, 0
time
Streams Entering.
Each is stream is
composed of
elements/tuples
Processor
Limited
Working
Storage
Archival
Storage
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive
Datasets, http://www.mmds.org
58
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Types of queries on Data Streams
• Sampling from a data stream:
– construct a random sample
• Queries over a sliding window:
– Number of items of type x in the last k elements of the stream
• Filtering a data stream:
– Select elements with property x from the stream
• Counting distinct element:
– Number of distinct elements in the last k elements of the stream
• Estimating Moments
• Finding frequent items
59
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Applications
• Finding relevant products using
clickstreams
– Amazon, Google, all of them
want to find relevant
products to shop based on
user interactions
• Trending topics on Social media
• Health monitoring using sensor
networks
• Monitoring molecular dynamics
simulations
A. Ramanathan, J.-O. Yoo and C.J. Langmead, On-the-fly
identification of conformational sub-states from molecular
dynamics simulations. J. Chem. Theory Comput. (2011). Vol. 7 (3): 778789.
A. Ramanathan, P. K. Agarwal, M.G. Kurnikova, and C. J. Langmead,
An online approach for mining collective behaviors from molecular
dynamics simulations, J. Comp. Biol. (2010). Vol. 17 (3): 309-324.
60
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Sampling from a Data stream
• Practical consideration: we cannot store all of the
data, but we can store a sample!
• Two problems:
1. Sample a fixed proportion of elements in the
stream (say 1 in 10)
2. Maintain a random sample of fixed size over a
potentially infinite stream
• Desirable Property: For all time steps k, each of the
– At any time k, we would like a random sample of s
k elements seen so far have equal probability of
elements
being sampled…
61
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Sampling a fixed proportion
• Scenario: Let’s look at a search engine query stream
– Stream: <user, query, time>
– Why is this interesting?
– Answer questions like “how often did a user have to run the same
query in a single day?”
– We have only storage = 1/10th of the query stream
• Simple solution:
– Generate a random integer in [0..9] for each query
– Store the query if the integer is 0, otherwise discard
62
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
What is wrong with our simple approach?
• Suppose each user issues x queries and d queries twice
= (x + 2d) total queries
• What fraction of queries by an average search engine
user are duplicates?
d/(x+d)
• if we keep only 10% of the queries:
–
Sample will contain x/10 of the singleton queries and 2d/10 of the duplicate queries at least
once
–
But only d/100 pairs of duplicates
• d/100 = 1/10 ∙ 1/10 ∙ d
–
Of d “duplicates” 18d/100 appear exactly once
• 18d/100 = ((1/10 ∙ 9/10)+(9/10 ∙ 1/10)) ∙ d
–
Fraction appearing twice is (d/100) divided by (d/100 + s/10 + 18d/100):
• d/(10s + 19d)
63
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Instead of queries, let’s sample users…
• Pick 1/10th of users and take all their searches in
the sample
• User a hash function that hashes the
user name or user id uniformly into 10 buckets
• Why is this a better solution?
64
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Generalized Solution
• Stream of tuples with keys:
– Key is some subset of each tuple’s components
• e.g., tuple is (user, search, time); key is user
– Choice of key depends on application
• To get a sample of a/b fraction of the stream:
– Hash each tuple’s key uniformly into b buckets
– Pick the tuple if its hash value is at most a
Hash table with b buckets, pick the tuple if its hash value is at most a.
How to generate a 30% sample?
Hash into b=10 buckets, take the tuple if it hashes to one of the first 3 buckets
65
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
65
Maintaining a fixed size sample
• Suppose we need to maintain a random
sample S of size exactly s tuples
– E.g., main memory size constraint
• Why? Don’t know length of stream in advance
• Suppose at time n we have seen n items
– Each item is in the sample S with equal prob. s/n
How to think about the problem: say s = 2
Stream: a x c y z k c d e g…
At n= 5, each of the first 5 tuples is included in the sample S with equal prob.
At n= 7, each of the first 7 tuples is included in the sample S with equal prob.
66
Impractical solution would be to store all the n tuples seen
so far and out of them pick s at random
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Fixed Size Sample
• Algorithm (a.k.a. Reservoir Sampling)
– Store all the first s elements of the stream to S
– Suppose we have seen n-1 elements, and now
the nth element arrives (n > s)
• With probability s/n, keep the nth element, else discard it
• If we picked the nth element, then it replaces one of the
s elements in the sample S, picked uniformly at random
• Claim: This algorithm maintains a sample S
with the desired property:
– After n elements, the sample contains each element
seen so far with probability s/n
67
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Proof: By Induction
• We prove this by induction:
– Assume that after n elements, the sample contains each element
seen so far with probability s/n
– We need to show that after seeing element n+1 the sample
maintains the property
• Sample contains each element seen so far with probability
s/(n+1)
• Base case:
– After we see n=s elements the sample S has the desired property
• Each out of n=s elements is in the sample with probability s/s =
1
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
68
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Proof: By Induction
• Inductive hypothesis: After n elements, the sample S contains
each element seen so far with prob. s/n
• Now element n+1 arrives
• Inductive step: For elements already in S, probability that the
algorithm keeps it in S is:
s   s  s  1 
n

1 



n  1  Element
s inthe n  1
 n  1  Element
n+1
Element n+1 discarded
not discarded
sample not picked
• So, at time n, tuples in S were there with prob. s/n
• Time nn+1, tuple stayed in S with prob. n/(n+1)
• So prob. tuple is in S at time n+1 = (n/n+1) (s/n) = s/n+1
69
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Take home messages …
• Part of ML is good grasp of theory
• Part of ML is a good grasp of what hacks tend to work
• These are not always the same
– Especially in big-data situations
• Techniques covered so far:
– Brute-force estimation of a joint distribution
– Naive Bayes
– Stream-and-sort, request-and-answer patterns
– BLRT and KL-divergence (and when to use them)
– TF-IDF weighting – especially IDF
• it’s often useful even when we don’t understand why
70
Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration