Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
COSC 526 Class 4 Naïve Bayes with Large Feature sets: (1) Stream and Sort (2) Request and Answer Pattern (3) Rocchio’s Algorithm Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail: [email protected] Slides pilfered from… • William Cohen’s class (10-605) • Andrew Moore’s class (10-601) • Shannon Quinn’s class (CSCI-6900) • Jure Leskovec (mmds.org) • More (attributed as they come)! 2 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Recalling Naïve Bayes • You have a train dataset and a test dataset • Initialize an “event counter” (hashtable) C • For each example id, y, x1,….,xd in train: – C(“Y=ANY”) ++; C(“Y=y”) ++ Assumptions we made in last class: – For j in 1..d: • C(“Y=y ^ X=x ”) ++ • Hashtable will fit in memory! where: For each example (words id, y, x1,….,x q = 1/|V| is d in test: • Vocabulary faced while learning q = 1/|dom(Y)| – For each y’ in dom(Y): limited! m=1 • Compute log Pr(y’,x ,….,x ) = j • j y 1 d C ( X x j Y y' ) mqx C (Y y' ) mqy log log C ( Y y ' ) m C (Y ANY ) m j – Return the best y’ 3 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration • Assuming that hashtable/ counters do not fit into memory • Motivation: – Zipf’s Law: many words that you see are not seen that often 4 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration increasing costs… Implementing Naïve Bayes with Big Data Via Bruce Croft 5 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Implementing Naïve Bayes with constraints • Assuming that hashtable/ counters do not fit into memory • Motivation: – Heaps’ law: if V is the size of the vocabulary, n is the length of the corpus of words V Kn K 0.1 - 0.01 6 with constants K , 0 1 β 0.4 - 0.6 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration soure: Wikipedia (Heaps’ Law) Storage and Compute Requirements • For text classification for a corpus of O(n) words, expect to use O(√n) storage for vocabulary • What about other features? – Hypertext – Phrases – Keywords… 7 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Possible Approaches… • Using Relational Database – or key value store 8 Accessing data Times L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns Mutex lock/unlock 100 ns Main memory reference 100 ns Compress 1 KB with Zippy 10,000 ns Send 2KB over 1 GBPS network 20,000 ns Read 1 MB from memory (sequentially) 250,000 ns Transfer & copying data within datacenter 500,000 ns Disk seek 10,000,000 ns Read 1 MB sequentially from network 10,000,000 ns Read 1 MB sequentially from disk 30,000,000 ns Send packet from CANetherlandsCA 150,000,000 ns Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration ~10x ~15x Numbers that every CS person should know according to Jeff Dean… Using a relational DB for Big ML • Prefer random access of big data: – e.g., checking if parameter weights are sensible • Simplest approach: – sort the data and use binary search – Advantage: O(log2n) to access query row – Let K = rows per Mb – Scan the data once and record the seek position of Kth row in an index file (or memory) Cheap since index is size n/1000 – To find a row r: • Find r’ or the last item in the index smaller than r • Seek to r’ and read the next Mb 9 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Cost is ~= 2 seeks Using a relational DB for Big ML • Access went from ~= 1 seek (best possible) to ~= 2 seeks + finding r’ in the index. – If index is O(1Mb) then finding r’ is also like 1 seek – 3 seeks per random access in a Gb • What if the index is still large? – Build (the same sort of index) for the index! – Now we’re paying 4 seeks for each random access into a Tb – Suppose we do this recursively… • B-tree structure • Good for access, but bad for inserting and deleting records! 10 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Where does the latency come from? • Best case scenario occurs when the data is in the same sector or block • Data can be spread among non-adjacent blocks/sectors How to reduce the cost of access? 11 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Using relational DB for Big ML • Counts are stored on disk and not memory: – Access times O(n*scan) O(n*scan*4*seek) • DBs are good at caching frequently-used values – Seeks may be infrequent • However, with random access, the cost of access and updating (deletion/insertion) can be prohibitive! 12 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Counting • example 1 • example 2 • example 3 • …. “increment C[x] by D” Counting logic Hash table, database, etc • Hashtable issue: memory is too small • Database issue: seeks are slow 13 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Distributed Counting • example 1 • example 2 • example 3 • …. issues: New Hash table1 Machine 1 “increment C[x] by D” cost memory Machine 2 • Machines and $$! Hash table2 • Routing increment requests to right machine • Sending increment requests across the network Counting logic ... • Communication complexity Hash table2 Machine K Machine 0 Now we have enough memory…. 14 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Using a memory-distributed database • Distributed machines now hold the counts: – Extra costs in communication O(n) [it’s okay]: • O(n*scan) O(n*scan + n*send/rcv) – Complexity: routing needs to be done correctly • Distributing data in memory across machines is not as cheap as accessing memory locally because of communication costs. • The problem we’re dealing with is not size. It’s the interaction between size and locality: we have a large structure that’s being accessed in a non-local way. 15 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration What other options do we have… • Compress the counter hash-table – use integers as keys instead of strings? – use approximate counts? – discard infrequent/unhelpful words? • Trade off time or space: – If the counter updates are better ordered, we can avoid using the disk 16 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Large Vocabulary Naïve Bayes / Counting • Assume we have K times as much memory as we have • How? Construct a hash function h(event) For i=0,…,K-1: Scan thru the train dataset Increment counters for event only if h(event) mod K == i Save this counter set to disk at the end of the scan After K scans you have a complete counter set • this works for any counting task, not just naïve Bayes • What we’re really doing here is organizing our “messages” to get more locality…. 17 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration How can we do Naïve Bayes with a large vocabulary? • Sort the intermediate counts: – Worst case scenario O(n log n)! – For very large datasets, use merge-sort: • sort k subsets that fit in memory • Merge results in linear time 18 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Large Vocabulary Naïve Bayes • Create a hashtable C • For each example id, y, x1,….,xd in train: – C(“Y=ANY”) ++; C(“Y=y”) ++ – Print “Y=ANY += 1” – Print “Y=y += 1” – For j in 1..d: messages to another component that increments the counters • C(“Y=y ^ X=xj”) ++ • Print “Y=y ^ X=xj += 1” • Sort the event-counter update “messages” • Scan the sorted messages and compute and output the final counter values 19 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Distributed Counting Stream and Sort Counting Counting logic Hash table1 Increment C[x] by D Message-routing logic • example 1 • example 2 • example 3 • …. Hash table2 Machine 2 ... Hash table2 Machine 0 20 Machine 1 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Machine K Distributed Counting Stream and Sort Counting • C[x1] += D1 • C[x1] += D2 • …. Increment C[x] by D Sort • example 1 • example 2 • example 3 • …. Counting logic Machine A Machine C Machine B 21 Logic to combine counter updates Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Large-vocabulary Naïve Bayes • Create a hashtable C • For each example id, y, x1,….,xd in train: – C(“Y=ANY”) ++; C(“Y=y”) ++ – Print “Y=ANY += 1” – Print “Y=y += 1” – For j in 1..d: • C(“Y=y ^ X=xj”) ++ Y=business Y=business … Y=business ^ X =aaa … Y=business ^ X=zynga Y=sports ^ X=hat += Y=sports ^ X=hockey Y=sports ^ X=hockey Y=sports ^ X=hockey … Y=sports ^ X=hoe += … Y=sports … += += 1 1 += 1 += 1 += += += 1 1 += • Print “Y=y ^ X=xj += 1” • Sort the event-counter update “messages” – We’re collecting together messages about the same counter • Scan and add the sorted messages and output the final counter values 22 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration 1 1 1 1 Large-vocabulary Naïve Bayes Scan-and-add: Y=business Y=business … Y=business ^ X =aaa … Y=business ^ X=zynga Y=sports ^ X=hat += Y=sports ^ X=hockey Y=sports ^ X=hockey Y=sports ^ X=hockey … Y=sports ^ X=hoe += … Y=sports … += += 1 1 += 1 += 1 += += += 1 1 1 1 streaming •previousKey = Null • sumForPreviousKey = 0 • For each (event,delta) in input: • If event==previousKey • sumForPreviousKey += delta • Else • OutputPreviousKey() • previousKey = event • sumForPreviousKey = delta • OutputPreviousKey() 1 += 1 define OutputPreviousKey(): • If PreviousKey!=Null • print PreviousKey,sumForPreviousKey Accumulating the event counts requires constant storage … as long as the input is sorted. 23 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Large-vocabulary Naïve Bayes • For each example id, y, x1,….,xd in train: Complexity: O(n), n=size of train – Print Y=ANY += 1 – Print Y=y += 1 (Assuming a constant number of labels apply to each document) – For j in 1..d: • Print Y=y ^ X=xj += 1 • Sort the event-counter update “messages” • Scan and add the sorted messages and output the final counter values Complexity: O(nlogn) Complexity: O(n) O(|V||dom(Y)|) java MyTrainertrain | sort | java MyCountAdder > model Model size: max O(n), O(|V||dom(Y)|) 24 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Distributed Counting Stream and Sort Counting Increment C[x] by D Sort • example 1 • example 2 • example 3 • …. Standardized message routing logic Counting logic Machine A Trivial to parallelize! 25 • C[x1] += D1 • C[x1] += D2 • …. Logic to combine counter updates Machine C1..CK Machine B Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Easy to parallelize! Part II: Phrase Finding with Request and Answer Pattern 26 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration 27 ACL Workshop 2003 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Why phrase-finding? • There are lots of phrases • There’s not supervised data • It’s hard to articulate – What makes a phrase a phrase, vs just an n-gram? • a phrase is independently meaningful (“test drive”, “red meat”) or not (“are interesting”, “are lots”) – What makes a phrase interesting? 28 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration 29 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration In computer terms (aka learning), what is a good phrase? • Phraseness: degree to which a sequence of words “qualify” as a phrase – Statistics: How often do words co-occur vs. occur separately? • Informativeness: degree to which the phrase captures or illustrates a key idea within a document relative to a domain: – Statistics: Two aspects of co-occurrence – Background corpus (training) – Foreground corpus (testing) 30 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration “Phraseness”1 – based on BLRT • Binomial Log-Likelihood Ratio Test (BLRT): – Draw samples: • n1 draws, k1 successes • n2 draws, k2 successes • Are they from one binominal (i.e., k1/n1 and k2/n2 were different due to chance) or from two distinct binomials? – Define • p1=k1 / n1, p2=k2 / n2, p=(k1+k2)/(n1+n2), • L(p,k,n) = pk(1-p)n-k L(p1, k1 , n1 )L(p2 , k2 , n2 ) BLRT(n1, k1, n2 , k2 ) = L(p, k1 , n1 )L(p, k2 , n2 ) 31 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration “Phraseness”1 – based on BLRT • Binomial Log-Likelihood Ratio Test (BLRT): – Draw samples: • n1 draws, k1 successes • n2 draws, k2 successes • Are they from one binominal (i.e., k1/n1 and k2/n2 were different due to chance) or from two distinct binomials? – Define • pi=ki/ni, p=(k1+k2)/(n1+n2), • L(p,k,n) = pk(1-p)n-k L(p1, k1 , n1 )L(p2 , k2 , n2 ) BLRT(n1, k1, n2 , k2 ) = 2 log L(p, k1 , n1 )L(p, k2 , n2 ) 32 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration “Phraseness”1 – based on BLRT – Define • pi=ki /ni, p=(k1+k2)/(n1+n2), Phrase x y: W1=x ^ W2=y • L(p,k,n) = pk(1-p)n-k L(p1, k1 , n1 )L(p2 , k2 , n2 ) j p (n1, k1, n2 , k2 ) = 2 log L(p, k1 , n1 )L(p, k2 , n2 ) comment k1 C(W1=x ^ W2=y) how often bigram x y occurs in corpus C n1 C(W1=x) how often word x occurs in corpus C k2 C(W1≠x^W2=y) how often y occurs in C after a non-x n2 C(W1≠x) how often a non-x occurs in C Does y occur at the same frequency after x as in other positions? 33 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration “Informativeness”1 – based on BLRT – Define • pi=ki /ni, p=(k1+k2)/(n1+n2), • L(p,k,n) = pk(1-p)n-k Phrase x y: W1=x ^ W2=y and two corpora, C and B L(p1, k1 , n1 )L(p2, k2 , n2 ) j i (n1, k1, n2 , k2 ) = 2 log L(p, k1 , n1 )L(p, k2 , n2 ) comment k1 C(W1=x ^ W2=y) how often bigram x y occurs in corpus C n1 C(W1=* ^ W2=*) how many bigrams in corpus C k2 B(W1=x^W2=y) how often x y occurs in background corpus n2 B(W1=* ^ W2=*) how many bigrams in background corpus Does x y occur at the same frequency in both corpora? 34 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Phrase finding, with BLRT 35 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Intuition of phraseness and informativeness • Goal: distinguish between distributions and quantify how different they are • Phraseness: estimate x y with bigram model or unigram model • Informativeness: estimate with foreground vs background corpus Use point-wise KL Divergence 36 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Phraseness: Defining our metric • Using KL Divergence • Difference between bigram and unigram language model in foreground – Bigram model: P(x y)=P(x)P(y|x) – Unigram model: P(x y)=P(x)P(y) 37 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Informativeness: Defining our metric • Using KL Divergence or • Why not combine phraseness & informativeness? informativeness phraseness 38 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Phrase finding with Pointwise KL Divergence combined 39 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Advantages of using point KL Divergence over BLRT • BLRT scores frequency in foreground and background symmetrically – Point KL divergence does not! • Phraseness and Informativeness are comparable: – Combination is straightforward; – Don’t need a classifier • Language modeling: – Extensions to n-grams, smoothing methods – can modularize the software 40 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Now, how do we implement it? • Request and Answer Pattern • Main data structure: tables of key-value pairs – key: is a phrase x y – value: mapping from attribute names • Send messages to this “table” and get results back 41 Key Value old man freq(B) = 10, freq(F)=13, I = 1.3, P = 740 bad service freq(B) = 8, freq(F) = 25, I = 560, P = 254 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration From the foreground, generating and scoring phrases • Count events with “W1=x and W2=y”: – stream and sort approach in naïve Bayes • Stream through the output: – convert to {phrase, attributes-of-phrase} records – Only one attribute freq_in_F=n • Stream through foreground and count events “W1=x” in a (memory-based) hashtable • Compute phrasiness*. Scan through the phrase table to add an extra attribute to hold word frequencies 42 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration From the background, count events • Count events with “W1=x and W2=y”: – stream and sort approach in naïve Bayes • Stream through the output: – convert to {phrase, attributes-of-phrase} records – Only one attribute freq_in_B=n • Sort the two phrase tables: freq_in_B and freq_in_F and run the output through another reduce – Append together attributes with the same key 43 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Putting things together! • Scan the phrase table once more: – Add informativeness scores together Assuming word vocabulary nW is small: • Scan foreground corpus C for phrases: O(nC) producing mC phrase records – of course mC << nC • Compute phrasiness: O(mC) Assumes word counts fit in memory • Scan background corpus B for phrases: O(nB) producing mB • Sort together and combine records: O(m log m), m=mB + mC • Compute informativeness and combined quality: O(m) 44 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Can we do better? Keeping word counts out of memory • Assume we have word and phrase tables: – how to incorporate word records with phrases? • For each phrase x y request word counts: – Print ‘x ~ request = freq-in-F, from=xy’ – Print y ~ request = freq-in-F, from=xy’ • Sort all the word requests in the word tables • Scan through the result and generate the answers: for each word w, a1=n1,a2=n2,…. – Print “xy ~request=freq-in-C,from=w” • Sort and augment xy records appropriately 45 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Final Analysis • Scan foreground corpus F for phrases, words: O(nF) – producing mF phrase records, vF word records • Scan phrase records producing word-freq requests: O(mF ) – producing 2mF requests • Sort requests with word records: O((2mF + vF )log(2mF + vF)) • = O(mFlog mF) since vF < mF • Scan through and answer requests: O(mF) • Sort answers with phrase records: O(mFlog mF) • Repeat 1-5 for background corpus: O(nB + mBlogmB) • Combine the two phrase tables: O(m log m), m = mB + mC • Compute all the statistics: O(m) 46 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Part III: Rocchio’s Algorithm 47 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Naïve Bayes is unusual • Only one pass through the data • We don’t necessarily care about the order of the data • Rocchio’s algorithm: – Instead of “binary” or “counts” for words, we will use a vector space model – Then classify! Relevance Feedback in Information Retrieval, SMART Retrieval System Experiments in Automatic Document Processing, 1971, Prentice Hall Inc. 48 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Vector Space Model for Documents • Three aspects are needed: • Word Weighting method: we will use term frequency model • Document length normalization: we will use Euclidean vector length • Similarity Measure: we will use Cosine Similarity 49 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Rocchio’s algorithm Many variants of these formulae DF(w) = # different docs w occurs in TF(w, d) = # different times w occurs in doc d | D | D is total no. of documents IDF(w) = DF(w) u(w, d) = log(TF(w, d) +1)× log(IDF(w)) u(d) = u(w1, d),...., u(w|V|, d) …as long as u(w,d)=0 for words not in d! Store only non-zeros in u(d), so size is O(|d| ) 1 u(d) 1 u(d ') u(y) = a -b å å | Cy | dÎCy || u(d) || 2 | D - Cy | d 'ÎD-Cy || u(d ') || 2 u(d) u(y) f (d) = arg max y × || u(d) || 2 || u(y) || 2 50 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration But size of u(y) is O(|nV| ) Rocchio’s algorithm Given a table mapping w to DF(w), we can DF(w) =# different docs w occurs in compute v(d) from TF(w,d) =# different times w occurs in doc d the words in d…and the rest of the |D| learning algorithm is IDF(w) = DF (w) just adding… u(w,d) = log(TF (w,d) +1)× log(IDF(w)) u(d) u(d) = u(w1,d),...., u(w|V | ,d) , v(d) = = v(w1,d),.... || u(d) || 2 1 1 u(y) u(y) = a v(d) - b v(d), v(y) = å å | Cy | d ÎC y | D - Cy | d 'ÎD -C y || u(y) || 2 f (d) = argmaxy v(d)× v(y) 51 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Rocchio v Bayes Imagine a similar process but for labeled documents… Event counts Train data id1 y1 w1,1 w1,2 w1,3 …. w1,k1 id2 y2 w2,1 w2,2 w2,3 …. id3 y3 w3,1 w3,2 …. id4 y4 w4,1 w4,2 … id5 y5 w5,1 w5,2 …. .. X=w1^Y=sports X=w1^Y=worldNews X=.. X=w2^Y=… X=… … 5245 1054 2120 37 3 … Recall Naïve Bayes test process? id1 y1 w1,1 w1,2 w1,3 …. w1,k1 C[X=w1,1^Y=sports]=5245, C[X=w1,1^Y=..],C[X=w1,2^…] id2 y2 w2,1 w2,2 w2,3 …. C[X=w2,1^Y=….]=1054,…, C[X=w2,k2^…] id3 y3 w3,1 w3,2 …. C[X=w3,1^Y=….]=… id4 y4 w4,1 w4,2 … … 52 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Rocchio Summary • Compute DF – one scan thru docs • Compute v(idi) for each document – output size O(n) • Add up vectors to get v(y) • time: O(n), n=corpus size – like NB event-counts • time: O(n) – one scan, if DF fits in memory – like first part of NB test procedure otherwise • time: O(n) – one scan if v(y)’s fit in memory – like NB training otherwise • Classification ~= disk NB 53 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Summary • Three approaches to work with Big “Streaming” Data: – Stream and Sort – Request and Answer – Rocchio’s Algorithm • Implementing phrase finding: – Correlations between words for identifying phrases • Where can these be used? – NLP Related tasks: Complex name-entity resolution – Other counting applications! 54 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration More about Data Streams… 55 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Streaming Data • Data streams are “infinite” and “continuous” – Distribution can evolve over time • Managing streaming data is important: – Input is coming extremely fast! – We don’t see all of the data – (sometimes) input rates are limited (e.g., Google queries, Twitter and Facebook status updates) • Key question in streaming data: – How do we make critical calculations about the stream using limited (secondary) memory? 56 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Is Naïve Bayes a Streaming Algorithm? • Yes! • In Machine Learning, we call this Online Learning – Allows to model as the data becomes available – Adapts slowly as the data streams in • How are we doing this? – NB: slow updates to the model – Learn from the training data – For every example in th stream, we update it (learning rate) 57 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration General Stream Processing Model Ad-Hoc Queries Standing Queries . . . 1, 5, 2, 7, 0, 9, 3 Output . . . a, r, v, t, y, h, b . . . 0, 0, 1, 0, 1, 1, 0 time Streams Entering. Each is stream is composed of elements/tuples Processor Limited Working Storage Archival Storage J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 58 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Types of queries on Data Streams • Sampling from a data stream: – construct a random sample • Queries over a sliding window: – Number of items of type x in the last k elements of the stream • Filtering a data stream: – Select elements with property x from the stream • Counting distinct element: – Number of distinct elements in the last k elements of the stream • Estimating Moments • Finding frequent items 59 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Applications • Finding relevant products using clickstreams – Amazon, Google, all of them want to find relevant products to shop based on user interactions • Trending topics on Social media • Health monitoring using sensor networks • Monitoring molecular dynamics simulations A. Ramanathan, J.-O. Yoo and C.J. Langmead, On-the-fly identification of conformational sub-states from molecular dynamics simulations. J. Chem. Theory Comput. (2011). Vol. 7 (3): 778789. A. Ramanathan, P. K. Agarwal, M.G. Kurnikova, and C. J. Langmead, An online approach for mining collective behaviors from molecular dynamics simulations, J. Comp. Biol. (2010). Vol. 17 (3): 309-324. 60 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Sampling from a Data stream • Practical consideration: we cannot store all of the data, but we can store a sample! • Two problems: 1. Sample a fixed proportion of elements in the stream (say 1 in 10) 2. Maintain a random sample of fixed size over a potentially infinite stream • Desirable Property: For all time steps k, each of the – At any time k, we would like a random sample of s k elements seen so far have equal probability of elements being sampled… 61 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Sampling a fixed proportion • Scenario: Let’s look at a search engine query stream – Stream: <user, query, time> – Why is this interesting? – Answer questions like “how often did a user have to run the same query in a single day?” – We have only storage = 1/10th of the query stream • Simple solution: – Generate a random integer in [0..9] for each query – Store the query if the integer is 0, otherwise discard 62 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration What is wrong with our simple approach? • Suppose each user issues x queries and d queries twice = (x + 2d) total queries • What fraction of queries by an average search engine user are duplicates? d/(x+d) • if we keep only 10% of the queries: – Sample will contain x/10 of the singleton queries and 2d/10 of the duplicate queries at least once – But only d/100 pairs of duplicates • d/100 = 1/10 ∙ 1/10 ∙ d – Of d “duplicates” 18d/100 appear exactly once • 18d/100 = ((1/10 ∙ 9/10)+(9/10 ∙ 1/10)) ∙ d – Fraction appearing twice is (d/100) divided by (d/100 + s/10 + 18d/100): • d/(10s + 19d) 63 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Instead of queries, let’s sample users… • Pick 1/10th of users and take all their searches in the sample • User a hash function that hashes the user name or user id uniformly into 10 buckets • Why is this a better solution? 64 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Generalized Solution • Stream of tuples with keys: – Key is some subset of each tuple’s components • e.g., tuple is (user, search, time); key is user – Choice of key depends on application • To get a sample of a/b fraction of the stream: – Hash each tuple’s key uniformly into b buckets – Pick the tuple if its hash value is at most a Hash table with b buckets, pick the tuple if its hash value is at most a. How to generate a 30% sample? Hash into b=10 buckets, take the tuple if it hashes to one of the first 3 buckets 65 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration 65 Maintaining a fixed size sample • Suppose we need to maintain a random sample S of size exactly s tuples – E.g., main memory size constraint • Why? Don’t know length of stream in advance • Suppose at time n we have seen n items – Each item is in the sample S with equal prob. s/n How to think about the problem: say s = 2 Stream: a x c y z k c d e g… At n= 5, each of the first 5 tuples is included in the sample S with equal prob. At n= 7, each of the first 7 tuples is included in the sample S with equal prob. 66 Impractical solution would be to store all the n tuples seen so far and out of them pick s at random Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Fixed Size Sample • Algorithm (a.k.a. Reservoir Sampling) – Store all the first s elements of the stream to S – Suppose we have seen n-1 elements, and now the nth element arrives (n > s) • With probability s/n, keep the nth element, else discard it • If we picked the nth element, then it replaces one of the s elements in the sample S, picked uniformly at random • Claim: This algorithm maintains a sample S with the desired property: – After n elements, the sample contains each element seen so far with probability s/n 67 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Proof: By Induction • We prove this by induction: – Assume that after n elements, the sample contains each element seen so far with probability s/n – We need to show that after seeing element n+1 the sample maintains the property • Sample contains each element seen so far with probability s/(n+1) • Base case: – After we see n=s elements the sample S has the desired property • Each out of n=s elements is in the sample with probability s/s = 1 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 68 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Proof: By Induction • Inductive hypothesis: After n elements, the sample S contains each element seen so far with prob. s/n • Now element n+1 arrives • Inductive step: For elements already in S, probability that the algorithm keeps it in S is: s s s 1 n 1 n 1 Element s inthe n 1 n 1 Element n+1 Element n+1 discarded not discarded sample not picked • So, at time n, tuples in S were there with prob. s/n • Time nn+1, tuple stayed in S with prob. n/(n+1) • So prob. tuple is in S at time n+1 = (n/n+1) (s/n) = s/n+1 69 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Take home messages … • Part of ML is good grasp of theory • Part of ML is a good grasp of what hacks tend to work • These are not always the same – Especially in big-data situations • Techniques covered so far: – Brute-force estimation of a joint distribution – Naive Bayes – Stream-and-sort, request-and-answer patterns – BLRT and KL-divergence (and when to use them) – TF-IDF weighting – especially IDF • it’s often useful even when we don’t understand why 70 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration