Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CSC 556– DBMS II, Spring 2013, Week 7 Bayesian Inference Paul Graham’s Plan for Spam, + A Patent Application for Learning Mobile Preferences, + some text related matters Bayes and Incremental (Lazy) Learning • Bayes’ formula • P(selected | K) = (P(K | selected) x P(selected)) / P(K) • My collection of terms on the right hand side. – A priori probability of events • P(selected) = size of SelectedSet / size of InspectedSet • P(K) = count of K in InspectedSet / size of InspectedSet – A posteriori probability (conditional probability) • P(K|selected) = count of K in SelectedSet / size of SelectedSet • InspectedSet are available alternatives that the mobile user inspected. They correspond to all emails see in Gragam. • SelectedSet are available alternative contacted by mobile user. They correspond to spam vs. non-spam choices in Graham. Graham’s Formula (Hackers and Painters) • rg = min(1, 2 * (good(w) / G)) • rb = min(1, bad(w) / B) • Pspam|w = max(.01, min(.99, rb / (rg + rb))) • w is token from email, good(w) and bad(w) are occurrence counts of w in good & bad emails, G and B are good & bad emails. • 2 * biases against false positives (don’t’ over-discard) • False positives are non-linearly undesirable. • min & max related to Laplace Estimator (textbook) Graham continued • Graham uses 15 most interesting tokens. • They occur > 5 times with max distance from .5. • Pspam = (∏i=1..15 Pspam|wi) / (∏i=1..15 Pspam|wi + ∏i=1..15 1 - Pspam|wi) • Assign .4 spam probability to unknown token. • Treat email as spam if Pspam > .9 Spam of the Future (is now) “Assuming they could solve the problem of the headers, the spam of the future will probably look something like this: Hey there. Thought you should check out the following: http://www.27meg.com/foo because that is about as much sales pitch as content-based filtering will leave the spammer room to make. (Indeed, it will be hard even to get this past filters, because if everything else in the email is neutral, the spam probability will hinge on the url, and it will take some effort to make that look neutral.) See my example from March 12. Better Bayesian Filtering (1/2003) • Train on lots and lots of data. • Don’t ignore headers, don’t stem tokens. • Related to textbook section 3.5 n-grams for small n~=3. • Use only the most significant entropy-reducing tokens (Graham’s 15). • Bias against false positives. • Graham spends a lot of time talking about tokenization. Data cleaning, data to include, and data format are all key & work-intensive. Parson, Glossner & Jinturkar • Mobile users have direct associations to their “friends and relatives” (and establishments, etc.). • These directional associations have strengths 0.0 through 1.0 borrowed from spreading activation. • http://en.wikipedia.org/wiki/Spreading_activation • http://en.wikipedia.org/wiki/Semantic_network • When arriving / departing, weighted associations scale * transitive strength * proximity. • The InspectedSet consists of the number of alternatives inspected by the user on the handset. Centile Ranking • For learning Associations, Preferences, and for Adjusting their Weights • • • • Raw Score: NumberOfTimesSelected / NumberofTimesInspected. Sort by Raw Score. Form Equivalence Classes. Position in Sort gives Percentile Rank (CentileWeight). NewStrength = ((OldStrength * DragFactor) + CentileWeight) / (DragFactor + 1) • New Preference Learning ranks Keywords according to Association weights that reach those keywords, applies Centile Ranking to top “selected” keywords, then forms single Keyword query formula for top ranked Keywords. Bayesian Preference Rules • • • • • • • P(selected | K) = (P(K | selected) x P(selected)) / P(K) P(selected) = size of SelectedSet / size of InspectedSet P(K) = count of K in InspectedSet / size of InspectedSet P(K|selected) = count of K in SelectedSet / size of SelectedSet InspectedSet are available, inspected alternatives. SelectedSet are available alternative contacted by mobile user. NewStrength = ((OldStrength x DragFactor) + P(selected | K) ) / (DragFactor + 1) • Use only Keywords reached via Associations, i.e., contacts that were actually made. Top percent are treated as “selected.” Avoids storing data for unselected keywords. Multiple keywords connected by and, or, minus possible; and is implemented. • Garbage collect weak Associations & Preferences (via Threshold). Deficiencies of Weka’s relational approach • Some attributes are naturally sets of values from a domain. Some are counted multisets. • Some are naturally maps K V. • Multi-instance classifiers in Weka are only a small subset. • They do not addressed multi-value attributes. • Text mining is also outside Weka’s scope. Data Mining Scrabble • It requires text processing. • Text is not conventional prose. • It poses problems for data representation / normalization. • It will support multi-instance classification. • Each “play” in a game is an instance of attributes. • Plays across games have relationships. • It may support analysis of social interactions. What to collect? • Each instance contains gameID, moveNumber, playerID, row or column of play, pointGain, pointSum, remainderInTileBag, isWinner/rank. • These are conventional scalar attribute values. • Multi-value attributes are harder. • There are zero or more PLAYED, LEAVED, CROSSED, BORROWED (from sides) and SWAPPED letters per play. • Bindings for blanks, 3word, 2word, 3letter, 2letter bonus locations vary in number available and what their bindings (word vs. letter). • How do we deal with strings, fragments, and letter blends? Contiguous consonants, also vowels?