Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
KEY CONCEPTS IN PROBABILITY: SMOOTHING, MLE, AND MAP Outline • MAPs and MLEs – catchup from last week • Joint Distributions – a new learner • Naïve Bayes – another new learner Administrivia • Homeworks: – Due tomorrow – Hardcopy and Autolab submission (see wiki) • Texts – Mitchell or Murphy are optional • this week – an update from Tom Mitchell’s longexpected new edition – Bishop is also excellent if you prefer • but a little harder to skip around in – pick one or the other (both is overkill) – main differences are not content but notation: for instance… Some practical problems I bought a loaded d20 on EBay…but it didn’t come with any useful specs. How can I find out how it behaves? Frequency 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Face Shown 1. Collect some data (20 rolls) 2. Estimate Pr(i)=C(rolls of i)/C(any roll) A better solution I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves? Frequency 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Face Shown 0. Imagine some data (20 rolls, each i shows up 1x) 1. Collect some data (20 rolls) 2. Estimate Pr(i)=C(rolls of i)/C(any roll) A better solution? Q: What if I used m rolls with a probability of q=1/20 of rolling any i? C (i) 1 P̂r(i) C ( ANY ) C ( IMAGINED) C (i ) mq P̂r(i ) C ( ANY ) m I can use this formula with m>20, or even with m<20 … say with m=1 Terminology – more later This is called a uniform Dirichlet prior C(i), C(ANY) are sufficient statistics C (i ) mq P̂r(i ) C ( ANY ) m MLE = maximum likelihood estimate Tom’s notes are different MAP= maximum a posteriori estimate Some differences…. William: Estimate each probability Pr(i) associated with a multinomial with MLE as: Tom: estimate Θ=P(heads) for a binomial with MLE as: #heads C(i) P̂r(i) = C(ANY) for C(i)=count of times you saw i, and estimate ith MAP as: #tails and with MAP as: #imaginary heads C (i ) mq P̂r(i ) C ( ANY ) m #imaginary tails Some apparent differences…. C (i ) mq P̂r(i ) C ( ANY ) m Tom: estimate Θ=P(heads) for a binomial with MLE as: #heads #tails C(i) = α1 C(ANY) = α0+α1 m = (γ0+γ1) and with MAP as: #imaginary heads .. and confidence in prior q = γ1 / (γ0+γ1) emphasizes the prior emphasizes the pseudo-data #imaginary tails imagined m=60 samples with q = 0.3 imagined m=60 samples with q = 0.4 imagined m=120 samples with q = 0.3 imagined m=120 samples with q = 0.4 Why we call this a MAP • Simple case: replace the die with a coin – Now there’s one parameter: q=P(H) – I start with a prior over q, P(q) – I get some data: D={D1=H, D2=T, ….} – I compute maximum of posterior of q P(D | q)P(q) argmax q P(q | D) = P(D) argmaxq P(D | q) = argmax q P(D | q)P(q) MLE estimate MAP estimate Why we call this a MAP • Simple case: replace the die with a coin – Now there’s one parameter: q=P(H) – I start with a prior over q, P(q) – I get some data: D={D1=H, D2=T, ….} – I compute the posterior of q • The math works if the pdf of P(q) is P(x) = • α+1,β+1 are counts of imaginary pos/neg examples Why we call this a MAP • The math works if the pdf P(x) = 30 20 10 0.5 Why we call this a MAP • This is called a beta distribution • The generalization to multinomials is called a Dirichlet distribution • Parameters are f(x1,…,xK) = KEY CONCEPTS IN PROBABILITY: THE JOINT DISTRIBUTION Some practical problems • I have 1 standard “fair” d6 die, 2 loaded d6 die, one loaded high, one low. • Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50 • Experiment: pick one d6 uniformly at random (A) and roll it. What is more likely – rolling a seven or rolling doubles? Three combinations: HL, HF, FL P(D) = P(D ^ A=HL) + P(D ^ A=HF) + P(D ^ A=FL) = P(D | A=HL)*P(A=HL) + P(D|A=HF)*P(A=HF) + P(A|A=FL)*P(A=FL) A brute-force solution A Roll 1 Roll 2 P Comment FL 1 1 1/3 * 1/6 * ½ doubles 1 2 1/3 * 1/6 * 1/10 FL FL … FL A joint probability table shows P(X1=x1 and … and Xk=xk) 1 every possible … … of values x1,x2,…., xk for combination seven 1 6 FL 2 this you1can compute any P(A) where A is any With boolean combination of the primitive events (Xi=Xk), e.g. 2 … … •… P(doubles) … FL • 6P(seven or 6eleven) HL 1 • 1P(total is higher than 5) HL • 1…. 2 … … … HF 1 1 … doubles doubles The Joint Distribution Example: Boolean variables A, B, C Recipe for making a joint distribution of M variables: The Joint Distribution Example: Boolean variables A, B, C Recipe for making a joint distribution of M variables: 1. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows). A B C 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1 The Joint Distribution Example: Boolean variables A, B, C Recipe for making a joint distribution of M variables: 1. 2. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows). For each combination of values, say how probable it is. A B C Prob 0 0 0 0.30 0 0 1 0.05 0 1 0 0.10 0 1 1 0.05 1 0 0 0.05 1 0 1 0.10 1 1 0 0.25 1 1 1 0.10 The Joint Distribution Example: Boolean variables A, B, C Recipe for making a joint distribution of M variables: 1. 2. 3. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows). For each combination of values, say how probable it is. If you subscribe to the axioms of probability, those numbers must sum to 1. A B C Prob 0 0 0 0.30 0 0 1 0.05 0 1 0 0.10 0 1 1 0.05 1 0 0 0.05 1 0 1 0.10 1 1 0 0.25 1 1 1 0.10 Estimating The Joint Distribution Recipe for making a joint distribution of M variables: 1. 2. 3. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows). For each combination of values, estimate how probable it is from data. If you subscribe to the axioms of probability, those numbers must sum to 1. Example: Boolean variables A, B, C A B C Prob 0 0 0 0.30 0 0 1 0.05 0 1 0 0.10 0 1 1 0.05 1 0 0 0.05 1 0 1 0.10 1 1 0 0.25 1 1 1 0.10 Pros and Cons of the Joint Distribution • You can do a lot with it! – Answer any query Pr(Y1,Y2,..|X1,X2,…) • It takes up a lot of room! • It takes a lot of data to train! • It can be expensive to use – The big question: how do you simplify (approximate, compactly store,…) the joint and still be able to answer interesting queries? Density Estimation • Our Joint Distribution learner is our first example of something called Density Estimation • A Density Estimator learns a mapping from a set of attributes values to a Probability Input Attributes Copyright © Andrew W. Moore Density Estimator Probability Density Estimation – looking ahead • Compare it to two other major kinds of models: Input Attributes Classifier Prediction of categorical output or class One of a few discrete values Input Attributes Density Estimator Probability Input Attributes Regressor Prediction of real-valued output Copyright © Andrew W. Moore Another example Another example • Starting point: Google books 5-gram data – All 5-grams that appear >= 40 times in a corpus of 1M English books • 30Gb compressed, 250-300Gb uncompressed • Each 5-gram contains frequency distribution over years (which I ignored) – Pulled out counts for all 5-grams (A,B,C,D,E) where C=affect or C=effect and turned this into a joint probability table Some of the Joint Distribution A B C D E is the effect of the 0.00036 is the effect of a 0.00034 . The effect of this 0.00034 to this effect : “ 0.00034 be the effect of the … … … … … … the effect of any 0.00024 … … … … … does not affect the general 0.00020 does not affect the question 0.00020 any manner affect the principle 0.00018 not p … about 50k more rows...that summarize 90M 5-gram instances in text Example queries Pr(C) ? c Pr(C=c) C=effect 0.94628 C=affect 0.04725 C=Effect 0.00575 C=EFFECT 0.00067 C=effecT … Example queries Pr(B|C=affect) ? b Pr(B=b|C=affect) B=not 0.61357 B=to 0.11483 B=may 0.03267 B=they 0.02738 B=which … Example queries Pr(C|B=not,D=the) ? c Pr(C|b=not,D=the) B=affect 0.99644 B=effect 0.00356 Density Estimation As a Classifier Input Attributes Classifier Input Attributes Density Estimator Input Attributes + Class Y Density Estimator Prediction of categorical output or class One of a few discrete values Probability P(X1=x1,…,Xn=xn) Probability P(Y=y1|X1=x1,…,Xn=xn) … P(Y=yk|X1=x1,…,Xn=xn) Predict: f(X1=x1,…,Xn=xn)=max yi P(Y=yi|X1=x1,…,Xn=xn) Copyright © Andrew W. Moore An experiment: how useful is the brute-force joint classifier? • Test set: extracted all uses affect or effect in a 20k document newswire corpus: – about 723 n-grams, 661 distinct • Tried to predict center word C with: – argmaxc Pr(C=c|A=a,B=b,D=d,E=e) using the joint estimated from the Google ngram data Poll time… • https://piazza.com/class/ij382zqa2572hc Example queries How many errors would I expect in 100 trials if my classifier always just guesses the most frequent class? https://piazza.com/class/ij382zqa2572hc c Pr(C=c) C=effect 0.94628 C=affect 0.04725 C=Effect 0.00575 C=EFFECT 0.00067 C=effecT … Performance summary Pattern P(C|A,B,D,E) Used Errors 101 1 But: no counts at all for a,b,c,d for 622 of the 723 instances! Slightly fancier idea…. • Tried to predict center word with: – Pr(C|A=a,B=b,D=d,E=e) – then P(C|A,B,D) if there’s no data for that – then P(C|B,D) if there’s no data for that – then P(C|B) … – then P(C) EXAMPLES – “The cumulative _ of the” effect (1.0) – “Go into _ on January” effect (1.0) – “From cumulative _ of accounting” not present in train data • Nor is ““From cumulative _ of _” • But “_ cumulative _ of _” effect (1.0) – “Would not _ Finance Minister” not present • But “_ not _ _ _” affect (0.9625) Performance summary Pattern Used 3% error Errors P(C|A,B,D,E) 101 1 P(C|A,B,D) 157 6 P(C|B,D) 163 13 P(C|B) 244 78 P(C) 58 31 723 5% error 15% error