Download Document

Reasoning Under Uncertainty Artificial Intelligence CMSC 25000 February 21, 2008 Roadmap • Reasoning under uncertainty – Decision trees Decision Making • Design model of rational decision making – Maximize expected value among alternatives • Uncertainty from – Outcomes of actions – Choices taken • To maximize outcome – Select maximum over choices – Weighted average value of chance outcomes Gangrene Example Medicine Amputate foot Worse 0.25 Die 0.05 0 Medicine Die 0.4 0 Live 0.6 995 Full Recovery 0.7 Live 0.99 1000 850 Amputate leg Die 0.02 0 Live 0.98 700 Die 0.01 0 Decision Tree Issues • Problem 1: Tree size – k activities : 2^k orders • Solution 1: Hill-climbing – Choose best apparent choice after one step • Use entropy reduction • Problem 2: Utility values – Difficult to estimate, Sensitivity, Duration • Change value depending on phrasing of question • Solution 2c: Model effect of outcome over lifetime Conclusion • Reasoning with uncertainty – Many real systems uncertain - e.g. medical diagnosis • Bayes’ Nets – Model (in)dependence relations in reasoning – Noisy-OR simplifies model/computation • Assumes causes independent • Decision Trees – Model rational decision making • Maximize outcome: Max choice, average outcomes Bayesian Spam Filtering • Automatic Text Categorization • Probabilistic Classifier – Conditional Framework – Naïve Bayes Formulation • Independence assumptions galore – Feature Selection – Classification & Evaluation Spam Classification • Text categorization problem – Given a message,M, is it Spam or NotSpam? • Probabilistic framework – P(Spam|M)> P(NotSpam|M) • P(Spam|M)=P(Spam,M)/P(M) • P(NotSpam|M)=P(NotSpam,M)/P(M) – Which is more likely? Characterizing a Message • Represent message M as set of features – Features: a1,a2,….an • What features? – Words! (again) – Alternatively (skip) n-gram sequences • Stemmed (?) • Term frequencies: N(W, Spam); N(W,NotSpam) – Also, N(Spam),N(NotSpam): # of words in each class Characterizing a Message II • Estimating term conditional probabilities 1 N (W , C )  K P(W | C )  N (C )  1 • Selecting good features: – Exclude terms s.t. • N(W|Spam)+N(W|NotSpam)<4 • 0.45 <=P(W|Spam)/(P(W|Spam)+P(W|NotSpam))<=0.55 Naïve Bayes Formulation • Naïve Bayes (aka “Idiot” Bayes) – Assumes all features independent • Not accurate but useful simplification • So, – P(M,Spam)=P(a1,a2,..,an,Spam) – = P(a1,a2,..,an|Spam)P(Spam) – =P(a1|Spam)..P(an|Spam)P(Spam) – Likewise for NotSpam Experimentation (Pantel & Lin) • Training: 160 spam, 466 non-spam • Test: 277 spam, 346 non-spam • 230,449 training words; 60434 spam – 12228 terms; filtering reduces to 3848 Results (PL) • False positives: 1.16% • False negatives: 8.3% • Overall error: 4.33% • Simple approach, effective Variants • Features? • Model? – Explicit bias to certain error types • Address lists • Explicit rules Uncertain Reasoning over Time Noisy-Channel Model • Original message not directly observable – Passed through some channel b/t sender, receiver + noise – From telephone (Shannon), Word sequence vs acoustics (Jelinek), genome sequence vs CATG, object vs image • Derive most likely original input based on observed Bayesian Inference • P(W|O) difficult to compute – W – input, O – observations W *  arg max P(W | O) W P(O | W ) P(W )  arg max P(O) W  arg max P(O | W ) P(W ) W – Generative and Sequence Applications • AI: Speech recognition!, POS tagging, sense tagging, dialogue, image understanding, information retrieval • Non-AI: – Bioinformatics: gene sequencing – Security: intrusion detection – Cryptography Hidden Markov Models Probabilistic Reasoning over Time • Issue: Discrete models – Many processes continuously changing – How do we make observations? States? • Solution: Discretize – “Time slices”: Make time discrete – Observations, States associated with time: Ot, Qt • Observations can be discrete or continuous – Here focus on discrete for clarity Modelling Processes over Time • Infer underlying state sequence from observed • Issue: New state depends on preceding states – Analyzing sequences • Problem 1: Possibly unbounded # prob tables – Observation+State+Time • Solution 1: Assume stationary process – Rules governing process same at all time • Problem 2: Possibly unbounded # parents – Markov assumption: Only consider finite history – Common: 1 or 2 Markov: depend on last couple Hidden Markov Models (HMMs) • An HMM is: – 1) A set of states: Q  qo , q1 ,..., qk – 2) A set of transition probabilities: A  a01,..., amn • Where aij is the probability of transition qi -> qj – 3)Observation probabilities: B  bi (ot ) • The probability of observing ot in state i – 4) An initial probability dist over states: • The probability of starting in state i – 5) A set of accepting states i Three Problems for HMMs • Find the probability of an observation sequence given a model – Forward algorithm • Find the most likely path through a model given an observed sequence – Viterbi algorithm (decoding) • Find the most likely model (parameters) given an observed sequence – Baum-Welch (EM) algorithm Bins and Balls Example • Assume there are two bins filled with red and blue balls. Behind a curtain, someone selects a bin and then draws a ball from it (and replaces it). They then select either the same bin or the other one and then select another ball… – (Example due to J. Martin) Bins and Balls Example .6 .7 .4 Bin 1 Bin 2 .3 Bins and Balls • Π Bin 1: 0.9; Bin 2: 0.1 • A Bin1 Bin2 Bin1 0.6 0.4 Bin2 0.3 0.7 • B Red Blue Bin 1 Bin 2 0.7 0.4 0.3 0.6 Bins and Balls • Assume the observation sequence: – Blue Blue Red (BBR) • Both bins have Red and Blue – Any state sequence could produce observations • However, NOT equally likely – Big difference in start probabilities – Observation depends on state – State depends on prior state Bins and Balls Blue Blue Red 111 112 121 122 (0.9*0.3)*(0.6*0.3)*(0.6*0.7)=0.0204 (0.9*0.3)*(0.6*0.3)*(0.4*0.4)=0.0077 (0.9*0.3)*(0.4*0.6)*(0.3*0.7)=0.0136 (0.9*0.3)*(0.4*0.6)*(0.7*0.4)=0.0181 211 212 221 222 (0.1*0.6)*(0.3*0.7)*(0.6*0.7)=0.0052 (0.1*0.6)*(0.3*0.7)*(0.4*0.4)=0.0020 (0.1*0.6)*(0.7*0.6)*(0.3*0.7)=0.0052 (0.1*0.6)*(0.7*0.6)*(0.7*0.4)=0.0070 Answers and Issues • Here, to compute probability of observed – Just add up all the state sequence probabilities • To find most likely state sequence – Just pick the sequence with the highest value • Problem: Computing all paths expensive – 2T*N^T • Solution: Dynamic Programming – Sweep across all states at each time step • Summing (Problem 1) or Maximizing (Problem 2) Forward Probability  j (t )  P(o1 , o2 ,.., ot , qt  j |  )  j (1)   j b j (o1 ),1  j  N N   j (t  1)    i (t )aij b j (ot 1 )  i 1  N P(O |  )    i (T ) i 1 Where α is the forward probability, t is the time in utterance, i,j are states in the HMM, aij is the transition probability, bj(ot) is the probability of observing ot in state bj N is the max state, T is the last time Pronunciation Example • Observations: 0/1 Sequence Pronunciation Model Acoustic Model • 3-state phone model for [m] – Use Hidden Markov Model (HMM) 0.3 0.9 0.4 Transition probabilities Onset 0.7 Mid 0.1 End 0.6 Final C3: C1: C2: 0.3 0.5 0.2 C5: C3: C4: 0.1 0.2 0.7 C6: C4: C6: 0.4 0.1 0.5 Observation probabilities – Probability of sequence: sum of prob of paths Forward Algorithm • Idea: matrix where each cell forward[t,j] represents probability of being in state j after seeing first t observations. • Each cell expresses the probability: forward[t,j] = P(o1,o2,...,ot,qt=j|w) • qt = j means "the probability that the tth state in the sequence of states is state j. • Compute probability by summing over extensions of all paths leading to current cell. • An extension of a path from a state i at time t-1 to state j at t is computed by multiplying together: i. previous path probability from the previous cell forward[t-1,i], ii. transition probability aij from previous state i to current state j iii. observation likelihood bjt that current state j matches observation symbol t. Forward Algorithm Function Forward(observations length T, state-graph) returns best-path Num-states<-num-of-states(state-graph) Create path prob matrix forwardi[num-states+2,T+2] Forward[0,0]<- 1.0 For each time step t from 0 to T do for each state s from 0 to num-states do for each transition s’ from s in state-graph new-score<-Forward[s,t]*at[s,s’]*bs’(ot) Forward[s’,t+1] <- Forward[s’,t+1]+new-score Viterbi Code Function Viterbi(observations length T, state-graph) returns best-path Num-states<-num-of-states(state-graph) Create path prob matrix viterbi[num-states+2,T+2] Viterbi[0,0]<- 1.0 For each time step t from 0 to T do for each state s from 0 to num-states do for each transition s’ from s in state-graph new-score<-viterbi[s,t]*at[s,s’]*bs’(ot) if ((viterbi[s’,t+1]==0) || (viterbi[s’,t+1]<new-score)) then viterbi[s’,t+1] <- new-score back-pointer[s’,t+1]<-s Backtrace from highest prob state in final column of viterbi[] & return Modeling Sequences, Redux • Discrete observation values – Simple, but inadequate – Many observations highly variable • Gaussian pdfs over continuous values – Assume normally distributed observations • Typically sum over multiple shared Gaussians – “Gaussian mixture models” – Trained with HMM model 1 [( ot   j )  ( ot   j )] 1 j b j (ot )  e (2 ) |  j | Learning HMMs • Issue: Where do the probabilities come from? • Solution: Learn from data – Trains transition (aij) and emission (bj) probabilities • Typically assume structure – Baum-Welch aka forward-backward algorithm • Iteratively estimate counts of transitions/emitted • Get estimated probabilities by forward comput’n – Divide probability mass over contributing paths

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Document