Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Agent-based model in biology wikipedia , lookup

Time series wikipedia , lookup

Neural modeling fields wikipedia , lookup

Multi-armed bandit wikipedia , lookup

Mixture model wikipedia , lookup

Mathematical model wikipedia , lookup

Pattern recognition wikipedia , lookup

Inductive probability wikipedia , lookup

Probabilistic context-free grammar wikipedia , lookup

Transcript
Reasoning Under Uncertainty
Artificial Intelligence
CMSC 25000
February 21, 2008
Roadmap
• Reasoning under uncertainty
– Decision trees
Decision Making
• Design model of rational decision making
– Maximize expected value among alternatives
• Uncertainty from
– Outcomes of actions
– Choices taken
• To maximize outcome
– Select maximum over choices
– Weighted average value of chance outcomes
Gangrene Example
Medicine
Amputate foot
Worse 0.25
Die 0.05
0
Medicine
Die 0.4
0
Live 0.6
995
Full Recovery 0.7
Live 0.99
1000
850
Amputate leg
Die 0.02
0
Live 0.98
700
Die 0.01
0
Decision Tree Issues
• Problem 1: Tree size
– k activities : 2^k orders
• Solution 1: Hill-climbing
– Choose best apparent choice after one step
• Use entropy reduction
• Problem 2: Utility values
– Difficult to estimate, Sensitivity, Duration
• Change value depending on phrasing of question
• Solution 2c: Model effect of outcome over
lifetime
Conclusion
• Reasoning with uncertainty
– Many real systems uncertain - e.g. medical
diagnosis
• Bayes’ Nets
– Model (in)dependence relations in reasoning
– Noisy-OR simplifies model/computation
• Assumes causes independent
• Decision Trees
– Model rational decision making
• Maximize outcome: Max choice, average outcomes
Bayesian Spam Filtering
• Automatic Text Categorization
• Probabilistic Classifier
– Conditional Framework
– Naïve Bayes Formulation
• Independence assumptions galore
– Feature Selection
– Classification & Evaluation
Spam Classification
• Text categorization problem
– Given a message,M, is it Spam or NotSpam?
• Probabilistic framework
– P(Spam|M)> P(NotSpam|M)
• P(Spam|M)=P(Spam,M)/P(M)
• P(NotSpam|M)=P(NotSpam,M)/P(M)
– Which is more likely?
Characterizing a Message
• Represent message M as set of features
– Features: a1,a2,….an
• What features?
– Words! (again)
– Alternatively (skip) n-gram sequences
• Stemmed (?)
• Term frequencies: N(W, Spam); N(W,NotSpam)
– Also, N(Spam),N(NotSpam): # of words in each class
Characterizing a Message II
• Estimating term conditional probabilities
1
N (W , C ) 
K
P(W | C ) 
N (C )  1
• Selecting good features:
– Exclude terms s.t.
• N(W|Spam)+N(W|NotSpam)<4
• 0.45 <=P(W|Spam)/(P(W|Spam)+P(W|NotSpam))<=0.55
Naïve Bayes Formulation
• Naïve Bayes (aka “Idiot” Bayes)
– Assumes all features independent
• Not accurate but useful simplification
• So,
– P(M,Spam)=P(a1,a2,..,an,Spam)
–
= P(a1,a2,..,an|Spam)P(Spam)
–
=P(a1|Spam)..P(an|Spam)P(Spam)
– Likewise for NotSpam
Experimentation (Pantel & Lin)
• Training: 160 spam, 466 non-spam
• Test: 277 spam, 346 non-spam
• 230,449 training words; 60434 spam
– 12228 terms; filtering reduces to 3848
Results (PL)
• False positives: 1.16%
• False negatives: 8.3%
• Overall error: 4.33%
• Simple approach, effective
Variants
• Features?
• Model?
– Explicit bias to certain error types
• Address lists
• Explicit rules
Uncertain Reasoning over
Time
Noisy-Channel Model
• Original message not directly observable
– Passed through some channel b/t sender, receiver + noise
– From telephone (Shannon), Word sequence vs acoustics
(Jelinek), genome sequence vs CATG, object vs image
• Derive most likely original input based on observed
Bayesian Inference
• P(W|O) difficult to compute
– W – input, O – observations
W *  arg max P(W | O)
W
P(O | W ) P(W )
 arg max
P(O)
W
 arg max P(O | W ) P(W )
W
–
Generative and Sequence
Applications
• AI: Speech recognition!, POS tagging,
sense tagging, dialogue, image
understanding, information retrieval
• Non-AI:
– Bioinformatics: gene sequencing
– Security: intrusion detection
– Cryptography
Hidden Markov Models
Probabilistic Reasoning over
Time
• Issue: Discrete models
– Many processes continuously changing
– How do we make observations? States?
• Solution: Discretize
– “Time slices”: Make time discrete
– Observations, States associated with time:
Ot, Qt
• Observations can be discrete or continuous
– Here focus on discrete for clarity
Modelling Processes over Time
• Infer underlying state sequence from observed
• Issue: New state depends on preceding states
– Analyzing sequences
• Problem 1: Possibly unbounded # prob tables
– Observation+State+Time
• Solution 1: Assume stationary process
– Rules governing process same at all time
• Problem 2: Possibly unbounded # parents
– Markov assumption: Only consider finite history
– Common: 1 or 2 Markov: depend on last couple
Hidden Markov Models (HMMs)
• An HMM is:
– 1) A set of states: Q  qo , q1 ,..., qk
– 2) A set of transition probabilities:
A  a01,..., amn
• Where aij is the probability of transition qi -> qj
– 3)Observation probabilities: B  bi (ot )
• The probability of observing ot in state i
– 4) An initial probability dist over states:
• The probability of starting in state i
– 5) A set of accepting states
i
Three Problems for HMMs
• Find the probability of an observation sequence
given a model
– Forward algorithm
• Find the most likely path through a model given
an observed sequence
– Viterbi algorithm (decoding)
• Find the most likely model (parameters) given
an observed sequence
– Baum-Welch (EM) algorithm
Bins and Balls Example
• Assume there are two bins filled with red
and blue balls. Behind a curtain, someone
selects a bin and then draws a ball from it
(and replaces it). They then select either
the same bin or the other one and then
select another ball…
– (Example due to J. Martin)
Bins and Balls Example
.6
.7
.4
Bin 1
Bin 2
.3
Bins and Balls
• Π Bin 1: 0.9; Bin 2: 0.1
• A
Bin1 Bin2
Bin1 0.6 0.4
Bin2
0.3
0.7
• B
Red
Blue
Bin 1 Bin 2
0.7
0.4
0.3
0.6
Bins and Balls
• Assume the observation sequence:
– Blue Blue Red (BBR)
• Both bins have Red and Blue
– Any state sequence could produce observations
• However, NOT equally likely
– Big difference in start probabilities
– Observation depends on state
– State depends on prior state
Bins and Balls
Blue Blue Red
111
112
121
122
(0.9*0.3)*(0.6*0.3)*(0.6*0.7)=0.0204
(0.9*0.3)*(0.6*0.3)*(0.4*0.4)=0.0077
(0.9*0.3)*(0.4*0.6)*(0.3*0.7)=0.0136
(0.9*0.3)*(0.4*0.6)*(0.7*0.4)=0.0181
211
212
221
222
(0.1*0.6)*(0.3*0.7)*(0.6*0.7)=0.0052
(0.1*0.6)*(0.3*0.7)*(0.4*0.4)=0.0020
(0.1*0.6)*(0.7*0.6)*(0.3*0.7)=0.0052
(0.1*0.6)*(0.7*0.6)*(0.7*0.4)=0.0070
Answers and Issues
• Here, to compute probability of observed
– Just add up all the state sequence probabilities
• To find most likely state sequence
– Just pick the sequence with the highest value
• Problem: Computing all paths expensive
– 2T*N^T
• Solution: Dynamic Programming
– Sweep across all states at each time step
• Summing (Problem 1) or Maximizing (Problem 2)
Forward Probability
 j (t )  P(o1 , o2 ,.., ot , qt  j |  )
 j (1)   j b j (o1 ),1  j  N
N

 j (t  1)    i (t )aij b j (ot 1 )
 i 1

N
P(O |  )    i (T )
i 1
Where α is the forward probability, t is the time in utterance,
i,j are states in the HMM, aij is the transition probability,
bj(ot) is the probability of observing ot in state bj
N is the max state, T is the last time
Pronunciation Example
• Observations: 0/1
Sequence Pronunciation Model
Acoustic Model
• 3-state phone model for [m]
– Use Hidden Markov Model (HMM)
0.3
0.9
0.4
Transition probabilities
Onset 0.7 Mid 0.1 End 0.6 Final
C3:
C1: C2:
0.3
0.5 0.2
C5:
C3: C4:
0.1
0.2 0.7
C6:
C4: C6:
0.4
0.1 0.5
Observation probabilities
– Probability of sequence: sum of prob of
paths
Forward Algorithm
• Idea: matrix where each cell forward[t,j] represents
probability of being in state j after seeing first t
observations.
• Each cell expresses the probability:
forward[t,j] = P(o1,o2,...,ot,qt=j|w)
• qt = j means "the probability that the tth state in the
sequence of states is state j.
• Compute probability by summing over extensions of all
paths leading to current cell.
• An extension of a path from a state i at time t-1 to state j
at t is computed by multiplying together:
i. previous path probability from the previous cell
forward[t-1,i],
ii. transition probability aij from previous state i to
current state j
iii. observation likelihood bjt that current state j
matches observation symbol t.
Forward Algorithm
Function Forward(observations length T, state-graph)
returns best-path
Num-states<-num-of-states(state-graph)
Create path prob matrix forwardi[num-states+2,T+2]
Forward[0,0]<- 1.0
For each time step t from 0 to T do
for each state s from 0 to num-states do
for each transition s’ from s in state-graph
new-score<-Forward[s,t]*at[s,s’]*bs’(ot)
Forward[s’,t+1] <- Forward[s’,t+1]+new-score
Viterbi Code
Function Viterbi(observations length T, state-graph) returns best-path
Num-states<-num-of-states(state-graph)
Create path prob matrix viterbi[num-states+2,T+2]
Viterbi[0,0]<- 1.0
For each time step t from 0 to T do
for each state s from 0 to num-states do
for each transition s’ from s in state-graph
new-score<-viterbi[s,t]*at[s,s’]*bs’(ot)
if ((viterbi[s’,t+1]==0) || (viterbi[s’,t+1]<new-score))
then
viterbi[s’,t+1] <- new-score
back-pointer[s’,t+1]<-s
Backtrace from highest prob state in final column of viterbi[] & return
Modeling Sequences, Redux
• Discrete observation values
– Simple, but inadequate
– Many observations highly variable
• Gaussian pdfs over continuous values
– Assume normally distributed observations
• Typically sum over multiple shared Gaussians
– “Gaussian mixture models”
– Trained with HMM model
1
[( ot   j )  ( ot   j )]
1
j
b j (ot ) 
e
(2 ) |  j |
Learning HMMs
• Issue: Where do the probabilities come
from?
• Solution: Learn from data
– Trains transition (aij) and emission (bj)
probabilities
• Typically assume structure
– Baum-Welch aka forward-backward algorithm
• Iteratively estimate counts of transitions/emitted
• Get estimated probabilities by forward comput’n
– Divide probability mass over contributing paths