Download final script

Document related concepts

Machine learning wikipedia , lookup

Binary search algorithm wikipedia , lookup

Pattern recognition wikipedia , lookup

Simulated annealing wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Dijkstra's algorithm wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Generalized linear model wikipedia , lookup

Mathematical optimization wikipedia , lookup

Corecursion wikipedia , lookup

Simplex algorithm wikipedia , lookup

Artificial intelligence wikipedia , lookup

Transcript
Introduction to Artificial Intelligence
Marc Toussaint
April 7, 2015
The majority of slides of the earlier parts are adapted from Stuart Russell.
This is a direct concatenation and reformatting of all lecture slides and exercises from
the Artificial Intelligence course (winter term 2014/15, U Stuttgart), including indexing to
help prepare for exams.
sequential
decisions
deterministic
sequential
decision
problems
propositional
sequential assignment
MCTS
fwd/bwd
chaining
bandits
UCB
graphical
models
MDPs
Decision Theory
probabilistic
on
trees
minimax
utilities
multi-agent
MDPs
FOL
constraint
propagation
alpha/beta
pruning
games
propositional
logic
CSP
backtracking
search
BFS
relational
dynamic
programming
V(s), Q(s,a)
FOL
belief
propagation
msg. passing
relational
graphical
models
HMMs
relational
MDPs
fwd/bwd
msg. passing
ML
learning
Reinforcement
Learning
Active
Learning
Contents
1 Introduction
4
3 Search
5
Example: Romania (1:2) Example: Vacuum World (1:5) Problem Definition: Deterministic, fully observable (1:9)
Example: The 8-Puzzle (1:15)
3.1 Tree Search Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
Tree search implementation: states vs nodes (1:25) Tree Search: General Algorithm (1:26) Breadth-first search
(BFS) (1:29) Complexity of BFS (1:37) Uniform-cost search (1:38) Depth-first search (DFS) (1:39) Complexity
of DFS (1:52) Iterative deepening search (1:54) Complexity of Iterative Deepening Search (1:63) Graph search
and repeated states (1:65)
4 Informed search algorithms
13
Best-first Search (2:3) Greedy Search (2:5) Complexity of Greedy Search (2:14) A∗ search (2:15) A∗ : Proof 1 of
Optimality (2:22) Complexity of A∗ (2:27) A∗ : Proof 2 of Optimality (2:28) Admissible heuristics (2:30) Memorybounded A∗ (2:34)
5 Constraint Satisfaction Problems
17
Constraint satisfaction problems (CSPs): Definition (3:2) Map-Coloring Problem (3:3)
5.1 Methods for solving CSPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Backtracking (3:10) Variable order: Minimum remaining values (3:18) Variable order: Degree heuristic (3:19)
Value order: Least constraining value (3:20) Forward checking (3:21) Constraint propagation (3:25) Treestructured CSPs (3:33)
1
18
2
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
6 Optimization
22
Optimization problem: Definition (4:2) Local Search (4:5) Travelling Salesman Problem (TSP) (4:6) Local optima,
plateaus (4:8) Iterated Local Search (4:9) Simulated Annealing (4:11) Genetic Algorithms (4:14)
6.1 A glimpse at general optimization problems . . . . . . . . . . . . . . . . . . . . . .
24
LP, QP, ILP, NLP (4:17) Slack Variables (4:19) n-queens as ILP (4:19) TSP as ILP (4:20) CSP as ILP (4:21)
7 Propositional Logic
26
Knowledge base: Definition (5:2) Wumpus World example (5:4) Logics: Definition, Syntax, Semantics (5:20) Entailment (5:21) Model (5:22) Inference (5:28) Propositional logic: Syntax (5:29) Propositional logic: Semantics (5:31)
Logical equivalence (5:37) Validity (5:38) Satisfiability (5:38) Horn Form (5:40) Modus Ponens (5:40) Forward chaining (5:41) Completeness of Forward Chaining (5:51) Backward Chaining (5:52) Conjunctive Normal Form (5:64)
Resolution (5:64) Conversion to CNF (5:65)
8 First Order Logic
37
FOL: Syntax (6:5) Universal quantification (6:7) Existential quantification (6:8)
8.1 FOL description of interactive domains . . . . . . . . . . . . . . . . . . . . . . . . .
38
Situation Calculus (6:21) Frame problem (6:22) Planning Domain Definition Language (PDDL) (6:24)
9 First Order Logic – Inference
41
Reduction to propositional inference (7:6) Unification (7:9) Generalized Modus Ponens (7:14) Forward Chaining
(7:15) Backward Chaining (7:17) Resolution (7:19) Conversion to CNF (7:20)
10 Probabilities
44
Probabilities as (subjective) information calculus (8:2) Frequentist vs Bayesian (8:4)
10.1 Basic definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
Definitions based on sets (8:6) Random variables (8:7) Probability distribution (8:8) Joint distribution (8:9)
Marginal (8:9) Conditional distribution (8:9) Bayes’ Theorem (8:11) Multiple RVs, conditional independence
(8:12)
10.2 Probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
Bernoulli and Binomial (8:14) Beta (8:15) Multinomial (8:18) Dirichlet (8:19) Conjugate priors (8:23) Dirac (8:26)
Gaussian (8:27) Particle approximation of a distribution (8:29) Utilities and Decision Theory (8:32) Entropy (8:33)
Kullback-Leibler divergence (8:34)
11 Bandits & UCT
50
Multi-armed Bandits (9:1) Exploration, Exploitation (9:6) Upper Confidence Bound (UCB) (9:8) Monte Carlo Tree
Search (MCTS) (9:14) Upper Confidence Tree (UCT) (9:19)
12 Game Playing
53
Minimax (10:3) Alpha-Beta Pruning (10:6) Evaluation functions (10:11) UCT for games (10:12)
13 Graphical Models
55
Bayesian Network (11:3) Conditional independence in a Bayes Net (11:7) Inference: general meaning (11:12)
13.1 Inference Methods in Graphical Models . . . . . . . . . . . . . . . . . . . . . . . .
57
Inference in graphical models: overview (11:17) Monte Carlo (11:19) Importance sampling (11:22) Gibbs sampling (11:24) Variable elimination (11:27) Factor graph (11:30) Belief propagation (11:36) Message passing
(11:36) Loopy belief propagation (11:39) Junction tree algorithm (11:41) Maximum a-posteriori (MAP) inference
(11:45) Conditional random field* (11:46)
14 Dynamic Models
63
Markov Process (12:2) Filtering, Smoothing, Prediction (12:3) Hidden Markov Model (12:4) HMM: Inference (12:5)
HMM inference (12:6) Kalman filter (12:9)
15 Reinforcement Learning
66
Markov Decision Process (MDP) (13:3) Value Function (13:4) Bellman optimality equation (13:8) Value Iteration
(13:10) Q-Function (13:11) Q-Iteration (13:12) Proof of convergence of Q-Iteration (13:13) Temporal difference (TD)
(13:19) Sarsa (13:21) Q-learning (13:22) Proof of convergence of Q-learning (13:24) Eligibility traces (13:26) Modelbased RL (13:34) Imitation Learning (13:37) Inverse RL (13:40) Policy gradients (13:44)
16 Reinforcement Learning – Exploration
Epsilon-greedy exploration in Q-learning (14:4) Sample Complexity (14:6) PAC-MDP efficiency (14:7) ExplicitExploit-or-Explore* (14:9) R-Max (14:14) Bayesian RL (14:15) Optimistic heuristics (14:16)
74
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
17 Exercises
17.1 Exercise 1 .
17.2 Exercise 2 .
17.3 Exercise 3 .
17.4 Exercise 4 .
17.5 Exercise 5 .
17.6 Exercise 6 .
17.7 Exercise 7 .
17.8 Exercise 8 .
17.9 Exercise 9 .
17.10Exercise 10
Index
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
78
78
79
79
80
81
82
82
83
83
84
86
4
1
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
Introduction
I wasn’t happy with the first slides (’Introduction’ and ’Intelligent Agents’).
So I skip them here. They will also not be relevant for the exam. You
may find them on the lecture webpage.
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
3
5
Search
1:4
Example: vacuum world
Outline
• Problem formulation & examples
Deterministic, fully observable, start in
#5. Solution??
• Basic search algorithms
1:1
1:5
Example: Romania
Example: vacuum world
On holiday in Romania; currently in Arad.
Flight leaves tomorrow from Bucharest
Deterministic, fully observable, start in
#5. Solution??
[Right, Suck]
Formulate goal:
be in Bucharest, Sgoal = {Bucharest}
Formulate problem:
states: various cities, S = {Arad, Timisoara, . . . }
actions: drive between cities, A = {edges between states}
Find solution:
Non-observable,
{1, 2, 3, 4, 5, 6, 7, 8}
e.g.,
Right goes
Solution??
start
to
in
{2, 4, 6, 8}.
1:6
sequence of cities, e.g., Arad, Sibiu, Fagaras, Bucharest
minimize costs with cost function, (s, a) 7→ c
1:2
Example: Romania
Example: vacuum world
Deterministic, fully observable, start in
#5. Solution??
[Right, Suck]
Non-observable,
start
in
{1, 2, 3, 4, 5, 6, 7, 8}
e.g.,
Right goes to {2, 4, 6, 8}.
Solution??
[Right, Suck, Lef t, Suck]
Non-deterministic, start in #5
Murphy’s Law: Suck can dirty a clean
carpet
Local sensing: dirt, location only.
Solution??
1:7
1:3
Problem types
Deterministic, fully observable (“single-state problem”)
Agent knows exactly which state it will be in; solution is a
sequence
First state and world known → the agent does not rely on
observations
Non-observable (“conformant problem”)
Agent may have no idea where it is; solution (if any) is a
sequence
Nondeterministic and/or partially observable (“contingency problem”)
percepts provide new information about current state
solution is a reactive plan or a policy
often interleave search, execution
Unknown state space (“exploration problem”)
Example: vacuum world
Deterministic, fully observable, start in
#5. Solution??
[Right, Suck]
Non-observable,
start
in
{1, 2, 3, 4, 5, 6, 7, 8}
e.g.,
Right goes to {2, 4, 6, 8}.
Solution??
[Right, Suck, Lef t, Suck]
Non-deterministic, start in #5
Murphy’s Law: Suck can dirty a clean
carpet
Local sensing: dirt, location only.
Solution??
[Right, if dirt then Suck]
1:8
6
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
Deterministic, fully observable problem def.
A deterministic, fully observable problem is defined by four items:
initial state s0 ∈ S
e.g., s0 = Arad
successor function succ : S × A → S
e.g., succ(Arad,Arad-Zerind) = Zerind
goal states Sgoal ⊆ S
e.g., s = Bucharest
step cost function cost(s, a, s0 ), assumed to be ≥ 0
states??: integer dirt and robot locations (ignore dirt amounts
e.g., traveled distance, number of actions executed, etc.
etc.)
the path cost is the sum of step costs
actions??: Lef t, Right, Suck, N oOp
goal test??
path cost??
A solution is a sequence of actions leading from s0 to a goal
1:12
An optimal solution is a solution with minimal path costs
Example: vacuum world state space graph
1:9
Example: vacuum world state space graph
states??: integer dirt and robot locations (ignore dirt amounts
etc.)
actions??: Lef t, Right, Suck, N oOp
goal test??: no dirt
states??
path cost??
actions??
1:13
goal test??
Example: vacuum world state space graph
path cost??
1:10
Example: vacuum world state space graph
states??: integer dirt and robot locations (ignore dirt amounts
etc.)
actions??: Lef t, Right, Suck, N oOp
goal test??: no dirt
states??: integer dirt and robot locations (ignore dirt amounts
path cost??: 1 per action (0 for N oOp)
etc.)
1:14
actions??
Example: The 8-puzzle
goal test??
path cost??
1:11
Example: vacuum world state space graph
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
7
Example: The 8-puzzle
states??
actions??
goal test??
path cost??
1:15
Example: The 8-puzzle
states??: integer locations of tiles (ignore intermediate positions)
actions??: move blank left, right, up, down (ignore unjamming
etc.)
goal test??: = goal state (given)
path cost??: 1 per move
[Note: optimal solution of n-Puzzle family is NP-hard]
states??: integer locations of tiles (ignore intermediate posi-
1:19
tions)
actions??
3.1
goal test??
Tree Search Algorithms
path cost??
1:20
1:16
Tree search algorithms
Example: The 8-puzzle
Basic idea:
offline, simulated exploration of state space
by generating successors of already-explored states
(a.k.a. expanding states)
1:21
Tree search example
states??: integer locations of tiles (ignore intermediate positions)
actions??: move blank left, right, up, down (ignore unjamming
etc.)
goal test??
path cost??
1:17
1:22
Tree search example
Example: The 8-puzzle
1:23
states??: integer locations of tiles (ignore intermediate positions)
Tree search example
actions??: move blank left, right, up, down (ignore unjamming
etc.)
goal test??: = goal state (given)
path cost??
1:18
8
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
1:24
Depth-first search
Depth-limited search
Implementation: states vs. nodes
Iterative deepening search
1:28
A state is a (representation of) a physical configuration
A node is a data structure constituting part of a search tree
Breadth-first search
includes parent, children, depth, path cost g(x)
States do not have parents, children, depth, or path cost!
Expand shallowest unexpanded node
Implementation:
fringe is a FIFO queue, i.e., new successors go at end
The E XPAND function creates new nodes, filling in the various
fields and using the S UCCESSOR F N of the problem to create
the corresponding states.
1:25
1:29
Breadth-first search
Implementation: general tree search
function T REE -S EARCH( problem, fringe) returns a solution, or failure
fringe ← I NSERT(M AKE -N ODE(I NITIAL -S TATE[problem]), fringe)
loop do
if fringe is empty then return failure
node ← R EMOVE -F RONT(fringe)
if G OAL -T EST(problem, S TATE(node)) then return node
fringe ← I NSERTA LL(E XPAND(node, problem), fringe)
function E XPAND( node, problem) returns a set of nodes
successors ← the empty set
for each action, result in S UCCESSOR -F N(problem, S TATE[node]) do
s ← a new N ODE
PARENT-N ODE[s] ← node; ACTION[s] ← action; S TATE[s] ← result
PATH -C OST[s] ← PATH -C OST[node] + S TEP -C OST(S TATE[node],
action, result)
D EPTH[s] ← D EPTH[node] + 1
add s to successors
return successors
1:26
Expand shallowest unexpanded node
Implementation:
fringe is a FIFO queue, i.e., new successors go at end
1:30
Breadth-first search
Expand shallowest unexpanded node
Implementation:
Search strategies
fringe is a FIFO queue, i.e., new successors go at end
A strategy is defined by picking the order of node expansion
Strategies are evaluated along the following dimensions:
completeness—does it always find a solution if one exists?
time complexity—number of nodes generated/expanded
space complexity—maximum number of nodes in memory
optimality—does it always find a least-cost solution?
1:31
Time and space complexity are measured in terms of
b—maximum branching factor of the search tree
Breadth-first search
d—depth of the least-cost solution
m—maximum depth of the state space (may be ∞)
Expand shallowest unexpanded node
1:27
Uninformed search strategies
Uninformed strategies use only the information available
in the problem definition
Breadth-first search
Uniform-cost search
Implementation:
fringe is a FIFO queue, i.e., new successors go at end
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
1:32
Properties of breadth-first search
9
Depth-first search
Expand deepest unexpanded node
Implementation:
Complete??
fringe = LIFO queue, i.e., put successors at front
1:33
Properties of breadth-first search
Complete?? Yes (if b is finite)
Time??
1:34
Properties of breadth-first search
1:39
Depth-first search
Complete?? Yes (if b is finite)
Expand deepest unexpanded node
Time?? 1 + b + b2 + b3 + . . . + bd + b(bd − 1) = O(bd+1 ), i.e.,
exp. in d
Implementation:
fringe = LIFO queue, i.e., put successors at front
Space??
1:35
Properties of breadth-first search
Complete?? Yes (if b is finite)
Time?? 1 + b + b2 + b3 + . . . + bd + b(bd − 1) = O(bd+1 ), i.e.,
1:40
exp. in d
Depth-first search
Space?? O(bd+1 ) (keeps every node in memory)
Optimal??
Expand deepest unexpanded node
1:36
Implementation:
Properties of breadth-first search
fringe = LIFO queue, i.e., put successors at front
Complete?? Yes (if b is finite)
Time?? 1 + b + b2 + b3 + . . . + bd + b(bd − 1) = O(bd+1 ), i.e.,
exp. in d
Space?? O(bd+1 ) (keeps every node in memory)
Optimal?? Yes (if cost = 1 per step); not optimal in general
1:41
Space is the big problem; can easily generate nodes at 100MB/sec
so 24hrs = 8640GB.
1:37
Uniform-cost search
Depth-first search
Expand deepest unexpanded node
Implementation:
fringe = LIFO queue, i.e., put successors at front
Expand least-cost unexpanded node
Implementation:
fringe = queue ordered by path cost, lowest first
Equivalent to breadth-first if step costs all equal
Complete?? Yes, if step cost ≥ Time?? # of nodes with g ≤ cost of optimal solution, O(bdC
∗
/e
where C ∗ is the cost of the optimal solution
Space?? # of nodes with g ≤ cost of optimal solution, O(bdC
∗
1:42
)
/e
Depth-first search
)
Optimal?? Yes—nodes expanded in increasing order of g(n)
1:38
Expand deepest unexpanded node
Implementation:
fringe = LIFO queue, i.e., put successors at front
10
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
1:47
Properties of depth-first search
Complete??
1:48
1:43
Depth-first search
Properties of depth-first search
Expand deepest unexpanded node
Complete?? No: fails in infinite-depth spaces, spaces with loops
Implementation:
Modify to avoid repeated states along path
fringe = LIFO queue, i.e., put successors at front
⇒ complete in finite spaces
Time??
1:49
Properties of depth-first search
1:44
Complete?? No: fails in infinite-depth spaces, spaces with loops
Modify to avoid repeated states along path
Depth-first search
⇒ complete in finite spaces
Expand deepest unexpanded node
Time?? O(bm ): terrible if m is much larger than d
but if solutions are dense, may be much faster than breadth-
Implementation:
first
fringe = LIFO queue, i.e., put successors at front
Space??
1:50
Properties of depth-first search
1:45
Complete?? No: fails in infinite-depth spaces, spaces with loops
Modify to avoid repeated states along path
Depth-first search
⇒ complete in finite spaces
Time?? O(bm ): terrible if m is much larger than d
Expand deepest unexpanded node
but if solutions are dense, may be much faster than breadth-
Implementation:
first
fringe = LIFO queue, i.e., put successors at front
Space?? O(bm), i.e., linear space!
Optimal??
1:51
Properties of depth-first search
1:46
Complete?? No: fails in infinite-depth spaces, spaces with loops
Depth-first search
Modify to avoid repeated states along path
⇒ complete in finite spaces
Expand deepest unexpanded node
Time?? O(bm ): terrible if m is much larger than d
Implementation:
fringe = LIFO queue, i.e., put successors at front
but if solutions are dense, may be much faster than breadthfirst
Space?? O(bm), i.e., linear space!
Optimal?? No
1:52
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
11
Depth-limited search
= depth-first search with depth limit l,
i.e., nodes at depth l have no successors
Recursive implementation:
function D EPTH -L IMITED -S EARCH( problem, limit) returns soln/fail/cutoff
R ECURSIVE -DLS(M AKE -N ODE(I NITIAL -S TATE[problem]), problem,
limit)
function R ECURSIVE -DLS(node, problem, limit) returns soln/fail/cutoff
cutoff-occurred? ← false
if G OAL -T EST(problem, S TATE[node]) then return node
else if D EPTH[node] = limit then return cutoff
else for each successor in E XPAND(node, problem) do
result ← R ECURSIVE -DLS(successor, problem, limit)
if result = cutoff then cutoff-occurred? ← true
else if result 6= failure then return result
if cutoff-occurred? then return cutoff else return failure
1:58
Properties of iterative deepening search
1:53
Complete??
1:59
Iterative deepening search
function I TERATIVE -D EEPENING -S EARCH( problem) returns a solution
inputs: problem, a problem
Properties of iterative deepening search
Complete?? Yes
for depth ← 0 to ∞ do
result ← D EPTH -L IMITED -S EARCH( problem, depth)
if result 6= cutoff then return result
end
Time??
1:60
1:54
Properties of iterative deepening search
Complete?? Yes
Iterative deepening search l = 0
Time?? (d + 1)b0 + db1 + (d − 1)b2 + . . . + bd = O(bd )
Space??
1:61
1:55
Properties of iterative deepening search
Complete?? Yes
Iterative deepening search l = 1
Time?? (d + 1)b0 + db1 + (d − 1)b2 + . . . + bd = O(bd )
Space?? O(bd)
Optimal??
1:62
1:56
Properties of iterative deepening search
Iterative deepening search l = 2
Complete?? Yes
Time?? (d + 1)b0 + db1 + (d − 1)b2 + . . . + bd = O(bd )
Space?? O(bd)
Optimal?? Yes, if step cost = 1
Can be modified to explore uniform-cost tree
Numerical comparison for b = 10 and d = 5, solution at far left
leaf:
1:57
Iterative deepening search l = 3
N (IDS)
=
50 + 400 + 3, 000 + 20, 000 + 100, 000 = 123, 450
N (BFS)
=
10 + 100 + 1, 000 + 10, 000 + 100, 000 + 999, 990 =
12
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
IDS does better because other nodes at depth d are not expanded
BFS can be modified to apply goal test when a node is generated
1:63
Summary of algorithms
Criterion
BreadthFirst
UniformCost
DepthFirst
DepthLimited
Iterative
Deepening
Yes∗
bd+1
bd+1
Yes∗
Yes∗
∗
bdC /e
∗
dC
/e
b
Yes
No
bm
bm
No
Yes, if l ≥ d
bl
bl
No
Yes
bd
bd
Yes∗
Complete?
Time
Space
Optimal?
1:64
Loops: Repeated states
Failure to detect repeated states can turn a linear problem into
an exponential one!
1:65
Graph search
function G RAPH -S EARCH( problem, fringe) returns a solution, or failure
closed ← an empty set
fringe ← I NSERT(M AKE -N ODE(I NITIAL -S TATE[problem]), fringe)
loop do
if fringe is empty then return failure
node ← R EMOVE -F RONT(fringe)
if G OAL -T EST(problem, S TATE[node]) then return node
if S TATE[node] is not in closed then
add S TATE[node] to closed
fringe ← I NSERTA LL(E XPAND(node, problem), fringe)
end
But: storing all visited nodes leads again to exponential space
complexity (as for BFS)
1:66
Summary
In BFS (or uniform-cost search), the fringe propagates layerwise, containing nodes of similar distance-from-start (cost-sofar), leading to optimal paths but exponential space complexity
O(B d+1 )
In DFS, the fringe is like a deep light beam sweeping over the
tree, with space complexity O(bm). Iteratively deepening it also
leads to optimal paths.
Graph search can be exponentially more efficient than tree search,
but storing the visited nodes may lead to exponential space
complexity as BFS.
1:67
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
4
Informed search algorithms
13
Greedy search
We set the priority function equal to a heuristic f (n) = h(n)
h(n) = estimate of cost from n to the closest goal
Outline
E.g., hSLD (n) = straight-line distance from n to Bucharest
Greedy search expands the node that appears to be closest to
• Best-first search
goal
• A∗ search
2:5
• Heuristics
2:1
Greedy search example
Review: Tree search
function T REE -S EARCH( problem, fringe) returns a solution, or failure
fringe ← I NSERT(M AKE -N ODE(I NITIAL -S TATE[problem]), fringe)
loop do
if fringe is empty then return failure
node ← R EMOVE -F RONT(fringe)
if G OAL -T EST[problem] applied to S TATE(node) succeeds return
node
fringe ← I NSERTA LL(E XPAND(node, problem), fringe)
2:6
Greedy search example
A strategy is defined by picking the order of node expansion
2:2
Best-first search
2:7
Idea: use an arbitrary priority function f (n) for each node
– actually f (n) is neg-priority: nodes with lower f (n) have
higher priority
Greedy search example
f (n) should reflect which nodes could be on an optimal path
– could is optimistic – the lower f (n) the more optimistic
you are that n is on an optimal path
⇒ Expand the unexpanded node with highes priority
Implementation:
fringe is a queue sorted in decreasing order of priority
2:8
Special cases:
greedy search
Greedy search example
A∗ search
2:3
Romania with step costs in km
2:9
Properties of greedy search
Complete??
2:4
2:10
14
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
Properties of greedy search
n.
(Also require h(n) ≥ 0, so h(G) = 0 for any goal G.)
Complete?? No–can get stuck in loops, e.g., with Oradea as
goal,
E.g., hSLD (n) never overestimates the actual road distance
Theorem: A∗ search is optimal (=finds the optimal path)
Iasi → Neamt → Iasi → Neamt →
2:15
Complete in finite space with repeated-state checking
Time??
2:11
A∗ search example
Properties of greedy search
Complete?? No–can get stuck in loops, e.g.,
2:16
Iasi → Neamt → Iasi → Neamt →
Complete in finite space with repeated-state checking
Time?? O(bm ), but a good heuristic can give dramatic improve-
A∗ search example
ment
Space??
2:12
Properties of greedy search
2:17
Complete?? No–can get stuck in loops, e.g.,
Iasi → Neamt → Iasi → Neamt →
A∗ search example
Complete in finite space with repeated-state checking
Time?? O(bm ), but a good heuristic can give dramatic improvement
Space?? O(bm )—keeps all nodes in memory
Optimal??
2:13
2:18
Properties of greedy search
A∗ search example
Complete?? No–can get stuck in loops, e.g.,
Iasi → Neamt → Iasi → Neamt →
Complete in finite space with repeated-state checking
Time?? O(bm ), but a good heuristic can give dramatic improvement
Space?? O(bm )—keeps all nodes in memory
Optimal?? No
2:19
Greedy search does not care about the ’past’ (the cost-so-far).
2:14
A∗ search example
A∗ search
Idea: combine information from the past and the future
– neg-priority = cost-so-far + estimated cost-to-go
Evaluation function f (n) = g(n) + h(n)
g(n) = cost-so-far to reach n
h(n) = estimated cost-to-go from n
f (n) = estimated total cost of path through n to goal
A∗ search uses an admissible (=optimistic) heuristic
i.e., h(n) ≤ h∗ (n) where h∗ (n) is the true cost-to-go from
2:20
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
A∗ search example
15
Properties of A∗
Complete?? Yes, unless there are infinitely many nodes with
f ≤ f (G)
Time?? Exponential in [relative error in h × length of soln.]
Space?? Keeps all nodes in memory
Optimal??
2:26
Properties of A∗
2:21
Complete?? Yes, unless there are infinitely many nodes with
f ≤ f (G)
Proof of optimality of A∗
Time?? Exponential in [relative error in h × length of soln.]
Suppose some suboptimal goal G2 has been generated and is
in the fringe (but has not yet been selected to be tested for goal
condition!). Let n be an unexpanded node on a shortest path to
an optimal goal G.
Space?? Keeps all nodes in memory
Optimal?? Yes
A∗ expands all nodes with f (n) < C ∗
A∗ expands some nodes with f (n) = C ∗
A∗ expands no nodes with f (n) > C ∗
2:27
Optimality of A∗ (more useful)
f (G2 )
=
g(G2 )
>
g(G)
since G2 is suboptimal
≥
f (n)
since h is admissible
Lemma: A∗ expands nodes in order of increasing f value∗
since h(G2 ) = 0
Gradually adds “f -contours” of nodes (cf. breadth-first adds layers)
Contour i has all nodes with f = fi , where fi < fi+1
Since f (n) < f (G2 ), A∗ will expand n before is will select G2
from the fringe (for goal testing). Then, as G is added to the
fringe, and since f (G) = g(G) < f (G2 ) = g(G2 ) it will select G
before G2 for goal testing.
2:22
Properties of A∗
Complete??
2:23
2:28
∗
Properties of A
Proof of lemma: Consistency
Complete?? Yes, unless there are infinitely many nodes with
A heuristic is consistent if
f ≤ f (G)
h(n) ≤ c(n, a, n0 ) + h(n0 )
Time??
2:24
If h is consistent, we have
Properties of A∗
f (n0 )
=
g(n0 ) + h(n0 )
=
g(n) + c(n, a, n0 ) + h(n0 )
f ≤ f (G)
≥
g(n) + h(n)
Time?? Exponential in [relative error in h × length of soln.]
=
f (n)
Complete?? Yes, unless there are infinitely many nodes with
Space??
I.e., f (n) is nondecreasing along any path.
2:25
2:29
16
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
Admissible heuristics
Key point: the optimal solution cost of a relaxed problem
is no greater than the optimal solution cost of the real problem
E.g., for the 8-puzzle:
2:33
h1 (n) = number of misplaced tiles
Memory-bounded A∗
h2 (n) = total Manhattan distance
(i.e., no. of squares from desired location of each tile)
As with BFS, A∗ has exponential space complexity
Iterative-deepening A∗ , works for integer path costs, but problematic for real-valued
(Simplified) Memory-bounded A∗ (SMA∗ ):
– Expand as usual until a memory bound is reach
h1 (S) =??
– Then, whenever adding a node, remove the worst node
h2 (S) =??
0
n from the tree
– worst means: the n0 with highest f (n0 )
2:30
– To not loose information, backup the measured step-cost
Admissible heuristics
cost(˜
n, a, n0 )
to improve the heuristic h(˜
n) of its parent
E.g., for the 8-puzzle:
∗
SMA is complete and optimal if the depth of the optimal path is
h1 (n) = number of misplaced tiles
within the memory bound
h2 (n) = total Manhattan distance
2:34
(i.e., no. of squares from desired location of each tile)
Summary
Combine information from the past and the future
A heuristic function h(n) represents information about the future
– it estimates cost-to-go optimistically
h1 (S) =?? 6
Good heuristics can dramatically reduce search cost
h2 (S) =?? 4+0+3+3+1+0+2+1 = 14
2:31
Greedy best-first search expands lowest h
– incomplete and not always optimal
Dominance
∗
A search expands lowest f = g + h
– neg-priority = cost-so-far + estimated cost-to-go
If h2 (n) ≥ h1 (n) for all n (both admissible)
– complete and optimal
then h2 dominates h1 and is better for search
– also optimally efficient (up to tie-breaks, for forward search)
Typical search costs:
d = 14 IDS = 3,473,941 nodes
A∗ (h1 ) = 539 nodes
A∗ (h2 ) = 113 nodes
d = 24 IDS ≈ 54,000,000,000 nodes
A∗ (h1 ) = 39,135 nodes
A∗ (h2 ) = 1,641 nodes
Given any admissible heuristics ha , hb ,
Admissible heuristics can be derived from exact solution of relaxed problems
Memory-bounded startegies exist
2:35
Outlook
We postpone tree search with partial observations
h(n) = max(ha (n), hb (n))
– rather discuss this in a fully probabilistic setting later
is also admissible and dominates ha , hb
We postpone tree search for games
2:32
Relaxed problems
– minimax extension to tree search
– discuss state-of-the-art probabilistic Monte-Carlo tree search
methods later
Admissible heuristics can be derived from the exact
solution cost of a relaxed version of the problem
If the rules of the 8-puzzle are relaxed so that a tile can move
anywhere, then h1 (n) gives the shortest solution
If the rules are relaxed so that a tile can move to any adjacent
square, then h2 (n) gives the shortest solution
Next: Constraint Statisfaction Problems
2:36
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
5
Constraint Satisfaction Problems
17
Example: Map-Coloring contd.
Outline
• CSP examples
• Backtracking sequential assignment for CSPs
• Problem structure and problem decomposition
• Later: general-purpose discrete (and continuous) optimization
methods
Solutions are assignments satisfying all constraints, e.g.,
3:1
{W = red, N = green, Q = red, E = green, V = red, S = blue, T = green}
3:4
Constraint satisfaction problems (CSPs)
Constraint graph
In previous lectures we consideres sequential decision problems
Binary CSP: each constraint relates at most two variables
CSPs are not sequential decision problems
Constraint graph: a bi-partite graph: nodes are variables, boxes
However, the basic methods address them by testing sequentially ’decisions’
are constraints
In the map-coloring problem, all constraints relate two variables:
boxes↔edges
In general, constraints may constrain several (or one) vari-
CSP:
ables (|Ik | 6= 2)
We have n variables xi , each with domain Di , xi ∈ Di
N
We have K constraints Ck , each of which determines the feasible configurations of a subset of variables
W
The goal is to find a configuration X = (X1 , .., Xn ) of all vari-
c1
c2
ables that satisfies all constraints
c3
Q
S
E
c6
c8
c7
Formally Ck = (Ik , ck ) where Ik ⊆ {1, .., n} determines the
c5
c4
c9
V
subset of variables, and ck : DIk → {0, 1} determines whether
a configuration xIk ∈ DIk of this subset of variables is feasible
T
3:5
3:2
Varieties of CSPs
Example: Map-Coloring
• Discrete variables: finite domains; each Di of size |Di | = d ⇒
O(dn ) complete assignments
– e.g., Boolean CSPs, incl. Boolean satisfiability infinite domains (integers, strings, etc.)
– e.g., job scheduling, variables are start/end days for each
job
– linear constraints solvable, nonlinear undecidable
• Continuous variables
– e.g., start/end times for Hubble Telescope observations
– linear constraints solvable in poly time by LP methods
Variables W , N , Q, E, V , S, T
3:6
(E = New South Wales)
Domains Di = {red, green, blue} for all variables
Varieties of constraints
Constraints: adjacent regions must have different colors
e.g., W 6= N , or
Unary constraints involve a single variable, |Ik | = 1
(W, N ) ∈ {(red, green), (red, blue), (green, red), (green, blue), . . .}
3:3
e.g., S 6= green
Binary constraints involve pairs of variables, |Ik | = 2
e.g., S 6= W
18
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
⇒ b = d and there are dn leaves
Higher-order constraints involve 3 or more variables, |Ik | > 2
e.g., Sudoku
Depth-first search for CSPs with single-variable assignments
Having “soft constraints” (preferences, cost, probabilities) leads
is called backtracking search
to general optimization and probabilistic inference problems
Backtracking search is the basic uninformed algorithm for CSPs
3:7
Can solve n-queens for n ≈ 25
3:11
Real-world CSPs
Backtracking search
Assignment problems
function B ACKTRACKING -S EARCH(csp) returns solution/failure
return R ECURSIVE -B ACKTRACKING({ }, csp)
e.g., who teaches what class
Timetabling problems
function
R ECURSIVE -B ACKTRACKING(assignment, csp)
returns
soln/failure
if assignment is complete then return assignment
var ← S ELECT-U NASSIGNED -VARIABLE(VARIABLES[csp], assignment,
csp)
for each value in O RDERED -D OMAIN -VALUES(var, assignment, csp) do
if value is consistent with assignment given C ONSTRAINTS[csp]
then
add [var = value] to assignment
result ← R ECURSIVE -B ACKTRACKING(assignment, csp)
if result 6= failure then return result
remove [var = value] from assignment
return failure
e.g., which class is offered when and where?
Hardware configuration
Spreadsheets
Transportation scheduling
Factory scheduling
Floorplanning
Notice that many real-world problems involve real-valued variables
3:8
5.1
3:12
Backtracking example
Methods for solving CSPs
3:13
3:9
Backtracking example
Sequential assignment approach
Let’s start with the straightforward, dumb approach, then fix it
States are defined by the values assigned so far
• Initial state: the empty assignment, { }
• Successor function: assign a value to an unassigned variable
that does not conflict with current assignment ⇒ fail if no feasible assignments (not fixable!)
3:14
• Goal test: the current assignment is complete
Backtracking example
1) Every solution appears at depth n with n variables ⇒ use
depth-first search
2) b = (n − `)d at depth `, hence n!dn leaves!
3:10
Backtracking sequential assignment
Two variable assignment decisions are commutative, i.e.,
[W = red then N = green] same as [N = green then W = red]
We can fix a single next variable to assign a value to at each
node
This does not compromise completeness (ability to find the
solution)
3:15
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
19
Backtracking example
Given a variable, choose the least constraining value:
the one that rules out the fewest values in the remaining variables
Combining these heuristics makes 1000 queens feasible
3:20
3:16
Forward checking
Improving backtracking efficiency
Idea: Keep track of remaining legal values for unassigned variables
Simple heuristics can give huge gains in speed:
Terminate search when any variable has no legal values
1. Which variable should be assigned next?
2. In what order should its values be tried?
3. Can we detect inevitable failure early?
4. Can we take advantage of problem structure?
3:21
3:17
Forward checking
Minimum remaining values
Idea: Keep track of remaining legal values for unassigned variables
Minimum remaining values (MRV):
Terminate search when any variable has no legal values
choose the variable with the fewest legal values
3:18
3:22
Degree heuristic
Forward checking
Tie-breaker among MRV variables
Idea: Keep track of remaining legal values for unassigned vari-
Degree heuristic:
choose the variable with the most constraints on remaining
variables
ables
Terminate search when any variable has no legal values
3:19
Least constraining value
3:23
20
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
Forward checking
for every value x of X there is some allowed y
Idea: Keep track of remaining legal values for unassigned variables
Terminate search when any variable has no legal values
3:27
Arc consistency (for pair-wise constraints)
Simplest form of propagation makes each arc consistent
X → Y is consistent iff
for every value x of X there is some allowed y
Use a data structure D OMAIN [X] to explicitly store Di for each
node
3:24
Constraint propagation
If X loses a value, neighbors of X need to be rechecked
Forward checking propagates information from assigned to unassigned variables, but doesn’t provide early detection for all failures:
3:28
Arc consistency (for pair-wise constraints)
Simplest form of propagation makes each arc consistent
X → Y is consistent iff
for every value x of X there is some allowed y
N and S cannot both be blue!
Idea: propagate the implied constraints serveral steps further
Generally, this is called constraint propagation
3:25
If X loses a value, neighbors of X need to be rechecked
Arc consistency detects failure earlier than forward checking
Can be run as a preprocessor or after each assignment
Arc consistency (for pair-wise constraints)
3:29
Arc consistency algorithm (for pair-wise constraints)
Simplest form of propagation makes each arc consistent
X → Y is consistent iff
for every value x of X there is some allowed y
function AC-3( csp) returns the CSP, possibly with reduced domains
inputs: csp, a pair-wise CSP with variables {X1 , X2 , . . . , Xn }
local variables: queue, a queue of arcs, initially all the arcs in csp
while queue is not empty do
(Xi , Xj ) ← R EMOVE -F IRST(queue)
if R EMOVE -I NCONSISTENT-VALUES(Xi , Xj ) then
for each Xk in N EIGHBORS[Xi ] do
add (Xk , Xi ) to queue
3:26
Arc consistency (for pair-wise constraints)
Simplest form of propagation makes each arc consistent
X → Y is consistent iff
function R EMOVE -I NCONSISTENT-VALUES( Xi , Xj ) returns true iff
D OM[Xi ] changed
changed ← false
for each x in D OMAIN[Xi ] do
if no value y in D OMAIN[Xj ] allows (x,y) to satisfy the constraint
Xi ↔ Xj
then delete x from D OMAIN[Xi ]; changed ← true
return changed
O(n2 d3 ), can be reduced to O(n2 d2 )
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
3:30
21
3. For j from 1 to n, assign Xj consistently with P arent(Xj )
This is forward sequential assignment (trivial backtracking)
Constraint propagation
3:34
Nearly tree-structured CSPs
See textbook for details for non-pair-wise constraints
Very closely related to message passing in probabilistic models
Conditioning: instantiate a variable, prune its neighbors’ domains
In practice: design approximate constraint propagation for specific problem
E.g.: Sudoku: If Xi is assigned, delete this value from all
peers
3:31
Problem structure
Cutset conditioning: instantiate (in all ways) a set of variables
such that the remaining constraint graph is a tree
Cutset size c ⇒ runtime O(dc · (n − c)d2 ), very fast for small c
N
W
c1
c2
c3
3:35
Q
c5
c4
c9
S
Summary
E
c6
c8
c7
CSPs are a fundamental kind of problem:
V
finding a feasible configuration of n variables
the set of constraints defines the (graph) structure of the prob-
T
lem
Tasmania and mainland are independent subproblems
Sequential assignment approach
Backtracking = depth-first search with one variable assigned
Identifiable as connected components of constraint graph
3:32
per node
Variable ordering and value selection heuristics help significantly
Tree-structured CSPs
Forward checking prevents assignments that guarantee later
failure
Constraint propagation (e.g., arc consistency) does additional
work
to constrain values and detect inconsistencies
The CSP representation allows analysis of problem structure
Theorem: if the constraint graph has no loops, the CSP can be
solved in O(n d2 ) time
Tree-structured CSPs can be solved in linear time
If after assigning some variables, the remaining structure is a
Compare to general CSPs, where worst-case time is O(dn )
tree
→ linear time feasibility check by tree CSP
This property also applies to logical and probabilistic reasoning!
3:33
Algorithm for tree-structured CSPs
1. Choose a variable as root, order variables from root to leaves
such that every node’s parent precedes it in the ordering
2. For j from n down to 2, apply R EMOVE I NCONSISTENT(P arent(Xj ), Xj )
This is backward constraint propagation
3:36
22
6
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
Optimization
Iterative improvement
The majority of optimization methods iteratively manipulate x to
monotonely improve x, e.g.:
– line search, backtracking, trust region methods
Outline
– gradient-based, (Quasi-) Newton methods
– interior point methods, Simplex method
• Local Search
– primal-dual Newton
• Iterated Local Search
– local search, pattern search, Nelder-Mead
• Simulated annealing & Genetic algorithms (briefly)
Exceptions:
– Global Optimization, Bayesian Optimization
• General formulation of optimization problems
– stochastic search, simulated annealing, evolutionary algorithms
• LP, QP, ILP, non-linear program
• ILP formulations of n-queens, CSP, TSP
4:4
4:1
Local search (greedy downhill, hill climbing)
Optimization problems
We have n variables xi ,
We assume there is a finite neighborhood N(x) defined for every
continuous x ∈ Rn , or discrete xi ∈
x
{1, .., d}, or mixed
Greedy local search (variant 1):
An optimization problem (or mathematical program) is defined
by
min f (x)
x
s.t.
g(x) ≤ 0, h(x) = 0
where g : Rn → Rk defines k inequality constraints,
Input: Initial x, function f (x)
Output: Local minimum x
ˆ of f (x)
1: repeat
2:
x
ˆ←x
3:
x ← argminy∈N(x) f (y)
4: until f (ˆ
x) ≤ f (x)
and h : Rn → Rl defines l equality constraints
Variant 2: x ← the “first” y ∈ N(x) such that f (y) < f (x)
Optimization is a central thread through all of science:
– Machine Learning, Robotics, Computer Vision
4:5
– Engineering, Control Theory
– Economics, Operations Research
Example: Travelling Salesman Problem (TSP)
– Physics, Chemistry, Molecular Biology
– Social Sciences
Goal: Find the shortest closed tour visiting n cities.
Computational modelling of natural phenomena often via opti-
Start with any complete tour; modify 2 arcs to make the tour
mality principles
shorter
4:2
Stefan Funke gives an excellent lecture on Discrete Optimization (WS)
– max-flow and min-cut on graphs
– Linear Programs, esp. Simplex methods
Variants of this approach get within 1% of optimum very quickly
– Integer Linear Programming and LP-relaxations
with thousands of cities
In TSP, this neighborhood is called 2-opt (modifying 2 arcs).
I offer a lecture on Continuous Optimization (SS)
– Gradient and Newton methods
3-opt or 4-opt are larger neighborhoods.
4:6
– Lagrangian, log-barrier, augmented lagrangian methods,
primal-dual
– Local & stochastic search, global optimization, Bayesian
optimization
4:3
Example: n-queens
Goal: Put n queens on an n × n board with no two queens on
the same
row, column, or diagonal
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
Start with any configuration of n queens; move a queen to re-
23
Example: Tavelling Salesman Problem
duce number of conflicts
LocalSearch uses the simple 2-opt or 3-opt neighborhood (→
quick)
Iterated Local Search uses 4-opt meta-neighborhood (double
bridges)
4:10
Simulated annealing
Almost always solves n-queens problems almost instantaneously
for very large n, e.g., n = 1million
Idea: Escape local minimum by allowing some “bad” moves
but gradually decrease their size and frequency
4:7
Input: initial x, function f (x), proposal distribution q(y|x), initial temp. T
Output: Global minimum x
ˆ of f (x)
1: repeat
2:
Sample y from the neighborhood of x, y ∼ q(y|x)
3:
Acceptance
probability
A
=
f (x)−f (y)
q(x|y) T
min 1, e
q(y|x)
4:
With probability A update x ← y
5:
Decrease T , e.g. T ← (1 − )T for small 6: until T = 0 or x converges
Local search contd.
Useful to consider solution space landscape
Typically q(x|y) = q(y|x)
The new sample y is always accepted if y is better than x (f (y) ≤
f (x))
Random-restart local search overcomes local optima problem—
If y is worse than x, only accept with probability e
f (x)−f (y)
T
trivially complete
4:11
Random sideways moves escape from plateaus, but loop on flat
Properties of simulated annealing
optima
4:8
At fixed “temperature” T , state occupation probability reaches
Boltzman distribution
Iterated Local Search (6= random restarts)
p(x) = αe
−f (x)
kT
Random restarts may be rather expensive, sampling initial x is
T decreased slowly enough =⇒ always reach best state x∗ =
fully uninformed
Idea: Escape local minimum x by restarting in a meta-neighborhood
argminx f (x)
because e
N (x)
∗
−f (x∗ )
kT
/e
−f (x)
kT
=e
f (x)−f (x∗ )
kT
1 for small T
Is this necessarily an interesting guarantee??
Input: Initial x, function f (x)
Output: Local minimum x
ˆ of f (x)
1: repeat
2:
For all meta-neigbors yi ∈ N∗ (x) compute yˆi ←
LocalSearch(yi )
3:
x ← argminy∈{ˆ
y1 ,..,ˆ
yI } f (y)
4: until x converges
Devised by Metropolis et al., 1953, for physical process modelling
4:12
Local beam search (maintain k candidates)
Idea: keep k candidates instead of 1; choose top k of all their
LocalSearch uses a simple/quick neighborhood N(x)
successors
The meta-neighborhood N∗ (x) enables large jumps towards alternative local optima
Not the same as k searches run in parallel!
Searches that find good candidates recruit other searches
Variant 2: x ← the “first” yi ∈ N∗ (x) such that f (ˆ
yi ) < f (x)
to join them
Stochastic variant: Meta-neighborhood N∗ (x) is replaced by a
Problem: quite often, all k candidates end up on same local hill
transition prob. q ∗ (y|x)
Idea: choose k successors randomly, biased towards good ones
4:9
4:13
24
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
Genetic algorithms
– Sequential Quadratic Programming (SQP), Log-barrier, Augmented Lagrangian, primal-dual
= stochastic local beam search + generate successors from
4:17
pairs of candidates
The art is in finding a reduction
How can a real-world problem be encoded as an optimization
problem?
4:18
4:14
Genetic algorithms contd.
Example: n-queens as Integer Linear Program
binary indicator variables xij for a queen at position (i, j), i, j =
GAs require solutions encoded as strings (GPs use trees or pro-
1, .., n
Constraints:
grams)
– row constraints ∀i :
P
xij ≤ 1
P
i xij ≤ 1
P
– diagonal cnstr. ∀i∈{−n+1,..,n−1} :
j:j,i+j∈{1,..,n} xi+j,j ≤ 1
P
– diagonal cnstr. ∀i∈{−n+1,..,n−1} :
j:j,i−j∈{1,..,n} xi−j,j ≤ 1
Crossover helps iff substrings are meaningful components
j
– column constraints ∀j :
Objective Function: arbitrary (e.g. f (x) = 1)! We encoded everything in the constraints!
GAs 6= evolution: e.g., real genes encode replication machinery!
Better alternative: Optimize the number of constraint violations:
instead of “≤ 1” write “≤ 1 + ξk ” in all constraints
the slack variables ξ = (ξ1 , .., ξK ) become part of the state
Move general view:
add the constraints ξk ≥ 0
keeping multiple candidates allows us to use
objective function f (x, ξ) =
more general neighborhoods N(x1 , .., xK ) or meta-neighborhoods
P
k
ξk
related to Phase I optimization of finding a feasible x
4:15
6.1
A glimpse at general optimization problems
4:19
Example: TSP as Integer Linear Program
binary indicator variables xij for (ij) ∈ tour
4:16
Optimization problems
city-visit-times ti ∈ {1, .., n}
Objective:
cost f (x) =
Linear Program (LP)
>
min c x s.t.
x
Gx ≤ h, Ax = b
– Simplex Algorithm, Interior point method (Log-barrier), Augmented Lagrangian, primal-dual
– LP in standard form: minx c>x s.t.
P
ij
cij xij of the tour
Constraints:
P
– Columns sum to 1: ∀j : i xij = 1
P
– Rows sum to 1: ∀i : j xij = 1
– city-visit-times ti must fulfill:
∀2≤i6=j≤n : ti − tj ≤ n − 2 − (n − 1)xij
x ≥ 0, Ax = b
Quadratic Program (QP) (Q is positive definite)
1
min x>Qx + c>x s.t. Gx ≤ h, Ax = b
x
2
– Log-barrier, Augmented Lagrangian, primal-dual Newton
(There are alternative formulations.)
4:20
Example: CSP as Integer Linear Program
Integer Linear Program (ILP)
min c>x s.t.
x
Ax = b, xi ∈ {1, .., di }
– LP-relaxations & backtracking, specialized methods, graph
cut methods
Non-linear program (Convex Program: f, g convex and h linear)
min f (x)
x
s.t.
g(x) ≤ 0, h(x) = 0
binary indicator variables xiv = [Xi = v] for every CSP variable
Xi
Constraints:
– “Xi can have only one value”:
probabilities..)
∀i :
P
v
xiv = 1 (cf.
– If [Xi = v ∧Xj = w] is constraint-violating, add a constraint
xiv + xjw ≤ 1
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
– Do this for EVERY forbidden local configuration (MANY constraints)
Objective Function:
arbitrary (e.g. f (x) = 1)! We encoded everything in the
constraints!
Better alternative:
Translate the constraints into soft constraints xiv + xjw ≤
1 + ξk
Minimize
P
k
ξk
s.t.
ξk ≥ 0
(There exists a more efficient formulation for MaxSAT in conjunctive normal form.)
4:21
Summary
Many problems can be reduced to optimization problems
Local Search, esp. Iterated Local Search is often effective in
practice
In continuous domains, when gradients, Hessians are given →
full field of optimization
Ongoing research in global & Bayesian optimization
4:22
25
26
7
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
Propositional Logic
Performance measure
gold +1000, death -1000
-1 per step, -10 for using the arrow
Environment
Squares adjacent to wumpus are smelly
Squares adjacent to pit are breezy
Glitter iff gold is in the same square
Shooting kills wumpus if you are facing it
The wumpus kills you if in the same
square
Shooting uses up the only arrow
Grabbing picks up gold if in same square
Releasing drops the gold in same square
Actuators Left turn, Right turn,
Outline
• Knowledge-based agents
• Wumpus world
• Logic in general—models and entailment
• Propositional (Boolean) logic
• Equivalence, validity, satisfiability
• Inference rules and theorem proving
Forward, Grab, Release, Shoot, Climb
– forward chaining
Sensors Breeze, Glitter, Stench, Bump, Scream
– backward chaining
– resolution
5:4
5:1
Wumpus world characterization
Knowledge bases
Observable??
5:5
Knowledge base = set of sentences of a formal language
Wumpus world characterization
Declarative approach to building an agent (or other system):
T ELL it what it needs to know
Observable?? No—only local perception
Then it can A SK itself what to do—answers should follow from
Deterministic??
the KB
5:6
Agents can be viewed at the knowledge level
Wumpus world characterization
i.e., what they know, regardless of how implemented
Or at the implementation level
i.e., data structures in KB and algorithms that manipulate
Observable?? No—only local perception
Deterministic?? Yes—outcomes exactly specified
them
5:7
5:2
Wumpus world characterization
A simple knowledge-based agent
Observable?? No—only local perception
function KB-AGENT( percept) returns an action
static: KB, a knowledge base
t, a counter, initially 0, indicating time
Deterministic?? Yes—outcomes exactly specified
Static?? Yes—Wumpus and Pits do not move
T ELL(KB, M AKE -P ERCEPT-S ENTENCE( percept, t))
action ← A SK(KB, M AKE -ACTION -Q UERY(t))
T ELL(KB, M AKE -ACTION -S ENTENCE(action, t))
t←t + 1
return action
Discrete??
5:8
Wumpus world characterization
The agent must be able to:
Represent states, actions, etc.
Incorporate new percepts
Update internal representations of the world
Deduce hidden properties of the world
Deduce appropriate actions
Observable?? No—only local perception
Deterministic?? Yes—outcomes exactly specified
Static?? Yes—Wumpus and Pits do not move
5:3
Discrete?? Yes
Single-agent??
Wumpus World description
5:9
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
27
Wumpus world characterization
Observable?? No—only local perception
Deterministic?? Yes—outcomes exactly specified
Static?? Yes—Wumpus and Pits do not move
Discrete?? Yes
Single-agent?? Yes—Wumpus is essentially a natural feature
5:10
Exploring a wumpus world
5:13
Exploring a wumpus world
5:11
5:14
Exploring a wumpus world
Exploring a wumpus world
5:12
5:15
Exploring a wumpus world
Exploring a wumpus world
28
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
Breeze in (1,2) and (2,1)
⇒ no safe actions
Assuming pits uniformly distributed,
(2,2) has pit w/ prob 0.86, vs. 0.31
Smell in (1,1) ⇒ cannot move
Can use a strategy of coercion:
shoot straight ahead
wumpus was there ⇒ dead ⇒
safe
wumpus wasn’t there ⇒ safe
5:19
5:16
Logic in general
Exploring a wumpus world
Logics are formal languages for representing information
such that conclusions can be drawn
Syntax defines the sentences in the language
Semantics define the “meaning” of sentences;
i.e., define truth of a sentence in a world
E.g., the language of arithmetic
x + 2 ≥ y is a sentence; x2 + y > is not a sentence
x + 2 ≥ y is true iff the number x + 2 is no less than the number
y
x + 2 ≥ y is true in a world where x = 7, y = 1
x + 2 ≥ y is false in a world where x = 0, y = 6
5:20
5:17
Entailment
Entailment means that one thing follows from another:
Exploring a wumpus world
KB |= α
Knowledge base KB entails sentence α
if and only if
α is true in all worlds where KB is true
E.g., the KB containing “the Giants won” and “the Reds won”
entails “Either the Giants won or the Reds won”
E.g., x + y = 4 entails 4 = x + y
Entailment is a relationship between sentences (i.e., syntax)
that is based on semantics
5:21
5:18
Other tight spots
Models
Given a logical sentence, when is its truth uniquely defined in a
world?
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
Logicians typically think in terms of models, which are formally
29
Wumpus models
structured worlds
(e.g., full abstract description of a world, configuration of all variables, world state)
with respect to which truth can uniquely be evaluated
We say m is a model of a sentence α if α is true in m
M (α) is the set of all models of α
Then KB |= α if and only if M (KB) ⊆ M (α)
E.g. KB = Giants won and Reds won
α = Giants won
KB = wumpus-world rules + observations
α1 = “[1,2] is safe”, KB |= α1 , proved by model checking
5:22
5:26
Entailment in the wumpus world
Wumpus models
Situation after detecting nothing in [1,1],
moving right, breeze in [2,1]
Consider possible models for ?s
assuming only pits
3 Boolean choices ⇒ 8 possible models
5:23
Wumpus models
KB = wumpus-world rules + observations
α2 = “[2,2] is safe”, KB 6|= α2
5:27
Inference
Inference in the general sense means: Given some pieces of
information (prior, observed variabes, knowledge base) what is
5:24
the implication (the implied information, the posterior) on other
things (non-observed variables, sentence)
Wumpus models
KB `i α = sentence α can be derived from KB by procedure i
Consequences of KB are a haystack; α is a needle.
Entailment = needle in haystack; inference = finding it
Soundness: i is sound if
whenever KB `i α, it is also true that KB |= α
Completeness: i is complete if
whenever KB |= α, it is also true that KB `i α
Preview: we will define a logic (first-order logic) which is expressive enough to say
almost anything of interest, and for which there exists a sound and complete inference procedure. That is, the procedure will answer any question whose answer
follows from what is known by the KB.
KB = wumpus-world rules + observations
5:25
5:28
30
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
Propositional logic: Syntax
5:33
Propositional logic is the simplest logic—illustrates basic ideas
Wumpus world sentences
The proposition symbols P1 , P2 etc are sentences
If S is a sentence, ¬S is a sentence (negation)
Let Pi,j be true if there is a pit in [i, j].
If S1 and S2 are sentences, S1 ∧ S2 is a sentence (conjunction)
Let Bi,j be true if there is a breeze in [i, j].
If S1 and S2 are sentences, S1 ∨ S2 is a sentence (disjunction)
If S1 and S2 are sentences, S1 ⇒ S2 is a sentence (implication)
¬P1,1
If S1 and S2 are sentences, S1 ⇔ S2 is a sentence (bicondi-
¬B1,1
tional)
B2,1
5:29
“Pits cause breezes in adjacent squares”
Propositional logic: Syntax grammar
hsentencei
hatomic sentencei
hcomplex sentencei
→
→
→
hatomic sentencei | hcomplex sentencei
true | false | P | Q | R | . . .
¬ hsentencei
| (hsentencei ∧ hsentencei)
| (hsentencei ∨ hsentencei)
| (hsentencei ⇒ hsentencei)
| (hsentencei ⇔ hsentencei)
B1,1
⇔
(P1,2 ∨ P2,1 )
B2,1
⇔
(P1,1 ∨ P2,2 ∨ P3,1 )
“A square is breezy if and only if there is an adjacent pit”
5:34
5:30
Truth tables for inference
Propositional logic: Semantics
B1,1 B2,1 P1,1 P1,2 P2,1 P2,2 P3,1
Each model specifies true/false for each proposition symbol
E.g. P1,2 P2,2
P3,1
true true false
(With these symbols, 8 possible models, can be enumerated
automatically.)
R1
R2
R3
R4
R5
KB
false false false false false false false true true true true false false
false false false false false false true true true false true false false
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
false true false false false false false true true false true true false
false true false false false false true true true true true true true
false true false false false true false true true true true true true
false true false false false true true true true true true true true
Rules for evaluating truth with respect to a model m:
¬S
is true iff
S
is false
false true false false true false false true false false true true false
..
..
..
..
..
..
..
..
..
..
..
..
..
S1 ∧ S2
is true iff
S1
is true and
S2
is true
.
.
.
.
.
.
.
.
.
.
.
.
.
S1 ∨ S2
is true iff
S1
is true or
S2
is true
true true true true true true true false true true false true false
S1 ⇒ S2
is true iff
S1
is false or
S2
is true
Enumerate rows (different assignments to symbols),
i.e., is false iff
S1
is true and
S2
is false
if KB is true in row, check that α is too
S1 ⇔ S2
is true iff
S1 ⇒ S2 is true and S2 ⇒ S1 is true
Simple recursive process evaluates an arbitrary sentence, e.g.,
5:35
¬P1,2 ∧ (P2,2 ∨ P3,1 ) = true ∧ (false ∨ true) = true ∧ true = true
5:31
Inference by enumeration
Truth tables for connectives
P
Q
¬P
P ∧Q
P ∨Q
P ⇒Q
P ⇔Q
false
false
true
true
false
true
false
true
true
true
false
false
false
false
false
true
false
true
true
true
true
true
false
true
true
false
false
true
5:32
Wumpus world sentences
Let Pi,j be true if there is a pit in [i, j].
Let Bi,j be true if there is a breeze in [i, j].
¬P1,1
¬B1,1
B2,1
“Pits cause breezes in adjacent squares”
Depth-first enumeration of all models is sound and complete
function TT-E NTAILS ?(KB, α) returns true or false
inputs: KB, the knowledge base, a sentence in propositional logic
α, the query, a sentence in propositional logic
symbols ← a list of the proposition symbols in KB and α
return TT-C HECK -A LL(KB, α, symbols, [ ])
function TT-C HECK -A LL(KB, α, symbols, model) returns true or false
if E MPTY ?(symbols) then
if PL-T RUE ?(KB, model) then return PL-T RUE ?(α, model)
else return true
else do
P ← F IRST(symbols); rest ← R EST(symbols)
return TT-C HECK -A LL(KB, α, rest, E XTEND(P , true, model)) and
TT-C HECK -A LL(KB, α, rest, E XTEND(P , false, model))
O(2n ) for n symbols
5:36
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
31
Logical equivalence
Forward and backward chaining
Two sentences are logically equivalent iff true in same models:
Applicable when KB is in Horn Form
α ≡ β if and only if α |= β and β |= α
Horn Form (restricted)
KB = conjunction of Horn clauses
(α ∧ β)
≡
(β ∧ α) commutativity of ∧
(α ∨ β)
≡
(β ∨ α) commutativity of ∨
– proposition symbol; or
((α ∧ β) ∧ γ)
≡
(α ∧ (β ∧ γ)) associativity of ∧
– (conjunction of symbols) ⇒ symbol
((α ∨ β) ∨ γ)
≡
(α ∨ (β ∨ γ)) associativity of ∨
¬(¬α)
≡
α
(α ⇒ β)
≡
(¬β ⇒ ¬α) contraposition
(α ⇒ β)
≡
(¬α ∨ β) implication elimination
(α ⇔ β)
≡
((α ⇒ β) ∧ (β ⇒ α)) biconditional elimination
¬(α ∧ β)
≡
(¬α ∨ ¬β)
De Morgan
¬(α ∨ β)
≡
(¬α ∧ ¬β)
De Morgan
(α ∧ (β ∨ γ))
≡
((α ∧ β) ∨ (α ∧ γ)) distributivity of ∧ over ∨
(α ∨ (β ∧ γ))
≡
((α ∨ β) ∧ (α ∨ γ)) distributivity of ∨ over ∧
Horn clause =
E.g., C ∧ (B ⇒ A) ∧ (C ∧ D ⇒ B)
double-negation elimination
Modus Ponens (for Horn Form): complete for Horn KBs
α1 , . . . , αn ,
α1 ∧ · · · ∧ αn ⇒ β
β
Can be used with forward chaining or backward chaining.
These algorithms are very natural and run in linear time
5:40
5:37
Validity and satisfiability
Forward chaining
A sentence is valid if it is true in all models,
e.g., true, A ∨ ¬A,
A ⇒ A,
Idea: fire any rule whose premises are satisfied in the KB,
(A ∧ (A ⇒ B)) ⇒ B
Validity is connected to inference via the Deduction Theorem:
KB |= α if and only if (KB ⇒ α) is valid
add its conclusion to the KB, until query is found
P ⇒Q
L∧M ⇒P
A sentence is satisfiable if it is true in some model
e.g., A ∨ B,
B∧L⇒M
C
A∧P ⇒L
A sentence is unsatisfiable if it is true in no models
A∧B ⇒L
e.g., A ∧ ¬A
A
Satisfiability is connected to inference via the following:
B
KB |= α if and only if (KB ∧ ¬α) is unsatisfiable
5:41
i.e., prove α by reductio ad absurdum
5:38
Forward chaining example
Proof methods
Proof methods divide into (roughly) two kinds:
Application of inference rules
– Legitimate (sound) generation of new sentences from old
– Proof = a sequence of inference rule applications
Can use inference rules as operators in a standard search
alg.
– Typically require translation of sentences into a normal form
Model checking
truth table enumeration (always exponential in n)
improved backtracking, e.g., Davis–Putnam–Logemann–Loveland
(see book)
5:42
heuristic search in model space (sound but incomplete)
e.g., min-conflicts-like hill-climbing algorithms
5:39
Forward chaining example
32
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
5:43
Forward chaining example
5:46
Forward chaining example
5:44
Forward chaining example
5:47
Forward chaining example
5:45
Forward chaining example
5:48
Forward chaining example
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
33
Backward chaining
Idea: work backwards from the query q:
to prove q by BC,
check if q is known already, or
prove by BC all premises of some rule concluding q
Avoid loops: check if new subgoal is already on the goal stack
Avoid repeated work: check if new subgoal
1) has already been proved true, or
2) has already failed
5:52
5:49
Backward chaining example
Forward chaining algorithm
function PL-FC-E NTAILS ?(KB, q) returns true or false
inputs: KB, the knowledge base, a set of propositional Horn clauses
q, the query, a proposition symbol
local variables: count, a table, indexed by clause, initially the number
of premises
inferred, a table, indexed by symbol, each entry initially
false
agenda, a list of symbols, initially the symbols known
in KB
while agenda is not empty do
p ← P OP(agenda)
unless inferred[p] do
inferred[p] ← true
for each Horn clause c in whose premise p appears do
decrement count[c]
if count[c] = 0 then do
if H EAD[c] = q then return true
P USH(H EAD[c], agenda)
return false
5:53
5:50
Backward chaining example
Proof of completeness
FC derives every atomic sentence that is entailed by KB
1. FC reaches a fixed point where no new atomic sentences are
derived
2. Consider the final state as a model m, assigning true/false to
symbols
3. Every clause in the original KB is true in m
Proof : Suppose a clause a1 ∧ . . . ∧ ak ⇒ b is false in m
Then a1 ∧ . . . ∧ ak is true in m and b is false in m
Therefore the algorithm has not reached a fixed point!
4. Hence m is a model of KB
5. If KB |= q, q is true in every model of KB, including m
5:54
General idea: construct any model of KB by sound inference,
check α
5:51
Backward chaining example
34
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
5:55
Backward chaining example
5:58
Backward chaining example
5:56
Backward chaining example
5:59
Backward chaining example
5:57
Backward chaining example
5:60
Backward chaining example
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
35
where `i and mj are complementary literals. E.g.,
P1,3 ∨ P2,2 ,
P1,3
¬P2,2
Resolution is sound and complete for propositional logic
5:64
Conversion to CNF
B1,1 ⇔ (P1,2 ∨ P2,1 )
1. Eliminate ⇔, replacing α ⇔ β with (α ⇒ β) ∧ (β ⇒ α).
(B1,1 ⇒ (P1,2 ∨ P2,1 )) ∧ ((P1,2 ∨ P2,1 ) ⇒ B1,1 )
5:61
Backward chaining example
2. Eliminate ⇒, replacing α ⇒ β with ¬α ∨ β.
(¬B1,1 ∨ P1,2 ∨ P2,1 ) ∧ (¬(P1,2 ∨ P2,1 ) ∨ B1,1 )
3. Move ¬ inwards using de Morgan’s rules and double-negation:
(¬B1,1 ∨ P1,2 ∨ P2,1 ) ∧ ((¬P1,2 ∧ ¬P2,1 ) ∨ B1,1 )
4. Apply distributivity law (∨ over ∧) and flatten:
(¬B1,1 ∨ P1,2 ∨ P2,1 ) ∧ (¬P1,2 ∨ B1,1 ) ∧ (¬P2,1 ∨ B1,1 )
5:65
Resolution algorithm
5:62
Proof by contradiction, i.e., show KB ∧ ¬α unsatisfiable
function PL-R ESOLUTION(KB, α) returns true or false
inputs: KB, the knowledge base, a sentence in propositional logic
α, the query, a sentence in propositional logic
Forward vs. backward chaining
FC is data-driven, cf. automatic, unconscious processing,
e.g., object recognition, routine decisions
May do lots of work that is irrelevant to the goal
BC is goal-driven, appropriate for problem-solving,
e.g., Where are my keys? How do I get into a PhD pro-
clauses ← the set of clauses in the CNF representation of KB ∧ ¬α
new ← { }
loop do
for each Ci , Cj in clauses do
resolvents ← PL-R ESOLVE(Ci , Cj )
if resolvents contains the empty clause then return true
new ← new ∪ resolvents
if new ⊆ clauses then return false
clauses ← clauses ∪ new
gram?
5:66
Complexity of BC can be much less than linear in size of KB
5:63
Resolution
Resolution example
KB = (B1,1 ⇔ (P1,2 ∨ P2,1 )) ∧ ¬B1,1
α = ¬P1,2
Conjunctive Normal Form (CNF—universal)
conjunction of disjunctions of literals
|
{z
}
clauses
E.g., (A ∨ ¬B) ∧ (B ∨ ¬C ∨ ¬D)
Resolution inference rule (for CNF): complete for propositional
logic
5:67
`1 ∨ · · · ∨ `k ,
m1 ∨ · · · ∨ mn
`1 ∨ · · · ∨ `i−1 ∨ `i+1 ∨ · · · ∨ `k ∨ m1 ∨ · · · ∨ mj−1 ∨ mj+1 ∨ · · · ∨ mn
36
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
Summary
Modus Ponens rule: complete for Horn KBs
α1 ,...,αn ,
α1 ∧···∧αn ⇒
β
Resolution rule: complete for propositional logic in CNF, let “`i =
Logical agents apply inference to a knowledge base
¬mj ”:
to derive new information and make decisions
`1 ∨···∨`k ,
m1 ∨···∨mn
`1 ∨···∨`i−1 ∨`i+1 ∨···∨`k ∨m1 ∨···∨mj−1 ∨mj+1 ∨···∨mn
Basic concepts of logic:
5:70
– syntax: formal structure of sentences
– semantics: truth of sentences wrt models
– entailment: necessary truth of one sentence given another
– inference: deriving sentences from other sentences
– soundness: derivations produce only entailed sentences
– completeness: derivations can produce all entailed sentences
Wumpus world requires the ability to represent partial and negated
information, reason by cases, etc.
Forward, backward chaining are linear-time, complete for Horn
clauses
Resolution is complete for propositional logic
Propositional logic lacks expressive power
5:68
Dictionary: logic in general
a logic: a language, elements α are sentences, (grammar example: slide 34)
model m: a world/state description that allows us to evaluate
α(m) ∈ {true, false} uniquely for any sentence α, M (α) =
{m : α(m) = true}
entailment α |= β:
M (α) ⊆ M (β), “∀m : α(m) ⇒ β(m)”
(Folgerung)
equivalence α ≡ β: iff (α |= β and β |= α)
KB: a set of sentences
inference procedure i can infer α from KB: KB `i α
soundness of i: KB `i α implies KB |= α
(Korrektheit)
completeness of i: KB |= α implies KB `i α
5:69
Dictionary: propositional logic
conjunction: α ∧ β, disjunction: α ∨ β, negation: ¬α
implication: α ⇒ β ≡ ¬α∨β, biconditional: α ⇔ β ≡ (α ⇒ β)∧
(β ⇒ α)
Note: |= and ≡ are statements about sentences in a logic; ⇒
and ⇔ are symbols in the grammar of propositional logic
α valid: true for any model, e.g.: KB |= α iff [(KB ⇒ α) is
valid]
(allgemeingultig)
¨
α unsatisfiable: true for no model, e.g.: KB |= α iff [(KB ∧¬α)
is unsatisfiable]
literal: A or ¬A, clause: disjunction of literals, CNF: conjunction
of clauses
Horn clause: symbol | (conjunction of symbols ⇒ symbol), Horn
form: conjunction of Horn clauses
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
8
37
First Order Logic
6:4
Syntax of FOL: Basic elements
Constants
Predicates
Functions
Variables
Connectives
Equality
Quantifiers
Outline
• Why FOL?
• Syntax and semantics of FOL
• Example sentences
KingJohn, 2, U CB, . . .
Brother, >, . . .
Sqrt, Lef tLegOf, . . .
x, y, a, b, . . .
∧ ∨ ¬ ⇒ ⇔
=
∀∃
6:5
• Wumpus world in FOL
6:1
Pros and cons of propositional logic
First Order Logic: Syntax grammar
hsentencei
→
hatomic sentencei
| hcomplex sentencei
| [∀ | ∃] hvariablei hsentencei
hatomic sentencei
→
predicate(htermi,. . . )
| htermi=htermi
htermi
→
function(htermi,. . . )
| constant
| variable
hcomplex sentencei
→
¬ hsentencei
| (hsentencei [∧ | ∨ | ⇒ | ⇔ ] hsente
Pros:
Propositional logic is declarative: pieces of syntax correspond
to facts
Propositional logic allows partial/disjunctive/negated information
(unlike most data structures and databases)
Propositional logic is compositional:
meaning of B1,1 ∧ P1,2 is derived from meaning of B1,1 and of
P1,2
6:6
Meaning in propositional logic is context-independent
(unlike natural language, where meaning depends on context)
Universal quantification
∀ hvariablesi hsentencei
Cons:
Everyone at Berkeley is smart:
Propositional logic has very limited expressive power
(unlike
natural language)
∀ x At(x, Berkeley) ⇒ Smart(x)
∀x P
E.g., cannot say “pits cause breezes in adjacent squares”
is true in a model m iff P is true with x being
each possible object in the model
except by writing one sentence for each square
6:7
6:2
Existential quantification
First-order logic
∃ hvariablesi hsentencei
Whereas propositional logic assumes world contains facts,
first-order logic (like natural language) assumes the world con-
Someone at Stanford is smart:
∃ x At(x, Stanf ord) ∧ Smart(x)
tains
∃x P
• Objects: people, houses, numbers, theories, Ronald Mc-
is true in a model m iff P is true with x being
some possible object in the model
Donald, colors, baseball games, wars, centuries . . .
6:8
• Relations: red, round, bogus, prime, multistoried . . .,
brother of, bigger than, inside, part of, has color, occurred
after, owns, comes between, . . .
Properties of quantifiers
∀ x ∀ y is the same as ∀ y ∀ x
• Functions: father of, best friend, third inning of, one more
than, end of . . .
∃ x ∃ y is the same as ∃ y ∃ x
∃ x ∀ y is not the same as ∀ y ∃ x
6:3
∃ x ∀ y Loves(x, y)
“There is a person who loves everyone in the world”
Logics in general
Language
Ontological
Commitment
Epistemological
Commitment
Propositional logic
First-order logic
Temporal logic
Probability theory
Fuzzy logic
facts
facts, objects, relations
facts, objects, relations, times
facts
facts + degree of truth
true/false/unknown
true/false/unknown
true/false/unknown
degree of belief
known interval value
∀ y ∃ x Loves(x, y)
“Everyone in the world is loved by at least one person”
Quantifier duality: each can be expressed using the other
∀ x Likes(x, IceCream)
∃ x Likes(x, Broccoli)
¬∃ x ¬Likes(x, IceCream)
¬∀ x ¬Likes(x, Broccoli)
38
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
Example sentences
6:9
Brothers are siblings
Truth in first-order logic
∀ x, y Brother(x, y) ⇒ Sibling(x, y).
“Sibling” is symmetric
Sentences are true with respect to a model and an interpretation
6:14
Model contains ≥ 1 objects (domain elements) and relations
among them
Example sentences
Interpretation specifies referents for
constant symbols → objects (domain elements)
Brothers are siblings
predicate symbols → relations
∀ x, y Brother(x, y) ⇒ Sibling(x, y).
function symbols → functional relations
“Sibling” is symmetric
An atomic sentence predicate(term1 , . . . , termn ) is true
iff the objects referred to by term1 , . . . , termn
∀ x, y Sibling(x, y) ⇔ Sibling(y, x).
are in the relation referred to by predicate
One’s mother is one’s female parent
6:15
6:10
Example sentences
Models for FOL: Example
Brothers are siblings
∀ x, y Brother(x, y) ⇒ Sibling(x, y).
“Sibling” is symmetric
∀ x, y Sibling(x, y) ⇔ Sibling(y, x).
One’s mother is one’s female parent
∀ x, y M other(x, y) ⇔ (F emale(x) ∧ P arent(x, y)).
A first cousin is a child of a parent’s sibling
6:16
Example sentences
Brothers are siblings
∀ x, y Brother(x, y) ⇒ Sibling(x, y).
6:11
“Sibling” is symmetric
∀ x, y Sibling(x, y) ⇔ Sibling(y, x).
Models for FOL: Lots!
One’s mother is one’s female parent
Entailment in propositional logic can be computed by enumerat-
∀ x, y M other(x, y) ⇔ (F emale(x) ∧ P arent(x, y)).
ing models
A first cousin is a child of a parent’s sibling
We can enumerate the FOL models for a given KB vocabulary:
∀ x, y F irstCousin(x, y) ⇔ ∃ p, ps P arent(p, x)∧Sibling(ps, p)∧
For each number of domain elements n from 1 to ∞
P arent(ps, y)
For each k-ary predicate Pk in the vocabulary
6:17
For each possible k-ary relation on n objects
For each constant symbol C in the vocabulary
For each choice of referent for C from n objects . . .
8.1
FOL description of interactive domains
Computing entailment by enumerating FOL models is not easy!
6:18
6:12
Knowledge base for the wumpus world
Example sentences
“Perception”
Brothers are siblings
∀ b, g, t P ercept([Smell, b, g], t) ⇒ Smelt(t)
6:13
∀ s, b, t P ercept([s, b, Glitter], t) ⇒ AtGold(t)
Reflex: ∀ t AtGold(t) ⇒ Action(Grab, t)
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
Reflex with internal state: do we have the gold already?
39
Each axiom is “about” a predicate (not an action per se):
∀ t AtGold(t) ∧ ¬Holding(Gold, t) ⇒ Action(Grab, t)
P true afterwards
Holding(Gold, t) cannot be observed
⇔
[an action made P true
∨
P true already and no action made P false
⇒ keeping track of change is essential
6:19
For holding the gold:
Deducing hidden properties
∀ a, s Holding(Gold, Result(a, s)) ⇔
Properties of locations:
[(a = Grab ∧ AtGold(s))
∀ x, t At(Agent, x, t) ∧ Smelt(t) ⇒ Smelly(x)
∨ (Holding(Gold, s) ∧ a 6= Release)]
∀ x, t At(Agent, x, t) ∧ Breeze(t) ⇒ Breezy(x)
6:23
Squares are breezy near a pit:
∀ y Breezy(y) ⇔ [∃ x P it(x) ∧ Adjacent(x, y)]
Planning Domain Definition Language (PDDL)
Implies two rules:
The Situation Calculus is very general, but not concise
Diagnostic rule—infer cause from effect
The AI community developed as simpler format (based on STRIPS)
∀ y Breezy(y) ⇒ ∃ x P it(x) ∧ Adjacent(x, y)
for the 1998/2000 International Planning Competition (IPC)
Causal rule—infer effect from cause
∀ x, y P it(x) ∧ Adjacent(x, y) ⇒ Breezy(y)
6:20
Keeping track of change: Situation Calculus
Facts hold in situations, rather than eternally
E.g., Holding(Gold, N ow) rather than just Holding(Gold)
Situation calculus is one way to represent change in FOL:
Adds a situation argument to each non-eternal predicate
6:24
E.g., N ow in Holding(Gold, N ow) denotes a situation
Situations are connected by the Result function
PDDL
Result(a, s) is the situation that results from doing a in s
The precondition specifies if an action predicate is applicable in
a given situation
The effect determines the changed facts
Frame assumption: All facts not mentioned in the effect remain
unchanged.
6:21
Describing actions I: Frame problem
The majority of state-of-the-art AI planners use this format
FFplan: (B. Nebel, Freiburg) a forward chaining heuristic
“Effect” axiom—describe changes due to action
state space planner
∀ s AtGold(s) ⇒ Holding(Gold, Result(Grab, s))
For probabilistic versions of PDDL:
T. Lang and M. Toussaint: Planning with noisy probabilistic relational rules.
JAIR, 2010.
“Frame” axiom—describe non-changes due to action
∀ s HaveArrow(s) ⇒ HaveArrow(Result(Grab, s))
6:25
Frame problem: find an elegant way to handle non-change
(a) representation—avoid frame axioms
Planning as FOL inference
(b) inference—avoid repeated “copy-overs” to keep track of
state
A general approach to planning is to query the KB for a plan that
6:22
Describing actions II
Successor-state axioms solve the representational frame problem
fulfills a goal condition
There is debate and ongoing research on this versus fwd search
6:26
40
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
Substitution
Suppose a wumpus-world agent is using an FOL KB
and perceives a smell and a breeze (but no glitter) at s:
T ELL(KB, P ercept([Smell, Breeze, N one], s))
A SK(KB, ∃ a Action(a, s))
I.e., does KB entail any particular actions at s?
Answer: Y es, {a/Shoot}
← substitution (binding list)
Given a sentence S and a substitution σ,
Sσ denotes the result of plugging σ into S; e.g.,
S = Smarter(x, y)
σ = {x/Hillary, y/Bill}
Sσ = Smarter(Hillary, Bill)
A SK (KB,S) returns some/all σ such that KB |= Sσ
6:27
Making plans as inference over plans
Represent plans as action sequences [a1 , a2 , . . . , an ]
P lanResult(p, s) is the result of executing p in s
Then the query A SK(KB, ∃ p Holding(Gold, P lanResult(p, S0 )))
has the solution {p/[F orward, Grab]}
Definition of P lanResult in terms of Result:
∀ s P lanResult([ ], s) = s
∀ a, p, s P lanResult([a|p], s) = P lanResult(p, Result(a, s))
6:28
Summary
First-order logic:
– objects and relations are semantic primitives
– syntax: constants, functions, predicates, equality, quantifiers
Increased expressive power: sufficient to define wumpus world
Situation calculus:
– convention for describing actions and change in FOL
– can formulate planning as inference on a situation calculus
KB
Planning Domain Definition Language (PDDL):
– more common restricted language
– more concise because of the frame assumption
– directly lends to forward chaining methods (like FFplan)
6:29
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
9
First Order Logic – Inference
41
Existential instantiation contd.
UI can be applied several times to add new sentences;
the new KB is logically equivalent to the old
Outline
EI can be applied once to replace the existential sentence;
the new KB is not equivalent to the old,
• Reducing first-order inference to propositional inference
but is satisfiable iff the old KB was satisfiable
• Unification
7:5
• Generalized Modus Ponens
Reduction to propositional inference
• Forward and backward chaining
• Resolution
Suppose the KB contains just the following:
7:1
∀ x King(x) ∧ Greedy(x) ⇒ Evil(x)
King(John)
Greedy(John)
Brother(Richard, John)
A brief history of reasoning
450B . C.
322B . C.
1565
1847
1879
1922
1930
1930
1931
1960
1965
Stoics
Aristotle
Cardano
Boole
Frege
Wittgenstein
¨
Godel
Herbrand
¨
Godel
Davis/Putnam
Robinson
propositional logic, inference (maybe)
“syllogisms” (inference rules), quantifiers
probability theory (propositional logic + uncertainty)
propositional logic (again)
first-order logic
proof by truth tables
∃ complete algorithm for FOL
complete algorithm for FOL (reduce to propositional)
¬∃ complete algorithm for arithmetic
“practical” algorithm for propositional logic
“practical” algorithm for FOL—resolution
Instantiating the universal sentence in all possible ways, we
have
King(John) ∧ Greedy(John) ⇒ Evil(John)
King(Richard) ∧ Greedy(Richard) ⇒ Evil(Richard)
King(John)
Greedy(John)
Brother(Richard, John)
7:2
The new KB is propositionalized: proposition symbols are King(Joh
Universal instantiation (UI)
7:6
Every instantiation of a universally quantified sentence is en-
Reduction contd.
tailed by it:
∀v α
S UBST({v/g}, α)
Idea: propositionalize KB and query, apply resolution, return result
for any variable v and ground term g
Problem: with function symbols, there are infinitely many ground
7:3
terms,
e.g., F ather(F ather(F ather(John)))
Existential instantiation (EI)
Theorem: Herbrand (1930). If a sentence α is entailed by an
FOL KB,
For any sentence α, variable v, and constant symbol k
it is entailed by a finite subset of the propositional KB
that does not appear elsewhere in the knowledge base:
Idea: For n = 0 to ∞ do
∃v α
S UBST({v/k}, α)
create a propositional KB by instantiating with depth-n terms
see if α is entailed by this KB
Problem: works if α is entailed, loops if α is not entailed
Theorem: Turing (1936), Church (1936), entailment in FOL is
E.g., ∃ x Crown(x) ∧ OnHead(x, John) yields
semidecidable
Crown(C1 ) ∧ OnHead(C1 , John)
7:7
provided C1 is a new constant symbol, called a Skolem constant
Problems with propositionalization
Another example: from ∃ x d(xy )/dy = xy we obtain
Propositionalization seems to generate lots of irrelevant sentences.
d(ey )/dy = ey
E.g., from
provided e is a new constant symbol
7:4
∀ x King(x) ∧ Greedy(x) ⇒ Evil(x)
King(John)
∀ y Greedy(y)
Brother(Richard, John)
42
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
it seems obvious that Evil(John), but propositionalization pro-
Unification
duces lots of facts such as Greedy(Richard) that are irrelevant
With p k-ary predicates and n constants, there are p · nk instan-
We can get the inference immediately if we can find a substitution θ
tiations
such that King(x) and Greedy(x) match King(John) and Greedy(y
With function symbols, it gets much much worse!
7:8
θ = {x/John, y/John} works
U NIFY(α, β) = θ if αθ = βθ
Unification
We can get the inference immediately if we can find a substitution θ
such that King(x) and Greedy(x) match King(John) and Greedy(y)
θ = {x/John, y/John} works
p
Knows(John, x)
Knows(John, x)
Knows(John, x)
Knows(John, x)
q
Knows(John, Jane)
Knows(y, OJ)
Knows(y, M other(y))
Knows(x, OJ)
θ
{x/Jane}
{x/OJ, y/John}
{y/John, x/M other(
U NIFY(α, β) = θ if αθ = βθ
7:12
p
Knows(John, x)
Knows(John, x)
Knows(John, x)
Knows(John, x)
q
Knows(John, Jane)
Knows(y, OJ)
Knows(y, M other(y))
Knows(x, OJ)
θ
Unification
We can get the inference immediately if we can find a substitution θ
7:9
such that King(x) and Greedy(x) match King(John) and Greedy(y
θ = {x/John, y/John} works
Unification
U NIFY(α, β) = θ if αθ = βθ
We can get the inference immediately if we can find a substitution θ
such that King(x) and Greedy(x) match King(John) and Greedy(y)
θ = {x/John, y/John} works
U NIFY(α, β) = θ if αθ = βθ
p
Knows(John, x)
Knows(John, x)
Knows(John, x)
Knows(John, x)
q
Knows(John, Jane)
Knows(y, OJ)
Knows(y, M other(y))
Knows(x, OJ)
θ
{x/Jane}
p
Knows(John, x)
Knows(John, x)
Knows(John, x)
Knows(John, x)
q
Knows(John, Jane)
Knows(y, OJ)
Knows(y, M other(y))
Knows(x, OJ)
θ
{x/Jane}
{x/OJ, y/John}
{y/John, x/M other(
f ail
Standardizing apart eliminates overlap of variables, e.g., Knows(z17
7:13
Generalized Modus Ponens (GMP)
7:10
Unification
p1 0 , p2 0 , . . . , pn 0 , (p1 ∧ p2 ∧ . . . ∧ pn ⇒ q)
qθ
where pi 0 θ = pi θ for all i
We can get the inference immediately if we can find a substitution θ
p1 0 is King(John)
p2 0 is Greedy(y)
θ is {x/John, y/John}
qθ is Evil(John)
such that King(x) and Greedy(x) match King(John) and Greedy(y)
θ = {x/John, y/John} works
U NIFY(α, β) = θ if αθ = βθ
p
Knows(John, x)
Knows(John, x)
Knows(John, x)
Knows(John, x)
q
Knows(John, Jane)
Knows(y, OJ)
Knows(y, M other(y))
Knows(x, OJ)
θ
{x/Jane}
{x/OJ, y/John}
p1 is King(x)
p2 is Greedy(x)
q is Evil(x)
GMP used with KB of definite clauses (exactly one positive literal)
All variables assumed universally quantified
7:14
7:11
Forward chaining algorithm
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
43
Resolution: brief summary
Full first-order version:
function FOL-FC-A SK(KB, α) returns a substitution or false
repeat until new is empty
new ← { }
for each sentence r in KB do
( p 1 ∧ . . . ∧ p n ⇒ q) ← S TANDARDIZE -A PART(r)
for each θ such that (p 1 ∧ . . . ∧ p n )θ = (p 01 ∧ . . . ∧ p 0n )θ
for some p 01 , . . . , p 0n in KB
q 0 ← S UBST(θ, q)
if q 0 is not a renaming of a sentence already in KB or new
then do
add q 0 to new
φ ← U NIFY(q 0 , α)
if φ is not fail then return φ
add new to KB
return false
`1 ∨ · · · ∨ `k ,
m1 ∨ · · · ∨ mn
(`1 ∨ · · · ∨ `i−1 ∨ `i+1 ∨ · · · ∨ `k ∨ m1 ∨ · · · ∨ mj−1 ∨ mj+1 ∨ · · · ∨ mn )θ
where U NIFY(`i , ¬mj ) = θ.
For example,
¬Rich(x) ∨ U nhappy(x)
Rich(Ken)
U nhappy(Ken)
with θ = {x/Ken}
Apply resolution steps to CN F (KB ∧ ¬α); complete for FOL
7:19
7:15
Conversion to CNF
Properties of forward chaining
Everyone who loves all animals is loved by someone:
Sound and complete for first-order definite clauses
∀ x [∀ y Animal(y) ⇒ Loves(x, y)] ⇒ [∃ y Loves(y, x)]
(proof similar to propositional proof)
1. Eliminate biconditionals and implications
Datalog = first-order definite clauses + no functions (e.g., crime
∀ x [¬∀ y ¬Animal(y) ∨ Loves(x, y)] ∨ [∃ y Loves(y, x)]
KB)
k
FC terminates for Datalog in poly iterations: at most p·n literals
2. Move ¬ inwards: ¬∀ x, p ≡ ∃ x ¬p, ¬∃ x, p ≡ ∀ x ¬p:
May not terminate in general if α is not entailed
This is unavoidable: entailment with definite clauses is semidecidable
∀ x [∃ y ¬(¬Animal(y) ∨ Loves(x, y))] ∨ [∃ y Loves(y, x)]
∀ x [∃ y ¬¬Animal(y) ∧ ¬Loves(x, y)] ∨ [∃ y Loves(y, x)]
∀ x [∃ y Animal(y) ∧ ¬Loves(x, y)] ∨ [∃ y Loves(y, x)]
7:16
7:20
Backward chaining algorithm
Conversion to CNF contd.
function FOL-BC-A SK(KB, goals, θ) returns a set of substitutions
inputs: KB, a knowledge base
goals, a list of conjuncts forming a query (θ already applied)
θ, the current substitution, initially the empty substitution { }
local variables: answers, a set of substitutions, initially empty
if goals is empty then return {θ}
q 0 ← S UBST(θ, F IRST(goals))
for each sentence r in KB
where S TANDARDIZE -A PART(r) = ( p 1 ∧ . . . ∧ p n ⇒ q)
and θ0 ← U NIFY(q, q 0 ) succeeds
new goals ← [ p 1 , . . . , p n |R EST(goals)]
answers ← FOL-BC-A SK(KB, new goals, C OMPOSE(θ0 , θ)) ∪ answers
return answers
3. Standardize variables: each quantifier should use a different
one
∀ x [∃ y Animal(y) ∧ ¬Loves(x, y)] ∨ [∃ z Loves(z, x)]
4. Skolemize: a more general form of existential instantiation.
Each existential variable is replaced by a Skolem function
of the enclosing universally quantified variables:
∀ x [Animal(F (x)) ∧ ¬Loves(x, F (x))] ∨ Loves(G(x), x)
7:17
5. Drop universal quantifiers:
Properties of backward chaining
[Animal(F (x)) ∧ ¬Loves(x, F (x))] ∨ Loves(G(x), x)
Depth-first recursive proof search: space is linear in size of proof
6. Distribute ∧ over ∨:
Incomplete due to infinite loops
⇒ fix by checking current goal against every goal on stack
[Animal(F (x))∨Loves(G(x), x)]∧[¬Loves(x, F (x))∨Loves(G(x), x
Inefficient due to repeated subgoals (both success and failure)
⇒ fix using caching of previous results (extra space!)
7:21
Widely used (without improvements!) for logic programming
7:18
44
10
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
Probabilities
• Utilities, decision theory, entropy, KLD
8:3
Probability: Frequentist and Bayesian
Objective Probability
• Frequentist probabilities are defined in the limit of an infinite
number of trials
Example: “The probability of a particular coin landing heads up is 0.43”
The double slit experiment:
x
• Bayesian (subjective) probabilities quantify degrees of belief
Example: “The probability of it raining tomorrow is 0.3”
– Not possible to repeat “tomorrow”
θ
8:4
P
10.1
Basic definitions
8:5
Probabilities & Sets
• Sample Space/domain O, e.g. O = {1, 2, 3, 4, 5, 6}
• Probability P : A ⊂ O 7→ [0, 1]
e.g., P ({1}) = 16 , P ({4}) = 16 , P ({2, 5}) = 31 ,
8:1
• Axioms: ∀A, B ⊆ O
Probability Theory
– Nonnegativity P (A) ≥ 0
– Additivity P (A ∪ B) = P (A) + P (B) if A ∩ B = { }
• Why do we need probabilities?
– Normalization P (O) = 1
– Obvious: to express inherent (objective) stochasticity of the
world
• Implications
0 ≤ P (A) ≤ 1
• But beyond this: (also in a “deterministic world”):
– lack of knowledge!
P ({ }) = 0
A ⊆ B ⇒ P (A) ≤ P (B)
– hidden (latent) variables
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
– expressing uncertainty
P (O \ A) = 1 − P (A)
– expressing information (and lack of information)
– Subjective Probability
8:6
Probabilities & Random Variables
• Probability Theory: an information calculus
8:2
• For a random variable X with discrete domain dom(X) = O we
write:
Outline
• Basic definitions
– Random variables
– joint, conditional, marginal distribution
– Bayes’ theorem
• Probability distributions:
– Binomial & Beta
– Multinomial & Dirichlet
– Conjugate priors
– Gauss
– Dirak & Particles
∀x∈O : 0 ≤ P (X = x) ≤ 1
P
P (X = x) = 1
x∈O
Example: A dice can take values O = {1, .., 6}.
X is the random variable of a dice throw.
P (X = 1) ∈ [0, 1] is the probability that X takes value 1.
• A bit more formally: a random variable is a map from a measureable
space to a domain (sample space) and thereby introduces a probability
measure on the domain (“assigns a probability to each possible value”)
8:7
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
45
Probabilty Distributions
likelihood · prior
normalization
posterior =
• P (X = 1) ∈ R denotes a specific probability
8:11
P (X) denotes the probability distribution (function over O)
Multiple RVs:
Example: A dice can take values O = {1, 2, 3, 4, 5, 6}.
By P (X) we discribe the full distribution over possible values {1, .., 6}.
These are 6 numbers that sum to one, usually stored in a table, e.g.:
[ 16 , 16 , 61 , 16 , 16 , 16 ]
• Analogously for n random variables X1:n (stored as a rank n
tensor)
Joint:
• In implementations we typically represent distributions over discrete random variables as tables (arrays) of numbers
P (X1:n )
Marginal:
P (X1:n ),
X2:n
P (X1 |X2:n ) =
Conditional:
• Notation for summing over a RV:
P
P (X1 ) =
P (X1:n )
P (X2:n )
• X is conditionally independent of Y given Z iff:
In equation we often needP
to sum over RVs. We then write
X P (X) · · ·
P
as shorthand for the explicit notation x∈dom(X) P (X = x) · · ·
P (X|Y, Z) = P (X|Z)
8:8
Joint distributions
• Product rule and Bayes’ Theorem:
P (X1:n ) =
Qn
i=1
P (X, Z, Y )
P (X|Y, Z) P (Y |Z) P (Z)
P (Xi |Xi+1:n )
P (X1 |X2:n ) =
P (X2 |X1 ,X3:n ) P (X1 |X3:n )
P (X2 |X3:n )
Assume we have two random variables X and Y
=
P (X|Y, Z) =
P (Y |X,Z) P (X|Z)
P (Y |Z)
P (X, Y |Z) =
P (X,Z|Y ) P (Y )
P (Z)
8:12
• Definitions:
Joint:
P (X, Y )
Marginal:
10.2
P (X) =
Conditional:
P
Y
Probability distributions
P (X, Y )
P (X|Y ) =
P (X,Y )
P (Y )
The conditional is normalized: ∀Y :
P
X
Bishop, C. M.: Pattern Recognition and Machine Learning.
Springer, 2006
http://research.
microsoft.com/en-us/um/
people/cmbishop/prml/
P (X|Y ) = 1
• X is independent of Y iff: P (X|Y ) = P (X)
(table thinking: all columns of P (X|Y ) are equal)
8:9
Joint distributions
8:13
Bernoulli & Binomial
joint: P (X, Y )
P
marginal: P (X) = Y P (X, Y )
conditional:
P (X|Y ) =
• We have a binary random variable x ∈ {0, 1}
P (X,Y )
P (Y )
{0, 1})
The Bernoulli distribution is parameterized by a single scalar µ,
• Implications of these definitions:
Product rule:
(i.e. dom(x) =
P (x = 1 | µ) = µ ,
P (X, Y ) = P (X|Y ) P (Y ) = P (Y |X) P (X)
P (x = 0 | µ) = 1 − µ
Bern(x | µ) = µx (1 − µ)1−x
Bayes’ Theorem:
P (X|Y ) =
P (Y |X) P (X)
P (Y )
• We have a data set of random variables D = {x1 , .., xn }, each
xi ∈ {0, 1}. If each xi ∼ Bern(xi | µ) we have
8:10
P (D | µ) =
Bayes’ Theorem
Qn
i=1
argmax log P (D | µ) = argmax
µ
P (X|Y ) =
Bern(xi | µ) =
P (Y |X) P (X)
P (Y )
µ
n
X
Qn
i=1
µxi (1 − µ)1−xi
xi log µ + (1 − xi ) log(1 − µ) =
i=1
• The Binomial distribution is the distribution over the count m =
Pn
i=1 xi


n  µm (1 − µ)n−m ,
Bin(m | n, µ) = 


m



n!
n =

(n − m)! m!
m

1X
ni
46
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
8:14
Beta
Multinomial
• We have an integer random variable x ∈ {1, .., K}
The probability of a single x can be parameterized by µ = (µ1 , .., µK )
How to express uncertainty over a Bernoulli parameter µ
• The Beta distribution is over the interval [0, 1], typically the pa-
P (x = k | µ) = µk
rameter µ of a Bernoulli:
with the constraint
1
Beta(µ | a, b) =
µa−1 (1 − µ)b−1
B(a, b)
with mean hµi =
a
a+b
and mode µ∗ =
a−1
a+b−2
PK
k=1
µk = 1 (probabilities need to be nor-
malized)
• We have a data set of random variables D = {x1 , .., xn }, each
for a, b > 1
xi ∈ {1, .., K}. If each xi ∼ P (xi | µ) we have
• The crucial point is:
– Assume we are in a world with a “Bernoulli source” (e.g.,
binary bandit), but don’t know its parameter µ
– Assume we have a prior distribution P (µ) = Beta(µ | a, b)
P (D | µ) =
where mk =
i=1
Pn
µx i =
i=1 [xi
Qn
i=1
[x =k]
QK
k=1
µk i
=
QK
k=1
m
µk k
= k] is the count of [xi = k]. The ML
estimator is
– Assume we collected some
P data D = {x1 , .., xn },
Pxi ∈
{0, 1}, with counts aD = i xi of [xi = 1] and bD = i (1 −
xi ) of [xi = 0]
– The posterior is
Qn
argmax log P (D | µ) =
µ
1
(m1 , .., mK )
n
• The Multinomial distribution is this distribution over the counts
P (D | µ)
P (µ) ∝ Bin(D | µ) Beta(µ | a, b)
P (µ | D) =
P (D)
mk
∝ µaD (1 − µ)bD µa−1 (1 − µ)b−1 = µa−1+aD (1 − µ)b−1+bD
Mult(m1 , .., mK | n, µ) ∝
QK
k=1
m
µk k
= Beta(µ | a + aD , b + bD )
8:18
8:15
Dirichlet
Beta
How to express uncertainty over a Multinomial parameter µ
The prior is Beta(µ | a, b), the posterior is Beta(µ | a + aD , b + bD )
• Conclusions:
– The semantics of a and b are counts of [xi = 1] and [xi = 0],
respectively
– The Beta distribution is conjugate to the Bernoulli (explained later)
– With the Beta distribution we can represent beliefs (state
of knowledge) about uncertain µ ∈ [0, 1] and know how to
update this belief given data
8:16
• The Dirichlet distribution is over the K-simplex, that is, over
P
µ1 , .., µK ∈ [0, 1] subject to the constraint K
k=1 µk = 1:
Dir(µ | α) ∝
QK
k=1
α −1
µk k
It is parameterized by α = (α1 , .., αK ), has mean hµi i =
and mode µ∗i =
Pαi −1
j αj −K
Pαi
j αj
for ai > 1.
• The crucial point is:
– Assume we are in a world with a “Multinomial source” (e.g.,
an integer bandit), but don’t know its parameter µ
– Assume we have a prior distribution P (µ) = Dir(µ | α)
– Assume we collected someP
data D = {x1 , .., xn }, xi ∈
{1, .., K}, with counts mk = i [xi = k]
Beta
– The posterior is
P (D | µ)
P (µ) ∝ Mult(D | µ) Dir(µ | a, b)
P (D)
QK
Q
m QK
αk −1
αk −1+mk
∝ k=1 µk k
= K
k=1 µk
k=1 µk
P (µ | D) =
= Dir(µ | α + m)
8:19
Dirichlet
from Bishop
The prior is Dir(µ | α), the posterior is Dir(µ | α + m)
8:17
• Conclusions:
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
47
Conjugate priors
– The semantics of α is the counts of [xi = k]
– The Dirichlet distribution is conjugate to the Multinomial
– With the Dirichlet distribution we can represent beliefs (state
of knowledge) about uncertain µ of an integer random variable and know how to update this belief given data
likelihood
Binomial Bin(D | µ)
Multinomial
Mult(D | µ)
Gauss N(x | µ, Σ)
1D
Gauss
N(x | µ, λ-1 )
nD
Gauss
N(x | µ, Λ-1 )
nD
Gauss
N(x | µ, Λ-1 )
8:20
Dirichlet
Illustrations for α = (0.1, 0.1, 0.1), α = (1, 1, 1) and α = (10, 10, 10):
conjugate
Beta Beta(µ | a, b)
Dirichlet Dir(µ | α)
Gauss N(µ | µ0 , A)
Gamma Gam(λ | a, b)
Wishart Wish(Λ | W, ν)
Gauss-Wishart
N(µ | µ0 , (βΛ)-1 ) Wish(Λ | W, ν)
8:24
from Bishop
Distributions over continuous domain
8:21
Motivation for Beta & Dirichlet distributions
• Bandits:
– If we have binary [integer] bandits, the Beta [Dirichlet] distribution is a way to represent and update beliefs
8:25
Distributions over continuous domain
• Let x be a continuous RV. The probability density function
(pdf) p(x) ∈ [0, ∞) defines the probability
– The belief space becomes discrete: The parameter α of
the prior is continuous, but the posterior updates live on a
discrete “grid” (adding counts to α)
b
Z
P (a ≤ x ≤ b) =
p(x) dx ∈ [0, 1]
a
– We can in principle do belief planning using this
• Reinforcement Learning:
– Assume we know that the world is a finite-state MDP, but
do not know its transition probability P (s0 | s, a). For each
(s, a), P (s0 | s, a) is a distribution over the integer s0
– Having a separate Dirichlet distribution for each (s, a) is a
way to represent our belief about the world, that is, our belief about P (s0 | s, a)
– We can in principle do belief planning using this → Bayesian
Reinforcement Learning
• Dirichlet distributions are also used to model texts (word distributions in text), images, or mixture distributions in general
The (cumulative) probability distribution F (y) = P (x ≤ y) =
Ry
dx p(x) ∈ [0, 1] is the cumulative integral with limy→∞ F (y) =
−∞
1
(In discrete domain: probability distribution and probability mass function P (x) ∈ [0, 1] are used synonymously.)
• Two basic examples:
Gaussian: N(x | µ, Σ) =
>
1
e− 2 (x−µ)
1
| 2πΣ | 1/2
Σ-1 (x−µ)
Dirac or d (“point particle”) d(x) = 0 except at x = 0,
8:22
Conjugate priors
R
d(x) dx =
1
d(x) =
∂
H(x)
∂x
where H(x) = [x ≥ 0] = Heavyside step func-
tion
8:26
• Assume you have data D = {x1 , .., xn } with likelihood
Gaussian distribution
P (D | θ)
N (x|µ, σ 2 )
that depends on an uncertain parameter θ
2σ
• 1-dim: N(x | µ, σ 2 ) =
Assume you have a prior P (θ)
1
| 2πσ 2 | 1/2
1
2
e− 2 (x−µ)
/σ 2
µ
• n-dim Gaussian in normal form:
• The prior P (θ) is conjugate to the likelihood P (D | θ) iff the posterior
P (θ | D) ∝ P (D | θ) P (θ)
N(x | µ, Σ) =
1
1
exp{− (x − µ)> Σ-1 (x − µ)}
2
| 2πΣ | 1/2
with mean µ and covariance matrix Σ. In canonical form:
is in the same distribution class as the prior P (θ)
N[x | a, A] =
exp{− 12 a>A-1 a}
1
exp{− x> A x + x>a}
2
| 2πA-1 | 1/2
(1)
• Having a conjugate prior is very convenient, because then you
with precision matrix A = Σ-1 and coefficient a = Σ-1 µ (and
know how to update the belief given data
8:23
mean µ = A-1 a).
48
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
• Gaussian identities: see http://ipvs.informatik.uni-stuttgart.
Motivation for particle distributions
de/mlr/marc/notes/gaussians.pdf
8:27
• Numeric representation of “difficult” distributions
– Very general and versatile
– But often needs many samples
Motivation for Gaussian distributions
• Distributions over games (action sequences), sample based planning, MCTS
• Gaussian Bandits
• State estimation, particle filters
• Control theory, Stochastic Optimal Control
• State estimation, sensor processing, Gaussian filtering (Kalman
• etc
8:31
filtering)
• Machine Learning
Utilities & Decision Theory
• etc
8:28
• Given a space of events O (e.g., outcomes of a trial, a game,
etc) the utility is a function
Particle Approximation of a Distribution
U : O→R
• We approximate a distribution p(x) over a continuous domain
Rn
• The utility represents preferences as a single scalar – which is
not always obvious (cf. multi-objective optimization)
• A particle distribution q(x) is a weighed set S = {(x , w
i
i
)}N
i=1
of
N particles
– each particle has a “location” xi ∈ Rn and a weight wi ∈ R
P
– weights are normalized, i wi = 1
q(x) :=
N
X
wi d(x − xi )
• Decision Theory making decisions (that determine p(x)) that
maximize expected utility
Z
E{U }p =
U (x) p(x)
x
• Concave utility functions imply risk aversion (and convex, risk-
i=1
taking)
where d(x − xi ) is the d-distribution.
8:32
• Given weighted particles, we can estimate for any (smooth) f :
Z
hf (x)ip =
f (x)p(x)dx ≈
PN
i=1
wi f (xi )
Entropy
• The neg-log (− log p(x)) of a distribution reflects something like
x
See An Introduction to MCMC for Machine Learning www.cs.
ubc.ca/˜nando/papers/mlintro.pdf
“error”:
– neg-log of a Guassian ↔ squared error
– neg-log likelihood ↔ prediction error
8:29
• The (− log p(x)) is the “optimal” coding length you should assign
Particle Approximation of a Distribution
to a symbol x. This will minimize the expected length of an
encoding
Histogram of a particle representation:
Z
H(p) =
p(x)[− log p(x)]
x
• The entropy H(p) = Ep(x) {− log p(x)} of a distribution p is a
measure of uncertainty, or lack-of-information, we have about x
8:33
Kullback-Leibler divergence*
• Assume you use a “wrong” distribution q(x) to decide on the
coding length of symbols drawn from p(x). The expected length
of a encoding is
Z
8:30
p(x)[− log q(x)] ≥ H(p)
x
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
• The difference
D p q =
Z
p(x) log
x
p(x)
≥0
q(x)
is called Kullback-Leibler divergence
Proof of inequality, using the Jenson inequality:
Z
−
p(x) log
x
q(x)
≥ − log
p(x)
Z
p(x)
x
q(x)
=0
p(x)
8:34
Some more continuous distributions*
Gaussian
Dirac or d
Student’s t
(=Gaussian for ν → ∞, otherwise
heavy tails)
Exponential
N(x | a, A)
> -1
1
1
e− 2 (x−a) A
| 2πA | 1/2
=
(x−a)
∂
d(x) = ∂x
H(x)
ν+1
2
p(x; ν) ∝ [1 + xν ]− 2
p(x; λ) = [x ≥ 0] λe−λx
(distribution over single event time)
Laplace
(“double exponential”)
Chi-squared
Gamma
p(x; µ, b) =
1 − | x−µ | /b
e
2b
p(x; k) ∝ [x ≥ 0] xk/2−1 e−x/2
p(x; k, θ) ∝ [x ≥ 0] xk−1 e−x/θ
8:35
49
50
11
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
Bandits & UCT
Bandits: Formal Problem Definition
• Let at ∈ {1, .., n} be the choice of machine at time t
Let yt ∈ R be the outcome
Multi-armed Bandits
• A policy or strategy maps all the history to a new choice:
π : [(a1 , y1 ), (a2 , y2 ), ..., (at-1 , yt-1 )] 7→ at
• Problem: Find a policy π that
maxh
PT
t=1
yt i
or
maxhyT i
• There are n machines
• Each machine i returns a reward y ∼ P (y; θi )
The machine’s parameter θi is unknown
or other objectives like discounted infinite horizon maxh
P∞
t=1
γ t yt i
• Your goal is to maximize the reward, say, collected over the first
T trials
9:1
9:5
Exploration, Exploitation
• “Two effects” of choosing a machine:
– You collect more data about the machine → knowledge
Bandits – applications
– You collect reward
• Online advertisement
• For example
– Exploration: Choose the next action at to minhH(bt )i
– Exploitation: Choose the next action at to maxhyt i
9:6
• Clinical trials, robotic scientist
Digression: Active Learning
• “Active Learning”
• Efficient optimization
9:2
“Experimental Design”
“Exploration in Reinforcement Learning”
Bandits
All of these are strongly related to trying to minimize (also) H(bt )
• The bandit problem is an archetype for
– Sequential decision making
– Decisions that influence knowledge as well as rewards/states
Gaussian Processes:
– Exploration/exploitation
• The same aspects are inherent also in global optimization, active learning & RL
• The Bandit problem formulation is the basis of UCB – which is
(from Rasmussen & Williams)
the core of serveral planning and decision making methods
9:7
• Bandit problems are commercially very relevant
9:3
Upper Confidence Bounds (UCB)
Upper Confidence Bound (UCB)
1:
2:
9:4
3:
4:
Initialization: Play each machine once
repeat
Play the machine i that maximizes yˆi + β
until
q
2 ln n
ni
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
51
UCB for Gauss
yˆi is the average reward of machine i so far
• If we have a single Gaussian bandits, we can compute
P
the mean estimator µ
ˆ = n1 i yi
P
2
1
the empirical variance σ
ˆ 2 = n−1
i (yi − µ)
ni is how often machine i has been played so far
P
n = i ni is the number of rounds so far
β is often chosen as β = 1
and the variance of the mean estimator Var{µ} = sˆ2 /n
The bound is derived from the Hoeffding inequality
• µ
ˆ and Var{µ} describe our posterior Gaussian belief over the
See Finite-time analysis of the multiarmed bandit problem, Auer, Cesa-Bianchi &
Fischer, Machine learning, 2002.
true underlying µ
Using the err-function we can get exact quantiles
9:8
• Alternative strategies:
UCB algorithms
90%-quantile(µi )
• UCB algorithms determine a confidence interval such that
µ
ˆi + β
yˆi − σi < hyi i < yˆi + σi
p
√
Var{µi } = m
ˆ i + βσ
ˆ/ n
9:11
with high probability.
UCB - Discussion
UCB chooses the upper bound of this confidence interval
• UCB over-estimates the reward-to-go (under-estimates cost-to• Optimism in the face of uncertainty
go), just like A∗ – but does so in the probabilistic setting of bandits
• Strong bounds on the regret (sub-optimality) of UCB (e.g. Auer
et al.)
• The fact that regret bounds exist is great!
9:9
• UCB became a core method for algorithms (including planners)
UCB for Bernoulli
to decide what to explore:
• If we have a single Bernoulli bandits, we can count
In tree search, the decision of which branches/actions to explore
is itself a decision problem. An “intelligent agent” (like UBC) can
a = 1 + #wins ,
b = 1 + #losses
be used within the planner to make decisions about how to grow
the tree.
• Our posterior over the Bernoulli parameter µ is Beta(µ | a, b)
• The mean is hµi =
a
a+b
The mode (most likely) is µ∗ =
The variance is Var{µ} =
a−1
a+b−2
9:12
Monte Carlo Tree Search
for a, b > 1
ab
(a+b+1)(a+b)2
9:13
One can numerically compute the inverse cumulative Beta distribution → get exact quantiles
Monte Carlo Tree Search (MCTS)
• MCTS is very successful on Computer Go and other games
• Alternative strategies:
• MCTS is rather simple to implement
• MCTS is very general: applicable on any discrete domain
argmax 90%-quantile(µi )
i
• Key paper:
argmaxhµi i + β
´ Bandit based Monte-Carlo Planning, ECML
Kocsis & Szepesvari:
p
Var{µi }
2006.
i
9:10
• Survey paper:
Browne et al.: A Survey of Monte Carlo Tree Search Methods,
2012.
52
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
• Tutorial presentation:
9:17
http://web.engr.oregonstate.edu/˜afern/
icaps10-MCP-tutorial.ppt
9:14
Monte Carlo methods
Flat Monte Carlo
• The goal of MCTS is to estimate the utility (e.g., expected payoff
• General, the term Monte Carlo simulation refers to methods that
D) depending on the first action a chosen:
generate many i.i.d. random samples xi ∼ P (x) from a distribu-
Q(s0 , a) = E{D|s0 , a}
tion P (x). Using the samples one can estimate expectations of
anything that depends on x, e.g. f (x):
Z
hf i =
P (x) f (x) dx ≈
x
where expectation is taken with w.r.t. the whole future randomized actions (including a potential opponent)
N
1 X
f (xi )
N i=1
• Flat Monte Carlo does so by rolling out many random simula(In this view, Monte Carlo approximates an integral.)
tions (using a R OLLOUT P OLICY) without growing a tree
The key difference/advantage of MCTS over flat MC is that the
• Example: What is the probability that a solitair would come out
tree growth focusses computational effort on promising actions
successful? (Original story by Stan Ulam.) Instead of trying to
9:18
analytically compute this, generate many random solitairs and
Upper Confidence Tree (UCT)
count.
• UCT uses UCB to realize the T REE P OLICY, i.e. to decide where
• The method developed in the 40ies, where computers became
to expand the tree
faster. Fermi, Ulam and von Neumann initiated the idea. von
Neumann called it “Monte Carlo” as a code name.
9:15
• B ACKUP updates all parents of vl as
n(v) ← n(v) + 1 (count how often has it been played)
Generic MCTS scheme
Q(v) ← Q(v) + D (sum of rewards received)
• T REE P OLICY chooses child nodes based on UCB:
s
2 ln n(v)
Q(v 0 )
argmax
+β
0)
0
n(v
n(v 0 )
v ∈∂(v)
or choose v 0 if n(v 0 ) = 0
9:19
from Browne et al.
1:
2:
3:
4:
5:
6:
7:
8:
start tree V = {v0 }
while within computational budget do
vl ← T REE P OLICY(V ) chooses a leaf of V
append vl to V
D ← R OLLOUT P OLICY(V ) rolls out a full simulation,
with return D
B ACKUP(vl , D) updates the values of all parents of vl
end while
return best child of v0
9:16
Generic MCTS scheme
• In comparision to other planners it always computes full roll outs
to a terminal state. No heuristics to estimate the utility of a state
are needed.
• The tree grows unbalanced
• The T REE P OLICY decides where the tree is expanded – and
needs to trade off exploration vs. exploitation
• The R OLLOUT P OLICY is necessary to simulate a roll out. It could
be a random policy; at least a randomized policy.
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
12
Game Playing
53
Properties of minimax
Complete?? Yes, if tree is finite (chess has specific rules for
this)
Outline
Optimal?? Yes, against an optimal opponent. Otherwise??
• Minimax
Time complexity?? O(bm )
• α–β pruning
Space complexity?? O(bm) (depth-first exploration)
• UCT for games
For chess, b ≈ 35, m ≈ 100 for “reasonable” games
10:1
⇒ exact solution completely infeasible
But do we need to explore every path?
Game tree (2-player, deterministic, turns)
10:5
α–β pruning example
10:2
Minimax
Perfect play for deterministic, perfect-information games
Idea: choose move to position with highest minimax value
= best achievable payoff against best play
E.g., 2-ply game:
10:3
Minimax algorithm
function M INIMAX -D ECISION(state) returns an action
inputs: state, current state in game
return the a in ACTIONS(state) maximizing M IN -VALUE(R ESULT(a,
state))
function M AX -VALUE(state) returns a utility value
if T ERMINAL -T EST(state) then return U TILITY(state)
v ← −∞
for a, s in S UCCESSORS(state) do v ← M AX(v, M IN -VALUE(s))
return v
function M IN -VALUE(state) returns a utility value
if T ERMINAL -T EST(state) then return U TILITY(state)
v←∞
for a, s in S UCCESSORS(state) do v ← M IN(v, M AX -VALUE(s))
return v
10:4
54
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
10:6
Suppose we have 100 seconds, explore 104 nodes/second
⇒ 106 nodes per move ≈ 358/2
Why is it called α–β?
⇒ α–β reaches depth 8 ⇒ pretty good chess program
10:10
Evaluation functions
α is the best value (to
MAX )
If V is worse than α,
MAX
Define β similarly for
MIN
found so far off the current path
will avoid it ⇒ prune that branch
For chess, typically linear weighted sum of features
10:7
E VAL(s) = w1 f1 (s) + w2 f2 (s) + . . . + wn fn (s)
The α–β algorithm
e.g., w1 = 9 with
function A LPHA -B ETA -D ECISION(state) returns an action
return the a in ACTIONS(state) maximizing M IN -VALUE(R ESULT(a,
state))
function M AX -VALUE(state, α, β) returns a utility value
inputs: state, current state in game
α, the value of the best alternative for MAX along the path to
state
β, the value of the best alternative for MIN along the path to
state
if T ERMINAL -T EST(state) then return U TILITY(state)
v ← −∞
for a, s in S UCCESSORS(state) do
v ← M AX(v, M IN -VALUE(s, α, β))
if v ≥ β then return v
α ← M AX(α, v)
return v
f1 (s) = (number of white queens) – (number of black queens),
etc.
10:11
Upper Confidence Tree (UCT) for games
Standard backup updates all parents of vl as
n(v) ← n(v) + 1 (count how often has it been played)
Q(v) ← Q(v) + ∆ (sum of rewards received)
In games use a “negamax” backup: While iterating upward, flip
sign ∆ ← −∆ in each iteration
function M IN -VALUE(state, α, β) returns a utility value
same as M AX -VALUE but with roles of α, β reversed
Survey of MCTS applications:
10:8
Browne et al.: A Survey of Monte Carlo Tree Search Methods, 2012.
10:12
Properties of α–β
Brief notes on game theory
Pruning does not affect final result
• (Small) zero-sum games can be represented by a payoff matrix
Good move ordering improves effectiveness of pruning
• Uji denotes the utility of player 1 if she chooses the pure (=de-
A simple example of the value of reasoning about which computations are relevant (a form of metareasoning)
terministic) strategy i and player 2 chooses the pure strategy j.
U T = −U
Zero-sum games: Uji = −Uij ,
10:9
Resource limits
Standard approach:
• Use C UTOFF -T EST instead of T ERMINAL -T EST
e.g., depth limit (perhaps add quiescence search)
• Fining a minimax optimal mixed strategy p is a Linear Program
max w
w
s.t. U p ≥ w ,
X
pi = 1 ,
p≥0
i
Note that U p ≥ w implies minj (U p)j ≥ w.
• Gainable payoff of player 1: maxp minq q T U p
Minimax-Theorem: maxp minq q T U p = minq maxp q T U p
Minimax-Theorem ↔ optimal p with w ≥ 0 exists
• Use E VAL instead of U TILITY
i.e., evaluation function that estimates desirability of
position
10:13
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
13
Graphical Models
55
• The joint distribution can be factored as
P (X1:n ) =
n
Y
P (Xi | Parents(Xi ))
i=1
Outline
• Missing links imply conditional independence
• Ancestral simulation to sample from joint distribution
• A. Introduction
11:5
– Motivation and definition of Bayes Nets
– Conditional independence in Bayes Nets
– Examples
Example
• B. Inference in Graphical Models
– Sampling methods (Rejection, Importance, Gibbs)
– Variable Elimination & Factor Graphs
– Message passing, Loopy Belief Propagation
(Heckermann 1995)
P(B=bad) =0.02
P(F=empty)=0.05
Battery
Fuel
11:1
Gauge
Graphical Models
P(G=empty|B=good,F=not empty)=0.04
P(G=empty|B=good,F=empty)=0.97
P(G=empty|B=bad,F=not empty)=0.10
P(G=empty|B=bad,F=empty)=0.99
• The core difficulty in modelling is specifying
TurnOver
What are the relevant variables?
Start
P(S=no|T=yes,F=not empty)=0.01
P(S=no|T=yes,F=empty)=0.92
P(S=no|T=no,Fnot empty)=1.00
P(S=no|T=no,F=empty)=1.00
P(T=no|B=good)=0.03
P(T=no|B=bad)=0.98
How do they depend on each other?
(Or how could they depend on each other → learning)
⇐⇒ P (S, T, G, F, B) = P (B) P (F ) P (G|F, B) P (T |B) P (S|T, F )
• Graphical models are a simple, graphical notation for
• Table sizes: LHS = 25 − 1 = 31 RHS = 1 + 1 + 4 + 2 + 4 = 12
1) which random variables exist
11:6
2) which random variables are “directly coupled”
Thereby they describe a joint probability distribution P (X1 , .., Xn )
Bayes Nets & conditional independence
over n random variables.
• Independence: Indep(X, Y ) ⇐⇒ P (X, Y ) = P (X) P (Y )
• 2 basic variants:
– Bayesian Networks
– Factor Graphs
Field)
• Conditional independence:
(aka. directed model, belief network)
Indep(X, Y |Z) ⇐⇒ P (X, Y |Z) = P (X|Z) P (Y |Z)
(aka. undirected model, Markov Random
11:2
X
Bayesian Networks
Z
• A Bayesian Network is a
– directed acyclic graph (DAG)
– where each node represents a random variable Xi
Z
Y
X
Z
Y
X
Y
(head-to-head)
(tail-to-tail)
(head-to-tail)
Indep(X, Y )
¬Indep(X, Y |Z)
¬Indep(X, Y )
Indep(X, Y |Z)
¬Indep(X, Y )
Indep(X, Y |Z)
11:7
– for each node we have a conditional probability distribution
P (Xi | Parents(Xi ))
• In the simplest case (discrete RVs), the conditional distribution
is represented as a conditional probability table (CPT)
11:3
Example
P (X, Y, Z) = P (X) P (Y ) P (Z|X, Y )
P
P (X, Y ) = P (X) P (Y )
Z P (Z|X, Y ) = P (X) P (Y )
• Tail-to-tail: Indep(X, Y |Z)
P (X, Y, Z) = P (Z) P (X|Z) P (Y |Z)
drinking red wine → longevity?
11:4
Bayesian Networks
• DAG → we can sort the RVs; edges only go from lower to higher
index
• Head-to-head: Indep(X, Y )
P (X, Y |Z) = P (X, Y, Z) = P (Z) = P (X|Z) P (Y |Z)
• Head-to-tail: Indep(X, Y |Z)
P (X, Y, Z) = P (X) P (Z|X) P (Y |Z)
P (X, Y |Z) =
P (X,Y,Z)
P (Z)
=
P (X,Z) P (Y |Z)
P (Z)
= P (X|Z) P (Y |Z)
56
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
11:8
Inference
• Inference: Given some pieces of information (prior, observed
General rules for determining conditional independence in
a Bayes net:
variabes) what is the implication (the implied information, the
posterior) on a non-observed variable
• In a Bayes Nets: Assume there is three groups of RVs:
– Z are observed random variables
• Given three groups of random variables X, Y, Z
– X and Y are hidden random variables
Indep(X, Y |Z) ⇐⇒ every path from X to Y is “blocked by Z”
• A path is “blocked by Z” ⇐⇒ on this path...
– ∃ a node in Z that is head-to-tail w.r.t. the path, or
– We want to do inference about X, not Y
Given some observed variables Z, compute the
posterior marginal P (X | Z) for some hidden variable X.
– ∃ a node in Z that is tail-to-tail w.r.t. the path, or
P (X | Z) =
– ∃ another node A which is head-to-head w.r.t. the path
and neither A nor any of its descendants are in Z
11:9
Example
X
P (X, Z)
1
=
P (X, Y, Z)
P (Z)
P (Z) Y
where Y are all hidden random variables except for X
• Inference requires summing over (eliminating) hidden variables.
11:12
(Heckermann 1995)
P(B=bad) =0.02
P(F=empty)=0.05
Battery
Fuel
Example: Holmes & Watson
• Mr. Holmes lives in Los Angeles. One morning when Holmes
Gauge
leaves his house, he realizes that his grass is wet. Is it due to
P(G=empty|B=good,F=not empty)=0.04
P(G=empty|B=good,F=empty)=0.97
rain, or has he forgotten to turn off his sprinkler?
– Calculate P (R|H), P (S|H) and compare these values to
the prior probabilities.
P(G=empty|B=bad,F=not empty)=0.10
P(G=empty|B=bad,F=empty)=0.99
TurnOver
P(T=no|B=good)=0.03
P(T=no|B=bad)=0.98
Indep(T, F )?
Start
– Calculate P (R, S|H).
Note: R and S are marginally independent, but conditionally dependent
P(S=no|T=yes,F=not empty)=0.01
P(S=no|T=yes,F=empty)=0.92
P(S=no|T=no,Fnot empty)=1.00
P(S=no|T=no,F=empty)=1.00
Indep(B, F |S)?
Indep(B, S|T )?
11:10
• Holmes checks Watson’s grass, and finds it is also wet.
– Calculate P (R|H, W ), P (S|H, W )
– This effect is called explaining away
What can we do with Bayes nets?
JavaBayes: run it from the html page
• Inference: Given some pieces of information (prior, observed
http://www.cs.cmu.edu/˜javabayes/Home/applet.html
variabes) what is the implication (the implied information, the
11:13
posterior) on a non-observed variable
Example: Holmes & Watson
• Decision Making: If utilities and decision variables are defined
P(R=yes)=0.2
P(S=yes)=0.1
Rain
Sprinkler
→ compute optimal decisions in probabilistic domains
Watson
• Learning:
– Fully Bayesian Learning: Inference over parameters (e.g.,
β)
P(W=yes|R=yes)=1.0
P(W=yes|R=no)=0.2
Holmes
P(H=yes|R=no,S=no)=0.0
P(H=yes|R=yes,S=yes)=1.0
P(H=yes|R=yes,S=no)=1.0
P(H=yes|R=no,S=yes)=0.9
P (H, W, S, R) = P (H|S, R) P (W |R) P (S) P (R)
– Maximum likelihood training: Optimizing parameters
P (R|H) =
• Structure Learning (Learning/Inferring the graph structure itself): Decide which model (which graph structure) fits the data
best; thereby uncovering conditional independencies in the data.
=
1 X
P (H|S, R) P (S) P (R)
P (H) S
1
1
(1.0 · 0.2 · 0.1 + 1.0 · 0.2 · 0.9) =
0.2
P (H = 1)
P (H = 1)
1
1
P (R = 0 | H = 1) =
(0.9 · 0.8 · 0.1 + 0.0 · 0.8 · 0.9) =
0.072
P (H = 1)
P (H = 1)
P (R = 1 | H = 1) =
11:11
X P (R, W, S, H)
X
1
=
P (H|S, R) P (W |R) P (S) P (R)
P
(H)
P
(H)
W,S
W,S
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
11:14
57
In this view, Monte Carlo methods approximate an integral.
• Motivation: p(x) itself is too complicated to express analytically
or compute hf (x)ip directly
• These types of calculations can be automated
→ Variable Elimination Algorithm (discussed later)
11:15
13.1
Inference Methods in Graphical Models
• Example: What is the probability that a solitair would come out successful? (Original story by Stan Ulam.) Instead of trying to analytically
compute this, generate many random solitairs and count.
• Naming: The method developed in the 40ies, where computers became
faster. Fermi, Ulam and von Neumann initiated the idea. von Neumann
called it “Monte Carlo” as a code name.
11:19
11:16
Rejection Sampling
Inference methods in graphical models
• We have a Bayesian Network with RVs X1:n , some of which are
• Sampling:
– Rejection samping, importance sampling, Gibbs sampling
– More generally, Markov-Chain Monte Carlo (MCMC) methods
observed: Xobs = yobs , obs ⊂ {1 : n}
• The goal is to compute marginal posteriors P (Xi | Xobs = yobs )
conditioned on the observations.
• Message passing:
– Exact inference on trees (includes the Junction Tree Algorithm)
• We generate a set of K (joint) samples of all variables
S = {xk1:n }K
k=1
– Belief propagation
Each sample xk1:n = (xk1 , xk2 , .., xkn ) is a list of instantiation of all
• Other approximations/variational methods
– Expectation propagation
RVs.
11:20
– Specialized variational methods depending on the model
Rejection Sampling
• Reductions:
– Mathematical Programming (e.g. LP relaxations of MAP)
• To generate a single sample xk1:n :
– Compilation into Arithmetic Circuits (Darwiche at al.)
11:17
1. Sort all RVs in topological order; start with i = 1
2. Sample a value xki ∼ P (Xi | xkParents(i) ) for the ith RV con-
Sampling
ditional to the previous samples xk1:i-1
• Read Andrieu et al: An Introduction to MCMC for Machine Learn-
3. If i ∈ obs compare the sampled value xki with the observation yi . Reject and repeat from a) if the sample is not
ing (Machine Learning, 2003)
equal to the observation.
4. Repeat with i ← i + 1 from 2.
• Here I’ll discuss only thee basic methods:
– Rejection sampling
• We compute the marginal probabilities from the sample set S:
– Importance sampling
– Gibbs sampling
11:18
P (Xi = x | Xobs = yobs ) ≈
countS (xki = x)
K
or pair-wise marginals:
Monte Carlo methods
• Generally, a Monte Carlo method is a method to generate a set
P (Xi = x, Xj = x0 | Xobs = yobs ) ≈
countS (xki = x ∧ xkj = x0 )
K
of (potentially weighted) samples that approximate a distribution
11:21
p(x).
In the unweighted case, the samples should be i.i.d. xi ∼ p(x)
In the general (also weighted) case, we want particles that allow
Importance sampling (with likelihood weighting)
to estimate expectations of anything that depends on x, e.g.
f (x):
• Rejecting whole samples may become very inefficient in large
Z
hf (x)ip =
f (x) p(x) dx = lim
x
N →∞
N
X
i=1
Bayes Nets!
wi f (xi )
58
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
Gibbs sampling*
• New strategy: We generate a weighted sample set
S = {(xk1:n , wk )}K
k=1
• As for rejection sampling, Gibbs sampling generates an unweighted
sample set S which can directly be used to compute marginals.
where each sample
xk1:n
is associated with a weight w
k
In practice, one often discards an initial set of samples (burn-in)
to avoid starting biases.
• In our case, we will choose the weights proportional to the likelihood P (Xobs = yobs | X1:n = xk1:n ) of the observations conditional to the sample xk1:n
• Gibbs sampling is a special case of MCMC sampling.
Roughly, MCMC means to invent a sampling process, where the
11:22
next sample may stochastically depend on the previous (Markov
property), such that the final sample set is guaranteed to corre-
Importance sampling
spond to P (X1:n ).
→ An Introduction to MCMC for Machine Learning
• To generate a single sample (wk , xk1:n ):
11:25
k
1. Sort all RVs in topological order; start with i = 1 and w =
Sampling – conclusions
1
2. a) If i 6∈ obs, sample a value xki ∼ P (Xi | xkParents(i) ) for the
ith RV conditional to the previous samples xk1:i-1
b) If i ∈ obs, set the value
xki
= yi and update the weight
• Sampling algorithms are very simple, very general and very
popular
– they equally work for continuous & discrete RVs
– one only needs to ensure/implement the ability to sample
from conditional distributions, no further algebraic manipulations
according to likelihood
wk ← wk P (Xi = yi | xk1:i-1 )
– MCMC theory can reduce required number of samples
3. Repeat with i ← i + 1 from 2.
• In many cases exact and more efficient approximate inference is
• We compute the marginal probabilities as:
PK
P (Xi = x | Xobs = yobs ) ≈
possible by actually computing/manipulating whole distributions
in the algorithms instead of only samples.
wk [xki = x]
PK
k
k=1 w
k=1
11:26
Variable Elimination
and likewise pair-wise marginals, etc.
Notation: [expr] = 1 if expr is true and zero otherwise
11:27
11:23
Variable Elimination example
X2
Gibbs sampling*
F4 µ µ
3
3
X1
• In Gibbs sampling we also generate a sample set S – but in this
case the samples are not independent from each other. The
next sample “modifies” the previous one:
• First, all observed RVs are clamped to their fixed value xki = yi
for any k.
• To generate the (k + 1)th sample, iterate through the latent vari-
F3
µ1 (X
F5
µ2
µ4
µ5
X3
X5
P (x5 )
P
= x1 ,x2 ,x3 ,x4 ,x6 P (x1 ) P (x2 |x1 ) P (x3 |x1 ) P (x4 |x2 ) P (x5 |x3 ) P (x6 |x2 ,
P
P
= x1 ,x2 ,x3 ,x6 P (x1 ) P (x2 |x1 ) P (x3 |x1 ) P (x5 |x3 ) P (x6 |x2 , x5 )
P(
x4 |
F1 (
=
P
=
P
P (x1 ) P (x2 |x1 ) P (x3 |x1 ) P (x5 |x3 ) P (x6 |x2 , x5 ) µ1 (x2 )
P
P (x1 ) P (x2 |x1 ) P (x3 |x1 ) P (x5 |x3 ) µ1 (x2 )
(x6 |x2 , x
x6 P
|
{z
x1 ,x2 ,x3 ,x6
x1 ,x2 ,x3
F2 (x2 ,x5 ,x6
ables i 6∈ obs, updating:
=
P
=
P
=
P
=
P
=
P
x ,x ,x P (x1 ) P (x2 |x1 ) P (x3 |x1 ) P (x5 |x3 ) µ1 (x2 ) µ2 (x2 , x5 )
P
P 1 2 3
= x2 ,x3 P (x5 |x3 ) µ1 (x2 ) µ2 (x2 , x5 )
P (x1 ) P (x2 |x1 ) P (x3 |x1 )
x1 |
{z
}
xk+1
∼ P (Xi | xk1:n\i )
i
F3 (x1 ,x2 ,x3 )
∼ P (Xi | xk1 , xk2 , .., xki-1 , xki+1 , .., xkn )
Y
P (Xj = xkj | Xi , xkParents(j)\i )
∼ P (Xi | xkParents(i) )
P (x5 |x3 ) µ1 (x2 ) µ2 (x2 , x5 ) µ3 (x2 , x3 )
P
P (x5 |x3 )
1 (x2 ) µ2 (x2 , x5 ) µ3 (x2 , x3 )
x2 µ
|
{z
}
x2 ,x3
x3
F4 (x3 ,x5 )
j:i∈Parents(j)
That is, each xk+1
is resampled conditional to the other (neighi
x3
x3
P (x5 |x3 ) µ4 (x3 , x5 )
P (x5 |x3 ) µ4 (x3 , x5 )
|
{z
}
F5 (x3 ,x5 )
boring) current sample values.
= µ5 (x5 )
11:24
11:28
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
Variable Elimination example – lessons learnt
• There is a dynamic programming principle behind Variable Elimination:
– For eliminating X5,4,6 we use the solution of eliminating
X4,6
– The “sub-problems” are represented by the F terms, their
solutions by the remaining µ terms
59
Variable Elimination Algorithm
• eliminate single variable(F, i)
6:
Input: list F of factors, variable id i
Output: list F of factors
find relevant subset Fˆ ⊆ F of factors coupled to i: Fˆ =
{k : i ∈ ∂k}
ˆ with neighborhood ∂ k
ˆ = all variables
create new factor k
in Fˆ except i
P Q
compute µkˆ (X∂ kˆ ) = Xi k∈Fˆ fk (X∂k )
remove old factors Fˆ and append new factor µˆ to F
7:
return F
1:
2:
3:
4:
– We’ll continue to discuss this 4 slides later!
5:
• The factorization of the joint
– determines in which order Variable Elimination is efficient
– determines what the terms F (...) and µ(...) depend on
• elimination algorithm(µ, F, M )
• We can automate Variable Elimination. For the automation, all
that matters is the factorization of the joint.
1:
2:
11:29
3:
4:
5:
Factor graphs
6:
• In the previous slides we introduces the box
k
notation to indi-
7:
Input: list F of factors, tuple M of desired output variables
ids
Output: single factor µ over variables XM
define all variables present in F : V = vars(F )
define variables to be eliminated: E = V \ M
for all i ∈ E: eliminate single variable(F, i)
for all remaining factors, compute the product µ =
Q
f ∈F f
return µ
cate terms that depend on some variables. That’s exactly what
11:32
factor graphs represent.
Variable Elimination on trees
• A Factor graph is a
– bipartite graph
Y3
1
– where each circle node represents a random variable Xi
Y8
– each box node represents a factor fk , which is a function
fk (X∂k )
Y1
2
6
X
Y4
F1 (Y1,8 , X)
F3 (Y3,4,5 , X)
Y2
– the joint probability distribution is given as
4
P (X1:n ) =
K
Y
7
3
5
Y5
F2 (Y2,6,7 , X)
Y6
Y7
fk (X∂k )
k=1
The subtrees w.r.t. X can be described as
Notation: ∂k is shorthand for Neighbors(k)
11:30
F1 (Y1,8 , X) = f1 (Y8 , Y1 ) f2 (Y1 , X)
F2 (Y2,6,7 , X) = f3 (X, Y2 ) f4 (Y2 , Y6 ) f5 (Y2 , Y7 )
Bayes Net → factor graph
F3 (Y3,4,5 , X) = f6 (X, Y3 , Y4 ) f7 (Y4 , Y5 )
X4
X2
X1
The joint distribution is:
X6
P (Y1:8 , X) = F1 (Y1,8 , X) F2 (Y2,6,7 , X) F3 (Y3,4,5 , X)
11:33
X3
• Bayesian Network:
X5
P (x1:6 ) = P (x1 ) P (x2 |x1 ) P (x3 |x1 ) P (x4 |x2 ) P (x5 |x3 ) P (x6 |x2 , x5 )
X2
Variable Elimination on trees
X4
Y3
µ2→X
Y1
2
X1
• Factor Graph:
X6
X3
Y9
1
Y8
F1 (Y1,8 , X)
6
X µ6→X
µ3→X
3
X5
8
Y4
7
F3 (Y3,4,5 , X)
Y2
Y5
4
P (x1:6 ) = f1 (x1 , x2 ) f2 (x3 , x1 ) f3 (x2 , x4 ) f4 (x3 , x5 ) f5 (x2 , x5 , x6 )
5
F2 (Y2,6,7 , X)
Y6
Y7
→ each CPT in the Bayes Net is just a factor (we neglect the
We can eliminate each tree independently. The remaining terms
special semantics of a CPT)
11:31
(messages) are:
P
µF1 →X (X) = Y1,8 F1 (Y1,8 , X)
60
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
µF2 →X (X) =
P
F2 (Y2,6,7 , X)
• Message passing exemplifies how to exploit the factorization
µF3 →X (X) =
P
F3 (Y3,4,5 , X)
structure of the joint distribution for the algorithmic implemen-
Y2,6,7
Y3,4,5
tation
The marginal P (X) is the product of subtree messages
P (X) = µF1 →X (X) µF2 →X (X) µF3 →X (X)
11:34
• Note: These are recursive equations. They can be resolved
exactly if and only if the dependency structure (factor graph) is
Variable Elimination on trees – lessons learnt
a tree. If the factor graph had loops, this would be a “loopy
recursive equation system”...
• The “remaining terms” µ’s are called messages
11:37
Intuitively, messages subsume information from a subtree
• Marginal = product of messages, P (X) =
Q
Message passing variants
µFk →X , is very
k
intuitive:
– Fusion of independent information from the different subtrees
– Fusing independent information ↔ multiplying probability
tables
• Message passing has many important applications:
– Many models are actually trees: In particular chains esp.
Hidden Markov Models
– Message passing can also be applied on non-trees (↔
loopy graphs) → approximate inference (Loopy Belief Propagation)
• Along a (sub-) tree, messages can be computed recursively
11:35
– Bayesian Networks can be “squeezed” to become trees →
exact inference in Bayes Nets! (Junction Tree Algorithm)
11:38
Message passing
• General equations (belief propagation (BP)) for recursive message computation (writing µk→i (Xi ) instead of µFk →X (X)):
• If the graphical model is not a tree (=has loops):
– The recursive message equations cannot be resolved.
µ
¯ j→k (Xj )
µk→i (Xi ) =
X
fk (X∂k )
X∂k\i
z Y
Y
j∈∂k\i
|
Loopy Belief Propagation
– However, we could try to just iterate them as update equations...
}|
{
µk0 →j (Xj )
k0 ∈∂j\k
{z
F (subtree)
}
• Loopy BP update equations:
Q
j∈∂k\i :
excl. i
Q
k0 ∈∂j\k
(initialize with µk→i = 1)
branching at factor k, prod. over adjacent variables j
: branching at variable j, prod. over adjacent factors k
0
µnew
k→i (Xi ) =
X
fk (X∂k )
X∂k\i
Y
Y
j∈∂k\i
k0 ∈∂j\k
µold
k0 →j (Xj )
excl. k
11:39
µ
¯ j→k (Xj ) are called “variable-to-factor messages”: store them for efficiency
Loopy BP remarks
Y3
µ1→Y1
Y9
1
Y8
µ2→X
Y1
2
F1 (Y1,8 , X)
6
X µ6→X
µ3→X
3
4
8
Y4 µ7→Y47
F3 (Y3,4,5 , X)
Y2
µ4→Y2
µ8→Y4
µ5→Y2
5
Y5
Example messages:
P
µ2→X =
f2 (Y1 , X) µ1→Y1 (Y1 )
PY1
µ6→X =
f6 (Y3 , Y4 , X) µ7→Y4 (Y4 )
PY3 ,Y4
µ3→X =
Y f3 (Y2 , X) µ4→Y2 (Y2 ) µ5→Y2 (Y2 )
2
F2 (Y2,6,7 , X)
Y6
Y7
11:36
Message passing remarks
• Computing these messages recursively on a tree does nothing
else than Variable Elimination
Q
⇒ P (Xi ) = k∈∂i µk→i (Xi ) is the correct posterior marginal
• However, since it stores all “intermediate terms”, we can compute ANY marginal P (Xi ) for any i
• Problem of loops intuitively:
loops ⇒ branches of a node to not represent independent information!
– BP is multiplying (=fusing) messages from dependent sources of information
• No convergence guarantee, but if it converges, then to a state of marginal consistency
X
X
b(X∂k ) =
b(X∂k0 ) = b(Xi )
X∂k\i
X
∂k0 \i
and to the minimum of the Bethe approximation of the free energy (Yedidia, Freeman, & Weiss, 2001)
• We shouldn’t be overly disappointed:
– if BP was exact on loopy graphs we could efficiently solve NP hard problems...
– loopy BP is a very interesting approximation to solving an NP hard problem
– is hence also applied in context of combinatorial optimization (e.g., SAT
problems)
• Ways to tackle the problems with BP convergence:
– Damping (Heskes, 2004: On the uniqueness of loopy belief propagation
fixed points)
– CCCP (Yuille, 2002: CCCP algorithms to minimize the Bethe and Kikuchi
free energies: Convergent alternatives to belief propagation)
– Tree-reweighted MP (Kolmogorov, 2006: Convergent tree-reweighted message passing for energy minimization)
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
61
11:43
11:40
Junction Tree Algorithm
Junction Tree Algorithm Example
• Many models have loops
X2
Instead of applying loopy BP in the hope of getting a good approximation, it is possible to convert every model into a tree by redefinition of
X4
X4
X6
X1
X1
RVs. The Junction Tree Algorithms converts a loopy model into a tree.
X3
X2,5
X2,3
X6
X5
• Loops are resolved by defining larger variable groups (separators) on which messages are defined
• If we eliminate in order 4, 6, 5, 1, 2, 3, we get remaining terms
11:41
(X2 ), (X2 , X5 ), (X2 , X3 ), (X2 , X3 ), (X3 )
Junction Tree Example
• Example:
which translates to the Junction Tree on the right
A
B
A
B
C
D
C
D
11:44
Maximum a-posteriori (MAP) inference
• Often we want to compute the most likely global assignment
• Join variable B and C to a single separator
MAP
X1:n
= argmax P (X1:n )
A
X1:n
B, C
D
This can be viewed as a variable substitution: rename the tuple
of all random variables.
This is called MAP inference and can be solved
P
by replacing all
by max in the message passing equations – the
algorithm is called Max-Product Algorithm and is a generalization of
Dynamic Programming methods like Viterbi or Dijkstra.
(B, C) as a single random variable
• Application: Conditional Random Fields
• A single random variable may be part of multiple separators –
but only along a running intersection
f (y, x) = φ(y, x)>β =
k
X
φj (y∂j , x)βj = log
j=1
11:42
Junction Tree Algorithm
k
hY
eφj (y∂j ,x)βj
i
j=1
with prediction x 7→ y ∗ (x) = argmax f (x, y)
y
Finding the argmax is a MAP inference problem! This is frequently
needed in the innerloop of CRF learning algorithms.
• Standard formulation: Moralization & Triangulation
11:45
A clique is a fully connected subset of nodes in a graph.
1) Generate the factor graph (classically called “moralization”)
Conditional Random Fields
2) Translate each factor to a clique: Generate the undirected
graph
• The following are interchangable:
“Random Field” ↔ “Markov R
where undirected edges connect all RVs of a factor
3) Triangulate the undirected graph
4) Translate each clique back to a factor; identify the separators
between factors
• Formulation in terms of variable elimination:
1) Start with a factor graph
• Therefore, a CRF is a conditional factor graph:
– A CRF defines a mapping from input x to a factor graph
over y
– Each feature φj (y∂j , x) depends only on a subset ∂j of
variables y∂j
– If y∂j are discrete, a feature φj (y∂j , x) is usually an indicator feature (see lecture 03); the corresponding parameter
βj is then one entry of a factor fk (y∂j ) that couples these
variables
2) Choose an order of variable elimination
11:46
3) Keep track of the “remaining µ terms” (slide 14): which RVs
would they depend on? → this identifies the separators
62
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
What we didn’t cover
• A very promising line of research is solving inference problems
using mathematical programming. This unifies research in the
areas of optimization, mathematical programming and probabilistic inference.
Linear Programming relaxations of MAP inference and CCCP methods are great
examples.
11:47
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
14
Dynamic Models
63
• A Hidden Markov Model (HMM) is defined as the joint distribution
P (X0:T , Y0:T ) = P (X0 ) ·
Motivation
T
Y
P (Xt |Xt-1 ) ·
t=1
– Robotics slides
– Speech recognition
T
Y
P (Yt |Xt ) .
t=0
X0
X1
X2
X3
XT
Y0
Y1
Y2
Y3
YT
– Music
12:4
12:1
Inference in an HMM – a tree!
Markov processes (Markov chains)
Construct a Bayes net from these variables: parents?
Ffuture (X2:T , Y3:T )
Fpast (X0:2 , Y0:1 )
Markov assumption: Xt depends on bounded subset of X0:t−1
First-order Markov process: P (Xt | X0:t−1 ) = P (Xt | Xt−1 )
X0
X1
X2
X3
XT
Y0
Y1
Y2
Y3
YT
Second-order Markov process: P (Xt | X0:t−1 ) = P (Xt | Xt−2 , Xt−1 )
Fnow (X2 , Y2 )
• The marginal posterior P (Xt | Y1:T ) is the product of three messages
Sensor Markov assumption: P (Yt | X0:t , Y0:t−1 ) = P (Yt | Xt )
Stationary process: transition model P (Xt | Xt−1 ) and
P (Xt | Y1:T ) ∝ P (Xt , Y1:T ) = µpast (Xt ) µnow (Xt ) µfuture (Xt )
|{z}
| {z }
|{z}
%
α
β
sensor model P (Yt | Xt ) fixed for all t
12:2
Different inference problems in Markov Mod-
• For all a < t and b > t
– Xa conditionally independent from Xb given Xt
– Ya conditionally independent from Yb given Xt
els
“The future is independent of the past given the present”
Markov property
• P (xt | y0:T ) marginal posterior
• P (xt | y0:t ) filtering
• P (xt | y0:a ), t > a prediction
• P (xt | y0:b ), t < b smoothing
• P (y0:T ) likelihood calculation
• Viterbi alignment: Find sequence
x∗0:T
(conditioning on Yt does not yield any conditional independences)
12:5
Inference in HMMs
that maximizes P (x0:T | y0:T )
(This is done using max-product, instead of sum-product message passing.)
Ffuture (X2:T , Y3:T )
Fpast (X0:2 , Y0:1 )
X0
X1
X2
X3
XT
Y0
Y1
Y2
Y3
YT
Fnow (X2 , Y2 )
Applying the general message passing equations:
12:3
forward msg.
backward msg.
• We assume we have
slice
– a discrete latent variable Xt in each time slice
X
P (xt |xt-1 ) αt-1 (xt-1 ) %t-1 (xt-1 )
xt-1
Hidden Markov Models as Graphical Model
– observed (discrete or continuous) variables Yt in each time
µXt-1 →Xt (xt ) =: αt (xt ) =
α0 (x0 ) = P (x0 )
X
P (xt+1 |xt ) βt+1 (xt+1 ) %t+1 (xt+1 )
µXt+1 →Xt (xt ) =: βt (xt ) =
xt+1
βT (x0 ) = 1
observation msg.
posterior marginal
posterior marginal
µYt →Xt (xt ) =: %t (xt ) = P (yt | xt )
q(xt ) ∝ αt (xt ) %t (xt ) βt (xt )
q(xt , xt+1 ) ∝ αt (xt ) %t (xt ) P (xt+1 |xt ) %t+1 (xt+1 ) βt+1
– some observation model P (Yt | Xt ; θ)
– some transition model P (Xt | Xt-1 ; θ)
12:6
64
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
Inference in HMMs – implementation notes
Kalman Filter example
• The message passing equations can be implemented by reinterpreting
them as matrix equations: Let αt , β t , %t be the vectors corresponding
to the probability tables αt (xt ), βt (xt ), %t (xt ); and let P be the matrix
with enties P (xt | xt-1 ). Then
• filtering of a position (x, y) ∈ R2 :
1:
2:
3:
4:
5:
α0 = π, β T = 1
fort=1:T -1 : αt = P (αt-1 · %t-1 )
fort=T -1:0 : β t = P> (β t+1 · %t+1 )
fort=0:T : q t = αt · %t · β t
fort=0:T -1 : Qt = P · [(β t+1 · %t+1 ) (αt · %t )>]
where · is the element-wise product! Here, q t is the vector with entries
q(xt ), and Qt the matrix with entries q(xt+1 , xt ). Note that the equation
for Qt describes Qt (x0 , x) = P (x0 |x)[(βt+1 (x0 )%t+1 (x0 ))(αt (x)%t (x))].
12:7
Inference in HMMs: classical derivation
12:10
Given our knowledge of Belief propagation, inference in HMMs is simple. For
reference, here is a more classical derivation:
Kalman Filter example
• smoothing of a position (x, y) ∈ R2 :
P (y0:T | xt ) P (xt )
P (y0:T )
P (y0:t | xt ) P (yt+1:T | xt ) P (xt )
=
P (y0:T )
P (y0:t , xt ) P (yt+1:T | xt )
=
P (y0:T )
αt (xt ) βt (xt )
=
P (y0:T )
P (xt | y0:T ) =
αt (xt ) := P (y0:t , xt ) = P (yt |xt ) P (y0:t-1 , xt )
X
= P (yt |xt )
P (xt | xt-1 ) αt-1 (xt-1 )
xt-1
βt (xt ) := P (yt+1:T | xt ) =
X
P (yt+1:T | xt+1 ) P (xt+1 | xt )
x+1
=
i
Xh
βt+1 (xt+1 ) P (yt+1 |xt+1 ) P (xt+1 | xt )
x+1
12:11
Note: αt here is the same as αt · %t on all other slides!
12:8
HMM remarks
HMM example: Learning Bach
• A machine “listens” (reads notes of) Bach pieces over and over
• The computation of forward and backward messages along the
again
→ It’s supposed to learn how to write Bach pieces itself (or at
Markov chain is also called forward-backward algorithm
• The EM algorithm to learn the HMM parameters is also called
least harmonize them).
Baum-Welch algorithm
• If the latent variable xt is continuous xt ∈ Rd instead of discrete, then such a Markov model is also called state space
• Harmonizing Chorales in the Style of J S Bach Moray Allan &
Chris Williams (NIPS 2004)
model.
• If the continuous transitions and observations are linear Gaus-
– observed sequence Y0:T Soprano melody
sian
P (xt+1 |xt ) = N(xt+1 | Axt +a, Q) ,
• use an HMM
P (yt |xt ) = N(yt | Cxt +c, W )
– latent sequence X0:T chord & and harmony:
then the forward and backward messages αt and βt are also
Gaussian.
→ forward filtering is also called Kalman filtering
→ smoothing is also called Kalman smoothing
• Sometimes, computing forward and backward messages (in disrete or continuous context) is also called Bayesian filtering/smoothing
12:9
12:12
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
HMM example: Learning Bach
• results: http://www.anc.inf.ed.ac.uk/demos/hmmbach/
• See also work by Gerhard Widmer http://www.cp.jku.at/
people/widmer/
12:13
Dynamic Bayesian Networks
– Arbitrary BNs in each time slide
– Special case: MDPs, speech, etc
12:14
65
66
15
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
Reinforcement Learning
• Stationary MDP:
– We assume P (s0 | s, a) and P (r|s, a) independent of time
R
– We also define R(s, a) := E{r|s, a} = r P (r|s, a) dr
Long history of RL in AI
13:3
Idea of programming a computer to learn by trial and error (Tur-
State value function
ing, 1954)
SNARCs (Stochastic Neural-Analog Reinforcement Calculators)
(Minsky, 54)
• The value (expected discounted return) of policy π when started
in state s:
Checkers playing program (Samuel, 59)
Lots of RL in the 60s (e.g., Waltz & Fu 65; Mendel 66; Fu 70)
MENACE (Matchbox Educable Naughts and Crosses Engine
V π (s) = Eπ {r0 + γr1 + γ 2 r2 + · · · | s0 = s}
(Mitchie, 63)
discounting factor γ ∈ [0, 1]
RL based Tic Tac Toe learner (GLEE) (Mitchie 68)
Classifier Systems (Holland, 75)
Adaptive Critics (Barto & Sutton, 81)
• Definition of optimality:
behavior π ∗ is optimal iff
Temporal Differences (Sutton, 88)
∗
from Satinder Singh’s Introduction to RL, videolectures.com
∀s : V π (s) = V ∗ (s)
where V ∗ (s) = max V π (s)
π
(simultaneously maximising the value in all states)
• Long history in Psychology
13:1
(In MDPs there always exists (at least one) optimal deterministic policy.)
Outline
13:4
• Markov Decision Processes as formal model
– Definition
– Value/Q-function
– Planning as computing V /Q given a model
• Learning
An example for a
value function...
– Temporal Difference & Q-learning
– Limitations of the model-free view
– Model-based RL
• Exploration
• Briefly
demo:
test/mdp runVI
– Imitation Learning & Inverse RL
– Continuous states and actions (LSPI, Policy Gradients)
13:2
Values provide a gradient towards desirable states
13:5
Markov Decision Process
Value function
a0
a1
a2
s0
s1
s2
• The value function V is a central concept in all of RL!
r0
P (s0:T +1 , a0:T , r0:T ; π) = P (s0 )
r1
QT
t=0
r2
P (at |st ; π) P (rt |st , at ) P (st+1 |st , at )
• In other domains (stochastic optimal control) it is also called
– world’s initial state distribution P (s0 )
– world’s transition probabilities P (st+1 | st , at )
– world’s reward probabilities P (rt | st , at )
– agent’s policy π(at | st ) = P (a0 |s0 ; π)
at = π(st ))
Many algorithms can directly be derived from properties of the value
function.
cost-to-go function (cost = −reward)
13:6
(or deterministic
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
Recursive property of the value function
67
• Value Iteration:
(initialize Vk=0 (s) = 0)
h
i
X
∀s : Vk+1 (s) = max R(s, a) + γ
P (s0 |s, a) Vk (s0 )
a
π
s0
2
V (s) = E{r0 + γr1 + γ r2 + · · · | s0 = s; π}
stopping criterion:
= E{r0 | s0 = s; π} + γE{r1 + γr2 + · · · | s0 = s; π}
P
= R(s, π(s)) + γ s0 P (s0 | s, π(s)) E{r1 + γr2 + · · · | s1 = s0 ; π}
P
= R(s, π(s)) + γ s0 P (s0 | s, π(s)) V π (s0 )
maxs |Vk+1 (s) − Vk (s)| ≤ • Note that V ∗ is a fixed point of value iteration!
• Value Iteration converges to the optimal value function V ∗ (proof
below)
demo:
V π = Rπ + γP π V π
• We can write this in vector notation
test/mdp runVI
13:10
with vectors V πs = V π (s), Rπs = R(s, π(s)) and matrix P πs0 s =
P (s0 | s, π(s))
State-action value function (Q-function)
• For stochastic π(a|s):
P
P
V π (s) = a π(a|s)R(s, a) + γ s0 ,a π(a|s)P (s0 | s, a) V π (s0 )
• We repeat the last couple of slides for the Q-function...
13:7
• The state-action value function (or Q-function) is the expected
discounted return when starting in state s and taking first action
Bellman optimality equation
a:
• Recall the recursive property of the value function
π
V (s) = R(s, π(s)) + γ
P
s0
0
Qπ (s, a) = Eπ {r0 + γr1 + γ 2 r2 + · · · | s0 = s, a0 = a}
X
= R(s, a) + γ
P (s0 | s, a) Qπ (s0 , π(s0 ))
0
π
P (s | s, π(s)) V (s )
s0
(Note: V π (s) = Qπ (s, π(s)).)
• Bellman optimality equation
i
h
P
V ∗ (s) = maxa R(s, a) + γ s0 P (s0 | s, a) V ∗ (s0 )
h
i
P
with π ∗ (s) = argmaxa R(s, a) + γ s0 P (s0 | s, a) V ∗ (s0 )
• Bellman optimality equation for the Q-function
P
Q∗ (s, a) = R(s, a) + γ s0 P (s0 | s, a) maxa0 Q∗ (s0 , a0 )
(Sketch of proof: If π would select another action than argmaxa [·], then π 0 which
= π everywhere except π 0 (s) = argmaxa [·] would be better.)
with π ∗ (s) = argmaxa Q∗ (s, a)
13:11
Q-Iteration
• This is the principle of optimality in the stochastic case
13:8
• Recall the Bellman equation:
Richard E. Bellman
(1920—1984)
Bellman’s principle of optimality
Q∗ (s, a) = R(s, a) + γ
• Q-Iteration:
B
P
s0
P (s0 | s, a) maxa0 Q∗ (s0 , a0 )
(initialize Qk=0 (s, a) = 0)
∀s,a : Qk+1 (s, a) = R(s, a) + γ
A
X
P (s0 |s, a) max
Qk (s0 , a0 )
0
a
s0
A opt ⇒ B opt
stopping criterion:
maxs,a |Qk+1 (s, a) − Qk (s, a)| ≤ • Note that Q∗ is a fixed point of Q-Iteration!
h
i
V ∗ (s) = max R(s, a) + γ s0 P (s0 | s, a) V ∗ (s0 )
a
h
i
P
∗
π (s) = argmax R(s, a) + γ s0 P (s0 | s, a) V ∗ (s0 )
P
• Q-Iteration converges to the optimal state-action value function
Q∗
a
13:12
13:9
Proof of convergence
Value Iteration
• Let Dk = ||Q∗ − Qk ||∞ = maxs,a |Q∗ (s, a) − Qk (s, a)|
• How can we use this to compute V ∗ ?
• Recall the Bellman optimality equation:
Qk+1 (s, a) = R(s, a) + γ
X
≤ R(s, a) + γ
X
s0
∗
h
V (s) = maxa R(s, a) + γ
P
s0
0
∗
0
P (s | s, a) V (s )
i
s0
P (s0 |s, a) max
Qk (s0 , a0 )
0
a
h
i
∗ 0
0
P (s0 |s, a) max
Q
(s
,
a
)
+
D
k
0
a
68
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
h
i
X
∗ 0
0
= R(s, a) + γ
P (s0 |s, a) max
Q
(s
,
a
)
+ γDk
0
a
s0
• Learning
– Temporal Difference & Q-learning
– Limitations of the model-free view
– Model-based RL
= Q∗ (s, a) + γDk
similarly: Qk ≥ Q∗ − Dk ⇒ Qk+1 ≥ Q∗ − γDk
• Exploration
• Briefly
• The proof translates directly also to value iteration
13:13
– Imitation Learning & Inverse RL
– Continuous states and actions (LSPI, Policy Gradients)
13:16
For completeness*
Learning in MDPs
• While interacting with the world, the agent collects data of the
• Policy Evaluation computes V π instead of V ∗ : Iterate:
form
π
∀s : Vk+1
(s) = R(s, π(s)) + γ
P
s0
D = {(st , at , rt , st+1 )}Tt=1
P (s0 |s, π(s)) Vkπ (s0 )
(state, action, immediate reward, next state)
Or use matrix inversion V π = (I −γP π )−1 Rπ , which is O(|S|3 ).
What could we learn from that?
• Policy Iteration uses V π to incrementally improve the policy:
• Model-based RL:
learn to predict next state: estimate P (s0 |s, a)
1. Initialise π0 somehow (e.g. randomly)
learn to predict immediate reward: estimate P (r|s, a)
2. Iterate:
– Policy Evaluation: compute V
πk
πk
or Q
– Policy Update: πk+1 (s) ← argmaxa Qπk (s, a)
• Model-free RL:
learn to predict value: estimate V (s) or Q(s, a)
demo:
test/mdp runPI
13:14
• Policy search:
e.g., estimate the “policy gradient”, or directly use black box
Towards Learning
(e.g. evolutionary) search
13:17
• From Sutton & Barto’s Reinforcement Learning book:
The term dynamic programming (DP) refers to a collection of algorithms that can
be used to compute optimal policies given a perfect model of the environment as
a Markov decision process (MDP). Classical DP algorithms are of limited utility
in reinforcement learning both because of their assumption of a perfect model
and because of their great computational expense, but they are still important
theoretically. DP provides an essential foundation for the understanding of the
methods presented in the rest of this book. In fact, all of these methods can
be viewed as attempts to achieve much the same effect as DP, only with less
computation and without assuming a perfect model of the environment.
Let’s introduce basic model-free methods first.
D = {(s, a, r, s0 )t }Tt=0
learn
→
V (s)
→
π(s)
13:18
• So far, we introduced basic notions of an MDP and value func-
Temporal difference (TD) learning with V
tions and methods to compute optimal policies assuming that
• Recall the recursive property of V (s):
we know the world (know P (s0 |s, a) and R(s, a))
Value Iteration and Q-Iteration are instances of Dynamic Programming
V π (s) = R(s, π(s)) + γ
P
s0
P (s0 | s, π(s)) V π (s0 )
• TD learning: Given a new experience (s, a, r, s0 )
Vnew (s) = (1 − α) Vold (s) + α [r + γVold (s0 )]
• Reinforcement Learning?
= Vold (s) + α [r + γVold (s0 ) − Vold (s)] .
13:15
Outline
• Markov Decision Processes as formal model
– Definition
– Value/Q-function
– Planning as computing V /Q given a model
• Reinforcement:
– more reward than expected (r > Vold (s) − γVold (s0 ))
→ increase V (s)
– less reward than expected (r < Vold (s) − γVold (s0 ))
→ decrease V (s)
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
13:19
69
• Reinforcement:
– more reward than expected (r > Qold (s, a)−γ maxa0 Qold (s0 , a0 ))
Temporal difference (TD) learning with Q
→ increase Q(s, a)
– less reward than expected (r < Qold (s, a)−γ maxa0 Qold (s0 , a0 ))
• Recall the recursive property of Q(s, a):
→ decrease Q(s, a)
π
Q (s, a) = R(s, a) + γ
P
s0
0
∗
0
0
P (s |s, a) Q (s , π(s ))
13:22
Q-learning = off-policy TD learning with Q∗
• TD learning: Given a new experience (s, a, r, s0 , a0 = π(s0 ))
Qnew (s, a) = (1 − α) Qold (s, a) + α [r + γQold (s0 , a0 )]
• Off-policy: We estimate Q∗ while executing π
= Qold (s, a) + α [r + γQold (s0 , a0 ) − Qold (s, a)]
• Q-learning:
1:
• Reinforcement:
2:
– more reward than expected (r > Qold (s, a) − γQold (s0 , a0 ))
3:
4:
→ increase Q(s, a)
5:
– less reward than expected (r < Qold (s, a) − γQold (s0 , a0 ))
6:
7:
→ decrease Q(s, a)
8:
13:20
9:
10:
Sarsa = on-policy TD learning with Q
Initialize Q(s, a) = 0
repeat
// for each episode
Initialize start state s
repeat
// for each step of episode
Choose action a ≈ argmaxa Q(s, a)
Take action a, observe r, s0
Q(s, a) ← Q(s, a) + α [r + γ maxa0 Qold (s0 , a0 ) −
Qold (s, a)]
s ← s0 , a ← a0
until end of episode
until happy
13:23
• On-policy: We estimate Qπ while executing π
Q-learning convergence with prob 1
1:
2:
3:
4:
5:
• Sarsa:
6:
7:
8:
9:
10:
11:
Initialize Q(s, a) = 0
repeat
// for each episode
Initialize start state s
Choose action a ≈ argmaxa Q(s, a)
repeat
// for each step of episode
Take action a, observe r, s0
Choose action a0 ≈ argmaxa0 Q(s0 , a0 )
Q(s, a) ← Q(s, a)+α [r+γQold (s0 , a0 )−Qold (s, a)]
s ← s0 , a ← a0
until end of episode
until happy
• Q-learning is a stochastic approximation of Q-Iteration:
Q-learning: Qnew (s, a) = (1 − α)Qold (s, a) + α[r + γ maxa0 Qold (s0 , a0 )]
Q-Iteration: P
∀s,a : Qk+1 (s, a) =
R(s, a) + γ s0 P (s0 |s, a) maxa0 Qk (s0 , a0 )
We’ve shown convergence of Q-Iteration to Q∗
• Convergence of Q-learning:
Q-Iteration is a deterministic update: Qk+1 = T (Qk )
• -greedy action selection:
Q-learning is a stochastic version: Qk+1 = (1−α)Qk +α[T (Qk )+
ηk ]
a ≈ argmax Q(s, a)
⇐⇒
a=
a
random
with prob. argmaxa Q(s, a) else
ηk is zero mean!
13:24
13:21
Q-learning impact
Q-learning
• Q-Learning was the first provably convergent direct adaptive op-
• Recall the Bellman optimality equation for the Q-function:
Q∗ (s, a) = R(s, a) + γ
P
s0
timal control algorithm
P (s0 |s, a) maxa0 Q∗ (s0 , a0 )
0
• Q-learning (Watkins, 1988) Given a new experience (s, a, r, s )
Qnew (s, a) = (1 − α) Qold (s, a) + α [r + γmax
Qold (s0 , a0 )]
0
a
= Qold (s, a) + α [r − Qold (s, a) + γ max
Qold (s0 , a0 )]
0
a
• Great impact on the field of Reinforcement Learning in 80/90ies
– “Smaller representation than models”
– “Automatically focuses attention to where it is needed,”
i.e., no sweeps through state space
– Can be made more efficient with eligibility traces
13:25
70
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
Eligibility traces
TD-Gammon notes
• Temporal Difference: based on single experience (s0 , r0 , s1 )
• Choose features as raw position inputs (number of pieces at
each place)
Vnew (s0 ) = Vold (s0 ) + α[r0 + γVold (s1 ) − Vold (s0 )]
→ as good as previous computer programs
• Longer experience sequence, e.g.: (s0 , r0 , r1 , r2 , s3 )
• Using previous computer program’s expert features
Temporal credit assignment, think further backwards: receiv-
→ world-class player
ing r3 also tells us something about V (s0 )
Vnew (s0 ) = Vold (s0 ) + α[r0 + γr1 + γ 2 r2 + γ 3 Vold (s3 ) − Vold (s0 )]
• TD(λ): remember where you’ve been recently (“eligibility trace”)
and update those values as well:
• Kit Woolsey was world-class player back then:
– TD-Gammon particularly good on vague positions
– not so good on calculable/special positions
– just the opposite to (old) chess programs
e(st ) ← e(st ) + 1
∀s : Vnew (s) = Vold (s) + α e(s) [rt + γVold (st+1 ) − Vold (st )]
• See anotated matches: http://www.bkgm.com/matches/
woba.html
∀s : e(s) ← γλe(s)
• Core topic of Sutton & Barto book
• Good example for
– value function approximation
→ great improvement of basic RL algorithms
13:26
– game theory, self-play
TD(λ), Sarsa(λ), Q(λ)
13:29
Detour: Dopamine
• TD(λ):
∀s : V (s) ← V (s) + α e(s) [rt + γVold (st+1 ) − Vold (st )]
• Sarsa(λ)
∀s,a : Q(s, a) ← Q(s, a) + α e(s, a) [r + γQold (s0 , a0 ) − Qold (s, a)]
• Q(λ)
∀s,a : Q(s, a) ← Q(s, a) + α e(s, a) [r + γ maxa0 Qold (s0 , a0 ) −
Montague, Dayan & Sejnowski: A Framework for Mesencephalic
Dopamine Systems based on Predictive Hebbian Learning. Jour-
Qold (s, a)]
13:27
nal of Neuroscience, 16:1936-1947, 1996.
13:30
TD-Gammon, by Gerald Tesauro
So what does that mean?
(See section 11.1 in Sutton & Barto’s book.)
• MLP to represent the value function V (s)
– We derived an algorithm from a general framework
– This algorithm involves a specific variable (reward residual)
– We find a neural correlate of exactly this variable
Great!
Devil’s advocate:
• Only reward given at end of game for win.
• Self-play: use the current policy to sample moves on both sides!
– Does not proof that TD learning is going on
Only that an expected reward is compared with a experienced reward
• random policies → games take up to thousands of steps. Skilled
– Does not discriminate between model-based and modelfree
(Both can induce an expected reward)
players ∼ 50 − 60 steps.
• TD(λ) learning (gradient-based update of NN weights)
13:28
13:31
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
Limitations of the model-free view
71
• Learning
– Temporal Difference & Q-learning
– Limitations of the model-free view
– Model-based RL
• Given learnt values, behavior is a fixed SR (or state-action) mapping
• If the “goal” changes: need to re-learn values for every state in
• Exploration
the world! all previous values are obsolete
• Briefly
– Imitation Learning & Inverse RL
– Continuous states and actions (LSPI, Policy Gradients)
• No general “knowledge”, only values
• No anticipation of general outcomes (s0 ), only of value
13:35
• No “planning”
13:32
dynamic prog.
V (s)
policy
π(s)
learn policy
π(s)
Model−free
policy
π(s)
optimize policy
π(s)
learn latent costs
R(s, a)
dynamic prog.
V (s)
Inverse RL
learn value fct.
V (s)
Imitation Learning
learn model
P (s0 |s, a)
R(s, a)
demonstration data
D = {(s0:T , a0:T )d }n
d=1
Policy Search
experience data
D = {(st , at , rt )}Tt=0
Model−based
By definition, goal-directed behavior is performed to obtain a desired
goal. Although all instrumental behavior is instrumental in achieving its
contingent goals, it is not necessarily purposively goal-directed. Dickinson and Balleine [1,11] proposed that behavior is goal-directed if: (i)
it is sensitive to the contingency between action and outcome, and (ii)
the outcome is desired. Based on the second condition, motivational
manipulations have been used to distinguish between two systems of
action control: if an instrumental outcome is no longer a valued goal (for
instance, food for a sated animal) and the behavior persists, it must not
be goaldirected. Indeed, after moderate amounts of training, outcome
revaluation brings about an appropriate change in instrumental actions
(e.g. leverpressing) [43,44], but this is no longer the case for extensively trained responses ([30,31], but see [45]). That extensive training
can render an instrumental action independent of the value of its consequent outcome has been regarded as the experimental parallel of the
folk psychology maxim that wellperformed actions become habitual [9]
(see Figure I).
Five approaches to learning behavior
policy
π(s)
13:36
Niv, Joel & Dayan: A normative perspective on motivation. TICS,
Imitation Learning
10:375-381, 2006.
13:33
D = {(s0:T , a0:T )d }n
d=1
learn/copy
→
π(s)
Model-based RL
D = {(s, a, r, s0 )t }Tt=0
learn
→
P (s0 |s, a)
DP
→
V (s)
→
π(s)
• Model learning: Given data D = {(st , at , rt , st+1 )}Tt=1 estimate
• Use ML to imitate demonstrated state trajectories x0:T
Literature:
0
P (s |s, a) and R(s, a)
Atkeson & Schaal: Robot learning from demonstration (ICML 1997)
Schaal, Ijspeert & Billard: Computational approaches to motor learning
by imitation (Philosophical Transactions of the Royal Society of London.
Series B: Biological Sciences 2003)
For instance:
– discrete state-action: Pˆ (s0 |s, a) =
#(s0 ,s,a)
#(s,a)
– continuous state-action: Pˆ (s0 |s, a) = N(s0 | φ(s, a)>β, Σ)
estimate parameters β (and perhaps Σ) as for regression
(including non-linear features, regularization, cross-validation!)
Grimes, Chalodhorn & Rao: Dynamic Imitation in a Humanoid Robot
through Nonparametric Probabilistic Inference. (RSS 2006)
Rudiger
Dillmann: Teaching and learning of robot tasks via observation
¨
of human performance (Robotics and Autonomous Systems, 2004)
13:37
• Planning, for instance:
– discrete state-action:
model
Value Iteration with the estimated
– continuous state-action: Least Squares Value Iteration
Stochastic Optimal Control (Riccati, Differential Dynamic
Prog.)
13:34
Imitation Learning
• There a many ways to imitate/copy the oberved policy:
Learn a density model P (at | st )P (st ) (e.g., with mixture of Gaussians) from the observed data and use it as policy (Billard et al.)
Outline
• Markov Decision Processes as formal model
– Definition
– Value/Q-function
– Planning as computing V /Q given a model
Or trace observed trajectories by minimizing perturbation costs
(Atkeson & Schaal 1997)
13:38
72
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
Imitation Learning
(Abbeel & Ng, ICML 2004)
13:41
Atkeson & Schaal
13:42
13:39
Continuous state/actions in model-free RL
Inverse RL
• All of this is fine in small finite state & action spaces.
D = {(s0:T , a0:T )d }n
d=1
learn
→
R(s, a)
DP
→
V (s)
→
π(s)
Q(s, a) is a |S| × |A|-matrix of numbers.
π(a|s) is a |S| × |A|-matrix of numbers.
• Use ML to “uncover” the latent reward function in observed behavior
• In the following:
– optimize a parameterized π(a|s) (policy search)
Literature:
13:43
Pieter Abbeel & Andrew Ng: Apprenticeship learning via inverse reinforcement learning (ICML 2004)
Andrew Ng & Stuart Russell: Algorithms for Inverse Reinforcement Learning (ICML 2000)
Nikolay Jetchev & Marc Toussaint: Task Space Retrieval Using Inverse
Feedback Control (ICML 2011).
Policy gradients
• In continuous state/action case, represent the policy as linear in
arbitrary state features:
13:40
π(s) =
k
X
φj (s)βj = φ(s)>β
(deterministic)
j=1
Inverse RL (Apprenticeship Learning)
• Given: demonstrations D = {xd0:T }n
d=1
π(a | s) = N(a | φ(s)>β, Σ)
(stochastic)
with k features φj .
• Try to find a reward function that discriminates demonstrations from other policies
• Basically, given an episode ξ = (st , at , rt )H
t=0 , we want to esti-
– Assume the reward function is linear in some features R(x) =
mate
w>φ(x)
∂V (β)
∂β
– Iterate:
1. Given a set of candidate policies {π0 , π1 , ..}
13:44
2. Find weights w that maximize the value margin between
Policy Gradients
teacher and all other candidates
max ξ
w,ξ
s.t. ∀πi :
w>hφiD
| {z }
value of demonstrations
≥ w>hφiπi +ξ
| {z }
value of πi
2
||w|| ≤ 1
3. Compute a new candidate policy πi that optimizes R(x) =
>
w φ(x) and add to candidate list.
• One approach is called REINFORCE:
Z
Z
∂V (β)
∂
∂
=
P (ξ|β) R(ξ) dξ = P (ξ|β)
log P (ξ|β)R(ξ)dξ
∂β
∂β
∂β
= Eξ|β {
H
H
X
∂ log π(at |st ) X t0 −t
∂
log P (ξ|β)R(ξ)} = Eξ|β {
γt
γ
rt0 }
∂β
∂β
t=0
t0 =t
|
{z
}
Qπ (st ,at ,t)
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
∂V (β)
∂β
• Another is PoWER, which requires
β←β+
Eξ|β {
PH
Eξ|β {
Pt=0
H
=0
13:48
t Qπ (st , at , t)}
t=0
73
Basic topics not covered
Qπ (st , at , t)}
• Partial Observability (POMDPs)
See: Peters & Schaal (2008): Reinforcement learning of motor skills with policy
gradients, Neural Networks.
Kober & Peters: Policy Search for Motor Primitives in Robotics, NIPS 2008.
Vlassis, Toussaint (2009): Learning Model-free Robot Control by a Monte Carlo
EM Algorithm. Autonomous Robots 27, 123-130.
13:45
What if the agent does not observe the state st ? → The policy π(at | bt )
needs to build on an internal representation, called belief βt .
• Continuous state & action spaces, function approximation in RL
• Predictive State Representations, etc etc...
13:49
Kober & Peters: Policy Search for Motor Primitives in Robotics, NIPS 2008.
13:46
policy
π(s)
Model−free
policy
π(s)
learn policy
π(s)
learn latent costs
R(s, a)
dynamic prog.
V (s)
Inverse RL
dynamic prog.
V (s)
optimize policy
π(s)
Imitation Learning
learn value fct.
V (s)
Model−based
learn model
P (s0 |s, a)
R(s, a)
demonstration data
D = {(s0:T , a0:T )d }n
d=1
Policy Search
experience data
D = {(st , at , rt )}Tt=0
policy
π(s)
– Policy gradients are one form of policy search.
– There are other, direct policy search methods
(plain stochastic search, e.g. “Covariance Matrix Adaptation”)
13:47
Conclusions
• Markov Decision Processes and RL provide a solid framework
for describing behavioural learning & planning
• Little taxonomy:
policy
π(s)
learn policy
π(s)
learn latent costs
R(s, a)
dynamic prog.
V (s)
Inverse RL
policy
π(s)
optimize policy
π(s)
Imitation Learning
dynamic prog.
V (s)
Model−free
learn value fct.
V (s)
Model−based
learn model
P (s0 |s, a)
R(s, a)
demonstration data
D = {(s0:T , a0:T )d }n
d=1
Policy Search
experience data
D = {(st , at , rt )}Tt=0
policy
π(s)
74
16
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
Reinforcement Learning – Exploration
• Don’t forget the difference between learning and planning. Planning (solving an MDP / calculating optimal state and action val-
Exploration is fundamental intelligent behavior
ues) is the problem of adapting the behavior based on a given
model
14:3
• Try to control the data
you collect!
– Make decisions that
lead to interesting
data
-greedy exploration in Q-learning
1:
2:
3:
– Reflect
your
own
knowledge;
know
what you don’t know,
what you could learn
4:
5:
– Go to situations where
you might learn something
• Curiousity, fun, intrinsic
motivation, life-long learning
6:
7:
8:
9:
10:
Initialize Q(s, a) = 0
repeat[for each episode]
Initialize start state s
repeat[for each step of episode]
Choose
action
(
random
with prob. a
argmaxa Q(s, a) else
Take action a, observe r, s0
Qnew (s, a)
←
Qold (s, a)
γ maxa0 Qold (s0 , a0 ) − Qold (s, a)]
s ← s0
until end of episode
until happy
=
+
α
[r
+
14:1
• Estimate Q(s, a) converges to Q∗ (s, a) with infinite number of
Recall Markov decision processes
a0
a1
a2
s0
s1
s2
r0
r1
samples for (s, a) and appropriate α (Watkins and Dayan, 1992)
• Off-policy learning:
– Q-learning estimates π ∗ in form of Q∗ (line 7)
– However, Q-learning does not execute π ∗ (or its current estimate thereof), but -greedy to ensure to explore (line 5)
r2
14:4
P (s0:T , a0:T , r0:T ; π) =
P (s0 )P (a0 |s0 ; π)P (r0 |a0 , s0 )
QT
t=1
P (st |at-1 , st-1 )P (at |st ; π)P (rt |at , st )
“Exploration-exploitation tradeoff”
• Two different types of behavior:
– exploration: act with the goal to learn as much as possible;
perform actions with unknown rewards / outcomes / values
– world’s initial state distribution P (s0 )
– world’s transition probabilities P (st+1 | at , st )
– world’s reward probabilities P (rt | at , st )
– exploitation: act with the goal of getting as much reward as
possible;
– discount parameter γ for future rewards
perform actions which are known to produce large reward / value
– agent’s policy π(at | st ) (or deterministic at = π(st ))
• Exploration-exploitation tradeoff:
not part of the world model!
– two different sources of uncertainty: the world itself (not con-
be sure not to miss states and actions with large rewards;
but do not waste too much time in low-reward states and actions
trolled by the agent) vs. the policy (controlled by the agent)
14:5
14:2
Recall reinforcement learning
P
t
• Agent wants to maximize its future rewards E[ ∞
t=0 γ rt | s0 ; π]
Sample Complexity
• Let A be an RL algorithm which acts in an unknown MDP, resulting in s0 , a0 , r0 , s1 , a1 , r1 , . . .
• Agent starts without a world model
→ no P (st+1 | at , st ), no P (rt | at , st ), no V ∗ (s), no Q∗ (s, a)
• Agent needs to learn from experience s0 , a0 , r0 , s1 , a1 , r1 , . . .
which actions lead to high rewards
– model-based RL: learn world model and then plan
– model-free RL: learn V and Q directly (Q-learning, TDlearning)
• How can we describe and judge the exploration efficiency of A
in formal terms?
• Definition (Kakade, 2003):
P
s
∗
Let VtA = E[ ∞
s=0 γ rt+s | s0 , a0 , r0 . . . st−1 , at−1 , rt−1 , st ]. V
is the value function of the optimal policy.
Let > 0 be a prescribed accuracy.
The sample complexity of A is the number of timesteps t such
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
75
that VtA (st ) < V ∗ (st ) − 14:9
This is the number of timesteps where the policy of A is more
than worse than the optimal policy
14:6
E 3 sketch*
PAC-MDP efficiency
• Let δ > 0 be an allowed probability of failure.
A is called PAC-MDP efficient if with probability 1 − δ its sample
complexity scales polynomially in δ, and quantities describing
the MDP
• PAC: probably (δ) approximately () correct
The PAC framework is fundamental to frequentist learning theory.
For instance, it can be used to derive guarantees on the generalization performance of support vector machines
Input: State s
Output: Action a
1: if s is known then
2:
Plan in MDPknown
// Sufficiently accurate model
estimates
3:
if resulting plan has value above some threshold then
4:
return first action of plan
// Exploitation
5:
else
6:
Plan in MDPunknown
7:
return first action of plan
// Planned exploration
8:
end if
9: else
10:
return action with the least observations in s// Direct
exploration
11: end if
14:10
• Quantities describing the MDP: number of states, number of
actions, discount factor γ, maximal reward Rmax > 0, parame-
E 3 example*
ters in the transition model P (s0 | s, a), . . .
14:7
PAC-MDP efficiency
• -greedy is not PAC-MDP efficient.
Its sample complexity is exponential in the number of states
(Whitehead, 1991)
S. Singh (Tutorial 2005)
• Examples of PAC-MDP efficient approaches:
– model-based: E 3 , R- MAX
14:11
– model-free: Delayed Q-learning
14:8
3
Explicit-Exploit-or-Explore (E ) algorithm*
E 3 example*
Kearns and Singh (2002)
• PAC-MDP efficient model-based RL algorithm
• Based on two previously established key ideas:
– counts c(s, a) for states and actions to quantify model
confidence: s is known if all actions in s sufficiently often
executed
– optimism in the face of uncertainty: unknown states are
assumed to give maximum reward (whose value is known)
• E 3 uses two MDPs:
S. Singh (Tutorial 2005)
– MDPknown : known states with (approximately exact) estimates of P (st+1 | st , at ) and P (rt | st , at )
→ captures what you know and drives exploitation
– MDPunknown : MDPknown without reward + special state s0
where the agent receives maximum reward
→ drives exploration
14:12
E 3 example*
76
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
• Rmax:
(
Rmax
– R (s, a) = ˆ
θrsa
∗
if #s,a < n
, P ∗ (s0 |s, a) =
otherwise
(
ds0 s∗
θˆs0 sa
if #s
other
– Guarantees over-estimation of values, polynomial PAC results!
– Read about “KWIK-Rmax”! (Li, Littman, Walsh, Strehl, 2011)
• Bayesian Exploration Bonus (BEB), Kolter & Ng (ICML 2009)
– Choose P ∗ (s0 |s, a) = P (s0 |s, a, b) integrating over the current belief b(θ) (non-over-confident)
β
– But choose R∗ (s, a) = θˆrsa +
with a hyperparam1+α0 (s,a)
eter α0 (s, a), over-estimating return
• Confidence intervals for V -/Q-function (Kealbling ’93, Dearden
Mˆ : estimated known state MDP
M : true known state MDP
et al. ’99)
14:16
S. Singh (Tutorial 2005)
14:13
More ideas about exploration
R- MAX
• Intrinsic rewards for learning progress
– “fun”, “curiousity”
Brafman and Tennenholtz (2002)
• similar to E 3 ; implicit instead of explicit exploration
– in addition to the external “standard” reward of the MDP
• based on reward function
– “Curious agents are interested in learnable but yet unknown
regularities, and get bored by both predictable and inherently unpredictable things.” (J. Schmidhuber)
RR- MAX (s, a) =
– Use of a meta-learning system which learns to predict the
error that the learning machine makes in its predictions;
meta-predictions measure the potential interestingness of
situations (Oudeyer et al.)
R(s, a) c(s, a) ≥ m (s, a known)
Rmax
c(s, a) < m (s, a unknown)
• Is PAC-MDP efficient
• Optimism in the face of uncertainty
14:14
• Dimensionality reduction for model-based exploration in continuous spaces: low-dimensional representation of the transition
Bayesian RL
function; focus exploration on relevant dimensions (A. Nouri,
• There exists an optimal solution to the exploration-exploitation
M. Littman)
14:17
trade-off: belief planning (see my tutorial “Bandits, Global Optimization, Active Learning, and Bayesian RL – understanding
the common ground”)
V π (b, s) = R(s, π(b, s)) +
Z
P (b0 , s0 | b, s, π(b, s)) V π (b0 , s0 )
b0 ,s0
– Agent maintains a distribution (belief) b(m) over MDP models m
– typically, MDP structure is fixed; belief over the parameters
– belief updated after each observation (s, a, r, s0 ): b → b0
– only tractable for very simple problems
• Bayes-optimal policy π ∗ = argmaxπ V π (b, s)
– no other policy leads to more rewards in expectation w.r.t. prior
distribution over MDPs
More ideas about exploration
• Exploration to reduce uncertainty of a belief p(x)
– In robotics, p(x) might be the belief about the robot position
x
– Entropy: a probabilistic
measure of information
R
H(p(x)) = − p(x) log p(x)dx
Hp (x) is maximal if p is uniform,
and minimal if p is a point mass distribution
– Information gain of action a:
I(a) = H(p(x)) − Ez [H(p(x0 | z, a))]
(z is the potential observation)
expected change of entropy when executing an action
– maximizing information gain = minimizing uncertainty in belief
14:18
– solves the exploration-exploitation tradeoff
14:15
Optimistic heuristics
Digression: Active Learning
Cohn, Ghahramani, Jordan (1996)
• As with UCB, choose estimators for R∗ , P ∗ that are optimistic/over-
• active choice of learning examples in a supervised learning setting:
confident
h
∗
Vt (s) = max R +
a
P
s0
∗
0
0
P (s |s, a) Vt+1 (s )
i
learn mapping X → Y from training examples D = {(xi , yi )m
i=1 }
• active learning protocol:
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
– select a new input x
˜: x
˜ may be a query, experiment, action
...
– observe resulting output y˜
– incorporate new example (˜
x, y˜) into D, relearn, and repeat;
77
• Ideas for driving exploration: random actions, optimism in the
face of uncertainty, maximizing learning progress and information gain
• crucial question: how to choose x
˜?
• Generalization over states and actions is crucial (→ relational
• heuristics:
– where we don’t have data
RL)
– where we perform poorly
– where we have low confidence
• Active learning selects statistically optimal training data for effi-
– where we expect it to change our model
cient supervised learning
– where we previously found data
14:22
• in the following: select x
˜ in a statistically “optimal” manner
14:19
Digression: Active Learning (continued)
• Goal: minimize variance of prediction yˆ for given x
σy2ˆ = ED [(ˆ
y − ED [ˆ
y ])2 ]
changes with new training example (˜
x, y˜)
• Choose x
˜ which minimizes the expected predictive variances
conditioned on having seen x
˜:
˜]
hσy2ˆ i = ED∪(˜x,˜y) [σy2ˆ | x
References
• Brafman, Tennenholtz (2002). R-max - a general polynomial time algorithm for near-optimal RL. JMLR.
• Cohn, Ghahramani, Jordan (1996): Active learning with statistical models. JAIR.
• Kakade (2003): On the sample complexity of RL. PhD thesis.
• Kearns, Singh (2002): Near-optimal reinforcement learning in polynomial time. Machine Learning Journal.
• Li (2009): A unifying framework for computational RL theory. PhD thesis.
• Nouri, Littman (2010): Dimension reduction and its application to modelbased exploration in continuous spaces. Machine Learning Journal.
• Oudeyer, Kaplan, Hafner (2007): Intrinsic motivation systems for autonomous mental development. IEEE Evolutionary Computation.
• Schmidhuber (1991): Curious model-building control systems. In Int. Joint
Conf. on Neural Networks.
14:23
• How can we compute
hσy2ˆ i?
Monte Carlo approximation (sam-
pling): evaluate at a set of reference points drawn from P (x)
14:20
Digression: Active Learning (continued)
• Example: mixture of Gaussians
– analytic solution for calculating expected predictive variances
of the learner
Cohn, Ghahramani, Jordan (1996)
14:21
Conclusions
• Exploration is fundamental intelligent behavior
• RL agents need to solve the exploration-exploitation tradeoff
• Sample complexity is a measure of the exploration efficiency of
an RL algorithm
78
17
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
Exercises
17.1
17.1.1
Exercise 1
First Steps
You will hand in your exercises in groups of (up to)
three. If you did not write Johannes your group members, please do so as soon as possible.
([email protected])
All exercises will be in Python and handed in via Git.
Make yourself familiar with both Python and Git by reading a few tutorials and examples. You can find links to
some free good tutorials at the course website at https:
around with the data in it.
The last file is run_tests.sh. It runs the tests, so that
you can use the test to check whether you are doing
right. Note that our test suite will be different from the
one we hand to you. So just mocking each function with
the desired output without actually computing it will not
work. You can run the tests by executing:
$ sh run_tests.sh
If you are done implementing the exercise simply commit your implementation and push it to our server.
$ git add e01-graphsearch.py
$ git commit
$ git push
Task: Implement breadth-first search, uniform-cost search,
limited-depth search, iterative deepening search and A14-ArtificialIntelligence/.
star as described in the lecture (A-star will be topic of
the next lecture). All methods get as an input a graph,
Login in our GitLab system at https://sully.informatik.
a start state, and a goal state. Your methods should
uni-stuttgart.de/gitlab/ with the account sent
return two things: the path from start to goal, and the
to you. If you did not receive an account yet, please
fringe at the moment when the goal state is found (that
email Johannes.
latter allows us to check correctness of the implemenCreate a SSH key (if you don’t already have one) and
tation). The first return value should be the found Node
upload it in your profile at “Profile Settings” and “SSH
(which has the path implicitly included through the parKeys”.
ent links) and a Queue (one of the following: Queue,
LifoQueue, PriorityQueue and NodePriorityQueue)
$ ssh-keygen
object holding the fringe. You also have to fill in the pri$ cat ˜/.ssh/id_rsa.pub
ority computation at the put() method of the NodePriorityQue
//ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/
Clone your repository with:
Iterative Deepening and Depth-limited search are a bit
different in that they do not explicetly have a fringe. You
$ git clone [email protected]:ai_lecture/group_[GROUP_NUMBER].git
don’t have to return a fringe in those cases, of course.
Depth-limited search additionally gets a depth limit as
input. A-star gets a heuristic function as input, which
you can call like this:
17.1.2 Tree Search
def a_star_search(graph, start, goal, heuristic
In the repository you will find the directory e01-graphsearch
# ...
with a couple of files. First there is e01-graphsearch.py
h = heuristic(node.state, goal)
with the boilerplate code for the exercise. The com# ...
ments in the code define what each function is supposed to do. Implement each function and you are done
with the exercise.
Tips:
The second file you will find is tests.py. It consists of
tests that check whether your function does what they
should. You don’t have to care about this file, but you
can have a look in it to understand the exercise better.
The next file is data.py. It consists of a very small
graph and the S-Bahn net of Stuttgart as graph structure. It will be used by the test. If you like you can play
– For those used to IDEs like Visual Studio or Eclipse: Install PyCharm (Community Edition). Start it in the git directory. Perhaps set the Keymap to ’Visual Studio’ (which
sets exactly the same keys for running and stepping in
the debugger). That’s helping a lot.
– Use the data structure Node that is provided. It has exactly the attributes mentioned on slide 26.
– Maybe you don’t have to implement the ’Tree-Search’
and ’Expand’ methods separately; you might want to put
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
• Breitensuche, Tiefensuche und Suche mit ein¨ einer Bestenheitlichen Kosten sind Spezialfalle
suche.
them in one little routine.
17.2
79
• Suche mit einheitlichen Kosten ist ein Spezialfall
der A∗ -Suche.
Exercise 2
¨ Prasenz
¨
Dieses Blatt enthalt
ubungen,
die teils am 6.11.
¨
¨
in der Ubungsgruppe besprochen werden und auch zur
Klausurvorbereitung dienen. Sie sind nicht abzugeben.
¨
Studenten werden zufallig
gebeten, sich an den Aufgaben zu versuchen.
¨
• Es gibt Zustandsraume,
in dem iterative Tiefen¨
¨ als Tiefensuche eine hohere
Laufzeit-Komplexitat
2
suche hat (O(n ) vs. O(n)).
17.2.3
¨
Prasenzaufgabe:
Greedy-Bestensuche
und A∗ -Suche
17.2.1
¨
Prasenzaufgabe:
Beispiel fur
¨ Baum-
¨
Betrachten Sie die Rumanien-Karte
auf Folie 4 in 03-search.pd
suche
Betrachten Sie den Zustandsraum, in dem der Startzustand mit der Nummer 1 bezeichnet wird und die Nach¨
folgerfunktion fur
mit den Num¨ Zustand n die Zustande
mern 4n − 2, 4n − 1, 4n und 4n + 1 zuruck
¨ gibt. Nehmen
Sie an, dass die hier gegebene Reihenfolge auch genau
die Reihenfolge ist, in der die Nachbarn in expand durchlaufen werden und in die LIFO fringe eingetragen werden.
• Zeichnen Sie den Teil des Zustandsraums, der
¨
die Zustande
1 bis 21 umfasst.
• Geben Sie die Besuchsreihenfolge (Besuch=[ein
Knoten wird aus der fringe genommen, goal-check,
¨
und expandiert]) fur
Tiefensuche
¨ eine beschrankte
mit Grenze 2 und fur
¨ eine iterative Tiefensuche,
jeweils mit Zielknoten 4, an. Geben Sie nach jedem Besuch eines Knotens den dann aktuellen
Inhalt der fringe an. Die initiale fringe ist [1]. Nutzen
Sie fur
¨ jeden Besuch in etwa die Notation:
besuchter Zustand: [fringe nach dem Besuch]
• Fuhrt
ein endlicher Zustandsraum immer zu einem
¨
endlichen Suchbaum? Begrunden
Sie Ihre Antwort.
¨
17.2.2
¨
¨ der SuchPrasenzaufgabe:
Spezialfalle
• Die Luftlinien-Heuristik hLL bringt Probleme fur
¨
eine Greedy-Bestensuche, wenn wir von Iasi nach
Faragas gehen wollen. In umgekehrter Richtung
jedoch nicht. Finden Sie einen Fall, in dem GreedySuche mit hLL fur
¨ keine Richtung den kurzesten
¨
Weg findet.
• Verfolgen Sie den Weg von Lugoj nach Bukarest
mittels einer A∗ -Suche und verwenden Sie die
Luftlinien-Distanz als Heuristik. Geben Sie alle
Knoten an, die auf dem Weg berucksichtigt
wer¨
den, und ermitteln Sie jeweils die Werte fur
¨ f, g
und h.
• Geben Sie den mittels der A∗ -Suche gefundenen
kurzesten
Weg an.
¨
17.3
17.3.1
Exercise 3
Install Numpy
For this exercise you will need numpy. Please install
a recent version of it. Installation notes you can find
http://www.scipy.org/install.html. Any reasonably recent version will do for this exercise. (So using a version from the Ubuntu repository for instance is
fine.)
strategien
Beweisen Sie die folgenden Aussagen:
• Breitensuche ist ein Spezialfall der Suche mit einheitlichen Kosten.
17.3.2
Constrained Satisfaction Problems
Pull the current exercise from our server to your local
repository:
80
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
$ git pull
Task 1: Implement backtracking for the constrained
satisfaction problem definition you find in csp.py. Make
three different versions of it 1) without any heuristic 2)
with minimal remaining value as heuristic but without
tie-breaker (take the first best solution) 3) with minimal
remaining value and the degree heuristic as tie-breaker.
17.4.1
¨
Prasenzaufgabe:
CSP
Betrachten Sie folgenden Kartenausschnitt:
Optional: Implement AC-3 or any approximate form of
constraint propagation and activate it if the according
parameter is set.
Task 2: Implement a method to convert a Sudoku into
a csp.ConstrainedSatisfactionProblem. The
sudoku is given as a numpy array. Every empty field is
set to 0. The CSP you create should cover all rules of
a Sudoku, which are (from http://en.wikipedia.
org/wiki/Sudoku):
Fill a 9 × 9 grid with digits so that each column, each
row, and each of the nine 3 × 3 sub-grids that compose
the grid (also called ’boxes’, ’blocks’, ’regions’, or ’subsquares’) contains all of the digits from 1 to 9.
Der Kartenausschnitt soll mit insgesamt 4 Farben so
eingef”arbt werden, dass je zwei Nachbarl”ander verschiedene Farben besitzen.
(a) Mit welchem Land w”urde man am ehesten beginnen?
(b) F”arben Sie das erste Land ein und wenden Sie
durchgehend Constraint Propagation an.
In the lecture we mentioned the all different constraint
for columns, rows, and blocks. As the csp.ConstrainedSatisfactionProblem
only allows you to represent pair-wise unequal con¨
17.4.2 Prasenzaufgabe:
Generalized Arc Constraints (to facilitate constraint propagation) you need
sistency
to convert this.
We have n variables xi , each with the (current) domain Di . Constraint propagation by establishing local constraint consistency (“arc consistency”) in general
means the following:
For a variable xi and an adjacent constraint Ck , we
delete all values v from Di for which there exists no
tuple τ ∈ DIk with τi = v that satisfies the constraint.
Consider a simple example
x1 , x2 ∈ {1, 2} ,
x3 , x4 ∈ {2, .., 6} ,
c = AllDiff(x1 , .., x4 )
(a) How does constraint propagation from c to x3 update the domain D3 ?
17.4
Exercise 4
¨ Prasenz
¨
Dieses Blatt enthalt
ubungen,
die am 20.11. in
¨
¨
der Ubungsgruppe besprochen werden und auch zur
Klausurvorbereitung dienen. Sie sind nicht abzugeben.
¨
Studenten werden zufallig
gebeten, sich an den Aufgaben zu versuchen.
(b) On http://norvig.com/sudoku.html Norvig
describes his Sudoku solver, using the following rules
for constraint propagation:
(1) If a square has only one possible value,
then eliminate that value from the square’s
peers.
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
(2) If a unit (block, row or column) has only
one possible place for a value, then put the
value there.
Is this a general implementation of constraint propagation for the allDiff constraint?
17.5.2
81
Modelle enumerieren (Aussagenlogik)
Betrachten Sie die Aussagenlogik mit Symbolen A, B,
C und D. Insgesamt existieren also 16 Modelle. In
wievielen Modellen sind die folgenden S”atze erf”ullt?
1. (A ∧ B) ∨ (B ∧ C)
2. A ∨ B
Note: The generalized arc consistency is equivalent
so-called message passing (or belief propagation) in
probabilistic networks, except that the messages are
domain sets instead of belief vectors.
3. A ⇔ (B ⇔ C)
17.5.3
¨
Unifikation (Pradikatenlogik)
See also www.lirmm.fr/˜bessiere/stock/TR06020.
pdf
Geben Sie f”ur jedes Paar von atomaren S”atzen den
allgemeinsten Unifikator an, sofern er existiert. Standardisieren Sie nicht weiter. Geben Sie None zuruck,
¨
wenn kein Unifikator existiert. Ansonsten ein Dictioary,
dass als Key die Variable und als Value die Konstante
17.5 Exercise 5
¨
enthalt.
¨
Abgabetermin: 3. Dez, 24:00h. Die Losungen
bitte
als python-Datei (siehe Vorlage in der Email) mit dem
Namen e05/e05_sol.py (in Verzeichnis e05) in Euer
git account einloggen. Bei Unklarheiten bitte melden.
Fur
¨ P (A), P (x):
sol3z = {’x’: ’A’}
1. P (A, B, B), P (x, y, z).
2. Q(y, G(A, B)), Q(G(x, x), y).
3. Older(F ather(y), y), Older(F ather(x), John).
17.5.1
Erfullbarkeit
¨
und allgemeine Gultigkeit
¨
4. Knows(F ather(y), y), Knows(x, x).
(Aussagenlogik)
Entscheiden Sie, ob die folgenden S”atze erf”ullbar (satisfiable), allgemein g”ultig (valid) oder keins von beidem (none) sind.
(a) Smoke ⇒ Smoke
(b) Smoke ⇒ F ire
(c) (Smoke ⇒ F ire) ⇒ (¬Smoke ⇒ ¬F ire)
(d) Smoke ∨ F ire ∨ ¬F ire
(e) ((Smoke∧Heat) ⇒ F ire) ⇔ ((Smoke ⇒ F ire)∨
(Heat ⇒ F ire))
(f) (Smoke ⇒ F ire) ⇒ ((Smoke ∧ Heat) ⇒ F ire)
(g) Big ∨ Dumb ∨ (Big ⇒ Dumb)
(h) (Big ∧ Dumb) ∨ ¬Dumb
17.5.4
¨
Prasenzaufgabe:
Matching as Constraint
Satisfaction Problem
Consider the Generalized Modus Ponens (slide 09:15)
for inference (forward and backward chaining) in first
order logic. Applying this inference rule requires to find
a substitution θ such that p0i θ = pi θ for all i.
Show constructively that the problem of finding a substitution θ (also called matching problem) is equivalent
to a Constraint Satisfaction Problem. “Constructively”
means, explicitly construct/define a CSP that is equivalent to the matching problem.
Note: The PDDL language to describe agent planning
problems (slide 08:24) is similar to a knowledge in Horn
form. Checking whether the action preconditions hold
in a given situation is exactly the matching problem; applying the Generalized Modus Ponens corresponds to
the application of the action rule on the current situation.
82
17.5.5
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
¨
Prasenzaufgabe
def inform_move(self, move):
# after each move (also your own) this f
# the player of the move played (which c
# chose, if you chose a illegal one.
pass
In the lecture we discussed the case
“A first cousin is a child of a parent’s sibling”
def get_next_move(self, board, secs):
∀x, y F irstCousin(x, y) ⇐⇒ ∃p, z P arent(p, x)∧Sibling(z, p)∧P arent(z, y)
# return a move you want to play on the
# seconds
pass
A question was whether this is equivalent to
¨
Siearent(z,
konnen
die mitgelieferte python-chess Bibliothek nutzen.
∀x, y, p, z F irstCousin(x, y) ⇐⇒ P arent(p, x)∧Sibling(z, p)∧P
y)
Let’s simplify: Show that the following two
∀x A(x) ⇐⇒ ∃y B(y, x)
(2)
∀x, y A(x) ⇐⇒ B(y, x)
(3)
are different. For this, bring both sentences in CNF
as described on slides 09:21 and 09:22 of lecture 09FOLinference.
Dieses mal gibt es keine Unittests, sondern Ihr ChessPlayer sollte in der Lage sein gegen einen anderen
¨
Spieler zu spielen und sollte – wenn moglich
– gegen
¨
¨
einen zufallig
spielenden Spieler gewinnen. Sie konnen
Ihre implementierung testen mit
$ python2 interface.py --human
um als Mensch gegen Ihren Spieler zu spielen. Oder
mit
$ python2 interface.py --random
17.6
Exercise 6
¨
um einen zufallig
spielenden Spieler gegen ihr Programm
antreten zu lassen.
Abgabetermin: 7. Jan 2015, 24:00h.
17.6.1
Schach
Implementieren Sie ein Schach spielendes Programm.
Der grundlegende Python code ist dafur
¨ in Ihren Repositories. Wir haben auch bereits die Grundstruktur fur
¨
den UCT Algorithmus implementiert, so dass Sie nur
die einzelnen Funktionen implementieren mussen.
Sie
¨
¨
konnen dafur
und z.B. die
¨ auch die Suchtiefe verkurzen
¨
mitgelieferte sehr einfache Evaluationsfunktion – oder
eine eigene – nutzen. Es ist aber Ihnen uberlassen
¨
¨
auch einen vollig
anderen Algorithmus zu implementieren (z.B. Minimax), solange das folgende Interface
eingehalten wird:
17.7
Exercise 7
¨ Prasenz
¨
Dieses Blatt enthalt
ubungen,
die am 12.01. in
¨
¨
der Ubungsgruppe besprochen werden und auch zur
Klausurvorbereitung dienen. Sie sind nicht abzugeben.
¨
Studenten werden zufallig
gebeten, sich an den Aufgaben zu versuchen.
17.7.1
¨
Prasenzaufgabe:
Bedingte Wahrscheinlichkeit
1. Die Wahrscheinlichkeit, an der bestimmten tropischen Krankheit zu erkranken, betr”agt 0,02%. Ein
Test, der bestimmt, ob man erkrankt ist, ist in
99,995% der F”alle korrekt. Wie hoch ist die Wahrscheinclass ChessPlayer(object):
lichkeit, tats”achlich an der Krankheit zu leiden,
def __init__(self, game_board, player):
wenn der Test positiv ausf”allt?
# The game board is the board at the beginning, the optimization is
2. Eine andere
seltene
Krankheit
# either chess.WHITE or chess.BLACK, depending
on the
player
youbetrifft
are.0,005% aller
Menschen. Ein entsprechender Test ist in 99,99%
pass
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
der F”alle korrekt. Mit welcher Wahrscheinlichkeit
ist man bei positivem Testergebnis von der Krankheit
betroffen?
3. Es gibt einen neuen Test f”ur die Krankheit aus
b), der in 99,995% der F”alle korrekt ist. Wie
hoch ist hier die Wahrscheinlichkeit, erkrankt zu
sein, wenn der Test positiv ausf”allt?
17.7.2
p(x, c) = p(c)
• From bandit 1: 8 7 12 13 11 9
D
Y
p(xi |c)
(4)
i=1
Nun kann man fur
¨ eine neue eintreffende Email die
¨
vorkommenden Worter
x∗ analysieren und mit Hilfe von
Bayes Formel die Wahrscheinlichkeit berechnen, ob diese
Email Spam oder Ham ist.
p(c|x∗ ) =
¨
Prasenzaufgabe:
Bandits
Assume you have 3 bandits. You have already tested
them a few times and received returns
83
p(x∗ |c)p(c)
p(x∗ |c)p(c)
=P
∗
∗
p(x )
c p(x |c)p(c)
(5)
Aufgabe: Implementieren Sie einen Naive Bayes Klassifikator fur
¨ die Spam Emails. Sie finden Trainingsdaten und Python-Code, der mit dieser umgehen kann,
in Ihrem Repository.
Ihre Implementierung sollte zwei Funktionen enthalten:
• From bandit 2: 8 12
• From bandit 3: 5 13
For the returns of each bandit separately, compute a)
the mean return, the b) standard deviation of returns,
and c) standard deviation of the mean estimator.
class NaiveBayes(object):
def train(self, database):
’’’ Train the classificator with the giv
pass
def spam_prob(self, email):
’’’ Compute the probability for the give
return 0.
Which bandid would you choose next? (Distinguish
cases: a) if you know this is the last chance to pull a
bandit; b) if you will have many more trials thereafter.)
17.8
Exercise 8
Tip: David Barber gibt ein seinem Buch “Bayesian Reasoning and Machine Learning” eine sehr gute Einfuhrung
¨
in den Naive Bayes Klassifikator (Seite 243 ff., bzw.
Seite 233 ff. in der kostenlosen Online Version des
Buches, die man unter http://www.cs.ucl.ac.uk/
staff/d.barber/brml/ herunterladen kann).
Abgabetermin: Mi, 28.01.2015, 23:59 h
17.9
17.8.1
Exercise 9
Spamfilter mit Naive Bayes
¨ Prasenz
¨
Dieses Blatt enthalt
ubungen,
die am 26.01. in
¨
¨
Sie haben in der Vorlesung grafische Modelle und Inder Ubungsgruppe
besprochen werden und auch zur
ferenz in ihnen kennengelernt. Auf dieser Grundlage
Klausurvorbereitung dienen. Sie sind nicht abzugeben.
basiert der viel verwendete Naive Bayes Klassifikator.
¨
Studenten werden zufallig
gebeten, sich an den AufDer Bayes Klassifikator wird zum Beispiel dafur
¨ vergaben zu versuchen.
wandet, Spam Emails automatisch zu erkennen. Dafur
¨
¨
werden Trainings-Emails untersucht und die Worthaufigkeiten
¨
gezahlt.
Daraus werden dann D Wahrscheinlichkeiten
¨
17.9.1 Prasenzaufgabe:
Hidden Markov Modp(xi |c) fur
¨ das Auftreten eines bestimmten Wortes, gegeben
elle
¨
das eine Email Spam/Ham ist, geschatzt.
Nun wird die
Annahme getroffen, dass all diese Wahrscheinlichkeiten
¨
unabhangig
sind, so dass die Joint Verteilung wie folgt
Sie stehen bei Nacht auf einer Br”ucke ”uber der B14
berechnet werden kann:
in Stuttgart und m”ochten z”ahlen, wieviele LKW, Busse
84
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
und Kleintransporter in Richtung Bad Canstatt fahren.
Da Sie mehrere Spuren gleichzeitig beobachten und es
dunkel ist machen Sie folgende Fehler bei der Beobachtung des Verkehrs:
17.10.1
¨
Prasenzaufgabe:
Value Iteration
7
6
1
5
2
• Einen LKW erkennen Sie in 30% der F”alle als
Bus, in 10% der F”alle als Kleintransporter.
• Einen Bus erkennen Sie in 40% der F”alle als
LKW, in 10% der F”alle als Kleintransporter.
• Einen Kleintransporter erkennen Sie in je 10%
der F”alle als Bus bzw. LKW.
Zudem nehmen Sie folgendes an:
• Auf einen Bus folgt zu 10% ein Bus und zu 30%
ein LKW, ansonsten ein Kleintransporter.
• Auf einen LKW folgt zu 60% ein Kleintransporter
und zu 30% ein Bus, ansonsten ein weiterer LKW.
• Auf einen Kleintransporter folgt zu 80% ein Kleintransporter und zu je 10% ein Bus bzw. ein LKW.
Sie wissen sicher, dass das erste beobachtete Fahrzeug
¨
tatsachlich
ein Kleintransporter ist.
a) Formulieren Sie das HMM dieses Szenarios. D.h.,
geben Sie explizit P (X1 ), P (Xt+1 |Xt ) und P (Yt |Xt ) an.
8
4
3
Consider the circle of states above, which depicts the
8 states of an MDP. The green state (#1) receives a
reward of r = 4096 and is a terminal state, the red state
(#2) is punished with r = −512. Consider a discounting
of γ = 1/2.
Description of P (s0 |s, a):
• The agent can choose between two actions: going one step clock-wise or one step counter-clockwise.
• With probability 3/4 the agent will transition to the
desired state, with probability 1/4 to the state in
opposite direction.
• Exception: the green state (#1) is a terminal state:
The MDP terminates after the agent reached this
state and collected the reward.
Description of P (r|s, a):
¨
b) Pradiktion:
Was ist die Marginal-Verteilung P (X3 )
uber
das 3. Fahrzeug.
¨
• The agent will receive reward of r = 4096 upon
reaching the green state (#1).
c) Filtern: Sie machten die Beobachtungen Y1:3 = (K, B, B).
Was ist die Wahrscheinlichkeit P (X3 |Y1:3 ) des 3. Fahrzeugs
gegeben diese Beobachtungen?
• The agent receives reward of r = −512 upon
reaching the red state (#2).
¨
d) Glatten:
Was ist die Wahrscheinlichkeit P (X2 |Y1:3 )
des 2. Fahrzeugs, gegeben die 3 Beobachtungen?
e) Viterbi (wahrscheinlichste Folge): Was ist die wahrscheinlichste Folge argmaxX1:3 P (X1:3 |Y1:3 ) an Fahrzeugen,
gegeben die 3 Beobachtungen?
17.10
Exercise 10
¨ Prasenz
¨
Dieses Blatt enthalt
ubungen,
die am 05.02. in
¨
¨
der Ubungsgruppe besprochen werden und auch zur
Klausurvorbereitung dienen. Sie sind nicht abzugeben.
¨
Studenten werden zufallig
gebeten, sich an den Aufgaben zu versuchen.
1. Perform three steps of Value Iteration: Initialize
Vk=0 (s) = 0, what is Vk=1 (s), Vk=2 (s), Vk=3 (s)?
2. How could you compute the optimal value function V π (s) for a GIVEN policy (e.g., always walk
clock-wise) in closed form? Provide an explicit
matrix equation.
3. Assume you are given V ∗ (s). How can you compute the optimal Q∗ (s, a) form this? And assume
Q∗ (s, a) is given, how can you compute the optimal V ∗ (s) from this? Provide general equations.
4. What is Qk=3 (s, a) for the example above? What
is the “optimal” policy given Qk=3 ?
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
17.10.2
¨
Prasenzaufgabe:
TD-learning
Consider TD-learning. The initial value function V (s) =
0. Consider the following setup: The agent starts in
state 4, then permanently chooses clock-wise action.
As soon as it reaches the green terminal state (one way
or another), it will be beamed back to the start state 4
and everything repeats over and over again.
1. Describe at what events plain TD-learning will
update the value function, how it will update it.
Guess roughly how many steps the agent will
have taken when for the first time V (s4 ) becomes
non-zero. How would this be different for eligibility traces?
85
Index
Existential quantification (6:8),
Explicit-Exploit-or-Explore* (14:9),
Exploration, Exploitation (9:6),
n-queens as ILP (4:19),
A∗ search (2:15),
A∗ : Proof 1 of Optimality (2:22),
A∗ : Proof 2 of Optimality (2:28),
Factor graph (11:30),
Filtering, Smoothing, Prediction (12:3),
FOL: Syntax (6:5),
Forward Chaining (7:15),
Forward chaining (5:41),
Forward checking (3:21),
Frame problem (6:22),
Frequentist vs Bayesian (8:4),
Admissible heuristics (2:30),
Alpha-Beta Pruning (10:6),
Backtracking (3:10),
Backward Chaining (5:52),
Backward Chaining (7:17),
Bayes’ Theorem (8:11),
Bayesian Network (11:3),
Bayesian RL (14:15),
Belief propagation (11:36),
Bellman optimality equation (13:8),
Bernoulli and Binomial (8:14),
Best-first Search (2:3),
Beta (8:15),
Breadth-first search (BFS) (1:29),
Gaussian (8:27),
Generalized Modus Ponens (7:14),
Genetic Algorithms (4:14),
Gibbs sampling (11:24),
Graph search and repeated states (1:65),
Greedy Search (2:5),
Hidden Markov Model (12:4),
HMM inference (12:6),
HMM: Inference (12:5),
Horn Form (5:40),
Completeness of Forward Chaining (5:51),
Complexity of BFS (1:37),
Complexity of DFS (1:52),
Complexity of Greedy Search (2:14),
Complexity of Iterative Deepening Search (1:63),
Complexity of A∗ (2:27),
Conditional distribution (8:9),
Conditional independence in a Bayes Net (11:7),
Conditional random field* (11:46),
Conjugate priors (8:23),
Conjunctive Normal Form (5:64),
Constraint propagation (3:25),
Constraint satisfaction problems (CSPs): Definition (3:2),
Imitation Learning (13:37),
Importance sampling (11:22),
Inference (5:28),
Inference in graphical models: overview (11:17),
Inference: general meaning (11:12),
Inverse RL (13:40),
Iterated Local Search (4:9),
Iterative deepening search (1:54),
Joint distribution (8:9),
Junction tree algorithm (11:41),
Conversion to CNF (5:65),
Conversion to CNF (7:20),
CSP as ILP (4:21),
Kalman filter (12:9),
Knowledge base: Definition (5:2),
Kullback-Leibler divergence (8:34),
Definitions based on sets (8:6),
Depth-first search (DFS) (1:39),
Dirac (8:26),
Dirichlet (8:19),
Local optima, plateaus (4:8),
Local Search (4:5),
Logical equivalence (5:37),
Logics: Definition, Syntax, Semantics (5:20),
Loopy belief propagation (11:39),
LP, QP, ILP, NLP (4:17),
Eligibility traces (13:26),
Entailment (5:21),
Entropy (8:33),
Epsilon-greedy exploration in Q-learning (14:4),
Evaluation functions (10:11),
Example: Romania (1:2),
Example: The 8-Puzzle (1:15),
Example: Vacuum World (1:5),
Map-Coloring Problem (3:3),
Marginal (8:9),
Markov Decision Process (MDP) (13:3),
Markov Process (12:2),
Maximum a-posteriori (MAP) inference (11:45),
86
Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015
Memory-bounded A∗ (2:34),
Message passing (11:36),
Minimax (10:3),
Model (5:22),
Model-based RL (13:34),
Modus Ponens (5:40),
Monte Carlo (11:19),
Monte Carlo Tree Search (MCTS) (9:14),
Multi-armed Bandits (9:1),
Multinomial (8:18),
Multiple RVs, conditional independence (8:12),
Optimistic heuristics (14:16),
Optimization problem: Definition (4:2),
PAC-MDP efficiency (14:7),
Particle approximation of a distribution (8:29),
Planning Domain Definition Language (PDDL) (6:24),
Policy gradients (13:44),
Probabilities as (subjective) information calculus (8:2),
Probability distribution (8:8),
Problem Definition: Deterministic, fully observable (1:9),
Proof of convergence of Q-Iteration (13:13),
Proof of convergence of Q-learning (13:24),
Propositional logic: Semantics (5:31),
Propositional logic: Syntax (5:29),
Q-Function (13:11),
Q-Iteration (13:12),
Q-learning (13:22),
R-Max (14:14),
Random variables (8:7),
Reduction to propositional inference (7:6),
Resolution (5:64),
Resolution (7:19),
Sample Complexity (14:6),
Sarsa (13:21),
Satisfiability (5:38),
Simulated Annealing (4:11),
Situation Calculus (6:21),
Slack Variables (4:19),
Temporal difference (TD) (13:19),
Travelling Salesman Problem (TSP) (4:6),
Tree search implementation: states vs nodes (1:25),
Tree Search: General Algorithm (1:26),
Tree-structured CSPs (3:33),
TSP as ILP (4:20),
UCT for games (10:12),
Unification (7:9),
Uniform-cost search (1:38),
Universal quantification (6:7),
Upper Confidence Bound (UCB) (9:8),
Upper Confidence Tree (UCT) (9:19),
Utilities and Decision Theory (8:32),
Validity (5:38),
Value Function (13:4),
Value Iteration (13:10),
Value order: Least constraining value (3:20),
Variable elimination (11:27),
Variable order: Degree heuristic (3:19),
Variable order: Minimum remaining values (3:18),
Wumpus World example (5:4),
87