Download ReinforcementLearning_part2

Understanding AlphaGo Understanding AlphaGo Go Overview • Originated in ancient China 2,500 years ago • Two players game • Goal - surround more territory than the opponent • 19X19 grid board • playing pieces “stones“ • Turn = place a stone or pass • The game ends when both players pass Go Overview Only two basic rules 1. Capture rule: stones that have no liberties ->captured and removed from board 2. ko rule: a player is not allowed to make a move that returns the game to the previous position X Go In a Reinforcement Set-Up • Environment states • Actions S= A= • Transition between states • Reinforcement function r(s)= 0 1 if s is not a terminal state o.w Goal : find policy that maximize the expected total payoff Why it is hard for computers to play GO? • Possible configuration of board extremely high ~10^700 • Impossible to use brute force exhaustive search • Chess (b≈35, d≈80) • main challenges • Branching factor • Value function Go (b≈250, d≈150) Training the Deep Neural Networks Human experts (state,action) 𝑃ϭ - SL policy network 𝑃π - Rollout policy network Monte Carlo Tree Search 𝑃ƿ - RL policy network (state, win/loss) 𝑉Ɵ - Value network Training the Deep Neural Networks Policy Policy Value SL Policy Network : 𝑷ϭ • ~30 million (state, action) • Goal:maximize the log likelihood of an action • Input : 48 feature planes 12 convolutional + rectifier layers Softmax 19X19X48 • Output: action probability map Probability map SL Policy Network : 𝑷ϭ Bigger -> better and slower Accuracy AlphaGo (all input features) 57.0% AlphaGo (only raw board position) 55.7% state of the art 44.4% Training the Rollout Policy Network 𝑷π SL policy net 𝑃ϭ Rollout Policy Network 𝑃π Forwarding 3 milliseconds 2 microseconds 12 convolutional + rectifier layers Accuracy 55.4% 24.2% Softmax • Similar to SL policy 𝑷ϭ • Output – probability map over actions • Goal: maximize the log likelihood • Input • Not full grid • *handcrafted local features* Probability map Training the RL Policy Network 𝑷𝛒 • {𝜌_ | 𝜌_ is an old version of 𝜌} 12 convolutional + rectifier layers • 𝑃𝜌 vs. 𝑃𝜌− SGA • Preventing overfitting • RL policy Won more then 80% of the games against SL policy Softmax • Initialize weights to 𝜌=ϭ 19X19X48 • Refined version of SL policy (𝑷ϭ ) Probability map Training the Deep Neural Networks Human experts (state,action) 𝑃ϭ - SL policy network 𝑃π - Rollout policy network Monte Carlo Tree Search 𝑃ƿ - RL policy network (state, win/loss) 𝑉Ɵ - Value network Training the Value Network 𝑽𝛉 • Position evaluation • Approximating optimal value function • Input : state , output: probability to win • Goal: minimize MSE convolutional + rectifier layers fc 19X19X48 • Overfitting - position within games are strongly correlated scalar Training the Deep Neural Networks 𝑃ϭ - SL policy network ~30m Human expert (state,action) 𝑃π - Rollout policy network Monte Carlo Tree Search 𝑃ƿ - RL policy network (state,won/loss) 𝑉Ɵ - Value network Monte Carlo Tree Search • Monte Carlo Experiments : repeated random sampling to obtain numerical results • Search method • Method for making optimal decisions in artificial intelligence (AI) problems • The strongest Go AIs (Fuego, Pachi, Zen, and Crazy Stone) all rely on MCTS Monte Carlo Tree Search Each round of Monte Carlo tree search consists of four steps 1. Selection 2. Expansion 3. Simulation 4. Backpropagation MCTS – Upper Confidence Bounds for Trees • Exploration Exploitation Tradeoff • Kocsis, L. & Szepesvári, C. Bandit based MonteCarlo planning (2006) • Convergence to the optimal solution Exploitation Exploration Wi #wins after visiting the node i ni #times node i has been visited C exploration parameter t #times node i parent has been visited AlphaGo MCTS Selection Expansion Evaluation • Each edge (s,a) stores: • Q(s,a) - action value (avrerage value of sub tree) • N(s,a) – visit count • P(s,a) – prior probability Backpropagation AlphaGo MCTS Selection Expansion Evaluation Backpropagation AlphaGo MCTS Selection Expansion Leaf evaluation: 1. Value network 2. Random rollout played until terminal Evaluation Backpropagation AlphaGo MCTS Selection Expansion Evaluation How to choose the next move? • Maximum visit count • Less sensitive to outliers than maximum action value Backpropagation AlphaGo VS Experts 4:1 5:0 Take Home VS • Modular system • Reinforcement and Deep learning • Generic

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ReinforcementLearning_part2