Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Study on Genetic Network Programming (GNP) with Learning and Evolution Hirasawa laboratory, Artificial Intelligence section Information architecture field Graduate School of Information, Production and Systems Waseda University I Research Background Systems are becoming large and complex robot control elevator Group Control System Stock trading system It is very difficult to make efficient control rules considering many kinds of real world phenomena Intelligent systems (evolutionary and learning algorithms) can solve problems automatically II Objective of the research • propose an algorithm which combines evolution and learning – In the natural world・・・ • evolution ― Many individuals (living things) adapt to the world (environment) through long time of generations give inherent functions and characteristics to the living things • learning ― the knowledge the living things acquire in their life time through trial-and-error the knowledge acquired in the course of their life III evolution Evolution Characteristics of living things are determined by genes Evolution gives inherent characteristics and Functions Evolution is realized the following components selection crossover mutation Selection Those who fit into an environment survive, otherwise die out. Crossover Genes are exchanged between two individuals New individuals are produced Mutation Some of the genes of the selected individuals are changed to other ones New individuals are produced IV learning Important factors in reinforcement learning • State transition (definition of states and actions) • Trial and error learning • Future prediction Framework of Reinforcement Learning • Learn action rules through the interaction between an agent and an environment. State signal agent (sensor input) Action Reward (evaluation on the action) environment The aim of RL is to maximize the total rewards obtained from the environment State transition • State transition at st State at time t st+1 at+1 Reward rt st+2 at+2 rt+1 …… Reward 100 st+n rt+2 rt+n Goal!! Example: maze problem st+1 st G An action taken at time t st+n st+2 G G …… start at: move right at+1: move upward at+2: move left at+n: do nothing (end) Trial-and-error learning Acquired knowledge concept of reinforcement learning Take this action again! trial and error learning method Success (get reward) agent Decide an action take the action Don’t take this action again Failure (get negative reward) Reward (scalar value): indicate whether good action or not Future prediction • Reinforcement learning estimates the future rewards and take actions at+2 at+1 st+2 st+1 at st rt+2 rt+1 Reward rt current time st+3 future Future prediction • Reinforcement learning considers the rewards not only current but also the future rewards Reward rt=1 st at st+1 rt+1=1 at+1 rt+2=1 st+2 at+2 st+3 Case 1 at Case 2 st+1 Reward rt=0 at+1 rt+1=0 st+2 at+2 st+3 rt+2=100 V GNP with evolution and learning Genetic Network Programming (GNP) GNP is an Evolutionary Computation. What’s Evolutionary Computation? solution = gene • Solutions (programs) are represented by genes • The programs are evolved (changed) by selection, crossover and mutation Structure of GNP • GNP represents its programs using directed graph structures. • The graph structures can be represented as gene structures. • The graph structure is composed of processing nodes and judgment nodes. 0 0 3 4 0 1 1 6 0 2 5 7 1 0 8 0 1 0 0 4 … … … … 1 5 1 2 Graph structure gene structure Khepera robot • Khepera robot is used for the performance evaluation of GNP obstacle sensor Sensor value Far from obstacles Close to zero Close to obstacles Close to 1023 wheel Speed of the right wheel VR Speed of the left wheel VL -10 (back) ~ 10 (forward) -10 (back) ~ 10 (forward) Node functions Processing node Each node determines an agent action Ex) khepera robot behavior Set the speed of the right wheel at 10 Judgment node Each node selects a branch based on the judgment result 500 or more Judge the value of sensor 1 Less than 500 An example of node transition Judge sensor 5 80 or more Less than 80 The value is 700 or more The value is less than 700 Set the speed of the right wheel at 5 Judge sensor 1 Flowchart of GNP start Generate an initial population (initial programs) Task execution Reinforcement Learning Evolution Selection / Crossover / Mutation Last generation stop one generation Evolution of GNP selection Select good individuals (programs) from the population based on their fitness Fitness indicates how much each individual achieves a given task ・・・ used for crossover and mutation GNP population Evolution of GNP crossover Some nodes and their connections are exchanged. Individual 1 Individual 2 mutation Change connections Change node function Speed of Right wheel: 5 Speed of Left wheel: 10 The role of Learning Example) 1000 or more Less than 1000 Set the speed of the right wheel at 10 Judge sensor 0 Collision! Judgment node 1000 is changed to 500 in order to judge obstacle sensitively Node parameters are changed by reinforcement learning Processing node 10 is changed to 5 not to collide with the obstacle The aim of combining evolution and learning Evolution uses many individuals and better ones are selected after task execution Learning uses one individuals and better action rules can be determined during task execution • create efficient programs • search for solutions faster VI Simulation • Wall-following behavior 1. 2. 3. 4. All the sensor values must not be more than 1000 At least one sensor value is more than 100 Move straight Move fast Simulation environment vR (t ) vL (t ) Reward (t) 1 20 1000 fitness Reward (t) / 1000 t 1 1 C 0 vR (t ) vL (t ) C 20 : If the condition 1 and 2 is satisfied : otherwise Node functions Processing node (2 kinds) Judge the value of sensor 0 ..... Determine the speed of right wheel Determine the speed of left wheel Judgment node (8 kinds) Judge the value of sensor 7 Simulation result GNP with learning and evolution 0.8 0.6 start 0.4 fitness 0.2 Standard GNP (GNP with evolution) 0 0 200 400 600 800 Track of the robot 1000 generation fitness curves of the best individuals averaged over 30 independent simulations • conditions – The number of individuals: 600 – The number of nodes: 34 • Judgement nodes: 24 • Processing nodes: 10 Simulations in the inexperienced environments Simulation on the generalization ability The best program obtained in the previous environment start Execute in the inexperienced environment start The robot can show the wall-following behavior. VII Conclusion • The algorithm of GNP using evolution and reinforcement learning is proposed. – From the simulation results, the proposed method can learn wall-following behavior well. • Future work – Apply GNP with evolution and reinforcement learning to real world applications • Elevator control system • Stock trading model – Compare with other evolutionary algorithms VIII other simulations tileworld wall floor Agent can push a tile and drop it into a hole. tile hole The aim of agent is to drop tiles into holes as many as possible. agent Example of tileworld Fitness = the number of dropped tiles Reward rt = 1 (when dropping a tile into a hole) Node functions Processing node go forward turn right turn left stay Judgement node What is in the forward cell ? (floor, tile, hole, wall or agent) backward cell left cell right cell the direction of the nearest tile (forward, backward, left, right or nothing) the direction of the nearest hole the direction of the nearest hole from the nearest tile the direction of the second nearest tile Example of node transition right Direction of the nearest hole backward floor forward wall agent tile hole Go forward left nothing What is in the forward? Simulation 1 – There are 30 tiles and 30 holes – same environment every generation – Time limit: 150 steps Environment I Fitness curve(simulation 1) GNP with learning and Evolution GNP with evolution 20 fitness 15 GP-ADFs (main tree:max depth 3 10 GP (max depth 5) 5 ADF: depth 2) EP(evolution of Finite State Machine) 0 0 500 1000 1500 2000 2500 3000 3500 generation 4000 4500 5000 Simulation 2 • Put 20 tiles and 20 holes at random positions • One tile and one hole appear just after an agent push a tile into a hole • Time limit: 300 steps Environment II (example of an initial state) Fitness curve (simulation 2) 25 GNP with learning and evolution 20 GNP with evolution fitness 15 EP 10 GP-ADFs (main tree:max depth3 ADF: depth 2) 5 GP (max depth 5) 0 0 500 1000 1500 2000 2500 3000 3500 generation 4000 4500 5000 Ratio of used nodes Second nearest tile Direction of hole from tile direction of hole Direction of tile Judge right side Judge left side Judge backward Judge forward Do nothing Turn right Turn left Go forward Ratio of used nodes Ratio of used nodes Go forward Turn left Do nothing Turn right Judge forward Judge backward Judge right side Judge left side Direction of tile direction of hole Direction of hole from tile Second nearest tile Node function Node function Last generation Initial generation Summary of the simulations Data on the best individuals obtained at the last generation (30 samples) Simulation I Mean fitness Standard deviation T-test (p value) GNP-LE GNP-E GP GP-ADFs EP 21.23 18.00 14.00 15.43 16.30 2.73 1.88 4.00 1.94 1.99 1.04×10-6 3.13×10- 3.03×10- 3.17×10- 1.32×10-6 5.31×10-11 5.95×10-4 GNP-LE GNP-E 17 11 13 Simulation II Mean fitness Standard deviation T-test (p value) GNP-LE GNP-E GNP-LE GNP-E GP GP-ADFs EP 19.93 15.30 6.10 6.67 14.40 2.43 3.88 1.75 3.19 2.54 5.90×10-8 1.53×10- 7.46×10- 2.90×10- 5.91×10- 1.36×10- 1.46×10-1 31 15 26 13 12 Summary of the simulations Calculation time comparison Simulation I GNP with LE GNP with E GP GP-ADFs EP Calculation time for 5000 generations [s] 1,717 1,019 3,281 3,252 2,802 Ratio of GNP with E (1) to each 1.68 1 3.22 3.19 2.75 GNP with LE GNP with E GP GP-ADFs EP Calculation time for 5000 generations [s] 2,734 1,177 12,059 5,921 1,584 Ratio of GNP with E (1) to each 2.32 1 10.25 5.03 1.35 Simulation II The program obtained by GNP 0 step 10 12 11 16 15 14 13 9 8 7 6 5 4 3 2 1 Maze problem K objective:reach goal as early as possible G wall floor The key is necessary to open the door in front of the goal Time limit: 300 step door agent fitness= K key G goal Remaining time (when reaching the goal) 0 (when the agent cannot reach the goal) reward rt = 1 (when reaching the goal) Node functions Processing node Judgment node go forward Judge forward cell turn right Judge backward cell turn left random (take one of three actions randomly) Judge left cell Judge right cell Fitness curve(maze problem) GNP with learning and Evolution (GNP-LE) 300 Fitness 200 GNP with evolution (GNP-E) GP 100 0 0 1000 generation 2000 3000 GNP-LE GNP-E GP mean 253.0 246.2 227.0 Standard deviation 0.00 2.30 37.4 Ratio of reaching the goal 100% 100% 100% Ratio of obtaining the optimal policy 100% 3.3% 63% Data on the best individuals obtained at the last generation (30 samples)