Download GNP

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Study on Genetic Network
Programming (GNP) with
Learning and Evolution
Hirasawa laboratory,
Artificial Intelligence section
Information architecture field
Graduate School of Information, Production and Systems
Waseda University
I Research Background
 Systems are becoming large and complex
 robot control
 elevator Group Control System
 Stock trading system
It is very difficult to make efficient control rules
considering many kinds of real world phenomena
Intelligent systems
(evolutionary and learning algorithms)
can solve problems automatically
II Objective of the research
• propose an algorithm which combines evolution
and learning
– In the natural world・・・
• evolution ― Many individuals (living things) adapt to
the world (environment) through long
time of generations
give inherent functions and
characteristics to the living things
• learning ― the knowledge the living things acquire
in their life time through trial-and-error
the knowledge acquired in the
course of their life
III evolution
Evolution
Characteristics of living things are determined by genes
Evolution gives inherent characteristics and Functions
Evolution is realized the following
components
selection
crossover
mutation
Selection
Those who fit into an environment survive,
otherwise die out.
Crossover
Genes are exchanged between two individuals
New individuals are produced
Mutation
Some of the genes of the selected individuals are changed to other ones
New individuals are produced
IV learning
Important factors in
reinforcement learning
• State transition (definition of states and actions)
• Trial and error learning
• Future prediction
Framework of
Reinforcement Learning
• Learn action rules through the interaction between an
agent and an environment.
State signal
agent
(sensor input)
Action
Reward
(evaluation
on the action)
environment
The aim of RL is to maximize the total rewards obtained from the environment
State transition
• State transition
at
st
State at time t
st+1
at+1
Reward rt
st+2
at+2
rt+1
……
Reward 100
st+n
rt+2
rt+n
Goal!!
Example: maze problem
st+1
st
G
An action taken at time t
st+n
st+2
G
G
……
start
at: move right
at+1: move upward at+2: move left
at+n: do nothing (end)
Trial-and-error learning
Acquired knowledge
concept of reinforcement learning
Take this
action
again!
trial and error learning method
Success (get reward)
agent
Decide an action
take the action
Don’t take
this action
again
Failure (get negative reward)
Reward (scalar value): indicate whether good action or not
Future prediction
• Reinforcement learning estimates the future rewards and
take actions
at+2
at+1
st+2
st+1
at
st
rt+2
rt+1
Reward rt
current time
st+3
future
Future prediction
• Reinforcement learning considers the rewards not only
current but also the future rewards
Reward rt=1
st
at
st+1
rt+1=1
at+1
rt+2=1
st+2
at+2
st+3
Case 1
at
Case 2
st+1
Reward rt=0
at+1
rt+1=0
st+2
at+2
st+3
rt+2=100
V GNP with evolution and learning
Genetic Network Programming (GNP)
GNP is an Evolutionary Computation.
What’s Evolutionary Computation?
solution
=
gene
• Solutions (programs) are represented by genes
• The programs are evolved (changed) by selection, crossover and
mutation
Structure of GNP
• GNP represents its programs using directed graph structures.
• The graph structures can be represented as gene structures.
• The graph structure is composed of processing nodes and judgment
nodes.
0 0 3 4
0 1 1 6
0 2 5 7
1 0 8 0
1 0 0 4
…
…
…
…
1 5 1 2
Graph structure
gene structure
Khepera robot
• Khepera robot is used for the performance
evaluation of GNP
obstacle
sensor
Sensor value
Far from obstacles
Close to zero
Close to obstacles
Close to 1023
wheel
Speed of the right wheel VR
Speed of the left wheel VL
-10 (back) ~ 10 (forward)
-10 (back) ~ 10 (forward)
Node functions
Processing node
Each node determines an agent action
Ex) khepera robot behavior
Set the speed of the right wheel at 10
Judgment node
Each node selects a branch based on the judgment result
500 or more
Judge the value of sensor 1
Less than 500
An example of node transition
Judge sensor 5
80 or more
Less than 80
The value is 700 or more
The value is
less than 700
Set the speed of the
right wheel at 5
Judge sensor 1
Flowchart of GNP
start
Generate an initial population (initial programs)
Task execution
Reinforcement Learning
Evolution
Selection / Crossover / Mutation
Last generation
stop
one generation
Evolution of GNP
selection
Select good individuals (programs) from the population
based on their fitness
Fitness indicates how much each individual
achieves a given task
・・・
used for crossover and mutation
GNP population
Evolution of GNP
crossover
Some nodes and their connections are exchanged.
Individual 1
Individual 2
mutation
Change connections
Change node function
Speed of Right
wheel: 5
Speed of Left
wheel: 10
The role of Learning
Example)
1000 or more
Less than 1000
Set the speed of the right wheel at 10
Judge sensor 0
Collision!
Judgment node
1000 is changed to
500 in order to judge
obstacle sensitively
Node parameters are changed by
reinforcement learning
Processing node
10 is changed to 5
not to collide with
the obstacle
The aim of combining
evolution and learning
Evolution uses many individuals and better ones are selected
after task execution
Learning uses one individuals and better action rules can
be determined during task execution
• create efficient programs
• search for solutions faster
VI Simulation
•
Wall-following behavior
1.
2.
3.
4.
All the sensor values
must not be more than
1000
At least one sensor value
is more than 100
Move straight
Move fast
Simulation environment
vR (t )  vL (t ) 
Reward (t) 
 1

20

 1000

fitness    Reward (t) / 1000
 t 1

1
C
0
vR (t )  vL (t ) 
C

20

: If the condition 1 and 2 is satisfied
: otherwise
Node functions
Processing node (2 kinds)
Judge the value of sensor 0
.....
Determine the speed of right wheel
Determine the speed of left wheel
Judgment node (8 kinds)
Judge the value of sensor 7
Simulation result
GNP with learning and evolution
0.8
0.6
start
0.4
fitness
0.2
Standard GNP (GNP with evolution)
0
0
200
400
600
800
Track of the robot
1000
generation
fitness curves of the best individuals averaged
over 30 independent simulations
•
conditions
– The number of
individuals: 600
– The number of nodes:
34
• Judgement nodes: 24
• Processing nodes: 10
Simulations in the
inexperienced environments
Simulation on the generalization ability
The best program obtained
in the previous environment
start
Execute in the
inexperienced environment
start
The robot can show the wall-following behavior.
VII Conclusion
• The algorithm of GNP using evolution and
reinforcement learning is proposed.
– From the simulation results, the proposed method can
learn wall-following behavior well.
• Future work
– Apply GNP with evolution and reinforcement learning to
real world applications
• Elevator control system
• Stock trading model
– Compare with other evolutionary algorithms
VIII other simulations
tileworld
wall
floor
Agent can push a tile and drop it into
a hole.
tile
hole
The aim of agent is to drop tiles into
holes as many as possible.
agent
Example of tileworld
Fitness = the number of dropped tiles
Reward rt = 1 (when dropping a tile into a hole)
Node functions
Processing node
go forward
turn right
turn left
stay
Judgement node
What is in the forward cell ?
(floor, tile, hole, wall or agent)
backward cell
left cell
right cell
the direction of the nearest tile
(forward, backward, left, right or nothing)
the direction of the nearest hole
the direction of the nearest hole from
the nearest tile
the direction of the second nearest tile
Example of node transition
right
Direction of the
nearest hole
backward
floor
forward
wall
agent
tile
hole
Go forward
left
nothing
What is in the
forward?
Simulation 1
– There are 30 tiles and
30 holes
– same environment
every generation
– Time limit: 150 steps
Environment I
Fitness curve(simulation 1)
GNP with learning and Evolution
GNP with evolution
20
fitness
15
GP-ADFs (main tree:max depth 3
10
GP (max depth 5)
5
ADF: depth 2)
EP(evolution of Finite State Machine)
0
0
500
1000
1500
2000
2500
3000
3500
generation
4000
4500
5000
Simulation 2
• Put 20 tiles and 20 holes
at random positions
• One tile and one hole
appear just after an agent
push a tile into a hole
• Time limit: 300 steps
Environment II
(example of an initial state)
Fitness curve (simulation 2)
25
GNP with learning and evolution
20
GNP with evolution
fitness
15
EP
10
GP-ADFs (main tree:max depth3
ADF: depth 2)
5
GP (max depth 5)
0
0
500
1000
1500
2000
2500
3000
3500
generation
4000
4500
5000
Ratio of used nodes
Second nearest tile
Direction of hole from tile
direction of hole
Direction of tile
Judge right side
Judge left side
Judge backward
Judge forward
Do nothing
Turn right
Turn left
Go forward
Ratio of used nodes
Ratio of used nodes
Go forward
Turn left
Do nothing
Turn right
Judge forward
Judge backward
Judge right side
Judge left side
Direction of tile
direction of hole
Direction of hole from tile
Second nearest tile
Node function
Node function
Last generation
Initial generation
Summary of the simulations
Data on the best individuals obtained at the last generation (30 samples)
Simulation I
Mean fitness
Standard deviation
T-test
(p value)
GNP-LE
GNP-E
GP
GP-ADFs
EP
21.23
18.00
14.00
15.43
16.30
2.73
1.88
4.00
1.94
1.99
1.04×10-6
3.13×10-
3.03×10-
3.17×10-
1.32×10-6
5.31×10-11
5.95×10-4
GNP-LE
GNP-E
17
11
13
Simulation II
Mean fitness
Standard deviation
T-test
(p value)
GNP-LE
GNP-E
GNP-LE
GNP-E
GP
GP-ADFs
EP
19.93
15.30
6.10
6.67
14.40
2.43
3.88
1.75
3.19
2.54
5.90×10-8
1.53×10-
7.46×10-
2.90×10-
5.91×10-
1.36×10-
1.46×10-1
31
15
26
13
12
Summary of the simulations
Calculation time comparison
Simulation I
GNP with
LE
GNP with E
GP
GP-ADFs
EP
Calculation time for 5000
generations [s]
1,717
1,019
3,281
3,252
2,802
Ratio of GNP with E (1) to
each
1.68
1
3.22
3.19
2.75
GNP with
LE
GNP with E
GP
GP-ADFs
EP
Calculation time for 5000
generations [s]
2,734
1,177
12,059
5,921
1,584
Ratio of GNP with E (1) to
each
2.32
1
10.25
5.03
1.35
Simulation II
The program obtained by
GNP
0 step
10
12
11
16
15
14
13
9
8
7
6
5
4
3
2
1
Maze problem
K
objective:reach goal as early as possible
G
wall
floor
The key is necessary to open
the door in front of the goal
Time limit: 300 step
door
agent
fitness=
K
key
G
goal
Remaining time (when reaching the goal)
0
(when the agent cannot reach the goal)
reward rt = 1 (when reaching the goal)
Node functions
Processing node
Judgment node
go forward
Judge forward cell
turn right
Judge backward cell
turn left
random
(take one of three actions randomly)
Judge left cell
Judge right cell
Fitness curve(maze problem)
GNP with learning and Evolution (GNP-LE)
300
Fitness
200
GNP with evolution (GNP-E)
GP
100
0
0
1000
generation
2000
3000
GNP-LE
GNP-E
GP
mean
253.0
246.2
227.0
Standard deviation
0.00
2.30
37.4
Ratio of reaching the goal
100%
100%
100%
Ratio of obtaining the optimal policy 100%
3.3%
63%
Data on the best individuals obtained at the last generation (30 samples)
Related documents