Download Engineering goalkeeper behavior using an emotion learning

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Biology and consumer behaviour wikipedia , lookup

Transcript
Engineering goalkeeper behavior using an emotion learning method
Stevo Bozinovski and Peter Schoell
German National Center for Information Technology (GMD)
Sankt Augustin
Abstract. This paper reports an approach toward engineering an evolving robot behavior for
a role of a goalkeeper. In the course of our study of the problem, we developed a hand coded
controller as well. The method of learning we used is emotion based self-reinforcement
learning. The training of the learning robot is made by a non-didactic trainer which performs
random shooting. We present also a typical learning curve as result from a training
experiment.
Introduction
The new AI challenge, set up by the RoboCup initiators, in next 50 years to develop a football
team that will beat a human football team, induced many research efforts to converge toward
building robots capable of playing football (soccer). Various issues emerged that will be a
challenge for the next years, and they are covered in contemporary reports on the subject (e.g.
Asada et al., 1999a). One of the issues is learning to play football. Close to that issue is the
issue which we call the issue of of football training, which we state as follows:
Human footbal players are trained, not programmed to play football. Is it possible to
achieve that with robot players? Does it make sense to define a RoboCup league in which
only trained robots will play?
Following such an idea, we tried to develop a training based controller for a robot (a physical
robot or a software agent). We considered the role of the goalkeeper robot, which looked the
simplest among other roles in a football team. In what follows we will describe briefly our
development of a trained goalkeeper.
A hand-coded goalkeeper controller
It seems that the RoboCup effort turns gradually toward a role based team structure (Asada et
al. 1999a). In such a role based structure the role of a goalkeeper (goalie, goaltender,
goalman) has special importance. There are reports of building teams without implementing a
goalkeeper (Tambe 1999). But it seems, as development progresses that a role of a goalkeeper
should be considered. The goalkeeper role can be assigned just to one player, or possibly to
each player, with ability to take that role in a recognized situation. Stone and Veloso (1999)
cover the issue of rigid and flexible roles within a RoboCup team.
To study the role of the goalkeeper, we first developed a hand coded controller. We used
previous experience in building GMD robots (Bredenfeld et al. 1999). From the previous
experience we can define two main problems in connection with the goalkeeper: the selflocalization problem and the ball-blocking problem.
Here we will not consider the self localization problem in detail but let us only mention that
we considered that problem for real robots and we developed a solution (Schoell, 1999). Let
us only mention that four behaviors from our Dual Dynamics approach (Jaeger and
Christaller, 1998) were used for this purpose: ToggleBack, ToggleForward, CorrectHeading,
and InitRobot. The robot can determine it´s position by several infrared sensors. Goal pols are
used as fixed landmarks, and also the rear wall of the goal. The individual behaviors are
specified with the Dual Dynamics Design Tool developed by A. Bredenfeld and tested on a
software simulator.
Here we will concentrate on the problem goalkeeper against the ball (ball blocking problem).
Previously we used a simple behavior for blocking the ball. In the course of study for
developing an evolving, trained goalkeeper, we developed a simulation program with the
following three behaviors:
Sentry: Sentry is the default behavior and it is activated when the ball is not in sight. Since we
assume that the robot must not go beyond the goal frame, it is also activated when the ball is
left to the goal while the robot is already on the left edge of the goal (analogously for the right
edge). As part of this behavior, when the ball is not in sight, the robot positions itself in the
middle of the goal frame.
Predict. This behavior is activated when the ball is visible and is producing a trajectory that is
not close to the goal. In this behavior the robot is moving along with the ball with a speed
equal to the ball speed. With this behavior the robot expresses prediction where the ball will
appear close to the goal in near future.
Intercept. This is the behavior that is activated when the ball is close to the goal. As a result,
the robot is moving toward the ball with near maximum speed in order to intercept the ball.
Actually, the robot is always moving in one dimension, along the goal, so the interception is
done in that dimension. The optimization parameter is the distance between the ball and the
robot along the robot movement.
A simulation program, written as Java animation applet, was developed to explore the
performance of a goalkeeper designed in such a way, and it showed successful results.
A trained goalkeeper
There have been several reports on training a football playing robot. An interesting approach
has been reported by Tambe et al. (1999). They trained a robot for a shooting role, and
actually implemented the supervised pattern classification method. A human expert selected a
training set of various possible shooting situations, and assigned a desired shoot for each
situation. It was shown that a successful training is achieved using this method. Another
approach has been reported for passing-shooting pair from the Osaka team (Asada et al.
1999b) but with limited success. As Werger (1999) reports, the system was not implemented
in a real play. It was mentioned (Asada et al. 1999a) that the Osaka team used reinforcement
learning for training the goalkeeper too, but we had not yet the opportunity to study
description of that training.
We studied the goalkeeper problem from a viewpoint of implementing a successful training.
A candidate approach was the supervised pattern classification training, as previously done by
Tambe et al. (1999) for a shooter role. But, during the study we recognized that the
goalkeeper problem has similarity to the pole-balancing problem (Figure 1). The pole
balancing problem is well known in reinforcement learning theory, and we already have some
experience in solving that problem (Bozinovski and Anderson 1983). So, we decided to try
that approach, using the Crossbar Learning, as we did with the pole balancing problem.
Figure 1. Viewing the goalkeeper problem as a pole balancing problem
The Crossbar Learning approach
Crossbar learning method was implemented in the Crossbar Adaptive Array (CAA)
architecture (Bozinovski 1982), which we describe here briefly (Figure 2). The learning
method it introduced is now recognized as a forerunner of the well known Q-learning method
(Barto 1996), but the contribution of this architecture to machine learning is still under
consideration (Bozinovski 1999).
Behavioral Environment
sensors s1 s2 s3
sM
....
b1
situation
recognizer
situations
x1
x2
xm
crossbar
learning
memory
bn
behavior
routines
B1
B2
B3
behavior selection
three state
buffers
emotion judgment
genome string
Genetic Environment
Figure 2. Crossbar adaptive array architecture
As Figure 2 shows, the CAA architecture is assumed to exist in two environments: a
behavioral environment in which it behaves, and a genetic environment which it
communicates its genetic information with. It receives a set of signals from the behavioral
environment, and makes environment partitioning in several situations. Situation is a concept
which is behavior dependent. For a previously chosen set of behaviors, a situation is a region
in the input variable space, being in which, a behavior-based system triggers the assigned
behavior. The situations are received by the crossbar learning memory of the system. This
memory firstly computes the emotion being in the encountered situation, and then also it
computes the behavior (action) that should be triggered in that situation. It´s learning rule is
w(i,j) = w(i,j) + emotion(k)
where i is a previously performed behavior, j is the situation in which the behaviuor i was
performed, k is the situation which is consequence of i and j, w(i,j) is the element of the
learning matrix named crossbar element, and emotion(k) is computed emotion of being in k. .
In CAA emotion is understand as internal state evaluation, performed by the agent itself.
From the genetic environment a CAA agent receives a genome (or species) string which
carries information about some internal states of the agent, in connection to the environment
situations. It assures that an agent will feel bad in a dangerous situation; otherwise, with
feeling good or neutral in dangerous situiation, it will not survive the environment due to the
wrongly inherited genetic memory. In the case of a goalkeeper agent, we introduced that only
being in a situation of received goal, the goalkeeper will feel bad. So the goalkeeper is
genetically bound to learn to avoid bad situations, and will tend to defend the goal.
Goalkeeper training using crossbar learning method
In implementing crossbar learning to a goalkeeping robot, we divided the input space in 10
situations depending on the ball vector in the field. For example, situation 1 is recognized
when the ball is left of the goalkeeper and is moving left, situation 5 is recognized when the
ball is in front of the robot and is moving straight to the goal, etc.
The trainer is chosen to be non-didactical, in a sense that he is not introducing lessons and
didactic order between them. He is just shooting the ball randomly in the direction toward the
goal side. The ball sometimes misses the goal (a miss shot), and sometimes bounces from the
side-walls. Some of the shoots reach the goal, directly or after bouncing. It is expected that the
goalkeeper will learn to defend those shoots.
We distinguish a training phase and an exploitation phase of the goalkeeper. In the training
phase the goalkeeper is trained, and after a certain level of performance is achieved, the
goalkeeper is assumed ready to enter a real match, exploitation phase. A performance measure
is needed to decide whether a goalkeeper is well-trained and can enter ist exploitation phase.
As performance measure we choosed the ratio defended/received goals. We estimated that
for a good trained human goalkeeper that ratio must be at least 3, which means 75% of all the
balls reaching the goal frame are defended. But actually we trained our robot till that ratio is 4
or more, depending of the experiment. We assumed that the training shoud not last more than
500 shooting trials, including miss-shoots. With such a restriction we obtained a performance
retio between 4 and 10.
Figure 3 shows a learning curve of one of our experiment
Learning Curve for Experiment 8
6
5
4
3
2
1
Number of shoots toward the goal
177
169
161
153
145
137
129
121
113
97
105
89
81
73
65
57
49
41
33
25
17
9
0
1
Performane: Defended/received goals
7
Figure 3. A learning curve for an evolving goalkeeper
As figure 3 shows, at the beginning the robot can possibly defend the goal by chance, but after
a several steps the learning performance ratio will drop, and then will increase due to the
learning process. In this experiment, after about 50 shots toward the goal, the robot has
gained experience enough to start behaving as a goalkeeper. Its performance is getting better
gradually. But although he is trained to recognize most of the situations after 50 trials, some
of the situations are learned after about 113 shoots toward the goal. The trainng in this
experiment is carried out until the performance coeficient reached value 7. It is assumed that
this goalkeeper is now well trained and can enter its exploitation phase.
We carried out a number of experiments on one of our Java simulators, and all were
successful, meaning all achieved a performance ratio greater that 4 in less than 500 training
trials, and all produced an evolving, well trained goalkeeper.
Conclusion
Robot training for football playing is one of the challenges that can be recognized in
connection with the RoboCup initiative. Here we described our effort of building a goalkeeper
robot using a training method, and an emotion based learning architecture as a robot
controller. Our simulation study proved the feasibility of this method for utilization in real
robots.
Our future work goes toward implementation of this learning scheme to a dual dynamics
based robot controller (Jaeger and Christaller, 1998, Bredenfeld et al., 1999).
References
Asada M., Kitano H., Noda I., Veloso M.: Robocup: Today and tomorrow – What we
have learned. Artificial Intelligence 110: 193-214, 1999a
Asada M., Uchibe E., Hosoda K.: Cooperative behavior acquisition for mobile robots
in dynamically changing real worlds via vision-based reinforcement learning and
development. Artificial Intelligence 110: 275-292, 1999b
Barto A. Reinforcement learning. In Omidvar and Elliot (Eds.) Neural networks for
Control, Academic Press, 1996
Bredenfeld A., Cristaller T., Goehring W., Guenter H., Jaeger H., Kobialka H-U.,
Ploeger P-G., Schoell P., Siegberg A., Verbeek C., Wilberg J. Behavior engineering with
„dual dynamics“ model and design tools. ICAI-99 RoboCup Workshop, Stockholm, 1999
Bozinovski S. A self learning system using secondary experiment In R.Trappl (Ed.)
Cybernetics and Systems, North Holland, 1982
Bozinovski S., Anderson C. Associative memory as controller of an unstable system:
Simulation of a learning control, Proc. IEEE Melecon, C5.11, Athens, 1983
Bozinovski S.: Crossbar Adaptive array: The first connectionist network that solved
the delayed reinforcement löearning problem. In A. Dobnikar, N. Steele, D. Pearson, R.
Albrecht (Eds.) Artificial Neural Nets and Genetic Programming, Springer Verlag, 1999
Jaeger H., Christaller T. Dual dynamics: designing behavior systems for autonomous
robots. Artificial Life and Robotics 2: 108-112, 1998
Schoell P. The Goalkeeper of the GMD RoboCup team. http://ais.gmd.de/BE/ddd
Tambe M., Adibi J., Al-Onaizan Y., Erdem A., Kaminka G., Marsella S., Muslea I.,:
Building agent teams using an explicit teamwork model and learning. Artificial intelligence
110: 215-239, 1999
Werger B.: Cooperation without deliberation: A minimal behavior-based approach to
multi-robot teams. Artificial Intelligence 110: 293-320, 1999
Acknowledgment
The first author wants to express great gratitude being invited by Prof. Thomas
Christaller and Dr. Herbert Jaeger to work with GMD, as well as for the excellent conditions
provided. Dr. Jaeger also made valuable contributions to this work. Dr Bernd Mueller made
valuable comments on this paper.