Download INTELLIGENT AGENT PLANNING WITH QUASI

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genetic algorithm wikipedia , lookup

Soar (cognitive architecture) wikipedia , lookup

Existential risk from artificial general intelligence wikipedia , lookup

Multi-armed bandit wikipedia , lookup

History of artificial intelligence wikipedia , lookup

Concept learning wikipedia , lookup

Agent (The Matrix) wikipedia , lookup

Pattern recognition wikipedia , lookup

Machine learning wikipedia , lookup

Cognitive model wikipedia , lookup

Reinforcement learning wikipedia , lookup

Embodied cognitive science wikipedia , lookup

Transcript
BULETINUL INSTITUTULUI POLITEHNIC DIN IAŞI
Publicat de
Universitatea Tehnică „Gheorghe Asachi” din Iaşi
Tomul LVI (LX), Fasc. 3, 2010
SecŃia
AUTOMATICĂ şi CALCULATOARE
INTELLIGENT AGENT PLANNING WITH
QUASI-DETERMINED STATES USING INDUCTIVE
LEARNING
BY
FLORIN LEON
Abstract. Traditional representations for planning problems use predicative logic,
and many planning algorithms consider the environment to be deterministic and the
planning agent to be detached from its execution environment. Also, reactive agent
architectures have been proposed that address the problem of quick responses to changes
in the environment, and consider the intelligent behavior of an agent to be en emergent
result of the interaction of simpler, layered behaviors. However, these approaches do not
take into account learning as an intrinsic part of problem-solving or planning behavior. In
this paper, we describe a method of including a learning phase into the plan itself, so that
the agent can dynamically recognize the preconditions of an action when the states are not
fully determined, and even directly choose its actions based on learning results.
Key words: intelligent agents, planning, inductive learning, classification.
2000 Mathematics Subject Classification: 68T20, 68T42.
1. Introduction
Many researchers regard agent-based solutions as a new paradigm of
handling complexity in software systems. Recently, an increasing number of
industrial applications have certified the success of this approach, mainly in
applications such as telecommunication networks, manufacturing enterprises,
air traffic control, transportation systems, electronic commerce, patient
monitoring, or process control [7].
Despite the lack of general agreement regarding an established
definition of an agent, it is generally accepted that it is a software or hardware
entity that displays the properties of autonomy, i.e. it is capable of independent,
28
Florin Leon
unsupervised actions, and situatedness, i.e. it is part of the physical or simulated
execution environment. Wooldridge and Jennings [24], [23] further describe an
intelligent agent as having the additional properties of: reactivity (the ability to
perceive its environment, and respond in a timely manner to changes that occur
in it), pro-activeness (the ability to exhibit goal-directed behavior by taking the
initiative), and social ability (to interact with other agents and possibly
humans).
An agent can decide a response to a request in an autonomous manner,
unlike a typical object in the object-oriented programming paradigm, which has
its methods simply called from other objects, and that code is automatically
executed. Thus, an agent can be viewed from the outside as a black box, with
perceptual or sensorial input from its environment and effectoric capabilities to
modify the same environment as the output. Therefore, an agent can be seen as
a function continuously mapping its perception into actions, in many cases
taking into account its internal state as well.
Therefore, one of the fundamental issues of agent-based design is how
to choose the optimal action, or how to achieve an optimal sequence of actions,
a plan.
2. Planning Methods in Artificial Intelligence
Traditionally, a great deal of work in artificial intelligence was devoted
to planning algorithms. There are several notations used to describe planning
problems.
One of the first was STanford Research Institute Problem Solver,
STRIPS [5], famous for its use by Shakey the robot, the first mobile robot to be
able to reason about its own actions, in the context of the blocks world. In this
representation, a state is a conjunction of first-order propositional positive
literals. In order to address the frame problem, it uses the “closed-world
assumption”, i.e. any condition that is not explicitly mentioned in a state is
assumed to be false. The goal is represented by a conjunction of literals. The
actions are represented by so-called “action schemas”, including preconditions
and effects, for example:
Action( Drive(car, Iasi, Timisoara),
Precondition: At(car, Iasi) ∧ Car(car)
Effect: ¬At(car, Iasi) ∧ At(car, Timisoara))
ADL language [14] is an extension of STRIPS that allows, among
others, positive and negative literals in states, quantified variables in goals,
goals to be expressed as conjunctions and disjunctions. Also, it includes typed
variables, e.g. (car: Car), and uses the “open world assumption”, i.e.
unmentioned literals are unknown.
PDDL [6] differs from STRIPS and ADL from the syntactic point of
Bul. Inst. Polit. Iaşi, t. LVI (LX), f. 3, 2010
29
view and its goal is to be a broader modeling language for planning problems.
STRIPS and ADL can be considered to be fragments of PDDL. An example of
block world problem described in PDDL is as follows [21]:
(:action move
(:parameters ?b - block ?from ?to - loc)
(:preconditions (and (not (blocked ?b)) (not (blocked ?to))
(on ?b ?from) (not (= ?from ?to)))))
(:effect (and (not (blocked ?from)) (on ?b ?to)
(when (not (= ?to table)) (blocked ?to))))))
Many algorithms have been devised to solve classical planning
problems. Besides the straightforward state-space search, with the forward
(progression) and backward (regression) search methods resembling the
conventional problem solving techniques, we can mention the partial-order
planning (POP) algorithms, whose main idea is to establish partial sequences of
actions into a plan, without specifying an ordering between these sequences
beforehand. Within the partial action sequences, a set of ordering constraints is
defined.
Faced with scalability issues, some recent planning algorithms emerged,
such as: Graphplan [1], SATPLAN [9], and RAX [8].
A typical shortcoming of many classical planning algorithms is that
they assume the environment to be fully observable, static, and deterministic.
For dealing with incomplete information and non-deterministic settings, other
planning methods are needed, such as [17]:
1. Conformant planning, that must ensure that the plan achieves the
goal in all possible circumstances, regardless of the true initial state and the
actual action outcomes;
2. Conditional planning, that constructs plans with different branches
for the different contingencies that could occur;
3. Execution monitoring and replanning, that checks whether the plan
can apply to current situations, or it needs to be revised;
4. Continuous planning, where the planner executes uninterruptedly,
even after achieving one goal, and can therefore handle the abandonment of
goals or the creation of additional goals.
A critique to the AI establishment was made by Rodney Brooks [2], [3],
who considers that traditional AI planners are disembodied entities, and not
physically situated in their execution environment, as it is the case with agents.
He considers that the abstraction of representation can be misleading, and
overly simplifies the real-world problems. Brooks proposed the reactive
subsumption architecture, stating the fact that the intelligence of an agent
results from the continuous interaction between the agent and its environment.
He considered a set of behavior levels, from the low-level, critical reactions to
more abstract ones, and claims that intelligent behavior is an emergent property
30
Florin Leon
of the interaction of these simpler behaviors.
However, although reactive architectures are well-suited for certain
problems, they are less suited to others [7]. Therefore, there is a need to
investigate hybrid architectures that combine the best of the two extremes.
3. Planning with Quasi-Determined States
The classical planning methods use a predicative logic representation
for the states. For example, if a robotic agent has a plan of taking an apple off
the table and putting it into a basket, a typical plan would use a predicate such
as Apple(a) to describe this object.
However, in a real-life situation, if more objects were placed on the
table, it could be difficult for the agent to recognize the apple out of them.
Another difficulty of the agent would arise if the apple on the table would have
a non-standard appearance, in terms of size, shape, or color. In this case, the
preconditions of actions, that describe the states in which the agent has to make
a decision, are not fully determined. We call such states quasi-determined
states.
In order to address the issue of reactive actions, we propose that besides
recognizing the preconditions of an action, classification can be used to directly
map a state to an action. A training dataset can be used to choose the
appropriate action in a state instead of a conditional planning method. As an
example, we can mention the classical “weather” dataset, that decides if
someone should play tennis or not, based on attributes such as temperature,
humidity, and outlook, or the decision to give a loan to a person based on
his/her marital status, age, and income. The applications of classification are
presently extensive in real-life situations, such as deciding the optimal treatment
of patients, classifying electronic payments as legitimate or not, speech and
writing recognition, etc. All these and similar ones could be part of a longer
plan of an agent.
Another advantage of using learning for quasi-determined states is that
the agent can adapt in real-time to the changing conditions of its execution
environment.
There are many inductive learning algorithms that address the problem
of classification. We can mention three main classes of such algorithms:
decision trees, which provide an explicit symbolic result, similar to the rules on
which humans base their conscious reasoning process, instance-based, similar
to the way humans recognize objects or situations based on analogy to
previously encountered ones, and Bayesian, similar to the way in which humans
make decisions based on probabilities or frequency of event occurrence.
Another well-known technique that can be used for classification is the subsymbolic, neural network approach.
Bul. Inst. Polit. Iaşi, t. LVI (LX), f. 3, 2010
31
Synthetically, Fig. 1 shows the two situations where an inductive
learning phase can be included into a plan. Action 1 is a classical action within a
plan. Action 2 has its preconditions determined by classification. Action 3 is
dynamically determined by a supervised procedure. ES, the set of effects of Start
pseudo-action represents the initial state of the problem. PF, the set of
preconditions of the Finish pseudo-action, represents the goal of the problem. Pi
designate the preconditions of an action and Ei designate the effects.
Fig. 1 – A plan with quasi-determined states and learning phases.
4. Case Study: The Mountain Car Problem
As an example of how learning and quasi-determined states can be
incorporated into a planning mechanism for an intelligent agent, we will
consider the “Mountain Car” problem. It was originally presented by
Andrew Moore in his PhD dissertation [12] and later Sutton and Barto
included it as an exercise in their well-known introductory book to
reinforcement learning [19].
The task requires an agent to drive an underpowered car up a steep
mountain road. Since gravity is stronger than the engine of the car, even at full
power the car cannot accelerate up the steep slope [15]. The movement of the
car is described by two continuous output variables, position x ∈ [−1.2, 0.5]
and velocity v ∈ [−0.07, 0.07] , and one discrete input representing its
acceleration a. The acceleration is therefore the action that the agent chooses,
and it can be one of the three available discrete options: full thrust forward
(1), no thrust (0) and full thrust backward (-1).
Recently, a 3D version of the problem has been proposed, which
extends the standard 2D variant, where the states are described by 4
continuous variables [20].
The mountain curve on which the car is moving is defined as: h =
sin(3x). According to the laws of physics, the discrete-time state space
equations of the system are those presented in Eq. 1:
32
(1)
Florin Leon
vt +1 = vt − 0.0025 ⋅ cos(3xt ) + 0.001 ⋅ at

 xt +1 = xt + vt +1
where at ∈ {−1, 0, 1} .
Fig. 2 presents the setting of the mountain car problem (adapted after
Singh and Sutton [18] and Naeeni [13]).
Fig. 2 – The mountain car problem.
Both state variables are kept in the defined range, i.e. all values above
or below the boundaries will be set to their extreme values. When the position
x is equal to the extreme left boundary -1.2, the velocity v is set to 0. The goal,
the top of the mountain, is located at x = 0.5.
The problem is particularly interesting because, in order to reach its
goal, the car must gain enough kinetic energy by accelerating in alternating
directions, backward or forward. It must first drive backward, up the other
side of the valley, to gain enough momentum to drive forward up the hill. It
will therefore move away from the goal at first in order to find a working
solution. Also, the states of the problem defined by position and velocity are
continuous, real-valued, and this causes an additional difficulty for a
reinforcement learning algorithm dealing with discrete states. Finally, because
of the external factor, gravity, and the momentum of the car, the actions may
not have the same results in similar states.
Bul. Inst. Polit. Iaşi, t. LVI (LX), f. 3, 2010
33
Fig. 3 – The behavior of the agent using the naïve heuristic for
different starting positions.
4.1. Naïve Heuristic
First, we can verify the assumptions of the problem when the agent uses
a naïve heuristic, i.e. it maintains the acceleration forward (a = 1) at all times.
Fig. 3 shows the behavior of the system for different initial positions.
34
Florin Leon
Fig. 4 – The behavior of the agent using the simple heuristic for
different starting positions.
When the initial position is on the top of the mountain on the opposite
direction from the goal, the car momentum is enough to climb the mountain
side in order to reach the goal. The momentum is enough until the initial position
becomes -0.84. In this case, the naïve heuristic proves its limitation because the
car will become engaged in an oscillatory movement over the alternating sides of
Bul. Inst. Polit. Iaşi, t. LVI (LX), f. 3, 2010
35
the valley. Getting closer to the goal, only the initial position of 0.39 or greater
will be enough to reach the goal directly, using forward acceleration.
4.2. Simple Heuristic
Taking into account the characteristics of the problem, we can devise a
simple heuristic that ensures the fact that the goal is reached every time. The
heuristic tries to make maximum use of the gravitational force: the acceleration
of the car is the sign of its speed. Fig. 4 shows the behavior of the system for
different initial positions in this case.
One can see that even when the initial position x ∈ [−0.84, 0.38] , the
agent reaches its goal after several amplifying oscillations.
4.3. Reinforcement Learning Solution
The simple heuristic presented above does not solve the problem in an
optimal manner, i.e. with a minimum number of time steps. The problem was
originally designed to be solved with reinforcement learning algorithms, so we
employ such a technique to find shorter plans for the agent.
Model-free reinforcement learning algorithms using temporal
differences such as Q-Learning [22] or State-Action-Reward-State-Action,
SARSA [16] need to discretize the continuous input states of the problem. The
Q function, used to determine the best action to be taken in a particular state, is
defined as Q : S × A → ℜ and is usually implemented as a matrix containing the
real-valued rewards r ∈ ℜ given by the environment in a particular state s ∈ S
when performing a particular action a ∈ A . The mountain car problem is also
difficult for a reinforcement learning algorithm because all the rewards are -1,
with the exception of the goal state where the reward is 0. Therefore, the agent
becomes aware of a higher reward only when it finally reaches the goal.
For the following tests, the Matlab implementation of SARSA
algorithm by J.A. Martin [11] was used. For the initial positions where the first
approach began to fail, and also for the initial position of x = -0.5, which is the
“standard” start point suggested by the problem author(s), a comparison was
made in terms of the number of time steps of the solution. This comparison is
displayed in Fig. 5. In most cases, the reinforcement learning algorithm finds
shorter plans than the simple heuristic presented before.
Fig. 6 further presents a detailed comparison between the two
approaches for an initial position of x = -0.5 in terms of car trajectory and car
speed. The upper row contains the results of the simple heuristic. One can
notice that during the second left oscillation, the car hits the fixed wall and its
speed becomes 0. The additional steps of the solution are due to the fact that it
didn’t control its acceleration better so that it could climb the left side of the
36
Florin Leon
mountain only up to a position sufficient to gain enough momentum to reach the
goal. The bottom row contains the results of the reinforcement learning
algorithm.
Fig. 5 – Comparison between the number of solution steps
found by the reinforcement learning algorithm and the simple heuristic.
Fig. 6 – Position and velocity of the car for the initial position of x = -0.5
using simple heuristic and reinforcement learning, respectively.
Bul. Inst. Polit. Iaşi, t. LVI (LX), f. 3, 2010
37
4.4. Filtered Supervised Solution
From the Q matrix found by the SARSA algorithm, we can extract a
supervised learning dataset, so that each row of the Q matrix is transformed into
a training instance.
Let S be the matrix of states, S = sij , 1 ≤ i ≤ n, 1 ≤ j ≤ m , let A be the
action vector,
( )
A = (a ), 1 ≤ i ≤ p , and Q = (q ), 1 ≤ i ≤ n, 1 ≤ j ≤ p .
i
ij
Then the supervised learning dataset will be a matrix whose instances
will be the states followed by the optimal action in that state. Formally, D is a
matrix D = d ij , 1 ≤ i ≤ n, 1 ≤ j ≤ m + 1 , where:
( )
(2)
s , if 1 ≤ j ≤ m
d ij =  ij
ak , where qik = max{qil | l = 1,..., p}, if j = m + 1
More synthetically, the dataset D is defined as:
(3)
D= S
A*
where:
(4)
( )
A* = ai* , such that ai* = arg max Q( si , a )
a
However, since the Q matrix was inherently constructed based on trial
and error episodes, eventually composed into coherent policies, we tried to
determine how much of this learned experience was relevant for a supervised
learning setting. The supervised learning algorithm used by the agent was the
simple nearest neighbor algorithm [4], due to its simplicity and very good
performance (mostly up to 100% on the training set) when the data are not
affected by noise [10]. Another reason for choosing the nearest neighbor
algorithm was its resemblance to human pattern recognition by analogy. In the
case of the mountain car problem, the action is taken by similarity to previously
learned actions in some situations given by car position and car velocity, i.e. the
input states of the problem. It is hypothesized that the dataset resulting from the
reinforcement learning may have noisy or irrelevant training instance, which
may affect the optimal solution.
In order to filter the dataset, we considered 1000 trials of random
sampling, with a filtering factor varying from 10% to 90%.
38
Florin Leon
Fig. 7 – The percent of failed trials when the filtering factor varies.
Fig. 7 shows the percent of failed trials when the filtering factor varies.
By failed trials we mean plans in which the agent fails to reach the goal,
resulting in a continuous oscillatory behavior. When the training dataset is
small, the information may be insufficient for the agent to learn the solution. As
additional information is accumulated, the agent begins to use it to solve the
problem more frequently. Thus the failed trials decrease from over 50% with 27
randomly chosen training instances to less than 1% for 243 randomly chosen
training instances. Of course, all the 270 original instances are sufficient for the
agent to solve the problem every time.
Taking into account only the successful trials, we counted the average
number of steps and the minimum number of steeps needed to reach the goal,
graphically presented in Figs. 8 and 9, respectively.
Fig. 8 – The average number of steps to goal when the filtering factor varies.
Bul. Inst. Polit. Iaşi, t. LVI (LX), f. 3, 2010
39
Fig. 9 – The minimum number of steps to goal when the filtering factor varies.
One can see that although the average number of steps increases when
the agent receives more information, the minimum number of steps is attained
only with a small dataset. In order to find the optimal dataset, we extended the
test with 10000 trials only for a filtering factor of 10% for the initial position of
-0.5. From the initial 270 states used by SARSA discretization, with additional
removal of redundant states and with a further removal of training instances
where the acceleration was 0 (because we considered that the optimal solution
was to be attained only when the agent actively pursues the goal, with no
passive actions), the number of training instances was reduced to 12. The best
solution has now only 104 steps. These distinctive instances are displayed in
Table 1.
Table 1
The Selected Training Instances
Car position x
Car velocity v
Acceleration a
(agent state)
(agent action)
−0.80
−0.04
−1
−0.70
0.00
1
−0.70
0.06
1
−0.60
−0.06
1*
−0.40
−0.03
−1
−0.40
−0.02
−1
−0.40
0.04
1
−0.30
−0.04
−1
−0.10
−0.04
−1
0.00
−0.01
−1
0.10
−0.03
−1
0.10
0.02
1
40
Florin Leon
It is clear that these instances conform to the simple heuristic described
in paragraph 4.2, with only one exception, marked with italic font in the Table 1
and with the asterisk following the class/action. This instance is responsible for
the decrease in the number of steps, because it “tells” the agent when to switch
to driving forward on the left side of the valley, and thus to reach the goal on the
right side earlier.
5. Conclusions
This paper presents a way to include supervised, inductive learning into
a planning problem. A model of extracting a training dataset from the Q matrix
of a reinforcement learning algorithm is described. The agent does not possess
all the necessary information at any given time, it needs to compute the optimal
action. If the environment is non-deterministic, the agent can learn and change
its model. A predicative representation of the states is not necessary, because
these are dynamically recognized by means of predictions made on the basis of
the learnt model. However, the model can be symbolically interpreted, because
the actual values that compose it are explicitly available.
A c k n o w l e d g e m e n t s. This work was supported by CNCSISUEFISCSU, project number PNII-IDEI 316/2008, Behavioural Patterns Library for
Intelligent Agents Used in Engineering and Management.
Received: May 12, 2010
“Gheorghe Asachi” Technical University of Iaşi,
Department of Computer Engineering
e-mails: [email protected]
REFERENCES
1. Blum A., Furst M.L., Fast Planning Through Planning Graph Analysis. Artificial
Intel., 90, 1-2, 1997.
2. Brooks R.A., Intelligence without Reason. Proc. of the Twelfth Internat. Joint Conf.
on Artificial Intel., Sydney, Australia, 569−595, 1991.
3. Brooks R.A., Intelligence without Representation. Artificial Intel., 47, 139−159,
1991.
4. Cover T.M., Hart P.E., Nearest neighbor pattern classification. IEEE Trans. on
Information Theory, 13, 1, 21−27, 1967.
5. Fikes R., Nilsson N., STRIPS: A New Approach to the Application of Theorem
Proving to Problem Solving. Artificial Intel., 2, 189−208, 1971.
6. Ghallab M., Howe A., Knoblock C., McDermott D., Ram A., Veloso M., Weld D.,
Wilkins D., PDDL - the Planning Domain Definition Language. Techn.
Report CVC TR-98-003/DCS TR-1165, Yale Center for Computational
Vision and Control, 1998.
Bul. Inst. Polit. Iaşi, t. LVI (LX), f. 3, 2010
41
7. Jennings N.R., Sycara K., Woolridge M., A Roadmap of Agent Research and
Development. Autonomous Agents and Multi-Agent Syst., 1, 7−38,
1998.
8. Johnson A., Morris P., Muscettola N., Rajan K. , Planning in Interplanetary Space.
Theory and Practice, Proc. of AIPS, 2000.
9. Kautz H., Selman B., Pushing the Envelope: Planning, Propositional Logic and
Stochastic Search. In Proc. of AAAI, 1996.
10. Leon F., Intelligent Agents with Cognitive Capabilities. Edit. Tehnopress, Iaşi,
România, 2006.
11. Martin J.A., A Reinforcement Learning Environment in Matlab. http://www.dia.fi.
upm. es/~jamartin/download.htm, 2010.
12. Moore A., Efficient Memory-Based Learning for Robot Control. Ph. D. Diss, Univ.
of Cambridge, 1990.
13. Naeeni A.F., Advanced Multi-Agent Fuzzy Reinforcement Learning. Master Diss.,
Computer Sci. Depart., Dalarna Univ. College, Sweden, http://www2.
informatik.hu-berlin.de/~ferdowsi/Thesis/Master%20Thesis.pdf, 2004.
14. Pednault E.P.D., ADL: Exploring the Middle Ground Between STRIPS and the
Situation Calculus. Proc. of Knowledge Representation Conf., 1989.
15. RL-Community, Mountain Car (Java). RL-Library, http://library.rl-community.
org/wiki/Mountain_Car_(Java), 2010.
16. Rummery G.A., Niranjan M., On-line Q-learning Using Connectionist Systems.
Techn. Report CUED/F-INFENG/TR 166, Engng. Depart., Cambridge
Univ., 1994.
17. Russell S., Norvig P., Artificial Intelligence: A Modern Approach. Prentice Hall; 2nd
Ed., 2002.
18. Singh S.P., Sutton R.S. , Reinforcement Learning with Replacing Eligibility Traces.
Machine Learning, 22, 1/2/3, 123-158, 1996.
19. Sutton R.S., Barto A.G., Reinforcement Learning: An Introduction. MIT Press,
Cambridge, Massachusetts, 1998.
20. Taylor M.E., Kuhlmann G., Stone P., Autonomous Transfer for Reinforcement
Learning. Proc. of the Seventh Internat. Joint Conf. on Autonomous Agents
and Multiagent Syst., 2008.
21. Vlahavas I., Vrakas D. (Eds.), Intelligent Techniques for Planning. Idea Group
Publ., 2005.
22. Watkins C.J.C.H., Dayan P., Technical Note: Q-Learning. Machine Learning, 8, 5568, 1992.
23. Wooldridge M., Intelligent Agents. In G. Weiß (Ed.), Multiagent Systems - A
Modern Approach to Distributed Artificial Intelligence, The MIT Press,
Cambridge, Massachusetts, 2000.
24. Wooldridge M., Jennings N.R., Agent Theories, Architectures, and Languages: a
Survey. In Wooldridge and Jennings (Eds.), Intelligent Agents, Springer
Verlag, Berlin, 1995.
42
Florin Leon
METODĂ DE PLANIFICARE PENTRU AGENłI INTELIGENłI
CU STĂRI CVASI-DETERMINATE
UTILIZÂND ÎNVĂłAREA INDUCTIVĂ
(Rezumat)
Reprezentările tradiŃionale ale problemelor de planificare utilizează logica
predicativă şi mulŃi algoritmi de planificare consideră mediul ca fiind determinist iar
agentul de planificare ca fiind detaşat de mediul său de execuŃie. De asemenea, au fost
propuse şi arhitecturi reactive pentru agenŃi, care încearcă să rezolve problema
răspunsurilor rapide la modificările apărute în mediu şi consideră comportamentul
inteligent ca fiind un rezultat emergent al interacŃiunilor dintre comportamente mai
simple, dispuse pe niveluri. Totuşi, aceste abordări nu iau în considerare învăŃarea ca o
parte intrinsecă a comportamentelor de rezolvare a problemelor sau de planificare. În
acest articol, se descrie o metodă de a include o fază de învăŃare în planul însuşi, astfel
încât agentul să poată recunoaşte dinamic precondiŃiile unei acŃiuni atunci când stările
nu sunt complet determinate şi chiar să aleagă direct acŃiunile pe baza rezultatelor
învăŃării. Ca studiu de caz, se consideră problema “mountain car”, o problemă tipică
învăŃării cu întărire, cu stări continue. Se fac comparaŃii între rezultatele a două euristici,
a rezolvării cu ajutorul algoritmului SARSA şi a abordării supervizate, rezultată ca
filtrare a rezultatelor învăŃării cu întărire. Se descrie de asemenea un model de a extrage
o mulŃime de date de învăŃare din matricea Q a unui algoritm de învăŃare cu întărire.
Agentul nu dispune de toate informaŃiile necesare la fiecare moment de timp, ci trebuie
să calculeze acŃiunea optimă. Dacă mediul este nedeterminist, agentul poate învăŃa şi îşi
poate schimba modelul. Reprezentarea predicativă a stărilor nu este necesară, deoarece
acestea sunt recunoscute dinamic prin predicŃiile realizate pe baza modelului învăŃat. Cu
toate acestea, modelul poate fi interpretat în mod simbolic, deoarece valorile propriuzise care îl compun sunt disponibile explicit.