Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Genetic algorithm wikipedia , lookup
Soar (cognitive architecture) wikipedia , lookup
Existential risk from artificial general intelligence wikipedia , lookup
Multi-armed bandit wikipedia , lookup
History of artificial intelligence wikipedia , lookup
Concept learning wikipedia , lookup
Agent (The Matrix) wikipedia , lookup
Pattern recognition wikipedia , lookup
Machine learning wikipedia , lookup
Cognitive model wikipedia , lookup
BULETINUL INSTITUTULUI POLITEHNIC DIN IAŞI Publicat de Universitatea Tehnică „Gheorghe Asachi” din Iaşi Tomul LVI (LX), Fasc. 3, 2010 SecŃia AUTOMATICĂ şi CALCULATOARE INTELLIGENT AGENT PLANNING WITH QUASI-DETERMINED STATES USING INDUCTIVE LEARNING BY FLORIN LEON Abstract. Traditional representations for planning problems use predicative logic, and many planning algorithms consider the environment to be deterministic and the planning agent to be detached from its execution environment. Also, reactive agent architectures have been proposed that address the problem of quick responses to changes in the environment, and consider the intelligent behavior of an agent to be en emergent result of the interaction of simpler, layered behaviors. However, these approaches do not take into account learning as an intrinsic part of problem-solving or planning behavior. In this paper, we describe a method of including a learning phase into the plan itself, so that the agent can dynamically recognize the preconditions of an action when the states are not fully determined, and even directly choose its actions based on learning results. Key words: intelligent agents, planning, inductive learning, classification. 2000 Mathematics Subject Classification: 68T20, 68T42. 1. Introduction Many researchers regard agent-based solutions as a new paradigm of handling complexity in software systems. Recently, an increasing number of industrial applications have certified the success of this approach, mainly in applications such as telecommunication networks, manufacturing enterprises, air traffic control, transportation systems, electronic commerce, patient monitoring, or process control [7]. Despite the lack of general agreement regarding an established definition of an agent, it is generally accepted that it is a software or hardware entity that displays the properties of autonomy, i.e. it is capable of independent, 28 Florin Leon unsupervised actions, and situatedness, i.e. it is part of the physical or simulated execution environment. Wooldridge and Jennings [24], [23] further describe an intelligent agent as having the additional properties of: reactivity (the ability to perceive its environment, and respond in a timely manner to changes that occur in it), pro-activeness (the ability to exhibit goal-directed behavior by taking the initiative), and social ability (to interact with other agents and possibly humans). An agent can decide a response to a request in an autonomous manner, unlike a typical object in the object-oriented programming paradigm, which has its methods simply called from other objects, and that code is automatically executed. Thus, an agent can be viewed from the outside as a black box, with perceptual or sensorial input from its environment and effectoric capabilities to modify the same environment as the output. Therefore, an agent can be seen as a function continuously mapping its perception into actions, in many cases taking into account its internal state as well. Therefore, one of the fundamental issues of agent-based design is how to choose the optimal action, or how to achieve an optimal sequence of actions, a plan. 2. Planning Methods in Artificial Intelligence Traditionally, a great deal of work in artificial intelligence was devoted to planning algorithms. There are several notations used to describe planning problems. One of the first was STanford Research Institute Problem Solver, STRIPS [5], famous for its use by Shakey the robot, the first mobile robot to be able to reason about its own actions, in the context of the blocks world. In this representation, a state is a conjunction of first-order propositional positive literals. In order to address the frame problem, it uses the “closed-world assumption”, i.e. any condition that is not explicitly mentioned in a state is assumed to be false. The goal is represented by a conjunction of literals. The actions are represented by so-called “action schemas”, including preconditions and effects, for example: Action( Drive(car, Iasi, Timisoara), Precondition: At(car, Iasi) ∧ Car(car) Effect: ¬At(car, Iasi) ∧ At(car, Timisoara)) ADL language [14] is an extension of STRIPS that allows, among others, positive and negative literals in states, quantified variables in goals, goals to be expressed as conjunctions and disjunctions. Also, it includes typed variables, e.g. (car: Car), and uses the “open world assumption”, i.e. unmentioned literals are unknown. PDDL [6] differs from STRIPS and ADL from the syntactic point of Bul. Inst. Polit. Iaşi, t. LVI (LX), f. 3, 2010 29 view and its goal is to be a broader modeling language for planning problems. STRIPS and ADL can be considered to be fragments of PDDL. An example of block world problem described in PDDL is as follows [21]: (:action move (:parameters ?b - block ?from ?to - loc) (:preconditions (and (not (blocked ?b)) (not (blocked ?to)) (on ?b ?from) (not (= ?from ?to))))) (:effect (and (not (blocked ?from)) (on ?b ?to) (when (not (= ?to table)) (blocked ?to)))))) Many algorithms have been devised to solve classical planning problems. Besides the straightforward state-space search, with the forward (progression) and backward (regression) search methods resembling the conventional problem solving techniques, we can mention the partial-order planning (POP) algorithms, whose main idea is to establish partial sequences of actions into a plan, without specifying an ordering between these sequences beforehand. Within the partial action sequences, a set of ordering constraints is defined. Faced with scalability issues, some recent planning algorithms emerged, such as: Graphplan [1], SATPLAN [9], and RAX [8]. A typical shortcoming of many classical planning algorithms is that they assume the environment to be fully observable, static, and deterministic. For dealing with incomplete information and non-deterministic settings, other planning methods are needed, such as [17]: 1. Conformant planning, that must ensure that the plan achieves the goal in all possible circumstances, regardless of the true initial state and the actual action outcomes; 2. Conditional planning, that constructs plans with different branches for the different contingencies that could occur; 3. Execution monitoring and replanning, that checks whether the plan can apply to current situations, or it needs to be revised; 4. Continuous planning, where the planner executes uninterruptedly, even after achieving one goal, and can therefore handle the abandonment of goals or the creation of additional goals. A critique to the AI establishment was made by Rodney Brooks [2], [3], who considers that traditional AI planners are disembodied entities, and not physically situated in their execution environment, as it is the case with agents. He considers that the abstraction of representation can be misleading, and overly simplifies the real-world problems. Brooks proposed the reactive subsumption architecture, stating the fact that the intelligence of an agent results from the continuous interaction between the agent and its environment. He considered a set of behavior levels, from the low-level, critical reactions to more abstract ones, and claims that intelligent behavior is an emergent property 30 Florin Leon of the interaction of these simpler behaviors. However, although reactive architectures are well-suited for certain problems, they are less suited to others [7]. Therefore, there is a need to investigate hybrid architectures that combine the best of the two extremes. 3. Planning with Quasi-Determined States The classical planning methods use a predicative logic representation for the states. For example, if a robotic agent has a plan of taking an apple off the table and putting it into a basket, a typical plan would use a predicate such as Apple(a) to describe this object. However, in a real-life situation, if more objects were placed on the table, it could be difficult for the agent to recognize the apple out of them. Another difficulty of the agent would arise if the apple on the table would have a non-standard appearance, in terms of size, shape, or color. In this case, the preconditions of actions, that describe the states in which the agent has to make a decision, are not fully determined. We call such states quasi-determined states. In order to address the issue of reactive actions, we propose that besides recognizing the preconditions of an action, classification can be used to directly map a state to an action. A training dataset can be used to choose the appropriate action in a state instead of a conditional planning method. As an example, we can mention the classical “weather” dataset, that decides if someone should play tennis or not, based on attributes such as temperature, humidity, and outlook, or the decision to give a loan to a person based on his/her marital status, age, and income. The applications of classification are presently extensive in real-life situations, such as deciding the optimal treatment of patients, classifying electronic payments as legitimate or not, speech and writing recognition, etc. All these and similar ones could be part of a longer plan of an agent. Another advantage of using learning for quasi-determined states is that the agent can adapt in real-time to the changing conditions of its execution environment. There are many inductive learning algorithms that address the problem of classification. We can mention three main classes of such algorithms: decision trees, which provide an explicit symbolic result, similar to the rules on which humans base their conscious reasoning process, instance-based, similar to the way humans recognize objects or situations based on analogy to previously encountered ones, and Bayesian, similar to the way in which humans make decisions based on probabilities or frequency of event occurrence. Another well-known technique that can be used for classification is the subsymbolic, neural network approach. Bul. Inst. Polit. Iaşi, t. LVI (LX), f. 3, 2010 31 Synthetically, Fig. 1 shows the two situations where an inductive learning phase can be included into a plan. Action 1 is a classical action within a plan. Action 2 has its preconditions determined by classification. Action 3 is dynamically determined by a supervised procedure. ES, the set of effects of Start pseudo-action represents the initial state of the problem. PF, the set of preconditions of the Finish pseudo-action, represents the goal of the problem. Pi designate the preconditions of an action and Ei designate the effects. Fig. 1 – A plan with quasi-determined states and learning phases. 4. Case Study: The Mountain Car Problem As an example of how learning and quasi-determined states can be incorporated into a planning mechanism for an intelligent agent, we will consider the “Mountain Car” problem. It was originally presented by Andrew Moore in his PhD dissertation [12] and later Sutton and Barto included it as an exercise in their well-known introductory book to reinforcement learning [19]. The task requires an agent to drive an underpowered car up a steep mountain road. Since gravity is stronger than the engine of the car, even at full power the car cannot accelerate up the steep slope [15]. The movement of the car is described by two continuous output variables, position x ∈ [−1.2, 0.5] and velocity v ∈ [−0.07, 0.07] , and one discrete input representing its acceleration a. The acceleration is therefore the action that the agent chooses, and it can be one of the three available discrete options: full thrust forward (1), no thrust (0) and full thrust backward (-1). Recently, a 3D version of the problem has been proposed, which extends the standard 2D variant, where the states are described by 4 continuous variables [20]. The mountain curve on which the car is moving is defined as: h = sin(3x). According to the laws of physics, the discrete-time state space equations of the system are those presented in Eq. 1: 32 (1) Florin Leon vt +1 = vt − 0.0025 ⋅ cos(3xt ) + 0.001 ⋅ at xt +1 = xt + vt +1 where at ∈ {−1, 0, 1} . Fig. 2 presents the setting of the mountain car problem (adapted after Singh and Sutton [18] and Naeeni [13]). Fig. 2 – The mountain car problem. Both state variables are kept in the defined range, i.e. all values above or below the boundaries will be set to their extreme values. When the position x is equal to the extreme left boundary -1.2, the velocity v is set to 0. The goal, the top of the mountain, is located at x = 0.5. The problem is particularly interesting because, in order to reach its goal, the car must gain enough kinetic energy by accelerating in alternating directions, backward or forward. It must first drive backward, up the other side of the valley, to gain enough momentum to drive forward up the hill. It will therefore move away from the goal at first in order to find a working solution. Also, the states of the problem defined by position and velocity are continuous, real-valued, and this causes an additional difficulty for a reinforcement learning algorithm dealing with discrete states. Finally, because of the external factor, gravity, and the momentum of the car, the actions may not have the same results in similar states. Bul. Inst. Polit. Iaşi, t. LVI (LX), f. 3, 2010 33 Fig. 3 – The behavior of the agent using the naïve heuristic for different starting positions. 4.1. Naïve Heuristic First, we can verify the assumptions of the problem when the agent uses a naïve heuristic, i.e. it maintains the acceleration forward (a = 1) at all times. Fig. 3 shows the behavior of the system for different initial positions. 34 Florin Leon Fig. 4 – The behavior of the agent using the simple heuristic for different starting positions. When the initial position is on the top of the mountain on the opposite direction from the goal, the car momentum is enough to climb the mountain side in order to reach the goal. The momentum is enough until the initial position becomes -0.84. In this case, the naïve heuristic proves its limitation because the car will become engaged in an oscillatory movement over the alternating sides of Bul. Inst. Polit. Iaşi, t. LVI (LX), f. 3, 2010 35 the valley. Getting closer to the goal, only the initial position of 0.39 or greater will be enough to reach the goal directly, using forward acceleration. 4.2. Simple Heuristic Taking into account the characteristics of the problem, we can devise a simple heuristic that ensures the fact that the goal is reached every time. The heuristic tries to make maximum use of the gravitational force: the acceleration of the car is the sign of its speed. Fig. 4 shows the behavior of the system for different initial positions in this case. One can see that even when the initial position x ∈ [−0.84, 0.38] , the agent reaches its goal after several amplifying oscillations. 4.3. Reinforcement Learning Solution The simple heuristic presented above does not solve the problem in an optimal manner, i.e. with a minimum number of time steps. The problem was originally designed to be solved with reinforcement learning algorithms, so we employ such a technique to find shorter plans for the agent. Model-free reinforcement learning algorithms using temporal differences such as Q-Learning [22] or State-Action-Reward-State-Action, SARSA [16] need to discretize the continuous input states of the problem. The Q function, used to determine the best action to be taken in a particular state, is defined as Q : S × A → ℜ and is usually implemented as a matrix containing the real-valued rewards r ∈ ℜ given by the environment in a particular state s ∈ S when performing a particular action a ∈ A . The mountain car problem is also difficult for a reinforcement learning algorithm because all the rewards are -1, with the exception of the goal state where the reward is 0. Therefore, the agent becomes aware of a higher reward only when it finally reaches the goal. For the following tests, the Matlab implementation of SARSA algorithm by J.A. Martin [11] was used. For the initial positions where the first approach began to fail, and also for the initial position of x = -0.5, which is the “standard” start point suggested by the problem author(s), a comparison was made in terms of the number of time steps of the solution. This comparison is displayed in Fig. 5. In most cases, the reinforcement learning algorithm finds shorter plans than the simple heuristic presented before. Fig. 6 further presents a detailed comparison between the two approaches for an initial position of x = -0.5 in terms of car trajectory and car speed. The upper row contains the results of the simple heuristic. One can notice that during the second left oscillation, the car hits the fixed wall and its speed becomes 0. The additional steps of the solution are due to the fact that it didn’t control its acceleration better so that it could climb the left side of the 36 Florin Leon mountain only up to a position sufficient to gain enough momentum to reach the goal. The bottom row contains the results of the reinforcement learning algorithm. Fig. 5 – Comparison between the number of solution steps found by the reinforcement learning algorithm and the simple heuristic. Fig. 6 – Position and velocity of the car for the initial position of x = -0.5 using simple heuristic and reinforcement learning, respectively. Bul. Inst. Polit. Iaşi, t. LVI (LX), f. 3, 2010 37 4.4. Filtered Supervised Solution From the Q matrix found by the SARSA algorithm, we can extract a supervised learning dataset, so that each row of the Q matrix is transformed into a training instance. Let S be the matrix of states, S = sij , 1 ≤ i ≤ n, 1 ≤ j ≤ m , let A be the action vector, ( ) A = (a ), 1 ≤ i ≤ p , and Q = (q ), 1 ≤ i ≤ n, 1 ≤ j ≤ p . i ij Then the supervised learning dataset will be a matrix whose instances will be the states followed by the optimal action in that state. Formally, D is a matrix D = d ij , 1 ≤ i ≤ n, 1 ≤ j ≤ m + 1 , where: ( ) (2) s , if 1 ≤ j ≤ m d ij = ij ak , where qik = max{qil | l = 1,..., p}, if j = m + 1 More synthetically, the dataset D is defined as: (3) D= S A* where: (4) ( ) A* = ai* , such that ai* = arg max Q( si , a ) a However, since the Q matrix was inherently constructed based on trial and error episodes, eventually composed into coherent policies, we tried to determine how much of this learned experience was relevant for a supervised learning setting. The supervised learning algorithm used by the agent was the simple nearest neighbor algorithm [4], due to its simplicity and very good performance (mostly up to 100% on the training set) when the data are not affected by noise [10]. Another reason for choosing the nearest neighbor algorithm was its resemblance to human pattern recognition by analogy. In the case of the mountain car problem, the action is taken by similarity to previously learned actions in some situations given by car position and car velocity, i.e. the input states of the problem. It is hypothesized that the dataset resulting from the reinforcement learning may have noisy or irrelevant training instance, which may affect the optimal solution. In order to filter the dataset, we considered 1000 trials of random sampling, with a filtering factor varying from 10% to 90%. 38 Florin Leon Fig. 7 – The percent of failed trials when the filtering factor varies. Fig. 7 shows the percent of failed trials when the filtering factor varies. By failed trials we mean plans in which the agent fails to reach the goal, resulting in a continuous oscillatory behavior. When the training dataset is small, the information may be insufficient for the agent to learn the solution. As additional information is accumulated, the agent begins to use it to solve the problem more frequently. Thus the failed trials decrease from over 50% with 27 randomly chosen training instances to less than 1% for 243 randomly chosen training instances. Of course, all the 270 original instances are sufficient for the agent to solve the problem every time. Taking into account only the successful trials, we counted the average number of steps and the minimum number of steeps needed to reach the goal, graphically presented in Figs. 8 and 9, respectively. Fig. 8 – The average number of steps to goal when the filtering factor varies. Bul. Inst. Polit. Iaşi, t. LVI (LX), f. 3, 2010 39 Fig. 9 – The minimum number of steps to goal when the filtering factor varies. One can see that although the average number of steps increases when the agent receives more information, the minimum number of steps is attained only with a small dataset. In order to find the optimal dataset, we extended the test with 10000 trials only for a filtering factor of 10% for the initial position of -0.5. From the initial 270 states used by SARSA discretization, with additional removal of redundant states and with a further removal of training instances where the acceleration was 0 (because we considered that the optimal solution was to be attained only when the agent actively pursues the goal, with no passive actions), the number of training instances was reduced to 12. The best solution has now only 104 steps. These distinctive instances are displayed in Table 1. Table 1 The Selected Training Instances Car position x Car velocity v Acceleration a (agent state) (agent action) −0.80 −0.04 −1 −0.70 0.00 1 −0.70 0.06 1 −0.60 −0.06 1* −0.40 −0.03 −1 −0.40 −0.02 −1 −0.40 0.04 1 −0.30 −0.04 −1 −0.10 −0.04 −1 0.00 −0.01 −1 0.10 −0.03 −1 0.10 0.02 1 40 Florin Leon It is clear that these instances conform to the simple heuristic described in paragraph 4.2, with only one exception, marked with italic font in the Table 1 and with the asterisk following the class/action. This instance is responsible for the decrease in the number of steps, because it “tells” the agent when to switch to driving forward on the left side of the valley, and thus to reach the goal on the right side earlier. 5. Conclusions This paper presents a way to include supervised, inductive learning into a planning problem. A model of extracting a training dataset from the Q matrix of a reinforcement learning algorithm is described. The agent does not possess all the necessary information at any given time, it needs to compute the optimal action. If the environment is non-deterministic, the agent can learn and change its model. A predicative representation of the states is not necessary, because these are dynamically recognized by means of predictions made on the basis of the learnt model. However, the model can be symbolically interpreted, because the actual values that compose it are explicitly available. A c k n o w l e d g e m e n t s. This work was supported by CNCSISUEFISCSU, project number PNII-IDEI 316/2008, Behavioural Patterns Library for Intelligent Agents Used in Engineering and Management. Received: May 12, 2010 “Gheorghe Asachi” Technical University of Iaşi, Department of Computer Engineering e-mails: [email protected] REFERENCES 1. Blum A., Furst M.L., Fast Planning Through Planning Graph Analysis. Artificial Intel., 90, 1-2, 1997. 2. Brooks R.A., Intelligence without Reason. Proc. of the Twelfth Internat. Joint Conf. on Artificial Intel., Sydney, Australia, 569−595, 1991. 3. Brooks R.A., Intelligence without Representation. Artificial Intel., 47, 139−159, 1991. 4. Cover T.M., Hart P.E., Nearest neighbor pattern classification. IEEE Trans. on Information Theory, 13, 1, 21−27, 1967. 5. Fikes R., Nilsson N., STRIPS: A New Approach to the Application of Theorem Proving to Problem Solving. Artificial Intel., 2, 189−208, 1971. 6. Ghallab M., Howe A., Knoblock C., McDermott D., Ram A., Veloso M., Weld D., Wilkins D., PDDL - the Planning Domain Definition Language. Techn. Report CVC TR-98-003/DCS TR-1165, Yale Center for Computational Vision and Control, 1998. Bul. Inst. Polit. Iaşi, t. LVI (LX), f. 3, 2010 41 7. Jennings N.R., Sycara K., Woolridge M., A Roadmap of Agent Research and Development. Autonomous Agents and Multi-Agent Syst., 1, 7−38, 1998. 8. Johnson A., Morris P., Muscettola N., Rajan K. , Planning in Interplanetary Space. Theory and Practice, Proc. of AIPS, 2000. 9. Kautz H., Selman B., Pushing the Envelope: Planning, Propositional Logic and Stochastic Search. In Proc. of AAAI, 1996. 10. Leon F., Intelligent Agents with Cognitive Capabilities. Edit. Tehnopress, Iaşi, România, 2006. 11. Martin J.A., A Reinforcement Learning Environment in Matlab. http://www.dia.fi. upm. es/~jamartin/download.htm, 2010. 12. Moore A., Efficient Memory-Based Learning for Robot Control. Ph. D. Diss, Univ. of Cambridge, 1990. 13. Naeeni A.F., Advanced Multi-Agent Fuzzy Reinforcement Learning. Master Diss., Computer Sci. Depart., Dalarna Univ. College, Sweden, http://www2. informatik.hu-berlin.de/~ferdowsi/Thesis/Master%20Thesis.pdf, 2004. 14. Pednault E.P.D., ADL: Exploring the Middle Ground Between STRIPS and the Situation Calculus. Proc. of Knowledge Representation Conf., 1989. 15. RL-Community, Mountain Car (Java). RL-Library, http://library.rl-community. org/wiki/Mountain_Car_(Java), 2010. 16. Rummery G.A., Niranjan M., On-line Q-learning Using Connectionist Systems. Techn. Report CUED/F-INFENG/TR 166, Engng. Depart., Cambridge Univ., 1994. 17. Russell S., Norvig P., Artificial Intelligence: A Modern Approach. Prentice Hall; 2nd Ed., 2002. 18. Singh S.P., Sutton R.S. , Reinforcement Learning with Replacing Eligibility Traces. Machine Learning, 22, 1/2/3, 123-158, 1996. 19. Sutton R.S., Barto A.G., Reinforcement Learning: An Introduction. MIT Press, Cambridge, Massachusetts, 1998. 20. Taylor M.E., Kuhlmann G., Stone P., Autonomous Transfer for Reinforcement Learning. Proc. of the Seventh Internat. Joint Conf. on Autonomous Agents and Multiagent Syst., 2008. 21. Vlahavas I., Vrakas D. (Eds.), Intelligent Techniques for Planning. Idea Group Publ., 2005. 22. Watkins C.J.C.H., Dayan P., Technical Note: Q-Learning. Machine Learning, 8, 5568, 1992. 23. Wooldridge M., Intelligent Agents. In G. Weiß (Ed.), Multiagent Systems - A Modern Approach to Distributed Artificial Intelligence, The MIT Press, Cambridge, Massachusetts, 2000. 24. Wooldridge M., Jennings N.R., Agent Theories, Architectures, and Languages: a Survey. In Wooldridge and Jennings (Eds.), Intelligent Agents, Springer Verlag, Berlin, 1995. 42 Florin Leon METODĂ DE PLANIFICARE PENTRU AGENłI INTELIGENłI CU STĂRI CVASI-DETERMINATE UTILIZÂND ÎNVĂłAREA INDUCTIVĂ (Rezumat) Reprezentările tradiŃionale ale problemelor de planificare utilizează logica predicativă şi mulŃi algoritmi de planificare consideră mediul ca fiind determinist iar agentul de planificare ca fiind detaşat de mediul său de execuŃie. De asemenea, au fost propuse şi arhitecturi reactive pentru agenŃi, care încearcă să rezolve problema răspunsurilor rapide la modificările apărute în mediu şi consideră comportamentul inteligent ca fiind un rezultat emergent al interacŃiunilor dintre comportamente mai simple, dispuse pe niveluri. Totuşi, aceste abordări nu iau în considerare învăŃarea ca o parte intrinsecă a comportamentelor de rezolvare a problemelor sau de planificare. În acest articol, se descrie o metodă de a include o fază de învăŃare în planul însuşi, astfel încât agentul să poată recunoaşte dinamic precondiŃiile unei acŃiuni atunci când stările nu sunt complet determinate şi chiar să aleagă direct acŃiunile pe baza rezultatelor învăŃării. Ca studiu de caz, se consideră problema “mountain car”, o problemă tipică învăŃării cu întărire, cu stări continue. Se fac comparaŃii între rezultatele a două euristici, a rezolvării cu ajutorul algoritmului SARSA şi a abordării supervizate, rezultată ca filtrare a rezultatelor învăŃării cu întărire. Se descrie de asemenea un model de a extrage o mulŃime de date de învăŃare din matricea Q a unui algoritm de învăŃare cu întărire. Agentul nu dispune de toate informaŃiile necesare la fiecare moment de timp, ci trebuie să calculeze acŃiunea optimă. Dacă mediul este nedeterminist, agentul poate învăŃa şi îşi poate schimba modelul. Reprezentarea predicativă a stărilor nu este necesară, deoarece acestea sunt recunoscute dinamic prin predicŃiile realizate pe baza modelului învăŃat. Cu toate acestea, modelul poate fi interpretat în mod simbolic, deoarece valorile propriuzise care îl compun sunt disponibile explicit.