Download 1 Deep and Beautiful. The Reward Prediction Error Hypothesis of

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Psychological behaviorism wikipedia , lookup

Learning theory (education) wikipedia , lookup

Operant conditioning wikipedia , lookup

Neuropsychopharmacology wikipedia , lookup

Neuroeconomics wikipedia , lookup

Transcript
Deep and Beautiful.
The Reward Prediction Error Hypothesis of Dopamine
Abstract
According to the reward-prediction error hypothesis (RPEH) of dopamine, the phasic
activity of dopaminergic neurons in the midbrain signals a discrepancy between the predicted and
currently experienced reward of a particular event. It can be claimed that this hypothesis is deep,
elegant and beautiful, representing one of the largest successes of computational neuroscience. This
paper examines this claim, making two contributions to existing literature. First, it draws a
comprehensive historical account of the main steps that led to the formulation and subsequent
success of the RPEH. Second, in light of this historical account, it explains in which sense the
RPEH is explanatory and under which conditions it can be justifiably deemed deeper than the
incentive salience hypothesis of dopamine, which is arguably the most prominent contemporary
alternative to the RPEH.
Keywords: Dopamine; Reward-Prediction Error; Explanatory Depth; Incentive Salience;
Reinforcement Learning
1. Introduction
According to the reward-prediction error hypothesis of dopamine (RPEH), the phasic activity of
dopaminergic neurons in specific regions in the midbrain signals a discrepancy between the
predicted and currently experienced reward of a particular event. The RPEH is widely regarded as
one of the largest successes of computational neuroscience. Terrence Sejnowski, a pioneer in
computational neuroscience and prominent cognitive scientist, pointed at the RPEH, when, in 2012,
he was invited by the online magazine Edge.org to answer the question “What is your favorite deep,
elegant, or beautiful explanation?” Several researchers in cognitive and brain sciences would agree
that this hypothesis “has become the standard model [for explaining dopaminergic activity and
reward-based learning] within neuroscience” (Caplin & Dean, 2008, p. 663). Even among critics,
1
the “stunning elegance” and the “beautiful rigor” of the RPEH are recognized (Berridge, 2007, p.
399 and p. 403).
However, the type of information coded by dopaminergic transmission—along with its
functional role in cognition and behaviour—is very likely to go beyond reward-prediction error.
The RPEH is not the only available hypothesis about what type of information is encoded by
dopaminergic activity in the midbrain (cf., Berridge, 2007; Friston, Shiner, FitzGerald, Galea,
Adams et al., 2012; Graybiel, 2008; Wise, 2004). Current evidence does not speak univocally in
favour of this hypothesis, and disagreement remains about to what extent the RPEH is supported by
available evidence (Dayan & Niv, 2008; O’Doherty, 2012; Redgrave & Gurney, 2006). On the one
hand, it has been claimed that “to date no alternative has mustered as convincing and
multidirectional experimental support as the prediction-error theory of dopamine” (Niv &
Montague, 2009, p. 342); on the other hand, the counter-claims have been put forward that the
RPEH is an “elegant illusion” and that “[s]o far, incentive salience predictions [that is, predictions
of an alternative hypothesis about dopamine] appear to best fit the data from situations that
explicitly pit the dopamine hypotheses against each other” (Berridge, 2007, p. 424).
How has the RPEH become so successful then? What does it explain exactly? And, granted
that it is at least intuitively uncontroversial that the RPEH is beautiful and elegant, in which sense
can it be justifiably deemed deeper than alternatives? The present paper addresses these questions
by firstly reconstructing the main historical events that led to the formulation and subsequent
success of the RPEH (Section 2).
With this historical account on the background, it is elucidated what and how the RPEH
explains, contrasting it to the incentive salience hypothesis—arguably its most prominent current
alternative. It is clarified that both hypotheses are concerned only with what type of information is
encoded by dopaminergic activity. Specifically, the RPEH has the dual role of accurately describing
the dynamic profile of phasic dopaminergic activity in the midbrain during reward-based learning
2
and decision-making, and of explaining this profile by citing the representational role of
dopaminergic phasic activity. If the RPEH is true, then a mechanism composed of midbrain
dopaminergic neurons and their phasic activity carries out the task of learning what to do in the face
of expected rewards, generating decisions accordingly (Section 3).
The paper finally explicates under which conditions some explanation of learning,
motivation or decision-making phenomena based on the RPEH can be justifiably deemed deeper
than some alternative explanation based on the incentive salience hypothesis. Two accounts of
explanatory depth are considered. According to one account, deeper explanatory generalizations
have wider scope (e.g., Hempel, 1959); according to the other, deeper explanatory generalizations
show more degrees of invariance (e.g., Woodward and Hitchcock, 2003). It is argued that, although
it is premature to maintain that explanations based on the RPEH are actually deeper—in either of
these two senses of explanatory depth—than alternative explanations based on the incentive
salience hypothesis, relevant available evidence indicates that they may well be (Section 4). The
contribution of the paper to existing literature is summarised in the conclusion.
2. Reward-Prediction Error Meets Dopamine
Dopamine is a neurotransmitter in the brain.1 It has significant effects on many aspects of cognition
and behaviour, including motor control, learning, attention, motivation, decision-making and mood
1
Neurotransmitters are chemicals that carry information from one neuron to another across
synapses. Synapses are structures connecting neurons that allow one nerve cell to pass an electric or
chemical signal to one or more cells. Synapses consist of a presynaptic nerve ending, which can
contain neurotransmitters, a postsynaptic nerve ending, which can contain receptor sites for
neurotransmitters, and the synaptic cleft, which is a physical gap between the presynaptic and the
postsynaptic ending. After neurotransmitters are released by a presynaptic ending, they diffuse
3
regulation. Dopamine is implicated in pathologies such as Parkinson’s disease, schizophrenia,
attention deficit hyperactivity disorder (ADHD) and addiction. These are some of the reasons for
why so much work has been directed at understanding the type of information carried by neurons
that utilize dopamine as a neurotransmitter as well as their functional roles in cognition and
behaviour.
Neurons that use dopamine as a neurotransmitter to communicate information are called
dopamine or dopaminergic neurons. Such neurons are phylogenetically old, and found in all
mammals, birds, reptiles and insects. Dopaminergic neurons are localized in several brain networks
in the diencephalon (a.k.a. interbrain), mesencephalon (a.k.a. midbrain) and olfactory bulb
(Björklund & Dunnett, 2007). Approximately 90% of the total number of dopaminergic neurons is
in the ventral part of the midbrain, which comprises different dopaminergic networks with separate
pathways. One of these pathways is the nigrostriatal pathway. It links the substantia nigra, a
structure in the midbrain, with the striatum, which is the largest nucleus of the basal ganglia in the
forebrain and has two components: the putamen and the caudate nucleus. Another pathway is the
mesolimbic, which links the ventral tegmental area in the midbrain to structures in the forebrain,
external to the basal ganglia, such as the amygdala and the medial prefrontal cortex.
Dopamine neurons show two main patterns of firing activity, which modulates the level of
extracellular dopamine: tonic and phasic activity (Grace, 1991). Tonic activity consists of regular
firing patterns of ~1-6 Hz that maintain a slowly-changing, extracellular, base-level of extracellular
dopamine in afferent brain structures. Phasic activity consists of a sudden change in the firing rate
of dopamine neurons, which can increase up to ~20 Hz, causing a transient increase of extracellular
dopamine concentrations.
across the synaptic cleft and then bind with receptors on the postsynaptic ending, which alters the
state of the postsynaptic neuron.
4
The discovery that neurons can communicate by releasing chemicals was due to the
German-born pharmacologist Otto Loewi—Nobel Prize winner in Physiology and Medicine along
with co-recipient Sir Henry Dale—in 1921 (cf., Loewi, 1936). The discovery of dopamine as a
neurotransmitter in the brain dates 1957, and was due to the Swedish pharmacologist Arvid
Carlsson—Nobel Prize in Physiology and Medicine in 2000 along with co-recipients Eric Kandel
and Paul Greengard (cf., Carlsson, 2003). Carlsson’s work in the 1950s and 1960s paved the way to
the findings that the basal ganglia contain the highest dopamine concentrations, that dopamine
depletion is likely to impair motor function and that patients with Parkinson’s disease have
markedly reduced concentrations of dopamine in the caudate and putamen (cf. Carlsson, 1959;
1966).
Since at least the 1950s, the search for the mechanisms of reward-based learning and
motivation has been taking place. James Olds and Peter Milner set out to investigate how electrical
stimulation of certain brain areas could reinforce behaviour. They implanted electrodes in different
areas of rats’ brains and allowed them to move about a Skinner box. Rats received stimulation
whenever they pressed a lever in the box. When this stimulation was targeted at the ventral
tegmental area and basal forebrain, the rats showed signs of positive reinforcement, as they would
repeatedly press the lever up to 2,000 times per hour. These results suggested to Olds and Milner
that they had “perhaps located a system within the brain whose peculiar function is to produce a
rewarding effect on behavior” (Olds & Milner, 1954, p. 426).
The notion of “reward” here is to be understood within Thorndike’s (1911) and Skinner’s
(1938) theories of learning. As Olds and Milner put it: “In its reinforcing capacity, a stimulus
increases, decreases, or leaves unchanged the frequency of preceding responses, and accordingly it
is called a reward, a punishment, or a neutral stimulus” (Olds & Milner, 1954, p. 419). So, some
brain stimulation or some environmental stimulus is “rewarding” if animals learn to perform actions
that are reliably followed by that stimulation or stimulus.
5
Later experiments confirmed that electrical self-stimulation of specific brain regions has the
same impact on motivation as other natural rewards, like food or water for hungry or thirsty animals
(Trowill, Panksepp, & Gandelman, 1969; Crow 1972). The idea that some neurotransmitter could
be a relevant causal component of some mechanism of reward-based learning and motivation was
substantiated
by pharmacological
studies
(Stein,
1968;
1969).
Based
on
subsequent
pharmacological (Fibiger, 1978) and anatomical research (Lindvall & Björklund, 1974), hypotheses
about the involvement of dopaminergic neurons in this mechanism began to be formulated. In Roy
Wise’s (1978) words: “[from the available evidence] it can be concluded that dopamine plays a
specialized role in reward processes… It does seem to be the case that a dopaminergic system forms
a critical link in the neural circuitry which confers rewarding qualities on intracranial stimulation…
and on intravenous stimulant injections” (Wise, 1978, pp. 237-238).
Wise (1982) put forward one of the first hypotheses about dopamine function in cognition
and behaviour that aimed to explain a set of relevant data from anatomy, pharmacology, brain selfstimulation, pathology and lesion studies. It was called the anhedonia hypothesis, and was advanced
based on pharmacological evidence that moderate doses of neuroleptics drugs (i.e. dopamine
antagonists)2 can disrupt behavioural phenomena during reinforcement tasks without severely
compromising motor function (cf., Costall & Naylor, 1978). The anhedonia hypothesis was
proposed as an alternative to simple motor hypotheses that claimed that the dopamine system is a
mechanism of motor control and that dopaminergic impairment causes only motor deficits (see,
e.g., Koob, 1982).
The anhedonia hypothesis stated that “the normal functioning of some unidentified
dopaminergic substrate (it could be one or more of several dopaminergic projections in the brain)
2
Drugs that block the effects of dopamine by binding to and occluding the dopamine receptor are
called dopamine antagonists.
6
and its efferent connections are necessary for the motivational phenomena of reinforcement and
incentive motivation and for the subjective experience of pleasure that usually accompanies these
phenomena” (Wise, 1982, p. 53).3 This hypothesis is committed to the claims that some network of
dopaminergic neurons to be specified is a causally relevant component of the mechanism of
reinforcement, that some network of dopaminergic neurons is necessary to feel pleasure, and that
pleasure is a necessary correlate of reinforcement.
The explanatory link between dopamine and pleasure was superficial, however. Testing the
effects of selective lesion of the mesostriatal dopamine system on rats’ reactions to different tastes,
Berridge, Vienier, and Robinson (1989) observed that the predictions of both the anhedonia and the
motor hypothesis were not borne out. It was found that the subjective experience of pleasure is not a
necessary correlate of reinforcement and that dopaminergic neurons are not necessary for pleasure
(see Wise, 2004, for a later reassessment of the evidence).
On the basis of taste-reactivity data and of the psychopharmacological effects of drug
addiction, and drawing upon earlier theories of incentive motivation (e.g., Bindra, 1974; Toates,
1986), Kent Berridge and colleagues put forward the incentive salience hypothesis of dopamine
(ISH). According to this hypothesis, dopamine release by mesencephalic structures such as the
ventral tegmental area assigns “incentive value” to objects or behavioural acts. Incentive salience is
a motivational, “magnet-like” property, which makes external stimuli or internal representations
more salient, and more likely to be wanted, approached or consumed. Attribution of incentive
3
‘Incentive motivation’ is synonymous with ‘secondary reinforcement’ (or ‘conditioned
reinforcement’), which refers to a stimulus or situation that has acquired its function as a reinforcer
after it has been associated with a primary reinforcer such as water or food. When a stimulus
acquires incentive properties, it acquires not only the ability to elicit and maintain instrumental
behaviour, but also to attract approach and to elicit consummatory behaviour (cf., Bindra, 1974).
7
salience to a stimulus that predicts some reward makes both the stimulus and the reward “wanted”
(Robinson & Berridge, 1993; Berridge & Robinson, 1998). Since the ISH has been claimed to be
the most foremost contemporary alternative to the RPEH (e.g., Berridge, 2007), the sections below
consider it more closely, and compare it along two dimensions of explanatory depth with the RPEH.
For now, let us move to the next steps on the road to the RPEH.
In the 1980s, research on the role of dopamine in motor function remained an active topic of
research (e.g., Beninger, 1983; Stricker & Zigmond, 1986; White, 1986; see Dunnett & Robbins,
1992, for a later review). This interest was justified by earlier findings that Parkinsonian patients
display a drastic reduction of dopaminergic neurons in the striatum (Ehringer & Hornykiewicz,
1960; Hornykiewicz, 1966), associated with symptoms like tremor, hypokinesia and rigidity.
Wolfram Schultz was among the neuroscientists working on the relationship between dopamine
depletion, motor function and Parkinson’s disease (Schultz, 1983). As a way to assess this
relationship, he used single-cell recordings of dopaminergic neurons in awake monkeys while they
were performing reaching movements for food reward in response to auditory or visual stimuli
(Schultz, Ruffieux, & Aebischer, 1983; Schultz, 1986). Phasic activity of midbrain dopamine
neurons was found to be associated with the presentation of the visual or auditory stimuli that
would be followed by the food reward. Some such neurons showed phasic changes in activity also
at the time the reward was obtained. Execution of reaching movements was less significantly
associated with dopaminergic activity, indicating that activity of midbrain dopaminergic neurons
does not encode specific movement parameters. Schultz and colleagues hypothesised that such
activity carried out some more general function having to do with a change in the level of
behavioural reactivity triggered by stimuli leading to a reward.
In the following ten years, Schultz and colleagues carried out similar single-cell recording
experiments from midbrain dopaminergic neurons in the ventral tegmental area and substantia nigra
8
of awake monkeys while they repeatedly performed an instrumental or Pavlovian conditioning task4
(Schultz & Romo, 1988; Romo & Schultz, 1990; Ljungberg, Apicella & Schultz, 1992; Schultz,
Apicella & Ljungberg, 1993; Schultz, Mirenowicz & Schultz, 1994). In a typical experiment a
thirsty monkey was seated before two levers. After a visual stimulus was displayed (e.g. a light
flashing), the monkey had to press the left but not the right lever in order to receive the juice
reward. An idiosyncratic pattern of dopaminergic activity was observed during this experiment.
During the early phase of learning—when the monkey was behaving erratically—dopamine
neurons displayed a phasic burst of activity only when the reward was obtained. After a number of
trials, as the monkey had learnt the correct stimulus-action-reward association, the response of the
neurons to the reward disappeared. Now, whenever the visual stimulus was displayed, the monkey
began to show anticipatory licking behaviour, and its dopaminergic neurons showed phasic bursts
of activity associated with the presentation of the visual stimulus. If an expected juice reward was
omitted, the neurons responded with a dip of activity, below basal firing rate, at the time at which
reward would have been delivered, which suggested that dopaminergic activity is sensitive to both
the occurrence and time of the reward.
The pattern of dopaminergic activity observed in these types of tasks was explained in terms
of generic “attentional and motivational processes underlying learning and cognitive behavior”
(Schultz et al., 1993, p. 900). Schultz and colleagues did not refer to previous research by Wise and
others about the involvement of dopamine in the mechanisms of reward, motivation and learning,
nor did they refer to the growing literature on reinforcement learning from psychology and artificial
4
In instrumental (or operant) conditioning, animals learn to respond to specific stimuli in such a
way as to obtain rewards and avoid punishments. In Pavlovian (or classical) conditioning, no
response is required to get rewards and avoid punishments, since rewards and punishment come
after specific stimuli independent of the animal’s behaviour.
9
intelligence. Thus, in the early 1990s, the questions what type of information dopaminergic activity
encodes and what its causal role is in the mechanism of reward-based learning and motivation were
outstanding.
Meanwhile, by the late 1980s, Reinforcement Learning (RL) had been established as one of
the most popular computational frameworks in machine learning and artificial intelligence. RL
offers a collection of algorithms to solve the problem of learning what to do in the face of rewards
and punishments received by taking different actions in an unfamiliar environment (Sutton & Barto,
1998). One widely used RL algorithm is the temporal difference (TD) learning algorithm, whose
development is most closely associated with Rich Sutton’s (1988). The development of the TDalgorithm was influenced by earlier theories of animal learning in mathematical psychology,
especially by a seminal paper by Bush and Mosteller (1951), which formulated a formal account of
how rewards increment the probability of a given behavioural response during instrumental
conditioning tasks. Bush and Mosteller’s account was extended by Rescorla and Wagner (1972),
whose model set the basis for the TD-learning algorithm.
The Rescorla-Wagner model is a formal model of instrumental and Pavlovian conditioning
that describes the underlying changes in associative strength between a signal (e.g., a conditioned
stimulus) and a subsequent stimulus (e.g., an unconditioned stimulus). The basic insight is similar
to the one informing the Bush-Mosteller model: learning depends on error in prediction. As
Rescorla and Wagner put it: “Organisms only learn when events violate their expectations. Certain
expectations are built up about the events following a stimulus complex; expectations initiated by
the complex and its component stimuli are then only modified when consequent events disagree
with the composite expectation” (Rescorla & Wagner, 1972, p. 75). Accordingly, learning is driven
by prediction errors, and the basic unit of learning is the conditioning trial. Change in associative
strength between a conditioned stimulus and an unconditioned stimulus is a function of differences
between what was predicted (i.e. the animal’s expectation of the unconditioned stimulus, given all
10
the conditioned stimuli present on the trial) and what actually happened (i.e. the unconditioned
stimulus) in a conditioning trial.
TD-learning extends the Rescorla-Wagner model by taking account of the timing of
different stimuli within learning trials, which in fact influences how associative strength changes.
TD-learning is driven by the difference between temporally successive estimates (or predictions) of
a certain quantity—for example, the total amount of reward expected over the future (i.e. value). At
any given time step, the estimate of this quantity is updated to bring it closer to the estimate at the
next time step. The TD-learning algorithm makes predictions about what will happen. Then it
compares these predictions with what actually happens. If the prediction is wrong, then the
difference between what was predicted and what actually happened is used for learning. This core
of TD-learning is captured by two equations. The first is an update rule:
[1]
V(S)new = V(S)old + η δ(toutcome),
where V(S) denotes the value of a chosen option S, η is a learning rate parameter, and δ(toutcome) is
the temporal-difference reward-prediction error computed at each of two consecutive time steps
(tstimulus and toutcome = tstimulus + 1). The second equation defines reward-prediction error at time t as:
[2]
δ(t) = r(t) + V(t) - V(t - 1),
where V(t) is the predicted value of some option at time t, and r(t) is the reward outcome obtained at
time t. The reward-prediction error at toutcome is used to update V(S), that is the value of the chosen
option. The potential of TD-learning, and of RL more generally, to build neural networks models
and help interpret some results in brain science was clear from the 1980s. As Sutton and Barto
(1998, p. 22) recall, “some neuroscience models developed at that time are well interpreted in terms
of temporal-difference learning (Hawkins and Kandel, 1984; Byrne, Gingrich, and Baxter, 1990;
Gelperin, Hopfield, and Tank, 1985; Tesauro, 1986),” however, the connection between dopamine
and TD-learning had still to be explicitly made.
11
In the early 1990s, Read Montague and Peter Dayan were working in Terry Sejnowski’s
Computational Neurobiology lab at the Salk Institute in San Diego. Dayan’s PhD at the University
of Edinburgh, in artificial intelligence and computer science, focused on RL, while some of
Montague’s work, both as a graduate student and postdoc in biophysics and neuroscience, focused
on models of self-organization and learning in neural networks. They both approached questions
about brains and neural circuits by asking what computational functions they carry out (cf., Dayan,
1994). One spring day in 1991 TD-learning and dopamine got connected (Montague, 2007, pp. 108109). Dayan came across one of Schultz and colleagues’ articles, which presented data from
recordings of dopamine neurons’ activity during an instrumental learning task. By examining the
plots in the article showing the firing patterns of the monkeys’ dopamine neurons, Dayan and
Montague recognized the signature of TD-learning. The similarity between the reward-prediction
error signal used in TD-learning algorithms and Schultz and colleagues’ recordings was striking.
Activity of dopamine neurons appeared to encode a reward-prediction error signal.
Montague, Dayan and Sejnowski began writing a paper that interpreted Schultz and
colleagues’ results within the computational framework of RL. Their project was to provide a
unifying TD-learning model that could explain the neurophysiological and behavioural regularities
observed in Schultz and colleagues’ experiments. In a short abstract, Quartz, Dayan, Montague and
Sejnowski (1992) laid out the insight that on-going activity in dopaminergic neurons in the
midbrain can encode comparisons of expected and actual reward outcomes, which drive learning
and decision-making. The insight was articulated by Montague, Dayan, Nowlan, Pouget, and
Sejnowski (1993), and presented at the Conference on Neural Information Processing Systems
(NIPS)—an annual event bringing together researchers interested in biological and artificial
learning systems. In that paper a statement of the connection between TD-learning and Schultz and
colleagues’ pattern of results was made: Some “diffuse modulatory systems… appear to deliver
reward and/or salience signals to the cortex and other structures to influence learning in the adult.
12
Recent data (Ljunberg et al. 1992) suggest that this latter influence is qualitatively similar to that
predicted by Sutton and Barto’s (1981, 1987) classical conditioning theory” (Montague et al., 1993,
p. 970). However, theoretical neuroscience—computational neuroscience, particularly—was in its
infancy at the time, and it had not been recognized as a field integral to neuroscience yet (cf.,
Abbott, 2008). The paper that Montague, Dayan and Sejnowski had originally set out to publish in
1991 was getting rejected by every major journal in neuroscience, partly because the field was
dominated by experimentalists (Montague, 2007, p. 285).
Dayan and Montague approached the issue from a different angle. The foraging behaviour of
honeybees was known to follow the pattern of TD-learning; and evidence from single-cell
recordings and from intracellular current injections indicated that bees use a neurotransmitter,
octopamine, to achieve TD-learning (Hammer, 1993). Motivated by these findings, Dayan and
Montague developed a model of the foraging behaviour of honeybees (Montague, Dayan, Person, &
Sejnowski, 1995). First, they identified a type of neural architecture, which might be common to
both vertebrates and invertebrates, and could implement TD-learning. Second, they argued for the
biological feasibility of this neurocomputational architecture, noting that bees’ diffuse
octopaminergic system is suited to carry out TD-learning. Finally, they showed that a version of the
TD-learning algorithm running on the neurocomputational architecture they specified could produce
some learning and decision-making phenomena displayed by bees during foraging. Montague and
colleagues highlighted that: “There is good evidence for similar predictive responses in primate
mesencephalic dopaminergic systems. Hence, dopamine delivery in primates may be used by target
neurons to guide action selection and learning, suggesting the conservation of an important
13
functional principle, albeit differing in its detailed implementation” (Montague et al., 1995, p.
728).5
By the mid-1990s, other research groups recognized that the activity of dopaminergic
neurons in tasks of the sort used by Schultz and colleagues can be accurately described as
implementing some reward-prediction error algorithm. Friston, Tononi, Reeke, Sporns, and
Edelman (1994) considered value-dependent plasticity in the brain in the context of evolution and
adaptive behaviour. They hypothesised that the ascending neuromodulatory systems, and—in light
of Ljunberg et al.’s (1992) findings—the dopaminergic system in particular, are core components of
some value-based mechanism, whose processes are selective for rewarding stimuli. Houk, Adams
and Barto (1995) put forward a hypothesis about the computational architecture of the basal
5
This conclusion might engender confusion, as it is not obvious how the suggestion that a
‘functional principle’ is ‘conserved’ across evolution should be understood. When two unrelated
organisms share some trait, this is often taken as evidence that the trait is homologous (i.e., derived
from a common ancestor). But this is never sufficient evidence of homology. A proper phylogenetic
reconstruction of the trait (involving more than two species at the very least) is necessary for
establishing homology. Given the available evidence, Montague and colleagues’ suggestion is
better understood in terms of analogy rather than homology. Two (probably more) species—
including honeybees, macaque monkeys and other primates—might have independently evolved
distinct diffuse neurotransmitter systems that have analogous (similar) functional properties. The
similarity is not due to a common ancestor. Rather, the similarity is due to convergent evolution:
both species faced similar environmental challenges and selective pressures, which implicates that
TD-learning is an adaptive strategy to solve a certain class of learning and decision-making
problems that recur across species. I am grateful to an anonymous referee for drawing my attention
to this point.
14
ganglia, where dopaminergic neurons would control learning and bias the selection of actions by
computing reward-prediction errors.
The neuroscience community started to pay closer attention to the relationship between TDlearning and dopamine. After five years, Montague, Dayan and Sejnowski had their original paper
published in The Journal of Neuroscience (Montague, Dayan, & Sejnowski, 1996). In this paper,
after having noted with Wise’s (1982) that dopamine neurons are involved in a number of cognitive
and behavioural functions, they examined Schultz and colleagues’ results. These results indicated
that whatever is encoded in dopaminergic signals should be capable of explaining four sets of data.
“(1) The activities of these neurons do not code simply for the time and magnitude of reward
delivery. (2) Representations of both sensory stimuli (lights, tones) and rewarding stimuli (juice)
have access to driving the output of dopamine neurons. (3) The drive from both sensory and reward
representations to dopamine neurons is modifiable. (4) Some of these neurons have access to a
representation of the expected time of reward delivery” (Montague et al., 1996, p. 1938).
Montague, Dayan and Sejnowski (1996) emphasised an underappreciated aspect of Schultz
and colleagues’ results: dopaminergic neurons are sensitive not only to the expected and actual
experienced magnitude of reward, but also to the precise temporal relationships between the
occurrence of a reward-predictor and the occurrence of the actual reward. This aspect was crucial to
draw the connection between TD-computation and dopaminergic activity. For it suggested that
dopamine neurons should be able to represent relationships between reward-predictors, predictions
of both the likely time and magnitude of a future reward, and the actual experienced time and
magnitude of the reward. The core of Montague and colleagues’ (1996) consists in laying out the
computational framework of reinforcement learning and bringing it to bear on neurophysiological
and behavioural evidence related to dopamine so as to connect neural function to cognitive
function. By means of modelling and computer simulations, they showed that the type of algorithm
that can solve learning tasks of a certain kind could accurately and compactly describe the
15
behaviour of many dopaminergic neurons in the midbrain: “the fluctuating delivery of dopamine
from the VTA [i.e., ventral tegmental area] to cortical and subcortical target structures in part
delivers information about prediction errors between the expected amount of reward and the actual
reward” (Ibid., p. 1944, emphasis in original). One year later, in 1997, Montague and Dayan
published another similar paper co-authored with Schultz in Science, which has remained the
reference about the RPEH.
3. Reward-Prediction Error and Incentive Salience: What Do They Explain?
In light of Montague et al. (1996) and Schultz et al. (1997), the RPEH can now be more precisely
characterised. The hypothesis states that the phasic firing of dopaminergic neurons in the ventral
tegmental area and substantia nigra “in part” encodes reward-prediction errors. Montague and
colleagues did not claim that all type of activity in all dopaminergic neurons encode only (or in all
circumstances) reward-prediction errors. Their hypothesis is about “a particular relationship
between the causes and effects of mesencephalic dopaminergic output on learning and behavioural
control” (Montague, Dayan, Sejnowski, 1996, p. 1944). This relationship may hold for a certain
type of activity of some dopaminergic neurons during certain kinds of learning and decision-making
tasks. The claim is not that dopaminergic neurons encode only reward-prediction errors. The claim
is neither that prediction errors can only be computed by dopaminergic activity, nor that all learning
and action selection is carried out through reward-prediction errors or dependant on dopaminergic
activity.
The RPEH relates dynamic patterns of activity in specific structures of the brain to a precise
computational function. As reward-prediction errors are differences between experienced and
expected rewards, whether or not dopamine neurons respond to a particular reward depends on
whether or not this reward was expected at all, on its expected magnitude, and on the expected time
of its delivery. Thus, the hypothesis relates dopaminergic responses to two types of variables:
16
reward and belief (or expectation) about the magnitude and time of delivery of a reward that may be
obtained in a situation. Accordingly, the RPEH can be understood as relating dopaminergic
responses to probability distributions over prizes (or lotteries), from which a prize with a certain
magnitude is obtained at a given time (Caplin & Dean, 2008; Caplin, Dean, Glimcher, & Rutledge,
2010).
The hypothesis has the dual role of accurately describing the dynamic profile of phasic
dopaminergic activity in the midbrain during reward-based learning and decision-making, and of
explaining this profile by citing the representational role of dopaminergic phasic activity. Thus, the
RPEH addresses two distinct questions. First, how are some of the regularities in dopaminergic
neurons’ firing patterns accurately and compactly described? Second, what is the computational
function carried out by those firing patterns? By answering the second question, the RPEH
furnishes the basis for a neurocomputational explanation of reward-based learning and decisionmaking.
Neurocomputational explanations explain cognitive phenomena and behaviour (e.g.,
blocking and second-order conditioning6) by identifying and describing relevant mechanistic
components (e.g., dopaminergic neurons), their organized activities (e.g., dopaminergic neurons’
phasic firings), the computational routines they perform (e.g., computations of reward-prediction
errors) and the informational architecture of the system that carries out those computations (e.g., the
actor-critic architecture, which implements TD-learning and maps onto separable neural
components, see, e.g., Joel, Niv, & Ruppin, 2002; Balleine, Daw, & O’Doherty, 2009). Neural
6
In classical conditioning, blocking is the phenomenon that little or no conditioning occurs to a new
stimulus if it is combined with a previously conditioned stimulus during the conditioning process.
Second-order conditioning is a phenomenon where a conditional response is acquired by a neutral
stimulus, when the latter is paired with a stimulus that has previously been conditioned.
17
computation can be understood as the transformation—via sensory input and patterns of activity of
other neural populations—of neural firings according to algorithmic rules that are sensitive only to
certain properties of neural firings (Churchland & Sejnowski, 1992; Colombo, 2013; Piccinini &
Bahar, 2012).
If RPEH is true, then a neurocomputational mechanism composed of midbrain dopaminergic
neurons and their phasic activity carries out the task of learning what to do in the face of expected
rewards and punishments, generating decisions accordingly. Currently, several features of this
mechanism remain to be identified. So, RPEH-based explanations of reward-learning and decisionmaking are currently gappy (Dayan & Niv, 2008; O’Doherty, 2012). Nonetheless, several cognitive
and brain scientists would agree that some RPEH-based neurocomputational mechanism to be
specified can adequately explain many learning and decision-making phenomena in a manner not
only more beautiful but also deeper than available alternatives.
The current most prominent available alternative to the RPEH is probably the incentive
salience hypothesis (ISH). This hypothesis states that firing of dopaminergic neurons in a larger
mesocorticolimbic system mediates only incentive salience attribution. In Berridge’s words:
“Dopamine mediates only a ‘wanting’ component, by mediating the dynamic attribution of
incentive salience to reward-related stimuli, causing them and their associated reward to become
motivationally ‘wanted’” (Berridge, 2007, p. 408).
The ISH relates dopaminergic activations to a psychological construct: incentive salience
(a.k.a. “wanting”). Thereby, it answers the question what the causal role of dopamine is in rewardrelated behaviour. By answering this question, the ISH furnishes the basis for a neuropsychological
explanation of reward-based motivation and decision-making. The ISH is committed to the claim
that dopaminergic firing codes incentive salience that bestows stimuli or internal representations
with the properties of being appetitive and attention-grabbing. Incentive salience attributions need
not be conscious and need not involve feelings of pleasure (a.k.a. “liking”). Dopaminergic activity
18
is necessary to motivate actions aimed at some goal, as it would be a core component of a
mechanism of motivation (or mechanism of “wanting”). Dopaminergic neurons are not relevant
components of the mechanism of reward-based learning: “to say dopamine acts as a prediction error
to cause new learning may be to make a causal mistake about dopamine’s role in learning: it
might… be called a ‘dopamine prediction error’” (Berridge, 2007, p. 399). So, the ISH can be
considered an alternative to the RPEH because it denies two central claims of the RPEH: first, it
denies that dopamine encodes reward-prediction errors; second, it denies that dopamine is a core
component of a mechanism of reward-based learning.
Similarly to the RPEH, the explanation grounded in the ISH is gappy: it includes dopamine
as a core component of the mechanism of incentive salience attribution and motivation, but it leaves
several explanatory gaps (see, e.g., Berridge, 2007, note 8). Particularly, the hypothesis is underconstrained in at least three ways that make it less precise than the RPEH. First, it does not precisely
identify the relevant anatomical location of the dopaminergic components; second, it is
uncommitted as to possible different roles of phasic and tonic dopaminergic signals; finally, it is not
formalised by a single computational model that could yield quantitative predictions.
As they are formulated, both the RPEH and ISH are only concerned with what type of
information is encoded by dopaminergic activity. Nonetheless, putting forward different claims
about what dopaminergic neurons do, these hypotheses motivate different dopamine-centred
explanations of phenomena related to learning, motivation and decision-making. Although these
dopamine-centred explanations are currently tentative and incomplete, and so it might be premature
to argue that one explanation is actually deeper than the other, it is worthwhile explicating under
which conditions a RPEH-based explanation can be justifiably deemed deeper than an alternative
ISH-based explanation, while pointing at relevant available evidence.
4. Explanatory Depth, Reward-Prediction Error and Incentive Salience
19
A number of accounts of explanatory depth have recently been proposed in philosophy of science
(e.g., Woodward & Hitchcock, 2003; Strevens, 2009; Weslake, 2010). While significantly different,
these accounts agree that explanatory depth is a feature of generalizations that express the
relationship between an explanans and an explanandum.
According to Woodward and Hitchcock (2003), in order to be genuinely explanatory, a
generalization should exhibit patterns of counterfactual dependence relating the explanans to the
explanandum. Explanatory generalizations need not be laws or exceptionless regularities. They
should enable us to answer what-if-things-had-been-different questions that show what the
explanandum phenomenon depends upon. These questions concern the ways in which the
explanandum would change under changes or interventions on its explanans, where the key feature
of such interventions is that they do not causally affect the explanandum except through their effect
on the explanans (Woodward, 2003). The degree of depth of an explanatory generalization is a
function of the range of the counterfactual questions concerning possible changes in the target
system that it can answer. Given two competitive explanatory generalizations G1 and G2, if G1 is
invariant (or continue to hold) under a wider range of possible interventions or changes than G2,
then G1 is deeper than G2.7 Call this the “invariance account of explanatory depth.”
According to a different view (cf., Hempel, 1959), explanatory generalizations should
enable us to track pervasive uniformities in nature by being employable to account for a wide range
of phenomena displayed by several types of possible systems. The depth of an explanatory
7
Woodward and Hitchcock (2003, sec. 3) distinguish a number of ways in which a generalization
may be more invariant than another. For the purposes of this paper suffices it to point out that what
they share is that they spell out different ways in which an explanatory generalization enables us to
answer what-if-things-had-been-different-questions.
20
generalization is a function of the range of possible systems to which it can apply. 8 For a hypothesis
to apply to a target system is for the hypothesis to accurately describe the relevant structures and
dynamics of the system—where what is relevant and what is not is jointly determined by the causal
structure of the real-world system under investigation, the scientists’ varying epistemic interests and
purposes in relation to that system, and the scientists’ audience. Given two competitive explanatory
generalizations G1 and G2, if G1 can be applied to a wider range of possible systems or phenomena
than G2, then G1 is deeper than G2. So, deeper explanatory generalizations have wider scope. Call
this: the “scope account of explanatory depth.”9
4.1. Depth as scope, reward-prediction error and incentive salience
If some RPEH-based explanatory generalization can be applied to a wider range of possible
phenomena or systems than some alternative ISH-based explanatory generalization, then the RPEHbased generalization is deeper according to the scope account of explanatory depth. What is
available evidence relevant to assess this claim?
ISH-based explanations have been most directly applied to rats’ behaviour and to the
phenomenon of addiction in rodents and humans. In the late 1980s early 1990s, incentive salience
was offered to explain the differential effects on “liking” (i.e. the experience of pleasure) and
“wanting” (i.e. incentive salience) of pharmacological manipulations of dopamine in rats during
taste-reactivity tasks (Berridge et al., 1989). Since then incentive salience has been used to explain
8
Kitcher (1989) put forward a similar idea. His view, however, is that depth is a function of the
range of actual situations to which it can apply. For a discussion of some of the problems raised by
this view, see Woodward and Hitchcock (2003, sec. 4).
9
It bears mention that the verdict is still out on how these two views on depth relate to one another
(see, e.g., Strevens, 2004 for a discussion of this issue).
21
results from electrophysiological and pharmacological experiments that manipulated dopaminergic
activity in mesocorticolimbic areas of rats performing Pavlovian or instrumental conditioning tasks
(cf. Berridge & Robinson, 1998; Peciña, Cagniard, Berridge, Aldridge, & Zhuang, 2003; Tindell,
Berridge, Zhang, Peciña, & Aldridge, 2005; Wyvell, & Berridge, 2000).
Most ISH-explanations applied to humans concern a relatively small set of phenomena
observed in addiction and Parkinson’s disease (Robinson & Berridge, 2008; O’Sullivan et al.,
2011). From the viewpoint of incentive salience, addiction to certain substances or behaviours is
caused by over-attribution of incentive salience. Compulsive behaviour would depend on an
excessive attribution of incentive salience to drug-rewards and their cues, due to hypersensitivity or
“sensitization” (i.e. an increase in a drug effect caused by repeated drug administration) in
mesocortical dopaminergic projections. Sensitized dopaminergic systems would then cause
pathological incentive motivation for drugs or other stimuli.
It may appear that a RPEH-based explanation has obviously wider scope than an ISH-based
explanation. For TD-learning has been applied to many biological and artificial systems (see e.g.
Sutton & Barto, 1998, ch. 11). TD-learning seems to be widespread in nature. For instance, recall
that while Montague et al. (1995) argued that release of octopamine by a specific neuron in the
honeybee brain may signal a reward-prediction error, they also suggested that the same type of
“functional principle” guiding learning and action selection may well be conserved across species.
However, if honeybees, primates and other species share an analogous TD-learning
mechanism, or if many artificial systems implement TD-learning, this is not evidence for the wider
explanatory scope of a RPEH-based explanation. Rather, it is evidence for the wider explanatory
scope of RL, and particularly of TD-learning. The RPEH and the ISH are about dopamine. So,
relevant evidence for wider scope should involve dopaminergic neurons and their activity.
RPEH-based explanations of learning and decision-making apply at least to rats, monkeys,
and humans. The RPEH was formulated by comparing monkey electrophysiological data during
22
instrumental and Pavlovian conditioning tasks to the dynamics of a TD reward-prediction error
signal (Montague et al., 1996; Schultz et al., 1997). Since then, single-cell experiments with
monkeys have strengthen the case for a quantitatively accurate correspondence between phasic
dopaminergic firings in the midbrain and TD reward-prediction errors (Bayer & Glimcher, 2005;
Bayer, Lau & Glimcher, 2007). Recordings from the ventral tegmental area of rats that performed a
dynamic odour-discrimination task indicate that the RPEH generalizes to that species as well
(Roesch, Calu, & Schoenbaum, 2007). Finally, a growing number of studies using functional
magnetic imaging (fMRI) in humans engaged in decision-making and learning tasks have shown
that activity in dopaminergic target areas such as the striatum and the orbitofrontal cortex correlates
with reward-prediction errors of TD-models (Berns, McClure, Pagnoni, & Montague, 2001;
Knutson, Adams, Fong, & Hommer, 2001; McClure, Berns, & Montague, 2003a; O’Doherty,
Dayan, Friston, Critchley, & Dolan, 2003). These findings are in fact coherent with the RPEH,
since fMRI measurements seem to reflect the incoming information that an area is processing, and
striatal and cortical areas such as the orbitofrontal cortex are primary recipients of dopaminergic
input from the ventral tegmental area (cf., McClure & D’Ardenne, 2009; Niv & Schoenbaum,
2008).10
Some RPEH-based explanation is employable to account for many phenomena related to
learning and decision-making. Among the cognitive phenomena and behaviour for which some
RPEH-based explanation has been put forward are: habitual vs. goal-directed behaviour (Daw, Niv,
10
It is currently not possible to use fMRI data to assess the dopaminergic status of the MRI signal.
Reliable information about the precise source of the fMRI signal is difficult to gather for the cortex,
let alone the basal ganglia, particularly because neuromodulators can be vasoactive themselves (see
Kishida et al., 2011, and Zaghloul et al., 2009 on methodologies for the measurement of sub-second
dopamine release in humans).
23
& Dayan, 2005; Tricomi, Balleine, & O’Doherty, 2009), working memory (O’Reilly & Frank,
2006), performance monitoring (Holroyd & Coles, 2002), pathological gambling (Ross, 2010), and
a variety of psychiatric conditions including depression (e.g., Huys, Vogelstein, & Dayan, 2008; for
a review of computational psychiatry see Montague, Dolan, Friston, & Dayan, 2012).
Most relevant here, a RPEH-based explanation of incentive salience itself has also been
proposed, which indicates that different hypotheses about dopamine may well cohere to much
greater extent than might be supposed once they are properly formalized (McClure, Daw, &
Montague, 2003b). According to this proposal, incentive salience corresponds to expected future
reward, and dopamine—as suggested by the RPEH—serves the dual role of learning to predict
future reward and of biasing action selection towards stimuli predictive of reward. McClure and
colleagues demonstrated that some of the phenomena explained by the ISH such as the dissociation
between wanted and liked objects directly follow from the role in biasing action selection that
dopamine possesses according to the RPEH. Dopamine release would assign incentive salience to
stimuli or actions by increasing the likelihood of choosing some action that leads to reward. So,
dopamine receptor antagonism would reduce the probability of selecting any action, because
estimated values for each available option would also decrease.
If this proposal correctly captures the concept of incentive salience—which has been
debated (Zhang, Berridge, Tindell, Smith, & Aldridge, 2009)—then there would be telling reason to
believe that some RPEH-based explanation has indeed wider scope than some ISH-based
explanation. We would be in the position to make a direct comparison between them, and the ISHbased explanation would be entailed by a more general RPEH-based explanation. So, for any
possible target system to which the RPEH-based explanation applies, there will be an ISH-based
explanation that applies to the same system, but not vice versa.
4.2. Depth as invariance, reward-prediction error and incentive salience
24
If some RPEH-based generalization is invariant (or continue to hold) under a wider range of
possible interventions or changes than an alternative ISH-based generalization, then the RPEHbased generalization is deeper according to the invariance account of explanatory depth. These
interventions—recall—should not causally affect the explanandum phenomena except through their
effect on the dopamine-centred mechanism whose behaviour is described by the explanatory
generalization.
In order to assess the relative degree of depth of alternative RPEH-based and ISH-based
explanations, relevant interventions should be on particular mechanisms found in a particular
biological lineage, such as some dopamine-centred mechanism found in primates. Interventions on
merely analogous mechanisms found across biological lineages will not provide evidence relevant
for depth-as-invariance.
It should also be noted that the degree of precision of the RPEH is higher than that of the
ISH. Unlike the ISH, the RPEH makes claims specific to dopaminergic phasic activity, and to
dopaminergic neurons in the ventral tegmental area and in the substantia nigra. It may be thought
that this means that the range of interventions on dopaminergic activity relevant to assess depth is
narrower for a RPEH-based explanation than for an ISH-based explanation: While for the ISHbased explanation relevant interventions may be on both phasic and tonic activity, or on
dopaminergic neurons in mesocorticolimbic areas besides the ventral tegmental area and the
substantia nigra, for the RPEH-based explanation relevant interventions will not concern tonic
activity or dopaminergic neurons in any mesocorticolimbic circuit. This thought is mistaken,
however. For the ISH remains silent about the specific range of interventions relevant to assess the
invariance of an ISH-based explanation. So, the range of interventions relevant to assess the depth
of an ISH-based explanation does not strictly contain the range of interventions relevant to assess
the depth of a RPEH-based explanation.
25
Finally, unlike the RPEH, the ISH lacks an uncontroversial formalization that could be
implemented in the design of an experiment and yield precise, quantitative predictions. So, a
RPEH-based explanation may be deeper than an alternative ISH-based explanation, even if they are
both invariant under interventions on e.g. phasic dopaminergic activity in the ventral tegmental
area. For the RPEH-based explanation will yield more accurate answers about how the
explanandum phenomenon will change.
One set of available relevant evidence concerns dopamine manipulations by drugs. If some
dopamine-centred explanatory generalization, based on either the RPEH or ISH, correctly gives
information about how some target behaviour would change, had dopaminergic signalling been
enhanced or reduced, then the generalization will show some degree of depth-as-invariance. Before
illustrating this idea with two examples, some caveats are in order. One caveat is that
neuromodulators like dopamine have multiple, complex, and poorly understood effects on target
neural circuits and on cognition at different spatial and temporal scales (Dayan, 2012). Knowledge
is lacking about the precise effects on neurophysiology and behaviour of drugs that intervene on
dopaminergic signals. Moreover—as mentioned above—RPEH-based and ISH-based explanations
are currently tentative and gappy, partly precisely because the effects of interventions on
dopaminergic systems are currently not well-understood. Being tentative and gappy, it remains
controversial to what extent such explanations would be always indeed fundamentally different (cf.
e.g. McClure et al. 2003b; Niv & Montague, 2009, pp. 341-342).
To probe the explanatory link between dopaminergic signalling and reward-based learning
and decision-making, Pessiglione, Seymour, Flandin, Dolan, and Frith (2006) used an instrumental
learning task that involved monetary gains and losses, in combination with a pharmacological
manipulation of dopaminergic signalling, as well as computational and functional imaging
techniques, in healthy humans. Either haloperidol (an antagonist of dopamine receptors) or LDOPA (a metabolic precursor of dopamine) was administered to different groups of participants.
26
The effects of these manipulations were examined on both brain activity and choice behaviour. It
was found that L-DOPA enhanced activity in the striatum—a main target of dopaminergic
signalling—while haloperidol diminished it, which suggested that the magnitude of rewardprediction error signals targeting the striatum was enhanced (or diminished) by treatment with LDOPA (or haloperidol). Choice behaviour was found to be systematically modulated by these
manipulations. L-DOPA improved learning performance towards monetary gains, while haloperidol
decreased it; that is: participants treated with L-DOPA were more likely than participants treated
with haloperidol to choose stimuli associated with greater reward. Computational modelling results
demonstrated that differences in reward-prediction error magnitude were sufficient for a TDlearning model to predict the effects of the manipulations on choice behaviour.
Some of the caveats spelled out above apply to Pessiglione et al.’s study. Pessiglione and
colleagues acknowledged that it was not obvious how the drugs they administered affected different
aspects of dopaminergic signalling with respect to e.g. tonic versus phasic firing, or distinct
dopamine receptors. Notably, they did not consider whether their interventions might have affected
learning behaviour (i.e. one of their target explananda) through effects also on motivation or
attention. Nonetheless, evidence of the sort they provided is relevant to assess the relative depth of
RPEH-based explanatory generalizations. For it can demonstrate that reward-based learning and
decision-making are often modulated by reward-prediction errors encoded by dopaminergic
activity. And it may indicate that the relation between RPEs and dopaminergic activity on the one
hand, and choice behaviour during some reinforcement learning tasks, on the other, shows some
degree of invariance.
There are fewer human studies to probe the link between dopaminergic manipulation with
drugs, and incentive salience attribution and motivation. One such study concerned the mechanism
of sexual motivation. Oei, Rombouts, Soeter, van Gerven, & Both (2012) investigated how
dopamine modulates activation in the ventral striatum, which together with its dopaminergic
27
pathways is suggested to be a component of a larger mesocorticolimbic mechanism of incentive
salience, during subconscious processing of sexual stimuli.
Incentive salience is thought to be an essential property of sexual stimuli that would
motivate behavioural approach tendencies, capture attention, and elicit urges to pursue sex. An ISHbased explanation of sexual motivation would claim that sexual desire is produced by processes in a
large mesocorticolimbic network driven by the release of dopamine into striatal targets.
Dopaminergic activations would bestow incentive salience to sexual cues and sexual unconditioned
stimuli, making these stimuli “wanted” and attention grabbing.
Oei and colleagues used fMRI combined with dopaminergic manipulations through
administration of L-DOPA and haloperidol in healthy participants to probe the amplification (or
depression) of the incentive salience of unconsciously perceived sexual cues. It was found that LDOPA significantly enhanced activation in the ventral striatum and dorsal anterior cingulate—a
brain region involved in cognitive control, action selection, emotional processing and motor
control—when sexual stimuli were subliminally presented, in contrast to emotionally neutral and
emotionally negative stimuli. Haloperidol, instead, decreased activation in those areas when the
sexual stimuli were presented. It was concluded that the processing of sexual incentive stimuli is
sensitive to pharmacological manipulations of dopamine levels in the midbrain.
These findings provide some evidence relevant to assess the degree of depth of an ISHbased explanation of sexual motivation because they would indicate that such explanation might be
invariant over changes of the magnitude of dopaminergic signals in the mesocorticolimbic network,
which would enable sexual motivation, as well as over changes in conscious perception of the
sexual incentive stimuli. However—as Oei and colleagues acknowledged—these results did not
speak on whether dopamine-dependent regulation of incentive salience attribution is related to
increases (or decreases) in sexual desire or behavioural approach tendencies. Neither did they
discriminate whether or not the dopaminergic changes they observed could be predicted by the
28
reward-prediction signals in a TD-learning algorithm. Hence, studies such as this one leave open the
possibilities that the dopaminergic intervention did not in fact affect attention or motivation to
pursue sexual behaviour (i.e. target explananda phenomena of an ISH-based explanation of sexual
motivation), and that the intervention could have affected sexual motivation through the effects of
reward-prediction error neural computations underlying reinforcement learning.
5. Conclusion
This paper has made two types of contributions to existing literature, which should be of interest to
both historians and philosophers of cognitive science. First, the paper has provided a comprehensive
historical overview of the main steps that have led to the formulation of the RPEH. Second, in light
of this historical overview, it has made explicit what precisely the RPEH and the ISH explain, and
under which circumstances neurocomputational explanations of learning and decision-making
phenomena based on the RPEH can be justifiably deemed deeper than explanations based on the
ISH.
From the historical overview, it emerges that the formulation and subsequent success of the
RPEH depend, at least partly, on its capacity of combining several threads of research across
psychology, neuroscience and machine learning. By bringing the computational framework of RL
to bear on neurophysiological and behavioural sets of data that have been gathered about
dopaminergic neurons since 1960s, the RPEH connects dopamine’s neural function to cognitive
function in a quantitatively precise and compact fashion.
It should now be clear that the RPEH, as well as the ISH, which is arguably its current main
alternative, are hypotheses about the type of information encoded by dopaminergic activity. As
such, they do not explain by themselves why or how people and other animals display certain types
of phenomena related to learning, decision-making or motivation. Nonetheless, putting forward
29
different claims about what dopaminergic neurons encode, these hypotheses furnish the basis for
distinct dopamine-centred explanations of those phenomena.
The paper has examined some such explanations, contrasting them along two dimensions of
explanatory depth. It has not been established that RPEH-based explanations are actually deeper—
in either of the two senses of explanatory depth considered—than alternative explanations based on
the ISH. For the dopamine-centred explanations that the two hypotheses motivate are currently
tentative and incomplete. Nonetheless, from the relevant available evidence discussed in the paper,
there are grounds to tentatively believe that currently, for at least some phenomenon related to
learning, decision-making or motivation, some RPEH-based explanation has wider scope or has
more degrees of invariance than some ISH-based alternative explanation.
Acknowledgements I am sincerely grateful to Aistis Stankevicius, Charles Rathkopf, Peter Dayan,
and especially to Gregory Radick, editor of this journal, and to two anonymous referees, for their
encouragement, constructive criticisms and helpful suggestions. The work on this project was
supported by the Deutsche Forschungsgemeinschaft (DFG) as part of the priority program “New
Frameworks of Rationality” ([SPP 1516]). The usual disclaimers about any remaining error or
mistake in the paper apply.
References
Abbott, L.F. (2008). Theoretical neuroscience rising. Neuron, 60, 489-495.
Balleine, B.W., Daw, N.D., & O’Doherty J.P. (2009). Multiple forms of value learning and the function of
dopamine. In Neuroeconomics: Decision Making and the Brain, ed. Paul W. Glimcher, Colin F.
Camerer, Ernst Fehr, and Russell A. Poldrack, 367-388, New York: Academic Press.
30
Bayer, H.M., & Glimcher, P.W. (2005). Midbrain dopamine neurons encode a quantitative reward prediction
error signal. Neuron, 47(1), 129–141.
Bayer, H. M., Lau, B., & Glimcher, P. W. (2007). Statistics of midbrain dopamine neuron spike trains in the
awake primate. Journal of Neurophysiology, 98(3), 1428–1439.
Berns, G. S., McClure, S. M., Pagnoni, G., & Montague, P. R. (2001). Predictability modulates human brain
response to reward. Journal of Neuroscience, 21(8), 2793–2798.
Berridge, K. C. (2007) The debate over dopamine’s role in reward: the case for incentive salience.
Psychopharmacology (Berl), 191, 391-431.
Berridge, K.C., and Robinson, T.E., (1998). What is the role of dopamine in reward: hedonic impact, reward
learning, or incentive salience? Brain Research Reviews, 28, 309-369.
Berridge, K.C., Venier, I.L., & Robinson, T.E. (1989). Taste Reactivity Analysis of 6-OHDA aphagia
without impairment of taste reactivity: Implications for theories of dopamine function. Behavioral
Neuroscience, 103, 36-45.
Bindra, D.A. (1974). A motivational view of learning, performance, and behavior modification.
Psychological Review 81:199-213.
Björklund, A., & Dunnett, S.B. (2007). Dopamine neuron systems in the brain: an update
Trends in Neurosciences, 30(5):194-202.
Bush, R.R., & Mosteller, F. (1951). A mathematical model for simple learning. Psychological Review, 58,
313–323.
Caplin, A., Dean, M., Glimcher, P.W., & Rutledge, R.B. (2010). Measuring beliefs and rewards: A
neuroeconomic approach. Quarterly Journal of Economics, 125(3), 923-960.
Caplin, A. & Dean, M. (2008). Dopamine, Reward Prediction Error, and Economics. Quarterly Journal of
Economics, 123: 2, 663-701.
Carlsson A. (2003) A half-century of neurotransmitter research: impact on neurology and psychiatry, in
Nobel Lectures. Physiology or Medicine, 1996–2000 (Jörnvall H ed) pp 308-309, World Scientific
Publishing Co., Singapore.
Carlsson, A. (1966). Morphologic and dynamic aspects of dopamine in the central nervous system. In: Costa
E., Côté L.J., Yahr M.D., editors. Biochemistry and pharmacology of the basal ganglia. Hewlett, NY:
Raven Press, pp. 107–113.
Carlsson A (1959). The occurrence, distribution, and physiological role of catecholamines in the nervous
system. Pharmacological Reviews, 11, 490-493.
Churchland, P.S. & Sejnowski, T.J. (1992). The Computational Brain. Cambridge, MA, MIT Press.
31
Colombo, M. (2013). Constitutive relevance and the personal/subpersonal distinction. Philosophical
Psychology, 26, 547-570.
Costall, B. & Naylor, R. J. (1979). Behavioural aspects of dopamine agonists and antagonists. In A. S. Horn,
J. Korf and B.H.C. Westerink (Eds.), The Neurobiology of Dopamine, Academic Press, London, pp.
555-576.
Crow, T.J. (1972). A map of the rat mesencephalon for electrical selfstimulation. Brain Research, 36, 265273.
Dayan, P. & Niv, Y. (2008). Reinforcement learning: The good, the bad and the ugly. Current Opinion in
Neurobiology, 18, 185-196.
Dayan P. (1994) Computational modelling. Current Opinion in Neurobiology 4(2):212-217.
Dayan, P. (2012). Twenty-five lessons from computational neuromodulation. Neuron,76, 240-256.
Daw ND, Niv Y, Dayan P (2005) Uncertainty-based competition between prefrontal and dorsolateral striatal
systems for behavioral control. Nature Neuroscience 8:1704–1711
Dunnett, S.B., & Robbins, T.W. (1992). The functional role of mesotelencephalic dopamine systems.
Biological reviews of the Cambridge Philosophical Society, 67(4), 491-518.
Ehringep, H. and Hornykiewicz (1960) Verteilung von Noradrenalin und Dopamin (3Hydroxytyramin) im
Gehirn des Menschen und ihr Verhalten bci Erkrankungen des extrapyramidalen Systems. Klinisch
Wochenschrift, 38, 1236-1239.
Fibiger, H.C. (1978). Drugs and reinforcement mechanisms: A critical review of the catecholamine theory.
Annual Review of Pharmacology & Toxicology, 18, 37-56.
Friston K.J., Shiner T., FitzGerald T., Galea J.M., Adams R., Brown H., Dolan R.J., Moran R., Stephan K.E.,
& Bestmann, S. (2012). Dopamine, affordance and active inference. PLoS Computational Biology
8:e1002327. doi: 10.1371/journal.pcbi.1002327.
Friston, K.J., Tononi, G., Reeke, G. N., Sporns, O. & Edelman, G. M. (1994). Value-dependent selection in
the brain: simulation in a synthetic neural model. Neuroscience, 59, 229-243.
Glimcher, P.W. (2011). Understanding dopamine and reinforcement learning: The dopamine reward
prediction error hypothesis. Proceeding of the National Academy of Science USA, 108 (Suppl. 3),
15647–15654.
Grace, A.A. (1991). Phasic versus tonic dopamine release and the modulation of dopamine system
responsivity: a hypothesis for the etiology of schizophrenia. Neuroscience, 41(1), 1-24.
Graybiel, A.M. (2008). Habits, rituals and the evaluative brain. Annual Review of Neuroscience, 31, 359-387.
Hammer, M. (1993). An identified neuron mediates the unconditioned stimulus in associative olfactory
learning in honeybees. Nature, 366, 59-63.
32
Hempel, C.G. (1959). The Logic of Functional Analysis. In Symposium on Sociological Theory, ed. L.
Gross, 271–87. New York: Harper & Row. Repr. with revisions. 1965. In Aspects of Scientific
Explanation and Other Essays in the Philosophy of Science, 297–330. New York: Free Press.
Holroyd, C.B., & Coles, M.G.H. (2002). The neural basis of human error processing: Reinforcement
learning, dopamine, and the error-related negativity. Psychological Review, 109(4), 679–709.
Hornykiewiczl, O. (1966) Dopamine (3-hydroxytyramine) and brain function. Pharmacological Reviews,18,
925-964.
Huys QJM, Vogelstein J & Dayan P (2008). Psychiatry: Insights into depression through normative decisionmaking models. NIPS 2008.
Joel, D., Niv, Y. and Ruppin, E. (2002). Actor—critic models of the basal ganglia: new anatomical and
computational perspectives, Neural Networks, 15, 535-47.
Kishida, K.T., Sandberg, S.S., Lohrenz, T., Comair, Y.G., Saez, I.G., Phillips, P.E.M., & Montague, P.R.
(2011). Sub-Second Dopamine Detection in Human Striatum. PLoS ONE.6(8): e23291.
Kitcher, P. (1989). Explanatory Unification and the Causal Structure of the World. In Scientific Explanation,
ed. P. Kitcher and W. Salmon, Minneapolis: University of Minnesota Press, pp. 410–505.
Knutson, B., Adams, C. M., Fong, G. W., & Hommer, D. (2001). Anticipation of increasing monetary
reward selectively recruits nucleus accumbens. Journal of Neuroscience, 21(16), RC159.
Koob, G.F. (1982). The dopamine anhedonia hypothesis: a pharmacological phrenology. Behav. Brain Sci. 5,
63-64.
Lindvall, O., & Björklund, A. (1974). The organization of the ascending catcholamine neuron systems in the
rat brain as revealed by the glyoxylic acid fluoresence method. Acta Physiologica Scandinavica Suppl.
412, 1-48.
Ljungberg T, Apicella P, & Schultz W (1992). Responses of monkey dopamine neurons during learning of
behavioral reactions. Journal of Neurophysiology, 67, 145-163.
Loewi, O. (1936) The chemical transmission of nerve action. Nobel Lecture. Reprinted in Nobel Lectures,
Physiology or Medicine, vol. 2 (1922–1941), pp. 416–432. Amsterdam: Elsevier, 1965. Available
online
at:
URL
<
http://www.nobelprize.org/nobel_prizes/medicine/laureates/1936/loewilecture.html>
McClure, S.M., & D’Ardenne, K. (2009). Computational neuroimaging: monitoring reward learning with
blood flow. In: Dr. Jean-Claude Dreher and Léon Tremblay, editors, Handbook of Reward and
Decision Making. Oxford: Academic Press, 2009, pp. 229-247.
McClure, S.M., Berns, G. S., & Montague, P. R. (2003a). Temporal prediction errors in a passive learning
task activate human striatum. Neuron, 38(2), 339–346.
33
McClure, S.M., Daw, N.D., & Montague, P.R., (2003b). A computational substrate for incentive salience.
Trends in Neuroscience 26(8), 423-428.
Mirenowicz J., & Schultz W. (1994). Importance of unpredictability for reward responses in primate
dopamine neurons. Journal of Neurophysiology, 72(2), 1024-1027.
Montague, P.R. (2007). Your Brain is Almost Perfect: How we make Decisions. New York: Plume.
Montague, P.R., Dolan, R.J., Friston, K.J., & Dayan, P. (2012). Computational psychiatry. Trends in
Cognitive Sciences, 16, 72-80.
Montague, P.R., Dayan, P., & Sejnowski, T.J. (1996). A framework for mesencephalic dopamine systems
based on predictive Hebbian learning. Journal of Neuroscience, 16(5): 1936-1947.
Montague, P.R., Dayan, P, Person, C, & Sejnowski, T.J. (1995). Bee foraging in uncertain environments
using predictive Hebbian learning. Nature 377, 725-728.
Montague, PR, Dayan, P, Nowlan, SJ, Pouget, A & Sejnowski, TJ (1993). Using aperiodic reinforcement for
directed self-organization. In Advances in Neural Information Processing Systems 5, SJ Hanson, JD
Cowan, CL Giles (Eds), San Mateo (CA): Morgan Kaufmann 969-977.
Niv, Y. (2009). Reinforcement learning in the brain. Journal of Mathematical Psychology 53(3): 139-154.
Niv, Y., & Montague, P.R. (2009) Theoretical and empirical studies of learning. In Neuroeconomics:
Decision Making and the Brain, eds Glimcher PW, et al. (Academic Press, New York), pp 329–249.
Niv, Y. & Schoenbaum, G. (2008). Dialogues on prediction errors. Trends in Cognitive Sciences, 12(7), 265272.
O’Doherty, J. P. (2012). Beyond simple reinforcement learning: the computational neurobiology of rewardlearning and valuation. The European Journal of Neuroscience, 35(7), 987-990.
O’Doherty, J., Dayan, P., Friston, K., Critchley, H., & Dolan, R. (2003). Temporal difference learning model
accounts for responses in human ventral striatum and orbitofrontal cortex during Pavlovian appetitive
learning. Neuron, 38, 329-337.
Oei, N.Y., Rombouts, S.A., Soeter, S.P., van Gerven, J.M., Both, S. (2012). Dopamine modulates reward
system activity during subconscious processing of sexual stimuli. Neuropsychopharmacology, 37,
1729-1737.
Olds, J., & Milner PM. (1954). Positive reinforcement produced by electrical stimulation of septal area and
other regions of rat brain. Journal of Comparative & Physiological Psychology,47, 419-27.
O’Reilly, R. C., & Frank, M. J. (2006). Making working memory work: A computational model of learning
in prefrontal cortex and basal ganglia. Neural Computation, 18, 283-328.
34
O’Sullivan, S.S., Wu, K., Politis, M., Lawrence, A.D., Evans, A.H., Bose, S.K., Djamshidian, A., Lees, A.J.
& Piccini, P. (2011) Cue-induced striatal dopamine release in Parkinson’s disease-associated
impulsive-compulsive behaviours. Brain, 134, 969-997.
Peciña, S., Cagniard, B., Berridge, K.C., Aldridge, J.W. & Zhuang, X. (2003). Hyperdopaminergic mutant
mice have higher ‘wanting’ but not ‘liking’ for sweet rewards. Journal of Neuroscience, 23, 9395–
9402
Pessiglione, M., Seymour, B., Flandin, G., Dolan, R.J., & Frith, C.D. (2006). Dopamine-dependent
prediction errors underpin reward-seeking behaviour in humans. Nature, 442, 1042-1045.
Piccinini, G., & Bahar, S. (2012). Neural Computation and the Computational Theory of Cognition.
Cognitive Science.
Quartz SR, Dayan P, Montague PR, & Sejnowski TJ (1992). Expectation learning in the brain using diffuse
ascending projections. Society for Neuroscience Abstracts 18:1210.
Redgrave, P. & Gurney, K. (2006). The short-latency dopamine signal: a role in discovering novel actions?
Nature Reviews Neuroscience, 7:967-975.
Rescorla R.A., & Wagner A.R. (1972). A theory of Pavlovian conditioning: Variations in the effectiveness of
reinforcement and nonreinforcement. In: Classical Conditioning II: Current Research and Theory
(Eds Black AH, Prokasy WF) New York: Appleton Century Crofts, pp. 64-99.
Roesch, M. R., Calu, D. J., & Schoenbaum, G. (2007). Dopamine neurons encode the better option in rats
deciding between differently delayed or sized rewards. Nature Neuroscience, 10(12), 1615–1624.
Robinson, T.E. & Berridge, K.C. (1993). The neural basis of drug craving. An incentive-sensitization theory
of addiction. Brain Research Reviews, 18, 247-291.
Robinson, T.E., Berridge, K.C.,2008. Review. The incentive sensitization theory of addiction: some current
issues. Philosophical Transactions of the Royal Society B: Biological Sciences 363 (1507), 3137–
3146.
Robinson, S., Sandstrom, S.M., Denenberg, V.H., & Palmiter, R.D. (2005) Distinguishing whether dopamine
regulates liking, wanting, and/or learning about rewards. Behavioral Neuroscience, 119, 5–15.
Romo, R., & Schultz W. ( 1990). Dopamine neurons of the monkey midbrain: contingencies of responses to
active touch during self-initiated arm movements. Journal of Neurophysiology, 63, 592-606.
Ross, D. (2010). Economic Models of Pathological Gambling. In D. Ross, H. Kincaid, D. Spurrett, & P.
Collins (Eds.), What is Addiction? (pp. 131-158). Cambridge (MA): MIT Press.
Schultz, W., Dayan, P., & Montague, P.R. (1997). A neural substrate of prediction and reward. Science, 275,
1593-1599.
35
Schultz, W., Apicella P., & Ljungberg, T. (1993). Responses of monkey dopamine neurons to reward and
conditioned stimuli during successive steps of learning a delayed response task. Journal of
Neuroscience, 13, 900-913.
Schultz W., & Romo, R. (1990). Dopamine neurons of the monkey midbrain: contingencies of responses to
stimuli eliciting immediate behavioral reactions. Journal of Neurophysiology, 63, 607-624.
Schultz W. (1986). Responses of midbrain dopamine neurons to behavioral trigger stimuli in the monkey.
Journal of Neurophysiology, 56, 1439 -1461.
Schultz, W., Ruffieux, A., & Aebischer, P. (1983). The activity of pars compacta neurons of the monkey
substantia nigra in relation to motor activation. Experimental Brain Research, 51, 377-387.
Skinner, B.F. (1938). The behavior of organisms. New York: D. Appleton-Century.
Stein, L. (1969). Chemistry of purposive behavior. In J. T. Tapp (Ed.), Reinforcement and Behavior,
Academic Press, New York, pp. 328-355.
Stein, L. (1968). Chemistry of reward and punishment, In: Proceedings of the American College of
NeuroPsychophar-macology (Efron DH, Ed.) (U.S. Government Printing Office:Washington, DC), pp.
105-123.
Strevens, M. (2004). The causal and unification accounts of explanation unified—causally. Noûs, 38, 154176.
Strevens, M. (2009). Depth: An Account of Scientific Explanation. Cambridge, MA: Harvard University
Press.
Stricker, E. M., & Zigmond, M. J. (1986). Brain monoamines, homeostasis, and adaptive behavior. In
Handbook of physiology, Vol. IV: Intrinsic regulatory systems of the brain (pp. 677-696). Bethesda,
MD: American Physiological Society.
Sutton, R.S. (1988). Learning to Predict by the Method of Temporal Differences. Machine Learning, 3, 9-44.
Sutton, R.S., & Barto, A.G. (1998). Reinforcement Learning. An Introduction. Cambridge, MA, MIT Press.
Sutton, R.S., & Barto, A.G. (1981). Toward a modern theory of adaptive networks: Expectation and
prediction. Psychological Review, 88 2, pp 135-170.
Sutton, R.S. & Barto, A.G. (1987). A temporal-difference model of classical conditioning. Proceedings of
the Ninth Annual Conference of the Cognitive Science Society. Seattle, WA.
Thorndike, E. (1911). Animal Intelligence. New York: MacMillan.
Tindell, A.J., Berridge, K.C., Zhang, J., Peciña, S. & Aldridge, J.W. (2005). Ventral pallidal neurons code
incentive motivation: amplification by mesolimbic sensitization and amphetamine. European Journal
of Neuroscience, 22, 2617-2634.
36
Toates, F. (1986). Motivational Systems. Cambridge University Press, Cambridge.
Tricomi, E., Balleine, B., & O’Doherty, J. (2009). A specific role for posterior dorsolateral striatum in
human habit learning. European Journal of Neuroscience, 29, 2225–2232.
Trowill, J. A., Panksepp, J., & Gandelman, R. (1969). An incentive model of rewarding brain stimulation.
Psychological Review, 76 (1969) 264-281.
Weslake, B. (2010). Explanatory Depth. Philosophy of Science, 77(2), 273-294.
White, N. M. (1986). Control of sensorimotor function by dopaminergic nigrostriatal neurons: Influences of
eating and drinking. Neuroscience and Biobehavioral Review, 10, 15-36.
Wise, R.A. (1982). Neuroleptics and operant behavior: the anhedonia hypothesis. Behavioral and Brain
Sciences, 5, 39-88.
Wise R.A. (2004). Dopamine, learning and motivation. Nature Reviews Neuroscience, 5, 483 494.
Woodward, J. (2003). Making Things Happen: A Theory of Causal Explanation. New York: Oxford
University Press.
Woodward, J., & Hitchcock, C. (2003). Explanatory Generalizations, pt. 2, Plumbing Explanatory Depth.
Noûs, 37, 181–99.
Wyvell, C.L., & Berridge, K.C. (2000) Intra-accumbens amphetamine increases the conditioned incentive
salience of sucrose reward: enhancement of reward “wanting” without enhanced “liking” or response
reinforcement. Journal of Neuroscience, 20, 8122-8130.
Zaghloul, K.A., Blanco, J.A., Weidemann, C.T., McGill, K., Jaggi, J.L., Baltuch, G.H., & Kahana, M.J.
(2009). Human substantia nigra neurons encode unexpected financial rewards. Science 323(5920):
1496-1499.
Zhang J., Berridge K.C., Tindell, A.J., Smith KS, & Aldridge JW (2009) A neural computational model of
incentive salience. PLoS Computational Biology, 5:e1000437.
37