Notes on Learning to Compute and Computing to Learn
Khurshid Ahmad
Department of Computing
University of Surrey
Guildford, Surrey, GU2 7XH ([email protected])
One key message in modern neuroscience is multimodality: the ability of uni-modal areas in the brain
such as speech and vision to interact with each other
and with hetero-modal areas, areas that are activated
by two or more input modalities, to converge the
outputs of the uni-modal systems for producing ‘higher
cognitive’ behaviour. Such behaviour includes the
ability to quantify, the ability to retrieve images when
linguistic cues are provided and vice versa. Multi-net
neural computing systems that can simulate such
behaviour are reported. Multi-net systems comprise of
modules that take unimodal input and one or more
modules that facilitate cross-modal interaction
through Hebbian connections. These systems have
achieved a modicum of success.
Keywords: Multi-net neural computing, multi-modal
systems, competitive learning, image retrieval,
information extraction.
Neural computing systems were inspired by
developments in neuroscience in the 1940’s and
with few exceptions such systems work as
cellular networks and attempt to simulate learning
at the cellular level.
Developments in
neurosciences, especially during the 1980’s have
provided a unique insight into how the brain
functions, and by implication, how humans learn
and behave.
A brief review of these
developments is in order here (Section 2) before I
describe some of the work in neural computing,
which has been inspired by looking at the cooperative behaviour of two or more cellular
networks for simulating intelligent behaviour.
The title of this paper is an imitation of
Elaine Rich and Kevin Knight’s introduction to
neural computing in their book Artificial
Intelligence (1999) where they suggest if the
network can compute then it will learn to
compute. This is perhaps at some variance with
other cognitivist approaches in machine learning
where the emphasis appears to be on ‘computing
to learn’ – the insistence being on representation
schema and reasoning strategies whether it is rule
induction or rule deduction, or it is learning by
synthesis or learning by analysis. More recent
approaches to machine learning, for insistence
case-based reasoning, despite their welcome
departure from an immutable rule base, still have
the structuralist influence of early AI. My
interpretation of ‘learning to compute’ is that
learning emerges alongside changes in the
structure of the neural substrate. In neural
computing, weight changes do suggest changes in
the neural substrate. Work in neurobiology and
neuropsychology suggests that areas within the
brain get interconnected in addition to
interconnections within an area.
The more
important lesson is that some human behaviour
can only be explained on the basis of distinct and
large areas of the brain interacting in unison
(Section 3). Some neural computing systems
developed by my students over the last decade in
an attempt to train not only individual networks
but also to train networks to learn to behave cooperatively are discussed in Section 4; Section 5
comprises an afterword. Before we embark on
our discussion it may be useful to brush up on
techniques used to observe the brain ‘in action’
and elaborate on terminology used in neuroanatomy.
Seeing the Brain in Vivo
The human brain specifically, and animal brain
generally, is a complex whole which interacts
with its immediate environment, reacting here,
and through its muscular appendages, changing
the environment there. The brain at work is a
difficult place to observe. It is not possible to
measure single (neuronal) activity of neurons or
populations of neurons directly except during
surgery. One indirect measurement of neuronal
activity is the observation that active areas of the
brain invariably have increased blood flow as
compared to when these areas are not active.
Positron emission tomography (PET) technique
uses radio-labelled positron emitting isototpes and
areas of brain that may be active for durations of
1-1000 seconds and localised to within 10-100
mm can be ‘seen’ by PET scanner. Functional
magnetic resonance imaging (fMRI) is used in
visualising neuronal activity by observing flow
rates changing and affecting the magnetic signals
to which MRI images are sensitive. Many
nueroanotmoical correlates of perceptual and
cognitive behaviour. And there is speculation
about the evolution of cortical areas of the brain
as well.
These indirect observations have to be
interpreted carefully due to the ‘great variability’
in brain anatomy between individuals, which
poses technical and conceptual problems as to the
specific location of an activation in the brain.
These problems are exacerbated because the
‘positions of these [functional] areas [of the brain]
are not well predicted by anatomical landmarks in
many regions’[8].
appears to be localised in one or more of the
overlapping four lobes of the brain – frontal, or
everything in front of the so-called central fissure
of the brain; parietal lobe that is caudal to (or
towards the tail of) the frontal lobe); temporal
lobe jutting forward from the base of the brain
and ventral to (or at the base of) the frontal and
parietal lobes; and, occipital lobe which lies at the
back of very back and is caudal to the parietal and
temporal lobes. Each of the lobes is further
subdivided into posterior (back) and anterior
(front) or medial and lateral parts. The cerebral
cortex is convoluted and as much as 2/3rd of the
cortex is ‘hidden’ in small grooves (sulci, singular
sulcus), large grooves (fissures), and gyri
(singular gyrus) or the bulges between adjacent
sulci or fissures.
For instance, the fissure
between the temporal and frontal lobe is called
Sylvian fissure.
P. Somatosensory
P. Motor Cortex
Visual Ass.Cortex
P. Auditory Cortex
Auditory Ass. Cortex
Motor Ass. Cortex
300 Words on Neuroanatomy
Our physical and mental environment appears
‘seamless’ despite the seemingly independent
modalities of vision, speech, hearing, olfaction
and touch amongst others. These modalities
interact and intermingle to manifest as unities at a
certain level of description: these unities include
objects, events, concepts and emotions. The
unities are dynamic in that they move, appear and
disappear. The system that deals with each of
these apparently independent modalities, the
human nervous system, does appear to comprise
many interacting and diffusely delineated parts.
The human brain, part of the central nervous
system, is not a uniform mass and appears to have
two interconnected hemispheres, left and right.
The cerebral cortex, the outermost layer of grey
matter of the cerebral hemispheres, is critically
involved in how we not only use the cortex, and
other areas of the nervous system, to understand,
exploit and sustain our environment, but learn to
do it as well. This interplay of sensing and
learning to make sense of the environment is most
obvious in infancy and when humans encounter
novelties in their environment. Each sensory
modality and the associated motor movements
Ass. Cortex
P. Visual Cortex
Figure 1. An approximate lateral view of the various regions of the cerebral cortex. The
numbered boxes refer to Brodmann’s cyto-architectural areas originally discussed in the early
20th century. (P. refers to Primary and Ass. to Association. More accurate view of the cerebral
cortex are available on or on
Neural Correlates of Behaviour
Literature on physiological psychology suggests
that the left hemisphere is involved in ‘controlling
serial behaviors’ including those of talking,
understanding the speech of others, reading and
writing. The right hemisphere appears to be more
active when humans engage, in laboratory
conditions, in ‘synthesis’ – drawing sketches,
reading maps, making wholes from parts and is
also involved in ‘the expression of and
recognition of emotion in the tone of voice’ [11,
29]. Right-hemisphere lesions to the parietal
cortex are possibly responsible for deficits in
spatial attention – attention being the cognitive
faculty, which enables humans to focus on certain
features of the environment to the (relative)
exclusion of others [16].
Each primary sensory area of the cerebral
cortex sends information to adjacent regions,
called the sensory association cortex also known
as memory cortex: all motor areas have some
sensory input. Association cortices, parts of the
cerebral that do not have primary motor or
sensory role, are involved in the so-called ‘higher
order processing of sensory information.’ There
are three major cortices of interest: posterior
parietal cortex (PPC), prefrontal cortex, and
temporal cortex. The temporal cortex has the
visual and audio association cortices; the
prefrontal cortex comprises ‘motor memories
[specifically] memories of muscular movement
that are needed to articulate words’ [11] and is
supposed ‘to respond to complex sensory stimuli
of behavioural relevance’ [6]; and, the PPC
comprises areas involved in combining
information on personal (somatic) awareness –
where am I? – with information on extra-personal
(visual) space.
Cortical Correlates of Perception
The visual cortex has at least 25 different
subregions arranged in a hierarchical fashion for
processing from colours (human extrastriate
cortex) to object perception (ventral stream),
from movement-, size-, and orientation of objects
(inferior temporal cortex) to movement
perception and object location (posterior parietal
cortex). Various left-hemisphere (perisylvian)
areas – temporal, parietal and frontal – have been
identified as being involved with the
‘complexities of linguistic processing, ranging
from semantic to syntactic, morphological,
phonological and phonetic analysis and synthesis’
[20]. The ‘language areas’ in the brain were the
first to be identified when the doctrine of cerebral
localization was launched in the 19th century:
Broca’s area or the ‘motor speech area’,
comprising portions of the inferior frontal gyrus,
for coordinated action of diverse muscles used in
speech, and the ‘sensory or ideational speech
area’ – distributed over the left temporal lobe and
inferior parietal lobe – responsible for
‘comprehension of language, naming of objects’
[7] amongst others. Wernicke’s area, in the
parietal lobe, contains ‘the auditory entries of
words; the meanings are contained as memories
in the sensory association areas’ [11].
Cerebral localization data for 58 word
production experiments, involving measurements
of activation levels in parts of the brain to within
10-100 mm were performed for example with
neuroimaging techniques as diverse as PET, EEG,
and fMRI, suggest that there are 28 major
regions, mainly in the cerebral cortex, which were
highly activated and a finer grain analysis
suggests that there maybe as many as 104 regions
which are involved in language production [18]
The experiments were performed by different
research teams but mainly focussed on 7 word
production tasks including picture naming and
word generation. These tasks include ‘conceptual
preparation, lexical selection, phonological coderetrieval and encoding’: the first task 150 ms, the
second is completed in the next 125 ms and the
last two within the next 125 ms. The next two
tasks take another 125 ms all told. There are
identifiable regions that are activated in word
generation that are not activated in picture naming
and vice versa. Lesion studies indicate that when
local specialist areas in the brain suffer damage
the brain cannot perform one or more functions
that it can normally perform: damage to the
various parts of parietal and temporal lobes
results in the loss of naming capability and
damage to the frontal cortex rostral (or in front of)
and to the base of the primary motor cortex leads
to non-fluent speech. (Section 4 briefly describes
two multi-net computing systems that attempt to
simulate language development and language
deficit respectively).
Vision and speech processing are good
examples of unimodal processing: it appears that
one single modality is being processed across a
network, comprising units at different locations in
the brain, and each component is more or less
specialised in processing that modality. Output of
the one unit becomes the input of the next and the
process continues until manifestations of
cognitive behaviour are finally produced:
identification of an object in an image,
understanding of a word or phrase, articulation of
linguistic output in response to a linguistic
Cortical Processing and Cortical
The above discussion is based implicitly on the
behaviour of an adult human. It is reasonable to
expect that humans have learnt how to react to
the various modalities. Neonates, infants and
young children, almost irrespective of their
intelligence level, appear to learn to react to their
external environment at a considerable pace
whilst their nervous system is rapidly evolving as
well [5]. During the first 12-38 weeks of antenatal life the structure of the central nervous
differentiation, through axonal growth and
dendritic ‘ontogeny’, and by the development of
synapses (‘synapto-genesis’) and the encasing of
the synapses in the myelin sheath (myelinogenesis). Even during the genesis of the antenatal nervous system, the infant responds to
stimuli and when born appears to have capability
as diverse as being able to visually enumerate and
so appears to possess a substrate for language as
well. The first year of post-natal life involves the
establishment of connectivity in the infants brain
together with glio-genesis – the encasing of the
brain in the white matter. The child’s gaze
stabilizes earlier on in the postnatal development,
the child goes from ‘one-word’ language stage to
two words during the first 18-24 months of life.
Modality and Neuronal Correlation
The child can integrate different modalities –
Lewkowicz and Turkewitz [19] report an
experiment involving 32 infants between the
‘ages’ of 11 hours and 48 hours in which their
visual preferences for ‘light patches of different
intensity’ were examined with and without an
auditory stimulus (through a white noise
generator). The authors conclude that ‘visual
preference in human newborns can be modified
by immediately preceding exposure to sound’
[19]. Such inter-sensory interaction between
auditory and visual stimulation can be observed
when we watch television or films where TV folk
and actors appear to be talking through their lips
but actually the sound is emitted from a
loudspeaker. And ventriloquism depends on the
dominance of the visual stimulus on the auditory
stimulus. This has led to the claim that ‘illusions
remind us the our visual experience is not a
measurement of the physical stimuli, but rather a
neural computation’ [31].
There are then degrees of involvement of
other modalities, for example, naming an object
in a visual scene requires two modalities, visual
and speech, although one can segregate the two in
that after the initial visual stimulus one can argue
that speech takes over. However, attention and
numerosity studies indicate that certain behaviour
can only be explained through multi-sensory
integration. This integration has been defined as
a statistically significant difference between the
neuron’s response to a stimulus combination
compared to its response to the individual
component stimulus [21].
There is some
evidence, based on experiments on cats, that
certain areas of the cats’ nervous system comprise
unimodal neurons, at least until 12 days after
birth, but then these neurons develop capability
for integrating multi-sensory information [30]. It
has been suggested that there are ‘many areas in
the mammalian brain […] where the different
sensory streams converge onto individual neurons
responsive to stimulation in more than one
modality’ [10] and such heteromodal neurons
have been found in the prefrontal cortex, posterior
parietal cortex and the lateral temporal cortex and
in the mid-brain (specifically in the super
colliculus – the site where Meredith and Stein
[21] found multi-sensory integration in cats).
Learning to Compute: Cross Modal
There are two developments that indicate that
there are certain ‘higher level’ cognitive tasks that
can be understood in terms of interaction between
different sensory modalities (and motor
movement). The first development relates to the
orientation of human spatial-attention –our ability
to ‘focus selectively on sensory information from
particular locations in external space’ either
voluntarily, endogenous attention, or ‘reflexively’
by salient events, exogenous attention. In a
number of neuropsychological experiments, it has
been found that deficits in spatial attention are
correlated strongly with lesions to the parietal
cortex in the right hemisphere and lesions to the
frontal cortex. Amongst many like phenomena,
visual illusions, actors speaking on a film/TV
screen or ventriloquists’ dummy talking, indicate
that human attention has to be co-ordinated crossmodally. This makes it possible to ‘select
information from a common external source
across several modalities despite the differences’
in the initial coding of the source of each
modality [17]. Frequently, one or more senses
substitute for the (temporary) loss of one of the
senses: looking for a light source in a darkened
region results in the heightened textural and
spatial awareness is a good example for such a
substitution. This cross cueing has been exploited
in the multi-net simulation reported in Section 0
where texts help in the retrieval of images.
The second development relates to topics
variously labelled numerosity, numerons, single
neuron arithmetic and number sense in humans
and some primates. The intuitive argument here
is that judgement related to quantities, an animal
guessing how many in a herd or the extent of
‘foraging area’, or infants and monkeys making
‘accurate’ judgements about whether two
quantities are equal to each other or very different
irrespective of physical attributes, must have a
neuronal correlate. Observations on enumeration
without having been taught a number system,
enumeration, or
approximate calculation without rigorously
carrying out arithmetic procedures, lead to the
speculation that there may be areas in the brain
where the visuo-spatial information about the
objects, for instance, is processed such that the
number information is preserved [15]. The
development of numerosity has been simulated
using a multi-net computing system later on in
this paper (Section 4.2). We will now discuss the
two developments in turn.
Cross-Modal Interaction and Spatial
The key to spatial attention is that different
stimuli, visual and auditory, help to identify the
spatial location of the object generating the
stimuli. One argument is that there may be a
neuronal correlate of such crossmodal interaction
between two stimuli. Information related to the
location of the stimulus (where) and identifying
the stimulus (what) appears to have correlates at
the neuronal level in the so-called dorsal and
ventral streams in the brain.
In a number of neuroimaging studies of
audio-visual speech perception, researchers have
attempted to identify ‘putative’ neuroantomical
sites where multimodal integration actually takes
place [10] – these studies were inspired, in part,
by the earlier work on cats [21, 22]. Two
experiments, one dealing with subjects’ mouth
movements whilst looking at a videotape of the
lower half of a face silently mouthing numbers –
silent lip reading – and the second with subjects
listening to numbers being spoken with the
videotape switched off – auditory speech
perception – are of note here. The neuroimages
of the two experiments were compared and
contrasted. The areas activated in both the
experiments include primary auditory cortex and
the auditory association cortex. The putative
heteromodal cortex, that integrates the two
stimuli, straddles audio- and visual association
cortices and appears to include Wernicke’s area
of language idealisation (Brodmann areas 37, 39,
40, 21/22), lie in the region ‘proximal to the
superior temporal sulcus (STS). In the silent-lip
reading test the visual stimulus provided to the
brain appears to generate auditory cues from the
auditory association cortex and the convergence
takes place in the STS. Calvert and colleagues
claim that ‘Activation in of the primary auditory
cortex by visible speech cues might proceed via
back projections from heteromodal cortex’ [10].
The authors point to other regions of the brain as
well, including the prefrontal cortex, posterior
parietal cortex, and possibly the midbrain region
of superior colliculus. There is a concomitant
claim that there may exist multimodal neurons
active in ‘many areas in the mammalian brain
[…] where the different sensory streams converge
onto individual neurons responsive to stimulation
in more than one modality’ [10].
One related cross-modal phenomenon is
that of synesthesia: a condition in which a sensory
experience normally associated with one modality
occurs when another modality is stimulated. This
is a (congenital) condition when synesthetic
humans recall colour when shown letters or
numbers for example.
Such cross-modal
behaviour may be attributed to cross wiring of the
otherwise unimodal brain or more specifically
cortical regions.
The grapheme-colour
synesthesia or (involuntary cross-activation) has
been explained by arguing that the colour areas in
the brain are in the fusiform gyrus and the visualgrapheme area is also in the fusiform, especially
in the left-hemisphere [27].
The observation that neonates and monkeys have
a number sense and other mathematical skills like
estimation, trajectory computation, without being
formally educated in arithmetic, or any other
branch of mathematics, has given rise to
significant interest in this area. Neuroimaging
techniques have significantly contributed here and
some even claim to have neural correlates of our
ontogenetic numeracy. Furthermore, educational
psychologists have observed numeracy and
related skills can be acquired through training and
that there are stages in which skills are acquired –
there is a evolutionary process involved here. For
some neonates and primates only acquire
numeracy when they come in contact with their
physical environment. Whether ontogenetic or
evolutionary, our number sense involves ‘the
coding and internal manipulation of an abstract
semantic content, the meaning of number words’
[26]. Information about putative neural correlates
of numerosity has come from the so-called lesion
studies. Studies of brain damaged individuals
showing mathematical deficits in their behaviour,
when compared to their intact counterparts,
suggests that lesions to specific regions of the
parietal and temporal cortices may be the reason
for the deficits . The parietal cortex appears to
play a critical role in the representation of
magnitudes and the temporal cortex is involved in
the representation of the visual form of the
numbers [9, 13].
Number sense has played a major role in
psychology where many earlier studies were
dedicated to ‘the mathematical description of how
a continuum of sensation, such as loudness or
duration’ is represented in the brain/mind. The
19th century psychophysicist, Gustav Fechner, had
observed that ‘the intensity of subjective
sensation increases as the logarithm of the
stimulus intensity’. One of the 21st century
rendition of this ‘law’ is that the ‘external
stimulus is scaled into a logarithmic internal
representation of sensation’ [15]. Number related
behaviours ‘depend on the capacity to abstract
information from sensory inputs and to retain it in
memory’ and that in monkeys this capacity is in
the ‘prefrontal cortex’ [24] and there are reports
of activation in humans in proximate regions of
the brain. As predicted by Fechner, there is a
compressed scaling of numerical information, and
this information is stored in the prefrontal cortex
of the monkey [24] and the parietal cortex of the
human [26]. Neider et al report over a third of the
352 randomly selected neurons from the lateral
prefrontal cortex of two monkeys ‘showed
activity that varied significantly with the number
of items in the sample display’ [24]: this suggests
that certain neurons specialise as ‘number
detectors’ – the illusive numerons perhaps have
been found.
The compressed number line theory can be
used to explain the observation that neonates and
monkeys, and adults in a hurry, can accurately
enumerate quantities less than 5 without recourse
to overt counting. Higher numbers cannot be
enumerated with any accuracy through visual
enumeration or subitisation and that within the
numbers 1-5, there is a diminution in accuracy as
we approach the higher number. Subitisation is
sometimes related to the existence of ‘preverbal
numerical abilities’ [34]. Recent findings about
approximate calculations performed by healthy
volunteers shows activity in the bilateral inferior
parietal cortex, in the prefrontal cortex and in the
cerebullum [14]; perhaps, the preverbal numerical
abilites are localised in these areas. These
cortices were less active when the same
volunteers carried out tasks where they were
asked to perform exact arithmetic calculations.
During the exact arithmetic the left inferior
parietal cortex was highly activated together with
the left angular gyrus. The cortical regions active
in approximate arithmetic ‘fall outside of
traditional perisylvian language areas and are
involved in various visuo-spatial and analogical
mental transformations’ (ibid:971).
calculations, it appears, depend on languagebased representations. Dehane et al [14] recall
from previous lesion studies that lesions to the
left parietal area result in the loss of the sense of
numerical quantity but preserve rote language
based arithmetic, contrariwise the damages to the
left hemisphere resulted in the loss of language
abilities but the sense of numerical quantity were
The contrasts in response times to
stimulation by numbers verbally and in numbers
presented in the Arabic notation show clear
differences; response to visual stimulation is on
average 100-200ms faster than to verbal
stimulation [26]. Studies involving fMRI and
event-related potential measurement show
indication of localisation: higher activations
reported for visual stimulation in the left and right
parietal and prefrontal cortex and for verbal
stimulation left and right occipital regions show
greater activation (ibid:1020). Simon et al [28]
have conducted fMRI experiments to examine the
organisation of the parietal cortex by looking at
fMRI images of subject performing six tasks:
calculation and phoneme detection. They found
that ‘number processing tasks call for generic
processes of attention orientation and spatial
manipulation that operate upon a “spatial map” of
numerical quantities in the middle IPS (Inferior
Parietal Sulcus]’ [28]; the IPS is a ‘fine grained
anatomical specialisation’ which may have a
region for the manipulation of numerical
Computing to Learn: Co-operating
Neural Networks and Competing
(inhibiting) Neurons
In this section I present multi-net neural
computing systems in an attempt to simulate
aspects of behaviour that invariably, and from
what we have discussed above inevitably, involve
interaction between two or more modalities (see
Figure 2).
Cross-modal Net:
Hebbian Network
Figure 2: An architecture for learning behaviour
encoded in two modalities and for learning how to
cross-modally learn the relationships between two unimodalities
Our early work during the early 1990’s focussed
on language processing, specifically child
language development and language disorder,
both these systems have individually trained
networks for representing concepts and linguistic
description of concepts that are cross-linked
through a Hebbian network. The language
development system learnt to simulate how young
children (18-24 months) produce one-word
utterances in response to audio-visual cues and to
simulate how these children learn the word-order
of their language based on inputs from their adult
The word-order learning system
learnt concepts and words using two independent
SOFM’s, cross linked by a Hebbian network,
together with an additive Grossberg network that
related semantic relations (for example, agents to
possessions, objects to containers, objects in
spatial locations). A back-propagation network
was used to teach the child the two-word
organisation as found in adult language. What the
network produced was a cross-modal output
(concept-word) that guided its production of
candidate two-word collocates. The system’s
learning outcomes were compared with standard
child language productions reported in the
literature and there was good agreement between
the observations and our system [1]. The
language disorder system was similarly
constructed with conceptual and a word lexicon
using SOFM’s and the system learnt to cross-link
concepts and words through a Hebbian network.
The system was trained on normal associations
between words and concepts. The conceptual
SOFM was then systematically ablated and the
cross-modal output showed increasing semantic
errors – the so-called naming errors found in a
certain group of aphasics. The results of the
language disorder multi-net were in good
agreement with findings about aphasics in the
literature [33].
In this paper we report on our more recent
work: (i) a multi-net contents-based image
retrieval system, which can store images with
their collateral textual description, and the system
can learn to retrieve images by their collateral
linguistic features and even images for which
there is no collateral text (Section 4.1), the link
between image and linguistic features was
established through a Hebbian network, and (iii) a
multi-net system that can learn to subitise and
another that can learn to count (Section 4.2), both
these systems rely on a Hebbian connection that
is learnt whilst two individual networks learn a
single modality each to represent quantities in a
visual scene and the verbal representation of the
The co-operative multi-net architecture we
report is essentially an extension of the original
idea of Willshaw and von der Marlsburg [32]
where two layers of neurons were connected via
Hebbian synapses; the Hebbian synapses have the
useful properties of local modification, time
variance, and have the useful feature that
correlates pre- and post-synaptic signals.
Kohonen’s self-organising feature map is also
based on a principle similar to that of Willshaw
and von der Marlsburg. The links are initially
weighted as zero or set at random and as the
training proceeds the connection strengths are
changed. Once the correlation is established
between pre- and post- signals, two uni-modal
inputs for us, we can use one modality as a cue
for the other.In all the systems we have developed
our emphasis has been on the use of the
unsupervised learning algorithms, except in cases
it was necessary to use supervised learning
This preferential use of an
unsupervised learning algorithm is based on the
view that ontogenesis plays a key role in learning,
and that occasions where environmental input
acts as a teacher are rather limited though
nonetheless important. We use self-organising
feature maps (SOFM) due to Teuvo Kohonen that
can transform a continuous n-dimensional input
signal onto a one or two-dimensional discrete
space of ‘neurons’. A discriminant function is
used to relate the weight vectors that connect the
input layer to the output layer: the weight vector
closest to the input is regarded as the ‘winner’ and
selected neighbourhood neurons in the output
layer form a halo and are activated when the
winner is activated.
Collateral Image and Text System
Images have been traditionally indexed with short
texts describing the objects within the image. In
some cases it needs a specialist to literally go
behind the image and discover objects not
apparent to a laypersons: radiologists, forensic
professionals, good art critics, are amongst those
specialists. The specialists excel also because
they are succinct in their description and
sometimes it’s a gift and at others this
succinctness is acquired through experience and
training. The accompanying text is sometimes
described as collateral to the image. The ability
to use the collateral texts for building computerbased image retrieval systems will help in dealing
with image collections that can now be stored
digitally. Theoretically, the manner in which we
grasp the relationship between the ‘features’ of
the image and the ‘features’ of the collateral text
relates back to cross-modality. The use of a
multi-net system comprising independent yet
interacting neural networks also contributes to the
debate on multiple classifier systems.
We have developed a multi-net system that
learns to classify images within an image
collection, where each image has a collateral text,
based on the common visual features and the
verbal features of the collateral text. The multinet can also learn to correlate images and their
collateral texts using Hebbian links – this means
that one image may be associated with more than
one collateral text and vice versa [3, 4]. The
details of the system are given below:
Image Feature
Collateral Text
Image-Text Cross
Kohonen SOFM
15× 15
Kohonen SOFM
15× 15
Hebbian Links
15× 15
We have had access to 66 scene of crime
images used in training scene-of-crime officers:
58 of these images were used for training and 8
for testing purposes.
Images features were
represented on a 112-dimensional vector: 21 for
colour distribution, 19 for shapes, and 72 for
texture; each of the features were automatically
extracted; the collateral texts were represented by
a 50-dimensional input vector: frequently
occurring and semantically-relevant keywords
were extracted automatically from the collateral
texts. Both the image and text vectors were
mapped onto 15X15 output layers of two
Kohonen Maps. During training, a Hebbian
network learnt to associate the most active
neurons in the two maps thereby establishing a
degree of cross-modality. Another system was
created where only one SOFM was trained to
recognise both the images and texts and was
trained using a combined 161 (121+51)
dimensional vector – we would call it a
monolithic system.
The trained system was then tested on the
8 images and their collateral texts. The multi-net
system could classify 7 out of 8 images correctly
whereas the monolithic system could only
classify 3 out of the 8 images correctly. By
correctly we mean that the images fell into a
cluster of images on the output image and text
SOFMs deemed similar by forensic experts
working with us. A closer examination of how
the test images and their collateral texts were
classified in the image and the text SOFMs
showed that the image SOFM could classify 4 out
of 8 images correctly whereas 5 out of 8 collateral
texts were in the ‘right’ clusters. The cross-modal
interaction between the two improves the
classification significantly – crudely by as much
as twice 4 compared with 7.
distance effect – larger the difference between the
two numerosities, further apart they are:
Numerosity Development
The Surrey Subitization System was developed to
study how an artificial neural can learn to
enumerate approximately (subitise) [2]. The
system comprises three interacting modules.
First, the mapping module, for responding to the
presence of objects in a visual scene irrespective
of their size and location, and then to represent
the response.
The second is a magnitude
representation module that was learns to represent
small numerosities as magnitudes along a number
line according to the Fechner’s law and the socalled distance effect. The third module is the
output module that comprises a word lexicon for
representing numbers verbally and a cross-modal
module that learns the relationship between the
magnitude representation and the verbal
representation. The details of the system are
shown below:
Mapping: Scale
Mapping: Translation
Verbal Representation
Cross Modality
Hebbian Links
648× 72
72 × 15
36× 36
16× 64
36× 64
The mapping modules transform a simple
visual scene onto a simple record of the whole
entities in the scene.
The magnitude
representation module, an SOFM, receives its
scale and translation invariant input from the
mapping module. This input is mapped onto the
output layer of the SOFM.
Initially, each
magnitude is assigned a random position on the
output layer:
1, 4
2, 3,
In one example, we trained the network to learn
numbers upto 5 and after a 100 training cycles,
the output layer appeared ‘compliant’ with
Fechner’s Law – numerosities organised on a
compressed number line, and reflected the
The results of our simulation compare well with
[12] in that the authors had used a hard-wired
network and we have trained our network. Our
subitisation system performs marginally better
than the supervised subitisation system reported
by [33]: the supervised system has difficulty in
representing intermediate numerosity, e.g ‘4’ is
not represented well when the system learns
numerosities between ‘1’-‘5’, and is dependent
upon the rather arbitrary choice of hidden layers
used typically in a supervised network [2]. In
another experiment the system was trained to
subitise numbers between 1 and 22, except for a
randomly selected set of six numbers - 2, 3, 10,
14, 15, 19 [4]. The magnitude and verbal
representation were trained for the 16 numbers in
the training set. The trained system was then
tested for the six numbers in the test set: the
results were encouraging in that each of the
numbers was recognised by the neuron in the
magnitude representation closest to its value and
the corresponding verbal output was identified as
well. In contrast a monolithic SOFM combining
both the magnitude and verbal representation
trained on the set of 16 numbers failed to
recognise most of the test set. Another advantage
of the cross-modal system was that input in one
modality (say magnitude) could cue output in
another modality (say verbal articulation) and
vice versa.
Based on observations from the neuroimaging
studies, lesion studies, and neuropsychological
experiments, I can claim retrospectively that all
the systems reported below have a heteromodal
region by way of an established training
algorithm, the Hebbian learning algorithm. The
strategy myself and my colleagues adopted was to
simultaneously train two uni-modal neural
networks, say, one dealing with learning image
features and the other the linguistic description of
a visual scene.
During this training, we
interspersed a Hebbian network that learnt the
association between the image features and the
linguistic features of the accompanying
description. The Hebbian links are the putative
cross-modal links and the Hebbian networks the
heteromodal area.
It is important that neural computing
literature is aware of some fascinating
neuropsychology as such awareness will show the
scope and limitations of neural computing
systems. There have been words of warning
about reading too much in the neuroimages [8]
and there had been numerous warnings about
reading too much into the results reported by the
neural computing community. Yes, each human
is unique and perhaps this is reflected in each
humans neuroanatomy, but we all believe, at least
most of the times, that an actor is speaking on the
TV or cinema screen, or that we do use modalities
interchangeably whilst looking up objects and
artefacts. How do we explain shared behaviour?
Neural networks reported in the literature may be
reduced a collection of switches or regression
analysis systems – both these statements are true
in that the whole purpose of a reductive argument
is to do just that. But there is, according to our
current scientific thinking and practice, electrical
activity, chemical and metabolic activity observed
in the brain, perhaps there is no harm in starting
with switches and regression algorithms.
