Download When Two Meanings Are Better Than One

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

American and British English spelling differences wikipedia , lookup

English orthography wikipedia , lookup

Transcript
Copyright 1994 by the American Psychological Association, Inc.
0096-1523/94/$3.00
Journal of Experimental Psychology:
Human Perception and Performance
1994, Vol. 20, No. 6, 1233-1247
When Two Meanings Are Better Than One: Modeling the Ambiguity
Advantage Using a Recurrent Distributed Network
Alan H. Kawamoto, William T. Farrar IV, and Christopher T. Kello
Ambiguous words are processed more quickly than unambiguous words in a lexical decision
task despite the fact that each sense of an ambiguous word is less frequent than the single
sense of unambiguous words of equal frequency or familiarity. In this computer simulation
study, we examined the effects of different assumptions of a fully recurrent connectionist
model in accounting for this processing advantage for ambiguous words. We argue that the
ambiguity advantage effect can be accounted for by distributed models if (a) the least mean
square (LMS) error-correction algorithm rather than the Hebbian algorithm is used in training
the network and (b) activation of the units representing the spelling rather than the meaning
is used to index word recognition times.
An important advantage of computational models is that
the underlying assumptions of the model must be explicitly
formulated. This explicit formulation allows comparison of
assumptions that are highly similar. In some cases, virtually
identical assumptions can give rise to qualitative differences
rather than merely quantitative differences. In this article,
we consider just such a situation. Two different connectionist learning algorithms lead to opposite predictions about the
time required to recognize ambiguous and unambiguous
words when activation of the spelling units is used as the
index of word recognition. In particular, ambiguous words
are incorrectly predicted to have a processing disadvantage
compared with unambiguous words when the Hebbian
learning algorithm is used, but correctly predicted to have a
processing advantage when the least mean square (LMS)
error-correction algorithm is used.
Ambiguity Advantage Effect
Ambiguous words pose a special challenge in modeling
word recognition. Not only must a way to represent the
multiple meanings of a word be found, but the manner in
which these meanings affect processing must also be considered. Nowhere is this problem more clearly manifested
than in accounting for relative processing times of ambiguous words and unambiguous words. For one particular
task—lexical decision—ambiguous words are processed
more quickly than unambiguous words, despite the fact that
the frequency of the different senses of an ambiguous word
is less than the frequency of the single sense of an unamAlan H. Kawamoto, William T. Farrar IV, and Christopher T.
Kello, Program in Experimental Psychology, University of California, Santa Cruz.
We would like to thank Art Jacobs and Max Coltheart for their
comments.
Correspondence concerning this article should be addressed to
Alan H. Kawamoto, Program in Experimental Psychology, Clark
Kerr Hall, University of California, Santa Cruz, California 95064.
biguous word (Kellas, Ferraro, & Simpson, 1988; Millis &
Button, 1989).
The issue of an advantage that ambiguous words have
over unambiguous words in being recognized has been
particularly vexing theoretically, and as we show later,
empirically. To our knowledge, the earliest study claiming
an advantage for ambiguous words was reported by Rubenstein, Garfield, and Millikan (1970). Subjects were faster to
make lexical decisions to ambiguous words compared with
unambiguous words of the same printed frequency of occurrence when these words were presented in isolation. A
subsequent study showed that this ambiguity advantage
existed only for homonyms whose meanings were equiprobable and not systematically related (Rubenstein, Lewis, &
Rubenstein, 1971). However, Clark (1973) argued that the
conclusions reached by Rubenstein et al. could not be generalized to a different set of stimuli because the results were
not significant when both subjects and items were treated
as random variables. In a subsequent study, Forster and
Bednall (1976) also found that lexical decision times to
homonyms were slightly faster than unambiguous controls,
but, again, this advantage did not prove to be statistically
significant when both subjects and items were treated as
random variables.
The issue of an advantage for ambiguous words was again
raised in a series of experiments by Jastrzembski and his
colleague (Jastrzembski, 1981; Jastrzembski & Stanners,
1975). Arguing that the earlier studies failed to find significant differences between ambiguous and unambiguous
words because the ambiguous words that were used had
only two distinct meanings, Jastrzembski and Stanners compared words that had many meanings listed in a dictionary
with words that had only a few. They found that subjects
responded significantly faster to words with many dictionary meanings compared with words with only a few meanings even when both subjects and items were considered in
the analysis. However, the conclusions of that study have
also been challenged. Gernsbacher (1984) argued that familiarity and not number of meanings was the basis of the
1233
1234
A. KAWAMOTO, W. FARRAR, AND C. KELLO
difference. When familiarity rather than printed frequency
was crossed with number of meanings, the only effect that
was significant was a main effect of familiarity.
In response to Gernsbacher's (1984) critique, two subsequent studies used subjective measures of ambiguity rather
than the objective measure of number of meanings (Kellas
et al., 1988; Millis & Button, 1989). In addition, the target
words were controlled for both printed frequency and subjective familiarity. In both studies, an advantage for ambiguous words compared with unambiguous words was found
when both subjects and items were treated as random
variables.
Admittedly, the advantage for ambiguous words has
sometimes been small and has not always been statistically
significant when both subjects and items are used as random
factors. However, in every study comparing words with
many meanings and words with few meanings, an advantage (not necessarily significant) for words with many
meanings was found when other variables were collapsed
across this condition (see our Table 1 and Gernsbacher,
1984, Table 6).1 Given that the frequency of each sense of
an ambiguous word is less than the frequency of an unambiguous word matched in printed frequency (half of the
frequency for ambiguous words with two equiprobable
meanings), one might have expected an ambiguity disadvantage. In fact, when other measures such as gaze durations were the dependent variable, equiprobable ambiguous
words took longer to process than unambiguous words
when the target words were presented in a neutral sentential
context (Carpenter & Daneman, 1981; Duffy, Morris, &
Rayner, 1988; Rayner & Duffy, 1986). Moreover, this processing disadvantage for ambiguous words persisted even
when a new unambiguous control word approximately half
of the printed frequency of the old control word was used.
Previous Accounts of the Ambiguity
Advantage Effect
Given the inherent disadvantage that ambiguous words
must overcome if the frequency of each sense is represented
separately, many models would predict that ambiguous
words would be recognized more slowly, not more quickly,
than unambiguous words. For models that use a localist
representation scheme, it is clear why an ambiguity disadvantage would be expected. For example, in the serial
search model (Forster & Bednall, 1976), each sense of an
ambiguous word would be found below the single entry of
an unambiguous word matched in printed frequency in a
frequency-ordered list. In logogen-type models (McClelland & Rumelhart, 1981; Morton, 1979), each logogen
corresponding to a sense of an ambiguous word would have
a higher threshold or lower resting level of activation compared with a matched unambiguous word. In order to overcome the disadvantage that ambiguous words have compared with unambiguous words, either a race between the
senses or facilitation among the senses of an ambiguous
word must be assumed (Jastrzembski, 1981; Kellas et al.,
1988).
By contrast, for models that use a distributed representa-
tion scheme, predictions are not as straightforward because
there is no single unit that corresponds uniquely to a lexical
entry. That is, different units are used to represent the
spelling, the pronunciation, and the meaning, and activation
over different sets of units is used as an index for different
responses (Kawamoto, 1988, 1993; Seidenberg & McClelland, 1989). For example, lexical decision is often indexed
by activation of the units representing the spelling, and
naming latency is usually indexed by activation of the units
representing the pronunciation (Kawamoto, 1993; Seidenberg & McClelland, 1989), but there are studies in which
lexical decision is indexed by units representing the pronunciation (Seidenberg & McClelland, 1989) or the meaning (Joordens & Besner, 1994). Because the relative activation over the different units is not the same for every
word, the same stimulus (or class of stimuli) can give
different predictions depending on the particular response
made. This property of distributed representations is, in fact,
part of the basis for the different conclusions reached in two
recent simulation studies of the ambiguity advantage effect.
Kawamoto (1993), who argued that distributed representations can account for the ambiguity advantage effect, used
activation over the spelling units as an index of lexical
decision times, whereas Joordens and Besner (1994), who
argued otherwise, used activation over the meaning units. It
should be emphasized that the basis for the disagreement is
really over what the appropriate measure of word recognition is when distributed representations are used, not that it
takes more time for the meaning units to become fully
activated (or stable) for ambiguous words compared with
unambiguous words. That is, Kawamoto (1993, Figure 6)
also showed that the meaning units of ambiguous words are
less activated than unambiguous words.
However, there was another important difference between
Kawamoto's (1993) simulation and Joordens and Besner's (1994) simulation. Kawamoto used the LMS errorcorrection learning algorithm to modify the connection
strengths, whereas Joordens and Besner used the Hebbian
learning algorithm. Many differences between these two
learning algorithms are already well-known (see McClelland & Rumelhart, 1988). For example, both the Hebbian
and the LMS algorithm will produce the output exactly if
the inputs are mutually orthogonal, but only the LMS algorithm will do so if the inputs are linearly independent but
not mutually orthogonal (as the complete lexical representation of ambiguous words would be).
In this article, we consider a qualitative difference in
processing that arises after a network is trained using different learning algorithms. In doing so, we argue that the
ambiguity advantage effect is seen only when (a) the orthographic units rather than the meaning units are used as the
1
In some experiments, there is sometimes an ambiguity disadvantage across a particular condition. For example, in Rubenstein,
Lewis, and Rubenstein (1971), an ambiguity disadvantage was
observed for medium- and high-frequency homographs whose
meanings were systematically related (e.g., noun and verb sense of
plow). However, these represent a fraction of the total number of
conditions in which an ambiguity advantage was found.
1235
SPECIAL SECTION: WHEN TWO MEANINGS ARE BETTER THAN ONE
Table 1
Results of Previous Studies Manipulating Number of Meanings (Supplement to Table 6 in Gernsbacher, 1984)
High-NM words
Study
Low-NM words
Difference
NM RT (ms) Errors NM RT (ms) Errors NM
RT (ms)
Errors MinF'
P
564
2.7
Forster & Bednall (1976)
2
543
2.3 1
5.0 1
21
Chumbley & Balota (1984)
Experiment 2
—
r = -.33
1.43a ns
Experiment 3
—•
r = -.47
4.36" .05
Gemsbacher (1984)
>10
948
19.5 1
951
ns
19.5
3
0.0 <1
Kellas, Ferraro, &
Simpson (1988)
Experiment 1
>1
725
0.6 1
763
38
0.6
59.28 .05
1.2 —
Millis & Button (1989)
Experiment 1
4.45
589
3.3 1
676
5.0 3.45
87
1.7
2.75 ns
Experiment 2
7.25
609
3.2 3.08
748
6.5
9.7 4.17
139
32.17 .05
Experiment 3
2.36
661
3.8 1.38
732
6.3 0.98
71
2.5
12.91 .05
Note. Adapted from "Resolving 20 Years of Inconsistent Interactions Between Lexical Familiarity and Orthography, Concreteness, and
Polysemy" by M. A. Gernsbacher, 1984, Journal of Experimental Psychology: General, 113, p. 273. Copyright 1984 by the American
Psychological Association. Adapted by permission of the author. Dashes indicate that the value was not given in original study. Forster
and Bednall (1976) meaning metric not given. Chumbley and Balota (1984) used a log transform of the number of dictionary meanings.
They used multiple regression analyses, so grouped means were unavailable. Gernsbacher (1984) used the number of dictionary meanings.
Kellas et al. (1988) used a 3-point subjective scale: 0 = no meaning, 1 = one meaning, 2 = more than one meaning. Millis and Button
(1989) subjective NM metrics: Experiment 1 = first meaning listed; Experiment 2 = total meanings listed; Experiment 3 = average
meanings listed. NM = mean number of meanings; RT = mean reaction time.
a
F values were calculated only by subjects, and they represent the number of meaning's unique contribution to predicting lexical RTs.
index of lexical decision times and (b) the LMS algorithm
rather than the Hebbian algorithm is used. In addition, we
show why the processing difference that we found can
ultimately be attributed to the multiple meanings that ambiguous words have compared with the single meaning that
unambiguous words have.
Fully Recurrent Connectionist Networks
In this section, we summarize the assumptions about the
representation, learning, and processing in our fully recurrent connectionist network model.
Representation
input from the environment (indicated by the dashed segments) as well as from every other unit in the network
(indicated by the solid segments). Each of the recurrent
connections formed between two units in the network has a
real-valued strength that is determined by the learning algorithm described in the following section.
Learning Algorithm
Learning in our model consists of modifying the connection strengths during each learning trial. For each learning
trial, the activation of the units in the network is first set to
<
SPELLING —-> <
MEANING
>
Each word is represented in a distributed fashion by a
finite number of binary features (+ or —). Features that are
present are positive with a value of +1.0, and features that
are absent are negative with a value of —1.0. Some of these
features represent the word's spelling, and the remaining
features represent the word's meaning. Each of these features plays a role in the representation of every lexical
entry (i.e., features are never 0), and there is no single
feature that corresponds uniquely to one entry. With such
a representation scheme, similarity is measured by the degree of overlap in the value of the features (e.g., the normalized dot product).
Network Architecture
The architecture of the connectionist network that we use
is illustrated in Figure 1. There is a unit in the network for
each feature used in the representation. Each unit receives
stimulus
Figure 1. Architecture of the fully recurrent connectionist network representing the spelling and meaning of a word. Only some
of the units and connections are shown.
1236
A. KAWAMOTO, W. FARRAR, AND C. KELLO
the values corresponding to the particular values of a lexical
entry. The probability of a given entry being selected on a
particular learning trial varies according to its relative frequency of occurrence. For the LMS algorithm that we use in
our model, the change in the connection strength from a
given unity to a unit i, Aw,-,-, is
(1)
AW,y =
where 17 is a scalar learning constant and t, and t^ are the
target activation levels of two distinct units i and j. By
contrast, for the Hebbian algorithm, the learning algorithm
used by Joordens and Besner (1994), the change in the
connection strength is
Aw,-, = 7?t,t;, i
(2)
In addition, we assume that the value of each connection
strength is initially 0.
Processing
After training, the performance of the network is examined by presenting just the spelling as the input to the
network. The input activates the corresponding orthographic unit in the network to -f 0.25 if the corresponding
input feature is positive and —0.25 if it is negative. Because
the input consists only of the spelling, there is no activation
of the semantic units. The initial activation in each orthographic unit arising from the input then spreads to other
orthographic units and to the semantic units. The activation
of each unit thus changes as a function of time (wherein
time changes in discrete steps). In particular, the activation
value of unit i, a,, at time t +1 is
a,(t + 1) = LIMIT Sa,-(f) +
+
(3)
where 8 is a decay constant, s,(0 is the influence of the input
stimulus on unit;', and LIMIT bounds the activation to the
range from — 1.0 to +1.0. The activation value of all the
units change in parallel. Each of these updating cycles is one
iteration through the network.
The activation value of each of the units in the network at
a given point in time is the most complete representation of
the state of the network at that time. However, for our
purposes, we were particularly interested in the number of
iterations it took all of the units representing the spelling to
saturate (i.e., reach their minimal or maximal activation
levels of -1.0 and +1.0, respectively). This measure was
used as an index of lexical decision times. A response was
defined as an error if the pattern of activation did not
correspond with the input or if all of the units did not
saturate after a fixed number of iterations through the
network.
A Critical Comparison
To demonstrate the critical processing difference that
arises when the LMS and Hebbian algorithms are used, we
examined separate lexicons (see Table 2) consisting of
either (a) an unambiguous word, (b) an ambiguous word
with two distinct unequiprobable meanings, (c) an ambiguous word with two distinct equiprobable meanings, and (d)
an ambiguous word with four distinct equiprobable meanings.2 Each word was represented by 12 features: The first 4
were "orthographic" features and the last 8 were "semantic"
features. The orthographic representations of the unambiguous word and the ambiguous words were identical
('H
1—'). The semantic representation of the ambiguous words were mutually orthogonal. In addition, the "printed frequency" of the words was identical.
Four different, fully recurrent networks comprised of 12
units were each trained on one of the four different lexicons
using the LMS algorithm and four other networks were
trained using the Hebbian algorithm with the software provided in McClelland and Rumelhart (1988). The learning
constant T), the same for both the LMS and the Hebbian
algorithms, was equal to 0.02. The only difference in the
training was that the entire lexicon was presented twice for
a total of 8 trials during the training phase for the Hebbian
algorithm, but four times for a total of 16 trials for the LMS
algorithm. (The number of learning trials was chosen so that
performance was roughly comparable for the two learning
algorithms.)
Processing assumptions for the networks trained using the
two learning algorithms were identical. Each network was
initially presented with an input consisting of just the spelling pattern scaled by a factor of 0.25. Activation in the
network was computed according to Equation 3, with the
decay term 8 equal to 0.2. As seen in Figure 2A, networks
trained using the LMS algorithm did simulate the ambiguity
advantage: The orthographic units were saturated by Iteration 1 for the network that learned the unambiguous word,
but by Iteration 6 for each of the networks that learned an
ambiguous word. Moreover, the relative magnitude of the
activation values for the iterations preceding saturation was
correlated with the degree of ambiguity of the meaning: (a)
ambiguous words with four distinct equiprobable meanings,
(b) ambiguous words with two distinct equiprobable meanings, (c) ambiguous words with two distinct polarized
meanings, and (d) unambiguous words. By contrast, as seen
in Figure 2B, for networks trained with the Hebbian algo2
We also examined an ambiguous word with similar equiprobable meanings. However, we did not report the results because the
size of the effect, relative to the other ambiguous words that all had
meanings that were mutually orthogonal, was dependent on the
degree of similarity. In general, the more similar the meaning, the
more similar the results were to the unambiguous word. We should
also point out that the lexicons were chosen to constitute just a
single unambiguous word or all of the senses of a single ambiguous word for clarity. The same points could have been made by
considering a single lexicon with unambiguous words and the
different types of ambiguous words.
1237
SPECIAL SECTION: WHEN TWO MEANINGS ARE BETTER THAN ONE
Table 2
Lexicons Consisting of Either an Unambiguous Word or
an Ambiguous Word Used in the Critical Comparison
Lexicon
FreSense quency
Representation
Unambiguous
-1
+
h
_!..
-j-
1
_f-
1
|-
2-Polarized
2-Equiprobable
1
2
3
1
+
1
2
2
2
+ —+ —+ —+ —+ —+ —
1
2
3
4
1
1
1
1
.
_|
H
1
1
-j-
—
1
-j-
I
-j-
|_-j
H
F-t
1- +
4-Equiprobable
-i
+
+
+
1- - H
_!..
~r
r.
T^
—
-i-T~
1
H
1
i
~r~r
_1_
_j_
—
1
-J-
_1
T~
1
1
1
~r"T"
_L.
_p.
1
_j_
—~
—
—
..
_L
_p
_1_
T~
1
T~
rithm, the relative activation of the orthographic units was
exactly opposite that found with the LMS algorithm.
The difference in activation for the different lexicons
trained using the LMS algorithm arises because the different
networks have different connection strengths. To illustrate
these differences, the connection strengths were partitioned
into four different subsets on the basis of their origin and
destination: orthographic unit —» orthographic unit, orthographic unit —> semantic unit, semantic unit -* semantic
unit, and semantic unit —> orthographic unit. Table 3 shows
the mean of the absolute value of the connection strengths
for each of these subsets of connections. (The values for the
connections from the semantic units to the orthographic
units are not shown separately because the values are nearly
identical to the reciprocal connections from the orthographic units to the semantic units.) On the basis of these
values, it is clear that as ambiguity increases, the mean of
the absolute value of the orthographic —» orthographic connections increases and the mean of the absolute value of the
orthographic -» semantic connections decreases. In other
words, as ambiguity increases, the influence of the orthographic —» orthographic connections increases and the influence of the orthographic <-> semantic connections decreases. This influence of the different connections on
activation in the orthographic units can be demonstrated by
(a) eliminating all but the orthographic —> orthographic
connections and (b) eliminating the orthographic —» orthographic connections. Figure 3 shows the mean of the absolute value of activation in the orthographic units under these
two conditions with only the spelling as the input.
The differences in connection strengths (and thus the
differences in activation of the orthographic units) that arise
in training the different words is due to the error-correcting
property of the LMS algorithm. When an unambiguous
word is being learned, all of the orthographic and semantic
features contribute to predicting that word's orthographic
features. By contrast, when an ambiguous word was being
learned, only the unambiguous features contribute to predicting that word's orthographic features: The ambiguous
features do not make a contribution because the changes
that are made for one sense of an ambiguous word are offset
by the other senses. Consequently, the orthographic features
(as well as the unambiguous semantic features) must compensate for the missing contribution of the ambiguous semantic features by making a larger contribution in predicting the activation of the orthographic features. With an
error-correction learning algorithm, any manipulation that
increases the ambiguity will lead to a larger contribution of
the orthographic —> orthographic connections, and, conversely, any manipulation that decreases the ambiguity will
lead to a smaller contribution. Thus, increasing the number
of distinct meanings from two to four increases the ambiguity (i.e., with four orthogonal meanings, there are three
features that are ambiguous, five features that are partially
biased, and one feature that is unambiguous), whereas increasing the relative dominance of one meaning (or increasing the similarity of the meanings) decreases the ambiguity.
In contrast to the LMS algorithm, the changes made for a
connection on a given learning trial with the Hebbian algo-
1.0
J
unambiguous
Z polarized
2 equlprobable
4 equiprobable
0.2
0.0
0
2
4
6
8
10
Iterations
1.01
| 0.61
"« 0.4"
unambiguous
2 polarized
2 equiprobable
4 equiprobable
OJ
0.0
0
2
4
6
8
10
Iterations
Figure 2. Time course of activation of the orthographic units
when just spelling is presented for each of the networks trained
using (a) the least mean square learning algorithm and (b) the
Hebbian learning algorithm.
1238
A. KAWAMOTO, W. FARRAR, AND C. KELLO
Table 3
Mean of the Absolute Value of the Connection Strengths
for Different Subsets of Connections (Orthographic —*
Orthographic, Orthographic —> Semantic, and
Semantic —> Semantic) for Each Network
Lexicon
O-»0
O-»S
.089
.089
.122
.077
.132
.066
.166
.065
O = orthographic; S = semantic.
Unambiguous
2-Polarized
2-Equiprobable
4-Equiprobable
Note.
S -^S
.089
.085
.075
.079
rithm are independent of previous changes made to that
connection as well as previous changes made to all other
connections. That is, unlike the LMS algorithm in which the
network's current performance (i.e., the magnitude of the
error) influences the magnitude of the change in the connection strength, the changes made using the Hebbian algorithm are always the same for a particular stimulus. Thus,
the Hebbian learning algorithm does not compensate for the
fact that the ambiguous semantic features are not contributing to the prediction of the orthographic features. In
particular, the connection strengths from one orthographic
unit to another orthographic unit in the network trained on
the unambiguous word are identical to the corresponding
connection strengths in the networks trained on one of the
ambiguous words.
Despite the large differences in the final connection
strengths obtained using the Hebbian and the error-correction learning algorithms, the connection strengths are actually highly similar early in the training. In fact, after the first
learning trial, the connection strengths are identical using
the LMS and Hebbian learning algorithms, assuming the
same lexical entry is learned and that the weights are initially zero. Thus, for the error-correction algorithm, there is
an ambiguity disadvantage early in training that becomes an
ambiguity advantage later in training (see Table 4).
Finally, as already noted, one of the key assumptions was
that activation in the orthographic units is used as an index
of lexical decision times. Indeed, even with the LMS algorithm, the unambiguous word would have been recognized
before the ambiguous words if saturation of the semantic
units had been used as the index of recognition. As Figure
4 shows, the relative magnitude of the activation values for
the last 8 units (the semantic units) was exactly opposite that
found in Figure 2A.
number of neighbors those words have increases. To demonstrate the ambiguity advantage effect with a real lexicon, the
data from one experiment will be simulated.
Our next simulation examined a particular set of data
reported by Millis and Button (1989, Experiment 3). We
chose this specific experiment because the target words
used in the experiment were reported and because these
results have addressed previous criticisms. In their experiment, Millis and Button found that ambiguous words with
few meanings (an average of 1.38 meanings) had a mean
lexical decision time of 732 ms and that ambiguous words
with many meanings (an average of 2.36 meanings) had a
mean lexical decision time of 661 ms. For purposes of
comparison, we also included unambiguous words matched
in printed frequency because many experiments examining
the effect of ambiguity on word recognition have specifically compared unambiguous words with ambiguous words,
rather than two classes of ambiguous words. Unfortunately,
1.0
C °
I
0.6
tj 0.4
•^
unambiguous
2 polarized
0.2
2 equlprobable
4 equlprobable
0.0
0
2
4
6
10
Iterations
i.o-
5 «« 0.61
•J5
« 0.40^'
Simulation of Data
A real lexicon is much more complicated than the examples just considered. A real lexicon contains many words,
and most words are similar in spelling to other words. In
fact, ignoring the part of the spelling that is dissimilar, a
shorter "ambiguous" word is generated (e.g., the unambiguous words BERM and BERG yield the ambiguous string
BER .). Thus, even unambiguous words could manifest an
"ambiguity advantage," and this possibility increases as the
0.0
4
6
10
Iterations
Figure 3. Time course of activation of the orthographic units for
each of the networks trained using the LMS algorithm when (a)
only the orthographic —» orthographic connections are used and (b)
all other connections except for the orthographic —> orthographic
connections are used.
1239
SPECIAL SECTION: WHEN TWO MEANINGS ARE BETTER THAN ONE
to our knowledge, there is no single experiment that specifically compared unambiguous words with ambiguous
words with few and many meanings and therefore we could
not explicitly compare all three classes of words.
unwnbiguoui
2 polarized
.r
I
Zeqidprobablc
4equIprobBbte
0<6
Lexicon
The lexicon consisted of a total of 108 four-letter words.
Of these, 24 were ambiguous words drawn from the test
words used in Millis and Button (1989, Experiment 3). In
our implementation, the ambiguous words that had few
meanings were assigned two meanings and the ambiguous
words that had many meanings were assigned four meanings. For simplicity, we assumed that all meanings of an
ambiguous word were of the same or nearly the same
frequency (e.g., if a word with four meanings had a printed
frequency of 14, two meanings would have a frequency of 4
and two meanings would have a frequency of 3) because
Millis and Button did not report relative dominance of the
meanings. In addition, 12 unambiguous words whose
Francis and Kucera (1982) frequency were approximately
equal to the ambiguous items were included for comparison.
The ambiguous and unambiguous items are presented in the
Appendix. The lexicon also contained an additional 72
unambiguous filler words. They were randomly chosen
from among the four-letter words in the Francis and Kucera
(1982) corpus that had a frequency greater than 1 and were
not proper names. Words with frequencies greater than 500
were treated as if they had a frequency of 500. Twenty
different sets of 72 filler items were created for each of
the 20 replications of the simulation (described more fully
shortly) to reduce the possibility that an effect was due to a
specific set of filler items.
Each lexical entry was composed of a spelling (represented by 56 orthographic features) and a meaning (represented by 70 semantic features). The orthographic representation was identical to that used by McClelland and Rumelhart (1981): Each of the four letters was represented by 14
features. The semantic representation of each word meaning
was generated by randomly permuting the order of 35
positive and 35 negative features. The semantic representation for each ambiguous and unambiguous word was different for each network.
Table 4
Number of Iterations to Reach Saturation Following
Different Amounts of Training Using the ErrorCorrection Algorithm
No. epochs
Unambiguous
04-
0.2'
4
6
8
10
Iterations
Figure 4. Time course of activation of the semantic units for
each of the networks trained using the least mean square algorithm.
Networks
Twenty fully recurrent networks with identical architectures were used to simulate 20 subjects. Each network
consisted of 126 units, with each unit representing 1 of the
126 orthographic or semantic features of a lexical entry.
Training
The training regimen was identical to that described earlier. Initially, the strengths of all connections between units
were set to zero. The connection strengths between each
pair of units were then modified after each learning trial
according to Equation la with the learning constant 17 set
to 0.00005.3 The number of times each word was presented
to the network during one training epoch was equal to that
word's Francis and Kucera (1982) frequency.
Performance
To assess the performance of the network, the spelling of
each ambiguous and unambiguous test word, scaled
by 0.25, was presented to the network. The decay constant
S was set to 0.95. For each test word, the network continued
processing until all of the orthographic units were saturated
or until the network had reached 200 iterations. When the
network did not fully activate the correct orthographic units
in a given trial, that trial was considered an error and
removed from the latency analysis. If the network took more
than 20 iterations to obtain the appropriate output in any
given trial, that trial's iteration time was truncated to 20.
2-Equiprobable
1
2
8
10
3
7
7
4
7
6
Note. Dashes indicate that neither network reached saturation
after just one epoch of training.
3
The learning constant r\ must be less than l/A.max, where Amax
is the maximal eigenvalue of the sample covariance matrix, in
order for the algorithm to converge (Widrow & Stearns, 1985).
Because some items in the lexicon have a much higher relative
frequency than other items in the lexicon, Amax is large. The large
Amax in turn dictates a small TJ.
1240
A. KAWAMOTO, W. FARRAR, AND C. KELLO
The performance of the network after 5, 10, and 15
epochs of training is shown in Figure 5. The simulation
results of the response times shown in Figure 5A are consistent with the empirical data. On average, ambiguous
words took less time to saturate the orthographic units
compared with unambiguous words, and ambiguous words
with many meanings took less time to saturate than ambiguous words with few meanings. The average number of
iterations it took the ambiguous words to saturate after 15
epochs of training can be multiplied by 546 and then subtracted by 2,053 to yield the lexical decision times (in
milliseconds) reported by Millis and Button (1989, Experiment 3). (Note that a different learning constant would lead
to the same pattern of results, although the particular values
attained after a given number of training epochs, and thus
the constants for the transformation, would be different.)
However, the simulation results of the error performance
shown in Figure 5B are only partially consistent with the
empirical data. That is, as in the data from Millis and Button
a
unambiguous
2 tqulprobablt
4 cqulprobsble
T
1
o
1
6
5-
5
10
15
Training Epochs
50
40<•"»!
\^s 30"
I
£
20
a
101
0
0
5
10
15
Training Epochs
Figure 5. Performance at different points during the training of
a network that learns a large lexicon: (a) number of iterations for
the orthographic units to saturate and (b) number of errors.
(1989, Experiment 3) and other studies (see our Table 1 and
Gernsbacher, 1984), ambiguous words with many meanings
had fewer errors than ambiguous words with few meanings,
but in contrast to published results, unambiguous words had
fewer errors than ambiguous words. We cannot explain this
particular simulation result at this point, although the small
neighborhoods of some of the ambiguous stimuli (e.g., huge
and pond) may have been a contributing factor.
General Discussion
Many models that begin with much different assumptions
about what information is used and how that information is
used in processing can end up making highly similar qualitative predictions (see also Van Orden and Goldinger,
1994, for a related discussion). For example, for naming
latencies, the main effects of frequency and regularity as
well as the frequency by regularity interaction can be accounted for by dual-route models in which both lexical and
sublexical information are represented, by analogy models
in which only lexical information is represented, and by
distributed models in which only sublexical information
is represented (see the discussions in Kawamoto, 1994;
Kawamoto & Zemblidge, 1992). Moreover, distinct versions of each of these three classes of models can account
for these effects. In the dual-route model, for example, the
lexical and sublexical rule routes can either be independent
with the pronunciation and latency determined by a race
between the routes (Coltheart, 1978), or the two routes can
interact with the pronunciation and latency determined by
integrating information from the two routes (Monsell,
Patterson, Graham, Hughes, & Milroy, 1992).
When different models all make the same qualitative
predictions, there are at least two alternatives in deciding
among the competing models. One alternative is to make the
decision on a quantitative basis by determining which
model provides the best fit using a measure such as rootmean-square deviation. As seen in this special issue, the
explicit formulation of models as computer programs allows
such comparisons to be made. For example, one can compare the advantage of using number of types with the
number of tokens in formulating rules of pronunciation
(e.g., Norris, 1994). Another alternative is to determine how
well the various models account for other related phenomena. For example, recent computational models of naming
have begun to focus on results such as generalization capacity (e.g., Coltheart, Curtis, Atkins, & Haller, 1993).
However, there are also many models with highly similar
assumptions that make qualitatively different predictions.
For example, different assumptions about the variance of
finishing times of the two routes in the race version of the
dual-route model lead to opposite predictions for the latencies of correct naming responses of low-frequency regular
and irregular words (Kawamoto & Zemblidge, 1992). In
particular, if the variance of the slower rule route is less than
or equal to the variance of the faster lexical route, irregular
words are incorrectly predicted to have shorter latencies
than regular words. However, if the variance of the rule
SPECIAL SECTION: WHEN TWO MEANINGS ARE BETTER THAN ONE
route is sufficiently greater than the variance of the lexical
route, irregular words are correctly predicted to have longer
latencies than regular words.
The comparison between models that we have focused on
in this study is an example in which a single assumption, the
particular learning algorithm used in modifying the connection strengths, led to opposite predictions. (Note that other
assumptions are needed to account for the ambiguity advantage effect.) This single difference has important consequences because the Hebbian algorithm leads to a linear
(i.e., additive) effect of learning, whereas the LMS algorithm leads to a nonlinear (i.e., nonadditive) effect of learning. As in other assumptions in which one alternative is
linear and the other is nonlinear, the difference in the effect
can be dramatic. In the following discussion, we begin by
comparing our approach with other recurrent network models. We consider which of the many different assumptions
are critical in distinguishing the models with respect to
modeling the ambiguity advantage effect. We then turn to a
version of our model that incorporates pronunciation and
discuss how the effect of semantics, the basis for our account of the ambiguity effect in the lexical decision task,
also leads to the prediction of an ambiguity effect in the
naming task. We also discuss how semantics plays a role in
other effects of the time course of naming. Finally, we
consider encoding variability, a widely discussed memory
result that is analogous to the ambiguity advantage effect.
Recurrent Networks
Recurrent network models, like all other models, can
make qualitatively different predictions if different assumptions are made. These assumptions include choices about
the specific representation of a word, the architecture of the
network, the updating of a unit's activation, the learning
algorithm, and the appropriate index of a behavioral response. In the first set of simulations presented in this study,
for example, the only difference between the two models
was the learning algorithm. Thus, the faster saturation of the
spelling units manifested for ambiguous words compared
with unambiguous words could be attributed directly to the
difference in the connection strengths that arose from the
different learning algorithms.
The basis for the processing advantage can be understood
by a closer examination of the two learning algorithms.
With the Hebbian learning algorithm, the magnitude of the
change in the connection strength is proportional to the
product of the activation of the "preconnection" unit and the
activation of the "postconnection" unit (Equation 2). Thus,
the change made to a connection from the preconnection
unit to the postconnection unit on a given learning trial is
independent of previous changes made to that connection as
well as previous changes made to all other connections to
the postconnection unit. This independence is manifested in
two ways. First, the changes to the connection strengths
made in response to a particular stimulus are identical for
each presentation of that stimulus. Second, and more important, the Hebbian learning algorithm does not compen-
1241
sate for the fact that the ambiguous semantic features are not
contributing to the prediction of the orthographic features.
Thus, in the critical comparison, the connection strengths
between two orthographic units in the network trained on an
ambiguous word were identical to the corresponding connection strengths in the networks trained on the unambiguous word because we assumed identical spellings for all the
words.
By contrast, with the LMS algorithm, the magnitude of
the change hi the connection strength is proportional to the
product of the activation of the preconnection unit and the
error, the difference between the activation of the postconnection unit and the net input to that unit (Equation 1). Thus,
unlike the Hebbian algorithm, the change made to a connection on a given learning trial is not independent of
previous changes made to the connections to the postconnection unit. This dependence is manifested in two ways.
First, the changes made to the connection strengths for a
particular stimulus are not identical for each presentation of
that stimulus. Second, and more important, the LMS learning algorithm does compensate for the fact that the ambiguous semantic features are not contributing to the prediction
of the orthographic features. Thus, in the critical comparison, the connections between two orthographic units in the
network trained on each of the ambiguous words make a
larger contribution to predicting the activation of the postconnection unit than corresponding connections in the networks trained on the unambiguous word. These differences
are clearly manifested by comparing the processing with
only the orthographic —> orthographic connections and with
all but the orthographic —> orthographic connections (see
Figure 3).
However, the learning assumption is only one of a number of key assumptions required to account for the ambiguity advantage. Another key assumption is that semantics
must be represented in some form. Clearly, if a particular
model simply does not implement semantics (e.g., Golden,
1986; Seidenberg & McClelland, 1989), then the model will
certainly not be able to account for any effect that is based
on semantics. In particular, the LMS algorithm would not
differentially affect the orthographic —>• orthographic connections of ambiguous words compared with unambiguous
words if the effect of semantic —» orthographic connections
were not taken into account during learning. Note that in our
account, the ambiguity advantage that is manifested during
processing can ultimately be attributed to the effect (actually, the lack of a contribution) of semantics in predicting
the features of orthographic units during learning.4 One
4
In a previous simulation using this model, one of us
(Kawamoto, 1993) actually argued that the ambiguity advantage
was due to the cooperation between the two senses of an ambiguous word that arose during learning and processing. That is, the
changes to the orthographic —» orthographic connections and the
semantic —* orthographic (for unambiguous semantic features)
would be similar for the different senses of an ambiguous word.
Thus, the connection strength changes for an ambiguous word
would be roughly comparable to the changes for an unambiguous
word. Then, when the spelling is presented during a test trial,
1242
A. KAWAMOTO, W. FARRAR, AND C. KELLO
final critical assumption is that the orthographic units and
not the semantic units are the index of lexical decision
times. As seen earlier in this study and elsewhere
(Kawamoto, 1993), there is an ambiguity disadvantage if
the semantic units had been used as the index of lexical
decision time. In the following discussion, we show why
these assumptions are important in accounting for the
ambiguity advantage effect by examining other recurrent
models.
The first alternative recurrent model that we consider in
detail is the Hopfield (1982) network. This model is particularly important because Joordens and Besner (1994) have
specifically examined the ambiguity advantage effect using
a Hopfield network and reached the opposite conclusion
from us. We begin by noting the differences between the
two models (with our assumptions presented first): (a) connections between orthographic units—present versus absent; (b) learning algorithm—LMS versus Hebbian; (c)
index of lexical decision times—orthographic versus
semantic units; (d) activation of orthographic units—
undamped versus clamped; (e) activation value—real versus binary valued; and (f) schedule of updating activationparallel versus asynchronous. Of these differences, only the
first three are critical in accounting for the ambiguity advantage effect, as demonstrated earlier. The remaining differences do not affect the outcome. In particular, if Joordens
and Besner had assumed full connectivity between units and
had not clamped the orthographic units, an input spelling
that had been corrupted by noise would be corrected more
quickly for unambiguous words compared with ambiguous
words. In addition, the different processing assumptions
reflected in the last two assumptions lead to the same
results because the processing in both cases is governed
by the same Lyaponuv function (Golden, 1986; Hopfield,
1984).
The next models that we consider are resonance models
(Grossberg & Stone, 1986; Kosko, 1988; Murre, Phaf, &
Wolters, 1992; Van Orden & Goldinger, 1994). The major
difference between resonance models and the fully recurrent
models considered in this study is the topology of the
network (i.e., the pattern of connectivity between units). In
resonance models, there are separate groups of units (e.g.,
orthographic and semantic units) with modifiable connections only between units in different groups: There are no
modifiable connections between orthographic units, the
very connections that we argue to be the basis for the
ambiguity advantage in our model (see Figure 3).
The resonance model that we consider first, Kosko's
activation in the orthographic units leads to activation of both
meanings in the semantic units. These different meanings in turn
activate the same spelling pattern in the orthographic units. (A
similar proposal was discussed by Rueckl and Olds, 1993.) Although this semantically mediated pathway does influence the
activity in the orthographic units, in our model the ambiguity
advantage is due to the differential orthographic —» orthographic
connection strengths that arise when the error-correction algorithm
is used. This point was demonstrated earlier with a network that
lacked orthographic —> orthographic connections.
(1988) bidirectional associative memory (BAM) model, is
closest in spirit to our fully recurrent model because the
BAM also uses a distributed representation comprised
of +1 and — 1 binary values. To make the discussion explicit, we compare the unambiguous word with the ambiguous word with two equiprobable meanings as we did
earlier. We first consider the effects of learning the mapping
from spelling to meaning and then consider the mapping
from meaning to spelling. (This separate consideration of
the network as two feedforward networks is valid for a
linear activation function because there are no connections
between units within a group.) For the mapping from spelling to meaning, the spelling of the unambiguous word is
mapped to one meaning twice as often as the spelling of the
ambiguous word is mapped to each of its two meanings.
For the Hebbian algorithm, learning has the following
consequences:
2 T 7j M
(4a)
Mam2),
(4b)
where T is the number of learning epochs and -rj is the
learning constant, Sun and Sam are the patterns corresponding to the spelling of the unambiguous and ambiguous
words, respectively, and Mu№ Maml, and Mam2 are the patterns corresponding to the meaning of the unambiguous
word, the meaning of the first sense of the ambiguous word,
and the meaning of the second sense of the ambiguous
word, respectively. Note that the magnitude of the meaning
vector of the unambiguous word is actually V2 times larger
than the meaning vector of the ambiguous word because the
two meanings of the ambiguous word are orthogonal. (Another way to see this is to observe that half of the meaning
units of the ambiguous word have an activation equal to 0.)
For the LMS algorithm, it is possible to write a simple
closed form expression for the activation of the meaning
only for the unambiguous word:
Sun -* [1 - (1 - Tf)2T]Mun.
(5)
However, the asymptotic effect of the LMS algorithm can
be expressed simply as
Sm-*Mm
(6a)
1
Sam ~* 2 (M<™1
+
Mam2)-
(6b)
Thus, for the mapping from spelling to meaning, the asymptotic effect of the LMS algorithm is identical to the
effect of the Hebbian algorithm. By contrast, the mapping
from meaning to spelling is different for the Hebbian and
LMS algorithms because there are two orthogonal meanings for an ambiguous word. Thus, there are two different
inputs for ambiguous words compared with the single
repeated input for unambiguous words. For the Hebbian
algorithm,
Mm -» 2 T 17 Sm,
(7a)
Maml -* T T) Sam,
(7b)
Mam2~>
(7c)
TT]Sam.
SPECIAL SECTION: WHEN TWO MEANINGS ARE BETTER THAN ONE
For the LMS algorithm,
(8a)
(8b)
(8c)
Asymptotically, the LMS algorithm leads to
M,,,
(9a)
M.ami
(9b)
M,ami '
(9c)
Consequently, for the Hebbian algorithm, the activation fed
back to the spelling units from the meaning units after
instantiation of the spelling (scaled by a constant) is twice as
large for unambiguous words compared with ambiguous
words. By contrast, for the LMS algorithm, the activation
fed back to the spelling units would be the same for unambiguous and ambiguous words. Thus, for the BAM, neither
the Hebbian nor the LMS algorithm can account for the
ambiguity advantage effect.
We now turn to another class of resonance models in
which the connections between orthographic and semantic
units are all excitatory and the connections between units
within a group are all inhibitory (e.g., Van Orden &
Goldinger, 1994). Note that with the assumption of mutually inhibitory connections between units, activation is positive and the representation is in some sense "local." That is,
within a group of units that have mutually inhibitory connections, only one unit, the "winner," remains active; the
activation of the remaining units decreases to 0. We thus
refer to this class of resonance models as "winner-take-all"
models. The first winner-take-all model that we consider is
the meta-model proposed by Van Orden and his colleagues
(Van Orden & Goldinger, 1994; Van Orden, Pennington, &
Stone, 1990). In their model, there are three groups of units
representing the spelling, the pronunciation, and the meaning of a word. The connections between units in each pair of
groups are modified according to a covariational learning
principle. Although it is not possible to make specific predictions because details of the activation function and learning function remain open, we point out that the same outcome noted earlier for the BAM for both the Hebbian and
LMS algorithms would be expected under some conditions
even without the effect of the mutually inhibitory connections. In fact, the unambiguous words would be expected to
gain an advantage when the effect of the mutually inhibitory
connections are added because these connections would
tend to suppress the simultaneous activations of the two
semantic representations activated for ambiguous words.
The second winner-take-all model that we consider is the
categorization and learning module (CALM) model proposed by Murre et al. (1992). This model is actually much
more complex than the preceding winner-take-all model
because there are additional units that coordinate control in
a module. In particular, there are three classes of units in a
1243
module: Representation units that correspond to the units in
the preceding model, veto units that are paired with a single
representation unit, and an arousal unit. (It should be noted
that because each representation unit forms excitatory connections to its complementary veto unit, more than one unit
in a module remains active, but, as before, only one representation unit is the winner.) Each class of units forms only
fixed excitatory or inhibitory connections to other units
within the module and to an "external" node outside the
module. In addition, there are modifiable excitatory connections between representation units in different modules.
Given the complexity of the architecture and the nonlinearities of the activation function and learning rule, we do not
hazard making predictions without an implementation of the
model. However, we do point out the possibility that qualitatively different predictions could arise depending on the
particular values of the fixed parameters chosen.
The final set of recurrent models that we consider is based
on the interactive-activation (IA) model (McClelland &
Rumelhart, 1981). The IA model shares a number of characteristics with the winner-take-all models just described,
such as mutually inhibitory connections between units
within a group; unlike the other resonance models discussed, the connections between groups of units are set as a
parameter rather than learned. The IA model has been the
basis for a previous account of the ambiguity advantage
effect (Kellas et al., 1988). In their account, Kellas et al.
proposed that ambiguous words are represented by multiple
entries at the word level. However, unlike other word units,
the senses of an ambiguous word do not compete with each
other and thus end up cooperating with each other. This
cooperation arises because the neighbors of an ambiguous
word are inhibited by two word units and because the
component letters receive positive feedback from two word
units. This cooperation would clearly lead to one of the units
corresponding to an ambiguous word reaching threshold
more quickly than the single unit of an unambiguous word
if the thresholds were based on the printed frequency. Furthermore, if the cooperation were sufficiently great, the
advantage provided by the multiple entries of an ambiguous
word could also overcome the disadvantage that arises even
when the threshold is based on frequency of occurrence of
a particular sense. However, it should be pointed out that an
additional mechanism would still be required to resolve the
ambiguity.
Local Versus Distributed Representations
One key difference between different word recognition
models is whether a lexical entry is represented in a distributed fashion or in a localist fashion. With a distributed
representation, a lexical entry is represented as values over
a finite number of units, with no single unit corresponding
uniquely to an unambiguous word or a sense of an ambiguous word. By contrast, with a localist representation, a
single unit does correspond uniquely to an unambiguous
word or a sense of an ambiguous word.
The distinction between local and distributed representa-
1244
A. KAWAMOTO, W. FARRAR, AND C. KELLO
tions has important implications in determining the relationship between ambiguous senses and becomes particularly
salient in distinguishing homonymous senses from polysemous senses. Consider, for example, the problem of deciding to add an entry to the lexicon. That is, when is a
particular usage of an ambiguous word sufficiently distinct
to be considered a new sense, and how can the similarity
between senses be represented? Actually, these problems
must also be confronted by the lexicographer. Consider, for
example, the different senses of bass listed in MerriamWebster's Tenth Collegiate Dictionary (1993):
1. bass 'has n or bass or bass.es [MB base, alter, of OE
ba-ers; akin to OE byrst bristle] pi: any of numerous
edible spiny-finned fishes (esp. families Centrarchidae
and Serranidae)
2. bass 'ba-s aj [ME bas base] 1: deep or grave in tone 2: of
low pitch
3. bass 'ba-s n 1: a deep or grave tone: low-pitched
sound 2al: the lowest part in polyphonic or harmonic
music 2a2: the lower half of the whole vocal or instrumental tonal range 2bl: the lowest male singing
voice 2b2: a person having such a voice 2c: the lowest
member in range of a family of instruments
4. bass 'bas n [alter, of bast] 1: a coarse tough fiber from
palms 2: BASSWOOD. (p. 96)
In the entries for bass, the definitions for the second entry
are listed as 1 and 2, whereas the listing for the third entry
is listed as 1, 2al, 2a2, 2bl, 2b2, or 2c. What determines
whether a use warrants a separate subentry or sub-subentry?
Moreover, given that the adjectival entry is listed separately,
how can its unsystematic relationship with the first noun
entry and its systematic relationship with the second noun
entry be represented?
With a distributed representation, no explicit decision
about the distinctiveness of a new sense has to be made.
Each new sense or word simply corresponds to a new
pattern of activation over the set of finite features. As the
difference between two different usages increases or decreases, the difference in the representation of the senses
increases or decreases, respectively. For senses that are
highly similar, as in two of the meanings listed in the third
entry of bass (2al: the lowest part in polyphonic or harmonic music, and 2a2: the lower half of the whole vocal or
instrumental tonal range), the semantic representation
would be nearly identical. The representation for these
meanings would also be similar to the other senses listed
under that entry. For senses that are derivationally related,
as is the case for the second and third entries of bass, the
semantic representations would be similar, although the
representation for the part of speech would be different. By
contrast, the semantic representation of the second and third
entries would be different from that of the first entry. In
other words, each sense differs quantitatively from other
senses of the ambiguous word (as well as all other lexical
entries). In terms of the energy landscape metaphor discussed elsewhere (Kawamoto, 1993), the distance between
the energy minima corresponding to these senses decreases
as similarity between the senses increases. If there are many
related senses, there will be a broad basin of attraction
carved out by these senses during learning. On the other
hand, if the sense are unrelated, there will be widely dispersed basins of attractions.
By contrast, with a local representation such as that used
in the IA model, a decision must be made to determine
whether a novel use of an ambiguous word is sufficiently
distinct to warrant its own unit at the word level. Thus, each
new sense differs qualitatively from other existing senses.
Moreover, even in models that attempt to represent similarity between senses, there is still a qualitative distinction
between senses that are considered to be similar (i.e., derivationally related) and those that are not (Jastrzembski,
1981). The freedom allowed by these choices is clearly
illustrated by the decision to represent derivationally related
words either as one entry (e.g., Rubenstein et al., 1971) or
as separate entries (e.g., Cottrell & Small, 1983; Jastrzembski, 1981).
The number of senses that an ambiguous word has is
important in local representation schemes because processing depends critically on the number of senses. For example, in the IA model, adding another sense has two opposite
effects. On the one hand, the threshold of the remaining
units representing an ambiguous word decreases in their
baseline activation and thus would become activated more
slowly. This effect, however, would not be expected to be
that large because the resultant change would not make a
large impact on the time course of activation. On the other
hand, by adding another unit to the cohort of cooperating
units, neighbors would be inhibited more severely and positive feedback would also increase significantly. To take an
extreme position, assume that a word unit is generated for
each use of an unambiguous as well as an ambiguous word.
In this case, no ambiguity advantage would be predicted.
This example is admittedly unrealistic, but it does illustrate
the importance of the number of word units. However, more
realistic examples arise in which the distinction between
polysemy and homonymy is critical. Consider a situation in
which there are two distinct polarized senses of an ambiguous word. The dominant sense has a large number of
polysemous senses and the subordinate sense has only a
single sense. If the distinct senses are the entries in the race,
then the dominant sense will usually be the winner. On the
other hand, if the polysemous senses are the entries, then the
single sense corresponding to the subordinate sense could
be the winner most of the time. It should be noted that for
either choice (frequency on the basis of distinct senses or
polysemous senses), decisions will also have to be made
about the distinctiveness of a sense: Are two senses similar
enough to be considered the same sense and are the different
uses of a single sense distinct enough to be considered
different senses? These decisions will have an impact on the
winner because they determine the frequency of a sense and
the number of senses. Finally, frequency of a sense also
depends on whether derivations with the same form (e.g.,
the noun and verb forms of plow) are stored separately (e.g.,
as different logogens or different locations in a list) or
together (e.g., as the same logogen or at that same location
in a list).
SPECIAL SECTION: WHEN TWO MEANINGS ARE BETTER THAN ONE
Naming
In this section, we consider another task that figures
prominently in experiments of lexical memory: speeded
naming. In distributed models, the dependent variables considered in speeded naming—naming latency and errors—
have been indexed by units representing a word's pronunciation (Kawamoto, 1993; Seidenberg & McClelland, 1989;
Van Orden et al., 1990). Because our model does not
represent pronunciation, we base our discussion on an identical model that does represent pronunciation (Kawamoto,
1993).
One naming study that is relevant to discussions of the
ambiguity advantage effect has recently been reported
(Fera, Joordens, Balota, & Ferraro, 1992). Consistent with
results using the lexical decision task, they found an advantage of ambiguous words over unambiguous words. This
result can be accounted for by our distributed model assuming that the LMS algorithm is used. In this case, the connections from orthographic units to pronunciation units as
well as connections from pronunciation units to other pronunciation units compensate for the smaller contribution of
the semantic units for ambiguous words. Thus, when the
input is just the spelling, activation in the pronunciation
units would increase in absolute value faster for ambiguous
words compared with unambiguous words, just as they
would in the spelling units. Of course, if the Hebbian
algorithm had been used, we would expect the opposite
effect. Indeed, an ambiguity disadvantage is seen when the
Hebbian algorithm is used (M. E. J. Masson, personal
communication, November 6, 1993).
In the results just considered, the effect of semantics on
naming arose as a consequence of learning. In other situations, we have observed an effect of semantics on naming
that arises during the processing (Kawamoto & Zemblidge,
1992). For example, in speeded naming of monosyllabic
homographs (e.g., wound), naming latencies of the more
frequent irregular pronunciation are longer than that of the
less frequent regular pronunciation. Moreover, the same
pattern arises for the irregular word pint, although in this
case the regular pronunciation is incorrect. To account for
this pattern of results, there must be at least two constraints
on pronunciation, each with a different time course. In
particular, the regularity constraint must arise before the
word-specific constraint (which ultimately determines the
correct pronunciation of irregular words). In our model, the
regularity constraint corresponds to the mapping from spelling to pronunciation and the word-specific constraint corresponds to the semantically mediated mapping. Thus, our
model, like the model proposed by Van Orden and his
colleagues (Van Orden, 1987; Van Orden et al., 1990),
argues that the regularity constraint is faster than the wordspecific constraint, not slower as proponents of the dualroute model have proposed.
Encoding Variability
The model of lexical memory used in these simulations is
a member of a family of content-addressable memories
1245
(J. A. Anderson, Silverstein, Ritz, & Jones, 1977; Kohonen,
1977; McClelland & Rumelhart, 1985). As such, this lexical
memory model should be able to account for more general
memory phenomena. Conversely, one would expect that
general memory models could be used to account for more
specific lexical memory phenomena.
One set of phenomena that content-addressable memories
should be particularly well suited for is encoding effects, the
effect of context on memory. Encoding effects have been
manifested in two ways: as encoding specificity and as
encoding variability (cf. J. R. Anderson, 1990). With encoding specificity, performance is improved when the
context in which something was learned is reinstantiated
(Martin, 1968). For example, Thomson (1972) showed
that when subjects must remember only the second word
in a word pair (e.g., sky blue), their subsequent recognition of the target word is much better when presented
with the original word pair than just the target alone.
With encoding variability, an item presented for study in
different contexts is better remembered than an item presented in just a single context (Madigan, 1969; Melton,
1967, 1970; but also see Rueckl & Olds, 1993). For example, Madigan showed that when a target item is repeated in the memory phase, subsequent recall of the target word without any context cues is much better when
the target word had been presented with different context
words rather than the same context word during the memory phase.
One standard account of encoding variability is based on
encoding specificity: The random context provided during
the test phase is more likely to resemble one of two different
contexts compared with a single context provided during the
study phase (J. R. Anderson, 1990). This standard account
corresponds closely to Bower's (1972) formal model of
encoding variability. According to Bower, a stimulus activates a fixed number of stimulus elements or components
(i.e., "features" in our terminology). The particular elements
that are activated depend on the context. An activated element is associated in an all-or-none fashion to a given
response. Under these assumptions, presentation of the
stimulus with an identical context on a second learning trial
will activate similar stimulus elements, and thus only a
few additional features will subsequently be associated
with the response. By contrast, presentation of the stimulus with a different context on a second learning trial will
activate different stimulus elements, and thus many additional features will subsequently be associated with the
response. Thus, when a random context is present, the
features that are activated (which are fixed in number) are
more likely to overlap with the features activated under
two different contexts rather than the same context during
learning. As in our account based on the LMS algorithm,
it is a nonlinearity in the learning algorithm that gives
rise to a processing advantage when learning occurs under
two different contexts.
The simulations presented in this study suggest an alternative account of encoding variability. In particular, if we
assume that a context corresponds to the meaning of a word
in our simulations, the encoding variability effect is analo-
1246
A. KAWAMOTO, W. FARRAR, AND C. KELLO
gous to the ambiguity advantage effect. Thus, encoding
variability could arise directly out of learning as seen in the
simulations presented in this study without depending on a
stochastic component during processing as has previously
been proposed. More interesting, perhaps, is the fact that the
two approaches make different predictions about the effect
over the course of learning. In particular, our model predicts
a disadvantage early in the course of learning that subsequently becomes an advantage late in learning, whereas the
stimulus sampling approach predicts an advantage throughout the course of learning.
Future Directions
The majority of empirical work on word recognition has
been based on monosyllabic monomorphemic English
words. Not surprisingly, the focus of the majority of computer simulation models has focused on this class of words.
Clearly, however, these words represent just a fraction of
the words in English. Following current empirical trends in
which there is greater interest in multisyllabic, multimorphemic words, there needs to be greater consideration of
these words in computational models of word recognition.
Such a decision would clearly require more hierarchical
representations of pronunciations including syllables and
possibly onsets and rimes. In addition, stress assignment as
well as other grammatical category conditioned regularities
in pronunciation would have to be incorporated. We hope
that formulation of models as computer programs will allow
detailed comparison of different models by providing explicit predictions.
References
Anderson, J. A., Silverstein, J. W., Ritz, S. A., & Jones, R. S.
(1977). Distinctive features, categorical perception, and probability learning: Some applications of a neural model. Psychological Review, 84, 413-451.
Anderson, J. R. (1990). Cognitive psychology and its implications
(3rd ed.). New York: Freeman.
Bower, G. H. (1972). Stimulus-sampling theory of encoding variability. In A. W. Melton & E. Martin (Eds.), Coding processes
in human memory (pp. 85-123). New York: Holt, Rinehart &
Winston.
Carpenter, P. A., & Daneman, M. (1981). Lexical retrieval and
error recovery in reading: A model based on eye fixations.
Journal of Verbal Learning and Verbal Behavior, 20, 137-160.
Chumbley, J. I., & Balota, D. A. (1984). A word's meaning affects
the decision in lexical decision. Memory & Cognition, 12, 590606.
Clark, H. H. (1973). The language-as-a-fixed-effect fallacy: A
critique of language statistics in psychological research. Journal
of Verbal Learning and Verbal Behavior, 12, 335—359.
Coltheart, M. (1978). Lexical access in simple reading tasks. In G.
Underwood (Ed.), Strategies in information processing (pp.
151-216). San Diego, CA: Academic Press.
Coltheart, M., Curtis, B., Atkins, P., & Haller, M. (1993). Models
of reading aloud: Dual-route and parallel-distributed-processing
approaches. Psychological Review, 100, 589-608.
Cottrell, G. W., & Small, S. L. (1983). A connectionist scheme for
modeling word sense disambiguation. Cognition and Brain
Theory, 6, 89-120.
Duffy, S. A., Morris, R. K., & Rayner, K. (1988). Lexical ambiguity and fixation times in reading. Journal of Memory and
Language, 27, 429-446.
Fera, P., Joordens, S., Balota, D. A., & Ferraro, F. R. (1992,
November). Ambiguity in meaning and phonology: Effects on
naming. Poster presented at the annual meeting of the Psychonomic Society, St. Louis, MO.
Forster, K. I., & Bednall, E. S. (1976). Terminating and exhaustive
search in lexical access. Memory & Cognition, 4, 53-61.
Francis, W. N., & Kucera, H. (1982). Frequency analysis of
English usage: Lexicon and grammar. Boston: Houghton
Mifflin.
Gernsbacher, M. A. (1984). Resolving 20 years of inconsistent
interactions between lexical familiarity and orthography, concreteness, and polysemy. Journal of Experimental Psychology:
General, 113, 256-281.
Golden, R. M. (1986). The "brain-state-in-a-box" neural model is
a gradient descent algorithm. Journal of Mathematical Psychology, 30, 73-80.
Grossberg, S., & Stone, G. (1986). Neural dynamics of word
recognition and recall: Attentional priming, learning, and resonance. Psychological Review, 93, 46-74.
Hopfield, J. J. (1982). Neural networks and physical systems with
emergent collective computational abilities. Proceedings of the
National Academy of Sciences USA, 79, 2554-2558.
Hopfield, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons.
Proceedings of the National Academy of Sciences USA, 81,
3088-3092.
Jastrzembski, J. E. (1981). Multiple meanings, number of related
meanings, frequency of occurrence, and the lexicon. Cognitive
Psychology, 13, 278-305.
Jastrzembski, J. E., & Stanners, R. (1975). Multiple word meanings and lexical search speed. Journal of Verbal Learning and
Verbal Behavior, 14, 534-537.
Joordens, S., & Besner, D. (1994). When banking on meaning is
not (yet) money in the bank: Explorations in connectionist
modeling. Journal of Experimental Psychology: Learning,
Memory, and Cognition, 20, 1051-1062.
Kawamoto, A. H. (1988). Distributed representations of ambiguous words and their resolution in a connectionist network. In
S. I. Small, G. W. Cottrell, & M. K. Tanenhaus (Eds.), Lexical
ambiguity resolution (pp. 195-228). San Mateo, CA: Morgan
Kaufmann.
Kawamoto, A. H. (1993). Nonlinear dynamics in the resolution of
lexical ambiguity: A parallel distributed processing account.
Journal of Memory and Language, 32, 474-516.
Kawamoto, A. H. (1994). One system or two to hand regulars and
exceptions: How time-course of processing can inform this
debate. In S. D. Lima, R. Corrigan, & G. Iverson (Eds.), The
reality of linguistic rules (pp. 389-415). Philadelphia: John
Benjamins.
Kawamoto, A. H., & Zemblidge, J. (1992). Pronunciation of homographs. Journal of Memory and Language, 31, 349—374.
Kellas, G., Ferraro, F. R., & Simpson, G. B. (1988). Lexical ambiguity and the time course of attentional allocation in word
recognition. Journal of Experimental Psychology: Human Perception and Performance, 14, 601—609.
Kohonen, T. (1977). Associative memory. Berlin: Springer.
Kosko, B. (1988). Bi-directional associative memories. IEEE
Transactions on Systems, Man and Cybernetics, 18, 49-60.
Madigan, S. A. (1969). Intraserial repetition and coding processes
1247
SPECIAL SECTION: WHEN TWO MEANINGS ARE BETTER THAN ONE
in free recall. Journal of Verbal Learning and Verbal Behavior, 8, 828-835.
Martin, E. (1968). Stimulus meaningfulness and paired-associate
transfer: An encoding variability hypothesis. Psychological Review, 75, 421-441.
McClelland, J. L., & Rumelhart, D. E. (1981). An interactive activation model of context effects in letter perception: 1. An
account of basic findings. Psychological Review, 88, 375-407.
McClelland, J. L., & Rumelhart, D. E. (1985). Distributed memory
and the representation of general and specific knowledge. Journal of Experimental Psychology: General, 114, 375-407.
McClelland, J. L., & Rumelhart, D. E. (1988). Explorations in
parallel distributed processing: A handbook of models, programs, and exercises. Cambridge, MA: MIT Press.
Melton, A. W. (1967). Repetition and retrieval from memory.
Science, 158, 532.
Melton, A. W. (1970). The situations with respect to the spacing of
repetitions and memory. Journal of Verbal Learning and Verbal
Behavior, 9, 596-606.
Merriam-Webster, Incorporated. (1993). Merriam-Webster's Collegiate Dictionary (10th ed.). Springfield, MA: Author.
Millis, M. L., & Button, S. B. (1989). The effect of polysemy on
lexical decision time: Now you see it, now you don't. Memory
& Cognition, 17, 141-147.
Monsell, S., Patterson, K. E., Graham, A., Hughes, C. H., &
Milroy, R. (1992). Lexical and sublexical translation of spelling
to sound: Strategic anticipation of lexical status. Journal of
Experimental Psychology: Learning, Memory, and Cognition, 18, 452-467.
Morton, J. (1979). Word recognition. In J. Morton & J. C.
Marshall (Eds.), Psycholinguistics: 2. Structures and processes.
Cambridge, MA: MIT Press.
Murre, J. M. M., Phaf, R. H., & Wolters, G. (1992). CALM: Categorizing and learning module. Neural Networks, 5, 55—82.
Norris, D. (1994). A quantitative multiple-levels model of reading
aloud. Journal of Experimental Psychology: Human Perception
and Performance, 20, 1212-1232.
Rayner, K., & Duffy, S. A. (1986). Lexical complexity and fixations times in reading: Effects of word frequency, verb complexity, and lexical ambiguity. Memory & Cognition, 14, 191201.
Rubenstein, H., Garfield, L., & Millikan, J. A. (1970). Homographic entries in the internal lexicon. Journal of Verbal Learning and Verbal Behavior, 9, 487-494.
Rubenstein, H., Lewis, S. S., & Rubenstein, M. A. (1971). Homographic entries in the internal lexicon: Effects of systematicity
and relative frequency of meanings. Journal of Verbal Learning
and Verbal Behavior, 10, 57-62.
Rueckl, J. G., & Olds, E. M. (1993). When pseudowords acquire
meaning: Effect of semantic associations on pseudoword repetition priming. Journal of Experimental Psychology: Learning,
Memory, and Cognition, 19, 515-527.
Seidenberg, M. S., & McClelland, J. L. (1989). A distributed,
developmental model of word recognition and naming. Psychological Review, 96, 523-568.
Thomson, D. M. (1972). Context effects on recognition memory.
Journal of Verbal Learning and Verbal Behavior, 11, 497-511.
Van Orden, G. C. (1987). A ROWS is a ROSE: Spelling, sound,
and reading. Memory & Cognition, 15, 181-198.
Van Orden, G. C., & Goldinger, S. D. (1994). Interdependence of
form and function in cognitive systems explains perception of
printed words. Journal of Experimental Psychology: Human
Perception and Performance, 20, 1269-1291.
Van Orden, G. C., Pennington, B. F., & Stone, G. O. (1990). Word
identification in reading and the promise of subsymbolic psycholinguistics. Psychological Review, 97, 488-522.
Widrow, G., & Stearns, S. D. (1985). Adaptive signal processing.
Englewood Cliffs, NJ: Prentice Hall.
Appendix
Test Words Used in the Simulation and Their
Francis and Kucera (1982) Frequency
Unambiguous
CURE
GROW
SLID
DISK
LIFT
FAIL
CUTS
GAVE
NECK
PINK
CAST
TORN
28
53
24
25
23
37
30
60
81
48
45
25
Two
equiprobable
HUGE
LAKE
POND
PAST
MILE
PAIR
SHIP
LOAN
CHIN
SUIT
SAND
PLUG
30
37
15
60
45
50
81
23
27
48
28
25
Four
equiprobable
LANE
FOOL
LOCK
DULL
HORN
DIVE
REAR
TELL
FILE
BOND
TAIL
LOAD
30
37
23
27
31
23
51
60
81
46
36
45
Received January 31, 1994
Revision received April 1, 1994
Accepted April 15, 1994