Download When Two Meanings Are Better Than One

Copyright 1994 by the American Psychological Association, Inc. 0096-1523/94/$3.00 Journal of Experimental Psychology: Human Perception and Performance 1994, Vol. 20, No. 6, 1233-1247 When Two Meanings Are Better Than One: Modeling the Ambiguity Advantage Using a Recurrent Distributed Network Alan H. Kawamoto, William T. Farrar IV, and Christopher T. Kello Ambiguous words are processed more quickly than unambiguous words in a lexical decision task despite the fact that each sense of an ambiguous word is less frequent than the single sense of unambiguous words of equal frequency or familiarity. In this computer simulation study, we examined the effects of different assumptions of a fully recurrent connectionist model in accounting for this processing advantage for ambiguous words. We argue that the ambiguity advantage effect can be accounted for by distributed models if (a) the least mean square (LMS) error-correction algorithm rather than the Hebbian algorithm is used in training the network and (b) activation of the units representing the spelling rather than the meaning is used to index word recognition times. An important advantage of computational models is that the underlying assumptions of the model must be explicitly formulated. This explicit formulation allows comparison of assumptions that are highly similar. In some cases, virtually identical assumptions can give rise to qualitative differences rather than merely quantitative differences. In this article, we consider just such a situation. Two different connectionist learning algorithms lead to opposite predictions about the time required to recognize ambiguous and unambiguous words when activation of the spelling units is used as the index of word recognition. In particular, ambiguous words are incorrectly predicted to have a processing disadvantage compared with unambiguous words when the Hebbian learning algorithm is used, but correctly predicted to have a processing advantage when the least mean square (LMS) error-correction algorithm is used. Ambiguity Advantage Effect Ambiguous words pose a special challenge in modeling word recognition. Not only must a way to represent the multiple meanings of a word be found, but the manner in which these meanings affect processing must also be considered. Nowhere is this problem more clearly manifested than in accounting for relative processing times of ambiguous words and unambiguous words. For one particular task—lexical decision—ambiguous words are processed more quickly than unambiguous words, despite the fact that the frequency of the different senses of an ambiguous word is less than the frequency of the single sense of an unamAlan H. Kawamoto, William T. Farrar IV, and Christopher T. Kello, Program in Experimental Psychology, University of California, Santa Cruz. We would like to thank Art Jacobs and Max Coltheart for their comments. Correspondence concerning this article should be addressed to Alan H. Kawamoto, Program in Experimental Psychology, Clark Kerr Hall, University of California, Santa Cruz, California 95064. biguous word (Kellas, Ferraro, & Simpson, 1988; Millis & Button, 1989). The issue of an advantage that ambiguous words have over unambiguous words in being recognized has been particularly vexing theoretically, and as we show later, empirically. To our knowledge, the earliest study claiming an advantage for ambiguous words was reported by Rubenstein, Garfield, and Millikan (1970). Subjects were faster to make lexical decisions to ambiguous words compared with unambiguous words of the same printed frequency of occurrence when these words were presented in isolation. A subsequent study showed that this ambiguity advantage existed only for homonyms whose meanings were equiprobable and not systematically related (Rubenstein, Lewis, & Rubenstein, 1971). However, Clark (1973) argued that the conclusions reached by Rubenstein et al. could not be generalized to a different set of stimuli because the results were not significant when both subjects and items were treated as random variables. In a subsequent study, Forster and Bednall (1976) also found that lexical decision times to homonyms were slightly faster than unambiguous controls, but, again, this advantage did not prove to be statistically significant when both subjects and items were treated as random variables. The issue of an advantage for ambiguous words was again raised in a series of experiments by Jastrzembski and his colleague (Jastrzembski, 1981; Jastrzembski & Stanners, 1975). Arguing that the earlier studies failed to find significant differences between ambiguous and unambiguous words because the ambiguous words that were used had only two distinct meanings, Jastrzembski and Stanners compared words that had many meanings listed in a dictionary with words that had only a few. They found that subjects responded significantly faster to words with many dictionary meanings compared with words with only a few meanings even when both subjects and items were considered in the analysis. However, the conclusions of that study have also been challenged. Gernsbacher (1984) argued that familiarity and not number of meanings was the basis of the 1233 1234 A. KAWAMOTO, W. FARRAR, AND C. KELLO difference. When familiarity rather than printed frequency was crossed with number of meanings, the only effect that was significant was a main effect of familiarity. In response to Gernsbacher's (1984) critique, two subsequent studies used subjective measures of ambiguity rather than the objective measure of number of meanings (Kellas et al., 1988; Millis & Button, 1989). In addition, the target words were controlled for both printed frequency and subjective familiarity. In both studies, an advantage for ambiguous words compared with unambiguous words was found when both subjects and items were treated as random variables. Admittedly, the advantage for ambiguous words has sometimes been small and has not always been statistically significant when both subjects and items are used as random factors. However, in every study comparing words with many meanings and words with few meanings, an advantage (not necessarily significant) for words with many meanings was found when other variables were collapsed across this condition (see our Table 1 and Gernsbacher, 1984, Table 6).1 Given that the frequency of each sense of an ambiguous word is less than the frequency of an unambiguous word matched in printed frequency (half of the frequency for ambiguous words with two equiprobable meanings), one might have expected an ambiguity disadvantage. In fact, when other measures such as gaze durations were the dependent variable, equiprobable ambiguous words took longer to process than unambiguous words when the target words were presented in a neutral sentential context (Carpenter & Daneman, 1981; Duffy, Morris, & Rayner, 1988; Rayner & Duffy, 1986). Moreover, this processing disadvantage for ambiguous words persisted even when a new unambiguous control word approximately half of the printed frequency of the old control word was used. Previous Accounts of the Ambiguity Advantage Effect Given the inherent disadvantage that ambiguous words must overcome if the frequency of each sense is represented separately, many models would predict that ambiguous words would be recognized more slowly, not more quickly, than unambiguous words. For models that use a localist representation scheme, it is clear why an ambiguity disadvantage would be expected. For example, in the serial search model (Forster & Bednall, 1976), each sense of an ambiguous word would be found below the single entry of an unambiguous word matched in printed frequency in a frequency-ordered list. In logogen-type models (McClelland & Rumelhart, 1981; Morton, 1979), each logogen corresponding to a sense of an ambiguous word would have a higher threshold or lower resting level of activation compared with a matched unambiguous word. In order to overcome the disadvantage that ambiguous words have compared with unambiguous words, either a race between the senses or facilitation among the senses of an ambiguous word must be assumed (Jastrzembski, 1981; Kellas et al., 1988). By contrast, for models that use a distributed representation scheme, predictions are not as straightforward because there is no single unit that corresponds uniquely to a lexical entry. That is, different units are used to represent the spelling, the pronunciation, and the meaning, and activation over different sets of units is used as an index for different responses (Kawamoto, 1988, 1993; Seidenberg & McClelland, 1989). For example, lexical decision is often indexed by activation of the units representing the spelling, and naming latency is usually indexed by activation of the units representing the pronunciation (Kawamoto, 1993; Seidenberg & McClelland, 1989), but there are studies in which lexical decision is indexed by units representing the pronunciation (Seidenberg & McClelland, 1989) or the meaning (Joordens & Besner, 1994). Because the relative activation over the different units is not the same for every word, the same stimulus (or class of stimuli) can give different predictions depending on the particular response made. This property of distributed representations is, in fact, part of the basis for the different conclusions reached in two recent simulation studies of the ambiguity advantage effect. Kawamoto (1993), who argued that distributed representations can account for the ambiguity advantage effect, used activation over the spelling units as an index of lexical decision times, whereas Joordens and Besner (1994), who argued otherwise, used activation over the meaning units. It should be emphasized that the basis for the disagreement is really over what the appropriate measure of word recognition is when distributed representations are used, not that it takes more time for the meaning units to become fully activated (or stable) for ambiguous words compared with unambiguous words. That is, Kawamoto (1993, Figure 6) also showed that the meaning units of ambiguous words are less activated than unambiguous words. However, there was another important difference between Kawamoto's (1993) simulation and Joordens and Besner's (1994) simulation. Kawamoto used the LMS errorcorrection learning algorithm to modify the connection strengths, whereas Joordens and Besner used the Hebbian learning algorithm. Many differences between these two learning algorithms are already well-known (see McClelland & Rumelhart, 1988). For example, both the Hebbian and the LMS algorithm will produce the output exactly if the inputs are mutually orthogonal, but only the LMS algorithm will do so if the inputs are linearly independent but not mutually orthogonal (as the complete lexical representation of ambiguous words would be). In this article, we consider a qualitative difference in processing that arises after a network is trained using different learning algorithms. In doing so, we argue that the ambiguity advantage effect is seen only when (a) the orthographic units rather than the meaning units are used as the 1 In some experiments, there is sometimes an ambiguity disadvantage across a particular condition. For example, in Rubenstein, Lewis, and Rubenstein (1971), an ambiguity disadvantage was observed for medium- and high-frequency homographs whose meanings were systematically related (e.g., noun and verb sense of plow). However, these represent a fraction of the total number of conditions in which an ambiguity advantage was found. 1235 SPECIAL SECTION: WHEN TWO MEANINGS ARE BETTER THAN ONE Table 1 Results of Previous Studies Manipulating Number of Meanings (Supplement to Table 6 in Gernsbacher, 1984) High-NM words Study Low-NM words Difference NM RT (ms) Errors NM RT (ms) Errors NM RT (ms) Errors MinF' P 564 2.7 Forster & Bednall (1976) 2 543 2.3 1 5.0 1 21 Chumbley & Balota (1984) Experiment 2 — r = -.33 1.43a ns Experiment 3 —• r = -.47 4.36" .05 Gemsbacher (1984) >10 948 19.5 1 951 ns 19.5 3 0.0 <1 Kellas, Ferraro, & Simpson (1988) Experiment 1 >1 725 0.6 1 763 38 0.6 59.28 .05 1.2 — Millis & Button (1989) Experiment 1 4.45 589 3.3 1 676 5.0 3.45 87 1.7 2.75 ns Experiment 2 7.25 609 3.2 3.08 748 6.5 9.7 4.17 139 32.17 .05 Experiment 3 2.36 661 3.8 1.38 732 6.3 0.98 71 2.5 12.91 .05 Note. Adapted from "Resolving 20 Years of Inconsistent Interactions Between Lexical Familiarity and Orthography, Concreteness, and Polysemy" by M. A. Gernsbacher, 1984, Journal of Experimental Psychology: General, 113, p. 273. Copyright 1984 by the American Psychological Association. Adapted by permission of the author. Dashes indicate that the value was not given in original study. Forster and Bednall (1976) meaning metric not given. Chumbley and Balota (1984) used a log transform of the number of dictionary meanings. They used multiple regression analyses, so grouped means were unavailable. Gernsbacher (1984) used the number of dictionary meanings. Kellas et al. (1988) used a 3-point subjective scale: 0 = no meaning, 1 = one meaning, 2 = more than one meaning. Millis and Button (1989) subjective NM metrics: Experiment 1 = first meaning listed; Experiment 2 = total meanings listed; Experiment 3 = average meanings listed. NM = mean number of meanings; RT = mean reaction time. a F values were calculated only by subjects, and they represent the number of meaning's unique contribution to predicting lexical RTs. index of lexical decision times and (b) the LMS algorithm rather than the Hebbian algorithm is used. In addition, we show why the processing difference that we found can ultimately be attributed to the multiple meanings that ambiguous words have compared with the single meaning that unambiguous words have. Fully Recurrent Connectionist Networks In this section, we summarize the assumptions about the representation, learning, and processing in our fully recurrent connectionist network model. Representation input from the environment (indicated by the dashed segments) as well as from every other unit in the network (indicated by the solid segments). Each of the recurrent connections formed between two units in the network has a real-valued strength that is determined by the learning algorithm described in the following section. Learning Algorithm Learning in our model consists of modifying the connection strengths during each learning trial. For each learning trial, the activation of the units in the network is first set to < SPELLING —-> < MEANING > Each word is represented in a distributed fashion by a finite number of binary features (+ or —). Features that are present are positive with a value of +1.0, and features that are absent are negative with a value of —1.0. Some of these features represent the word's spelling, and the remaining features represent the word's meaning. Each of these features plays a role in the representation of every lexical entry (i.e., features are never 0), and there is no single feature that corresponds uniquely to one entry. With such a representation scheme, similarity is measured by the degree of overlap in the value of the features (e.g., the normalized dot product). Network Architecture The architecture of the connectionist network that we use is illustrated in Figure 1. There is a unit in the network for each feature used in the representation. Each unit receives stimulus Figure 1. Architecture of the fully recurrent connectionist network representing the spelling and meaning of a word. Only some of the units and connections are shown. 1236 A. KAWAMOTO, W. FARRAR, AND C. KELLO the values corresponding to the particular values of a lexical entry. The probability of a given entry being selected on a particular learning trial varies according to its relative frequency of occurrence. For the LMS algorithm that we use in our model, the change in the connection strength from a given unity to a unit i, Aw,-,-, is (1) AW,y = where 17 is a scalar learning constant and t, and t^ are the target activation levels of two distinct units i and j. By contrast, for the Hebbian algorithm, the learning algorithm used by Joordens and Besner (1994), the change in the connection strength is Aw,-, = 7?t,t;, i (2) In addition, we assume that the value of each connection strength is initially 0. Processing After training, the performance of the network is examined by presenting just the spelling as the input to the network. The input activates the corresponding orthographic unit in the network to -f 0.25 if the corresponding input feature is positive and —0.25 if it is negative. Because the input consists only of the spelling, there is no activation of the semantic units. The initial activation in each orthographic unit arising from the input then spreads to other orthographic units and to the semantic units. The activation of each unit thus changes as a function of time (wherein time changes in discrete steps). In particular, the activation value of unit i, a,, at time t +1 is a,(t + 1) = LIMIT Sa,-(f) + + (3) where 8 is a decay constant, s,(0 is the influence of the input stimulus on unit;', and LIMIT bounds the activation to the range from — 1.0 to +1.0. The activation value of all the units change in parallel. Each of these updating cycles is one iteration through the network. The activation value of each of the units in the network at a given point in time is the most complete representation of the state of the network at that time. However, for our purposes, we were particularly interested in the number of iterations it took all of the units representing the spelling to saturate (i.e., reach their minimal or maximal activation levels of -1.0 and +1.0, respectively). This measure was used as an index of lexical decision times. A response was defined as an error if the pattern of activation did not correspond with the input or if all of the units did not saturate after a fixed number of iterations through the network. A Critical Comparison To demonstrate the critical processing difference that arises when the LMS and Hebbian algorithms are used, we examined separate lexicons (see Table 2) consisting of either (a) an unambiguous word, (b) an ambiguous word with two distinct unequiprobable meanings, (c) an ambiguous word with two distinct equiprobable meanings, and (d) an ambiguous word with four distinct equiprobable meanings.2 Each word was represented by 12 features: The first 4 were "orthographic" features and the last 8 were "semantic" features. The orthographic representations of the unambiguous word and the ambiguous words were identical ('H 1—'). The semantic representation of the ambiguous words were mutually orthogonal. In addition, the "printed frequency" of the words was identical. Four different, fully recurrent networks comprised of 12 units were each trained on one of the four different lexicons using the LMS algorithm and four other networks were trained using the Hebbian algorithm with the software provided in McClelland and Rumelhart (1988). The learning constant T), the same for both the LMS and the Hebbian algorithms, was equal to 0.02. The only difference in the training was that the entire lexicon was presented twice for a total of 8 trials during the training phase for the Hebbian algorithm, but four times for a total of 16 trials for the LMS algorithm. (The number of learning trials was chosen so that performance was roughly comparable for the two learning algorithms.) Processing assumptions for the networks trained using the two learning algorithms were identical. Each network was initially presented with an input consisting of just the spelling pattern scaled by a factor of 0.25. Activation in the network was computed according to Equation 3, with the decay term 8 equal to 0.2. As seen in Figure 2A, networks trained using the LMS algorithm did simulate the ambiguity advantage: The orthographic units were saturated by Iteration 1 for the network that learned the unambiguous word, but by Iteration 6 for each of the networks that learned an ambiguous word. Moreover, the relative magnitude of the activation values for the iterations preceding saturation was correlated with the degree of ambiguity of the meaning: (a) ambiguous words with four distinct equiprobable meanings, (b) ambiguous words with two distinct equiprobable meanings, (c) ambiguous words with two distinct polarized meanings, and (d) unambiguous words. By contrast, as seen in Figure 2B, for networks trained with the Hebbian algo2 We also examined an ambiguous word with similar equiprobable meanings. However, we did not report the results because the size of the effect, relative to the other ambiguous words that all had meanings that were mutually orthogonal, was dependent on the degree of similarity. In general, the more similar the meaning, the more similar the results were to the unambiguous word. We should also point out that the lexicons were chosen to constitute just a single unambiguous word or all of the senses of a single ambiguous word for clarity. The same points could have been made by considering a single lexicon with unambiguous words and the different types of ambiguous words. 1237 SPECIAL SECTION: WHEN TWO MEANINGS ARE BETTER THAN ONE Table 2 Lexicons Consisting of Either an Unambiguous Word or an Ambiguous Word Used in the Critical Comparison Lexicon FreSense quency Representation Unambiguous -1 + h _!.. -j- 1 _f- 1 |- 2-Polarized 2-Equiprobable 1 2 3 1 + 1 2 2 2 + —+ —+ —+ —+ —+ — 1 2 3 4 1 1 1 1 . _| H 1 1 -j- — 1 -j- I -j- |_-j H F-t 1- + 4-Equiprobable -i + + + 1- - H _!.. ~r r. T^ — -i-T~ 1 H 1 i ~r~r _1_ _j_ — 1 -J- _1 T~ 1 1 1 ~r"T" _L. _p. 1 _j_ —~ — — .. _L _p _1_ T~ 1 T~ rithm, the relative activation of the orthographic units was exactly opposite that found with the LMS algorithm. The difference in activation for the different lexicons trained using the LMS algorithm arises because the different networks have different connection strengths. To illustrate these differences, the connection strengths were partitioned into four different subsets on the basis of their origin and destination: orthographic unit —» orthographic unit, orthographic unit —> semantic unit, semantic unit -* semantic unit, and semantic unit —> orthographic unit. Table 3 shows the mean of the absolute value of the connection strengths for each of these subsets of connections. (The values for the connections from the semantic units to the orthographic units are not shown separately because the values are nearly identical to the reciprocal connections from the orthographic units to the semantic units.) On the basis of these values, it is clear that as ambiguity increases, the mean of the absolute value of the orthographic —» orthographic connections increases and the mean of the absolute value of the orthographic -» semantic connections decreases. In other words, as ambiguity increases, the influence of the orthographic —» orthographic connections increases and the influence of the orthographic <-> semantic connections decreases. This influence of the different connections on activation in the orthographic units can be demonstrated by (a) eliminating all but the orthographic —> orthographic connections and (b) eliminating the orthographic —» orthographic connections. Figure 3 shows the mean of the absolute value of activation in the orthographic units under these two conditions with only the spelling as the input. The differences in connection strengths (and thus the differences in activation of the orthographic units) that arise in training the different words is due to the error-correcting property of the LMS algorithm. When an unambiguous word is being learned, all of the orthographic and semantic features contribute to predicting that word's orthographic features. By contrast, when an ambiguous word was being learned, only the unambiguous features contribute to predicting that word's orthographic features: The ambiguous features do not make a contribution because the changes that are made for one sense of an ambiguous word are offset by the other senses. Consequently, the orthographic features (as well as the unambiguous semantic features) must compensate for the missing contribution of the ambiguous semantic features by making a larger contribution in predicting the activation of the orthographic features. With an error-correction learning algorithm, any manipulation that increases the ambiguity will lead to a larger contribution of the orthographic —> orthographic connections, and, conversely, any manipulation that decreases the ambiguity will lead to a smaller contribution. Thus, increasing the number of distinct meanings from two to four increases the ambiguity (i.e., with four orthogonal meanings, there are three features that are ambiguous, five features that are partially biased, and one feature that is unambiguous), whereas increasing the relative dominance of one meaning (or increasing the similarity of the meanings) decreases the ambiguity. In contrast to the LMS algorithm, the changes made for a connection on a given learning trial with the Hebbian algo- 1.0 J unambiguous Z polarized 2 equlprobable 4 equiprobable 0.2 0.0 0 2 4 6 8 10 Iterations 1.01 | 0.61 "« 0.4" unambiguous 2 polarized 2 equiprobable 4 equiprobable OJ 0.0 0 2 4 6 8 10 Iterations Figure 2. Time course of activation of the orthographic units when just spelling is presented for each of the networks trained using (a) the least mean square learning algorithm and (b) the Hebbian learning algorithm. 1238 A. KAWAMOTO, W. FARRAR, AND C. KELLO Table 3 Mean of the Absolute Value of the Connection Strengths for Different Subsets of Connections (Orthographic —* Orthographic, Orthographic —> Semantic, and Semantic —> Semantic) for Each Network Lexicon O-»0 O-»S .089 .089 .122 .077 .132 .066 .166 .065 O = orthographic; S = semantic. Unambiguous 2-Polarized 2-Equiprobable 4-Equiprobable Note. S -^S .089 .085 .075 .079 rithm are independent of previous changes made to that connection as well as previous changes made to all other connections. That is, unlike the LMS algorithm in which the network's current performance (i.e., the magnitude of the error) influences the magnitude of the change in the connection strength, the changes made using the Hebbian algorithm are always the same for a particular stimulus. Thus, the Hebbian learning algorithm does not compensate for the fact that the ambiguous semantic features are not contributing to the prediction of the orthographic features. In particular, the connection strengths from one orthographic unit to another orthographic unit in the network trained on the unambiguous word are identical to the corresponding connection strengths in the networks trained on one of the ambiguous words. Despite the large differences in the final connection strengths obtained using the Hebbian and the error-correction learning algorithms, the connection strengths are actually highly similar early in the training. In fact, after the first learning trial, the connection strengths are identical using the LMS and Hebbian learning algorithms, assuming the same lexical entry is learned and that the weights are initially zero. Thus, for the error-correction algorithm, there is an ambiguity disadvantage early in training that becomes an ambiguity advantage later in training (see Table 4). Finally, as already noted, one of the key assumptions was that activation in the orthographic units is used as an index of lexical decision times. Indeed, even with the LMS algorithm, the unambiguous word would have been recognized before the ambiguous words if saturation of the semantic units had been used as the index of recognition. As Figure 4 shows, the relative magnitude of the activation values for the last 8 units (the semantic units) was exactly opposite that found in Figure 2A. number of neighbors those words have increases. To demonstrate the ambiguity advantage effect with a real lexicon, the data from one experiment will be simulated. Our next simulation examined a particular set of data reported by Millis and Button (1989, Experiment 3). We chose this specific experiment because the target words used in the experiment were reported and because these results have addressed previous criticisms. In their experiment, Millis and Button found that ambiguous words with few meanings (an average of 1.38 meanings) had a mean lexical decision time of 732 ms and that ambiguous words with many meanings (an average of 2.36 meanings) had a mean lexical decision time of 661 ms. For purposes of comparison, we also included unambiguous words matched in printed frequency because many experiments examining the effect of ambiguity on word recognition have specifically compared unambiguous words with ambiguous words, rather than two classes of ambiguous words. Unfortunately, 1.0 C ° I 0.6 tj 0.4 •^ unambiguous 2 polarized 0.2 2 equlprobable 4 equlprobable 0.0 0 2 4 6 10 Iterations i.o- 5 «« 0.61 •J5 « 0.40^' Simulation of Data A real lexicon is much more complicated than the examples just considered. A real lexicon contains many words, and most words are similar in spelling to other words. In fact, ignoring the part of the spelling that is dissimilar, a shorter "ambiguous" word is generated (e.g., the unambiguous words BERM and BERG yield the ambiguous string BER .). Thus, even unambiguous words could manifest an "ambiguity advantage," and this possibility increases as the 0.0 4 6 10 Iterations Figure 3. Time course of activation of the orthographic units for each of the networks trained using the LMS algorithm when (a) only the orthographic —» orthographic connections are used and (b) all other connections except for the orthographic —> orthographic connections are used. 1239 SPECIAL SECTION: WHEN TWO MEANINGS ARE BETTER THAN ONE to our knowledge, there is no single experiment that specifically compared unambiguous words with ambiguous words with few and many meanings and therefore we could not explicitly compare all three classes of words. unwnbiguoui 2 polarized .r I Zeqidprobablc 4equIprobBbte 0<6 Lexicon The lexicon consisted of a total of 108 four-letter words. Of these, 24 were ambiguous words drawn from the test words used in Millis and Button (1989, Experiment 3). In our implementation, the ambiguous words that had few meanings were assigned two meanings and the ambiguous words that had many meanings were assigned four meanings. For simplicity, we assumed that all meanings of an ambiguous word were of the same or nearly the same frequency (e.g., if a word with four meanings had a printed frequency of 14, two meanings would have a frequency of 4 and two meanings would have a frequency of 3) because Millis and Button did not report relative dominance of the meanings. In addition, 12 unambiguous words whose Francis and Kucera (1982) frequency were approximately equal to the ambiguous items were included for comparison. The ambiguous and unambiguous items are presented in the Appendix. The lexicon also contained an additional 72 unambiguous filler words. They were randomly chosen from among the four-letter words in the Francis and Kucera (1982) corpus that had a frequency greater than 1 and were not proper names. Words with frequencies greater than 500 were treated as if they had a frequency of 500. Twenty different sets of 72 filler items were created for each of the 20 replications of the simulation (described more fully shortly) to reduce the possibility that an effect was due to a specific set of filler items. Each lexical entry was composed of a spelling (represented by 56 orthographic features) and a meaning (represented by 70 semantic features). The orthographic representation was identical to that used by McClelland and Rumelhart (1981): Each of the four letters was represented by 14 features. The semantic representation of each word meaning was generated by randomly permuting the order of 35 positive and 35 negative features. The semantic representation for each ambiguous and unambiguous word was different for each network. Table 4 Number of Iterations to Reach Saturation Following Different Amounts of Training Using the ErrorCorrection Algorithm No. epochs Unambiguous 04- 0.2' 4 6 8 10 Iterations Figure 4. Time course of activation of the semantic units for each of the networks trained using the least mean square algorithm. Networks Twenty fully recurrent networks with identical architectures were used to simulate 20 subjects. Each network consisted of 126 units, with each unit representing 1 of the 126 orthographic or semantic features of a lexical entry. Training The training regimen was identical to that described earlier. Initially, the strengths of all connections between units were set to zero. The connection strengths between each pair of units were then modified after each learning trial according to Equation la with the learning constant 17 set to 0.00005.3 The number of times each word was presented to the network during one training epoch was equal to that word's Francis and Kucera (1982) frequency. Performance To assess the performance of the network, the spelling of each ambiguous and unambiguous test word, scaled by 0.25, was presented to the network. The decay constant S was set to 0.95. For each test word, the network continued processing until all of the orthographic units were saturated or until the network had reached 200 iterations. When the network did not fully activate the correct orthographic units in a given trial, that trial was considered an error and removed from the latency analysis. If the network took more than 20 iterations to obtain the appropriate output in any given trial, that trial's iteration time was truncated to 20. 2-Equiprobable 1 2 8 10 3 7 7 4 7 6 Note. Dashes indicate that neither network reached saturation after just one epoch of training. 3 The learning constant r\ must be less than l/A.max, where Amax is the maximal eigenvalue of the sample covariance matrix, in order for the algorithm to converge (Widrow & Stearns, 1985). Because some items in the lexicon have a much higher relative frequency than other items in the lexicon, Amax is large. The large Amax in turn dictates a small TJ. 1240 A. KAWAMOTO, W. FARRAR, AND C. KELLO The performance of the network after 5, 10, and 15 epochs of training is shown in Figure 5. The simulation results of the response times shown in Figure 5A are consistent with the empirical data. On average, ambiguous words took less time to saturate the orthographic units compared with unambiguous words, and ambiguous words with many meanings took less time to saturate than ambiguous words with few meanings. The average number of iterations it took the ambiguous words to saturate after 15 epochs of training can be multiplied by 546 and then subtracted by 2,053 to yield the lexical decision times (in milliseconds) reported by Millis and Button (1989, Experiment 3). (Note that a different learning constant would lead to the same pattern of results, although the particular values attained after a given number of training epochs, and thus the constants for the transformation, would be different.) However, the simulation results of the error performance shown in Figure 5B are only partially consistent with the empirical data. That is, as in the data from Millis and Button a unambiguous 2 tqulprobablt 4 cqulprobsble T 1 o 1 6 5- 5 10 15 Training Epochs 50 40<•"»! \^s 30" I £ 20 a 101 0 0 5 10 15 Training Epochs Figure 5. Performance at different points during the training of a network that learns a large lexicon: (a) number of iterations for the orthographic units to saturate and (b) number of errors. (1989, Experiment 3) and other studies (see our Table 1 and Gernsbacher, 1984), ambiguous words with many meanings had fewer errors than ambiguous words with few meanings, but in contrast to published results, unambiguous words had fewer errors than ambiguous words. We cannot explain this particular simulation result at this point, although the small neighborhoods of some of the ambiguous stimuli (e.g., huge and pond) may have been a contributing factor. General Discussion Many models that begin with much different assumptions about what information is used and how that information is used in processing can end up making highly similar qualitative predictions (see also Van Orden and Goldinger, 1994, for a related discussion). For example, for naming latencies, the main effects of frequency and regularity as well as the frequency by regularity interaction can be accounted for by dual-route models in which both lexical and sublexical information are represented, by analogy models in which only lexical information is represented, and by distributed models in which only sublexical information is represented (see the discussions in Kawamoto, 1994; Kawamoto & Zemblidge, 1992). Moreover, distinct versions of each of these three classes of models can account for these effects. In the dual-route model, for example, the lexical and sublexical rule routes can either be independent with the pronunciation and latency determined by a race between the routes (Coltheart, 1978), or the two routes can interact with the pronunciation and latency determined by integrating information from the two routes (Monsell, Patterson, Graham, Hughes, & Milroy, 1992). When different models all make the same qualitative predictions, there are at least two alternatives in deciding among the competing models. One alternative is to make the decision on a quantitative basis by determining which model provides the best fit using a measure such as rootmean-square deviation. As seen in this special issue, the explicit formulation of models as computer programs allows such comparisons to be made. For example, one can compare the advantage of using number of types with the number of tokens in formulating rules of pronunciation (e.g., Norris, 1994). Another alternative is to determine how well the various models account for other related phenomena. For example, recent computational models of naming have begun to focus on results such as generalization capacity (e.g., Coltheart, Curtis, Atkins, & Haller, 1993). However, there are also many models with highly similar assumptions that make qualitatively different predictions. For example, different assumptions about the variance of finishing times of the two routes in the race version of the dual-route model lead to opposite predictions for the latencies of correct naming responses of low-frequency regular and irregular words (Kawamoto & Zemblidge, 1992). In particular, if the variance of the slower rule route is less than or equal to the variance of the faster lexical route, irregular words are incorrectly predicted to have shorter latencies than regular words. However, if the variance of the rule SPECIAL SECTION: WHEN TWO MEANINGS ARE BETTER THAN ONE route is sufficiently greater than the variance of the lexical route, irregular words are correctly predicted to have longer latencies than regular words. The comparison between models that we have focused on in this study is an example in which a single assumption, the particular learning algorithm used in modifying the connection strengths, led to opposite predictions. (Note that other assumptions are needed to account for the ambiguity advantage effect.) This single difference has important consequences because the Hebbian algorithm leads to a linear (i.e., additive) effect of learning, whereas the LMS algorithm leads to a nonlinear (i.e., nonadditive) effect of learning. As in other assumptions in which one alternative is linear and the other is nonlinear, the difference in the effect can be dramatic. In the following discussion, we begin by comparing our approach with other recurrent network models. We consider which of the many different assumptions are critical in distinguishing the models with respect to modeling the ambiguity advantage effect. We then turn to a version of our model that incorporates pronunciation and discuss how the effect of semantics, the basis for our account of the ambiguity effect in the lexical decision task, also leads to the prediction of an ambiguity effect in the naming task. We also discuss how semantics plays a role in other effects of the time course of naming. Finally, we consider encoding variability, a widely discussed memory result that is analogous to the ambiguity advantage effect. Recurrent Networks Recurrent network models, like all other models, can make qualitatively different predictions if different assumptions are made. These assumptions include choices about the specific representation of a word, the architecture of the network, the updating of a unit's activation, the learning algorithm, and the appropriate index of a behavioral response. In the first set of simulations presented in this study, for example, the only difference between the two models was the learning algorithm. Thus, the faster saturation of the spelling units manifested for ambiguous words compared with unambiguous words could be attributed directly to the difference in the connection strengths that arose from the different learning algorithms. The basis for the processing advantage can be understood by a closer examination of the two learning algorithms. With the Hebbian learning algorithm, the magnitude of the change in the connection strength is proportional to the product of the activation of the "preconnection" unit and the activation of the "postconnection" unit (Equation 2). Thus, the change made to a connection from the preconnection unit to the postconnection unit on a given learning trial is independent of previous changes made to that connection as well as previous changes made to all other connections to the postconnection unit. This independence is manifested in two ways. First, the changes to the connection strengths made in response to a particular stimulus are identical for each presentation of that stimulus. Second, and more important, the Hebbian learning algorithm does not compen- 1241 sate for the fact that the ambiguous semantic features are not contributing to the prediction of the orthographic features. Thus, in the critical comparison, the connection strengths between two orthographic units in the network trained on an ambiguous word were identical to the corresponding connection strengths in the networks trained on the unambiguous word because we assumed identical spellings for all the words. By contrast, with the LMS algorithm, the magnitude of the change hi the connection strength is proportional to the product of the activation of the preconnection unit and the error, the difference between the activation of the postconnection unit and the net input to that unit (Equation 1). Thus, unlike the Hebbian algorithm, the change made to a connection on a given learning trial is not independent of previous changes made to the connections to the postconnection unit. This dependence is manifested in two ways. First, the changes made to the connection strengths for a particular stimulus are not identical for each presentation of that stimulus. Second, and more important, the LMS learning algorithm does compensate for the fact that the ambiguous semantic features are not contributing to the prediction of the orthographic features. Thus, in the critical comparison, the connections between two orthographic units in the network trained on each of the ambiguous words make a larger contribution to predicting the activation of the postconnection unit than corresponding connections in the networks trained on the unambiguous word. These differences are clearly manifested by comparing the processing with only the orthographic —> orthographic connections and with all but the orthographic —> orthographic connections (see Figure 3). However, the learning assumption is only one of a number of key assumptions required to account for the ambiguity advantage. Another key assumption is that semantics must be represented in some form. Clearly, if a particular model simply does not implement semantics (e.g., Golden, 1986; Seidenberg & McClelland, 1989), then the model will certainly not be able to account for any effect that is based on semantics. In particular, the LMS algorithm would not differentially affect the orthographic —>• orthographic connections of ambiguous words compared with unambiguous words if the effect of semantic —» orthographic connections were not taken into account during learning. Note that in our account, the ambiguity advantage that is manifested during processing can ultimately be attributed to the effect (actually, the lack of a contribution) of semantics in predicting the features of orthographic units during learning.4 One 4 In a previous simulation using this model, one of us (Kawamoto, 1993) actually argued that the ambiguity advantage was due to the cooperation between the two senses of an ambiguous word that arose during learning and processing. That is, the changes to the orthographic —» orthographic connections and the semantic —* orthographic (for unambiguous semantic features) would be similar for the different senses of an ambiguous word. Thus, the connection strength changes for an ambiguous word would be roughly comparable to the changes for an unambiguous word. Then, when the spelling is presented during a test trial, 1242 A. KAWAMOTO, W. FARRAR, AND C. KELLO final critical assumption is that the orthographic units and not the semantic units are the index of lexical decision times. As seen earlier in this study and elsewhere (Kawamoto, 1993), there is an ambiguity disadvantage if the semantic units had been used as the index of lexical decision time. In the following discussion, we show why these assumptions are important in accounting for the ambiguity advantage effect by examining other recurrent models. The first alternative recurrent model that we consider in detail is the Hopfield (1982) network. This model is particularly important because Joordens and Besner (1994) have specifically examined the ambiguity advantage effect using a Hopfield network and reached the opposite conclusion from us. We begin by noting the differences between the two models (with our assumptions presented first): (a) connections between orthographic units—present versus absent; (b) learning algorithm—LMS versus Hebbian; (c) index of lexical decision times—orthographic versus semantic units; (d) activation of orthographic units— undamped versus clamped; (e) activation value—real versus binary valued; and (f) schedule of updating activationparallel versus asynchronous. Of these differences, only the first three are critical in accounting for the ambiguity advantage effect, as demonstrated earlier. The remaining differences do not affect the outcome. In particular, if Joordens and Besner had assumed full connectivity between units and had not clamped the orthographic units, an input spelling that had been corrupted by noise would be corrected more quickly for unambiguous words compared with ambiguous words. In addition, the different processing assumptions reflected in the last two assumptions lead to the same results because the processing in both cases is governed by the same Lyaponuv function (Golden, 1986; Hopfield, 1984). The next models that we consider are resonance models (Grossberg & Stone, 1986; Kosko, 1988; Murre, Phaf, & Wolters, 1992; Van Orden & Goldinger, 1994). The major difference between resonance models and the fully recurrent models considered in this study is the topology of the network (i.e., the pattern of connectivity between units). In resonance models, there are separate groups of units (e.g., orthographic and semantic units) with modifiable connections only between units in different groups: There are no modifiable connections between orthographic units, the very connections that we argue to be the basis for the ambiguity advantage in our model (see Figure 3). The resonance model that we consider first, Kosko's activation in the orthographic units leads to activation of both meanings in the semantic units. These different meanings in turn activate the same spelling pattern in the orthographic units. (A similar proposal was discussed by Rueckl and Olds, 1993.) Although this semantically mediated pathway does influence the activity in the orthographic units, in our model the ambiguity advantage is due to the differential orthographic —» orthographic connection strengths that arise when the error-correction algorithm is used. This point was demonstrated earlier with a network that lacked orthographic —> orthographic connections. (1988) bidirectional associative memory (BAM) model, is closest in spirit to our fully recurrent model because the BAM also uses a distributed representation comprised of +1 and — 1 binary values. To make the discussion explicit, we compare the unambiguous word with the ambiguous word with two equiprobable meanings as we did earlier. We first consider the effects of learning the mapping from spelling to meaning and then consider the mapping from meaning to spelling. (This separate consideration of the network as two feedforward networks is valid for a linear activation function because there are no connections between units within a group.) For the mapping from spelling to meaning, the spelling of the unambiguous word is mapped to one meaning twice as often as the spelling of the ambiguous word is mapped to each of its two meanings. For the Hebbian algorithm, learning has the following consequences: 2 T 7j M (4a) Mam2), (4b) where T is the number of learning epochs and -rj is the learning constant, Sun and Sam are the patterns corresponding to the spelling of the unambiguous and ambiguous words, respectively, and Mu№ Maml, and Mam2 are the patterns corresponding to the meaning of the unambiguous word, the meaning of the first sense of the ambiguous word, and the meaning of the second sense of the ambiguous word, respectively. Note that the magnitude of the meaning vector of the unambiguous word is actually V2 times larger than the meaning vector of the ambiguous word because the two meanings of the ambiguous word are orthogonal. (Another way to see this is to observe that half of the meaning units of the ambiguous word have an activation equal to 0.) For the LMS algorithm, it is possible to write a simple closed form expression for the activation of the meaning only for the unambiguous word: Sun -* [1 - (1 - Tf)2T]Mun. (5) However, the asymptotic effect of the LMS algorithm can be expressed simply as Sm-*Mm (6a) 1 Sam ~* 2 (M<™1 + Mam2)- (6b) Thus, for the mapping from spelling to meaning, the asymptotic effect of the LMS algorithm is identical to the effect of the Hebbian algorithm. By contrast, the mapping from meaning to spelling is different for the Hebbian and LMS algorithms because there are two orthogonal meanings for an ambiguous word. Thus, there are two different inputs for ambiguous words compared with the single repeated input for unambiguous words. For the Hebbian algorithm, Mm -» 2 T 17 Sm, (7a) Maml -* T T) Sam, (7b) Mam2~> (7c) TT]Sam. SPECIAL SECTION: WHEN TWO MEANINGS ARE BETTER THAN ONE For the LMS algorithm, (8a) (8b) (8c) Asymptotically, the LMS algorithm leads to M,,, (9a) M.ami (9b) M,ami ' (9c) Consequently, for the Hebbian algorithm, the activation fed back to the spelling units from the meaning units after instantiation of the spelling (scaled by a constant) is twice as large for unambiguous words compared with ambiguous words. By contrast, for the LMS algorithm, the activation fed back to the spelling units would be the same for unambiguous and ambiguous words. Thus, for the BAM, neither the Hebbian nor the LMS algorithm can account for the ambiguity advantage effect. We now turn to another class of resonance models in which the connections between orthographic and semantic units are all excitatory and the connections between units within a group are all inhibitory (e.g., Van Orden & Goldinger, 1994). Note that with the assumption of mutually inhibitory connections between units, activation is positive and the representation is in some sense "local." That is, within a group of units that have mutually inhibitory connections, only one unit, the "winner," remains active; the activation of the remaining units decreases to 0. We thus refer to this class of resonance models as "winner-take-all" models. The first winner-take-all model that we consider is the meta-model proposed by Van Orden and his colleagues (Van Orden & Goldinger, 1994; Van Orden, Pennington, & Stone, 1990). In their model, there are three groups of units representing the spelling, the pronunciation, and the meaning of a word. The connections between units in each pair of groups are modified according to a covariational learning principle. Although it is not possible to make specific predictions because details of the activation function and learning function remain open, we point out that the same outcome noted earlier for the BAM for both the Hebbian and LMS algorithms would be expected under some conditions even without the effect of the mutually inhibitory connections. In fact, the unambiguous words would be expected to gain an advantage when the effect of the mutually inhibitory connections are added because these connections would tend to suppress the simultaneous activations of the two semantic representations activated for ambiguous words. The second winner-take-all model that we consider is the categorization and learning module (CALM) model proposed by Murre et al. (1992). This model is actually much more complex than the preceding winner-take-all model because there are additional units that coordinate control in a module. In particular, there are three classes of units in a 1243 module: Representation units that correspond to the units in the preceding model, veto units that are paired with a single representation unit, and an arousal unit. (It should be noted that because each representation unit forms excitatory connections to its complementary veto unit, more than one unit in a module remains active, but, as before, only one representation unit is the winner.) Each class of units forms only fixed excitatory or inhibitory connections to other units within the module and to an "external" node outside the module. In addition, there are modifiable excitatory connections between representation units in different modules. Given the complexity of the architecture and the nonlinearities of the activation function and learning rule, we do not hazard making predictions without an implementation of the model. However, we do point out the possibility that qualitatively different predictions could arise depending on the particular values of the fixed parameters chosen. The final set of recurrent models that we consider is based on the interactive-activation (IA) model (McClelland & Rumelhart, 1981). The IA model shares a number of characteristics with the winner-take-all models just described, such as mutually inhibitory connections between units within a group; unlike the other resonance models discussed, the connections between groups of units are set as a parameter rather than learned. The IA model has been the basis for a previous account of the ambiguity advantage effect (Kellas et al., 1988). In their account, Kellas et al. proposed that ambiguous words are represented by multiple entries at the word level. However, unlike other word units, the senses of an ambiguous word do not compete with each other and thus end up cooperating with each other. This cooperation arises because the neighbors of an ambiguous word are inhibited by two word units and because the component letters receive positive feedback from two word units. This cooperation would clearly lead to one of the units corresponding to an ambiguous word reaching threshold more quickly than the single unit of an unambiguous word if the thresholds were based on the printed frequency. Furthermore, if the cooperation were sufficiently great, the advantage provided by the multiple entries of an ambiguous word could also overcome the disadvantage that arises even when the threshold is based on frequency of occurrence of a particular sense. However, it should be pointed out that an additional mechanism would still be required to resolve the ambiguity. Local Versus Distributed Representations One key difference between different word recognition models is whether a lexical entry is represented in a distributed fashion or in a localist fashion. With a distributed representation, a lexical entry is represented as values over a finite number of units, with no single unit corresponding uniquely to an unambiguous word or a sense of an ambiguous word. By contrast, with a localist representation, a single unit does correspond uniquely to an unambiguous word or a sense of an ambiguous word. The distinction between local and distributed representa- 1244 A. KAWAMOTO, W. FARRAR, AND C. KELLO tions has important implications in determining the relationship between ambiguous senses and becomes particularly salient in distinguishing homonymous senses from polysemous senses. Consider, for example, the problem of deciding to add an entry to the lexicon. That is, when is a particular usage of an ambiguous word sufficiently distinct to be considered a new sense, and how can the similarity between senses be represented? Actually, these problems must also be confronted by the lexicographer. Consider, for example, the different senses of bass listed in MerriamWebster's Tenth Collegiate Dictionary (1993): 1. bass 'has n or bass or bass.es [MB base, alter, of OE ba-ers; akin to OE byrst bristle] pi: any of numerous edible spiny-finned fishes (esp. families Centrarchidae and Serranidae) 2. bass 'ba-s aj [ME bas base] 1: deep or grave in tone 2: of low pitch 3. bass 'ba-s n 1: a deep or grave tone: low-pitched sound 2al: the lowest part in polyphonic or harmonic music 2a2: the lower half of the whole vocal or instrumental tonal range 2bl: the lowest male singing voice 2b2: a person having such a voice 2c: the lowest member in range of a family of instruments 4. bass 'bas n [alter, of bast] 1: a coarse tough fiber from palms 2: BASSWOOD. (p. 96) In the entries for bass, the definitions for the second entry are listed as 1 and 2, whereas the listing for the third entry is listed as 1, 2al, 2a2, 2bl, 2b2, or 2c. What determines whether a use warrants a separate subentry or sub-subentry? Moreover, given that the adjectival entry is listed separately, how can its unsystematic relationship with the first noun entry and its systematic relationship with the second noun entry be represented? With a distributed representation, no explicit decision about the distinctiveness of a new sense has to be made. Each new sense or word simply corresponds to a new pattern of activation over the set of finite features. As the difference between two different usages increases or decreases, the difference in the representation of the senses increases or decreases, respectively. For senses that are highly similar, as in two of the meanings listed in the third entry of bass (2al: the lowest part in polyphonic or harmonic music, and 2a2: the lower half of the whole vocal or instrumental tonal range), the semantic representation would be nearly identical. The representation for these meanings would also be similar to the other senses listed under that entry. For senses that are derivationally related, as is the case for the second and third entries of bass, the semantic representations would be similar, although the representation for the part of speech would be different. By contrast, the semantic representation of the second and third entries would be different from that of the first entry. In other words, each sense differs quantitatively from other senses of the ambiguous word (as well as all other lexical entries). In terms of the energy landscape metaphor discussed elsewhere (Kawamoto, 1993), the distance between the energy minima corresponding to these senses decreases as similarity between the senses increases. If there are many related senses, there will be a broad basin of attraction carved out by these senses during learning. On the other hand, if the sense are unrelated, there will be widely dispersed basins of attractions. By contrast, with a local representation such as that used in the IA model, a decision must be made to determine whether a novel use of an ambiguous word is sufficiently distinct to warrant its own unit at the word level. Thus, each new sense differs qualitatively from other existing senses. Moreover, even in models that attempt to represent similarity between senses, there is still a qualitative distinction between senses that are considered to be similar (i.e., derivationally related) and those that are not (Jastrzembski, 1981). The freedom allowed by these choices is clearly illustrated by the decision to represent derivationally related words either as one entry (e.g., Rubenstein et al., 1971) or as separate entries (e.g., Cottrell & Small, 1983; Jastrzembski, 1981). The number of senses that an ambiguous word has is important in local representation schemes because processing depends critically on the number of senses. For example, in the IA model, adding another sense has two opposite effects. On the one hand, the threshold of the remaining units representing an ambiguous word decreases in their baseline activation and thus would become activated more slowly. This effect, however, would not be expected to be that large because the resultant change would not make a large impact on the time course of activation. On the other hand, by adding another unit to the cohort of cooperating units, neighbors would be inhibited more severely and positive feedback would also increase significantly. To take an extreme position, assume that a word unit is generated for each use of an unambiguous as well as an ambiguous word. In this case, no ambiguity advantage would be predicted. This example is admittedly unrealistic, but it does illustrate the importance of the number of word units. However, more realistic examples arise in which the distinction between polysemy and homonymy is critical. Consider a situation in which there are two distinct polarized senses of an ambiguous word. The dominant sense has a large number of polysemous senses and the subordinate sense has only a single sense. If the distinct senses are the entries in the race, then the dominant sense will usually be the winner. On the other hand, if the polysemous senses are the entries, then the single sense corresponding to the subordinate sense could be the winner most of the time. It should be noted that for either choice (frequency on the basis of distinct senses or polysemous senses), decisions will also have to be made about the distinctiveness of a sense: Are two senses similar enough to be considered the same sense and are the different uses of a single sense distinct enough to be considered different senses? These decisions will have an impact on the winner because they determine the frequency of a sense and the number of senses. Finally, frequency of a sense also depends on whether derivations with the same form (e.g., the noun and verb forms of plow) are stored separately (e.g., as different logogens or different locations in a list) or together (e.g., as the same logogen or at that same location in a list). SPECIAL SECTION: WHEN TWO MEANINGS ARE BETTER THAN ONE Naming In this section, we consider another task that figures prominently in experiments of lexical memory: speeded naming. In distributed models, the dependent variables considered in speeded naming—naming latency and errors— have been indexed by units representing a word's pronunciation (Kawamoto, 1993; Seidenberg & McClelland, 1989; Van Orden et al., 1990). Because our model does not represent pronunciation, we base our discussion on an identical model that does represent pronunciation (Kawamoto, 1993). One naming study that is relevant to discussions of the ambiguity advantage effect has recently been reported (Fera, Joordens, Balota, & Ferraro, 1992). Consistent with results using the lexical decision task, they found an advantage of ambiguous words over unambiguous words. This result can be accounted for by our distributed model assuming that the LMS algorithm is used. In this case, the connections from orthographic units to pronunciation units as well as connections from pronunciation units to other pronunciation units compensate for the smaller contribution of the semantic units for ambiguous words. Thus, when the input is just the spelling, activation in the pronunciation units would increase in absolute value faster for ambiguous words compared with unambiguous words, just as they would in the spelling units. Of course, if the Hebbian algorithm had been used, we would expect the opposite effect. Indeed, an ambiguity disadvantage is seen when the Hebbian algorithm is used (M. E. J. Masson, personal communication, November 6, 1993). In the results just considered, the effect of semantics on naming arose as a consequence of learning. In other situations, we have observed an effect of semantics on naming that arises during the processing (Kawamoto & Zemblidge, 1992). For example, in speeded naming of monosyllabic homographs (e.g., wound), naming latencies of the more frequent irregular pronunciation are longer than that of the less frequent regular pronunciation. Moreover, the same pattern arises for the irregular word pint, although in this case the regular pronunciation is incorrect. To account for this pattern of results, there must be at least two constraints on pronunciation, each with a different time course. In particular, the regularity constraint must arise before the word-specific constraint (which ultimately determines the correct pronunciation of irregular words). In our model, the regularity constraint corresponds to the mapping from spelling to pronunciation and the word-specific constraint corresponds to the semantically mediated mapping. Thus, our model, like the model proposed by Van Orden and his colleagues (Van Orden, 1987; Van Orden et al., 1990), argues that the regularity constraint is faster than the wordspecific constraint, not slower as proponents of the dualroute model have proposed. Encoding Variability The model of lexical memory used in these simulations is a member of a family of content-addressable memories 1245 (J. A. Anderson, Silverstein, Ritz, & Jones, 1977; Kohonen, 1977; McClelland & Rumelhart, 1985). As such, this lexical memory model should be able to account for more general memory phenomena. Conversely, one would expect that general memory models could be used to account for more specific lexical memory phenomena. One set of phenomena that content-addressable memories should be particularly well suited for is encoding effects, the effect of context on memory. Encoding effects have been manifested in two ways: as encoding specificity and as encoding variability (cf. J. R. Anderson, 1990). With encoding specificity, performance is improved when the context in which something was learned is reinstantiated (Martin, 1968). For example, Thomson (1972) showed that when subjects must remember only the second word in a word pair (e.g., sky blue), their subsequent recognition of the target word is much better when presented with the original word pair than just the target alone. With encoding variability, an item presented for study in different contexts is better remembered than an item presented in just a single context (Madigan, 1969; Melton, 1967, 1970; but also see Rueckl & Olds, 1993). For example, Madigan showed that when a target item is repeated in the memory phase, subsequent recall of the target word without any context cues is much better when the target word had been presented with different context words rather than the same context word during the memory phase. One standard account of encoding variability is based on encoding specificity: The random context provided during the test phase is more likely to resemble one of two different contexts compared with a single context provided during the study phase (J. R. Anderson, 1990). This standard account corresponds closely to Bower's (1972) formal model of encoding variability. According to Bower, a stimulus activates a fixed number of stimulus elements or components (i.e., "features" in our terminology). The particular elements that are activated depend on the context. An activated element is associated in an all-or-none fashion to a given response. Under these assumptions, presentation of the stimulus with an identical context on a second learning trial will activate similar stimulus elements, and thus only a few additional features will subsequently be associated with the response. By contrast, presentation of the stimulus with a different context on a second learning trial will activate different stimulus elements, and thus many additional features will subsequently be associated with the response. Thus, when a random context is present, the features that are activated (which are fixed in number) are more likely to overlap with the features activated under two different contexts rather than the same context during learning. As in our account based on the LMS algorithm, it is a nonlinearity in the learning algorithm that gives rise to a processing advantage when learning occurs under two different contexts. The simulations presented in this study suggest an alternative account of encoding variability. In particular, if we assume that a context corresponds to the meaning of a word in our simulations, the encoding variability effect is analo- 1246 A. KAWAMOTO, W. FARRAR, AND C. KELLO gous to the ambiguity advantage effect. Thus, encoding variability could arise directly out of learning as seen in the simulations presented in this study without depending on a stochastic component during processing as has previously been proposed. More interesting, perhaps, is the fact that the two approaches make different predictions about the effect over the course of learning. In particular, our model predicts a disadvantage early in the course of learning that subsequently becomes an advantage late in learning, whereas the stimulus sampling approach predicts an advantage throughout the course of learning. Future Directions The majority of empirical work on word recognition has been based on monosyllabic monomorphemic English words. Not surprisingly, the focus of the majority of computer simulation models has focused on this class of words. Clearly, however, these words represent just a fraction of the words in English. Following current empirical trends in which there is greater interest in multisyllabic, multimorphemic words, there needs to be greater consideration of these words in computational models of word recognition. Such a decision would clearly require more hierarchical representations of pronunciations including syllables and possibly onsets and rimes. In addition, stress assignment as well as other grammatical category conditioned regularities in pronunciation would have to be incorporated. We hope that formulation of models as computer programs will allow detailed comparison of different models by providing explicit predictions. References Anderson, J. A., Silverstein, J. W., Ritz, S. A., & Jones, R. S. (1977). Distinctive features, categorical perception, and probability learning: Some applications of a neural model. Psychological Review, 84, 413-451. Anderson, J. R. (1990). Cognitive psychology and its implications (3rd ed.). New York: Freeman. Bower, G. H. (1972). Stimulus-sampling theory of encoding variability. In A. W. Melton & E. Martin (Eds.), Coding processes in human memory (pp. 85-123). New York: Holt, Rinehart & Winston. Carpenter, P. A., & Daneman, M. (1981). Lexical retrieval and error recovery in reading: A model based on eye fixations. Journal of Verbal Learning and Verbal Behavior, 20, 137-160. Chumbley, J. I., & Balota, D. A. (1984). A word's meaning affects the decision in lexical decision. Memory & Cognition, 12, 590606. Clark, H. H. (1973). The language-as-a-fixed-effect fallacy: A critique of language statistics in psychological research. Journal of Verbal Learning and Verbal Behavior, 12, 335—359. Coltheart, M. (1978). Lexical access in simple reading tasks. In G. Underwood (Ed.), Strategies in information processing (pp. 151-216). San Diego, CA: Academic Press. Coltheart, M., Curtis, B., Atkins, P., & Haller, M. (1993). Models of reading aloud: Dual-route and parallel-distributed-processing approaches. Psychological Review, 100, 589-608. Cottrell, G. W., & Small, S. L. (1983). A connectionist scheme for modeling word sense disambiguation. Cognition and Brain Theory, 6, 89-120. Duffy, S. A., Morris, R. K., & Rayner, K. (1988). Lexical ambiguity and fixation times in reading. Journal of Memory and Language, 27, 429-446. Fera, P., Joordens, S., Balota, D. A., & Ferraro, F. R. (1992, November). Ambiguity in meaning and phonology: Effects on naming. Poster presented at the annual meeting of the Psychonomic Society, St. Louis, MO. Forster, K. I., & Bednall, E. S. (1976). Terminating and exhaustive search in lexical access. Memory & Cognition, 4, 53-61. Francis, W. N., & Kucera, H. (1982). Frequency analysis of English usage: Lexicon and grammar. Boston: Houghton Mifflin. Gernsbacher, M. A. (1984). Resolving 20 years of inconsistent interactions between lexical familiarity and orthography, concreteness, and polysemy. Journal of Experimental Psychology: General, 113, 256-281. Golden, R. M. (1986). The "brain-state-in-a-box" neural model is a gradient descent algorithm. Journal of Mathematical Psychology, 30, 73-80. Grossberg, S., & Stone, G. (1986). Neural dynamics of word recognition and recall: Attentional priming, learning, and resonance. Psychological Review, 93, 46-74. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences USA, 79, 2554-2558. Hopfield, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences USA, 81, 3088-3092. Jastrzembski, J. E. (1981). Multiple meanings, number of related meanings, frequency of occurrence, and the lexicon. Cognitive Psychology, 13, 278-305. Jastrzembski, J. E., & Stanners, R. (1975). Multiple word meanings and lexical search speed. Journal of Verbal Learning and Verbal Behavior, 14, 534-537. Joordens, S., & Besner, D. (1994). When banking on meaning is not (yet) money in the bank: Explorations in connectionist modeling. Journal of Experimental Psychology: Learning, Memory, and Cognition, 20, 1051-1062. Kawamoto, A. H. (1988). Distributed representations of ambiguous words and their resolution in a connectionist network. In S. I. Small, G. W. Cottrell, & M. K. Tanenhaus (Eds.), Lexical ambiguity resolution (pp. 195-228). San Mateo, CA: Morgan Kaufmann. Kawamoto, A. H. (1993). Nonlinear dynamics in the resolution of lexical ambiguity: A parallel distributed processing account. Journal of Memory and Language, 32, 474-516. Kawamoto, A. H. (1994). One system or two to hand regulars and exceptions: How time-course of processing can inform this debate. In S. D. Lima, R. Corrigan, & G. Iverson (Eds.), The reality of linguistic rules (pp. 389-415). Philadelphia: John Benjamins. Kawamoto, A. H., & Zemblidge, J. (1992). Pronunciation of homographs. Journal of Memory and Language, 31, 349—374. Kellas, G., Ferraro, F. R., & Simpson, G. B. (1988). Lexical ambiguity and the time course of attentional allocation in word recognition. Journal of Experimental Psychology: Human Perception and Performance, 14, 601—609. Kohonen, T. (1977). Associative memory. Berlin: Springer. Kosko, B. (1988). Bi-directional associative memories. IEEE Transactions on Systems, Man and Cybernetics, 18, 49-60. Madigan, S. A. (1969). Intraserial repetition and coding processes 1247 SPECIAL SECTION: WHEN TWO MEANINGS ARE BETTER THAN ONE in free recall. Journal of Verbal Learning and Verbal Behavior, 8, 828-835. Martin, E. (1968). Stimulus meaningfulness and paired-associate transfer: An encoding variability hypothesis. Psychological Review, 75, 421-441. McClelland, J. L., & Rumelhart, D. E. (1981). An interactive activation model of context effects in letter perception: 1. An account of basic findings. Psychological Review, 88, 375-407. McClelland, J. L., & Rumelhart, D. E. (1985). Distributed memory and the representation of general and specific knowledge. Journal of Experimental Psychology: General, 114, 375-407. McClelland, J. L., & Rumelhart, D. E. (1988). Explorations in parallel distributed processing: A handbook of models, programs, and exercises. Cambridge, MA: MIT Press. Melton, A. W. (1967). Repetition and retrieval from memory. Science, 158, 532. Melton, A. W. (1970). The situations with respect to the spacing of repetitions and memory. Journal of Verbal Learning and Verbal Behavior, 9, 596-606. Merriam-Webster, Incorporated. (1993). Merriam-Webster's Collegiate Dictionary (10th ed.). Springfield, MA: Author. Millis, M. L., & Button, S. B. (1989). The effect of polysemy on lexical decision time: Now you see it, now you don't. Memory & Cognition, 17, 141-147. Monsell, S., Patterson, K. E., Graham, A., Hughes, C. H., & Milroy, R. (1992). Lexical and sublexical translation of spelling to sound: Strategic anticipation of lexical status. Journal of Experimental Psychology: Learning, Memory, and Cognition, 18, 452-467. Morton, J. (1979). Word recognition. In J. Morton & J. C. Marshall (Eds.), Psycholinguistics: 2. Structures and processes. Cambridge, MA: MIT Press. Murre, J. M. M., Phaf, R. H., & Wolters, G. (1992). CALM: Categorizing and learning module. Neural Networks, 5, 55—82. Norris, D. (1994). A quantitative multiple-levels model of reading aloud. Journal of Experimental Psychology: Human Perception and Performance, 20, 1212-1232. Rayner, K., & Duffy, S. A. (1986). Lexical complexity and fixations times in reading: Effects of word frequency, verb complexity, and lexical ambiguity. Memory & Cognition, 14, 191201. Rubenstein, H., Garfield, L., & Millikan, J. A. (1970). Homographic entries in the internal lexicon. Journal of Verbal Learning and Verbal Behavior, 9, 487-494. Rubenstein, H., Lewis, S. S., & Rubenstein, M. A. (1971). Homographic entries in the internal lexicon: Effects of systematicity and relative frequency of meanings. Journal of Verbal Learning and Verbal Behavior, 10, 57-62. Rueckl, J. G., & Olds, E. M. (1993). When pseudowords acquire meaning: Effect of semantic associations on pseudoword repetition priming. Journal of Experimental Psychology: Learning, Memory, and Cognition, 19, 515-527. Seidenberg, M. S., & McClelland, J. L. (1989). A distributed, developmental model of word recognition and naming. Psychological Review, 96, 523-568. Thomson, D. M. (1972). Context effects on recognition memory. Journal of Verbal Learning and Verbal Behavior, 11, 497-511. Van Orden, G. C. (1987). A ROWS is a ROSE: Spelling, sound, and reading. Memory & Cognition, 15, 181-198. Van Orden, G. C., & Goldinger, S. D. (1994). Interdependence of form and function in cognitive systems explains perception of printed words. Journal of Experimental Psychology: Human Perception and Performance, 20, 1269-1291. Van Orden, G. C., Pennington, B. F., & Stone, G. O. (1990). Word identification in reading and the promise of subsymbolic psycholinguistics. Psychological Review, 97, 488-522. Widrow, G., & Stearns, S. D. (1985). Adaptive signal processing. Englewood Cliffs, NJ: Prentice Hall. Appendix Test Words Used in the Simulation and Their Francis and Kucera (1982) Frequency Unambiguous CURE GROW SLID DISK LIFT FAIL CUTS GAVE NECK PINK CAST TORN 28 53 24 25 23 37 30 60 81 48 45 25 Two equiprobable HUGE LAKE POND PAST MILE PAIR SHIP LOAN CHIN SUIT SAND PLUG 30 37 15 60 45 50 81 23 27 48 28 25 Four equiprobable LANE FOOL LOCK DULL HORN DIVE REAR TELL FILE BOND TAIL LOAD 30 37 23 27 31 23 51 60 81 46 36 45 Received January 31, 1994 Revision received April 1, 1994 Accepted April 15, 1994

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download When Two Meanings Are Better Than One