Download Machine-to-man communication by speech Part II: Synthesis of

Machine-to-man communication by speech Part II: Synthesis of prosodic features of speech by rule by JONATHAN ALLEN Research Laboratory of Electronics, Massachusetts Institute of Technology Cambridg~, Massachusetts For several years, research has gone on in an attempt to develop a reading machine for the blind. Such a machine must be able to scan letters on a nor~ mal printed page, then recognize the scanned letters and punctuation, and finally convert the resultant character strings into an encoded form that may be perceived by some nonvisual sensory modality. Within recent years, at the Massachusetts Institute of Technology, an opaque scanner has been developed, l and an algorithm for recognizing scanned letters· has been devised. 2 The output display can take many forms, but the form that we feel is best suited for acceptably high reading speeds and intelligibility is synthesized speech. Effort has recently been focused on the conversion of orthographic letter strings to synthesized speech. An algorithm for grapheme-to-phoneme conversion (letter representation to sound representation) has been invented by Lee,a which is capable of specifying sufficient phonemic information to a terminal analog speech synthesizer for translation to synthesizer commands. The algorithm uses a dictionary to store the constituent morphs of English words, together with their phonemic representation. Hence each scanned wQrd is transformed into a concatenated string of phonemic symbols that are then interpreted by the synthesizer. The resulting speech is usually intelligible, but not suitable for long-term use. Several problems remain, apart from those concerned directly with speech synthesis by rule from phonemic specifications. First, many words can be nouns or verbs, depending on context [refuse, incline, survey], and proper stress cannot be specified until the intended syntactic form class is known. Second, punctuation and phrase boundaries may be used to specify pauses that help to make the complete senten~e understandable. Third, more complicated stress· contours over phrases can be specified which facilitate sentence perception. Finally, intonation contours, or "tunes" are important for designating statements, questions, exclamations, and continuing or terminal juncture. These features (stress, intonation, and pauses) comprise the main prosodic or suprasegmental features of speech. Several experiments 4 ,5,6 have shown that we tend to perceive sentences in chunks or phrasal units, and that the grammatical structure of these phrases is important for the correct perception of the sentence. In order to display this required structure to a listener, a speaker makes use of many redundant devices, among them the prosodic features, to convey the syntactic surface structure. When speech is being synthesized in an imperfect way at the phonemic level, the addition of these additional features can be used by listeners to compensate for the lack of other information. The listener may then use these cues to hypothesize the syntactic structure,. and hence generate his own phonetic "shape" of the perceived sentence. There is little reason to believe that the perceived stress contour, for example, must represent some continuitlg physical property of the utterance, since the listener uses some form of internalized rules to. "hear" the stress contour, whether or not it is physically present in a clear way. Hence, once the syntactic surface structure can be determined, the "stress" can be heard. Alternatively, prosodic features can be used in a limited fashion to help point out the surface structure, which is then used in the perception of the phonetic shape of the sentence. . The present paper describes a procedure for parsing sentences composed of words that are in turn derived from the morphs provided by the grapheme-tophoneme d.ecomposition, as well as a phonological procedure for specifying prosodic features over the 339 From the collection of the Computer History Museum (www.computerhistory.org) 340 Spring Joint Computer Conference, 1968 revealed phrases. As we have indicated, only a limited amount of the sentence is parsed and provided with prosodics, since the listener will "hear" the entire sentence once the structure is clear. We consider first the required parts-of-speech preprocessor, then the parser, and finally the phonological algorithm. Parts-oj-speech preprocessor After the grapheme-to-phoneme conversion is complete, many words will have been decomposed into their constituent morphs. For example, [grasshopper] ~ [grass] + [hop] + [er], and [browbeat] ~ [brow] + [beat]. Each of these morphs corresponds to a dictionary entry that contains, in addition to phonemic specifications, parts-of-speech information. I n the case of morphs that can exist alone ([grass, hop, brow,] etc.) this information consists in a set of parts of speech for that word, called the grammatical homographs of the word, and this set often has more than one homograph. For prefixes and suffixes ([re-, -s, -er, -ness,] etc.), information is given indicating the resultant part of speech when , the prefix or suffix is concatenated with a root morpho Thus [-ness] always forms a noun, as in [goodness] and [madness]. Other researchers 7 •8 have used a computational dictionary to compute parts of speech, relying on the prevalence of function words (determiners, prepositions, conjuctions, and auxiliaries), together with suffix rules of the type just described and their accompanying exception lists. This procedure, of course, keeps the lexicon small, but results in arbitrary partsof-speech classification when the word is not a function word, and does not have a recognizable suffix. Furthermore, ambiguous suffixes such as [-s] (implying pluml noun or singular verb) carryover their ambiguity to the entire word, whereas if the root word has a unique part of speech like [cat], our procedure gives a unique result; [catsJ (plural noun). Hence the presence of the morph lexicon can often be used to advantage, especially in the prevalent noun/verb ambiguities. The parts-of-speech algorithm considers each morph of the word and its relation with its left neighbor, starting from the right end of the word. If there are two or more suffixes [commendables, topicality] the suffixes are entered into a last-In first-out push-down stack. Then the top suffix is joined to the root morph, and the additional suffixes are concatenated until the stack is empty. Compounding is done next, and finally any prefixes are attached. Prefixes generally do not affect the part of speech of the root morph, but [em-, en-,] and [be-] all change the part of speech to verb. Compounds can occur in English in any of three ways, and there appears to be no reliable method for distinguishing these classes. There can, of course, be two separate words, as in [bus stop], or two words hyphenated, as in [hand-cuff], or finally, two root words concatenated directly, as in [sandpaper] . The parts-of-speech algorithm treats the last two cases, leaving the two-word case for the parser to handle. The algorithm ignores the presence of a hyphen, except that it "remembers" that the hyphen occurred, and then processes hyphenated and oneword compounds as though they were both single words. The parts of speech of the two elements ,of the compound are considered as row and column entries to a matrix whose cells yield the resulting part of speech. Thus Adverb·Noun ~ Noun ([underworld]). In general, since each element. may have , several parts of speech, the matrix is entered for e~ch possible combination, but the maximum number of resulting parts of speech is three. Combinations of suffixes with compounds ( [handwriting] ) can be accommodated, as well as one-word compounds containing more than two morphs. The algorithm has a special routine to handle troublesome suffixes such as [-er, -es, -s], in an attempt to reduce the reSUlting number of parts of speech to a minimum. In this way, the algorithm makes use of the parts of speech information of the individual morphs to compute the parts of speech set for the word formed by these morphs. These sets then serve as input to the parser, after having first been ordered to suit the principles of the parser. Parsing As we have remarked, if a listener is aware of the surface syntactic structure of a spoken sentence, then he may generate internally the accompanying prosodic features to the extent that they are determinable by linguistic rules forming part of his language competence. Hence we desire to make this structure evident to the listener by providing cues to the syntax in the prosodics of the synthesized speech. To do this, we must first determine the structure, and tlien implement prosodics corresponding to the structure. Since we are trying to provide only a limited number of such cues (enough to· allow the structure to be deduced), we have designed a limited parser that reveals the syntax of only a portion of the sentence. We have tried to find th~ simplest parser consistent with the'-:? phonological goals that would also use minimum core stor age and run fast enough (in the context of the over-all reading machine) to allow for a realistic speaking rate, say, 150-180 words per minute. Because the absence, or incorrect implementation of prosodies in a small From the collection of the Computer History Museum (www.computerhistory.org) lViachine-to-lVian Communication by Speech-Part H percentage of the output sentences, is not likely to be catastrophic, we can tolerate occ~sional mistakes by the parser, but we have tried to achieve 90 per cent accuracy. These requirements, for a limited, phraselevel parser operating in real-time at comfortable speaking rates within restricted core storage, are indeed severe, and many features found in other parsers are absent here. We do not use a large number of parts of speech classifications, nor do we exhaustively cycle through all the homographs of the words of a sentence to find all possible parsings. Inherent syntactic ambiguity ([They are washing machines]) is ignored, the resulting phrase structures being biased toward noun phrases and prepositional phrases. No deep-structure "trees" are obtained, since these are not needed in the phonological algorithm, and only noun phrases and prepositional phrases are detected, so that no sentencehood or clause-level tests are made. We do, however, compute a bracketed structure within each detected phrase, such as [the [old house]] and [in [[brightly lighted] windows]], since this structure is required by the phonological algorithm. The result is a context-sensitive parser that avoids time-consuming enumerative procedures, and consults alternative homographs only when some condition is detected (such as [to] used to introduce either an infinitive or a prepositional phrase) which requires such a search. The parser makes two passes (left -to-right) over a given input sentence. The first pass computes a tentative bracketing of noun phrases all:d prepositional phrases. Inasmuch as this initial bracketing makes no clause-level checks and does not directly examine the frequently occurring noun/verb ambiguities, it is followed by a special routine designed to resolve these ambiguities by means of local context and grammatical number agreement tests. These last tests are also designed to resolve noun/verb ambiguities that do not occur in bracketed phrases, as [refuse] in [They refuse to leave.]. As a result of these two passes, a limited phrase bracketing of the sentence is obtained, and some ambiguous words have been assigned a unique part of speech, yet several words remain as unbracketed constituents. The first pass is designed to quickly set up tentative noun phrase and prepositional phrase boundaries. This process may be thought of as operating in three parts. The program scans the sentence from left to right looking for potential phrase openers. For example, determiners, adjectives, participles, and nouns may introduce noun phrases, and prepositional phrases always start with a preposition. In the case of some introducers, such as present participles, words further along in the sentence are examined, as well as pre vi- 341 ous words, to determine the grammatical function of the participle, as in [Wiring circuits is fun.] Once a phrase opener has been found, very quick relational tests between neighboring words are made to determine whether the right phrase boundary has been reached. These checks are possible because English relies heavily on word order in its structure. Having found a tentative right phrase boundary, right context checks are made to determine whether or not this boundarv should he accented. After comnletion ---c - - -- - -- of -these checks, the phrase is closed and a new phrase introducer is looked for. This procedure continues until the end of the sentence is reached. When the bracketing is complete, further tests are made to check for errors in bracketing caused by frequent noun/verb ambiguities. For example, the sentence [That old man lives in the gray house.] would be initially bracketed. [That old man lives]NP [in the gray house)PREP p. Notice that sentence hood tests (although not performed by the parser) would immediately reveal that the sentence lacks a verb, and further routines could deduce that [lives], which can be a noun (plural) or verb (third person singular), "is functioning as a verb, although the bracketing routine, since it is biased toward noun homographs, made [lives] part of the noun phrase. We also note the importance of this error for the phonetic shape of the sentence, since [lives] changes its phonemic structure according to its grammatical function in the sentence. An agreement test, however, compares the rightmost "noun" with any determiners that may reflect grammatical number. In this case, [that] is a singular demonstrative pronoun, so we know that [lives] does not agree with it, and hence must be a verb. After the agreement test has been made for each noun phrase, local context checks are used in an attempt to remove noun/verb ambiguities that are important for the phonological implementation, and yet have not been bracketed into phrases containing more than one word. Thus in the sentence [They produce and develop many different machines.] , the algorithm would note that [produce] is immediately preceded by a personal pn:moun in the nominative case, and hence the word is functioning as a verb. Soch knowledge can then be used to put stress on the second syllable of the word in accordance with its function. At the conclusion of the parsing process described above, phrase boundaries for noun phrases and prepositional phrases have been marked, but the structure within the phrase is not known. In order to apply the rules that are used for computing stress patterns within the phrase, however, internal bracketing must be - - - - - • _ •. - _. - - - - - - - - - 6.,.- - - -- - - -- - From the collection of the Computer History Museum (www.computerhistory.org) - - - - 342 Spring Joint Computer Conference, 1968 provided. For this reason, determiner-adjective-noun sequences are given a "progressive" bracketing, as [the [long [red barn]]], whereas noun phrases beginning with adverbials are given "regressive" bracketings, as [[ [very brightly] projected] pictures], A preposition beginning a prepositional phrase always has a progressive relation to the remaining noun phrase, so that we have [in [the [long [red barn]]]] and [of [ [[ very brightly] projected] pictures] ] Furthermore, two nouns together, as in [the local bus stop], are marked as a compound for use by the phonological algorithm. The procedure described above is thus able to detect noun phrases and prepositional phrases and to compute the internal structure of these phrases. The grammar and parsing logic are intertwined in this procedure, .so that an explicit statement of the grammar is impossible. Nevertheless, the rules are easily modified, and additions can readily be made. If, for example, we decide to detect verbal constructions, this could easily be done. At present, however, we feel that recognition of noun phrases and preposi,tional phrases and the provision of prosodics for these phrases is sufficient to allow the listener to deduce the correct syntactic structure for large samples of representative text. Phonological algorithm The method for detecting and bracketing noun phrases and prepositional phrases has now been described. We assume that this surface structure is sufficient to allow the specification of stress and intonation within these phrasal units. The basis for this assumption is given in the work of Chomsky and Halle. 9 The phonological algorithm then uses the surface syntactic bracketing, plus punctuation and clause-marker words, to deduce the pattern for stress, pauses, and intonation related to the detected phrases. In the present implementation, only three acoustic parameters are varied to implement the prosodic features. These are fundamental frequency (fo)' vowel duration, and pauses. It is well known that juncture pauses have acoustic effects on the neighboring phonemes other than vowel lengthening and fo changes" but these effects are ignored in the present synthesis. We thus consider fo, vowel duration, and pauses to constitute an interacting parameter system that serves as a group of acoustic features used to implement the prosodics. The "sharing" of fo for use in marking both stress and intonation contours is another example of the interactive nature of these acoustic parameters. Stress is implemented within the detected phrases by iterative use of the stress cycle rules, described by Chomsky and Halle. 9 These rules operate on the two constituents within the innermost brackets to specify where main stress should be placed. All other stresses are then "pushed down" by one. (Here, "one" is the highest stress.) The innermost brackets are then "erased," and the nIles applied to the next pair of constituents. This cycle is then continued until the phrase boundaries are reached. For compounds, the rules specify main stress on the leftmost element (compound rule), whereas for all other syntactic units (e.g., phrases) main stress goes on the rightmost unit (nuclear stress rule). For example, we have [the [long [red barn]]] 2 4 2 3 where initially stress is I on all units except the article the, and two cycleS of the phrase rule are used. The parser ,has, or course, provided the bracketing of the phrase. Also, [in [[[very brightly] lighted] rooms]] 2 1 321 443 2 requires three applications of the rules, and [the [new [bus stop] ] ] I 2 4 2 3 which contains a compound, requires two iterations. It is clear that for long phrases requiring several iterations, say n, there will be n + I stress levels. Most linguists, however, recognize no more than four levels, so the algorithm clips off the lower levels. At present, three levels are being used, but this limit can be easily changed in the program. In the examples it has been implicitly assumed that each content word started with main stress before the rules were applied. Each word does have a main stress initially, but in general each word has its own stress contour, as, for example, in the triple [nation, national, nationality]. (As Lee:J has pointed out, pairs such as [nation/national] can be handled by placing the two words directly in the morph dictionary, but we have tried to extend the stress algorithm to cover many of these cases. Clearly, there is a compromise between processing time and dictionary size to be determined by experience.) Thus the algorithm must compute the stress for individual words by applying rules for compounds and suffixes. The compound rule is the same as for two separate words that comprise a compound (e.g., [bus stop, browbeat]). Each morph in the lexicon is given lexical stress, so that an initial stress contour is provided, Each suffix is also provided with· information about its effect on From the collection of the Computer History Museum (www.computerhistory.org) r-,,1achine-to-l\.1an Communication by Speech- Part II stress. Hence [-s, -ed] and [-ing] all leave the root morph stress unaltered, and have the lowest level stre~s for themselves. Another example is the'~uffix [-ion], which always places main stress on the immediately preceeding vowel (e.g., [nationalization, distribution]). At present, such changes in stress of the root word are not computed by rule. I n this way, stress contours for individual words are first computed, and then these are "placed" in the bracketed phrase structure and the stress cycl~ is applied until the over-all stress pattern is obtained for the whole phrase. Note that function words receive no stress, so that stress is controlled for these words, even though they do not appear in bracketed phrases. Pauses are provided in a definite hierarchy throughout each sentence. The following disposition of pauses has been arrived at empirically, and represents a compromise between naturalness and intelligibility. At present, no pauses are used within the word at the juncture between any two morphs. Within a bracketed phrase or between two adjacent unbracketed cons tituents no pauses are used between words. At phrase boundaries, pauses of 200 to 400 msec have been used to set off the detected phrase. Short pauses of 100 msec are used where commas and semicolons appear, and pauses of 200 msec are inserted before clause-marker words such as [that, since, which] etc., which serve to break up the sentence into clausal units. Finally, terminal pauses of SOO msec are provided for colon, period, question mark, and exclamation point. Thus a hierarchy of pauses is used to help make the grammatical structure of the sentence clear. The provision of intonational fo contours by rule has been described by Mattingly,t° and our technique is similar to his. The slope of the fo contour is controlled by the specific phonemes encountered in the sentence, and by the nature of the pause at the end of the phrasal unit. Rising terminal contours are specified at the end of int~rrogative clauses just preceding the question mark, except when the clause starts with a [~h-] word, as [where is the station?]. In the absence of a question mark, the intonati~n fo contour'is falling with a slope determined by rule as is done by Mattingly. The starting point for fo at the beginning of a sentence is fixed at 110Hz. The jumps in fo for the various stress levels vary with the initial value of fo, but nominally they are 12, IS, and 30 Hz corresponding to the stress levels 3, 2, and 1 respectively. As noted before, 1 'corresponds to the highest stress in our system. Subjective experience The method of implementing prosodic features on 343 the limited basis described above has been used in connection with the TX-O computer at M.I.T., driving a terminal analog synthesizer. While the resulting speech is still unnatural in many respects, a substantial improvement in speech quality has been attained. It appears that by using limited phrase level parsing and implementation of prosodics mainly within these phrases, sufficient cues can be provided to the listener to enable him to detect the grammatical structure of the sentence and hence provide his own internal phonetic shape for the sentence. Since this system will become part of a complete computercontrolled reading machine operating in real time, it is encouraging to find that such a limited approach is able to improve the speech quality markedly. We anticipate that further work on both phonemic and prosodic synthesis rules will yield even greater intelligibility and naturalness in the output speech, with little additional computing load placed on the system. DISCUSSION The speech synthesis system described here has been developed for research purposes. Hence the implementation of our speech synthesis system has remained very flexible so that further improvements can be easily accommodated. Better rules for phonemic synthesis are being developed, and will be incorporated into the system., Much work remains to be done on the determination of the physiological mechanisms underlying stress, and the resultant observable phonetic patterns which arise from these articulations. Particular attention is being focused on the nature and interaction of fo and vowel duration as correlates of stress. There will also undoubtedly be further improvements in the parsing procedure as experience dictates. From the linguistic point of view, the lexicon for a language should contain only the idiosynchrasie's of a language, everything derivable by rule being computed as part of the language user's performance. Engineering considerations, however, clearly dictate a compromise with this view, and the cost of memory versus the cost of computing with an extensive set of rules must be examined further. It may, for example, become feasible to compute lexical stress by rule, but any advantages of this procedure must outweigh the cost in time and program storage for these rules. ACKNOWLEDGMENTS This work was supported principally by the National Institutes of Health (Grant 1 POI GM-1490-0l) and in part by the Joint Services Electronics Program (Contract DA2S-043-AMC-02536(E»; additional From the collection of the Computer History Museum (www.computerhistory.org) 344 Spring Joint Computer Conference, 1968 support was received through a fellowship from Bell Telephone Laboratories, Inc. 6 REFERENCES 1 C L SEITZ An opaque scanner for reading machine research SM Thesis MIT 1967 2 J K CLEMENS Optical character recognition for reading machine applications Doctoral Thesis MIT 1965 3 F FLEE A study of grapheme to phoneme translation of English PhD Thesis MIT 1965 4 G A MILLER Decision units in the perception of speech IRE Transactions on Information Theory VoIIT-8 No 2 P 81 February 1962 5 G A MILLER G A HEISE W LICHTEN 7 8 9 lO The intelligibility of speech as a function of the contest of the test materials J Exptl Psychol41 p 329 1951 G A MILLER S ISARD Some perceptual consequences of linguistic rules J Verb Learn Verb Behav 2 p 2! 7 ! 963 S KLEIN R F SIMMONS A computation approach to the grammical coding of English words J Assoc Computing Machinery 10 334 1963 D C CLARKE R E WALL An economical program for the limited parsing of English AFIPS Conference Proceedings p 307 Fall Joint Comp Conf 1965 N CHOMSKY M HALLE Sound patterns of English (in press) I G MATTINGLY Synthesis by rule of prosodic features T llnOllllOP _.01;_ llnrl .~npprh I Qf:.f:. --"·.0--0-t"'--_&;&"Q •I o..".....,....., From the collection of the Computer History Museum (www.computerhistory.org)

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Machine-to-man communication by speech Part II: Synthesis of