Download Machine-to-man communication by speech Part II: Synthesis of

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Spanish grammar wikipedia , lookup

Sloppy identity wikipedia , lookup

Lithuanian grammar wikipedia , lookup

Zulu grammar wikipedia , lookup

Old Irish grammar wikipedia , lookup

Agglutination wikipedia , lookup

French grammar wikipedia , lookup

Antisymmetry wikipedia , lookup

Lexical semantics wikipedia , lookup

Chinese grammar wikipedia , lookup

Morphology (linguistics) wikipedia , lookup

Yiddish grammar wikipedia , lookup

Latin syntax wikipedia , lookup

Pleonasm wikipedia , lookup

Scottish Gaelic grammar wikipedia , lookup

Preposition and postposition wikipedia , lookup

Transformational grammar wikipedia , lookup

Turkish grammar wikipedia , lookup

Stemming wikipedia , lookup

Compound (linguistics) wikipedia , lookup

Polish grammar wikipedia , lookup

Parsing wikipedia , lookup

Malay grammar wikipedia , lookup

Vietnamese grammar wikipedia , lookup

Esperanto grammar wikipedia , lookup

Determiner phrase wikipedia , lookup

Pipil grammar wikipedia , lookup

English grammar wikipedia , lookup

Transcript
Machine-to-man communication by speech
Part II: Synthesis of prosodic features of
speech by rule
by JONATHAN ALLEN
Research Laboratory of Electronics, Massachusetts Institute of Technology
Cambridg~, Massachusetts
For several years, research has gone on in an attempt to develop a reading machine for the blind.
Such a machine must be able to scan letters on a nor~
mal printed page, then recognize the scanned letters
and punctuation, and finally convert the resultant
character strings into an encoded form that may be
perceived by some nonvisual sensory modality. Within recent years, at the Massachusetts Institute of
Technology, an opaque scanner has been developed, l
and an algorithm for recognizing scanned letters· has
been devised. 2 The output display can take many
forms, but the form that we feel is best suited for acceptably high reading speeds and intelligibility is
synthesized speech. Effort has recently been focused
on the conversion of orthographic letter strings to
synthesized speech.
An algorithm for grapheme-to-phoneme conversion
(letter representation to sound representation) has
been invented by Lee,a which is capable of specifying
sufficient phonemic information to a terminal analog
speech synthesizer for translation to synthesizer commands. The algorithm uses a dictionary to store the
constituent morphs of English words, together with
their phonemic representation. Hence each scanned
wQrd is transformed into a concatenated string of
phonemic symbols that are then interpreted by the
synthesizer.
The resulting speech is usually intelligible, but not
suitable for long-term use. Several problems remain,
apart from those concerned directly with speech
synthesis by rule from phonemic specifications. First,
many words can be nouns or verbs, depending on context [refuse, incline, survey], and proper stress cannot
be specified until the intended syntactic form class is
known. Second, punctuation and phrase boundaries
may be used to specify pauses that help to make the
complete senten~e understandable. Third, more
complicated stress· contours over phrases can be
specified which facilitate sentence perception. Finally,
intonation contours, or "tunes" are important for
designating statements, questions, exclamations, and
continuing or terminal juncture. These features (stress,
intonation, and pauses) comprise the main prosodic
or suprasegmental features of speech.
Several experiments 4 ,5,6 have shown that we tend
to perceive sentences in chunks or phrasal units, and
that the grammatical structure of these phrases is
important for the correct perception of the sentence.
In order to display this required structure to a listener,
a speaker makes use of many redundant devices,
among them the prosodic features, to convey the syntactic surface structure. When speech is being synthesized in an imperfect way at the phonemic level, the
addition of these additional features can be used by
listeners to compensate for the lack of other information. The listener may then use these cues to hypothesize the syntactic structure,. and hence generate his
own phonetic "shape" of the perceived sentence.
There is little reason to believe that the perceived
stress contour, for example, must represent some
continuitlg physical property of the utterance, since
the listener uses some form of internalized rules to.
"hear" the stress contour, whether or not it is physically present in a clear way. Hence, once the syntactic
surface structure can be determined, the "stress" can
be heard. Alternatively, prosodic features can be
used in a limited fashion to help point out the surface
structure, which is then used in the perception of the
phonetic shape of the sentence.
.
The present paper describes a procedure for parsing
sentences composed of words that are in turn derived
from the morphs provided by the grapheme-tophoneme d.ecomposition, as well as a phonological
procedure for specifying prosodic features over the
339
From the collection of the Computer History Museum (www.computerhistory.org)
340
Spring Joint Computer Conference, 1968
revealed phrases. As we have indicated, only a limited
amount of the sentence is parsed and provided with
prosodics, since the listener will "hear" the entire
sentence once the structure is clear. We consider
first the required parts-of-speech preprocessor, then
the parser, and finally the phonological algorithm.
Parts-oj-speech preprocessor
After the grapheme-to-phoneme conversion is
complete, many words will have been decomposed
into their constituent morphs. For example, [grasshopper] ~ [grass] + [hop] + [er], and [browbeat] ~ [brow] + [beat]. Each of these morphs corresponds to a dictionary entry that contains, in addition to phonemic specifications, parts-of-speech information. I n the case of morphs that can exist alone
([grass, hop, brow,] etc.) this information consists in
a set of parts of speech for that word, called the grammatical homographs of the word, and this set often
has more than one homograph. For prefixes and
suffixes ([re-, -s, -er, -ness,] etc.), information is
given indicating the resultant part of speech when
, the prefix or suffix is concatenated with a root morpho
Thus [-ness] always forms a noun, as in [goodness]
and [madness].
Other researchers 7 •8 have used a computational
dictionary to compute parts of speech, relying on the
prevalence of function words (determiners, prepositions, conjuctions, and auxiliaries), together with
suffix rules of the type just described and their accompanying exception lists. This procedure, of course,
keeps the lexicon small, but results in arbitrary partsof-speech classification when the word is not a function word, and does not have a recognizable suffix.
Furthermore, ambiguous suffixes such as [-s] (implying pluml noun or singular verb) carryover their
ambiguity to the entire word, whereas if the root
word has a unique part of speech like [cat], our
procedure gives a unique result; [catsJ (plural noun).
Hence the presence of the morph lexicon can often
be used to advantage, especially in the prevalent
noun/verb ambiguities.
The parts-of-speech algorithm considers each morph
of the word and its relation with its left neighbor, starting from the right end of the word. If there are two or
more suffixes [commendables, topicality] the suffixes are entered into a last-In first-out push-down
stack. Then the top suffix is joined to the root morph,
and the additional suffixes are concatenated until
the stack is empty. Compounding is done next, and
finally any prefixes are attached. Prefixes generally
do not affect the part of speech of the root morph, but
[em-, en-,] and [be-] all change the part of speech to
verb. Compounds can occur in English in any of three
ways, and there appears to be no reliable method for
distinguishing these classes. There can, of course, be
two separate words, as in [bus stop], or two words
hyphenated, as in [hand-cuff], or finally, two root
words concatenated directly, as in [sandpaper] .
The parts-of-speech algorithm treats the last two
cases, leaving the two-word case for the parser to
handle. The algorithm ignores the presence of a
hyphen, except that it "remembers" that the hyphen
occurred, and then processes hyphenated and oneword compounds as though they were both single
words. The parts of speech of the two elements ,of the
compound are considered as row and column entries
to a matrix whose cells yield the resulting part of
speech. Thus Adverb·Noun ~ Noun ([underworld]). In general, since each element. may have
, several parts of speech, the matrix is entered for e~ch
possible combination, but the maximum number of
resulting parts of speech is three. Combinations of
suffixes with compounds ( [handwriting] ) can be
accommodated, as well as one-word compounds
containing more than two morphs.
The algorithm has a special routine to handle
troublesome suffixes such as [-er, -es, -s], in an attempt to reduce the reSUlting number of parts of
speech to a minimum.
In this way, the algorithm makes use of the parts of
speech information of the individual morphs to compute the parts of speech set for the word formed by
these morphs. These sets then serve as input to the
parser, after having first been ordered to suit the principles of the parser.
Parsing
As we have remarked, if a listener is aware of the
surface syntactic structure of a spoken sentence, then
he may generate internally the accompanying prosodic
features to the extent that they are determinable by
linguistic rules forming part of his language competence. Hence we desire to make this structure evident
to the listener by providing cues to the syntax in the
prosodics of the synthesized speech. To do this, we
must first determine the structure, and tlien implement
prosodics corresponding to the structure. Since we
are trying to provide only a limited number of such
cues (enough to· allow the structure to be deduced), we
have designed a limited parser that reveals the syntax
of only a portion of the sentence. We have tried to
find th~ simplest parser consistent with the'-:? phonological goals that would also use minimum core stor
age and run fast enough (in the context of the over-all
reading machine) to allow for a realistic speaking rate,
say, 150-180 words per minute. Because the absence,
or incorrect implementation of prosodies in a small
From the collection of the Computer History Museum (www.computerhistory.org)
lViachine-to-lVian Communication by Speech-Part H
percentage of the output sentences, is not likely to be
catastrophic, we can tolerate occ~sional mistakes by
the parser, but we have tried to achieve 90 per cent
accuracy. These requirements, for a limited, phraselevel parser operating in real-time at comfortable
speaking rates within restricted core storage, are indeed severe, and many features found in other parsers
are absent here. We do not use a large number of
parts of speech classifications, nor do we exhaustively
cycle through all the homographs of the words of a
sentence to find all possible parsings. Inherent syntactic ambiguity ([They are washing machines]) is
ignored, the resulting phrase structures being biased
toward noun phrases and prepositional phrases. No
deep-structure "trees" are obtained, since these are
not needed in the phonological algorithm, and only
noun phrases and prepositional phrases are detected,
so that no sentencehood or clause-level tests are made.
We do, however, compute a bracketed structure
within each detected phrase, such as [the [old house]]
and [in [[brightly lighted] windows]], since this
structure is required by the phonological algorithm.
The result is a context-sensitive parser that avoids
time-consuming enumerative procedures, and consults alternative homographs only when some condition is detected (such as [to] used to introduce
either an infinitive or a prepositional phrase) which
requires such a search.
The parser makes two passes (left -to-right) over a
given input sentence. The first pass computes a tentative bracketing of noun phrases all:d prepositional
phrases. Inasmuch as this initial bracketing makes no
clause-level checks and does not directly examine the
frequently occurring noun/verb ambiguities, it is
followed by a special routine designed to resolve these
ambiguities by means of local context and grammatical
number agreement tests. These last tests are also designed to resolve noun/verb ambiguities that do not
occur in bracketed phrases, as [refuse] in [They
refuse to leave.]. As a result of these two passes, a
limited phrase bracketing of the sentence is obtained,
and some ambiguous words have been assigned a
unique part of speech, yet several words remain as
unbracketed constituents.
The first pass is designed to quickly set up tentative
noun phrase and prepositional phrase boundaries. This
process may be thought of as operating in three parts.
The program scans the sentence from left to right
looking for potential phrase openers. For example,
determiners, adjectives, participles, and nouns may
introduce noun phrases, and prepositional phrases
always start with a preposition. In the case of some
introducers, such as present participles, words further
along in the sentence are examined, as well as pre vi-
341
ous words, to determine the grammatical function of
the participle, as in [Wiring circuits is fun.] Once a
phrase opener has been found, very quick relational
tests between neighboring words are made to determine whether the right phrase boundary has been
reached. These checks are possible because English
relies heavily on word order in its structure. Having
found a tentative right phrase boundary, right context
checks are made to determine whether or not this
boundarv should he accented. After comnletion
---c - - -- - -- of
-these checks, the phrase is closed and a new phrase
introducer is looked for. This procedure continues
until the end of the sentence is reached.
When the bracketing is complete, further tests are
made to check for errors in bracketing caused by frequent noun/verb ambiguities. For example, the sentence [That old man lives in the gray house.] would be
initially bracketed.
[That old man lives]NP [in the gray house)PREP p.
Notice that sentence hood tests (although not performed by the parser) would immediately reveal that
the sentence lacks a verb, and further routines could
deduce that [lives], which can be a noun (plural) or
verb (third person singular), "is functioning as a verb,
although the bracketing routine, since it is biased toward noun homographs, made [lives] part of the noun
phrase. We also note the importance of this error
for the phonetic shape of the sentence, since [lives]
changes its phonemic structure according to its
grammatical function in the sentence. An agreement test, however, compares the rightmost "noun"
with any determiners that may reflect grammatical
number. In this case, [that] is a singular demonstrative pronoun, so we know that [lives] does not agree
with it, and hence must be a verb. After the agreement test has been made for each noun phrase, local
context checks are used in an attempt to remove
noun/verb ambiguities that are important for the
phonological implementation, and yet have not
been bracketed into phrases containing more than
one word. Thus in the sentence [They produce
and develop many different machines.] , the algorithm would note that [produce] is immediately
preceded by a personal pn:moun in the nominative
case, and hence the word is functioning as a verb.
Soch knowledge can then be used to put stress on the
second syllable of the word in accordance with its
function.
At the conclusion of the parsing process described
above, phrase boundaries for noun phrases and prepositional phrases have been marked, but the structure
within the phrase is not known. In order to apply the
rules that are used for computing stress patterns within the phrase, however, internal bracketing must be
-
-
- - -
•
_
•. -
_. -
-
-
-
- - -
-
-
6.,.-
-
-
-- -
-
-- -
From the collection of the Computer History Museum (www.computerhistory.org)
-
-
-
-
342
Spring Joint Computer Conference, 1968
provided. For this reason, determiner-adjective-noun
sequences are given a "progressive" bracketing, as
[the [long [red barn]]], whereas noun phrases beginning with adverbials are given "regressive" bracketings, as [[ [very brightly] projected] pictures], A
preposition beginning a prepositional phrase always
has a progressive relation to the remaining noun
phrase, so that we have [in [the [long [red barn]]]]
and [of [ [[ very brightly] projected] pictures] ]
Furthermore, two nouns together, as in [the local
bus stop], are marked as a compound for use by the
phonological algorithm.
The procedure described above is thus able to detect noun phrases and prepositional phrases and to
compute the internal structure of these phrases. The
grammar and parsing logic are intertwined in this
procedure, .so that an explicit statement of the grammar is impossible. Nevertheless, the rules are easily
modified, and additions can readily be made. If, for
example, we decide to detect verbal constructions,
this could easily be done. At present, however, we
feel that recognition of noun phrases and preposi,tional
phrases and the provision of prosodics for these
phrases is sufficient to allow the listener to deduce the
correct syntactic structure for large samples of
representative text.
Phonological algorithm
The method for detecting and bracketing noun
phrases and prepositional phrases has now been
described. We assume that this surface structure is
sufficient to allow the specification of stress and intonation within these phrasal units. The basis for this
assumption is given in the work of Chomsky and
Halle. 9 The phonological algorithm then uses the
surface syntactic bracketing, plus punctuation and
clause-marker words, to deduce the pattern for stress,
pauses, and intonation related to the detected phrases.
In the present implementation, only three acoustic
parameters are varied to implement the prosodic
features. These are fundamental frequency (fo)' vowel
duration, and pauses. It is well known that juncture
pauses have acoustic effects on the neighboring phonemes other than vowel lengthening and fo changes"
but these effects are ignored in the present synthesis.
We thus consider fo, vowel duration, and pauses to
constitute an interacting parameter system that serves
as a group of acoustic features used to implement the
prosodics. The "sharing" of fo for use in marking
both stress and intonation contours is another example of the interactive nature of these acoustic parameters.
Stress is implemented within the detected phrases
by iterative use of the stress cycle rules, described by
Chomsky and Halle. 9 These rules operate on the two
constituents within the innermost brackets to specify
where main stress should be placed. All other stresses
are then "pushed down" by one. (Here, "one" is the
highest stress.) The innermost brackets are then
"erased," and the nIles applied to the next pair of
constituents. This cycle is then continued until the
phrase boundaries are reached. For compounds, the
rules specify main stress on the leftmost element
(compound rule), whereas for all other syntactic
units (e.g., phrases) main stress goes on the rightmost unit (nuclear stress rule). For example, we have
[the [long [red barn]]]
2
4
2
3
where initially stress is I on all units except the article the, and two cycleS of the phrase rule are used.
The parser ,has, or course, provided the bracketing of
the phrase. Also,
[in [[[very brightly] lighted] rooms]]
2
1
321
443
2
requires three applications of the rules, and
[the [new [bus stop] ] ]
I
2
4
2
3
which contains a compound, requires two iterations.
It is clear that for long phrases requiring several
iterations, say n, there will be n + I stress levels.
Most linguists, however, recognize no more than four
levels, so the algorithm clips off the lower levels. At
present, three levels are being used, but this limit can
be easily changed in the program.
In the examples it has been implicitly assumed that
each content word started with main stress before the
rules were applied. Each word does have a main
stress initially, but in general each word has its own
stress contour, as, for example, in the triple [nation,
national, nationality]. (As Lee:J has pointed out, pairs
such as [nation/national] can be handled by placing
the two words directly in the morph dictionary, but
we have tried to extend the stress algorithm to cover
many of these cases. Clearly, there is a compromise
between processing time and dictionary size to be
determined by experience.) Thus the algorithm must
compute the stress for individual words by applying
rules for compounds and suffixes. The compound rule
is the same as for two separate words that comprise a compound (e.g., [bus stop, browbeat]). Each
morph in the lexicon is given lexical stress, so that
an initial stress contour is provided, Each suffix is
also provided with· information about its effect on
From the collection of the Computer History Museum (www.computerhistory.org)
r-,,1achine-to-l\.1an Communication by Speech- Part II
stress. Hence [-s, -ed] and [-ing] all leave the root
morph stress unaltered, and have the lowest level
stre~s for themselves. Another example is the'~uffix
[-ion], which always places main stress on the
immediately preceeding vowel (e.g., [nationalization, distribution]). At present, such changes in stress
of the root word are not computed by rule. I n this way,
stress contours for individual words are first computed, and then these are "placed" in the bracketed
phrase structure and the stress cycl~ is applied until
the over-all stress pattern is obtained for the whole
phrase. Note that function words receive no stress, so
that stress is controlled for these words, even though
they do not appear in bracketed phrases.
Pauses are provided in a definite hierarchy throughout each sentence. The following disposition of pauses
has been arrived at empirically, and represents a
compromise between naturalness and intelligibility.
At present, no pauses are used within the word at the
juncture between any two morphs. Within a bracketed
phrase or between two adjacent unbracketed cons tituents no pauses are used between words. At phrase
boundaries, pauses of 200 to 400 msec have been
used to set off the detected phrase. Short pauses of
100 msec are used where commas and semicolons
appear, and pauses of 200 msec are inserted before
clause-marker words such as [that, since, which]
etc., which serve to break up the sentence into clausal
units. Finally, terminal pauses of SOO msec are
provided for colon, period, question mark, and exclamation point. Thus a hierarchy of pauses is used to
help make the grammatical structure of the sentence
clear.
The provision of intonational fo contours by rule
has been described by Mattingly,t° and our technique
is similar to his. The slope of the fo contour is controlled by the specific phonemes encountered in the
sentence, and by the nature of the pause at the end
of the phrasal unit. Rising terminal contours are
specified at the end of int~rrogative clauses just
preceding the question mark, except when the clause
starts with a [~h-] word, as [where is the station?].
In the absence of a question mark, the intonati~n fo
contour'is falling with a slope determined by rule as is
done by Mattingly.
The starting point for fo at the beginning of a sentence is fixed at 110Hz. The jumps in fo for the various stress levels vary with the initial value of fo, but
nominally they are 12, IS, and 30 Hz corresponding to
the stress levels 3, 2, and 1 respectively. As noted before, 1 'corresponds to the highest stress in our system.
Subjective experience
The method of implementing prosodic features on
343
the limited basis described above has been used in
connection with the TX-O computer at M.I.T.,
driving a terminal analog synthesizer. While the
resulting speech is still unnatural in many respects, a
substantial improvement in speech quality has been
attained. It appears that by using limited phrase level
parsing and implementation of prosodics mainly within these phrases, sufficient cues can be provided to the
listener to enable him to detect the grammatical
structure of the sentence and hence provide his own
internal phonetic shape for the sentence. Since this
system will become part of a complete computercontrolled reading machine operating in real time, it
is encouraging to find that such a limited approach is
able to improve the speech quality markedly. We anticipate that further work on both phonemic and
prosodic synthesis rules will yield even greater intelligibility and naturalness in the output speech, with
little additional computing load placed on the system.
DISCUSSION
The speech synthesis system described here has been
developed for research purposes. Hence the implementation of our speech synthesis system has remained very flexible so that further improvements
can be easily accommodated. Better rules for phonemic synthesis are being developed, and will be incorporated into the system., Much work remains to be
done on the determination of the physiological mechanisms underlying stress, and the resultant observable
phonetic patterns which arise from these articulations. Particular attention is being focused on the nature and interaction of fo and vowel duration as correlates of stress. There will also undoubtedly be further improvements in the parsing procedure as
experience dictates. From the linguistic point of
view, the lexicon for a language should contain only
the idiosynchrasie's of a language, everything derivable
by rule being computed as part of the language user's
performance. Engineering considerations, however,
clearly dictate a compromise with this view, and the
cost of memory versus the cost of computing with an
extensive set of rules must be examined further. It
may, for example, become feasible to compute lexical
stress by rule, but any advantages of this procedure
must outweigh the cost in time and program storage
for these rules.
ACKNOWLEDGMENTS
This work was supported principally by the National
Institutes of Health (Grant 1 POI GM-1490-0l) and
in part by the Joint Services Electronics Program
(Contract DA2S-043-AMC-02536(E»; additional
From the collection of the Computer History Museum (www.computerhistory.org)
344
Spring Joint Computer Conference, 1968
support was received through a fellowship from Bell
Telephone Laboratories, Inc.
6
REFERENCES
1 C L SEITZ
An opaque scanner for reading machine research
SM Thesis MIT 1967
2 J K CLEMENS
Optical character recognition for reading machine applications
Doctoral Thesis MIT 1965
3 F FLEE
A study of grapheme to phoneme translation of English
PhD Thesis MIT 1965
4 G A MILLER
Decision units in the perception of speech
IRE Transactions on Information Theory VoIIT-8 No 2 P 81
February 1962
5 G A MILLER G A HEISE W LICHTEN
7
8
9
lO
The intelligibility of speech as a function of the contest of the
test materials
J Exptl Psychol41 p 329 1951
G A MILLER S ISARD
Some perceptual consequences of linguistic rules
J Verb Learn Verb Behav 2 p 2! 7 ! 963
S KLEIN R F SIMMONS
A computation approach to the grammical coding of English
words
J Assoc Computing Machinery 10 334 1963
D C CLARKE R E WALL
An economical program for the limited parsing of English
AFIPS Conference Proceedings p 307 Fall Joint Comp Conf
1965
N CHOMSKY M HALLE
Sound patterns of English
(in press)
I G MATTINGLY
Synthesis by rule of prosodic features
T llnOllllOP _.01;_
llnrl .~npprh
I Qf:.f:.
--"·.0--0-t"'--_&;&"Q •I o..".....,.....,
From the collection of the Computer History Museum (www.computerhistory.org)