* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Introduction to Computational Linguistics Context Free Grammars
Zulu grammar wikipedia , lookup
Serbo-Croatian grammar wikipedia , lookup
Preposition and postposition wikipedia , lookup
Portuguese grammar wikipedia , lookup
Arabic grammar wikipedia , lookup
Compound (linguistics) wikipedia , lookup
Chinese grammar wikipedia , lookup
Dependency grammar wikipedia , lookup
Lexical semantics wikipedia , lookup
Probabilistic context-free grammar wikipedia , lookup
Ancient Greek grammar wikipedia , lookup
Scottish Gaelic grammar wikipedia , lookup
Context-free grammar wikipedia , lookup
Malay grammar wikipedia , lookup
Latin syntax wikipedia , lookup
Yiddish grammar wikipedia , lookup
Spanish grammar wikipedia , lookup
Polish grammar wikipedia , lookup
Esperanto grammar wikipedia , lookup
Vietnamese grammar wikipedia , lookup
Determiner phrase wikipedia , lookup
English grammar wikipedia , lookup
Transformational grammar wikipedia , lookup
Introduction to Computational Linguistics Context Free Grammars Slide 1 [Revised Version] Week 5, Lectures 1 & 2 November 7, 2002 Today Phrase Structure and Context Free Grammars – What is a CFG? – Linguistic examples of CFGs Slide 2 Constituents – Linguistic problem areas for CFGs – Grammar and spoken language Part of Speech and Tagging Main points... Parts of speech represent classes of words based on morphological and distributional similarities. Accurate part of speech tagging is pretty much a solved problem (at least for Slide 3 English) by any one of a number of statistical or rule-based approaches. For situations where there is no tagged training data one can fall back on unsupervised HMM training, but results are still poor. Syntax By SYNTAX, we mean various aspects of how words are strung together to form components of sentences and how those components are strung together to form sentences. New concept: C ONSTITUENCY Slide 4 groups of words may behave as a single unit or CONSTITUENT, e.g., noun phrase evidence – the whole group appears in similar syntactic environments, e.g., before a verb – preposed/postposed constructions Syntax in CL Syntactic analysis is used to varying degrees in applications such as: Grammar checkers Spoken Language Understanding Question answering systems Slide 5 Information extraction Automatic Text Generation Machine Translation Typically, detailed syntactic analysis is taken to be a prerequisite for detailed semantic interpretation. Context Free Grammars (CFGs) These: capture constituency and ordering; formalise descriptive linguistic work of the ’40s and ’50s; are widely used in linguistics. Slide 6 CFGs don’t directly encode grammatical relations (such as SUBJECT of a verb) or dependency relations (such as HEAD of a clause); however, to a certain extent, they can be defined in terms of constituency. CFGs are somewhat oriented towards languages like English which have relatively fixed word order. Most modern linguistic theories of grammar incorporate some notions from context free grammar, to a lesser or greater extent. Context Free Grammars A CFG is a 4-tuple, and defines a formal language a set of non-terminal symbols a set of terminal symbols a set of productions Slide 7 – – – : is a non-terminal is a string of symbols from the infinite set of strings a designated start symbol CFGs are equivalent to Backus-Naur form (BNF) grammars. Example S NP VP NP N om VP Slide 8 D et N V NP D et N om N V a flight left = ‘noun phrase’, VP = ‘verb phrase’, Det = ‘determiner’, Nom = ‘Nominal’, N = ‘noun’, V = ‘verb’. Generativity As with FSAs, we can view a CFG as a device for (i) assigning structure to a given sentence, or (ii) generating sentences Sentences that can be DERIVED by a grammar belong to the formal language defined by the grammar, and are called G RAMMATICAL S ENTENCES Slide 9 Those sentences that cannot be derived are U NGRAMMATICAL S ENTENCES The formal language, , defined by a grammar is the set of strings composed of terminal symbols that are derivable from the START SYMBOL: and derives PARSING is mapping from a word string to its parse tree Derivations and Trees A DERIVATION of a string is just a sequence of rule applications Derivations can be visualized as PARSE TREES. NP Slide 10 Det Nom a N flight We say that a flight can be DERIVED from the non-terminal NP. Equivalence and Normal Form S TRONG E QUIVALENCE two grammars generate the same set of strings and assign the same phrase structures W EAK E QUIVALENCE Slide 11 two grammars generate the same set of strings but do not assign the same phrase structures C HOMSKY N ORMAL F ORM (CNF) productions of the form or upper case indicates non-terminals, lowercase indicates terminals Equivalence and Normal Form: Example Original CNF Slide 12 What type of equivalence? What’s this ‘Context’? The use of the term CONTEXT FREE in the description of this formalism has nothing to do with the ordinary use of the word “context”. All it really means is that there is a single non-terminal on the left-hand side of the rule, e.g., Slide 13 In other words, we can rewrite an in which we find the . as , regardless of the (string/tree) context CFG: As opposed to what? Regular Grammars — generally claimed to be too weak to capture lingistic generalizations. Context Sensitive grammars — generally regarded as too strong. Recursively Enumerable (Type 0) Grammars — generally regarded as way too Slide 14 strong. (But it has sometimes be suggested that NLs aren’t even RE!) RE vs. Recursive — we seem to know when strings are ungrammatical. Approaches that are TOO STRONG have the power to predict/describe/capture syntactic structures that don’t exist in human languages. However, it is generally agreed that CFGs aren’t quite strong enough. The computational processes associated with stronger formalisms are not as efficient as those associated with weaker methods. The Chomsky Hierarchy L ANGUAGE T YPE AUTOMATON G RAMMARS E XEMPLAR Type 0: Recursively Enumerable Universal Turing Machine Linear Bounded Automaton Type 1: Context sensitive Slide 15 Type 2: Context Free Push-Down Automaton Finite-state Automaton Type 3: Finite State where etc. are variables over nonterminal symbols, terminal symbols or words, etc. are variables over etc. are variables over strings of terminal and/or non-terminal symbols, and , , etc. are terminal symbols. Developing Grammars A huge amount of skilled effort goes into the development of grammars for human languages — can only scratch the surface here. Our primary question: What constitutes a constituent? Answer: A consistent, recurring, syntactic substructure. Slide 16 A Tiny Lexicon is prefer V A dj Slide 17 Pro Proper-Noun Conj cheapest other like need want fly latest ... you it non-stop first direct me I ... Alaska Baltimore Los Angeles Chicago United D et P flights breeze trip morning . . . N American . . . from to on near and or but . . . the a an this these that ... ... A Tiny Grammar S NP Nominal Slide 18 VP PP NP VP I + want a morning flight Pro I Proper-Noun Los Angeles Det Nom a + flight N Noml morning + flight Noun flights V do V NP want + a flight Verb NP PP leave + Boston + in the morning V PP leaving + on Thursday P NP from + Los Angeles Example Phrase Structure Tree S NP Slide 19 VP Pro V I prefer NP Det a Nom Noun Noun morning flight Key Constituents The key constituents that we will look at are the following: Sentences Noun phrases Verb phrases Slide 20 Prepositional phrases Common Sentence Types Declaratives John left. S NP VP Imperatives Slide 21 Leave! S VP Yes-No Questions Did John leave? Aux NP VP S WH Questions (who, what, where, which, why, how ) When did John leave? S Wh-XP Aux NP VP Noun Phrases (simplified) Revolves around a HEAD (the central noun) P RENOMINAL modifiers come before the head noun: NP Slide 22 (Det) (Card) (Ord) (Quant) (AP) Nom NB. Placing parentheses around a category on the rhs of a rule indicates that the constituent in question is OPTIONAL. Determiners (Det) a flight Cardinal Numbers (Card) one stop Noun Phrases, cont. Ordinal Numbers (Ord) the first stop Quantifiers (Quant) Slide 23 many fares Adjective Phrases (AP) a fast flight Noun Phrases (continued) P OSTNOMINAL modifiers come after the head noun N om N om N om Slide 24 N om PP N om VP N om S [+rel] Prepositional Phrase (PP): the flights from San Francisco to Boston Nonfinite VPs: VP[+ing]: flights leaving on Thursday VP[+to]: the last flight to arrive VP[+ed]: the aircraft used by USAir Relative Clauses (S[+rel]) S [+rel] NP [+rel] VP flights that serve lunch S [+rel] NP [+rel] S [+gap] flight which you booked Postnominal modifiers can be combined Slide 25 a flight [from San Francisco] [leaving Monday] a flight [which I’d booked] [which never departed] Recursive Structures There is no upper bound on the length of a grammatical English sentence. – Therefore the set of English sentences is infinite. A grammar is a finite statement about well-formedness. – To account for an infinite set, it has to allow iteration (e.g., Slide 26 ) or recursion. R ECURSIVE RULES: where the non-terminal on the left-hand side of the arrow in a rule also appears on the right-hand side of a rule. Recursive Structures, cont. Direct recursion: N om VP N om PP flight to Boston departed Miami at noon VP PP Indirect recursion: Slide 27 S VP NP VP said that the flight was late V S S NP Slide 28 VP Pro V they said S NP VP Pro V he claimed S NP VP Pro V she lied Is English a Regular Language? Chomsky (1957): 1. 2. If , then/*or ) is not a regular language. . . [either or ], then . ( ) 5. If [either [if , then ], or ], then . ( ) 4. If 3. Either Slide 29 (e.g., , *then/or “Note that many of the sentences of the form [(4)], etc., will be quite strange and unusual . . . . But they are all grammatical sentences, formed by processes of sentence construction so simple and elementary that even the most rudimentary grammar of English would contain them.” Chomsky (1957, p.23) Is English a Regular Language, cont (1) a. If Kim comes, then Lee will come. b. If either Rusty comes or Sandy stays at home, then Lee will come. c. If either if Kim comes then she’ll bring Sandy or Sandy stays at home, then Lee will come. Slide 30 (2) a. The mouse died. b. The mouse [that the cat chased] died. c. The mouse [that the cat [that the dog chased] bit] died. d. The mouse [that the cat [that the dog [that the child fed] chased] bit] died. Depth restriction seems quite robust ( ) Is the performance/competence distinction appropriate? Coordination NP VP S Slide 31 S NP and NP VP and VP and S (3) I need [[NP the times] and [NP the fares]]. (4) a flight [VP departing at 9a.m.] and [VP returning at 5p.m.]] (5) [[S I depart on Wednesday] and [S I’ll return on Friday]]. In fact, any phrasal constituent can be conjoined with a constituent of the same type to form a new constituent of that type. So, coordination in English is better expressed by a schema of the following sort (where XP is an arbitrary phrase): XP XP and XP Syntactic Ambiguity Many kinds of syntactic (structural) ambiguity. PP attachment has received much attention: VP Slide 32 VP V NP V saw Nom saw Nom PP the man with a telescope NP PP the man with a telescope PP Ambiguity Different structures naturally correspond to different semantic interpretations (‘readings’) Arises from independently motivated syntactic rules: VP Slide 33 NP V . . . PP N om PP However, also strong, lexically influenced, preferences: – I bought [a book [on linguistics]] – I bought [a book] [on sunday] Problem Areas for CFGs Agreement Subcategorization Movement Slide 34 Number Agreement In English, some determiners agree in number with the head noun: Slide 35 (6) This dog (7) Those dogs (8) * Those dog (9) * This dogs And verbs agree in number with their subjects: (10) What flights leave in the morning? (11) * What flight leave in the morning? Number Agreement, cont. Expand our grammar with multiple sets of rules? NP sg NP pl S sg Slide 36 S pl D etsg N sg D etpl N pl NP sg VP sg VP sg VP pl NP pl VP pl V sg ( NP ) ( NP ) ( PP ) V pl ( NP ) ( NP ) ( PP ) worse when we add person and even worse in languages with richer agreement (e.g., three genders). lose generalizations about nouns and verbs — can’t say property words of category V. is true of all Subcategorization Verbs have preferences for the kinds of constituents they co-occur with. (12) I found the cat (13) Slide 37 * I disappeared the cat. A traditional subcategorization of verbs: transitive (takes a direct object NP) intransitive In more recent approaches, there might be as many as a hundred subcategorizations of verb. Subcategorization, cont. More examples: find is subcategorized for an NP (can take an NP complement) want is subcategorized for an NP or an infinitival VP bet is subcategorized for NP NP S A listing of the possible sequences of complements is called the Slide 38 SUBCATEGORIZATION FRAME for the verb. As with agreement, the obvious CFG solution yields rule explosion: VP V ditr NP NP VP V intr VP VP V tr NP V ditrS NP NP S Example Subcategorization Frames Frame Verb Example eat, sleep I want to eat NP prefer, find, leave, Find [NP the flight from Pittsburgh to Boston] NP NP show, give Show [NP me] [NP airlines with flights from Pittsburgh] Slide 39 PP PP fly, travel I would like to fly [PP from Boston] [PP to Philadelphia] NP PP help, load, Can you help [NP me] [PP with a flight] VP [+inf] prefer, want, need I would prefer [VP [+inf] to go by United airlines] VP [+base] can, would, might I can [VP [+base] go from Boston] S mean Does this mean [S AA has a hub in Boston]? Movement (or Unbounded Dependency) Constructions (14) a.* I gave to the driver. b. I gave some money to the driver. (15) Slide 40 a. $5 [I gave to the driver], (and $1 I gave to the porter). b. He asked how much [I gave to the driver]. c. I forgot about the money which [I gave to the driver]. (16) How much did you think [I gave to the driver]? (17) How much did you think he claimed [I gave (18) How much did you think he claimed that I said [I gave (19) ... to the driver]? to the driver]? Spoken Language Slide 41 the . [exhale] . . . [inhale] . . [uh] does American airlines . offer any . one way flights . [uh] one way fares, for one hundred and sixty one dollars [mm] i’d like to leave i guess between [um] . [smack] . five o’clock no, five o’clock and [uh], seven o’clock . P M around, four, P M all right, [throat clear] . . i’d like to know the . give me the flight . times . in the morning . for September twentieth . nineteen ninety one [uh] one way [uh] seven fifteen, please on United airlines . . give me, the . . time . . from New York . [smack] . to Boise-, to . I’m sorry . on United airlines . [uh] give me the flight, numbers, the flight times from . [uh] Boston . to Dallas Grammars for Spoken Language Spoken language is similar to, yet different from, written language. Differences include: lexical statistics (e.g., spoken language has more pronouns) constructional simplicity (e.g., less embedding) Slide 42 disfluencies (e.g., uh, um, word repetitions, false starts) sentence fragments (e.g., around 4 p.m. as an answer) Disfluencies in Spoken Language Interruption Point Does American airlines offer any one−way flights Reparandum [uh] Editing Phase one−way fares for 160 dollars? Repair false start (reparandum) Slide 43 uh repair Reading J&K Chapter 9 Slide 44 Acknowledgements Some content in these slides is drawn from course materials created by James Martin and by Steven Bird. Slide 45