Download EE517 – Statistical Language Processing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Compound (linguistics) wikipedia , lookup

Georgian grammar wikipedia , lookup

Ukrainian grammar wikipedia , lookup

Portuguese grammar wikipedia , lookup

Old Irish grammar wikipedia , lookup

Chinese grammar wikipedia , lookup

Pleonasm wikipedia , lookup

Lithuanian grammar wikipedia , lookup

Modern Greek grammar wikipedia , lookup

Modern Hebrew grammar wikipedia , lookup

Morphology (linguistics) wikipedia , lookup

Old Norse morphology wikipedia , lookup

Ojibwe grammar wikipedia , lookup

Inflection wikipedia , lookup

Macedonian grammar wikipedia , lookup

Spanish grammar wikipedia , lookup

Preposition and postposition wikipedia , lookup

Determiner phrase wikipedia , lookup

Swedish grammar wikipedia , lookup

Russian grammar wikipedia , lookup

Ancient Greek grammar wikipedia , lookup

Vietnamese grammar wikipedia , lookup

French grammar wikipedia , lookup

Turkish grammar wikipedia , lookup

Icelandic grammar wikipedia , lookup

Esperanto grammar wikipedia , lookup

Japanese grammar wikipedia , lookup

Latin syntax wikipedia , lookup

Old English grammar wikipedia , lookup

Italian grammar wikipedia , lookup

Scottish Gaelic grammar wikipedia , lookup

Yiddish grammar wikipedia , lookup

Serbo-Croatian grammar wikipedia , lookup

Malay grammar wikipedia , lookup

Lexical semantics wikipedia , lookup

Polish grammar wikipedia , lookup

English grammar wikipedia , lookup

Pipil grammar wikipedia , lookup

Transcript
EE517 – Statistical Language Processing
Prof. Mari Ostendorf
Basic Linguistics (M & S Chpt 3)
Outline
• Part of speech and morphology (word structure)
• Syntax (phrase structure)
• Lexical semantics
• Discourse
Note: we’ll cover the first two linguistics topics in more detail because
there’s more statistical modeling work in those areas.
Word Part of Speech
Part of speech (POS) = grammatical categories (e.g. noun, verb, etc)
POS is usually listed with words in a dictionary.
POS tagging is an important tool in language processing. It is analogous
to using Markov models (simplest model beyond i.i.d.) in being the simplest thing that you can do beyond just using the words.
Major POS groups include:
• Nouns, pronouns (person, place, thing, animal, concept, etc)
– Nouns can be singular vs. plural (dog, dogs), proper noun (Joe),
possessive (Joe’s)
– Pronouns (stand-ins for nouns) can be: First, second or third person (I, you, he/she); nominative (he, she); accusative (me, him,
her); possessive (my, mine); reflexive (herself)
• Determiners, adjectives (accompany nouns)
– Determiners include: articles (a, the), demonstratives (this, that)
– Adjectives describe properties of nouns: red, long, pretty, rich,
richer, richest
• Verbs (describe actions, activities, states)
– main verbs: He threw the stone. (action); I read (activity);
I have $50. (state)
– verbs used with other verbs:
∗ auxiliary verbs: have, be
∗ modals: may, can, shall, will
– verbs have many forms based on singular/plural, tense, infinitive,
etc. (see text)
• Adverbs, prepositions, particles (accompany verbs and more)
– adverbs modify verbs (often, quickly), and sometimes adjectives
(very)
– prepositions are small words that express relations, e.g. spatial (in,
on, by, over, under), temporal (after, before), ...
– particles are a subclass of prepositions that connect with the verb
as in a single lexical item (though not necessarily sequential).
Example:
The plane took off. (particle)
She took off her jacket. (particle)
She took her jacket off. (particle)
She took the book off the shelf. (preposition)
(*) She took the book off.
• Conjunctions and complementizers (connect phrases)
– Coordinating conjunctions connect words, phrases or sentence
that are in some sense equals (and, but)
– Complementizers (also called subordinating conjunctions) connect phrases where one is primary and the other secondary (that,
because, if, although)
• Miscellaneous other categories:
– Interjections: oh
– Fillers: uh, um
You often can’t determine the category in isolation:
light the fire (V); turn on the light (N); light lunch (Adj)
ASIDE: The examples given here are very much English-centric. There are different
distinctions that are important in other languages.
Morphology
Systematic relation of different forms of a “word” (or, lemma)
Morphological analysis is an important tool for dealing with words you’ve
never seen before, which is always a problem because in practice our
dictionaries are finite. It is less critical for English, but is particularly
useful for highly inflected languages (e.g. Finnish, Czech, Turkish).
Turkish example from Jurafsky & Martin:
urgarlastiramadiklarimizdanmissinizcasina
“(behaving) as if you are among those whom we could not cause to become civilized”
Typical computational problems that benefit from morphological analysis:
• Find word root for information retrieval or to assess when information
content is new vs. old
• Pronunciation prediction (text-to-speech), spell checking
• Part-of-speech tagging and parsing
• Machine translation
Two classes of morphemes: stem (supplies main meaning), affix (add to
meaning)
Three types of morphological processes:
• inflection: does not change word class substantially but indicates minor grammatical distinctions, such as:
– singular/plural: dog, dogs
– tense: play, playing, played
– gender: fils/fille (French) chica/chico (Spanish)
• derivation: usually results in more radical change in syntactic category, sometimes changing meaning
– wide ADJ → widely ADV
– understand V → understandable ADJ
– teach V → teacher N
– compute V → computer N
• compounding: merging 2 words to get 1 lexical item
– disk drive, mad cow disease
– babysitter, suitcase, overtake
Phrase Structure
Syntax = study of regularities and constraints of word order and phrase
structure.
There are multiple theories of syntax. We’ll take a limited view that gets
across some key ideas and is useful for the computational models that we
will look at.
A sequence of words is a constituent if you can replace it or move it
around:
my good friends
I invited Janet
my neighbor and his charming daughter
Main types of constituents:
• NP: noun phrase
• VP: verb phrase
• PP: preposition phrase
• AP: adjective phrase
Examples:
She is very sure of herself.
The man caught the butterfly with a net.
Phrase structure can be represented with a tree, as in the figure, or with
bracketing:
[S [N P The man][V P [V BD caught][N P the butterfly][P P [IN with][N P a net]]]
The leaf nodes of the tree are the words and are referred to as “terminal nodes” or “terminals”. The internal nodes are referred to as “nonterminals” and are associated with phrase labels.
Phrases illustrated in this example:
• NP: the man, the butterfly, a net
• VP: caught the butterfly with a net
• PP: with a net
Phrases typically have a “head”, which is usually the central constituent
that determines the syntactic character of the phrase. For example, the
head of a noun phrase is the main noun; the head of a verb phrase is the
mean verb.
You can describe either bracket or tree representation with rewrite rules:
S → NP V P
V P → V BD N P P P
P P → IN N P
and of course more rules are needed to handle other types of sentences.
Often the tree and the rules are converted to be binary, in which case you
would need two V P rules: V P → V P P P ; V P → V BD N P.
These rules are referred to as “context free” because they don’t use information about the specific words or neighboring context. The options
depend only on a particular non-terminal phrase type.
• Advantages of context-free grammars: unlike n-grams, the can capture non-local dependencies, e.g.
The women who found the wallet were rewarded.
(*) The women who found the wallet was rewarded.
• A disadvantage: lexical context often matters.
“sew clothes” but not “sew wood blocks”
A solution to this problem is to “lexicalize” the grammar, which is popular
in current statistical parsers.
Aside: There many are other types of grammars used in statistical model, including dependency grammars and tree adjoining
grammars, for example.
Arguments and adjuncts
Verbs have:
• arguments (centrally involved with the verb)
– subject NP (almost always)
– direct object NP (for transitive verbs, not intransitive)
– indirect object NP (for ditransitive verbs)
Example: She gave him a ball.
• adjuncts (optional, frequently to tell time, place or manner)
– He went yesterday
– He went to the store
– John gave her the book with a smile.
Empty Categories
Useful for dealing with “missing” arguments.
[S 00 [N P Which book][S 0 [M D should][S [N P Peter][V P [V B buy][N P φ]]]]]
Semantics
Semantics: the meaning of words and combinations of words
Focus here on the meaning of words, or lexical semantics
• Relations between words
– more vs. less general: animal → { cat, (dog → husky), giraffe}
animal is a hypernym of cat; cat is a hyponym of animal
– part vs. whole: tree → { trunk, (branch → leaf), root}
branch is a meronym of tree
– antonyms (opposites): hot vs. cold
– synonyms (similar): car vs. automobile
• Different meanings of the same orthographic word (word sense)
– homonyms: (some closely related meanings, some not)
∗ bank: the financial institution vs. the edge of land next to a river
vs. blood bank vs. airplane turn
∗ branch: of a tree vs. of an organization
– other: (same spelling, different pronunciation)
bass: the fish vs. the musical instrument vs. low pitched sound
• Arguments of verbs can have different semantic roles
– agent: person or thing doing something
– patient: person or thing having something done to it
– instrument
– goal
Examples:
John cut the grass with the lawnmower.
The lawnmower cut the grass.
Discourse
Phenomena at the level of groups of phrases or sentences.
Elements of discourse (Grosz & Sidner)
• Segmentation: how phrases and sentences are grouped into topics and
sub-topics
• Intention: purpose of an utterance or phrase, speech act (question,
statement, greeting, backchannel (uh-huh))
• Attention: what’s in focus, global topic vs. local NP
Problems that have received attention in data-driven work:
• Topic segmentation or topic change detection, topic detection and
tracking
• Text coherence, discourse planning
• Speech act labeling (e.g. does ok mean agree or question or moving
on or . . . )
• Reference resolution and entity recognition
Mary helped Peter get out of the cab. He thanked her.
Mary helped the passenger out of the cab.
The woman thanked her.
Hurricane Hugo destroyed 20000 Florida homes.
The disaster has been the most costly in . . .