Download Lecture slides: Nature of words - School of Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Words (etc.)
John Barnden
School of Computer Science
University of Birmingham
Natural Language Processing 1
2015/16 Semester 2
Some Problems
• Intro Exercise-Set B suggests that our intuitive notion of a “word” is vague,
and inadequate even for informal purposes.
– E.g., we think of sentences as formed of words, but probably wouldn’t think of
“£12,000” as a word, or of “VAT” as a proper word (except when it means a large
container, of course), but would nevertheless agree that “Mary bought the car for
£12,000 plus VAT” was a properly formed sentence!
• There are many sorts of abbreviation, acronym and special symbol in written
language, and the relationship of such special units to spoken language is
diverse.
• And what about punctuation marks – are they words? They carry meaning,
after all.
• And what about things like “um”, “ah”, “ow”, “owwww” (in speech or text)?
• New words are created every day. A new one for me in 2011: “treggings” at
Marks and Spencers.
– Example of a “portmanteau” word, formed from melding other words – here “trousers” and
“leggings”.
Differences between Words
• In language study, differences of meaning or sound or spelling sometimes are,
and sometimes aren’t, taken to indicate different words.
– “Present”[noun:=gift] and “present”[verb as in: present a proposal] are typically
regarded as different words though spelled the same. (Same spelling, different
meaning and sound.)
– “Bank”[noun:financial] and “bank”[noun:of a river] may be taken to be different
words, but may instead be regarded as one word with a variable meaning. (Same
spelling and sound, different meaning.) NB: there’s also one or more verbs spelled
“bank,” including one with a financial sense.
– “Patent”[noun:legal doc.] can be pronounced in two different ways, but both are
typically taken to be versions of just one word. (Same meaning and spelling,
different sound.)
– “Realize” and “realise”: typically regarded as alternative spellings of the same
word. (Same meaning and sound, different spelling.)
Special Terminology
• Language studiers have introduced specialized terms to try to make things
more precise, but still there’s some looseness and variability in the
terminology.
• Some terms (see Appendix at end of these slides for more detail):
– Homographs, homophones, homonyms
– Lemmas
– Citation forms
– Wordforms, lexical forms
– Lexemes
– Lexical items
Terminology, contd
•
A lexical form (also: wordform) is a particular written string or spoken sound that would
be regarded as a word of the language.
[that loosely follows J&M p.120]
So all occurrences of the written string “presents” would be occurrences of the same written
lexical form, irrespective of sound or meaning.
(All occurrences of spoken items that sound like “to” would be of the same spoken lexical form,
irrespective of spelling or meaning.)
•
A lemma [J&M p.645] is a particular lexical form used as a sort of standard or basic version
of some lexical forms that are just close variants of the same meaning (different number,
different tense, etc.).
– Thus “carpet” and “carpets” are different lexical forms but have the same lemma, “carpet”.
– The lemma for “sing”, “sang” and “sung” is “sing”.
Different Sorts of Meaning-Difference
• Ambiguity (or homonymy according to some definitions): where a particular
lexical form has a variety of senses.
• A special case is polysemy, where different senses for a lexical form are related
in some way.
– E.g., “bank” can mean a financial institution or a building serving customers of it.
– A “newspaper” can mean an institution or a particular physical object produced by
that institution.
– Exercise: “window”
• Some people restrict “ambiguity” (and/or “homonymy” ) to apply only to those
cases of different-sense that aren’t cases of polysemy.
Arbitrariness of Sense Differences
•
The number of senses a lexical form has, and what they are, is in large part a matter of
choice and convenience for particular purposes.
Different dictionaries, NLP systems, etc. divide up senses differently.
•
Consider the verb “cut”, as applied to physical objects. Cutting proceeds significantly
differently according to the type of object (cake, grass, meat, hair, ...). Do these
correspond to different senses of “cut”??
Consider “cut” as applied to government expenditure. Does this involve a different sense
of cut – or is there just one very generalized sense that applies to expenditure and hair
and grass and ...
•
To what extent do lexical forms have clearly identifiable senses at all? I.e., perhaps the
sense in action at a given point is often at least partly affected by the unique situation
being talked about? Two uses of “cut” hardly ever have the same meaning????
Special Word Classifications
• Words are classified in a variety of ways.
• We’ll look at a few main classifications:
– into “parts of speech” (such as nouns and verbs; also often called “lexical categories” or “word
classes”, though these terms are more ambiguous).
– into “proper nouns” (David) and “common nouns” (car).
– into “open class” or “closed class”.
Those classifications are of particular importance to NLP.
Why Important?
• Parts of speech (POSs) [NB: about text as well as speech!]
– constitute a basic level of grammatical analysis
– help with more complex grammatical analysis of utterances
– are useful by themselves in specialized practical tasks such as “information extraction” and
“named entity recognition”
–
are useful in more academic practical tasks such as searching corpora (large bodies of
recorded language) for examples of desired types: e.g. Can search for all examples of the word
“spaceship” preceded by an adjective and followed by a verb.
• Proper nouns
– Are important in practical tasks, e.g. named entity recognition and document summarization.
• Closed class words
– are important in grammatical analysis of utterances and signalling some particular phenomena
in some tasks
– But by contrast may need to be suppressed in some types of task.
Parts of Speech (POSs)
•
Lexical forms are traditionally put into categories such as noun, pronoun, determiner,
article, verb, adjective, adverb, preposition, particle, conjunction, interjection. E.g.:
– Determiner: e.g.: the, a, this, ... and possibly: every, which
– Article: the, a, an
– Particle: e.g.: up, off, in, at, by when closely tied into a phrasal verb: e.g. “take up”.
– Preposition: e.g.: up, off, in, at, by in freer uses
– Conjunction: e.g.: and, or, but, since, if, when, because
– Interjection: e.g.: hello, wow
•
Many lexical forms have more than one POS (consider love, three, off, that, kill)
•
Modern linguistic theories may propose more categories, from 12 upwards.
•
NLP systems typically use many more categories: from mid-30s (e.g. within the set of 45
“tags” in the Penn Treebank tagset [textbook]) up to 140+
•
See various lists of POSs and related “tagsets” in sections 5.1-5.3 of textbook.
How are POSs Defined?
• With great difficulty!
• One try: use (partly-)conceptual criteria as in:
– Noun: denotes an entity or entity concept (car, snow, love, Tony Blair, Santa Claus, ...)
– Verb: means an action, state, relationship, etc. (to push, to sleep, to love, to be, ...)
– Adjective: adjusts an entity concept denoted by a nearby noun (red, sad, fake, ...)
– Adverb: qualifies some event or state denoted by a nearby verb, adjective, adverb,
clause, etc. (boldly, tomorrow, here, loosely, very(?), ...)
– Determiner: specifies what specific entities (denoted via other words), if any, are being
talked about by a nearby noun
– NB: the qualifications above about what POSs are nearby a word are my own.
• Such criteria are often mentioned, but are problematic.
– “destruction” is a noun but refers to an action
– auxiliary verbs, such as “have” in combination, have a special function
– “my”: sometimes classed as adjective, sometimes as determiner.
How are POSs Defined?, contd.
• Something else that may help: morphological data, as in:
– Certain lexical forms clump together like this:
carrot, carrots, carrot’s, carrots’
man, men, man’s, men’s
– Certian other lexical forms pattern like this:
criticize, criticizes, criticized, criticized, criticizing
sing, sings, sang, sung, singing
• So we can postulate two different classes and call them nouns and verbs.
– The distinction being drawn is:
one class has a singular/plural dimension and a possessive/non-possessive
dimension;
the other has a singular/plural dimension, a tense dimension, and ...
• In English, this sort of approach doesn’t extend very well to other words, which
largely don’t inflect (change shape) etc.
How are POSs Defined?, contd.
• What may be more helpful is “distributive” data – i.e. data about how words go
together in meaningful linguistic expressions, as in:
– Only certain words can follow “the” or “a[n]”; of these, some must usually be followed
by other words to be considered a meaningful unit; others don’t need to be.
– So we can have: the/a car, the/a big car
– But not usually: the/a big.
– And car and big act differently after forms of “be”:
– We can have : the thing is big but not the thing is car.
– Certain words such as is and hated need words around them to make sense.
• This sort of data (together perhaps with the above conceptual and morphological
data) may make it useful to divide words into nouns, adjectives, determiners, etc.
• Having done this, we may find it relatively easy to specify in general a grammar,
i.e. a description of the strings of lexical forms that are allowed in the language,
based on their assigned classes.
[How are POSs Defined?, contd.]
• The classification of some words in a given language is contentious, and differs
between different schemes (see textbook).
• The way words are classified in one language may not work well for another
relatively distant one. We shouldn’t expect the notion of noun, verb, adverb,
etc. familiar from one language to correspond in a simple way to categories in
another, distant language.
• However, even languages like English and Japanese can be given at least roughly
the same (main) POSs, even though there’s a lot of detailed difference in the
morphological and distributive data.
• So perhaps one test of a classification scheme is how well it survives across
languages.
Proper and Common Nouns
• Proper nouns are, roughly speaking, those that refer to specific (though not
necessarily real) entities of certain types in specific contexts:
– David Cameron, David, Santa Claus, The Guardian, Love Actually [a film title],
Edgbaston, University of Birmingham, Department of the Environment, School of
Psychology, Prolog, English(?)
– NB: there are many Davids; many universities may have a School of Psychology: the
“specificity” is only within particular contexts, possibly fleeting.
• Common nouns are the remaining nouns:
– car, carrot, bandwith, fifteen [in some uses], relationship, baking [as in the baking of
the cake]
• Not clear whether the following should count as proper nouns:
– Christianity, Islam, Act V, January, Tuesday [when used to refer to a day, not an
actress!]
• Not clear why nouns like fifteen and love aren’t considered to be proper nouns:
they can be considered to refer to specific entitites, after all.
Proper and Common Nouns, contd.
• In English, proper nouns are usually spelled with an initial capital letter, if a
single word, or at least have the main words within them initially-capitalized
– But sometimes not, for special effect (advertisements) or choice (poet e.e. cummings)
• But (in English) not all words spelled with a capital letter are proper nouns: they
may be common nouns or not nouns at all:
– Blairite [can be noun or adjective]
– Englishwoman [noun], English [can be adjective (a “proper adjective”)]
– Islamic [not noun – a proper adjective]
– I
[pronoun]
– Birminghamize [not noun – a “proper verb”?]
– Many acronyms and abbreviations: NB, PS, NLP, CS, ABM, VD, DVD [most are nouns]
• Anyway, the capitalization semi-criterion breaks down in many other languages
(incl. ones close to English such as German) and in English sentence starters,
document titles and section headings, headlines, etc.
• Many proper nouns are spelled the same as common nouns or other words
(apart from capitalization): Peter, Blacksmith.
Closed Classes of Words
• These are classes whose membership is largely or completely fixed, and either
relatively very small or very rule-prescribed. E.g.:
–
–
–
–
–
–
–
Prepositions
Particles
Determiners
Conjunctions
Pronouns
Auxiliary verbs: e.g. can, should, may, be, have, do
Basic degree modifiers: e.g.: very, quite, more, too
[when followed by an adjective/adverb]
•
Such words are usually regarded as “function words”: have special roles in grammar—as
arguably in all the above cases, in varying degrees. (But NB the degree modifier class in
general is fairly open, and only the basic members might be regarded as function words.)
•
Not completely fixed membership: e.g.:
– ordinary language evolution
– differences across dialects, slangs, etc. (“in back of” in Amer. Eng., not Brit.)
Open Classes of Words
• Classes to which new members can be freely added, and often are.
• Notably:
–
–
–
–
–
Nouns
Non-auxiliary Verbs
Adjectives
Adverbs
Interjections.
• Some fairly recent examples:
–
tweet [verb & noun, related to Twitter communications: NB the lexical form existed
before, both as verb and noun],
– treggings, globish [a newly arising global form of English], mobile [noun for
phone], Blairite
– remote [noun, short for remote control]
– Newly invented proper nouns (or common nouns conscripted): Johnathan [new to
me anyway], the Gherkin [a building in London], Agatha Mabel Barnden
A Difficult Case: Numerals
•
Written-out numerals: e.g.: one, seven, first, seventh, thirty-nine, thousand, millionth,
dozen, score, triple, quarter?, fourfold?, twice?
•
Are they open or closed class?
•
•
Membership seems very fixed and rule-prescribed.
But:
– There are lots of numerals – more than in other closed classes
– and infinitely many if allow strings such as “1053” and “MCMLVII” as numerals
– and perhaps for other reasons: “thousand-and-fifty-three-fold”.
– Invention of things such as zillion, squillion and nth, ith, jth
What Now?
• Next, we’ll talk about “morphology”, helped by our new knowledge of words
and classes of them.
• That will involve the question of how to compute the morphology of words.
• After that we’ll be in a position to look at “POS tagging”: actually computing the
POSs of words in discourse (and then annotating the words with their POSs).
• To some extent at least POS tagging also includes, or can include, finding proper
nouns, and intrinsically includes making the open/closed-class distinction and
finding function words.
• Exercise: What’s the written plural of “POS” and how do you pronounce it?
POSs?
POSes?
PsOS?
POS? [the last seems to be preferred currently]
Appendix:
More Detail on Some Matters
Word Separation
• Words are typically not separated from each other in speech, and can subtly
affect each other’s sound. So separation of speech into words is itself a
somewhat theoretical (if commonsensically natural) act.
• In some languages (e.g., Chinese, Japanese, old Latin) words are not (or not
always) separated in writing.
More Terminology
•
A lexical item or lexical entry is often used to mean the (main) items listed in a dictionary
or lexicon (lexicon = database of words in, e.g., an AI system) and to which meanings are
given by the dictionary or lexicon.
– So lexical items are typically lemmas in effect, giving meanings for lexical forms.
– But note that dictionaries often list irregular inflected forms separately, e.g. “sung” (as past
participle of “sing”).
– An item in a dictionary can be a phrase rather than a single word.
Homographs, etc.
• According to J&M (pp.290, 646, 648):
– “Homographs”: words with the same spelling but different sound, such as
“live”[verb] and “live”[adjective]. [I think J&M also mean that the words have
different meanings, excluding the cases like “patent” above.]
– “Homophones”: words with the same sound but different spelling, such as “to”,
“too” and “two”. [I think J&M also mean that the words have different meanings,
excluding the cases like “realise/realize” above.] And note that “to” has more than
one meaning.
– “Homonyms”: different word senses (meanings) that are of words with the same
spelling and sound, as in “bank”.
• BUT: Other academics, and dictionaries, may define that terminology
somewhat differently.
– Webster’s Third New International defines “homonym” to mean various things, none the
same as J&M’s definition!! One (!!) of the meanings is: one of two or more words spelled and
pronounced alike but different in meaning. And “homonym” can also mean the same as
“homograph” or “homophone”!!
More Terminology
•
A wordform (also: lexical form) [loosely following J&M p.120] is a particular written string
or spoken sound that would be regarded as a word of the language.
So all occurrences of the written string “presents” would be occurrences of the same
written wordform, irrespective of sound or meaning. All occurrences of the spoken item
that sounds like “to” would be of the same spoken wordform, irrespective of spelling or
meaning.
I’ll use lexical form in preference to wordform to emphasize inclusion of special units such
as an abbreviation, acronym, or numeral. I’ll mainly be concerned with written forms.
•
A lexeme [J&M p.645] is a lexical form (spoken or written) together with a particular sense
(meaning) for it.
•
A lemma or citation form [J&M p.645] is a particular lexical form used as a sort of
standard or basic version of the wordform in a lexeme. Thus “carpet” and “carpets” are in
different lexemes and lexical forms but have the same lemma, “carpet”. The lemma for
“sing”, “sang” and “sung” is “sing”.
– Caution: J&M note on p.646 that “lemma” is sometimes used to mean the sense part of a
lexeme. Also, they themselves give a definition significantly different from the above on p.120!!
More Terminology, contd.
•
A lexical item or lexical entry is often used to mean the (main) items listed in a dictionary
or lexicon (lexicon = database of words in, e.g., an AI system) and to which meanings are
given by the dictionary or lexicon.
•
So lexical items are typically citation forms.
•
But note that dictionaries often list irregular inflected forms separately, e.g. “sung” (as past
participle of “sing”).
•
An item in a dictionary can be a phrase rather than a single word.
•
“Lexical item” sometimes means the same as my “lexical form”.