Download Introduction to Part of Speech Tagging

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
IntroductiontoPartof
SpeechTagging
AlanRitter
ManyslidesadaptedfromBrendanO’Connor ChrisManning
Wherearewegoingwiththis?
• Textclassification:bagsofwords
• Sequencetagging
• PartsofSpeech
• NamedEntityRecognition
• Otherareas:bioinformatics(geneprediction),etc…
What’sapart-of-speech(POS)?
• Syntax=howwordscomposetoformlargermeaningbearingunits
• POS=syntacticcategoriesforwords
• Youcouldsubstitutewordswithinaclassandhaveasyntacticallyvalid
sentence
• Givesinformationhowwordscombineintolargerphrases
• Isawthedog
• Isawthecat
• Isawthe___
PartsofSpeechisanoldidea
• PerhapsstartingwithAristotleintheWest(384–322BCE),therewas
theideaofhavingpartsofspeech
• Schoolgrammar:noun,verb,adjective,adverb,preposition,
conjunction,pronoun,interjection
• Manymorefinegrainedpossibilities
https://www.youtube.com/watch?v=ODGA7ssL-6g&index=1&list=PL6795522EAD6CE2F7
Open class (lexical) words
Nouns
Proper
IBM
Italy
Verbs
Common
cat / cats
snow
Closed class (functional)
Determiners the some
Conjunctions and or
Pronouns
he its
Main
see
registered
Adjectives
old older oldest
Adverbs
slowly
Numbers
… more
122,312
one
Modals
can
had
Prepositions to with
Particles
off up
Interjections Ow Eh
… more
Openvs.Closedclasses
• Openvs.Closedclasses
• Closed:
• determiners:a,an,the
• pronouns:she,he,I
• prepositions:on,under,over,near,by,…
• Why“closed”?
• Open:
• Nouns,Verbs,Adjectives,Adverbs.
ManyTaggingStandards
• PennTreebank(45tags)…thisisthemostcommonone
• Browncorpus(85tags)
• Coarsetagsets
• UniversalPOStags(Petrovet.al.https://github.com/slavpetrov/universalpos-tags)
• Motivation:cross-linguisticregularities
Whatarepartsofspeechusefulfor?
• Phraseidentification(chunking)
• Namedentityrecognition
• InformationExtraction
• Parsing
QuickandDirtyNounPhraseIdentification
POSTagging
• WordsoftenhavemorethanonePOS:back
•
•
•
•
Theback door =JJ
Onmyback =NN
Winthevotersback =RB
Promisedtoback thebill =VB
• ThePOStaggingproblemistodeterminethePOStagforaparticular
instanceofaword.
POSTagging
• Input: Playswellwithothers
• Ambiguity:NNS/VBZUH/JJ/NN/RBINNNS
• Output: Plays/VBZwell/RBwith/INothers/NNS
• Uses:
•
•
•
•
Penn
Treebank
POStags
Text-to-speech(howdowepronounce“lead”?)
Canwriteregexps like(Det)Adj*N+overtheoutputforphrases,etc.
Asinputtoortospeedupafullparser
Ifyouknowthetag,youcanbackofftoitinothertasks
POStaggingperformance
• Howmanytagsarecorrect?(Tagaccuracy)
• About97%currently
• Butbaselineisalready90%
• Baselineisperformanceofstupidestpossiblemethod
• Tageverywordwithitsmostfrequent tag
• Tagunknown wordsasnouns
• Partlyeasybecause
• Manywordsareunambiguous
• Yougetpointsforthem(the,a,etc.)andforpunctuationmarks!
Decidingonthecorrectpartofspeechcanbe
difficultevenforpeople
• Mrs/NNPShaefer/NNPnever/RBgot/VBDaround/RP to/TO
joining/VBG
• All/DTwe/PRPgotta/VBNdo/VBis/VBZgo/VBaround/INthe/DT
corner/NN
• Chateau/NNPPetrus/NNPcosts/VBZaround/RB250/CD
HowdifficultisPOStagging?
• About11%ofthewordtypesintheBrowncorpusareambiguous
withregardtopartofspeech
• Buttheytendtobeverycommonwords.E.g.,that
• Iknowthat heishonest=IN
• Yes,that playwasnice=DT
• Youcan’tgothat far=RB
• 40%ofthewordtokensareambiguous
It’shardforpeopletoo!