Download Slides - School of Computer Science

Document related concepts

Pattern recognition wikipedia , lookup

Machine learning wikipedia , lookup

Concept learning wikipedia , lookup

Transcript
Information Extraction: 10-707 and 11-748
Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and
the subether bore throughout the universe a dozen pictures of what he was doing.
He straightened and nodded to Dwar Reyn, then moved to a position beside the switch that would complete the contact
when he threw it. The switch that would connect, all at once, all of the monster computing machines of all the populated
planets in the universe - ninety-six billion planets - into the supercircuit that would connect them all into one supercalculator,
one cybernetics machine that would combine all the knowledge of all the galaxies.
Dwar Reyn spoke briefly to the watching and listening trillions. Then after a moment’s silence he said, “Now, Dwar Ev.”
Dwar Ev threw the switch. There was a mighty hum, the surge of power from ninety-six billion planets. Lights flashed and
quieted along the miles-long panel.
Dwar Ev stepped back and drew a deep breath. “The honour of asking the first questions is yours, Dwar Reyn.”
“Thank you,” said Dwar Reyn. “It shall be a question which no single cybernetics machine has been able to answer.”
He turned to face the machine. “Is there a God ?”
The mighty voice answered without hesitation, without the clicking of a single relay.
“Yes, now there is a god.”
Sudden fear flashed on the face of Dwar Ev. He leaped to grab the switch.
A bolt of lightning from the cloudless sky struck him down and fused the switch shut.
‘Answer’ by Fredric Brown.
©1954, Angels and Spaceships
Information Extraction: 10-707 and 11-748
• Instructor and stuff:
– William Cohen (wcohen@cs, Wean 5317)
• Assistant: Sharon Cavlovich (sharonw@cs)
– TA: Vitor Carvalho (vitor@cs)
• Web page:
– http://www.cs.cmu.edu/~wcohen/10-707/
– Approximate but fairly detailed through spring
break
• Tues-Thus 12-1:20pm, Wean 4615a
– Go ahead and eat!
– No class Thus 1/25
Information Extraction: 10-707 and 11-748
• Prerequisite:
– Machine learning or consent of William
• Grading:
– Do a project, in groups of 2-3.
• Typical example: new algorithm on an old dataset, or old algorithm
on a new dataset.
• Write it up as a conference paper and present (as poster?)
• Your idea doesn’t have to actually work, or be novel.
– Do the readings and participate in class.
• Short critique of each assigned paper: ~= 500 words, one thing you
liked and/or one thing you didn’t like, and why.
• Start week of Jan 30th.
• Presentation of at least one paper (one of the suggested “optional”
papers, or one I approve)
Information Extraction: 10-707 and 11-748
•
What is covered?
– What is information extraction?
• “(ML Approaches to) Extracting Structured Information from Text”
• “Learning How to Turn Words into Data”
– Applications:
•
•
•
•
Web info extraction: building catalogs, directories, etc from web sites
Biotext info extraction: extracting facts like regulates(CDC23,TNF-1b)
Question-answering: answering Q’s like “who invented the light bulb?”
….
– Techniques:
• Named entity recognition: finding names in text
– …
– Graphical models for classifying sequences of tokens
• Extracting facts (aka events, relationships) – classifying pairs of extractions
• Normalizing extracted data – classifying pairs of extractions
• Semi- and unsupervised approaches to finding information from large corpora (aka
bookstrapping – “read the web” like techniques
•
Today:
– Admin, motivation
– A brief overview of IE, and a less brief overview of named entity recognition
Motivation:
Why bother with IE?
Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and
the subether bore throughout the universe a dozen pictures of what he was doing.
He straightened and nodded to Dwar Reyn, then moved to a position beside the switch that would complete the contact
when he threw it. The switch that would connect, all at once, all of the monster computing machines of all the populated
planets in the universe - ninety-six billion planets - into the supercircuit that would connect them all into one supercalculator,
one cybernetics machine that would combine all the knowledge of all the galaxies.
Dwar Reyn spoke briefly to the watching and listening trillions. Then after a moment’s silence he said, “Now, Dwar Ev.”
Dwar Ev threw the switch. There was a mighty hum, the surge of power from ninety-six billion planets. Lights flashed and
quieted along the miles-long panel.
Dwar Ev stepped back and drew a deep breath. “The honour of asking the first questions is yours, Dwar Reyn.”
“Thank you,” said Dwar Reyn. “It shall be a question which no single cybernetics machine has been able to answer.”
He turned to face the machine. “Is there a God ?”
The mighty voice answered without hesitation, without the clicking of a single relay.
“Yes, now there is a god.”
Sudden fear flashed on the face of Dwar Ev. He leaped to grab the switch.
A bolt of lightning from the cloudless sky struck him down and fused the switch shut.
‘Answer’ by Fredric Brown.
©1954, Angels and Spaceships
The incredibly rapid growth and increasing pervasiveness of the Internet brings
to mind a piece of science fiction, a short story, that I read many years ago in
the days when UNIVACs and enormous IBM mainframes represented the
popular image of computers. In the story, […]
While such an exaggerated view of the power of networked computers may now
seem charmingly quaint, it is in fact not all that far beyond some of the wilder
claims that one hears for the future of the global information superhighway, of
which the Internet is widely regarded as the prototype. And if those claims are
exaggerated, they are at least reflective of the extent to which this technology
has caught hold of the popular and commercial imagination.
Technology and the Future, 7th Edition
Albert H. Teich, editor
© 1996
Some observations
• In the distant future:
– Complex AI systems are completed by ceremonially soldering
the final connection, not ceremonially compiling the last Java
class
– Performance is monitored by clicking relays
– A “lightning-from-a-cloudless-sky” peripheral exists
• Writing and debugging device drivers is a dangerous and
highly skilled profession
– Question-answering interfaces are still in use
• Natural-language query in, answer out
– Answering (some) complex questions requires combining
information from many different places
• With different parts contributed by different people?
Two ways to manage information
“ceremonial soldering”
Query
Answer
Xxx xxxx
xxxx xxx
xxx xxx
xx xxxx
xxxx xxx
X:advisor(wc,Y)&affil(X,lti) ?
Query
Xxx xxxx
xxxx
xxxxxxx
Xxx
xxxxxxx
xxx xxx
xx xxxx
xxxXxx
xxx xxxx
xxxx
xxxx
xxx
xx xxxx xxx
xxx
xxx
xxxx
xxx
xx xxxx
xxxx xxx
Answer
inference
retrieval
Xxx xxxx
xxxx xxx
xxx xxx
xx xxxx
xxxx xxx
{X=em; X=vc}
Xxx xxxx
xxxx
xxxxxxx
Xxx
xxxxxxx
xxx
xxxxxxx
Xxx
xx xxxx
xxxxxxx
xxx xxx
xxxx
xxxxx
xxxx
xxx xxx
xxxx
xxxxx
xxxx
xxxx xxx
Xxx xxxx
xxxx xxx
xxx xxx
xx xxxx
xxxx xxx
advisor(wc,vc)
advisor(yh,tm)
IE
affil(wc,mld)
affil(vc,lti)
fn(wc,``William”)
fn(vc,``Vitor”)
AND
Some observations
• Using computers to combine information from
multiple places is and has been important…
Some observations
• Using computers to merge information is and has
been important…
– Data cleaning and integration, record linkage, …
– Standards for data exchange:
• KQML, KIF, DAML+OIL, …
• Semantic web: N3Logic, OWL, …
– Friend-of-a-friend, GeneOntology, ….
– Growth from 456 OWL ontologies in 2004 to 14,600 in 2007
• Number of web pages estimated at 11.5B as of early 2006
– #webPages/#ontologies =~ 1,000,000 ?
– #webSites/#ontologies =~ 10,000 ?
– It seems to be much easier to generate sharable text than to
generate sharable knowledge.
– A lot of accessible knowledge is only accessible in text
How do you extract
information?
[Cohen / McCallum tutorial, NIPS 2002, KDD 2003, …]
[Some pilfering from Tom Mitchell’s invited talks]
What is “Information Extraction”
As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
NAME
TITLE
ORGANIZATION
What is “Information Extraction”
As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
IE
NAME
Bill Gates
Bill Veghte
Richard Stallman
TITLE
ORGANIZATION
CEO
Microsoft
VP
Microsoft
founder Free Soft..
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + clustering + association
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
aka “named entity
Gates
extraction”
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
* Microsoft Corporation
CEO
Bill Gates
* Microsoft
Gates
* Microsoft
Bill Veghte
* Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Example: Finding Jobs Ads on the Web
Martin Baker, a person
Genomics job
Employers job posting form
Example: A Solution
Extracting Job Openings from the Web
foodscience.com-Job2
JobTitle: Ice Cream Guru
Employer: foodscience.com
JobCategory: Travel/Hospitality
JobFunction: Food Services
JobLocation: Upper Midwest
Contact Phone: 800-488-2611
DateExtracted: January 8, 2001
Source: www.foodscience.com/jobs_midwest.htm
OtherCompanyJobs: foodscience.com-Job1
Category = Food Services
Keyword = Baker
Location = Continental U.S.
Job Openings:
Data Mining the Extracted Job Information
Notice that we get something useful
from just identifying the person
names and then doing some counting
and trending
Landscape of IE Tasks (1/4):
Degree of Formatting
Text paragraphs
without formatting
Grammatical sentences
and some formatting & links
Astro Teller is the CEO and co-founder of
BodyMedia. Astro holds a Ph.D. in Artificial
Intelligence from Carnegie Mellon University,
where he was inducted as a national Hertz fellow.
His M.S. in symbolic and heuristic computation
and B.S. in computer science are from Stanford
University. His work in science, literature and
business has appeared in international media from
the New York Times to CNN to NPR.
Non-grammatical snippets,
rich formatting & links
Tables
Landscape of IE Tasks (2/4):
Intended Breadth of Coverage
Web site specific
Formatting
Amazon.com Book Pages
Genre specific
Layout
Resumes
Wide, non-specific
Language
University Names
Landscape of IE Tasks (3/4):
Complexity
E.g. word patterns:
Closed set
Regular set
U.S. states
U.S. phone numbers
He was born in Alabama…
Phone: (413) 545-1323
The big Wyoming sky…
The CALD main office can be
reached at 412-268-1299
Complex pattern
U.S. postal addresses
University of Arkansas
P.O. Box 140
Hope, AR 71802
Headquarters:
1128 Main Street, 4th Floor
Cincinnati, Ohio 45210
Ambiguous patterns,
needing context and
many sources of evidence
Person names
…was among the six houses
sold by Hope Feldman that year.
Pawel Opalinski, Software
Engineer at WhizBang Labs.
Landscape of IE Tasks (4/4):
Single Field/Record
Jack Welch will retire as CEO of General Electric tomorrow. The top role
at the Connecticut company will be filled by Jeffrey Immelt.
Single entity
Binary relationship
Person: Jack Welch
Relation: Person-Title
Person: Jack Welch
Title:
CEO
Person: Jeffrey Immelt
Location: Connecticut
“Named entity” extraction
Relation: Company-Location
Company: General Electric
Location: Connecticut
N-ary record
Relation:
Company:
Title:
Out:
In:
Succession
General Electric
CEO
Jack Welsh
Jeffrey Immelt
A little more depth on named
entity recognition (NER)
Models for NER
Classify Pre-segmented
Candidates
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama
Alaska
…
Wisconsin
Wyoming
Boundary Models
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
Classifier
Classifier
which class?
which class?
Try alternate
window sizes:
Token Tagging
Abraham Lincoln was born in Kentucky.
BEGIN
Most likely state sequence?
Classifier
Sliding Window
This is often treated as
a structured prediction
problem…classifying
tokens sequentially
which class?
BEGIN
END
BEGIN
END
HMMs, CRFs, ….
Sliding Windows
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
A “Naïve Bayes” Sliding Window Model
[Freitag 1997]
…
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun …
w t-m
w t-1 w t
w t+n
w t+n+1
w t+n+m
prefix
contents
suffix
Estimate Pr(LOCATION|window) using Bayes rule
Try all “reasonable” windows (vary length, position)
Assume independence for length, prefix words, suffix words, content words
Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)
If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it.
A “Naïve Bayes” Sliding Window Model
[Freitag 1997]
…
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun …
w t-m
w t-1 w t
w t+n
w t+n+1
w t+n+m
prefix
1.
contents
suffix
Create dataset of examples like these:
+(prefix00,…,prefixColon, contentWean,contentHall,….,suffixSpeaker,…)
- (prefixColon,…,prefixWean,contentHall,….,ContentSpeaker,suffixColon,….)
…
2.
Train a NaiveBayes classifier (or YFCL), treating the examples like
BOWs for text classification
If Pr(class=+|prefix,contents,suffix) > threshold, predict the content
window is a location.
3.
•
To think about: what if the extracted entities aren’t consistent, eg if the
location overlaps with the speaker?
“Naïve Bayes” Sliding Window Results
Domain: CMU UseNet Seminar Announcements
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during
the 1980s and 1990s.
As a result of its
success and growth, machine learning is
evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning), genetic
algorithms, connectionist learning, hybrid
systems, and so on.
Field
Person Name:
Location:
Start Time:
F1
30%
61%
98%
Token Tagging
NER by tagging tokens
Given a sentence:
Yesterday Pedro Domingos flew to New York.
1) Break the sentence into tokens, and
classify each token with a label
indicating what sort of entity it’s part of:
person name
location name
background
Yesterday Pedro Domingos flew to New York
2) Identify names based on the entity labels
Person name: Pedro Domingos
Location name: New York
3) To learn an NER
system, use YFCL.
NER by tagging tokens
Similar labels tend to cluster together in text
person name
Yesterday Pedro Domingos flew to New York
Another common labeling scheme is BIO (begin,
inside, outside; e.g. beginPerson, insidePerson,
beginLocation, insideLocation, outside)
BIO also leads to strong dependencies between
nearby labels (eg inside follows begin)
location name
background
NER with Hidden Markov Models
Given a sequence of observations:
Yesterday Pedro Domingos spoke this example sentence.
and a trained HMM:
person name
location name
background
 
Find the most likely state sequence: (Viterbi) arg max s P( s , o )
Yesterday Pedro Domingos spoke this example sentence.
Any words said to be generated by the designated “person name”
state extract as a person name:
Person name: Pedro Domingos
HMM for Segmentation of
Addresses
Hall
0.15
Wean
0.03
N-S
0.02
…
…
CA
0.15
NY
0.11
PA
0.08
…
…
• Simplest HMM Architecture: One state per entity type
[Pilfered from Sunita Sarawagi, IIT/Bombay]
HMMs for Information Extraction
…
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun
1.
The HMM consists of two probability tables
•
•
2.
Estimate these tables with a (smoothed) CPT
•
3.
Pr(currentState=s|previousState=t) for s=background, location, speaker,
Pr(currentWord=w|currentState=s) for s=background, location, …
Prob(location|location) = #(loc->loc)/#(loc->*) transitions
Given a new sentence, find the most likely sequence of hidden states
using Viterbi method:
MaxProb(curr=s|position k)=
Maxstate t MaxProb(curr=t|position=k-1) * Prob(word=wk-1|t)*Prob(curr=s|prev=t)
…
“Naïve Bayes” Sliding Window vs HMMs
Domain: CMU UseNet Seminar Announcements
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during
the 1980s and 1990s.
As a result of its
success and growth, machine learning is
evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning), genetic
algorithms, connectionist learning, hybrid
systems, and so on.
Field
Speaker:
Location:
Start Time:
F1
30%
61%
98%
Field
Speaker:
Location:
Start Time:
F1
77%
79%
98%
What is a “symbol” ???
Cohen => “Cohen”, “cohen”, “Xxxxx”, “Xx”, … ?
5317 => “5317”, “9999”, “9+”, “number”, … ?
All
Numbers
3-digits
000..
...999
Words
5-digits
00000..
..99999
Others
0..99
0000..9999
Chars
000000..
A..
Delimiters
Multi-letter . , / - + ? #
..z
aa..
Datamold: choose best abstraction level using holdout set
HMM Example: “Nymble”
[Bikel, et al 1998],
[BBN “IdentiFinder”]
Task: Named Entity Extraction
Person
start-ofsentence
end-ofsentence
Org
Other
Train on ~500k words of news wire text.
Case
Mixed
Upper
Mixed
Observation
probabilities
P(st | st-1, ot-1 )
P(ot | st , st-1 )
or
(Five other name classes)
Results:
Transition
probabilities
Language
English
English
Spanish
P(ot | st , ot-1 )
Back-off to:
Back-off to:
P(st | st-1 )
P(ot | st )
P(st )
P(ot )
F1 .
93%
91%
90%
Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99]
What is a symbol?
Bikel et al mix symbols from two abstraction levels
What is a symbol?
Ideally we would like to use many, arbitrary, overlapping
features of words.
identity of word
ends in “-ski”
is capitalized
is part of a noun phrase
is in a list of city names
is under node X in WordNet
is in bold font
is indented
is in hyperlink anchor
…
S t-1
St
S t+1
…
is “Wisniewski”
part of
noun phrase
…
ends in
“-ski”
O
t -1
Ot
O t +1
Lots of learning systems are not confounded by multiple, nonindependent features: decision trees, neural nets, SVMs, …
What is a symbol?
identity of word
ends in “-ski”
is capitalized
is part of a noun phrase
is in a list of city names
is under node X in WordNet
is in bold font
is indented
is in hyperlink anchor
…
S t-1
St
S t+1
…
is “Wisniewski”
…
part of
noun phrase
ends in
“-ski”
O
t -1
Ot
O t +1
Idea: replace generative model in HMM with a maxent
model, where state depends on observations
Pr( st | xt )  ...
What is a symbol?
identity of word
ends in “-ski”
is capitalized
is part of a noun phrase
is in a list of city names
is under node X in WordNet
is in bold font
is indented
is in hyperlink anchor
…
S t-1
St
S t+1
…
is “Wisniewski”
part of
noun phrase
…
ends in
“-ski”
O
t -1
Ot
O t +1
Idea: replace generative model in HMM with a maxent
model, where state depends on observations and
previous state
Pr( st | xt , st 1, )  ...
What is a symbol?
identity of word
ends in “-ski”
is capitalized
is part of a noun phrase
is in a list of city names
is under node X in WordNet
is in bold font
is indented
is in hyperlink anchor
…
S t-1
St
S t+1
…
is “Wisniewski”
part of
noun phrase
…
ends in
“-ski”
O
t -1
Ot
O t +1
Idea: replace generative model in HMM with a maxent
model, where state depends on observations and
previous state history
Pr( st | xt , st 1, st 2, ...)  ...
Ratnaparkhi’s MXPOST
• Sequential learning problem:
predict POS tags of words.
• Uses MaxEnt model
described above.
• Rich feature set.
• To smooth, discard features
occurring < 10 times.
Conditional Markov Models (CMMs) aka
MEMMs aka Maxent Taggers vs HMMS
St-1
St
St+1
...
Pr( s, o)   Pr( si | si 1 ) Pr(oi | si 1 )
i
Ot-1
Ot
St-1
Ot+1
St
St+1
...
Pr( s | o)   Pr( si | si 1 , oi 1 )
i
Ot-1
Ot
Ot+1
HMMs vs MEMM vs CRF
HMM
MEMM
CRF
Some things to think about
• We’ve seen sliding windows, non-sequential token
tagging, and sequential token tagging.
– Which of these are likely to work best, and when?
– Are there other ways to formulate NER as a learning task?
– Is there a benefit from using more complex graphical
models? What potentially useful information does a linearchain CRF not capture?
– Can you combine sliding windows with a sequential model?
• Next lecture will survey IE of sets of related entities
(e.g., person and his/her affiliation).
– How can you formalize that as a learning task?