Download Learning Extractors from Unlabeled Text using Relevant Databases

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Learning Extractors from Unlabeled Text using Relevant Databases
Kedar Bellare and Andrew McCallum
Department of Computer Science
140 Governor’s Drive
University of Massachusetts
Amherst, MA 01003-4601
{kedarb,mccallum}@cs.umass.edu
Abstract
Supervised machine learning algorithms for information extraction generally require large amounts of training data. In many cases where labeling training data is
burdensome, there may, however, already exist an incomplete database relevant to the task at hand. Records
from this database can be used to label text strings
that express the same information. For tasks where
text strings do not follow the same format or layout,
and additionally may contain extra information, labeling the strings completely may be problematic. This
paper presents a method for training extractors which
fill in missing labels of a text sequence that is partially
labeled using simple high-precision heuristics. Furthermore, we improve the algorithm by utilizing labeled
fields from the database. In experiments with BibTeX
records and research paper citation strings, we show
a significant improvement in extraction accuracy over
a baseline that only relies on the database for training
data.
Introduction
The Web is a large repository of data sources where majority of the information is represented as unstructured text.
Efficiently mining and querying such sources requires largescale integration of unstructured resources into a structured
database. A first step towards achieving this goal involves
extraction of record-like information from unstructured and
unlabeled text. Information extraction (IE) approaches address the problem of segmenting and labeling unstructured
text to populate a database with a pre-defined schema.
In this paper, we mainly deal with information extraction
from textual records. Textual records are contiguous fragments of text present in unstructured text documents that resemble fields of a database record. In the context of restaurant addresses, an instance of a textual record is: “katsu.
1972 hillhurst ave. los feliz, california, 213-665-1891”
which can be segmented as [name= “katsu.”, address=
“1972 hillhurst ave.”, city= “los feliz”, state= “california”,
phone= “213-665-1891”]. Examples of textual records include citation strings found in research papers and contact
addresses found on person homepages. Pattern matching
c 2007, Association for the Advancement of Artificial
Copyright Intelligence (www.aaai.org). All rights reserved.
and rule-based approaches for IE that only depend on delimiters and other structure and font-based cues for segmenting such data are prone to failure as these markers are generally not reliable. In addition, the varying order of fields
rendered in text further exacerbates extraction. Algorithms
based on supervised machine learning such as conditional
random fields (CRFs) (Lafferty, McCallum, & Pereira 2001;
Peng & McCallum 2004; McCallum 2003; Sarawagi & Cohen 2005) and hidden Markov models (HMMs) (Rabiner
1989; Seymore, McCallum, & Rosenfeld 1999; Freitag &
McCallum 1999) are popular approaches for solving such
tasks. These approaches, however, require labeled training
data, such as annotated text, which is often scarce and expensive to produce. In many cases, there already exists a
database with data related to the expected input of the algorithm, and structure relevant to the desired output. This
database may be incomplete and noisy, and be missing the
surrounding contexts that act as reliable indicators for recognizing these data as rendered in unlabeled data, however,
the database can nonetheless act as a significant source of
supervised guidance.
Previous work using databases to train information extractors has taken one of two simpler approaches. The first
attempts to train directly from the database alone – missing
information about typical transformations and context that
occur in the rendering of that data. For example, the method
proposed by Agichtein & Ganti (2004) trains a separate hidden Markov model for each field directly from database field
string values. The second approach uses hand-built heuristic rules to apply database field labels to unlabeled data, on
which training is then performed. The work of Geng &
Yang (2004) is one example. While correctly capturing the
context of data fields as they are rendered in the wild, this approach discards sequences that are not fully labeled. When
noisy and complex transformations are involved in the conversion of a record to text such an approach may not work
well.
We build upon the framework of conditional random
fields to address these challenges. We present a method
that only requires some of the tokens in the text to be labeled with high-precision. The input to the algorithm is a
matching1 database record and textual record pair. An ex1
Associations among database and text can typically be easily
ample of a matching database record for the text “katsu.
1972 hillhurst ave. los feliz, california, 213-665-1891” can
be [name=“restaurant katsu”, address=“n. hillhurst avenue”, city=“los angeles”, state=“”, phone=“213-6651891”]. Given such a pair, a high-precision partial labeling of the text sequence can be constructed easily. A simple
linear-chain CRF is trained with an expected gradient procedure to fill in missing labels of the text sequence. Similar
to Agichtein & Ganti (2004), we also explore the use of labeled field information in the database to further improve
our model. Employing a CRF during extraction enables us
to utilize multiple overlapping and rich features of the input
text in comparison with HMM-based techniques. Once an
extractor is trained, it is used to segment and label unseen
text sequences which are then added as records to the original database.
We present experimental results on a citation extraction
task using real-world BibTeX databases. Our results demonstrate a significant improvement over a baseline that only
relies on database records during training. For the baseline we implemented an enhanced version of the state-ofthe-art system presented in (Agichtein & Ganti 2004) . We
generate “pseudo”-textual records from a database by artificially concatenating fields in the order in which they may
appear in the real-world. Furthermore, real-world noise (e.g.
spelling errors, word and field deletions) is artificially added
to the database record before concatenation. Given such a
sequence that is fully-labeled using database field names,
a linear-chain CRF is trained on the data. In our experiments, we show a 21% improvement in accuracy over this
baseline of an extractor trained only on partially-labeled text
sequences. Furthermore, adding labeled supervision in the
form of database fields produces an additional 6% absolute
improvement in performance.
Related Work
Databases have been used in information extraction (IE) systems either for labeling text in the real world (Geng & Yang
2004; Craven & Kumlien 1999) or as sources of weaklylabeled data (Agichtein & Ganti 2004; Seymore, McCallum,
& Rosenfeld 1999).
Gene and protein databases are widely used for IE from
biomedical text. An automated process is employed in
(Craven & Kumlien 1999) to label biomedical abstracts with
the help of an existing protein database. The labeling procedure looks for exact mentions of database tuples in text
and does not deal with missing or noisy fields. Geng &
Yang (2004) suggest the use of database fields to heuristically label text in the real world and then train an extractor
only on data labeled completely with high-confidence. Producing a heuristic labeling of the entire text with high confidence may not be possible in general. We address this limitation in our work by requiring only a subset of tokens to be
labeled with high-precision. Other approaches (Sarawagi &
Cohen 2005; Sarawagi & Cohen 2004) employ database or
created through heuristic methods or the use of an inverted index.
These associations may also be created by asking the user for a
match/non-match label.
dictionary lookups in combination with similarity measures
to add features to the text sequence. Although these features
are helpful at training time, learning algorithms still require
labeled data to make use of them.
Recent work on reference extraction with HMMs (Seymore, McCallum, & Rosenfeld 1999) has used BibTeX
records along with labeled citations to derive emission distributions for the HMM states. However, learning transition probabilities for the HMM requires labeled examples to
capture the structure present in real data. Some IE methods trained directly on database records (Agichtein & Ganti
2004), encode relaxations in the finite-state machine (FSM)
to capture the errors that may exist in real world text. Only a
few common relaxations such as token deletions, swaps and
insertions can be encoded into the model effectively. Because our model is trained directly on the text itself, we are
able to capture the variability present in real-world text.
Extraction framework
In this section, we present an overview of the framework
we employ to learn extractors from pairs of corresponding
database-text records.
A linear-chain conditional random field (CRF) constitutes
the basic building block in all our algorithms. The linearchain CRF is an undirected graphical model consisting of a
sequence of output random variables y connected to form
a linear-chain under first-order Markov assumptions. CRFs
are trained to maximize the conditional likelihood of the output variables y : hy1 y2 . . . yT i given the input random variables x : hx1 x2 . . . xT i. The probability distribution p(y|x)
has an exponential form with feature functions f encoding
sufficient statistics of (x, y). The output variables generally take on discrete states from an underlying finite state
machine where each state is associated with a unique label y ∈ Y. Hence, the conditional probability distribution
p(y|x) is given by,
!
T X
K
X
1
p(y|x) =
exp
λk fk (yt−1 , yt , x, t)
(1)
Z(x)
t=1
k=1
where T is the length of the sequence, K is the number of
feature functions fk and λk are the parameters of the distribution. Z(x)
Pis thePpartition function given by,
Z(x) =
P
K
T
0
0
y0 ∈Y T exp
t=1
k=1 λk fk (yt−1 , yt , x, t) .
When there are missing labels h = hh1 h2 . . . hm i,
their effect needs to be marginalized out when calculating
the conditional probability distribution p(y|x) =
P
h∈Y m p(h, y|x).
The feature functions in a CRF model allow additional
flexibility to effectively take advantage of complex overlapping features of the input. These features can encode binary
or real-valued attributes of the input sequence with arbitrary
lookahead.
The procedure for inference in linear-chain CRFs is
similar to those used in HMMs.
Baum-Welch and
Viterbi algorithms are easily extended as described in
Lafferty, McCallum, & Pereira (2001). During training, the parameters of the CRF are set to maximize
the
conditional likelihood of the training
(1)
set D
(x , y(1) ), (x(2) , y(2) ), . . . , (x(N ) , y(N ) ) .
=
Data format
In this section, we briefly work through an example of a
database containing restaurant records and a relevant textual
record found on a restaurant review website.
The input to our system is a database consisting of a
set of possibly noisy database (DB) records and corresponding textual records that are realizations of these DB
records. For example, a matching record-text pair can
be [name=“restaurant katsu”, address=“n. hillhurst avenue”, city=“los angeles”, state=“”, phone=“213-6651891”] and the text rendering “katsu. 1972 hillhurst ave. los
feliz, california, 213-665-1891”. Notice that the field state
is missing in the DB record. We assume a tokenization of
both the record and the text string based on whitespace characters. Let x = hx1 x2 . . . xT i represent the sequence of tokens in the text string under such a tokenization. We are not
given a label sequence corresponding to the input sequence
x.
Let x0 be the DB record and x0 [i] indexes the ith field in
the record. Let y 0 [i] be the label corresponding to the token
sequence x0 [i] where all the labels in the sequence have the
same value. If some of the fields in the DB record are empty
then the corresponding record and label sequence are also
empty.
We build robust statistical models to learn extractors by
leveraging label information from DB records. Using the
named fields of a DB tuple we unambiguously label some of
the tokens in the matching text string with high confidence.
To construct a partial labeling of the text we use heuristics
that include exact token match or strong similarity match
based on character edit distance (e.g. Levenshtein or character n-grams). In addition, we require that there be a unique
match of a record token within the text sequence. The final
labels produced have high precision although many of the
tokens may remain unlabeled, leading to low recall.
A partial labeling of the text is denoted by y =
hyt1 yt2 . . . ytm i where t1 , t2 , . . . tm are the positions in
the sequence where labels are observed. In the gaps of
the observed sequence y, there exists a hidden label sequence h = hht01 ht02 . . . ht0n i with unobserved label positions t01 , t02 , . . . t0m . In the above example, “katsu”, “hillhurst”, “los” and “213-665-1891” would be labeled by
the high-precision heuristics with name, address, city and
phone respectively. The tokens “1972”, “ave.” and “feliz”
would remain unlabeled. If there were an extra token “los”
in the address field (e.g. “los n. hillhurst avenue”) or in
the text string (e.g. “los katsu 1972 hillhurst ...”), then the
tokens “los” in the various fields would also remain unlabeled.
CRF Training with Missing Labels
For a linear chain CRF with an underlying FSM we assume that each state has a corresponding unique output label y ∈ Y. In the case of restaurant addresses, the states
can be name, address, city, state and phone with a fully
connected state machine. Given an input sequence x with a
partially observed label sequence y = hyt1 yt2 . . . ytm i and a
hidden label sequence h = hht01 ht02 . . . ht0n i, the conditional
probability p(y|x) is given by,
X
p(y|x) =
p(y, h|x)
(2)
h
T X
K
X
1 X
exp(
λk fk (zt−1 , zt , x, t))
Z(x)
t=1
=
h
where
k=1
yti
ht0j
zt =
for t = ti
for t = t0j
We could use EM to maximize conditional data likelihood
(Equation 2). However, we employ a direct gradient ascent
procedure similar to expected conjugate gradient suggested
by Salakhutdinov, Roweis, & Ghahramani (2003),
log p(y|x)
=
log(
X
T X
K
X
exp(
λk fk (zt−1 , zt , x, t)))
t=1 k=1
h
K
T X
X
X
0
λk fk (yt−1
, yt0 , x, t)))
− log(
exp(
y0
∂ log p(y|x)
∂λk
X
=
t=1 k=1
T
X
p(h|y, x)
−
fk (zt−1 , zt , x, t)
t=1
h
X
p(y0 |x)
y0
T
X
0
fk (yt−1
, yt0 , x, t)
t=1
For the dataset of text strings that are
partially
using
heuristics D
=
(1) (1) labeled
(x , y ), (x(2) , y(2) ), . . . (x(N ) , y(N ) ) , we maximize the conditional log-likelihood.
L(Λ = hλ1 . . . λK i; D) =
N
X
K
X
λ2k
log p(y(i) |x(i) ) −
2σ 2
i=1
k=1
where σ is the gaussian variance used for regularization.
Thus,
∂L
∂λk
=
N X
X
i=1
−
p(h|y(i) , x(i) )
fk (zt−1 , zt , x(i) , t)
t=1
h
N X
X
i=1
T
X
y0
p(y0 |x(i) )
T
X
0
fk (yt−1
, yt0 , x(i) , t)
t=1
λk
− 2
(3)
σ
The expected feature counts in Equation (3) are easily calculated using the forward-backward procedure (Culotta &
McCallum 2004; McCallum 2003). Thus, given a partially
labeled data set of textual records we can maximize the parameters such that the conditional likelihood of the observed
labels is maximized. In the process, missing labels are filled
in such that they best explain the observed data and hence a
partially labeled text record is fully labeled.
To maximize log-likelihood we use a quasi-Newton LBFGS optimization. Since there are missing labels, the optimization is no longer convex and sub-optimal local minima
are possible. Some of the parameters may be initialized to
reasonable values to attain a good local optimum, but we did
not encounter problems of local maxima in practice. The parameters learned by the model are used at inference time to
decode new text sequences.
Henceforth, we call this model as M-CRF for missing
label linear-chain CRF.
CRF Training with the DB
The database provided as input to the system generally contains extra information that is not present in the partially labeled text sequences such as:
• Additional fields that are not rendered in text.
• Training data for modeling transitions within the same
field.
Therefore, we train a CRF classifier on individual fields
of a DB record x0 = {x0 [1], x0 [2], . . . , x0 [n]}. The state
machine we employ is the same as the FSM used in the
M-CRF model and is trained to distinguish the field that
a particular string belongs to. The labels at training time
correspond to column names in the DB schema y0 =
{y 0 [1], y 0 [2], . . . , y 0 [n]}.
For a particular field x0 [i] = hx01 x02 . . . x0T i we replicate
the label y 0 [i] for T timesteps and construct a new label sequence y0 [i] = hy10 y20 . . . yT0 i. A linear-chain CRF is then
used as a classifier where the conditional probability for an
individual field p(y0 [i] | x0 [i]) is given by,
p(y0 [i] | x0 [i]) =
T X
K
X
1
0
exp(
λk fk (yt−1
, yt0 , x0 [i], t))
Z(x0 [i])
t=1
k=1
and the conditional probability of a complete DB record x0
is given by,
p(y0 |x0 ) =
n
Y
p(y0 [i] | x0 [i])
i=1
Note that we do not model connections between different
fields and treat each field independently. Given a database of
records D0 = {x0(1) , x0(2) , . . . , x0(M ) } and a fixed schema
y0 = {y 0 [1], y 0 [2], . . . , y 0 [n]}, we optimize the parameters Λ
to maximize the conditional likelihood of the data D0 ,
0
L(Λ = hλ1 . . . λK i; D )
=
M
X
log p(y0 |x0(i) )
i=1
=
M X
n
X
log p(y0 [j]|x0(i) [j])
i=1 j=1
As there are no hidden variables or missing labels this learning problem is convex and optimal parameters can be obtained though an L-BFGS procedure. During decoding,
when a complete text record needs to be segmented we apply the Viterbi procedure to produce the best labeling of the
text.
Since we do not learn parameters for transitions between
fields, this approach, henceforth know as DB-CRF, may
perform poorly when applied in isolation. However, the current approach can be used to initialize the parameters of the
M-CRF model and then the M-CRF model can be further
trained on partially labeled text sequences. We call such an
approach DB+M-CRF.
Experiments
We present experimental results on a reference extraction
task which uses a database of BibTeX records and research
paper citations to learn an extractor. The Cora data set
(McCallum et al. 2000) contains segmented and labeled
references to 500 unique computer science research papers. This data set has been used in prior evaluations of
information extraction algorithms (McCallum et al. 2000;
Seymore, McCallum, & Rosenfeld 1999; Han et al. 2003;
Peng & McCallum 2004). We collected a set of BibTeX
records from the web by issuing queries to a search engine with keywords present in the citation text. We gathered
matching BibTeX records for 350 citations in the Cora data
set. We fixed the set of labels to the field names in references, namely, author, title, booktitle, journal, editor, volume, pages, date, institution, location, publisher and note.
We throw away certain extra labels in the BibTeX such as
url and isbn. Other field names like month and year are
converted into date. Finally the data set formed consists of
matching record-text pairs where the labels on the text are
used only at test time. We perform 7-fold cross-validation
on the data set with 300/50 training and test examples respectively.
We use a variety of features to enhance system performance. Regular expressions are used to detect whether a token contains all characters, all digits or both digits and characters (e.g. ALLCHAR, ALLDIGITS, ALPHADIGITS).
We also encode the number of characters or digits in the
token as a feature (e.g. NUMCHAR=3, NUMDIGITS=1).
Other domain-specific patterns for dates, pages and URLs
are also helpful (DOM). We also employ identity, suffixes,
prefixes and character n-grams of the token (TOK). Another
class of features that are helpful during extraction are lexicon features (LEX). Presence of a token in lexicon such as
“Common Last Names”, “Publisher names”, “US Cities” as
binary features help improve the accuracy of the model. Finally, binary features at certain offsets (e.g. +1, -2) and various conjunctions can be added to the model (OFFSET).
Thus, a conditional model lets us use these arbitrary helpful
features that could not be exploited in a generative model.
We compare our models against a baseline that only relies on the database fields for training data. Instead of the
method suggested by Agichtein & Ganti (2004) in which
each field of the database is modeled separately, we actually simulate the construction of a textual record from a
given DB record. A CRF extractor is then trained using all
these labeled records. The baseline model (B-CRF) uses labeled text data that is constructed by artificially concatenating database fields in an order that is observed in real-world
text. This order is randomly chosen from a list of real-world
Overall acc.
author
title
booktitle
journal
editor
volume
pages
date
institution
location
publisher
note
Average F-1
B-CRF
64.1%
92.9
69.7
65.1
64.8
11.7
37.8
65.8
62.6
0.0
18.8
34.9
7.8
48.4
M-CRF
85.2%
89.6
93.7
88.3
78.1
1.4
68.3
74.9
75.5
6.8
47.1
59.6
4.1
61.9
DB-CRF
47.4%
87.0
50.6
51.8
31.0
0.0
1.4
22.5
1.2
9.1
10.7
28.8
4.5
26.3
DB+M-CRF
91.9%
97.5
96.9
90.5
83.8
56.5
83.7
90.3
94.2
20.4
80.0
84.2
8.2
78.7
GS-CRF
96.7%
99.2
98.3
96.0
92.7
90.4
96.9
98.0
98.2
79.4
89.5
94.1
26.2
89.0
Table 1: Extraction results on the paired BibTeX-citation data set. Bold-faced numbers indicate the best model that is closest
in performance to the gold-standard extractor.
orderings (provided by the user) that occur in text. Furthermore, delimiters are placed between fields while concatenating them. We further strengthen the baseline by modeling
variations of the record as they appear in the text. We randomly remove delimiting information, delete tokens, insert
tokens, introduce spelling errors and delete a field before
concatenation to simulate noise in the real-world text. Each
of these errors in injected with a probability of 0.1 during the
concatenation phase. Given a set of fully-labeled noisy token sequences a linear-chain CRF is trained to maximize the
conditional likelihood of the observed label sequences. The
trained CRF is then used as an extractor at test time. Lastly,
the gold-standard extractor (GS-CRF) is a CRF trained on
labeled citations from the Cora data set.
The evaluation metrics used in the comparison of our algorithms against the baseline model are overall accuracy
(number of tokens labeled correctly in all text sequences
ignoring punctuation) and per-field F1. Table 1 shows the
results of various algorithms applied to the data set. The MCRF model clearly provides an improvement of 21% over
the baseline B-CRF in terms of overall accuracy. Using
DB-CRF in isolation without labeled text data does worse
than the baseline. However, initializing the model with parameters learned from DB-CRF to train the DB+M-CRF
model provides a further boost of 6% absolute improvement
in overall accuracy over M-CRF model. All these differences are significant at p = 0.05 under 1-tailed paired t-test.
The DB+M-CRF model provides the highest performance
and is indeed very close to the gold-standard performance
of GS-CRF.
The effect of token-specific prefix/suffix (TOK), domainspecific (DOM), lexicon (LEX) and offset-conjunction features (OFFSET) for the DB+M-CRF model is explored in
Table 2. In particular, we find that the TOK features are most
helpful during extraction.
Table 3 shows the percentage of DB records that contain a
certain field in our data set. In general, fields that are poorly
represented in DB records such as editor, institution and note
Feature type
ALL
–OFFSET
–LEX
–DOM
–TOK
% Overall acc.
91.94
91.95
90.89∗
90.72
14.46∗
Table 2: Effect of removing OFFSET, LEX, DOM,
TOK features one-by-one on the extraction performance of
DB+M-CRF model. Starred numbers indicate significant
difference over the row above it (p = 0.05).
have a corresponding low F1 performance during extraction.
Additionally, since the content of the note field is varied the
extraction accuracy is further reduced.
The largest improvement in performance is observed for
the editor field. During training, only a few partially labeled
text sequences contain tokens labeled as editor. Due to lack
of supervision the M-CRF model is unable to disambiguate
between the author and editor field and therefore incorrectly
labels the editor fields in the test data. However, by additionally utilizing editor fields only present only in the database
and not in text the DB+M-CRF model provides an absolute
improvement of 45% in F1 over the best baseline. Hence,
an advantage of the DB is that it gives further supervision
when there are extra fields in a record that are not present in
the matching text.
The database also improves segmentation accuracy by
modeling individual fields and their boundary tokens. This
is especially noticeable for fields such as booktitle and title where starting tokens may be stop words that are generally left unlabeled by high-precision heuristics. We find that
errors in labeling tokens by the M-CRF model on the author/title, title/booktitle and author/booktitle boundaries are
correctly handled by the DB+M-CRF model.
Furthermore, by modeling confusing fields such as volume, pages and date in context, the M-CRF and DB+M-
Field name
author
title
booktitle
journal
editor
volume
pages
date
institution
location
publisher
note
% occurence
in DB records
99.7
99.4
43.9
39.9
5.7
43.9
82.2
100.0
2.3
49.6
50.9
4.2
Table 3: Percentage of records containing a certain field.
CRF models provide an improvement over other baselines.
Similar improvements are observed for booktitle/journal and
publisher/location. The latter confusion is caused by city
names present in publisher fields (e.g., “New York” in the
field [publisher=“Springer-Verlag New York, Inc.”]).
Hence, a combination of partially labeled text data and
fully labeled DB records are helpful for extraction.
Conclusions
This paper suggests new ways of leveraging databases and
text sources to automatically learn extractors. We obtain significant error reductions over a baseline model that only relies on the database for supervision. We find that a combination of partially labeled text strings and labeled database
record strings are required to attain state-of-the-art performance.
References
Agichtein, E., and Ganti, V. 2004. Mining reference tables
for automatic text segmentation. In KDD, 20.
Craven, M., and Kumlien, J. 1999. Constructing biological knowledge bases by extracting information from text
sources. In ISMB, 77–86. AAAI Press.
Culotta, A., and McCallum, A. 2004. Confidence estimation for information extraction. In Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics (HLT-NAACL). 2004. Boston, MA., volume 8.
Freitag, D., and McCallum, A. 1999. Information extraction with HMM and shrinkage. In Proceedings of
the Fifteenth National Conference on Artificial Intelligence
(AAAI).
Geng, J., and Yang, J. 2004. Autobib: Automatic extraction of bibliographic information on the web. In IDEAS,
193–204.
Han, H.; Giles, C. L.; Manavoglu, E.; Zha, H.; Zhang, Z.;
and Fox, E. A. 2003. Automatic document metadata extraction using support vector machines. In JCDL, 37–48.
Lafferty, J.; McCallum, A.; and Pereira, F. C. N. 2001.
Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 282.
McCallum, A.; Nigam, K.; Rennie, J.; and Seymore, K.
2000. Automating the construction of internet portals with
machine learning. Information Retrieval Journal 3:127–
163.
McCallum, A. 2003. Efficiently inducing features of conditional random fields. In UAI, 403–41. San Francisco,
CA: Morgan Kaufmann.
Peng, F., and McCallum, A. 2004. Accurate information
extraction from research papers using conditional random
fields. In HLT-NAACL, 329–336.
Rabiner, L. R. 1989. A tutorial on hidden markov models and selected applications in speech processing. IEEE
17:257–286.
Salakhutdinov, R.; Roweis, S. T.; and Ghahramani, Z.
2003. Optimization with em and expectation-conjugategradient. In ICML, 672.
Sarawagi, S., and Cohen, W. W. 2004. Exploiting dictionaries in named entity extraction: combining semi-markov
extraction processes and data integration methods. In
KDD, 89.
Sarawagi, S., and Cohen, W. W. 2005. Semi-markov conditional random fields for information extraction. In NIPS.
Cambridge, MA: MIT Press. 1185–1192.
Seymore, K.; McCallum, A.; and Rosenfeld, R. 1999.
Learning hidden markov model structure for information
extraction. In Proceedings of the AAAI’99 Workshop on
Machine Learning for Information Extraction.