Download Wnt pathway curation using automated natural

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nutriepigenomics wikipedia , lookup

NEDD9 wikipedia , lookup

Wnt signaling pathway wikipedia , lookup

Secreted frizzled-related protein 1 wikipedia , lookup

Transcript
BIOINFORMATICS
ORIGINAL PAPER
Vol. 21 no. 8 2005, pages 1653–1658
doi:10.1093/bioinformatics/bti165
Data and text mining
Wnt pathway curation using automated natural language
processing: combining statistical methods with partial and full
parse for knowledge extraction
Carlos Santos, Daniela Eggle and David. J. States∗
Bioinformatics Program, The University of Michigan, Ann Arbor, MI 48109, USA
Received on October 15, 2004; revised on November 17, 2004; accepted on November 18, 2004
Advance Access publication November 25, 2004
ABSTRACT
Motivation: Wnt signaling is a very active area of research with
highly relevant publications appearing at a rate of more than one
per day. Building and maintaining databases describing signal transduction networks is a time-consuming and demanding task that
requires careful literature analysis and extensive domain-specific
knowledge. For instance, more than 50 factors involved in Wnt signal transduction have been identified as of late 2003. In this work
we describe a natural language processing (NLP) system that is
able to identify references to biological interaction networks in free
text and automatically assembles a protein association and interaction map.
Results: A ‘gold standard’ set of names and assertions was derived
by manual scanning of the Wnt genes website (http://www.stanford.
edu/∼rnusse/wntwindow.html) including 53 interactions involved in
Wnt signaling. This system was used to analyze a corpus of peerreviewed articles related to Wnt signaling including 3369 Pubmed
and 1230 full text papers. Names for key Wnt-pathway associated
proteins and biological entities are identified using a chi-squared
analysis of noun phrases over-represented in the Wnt literature as
compared to the general signal transduction literature. Interestingly,
we identified several instances where generic terms were used on
the website when more specific terms occur in the literature, and one
typographic error on the Wnt canonical pathway. Using the named
entity list and performing an exhaustive assertion extraction of the
corpus, 34 of the 53 interactions in the ‘gold standard’ Wnt signaling set were successfully identified (64% recall). In addition, the
automated extraction found several interactions involving key Wntrelated molecules which were missing or different from those in the
canonical diagram, and these were confirmed by manual review of
the text. These results suggest that a combination of NLP techniques for information extraction can form a useful first-pass tool
for assisting human annotation and maintenance of signal pathway
databases.
Availability: The pipeline software components are freely available
on request to the authors.
Contact: [email protected]
Supplementary information: http://stateslab.bioinformatics.med.
umich.edu/software.html
∗ To
whom correspondence should be addressed.
INTRODUCTION
Detailed signal pathway annotation and model construction can
be an arduous task for human readers to accomplish. The task is
complicated for heavily investigated pathways like the Wnt signal
transduction cascade or other major cellular pathways due to the large
volume of papers published for biological interactions involving
members of those pathways. In the Wnt signal transduction literature,
for instance, there were 239 MeSH-annotated ‘Signal Transduction’
Wnt pathway MEDLINE articles in 2003, and 889 articles for the
period from 2000 to 2004. Expanding the search to include other
co-factors or major proteins in the pathway expands the results to
many thousands of articles.
For a pathway like the Wnt pathway, up-to-date models are
essential for investigators in the field; without accurate models,
experimental results may be placed outside of the proper biological
context or key insights may be missed altogether if the model structure is incorrect. Comprehensively annotated models of complex
pathways like Wnt are also essential for hypothesis-generation and
experiment validation, yet with the exception of periodic reviews on
the subject, there are few sources of Wnt-signaling information that
are kept consistent with the latest published literature.
In the past, various groups (Andrade and Valencia, 1997; Blaschke,
1999; Daraselia et al., 2004; Iliopoulos et al., 2001; Koike et al.,
2003; Raychaudhuri et al., 2002; Stephens et al., 2001; Wilbur and
Yang, 1996) have used natural language processing (NLP) systems
to extract biological molecule annotation information (Andrade and
Valencia, 1997), to detect protein–protein interaction information
(Bader and Hogue, 2002; Blaschke, 1999; Marcotte et al., 2001),
or to improve indexing and recall into searches from MEDLINE
abstracts (Iliopoulos et al., 2001; Stapley and Benoit, 2000). Methods
included a mixture of text mining and indexing, with some groups
using classification by Bayesian statistics (Wilbur and Yang, 1996),
structured grammar matches (Temkin and Gilder, 2003), or word
filtering of known entities, as well as the use of partial and full
parsers. Full parsers have been employed to discover protein–protein
interactions with promising results, highlighting the utility of this
approach (Daraselia et al., 2004); however they are not available as
open-source.
We have developed an automated NLP-based system to assist in the
generation of up-to-date pathway models from the literature that can
automatically detect and rank key interacting proteins in an article
corpus like that of Wnt signaling.
© The Author 2004. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
1653
C.Santos et al.
The named entity module we present employs a word-statistic
chi-squared test, but begins with a partial parser to derive the necessary named entities. Then, the full parser module provides deep
phrase attachment, syntax annotation and grammatical relations, and
extracts interaction statements by filtering results with a list of verbs
and the named entity list derived from the partial parse.
We avoid the need to generate and maintain a large-scale named
entity list by taking advantage of both the Link parser’s (Sleator and
Temperly, 1991) phrase attachment facilities, as well as fast partialparser’s (Abney, 1996a) noun phrase annotation to generate a list
of words specific to Wnt signal transduction. Our system uses the
fast partial parser coupled with a simple statistical test to automatically build a corpus-specific named entity list without requiring an
extensive pre-computed synonym list. While this approach is only a
first-pass disambiguation of the named entities found within the corpus, for the queries likely to be of interest to a human domain expert,
we find this automated named entity annotation to be at least as specific as the human-constructed signaling pathway entities available
in the public domain.
Following named entity extraction, we detect the actual interaction and protein-associations with the Link parser(Sleator and
Temperly, 1991). The parser allows us to reduce grammatically complicated sentences into simplified ‘tuples’ which roughly correspond
to specific biological assertions made in any particular sentence. The
3-tuple representation allows for fast search for a direct linking verb
between two named entities. The search we perform yields various
relevant possible additions to the canonical Wnt pathway, as well as
provides provenance and annotation for a majority of the interactions
present in the pathway where source material was not annotated.
METHODS: ARTICLE XML PROCESSING AND
FULL PARSE
HTML retrieval and XML conversion
Full-text and MEDLINE articles are retrieved using NCBI’s Linkout eretrieval utility (National Center for Biotechnology Information—Entrez
Programming Utilities, 2004). For an initial query, an XML file of retrieved
UI (Pubmed ID) entries serves as a corpus index, from which local Perl script
retrieves where possible the full-text article (via LinkOut URL) and MEDLINE entry. The latter entry serves as a backup entry for cases where full-text
may not be present, or where the NCBI LinkOut URL yields only a PDF file.
For the Wnt signaling pathway, we queried Pubmed with:
(‘Signal Transduction’[MeSH] OR Wnt[All fields] OR Akt[All Fields] OR
catenin[All Fields] OR frizzled[All Fields])
The query yielded 3523 articles (full analysis in supplementary data),
of which 3369 could be retrieved in XML. Of these 3369 documents,
the majority (2914) had a parseable abstract field (either from HTML or
MEDLINE record), and of the 455 that did not, the papers were often
review papers, with the XML tag marked as ‘TOP’. The full corpus
composition is available as supplementary information at: http://stateslab.
bioinformatics.med.umich.edu/software.html. The query was restricted to the
past five years (1999/03/03 to 2004/03/01).
XML document structure parsing
To normalize successfully retrieved HTML papers, we developed a
document-structure parsing script in Perl (v. 5.6.0) that extracts into XMLformat the Titles, PMID, Abstract, Methods/Materials, Conclusions, Figures,
Tables, and References sections of full-text articles: We parse sentences
within all sections by default, only explicitly excluding sections parsed as
‘References’. It is important to note that of the 3369 retrieved papers, over
1654
10% had no explicitly labeled ‘abstract’ section (even if one was provided in
the MEDLINE).
Pre-processing and parse
For parsing, we process and exclude non-parseable sections like references and tables in each paper. Articles are then processed through
a Link grammar parser (Sleator and Temperly, 1991) (version 4.1a;
http://www.link.cs.cmu.edu/ link/ftp.html) on a 16-node Linux cluster.
Link parser output
For each sentence, the parser yields word associations as a flat list with
left-hand terms ‘attached’ by a grammar relation to terms on the right. The
‘subject–verb-object’ relations provided by the parser form the core assertions
we wish to capture from the parse. The parser captures the main verb of each
clause or sentence, links it with the proper subject noun, and object if present,
yielding a subject–verb–object assertion which we extract as a 3-tuple.
METHODS: ASSERTION REPRESENTATION VIA
LINK PARSING: SUBJECT–VERB–OBJECT
TUPLES
Tuple format
The structures we call tuples are Link-grammar-parser derived structured,
hierarchical representations of grammatical relations between phrases and
words within sentences. Generally, each tuple takes the form of a threecomponent structure:
In our tuple format:
<int pmid="12952940">
<protA>Wnt</protA>
<protB>Frizzled</protB>
<assert>
<src_sent>...</src_sent>
<tuple>
<subj>...</subj>
<verb> ... </verb>
<obj> ... </obj>
</tuple>
</assert>
</int>
Each interaction int, contains two named entities protA and protB, with assert
element which contains a sentence (src_sent) and a tuple element (tup). The
tup contains a subject (subj), verb (verb) and an (object). The subject and
object terms can be either single or multi-word nouns, attached to modifying
prepositional phrases, adjectives and articles. Verbs are single words and are
marked as verb. Objects follow the specific verb marked.
Some authors (Koike et al., 2003) employ sophisticated template-matching
with partial parse-based algorithms when detecting interactions. These systems are faster than our parse, but often require substantial manual template
generation for the partial parser.
Our interaction detection searched for phrases with two named entities
flanking any of a select group of stemmed verbs. The verb list itself was
manually compiled from a listing of verbs found in the corpus and from
verbs in general usage likely to be found describing protein-interactions.
These ‘direct’ and ‘indirect’ physical interaction verbs are split into:
Direct interaction verbs: bind (bound), interact(-s,-ed), stabilize(-s,-d),
phosphorylate(-s,-d), ubiquinate(-s,-d), sumoylate(-s,-d), degrade(-s,-d),
block(s).
Indirect interaction verbs: induc(-es,-ed), trigger(-s,-ed), block(s),
enhance(s), synergize(s), cooperate(s), localizes, regul(-ates,-ion),
activate(s), inhibit(s), control(s), translocate(s), antagonize(s),
amplif(-y,-ies), transduce(s), degrade(s), trigger(s).
Wnt pathway curation using automated natural language processing
METHODS: AUTOMATIC NAME EXTRACTION
USING A FULL PARSER
Tuple examples
The system outputs tuple assertions from sentences in XML:
Full-parse phrase-derived named entity extraction
from the Link parser
<assert>
<src_sent>
Wnt8 binds to LRP6 and Frizzled8.
</src_sent>
<tup>
<subj>Wnt8</subj>
<verb mod="v">binds.v</verb>
<obj><p pp="to">LRP6</p></obj>
</tup>
</assert>
The second named entity-extracting module in the pipeline scans the tuples
generated (Wnt-specific tuples) from the Link parse for tuples derived from
sentences such as ‘X is . . . a protein’ and ‘the Y protein’. For every tuple
formatted with ‘is’ as the verb, we find the subject, and if it is a single word
or phrase, capture the predicate phrase for that tuple, and append the subject
into an index entry one word at a time, recursively. For example:
Sentence: E-cadherin is a transmembrane glycoprotein ..
The sentence above, ‘Wnt8 binds to LRP6 and Frizzled8.’ yields two
assertion tuples: the binding of ‘Wnt8’ to ‘LRP6’ and a matching tuple (not
shown) for the binding of ‘Wnt8’ to ‘Frizzled8’.
In addition to direct interactions, in sentences where a verb suggesting an
interaction is found within the object, we make the assertion as being the
closest preceding matching verb or gerund matching within the phrase for
the named entity in the object.
METHODS: AUTOMATIC NAME EXTRACTION
FROM A PARTIAL PARSER
The Cass parser (Abney, 1996b) is a fast (10 000 sentences/hour) deterministic partial parser that we use to construct a named entity set specific to
the current domain. The parser has several key advantages over a parser like
Link that make it a worthwhile choice for a named entity recognizer, primarily its good specificity for detecting selected ‘phrase chunks’ of sentences at
speeds which are many orders of magnitude greater than those achieved with
a full parser like Link. This markup allows us to statistically compile named
entity candidates (noun phrases) from the small topic-specific corpus against
a massive background corpus (all ‘signal transduction’), while reserving the
use of a computationally expensive full parser only for determining tuples in
the small corpus.
We used the Cass parser to select named entities (noun phrases) for the Wnt
pathway by comparing the occurrence of named entities in the Wnt-specific
article corpus against their occurrence in a ‘background’ signal transduction
literature corpus (10 000 records, yielding 8873 parsed articles corresponding
to the PubMed query ‘Signal Transduction’[MeSH] from the previous two
years).
By comparing the frequency of ‘Wnt’ to ‘signal transduction’ noun phrases,
we calculated one-degree of freedom chi-squared values for Wnt Cass noun
phrases relative to the signal transduction corpus and ranked them according
to that chi-squared value. Significance was set as p < 0.001. Examples
of over-represented Wnt terms included both single phrases and compound
phrases.
For every NX term, X 2 was calculated as:
wi : the number of occurrences of NX term i in the Wnt-specific corpus;
W : the total number of NX terms in the Wnt-specific corpus;
si : the number of occurrences of term i in the signal transduction corpus;
Si : the number of occurrences of term i in the signal transduction corpus.
X2 =
k
i=1
si 2
W
S
si /S
w
i
−
(1)
Note that not all terms were proteins, since the terms are noun phrases in general; proteins of interest were filtered at search time. Noun phrases we detected
included both single (‘wnt’) and multiple-word forms that would otherwise
be missed by a dictionary-based search (e.g. ‘casein kinase i epsilon’).
E-cadherin >> is >> glycoprotein
E-cadherin >> is >> transmembrane glycoprotein
E-cadherin >> Append to "glycoprotein" file
E-cadherin >> Append to "transmembrane glycoprotein" file
After categories are formed and the first set of names is input, the system
re-scans the entire corpus for phrases of the form ‘article X Y’, where article
is either ‘a’, ‘an’, or ‘the’, Y is a term category (e.g. ‘protein’), and X is a
non-whitespace term. This second pass allows us to capture a small additional
fraction of terms of the form ‘the Wnt protein’, where the last word in the
phrase is a solid term category like ‘protein’.
The end result of both passes is a series of categories or category files,
comprising a shallow ontology. This auto-categorization system yielded 7066
distinct categories for the 3306-article Wnt-signaling specific corpus, and
24 474 terms within those categories, of which 24 323 were unique terms. The
largest categories are not surprisingly commonly discussed terms, including
‘protein’, ‘gene’, ‘proteins’, etc.
We find the terms extracted are very specific as they are directly extracted
from direct declarative statements in the corpus.
MANUAL ANNOTATION RESULTS
Our precision and recall are measured as to the correct fraction of
overall interactions returned and the percentage of the interactions
captured in the gold standard (Nusse, 2004), respectively. Results
are given in Table 1.
Calculation of precision
We define precision as the fraction of correct tuples returned by the
parser. These tuples are tuples where the sentence actually supported
evidence for a direct physical binding interaction or mentioned an
indirect but biological relationship between the two protein entities
in the tuple.
From the corpus, we derived a set of 6787 tuples/interactions, of
which 1210 were unique pair-wise. We tested 5% (randomly selected) of the data set (340 sentences), representing individual unique
sentences with their tuples and the two interacting proteins, and handscored assertions for the accuracy of the tuple and named entity
search to determine if the sentences support the interactions noted.
This tests the performance of the parse/extraction software without
explicitly biasing the sampling towards a subset of the corpus (e.g.
interactions which only contain a few papers in the entire corpus).
For the parser evaluation, we tally but ignore from the final count
all name-detection errors as these are a function of the named entity
module or of the human input.
‘Direct’ verb tuples are more useful for actual diagramming of
physical pathways, but the ‘indirect’ interactions are still indicative
of relationships between distant pathway components, and may be
1655
C.Santos et al.
Table 1. Performance metrics from Wnt hand-input named entities
Count (%)
Interaction as
detected
Pubmed ID Example sentence
Total manually sample counted
Total false positive names (ignored from
both tallies below)
340
27 (7.9)
—
Akt <-> Tir
—
12896980
Total indirectly/categorically correct
interactions (A pathway. . .B
pathway. . .ignoring name errors)
175 (55.9%)
Akt <->
PI3-kinase
14557259
Total directly/physical interaction correct
(A->binds->B ignore name errors)
108 (34.5%)
Dvl <-> Axin 11113207
Total correct names, but error in the parse
(ignoring name errors):
30 (9.5%)
Total Gold Standard Associations Detected
Parse/extract precision for assertions
with correctly selected names
Recall versus gold standard (Wnt genes
website)
Separate unique interactions (overall)
Dvl<->Axin
11113207
Example tuple (short format)
—
—
Although Akt activity was also
LIN: [Akt activity.n]
induced by Tiron and DPI, the
v:<was.v>
other two free-radical
[m:<induced.v> only [pp
scavengers examined , only
by Tiron]]
selenite supported cell growth.
Akt is activated by many growth Akt v:<is.v>
factors and cytokines in a
[m:<activated.v> [pp in [a
PI3-kinase-dependent manner.
PI3-kinase-dependent
manner.n]] [pp by [many
cytokines.n]]]
Consistent with these results, Dvl Dvl v:<interacts.v> [pp with
interacts with Axin and
Axin]
inhibits GSK-3 beta-dependent
phosphorylation of
beta-catenin, APC, and Axin in
the Axin complex.
Consistent with these results, Dvl Dvl v:<inhibits.v> [pp in
interacts with Axin and
[the Axin complex.n]]
inhibits GSK-3 beta-dependent
phosphorylation of
beta-catenin, APC, and Axin in
the Axin complex
34 of 53 (58.4%)
(175 + 108)/(340 − 27) = 90.4%
34/53 (64.1%)
1210
Interacting proteins are represented in bold.
useful for validation of models built with the system. We are not
measuring interaction directionality at present in the system.
Calculation of recall
The exact recall metric for a system like ours is difficult to calculate
manually, as it would require determining the total number of ‘facts’
made about binding proteins in the articles scanned. We therefore
calculate recall as the fraction of the gold standard interaction set
we are able to reproduce compared to the Wnt genes homepage,
rather than as the fraction of interactions detected against the absolute
‘assertion or interaction’ count in the corpus.
noun phrases without requiring the use of an external dictionary or
coordination and integration with existing synonym lists.
In actual usage, we found that compiling extensive named entity
lists from other databases provided little benefit, as in the end, interactions adding to the gold standard will be manually verified before
being submitted as authoritative. Extracting the named entities from
the text itself yields word phrases that are guaranteed to match (even if
they are spelling variants), and allows extraction of useful assertions
that can later be verified for accuracy. As expected, this process is
extremely fast, but can occasionally introduce spurious ‘interactions’
between terms and common phrases.
Domain specificity
By default, all returned interactions that are ‘correct’ are within the
domain. The corpus itself is the domain we examine, and we expect
a ‘Wnt’ corpus to therefore contain only within-domain interactions.
DISCUSSION: USE OF A PARTIAL PARSER FOR
NAMED ENTITY EXTRACTION
The Cass parser lacks certain phrase attachment and coordination
capabilities of Link, but we found that its relatively good accuracy and
very high speed allowed us to use Cass as a named entity extractor.
Cass’ finite-state grammar rules allow us to extract multiple-word
1656
RESULTS: COMPARISON OF AUTOMATIC WNT
PATHWAY ANNOTATION AND THE EXISTING
GOLD STANDARD
The system discovered various high chi-squared terms with additional or different annotations than those present in the gold standard:
The phosphorylation interaction between CKI-epsilon
(CK1e) and APC
In the diagrammed gold standard Wnt-signaling pathway, no specific mention of CK1-epsilon (CKIe, CKI epsilon) interaction with
Wnt pathway curation using automated natural language processing
APC is made, and on closer inspection, Kishida et al. (2001) do
make a statement of the direct phosphorylation between the two
molecules.
The phosphorylation of beta-catenin by CKII (CK2)
The Wnt genes gold standard mentions CK2 as CKII in the context of
binding to Dishevelled, but does not specifically show direct interaction of CK2 with beta-catenin in the protein interaction figures
although links to a paper describing phosphorylation of beta-catenin
by CK2 are provided. Our search independently found two articles,
including the cited articles (Song et al., 2003) and a morphological
study (Rosner et al., 2002) which describe the direct interaction of
CK2 with beta-catenin directly. The chi-squared values for CK2
and beta-catenin are 1179.50 and 40537.69, respectively, suggesting
these terms are significantly over-represented in the Wnt literature as
a whole, and suggesting this interaction should be a directly featured
pair in the gold standard map.
Six3 and Wnt regulation
The Wnt genes website lists Six3 [Sine oculis homeobox
(Drosophila) homologue 3] as a Wnt target gene (Lagutin et al.,
2003). Six3 also feedbacks to repress Wnt expression, an interaction
note mentioned on the website and specifically not mentioned in the
table of Wnt feedback target genes; although again, a paper cited by
the website describes this interaction (Braun et al., 2003).
Pathway expansion: Wnt downstream targets
Chen et al. (2001) report that Wnt-1 signaling inhibits apoptosis and
caspase activation induced by cancer chemotherapy. Such distant
pathway cross-talk events of activation and regulation between Wnt
and other pathways are difficult to curate manually and by definition
are often not fully referenced in ‘canonical’ diagrams. In particular,
remote downstream activation or cross-talk between proteins downstream of the canonical pathway are areas where statements in the
literature could be mined by automatic annotation software.
Wnt-7a and LMX-1b
Lmx1b is induced in the mouse dorsal mesenchyme by wnt-7a and
it is both necessary and sufficient to specify dorsal limb pattern (Liu
et al., 2003). The activation pattern was not noted in the Wnt genes
website, but was found amongst the interactions by the machine parse
(in article PMID 12588849) (Liu et al., 2003).
Typographical corrections: Pygopus and Pygopos
Human typists are not infallible, and the name recognizer component of the pathway automatically discovered the Pygopus name but
missed the interaction with Pygopos. The latter term resulted in the
term list after human entry, and manual review showed the spelling
error arose from a spelling error on the annotation itself from the Wnt
signaling canonical pathway. The example serves not as any particular criticism of the pathway map, but rather highlights the risk of
relying on human typed input into pathway annotations; automated
systems do not fatigue or commit unintentional typographical mistakes whereas human input can lead to a certain degree of error even
in highly curated databases.
CONCLUSION
Our results with automatic component identification and interaction detection in the Wnt signaling pathway suggest that natural
language techniques are able to substantially improve the coverage
of canonical reference literature and signaling models. The high precision and processing speed of this automated signaling interaction
pipeline demonstrates the value of full parsers and statistical techniques. Using this approach as a ‘first-pass’ filter into the literature
can usefully assist scientists maintaining databases and information
resources in complex and rapidly evolving fields such as signaling
pathways. As with any fully automated system, however, the recall
rates with respect to the known canonical models do not yet match
those of an expert human reviewer.
In the future, we expect to capture directionality and type of interaction in a more robust way for our assertions. This will require
more template development, and may require the use of an ontology
for an outside reference source for error-detection of incorrect assertions. The role we most expect this system to serve is a real-time
scanning facility for new articles, searching for newly discovered
interactions. Automated computational methods are capable of analyzing a much broader coverage of literature than would be feasible
for a human reviewer to perform. In this role, there is a premium on
specificity to avoid overloading the manual reviewer with erroneous
matches, and our results suggest that deep-parsing, automated natural language processing technology is now capable of achieving this
requirement.
We found that our auto-categorization module, using statisticaland natural-language parsing techniques allowed us to build a named
entity list at run-time, rather than requiring a cumbersome fixed
named entity assembler before the processing. This approach was
perhaps our main advantage in this pipeline, because unlike general
English-language texts, the biomedical literature enjoys a substantial human-hierarchical index via the MeSH tags provided by
MEDLINE.
MeSH indexing provides a powerful tool for building reference and
background article sets that can be used to search a specific article
corpus for biologically relevant named entities which are typically
over-represented with high statistical significance. The fast partial
parser CASS serves a useful role in assigning multiple-word entities.
CASS is uniquely powerful in its ability to efficiently process very
large collections of text. This speed is a result of algorithmic efficiencies which are unlikely to be matched by more complete full parsers.
The combination of fast partial-parse, exploiting MeSH indexing and
statistical analysis of multiple word phrases significantly simplifies
our task of assembling a comprehensive term list.
At a deeper level of text interpretation, the Link parser provides
us with grammatical relations, which allows us to move beyond
simple association statistics to access the information encoded in the
grammatical structure of sentences. While some sentences in biomedical text are too complex to be accurately parsed using current
technology, we find that parsers such as Link are able to accurately
and efficiently parse the majority of sentences in the molecular biology literature. Using the integrated approach described above, we
are beginning to be able to analyze the knowledge encoded in biomedical text.
ACKNOWLEDGEMENTS
We wish to thank Dr Stephen Abney, Dragomir Radev and H.V.
Jagadish for many hours of thoughtful discussion and critical
feedback. This project was supported in part by a grant for the
NIH/National Library of Medicine R01 LM008106.
1657
C.Santos et al.
REFERENCES
Abney,S. (1996a) Partial parsing via finite-state cascades. J. Natural Language Eng., 2,
337–344.
Abney,S. (1996b) Statistical Methods and Linguistics. The MIT Press, Cambridge, MA.
Andrade,M.A. and Valencia,A. (1997) Automatic annotation for biological sequences
by extraction of keywords from MEDLINE abstracts. Development of a prototype
system. Proc. Int. Conf. Intell. Syst. Mol. Biol., 5, 25–32.
Bader,G.D. and Hogue,C.W. (2002) Analyzing yeast protein-protein interaction data
obtained from different sources. Nat. Biotechnol., 20, 991–997.
Blaschke,C. et al (1999) Automatic extraction of biological information from scientific
text: protein-protein interactions. Proceedings of the AAAI Conference on Intelligent
Systems for Molecular Biology (ISMB), AAAI Press, pp. 60–67.
Braun,M.M., Etheridge,A., Bernard,A., Robertson,C.P. and Roelink,H. (2003) Wnt signaling is required at distinct stages of development for the induction of the posterior
forebrain. Development, 130, 5579–5587.
Chen,S., Guttridge,D.C., You,Z., Zhang,Z., Fribley,A., Mayo,M.W., Kitajewski,J. and
Wang,C.Y. (2001) Wnt-1 signaling inhibits apoptosis by activating beta-catenin/T
cell factor-mediated transcription. J. Cell Biol., 152, 87–96.
Daraselia,N., Yuryev,A., Egorov,S., Novichkova,S., Nikitin,A. and Mazo,I. (2004)
Extracting human protein interactions from MEDLINE using a full-sentence parser.
Bioinformatics, 20, 604–611.
Iliopoulos,I., Enright,A.J. and Ouzounis,C.A. (2001) Textquest: document clustering of
Medline abstracts for concept discovery in molecular biology. Pac. Symp. Biocomput.,
384–395.
Kishida,M., Hino,S., Michiue,T., Yamamoto,H., Kishida,S., Fukui,A., Asashima,M.
and Kikuchi,A. (2001) Synergistic activation of the Wnt signaling pathway by Dvl
and casein kinase Iepsilon. J. Biol. Chem., 276, 33147–33155.
Koike,A., Kobayashi,Y. and Takagi,T. (2003) Kinase pathway database: an integrated protein-kinase and NLP-based protein-interaction resource. Genome Res., 13,
1231–1243.
Lagutin,O.V., Zhu,C.C., Kobayashi,D., Topczewski,J., Shimamura,K., Puelles,L.,
Russell,H.R., McKinnon,P.J., Solnica-Krezel,L. and Oliver,G. (2003) Six3 repression
1658
of Wnt signaling in the anterior neuroectoderm is essential for vertebrate forebrain
development. Genes Dev., 17, 368–379.
Liu,C., Nakamura,E., Knezevic,V., Hunter,S., Thompson,K. and Mackem,S. (2003) A
role for the mesenchymal T-box gene Brachyury in AER formation during limb
development. Development, 130, 1327–1337.
Marcotte,E.M., Xenarios,I. and Eisenberg,D. (2001) Mining literature for protein–
protein interactions. Bioinformatics, 17, 359–363.
National Center for Biotechnoly Information—Entrez Programming Utilities (2004)
(NCBI).
Nusse,R. (2004) The Wnt gene Homepage (Howard Hughes Medical Insitiute).
Raychaudhuri,S., Schutze,H. and Altman,R.B. (2002) Using text analysis to identify
functionally coherent gene groups. Genome Res., 12, 1582–1590.
Rosner,A., Miyoshi,K., Landesman-Bollag,E., Xu,X., Seldin,D.C., Moser,A.R.,
MacLeod,C.L., Shyamala,G., Gillgrass,A.E. and Cardiff,R.D. (2002) Pathway pathology: histological differences between ErbB/Ras and Wnt pathway transgenic
mammary tumors. Am. J. Pathol., 161, 1087–1097.
Sleator,D. and Temperly,D. (1991) Parsing English with a Link Grammar. Carnegie
Mellon University, Computer Science Technical Report CMU-CS-91-916.
Song,D.H., Dominguez,I., Mizuno,J., Kaut,M., Mohr,S.C. and Seldin,D.C. (2003)
CK2 phosphorylation of the armadillo repeat region of beta-catenin potentiates Wnt
signaling. J. Biol. Chem., 278, 24018–24025.
Stapley,B.J. and Benoit,G. (2000) Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts. Pac. Symp. Biocomput.,
529–540.
Stephens,M., Palakal,M., Mukhopadhyay,S., Raje,R. and Mostafa,J. (2001) Detecting
gene relations from Medline abstracts. Pac. Symp. Biocomput., pp. 483–495.
Temkin,J.M. and Gilder,M.R. (2003) Extraction of protein interaction information from unstructured text using a context-free grammar. Bioinformatics, 19,
2046–2053.
Wilbur,W.J. and Yang,Y. (1996) An analysis of statistical term strength and its use
in the indexing and retrieval of molecular biology texts. Comput. Biol. Med., 26,
209–222.