Download A Verb-centric Approach for Relationship

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
A Verb-centric Approach for Relationship Extraction in Biomedical Text
Abhishek Sharma
Department of Computer Science
San Francisco State University
San Francisco, CA, USA
[email protected]
Rajesh Swaminathan
Depar Department of Computer Science
San Francisco State University
San Francisco, CA, USA
[email protected]
Abstract—Advances in biomedical technology and research
have resulted in a large number of research findings, which
are primarily published in unstructured text such as journal
articles. Text mining techniques have been thus employed to
extract knowledge from such data. In this article we focus on
the task of identifying and extracting relations between bioentities such as green tea and breast cancer. Unlike previous
work that employs heuristics such as co-occurrence patterns
and handcrafted syntactic rules, we propose a verb-centric
algorithm. This algorithm identifies and extracts the main
verb(s) in a sentence; therefore, it does not require the usage
of predefined rules or patterns. Using the main verb(s) it then
extracts the two involved entities of a relationship. The
biomedical entities are identified using a dependence parse
tree by applying syntactic and linguistic features such as
preposition phrases and semantic role analysis. The proposed
verb-centric approach can effectively handle complex sentence
structures such as clauses and conjunctive sentences. We
evaluate the algorithm on several datasets and achieve an
average F-score of 0.905, which is significantly higher than
that of previous work.
Keywords-Biomedical Text Mining, Relationship Extraction;
Verb-centric Method; Natural Language Processing (NLP);
I.
INTRODUCTION
Unstructured text such as journal articles still remains
the main means of publishing research in the field of life
sciences. For instance, the MEDLINE database currently
holds 19 million articles or citations in the biomedical field.
Furthermore every day, there are 1500~3000 new articles
added to the database [23]. With this exponential increase in
biomedical articles, it becomes unrealistic for humans to
gather relevant knowledge by manually curative this
information trove. It is therefore imperative to apply text
mining techniques to automatically analyze this voluminous
unstructured data. In this article, we describe an algorithm,
to extract a specific type of knowledge, the relationship
between a set of bio-entities including foods, chemicals,
proteins, genes and diseases, from journal articles in
biomedical domain.
The ultimate goal through analyzing biomedical articles
is to construct bionetworks that capture various interactions
or relationships among different bio-entities (e.g. soy,
isoflavones, and osteoporosis) [29][4][7][35][38][43]. To
this end, two common tasks – entity recognition and
relationship extraction - have been under extensive studies
where the former often focuses on recognizing protein and
gene names and the latter the relation between proteins and
genes [4][35][39]. We recently proposed to study the
Hui Yang
Department of Computer Science
San Francisco State University
San Francisco, CA, USA
[email protected]
polarity/strength of a relationship as well. For instance, the
sentence, Soy consumption may reduce the risk of fracture,
exhibits a positive polarity with a weak level of certainty or
strength [41]. In this article, we primarily focus on the
identification and extraction of relationships between
entities in biomedical literature. Five types of entities – food,
disease, protein, chemical and gene are considered in this
work, since the long term goal of this work is to build a
food-disease-gene network [41]. We have proposed a verbcentric method to address this issue. To illustrate the main
goal of our algorithm, let us consider the sentence, Soy food
consumption may reduce the risk of fracture [42]. It
describes a relationship, may reduce the risk of, between the
entities, Soy, food consumption and fracture. When the
proposed algorithm takes this sentence as an input, it will
first confirm whether it is a relationship bearing sentence. If
yes, the algorithm will then extract the relevant parts of the
relationship as stated in the sentence and produce the output,
Soy food consumption | may reduce the risk of | fracture.
There are three components in the output separated by the |
symbol. The middle component, may reduce the risk of, is
the relationship-depicting phrase, whereas Soy food
consumption and fracture are the two participating entities.
Four main approaches have been applied in the past to
extract relationship, namely, the co-occurrence based, linkbased, the rule-based and machine learning approaches such
as Hidden Markov Model [35][4][7][43]. In the cooccurrence approach, if two entities frequently collocate
with each other, a relationship is inferred between them.
Link-based or association based methods extend the cooccurrence approach. Two entities are considered to have a
relationship if they co-occur with a common entity.
[11][12][36][13]The above two approaches often do not
employ natural language processing (NLP) techniques and
focus on a few target entities, such as “fish oil” and
“Raynaud’s disease in the work by Swanson, Lindsay and
Gorden [38][11]. Rules-based techniques on the other hand
heavily reply on NLP techniques to identify both syntactic
and semantic units in the text. Handcrafted rules are then
applied to extract relationships between entities. For
instance, Fundel et al. construct three rules and apply them
to a dependency parse tree to extract the interactions
between genes and proteins [8]. Feldman introduces a
template in the form of NP1-Verb-NP2 to identify the
relation between two entities corresponding to two noun
phrases, NP1 and NP2, respectively [7]. Many rule-based
studies often predetermine the relationship categories using
target verbs such as, inhibit and activate or negation words
such as, but not [40][24][29][33]. Finally, Machine learning
approaches such as SVM and CRF have also been used to
extract relationships [29]. The approaches are supervised
and require manually annotated training dataset, which can
be expensive to construct. Furthermore, rule-based
approaches have been proven relatively successful in the
past [7][43]. Rule-based approaches often deliver a
reasonable precision (0.60~0.80) but with a low recall
(~0.50) with few exceptions [8][41]. The proposed approach
in this paper bears many similarities with the rule-based
approaches.
Existing works however exhibit several major
limitations. First, as mentioned earlier, they mainly focus on
proteins and genes and their interactions, whereas in our
work we consider foods, chemicals and diseases. Second,
rules are not only expensive to construct but also severely
limit the ability to uncover relevant relations, that is, the low
recall issue. Third, these approaches assume that the entities
of interest have been correctly recognized and labeled. This
however is often not true due to the limitations in the
automated entity recognition process. As a result, some
entities are not recognized; whereas some are not recognized
in their full form. For instance, only, tofu, is labeled as a
food in the phrase, intake of tofu. Fourth, these approaches
do not effectively handle complex sentences that contain
multiple relationships involving multiple pairs of entities,
especially when a sub-clause is involved. For example, in
the sentence, Tofu intake showed a significant association
with genistein and miso soup showed a slight association
with phytoestrogens, there are two relationships with two
sets of participating entities. Finally, these approaches
primarily rely on a parser to correctly identify the verb
phrase. This however can be challenging, since verb phrases
are often not correctly tagged as a separate syntactic
constituent by today’s state-of-the-art parsers [21].
To overcome these limitations, we propose a verbcentric algorithm. Using preposition phrase handling, our
algorithm effectively addresses the incomplete entity issue.
By analyzing the phrase-level conjunction structure, our
algorithm is able to separate the one-to-many or many-toone relations into multiple binary relations. For instance,
three instances are extracted from, A, B and C is related to
D. The multiple relationship issue is handled by analyzing
the sentence level conjunction structure. The challenge of
missing entities is addressed by analyzing the semantic roles
(e.g. subjects/objects) of a noun phrase. Finally, instead of
relying on the verb phrases identified by a parser, we rely on
main verbs.
We have evaluated our algorithm on several datasets
drawn from MEDLINE and achieved an average F-score of
~0.90. We have also implemented the conventional NP1Verb-NP2 rule-based relationship extraction method and
evaluated it against the same datasets, which achieved an
average F-score of (~0.82) [41]. This shows that the
proposed verb-centric approach clearly outperforms the
conventional approaches for relationship extraction.
Please note that the work described in this article
corresponds to one of the four modules in an information
extraction system that has been proposed by us earlier [41].
As shown in Figure 1. , the system consists of 4 modules
(Entity Recognition, Relationship Extraction, Relationship
Polarity Analysis and Relationship Strength Analysis). The
relationship extraction module is executed after all the
entities have been extracted. The output of the relationship
extraction module will act as an input to the next two
modules, relationship polarity and strength analysis.
Therefore, the performance of the relationship extraction
module plays a critical role in the success of the entire
framework. Please refer to [41] for more details.
Figure 1. Architecture of the overall system [41]
II.
ALGORITHM
Figure 2. Relationship Extraction System Overview
The goal of our algorithm is sentence-wise identification
and extraction of relationships. In addition the algorithm
should be able to identify and extract the participating
entities of the relationship in a given sentence. To reach this
goal we have designed a three step algorithm. Given a
sentence our algorithm first identifies whether it is a
relationship bearing sentence or not. It next extracts the
relevant relationship depicting phrase and participating
entities from the relationship bearing sentence. Finally it
formats the extracted relations into a list of binary relation in
the form Entity | Relationship | Entity. Figure 2. presents a
schematic description of the verb-centric relationship
extraction algorithm. In the following sections, we describe
each of these steps in detail.
A. Data Acquisition and Entity Recognition
To acquire relevant scientific publications from the
MEDLINE database, we utilize an in-house program [23].
Given a set of key words, the program automatically
downloads all the relevant articles from MEDLINE and
stores them on a local disk. Both abstracts and full texts of
the biomedical articles, if available, are downloaded. In this
work, we only analyze the abstracts.
The NER (Named Entity Recognition) module takes
each abstract and applies a lexicon-based approach to
recognize the following types of entities including food,
disease, gene, protein and chemical. The NER module also
identifies co-references and abbreviations. Please refer to
[41] for details. Let us consider the sentence, Hence, soy
isoflavones and saponins are likely to be protective of colon
cancer and to be well tolerated. The NER module will
recognize the following entities from this sentence,
{\chemical soy isoflavones}, {\chemical saponins} {\disease
colon cancer}, each labeled by semantic type. This NER
module was evaluated using the GENIA corpus and datasets
drawn from the MEDLINE database. See [41] for details.
B. Algorithm
1) Relationship extraction - what sentences?:
Our relationship extraction algorithm identifies and extracts
relationships at the sentence level. Given a sentence we first
need to determine whether it contains a relationship or not.
To this end, we performed the following case study and
were able to identify several key criteria for a relationship
bearing sentence. Specifically, we have randomly selected
50 abstracts and recruited a team of 5 members to manually
annotate all the entities of interest and their relationships.
For instance, the sentence in the above section will be
annotated as, Hence, {\chemical soy isoflavones and
saponins} {\relationship are likely to be protective of}
{\disease colon cancer} and to be well tolerated.
A total of 352 relationship-bearing sentences in the 50
abstracts are identified and annotated. We analyze the InterAnnotator Agreement (IAA) over all these sentences and
address the discrepancies by taking the one agreed upon by
3 or more annotators. We then conduct a collection of
studies to characterize these 352 sentences. We first
compare the list of manually annotated entities with the list
identified by the NER module. We observe that although the
NER module can correctly identify around 90% of the
entities, it exhibits intrinsic limitations as described in the
previous section, for instance, incomplete and missing
entities. We also observe that these 352 sentences exhibit a
variety of sentence structure, while some are relatively
simple in the form of subject-verb-object, a good portion
involve multiple relations by the use of conjunctive structure
(e.g. A is related to B, that affects C. A, B, C is related to D.
A is related to B and C is related to D.) We also notice that
~97% of these sentences describe a verb-based relationship
(e.g. A affects B). We then manually collect all such verbs,
and compare them with the 54 verbs or verb phrases
included in the UMLS Semantic Network. These 54 verbs
are intended to capture the main relations that may exist
between biomedical entities. We observed that these two
lists have a good overlap. For the verbs that appear in the
manually annotated abstracts but not in UMLS, we can often
identify a verb that is semantically similar. Finally, although
~61% of these 352 involve two or more entities of interest,
~36% only consist of one entity. We therefore need to relax
the commonly adopted criterion that requests the presence of
the two entities of interest when extracting relationships.
To summarize, through this case study, we are able to
not only gain valuable insights into the challenges we are
facing, but also identifying two key criteria we can use to
reliably determine whether a given sentence is relationshipbearing (1) the sentence needs to contain at least one entity
of interest; and (2) it contains a verb that is semantically
similar to one of the 54 verbs listed in the UMLS Semantic
Networks.
2) Identifying the relationship bearing sentences.
We are now ready to describe the main algorithm as
depicted in Figure 2. As mentioned earlier, the inputs to the
algorithm are sentences whose entities are automatically
labeled by the NER module. Given a labeled sentence, we
first determine whether it contains a relationship using the
two criteria derived above. This takes the following two
steps:
Step1: Given a sentence, the entities recognized by the
NER module might be redundant or incorrect as a result of
imperfect parsing. To deal with this issue, we (1) discard an
entity if it is an entire sentence; and (2) check whether the
span of an entity is a subset of another entity in the same
sentence. We discard it if it is the case. Once the above tasks
are done, we retain the sentence if it contains one or more
labeled entities.
Step2: In this step, we examine whether each sentence
retained in step 1 contains a verb that is semantically similar
to one of the 54 UMLS verbs. To do this, we first expand
the list of these 54 verbs, to include verbs that are
semantically similar to one of the 54 verbs by using both
WordNet [17] and VerbNet [16].We then check whether one
of the verbs in the sentence under consideration is on the
expanded verb list. If the answer is yes; the sentence is
considered relationship-bearing and passed onto the next
step for further analysis.
3) Verb-based relationship extraction
As described earlier, we have observed in our case study
that although many authors employ the simple, Subjectverb-Object, sentence structure to describe a relationship
between two entities, it is not uncommon for authors to use
more complex sentence structures. Two such structures are
especially popular (1) using conjunctive structure or clauses
to describe multiple verb-based relations in a single
sentence. For instance, the following sentence,
Supplementation with soy protein containing isoflavones
does not reduce colorectal epithelial cell proliferation or the
average height of proliferating cells in the cecum, sigmoid
colon, and rectum and increases cell proliferation measures
in the sigmoid colon, consists of two relationship-depicting
phrases, does not reduce and increases, which are corrected
by the conjunctive word, and; and (2) using conjunctive
structure to describe a single verb-based many-to-one or
one-to-many relationship. As an example, the sentence,
Fermented soy products are known to contain high
concentrations of the isoflavone, genistein, and other
compounds, specifies a one-to-three relation.
As one can observe from the different sentence
structures, regardless of its complexity, the most critical step
towards correctly extracting all the relationships in a
sentence is to identify the verb phrases. To achieve this,
previous work relies on two main techniques: (1) Using the
parse tree of a sentence to extract the verb phrases. This can
be inaccurate as most state-of-the-art parsers cannot
correctly identify the right boundary of a verb phrase. As a
result, the right boundary of a verb phrase is often defaulted
to the end of a sentence; and (2) assuming the entities of
interest are correctly recognized and using such entities as
anchors to infer the span of a verb phrase. This again has
serious limitations as no entity recognition algorithm can
achieve perfect performance so far. In addition, not every
verb phrase is surrounded by two entities. An example is
shown in the first sentence in the above paragraph.
To overcome these problems we propose to extract the
main verbs from a sentence where a main verb is the most
important verb in a sentence and states the action of the
subject. We take a two step procedure for this purpose; (1)
we first partition a sentence into multiple semantic units if
necessary, such that each unit consists of only one main
verb. This is done by analyzing the conjunctive structure at
the sentence level; and (2) for each semantic unit, we then
identify and extract its main verb. We next describe these
two steps in details.
To partition a complex sentence into multiple semantic
units we analyze the sentence-level conjunctive structure
suing the OpenNLP parser [21]. A list of conjunctions is
retrieved from the parse tree of the sentence. The parent subtree of each conjunction is then retrieved. If the parent subtree corresponds to the entire input sentence the algorithm
asserts that it is a sentence level conjunction. It then uses the
conjunctions to break the sentence into smaller semantic
units. Let us consider the same example mentioned earlier,
Supplementation with soy protein containing isoflavones
does not reduce colorectal epithelial cell proliferation or the
average height of proliferating cells in the cecum, sigmoid
colon, and rectum and increases cell proliferation measures
in the sigmoid colon. In this sentence the conjunction word,
and between the words, rectum and increases has the entire
sentence returned as its parent sub-tree while, and, between
the words, colon and rectum, has the phrase, in the cecum,
sigmoid colon, and rectum, as the parent. Therefore, the
first, and, is a sentence-level conjunction and the sentence is
divided into two parts at this conjunction. Clearly, each part
contains a verb-based relationship.
To find the main verb(s), the algorithm considers the
parse tree of a sentence generated by the OpenNLP parser. It
traverses the parser tree using the preorder traversal and
continuously checks the tags of each right sub-tree until it
reaches the deepest verb node of a verb phrase. This verb is
then considered the main verb of the sentence [35]. Using
the software tool Grammar Scope, which provides a
graphical representation of a parse tree [21], we have
manually verified this main verb extraction procedure.
Shown in Figure 3 is a parse tree produced by this tool for
the sentence, Phytoestrogens may be associated with a
reduced risk of hormone dependent neoplasm such as
prostate and breast cancer. The verb, associated is extracted
correctly as the main verb. The main verb extraction is a
critical step as is facilitates the final step of verb-centric
relationship extraction.
Figure 3. Graphical Representation of the Parse Tree of the sentence
Phytoestrogen may be associated with a reduced risk of hormone
dependent neopalsms such as prostrate and breast cancer.
4) Determining the two participating entities
The main verbs extracted in the above section state the
main relations between entities which however only
constitute one third of a binary relationship. To complete a
relationship, we still need to identify its two involved
entities. A proximity-based approach is adopted for this
purpose. Specifically, we take the entities located
immediately to the left and to the right of a main verb in the
same semantic unit as the two participating entities. For this
proximity-based approach to work effectively, we however
have to tackle the following issues: (1) the NER module
might not recognize all the entities of interest. We term this
as the missing entity issue; (2) the incomplete entity issue,
where the entity identified by the NER is part of a true
entity. For instance, only the word, equol, in the phrase,
serum concentration of equol, is labeled as, chemical; and
(3) the many-to-one or one-to-many relationship mentioned
earlier (for instance, A activates B, C and D). To tackle
these three issues, we rely on both dependency parse trees
and linguistic features. We next describe our solution to
each of these issue details.
For the missing entity issue, we first observe that many
of these missing entities are either a subject or an object in a
sentence. We therefore address this issue by identifying the
subject and object in a sentence. The program passes each
relationship-bearing sentence unit as an input to the Stanford
NLP Parser [15]. This parser produces pairs of dependent
words, where each pair satisfies one of the 55 prescribed
grammatical relations [15]. Several rules are applied to
extract the subject/object in a sentence using the word
dependencies. We explain them by an example here.
Consider the sentence shown in Figure 4. and the
dependencies obtained from the Stanford Parser. The typed
dependencies are generated in pairs of Governor (Gov) and
Dependency (Dep.), followed by the grammatical
relationship (Rel.) between them. To recognize the subject,
the program scans for the substring of ‘subj’ in the
relationship, then searches for the occurrence of the
governor word or dependency in any of the pairs having a
noun or preposition relationship. As shown in the sentence
above, word ‘consumption’ has a ‘subj’ relation with
‘shown’. The program finds the occurrence of ‘consumption’
with ‘soy’ as a noun and stops at this point and label‘soy
consumption’ as the subject. Similarly, it searches for a
typed dependency with the substring of ‘obj’ in a
relationship and follows the same procedure as for the
subject to locate the object.
Soy consumption has been shown to modulate bone turnover
and increase bone mineral density in postmenopausal women.
Gov: consumption-2 Dep: Soy-1 Rel: nn
Gov: shown-5
Dep: consumption-2 Rel: nsubjpass
Gov: turnover-9 Dep: bone-8 Rel: nn
Gov: modulate-7 Dep: turnover-9 Rel: dobj
Gov: modulate-7 Dep: increase-11 Rel: conj_and
Gov: density-14
Dep: bone-12 Rel: nn
Gov: density-14
Dep: mineral-13 Rel: nn
Gov: increase-11 Dep: density-14 Rel: dobj
Gov: women-17
Dep: postmenopausal-16 Rel: amod
Gov: density-14
Dep: women-17 Rel: prep_in
Figure 4. Typed dependencies of a sentence generated by Stanford Parser
Once the subject and object of a sentence unit have been
labeled, we compare them with the list of entities labeled by
the NER module. If either of them does not overlap with any
of the existing entities, we find the smallest noun phrase that
contains the subject or object and add it to the existing list of
labeled entities.
To handle the incomplete entity issue, we search the
preposition phrase around an entity and merge them together
to the old entity. This is based on our observations made
over the incomplete entities. Here, only, equol and dietary
intake are labeled as entities by the NER module, which are
incomplete and should be, serum concentration of equol and
dietary intake of tofu and miso, instead. To address this
issue, our program identifies the preposition phrase right
before, after or overlapping with a labeled entity and merge
the preposition phrase with the entity to replace the
incomplete one. In the above two cases, of equol and of tofu
and miso soup, are the two preposition phrases. Since,
equol, is part of the preposition phrase, of equol, the
program will search the noun phrase immediately preceding
it, which is, serum concentration and merge them together
to replace, equol. One the other hand, the preposition phrase,
of tofu and miso soup, is right after the labeled entity,
dietary intake, the program simply merge them. After this
preposition handling, our algorithm is able to identify the
complete form of the two entities in the above sentence.
Finally, to address the one-to-many or many-to-one
relationships as exhibited in the following sentence,
Fermented soy products are known to contain high
concentrations of the isoflavone, genistein, and other
compounds, we utilize the dependency parse tree to first
merge the multiple entities into one compound entity. This
allows us to correctly establish the one-to-many and manyto-one relationship using the main verb identified earlier.
The merged compound entity will be split into individual
entities as they are when we produce the final binary
relationships. This will be described later in more detail.
Let is now use the above sentence to illustrate the
merging process using a dependency parse tree. Using the
typed dependencies produced by the Stanford parser, the
algorithm finds each governor with a relationship tag of,
conjunction and identifies all the dependencies that have the
same governor. It also finds the noun phrase for each
governor and its corresponding set of dependency words. It
finally merges all the noun phrases into a compound entity.
In the example the three entities will be merged into one.
5) Exception: the “between .. and” case
A special case in describing a relationship is the use of,
between … and … template. The approach described earlier
cannot handle this structure. We therefore treat such cases
separately. Consider the following sentence, Few studies
showed protective effect between phytoestrogen intake and
prostate cancer risk. The algorithm uses generated by parse
tree the OpenNLP parser for the sentence and determines the
preposition phrase that contains the preposition, between
[21]. It then identifies the entity after, between and before,
and as the left entity, the one immediately after, and as the
right entity. The relationship depicting phrase is constructed
by using the main verb plus all the words between the main
verb and the word between. Therefore, the binary
relationship of sentence stated above is, phytoestrogen
intake|showed protective effect|prostate cancer risk.
6) Main verb-based Relationship construction
At this point, we have identified all the main verbs in a
sentence. We have also labeled all the entities of interest
after having addressed the three entity-related issues:
missing entities, incomplete entities and the one-to-many or
many-to-one relationships. We are ready to use the main
verbs to construct the relations contained in the sentence. To
do this, we take each main verb and identify the labeled
entities before and after the main verb in the same semantic
unit. For instance, the sentence we have seen earlier,
Supplementation with soy protein containing isoflavones
does not reduce colorectal epithelial cell proliferation or the
average height of proliferating cells in the cecum, sigmoid
colon, and rectum and increases cell proliferation measures
in the sigmoid colon, will produce the following two
relationships: (1) Supplementation with soy protein
containing isoflavones | does not reduce colorectal
epithelial cell proliferation or the average height of
proliferating cells in | the cecum, sigmoid colon, and
rectum” and (2) Supplementation with soy protein
containing isoflavones | increases cell proliferation
measures in | the sigmoid colon. Note that the left entity of
the second relationship is the subject of the entire sentence.
This is done as a post processing step. Also note that the first
relationship actually contains a one-to-many relationship.
Our algorithm will still treat it a binary relation as the
entities on the right are not individually labeled. Instead,
they together are identified as the object of, reduce.
For a many-to-one or one-to-many relationship as
exhibited in the sentence, Fermented soy products are
known to contain high concentrations of the isoflavone,
genistein, and other compounds, our program first combines
the multiple entities located on the same side of the relation
into one compound entity in order to correctly associate
them with the main verb. For the above sentence, we will
first produce the following relationship, Fermented soy
products | are known to contain high concentrations of | the
isoflavone, genistein, and other compounds. We then break
apart the compound entity into individual entities as they are
initially extracted. The above sentence therefore will
produce three relationships: (1) Fermented soy products |
are known to contain high concentrations of | the isoflavone
(2) Fermented soy products | are known to contain high
concentrations of | genistein (3) Fermented soy products |
are known to contain high concentrations of | other
compounds.
This above procedure also applies to the many-to-one or
one-to-many relationship described by the “between … and
…” scenario. For instance, the sentence, No association was
suggested between the frequency of consumption of fruit,
vegetables, green tea, and soy products and gastric cancer,
will lead to the identification of the following relationships:
(1) fruit | No association was suggested|gastric cancer (2)
vegetables | No association was suggested|gastric cancer
(3) green tea | No association was suggested|gastric cancer
(4) soy products | No association was suggested | gastric
cancer.
Finally, for a relationship involving an entity that is
either in an abbreviated format or a pronoun phrase, we
replace it with its original long form or its co-referred entity.
Consider the following two examples: (1) VLFD___a verylow-fat diet| significantly reduces |estrogen concentrations
in postmenopausal women (2) its intake___Soy|may help to
prevent some|diseases. In the first example, VLFD is
identified by our NER module as an abbreviation of, a very
low-fat diet. We attach this long form in the final
relationship construction. In the second example, the NER
module recognizes that the pronoun, its, co-refers to the
entity, soy. We hence attach, soy, in the extracted
relationship.
III. EVALUATION
In this section, we report the evaluation results by
applying the proposed verb-centric algorithm to three
datasets drawn from the MEDLINE database. As shown in
Table 1, the three datasets are created using three sets of
keywords cancer, soy, cancer, beta-carotene and cancer,
benzyl. They consist of total of ~1400 sentences among
which ~750 are relationship bearing.
TABLE I.
ID
#(Abs)
1
2
50
80
3
40
Datasets Description
Keywords
Soy & Cancer
Betacarotene &
Cancer
Benzyl Cancer
#Sentences/
abstract
8
9
Relationbearing sent
5
4
7
4
The manual annotation was performed by a team of 5
members. To evaluate our algorithm, we first manually
annotated all the relationships in all the datasets. For each
relationship we explicitly label the relationship depicting
phrase and participating entities. Here is an annotated
sentence as an example, Hence, {\chemical soy isoflavones}
and {\chemical saponins} {\relationship are likely to be
protective of} {\disease colon cancer} and to be well
tolerated. The manual annotation was done independently
by a group of five members. All members are graduate
students who have been analyzing biomedical text for at
least one year. For this annotation, we are interested in the
following types of biomedical entities and their relationship
as reported in the literature: food, disease, protein, chemical
and gene. To reduce potential subjectivity in the manual
annotation, we combine the five versions of annotations
from the five members into one version using majority rule
policy. Specifically, given a sentence, the annotation that is
agreed by 3 or more members is taken as the final version.
Manual intervention was often required for this task as we
could not find a majority vote for some sentences. In this
case, a group discussion was invoked to reach an agreement.
Note that we did not evaluate our algorithm using existing
public datasets such as LLL (Learning Language in Logic)
challenge dataset [28] because such datasets often focus on
genes and proteins, whereas we are concerned with also
foods, chemicals and diseases. We plan to test our algorithm
using such datasets in the future.
Using the same three datasets, we feed them to our
named entity recognition (NER) module to automatically
label the five types of entities as mentioned above. The verbcentric algorithm then takes the labeled sentences as input
for relationship extraction. Three measures – precision,
recall and F-score are used to measure the effectiveness of
our algorithm, where Precision = (True Positive)/(True
Positive + False Positive), Recall = (True Positive)/(True
Positive
+
False
Negative)
and
F-Score
=
(2*Precision*Recall)/(Precision + Recall). Given a
relationship-bearing sentence, it is considered as a true
positive if (1) the algorithmically extracted relationshipdepicting phrase (RDP) overlaps with the manually
annotated RDP, or (2) these two RDPs agree on the same
main verb. Otherwise, the sentence is considered as a false
positive. Finally, a false negative corresponds to a sentence
that is manually annotated as relationship-bearing but not by
our algorithm. We also consider a more rigid criterion where
we not only compare whether the algorithmically executed
RDPs match with their manual counterparts, but also
consider the entities involved in a relationship. We will
report such results later. In the following discussion, without
further notice, the reported precision, recall and F-score are
computed on the basis of the main verbs only.
To demonstrate the advantage of the proposed verbcentric approach, we also compare it with a rule-based
approach, where a rule NP1-VP-NP2 is used. According to
this rule, a relationship is composed of a verb phrase,
sandwiched by two noun phrases, which are required to be
labeled as entities of interest. We refer to this approach as
the NVN method or NVN. We first study the effect of the
three entity-related issues-missing, incomplete and
conjoining – over the NVN algorithm. The results are shown
in Table II. Without any additional entity handling, in other
words simply relying on the entities recognized by the NER
module, we achieve a precision of 0.79 and recall of 0.82.
Both measures are improved (0.80 and 0.86) after we have
incorporated the noun phrases that function as subject and
object in a sentence. Finally, these measures are further
improved (0.84 and 0.87) after conjoining entities are also
dealt with. These gradual improvements indicate the
necessity and importance of handling these entity related
issues. We however could not further improve the accuracy
of the NVN approach as it is unable to correctly extract
relationships where only one entity is available in the
proximity of a verb. For instance, in a sentence such as “A
increases B but decreases C”, the NVN method could not
correctly extract the relation between A and C. In contrast to
NVN, our verb-centric algorithm is more versatile and able
to handle various scenarios. As evidenced in Table II, on the
same dataset, the verb-centric method achieves a precision
of 0.96 and a recall of 0.90, which is significantly higher
than the NVN method.
margin. This demonstrates the proximity-based approach for
identifying entities involved in a relationship works
reasonably well. This also shows that the strategies we have
introduced to handle missing, incomplete and conjoining
entities are effective.
TABLE III.
Precision
Recall
F-Score
Results for Verb-centric Relationship Extraction DataSet1
Full Abstracts
Conclusions
0.9589
0.8873
0.9045
0.9545
0.9309
0.9196
TABLE IV.
Results for Verb-centric Relationship Extraction Dataset 2
TABLE V.
TABLE II.
Effect of Missing, Incomplete and Conjoining Entities
using Dataset 1 (Keywords – Soy, Cancer)
Approach
NVN without missing /incomplete
/conjoining entities
NVN wih subject/object extraction
NVN with missing /incomplete
/conjoining entities
Verb-centric relationship extraction
Precision
Recall
0.79
0.82
Fscore
0.8047
0.80
0.84
0.87
0.89
0.8426
0.8547
0.9589
0.9045
0.9309
We also apply the verb-centric algorithm to all three
datasets and study whether it is biased against a specific
dataset. In addition, we also study whether it is sensitive to
sentences located in different sections of an abstract.
Specifically, we focus on the conclusion sentences in an
abstract. If an abstract does not explicitly identify the
conclusion sentences, the last three sentences are regarded as
the conclusion. Tables III-V summarizes the results of the
above studies. From these tables, we observe that our
approach is not biased and achieves a balanced precision
from 0.86 to ~0.95 and a recall form 0.88 to ~0.92. There is
variation in the precision and recall values in the three
datasets. Dataset 2 and 3 have more complex sentences
containing multiple relationships. In certain cases the
sentences are not being separated into smaller units properly.
Due to this reason certain parts of the sentences are not
considered and hence, the relationships not identified or
extracted, thereby affecting the precision and recall in
datasets 2 and 3 compared to dataset 1. However, this is
better than existing methods, which often deliver a
reasonable precision (0.60~0.80), but with a low recall
(~0.50) [41].
As mentioned earlier in this section, we also adopt a
stricter criterion to measure a true positive. Here, we take
both the relationship-depicting phrase (RDP) and the
involved entities into account. An algorithmically extracted
relationship is considered as a true positive if it’s RDP (or
main verb) and involved entities match with the manual
annotation. The results on all three datasets based on this
criterion are reported in Table VI. Comparing Table VI with
Table III-V, one can observe that on all three datasets the
values of both precision and recall have dropped by a small
Full Abstracts
0.8641
0.8833
0.8736
Precision
Recall
F-Score
Results for Verb-centric Relationship Extraction Dataset 3
Precision
Recall
F-Score
TABLE VI.
Dataset 1
Dataset 2
Dataset 3
Conclusions
0.9032
0.918
0.918
Full Abstracts
0.91
0.901
0.8999
Conclusions
0.878
0.8999
0.8999
Evluation Results Considering Entity Match
Precision
0.9455
0.8580
0.9052
Recall
0.8683
0.8436
0.8614
F-Score
0.902
0.8507
0.8827
A. Analysis of Errors
We have also manually gone over the results to
determine the main factors that lead to the false positives
and false negatives. We notice that most false positives are
caused by the two rules employed during the manual
annotating process. First, in manual annotation, the
annotators did not label the, part-of, or hierarchical
relationships (for example, A is an ingredient in B) between
entities, since they consider these are facts and not directly
derived from a particular study. The algorithm on the other
hand considers such relationships. Second, manual
annotations rule out the relationships described as part of the
study objective. For instance, annotators do not label the
sentence We investigate whether soy consumption can
reduce the risk of breast cancer. The verb-centric algorithm
however extracts a relationship from the whether-clause.
As for the false negative, they are often caused by the
following two scenarios. (1) The usage of unconventional
writing style. As an example, in the following sentence, The
significant inverse association with beta-carotene and
lutein/zeaxanthin was more pronounced in women, and in
overweight or obese subjects. Instead of using the, between
… and … structure the author uses, with … and … structure.
As a result the algorithm is unable to detect this relationship,
and (2) the current verb-centric algorithm is unable to
extract relationships in the form of, no preventative effect of
A on B. However, we can readily extend our algorithm to
identify and extract such cases by considering the main
preposition words such as, by, of, on, in, at, to, for and with.
We plan to address this issue in the future.
IV.
RELATED WORK
Relationship extraction has recently become an area of
interest, resulting in many studies [29][4][7][35][38][43].
Four mainstream approaches have been used for relationship
extraction: co-occurrence, link-based, rule-based and
machine learning approaches. The co-occurrence approach
infers a relationship between two entities if they frequently
collocate with each other. This is a simple method where
relations between biomedical entities can be detected by
collecting sentences in which they occur. The link-based
methods extend the co-occurrence approach to infer
relationships between entities if they co-occur with a
common term This method gives high recall, but the
precision is often low [43][8] [35][4][7].
Several rule-based algorithms have been applied to
extract relationships from biomedical text. An increasing
emphasis has been given to syntactic structures extracted
from a parse tree such as noun phrases, verb phrases, subject
and object [43]. However, it has been shown that on larger
corpus such methods can be computationally costly [6].
Fundel et al used the Stanford Lexicalized Parser to generate
the dependency trees from MEDLINE abstracts. The system
is used to find protein/gene interactions. It delivers a
reasonable precision and recall of 0.79 and 0.85 on a 50
abstract dataset [8]. Another dependency parser based
system is build for relation extraction by Rinaldi et al. It
uses lexical and semantic information in conjunction to
dependency parsing and obtains a ranging from precision of
0.52(strict) to 0.90 (approximate boundary) and recall 0.40
(estimated lower bound) to 0.60 (actually measured) [31].
Tsai et al combine the semantic role labeling such as
location and time information with syntactic analysis in their
study. They us 30 most frequently used biomedical verbs
and the predicate-argument structures. They achieve high Fscore of 0.87 [40]. Kim et al extract the pattern like “A binds
B but not C” between proteins and genes. The system gave a
fast execution time of 0.038/abstract and a high precision of
0.97 [24]. Subramaniam et al developed the Bio-Annotator
system, which is part of the current Relation Extraction
system, and uses rules and dictionary lookup for identifying
and classifying biological terms [33][37]. Note that these
systems significantly differ from our work in that they focus
on specific types of relationships where ours can extract a
much wider range of relationship.
There are several machine learning methods generally
used for achieving relationship extraction. One of the
approaches that is used very commonly in relationship
extraction is based on the Conditional Random Field (CRF).
CRFs are probabilistic graphical models used for labeling
and segmenting sequences [2]. Additionally both Naïve
Bayesian classifier and Hidden Markov Model are also
applied for relationship extraction [5][9].
In the area of semantic role extraction Gildea and
Jurafsky [9] describe a statistical approach for semantic role
labelling using data collected from FrameNet by analysing a
number of features such as phrase type, grammatical
function and position in the sentence. Similarly Shi and
Mihalcea propose a rule-based approach for semantic
parsing using FrameNet and WordNet [33][34]. They extract
rules from the tagged data provided by FrameNet, which
specify the realization (order and different syntactic
features) for the present semantic roles.
V.
CONCLUSION/DISCUSSION
We present a verb-centric relationship extraction
algorithm in this article. Given a sentence from biomedical
text our algorithm identifies whether it is a relationship
bearing sentence or not and then extracts the relationship
depicting phrase from the sentence. Our algorithm also
extracts the participating entities involved around the
relationship depicting phrase. It handles missing, incomplete
and conjoining entity issues involved in extraction of
participating entities. Our algorithm achieves a balanced
precision from 0.86 to ~0.95 and a recall form 0.88 to ~0.92,
evaluated on three datasets.
In our future work we would like to work on evaluating
our algorithm on public datasets to be able to lay direct
comparisons with other relationship extraction approaches.
We would be working on tasks of relationship integration
and categorization, for example, hierarchical and causal
relations. Finally, our future goal is visualization of the
biomedical entities and the relationship extracted by our
algorithm described in this article.
ACKNOWLEDGEMENT
We would like to thank Jason De’ Silva, Dong Yan and
Vilas Ketkar in helping us with manual annotation of the
data-sets used in the evaluation of our Verb-centric
relationship extraction algorithm.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
Aronson A.R., “Effective Mapping of Biomedical Text to the
UMLS Metathesaurus: The MetaMap Program,” AMIA
2001.
Bundschu M., Dejori M., Stetter M., Tresp V. and Kriegel
H.P., “Extraction of semantic biomedical relations from text
using conditional random fields,” BMC Bioinformatics 2008,
9:207 doi: 10.1186/1471-2105-9-207
Chulan U.A.U, Sulaiman M.N., Hamid J.A., Mahmod R.,
Selamat H.,
“Extracting Relationship in Text using
Connectors,” Faculty of Computer Sc. and Info. System
UPM.
Cohen A.M., William R. and Hersh, “A survey of current
work in biomedical text mining. Briefings in
Bioinformatics”. Vol 6. No 1. 57–71. March 2005.
Collier N., Nobata C., Tsujii J. Ichi, “Extracting the names of
genes and gene products with a hidden markov model,” In
Proceedings of the International Conference on Computer
Linguistics (COLING). Morgan Kaufmann, 201–207 2000.
Curran JR, Moens M. “Scaling context space”, Proceedings
of the 38th Annual Meeting of the Association for
Computational Linguistics, Philadelphia, PA. ACL,
2002:231–8.
Feldman R., Regev Y., Finkelstein-Landau M., Hurvitz E.
and Kogan B., “Mining biomedical literature using
information extraction,” ClearForest Corp, USA & Israel,
KDD Cup, 2002 competition.
Fundel K., Ku¨ffner R., Zimmer R., “RelEx—relation
extraction using dependency parse trees” Bioinformatics
2007;23:365–71.
Gildea D. and Jurafsky D., “Automatic Labeling of Semantic
Roles,” Computational Linguistics, 28(3):245–288, 2002.
Girju R., Roth D., and Sammons M., “Disambiguation of
VerbNet classes,” The Interdisciplinary Workshop on Verb
Features and Verb Classes, 2005
[11] Gordon M.D., and Lindsay R.K. (1999), “Literature beased
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
discovery by leical statistics”, Journal of the American
Society for Information Science, Volume 50, Pages 574-527.
Gordon M.D., and Lindsay R.K. (1996), “Toward discovery
support systems: A replcation , reexamination and extension
of Sawnson’s work on literature-based discovery of a
connection between Raynaud’s and Fish Oil”, Journal of the
American Society for Information Science, Volume 47,
Pages 116 – 128.
Hristovski D., Peterlin B., Mitchell J.A., Humphrey S.M.,
“Using literature-based discovery to identify disease
candidate genes” International Journal of Medicine
Informatics, Volume 74(2-4), Pages 289-289.
Bodenreider O., Hole W. T., Humphreys B. L., Roth L. A.;
Srinivasan S., “Customizing the UMLS Metathesaurus for
your Applications”, Proc AMIA Symp November 2001,
<http://www.nlm.nih.gov/research/umls/>
Klein D. and Manning C. D., “Fast Exact Inference with a
Factored Model for Natural Language Parsing. In Advances
in Neural Information Processing Systems 15 (NIPS 2002)”,
Cambridge, MA: MIT Press, pp. 3-10. 2003.
<http://nlp.stanford.edu/software/lex-parser.shtml>
Schuler K. K., “VerbNet: A broad coverage, comprehensive
verb lexicon, University of Pennsylvania (Dissertation)”,
unpublished <http://verbs.colorado.edu/~mpalmer/projects/>
Miller, G. A. “WordNet - About Us.” WordNet. Princeton
University. 2009. <http://wordnet.princeton.edu/ >
Aronson Alan R., “Effective Mapping of Biomedical Text to
the UMLS Metathesaurus: The MetaMap Program”, AMIA
2001Proceedings
<http://www.nlm.nih.gov/research/umls/meta3.html>
Holden J.M., Haytowitz D.B., Pehrsson P.R., Exler J., and
Trainer D., “USDA's National Food and Nutrient Analysis
Program”, Progress Report, 17th International Congress of
Nutrition.
Vienna, Austria.
August 27-31, 2001.
<http://www.nal.usda.gov/fnic/foodcomp/Data/>
Bou B., “Stanford parser grammaticla relationship browser”
<http://grammarscope.sourceforge.net/>
Marcus M.P., Santorini B., Marcinkiewicz M. A., “Building a
large annotated corpus of English: The Penn Treebank”,
<http://www.cis.upenn.edu/~treebank/>
Baldridge J., Bierner G., Morton T., “OpenNlp”,
<http://opennlp.sourceforge.net/>
McEntyre J. and Ostell J., “The N.C.B.I. Handbook”,
National Center for Biotechnilogy Information, 2002,
<http://www.ncbi.nlm.nih.gov/pubmed/>
Kim J.J., Zhang Z., Park J.C., et al. “BioContrasts: extracting
and exploiting protein-protein contrastive relations from
biomedical literature.” Bioinformatics 2006;22:597-605.
http://bioinformatics.oxfordjournals.org/cgi/reprint/22/5/597.
pdf.
Klein A., He X., Roch M., Mallett A., Duska L., Supko J.G.,
Seiden M.V., “Prolonged stabilization of platinum-resistant
ovarian cancer in a single patient consuming a fermented soy
therapy,” Gynecol Oncol. 2006 Jan;100(1):205-9. Epub 2005
Sep 19.
Mukherjea S. and Sahay S., “Discovering Biomedical
Relations Utilizing the World-Wide Web”, Pacific
Symposium on Biocomputing 11:164-175(2006)
Mustafa J. and Seki K., “Discovering Implicit Associations
between Genes and Hereditary diseases,” Pacific Symposium
on Biocomputing 12:316-327(2007).
Nedellec C., “Learning language in logic - genic interaction
extraction challenge. In Proceedings of the ICML05
workshop: Learning Language in Logic (LLL05), 2005.
[29] Palakal M., Mukhopadhyay S.
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
and Stephens M.,
“Identification Of Biological Relationships From Text
Documents”, Springer US, 1571-0270, Pages 449-489,
10.1007/b13595, 2005
Popowich F., “Using Text Mining and Natural Language
Processing for Health Care Claims Processing,”. ACM
SIGKDD Explorations Newsletter archive Volume 7 , Issue 1
June 2005
Rinaldi F. et al, “Mining of relations between proteins over
biomedical scientific literature using a deep-linguistic
approach” Artif Intell Med 2007;39(2):127–36.
Rusu D., Dali L., Fortuna B., Grobelnik M., Mladenić D.,
“Triplet Extraction from Sentences,” Ljubljana: 2007.
Proceedings of the 10th International Multiconference
Information Society - IS 2007". Vol. A, pp. 218 - 222.
Sahay S., Mukherjea S., Agichtein E., Garcia , Navathe E. V.,
S. B., and Ram A., “Discovering Semantic Biomedical
Relations, Utilizing the Web,” ACM Trans. Knowl. Discov.
Data. 2, 1, Article 3 (March 2008), 15 pages. DOI =
10.1145/1342320.1342323
Shi L. and Mihalcea R., “Open Text Parsing Using FrameNet
and WordNet,” In Daniel Marcu Susan Dumais and Salim
Roukos, editors, Proceedings of HLT-NAACL 2004:
Demonstration
Papers,
pages
247–250,
Boston,
Massachusetts, USA, May 2 – May 7. Association for
Computational Linguistics.
Skusa A. and Rüegg A. and Köhler J., “Extraction of
biological interaction networks from scientific literature,”
Briefings in Bioinformatics, (6)3:263--276, 2005.
Srinivasan P. (2004), “Generating hypothesis from
MEDLINE”, Journal of the American Society for
Information, Volume 55, Pages 369-413.
Subramaniam L. V., Mukherjea S., Kankar P., Srivastava B.,
Batra V. S., Kamesam P. V., Kothari R., “Information
extraction from biomedical literature: Methodology,
evaluation and an application” In Proceedings of the ACM
CIKM International Conference on Information and
Knowledge Management (CIKM’03). ACM Press, 410–417.
Swanson D.R., “Fish oil, Raynaud’s syndrome, and
undiscovered public knowledge,” Perspect. Bio. Med, v30,
pp. 7-18, 1986
Tanabe L. and Wilbur W. J., “Tagging gene and protein
names in biomedical text” Bioinformatics, Vol 18 no 8 2002,
Pages 1124-1132, Oxford University Press, 2002
Tsai T.H., Chou W.C, Lin Y.C., et al, “BIOSMILE: Adapting
semantic role labeling for biomedical verbs: an exponential
model coupled with automatically generated template
features.” In: BioNLP, 2006.
Yang H., Sharma A.., Swaminathan R., Ketkar V., “On
building a quantitive food-disease-gene network,” 2nd
International
conference
on
Bioinformatics
and
Computational Biology, March 2010, in press.
Zhang X., Shu X.O., Li H., Yang G., Li Q., Gao Y.T., Zheng
W., “Prospective cohort study of soy food consumption and
risk of bone fracture among postmenopausal women,” Arch
Intern Med. 12;165(16):1890-5 September 2005.
Zweigenbaum P., Demner-Fushman D., Yu H. and Cohen
K.B., “Frontiers of biomedical textmining:current progress”
Briefings In Bioinformatics. VOL 8. NO 5. 358-375
Doi:10.1093/Bib/Bbm045, 2007