Download A Verb-centric Approach for Relationship

A Verb-centric Approach for Relationship Extraction in Biomedical Text Abhishek Sharma Department of Computer Science San Francisco State University San Francisco, CA, USA [email protected] Rajesh Swaminathan Depar Department of Computer Science San Francisco State University San Francisco, CA, USA [email protected] Abstract—Advances in biomedical technology and research have resulted in a large number of research findings, which are primarily published in unstructured text such as journal articles. Text mining techniques have been thus employed to extract knowledge from such data. In this article we focus on the task of identifying and extracting relations between bioentities such as green tea and breast cancer. Unlike previous work that employs heuristics such as co-occurrence patterns and handcrafted syntactic rules, we propose a verb-centric algorithm. This algorithm identifies and extracts the main verb(s) in a sentence; therefore, it does not require the usage of predefined rules or patterns. Using the main verb(s) it then extracts the two involved entities of a relationship. The biomedical entities are identified using a dependence parse tree by applying syntactic and linguistic features such as preposition phrases and semantic role analysis. The proposed verb-centric approach can effectively handle complex sentence structures such as clauses and conjunctive sentences. We evaluate the algorithm on several datasets and achieve an average F-score of 0.905, which is significantly higher than that of previous work. Keywords-Biomedical Text Mining, Relationship Extraction; Verb-centric Method; Natural Language Processing (NLP); I. INTRODUCTION Unstructured text such as journal articles still remains the main means of publishing research in the field of life sciences. For instance, the MEDLINE database currently holds 19 million articles or citations in the biomedical field. Furthermore every day, there are 1500~3000 new articles added to the database [23]. With this exponential increase in biomedical articles, it becomes unrealistic for humans to gather relevant knowledge by manually curative this information trove. It is therefore imperative to apply text mining techniques to automatically analyze this voluminous unstructured data. In this article, we describe an algorithm, to extract a specific type of knowledge, the relationship between a set of bio-entities including foods, chemicals, proteins, genes and diseases, from journal articles in biomedical domain. The ultimate goal through analyzing biomedical articles is to construct bionetworks that capture various interactions or relationships among different bio-entities (e.g. soy, isoflavones, and osteoporosis) [29][4][7][35][38][43]. To this end, two common tasks – entity recognition and relationship extraction - have been under extensive studies where the former often focuses on recognizing protein and gene names and the latter the relation between proteins and genes [4][35][39]. We recently proposed to study the Hui Yang Department of Computer Science San Francisco State University San Francisco, CA, USA [email protected] polarity/strength of a relationship as well. For instance, the sentence, Soy consumption may reduce the risk of fracture, exhibits a positive polarity with a weak level of certainty or strength [41]. In this article, we primarily focus on the identification and extraction of relationships between entities in biomedical literature. Five types of entities – food, disease, protein, chemical and gene are considered in this work, since the long term goal of this work is to build a food-disease-gene network [41]. We have proposed a verbcentric method to address this issue. To illustrate the main goal of our algorithm, let us consider the sentence, Soy food consumption may reduce the risk of fracture [42]. It describes a relationship, may reduce the risk of, between the entities, Soy, food consumption and fracture. When the proposed algorithm takes this sentence as an input, it will first confirm whether it is a relationship bearing sentence. If yes, the algorithm will then extract the relevant parts of the relationship as stated in the sentence and produce the output, Soy food consumption | may reduce the risk of | fracture. There are three components in the output separated by the | symbol. The middle component, may reduce the risk of, is the relationship-depicting phrase, whereas Soy food consumption and fracture are the two participating entities. Four main approaches have been applied in the past to extract relationship, namely, the co-occurrence based, linkbased, the rule-based and machine learning approaches such as Hidden Markov Model [35][4][7][43]. In the cooccurrence approach, if two entities frequently collocate with each other, a relationship is inferred between them. Link-based or association based methods extend the cooccurrence approach. Two entities are considered to have a relationship if they co-occur with a common entity. [11][12][36][13]The above two approaches often do not employ natural language processing (NLP) techniques and focus on a few target entities, such as “fish oil” and “Raynaud’s disease in the work by Swanson, Lindsay and Gorden [38][11]. Rules-based techniques on the other hand heavily reply on NLP techniques to identify both syntactic and semantic units in the text. Handcrafted rules are then applied to extract relationships between entities. For instance, Fundel et al. construct three rules and apply them to a dependency parse tree to extract the interactions between genes and proteins [8]. Feldman introduces a template in the form of NP1-Verb-NP2 to identify the relation between two entities corresponding to two noun phrases, NP1 and NP2, respectively [7]. Many rule-based studies often predetermine the relationship categories using target verbs such as, inhibit and activate or negation words such as, but not [40][24][29][33]. Finally, Machine learning approaches such as SVM and CRF have also been used to extract relationships [29]. The approaches are supervised and require manually annotated training dataset, which can be expensive to construct. Furthermore, rule-based approaches have been proven relatively successful in the past [7][43]. Rule-based approaches often deliver a reasonable precision (0.60~0.80) but with a low recall (~0.50) with few exceptions [8][41]. The proposed approach in this paper bears many similarities with the rule-based approaches. Existing works however exhibit several major limitations. First, as mentioned earlier, they mainly focus on proteins and genes and their interactions, whereas in our work we consider foods, chemicals and diseases. Second, rules are not only expensive to construct but also severely limit the ability to uncover relevant relations, that is, the low recall issue. Third, these approaches assume that the entities of interest have been correctly recognized and labeled. This however is often not true due to the limitations in the automated entity recognition process. As a result, some entities are not recognized; whereas some are not recognized in their full form. For instance, only, tofu, is labeled as a food in the phrase, intake of tofu. Fourth, these approaches do not effectively handle complex sentences that contain multiple relationships involving multiple pairs of entities, especially when a sub-clause is involved. For example, in the sentence, Tofu intake showed a significant association with genistein and miso soup showed a slight association with phytoestrogens, there are two relationships with two sets of participating entities. Finally, these approaches primarily rely on a parser to correctly identify the verb phrase. This however can be challenging, since verb phrases are often not correctly tagged as a separate syntactic constituent by today’s state-of-the-art parsers [21]. To overcome these limitations, we propose a verbcentric algorithm. Using preposition phrase handling, our algorithm effectively addresses the incomplete entity issue. By analyzing the phrase-level conjunction structure, our algorithm is able to separate the one-to-many or many-toone relations into multiple binary relations. For instance, three instances are extracted from, A, B and C is related to D. The multiple relationship issue is handled by analyzing the sentence level conjunction structure. The challenge of missing entities is addressed by analyzing the semantic roles (e.g. subjects/objects) of a noun phrase. Finally, instead of relying on the verb phrases identified by a parser, we rely on main verbs. We have evaluated our algorithm on several datasets drawn from MEDLINE and achieved an average F-score of ~0.90. We have also implemented the conventional NP1Verb-NP2 rule-based relationship extraction method and evaluated it against the same datasets, which achieved an average F-score of (~0.82) [41]. This shows that the proposed verb-centric approach clearly outperforms the conventional approaches for relationship extraction. Please note that the work described in this article corresponds to one of the four modules in an information extraction system that has been proposed by us earlier [41]. As shown in Figure 1. , the system consists of 4 modules (Entity Recognition, Relationship Extraction, Relationship Polarity Analysis and Relationship Strength Analysis). The relationship extraction module is executed after all the entities have been extracted. The output of the relationship extraction module will act as an input to the next two modules, relationship polarity and strength analysis. Therefore, the performance of the relationship extraction module plays a critical role in the success of the entire framework. Please refer to [41] for more details. Figure 1. Architecture of the overall system [41] II. ALGORITHM Figure 2. Relationship Extraction System Overview The goal of our algorithm is sentence-wise identification and extraction of relationships. In addition the algorithm should be able to identify and extract the participating entities of the relationship in a given sentence. To reach this goal we have designed a three step algorithm. Given a sentence our algorithm first identifies whether it is a relationship bearing sentence or not. It next extracts the relevant relationship depicting phrase and participating entities from the relationship bearing sentence. Finally it formats the extracted relations into a list of binary relation in the form Entity | Relationship | Entity. Figure 2. presents a schematic description of the verb-centric relationship extraction algorithm. In the following sections, we describe each of these steps in detail. A. Data Acquisition and Entity Recognition To acquire relevant scientific publications from the MEDLINE database, we utilize an in-house program [23]. Given a set of key words, the program automatically downloads all the relevant articles from MEDLINE and stores them on a local disk. Both abstracts and full texts of the biomedical articles, if available, are downloaded. In this work, we only analyze the abstracts. The NER (Named Entity Recognition) module takes each abstract and applies a lexicon-based approach to recognize the following types of entities including food, disease, gene, protein and chemical. The NER module also identifies co-references and abbreviations. Please refer to [41] for details. Let us consider the sentence, Hence, soy isoflavones and saponins are likely to be protective of colon cancer and to be well tolerated. The NER module will recognize the following entities from this sentence, {\chemical soy isoflavones}, {\chemical saponins} {\disease colon cancer}, each labeled by semantic type. This NER module was evaluated using the GENIA corpus and datasets drawn from the MEDLINE database. See [41] for details. B. Algorithm 1) Relationship extraction - what sentences?: Our relationship extraction algorithm identifies and extracts relationships at the sentence level. Given a sentence we first need to determine whether it contains a relationship or not. To this end, we performed the following case study and were able to identify several key criteria for a relationship bearing sentence. Specifically, we have randomly selected 50 abstracts and recruited a team of 5 members to manually annotate all the entities of interest and their relationships. For instance, the sentence in the above section will be annotated as, Hence, {\chemical soy isoflavones and saponins} {\relationship are likely to be protective of} {\disease colon cancer} and to be well tolerated. A total of 352 relationship-bearing sentences in the 50 abstracts are identified and annotated. We analyze the InterAnnotator Agreement (IAA) over all these sentences and address the discrepancies by taking the one agreed upon by 3 or more annotators. We then conduct a collection of studies to characterize these 352 sentences. We first compare the list of manually annotated entities with the list identified by the NER module. We observe that although the NER module can correctly identify around 90% of the entities, it exhibits intrinsic limitations as described in the previous section, for instance, incomplete and missing entities. We also observe that these 352 sentences exhibit a variety of sentence structure, while some are relatively simple in the form of subject-verb-object, a good portion involve multiple relations by the use of conjunctive structure (e.g. A is related to B, that affects C. A, B, C is related to D. A is related to B and C is related to D.) We also notice that ~97% of these sentences describe a verb-based relationship (e.g. A affects B). We then manually collect all such verbs, and compare them with the 54 verbs or verb phrases included in the UMLS Semantic Network. These 54 verbs are intended to capture the main relations that may exist between biomedical entities. We observed that these two lists have a good overlap. For the verbs that appear in the manually annotated abstracts but not in UMLS, we can often identify a verb that is semantically similar. Finally, although ~61% of these 352 involve two or more entities of interest, ~36% only consist of one entity. We therefore need to relax the commonly adopted criterion that requests the presence of the two entities of interest when extracting relationships. To summarize, through this case study, we are able to not only gain valuable insights into the challenges we are facing, but also identifying two key criteria we can use to reliably determine whether a given sentence is relationshipbearing (1) the sentence needs to contain at least one entity of interest; and (2) it contains a verb that is semantically similar to one of the 54 verbs listed in the UMLS Semantic Networks. 2) Identifying the relationship bearing sentences. We are now ready to describe the main algorithm as depicted in Figure 2. As mentioned earlier, the inputs to the algorithm are sentences whose entities are automatically labeled by the NER module. Given a labeled sentence, we first determine whether it contains a relationship using the two criteria derived above. This takes the following two steps: Step1: Given a sentence, the entities recognized by the NER module might be redundant or incorrect as a result of imperfect parsing. To deal with this issue, we (1) discard an entity if it is an entire sentence; and (2) check whether the span of an entity is a subset of another entity in the same sentence. We discard it if it is the case. Once the above tasks are done, we retain the sentence if it contains one or more labeled entities. Step2: In this step, we examine whether each sentence retained in step 1 contains a verb that is semantically similar to one of the 54 UMLS verbs. To do this, we first expand the list of these 54 verbs, to include verbs that are semantically similar to one of the 54 verbs by using both WordNet [17] and VerbNet [16].We then check whether one of the verbs in the sentence under consideration is on the expanded verb list. If the answer is yes; the sentence is considered relationship-bearing and passed onto the next step for further analysis. 3) Verb-based relationship extraction As described earlier, we have observed in our case study that although many authors employ the simple, Subjectverb-Object, sentence structure to describe a relationship between two entities, it is not uncommon for authors to use more complex sentence structures. Two such structures are especially popular (1) using conjunctive structure or clauses to describe multiple verb-based relations in a single sentence. For instance, the following sentence, Supplementation with soy protein containing isoflavones does not reduce colorectal epithelial cell proliferation or the average height of proliferating cells in the cecum, sigmoid colon, and rectum and increases cell proliferation measures in the sigmoid colon, consists of two relationship-depicting phrases, does not reduce and increases, which are corrected by the conjunctive word, and; and (2) using conjunctive structure to describe a single verb-based many-to-one or one-to-many relationship. As an example, the sentence, Fermented soy products are known to contain high concentrations of the isoflavone, genistein, and other compounds, specifies a one-to-three relation. As one can observe from the different sentence structures, regardless of its complexity, the most critical step towards correctly extracting all the relationships in a sentence is to identify the verb phrases. To achieve this, previous work relies on two main techniques: (1) Using the parse tree of a sentence to extract the verb phrases. This can be inaccurate as most state-of-the-art parsers cannot correctly identify the right boundary of a verb phrase. As a result, the right boundary of a verb phrase is often defaulted to the end of a sentence; and (2) assuming the entities of interest are correctly recognized and using such entities as anchors to infer the span of a verb phrase. This again has serious limitations as no entity recognition algorithm can achieve perfect performance so far. In addition, not every verb phrase is surrounded by two entities. An example is shown in the first sentence in the above paragraph. To overcome these problems we propose to extract the main verbs from a sentence where a main verb is the most important verb in a sentence and states the action of the subject. We take a two step procedure for this purpose; (1) we first partition a sentence into multiple semantic units if necessary, such that each unit consists of only one main verb. This is done by analyzing the conjunctive structure at the sentence level; and (2) for each semantic unit, we then identify and extract its main verb. We next describe these two steps in details. To partition a complex sentence into multiple semantic units we analyze the sentence-level conjunctive structure suing the OpenNLP parser [21]. A list of conjunctions is retrieved from the parse tree of the sentence. The parent subtree of each conjunction is then retrieved. If the parent subtree corresponds to the entire input sentence the algorithm asserts that it is a sentence level conjunction. It then uses the conjunctions to break the sentence into smaller semantic units. Let us consider the same example mentioned earlier, Supplementation with soy protein containing isoflavones does not reduce colorectal epithelial cell proliferation or the average height of proliferating cells in the cecum, sigmoid colon, and rectum and increases cell proliferation measures in the sigmoid colon. In this sentence the conjunction word, and between the words, rectum and increases has the entire sentence returned as its parent sub-tree while, and, between the words, colon and rectum, has the phrase, in the cecum, sigmoid colon, and rectum, as the parent. Therefore, the first, and, is a sentence-level conjunction and the sentence is divided into two parts at this conjunction. Clearly, each part contains a verb-based relationship. To find the main verb(s), the algorithm considers the parse tree of a sentence generated by the OpenNLP parser. It traverses the parser tree using the preorder traversal and continuously checks the tags of each right sub-tree until it reaches the deepest verb node of a verb phrase. This verb is then considered the main verb of the sentence [35]. Using the software tool Grammar Scope, which provides a graphical representation of a parse tree [21], we have manually verified this main verb extraction procedure. Shown in Figure 3 is a parse tree produced by this tool for the sentence, Phytoestrogens may be associated with a reduced risk of hormone dependent neoplasm such as prostate and breast cancer. The verb, associated is extracted correctly as the main verb. The main verb extraction is a critical step as is facilitates the final step of verb-centric relationship extraction. Figure 3. Graphical Representation of the Parse Tree of the sentence Phytoestrogen may be associated with a reduced risk of hormone dependent neopalsms such as prostrate and breast cancer. 4) Determining the two participating entities The main verbs extracted in the above section state the main relations between entities which however only constitute one third of a binary relationship. To complete a relationship, we still need to identify its two involved entities. A proximity-based approach is adopted for this purpose. Specifically, we take the entities located immediately to the left and to the right of a main verb in the same semantic unit as the two participating entities. For this proximity-based approach to work effectively, we however have to tackle the following issues: (1) the NER module might not recognize all the entities of interest. We term this as the missing entity issue; (2) the incomplete entity issue, where the entity identified by the NER is part of a true entity. For instance, only the word, equol, in the phrase, serum concentration of equol, is labeled as, chemical; and (3) the many-to-one or one-to-many relationship mentioned earlier (for instance, A activates B, C and D). To tackle these three issues, we rely on both dependency parse trees and linguistic features. We next describe our solution to each of these issue details. For the missing entity issue, we first observe that many of these missing entities are either a subject or an object in a sentence. We therefore address this issue by identifying the subject and object in a sentence. The program passes each relationship-bearing sentence unit as an input to the Stanford NLP Parser [15]. This parser produces pairs of dependent words, where each pair satisfies one of the 55 prescribed grammatical relations [15]. Several rules are applied to extract the subject/object in a sentence using the word dependencies. We explain them by an example here. Consider the sentence shown in Figure 4. and the dependencies obtained from the Stanford Parser. The typed dependencies are generated in pairs of Governor (Gov) and Dependency (Dep.), followed by the grammatical relationship (Rel.) between them. To recognize the subject, the program scans for the substring of ‘subj’ in the relationship, then searches for the occurrence of the governor word or dependency in any of the pairs having a noun or preposition relationship. As shown in the sentence above, word ‘consumption’ has a ‘subj’ relation with ‘shown’. The program finds the occurrence of ‘consumption’ with ‘soy’ as a noun and stops at this point and label‘soy consumption’ as the subject. Similarly, it searches for a typed dependency with the substring of ‘obj’ in a relationship and follows the same procedure as for the subject to locate the object. Soy consumption has been shown to modulate bone turnover and increase bone mineral density in postmenopausal women. Gov: consumption-2 Dep: Soy-1 Rel: nn Gov: shown-5 Dep: consumption-2 Rel: nsubjpass Gov: turnover-9 Dep: bone-8 Rel: nn Gov: modulate-7 Dep: turnover-9 Rel: dobj Gov: modulate-7 Dep: increase-11 Rel: conj_and Gov: density-14 Dep: bone-12 Rel: nn Gov: density-14 Dep: mineral-13 Rel: nn Gov: increase-11 Dep: density-14 Rel: dobj Gov: women-17 Dep: postmenopausal-16 Rel: amod Gov: density-14 Dep: women-17 Rel: prep_in Figure 4. Typed dependencies of a sentence generated by Stanford Parser Once the subject and object of a sentence unit have been labeled, we compare them with the list of entities labeled by the NER module. If either of them does not overlap with any of the existing entities, we find the smallest noun phrase that contains the subject or object and add it to the existing list of labeled entities. To handle the incomplete entity issue, we search the preposition phrase around an entity and merge them together to the old entity. This is based on our observations made over the incomplete entities. Here, only, equol and dietary intake are labeled as entities by the NER module, which are incomplete and should be, serum concentration of equol and dietary intake of tofu and miso, instead. To address this issue, our program identifies the preposition phrase right before, after or overlapping with a labeled entity and merge the preposition phrase with the entity to replace the incomplete one. In the above two cases, of equol and of tofu and miso soup, are the two preposition phrases. Since, equol, is part of the preposition phrase, of equol, the program will search the noun phrase immediately preceding it, which is, serum concentration and merge them together to replace, equol. One the other hand, the preposition phrase, of tofu and miso soup, is right after the labeled entity, dietary intake, the program simply merge them. After this preposition handling, our algorithm is able to identify the complete form of the two entities in the above sentence. Finally, to address the one-to-many or many-to-one relationships as exhibited in the following sentence, Fermented soy products are known to contain high concentrations of the isoflavone, genistein, and other compounds, we utilize the dependency parse tree to first merge the multiple entities into one compound entity. This allows us to correctly establish the one-to-many and manyto-one relationship using the main verb identified earlier. The merged compound entity will be split into individual entities as they are when we produce the final binary relationships. This will be described later in more detail. Let is now use the above sentence to illustrate the merging process using a dependency parse tree. Using the typed dependencies produced by the Stanford parser, the algorithm finds each governor with a relationship tag of, conjunction and identifies all the dependencies that have the same governor. It also finds the noun phrase for each governor and its corresponding set of dependency words. It finally merges all the noun phrases into a compound entity. In the example the three entities will be merged into one. 5) Exception: the “between .. and” case A special case in describing a relationship is the use of, between … and … template. The approach described earlier cannot handle this structure. We therefore treat such cases separately. Consider the following sentence, Few studies showed protective effect between phytoestrogen intake and prostate cancer risk. The algorithm uses generated by parse tree the OpenNLP parser for the sentence and determines the preposition phrase that contains the preposition, between [21]. It then identifies the entity after, between and before, and as the left entity, the one immediately after, and as the right entity. The relationship depicting phrase is constructed by using the main verb plus all the words between the main verb and the word between. Therefore, the binary relationship of sentence stated above is, phytoestrogen intake|showed protective effect|prostate cancer risk. 6) Main verb-based Relationship construction At this point, we have identified all the main verbs in a sentence. We have also labeled all the entities of interest after having addressed the three entity-related issues: missing entities, incomplete entities and the one-to-many or many-to-one relationships. We are ready to use the main verbs to construct the relations contained in the sentence. To do this, we take each main verb and identify the labeled entities before and after the main verb in the same semantic unit. For instance, the sentence we have seen earlier, Supplementation with soy protein containing isoflavones does not reduce colorectal epithelial cell proliferation or the average height of proliferating cells in the cecum, sigmoid colon, and rectum and increases cell proliferation measures in the sigmoid colon, will produce the following two relationships: (1) Supplementation with soy protein containing isoflavones | does not reduce colorectal epithelial cell proliferation or the average height of proliferating cells in | the cecum, sigmoid colon, and rectum” and (2) Supplementation with soy protein containing isoflavones | increases cell proliferation measures in | the sigmoid colon. Note that the left entity of the second relationship is the subject of the entire sentence. This is done as a post processing step. Also note that the first relationship actually contains a one-to-many relationship. Our algorithm will still treat it a binary relation as the entities on the right are not individually labeled. Instead, they together are identified as the object of, reduce. For a many-to-one or one-to-many relationship as exhibited in the sentence, Fermented soy products are known to contain high concentrations of the isoflavone, genistein, and other compounds, our program first combines the multiple entities located on the same side of the relation into one compound entity in order to correctly associate them with the main verb. For the above sentence, we will first produce the following relationship, Fermented soy products | are known to contain high concentrations of | the isoflavone, genistein, and other compounds. We then break apart the compound entity into individual entities as they are initially extracted. The above sentence therefore will produce three relationships: (1) Fermented soy products | are known to contain high concentrations of | the isoflavone (2) Fermented soy products | are known to contain high concentrations of | genistein (3) Fermented soy products | are known to contain high concentrations of | other compounds. This above procedure also applies to the many-to-one or one-to-many relationship described by the “between … and …” scenario. For instance, the sentence, No association was suggested between the frequency of consumption of fruit, vegetables, green tea, and soy products and gastric cancer, will lead to the identification of the following relationships: (1) fruit | No association was suggested|gastric cancer (2) vegetables | No association was suggested|gastric cancer (3) green tea | No association was suggested|gastric cancer (4) soy products | No association was suggested | gastric cancer. Finally, for a relationship involving an entity that is either in an abbreviated format or a pronoun phrase, we replace it with its original long form or its co-referred entity. Consider the following two examples: (1) VLFD___a verylow-fat diet| significantly reduces |estrogen concentrations in postmenopausal women (2) its intake___Soy|may help to prevent some|diseases. In the first example, VLFD is identified by our NER module as an abbreviation of, a very low-fat diet. We attach this long form in the final relationship construction. In the second example, the NER module recognizes that the pronoun, its, co-refers to the entity, soy. We hence attach, soy, in the extracted relationship. III. EVALUATION In this section, we report the evaluation results by applying the proposed verb-centric algorithm to three datasets drawn from the MEDLINE database. As shown in Table 1, the three datasets are created using three sets of keywords cancer, soy, cancer, beta-carotene and cancer, benzyl. They consist of total of ~1400 sentences among which ~750 are relationship bearing. TABLE I. ID #(Abs) 1 2 50 80 3 40 Datasets Description Keywords Soy & Cancer Betacarotene & Cancer Benzyl Cancer #Sentences/ abstract 8 9 Relationbearing sent 5 4 7 4 The manual annotation was performed by a team of 5 members. To evaluate our algorithm, we first manually annotated all the relationships in all the datasets. For each relationship we explicitly label the relationship depicting phrase and participating entities. Here is an annotated sentence as an example, Hence, {\chemical soy isoflavones} and {\chemical saponins} {\relationship are likely to be protective of} {\disease colon cancer} and to be well tolerated. The manual annotation was done independently by a group of five members. All members are graduate students who have been analyzing biomedical text for at least one year. For this annotation, we are interested in the following types of biomedical entities and their relationship as reported in the literature: food, disease, protein, chemical and gene. To reduce potential subjectivity in the manual annotation, we combine the five versions of annotations from the five members into one version using majority rule policy. Specifically, given a sentence, the annotation that is agreed by 3 or more members is taken as the final version. Manual intervention was often required for this task as we could not find a majority vote for some sentences. In this case, a group discussion was invoked to reach an agreement. Note that we did not evaluate our algorithm using existing public datasets such as LLL (Learning Language in Logic) challenge dataset [28] because such datasets often focus on genes and proteins, whereas we are concerned with also foods, chemicals and diseases. We plan to test our algorithm using such datasets in the future. Using the same three datasets, we feed them to our named entity recognition (NER) module to automatically label the five types of entities as mentioned above. The verbcentric algorithm then takes the labeled sentences as input for relationship extraction. Three measures – precision, recall and F-score are used to measure the effectiveness of our algorithm, where Precision = (True Positive)/(True Positive + False Positive), Recall = (True Positive)/(True Positive + False Negative) and F-Score = (2*Precision*Recall)/(Precision + Recall). Given a relationship-bearing sentence, it is considered as a true positive if (1) the algorithmically extracted relationshipdepicting phrase (RDP) overlaps with the manually annotated RDP, or (2) these two RDPs agree on the same main verb. Otherwise, the sentence is considered as a false positive. Finally, a false negative corresponds to a sentence that is manually annotated as relationship-bearing but not by our algorithm. We also consider a more rigid criterion where we not only compare whether the algorithmically executed RDPs match with their manual counterparts, but also consider the entities involved in a relationship. We will report such results later. In the following discussion, without further notice, the reported precision, recall and F-score are computed on the basis of the main verbs only. To demonstrate the advantage of the proposed verbcentric approach, we also compare it with a rule-based approach, where a rule NP1-VP-NP2 is used. According to this rule, a relationship is composed of a verb phrase, sandwiched by two noun phrases, which are required to be labeled as entities of interest. We refer to this approach as the NVN method or NVN. We first study the effect of the three entity-related issues-missing, incomplete and conjoining – over the NVN algorithm. The results are shown in Table II. Without any additional entity handling, in other words simply relying on the entities recognized by the NER module, we achieve a precision of 0.79 and recall of 0.82. Both measures are improved (0.80 and 0.86) after we have incorporated the noun phrases that function as subject and object in a sentence. Finally, these measures are further improved (0.84 and 0.87) after conjoining entities are also dealt with. These gradual improvements indicate the necessity and importance of handling these entity related issues. We however could not further improve the accuracy of the NVN approach as it is unable to correctly extract relationships where only one entity is available in the proximity of a verb. For instance, in a sentence such as “A increases B but decreases C”, the NVN method could not correctly extract the relation between A and C. In contrast to NVN, our verb-centric algorithm is more versatile and able to handle various scenarios. As evidenced in Table II, on the same dataset, the verb-centric method achieves a precision of 0.96 and a recall of 0.90, which is significantly higher than the NVN method. margin. This demonstrates the proximity-based approach for identifying entities involved in a relationship works reasonably well. This also shows that the strategies we have introduced to handle missing, incomplete and conjoining entities are effective. TABLE III. Precision Recall F-Score Results for Verb-centric Relationship Extraction DataSet1 Full Abstracts Conclusions 0.9589 0.8873 0.9045 0.9545 0.9309 0.9196 TABLE IV. Results for Verb-centric Relationship Extraction Dataset 2 TABLE V. TABLE II. Effect of Missing, Incomplete and Conjoining Entities using Dataset 1 (Keywords – Soy, Cancer) Approach NVN without missing /incomplete /conjoining entities NVN wih subject/object extraction NVN with missing /incomplete /conjoining entities Verb-centric relationship extraction Precision Recall 0.79 0.82 Fscore 0.8047 0.80 0.84 0.87 0.89 0.8426 0.8547 0.9589 0.9045 0.9309 We also apply the verb-centric algorithm to all three datasets and study whether it is biased against a specific dataset. In addition, we also study whether it is sensitive to sentences located in different sections of an abstract. Specifically, we focus on the conclusion sentences in an abstract. If an abstract does not explicitly identify the conclusion sentences, the last three sentences are regarded as the conclusion. Tables III-V summarizes the results of the above studies. From these tables, we observe that our approach is not biased and achieves a balanced precision from 0.86 to ~0.95 and a recall form 0.88 to ~0.92. There is variation in the precision and recall values in the three datasets. Dataset 2 and 3 have more complex sentences containing multiple relationships. In certain cases the sentences are not being separated into smaller units properly. Due to this reason certain parts of the sentences are not considered and hence, the relationships not identified or extracted, thereby affecting the precision and recall in datasets 2 and 3 compared to dataset 1. However, this is better than existing methods, which often deliver a reasonable precision (0.60~0.80), but with a low recall (~0.50) [41]. As mentioned earlier in this section, we also adopt a stricter criterion to measure a true positive. Here, we take both the relationship-depicting phrase (RDP) and the involved entities into account. An algorithmically extracted relationship is considered as a true positive if it’s RDP (or main verb) and involved entities match with the manual annotation. The results on all three datasets based on this criterion are reported in Table VI. Comparing Table VI with Table III-V, one can observe that on all three datasets the values of both precision and recall have dropped by a small Full Abstracts 0.8641 0.8833 0.8736 Precision Recall F-Score Results for Verb-centric Relationship Extraction Dataset 3 Precision Recall F-Score TABLE VI. Dataset 1 Dataset 2 Dataset 3 Conclusions 0.9032 0.918 0.918 Full Abstracts 0.91 0.901 0.8999 Conclusions 0.878 0.8999 0.8999 Evluation Results Considering Entity Match Precision 0.9455 0.8580 0.9052 Recall 0.8683 0.8436 0.8614 F-Score 0.902 0.8507 0.8827 A. Analysis of Errors We have also manually gone over the results to determine the main factors that lead to the false positives and false negatives. We notice that most false positives are caused by the two rules employed during the manual annotating process. First, in manual annotation, the annotators did not label the, part-of, or hierarchical relationships (for example, A is an ingredient in B) between entities, since they consider these are facts and not directly derived from a particular study. The algorithm on the other hand considers such relationships. Second, manual annotations rule out the relationships described as part of the study objective. For instance, annotators do not label the sentence We investigate whether soy consumption can reduce the risk of breast cancer. The verb-centric algorithm however extracts a relationship from the whether-clause. As for the false negative, they are often caused by the following two scenarios. (1) The usage of unconventional writing style. As an example, in the following sentence, The significant inverse association with beta-carotene and lutein/zeaxanthin was more pronounced in women, and in overweight or obese subjects. Instead of using the, between … and … structure the author uses, with … and … structure. As a result the algorithm is unable to detect this relationship, and (2) the current verb-centric algorithm is unable to extract relationships in the form of, no preventative effect of A on B. However, we can readily extend our algorithm to identify and extract such cases by considering the main preposition words such as, by, of, on, in, at, to, for and with. We plan to address this issue in the future. IV. RELATED WORK Relationship extraction has recently become an area of interest, resulting in many studies [29][4][7][35][38][43]. Four mainstream approaches have been used for relationship extraction: co-occurrence, link-based, rule-based and machine learning approaches. The co-occurrence approach infers a relationship between two entities if they frequently collocate with each other. This is a simple method where relations between biomedical entities can be detected by collecting sentences in which they occur. The link-based methods extend the co-occurrence approach to infer relationships between entities if they co-occur with a common term This method gives high recall, but the precision is often low [43][8] [35][4][7]. Several rule-based algorithms have been applied to extract relationships from biomedical text. An increasing emphasis has been given to syntactic structures extracted from a parse tree such as noun phrases, verb phrases, subject and object [43]. However, it has been shown that on larger corpus such methods can be computationally costly [6]. Fundel et al used the Stanford Lexicalized Parser to generate the dependency trees from MEDLINE abstracts. The system is used to find protein/gene interactions. It delivers a reasonable precision and recall of 0.79 and 0.85 on a 50 abstract dataset [8]. Another dependency parser based system is build for relation extraction by Rinaldi et al. It uses lexical and semantic information in conjunction to dependency parsing and obtains a ranging from precision of 0.52(strict) to 0.90 (approximate boundary) and recall 0.40 (estimated lower bound) to 0.60 (actually measured) [31]. Tsai et al combine the semantic role labeling such as location and time information with syntactic analysis in their study. They us 30 most frequently used biomedical verbs and the predicate-argument structures. They achieve high Fscore of 0.87 [40]. Kim et al extract the pattern like “A binds B but not C” between proteins and genes. The system gave a fast execution time of 0.038/abstract and a high precision of 0.97 [24]. Subramaniam et al developed the Bio-Annotator system, which is part of the current Relation Extraction system, and uses rules and dictionary lookup for identifying and classifying biological terms [33][37]. Note that these systems significantly differ from our work in that they focus on specific types of relationships where ours can extract a much wider range of relationship. There are several machine learning methods generally used for achieving relationship extraction. One of the approaches that is used very commonly in relationship extraction is based on the Conditional Random Field (CRF). CRFs are probabilistic graphical models used for labeling and segmenting sequences [2]. Additionally both Naïve Bayesian classifier and Hidden Markov Model are also applied for relationship extraction [5][9]. In the area of semantic role extraction Gildea and Jurafsky [9] describe a statistical approach for semantic role labelling using data collected from FrameNet by analysing a number of features such as phrase type, grammatical function and position in the sentence. Similarly Shi and Mihalcea propose a rule-based approach for semantic parsing using FrameNet and WordNet [33][34]. They extract rules from the tagged data provided by FrameNet, which specify the realization (order and different syntactic features) for the present semantic roles. V. CONCLUSION/DISCUSSION We present a verb-centric relationship extraction algorithm in this article. Given a sentence from biomedical text our algorithm identifies whether it is a relationship bearing sentence or not and then extracts the relationship depicting phrase from the sentence. Our algorithm also extracts the participating entities involved around the relationship depicting phrase. It handles missing, incomplete and conjoining entity issues involved in extraction of participating entities. Our algorithm achieves a balanced precision from 0.86 to ~0.95 and a recall form 0.88 to ~0.92, evaluated on three datasets. In our future work we would like to work on evaluating our algorithm on public datasets to be able to lay direct comparisons with other relationship extraction approaches. We would be working on tasks of relationship integration and categorization, for example, hierarchical and causal relations. Finally, our future goal is visualization of the biomedical entities and the relationship extracted by our algorithm described in this article. ACKNOWLEDGEMENT We would like to thank Jason De’ Silva, Dong Yan and Vilas Ketkar in helping us with manual annotation of the data-sets used in the evaluation of our Verb-centric relationship extraction algorithm. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] Aronson A.R., “Effective Mapping of Biomedical Text to the UMLS Metathesaurus: The MetaMap Program,” AMIA 2001. Bundschu M., Dejori M., Stetter M., Tresp V. and Kriegel H.P., “Extraction of semantic biomedical relations from text using conditional random fields,” BMC Bioinformatics 2008, 9:207 doi: 10.1186/1471-2105-9-207 Chulan U.A.U, Sulaiman M.N., Hamid J.A., Mahmod R., Selamat H., “Extracting Relationship in Text using Connectors,” Faculty of Computer Sc. and Info. System UPM. Cohen A.M., William R. and Hersh, “A survey of current work in biomedical text mining. Briefings in Bioinformatics”. Vol 6. No 1. 57–71. March 2005. Collier N., Nobata C., Tsujii J. Ichi, “Extracting the names of genes and gene products with a hidden markov model,” In Proceedings of the International Conference on Computer Linguistics (COLING). Morgan Kaufmann, 201–207 2000. Curran JR, Moens M. “Scaling context space”, Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA. ACL, 2002:231–8. Feldman R., Regev Y., Finkelstein-Landau M., Hurvitz E. and Kogan B., “Mining biomedical literature using information extraction,” ClearForest Corp, USA & Israel, KDD Cup, 2002 competition. Fundel K., Ku¨ffner R., Zimmer R., “RelEx—relation extraction using dependency parse trees” Bioinformatics 2007;23:365–71. Gildea D. and Jurafsky D., “Automatic Labeling of Semantic Roles,” Computational Linguistics, 28(3):245–288, 2002. Girju R., Roth D., and Sammons M., “Disambiguation of VerbNet classes,” The Interdisciplinary Workshop on Verb Features and Verb Classes, 2005 [11] Gordon M.D., and Lindsay R.K. (1999), “Literature beased [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] discovery by leical statistics”, Journal of the American Society for Information Science, Volume 50, Pages 574-527. Gordon M.D., and Lindsay R.K. (1996), “Toward discovery support systems: A replcation , reexamination and extension of Sawnson’s work on literature-based discovery of a connection between Raynaud’s and Fish Oil”, Journal of the American Society for Information Science, Volume 47, Pages 116 – 128. Hristovski D., Peterlin B., Mitchell J.A., Humphrey S.M., “Using literature-based discovery to identify disease candidate genes” International Journal of Medicine Informatics, Volume 74(2-4), Pages 289-289. Bodenreider O., Hole W. T., Humphreys B. L., Roth L. A.; Srinivasan S., “Customizing the UMLS Metathesaurus for your Applications”, Proc AMIA Symp November 2001, <http://www.nlm.nih.gov/research/umls/> Klein D. and Manning C. D., “Fast Exact Inference with a Factored Model for Natural Language Parsing. In Advances in Neural Information Processing Systems 15 (NIPS 2002)”, Cambridge, MA: MIT Press, pp. 3-10. 2003. <http://nlp.stanford.edu/software/lex-parser.shtml> Schuler K. K., “VerbNet: A broad coverage, comprehensive verb lexicon, University of Pennsylvania (Dissertation)”, unpublished <http://verbs.colorado.edu/~mpalmer/projects/> Miller, G. A. “WordNet - About Us.” WordNet. Princeton University. 2009. <http://wordnet.princeton.edu/ > Aronson Alan R., “Effective Mapping of Biomedical Text to the UMLS Metathesaurus: The MetaMap Program”, AMIA 2001Proceedings <http://www.nlm.nih.gov/research/umls/meta3.html> Holden J.M., Haytowitz D.B., Pehrsson P.R., Exler J., and Trainer D., “USDA's National Food and Nutrient Analysis Program”, Progress Report, 17th International Congress of Nutrition. Vienna, Austria. August 27-31, 2001. <http://www.nal.usda.gov/fnic/foodcomp/Data/> Bou B., “Stanford parser grammaticla relationship browser” <http://grammarscope.sourceforge.net/> Marcus M.P., Santorini B., Marcinkiewicz M. A., “Building a large annotated corpus of English: The Penn Treebank”, <http://www.cis.upenn.edu/~treebank/> Baldridge J., Bierner G., Morton T., “OpenNlp”, <http://opennlp.sourceforge.net/> McEntyre J. and Ostell J., “The N.C.B.I. Handbook”, National Center for Biotechnilogy Information, 2002, <http://www.ncbi.nlm.nih.gov/pubmed/> Kim J.J., Zhang Z., Park J.C., et al. “BioContrasts: extracting and exploiting protein-protein contrastive relations from biomedical literature.” Bioinformatics 2006;22:597-605. http://bioinformatics.oxfordjournals.org/cgi/reprint/22/5/597. pdf. Klein A., He X., Roch M., Mallett A., Duska L., Supko J.G., Seiden M.V., “Prolonged stabilization of platinum-resistant ovarian cancer in a single patient consuming a fermented soy therapy,” Gynecol Oncol. 2006 Jan;100(1):205-9. Epub 2005 Sep 19. Mukherjea S. and Sahay S., “Discovering Biomedical Relations Utilizing the World-Wide Web”, Pacific Symposium on Biocomputing 11:164-175(2006) Mustafa J. and Seki K., “Discovering Implicit Associations between Genes and Hereditary diseases,” Pacific Symposium on Biocomputing 12:316-327(2007). Nedellec C., “Learning language in logic - genic interaction extraction challenge. In Proceedings of the ICML05 workshop: Learning Language in Logic (LLL05), 2005. [29] Palakal M., Mukhopadhyay S. [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] and Stephens M., “Identification Of Biological Relationships From Text Documents”, Springer US, 1571-0270, Pages 449-489, 10.1007/b13595, 2005 Popowich F., “Using Text Mining and Natural Language Processing for Health Care Claims Processing,”. ACM SIGKDD Explorations Newsletter archive Volume 7 , Issue 1 June 2005 Rinaldi F. et al, “Mining of relations between proteins over biomedical scientific literature using a deep-linguistic approach” Artif Intell Med 2007;39(2):127–36. Rusu D., Dali L., Fortuna B., Grobelnik M., Mladenić D., “Triplet Extraction from Sentences,” Ljubljana: 2007. Proceedings of the 10th International Multiconference Information Society - IS 2007". Vol. A, pp. 218 - 222. Sahay S., Mukherjea S., Agichtein E., Garcia , Navathe E. V., S. B., and Ram A., “Discovering Semantic Biomedical Relations, Utilizing the Web,” ACM Trans. Knowl. Discov. Data. 2, 1, Article 3 (March 2008), 15 pages. DOI = 10.1145/1342320.1342323 Shi L. and Mihalcea R., “Open Text Parsing Using FrameNet and WordNet,” In Daniel Marcu Susan Dumais and Salim Roukos, editors, Proceedings of HLT-NAACL 2004: Demonstration Papers, pages 247–250, Boston, Massachusetts, USA, May 2 – May 7. Association for Computational Linguistics. Skusa A. and Rüegg A. and Köhler J., “Extraction of biological interaction networks from scientific literature,” Briefings in Bioinformatics, (6)3:263--276, 2005. Srinivasan P. (2004), “Generating hypothesis from MEDLINE”, Journal of the American Society for Information, Volume 55, Pages 369-413. Subramaniam L. V., Mukherjea S., Kankar P., Srivastava B., Batra V. S., Kamesam P. V., Kothari R., “Information extraction from biomedical literature: Methodology, evaluation and an application” In Proceedings of the ACM CIKM International Conference on Information and Knowledge Management (CIKM’03). ACM Press, 410–417. Swanson D.R., “Fish oil, Raynaud’s syndrome, and undiscovered public knowledge,” Perspect. Bio. Med, v30, pp. 7-18, 1986 Tanabe L. and Wilbur W. J., “Tagging gene and protein names in biomedical text” Bioinformatics, Vol 18 no 8 2002, Pages 1124-1132, Oxford University Press, 2002 Tsai T.H., Chou W.C, Lin Y.C., et al, “BIOSMILE: Adapting semantic role labeling for biomedical verbs: an exponential model coupled with automatically generated template features.” In: BioNLP, 2006. Yang H., Sharma A.., Swaminathan R., Ketkar V., “On building a quantitive food-disease-gene network,” 2nd International conference on Bioinformatics and Computational Biology, March 2010, in press. Zhang X., Shu X.O., Li H., Yang G., Li Q., Gao Y.T., Zheng W., “Prospective cohort study of soy food consumption and risk of bone fracture among postmenopausal women,” Arch Intern Med. 12;165(16):1890-5 September 2005. Zweigenbaum P., Demner-Fushman D., Yu H. and Cohen K.B., “Frontiers of biomedical textmining:current progress” Briefings In Bioinformatics. VOL 8. NO 5. 358-375 Doi:10.1093/Bib/Bbm045, 2007

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A Verb-centric Approach for Relationship