* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Classifying Biological Full-Text Articles for Multi
Gene expression profiling wikipedia , lookup
Genome evolution wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Molecular evolution wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Personalized medicine wikipedia , lookup
Genetic engineering wikipedia , lookup
Clinical neurochemistry wikipedia , lookup
Extracting Semantic Predication from Medline Citations for Pharmacogenomics C.B. Ahlers1, M. Fiszman2, D.D. Fushman1, F.M. Lang1 and T.C. Rindflesch1 1National Center for Biomedical Communications, National Library of Medicine 2University of Tennessee, USA (PSB 2007 12:209-220) Abstract This paper describes a NLP system (Enhanced SemRep) to identify core assertions on pharmacogenomics (基因藥 理學) in Medline. The development of the system is based on the adaptation of an existing system and depends on UMLS. Preliminary evaluation: 55% recall and 73% precision. 2/26 1. Introduction (1/3) Core research in pharmacogenomics investigates the interaction of genes/proteins with therapeutic substances. E.g. treatment of oncology(腫瘤學). Current NLP for pharmacogenomics concentrates on co-occurrence information without specifying exact relations. Enhanced SemRep complements that approach by representing assertions in text as semantic predications. 3/26 1. Introduction (2/3) Example These findings therefore demonstrate that dexamethasone (皮質類固醇) is a potent inducer of multidrug resistance-associated protein (多抗藥性 蛋白質) expression in rat hepatocytes (肝細胞) through a mechanism that seems not to involve the classical glucocorticoid receptor (糖皮質激素受體) pathway. 1. Dexamethasone STIMULATES Multidrug ResistanceAssociated Proteins 2. Dexamethasone NEG_INTERACTS_WITH Glucocorticoid receptor 3. Multidrug Resistance-Associated Proteins PART_OF Rats 4. Hepatocytes PART_OF Rats 4/26 1. Introduction (3/3) Based on two existing systems SemRep: extract semantic predications from clinical text. SemGen: developed from SemRep to identify etiologic (病因的) relations between genetic phenomena and diseases. Relations Genes, drugs, diseases, and population groups. At the gene level, no more specific genetic phenomena ( e.g. mutations, single nucleotide polymorphisms, and haplotype information). 5/26 2. Background NLP for Biomedicine The Unified Medical Language System SemRep and SemGen 6/26 2.1 NLP for Biomedicine (1/2) Co-occurrence of entities in text (genedisease relations, Yen et al., 2006; druggene, Rindflesch et al., 2000). Machine learning techniques (genedisease relations, Chun et al., 2006; drug-gene, Chang et al., 2004). Syntactic templates and shallow parsing (protein interactions, Blaschke et al., 1999) 7/26 2.1 NLP for Biomedicine (2/2) Enhanced SemRep addresses a wide range of syntactic structures and specific semantic relations pertinent to pharmacogenomics. Example STIMULATES DISRUPTS CAUSES 8/26 2.2 UMLS Metathesaurus (more than 106 concepts) Concept: fever; Synonyms: pyrexia, febrile, hyperthermia; Semantic Type: ‘Finding’ Semantic types represent allowable relationships between concepts ‘Gene or Genome’ PART_OF ‘Cell’ ‘Pharmacologic Substance’ INTERACTS_WITH ‘Enzyme’ ‘Disease or Syndrome’ CO-OCCURS_WITH ‘Neoplastic Process’ (腫瘤突起) 9/26 2.3 SemRep and SemGen (1/2) SemRep: a rule-based symbolic NLP system. Example Phenytoin (二苯妥因) induced gingival hyperplasia (齒齦增生) Pharmacological Substance [[head(noun(phenytoin)), metaconc(‘Phenytoin’:[orch,phsu]))], [verb(induced)], [head(noun([‘gingival hyperplasia’)), metaconc(‘Gingival Hyperplasia’:[dsyn]))]] Disease or Syndrome ‘Pharmacological Substance’ CAUSES ‘Disease or Syndrome’ Semantic Network relation/ argument identification Phenytoin CAUSES Gingival Hyperplasia 10/26 2.3 SemRep and SemGen (2/2) SemGen: identify semantic predications on the genetic etiology of disease. Gene and protein name: ABGene. Since UMLS Semantic Network does not cover molecular genetics, semantic relations are created: Gene-disease interactions: (ASSOCIATE_WITH, PREDISPOSE(易感染的), and CAUSE) Gene-gene interactions: (INHIBIT, STIMULATE, and INTERACTS_WITH) 11/26 3. Methods (1/2) Scrutiny of the pharmacogenomics literature to identify relevant predications not identified by either SemRep or SemGen. 1000 Medline were retrieved containing drug and gene names. 400 sentences were selected, including genetic (gene-disease), genomic (gene-gene), and pharmacogenomic (drug-gene, drug-genome) relations; in addition relations between genes and population groups; disease and population groups; and pharmacological relations (drug-disease, drugpharmacological effect, drug-drug) were scrutinized. 12/26 3. Methods (2/2) After processing these 400 sentences with SemRep, errors were analyzed and categorized for etiology. The majority of errors The Semantic Network Errors in argument identification due to “empty” heads Gene name identification Extensive modifications for Enhanced SemRep. Gene name identification was addressed by adding ABGene to the machinery. 13/26 3.1 Modification of Semantic Network for Enhanced SemRep (1/4) Grouping semantic types: Five broader semantic groups (Substance, Anatomy, Living Being, Process, and Pathology) were defined to permit predications relevant to pharmacogenomics. Substance: ‘Amino Acid, Peptide, or Protein’, ‘Antibiotic’(抗生素), ‘Carbohydrate’(碳水化合物), ... Anatomy: ‘Anatomical Structure’(解剖學構造), ‘Body Part, Organ, or Organ Component’, ‘Cell’, ‘Gene or Genome’, ‘Neoplastic Process’, ‘Tissue’ … 14/26 3.1 Modification of Semantic Network for Enhanced SemRep (2/4) Living Being: ‘Animal’, ‘Archaeon’(第三類有機體), ‘Bacterium’, ‘Fungus’(真菌), ‘Human’, ‘Invertebrate’(無脊椎動物), ‘Mammal’, ‘Organism’, ‘Vertebrate’, ‘Virus’ Process: ‘Acquired Abnormality’(後天異常), ‘Anatomical Abnormality’, ‘Cell Function’, ‘Cell or Molecular Dysfunction’(機能障礙), ‘Congenital Abnormality’(先天性異常), ‘Laboratory Test Result’… Pathology: ‘Acquired Abnormality’, ‘Anatomical Abnormality’, ‘Cell or Molecular Dysfunction’, ‘Congenital Abnormality’, ‘Disease or Syndrome’, ‘Injury or Poisoning’, Mental or Behavioral Disorder’(心理及行為障礙), … 15/26 3.1 Modification of Semantic Network for Enhanced SemRep (3/4) Define predications: categories 1-6 1: Genetic Etiology (基因病理學) 2: Substance Relations {Substance} ASSOCIATED_WITH OR PREDISPOSES OR CAUSES {Pathology} {Substance} INTERACTS_WITH OR INHIBITS OR STIMULATES {Substance} 3: Pharmacological Effects {Substance} AFFECTS OR DISRUPTS OR AUGMENTS {Anatomy OR Process} 16/26 3.1 Modification of Semantic Network for Enhanced SemRep (4/4) 4: Clinical Actions 5: Organism Characteristics {Substance} ADMINISTERED_TO {Living Being} {Process} MANIFESTATION_OF {Process} {Substance} TREATS {Living Being OR Pathology} {Anatomy OR Living Being} LOCATION_OF {Substance} {Anatomy} PART_OF {Anatomy OR Living Being} {Process} PROCESS_OF {Living Being} 6: Co-existence {Substance} CO-EXISTS_WITH {Substance} {Process} CO-EXISTS_WITH {Process} 17/26 3.2 Empty Heads Example: We saw differential activation of CYP2C9 variants by dapsone(藥:氨苯). “Variant” is ‘Qualitative Concept’. We want CYP2C9 variant be a member of the Substance group. Enumerate several categories of terms as semantically empty heads, e.g. allele (等位基因), mutation, variant, levels, expression… Words from these lists that have been labeled as heads are hidden and the word to their left is relabeled as heads. 18/26 3.3 Evaluation Test 300 sentences which are randomly generated from the set of 36,577 sentences containing drug and gene co-occurrences found on the Web-site. (bionlp.stanford.edu/genedrug) These sentences were annotated by three physicians (CBA, DD-F, MF). They did not mark up all assertions in the sentences, only those representing a predication defined in Enhanced SemRep. A total of 850 predications were assigned by the annotators. 19/26 4. Results Category Recall 74% Precision 74% Substance Relations (interact_with, inhibit, stimulate) 50% 73% Pharmacological Effects (affect, disrupt, augment) 41% 68% Clinical Actions (administered_to, manifestataion_of, treat) 54% 84% Organism Characteristics (location_of, part_of, process_of) 63% 71% Total 55% 73% Genetic Etilogy (associated_with, cause, predispose) 20/26 5.1 Discussion: Error Analysis (1/2) Word sense ambiguity (28%) Ticlopidine (血小板抑制劑) inhibition of phenytoin (二苯妥因) metabolism mediated by potent inhibition of CYP2C19 (基因). Inhibition wrongly mapped to ‘Psychological Inhibition’. CYP2C19 AFFECTS Psychological Inhibition. 21/26 5.1 Discussion: Error Analysis (2/2) Process coordinate structures (35%) The cytotoxic (細胞毒素) activities of mercaptopurine (藥:胇基嘌呤) and fluorouracil (抗腫瘤代謝藥物 ) are regulated by thiopurine methyltransferase (TPMT) and dihydropyrimidine dehydrogenase (DPD), respectively. Fluorouracil INTERACTS_WITH DPD gene. (○) mercaptopurine INTERACTS_WITH thiopurine methyltransferase. (X) 22/26 5.2 Process Medline Citations on CYP2D6 (1/3) 2849 Medline citations contain variant forms of CYP2D6. 5219 predications containing CYP2D6 as an argument were analyzed according to two predication categories (Genetic Etiology and Substance Relations). Compare with relations listed for this gene on the PharmGKB Web site (PharmacoGenetic Knowledge Base). 23/26 5.2 Process Medline Citations on CYP2D6 (2/3) Genetic Etiology 267 total predications represented CYP2D6 as an etiologic agent for a disease. Parkinson’s disease (帕金森氏症) (35), carcinoma of the lung (肺癌) (21), tardive dyskinesia (遲發性不自 主運動) (15), Alzheimer’s disease (阿茲海默症) (9), bladder carcinoma (膀胱癌) (8). 169 TP, and 4 FP, two were found not to contain the disease name in the referenced citation. Only carcinoma of the lung occurs in PharmGKB. 24/26 5.2 Process Medline Citations on CYP2D6 (3/3) Substance Relations 1128 total predications involve CYP2D6 and a drug. 69 drugs occurred 3 or more times in those predications where 41 drugs were in PharmGKB and 28 were not. 68 were true positives. Inhibit CYP2D6: quinidine (45), paroxetine (34), fluoxetine (27), fluvoxamine (8), sertraline (8). Interact_with CYP2D6: bufuralol (27), antipsychotic agents (25) dextromethorphan (21), venlafaxine (19), debrisoquin (18). Quinidine and sertraline are not in PharmGKB. Bufuralol is not in PharmGKB. SemRep failed to capture: cocaine, levomepromazine, maprotiline, trazodone, and yohimbine. 25/26 6. Conclusion This paper applies an existing NLP system in the pharmacogenomics domain. The major changes for developing Enhanced SemRep from SemRep involved modifying the semantic space stipulated by the UMLS Semantic Network. The outputs are semantic predications that represent assertions from Medline citations expressing a range of specific relations in pharmacogenomics. The information can support advanced information management applications for pharmacogenomics research and clinical care. In the future, authors intend to adapt the summarization and visualization techniques developed for clinical text. 26/26