* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Document
Epigenetics of neurodegenerative diseases wikipedia , lookup
Point mutation wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Ridge (biology) wikipedia , lookup
Gene desert wikipedia , lookup
Gene therapy wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
History of genetic engineering wikipedia , lookup
Genome evolution wikipedia , lookup
Genomic imprinting wikipedia , lookup
Minimal genome wikipedia , lookup
Oncogenomics wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Gene nomenclature wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Mir-92 microRNA precursor family wikipedia , lookup
Gene expression programming wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genome (book) wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Microevolution wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Gene expression profiling wikipedia , lookup
Molecular Entity Types Phenotypic Entity Types Gene Differentiation Status Clinical Stage Genomic Information Malignancy Types Phenomic Information Histology Variation Site Developmental State Heredity Status Genomic Variation associated with Malignancy Flow Chart for Manual Annotation Process Auto-Annotated Texts Biomedical Literature Machine-learning Algorithm Annotators (Experts) Entity Definitions Manually Annotated Texts Annotation Ambiguity QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Defining biomedical entities A point mutation was found at codon 12 (G A). Variation Defining biomedical entities Data Gathering A point mutation was found at codon 12 (G A). Variation A point mutation was found at codon 12 Variation.Type Variation.Location Data Classification (G A). Variation.InitialState Variation.AlteredState Defining biomedical entities Conceptual boundaries Sub-classification of entities Defining biomedical entities Conceptual boundaries Sub-classification of entities Levels of specificity Levels of specificity Gene Entity Malignancy type Entity Gene Protein kinase (Super family) MAPK (Gene family) MAPK10 Cancer/Tumor Carcinoma Lung carcinoma Squamous cell lung carcinoma Defining biomedical entities Conceptual boundaries Sub-classification of entities Levels of specificity Conceptual overlaps between entities Symptom: Subjective or objective evidence of disease. Disease: A specific pathological process with a characteristic set of symptoms. Arrhythmia vs. Long QT Syndrome Defining biomedical entities Conceptual boundaries Sub-classification of entities Levels of specificity Conceptual overlaps between entities Domain-specific clarification Gene entity clarification: Regulation element -- promoters (eg. TATA box) Defining biomedical entities Conceptual boundaries Sub-classification of entities Levels of specificity Conceptual overlaps between entities Domain-specific clarification Syntactical boundaries Text boundary issues The K-ras gene…… Defining biomedical entities Conceptual boundaries Sub-classification of entities Levels of specificity Conceptual overlaps between entities Domain-specific clarification Syntactical boundaries Text boundary issues (The K-ras gene) Pronoun co-reference (this gene, it, they) Defining biomedical entities Conceptual boundaries Sub-classification of entities Levels of specificity Conceptual overlaps between entities Domain-specific clarification Syntactical boundaries Text boundary issues (The K-ras gene) Co-reference (this gene, it, they) Structural overlap -- entity within entity (same entity type) MAP kinase kinase kinase Defining biomedical entities Conceptual boundaries Sub-classification of entities Levels of specificity Conceptual overlaps between entities Domain-specific clarification Syntactical boundaries Text boundary issues (The K-ras gene) Pronoun co-reference (this gene, it, they) Structural overlap -- entity within entity (different entity type) Squamous cell lung carcinoma Defining biomedical entities Conceptual boundaries Sub-classification of entities Levels of specificity Conceptual overlaps between entities Domain-specific clarification Syntactical boundaries Text boundary issues (The K-ras gene) Co-reference (this gene, it, they) Structural overlap -- entity within entity Discontinuous mentions (N- and K-ras ) Semantic ambiguity challenges Ambiguity within an entity type CAT catalase glycine-N-acyltransferase (GLYAT) Semantic ambiguity challenges Ambiguity within an entity type Ambiguity between entity types CAT Gene entity Organism Semantic ambiguity challenges Ambiguity within entity types Ambiguity between entity types Gene entity ambiguity 3% of human genes share aliases Huge ambiguity of genes between species (mouse and human) Gene.general, Gene.gene/RNA, Gene.protein Gene Gene RNA Protein Variation Type Location Initial State Altered State Malignancy Type Site Histology Clinical Stage Differentiation Status Heredity Status Developmental State Physical Measurement Cellular Process Expressional Status Environmental Factor Clinical Treatment Clinical Outcome Research System Research Methodology Drug Effect QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. http://www.ldc.upenn.edu/mamandel/itre/annotators/onco/definitions.html Manual Annotation Corpus Release QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Jena University Language & Information Engineering Lab: http://www.julielab.de K Bretonnel Cohen and Lawrence Hunter, BMC Bioinformatics. 2006; 7(Suppl 3): S5. Summary -- Entity Definition Developed iterative process for biomedical entity definition; Defined genomic and phenotypic entities with distinct conceptual and syntactical boundaries in genomic variation of malignancy; Constructed a manually annotated corpus with 1442 oncologyfocused articles. Named Entity Extractors Mycn is amplified in neuroblastoma. Gene Variation type Malignancy type Automated Extractor Development Training and testing data 1442 cancer-focused MEDLINE abstracts 70% for training, 30% for testing Automated Extractor Development Training and testing data 1442 cancer-focused MEDLINE abstracts 70% for training, 30% for testing Machine-learning algorithm Conditional Random Field (CRF) Sets of Features Lung cancer is the MType Mtype … of carcinoma deaths worldwide. Automated Extractor Development Training and testing data 1442 cancer-focused MEDLINE abstracts 70% for training, 30% for testing Machine-learning algorithm Conditional Random Fields (CRFs) Sets of Features Orthographic features (capitalization, punctuation, digit/number/alphanumeric/symbol); Character-N-grams (N=2,3,4); Prefix/Suffix: (*oma); Offsite conjuction (3 consecutive word tokens); Domain-specific lexicon (NCI neoplasm list). Extractor Performance Entity Gene Variation Type Location State-Initi al State-Sub Overall Malign ancy type Clinical St age Site Histology Deve lopmental State Precision 0.864 Recall 0.787 0.8556 0.8695 0.8430 0.8035 0.8541 0.7990 0.7722 0.8286 0.7809 0.7870 0.8456 0.8493 0.8005 0.8310 0.8438 0.8218 0.6492 0.6555 0.7774 0.7500 • Precision: (true positives)/(true positives + false positives) • Recall: (true positives)/(true positives + false negatives) Normal text Malignancies PMID: 15316311 Morpho logic and molecular characterization of renal cell carcinoma in children a nd y oung adu lts. A new WHO classification of renal cell carcinoma has been introduced in 2004. This classification includes the recently described renal cell carcinomas with the ASPL-TFE3 gene fusion and carcinomas with a PRCC -TFE3 gene fusion. Collectively, these tumors have been termed Xp11.2 or TFE3 translocation carcinomas, which prima rily occur in children and young adults. To further study the characteristics of renal cell carcinoma in young patients and to determi ne their genetic background, 41 renal cell carcinomas of patients younger than 22 years were morphologically and genetically characterized . Loss of heterozygosity analysis of the von Hippel - Lindau gene region and screening for VHL gene mu tations by direct sequencing were performed in 20 tumors. TFE3 protein overexpression, which correlates with the presence of a TFE3 gene fusion, was assessed by immunohistochemistry. Applying the new WHO classification for renal cell carcinoma, there we re 6 clear cell ( 15 %), 9 papillary (22 %), 2 chromophobe, and 2 collecting duct carcinomas. Eig ht carcinomas showed translocation carcinoma morphology (20 %). One carcinoma occurred 4 years after a neuroblastoma. Thirteen tumors could not be assigned to types specified by the new WHO classification: 10 were grouped as unclassified (24 %), including a unique renal cell carcinoma with prominently vacuolated cytoplasm and WT1 expression. Three carcinomas occurred in combination with nephroblastoma. Molecular analysis revealed deletions at 3p25-26 in one translocation carcinoma, one chromophobe renal cell carcinoma, and one papillary renal cell carcinoma. There were no VHL mutations. Nuclear TFE3 overexpression was detected in 6 renal cell carcinomas, all o f which showed areas with voluminous cytoplasm and foci of papillary architecture, consistent with a translocation carcinoma phenotype. The large proportion of TFE3 " translocation " carcinomas and "unclassified " carcinomas in the first two decades of life demonstrates that renal cell carcinomas in young patients contain genetically and phenotypically distinct tumo rs with further potential for novel renal cell carcinoma subtypes. The far lower f requency of clear cell carcinomas and VHL alterations comp ared with adults suggests that renal cell carcinomas in young patients have a unique genetic background. CRF-based Extractor vs. Pattern Matcher The testing corpus 39 manually annotated MEDLINE abstracts selected 202 malignancy type mentions identified The pattern matching system 5,555 malignancy types extracted from NCI neoplasm ontology Case-insensitive exact string matching applied 85 malignancy type mentions (42.1%) recognized correctly The malignancy type extractor 190 malignancy type mentions (94.1%) recognized correctly Included all the baseline-identified mentions The Types of Mentions NOT Identified by Pattern Matching Mention Types Mention Examples NCI List Acronyms NB Neuroblastoma Lexical variants (plural forms) Renal cell carcinomas Renal cell carcinoma Polymorphic expressions Lung cancer (tumor/tumour) Lung neoplasm higher levels of specificity Solid tumor <More specific tumor> Tumor names with modifiers Translocation carcinoma Carcinoma Normalization abdominal neoplasm abdomen neoplasm Abdominal tumour Abdominal neoplasm NOS Abdominal tumor Abdominal Neoplasms Abdominal Neoplasm Neoplasm, Abdominal Neoplasms, Abdominal Neoplasm of abdomen Tumour of abdomen Tumor of abdomen ABDOMEN TUMOR Unique Identifier Normalization abdominal neoplasm abdomen neoplasm Abdominal tumour Abdominal neoplasm NOS Abdominal tumor Abdominal Neoplasms Abdominal Neoplasm Neoplasm, Abdominal Neoplasms, Abdominal Neoplasm of abdomen Tumour of abdomen Tumor of abdomen ABDOMEN TUMOR UMLS metathesaurus Concept Unique Identifier (CUI) 19,397 CUIs with 92,414 synonyms C0000735 Normalization – Computational Procedures Rule-based algorithm Applied to both entity mentions and vocabulary terms (UMLS metathesaurus) Case insensitivity (carcinoma/Carcinoma) Space/punctuation removal (lung-cancer/lungcancer) Stemming (neuroblastoma/neuroblastomas) Applied to mentions only First/last character removal (additional space/punctuation) First/last word removal (translocation lung carcinoma) Evaluate the accuracy and the priority of the rules 1,000 randomly selected entity mentions Choose the best performed rule combination and sequences MEDLINE Data Processing Tagging MEDLINE pre-2006 abstracts 15,433,668 MEDLINE abstracts 9,153,340 redundant and 580,002 distinct malignancy type mentions ~60% extracted mentions matched to UMLS CUIs 1,642 CPU-hours (2.44 days on a 28-CPU cluster) Infrastructure construction (postgreSQL Database) Gene-Malignancy-Evidence Matrix 21,493,687 normalized gene symbols (16,875 unique) Gene Malignancy Evidence A1BG A1BG A1BG …… ABCC1 ABCC1 ABCC1 …… B3GAT1 B3GAT1 B3GAT1 …… ERVK6 ERVK6 ERVK6 …… NFKB1 NFKB1 NFKB1 …… VIM VIM VIM …… Adenocarcinoma Adenocarcinoma Adenocarcinoma …… Lung Carcinoma Lung Carcinoma Lung Carcinoma …… Breast Neoplasm Breast Neoplasm Breast Neoplasm …… 1634938 2292657 3566173 …… 11156254 11159731 11172691 …… 6870377 9129046 9701020 …… 9056412 9620301 9640365 …… 12842827 12901803 12934082 …… 12375611 12657940 12673425 …… Stage IV Melanoma of the Skin Stage IV Melanoma of the Skin Stage IV Melanoma of the Skin …… Colon Carcinoma Colon Carcinoma Colon Carcinoma …… Gastrointestinal Stromal Tumor Gastrointestinal Stromal Tumor Gastrointestinal Stromal Tumor …… Gene-Malignancy-Evidence Matrix 5,398,954 normalized malignancy types (4,166 CUIs) Gene Malignancy Evidence A1BG A1BG A1BG …… ABCC1 ABCC1 ABCC1 …… B3GAT1 B3GAT1 B3GAT1 …… ERVK6 ERVK6 ERVK6 …… NFKB1 NFKB1 NFKB1 …… VIM VIM VIM …… Adenocarcinoma Adenocarcinoma Adenocarcinoma …… Lung Carcinoma Lung Carcinoma Lung Carcinoma …… Breast Neoplasm Breast Neoplasm Breast Neoplasm …… 1634938 2292657 3566173 …… 11156254 11159731 11172691 …… 6870377 9129046 9701020 …… 9056412 9620301 9640365 …… 12842827 12901803 12934082 …… 12375611 12657940 12673425 …… Stage IV Melanoma of the Skin Stage IV Melanoma of the Skin Stage IV Melanoma of the Skin …… Colon Carcinoma Colon Carcinoma Colon Carcinoma …… Gastrointestinal Stromal Tumor Gastrointestinal Stromal Tumor Gastrointestinal Stromal Tumor …… Gene-Malignancy-Evidence Matrix 3,100,773 distinct Gene-Malignancy-Evidence relations Gene Malignancy Evidence A1BG A1BG A1BG …… ABCC1 ABCC1 ABCC1 …… B3GAT1 B3GAT1 B3GAT1 …… ERVK6 ERVK6 ERVK6 …… NFKB1 NFKB1 NFKB1 …… VIM VIM VIM …… Adenocarcinoma Adenocarcinoma Adenocarcinoma …… Lung Carcinoma Lung Carcinoma Lung Carcinoma …… Breast Neoplasm Breast Neoplasm Breast Neoplasm …… 1634938 2292657 3566173 …… 11156254 11159731 11172691 …… 6870377 9129046 9701020 …… 9056412 9620301 9640365 …… 12842827 12901803 12934082 …… 12375611 12657940 12673425 …… Stage IV Melanoma of the Skin Stage IV Melanoma of the Skin Stage IV Melanoma of the Skin …… Colon Carcinoma Colon Carcinoma Colon Carcinoma …… Gastrointestinal Stromal Tumor Gastrointestinal Stromal Tumor Gastrointestinal Stromal Tumor …… Ranked by Frequency 6850 6800 6750 6700 TP53-Carcinoma ESR1-Breast Carcinoma ESR1-Breast Neoplasm 6650 6600 6550 6500 Gene-Malignancy Relaions Summary -- Extractor Development and Application Developed well-performed automated entity extractors across genomic and phenotypic domains; Constructed rule-based computational procedure for normalization; Applied the extractors and normalizers to all MEDLINE abstracts; Imported the extracted information into a relational database. Text Mining Applications -- Hypothesizing NB Candidate Genes Text Mining Applications -- Hypothesizing NB Candidate Genes Two distinct subtypes of neuroblastoma Developmenta l State NB Subtype A NB Subtype B Younger age Older age Biology Clinical Stage Differentiation Lower Stage Proliferation Higher Stage Clinical Outcome Trk Expression Favorable High level expression of NTRK1 Unfavorable High level expression of NTRK2 Text Mining Applications -- Hypothesizing NB Candidate Genes Two distinct subtypes of neuroblastoma Distinct clinical behaviors (favorable vs. unfavorable) • NGF/NTRK1 (TrkA) vs. BDNF/NTRK2 (TrkB) signaling pathways • Trk Signaling Angiogenesis Differentiation Drug Resistance Tumorigenicity NB Subtype A NTRK1/NGF Inhibits Yes Inhibits Inhibits NB Subtype B NTRK2/BDNF Promotes No Promotes Promotes Text Mining Applications -- Hypothesizing NB Candidate Genes Two distinct subtypes of neuroblastoma Distinct clinical behaviors (favorable vs. unfavorable) • NGF/NTRK1 (TrkA) vs. BDNF/NTRK2 (TrkB) signaling pathways • Determine the early response genes differentiating the two pathways • More precise prognosis and clinical intervention • Text Mining Applications -- Hypothesizing NB Candidate Genes NTRK1 NTRK2 SH-SY5Y SH-SY5Y NGF BDNF RNA extraction at 0,1.5hrs,4hrs and 12hrs Affymetrix U133A Expression Array (RMAexpress normalization, SAM test) 751 differentially expressed genes Text Mining Applications -- Hypothesizing NB Candidate Genes Microarray Expression Data Analysis symbol NALP1 RALY Gene Set 1: NTRK1, NTRK2 CDC2L6 RASGRP2 KCNK3 468 RPS6KA1 SEC61A2 VGF CACNA1C TBX3 283 THRA B4GALT5 NRXN2 GNB5 Gene Set 2: NTRK2, NTRK1 RAI2 FRS3 Text Mining Applications -- Hypothesizing NB Candidate Genes Differentially represented genes in biomedical literature NTRK1 vs. NTRK2 pathway differentially associated genes/proteins based on literature • Preferential association determined by co-occurrence with either receptor 5 times or more over the other • Assumption: the co-occurrence frequency is reflecting functional correlation • Text Mining Applications -- Hypothesizing NB Candidate Genes NTRK1/NTRK2 Preferentially Associated Genes in Literature LitSet 1: NTRK1 Associated Genes 514 157 LitSet 2: NTRK2 Associated Genes Text Mining Applications -- Hypothesizing NB Candidate Genes Microarray Expression Data Analysis NTRK1/NTRK2 Associated Genes in Literature NTRK1 Associated Genes Gene Set 1: NTRK1, NTRK2 18 514 468 283 Gene Set 2: NTRK2, NTRK1 4 157 NTRK2 Associated Genes Functional Pathway Analysis Determine gene enrichment score for six selected functional pathways: CD -- Cell Death; CGP -- Cell Growth and Proliferation; CCSI -- Cell-to-Cell Signaling and Interaction; CM -- Cell Morphology NSDF -- Nervous System Development and Function; CAO -- Cellular Assembly and Organization. Functional Pathway Analysis CD CGP CCSI CM NSDF CAO Overall Group (N=10,459) 1979, 18.9% 2251, 21.5% 1492, 14.3% 1068, 10.2% 897, 8.58% 755, 7.22% Array Group (N= 751) 153, 20.4% 154, 20.5% 57, 9.98% 85, 11.3% 108, 19.6% 103, 13.7% Text Group (N= 550) 309, 56.2% 304, 55.3% 186, 33.8% 219, 39.8% 148, 26.9% 115, 20.9% Overlap Group (N=22) 12, 54.5% 3, 13.6% 7, 31.8% 7, 31.8% 9, 40.9% 11, 50% Six selected pathways: CD -- Cell Death; CGP -- Cell Growth and Proliferation; CCSI -- Cell-to-Cell Signaling and Interaction; Ingenuity Pathway Analysis Tool Kit CM -- Cell Morphology; NSDF -- Nervous System Development and Function; CAO -- Cellular Assembly and Organization. Hypergeometric Test P-values CD CGP CCSI CM NSDF CAO Array Group 0.152 0.746 0.999 0.146 <0.001 <0.001 Text Group 0.0166 0.0216 0.0227 0.0109 <0.001 <0.001 Overlap Group <0.001 0.728 0.009 0.001 <0.001 <0.001 Hypergeometric Test between Array and Overlap Groups CD CGP CCSI CM NSDF CAO Overlap Group <0.001 0.728 0.00940 0.0124 <0.001 0.0117 Multiple-test corrected P-values (Bonferroni step-down) RT-PCR Experimental Validation 11 out of 22 genes selected for RT-PCR validation: Symbol Description CAMK4 calcium/calmodulin-dependent protein kinase IV VSNL1 visinin-like 1 TBC1D8 TBC1 domain family, member 8 (with GRAM domain) RPS6KA1 ribosomal protein S6 kinase, 90kDa, polypeptide 1 EFNB3 ephrin-B3 B3GAT1 beta-1,3-glucuronyltransferase 1 (glucuronosyltransferase P) GNAS GNAS complex locus NEFH neurofilament, heavy polypeptide 200kDa INA internexin neuronal intermediate filament protein, alpha NEFL neurofilament, light polypeptide 68kDa TYRO3 TYRO3 protein tyrosine kinase RT-PCR Experimental Validation 11 out of 22 genes selected for RT-PCR validation: Symbol Description CAMK4 calcium/calmodulin-dependent protein kinase IV VSNL1 visinin-like 1 TBC1D8 TBC1 domain family, member 8 (with GRAM domain) RPS6KA1 ribosomal protein S6 kinase, 90kDa, polypeptide 1 EFNB3 ephrin-B3 B3GAT1 beta-1,3-glucuronyltransferase 1 (glucuronosyltransferase P) GNAS GNAS complex locus NEFH neurofilament, heavy polypeptide 200kDa INA internexin neuronal intermediate filament protein, alpha NEFL neurofilament, light polypeptide 68kDa TYRO3 TYRO3 protein tyrosine kinase RT-PCR Experimental Validation 11 out of 22 genes selected for RT-PCR validation: Symbol Description CAMK4 calcium/calmodulin-dependent protein kinase IV VSNL1 visinin-like 1 TBC1D8 TBC1 domain family, member 8 (with GRAM domain) RPS6KA1 ribosomal protein S6 kinase, 90kDa, polypeptide 1 EFNB3 ephrin-B3 B3GAT1 beta-1,3-glucuronyltransferase 1 (glucuronosyltransferase P) GNAS GNAS complex locus NEFH neurofilament, heavy polypeptide 200kDa INA internexin neuronal intermediate filament protein, alpha NEFL neurofilament, light polypeptide 68kDa TYRO3 TYRO3 protein tyrosine kinase RT-PCR Experimental Validation EFNB3 2.5 2 1.5 TrkA TrkB 1 0.5 0 0hr 1.5hr 4hr 12hr EFNB3 Discussion EFNB3 (ephrin-B3) belongs to a family of ligands that binds to Eph family receptor tyrosine kinases Implicated in axon guidance and vertebrate nervous system development Exhibited growth-suppressive activity against NB cells in vitro Preferentially and significantly associated with low tumor stage and favorable clinical outcomes in neuroblastoma primary tumors RT-PCR Experimental Validation TYRO3 1.4 1.2 1 0.8 TrkA 0.6 TrkB 0.4 0.2 0 0hr 1.5hr 4hr 12hr TYRO3 Discussion Trans-memberane receptor tyrosine kinase activated by GAS6 GAS6 has showed to promote human fetal oligodendrocyte survival without proliferation GAS6 may also contribute to cell adhesion and immune responses Further study of GAS6/TYRO3 signaling is needed Summary -- NB Application Prioritized array-determined differentially expressed genes by integrating text mining results Literature-based method showed its capability of enriching functionally relevant genes by pathway analysis RT-PCR experiments further validated the inferential power of text mining Conclusion Created a process for iteratively and precisely defining biomedical semantic types directly from literature Developed automated entity extractors across genomic and phenotypic domains in malignancy with satisfactory accuracy rates Applied this computational entity recognition and normalization process to all MEDLINE abstracts Integrated text mining results with neuroblastoma experimental data to hypothesize candidate genes differentiating neuroblastoma subtypes Future Directions Increasing dimensions of Information matrix Context-based normalization algorithm Relation extraction with deeper semantic parsing Acknowledgement Penn BioIE Team: Dr. Mark Liberman Dr. Mark Mandel Dr. Ryan McDonald Dr. Fernando Pereira Annotator team White Lab: Steve Carroll Hawren Fang Kevin Murphy Brodeur Lab: Dr. Garrett Brodeur Ms. Ruth Ho Dr. Jane Minturn CHOP NAP Core: Dr. Eric Rappaport CHOP Bioinformatics Core: Dr. Xiaowu Gai Dr. Jim Zhang