Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Ontrez project at NCBO Nigam Shah [email protected] Public data repositories • Around 1100 databases in the NAR’s 2008 database issue. • High throughput gene expression data in repositories such as GEO, SMD, Array Express • Clinical Trial repositories such as caBIG, TrialBank, clinicaltrials.gov • Guideline repositories such as www.guideline.gov • Image repositories such as BIRN • Observational studies such as Framingham, NHANES, AMCIS. 2 Database annotation • Ontology based annotation is not as widespread as desired • Most annotation is still free-text • Possible reasons: 1. Lack of a one stop shop for bio-ontologies 2. Lack of tools to annotate experimental data • Manual phenote • Automatic ? 3. Lack of a sustainable mechanism to create ontology based annotations 3 Different kinds of annotations cytoskeleton organization and biogenesis metadata annotation ELMO1 expression is altered by mechanical stimuli : : Other experiments : : ELMO1 associated_with actin Expression profiling of cultured bladder smooth muscle cells subjected to repetitive mechanical stimulation for 4 hours. Chronic overdistension results in bladder wall thickening, associated with loss of muscle contractility. Results identify genes whose expression is altered by mechanical stimuli. Chronic Bladder Overdistension 4 Annotations as assertions • Annotation = An assertion declaring a relationship b/w a biomedical entity and a type in an ontology. • e.g. p53 <associated_with> cell death • Annotations tell us what the biologists believe to be true (in particular or in general) • Most annotations are based on particular observations and are generalized during interpretation by a biologist/curator. • Semantics of annotations are not always declared apriori (e.g. associated_with, involves) 5 Annotations as ‘Meta-data’ • Metadata: The text description accompanying a dataset in a database. • Metadata-annotations should be machine processed (and indexed using ontologies) because • The volume is orders of magnitude more than the summary results • These annotations are not stating any biological fact • Hence don’t need a curator to create them • These annotations are to be used to LOCATE datasets accurately as soon as they are available in a public repository • we can not afford to have a curation bottleneck 6 High level goal • Process the metadata annotations to automatically tag the ‘elements’ in public repositories with as many ontology terms as possible. • For example in case of the GEO dataset 906: • Expression profiling of cultured bladder smooth muscle cells subjected to repetitive mechanical stimulation for 4 hours. Chronic overdistension results in bladder wall thickening, associated with loss of muscle contractility. Results identify genes whose expression is altered by mechanical stimuli. • Gets tagged with: • Expression, Expression of bladder, bladder, smooth, bladder muscle, muscle, smooth muscle, cells, mechanical, mechanical stimulation, stimulation, Chronic, results, bladder overdistension, associated, associated with, with, loss, genes, altered 7 Tagging [annotating] with ontology terms 8 9 Querying the annotation index 10 11 12 13 14 WHAT NEW SCIENCE DO WE ENABLE? 15 New Science enabled • Nature study on image features and gene expression • Correlation b/w protein and gene expression for cancer classification • Correlating gene expression and drug effect information for predicting drug efficacy • Training and testing image processing algorithms 16 Decoding global gene expression programs in liver cancer by noninvasive imaging Eran Segal, Claude B Sirlin, Clara Ooi, Adam S Adler, Jeremy Gollub, Xin Chen, Bryan K Chan, George R Matcuk, Christopher T Barry, Howard Y Chang & Michael D Kuo Nature Biotechnology 25, 675 - 680 (2007) Published online: 21 May 2007 17 Correlation of protein and gene expression for the stratification of breast cancer patients 18 There are 20 other diseases for which this is possible! Disease GEO samples Acute myeloid leukemia Malignant melanoma B-cell lymphoma Prostate cancer Renal carcinoma Carcinoma squamous Multiple myeloma Clear cell carcinoma Renal cell carcinoma Breast carcinoma Hepatocellular carcinoma Carcinoma lung Cutaneous malignant melanoma T-cell lymphoma Lymphoblastic lymphoma Uterine fibroid Medulloblastoma Clear cell sarcoma Leiomyosarcoma Mesothelioma Kaposi's sarcoma Cardiomyopathy Dilated cardiomyopathy 366 47 133 47 34 105 225 34 34 3 80 91 38 TMAD samples 3 43 27 15 185 175 169 63 9 1277 163 66 41 29 29 10 46 35 24 54 4 14 14 31 30 19 9 8 5 5 3 2 2 19 20 TMAD incorporates the NCI Thesaurus ontology for searching tissues in the cancer domain. Image processing researchers can extract images and scores for training and testing classification algorithms. 21 Current status of the prototype Resource PubMed Number of elements Resource Number of Number of file size direct closure (Kb) annotations annotations Total number of 'useful' annotations 10164 13461 187686 681973 857459 2751 2880 143134 484758 619133 ClinicalTrials.gov 43918 8379 1206939 6792430 5217115 Gene Expression Omnibus ARRS GoldMiner 546 163 16494 100984 116234 1155 494 53082 290935 340915 58534 25377 1607335 8351080 7150856 ArrayExpress TOTAL 22 Ontrez: Target resources Papers Datasets mRNA express ion Protein express ion Guideli nes GWAS Clinical Trials RCT reports Trial descrip tion Treatm ents Drugs Phenotype text images Animal models Alleles and Genotype Genes Variatio ns Metastatic Melanoma 3330 7 76 237 1 1 314 1 2 47 0 0 Invasive Melanoma Melanoma in situ Spindle Cell Melanoma 23 Where can we go? • Become a service for ‘annotating’ biomedical text. – People send us text, we send back recognized concepts (may be even relationships) – Given a set of concepts we provide a similarity metric between them – Both these services can be plugged into a variety of community and collaborative annotations tools • Become ‘the one stop shop’ for finding items across a wide variety of resources … – Integrate on the ‘disease’ dimension. Gene cards exist, disease cards don’t – Focus on approx. 15 resources in the next year. – PDB and PLoS are interested 24 Research questions - 1 Genes/Proteins Diseases Drugs body parts developmental Pathways stages processes genetic markers SNOMEDCT .. X .. .. .. .. .. .. RxNORM INOH NCIT Gene Ontology (BP) FMA Cell type Ontology .. .. .. .. .. X X .. .. .. .. .. .. .. .. .. X .. .. .. .. .. .. .. .. .. .. .. .. .. X .. .. .. .. X .. .. .. .. .. .. .. .. .. .. .. .. .. .. X .. .. .. .. .. Mouse anatomy and .. development .. .. X X .. .. .. Zebrafish anatomy and .. development .. .. X X .. .. .. Mammalian Phenotype 25 Research questions - 2 Genes/Proteins Diseases Drugs body parts developmental Pathways processes genetic markers stages GATE .. .. .. .. .. .. .. .. UMLS-Query .. .. .. .. .. .. .. .. mgrep .. .. .. .. .. .. .. .. MetaMAP .. .. .. .. .. .. .. .. UPenn (conditional random fields) .. .. .. .. .. .. .. .. Language Modeling methods .. .. .. .. .. .. .. .. 26 Credits and collaborations • Clement Jonquet • Nipun Bhatia • Manhong Dai • Fan Meng • Brian Athey • Mark Musen 27