Download Document

Genes Medical Informatics Diseas Diseas Diseas es Diseas es Diseas es Diseas es Diseases es es Genomics and Bioinformatics Gene Gene Annotation Gene Annotation Databases Gene Annotation Databases Annotation Databases Databases Genes Diseases Anatomy Genes Physiology Diseases Diseases Novel relationships & Deeper insights Identification and Prioritization of Novel Disease Candidate Genes Systems Biology Based Integrative Approaches Bioinformatics to Systems Biology November 16, 2007 Anil Jegga Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center (CCHMC) Department of Pediatrics, University of Cincinnati Cincinnati, Ohio - 45229 [email protected] http://anil.cchmc.org Acknowledgements • Jing Chen • Eric Bardes • Bruce Aronow Support • All the publicly available gene annotation resources especially NCBI, MGI and UCSC Cincinnati Children’s Hospital Medical Center Computational Medical Center, Cincinnati Mouse Models of Human Cancers Consortium University of Cincinnati College of Medicine Two Separate Worlds….. Disease World Medical Informatics Bioinformatics & the “omes Genome Regulome Transcriptome miRNAome Disease Database Patient Records Clinical Trials Proteome Interactome Metabolome Variome Pharmacogenome PubMed →Name Physiome OMIM →Synonyms Clinical →Related/Similar Diseases Synopsis Pathome →Subtypes →Etiology →Predisposing Causes →Pathogenesis →Molecular Basis 382 “omes” so far……… →Population Genetics →Clinical findings →System(s) involved and there is “UNKNOME” too →Lesions →Diagnosis genes with no function known →Prognosis http://omics.org/index.php/Alphabetically_ordered_list_of_omics →Treatment (as on November 15, 2007) →Clinical Trials…… With Some Data Exchange… the Ultimate Goal……. Disease World Medical Informatics Bioinformatics Genome Regulome Personalized Medicine Disease Database ►Decision Support System ►Course/Outcome Predictor ►Diagnostic Test Selector →Name →Synonyms ►Clinical Trials Design →Related/Similar Diseases ►Hypothesis Generator →Subtypes →Etiology ►Novel Gene/Drug Targets….. ► Patient Records Clinical Trials →Predisposing Causes →Pathogenesis →Molecular Basis →Population Genetics →Clinical findings →System(s) involved →Lesions →Diagnosis →Prognosis →Treatment →Clinical Trials…… Transcriptome miRNAome Proteome Interactome Metabolome Physiome Pathome Integrative Genomics Biomedical OMIM Informatics Variome Pharmacogenome PubMed No Integrative Genomics is Complete without Ontologies Gene World • Gene Ontology (GO) Biomedical World • Unified Medical Language System (UMLS) The 3 Gene Ontologies • Molecular Function = elemental activity/task – the tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity – What a product ‘does’, precise activity • Biological Process = biological goal or objective – broad biological goals, such as dna repair or purine metabolism, that are accomplished by ordered assemblies of molecular functions – Biological objective, accomplished via one or more ordered assemblies of functions • Cellular Component = location or complex – subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and RNA polymerase II holoenzyme – ‘is located in’ (‘is a subcomponent of’ ) http://www.geneontology.org Example: Gene Product = hammer Function (what) Process (why) Drive a nail - into wood Carpentry Drive stake - into soil Gardening Smash a bug Pest Control A performer’s juggling object Entertainment http://www.geneontology.org Unified Medical Language System Knowledge Server– UMLSKS • The UMLS Metathesaurus contains information about biomedical concepts and terms from many controlled vocabularies and classifications used in patient records, administrative health data, bibliographic and full-text databases, and expert systems. • The Semantic Network, through its semantic types, provides a consistent categorization of all concepts represented in the UMLS Metathesaurus. The links between the semantic types provide the structure for the Network and represent important relationships in the biomedical domain. • The SPECIALIST Lexicon is an English language lexicon with many biomedical terms, containing syntactic, morphological, and orthographic information for each term or word. http://umlsks.nlm.nih.gov/kss • • • • • Unified Medical Language System Metathesaurus about over 1 million biomedical concepts About 5 million concept names from more than 100 controlled vocabularies and classifications (some in multiple languages) used in patient records, administrative health data, bibliographic and full-text databases and expert systems. The Metathesaurus is organized by concept or meaning. Alternate names for the same concept (synonyms, lexical variants, and translations) are linked together. Each Metathesaurus concept has attributes that help to define its meaning, e.g., the semantic type(s) or categories to which it belongs, its position in the hierarchical contexts from various source vocabularies, and, for many concepts, a definition. Customizable: Users can exclude vocabularies that are not relevant for specific purposes or not licensed for use in their institutions. MetamorphoSys, the multi-platform Java install and customization program distributed with the UMLS resources, helps users to generate pre-defined or custom subsets of the Metathesaurus. • Uses: – linking between different clinical or biomedical vocabularies – information retrieval from databases with human assigned subject index terms and from free-text information sources – linking patient records to related information in bibliographic, full-text, or factual databases – natural language processing and automated indexing research Open biomedical ontologies http://obo.sourceforge.net/ Mammalian Phenotype Ontology 1. The Mammalian Phenotype (MP) Ontology enables robust annotation of mammalian phenotypes in the context of mutations, quantitative trait loci and strains that are used as models of human biology and disease. 2. Each node in MPO represents a category of phenotypes and each MP ontology term has a unique identifier, a definition, synonyms, and is associated with gene variants causing these phenotypes in genetically engineered or mutagenesis experiments. 3. In the current version of MPO, there are >4250 terms associated to >4300 unique Entrez mouse genes (extrapolated to ~4300 orthologous human genes). http://www.informatics.jax.org Disease Gene Identification and Prioritization Hypothesis: Majority of genes that impact or cause disease share membership in any of several functional relationships OR Functionally similar or related genes cause similar phenotype. Functional Similarity – Common/shared •Gene Ontology term •Pathway •Phenotype •Chromosomal location •Expression •Cis regulatory elements (Transcription factor binding sites) •miRNA regulators •Interactions •Other features….. Background, Problems & Issues 1. Most of the common diseases are multifactorial and modified by genetically and mechanistically complex polygenic interactions and environmental factors. 2. High-throughput genome-wide studies like linkage analysis and gene expression profiling, tend to be most useful for classification and characterization but do not provide sufficient information to identify or prioritize specific disease causal genes. Background, Problems & Issues 3. Since multiple genes are associated with same or similar disease phenotypes, it is reasonable to expect the underlying genes to be functionally related. 4. Such functional relatedness (common pathway, interaction, biological process, etc.) can be exploited to aid in the finding of novel disease genes. For e.g., genetically heterogeneous hereditary diseases such as Hermansky-Pudlak syndrome and Fanconi anaemia have been shown to be caused by mutations in different interacting proteins. Background, Problems & Issues Disease candidate gene studies Ellinor et al. J Am Coll Cardiol 2006. dilated cardiomyopathy Linkage, gene expression Linkage analysis Potential candidate genes (too many!) Locus region 10q25-26 ~9.5Mb with 68 genes Fine mapping Hand/cherry picking Prioritization approach Biological experiments (expensive, time consuming) 7 candidates selected by experts ADRB1 missing Background, Problems & Issues Current candidate gene prioritization tools Assumption: genes involved in the same complex disease will have similar functions dilated cardiomyopathy Approach without training Input: Multiple locus regions Approach with training Training: Known disease genes (10 from OMIM) Test: 68 genes at 10q25-26 Enriched functions Prioritize genes based on the functions Score test genes based on their similarity to training set TOPPGene Transcriptome Ontology Pathway based Prioritization of Genes http://toppgene.cchmc.org Chen J, Xu H, Aronow BJ, Jegga AG. 2007. Improved human disease candidate gene prioritization using mouse phenotype. BMC Bioinformatics 8(1): 392 [Epub ahead of print] Applications: 1. For functional enrichment 2. For candidate gene prioritization Why another gene prioritization method? Comparison with other related approaches Feature type POCUS Prospectr SUSPECTS ENDEAVOUR ToppGene Year 2003 2005 2006 2006 2007 Sequence Features GO Annotations Transcript Features Protein Features Literature Phenotype Annotations Training set Comparison with other related approaches Feature Details Feature type POCUS Prospectr SUSPECTS ENDEAVOUR ToppGene Year 2003 2005 2006 2006 2007 Gene length Homology Base composition Gene length Homology Base composition Blast cis-element Cytoband cis-element miRNA targets GeneSets Gene Ontology Gene Ontology Gene Ontology Mouse Phenotype Gene expression Gene expression EST expression Gene expression Protein domains domains interactions pathways domains interactions pathways Keywords Co-citation Yes Yes Sequence Features & Annotations Gene Annotations Gene Ontology Transcript Features Protein Features domains Literature Training set No No Yes Mammalian Phenotype Ontology We do not check whether the human orthologous gene of a mouse gene causes similar phenotype. Rather, we assume that orthologous genes cause “orthologous phenotype” and test the potential of the extrapolated mouse phenotype terms as a similarity measure to prioritize human disease candidate genes Mammalian Phenotype Ontology 77 human genes explicitly associated with “heart development” (GO:0007507) Mouse orthologs cause various types of cardiac phenotype (MPO) ToppGene – General Schema TOPPGene - Data Sources 1. Gene Ontology: GO and NCBI Entrez Gene 2. Mouse Phenotype: MGI (used for the first time for human disease gene prioritization) 3. Pathways: KEGG, BioCarta, BioCyc, Reactome, GenMAPP, MSigDB 4. Domains: UniProt (Pfam, Interpro,etc.) 5. Interactions: NCBI Entrez Gene (Biogrid, Reactome, BIND, HPRD, etc.) 6. Pubmed IDs: NCBI Entrez Gene 7. Expression: GEO 8. Cytoband: MSigDB New 9. Cis-Elements: MSigDB features 10.miRNA Targets: MSigDB added TOPPGene - Validation • Random-gene cross-validation – Disease-gene relations from OMIM and GAD databases – Training set: disease genes with one gene (“target”) removed – Test set: 100 genes = “target” gene + 99 random genes – Rank of “target” gene – Control: random training sets – AUC and Sensitivity/Specificity TOPPGene - Validation Random-gene cross-validation: breast cancer example Disease genes ATM BARD1 BRCA1 BRCA2 BRIP1 CASP8 CHEK2 KRAS PALB2 PIK3CA PPM1D RAD51 RB1CC1 SLC22A18 TP53 Training set BARD1 BRCA1 BRCA2 BRIP1 CASP8 CHEK2 KRAS PALB2 PIK3CA PPM1D RAD51 RB1CC1 SLC22A18 TP53 Test set KIAA1333 PQLC3 RBMY2OP ZNF133 LOC402643 FBL SLEB4 FAM32A AACSL ATM NDUFB5 DENND4A C14orf106 … … KCNJ16 Ranked list 1. 2. 3. 4. prioritization 5. 6. 99 random genes ATM KIAA1333 PQLC3 RBMY2OP ZNF133 LOC40264 3 FBL SLEB4 FAM32A AACSL NDUFB5 DENND4A C14orf106 7. 8. 9. 10. 11. 12. 13. … … 100. KCNJ16 • AUC: 0.916 Sensitivity: frequency of “target” genes that are ranked above a particular threshold position Specificity: the percentage of genes ranked below the threshold 0.8 Sensitivity/Specificity: 77/90 0.6 • 0.4 Control: 20 random sets of 35 genes each 0.2 • 0.0 Training:19 diseases with 693 genes True positive rate Sensitivity • 1.0 Random-gene cross-validation result 0.0 0.2 0.4 0.6 1 - specificity False positive rate 0.8 1.0 Using Mouse Phenotype as a feature of similarity measure improves human disease gene prioritization Random-gene cross-validation with only one feature AUC of different feature sets 1 100.00% AUC (random control) AUC (p-value score) Coverage AUC 0.8 90.00% 80.00% 0.7 70.00% 0.6 60.00% 0.5 50.00% 0.4 40.00% 0.3 30.00% 0.2 20.00% 0.1 10.00% 0 0.00% All GO:MF GO:BP MP Pathway Feature set Domain Pubmed Interaction Expression Coverage 0.9 Using Mouse Phenotype as a feature of similarity measure improves human disease gene prioritization Random-gene cross-validation by leaving one feature out 1.0 Overall performance All features: 0.913 All – MP: 0.893 All – MP – PubMed: 0.888 0.8 0.4 0.6 All – MP 0.2 All – MP - Pubmed 0.0 True positive rate Sensitivity Sensitivity: true positive rate at a cutoff score Specificity: true negative rate at the same cutoff All 0.00 0.05 0.10 1-specificity False positive rate 0.15 0.20 Locus-region cross-validation using different feature sets Features Average rank ratio Number of times of “target” genes were “target” genes ranked top 5% Number of times “target” genes were ranked top 10% All 7.39% 118 125 GO + MP + PubMed 7.50% 118 126 MP + PubMed 7.08% 121 126 Without GO 6.84% 117 123 Without Pathway 7.66% 118 124 Without Domain 6.71% 118 124 Without Interaction 7.17% 120 124 Without Expression 7.28% 118 128 Without MP 9.77% 110 117 Without Pubmed 9.91% 100 111 Without MP & Pubmed 22.61% 71 80 ToppGene web server (http://toppgene.cchmc.org) For functional enrichment analysis ToppGene web server (http://toppgene.cchmc.org) For functional enrichment analysis ToppGene web server (http://toppgene.cchmc.org) For functional enrichment analysis ToppGene web server (http://toppgene.cchmc.org) For functional enrichment analysis PPI - Predicting Disease Genes 1. Direct protein–protein interactions (PPI) are one of the strongest manifestations of a functional relation between genes. 2. Hypothesis: Interacting proteins lead to same or similar disease phenotypes when mutated. 3. Several genetically heterogeneous hereditary diseases are shown to be caused by mutations in different interacting proteins. For e.g. Hermansky-Pudlak syndrome and Fanconi anaemia. Hence, protein–protein interactions might in principle be used to identify potentially interesting disease gene candidates. 7 Known Disease Genes Mining human interactome HPRD BioGrid Direct Interactants of Disease Genes Indirect Interactants of Disease Genes Prioritize candidate genes in the interacting partners of the diseaserelated genes • Training sets: disease related genes • Test sets: interacting partners of the training genes 66 Which of these interactants are potential new candidates? 778 Example: Breast cancer OMIM genes (level 0) Directly interacting genes (level 1) Indirectly interacting genes (level2) 15 342 2469! 15 342 2469 ToppGene web server (http://toppgene.cchmc.org) For candidate gene prioritization ToppGene web server (http://toppgene.cchmc.org) For candidate gene prioritization ToppGene web server (http://toppgene.cchmc.org) For candidate gene prioritization Example: Breast cancer study. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature. 2007 May 27. rs id Location Gene Training set Test set rs2981582 10q26 15 OMIM genes 83 genes in the region FGFR2 Prioritization result: Rank Gene Description P-value 1 BUB3 budding uninhibited by benzimidazoles 3 homolog 0.003865 2 FGFR2 fibroblast growth factor receptor 2 0.018906 3 BCCIP BRCA2 and CDKN1A interacting protein 0.04784 Example: Breast cancer study. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature. 2007 May 27. ToppGene Prioritization Example: Breast cancer Training set Test set 15 OMIM genes 342 interacting genes Ranked Interactants Rank Gene Description 1 ATR ataxia telangiectasia and Rad3 related 2 FANCD2 Fanconi anemia, complementation group D2 3 NBN (NBS1) Nibrin Limitations General limitations of any training-test strategy: • Prior knowledge of disease-gene associations. • Assumption that the disease genes yet to discover will be consistent with what is already known about a disease. • Depend on the accuracy and completeness of the functional annotations. – Only one-fifth of the known human genes have pathway or phenotype annotations and there are still more than 40% genes whose functions are not defined! Chen et al., 2007; BMC Bioinformatics Mouse Phenotype - Limitations 1. MP is not a disease-centric ontology and the phenotype of a same gene mutation can vary depending on specific mouse strains or their genetic backgrounds. 2. Orthologous genes need not necessarily result in orthologous phenotypes. Possible Solutions - Future Directions More efficient cross-species phenome extrapolation where in the mouse phenotype terms are mapped to human phenotype concepts (from UMLS) semantically (“orthologous phenotype”) and the resultant orthologous genes associated with an orthologous phenotype are identified. Chen et al., 2007; BMC Bioinformatics PPIs for disease gene identification Limitations 1. Noisy interactome data • In vitro Vs in vivo (for e.g. only 5.8% of yeast twohybrid predicted interactions were confirmed by HPRD) • Extrapolation of interactions from one species to another • Bias towards “well-studied” genes/proteins 2. Too many interactants! Hub proteins 3. Two interacting proteins need not lead to similar phenotype when mutated 4. Disease proteins may lie at different points in a pathway and need not interact directly 5. Lastly, disease mutations need not always involve proteins Oti et al., 2006; J Med Gen http://anil.cchmc.org (under presentations) And PRIORITIZATION too! Thank You! http://sbw.kgi.edu/

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Document