* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Lecture slides
Survey
Document related concepts
List of types of proteins wikipedia , lookup
Magnesium transporter wikipedia , lookup
Gene desert wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Protein moonlighting wikipedia , lookup
Gene expression wikipedia , lookup
Molecular evolution wikipedia , lookup
Community fingerprinting wikipedia , lookup
Genome evolution wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Interactome wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Gene expression profiling wikipedia , lookup
Gene nomenclature wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Transcript
Disease Gene Candidate Prioritization by Integrative Biology Table of contents: Where are we in the course pipeline? Maturation of the project and the project description Background Networks – deducing functional relationships from PPI data networks Protein interaction networks Functional modules / network clusters Phenotype association Grouping disorders based on their phenotype. Biological implications of phenotype clusters. Method and examples Integrating protein interaction data and phenotype associations in an automated large scale disease gene finding platform The DNA Array Analysis Pipeline Question Experimental Design Array design Probe design Sample Preparation Hybridization Buy Chip/Array Image analysis Normalization Expression Index Calculation Comparable Gene Expression Data Statistical Analysis Fit to Model (time series) Advanced Data Analysis Clustering Meta analysis PCA Classification Survival analysis Promoter Analysis Regulatory Network Project description Masters thesis sept 2003 Niels Tommerup Professor, Centre Director, Dr. Med. Søren Brunak, Professor, center director, Dr. Phil, PhD Project description Project: Find disease genes Niels Tommerup Professor, Centre Director Dr. Med. Using Bioinformatics Søren Brunak, Professor, center director, Dr. Phil, PhD Project description Project: Disease gene candidate prioritization by integrating protein interaction and phenotype association data. Niels Tommerup Professor, Centre Director Dr. Med. Søren Brunak, Professor, center director, Dr. Phil, PhD Project description Abstract The availability of the first draft of the human genome in 2001 (Venter, Adams et al. 2001) led to an increase in the number of methods for disease gene identification. However, the general number of candidates in most loci linked to a particular phenotype is in the hundreds (McCarthy, Smedley et al. 2003; van Driel, Cuelenaere et al. 2003), and the underlying genes in over 900 of the ~ 2550 loci associated with a phenotype in the “Online Mendelian Inheritance in Man” (OMIM) database, have not yet been identified (Hamosh, Scott et al. 2005). Evidently disease gene identification continues to be a very strenuous challenge, since mutational analysis of hundreds of candidates in a critical interval using methods currently available is extremely resource demanding. Thus, prioritising the candidates based on different criteria followed by an extensive investigation of promising candidates, is a logical step in the disease gene finding process. With the advent of proteomics, we are now able to retrieve information on gene functions in a large-scale manner, thus bridging the gap between genotype and phenotype, a possibility with significant interest for disease gene candidate prioritization. We propose that automated correlation of phenotype association networks, with interolog data (the transfer of protein interactions between orthologous protein pairs in different organisms), is a powerful way of identifying good disease gene candidates in a large list of genes in loci associated with a phenotype. Our method automatically identifies potential functional modules consisting of protein components, where at least one of the components is a disease related protein. When such incriminated modules are identified, the remaining protein components of the module are correlated with loci in the genome associated with a similar phenotype. A hit is reported if other protein components of the incriminated module are the product of genes in loci associated with an identical or overlapping phenotype. Using this large scale approach we show that a gene in a locus is a heavily incriminated candidate, if the protein product of the gene interacts with a protein involved in a similar or identical phenotype, and publish a list of 60 likely candidates in various disorders. Project description Abstract The availability of the first draft of the human genome in 2001 (Venter, Adams et al. 2001) led to an increase in the number of methods for disease gene identification. However, the general number of candidates in most loci linked to a particular phenotype is in the hundreds (McCarthy, Smedley et al. 2003; van Driel, Cuelenaere et al. 2003), and the underlying genes in over 900 of the ~ 2550 loci associated with a phenotype in the “Online Mendelian Inheritance in Man” (OMIM) database, have not yet been identified (Hamosh, Scott et al. 2005). Evidently disease gene identification continues to be a very strenuous challenge, since mutational analysis of hundreds of candidates in a critical interval using methods currently available is extremely resource demanding. Thus, prioritising the candidates based on different criteria followed by an extensive investigation of promising candidates, is a logical step in the disease gene finding process. With the advent of proteomics, we are now able to retrieve information on gene functions in a large-scale manner, thus bridging the gap between genotype and phenotype, a possibility with significant interest for disease gene candidate prioritization. We propose that automated correlation of phenotype association networks, with interolog data (the transfer of protein interactions between orthologous protein pairs in different organisms), is a powerful way of identifying good disease gene candidates in a large list of genes in loci associated with a phenotype. Our method automatically identifies potential functional modules consisting of protein components, where at least one of the components is a disease related protein. When such incriminated modules are identified, the remaining protein components of the module are correlated with loci in the genome associated with a similar phenotype. A hit is reported if other protein components of the incriminated module are the product of genes in loci associated with an identical or overlapping phenotype. Using this large scale approach we show that a gene in a locus is a heavily incriminated candidate, if the protein product of the gene interacts with a protein involved in a similar or identical phenotype, and publish a list of 60 likely candidates in various disorders. Background Background Finding genes responsible for major genetic disorders can lead to diagnostics, potential drug targets, treatments and large amounts of information about molecular cell biology in general. Background Methods for disease gene finding post genome era (>2001): Mircodeletions http://www.med.cmu.ac.th/dept/pediatrics/06interest-cases/ic-39/case39.html Translocations http://www.rscbayarea.com/image s/reciprocal_translocation.gif Linkage analysis Fagerheim et al 1996. 1q21-1q23.1 chr1:141,600,00-155,900,000 Background Automated methods for disease gene finding int the post genome era (>2001): ? Grouping: Tissues, Gene Ontology, Gene Expression, MeSH terms ……. (Perez-Iratxeta, Bork et al. 2002) (Freudenberg and Propping 2002) (van Driel, Cuelenaere et al. 2005) (Hristovski, Peterlin et al. 2005) Disease Gene Finding. Summery Background Why do we want to find disease genes, how has it been done until now? Networks – deducing functional relationships from network theory Protein interactionnetworks Functional modules / network clusters Phenotype association Grouping disorders based on their phenotype. Biological implications of phenotype clusters. Method and examples Combining network theory and phenotype associations in an automated large scale disease gene finding platform proof of concept. Status of pipeline / infrastructure Networks and functional modules Deducing functional relationships from protein interaction networks Networks and functional modules Deducing functional relationships from network theory Network theory is boooooooooring Networks Text mining of full text corpora e.g PubMed Central http://www.biosolveit.de/ToPNet/screenshots/fig1.html Networks Protein interaction networks of physical interactions. (Barabasi and Oltvai 2004). Networks daily Social Networks, The CBS interactome weekly monthly (de Licthenberg et al.) Networks daily Social Networks, The CBS interactome weekly monthly (de Licthenberg et al.) Extracting functional data from protein interaction networks The Ach receptor involved in Myasthenic Syndrome. InWeb Homo Sapiens Dynamic funcional module: Eg: Cell cycle regulation Metabolism Trans-organism protein interaction network Orthologs? Orthologous genes are direct descendants of a gene in a common ancestor: S.Cerevisiae D. Melanogaster H.Sapiens (O'Brien K, Remm et al. 2005) Trans-organism protein interaction network H.Sapiens MOSAIC D. Melanogaster Experim. C. Elegans Experim. S. Cerevisiae Experim. Infrastructure status BIND IntAct DIP Web server Opis MINT HPRD Command line Inweb.pl GRID Handcurated sets PPI – pred. Trans-organism ppi pipeline InWeb Homo Sapiens >122.000 int. > 22.000 genes CBS Datawarehouse Download/reformat db’s Scoring A) Topological B) No publ. Extraction perl modules Direct SQL access XML or SIF output Disease Gene Finding. Summery Background Why do we want to find disease genes, how has it been done until now? Networks – deducing functional relationships from network theory Protein interactionnetworks Functional modules / network clusters Phenotype association Grouping disorders based on their phenotype. Biological implications of phenotype clusters. Method and examples Combining network theory and phenotype associations in an automated large scale disease gene finding platform proof of concept. Status of pipeline / infrastructure Phenotype association Phenotype association Zelwegger syndrome Absent liver peroxisomes Hepatomegaly Intrahepatic biliary dysgenesis Prolonged neonatal jaundice Pyloric hypertrophy Patent ductus arteriosus Ventricular septal defects Bell-shaped thorax Small adrenal glands Absent renal peroxisomes Clitoromegaly Cryptorchidism Hydronephrosis Hypospadias Renal cortical microcysts Failure to thrive Abnormal electroretinogram Abnormal helices Anteverted nares Brushfield spots Cataracts Corneal clouding Epicanthal folds Flat facies Flat occiput Glaucoma High arched palate High forehead Hypertelorism Large fontanelles Macrocephaly Micrognathia Nystagmus Pale optic disk Pigmentary retinopathy Posteriorly rotated ears Protruding tongue Redundant skin folds of neck Round facies Sensorineural deafness Turribrachycephaly Upward slanting Hyporeflexia or areflexia Hypotonia Polymicrogyria Seizures Severe mental retardation Subependymal cysts Pulmonary hypoplasia Cubitus valgus Delayed bone age Metatarsus adductus Rocker-bottom feet Stippled epiphyses (especially patellar and acetabular regions) Talipes equinovarus Transverse palmar crease Ulnar deviation of hands Wide cranial sutures Transverse palmar crease Heterotopias/abnormal migration Hypoplastic olfactory lobes palpebral fissures Autosomal recessive Albuminuria Aminoaciduria Decreased dihydroxyacetone phosphate acyltransferase (DHAPAT) activity Decreased plasmologen Elevated long chain fatty acids Elevated serum iron and iron binding capacity Increased phytanic acid Pipecolic acidemia Breech presentation Death usually in first year of life Genetic heterogeneity Infants occasionally mistaken as having Down syndrome Agenesis/hypoplasic corpus collosum Phenotype association Word vectors Reference : Zelwegger Syndrome (214100) 214100 202370 Phenotype Sim. Score Adrenoleukodystrophy (202370) 0.781 Hyperpipecolatemia (239400) 0.703 Cerebrohepatorenal Syndr. (214110) 0.682 Refsum Disease (266510) 0.609 A relationship between the infantile form of Refsum disease and Zellweger syndrome was suggested by the observations of Poulos et al. (1984) in 2 patients. In the infantile form of Refsum disease, as in Zellweger syndrome, peroxisomes are deficient and peroxisomal functions are impaired (Schram et al., 1986). Clinically, infantile Refsum disease, ZWS, and adrenoleukodystrophy have several overlapping features. (Stokke et al., 1984). (http://www.ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=266510) Phenotype association Phenotype association network Word vectors CerebroHepatorenal Zelwegger Refsum Adrenoleuko -dystrophy Disease Gene Finding. Summery Background Why do we want to find disease genes, how has it been done until now? Networks – deducing functional relationships from network theory Protein interactionnetworks Functional modules / network clusters Phenotype association Grouping disorders based on their phenotype. Biological implications of phenotype clusters. Method and examples Combining network theory and phenotype associations in an automated large scale disease gene finding platform proof of concept. Method – Proof of concept Method InWeb Word vectors Homo Sapiens Phenotype clustering Results - Benchmark DE SANCTIS-CACCHIONE SYNDROME Gene map locus 10q11 >12MB area, 103 ranked genes CLINICAL FEATURES De Sanctis and Cacchione (1932) reported a condition, which they called 'xerodermic idiocy,' in which patients had xeroderma pigmentosum, mental deficiency, progressive neurologic deterioration, dwarfism, and gonadal hypoplasia. http://www.ncbi.nlm.nih.gov/entrez/dispomim.cgi?id= 278800 MIM RANK GENE P-value TRUE 278800 278800 1 2 ENSG00000032514 ENSG00000188611 0.300326793109544 0.0125655342047565 * 278800 278800 278800 278800 278800 278800 278800 . . . . . 278800 278800 278800 278800 278800 278800 278800 278800 278800 278800 278800 278800 278800 278800 278800 278800 278800 278800 278800 278800 278800 278800 278800 278800 278800 2 2 3 3 4 4 4 . . . . . 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 6 ENSG00000138297 ENSG00000165406 ENSG00000196693 ENSG00000185532 ENSG00000197910 ENSG00000165383 ENSG00000172538 . . . . . ENSG00000165511 ENSG00000182354 ENSG00000172661 ENSG00000165507 ENSG00000178440 ENSG00000138299 ENSG00000197704 ENSG00000012779 ENSG00000197354 ENSG00000189090 ENSG00000107551 ENSG00000126542 ENSG00000198364 ENSG00000185849 ENSG00000150165 ENSG00000128815 ENSG00000178645 ENSG00000138293 ENSG00000176833 ENSG00000179251 ENSG00000169826 ENSG00000172678 ENSG00000197752 ENSG00000107643 ENSG00000165733 0.0125655342047565 0.0125655342047565 0.0121357313793756 0.0121357313793756 0.00680983722337082 0.00680983722337082 0.00680983722337082 . . . . . 0.00680983722337082 0.00680983722337082 0.00680983722337082 0.00680983722337082 0.00680983722337082 0.00680983722337082 0.00680983722337082 0.00680983722337082 0.00680983722337082 0.00680983722337082 0.00680983722337082 0.00680983722337082 0.00680983722337082 0.00680983722337082 0.00680983722337082 0.00680983722337082 0.00680983722337082 0.00680983722337082 0.00680983722337082 0.00680983722337082 0.00680983722337082 0.00680983722337082 0.00680983722337082 0.00412573091718715 0.000263885640603109 278800 7 ENSG00000169813 6,63E+07 Results – Benchmarking DE SANCTIS-CACCHIONE SYNDROME Ranked 1 P-value 0.300326793109544 #278800 DE SANCTISCACCHIONE SYNDROME DNA excision repair protein ERCC-6 *126340 DNA REPAIR DEFECT EM9 OF CHINESE HAMSTER OVARY CELLS, COMPLEMENTATION OF; EM9 #133540 COCKAYNE SYNDROME CKN2 #278730 XERODERMA PIGMENTOSUM, COMPLEMENTATION GROUP D DNA excision repair protein ERCC-2 #601675 TRICHOTHIODYSTROPHY Eukaryotic initiation factor 4A-I (eIF4A-I) Eukaryotic translation initiation factor 4E (eIF4E) Results – Benchmarking DE SANCTIS-CACCHIONE SYNDROME Ranked 2 P-value 0.0125655342047565 Disease Gene Finding. Summery Background Why do we want to find disease genes, how has it been done until now? Networks – deducing functional relationships from network theory Protein interactionnetworks Functional modules / network clusters Phenotype association Grouping disorders based on their phenotype. Biological implications of phenotype clusters. Method and examples Combining network theory and phenotype associations in an automated large scale disease gene finding platform proof of concept. Method - status In Silico: Two in silico proof of concept hits: BOR Syndrome Dilated Cardiomyopathy In Vitro: 5 Candidates being tested in the lab by mutational screening in patient material: BOR Syndrome Sensorineural Deafness Holoprosencephaly Obesity Anhidrosis Benchmarking: Benchmarking by unbiased prioritizing genes in ~ 1000 critical intervals where the actual disease gene is known. Project description Abstract The availability of the first draft of the human genome in 2001 (Venter, Adams et al. 2001) led to an increase in the number of methods for disease gene identification. However, the general number of candidates in most loci linked to a particular phenotype is in the hundreds (McCarthy, Smedley et al. 2003; van Driel, Cuelenaere et al. 2003), and the underlying genes in over 900 of the ~ 2550 loci associated with a phenotype in the “Online Mendelian Inheritance in Man” (OMIM) database, have not yet been identified (Hamosh, Scott et al. 2005). Evidently disease gene identification continues to be a very strenuous challenge, since mutational analysis of hundreds of candidates in a critical interval using methods currently available is extremely resource demanding. Thus, prioritising the candidates based on different criteria followed by an extensive investigation of promising candidates, is a logical step in the disease gene finding process. With the advent of proteomics, we are now able to retrieve information on gene functions in a large-scale manner, thus bridging the gap between genotype and phenotype, a possibility with significant interest for disease gene candidate prioritization. We propose that automated correlation of phenotype association networks, with interolog data (the transfer of protein interactions between orthologous protein pairs in different organisms), is a powerful way of identifying good disease gene candidates in a large list of genes in loci associated with a phenotype. Our method automatically identifies potential functional modules consisting of protein components, where at least one of the components is a disease related protein. When such incriminated modules are identified, the remaining protein components of the module are correlated with loci in the genome associated with a similar phenotype. A hit is reported if other protein components of the incriminated module are the product of genes in loci associated with an identical or overlapping phenotype. Using this large scale approach we show that a gene in a locus is a heavily incriminated candidate, if the protein product of the gene interacts with a protein involved in a similar or identical phenotype, and publish a list of 60 likely candidates in various disorders. Project description Project: Find disease genes Niels Tommerup Professor, Centre Director Dr. Med. Using Bioinformatics Søren Brunak, Professor, center director, Dr. Phil, PhD, physicist Acknowledgments Disease Gene Finding : Olga Rigina Olof Karlberg Zenia M. Størling Páll Ísólfur Ólason Kasper Lage Anders Gorm Anders Hinsby Yves Moreau Søren Brunak