Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Functional annotation and identification of candidate disease genes by computational analysis of normal tissue gene expression data L. Miozzi1, U. Ala1, R. Piro2, F. Rosa3, F. Di Cunto1 and P. Provero1 1Dipartimento di Genetica, Biologia e Biochimica, Università di Torino, Torino, Italy; 2INFN, Sezione di Torino, Torino, Italy; 3ISI Foundation, Torino, Italy Introduction Among the open problems of molecular biology in the post-genomic era the functional annotation of the human genome and the identification of genes involved in genetic diseases are especially important. Expression data on a genomic scale have been available for several years thanks to a set of new experimental techniques, and are widely believed to contain much information potentially relevant towards the solution of such problems. Here we present the results of a computational analysis of publicly available expression data on human normal tissues, based on the integration of data obtained with the two most important experimental platforms (microarrays and SAGE) and different measures of dissimilarity between expression profiles. The building blocks of the procedure are the Gene Expression Neighborhoods (GEN), small sets of tightly coexpressed genes which are analyzed in terms of functional annotation and relevance to human diseases. This analysis provides putative functional annotations for many genes, and identifies promising candidate disease genes for experimental verification. The “guilt by association” principle: The presented work is based on the following principle: “ since there is a strong correlation between coexpression and functional relatedness, a gene found to be coexpressed with several others involved in the same biological process can be putatively given the same functional annotation (Brazma A. et Vilo J., 2000, FEBS Lett. 480:17-24) ”. Publicly available expression data Method In this work we analyze publicly available expression data on human normal tissues obtained with Affymetrix microarrays (http://symatlas.gnf.org/SymAtlas/) and with SAGE (Serial Analysis of Gene Expression; http://cgap.nci.nih.gov/). We considered 158 experiments concerning 12109 genes for Affymetrix and 62 experiments concerning 11741 genes for SAGE. Microarrays SAGE integration of different quantitative measures of dissimilarity between expression profiles Different measures of dissimilarity between expression profiles have been defined and integrated: Euclidean distance and Pearson linear dissimilarity for the microarray data, Euclidean distance and a dissimilarity measure based on the Poisson distribution (developed in Van Helden J., 2004, Bioinformatics 20(3):399-406 in a different context) for SAGE data. Identification of Gene Expression Neighborhoods (GEN) The unit of functional analysis, named Gene Expression Neighborhood (GEN), has been defined as a gene plus its k nearest expression neighbors, with k typically a rather small number (the results we report were obtained with k=6). For each dataset and each choice of dissimilarity measure we identified a number of GENs equal to the number of genes represented in the dataset. GEN functional analysis using the controlled annotation vocabulary Gene Ontology A GEN was considered functionally characterized if there was at least one Gene Ontology term (http://www.geneontology.org/) shared by the majority (K) of its genes (K=4 genes in the results presented). To avoid too generic GO terms, the analysis has been limited to those terms, shared by no more than a given maximum number M of genes in the whole experimental dataset under investigation (M=300 in the results presented). This limit ensures that the majority rule used to define functionally characterized GENs automatically implies statistically significant overrepresentation of the GO term involved. Estimation of false discovery rate The false discovery rate for the functionally characterized GENs has been estimated: random GENs have been generated by reshuffling the gene names in the whole dataset (thus preserving the characteristics of the actual GENs, such as their degree of self-overlapping) and subjected to the same functional analysis. Leave-one-out A leave-one-out analysis has been performed to estimate how many correct annotations the method can correctly identify. Putative new GO functional annotations Characterized GENs have been used to determine putative new functional annotations: for each functionally characterized GEN and for each GO term associated to it (shared by the majority of its genes), the same GO term has been putatively attributed to the genes in the GEN not associated to it. Finally, we looked for functionally characterized GENs containing at least 3 genes associated with a genetic disease in the OMIM database (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM). When the relevant OMIM entries were related to each other, the genes in the GEN not associated to OMIM entries have been considered as interesting candidates to be involved in similar pathologies. Integration with OMIM data Potential new disease genes (OMIM) Dataset Disease Gene Microarray+Pearson ACROMEGALOID FEATURES, OVERGROWTH, CLEFT PALATE, AND HERNIA ENSG00000069482 Microarray+Pearson AORTIC ANEURYSM, FAMILIAL THORACIC 1 ENSG00000149591 Microarray+Pearson CARDIOMYOPATHY, DILATED, 1C; CMD1C ENSG00000107796 Microarray+Pearson CHARCOT-MARIE-TOOTH DISEASE, AXONAL, TYPE 2G; CMT2G ENSG00000166986 Microarray+Pearson CHARCOT-MARIE-TOOTH DISEASE, DOMINANT INTERMEDIATE A ENSG00000166197 Microarray+Pearson CONVULSIONS, BENIGN FAMILIAL INFANTILE, 2 ENSG00000087258 Microarray+Pearson CONVULSIONS, FAMILIAL INFANTILE, WITH PAROXYSMAL CHOREOATHETOSIS; ICCA ENSG00000087258 Microarray+Pearson DEAFNESS, NEUROSENSORY, AUTOSOMAL RECESSIVE 46; DFNB46 ENSG00000101608 Microarray+Pearson EPILEPSY, IDIOPATHIC GENERALIZED, SUSCEPTIBILITY TO, 3; EIG3 ENSG00000078725 Microarray+Pearson EPILEPSY, PARTIAL, WITH VARIABLE FOCI ENSG00000100095 Microarray+Pearson FACIOSCAPULOHUMERAL MUSCULAR DYSTROPHY 1A; FSHMD1A ENSG00000154553 Microarray+Pearson MUSCULAR DYSTROPHY, LIMB-GIRDLE, TYPE 1F; LGMD1F ENSG00000128595 Microarray+Pearson PARKINSON DISEASE 3, AUTOSOMAL DOMINANT LEWY BODY; PARK3 ENSG00000075340 Microarray+Pearson POLYDACTYLY, PREAXIAL II; PPD2 ENSG00000106538 Microarray+Pearson ROSSELLI-GULIENETTI SYNDROME ENSG00000137699 Microarray+Pearson SCAPULOPERONEAL MYOPATHY; SPM ENSG00000139329 Microarray+Pearson VACUOLAR NEUROMYOPATHY ENSG00000077009 Microarray+Pearson VACUOLAR NEUROMYOPATHY ENSG00000099800 Microarray+Pearson ACROMEGALOID FEATURES, OVERGROWTH, CLEFT PALATE, AND HERNIA ENSG00000131808 Microarray+Pearson BREAST CANCER, 11-22 TRANSLOCATION ASSOCIATED ENSG00000137713 Microarray+Pearson BREAST CANCER, DUCTAL, 1; BRCD1 ENSG00000139618 Microarray+Pearson ELECTROENCEPHALOGRAM, LOW-VOLTAGE ENSG00000075043 Microarray+Pearson EOSINOPHILIA, FAMILIAL ENSG00000113721 Microarray+Pearson MICROCEPHALY, PRIMARY AUTOSOMAL RECESSIVE, 4; MCPH4 ENSG00000156970 Microarray+Pearson MUSCULAR DYSTROPHY, CONGENITAL, 1B ENSG00000143632 Microarray+Pearson SCAPULOPERONEAL MYOPATHY; SPM ENSG00000011465 Microarray+Pearson TRIPHALANGEAL THUMB-POLYSYNDACTYLY SYNDROME ENSG00000106538 Microarray+Pearson TUMOR SUPPRESSOR GENE ON CHROMOSOME 11 ENSG00000137713 Microarray+Pearson CARDIOMYOPATHY, DILATED, 1F; CMD1F ENSG00000118523 Microarray+Pearson CARDIOMYOPATHY, DILATED, 1Q; CMD1Q ENSG00000091136 Microarray+Pearson DEAFNESS, AUTOSOMAL RECESSIVE 51; DFNB51 ENSG00000026508 Microarray+Pearson MYOPATHY, LIMB-GIRDLE, WITH BONE FRAGILITY ENSG00000147872 Microarray+Euclidea ARRHYTHMOGENIC RIGHT VENTRICULAR DYSPLASIA, FAMILIAL, 5; ARVD5 ENSG00000160808 Microarray+Euclidea NONCOMPACTION OF LEFT VENTRICULAR MYOCARDIUM, FAMILIAL ISOLATED, AUTOSOMAL DOMINANT 2 ENSG00000130598 Microarray+Euclidea SCAPULOPERONEAL MYOPATHY; SPM ENSG00000011465 Microarray+Euclidea MUSCULAR DYSTROPHY, CONGENITAL, 1B ENSG00000143632 Microarray+Euclidea CARDIOMYOPATHY, DILATED, 1C; CMD1C ENSG00000122367 SAGE+Euclidean ANEURYSM, INTRACRANIAL BERRY, 3 ENSG00000158747 SAGE+Euclidean MYOPIA 5 ENSG00000108821 SAGE+Euclidean MYOPIA 6 ENSG00000100122 SAGE+Euclidean NONCOMPACTION OF LEFT VENTRICULAR MYOCARDIUM, FAMILIAL ISOLATED, AUTOSOMAL DOMINANT 2 ENSG00000130598 SAGE+Euclidean MICROPHTHALMIA-CATARACT ENSG00000167971 SAGE+Euclidean EXFOLIATIVE ICHTHYOSIS, AUTOSOMAL RECESSIVE, ICHTHYOSIS BULLOSA OF SIEMENS-LIKE ENSG00000186081 SAGE+Euclidean MACULAR DYSTROPHY, RETINAL, 2, BULL'S EYE ENSG00000007062 SAGE+Euclidean CATARACT, CONGENITAL NUCLEAR, AUTOSOMAL RECESSIVE 1; CATCN1 ENSG00000105370 SAGE+Euclidean CARDIOMYOPATHY, DILATED, 1C; CMD1C ENSG00000122367 SAGE+Euclidean ARRHYTHMOGENIC RIGHT VENTRICULAR DYSPLASIA, FAMILIAL, 5; ARVD5 ENSG00000160808 SAGE+Euclidean ACHROMATOPSIA 1 ENSG00000129535 SAGE+Euclidean ACHROMATOPSIA 1 ENSG00000139988 SAGE+Euclidean CONE-ROD DYSTROPHY 5; CORD5 ENSG00000109047 SAGE+Euclidean CONE-ROD DYSTROPHY 5; CORD5 ENSG00000179036 SAGE+Euclidean POSTERIOR COLUMN ATAXIA WITH RETINITIS PIGMENTOSA; AXPC1 ENSG00000116703 SAGE+Euclidean MYOPIA 6 ENSG00000196431 SAGE+Euclidean GLAUCOMA 3, PRIMARY INFANTILE, B; GLC3B ENSG00000158747 SAGE+Euclidean MICROPHTHALMIA-CATARACT ENSG00000197253 SAGE+Euclidean DUPUYTREN CONTRACTURE ENSG00000087245 SAGE+Euclidean CORNEAL DYSTROPHY, CRYSTALLINE, OF SCHNYDER ENSG00000158747 SAGE+Euclidean CATARACT, AUTOSOMAL RECESSIVE, EARLY-ONSET, PULVERULENT ENSG00000172014 SAGE+Euclidean CATARACT, POSTERIOR POLAR 3 ENSG00000125864 Table 3 – List of candidates genes potentially involved in human genetic diseases. Results •The leave-one-out analysis showed that 1026 correct GO annotations involving 644 genes and 94 GO terms would have been correctly identified by the method (see table 1). Euclidean Pearson Poisson Euclidean+ Pearson Euclidean+ Poisson Microarray 428 788 / 958 428 SAGE 50 / 51 50 92 Microarray+ SAGE a) 468 788 51 992 504 Euclidean Pearson Poisson Euclidean+ Pearson Euclidean+ Poisson Microarray 318 546 / 598 318 SAGE 48 / 48 48 82 353 546 48 625 376 Microarray+ SAGE b) Table 1 - Leave-one-out analysis results showing the number of GO annotations (a) and annotated genes (b) correctly identified. •The distribution of GO terms among the three Gene Ontology branches changes significantly among the experimental datasets-dissimilarity measures showing that different combinations are able to capture different aspects of coexpression. Microarray-Pearson Microarray-Euclidean Fig.1- the graphics show the distribution of correct obtained GO annotations among the three GO branch ( Biological process; Molecular function; Cellular conponent) SAGE-Euclidean SAGE-Poisson •Different definition of dissimilarity measures describe different aspects of coexpression correlated with different kinds of functional annotation (see table 1 and 2) as shown by the fact that only a small fraction of GO annotations is predicted by more than one dissimilarity measure – dataset. c) Euclidean Pearson Poisson Euclidean+ Pearson Euclidean+ Poisson Microarray 569 950 / 1240 569 407 SAGE 173 / 216 173 362 1081 Microarray+ SAGE 720 950 216 1378 892 Euclidean Pearson Poisson Euclidean+ Pearson Euclidean+ Poisson Microarray 688 1215 / 1731 688 SAGE 188 / 230 188 Microarray+ SAGE 866 1215 230 1906 d) Table 2 - Number of obtained putative new functional GO annotations (c) and new annotated genes (d). •We have obtained 2113 putative new GO annotations involving 1540 genes and 194 GO terms (see table 2). •The integration of our functional annotation results with the OMIM database allowed us to identify at least 59 interesting candidate genes potentially involved in human genetic disease (see table 3). Conclusion We have developed a useful approach to analyze and integrate information obtained with different experimental techniques and different definitions of dissimilarity measures able to explore several aspects of coexpression. The results demonstrate that this integration increases the amount of useful information obtained.