Download Comparative Sequence Analysis between Human and Mouse

Comparative Sequence Analysis between Human and Mouse: Promoter Conservation and Protein Conservation Hirokazu Chiba Riu Yamashita [email protected] [email protected] Kengo Kinoshita [email protected] Kenta Nakai [email protected] Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan Keywords: promoter region, comparative analysis, alignment 1 Introduction Comparative sequence analysis is a powerful tool to extract functional information from genomes of organisms. Completion of human and mouse genomes lead to profitable comparative analyses, providing many insights into non-coding regions as well as into protein coding regions. Shortly after the completion of human and mouse genomes, large-scale collections of 5’ end of cDNAs for human and mouse were constructed, which made possible comprehensive and reliable identification of promoter regions. Based on these data, some pioneering works for human and mouse promoter regions were carried out. Suzuki et al. [1] identified blocks of highly conserved regions in orthologous promoter sequences, and Iwama and Gojobori [2] found that transcription factors and developmental process genes show highest degree of upstream sequence conservation. Our group previously constructed DBTSS: DataBase of human transcriptional start sites, and it has been recently updated [3]. In this study, we performed comparative analysis between human and mouse based on DBTSS, aiming at two points. First, we tried to carry out the most comprehensive comparison to date, and to investigate the relationship between promoter conservation and gene function. Second, we tried to figure out what kinds of relationships there are between promoter conservation and protein conservation. 2 2.1 Materials and Methods Alignment of Orthologous Promoters, and Orthologous Proteins DBTSS provides a representative TSS for each gene in human and mouse genomes, and an orthologous table for comparison of TSS data. In this study, sequences from -1000 to +200 based on the representative TSSs were defined as promoter sequences. Protein sequences were obtained from NCBI reference sequence (RefSeq) database and associated. As a result, we obtained 8429 pairs of one-to-one orthologous genes with both promoter sequences and protein sequences. Promoter pairs and protein pairs were respectively aligned with a local alignment program water of EMBOSS package (ftp://emboss.open-bio.org/pub/EMBOSS/). To assess the conservation of promoter sequences, the raw alignment score was used. In the case of proteins, the raw score seems inappropriate because the raw score largely depends on the length of proteins. Therefore, we used the percentage identity for protein sequences. 2.2 GO Annotation of Genes and Significance Test for Conservation The gene ontology (GO) is widely used for annotating genes. The GO annotation for each gene of human and mouse was obtained by using gene2go file at NCBI (ftp://ftp.ncbi.nih.gov/gene/DATA/). In this study, to summarize the attributes of genes, a slimmed down version of the GO vocabulary ‘GO slim’ was used. Each GO term was mapped to GO slim terms using map2slim.pl of go-perl package (http://search.cpan.org/~cmungall/go-perl/). A set of high level terms was selected to cover most aspects of each of the three ontologies. We tested whether the alignment scores of genes associated with each GO term are significantly high by Wilcoxon rank sum test. The control group for a GO term is a set of genes that are not associated with the term but with other terms. Similarly, it was tested whether the alignment scores were significantly low. In the case of proteins, the percentage identity was used instead of alignment score. 3 3.1 Results and Discussion Promoter Conservation and Gene Function We aligned promoter sequences of human and mouse. The distribution of alignment scores had two peaks; a major peak around 1000, and a minor peak lower than 100. When we aligned non-orthologous promoters generated by shuffling pairs, the score distribution precisely fit the minor peak of orthologous one, therefore, it is plausible that the minor peak corresponds to improperly paired “pseudo-orthologous” promoter sequences. We discarded promoter pairs with scores less than 200. The remaining 6901 among 8429 promoter pairs were used in the following analysis. We examined the relationship between promoter conservation and gene function. Terms with the most significantly high conservation were “development” and “transcription factor activity”. This is consistent with previous reports [2, 4]. Among others with high conservations were “signal transduction”, “cell-cell signaling”. On the other hand, terms with the most significantly low conservations were “oxidoreductase activity”, “mitochondrion”, “ribosome”, “lysosome”. Considering that the promoter conservation reflects how strict the gene expression is regulated, these results suggest that genes conveying signals require relatively strict regulation, while genes for energy generation, macromolecule biosynthesis and digestion may not so strictly regulated. 3.2 Promoter Conservation and Protein Conservation We aligned protein sequences, and examined the relationship between protein conservation and gene function, similarly. Terms with significantly high or low protein conservations were identified. We subsequently examined the relationship between promoter conservation and protein conservation. The correlation between them was weak (Kendall’s rank correlation was 0.179). Notably, we found out that some GO terms with significantly high promoter conservation show significantly low protein conservation. They are “receptor binding activity”, “extracellular matrix”. On the contrary, there were GO terms with significantly low promoter conservation and significantly high protein conservation. They are “ribosome” and “protein biosynthesis”. These results suggest that the promoter sequences and promoter sequences might evolve under different constraints. References [1] Suzuki, Y., Yamashita, R., Shirota, M., Sakakibara, Y., Chiba, J., Mizushima-Sugano, J., Nakai, K., and Sugano, S., Sequence comparison of human and mouse genes reveals a homologous block structure in the promoter regions, Genome Res., 14(9):1711–1718, 2004. [2] Iwama, H., and Gojobori, T., Highly conserved upstream sequences for transcription factor genes and implications for the regulatory network, Proc. Natl Acad. Sci. USA, 101(49):17156–17161, 2004 [3] Yamashita, R., Suzuki, Y., Wakaguri, H., Tsuritani, K., Nakai, K., and Sugano, S., DBTSS: DataBase of human transcription start sites, progress report 2006, Nucleic Acids Res., 34(Database issue):D86–D89, 2006. [4] Woolfe, A., Goodson, M., Goode, D.K., Snell, P., McEwen, G.K., Vavouri, T., Smith, S.F., North, P., Callaway, H., Kelly, K., Walter, K., Abnizova, I., Gilks, W., Edwards, Y.J., Cooke, J.E., and Elgar, G., Highly conserved non-coding sequences are associated with vertebrate development, PLoS Biol., 3(1):e7, 2005.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Comparative Sequence Analysis between Human and Mouse