Download Comparative Sequence Analysis between Human and Mouse

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of genetic engineering wikipedia , lookup

Microevolution wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Ridge (biology) wikipedia , lookup

Gene nomenclature wikipedia , lookup

RNA-Seq wikipedia , lookup

Pathogenomics wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Genome (book) wikipedia , lookup

Genomics wikipedia , lookup

Minimal genome wikipedia , lookup

Gene desert wikipedia , lookup

Designer baby wikipedia , lookup

Protein moonlighting wikipedia , lookup

Human genome wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Sequence alignment wikipedia , lookup

Gene wikipedia , lookup

Helitron (biology) wikipedia , lookup

Metagenomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Non-coding DNA wikipedia , lookup

Point mutation wikipedia , lookup

Genome evolution wikipedia , lookup

Gene expression profiling wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

NEDD9 wikipedia , lookup

Transcript
Comparative Sequence Analysis between Human and Mouse:
Promoter Conservation and Protein Conservation
Hirokazu Chiba
Riu Yamashita
[email protected]
[email protected]
Kengo Kinoshita
[email protected]
Kenta Nakai
[email protected]
Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku,
Tokyo 108-8639, Japan
Keywords: promoter region, comparative analysis, alignment
1
Introduction
Comparative sequence analysis is a powerful tool to extract functional information from genomes of
organisms. Completion of human and mouse genomes lead to profitable comparative analyses, providing
many insights into non-coding regions as well as into protein coding regions. Shortly after the completion of
human and mouse genomes, large-scale collections of 5’ end of cDNAs for human and mouse were
constructed, which made possible comprehensive and reliable identification of promoter regions. Based on
these data, some pioneering works for human and mouse promoter regions were carried out. Suzuki et al. [1]
identified blocks of highly conserved regions in orthologous promoter sequences, and Iwama and Gojobori
[2] found that transcription factors and developmental process genes show highest degree of upstream
sequence conservation.
Our group previously constructed DBTSS: DataBase of human transcriptional start sites, and it has been
recently updated [3]. In this study, we performed comparative analysis between human and mouse based on
DBTSS, aiming at two points. First, we tried to carry out the most comprehensive comparison to date, and to
investigate the relationship between promoter conservation and gene function. Second, we tried to figure out
what kinds of relationships there are between promoter conservation and protein conservation.
2
2.1
Materials and Methods
Alignment of Orthologous Promoters, and Orthologous Proteins
DBTSS provides a representative TSS for each gene in human and mouse genomes, and an orthologous table
for comparison of TSS data. In this study, sequences from -1000 to +200 based on the representative TSSs
were defined as promoter sequences. Protein sequences were obtained from NCBI reference sequence
(RefSeq) database and associated. As a result, we obtained 8429 pairs of one-to-one orthologous genes with
both promoter sequences and protein sequences. Promoter pairs and protein pairs were respectively aligned
with a local alignment program water of EMBOSS package (ftp://emboss.open-bio.org/pub/EMBOSS/). To
assess the conservation of promoter sequences, the raw alignment score was used. In the case of proteins, the
raw score seems inappropriate because the raw score largely depends on the length of proteins. Therefore, we
used the percentage identity for protein sequences.
2.2
GO Annotation of Genes and Significance Test for Conservation
The gene ontology (GO) is widely used for annotating genes. The GO annotation for each gene of human
and mouse was obtained by using gene2go file at NCBI (ftp://ftp.ncbi.nih.gov/gene/DATA/). In this study, to
summarize the attributes of genes, a slimmed down version of the GO vocabulary ‘GO slim’ was used. Each
GO term was mapped to GO slim terms using map2slim.pl of go-perl package
(http://search.cpan.org/~cmungall/go-perl/). A set of high level terms was selected to cover most aspects of
each of the three ontologies. We tested whether the alignment scores of genes associated with each GO term
are significantly high by Wilcoxon rank sum test. The control group for a GO term is a set of genes that are
not associated with the term but with other terms. Similarly, it was tested whether the alignment scores were
significantly low. In the case of proteins, the percentage identity was used instead of alignment score.
3
3.1
Results and Discussion
Promoter Conservation and Gene Function
We aligned promoter sequences of human and mouse. The distribution of alignment scores had two peaks; a
major peak around 1000, and a minor peak lower than 100. When we aligned non-orthologous promoters
generated by shuffling pairs, the score distribution precisely fit the minor peak of orthologous one, therefore,
it is plausible that the minor peak corresponds to improperly paired “pseudo-orthologous” promoter
sequences. We discarded promoter pairs with scores less than 200. The remaining 6901 among 8429
promoter pairs were used in the following analysis.
We examined the relationship between promoter conservation and gene function. Terms with the most
significantly high conservation were “development” and “transcription factor activity”. This is consistent
with previous reports [2, 4]. Among others with high conservations were “signal transduction”, “cell-cell
signaling”. On the other hand, terms with the most significantly low conservations were “oxidoreductase
activity”, “mitochondrion”, “ribosome”, “lysosome”. Considering that the promoter conservation reflects
how strict the gene expression is regulated, these results suggest that genes conveying signals require
relatively strict regulation, while genes for energy generation, macromolecule biosynthesis and digestion
may not so strictly regulated.
3.2
Promoter Conservation and Protein Conservation
We aligned protein sequences, and examined the relationship between protein conservation and gene
function, similarly. Terms with significantly high or low protein conservations were identified. We
subsequently examined the relationship between promoter conservation and protein conservation. The
correlation between them was weak (Kendall’s rank correlation was 0.179). Notably, we found out that some
GO terms with significantly high promoter conservation show significantly low protein conservation. They
are “receptor binding activity”, “extracellular matrix”. On the contrary, there were GO terms with
significantly low promoter conservation and significantly high protein conservation. They are “ribosome”
and “protein biosynthesis”. These results suggest that the promoter sequences and promoter sequences might
evolve under different constraints.
References
[1] Suzuki, Y., Yamashita, R., Shirota, M., Sakakibara, Y., Chiba, J., Mizushima-Sugano, J., Nakai, K., and
Sugano, S., Sequence comparison of human and mouse genes reveals a homologous block structure in
the promoter regions, Genome Res., 14(9):1711–1718, 2004.
[2] Iwama, H., and Gojobori, T., Highly conserved upstream sequences for transcription factor genes and
implications for the regulatory network, Proc. Natl Acad. Sci. USA, 101(49):17156–17161, 2004
[3] Yamashita, R., Suzuki, Y., Wakaguri, H., Tsuritani, K., Nakai, K., and Sugano, S., DBTSS: DataBase of
human transcription start sites, progress report 2006, Nucleic Acids Res., 34(Database issue):D86–D89,
2006.
[4] Woolfe, A., Goodson, M., Goode, D.K., Snell, P., McEwen, G.K., Vavouri, T., Smith, S.F., North, P.,
Callaway, H., Kelly, K., Walter, K., Abnizova, I., Gilks, W., Edwards, Y.J., Cooke, J.E., and Elgar, G.,
Highly conserved non-coding sequences are associated with vertebrate development, PLoS Biol., 3(1):e7,
2005.