Download Estimating the Number of Mouse Genes and the Duplicated Regions

Genome Informatics 11: 327–328 (2000) 327 Estimating the Number of Mouse Genes and the Duplicated Regions within the Mouse Genome Yasuhiko Wada1 Tadashi Imanishi2 [email protected] [email protected] Takashi Gojobori2 [email protected] 1 2 Faculty of Agriculture, Saga University, 1 Honjyo, Saga city, Saga 840-8502, Japan Center for Information Biology, Natl. Inst. Genet., Mishima, Shizuoka 411-8540, Japan Keywords: genome duplication, database, number of genes, mouse genome 1 Introduction To elucidate the evolution of mammalian genomes, it is crucial to estimate the number of genes in the genome and to measure the degree of redundancy in the genome in various species. The number of human protein-coding genes was recently estimated as 35,000-40,000, though it is still controversial. Also, traces of ancient duplications of extensive chromosomal regions were being discovered within the human genome. In this study, we estimated the number of mouse genes using expressed sequence tags of full-length cDNA library and a set of genes obtained by clustering mRNA sequences from DDBJ/EMBL/GenBank. We also estimated the duplicated chromosomal regions within the mouse genome using the map information derived from the Mouse Genome Database and the numerous homologous gene pairs from DDBJ/EMBL/GenBank. 2 Estimating the number of mouse genes To estimate the number of mouse genes, we adopted the method reported by Ewing and Green (2000). The method involves determining the overlap between two independently derived sets of gene sequences. The first set should contain full-length sequences for an unbiased sample of genes from the genome. The second set may have sequences that are incomplete or redundant provided they are accurate enough to reliably determine matches to genes in the first set. Under these assumptions, if the first set has n1 genes and the second set has n2 genes and the number of sequences in the second set that are matched by the first is m2 , the total number G of genes in the genome may be estimated as G = (n1 n2 )/m2 . For the first set of gene sequences, we used a set of 3,752 genes obtained by clustering mRNA sequences from GenBank (r.118). For the second set, we used redundant expressed sequence tags (Riken-MEI 4.02) of full-length cDNA library generated by Genome Exploration Research Group of RIKEN Tsukuba Life Science Center [2]. For comparison of the two sequence sets, we only accepted matches in which the aligned regions were a minimum of 100 bases and the sequences show 95% or higher identity. According to our preliminary result, the total number of mouse genes was estimated as 75,327. However, the estimated number is heavily dependent on the threshold of sequence matches; if we accept matches that show lower level of sequence identity, the estimated number becomes much smaller. 3 Estimating the duplicated regions within the mouse genome We used a set of 2,556 amino acid sequences that have links to GenBank accession numbers and mapping positions in the Mouse Genome Database. The amino acid sequences of the mRNA were 328 Wada et al. obtained from GenBank R.118. To search for homologous gene pairs, we performed the FASTP [3] search among all the amino acid sequences using fasta3.1 package. The criterion to define homologous gene pairs is that the expect value of the FASTP result is over 1.0E-5, the length of the overlapped region is over 100 base pairs and the ratio of the overlapped region and the original length of the longer of two sequences is over 0.8. We defined the candidate of the duplicated regions that have more than two homologous gene pairs located within 5cM at each chromosome. Tandem repeats were defined that homologous genes are located within 5cM region. The probabilities that the expected number of the duplicated region is zero were calculated using the following formula : P = exp(−Px M (M − 1)/2) where Px = (x!)C x , x is the number of non-homologous genes located on the duplicated region, C = (total number of homologous gene pairs)/{N (N − 1)/2}, N is the total gene number after each tandem repeat regards as one gene, M is the combination number which x genes linked together within Dmax cM region are derived from whole genome and Dmax is the maximum length of two genes located on the duplicated region. In this study, we define the region which the probability was over 0.95 as the chromosomal duplicated region. We found 89 tandem repeats in which 260 genes were located in this study. After the statistical test, 27 candidate pairs of duplicated regions among chromosomes were found and 129 genes were located in the duplicated regions. The more detailed information has been shown in our WWW page [4]. Average % identity of gene pairs located on all duplicated regions was 42.4%. Average overlapped length of two amino acids sequences was 432.1. Seven procollagen genes, three Hox gene clusters, six integrin genes, three fibroblast growth factor receptor genes and three eye absent ho- Figure 1: Relationship among six chromosomal regions remolog genes were located on the dupli- lated to Hox gene clusters in the mouse genome. cated regions. Acknowledgements The authors would like to acknowledge Dr. H. Yasue (National Institute of Animal Industry) for helpful discussion. This study was carried out under ISM Cooperative Research Program (2000-ISM1019). References [1] Ewing, B. and Green, P., Analysis of expressed sequence tags indicates 35,000 human genes, Nature Genetics, 25:232–234, 2000. [2] http://genome.rtc.riken.go.jp/ [3] Pearson, W.R. and Lipman, D.J., Improved tools for biological sequence comparison, Proc. Natil. Acad. Sci. USA, 85:2444–2448, 1988. [4] http://genome.ag.saga-u.ac.jp/genome/duplicate/mouse/

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Estimating the Number of Mouse Genes and the Duplicated Regions