Download Estimating the Number of Mouse Genes and the Duplicated Regions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Vectors in gene therapy wikipedia , lookup

Mitochondrial DNA wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

NUMT wikipedia , lookup

Genetic engineering wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Point mutation wikipedia , lookup

Nutriepigenomics wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Oncogenomics wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Copy-number variation wikipedia , lookup

Essential gene wikipedia , lookup

Gene expression programming wikipedia , lookup

Segmental Duplication on the Human Y Chromosome wikipedia , lookup

Metagenomics wikipedia , lookup

Transposable element wikipedia , lookup

Gene desert wikipedia , lookup

Genomics wikipedia , lookup

Public health genomics wikipedia , lookup

Non-coding DNA wikipedia , lookup

Genomic library wikipedia , lookup

Helitron (biology) wikipedia , lookup

Microevolution wikipedia , lookup

History of genetic engineering wikipedia , lookup

RNA-Seq wikipedia , lookup

Human Genome Project wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Pathogenomics wikipedia , lookup

Genomic imprinting wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Designer baby wikipedia , lookup

Gene wikipedia , lookup

Ridge (biology) wikipedia , lookup

Human genome wikipedia , lookup

Genome (book) wikipedia , lookup

Gene expression profiling wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genome editing wikipedia , lookup

Minimal genome wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
Genome Informatics 11: 327–328 (2000)
327
Estimating the Number of Mouse Genes and the
Duplicated Regions within the Mouse Genome
Yasuhiko Wada1
Tadashi Imanishi2
[email protected]
[email protected]
Takashi Gojobori2
[email protected]
1
2
Faculty of Agriculture, Saga University, 1 Honjyo, Saga city, Saga 840-8502, Japan
Center for Information Biology, Natl. Inst. Genet., Mishima, Shizuoka 411-8540,
Japan
Keywords: genome duplication, database, number of genes, mouse genome
1
Introduction
To elucidate the evolution of mammalian genomes, it is crucial to estimate the number of genes in the
genome and to measure the degree of redundancy in the genome in various species. The number of
human protein-coding genes was recently estimated as 35,000-40,000, though it is still controversial.
Also, traces of ancient duplications of extensive chromosomal regions were being discovered within
the human genome. In this study, we estimated the number of mouse genes using expressed sequence
tags of full-length cDNA library and a set of genes obtained by clustering mRNA sequences from
DDBJ/EMBL/GenBank. We also estimated the duplicated chromosomal regions within the mouse
genome using the map information derived from the Mouse Genome Database and the numerous
homologous gene pairs from DDBJ/EMBL/GenBank.
2
Estimating the number of mouse genes
To estimate the number of mouse genes, we adopted the method reported by Ewing and Green
(2000). The method involves determining the overlap between two independently derived sets of gene
sequences. The first set should contain full-length sequences for an unbiased sample of genes from
the genome. The second set may have sequences that are incomplete or redundant provided they are
accurate enough to reliably determine matches to genes in the first set. Under these assumptions, if
the first set has n1 genes and the second set has n2 genes and the number of sequences in the second
set that are matched by the first is m2 , the total number G of genes in the genome may be estimated
as G = (n1 n2 )/m2 .
For the first set of gene sequences, we used a set of 3,752 genes obtained by clustering mRNA
sequences from GenBank (r.118). For the second set, we used redundant expressed sequence tags
(Riken-MEI 4.02) of full-length cDNA library generated by Genome Exploration Research Group of
RIKEN Tsukuba Life Science Center [2]. For comparison of the two sequence sets, we only accepted
matches in which the aligned regions were a minimum of 100 bases and the sequences show 95% or
higher identity. According to our preliminary result, the total number of mouse genes was estimated
as 75,327. However, the estimated number is heavily dependent on the threshold of sequence matches;
if we accept matches that show lower level of sequence identity, the estimated number becomes much
smaller.
3
Estimating the duplicated regions within the mouse genome
We used a set of 2,556 amino acid sequences that have links to GenBank accession numbers and
mapping positions in the Mouse Genome Database. The amino acid sequences of the mRNA were
328
Wada et al.
obtained from GenBank R.118. To search for homologous gene pairs, we performed the FASTP [3]
search among all the amino acid sequences using fasta3.1 package. The criterion to define homologous
gene pairs is that the expect value of the FASTP result is over 1.0E-5, the length of the overlapped
region is over 100 base pairs and the ratio of the overlapped region and the original length of the longer
of two sequences is over 0.8. We defined the candidate of the duplicated regions that have more than
two homologous gene pairs located within 5cM at each chromosome. Tandem repeats were defined
that homologous genes are located within 5cM region.
The probabilities that the expected number of the duplicated region is zero were calculated using
the following formula :
P = exp(−Px M (M − 1)/2)
where Px = (x!)C x , x is the number of non-homologous genes located on the duplicated region,
C = (total number of homologous gene pairs)/{N (N − 1)/2}, N is the total gene number after each
tandem repeat regards as one gene, M is the combination number which x genes linked together within
Dmax cM region are derived from whole genome and Dmax is the maximum length of two genes located
on the duplicated region. In this study,
we define the region which the probability was over 0.95 as the chromosomal duplicated region.
We found 89 tandem repeats in
which 260 genes were located in this
study. After the statistical test, 27
candidate pairs of duplicated regions
among chromosomes were found and
129 genes were located in the duplicated regions. The more detailed
information has been shown in our
WWW page [4]. Average % identity of
gene pairs located on all duplicated regions was 42.4%. Average overlapped
length of two amino acids sequences
was 432.1. Seven procollagen genes,
three Hox gene clusters, six integrin
genes, three fibroblast growth factor
receptor genes and three eye absent ho- Figure 1: Relationship among six chromosomal regions remolog genes were located on the dupli- lated to Hox gene clusters in the mouse genome.
cated regions.
Acknowledgements
The authors would like to acknowledge Dr. H. Yasue (National Institute of Animal Industry) for
helpful discussion. This study was carried out under ISM Cooperative Research Program (2000-ISM1019).
References
[1] Ewing, B. and Green, P., Analysis of expressed sequence tags indicates 35,000 human genes,
Nature Genetics, 25:232–234, 2000.
[2] http://genome.rtc.riken.go.jp/
[3] Pearson, W.R. and Lipman, D.J., Improved tools for biological sequence comparison, Proc. Natil.
Acad. Sci. USA, 85:2444–2448, 1988.
[4] http://genome.ag.saga-u.ac.jp/genome/duplicate/mouse/