Download Latent Periodicity of Many Genes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mutation wikipedia , lookup

Gel electrophoresis of nucleic acids wikipedia , lookup

DNA sequencing wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression wikipedia , lookup

Transcriptional regulation wikipedia , lookup

DNA barcoding wikipedia , lookup

Genome evolution wikipedia , lookup

Molecular cloning wikipedia , lookup

Genomic library wikipedia , lookup

DNA vaccination wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Community fingerprinting wikipedia , lookup

Non-coding DNA wikipedia , lookup

Point mutation wikipedia , lookup

Molecular evolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Genome Informatics 12: 437–439 (2001)
437
Latent Periodicity of Many Genes
1
2
Eugene Korotkov1,2
Nikolay Kudryaschov1
[email protected]
[email protected]
Department of Cybernetics, Moscow Physical Engineering Institute, Moscow, 115409,
Russian Federation
Center of Bioengineering, Russian Academy of Sciences, Moscow, 117312, Russian
Federation
Keywords: latent periodicity, information decomposition, gene structuredness
1
Introduction
Development of mathematical methods for study of symbolical sequence periodicity gets special significance nowadays. First of all it is concerned with the successful determination of DNA sequences
from various genomes and accumulation of a great number of amino acid sequences. Therefore there
is a problem for mathematics and biologists to be solved - to determine the structural features of these
sequences and to find the biological meaning of the revealed structural features of the sequences.
One of such structural features is a periodicity of symbolic sequences. Earlier comprehensive
mathematical methods were developed for study of periodicity of continuous and discrete numerical
sequences, using Fourier transformation and allowing to define the spectral density of a numerical
sequence. However, such application of Fourier transformation demands presentation of a symbolical
sequence as a numerical sequence in which the properties of any symbolical text should be displayed
unequivocally. The most widely used is the method, including construction from the given symbolical
sequence of m sequences consisting of numbers zero and one, and formed according to the law: x(i, j) =
1, if the symbol ai occupies a site j, and x(i, j) = 0 in all other cases. Here A = {a 1 , a2 , . . . , am } is the
alphabet of a symbolical sequence and m is the size of the alphabet of a symbolical sequence. Then
the Fourier transformation is applied to each of such numerical sequence and the Fourier-harmonics
are calculated, corresponding to i-type symbols, as well as matrix structural factors, corresponding to
pair correlation of symbols [6].
However, in our opinion the given method works rather well for study of periodicity of symbolical
sequences with relatively short length (which is smaller than the size of the symbolical sequence
alphabet). For the periods with the length greater than the size of the symbolical sequence alphabet,
there is a possibility of “decomposition” of the statistical importance of the longer periods in favor
of the shorter ones. Thus it turns out that statistical importance of the longer period is a kind of
“spread” onto the statistical importance of the shorter periods, i.e. there is an effect of attenuation
of harmonics with longer periods in favor of harmonics with shorter periods. This effect will be even
stronger for cases, where there are several replacements in periodic sequences, - in such sequences
periods could not be simply identical.
The main purpose of this work is to show our results for study of the DNA sequences by ID method
and to show the existence of the latent periodicity in lot of gene DNA sequences.
2
Method and Results
Main principles of ID method were developed in early publications [1, 2, 3, 4, 5]. The ID method is
similar to a Fourier transformation method for numerical sequences, but the first one has the following
438
Korotkov and Kudryaschov
8
8
A
4
4
Z
Z2
2
0
0
-2
-2
-4
-4
10
B
6
6
0
100
200
300
400
10
C
8
0
6
Z 4
2
Z 4
200
300
400
D
8
6
100
2
0
0
-2
-2
-4
-4
0
20
40
60
80
100
Period length in DNA bases
120
140
0
100
200
300
Period length in DNA bases
400
500
Figure 1: The examples of the latent periodicity of gene sequences. A - Deinococcus radiodurans gene
for c-di-GMP phosphodiesterase (2867-5239 base pairs) from sequence AE002006. DNA sequence from
3108 to 3963 bases has the latent periodicity equal to 120 bases and Z=9,1;B - Methylobacterium
extorquens methanol oxidation gene mxaE (165-1010 base pairs) from sequence AF017434. DNA
sequence form 232 to 1015 bases has the latent periodicity equal to â 126 bases and Z=7,5. C - gene
coding the high-sulphur wool matrix protein B2A from sheep (73-561 base pairs) from SHPWMPBB.
DNA sequence from 373 to 634 bases has the latent periodicity with the length equal to 5 bases and
Z=8,0; D - Two V-regions from fugu rubripes of the t-cell receptor alpha-chain gene (13412-13718;
13850-13892) from FRTCRA1. DNA sequence from 13628 to 14594 bases has the latent periodicity
with the length equal to 59 bases and Z=7,5.
advantages: 1. The calculation of the ID spectrum does not require any transformation of a symbolical
sequence to numerical sequences; 2. ID allows revealing both the obvious periodicity and the latent
periodicity of a symbolical sequence in which there is no statistically important similarity between
any two periods. 3. The statistical importance of long periods is not spread onto the statistical
importance of shorter periods; 4. On the basis of the matrix M it is possible to determine the type
of periodicity. In this work we present the results of study by ID method the DNA sequences from
Genbank, version 122. This analysis permits to find more 1, 5 × 10 6 DNA sequences with different
types of latent periodicity with period from 6 to 200 bases in Genbank. We found 7 × 10 5 sequences
with different types of triplet periodicity also.
3
Discussions
The developed method of information decomposition (ID) of symbolical sequences proved to be capable
to reveal latent structuredness of thousand genes. The results obtained with the method showed that
a lot of known genetic texts contain sequences with latent periodicity of various lengths and various
types, which could not be revealed earlier. The origin of latent periodicity in genetic texts might
be connected both with evolution of genome and protein molecules, and with functional meaning
of various sequences. For example, periodicity equal to 21 bases is usually connected with α-helix
formation protein molecules. The longer periodicity could be determined by domain organization
Latent Periodicity of Many Genes
439
formation in proteins; it also could be involved in the process of nucleosome binding with DNA. We
observed a great variety of period lengths and types in DNA sequences. It is possible to assume
that ID method is able to “see” certain structural characteristics of gene sequences, reflecting the
gene evolution or spatial organization of the corresponding proteins. In this regard ID is obviously
important method allowing to connect the origin of certain protein structures with the presence of
certain latent periodicity in corresponding DNA sequences.
References
[1] Chaley, M.B., Korotkov, E.V., and Skryabin, K.G., Method reavealing latent periodicity of the
nucleotide sequences modified for a case of small samples, DNA Research, 6:153–163, 1999.
[2] Korotkov, E.V., Korotkova, M.A., and Tulko, J.S., Latent sequence periodicity of some oncogenes
and DNA-binding protein genes, CABIOS, 13:37–44, 1997.
[3] Korotkova, M.A., Korotkov, E.V., and Rudenko, V.M., Latent periodicity of protein sequences,
Journal of Molecular Modelling, 5:103–115, 1999.
[4] Korotkov, E.V. and Korotkova, M.A., DNA regions with latent periodicity in some human clones,
DNA Sequence, 5:353–358, 1995.
[5] Korotkov, E.V., Korotkova, M.A., Rudenko, V.M., and Skruabin, K.G., Regions with the latent
periodicity in the amino acid sequences of the many proteins, Molekylarnya Biologya (Russian),
33:1–8, 1999.
[6] Lobzin, V.V. and Chechetkin, V.R., Order and correlations in genomic DNA sequences, The
spectral approach, Uspekhi Fizicheskikh Nauk (Russian), 170:57–81, 2000.