* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Latent Periodicity of Many Genes
Gel electrophoresis of nucleic acids wikipedia , lookup
DNA sequencing wikipedia , lookup
Gene expression wikipedia , lookup
Transcriptional regulation wikipedia , lookup
DNA barcoding wikipedia , lookup
Genome evolution wikipedia , lookup
Molecular cloning wikipedia , lookup
Genomic library wikipedia , lookup
DNA vaccination wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Community fingerprinting wikipedia , lookup
Non-coding DNA wikipedia , lookup
Point mutation wikipedia , lookup
Genome Informatics 12: 437–439 (2001) 437 Latent Periodicity of Many Genes 1 2 Eugene Korotkov1,2 Nikolay Kudryaschov1 [email protected] [email protected] Department of Cybernetics, Moscow Physical Engineering Institute, Moscow, 115409, Russian Federation Center of Bioengineering, Russian Academy of Sciences, Moscow, 117312, Russian Federation Keywords: latent periodicity, information decomposition, gene structuredness 1 Introduction Development of mathematical methods for study of symbolical sequence periodicity gets special significance nowadays. First of all it is concerned with the successful determination of DNA sequences from various genomes and accumulation of a great number of amino acid sequences. Therefore there is a problem for mathematics and biologists to be solved - to determine the structural features of these sequences and to find the biological meaning of the revealed structural features of the sequences. One of such structural features is a periodicity of symbolic sequences. Earlier comprehensive mathematical methods were developed for study of periodicity of continuous and discrete numerical sequences, using Fourier transformation and allowing to define the spectral density of a numerical sequence. However, such application of Fourier transformation demands presentation of a symbolical sequence as a numerical sequence in which the properties of any symbolical text should be displayed unequivocally. The most widely used is the method, including construction from the given symbolical sequence of m sequences consisting of numbers zero and one, and formed according to the law: x(i, j) = 1, if the symbol ai occupies a site j, and x(i, j) = 0 in all other cases. Here A = {a 1 , a2 , . . . , am } is the alphabet of a symbolical sequence and m is the size of the alphabet of a symbolical sequence. Then the Fourier transformation is applied to each of such numerical sequence and the Fourier-harmonics are calculated, corresponding to i-type symbols, as well as matrix structural factors, corresponding to pair correlation of symbols [6]. However, in our opinion the given method works rather well for study of periodicity of symbolical sequences with relatively short length (which is smaller than the size of the symbolical sequence alphabet). For the periods with the length greater than the size of the symbolical sequence alphabet, there is a possibility of “decomposition” of the statistical importance of the longer periods in favor of the shorter ones. Thus it turns out that statistical importance of the longer period is a kind of “spread” onto the statistical importance of the shorter periods, i.e. there is an effect of attenuation of harmonics with longer periods in favor of harmonics with shorter periods. This effect will be even stronger for cases, where there are several replacements in periodic sequences, - in such sequences periods could not be simply identical. The main purpose of this work is to show our results for study of the DNA sequences by ID method and to show the existence of the latent periodicity in lot of gene DNA sequences. 2 Method and Results Main principles of ID method were developed in early publications [1, 2, 3, 4, 5]. The ID method is similar to a Fourier transformation method for numerical sequences, but the first one has the following 438 Korotkov and Kudryaschov 8 8 A 4 4 Z Z2 2 0 0 -2 -2 -4 -4 10 B 6 6 0 100 200 300 400 10 C 8 0 6 Z 4 2 Z 4 200 300 400 D 8 6 100 2 0 0 -2 -2 -4 -4 0 20 40 60 80 100 Period length in DNA bases 120 140 0 100 200 300 Period length in DNA bases 400 500 Figure 1: The examples of the latent periodicity of gene sequences. A - Deinococcus radiodurans gene for c-di-GMP phosphodiesterase (2867-5239 base pairs) from sequence AE002006. DNA sequence from 3108 to 3963 bases has the latent periodicity equal to 120 bases and Z=9,1;B - Methylobacterium extorquens methanol oxidation gene mxaE (165-1010 base pairs) from sequence AF017434. DNA sequence form 232 to 1015 bases has the latent periodicity equal to â 126 bases and Z=7,5. C - gene coding the high-sulphur wool matrix protein B2A from sheep (73-561 base pairs) from SHPWMPBB. DNA sequence from 373 to 634 bases has the latent periodicity with the length equal to 5 bases and Z=8,0; D - Two V-regions from fugu rubripes of the t-cell receptor alpha-chain gene (13412-13718; 13850-13892) from FRTCRA1. DNA sequence from 13628 to 14594 bases has the latent periodicity with the length equal to 59 bases and Z=7,5. advantages: 1. The calculation of the ID spectrum does not require any transformation of a symbolical sequence to numerical sequences; 2. ID allows revealing both the obvious periodicity and the latent periodicity of a symbolical sequence in which there is no statistically important similarity between any two periods. 3. The statistical importance of long periods is not spread onto the statistical importance of shorter periods; 4. On the basis of the matrix M it is possible to determine the type of periodicity. In this work we present the results of study by ID method the DNA sequences from Genbank, version 122. This analysis permits to find more 1, 5 × 10 6 DNA sequences with different types of latent periodicity with period from 6 to 200 bases in Genbank. We found 7 × 10 5 sequences with different types of triplet periodicity also. 3 Discussions The developed method of information decomposition (ID) of symbolical sequences proved to be capable to reveal latent structuredness of thousand genes. The results obtained with the method showed that a lot of known genetic texts contain sequences with latent periodicity of various lengths and various types, which could not be revealed earlier. The origin of latent periodicity in genetic texts might be connected both with evolution of genome and protein molecules, and with functional meaning of various sequences. For example, periodicity equal to 21 bases is usually connected with α-helix formation protein molecules. The longer periodicity could be determined by domain organization Latent Periodicity of Many Genes 439 formation in proteins; it also could be involved in the process of nucleosome binding with DNA. We observed a great variety of period lengths and types in DNA sequences. It is possible to assume that ID method is able to “see” certain structural characteristics of gene sequences, reflecting the gene evolution or spatial organization of the corresponding proteins. In this regard ID is obviously important method allowing to connect the origin of certain protein structures with the presence of certain latent periodicity in corresponding DNA sequences. References [1] Chaley, M.B., Korotkov, E.V., and Skryabin, K.G., Method reavealing latent periodicity of the nucleotide sequences modified for a case of small samples, DNA Research, 6:153–163, 1999. [2] Korotkov, E.V., Korotkova, M.A., and Tulko, J.S., Latent sequence periodicity of some oncogenes and DNA-binding protein genes, CABIOS, 13:37–44, 1997. [3] Korotkova, M.A., Korotkov, E.V., and Rudenko, V.M., Latent periodicity of protein sequences, Journal of Molecular Modelling, 5:103–115, 1999. [4] Korotkov, E.V. and Korotkova, M.A., DNA regions with latent periodicity in some human clones, DNA Sequence, 5:353–358, 1995. [5] Korotkov, E.V., Korotkova, M.A., Rudenko, V.M., and Skruabin, K.G., Regions with the latent periodicity in the amino acid sequences of the many proteins, Molekylarnya Biologya (Russian), 33:1–8, 1999. [6] Lobzin, V.V. and Chechetkin, V.R., Order and correlations in genomic DNA sequences, The spectral approach, Uspekhi Fizicheskikh Nauk (Russian), 170:57–81, 2000.