Download LATENT PERIODICITY OF DNA SEQUENCES OF MANY GENES

222 LATENT PERIODICITY OF DNA SEQUENCES OF MANY GENES E.V. KOROTKOV Centerof Bioengineeringof the RussianAcademyof Sciences, 60-0ktybryaprospect,7/1, Moscow,Russia D.A. PHOENIX Universityof CentralLancashirePrestonPRI 2HE,UK ABSTRACT A method of latent periodicity search being developed. Mutual information is used to reveal of DNA or mRNA sequence latent periodicity. The latent periodicity of DNA sequence is a periodicity with low level of homology between any two periods inside DNA sequence. The mutual information between artificial numerical sequence and DNA sequence is calculated. The length of the artificial sequence period is varied from 2 to 150. The high level of the mutual informationbetween artificial a nd DNA sequences alows to find any type of latent periodicity of DNA sequence. The latent periodicity of many DNA coding regions has been found. Potential significanceof latent periodicity is discussed. INTRODUCTION The search of periodical sequences in DNA is important for understanding of the DNA sequence structure of human and other species genomes. Periodical sequences in human genome are now found as satellite and minisatellite sequences [11, 26, 2]. These sequences contain a short repeated fragment of DNA. Mathematical methods of determining symbolical sequence periodicity are now developed [7, 8, 19, 20, 21, 27]. These approaches are directed to application of mathematical methods advanced for analysis of numerical sequences periodicity to study periodicity of symbolical sequences. A transformation of a symbolic sequence to a numerical sequence is frequently used for such analysis. The investigations have shown a periodicity of amino acid sequences of some genes [18, 19, 20, 21, 25] and a periodicity of some DNA sequences [3,4, 7, 8,23]. The study of a periodicity of DNA sequences is mainly based on the analysis of homological periodicity including imperfect [17, 24, 27, 28]. However, hidden periodicity of DNA sequence may exist and this periodicity can not be found by the algorithms developed. For example, there may be observed (A/C/T)(T/G)(G/A) (T/A)(C/G/A)(G/T) period where the first position of a period contains A, C or T, 223 the second position contains T or G and so on. Earlier, a new mathematical method of latent periodicity detection has been developed and the latent periodicity of some human genes has been found [15]. The mathematical approach directed to detecting similar laws in DNA and DNA sequences is improved in the present work. Latent periodicity of some DNA regions is shown. These DNA contain the following genes: albumin gene, famesyltransferase alpha-subunit gene and plasmogen. Mathematical approach being developed uses the method of enlarged similarity of DNA sequences [13,14]. SYSTEM AND METHODS A comparison of artificial periodic sequences with the DNA sequence analyzed is used for detection of latent periodicity [15]. The alphabet of artificial sequences contains letters'Sj. The sequence SIS2SIS2S\S2..'is generated for the search of DNA period qual to 2 bases. The sequence S\S2",SnS\S2...Snis generated for study of the DNA period equal to n bases. The length of artificial sequence is equal to the length of the analyzed sequence. The artificial sequences having period trom 2 up to 150 are serially compared with the analyzed DNA sequence. Mutual information is chosen as measure of similarity of artificial and DNA sequences. A matrix M is filled in for calculation of the mutual information. Elements of the matrix M are numbers of coincidences of each type between the artificial sequence and the DNA sequence. The dimension of matrix M is 4xn. The value of the matrix M rows are the bases A, U,C and G. Value of the columns of the matrix M are the letters Sj (i=I,2,..., n) of the artificial sequence. The sums of row elements are equal to quantity of A, U, C and G bases in the DNA sequence. The sums of elements in each column are equal to the quantity of Si letters in the artificial sequence. The mutual information is calculated using the formula [16]: 4 n n 4 I = ~~m..lnm" - ~ x.lnx. - ~ yJ.lnyJ' + LlnL IJ IJ t I I I I I (1) Here mijis the element of a matrix M; Xiis the quantity of A, U, C and G symbols in the DNA sequence; yj is the quantity of Sj symbol in the artificial sequence; L is the length of compared sequences. The 21 value is distributed as X2 with 3(n-l) degrees of treedom. It permits to evaluate the probability of accidental formation of periodicity . All possible periods can be divided into two classes. The first class (simple periods) includes periods that are equal to simple numbers. The simple numbers are the numbers that can be devided without remainder on one or on itself. The second class (complex periods) contains periods equal to a product of simple numbers. Let 224 us have a compound period A=B*c. Band C are simple periods. Let the period A have the least probability a= P(X2~2I) for all periods analyzed. It was shown earlier [29]: I(DNA,B)+IoNA(C,B)+I(DNA,C)=I(C*B,DNA)+I(B,C) (2) Here I(DNA,B) is the mutual information between DNA sequence and the artificial sequence with the period that is equal to B; I(DNA,C) is the mutual information between DNA sequence and the artificial sequence with the period that is equal to A; I(B,C) is the mutual information between two artificial sequences with periods that are equal to Band C; I(B*C,DNA) is the mutual information between DNA sequence and compound artificial sequence with the period that is equal to B*C. For example, let us have B=2 and C=3. We may write the sequences Band Cas: b1b2b3b1b2b3b1b2b3b1b2b3 CIC2CIC2CIC2CIC2CIC2CIC2 Then we may unite those two sequences and introduce 6 new letters: al=b1cl; a2=b2c2; a3=b3cI; a4=b1c2; as=b2cl; a6=b3c2' The sequence A = {ala2a3a4aS%ala2a3a4aSa6a)a2a3a4asa6J is a united sequence B*C. The same method is applied if we construct the sequences (DNA *B), (DNA*C) and (DNA *C*B). The mutual information I(B,C) is calculated as: I(B,C) = {H(B)+H(C)-H(B*C)}L (3) where H(B), H(C), H(B*C) are the entropies ofB, C and B*C sequences and L is the length of sequences. The H(B) is calculated as: H(B) =l:pi(B){lnpi(B)} (4) where piCE)is a probability of i letter in B sequence (i=1,2,..B). The entropy H(C) and H(B*C) are calculated by similar way. IoNA(C,B) is conditional mutual information between C and B sequences [29]. IoNA(C,B)= H(DNA *B)+H(DNA *C)-H(DNA)-H(DNA *C*B) (5) Here H(DNA*B) is the entropy of the united sequence DNA*B, H(DNA *C) is the entropy of the united sequence DNA*C, H(DNA) is the entropy of the DNA sequence and H(DNA*C*B) is the entropy of the united DNA*C*B sequence. IoNA(C,B) is not negative [29]. If periods of the two sequences Band Care represented by simple numbers, I(B,C)=O. It means that the mutual information I(DNA,B) and I(DNA,C) are independent parts of I(DNA,A). If the mutual information is considered as random value then I(DNA,B) and I(DNA,C) are 225 independent random values. According to [16] 21(DNA,B) and 2I(DNA,C) is distributed as X2with 3(B-l) and 3(C-l) degrees of freedom accordingly. The mutual information I(DNA,kB), where k = 2,3,4..., may be equal to or exceed I(DNA,B). For example, if we calculate the mutual information between the artificial sequences with the periods of 3,6,9,12... and DNA, then the mutual information for periods 6,9,12... is more or equal to I(DNA,3). This means that the mutual information is accumulated when k is increased. For the compound period A it is conveniently to take value IONA(C,B)as a measure of the sequence periodicity. This double value has the distribution X2 with number of degrees of freedom equal to 3(A-l)-3(B-l)-3(C-l). IONA(C,B)reflects a contribution of the period A to creation of this periodicity by excluding all simple influence. It is conveniently to obtain the IONA(C,B)spectrum for any period on practice. It may be done by serial deduction of the mutual information of simple periods from the mutual informations of compound periods. It is possible to do a similar subtraction for degrees of freedom, because the mutual information is distributed as X2 with 3(n-l) degrees of freedom (where n is the period length). Then mutual informations of compound periods are deducted from mutual informations of longer divisible compound periods. For example, the mutual information I(DNA,6) is deducted from I(DNA,12), I(DNA,18) and so on. The spectrum obtained of modified mutual information Im(n)shows the contribution of each period n to formation of periodicity. The values 2Im(n) are distributed as X2 with the remaining number of degrees A contribution of each period in the observed periodicity is convenient to obtain from the spectrum Im(n).For example, if a DNA sequence has periodicity only in 18 bases, then modified Im(n)spectrum has only one maximum. If a DNA sequence has periods equal to 3, 6, 9 and 18 bases, then the Im(n)spectrum has four comparatively small maximums. The modified Im(n)spectrum may be applied for the detection of enclosed periods. The modified spectrum Im(n)must be represented as a spectrum of argument X of the normal distribution for valuation of statistical importance of each period. Using formula (6), transformation Im(n)is performedd [10]: X(n) = (4Im(n»1/2 - (2t-l)I/2 (6) Here t is the number of degrees of freedom for Im(n). This spectrum X(n) will be shown below for some DNA regions with latent periodicity. As far as the DNA sequences of coding regions have period equal to 3 bases [28], the search for latent periodicity should be conducted by deducting I(DNA,3) from all divisible periods. It allows to eliminate influence of triplet periodicity on long latent periodicity. The search for DNA regions with I(DNA,3n)-I(DNA,3), which exceed 5,6x is conducted in the present work. 5,6x corresponds to probability 10-8that the found long periodicity is caused by random factors. 226 RESULTS AND DISCUSSION The search for regions with latent periodicity was performed in DNA and mRNA clones from the EMBL data bailie The clones with the length less than 1000 bases were not analyzed. An artificial sequence containing 1000 bases was compared with the first 1000 bases of DNA or mRNA clone. Independent variations of the left and right borders were conducted for each artificial sequence with a period from 2 up to 150. The purpose was to find an DNA or mRNA region having the best periodicity and the maximum of I(DNA,3n)-I(DNA,3). If a DNA or mRNA region with latent periodicity was not found then the displacement of the scanned artificial sequence on 100 bases was performed. If a region with latent periodicity was revealed, then the displacement of the artificial sequence on 500 bases was conducted, and the procedure of calculations was repeated. The full length of a clone was analyzed by such away. After that the next DNA clone from the EMBL data bank was analyzed. The results of the analysis allowed to reveal many DNA and mRNA regions with latent periodicity. More than 30% of human genes (with length more 1000 bases) from EMBL data bank have the regions with latent periodicity. 3 regions with the expressed latent periods are shown on Fig.I as example. The corresponding X(n) spectrum of that regions is shown in Fig2. The 21'(n,DNA) on Fig.I is equal to 2I(DNA,n)-3(n-l). The significance 3(n-I) is an average significance of 2I(DNA,n) when an artificial sequence and an DNA or mRNA sequence are the random sequences. The found sequences with latent periodicity have the significant period equal to 3 bases. These 3 mRNA sequences have the 2I(3,DNA) in the interval of 16,3 up to 62,6. It corresponds to the probability a from less than 10-9up to IO-J.an the background of the period equal to 3 bases (which is characteristic of DNA coding regions) periods of the size divisible by 3 are found. The values 21'(3n,DNA) for 3 regions from A06977 clone [12]; HSFTA clone [1] and from HSPLASM clone [5] are shown in Fig.I. These regions are components of galactose gene, famesyltransferase gene and plasminogen. Thosee regions have the latent periods that are equal to 96, 102 and 102 bases. Homology between any pair of periods of these regions has statistically insignificant level. Earlier the method of searching for a periodicity of a symbolic sequence has been developed [19]. The method is based on writing out the sequence in rows of N columns to detect a period ofN. Then the symbols of the sequence are combined in two groups and two letter alphabet is introduced (A and B letters). The amalgamation of symbols in the two groups is performed on the bases of some common characteristics of the symbols. For amino acid sequence those qualities may be hydrophobic of amino acids, charges of amino acids and some others qualities [19]. However, the introduction of two letter alphabet limits the types of .periodic sequences that may be found. For example, if we have the period {(a/t)(t/g/c)(c/a)} then a new alphabet A={a,t}, B={c,g} permits to reveal a period in the first column, but periodicity is not revealed in the second and third columns. If we introduce a new alphabet A={t,g,c}, B={a}, then we may find a periodicity of the second column 227 only. An analysis of periodicity by introduction of a new two letter alphabet is effective, if there is a similar distribution of the base types in the columns. Moreover, if we consider triplet periodicity, there are many types of junctions of triplets in two letters and for it is very hard to test all types of junctions. The regions of DNA or mRNA sequences of some genes with latent periodicity were found earlier [15]. Results of this work and earlier work show that no less than 30% of human genes from the EMBL data bank have regions with latent periodicity. Latent periodicity is not similar to a homological altenation of DNA or mRNA bases and can heavily influences on codon usage. It is possible to assume that latent periodicity is a typical sign of some genes and it can be related to evolutionary origin of genes by process of multiple duplications. Then latent periodicity is a reflection of ancient evolutionary events in gene sequences. It is also possible to assume that the latent periodicity can also have a certain function in a cell. The latent periodicity can provide certain bends of DNA coding regions [6, 9, 22]. The computer data bank of the regions with latent periodicity is created now. This data bank will be based on computer in Lancashire University. REFERENCES. 1. D.A.Andres, A.Milatovich, T.Ozcelik, J.M.Wenzlau, M.S.Brown, J.LGoldstein, U.Francke "CDNA cloning of the two subunits of human CAA~ famesyltransferase and chromosomal mapping of FNTA and FNTB loci and related sequences" Genomics. 18,105 (1993) 2. V.V.Bliskovsky "Tandem DNA repeats in vertebrate genomes: structure, possible mechanisma of creation and evolution" Mol.Biol. (Russian). 25, 965 (1991) 3. M.Bina "Periodicity of dinucleotides in nucleosomes derived from Simian virus 40 chromatin" J.Mol.Biol. 235, 198 (1994) 4. B.Borstnik, D.Pampemik, D.Lukman, D.Ugarkovic. and M.Plohl. "Tandemly repeated pentanucleotides in DNA sequences of eucaryotes" Nucl.Acids Res., 22, 3412 (1994) 5. M.J.Browne, C.G.Chapman, I.Dodd, J.E.Carey, G.M.Lawrence, D.Mitchell, J.H.Robinson "Expression of recombinant human plasminogen and aglycoplasminogenin HeLa cells" Unpublished. 6. P.Carrera and F.Azorin "Srtuctural characterization of intrinsically curved ATrich DNA sequences" Nucl. Acids Res. 22, 3671 (1994) 7. V.R.Chechetkin, LA.Knizhnikova and A.Yu.Turygin "Three-quasiperiodicity, mutual correlatuions, ordering and long modulations in genomic nucleotide sequences viruses" J.ofBiomol. Str. & Dynamics. 12,271 (1994) 228 8. E.A.Cheever, G.C.Overton and B.B.Searls "Fast fourier transform-based correlations of DNA sequences using complex plane encoding" CABIOS. 7, 143 (1991) .9. D.S.Goodsell and R.E.Dickerson "Bending and curvature calculations in B+DNA" Nuci. Acids Res. 22, 5497 (1994) 10. Handbook of applicable mathematics. Volume IV. (1984). Wiley-Interscience Publication. John Wiley & Sons. New-York-Brisbane-Toronto-Singapore. 11. c.M.Hearne, S.Ghosh and J.A.Todd "Microsatellites for linkage analysis of genetic traits" Trends in Genetics. 8, 288 (1992) 12. E.Hinchliffe "Induction of galactose regulated gene expression in yeast" Patent number EP0248637 (1987) 13. E.V.Korotkov E.V. "Fast method of homology and purine-pirimidine mutual relations between DNA sequences search" DNA Sequence. 4, 413 (1994) 14 E.V.Korotkov and M.A.Korotkova "Enlarged similarity of nucleic acids sequences" DNA Research. 3, N.3, I (1996) 15. E.V.Korotkov and M.A.Korotkova "Latent periodicity of DNA sequences of some human genes" DNA Sequence. 5, 353 (1995) 16. S.Kullback "Information theory and Statistics" John Wiley & Sons, Inc. NewYork, London (1959) 17. V.Yu.Makeev and V.G.Tumanyan "On link between the distance and correlation analysis of various types and Fourier transformation used for the search of the periodical patterns in the primary structures of the biopolymers" Biophysics (Russian). 39,294 (1994) 18 A.D.McLachlan "Repeated helical pattern in apolipoprotein A-I" Nature. 267, 465 (1977a) 19. A.D.McLachlan "Analysis of periodic patterns in amino acid sequences: collagen" Biopolymers. 16, 1271 (1977b) 20. A.D.McLachlan "Multichannel Fourier analysis of patterns in protein sequences" J.Phys.Chem. 97, 3000 (1993) 21. A.D.Mclachlan and M.Stewart "Analysis of repeated motif in the talin rod" J.Mol.Biol. 235, 1278 (1994) 22. P.T.McNamara, A.Bolshoy, E.N.Trifonov and R.E.Harrington "Sequencedependent kinks in curved DNA" J.Biomoi. Str. & Dyn. 3, 529 (1990) 23. E.Pizzi, S.Linni and C.Frontali "Detection of latent sequence periodicities" Nuci. Acids Res. 18,3745 (1990) 24. B.D.Silverman and R.Linsker "A measure of DNA periodicity" J.Theor. BioI. 118, 295 (1986) 25. M.Stewart and A.D.Mclaclan "Fourteen actin-binding sites on tropomiosin?" Nature. 257, 331 (1975) 26. lA.Todd "La carte des microsatellites est arrivee" Human Molecular Genetics 1, 663 (1992) 27. R.F.Voss "Evolution of long-range fractal correlations and 1/f noise in DNA base sequences"Phys. Rev. Letters.68, 3805 (1992) 229 28. G. Von Heijne "Sequences analysis in molecular biology" Academic Press, Inc. San-Diego-New-York-London(1987) . 29 A.M.Yaglom and I.M.Yaglom "Probability and Information" Nauka Press, Moscow (1960)

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download LATENT PERIODICITY OF DNA SEQUENCES OF MANY GENES