Download LATENT PERIODICITY OF DNA SEQUENCES OF MANY GENES

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Transposable element wikipedia , lookup

Epigenetics wikipedia , lookup

DNA methylation wikipedia , lookup

Mutation wikipedia , lookup

Mitochondrial DNA wikipedia , lookup

Nutriepigenomics wikipedia , lookup

DNA repair wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

DNA sequencing wikipedia , lookup

DNA wikipedia , lookup

Mutagen wikipedia , lookup

Comparative genomic hybridization wikipedia , lookup

Designer baby wikipedia , lookup

Replisome wikipedia , lookup

DNA polymerase wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

DNA barcoding wikipedia , lookup

Cancer epigenetics wikipedia , lookup

Gene wikipedia , lookup

DNA profiling wikipedia , lookup

SNP genotyping wikipedia , lookup

Human genome wikipedia , lookup

Genomic library wikipedia , lookup

Microevolution wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

DNA damage theory of aging wikipedia , lookup

Nucleosome wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Primary transcript wikipedia , lookup

Gel electrophoresis of nucleic acids wikipedia , lookup

DNA vaccination wikipedia , lookup

Metagenomics wikipedia , lookup

Point mutation wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Genealogical DNA test wikipedia , lookup

Epigenomics wikipedia , lookup

Molecular cloning wikipedia , lookup

Cell-free fetal DNA wikipedia , lookup

Genome editing wikipedia , lookup

History of genetic engineering wikipedia , lookup

DNA supercoil wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Nucleic acid double helix wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

Genomics wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Non-coding DNA wikipedia , lookup

Microsatellite wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
222
LATENT PERIODICITY OF DNA SEQUENCES OF MANY GENES
E.V. KOROTKOV
Centerof Bioengineeringof the RussianAcademyof Sciences, 60-0ktybryaprospect,7/1,
Moscow,Russia
D.A. PHOENIX
Universityof CentralLancashirePrestonPRI 2HE,UK
ABSTRACT
A method of latent periodicity search being developed. Mutual information is used to reveal of
DNA or mRNA sequence latent periodicity. The latent periodicity of DNA sequence is a
periodicity with low level of homology between any two periods inside DNA sequence. The
mutual information between artificial numerical sequence and DNA sequence is calculated.
The length of the artificial sequence period is varied from 2 to 150. The high level of the
mutual informationbetween artificial a nd DNA sequences alows to find any type of latent
periodicity of DNA sequence. The latent periodicity of many DNA coding regions has been
found. Potential significanceof latent periodicity is discussed.
INTRODUCTION
The search of periodical sequences in DNA is important for understanding of the
DNA sequence structure of human and other species genomes. Periodical sequences
in human genome are now found as satellite and minisatellite sequences [11, 26, 2].
These sequences contain a short repeated fragment of DNA. Mathematical methods
of determining symbolical sequence periodicity are now developed [7, 8, 19, 20, 21,
27]. These approaches are directed to application of mathematical methods advanced
for analysis of numerical sequences periodicity to study periodicity of symbolical
sequences. A transformation of a symbolic sequence to a numerical sequence is
frequently used for such analysis. The investigations have shown a periodicity of
amino acid sequences of some genes [18, 19, 20, 21, 25] and a periodicity of some
DNA sequences [3,4, 7, 8,23].
The study of a periodicity of DNA sequences is mainly based on the analysis of
homological periodicity including imperfect [17, 24, 27, 28]. However, hidden
periodicity of DNA sequence may exist and this periodicity can not be found by the
algorithms developed. For example, there may be observed (A/C/T)(T/G)(G/A)
(T/A)(C/G/A)(G/T) period where the first position of a period contains A, C or T,
223
the second position contains T or G and so on. Earlier, a new mathematical method
of latent periodicity detection has been developed and the latent periodicity of some
human genes has been found [15]. The mathematical approach directed to detecting
similar laws in DNA and DNA sequences is improved in the present work. Latent
periodicity of some DNA regions is shown. These DNA contain the following genes:
albumin gene, famesyltransferase alpha-subunit gene and plasmogen. Mathematical
approach being developed uses the method of enlarged similarity of DNA sequences
[13,14].
SYSTEM AND METHODS
A comparison of artificial periodic sequences with the DNA sequence analyzed is
used for detection of latent periodicity [15]. The alphabet of artificial sequences
contains letters'Sj. The sequence SIS2SIS2S\S2..'is generated for the search of DNA
period qual to 2 bases. The sequence S\S2",SnS\S2...Snis generated for study of the
DNA period equal to n bases. The length of artificial sequence is equal to the length
of the analyzed sequence. The artificial sequences having period trom 2 up to 150
are serially compared with the analyzed DNA sequence. Mutual information is
chosen as measure of similarity of artificial and DNA sequences. A matrix M is
filled in for calculation of the mutual information. Elements of the matrix M are
numbers of coincidences of each type between the artificial sequence and the DNA
sequence. The dimension of matrix M is 4xn. The value of the matrix M rows are the
bases A, U,C and G. Value of the columns of the matrix M are the letters Sj
(i=I,2,..., n) of the artificial sequence. The sums of row elements are equal to
quantity of A, U, C and G bases in the DNA sequence. The sums of elements in each
column are equal to the quantity of Si letters in the artificial sequence. The mutual
information is calculated using the formula [16]:
4 n
n
4
I = ~~m..lnm"
- ~ x.lnx.
- ~ yJ.lnyJ' + LlnL
IJ
IJ
t
I
I
I
I
I
(1)
Here mijis the element of a matrix M; Xiis the quantity of A, U, C and G symbols in
the DNA sequence; yj is the quantity of Sj symbol in the artificial sequence; L is the
length of compared sequences. The 21 value is distributed as X2 with 3(n-l) degrees
of treedom. It permits to evaluate the probability of accidental formation of
periodicity .
All possible periods can be divided into two classes. The first class (simple
periods) includes periods that are equal to simple numbers. The simple numbers are
the numbers that can be devided without remainder on one or on itself. The second
class (complex periods) contains periods equal to a product of simple numbers. Let
224
us have a compound period A=B*c. Band C are simple periods. Let the period A
have the least probability a= P(X2~2I) for all periods analyzed. It was shown earlier
[29]:
I(DNA,B)+IoNA(C,B)+I(DNA,C)=I(C*B,DNA)+I(B,C)
(2)
Here I(DNA,B) is the mutual information between DNA sequence and the artificial
sequence with the period that is equal to B; I(DNA,C) is the mutual information
between DNA sequence and the artificial sequence with the period that is equal to A;
I(B,C) is the mutual information between two artificial sequences with periods that
are equal to Band C; I(B*C,DNA) is the mutual information between DNA
sequence and compound artificial sequence with the period that is equal to B*C. For
example, let us have B=2 and C=3. We may write the sequences Band Cas:
b1b2b3b1b2b3b1b2b3b1b2b3
CIC2CIC2CIC2CIC2CIC2CIC2
Then we may unite those two sequences and introduce 6 new letters: al=b1cl;
a2=b2c2; a3=b3cI; a4=b1c2; as=b2cl; a6=b3c2' The
sequence
A
=
{ala2a3a4aS%ala2a3a4aSa6a)a2a3a4asa6J
is a united sequence B*C. The same method is
applied if we construct the sequences (DNA *B), (DNA*C) and (DNA *C*B). The
mutual information I(B,C) is calculated as:
I(B,C) = {H(B)+H(C)-H(B*C)}L
(3)
where H(B), H(C), H(B*C) are the entropies ofB, C and B*C sequences and L is the
length of sequences. The H(B) is calculated as:
H(B) =l:pi(B){lnpi(B)}
(4)
where piCE)is a probability of i letter in B sequence (i=1,2,..B). The entropy H(C)
and H(B*C) are calculated by similar way. IoNA(C,B) is conditional mutual
information between C and B sequences [29].
IoNA(C,B)= H(DNA *B)+H(DNA *C)-H(DNA)-H(DNA *C*B)
(5)
Here H(DNA*B) is the entropy of the united sequence DNA*B, H(DNA *C) is the
entropy of the united sequence DNA*C, H(DNA) is the entropy of the DNA
sequence and H(DNA*C*B) is the entropy of the united DNA*C*B sequence.
IoNA(C,B) is not negative [29]. If periods of the two sequences Band Care
represented by simple numbers, I(B,C)=O. It means that the mutual information
I(DNA,B) and I(DNA,C) are independent parts of I(DNA,A). If the mutual
information is considered as random value then I(DNA,B) and I(DNA,C) are
225
independent random values. According to [16] 21(DNA,B) and 2I(DNA,C) is
distributed as X2with 3(B-l) and 3(C-l) degrees of freedom accordingly.
The mutual information I(DNA,kB), where k = 2,3,4..., may be equal to or exceed
I(DNA,B). For example, if we calculate the mutual information between the artificial
sequences with the periods of 3,6,9,12... and DNA, then the mutual information for
periods 6,9,12... is more or equal to I(DNA,3). This means that the mutual
information is accumulated when k is increased. For the compound period A it is
conveniently to take value IONA(C,B)as a measure of the sequence periodicity. This
double value has the distribution X2 with number of degrees of freedom equal to
3(A-l)-3(B-l)-3(C-l). IONA(C,B)reflects a contribution of the period A to creation
of this periodicity by excluding all simple influence. It is conveniently to obtain the
IONA(C,B)spectrum for any period on practice. It may be done by serial deduction of
the mutual information of simple periods from the mutual informations of compound
periods. It is possible to do a similar subtraction for degrees of freedom, because the
mutual information is distributed as X2 with 3(n-l) degrees of freedom (where n is
the period length). Then mutual informations of compound periods are deducted
from mutual informations of longer divisible compound periods. For example, the
mutual information I(DNA,6) is deducted from I(DNA,12), I(DNA,18) and so on.
The spectrum obtained of modified mutual information Im(n)shows the contribution
of each period n to formation of periodicity. The values 2Im(n) are distributed as X2
with the remaining number of degrees
A contribution of each period in the observed periodicity is convenient to obtain
from the spectrum Im(n).For example, if a DNA sequence has periodicity only in 18
bases, then modified Im(n)spectrum has only one maximum. If a DNA sequence has
periods equal to 3, 6, 9 and 18 bases, then the Im(n)spectrum has four comparatively
small maximums. The modified Im(n)spectrum may be applied for the detection of
enclosed periods.
The modified spectrum Im(n)must be represented as a spectrum of argument X of
the normal distribution for valuation of statistical importance of each period. Using
formula (6), transformation Im(n)is performedd [10]:
X(n) = (4Im(n»1/2 - (2t-l)I/2
(6)
Here t is the number of degrees of freedom for Im(n). This spectrum X(n) will be
shown below for some DNA regions with latent periodicity.
As far as the DNA sequences of coding regions have period equal to 3 bases [28],
the search for latent periodicity should be conducted by deducting I(DNA,3) from all
divisible periods. It allows to eliminate influence of triplet periodicity on long latent
periodicity. The search for DNA regions with I(DNA,3n)-I(DNA,3), which exceed
5,6x is conducted in the present work. 5,6x corresponds to probability 10-8that the
found long periodicity is caused by random factors.
226
RESULTS AND DISCUSSION
The search for regions with latent periodicity was performed in DNA and mRNA
clones from the EMBL data bailie The clones with the length less than 1000 bases
were not analyzed. An artificial sequence containing 1000 bases was compared with
the first 1000 bases of DNA or mRNA clone. Independent variations of the left and
right borders were conducted for each artificial sequence with a period from 2 up to
150. The purpose was to find an DNA or mRNA region having the best periodicity
and the maximum of I(DNA,3n)-I(DNA,3). If a DNA or mRNA region with latent
periodicity was not found then the displacement of the scanned artificial sequence on
100 bases was performed. If a region with latent periodicity was revealed, then the
displacement of the artificial sequence on 500 bases was conducted, and the
procedure of calculations was repeated. The full length of a clone was analyzed by
such away. After that the next DNA clone from the EMBL data bank was analyzed.
The results of the analysis allowed to reveal many DNA and mRNA regions with
latent periodicity. More than 30% of human genes (with length more 1000 bases)
from EMBL data bank have the regions with latent periodicity. 3 regions with the
expressed latent periods are shown on Fig.I as example. The corresponding X(n)
spectrum of that regions is shown in Fig2. The 21'(n,DNA) on Fig.I is equal to
2I(DNA,n)-3(n-l). The significance 3(n-I) is an average significance of 2I(DNA,n)
when an artificial sequence and an DNA or mRNA sequence are the random
sequences. The found sequences with latent periodicity have the significant period
equal to 3 bases. These 3 mRNA sequences have the 2I(3,DNA) in the interval of
16,3 up to 62,6. It corresponds to the probability a from less than 10-9up to IO-J.an
the background of the period equal to 3 bases (which is characteristic of DNA
coding regions) periods of the size divisible by 3 are found. The values 21'(3n,DNA)
for 3 regions from A06977 clone [12]; HSFTA clone [1] and from HSPLASM clone
[5] are shown in Fig.I. These regions are components of galactose gene,
famesyltransferase gene and plasminogen. Thosee regions have the latent periods
that are equal to 96, 102 and 102 bases. Homology between any pair of periods of
these regions has statistically insignificant level.
Earlier the method of searching for a periodicity of a symbolic sequence has been
developed [19]. The method is based on writing out the sequence in rows of N
columns to detect a period ofN. Then the symbols of the sequence are combined in
two groups and two letter alphabet is introduced (A and B letters). The
amalgamation of symbols in the two groups is performed on the bases of some
common characteristics of the symbols. For amino acid sequence those qualities may
be hydrophobic of amino acids, charges of amino acids and some others qualities
[19]. However, the introduction of two letter alphabet limits the types of .periodic
sequences that may be found. For example, if we have the period {(a/t)(t/g/c)(c/a)}
then a new alphabet A={a,t}, B={c,g} permits to reveal a period in the first column,
but periodicity is not revealed in the second and third columns. If we introduce a new
alphabet A={t,g,c}, B={a}, then we may find a periodicity of the second column
227
only. An analysis of periodicity by introduction of a new two letter alphabet is
effective, if there is a similar distribution of the base types in the columns. Moreover,
if we consider triplet periodicity, there are many types of junctions of triplets in two
letters and for it is very hard to test all types of junctions.
The regions of DNA or mRNA sequences of some genes with latent periodicity
were found earlier [15]. Results of this work and earlier work show that no less than
30% of human genes from the EMBL data bank have regions with latent periodicity.
Latent periodicity is not similar to a homological altenation of DNA or mRNA bases
and can heavily influences on codon usage. It is possible to assume that latent
periodicity is a typical sign of some genes and it can be related to evolutionary origin
of genes by process of multiple duplications. Then latent periodicity is a reflection of
ancient evolutionary events in gene sequences.
It is also possible to assume that the latent periodicity can also have a certain
function in a cell. The latent periodicity can provide certain bends of DNA coding
regions [6, 9, 22].
The computer data bank of the regions with latent periodicity is created now. This
data bank will be based on computer in Lancashire University.
REFERENCES.
1. D.A.Andres, A.Milatovich, T.Ozcelik, J.M.Wenzlau, M.S.Brown, J.LGoldstein,
U.Francke "CDNA cloning of the two subunits of human CAA~ famesyltransferase
and chromosomal mapping of FNTA and FNTB loci and related sequences"
Genomics. 18,105 (1993)
2. V.V.Bliskovsky "Tandem DNA repeats in vertebrate genomes: structure, possible
mechanisma of creation and evolution" Mol.Biol. (Russian). 25, 965 (1991)
3. M.Bina "Periodicity of dinucleotides in nucleosomes derived from Simian virus
40 chromatin" J.Mol.Biol. 235, 198 (1994)
4. B.Borstnik, D.Pampemik, D.Lukman, D.Ugarkovic. and M.Plohl. "Tandemly
repeated pentanucleotides in DNA sequences of eucaryotes" Nucl.Acids Res., 22,
3412 (1994)
5. M.J.Browne, C.G.Chapman, I.Dodd, J.E.Carey, G.M.Lawrence, D.Mitchell,
J.H.Robinson
"Expression
of recombinant
human
plasminogen
and
aglycoplasminogenin HeLa cells" Unpublished.
6. P.Carrera and F.Azorin "Srtuctural characterization of intrinsically curved ATrich DNA sequences" Nucl. Acids Res. 22, 3671 (1994)
7. V.R.Chechetkin, LA.Knizhnikova and A.Yu.Turygin "Three-quasiperiodicity,
mutual correlatuions, ordering and long modulations in genomic nucleotide
sequences viruses" J.ofBiomol. Str. & Dynamics. 12,271 (1994)
228
8. E.A.Cheever, G.C.Overton and B.B.Searls "Fast fourier transform-based
correlations of DNA sequences using complex plane encoding" CABIOS. 7, 143
(1991)
.9. D.S.Goodsell and R.E.Dickerson "Bending and curvature calculations in
B+DNA" Nuci. Acids Res. 22, 5497 (1994)
10. Handbook of applicable mathematics. Volume IV. (1984). Wiley-Interscience
Publication. John Wiley & Sons. New-York-Brisbane-Toronto-Singapore.
11. c.M.Hearne, S.Ghosh and J.A.Todd "Microsatellites for linkage analysis of
genetic traits" Trends in Genetics. 8, 288 (1992)
12. E.Hinchliffe "Induction of galactose regulated gene expression in yeast" Patent
number EP0248637 (1987)
13. E.V.Korotkov E.V. "Fast method of homology and purine-pirimidine mutual
relations between DNA sequences search" DNA Sequence. 4, 413 (1994)
14 E.V.Korotkov and M.A.Korotkova "Enlarged similarity of nucleic acids
sequences" DNA Research. 3, N.3, I (1996)
15. E.V.Korotkov and M.A.Korotkova "Latent periodicity of DNA sequences of
some human genes" DNA Sequence. 5, 353 (1995)
16. S.Kullback "Information theory and Statistics" John Wiley & Sons, Inc. NewYork, London (1959)
17. V.Yu.Makeev and V.G.Tumanyan "On link between the distance and correlation
analysis of various types and Fourier transformation used for the search of the
periodical patterns in the primary structures of the biopolymers" Biophysics
(Russian). 39,294 (1994)
18 A.D.McLachlan "Repeated helical pattern in apolipoprotein A-I" Nature. 267,
465 (1977a)
19. A.D.McLachlan "Analysis of periodic patterns in amino acid sequences:
collagen" Biopolymers. 16, 1271 (1977b)
20. A.D.McLachlan "Multichannel Fourier analysis of patterns in protein sequences"
J.Phys.Chem. 97, 3000 (1993)
21. A.D.Mclachlan and M.Stewart "Analysis of repeated motif in the talin rod"
J.Mol.Biol. 235, 1278 (1994)
22. P.T.McNamara, A.Bolshoy, E.N.Trifonov and R.E.Harrington "Sequencedependent kinks in curved DNA" J.Biomoi. Str. & Dyn. 3, 529 (1990)
23. E.Pizzi, S.Linni and C.Frontali "Detection of latent sequence periodicities" Nuci.
Acids Res. 18,3745 (1990)
24. B.D.Silverman and R.Linsker "A measure of DNA periodicity" J.Theor. BioI.
118, 295 (1986)
25. M.Stewart and A.D.Mclaclan "Fourteen actin-binding sites on tropomiosin?"
Nature. 257, 331 (1975)
26. lA.Todd "La carte des microsatellites est arrivee" Human Molecular Genetics 1,
663 (1992)
27. R.F.Voss "Evolution of long-range fractal correlations and 1/f noise in DNA
base sequences"Phys. Rev. Letters.68, 3805 (1992)
229
28. G. Von Heijne "Sequences analysis in molecular biology" Academic Press, Inc.
San-Diego-New-York-London(1987)
.
29 A.M.Yaglom and I.M.Yaglom "Probability and Information" Nauka Press,
Moscow (1960)