Download Average mutual information

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Methods of identification
and localization of the DNA
coding sequences
Jacek Leluk
Interdisciplinary Centre for Mathematical and
Computational Modelling
Warsaw University
Jacek Leluk
Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Identification of coding/non-coding
sequences in genome
Measures dependent on
a model of coding DNA
Measures independent of
a model of coding DNA
based on:
oligonucleotide
counts
Codon
usage
Amino
acid
usage
Codon
preference
Hexamer
usage
based on:
base
compositional
bias between
codon
positions
dependence
between
nucleotide
positions
base
compositional
bias between
codon
positions
Codon
prototype
Markov
models
Position
asymmetry
Jacek Leluk
Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
periodic
correlation
between
nucleotide
positions
Periodic
asymmetry
index
Average
mutual
information
Fourier
spectrum
The notation used
S – DNA sequence of length l, while Si (i=1 ... l)
denotes the individual nucleotides
C – sequence of codons; Cj – the codon
occupying position j in the sequence
or
- denotes the sequence of codons that
results when the grouping of
,
nucleotides from
sequence S into
codons starts at nucleotide i
- denotes the codon occupying position j in the
decomposition i of the sequence S
[k] - the nucleotide occupying position k in the codon
Jacek Leluk
Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Examples
Jacek Leluk
Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Measures based on a model of coding DNA
The notation used
probability of the sequence of nucleotides S, given
that S is coding in frame i (i=1, 2, 3)
probability of the non-coding DNA sequence
(randomly generated)
Likelihood ratio
The ratio of the probability of finding the
sequence of nucleotides S, if S is coding in
frame i over the probability of finding the
sequence of nucleotides S, if S is non-coding
Jacek Leluk
Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Measures based on a model of coding DNA
The notation used
Log-likelihood ratio
coding potential of sequence S in frame
i given the model of coding DNA
the probability of the sequence of nucleotides S is
higher assuming that S is coding in frame i, than
assuming that S is non-coding in frame i
the probability of S is higher assuming that S does
not code in frame i than assuming that S is coding in
frame i
The log-likelihood ratios is computed for all three possible
frames. If the sequence is coding, the log-likelihood ratio will
larger for one of the frames than for the other two.
Jacek Leluk
Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Measures based on a model of coding DNA
Measures based on oligonucleotide counts
Codon usage
frequency (probability) of codon C in the genes of the
considered species (the codon usage table)
probability of finding the
sequence of codons C knowing
that C codes for a protein
P0(C)=(1/64)m probability of finding the non-coding sequence
Jacek Leluk
Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Measures based on a model of coding DNA
Measures based on oligonucleotide counts
Amino acid usage
the observed probability of the amino acid encoded by
codon C in the existing proteins
This value can be directly derived from a codon usage table by
summing up the probabilities of synonymous codons
where
means c’ synonymous to c
probability of finding the amino acid sequence resulting of
translating the sequence in coding open reading frame
frequency of the „non-coding amino acids”;
nc – number of codons synonymous to C
Jacek Leluk
Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Measures based on a model of coding DNA
Measures based on oligonucleotide counts
Codon preference
relative probability in coding regions of codon C among
codons synonymous to C
probability of the sequence S encoding the particular amino acid
sequence in frame i
In non-coding regions there is no preference between
„synonymous codons”. Then:
probability of codon C in non-coding DNA
Jacek Leluk
Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Measures based on a model of coding DNA
Measures based on oligonucleotide counts
Hexamer usage
This approach is based on the hexamer usage table
for
i=1, 2, 3, ... , 4096. In this case there are six reading frames to be
analyzed.
The probability of a sequence of hexanucleotides,
in the coding frame of a coding sequence is
Jacek Leluk
Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Measures based on a model of coding DNA
Measures based on base compositional bias between codon positions
Codon prototype
Let f(b,r) be the probability of nucleotide b at codon position r, as
estimated from known coding regions. Then:
is the probability of codon c in coding regions, assuming
independence between adjacent nucleotides
probability of for all triplets c in non-coding DNA
Example:
P2(S) and P3(S) are computed in similar way
Jacek Leluk
Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Measures based on a model of coding DNA
Measures based on dependence between nucleotide positions
Markov Models
In the Markov models the probability of a nucleotide at a particular
codon position depends on the nucleotide(s) preceding it.
The Markov models of order 1 is the simplest of the Markov models.
The probability of a nucleotide depends only on the preceding
nucleotide. In this case, the model of coding DNA is based on the
probabilities of the four nucleotides at each codon position,
depending on the nucleotide occurring at the preceding codon
position (technically called the transition probabilities). Thus,
instead of one single matrix, as in Codon Prototype, three 4x4
matrices (the transition matrices) are required, F1, F2, and F3, each
one corresponding to a different codon position.
There are used Markov models of the order 1 to 5
Jacek Leluk
Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Measures independent of a model of coding DNA
Measures based on base compositional bias between codon positions
Position asymmetry
The goal is to measure how asymmetric is the distribution
of nucleotides at the three triplet positions in the sequence.
the relative frequency of nucleotide b at codon r position
in the sequence S, as calculated from one of the three
decompositions of S in codons (any of them)
average frequency of nucleotide b at the
three codon positions
asymmetry in the distribution of
nucleotide b
Jacek Leluk
Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Measures independent of a model of coding DNA
Measures based on base compositional bias between codon positions
Position asymmetry (continued)
Position Asymmetry of the sequence
Jacek Leluk
Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Measures independent of a model of coding DNA
Measures based on periodic correlation between nucleotide positions
Periodic asymmetry index
This approach considers three distinct probabilities:
- the probability Pin of finding pairs of the same nucleotide at
distances k=2, 5, 8, ...
- the probability P1out of finding pairs of the same nucleotide at
distances k=0, 3, 6, ...
- the probability P2out of finding pairs of the same nucleotide at
distances k=1, 4, 7, ...
The tendency to cluster homogeneous di-nucleotides in a 3-base
periodic pattern can be measured by the Periodic Asymmetry
Index:
Jacek Leluk
Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Measures independent of a model of coding DNA
Measures based on periodic correlation between nucleotide positions
Average mutual information
absolute number of times when nucleotide i is followed
by nucleotide j at a distance of k positions
probability that nucleotide i is followed by nucleotide j at
a distance of k positions
Correlation between nucleotides i and j at a distance of k positions
where pi and pj are probabilities of nucleotide i and j occurrence in
sequence S
Jacek Leluk
Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Measures independent of a model of coding DNA
Measures based on periodic correlation between nucleotide positions
Average mutual information (continued)
Mutual Information function
quantifies the amount of information that can be obtained
from one nucleotide about another nucleotide at a distance k
Jacek Leluk
Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Measures independent of a model of coding DNA
Measures based on periodic correlation between nucleotide positions
Average mutual information (continued)
the in-frame mutual information at distances k=2, 5, 8, ...
the out-frame mutual information at distances k=0, 1, 3, 4, ...
Average Mutual Information
Jacek Leluk
Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Measures independent of a model of coding DNA
Measures based on periodic correlation between nucleotide positions
Fourier analysis
The partial spectrum of a DNA sequence S of length l
corresponding to nucleotide b is defined as:
where Ub(Sj)=1 if Sj=b, and otherwise it is 0, and f is the discrete
frequency, f =k/l, for k=1, 2, ... ,l/2
DNA coding regions reveal the characteristic periodicity of 3 as a
distinct peak at frequency f =1/3
No such ``peak'' is apparent for non-coding sequences
Jacek Leluk
Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Summary of results
Jacek Leluk
Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
List of Gene Identification programs
and Internet access (part 1)
Jacek Leluk
Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
List of Gene Identification programs
and Internet access (part 2)
Jacek Leluk
Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
P01055
P01057
P01056
P01058
P01059
P01063
P17734
P81483
P81484
P16343
P01064
P82469
P01061
P01062
P01060
1BBI:
1D6R:I
1DF9:C
1PI2:
1PBI:A
AAB4719
TISYC2
JC2225
TIZB2
JC2073
JC2072
0506164
0401177
763679A
TISYD2
0907248
1102213
1102213
0404180
TIZB1B
TIMB
TIZB1P
JC1066
Q41066
P80321
Q41065
P81705
P56679
P16346
P01065
P24661
P07679
P19860
P22737
220645
P09864
P09863
3
10
20
30
40
50
60
ESSKPCCDQCACTKSNPPQCRCSDMRLNSCHSACKSCICALSYPAQCF-CVDITDFCYEP-CKP
ESSKPCCDECACTKSIPPQCRCTDVRLNSCHSACSSCVCTFSIPAQCV-CVDMKDFCYAP-CKS
QSSKPCCBHCACTKSIPPQCRCTDLRLDSCHSACKSCICTLSIPAQCV-CBBIBDFCYEP-CKS
ESSKPCCDQCSCTKSMPPKCRCSDIRLNSCHSACKSCACTYSIPAKCF-CTDINDFCYEP-CKS
ESSKPCCDLCTCTKSIPPQCHCNDMRLNSCHSACKSCICALSEPAQCF-CVDTTDFCYKS-CHN
ESSKPCCDLCMCTASMPPQCHCADIRLNSCHSACDRCACTRSMPGQCR-CLDTTDFCYKP-CKS
QSSKPCCRQCACTKSIPPQCRCSQVRLNSCHSACKSCACTFSIPAQCF-CGBIBBFCYKP-CKS
-SSKPCCBHCACTKSIPPQCRCSBLRLNSCHSECKGCICTFSIPAQCI-CTDTNNFCYEP-CKS
-SSKPCCBHCACTKSIPPQCRCSBLRLNSCHSECKGCICTFSIPAQCI-CTDTNNFCYEP-CKS
ESSKPCCSSC-CTRSRPPQCQCTDVRLNSCHSACKSCMCTFSDPGMCS-CLDVTDFCYKP-CKS
EYSKPCCDLCMCTRSMPPQCSCEDIRLNSCHSDCKSCMCTRSQPGQCR-CLDTNDFCYKP-CKS
-SSGPCCDRCRCTKSEPPQCQCQDVRLNSCHSACEACVCSHSMPGLCS-CLDITHFCHEP-CKS
ESSHPCCDLCLCTKSIPPQCQCADIRLDSCHSACKSCMCTRSMPGQCR-CLDTHDFCHKP-CKS
ESSEPCCDSCDCTKSIPPECHCANIRLNSCHSACKSCICTRSMPGKCR-CLDTDDFCYKP-CES
QSSPPCCBICVCTASIPPQCVCTBIRLBSCHSACKSCMCTRSMPGKCR-CLBTTBYCYKS-CKS
ESSKPCCDQCACTKSNPPQCRCSDMRLNSCHSACKSCICALSYPAQCF-CVDITDFCYEP-CKP
---KPCCDQCACTKSNPPQCRCSDMRLNSCHSACKSCICALSYPAQCF-CVDITDFCYEP-CKESSEPCCDSCDCTKSIPPQCHCANIRLNSCHSACKSCICTRSMPGKCR-CLDTDDFCYKP-CES
EYSKPCCDLCMCTRSMPPQCSCED-RINSCHSDCKSCMCTRSQPGQCR-CLDTNDFCYKP-CKS
DVKSACCDTCLCTKSNPPTCRCVDVGET-CHSACLSCICAYSNPPKCQ-CFDTQKFCYKQ-CHN
ESSKPCCDQCTCTKSIPPQCRCTDVRLNSCHSACSSCVCTFSIPAQCV-CVDMKDFCYAP-CKS
ESSKPCCDLCMCTASMPPQCHCADIRLNSCHSACDRCACTRSMPGQCR-CLDTTDFCYKP-CKS
ESSKPCCDLCMCTASMPPQCHCADIRLNSCHSACDRCACTRSMPGQCR-CLDTTDFCYKP-CKS
ESSKPCCDQC-CTKSMPPKCRCSDIRLDSCHSACKSCACTYSIPAKCF-CTDINDFCYEP-CKS
ESSKPCCDECKCTKSEPPQCQCVDTRLESCHSACKLCLCALSFPAKCR-CVDTTDFCYKP-CKS
ESSKPCCDECKCTKSEPPQCQCVDTRLESCHSACKLCLCALSFPAKCR-CVDTTDFCYKP-CKS
ESSKPCCDQC-CTKSMPPKCRCSDIRLDSCHSACKSCACTYSIPAKCF-CTDINDFCYEP-CKS
ESSKPCCDLCMCTASMPPQCHCADIRLNSCHSACDRCACTRSMPGQCR-CLDTTDFCYKP-CKS
ESSKPCCDLCMCTASMPPQCHCADIRLNSCHSACDRCACTRSMPGQCR-CLDTTDFCYKP-CKS
EYSKPCCDLCMCTRSMPPQCSCEDIRLNSCHSDCKSCMCTRSQPGQCR-CLDTNDFCYKP-CKS
ESSEPCCDSCRCTKSIPPQCHCADIRLNSCHSACKSCMCTRSMPGKCR-CLDTDDFCYKP-CES
ESSEPCCDLCLCTKSIPPQCQCADIRLNSCHSACKSCMCTRSMPGQCH-CLDTHDFCHKP-CKS
ESSEPCCDLCLCTKSIPPQCQCADIRLNSCHSACKSCMCTRSMPGQCR-CLDTHDFCHKP-CKS
EYSKPCCDLCMCTRSMPPQCSCEDIRLNSCHSDCKSCMCTRSQPGQCR-CLDTNDFCYKP-CKS
ESSHPCCDLCLCTKSIPPQCQCADIRLDSCHSACKSCMCTRSMPGQCH-CLDTHDFCHKP-CKS
ESSEPCCDSCDCTKSKPPQCHCANIRLNSCHSACKSCICTRSMPGKCR-CLDTDDFCYKP-CES
ESSHPCCDLCLCTKSIPPQCQCADIRLNSCHSACKSCMCTRSMPGQCR-CLDTHDFCHKP-CKS
ESSEPCCDSCDCTKSKPPQCHCANIRLNSCHSACKSCICTRSMPGKCR-CLDTDDFCTKP-CES
DVKSACCDTCLCTKSDPPTCRCVDVGET-CHSACDSCICALSYPPQCQ-CFDTHKFCYKA-CHN
STTTACCDFCPCTRSIPPQCQCTDVREK-CHSACKSCLCTLSIPPQCH-CYDITDFCYPS-CRDVKSACCDTCLCTKSNPPTCRCVDVRET-CHSACDSCICAYSNPPKCQ-CFDTHKFCYKA-CHN
--TSACCDKCFCTKSNPPICQCRDVGET-CHSACKFCICALSYPAQCH-CLDQNTFCYDK-CDS
DVKSACCDTCLCTKSNPPTCRCVDVGET-CHSACLSCICAYSNPPKCQ-CFDTQKFCYKA-CHN
--TTACCNFCPCTRSIPPQCRCTDIGET-CHSACKTCLCTKSIPPQCH-CADITNFCYPK-CNDVKSACCDTCLCTRSQPPTCRCVDVGER-CHSACNHCVCNYSNPPQCQ-CFDTHKFCYKA-CHS
DVKSACCDTCLCTKSEPPTCRCVDVGER-CHSACNSCVCRYSNPPKCQ-CFDTHKFCYKS-CHN
KRPWECCDIAMCTRSIPPICRCVDKVDR-CSDACKDCEETEDN--RHV-CFDTYIGDPGPTCHD
ERPWKCCDLQTCTKSIPAFCRCRDLLEQ-CSDACKECGKVRDSDPPRYICQDVYRGIPAPMCHE
ERPWKCCDLQTCTKSIPAFCRCRDLLEQ-CSDACKECGKVRDSDPPRYICQDVYRGIPAPMCHE
ES-EGCCDRCICTKSMPPQCHCHDVRLDSCHSDCETCICTRSYPAQCR-CADTTDFCYKP-C-S
TRPWKCCDRAICTKSFPPMCRCMDMVEQ-CAATCKKCGPATSDSSRRV-CEDXY----------KRPWKCCDQAVCTRSIPPICRCMDQVFE-CPSTCKACGPSVGDPSRRV-CQDQYV----------
Thank you
for your attention
Related documents