Download On Periodicity in the Occurrence of Nucleotides in Protein Coding

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cre-Lox recombination wikipedia , lookup

Non-coding DNA wikipedia , lookup

List of types of proteins wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Gene expression wikipedia , lookup

Molecular evolution wikipedia , lookup

Genome evolution wikipedia , lookup

Gene wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Gene therapy wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene desert wikipedia , lookup

Gene expression profiling wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene regulatory network wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Community fingerprinting wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
On Periodicity in the Occurrence of Nucleotides
in Protein Coding Stretches of Genomes
Probal Chaudhuri
Indian Statistical Institute, Calcutta
Tutorial in ICAPR, February 2009
1 / 13
Periodicity in DNA Sequences :
Periodicity in DNA
Sequences :
Fourier Analysis :
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
+ Depending on the size and the complexity of a genome, a substantial
part of it may be non-functional or may have unknown or uncertain
functions (“the junk or selfish DNA”).
Gene Prediction :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.)
Gene Prediction
(contd.)
2 / 13
Periodicity in DNA Sequences :
Periodicity in DNA
Sequences :
Fourier Analysis :
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Gene Prediction :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.)
Gene Prediction
(contd.)
+ Depending on the size and the complexity of a genome, a substantial
part of it may be non-functional or may have unknown or uncertain
functions (“the junk or selfish DNA”).
+ How can the functional stretches of a genome (e.g., protein coding
regions or genes) can be identified and distinguished from the “junk”
streches ?
2 / 13
Periodicity in DNA Sequences :
Periodicity in DNA
Sequences :
Fourier Analysis :
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Gene Prediction :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.)
Gene Prediction
(contd.)
+ Depending on the size and the complexity of a genome, a substantial
part of it may be non-functional or may have unknown or uncertain
functions (“the junk or selfish DNA”).
+ How can the functional stretches of a genome (e.g., protein coding
regions or genes) can be identified and distinguished from the “junk”
streches ?
+ Periodic occurrence of nucleotide bases (DNA words formed by
combination of bases) can be a possible feature that distinguishes
functional stretches from non-functional stretches.
2 / 13
Fourier Analysis :
Periodicity in DNA
Sequences :
Fourier Analysis to detect periodic occurrence DNA words:
Fourier Analysis :
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
+ Fix a DNA word, i.e, a mononucleotide or an oligonucleotide. Then
Gene Prediction :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.)
Gene Prediction
(contd.)
convert the DNA sequence into a finite binary sequence {xn } of 0’s
and 1’s that records the presence and the absence of the DNA word at
different positions of the DNA sequence.
3 / 13
Fourier Analysis :
Periodicity in DNA
Sequences :
Fourier Analysis to detect periodic occurrence DNA words:
Fourier Analysis :
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
+ Fix a DNA word, i.e, a mononucleotide or an oligonucleotide. Then
Gene Prediction :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.)
Gene Prediction
(contd.)
convert the DNA sequence into a finite binary sequence {xn } of 0’s
and 1’s that records the presence and the absence of the DNA word at
different positions of the DNA sequence.
+ Next, compute the Discrete Fourier Transform and obtain the Fourier
Spectrum for the sequence {xn }:
N
−1 X
f (ω) = N
xn exp(2nπiω) ,
n=1
√
where 0 < ω < 1 and i = −1.
3 / 13
Fourier Analysis (contd.)
Periodicity in DNA
Sequences :
Fourier Analysis :
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
+ If k -periodicity (k ≥ 2) is suspected for a specified DNA word, that
can be verified using the statistic R(k) = the ratio between the
maximum value of the Fourier Spectrum f (ω) in a deleted
neighborhood of ω = 1/k and the value of f (1/k) for the
corresponding binary sequence.
Gene Prediction :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.)
Gene Prediction
(contd.)
4 / 13
Fourier Analysis (contd.)
Periodicity in DNA
Sequences :
Fourier Analysis :
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Gene Prediction :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.)
Gene Prediction
(contd.)
+ If k -periodicity (k ≥ 2) is suspected for a specified DNA word, that
can be verified using the statistic R(k) = the ratio between the
maximum value of the Fourier Spectrum f (ω) in a deleted
neighborhood of ω = 1/k and the value of f (1/k) for the
corresponding binary sequence.
+ Small value of R(k) (especially much smaller than 1) indicates an
evidence for k -periodicity in the sequence.
4 / 13
Fourier Analysis (contd.)
Periodicity in DNA
Sequences :
Fourier Analysis :
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Gene Prediction :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.)
Gene Prediction
(contd.)
+ If k -periodicity (k ≥ 2) is suspected for a specified DNA word, that
can be verified using the statistic R(k) = the ratio between the
maximum value of the Fourier Spectrum f (ω) in a deleted
neighborhood of ω = 1/k and the value of f (1/k) for the
corresponding binary sequence.
+ Small value of R(k) (especially much smaller than 1) indicates an
evidence for k -periodicity in the sequence.
+ For identification of genes, a well-known popular choice is k = 3 after
a mononucleotide is used to generate the binary sequence from the
DNA sequence.
4 / 13
Fourier Analysis (contd.)
Periodicity in DNA
Sequences :
Thresholding for R(k) :
Fourier Analysis :
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
+ Thresholds for the values of R(k) to discriminate between genes and
non-coding sequences may be organism specific.
Gene Prediction :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.)
Gene Prediction
(contd.)
5 / 13
Fourier Analysis (contd.)
Periodicity in DNA
Sequences :
Thresholding for R(k) :
Fourier Analysis :
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
+ Thresholds for the values of R(k) to discriminate between genes and
Gene Prediction :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.)
Gene Prediction
(contd.)
non-coding sequences may be organism specific.
+ Thresholding by statistical test of significance would require working
out the “sampling distribution” of R(k). Such a “sampling distribution”
can be derived (at least in principle) after we assume some stochastic
model for the original DNA sequence.
5 / 13
Fourier Analysis (contd.)
Periodicity in DNA
Sequences :
Thresholding for R(k) :
Fourier Analysis :
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
+ Thresholds for the values of R(k) to discriminate between genes and
Gene Prediction :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.)
Gene Prediction
(contd.)
non-coding sequences may be organism specific.
+ Thresholding by statistical test of significance would require working
out the “sampling distribution” of R(k). Such a “sampling distribution”
can be derived (at least in principle) after we assume some stochastic
model for the original DNA sequence.
+ Stochastic models investigated in the literature include homogenous
and non-homogenous but periodic Markov models of different orders
and their extensions.
5 / 13
Fourier Analysis (contd.)
Periodicity in DNA
Sequences :
Thresholding for R(k) by bootstrap type analysis:
Fourier Analysis :
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
+ Bootstrap type test of significance can be carried out by repeated
Gene Prediction :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.)
Gene Prediction
(contd.)
statistical perturbation (e.g., by random mutations and permutations of
nucleotide bases) of a given sequence and by computing the value of
R(k) for each perturbed sequence.
6 / 13
Fourier Analysis (contd.)
Periodicity in DNA
Sequences :
Thresholding for R(k) by bootstrap type analysis:
Fourier Analysis :
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
+ Bootstrap type test of significance can be carried out by repeated
Gene Prediction :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.)
Gene Prediction
(contd.)
+ This leads to a bootstrap type statistical distribution of the R(k) values
statistical perturbation (e.g., by random mutations and permutations of
nucleotide bases) of a given sequence and by computing the value of
R(k) for each perturbed sequence.
and a P-value associated with the value of R(k) observed in the data.
6 / 13
Gene Prediction :
Periodicity in DNA
Sequences :
Gene Prediction by Statistical Discriminant Analysis
Fourier Analysis :
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
+ One begins with the identification of Open Reading Frames (ORFs) in
Gene Prediction :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.)
Gene Prediction
(contd.)
a genome, which are identified by locating the START and the STOP
codons in the genome. There may be overlapping ORFs. To keep
things simple, let us consider only prokaryotic genomes so that there
are no introns
7 / 13
Gene Prediction :
Periodicity in DNA
Sequences :
Gene Prediction by Statistical Discriminant Analysis
Fourier Analysis :
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
+ One begins with the identification of Open Reading Frames (ORFs) in
Gene Prediction :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.)
Gene Prediction
(contd.)
a genome, which are identified by locating the START and the STOP
codons in the genome. There may be overlapping ORFs. To keep
things simple, let us consider only prokaryotic genomes so that there
are no introns
+ In widely used gene prediction algorithms (e.g., GLIMMER and
GENEMARK), statistical classification of ORFs into genes and
non-coding regions are done based on specific stochastic models for
ORFs that belong to coding and non-coding regions.
7 / 13
Gene Prediction (contd.) :
Periodicity in DNA
Sequences :
GLIMMER and GENEMARK :
Fourier Analysis :
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
+ GLIMMER uses “Interpolated Markov Models”,which can be viewed as
“variable order Markov models”.
Gene Prediction :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.)
Gene Prediction
(contd.)
8 / 13
Gene Prediction (contd.) :
Periodicity in DNA
Sequences :
GLIMMER and GENEMARK :
Fourier Analysis :
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
+ GLIMMER uses “Interpolated Markov Models”,which can be viewed as
Gene Prediction :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.)
Gene Prediction
(contd.)
“variable order Markov models”.
+ Different versions of GENEMARK use different types of Markov and
“Hidden Markov Models”.
8 / 13
Gene Prediction (contd.) :
Periodicity in DNA
Sequences :
Fourier Analysis :
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
One has to train the statistical classifiers :
+ GLIMMER trains its classifier by assuming that relatively longer ORFs
that have good homology with known genes of closely related
organisms are “potential genes”.
Gene Prediction :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.)
Gene Prediction
(contd.)
9 / 13
Gene Prediction (contd.) :
Periodicity in DNA
Sequences :
Fourier Analysis :
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Gene Prediction :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.)
Gene Prediction
(contd.)
One has to train the statistical classifiers :
+ GLIMMER trains its classifier by assuming that relatively longer ORFs
that have good homology with known genes of closely related
organisms are “potential genes”.
+ GENEMARK uses an iterative self training procedure. The iterative
steps involve modelling of Ribosomal Binding Sites (RBS), and that
uses knowledge about known genes in known organisms.
9 / 13
Gene Prediction (contd.) :
Periodicity in DNA
Sequences :
Distance Based Analysis and Nearest Neighbour Type Classifiers :
Fourier Analysis :
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
+ Given an ORF, its distance from a set of “potential genes” as well as
Gene Prediction :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.)
Gene Prediction
(contd.)
its distance from a set of “potential non-genes” can be computed in
some appropriate way (e.g., based on oligonucleotide frequencies in
different DNA sequences).
10 / 13
Gene Prediction (contd.) :
Periodicity in DNA
Sequences :
Distance Based Analysis and Nearest Neighbour Type Classifiers :
Fourier Analysis :
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
+ Given an ORF, its distance from a set of “potential genes” as well as
Gene Prediction :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.)
Gene Prediction
(contd.)
its distance from a set of “potential non-genes” can be computed in
some appropriate way (e.g., based on oligonucleotide frequencies in
different DNA sequences).
+ If the distance from the set of “potential genes” turns out to be smaller
than that from the set of “potential non-genes”, the given sequence is
classified as a gene. Otherwise it is classified as a non-coding
sequence.
10 / 13
Gene Prediction (contd.) :
Periodicity in DNA
Sequences :
Distance Based Analysis and Nearest Neighbour Type Classifiers :
Fourier Analysis :
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
+ Given an ORF, its distance from a set of “potential genes” as well as
Gene Prediction :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.)
Gene Prediction
(contd.)
its distance from a set of “potential non-genes” can be computed in
some appropriate way (e.g., based on oligonucleotide frequencies in
different DNA sequences).
+ If the distance from the set of “potential genes” turns out to be smaller
than that from the set of “potential non-genes”, the given sequence is
classified as a gene. Otherwise it is classified as a non-coding
sequence.
+ Alternatively, one can also try the more traditional nearest neighbour
classifier.
10 / 13
Gene Prediction (contd.) :
Periodicity in DNA
Sequences :
Fourier Analysis :
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Blending of the ideas from Fourier and Nearest Neighbour type analysis:
+ We first run the Fourier analysis on a set of identified ORFs in the
genome and compute the value of R(3) for each ORF sequence.
Gene Prediction :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.)
Gene Prediction
(contd.)
11 / 13
Gene Prediction (contd.) :
Periodicity in DNA
Sequences :
Fourier Analysis :
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Gene Prediction :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.)
Gene Prediction
(contd.)
Blending of the ideas from Fourier and Nearest Neighbour type analysis:
+ We first run the Fourier analysis on a set of identified ORFs in the
genome and compute the value of R(3) for each ORF sequence.
+ Then we form a set of “potential genes” and “potential non-genes” by
classifying each ORF sequence into the first set if it exhibits significant
evidence of 3-periodicity with respect to some DNA word(s) (i.e., if the
value of R(3) is below some appropriate threshold), and by
classifying it into the second set if it does not exhibit any such
evidence for 3-periodicity with respect to any DNA word (i.e., if the
value of R(3) is above some appropriate threshold).
11 / 13
Gene Prediction (contd.)
Periodicity in DNA
Sequences :
Fourier Analysis :
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
+ Given an ORF, we select a random subset of the potential genes and
a random subset of the potential non-genes.
Gene Prediction :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.)
Gene Prediction
(contd.)
12 / 13
Gene Prediction (contd.)
Periodicity in DNA
Sequences :
Fourier Analysis :
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
+ Given an ORF, we select a random subset of the potential genes and
a random subset of the potential non-genes.
+ Classify the given ORF using a nearest neighbor type analysis using
the above randomly selected training data set.
Gene Prediction :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.)
Gene Prediction
(contd.)
12 / 13
Gene Prediction (contd.)
Periodicity in DNA
Sequences :
Fourier Analysis :
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Gene Prediction :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.)
Gene Prediction
(contd.)
+ Given an ORF, we select a random subset of the potential genes and
a random subset of the potential non-genes.
+ Classify the given ORF using a nearest neighbor type analysis using
the above randomly selected training data set.
+ Given a sequence, repeat the random selection of subsets several
times, and the final classification of the given sequence is determined
by “the majority of the votes” in different random selections.
12 / 13
Gene Prediction (contd.)
Periodicity in DNA
Sequences :
Fourier Analysis :
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Fourier Analysis
(contd.)
Gene Prediction :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.) :
Gene Prediction
(contd.)
Gene Prediction
(contd.)
Evaluation of a prediction algorithm :
Sensitvity =
N o. of T rue P ositives
N o. of T rue P ositives and F alse N egatives
N o. of T rue P ositives
Specif icity =
N o. of T rue P ositives and F alse P ositives
13 / 13