* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download On Periodicity in the Occurrence of Nucleotides in Protein Coding
Survey
Document related concepts
Cre-Lox recombination wikipedia , lookup
Non-coding DNA wikipedia , lookup
List of types of proteins wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Gene expression wikipedia , lookup
Molecular evolution wikipedia , lookup
Genome evolution wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Gene therapy wikipedia , lookup
Gene desert wikipedia , lookup
Gene expression profiling wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Gene nomenclature wikipedia , lookup
Gene regulatory network wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Transcript
On Periodicity in the Occurrence of Nucleotides in Protein Coding Stretches of Genomes Probal Chaudhuri Indian Statistical Institute, Calcutta Tutorial in ICAPR, February 2009 1 / 13 Periodicity in DNA Sequences : Periodicity in DNA Sequences : Fourier Analysis : Fourier Analysis (contd.) Fourier Analysis (contd.) Fourier Analysis (contd.) + Depending on the size and the complexity of a genome, a substantial part of it may be non-functional or may have unknown or uncertain functions (“the junk or selfish DNA”). Gene Prediction : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) Gene Prediction (contd.) 2 / 13 Periodicity in DNA Sequences : Periodicity in DNA Sequences : Fourier Analysis : Fourier Analysis (contd.) Fourier Analysis (contd.) Fourier Analysis (contd.) Gene Prediction : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) Gene Prediction (contd.) + Depending on the size and the complexity of a genome, a substantial part of it may be non-functional or may have unknown or uncertain functions (“the junk or selfish DNA”). + How can the functional stretches of a genome (e.g., protein coding regions or genes) can be identified and distinguished from the “junk” streches ? 2 / 13 Periodicity in DNA Sequences : Periodicity in DNA Sequences : Fourier Analysis : Fourier Analysis (contd.) Fourier Analysis (contd.) Fourier Analysis (contd.) Gene Prediction : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) Gene Prediction (contd.) + Depending on the size and the complexity of a genome, a substantial part of it may be non-functional or may have unknown or uncertain functions (“the junk or selfish DNA”). + How can the functional stretches of a genome (e.g., protein coding regions or genes) can be identified and distinguished from the “junk” streches ? + Periodic occurrence of nucleotide bases (DNA words formed by combination of bases) can be a possible feature that distinguishes functional stretches from non-functional stretches. 2 / 13 Fourier Analysis : Periodicity in DNA Sequences : Fourier Analysis to detect periodic occurrence DNA words: Fourier Analysis : Fourier Analysis (contd.) Fourier Analysis (contd.) Fourier Analysis (contd.) + Fix a DNA word, i.e, a mononucleotide or an oligonucleotide. Then Gene Prediction : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) Gene Prediction (contd.) convert the DNA sequence into a finite binary sequence {xn } of 0’s and 1’s that records the presence and the absence of the DNA word at different positions of the DNA sequence. 3 / 13 Fourier Analysis : Periodicity in DNA Sequences : Fourier Analysis to detect periodic occurrence DNA words: Fourier Analysis : Fourier Analysis (contd.) Fourier Analysis (contd.) Fourier Analysis (contd.) + Fix a DNA word, i.e, a mononucleotide or an oligonucleotide. Then Gene Prediction : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) Gene Prediction (contd.) convert the DNA sequence into a finite binary sequence {xn } of 0’s and 1’s that records the presence and the absence of the DNA word at different positions of the DNA sequence. + Next, compute the Discrete Fourier Transform and obtain the Fourier Spectrum for the sequence {xn }: N −1 X f (ω) = N xn exp(2nπiω) , n=1 √ where 0 < ω < 1 and i = −1. 3 / 13 Fourier Analysis (contd.) Periodicity in DNA Sequences : Fourier Analysis : Fourier Analysis (contd.) Fourier Analysis (contd.) Fourier Analysis (contd.) + If k -periodicity (k ≥ 2) is suspected for a specified DNA word, that can be verified using the statistic R(k) = the ratio between the maximum value of the Fourier Spectrum f (ω) in a deleted neighborhood of ω = 1/k and the value of f (1/k) for the corresponding binary sequence. Gene Prediction : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) Gene Prediction (contd.) 4 / 13 Fourier Analysis (contd.) Periodicity in DNA Sequences : Fourier Analysis : Fourier Analysis (contd.) Fourier Analysis (contd.) Fourier Analysis (contd.) Gene Prediction : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) Gene Prediction (contd.) + If k -periodicity (k ≥ 2) is suspected for a specified DNA word, that can be verified using the statistic R(k) = the ratio between the maximum value of the Fourier Spectrum f (ω) in a deleted neighborhood of ω = 1/k and the value of f (1/k) for the corresponding binary sequence. + Small value of R(k) (especially much smaller than 1) indicates an evidence for k -periodicity in the sequence. 4 / 13 Fourier Analysis (contd.) Periodicity in DNA Sequences : Fourier Analysis : Fourier Analysis (contd.) Fourier Analysis (contd.) Fourier Analysis (contd.) Gene Prediction : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) Gene Prediction (contd.) + If k -periodicity (k ≥ 2) is suspected for a specified DNA word, that can be verified using the statistic R(k) = the ratio between the maximum value of the Fourier Spectrum f (ω) in a deleted neighborhood of ω = 1/k and the value of f (1/k) for the corresponding binary sequence. + Small value of R(k) (especially much smaller than 1) indicates an evidence for k -periodicity in the sequence. + For identification of genes, a well-known popular choice is k = 3 after a mononucleotide is used to generate the binary sequence from the DNA sequence. 4 / 13 Fourier Analysis (contd.) Periodicity in DNA Sequences : Thresholding for R(k) : Fourier Analysis : Fourier Analysis (contd.) Fourier Analysis (contd.) Fourier Analysis (contd.) + Thresholds for the values of R(k) to discriminate between genes and non-coding sequences may be organism specific. Gene Prediction : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) Gene Prediction (contd.) 5 / 13 Fourier Analysis (contd.) Periodicity in DNA Sequences : Thresholding for R(k) : Fourier Analysis : Fourier Analysis (contd.) Fourier Analysis (contd.) Fourier Analysis (contd.) + Thresholds for the values of R(k) to discriminate between genes and Gene Prediction : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) Gene Prediction (contd.) non-coding sequences may be organism specific. + Thresholding by statistical test of significance would require working out the “sampling distribution” of R(k). Such a “sampling distribution” can be derived (at least in principle) after we assume some stochastic model for the original DNA sequence. 5 / 13 Fourier Analysis (contd.) Periodicity in DNA Sequences : Thresholding for R(k) : Fourier Analysis : Fourier Analysis (contd.) Fourier Analysis (contd.) Fourier Analysis (contd.) + Thresholds for the values of R(k) to discriminate between genes and Gene Prediction : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) Gene Prediction (contd.) non-coding sequences may be organism specific. + Thresholding by statistical test of significance would require working out the “sampling distribution” of R(k). Such a “sampling distribution” can be derived (at least in principle) after we assume some stochastic model for the original DNA sequence. + Stochastic models investigated in the literature include homogenous and non-homogenous but periodic Markov models of different orders and their extensions. 5 / 13 Fourier Analysis (contd.) Periodicity in DNA Sequences : Thresholding for R(k) by bootstrap type analysis: Fourier Analysis : Fourier Analysis (contd.) Fourier Analysis (contd.) Fourier Analysis (contd.) + Bootstrap type test of significance can be carried out by repeated Gene Prediction : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) Gene Prediction (contd.) statistical perturbation (e.g., by random mutations and permutations of nucleotide bases) of a given sequence and by computing the value of R(k) for each perturbed sequence. 6 / 13 Fourier Analysis (contd.) Periodicity in DNA Sequences : Thresholding for R(k) by bootstrap type analysis: Fourier Analysis : Fourier Analysis (contd.) Fourier Analysis (contd.) Fourier Analysis (contd.) + Bootstrap type test of significance can be carried out by repeated Gene Prediction : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) Gene Prediction (contd.) + This leads to a bootstrap type statistical distribution of the R(k) values statistical perturbation (e.g., by random mutations and permutations of nucleotide bases) of a given sequence and by computing the value of R(k) for each perturbed sequence. and a P-value associated with the value of R(k) observed in the data. 6 / 13 Gene Prediction : Periodicity in DNA Sequences : Gene Prediction by Statistical Discriminant Analysis Fourier Analysis : Fourier Analysis (contd.) Fourier Analysis (contd.) Fourier Analysis (contd.) + One begins with the identification of Open Reading Frames (ORFs) in Gene Prediction : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) Gene Prediction (contd.) a genome, which are identified by locating the START and the STOP codons in the genome. There may be overlapping ORFs. To keep things simple, let us consider only prokaryotic genomes so that there are no introns 7 / 13 Gene Prediction : Periodicity in DNA Sequences : Gene Prediction by Statistical Discriminant Analysis Fourier Analysis : Fourier Analysis (contd.) Fourier Analysis (contd.) Fourier Analysis (contd.) + One begins with the identification of Open Reading Frames (ORFs) in Gene Prediction : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) Gene Prediction (contd.) a genome, which are identified by locating the START and the STOP codons in the genome. There may be overlapping ORFs. To keep things simple, let us consider only prokaryotic genomes so that there are no introns + In widely used gene prediction algorithms (e.g., GLIMMER and GENEMARK), statistical classification of ORFs into genes and non-coding regions are done based on specific stochastic models for ORFs that belong to coding and non-coding regions. 7 / 13 Gene Prediction (contd.) : Periodicity in DNA Sequences : GLIMMER and GENEMARK : Fourier Analysis : Fourier Analysis (contd.) Fourier Analysis (contd.) Fourier Analysis (contd.) + GLIMMER uses “Interpolated Markov Models”,which can be viewed as “variable order Markov models”. Gene Prediction : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) Gene Prediction (contd.) 8 / 13 Gene Prediction (contd.) : Periodicity in DNA Sequences : GLIMMER and GENEMARK : Fourier Analysis : Fourier Analysis (contd.) Fourier Analysis (contd.) Fourier Analysis (contd.) + GLIMMER uses “Interpolated Markov Models”,which can be viewed as Gene Prediction : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) Gene Prediction (contd.) “variable order Markov models”. + Different versions of GENEMARK use different types of Markov and “Hidden Markov Models”. 8 / 13 Gene Prediction (contd.) : Periodicity in DNA Sequences : Fourier Analysis : Fourier Analysis (contd.) Fourier Analysis (contd.) Fourier Analysis (contd.) One has to train the statistical classifiers : + GLIMMER trains its classifier by assuming that relatively longer ORFs that have good homology with known genes of closely related organisms are “potential genes”. Gene Prediction : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) Gene Prediction (contd.) 9 / 13 Gene Prediction (contd.) : Periodicity in DNA Sequences : Fourier Analysis : Fourier Analysis (contd.) Fourier Analysis (contd.) Fourier Analysis (contd.) Gene Prediction : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) Gene Prediction (contd.) One has to train the statistical classifiers : + GLIMMER trains its classifier by assuming that relatively longer ORFs that have good homology with known genes of closely related organisms are “potential genes”. + GENEMARK uses an iterative self training procedure. The iterative steps involve modelling of Ribosomal Binding Sites (RBS), and that uses knowledge about known genes in known organisms. 9 / 13 Gene Prediction (contd.) : Periodicity in DNA Sequences : Distance Based Analysis and Nearest Neighbour Type Classifiers : Fourier Analysis : Fourier Analysis (contd.) Fourier Analysis (contd.) Fourier Analysis (contd.) + Given an ORF, its distance from a set of “potential genes” as well as Gene Prediction : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) Gene Prediction (contd.) its distance from a set of “potential non-genes” can be computed in some appropriate way (e.g., based on oligonucleotide frequencies in different DNA sequences). 10 / 13 Gene Prediction (contd.) : Periodicity in DNA Sequences : Distance Based Analysis and Nearest Neighbour Type Classifiers : Fourier Analysis : Fourier Analysis (contd.) Fourier Analysis (contd.) Fourier Analysis (contd.) + Given an ORF, its distance from a set of “potential genes” as well as Gene Prediction : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) Gene Prediction (contd.) its distance from a set of “potential non-genes” can be computed in some appropriate way (e.g., based on oligonucleotide frequencies in different DNA sequences). + If the distance from the set of “potential genes” turns out to be smaller than that from the set of “potential non-genes”, the given sequence is classified as a gene. Otherwise it is classified as a non-coding sequence. 10 / 13 Gene Prediction (contd.) : Periodicity in DNA Sequences : Distance Based Analysis and Nearest Neighbour Type Classifiers : Fourier Analysis : Fourier Analysis (contd.) Fourier Analysis (contd.) Fourier Analysis (contd.) + Given an ORF, its distance from a set of “potential genes” as well as Gene Prediction : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) Gene Prediction (contd.) its distance from a set of “potential non-genes” can be computed in some appropriate way (e.g., based on oligonucleotide frequencies in different DNA sequences). + If the distance from the set of “potential genes” turns out to be smaller than that from the set of “potential non-genes”, the given sequence is classified as a gene. Otherwise it is classified as a non-coding sequence. + Alternatively, one can also try the more traditional nearest neighbour classifier. 10 / 13 Gene Prediction (contd.) : Periodicity in DNA Sequences : Fourier Analysis : Fourier Analysis (contd.) Fourier Analysis (contd.) Fourier Analysis (contd.) Blending of the ideas from Fourier and Nearest Neighbour type analysis: + We first run the Fourier analysis on a set of identified ORFs in the genome and compute the value of R(3) for each ORF sequence. Gene Prediction : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) Gene Prediction (contd.) 11 / 13 Gene Prediction (contd.) : Periodicity in DNA Sequences : Fourier Analysis : Fourier Analysis (contd.) Fourier Analysis (contd.) Fourier Analysis (contd.) Gene Prediction : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) Gene Prediction (contd.) Blending of the ideas from Fourier and Nearest Neighbour type analysis: + We first run the Fourier analysis on a set of identified ORFs in the genome and compute the value of R(3) for each ORF sequence. + Then we form a set of “potential genes” and “potential non-genes” by classifying each ORF sequence into the first set if it exhibits significant evidence of 3-periodicity with respect to some DNA word(s) (i.e., if the value of R(3) is below some appropriate threshold), and by classifying it into the second set if it does not exhibit any such evidence for 3-periodicity with respect to any DNA word (i.e., if the value of R(3) is above some appropriate threshold). 11 / 13 Gene Prediction (contd.) Periodicity in DNA Sequences : Fourier Analysis : Fourier Analysis (contd.) Fourier Analysis (contd.) Fourier Analysis (contd.) + Given an ORF, we select a random subset of the potential genes and a random subset of the potential non-genes. Gene Prediction : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) Gene Prediction (contd.) 12 / 13 Gene Prediction (contd.) Periodicity in DNA Sequences : Fourier Analysis : Fourier Analysis (contd.) Fourier Analysis (contd.) Fourier Analysis (contd.) + Given an ORF, we select a random subset of the potential genes and a random subset of the potential non-genes. + Classify the given ORF using a nearest neighbor type analysis using the above randomly selected training data set. Gene Prediction : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) Gene Prediction (contd.) 12 / 13 Gene Prediction (contd.) Periodicity in DNA Sequences : Fourier Analysis : Fourier Analysis (contd.) Fourier Analysis (contd.) Fourier Analysis (contd.) Gene Prediction : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) Gene Prediction (contd.) + Given an ORF, we select a random subset of the potential genes and a random subset of the potential non-genes. + Classify the given ORF using a nearest neighbor type analysis using the above randomly selected training data set. + Given a sequence, repeat the random selection of subsets several times, and the final classification of the given sequence is determined by “the majority of the votes” in different random selections. 12 / 13 Gene Prediction (contd.) Periodicity in DNA Sequences : Fourier Analysis : Fourier Analysis (contd.) Fourier Analysis (contd.) Fourier Analysis (contd.) Gene Prediction : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) : Gene Prediction (contd.) Gene Prediction (contd.) Evaluation of a prediction algorithm : Sensitvity = N o. of T rue P ositives N o. of T rue P ositives and F alse N egatives N o. of T rue P ositives Specif icity = N o. of T rue P ositives and F alse P ositives 13 / 13