Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling Warsaw University Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University Identification of coding/non-coding sequences in genome Measures dependent on a model of coding DNA Measures independent of a model of coding DNA based on: oligonucleotide counts Codon usage Amino acid usage Codon preference Hexamer usage based on: base compositional bias between codon positions dependence between nucleotide positions base compositional bias between codon positions Codon prototype Markov models Position asymmetry Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University periodic correlation between nucleotide positions Periodic asymmetry index Average mutual information Fourier spectrum The notation used S – DNA sequence of length l, while Si (i=1 ... l) denotes the individual nucleotides C – sequence of codons; Cj – the codon occupying position j in the sequence or - denotes the sequence of codons that results when the grouping of , nucleotides from sequence S into codons starts at nucleotide i - denotes the codon occupying position j in the decomposition i of the sequence S [k] - the nucleotide occupying position k in the codon Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University Examples Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University Measures based on a model of coding DNA The notation used probability of the sequence of nucleotides S, given that S is coding in frame i (i=1, 2, 3) probability of the non-coding DNA sequence (randomly generated) Likelihood ratio The ratio of the probability of finding the sequence of nucleotides S, if S is coding in frame i over the probability of finding the sequence of nucleotides S, if S is non-coding Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University Measures based on a model of coding DNA The notation used Log-likelihood ratio coding potential of sequence S in frame i given the model of coding DNA the probability of the sequence of nucleotides S is higher assuming that S is coding in frame i, than assuming that S is non-coding in frame i the probability of S is higher assuming that S does not code in frame i than assuming that S is coding in frame i The log-likelihood ratios is computed for all three possible frames. If the sequence is coding, the log-likelihood ratio will larger for one of the frames than for the other two. Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University Measures based on a model of coding DNA Measures based on oligonucleotide counts Codon usage frequency (probability) of codon C in the genes of the considered species (the codon usage table) probability of finding the sequence of codons C knowing that C codes for a protein P0(C)=(1/64)m probability of finding the non-coding sequence Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University Measures based on a model of coding DNA Measures based on oligonucleotide counts Amino acid usage the observed probability of the amino acid encoded by codon C in the existing proteins This value can be directly derived from a codon usage table by summing up the probabilities of synonymous codons where means c’ synonymous to c probability of finding the amino acid sequence resulting of translating the sequence in coding open reading frame frequency of the „non-coding amino acids”; nc – number of codons synonymous to C Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University Measures based on a model of coding DNA Measures based on oligonucleotide counts Codon preference relative probability in coding regions of codon C among codons synonymous to C probability of the sequence S encoding the particular amino acid sequence in frame i In non-coding regions there is no preference between „synonymous codons”. Then: probability of codon C in non-coding DNA Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University Measures based on a model of coding DNA Measures based on oligonucleotide counts Hexamer usage This approach is based on the hexamer usage table for i=1, 2, 3, ... , 4096. In this case there are six reading frames to be analyzed. The probability of a sequence of hexanucleotides, in the coding frame of a coding sequence is Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University Measures based on a model of coding DNA Measures based on base compositional bias between codon positions Codon prototype Let f(b,r) be the probability of nucleotide b at codon position r, as estimated from known coding regions. Then: is the probability of codon c in coding regions, assuming independence between adjacent nucleotides probability of for all triplets c in non-coding DNA Example: P2(S) and P3(S) are computed in similar way Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University Measures based on a model of coding DNA Measures based on dependence between nucleotide positions Markov Models In the Markov models the probability of a nucleotide at a particular codon position depends on the nucleotide(s) preceding it. The Markov models of order 1 is the simplest of the Markov models. The probability of a nucleotide depends only on the preceding nucleotide. In this case, the model of coding DNA is based on the probabilities of the four nucleotides at each codon position, depending on the nucleotide occurring at the preceding codon position (technically called the transition probabilities). Thus, instead of one single matrix, as in Codon Prototype, three 4x4 matrices (the transition matrices) are required, F1, F2, and F3, each one corresponding to a different codon position. There are used Markov models of the order 1 to 5 Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University Measures independent of a model of coding DNA Measures based on base compositional bias between codon positions Position asymmetry The goal is to measure how asymmetric is the distribution of nucleotides at the three triplet positions in the sequence. the relative frequency of nucleotide b at codon r position in the sequence S, as calculated from one of the three decompositions of S in codons (any of them) average frequency of nucleotide b at the three codon positions asymmetry in the distribution of nucleotide b Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University Measures independent of a model of coding DNA Measures based on base compositional bias between codon positions Position asymmetry (continued) Position Asymmetry of the sequence Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University Measures independent of a model of coding DNA Measures based on periodic correlation between nucleotide positions Periodic asymmetry index This approach considers three distinct probabilities: - the probability Pin of finding pairs of the same nucleotide at distances k=2, 5, 8, ... - the probability P1out of finding pairs of the same nucleotide at distances k=0, 3, 6, ... - the probability P2out of finding pairs of the same nucleotide at distances k=1, 4, 7, ... The tendency to cluster homogeneous di-nucleotides in a 3-base periodic pattern can be measured by the Periodic Asymmetry Index: Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University Measures independent of a model of coding DNA Measures based on periodic correlation between nucleotide positions Average mutual information absolute number of times when nucleotide i is followed by nucleotide j at a distance of k positions probability that nucleotide i is followed by nucleotide j at a distance of k positions Correlation between nucleotides i and j at a distance of k positions where pi and pj are probabilities of nucleotide i and j occurrence in sequence S Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University Measures independent of a model of coding DNA Measures based on periodic correlation between nucleotide positions Average mutual information (continued) Mutual Information function quantifies the amount of information that can be obtained from one nucleotide about another nucleotide at a distance k Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University Measures independent of a model of coding DNA Measures based on periodic correlation between nucleotide positions Average mutual information (continued) the in-frame mutual information at distances k=2, 5, 8, ... the out-frame mutual information at distances k=0, 1, 3, 4, ... Average Mutual Information Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University Measures independent of a model of coding DNA Measures based on periodic correlation between nucleotide positions Fourier analysis The partial spectrum of a DNA sequence S of length l corresponding to nucleotide b is defined as: where Ub(Sj)=1 if Sj=b, and otherwise it is 0, and f is the discrete frequency, f =k/l, for k=1, 2, ... ,l/2 DNA coding regions reveal the characteristic periodicity of 3 as a distinct peak at frequency f =1/3 No such ``peak'' is apparent for non-coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University Summary of results Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University List of Gene Identification programs and Internet access (part 1) Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University List of Gene Identification programs and Internet access (part 2) Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University P01055 P01057 P01056 P01058 P01059 P01063 P17734 P81483 P81484 P16343 P01064 P82469 P01061 P01062 P01060 1BBI: 1D6R:I 1DF9:C 1PI2: 1PBI:A AAB4719 TISYC2 JC2225 TIZB2 JC2073 JC2072 0506164 0401177 763679A TISYD2 0907248 1102213 1102213 0404180 TIZB1B TIMB TIZB1P JC1066 Q41066 P80321 Q41065 P81705 P56679 P16346 P01065 P24661 P07679 P19860 P22737 220645 P09864 P09863 3 10 20 30 40 50 60 ESSKPCCDQCACTKSNPPQCRCSDMRLNSCHSACKSCICALSYPAQCF-CVDITDFCYEP-CKP ESSKPCCDECACTKSIPPQCRCTDVRLNSCHSACSSCVCTFSIPAQCV-CVDMKDFCYAP-CKS QSSKPCCBHCACTKSIPPQCRCTDLRLDSCHSACKSCICTLSIPAQCV-CBBIBDFCYEP-CKS ESSKPCCDQCSCTKSMPPKCRCSDIRLNSCHSACKSCACTYSIPAKCF-CTDINDFCYEP-CKS ESSKPCCDLCTCTKSIPPQCHCNDMRLNSCHSACKSCICALSEPAQCF-CVDTTDFCYKS-CHN ESSKPCCDLCMCTASMPPQCHCADIRLNSCHSACDRCACTRSMPGQCR-CLDTTDFCYKP-CKS QSSKPCCRQCACTKSIPPQCRCSQVRLNSCHSACKSCACTFSIPAQCF-CGBIBBFCYKP-CKS -SSKPCCBHCACTKSIPPQCRCSBLRLNSCHSECKGCICTFSIPAQCI-CTDTNNFCYEP-CKS -SSKPCCBHCACTKSIPPQCRCSBLRLNSCHSECKGCICTFSIPAQCI-CTDTNNFCYEP-CKS ESSKPCCSSC-CTRSRPPQCQCTDVRLNSCHSACKSCMCTFSDPGMCS-CLDVTDFCYKP-CKS EYSKPCCDLCMCTRSMPPQCSCEDIRLNSCHSDCKSCMCTRSQPGQCR-CLDTNDFCYKP-CKS -SSGPCCDRCRCTKSEPPQCQCQDVRLNSCHSACEACVCSHSMPGLCS-CLDITHFCHEP-CKS ESSHPCCDLCLCTKSIPPQCQCADIRLDSCHSACKSCMCTRSMPGQCR-CLDTHDFCHKP-CKS ESSEPCCDSCDCTKSIPPECHCANIRLNSCHSACKSCICTRSMPGKCR-CLDTDDFCYKP-CES QSSPPCCBICVCTASIPPQCVCTBIRLBSCHSACKSCMCTRSMPGKCR-CLBTTBYCYKS-CKS ESSKPCCDQCACTKSNPPQCRCSDMRLNSCHSACKSCICALSYPAQCF-CVDITDFCYEP-CKP ---KPCCDQCACTKSNPPQCRCSDMRLNSCHSACKSCICALSYPAQCF-CVDITDFCYEP-CKESSEPCCDSCDCTKSIPPQCHCANIRLNSCHSACKSCICTRSMPGKCR-CLDTDDFCYKP-CES EYSKPCCDLCMCTRSMPPQCSCED-RINSCHSDCKSCMCTRSQPGQCR-CLDTNDFCYKP-CKS DVKSACCDTCLCTKSNPPTCRCVDVGET-CHSACLSCICAYSNPPKCQ-CFDTQKFCYKQ-CHN ESSKPCCDQCTCTKSIPPQCRCTDVRLNSCHSACSSCVCTFSIPAQCV-CVDMKDFCYAP-CKS ESSKPCCDLCMCTASMPPQCHCADIRLNSCHSACDRCACTRSMPGQCR-CLDTTDFCYKP-CKS ESSKPCCDLCMCTASMPPQCHCADIRLNSCHSACDRCACTRSMPGQCR-CLDTTDFCYKP-CKS ESSKPCCDQC-CTKSMPPKCRCSDIRLDSCHSACKSCACTYSIPAKCF-CTDINDFCYEP-CKS ESSKPCCDECKCTKSEPPQCQCVDTRLESCHSACKLCLCALSFPAKCR-CVDTTDFCYKP-CKS ESSKPCCDECKCTKSEPPQCQCVDTRLESCHSACKLCLCALSFPAKCR-CVDTTDFCYKP-CKS ESSKPCCDQC-CTKSMPPKCRCSDIRLDSCHSACKSCACTYSIPAKCF-CTDINDFCYEP-CKS ESSKPCCDLCMCTASMPPQCHCADIRLNSCHSACDRCACTRSMPGQCR-CLDTTDFCYKP-CKS ESSKPCCDLCMCTASMPPQCHCADIRLNSCHSACDRCACTRSMPGQCR-CLDTTDFCYKP-CKS EYSKPCCDLCMCTRSMPPQCSCEDIRLNSCHSDCKSCMCTRSQPGQCR-CLDTNDFCYKP-CKS ESSEPCCDSCRCTKSIPPQCHCADIRLNSCHSACKSCMCTRSMPGKCR-CLDTDDFCYKP-CES ESSEPCCDLCLCTKSIPPQCQCADIRLNSCHSACKSCMCTRSMPGQCH-CLDTHDFCHKP-CKS ESSEPCCDLCLCTKSIPPQCQCADIRLNSCHSACKSCMCTRSMPGQCR-CLDTHDFCHKP-CKS EYSKPCCDLCMCTRSMPPQCSCEDIRLNSCHSDCKSCMCTRSQPGQCR-CLDTNDFCYKP-CKS ESSHPCCDLCLCTKSIPPQCQCADIRLDSCHSACKSCMCTRSMPGQCH-CLDTHDFCHKP-CKS ESSEPCCDSCDCTKSKPPQCHCANIRLNSCHSACKSCICTRSMPGKCR-CLDTDDFCYKP-CES ESSHPCCDLCLCTKSIPPQCQCADIRLNSCHSACKSCMCTRSMPGQCR-CLDTHDFCHKP-CKS ESSEPCCDSCDCTKSKPPQCHCANIRLNSCHSACKSCICTRSMPGKCR-CLDTDDFCTKP-CES DVKSACCDTCLCTKSDPPTCRCVDVGET-CHSACDSCICALSYPPQCQ-CFDTHKFCYKA-CHN STTTACCDFCPCTRSIPPQCQCTDVREK-CHSACKSCLCTLSIPPQCH-CYDITDFCYPS-CRDVKSACCDTCLCTKSNPPTCRCVDVRET-CHSACDSCICAYSNPPKCQ-CFDTHKFCYKA-CHN --TSACCDKCFCTKSNPPICQCRDVGET-CHSACKFCICALSYPAQCH-CLDQNTFCYDK-CDS DVKSACCDTCLCTKSNPPTCRCVDVGET-CHSACLSCICAYSNPPKCQ-CFDTQKFCYKA-CHN --TTACCNFCPCTRSIPPQCRCTDIGET-CHSACKTCLCTKSIPPQCH-CADITNFCYPK-CNDVKSACCDTCLCTRSQPPTCRCVDVGER-CHSACNHCVCNYSNPPQCQ-CFDTHKFCYKA-CHS DVKSACCDTCLCTKSEPPTCRCVDVGER-CHSACNSCVCRYSNPPKCQ-CFDTHKFCYKS-CHN KRPWECCDIAMCTRSIPPICRCVDKVDR-CSDACKDCEETEDN--RHV-CFDTYIGDPGPTCHD ERPWKCCDLQTCTKSIPAFCRCRDLLEQ-CSDACKECGKVRDSDPPRYICQDVYRGIPAPMCHE ERPWKCCDLQTCTKSIPAFCRCRDLLEQ-CSDACKECGKVRDSDPPRYICQDVYRGIPAPMCHE ES-EGCCDRCICTKSMPPQCHCHDVRLDSCHSDCETCICTRSYPAQCR-CADTTDFCYKP-C-S TRPWKCCDRAICTKSFPPMCRCMDMVEQ-CAATCKKCGPATSDSSRRV-CEDXY----------KRPWKCCDQAVCTRSIPPICRCMDQVFE-CPSTCKACGPSVGDPSRRV-CQDQYV---------- Thank you for your attention