Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences Authors: Michael M. Yin and Jason T. L. Wang Sources: Information Sciences, 163(1-3), pp. 201-218, 2004 Advisor: Min-Shiang Hwang Speaker: Chun-Ta Li Outline • • • • • • • Introduction Related work The proposed approach Example Experiments and results Conclusions Comments 2 Introduction – 1/4 • Data mining – knowledge discovery from data • Data mining in life sciences: – – – – Finding clustering rules for gene expressions Discovering classification rules for proteins Detecting associations between metabolic pathways Predicting genes in genomic DNA sequences 3 Introduction – 2/4 • A genomic DNA sequence – Four types of nucleotides (A, C, G, T) codon:密碼子 introns:內含子 exons:編碼順序 donor:捐贈者 • The basic structure for a vertebrate gene • A sequence fragment containing an exon of 296 coding sequences nucleotides 4 Introduction – 3/4 coding region 5 Introduction – 4/4 • A number of programs have been developed for locating gene coding regions (exons). • Insufficient: – The vertebrate DNA sequence signals involved in gene determination are usually ill defined. – The automated interpretation without experimental validation of genomic data is still myth. • Motivation: – GeneScout: Developing accurate methods for automatically detecting vertebrate genomic DNA structures. – Exon: start sites, junction donor, acceptor sites 6 Related work – 1/2 • NN-based techniques (Neural Network) – Gene structure prediction – Training 7 Related work – 2/2 • HMM-based techniques (Hidden Markov Models) – – – – To describe sequential data or processes Using a number of states Probabilistic state transitions Example: cast a dice Normal Fake 8 The proposed approach – 1/6 • HMM models for predicting functional sites – Star Site Model 1 1 Start codon 9 The proposed approach – 2/6 • HMM models for predicting functional sites – The Donor Model Donor site 10 The proposed approach – 3/6 • HMM models for predicting functional sites – The Acceptor Model Acceptor site 11 The proposed approach – 4/6 • Graph representation of the gene detection problem – candidate start codons, candidate donor sites, candidate acceptor sites :exon : intron 12 The proposed approach – 5/6 • A dynamic programming algorithm – Weight of the vertex v – W(v) – Weight of the edge (v1,v2) – W(v1,v2) start acceptor donor donor acceptor acceptor donor stop :exon (Codon model) : intron 13 The proposed approach – 6/6 • An HMM model for computing coding potentials – The Codon Model Stop codons: • First state is base T TAA, TAG, TGA, TGG • Second state is base A or G • Third State can only be C or T (A, G is not defined) TGT = 0.5*0.1= 0.05 0.5 0.1 14 Example – 1/2 :exon (Codon model) : intron (none) • The Codon Model start acceptor acceptor 0 acceptor 0 0.25 0 donor donor donor stop = GCCATTGAA 0.12 0.06 0.07 GCCATTGAA = 0.12+0.06+0.07 = 0.25 15 Example – 2/2 • A dynamic programming algorithm 1.65 start acceptor acceptor acceptor 1.77 0.33 0.25 0 donor donor 2.21 donor stop 1.23 ( S ( )) 1.7 0.25 2.21 0 1.65 0.33 1.23 7.37 16 Experiments and results – 1/3 • Data: – GeneBank 570 vertebrate sequences 28,992,149 nucleotides 2649 exons 444,498 nucleotides – start condon – ATG – donor site – GT – acceptor site – AG • Evaluating method: – 10-way cross-validation – 570 sequences 10 sets 9 sets training data 1 set test data 17 Experiments and results – 2/3 :正確認出nucleotide的比率 :正確認出nucleotide的比率相較於誤認是nucleotide的比率 :在nucleotide level的總預測精確度(1~-1) :正確認出exon的比率 :正確認出exon的比率相較於誤認是exon的比率 18 Experiments and results – 3/3 • 8 sequences GeneScout correctly detected nucleotides about 85% but GeneScan did not correctly predict any coding nucleotide • GeneScout funs much faster than GeneScan 19 Conclusions • GeneScout uses hidden Markov models to detect functional sites. • A vertebrate genomic DNA sequence A directed acyclic graph A dynamic programming algorithm optimal path • Experiment results shows GeneScout can detect 51% of exons in the data set. 20 Comments • Advisor’s comments – 由gene structure的stop codon處開始往前做計算, 以比較本paper是從start codon處開始往後做計算的 差別。 21