Download GeneScout: a data mining system for predicting vertebrate genes in

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
GeneScout: a data mining system
for predicting vertebrate genes in
genomic DNA sequences
Authors: Michael M. Yin and Jason T. L. Wang
Sources: Information Sciences, 163(1-3), pp. 201-218, 2004
Advisor: Min-Shiang Hwang
Speaker: Chun-Ta Li
Outline
•
•
•
•
•
•
•
Introduction
Related work
The proposed approach
Example
Experiments and results
Conclusions
Comments
2
Introduction – 1/4
• Data mining – knowledge discovery from data
• Data mining in life sciences:
–
–
–
–
Finding clustering rules for gene expressions
Discovering classification rules for proteins
Detecting associations between metabolic pathways
Predicting genes in genomic DNA sequences
3
Introduction – 2/4
• A genomic DNA sequence
– Four types of nucleotides (A, C, G, T)
codon:密碼子
introns:內含子
exons:編碼順序
donor:捐贈者
• The basic structure for a vertebrate gene
• A sequence fragment containing an exon of 296
coding sequences
nucleotides
4
Introduction – 3/4
coding region
5
Introduction – 4/4
• A number of programs have been developed for locating
gene coding regions (exons).
• Insufficient:
– The vertebrate DNA sequence signals involved in gene
determination are usually ill defined.
– The automated interpretation without experimental validation of
genomic data is still myth.
• Motivation:
– GeneScout: Developing accurate methods for automatically
detecting vertebrate genomic DNA structures.
– Exon: start sites, junction donor, acceptor sites
6
Related work – 1/2
• NN-based techniques (Neural Network)
– Gene structure prediction
– Training
7
Related work – 2/2
• HMM-based techniques (Hidden Markov Models)
–
–
–
–
To describe sequential data or processes
Using a number of states
Probabilistic state transitions
Example: cast a dice
Normal
Fake
8
The proposed approach – 1/6
• HMM models for predicting functional sites
– Star Site Model
1
1
Start codon
9
The proposed approach – 2/6
• HMM models for predicting functional sites
– The Donor Model
Donor site
10
The proposed approach – 3/6
• HMM models for predicting functional sites
– The Acceptor Model
Acceptor
site
11
The proposed approach – 4/6
• Graph representation of the gene detection problem
– candidate start codons, candidate donor sites, candidate acceptor sites
:exon
: intron
12
The proposed approach – 5/6
• A dynamic programming algorithm
– Weight of the vertex v – W(v)
– Weight of the edge (v1,v2) – W(v1,v2)
start
acceptor
donor
donor
acceptor
acceptor
donor
stop
:exon (Codon model)
: intron
13
The proposed approach – 6/6
• An HMM model for computing coding potentials
– The Codon Model
Stop codons:
• First state is base T
TAA, TAG, TGA, TGG
• Second state is base A or G
• Third State can only be C or T (A, G is not defined)
TGT = 0.5*0.1= 0.05
0.5
0.1
14
Example – 1/2
:exon (Codon model)
: intron (none)
• The Codon Model
start
acceptor
acceptor
0
acceptor
0
0.25
0
donor
donor
donor
stop
= GCCATTGAA
0.12 0.06
0.07
GCCATTGAA = 0.12+0.06+0.07 = 0.25
15
Example – 2/2
• A dynamic programming algorithm
1.65
start
acceptor
acceptor
acceptor
1.77
0.33
0.25
0
donor
donor
2.21
donor
stop
1.23
( S ( ))  1.7  0.25  2.21  0  1.65  0.33  1.23  7.37
16
Experiments and results – 1/3
• Data:
– GeneBank  570 vertebrate sequences  28,992,149
nucleotides  2649 exons  444,498 nucleotides
– start condon – ATG
– donor site – GT
– acceptor site – AG
• Evaluating method:
– 10-way cross-validation
– 570 sequences  10 sets
9 sets  training data
1 set  test data
17
Experiments and results – 2/3
:正確認出nucleotide的比率
:正確認出nucleotide的比率相較於誤認是nucleotide的比率
:在nucleotide level的總預測精確度(1~-1)
:正確認出exon的比率
:正確認出exon的比率相較於誤認是exon的比率
18
Experiments and results – 3/3
• 8 sequences  GeneScout correctly detected nucleotides about 85%
but GeneScan did not correctly predict any coding nucleotide
• GeneScout funs much faster than GeneScan
19
Conclusions
• GeneScout uses hidden Markov models to detect
functional sites.
• A vertebrate genomic DNA sequence  A directed
acyclic graph  A dynamic programming algorithm 
optimal path
• Experiment results shows GeneScout can detect 51% of
exons in the data set.
20
Comments
• Advisor’s comments
– 由gene structure的stop codon處開始往前做計算,
以比較本paper是從start codon處開始往後做計算的
差別。
21
Related documents