Download DNA Analysis2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
DNA Analysis
Part II
Amir Golnabi
ENGS 112
Spring 2008
1
What we saw in part I:
1. Markov Chain
2. DNA and Modeling
3. Markovian Models for DNA Sequences
4. HMM for DNA Sequences
Part II:
1. DNA Methylation and CpG islands
2. Markov Chain Model
3. Hidden Markov Model
4. Finding the State Path
5. Parameter Estimation for HMMs
6. References
2
1.DNA Methylation and CpG islands
• CG base pair in the human genome
• Modification of Cytosine by methylation
• High chance of mutation of methyl-C into a T
• CG dinucleotides are rarer in the genome
• Methylation is suppressed in short stretches of the
genome such as around the promoters or start regions
of many genes.  more CG dinucleotides: CpG islands
• "p“: "C" and "G" are connected by a phosphodiester
bond
• Two questions:
– Given a short stretch of genomic sequence, how would
we decide whether it comes from a CpG island?
– Given a long piece of sequence, how would we find the
CpG islands in it?
3
2.Given a short stretch of genomic sequence, how
would we decide whether it comes from a CpG island?
• Markov Chain:
Transition probabilities:
ast  Pxi  t xi 1  s 
Probability of sequences:
Px  PxL , xL1 ,..., x1 
 PxL xL 1  PxL 1 xL  2 ... Px2 x1  Px1   Px1  a xi1xi
L
i 2
• Beginning and end of sequences:
> Silent states
4
Transition probabilities using Maximum likelihood
estimator for CpG islands:
• Two Markov chain models:
1.CpG islands (the ‘+’ model)
2.Remainder of the sequence (the ‘-’ model)
• Table of frequencies:
+
A
C
G
T
-
A
C
G
T
A
0.180
0.274
0.426
0.120
A
0.300
0.205
0.285
0.210
C
0.171
0.368
0.274
0.188
C
0.322
0.298
0.078
0.302
G
0.161
0.339
0.375
0.125
G
0.248
0.246
0.298
0.208
T
0.079
0.355
0.384
0.182
T
0.177
0.239
0.292
0.292
• Each row sums to 1.
• Tables are asymmetric.
5
To use this model for discrimination: Log-odds
ratio:
L
a xi1xi L
P x mod el 


S  x   log
  log
Px mod el  
a
i 1
• x is the sequence

xi 1 xi
  xi1xi
i 1
β
A
C
G
T
• β is the log likelihood
A
-0.740
0.419
0.580
-0.803
ratio is corresponding
C
-0.913
0.302
1.812
-0.685
transition probabilities
G
-0.624
0.461
0.331
-0.730
T
-1.169
0.573
0.339
-0.679
- The histogram of the
length-normalized scores
,S(x), for all the
sequences(~60,000
nucleotides)
6
3. Given a long piece of sequence, how would we
find the CpG islands in it?
• Single model for the entire sequence that
incorporates both Markov chains: HMM
• Similar transition probabilities within each set
• Small chance of switching between + and – regions
• There is no one-to-one correspondence between states
and symbols.
7
• Sequence of states (path Π): Transition probabilities:
– State sequence is hidden in HMM
ast  P i  t  i 1  s 
• Sequence of symbols: emission probabilities:
– Prob. b is seen in state s
es b  Pxi  b  i  s 
– emission prob. of CpG islands: 0 or 1
• A sequence can be generated from a HMM as follows:
– A state 1 is chosen according to a0i
– In
1
an observation is emitted according to
– A new state
2
e 1
is chosen according to a i
1
– and so forth…: A sequence of random observations
– P(x)= prob. X was generated by the model
– Joint probability of an observed seq x and state seq
:
L
P x ,   a0 1  e i  xi  a i i1
i 1
8
• Example: Prob. of sequence ‘CGCG’ being emitted by the
state sequence (C+,G-,C-,G+):
a0 ,C 1  aC ,G 1  aG ,C 1  aC ,G 1  aG ,0
• Not very useful in practice because the path is not
known → Path estimation: By finding the most likely one
– Viterbi Algorithm
– Forward or Backward Algorithm
• Example: CpG model: Generating symbol sequence CGCG
– State sequences: (C+,G+,C+,G+),(C-,G-,C-,G-),
(C+,G-,C-,G+)
– (C+,G-,C-,G+): switching back and forth between + and –
– (C-,G-,C-,G-): small prob. of CG in ‘-’ group
– (C+,G+,C+,G+): Best option!
9
5.Parameter Estimation for HMMs:
HMM models:
1.Design the structure: states and their connections
2.Design parameter values: transition and emission
probabilities, ast and es b 
Baum-Welch And Viterbi training
10
7.References
• Bandyopadhyay, Sanghamitra. Gene Identification: Classical and
Computational Ingelligence Approach. 38 vols. IEEE, JAN2008.
• Durbin, R., S. Eddy, and A. Krogh. Biological Sequence Analysis.
Cambridge: Cambridge University, 1998.
• Koski, Timo. Hidden Markov Models for Bioinformatics. Sweden:
Kluwer Academic , 2001.
• Birney, E. "Hidden Markov models in biological sequence
analysis". July 2001:
• Haussler, David. David Kulp, Martin Reese Frank Eeckman "A
Generalized Hidden Markov Model for the Recognition of Human Genes
in DNA".
• Boufounos, Petros, Sameh El-Difrawy, Dan Ehrlich. "HIDDEN MARKOV
MODELS FOR DNA SEQUENCING".
11
Related documents