Download Protein and DNA Sequence Analysis Part II Summary: Sequence

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Summary: Sequence Analysis Lectures
Protein and DNA Sequence Analysis
Part II
Fritz Roth
BCMP 201
Spring 2008
Outline
n
Sequence Analysis I
Sequence Analysis II
Case Study
Searching sequence
databases
Aligning a pair of
sequences
Scoring aligned
sequences
Aligning multiple
sequences
Representing and finding
sequence patterns
Searching Sequence Databases
Searching sequence databases
An O(nm) database search algorithm (SmithWaterman) sounds pretty good.
- BLAST
- BLAST statistics
n
Aligning multiple sequences
n
Representing and finding sequence patterns
But searching a 300 a.a. query against SwissProt could take an hour!
Enter… BLAST!
BLAST algorithm
BLAST: The family
Step 0 - Preprocess the Sequence Database
Make a lookup table with locations of ‘words’ in all database sequences
For each query sequence:
Step 1 - Define Query Words
Step 2 - Locate Query Words in the Database
Step 3 - Ungapped Extension
Step 4 - Gapped Extension
From NCBI website
1
BLAST Step 1: Define query words
BLAST Step 1: Define query words
For every word in query, make neighborhood word list
(Word Size = 2, Threshold 8)
Example: Adipokinetic hormone II - Migratory locust
Query Words
Expanded List
Q L N F S A G W
Q L
L N
N F
F S
S A
A G
G W
n
QL,QM,HL,ZL
LN,LB
NF,AF,NY,DF,QF,EF,GF,HF,KF,SF,TF,BF,ZF
FS,FA,FN,FD,FG,FP,FT,FB,YS
None score≥ 8 (including SA)
AG
GW,AW,RW,NW,DW,QW,EW,HW,IW,KW,MW,PW,SW,
TW,VW,BW,ZW,XW
Default word size in BLAST is 11 for DNA and 3
for proteins.
From http://www.psc.edu/biomed/training/tutorials/sequence/db/index.html
From NCBI website
BLAST Step 2: Word lookup
n
In step 0 we preprocessed the sequence
database:
- Storing locations of all neighborhood words that could
match a query word with score above threshold.
n
In step 2 we use a the lookup table (a hash
table) to do O(1) (constant time) lookup.
BLAST Step 3 variant: Ungapped Extension
Ungapped Extension from Two-Word Hits:
n Extend when two words are on same diagonal
within distance A.
n Same sensitivity achieved with lower word
threshold
BLAST Step 3: Ungapped Extension
For each single word ‘hit’ exceeding threshold T
n
n
Extend an ungapped alignment until the alignment
score starts decreasing
Ungapped aligned sequences =
High Scoring Segment Pair (HSP).
Illustration of the Two-Word Variant
+
.
T ≥ 13
T ≥ 11
For each hit…
n Extend ungapped alignment until alignment score
decreases below best score S.
n If S is above a threshold, go to STEP 4,
otherwise discard
From Altschul et al, NAR, 1997
2
Decrease-Limited Dynamic Programming
BLAST Step 4: Gapped extension
STEP 4 (BLAST2 variant)
Gapped Extension Using Dynamic Programming
n
If ungapped alignment (HSP) had good enough score,
then…
Stop if score
falls by X
from highest
score so far
For each hit…
n Extend alignment in both directions from center of HSP
using dynamic programming.
n
Decrease
Threshold=2
For improved efficiency, neglect alignment paths that
score below max score observed thus far by more than X.
∆
A
G
C
C
T
A
∆
0
0
0
0
0
0
0
A
0
3
1
-
T
0
1
2
0
G
0
-
4
2
-
C
0
2
7
5
-
C
0
-
5
10
8
-
A
0
-
8
9
11
T
0
-
11
9
G
0
9
10
BLAST: A final alignment
Decrease-Limited Dynamic Programming
From Altschul et al, NAR, 1997
From Altschul et al, NAR, 1997
BLAST: E-values
With a random sequence database,
expected number of hits, E, is:
Where…
BLAST: P-values
n
m is the size of our query
n
n is the size of our database
n
K and λ are scale parameters
- They depend on gapped or ungapped alignment (λ and K
for gapped is lower)
- λ also depends on substitution matrix (BLOSUM62 uses
2log2; some versions of PAM use 10log10)
P-value = Prob( N > 0)
= Prob of a hit as good or better by chance
E ≅ Kmn ⋅ e −λ S
n
-
n
Probability of N=0 events given E expected events
(Poisson):
Prob( N = 0) = e− E
n
So P-value is
n
And…
Prob( N > 0) = 1 − Prob( N = 0)
= 1 − e − E ≅ E (for small E )
E values and P values are about the same below .01
From Altschul et al 1994
3
BLAST: E- and P-values
BLAST: Problematic ‘hits’
n
Truly homologous but uninteresting
- ‘Self-hits ’
- Repetitive elements
- Vector sequence
One lesson: Don’t search a bigger database than you
need to.
E ≅ Kmn ⋅ e −λ S
n
Non-homologous but similar sequence
- Low complexity sequence
- Coiled coil regions
- Membrane-spanning (hydrophobic) sequences
Where…
n
m is the size of our query
n
n is the size of our database
YKIL
Composition-based statistics
| |
FKVL
n
n
Odds score of Y and F is: Odds = p (Y ⇔ F )
YF
q(Y ) ⋅ q ( F )
But if your query is mostly “F”, then we should
adjust the score
In terms of Odds score:
Odds’YF = OddsYF ⋅ q(Y) / q’(Y)
n
In terms of Log-Odds score:
S’YF = SYF + log [q(Y) / q’(Y)]
n
BLAST: Filtering
n
n
n
Filtering (aka Masking): Hiding regions
that often give spurious high scores:
BLAST: Miscellaneous
n
Standard filters for low-complexity:
- SEG (Protein)
- DUST (DNA)
Some BLAST interfaces have a coiled-coil
region filter, some have a repeat filter.
n
Matches that are more than 50% identical
in a 20-40 amino acid region occur
frequently by chance.
Protein sequence comparisons typically
double the evolutionary look-back time
over DNA sequence comparisons.
4
Historical Footnote: FASTA
n
FASTA (aka Pearson-Lipman) came after SmithWaterman but before BLAST
n
First to use path-restricted dynamic programming
n
First to use rule of two hits on the same diagonal
n
Outline
n
Searching sequence databases
n
Aligning multiple sequences
- Global alignment
- Local alignment
n
Now remembered mostly because of FASTA
format sequence files:
Representing and finding sequence patterns
>bovin ALPHA CRYSTALLIN B CHAIN (ALPHA(B)-CRYSTALLIN).
MDIAIHHPWIRRPFFPFHSPSRLFDQFFGEHLLESDLFPASTSLSPFYLR
PPSFLRAPSWIDTGLSEMRLEKDRFSVNLDVKHFSPEELKVKVLGDVIEV
HGKHEERQDEHGFISREFHRKYRIPADVDPLAITSSLSSDGVLTVNGPRK
QASGPERTIPITREEKPAVTAAPKK
>chick ALPHA CRYSTALLIN B CHAIN (ALPHA(B)-CRYSTALLIN).
MDITIHNPLVRRPLFSWLTPSRIFDQIFGEHLQESELLPTSPSLSPFLMR
SPFFRMPSWLETGLSEMRLEKDKFSVNLDVKHFSPEELKVKVLGDMIEIH
GKHEERQDEHGFIAREFSRKYRIPADVDPLTITSSLSLDGVLTVSAPRKQ
SDVPERSIPITREEKPAIAGSQRK
ClustalW, a Tree-Based Method for
Global Alignment
Multiple Sequence Alignment
Can we use dynamic programming?
n
n
We can extend Smith-Waterman to align
N sequences of length L in O(LN).
1. Align and score all pairs of sequences
2. Build a tree by successively merging sequence pairs
(or sequence cluster pairs)
This becomes prohibitive above 3-4
sequences of length 100.
Approximate method examples
n
Global alignment—ClustalW
n
Local alignment—Gibbs Motif Sampling
- (e.g., AlignACE)
NOTE: similarity between clusters is the average over
all sequence pairs between two clusters.
Adapted from (Sternberg, 1996)
ClustalW (continued)
ClustalW: Notes
n
3. Build multiple
sequence alignment
successively,
starting with most
similar pair
n
n
Dependence on initial pairwise alignment
- (gaps come but they don’t go)
Appropriate only if sequences are similar
overall, rather than just in local regions
For aligning sequences with <30% identity
try slower, more accurate “T-Coffee”
Adapted from (Sternberg, 1996)
5
Outline
Gibbs Motif Sampling:: Input Data Set
n
Searching sequence databases
n
Aligning multiple sequences
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
- Global alignment
- Local alignment
n
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
Representing and finding sequence patterns
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
Seven amino acid biosynthesis genes
300-600 bp of upstream sequence per
gene (Saccharomyces cerevisiae)
Gibbs Motif Sampling:: Weighting
Gibbs Motif Sampling:: Initial Seeding
Add?
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT
…HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG
…ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT
…ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC
…THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA
…ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA
…HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA
…PRO3
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT
…HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG
…ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT
…ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC
…THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA
…ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA
…HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA
…PRO3
TGAAAAATTC
TGAAAAATTC
GACATCGAAA
GACATCGAAA
GCACTTCGGC
GCACTTCGGC
GAGTCATTAC
GAGTCATTAC
GTAAATTGTC
GTAAATTGTC
CCACAGTCCG
CCACAGTCCG
TGTGAAGCAC
TGTGAAGCAC
**********
MAP score = -10.0
**********
Remove.
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT
…HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG
…ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT
…ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC
…THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA
…ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA
…HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA
…PRO3
ATGAAAAAAT
TGAAAAATTC
GACATCGAAA
TGAAAAATTC
GCACTTCGGC
GACATCGAAA
GCACTTCGGC
GAGTCATTAC
GAGTCATTAC
GTAAATTGTC
GTAAATTGTC
CCACAGTCCG
CCACAGTCCG
TGTGAAGCAC
TGTGAAGCAC
**********
**********
Use this as a weight
Add or remove each sequence in a weighted random
fashion.
Gibbs Motif Sampling:: Convergence on GCN4
Gibbs Motif Sampling:: More Sampling
Add?
What are the odds that this sequence came from
the current alignment as opposed to the random
model?
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT
…HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG
…ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT
…ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC
…THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA
…ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA
…HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA
…PRO3
AAAAGAGTCA
AAATGACTCA
AAGTGAGTCA
AAAAGAGTCA
GGATGAGTCA
AAATGAGTCA
GAATGAGTCA
AAAAGAGTCA
**********
MAP score = 20.37
6
Gibbs Motif Sampling: Protein example
Locations of conserved
repeats in bacterial outer
membrane proteins (porins)
Outline
n
Searching sequence databases
n
Aligning multiple sequences
n
Representing and finding sequence patterns
-
Consensus sequences
Regular expressions
Weight matrices
Profile Hidden Markov models
From Neuwald, Liu, Lawrence 1995
Representing and finding sequence patterns
Pattern Model I: The Consensus Sequence
An Example Multiple Sequence Alignment
n
Consensus Sequences
n
Regular Expressions
n
Weight Matrices
n
Hidden Markov Models
AGCATT
CGCACC
ATCATT
AGCACT
ACAAAT
CCCAAA
GCCAGG
Table of Counts
A
C
G
T
4
2
1
0
0
3
3
1
1
6
0
0
7
0
0
0
2
2
1
2
1
1
1
4
Consensus is the sequence of (possibly
degenerate) bases which best represents the
aligned bases
Consensus Sequence: Degenerate Bases
Consensus Sequence: What is It?
An Example Multiple Sequence Alignment
Degenerate DNA Codes:
IUPAC Code
Meaning
Mnemonic
W
(A/T)
S
(G/C)
M
(A/C)
K
(G/T)
R
(A/G)
Y
(C/T)
V
(A/C/G)
H
(A/C/T)
D
(A/G/T)
B
(C/G/T)
N
(A/C/G/T)
Weak
Strong
aMino
Keto
puRine
pYrimidine
Not T or U, V
Not G, H
Not C, D
Not A, B
aNy base
AGCATT
CGCACC
ATCATT
AGCACT
ACAAAT
CCCAAA
GCCAGG
Table of Counts
A
C
G
T
4
2
1
0
0
3
3
1
1
6
0
0
7
0
0
0
2
2
1
2
1
1
1
4
A S C A N T?
M S C A N T?
M S C A H T?
ISSUES:
n >9 different methods that give different
consensus for some choice of input! (Day and
McMorris, 1992)
n
Most papers don’t tell you how they reached
consensus.
7
Consensus Sequence: Proteins
Consensus Sequence: Scoring
Scoring your consensus pattern against query
sequences
n
Standard: perfect match > imperfect match
Protein Degeneracy Codes:
IUPAC Code
Meaning
B (Asx)
D (Asp) or N (Asg)
Z (Glx)
E (Glu) or Q (Gln)
X
Any amino acid
n
Pros
n
Easy to calculate (no rules!)
Concise
n
Easy to score against a query sequence
n
Example protein consensus sequence for PKG
kinase:
Cons
(R/K)X(S/T)X
Pattern Model II: The Regular Expression
Alternative: Perfect match > 1 mismatch > 2
mismatch > …
n
Rules are ambiguous
n
Information is lost
PHI-BLAST: Pattern-Hit Initiated BLAST
Regular expressions
n Like consensus , but allow for more complicated rules.
EXAMPLE: CG(AA|TT)GC or CGGC
EXAMPLE: Regular expression that finds most protein kinases
[LIV]-G-{P}-G-{P}-[FYWMGSTNH]-[SGA]-{PW}-[LIVCAT]{PD}-x-[GSTACLIVMFY]-x(5,18)-[LIVMFYWCSTAR]-[AIVP][LIVMFAGCKR]-K
[ ] means any of these amino acids; {} means anything but these
Pros
Allows more complicated patterns than consensus
From
http://bioweb.pasteur.fr/seqanal/blast/
Same disadvantages as consensus sequences
Pattern Model III: The weight matrix
n
aka Position-Specific Scoring Matrix (PSSM)
Alignment
ACAA
TCAA
ACAG
AGCT
Probability Table
Position
Count Table
Position
A
1
2
3
4
3
0
3
2
C
0
3
1
0
G
0
1
0
1
T
1
0
0
1
Odds ratio of query sequence ACAC?
p ( A) p2 (C ) p3 ( A) p4 (C )
Odds = 1
⋅
⋅
⋅
From http://bioweb.pasteur.fr/seqanal/blast/
q1 ( A) q2 (C ) q3 ( A) q4 (C )
.75 .75 .75 0
=
⋅
⋅
⋅ =0
.3 .2 .3 .2
Base
PSI-BLAST: Position-Specific Iterated BLAST
Base
n
1
2
A
.75
0
3
4
.75 .50
C
0
.75 .25
0
G
0
.25
0
.25
T
.25
0
0
.25
Random Probability Table
Position
Base
n
Cons
1
2
3
4
A
.3
.3
.3
.3
C
.2
.2
.2
.2
G
.2
.2
.2
.2
T
.3
.3
.3
.3
8
Weight matrix: using pseudocounts
Position
1
2
3
4
1
2
3
4
A
3
0
3
2
A
3.3 0.3 3.3 2.3
C
0
3
1
0
C
0.2 3.2 1.2 0.2
G
0
1
0
1
G
0.2 1.2 0.2 1.2
T
1
0
0
1
T
1.3 0.3 0.3 1.3
Probability Table
Position
1
Base
Base
Base
Alignment
ACAA
TCAA
ACAG
AGCT
Weight matrices vs consensus sequences
Revised Count Table
Position
Count Table
Odds ratio of query sequence ACAC?
.66 .64 .66 .04
=
⋅
⋅
⋅
= 3.1
.3 .2 .3 .2
3
4
.66 .06 .66 .46
C
.04 .64 .24 .04
G
.04 .24 .04 .24
T
.26 .06 .06 .26
Do they match the pattern in the alignment?
n Using consensus ACAA, only find ACAA exact match
n Using consensus ACAD…
ACAA and ACAG are tied
n Using regular expression ACAN,
All score equally!
n Using weight matrix, odds ratios are
36, 19, and 3, respectively
Random Probability Table
Position
Base
p ( A) p2 (C ) p3 ( A) p4 (C )
Odds = 1
⋅
⋅
⋅
q1 ( A) q2 (C ) q3 ( A) q4 (C )
2
A
Alignment
ACAA
Consensus might be: ACAA, ACAD, or ACAN
TCAA
ACAG
AGCT
Example: Three test sequences ACAA, ACAG, ACAC.
1
2
3
4
A
.3
.3
.3
.3
C
.2
.2
.2
.2
G
.2
.2
.2
.2
T
.3
.3
.3
.3
Sequence Logos
Sequence Logos
n
Randomness can be described in
terms of Shannon entropy, or
uncertainty.
H = −∑ pi ⋅ log 2 ( pi )
i
EXAMPLE: Amount of computer memory
it takes to store the result of a coin toss.
Hcoin toss = −
∑
0.5 ⋅ log2 (0.5) = 1 bit
i={heads,tails}
Hmax for DNA pos = −
∑
i={A,C,G,T )
Sequence Logos
H = −∑ pi ⋅ log 2 ( pi )
i
n
If frequencies of A, C, G, T are
88%, 1%, 1%, 10%, then…
Hpos5 = −(.88⋅ log2(.88) + .01⋅ log2 (.01) + .01⋅ log 2(.01) + .10 ⋅ log 2(.10) = 0.6 bits
n
n
n
0.25⋅ log2 (0.25) = 2 bits
Outline
n
Searching sequence databases
n
Aligning multiple sequences
n
Representing and finding sequence patterns
-
Consensus sequences
Regular expressions
Weight matrices
Profile Hidden Markov models
Information = Reduction in uncertainty
= 2 bits - 0.6 bits = 1.4 bits
Total stack height shows information
Relative letter height show relative
frequency
9
Pattern Model IV: Hidden Markov Models
Markov Models
Example: The Boss
n Also switches
randomly between
states
A machine that is always in
one of several possible
states
n
Also emits a
symbol from each
state
Switches randomly
between states
n
n
Emits a symbol from each
state
n
n
Different states
can emit the same
letter!
ACCGATAGCTA…
Hidden Markov Models
Given recent “symbols” emitted,
what state is the boss in?
State
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Random#
0.82
0.23
0.95
0.55
0.95
0.46
0.55
0.90
0.11
0.17
0.22
0.45
0.14
0.39
0.94
0.91
0.67
0.17
0.30
0.76
0.75
0.38
0.94
0.70
0.75
0.52
0.60
0.09
0.92
Emission
Great!
Great!
Great!
Great!
Great!
Great!
Great!
Great!
No!
No!
No!
No!
No!
No!
Great!
Great!
No!
No!
No!
No!
No!
No!
Great!
No!
Great!
Great!
Great!
No!
Great!
Random#
0.95
0.43
0.81
0.40
0.19
0.56
0.53
1.00
0.10
0.41
0.14
0.52
0.56
0.06
0.06
0.83
0.81
0.68
0.20
0.62
0.64
0.73
0.75
0.06
0.73
0.24
0.19
0.95
0.76
Pattern Model: The Profile HMM
Equivalent to a weight matrix… but with insertions and deletions:
Using HMMs
n Find most probable path (as in boss’s mood example)
n Odds of ‘emitting’ a given sequence (used like a weight matrix)
Hidden Markov Models: DNA and Protein
Examples
n
Gene Prediction
n
Transmembrane segments
Random
Alignment
ACA---ATG
TCAACTATC
ACAC--AGC
AGC---ATC
ACCG--ATC
Sequence Pattern Resources
DNA
n
n
n
TRANSFAC, weight matrices for TF binding sites
JASPAR, weight matrices for TF binding sites
ESEfinder, weight matrices for splicing enhancer sequences
Protein
n PROSITE: regular expressions
n
ProDom: weight matrices (based on PSI-BLAST)
n
PFAM: HMMs
n
SMART: weight matrices + HMMs
signalling domain-focused
n
n
InterPro: all of the above
Conserved Domain Database (CDD): SMART+PFAM+COGs
10
Conserved Domain Database
Remember MJ0577?
Conserved Domain Database: Example
Summary: Sequence Analysis Lectures
Sequence Analysis I
Sequence Analysis II
Case Study
Searching sequence
databases
Aligning a pair of
sequences
Scoring aligned
sequences
Aligning multiple
sequences
Representing and finding
sequence patterns
11
Related documents