Download Protein and DNA Sequence Analysis Part 1 Overview: Sequence

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Overview: Sequence Analysis Lectures
Protein and DNA Sequence Analysis
Part 1
Fritz Roth
BCMP 201
Spring 2008
Motivation: A Firehose of Sequence
Sequence Analysis I
Sequence Analysis II
Case Study w/ BLAST
Searching large sequence
databases
Aligning a pair of
sequences
Scoring aligned
sequences
n
n
n
Whole-genome projects —>
Not whole-genome
Representing and finding
sequence patterns
Sequence Analysis: What’s the Use?
As of Mar 2007:
——100 billion base pairs—
Aligning multiple
sequences
542 eubacteria
39 archaebacteria
24 microbial
‘metagenomes’
60 eukaryotes
(of which 21 are multicellular)
n
*does not include many
‘draft’ genomes
n
Find genes
n
Infer protein function
n
Infer evolutionary history
n
Infer subcellular localization
n
Infer gene regulation
n
Infer protein-protein interactions
0
1997 ‘98 ‘99 ‘00 ‘01 ‘02 ‘03 ‘04 ‘05
http://www.ebi.ac.uk/genomes/index.html
http://www.ncbi.nlm.nih.gov/Genbank/genbankgrowth.jpg
Identity vs. Similarity
Identity: Extent to which residues in aligned
sequences are invariant.
Similarity: Extent to which residues in aligned
sequences have similar properties.
Need not have diverged from common ancestor.
Homology in Two Flavors
Homology:
Similarity due to descent
from common ancestor.
Need not have same function.
Orthologs:
Homologs diverged by
speciation.
Sequences with ‘mutual best
hit’ relationship.
Paralogs: Homologs
diverged by gene duplication
Share function less often
1
Similar sequence… Same function?
Legend
o general similarity
X non-enzyme, same
functional class
p enzyme, same
functional class
--p-- enzyme, same
precise function
--X-- non-enzyme, same
precise function
Outline: Sequence Analysis I
n
BLAST Case study
n
Pairwise sequence alignment
- Global vs. local alignment
- Dot plots
- Smith-Waterman method
n
Scoring aligned sequences
Caveats: Only for single domain proteins,
From Wilson, Kreychman, and Gerstein, J Mol Biol, 2000
Case study
n
n
n
Case Study
Methanococcus jannaschii:
Archaebacterium from
undersea thermal vent
Genome sequenced, 1996
MJ0577: predicted protein
based on GeneMark software
1 MSVMYKKILY PTDFSETAEI ALKHVKAFKT LKAEEVILLH VIDEREIKKR DIFSLLLGVA
61 GLNKSVEEFE NELKNKLTEE AKNKMENIKK ELEDVGFKVK DIIVVGIPHE EIVKIAEDEG
121 VDIIIMGSHG KTNLKEILLG SVTENVIKKS NKPVLVVKRK NS
(Case study adapted from http://www.ncbi.nih.gov/Education/BLASTinfo/tut2.html )
2
Case Study: BLAST Results
self
No
Known
function
Case Study: BLAST Results
Putative
Filament
Protein
Case Study: Filament protein?
Twilight
Zone
n
n
n
Predicted coiled-coil region (COILS)
Manually filtered the query
Subsequent BLAST run did not retrieve this protein
Case Study: BLAST with Cationic amino acid
transporter
Case Study: BLAST Results
Cationic
amino acid
transporter
M. jannaschii
Protein
MJ0577
Twilight
Zone
n
n
The transporter entry is 780
aa long, so MJ0577 (~160aa) is
not likely to perform the same
function
It may share a domain, but
we’ll drop this lead for now
3
Case Study: BLAST conclusion?
Position-Specific Iterated BLAST (PSI-BLAST)
Not very satisfying
From http://bioweb.pasteur.fr/seqanal/blast/
Putative
Filament
Protein
Case study: Filament protein?
n
n
Remember, this hit dropped out when predicted
coiled-coil is masked
Another approach to verify:
- Run PSI-BLAST using filament protein as query
- MJ0577 is below threshold in BLAST stage
- MJ0577 is recovered in first profile scan
- By this criterion, MJ0577 and filament protein are
related, but we’re still worried about coiled-coil region
Cationic
amino acid
transporter
4
Case Study: Universal stress protein?
Universal
stress
proteins
Case Study:
Following up the universal stress protein lead
n
n
n
BLAST search with E. coli UspA does not yield
MJ0577 as significant.
First PSI-BLAST iteration yields MJ0577 and
some of its closest relatives.
Tentative Prediction:
Case Study: Experimental Followup
n
n
n
MJ0577 is a universal stress protein
n
E. coli Universal stress protein trivia
n
n
n
Important for survival in stationary phase
n
Is phosphorylated at either serine or threonine
Autophosphorylates inefficiently in vitro
n
Outline: Sequence Analysis I
n
BLAST Case study
n
Pairwise sequence alignment
- Global vs. local alignment
- Dot plots
- Smith-Waterman method
n
Scoring aligned sequences
Usually, function before
structure
Zarembinski et al. solved
MJ0577 structure
Test case for “structural
genomics”
Structure similarity search
yielded nothing new.
However… a bound ATP!
This supports stress protein
prediction somewhat
(remember autophosphorylation)
Global vs. Local Alignment
Global Alignment: Alignment of sequences over
their entire length.
LGPSTKQFGKGSAS-RIWDN
|
|||| |
|
|
LNQIERSFGKG-AIMRLGDA
Local Alignment: Alignment of some portion
of sequences.
-------FGKGSA-------|||| |
-------FGKG-A--------
5
Sequence Alignment:
The Brute Force Approach
n
If no gaps, slide the sequences past each other
until you get the best score:
ABCDEFGHI
||||
WXYABCDZE
n
Dot Plots:
Local Alignment the Old-Fashioned Way
A dot for each residue match
5’
3’
3’
For two n-residue sequences, this takes ~2n
evaluations
With gaps, scoring all alignments of two 100 aa
proteins takes  2n  = ( 2 n )! , or ~1059 evaluations
n
n
n
( n !)2
5’
1059 eval on a 1GHz computer -> longer than
lifetime of universe!
Hemoglobin alpha vs beta chain (“Dotter”)
Dot Plots: Filtering
N-terminal
C-terminal
N-terminal
A dot for each match
5’
3’
A dot only for 2 matches
5’
3’
3’
Window length 31
Match score: +5
Mismatch score: -4
5’
C-terminal
From http://lectures.molgen.mpg.de/Pairwise/DotPlots/
Dot Plot Summary
Global Alignment
Advantages
n
Simple
n
Visual
Drawbacks
n
n
Isn’t automated (try a whole genome!)
Hard to find optimal alignment if gaps are
included
n
In 1970, Needleman and Wunsch solved global
alignment in O(n3) using dynamic programming.
(O(n3) means running time is proportional to n 3,
where n is input data size)
n
In 1982, Gotoh solved it in O(n2).
6
Local Alignment: Smith-Waterman
Local Alignment: Smith-Waterman
Sub-problem:
What’s the best possible
score for two aligned
segments that end at a
particular position?
Smith-Waterman:
A dynamic programming algorithm developed in
1981 that aligns two sequences of length n and
m in O(nm).
∆
A
G
C
C
T
∆
A
T
G
Dynamic Programming:
C
Finding the optimal solution to a problem by
reusing optimal solutions of similar (but
smaller) problems.
C
A
T
Local Alignment: Smith-Waterman
Sub-problem:
What’s the best possible
score for two aligned
segments that end at a
particular position? A-G
| |
ATG
∆
A
G
Local Alignment: Smith-Waterman
C
C
How to Calculate Top Score
∆
0
Match(i, j)

TopScore(i, j) = Max TopScore(i −1, j −1) + Match(i, j)
 TopScore(i −1, j) −Gap(1)

 TopScore(i, j −1) −Gap(1)
A
T
G
 3 for match
Match(i, j ) = 
-1 for mismatch
C
C
-----AG
|
ATGCCAT
Gap(1) = 2
A
T
Local Alignment: Smith-Waterman
All subproblems
solved!
∆
A
G
C
C
T
∆
0
0
0
0
0
0
A
0
T
T
T
0
G
0
C
0
C
0
A
0
T
0
Local Alignment: Smith-Waterman
∆
A
G
C
C
T
∆
0
0
0
0
0
0
A
0
3
1
0
0
0
T
0
1
2
0
0
3
G
0
0
4
2
0
1
C
0
0
2
7
5
C
0
0
0
5
10
A
0
3
1
3
8
T
0
1
2
1
6
What is the best
alignment?
A-GCC-T
| ||| |
ATGCCAT
∆
A
G
C
C
∆
0
0
0
0
0
0
A
0
3
1
0
0
0
T
0
1
2
0
0
3
G
0
0
4
2
0
1
3
C
0
0
2
7
5
3
8
C
0
0
0
5
10
8
9
A
0
3
1
3
8
9
11
T
0
1
2
1
6
11
7
Local Alignment: Smith-Waterman
Outline: Sequence Analysis I
Optimal Alignments
∆
A-GCC-T
| ||| |
ATGCCAT
A-GCCTA
| ||| |
ATGCC-A
∆
A
G
C
C
T
A
0
0
0
0
0
0
0
A
0
3
1
0
0
0
3
T
0
1
2
0
0
3
1
G
0
0
4
2
0
1
2
C
0
0
2
7
5
3
1
C
0
0
0
5
10
8
6
A
0
3
1
3
8
9
11
T
0
1
2
1
6
11
9
G
0
0
4
2
4
9
10
n
Case study
n
Pairwise sequence alignment
n
Scoring aligned sequences
- Log-odds scoring
- Substitution matrices
- Gap penalties
YKIL
Scoring an Alignment
Conservation-based Scoring
| |
FKVL
Two Models: Random vs. Diverged
YKKILYGPTD--FSETA
| |
|||
n
||| |
n
FRKVLF-PTDGGFSEGA
n
Approaches to Similarity
n
n
n
n
% Identity—exact match
Similar codons
Similar chemical characteristics (polar, non-polar,
bulky, etc)
Conservation—frequently substituted amino acids are
similar
q(i )
: Prob. of residue by chance
q(i ) ⋅ q ( j ) : Prob. of residue pair by chance.
p(i ⇔ j ) : Prob. of residue pair if diverged
from common ancestor:
n
Odds ratio of Y and F is: OddsYF =
n
Odds ratio of entire sequence is
Oddsseq =
p(Y ⇔ F ) p ( K ⇔ K ) p( I ⇔ V ) p ( L ⇔ L )
⋅
⋅
⋅
q (Y ) ⋅ q( F ) q ( K ) ⋅ q ( K ) q ( I ) ⋅ q(V ) q ( L ) ⋅ q( L)
YKIL
Log-Odds Scores
YKIL
Log-Odds Scores
| |
FKVL
n
Log-odds score of Y and F is: SYF = log
p (Y ⇔ F )
q(Y ) ⋅ q ( F )
| |
FKVL
p (Y ⇔ F )
q(Y ) ⋅ q ( F )
n
Log-odds score of Y and F is: SYF = log
n
Log-odds score of entire sequence is
p (Y ⇔ F )
q(Y ) ⋅ q ( F )
 p (Y ⇔ F ) p ( K ⇔ K ) p( I ⇔ V ) p( L ⇔ L) 
Sseq = log 
×
×
×

 q(Y ) ⋅ q ( F ) q( K ) ⋅ q ( K ) q ( I ) ⋅ q(V ) q ( L) ⋅ q ( L) 
8
PAM? BLOSUM?
YKIL
Log-Odds Scores
| |
FKVL
n
Log-odds score of Y and F is: SYF = log
n
Log-odds score of entire sequence is
p (Y ⇔ F )
q(Y ) ⋅ q ( F )
 p (Y ⇔ F ) p ( K ⇔ K ) p( I ⇔ V ) p( L ⇔ L) 
Sseq = log 
×
×
×

 q(Y ) ⋅ q ( F ) q( K ) ⋅ q ( K ) q ( I ) ⋅ q(V ) q ( L) ⋅ q ( L) 
= log
p(Y ⇔ F)
p(K ⇔ K )
p(I ⇔ V )
p(L ⇔ L)
+log
+ log
+ log
q(Y ) ⋅ q(F)
q(K) ⋅ q(K)
q(I ) ⋅ q(V )
q(L) ⋅ q(L)
= SYF + SKK + SIV + S LL
Substitution Matrix = Log-Odds Table
Percent Accepted Mutation (PAM) Matrix
PAM: A unit of evolutionary distance at which
proteins have diverged an average of 1% (1
amino acid per 100).
Developed by Margaret Dayhoff ~1978 based
on 71 protein families.
The log-odds scoring matrix for sequences with
~1% divergence is called PAM1.
PAM: How Do You Build It?
n
Align sequences that are at least 85% identical.
n
Reconstruct phylogenetic trees
n
Infer ancestral sequences (71 trees w/ 1572 exchanges)
ACGH
\
\
B - C \
\
DBGH
/
/
/ A - D
/
\/
ABGH
\
\
I - G \
J - H
\
ADIJ
CBIJ
\
/
\
/
B - D \
/ A - C
\ /
\/
ABIJ
/
/
/
I - L
/
\ /
|
ABIJ
n
Count aligned residue pairs at every step
PAM: How Do You Build It?
Make a matrix M1 with probability of substitution
after one PAM unit of time...
M ij (t ) = p (i | j , t )
j
A
For example: Mij(1) =
A
i
B
C
B
C
.900 .090 .010
.045 .950 .005
.001 .001 .998
Adapted from Wheeler, BioComputing Hypertext Coursebook
9
PAM: How Do You Build It?
Relating M to log-odds
We can then extrapolate to any evolutionary time:
PAMij (t ) = log
M(2) = [M(1)]2; M(3) = [M(1)]3; etc.
M(20)
M(1)
A
B
C
A
B
C
M(200)
A
B
C
M(1000)
A
B
= log
A
.310 .548 .142
.120 .243 .637
.077 .154 .769
B
.045 .950 .005
.274 .613 .113
.122 .248 .631
.077 .154 .769
C
.001 .001 .998
.014 .023 .963
.064 .126 .810
.077 .154 .769
q (i )
Mij (t ) ⋅ q ( j )
q (i ) ⋅ q ( j )
p (i | j , t ) ⋅ q ( j )
= log
q (i ) ⋅ q( j )
p (i, j | t )
= log
q (i ) ⋅ q ( j )
C
.900 .090 .010
Mij (t )
The PAM250
The Trouble With PAM
n
Rare substitutions not observed in PAM1
(36/190 substitutions not observed!).
n
n
Errors in PAM1 are magnified by
extrapolation.
Distant sequences usually have islands
(blocks) of conserved residues.
(substitution not equally likely over entire
sequence.)
Adapted from Wheeler, BioComputing Hypertext Coursebook
BLOSUM
BLOSUM (Blocks
Substitution) Matrices
Developed by Henikoff and
Henikoff in 1992
Does not extrapolate from
close homologs.
Uses BLOCKS, a collection
of ungapped multiple
alignments
ID
HOMSERKINASE; BLOCK
AC
PR00958A; distance from previous
block=(8,20)
DE
Homoserine kinase signature
BL
adapted;
width=16; seqs=18; 99.5%=874;
strength=1260
KHSE_FREDI|P04947 (14) TTANLGPGFDCIGAAL 46
KHSE_SYNY3|P73646 (11) TTANIGPGFDCLGAAL 43
KHSE_BRELA|P07128 (17) SSANLGPGFDTLGLAL 36
KHSE_CORGL|P08210 (17) SSANLGPGFDTLGLAL 36
KHSE_MYCTU|Q10603 (20) SSANLGPGFDSVGLAL 37
KHSE_MYCLE|P45836 (18) SSANLGPGFDSIGLAL 38
KHSE_BACSU|P04948 (15) STANLGPGFDSVGMAL 42
O32121
(15) STANLGPGFDSVGMAL 42
KHSE_STRPN|P72535 ( 8) TSANIGPGFDSVGVAV 48
O67332
( 9) TTTNFGSGFDTFGLAL 91
KHSE_LACLA|P52991 ( 8) TSANLGAGFDSIGIAV 68
KHSE_YEAST|P17423 (11) SSANIGPGYDVLGVGL 85
O43056
(11) SSANIGPGFDVLGMSL 46
KHSE_ECOLI|P00547 ( 9) SSANMSVGFDVLGAAV 62
KHSE_HAEIN|P44504 ( 9) SSANISVGFDTLGAAI 68
KHSE_SERMA|P27722 ( 9) SIGNVSVGFDVLGAAV 100
O25690
( 8) TSANLGPGFDCLGLSL 46
KHSE_METJA|Q58504 (14) TSANLGVGFDVFGLCL 60
BLOSUM: How do you build it?
Procedure
n
n
Aligned, ungapped sequence blocks from Blocks
database.
Tally observed frequency of each amino acid
pair among aligned residues.
Adapted from Wheeler, BioComputing Hypertext Coursebook
10
BLOSUM
Remember the log-odds score Sij = log
p(i ⇔ j)
q(i) ⋅ q( j )
BLOSUM
?
p(i ⇔ j) ?
How do I calculate
Count all possible residue pairings
(e.g. there are eleven A-B pairs)
BLOCK
ABC
ABC
BAC
AAB
ABA
Α
A
B
C
7
11
3
3
3
B
BLOCK
ABC
ABC
BAC
AAB
ABA
Α
B
C
B
A
B
C
7
11
3
3
3
3
C
#AB substitutions
Total # substitutions
11
= = .37
30
p( A ⇔ B) =
3
C
BLOSUM
BLOCK
ABC
ABC
BAC
AAB
ABA
Α
BLOSUM: How do you build it?
A
B
C
7
11
3
3
3
#AB substitutions
Total # substitutions
11
= = .37
30
p( A ⇔ B) =
3
Procedure
n
n
n
Aligned, ungapped sequence blocks from Blocks
database.
Cluster similar sequences
Tally observed frequency of each amino acid
pair among aligned residues, counting members
of a cluster fractionally toward the tally.
if q( A) = .33 and q(B) = .33,
p( A ⇔ B)
0.37
then S AB = log
= log
= log(3.4) = 0.53
q( A) ⋅ q(B)
0.33⋅ 0.33
Adapted from Wheeler, BioComputing Hypertext Coursebook
BLOSUM: “Tuning” for distant homology
BLOSUM: Comments
Cluster similar sequences
Tally fractional aligned residue pairs
n
Clustering is analogous to increasing PAM distance.
n
Clustering threshold 80% -> BLOSUM 80.
n
Good statistics:
½ of an “AB” pair
BLOCK
ABC
ABC
BAC
AAB
ABA
Α
B
A
B
C
4
7
2
1
2
- 1.25 x 106 pairs contributed
- Least frequent pair observed 2369 times!
1
C
#AB substitutions
Total # substitutions
7
= = .41
17
p( A ⇔ B) =
11
BLOSUM and PAM correspondence
The Affine Gap Penalty
n
Links to Smith-Waterman applets
Applet that shows Smith-Waterman alignment for DNA sequences:
http://www.cs.pdx.edu/~ps/CapStone03/dynvis/SimilarityApplet.html
n
Gap(L) = -b – (L-1) · e
n
Where…
n
Default for BLAST is
(Try “Smith-Waterman”, “Affine Gap Model”, “Blosum 62”, Gap
opening cost 9, Gap extension cost 9. You may want to hit “New
Alignment” a few times until you get an alignment that scores well
enough to be interesting.
Email [email protected] with questions
- L is the gap length
- b is the gap opening penalty
- e is the gap extension penalty
- gap-opening penalty of 11
- gap extension penalty of 1
Summary
n
Basics of homology
n
Case study
- BLAST
- PSI-BLAST
(select “local alignment”, and match, mismatch, and gap scores of 3,
-1, and -1 to be consistent with the lecture)
Applet that shows Smith-Waterman on protein sequences:
http://www.cs.auckland.ac.nz/~cam/bio/swnw.html
Name comes from the affine transformation
y=ax+b
n
Pairwise sequence alignment
n
Defining sequence similarity
- Global vs. local
- Dot plots
- Smith-Waterman
- Log-odds scoring
- Substitution matrices
- Gap penalties
12
Related documents