Download Sequence-comparison methods Outline Pairwise sequence

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Outline
Sequence-comparison methods
Gerard Kleywegt
Uppsala University
 Why
compare sequences?
 Dotplots
 Pairwise sequence alignments
 Multiple sequence alignments
 Profile methods
Outline
Pairwise sequence alignments II
 Scoring
matrices
– Dayhoff
– BLOSUM
LACTAL
|ID +SIM
LYSOZY
KQFTKCELSQLLK--DIDGYGGIALPELICTMFHTSGYDTQAIVEN-DESTEYGLFQISN
| | +|||+ +|
+| | | +|
+|
| ++||| +| | || ||++||++
KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINS
LACTAL
|ID +SIM
LYSOZY
KLWCKSSQVPQSRNICDISCDKFLDDDITDDIMCAKKIL-DIKGIDYWLAHKALCT-EKL
+ ||
| |||+|+| | +| ||| + |||||+ | |++ |+| + |
+
RWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNRCKGTDV
LACTAL
|ID +SIM
LYSOZY
EQWLCE-KL
+ |+
+|
QAWIRGCRL
Scoring matrices
 So
far, we have used very simple
scoring schemes such as:
– Match = +3
– Mismatch = -1
 May
be good enough for DNA, but not
for proteins
– Ex: leucine much more likely to be
replaced by an isoleucine than by a
glutamate
– Why is this???
 Fast
pairwise methods
– FASTA
– BLAST
 Databases
 Assessing
significance
Amino-acid properties
Ex: His =
Positive
Charged
Polar
Aromatic
Hydrophobic
The more
barriers
between two
residues, the
less similar
(Taylor, 1986)
1
Scoring matrices
Substitution matrices
 More

Given 20 amino acid types, a substitution
matrix is a symmetric 20 x 20 matrix

Why is it symmetric?
sophisticated scoring matrices for
amino acids have been based on:
– Minimal number of base changes required
to convert the codon of one amino acid into
another (Genetic Code Matrix)
– Similarity of their physico-chemical
properties
– Observed residue equivalences in aligned
protein structures
– Observed substitutions in aligned
sequences  substitution matrices
Substitution matrices
The matrix elements are the “log odds” ratios of
substitution
 Odds ratio R(x,y) = (probability that residues x and y
are aligned given that they evolved from each other
or a common ancestor) / (probability of aligning x and
y by chance)
– Usually not known if X replaced Y or vice versa (or
if both derive from an ancestral residue Z)

How many unique elements does a
substitution matrix contain?
– 20*19/2 = 190 off-diagonal
– 20 diagonal
– 210 in all
Substitution matrices

– Ex: “x” = valine, “y” = leucine
R(x,y) = P(x,y|related) / P(x,y|chance)
 P(x,y|related) = observed empirically
 P(x,y|chance) = f(x) . f(y)

– f(z) = frequency of residue type z in the population
(database/proteome/…)

Actual matrix elements:
– Take 10log, multiply by 10, round to nearest integer
– S(x,y) = nint (10 x 10log(R(x,y)))
Substitution matrices
 Another
example:
– P(x,y|related) = 0.0025 (observed)
– f(x) = 0.1 and f(y) = 0.05
– R(x,y) = 0.0025 / (0.1*0.05) = 0.0025 /
0.005 = 0.5
– Thus: for a pair of homologous proteins, x
and y are 2 times less likely to be aligned
than one would expect by chance
– S(x,y) = nint (10 log(0.5)) = nint (10 *
-0.301) = nint (-3.01) = -3
– S(y,x) = S(x,y) = -3
 Example:
– P(x,y|related) = 0.03 (observed)
– f(x) = 0.1 and f(y) = 0.05
– R(x,y) = 0.03 / (0.1*0.05) = 0.03 / 0.005 = 6
– Thus: for a pair of homologous proteins, x
and y are 6 times more likely to be aligned
than one would expect by chance
– S(x,y) = nint (10 log(6)) = nint (10 * 0.778)
= nint (7.78) = 8
– S(y,x) = S(x,y) = 8
Dayhoff matrices
 To
generate a substitution matrix we
need to measure P(x,y|related)
 Dayhoff et al. were the first to do this
(late 1960s)
 Analysed alignments of closely related
sequences (<1% mutations)
 Resulting matrices are called:
– Dayhoff matrices
– MDM (mutation data matrix)
– PAM (point/percent accepted mutation)
2
Dayhoff matrices

Dayhoff matrices

Analysis yielded PAM1 matrix
– Suitable for comparing proteins with less than 1%
mutations

PAM2 matrix is obtained by multiplying PAM1
with itself
– PAM2 = PAM1 PAM1 = (PAM1)2
– S(I,L) ~ S(I,A)S(A,L) + S(I,C)S(C,L) + ..
– Suitable for comparing proteins with ~2%
mutations



%SI
75
60
50
25
20
PAM
30
80
110
200
250
PAM250 means 250 mutations per 100
residues - how can they still have %SI ~20%?
– Ex: AGSTV (4 mutations, 1 difference)
– Ex: ILI (2 mutations, no difference)
PAM3 = PAM2 PAM1 = (PAM1)3
PAM250 = (PAM1)250
Dayhoff matrices
Appropriate PAM matrix to use depends on
(expected) level of sequence divergence
(identity)
BLOSUM matrices
PAM
LL
LI
2
7
-9
5
7
-6
10
7
-4
30
7
-1
80
6
1
250
6
2
500
7
4

Dayhoff matrices
– Based on explicit model of evolution
– Based on global alignments of closely related
proteins
– Based on a very small sample of sequences (only
~1500 observed substitutions; few W)

BLOSUM matrices
– Henikoff & Henikoff, 1992
– Based on observed substitutions in conserved
(gap-less) blocks of aligned sequences from many
protein families (i.e., local)
– Turn out to work better than Dayhoff matrices
BLOCKS example
BLOSUM matrices
Block PR00178A
 BLOSUM62
ID FATTYACIDBP; BLOCK
AC PR00178A; distance from previous block=(2,27)
DE Fatty acid-binding protein signature
BL adapted; width=23; seqs=85; 99.5%=1111; strength=1324
MYP2_BOVIN|P02690 ( 4) FLGTWKLVSSENFDEYMKALGVG 12
MYP2_RABIT|P02691 ( 4) FLGTWKLVSSENFDDYMKALGVG 13
FABH_BOVIN|P10790 ( 4) FVGTWKLVDSKNFDDYMKSLGVG 16
FABH_HUMAN|P05413 ( 4) FLGTWKLVDSKNFDDYMKSLGVG 16
[…]
FABI_MOUSE|P55050 ( 2) FDGTWKVDRNENYEKFMEKMGIN 42
FABI_XENLA|Q91775 ( 2) FDGTWKVDRSENYEKFMEVMGVN 44
FABL_HALBI|P81653 ( 2) FSGTWQVYSQENIEDFLRALSLP 87
FAB2_MANSE|P31417 ( 3) LGKVYSLVKQENFDGFLKSAGLS 96
FAB1_MANSE|P31416 ( 4) LGKVYKFDREENFDGFLKSIGLS 59
//
good general-purpose
matrix (see hand-out)
– Based on aligned blocks with at least 62%
sequence identities
– Comparable to PAM120
– Default for many programs
– Gap penalty (example): -11 - L
 BLOSUM80
~ PAM1
 BLOSUM45 ~ PAM250
3
Substitution matrices
 Percentage
 What
are the %-ages SI and sequence
similarity (using BLOSUM62) for the
following alignment?
KQFTKCELSQLLK--DIDGYGG
KVFGRCELAAAMKRHGLDNYRG
Fast pairwise methods


KQFTKCELSQLLK--DIDGYGG
= = +===+ +=
+= = =
KVFGRCELAAAMKRHGLDNYRG
sequence similarity
– Count number of aligned residues whose
score in the substitution matrix is greater
than zero
– Divide by length of shortest sequence

Substitution matrices
Needleman-Wunsch-Sellers and SmithWaterman are guaranteed to find an optimal
alignment
But they are too slow if you want to compare
a sequence against a database with
thousands or millions of sequences
Faster (50-100*), heuristic methods have
been developed for this purpose
 Identities:
9
 Similarities: 4 (+ 9)
 Length of shortest sequence: 20
 %SI = 100% * 9 / 20 = 45%
 Similarity = 100% * (9+4) / 20 = 65%
FASTA

Pearson & Lipman, 1988

Method (grossly simplified!)
– Find identical “k-tuples” (k=1 or 2 for proteins, 4-6
for nucleic acids) in both sequences
– Extend these segments to include similar residues
– Impose window to limit insertions and deletions
– Select the 10 highest scoring segments within the
window
– Limited dynamic programming to join segments
within the window
– Cut (reasonable) corners, so not guaranteed to
find optimal solution
– FASTA (global sequence alignment)
– BLAST (local sequence alignment)
BLAST




Altschul et al., 1990
Basic Local Alignment Search Tool
The workhorse of bioinformatics
Various versions, e.g.
– BLASTP - protein sequence versus protein
database
– BLASTN = nucleic acid sequence versus nucleic
acid database
– TBLASTX = translated nucleic acid sequence
versus translated nucleic acid database
4
BLAST

BLAST algorithm
Method (grossly simplified!)
– Generate all 3-tuples of the sequence
– For each of them, find all 3-tuples that are similar
• Use BLOSUM62 and a cut-off value (e.g., 13)
• Ex: sequence LSPDGHD… LSP scores 15 with itself
– ISP and MSP score 13  include
– LTP, VSP, LNP, LAP score 12  ignore
– Locate the 3-tuples in each database sequence
– Find matching 3-tuples that lie nearby on the same
diagonal
– Extend these segments
– Limited dynamic programming to join high-scoring
segments
Databases
Databases
 Large,
 Why
central repositories of sequences
is it not a good idea to make
sequence data (or updates) only
available through specialised websites?
 More
and more sequence data
published directly on the web
– Limits user base
• Non-specialists may not know about all webbased resources
– e.g., in organism-specific databases
– Limits lifetime of the data
• Half-life of random websites is only ~2 years
 Why
is it important to also deposit
sequences in central repositories?
– Limits biological context of the data
• No comparisons to other sequences possible
– Limits quality/coverage of central database
Databases

Nucleotide sequences
Databases

• ~47 million sequences (August, 2005)
– EMBL
• ~55 million sequences (August, 2005)
– DDBJ
– Together > 100 gigabases (August, 2005)

Protein sequences
– UniProt
• Swiss-Prot + TrEMBL + PIR
• 2,738,790 sequences (January, 2006)
– GenPept
• Translated Genbank, EMBL, …
• 3,230,559 sequences (January, 2006)
“Data explosion”
– Driven by largescale sequencing
efforts
– GenBank

GenBank
– DNA sequences
– NCBI (NIH)
– ~150x growth from
1995 to 2005
5
Databases
Databases
 When
searching for related sequences
– TP = true positive = related & retrieved
– TN = true negative = not related & not
retrieved
– FN = false negative = related & not
retrieved
– FP = false positive = not related & retrieved
 Same

Growth of UniProtKB/TrEMBL
– Annotated/translated protein sequences (EBI)
Databases

concepts used to assess
performance of machine-learning
methods
Databases
Performance measures (values 0…1)
– Sensitivity = recall = TP / (TP + FN)
 Example:
search with sequence A
• More sensitive if fewer FN (but maybe many FP)
• Fraction related sequences that is retrieved
Sequences
Not
retrieved retrieved
– Selectivity = precision = TP / (TP + FP)
• More selective if fewer FP (but maybe many FN)
• Fraction retrieved sequences that is related
– F-measure = 2 * Prec * Rec / (Prec + Rec)
• Harmonic mean of the two measures
Databases
 Sensitivity
123
Not related
19
TP
FP
12
8183
FN
TN
Databases
= TP/(TP+FN) = 0.91
– Probability that a related sequence will be
retrieved
– A.k.a. recall
 Selectivity
Related to A
= TP/(TP+FP) = 0.87
– Probability that a retrieved sequence is
related
– A.k.a. PPV (Positive Predictive Value)
– A.k.a. precision
 F-measure
 Specificity
= TN/(TN+FP) = 0.998
– Probability that a non-related sequence will
not be retrieved
 NPV
= TN/(TN+FN) = 0.999
– Negative predictive value
– Probability that a non-retrieved sequence
will be non-related
= 0.89
6
Databases
Databases
Summary
Test
positive
Test
negative
Statistics
Example
Retrieved
Not
retrieved
Statistics
Property
true
TP
FN
Sensitivity
Recall
Related
9
1
?
Property
false
FP
TN
Specificity
Not
related
11
99
?
Statistics
Selectivity
Precision
PPV
NPV
F-measure
Statistics
?
?
?
Statistic = true / (true + false)
Databases
Significance
Example Retrieved
Related
Not
related
Statistics
9
11
9/20 =
0.45
Not
retrieved
Statistics
1
9/10 = 0.9
99
99/110 =
0.9
99/100 =
0.99
(2 x 0.9 x 0.45) /
(0.9 + 0.45) =
(1.8 x 0.45) / (3
x 0.45) = 0.6

Alignment of the random sequences
generated previously in class (2006)

What level of sequence identity do you think
they have?
PojK CGAGTTTTCGGCGTCTATCTT
TjeJ TAAACACAAGGCTACACA
PojK
RVFGVYL
TjeJ (STOP)TQG-YT
Significance
Significance
 33

%?
PojK ------CGAGTTTTCGGCGTCTATCTT
| || | | |
TjeJ TAAACACAAGGCTAC-ACA--------
 44
%?
PojK CGAGTTTTCGGCGT-CTATCTT
|
||| | | | |
TjeJ TAAACACAAGGC-TAC-A-C-A
How about 56 %?
PojK ----CGAGTTTTC--GGCGT-CTATC-TT
| |
| ||| | | | |
TjeJ TAAAC-A-----CAAGGC-TAC-A-CA--

Not bad for a pair of random sequences

Composition:
– PojK: 2A, 5C, 5G, 9T
– TjeJ: 9A, 5C, 2G, 2T
– Common: 2A, 5C, 2G, 2T = 11/18 = 61%
7
Significance
Significance
 Given
 Z-scores
two sequences and a scoring
scheme, it is always possible to find the
optimal alignment
– But is it statistically significant?
– Or biologically meaningful?
 Aligning
two random DNA sequences
without gaps, one would expect %SI
~25%, and as much as 50% with gaps
 Ditto, for proteins, %SI ~5%
– Align two sequences, note score
– Randomise one of the sequences, align,
note score (e.g., 100 or 1000 times, to get
average and standard deviation)
– Alternatively, align it to a random sample of
sequences from a database
– Z-score = (Score - Average) / St.dev.
Significance
Significance
 Z>15:
almost certainly homologous
probably homologous
 Z<5:???
 Optimal
Significance
Significance
 Z=5-15:
local alignment scores of
unrelated protein sequences follow an
extreme-value distribution (EVD)
 Compare: the length of the tallest
person in each house in a country
 Different from normal distribution: righthand tail decays slower
 EVD
Assuming a normal
distribution would
over-estimate the
significance of
local alignments
– Probability of obtaining an alignment score
greater than “x” by chance is:
P(S>x) = 1 - exp (-exp(-λ(x-u)))
– u = characteristic value = Kmn/ λ
– λ = decay factor, K = constant
– m, n = sequence lengths
– K and λ can be calculated from the
substitution matrix and the relative aminoacid frequencies
8
Significance

EVD allows analytical calculation of the probability of
exceeding a certain alignment score by chance
– p-values
– p = 0.01 means: 1 in 100 unrelated sequences gives at least
this score
– In a database scan against 106 sequences, this would
retrieve 10,000 false positives

Expectation value: how many matches with at least a
certain score are expected by chance in a database
of N sequences
– E-values
– E=p*N
– Typical cut-off for BLAST searches: E < 0.01
Significance
A
search in a database with 1,000
sequences gives two hits:
– One has E=10-5
– The other has p=10-6
 Which
hit is more significant? Why?
= pN E(other)=10-6 . 103 = 10-3
 The hit with E=10-5 is more significant
E
9