Download Document

Document related concepts

List of types of proteins wikipedia , lookup

Protein mass spectrometry wikipedia , lookup

Cyclol wikipedia , lookup

Protein wikipedia , lookup

Alpha helix wikipedia , lookup

Homology modeling wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Structural alignment wikipedia , lookup

Protein structure prediction wikipedia , lookup

Transcript
Blast & Multiple Alignment
Scoring Alignments
• Quality = [10(match)] + [-1(mismatch)]
[(Gap Creation Penalty)(#of Gaps) +(Gap Ext. Pen.)(Total length of Gaps)]
Scoring scheme incorporates an evolutionary model-• Matches are conserved
• Mismatches are divergences
• Gaps are more likely to disrupt function, hence greater penalty
than mismatch.
Introduction of a gap (indel) penalized more than
extension of a gap.
• Both Global and Local alignment programs will
(almost) always give a match.
• It is important to determine if the match is
biologically relevant.
• Not necessarily relevant: Low complexity
regions.
– Sequence repeats (glutamine runs)
– Transmembrane regions (high in hydrophobes)
• If working with coding regions, you are
typically better off comparing protein
sequences. Greater information content.
Substitution Matrices
Substitution Matrix
• Nucleic Acid
• Incorporates the observation that Transitions (A<>G or C<->T are more common than
Transversions
• Amino Acid Substitution
Substitutions
• 20 different amino acids
– Physical and chemical properties of some
are similar.
A useful classification of
amino acids
•
•
•
•
•
Aliphatic - G, A, V, L, I, P
Aromatic - F, Y, W
Uncharged polar - S, T, N, Q
Charged - D, E, H, K, R
Sulfur-containing - C, M
Amino Acid Substitution Matrix
• Accounts for the observation that some
amino acid substitutions are better
tolerated than others.
• Other types of substitutions are rare.
A
C
D
E
A C D E
2 -2 0 0
12 -5 -5
4 3
4
Two Main AA Substitution
Matrices
• Dayhoff PAM Matrix
– Aligned closely related proteins (orthologs)
to identify amino acid changes that were
acceptable to maintaining function.
Two Main AA Substitution
Matrices
• Dayhoff PAM Matrix
– Aligned closely related proteins (orthologs)
to identify amino acid changes that were
acceptable to maintaining function.
• BLOSUM Matrix
– Developed from large number of conserved
amino acid patterns, termed BLOCKS
– BLOCKS: conserved, ungapped amino
acids identified in related proteins
Dayhoff PAM Matrix
• Fundamental Assumptions:
– mutation at each site is independent of
previous changes and other sites- Markov
model
– each site is equally mutable
Dayhoff PAM Matrix
Acceptable mutations -conserved function
Number of changes of each amino acid into
every other amino acid was counted
Dayhoff PAM Matrix
Acceptable mutations -conserved function
Number of changes of each amino acid into
every other amino acid was counted
– Takes into account the frequency of
occurrence of each amino acid (not all
amino acids are equally abundant).
L
A
G
S
V
E
T
K
I
D
R
P
N
Q
F
Y
M
H
C
W
1978
0.085
0.087
0.089
0.070
0.065
0.050
0.058
0.081
0.037
0.047
0.041
0.051
0.040
0.038
0.040
0.030
0.015
0.034
0.033
0.010
1991
0.091
0.077
0.074
0.069
0.066
0.062
0.059
0.059
0.053
0.052
0.051
0.051
0.043
0.041
0.040
0.032
0.024
0.023
0.020
0.014
The frequencies in the middle column are taken from Dayhoff (1978), the frequencies in the right
column are taken from the 1991 recompilation of the mutation matrices by Jones et al. (Jones, D.T.
Taylor, W.R. & Thornton, J.M. (1991) CABIOS 8:275-282) representing a database of observations
that is approximately 40 times larger than that available to Dayhoff.
From: http://www.lmb.uni-muenchen.de/Groups/Bioinformatics/04/ch_04_3.html
Dayhoff PAM Matrix
• Results in a 20 X 20 matrix of
probabilities for each possible amino
acid substitution.
Dayhoff PAM Matrix
• Initial matrix is derived from very similar
proteins.
Dayhoff PAM Matrix
• Initial matrix is derived from very similar proteins.
• However, not all homologs are very similar.
Dayhoff PAM Matrix
• Initial matrix is derived from very similar proteins.
• However, not all homologs are very similar.
• Extrapolate to encompass greater divergence by
multiplication of original matrix
Dayhoff PAM Matrix
• Initial matrix is derived from very similar proteins.
• However, not all homologs are very similar.
• Extrapolate to encompass greater divergence by
multiplication of original matrix
• Results in a series of PAM matrices representing
different levels of similarity.
Dayhoff PAM Matrix
• Initial matrix is derived from very similar proteins.
• However, not all homologs are very similar.
• Extrapolate to encompass greater divergence by
multiplication of original matrix
• Results in a series of PAM matrices representing
different levels of similarity.
• PAM250,PAM120,PAM80 PAM60 correspond to 20,
40, 50 and 60 percent similarity, respectively.
Dayhoff PAM Matrix
• Initial matrix is derived from very similar proteins.
• However, not all homologs are very similar.
• Extrapolate to encompass greater divergence by
multiplication of original matrix
• Results in a series of PAM matrices representing
different levels of similarity.
• PAM250,PAM120,PAM80 PAM60 correspond to 20,
40, 50 and 60 percent similarity, respectively.
• As proteins being compared decrease in similarity, the
numerical value of the PAM matrix should increase.
BLOSUM Matrix
– Developed from large number of conserved
amino acid patterns, termed BLOCKS
BLOSUM Matrix
– Developed from large number of conserved
amino acid patterns, termed BLOCKS
– BLOCKS: conserved, ungapped amino acids
identified in related proteins
BLOSUM Matrix
– Developed from large number of conserved
amino acid patterns, termed BLOCKS
– BLOCKS: conserved, ungapped amino acids
identified in related proteins
– ~2000 conserved blocks-- thought to act as
signatures for families of related proteins.
BLOSUM Matrix
– Developed from large number of conserved
amino acid patterns, termed BLOCKS
– BLOCKS: conserved, ungapped amino acids
identified in related proteins
– ~2000 conserved blocks-- thought to act as
signatures for families of related proteins.
– BLOSUM 62 matrix derived from BLOCKS
exhibiting 62% similarity--
BLOSUM Matrix
– Developed from large number of conserved
amino acid patterns, termed BLOCKS
– BLOCKS: conserved, ungapped amino acids
identified in related proteins
– ~2000 conserved blocks-- thought to act as
signatures for families of related proteins.
– BLOSUM 62 matrix derived from BLOCKS
exhibiting 62% similarity-– Higher the number, the greater the similarity
• opposite of PAM matrix.
Matrix Application
• ODDs matrix:
– Ratio that compares the chance that the
mutation represents an authentic
evolutionary change (pair found in related
proteins) to the chance that the change
occurred by random sequence variation (pair
found in unrelated proteins) .
Matrix Application
• Log Odds matrix:
– Ratio that compares the chance that the mutation represents an
authentic evolutionary change (pair found in related proteins) to the
chance that the change occurred by random sequence variation (pair
found in unrelated proteins) .
– Convert to log scores to simplify score
determination (add log scores)
Matrix Application
• Practical Consequence– Typically do not know the percent similarity
until you have an alignment.
– Use several different matrices and
compare output.
Substitution matrix
• Used to score alignments.
• Positive values: substitution is tolerated.
Substitution matrix
• Used to score alignments.
• Positive values: substitution is tolerated.
• Zero: substitution occurs with same
frequency as random event.
Substitution matrix
• Used to score alignments.
• Positive values: substitution is tolerated.
• Zero: substitution occurs with same
frequency as random event.
• Negative value: substitution is typically
selected against.
Expect value
(E-value)
• Expected number of hits, of equivalent
or better score, found by random
chance in a database of the size
searched.
Conserved domains
Domain: sequence of amino acids that typically fold to
a stable tertiary structure. Many proteins are multidomain.
Blast to Psi-Blast
• Blast makes use of Scoring Matrix
derived from large number of proteins.
• What if you want to find homologs
based upon a specific gene product?
• Develop a position specific scoring
matrix (PSSM).
PSSM
M F W Y G A P V I L C R K E N D Q S T H
M
5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
G
1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0
A
1 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0
S
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 2 0
F
0 4 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Determine frequency of substitution,
and converts to LogOdd score.
PSSM
INDEL
M F W Y G A P V I L
C R K E N D Q S T H
M
5 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
G
1 0 0 0 1 0 0 0 0 1
0 0 0 1 0 0 1 0 0 0
A
1 0 0 0 0 4 0 0 0 0
0 0 0 0 0 0 0 0 0 0
S
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 3 2 0
F
0 4 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
Indel 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
Can include a score for permitting insertions and deletions.
Perhaps this position is at a turn, where INDELs are common.
PSSM
• In evaluating (scoring) alignments,
PSSM approaches typically:
– Reward matches to columns that have
conserved amino acids
– Penalize mismatches to columns with
conserved amino acid more than
mismatches in a variable column
PSI-BLAST
• Input a single query sequence.
• Executes a BLAST run.
• Program takes significant hits,
incorporates matches into a PSSM.
• Sequences >98% similar not included
(avoid biasing the PSSM).
Power of approach:
• PSI-BLAST is iterative.
• Takes best hits and improves the
scoring matrix.
Original Blast
had 84 hits.
Utility of Y Blast
• Identify distantly related proteins based
upon the profile.
• These potential matches may suggest
functions.
• --Profile adds information only over
identified region of similarity.
Problem of approach:
• PSI-BLAST is iterative.
• Takes best hits and improves the
scoring matrix.
• Investigator must be certain that new
hits are correct.
• Investigator must be certain region of
interest is included in PSSM.
Multiple Sequence Alignment
Multiple Sequence Alignment
(MSA)
• Can define most similar regions in a set
of proteins
– functional domains
– structural domains
• If structure of one (or more) members is
known, may be possible to predict some
structure of other members
MSA and Sequence Pair
Alignment
• Dynamic programming - (matrix
approach) provides an optimal
alignment between two sequences.
• Difficult for multiple alignment, because
the number of comparisons grows
exponentially with added sequences.
S
e
q
2
Optimal alignment
Seq 1
How to add a third sequence?
Complete all pair-wise comparisons.
Each added alignment imposes
boundaries on final MSA.
Optimal Multiple
Sequence Alignment
For more than three, problem
extends into N dimensional
space.
Scoring MSA
• Add scores derived from pair-wise
alignments.
• Sum of pairs (SP score).
• Gaps-constant penalty for any size of gap.
Progressive MSA
• Do pair-wise alignment
• Develop an evolutionary tree
• Most closely related sequences are then
aligned, then more distant are added.
• Genetic distance - number of mismatched
positions divided by the total number of
matched positions (gaps not considered).
Example
• Card Domain
Gaps
• Clustalw attempts to place gaps
between conserved domains.
• In known sequences, gaps are
preferentially found between secondary
structure elements (alpha helices, beta
strands).
These are equivalent trees
A
B
B
A
C
C
C
C
A
B
B
A
Problem with Progressive
Alignment: Errors made in
early alignments are
propagated throughout the
MSA
Profiles & Gaps
• From an MSA, a conserved region
identified and a scoring matrix (profile)
constructed for that region.
• Each position has a score associated
with an amino acid substitution or gap.
• Blocks- also extracted from MSA, but no
gaps are permitted.
• Block Server
• http://blocks.fhcrc.org/blocks/blocks_search.html
• Results
Hidden Markov Models
• Probabilistic model of a Multiple
sequence alignment.
• No indel penalties are needed
• Experimentally derived information can
be incorporated
• Parameters are adjusted to represent
observed variation.
• Requires at least 20 sequences
The Evolution of a Sequence
• Over long periods of time a sequence will
acquire random mutations.
– These mutations may result in a new amino acid
at a given position, the deletion of an amino acid,
or the introduction of a new one.
– Over VERY long periods of time two sequences
may diverge so much that their relationship can
not see seen through the direct comparison of
their sequences.
Hidden Markov Models
• Pair-wise methods rely on direct comparisons
between two sequences.
• In order to over come the differences in the
sequences, a third sequence is introduced, which
serves as an intermediate.
• A high hit between the first and third sequences as
well as a high hit between the second and third
sequence, implies a relationship between the first
and second sequences. Transitive relationship
Introducing the HMM
• The intermediate sequence is kind of
like a missing link.
• The intermediate sequence does not
have to be a real sequence.
• The intermediate sequence becomes
the HMM.
Introducing the HMM
• The HMM is a mix of all the sequences
that went into its making.
• The score of a sequence against the
HMM shows how well the HMM serves
as an intermediate of the sequence.
– How likely it is to be related to all the other
sequences, which the HMM represents.
Match State with no Indels
MSGL
MTNL
B
M1
M2
M3
M4
Arrow indicates transition probability.
In this case 1 for each step
E
Match State with no Indels
MSGL
MTNL
B
M=1
S=0.5
T=0.5
M1
M2
M3
M4
E
Also have probability of Residue at each positon
Typically want to incorporate small probability
for all other amino acids.
MSGL
MTNL
B
M=1
S=0.5
T=0.5
M1
M2
M3
M4
E
Permit insertion states
MS.GL
MT.NL
MSANI
B
I1
I2
I3
I4
M1
M2
M3
M4
Transition probabilities may not be 1
E
Permit insertion states
MS..GL
MT..NL
MSA.NI
MTARNL
B
I1
I2
I3
I4
M1
M2
M3
M4
E
MS..GL-MT..NLAG
MSA.NIAG
MTARNLAG
DELETE PERMITS INCORPORATION OF
LAST TWO SITES OF SEQ1
D1
B
D2
D3
D4
D5
D6
I4
I5
I6
I1
I2
I3
M1
M2
M3
M4
M5
M6
E
The bottom line of states are the main states (M)
These model the columns of the alignment
The second row of diamond shaped states are called the insert states (I)
These are used to model the highly variable regions in the alignment.
The top row or circles are delete states (D)
These are silent or null states because they do not match any residues, they simply
allow the skipping over of main states.
Dirichlet Mixtures
• Additional information to expand
potential amino acids in individual sites.
• Observed frequency of amino acids
seen in certain chemical environments
– aromatic
– acidic
– basic
– neutral
– polar
The PSSM will skew
towards this region