Download Sequence Alignment - NIU Department of Biological Sciences

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

NEDD9 wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Frameshift mutation wikipedia , lookup

Koinophilia wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Point mutation wikipedia , lookup

Expanded genetic code wikipedia , lookup

Genetic code wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Sequence alignment wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Transcript
Sequence Alignment
Sequence Alignment
•
Why:
– To match a new sequence to others with known functions
– To search for ESTs and other signs of gene expression
– To understand population dynabmics and evolutionary relationships between
genes and species
– To find important regions within proteins
•
Issues:
– Alignment should mimic evolutionary descent: the actual history of mutation and
selection that led to this gene
• But it is too complicated to get perfectly correct
• Protein alignments work over larger evolutionary distances than nucleotide
– How to treat substitutions, insertions and deletions (gaps)
– How to score possible alignments
• Global vs. local alignment
– Multiple alignment (as an extension of pairwise alignment
• Hidden Markov Models and other ways of abstracting multiple alignment information
•
Homology: related by evolutionary descent. As opposed to similarity, which
is not necessarily based on descent from a common ancestor
– But in practice, long aligned sequences seem to only arise by evolution
– Short alignments can be due to chance or convergent evolution.
Example Alignments
• THISSEQUENCE vs. THATSEQUENCE
– Same length, just 2 mismatches
• THISISASEQUENCE vs. THATSEQUENCE
– Length is different, need to introduce gaps to maximize identities.
Scoring by Identity
• One simple way to score an alignment is by counting the number of
perfect matches.
– Get percentage of identities by dividing number of matches by total
positions (including gap positions). This is a measure of relatedness
between 2 proteins.
– For previous example, 11 matches with 16 positions = 68.75% (69%)
identities
• Length matters: it is harder to get a high percentage of identities in a
long sequence than in a short one.
• Problem of random matches. For nucleotides, 25% of all positions
in random sequences match, and it’s 5% for proteins.
– General rule, based on proteins with known structural similarity:
• Two proteins are probably structurally similar (and thus probably
homologous) if they have 30% or more identical amino acids over their
whole length when aligned.
• Less than 20% amino acid identity means probably not homologous
• Between 20% and 30% is a gray zone
• My personal happiness with matches increases when it’s above 35%
• Except for very unusual proteins, 100% identity doesn’t occur between
homologous proteins in different species
Dotplots
• Dotplots are a simple way
of seeing alignments
– We really like to see good
visual demonstrations, not
just tables of numbers
• It’s a grid: put one
sequence along the top
and the other down the
side, and put a dot
wherever they match.
• You see the alignment as
a diagonal
Dotplot Noise
•
A big problem is noise: there are lots of random matches (roughly 5% for
proteins) that confuse the image.
– Standard solution: create a sliding window (say 10 residues) and only mark a
dot if a minimum number of matches occur in that window (say 3).
– A lot of noise goes away
•
This is a sequence compared to itself, so there is a perfect diagonal.
A Real Dotplot
•
•
Two haptoglobin sequences. (Haptoglobin is a blood protein that binds to hemoglobin that has
gotten out of the red blood cells).
You can see a gap in one sequence, a region of poor similarity just before it, and a simple
sequence repeat near the beginning.
Similarity Matching
•
In proteins, many substitutions occur that have little effect on structure or
function
– or, they alter the protein to make it more adapted for the lifestyles of the
different species
– This depends on where in the protein they occur and on the chemical and
physical properties of the amino acids.
•
Substitution matrices: scores of the probability of changing one amino
acid into another.
– Amino acids are similar if they can frequently be substituted for each other.
– These are just overall numbers compiled over many sequences, not adapted
to specific cases.
•
•
Early attempts were based on amino acid properties, or on the nubmer of
nucleotide substitutions needed to change form one amino acid to the
other.
Now they are based on actual comparison between sequences.
– The two most popular types: PAM and BLOSUM
– There are other, more specialized substitution matrices, for comparing
transmembrane regions, for example.
BLOSUM62 Matrix
Similarity Matrix Theory
•
•
•
Think about aligning 2 proteins from similar species that are orthologs: same function
and syntenic. At some point back in evolutionary time, there was a single DNA
sequence that is the common ancestor of both proteins.
– Most paired amino acids are identical, but a few are different.
Reduce the problem: consider a single aligned pair of amino acids, that are not
identical. T-S
We are comparing 2 theories of how these amino acids were derived from a common
ancestor.
1
2
•
Random mutation followed by natural selection. Some substitutions will happen more
frequently than others because they lead to functional proteins more often.
• The frequency with which T and S are substituted for each other by evolution is derived
from counting them in well-aligned sequences. = freq(T-S)
Completely random changes: every possible substitution happens in proportion to the
relative frequencies of the different amino acids, the two amino acids are unrelated to each
other.
• In this case, the frequency of a T and an S is just the product of the frequency of T’s and
the frequency of S’s in the entire protein (or proteome). -= freq(T) • freq(S)
The odds ratio is the evolutionary theory (observed data) frequency divided by the
random theory frequency. OR = freq(T-S) / freq(T) • freq(S)
More Theory
•
We want to get the odds that a given alignment fits the evolutionary model
better than a random model.
– Good alignments give high odds ratios
•
•
Need to multiply the OR’s for all amino acids in the alignment
It is easier (and doesn’t overflow the computer’s floating point calculator)
to take the logarithm of the odds ratio for each amino acid, and then add
the logarithms.
– This is the lod score (log of odds).
•
•
•
A negative score means that the given substitution is less likely than
chance, and a positive score means it is more likely than chance.
You can score each possible alignment by adding up over the whole
protein
Some fooling with constants (which don’t distort the results but are either more pleasing to the
human eye or make further calculations easier: multiply lod score by 10, or add a constant to
make al values 0 or greater
PAM
•
•
•
•
PAM = “Point Accepted Mutations”, meaning single amino acid substitutions (point
mutations) that have been “accepted” by natural selection: they are functional in
different species.
Derived by Dayhoff and colleagues in the 1960’s and 1970’s (although there are
some newer versions around)
They give a measure of the frequency of changing from one amino acid to another,
as compared to the frequency of random change
Derived from global alignments of homologus sequences from different, but closely
related, species. The sequences had an average of 1 amino acid change per
hundred residues. Thus we assume at most 1 mutation has occurred at each
position.
–
–
–
•
Do an phylogenetic analysis of the sequences to determine which mutations have occurred
Calculate the lod scores. Then multiply all of them by 10 and round to integers.
This set of scores derived from sequence alignments is the PAM1 matrix.
Since most sequences being aligned are not between such closely related species,
the PAM1 matrix is multiplied by itself many times to mimic lots of small changes.
–
–
–
This concept is a serious weakness: multiplying of errors magnifies them.
The number after “PAM” is the number of times the matrix has been multiplied by itself.
Common ones: PAM30, PAM70, PAM120, PAM250. Bigger number = better for more distant
relationships
BLOSUM
•
•
•
=BLOck Substitution Matrix. Derived in the 1990’s by Henikoff and
Henikoff.
Based on local alignments of Blocks, which are short, highly homologous
regions, with no gaps
Sequences were grouped together if they were very similar, and then
comparisons were made between the groups as in the PAM matrices.
– No attempt at phylogenetic trees
– The different BLOSUM matrices have specific cutoffs for amino acid identities.
For example, the BLOSUM62 matrix is based on sequence blocks with at least
62% identity.
– The odds ratio for each substitution is calculated, but instead of taking the base
10 log and multiplying the result by 10 as in PAM, BLOSUM takes the base 2 log
and multiplies by 2. This gives scores in “half-bits”.
•
Bigger numbers imply closer evolutionary distance, so BLOSUM80 is better
for closely related species than BLOSUM 45.
•
BLOSUM seems to work better than PAM
–
BLOSUM62 is the default used in BLAST searches.
BLOSUM62 and PAM120 Matrices
The colors represent different
physiochemical properties.
Note that some substitutions are
positive, which indicates that they
occur more frequently than chance.
The average value is negative: it is
more likely than an amino acid will
stay the same than change.
The diagonal values are unchanged
amino acids, all of which have positive
values. Some are less changeable
than others: tryptophan and cysteine
especially.
Gaps
•
•
•
Gaps occur with roughly 1/10 the frequency of base substitutions, so they
are common in most alignments.
Symbolized by hyphens ( --- ) paired with residues: like a mismatch with a
blank space.
You can assign a penalty for each gap position.
– This is called a linear gap penalty: the total penalty is proportional to the gap
length.
•
•
The problem is, once you start putting them in, you can get almost
anything aligned.
Alignment programs usually distinguish between creating a gap and
extending a gap. Thus, the gap opening penalty and a (smaller) gap
extension penalty.
– This is called an affine gap penalty.
•
Although substitutions have a lot of theory behind them, gap penalties are
generally determined by heuristic means.
– Heuristic = a method or value determined by trial-and-error experiments,
without a strong guiding theory.
– In this case, gap opening and extension penalties are the result of trying many
possibilities and seeing which ones give the most pleasing alignments.
– The BLAST default is a -11 penalty for opening the gap and -1 for each
additional base of gap. (11/1)
• Other options on BLAST at NCBI are 7/2, 8/2, 9/2, 10/1, and 12/1
Comparing 2 distantly related sequences with different gap penalties:
•Top sequence has fewer gaps and longer matches.
•Bottom sequence has more identities and similarities overall, but lots of little gaps. The matches near
the C-terminal are absurd.
•Look at the short segment after the first gap in the lower sequence: gained 3 identities
How Do We Make Alignments?
•
•
•
We have been working on scoring an alignment: identities and similarities, and gap
penalties.
But, how do you get an alignment to score in the first place?
– Trying all possibilities is one of those “more possibilities than there are atoms in
the Universe” problems.
The general solution: “dynamic programming”, a technique first applied to DNA
sequences by Needleman and Wunsch (1970)
–
–
•
•
Their original method gave global alignments.
Smith and Waterman (1981) provided a slight (but critical) modification that produced local
alignments, which work better than global for most genes.
These methods provide an optimal alignment, for a given substitution matrix and set
of gap penalties.
They are much faster than trying all possibilities, but still not quick enough. Various
refinements and heuristic methods improve the speed.
Smith-Waterman Algorithm
•
Start with a 2-dimensional matrix with one
sequence along the top and the other
sequence down the left side. All possible
pairs of nucleotides or amino acids are
represented by the cells of the matrix.
–
•
All possible alignments are represented by
the paths through the matrix.
–
–
–
•
•
“Edge rows” along the top and left side.
a diagonal step is an alignment between the
query and the subject sequences at that
position
a vertical step is a gap in the query sequence
a horizontal step is a gap in the subject
sequence.
Have a match reward and penalties for
mismatches, gap openings, and gap
extensions. For our example, we will use
the BLOSUM62 matrix, with a linear gap
penalty of -6
Initialize the edge rows to scores of 0.
BLOSUM62
With positive
scores marked
Calculating Cell Scores
T A
• The cell at row i and column j
has a score S(i, j)
• Starting at top left cell, proceed
row-by-row, calculating each
cell’s score S(i, j). S(i, j) is the
maximum of:
– 0 (i.e. set to 0 if the calculated
score is less than 0)
– S(i-1, j-1) + match/mismatch
score for cell (i, j)
– S(i, j-1) + match/mismatch
score for cell (i, j) + gap
penalty
– S(i-1, j) + match/mismatch
score for cell (i, j) + gap
penalty
T 5 7
G 2 ?
For the cell in question, the bases don’t
match, so it starts with a match/mismatch
score of -1. There are 3 possible
alignment paths to this cell:
1. diagonal (query/subject alignment).
Score = 5 – 1 = 4.
2. vertical (query gap). Score = 7 – 4 –
1=2
3. horizontal (subject gap). Score = 2 –
4 – 1 = -3 (set to 0)
Since 4 is the maximum, the cell’s value is
set to 4.
Smith-Waterman Details
•
Start at the first row: T doesn’t match
anything, and looking at BLOSUM62,
the only positive score for a mismatch
is +1 with S.
–
•
Second row: H matches N = +1, but
nothing else..
–
•
We keep track of the 0 -> 1 diagonal
The diagonal staring with the 1 in the
previous row is a H-A mismatch = -2,
so 1 -22 = -1, which is scored as 0.
Third row: I gives positive scores with
M. L, and V. But, nothing builds on the
previous row.
More S-W
•
Fourth row: S has positive scores
with N, A, and T.
–
–
–
–
S-S = +4 match, added to 4 from
the diagonal = 8
S-A = 1. For a horizontal move
(subject gap), 8 + 1 – 6 = 3.
S-I is -2 mismatch, added to 2 from
the diagonal = 0.
S-G = 0 mismatch, added to 4 from
the diagonal
More S-W
Still More!
Traceback
•
Then, start at the highest score in the
matrix and trace back the path leading
through the highest previous scores to
0. Go left and up only, preferring the
diagonal path if a choice needs to be
made.
–
•
High score is 16, in the bottom row (but
it could have been elsewhere).
Write the alignment starting at the top.
–
–
–
It doesn’t cover the entire sequence: it
is a local alignment, not global.
It isn’t perfect: the strong diagonal from
LI and the 0 mismatch score from a GN match overcame the gap penalty
needed to put a gap where the G is.
Nevertheless, given the BLOSUM62
matrix and the -6 linear gap penalty,
this is an optimal alignment,
ISALIGNE
IS-LIN-E
Changing the Gap Penalty
• The top one
has a -4 gap
penalty and
the bottom
one has a -8
gap penalty
(both linear).
They give
somewhat
different
alignments.
A Needleman-Wunsch Alignment
Speeding Things Up