Download Pairwise alignment

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Sequence similarity
Sequence Comparison
Much of bioinformatics involves sequences
• DNA sequences
• RNA sequences
• Protein sequences
We can think of these sequences as strings
of letters
• DNA & RNA: alphabet of 4 letters
• Protein: alphabet of 20 letters
Sequence Comparison - Motivation
• Nucleotide
– Learn about evolutionary relationships
– Finding genes, domains, signals …
• Protein
– Learn about evolutionary relationships
– Classify protein families (function, structure)
– Identify common domains (function, structure)
Calculation of an alignment score
How do we align two sequences?
ATTGCAGTGATCG
ATTGCGTCGATCG
Solution 1
Solution 2
ATTGCAGTGATCG
|||||
|||||
ATTGCGTCGATCG
ATTGCAGT-GATCG
||||| || |||||
ATTGC-GTCGATCG
10 matches | , 3
mismatches
12 matches |, 2
gaps -
Which alignment is better?
We will use a scoring scheme
Match
+1 +1
Mismatch –1 0
Indel(gap) -2 -2
Solution 1
ATTGCAGTGATCG
|||||
|||||
ATTGCGTCGATCG
10 matches, 3 mismatches
Solution 2
ATTGCAGT-GATCG
||||| || |||||
ATTGC-GTCGATCG
12 matches, 2 gaps
10X1+3X(-1) = 7
12X1+2X(-2) = 8
10X1+3X(0) = 10
12X1+2X(-2) = 8
Scoring Alignments - intuition
• Similar sequences evolved from a common
ancestor
• Evolution changed the sequences from this
ancestral sequence by mutations:
– Replacements: one letter replaced by another
– Deletion: deletion of a letter
– Insertion: insertion of a letter
• Scoring of sequence similarity should
examine how many operations took place
Causes for sequence (dis)similarity
mutation:
a nucleotide at a certain location is replaced by
another nucleotide (e.g.: ATA → AGA)
insertion:
at a certain location one new nucleotide is
inserted inbetween two existing nucleotides
(e.g.: AA → AGA)
deletion:
at a certain location one existing nucleotide
is deleted (e.g.: ACTG → AC-G)
indel:
an insertion or a deletion
Gaps
• Positions at which a letter is paired with a null
are called gaps.
• Gap scores are typically negative.
• Since a single mutational event may cause the insertion
or deletion of more than one residue, the presence of
a gap is ascribed more significance than the length
of the gap.
Gap Opening
• The gap-opening penalty defines the cost for
opening a gap in one of the sequences.
• If you raise the gap-opening penalty above
default, local alignments that contain gaps
may be split into several shorter alignments.
Affine Gap Penalties
• In nature, a series of indels often come as
a single event rather than a series of
single nucleotide events:
This is more
likely.
Normal scoring would
give the same score This is less
for both alignments
likely.
Gap = Gapopen + Len * Gapextend
Gap penalties lead to:
• Increasing penalties for gaps opening and extension
– The alignment will contain fewer gaps and more mismatches
• Decreasing penalties for gaps opening and
extension
– The alignment will contain more gaps (of varied lengths) and
fewer mismatches
•
Holding same score of penalty for gap opening and
increasing penalty for gap extension
– Very long gaps will not be tolerated – they will be replaced with
additional gaps of medium length and with mismatches.
Sequence similarity
Global alignment
A global alignment between two sequences is an alignment in which all the
characters in both sequences participate in the alignment.
As these sequences are also easily identified by local alignment methods global
alignment is now somewhat deprecated as a technique.
Global
_____ _______
__ ____ ____
Local
__ ____
__ ____
Local alignment
Local alignment methods find related regions within sequences - they can
consist of a subset of the characters within each sequence.
For example, positions 20-40 of sequence A might be aligned with positions
50-70 of sequence B.
This is a more flexible technique than global alignment and has the advantage
that related regions which appear in a different order in the two proteins can be
identified as being related.
Global
_____ _______
__ ____ ____
Local
__ ____
__ ____
Global vs. Local:
Global
Local
Global vs. Local:
• Use global alignment if
– You expect, based on some biological information, that your
sequences will match over the entire length.
– Your sequences are of similar length.
• Use local alignment if
– You expect that only certain parts of two sequences will match (as
in the case of conserved segment that can be found in many
different proteins).
– Your sequences are very different in length.
– You want to search a sequence database (we will talk about it in
details later).
If two proteins share more than one common
region, for example one has a single copy of a
particular domain while the other has two
copies, it may be possible to "miss" one of the
two copies if using local alignment, which
presents only the best scoring alignment.
Emboss
[best solution]
vs. Lalign (Embnet) [several solutions]
Comparing nucleotides
• Every match got the same score
• Every mismatch got the same score
• Gaps- we decided but default usually
good.
• However
In the case of aa
• Not all matches are the same
• Different mismatches get different scores
Amino acid properties
Serine (S) and Threonine (T) have
similar physicochemical properties
Aspartic acid (D) and Glutamic
acid (E) have similar properties
=>
Substitution of S/T or E/D occurs relatively often
during evolution
=>
Substitution of S/T or E/D should result in scores
that are only moderately lower than identities
Each aa is
characterized by a
combination of
features (size,
charge, etc.).
The relative
importance of each
feature may vary
according to the aa
role in the 3-D
structure and
function of the
protein.
So how can we score matches and mismatches?
Amino Acids Substitution Matrices
The PAM and BLOSUM substitution matrices describe
the likelihood that two residue types would mutate to each
other.
These matrices are based on biological sequence
information: the substitutions observed in structural
(BLOSUM) or evolutionary (PAM) alignments of well
studied protein families
These scoring systems have a probabilistic foundation.
PAM series - Percent Accepted Mutation
(Accepted by natural selection)
• All the PAM data come from alignments of closely
related proteins (>85% amino acid identity) from 71
protein families (total of 1572 protein sequences).
Some of the protein families are:
Ig kappa chain
Kappa casein
Lactalbumin
Hemoglobin a
Myoglobin
Insulin
Histone H4
Ubiquitin
• PAM matrices are based on global sequence alignments
- these include both highly conserved and highly mutable
regions.
Various degrees of conservation
The PAM1 is the matrix calculated from comparisons
of sequences with no more than 1% divergence. At an
evolutionary interval of PAM1, one change has occurred
over a length of 100 amino acids.
Other PAM matrices are extrapolated from PAM1. For
PAM250, 250 changes have occurred for two proteins over a
length of 100 amino acids.
All the PAM data come from closely related proteins
(>85% amino acid identity).
PAM series - Percent Accepted Mutation
(Accepted by natural selection)
Varying degrees of conservation
*
THE BLOSUM Family of Matrices
Blocks Substitution MatricesHenikoff and Henikoff, 1992
• Blocks are short conserved patterns of 3-60 aa long.
• Proteins can be divided into families by common
blocks.
Block
A
B
C
D
• Different BLOSUM matrices emerge by looking
at sequences with different identity percentage.
Example: BLOSUM62 is derived from an alignment
of sequences that share no less than 62% identity.
The Blocks Database
Gapless
alignment
blocks
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
4
-1 5
-2 0 6
-2 -2 1 6
0 -3 -3 -3 9
-1 1 0 0 -3 5
-1 0 0 2 -4 2 5
0 -2 0 -1 -3 -2 -2 6
-2 0 1 -1 -3 0 0 -2 8
-1 -3 -3 -3 -1 -3 -3 -4 -3 4
-1 -2 -3 -4 -1 -2 -3 -4 -3 2 4
-1 2 0 -1 -1 1 1 -2 -1 -3 -2 5
-1 -2 -2 -3 -1 0 -2 -3 -2 1 2 -1 5
-2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6
-1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7
1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4
0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5
-3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11
-2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2
2
7
0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
A R N D C Q E G H I
L K M F P S T W Y
V
Blosum62 scoring matrix
PAM versus BLOSUM
 Based on an explicit
evolutionary model
 Based on empirical
frequencies
 Derived from small,
closely related proteins
with ~15% divergence
 Uses much larger, more
diverse set of protein
sequences (30-90% ID)
 Higher PAM numbers to
detect more remote
sequence similarities
 Lower BLOSUM numbers
to detect more remote
sequence similarities
 Errors in PAM 1 are
scaled 250X in PAM 250
 Errors in BLOSUM arise
from errors in alignment
Guidelines
• Lower PAMs and higher Blosums find short
local alignment of highly similar sequences
• Higher PAMs and lower Blosums find longer
weaker local alignment
• No single matrix answers all questions
Guidelines
• BLOSUM is generally better than PAM for local
alignments.
• The default matrix is often identity matrix for DNA
and BLOSUM 62 for proteins
• When using BLOSUM80 instead of BLOSUM45,
local alignments tend to be shorter.
• Low PAMs have same effects as high Blosums.
BLOSUM indicates percent identity while PAM is
proportional to the percent of accepted mutations.
Related documents