Download sequence alignments

Document related concepts

Gene wikipedia , lookup

Metabolism wikipedia , lookup

Matrix-assisted laser desorption/ionization wikipedia , lookup

Protein wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Non-coding DNA wikipedia , lookup

Proteolysis wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Biochemistry wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Biosynthesis wikipedia , lookup

Genetic code wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Point mutation wikipedia , lookup

Transcript
Introduction to sequence alignment
WEEK 2
Mike Hallett (David Walsh)
BIOL510: Bioinformatics
Outline: pairwise alignment
• The importance of pairwise alignment
• The important steps in comparing two sequences
(Sections 4.1-4.5)
• Performing pairwise alignments using BLAST
(Section 4.6-4.7): WILL COVER IN LAB
CLASS ON TUESDAY
4.1 Principles of sequence alignment
Sequences (DNA and protein) vary as a result of
evolutionary processes acting at the molecular level:
1. Point mutations: nucleotides or amino acids
1. Insertions and deletions (length variation)
1. Fusion of two genes into a single gene
Evolution in gene sequences can effectively mask any
underlying sequence similarity
P. 73
Similarity and Homology
• Similarity is quantitative measure of how related two
sequences are:
– Usually based on pairwise alignment of two sequences
– By aligning sequences we can count the number of residues that
line up and be expressed in terms of percent identity
– High degrees of sequence similarity may imply a common
evolutionary history or a possible commonality in biological
function
• Homology refers specifically to similarity in sequence or
structure due to decent from a common ancestor
– The concept of homology implies an evolutionary relationship
Definition: homology
Homology
Similarity attributed to descent from a common ancestor.
Morphological homology
Molecular homology
fly
human
plant
bacterium
yeast
archaeon
GAKKVIISAP
GAKRVIISAP
GAKKVIISAP
GAKKVVMTGP
GAKKVVITAP
GADKVLISAP
SAD.APM..F
SAD.APM..F
SAD.APM..F
SKDNTPM..F
SS.TAPM..F
PKGDEPVKQL
Definitions: identity, similarity, conservation
Identity
The extent to which two (nucleotide or amino acid)
sequences are invariant.
Similarity
The extent to which nucleotide or protein sequences
are related. It is based upon identity plus conservation.
Conservation
Changes at a specific position of an amino acid or
(less commonly, DNA) sequence that preserve the
physico-chemical properties of the original residue.
Pairwise sequence alignment is the most
fundamental operation of bioinformatics
• It is fundamental to characterizing genome sequences
• To identify genes within a genome
• To identify related proteins, predict protein structure
and function
• To construct phylogenetic trees and compare
evolutionary relationships
P. 72
Definition: pairwise alignment
Pairwise alignment: The process of lining up two
sequences to achieve maximal levels of similarity
Query: 181 catcaactacaactccaaagacacccttacacccactaggatatcaacaaacctacccac 240
|||||||| |||| |||||| ||||| | |||||||||||||||||||||||||||||||
Sbjct: 189 catcaactgcaaccccaaagccacccct-cacccactaggatatcaacaaacctacccac 247
4.5 Types of alignment
Global alignments: an alignment that covers the full
length of a gene or protein sequence
 for aligning closely related sequences that are
similar over their whole length
4.5 Types of alignment
Global alignments: an alignment that covers the full
length of a gene or protein sequence
 for aligning closely related sequences that are
similar over their whole length
Local alignments: an alignment that only covers a
certain region (e.g. domain) of a gene or protein
sequences
 for aligning proteins that are only partly
related (e.g multidomain proteins)
 for identifying conserved regions in very
divergent sequences
4.5 Types of alignment
4.5 Types of alignment
General approach to pairwise alignment
• Choose two sequences
• Select an alignment algorithm that generates a score
• Score reflects degree of similarity
• Allow gaps (insertions, deletions)
• Estimate probability that the alignment
occurred by chance
4.1 Principles of sequence alignment
When sequences are derived from a common ancestor, we
want to align bases/amino acids derived from the same
ancestral position
b-corticotropin (sheep)
Corticotropin A (pig)
ala gly glu asp asp glu
asp gly ala glu asp glu
(Peptide Hormones)
Oxytocin
Vasopressin
(Nueromodulators)
CYIQNCPLG
CYFQNCPRG
4.1 Principles of sequence alignment
When sequences are derived from a common ancestor, we
want to align bases/amino acids derived from the same
ancestral position
T H I S S E QE N C E
T H A T S E QE N C E
Identical matches
Two amino acid point mutations
Mismatches
P. 73
4.1 Principles of sequence alignment
Often sequences we wish to align will differ in length,
obscuring the similarity that exists:
T H I S I S A S E Q E N C E
T HA T S E Q E N C E
Identical matches
Mismatches
How many amino acid point mutations?
 8 point mutations?
P. 73
4.1 Principles of sequence alignment
Insertion/deletion mutations result in gaps in an alignment
T H I S I S A - S E Q E N C E
T H - - - - A T S E Q E N C E
Identical matches
Mismatches
How many amino acid point mutations?
 0 point mutations?
 but two indel mutations!
The best pairwise alignment is not obvious, hence we have
algorithms for testing different alignments quantitatively
P. 74
Matches do not have to be identical
Certain amino acids resemble each other in their physical and
chemical characteristics, and can substitute functionally for
each other
T H I S I S A S E Q E N C E
T H A T- - - S E Q E N C E
serine - threonine
isoleucine - alanine
Charged amino acids
Polar uncharged amino
acids
Hydrophobic amino
acids
Pairwise alignment: protein versus DNA sequences
• Synonymous mutations alter DNA but not amino acid
sequences
• Nonsynonymous mutations alter amino acid sequence
• Protein sequences offer a longer “look-back” time
• DNA sequences can be translated into protein,
and then used in pairwise alignments
Codons are degenerate: changes in the third position
often do not alter the amino acid that is specified
DNA alignments
• Many times, DNA alignments are appropriate (or necessary):
-- to identify promoters and regulatory elements
-- to identify gene sequences
-- to study noncoding regions of DNA
-- to study DNA polymorphisms (SNPs)
Query: 181 catcaactacaactccaaagacacccttacacccactaggatatcaacaaacctacccac 240
|||||||| |||| |||||| ||||| | |||||||||||||||||||||||||||||||
Sbjct: 189 catcaactgcaaccccaaagccacccct-cacccactaggatatcaacaaacctacccac 247
4.2 Scoring alignments
How do we objectively determine which is the best
possible alignment for a pair of sequences?
4.2 Scoring alignments
How do we objectively determine which is the best
possible alignment for a pair of sequences?
 Generate all possible alignments
(not possible: 1075 possibilities for an alignment of 100 positions!!!)
 Calculate a score for each alignment
• Optimal alignment: the alignment with the best score
• Suboptimal alignments: alignments with slightly poorer scores
4.2 Scoring alignments: Percent identity
The simplest way to quantify similarity is to sum the number
of bases/amino acid matches and divide by length of the
alignment
T H I S I S A - S E Q E N C E
T H - - - - A T S E Q E N C E
(10 matches/15 positions)*100 = 66% identity
4.2 Scoring alignments: dot plots
Dot-plots are a simple way to
visualize pairwise sequence
similarity
Fig 4.1
Matches do not have to be identical
Do all amino acid substitutions occur with the same
probability?
Matches do not have to be identical
Do all amino acid substitutions occur with the same
probability? NO!!!!
T H I S I S A S E Q E N C E
T H A T- - - S E Q E N C E
serine – threonine : highly conservative
isoleucine – alanine : poorly conservative
Substitution Matrix
A substitution matrix contains the likelihood that a particular
pair of amino acids will occupy the same position due to
decent from a common ancestor (i.e. homology)
 20 x 20 substitution matrix
The BLOSUM62 substitution matrix
+5 for Arg to Arg
-2 for Arg to Asp
Fig 4.4
The BLOSUM62 substitution matrix
+ 1 for Ser to Thr
+5 for Arg to Arg
-2 for Arg to Asp
Fig 4.4
Scoring a pairwise alignment using the
BLOSUM62 matrix
T H I S S E Q E N C E
T H A T S E Q E N C E
5 8 -1 1 4 5 5 5 6 9 5
The overall alignment score (S) = 52
Generation of substitution scoring matrices
• Based on the observed amino acid substitution
frequencies in alignments of homologous protein
sequences
• Use real data to model the evolutionary processes
PAM substitution matrices are calculated from
global protein alignments
BLOSUM substitution matrices are calculated from
local protein alignments
PAM matrices:
Point-accepted mutations
Dayoff (1960’s) calculated substitution probabilities
from alignments of highly similar protein families
All the PAM data come from closely related proteins
(>85% amino acid identity).
PAM matrices:
Point-accepted mutations
Dayoff (1960’s) calculated substitution probabilities
from alignments of highly similar protein families
All the PAM data come from closely related proteins
(>85% amino acid identity).
The PAM1 is the matrix calculated from comparisons
of sequences with no more than 1% divergence.
Other PAM matrices are extrapolated from PAM1. For
PAM250, 250 changes have occurred for two proteins
over a length of 100 amino acids.
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
A PAM250 scoring matrix that
assigns scores and is
forgiving of mismatches…
(such as +17 for W to W
or -5 for W to T)
2
-2 6
0 0 2
0 -1 2 4
-2 -4 -4 -5 12
0 1 1 2 -5 4
0 -1 1 3 -5 2 4
1 -3 0 1 -3 -1 0 5
-1 2 2 1 -3 3 1 -2 6
-1 -2 -2 -2 -2 -2 -2 -3 -2 5
-2 -3 -3 -4 -6 -2 -3 -4 -2 -2 6
-1 3 1 0 -5 1 0 -2 0 -2 -3 5
-1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6
-3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9
1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6
1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2
1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3
-6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17
-3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10
0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4
A R N D C Q E G H I
L K M F P S T W Y V
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
7
…compared to a scoring
matrices such as PAM10
that are strict and do not
tolerate mismatches
(such as +13 for W to W
or -19 for W to T)
-10
9
-7
-9
9
-6
-17
-1
8
-10
-11
-17
-21
10
-7
-4
-7
-6
-20
9
-5
-15
-5
0
-20
-1
8
-4
-13
-6
-6
-13
-10
-7
7
-11
-4
-2
-7
-10
-2
-9
-13
10
-8
-8
-8
-11
-9
-11
-8
-17
-13
9
-9
-12
-10
-19
-21
-8
-13
-14
-9
-4
7
-10
-2
-4
-8
-20
-6
-7
-10
-10
-9
-11
7
-8
-7
-15
-17
-20
-7
-10
-12
-17
-3
-2
-4
12
-12
-12
-12
-21
-19
-19
-20
-12
-9
-5
-5
-20
-7
9
-4
-7
-9
-12
-11
-6
-9
-10
-7
-12
-10
-10
-11
-13
8
-3
-6
-2
-7
-6
-8
-7
-4
-9
-10
-12
-7
-8
-9
-4
7
-3
-10
-5
-8
-11
-9
-9
-10
-11
-5
-10
-6
-7
-12
-7
-2
8
-20
-5
-11
-21
-22
-19
-23
-21
-10
-20
-9
-18
-19
-7
-20
-8
-19
13
-11
-14
-7
-17
-7
-18
-11
-20
-6
-9
-10
-12
-17
-1
-20
-10
-9
-8
10
-5
-11
-12
-11
-9
-10
-10
-9
-9
-1
-5
-13
-4
-12
-9
-10
-6
-22
-10
R
N
D
Q
E
A
C
G
H
I
L
K
M F
P
S
T
W Y
8
V
BLOSUM Matrices
BLOSUM matrices are based on local alignments.
The BLOCKS database contains thousands of groups of
multiple sequence alignments.
BLOSUM stands for blocks substitution matrix.
All BLOSUM matrices are based on observed alignments;
they are not extrapolated from comparisons of
closely related proteins.
BLOSUM Matrices
BLOSUM62 is a matrix calculated from comparisons of
sequences with no more than 62% similarity.
BLOSUM62 is the default matrix in BLAST2.0.
Though it is tailored for comparisons of moderately distant
proteins, it performs well in detecting closer relationships.
A search for distant relatives may be more sensitive
with a different matrix.
Selecting an appropriate scoring matrix
More conserved
Rat versus
mouse globin
Less conserved
Rat versus
bacterial
globin
4.4 Inserting Gaps
Homologous sequences are often different in length as
a result of insertions and deletions (indels)
The alignment of indels involves inserting gaps into the
alignment
Gap penalty: each time a gap is introduced, a gap penalty is
subtracted from the score
A gap opening penalty is usually high
A gap extension penalty is usually low
Scoring a pairwise alignment using the
BLOSUM62 matrix and gap penalty
T H I S I S A S E Q E N C E
T H A T- - - S E Q E N C E
5 8 -1 1
4 5 5 5 6 9 5
Scoring a pairwise alignment using the
BLOSUM62 matrix and gap penalty
T H I S I S A S E Q E N C E
T H A T- - - S E Q E N C E
5 8 -1 1
4 5 5 5 6 9 5
Gap opening penalty = -11
Gap extension penalty = -1
The overall score (S) = 52 + (-11 - 2) = 40
4.4 Inserting Gaps
Alignment with a high gap penalty
Alignment with a low gap penalty
Page 86
Next in the course... Ch 5.
We have learned how to score an alignment, but
how do you generate the alignment in the first
place?
Here are two approaches:
 Dynamic Programming Algorithms
 Heuristic Search Algorithms
Sequence alignments continued
Rasko et al. Nucleic Acids Res. 2004; 32(3): 977–988
David Walsh
[email protected]
BIOL510: Bioinformatics
Outline: sequence alignments (Ch 5)
•Dynamic Programming Algorithms (Ch 5.2)
Global alignment: Needleman-Wunsch
Local alignments: Smith-Waterman
•Heuristic Search Algorithms (Ch 5.3)
 BLAST
•Alignment Score Significance (Ch 5.4)
WE WILL COVER THIS ON TUESDAY DURING THE
LAB
Scoring an alignment using the
BLOSUM62 substitution matrix and gap
penalty
T H I S I S A S E Q E N C E
T H A T- - - S E Q E N C E
5 8 -1 1
4 5 5 5 6 9 5
Gap opening penalty = -11
Gap extension penalty = -1
The overall score (S) = 52 + (-11 - 2) = 40
Dynamic Programming Algorithms
For any given pair of sequences, if gaps are allowed
there is a large number of possible alignments.
Dynamic Programming Algorithms
For any given pair of sequences, if gaps are allowed
there is a large number of possible alignments.
Dynamic programming algorithms: can explore the full
range of alignments using a variety of different
constraints, by dividing the problem of alignment into
many smaller parts
Needleman and Wunsch published the original program
in the 1970’s and there have been many modifications
and improvements since.
Global alignment versus local alignment
Global alignment (Needleman-Wunsch) extends
from one end of each sequence to the other.
Local alignment finds optimally matching
regions within two sequences (“subsequences”).
Local alignment is almost always used for database
searches such as BLAST. It is useful to find domains
(or limited regions of homology) within sequences.
Smith and Waterman (1981) solved the problem of
performing optimal local sequence alignment. Other
methods (BLAST, FASTA) are faster but less thorough.
Needleman-Wunsch: dynamic programming
N-W is guaranteed to find optimal alignments, although
the algorithm does not search all possible alignments.
It is an example of a dynamic programming algorithm:
an optimal path (alignment) is identified by
incrementally extending optimal subpaths.
Thus, a series of decisions is made at each step of the
alignment to find the pair of residues with the best score.
4.2 Scoring alignments: dot plots
Dot-plots are a simple way to
visualize pairwise sequence
similarity
But, they are the beginning of
generating optimal alignments as
well.
Fig 4.1
Three steps to global alignment
with the Needleman-Wunsch algorithm
[1] set up a matrix of two sequences
[2] score the matrix
[3] identify the optimal alignment(s)
Global alignment with the algorithm
of Needleman and Wunsch (1970)
• Two sequences can be compared in a matrix
along x- and y-axes.
• If they are identical, a path along a diagonal
can be drawn
• Find the optimal subpaths, and add them up to achieve
the best score. This involves
--adding gaps when needed
--allowing for conservative substitutions
--choosing a scoring system (simple or complicated)
• N-W is guaranteed to find optimal alignment(s)
Four possible outcomes in aligning two sequences
1
2
[1] identity (stay along a diagonal)
[2] mismatch (stay along a diagonal)
[3] gap in one sequence (move vertically!)
[4] gap in the other sequence (move horizontally!)
The initial stage of dynamic programming
Gap extension penalty (E) = -8
BLOSUM62 substitution matrix
Figure 5.8
The initial stage of dynamic programming
Figure 5.10
Gap extension penalty (E) = -8
BLOSUM62 substitution matrix
Figure 5.8
The initial stage of dynamic programming
-16
Figure 5.10
Gap extension penalty (E) = -8
BLOSUM62 substitution matrix
Figure 5.8
The initial stage of dynamic programming
-16
Figure 5.10
Gap extension penalty (E) = -8
BLOSUM62 substitution matrix
Figure 5.8
The initial stage of dynamic programming: filling in the matrix
-1
Figure 5.10
Gap extension penalty (E) = -8
BLOSUM62 substitution matrix
Thr  Ile Score= -1
Figure 5.8
The initial stage of dynamic programming: filling in the matrix
Score = -4
Gap extension penalty (E) = -8
BLOSUM62 substitution matrix
The final stage of dynamic programming: traceback
Figure 5.9
The initial stage of dynamic programming: filling in the matrix
Score = 7
Gap extension penalty (E) = -4
BLOSUM62 substitution matrix
The final stage of dynamic programming: traceback
Figure 5.11
Local alignment: the Smith-Waterman (SW)
algorithm
Remember: two protein sequences may not exhibit
homology along their full length
Page 88, Page 136
Local alignment: the Smith-Waterman (SW)
algorithm
Remember: two protein sequences may not exhibit
homology along their full length
SW is a modification of the Needleman-Wunsch algorithm
Instead of looking at each sequence in its entirety, the
method compares segments of all possible lengths and
chooses the segments that optimize the similarity measure
Page 88, Page 136
Local alignment algorithm: optimal subsequence alignments
less than zero (<0) are rejected
Score = 12
Gap extension penalty (E) = -8 (!!!!!)
BLOSUM62 substitution matrix
Figure 5.15
Outline: sequence alignments (Ch 5)
•Dynamic Programming Algorithms (Ch 5.2)
Global alignment: Needleman-Wunsch
Local alignments: Smith-Waterman
•Heuristic Search Algorithms (Ch 5.3)
 BLAST
•Alignment Score Significance (Ch 5.4)
Will cover this topic during the lab