Download Bioinformatics Sequencing

Document related concepts

Gene expression programming wikipedia , lookup

Pathogenomics wikipedia , lookup

Genetic code wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Non-coding DNA wikipedia , lookup

Human genome wikipedia , lookup

Genomics wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genome editing wikipedia , lookup

Microsatellite wikipedia , lookup

Metagenomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Point mutation wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Sequence alignment wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Transcript
Sequence Alignment
Arun Goja
MITCON BIOPHARMA
Why do we want to compare
sequences?
• Evolutionary relationships
– Phylogenetic trees can be constructed based on
comparison of the sequences of a molecule (example: 16S
rRNA) taken from different species
– Residues conserved during evolution play an important
role
• Prediction of protein structure and function
– Proteins which are very similar in sequence generally have
similar 3D structure and function as well
– By searching a sequence of unknown structure against a
database of known proteins the structure and/or function
can in many cases be predicted
WHY ?
sequence alignment
Sequence alignment is important for:
* prediction of function
* database searching
* gene finding
* sequence divergence
* sequence assembly
3
Over time, genes accumulate
mutations
 Environmental factors
• Radiation
• Oxidation
 Mistakes in replication or repair
 Deletions, Duplications
 Insertions
 Inversions
 Point mutations
4
Deletions
• Codon deletion:
ACG ATA GCG TAT GTA TAG CCG…
– Effect depends on the protein, position, etc.
– Almost always deleterious
– Sometimes lethal
• Frame shift mutation:
ACG ATA GCG TAT GTA TAG CCG…
ACG ATA GCG ATG TAT AGC CG?…
– Almost always lethal
5
Indels
• Comparing two genes it is generally
impossible to tell if an indel is an insertion in
one gene, or a deletion in another, unless
ancestry is known:
ACGTCTGATACGCCGTATCGTCTATCT
ACGTCTGAT---CCGTATCGTCTATCT
6
Comparing two sequences
• Point mutations, easy:
ACGTCTGATACGCCGTATAGTCTATCT
ACGTCTGATTCGCCCTATCGTCTATCT
• Indels are difficult, must align sequences:
ACGTCTGATACGCCGTATAGTCTATCT
CTGATTCGCATCGTCTATCT
ACGTCTGATACGCCGTATAGTCTATCT
----CTGATTCGC---ATCGTCTATCT
7
Causes for sequence (dis)similarity
mutation:
a nucleotide at a certain location is replaced by
another nucleotide (e.g.: ATA → AGA)
insertion:
at a certain location one new nucleotide is
inserted inbetween two existing nucleotides
(e.g.: AA → AGA)
deletion:
at a certain location one existing nucleotide
is deleted (e.g.: ACTG → AC-G)
indel:
an insertion or a deletion
8
Definition
• Homology: related by descent
• Homologous sequence positions
ATTGCGC
ATTGCGC
C
 ATTGCGC
ATTGCGC
 AT-CCGC
 ATCCGC
Orthologous and paralogous
• Orthologous sequences differ because they
are found in different species (a speciation
event)
• Paralogous sequences differ due to a gene
duplication event
• Sequences may be both orthologous and
paralogous
Sequence alignment - meaning
Sequence alignment is used to study the evolution of the sequences from a
common ancestor such as protein sequences or DNA sequences.
Mismatches in the alignment correspond to mutations,
and gaps correspond to insertions or deletions.
Sequence alignment also refers to the process of constructing significant
alignments in a database of potentially unrelated sequences.
11
Sequence alignment - definition
Sequence alignment is an arrangement of two or more sequences, highlighting
their similarity.
The sequences are padded with gaps (dashes) so that wherever possible, columns
contain identical characters from the sequences involved
tcctctgcctctgccatcat---caaccccaaagt
|||| ||| ||||| |||||
||||||||||||
tcctgtgcatctgcaatcatgggcaaccccaaagt
12
Pairwise alignment: the problem
The number of possible pairwise alignments increases explosively with
the length of the sequences:
Two protein sequences of length 100 amino acids can be aligned in
approximately 1060 different ways
Time needed to test all possibilities is same order of magnitude as the
entire lifetime of the universe.
Pairwise Alignment
• The alignment of two sequences (DNA or
protein) is a relatively straightforward
computational problem.
– There are lots of possible alignments.
•
• Two sequences
can always be aligned.
• Sequence alignments have to be scored.
• Often there is more than one solution with the
same score.
Methods of Alignment
• By hand - slide sequences on two lines of a
word processor
• Dot plot
– with windows
• Rigorous mathematical approach
– Dynamic programming (slow, optimal)
• Heuristic methods (fast, approximate)
– BLAST and FASTA
• Word matching and hash tables0
Align by Hand
GATCGCCTA_TTACGTCCTGGAC <---> AGGCATACGTA_GCCCTTTCGC
You still need some kind of scoring system to
find the best alignment
Percent Sequence Identity
• The extent to which two nucleotide or amino
acid sequences are invariant
AC C TG A G – AG
AC G TG – G C AG
mismatch
indel
70% identical
Dotplot:
A dotplot gives an overview of all possible alignments
Sequence 2
A
T
T
C
A
C
A
T
A





 
 











T







 

A C

A T
T


A C
Sequence 1
G
T

A C
Dotplot:
In a dotplot each diagonal corresponds to a possible (ungapped) alignment
Sequence 2
A
T
T
C
A
C
A
T
A





 
 











T







 

A C

A T


A C
T
G
T

A C
Sequence 1
One possible alignment:
T A C A T T A C G T A C
A T A C A C T T A
Insertions / Deletions in a Dotplot
Sequence 2 T
A
C
T
G
T
C
A
T
T
A
C
T
G
T
T
C
A
T
Sequence 1
T A C T G - T C A T
| | | | |
| | | |
T A C T G T T C A T
Alignment methods
• Rigorous algorithms = Dynamic Programming
– Needleman-Wunsch (global)
– Smith-Waterman (local)
• Heuristic algorithms
(faster but approximate)
• BLAST
• FASTA
Pairwise alignment
Pairwise sequence alignment methods are concerned with finding the best-matching
piecewise local or global alignments of protein (amino acid) or DNA (nucleic acid)
sequences.
Typically, the purpose of this is to find homologues (relatives) of a gene or gene-product
in a database of known examples.
This information is useful for answering a variety of biological questions:
1. The identification of sequences of unknown structure or function.
2. The study of molecular evolution.
22
Dynamic Programming Approach to
Sequence Alignment
The dynamic programming approach to sequence alignment always tries to follow the
best prior-result so far.
Try to align two sequences by inserting some gaps at different locations, so as to
maximize the score of this alignment.
Score measurement is determined by "match award", "mismatch penalty" and "gap
penalty". The higher the score, the better the alignment.
If both penalties are set to 0, it aims to always find an alignment with maximum matches
so far.
Maximum match = largest number matches can have for one sequence by allowing all
possible deletion of another sequence.
It is used to compare the similarity between two sequences of DNA or Protein, to predict
similarity of their functionalities.
Examples: Needleman-Wunsch(1970), Sellers(1974), Smith-Waterman(1981)
23
Global alignment
A global alignment between two sequences is an alignment in
which all the characters in both sequences participate in the
alignment.
Global alignments are useful mostly for finding closely-related
sequences.
24
Global Alignment
• Global algorithms are often not effective for highly diverged
sequences and do not reflect the biological reality that two
sequences may only share limited regions of conserved
sequence.
• Sometimes two sequences may be derived from ancient
recombination events where only a single functional
domain is shared.
• Global alignment is useful when you want to force two
sequences to align over their entire length
Global Alignment
Find the global best fit between two sequences
Example: the sequences s = VIVALASVEGAS and
t = VIVADAVIS align like:
A(s,t) =
V I V A L A S V E G A S
| | | |
|
|
|
V I V A D A - V - - I S
indels
26
The Needleman-Wunsch algorithm
The Needleman-Wunsch algorithm performs a global alignment
on two sequences (s and t) and is applied to align protein or
nucleotide sequences.
The Needleman-Wunsch algorithm is an example of dynamic
programming, and is guaranteed to find the alignment with the
maximum score.
The Needleman-Wunsch algorithm is an example of dynamic
programming, a discipline invented by Richard Bellman (an
American mathematician) in 1953
27
Local alignment
Local alignment methods find related regions within sequences - they can consist of a
subset of the characters within each sequence.
For example, positions 20-40 of sequence A might be aligned with positions
50-70 of sequence B.
This is a more flexible technique than global alignment and has the advantage that
related regions which appear in a different order in the two proteins (which is known as
domain shuffling) can be identified as being related.
This is not possible with global alignment methods.
28
The Smith Waterman algorithm
The Smith-Waterman algorithm (1981) is for determining similar regions between
two nucleotide or protein sequences.
Smith-Waterman is also a dynamic programming algorithm and improves on
Needleman-Wunsch. As such, it has the desirable property that it is guaranteed to
find the optimal local alignment with respect to the scoring system being used
(which includes the substitution matrix and the gap-scoring scheme).
However, the Smith-Waterman algorithm is demanding of time and memory
resources: in order to align two sequences of lengths m and n, O(mn) time and
space are required.
29
Global vs. Local Alignments
• Global alignment algorithms start at the
beginning of two sequences and add gaps to
each until the end of one is reached.
• Local alignment algorithms finds the region
(or regions) of highest similarity between two
sequences and build the alignment outward
from there.
Statistical analysis of alignments
This works identical to gene finding:
* Generate randomized sequences based on the
second string
* Determine the optimal alignments of the first
sequence with these randomized sequences
* Compute a histogram and rank the observed
score in this histogram
31
The Needleman-Wunsch algorithm
A smart way to reduce the massive number of possibilities
that need to be considered, yet still guarantees that the
best solution will be found (Saul Needleman and Christian
Wunsch, 1970).
The basic idea is to build up the best alignment by using
optimal alignments of smaller subsequences.
Needleman & Wunsch
•
•
•
•
Place each sequence along one axis
Place score 0 at the up-left corner
Fill in 1st row & column with gap penalty multiples
Fill in the matrix with max value of 3 possible moves:
– Vertical move: Score + gap penalty
– Horizontal move: Score + gap penalty
– Diagonal move: Score + match/mismatch score
• The optimal alignment score is in the lower-right corner
• To reconstruct the optimal alignment, trace back where the max
at each step came from, stop when hit the origin.
Example
•
Let gap = -2
match = 1
mismatch = -1.
empty
A
A
A
C
0
-2
-4
-6
-8
A
-2
1
-1
-3
-5
G
-4
-1
0
-2
-4
C
-6
-3
-2
-1
-1
empty
AAAC
A-GC
AAAC
-AGC
Local Alignment
• Problem first formulated:
– Smith and Waterman (1981)
• Problem:
– Find an optimal alignment between a
substring of s and a substring of t
• Algorithm:
– is a variant of the basic algorithm for global
alignment
Smith & Waterman
•
•
•
•
Place each sequence along one axis
Place score 0 at the up-left corner
Fill in 1st row & column with 0s
Fill in the matrix with max value of 4 possible values:
–
–
–
–
0
Vertical move: Score + gap penalty
Horizontal move: Score + gap penalty
Diagonal move: Score + match/mismatch score
• The optimal alignment score is the max in the matrix
• To reconstruct the optimal alignment, trace back where the MAX
at each step came from, stop when a zero is hit
Local Alignment
•
Let gap = -2
match = 1
mismatch = -1.
empty
G
A
T
A
C
C
C
GATCACCT
GATACCC
GATCACCT
GAT _ ACCC
empty
G
A
T
C
A
C
C
T
0
0
0
0
1
0
0
0
0
0
0
2
0
1
0
0
0
0
3
1
0
0
0
0
1
2
2
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
2
1
1
0
0
0
3
2
2
0
0
1
4
3
1
0
0
2
3
0
0
0
0
0
Pairwise alignment: the solution
”Dynamic programming”
(the Needleman-Wunsch algorithm)
Alignment depicted as path in matrix
T
C
G
C
A
T
TCGCA
TC-CA
C
C
A
T
T
C
C
A
C
G
C
A
TCGCA
T-CCA
Alignment depicted as path in matrix
T
C
G
C
A
Meaning of point in matrix: all
residues up to this point have
been aligned (but there are
many different possible
paths).
T
C
x
C
A
Position labeled “x”: TC aligned with TC
--TC
TC--
-TC
T-C
TC
TC
Creation of an alignment path matrix
• If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we can calculate F(i,j)
• Three possibilities:
• xi and yj are aligned, F(i,j) = F(i-1,j-1) + s(xi ,yj)
• xi is aligned to a gap, F(i,j) = F(i-1,j) - d
• yj is aligned to a gap, F(i,j) = F(i,j-1) - d
• The best score up to (i,j) will be the largest of the three options
Dynamic programming: computation
of scores
T
C
T
C
C
A
x
G
C
A
Any given point in matrix can only be
reached from three possible previous
positions (you cannot “align
backwards”).
=> Best scoring alignment ending in
any given point in the matrix can be
found by choosing the highest scoring
of the three possibilities.
Dynamic programming: computation
of scores
T
C
G
C
T
C
x
C
A
Any given point in matrix can only be
reached from three possible positions
(you cannot “align backwards”).
=> Best scoring alignment ending in
any given point in the matrix can be
found by choosing the highest scoring
of the three possibilities.
A
score(x,y-1) - gap-penalty
score(x,y) = max
Dynamic programming: computation
of scores
T
C
G
C
T
C
x
C
A
Any given point in matrix can only be
reached from three possible positions
(you cannot “align backwards”).
=> Best scoring alignment ending in
any given point in the matrix can be
found by choosing the highest scoring
of the three possibilities.
A
score(x,y-1) - gap-penalty
score(x,y) = max
score(x-1,y-1) + substitution-score(x,y)
Dynamic programming: computation
of scores
T
C
G
C
T
C
x
C
A
Any given point in matrix can only be
reached from three possible positions
(you cannot “align backwards”).
=> Best scoring alignment ending in
any given point in the matrix can be
found by choosing the highest scoring
of the three possibilities.
A
score(x,y-1) - gap-penalty
score(x,y) = max
score(x-1,y-1) + substitution-score(x,y)
score(x-1,y) - gap-penalty
Dynamic programming: computation
of scores
T
C
G
C
T
C
x
C
A
Any given point in matrix can only be
reached from three possible positions
(you cannot “align backwards”).
=> Best scoring alignment ending in
any given point in the matrix can be
found by choosing the highest scoring
of the three possibilities.
A
Each new score is found by choosing the maximum of three possibilities.
For each square in matrix: keep track of where best score came from.
Fill in scores one row at a time, starting in upper left corner of matrix,
ending in lower right corner.
score(x,y-1) - gap-penalty
score(x,y) = max
score(x-1,y-1) + substitution-score(x,y)
score(x-1,y) - gap-penalty
Dynamic programming: example
A
C
G
T
A C G T
1 -1 -1 -1
-1 1 -1 -1
-1 -1 1 -1
-1 -1 -1 1
Gaps: -2
Dynamic programming: example
Dynamic programming: example
Dynamic programming: example
Dynamic programming: example
Dynamic programming: example
T C G C A
: :
: :
T C - C A
1+1-2+1+1 = 2
Global versus local alignments
Global alignment: align full length of both sequences.
“Needleman-Wunsch” algorithm).
(The
Global alignment
Local alignment: find best partial alignment of two sequences
(the “Smith-Waterman” algorithm).
Seq 1
Local alignment
Seq 2
Local alignment overview
• The recursive formula is changed by adding a fourth
possibility: zero. This means local alignment scores are never
negative.
score(x,y-1) - gap-penalty
score(x,y) = max
score(x-1,y-1) + substitution-score(x,y)
score(x-1,y) - gap-penalty
0
• Trace-back is started at the highest value rather than in lower
right corner
• Trace-back is stopped as soon as a zero is encountered
Local alignment: example
Alignments: things to keep in mind
“Optimal alignment” means “having the highest possible score,
given substitution matrix and set of gap penalties”.
This is NOT necessarily the biologically most meaningful
alignment.
Specifically, the underlying assumptions are often wrong:
substitutions are not equally frequent at all positions, affine gap
penalties do not model insertion/deletion well, etc.
Pairwise alignment programs always produce an alignment even when it does not make sense to align sequences.