Download Alignment

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Promoter (genetics) wikipedia , lookup

Non-coding DNA wikipedia , lookup

Community fingerprinting wikipedia , lookup

Protein wikipedia , lookup

List of types of proteins wikipedia , lookup

Genetic code wikipedia , lookup

Gene expression wikipedia , lookup

Western blot wikipedia , lookup

Protein adsorption wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Protein moonlighting wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Proteolysis wikipedia , lookup

Molecular evolution wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Point mutation wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Protein structure prediction wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Homology modeling wikipedia , lookup

Structural alignment wikipedia , lookup

Transcript
BNFO 602
Lecture 2
Usman Roshan
DNA Sequence Evolution
-3 mil yrs
AAGACTT
AAGACTT
AAGGCTT
AAGGCTT
_GGGCTT
_GGGCTT
GGCTT
_G_GCTT
(Mouse)
(Mouse)
TAGACCTT
TAGACCTT
TAGGCCTT
TAGGCCTT
(Human)
(Human)
-2 mil yrs
T_GACTT
T_GACTT
TAGCCCTTA
TAGCCCTTA
(Monkey)
(Monkey)
A_CACTT
A_CACTT
ACACTTC
A_CACTTC
(Lion)
ACCTT
A_C_CTT
(Cat)
(Cat)
-1 mil yrs
today
Sequence alignments
They tell us about
•
Function or activity of a new gene/protein
•
Structure or shape of a new protein
•
Location or preferred location of a protein
•
Stability of a gene or protein
•
Origin of a gene or protein
•
Origin or phylogeny of an organelle
•
Origin or phylogeny of an organism
•
And more…
Pairwise sequence alignment
• How to align two sequences?
Pairwise alignment
• How to align two sequences?
• We use dynamic programming
• Treat DNA sequences as strings over the
alphabet {A, C, G, T}
Pairwise alignment
Dynamic programming
Define V(i,j) to be the optimal pairwise alignment
score between S1..i and T1..j (|S|=m, |T|=n)
Dynamic programming
Define V(i,j) to be the optimal pairwise alignment
score between S1..i and T1..j (|S|=m, |T|=n)
Time and space complexity is O(mn)
Dynamic programming
Animation slides by Elizabeth Thomas in
Cold Spring Harbor Labs (CSHL)
http://meetings.cshl.org/tgac/tgac/flash/DynamicProgramming.swf
How do we pick gap
parameters?
Structural alignments
• Recall that proteins have 3-D structure.
Structural alignment - example
1
Alignment of thioredoxins from
human and fly taken from the
Wikipedia website. This protein
is found in nearly all organisms
and is essential for mammals.
PDB ids are 3TRX and 1XWC.
Structural alignment - example
2
Taken from http://bioinfo3d.cs.tau.ac.il/Align/FlexProt/flexprot.html
Unaligned proteins.
2bbm and 1top are
proteins from fly and
chicken respectively.
Computer generated
aligned proteins
Structural alignments
• We can produce high quality manual
alignments by hand if the structure is
available.
• These alignments can then serve as a
benchmark to train gap parameters so
that the alignment program produces
correct alignments.
Benchmark alignments
• Protein alignment benchmarks
– BAliBASE, SABMARK, PREFAB,
HOMSTRAD are frequently used in studies
for protein alignment.
– Proteins benchmarks are generally large
and have been in the research community
for sometime now.
– BAliBASE 3.0
Biologically realistic scoring matrices
• PAM and BLOSUM are most popular
• PAM was developed by Margaret
Dayhoff and co-workers in 1978 by
examining 1572 mutations between 71
families of closely related proteins
• BLOSUM is more recent and computed
from blocks of sequences with sufficient
similarity
PAM
• We need to compute the probability transition
matrix M which defines the probability of
amino acid i converting to j
• Examine a set of closely related sequences
which are easy to align---for PAM 1572
mutations between 71 families
• Compute probabilities of change and
background probabilities by simple counting
Local alignment
• Global alignment recursions:
V (i 1, j 1)  S(x i , y j )


V (i, j)  
V (i 1, j)  g



V (i, j 1)  g


• Local alignment recursions



0


V (i 1, j 1)  S(x i , y j )
V (i, j)  

V (i 1, j)  g




V (i, j 1)  g


Local alignment traceback
• Let T(i,j) be the traceback matrices and m and n be
length of input sequences.
• Global alignment traceback:
– Begin from T(m,n) and stop at T(0,0).
• Local alignment traceback:
– Find i*,j* such that T(i*,j*) is the maximum over all T(i,j).
– Begin traceback from T(i*,j*) and stop when
T(i,j) <= 0.
BLAST
• Local pairwise alignment heuristic
• Faster than standard pairwise alignment
programs such as SSEARCH, but less
sensitive.
• Online server:
http://www.ncbi.nlm.nih.gov/blast
BLAST
1. Given a query q and a target sequence, find
substrings of length k (k-mers) of score at least t --also called hits. k is normally 3 to 5 for amino acids
and 12 for nucleotides.
2. Extend each hit to a locally maximal segment.
Terminate the extension when the reduction in score
exceeds a pre-defined threshold
3. Report maximal segments above score S.
Finding k-mers quickly
• Preprocess the database of sequences:
– For each sequence in the database store all kmers in hash-table.
– This takes linear time
• Query sequence:
– For each k-mer in the query sequence look up the
hash table of the target to see if it exists
– Also takes linear time
Profile-sequence alignment
• Given a family alignment, how can we align it to a
sequence?
• First, we compute a profile of the alignment.
• We then align the profile to the sequence using
standard dynamic programming.
• However, we need to describe how to align a profile
vector to a nucleotide or residue.
Profile
• A profile can be described by a set of
vectors of nucleotide/residue
frequencies.
• For each position i of the alignment, we
we compute the normalized frequency
of nucleotides A, C, G, and T
Aligning a profile vector to a
nucleotide
• ClustalW/MUSCLE
– Let f be the profile vector
– Score(f,j)=
f
S(i, j)
i
i{A,C,G,T }
– where S(i,j) is substitution scoring matrix

Multiple sequence alignment
• “Two sequences whisper, multiple
sequences shout out loud”---Arthur Lesk
• Computationally very hard---NP-hard
Formally…
Multiple sequence alignment
Unaligned sequences
Aligned sequences
GGCTT
TAGGCCTT
TAGCCCTTA
ACACTTC
ACTT
_G_ _ GCTT_
TAGGCCTT_
TAGCCCTTA
A_ _CACTTC
A_ _C_ CTT_
Conserved regions help us
to identify functionality
Sum of pairs score
Sum of pairs score
• What is the sum of
pairs score of this
alignment?
Iterative alignment
(heuristic for sum-of-pairs)
• Pick a random sequence from input set S
• Do (n-1) pairwise alignments and align to
closest one t in S
• Remove t from S and compute profile of
alignment
• While sequences remaining in S
– Do |S| pairwise alignments and align to closest
one t
– Remove t from S
Iterative alignment
• Once alignment is computed randomly
divide it into two parts
• Compute profile of each sub-alignment
and realign the profiles
• If sum-of-pairs of the new alignment is
better than the previous then keep,
otherwise continue with a different
division until specified iteration limit
Progressive alignment
• Idea: perform profile alignments in the
order dictated by a tree
• Given a guide-tree do a post-order
search and align sequences in that
order
• Widely used heuristic
Popular alignment programs
• ClustalW: most popular, progressive alignment
• MUSCLE: fast and accurate, progressive and
iterative combination
• T-COFFEE: slow but accurate, consistency based
alignment (align sequences in multiple alignment to
be close to the optimal pairwise alignment)
• PROBCONS: slow but highly accurate, probabilistic
consistency progressive based scheme
• DIALIGN: very good for local alignments
MUSCLE
MUSCLE
Evaluation of multiple sequence
alignments
• Compare to benchmark “true”
alignments
• Use simulation
• Measure conservation of an alignment
• Measure accuracy of phylogenetic trees
• How well does it align motifs?
• More…
Comparison of alignments on
BAliBASE