Download Multiple Sequence Alignment

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene prediction wikipedia , lookup

Probabilistic context-free grammar wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Transcript
Multiple Sequence Alignment
Carlow IT Bioinformatics
November 2006
MSA
• A central technique in bioinformatics
• along with:
– homology searching
– multiple sequence alignment
– phylogenetic trees
An example
“all you have to do” is re-write your sequences so that
similar features finish up in the same columns
Evolutionary relationship
• “similar features” ideally means
homologous – with a shared ancestor
• clustalW and T-coffee mimic the process
of evolution
– by weighting similar residues by how
conserved they are in evolution
• Important AAs don’t mutate
• Less important AAs change easily, even randomly
– by inserting judicious gaps
Criteria for alignment
• Amino acids in the same column have
– Structural similarity (used by threading progs)
• Practical exercise inferring position of Bsu recA AAs
– Evolutionary similarity – residues have a common
ancestor
– Functional similarity (active site, C-C bonds) may
have to hand edit known functions
– Sequence similarity
• The first 3 (clear biological attributes) are, you
hope, reflected by the last (an abstraction) which
is what MSA programs use
Applications
• Discover conserved patterns/motifs
– A step to describing a protein domain
– MSA can add a distant relative to your protein
family
• A step to define DNA regulatory elements.
• Prediction of 2nd Structure and helps 3-D
• A step to phylogenetic trees: to describe or
show the process of evolution
• PCR analysis/primer design
– find most and least degenerate regions of your
sequence
So why difficult?
Where put the gap?
FGDERTHHS
FGD-D-HRS
FGDERTHHS
FGD--DHRS
FGDERTHHS
FGDD--HRS
Trivial 2 seq alignment: 3
possibilities. As length and #
of seqs increase, number of
possible permutations goes
astronomical
Some data
• Cat ATGAAACGTCGGATCTAA
• Dog ATGAATCGACCCATCTAA
• Mus ATGGCGTGGCTTGGCATGTGA
• Rat ATGGCATGTCGTGGCATGTAG
Protocol step 1
• Align each pair of seqs C-D, C-M, C-R etc
• Get a score for each alignment
• And make a …
Similarity matrix
Cat
Cat ID
Dog
Mus
Rat
Dog
14
ID
Mus
10
10
ID
Rat
10
10
16
ID
• Number of identical residues
– Which pair of sequences is most similar?
Progressive alignment
• Align the two most similar sequences, inserting
any gaps.
• Mus/Rat: lock these sequences together (call it
“RODent)
• Return to similarity matrix to find next most
similar seqs or sequence cluster
• Dog/Cat: align and lock (call it CARnivore)
– if next step requires a gap, then gap inserted in both
carnivore sequences
• Align next most …(now its iterative)
An alignment
Cat
Dog
Mus
Rat
ATGAAACGTCGG---ATCTAA
ATGAATCGACCC---ATCTAA
ATGGCGTGGCTTGGCATGTGA
ATGGCATGTCGTGGCATGTAG
***
* *
** *
• Good: Always a two “sequence” problem
– So computationally possible
• Bad: Can’t rewrite or decouple (part of) the
dog/cat alignment in the light of later info.
Locked in a (suboptimal?) trough.
More complex 10 seq example
Choosing the right seqs
• Use MSA to inform you!
• Always use AA/protein if possible
– can copygaps back to DNA later
•
•
•
•
•
•
Start with 6-15 sequences
Eliminate very different (<30% id) seqs
Eliminate identical sequences
Watch out for partial sequences
…or sequences that need ++ gaps to align
Check for repeats with dotlet, Lalign
Less is more
• Large alignments
– take ++ CPU and time
– are hard to do well
– are difficult to display
– are difficult to use: in trees for example
– may include marginal seqs that wreck whole
alignment
• So start small and add/eliminate seqs until
you have a clear informative picture
Level of variation is important
• Choose sequence family with best rate of
evolution for your taxonomic group
– Histones evolve very slow (compare kingdoms)
– Transferrins are fast (compare classes,orders)
• Closely related sequences may have identical
protein (but variable DNA)
• Distantly related sequences no DNA signal
(“saturated”)
ClustalW at embnet.ch.org
Paste in your FASTA sequences
Output choices
ClustalW at EBI
Paste in your (FASTA) sequences
EBI: loads of options
T-coffee
Minimal input parameters and STILL a better job than ClustalW
Output EBI clustalW
Jalview alignment
editor
Pairwise distance etc
Alignment
Guidetree
What you submitted
An alignment fragment
ACT_CANAL
ACT_CANDU
ACT_PICAN
ACT_PICPA
ACT_KLULA
ACT_YEAST
ACT_YARLI
ACT2_ABSGL
ACT2_SCHCO
-MDGEEVAALIIDNGSGMCKA
-MDGEEVAALVIDNGSGMCKA
-MDGEDVAALVIDNGSGMCKA
-MDGEDVAALVIDNGSGMCKA
-MDS-EVAALVIDNGSGMCKA
-MDS-EVAALVIDNGSGMCKA
-MED-ETVALVIDNGSGMCKA
MSMEEDIAALVIDNASGMCKA
--MDDEIQAVVIDNGSGMCKA
: *:::**.******
* All AA in column identical
: AA similar size & hydrophobicity
. AA similar size or hydrophobicity
ClustalW format
The alignment, so what next?
•
•
•
•
Look at it very closely
Hand edit if necessary (probably)
Eliminate problem sequences and redo?
Use display option best for next step
– Phylip format for trees
Parameter changes
• Substit matrix PAM, Gonnet, Blosum
– Clustalw chooses which matrix within family
• PAM30 for closely related pairs; PAM120; PAM250 for more
distant
– Difficult alignment: matrix change may help
• Gap penalty (open and extend) have optimal
values for each family: find which by trial and
error.
– Clustalw puts gaps (which are often external loops)
near previous gaps (longer loop)
• MSA does the grunt work. YOU do the fine
tuning.
Guide tree
• To figure which pairs of sequences to align
first, a phylogenetic tree is calculated from
pairwise distance matrix.
– Stored in a DND (dendrogram) file
• Never use this file to draw a tree
• Clustalw can construct a tree from the
multiple sequence alignment (better than
pairwise)
Alignment display: weblogo
Always remember: sequence represents a 3-D structure
Patterns to recognise
(more reliable in MSA than in single seq)
MSA improves 2ndary structure (a-helix b-sheet) prediction by >6%)
• Alternate hydrophobic residues
– Surface b-sheet (zig-zag-zig-zag)
• Runs of hydrophobic residues
– Interior/buried b-sheet
• Residues with 3.5AA spacing (amphipathic)
– a-helix WNNWFNNFNNWNNNF
• Gaps/indels
– Probably surface not core
Conserved residues
• W,F,Y large hydrophobic, internal/core
– conserved WFY best signal for domains
• G,P turns, can mark end of a-helix b-sheet
• C conserved with reliable spacing speaks C-C
disulphide bridges - defensins
• H,S often catalytic sites in proteases (and other
enzymes)
• KRDE charged: ligand binding or salt-bridge
• L very common AA but not conserved
– except in Leucine zipper L234567L234567L234567L
Finish with an alignment:
defensins
3 pairs of C residues: 3 disulphide bridges