Download Why Sequence Alignment?

Document related concepts

Genetic code wikipedia , lookup

Community fingerprinting wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Molecular ecology wikipedia , lookup

Non-coding DNA wikipedia , lookup

Multilocus sequence typing wikipedia , lookup

Point mutation wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Transcript
Homologues finding and
Multiple Sequence Alignment
Maya Schushan
November 2010
Outline- introduction to alignments
1. Introduction
2. Applications
3. General Alignment Methodology
4. Pairwise Alignment:
• Smith-Waterman
• Needlman-Wunch
5. Multiple Sequence Alignment:
• ClustalW
• MUSCLE
• T-coffee
Introduction
What Is An Alignment?
A process of lining-up 2 or more sequences to achieve
maximum level of identity, in order to find homologies.
TCATG
CATTG
?
TCATG
CATTG
or
TCATG
CATTG
Introduction
What Is An Alignment?
• Comparing 2 (pairwise) or more (multiple) sequences.
• Searching for a series of identical or similar
characters in the sequences.
VLSPADKTNVKAAWAKVGAHAAGHG
||| |
|
|||| | ||||
VLSEAEWQLVLHVWAKVEADVAGHG
Introduction
Basic Terms
• Homology:
Relation of sequences which is a result of divergence from a
common ancestor.
• Identity:
Sequences or Sub-sequences that are invariant.
• Similarity:
Sequences or Sub-sequences that are related.
CAG
CAT
Introduction
Homologues: Orthology vs Paralogy
Reproduced from NCBI education website
Introduction
The Limits of Sequence Similarity
Outline
1. Introduction
2. Applications
3. General Alignment Methodology
4. Pairwise Alignment:
• Smith-Waterman
• Needlman-Wunch
5. Multiple Sequence Alignment:
• ClustalW
• MUSCLE
• T-coffee
Applications
Why Sequence Alignment?
1.Predict characteristics of a protein –
VTISCTGSSSNIGAG-NHVKWYQQLPG
VTISCTGSSSNIGS--ITVNWYQQLPG
LRLSCTGSGFIFSS--YAMYWYQQAPG
LSLTCTGSGTSFDD-QYYSTWYQQPPG
Applications
Why Sequence Alignment?
A model is generated
according to a
template structure of a
homologous protein
Applications
Why Sequence Alignment?
2. Learn about evolutionary relationships –
• Two sequences from different organisms are similar 
they may have a common ancestor.
• Needed for construction of phylogenetic trees
Applications
Why Sequence Alignment?
3. Research of disease –
• Comparison of sequences between individuals can
detect changes that are related to diseases
• Analysis of residues’ substitutions: mutation or
polymorphism?
Applications
Why Sequence Alignment?
4. Find similar sequences in a database
• The commonly used BLAST and FASTA search
programs have to utilize a form of an alignment to
detect similar sequences to the sequence in hand
• The methods employed has to be very fast, to make
the search in a database containing millions of
sequences feasible
Applications
Why Sequence Alignment?
Examples for specific applications:
• Evolutionary conservation analysis (ConSeq/ConSurf)
• Motif and domain prediction (Prosite/InterPro/Pfam)
• Phylogenetic trees
• …
ConSurf analysis of
PDB entry 1hyt-hydrolase
Outline
1. Introduction
2. Applications
3. General Alignment Methodology
4. Pairwise Alignment:
• Smith-Waterman
• Needlman-Wunch
5. Multiple Sequence Alignment:
• ClustalW
• MUSCLE
• T-coffee
General Alignment Methodology
Example: Aligning Two Globins
Human Hemoglobin (HH):
VLSPADKTNVKAAWGKVGAHAGYEG
Sperm Whale Myoglobin (SWM):
VLSEGEWQLVLHVWAKVEADVAGHG
General Alignment Methodology
Example: Aligning Two Globins
• Percent identity: 36
• Percent similarity: 40
(HH)
No Gaps:
VLSPADKTNVKAAWGKVGAHAGYEG
(SWM) VLSEGEWQLVLHVWAKVEADVAGHG
General Alignment Methodology
Example: Aligning Two Globins
With Gaps:
Gaps: 2
• Percent identity: 45.833 (instead of 36 without gaps)
• Percent similarity: 54.167 (instead of 40 without gaps)
•
(HH)
VLSPADKTNVKAAWGKVGAH-AGYEG
(SWM) VLSEGEWQLVLHVWAKVEADVAGH-G
General Alignment Methodology
Sequence Modifications
1. Insertion - an insertion of a letter or several letters to the
sequence.
AAGA AAGTA
2. Deletion - deleting a letter (or more) from the sequence.
AAGA AGA
3. Substitution - replacing a sequence letter by another.
AAGA AACA
INEDL- Insertions + Deletions
General Alignment Methodology
Measuring An Alignment
S = ACTG
T = AGT
S’ = AC_TG S’ = ACTG S’ = ACTG
T’ = A_GT_ T’ = AGT_ T’ = _AGT
Good: Identical characters- match.
Bad: Different characters- mismatch; gap (InDel).
• Each pair of characters gets a value, depending on its identity.
•The similarity score of the alignment is the sum of pair values.
General Alignment Methodology
Alignment Scoring
1. Assume independent mutation model
2. Score at each position
– Positive if the same/similar (e.g.
– Negative if different or gap
3. Score of an alignment is sum of position score
General Alignment Methodology
Alignment Scoring
• Different scoring  different best alignments
• Scoring systems implicitly represent a particular theory of
evolution
– Some mismatches are more plausible
• Transition vs. Transversion
• LysArg
≠
LysCys
– Gap extension Vs. Gap opening
General Alignment Methodology
Alignment Scoring
Scoring Matrix
• A matrix n  n : n=4 for DNA, n=20 for proteins
• Each entry matrix defines the score for observing the
two letters in the alignment
A G C T
– Positive if likely to change
– Negative otherwise
A 1
G -5 1
C -5 -5 1
T -5 -5 -5 1
General Alignment Methodology
DNA scoring matrices
• Transitions – purine to purine or pyrmidine to pyrmidine
(4 possibilities)
• Transversions – purine to pyrmidine or pyrmidine to purine
(8 possibilities)
• By chance alone transversions should occur twice as often as
transitions.
• De-facto transitions are more frequent than transversions.
General Alignment Methodology
DNA scoring matrices
From
To
A
G
A
G
2
-4
2
C
T
-6
-6
-6
-6
Transversion
C
T
2
-4
2
Transition
Match
General Alignment Methodology
Proteins scoring matrices
• Observation: some substitutions
are more frequent than others,
e.g., chemically similar amino
acids
• As for DNA, protein matrices
define the probabilities of
change between the different
amino acids
• Popular matrices are based on
empirical data: PAM & BLOSUM
T
T
T
T
T
T
L
L
L
L
L
L
Y
Y
Y
Y
Y
Y
D
D
E
D
Q
D
K
K
K
K
K
K
In the fourth
column
E and D are
found in 7 / 8
General Alignment Methodology
Proteins scoring matrices
Category
Amino Acid
Acids and Amides
Asp (D) Glu(E) Asn (N) Gln (Q)
Basic
His (H) Lys (K) Arg (R)
Aromatic
Phe (F) Tyr (Y) Trp (W)
Hydrophilic
Ala (A) Cys (C) Gly (G) Pro (P) Ser (S) Thr (T)
Hydrophobic
Ile (I) Leu (L) Met (M) Val (V)
General Alignment Methodology
BLOSUM Matrices
• Based on BLOCKS database:
~2000 blocks from 500 families of related proteins
• Blocks: short conserved patterns of 3-60 aa without gaps
• Different BLOSUMn matrices are calculated independently
from BLOCKS
• BLOSUMn is based on sequences that shared at least n
percent identity
General Alignment Methodology
BLOSUM Matrices
• Low BLUSOM numbers for distant sequences
• High BLUSOM numbers for similar sequence
• Generally:
– BLOSUM62 for general use
– BLOSUM80 for close relations
– BLOSUM45 for distant relations
General Alignment Methodology
Types of Gap Penalties
(insertions or deletions)
InDels are rare in evolution: once created, easy to
extend:
• Gap open –
penalty for the first residue in a gap
• Gap extension –
penalty for additional residue in a gap.
General Alignment Methodology
Types of Gap Penalties
Motivation: Aligning cDNAs to Genomic DNA
cDNA query
Genomic DNA
Conclusion: gap opening and extension should be ranked
differently to properly align the sequences
General Alignment Methodology
Summary: Scoring and Alignment
• The final score of the alignment is the sum of the
positive scores and penalty scores:
Scoring
+ Number of Identities
Matrix
+ Number of Similarities
- Number of Gap insertions
- Number of Gap extensions
Alignment score
Gap
penalties
Outline
1. Introduction
2. Applications
3. General Alignment Methodology
4. Pairwise Alignment:
• Smith-Waterman
• Needlman-Wunch
5. Multiple Sequence Alignment:
• ClustalW
• MUSCLE
• T-coffee
Pairwise Alignment
Local vs. Global
• Global alignment – finds the best alignment across the whole
two sequences.
ADLGAVFALCDRYFQ
||||
|||| |
ADLGRTQN-CDRYYQ
• Local alignment – finds regions of similarity in parts of the
sequences.
ADLG
||||
ADLG
CDRYFQ
|||| |
CDRYYQ
Pairwise Alignment
Global: Needleman & Wunsch (1970)
•The best alignment over the entire length of two
sequences only
•The Needleman-Wunsch algorithm is appropriate for
finding the alignment of two sequences which are:
(i) of the similar length;
(ii) similar across their entire lengths.
Example:
SIMILARITY
PI-LLAR--Needleman, S. B. and Wunsch, C. D., 1970
Pairwise Alignment
Global: Needleman & Wunsch (1970)
Pairwise Alignment
Local: Smith & Waterman (1981)
• Makes an optimal alignment of the best segment of similarity
• Suitable when comparing substantially different sequences,
and have only a short patches of similarity
• Use when one sequence is short and the other is very long
• Can return a number of highly aligned segments
•For example, the local alignment of
SIMILARITY and PILLAR: MILAR
ILLAR
Smith, T.F. and Waterman, M.S., 1981
Pairwise Alignment
Local: Smith & Waterman (1981)
Smith, T.F. and Waterman, M.S., 1981
Pairwise Alignment
User Input
• Pair of sequences
• Local or global alignment
• Scoring:
– Gap penalties: opening/extension
– Scoring matrix
Outline
1. Introduction
2. Applications
3. General Alignment Methodology
4. Pairwise Alignment:
• Smith-Waterman
• Needlman-Wunch
5. Multiple Sequence Alignment:
• ClustalW
• MUSCLE
• T-coffee
Multiple Sequence Alignment
Pairwise Vs. Multiple Sequence Alignment
Alignments help to analyze sequence data:
organize and visualize.
Pairwise:
For 2 sequences
F G K  G K G
F G K F G K G
MSA:
For more than 2 sequences
F
F
-
G
G
G
-
K
K
K
K

F
Q
F
G
G
G
G
K
K
K
K
G
G
G
G
Multiple Sequence Alignment
Rules For Choosing Sequences
• Very similar sequences have little information
•
Very different sequences cause trouble…<30% identical
with more than half of the other sequences in the set
• Choose sequences as distantly related as possible
– Sequence between 30-80% identical with more than half
of the sequences in the set
• The more sequences the better
Multiple Sequence Alignment
Similarity Score of MSA
• Each position gets a value, depending on its identity.
• The similarity score of the alignment is the sum of all
position values.
• A popular way to compute position values:
SP - Sum of Pairs - each pair gets the score from the
similarity matrix (PAM, BLOSUM).
Goal: Find MSA with maximum similarity score
Bad News: This problem is NP hard
Multiple Sequence Alignment
More than a handful of
MSA methods exist…
APPROXIMATE
FAST
ACCURATE
SLOW
Multiple Sequence Alignment
ClustalW (1994)- Introduction
• This heuristic approach works because it uses the biological
meaning of MSA
• Based on the idea that the sequences we usually want to align
are phylogenetically related: a pairwise alignment algorithm is
used iteratively, first to align the most closely related pair of
sequences, then the next most similar one to that pair.
•Rule “once a gap, always a gap”: The gaps between more
similar pairs of sequences should not be affected by more
distantly related ones.
Thompson, J.D. et al, 1994
Multiple Sequence Alignment
ClustalW- Progressive Alignment
Hbb_Human 1
Hbb_Horse 2
Hba_Human 3
Hba_Horse 4
Myg_Whale 5
17
-
59
60
-
59
59
13
-
77
77
75
75
1. Quick pairwise alignment
calculate distance matrix
-
Hbb_Human
Hbb_Horse
Hba_Human
Hba_Horse
2. Build a guide tree using the
NJ phylogenetic method
Myg_Whale
3. Progressive alignment
following guide tree
Multiple Sequence Alignment
ClustalW- Progressive Alignment
A
B
C
D
A
-
-
-
-
B
1
-
-
-
C
7
8
-
-
D
11
5
2
-
A
B
C
D
Multiple Sequence Alignment
ClustalW- Additional Features
• Sequence weighting:
–
–
–
–
Each sequence gets a weight derived from the guide tree
Close sequences are down-weighted
Distant sequences receive high weights
The weights are normalized so that the highest is 1
w1
w2
w3
w4
w5
w6
w7
W(Hbb_Human) = .081 +
½*.226 + ¼*.061 + 1/5*.015
+ 1/6*.062 = 0.221
Multiple Sequence Alignment
ClustalW- Problems
• Sequences that are similar only in some smaller regions
 ClustalW tries to find global alignments, not local.
• Sequence that contains a large insertion compared to the rest
 global not local
• Sequence that contains a repetitive element, while another sequence only
contains one copy.
Vs
Multiple Sequence Alignment
MUSCLE- Introduction
• The most recent popular MSA software
• Considered to be the most accurate MSA software available
today
• The basic idea: iterative progressive alignment
Edgar, R.C., 2004
Multiple Sequence Alignment
MUSCLE Innovations
• Faster distance estimation between the input sequences
• Faster construction of an evolutionary tree
(UPGMA instead of NJ in ClustalW )
•Applying new score function to the profile alignments
• Refinement of the initial results
Edgar R.C., 2004
faster
more
accurate
Multiple Sequence Alignment
MUSCLE Innovations- Refinement Step
• An edge is chosen from the progressive alignment tree.
• The tree is divided into two subtrees by deleting this edge.
• The MSA from each subtree is computed by progressive alignment.
• The two MSAs are aligned, generating an entire new MSA
• If the new MSA achieves higher score than the previous  keep it
New MSA
Old MSA
----------------------------------------------------------------------------------------------------------
MSA1
----------------------------------------
MSA2
---------------------------
----------------------------------------------------------------------------------------------------------
Multiple Sequence Alignment
MUSCLEIt’s Even More Complicated…
Multiple Sequence Alignment
All Against All- SH2 domains
T-coffee
MUSCLE
Edgar, R.C., 2004
Multiple Sequence Alignment
All Against All- BaliBase 2005
MUSCLE is superior
in some cases….
Edgar, R.C., 2004
Multiple Sequence Alignment
All Against All- PREFAB
T-coffee in others…
 Trial and error is the best approach
Edgar, R.C., 2004