Download Lecture 3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Matrix-assisted laser desorption/ionization wikipedia , lookup

Molecular ecology wikipedia , lookup

Proteolysis wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Community fingerprinting wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Non-coding DNA wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Multilocus sequence typing wikipedia , lookup

Biochemistry wikipedia , lookup

Biosynthesis wikipedia , lookup

Genetic code wikipedia , lookup

Point mutation wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Protein structure prediction wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Transcript
Sequence Alignment
Outline
 Alignment of pairs of sequence
 Local and global alignments
 Methods of alignments
Dot matrix analysis
Dynamic programming approach
 Use of scoring matrices and gap penalties
 Scoring matrices -- PAM and BLOSUM
What is sequence alignment?
Sequence alignment is a way of arranging the sequences of
DNA, RNA or protein to identify regions of similarity that may be
a consequence of functional, structural or evolutionary
relationships between the sequences.
The procedure of comparing two (pair-wise alignment) or
more multiple sequences is to search for a series of individual
characters or patterns that are in the same order in the
sequences.
 There are two types of alignment: local and global.
Global alignment vs Local alignment

Global alignment is attempting to match as much of the sequence as
possible.
The tool for Global alignment is based on Needleman-Wunsch algorithm.

Local alignment is to try to find the regions with highest density of
matches. The tool for local alignment is based on Smith-Waterman.

Both algorithms are derivates from the basic dynamic programming
algorithm.
L G P S S K Q T G K G S - S R I W D N
Global alignment
L N - I T K S A G K G A I M R L G D A
- - - - - - - T G K G - - - - - - - Local alignment
- - - - - - - A G K G - - - - - - - -
Why do sequence alignment?
 Sequence alignment is useful for discovering structural,
functional and evolutionary information in biological sequences.
 Sequences that are very much alike may have similar secondary
and 3D structure, similar function and likely a common ancestral
sequence. It is extremely unlikely that such sequences obtained
similarity by chance.
-- For DNA molecules with n nucleotides such probability is very
low P = 4-n.
-- For proteins with n nucleotides, the probability even much lower
P = 20 –n.
Sequence alignment makes the following tasks easy: 1.annotation
of new sequences; 2. modelling of protein structures; 3. design and
analysis of gene expression experiments
An example of aligning text strings
Raw Data ???
T C A T G
C A T T G
2 matches, 0 gaps
T C A T G
| |
C A T T G
3 matches (2 end gaps)
T C A T G .
| | |
. C A T T G
4 matches, 1 insertion
T C A- T G
| | | |
. C ATT G
4 matches, 1 insertion
T C A T - G
| | |
|
. C A T T G
Terminologies of sequence comparison

Sequence identity -- exactly the same Amino Acid or Nucleotide in the
same position.

Sequence similarity -- Substitutions with similar chemical properties.

Sequence homology -- general term that indicates evolutionary
relatedness among sequences; we usually measure of percentage
identity of sequence homology

Pairwise alignment -- used to find the best-matching piecewise (local)
or global alignments of two query sequences. Pairwise alignments
can only be used between two sequences at a time.

Multiple sequence alignment -- try to align all of the sequences in a
given query set.
Methods of pairwise alignment
 Dot matrix analysis
 The dynamic programming (DP) algorithm
 Word methods
What is Dot matrix analysis
 A dot matrix analysis is a method for comparing two
sequences to look for possible alignment (Gibbs and McIntyre
1970)
 The algorithm for a dot matrix:
1. One sequence (A) is listed across the top of the matrix and the
other (B) is listed down the left side
2. Starting from the first character in B, one moves across the page
keeping in the first row and placing a dot in many column where the
character in A is the same
3. The process is continued until all possible comparisons between
A and B are made
4. Any region of similarity is revealed by a diagonal row of dots
5. Isolated dots not on diagonal represent random matches
What can Dot matrix analysis do?
 It can detect of matching regions can be improved by
filtering out random matches and this can be achieved by
using a sliding window
 It can be used to assess repetitiveness in a single
sequence, such as direct and inverted repeats within the
sequences
1st example of Dot matrix analysis: two
identical sequences
http://arbl.cvmbs.colostate.edu/molkit/dnadot/index.html
2nd example of Dot matrix analysis: two very
different sequences
 http://arbl.cvmbs.colostate.edu/molkit/dnadot/index.html
3rd example of Dot matrix analysis: two similar
sequences sequences
http://arbl.cvmbs.colostate.edu/molkit/dnadot/index.html
Dynamic programming algorithm
 The approach compares every pair of characters in the two sequences
and generates an alignment, which is the best or optimal.
 The method can be useful in aligning nucleotide to protein sequences.
The method requires large amounts of computing power and is a highly
computationally demanding because the nature of dynamic programming
technique is recursion.
New algorithmic improvements as well as increasing computer capacity
make possible to align a query sequence against a large DB in a few
minutes.
Two approaches for dynamic programming: Top-down approach and
Bottom-up.
The procedure of the dynamic programming algorithm
 The alignment procedure depends upon scoring system based on
probability that:
1) a particular amino acid pair is found in alignments of related proteins
(pxy);
2) the same amino acid pair is aligned by chance (pxpy);
3) introduction of a gap would be a better choice as it increases the score.
 A substitution matrix is composed of the ratio of the first two
probabilities. There are many such matrices, two of them PAM and
BLOSUM will be talked in next few slides.
 The calculation of scores for the gap introduction and its extension is
from the matrices and represent a prior knowledge and some assumptions.
For example: one of them is quite simple, if negative cost of a gap is too
high a reasonable alignment between slightly different sequences will be
never achieved but if it is too low an optimal alignment is hardly possible.
Other assumptions are based on sophisticated statistical procedures.
An example: scoring a sequence alignment
with a gap penalty
Sequence 1 V D S - C Y
Sequence 2 V E S L C Y
Score
4 2 4 -11 9 7
Score = sum of amino acid pair scores (26)
minus single gap penalty (11) = 15
Note: 1. it is likely to have non-identical amino acids placed in the
corresponding positions.
2. Scores gained by each match are not always the same, for
instance two rare amino acids will score more than two common.
3. The alignment gap(s) may be introduced for optimising the
score. Introduction of gaps causes penalties.
Steps for the dynamic programming
algorithm
1.
Score of new = Score of previous + Score of new
alignment
alignment (A)
V D S - C Y
V D S - C
Y
V E S L C Y
V E S L C
Y
15
2.
Score of
=
8
aligned pair
+
7
= Score of previous + Score of new
alignment (A)
alignment (B)
aligned pair
V D S - C
V D S -
C
V E S L C
V E S L
C
8
=
-1
+
9
3. Repeat removing aligned pairs until end of alignments is reached
Why use a substitution matrix?

Determine likelihood of homology between two
sequences.

Substitutions that are more likely should get a
higher score,

Substitutions that are less likely should get a
lower score.
How to calculate Scoring Matrices

Log-odds matrix where each cell gives the probability of
aligning those two residues

Score of alignment = Sum of log-odds scores of residues

Score for each residue given by:
pab
s(a, b)  log(
)

f a fb
1
Types of Matrices
 Percent Identity
 Standard scoring matrix to align DNA sequences
 PAM
 Estimates the rate at which each possible residue in a
sequence changes to each other residue over time
 BLOSUM-X
 Identifies sequences that are X% similar to the query
sequence
Scoring matrices: PAM (Percent Accepted Mutation) and
BLOSUM62 (BLOcks amino acid SUbstitution Matrices)
Amino acids are grouped according to to the
chemistry of the side group: (C) sulfhydryl, (STPAG)small hydrophilic, (NDEQ) acid, acid amide and
hydrophilic, (HRK) basic, (MILV) small hydrophobic,
and (FYW) aromatic. Log odds values: +10 means
that ancestor probability is greater, 0 means that the
probability are equal, -4 means that the change is
random. Thus the probability of alignment YY/YY is
10+10=20, whereas YY/TP is –3-5=-8, a rare and
unexpected between homologous sequences.
BLOSUM is based on local alignments. BLOSUM was first
introduced in a paper by Henikoff and Henikoff. They
scanned the for very conserved regions of protein families
(that do not have gaps in the sequence alignment) and
then counted the relative frequencies of amino acids and
their substitution probabilities. Then, they calculated a logodds score for each of the 210 possible substitutions of
the 20 standard amino acids.
Word methods

Word methods, also known as k-tuple methods, are
heuristic methods that are not guaranteed to find an
optimal alignment solution, but are significantly more
efficient than dynamic programming.

The typical tools used for this method is BLAST and
FASTA.
The list of sequence alignment software

http://en.wikipedia.org/wiki/List_of_sequen
ce_alignment_software