Download 10 - digbio

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Project Phase II Report





Due on 10/20, send me through email
Write on top of Phase I report.
5-20 Pages
Free style in writing (use 11pt font or larger)
Methods
 Overview (high-level description)
 Source of data
 Algorithm (pseudo code)
 Prove or argue why the algorithm will work
 Analyze the computational complexity of the algorithm
 Plan of implementation
Multiple Sequence
Alignment
Dong Xu
Computer Science Department
109 Engineering Building West
E-mail: [email protected]
573-882-7064
http://digbio.missouri.edu
As we proceed …
Warning:
Muddy
Road
Ahead!!!
Outline
 Background
 Scoring
function
 Dynamic
 Star
(what, why)
programming
alignment
 Progressive
 Profile
alignment
and Psi-Blast
Introduction
 The
multiple sequence alignment of a
set of sequences may be viewed as
an evolutionary history of the
sequences.
 No
sequence ordering is required.
An Example of
Multiple Alignment
VTISCTGSESNIGAG-NHVKWYQQLPG
VTISCTGTESNIGS--ITVNWYQQLPG
LRLSCSSSDFIFSS--YAMYWVRQAPG
LSLTCTVSETSFDD--YYSTWVRQPPG
PEVTCVVVDVSHEDPQVKFNWYVDG-ATLVCLISDFYPGA--VTVAWKADS-AALGCLVKDYFPEP--VTVSWNSG--VSLTCLVKEFYPSD--IAVEWWSNG--
Why Multiple Alignment (1)

Natural extension of Pairwise Sequence
Alignment
 “Pairwise
alignment whispers… multiple
alignment shouts out loud” Hubbard et
al 1996

Much more sensitive in detecting sequence
relationship and patterns
Why Multiple Alignment (2)

Give hints about the function and
evolutionary history of a set of sequences

Foundation for phylogenic tree
construction and protein family
classification

Useful for protein structure prediction…
Outline
 Background
 Scoring
function
 Dynamic
 Star
(what, why)
programming
alignment
 Progressive
 Profile
alignment
and Psi-Blast
Idea of Scoring
 Requirement
of a good quality of
alignment measure
Additive function
Function must be independent of order of
arguments
Should reward presence of many equal or
strongly related symbols (in the same
column) and penalize unrelated symbols and
spaces.
Distance from Consensus

Consensus sequnece: single sequence which
represents the most common amino acid/base
in that position
D
D
E
D
E
D

D
G
G
G
G
G
G
G
G
G
G
A
I
I
A
A/I
V
L
L
V
V/L
V
V
V
V
E
E
E
Q
Q
E
A
A
A
A
A
A
L
L
L
V
L
L
distance from consensus: total number of
characters in the alignment that differs from the
consensus character of their columns
Sum-of-pairs’ (SP) measure



The Sum of Pairs (SP) method is as follows:
Given (1) a set of N aligned sequences each
of length L in the form of a L x N MSA
alignment matrix M and (2) a substitution
matrix (PAM or BLOSUM) that gives a cost
c(x,y) for aligning two characters x, y .
The SP score SP(mi) for the I-th column of
M denoted by mi is calculated:
SP(mi) = SUM(j, k) [ c( mij , mik) ]
The SP score for M is SUM(i) [SP(mi) ]
Feature of SP Measure

Theorem: Let alpha be a multiple alignment
of the set of sequences s1, …, sk; and
alpha(I,j) denote the pairwise alignment of si
and sj as induced by alpha. Then
SP-score(alpha) = Sum over i,j [score(alpha(i,j)]
 The above is only true if we have s(-,-) = 0.
 This is because in pairwise alignment the
presence of two aligned spaces (–) in the two
sequences are ignored.
Profile Distance
 In
the minimum entropy approach the
score of multiple alignment is defined
as the sum of entropies of the
columns (the entropy of a column is
related with the frequency of letter x
in the column)
 Information content
 Profile score
Outline
 Background
 Scoring
function
 Dynamic
 Star
(what, why)
programming
alignment
 Progressive
 Profile
alignment
and Psi-Blast
Dynamic Programming

Multiple sequence
alignment can be
done with dynamic
programming, using
(n+1)k memory
cells, where n is the
length of each of
the k sequences to
be aligned.
Sequence 2
Sequence 3
Sequence 1
Complexity Analysis of
Dynamic Programming
 Running
time:
(n+1)k number of entries in the table
For each entry we need to find the maximum
of 2k -1 elements
Finding the SP-score corresponding to each
element means adding O(k2) numbers
Total = O(k22knk) i.e., NP-hard
Outline
 Background
 Scoring
function
 Dynamic
 Star
(what, why)
programming
alignment
 Progressive
 Profile
alignment
and Psi-Blast
Center Star Method
 The
center star method is an
approximation algorithm, if scores
can be decoupled into pairwise
distances.
 minimizing sum of pairwise distances.
 Works well when a distance satisfies
the triangle inequality: dst <= dsc + dct
for all sequences s, t, and c.
Star Alignments
 Select
a sequence sc as the center of
the star
each sequence s1, …, sk such
that index i  c, perform a
Needleman-Wunsch global alignment
 For
 Aggregate
alignments with the
principle “once a gap, always a gap.”
Star Alignments Example
s1
MPE
MSKE
| |
- ||
MKE
MKE
s1: MPE
s2: MKE
s3: MSKE
s4: SKE
s3
s2
SKE
MPE
MKE
||
MKE
s4
-MPE
-MKE
MSKE
-MPE
-MKE
MSKE
-SKE
Choosing a Center
 Try
them all and pick the one with the
best score
all O(k2) alignments, and
pick the sequence sc that maximizes
 Calculate
 score(s , s )
i c
i
c
Performance
 Therefore,
if C is the center star
method SP score, then C/2 is a lower
bound to the SP score of any multiple
alignment.
 C is at most as bad as twice of the
optimum, and this bound on how bad
the result can be makes the center
star method an approximation
algorithm to the NP-hard problem.
Complexity Analysis
 Assuming
k sequences have length n
O(n2) to calculate 1 global alignment
O(k2) global alignments to calculate
O(k2n2) overall cost
Outline
 Background
 Scoring
function
 Dynamic
 Star
(what, why)
programming
alignment
 Progressive
 Profile
alignment
and Psi-Blast
Progressive Alignment
 Devised
by Feng and Doolittle in
1987.
 Essentially a heuristic method, not
guaranteed to find the ‘optimal’
alignment.
 Multiple alignment is achieved by
successive application of pairwise
methods.
Basic Algorithm

Compare all sequences pairwise.

Perform cluster analysis on the pairwise
data to generate a hierarchy for alignment
(guide tree).

Build alignment step by step according to
the guide tree. Build the multiple alignment
by first aligning the most similar pair of
sequences, then add another sequence or
another pairwise alignments.
Steps in Progressive
Multiple Alignment

Compare pairwise sequences

Perform cluster analysis on pairwise data
to generate hierarchy for alignment
Building Guided Tree
Using Weight
Hbb_Human
Hbb_Horse
-
Quick pairwise alignment:
.59 .60 - calculate distance matrix
.17
-
Hba_Horse
.59
.59
.13
-
Myg_Whale
.77
.77
.75
.75
Hba_Human
Hbb_Human
Hbb_Horse
Hba_Human
-
2
3
1
Hba_Horse
Myg_Whale
Neighbor-joining tree (guide tree)
4
Alignment (1)

Build multiple alignments by first aligning most
similar pair, then next similar pair etc.
Alignment (2)
Scoring
PQRRZW
YQRKZX
YZTUOP
TZZ_FO
Total Score =
[w(R, U ) + w(R,  ) + w(K, U) + w(K, ) ] /
4
CLUSTAL

Most successful implementation of
progressive alignment (Des Higgins)

CLUSTAL - gives equal weight to all sequences

CLUSTALW - has the ability to give different
weights to the sequences

CLUSTALX - provides a GUI to CLUSTAL

http://searchlauncher.bcm.tmc.edu/multialign/multi-align.html
Problems with Progressive
Alignments
Greedy Nature: No guarantee for global
optimum
 Depends on pairwise alignments
 If sequences are very distantly related,
much higher likelihood of errors
 Care must be made in choosing scoring
matrices and penalties
 Other approaches using Bayesian
methods such as hidden Markov models

Outline
 Background
 Scoring
function
 Dynamic
 Star
(what, why)
programming
alignment
 Progressive
 Profile
alignment
and Psi-Blast
Concept of Profile
Seq1->
Seq2->
Seq3->
Seq4->
Information about the degree of conservation
of sequence positions is included
Position-specific Score
Matrix (PSSM)

For protein of length L, scoring matrix is L x
20, PSSM(i,j) --“Profile”: specific scores for
each of the 20 amino acids at each position in a
sequence.

For highly conserved residues at a particular
position, a high positive score is assigned, and
others are assigned high negatives.

For a weakly conserved position, a value close
to zero is assigned to all the amino acid types.
Building a Profile
 First, get multiple sequence alignments using
substitution matrix, Sjk.
 Second, count the number of occurrences of amino acid
k at position i, Cik.
 (1) Average-score method:
Wij = Sk CikSjk / N.
 (2) log-odds-ratio formula:
Wij = log(qij/pj).
qij = Cij / N.
pj : background probability of residue j.
Calculating Profiles (1)

Gribskov et al, Proc. Natl. Acad. Sci. USA 84, 4355-4358,
1987
ACGCTAFKI
Wij = Sk CikSjk / N
GCGCTAFKI
ACGCTAFKL
GCGCTGFKI
GCGCTLFKI
ASGCTAFKL
ACACTAFKL
C1A = 4, C1G = 3
W1A = (4  SAA + 3  SAG) / 7
= (4  4 + 3  0) / 7
= 2.3
Calculating profile (2)
Wij = log(qij/pj).
qij = Cij / N.
pj : background probability of residue j.
 For small N, formula qij = Cij / N is not good
 A large set of too closely related sequences
carries little more information than a single
member.
 Absence of Leu does not mean no Leu at
this position when Ile is abundant!
 Pseudocount frequency, gij
Frequency Matrix

Effective frequency, fij
 qij   gij
fij 


Frequency matrix element, fij, is the
probability of amino acid j at position i.
gij = pj Sk qik exp (l Skj )
pj: : background frequency
Frequency matrix, example
 i (position) 1,…,L
j (amino acid type) 1,…,20
0.03 0.02 0.01 0.01 0.01 0.02 0.02 0.02 0.01 0.22 0.27 0.02 0.12 0.03 0.01 0.02 0.03 0.00 0.01 0.08
0.04 0.02 0.17 0.03 0.01 0.02 0.03 0.03 0.01 0.03 0.08 0.03 0.01 0.01 0.02 0.07 0.34 0.00 0.01 0.04
0.07 0.03 0.02 0.03 0.01 0.12 0.13 0.03 0.01 0.01 0.02 0.04 0.01 0.01 0.38 0.07 0.07 0.00 0.01 0.02
0.07 0.03 0.09 0.06 0.01 0.03 0.14 0.03 0.06 0.01 0.02 0.09 0.01 0.01 0.15 0.16 0.05 0.00 0.01 0.02
0.03 0.11 0.08 0.03 0.01 0.12 0.13 0.02 0.01 0.04 0.04 0.13 0.09 0.01 0.02 0.04 0.03 0.00 0.01 0.04
0.03 0.12 0.05 0.03 0.01 0.10 0.05 0.02 0.01 0.04 0.14 0.10 0.16 0.01 0.01 0.04 0.03 0.00 0.01 0.04
0.04 0.04 0.12 0.06 0.00 0.06 0.12 0.03 0.03 0.01 0.02 0.18 0.02 0.02 0.02 0.15 0.04 0.03 0.01 0.02
0.06 0.02 0.08 0.03 0.00 0.08 0.19 0.02 0.04 0.05 0.07 0.09 0.04 0.01 0.02 0.12 0.04 0.00 0.01 0.03
0.03 0.01 0.01 0.01 0.01 0.02 0.01 0.01 0.02 0.15 0.31 0.01 0.02 0.12 0.01 0.02 0.02 0.01 0.02 0.13
0.06 0.03 0.05 0.06 0.03 0.07 0.19 0.02 0.02 0.03 0.06 0.10 0.01 0.05 0.02 0.07 0.07 0.00 0.05 0.03
0.22 0.10 0.03 0.08 0.01 0.03 0.12 0.02 0.01 0.01 0.06 0.11 0.01 0.01 0.03 0.05 0.07 0.00 0.02 0.03
0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.08 0.34 0.01 0.02 0.26 0.01 0.02 0.02 0.01 0.02 0.10
0.03 0.05 0.02 0.03 0.00 0.10 0.29 0.12 0.01 0.01 0.08 0.13 0.02 0.03 0.02 0.03 0.03 0.00 0.01 0.02
0.06 0.10 0.10 0.07 0.00 0.06 0.13 0.03 0.06 0.01 0.02 0.18 0.01 0.01 0.02 0.08 0.04 0.00 0.03 0.01
0.09 0.22 0.07 0.03 0.00 0.06 0.04 0.02 0.02 0.04 0.07 0.18 0.01 0.01 0.01 0.07 0.05 0.00 0.01 0.03
0.03 0.01 0.02 0.01 0.01 0.01 0.02 0.01 0.01 0.17 0.41 0.01 0.03 0.05 0.01 0.03 0.04 0.00 0.04 0.05
0.04 0.02 0.10 0.05 0.00 0.06 0.02 0.48 0.02 0.01 0.02 0.05 0.01 0.01 0.01 0.05 0.03 0.00 0.01 0.01
0.05 0.01 0.01 0.01 0.01 0.02 0.08 0.01 0.06 0.13 0.09 0.02 0.04 0.05 0.01 0.02 0.02 0.01 0.25 0.08
0.06 0.06 0.07 0.05 0.01 0.06 0.09 0.02 0.01 0.02 0.02 0.07 0.01 0.02 0.09 0.17 0.13 0.00 0.02 0.02
0.02 0.01 0.01 0.04 0.00 0.02 0.05 0.01 0.01 0.06 0.11 0.01 0.01 0.43 0.03 0.03 0.03 0.01 0.04 0.03
Profile Alignment (1)
sequence
profile
ACD……VWY
Profile Alignment (2)

alignment of alignments
•
•
Sequence – Profile Alignment.
Profile – Profile Alignment.

Penalize gaps in conserved regions more heavily
than gaps in more variable regions

Dynamic Programming.
(same idea as in Pairwise Sequence Alignment)
Optimal alignment in time O(a2l2)
a = alphabet size, l = sequence length
PsiBlast
Psi (Position Specific Iterated) is an
automatic profile-like search
 The program first performs a gapped blast
search of the database. The information of
the significant alignments is then used to
construct a “position specific” score matrix.
This matrix replaces the query sequence in
the next round of database searching
 The program may be iterated until no new
significant are found

Search (Building Profile)
with PSI-BLAST









Query sequence is “master” for multiple alignment.
Profile length = query sequence length.
Find family members with low E-value (e.g. <0.01).
Exclude sequences with >98% identity.
Build frequency matrix.
Build PSSM.
Build alignment using PSSM.
Add new sequences.
Iterate…
Significance of Psi Blast

PSI-BLAST estimates the statistical
significance of the local alignments found.
Because profile substitution scores are
constructed to a fixed scale, and gap scores
remain independent of position, the statistical
theory and parameters for gapped BLAST
alignments remain applicable to profile
alignments.

Much more sensitive than BLAST

E-values can be misleading!
Reading Assignments

Suggested reading:
 Chapter 4 in “Current Topics in
Computational Molecular Biology, edited by
Tao Jiang, Ying Xu, and Michael Zhang. MIT
Press. 2002.”

Optional reading:
 Chapter 7 in “Pavel Pevzner: Computational
Molecular Biology - An Algorithmic Approach.
MIT Press, 2000.”
Project Assignment
Develop a program that implement the
Center Star algorithm
1.
Modify your code in the first assignment
for global alignment.
2.
Use edit distance (match 1; otherwise
0) with gap penalty –1 – k (k is gap
size).
Related documents