Download Example 2 Monte Carlo Simulation

Document related concepts

Western blot wikipedia , lookup

Protein wikipedia , lookup

Proteomics wikipedia , lookup

Circular dichroism wikipedia , lookup

Cyclol wikipedia , lookup

Protein mass spectrometry wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Alpha helix wikipedia , lookup

Protein domain wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Protein structure prediction wikipedia , lookup

Homology modeling wikipedia , lookup

Structural alignment wikipedia , lookup

Transcript
Advanced Tools and Algorithms
in Bioinformatics
Chittibabu Guda
Summer, 2004
UCSD Extension, Department of Biosciences
Clustering Tools
• Clustering is grouping together of related sequences based on
some set thresholds such as length, % identity, composition etc.
• % identity is the most commonly used criterion to remove
redundant sequences in the databases
• Clustering helps improve the speed of database searches in the
orders of magnitude with minimal loss of content
• The general principle in clustering is pair-wise alignment of
sequences in all-to-all combination
• Most commonly used tools are
• blastclust
• cd-hit
BLASTCLUST
http://www.csc.fi/molbio/progs/blast/blastclust.html
• BLAST score-based single-linkage clustering
• All sequences in the database are compared pair-wise in all-to-all
combinations, based on the BLAST score
• For each pair, the top scoring alignment is evaluated based on two
factors
• Length coverage- L’/L (for one or both sequences)
• Score density – I/AL
• where, L’ is length of sequence in the alignment, L is total length of the
sequence, I is the number of identical residues and AL is the total
alignment length (L’+gaps)
• If both these factors score above the set thresholds, the two
sequences are considered as neighbors
• The default e-value is 1e-6
CD-HIT (http://bioinformatics.ljcrf.edu/cd-hi/)
• This program is 20-30 times faster than BLASTCLUST for it avoids all-toall comparison of pair-wise alignments
• Short word filters are applied to reduce the number of pair-wise alignments
• First index tables are built for short words of 2-5 residues, in all possible
combinations
• (ABC-), a 4-letter alphabet can make a maximum of 16 two-letter pairs
• AB, AC, A-, BA, CA, -A, BC, B-, CB, -B, C-, -C, AA, BB, CC, -• So, for (20+1) amino acids, the index table size would be 21n where n is the word
size (If n=5, total number of words would be ~ 4 million)
• Program compares the type and number of identical peptides between the
representative and the new sequence
• Only those pairs that meet the minimum criterion will be further aligned to
confirm the identity
• Very fast algorithm for clustering larger databases like NR
Phylogenetic Analysis
Terminology
• Homologous : Similar
• Paralogous : Similar sequences in the same species, originated by gene
duplication
• Orthologous: Similar sequences in different species by divergent
evolution
• Xenologous: Genes acquired by horizontal gene transfer
• Analogous: Similarity by convergent evolution
Methods of building phylogenetic trees
• Based on the data processing
• Discrete methods
• Maximum-parsimony method
• Maximum-Likelihood method
• Distance-based methods
• Based on the tree-building algorithm
• Clustering methods
• UPGMA
• Neighbor-joining
• Optimality criterion
Distance-based versus discrete methods
• Distance methods first convert aligned sequences into a pair-wise
distance matrix and then input the matrix into a tree building method
• Discrete methods are based on characters i.e., consider each nucleotide
or amino acid directly
• In distance methods, once a distance matrix is built the biological
information is lost while, in discrete methods additional information
such as which site contributes to the length of each branch is preserved
• Distance based methods are faster and easier to implement than
discrete methods
Clustering versus optimality criteria-based methods
• Clustering methods follow a set of steps and arrive at a single tree
while in the other case, a set of all possible trees are built and the
best of them is evaluated based on the score
• Clustering methods do not allow us to evaluate competing
hypotheses
• Clustering methods are faster, easy to implement and produce an
unambiguous output while the other methods are computationally
very expensive
• Optimality methods often result in good quality trees since they
could be interactively corrected
Parsimony Methods :Background
• Eck and Dayhoff method counts the number of all to all amino acid
substitutions in a phylogeny, but in this method, both high and low probable
substitutions (acc. to genetic code) are treated equally
• Ex: AAA (K)  CGC (R) vs AAC (N)  AGC (S)
• Fitch method counts the minimum number of nucleotide changes required
to achieve the observed variation, but this method treats both synonymous
and non-synonymous changes equally
• Ex: UUU(F)  CUU(L)  CUA(L)  CAA (Q)
• In Maximum parsimony method a moderate approach between the above
two methods is used. All amino acid changes be consistent with the genetic
code and synonymous changes are counted less times than non-synonymous
changes.
• In the above example the number of changes from F  Q is counted
as two, not three
Maximum Parsimony Method
• Also called minimum evolution method
• Predict tree(s) that minimizes the number of steps required to
generate the observed variation in the sequences
• For each aligned column in the multiple alignment, phylogenetic
trees that require smallest number of evolutionary changes to
produce the observed variation are identified
• Finally, those trees that produce the smallest number of changes
overall for all sequence positions are identified
• Very time consuming, not good for large number of sequences or
sequences with a large amount of variation
• For DNA: DNAPARS
• For proteins: PROTPARS
Protpars Example
Distance-based Method
• Distance between pairs of sequences is calculated based on
• Dayhoff’s PAM matrix values
• Fraction of non-identical amino acids between the two
sequences
• Depending on whether the conversion of amino acids is
within the group or to a different group
• A distance matrix of (n x n) is calculated between all pair-wise
combinations where each diagonal is identical to the other
• Distance matrix is used as input in different algorithms to
calculate an optimal evolutionary tree
Distance Matrix generated by Protdist
HUMAN
MOUSE
DROME
SOLTU
WHEAT
ARATH
NEUCR
YEAST
Distance method continued …
• The key is how best the pair-wise distances are made additive on a
predicted evolutionary tree
• Using the distance matrix, several phylogenetic trees are built and
evaluated based on the following criteria
• Goodness of fit methods seek the metric tree that best accounts for
the observed pair-wise distances
• Minimum evolution method: Seeks the tree whose sum of branch
lengths is the minimum (minimum evolution)
• Methods used
• FITCH: Based on Fitch-Margoliash method
• NEIGHBOR: Based on neighbor-joining or UPGMA methods
Feng-Doolittle Method …..
A
A
B
C
D
B
C
Human Chimp Gorilla
0
88
103
0
106
0
Human
Chimp
Gorilla
Orang
Tree building using Fitch-Margoliash method (1967)
Da = ( DAB + DAC - DBC ) / 2
Db = ( DAB + DBC - DAC ) / 2
Dc = ( DAC + DBC - DAB ) / 2
Dc
Da
Db
C
B
A
Join the first 3 sequences
9.0
Da = ( 88 + 103 - 106 ) / 2 = 42.5
51.5
42.5
45.5
Db = ( 88 + 106 - 103 ) / 2 = 45.5
Dc = ( 103 + 106 - 88 ) / 2 = 60.5
C
B
A
D
Orang
160
170
166
0
Feng-Doolittle Method …..
A
A
B
C
D
Human
Chimp
Gorilla
Orang
B
C
Human Chimp Gorilla
0
88
103
0
106
0
D
A
B
C
Orang
160
170
166
0
Hum/Chimp Gorilla
A Hum/Chimp 0
104.5
B Gorilla
0
C Orang
Orang
165
166
0
30.75
Join the 4th sequence to current tree
82.5
Da = ( 104.5 + 165 - 166 ) / 2 = 51.75
9.25
52.75
Db = ( 104.5 + 166 - 165 ) / 2 = 52.75
Dc = ( 165 + 166 - 104.5 ) / 2 = 113.25
42.5
45.5
C
B
A’
A
Maximum-Likelihood Methods
• These methods are discrete methods similar to maximum
parsimony (MP) methods, however probability calculations are
used to find a tree that best accounts for the variation in a set of
sequences
• Analysis is performed on all columns in the multiple alignment
and all possible trees are considered
• Compared to MP methods, more divergent sequences can be
analyzed
• However, the main disadvantage is that these methods are
computationally intensive
Genome-scale Data Analysis
Ensembl/translation
Sequenced
Genome
Unknown
function &
structure
No
Known
structure
Yes
Pdb
search
Complete
Proteome
No
Interpro
Pfam
Yes
Known
function
Finding right tools for right tasks
• Finding paralogues by clustering (BLASTCLUST, CD-HIT)
• Finding homologues and orthologues (BLAST)
• Finding remote homologues (PSI-BLAST)
• Finding functional annotation (PFAM, INTERPRO)
• Finding structural annotation (Blast PDB)
• Finding low complex regions (SEG, CAST)
• Finding transmembrane regions (TMHMM)
• Finding disordered regions (COILS, PONDR)
• Finding secondary structure (JPRED, TOPpred)
Accessing Tools and Data
• Web-based tools vs. Standalone tools
• Download
• NCBI :
ftp://ftp.ncbi.nih.gov
• EBI:
ftp://ftp.ebi.ac.uk
• PDB:
ftp://ftp.rcsb.org
• PFAM:
ftp://ftp.genetics.wustl.edu
• Local installation and configuration
Structure-based Algorithms
Protein Data Bank (PDB)
http://www.rcsb.org
• About 26000 structures including X-Ray, NMR and models
• Structures include 23597 proteins, 1108 protein/nucleic acid
complexes, 1336 nucleic acids and 18 carbohydrates
• Sequence numbering
• PDB/Atomic numbering
• PDB ID/chain ID
Growth of PDB entries
Growth of new folds in PDB
NIGMS funded Structural Genomics Projects
• Midwest Center for Structural Genomics
• Northeast Structural Genomics Consortium
• New York Structural Genomics Research Consortium
• Southeast Collaboratory for Structural Genomics
• Structural Genomics Center
• Tuberculosis (TB) Structural Genomics Consortium
• Joint Center for Structural Genomics
• Center for Eukaryotic Structural Genomics
• Structural Genomics of Pathogenic Protozoa Consortium
Protein Structure Databases
• SCOP : Structural Classification of Proteins
• CATH : Class, Architecture, Topology & Homologous superfamily
• FSSP/DALI : Fold classification based on Structure-Structure
alignment of Proteins
• HSSP: Homology-derived Secondary Structure of Proteins
• HOMSTRAD : Homologous Structure Alignment Database
• DSSP : Database of Secondary Structure Assignments
• DMAPS : Database of Multiple Alignment for Protein Structures
Structure Alignments
• Protein structures are determined
by X-ray crystallography or NMR
methods
• Structural alignment involves
establishing equivalencies between
residues in two or more proteins
based on their 3D-coordinates
• 3-D coordinates from C- atoms
are most commonly used for
calculation of distance in structural
alignments
Methods used for structure alignment
• Dynamic programming
(Taylor & Orengo, 1989)
• Combinatorial Extension
• Monte Carlo method
(Shindyalov & Bourne, 1998)
(Mirny & Shakhnovich, 1998, Guda et. al., 2001)
• Environment profile method (Jung & Lee., 2000)
• Genetic Algorithms
(May & Johnson, 1995)
Combinatorial Extension (CE) Method
http://cl.sdsc.edu/ce.html
• CE method is based on determining Aligned Fragment Pairs (AFPs) with
local similarities and joining AFPs to form a continuous path
• AFPs are based on the difference in the local geometry of structures being
compared
• For ex., inter-residue distances are calculated between 8 residues in all
possible combinations, except between the neighboring residues ((n-1)(n-2)/2).
This is done for all candidate AFPs in each structure
• Difference(d) in the average distances is calculated and all candidate AFPs
with d under some threshold are considered AFPs
• Consecutive AFPs are selected based on calculation of inter-residue
distances between two AFP members in the same chain in 64 (8x8)
combinations and selecting the ones with minimum average difference (d)
CE Method …
Extending the optimal path
• The alignment path is constructed from AFPs selected from any
position in the similarity matrix and consecutive AFPs are added in
either direction such that,
• two consecutive AFPs are aligned without gaps OR
• two consecutive AFPs are aligned with gaps inserted in either of
the proteins, but not in both
• The maximum allowable size of a gap is 30. This is required to limit the
gap size, however, similarities requiring gap size > 30 are misrepresented
by this algorithm
• A few best alignments are superimposed and r.m.s.d. (Root mean square
deviation) is iteratively optimized using dynamic programming by
adjusting gaps
• Finally, the pair with lowest RMSD value is selected
FSSP/DALI http://www.ebi.ac.uk/dali/fssp/fssp.html
• Fold Classification based on Structure-Structure alignment of Proteins
• All structures in PDB are clustered into families based on 25%
sequence identity and representatives for each family are selected
• FSSP was built using completely automatic method (DALI), based on
all-against-all comparison of representative set of structures
• DALI (Distance matrix ALIgnment) is based on distance maps that
contains all pair-wise distances between residue centers i. e., C-œ atoms
• The distance matrices from each protein are decomposed into
hexapeptide-hexapeptide submatrices. Similar contact patterns are
paired and combined into larger sets of pairs
• A Monte Carlo procedure is used to optimize similarity score
• Multiple structure alignments were built based on pair-wise
comparison of representative and member within the family and
between representatives
HOMSTRAD
http://www-cryst.bioc.cam.ac.uk/homstrad/
• HOMologous STRucture Alignment Database
• 1032 families with 3454 structures
• Structures with only C-alpha values were excluded
• Structurally similar proteins were clustered into
homologous families and alignments were built based on 3-D
coordinate data
• Uses COMPARER and MNYFIT for building structure
alignments
• Multiple alignments were calculated only for representative
members of each family
Limitations of current methods
Most of the multiple alignment methods are based on master-slave or
progressive alignments. These are biased towards the master structure
or the initial alignment
Example:
master
Monte Carlo Optimization Method
http://cemc.sdsc.edu
http://dmaps.sdsc.edu
Problem: Most of the multiple alignment methods are based on
pair-wise alignment of structures to a Master structure. This leads
to biased alignments towards the master, ignoring the similarities
within the other structures
Essential elements of the Method
• The Target/Scoring function
• The Search Algorithm
• The Search Constraints
• Algorithm
General Monte Carlo Approach
• Compute a distance-based score for the current alignment
• Make a random trial change to the current alignment and compute the
change in the score (S)
• If S > 0, the move is always accepted
• If S <= 0, the move may be accepted by adding an additional score of P
 C  s 
P

m


where,
-C is a constant
-m is the trial move count
• Once a move is accepted, the change in the alignment becomes permanent
• This procedure is iterated until there is no further change in the score, i.e.,
the system is converged
Monte Carlo Simulation ...
Scoring function
(Modified from Levitt & Gerstein, 1998)
- S is the total score for the alignment
- l is the total number of columns and i is the column position, in the alignment
- M = 20 (Maximum score of a column, chosen arbitrarily)
- di is the average C distance between residues in column i.
  dpq 
 pq 
di  

N




- p and q are residues in column i
- N =(m x m-1)/2 (all-to-all combinations)
- m is the residue count in column i
- d0 is a constant (the distance increase that can be tolerated)
0, di  d 0 
A

10, di  d 0
- G is Affine gap penalty term ( G = I + pE) where, I=15, E=7. I and E are gap initiation
& extension penalties, respectively, and p is the number of gap extensions
Monte Carlo Simulation ...
Search Constraints
• Minimum Block length: > 3 (3-6)
• Residue Threshold: 50 % (33-66 %)
Block
Free pool
Monte Carlo Simulation ...
Random Trial Move Set
1. Shift Right
2. Shift Left
3. Expand Right
4. Expand Left
5. Shrink Right
6. Shrink Left
7. Split/Shrink
Monte Carlo Simulation ...
Shift Left
Before Accepting Move: Score = 30796, Distance = 3.815
After Accepting Move: Score = 30846, Distance = 3.849
Monte Carlo Simulation ...
Expand Right
Before Accepting Move: Score = 30850, Distance = 3.852
Free pool of residues
After Accepting Move: Score = 31048, Distance = 3.915
Expanded fragment
Monte Carlo Simulation ...
Expand Left
Before Accepting Move: Score = 31093 Distance = 4.042
Free pool of residues
After Accepting Move: Score = 31500, Distance = 4.207
Expanded fragment
Monte Carlo Simulation ...
Shrink
Before shrinking
After shrinking
Monte Carlo Simulation ...
Split and Shrink
Before Split and Shrinking
After Split and Shrinking
Monte Carlo Simulation ...
320
3.2
310
3.1
3
300
2.9
290
2.8
280
2.7
270
2.6
260
2.5
0
2000
4000
6000
8000
10000
12000
Move count
Number of alignment columns
Average alignment dist ance
Alignment distance
Number of alignment columns
Typical Monte Carlo behavior
Monte Carlo Simulation ...
Relation between alignment improvement and distance increase
Change in the number of alignment columns
60
40
20
0
-20
-40
-60
-80
1.2
1
0.8
0.6
0.4
0.2
Change in the average alignment distance
0
-0.2
Monte Carlo Simulation ...
Example 1
ID
A (CE)
B (CE+MC)
C (HOM.)
Monte Carlo Simulation ...
Example 2
ID
A (CE)
B (CE+MC)
C (HOMSTRAD)
CE-MC Web Server
• Accessible at http://cemc.sdsc.edu
• A web-based facility to perform multiple structure alignments
• User could upload local coordinate files and compare against the PDB files
• Initial seed alignments are built based on CE algorithm and iteratively
optimized using Monte Carlo Optimization
• Results are emailed upon completion of job
• Output is displayed in 4-different formats as follows
• JOY/html
• JOY/post-script
• Text
• FASTA
DMAPS Web Server
• Accesible at http://dmaps.sdsc.edu
• Stores pre-calculated multiple structure alignments for all
structural families in the PDB
• All structure chains in the PDB were clustered into ~1700 familes
and multiple structure alignments were performed using Monte
Carlo algorithm
• Multiple structure alignment for a structure family is accessible
with the PDB chain ID of any member of that family
• Results are retrieved and displayed in 4 different families, i.e.,
JOY/html, JOY/post-script, Text and FASTA
Final Project Work