* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Reconstructing phylogenetic trees for protein superfamilies
Gene nomenclature wikipedia , lookup
Gene regulatory network wikipedia , lookup
Interactome wikipedia , lookup
Expression vector wikipedia , lookup
Western blot wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Gene expression wikipedia , lookup
Community fingerprinting wikipedia , lookup
Magnesium transporter wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Point mutation wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Proteolysis wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Structural Phylogenomic Analysis Estimate Tree of Life; plot key traits onto tree Extend function prediction through inclusion of structure prediction and analysis Anti-fungal defensin (Radish) Scorpion toxin Drosomycin (Drosophila) Predict active site & subfamily specificity positions VirB4 model Based on 12% identity to TrwB structure Annotation transfer by homology • Status quo approach to protein function prediction – Given a gene (or protein) of unknown function • Run BLAST to find homologs • Identify the top BLAST hit(s) • If the score is significant, transfer the annotation – If resources permit, predict domains using PFAM or CDD • Problems: – Approach fails completely for ~30% of genes – Of those with annotations, only 3% have any supporting experimental evidence • 97% have had functions predicted by homology alone* – High error rate * Based on analysis of >300K proteins in the UniProt database Database annotation errors Main sources of annotation errors: Sub-functionalization 1. Domain shuffling Neo-functionalization 2. Gene duplication (failure to discriminate between orthologs and paralogs) 3. Existing database annotation errors Propagation of existing database annotation errors Errors in gene structure Galperin and Koonin, “Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement and operon disruption.”In Silico Biol. 1998 Tomato Cf-2 Bioinformatics Analysis Domain fusion and fission events complicate function prediction by homology, particularly for particularly common domains (e.g., LRR regions). Domain structure analysis (e.g., PFAM) is often critical. Tomato Cf-2 (GI:1587673) Dixon, Jones, Keddie, Thomas, Harrison and Jones JDG Cell (1996) BLAST against Arabidopsis Top BLAST hit in Arabidopsis is an RLK! Panther PFAM results Berkeley Phylogenomics Plant and Animal Innate Immunity Mediated by Structurally Similar Receptor and Receptor-like molecules TM Domain fusion/fission Cytoplasmic Toll Interleukin 1 Receptor (TIR) domain Errors due to domain shuffling (sic) Error presumably due to non-orthology of database hits used for annotation The top matching BLAST hits are putative odorant receptors Phylogenetic analysis suggests it’s more likely a Biogenic Amine GPCR Annotation error (source unknown) Phylogenomic inference Gene duplication in ancestral organism H1 C1 M1 R1 F1 W1 Eisen, 1998 Sjölander, Bioinformatics 2004 H2 C2 M2 R2 F2 W2 Human, Chimp, Mouse, Rat, Fly, Worm SCI-PHY analysis of selected GPCRs Venter et al, The sequence of the human genome (2001) Science. Sjolander, “"Phylogenomic inference of protein molecular function: advances and challenges," (2004) Bioinformatics Phylogenetic reconstruction of protein families is complicated • Gene duplication • Domain shuffling • Lessening of evolutionary pressures associated with speciation and duplication enable significant structural and sequence changes • Different mutation rates in some lineages • Different types of constraints at some positions • Multiple sequence alignment errors • What members to include? (Some families contain thousands of members) Caveats • Sequence “signal” guides the alignment • If the signal is weak, the alignment can be poor • As proteins diverge from a common ancestor, their structures and functions can change – Even structural superposition can be challenging! • Repeats, domain shuffling, large insertions or deletions can introduce alignment errors • If tree construction is the aim, errors in the alignment will affect tree accuracy! Fundamental mechanisms underlying evolution of gene families Homology and adaptation among protein families 1AGT Agitoxin 2 Egyptian Scorpion (K+ channel inhibitor) Drosomycin, Antifungal protein Fruit Fly 1CN2 Toxin 2 Mexican scorpion (Na+ channel inhibitor) 1BK8 Antimicrobial Protein 1 (Ah-Amp1) Common horse chestnut 1AYJ Antifungal protein 1 (RS-AFP1) Radish Protein superfamilies evolve novel forms and functions: Homology may be hard to detect from sequence similarity alone %ID #pair %Superpos >70 107 90.6 50-70 63 87.2 40-50 46 83.4 30-40 65 85.4 25-30 41 82.1 20-25 53 77.9 15-20 84 73 10-15 151 64.4 5-10 204 50.4 0-5 122 39.5 Pairwise alignment MSA-pw BLAST ClustalW Tcoffee ClustalW MAFFT Homology detectionMUSCLE and 0.954 0.955 0.955 0.955 0.954 (and0.954 alignment accuracy 0.862 0.903 0.894 0.901 0.919 0.911 %superposable positions!) drops with 0.862 evolutionary0.846 0.824 0.872 0.855 0.856 distance 0.811 0.874 0.867 0.87 0.892 0.925 0.779 0.782 0.788 0.795 0.837 0.836 Structure 0.678 can provide clues, 0.612 0.599 0.627 0.633 0.661 but not necessarily exact 0.381 0.451 0.457 0.49 0.496 0.554 definition 0.16 0.186 0.234 0.302 0.35 0.351 -0.007 -0.014 0 -0.047 0.098 0.075 -0.033 -0.049 -0.051 -0.034 -0.024 -0.022 S Not all positions in a molecule are created equal Light-blue positions are variable across subfamilies – but can be very conserved within subfamilies. These are the hallmarks of binding pockets determining substrate specificity. A A B C C B B C A Major differences between trees are in the coarse branching order A A B C C B B C A When each class, A, B and C appear equally similar to each other, the coarse branching order can be difficult to determine. In this case, it’s critical to be able to weight the subfamily-defining residues as more important when computing the distance between classes. HMM construction using an initial multiple sequence alignment Delete/skip Insert Match Seq1 Seq2 Seq3 Seq4 Seq5 M M M M M V V V V - V V V L L S S S S S T S S G G G P P P P P P P Profile or HMM parameter estimation using small training sets D D D D D S S T T T I V I I V F F W W W M M M M M K K K K K What other amino acids might be seen at this position among homologs? What are their . probabilities? The context is critical when estimating amino acid distributions D D D D D S S T T T I V I I L F F W W W M M M L L K K K K R This position may be critical for function or structure, and may not allow substitutions . Dirichlet Mixture Prior “Blocks9” Parameters estimated using Expectation Maximization (EM) algorithm. Training data: 86,000 columns from BLOCKS alignment database. Combining Prior Knowledge with Observations using Dirichlet Mixture Densities ˆpi = the estimated probability of amino acid ‘i’ n = (n1,…,n20) = the count vector summarizing the observed amino acids at a position. j = ( j,1 ,…, j,20 ) = the parameters of component j of the Dirichlet mixture . Dirichlet Mixtures: A Method for Improved Detection of Weak but Significant Protein Sequence Homology. Sjolander, Karplus, Brown, Hughey, Krogh, Mian and Haussler. CABIOS (1996) SATCHMO: Simultaneous Alignment and Tree Construction using Hidden Markov mOdels Xia Jiang Nandini Krishnamurthy Duncan Brown Michael Tung Jake Gunn-Glanville Bob Edgar Edgar, R., and Sjölander, K., "SATCHMO: Sequence Alignment and Tree Construction using Hidden Markov models," Bioinformatics. 2003 Jul 22;19(11):1404-11 SATCHMO motivation • Structural divergence within a superfamily means that… – Multiple sequence alignment (MSA) is hard – Alignable positions varies according to degree of divergence • Current MSA methods not designed to handle this variability • Assume globally alignable, all columns (e.g. ClustalW)… – Over-aligns, i.e. aligns regions that are not superposable • …or identify and align only highly conserved positions (profile HMMs) – Discards information important for subfamily specificity • Reality – Different degrees of alignability in different sequence pairs, different regions Agglomerative clustering Algorithm: Initialize all objects to be separate classes (leaves in the tree). Join “closest” classes (connecting each by edges to a node). Compute distance between new class and other classes. Join closest two classes. Iterate until all classes are joined into one class (a tree) SATCHMO output 1. Tree • • Cluster based on structural “distance” Built simultaneously with alignments 2. Multiple sequence alignments • Different alignment for each cluster (=each node in tree) 3. Prediction of alignable / non-alignable regions • 1,2,3 mutually dependent, inform each other – Interact each time two clusters are combined Note: we can assess alignment quality, but assessment of tree topology accuracy is not straightforward to estimate. SATCHMO algorithm: Progressive profile-profile alignment • Typical state: set of subtrees – Cluster (=subtree) contains • alignment of all subtree sequences • profile HMM – Initialization: each sequence forms a leaf in tree • Iterated step – Find most closely related pair of subtrees (using HMM scoring) – Align the MSAs of the two clusters using profile-profile alignment… – …treats MSA column as single “letter”, keeps columns intact – Result: new cluster with its own MSA – Predict “alignable” columns, and build profile HMM (w/Dirichlet mixture densities). Assessing sequence alignment with respect to structural alignment Xia Jiang Duncan Brown Nandini Krishnamurthy Alignment accuracy as a function of % ID (including homologs, full-length sequences) 1 0.9 Average CS score 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10-15% 15-20% 20-25% 25-30% 30-35% Percent ID CLUSTALW MUSCLE MAFFT SATCHMO 35-40% Alignment of proteins with different overall folds Summary • SATCHMO is designed to provide for the assumption of ‘positional homology’ during the tree estimation process • This assumption -- that we can predict the structurally equivalent positions from sequence information alone -- needs to be tested • We need a benchmark dataset to evaluate phylogenetic tree topology estimation