Download Structural phylogenomic inference of protein function

Structural Phylogenomic Inference of Protein Function Kimmen Sjölander University of California Berkeley [email protected] Extend function prediction through inclusion of structure prediction and analysis Predict active site & subfamily specificity positions Anti-fungal defensin (Radish) Drosomycin (Drosophila) Scorpion toxin VirB4 Annotation transfer by homology • Status quo approach to protein function prediction – Given a gene (or protein) of unknown function • Run BLAST to find homologs • Identify the top BLAST hit(s) • If the score is significant, transfer the annotation – If resources permit, predict domains using PFAM or CDD • Problems: – Approach fails completely for ~30% of genes – Of those with annotations, only 3% have any supporting experimental evidence • 97% have had functions predicted by homology alone* – High error rate * Based on analysis of >300K proteins in the UniProt database Tomato Cf-2 Bioinformatics Analysis Domain fusion and fission events complicate function prediction by homology, particularly for particularly common domains (e.g., LRR regions). Domain structure analysis (e.g., PFAM) is often critical. Tomato Cf-2 (GI:1587673) Dixon, Jones, Keddie, Thomas, Harrison and Jones JDG Cell (1996) BLAST against Arabidopsis Top BLAST hit in Arabidopsis is an RLK! Panther PFAM results Errors due to domain shuffling (sic) Error presumably due to non-orthology of database hits used for annotation Phylogenetic analysis suggests it’s more likely a Biogenic Amine GPCR Human neutral sphingomyelinase or bacterial isochorismate synthase? Database annotation errors Main sources of annotation errors: 1. Domain shuffling 2. Gene duplication (failure to discriminate between orthologs and paralogs) 3. Existing database annotation errors Errors in gene structure Contamination Other… Propagation of existing database annotation errors Galperin and Koonin, “Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement and operon disruption.”In Silico Biol. 1998 Phylogenomic inference Eisen “Phylogenomics: Improving Functional Predictions for Uncharacterized Genes by Evolutionary Analysis,” Genome Research 1998 Sjölander, “Phylogenomic inference of protein molecular function: advances and challenges," Bioinformatics 2004 Piet Hein, Grooks QuickT ime ™an d a TIFF ( Uncomp res sed) deco mpre ssor ar e need ed to see this pictur e. There is nothing more difficult to take in hand, more perilous to conduct, or more uncertain in its success, than to take the lead in the introduction of a new order of things. Because the innovator has for enemies all those who have done well under the old conditions, and lukewarm defenders in those who may do well under the new. This coolness arises partly from the incredulity of men, who do not readily believe in new things until they have had a long experience of them. Construction of genome-scale phylogenomic libraries Cluster genome into global homology groups Include homologs from other species Construct multiple sequence alignment Construct phylogenetic trees. Overlay with annotation data. Identify subfamilies. Retrieve key literature Predict cellular localization. Predict protein structure Predict key residues Deposit book in library Construct HMMs for the family and for individual subfamilies. Berkeley Universal Proteome Phylogenomic Explorer 9,707 protein family “books” and 708K HMMs and expanding daily http://phylogenomics.berkeley.edu/UniversalProteome Protein fold prediction 12% identity VirB4 TrwB structure (1E9RA) Active site Example Book: Voltage-gated K+ channels SCI-PHY subfamilies supported by ML tree, and also consistent with subtype and phylogenetic distribution (only one branch of ML tree displayed) GO annotations for Shal subfamily Database queries Look up protein family “books” based on the annotations associated with any sequence. Queries can be based on GO biological process, PFAM domains, UniProt accession numbers, etc. Key algorithms in PhyloFacts library construction What clustering methods are appropriate for inference of protein function? What alignment methods are accurate? How to mask? What tree methods to use? How to root a tree? Can we define functional subfamilies automatically? Fraction superposable positions drops with evolutionary divergence %ID #pair %Superpos >70 107 90.6 50-70 63 87.2 40-50 46 83.4 30-40 65 85.4 25-30 41 82.1 20-25 53 77.9 15-20 84 73 10-15 151 64.4 5-10 204 50.4 0-5 122 39.5 Pairwise alignment MSA-pw BLAST ClustalW Tcoffee ClustalW MAFF 0.954 0.955 0.955 0.955 0.9 0.862 0.903 0.894 0.901 0.9 0.824 0.872 0.855 0.856 0.8 0.811 0.874 0.867 0.87 0.8 0.779 0.782 0.788 0.795 0.8 0.612 0.599 0.627 0.633 0.6 0.381 0.451 0.457 0.49 0.4 0.16 0.186 0.234 0.302 0. -0.007 -0.014 0 -0.047 0.0 -0.033 -0.049 -0.051 -0.034 -0.0 FlowerPower Clustering global (or glocal) homologs Minimize profile drift Improved alignment accuracy Nandini Krishnamurthy, Ph.D. Step 1: Construct SearchDB Q=query Construct SearchDB using PSI-BLAST against target database Q Step 2: Select and align core set. Q Inclusion criteria: E-value 1e-10 Bi-directional coverage MUSCLE multiple alignment (Edgar, 2003) Step 3: Run SCI-PHY to identify subfamilies and build subfamily HMMs (SHMMs) Q BETE subfamily identification: Sjölander 1998 SHMM construction: Brown et al, 2004 Step 4: SHMMs compete for sequences from SearchDB. Sequences meeting criteria are aligned to their closest SHMM. Q Step 5: Run SCI-PHY on extended alignment to identify new subfamilies and construct SHMMs. Q Iterate until convergence Q Comparing FlowerPower, BLAST, PSI-BLAST and UCSC T2K Test: Clustering global homologs Agreement at domain structure determined by PFAM. SCOP used to cluster PFAM domains into structural equivalence classes. Subfamily Classification In PHYlogenomics (SCI-PHY) Seq1 Seq2 Seq3 Seq4 Seq5 LERY-K LDRFPR IERYGK MDRF-K VERYGK Nandini Krishnamurthy, Ph.D. Duncan Brown Multiple sequence alignment 5 3 1 4 2 Phylogenetic tree & subfamily decomposition Agglomerative clustering Input: MSA Initialize: construct profile1 for each row in MSA While (#clusters > 1) { Join closest2 pair of clusters Re-estimate profile1 Compute encoding cost3 for this stage } /* cut tree using minimum encoding cost */ 1. 2. Use Dirichlet mixture densities Distance function: relative entropy Sjolander, K. "Phylogenetic inference in protein superfamilies: Analysis of SH2 domains" Proceedings of Conference Intelligent Systems for Molecular Biology Detection of critical positions Subfamilies identified using minimum encoding cost principles • Each stage of the algorithm defines a different set of alignments, one for each cluster (“subfamily”). • Find the point during the clustering where the encoding cost of the alignments is minimal. This defines the subfamily decomposition. Cost N # classes 1 N= number of sequences. S= number of subfamilies; n c,1…n c,s are the amino acids aligned by subfamilies 1 through s at column c.  represents the Dirichlet mixture prior. SCI-PHY analysis of selected GPCRs Venter et al, The sequence of the human genome (2001) Science. Sjolander, “"Phylogenomic inference of protein molecular function: advances and challenges," (2004) Bioinformatics Key residue prediction using subfamily and family-wide conservation analysis Y221 W222 D558 R627 D628 Elizabeth Hua-Mei Kellogg Ryan Ritterson Nandini Krishnamurthy H745 Y743 A744 G629 Parker JS, Roe SM, Barford D. , EMBO J., 2004 D RD E YAH Tanaka Hall, T. Structure 2005 Rivas et al, 2005 Function Prediction Using HMMs 3.5.2.2 7TM GPCR Dihydropyrimidinase 3.5.4.1 ABC Transporter Cytosine deaminase 3.5.2.3 Amidohydrolase Dihydroorotase 3.5.1.5 Urease ATPase Family Subfamily Error Subfamily HMM construction 1. At completely conserved positions, and subfamily gapped positions: Use match state distributions estimated for general (family) HMM. 2. At other positions: 1. Estimate Dirichlet mixture density posterior for each subfamily at each position separately. 2. Use Dirichlet density posteriors to weight contributions from other subfamilies. 3. Compute amino acid distribution using weighted counts and standard Dirichlet procedure. 12 345 67 Brown et al,“Subfamily HMMs in functional genomics” (2005) Pacific Symposium on Biocomputing Subfamily HMMs increase the separation between true and false positives • • • 515 unique SCOP folds PFAM full MSAs Scored against Astral PDB90 1.5% error rate in subfamily classification using top-scoring SHMM SATCHMO: Simultaneous Alignment and Tree Construction using Hidden Markov mOdels Xia Jiang Nandini Krishnamurthy Duncan Brown Michael Tung Jake Gunn-Glanville Bob Edgar Edgar, R., and Sjölander, K., "SATCHMO: Sequence Alignment and Tree Construction using Hidden Markov models," Bioinformatics. 2003 Jul 22;19(11):1404-11 SATCHMO motivation • Structural divergence within a superfamily means that… – Multiple sequence alignment (MSA) is hard – Alignable positions varies according to degree of divergence • Current MSA methods not designed to handle this variability – Assume globally alignable, all columns (e.g. ClustalW)… • Over-aligns, i.e. aligns regions that are not superposable – …or identify and align only highly conserved positions (e.g., SAM software with HMM “surgery”) • Challenge – Different degrees of alignability in different sequence pairs, different regions – Masking protocols are lossy: loop regions may be variable across the family but may be critical for function! SATCHMO algorithm • Input: unaligned sequences • Initialize: a profile HMM is constructed for each sequence. • While (#clusters > 1) { – Use profile-profile scoring to select clusters to join – Align clusters to each other, keeping columns fixed – Analyze joint MSA to predict which positions appear to be structurally similar; these are retained, the remainder are masked. – Construct a profile HMM for the new masked MSA } • Output: Tree and MSA Alignment of proteins with different overall folds Assessing sequence alignment with respect to structural alignment Xia Jiang Duncan Brown Nandini Krishnamurthy Alignment accuracy as a function of % ID (including homologs, full-length sequences) 1 0.9 Average CS score 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10-15% 15-20% 20-25% 25-30% 30-35% Percent ID CLUSTALW MUSCLE MAFFT SATCHMO 35-40% Future work: Interactive specificity position identification Catalytic residues • Enable users to select subtrees for analysis • Identify positions conserved within each subtree, but which differentiate the two** • Plot over MSA and on structure (if available) Donald and Shakhnovich, NAR 2005 colored red Major challenge: Phylogenetic uncertainty Given: A (gene tree of unknown function), gene trees B and C (characterized function) Predict function for A. A A B C C B B C A Problem: use three phylogenetic tree methods, get 3 or more trees! Change the MSA, you also change the tree… Need: Better simulation studies, benchmark datasets http://phylogenomics.berkeley.edu Berkeley Phylogenomics Group PI: Kimmen Sjölander Nandini Krishnamurthy, Ph.D. Duncan Brown Sriram Sankararaman Xia Jiang Jake Gunn-Glanville Lead programmer and web administrator: Dan Kirshner This work is supported in part by a Presidential Early Career Award for Scientists and Engineers from the NSF, and by an R01 from the NHGRI (NIH).

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Structural phylogenomic inference of protein function