* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 09_Handelman - Structural Biology Knowledgebase
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Metagenomics wikipedia , lookup
Point mutation wikipedia , lookup
Gene desert wikipedia , lookup
Gene expression programming wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
History of genetic engineering wikipedia , lookup
Non-coding DNA wikipedia , lookup
Gene expression profiling wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Microevolution wikipedia , lookup
Designer baby wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Pathogenomics wikipedia , lookup
Gene nomenclature wikipedia , lookup
Genome editing wikipedia , lookup
Genome evolution wikipedia , lookup
Public health genomics wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Helitron (biology) wikipedia , lookup
Functional Classification of PSI Proteins to Support High Throughput Biochemical Characterization: Classes of Reciprocal Sequence Homologs (CRSH) Samuel Handelman, Nelson Tong, Jon D. Luff, David P. Lee, André Lazar, Paul Smith, Prasanna Gogate, Rohan Mallelwar and John Hunt Bacterial physiology in the post-genome era • Exponential growth in sequence information. • Structural information is more difficult to obtain. Evolution is key to leveraging what we do know. • Direct functional information is scarcer still: evolution and comparative studies are even more critical. vs. genome images from BacMap (UAlberta) and VirtualLaboratory; protein structure images from NESG (Columbia/Rutgers). Even today, most proteins are of unknown biochemical function E. coli 53% “Known” “hypothetical” “putative” “uncharacterized” or “unknown” (01/23/08) ~4,200 proteins • Closing this gap lays the groundwork for systems biology. H. Sapiens 54% “Known” Neither identical nor similar to any experimentally validated protein * ~27,000 proteins *Genome Information Integration Project And H-Invitational 2 (2007) Nucleic Acids Research 36:D793-799 3 CRSH Goal: Group Functionally Equivalent Homologs. • Homology clusters contain multiple distinct protein functions. CRSH Approach: • Identify subclusters such that all members have equivalent function (in bacteria only). Topic Overview • CRSH: what they are, why they’re useful • CRSH Web Interface, merits of mapping of TargetDB to protein functional groups • Using CRSH and Gene Neighborhood to predict stable tertiary interactions. Classes of Reciprocal Sequence Homologs (CRSHs) Predicted proteins from 474 fully sequenced bacterial genomes Cluster based on BLAST scores; verify clusters on profile scores Main application: Gene neighborhood method. Calculate “co-localization” counts for all CRSH pairs (# of times their genes are within 15 kB on chromosomes of fully diverged organisms) Split into sub-clusters when multiple members come from a single organism (likely paralogs); verify sub-clusters on profile scores Merge sub-clusters into classes if more similar than expected after accounting for interorganism distances; verify final classes on profile scores } CRSHs likely same function ~75,000 Split into sub-clusters when multiple members come from a single organism Indicates a pair of reciprocal closest homologs in their respective organisms M. tuberculosis RV0859 E. coli PaaJ A. tumefaciens ATU0502 A. tumefaciens PcaF acetyl-CoA acetyltransferases beta-ketoadipyl CoA thiolases Courtesy Marco Punta Gene Neighborhood Preview Each Octagon represents a CRSH O O O O 1 2 3 4 Genome 1 O O 1 3 Genome 2 Genome 3 … O O 1 3 O O 1 3 O N • Stronger neighborhood conservation => better function predictions. • Insight into function of unknown proteins. “Co-localized” = within 15 kB Frequency Distribution of Mean %ID in CRSH Frequency 0.2 0.15 0.1 0.05 Mean %ID 0 0 25 50 75 • Tremendous range in sequence conservation with more or less equivalent conservation of function. 100 P (Each Gene Neighbor is Conserved) A Fixed Homology Threshold Fails to Reliably Segregate Functionally Equivalent Proteins 0.4 Orthologs 0.3 Paralogs 0.2 0.1 0.0 0.00 0.25 0.50 0.75 1.00 1.25 1.50 Length-Normalized Blast Bit-Score Like Rost clusters, but for function • Based on sequence information, you can conclude that two proteins have the same structure, even if you don’t know the structure. • We’re working towards an analogous scheme for protein function, but each functional group needs it’s own cutoff. • We propose to do this especially for proteins whose function we do not yet know. Graph Courtesy Burkhard Rost • We have developed a web interface for these CRSH, which is meant for use by experimentalists. • Presently hosted in India (at http://61.8.141.68:8080/Columbia/), will be hosted at the NESG (at www.orthology.org), where CRSH pages will be available for each entry in targetDB. • The CRSH Pages that follow have been mapped to targetDB, so that biologists working in the centers can access them directly. • Within 2 mos. we hope for a direct link from the PSI TargetDB gateway to the CRSHs. • CRSHs already have links to biocyc, a leading bacterial physiology database; links coming to other functional genomics databases. • A consensus domain architecture schematic will appear shortly. • • • The applet on the left provides a graphical display of the phylogenetic distribution. In the near future, we’ll add the info from targetDB to this applet and to the table below. Known complexes in biocyc are targets for structural genomics efforts to solve multi-protein structures. The genetically co-localizing CRSH are promising secondary targets, as I will explain… Gene Neighborhood Hypothesis Generation With suggested applications in structural genomics and functional genomics OR Rational ideas have consequences for action; reason necessarily has a constructive function. • For every pair of CRSH for which complex-membership data is available in biocyc, we count the instances where the two CRSH appear in a putative operon together. • These counts correlate strongly with well-established, well-studied, stable and definitive physical complexes (drawn in this case from biocyc). • These Probabilities are overestimated due to the methods used. P(CRSH together in stable complex) Known Stable Complexes Strongly Correlate with Gene Neighborhood 1 0.8 0.6 0.4 All Hetero-Complexes Heterodimers Only 0.2 0 0 50 100 Co-localization counts (logarithmic bins) 1 0.8 0.6 0.4 0.2 59 10 -2 4 25 -3 9 40 10 99 024 9 25 0+ 0 14 • For each CRSH, we extract from biocyc a set of known small molecule interaction partners (ligands, substrates, products, etc.) We excluded very common partners (water, phosphate, ATP, etc.) • Because proteins together in operons are often part of the same metabolic pathways or respond to similar chemical signals, it is reasonable to extrapolate small molecule interactions to the conserved gene neighbors. • There is a definite correlation. This graph is preliminary – it is likely an underestimate. P (Known Interaction between CRSH Member and Small Molecule) Gene Neighborhood has some Correlation with Small Molecule Interaction Partners Aggregate Co-localization counts for CRSH/Small Molecule A • This view, which is still in beta, gives the known small-molecule interactions of all of the gene neighbors for a given CRSH, weighted to reflect the strength of gene neighborhood conservation. • As well as providing a starting point for interaction screening, this can make the functional insights provided by the gene neighborhood method more accessible. Salvage Pipeline • For structural genomics targets which have been cloned and are soluble, but which have failed to crystallize, we introduce a parallel pipeline to salvage them by adding “known” or predicted protein or small molecule binding partners. Crystallize without Partner Crystallize with Partner • Bonus biology: whole greater than sum of parts. Concluding Remarks • We are eager to add links to PSI resources to our CRSH pages – they are intended to facilitate collaboration between structural and functional genomics, in particular. • Functional information can improve the impact of structural genomics efforts, and may provide new salvage pathways for difficult targets. Thank you John “The Jersey Eliminator” Hunt Paul “Schmitty” Smith Greg “Cassis” Boel Sai “Full Nelson” Tong Marco “The Shark” Punta Burkhard “Wrecking Ball” Rost Prasanna “Crackerjack” Gogate Rohan “The Punisher” Mallelwar Jon “JD” Luff Liang “Red, White and Thunder” Tong Howard “Hurricane” Shuman Dana “Steel Toe” Pe’er Harmen “H-Bomb” Bussemacher Larry “The Tank” Chasin Dre “Enter the Dragon” Lazar David “Intravenous” Lee Girish “Bone Breaker” Rao Stephanie “Bronx” Wong Diana “1-2-3” Flynn George “El Pato Loco” Oldan Allison “Grid Iron” Fay Jordi “El Chupacabra” Banach John “Steel” Dworkin Etay “Aces” Ziv Chris “Fireball” Wiggins Gerwald “Sunshine” Jogl Cal “Howitzer” Lobel Yongzhao “Downtown” Shao David “Finger of Death” Draper Gae “Knuckles” Monteleone Mike “The Red Baron” Baran John “Mountain Man” Everett The Hunt Lab, The NESG American Heart Association, CF Foundation, NSF. 2.0 1.5 1.0 0.5 0.5 1.0 1.5 2.0 D. radiodurans with B. subtilis length-normalized blast bit score E. coli with S. elongatus a.a. %ID with binomial standard error E. coli with S. elongatus length-normalized blast bit score Consistency in CRSH sequence divergence levels between remote phyla 85 65 45 25 25 45 65 85 D. radiodurans with B. subtilis a.a. %ID with binomial standard error EACH DOT IS A CRSH Deviation from Evolutionary Consensus in Protein Complexes 0.4 Interaction Pairs from Biocyc 0.35 Frequency 0.3 Random Pairs from Biocyc Interaction Set With two S.D. against hypothesis 0.25 0.2 0.15 0.1 0.05 0 -1 -0.5 0 0.5 1 Spearman's Rho on Deviation from Consensus Distance 2.0 1.5 1.0 0.5 0.5 1.0 1.5 2.0 D. radiodurans with B. subtilis length-normalized blast bit score E. coli with S. elongatus a.a. %ID with binomial standard error E. coli with S. elongatus length-normalized blast bit score Consistency in CRSH sequence divergence levels between remote phyla 85 65 45 25 25 45 65 85 D. radiodurans with B. subtilis a.a. %ID with binomial standard error EACH DOT IS A CRSH