* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download AFP for Structural Genomics and Metagenomics
		                    
		                    
								Survey							
                            
		                
		                
                            
                            
								Document related concepts							
                        
                        Metagenomics wikipedia , lookup
Histone acetyltransferase wikipedia , lookup
Point mutation wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
						
						
							Transcript						
					
					Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute Summary - overview     Homology based methods Analogy based methods Physics based methods Why function prediction? What we mean by function  Multilevel definition    Phenotype Cellular function Molecular function (activity)    Substrates Inhibitors cofactors  Several attempts to develop a unified function classification    EC classification for enzymes 4.2.31.101 Merops (proteases), CAZY (hydrolases) Gene ontology Two, complementary views of the evolution and diversity of life Organisms (species) Genes (proteins) Both are amazingly large and diverse  Organisms (species)   About 1.5M known today, 10-100 million species estimated to exists, depending on the definition of species and other assumptions Their relations can be described in a tree of life, at least for eukaryotes.  Proteins   With 20 amino acid alphabet, the number of possible protein sequences is very large (20100 i.e. 1.2*10130 short proteins(!)) Total number: >10billions?    Bacterial and archeal tree of life is much more controversial, some even dispute the concepts of species for bacteria  10-100M species, with ~4K genes in a bacterial and ~10K in an eukaryotic genome Over 25 million known today, i.e. ~0.2%  Representative sample? From the 25 million proteins known today    Direct experimental data is available for few thousand proteins Indirect experimental data are available for perhaps few hundred thousand Structures of ~60 thousands have been solved protein universe seems to be very large. But is it random? Many proteins (like species) are close relatives        Histone H1 (human) - histone H1 (chicken) SRRSASHPTYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAA | | || || || ||| ||| | |||||||||||||||||| ||| |||||| || SKKSTDHPKYSDMIVAAIQAEKNRAGSSRQSIQKYIKSHYKVGENADSQIKLSIKRLVTT similarity: 77% id, BLAST e.value 0.0 function: two H1 histones from different species (orthologs) Their functions and structures are obviously very similar We can organize the protein universe into neighborhoods (families)? How many protein families are still out there? Number of clusters (thousands) Rate of discovery 250 200 size >=3 150 size >=5 size >=10 100 size >=20 50 0 0 1 2 3 4 5 6 7 Number of sequences (millions)  Number of protein clusters (modeling families) grows linearly in number of protein sequences (and exponentially in time) – cumulative total From Yooseph et al, PloS Biology, (2007) 5:e16 How far can we go?        Histone H5 - histone H1 TYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAAGVLKQTKGVGASGSFRLA | | | | | | | | | ||| | | | |||| |||||||| SVTELITKAVSASKERKGLSLAALKKALAAGGYDVEKNNSRIKLGLKSLVSKGTLVQTKGTGASGSFRLS similarity: 40% seq id, BLAST e.value 10-15 function: two histones (paralogs) Structures still very similar, functions somewhat different, but obviously similar This is surely too far?  Histone H5 - TRANSCRIPTION FACTOR E2F-4  PTYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAAGVLKQTKGVGASGSFRL | | | | | GLLTTKFVSLLQEAKD-GVLDLKLAADTLA------VRQKRRIYDITNVLEGIGLIEKKS----KNSIQW  similarity :7% seq id, BLAST e.value 1  Is it?    Structure – obviously similar (2.4 Å RMSD over 80 aa) function – clearly related (both bind DNA) More subtle similarity can be detected with more sophisticated methods We can keep adding more layers most “function assignments” are provided by predicted homology   Unknown protein GLLTTKFVSLLQEAKDGVLDLKL AADTLAVRQKRRIYDITNVLEGIG Similarity LIEKKSKNSIQW   -> homology similarity ? prediction Well studied protein SRRSASHPTYSEMIAAAIRAEKS RGGSSRQSIQKYIKSHYKVGHN ADLQIKLSIRRLLAA Similarity -> homology based annotations  Recognition of close and/or distant homologs based on similarity    Sequence Sequence/profile, profile/profile Structure  Problems  How to predict differences? Even homologous proteins evolve and change! Prediction by homology Recognition Alignment Are there any well characterized proteins similar to my protein? Can we assume they are homologous? What is the position-by-position target/template equivalence Modeling Structure of my protein is similar to the other one Function prediction Function of my protein is similar to the other one We could predict Structure of a complex Role in the whole organism activity 3D structure Important distinction  Similarity    Homology Two proteins have similar sequences/structures/function s if by some metric the s/s/f of one protein is more similar to the s/s/f of another than to a randomly chosen protein  Two proteins are homologous if they have evolved from a common ancestor Common error  Two proteins are 65% homologous   What we really meant The sequences of two proteins are 65% similar, therefore we can safely assume they are homologous, why else they would be so similar? If life would be easy, this is how it would look like similar homologous not similar unrelated Not (obviously) similar, but (probably) homologous  Histon H5 and transcription factor E2F4, identity 7%, similar fold, similar function (DNA binding)  PTYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAAGVLKQTKGVGASGSFRL | | | | | GLLTTKFVSLLQEAKD-GVLDLKLAADTLA------VRQKRRIYDITNVLEGIGLIEKKS----KNSIQW  Similar, but not homologous   phosphoribosyltransferase and viral coat protein, identity: 42%, different folds, different functions . . . . . 99 IRLKSYCNDQSTGDIKVIGGDDLSTLTGKNVLIVEDIIDTGKTMQTLLSLVRQY.NPKMVKVASLLVKRTPRSVGY 173 : ||. ||| || |. || | : | | | | || | || |:| | ||.| | 214 VPLKTDANDQ.IGDSLY....SAMTVDDFGVLAVRVVNDHNPTKVT..SKVRIYMKPKHVRV...WCPRPPRAVPY 279  Similarity vs. homology similar homologous not similar homologous similar not homologous not similar unrelated Can we return to this simple picture by redefining similarity? similar homologous not similar unrelated Are these two protein families related?   New protein (target)  KAAELEMEKEQILRSLGEISVHNCMFKLEECDREEI  EAITDRLTKRTKTVQVVVETPRNEEQKKALEDATLM IDEVGEMMHSNIEKAKLCLQ Known protein (template) VKKDALENLRVYLCEKIIAERHFDHLRAKKILSREDTEEISCRTS SRKRAGKLLDYLQENPKGLDTLVESIRREKTQNF How to compare two families? Score =  alignment   Fam1 Fam2 Ali MM ij B jm ? Profile-profile similarity Compare as vectors in 21 dimensional space (FFAS) How to validate a protocol 1. Recognition  Folding benchmarks     from structural clustering of PDB (several sets, 700 pairs used here)compared to sequence based clustering of the same group of proteins correct predictions vs. wrong predictions CASP meetings, CAFASP, LiveBench published and/or publicly available predictions, fold prediction servers, available prediction programs Summary - overview     Homology based methods Analogy based methods Physics based methods Why function prediction? Similarity -> analogy based annotations  Recognition of potential analogs based on similarity in     Genome organization (non homologous replacements) Genomic fingerprints Expression patterns Specific features   Charge distribution Presence of specific patterns  Problems  Is this similarity related to function? TM0449 (thy1) – from prediction to proof  TM0449  Hypothetical, uncharacterized protein  Multiple homologs in pathogenic and thermophilic bacteria  Novel fold evidence     Phylogenetic profile complementing thymidylate synthase A homolog complements TS in Dictyostelium Confirmed experimentally 3D motif search finds an identical arrangement binding phosphate in a different protein Summary - overview     Homology based methods Analogy based methods Physics based methods Why function prediction? “Ab initio function prediction” – substrate docking We know the structure of one protein in the family and functions of some others – is the function conserved? Newly solved target Gallery of models We can analyze conservation of surface features by mapping them on the sphere  And then compare maps between homologs And come up with new (predicted) functions Phospholipid vs. retinol vs. short peptide binding Summary - overview     Homology based methods Analogy based methods Physics based methods Why function prediction? Why my interest in function prediction?  Structural genomics: the structure is often the easiest experimental information to obtain (after sequence) Function vs function  We witnessed dramatic technological advances in sequencing and now structure determination, function analysis remain a painstaking, manual effort. We used to know a lot about function even before we started working on a protein. Well, not anymore 1970 1990 2005 2010 ? 1 year  Function discovery Structure determination Sequencing Structure determination is now done on an assembly line target selection expression cloning 1X 3X purification 5X 2X 2X imaging 1X harvesting 1X crystallization 1X 1X bl xtal mounting xtal screening 1X 1X publication annotation data collection tracing 2X 2X struc. validation 1X PDB phasing 1X 1X struc. refinement 7X Even few years ago functional annotation seemed trivial target selection expression cloning 1X 3X purification 5X 2X 2X imaging 1X harvesting 1X crystallization 1X 1X bl xtal mounting xtal screening 1X 1X publication annotation data collection tracing 2X 2X struc. validation 1X PDB phasing 1X 1X struc. refinement 7X After few years, the reality seems to be very different target selection expression cloning 1X purification imaging harvesting crystallization 1X 1X bl xtal mounting xtal screening data collection annotation struc. validation 1X PDB tracing 2X 2X publication phasing 1X 1X struc. refinement 7X “reverse order” of function and structure determination and it’s challenges  The classical way       1. A function is discovered and studied 2. The gene responsible in this function is identified 3. Function is confirmed 4. Product of this gene is isolated, crystallized solved. 5. we have a whole story! Structure “rationalizes” function and provides molecular details  Post-genomic    1. a new, uncharacterized gene is found in a genome 2. predictions or high-throughput methods prioritize this gene for further studies 3. the protein is studied in detail   Structure is solved in a high throughput center Structure is the first experimental information about the “hypothetical” protein We now have hundreds of structures of proteins with unknown functions Summary   For some, function prediction is a practical, day to day problem Analogy based approaches dominate the field     Homology seen from sequence similarity structural similarities Potential active sites, clefts, surface features Many useful tools exists, but they are very scattered and not very user-friendly Summary (2)     Avoid overconfidence - “easy” predictions contain many surprises Only synergy of several independent lines of reasoning can give a correct answer Elimination of “easy”, but inconsistent predictions is critical So far, AFP doesn’t even come close to expert analysis
 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            