Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Protein World SARA 12-12-2002 Amsterdam Tim Hulsen Genome sequencing • Since 1995: sequencing of complete ‘genomes’ (DNA): A/C/G/T order ACGTCATCGTAGCTAGCTAGTCGTACGTATG TGCAGTAGCATCGATCGATCAGCATGCATAC • At this moment more than 80 genomes have been sequenced and published, of all kinds of organisms: – – – – Animals Plants Fungi Bacteria Genomes Proteins • ‘Transcription’ and ‘translation’ of specific regions of the genome leads to proteins, consisting of twenty types of ‘amino acids’: ATG ACG CTG AGC TGC GGA CGT TGA -> TLSCGR • Proteins are responsible for all kinds of life processes • All the proteins that can be produced in an organism together are called the ‘proteome’ • Sequence comparisons make possible the classification of proteins Protein families • e.g. The GPCR family: • Sequence comparison helps in predicting the function of new proteins Determining protein functions • Function of 40-50% of the new proteins is unknown • Understanding of protein functions and relationships is important for: – Study of fundamental biological processes – Drug design – Genetic engineering Sequence comparison • Smith-Waterman dynamic programming algorithm (1981): calculates similarity/distance between two sequences: Query ---PLIT-LETRESVSubject NEQPKVTMLETRQTAD (bold=similar) • Results in a SW-score that is a measure for how similar the two sequences are to each other • Disadvantage: score is dependent of length • After the alignments, the proteins are ‘clustered’ (divided into families) according to their similarity Existent databases • Domain-based clusterings: Prosite, Pfam, ProDom, Prints, Domo, Blocks • Protein-based clusterings: ProtoMap, COGs, Systers, PIR, ClusTr • Structural classifications: SCOP, CATH, FSSP Why should there be another database? Another method • Enhanced Smith-Waterman algorithm: Monte-Carlo evaluation (Lipman et al., 1984) • How big is the chance that two sequences are similar but not related? • One of the two sequences is randomized and recalculated (200 times). Randomization leads to sequences with the same length and the same composition, but different order • Method leads to calculation of the Z-value: S(A,B) - µ Z(A,B) = ------------------σ Advantages • The obtained Z-value is a very reliable measure for sequence, compared to SWscore: – SW-score is dependent of length, Z-value is not – Amino acid bias does not affect the Z-value • Independent of the database size • Easier updating of the database, without a total recalculation Disadvantage • LOTS of calculation time needed, especially when all proteins in all proteomes are compared to each other (“all-against-all”)! SARA SARA calculation • Proteomes of 82 organisms compared ‘allagainst-all’ with the use of the Monte Carlo algorithm: more than 400,000 proteins! • 21,600 CPU days (~520,000 CPU hours) • = 21,600 PCs running parallel over 24 hours / 1 PC running for ~ 60 years • Using supercomputer TERAS (1024-CPU SGI Origin 3800) at SARA: less than two months! Parties involved • Gene-IT (Paris, France) • SARA (Amsterdam, the Netherlands) • CMBI (Nijmegen, the Netherlands) • Organon (Oss, the Netherlands) • EBI (Hinxton, UK) Supporting parties • Financed by NCF, foundation in support of supercomputing • Under the auspices of BioASP, the new Dutch knowledge and service center for Bioinformatics Results available through BioASP • http://www.bioasp.nl • Log in and click on links ‘Research’ and ‘Protein World’: 1 2 Results available through BioASP • Organism selection screen: Results available through BioASP • Results screen: Results available through BioASP • Alignment screen: Conclusions • Currently the most comprehensive and most accurate data-set of protein comparisons • A start for a maintainable and unique database of all proteins currently known • A rich data-source for clustering, datamining and orthology determination Orthology determination • Orthologs: genes/proteins in different species that derive from a common ancestor • Orthologs often have the same function • Interesting! Information from other species could help in annotating a protein Thank you for your attention Any questions?