* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Protein sequence databases
Signal transduction wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Genetic code wikipedia , lookup
Paracrine signalling wikipedia , lookup
Metalloprotein wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Gene expression wikipedia , lookup
Expression vector wikipedia , lookup
Point mutation wikipedia , lookup
Magnesium transporter wikipedia , lookup
Bimolecular fluorescence complementation wikipedia , lookup
Structural alignment wikipedia , lookup
Interactome wikipedia , lookup
Western blot wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Protein purification wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Large scale protein sequence clustering Prof. Dr. Antje Krause Bioinformatics Wildau University of Applied Sciences [email protected] Abstract The concept of protein superfamilies, families and domains is one of the oldest in computational biology. Back in the 60s, when the first protein sequence database was published as printed version, Margaret Dayhoff defined the basic principles of this discipline with only a small number of sequences at hand. Nowadays, with more than a million protein sequences available in public databases, a constantly growing number of uncharacterized proteins from completely sequenced genomes and still a comparatively small number of known protein structures, a systematic grouping and characterization of this data is needed more than ever. This tutorial reviews the different approaches developed during the last decades and points out possible challenges waiting in the future. Antje Krause Poznań 14.07.2006 2 Margaret O. Dayhoff “Dr. Margaret Oakley Dayhoff (1925-1983) was a pioneer in the use of computers in chemistry and biology, beginning with her PhD thesis project in 1948. Her work was multi-disciplinary, and used her knowledge of chemistry, mathematics, biology and computer science to develop an entirely new field. She is credited today as a founder of the field of Bioinformatics. This field is defined as the use of computers in solving information problems in the life sciences, mainly involving the creation of extensive electronic databases on protein sequences and genomes. Dr. Dayhoff was the first woman in the field of Bioinformatics.” http://www.dayhoff.cc/ Antje Krause Poznań 14.07.2006 3 Margaret O. Dayhoff • deduce evolutionary connections of the biological kingdoms, phyla, and other taxa from sequence evidence • collection of all known protein sequences • made available to others in 1965 in a small book • contained sequence information of 65 proteins • several releases followed • resulted in the Protein Information Resource (PIR) Antje Krause Poznań 14.07.2006 4 Antje Krause Poznań 14.07.2006 5 Protein sequences >O54090|O54090_SULAC Hypothetical protein (Fragment). MKILDYSDLVFFRKLTNKMRDPKTRFDVREFINRGEDYLFNYTNKNVGGVDERRRKFLKS LIFGMAA >P70723|P70723_ACIAM Orf-2 (Fragment). MSKNSLDNLGEKALELLKKYPLCDSCLGRCFAKLGYRFANKERGKAIKTYLVLELDRKIK DHELEDLNEIKEILFNMGKEYLEYLIYLSNEKFQERT >sptrembl|Q9V2V9|Q9V2V9_PYRAE Rieske iron sulfur protein (ParR). MVDENRRNTLKIFLGTTAALGAGMLATPLVASVIGSKAGYIKPEPSGAIPVEICKDVDSC PKDYGVSLDELRNGPVFKLLKVNTMAIPAVFGIVRAKDGKEYPVAYVAICTHFGCPVNVS GGKYLIGFNCPCHGSIFAICNDPNGCPDYNAAFLEMYVSGGPAPRSLRAIKVAVKDGVVY PLVAYI >O93973|O93973_MALSM Allergen. MSNVIKKVFNTDKAEAEGSKVADAPQEAGHKGEGFLHDAKDRLQGFAGHGHHNAQNAASG VAGSAGAGGAPSVPSANVDVTNPVNDASVQGGVEAPRSWSTQLPQSQSVADTTGATSAGR NNLTQTTSTGSGVNVAAGNVDQDVQHLAPVTRHVHHRHEIEELLREREHHIHQHHIQHHV QPVVDSEHLAEQIHSRVVPQTTVREVHANTDKDAALMRAVAGNPKDTFTQAAIDRSVIDK GETVREIVHHHIHNIVQPIIEKETHEYHRIRTTIPTTHITHEAPIVHESTAHQPIRKEDF LKGGGVLTSTTRSIEEVGLLNLGNNQRTVEGETYTGGLPLSQ >Q02039|Q02039_RHYSE NIP1 precursor (NIP1 avirulence protein precursor). MKFLVLPLSLAFLQIGLVFSTPDRCRYTLCCDGALKAVSACLHESESCLVPGDCCRGKSR LTLCSYGEGGNGFQCPTGYRQC >Q873M4|Q873M4_MALSM Manganese superoxide dismutase (Fragment). PFYPIPSALPFPLPIHSLFSRRTRLFRFSRTAARAGTEHTLPPLPYEYNALEPFISADIM MVHHGKHHQTYVNNLNASTKAYNDAVQAQDVLKQMELLTAVKFNGGGHVNHALFWKTMAP QSQGGGQLNDGPLKQAIDKEFGDFEKFKAAFTAKALGIQGSGWCWLGLSKTGSLDLVVAK DQDTLTTHHPIIGWDGWEHAWYLQYKNDKASYLKQWWNVVNWSEAESRYSEGLKASL >Q2V2P9|Q2V2P9_YEAST Protein YDR119W-A. MFFSQVLRSSARAAPIKRYTGGRIGESWVITEGRRLIPEIFQWSAVLSVCLGWPGAVYFF SKARKA Antje Krause Poznań 14.07.2006 6 MVDENRRNTLKIFLGTTAALGAGMLATPLVASVIGSKAGYIKPEPSGAIPVEICKDVDSC PKDYGVSLDELRNGPVFKLLKVNTMAIPAVFGIVRAKDGKEYPVAYVAICTHFGCPVNVS GGKYLIGFNCPCHGSIFAICNDPNGCPDYNAAFLEMYVSGGPAPRSLRAIKVAVKDGVVY PLVAYI Function? Diseases? Regulation? Development? Structure? Evolutionary history? Interactions? Antje Krause © David S. Goodsell 1999 Cellular location? Poznań 14.07.2006 Tissue? 7 Protein structures • Prediction of protein structure is still not possible from sequence alone • Not all mechanisms of protein folding are known • Experimental protein structure determination – is time consuming – is very expensive – is not always possible (protein must be in crystal structure) – results in only one conformation – does not show flexible regions – does not show the protein in its natural environment – can only be done with globular proteins (difficult with transmembrane proteins) Antje Krause Poznań 14.07.2006 8 Different categories of protein databases • Protein sequence databases: – Information about single proteins • Protein structure databases: – Information about single proteins • Protein domain databases: – Information about functional domains • Protein (sequence) family databases: – Information about groups of evolutionarily and functionally related proteins • Protein (structure) family databases: – Information about structural elements • Gene family databases: – Information about groups of evolutionarily and functionally related proteins or genes mainly of completely sequenced species Antje Krause Poznań 14.07.2006 9 Protein sequence databases • UniProt = Universal Protein Resource • Integration of Swiss-Prot/TrEMBL and PIR • http://www.expasy.uniprot.org • central repository of protein sequence and function • maintained by – European Bioinformatics Institute – Swiss Institute of Bioinformatics – Georgetown University Antje Krause Poznań 14.07.2006 10 Protein sequence databases •contain experimentally verified entries ... •... and translated entries from DNA databases, namely EMBL Swiss-Prot TrEMBL – predicted proteins – hypothetical proteins – putative proteins •Problem in the past: no clear difference between experimentally verified entries/annotation and predicted entries/annotation Antje Krause Poznań 14.07.2006 11 Protein sequence databases (Swiss-Prot/TrEMBL) now UniProt! ExPASy (http://www.expasy.ch) Expert Protein Analysis System SIB (http://www.isb-sib.ch) Swiss Institute of Bioinformatics, Geneva, CH Swiss-Prot (http://www.expasy.ch/sprot) Manually curated protein sequence database TrEMBL (translated EMBL) Computer-annotated supplement to SwissProt, contains all the translations of EMBL nucleotide sequence entries not yet integrated in Swiss-Prot Antje Krause Poznań 14.07.2006 12 Protein sequence databases (PIR-PSD) NBRF (http://pir.georgetown.edu/nbrf) National Biomedical Research Foundation Georgetown, Washington DC, USA JIPID Japan International Protein Information Database MIPS (http://mips.gsf.de) Munich Information Center for Protein Sequences, GSF, Neuherberg, Munich PIR (http://pir.georgetown.edu) Protein Information Resource Collaboration of NBRF, JIPID and MIPS PSD (http://pir.georgetown.edu/pirwww/search/textpsd.shtml) Protein Sequence Database First published in the Atlas of Protein Sequence and Structure (1965-1978), the first systematic collection of protein sequences, generated by Margaret Dayhoff Antje Krause Poznań 14.07.2006 13 ? Antje Krause Poznań 14.07.2006 14 Antje Krause Poznań 14.07.2006 15 Antje Krause Poznań 14.07.2006 16 Pattern search Pattern construction [SN]-P-x-[LV]-x(2)-H-A-x(3)-F. Multiple Sequence Alignment Antje Krause Poznań 14.07.2006 17 Patterns • Use of standard IUPAC one-letter codes for amino acids • Symbol 'x' for a position where any amino acid is possible • Ambiguities are indicated by listing the acceptable amino acids in square parentheses '[ ]' • Ambiguities are indicated by listing the not acceptable amino acids in curly brackets '{ }' • Elements are separated by '-' • Repetition of an element is indicated by a numerical value or a numerical range between parenthesis following that element • Restriction of the pattern to either the N- or C-terminal of a sequence is indicated by either starting with a '<' symbol or ending with a '>' symbol • A period ends the pattern Antje Krause Poznań 14.07.2006 18 Example leucine zipper L-x(6)-L-x(6)-L-x(6)-L. Coiled-coil PROSITE Entry PDOC00029 Antje Krause Poznań 14.07.2006 19 Example C2H2 zinc finger x x x x C x x x x H x x x x x \ x / Zn x x x x x x / C x x x \ H x x x x x PROSITE Entry PDOC00028 C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H. Antje Krause Poznań 14.07.2006 20 Pattern Advantages: Disadvantages: • easy and intuitive definition • yes/no-decisions: proteins not complying with a certain pattern will never be found although they may contain the domain • simple to use in automated processing • needs multiple alignment Antje Krause Poznań 14.07.2006 21 Antje Krause Poznań 14.07.2006 22 Rule Advantages: Disadvantages: • easy and intuitive notation • difficult to use in automated processing • simple to use in manual processing • able to model long range dependencies Antje Krause Poznań 14.07.2006 23 Antje Krause Poznań 14.07.2006 24 Profile • position specific scoring/weight matrix with N columns and 20+ rows • N is the number of columns in a multiple alignment = length of the multiple alignment = length of the profile • each row holds the information about 1 amino acid (IUPAC code), about gap penalties or other properties Antje Krause Poznań 14.07.2006 25 Scoring matrices (e.g. BLOSUM62) Antje Krause Poznań 14.07.2006 26 Average score method to calculate a profile Multiple Sequence Alignment with N=10 columns and Z=23 rows Profile with N=10 columns (k) and 20 + 1 rows (j) Cik: Quantity of amino acid i in column k Sij: Score of amino acid i and amino acid j in scoring matrix (e.g. BLOSUM62) ML1 = (CV1 / Z) * SVL + (CI1 / Z) * SIL = (4 / 23) * 1 + (19 / 23) * 2 = 1.83 Antje Krause Poznań 14.07.2006 27 Profile Advantages: Disadvantages: • captures degree of conservation at each position in a multiple alignment • difficult to use in manual processing • statistical method • needs multiple alignment • no formal statistical basis • simple to use in automated processing Antje Krause Poznań 14.07.2006 28 Antje Krause Poznań 14.07.2006 29 Antje Krause Poznań 14.07.2006 30 Antje Krause Poznań 14.07.2006 31 Hidden Markov Model (HMM) • statistical model where the system being modelled is assumed to be a Markov process (stochastic process) • the probability of being in one state depends only on the previous state • In a regular Markov model, the states are directly visible to the observer, and therefore the state transition probabilities are the only parameters • HMM adds outputs: each state has a probability distribution over the possible output tokens Antje Krause Poznań 14.07.2006 32 Profile HMM architecture Insertion End Terminal Match Begin Start C-terminal unaligned sequence N-terminal unaligned sequence Delete Joining segment of unaligned sequences from HMMER User Guide http://hmmer.wustl.edu Antje Krause Poznań 14.07.2006 33 Profile HMM Advantages: Disadvantages: • same as for profile • no manual processing • statistical method with well established formal probabilistic basis • needs a higher number of sequences to give a satisfactory result • can use unaligned sequences Antje Krause Poznań 14.07.2006 34 Antje Krause Poznań 14.07.2006 35 Domain databases • describe functional regions of proteins (called domains, motifs, signatures...) • a protein may consist of several and/or different domains (multi-domain-protein) • domains can be described with – – – – patterns (regular expressions) rules profiles Hidden Markov Models Antje Krause Poznań 14.07.2006 36 Antje Krause Poznań 14.07.2006 37 Domain databases Sanger Institute (http://www.sanger.ac.uk) The Wellcome Trust Sanger Institute, Hinxton, GB Pfam (http://www.sanger.ac.uk/Software/Pfam) – Protein FAMmilies database of alignments and HMMs – sequences from Swiss-Prot and TrEMBL Prosite (http://www.expasy.ch/prosite) – database of protein families and domains – consists of biologically significant sites, patterns and profiles – sequences from Swiss-Prot and TrEMBL – manual annotation made by experts Antje Krause Poznań 14.07.2006 38 InterPro Integrated Resources of Protein Families, Domains and Functional Sites Collaboration of Pfam, PROSITE, PRINTS, ProDom, SMART and TIGR Used for automatic annotation of entries in TrEMBL Antje Krause Poznań 14.07.2006 39 Antje Krause Poznań 14.07.2006 40 Domain databases FHCRC Fred Hutchinson Cancer Research Center, Seattle, Washington DC, USA BLOCKS (http://www.blocks.fhcrc.org) – multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins – automatically derived from InterPro – originally developed for the creation of scoring matrices (substitution matrices) BLOSUM62 (BLOcks SUbstitution Matrix) Antje Krause Poznań 14.07.2006 41 Domain databases SMART (http://smart.embl-heidelberg.de/) Simple Modular Architecture Research Tool • identification and annotation of genetically mobile domains and the analysis of domain architectures • signalling, extracellular and chromatin-associated proteins Antje Krause Poznań 14.07.2006 42 Antje Krause Poznań 14.07.2006 43 intron length frame Antje Krause Poznań 14.07.2006 44 Domain and family databases Suppose we have n homologous protein sequences •What do they have in common? •What are the functional regions of these proteins? •Which regions are conserved, which are not conserved? •How can we characterize these proteins/their functional domains? •What distinguishes these proteins/their functional domains from others? Antje Krause Poznań 14.07.2006 45 Similarity: Expressed in score, E-value, % sequence identity, etc. Homology: Relationship due to common ancestry Orthology: Genes in the genomes of different species with a common ancestor (resulting from a speciation event) Paralogy: Genes in the same genome with a common ancestor (resulting from a duplication event) Antje Krause Poznań 14.07.2006 46 But! Similarity ≠ Homology • Similarity is a good indicator for homology • Normally we deduce homology from significant sequence similarity • But, we can not deduce sequence similarity from homology! • Thus we also can not deduce non-homology from non-sequence similarity! Antje Krause Poznań 14.07.2006 47 Database Search Antje Krause Poznań 14.07.2006 48 Transitivity •use of intermediate sequences to derive knowledge about homology •if the proteins A and B are homologous and the proteins B and C are homologous, than A and C are homologous, too •this holds even if there is no sequence similarity detectable between A and C! Antje Krause Poznań 14.07.2006 49 Transitivity? ... may be limited to domains! ... but often it's difficult to define domain boundaries! Antje Krause Poznań 14.07.2006 50 Cutoff: 1e-30 Cutoff: 1e-20 Cutoff: 1e-10 Database Search Database Search Database Search Database Search Antje Krause Poznań 14.07.2006 51 Sequence Clustering: Goals Biologically meaningful partitioning of the data: • Functional annotation • Gain of information • Reduction of the search space • Selection of prototypic or representative sequences • Phylogenetic analyses • Protein prediction etc. Antje Krause Poznań 14.07.2006 52 Protein Families • “Protein superfamily” (Dayhoff, 1974): Group of evolutionarily related proteins • Hierarchy of homology domains, families, and superfamilies (Barker, 1996) Manual classification based on sequence similarity • Most current proteins are thought to be the descendants of no more than 1,000 (structural) ancestors (Chothia, 1994) • But no “definition”! Antje Krause Poznań 14.07.2006 53 Protein Families Following M.Dayhoff we can think of a • Protein superfamily as a group of proteins – sharing domains – being evolutionarily related – showing weak sequene similarity • Protein family as a group of proteins – being (closely) evolutionarily related – (showing at least 50% sequence similarity) • Homeomorphic protein family as a group of proteins – having the same domains in the same order Antje Krause Poznań 14.07.2006 54 Single Linkage Clustering Cutoff/Threshold weak conservative stringent Single Linkage Hierarchy Antje Krause Poznań 14.07.2006 55 Test data set • starting with 171,191 redundant sequences from Swiss-Prot • after all-against-all BLAST database searches: 19,407,137 pairwise values • after excluding 27,305 fragments (being 90% identical to another sequence over 95% of their sequence length): 13,083,209 pairwise values • Reminder: 171,191 sequences 14,653,093,645 possible pairwise values! • only 0.132% sequence pairs result in an Evalue < 10! Antje Krause Poznań 14.07.2006 56 143,886 non-redundant sequences and 13,083,209 pairwise values Antje Krause Poznań 14.07.2006 57 Antje Krause 10% sequence overlap 50% sequence overlap 75% sequence overlap 90% sequence overlap Poznań 14.07.2006 58 Observations • Doing single-linkage-clustering with this data we can vary on the pairwise results of the BLAST searches, i.e., Evalue, % Identity, length of local alignment, % alignment length of sequence length, Score and all combinations! • With a choice of at least 50% identity we are on the safe side (this was Margaret Dayhoff’s original value for a protein family!) • Unfortunately (but no surprise) nature does not behave in cutoffs • There are highly conserved protein families (e.g., histones) and fast evolving protein families (e.g., immunoglobulines) • Every protein family needs it’s own cutoff Antje Krause Poznań 14.07.2006 59 SYSTERS (SYSTEmatic Re-Searching) Single linkage hierarchy Superfamilies Superfamily distance graph Family clusters Superfamilies as well as family clusters are derived from the structure generated by the data itself no need for a user defined static cutoff Antje Krause Poznań 14.07.2006 60 259/212,012 211,975/37 1/36 15/21 1/18 2/19 13/5 4/1 Antje Krause Poznań 14.07.2006 61 Algorithm 1: Superfamily determination Input: Tree T = (V, E) with n leaves (sequences) Output: Superfamilies 1: for all leaves li V, i {1, ..., n} do 2: q li 3: I 0 4: sfi li 5: while (q Troot) do 6: p parent (q) 7: J subtreesize (p) - subtreesize (q) subtreesize (q) 8: if (J > I) then 9: IJ 10: sfi q 11: end if 12: qp 13: end while 14: end for 15: Resolve inclusions by keeping the largest superfamilies Antje Krause Poznań 14.07.2006 62 456 superfamilies with cutoff < 1e-180 64,282 superfamilies in 40,288 separate trees Antje300,000 Krause 14.07.2006 About non-redundantPoznań sequences 63 |V| = 7 x = 15 * (6 / 42) = 2,14 < (7 / 2) |E| = 15 4 4 3 x = 15 * (6 / 25) = 3,6 > (7 / 2) 4 1 B 1 1 Split graph Process subgraphs A C 1 1 D 1 4 1 2 2 Stop criterion: 4 Antje Krause 2 4 2 x> 4 2 G Minimal Cut C E 1 4 4 Output graph F Poznań 14.07.2006 |V| 2 |E| x * w(i) w( j) iC jE 64 weighted_HCS Algorithm 2: HCS Highly Connected Subcluster (Hartuv & Shamir, 1999) weighted graph G = (V, E) Input: Connected unweighted Output: Cluster graphs 1: (H1, H2, C) mincut (G) 2: x x |C| |E| * (iC w(i) / jE w(j)) 3: if (x > (|V| / 2)) then 4: output G 5: else weighted_ (H1) HCS (H1) 6: HCS 7: HCS weighted_ (H2) HCS (H2) 8: end if Antje Krause Poznań 14.07.2006 65 Ephrin type A Ephrin type B Predicted proteins (C.elegans and Drosophila) Antje Krause Poznań 14.07.2006 66 SLC: Single Linkage Clustering SF: Superfamilies SF+SC: Family clusters derived from superfamilies About 300,000 non-redundant sequences Antje Krause Poznań 14.07.2006 67 Family: Superfamily: systers.molgen.mpg.de Domains: Antje Krause Poznań 14.07.2006 68 SYSTERS • Exploit the self-structuring properties of the data: – Determine an individual cutoff for each superfamily based on the single linkage hierarchy – Split each superfamily into family clusters based on the superfamily distance graph • Automated and independent of static userdefined cutoffs • Results accessible on the Internet Antje Krause Poznań 14.07.2006 69 Protein family databases - ProtoNet • http://www.protonet.cs.huji.ac.il/ • global classification of proteins into hierarchical clusters • based on Swiss-Prot sequences, with TrEMBL sequences added after clustering • N. Kaplan et al., NAR, 2005, 33(DB) • 3 different hierarchical clustering methods available depending on the similarity measure (harmonic-, geometric-, arithemtic average) based on the BLAST Evalue Antje Krause Poznań 14.07.2006 70 Protein family databases - CluSTr • http://www.ebi.ac.uk/clustr/index.html • automatic hierarchical classification of all sequences in UniProt • uses Z-Score based on Smith-Waterman comparison: Z-Score = min(Z(A,B), Z(B,A)) with Z(A,B) = (Score(A,B) – M) / with M: arithmetic mean, : stand. deviation of all results • R. Petryszak et al., Bioinformatics, 2005, 21(18) • constructs single-linkagehierarchy • provides a subset of clusters at several different cutoff values Antje Krause Poznań 14.07.2006 71 Protein family detection - TribeMCL • http://www.ebi.ac.uk/research/cgg/tribe/ • uses a Markov Clustering method based on BLAST Evalues • primarily used for comparing protein sequence sets of completely sequenced genomes, e.g. in ENSEMBL • clustering software available • provides one set of protein families • more specific than other methods, but less sensitive Related Not related Found True positive False positive Not found False negative True negative Antje Krause Poznań 14.07.2006 A.J.Enright et al., NAR, 2002, 30(7) 72 But wait a moment... • Why so many databases? • Which one is “right” which one is “wrong”? • How can we proof that the results are correct? • We want to answer biological questions with these databases • Different databases are needed to answer different questions • There is no “right” or “wrong” • The benefit highly depends on the questions • The more concise the question, the more beneficial the answer Antje Krause Poznań 14.07.2006 73 Gene family databases Suppose we have the gene/protein sequences of 2 completely sequenced species • Which genes/proteins do these species have in common? • Which genes/proteins are orthologous? • Where are the differences? • Which genes/proteins have paralogs in one or the other species? Antje Krause Poznań 14.07.2006 74 Antje Krause Poznań 14.07.2006 75 What happens to a duplicated gene? Duplication-Degeneration-Complementation Model (DDC) Lynch & Force (Genetics, 1999/2000) Antje Krause Poznań 14.07.2006 76 Pairwise-best-hit-method 1. Search with all protein sequences of species A against all protein sequences of species B 2. Remember only the best hits 3. Search with all protein sequences of species B against all protein sequences of species A 4. Remember only the best hits 5. All pairwise-best-hits are assumed to be orthologs Antje Krause Poznań 14.07.2006 All proteins (genes) of species A All proteins (genes) of species B 77 Gene family databases - InParanoid • http://inparanoid.cgb.ki.se/ • clustering software available • after determination of main-orthologs inparalogs are added to the groups • inparalogs: duplicated after speciation event • outparalogs: speciation event after duplication • uses BLAST • K.O’Brien et al., NAR, 2005, 33 (DB) Antje Krause Poznań 14.07.2006 78 Gene family databases - COGs • http://www.ncbi.nlm.nih.gov/COG/ • Cluster of Orthologous Groups of proteins • based on all-against-all sequence search • a protein builds a COG if pairwise-best-hits consist for at least 3 species • manual postSpecies C processing Species A (alignments, Species B trees) of COGs to split COGs of multi-domain-proteins • R.L.Tatusov et al., 1997, Science, 278 Antje Krause Poznań 14.07.2006 79 Biological databases in general first issue every year is the database issue in 2006 this database collection covered 858 databases Antje Krause Poznań 14.07.2006 80