* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Issues in predicting protein function from sequence
Survey
Document related concepts
Protein (nutrient) wikipedia , lookup
Histone acetylation and deacetylation wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Protein phosphorylation wikipedia , lookup
Signal transduction wikipedia , lookup
P-type ATPase wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Magnesium transporter wikipedia , lookup
List of types of proteins wikipedia , lookup
Intrinsically disordered proteins wikipedia , lookup
Homology modeling wikipedia , lookup
Protein structure prediction wikipedia , lookup
Proteolysis wikipedia , lookup
Transcript
Chris Ponting is Group Leader in Bioinformatics at a new MRCfunded unit focused on the determination of gene function with particular emphasis on human disease. His particular scienti®c interest is the combined use of sequence and experimental data to predict protein evolution, structure and function. To enable ®ndings to be made accessible to biologists generally, he codevised the SMART web-based tool with Peer Bork and colleagues. Keywords: sequence analysis, orthology, function prediction, binding site identi®cation, domain families, horizontal gene transfer Issues in predicting protein function from sequence Chris P. Ponting Date received (in revised form): 9th November 2000 Abstract Identifying homologues, de®ned as genes that arose from a common evolutionary ancestor, is often a relatively straightforward task, thanks to recent advances made in estimating the statistical signi®cance of sequence similarities found from database searches. The extent by which homologues possess similarities in function, however, is less amenable to statistical analysis. Consequently, predicting function by homology is a qualitative, rather than quantitative, process and requires particular care to be taken. This review focuses on the various approaches that have been developed to predict function from the scale of the atom to that of the organism. Similarities in homologues' functions differ considerably at each of these different scales and also vary for different domain families. It is argued that due attention should be paid to all available clues to function, including orthologue identi®cation, conservation of particular residue types, and the co-occurrence of domains in proteins. Pitfalls in database searching methods arising from amino acid compositional bias and database size effects are also discussed. THE DESCENT OF GENES C. P. Ponting, MRC Functional Genetics Unit, Department of Human Anatomy and Genetics, University of Oxford, South Parks Road, Oxford OX1 3QX, UK Tel: 44 (0)1865 272175 Fax: 44 (0)1865 272175/272420 E-mail: [email protected] Completion of the human genome draft sequence has created an air of expectation among scientists and the general population alike that knowledge derived from sequence information will precipitate numerous breakthroughs in treating disease. To live up to such expectations will be a tall order, because considerable obstacles remain to be surmounted, but this is a worthwhile challenge that ought to be taken on. Chief among these obstacles is the assignment of function to genes. How does one even begin to predict how a gene affects the well-being of an individual, or his or her cells, or his or her molecular pathways, networks and complexes? Fortunately, this situation is being ameliorated using enhanced understanding of the evolutionary history of genes, as inferred from detailed sequence comparisons. Homology, the evolutionary descent of genes from a common ancestor,1 often provides vital evidence in the prediction of molecular function. That two genes are homologous does not necessarily mean that they possess common functions, only that they share a common ancestor. Nevertheless, an assumption often made is that the functions of homologues have remained essentially unchanged since the time of their last common ancestor. This provides a good working hypothesis of function, particularly for those homologues that have most recently diverged. However, a better view is that an evolutionary relationship implies functional similarity but that this may be true to a greater or lesser extent. The extent is crucially dependent on scale, since homologues' functional similarities are greatest when considered at the molecular scale, and least at the scales of cells or organisms. Although this discussion might be viewed as purely a semantic exercise, its appreciation does lead to a critical question. Have homologues' functions diverged since their last common ancestor? The short answer, derived from many studies, is that some have diverged considerably, and some have remained more or less the same. A long answer to this question will take up much of the rest & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 19±29. MARCH 2001 19 Ponting of this review. The various clues that indicate functional divergence or constancy, and sequence-based evidence that hints at particular functions will be discussed. Much rests upon various paradigms (Table 1) that hold true in most cases. DETECTING HOMOLOGUES USING PROTEIN DATABASE SEARCHES database search statistics First, however, it is appropriate to discuss how sequence similarities, detected in protein database searches, are used to propose homology assignments. Excellent reviews on this subject abound (for example refs 11±15), hence the brevity in its treatment here. The central issue here is whether similarities seen in sequence alignments merit an assignment of homology. Amino acid similarities can be quanti®ed as an alignment score S using a local alignment algorithm, gap penalties and a substitution matrix.13 From these scores an E-value can be calculated. This value represents the number of different alignments with scores equivalent to, or better than, S that are expected to occur in the database search simply by chance. Thus, an alignment with an associated E-value of 1 is considered not to be biologically signi®cant, one with an E-value of 0.1 possibly to be signi®cant, and another with an E-value less than 0.01 most likely to be signi®cant. The statistics of alignment scores, therefore, are a powerful tool for deciding on homology. As with most aspects of evolution and function prediction, however, there are pitfalls to be avoided. For example, the fact that E-values are strongly dependent on the size of the database being searched is often overlooked. An E-value provided in a search of a large bacterial genome containing about 6,000 gene products will be approximately 100 times smaller than Table 1: Paradigms and exceptions in predicting function and structure from sequence Paradigms Exceptions (1) A PTP-BAS PDZ-Fas receptor interaction, occurs in human but not in mice. (2) (3) (4) (5) (6) (7) (8) (9) 20 Orthologues possess similar functions Enzyme homologues are enzymes Regulatory domain homologues are not enzymes Equivalent cellular functions are mediated in different species by orthologues Gene coding regions mutate slower than non-coding regions Domain homologues are localised to single regions of sequence and 3D space, and possess the same order of secondary structures Disulphide bridges are invariant among homologues Although the same function or the same fold may have evolved more than once due to convergence, convergent evolution of sequences does not occur Domains possess single conformations 2 Many enzyme families possess representatives that are enzymatically inactive and possess substitutions of active site residues. The SH2 domain of human pp60c-src has been found to possess low tyrosine 3 phosphatase activity. Lambda integrases appear to have evolved an enzymatic 4 function from an ancient helix-turn-helix regulatory domain. A gene possessing a particular cellular role in one organism can be displaced by a non-orthologous but functionally equivalent gene in a second organism (`non5 orthologous gene displacement'). Snake toxin genes, among others, appear to undergo accelerated evolution of 6 their coding regions due to enhanced selective pressures. Crystal structures have shown that domains can be `inserted' into other domains. Domains may also contain secondary structures that are circularly permuted in 7 order. Disulphide `swapping' has been observed to occur in an epidermal growth factor8 like (EGF) domain of thrombomodulin. 9 Small localised structures, such as the helix±hairpin±helix motif, with conserved sequences have been observed in non-homologous contexts. These may have arisen through convergence or else via genetic duplication and insertion events Amyloid proteins are known to undergo conformational changes to â-sheet 10 structures. & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 19±29. MARCH 2001 Issues in predicting protein function from sequence compositional bias the E-value given for exactly the same alignment in a search of all gene products currently known (approximately 600,000). As databases ®ll with increasingly redundant sequences, including identical copies, alternatively spliced variants and homologues from closely related species, there has been a need for a truly non-redundant protein sequence database. This has been provided by NRDB90,16 a database in which no two sequences possess greater than 90 per cent identity. Use of this database, currently approaching 350,000 sequences, is therefore recommended to provide the most robust E-value estimates. A second serious pitfall relates to biases in the amino acid compositions of sequences. E-value calculations assume that database sequences that are unrelated to the query have average amino acid compositions. For the majority of sequences in databases this holds true, but in a minority of cases unrelated sequences can be detected with signi®cant alignment scores (low E-values) because of nonrandom amino acid compositions (Table 2). Similarly, non-signi®cant low E-values can be generated when a database search is initiated with a compositionally-biased or `low complexity' query sequence. A recent example of when compositional bias causes signi®cance Table 2: Examples of proteins with unusually high occurrences of particular amino acids Amino acid Proteins C DE G H ILMVFYWAC KR N P Q SR ST abcdefg Disulphide-rich proteins; metallothioneins; zinc ®ngers Acidic proteins (unknown function) Collagens Hisactophilin; histidine-rich glycoprotein Transmembrane helices Nuclear proteins, nuclear-localisation signals Many Dictyostelium proteins Collagens; SH3/WW/EVH1-binding sites; ®laments Triplet repeat disease gene products RNA-binding motifs containing multiple Ser and Arg residues Mucins (potential oligosaccharide-attachment sites) Heptad coiled coils (hydrophobic residues: a and d; hydrophilic bcefg) in, for example, myosins, intermediate ®laments and kinesins estimates to go awry relates to the proposed homology between Wingless/ Wnt and secreted phospholipase A2 .17±19 Here, database searches yielded apparently signi®cant E-values, E , 10ÿ3 , yet much of the alignment score was due to fortuitous matches of cysteines. In this case, three-dimensional structural information was on hand from which to argue that these molecules did not share a common ancestor, and therefore did not possess comparable structures and functions. This example highlights the awareness, which practitioners of database searches must possess, of the effects of compositional bias on alignment scores, and the availability of additional structural and functional data that might pro®tably be brought to bear on questionable homology assignments. Filtering of compositionally biased regions using SEG20 is default in many database-searching algorithms. The PSIBLAST12 server at the National Center for Biotechnology Information21 now uses composition-based statistics for E-value calculation. These approaches markedly improve the discrimination of true positive homologues versus compositionally biased false positives. However, as with all bioinformatics applications, such approaches are not foolproof in all cases, and the user is required to be vigilant in spotting those apparently signi®cant similarities that arise simply due to amino acid bias. ORTHOLOGY, PARALOGY AND FUNCTION PREDICTION The assignment of information from experimental data on one homologue to an under-characterised second homologue is the basic principle that underlies function prediction from sequence. The greatest con®dence should be placed in such assignments when these two homologues are orthologues. Orthologues are de®ned as having arisen from speciation events,22 and may be thought of as `the same' genes in different species. This contrasts with homologous & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 19±29. MARCH 2001 21 Ponting paralogy orthology domains 22 genes that arose from intragenome duplications; these are termed `paralogues'. Gene duplications give rise to a pair of genes, of which one is often assumed to retain the function of its parent, whereas the other eventually acquires either a `new' function or is lost via mutation and selection. This exposes a certain degree of ®ckleness in how the term `function' is currently used. Most homologues possess `similar' functions, yet paralogues are thought to persist only if they acquire `novel' functions. This apparent paradox is resolved when it is considered that `function' is an umbrella term that covers phenomena at the scales of atoms (catalysis, binding events), domains, proteins, complexes, networks, cells and organisms. For example, the molecular functions of paralogues, indeed even their protein sequences, can be identical, while their cellular roles differ owing to differences in cellular expression patterns. Consequently, homologues' functions are often degenerate at certain linear scales, even when they are not at others. Correct assignment of orthology and paralogy is not straightforward and is most accurate for genes found in completely sequenced genomes. Even for such genomes, past lineage-speci®c deletions of paralogous genes can result in paralogues falsely being predicted as orthologues. Additional complexities in assignment arise from complete genome, or largescale gene, duplications that are predicted to have occurred early in the vertebrate lineage.23 These duplications resulted in chordates possessing more (degenerate) copies of genes than invertebrates. Consequently, the human genome, for example, may possess several paralogous genes that are all orthologues of a single Caenorhabditis elegans gene. It is worth reiterating that orthologue identi®cation is the most powerful tool in predicting molecular function. Comparisons of complete genomes predict that the presence of orthologues in different genomes indicates functional conservation of proteins, complexes, pathways and cellular processes. Sets of orthologues peculiar to single taxonomic groups, indeed, may best de®ne these organisms,24 and also may provide the most accurate predictions of genome phylogeny.25 Paralogue identi®cation provides a less accurate prediction of function, particularly for prokaryotes. By contrast, the functions of paralogous genes in vertebrates are often overlapping, as assessed by gene knockout studies.26,27 MULTIDOMAIN PROTEINS Orthology assignment is often not straightforward for multidomain proteins. Although the term `domain' is used differently in evolution, genetics and molecular biology, it is described here as a compact unit of structure, often containing a hydrophobic core. Domains are frequently found in assorted combinations with other domains, re¯ecting the evolutionary ability of (partial) genes to be duplicated and recombined elsewhere. A distinction is made between domains, repeats and motifs. Repeats are structural and evolutionary entities that always found in two or more copies. Frequently, repeats assemble into elongated `rods' or `superhelices', or else into closed `barrel' structures, such as â-propellers. Closed structure assemblies of repeats might also be thought of as domains. Motifs are either regions of domains containing conserved active- or binding-site residues, or else conserved sequences, present outside domains, that may adopt folded conformations only in association with their binding ligands. An example of a motif that lacks obvious secondary structures and occurs outside domains is the AT-hook DNA-binding motif.28 There is little consensus in the literature on what constitutes orthology for multidomain proteins. For some, a criterion for orthology (and indeed paralogy) is that such proteins must possess identical domain architectures. For others, the concepts of orthology and paralogy should be applied at the domain level. The difference between these & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 19±29. MARCH 2001 Issues in predicting protein function from sequence comparative genomics de®nitions results in different orthology designations for genes that have fused, or molecules of slightly differing repeat numbers, such as the spectrin repeats in C. elegans DYS-1 and mammalian dystrophin for example.29 The author suggests that orthology be applied both at the protein/gene level and at the domain level. Orthologous proteins must contain orthologous domains and these must be present in the same order (Figure 1). Notwithstanding such dif®culties, sets of orthologous genes have been de®ned (see COGs31 ) from the completely known genome sequences of archaea, bacteria and eukarya. These and similar studies now allow investigation of functional conservation across divergent species. An excellent example of this is a recent in- depth study of the protein components of the citric acid cycle.32 This clearly demonstrated the complexities involved in assigning orthology and paralogy for divergent homologues, and highlighted the incompleteness of portions of the cycle in the majority of species. In the prokaryotes, the absence of some of the key enzymes of the cycle is probably due to these organisms' autotrophic lifestyles. Thus, the absence of genes in a completely sequenced genome can provide insight into function at the organismal level. HORIZONTAL GENE TRANSFER Horizontal (or `lateral') gene transfer has also played a major role in prokaryotic Figure 1: The domain architectures of protein pairs that are not orthologous presented using 30 the SMART server. (a) Synechocystis sll0776 and human MST (mixed lineage kinase 2) both contain protein kinase and src homology 3 (SH3) domains, but in different linear orders. Moreover, the bacterial SH3 domain (SH3b) is predicted to be extracellular, whereas MST is predicted to be cytoplasmic. (b) Yeast (Saccharomyces cerevisiae) Bem1p and human p47 phox contain SH3 and PX domains but in different collinear orders. (c) Yeast protein kinase C1 (PKC1p) and human protein kinase Cá (PKCá) are likely to possess similar cellular functions but are not orthologues owing to differing domain compositions and architectures. Abbreviations following those in SMART: STYKc, protein kinase with dual serine/threonine and tyrosine speci®city; S_TKc, protein kinase with serine/threonine speci®city; SH3, src homology 3 domain; SH3b, bacterial-type SH3 domain; PX, phox homology domain; C1, protein kinase C conserved region 1; C2, protein kinase C conserved region 2; HR1, protein kinase C-related kinase homology region 1; and, S_TK_X, serine/threonine-speci®c protein kinase extension domain & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 19±29. MARCH 2001 23 Ponting horizontal gene transfer Web-based resources evolution.33 Detecting that a prokaryotic lineage has acquired a `foreign' gene implicates that gene in the lineage's exploitation of new environments and evolutionary niches. Until recently, horizontal gene transfer was suspected to have been a relatively rare phenomenon. However, more recent studies have estimated that between 0 and 17 per cent of prokaryotic genomes has been acquired from other genetic sources.33 These sources frequently include eukaryotic genomes. A study of signalling domains revealed many instances of horizontal gene transfer from eukaryotes to bacteria, in particular to Synechocystis PCC6803, but fewer transfers from eukaryotes to archaea.34 With the exception of genes obtained fromthe in¯ux of mitochondrial and chloroplast genes into the nuclear genome, metazoan genomes are thought to have acquired relatively few genes via recent horizontal gene transfers. Bacterial virulence has been attributed in many cases to the acquisition of genes, absent from related non-virulent species. Pathogens of eukaryotes, therefore, might obtain signi®cant advantage from genes that arose via horizontal transfer from their hosts, by avoiding eradication by the immune system, or by hijacking host cellular processes for their own gain. For example, the presence of a Sec7 domaincontaining protein in Rickettsia prowazekii implicates it in this pathogen's modi®cation of the host's Golgi membranes.35 Similarly, a mammalian perforin-like domain in the Chlamydia trachomatis CT153 protein might indicate an involvement in pore formation and host cell entry.36 Detection of horizontally transferred genes in parasitic organisms, therefore, can lead to prediction of molecular and organismal functions. INFERRING FUNCTION AND LOCALISATION USING DOMAIN DATABASES Once a protein sequence has been obtained, the ®rst port-of-call in attempts to understand its functions should not be a BLAST or FASTA database search. Rather, the sequence should be scanned for occurrences of well-known domains, repeats, motifs and sorting signals, before embarking on BLAST-like sequence database searches. Several servers exist that provide searches for domain homologues (Table 3). Upon detection of a homologous domain in a query sequence, these servers furnish the user with relevant functional and structural information, as well as appropriate literature references. Although the new conserved domain database (CDD) server draws upon multiple alignments from both SMART and PFAM, the latter source libraries should also be searched from their homepages. This is because SMART and PFAM both use a hidden Markov model method for homologue detection that Table 3: Useful web-based resources for predicting function from protein sequence Name URL Predicts SMART http://smart.embl-heidelberg.de/ Domains, repeats, motifs, coiled coils, signal sequences Domains, repeats, motifs, coiled coils, signal sequences PFAM http://pfam.wustl.edu/ http://www.sanger.ac.uk/Pfam/ http://www.cgr.ki.se/Pfam/ CDD (mostly SMART http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml PFAM alignments) PROSITE http://www.isrec.isb-sib.ch/software/PFSCAN_form.html PSORT http://psort.nibb.ac.jp/ SignalP http://www.cbs.dtu.dk/services/SignalP/ big-PI http://mendel.imp.univie.ac.at/ Various See http://expasy.cbr.nrc.ca/tools/#transmem 24 Domains, repeats, motifs, coiled coils, signal sequences Domains, repeats, motifs Localisation signals Signal peptides GPI anchors Transmembrane helices & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 19±29. MARCH 2001 Issues in predicting protein function from sequence functional residues tertiary structures contrasts with the different BLAST-like methodology of CDD. It also should be noted that some short domains, repeats and motifs detectable using SMART and PFAM web-tools are not detectable using CDD. Other servers allow prediction of sequence signals that specify the sub- or extracellular localisation of proteins. Thus, signal peptides, nuclear localisation signals, glycosyl-phosphatidyl-inositol (GPI) anchors, transmembrane regions and organelle-targeting sequences can be detected (Table 3). Some information on localisation can also be gleaned from the domain content of eukaryotic proteins. Some domain types appear only in one of the three sets of cytoplasmic, secreted or nuclear proteins. For example, disulphide-rich domains, such as kringle, epidermal growth factor-like and ®bronectin type II domains, only occur in secreted proteins, whereas CHROMO, SET and BROMO domains only occur in chromatin-associated proteins. Some `promiscuous' domain types, such as immunoglobulin, von Willebrand factor A and ®bronectin type III domains, are not so exclusive and occur in cytoplasmic, secreted and nuclear proteins. The recent acceleration in the determination of tertiary structures has prompted renewed interest in the prediction of function directly from structure.37,38 Identi®cation of binding sites for the majority of molecules is straightforward, since most ligands bind the largest cleft on the molecule's surface.39 Certain folds also have tendencies to bind ligands in particular Table 4: Groupings of amino acid residues according to common functions Description Amino acids Function Polar residues Aromatic residues Zn2 -coordinating residues Ca2 -coordinating residues Magnesium- or manganesebinding residues Phosphate-binding residues C, D, E, H, K, N, Q, R, S, T F, H, W, Y C, D, E, H, N, Q D, E, N, Q D, E, N, S, R, T Active sites Protein ligand-binding sites Active sites, zinc ®ngers Allostery, ligand-binding sites Mg2 - or Mn2 -dependent catalysis or ligand-binding Phosphate and sulphate-binding H, K, R, S, T locations. For example, binding sites of â-propellers tend to be located at the junction between the â-sheet propeller `blades'.40 In addition, protein folds may also hint at molecular function, since protein domains with the same fold, albeit with non-signi®cant sequence similarity, often possess comparable functions.41 Structural features may also hold the key to function. The presence of helix-turnhelix-like structures indicates DNA binding whereas conserved and proximal histidine, aspartic acid and serine residues, suggests a serine proteinase/lipase-like hydrolase function. CONSERVED POSITIONS IN MULTIPLE ALIGNMENTS Many domain families exist, however, for which no functional or structural information is yet available. In these cases, one has to resort to prediction by analogy. For this, several `rules-of-thumb' are suggested (for residue groupings see Table 4). · Catalytic site residues are almost invariably polar. · Large aromatic residues are often found to be involved in protein±ligand interactions. · Zinc ions are coordinated by several residue types and, often, water molecules. · Calcium ions are often bound by acidic residues and amides, although additional interactions occur with backbone atoms. · Manganese and/or magnesium ions are, in enzymes such as nucleases42 and glycosyltransferases,43 often bound by two acidic residues separated by a hydrophobic residue. · Phosphate and sulphate groups are found bound to the amino terminus of á-helices in approximately half of all cases.44 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 19±29. MARCH 2001 25 Ponting · Distinction can be made between disulphide-rich secreted proteins and zinc ®ngers, since the former occur in proteins with signal peptides and never possess substitutions of cysteine for histidine, or vice versa. HOMOLOGUES WITH DIFFERENT FUNCTIONS enzyme and binding speci®city The great majority of enzymes, but not regulatory domains, appear to be ancient, since they are found in modern archaea, bacteria and eukaryotes.34 Divergent enzyme homologues often possess comparable catalytic mechanisms or reactive intermediates but different substrates and different cellular roles.45 For example, prokaryotic and eukaryotic homologues of protein kinase46 and phospholipase D47,48 families act on different molecular substrates. A similar conclusion is reached for the few regulatory domains that are thought to be ancient. Prokaryotic and eukaryotic PDZ domains bind C-terminal tails of proteins49,50 although with very different consequences: prokaryotic ligands are bound to facilitate proteolysis, whereas eukaryotic ligands are bound mostly as part of signal transduction pathways. More recently derived homologues may also possess different functions. This is most easily seen for enzyme homologues that possess substitutions of critical catalytic residues, but also holds true for non-enzymatic domain families. â-Trefoil proteins, for example, probably ®rst arose in early eukaryotic evolution and thenceforth diversi®ed into families of extracellular cytokines and sugar-binding proteins, and intracellular actin-binding proteins, each using distinct binding sites.51 INFERRING SPECIFICITY Given the functional diversi®cation of domain homologues, how can one decide on function? An obvious answer to this is that one can predict probable function based on that of the most sequencesimilar orthologues or else other homologues that have been characterised 26 experimentally. In other words, one assumes that function partitions with sequence similarity, and that family members on each branch of a dendogram possess comparable functions. Dendograms (`phylogenetic trees'), however, are calculated from all amino acids of domains, rather than from only those residues that are essential to impart function. In many cases, substitution of one or more of these essential residues results in homologues whose functions do not partition according to the dendogram structure. Furthermore, inaccuracies in the calculation of dendograms might result in inaccurate function prediction. Methods have been developed, however, that allow accurate predictions of function for multifunctional domain families on the basis of tertiary structures and/or dendograms.52,53 More recently, a method has been developed that detects residue types in multiple alignment positions whose conservation correlates with functional sub-types.54 Application of this method will be of increasing importance as functional speci®cities of domain families are experimentally derived. For example, recent experiments have shown that single residue substitutions in WW and SH2 domains result in changes in speci®cities from -PPXY- to -PPLP-, and phospho(Y)XXI to phospho(Y)XXN, respectively;55,56 here X is any residue. In the near future, it is essential that full predictive advantage of these speci®city switching positions is taken in annotating genomes. PREDICTING FUNCTION BY NON-HOMOLOGY METHODS Three new `context-dependent' prediction methods have recently come into prominence.57,58 The functions of domains A and B, found separately in two proteins, are predicted to be linked if domains A and B are found together in a second protein.59 Similarly, two genes are more likely to be functionally related if they are repeatedly found as gene neighbours in multiple genomes.60 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 19±29. MARCH 2001 Issues in predicting protein function from sequence Finally, correlation of the presence or absence of two genes in multiple genomes might imply their participation in a common cellular function.61 COMBINED APPROACHES TO FUNCTION PREDICTION future issues for bioinformatics Predicting function is a thorny problem centred on taking maximal advantage of available information while not resorting to over-prediction. Attributing function from one homologue to another must be achieved with due attention paid to all available evidence. As the examples given here have shown, function assignments can be made based on single base changes in genes, correlations of domains in proteins, similarities in protein structure and correlations of genes in genomes, as well as the more traditional sequence similarity-based identi®cations of orthologues and close homologues. This is balanced by the inability of sequence analysis to readily provide answers to such questions as the tissue-speci®c expression patterns of genes, the binding ligands of proteins, and the relationship between genotype and phenotype. As such, these latter issues represent not only the current limits of function prediction, but also some of the future goals of bioinformatics. activity', Nucleic Acids Res., Vol. 28, pp. 2229±2233. 5. Koonin, E. V., Mushegian, A. R. and Bork, P. (1996), `Non-orthologous gene displacement', Trends Biochem. Sci., Vol. 12, pp. 334±336. 6. Ohno, M. et al. (1998), `Molecular evolution of snake toxins: Is the functional diversity of snake toxins associated with a mechanism of accelerated evolution?', Prog. Nucleic Acid Res. Mol. Biol., Vol. 59, pp. 307±364. 7. Russell, R. B. and Ponting, C. P. (1998), `Protein fold irregularities that hinder sequence analysis', Curr. Opin. Struct. Biol., Vol. 8, pp. 364±371. 8. Sampoli Benitez, B. A. et al. (1997), `Structure of the ®fth EGF-like domain of thrombomodulin: An EGF-like domain with a novel disul®de-bonding pattern', J. Mol. Biol., Vol. 273, pp. 913±926. 9. Doherty, A. J., Serpell, L. C. and Ponting, C. P. (1996), `The helix±hairpin±helix DNAbinding motif: A structural basis for nonsequence-speci®c recognition of DNA', Nucleic Acids Res., Vol. 24, pp. 2488±2497. 10. Serpell, L. C. (2000), `Alzheimer's amyloid ®brils: Structure and assembly', Biochim. Biophys. Acta, Vol. 1502, pp. 16±30. 11. Bork, P. and Gibson, T. J. (1996), `Applying motif and pro®le searches', Methods Enzymol., Vol. 266, pp. 162±184. 12. Altschul, S. F. and Koonin, E. V. (1998), `Iterated pro®le searches with PSI-BLAST ± a tool for discovery in protein databases', Trends Biochem. Sci., Vol. 23, pp. 444±447. 13. Hofmann, K. (2000), `Sensitive protein comparisons with pro®les and hidden Markov models', Brie®ngs Bioinformatics, Vol. 1, pp. 167±178. Acknowledgement Thanks are due to Prof. John Mattick for many stimulating discussions. 14. Bateman, A. and Birney, E. (2000), `Searching databases to ®nd protein domain organization', Adv. Prot. Chem., Vol. 54, pp. 137±157. References 15. Ponting, C. P. et al. (2000), `Evolution of domain families', Adv. Prot. Chem., Vol. 54, pp. 185±244. 1. Fitch, W. (2000), `Homology ± a personal view on some of the problems', Trends Genet., Vol. 16, pp. 227±231. 2. Cuppen, E. et al. (1997), `No evidence for involvement of mouse protein-tyrosine phosphatase-BAS-like Fas-associated phosphatase-1 in Fas-mediated apoptosis', J. Biol. Chem., Vol. 272, pp. 30215±30220. 3. 4. Boerner, R. J. et al. (1995), `Catalytic activity of the SH2 domain of human pp60c-src; evidence from NMR, mass spectrometry, sitedirected mutagenesis and kinetic studies for an inherent phosphatase activity', Biochemistry, Vol. 34, pp. 15351±15358. Grishin, N. V. (2000), `Two tricks in one bundle: helix-turn-helix gains enzymatic 16. Holm, L. and Sander, C. (1998), `Removing near-neighbour redundancy from large protein sequence collections', Bioinformatics, Vol. 14, pp. 423±429; http://www.embl-ebi.ac.uk/ holm/nrdb90. 17. Reichsman, F., Moore, H. M. and Cumberledge, S. (1999), `Sequence homology between Wingless/Wnt-1 and a lipid-binding domain in secreted phospholipase A2 ', Curr. Biol., Vol. 9, pp. R353±355. 18. Barnes, M. R. and Russell, R. B. (1999), `A lipid-binding domain in Wnt: A case of mistaken identity?', Curr. Biol., Vol. 9, pp. R717±R718. 19. Copley, R. R., Ponting, C. P. and Bork, P. & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 19±29. MARCH 2001 27 Ponting (1999), `Phospholipases A2 and Wnts are unlikely to share a common ancestor', Curr. Biol., Vol. 9, p. R718. 20. Wootton, J. C. and Federhen, S. (1996), `Analysis of compositionally biased regions in sequence databases', Methods Enzymol., Vol. 266, pp. 554±571. 21. http://www.ncbi.nlm.nih.gov/blast/ psiblast.cgi 36. Ponting, C. P. (1999), `Chlamydial homologues of the MACPF (MAC/perforin) domain', Curr. Biol., Vol. 9, pp. R911±R913. 37. Shapiro, L. and Harris, T. (2000), `Finding function through structural genomics', Curr. Opin. Biotechnol., Vol. 11, pp. 31±35. 22. Fitch, W. M. (1970), `Distinguishing homologous from analogous proteins', Syst. Zool., Vol. 19, pp. 99±106. 38. Eisenstein, E. et al. (2000), `Biological function made crystal clear ± annotation of hypothetical proteins via structural genomics', Curr. Opin. Biotechnol., Vol. 11, pp. 25±30. 23. Holland, P. W. H. (1999), `Gene duplication: Past, present and future', Semin. Cell Dev. Biol., Vol. 10, pp. 541±547. 39. Laskowski, R. A. et al. (1996), `Protein clefts in molecular recognition and function', Protein Sci., Vol. 5, pp. 2438±2452. 24. Chervitz, S. A. et al. (1998), `Comparison of the complete protein sets of worm and yeast: Orthology and divergence', Science, Vol. 282, pp. 2022±2028. 40. Russell, R. B., Sasieni, P. D. and Sternberg, M. J. E. (1998), `Supersites within superfolds. Binding site similarity in the absence of homology', J. Mol. Biol., Vol. 282, pp. 903±918. 25. Snel, B., Bork, P. and Huynen, M. A. (1999), `Genome phylogeny based on gene content', Nat. Genet., Vol. 21, pp. 108±110. 26. Stein, P. L., Vogel, H. and Soriano, P. (1994), `Combined de®ciencies of Src, Fyn, and Yes tyrosine kinases in mutant mice', Genes Dev., Vol. 8, pp. 1999±2007. 27. Condie, B. G. and Capecchi, M. R. (1994), `Mice with targeted disruptions in the paralogous genes hoxa-3 and hoxd-3 reveal synergistic interactions', Nature, Vol. 370, pp. 304±307. 28. Aravind, L. and Landsman, D. (1998), `AThook motifs identi®ed in a wide variety of DNA-binding proteins', Nucleic Acids Res., Vol. 26, pp. 4413±4421. 29. Bessou, C. et al. (1998), `Mutations in the Caenorhabditis elegans dystrophin-like gene dys-1 lead to hyperactivity and suggest a link with cholinergic transmission', Neurogenetics, Vol. 2, pp. 61±72. 30. http://smart.embl-heidelberg.de 31. Tatusov, R. L., Koonin, E. V. and Lipman, D. J. (1997), `A genomic perspective on protein families', Science, Vol. 278, pp. 631637; http://www.ncbi.nlm.nih.gov/cog. 32. Huynen, M. A., Dandekar, T. and Bork, P. (1999), `Variation and evolution of the citric acid cycle: a genomic perspective', Trends Microbiol., Vol. 7, pp. 281±291. 33. Ochman, H., Lawrence, J. G. and Groisman, E. A. (2000), `Lateral gene transfer and the nature of bacterial innovation', Nature, Vol. 405, pp. 299±304. 34. Ponting, C. P. et al. (1999), `Eukaryotic signalling domain homologues in archaea and bacteria. Ancient ancestry and horizontal gene transfer', J. Mol. Biol., Vol. 289, pp. 729±745. 35. Wolf, Y. I., Aravind, L. and Koonin, E. V. (1999), `Rickettsiae and Chlamydiae: Evidence 28 of horizontal gene transfer and gene exchange', Trends Genet., Vol. 15, pp. 173±175. 41. Murzin, A. et al. (1995), `SCOP: A structural classi®cation of proteins database for investigation of sequences and structures', J. Mol. Biol., Vol. 247, pp. 536±540. 42. Ceska, T. A. and Sayers, J. R. (1998), `Structure-speci®c DNA cleavage by 59 nucleases', Trends Biochem. Sci., Vol. 23, pp. 331±336. 43. Busch, C. et al. (1998), `A common motif of eukaryotic glycosyltransferases is essential for the enzyme activity of large clostridial cytotoxins', J. Biol. Chem., Vol. 273, pp. 19566±19572. 44. Copley, R. R. and Barton, G. J. (1994), `A structural analysis of phosphate and sulphate binding sites in proteins. Estimation of propensities for binding and conservation of phosphate binding sites', J. Mol. Biol., Vol. 242, pp. 321±329. 45. Gerlt, J. A. and Babbitt, P. C. (1998), `Mechanistically diverse enzyme superfamilies: The importance of chemistry in the evolution of catalysis', Curr. Opin. Chem. Biol., Vol. 2, pp. 607±612. 46. Leonard, C. J., Aravind, L. and Koonin, E. V. (1998), `Novel families of putative protein kinases in bacteria and archaea: Evolution of the ``eukaryotic'' protein kinase superfamily', Genome Res., Vol. 8, pp. 1038±1047. 47. Koonin, E. V. (1996), `A duplicated catalytic motif in a new superfamily of phosphohydrolases and phospholipid synthases that includes poxvirus envelope proteins', Trends Biochem. Sci., Vol. 21, pp. 242±243. 48. Ponting, C. P. and Kerr, I. D. (1996), `A novel family of phospholipase D homologues that includes phospholipid synthases and putative endonucleases: Identi®cation of duplicated repeats and potential active site residues', Protein Sci., Vol. 5, pp. 914±922. & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 19±29. MARCH 2001 Issues in predicting protein function from sequence 49. Beebe, K. D. et al. (2000), `Substrate recognition through a PDZ domain in tailspeci®c protease', Biochemistry, Vol. 39, pp. 3149±3155. 50. Doyle, D. A. et al. (1996), `Crystal structures of a complexed and peptide-free membrane protein-binding domain: Molecular basis of peptide recognition by PDZ', Cell, Vol. 85, pp. 1067±1076. 51. Ponting, C. P. and Russell, R. B. (2000), `Identi®cation of distant homologues of FGFs suggests a common ancestor for all â-trefoil proteins', J. Mol. Biol., Vol. 302, pp. 1041± 1047. 52. Lichtarge, O., Bourne, H. R. and Cohen, F. E. (1996), `An evolutionary trace method de®nes binding surfaces common to protein families', J. Mol. Biol., Vol. 257, pp. 342358. 53. Sjolander, K. (1998), `Phylogenetic inference in protein superfamilies: Analysis of SH2 domains' in `Proceedings of the 6th International Conference on Intelligent Systems for Molecular Biology', AAAI Press, Menlo Park, CA, pp. 165±174. 54. Hannenhalli, S. S. and Russell, R. B. (2000), `Analysis and prediction of functional subtypes from protein sequence alignments', J. Mol. Biol., Vol. 303, pp. 61±76. 55. Espanel, X. and Sudol, M. (1999), `A single point mutation in a group I WW domain shifts its speci®city to that of group II WW domains', J. Biol. Chem., Vol. 274, pp. 17284±17289. 56. Kimber, M. S. et al. (2000), `Structural basis for speci®city switching of the Src SH2 domain', Mol. Cell, Vol. 5, pp. 1043±1049. 57. Marcotte, E. M. (2000), `Computational genetics: Finding protein function by nonhomology methods', Curr. Opin. Struct. Biol., Vol. 10, pp. 359±365. 58. Galperin, M. Y. and Koonin, E. V. (2000), `Who's your neighbor? New computational approaches for functional genomics', Nature Biotech., Vol. 18, pp. 609±613. 59. Marcotte, E. M. et al. (1999), `Detecting protein function and protein±protein interactions from genome sequences', Science, Vol. 285, pp. 751±753. 60. Dandekar, T. et al. (1998), `Conservation of gene order: A ®ngerprint of proteins that physically interact', Trends Biochem. Sci., Vol. 23, pp. 324±328. 61. Pellegrini, M. et al. (1999), `Assigning protein functions by comparative genome analysis: protein phylogenetic pro®les', Proc. Natl Acad. Sci. USA, Vol. 96, pp. 4285±4288. & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 19±29. MARCH 2001 29