Download Issues in predicting protein function from sequence

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Protein (nutrient) wikipedia , lookup

Histone acetylation and deacetylation wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Protein phosphorylation wikipedia , lookup

Signal transduction wikipedia , lookup

Protein wikipedia , lookup

P-type ATPase wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Magnesium transporter wikipedia , lookup

List of types of proteins wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Homology modeling wikipedia , lookup

Protein structure prediction wikipedia , lookup

Proteolysis wikipedia , lookup

VLDL receptor wikipedia , lookup

Protein moonlighting wikipedia , lookup

Transcript
Chris Ponting
is Group Leader in
Bioinformatics at a new MRCfunded unit focused on the
determination of gene function
with particular emphasis on
human disease. His particular
scienti®c interest is the
combined use of sequence and
experimental data to predict
protein evolution, structure
and function. To enable ®ndings
to be made accessible to
biologists generally, he codevised the SMART web-based
tool with Peer Bork and
colleagues.
Keywords: sequence
analysis, orthology, function
prediction, binding site
identi®cation, domain families,
horizontal gene transfer
Issues in predicting protein
function from sequence
Chris P. Ponting
Date received (in revised form): 9th November 2000
Abstract
Identifying homologues, de®ned as genes that arose from a common evolutionary ancestor, is
often a relatively straightforward task, thanks to recent advances made in estimating the
statistical signi®cance of sequence similarities found from database searches. The extent by
which homologues possess similarities in function, however, is less amenable to statistical
analysis. Consequently, predicting function by homology is a qualitative, rather than
quantitative, process and requires particular care to be taken. This review focuses on the
various approaches that have been developed to predict function from the scale of the atom to
that of the organism. Similarities in homologues' functions differ considerably at each of these
different scales and also vary for different domain families. It is argued that due attention
should be paid to all available clues to function, including orthologue identi®cation,
conservation of particular residue types, and the co-occurrence of domains in proteins. Pitfalls
in database searching methods arising from amino acid compositional bias and database size
effects are also discussed.
THE DESCENT OF GENES
C. P. Ponting,
MRC Functional Genetics Unit,
Department of Human Anatomy
and Genetics, University of Oxford,
South Parks Road,
Oxford
OX1 3QX, UK
Tel: ‡44 (0)1865 272175
Fax: ‡44 (0)1865 272175/272420
E-mail:
[email protected]
Completion of the human genome draft
sequence has created an air of expectation
among scientists and the general
population alike that knowledge derived
from sequence information will
precipitate numerous breakthroughs in
treating disease. To live up to such
expectations will be a tall order, because
considerable obstacles remain to be
surmounted, but this is a worthwhile
challenge that ought to be taken on.
Chief among these obstacles is the
assignment of function to genes. How
does one even begin to predict how a
gene affects the well-being of an
individual, or his or her cells, or his or her
molecular pathways, networks and
complexes? Fortunately, this situation is
being ameliorated using enhanced
understanding of the evolutionary history
of genes, as inferred from detailed
sequence comparisons.
Homology, the evolutionary descent of
genes from a common ancestor,1 often
provides vital evidence in the prediction
of molecular function. That two genes are
homologous does not necessarily mean
that they possess common functions, only
that they share a common ancestor.
Nevertheless, an assumption often made is
that the functions of homologues have
remained essentially unchanged since the
time of their last common ancestor. This
provides a good working hypothesis of
function, particularly for those
homologues that have most recently
diverged. However, a better view is that
an evolutionary relationship implies
functional similarity but that this may be
true to a greater or lesser extent. The
extent is crucially dependent on scale,
since homologues' functional similarities
are greatest when considered at the
molecular scale, and least at the scales of
cells or organisms.
Although this discussion might be
viewed as purely a semantic exercise, its
appreciation does lead to a critical
question. Have homologues' functions
diverged since their last common
ancestor? The short answer, derived from
many studies, is that some have diverged
considerably, and some have remained
more or less the same. A long answer to
this question will take up much of the rest
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 19±29. MARCH 2001
19
Ponting
of this review. The various clues that
indicate functional divergence or
constancy, and sequence-based evidence
that hints at particular functions will be
discussed. Much rests upon various
paradigms (Table 1) that hold true in most
cases.
DETECTING
HOMOLOGUES USING
PROTEIN DATABASE
SEARCHES
database search
statistics
First, however, it is appropriate to discuss
how sequence similarities, detected in
protein database searches, are used to
propose homology assignments. Excellent
reviews on this subject abound (for
example refs 11±15), hence the brevity in
its treatment here. The central issue here
is whether similarities seen in sequence
alignments merit an assignment of
homology. Amino acid similarities can be
quanti®ed as an alignment score S using a
local alignment algorithm, gap penalties
and a substitution matrix.13 From these
scores an E-value can be calculated. This
value represents the number of different
alignments with scores equivalent to, or
better than, S that are expected to occur
in the database search simply by chance.
Thus, an alignment with an associated
E-value of 1 is considered not to be
biologically signi®cant, one with an
E-value of 0.1 possibly to be signi®cant,
and another with an E-value less than
0.01 most likely to be signi®cant.
The statistics of alignment scores,
therefore, are a powerful tool for deciding
on homology. As with most aspects of
evolution and function prediction,
however, there are pitfalls to be avoided.
For example, the fact that E-values are
strongly dependent on the size of the
database being searched is often
overlooked. An E-value provided in a
search of a large bacterial genome
containing about 6,000 gene products will
be approximately 100 times smaller than
Table 1: Paradigms and exceptions in predicting function and structure from sequence
Paradigms
Exceptions
(1)
A PTP-BAS PDZ-Fas receptor interaction, occurs in human but not in mice.
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
20
Orthologues possess similar
functions
Enzyme homologues are
enzymes
Regulatory domain
homologues are not
enzymes
Equivalent cellular functions
are mediated in different
species by orthologues
Gene coding regions mutate
slower than non-coding
regions
Domain homologues are
localised to single regions of
sequence and 3D space, and
possess the same order of
secondary structures
Disulphide bridges are
invariant among
homologues
Although the same function
or the same fold may have
evolved more than once due
to convergence, convergent
evolution of sequences does
not occur
Domains possess single
conformations
2
Many enzyme families possess representatives that are enzymatically inactive and
possess substitutions of active site residues.
The SH2 domain of human pp60c-src has been found to possess low tyrosine
3
phosphatase activity. Lambda integrases appear to have evolved an enzymatic
4
function from an ancient helix-turn-helix regulatory domain.
A gene possessing a particular cellular role in one organism can be displaced by a
non-orthologous but functionally equivalent gene in a second organism (`non5
orthologous gene displacement').
Snake toxin genes, among others, appear to undergo accelerated evolution of
6
their coding regions due to enhanced selective pressures.
Crystal structures have shown that domains can be `inserted' into other domains.
Domains may also contain secondary structures that are circularly permuted in
7
order.
Disulphide `swapping' has been observed to occur in an epidermal growth factor8
like (EGF) domain of thrombomodulin.
9
Small localised structures, such as the helix±hairpin±helix motif, with conserved
sequences have been observed in non-homologous contexts. These may have
arisen through convergence or else via genetic duplication and insertion events
Amyloid proteins are known to undergo conformational changes to â-sheet
10
structures.
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 19±29. MARCH 2001
Issues in predicting protein function from sequence
compositional bias
the E-value given for exactly the same
alignment in a search of all gene products
currently known (approximately
600,000). As databases ®ll with
increasingly redundant sequences,
including identical copies, alternatively
spliced variants and homologues from
closely related species, there has been a
need for a truly non-redundant protein
sequence database. This has been
provided by NRDB90,16 a database in
which no two sequences possess greater
than 90 per cent identity. Use of this
database, currently approaching 350,000
sequences, is therefore recommended to
provide the most robust E-value
estimates.
A second serious pitfall relates to biases
in the amino acid compositions of
sequences. E-value calculations assume
that database sequences that are unrelated
to the query have average amino acid
compositions. For the majority of
sequences in databases this holds true, but
in a minority of cases unrelated sequences
can be detected with signi®cant alignment
scores (low E-values) because of nonrandom amino acid compositions
(Table 2). Similarly, non-signi®cant low
E-values can be generated when a
database search is initiated with a
compositionally-biased or `low
complexity' query sequence.
A recent example of when
compositional bias causes signi®cance
Table 2: Examples of proteins with unusually high occurrences of
particular amino acids
Amino acid
Proteins
C
DE
G
H
ILMVFYWAC
KR
N
P
Q
SR
ST
abcdefg
Disulphide-rich proteins; metallothioneins; zinc ®ngers
Acidic proteins (unknown function)
Collagens
Hisactophilin; histidine-rich glycoprotein
Transmembrane helices
Nuclear proteins, nuclear-localisation signals
Many Dictyostelium proteins
Collagens; SH3/WW/EVH1-binding sites; ®laments
Triplet repeat disease gene products
RNA-binding motifs containing multiple Ser and Arg residues
Mucins (potential oligosaccharide-attachment sites)
Heptad coiled coils (hydrophobic residues: a and d; hydrophilic bcefg)
in, for example, myosins, intermediate ®laments and kinesins
estimates to go awry relates to the
proposed homology between Wingless/
Wnt and secreted phospholipase A2 .17±19
Here, database searches yielded apparently
signi®cant E-values, E , 10ÿ3 , yet much
of the alignment score was due to
fortuitous matches of cysteines. In this
case, three-dimensional structural
information was on hand from which to
argue that these molecules did not share a
common ancestor, and therefore did not
possess comparable structures and
functions. This example highlights the
awareness, which practitioners of database
searches must possess, of the effects of
compositional bias on alignment scores,
and the availability of additional structural
and functional data that might pro®tably
be brought to bear on questionable
homology assignments.
Filtering of compositionally biased
regions using SEG20 is default in many
database-searching algorithms. The PSIBLAST12 server at the National Center
for Biotechnology Information21 now
uses composition-based statistics for
E-value calculation. These approaches
markedly improve the discrimination of
true positive homologues versus
compositionally biased false positives.
However, as with all bioinformatics
applications, such approaches are not
foolproof in all cases, and the user is
required to be vigilant in spotting those
apparently signi®cant similarities that arise
simply due to amino acid bias.
ORTHOLOGY, PARALOGY
AND FUNCTION
PREDICTION
The assignment of information from
experimental data on one homologue to
an under-characterised second
homologue is the basic principle that
underlies function prediction from
sequence. The greatest con®dence should
be placed in such assignments when these
two homologues are orthologues.
Orthologues are de®ned as having arisen
from speciation events,22 and may be
thought of as `the same' genes in different
species. This contrasts with homologous
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 19±29. MARCH 2001
21
Ponting
paralogy
orthology
domains
22
genes that arose from intragenome
duplications; these are termed
`paralogues'. Gene duplications give rise
to a pair of genes, of which one is often
assumed to retain the function of its
parent, whereas the other eventually
acquires either a `new' function or is lost
via mutation and selection.
This exposes a certain degree of
®ckleness in how the term `function' is
currently used. Most homologues possess
`similar' functions, yet paralogues are
thought to persist only if they acquire
`novel' functions. This apparent paradox
is resolved when it is considered that
`function' is an umbrella term that covers
phenomena at the scales of atoms
(catalysis, binding events), domains,
proteins, complexes, networks, cells and
organisms. For example, the molecular
functions of paralogues, indeed even their
protein sequences, can be identical, while
their cellular roles differ owing to
differences in cellular expression patterns.
Consequently, homologues' functions are
often degenerate at certain linear scales,
even when they are not at others.
Correct assignment of orthology and
paralogy is not straightforward and is most
accurate for genes found in completely
sequenced genomes. Even for such
genomes, past lineage-speci®c deletions of
paralogous genes can result in paralogues
falsely being predicted as orthologues.
Additional complexities in assignment
arise from complete genome, or largescale gene, duplications that are predicted
to have occurred early in the vertebrate
lineage.23 These duplications resulted in
chordates possessing more (degenerate)
copies of genes than invertebrates.
Consequently, the human genome, for
example, may possess several paralogous
genes that are all orthologues of a single
Caenorhabditis elegans gene.
It is worth reiterating that orthologue
identi®cation is the most powerful tool in
predicting molecular function.
Comparisons of complete genomes
predict that the presence of orthologues in
different genomes indicates functional
conservation of proteins, complexes,
pathways and cellular processes. Sets of
orthologues peculiar to single taxonomic
groups, indeed, may best de®ne these
organisms,24 and also may provide the
most accurate predictions of genome
phylogeny.25 Paralogue identi®cation
provides a less accurate prediction of
function, particularly for prokaryotes. By
contrast, the functions of paralogous genes
in vertebrates are often overlapping, as
assessed by gene knockout studies.26,27
MULTIDOMAIN PROTEINS
Orthology assignment is often not
straightforward for multidomain proteins.
Although the term `domain' is used
differently in evolution, genetics and
molecular biology, it is described here as a
compact unit of structure, often
containing a hydrophobic core. Domains
are frequently found in assorted
combinations with other domains,
re¯ecting the evolutionary ability of
(partial) genes to be duplicated and
recombined elsewhere. A distinction is
made between domains, repeats and
motifs. Repeats are structural and
evolutionary entities that always found in
two or more copies. Frequently, repeats
assemble into elongated `rods' or
`superhelices', or else into closed `barrel'
structures, such as â-propellers. Closed
structure assemblies of repeats might also
be thought of as domains. Motifs are
either regions of domains containing
conserved active- or binding-site residues,
or else conserved sequences, present
outside domains, that may adopt folded
conformations only in association with
their binding ligands. An example of a
motif that lacks obvious secondary
structures and occurs outside domains is
the AT-hook DNA-binding motif.28
There is little consensus in the
literature on what constitutes orthology
for multidomain proteins. For some, a
criterion for orthology (and indeed
paralogy) is that such proteins must
possess identical domain architectures. For
others, the concepts of orthology and
paralogy should be applied at the domain
level. The difference between these
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 19±29. MARCH 2001
Issues in predicting protein function from sequence
comparative genomics
de®nitions results in different orthology
designations for genes that have fused, or
molecules of slightly differing repeat
numbers, such as the spectrin repeats in
C. elegans DYS-1 and mammalian
dystrophin for example.29 The author
suggests that orthology be applied both at
the protein/gene level and at the domain
level. Orthologous proteins must contain
orthologous domains and these must be
present in the same order (Figure 1).
Notwithstanding such dif®culties, sets
of orthologous genes have been de®ned
(see COGs31 ) from the completely known
genome sequences of archaea, bacteria
and eukarya. These and similar studies
now allow investigation of functional
conservation across divergent species. An
excellent example of this is a recent in-
depth study of the protein components of
the citric acid cycle.32 This clearly
demonstrated the complexities involved
in assigning orthology and paralogy for
divergent homologues, and highlighted
the incompleteness of portions of the
cycle in the majority of species. In the
prokaryotes, the absence of some of the
key enzymes of the cycle is probably due
to these organisms' autotrophic lifestyles.
Thus, the absence of genes in a
completely sequenced genome can
provide insight into function at the
organismal level.
HORIZONTAL GENE
TRANSFER
Horizontal (or `lateral') gene transfer has
also played a major role in prokaryotic
Figure 1: The domain architectures of protein pairs that are not orthologous presented using
30
the SMART server. (a) Synechocystis sll0776 and human MST (mixed lineage kinase 2) both
contain protein kinase and src homology 3 (SH3) domains, but in different linear orders.
Moreover, the bacterial SH3 domain (SH3b) is predicted to be extracellular, whereas MST is
predicted to be cytoplasmic. (b) Yeast (Saccharomyces cerevisiae) Bem1p and human p47 phox
contain SH3 and PX domains but in different collinear orders. (c) Yeast protein kinase C1
(PKC1p) and human protein kinase Cá (PKCá) are likely to possess similar cellular functions
but are not orthologues owing to differing domain compositions and architectures.
Abbreviations following those in SMART: STYKc, protein kinase with dual serine/threonine
and tyrosine speci®city; S_TKc, protein kinase with serine/threonine speci®city; SH3, src
homology 3 domain; SH3b, bacterial-type SH3 domain; PX, phox homology domain; C1,
protein kinase C conserved region 1; C2, protein kinase C conserved region 2; HR1, protein
kinase C-related kinase homology region 1; and, S_TK_X, serine/threonine-speci®c protein
kinase extension domain
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 19±29. MARCH 2001
23
Ponting
horizontal gene transfer
Web-based resources
evolution.33 Detecting that a prokaryotic
lineage has acquired a `foreign' gene
implicates that gene in the lineage's
exploitation of new environments and
evolutionary niches. Until recently,
horizontal gene transfer was suspected to
have been a relatively rare phenomenon.
However, more recent studies have
estimated that between 0 and 17 per cent
of prokaryotic genomes has been acquired
from other genetic sources.33 These
sources frequently include eukaryotic
genomes. A study of signalling domains
revealed many instances of horizontal
gene transfer from eukaryotes to bacteria,
in particular to Synechocystis PCC6803,
but fewer transfers from eukaryotes to
archaea.34 With the exception of genes
obtained fromthe in¯ux of mitochondrial
and chloroplast genes into the nuclear
genome, metazoan genomes are thought
to have acquired relatively few genes via
recent horizontal gene transfers.
Bacterial virulence has been attributed
in many cases to the acquisition of genes,
absent from related non-virulent species.
Pathogens of eukaryotes, therefore, might
obtain signi®cant advantage from genes
that arose via horizontal transfer from
their hosts, by avoiding eradication by the
immune system, or by hijacking host
cellular processes for their own gain. For
example, the presence of a Sec7 domaincontaining protein in Rickettsia prowazekii
implicates it in this pathogen's
modi®cation of the host's Golgi
membranes.35 Similarly, a mammalian
perforin-like domain in the Chlamydia
trachomatis CT153 protein might indicate
an involvement in pore formation and
host cell entry.36 Detection of horizontally
transferred genes in parasitic organisms,
therefore, can lead to prediction of
molecular and organismal functions.
INFERRING FUNCTION
AND LOCALISATION
USING DOMAIN
DATABASES
Once a protein sequence has been
obtained, the ®rst port-of-call in attempts
to understand its functions should not be a
BLAST or FASTA database search.
Rather, the sequence should be scanned
for occurrences of well-known domains,
repeats, motifs and sorting signals, before
embarking on BLAST-like sequence
database searches. Several servers exist that
provide searches for domain homologues
(Table 3). Upon detection of a
homologous domain in a query sequence,
these servers furnish the user with relevant
functional and structural information, as
well as appropriate literature references.
Although the new conserved domain
database (CDD) server draws upon
multiple alignments from both SMART
and PFAM, the latter source libraries
should also be searched from their
homepages. This is because SMART and
PFAM both use a hidden Markov model
method for homologue detection that
Table 3: Useful web-based resources for predicting function from protein sequence
Name
URL
Predicts
SMART
http://smart.embl-heidelberg.de/
Domains, repeats, motifs,
coiled coils, signal sequences
Domains, repeats, motifs,
coiled coils, signal sequences
PFAM
http://pfam.wustl.edu/
http://www.sanger.ac.uk/Pfam/
http://www.cgr.ki.se/Pfam/
CDD (mostly SMART ‡ http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
PFAM alignments)
PROSITE
http://www.isrec.isb-sib.ch/software/PFSCAN_form.html
PSORT
http://psort.nibb.ac.jp/
SignalP
http://www.cbs.dtu.dk/services/SignalP/
big-PI
http://mendel.imp.univie.ac.at/
Various
See http://expasy.cbr.nrc.ca/tools/#transmem
24
Domains, repeats, motifs,
coiled coils, signal sequences
Domains, repeats, motifs
Localisation signals
Signal peptides
GPI anchors
Transmembrane helices
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 19±29. MARCH 2001
Issues in predicting protein function from sequence
functional residues
tertiary structures
contrasts with the different BLAST-like
methodology of CDD. It also should be
noted that some short domains, repeats
and motifs detectable using SMART and
PFAM web-tools are not detectable using
CDD.
Other servers allow prediction of
sequence signals that specify the sub- or
extracellular localisation of proteins. Thus,
signal peptides, nuclear localisation
signals, glycosyl-phosphatidyl-inositol
(GPI) anchors, transmembrane regions
and organelle-targeting sequences can be
detected (Table 3). Some information on
localisation can also be gleaned from the
domain content of eukaryotic proteins.
Some domain types appear only in one of
the three sets of cytoplasmic, secreted or
nuclear proteins. For example,
disulphide-rich domains, such as kringle,
epidermal growth factor-like and
®bronectin type II domains, only occur in
secreted proteins, whereas CHROMO,
SET and BROMO domains only occur
in chromatin-associated proteins. Some
`promiscuous' domain types, such as
immunoglobulin, von Willebrand factor
A and ®bronectin type III domains, are
not so exclusive and occur in cytoplasmic,
secreted and nuclear proteins.
The recent acceleration in the
determination of tertiary structures has
prompted renewed interest in the
prediction of function directly from
structure.37,38 Identi®cation of binding
sites for the majority of molecules is
straightforward, since most ligands bind
the largest cleft on the molecule's
surface.39 Certain folds also have
tendencies to bind ligands in particular
Table 4: Groupings of amino acid residues according to common
functions
Description
Amino acids
Function
Polar residues
Aromatic residues
Zn2‡ -coordinating residues
Ca2‡ -coordinating residues
Magnesium- or manganesebinding residues
Phosphate-binding residues
C, D, E, H, K, N, Q, R, S, T
F, H, W, Y
C, D, E, H, N, Q
D, E, N, Q
D, E, N, S, R, T
Active sites
Protein ligand-binding sites
Active sites, zinc ®ngers
Allostery, ligand-binding sites
Mg2‡ - or Mn2‡ -dependent
catalysis or ligand-binding
Phosphate and sulphate-binding
H, K, R, S, T
locations. For example, binding sites of
â-propellers tend to be located at the
junction between the â-sheet propeller
`blades'.40 In addition, protein folds may
also hint at molecular function, since
protein domains with the same fold, albeit
with non-signi®cant sequence similarity,
often possess comparable functions.41
Structural features may also hold the key
to function. The presence of helix-turnhelix-like structures indicates DNA
binding whereas conserved and proximal
histidine, aspartic acid and serine residues,
suggests a serine proteinase/lipase-like
hydrolase function.
CONSERVED POSITIONS IN
MULTIPLE ALIGNMENTS
Many domain families exist, however, for
which no functional or structural
information is yet available. In these cases,
one has to resort to prediction by analogy.
For this, several `rules-of-thumb' are
suggested (for residue groupings see
Table 4).
· Catalytic site residues are almost
invariably polar.
· Large aromatic residues are often found
to be involved in protein±ligand
interactions.
· Zinc ions are coordinated by several
residue types and, often, water
molecules.
· Calcium ions are often bound by acidic
residues and amides, although
additional interactions occur with
backbone atoms.
· Manganese and/or magnesium ions are,
in enzymes such as nucleases42 and
glycosyltransferases,43 often bound by
two acidic residues separated by a
hydrophobic residue.
· Phosphate and sulphate groups are
found bound to the amino terminus of
á-helices in approximately half of all
cases.44
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 19±29. MARCH 2001
25
Ponting
· Distinction can be made between
disulphide-rich secreted proteins and
zinc ®ngers, since the former occur in
proteins with signal peptides and never
possess substitutions of cysteine for
histidine, or vice versa.
HOMOLOGUES WITH
DIFFERENT FUNCTIONS
enzyme and binding
speci®city
The great majority of enzymes, but not
regulatory domains, appear to be ancient,
since they are found in modern archaea,
bacteria and eukaryotes.34 Divergent
enzyme homologues often possess
comparable catalytic mechanisms or
reactive intermediates but different
substrates and different cellular roles.45 For
example, prokaryotic and eukaryotic
homologues of protein kinase46 and
phospholipase D47,48 families act on
different molecular substrates. A similar
conclusion is reached for the few
regulatory domains that are thought to be
ancient. Prokaryotic and eukaryotic PDZ
domains bind C-terminal tails of
proteins49,50 although with very different
consequences: prokaryotic ligands are
bound to facilitate proteolysis, whereas
eukaryotic ligands are bound mostly as
part of signal transduction pathways.
More recently derived homologues
may also possess different functions. This
is most easily seen for enzyme
homologues that possess substitutions of
critical catalytic residues, but also holds
true for non-enzymatic domain families.
â-Trefoil proteins, for example, probably
®rst arose in early eukaryotic evolution
and thenceforth diversi®ed into families of
extracellular cytokines and sugar-binding
proteins, and intracellular actin-binding
proteins, each using distinct binding
sites.51
INFERRING SPECIFICITY
Given the functional diversi®cation of
domain homologues, how can one decide
on function? An obvious answer to this is
that one can predict probable function
based on that of the most sequencesimilar orthologues or else other
homologues that have been characterised
26
experimentally. In other words, one
assumes that function partitions with
sequence similarity, and that family
members on each branch of a dendogram
possess comparable functions.
Dendograms (`phylogenetic trees'),
however, are calculated from all amino
acids of domains, rather than from only
those residues that are essential to impart
function. In many cases, substitution of
one or more of these essential residues
results in homologues whose functions do
not partition according to the dendogram
structure. Furthermore, inaccuracies in
the calculation of dendograms might
result in inaccurate function prediction.
Methods have been developed,
however, that allow accurate predictions
of function for multifunctional domain
families on the basis of tertiary structures
and/or dendograms.52,53 More recently, a
method has been developed that detects
residue types in multiple alignment
positions whose conservation correlates
with functional sub-types.54 Application
of this method will be of increasing
importance as functional speci®cities of
domain families are experimentally
derived. For example, recent experiments
have shown that single residue
substitutions in WW and SH2 domains
result in changes in speci®cities from
-PPXY- to -PPLP-, and phospho(Y)XXI
to phospho(Y)XXN, respectively;55,56
here X is any residue. In the near future,
it is essential that full predictive advantage
of these speci®city switching positions is
taken in annotating genomes.
PREDICTING FUNCTION
BY NON-HOMOLOGY
METHODS
Three new `context-dependent'
prediction methods have recently come
into prominence.57,58 The functions of
domains A and B, found separately in two
proteins, are predicted to be linked if
domains A and B are found together in a
second protein.59 Similarly, two genes are
more likely to be functionally related if
they are repeatedly found as gene
neighbours in multiple genomes.60
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 19±29. MARCH 2001
Issues in predicting protein function from sequence
Finally, correlation of the presence or
absence of two genes in multiple genomes
might imply their participation in a
common cellular function.61
COMBINED APPROACHES
TO FUNCTION
PREDICTION
future issues for
bioinformatics
Predicting function is a thorny problem
centred on taking maximal advantage of
available information while not resorting
to over-prediction. Attributing function
from one homologue to another must be
achieved with due attention paid to all
available evidence. As the examples given
here have shown, function assignments
can be made based on single base changes
in genes, correlations of domains in
proteins, similarities in protein structure
and correlations of genes in genomes, as
well as the more traditional sequence
similarity-based identi®cations of
orthologues and close homologues. This
is balanced by the inability of sequence
analysis to readily provide answers to such
questions as the tissue-speci®c expression
patterns of genes, the binding ligands of
proteins, and the relationship between
genotype and phenotype. As such, these
latter issues represent not only the current
limits of function prediction, but also
some of the future goals of bioinformatics.
activity', Nucleic Acids Res., Vol. 28, pp.
2229±2233.
5.
Koonin, E. V., Mushegian, A. R. and Bork, P.
(1996), `Non-orthologous gene displacement',
Trends Biochem. Sci., Vol. 12, pp. 334±336.
6.
Ohno, M. et al. (1998), `Molecular evolution
of snake toxins: Is the functional diversity of
snake toxins associated with a mechanism of
accelerated evolution?', Prog. Nucleic Acid Res.
Mol. Biol., Vol. 59, pp. 307±364.
7.
Russell, R. B. and Ponting, C. P. (1998),
`Protein fold irregularities that hinder sequence
analysis', Curr. Opin. Struct. Biol., Vol. 8, pp.
364±371.
8.
Sampoli Benitez, B. A. et al. (1997), `Structure
of the ®fth EGF-like domain of
thrombomodulin: An EGF-like domain with a
novel disul®de-bonding pattern', J. Mol. Biol.,
Vol. 273, pp. 913±926.
9.
Doherty, A. J., Serpell, L. C. and Ponting,
C. P. (1996), `The helix±hairpin±helix DNAbinding motif: A structural basis for nonsequence-speci®c recognition of DNA',
Nucleic Acids Res., Vol. 24, pp. 2488±2497.
10. Serpell, L. C. (2000), `Alzheimer's amyloid
®brils: Structure and assembly', Biochim.
Biophys. Acta, Vol. 1502, pp. 16±30.
11. Bork, P. and Gibson, T. J. (1996), `Applying
motif and pro®le searches', Methods Enzymol.,
Vol. 266, pp. 162±184.
12. Altschul, S. F. and Koonin, E. V. (1998),
`Iterated pro®le searches with PSI-BLAST ± a
tool for discovery in protein databases', Trends
Biochem. Sci., Vol. 23, pp. 444±447.
13. Hofmann, K. (2000), `Sensitive protein
comparisons with pro®les and hidden Markov
models', Brie®ngs Bioinformatics, Vol. 1, pp.
167±178.
Acknowledgement
Thanks are due to Prof. John Mattick for many
stimulating discussions.
14. Bateman, A. and Birney, E. (2000), `Searching
databases to ®nd protein domain organization',
Adv. Prot. Chem., Vol. 54, pp. 137±157.
References
15. Ponting, C. P. et al. (2000), `Evolution of
domain families', Adv. Prot. Chem., Vol. 54,
pp. 185±244.
1.
Fitch, W. (2000), `Homology ± a personal
view on some of the problems', Trends Genet.,
Vol. 16, pp. 227±231.
2.
Cuppen, E. et al. (1997), `No evidence for
involvement of mouse protein-tyrosine
phosphatase-BAS-like Fas-associated
phosphatase-1 in Fas-mediated apoptosis',
J. Biol. Chem., Vol. 272, pp. 30215±30220.
3.
4.
Boerner, R. J. et al. (1995), `Catalytic activity
of the SH2 domain of human pp60c-src;
evidence from NMR, mass spectrometry, sitedirected mutagenesis and kinetic studies for an
inherent phosphatase activity', Biochemistry,
Vol. 34, pp. 15351±15358.
Grishin, N. V. (2000), `Two tricks in one
bundle: helix-turn-helix gains enzymatic
16. Holm, L. and Sander, C. (1998), `Removing
near-neighbour redundancy from large protein
sequence collections', Bioinformatics, Vol. 14,
pp. 423±429; http://www.embl-ebi.ac.uk/
holm/nrdb90.
17. Reichsman, F., Moore, H. M. and
Cumberledge, S. (1999), `Sequence homology
between Wingless/Wnt-1 and a lipid-binding
domain in secreted phospholipase A2 ', Curr.
Biol., Vol. 9, pp. R353±355.
18. Barnes, M. R. and Russell, R. B. (1999),
`A lipid-binding domain in Wnt: A case of
mistaken identity?', Curr. Biol., Vol. 9, pp.
R717±R718.
19. Copley, R. R., Ponting, C. P. and Bork, P.
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 19±29. MARCH 2001
27
Ponting
(1999), `Phospholipases A2 and Wnts are
unlikely to share a common ancestor', Curr.
Biol., Vol. 9, p. R718.
20. Wootton, J. C. and Federhen, S. (1996),
`Analysis of compositionally biased regions in
sequence databases', Methods Enzymol., Vol.
266, pp. 554±571.
21. http://www.ncbi.nlm.nih.gov/blast/
psiblast.cgi
36. Ponting, C. P. (1999), `Chlamydial
homologues of the MACPF (MAC/perforin)
domain', Curr. Biol., Vol. 9, pp. R911±R913.
37. Shapiro, L. and Harris, T. (2000), `Finding
function through structural genomics', Curr.
Opin. Biotechnol., Vol. 11, pp. 31±35.
22. Fitch, W. M. (1970), `Distinguishing
homologous from analogous proteins', Syst.
Zool., Vol. 19, pp. 99±106.
38. Eisenstein, E. et al. (2000), `Biological function
made crystal clear ± annotation of hypothetical
proteins via structural genomics', Curr. Opin.
Biotechnol., Vol. 11, pp. 25±30.
23. Holland, P. W. H. (1999), `Gene duplication:
Past, present and future', Semin. Cell Dev.
Biol., Vol. 10, pp. 541±547.
39. Laskowski, R. A. et al. (1996), `Protein clefts
in molecular recognition and function', Protein
Sci., Vol. 5, pp. 2438±2452.
24. Chervitz, S. A. et al. (1998), `Comparison of
the complete protein sets of worm and yeast:
Orthology and divergence', Science, Vol. 282,
pp. 2022±2028.
40. Russell, R. B., Sasieni, P. D. and Sternberg,
M. J. E. (1998), `Supersites within superfolds.
Binding site similarity in the absence of
homology', J. Mol. Biol., Vol. 282, pp.
903±918.
25. Snel, B., Bork, P. and Huynen, M. A. (1999),
`Genome phylogeny based on gene content',
Nat. Genet., Vol. 21, pp. 108±110.
26. Stein, P. L., Vogel, H. and Soriano, P. (1994),
`Combined de®ciencies of Src, Fyn, and Yes
tyrosine kinases in mutant mice', Genes Dev.,
Vol. 8, pp. 1999±2007.
27. Condie, B. G. and Capecchi, M. R. (1994),
`Mice with targeted disruptions in the
paralogous genes hoxa-3 and hoxd-3 reveal
synergistic interactions', Nature, Vol. 370, pp.
304±307.
28. Aravind, L. and Landsman, D. (1998), `AThook motifs identi®ed in a wide variety of
DNA-binding proteins', Nucleic Acids Res.,
Vol. 26, pp. 4413±4421.
29. Bessou, C. et al. (1998), `Mutations in the
Caenorhabditis elegans dystrophin-like gene
dys-1 lead to hyperactivity and suggest a link
with cholinergic transmission', Neurogenetics,
Vol. 2, pp. 61±72.
30. http://smart.embl-heidelberg.de
31. Tatusov, R. L., Koonin, E. V. and Lipman,
D. J. (1997), `A genomic perspective on
protein families', Science, Vol. 278, pp. 631637; http://www.ncbi.nlm.nih.gov/cog.
32. Huynen, M. A., Dandekar, T. and Bork, P.
(1999), `Variation and evolution of the citric
acid cycle: a genomic perspective', Trends
Microbiol., Vol. 7, pp. 281±291.
33. Ochman, H., Lawrence, J. G. and Groisman,
E. A. (2000), `Lateral gene transfer and the
nature of bacterial innovation', Nature, Vol.
405, pp. 299±304.
34. Ponting, C. P. et al. (1999), `Eukaryotic
signalling domain homologues in archaea and
bacteria. Ancient ancestry and horizontal gene
transfer', J. Mol. Biol., Vol. 289, pp. 729±745.
35. Wolf, Y. I., Aravind, L. and Koonin, E. V.
(1999), `Rickettsiae and Chlamydiae: Evidence
28
of horizontal gene transfer and gene exchange',
Trends Genet., Vol. 15, pp. 173±175.
41. Murzin, A. et al. (1995), `SCOP: A structural
classi®cation of proteins database for
investigation of sequences and structures',
J. Mol. Biol., Vol. 247, pp. 536±540.
42. Ceska, T. A. and Sayers, J. R. (1998),
`Structure-speci®c DNA cleavage by 59
nucleases', Trends Biochem. Sci., Vol. 23, pp.
331±336.
43. Busch, C. et al. (1998), `A common motif of
eukaryotic glycosyltransferases is essential for
the enzyme activity of large clostridial
cytotoxins', J. Biol. Chem., Vol. 273, pp.
19566±19572.
44. Copley, R. R. and Barton, G. J. (1994),
`A structural analysis of phosphate and sulphate
binding sites in proteins. Estimation of
propensities for binding and conservation of
phosphate binding sites', J. Mol. Biol., Vol.
242, pp. 321±329.
45. Gerlt, J. A. and Babbitt, P. C. (1998),
`Mechanistically diverse enzyme superfamilies:
The importance of chemistry in the evolution
of catalysis', Curr. Opin. Chem. Biol., Vol. 2,
pp. 607±612.
46. Leonard, C. J., Aravind, L. and Koonin, E. V.
(1998), `Novel families of putative protein
kinases in bacteria and archaea: Evolution of
the ``eukaryotic'' protein kinase superfamily',
Genome Res., Vol. 8, pp. 1038±1047.
47. Koonin, E. V. (1996), `A duplicated catalytic
motif in a new superfamily of
phosphohydrolases and phospholipid synthases
that includes poxvirus envelope proteins',
Trends Biochem. Sci., Vol. 21, pp. 242±243.
48. Ponting, C. P. and Kerr, I. D. (1996), `A novel
family of phospholipase D homologues that
includes phospholipid synthases and putative
endonucleases: Identi®cation of duplicated
repeats and potential active site residues',
Protein Sci., Vol. 5, pp. 914±922.
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 19±29. MARCH 2001
Issues in predicting protein function from sequence
49. Beebe, K. D. et al. (2000), `Substrate
recognition through a PDZ domain in tailspeci®c protease', Biochemistry, Vol. 39, pp.
3149±3155.
50. Doyle, D. A. et al. (1996), `Crystal structures of
a complexed and peptide-free membrane
protein-binding domain: Molecular basis of
peptide recognition by PDZ', Cell, Vol. 85,
pp. 1067±1076.
51. Ponting, C. P. and Russell, R. B. (2000),
`Identi®cation of distant homologues of FGFs
suggests a common ancestor for all â-trefoil
proteins', J. Mol. Biol., Vol. 302, pp. 1041±
1047.
52. Lichtarge, O., Bourne, H. R. and Cohen,
F. E. (1996), `An evolutionary trace method
de®nes binding surfaces common to protein
families', J. Mol. Biol., Vol. 257, pp. 342358.
53. Sjolander, K. (1998), `Phylogenetic inference
in protein superfamilies: Analysis of SH2
domains' in `Proceedings of the 6th
International Conference on Intelligent
Systems for Molecular Biology', AAAI Press,
Menlo Park, CA, pp. 165±174.
54. Hannenhalli, S. S. and Russell, R. B. (2000),
`Analysis and prediction of functional subtypes from protein sequence alignments',
J. Mol. Biol., Vol. 303, pp. 61±76.
55. Espanel, X. and Sudol, M. (1999), `A single
point mutation in a group I WW domain shifts
its speci®city to that of group II WW
domains', J. Biol. Chem., Vol. 274, pp.
17284±17289.
56. Kimber, M. S. et al. (2000), `Structural basis for
speci®city switching of the Src SH2 domain',
Mol. Cell, Vol. 5, pp. 1043±1049.
57. Marcotte, E. M. (2000), `Computational
genetics: Finding protein function by
nonhomology methods', Curr. Opin. Struct.
Biol., Vol. 10, pp. 359±365.
58. Galperin, M. Y. and Koonin, E. V. (2000),
`Who's your neighbor? New computational
approaches for functional genomics', Nature
Biotech., Vol. 18, pp. 609±613.
59. Marcotte, E. M. et al. (1999), `Detecting
protein function and protein±protein
interactions from genome sequences', Science,
Vol. 285, pp. 751±753.
60. Dandekar, T. et al. (1998), `Conservation of
gene order: A ®ngerprint of proteins that
physically interact', Trends Biochem. Sci., Vol.
23, pp. 324±328.
61. Pellegrini, M. et al. (1999), `Assigning protein
functions by comparative genome analysis:
protein phylogenetic pro®les', Proc. Natl Acad.
Sci. USA, Vol. 96, pp. 4285±4288.
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 19±29. MARCH 2001
29