Download Use of mass spectrometry-derived data to annotate nucleotide and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Minimal genome wikipedia , lookup

Genomic library wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Human genome wikipedia , lookup

Pathogenomics wikipedia , lookup

Genome evolution wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Metagenomics wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Point mutation wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Protein moonlighting wikipedia , lookup

NEDD9 wikipedia , lookup

Transcript
54
Review
21 Suzuki, N. et al. (1999) Werner syndrome helicase
contains a 5′–3′ exonuclease activity that digests
DNA and RNA strands in DNA/DNA and
RNA/DNA duplexes dependent on unwinding.
Nucleic Acids Res. 27, 2361–2368
22 Shen, J-C. et al. (1998) Characterisation of
Werner syndrome protein DNA helicase activity:
directionality, substrate dependence and
stimulation by replication protein A. Nucleic
Acids Res. 26, 2879–2885
23 Soultanas, P. et al. (2000) Uncoupling DNA
translocation and helicase activity in PcrA : direct
evidence for an active mechanism. EMBO J. 19,
3799–3810
24 Sawaya, M.R. et al. (1999) Crystal structure of
the helicase domain from the replicative
helicase-primase of bacteriophage T7. Cell 99,
167–177
25 Singleton, M.R. et al. (2000) Crystal structure of
T7 gene 4 ring helicase indicates a mechanism for
sequential hydrolysis of nucleotides. Cell 101,
589–600
TRENDS in Biochemical Sciences Vol.26 No.1 January 2001
26 Waksman, G. et al. (2000) Helicases as nucleic
acid unwinding machines. Nat. Struct. Biol. 7 ,
20–22
27 Hingorani, M.M. and Patel, S.S. (1993)
Interactions of bacteriophage T7 DNA
primase/helicase protein with single-stranded
and double-stranded DNAs. Biochemistry 32,
12478–12487
28 Jezewska, M.J. et al. (1996) Binding of
Escherichia coli primary replicative helicase
DnaB protein to single-stranded DNA. Long
range allosteric conformational changes within
the protein hexamer. Biochemistry 35, 2129–2145
29 Yu, X. et al. (1996) DNA is bound within the
central hole to one or two of the six subunits of
the T7 DNA helicase. Nat. Struct. Biol. 3,
740–743
30 Bird, L.E. et al. (1998) Helicases: a unifying
structural theme? Curr. Opin. Struct. Biol. 8, 14–18
31 Abrahams, J.P. et al. (1994) Structure at 2.8 Å
resolution of F1-ATPase from bovine heart
mitochondria. Nature 370, 621–628
32 Boyer, P.D (1993) The binding change mechanism
for ATP synthase some probabilities and
possibilities. Biochim. Biophys. Acta 1140, 215–250
33 Hingorani, M.M. et al. (1997) The dTTPase
mechanism of T7 DNA helicase resembles the
binding change mechanism of the F1-ATPase.
Proc. Natl. Acad. Sci. U. S. A. 94, 5012–5017
34 Stitt, B.L. (1988) Escherichia coli transcription
termination protein rho has three hydrolytic sites
for ATP. J. Biol. Chem. 263, 11130–11137
35 Kim, D. et al. (1999) Transcription termination
factor Rho contains three noncatalytic nucleotide
binding sites. J. Biol. Chem. 274, 11623–11628
36 Hingorani, M.M. and Patel, S.S. (1996)
Cooperative interactions of nucleotide ligands are
linked to oligomerisation and DNA binding in
bacteriophage T7 gene 4 helicases. Biochemistry
35, 2218–2228
37 Bianco, P.R. and Kowalczykowski, S.C. (2000)
Step size measurements on the translocation
mechanism of the RecBC DNA helicase. Nature
405, 368–372
Use of mass spectrometry-derived
data to annotate nucleotide and
protein sequence databases
Matthias Mann and Akhilesh Pandey
Mass spectrometry-based proteomic methodologies can be used to annotate
both nucleotide and protein sequence databases. Because such data have to be
derived from proteins, they can be used to identify coding regions of the
genome as well as provide the complete primary sequence of proteins and their
expression patterns and post-translational modifications.
Matthias Mann
Protein Interaction
Laboratory (PIL), Center
for Experimental
Bioinformatics,
University of Southern
Denmark, Campusvej 55,
Odense M, DK-5230
Denmark;
and MDS-Protana,
Staermosegaardsvej 6,
Odense M, DK-5230
Denmark.
e-mail:
[email protected]
Akhilesh Pandey
Whitehead Institute for
Biomedical Research,
Nine Cambridge Center,
Cambridge, MA 02142;
and Brigham and
Women’s Hospital,
Boston, MA 02115, USA.
e-mail:
[email protected]
Complete sequences of several genomes including
Saccharomyces cerevisiae, Caenorhabditis elegans and
Drosophila melanogaster as well as a draft of the
human genome have already been reported1–3. This
has led to an information explosion and we now have
more data available to us than we can manage (Box 1).
One example of how these data could be handled is
by use of automatic annotation programs such as
Ensembl, a project launched jointly by EMBL-EBI and
the Sanger Centre4. The Ensembl system is fed raw
genomic data and automatically assembles the DNA
fragments, predicts genes, and identifies single
nucleotide polymorphisms (SNPs), repeat regions and
regions homologous to other sequences in public
databases. However, manual confirmation to check for
errors requires that an estimated 300 person-years be
spent on annotating the human genome at the
sequence level5. The need to have more tools to handle
such data systematically is glaringly obvious. For
example, take the issue of the total number of genes in
the human genome. For years it was presumed to be
approximately 80 000 –100 000 genes. It was expected
that once the human genome was fully sequenced, it
would be trivial to calculate the exact number of genes.
Paradoxically, as we approach the finishing line with
respect to sequencing, several groups have reestimated the total number of genes and their latest
speculations now range from 35 000 to 120 000
(Refs 6,7). Where exactly did we go wrong? Given the
stark reality that the prediction accuracy rate of gene
prediction programs is approximately 40%, it is clear
to see where we might have erred (see results of the
Drosophila Genome Annotation Assessment Project in
Ref. 8). The mouse sequencing project, which will be
available to the public this year, is estimated to help
tremendously in annotating coding regions in the
human genome. It should be noted that prediction of
genes is still a much simpler task than predicting, for
example, a phosphorylation site or the function of a
domain or a protein itself. Bork has come up with an
interesting conclusion about most of the
bioinformatics tools in use today – that they have
difficulty exceeding a prediction accuracy of 70%
(Ref. 9). Viewed differently, this amounts to incorrect
predictions in ~30% of the time, which can be time
consuming to address.
http://tibs.trends.com 0968-0004/01/$ – see front matter © 2001 Elsevier Science Ltd. All rights reserved. PII: S0968-0004(00)01726-6
Review
TRENDS in Biochemical Sciences Vol.26 No.1 January 2001
Box 1. Major primary sequence databases
Box 2. Limitations of EST databases
• Nucleotide sequence databases of known/
predicted proteins (e.g. GenBank and EMBL)
Contain nucleotide sequences of all cloned and
well-characterized genes. Increasingly, these
databases contain cDNA sequences corresponding
to proteins that have not yet been characterized
(generally deposited by cDNA sequencing
consortia). A recent inclusion is proteins that are
‘totally hypothetical’ because they are generated by
artificially ‘stitching together’ the exons that are
predicted from the genomic DNA by genefinding
programs (this information must be used with
extreme caution). GenBank includes ‘finished’ highthroughput genomic sequences in this category.
• Short stretches of cDNAs
They might not contain coding sequences and
generally do not contain the full coding sequence.
• Protein sequences of known proteins
(e.g. SWISS-PROT)
Contain protein sequences of all cloned and
characterized proteins. In addition, they contain
translations of open reading frames of cDNAs that
are predicted to code for proteins.The level of
annotation of these depends upon the particular
database being used. Curated databases created by
human reading of the literature and subsequent
entry into a database also exist (i.e. the yeast
protein database;YPD).
• Database of expressed sequence tags (ESTs)
(e.g. dbEST)
Contain single-pass nucleotide sequence reads
from 5′ and 3′ ends of cDNA inserts.
• Genomic databases (e.g. high-throughput
genomic sequence; HTGS)
Contain nucleotide sequences from genomic DNA
obtained by a variety of strategies. HTGS contains
‘unfinished’ genomic sequence deposited by major
sequencing centers.
Expressed sequence tag (EST) databases can be
used with caution to derive useful information but still
suffer from certain drawbacks (Box 2). In addition,
genomic databases have their own limitations (Box 3).
One way to remedy this predicament is to analyze the
proteins directly. To tackle the genome, much attention
has been paid to bioinformatics-based tools and the
proteome has largely been ignored. The term
‘proteomics’ is any large-scale analysis of proteins and
has been applied to studies ranging from identification
of spots from 2D electrophoretic gels to characterization
of protein–protein interactions10,11. Proteomics has
been greatly facilitated by recent advances in biological
mass spectrometry (MS), which has become the
method of choice for protein characterization (Box 4).
The major advantage of proteomic analysis is that the
findings are more bona fide in the sense that they
reflect physical measurements of actual proteins and
http://tibs.trends.com
55
• Sequencing errors
This limits the use of ESTs in establishing the
reading frame that codes for the protein. Often we
have found that the correct protein sequence must
be obtained by ‘stitching together’ the putative
translation products from all three reading frames
of the EST.
• Untranslated regions
Because the ESTs are biased towards the 3′ end of
the gene, they might only contain the 3′ UTR,
precluding their use in searches based on protein
homology.
• Abundance
The occurrence of most ESTs correlates somewhat
with the abundance of mRNA level of the given
gene.Thus transcripts of medium-to-low abundance
might not be present in the EST databases.
• Difficulty in alignment
It can be difficult to align ESTs derived from the
same gene due to alternative splicing or gaps in the
sequence due to the fact that only the ends of
cDNAs are sequenced.This is one of the reasons
why the number of unique EST clusters might
represent fewer genes than the number of clusters.
are not an extrapolation (such as an assertion that a
sequence contains potential phosphorylation sites).
Therefore, the data derived from such studies are an
excellent and necessary complement to information
gathered from computational approaches.
Consequently, in the post-genomic era, mass
spectrometry assumes one more important role – that
of facilitating annotation of genomic as well as EST
and protein sequence databases.
These annotations are either impossible or difficult
to predict using nucleotide sequence information alone
– hence the analysis of proteins is a prerequisite. The
analysis of proteins using mass spectrometry is
generally associated with specific biological
experiments. For example, inherent in the
identification of a phosphorylation site in a peptide is
the information that this protein was phosphorylated
by treatment of a specific cell line, by a specified growth
factor for a given time. See Fig. 1 for a representation of
the various scenarios in which mass spectrometric
analysis interfaces with biological experiments.
Prediction of exons from genomic databases
Until now, identification and characterization of
proteins was difficult for a variety of reasons. First,
56
Review
TRENDS in Biochemical Sciences Vol.26 No.1 January 2001
Box 3. Limitations of genomic databases
• Interrupted coding regions
Because the coding exons in most eukaryotes are
interrupted by introns, the prediction of coding
regions is not straightforward. Human genes
contain the most introns and the largest introns of
most mammals studied, making precise gene
predictions especially difficult42.
• No reading frame
Because the intron–exon junctions do not have any
relation to codons, there is no continuous open
reading frame as is seen in cDNAs.This is because
splitting of exons by introns does not preserve
reading frames.
• Non-directional
The strand of genomic DNA that codes for a gene is
not fixed. Genes can be transcribed from both
strands in opposite directions.
• Repeat elements
The presence of repeated elements can create
serious problems in alignment and analysis.
Repetitive sequences are excluded when an
‘effective size’ of the human genome is calculated.
Edman sequencing, which was commonly used for
identification, required large amounts of purified protein
[Science’s Signal Transduction Knowledge Environment
(STKE): http://www.stke.org/cgi/content/full/
OC_sigtrans;2000/37/pl1]. Second, even when such
amounts were available, the relevant proteins could
often not be identified because of a blocked N terminus.
Mass spectrometry does not suffer from these
drawbacks – it is a sensitive method for identifying
proteins (femtomole levels of proteins from gels)12, can
analyze mixtures (therefore, the protein does not have to
be pure) and is not affected by a blocked N terminus.
Spots excised from 2D gels, bands excised from 1D
gels or a complex mixture of proteins in solution, can be
digested by proteases such as trypsin and the peptides
sequenced by tandem mass spectrometry. These
peptide sequences can then be used to search the
genomic databases13 to identify coding regions.
Although ESTs have been used successfully to identify
novel proteins13,14, it was found that the use of EST
data alone does not improve overall gene prediction8,15.
Some of the factors that contributed to this were
problems with alignment, the occurrence of paralogs
and the low quality of EST sequences. In addition,
ESTs led to an increased prediction of smaller genes,
thereby reducing specificity15. However, when any data
based on proteins were included, a better prediction
accuracy rate was obtained15. This observation makes
mass spectrometry-derived peptide sequence data
even more valuable as a complement to bioinformatics
tools for gene prediction. Because different gene
http://tibs.trends.com
prediction programs predict different exon–intron
structures, such peptide sequences will help establish
which ones are ‘real’. Of course, this method cannot
normally be used to predict every exon, as it is difficult
to achieve 100% coverage for every protein by mass
spectrometry at the low levels usually available in
biological experiments. The major reasons for this are:
(i) some of the peptides derived from a given protein
might not ionize well in the mass spectrometer and
thus result in very small peaks; (ii) the peptides might
be too large or too small for analysis; or (iii) the peptides
might give rise to complex fragmentation spectra that
are hard to interpret. Nevertheless, by using two
different enzymes, close to 100% of the protein primary
sequence can often be verified if sufficient material is
present. Besides prediction of coding exons, it is
possible to find alternatively spliced variants at the
protein level based on peptide sequences that ‘span’
different exons in the genome. Several instances
showing how mass spectrometry-derived data can be
used to annotate genomic sequence are shown in Fig. 2.
For genomic annotation we propose to use a special
format of mass spectrometry in which crude protein
mixtures are first reduced to peptides and the
peptides then fed into the mass spectrometer after
1- or 2D chromatography16–19. Thousands of peptides
can be sequenced in this approach in a fully
automated fashion and with high sensitivity. We then
suggest using database search software that can
directly find the sequenced peptide in genomic
databases, without the need for translation into the
six possible reading frames (B. Küster et al.,
unpublished). This analysis can be done on relatively
large amounts of starting material, enabling the
analysis of low-copy-number proteins.
Low-molecular-weight proteins
A major subset of proteins that are unlikely to be
identified from the genomic sequence solely by gene
prediction are genes that encode proteins containing
less than 100 amino acids20. This is because geneprediction algorithms deliberately avoid prediction of
such short genes by having a minimum gene size of
~100 amino acids. If such a limit is not imposed, then
the number of false positives (i.e. number of potential
coding regions that are actually noncoding) will
increase sharply. As a result, these proteins will have to
be identified by alternative methods. A major fraction of
these small polypeptides are likely to be cytokines and
growth factors that are secreted by cells – such proteins
can be enriched from supernatants of cell lines or
obtained from body fluids such as serum, urine or
cerebrospinal fluid. Thus, direct sequencing of small
secreted proteins by mass spectrometry will allow the
discovery of small open reading frames (ORFs) in the
genome. Another method is to use cDNA-based
techniques to specifically isolate secreted molecules,
such as the signal-sequence trap method21 or other
generic methods that isolate less abundant cDNAs such
as serial analysis of gene expression (SAGE)22.
Review
TRENDS in Biochemical Sciences Vol.26 No.1 January 2001
57
Box 4.Analysis of proteins by mass spectrometry
• Peptide sequencing by tandem mass spectrometry
Proteins are generally digested by trypsin to
generate peptides that are then sequenced by mass
spectrometry.The method used to sequence peptides
is termed tandem mass spectrometry, in which the
peptide to be sequenced is first selected from the
entire peptide mixture and then fragmented by
collision with an inert gas to obtain a partial or
complete sequence.
• Database searching using mass spectrometryderived data
Mass spectrometry of tryptic digests by matrixassisted laser desorption/ionization time of flight
(MALDI-TOF) provides a highly accurate list of
peptide masses.This list of peptide masses is used to
search a theoretical tryptic digest of all the proteins in
the databases to identify the protein.
The data obtained by tandem mass spectrometry
(MS/MS) generally comprise the partial sequence of
a peptide. This partial sequence is referred to as a
‘peptide sequence tag’. It generally contains enough
information to identify the protein from databases
as large as the human genomic database. If a
peptide match for a given peptide sequence tag is
not found, it might mean that the peptide belongs to
an unknown protein or that the peptide is modified
(see below).
• Quantitation of protein levels
Proteins or peptides are difficult to quantify in any
mass spectrometer. However, if two samples are
being compared, relative quantitation can be
obtained by modifying the peptides in both samples
(such as by adding a ‘tag’ of known molecular
mass). One of the samples is modified with a
naturally occurring isotope and the other with a
stable isotope analog, such as deuterium-labeled
tag, followed by mixing of the samples before
analysis. In this case, the corresponding peptides
derived from the two samples can be identified
easily because they have a constant difference in
mass, which is the mass of the stable isotope
Initiation codon
The translational start sites in mRNAs are not well
defined. When the mRNA is translated, the ribosomes
generally choose the first set of nucleotides encoding a
methionine (AUG). The sequences around these AUG
sequences have been proposed to affect the efficiency
of translation from a particular potential start site –
these sequences have been formulated into a
consensus sequence for translation initiation
(Kozak’s consensus)23. However, most AUGs (ATGs in
cDNA) are not in complete, or sometimes any,
agreement with the consensus Kozak sequence. This
means that determination of the initiator ATG that
http://tibs.trends.com
substitution. The difference in peak intensities then
correlates directly with the abundance of the
protein levels.
• Identifying phosphopeptides and phosphorylation
sites
The peptides containing phosphates can be identified
either by observing a change in molecular weight
upon treatment with a phosphatase, or in a mass
spectrometer, by looking for a reporter ion that is
characteristic of phosphate groups (parent or
precursor ion scanning).
Identification of phosphorylation sites can be
done in several ways by mass spectrometry. In a
triple-quadrupole or quadrupole time-of-flight mass
spectrometer, the location of phosphoserines or
phosphothreonines can be identified in the positive
mode if the mass difference between two successive
fragments corresponds to phosphorylated (addition
of HPO3 or 80 Da) serine residues (87 + 80 Da) or to
phosphorylated threonine residues (101 + 80 Da).
These residues can also be identified by a loss of
98 Da resulting from a β-elimination reaction.
Phosphotyrosines are generally more stable, do not
undergo the β-elimination reaction and are thus
found only as a difference of 163 + 80 Da between
two successive fragments. The masses of serine,
threonine and tyrosine residues are 87, 101 and
163 Da, respectively. Fragmentation to obtain
sequence can also be done in an ion-trap mass
spectrometer, although the fragmentation pattern is
slightly different from that described above.
• Other post-translational modifications
All modifications cause a shift in mass and therefore,
in principle, are detectable by mass spectrometry. For
example, the N terminus of a protein is indicated by
an acetylated peptide – in this case, the N-terminal
amino acid contained within the peptide is 42 Da
greater than expected. Similarly, myristoylation
causes an increase by 210 Da, and GalNac or GlcNAc
sugars increase the mass of the corresponding
peptide by 203 Da.
will be translated for a given gene might have to be
done experimentally or other methods might have to
be used. This could involve direct sequencing of the
N terminus of the protein, prediction of initiator
methionine by homology to known proteins, or
guessed by the fact that there are multiple stop
codons or no other ATGs upstream of a suspected
initiator ATG. This implies that either a longer
protein would be predicted, or conversely, a shorter
version of a protein would be predicted when, in
reality, an upstream ATG is being used.
Using mass spectrometry, it is possible to obtain
peptides that are acetylated at the N terminus – this
Review
58
TRENDS in Biochemical Sciences Vol.26 No.1 January 2001
Growth
factor
–
In-solution digest of
crude protein mixtures
followed by liquid
chromatography (LC)
separation of peptides.
+
Excise
bands
Excise
spots
2D gel
1D gel
Immunoprecipitation
with antibodies
Quantitation of protein levels
Condition A
A
B
Excise spots specifically detected by
modification-specific antibody
No treatment
+ TGF-β
Mass spectrometric
analysis
Database
searching
Condition B
Excise differentially
expressed spots
Excise bands that specifically
associate with ‘bait’ protein
Annotation
Sequence databases
Western blotting with
modification specific
(e.g. antiphosphoserine
antibody)
Fig. 1. Overview of various types of experiments that can be coupled to mass spectrometric analysis.
The data that can be used to annotate databases include information on the ‘upfront’ experiment
(such as growth factor used or the antibody used in the immunoprecipitation step), as well as the data
obtained by mass spectrometry. Abbreviation:TGF-β, transforming growth factor β.
establishes it as an N-terminal peptide24. Another
modification that can distinguish an N-terminal peptide
from other internal peptides is a fairly frequent cleavage
by aminopeptidases of the initiator methionine25. In the
post-genomic era, one obvious advantage of recognizing
the initiator methionine is that the translational start of
a gene can then be deduced from the corresponding
entry in the genome database.
Signal peptides
Signal peptides are a hydrophobic stretch of amino
acids that are found at the N termini of most
membrane-bound receptors as well as soluble
ligands. Although computer programs can predict
roughly where the cleavage of this signal sequence
can occur26–28, often the exact site of cleavage must
be determined experimentally. Most experiments in
biological mass spectrometry involve cleavage of the
protein by proteases such as trypsin before analysis.
Owing to the cleavage specificity of trypsin, this
results in peptides that contain an arginine or a
lysine residue at their C termini. This implies that
when the sequence of a peptide is determined, the
http://tibs.trends.com
Precipitation
with control and
‘bait’ proteins
Ti BS
residue in the protein that immediately precedes
the peptide must be an arginine or a lysine. An
instance where this does not occur is when the
peptide is derived from the N terminus of the
protein. In the case of secreted proteins, the signal
peptide is cleaved and the 'mature' protein is found
in the extracellular environment. Therefore, when a
secreted protein is being analyzed and a peptide
that corresponds to the N terminus is found, the site
of cleavage is automatically assigned. This
information is especially critical when recombinant
growth factors are manufactured for human use in
certain heterologous expression systems, because
inclusion of amino acids that are actually cleaved in
the mature protein might elicit an immune
response.
Prediction of correct reading frame and stop codon
If a very short EST sequence is present in databases,
it is not immediately clear whether it contains any
open reading frame (coding region) and, if it does,
what that reading frame is. If a peptide sequence
obtained by mass spectrometry is found to correspond
to an EST sequence, it might be used to assign the
correct reading frame to that clone. Further, as
explained above, a tryptic peptide sequence that does
not end in an arginine or lysine suggests that it is
Review
1
Real protein
TRENDS in Biochemical Sciences Vol.26 No.1 January 2001
23 4
5
6
7
8
5′ UTR
3′ UTR
1
Genomic DNA
2
3
4
5
6
7
8
P
Gene prediction
program 1
1 2
4
5
7
Key:
1
Gene prediction
program 2
1
256
7
23 4
5
8
P
Gene prediction
program 3
6
7
8
Exon 1
Intron
Peptide sequence
Acetylated N-terminal
peptide
Phosphorylated peptide
Glycosylated peptide
Ti BS
Fig. 2. Outputs of gene prediction programs compared with a real protein.The figure shows how
mass spectrometry-derived data can be used for annotation of genomic databases. Any tryptic
peptide whose sequence matches a genomic region localizes a coding exon. In addition, some
peptides might be sequenced that ‘span’ two exons in the genome, thereby predicting two exons
adjacent to each other (exons 6 and 7). A peptide with an acetylated N terminus or corresponding to a
cleaved methionine indicates the N terminus of the protein. Similarly, a peptide that does not contain
an arginine or lysine at its C terminus might indicate the C terminus of the protein.
derived from the C terminus of the protein under
study. In a database search, this would align with a
region just upstream of a stop codon, thus
establishing the reading frame as well as the
translational stop site within that clone.
Post-translational modifications
The majority of proteins in cells undergo posttranslational modifications. Extracellular and
transmembrane proteins are frequently glycosylated,
whereas intracellular signaling molecules can be
basally or inducibly phosphorylated. It is thought
that at any one time approximately one-third of all
proteins in eukaryotic cells are phosphorylated29.
Although it is more difficult to detect and analyze
post-translational modification-containing peptides,
post-translational modifications are frequently
discovered during large-scale proteomics experiments
(reviewed in Ref. 30). At the peptide level, a peptide
can be phosphorylated, acetylated or glycosylated
(Box 4). However, it is more difficult to localize the
exact residue that is modified, and in many instances
it might not be possible at all. Nevertheless,
knowledge of the relevant peptide carrying the
modification might be adequate for the biologist even
if the exact residue of modification is not located.
Because mass spectrometry provides information
about the exact mass of the post-translational
modification in addition to its sequence, such
information could be annotated as such in the
databases for further use by the research community.
Expression data
When a report on a novel gene is published, there is
usually some information on tissue expression by
http://tibs.trends.com
59
northern blotting to detect mRNA. Less commonly,
the cell lines that express that protein are described.
For investigators who are interested in studying that
protein, it is often difficult to find a resource listing
cell lines that express the protein. Because the
samples for many mass spectrometry-based
identification experiments are derived from a cell
line, once the identity of the protein is determined,
the entries for these proteins could be annotated in
sequence databases giving the cell line of origin as the
source of protein. Interesting avenues of investigation
might be initiated in light of such expression
information. For example, if a cytokine receptor
known to be specifically expressed in hematopoietic
cells is found in fibroblasts, its role in the newly
discovered environment can then be tested. Newer
technologies can quantify protein expression between
two states using stable isotope incorporation into one
of the states. Peak ratios between peptides with and
without the isotope tag accurately quantify relative
levels. In a special twist on this technique, using
isotope-coded affinity tags (ICATs)31, hundreds of
proteins can be quantified in a single experiment
using only the cysteine-containing peptides. This
quantitation is an improvement compared with
mRNA data because mRNA levels have been found to
correlate poorly with protein expression and might
not be predictive of protein levels32. Analogous to the
data derived from mRNA expression patterns using
microarrays that are made publicly available, protein
expression data could be linked to the protein
sequence itself (see below). Finally, because a
minority of pseudogenes can be transcribed33, if a
protein is detected for a gene, then it can be presumed
that it is not a pseudogene.
Protein localization
Proteins, especially novel ones, are often assigned
subcellular localization based on homology to other
proteins or by sophisticated computer programs34.
However, these predictions are generally not
reliable and therefore the localization of a given
protein must be confirmed by experimental
methods. An interesting example is the S100
family of calcium-binding proteins that are
involved in regulation of a plethora of intracellular
activities such as protein phosphorylation, enzyme
activities, cell proliferation and differentiation35.
Based on their sequence, they are not predicted to
be secreted molecules. However, some members of
this family were discovered as secreted proteins,
either as part of proteomic approaches to study
secreted molecules or as components of saliva36,37.
Similarly, proteins that are predicted to be nuclear
based on the presence of nuclear localization
signals might turn out to be cytosolic and vice
versa. Mass spectrometric identification of specific
fractions can be used to assist in annotating
proteins localized to organelles such as
mitochondria38 or the chloroplast39.
60
Review
TRENDS in Biochemical Sciences Vol.26 No.1 January 2001
Annotation with protein data – but how?
Acknowledgements
Work at the Protein
Interaction Laboratory
(PIL) is supported by a
generous grant from the
Danish National Research
Foundation to the Center
for Experimental
BioInformatics (CEBI) at
the University of
Southern Denmark. A.P. is
supported by a Howard
Temin Award from the
National Cancer Institute
(KO1 CA75447). We
thank members of
the PIL, especially
A. Podtelejnikov,
H. Steen, J. Andersen,
J. Rappsilber, S-E. Ong
and B. Küster for helpful
discussions. We thank
G. Subramanian, K.
Kristiansen and M.Tewari
for critical reading of the
manuscript.
The submission of sequences of newly identified genes
to databases is now routine. Standard web-based forms
are to be filled out with the fields clearly defined by the
agency that maintains the corresponding database.
However, protein data cannot be submitted by users of
GenBank, which is surprising given the amount of
information being generated directly from proteins.
This means that protein entries that are in the
databases are largely ‘inferred’ from the cDNA or
genomic sequences by conceptual translation or
prediction. GenBank currently only allows the
submitter of a given sequence to modify the data
submitted. There is hardly any provision for
third-party correction, let alone third-party
annotation. Therefore, the data acquired using mass
spectrometry (or by other proteomic methods) cannot
be added to the sequence databases under the current
circumstances. Data derived from analysis of proteins
are dependent upon several factors. For example,
although identification of a secreted protein from a cell
line could be added as a simple annotation of the
protein, modifications such as phosphorylation would
have to specify the exact experimental conditions (such
as growth factor treatment being used). Besides, a vast
amount of data can be accumulated on every protein.
We favor an open system of annotation whereby the
annotation is performed directly by the experimenter,
as in a distributed annotation system being developed
by Lincoln Stein (http://stein.cshl.org/das/). In this
system, a single server is designated as a reference
server that contains primary information about the
DNA and protein sequences as well as authorship
information. Several websites can then serve as
annotation servers that provide additional information
on the retrieved sequence. In this scenario, no attempt
is made to resolve contradictions that exist between
various annotation servers. The uniqueness of this
References
1 Adams, M.D. et al. (2000) The genome sequence
of Drosophila melanogaster. Science 287,
2185–2195
2 The C. elegans Sequencing Consortium (1998)
Genome sequence of the nematode C. elegans: a
platform for investigating biology. Science 282,
2012–2018
3 Goffeau, A. et al. (1996) Life with 6000 genes.
Science 274, 546–567
4 Hubbard, T. and Birney, E. (2000) Open
annotation offers a democratic solution to genome
sequencing. Nature 403, 825
5 Claverie, J.M. (2000) Do we need a huge new
centre to annotate the human genome? Nature
403, 12
6 Liang, F. et al. (2000) Gene index analysis of the
human genome estimates approximately 120,000
genes. Nat. Genet. 25, 239–240
7 Ewing, B. and Green, P. (2000) Analysis of
expressed sequence tags indicates 35,000 human
genes. Nat. Genet. 25, 232–234
8 Reese, M.G. et al. (2000) Genome annotation
assessment in Drosophila melanogaster. Genome
Res. 10, 483–501
http://tibs.trends.com
system lies in the fact that the end-user can decide for
him/herself which data to use. This would allow for
true integration of data, unlike pointers to a weblink
that contains related information on a particular
sequence. One of the ways in which this could be
accomplished would be to use systems similar to the
software that is used to share music files over the
internet40,41. To illustrate this example, the popular
Napster software has a central database and server
that allows the user base to link to information that
still resides on the hard disks of a large user
community. In a more decentralized model, the
Gnutella software dispenses with the central server
and connects the user community directly with each
other. Applying either of these models to biology, the
basic structure would be given by the genome and the
gene names but massive amounts of information could
be available dispersed over the internet. Obviously,
highly sophisticated software would be needed to
retrieve and integrate the disparate pieces of
information.
Conclusions and outlook
It is becoming clear that protein-derived data is
essential for annotation of nucleic acid and protein
sequence databases. Mass spectrometry as the
proteomic tool of choice is poised to contribute greatly
to the wealth of these databases. Besides the
advantage of high-throughput analysis, mass
spectrometry provides experimental data and is thus
a perfect complement to other annotation efforts. An
international initiative and collaboration is,
however, critical for the long-term viability of any
open annotation system. We envision that, like most
other web-based initiatives (electronic publication
being one of them), such a system of annotation will
become popular and indispensable in the not too
distant future.
9 Bork, P. (2000) Powers and pitfalls in sequence
analysis: the 70% hurdle. Genome Res. 10,
398–400
10 Wilkins, M.R. et al. (1996) From proteins to
proteomes: large scale protein identification by
two-dimensional electrophoresis and amino acid
analysis. Biotechnology 14, 61–65
11 Pandey, A. and Mann, M. (2000) Proteomics
to study genes and genomes. Nature 405,
837–846
12 Wilm, M. et al. (1996) Femtomole sequencing of
proteins from polyacrylamide gels by nanoelectrospray mass spectrometry. Nature 379,
466–469
13 Pandey, A. and Lewitter, F. (1999) Nucleotide
sequence databases: a gold mine for biologists.
Trends Biochem. Sci. 24, 276–280
14 Mann, M. (1996) A shortcut to interesting human
genes: peptide sequence tags, expressed-sequence
tags and computers. Trends Biochem. Sci. 21,
494–495
15 Krogh, A. (2000) Using database matches
with for HMMGene for automated gene
detection in Drosophila. Genome Res. 10,
523–528
16 Yates, J.R., 3rd et al. (1997) Direct analysis of
protein mixtures by tandem mass spectrometry.
J. Protein Chem. 16, 495–497
17 Washburn, M.P. and Yates, J.R. (2000) New
methods of proteome analysis: multidimensional
chromatography and mass spectrometry. In
Proteomics: A Trends Guide (Blackstock, W. and
Mann, M., eds.), pp. 27–30, Elsevier
18 Gygi, S.P. et al. (2000) Evaluation of twodimensional gel electrophoresis-based proteome
analysis technology. Proc. Natl. Acad. Sci. U. S. A.
97, 9390–9395
19 Martin, S.E. et al. (2000) Subfemtomole MS and
MS/MS peptide sequence analysis using nanoHPLC micro-ESI fourier transform ion cyclotron
resonance mass spectrometry. Anal. Chem. 72,
4266–4274
20 Rudd, K.E et al. (1998) Low molecular weight
proteins: a challenge for post-genomic research.
Electrophoresis 19, 536–544
21 Tashiro, K. et al. (1993) Signal sequence trap: a
cloning strategy for secreted proteins and type I
membrane proteins. Science 261, 600–603
22 Velculescu, V.E. et al. (1997) Characterization of
the yeast transcriptome. Cell 88, 243–251
Review
23 Kozak, M. (1992) Regulation of translation in
eukaryotic systems. Annu. Rev. Cell Biol. 8, 197–225
24 Wold, F. (1981) In vivo chemical modification of
proteins (post-translational modification). Annu.
Rev. Biochem. 50, 783–814
25 Ben-Bassat, A. et al. (1987) Processing of the
initiation methionine from proteins: properties of
the Escherichia coli methionine aminopeptidase
and its gene structure. J. Bacteriol.169, 751–757
26 von Heijne, G. (1983) Patterns of amino acids near
signal-sequence cleavage sites. Eur. J. Biochem.
133, 17–21
27 Claros, M.G. et al. (1997) Prediction of N-terminal
protein sorting signals. Curr. Opin. Struct. Biol. 7,
394–398
28 Nielsen, H. et al. (1997) Identification of prokaryotic
and eukaryotic signal peptides and prediction of
their cleavage sites. Protein Eng. 10, 1–6
29 Zolnierowicz, S. and Bollen, M. (2000) Protein
phosphorylation and protein phosphatases. De
Panne, Belgium, 19–24 September, 1999.
EMBO J. 19, 483–488
TRENDS in Biochemical Sciences Vol.26 No.1 January 2001
30 Jensen, O.N. (2000) Modification-specific
proteomics: systematic strategies for analysing
post-translationally modified proteins. In
Proteomics: A Trends Guide (Blackstock, W. and
Mann, M., eds), pp. 36–42, Elsevier
31 Gygi, S.P. et al. (1999) Quantitative analysis of
complex protein mixtures using isotope-coded
affinity tags. Nat. Biotechnol. 17, 994–999
32 Gygi, S.P. et al. (1999) Correlation between
protein and mRNA abundance in yeast. Mol. Cell.
Biol. 19, 1720–1730
33 Mighell, A.J. et al. (2000) Vertebrate
pseudogenes. FEBS Lett. 468, 109–114
34 Nakai, K. (2000) Protein sorting signals and
prediction of subcellular localization. Adv. Protein
Chem. 54, 277–344
35 Donato, R. (1999) Functional roles of S100
proteins, calcium-binding proteins of the EF-hand
type. Biochim. Biophys. Acta 1450, 191–231
36 Madsen, P. et al. (1991) Molecular cloning,
occurrence, and expression of a novel partially
secreted protein ‘psoriasin’ that is highly
37
38
39
40
41
42
61
up-regulated in psoriatic skin. J. Invest.
Dermatol. 97, 701–712
Kojima, T. et al. (2000) Human gingival crevicular
fluid contains MRP8 (S100A8) and MRP14
(S100A9), two calcium-binding proteins of the
S100 family. J. Dent. Res. 79, 740–747
Patterson, S.D. et al. (2000) Mass spectrometric
identification of proteins released from
mitochondria undergoing permeability transition.
Cell Death Differ. 7, 137–144
Peltier, J.B. et al. (2000) Proteomics of the
chloroplast: systematic identification and
targeting analysis of lumenal and peripheral
thylakoid proteins. Plant Cell 12, 319–341
Butler, D. (2000) Music software to come to
genome aid? Nature 404, 694
Butler, D. and Smaglik, P. (2000) Draft data leave
geneticists with a mountain still to climb. Nature
405, 984–985
Deutsch, M. and Long, M. (1999) Intron–exon
structures of eukaryotic model organisms. Nucleic
Acids Res. 27, 3219–3228
Life-or-death decisions by the Bcl-2
protein family
Jerry M.Adams and Suzanne Cory
In response to intracellular damage and certain physiological cues, cells enter
the suicide program termed apoptosis, executed by proteases called caspases.
Commitment to apoptosis is typically governed by opposing factions of the
Bcl-2 family of cytoplasmic proteins. Initiation of the proteolytic cascade
requires assembly of certain caspase precursors on a scaffold protein, and the
Bcl-2 family determines whether this complex can form. Its pro-survival
members can act by sequestering the scaffold protein and/or by preventing the
release of apoptogenic molecules from organelles such as mitochondria.
Pro-apoptotic family members act as sentinels for cellular damage: cytotoxic
signals induce their translocation to the organelles where they bind to their
pro-survival relatives, promote organelle damage and trigger apoptosis.
Jerry M.Adams*
Suzanne Cory#
The Walter and Eliza Hall
Institute of Medical
Research, P O Royal
Melbourne Hospital,
Melbourne 3050,
Australia.
*e-mail:
[email protected]
#e-mail:
[email protected]
Apoptosis, the stereotypic program of cellular suicide,
removes unwanted cells throughout life, and its
disrupted regulation is implicated in disorders
ranging from cancer and autoimmune diseases to
degenerative syndromes. The Bcl-2 family of
cytoplasmic proteins plays a central regulatory role.
Its interacting pro- and anti-apoptotic members
integrate diverse upstream survival and distress
signals to determine whether the cellular death
warrant is issued. Many of the key players have been
identified, and the spotlight is now on the stage where
their ‘dance of death’ commences: the surface of
organelles such as the mitochondria where Bcl-2
family members either reside or congregate during
apoptosis. The hotly debated and still unresolved
issue of how their control is exerted is the focus of this
and other recent reviews1–7.
The central pathway to death
The Bcl-2 family regulates an ancient path to cell
death (Fig. 1), found in organisms as diverse as
mammals, nematodes and fruitflies. The route
culminates in the scission of critical target proteins
by proteases of the caspase group, but only after
traversing critical checkpoints. To preclude
unscheduled cell suicide, each caspase is
synthesized as a minimally active precursor and
generation of the active enzyme requires its
processing, at sites of caspase cleavage. The
effector caspases are processed by ‘upstream’
caspases, but the apical caspase that sets up the
execution must process itself. The autocatalysis
requires multimerization, aided by an adaptor or
scaffold protein such as the nematode protein
CED-4, which primes autoactivation of caspase
CED-3, or the mammalian CED-4 homolog Apaf-1,
which primes autoactivation of procaspase-9
(Refs 4,6,7). The Bcl-2 family determines whether
or not the multimeric scaffold/procaspase complex,
often termed an ‘apoptosome’, can assemble. Antiapoptotic members such as Bcl-2 and its nematode
counterpart CED-9 prevent apoptosome formation,
but their life-saving activity can be foiled by other
relatives such as Bim or EGL-1 (see below).
Caspases can also be controlled downstream of
Bcl-2 (Fig. 1). The IAP (inhibitor of apoptosis)
proteins appear to directly block caspase activity
http://tibs.trends.com 0968-0004/01/$ – see front matter © 2001 Elsevier Science Ltd. All rights reserved. PII: S0968-0004(00)01740-0