Download selection of antigens for antibody-based proteomics

Document related concepts

Phosphorylation wikipedia , lookup

Signal transduction wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Magnesium transporter wikipedia , lookup

Protein phosphorylation wikipedia , lookup

List of types of proteins wikipedia , lookup

Cyclol wikipedia , lookup

Protein wikipedia , lookup

Protein (nutrient) wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Protein moonlighting wikipedia , lookup

Homology modeling wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Protein structure prediction wikipedia , lookup

Western blot wikipedia , lookup

Proteolysis wikipedia , lookup

Transcript
SELECTION OF ANTIGENS FOR
ANTIBODY-BASED PROTEOMICS
Lisa Berglund
Royal Institute of Technology
School of Biotechnology
Stockholm 2008
© Lisa Berglund, 2008
ISBN 978-91-7178-930-3
TRITA-BIO-Report 2008:5
ISSN 1654-2312
School of Biotechnology
Royal Institute of Technology
AlbaNova University Center
SE-106 91 Stockholm, Sweden
Printed at Universitetsservice US-AB
Box 700 14
Stockholm, Sweden
Ju mer jag vet, desto mer förstår jag att jag inte vet.
ABSTRACT
The human genome is predicted to contain ~20,500 protein-coding genes. The
encoded proteins are the key players in the body, but the functions and
localizations of most proteins are still unknown. Antibody-based proteomics has
great potential for exploration of the protein complement of the human genome,
but there are antibodies only to a very limited set of proteins. The Human
Proteome Resource (HPR) project was launched in August 2003, with the aim to
generate high-quality specific antibodies towards the human proteome, and to use
these antibodies for large-scale protein profiling in human tissues and cells.
The goal of the work presented in this thesis was to evaluate if antigens can be
selected, in a high-throughput manner, to enable generation of specific antibodies
towards one protein from every human gene. A computationally intensive analysis
of potential epitopes in the human proteome was performed and showed that it
should be possible to find unique epitopes for most human proteins. The result
from this analysis was implemented in a new web-based visualization tool for
antigen selection. Predicted protein features important for antigen selection, such
as transmembrane regions and signal peptides, are also displayed in the tool. The
antigens used in HPR are named protein epitope signature tags (PrESTs). A
genome-wide analysis combining different protein features revealed that it should
be possible to select unique, 50 amino acids long PrESTs for ~80% of the human
protein-coding genes.
The PrESTs are transferred from the computer to the laboratory by design of
PrEST-specific PCR primers. A study of the success rate in PCR cloning of the
selected fragments demonstrated the importance of controlled GC-content in the
primers for specific amplification. The PrEST protein is produced in bacteria and
used for immunization and subsequent affinity purification of the resulting sera to
generate mono-specific antibodies. The antibodies are tested for specificity and
approved antibodies are used for tissue profiling in normal and cancer tissues. A
large-scale analysis of the success rates for different PrESTs in the experimental
pipeline of the HPR project showed that the total success rate from PrEST
selection to an approved antibody is 31%, and that this rate is dependent on
PrEST length. A second PrEST on a target protein is somewhat less likely to
succeed in the HPR pipeline if the first PrEST is unsuccessful, but the analysis
shows that it is valuable to select several PrESTs for each protein, to enable
generation of at least two antibodies, which can be used to validate each other.
LIST OF PUBLICATIONS
This thesis is based on the following publications, which in the text will be referred
to by their roman numerals.
I.
Lisa Berglund*, Jorge Andrade*, Jacob Odeberg, and Mathias Uhlén
(2008). The epitope space of the human proteome. Protein Science 17:606613.†
II. Lisa Berglund*, Erik Björling*, Kalle Jonasson, Johan Rockberg, Linn
Fagerberg, Cristina Al-Khalili Szigyarto, Åsa Sivertsson, and Mathias Uhlén
(2008). A whole-genome bioinformatics approach to select antigens for
systematic antibody generation. Submitted.
III. Lisa Berglund, Anja Persson, and Mathias Uhlén (2008). Primer design for
high-throughput PCR cloning. Submitted.
IV. Lisa Berglund, Erik Björling, Marcus Gry, Anna Asplund, Cristina AlKhalili Szigyarto, Anja Persson, Jenny Ottosson, Henrik Wernérus, Peter
Nilsson, Åsa Sivertsson, Kenneth Wester, Caroline Kampf, Sophia Hober,
Fredrik Pontén, and Mathias Uhlén (2008). Generation of validated
antibodies towards the human proteome. Submitted.
RELATED PUBLICATIONS
Jorge Andrade*, Lisa Berglund*, Mathias Uhlén, and Jacob Odeberg (2006).
Using grid technology for computationally intensive applied bioinformatics
analyses. In Silico Biology 6:495-504.
Mathias Uhlén, Erik Björling, Charlotta Agaton, Cristina Al-Khalili Szigyarto,
Bahram Amini, Elisabet Andersen, Ann-Catrin Andersson, Pia Angelidou, Anna
Asplund, Caroline Asplund, Lisa Berglund et al. (2005). A human protein atlas for
normal and cancer tissues based on antibody proteomics. Mol Cell Proteomics
4:1920-1932.
* These authors contributed equally to this work.
† Reproduced by courtesy of Protein Science.
TABLE OF CONTENTS
INTRODUCTION ........................................................................................................13
FROM NUCLEIN TO THE SEQUENCE OF THE HUMAN GENOME .......................................13
DNA - RNA - PROTEIN ...............................................................................................13
GENOMICS AND PROTEOMICS ......................................................................................14
MAPPING THE BUILDING BLOCKS OF LIFE ....................................................................15
BIOINFORMATICS ........................................................................................................25
PRESENT INVESTIGATION ....................................................................................33
OBJECTIVES ................................................................................................................35
BACKGROUND ............................................................................................................36
THE EPITOPE SPACE OF THE HUMAN PROTEOME (PAPER I) ..........................................37
A WHOLE-GENOME BIOINFORMATICS APPROACH TO SELECT ANTIGENS FOR
SYSTEMATIC ANTIBODY GENERATION (PAPER II)........................................................39
PRIMER DESIGN FOR HIGH-THROUGHPUT PCR CLONING (PAPER III) ..........................42
GENERATION OF VALIDATED ANTIBODIES TOWARDS THE HUMAN PROTEOME
(PAPER IV) .................................................................................................................43
CONCLUDING REMARKS AND FUTURE PERSPECTIVES .............................47
ABBREVIATIONS ......................................................................................................51
ACKNOWLEDGMENTS............................................................................................53
REFERENCES .............................................................................................................55
INTRODUCTION
From nuclein to the sequence of the human
genome
In 1869, the Swiss physician and chemist Friedrich Miescher isolated a new
molecule from the nuclei of white blood cells [Dahm 2005]. What he did not know
was that the molecule he called nuclein, later renamed deoxyribonucleic acid
(DNA), was alone the hereditary material. Miescher found that the molecule
contained hydrogen, carbon, oxygen, nitrogen and phosphorus [Miescher 1871].
The actual components of DNA, including guanine, cytosine, adenine and
thymine, were originally described by Phoebus Levene in 1929 [Dahm 2005]. The
first evidence proposing DNA as the genetic material was published 75 years after
the discovery of nuclein, when Avery and his team [Avery et al. 1944] showed that
DNA transforms harmless bacteria into deadly bacteria. In 1977, methods were
published on how to derive the order and identity of the nucleotides in a DNA
molecule [Maxam and Gilbert 1977; Sanger et al. 1977], resulting in a Nobel Prize
in 1980. Rapid development of technologies for in vitro molecular biology made it
possible to start unraveling the sequences of whole genomes, and in 2001, the first
draft of the human genome sequence was published [Lander et al. 2001; Venter et
al. 2001].
DNA - RNA - protein
In 1953, the structure of the DNA molecule was solved by Francis Crick, James
Watson and Maurice Wilkins [Watson and Crick 1953], a discovery that gained a
lot of attention and for which they were awarded the Nobel Prize in 1962. By the
discovery of base pairing, a possible copying mechanism for genetic material could
be suggested. Francis Crick also published two other very important general
principles for molecular biology – the sequence hypothesis [Crick 1958; Crick et al.
1961] and the central dogma [Crick 1958; Crick 1970]. In the sequence hypothesis,
Crick suggested that the DNA sequence is a code for the amino acid sequence of a
particular protein [Crick 1958] and that this code is based on non-overlapping
nucleotide triplets [Crick et al. 1961]. The exact nucleotide triplets coding for each
amino acid were deciphered by Marshall Nirenberg [Nirenberg 1963], who in 1968
was awarded the Nobel Prize in Physiology or Medicine for his discovery. With the
central dogma, Crick described the flow of sequence information from DNA, to
ribonucleic acid (RNA), and finally to protein. The fundamental of the central
dogma is that once sequence information has passed into the protein, it can not
“get out again” [Crick 1958].
13
Proteins were actually discovered long before DNA - correspondence on protein
molecules can be found already from the eighteenth century - but the term protein
was first used by the Swedish scientist Jöns Jacob Berzelius in 1838 to describe a
polymer of amino acids [Tanford and Reynolds 2001]. The first completely
sequenced protein molecule was the B chain of insulin, published in 1951 [Sanger
and Tuppy 1951aa; 1951bb]. In 1958, the first three-dimensional structures of
proteins were solved by X-ray crystallography [Kendrew et al. 1960; Perutz et al.
1960]. The protein sequencing and structure determination were both
achievements awarded the Nobel Prize.
Genomics and proteomics
In the first publications of the human genome sequence, the number of proteincoding genes was estimated to about 30,000 [Lander et al. 2001; Venter et al. 2001].
With refined methods for protein-coding gene prediction, the current estimate is
around 20,500 genes [Clamp et al. 2007]. Only 1.5% of the nucleotides in the
genome encode proteins [Lander et al. 2001], and an equal proportion is nonprotein coding but under evolutionary selection [Waterston et al. 2002]. The
characterization of the conserved non-protein coding part of the genome sequence
is of great importance for genomics research [Waterston et al. 2002], as these parts
of the genome probably are active in the control of gene expression [Collins et al.
2003]. A lot of effort is also put into identifying the role of different parts of, and
individual differences in the genomic sequence in health and disease, to enable
diagnostics and correct therapeutic approaches [Collins et al. 2003].
The next phase in the elucidation of the human genome is the exploration of the
gene products. Although transcriptional profiling can be performed in a highthroughput manner [Schena et al. 1995; Velculescu et al. 1995; Lockhart et al. 1996;
Brenner et al. 2000; Shiraki et al. 2003; Bertone et al. 2004] to give clues to when,
where and to what extent genes are transcribed, it does not give a complete
picture. The central dogma states that sequence information is transferred from a
gene to RNA and, finally, to protein, but the RNA abundance does not have to
correlate with the protein abundance at any given time point [Anderson and
Seilhamer 1997]. Also, most of the genome is transcribed, and only a fraction of
the transcribed molecules seems to code for proteins [Birney et al. 2007].
Furthermore, the RNA molecule alone will not reveal much about the activity and
localization of the encoded protein molecule.
14
The proteome is defined as the protein complement expressed by a genome
[Wilkins et al. 1996]. Proteomics is the large-scale study of gene expression at the
protein level, and includes, for example, identification and quantification of
proteins in different samples, analysis of the structure of proteins, recognition of
post-translational modifications and their effect on protein function, elucidation of
the mechanisms behind protein transportation within the cell, discovery of
signaling and metabolic pathways, and studies of interactions between proteins or
between proteins and other molecules [Naaby-Hansen et al. 2001; Colinge and
Bennett 2007]. Proteomics will not only give us a deeper understanding of proteins
and their functions, but will also help us discover the mechanisms that cause
disease, and hopefully lead to early detection of disease and the development of
new drugs.
Mapping the building blocks of life
Proteomics studies commonly involve protein separation, protein identification,
protein structure determination and protein interaction studies (Table 1). Available
tools frequently used for protein separation are two-dimensional electrophoresis
[O'Farrell 1975] and multi-dimensional liquid chromatography [Erni and Frei
1978], while mass spectrometry [Tanaka et al. 1988], chemical degradation [Edman
1949] or affinity reagents [Köhler and Milstein 1975] can be used for protein
identification. Protein structure can be determined by X-ray crystallography
[Kendrew et al. 1960; Perutz et al. 1960] or nuclear magnetic resonance [Wuthrich
1990], while the yeast-two-hybrid system [Fields and Song 1989] or the tandem
affinity purification system [Rigaut et al. 1999], are used for protein interaction
studies.
Objective
Protein separation
Protein identification
Protein structure determination
Protein interactions
Commonly used tools
x Two-dimensional electrophoresis
x Multi-dimensional liquid chromatography
x Mass spectrometry
x Chemical degradation
x Affinity reagents
x X-ray crystallography
x Nuclear magnetic resonance
x Yeast-two-hybrid
x Tandem affinity purification
Table 1. Examples of tools used in proteomics studies.
15
Proteomics is facing several challenges. As protein-coding genes can give rise to a
number of alternatively spliced transcripts, and the proteins encoded by those
splice variants can be modified after translation, the number of different proteins
in the human proteome is huge. Adding the large group of proteins created by
somatic rearrangement (e. g. immunoglobulins) and amino acid differences due to
polymorphisms in the coding nucleotide sequence, current estimates of the
proteome size range from one million [Humphery-Smith 2004] to more than 10
millions [Uhlén and Pontén 2005].
Many proteins are present at very low levels in cells, while a few have extremely
high concentrations [Miklos and Maleszka 2001]. The proteins in plasma have for
example been shown to have a dynamic range of at least ten orders of magnitude
[Anderson and Anderson 2002]. Moreover, many proteins are fragile and
susceptible to degradation, and therefore careful handling of samples is required to
not lower already weak protein signals further [Falk et al. 2007]. There is no
equivalent to the polymerase chain reaction (PCR) for proteins, and without
enrichment, the low signals are only detectable by a few sensitive experimental
methods [Humphery-Smith 2004].
Affinity-based proteomics
Proteins can differ tremendously from one to another, both as a consequence of
the number of combinations that can be made from the 20 amino acids, and an
effect of post-translational modifications. The difference between individual
proteins can be a great advantage, for example in affinity-based proteomics, where
it allows each protein to be uniquely identified.
Affinity reagents can be used for proteomics studies regarding relative
quantification of proteins (enzyme-linked immunosorbent assay [Engvall and
Perlman 1971; Van Weemen and Schuurs 1971] or protein arrays [Haab et al.
2001]), size analysis (Western blot [Renart et al. 1979]), tissue profiling
(immunohistochemistry [Coons 1941]), subcellular localization studies
(immunofluorescence microscopy [Lazarides and Weber 1974]), interaction studies
(immuno affinity capture [Markham et al. 2007]), and cell studies (flow sorting
[Hulett et al. 1969]). Currently, a limiting factor for affinity proteomics is that the
available affinity reagents cover only a fraction of the human proteome [Taussig et
al. 2007].
16
Affinity reagents
Antibodies are naturally occurring as a part of the adaptive immune response. They
are produced by white blood cells (B-cells) and their primary function in the body
is to recognize foreign particles and to target those particles for elimination
[Travers et al. 2007]. The natural antibody response can be exploited in proteomics
by immunizing an animal with a protein of interest (or parts of that protein),
followed by harvest of the resulting antibodies, for use as affinity reagents in
further studies of the target protein. Antibodies for proteomic studies can be
divided into two groups – polyclonal antibodies and monoclonal antibodies. Both
are generated through immunization, but for polyclonal antibodies, the antisera is
used “as is”, which usually results in a mixture of antibodies recognizing different
parts of the immunized protein, as well as additional antibodies naturally occurring
in the immunized species, recognizing other targets. Monoclonal antibodies are
also generated through immunization of an animal, but B-cells from the
immunized animal are fused with cancer cells, generating separate hybridoma that
can be cultured. The antibodies secreted from each hybridoma can be tested for
specificity to the target protein [Köhler and Milstein 1975]. Antibodies from one
B-cell are always directed to one single part of the immunized protein.
Mono-specific antibodies is an interesting variant of polyclonal antibodies [Olive et
al. 2001]. These antibodies are generated as polyclonal antibodies, but the antisera
are affinity-purified with the immunized protein as a ligand. This procedure
eliminates more than 99% of the antibodies recognizing other targets than the
antigen [Uhlén and Pontén 2005], i. e. by using mono-specific antibodies, specific
recognition of the immunized protein is ensured. In contrast to monoclonal
antibodies, the mono-specific antibodies are not derived from a renewable source,
but the experimental procedures required for generation of monoclonal antibodies
are time-consuming in comparison with the procedures required for mono-specific
antibodies [Nilsson et al. 2005], making mono-specific antibodies an attractive
alternative for high-throughput proteomics efforts. The fact that mono-specific
antibodies are recognizing different parts of the protein makes them more versatile
than monoclonal antibodies. For example, the target protein is denatured in many
functional and detection assays, and having antibodies towards different parts of
the protein will increase the chance of binding [Uhlén and Pontén 2005].
Alternative affinity reagents to antibodies include antibody domains [Better et al.
1988; Bird et al. 1988; Huston et al. 1988], anticalins [Beste et al. 1999], ankyrin
repeat proteins [Binz et al. 2004], Affibody molecules [Nord et al. 1997], aptamers
17
[Ellington and Szostak 1990], and chemically synthesized first generation binders
[Evans et al. 1996]. These reagents are popular because of their small size as
compared to antibodies, and the fact that they can be easily modified [Falk et al.
2007]. Also, they do not require immunization of animals. It remains to be seen if
these alternative binders can be generated in a high-throughput manner.
Antigens
Antigen is short for antibody generation, but an antigen is any molecule that can
bind specifically to an antibody [Travers et al. 2007]. Antigens that actually induce
antibody production are called immunogens. In antibody-based proteomics, three
different antigens are used – full-length proteins, peptides and recombinant protein
fragments. Peptides have a standard size of 15 amino acids [Hancock and O'Reilly
2005]), and since they are short, they are easy to synthesize chemically. They have
readily been used as antigens since the early 1980’s [Sutcliffe et al. 1980; Lerner et al.
1981; Green et al. 1982]. Due to the small size of the peptides, they are not
expected to adopt the same conformation as the native protein. If the goal is to use
the antibody for recognition of a full-length target protein, it is therefore important
to select a part of the protein that is linear and exposed [Uhlén and Pontén 2005].
One way to try to circumvent this problem is to generate several different peptide
antigens for the target protein [Hancock and O'Reilly 2005].
An alternative strategy to peptides is to use full-length proteins as antigens. Fulllength proteins are almost always too large to be chemically synthesized. To
retrieve proteins that are correctly folded, the proteins should either be purified
from their natural environments or expressed in a system where folding and posttranslational modifications can be imitated [Yin et al. 2007]. These are very
laborious processes and are relatively cumbersome for high-throughput proteomics
studies.
The third type of antigens are recombinant proteins, where a part of the target
protein is fused to a tag, allowing for affinity purification of the proteins after
expression in a host cell [Terpe 2003]. The recombinant protein can be designed to
be longer than the peptide antigen and thus have a greater chance of containing
parts that are recognized by the antibody also in the full-length protein. Selection
of fragments with low sequence identity to proteins other than target reduces the
risk for generation of cross-reactive antibodies [Agaton et al. 2003]. Recombinant
antigens have been successfully used for generation of monoclonal and polyclonal
18
antibodies [Harris et al. 2002; Agaton et al. 2003], and are suitable for highthroughput proteomics studies.
Epitopes
An epitope is the part of an antigen to which an individual antibody binds
[Hancock and O'Reilly 2005]. The epitopes are classified as continuous or
discontinuous [Atassi and Smith 1978], based on if the amino acids in the epitope
are contiguous in the protein sequence or not. Although the amino acids in the
epitope are contiguous, some may not be directly involved in the direct binding to
the antibody and can be changed to another amino acid without affecting the
binding [Van Regenmortel 2006]. This makes the definition of continuous epitopes
unclear, as this epitope would functionally be discontinuous. The fact that many
peptide antigens generate antibodies that recognize the full-length protein can
possibly be explained by unfolding of the protein due to denaturation in the
studied sample [Laver et al. 1990], or merely that the folded protein contains
corresponding linear regions.
Classifying protein parts binary as epitopes or non-epitopes may not correspond to
how biology works. Under a given set of conditions (e. g. concentration of
antibody or type of assay), any part of a protein can function as an epitope
[Greenbaum et al. 2007]. Additionally, post-translational modifications can affect
the binding site of an antibody. For example, an oligosaccharide added by
glycosylation of the protein may be part of the epitope, or might mask the epitope
if the antibody was generated towards an unmodified antigen [Lisowska 2001]. In
the same manner, an antibody generated towards an unphosphorylated antigen can
have the epitope destroyed by phosphorylation of one of the amino acids in the
antibody binding site.
The Human Proteome Resource – mono-specific antibodies for
proteome profiling
The Human Proteome Resource (HPR) is a large antibody-based proteomics
project, aiming to profile the human proteome in terms of expression and
localization in human tissues [Uhlén et al. 2005]. With mono-specific antibodies as
affinity reagents, normal and cancer cells are scanned for the presence of target
proteins. One of the most important objectives of the project is to, through
continuous analyses of the results, detect biomarkers to be used for diagnosis and
prognosis of disease, and, hopefully, development of new drugs [Uhlén and
Pontén 2005]. Additionally, the generated antibodies are distributed to
19
collaborating groups to enable further studies of the target proteins. The results
from the HPR project are published on a publicly available website, the Human
Protein Atlas.
Figure 1. The pipeline of the Human Proteome Resource project.
Antibody generation in the HPR project is based on immunization of a
recombinant antigen [Agaton et al. 2003]. In summary (Figure 1), the starting point
for the pipeline is in silico selection of a transcript fragment corresponding to a
suitable antigen in the target protein [Lindskog et al. 2005]. The output from the in
silico selection is the sequence for a pair of oligonucleotide primers to be used for
amplification of the selected fragment from human RNA pools. The fragment is
inserted into a vector and sequence verified for subsequent expression of a
recombinant protein in bacteria. The produced protein is purified, verified by
molecular weight, and immunized into rabbits for generation of polyclonal
antibodies. The polyclonal sera are affinity-purified with the recombinant protein
as ligand, thereby rendering mono-specific antibodies [Agaton et al. 2004]. The
antibodies are quality assured by protein array [Nilsson et al. 2005], Western blot
and immunohistochemistry, where the results are compared to literature. The
20
subcellular localization of the protein is determined by confocal imaging of
immunofluorescently stained cell lines [Barbe et al. 2008].
The recombinant antigens, named protein epitope signature tags (PrESTs), are
protein fragments of 25-150 amino acid residues. The PrESTs are selected
primarily based on protein sequence uniqueness relative to the rest of the human
proteome, to avoid cross-reactivity of the generated antibodies to other proteins
than the target [Lindskog et al. 2005]. Transmembrane regions are avoided in the
PrEST due to inaccessibility of those regions to the antibodies in vivo, as well as
due to difficulties in handling hydrophobic regions in the used bacterial expression
system. Signal peptides are also avoided, as they are cleaved off from the mature
protein. Oligonucleotide primers corresponding to the start and end of the selected
transcript fragment are then designed, and with this the in silico procedure is
completed.
Selected primers are synthesized and used for amplification of the selected
fragment by the reverse transcriptase polymerase chain reaction (RT-PCR), with
pooled human RNA as a template [Agaton et al. 2003]. Successfully amplified
fragments are sequence verified after insertion into a plasmid vector. Strict criteria
are applied in the sequence verification step, ensuring that only clones with
sequence corresponding to the original target are approved. The vector is
transformed into Escherichia coli, for expression of the PrEST as a recombinant
protein in fusion with a dual tag [Larsson et al. 2000], consisting of six histidines
(His6) and part of the immunopotentiating albumin binding protein (ABP) from
Streptococcal protein G [Sjölander et al. 1997].
An automated schema is used for purification of expressed recombinant protein,
based on immobilized metal affinity chromatography with the His6-tag as
purification handle [Steen et al. 2006]. The purity and concentration of the
expressed protein are evaluated and, as a final quality insurance step, the molecular
weight of the protein is determined by mass-spectrometry. The recombinant
protein is immunized into rabbits, followed by three booster injections in four
week intervals [Nilsson et al. 2005]. Retrieved antisera are purified by a three-step
immunoaffinity based method. First, antisera are passed through a column with
His6ABP, to deplete antibodies towards the dual affinity tag in the PrEST. The
flow-through is collected and passed through a second column with the
immunized recombinant protein as affinity ligand. The mono-specific antibodies, i.
e. antibodies recognizing the PrEST, are caught in the column and are, as a final
step in the purification procedure, eluted and buffer-exchanged.
21
The first step in the specificity control of the mono-specific antibodies is the
evaluation of specific binding on a protein array [Nilsson et al. 2005]. Almost 400
PrESTs (recombinant proteins), including the target PrEST, are spotted onto an
epoxide-covered glass slide on which the antibodies are incubated. Fluorescently
labeled anti-rabbit antibodies are used as secondary antibodies for detection of
binding (signal) of the mono-specific antibodies to a PrEST. Unbound PrESTs are
detected by hen antibodies towards the His6ABP part of the recombinant proteins,
with a differently labeled anti-hen secondary antibody. The mono-specific
antibodies are expected to give high signals for the target PrEST and low or no
signals for the other PrESTs on the slide (Figure 2A).
The next step in the specificity control is the Western blot analysis. Here, the
mono-specific antibodies are evaluated for their ability to bind to the original fulllength protein target, as well as for their specificity. Total protein extracts from two
human cell lines (RT-4, U-251MG), human plasma and two human tissues (liver
and tonsil) are used for sodium dodecyl sulfate polyacrylamide gel electrophoresis
(SDS-PAGE) with gradient gels under reducing conditions, and are subsequently
transferred to membranes [Uhlén et al. 2005]. The mono-specific antibodies are
incubated on the membranes and a labeled secondary anti-rabbit antibody is used
for detection of bound antibodies. In the analysis of the Western blot, the
molecular weight of bound proteins can be estimated and compared to the
predicted molecular weight of the full-length target protein (Figure 2B). An
important aspect of Western blotting is that failure to detect bound protein of
“correct size”, or the presence of additional bound proteins, can possibly be
explained by the absence of the target protein in the studied cells, by posttranslational modifications, proteolysis, or unknown splice variants [Uhlén et al.
2005].
The last quality assurance step is immunohistochemistry on a test set of human
tissues and cells. Here, the expression and localization of the target protein can be
studied in situ, and the results can be evaluated and compared to literature and
predictions [Uhlén et al. 2005], for example, prediction of transmembrane regions
in a previously uncharacterized target protein supports membranous staining. The
mono-specific antibodies and a dextran polymer visualization system are incubated
on tissue microarrays (TMAs) [Kononen et al. 1998], allowing for simultaneous
analysis of many samples at the same time. After incubation, the TMAs are
developed with diaminobenzidine as chromogen and hematoxylin as
counterstaining (Figure 2C).
22
Figure 2. Specificity control of mono-specific antibodies. The liver protein arginase
(ARG1) is a protein expressed in liver and in red blood cells [Iyer et al. 1998]. There are three isoforms of
the protein (molecular weight 25, 35, and 36 kDa) [Flicek et al. 2008], created by alternative splicing.
Mono-specific antibodies towards the liver protein arginase have been tested for specificity in three different
platforms: A. Protein array. The green bars indicate the target PrEST (printed in duplicate). Black bars
correspond to other PrESTs. The fluorescence intensity is represented by the height of the bars. The monospecific antibodies tested here are only recognizing the target PrEST. B. Western blot. The leftmost lane
contains a molecular marker for 230, 110, 82, 49.3, 32.2, 25.5, and 17.6 kDa. The lane with strong
signal contains total protein lysate from human liver. The proteins recognized by the antibodies correspond to
the size of the isoforms of arginase. C. Images from immunohistochemistry on liver samples (normal (left)
and cancer (right)). Brown color indicates antibody bound to its antigen. Bone marrow samples are also
positive, as well as the squamous epithelial cells of vulva. No other tested tissues give a positive signal.
Specificity-approved mono-specific antibodies are further used for
immunohistochemistry on a set of 48 normal tissues (three individuals for each
tissue), and 20 different tumor types (for most of them, twelve individuals for each
tumor type) [Kampf et al. 2004]. The TMAs are scanned and the images are stored
in a database. All images are annotated by certified pathologists using a web-based
annotation tool (Oksvold, P., Lindskog, C., Björling, E., Kampf, C., and Pontén,
F., in prep.). The pathologist will specify the intensity, fraction of immunostained
cells, and a low-resolution subcellular localization for each given cell population. A
23
short text summary of the characteristics for each antibody is also recorded.
Furthermore, each mono-specific antibody is used for immunohistochemistry on
cell microarrays containing 47 different human cell lines and twelve clinical cell
samples. The microarrays are subsequently annotated by automated image analysis
[Strömberg et al. 2007]. In total, 708 different annotated images from tissues and
cells are generated for each mono-specific antibody. Additionally, any target
protein showing characteristics of specific interest, for example potential
biomarkers, are evaluated further by special TMAs containing large cohorts of
defined cancers, with associated clinical data.
Finally, the subcellular localization of the target protein is analyzed by confocal
imaging of immunofluorescently stained samples (three different cell lines). The
mono-specific antibodies, as well as two organelle probes specific for the
endoplasmic reticulum and micro-tubules, are fluorescently labeled and the nuclei
of the cells are counterstained with a nuclear probe. Multicolor images are
retrieved for each cell line and antibody, and are manually annotated for subcellular
localization, as well as the characteristics and intensity of the staining, using a webbased tool [Barbe et al. 2008].
All quality assurance data (protein array diagrams, Western blots and
immunohistochemical results) for approved antibodies is publicly available through
the Human Protein Atlas website (http://www.proteinatlas.org). The 708 images
containing the results from the immunohistochemistry, for evaluation of protein
localization and expression, are also published together with immunofluorescence
images of the subcellular localization of the protein. The annotation for each image
is provided, as well as information about the target protein and the antigen. An
advanced search tool is available for systematic exploration of the results [Björling
et al. 2007].
The long term goal of the Human Proteome Resource project is to have one
validated antibody towards at least one protein from each of the ~20,500 human
protein-coding genes, with information on the localization and expression of the
target proteins, both on tissue- and cell level [Uhlén 2007].
24
Bioinformatics
Bioinformatics is an interdisciplinary science. A simplified explanation of the term
could be “the use of computers to handle biological information”. The
bioinformatics definition committee at the National Institutes of Health in
Bethesda, Maryland, USA, did in July 2002 agree on the following definition of
bioinformatics: “Research, development, or application of computational tools and
approaches for expanding the use of biological, medical, behavioral or health data,
including those to acquire, store, organize, archive, analyze, or visualize such
data.”. In practice, bioinformatics has three main goals: 1. Organization of data, 2.
development of tools for analysis of data, and 3. utilization of developed tools for
interpretation of data [Luscombe et al. 2001].
Organization of data
Digital biological information is generated every day, all over the world. This data
should be organized, interconnected and eventually made public in some way, to
enable efficient analysis and to gather all previous knowledge for best possible
interpretation of new information. Examples of data that needs to be handled
include gene-, transcript- and protein sequences, protein structure data, gene
expression data (both on transcript and protein level) and protein interaction data.
There are international efforts specializing in collecting and storing this data,
thereby making sure that the format is consistent, and that the information can be
found easily.
A nucleotide sequence database
The DNA Databank of Japan (DDBJ) [Sugawara et al. 2007] in Japan, the
European Molecular Biology Laboratory (EMBL) Nucleotide Sequence Database
[Kulikova et al. 2007] in Europe, and GenBank [Benson et al. 2008] in USA have
since 1987 comprised the International Nucleotide Sequence Database
Collaboration, where nucleotide sequence information, with common formats and
agreed standards for annotation practice, is exchanged (http://www.insdc.org).
The entries in the databases include a description of the sequence, the taxonomy of
the source organism, references to literature, and a standardized table of features,
such as any coding regions, repeat regions, and sites of mutations or modifications
[Benson et al. 2008]. From 1982 and onwards, the number of bases in
DDBJ/EMBL/GenBank has doubled every 18 months. Currently, there are
nucleotide sequences from about 260,000 organisms in the databases,
corresponding to 80 billion bases, contributed from individual laboratories and
25
large-scale sequencing projects [Benson et al. 2008]. The data is accessible through
direct file transfer protocol (FTP)-downloads of the whole data set in flat-file
format, or by web-based database retrieval systems [Kulikova et al. 2007; Wheeler et
al. 2008], where search criteria can be used to find specified information.
A protein sequence database
For protein sequence information, the major repository is the Universal Protein
Resource (UniProt) [UniProt Consortium 2008], which is run in a collaboration
between the European Bioinformatics Institute in UK, the Protein Information
Resource in USA, and the Swiss Institute of Bioinformatics in Switzerland. The
UniProt Knowledgebase (UniProtKB) contains the current versions of protein
sequences, and is divided into two sections, one manually annotated part, SWISSPROT, and one computationally annotated part, TrEMBL. In the entries of
UniProtKB, there is annotation of function, biologically interesting domains,
modifications, subcellular localization, tissue specificity, interactions, splice variants
etc [UniProt Consortium 2008]. The annotations are based on computational
analysis, and, for UniProtKB/SWISS-PROT, on information extracted from
literature. In UniProtKB/SWISSPROT, entries from the same gene are merged to
create a non-redundant database. The sequences in UniProtKB are sequences from
structurally determined proteins, sequences mined from literature, sequences
submitted directly to the database, or, for the majority, translated nucleotide
sequences originally derived from entries with coding sequence annotation in the
GenBank/DDBJ/EMBL databases [UniProt Consortium 2008]. All UniProtKB
entries have an evidence code connected to them, displaying if the protein has
evidence for existence on protein level, transcript level, by homology, by
prediction, or simply if the evidence is uncertain. Evidence at protein level does
not mean that the protein sequence given is guaranteed to be fully accurate, but
that the existence of the protein has been verified, for example by detection by
antibodies, Edman degradation, mass spectrometry, or X-ray crystallography
[UniProt Consortium 2008]. In the current version of UniProtKB (release 13), the
UniProtKB/SWISS-PROT holds 356,194 protein entries from 11,290 different
species and UniProtKB/TrEMBL holds 5,395,414 protein entries from 155,282
species. Out of the 18,609 entries in UniProtKB/SWISS-PROT with human as
source organism, 10,721 have evidence at protein level. The UniProt data can be
accessed via a website supporting different search criteria, but can also be
downloaded in different formats.
26
Integration of data
There are numerous other data resources/repositories for biological information,
e. g. the Worldwide Protein Data Bank [Berman et al. 2003] for protein structures,
and the Gene Expression Omnibus [Edgar et al. 2002] for gene expression data.
Some databases have specialized in integrating data from different sources, making
it easier to find all relevant information on a specific gene or protein. GeneCards
[Rebhan et al. 1997], by the Weizmann Institute in Israel, and SOURCE [Diehn et
al. 2003] by Stanford University in USA, are examples of such databases. Both and
are publicly available and easily accessed via a web-interface.
A genome-centered database
Another way of organizing the biological information is to connect everything to
the source genome. The Ensembl project [Flicek et al. 2008], performs automatic
annotation of eukaryotic genomes, and integrates any biological data that can be
mapped onto features in these genomes [Birney et al. 2004]. The starting point for
the automatic annotation is the assembled genome sequence for a given species.
Protein- and complementary DNA (cDNA) sequences from other databases, like
the UniProtKB, are automatically matched to the genome sequence to create gene
models [Curwen et al. 2004]. The final product from the Ensembl system is an
annotated genome with predicted genes, based on experimental evidence. Ensembl
also provides cross-species information, such as syntenic regions and pairing of
orthologous genes between annotated genomes. The data in Ensembl is generally
updated every second month and, at present, there are 35 fully supported species
and preliminary support for six additional species in the database [Flicek et al.
2008]. According to Ensembl, the human genome (assembly 36, Ensembl release
48) contains 22,997 protein-coding genes, corresponding to 46,591 different
transcripts. Ensembl has put much effort into making the data available to the
public in a convenient way, and provides a website with several visualization
options and search capabilities, as well as downloadable data in different formats.
The project stores all data in a database, for remote access by anonymous login.
Tools for analysis of protein data
Once the biological data has been organized, it is essential to provide means for
easy and efficient public access to the information. As described in the
“Organization of data”-section above, the resources have put a lot of effort into
making the data easy to search and download. In addition to only retrieving
27
information, numerous tools for further analysis of the data, based on direct
comparison to other data and/or predictions, are available.
Sequence similarity
Web-based tools for string searching based on gene or protein identifiers and
descriptions are available for most of the larger databases. However, enabling the
search for objects differing slightly from entries in the database, for example
protein sequences that are similar but not identical to a search sequence, is also of
great importance. For this purpose, there are sequence alignment tools, like the
widely used Basic Local Alignment Search Tool (BLAST) package [Altschul et al.
1990; Altschul et al. 1997]. The BLAST programs can be used to find similarities of
a nucleotide or protein sequence to nucleotide or protein sequences in a database.
As the search space of the databases sometimes is very large, the searches need to
be fast. BLAST has solved this by aligning sequences in a heuristic manner, i. e.
searching for short, matching regions (“words”) between sequences and starting
the alignments from there [McGinnis and Madden 2004]. For blastp, the BLAST
program used for comparison of protein sequences, the algorithm works with
scores, where the score of two aligned amino acid residues is set by the selected
scoring matrix. The commonly used scoring matrices are based on multiple
alignments of similar protein sequences, where evolutionary rare substitutions are
given a low score, and common substitutions are given a high score. The matrices
can also reflect the probability of a pair of identical amino acids being a
coincidence [Eddy 2004], by giving a frequently found amino acid in a pair a lower
score than a rare amino acid in a pair. The size of the words that are used in the
first step of the algorithm is set to three amino acids by default. The total score for
those three amino acids in an alignment has to be over a certain threshold T to be
considered a starting point for further alignment of the sequences. Setting a high
value for T increases speed, as few alignments need to be done, but also increases
the risk of missing weak similarities [Altschul et al. 1997]. Also, an alignment will
only be pursued if there are two approved words within a maximum distance of A
(default value of A = 40 amino acids), the so-called two-hit method [Altschul et al.
1997]. When alignments are reported, they are accompanied by a total score for the
entire alignment and an E-value. The E-value is a statistical measurement, stating
that E is the number of alignments with the same or higher score that could be
found merely by chance in a database of this size [Altschul et al. 1997].
28
Conserved regions
Sequence comparison is the foundation also for finding functional and/or
structural core regions in clustered proteins by detection of conserved regions in
their sequences. The conserved regions can be extrapolated into signatures
[Attwood et al. 2003; Wu et al. 2004; Bru et al. 2005; Mi et al. 2005; Finn et al. 2006;
Letunic et al. 2006; Selengut et al. 2007; Hulo et al. 2008]. If the function is known
for some of the proteins from where a signature is derived, this information can be
used to give clues to the function of an uncharacterized protein with a matching
signature [Quevillon et al. 2005]. InterPro [Mulder et al. 2007] is a database of
protein signatures, which are retrieved from many different databases. The tool
InterProScan [Quevillon et al. 2005] merge the protein function recognition
methods of the member databases into one application, making the search nonredundant and fast.
Membrane protein topology prediction
Membrane proteins are a large group of proteins important in small-molecule
transport, signal transduction and cell-cell interactions, and are also targets for
more than 50% of all pharmaceutical drugs [Klabunde and Hessler 2002]. The
structure of membrane proteins is difficult to analyze experimentally [Jones 2007],
and the topology of membrane proteins is therefore another feature for which it is
highly desired to have tools for prediction. Most membrane proteins are built up
by transmembrane ơ-helices containing primarily hydrophobic amino acid residues,
which is a feature used by the early prediction tools [Kyte and Doolittle 1982]. The
loops between the helices have different composition depending on if they are on
the cytoplasmic side of the membrane or not (known as the positive-inside rule
[von Heijne 1989]). The state-of-the-art methods are based on machine learning
techniques where evolutionary information is included [Tusnady and Simon 2001;
Viklund and Elofsson 2004; Käll et al. 2005; Jones 2007]. In a recent benchmark
study, up to 80% of the topologies in a test set were predicted correctly using these
methods [Jones 2007]. Correctly predicted topology means that the orientations of
the transmembrane regions are correct and that the predicted helices are
approximately correctly positioned. However, the reported accuracy for topology
predictions is largely dependent on the test set used for the benchmark studies
[Käll and Sonnhammer 2002].
Prediction of subcellular localization
Transmembrane regions in the N-terminus of proteins are easily confused with
secretory signal peptides, due to the similar properties of these features. The
29
predictor Phobius combines transmembrane topology and signal peptide
predictions, which results in better discrimination between the two classes as
compared to previous topology prediction methods [Käll et al. 2007]. The secretory
signal peptide predicted by Phobius and other predictors (e. g. SignalP [Bendtsen et
al. 2004]) is directing the protein to the endoplasmic reticulum of the cell,
sometimes for further transport to the extracellular space, by the secretory pathway
[Bendtsen et al. 2004]. In more general terms, subcellular localization prediction
programs rely either on finding a sorting signal or on making the predictions based
on the global properties of the protein [Emanuelsson et al. 2007]. Some programs
also use GeneOntology [GeneOntology Consortium 2008] terms [Chou and Shen
2006] or homology information [Scott et al. 2004] to increase the accuracy of the
predictions. Proteins having >60% identical sequence have been shown to often
share subcellular localization [Nair and Rost 2002]. The specificity and sensitivity
of current subcellular localization prediction methods for uncharacterized proteins
with low sequence identity to characterized proteins is unfortunately low, with the
exception of prediction of proteins destined for the extracellular space, nucleus
and mitochondrion [Sprenger et al. 2006; Casadio et al. 2008], where the sorting
signals are known and the localization can be somewhat more reliably predicted.
Prediction of post-translational modifications
Signal peptides are often cleaved from the protein, which is classified as a posttranslational modification (PTM). PTMs are carried out by enzymatic processes,
where amino acids are removed from the protein or chemical groups are added to
certain parts of the protein. PTMs may alter the structure and function of the
protein [Blom et al. 2004]. Glycosylation is one of the most common PTMs [Hart
1992], and involves the covalent linking of an oligosaccharide to an amino acid.
Phosphorylation, i. e. the addition of a phosphate group to certain amino acids, is
another very common PTM [Zhang et al. 2002]. Phosphorylation is used as a
switch to control the function of a protein [Blom et al. 2004]. Several methods for
prediction of glycosylation and phosphorylation exist, but to achieve high
sensitivity of the methods, a somewhat low specificity must be expected [Blom et
al. 2004]. It should also be noted that since most PTMs are regulatory and
reversible, they are not only dependent on the protein sequence [Mann and Jensen
2003].
Prediction of epitopes
The knowledge of which parts of a protein that will give a robust immune
response can be helpful in the development of vaccines, in the generation of
30
antibodies by immunization and in evaluation of experimental results involving
antibodies. Current methods for prediction of linear epitopes are based on
propensity scales for hydrophilicity [Hopp and Woods 1981; Parker et al. 1986],
flexibility [Karplus and Schulz 1985], secondary structure [Garnier et al. 1978;
Odorico and Pellequer 2003] and solvent accessibility [Emini et al. 1985].
Unfortunately, an evaluation of 484 different propensity scales on a large set of
epitope-mapped proteins did not provide any evidence to support the hypothesis
of correlation between peaks in protein profiles using these scales and known
epitope location [Blythe and Flower 2005]. One explanation for the low accuracy
of continuous epitope predictions may be the presence of expendable amino acids
in the epitope, which introduces noise in the average values [Van Regenmortel and
Pellequer 1994]. Recently, the scales have been combined with machine learning
approaches, which have improved the prediction accuracy to some extent [Larsen
et al. 2006; Sollner 2006; Sollner and Mayer 2006]. When predicting discontinuous
epitopes, attempts have been made to combine residue solvent accessibility and
spatial distribution of a protein structure [Kulkarni-Kale et al. 2005; Haste
Andersen et al. 2006]. This requires a known three dimensional structure of the
analyzed protein, which is currently available only in a few cases. An evaluation of
methods for discontinuous epitope prediction showed that these methods perform
similarly to the prediction methods for continuous epitope prediction
[Ponomarenko and Bourne 2007]. In a report from a recent workshop with
prominent researchers in the field, it was concluded that the current state of B-cell
epitope prediction is far from ideal [Greenbaum et al. 2007].
31
32
PRESENT INVESTIGATION
33
34
Objectives
The starting point for the work presented in this thesis is the amino acid sequences
of the human proteome, and a strategy for systematic mono-specific antibody
generation with a recombinant protein as antigen. The goal is to present a validated
bioinformatics system for high-throughput selection of antigens for the human
proteome.
To reach this goal, a number of questions have to be answered and different
aspects must be considered:
1. Can we expect to find unique antigens to allow for generation of specific
antibodies towards a protein from every human protein-coding gene? (Paper I)
2. What criteria should be used in the selection of antigens, to give high success
rates in the experimental pipeline? How do these criteria affect the possibility to
find antigens? (Paper II)
3. How should protein information be processed, presented, and stored to allow
for high-throughput selection of antigens? (Paper II)
4. How should the transformation from the in silico world to the in vitro world be
done? (Paper III)
5. The results from the selection strategy need to be analyzed (Paper IV), to
provide feedback for optimization of the antigen selection criteria.
35
Background
Affinity proteomics has great potential for exploration of the protein complement
of genomes. The Human Proteome Resource (HPR) project was launched in
August 2003, with the aim to generate high-quality mono-specific antibodies
towards the human proteome, and to use these antibodies for large-scale protein
profiling in human tissues and cells. The high-throughput system for monospecific antibody generation was originally designed by a group of researchers from
the School of Biotechnology at the Royal Institute of Technology, where a pilot
project for antibody generation towards the proteins encoded by genes on
chromosome 21 was performed [Agaton et al. 2003]. Protein fragments (protein
epitope signature tags, or PrESTs) of 100-150 amino acid residues were used as
antigens. The PrESTs were selected from the open reading frames of putative
genes in chromosome 21 [Hattori et al. 2000] with exclusion of transmembrane
regions by using the TMHMM prediction tool [Sonnhammer et al. 1998] for
transmembrane helices. The antigen selection procedure included many manual
steps and was time-consuming.
When the HPR project started, the antigen selection procedure was made more
efficient by automation, via a PrEST selection tool called Bishop [Lindskog et al.
2005]. Bishop performs sliding window blastp [Altschul et al. 1997] sequence
similarity searches, with a default window size of 125 amino acids, against the
protein sequences in a given file. This step is performed to exclude regions with
high similarity to other proteins, with the aim to generate antibodies with high
specificity. The TMHMM prediction tool is integrated into the system, as well as
SignalP [Bendtsen et al. 2004], for prediction of transmembrane regions and signal
peptide, respectively. The sequence similarity search and predictions are started
manually for each protein (or in small batches). The output from the analysis is a
graph displaying the highest blastp score and corresponding E-value for each
window from the sequence similarity search, together with graphical representation
of any predicted transmembrane regions and signal peptide. Bishop will suggest
two PrESTs of the given window size, with as low blastp score as possible,
excluding transmembrane regions and signal peptide. Manual selection of PrESTs
is also possible. The final step in the selection of antigens is PCR primer design,
using basic criteria. Several primers are suggested, and the final choice of primer
pair is made by the user.
Many software systems were running in parallel in the early days of the HPR
project. A laboratory information management system (LIMS) was developed to
36
keep track of the enormous amounts of data that was generated from the
experimental pipeline. To meet the increasing demand for production of in silico
selected antigens to be started in the experimental pipeline, a commercial antigen
selection tool (ProteinWeaver (Affibody AB, Bromma, Sweden)) was purchased to
complement Bishop. Instead of having one system to handle the antigen selection
procedures and the storage of experimental and bioinformatics data, all
information had to be transferred between the different systems. Eventually, the
need for one single, fully integrated system was obvious.
The epitope space of the human proteome
(Paper I)
The goal of the Human Proteome Resource project is to generate at least one
mono-specific antibody towards a protein from every protein-coding human gene.
A prerequisite to reach this goal is that unique epitopes can be found on every one
of these proteins. An epitope can be conformational or linear (consisting of
consecutive amino acid residues) [Atassi and Smith 1978]. It is expected that the
proteins explored in the HPR project are unfolded, due to presence of denaturing
agents in all used applications. Therefore, the generated antibodies should
primarily recognize linear epitopes. No clear consensus exists regarding the size of
a linear epitope. Suggestions range from six to nine amino acid residues [Rodda et
al. 1986; Dunn et al. 1999; Fleury et al. 2000]. There are tools for epitope prediction
based on surface accessibility determinants, but a comparison of experimentally
verified linear epitopes and predicted epitopes showed no significant correlation
[Blythe and Flower 2005].
The epitope prediction method presented in Paper I does not consider the surface
accessibility of the potential epitope. Instead, the uniqueness of the epitope is
important, for prediction of the possibility for cross-reactivity of the
corresponding antibody. The uniqueness can be evaluated based on sequence
similarity searches against the collection of proteins assembled from the coding
parts of the human genome. Although all possible protein sequences are probably
not known and some of the predicted sequences will turn out to be wrong, the
state of the human proteome can be considered “good enough” to perform such
studies. One of the challenges when carrying out epitope-sized similarity searches
on a whole proteome is the time needed to compare all possible epitopes to each
other. To cover all Ensembl [Flicek et al. 2008] human proteins (>40,000 proteins)
using a sliding window (or “epitope”) with the size of 10 amino acid residues, more
than 10 billion comparisons would have to be performed. The analysis aims to find
37
the highest identity (number of identical amino acids) of the query window to any
protein, excluding hits to proteins from the target gene. The time needed to
perform such analysis by an exact comparison method (Hamming distance) can be
estimated to 145 years, using a modern personal computer. This is obviously not a
reasonable alternative. There are other algorithms, such as the blastp, which can
speed up the search process by being specialized in finding short-cuts in the search
for similarities between sequences. When the time consumed for similarity searches
using 18,196 windows with the size of 10 amino acid residues were compared for
Hamming distance and blastp, a 380 times shorter run time was found for blastp.
The drawback of using a heuristic method such as blastp is of course the increased
risk of missing important hits. When the identity values for the 18,196 windows
were compared for blastp and the Hamming distance method, the values agreed
for 98% of the windows. The disagreements were always found for windows with
low identity values, which is reassuring assuming that detection of high identity
windows is the most critical to avoid cross-reactive antibodies.
Although the blastp is 380 times faster than Hamming distance, the run time
corresponds to 137 days on a personal computer. To retrieve the results from a
sliding window similarity search for all human protein sequences in a reasonable
time, a grid-based implementation of the blastp was used [Andrade et al. 2006]. The
similarity searches were thereby distributed over hundreds of computers, allowing
the analysis to finish in one day.
The sliding window blastp similarity searches were performed for all Ensembl
proteins using windows of size 8, 10 and 12 amino acid residues. A comparison of
the protein profiles retrieved for the different window sizes revealed that regions
of high sequence identity are detected independently of window size. When
analyzing the results on a whole-proteome level, it is evident that finding for
example at least five out of eight or six out of ten amino acid residues identical to
another protein is very common, but finding seven out of ten or seven out of eight
identical amino acid residues is more rare. The preferred antigen size for the HPR
project is at least 50 amino acid residues (Paper IV). The analysis of the epitope
space of the human proteome shows that it would be possible to find an antigen of
that size without a single window having more than 8 out of 10 amino acid
residues identical, for 80% of the human genes.
38
A whole-genome bioinformatics approach to
select antigens for systematic antibody
generation (Paper II)
The need for an integrated HPR system, in combination with new features for
antigen selection, triggered the development of a new high-throughput antigen
selection tool (Paper II). The new antigen selection tool, named PRESTIGE,
allows for visualization of protein features and interactive selection of suitable
antigen regions, using the Ensembl database [Flicek et al. 2008] as the primary data
source. PRESTIGE is a web-based extension of the LIMS system, and any work
done using the tool is directly stored in the LIMS database. The LIMS system
handles all logistics involved in keeping track of which genes should be attempted
for antigen selection, and the results of these attempts.
PRESTIGE is divided into three sections – a sequence window, a protein model,
and a button pane (Fig. 3):
In the sequence window, the amino acid sequence for the current protein is given,
combined with the corresponding coding part of the processed transcript. To the
right of the sequence window, the splice variants of the current gene are listed to
allow the user to switch between variants. Protein sequences found in the
UniProtKB/SWISSPROT are marked with an ‘S’ in the list.
In the protein model, the protein is displayed as a scale, from N-terminus (left) to
C-terminus (right), with different features positioned on this scale. InterPro regions
[Mulder et al. 2007] are retrieved directly from the Ensembl database and can give
some clue to the function of the protein. Data on predicted signal peptides
(SignalP [Bendtsen et al. 2004]) is also retrieved from Ensembl, and these are
avoided in the antigen as they are cleaved off from the mature protein. Avoidance
of transmembrane regions in membrane-bound proteins is enabled using
visualization of the results from the TMHMM prediction tool [Sonnhammer et al.
1998]. The primary goal of the antigen selection is to generate antibodies with low
cross-reactivity to other proteins. The results from blastp sliding window sequence
similarity searches with a 50 amino acid residue window [Andrade et al. 2006] are
used to exclude domain-sized regions with high sequence identity to other proteins
in the antigen selection. A color coded curve in the protein model of PRESTIGE
displays the highest sequence identity (%) of each window to other proteins (where
splice variants from the same gene as the query protein are excluded). As
39
previously discussed (Paper I), linear epitopes are likely to be shorter than ten
amino acid residues, and it would therefore be desirable to exclude short windows
with high sequence identity to proteins from other genes, when selecting the
antigen. Hence, the results from sliding window sequence similarity searches using
a ten amino acid residues window are also visualized in the tool. Shared and
exclusive protein regions between splice variants from the same gene is preprocessed by a Java application comparing coding exons, and are displayed to
enable selection of antigens that give rise to antibodies binding to all or only one of
the splice variants.
Figure 3. The antigen selection tool PRESTIGE. The tool is displaying the following features:
Sequence window (A). Button pane (B). Protein scale (C). Restriction sites used in subsequent cloning of
the selected fragment (D). InterPro regions (E). Low complexity regions (F). Exclusive and shared regions
between splice variants (G). Signal peptide (H). Membrane protein topology (I). Sequence identity to
proteins from other genes based on a 50 amino acid sliding window (J) or a 10 amino acid sliding window
(K). Selected antigens (L).
In the button pane, the user can choose to zoom in on the protein model, fail the
gene in the LIMS system (if no antigen can be selected), and decide which type of
selection marker tool he or she wants to use. The selection marker tool is either a
40
frame of given size (for selection of antigens of predetermined size) or a guideline,
for dynamic size selection of antigens. The “Design primers”-button will send the
currently selected antigen to an automated PCR primer design, where different
primer types will be suggested (Paper III). The “Finalize design”-button will add
the selected antigens and primers to the LIMS database, for subsequent ordering
and synthesis of the selected primer pair.
An additional important feature of PRESTIGE is the display of previously selected
PrESTs and their status in the experimental pipeline, in the protein model. This
allows for intelligent choices to be made when iterating a previously attempted
gene. The protein model of PRESTIGE is also displayed in the LIMS user
interface for experimentalists in the HPR pipeline, to support interpretation of
experimental results.
Ensembl stores all data in an open source database. The data, together with the
table definition files, can be downloaded in text format by ftp, which allows for
straightforward setup of a local mirror of their database. Data can be extracted
from the database by an application programming interface (API), developed by
the Ensembl team [Stabenau et al. 2004]. A great advantage of using an API is that
the same programming code can be reused independently of changes in the
structure of the underlying database. The Ensembl database is updated a few times
per year. The data used in PRESTIGE is pre-processed for all proteins at the same
time and when a new version of Ensembl is released, a standard operating
procedure is used to update the data. This includes setting up a local version of the
new Ensembl database, and starting Grid procedures and the membrane protein
topology prediction. The standard operating procedure is uncomplicated and
allows for a new version of the data to be generated within a few days, where the
processing by computers is most time-consuming.
A whole-genome bioinformatics analysis was performed to evaluate the possibility
to find antigens of at least 50 amino acid residues when excluding any 50 amino
acid residues region >60% identical to another protein, any ten amino acid window
with more than eight identical amino acid residues to other proteins, predicted
transmembrane regions (based on TMHMM), and signal peptides. The results
show that at least one antigen can be found for 77% of the human genes, and for
65% of the genes, at least two non-overlapping antigens can be selected. The main
reason for not being able to select an antigen is high sequence identity of the
protein to proteins from other genes.
41
Primer design for high-throughput PCR cloning
(Paper III)
The transformation from the in silico world to the in vitro world of HPR is done
by oligonucleotide primer design for PCR cloning of the selected transcript
fragments from human RNA (Paper III). Consequently, the last step of the antigen
selection procedure in PRESTIGE is the design of primers. As HPR is a large
scale project, the cloning procedures need to be streamlined. Therefore, the RTPCR should be performed using the same protocol for all amplifications and all
primers should be designed to be as similar as possible regarding their melting
temperature (58-62°C). The PRESTIGE primer design is a fully automated
process, and different criteria are used to generate primers of three primer design
types - stringent, semi-stringent and non-stringent. The final selection of primer
pair is made by the user and is based on the length of the resulting PrEST. All
primer design types include the criteria that a guanine (G) or cytosine (C) should
be present in the 3’-end of the primer, and that the primers should have a length of
17-24 nucleotides. With the stringent criteria, no more than three G/C are allowed
among the last six nucleotides in the 3’-end. Additional criteria include checking
for repeats, hairpin loops, and primer dimer formation etc.
The success rate for the three primer design types was measured as the fraction of
clones that passed sequence validation. The analysis of success rates was
performed on a data set containing all PrESTs designed during the year of 2006 (n
= 5998). This ensures that all amplification and cloning have been performed using
the same experimental conditions and that all PrESTs have passed the relevant
experimental steps. The analysis shows clear differences between the primer design
types – the stringent design has the highest success rate (86%), while the semistringent and non-stringent have 67% and 68% success rate, respectively. The total
success rate for all PrESTs, irrespective of primer type, was 77%. The only
criterion separating stringent and semi-stringent design is the GC-content in the 3’end of the primers. The reason for not using stringent primers for all
amplifications is that a significant shortening of the originally selected fragment
often is required to fulfill all criteria. When only a short PrEST can be selected,
avoidance of further shortening of the selected fragment is of great importance
and thus semi-stringent or non-stringent primers may be the only option.
The GC-content, both of the entire primer and of the 3’-end, was shown to have a
strong effect on the success rate. For stringent primers with a total GC-content
<50%, a success rate of >90% could be observed, but for stringent primers with a
42
higher total GC-content (>60%), the success rate was 63%. For stringent and
semi-stringent primers with a total GC-content of 40-60%, the difference in
success rate based on GC-content of the 3’-end was obvious (92% and 74%,
respectively, for primers with 2 G/C or 5 G/C).
Notably, 77% of all in silico selected transcript fragments can be successfully
amplified and sequence verified using human RNA as the template, without the
need for large cDNA collections.
Generation of validated antibodies towards the
human proteome (Paper IV)
The antigen selection tools Bishop and ProteinWeaver were used to select >21,000
PrESTs on 66% of the human genes, allowing for a large scale analysis of the
success rates in the experimental pipeline of HPR. The length of the PrESTs was
in the range of 25-150 amino acid residues, with an average length of 105 amino
acid residues. For 70% of the PrESTs, the sequence identity to proteins from other
genes was less than 40%. Systematic antigen selection on chromosomes 14 and 22
showed that PrESTs could be designed for 86% of the genes, and high sequence
identity of the protein was the main reason for failure in design.
For the analysis of success rates, the experimental pipeline was divided into four
distinct modules: Cloning (I), Protein expression and purification (II), Antibody
generation (immunotechnology and protein array) (III), and Antibody validation
(immunohistochemistry and Western blot) (IV). The total success rate in each
module was analyzed, as well as the success rates based on PrEST length.
In Cloning, a successful result is a sequence-approved PrEST, and this was
achieved for 79% of the selected antigens. In accordance with the results reported
in Paper III, only small differences were detected based on PrEST length. A
successful result in the Protein expression and purification module is confirmed by
the correct molecular weight of the expressed PrEST by mass spectrometry. 72%
of the PrESTs were successful in this module, but a clear trend of decreasing
success rate with increasing PrEST length could be seen. Affinity purified
antibodies, being approved in terms of specificity by protein array analysis, were
obtained for 87% of the PrESTs in the Antibody generation module. The success
rate for PrESTs shorter than 50 amino acid residues was considerably lower than
for longer PrESTs. One might speculate that a possible the reason could be the
43
low number of epitopes present in a short antigen, resulting in poor immune
response.
For modules I-III, the validation is based on objective measurements and
quantitative scores. However, for the last module, Antibody validation, the
validation is a subjective decision. The experimental results from Western blot and
immunohistochemistry are compared to literature and bioinformatics prediction
methods for protein features. Many factors can affect the results in this module,
including post-translational modifications of the protein, unknown splice variants,
and the limited set of tissues and cells analyzed. In total, 63% of the PrESTs in the
Antibody validation module had generated approved antibodies. Similarly to the
Antibody generation module, antibodies generated towards short PrESTs had
lower success rate than antibodies generated towards longer PrESTs.
Overall, the accumulated success rate from cloning of a selected PrEST nucleotide
sequence to an approved antibody is 31%. The length of the antigen is affecting
the success rate by means of shorter PrESTs giving lower success rates in
Antibody generation and Antibody validation but higher success rates in Cloning
and Protein production and purification. The optimal antigen length to obtain a
high total success rate in the experimental pipeline was found to be 75-125 amino
acid residues.
Membrane proteins have previously been reported as being difficult to express
[Cunningham and Deber 2007]. An analysis of success rates based on PrESTs
selected on membrane or soluble proteins showed that these two groups had
similar success rates in the Protein expression and purification module, indicating
that avoidance of transmembrane regions in antigen selection is of importance for
successful protein expression.
To evaluate the subjective validation in the Antibody validation module, the
concordance in immunohistological staining patterns for 93 antibodies generated
to different parts of the same protein was evaluated by hierarchical clustering of
the expression data from a cell microarray. The staining intensity on the cell
microarray is automatically scored, thereby excluding any subjectivity in the
validation. The analysis showed that 73% of the paired antibodies had similar
expression patterns. For the majority of dissimilar pairs, there is a splice variant
unique for one of the antibodies in the pair, which might explain the dissimilarity.
Additionally, a pairwise correlation study was performed, where the correlation of
the expression pattern for randomly paired antibodies (not targeting the same
44
protein) was compared to the correlation of the expression pattern for paired
antibodies. The average correlation for the random set was 0.1, while the
correlation for 40% of the paired antibodies was above 0.5. An additional 42% of
the paired antibodies had a correlation of 0.25-0.5, where many of the pairs with
low correlation can again be explained by splice variants. However, it can not be
excluded that low correlation for some of the true pairs was an effect of crossreactivity of the antibodies.
For a majority of the genes in the HPR pipeline, two PrESTs have been selected.
The success rate in the different modules for a second PrEST on a gene was
analyzed to evaluate the value of iteration in the pipeline. The analysis showed that
the result for the first PrEST was correlated with the result for the second PrEST
in Cloning, Antibody generation and Antibody validation, indicating that the
protein itself is important in these modules. In Protein expression and purification,
each PrEST seems to be independent of the protein used for antigen selection.
Although the success rate for a second PrEST is expected to be lower if the first
PrEST has failed in module I, III or IV, the chance of retrieving an approved
antibody for this second PrEST is still reasonably high and thus supports an
iterative procedure. Optimally, two antibodies to different parts of a protein should
be generated and used to validate each other.
In conclusion, there is a great possibility to generate validated antibodies to at least
one protein from most of the human protein-coding genes, using the highthroughput approach of the HPR project. The high success rates achieved all
through the experimental pipeline is validating the antigen selection procedure, and
by using the results from the success rate analysis based on antigen length, the
antigen selection process can be optimized even further.
45
46
CONCLUDING REMARKS
AND
FUTURE PERSPECTIVES
47
48
The primary conclusion from the work presented in this thesis is that it should be
possible to, in a high-throughput manner, select unique antigens for generation of
specific antibodies towards a protein from most human protein-coding genes. An
analysis of possible epitopes in the human proteome showed that for the majority
of proteins, there are continuous stretches with low local sequence identity to
other proteins (Paper I). To further strengthen the epitope space analysis, the
minimal size of a linear epitope has to be experimentally determined. With new
techniques for epitope mapping, and availability of more antibodies, more reliable
information on linear epitope sizes is expected. This new information will probably
also improve the prediction tools for epitopes, making them more useful for in
silico antigen selection.
The antigen selection and visualization tool PRESTIGE was developed as an
extension to a laboratory management information system to allow for hightroughput selection of antigens based on chosen protein features, such as sequence
identity of the protein to other proteins, transmembrane regions and signal peptide
(Paper II). A whole-genome analysis showed that it should be possible to find a
suitable antigen for 77% of the protein-coding human genes. For the 23% of genes
where no antigen can be selected, the main reason is high sequence identity to
other genes. It is likely that these genes are members of a family of duplicated
genes, and generating antibodies recognizing several members of the family is
therefore an attractive solution. This would require modifications to the antigen
selection tool, as the strategy would be to detect the most similar proteins and
select antigens where this group is as dissimilar as possible to other proteins.
Another modification of PRESTIGE would be to add sequence similarity data for
orthologous genes in model organisms, such as mouse, rat, chimpanzee, fugu,
xenopus, and fruitfly. A visualization of this feature would potentially give clues to
the usability of generated antibodies in proteomics studies of these organisms.
The output from PRESTIGE is a pair of oligonucleotide primers to be used for
PCR cloning and subsequent protein expression of the selected antigen (Paper
III). An analysis of the primers revealed that the GC-content distribution of the
primer is an important factor for high frequency of approval in sequence
verification of the cloned fragments. A future improvement of the primer design
algorithms used in PRESTIGE would be the replacement of the basic “2+4”
melting temperature estimation formula by a more sophisticated formula with
nearest-neighbor thermodynamic parameters.
49
A large-scale analysis of the success rates in the experimental pipeline (Paper IV)
showed that the accumulated success rate from cloning to an approved antibody is
31%. The success rate was dependent on length of the selected antigen, and the
optimal antigen length proved to be 75-125 amino acid residues. Selecting more
than one antigen per protein will increase the chance of generating an approved
antibody. Optimally, several antigens should be selected to different parts of a
protein, to have at least two antibodies validating each other.
50
ABBREVIATIONS
ABP – albumin binding protein
BLAST - basic local alignment search tool
cDNA – complementary deoxyribonucleic acid
DDBJ - deoxyribonucleic acid databank of Japan
DNA – deoxyribonucleic acid
EMBL - european molecular biology laboratory
FTP – file transfer protocol
HPR – human proteome resource
LIMS – laboratory information management system
PCR – polymerase chain reaction
PrEST – protein epitope signature tag
PTM – post translational modification
RNA – ribonucleic acid
RT-PCR - reverse transcriptase polymerase chain reaction
SDS-PAGE - sodium dodecyl sulfate polyacrylamide gel electrophoresis
TMA – tissue microarray
UniProt - universal protein resource
UniProtKB – universal protein resource knowledgebase
51
52
ACKNOWLEDGMENTS
TACK!
Mathias Uhlén – universalgeni och min handledare, som peppat och pressat vid
rätt tillfällen. Kanon! Som de säger i kören.
Hela HPR – för fantastiskt data! Och för platsen i badtunnan.
Åsa Sivertsson – för alla ord. Intressanta, många och klarsynta.
Linn Fagerberg – för hjälp i kritiska lägen. Och för goda äppeldrinkar.
Jorge Andrade – Grid-kungen. Oj, vad snabbt det blev.
Jacob Odeberg – att skriva pek live tillsammans är en alldeles ypperlig idé.
HPR IT – Kalle Jonasson, Per Oksvold, Mattias Forsberg, Martin Zwahlen och
Janne Lund – för givande diskussioner och hjälp med PRESTIGE och allt möjligt
LIMS-relaterat.
Marcus Gry – för fina heatmaps och korrelationer.
Erik Sonnhammer – för värdefull expertgranskning.
Cristina Al-Khalili Szigyarto – en av de riktigt goda på vår planet.
HPR LG – för professionalism men ändå avspändhet, och för trevliga och nyttiga
sammankomster, både i mötesrum och på mer kontorsavlägsna platser.
Professorerna - Sophia Hober, Joakim Lundeberg, Stefan Ståhl, Per-Åke Nygren –
för en alldeles sällsynt positiv attityd.
DNA-klubb-workshopgänget – intensivt men ack så lärorikt.
Johan Rockberg, Basia Hjelm, Pawel Zajac, Carl Hamsten, Lotta Agaton, Mats
Lindskog och Maria Malmqvist – sköna nuvarande och dåvarande rumskamrater!
53
Marica Hamsten, Cecilia Eriksson, Johanna Steen, Caroline Asplund och Anna
Sköllermo – för fina pratstunder.
Malin Andersen – AZ och sen KTH, med lite CSHL som krydda.
Sara Gunnerås – för fina bilder på antikropp och kolonn.
Fredrik Sterky – för att du tog hand om mig så bra där i början när allting var nytt.
Alla på plan 3 – tänk om alla arbetsplatser vore så trevliga.
Niclas Jareborg – för att du öppnade porten till bioinformatikens underbara värld,
och visade att det går att göra en massa med nåt som heter Perl.
Kodaporna och ni andra på gamla Visual – för all kunskap, om allt från hashar till
ironi.
Peter Sigurdson – för att du hjälpte mig att vara lite besvärlig.
Pia Dosenovic – världens finaste vän, i alla väder.
Ina Berglund – a k a Rollan, C2, collekt, bk... Jenkins är en härlig typ.
Mamman och Pappan Berglund – bättre görs inte.
Erik Björling – min främste kritiker, men också mitt största stöd. Jag är såååå
övertygad om att all good things inte come to an end.
Detta arbete har finansierats med anslag från Knut och Alice Wallenbergs Stiftelse.
54
REFERENCES
Agaton, C., Falk, R., Höiden Guthenberg, I., Göstring, L., Uhlén, M., and Hober,
S. (2004). Selective enrichment of monospecific polyclonal antibodies for
antibody-based proteomics efforts. J Chromatogr A 1043: 33-40.
Agaton, C., Galli, J., Höiden Guthenberg, I., Janzon, L., Hansson, M., Asplund, A.,
Brundell, E., Lindberg, S., Ruthberg, I., Wester, K., et al. (2003). Affinity
proteomics for systematic protein profiling of chromosome 21 gene
products in human tissues. Mol Cell Proteomics 2: 405-414.
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. (1990). Basic
local alignment search tool. J Mol Biol 215: 403-410.
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and
Lipman, D.J. (1997). Gapped BLAST and PSI-BLAST: a new generation
of protein database search programs. Nucleic Acids Res 25: 3389-3402.
Anderson, L., and Seilhamer, J. (1997). A comparison of selected mRNA and
protein abundances in human liver. Electrophoresis 18: 533-537.
Anderson, N.L., and Anderson, N.G. (2002). The human plasma proteome:
history, character, and diagnostic prospects. Mol Cell Proteomics 1: 845-867.
Andrade, J., Berglund, L., Uhlén, M., and Odeberg, J. (2006). Using Grid
technology for computationally intensive applied bioinformatics analyses.
In Silico Biol 6: 495-504.
Atassi, M.Z., and Smith, J.A. (1978). A proposal for the nomenclature of antigenic
sites in peptides and proteins. Immunochemistry 15: 609-610.
Attwood, T.K., Bradley, P., Flower, D.R., Gaulton, A., Maudling, N., Mitchell,
A.L., Moulton, G., Nordle, A., Paine, K., Taylor, P., et al. (2003). PRINTS
and its automatic supplement, prePRINTS. Nucleic Acids Res 31: 400-402.
Avery, O.T., MacLeod, C.M., and McCarty, M. (1944). Studies on the chemical
nature of the substance inducing transformation of pneumococcal types:
induction of transformation by a desoxyribonucleic acid fraction isolated
from pneumococcus type III. Journal of Experimental Medicine 79: 137-158.
Barbe, L., Lundberg, E., Oksvold, P., Stenius, A., Lewin, E., Björling, E., Asplund,
A., Pontén, F., Brismar, H., Uhlén, M., et al. (2008). Toward a confocal
subcellular atlas of the human proteome. Mol Cell Proteomics 7: 499-508.
Bendtsen, J.D., Nielsen, H., von Heijne, G., and Brunak, S. (2004). Improved
prediction of signal peptides: SignalP 3.0. J Mol Biol 340: 783-795.
Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., and Wheeler, D.L.
(2008). GenBank. Nucleic Acids Res 36: D25-30.
Berman, H., Henrick, K., and Nakamura, H. (2003). Announcing the worldwide
Protein Data Bank. Nat Struct Biol 10: 980.
Bertone, P., Stolc, V., Royce, T.E., Rozowsky, J.S., Urban, A.E., Zhu, X., Rinn,
J.L., Tongprasit, W., Samanta, M., Weissman, S., et al. (2004). Global
identification of human transcribed sequences with genome tiling arrays.
Science 306: 2242-2246.
55
Beste, G., Schmidt, F.S., Stibora, T., and Skerra, A. (1999). Small antibody-like
proteins with prescribed ligand specificities derived from the lipocalin
fold. Proc Natl Acad Sci U S A 96: 1898-1903.
Better, M., Chang, C.P., Robinson, R.R., and Horwitz, A.H. (1988). Escherichia
coli secretion of an active chimeric antibody fragment. Science 240: 10411043.
Binz, H.K., Amstutz, P., Kohl, A., Stumpp, M.T., Briand, C., Forrer, P., Grutter,
M.G., and Pluckthun, A. (2004). High-affinity binders selected from
designed ankyrin repeat protein libraries. Nat Biotechnol 22: 575-582.
Bird, R.E., Hardman, K.D., Jacobson, J.W., Johnson, S., Kaufman, B.M., Lee,
S.M., Lee, T., Pope, S.H., Riordan, G.S., and Whitlow, M. (1988). Singlechain antigen-binding proteins. Science 242: 423-426.
Birney, E., Andrews, T.D., Bevan, P., Caccamo, M., Chen, Y., Clarke, L., Coates,
G., Cuff, J., Curwen, V., Cutts, T., et al. (2004). An overview of Ensembl.
Genome Res 14: 925-928.
Birney, E., Stamatoyannopoulos, J.A., Dutta, A., Guigo, R., Gingeras, T.R.,
Margulies, E.H., Weng, Z., Snyder, M., Dermitzakis, E.T., Thurman, R.E.,
et al. (2007). Identification and analysis of functional elements in 1% of
the human genome by the ENCODE pilot project. Nature 447: 799-816.
Björling, E., Lindskog, C., Oksvold, P., Linné, J., Kampf, C., Hober, S., Uhlén, M.,
and Pontén, F. (2007). A web-based tool for in silico biomarker discovery
based on tissue-specific protein profiles in normal and cancer tissues. Mol
Cell Proteomics.
Blom, N., Sicheritz-Pontén, T., Gupta, R., Gammeltoft, S., and Brunak, S. (2004).
Prediction of post-translational glycosylation and phosphorylation of
proteins from the amino acid sequence. Proteomics 4: 1633-1649.
Blythe, M.J., and Flower, D.R. (2005). Benchmarking B cell epitope prediction:
underperformance of existing methods. Protein Sci 14: 246-248.
Brenner, S., Johnson, M., Bridgham, J., Golda, G., Lloyd, D.H., Johnson, D., Luo,
S., McCurdy, S., Foy, M., Ewan, M., et al. (2000). Gene expression analysis
by massively parallel signature sequencing (MPSS) on microbead arrays.
Nat Biotechnol 18: 630-634.
Bru, C., Courcelle, E., Carrere, S., Beausse, Y., Dalmar, S., and Kahn, D. (2005).
The ProDom database of protein domain families: more emphasis on 3D.
Nucleic Acids Res 33: D212-215.
Casadio, R., Martelli, P.L., and Pierleoni, A. (2008). The prediction of protein
subcellular localization from sequence: a shortcut to functional genome
annotation. Brief Funct Genomic Proteomic.
Chou, K.C., and Shen, H.B. (2006). Predicting eukaryotic protein subcellular
location by fusing optimized evidence-theoretic K-Nearest Neighbor
classifiers. J Proteome Res 5: 1888-1897.
Clamp, M., Fry, B., Kamal, M., Xie, X., Cuff, J., Lin, M.F., Kellis, M., LindbladToh, K., and Lander, E.S. (2007). Distinguishing protein-coding and
noncoding genes in the human genome. Proc Natl Acad Sci U S A 104:
19428-19433.
56
Colinge, J., and Bennett, K.L. (2007). Introduction to computational proteomics.
PLoS Comput Biol 3: e114.
Collins, F.S., Green, E.D., Guttmacher, A.E., and Guyer, M.S. (2003). A vision for
the future of genomics research. Nature 422: 835-847.
Coons, A. (1941). Immunological properties of an antibody containing a
fluorescent group. Proc Soc Exp Biol Med 47: 200-202.
Crick, F. (1970). Central dogma of molecular biology. Nature 227: 561-563.
Crick, F.H. (1958). On protein synthesis. The Symposia of the Society for Experimental
Biology 12: 138-163.
Crick, F.H., Barnett, L., Brenner, S., and Watts-Tobin, R.J. (1961). General nature
of the genetic code for proteins. Nature 192: 1227-1232.
Cunningham, F., and Deber, C.M. (2007). Optimizing synthesis and expression of
transmembrane peptides and proteins. Methods 41: 370-380.
Curwen, V., Eyras, E., Andrews, T.D., Clarke, L., Mongin, E., Searle, S.M., and
Clamp, M. (2004). The Ensembl automatic gene annotation system.
Genome Res 14: 942-950.
Dahm, R. (2005). Friedrich Miescher and the discovery of DNA. Dev Biol 278: 274288.
Diehn, M., Sherlock, G., Binkley, G., Jin, H., Matese, J.C., Hernandez-Boussard,
T., Rees, C.A., Cherry, J.M., Botstein, D., Brown, P.O., et al. (2003).
SOURCE: a unified genomic resource of functional annotations,
ontologies, and gene expression data. Nucleic Acids Res 31: 219-223.
Dunn, C., O'Dowd, A., and Randall, R.E. (1999). Fine mapping of the binding
sites of monoclonal antibodies raised against the Pk tag. J Immunol Methods
224: 141-150.
Eddy, S.R. (2004). Where did the BLOSUM62 alignment score matrix come from?
Nat Biotechnol 22: 1035-1036.
Edgar, R., Domrachev, M., and Lash, A.E. (2002). Gene Expression Omnibus:
NCBI gene expression and hybridization array data repository. Nucleic
Acids Res 30: 207-210.
Edman, P. (1949). A method for the determination of amino acid sequence in
peptides. Arch Biochem 22: 475.
Ellington, A.D., and Szostak, J.W. (1990). In vitro selection of RNA molecules that
bind specific ligands. Nature 346: 818-822.
Emanuelsson, O., Brunak, S., von Heijne, G., and Nielsen, H. (2007). Locating
proteins in the cell using TargetP, SignalP and related tools. Nat Protoc 2:
953-971.
Emini, E.A., Hughes, J.V., Perlow, D.S., and Boger, J. (1985). Induction of
hepatitis A virus-neutralizing antibody by a virus-specific synthetic
peptide. J Virol 55: 836-839.
Engvall, E., and Perlman, P. (1971). Enzyme-linked immunosorbent assay
(ELISA). Quantitative assay of immunoglobulin G. Immunochemistry 8: 871874.
57
Erni, F., and Frei, R.W. (1978). Two-dimensional column liquid chromatographic
technique for resolution of complex mixtures. Journal of Chromatography 149:
561-569.
Evans, D.M., Williams, K.P., McGuinness, B., Tarr, G., Regnier, F., Afeyan, N.,
and Jindal, S. (1996). Affinity-based screening of combinatorial libraries
using automated, serial-column chromatography. Nat Biotechnol 14: 504507.
Falk, R., Ramström, M., Ståhl, S., and Hober, S. (2007). Approaches for systematic
proteome exploration. Biomol Eng 24: 155-168.
Fields, S., and Song, O. (1989). A novel genetic system to detect protein-protein
interactions. Nature 340: 245-246.
Finn, R.D., Mistry, J., Schuster-Bockler, B., Griffiths-Jones, S., Hollich, V.,
Lassmann, T., Moxon, S., Marshall, M., Khanna, A., Durbin, R., et al.
(2006). Pfam: clans, web tools and services. Nucleic Acids Res 34: D247251.
Fleury, D., Daniels, R.S., Skehel, J.J., Knossow, M., and Bizebard, T. (2000).
Structural evidence for recognition of a single epitope by two distinct
antibodies. Proteins 40: 572-578.
Flicek, P., Aken, B.L., Beal, K., Ballester, B., Caccamo, M., Chen, Y., Clarke, L.,
Coates, G., Cunningham, F., Cutts, T., et al. (2008). Ensembl 2008. Nucleic
Acids Res 36: D707-714.
Garnier, J., Osguthorpe, D.J., and Robson, B. (1978). Analysis of the accuracy and
implications of simple methods for predicting the secondary structure of
globular proteins. J Mol Biol 120: 97-120.
GeneOntology Consortium. (2008). The Gene Ontology project in 2008. Nucleic
Acids Res 36: D440-444.
Green, N., Alexander, H., Olson, A., Alexander, S., Shinnick, T.M., Sutcliffe, J.G.,
and Lerner, R.A. (1982). Immunogenic structure of the influenza virus
hemagglutinin. Cell 28: 477-487.
Greenbaum, J.A., Andersen, P.H., Blythe, M., Bui, H.H., Cachau, R.E., Crowe, J.,
Davies, M., Kolaskar, A.S., Lund, O., Morrison, S., et al. (2007). Towards
a consensus on datasets and evaluation metrics for developing B-cell
epitope prediction tools. J Mol Recognit 20: 75-82.
Haab, B.B., Dunham, M.J., and Brown, P.O. (2001). Protein microarrays for highly
parallel detection and quantitation of specific proteins and antibodies in
complex solutions. Genome Biol 2: RESEARCH0004.
Hancock, D.C., and O'Reilly, N.J. (2005). Immunochemical Protocols, 3 ed. Humana
Press, Totowa.
Harris, C.L., Lublin, D.M., and Morgan, B.P. (2002). Efficient generation of
monoclonal antibodies for specific protein domains using recombinant
immunoglobulin fusion proteins: pitfalls and solutions. J Immunol Methods
268: 245-258.
Hart, G.W. (1992). Glycosylation. Curr Opin Cell Biol 4: 1017-1023.
58
Haste Andersen, P., Nielsen, M., and Lund, O. (2006). Prediction of residues in
discontinuous B-cell epitopes using protein 3D structures. Protein Sci 15:
2558-2567.
Hattori, M., Fujiyama, A., Taylor, T.D., Watanabe, H., Yada, T., Park, H.S.,
Toyoda, A., Ishii, K., Totoki, Y., Choi, D.K., et al. (2000). The DNA
sequence of human chromosome 21. Nature 405: 311-319.
Hopp, T.P., and Woods, K.R. (1981). Prediction of protein antigenic determinants
from amino acid sequences. Proc Natl Acad Sci U S A 78: 3824-3828.
Hulett, H.R., Bonner, W.A., Barrett, J., and Herzenberg, L.A. (1969). Cell sorting:
automated separation of mammalian cells as a function of intracellular
fluorescence. Science 166: 747-749.
Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L., Cuche, B.A., de Castro, E.,
Lachaize, C., Langendijk-Genevaux, P.S., and Sigrist, C.J. (2008). The 20
years of PROSITE. Nucleic Acids Res 36: D245-249.
Humphery-Smith, I. (2004). A human proteome project with a beginning and an
end. Proteomics 4: 2519-2521.
Huston, J.S., Levinson, D., Mudgett-Hunter, M., Tai, M.S., Novotny, J., Margolies,
M.N., Ridge, R.J., Bruccoleri, R.E., Haber, E., Crea, R., et al. (1988).
Protein engineering of antibody binding sites: recovery of specific activity
in an anti-digoxin single-chain Fv analogue produced in Escherichia coli.
Proc Natl Acad Sci U S A 85: 5879-5883.
Iyer, R., Jenkinson, C.P., Vockley, J.G., Kern, R.M., Grody, W.W., and Cederbaum,
S. (1998). The human arginases and arginase deficiency. J Inherit Metab Dis
21 Suppl 1: 86-100.
Jones, D.T. (2007). Improving the accuracy of transmembrane protein topology
prediction using evolutionary information. Bioinformatics 23: 538-544.
Käll, L., Krogh, A., and Sonnhammer, E.L. (2005). An HMM posterior decoder
for sequence feature prediction that includes homology information.
Bioinformatics 21 Suppl 1: i251-257.
Käll, L., Krogh, A., and Sonnhammer, E.L. (2007). Advantages of combined
transmembrane topology and signal peptide prediction--the Phobius web
server. Nucleic Acids Res 35: W429-432.
Käll, L., and Sonnhammer, E.L. (2002). Reliability of transmembrane predictions
in whole-genome data. FEBS Lett 532: 415-418.
Kampf, C., Andersson, A.C., Wester, K., Björling, E., Uhlén, M., and Pontén, F.
(2004). Antibody-based tissue profiling as a tool for clinical proteomics.
Clin. Proteomics 1: 285-299.
Karplus, P.A., and Schulz, G.E. (1985). Prediction of chain flexibility in proteins. A
tool for the selection of peptide antigens. Naturwissenschaften 72: 212-213.
Kendrew, J.C., Dickerson, R.E., Strandberg, B.E., Hart, R.G., Davies, D.R.,
Phillips, D.C., and Shore, V.C. (1960). Structure of Myoglobin: A ThreeDimensional Fourier Synthesis at 2 Å. Resolution. Nature 185: 422-427.
Klabunde, T., and Hessler, G. (2002). Drug design strategies for targeting Gprotein-coupled receptors. Chembiochem 3: 928-944.
59
Köhler, G., and Milstein, C. (1975). Continuous cultures of fused cells secreting
antibody of predefined specificity. Nature 256: 495-497.
Kononen, J., Bubendorf, L., Kallioniemi, A., Barlund, M., Schraml, P., Leighton,
S., Torhorst, J., Mihatsch, M.J., Sauter, G., and Kallioniemi, O.P. (1998).
Tissue microarrays for high-throughput molecular profiling of tumor
specimens. Nat Med 4: 844-847.
Kulikova, T., Akhtar, R., Aldebert, P., Althorpe, N., Andersson, M., Baldwin, A.,
Bates, K., Bhattacharyya, S., Bower, L., Browne, P., et al. (2007). EMBL
Nucleotide Sequence Database in 2006. Nucleic Acids Res 35: D16-20.
Kulkarni-Kale, U., Bhosle, S., and Kolaskar, A.S. (2005). CEP: a conformational
epitope prediction server. Nucleic Acids Res 33: W168-171.
Kyte, J., and Doolittle, R.F. (1982). A simple method for displaying the
hydropathic character of a protein. J Mol Biol 157: 105-132.
Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J.,
Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. (2001). Initial
sequencing and analysis of the human genome. Nature 409: 860-921.
Larsen, J.E., Lund, O., and Nielsen, M. (2006). Improved method for predicting
linear B-cell epitopes. Immunome Res 2: 2.
Larsson, M., Gräslund, S., Yuan, L., Brundell, E., Uhlén, M., Höög, C., and Ståhl,
S. (2000). High-throughput protein expression of cDNA products as a
tool in functional genomics. J Biotechnol 80: 143-157.
Laver, W.G., Air, G.M., Webster, R.G., and Smith-Gill, S.J. (1990). Epitopes on
protein antigens: misconceptions and realities. Cell 61: 553-556.
Lazarides, E., and Weber, K. (1974). Actin antibody: the specific visualization of
actin filaments in non-muscle cells. Proc Natl Acad Sci U S A 71: 22682272.
Lerner, R.A., Green, N., Alexander, H., Liu, F.T., Sutcliffe, J.G., and Shinnick,
T.M. (1981). Chemically synthesized peptides predicted from the
nucleotide sequence of the hepatitis B virus genome elicit antibodies
reactive with the native envelope protein of Dane particles. Proc Natl Acad
Sci U S A 78: 3403-3407.
Letunic, I., Copley, R.R., Pils, B., Pinkert, S., Schultz, J., and Bork, P. (2006).
SMART 5: domains in the context of genomes and networks. Nucleic Acids
Res 34: D257-260.
Lindskog, M., Rockberg, J., Uhlén, M., and Sterky, F. (2005). Selection of protein
epitopes for antibody production. Biotechniques 38: 723-727.
Lisowska, E. (2001). Antigenic properties of human glycophorins--an update. Adv
Exp Med Biol 491: 155-169.
Lockhart, D.J., Dong, H., Byrne, M.C., Follettie, M.T., Gallo, M.V., Chee, M.S.,
Mittmann, M., Wang, C., Kobayashi, M., Horton, H., et al. (1996).
Expression monitoring by hybridization to high-density oligonucleotide
arrays. Nat Biotechnol 14: 1675-1680.
Luscombe, N.M., Greenbaum, D., and Gerstein, M. (2001). What is
bioinformatics? A proposed definition and overview of the field. Methods
Inf Med 40: 346-358.
60
Mann, M., and Jensen, O.N. (2003). Proteomic analysis of post-translational
modifications. Nat Biotechnol 21: 255-261.
Markham, K., Bai, Y., and Schmitt-Ulms, G. (2007). Co-immunoprecipitations
revisited: an update on experimental concepts and their implementation
for sensitive interactome investigations of endogenous proteins. Anal
Bioanal Chem 389: 461-473.
Maxam, A.M., and Gilbert, W. (1977). A new method for sequencing DNA. Proc
Natl Acad Sci U S A 74: 560-564.
McGinnis, S., and Madden, T.L. (2004). BLAST: at the core of a powerful and
diverse set of sequence analysis tools. Nucleic Acids Res 32: W20-25.
Mi, H., Lazareva-Ulitsky, B., Loo, R., Kejariwal, A., Vandergriff, J., Rabkin, S.,
Guo, N., Muruganujan, A., Doremieux, O., Campbell, M.J., et al. (2005).
The PANTHER database of protein families, subfamilies, functions and
pathways. Nucleic Acids Res 33: D284-288.
Miescher, F. (1871). Ueber die chemische Zusammensetzung der Eiterzellen. Med.Chem. Unters. 4: 441-460.
Miklos, G.L., and Maleszka, R. (2001). Protein functions and biological contexts.
Proteomics 1: 169-178.
Mulder, N.J., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D.,
Bork, P., Buillard, V., Cerutti, L., Copley, R., et al. (2007). New
developments in the InterPro database. Nucleic Acids Res 35: D224-228.
Naaby-Hansen, S., Waterfield, M.D., and Cramer, R. (2001). Proteomics--postgenomic cartography to understand gene function. Trends Pharmacol Sci 22:
376-384.
Nair, R., and Rost, B. (2002). Sequence conserved for subcellular localization.
Protein Sci 11: 2836-2847.
Nilsson, P., Paavilainen, L., Larsson, K., Ödling, J., Sundberg, M., Andersson,
A.C., Kampf, C., Persson, A., Al-Khalili Szigyarto, C., Ottosson, J., et al.
(2005). Towards a human proteome atlas: high-throughput generation of
mono-specific antibodies for tissue profiling. Proteomics 5: 4327-4337.
Nirenberg, M.W. (1963). The genetic code. II. Sci Am 208: 80-94.
Nord, K., Gunneriusson, E., Ringdahl, J., Ståhl, S., Uhlén, M., and Nygren, P.Å.
(1997). Binding proteins selected from combinatorial libraries of an alphahelical bacterial receptor domain. Nat Biotechnol 15: 772-777.
O'Farrell, P.H. (1975). High resolution two-dimensional electrophoresis of
proteins. J Biol Chem 250: 4007-4021.
Odorico, M., and Pellequer, J.L. (2003). BEPITOPE: predicting the location of
continuous epitopes and patterns in proteins. J Mol Recognit 16: 20-22.
Olive, C., Toth, I., and Jackson, D. (2001). Technological advances in antigen
delivery and synthetic peptide vaccine developmental strategies. Mini Rev
Med Chem 1: 429-438.
Parker, J.M., Guo, D., and Hodges, R.S. (1986). New hydrophilicity scale derived
from high-performance liquid chromatography peptide retention data:
correlation of predicted surface residues with antigenicity and X-rayderived accessible sites. Biochemistry 25: 5425-5432.
61
Perutz, M.F., Rossmann, M.G., Cullis, A.F., Muirhead, H., Will, G., and North,
A.C.T. (1960). Structure of Haemoglobin: A Three-Dimensional Fourier
Sythesis at 5.5-Å. Resolution, Obtained by X-ray Analysis. Nature 185:
416-422.
Ponomarenko, J.V., and Bourne, P.E. (2007). Antibody-protein interactions:
benchmark datasets and prediction tools evaluation. BMC Struct Biol 7: 64.
Quevillon, E., Silventoinen, V., Pillai, S., Harte, N., Mulder, N., Apweiler, R., and
Lopez, R. (2005). InterProScan: protein domains identifier. Nucleic Acids
Res 33: W116-120.
Rebhan, M., Chalifa-Caspi, V., Prilusky, J., and Lancet, D. (1997). GeneCards:
integrating information about genes, proteins and diseases. Trends Genet 13:
163.
Renart, J., Reiser, J., and Stark, G.R. (1979). Transfer of proteins from gels to
diazobenzyloxymethyl-paper and detection with antisera: a method for
studying antibody specificity and antigen structure. Proc Natl Acad Sci U S
A 76: 3116-3120.
Rigaut, G., Shevchenko, A., Rutz, B., Wilm, M., Mann, M., and Seraphin, B. (1999).
A generic protein purification method for protein complex
characterization and proteome exploration. Nat Biotechnol 17: 1030-1032.
Rodda, S.J., Geysen, H.M., Mason, T.J., and Schoofs, P.G. (1986). The antibody
response to myoglobin--I. Systematic synthesis of myoglobin peptides
reveals location and substructure of species-dependent continuous
antigenic determinants. Mol Immunol 23: 603-610.
Sanger, F., Nicklen, S., and Coulson, A.R. (1977). DNA sequencing with chainterminating inhibitors. Proc Natl Acad Sci U S A 74: 5463-5467.
Sanger, F., and Tuppy, H. (1951a). The amino-acid sequence in the phenylalanyl
chain of insulin. 2. The investigation of peptides from enzymic
hydrolysates. Biochem J 49: 481-490.
Sanger, F., and Tuppy, H. (1951b). The amino-acid sequence in the phenylalanyl
chain of insulin. I. The identification of lower peptides from partial
hydrolysates. Biochem J 49: 463-481.
Schena, M., Shalon, D., Davis, R.W., and Brown, P.O. (1995). Quantitative
monitoring of gene expression patterns with a complementary DNA
microarray. Science 270: 467-470.
Scott, M.S., Thomas, D.Y., and Hallett, M.T. (2004). Predicting subcellular
localization via protein motif co-occurrence. Genome Res 14: 1957-1966.
Selengut, J.D., Haft, D.H., Davidsen, T., Ganapathy, A., Gwinn-Giglio, M.,
Nelson, W.C., Richter, A.R., and White, O. (2007). TIGRFAMs and
Genome Properties: tools for the assignment of molecular function and
biological process in prokaryotic genomes. Nucleic Acids Res 35: D260-264.
Shiraki, T., Kondo, S., Katayama, S., Waki, K., Kasukawa, T., Kawaji, H., Kodzius,
R., Watahiki, A., Nakamura, M., Arakawa, T., et al. (2003). Cap analysis
gene expression for high-throughput analysis of transcriptional starting
point and identification of promoter usage. Proc Natl Acad Sci U S A 100:
15776-15781.
62
Sjölander, A., Nygren, P.Å., Ståhl, S., Berzins, K., Uhlén, M., Perlmann, P., and
Andersson, R. (1997). The serum albumin-binding region of streptococcal
protein G: a bacterial fusion partner with carrier-related properties. J
Immunol Methods 201: 115-123.
Sollner, J. (2006). Selection and combination of machine learning classifiers for
prediction of linear B-cell epitopes on proteins. J Mol Recognit 19: 209-214.
Sollner, J., and Mayer, B. (2006). Machine learning approaches for prediction of
linear B-cell epitopes on proteins. J Mol Recognit 19: 200-208.
Sonnhammer, E.L., von Heijne, G., and Krogh, A. (1998). A hidden Markov
model for predicting transmembrane helices in protein sequences. Proc Int
Conf Intell Syst Mol Biol 6: 175-182.
Sprenger, J., Fink, J.L., and Teasdale, R.D. (2006). Evaluation and comparison of
mammalian subcellular localization prediction methods. BMC Bioinformatics
7 Suppl 5: S3.
Stabenau, A., McVicker, G., Melsopp, C., Proctor, G., Clamp, M., and Birney, E.
(2004). The Ensembl core software libraries. Genome Res 14: 929-933.
Steen, J., Uhlén, M., Hober, S., and Ottosson, J. (2006). High-throughput protein
purification using an automated set-up for high-yield affinity
chromatography. Protein Expr Purif 46: 173-178.
Strömberg, S., Björklund, M.G., Asplund, C., Sköllermo, A., Persson, A., Wester,
K., Kampf, C., Nilsson, P., Andersson, A.C., Uhlén, M., et al. (2007). A
high-throughput strategy for protein profiling in cell microarrays using
automated image analysis. Proteomics 7: 2142-2150.
Sugawara, H., Abe, T., Gojobori, T., and Tateno, Y. (2007). DDBJ working on
evaluation and classification of bacterial genes in INSDC. Nucleic Acids Res
35: D13-15.
Sutcliffe, J.G., Shinnick, T.M., Green, N., Liu, F.T., Niman, H.L., and Lerner, R.A.
(1980). Chemical synthesis of a polypeptide predicted from nucleotide
sequence allows detection of a new retroviral gene product. Nature 287:
801-805.
Tanaka, K., Waki, H., Ido, Y., Akita, S., Yoshida, Y., and Yoshida, T. (1988).
Protein and polymer analysis up to m/z 100.000 by laser ionisation timeof-flight mass spectrometry. Commun. Mass Spectrom. 2: 151-153.
Tanford, C., and Reynolds, J. (2001). Nature's Robots: A History of Proteins, 1 ed.
Oxford University Press, Oxford.
Taussig, M.J., Stoevesandt, O., Borrebaeck, C.A., Bradbury, A.R., Cahill, D.,
Cambillau, C., de Daruvar, A., Dubel, S., Eichler, J., Frank, R., et al.
(2007). ProteomeBinders: planning a European resource of affinity
reagents for analysis of the human proteome. Nat Methods 4: 13-17.
Terpe, K. (2003). Overview of tag protein fusions: from molecular and
biochemical fundamentals to commercial systems. Appl Microbiol Biotechnol
60: 523-533.
Travers, P., Walport, M., and Kenneth, M. (2007). Janeway's Immunobiology, 7 ed.
Garland Publishing, New York.
63
Tusnady, G.E., and Simon, I. (2001). The HMMTOP transmembrane topology
prediction server. Bioinformatics 17: 849-850.
Uhlén, M. (2007). Mapping the human proteome using antibodies. Mol Cell
Proteomics 6: 1455-1456.
Uhlén, M., Björling, E., Agaton, C., Szigyarto, C.A., Amini, B., Andersen, E.,
Andersson, A.C., Angelidou, P., Asplund, A., Asplund, C., et al. (2005). A
human protein atlas for normal and cancer tissues based on antibody
proteomics. Mol Cell Proteomics 4: 1920-1932.
Uhlén, M., and Pontén, F. (2005). Antibody-based proteomics for human tissue
profiling. Mol Cell Proteomics 4: 384-393.
UniProt Consortium. (2008). The universal protein resource (UniProt). Nucleic
Acids Res 36: D190-195.
Van Regenmortel, M.H. (2006). Immunoinformatics may lead to a reappraisal of
the nature of B cell epitopes and of the feasibility of synthetic peptide
vaccines. J Mol Recognit 19: 183-187.
Van Regenmortel, M.H., and Pellequer, J.L. (1994). Predicting antigenic
determinants in proteins: looking for unidimensional solutions to a threedimensional problem? Pept Res 7: 224-228.
Van Weemen, B.K., and Schuurs, A.H. (1971). Immunoassay using antigenenzyme conjugates. FEBS Lett 15: 232-236.
Velculescu, V.E., Zhang, L., Vogelstein, B., and Kinzler, K.W. (1995). Serial
analysis of gene expression. Science 270: 484-487.
Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., Smith,
H.O., Yandell, M., Evans, C.A., Holt, R.A., et al. (2001). The sequence of
the human genome. Science 291: 1304-1351.
Viklund, H., and Elofsson, A. (2004). Best alpha-helical transmembrane protein
topology predictions are achieved using hidden Markov models and
evolutionary information. Protein Sci 13: 1908-1917.
von Heijne, G. (1989). Control of topology and mode of assembly of a polytopic
membrane protein by positively charged residues. Nature 341: 456-458.
Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P.,
Agarwala, R., Ainscough, R., Alexandersson, M., An, P., et al. (2002).
Initial sequencing and comparative analysis of the mouse genome. Nature
420: 520-562.
Watson, J.D., and Crick, F.H. (1953). Molecular structure of nucleic acids; a
structure for deoxyribose nucleic acid. Nature 171: 737-738.
Wheeler, D.L., Barrett, T., Benson, D.A., Bryant, S.H., Canese, K., Chetvernin, V.,
Church, D.M., Dicuccio, M., Edgar, R., Federhen, S., et al. (2008).
Database resources of the National Center for Biotechnology
Information. Nucleic Acids Res 36: D13-21.
Wilkins, M.R., Pasquali, C., Appel, R.D., Ou, K., Golaz, O., Sanchez, J.C., Yan,
J.X., Gooley, A.A., Hughes, G., Humphery-Smith, I., et al. (1996). From
proteins to proteomes: large scale protein identification by twodimensional electrophoresis and amino acid analysis. Biotechnology (N Y) 14:
61-65.
64
Wu, C.H., Nikolskaya, A., Huang, H., Yeh, L.S., Natale, D.A., Vinayaka, C.R., Hu,
Z.Z., Mazumder, R., Kumar, S., Kourtesis, P., et al. (2004). PIRSF: family
classification system at the Protein Information Resource. Nucleic Acids Res
32: D112-114.
Wuthrich, K. (1990). Protein structure determination in solution by NMR
spectroscopy. J Biol Chem 265: 22059-22062.
Yin, J., Li, G., Ren, X., and Herrler, G. (2007). Select what you need: a comparative
evaluation of the advantages and limitations of frequently used expression
systems for foreign genes. J Biotechnol 127: 335-347.
Zhang, H., Zha, X., Tan, Y., Hornbeck, P.V., Mastrangelo, A.J., Alessi, D.R.,
Polakiewicz, R.D., and Comb, M.J. (2002). Phosphoprotein analysis using
antibodies broadly reactive against phosphorylated motifs. J Biol Chem 277:
39379-39387.
65