Download RELIC – A bioinformatics server for combinatorial

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expression vector wikipedia , lookup

Gene expression wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Interactome wikipedia , lookup

Genomic library wikipedia , lookup

Magnesium transporter wikipedia , lookup

Molecular ecology wikipedia , lookup

Biosynthesis wikipedia , lookup

Western blot wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Protein wikipedia , lookup

Metalloprotein wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Structural alignment wikipedia , lookup

Genetic code wikipedia , lookup

Biochemistry wikipedia , lookup

Point mutation wikipedia , lookup

Homology modeling wikipedia , lookup

Protein structure prediction wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Peptide synthesis wikipedia , lookup

Proteolysis wikipedia , lookup

Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup

Transcript
Proteomics 2004, 4, 1439–1460
DOI 10.1002/pmic.200300680
1439
RELIC – A bioinformatics server for combinatorial
peptide analysis and identification of protein-ligand
interaction sites
Suneeta Mandava, Lee Makowski, Satish Devarapalli, Joseph Uzubell
and Diane J. Rodi
Biosciences Division, Argonne National Laboratory, Argonne, IL, USA
Phage display technology provides a versatile tool for exploring the interactions between
proteins, peptides and small molecule ligands. Quantitative analysis of peptide population
sequence diversity and bias patterns has the power to significantly enhance the impact of
these methods [1, 2]. We have developed a suite of computational tools for the analysis of
peptide populations and made them accessible by integrating fifteen software programs for
the analysis of combinatorial peptide sequences into the REceptor LIgand Contacts (RELIC)
relational database and web-server. These programs have been developed for the analysis of
statistical properties of peptide populations; identification of weak consensus sequences
within these populations; and the comparison of these peptide sequences to those of naturally occurring proteins. RELIC is particularly suited to the analysis of peptide populations
affinity selected with a small molecule ligand such as a drug or metabolite. Within this functional context, the ability to identify potential small molecule binding proteins using combinatorial peptide screening will accelerate as more ligands are screened and more genome
sequences become available. The broader impact of this work is the addition of a novel means
of analyzing peptide populations to the phage display community.
Keywords: Bioinformatics / Database / Phage display / Protein-ligand interactions / Small molecule
1 Introduction
The need for high-throughput bioinformatic methods to
characterize gene function is being driven by the generation of sequences at a rate far beyond our ability to carry
out experimental functional analyses. In spite of the large
number of analytical tools currently available, typically
about 40% of predicted open reading frames remain
functionally uncharacterized. An important clue to open
reading frame function is the identification of binding partners. Phage display technology is a widely used tool
for identifying either protein-or small molecule-binding
Correspondence: Dr. Diane J. Rodi, Biosciences Division,
Argonne National Laboratory, 9700 South Cass Avenue,
Argonne, IL 60439, USA
E-mail: [email protected]
Fax: 11-630-252-5517
Abbreviations: ASP, Active Server Pages; COM1, Component
Object Model; NEB, New England Biolabs; PDB, protein data
bank; RELIC, receptor ligand contacts
 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
Received
Revised
Accepted
25/7/03
30/10/03
3/11/03
partners [3]. Current methodology aims towards multiple
rounds of affinity selection until a point is reached at
which most of the remaining binding phage consist of a
small number of peptide sequences. The peptide sequences identified in that group of phage clones are then
assumed to reflect affinity to the ligand used for selection.
Affinity, however, is not the only factor contributing to
these results. The process of library construction itself
compounded by affinity selection methods generates a
group of peptide-bearing phage clones whose sequence
properties are statistically quite complex. Although the
initial nucleotide inserts may well be random in sequence,
the combination of random sequence filtration (due to the
low number of initial phage clones generated by electroporation (usually on the order of 106 to 109) as compared
to the high numbers of sequences theoretically possible
(for a 12 amino acid length set of peptides the theoretical
number of sequences possible is 4.09661015)) and nonrandom (repetitive or regular pattern) sequence filtration
due to biological selection steps, creates a pool of phage
particles with multiple reasons for continued inclusion
www.proteomics-journal.de
1440
S. Mandava et al.
in the population. Statistical dissection of the regular
sequence biases with custom-designed algorithms has
been used by our group to identify inclusion-limiting steps
during library construction [1]. These results, as well as
data from others, demonstrate that alternation of affinity
selection with growth in broth can lead to a group of
clones which are the product of affinity and enhanced
growth properties [1, 4, 5]. Given the random inclusion
of sequences into the library due to the electroporation
cut-off step, the exact peptide sequence which matches
the binding partner being sought may not be in the library,
only a low number of conservatively close sequences.
Affinity selection of a combinatorial peptide library screen
may therefore generate a group of closely related sequences which are functional homologs, but may contain
no obvious consensus sequence discernable by eye.
Optimal use of phage display libraries for the study
of intermolecular interactions, therefore, necessitates a
quantitative approach to data analysis in which close
peptide sequences, subtle motifs and overall trends are
capable of being monitored and studied. For example, is
there a method for estimating the sequence diversity of a
combinatorial peptide library from the sequences of a
practicable number of the members of that library? Does
the biology of the phage-host system result in biases in
amino acid representation that seriously impact the diversity of a phage displayed library and, consequently, the
results of affinity selection experiments? How can these
biases be identified and characterized in order to make
the best use of phage libraries in affinity selection or other
experiments designed to take advantage of the unique
properties of display libraries? How can ligand-binding
motifs within the peptides be readily identified in light of
the biological biases introduced by the system? These
types of questions cannot be answered using the computational tools and databases presently available. While
currently available genomics tools are sources of valuable
information, they do not address many of the specific
issues important to the phage display community.
A presently unmet need in the application of phage display technology is software for the analysis of the statistical properties of peptide libraries which answers these
types of questions by carrying out such procedures as
the estimation of sequence diversity in a library prior to
screening; the identification of weak consensus motifs
within short sequences; and the comparison of the sequences of affinity selected peptides to those of naturally
occurring proteins. We have developed a suite of fifteen
programs for the analysis of populations of peptides in
the context of these three functions. This software has
been incorporated into the publicly available REceptor
LIgand Contacts (RELIC) bioinformatics server (http://
 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
Proteomics 2004, 4, 1439–1460
relic.bio.anl.gov). The order of the programs and their
functionalities are specifically designed to aid a
researcher in the combinatorial peptide field from the
early stages of raw data acquisition to the final stage of
protein epitope mapping. The flow of data-processing
software starts with sequence translation programs, followed by physicochemical property mapping, sequence
bias identification algorithms, and finally peptide/protein
similarity mapping both with and in the absence of
three-dimensional coordinates.
The Programs page of RELIC (Fig. 1 and Table 1) groups
these programs into five general categories, each of
which addresses a particular need within the field. The
translation programs, DNA2PRO12 and DNA2PRO7, are
designed to translate raw DNA sequence text output into
peptide sequences for phage clones isolated from the
New England Biolabs (NEB; Beverly, MA, USA) Ph.D.12 and Ph.D.-C7C libraries respectively, eliminating
the need for manual translation of large numbers of sequences. These programs operate by scanning for vector
end/beginning sequences and specifically search for an
exact size insert (enabling the flagging of insert sequence
anomalies), and thus can be modified to accommodate
alternative phage display constructs. The characterization of peptide populations suite of programs is a collection of software which is capable of analyzing a population of peptide sequences in terms of their position-dependent abundances of amino acids and their sequence
diversity properties. These four programs are most useful
for analysis of matched sets of peptide sequences, a
recommended minimum of 50 sequences from the naïve
parent library and a recommended minimum of 50 sequences which are the product of an affinity selection
experiment using that parent library (duplicate copies of
peptides will be automatically removed, as RELIC programs assume that multiple occurrences do not reflect
independent events). A statistical shift in any of these
measured properties between the nonselected and
selected sets of peptide sequences is an indication that
the affinity selection process has been effective, even in
the absence of a clear amino acid sequence consensus
in the selected sequence set. Examples of practical
usage of these programs are presented below in
Section 3.2.
The peptide motif identification program set is specifically
designed to recognize amino acid consensus sequences
within a peptide population which are difficult to extract
by eye. The three MOTIF programs encompass different
combinations of allowed motif properties such as continuous/discontinuous sequence similarity, conservative
amino acid substitutions allowed/disallowed, and minimum sequence length requirements. The fourth software
www.proteomics-journal.de
Proteomics 2004, 4, 1439–1460
RELIC: A server for peptide analysis
1441
Figure 1. A reproduction of the RELIC programs page.
Table 1. List of the programs accessible through the web interface of RELIC
Translation: DNA to protein sequence:
DNA2PRO12
DNA2PRO7
12 mer peptide translation
7 mer peptide translation
– peptide sequence from DNA sequence
– peptide sequence from DNA sequence
Characterization of Peptide populations:
AAFREQ
POPDIV
AADIV
INFO
amino acid frequency
population diversity
amino acid frequency 1 diversity
peptide-associated information
–
–
–
–
frequency of each amino acid at each position
diversity of the population and of individual positions
combines two previous calculations
estimates likelihood of random occurrence of sequence
Peptide Motif Identification:
MOTIF1
MOTIF2
MOTIF3
motif identifier
Pair correlation analysis 2
Pair correlation analysis 6
– identifies continuous short motifs within a population
– identifies discontinuous short motifs within a population
– identifies discontinuous short motifs and their near neighbors
Comparison of peptide population to sequence of a known structure:
HETEROalign
CLOSEcon
DistSim
peptide-PDB file analysis
PDB file analysis
distance-similarity
– aligns peptides to a sequence from a PDB file
– identified residues contacting a heterogroup in a PDB file
– computes distance to ligand and similarity to peptide population
Analysis of single or multiple FASTA sequences:
MATCH
FASTAcon
FASTAskan
alignment
consensus finder
similarity calculation
– aligns peptides with a protein sequence
– IDs proteins from a population with short consensus sequences
– lists proteins with high similarity to a peptide population
 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.proteomics-journal.de
1442
S. Mandava et al.
group, comparison of peptide population to sequence of
a known structure, has been written to construct optimal
sequence alignments between peptide sequences obtained via affinity screening and a protein sequence with
associated three-dimensional coordinates (i.e. a protein
data bank (PDB) file). The combined use of these three
programs allows a user to determine whether any of the
peptides obtained via affinity selection are mimicking one
or more epitopes within a protein structure; facilitates
the visualization of those mimicked epitopes within that
protein structure; and calculates the distance between a
cocrystallized ligand within that protein structure and the
protein sequence epitopes identified via affinity-selected
peptide analysis.
The final suite of programs, analysis of single or multiple
FASTA sequences, similarly carries out optimal sequence
alignments between affinity-selected peptides and protein sequences, but does so for proteins for which there
is only a text sequence (i.e. no coordinates deposited
within the PDB). The MATCH program aligns multiple
peptide sequences to a single protein sequence, with a
text output that allows for easy identification of protein
regions with sequences similar to multiple peptide sequences (i.e. regions of clustering). The FASTAcon program will take a motif sequence (either supplied by the
user or output from any of the three MOTIF programs)
and scan text lists of protein sequences for the occurrence of that motif sequence in those protein sequences.
FASTAskan is a combination program which will serially
apply the MATCH algorithm to calculate a cumulative
similarity factor between a single protein and a set of
selected peptides; serially performs this function for
multiple protein sequences using the same peptide sequence set, then compiles a list of those proteins so that
they are ranked by similarity factor running from most
similar to least similar. This program is capable of sorting
an entire genome-derived list of proteins based upon their
similarity to a group of affinity-selected peptides in the
absence of a clear consensus sequence motif, clustering
those proteins most likely to bind to the affinity target at
the top of the output list.
RELIC was particularly designed for the study of the interaction of small molecules with proteins, based upon
previous work in which we have shown that the similarity
between the sequence of a protein and the sequences
of small molecule affinity-selected, phage-displayed peptides can be predictive for protein binding to that small
molecule ligand. This technique has been successfully
employed to map the contact residues in the targets of a
variety of drugs, drug candidates, and small molecule
metabolites [6, 7]. The use of affinity selection of phage
displayed peptide libraries to identify binding motifs has
 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
Proteomics 2004, 4, 1439–1460
potential as an approach in the annotation of whole
genomes. The rationale is analogous to the use of consensus sequences that provide reliable signatures for
ligand binding. Although a number of consensus binding
motifs for small molecules have been identified and are
compiled within databases of sequence motifs such as
PROSITE [8], small molecule binding motifs have been
characterized for only a small fraction of biological
ligands, and even in those cases are far from exhaustive.
For instance, although the p-loop (GxxxxGK(S/T)) is the
most widely occurring [9] and best characterized [10, 11]
of the small molecule binding motifs and is highly predictive for binding of nucleotide triphosphates, many ATPbinding sites do not in fact contain a p-loop motif. Threedimensional information from crystallographic structures
provides a strong basis for prediction of ligand binding
which, when combined with multiple sequence alignments can be highly predictive [12, 13]. Databases specifically addressing protein-small molecule interactions
typically involve three-dimensional information [14], but
even when assignment of a protein family from a sequence is possible and structural data is available,
assignment of the small molecule binding site is challenging.
By studying the patterns of combinatorial peptides binding to common metabolites such as ATP and glucose,
and correlating those sequences with three-dimensional
structures of known metabolite/protein pairs, we have
created a database within RELIC of peptide sequences
which are predictive for metabolite binding in known
protein sequences, along with the computational tools
required to carry out this analysis. RELIC is intended to
provide the scientific community with access to the sequences of peptides selected for affinity to multiple
ligands, as well as access to the tools described above.
The database currently houses over 5000 peptide sequences that have been selected for affinity to small molecule metabolites such as ATP, GTP and glucose and
drugs such as Taxol and Taxotere, as well as random
clones from parent libraries, thereby providing a unique
source of information for the study of the interaction of
proteins with these ligands.
2 Materials and methods
Data integration plays a central role in bioinformatics.
RELIC is a publicly accessible biotechnology system
that utilizes web technology, traditional programming and
a relational database to process and manipulate the experimental data of affinity-selected peptides. In order to seamlessly integrate that biological data, RELIC is based on an
object-oriented design using a relational database manwww.proteomics-journal.de
Proteomics 2004, 4, 1439–1460
agement system. For this particular project, the ORACLE 9i
(Release 9.2) (Oracle, Redwood Shores, CA, USA) database
system was chosen to store experimental data and the
relevant genomic/structure information as it provides a
wide array of database drivers for various programming
languages (both for thin and thick clients); hence, data
can be accessed through various programming languages
like JAVA, C11, Active Server Pages (ASP) or PERL. Figure 2a is a diagram depicting the logical and relational model
of the database by displaying all tables and intra-table relationships. To increase database efficiency, packages and
stored procedures were developed that reduce the amount
of information sent over the network to the database. An
additional benefit is database security, as the stored procedures do not expose the SQL code, making it very difficult
for an intruder to damage or harm the database.
RELIC: A server for peptide analysis
1443
The program architecture for the system, shown in
Figs. 2a and b, was driven by several factors including
the need for efficiency in data storage and retrieval, enforcement of data relationships and use of legacy code. A
crucial factor in the database design was optimization of
data storage and retrieval. The database had to be highly
normalized with little or no data redundancy. A second
key factor was enforcing data relationships through referential integrity (RI). To enforce RI and normalization,
indexes with keys were defined and foreign key constraints were devised. As a result, not only are the number
of orphan records greatly reduced, but the need for creating programs that are designed to look for duplicate data
or orphan records is eliminated. In addition, data retrieval
is facilitated, since it reduces the number and intensity
of table scans that occur in order to retrieve the data.
Figure 2. a) RELIC Database schema. Shown is a simple graphical representation of the RELIC Oracle schema. It displays
the table name, table fields, primary keys, foreign keys and hierarchical table relations that are enforced throughout the
database schema. The primary key fields illustrate how the data are logically and physically arranged for extractions and
insertions and stop duplicate data from being entered into the database. The figure also displays the foreign key relationship between tables; i.e. the logical way each table interacts with one and another. The foreign keys enforce referential
integrity by making sure key data exists in the parent table before it allows data insertion into a child table. b) A RELIC
user submits data for processing via a web interface. The user input and job information is stored in a RELIC database. A
job processing service periodically checks for pending jobs and processes them using the scientific algorithms developed
in FORTRAN, using COM1 interfaces. The user is sent an e-mail upon completion of the job with a link to the output.
 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.proteomics-journal.de
1444
S. Mandava et al.
Another challenge for designing the system was the need
to move text files from a remote client computer into the
database system. This was made possible with ASP.NET
using C# (Microsoft, Redmond, WA, USA). ASP.NET contains mechanisms for directly reading files through a web
browser and storing the information to a remote place; in
this case, the Oracle database.
Driven by legacy code and the need to modernize, the
RELIC system utilizes a myriad of programming languages. RELIC receives user input via a web interface
(ASP) and passes the input to application programs. The
user input directories are uploaded to the server using
JUpload applet (Persits Software, New York, NY, USA).
Chart Director library (Advanced Software Engineering,
Kowloon, Hong Kong) is used to generate plots in real
time. At the time the system was designed, ASP provided
the quickest internet user interface, greatly outperforming
PHP, Java and its derivatives. The ASP is constructed in
HTML, with Dynamic HTML and JavaScript on the clientside and VBscript on the server-side. These programs
interface with the ASP pages through Component Object
Model (COM1) wrappers written in Visual C11 utilizing
the Active Template Library. The COM1 objects interface
with FORTRAN programs that process the data. For
security reasons, data access and data processing have
been encapsulated in Visual Basic COM1 objects. The
processed information is then stored in an Oracle 9i database, where a queuing system has been developed in
which a windows system service will constantly check
for and process queued jobs. When the system completes the user’s job, an e-mail is sent out to a user-specified address indicating that the job is completed, with
an attached link leading to the results page. The system
retrieves data through the use of ActiveX Data Objects
through an OLE DB access layer. RELIC directly uses
database drivers developed by Oracle.
Proteomics 2004, 4, 1439–1460
users would constantly need to check for upgrades before using the software. To avoid this impediment, a three
tier system was created consisting of a web based user
interface; COM1 objects for the processing layer; and a
database for the final layer.
A DELL Power Edge 2500 server, with dual 1.13 GHz processor, 4GB RAM and a RAID5 SCSI disk array with
136GB disk space is used to run the web site and database (Dell, Rock Round, TX, USA). The database is
spread on the disk array avoiding disk contentions and
providing high performance. This server uses Windows
2003 operating system and Microsoft’s Internet Information Server (IIS) 6.x as the web server (Microsoft). To
improve the performance, RELIC jobs are run on a separate server, DELL Power Edge 3250 server, with dual Intel
64-bit 1.3GHHZ processors, 4GB RAM and an 18GB
SCSI disk. This server runs the Windows 2003 64-bit
operating system.
3 Results and discussion
3.1 Algorithm and implementation
The RELIC programs page reproduced in Fig. 1 has a
short description of each program along with reference
to more detailed discussions in published papers. RELIC
has a user-friendly interface with copy and paste options,
upload features, and select buttons. The software applications allow for either query of input protein sequences
against the RELIC-stored affinity-selected peptide data,
or entry of a user’s own FASTA formatted peptide sequence data for analysis. An e-mail to return program output is provided for all jobs. Sample applications which
illustrate the use of RELIC software are shown under
Help and include sample input as well as an explanation
of the output format. Each program and the algorithm
behind its operation are described individually in the following section.
The original software for RELIC consisted of 15 different
independent programs and was developed in FORTRAN
with a DOS based, command-line user interface. The
DOS based applications were converted into web based
applications by creating COM1 wrappers around the
legacy code. To increase usability, the COM1 objects
allow for a single interface to the different programs. Use
of this configuration will also allow for seamless future
upgrades. The challenge was to incorporate each of the
programs into a single easy to use interface which is
scalable to accommodate a large number of users. One
possible scenario was to rewrite all the programs into
a single program using C11 and Microsoft Foundation
Classes, allowing the user interface to be constructed
using traditional Windows forms. This, however, would
have severely limited distribution and upgrades, and
These two programs are designed to scan raw DNA sequence text file output from a chromatogram analysis
program, locate the insert within the vector, and translate
the insert sequences into peptide sequences from the
two NEB phage display libraries Ph.D.-12 and Ph.D.C7C. The output FASTA format peptide sequence list
can then be used as input for further analysis using other
RELIC software. The programs are designed to flag any
entries which are questionable with regard to poor or
ambiguous sequence data, including deletion mutants
and parental clones. These are the only two RELIC programs in which the input is restricted, the beginning and
 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.proteomics-journal.de
3.1.1 DNA2PRO12 and DNA2PRO7
Proteomics 2004, 4, 1439–1460
end sequences are from the NEB phage vector, and the
resulting peptide output length will be either 12 or 7 amino
acids. Updated software capable of translation from any
insert length DNA sequence within any vector will be
available in the next version, DNA2PRO. Prior to that
upgrade, DNA2PRO12/7 can be modified by request to
the RELIC administrator ([email protected]).
RELIC: A server for peptide analysis
1445
same sequence diversity, the probability of choosing the
same member twice in a random selection from the two
populations should be the same (see discussion in [2]).
The breadth of the error bars on the generated estimate
is solely dependent on the number of random sequences
analyzed, as the diversity is measured per amino acid
and is thus independent of recombinant insert length.
This feature makes POPDIV both easy to use and to interpret.
3.1.2 AAFREQ
The AAFREQ program calculates the frequency of occurrence of each of the 20 amino acids at each recombinant
insert position, as well as the overall position-independent frequency of each amino acid (i.e. amino acid composition) within that set of peptide sequences. Statistical
significance of the output data has been demonstrated
for input of 50 or more peptide sequences. AAFREQ is a
useful program for pinpointing amino acid sequence
biases, both overall (i.e. a dearth of cysteine residues at
all positions; [15, 16]) and in a particular insert position
(i.e. a bias against arginine residues at position 11 after
the signal peptidase cleavage site due to bias by that processing enzyme; [1]). Use of AAFREQ analysis of a ligandselected set of peptides in conjunction with an AAFREQ
analysis of at least 50 peptides chosen randomly from the
naïve library can identify position-dependent motifs and
biases within the selected population which are due
solely to the affinity selection process (see [1] and below
for examples of practical applications).
As an example, POPDIV was used to calculate the diversity of a random 12 amino acid display library based upon
100 randomly sequenced members of that library. A
100% complete library with this size insert would be
expected to generate a total of 2012 or 4.09661015 possible sequences. The 100 members polled gave a diversity
value of 0.04 with a standard deviation of 0.02. This value
indicates that the 12 mer library is statistically indistinguishable from a 12 mer library that contains only 4% of
all possible sequences or 0.046(461015) or 1.661014
peptides, indicating that a number of theoretically possible peptide sequences are either underrepresented or
absent. The sequence diversity of a population of peptides should go down subsequent to a round of affinity
selection, as the population is becoming enriched for a
subset of sequences which exhibit high affinity for the
ligand. This prediction makes POPDIV a useful tool for
the monitoring of multiple rounds of biopanning or for
the comparison of multiple constructed combinatorial
peptide libraries.
3.1.3 POPDIV
3.1.4 AADIV
The quality and utility of a display library is a function
of many properties, one of the most important being
sequence representation or diversity. The higher the percentage of all theoretically possible sequences which are
actually physically present within the phage population,
the more likely it is that affinity screening will generate
results which are pertinent to the experimental goals.
The only unequivocal method for measuring the completeness of a library would be to sequence each and every
member subsequent to initial amplification. In place of
this absolute number, however, a well-carried out population survey analogous to a political poll can generate a
rough estimate of the completeness of a display library.
The POPDIV program uses just such a statistical sampling method to estimate the sequence diversity of a
combinatorial peptide library based upon sequences
obtained from a limited number of randomly sampled
members of the library. A minimum of 50 sequences is
required to obtain statistically significant estimation of
the true diversity value. The calculation is premised upon
the assumption that if two peptide populations have the
AADIV is a composite program which combines the algorithms of both AAFREQ and POPDIV into one program for
convenience. The input/output formats are the same as
for the two individual programs. AADIV is a valuable starting point for the analysis of a set of combinatorial peptide
sequences from either a constructed parent library or in
conjunction with those from an affinity screening experiment out of that parent library, as it can quickly indicate
whether or not there is adequate sequence diversity in
the starting set of peptides and/or whether there is any
ligand-selected bias in the sequences pulled out from
that set.
 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.proteomics-journal.de
3.1.5 INFO
Although measurement of a change in amino acid frequency distribution patterns or sequence diversity is
a sign of successful biopanning, these parameters are
characteristics of a set of peptide sequences, not of individual peptides. INFO is a program that provides a math-
1446
S. Mandava et al.
Proteomics 2004, 4, 1439–1460
copy number in the amplified naïve library (to the left in
Fig. 3), by definition have low information content as their
presence in the final affinity-selected pool could easily be
the result of inadvertent inclusion via nonspecific or poor
binding followed by a round of good growth. Conversely,
phage particles which are low copy or rare within the
original parent library (to the right in Fig. 3), presumably
due to poor growth properties, have high information content, i.e. it is more certain that their presence in the final
affinity-selected pool is due to affinity to ligand.
ematical measure of the probability of observing a particular peptide sequence by random chance (i.e. nonspecific binding) as opposed to by selection for a specific property (i.e. ligand affinity). The statistical calculations carried
out by INFO are based in principle upon the 1948 paper
by Shannon [17] on the theory of information. Although
Shannon originally developed the concept of information
within the context of communication, the basic element of
this theory, information as a decrease in uncertainty, can
also be applied to combinatorial phage display screening.
The original peptide sequence diversity and pattern distribution as it is laid out within the parent library becomes
distorted during multiple rounds of affinity selection and
broth amplification, much like Shannon describes signal
perturbation by noise. Carrying this analogy further, “if
the signal is altered in a reasonable way by the noise, the
original can still be recovered” [17]. The wanted signal in
which a researcher is interested within the final set of peptide sequences is that resulting from affinity to a ligand.
Superimposed upon this affinity signal, however, is the
unwanted or extraneous information (or noise) introduced
into the peptide sequences by the filtering effects of the
various life cycle stages of the phage vector [1, 4, 5]. As
the noise in the sequence patterns is a regular, or nonrandom but measurable quantity, it can theoretically be subtracted to a certain extent from the signal, which is the
regular but unknown sequence pattern present due to
affinity selection.
How does the INFO program calculate a numerical quantity for information, i.e. individual amino acid sequence
bias patterns? The number is based upon the estimated
occurrence or representation of that individual peptide
sequence within the original unscreened or parent library.
Two input files are required for INFO: a text file with an
ideal minimum of 50 peptide sequences from clones randomly selected from a naïve library, and a second file
which contains one or more peptide sequences affinityselected from that same library. INFO first uses AAFREQ
to calculate the amino acid frequency distributions at
each position of the insert in the 50 peptide sequences
from the parent combinatorial library. From the observed
position-specific frequencies of amino acids in the unselected library, the probability of random observation of
any one peptide (PN) can be calculated by multiplying
the probability of each amino acid occurring at each posi-
How can this biological life cycle induced noise be measured? The steps involved in the generation of the library,
from recombinant DNA construction to amplified phage
particles, recreate the patterns of amino acid sequence
bias within the peptides that are responsible for what we
have defined as noise during multiple rounds of affinity
selection (i.e. the biological step of phage amplification
as opposed to the chemical step of affinity selection). In
other words, the selection process for good growth characteristics which occurs during multiple rounds of amplification is similarly occurring during naïve library amplification. The selection that occurs during both procedures
creates subgroups of phage particles with superior
growth properties within both libraries which are present
in many copies (Fig. 3). The sequence biases present in
both pools of phage which are due to positive growth
attributes can be estimated from the properties of the
naïve library and thus theoretically subtracted from the
affinity-selected library, leaving only the sequence biases
introduced by ligand affinity. This mathematical subtraction process is used not only by INFO, but is also an
option in the RELIC programs HETEROalign, MATCH, and
FASTAskan to reduce the noise in affinity-selected peptide sequences and amplify the signal – i.e. affinity to
ligand characteristics. The phage that are present in high
Figure 3. A plot of the distribution of individual peptide
sequences within an imaginary combinatorial peptide display library arrayed by copy number. Peptide sequences
are listed along the abscissa in order of numerical representation from highest copy number out to lowest copy
number, with copy number along the ordinate. Information theory dictates that those phage clones present in large numbers (i.e. the left side of the plot) inherently possess less information than the relatively rare
clones at the right-hand side of the plot.
 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.proteomics-journal.de
Proteomics 2004, 4, 1439–1460
tion within the peptide along the length of the sequence
((P1) (P2) (P3) . . . (PN) = Ptotal where N = the number of residues in the recombinant insert). Since the calculated
probability PN of any one specific peptide is a very small
number (due to serial multiplication of many position-specific probabilities), we define an associated information
parameter for that probability PN where information =
2ln(PN). For example, a fairly common 12 mer peptide
may have a probability of occurrence of 4.6610213, which
would translate into an information content of 2ln
(4.6610213) or 28.4. A rare 12 mer peptide sequence
could be less represented in the parent library by two
orders of magnitude or more, such as 4.6610215 with
an associated information content of 33.0. Higher information levels correspond to less probable amino acid
sequences within the insert and typically correspond
to sequences less favorable to viral growth. The smaller
the probability of random occurrence (i.e. the larger
the associated information), the greater the chance
that this peptide was observed in the affinity-selected
pool due to specific binding to the target (signal) as
opposed to positive growth characteristics (noise). Two
different real-life applications of the INFO program to
affinity selection experiments are described in Section 3.2.2. Note that since information content is a natural logarithm function, a difference of 4 in information
between 2 peptides translates to a factor of 110 difference in their estimated representation within the parent
library.
RELIC: A server for peptide analysis
1447
uses as input a peptide sequence file in text or FASTA
format and searches for user-specified contiguous amino
acid sequence motifs within that population such as the
group FVS, FLT, FLS, YVT and YLT. Alignment of short
stretches such as these may aid in the identification
of weaker consensus sequences on either side of this
anchor sequence within the peptide family. The output
from MOTIF1 is a list of motifs with conservative substitutions and the locations of the motifs within the peptide
sequence.
MOTIF2 searches for patterns of 3 amino acids and does
not allow conservative amino acid substitutions, but
does allow identical gap lengths, such as the pair
SWQLAP and SAQTSP, with the motif being SXQXXP. A
conserved motif of this type is possible for peptides
which are long enough to generate partial secondary
structure, thus enabling noncontinuous sequence conservation but contiguous spatial conservation. MOTIF3
searches for motifs containing 4 amino acids, with the
minimum number of occurrences specified by the user,
and outputs peptides that have identities in at least three
of the four amino acids in the motif. Both MOTIF2 and
MOTIF3 allow for penalty-free gaps providing the gap
length is identical for all motif members; i.e. G Q A H Q
L S and L M A H Q A S. Output for all three programs
includes identification of the parent motif, the length of
the motif, and an alignment of all the motif-bearing peptide sequences (with amino acids within the motif
flagged in color) as shown in Fig. 4.
3.1.6 MOTIF
3.1.7 MATCH
A number of algorithmic and heuristic approaches have
been taken to detect weak sequence similarities within
practicable computation times, including the SmithWaterman algorithm [18], FASTA [19], BLAST [20, 21] and
ParAlign [22]. These bioinformatic tools, however, have
been developed with, and optimized for, long protein
sequences. They are ill-suited for use in the analysis of
combinatorial phage display data which consist of short
peptide lengths. In addition, given both the low number
of peptide sequences present in most phage libraries as
compared to theoretically predicted values and the functional plasticity of amino acid side chains, the search
for consensus sequences needs to be mathematically
flexible. The suite of MOTIF programs are a group of
motif-hunting algorithms which score similarity and group
inclusion by tallying different combinations of motif properties, thereby emphasizing different search goals. Weak
sequence motifs within short peptide sequence populations, however, can be readily identified with the three
programs in RELIC that search for motifs within the peptide population, MOTIF1, MOTIF2 and MOTIF3. MOTIF1
Previous experience with small molecule binding peptides has demonstrated that even in the absence of a
clearly identifiable consensus sequence motif, weaker
conserved sequence patterns may be embedded within
the data. These peptide sequences may not be exact
matches for regions of a protein sequence, but when
simultaneously aligned up against the entire protein
sequence may cluster together within one region of the
protein. Bioinformatic summation of multiple binding peptide sequences to generate a type of cumulative sequence signature has been shown to yield valuable information regarding protein-small molecule interactions
[6, 7]. The calculation of similarity between the collective
sequences of a population of relatively short peptides and
the sequence of a naturally occurring protein, however,
raises certain algorithmic problems (manuscript in preparation). Although it can be calculated using a standard
similarity matrix such as BLOSUM62 with a short window
(i.e. 5 to 6 amino acids in length), that calculation produces three problems: (1) occurrences of rare amino
 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.proteomics-journal.de
1448
S. Mandava et al.
Proteomics 2004, 4, 1439–1460
Figure 4. An actual output file
from MOTIF2 is depicted. The
input file was a text format list
of peptide sequences obtained
from two rounds of affinity
screening with immobilized
GTP. Each motif is identified
separately, with the actual number of peptides possessing that
sequence motif listed on the
right-hand side, and the
sequences aligned beneath. Individual residues which have
been identified as part of the
motif sequence are colored for
each input peptide.
The programs MATCH, HETEROalign and FASTAskan all
implement this algorithm by identifying any stretches of
amino acid residues within a particular protein that exhibit
significant similarity to a group of affinity-selected peptides. The program works as follows: The first peptide
sequence in the input file is lined up against the designated protein sequence at protein residue 1 (see Fig. 5a,
lines 1 and 2). Each residue within that protein sequence
which is being compared to a peptide residue is given
a modified BLOSUM62 score. The peptide sequence is
then realigned with the protein sequence starting at
protein residue 2 (see lines 3 and 4 in Fig. 5a). A second
set of scores is calculated for each of the protein residues involved in this second alignment. The peptide sequence is then realigned at protein residue 3, rescored,
and then at protein residue 4, etc. until the first peptide
sequence is aligned with the protein sequence at the
carboxy-terminal end of the protein. This same process
is subsequently carried out for each peptide in the userdesignated pool of peptide sequences. The software
programs then tally the cumulative score for each amino
acid residue within the protein sequence using the alignments above the cut-off value. If there is a significant
fraction of peptides within the input sequences which
exhibit high similarity to a portion of the protein sequence, even in the absence of exact sequence matches
within a single peptide, these peptides will cluster underneath that region of the protein in the alignment output
(see Fig. 5b). The cumulative score values for those protein residues within that region will be high (see output b,
Fig. 5b). For certain protein/peptide population combinations it has been observed that chance similarity between target-selected consensus motifs and growthrelated motifs can cause significant noise within the
cumulative similarity scores. In these situations, it is
possible to use MATCH, HETEROalign, and FASTAskan
to carry out the same process with a set of peptide
sequences randomly selected from the parent library,
and then subtract the cumulative residue scores of this
analysis from the affinity-selected scores across the
entire length of the protein sequence. This generates
a net similarity score reflecting peptide alignment to
the designated protein with positive growth motifs subtracted out.
 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.proteomics-journal.de
acids like tryptophan are weighted so high that they tend to
overwhelm even perfect matches across 4 to 5 amino acid
segments; (2) the cumulative noise levels from poor
matches overwhelm less common but meaningful
matches; and (3) sequence biases in combinatorial
libraries tend to result in differential weighting of motifs.
The first problem can be minimized by renormalizing the
BLOSUM62 matrix so that the diagonal (or identity) terms
are all equal. The second problem can be minimized by
using a cut-off below which the individual calculated similarity is discarded from the final calculation sum. Extensive
analysis of well characterized systems has led to our use of
a 5 amino acid window and an experimentally determined
cut-off corresponding to three identities and one similarity
across the window (L. Makowski; manuscript in preparation). Although numerous strategies for offsetting the
biases inherent within the peptide libraries used have
been tested, none have universally improved on the calculations made in the absence of correction terms.
Proteomics 2004, 4, 1439–1460
RELIC: A server for peptide analysis
1449
Figure 5. a) The MATCH program algorithm carries out sequential rounds of similarity calculations
between a single peptide sequence (VITGKKAILLGE in the figure) along the length of a single protein
sequence (top line of each round shown). An identical amino acid match (red) is given a higher score
than a conservative replacement (green). Each round generates a number score for each amino acid
residue within the protein sequence. These sequential similarity calculations are then carried out for all
of the individual peptide sequences within the input file. b) MATCH carries out a final summation step
for each amino acid residue within the input protein sequence to generate an aggregate residue-specific similarity score as shown in output b. These similarity scores reflect the cumulative similarity between all of the input affinity-selected peptide sequences and the input protein sequence per amino
acid residue. Output a depicts the cluster diagram for the peptide sequences in which each peptide is
aligned to the input protein sequence at the position of maximal similarity.
 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.proteomics-journal.de
1450
S. Mandava et al.
3.1.8 HETEROalign
The protein-length cumulative similarity score output of
MATCH can be visualized in fold space if the three-dimensional coordinates of the protein are available. HETEROalign uses the algorithm from MATCH to compare an input
list of ligand-selected peptide sequences to a protein
sequence embedded within a PDB file. The second output from HETEROalign consists of a cluster diagram
(similar to MATCH output) which depicts the alignment of
peptides to the protein sequence along with the heteroatom-contacting residues demarcated with X along the
top of the protein sequence (see Fig. 6c). Heteroatom
contact distance is defined as the minimum interatomic
distance between a ligand atom center and an amino
acid atom center. An option is available in RELIC to define
the distance by the user at a particular value for a particular application. Although this program was originally
designed to analyze protein-small molecule ligand inter-
Proteomics 2004, 4, 1439–1460
actions, in the absence of a heteroatom within the PDB
file a cluster diagram will still be generated by HETEROalign, albeit with no X marks to indicate contact points
above the protein sequence.
Output 1 from HETEROalign is a PDB file modified from
the input PDB file in which the temperature factor for
each residue has been replaced by the cumulative similarity score. Visualization of this new PDB file with a standard package such as RasMol creates a three-dimensional
rendering of the protein in which red regions are those
with highest similarity to the input peptide sequences
and blue are regions of no similarity (by choosing temperature for color; see Figs. 6a and b). This output format
allows the user to quickly ascertain where the region(s)
of highest similarity to the input peptide sequences are
within the protein structure in terms of spatial proximity
to each other and/or to heteroatoms; and their proximity
to the surface. For example, in the pictures shown in
Figure 6. Three depictions of
the similarity between the sequence of phosphoenol pyruvate
carboxykinase and a set of 400
unconstrained 12 mer peptides
selected on the basis of their affinity to ATP through two rounds
of biopanning (Makowski et al.,
manuscript in preparation). a) A
rendering of the entire protein
using RasMol [25] and coding
the similarity by color where red
is the highest similarity and blue
the lowest. b) A close up view
of the ATP binding site rendered
with RasMol with the loop predicted to be involved in ATP
binding rendered as a solid ribbon. c) An alignment of ATPselected peptides with the portion of the sequence of phosphoenol pyruvate carboxykinase exhibiting the highest
similarity to the ATP-selected
peptides. Residues exhibiting
identity or similarity to the protein sequence are highlighted
in red or orange. Most peptides
align with the segment just
downstream of the p-loop.
 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.proteomics-journal.de
Proteomics 2004, 4, 1439–1460
Figs. 6a and b the region of highest similarity to the input
ATP-selected peptide sequences lies within the p-loop
region in contact with the ATP. A third output format from
HETEROalign is a list of amino acid residues with their
respective similarity scores, as shown for MATCH in
Fig. 5b (b).
RELIC: A server for peptide analysis
1451
crystal structure (below the horizontal line in the plot).
DistSim can only be used for PDB files (generated by
HETEROalign) with heteroatom coordinates in conjunction with a set of peptide sequences.
3.1.10 CLOSEcon
3.1.9 DistSim and CLOSEcon
DistSim is a program which measures the correlation between the HETEROalign-determined similarity score for
a particular amino acid residue within a protein and the
distance of that residue from a ligand bound to the protein. DistSim uses as input the HETEROalign generated
PDB file and plots the cumulative similarity score on the
abscissa axis versus the distance from a ligand within
that protein as determined from the cocrystal structure
on the ordinate axis. It was designed as a measure of
effectiveness for affinity selection experiments utilizing a
heterogroup as the affinity bait. As an example, in Fig. 7 it
can be seen that the seven amino acid residues within
protein 1AYL which exhibit the highest cumulative similarity score to a pool of ATP-selected peptides (lower righthand corner of graph) are all within 10 Å of the ATP in the
The related software CLOSEcon is similarly limited in
application to PDB files with designated heteroatom coordinates. CLOSEcon provides a list of all the amino acid
residues within a structure which are in contact (as defined by a user input maximum interatomic distance) with
a heteroatom on the basis of crystallographic coordinates. A PDB file is used as the input, with the output
being a list of contiguous amino acid residue sequences
(including single residue or punctate contacts). Multiple
PDB files which contain the same ligand can be analyzed
with CLOSEcon to obtain a population of amino acid residues which make contact with those ligands. These contact residues can be extracted and used as input for other
RELIC programs. For instance, the amino acid frequencies can be calculated with AAFREQ, or both contiguous
and noncontiguous motifs within these peptide segments
Figure 7. The relationship between similarity and distance to
the ATP as calculated by Dist
Sim. Each data point corresponds to one amino acid. The
points in the lower right correspond to the loop with the highest similarity, colored red in
Figs. 6a and b.
 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.proteomics-journal.de
1452
S. Mandava et al.
can be identified using the MOTIF programs. This type of
analysis allows for comparison of crystallographically determined contact sequences with phage display-derived
affinity selection sequences.
3.1.11 FASTAcon
FASTAcon is a short peptide motif scavenging program
which can search a long input list of protein sequences
in FASTA format for the occurrence of user-defined peptide sequences. The program can be directed to look
for either exact matches or close matches (i.e. the user
can define four out of five amino acid residues as a hit).
On the data entry page the RELIC server has links to
whole genome protein sequences from numerous organisms, including multiple bacterial species, mouse and
human, which can be used as input for FASTAcon and/
or FASTAskan. This software is useful as a downstream
tool from the MOTIF suite of programs to pinpoint proteins which possess an epitope similar or identical to an
affinity search-identified consensus sequence. In the
case of a drug-binding motif, FASTAcon can establish
the uniqueness of that motif within the predicted human
proteome.
3.1.12 FASTAskan
FASTAskan is a derivative iterated version of MATCH
in which the cumulative similarity scores for a list of
affinity-selected peptides is calculated within multiple
protein sequences rather than a single protein sequence.
The input protein sequence list can be as long as for
FASTAcon (i.e. all the predicted proteins from a single
genome). FASTAskan will rank the proteins in descending
order by peak similarity score at any one residue within
the protein sequence. This output is such that the proteins
most likely to bind to the ligand used for affinity screening
will be clustered at the top. The peptide sequence results
from a biopanning experiment can therefore be used directly in a ligand-target hunt without the intermediate step
of consensus motif identification.
Proteomics 2004, 4, 1439–1460
3.2.1 DNA sequence processing
Application 1: A user has multiple raw DNA sequences
from phage particles affinity-selected from the NEB
Ph.D.-12 or -C7C libraries against a protein or small
molecule ligand. These can be used as input into the two
programs DNA2PRO12 and DNA2PRO7. The programs
are designed to translate the sequences of DNA inserts
from these libraries into peptide sequences, although
the parameters in the programs could easily be modified
to accommodate other display library constructs. Both
programs automatically locate the position of the insert,
translate the insert and indicate any possible errors in
the insert sequence such as unexpected codons or
errors in the surrounding vector/insert junctions. Analysis
of DNA sequence data by eye is thus whittled down
to those samples which exhibit sequencing problems.
These programs provide the user with a FASTA format
list of peptide sequences which can then be used as
input for further analyses using RELIC or other webbased software.
3.2.2 Peptide population analysis and motif
identification
RELIC has seven programs that are designed either to analyze the statistical properties of a peptide population or
identify weak consensus sequences (short amino acid
sequences that are repeated either exactly or almost
exactly within the population). These data are particularly
valuable when calculated in conjunction with randomly
chosen members of the unselected library, as libraryspecific biases can be identified and subtracted.
The following sample applications are examples which
utilize the software available on RELIC. Sample input and
output data, as well as explanations of program operation
are available on the RELIC website. To improve database
performance, RELIC jobs are processed on two different
servers. A link is e-mailed to each user-supplied address
to facilitate data retrieval.
Application 2: AAFREQ is a program that calculates the
frequency of amino acid occurrence within a peptide population as a function of position within the recombinant
insert, thereby pinpointing which amino acids are over or
underrepresented. This program can determine positionspecific and residue-specific values such as how many
threonines occur in position 1 of the inserts. Figure 8
includes a representative sample output. The input data
was a text file of 12 amino acid long peptide sequences
affinity selected against an immobilized form of the sugar
galactose. Examination of the data in the table indicates
that there are no prolines at the amino-terminal positions
of the inserts (insert position 1), but a significant number
at all the other positions. This frequency distribution pattern indicates significant bias against proline residues
immediately adjacent to the signal peptidase cleavage
site. Line 4 shows that there are no cysteines at all in any
of the peptides. The value in the right hand column gives
the position-independent frequency of that particular
 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.proteomics-journal.de
3.2 Programs and sample applications
Proteomics 2004, 4, 1439–1460
RELIC: A server for peptide analysis
1453
Figure 8. Sample output from AAFREQ. In this population of peptides, seven had alanine at position 1 and 6 had alanine at
position 2. An unknown amino acid is denoted by x.
To assess changes in these patterns during biopanning,
AAFREQ can be used with 50 to 100 phage clone sequences from the unselected original library as input and
then compared with frequency values calculated from the
affinity-selected pool. AAFREQ output using amino acid
sequences of phage randomly selected from the two
NEB display libraries (Ph.D.-12 or Ph.D.-C7C) can be
viewed on the RELIC website. Data for other RELIC
peptide sequences can also easily be viewed for any of
the peptide analysis programs from the data entry page.
AAFREQ, as well as any of the other RELIC programs
(except the DNA2PRO programs) can be used with input
peptide sequences from any type of display library as
long as sequences from the parent library are available.
Application 3: A combinatorial peptide phage display
library is only as serviceable as the complexity or diversity
of its sequence members. The diversity as calculated
here (as defined in [2]) is a practical measure of the diversity of the population from which the sequences were
selected. Given the impossibility of sequencing a large
percentage of any population, the measure that POPDIV
performs cannot calculate the absolute probability of a
particular sequence being present in the population, but
instead gives an indication of the functional diversity
available in that population as a whole. By necessity
POPDIV was developed as a method for estimating the
sequence diversity of a peptide library from the sequences of a limited number of the members of that
library. The program can calculate the diversity on the
basis of the sequences of as few as 50 peptides randomly
selected from any population. This measure is particularly
useful for comparison of two or more populations.
 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.proteomics-journal.de
amino acid in the input peptide population. The total column indicates that proline and threonine are the two most
abundant residues in the sequences.
1454
S. Mandava et al.
Proteomics 2004, 4, 1439–1460
Table 2. Descriptions of three populations with very simple distributions of peptide abundance.
Population A
Population B
Population C
90% of members are present
in equal amounts;
10% of members are not present
50% of members are present in equal
amounts; the other 50% of members
are present at twice that abundance
75% of members are present
in equal amounts; 25% of
members are not present
diversity = 0.9
diversity = 0.9
diversity = 0.75
The corresponding diversities are calculated according to [2].
The concept of population diversity when peptides are
present at different abundances is discussed in more
detail elsewhere [1, 2]. As a simple illustration of the concept, consider the three peptide populations described
in Table 2. The measure of diversity utilized by POPDIV
cannot distinguish between library A and library B in
Table 2 because in a sampling experiment, the two
populations are statistically identical. It can, however,
detect the difference between A and C or B and C, with
both A and B behaving as if they have more sequence
diversity than C. This program is a useful tool for rapid
assessment of the relative complexity of two or more
peptide populations and is equally useful to researchers
either constructing or utilizing combinatorial peptide
libraries. For instance, if a user has constructed a random 12 amino acid peptide library and from the
sequences of 75 clones calculates the diversity at 0.01
using POPDIV, then that population functionally behaves
as if it contains approximately 4.09661015 times 0.01 or
461013 peptides.
One illustration of the shift in information content of a peptide display population during a biopanning experiment is
shown in Fig. 9. Three rounds of biopanning with an
immobilized version of the anticancer drug Taxol were
monitored by isolation and sequencing of single phage
clones at each step of the process. The normalized distribution of information content for each peptide population
was calculated (solid lines) and compared to the distribution in the naïve, i.e. unselected, library (dotted lines) (in
this case the Ph.D.-12 peptide library from NEB). Fig. 9a
contains data obtained after round 1 of affinity screening;
9b after round 2 and 9c after round 3. A comparison of the
three distributions indicated that the peptide population in
round 2 is slightly more enriched for high information content peptides than round 1 and much more so than the
unselected population [6]. Information content values
however begin to decrease significantly after three rounds
of biopanning (Fig. 9c), pointing towards a bias for phage
with enhanced growth properties. This trend is more readily visualized in the subtraction plots of Figs. 9a through c.
Calculation of the area under each set of curves shows
that that the number of peptides with information content
above 35.0 goes up 12% in rounds 1 and 2, but goes
down 30% in round 3 (as compared to the random curve).
Application 4: Affinity biopanning is an iterative process
that alternatively selects on the basis of affinity and
growth characteristics. Consequently, a statistical analysis that provides a basis for identifying those peptides
with the highest probability of being selected on the basis
of affinity rather than growth can be of use in determining
the success of a series of biopanning experiments or in
giving any one peptide sequence more statistical weight
than another. On the basis of the observed frequencies of
amino acids in an unselected population INFO calculates
an information parameter associated with each peptide
which is a measure of the likelihood of observing that
peptide by chance. A peptide selected after multiple
rounds of biopanning but having relatively low information
content has a sequence representative of peptides highly
iterated within the population, suggesting that it is there
on the basis of growth characteristics, not ligand affinity.
A peptide with relatively high information value is unlikely
to be present in a selected population due to chance or to
growth characteristics; rather, it is most likely present due
to its affinity to the target molecule.
Analysis of all three peptide populations with MOTIF1
demonstrated that of the putative Taxol-binding sequences, only the round 2 sequences contained clone
pairs sharing consensus sequences as long as four and
five residues. Of the two pairs of pentamers identified,
one pair of phage clones shared the pentapeptide HTPHP
at identical positions within the insert peptide, and a
second set shared SHPST at different locations along
the insert sequence. Localization of these four clones
within Fig. 9b indicates that the HTPHP-containing pair
possesses information contents of 35.9 and 36.3 respectively, whereas the SHPST-containing pair possesses
information content values of 31.6 and 31.2 respectively.
These numbers show that the HTPHP match is statistically a less likely occurrence than the SHPST match by
a factor of about 110-fold, strongly suggesting that the
HTPHP motif has high affinity for Taxol [6].
 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.proteomics-journal.de
Proteomics 2004, 4, 1439–1460
RELIC: A server for peptide analysis
1455
Figure 9. Relative abundances of peptides plotted as a function of associated information for peptides selected for affinity
to Taxol after a) one round of biopanning b) two rounds, and c) three rounds. The three curves shown in d) are the result of
subtraction of the normalized random curve from each of the three Taxol-selected curves. Areas of the curves in d) above
the zero line indicate an upward shift in representation for peptides of that information content within that peptide pool
relative to the parent library. Areas of the curves in d) below the zero line indicate a downward shift in representation for
peptides of that information content within that peptide pool relative to the parent library. A shift to higher information can
be seen in the first two rounds, with the trend shown more clearly in the subtraction curves depicted in d). After two rounds
a significant shift to lower information suggests that selection for good growth is beginning to dominate affinity selection. In
Fig. 9b the positions of peptides containing the SHPST motif are marked by open squares; the positions of peptides containing HTPHP are marked by closed squares.
protein size, and the postulated central role of Bcl-2 in
apoptosis, human Bcl-2 and the SHPST-containing protein
kinase DNA-PKcs (Mr 4096) were tested for binding activity
to Taxol. Bcl-2 was identified as an authentic Taxol-binding protein [6], whereas DNA-PKcs was ruled out (W. S.
Dynan, unpublished results), corroborating the statistical
prediction of the phage information content values.
A sequence search with the OWL database [23] using
the HTPHP consensus sequence identified two human
proteins: Bcl-2, a 239 amino acid antiapoptotic protein,
and ataxin-2, a large 1312 amino acid protein which has
been implicated in the pathogenesis of a neurodegenerative disorder. A similar search with the consensus
sequence SHPST yielded four human megaproteins, with
sizes ranging from 1035 residues to 6669 residues. Use
of FASTAcon yielded identical results. Given the higher statistical significance of the HTPHP-bearing phage, the fact
that the probability of a random hit for a consensus without
accompanying functional significance is proportional to
A second application of the INFO program can be seen
in Fig. 10. A published data set of peptide sequences
obtained via affinity selection of the NEB Ph.D.-12 library
using the apical domain of the chaperonin GroEL as bait
 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.proteomics-journal.de
1456
S. Mandava et al.
Proteomics 2004, 4, 1439–1460
Application 5: The MOTIF1 program uses as input a
peptide sequence file and searches for motifs within
the population. Allowing for conservative substitutions,
motifs of user-specified length are identified. An example
is given here for several motifs found within a population
of peptide sequences that have been affinity selected
using an immobilized version of ATP as the bait.
Figure 10. A graphical representation of the output from
INFO with input data from [24] as described in
Section 3.2.2, application 4. The red curve is the information distribution for randomly chosen peptide sequences
of the parent library Ph.D.-12 (NEB). The black curve is
the same INFO-generated normalized information distribution for the published GroEL apical domain-selected
peptides out of that library. The green curve is the subset
of that affinity-selected group of peptides which contains
multiple histidine residues. A clear shift to higher information content can be seen between the parent library and
the affinity-selected set of sequences. The very low information tail of the black curve, however, is taken up by the
poly-his peptides assumed by the authors to be present
as a result of binding to the nickel protein purification column [24]. The peptide shown by fluorescence anisotropy
as having the highest affinity for GroEL lies within the
highest peak of the selected peptides at the high information end of the black curve as shown by the position of the
arrow (information content = 34.897).
[24] included a number of histidine-containing peptides.
These sequences were excluded from the subsequent
binding studies by the authors, as the peptides were
concluded to have been selected by interaction with
the background Ni12-NTA resin. A comparison of the information content of the parent library (Fig. 10, red curve)
and the GroEL-selected peptides (Fig. 10, black curve)
demonstrates a significant shift to higher information peptides in the affinity-selected pool. The peptide with the
highest affinity for GroEL as measured by fluorescence
anisotropy, peptide SBP, has an information content of
34.9, a relatively high number for that parent library (see
arrow in Fig. 10). Separation of the poly-his peptide
population info content into a separate curve (green
curve, Fig. 10) demonstrates that the majority of the
his-containing peptides have an information content
centering around 31.5, indicating that they are over
30-fold more abundant in the Ph.D.-12 library than
the SBP peptide.
 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
Peptide
Position
Oligo
1
25
7
5
VAVAL
LALAL
19
59
6
2
IPSVQ
MPTLN
38
40
3
2
PSLLS
PSILS
38
40
4
3
SLLST
SILSS
43
82
7
6
PLLLT
PLLLS
Position refers to the position of the motif in the peptide:
VAVAL for example starts at position 7 in the peptide
NFSTRTVAVALF, which is the first peptide in the input
list. MOTIF2 also uses as input a peptide sequence text
file and searches for motifs within the population. Conservative substitutions are not allowed, whereas nonpenalized gaps are identified in the output, along with a list
of source peptides for each peptide printed in cluster
format. An example is given in Fig. 4 of the output of
MOTIF2-identified consensus motifs found within a population of peptide sequences that have been affinity
selected using an immobilized version of GTP.
3.2.3 Protein/ligand interaction analysis using
proteins of known structure
Many users obtain combinatorial peptide data in the
study of protein-ligand interactions. RELIC presently has
three programs that use PDB files as a basis for the analysis of protein-ligand interactions in conjunction with
phage display data.
Application 6: The program CLOSEcon uses as input one
or more PDB files (all with the same ligand) and provides a
list of the amino acid residues that are in contact (defined
by a maximum interatomic distance chosen by the user)
with a ligand on the basis of crystallographic coordinates.
Proteins may be downloaded from the PDB at http://
www.rcsb.org/PDB/, and then uploaded onto the data
entry page for either single protein analysis or analysis of
multiple proteins containing the same ligand. The example shown in Fig. 11 uses 1A9C (GTP Cyclohydrolase I)
and extracts the residues within 10 Å of the GTP heterowww.proteomics-journal.de
Proteomics 2004, 4, 1439–1460
RELIC: A server for peptide analysis
1457
Figure 11. An actual output file from the CLOSEcon program. The input PDB file 1A9C is a GTP
hydrolase mutant cocrystallized with its GTP ligand. Directly below a printout of all the residues
whose coordinates are contained within the input PDB file is a list of all continuous residue strings
within 10 Å of the GTP ligand, in order of amino-terminus to carboxy-terminus.
atom and produces two output files. The first output file
lists the amino acid sequence and number obtained from
the PDB file of the residues that are within 10 Å atom-toatom of the GTP molecule. The second output file lists the
residues along with the chain and the name of the protein.
Single amino acids indicate punctate contact while the
peptide strings indicate extended contact.
A major driving force in the development of the RELIC
database was the need for genomic methodology which
aids in the functional annotation of whole genomes. Our
group has attempted to use combinatorial peptide phage
display to identify small molecule binding sites within the
primary amino acid sequences of proteins. Bioaffinity
screening of immobilized tagged versions of numerous
small molecules such as the metabolite ATP and the
anticancer drug Taxol has generated populations of
peptide sequences with binding affinity for numerous
small molecules [6, 7]. Algorithm development related to
this work has been exploited to generate software with
novel capabilities. Applications 7 through 10 highlight
some of the potential applications of this software.
 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
Application 7: If the user has a PDB file, the program
HETEROalign will predict where in the protein structure
a small molecule ligand is most likely to bind using a
population of peptides selected for binding to that small
molecule. HETEROalign provides three visualizations of
the similarity between a protein sequence and a population of peptides. The first visualization is a three-dimensional representation of the similarity by color. Any
standard three-dimensional visualization package can
be used to show similarity when the colors of the image
are coded to temperature factor as shown in Figs. 6a
and b using RasMol [25]. Figures 6a and b depict the
three-dimensional structure of the ATP-binding protein
1AYL. The ATP molecule is rendered in cpk/spacefill
mode. The PDB output file from HETEROalign demonstrates that maximal similarity to the input peptide
sequences for this protein lies along a stretch of a-helix
which wraps around the ATP molecule. This stretch
contains the canonical ATP-binding motif known as a
p-loop [9–11].
www.proteomics-journal.de
1458
S. Mandava et al.
Proteomics 2004, 4, 1439–1460
The second output file is the sequence of the protein with
those peptides exhibiting significant similarity aligned to
the protein sequence in text format, with the similarity
scores of each residue in the protein sequence with conserved residues highlighted in color in the region of maximum similarity. An example of this output type is shown
in Fig. 6c. Bioaffinity screening of combinatorial peptide
display libraries using a purified protein as a target will
produce a population of peptides with affinity to that protein. In this instance HETEROalign can be similarly utilized
to map segments of the binding partner protein with high
similarity to the affinity-selected peptide sequences and
to quickly assess by visualization if they are clustered
and/or on the surface of the protein molecule. DistSim
makes possible the type of plot shown in Fig. 7. Using
the PDB file generated by HETEROalign as data input,
this program demonstrates that the amino acid residues
in this protein most similar to those of the peptide population are also physically closest to the hetero group of the
protein (i.e. ATP).
3.2.4 Protein/ligand interaction analysis using
any protein sequence
The last three programs in the RELIC database use protein sequence FASTA text files as a basis for the analysis
of protein-ligand interactions. Users can apply this software in the analysis of either single proteins or whole
genomes, using either peptides collected from RELIC’s
peptide database or peptides identified by the user from
bioaffinity experiments.
Application 8: A user wishes to assess the probability
that a sequenced protein of unknown function will bind
to ATP. MATCH will carry out a calculation of the similarity
between that protein sequence and a population of
ligand-selected peptides when no PDB file is available.
The FASTA sequence of that protein can be entered and
compared to phage displayed peptides that have been
affinity selected for binding ATP stored in RELIC using
MATCH. The degree of similarity calculated will provide a
measure of the probability of binding which can then be
compared by using MATCH with known ATP-binding proteins as input data. The output is a set of aligned peptides
similar to that shown in Fig. 6c with the protein/ligand
contact points predicted to be at the peptide cluster
points.
The last two programs are designed to search large sets
of protein sequences up to and including whole genomes
where the protein sequences are in FASTA format and
stored text files. Input file sizes of up to 50 MB can be
accommodated (for example, the International Protein
Index FASTA file of predicted and known human proteins
is at present 26.4 MB).
Application 9: A user needs to know how many and
which proteins in the Escherichia coli genome have the
consensus ATP binding sequence known as the p-loop
or Walker A box [26]. The p-loop consensus sequence
A/GxxxxGKS/T can be searched for with the program
FASTAcon, which will provide to the user a list of those
proteins in the input protein sequence list containing the
consensus sequence. This program is useful for downstream scanning of genomes using the output from any
of the MOTIF programs. As a second example, using
the IPI human genome obtained from European Bioinformatics Institute (http://www.ebi.ac.uk/proteome/) the
consensus sequence HTPHP was identified in the proteins listed below:
Application 10: Bioaffinity screening of phage display
libraries, as mentioned above, can yield a group of binding peptides with no obvious sequence consensus and
yet potentially rich with information. FASTAskan calcu-
Consensus sequence: HTPHP
#
1
2
3
4
5
6
7
8
9
position
accession#
databank
239.IPI:IPI00020961.1 u SWISS-PROT:P10415uREFSEQ_NP:NP_000624uTREMBL:Q96PA0
422.IPI:IPI00077228.1 u REFSEQ_XP:XP_108348 Tax_Id=9606 hypothetical protein
715.IPI:IPI00142188.2 u REFSEQ_XP:XP_170134uENSEMBL:ENSP00000300179 Tax_Id=
252.IPI:IPI00164279.1 u ENSEMBL:ENSP00000310165 Tax_Id=9606
205.IPI:IPI00031177.1 u REFSEQ_NP:NP_000648 Tax_Id=9606 B-cell lymphoma pro
1318.IPI:IPI00164711.1 u REFSEQ_NP:NP_002964uTREMBL:Q99493;Q99700uENSEMBL:EN
359.IPI:IPI00030308.1 u REFSEQ_NP:NP_115525uTREMBL:Q9H0D9uENSEMBL:ENSP00000
860.IPI:IPI00058937.2 u REFSEQ_XP:XP_067967 Tax_Id=9606 similar to Gp150-P1
348.IPI:IPI00161927.1 u REFSEQ_XP:XP_173469 Tax_Id=9606 hypothetical protein
# sequences scanned = 46840
# aa in scanned proteins = 18270974
 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.proteomics-journal.de
Proteomics 2004, 4, 1439–1460
lates the cumulative similarity between an entire peptide
population and a large set of protein sequences in
FASTA format stored in text files. Scores are generated
by calculating the similarity (see MATCH above) between
all peptides and each segment of the protein sequence.
The output is a list of the proteins with the highest
peak value of cumulative peptide population similarity,
ordered from highest to lowest according to the value
of the peak score. This allows the user to rank order a
list of proteins against a set of affinity-selected peptides
to predict which proteins are most likely to bind to the
selection bait. Output from FASTAskan includes 5000
proteins in the listing. An example application of FASTAskan uses peptide sequences (stored in RELIC)
selected for binding to ATP. The input of these peptides
plus the predicted proteome sequence of E. coli gave an
output list of the E. coli proteins with the highest peak
value of the similarity, ordered according to the value of
the peak score. This output of FASTAskan is shown in
Fig. 12. The list shows the top 10 scoring E. coli K12 proteins ranked by similarity to a population of 100 unconstrained 12 mer peptides (from NEB Ph.D.-12) affinity
selected for binding to immobilized biotinylated ATP.
Annotation for the entries on the list in Fig. 12 demonstrates that only proteins known to bind ATP or predicted
to bind to ATP on the basis of sequence similarity with
ATP-binding proteins are present.
RELIC: A server for peptide analysis
1459
4 Concluding remarks
A web-based bioinformatics server, RELIC, has been
constructed and contains a suite of bioinformatics programs capable of extracting functional information from
combinatorial peptide phage display data in the presence
or absence of exact sequence consensus motifs. In addition, populations of peptide sequences which bind to
numerous small molecules ranging from ATP to Taxol
can be downloaded from the web site. RELIC seeks to fill
an unmet need by providing to the phage display community a set of appliances for the analysis of populations of
peptides and for the comparison of populations of peptides both to each other and to the sequences of naturally
occurring proteins in an easy to use, web accessibleformat. As additional molecular ligands are screened,
these peptide sequences will be incorporated into future
versions of RELIC. Bioinformatic analysis of these new
data sets will produce improved, updated versions of the
RELIC programs and expand our ability to identify small
molecule binding proteins as well as pinpoint the ligandbinding sites therein.
The authors wish to thank R. F. Fischetti and M. Scholle for
data processing advice and helpful discussions respectively. This work was funded by a grant from the Office of
Biological and Environmental Research, Department of
Energy under Contract No. W-31-109-Eng-38 to D. J. R.
Figure 12. Sample output from FASTAskan. The protein in E. coli K 12 with the highest similarity to
the ATP-selected peptides had a peak similarity of 22.17 and is a member of the ATP-dependent
helicase superfamily II.
 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.proteomics-journal.de
1460
S. Mandava et al.
5 References
[1] Rodi, D. J., Soares, A. S., Makowski, L., J. Mol. Biol. 2002,
322, 1039–1052.
[2] Makowski, L., Soares, A., Bioinformatics 2003, 19, 483–489.
[3] Rodi, D. J., Makowski, L., Curr. Opin. Biotechnol. 1999, 10,
87–93.
[4] Zucconi, A., Dente, L., Santonico, E., Castagnoli, L., Cesareni, G., J. Mol. Biol. 2001, 307, 1329–1339.
[5] Iannolo, G., Minenkova, O., Gonfloni, S., Castagnoli, L.,
Cesareni, G., Biol. Chem. 1997, 378, 517–521.
[6] Rodi, D. J., Janes, R. W., Sanganee, H. J., Holton, R. et al.,
J. Mol. Biol. 1999, 285, 197–204.
[7] Rodi, D. J., Agoston, G. E., Manon, R., Lapcevich, R. et al.,
Comb. Chem. High Through. Screen. 2001, 4, 553–572.
[8] Sigrist, C. J., Cerutti, L., Hulo N., Gattiker, A. et al., Bioinformatics 2002, 3, 265–274.
[9] Wolf, Y. I., Brenner, S. E., Bash, P. A., Koonin, E. V., Genome
Res. 1999, 9, 17–26.
[10] Saraste, M., Sibbald, P. R., Wittinghofer, A., Trends Biochem. Sci. 1990, 11, 430–434.
[11] Kinoshita, K., Sadanami, K., Kidera, A., Go, N., Protein Eng.
1999, 12, 11–14.
[12] Johnson, J. M., Church, G. M., Proc. Natl. Acad. Sci. USA
2000, 97, 3965–3970.
 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
Proteomics 2004, 4, 1439–1460
[13] Stuart, A. C., Illyin, V. A., Sali, A., Bioinformatics 2002, 18,
200–201.
[14] Roche, O., Kiyama, R., Brooks, C. L., J. Med. Chem. 2001,
44, 3592–3598.
[15] Kay, B. K., Adey, N. B., He, Y.-S., Manfredi, J. P. et al., Gene
1993, 128, 59–65.
[16] Lowman, H. B., Wells, J. A., J. Mol. Biol. 1993, 234, 564–
578.
[17] Shannon, C. E., The Bell System Technical Journal. 1948,
27, 379–423, 623–656.
[18] Smith, T. F., Waterman, M. S., J. Mol. Biol. 1981, 147, 195–
197.
[19] Pearson, W. R., Lipman, D. J., Proc. Natl. Acad. Sci. USA
1988, 85, 2444–2448.
[20] Altschul, S. F., Gish, W., Miller, W., Myers, E. W., Lipman,
D. J., J. Mol. Biol. 1990, 215, 403–410.
[21] Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J. et al.,
Nucleic Acids Res. 1997, 25, 3389–3402.
[22] Rognes, T., Nucleic Acids Res. 2001, 29, 1647–1652.
[23] Bleasby, A. J., Akrigg, D., Attwood, T. K., Nucleic Acids Res.
1994, 22, 3574–3577.
[24] Chen, L., Sigler, P. B., Cell 1999, 99, 757–768.
[25] Sayle, R. A., Milner-White, E. J., Trends Biochem. Sci. 1995,
20, 374.
[26] Walker, J. E., Saraste, M., Runswick, M. J., Gay, N. J.,
EMBO J. 1982, 1, 945–951.
www.proteomics-journal.de