* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download RELIC – A bioinformatics server for combinatorial
Expression vector wikipedia , lookup
Gene expression wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Interactome wikipedia , lookup
Genomic library wikipedia , lookup
Magnesium transporter wikipedia , lookup
Molecular ecology wikipedia , lookup
Biosynthesis wikipedia , lookup
Western blot wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Metalloprotein wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Structural alignment wikipedia , lookup
Genetic code wikipedia , lookup
Biochemistry wikipedia , lookup
Point mutation wikipedia , lookup
Homology modeling wikipedia , lookup
Protein structure prediction wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Peptide synthesis wikipedia , lookup
Proteolysis wikipedia , lookup
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
Proteomics 2004, 4, 1439–1460 DOI 10.1002/pmic.200300680 1439 RELIC – A bioinformatics server for combinatorial peptide analysis and identification of protein-ligand interaction sites Suneeta Mandava, Lee Makowski, Satish Devarapalli, Joseph Uzubell and Diane J. Rodi Biosciences Division, Argonne National Laboratory, Argonne, IL, USA Phage display technology provides a versatile tool for exploring the interactions between proteins, peptides and small molecule ligands. Quantitative analysis of peptide population sequence diversity and bias patterns has the power to significantly enhance the impact of these methods [1, 2]. We have developed a suite of computational tools for the analysis of peptide populations and made them accessible by integrating fifteen software programs for the analysis of combinatorial peptide sequences into the REceptor LIgand Contacts (RELIC) relational database and web-server. These programs have been developed for the analysis of statistical properties of peptide populations; identification of weak consensus sequences within these populations; and the comparison of these peptide sequences to those of naturally occurring proteins. RELIC is particularly suited to the analysis of peptide populations affinity selected with a small molecule ligand such as a drug or metabolite. Within this functional context, the ability to identify potential small molecule binding proteins using combinatorial peptide screening will accelerate as more ligands are screened and more genome sequences become available. The broader impact of this work is the addition of a novel means of analyzing peptide populations to the phage display community. Keywords: Bioinformatics / Database / Phage display / Protein-ligand interactions / Small molecule 1 Introduction The need for high-throughput bioinformatic methods to characterize gene function is being driven by the generation of sequences at a rate far beyond our ability to carry out experimental functional analyses. In spite of the large number of analytical tools currently available, typically about 40% of predicted open reading frames remain functionally uncharacterized. An important clue to open reading frame function is the identification of binding partners. Phage display technology is a widely used tool for identifying either protein-or small molecule-binding Correspondence: Dr. Diane J. Rodi, Biosciences Division, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439, USA E-mail: [email protected] Fax: 11-630-252-5517 Abbreviations: ASP, Active Server Pages; COM1, Component Object Model; NEB, New England Biolabs; PDB, protein data bank; RELIC, receptor ligand contacts 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim Received Revised Accepted 25/7/03 30/10/03 3/11/03 partners [3]. Current methodology aims towards multiple rounds of affinity selection until a point is reached at which most of the remaining binding phage consist of a small number of peptide sequences. The peptide sequences identified in that group of phage clones are then assumed to reflect affinity to the ligand used for selection. Affinity, however, is not the only factor contributing to these results. The process of library construction itself compounded by affinity selection methods generates a group of peptide-bearing phage clones whose sequence properties are statistically quite complex. Although the initial nucleotide inserts may well be random in sequence, the combination of random sequence filtration (due to the low number of initial phage clones generated by electroporation (usually on the order of 106 to 109) as compared to the high numbers of sequences theoretically possible (for a 12 amino acid length set of peptides the theoretical number of sequences possible is 4.09661015)) and nonrandom (repetitive or regular pattern) sequence filtration due to biological selection steps, creates a pool of phage particles with multiple reasons for continued inclusion www.proteomics-journal.de 1440 S. Mandava et al. in the population. Statistical dissection of the regular sequence biases with custom-designed algorithms has been used by our group to identify inclusion-limiting steps during library construction [1]. These results, as well as data from others, demonstrate that alternation of affinity selection with growth in broth can lead to a group of clones which are the product of affinity and enhanced growth properties [1, 4, 5]. Given the random inclusion of sequences into the library due to the electroporation cut-off step, the exact peptide sequence which matches the binding partner being sought may not be in the library, only a low number of conservatively close sequences. Affinity selection of a combinatorial peptide library screen may therefore generate a group of closely related sequences which are functional homologs, but may contain no obvious consensus sequence discernable by eye. Optimal use of phage display libraries for the study of intermolecular interactions, therefore, necessitates a quantitative approach to data analysis in which close peptide sequences, subtle motifs and overall trends are capable of being monitored and studied. For example, is there a method for estimating the sequence diversity of a combinatorial peptide library from the sequences of a practicable number of the members of that library? Does the biology of the phage-host system result in biases in amino acid representation that seriously impact the diversity of a phage displayed library and, consequently, the results of affinity selection experiments? How can these biases be identified and characterized in order to make the best use of phage libraries in affinity selection or other experiments designed to take advantage of the unique properties of display libraries? How can ligand-binding motifs within the peptides be readily identified in light of the biological biases introduced by the system? These types of questions cannot be answered using the computational tools and databases presently available. While currently available genomics tools are sources of valuable information, they do not address many of the specific issues important to the phage display community. A presently unmet need in the application of phage display technology is software for the analysis of the statistical properties of peptide libraries which answers these types of questions by carrying out such procedures as the estimation of sequence diversity in a library prior to screening; the identification of weak consensus motifs within short sequences; and the comparison of the sequences of affinity selected peptides to those of naturally occurring proteins. We have developed a suite of fifteen programs for the analysis of populations of peptides in the context of these three functions. This software has been incorporated into the publicly available REceptor LIgand Contacts (RELIC) bioinformatics server (http:// 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim Proteomics 2004, 4, 1439–1460 relic.bio.anl.gov). The order of the programs and their functionalities are specifically designed to aid a researcher in the combinatorial peptide field from the early stages of raw data acquisition to the final stage of protein epitope mapping. The flow of data-processing software starts with sequence translation programs, followed by physicochemical property mapping, sequence bias identification algorithms, and finally peptide/protein similarity mapping both with and in the absence of three-dimensional coordinates. The Programs page of RELIC (Fig. 1 and Table 1) groups these programs into five general categories, each of which addresses a particular need within the field. The translation programs, DNA2PRO12 and DNA2PRO7, are designed to translate raw DNA sequence text output into peptide sequences for phage clones isolated from the New England Biolabs (NEB; Beverly, MA, USA) Ph.D.12 and Ph.D.-C7C libraries respectively, eliminating the need for manual translation of large numbers of sequences. These programs operate by scanning for vector end/beginning sequences and specifically search for an exact size insert (enabling the flagging of insert sequence anomalies), and thus can be modified to accommodate alternative phage display constructs. The characterization of peptide populations suite of programs is a collection of software which is capable of analyzing a population of peptide sequences in terms of their position-dependent abundances of amino acids and their sequence diversity properties. These four programs are most useful for analysis of matched sets of peptide sequences, a recommended minimum of 50 sequences from the naïve parent library and a recommended minimum of 50 sequences which are the product of an affinity selection experiment using that parent library (duplicate copies of peptides will be automatically removed, as RELIC programs assume that multiple occurrences do not reflect independent events). A statistical shift in any of these measured properties between the nonselected and selected sets of peptide sequences is an indication that the affinity selection process has been effective, even in the absence of a clear amino acid sequence consensus in the selected sequence set. Examples of practical usage of these programs are presented below in Section 3.2. The peptide motif identification program set is specifically designed to recognize amino acid consensus sequences within a peptide population which are difficult to extract by eye. The three MOTIF programs encompass different combinations of allowed motif properties such as continuous/discontinuous sequence similarity, conservative amino acid substitutions allowed/disallowed, and minimum sequence length requirements. The fourth software www.proteomics-journal.de Proteomics 2004, 4, 1439–1460 RELIC: A server for peptide analysis 1441 Figure 1. A reproduction of the RELIC programs page. Table 1. List of the programs accessible through the web interface of RELIC Translation: DNA to protein sequence: DNA2PRO12 DNA2PRO7 12 mer peptide translation 7 mer peptide translation – peptide sequence from DNA sequence – peptide sequence from DNA sequence Characterization of Peptide populations: AAFREQ POPDIV AADIV INFO amino acid frequency population diversity amino acid frequency 1 diversity peptide-associated information – – – – frequency of each amino acid at each position diversity of the population and of individual positions combines two previous calculations estimates likelihood of random occurrence of sequence Peptide Motif Identification: MOTIF1 MOTIF2 MOTIF3 motif identifier Pair correlation analysis 2 Pair correlation analysis 6 – identifies continuous short motifs within a population – identifies discontinuous short motifs within a population – identifies discontinuous short motifs and their near neighbors Comparison of peptide population to sequence of a known structure: HETEROalign CLOSEcon DistSim peptide-PDB file analysis PDB file analysis distance-similarity – aligns peptides to a sequence from a PDB file – identified residues contacting a heterogroup in a PDB file – computes distance to ligand and similarity to peptide population Analysis of single or multiple FASTA sequences: MATCH FASTAcon FASTAskan alignment consensus finder similarity calculation – aligns peptides with a protein sequence – IDs proteins from a population with short consensus sequences – lists proteins with high similarity to a peptide population 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de 1442 S. Mandava et al. group, comparison of peptide population to sequence of a known structure, has been written to construct optimal sequence alignments between peptide sequences obtained via affinity screening and a protein sequence with associated three-dimensional coordinates (i.e. a protein data bank (PDB) file). The combined use of these three programs allows a user to determine whether any of the peptides obtained via affinity selection are mimicking one or more epitopes within a protein structure; facilitates the visualization of those mimicked epitopes within that protein structure; and calculates the distance between a cocrystallized ligand within that protein structure and the protein sequence epitopes identified via affinity-selected peptide analysis. The final suite of programs, analysis of single or multiple FASTA sequences, similarly carries out optimal sequence alignments between affinity-selected peptides and protein sequences, but does so for proteins for which there is only a text sequence (i.e. no coordinates deposited within the PDB). The MATCH program aligns multiple peptide sequences to a single protein sequence, with a text output that allows for easy identification of protein regions with sequences similar to multiple peptide sequences (i.e. regions of clustering). The FASTAcon program will take a motif sequence (either supplied by the user or output from any of the three MOTIF programs) and scan text lists of protein sequences for the occurrence of that motif sequence in those protein sequences. FASTAskan is a combination program which will serially apply the MATCH algorithm to calculate a cumulative similarity factor between a single protein and a set of selected peptides; serially performs this function for multiple protein sequences using the same peptide sequence set, then compiles a list of those proteins so that they are ranked by similarity factor running from most similar to least similar. This program is capable of sorting an entire genome-derived list of proteins based upon their similarity to a group of affinity-selected peptides in the absence of a clear consensus sequence motif, clustering those proteins most likely to bind to the affinity target at the top of the output list. RELIC was particularly designed for the study of the interaction of small molecules with proteins, based upon previous work in which we have shown that the similarity between the sequence of a protein and the sequences of small molecule affinity-selected, phage-displayed peptides can be predictive for protein binding to that small molecule ligand. This technique has been successfully employed to map the contact residues in the targets of a variety of drugs, drug candidates, and small molecule metabolites [6, 7]. The use of affinity selection of phage displayed peptide libraries to identify binding motifs has 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim Proteomics 2004, 4, 1439–1460 potential as an approach in the annotation of whole genomes. The rationale is analogous to the use of consensus sequences that provide reliable signatures for ligand binding. Although a number of consensus binding motifs for small molecules have been identified and are compiled within databases of sequence motifs such as PROSITE [8], small molecule binding motifs have been characterized for only a small fraction of biological ligands, and even in those cases are far from exhaustive. For instance, although the p-loop (GxxxxGK(S/T)) is the most widely occurring [9] and best characterized [10, 11] of the small molecule binding motifs and is highly predictive for binding of nucleotide triphosphates, many ATPbinding sites do not in fact contain a p-loop motif. Threedimensional information from crystallographic structures provides a strong basis for prediction of ligand binding which, when combined with multiple sequence alignments can be highly predictive [12, 13]. Databases specifically addressing protein-small molecule interactions typically involve three-dimensional information [14], but even when assignment of a protein family from a sequence is possible and structural data is available, assignment of the small molecule binding site is challenging. By studying the patterns of combinatorial peptides binding to common metabolites such as ATP and glucose, and correlating those sequences with three-dimensional structures of known metabolite/protein pairs, we have created a database within RELIC of peptide sequences which are predictive for metabolite binding in known protein sequences, along with the computational tools required to carry out this analysis. RELIC is intended to provide the scientific community with access to the sequences of peptides selected for affinity to multiple ligands, as well as access to the tools described above. The database currently houses over 5000 peptide sequences that have been selected for affinity to small molecule metabolites such as ATP, GTP and glucose and drugs such as Taxol and Taxotere, as well as random clones from parent libraries, thereby providing a unique source of information for the study of the interaction of proteins with these ligands. 2 Materials and methods Data integration plays a central role in bioinformatics. RELIC is a publicly accessible biotechnology system that utilizes web technology, traditional programming and a relational database to process and manipulate the experimental data of affinity-selected peptides. In order to seamlessly integrate that biological data, RELIC is based on an object-oriented design using a relational database manwww.proteomics-journal.de Proteomics 2004, 4, 1439–1460 agement system. For this particular project, the ORACLE 9i (Release 9.2) (Oracle, Redwood Shores, CA, USA) database system was chosen to store experimental data and the relevant genomic/structure information as it provides a wide array of database drivers for various programming languages (both for thin and thick clients); hence, data can be accessed through various programming languages like JAVA, C11, Active Server Pages (ASP) or PERL. Figure 2a is a diagram depicting the logical and relational model of the database by displaying all tables and intra-table relationships. To increase database efficiency, packages and stored procedures were developed that reduce the amount of information sent over the network to the database. An additional benefit is database security, as the stored procedures do not expose the SQL code, making it very difficult for an intruder to damage or harm the database. RELIC: A server for peptide analysis 1443 The program architecture for the system, shown in Figs. 2a and b, was driven by several factors including the need for efficiency in data storage and retrieval, enforcement of data relationships and use of legacy code. A crucial factor in the database design was optimization of data storage and retrieval. The database had to be highly normalized with little or no data redundancy. A second key factor was enforcing data relationships through referential integrity (RI). To enforce RI and normalization, indexes with keys were defined and foreign key constraints were devised. As a result, not only are the number of orphan records greatly reduced, but the need for creating programs that are designed to look for duplicate data or orphan records is eliminated. In addition, data retrieval is facilitated, since it reduces the number and intensity of table scans that occur in order to retrieve the data. Figure 2. a) RELIC Database schema. Shown is a simple graphical representation of the RELIC Oracle schema. It displays the table name, table fields, primary keys, foreign keys and hierarchical table relations that are enforced throughout the database schema. The primary key fields illustrate how the data are logically and physically arranged for extractions and insertions and stop duplicate data from being entered into the database. The figure also displays the foreign key relationship between tables; i.e. the logical way each table interacts with one and another. The foreign keys enforce referential integrity by making sure key data exists in the parent table before it allows data insertion into a child table. b) A RELIC user submits data for processing via a web interface. The user input and job information is stored in a RELIC database. A job processing service periodically checks for pending jobs and processes them using the scientific algorithms developed in FORTRAN, using COM1 interfaces. The user is sent an e-mail upon completion of the job with a link to the output. 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de 1444 S. Mandava et al. Another challenge for designing the system was the need to move text files from a remote client computer into the database system. This was made possible with ASP.NET using C# (Microsoft, Redmond, WA, USA). ASP.NET contains mechanisms for directly reading files through a web browser and storing the information to a remote place; in this case, the Oracle database. Driven by legacy code and the need to modernize, the RELIC system utilizes a myriad of programming languages. RELIC receives user input via a web interface (ASP) and passes the input to application programs. The user input directories are uploaded to the server using JUpload applet (Persits Software, New York, NY, USA). Chart Director library (Advanced Software Engineering, Kowloon, Hong Kong) is used to generate plots in real time. At the time the system was designed, ASP provided the quickest internet user interface, greatly outperforming PHP, Java and its derivatives. The ASP is constructed in HTML, with Dynamic HTML and JavaScript on the clientside and VBscript on the server-side. These programs interface with the ASP pages through Component Object Model (COM1) wrappers written in Visual C11 utilizing the Active Template Library. The COM1 objects interface with FORTRAN programs that process the data. For security reasons, data access and data processing have been encapsulated in Visual Basic COM1 objects. The processed information is then stored in an Oracle 9i database, where a queuing system has been developed in which a windows system service will constantly check for and process queued jobs. When the system completes the user’s job, an e-mail is sent out to a user-specified address indicating that the job is completed, with an attached link leading to the results page. The system retrieves data through the use of ActiveX Data Objects through an OLE DB access layer. RELIC directly uses database drivers developed by Oracle. Proteomics 2004, 4, 1439–1460 users would constantly need to check for upgrades before using the software. To avoid this impediment, a three tier system was created consisting of a web based user interface; COM1 objects for the processing layer; and a database for the final layer. A DELL Power Edge 2500 server, with dual 1.13 GHz processor, 4GB RAM and a RAID5 SCSI disk array with 136GB disk space is used to run the web site and database (Dell, Rock Round, TX, USA). The database is spread on the disk array avoiding disk contentions and providing high performance. This server uses Windows 2003 operating system and Microsoft’s Internet Information Server (IIS) 6.x as the web server (Microsoft). To improve the performance, RELIC jobs are run on a separate server, DELL Power Edge 3250 server, with dual Intel 64-bit 1.3GHHZ processors, 4GB RAM and an 18GB SCSI disk. This server runs the Windows 2003 64-bit operating system. 3 Results and discussion 3.1 Algorithm and implementation The RELIC programs page reproduced in Fig. 1 has a short description of each program along with reference to more detailed discussions in published papers. RELIC has a user-friendly interface with copy and paste options, upload features, and select buttons. The software applications allow for either query of input protein sequences against the RELIC-stored affinity-selected peptide data, or entry of a user’s own FASTA formatted peptide sequence data for analysis. An e-mail to return program output is provided for all jobs. Sample applications which illustrate the use of RELIC software are shown under Help and include sample input as well as an explanation of the output format. Each program and the algorithm behind its operation are described individually in the following section. The original software for RELIC consisted of 15 different independent programs and was developed in FORTRAN with a DOS based, command-line user interface. The DOS based applications were converted into web based applications by creating COM1 wrappers around the legacy code. To increase usability, the COM1 objects allow for a single interface to the different programs. Use of this configuration will also allow for seamless future upgrades. The challenge was to incorporate each of the programs into a single easy to use interface which is scalable to accommodate a large number of users. One possible scenario was to rewrite all the programs into a single program using C11 and Microsoft Foundation Classes, allowing the user interface to be constructed using traditional Windows forms. This, however, would have severely limited distribution and upgrades, and These two programs are designed to scan raw DNA sequence text file output from a chromatogram analysis program, locate the insert within the vector, and translate the insert sequences into peptide sequences from the two NEB phage display libraries Ph.D.-12 and Ph.D.C7C. The output FASTA format peptide sequence list can then be used as input for further analysis using other RELIC software. The programs are designed to flag any entries which are questionable with regard to poor or ambiguous sequence data, including deletion mutants and parental clones. These are the only two RELIC programs in which the input is restricted, the beginning and 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de 3.1.1 DNA2PRO12 and DNA2PRO7 Proteomics 2004, 4, 1439–1460 end sequences are from the NEB phage vector, and the resulting peptide output length will be either 12 or 7 amino acids. Updated software capable of translation from any insert length DNA sequence within any vector will be available in the next version, DNA2PRO. Prior to that upgrade, DNA2PRO12/7 can be modified by request to the RELIC administrator ([email protected]). RELIC: A server for peptide analysis 1445 same sequence diversity, the probability of choosing the same member twice in a random selection from the two populations should be the same (see discussion in [2]). The breadth of the error bars on the generated estimate is solely dependent on the number of random sequences analyzed, as the diversity is measured per amino acid and is thus independent of recombinant insert length. This feature makes POPDIV both easy to use and to interpret. 3.1.2 AAFREQ The AAFREQ program calculates the frequency of occurrence of each of the 20 amino acids at each recombinant insert position, as well as the overall position-independent frequency of each amino acid (i.e. amino acid composition) within that set of peptide sequences. Statistical significance of the output data has been demonstrated for input of 50 or more peptide sequences. AAFREQ is a useful program for pinpointing amino acid sequence biases, both overall (i.e. a dearth of cysteine residues at all positions; [15, 16]) and in a particular insert position (i.e. a bias against arginine residues at position 11 after the signal peptidase cleavage site due to bias by that processing enzyme; [1]). Use of AAFREQ analysis of a ligandselected set of peptides in conjunction with an AAFREQ analysis of at least 50 peptides chosen randomly from the naïve library can identify position-dependent motifs and biases within the selected population which are due solely to the affinity selection process (see [1] and below for examples of practical applications). As an example, POPDIV was used to calculate the diversity of a random 12 amino acid display library based upon 100 randomly sequenced members of that library. A 100% complete library with this size insert would be expected to generate a total of 2012 or 4.09661015 possible sequences. The 100 members polled gave a diversity value of 0.04 with a standard deviation of 0.02. This value indicates that the 12 mer library is statistically indistinguishable from a 12 mer library that contains only 4% of all possible sequences or 0.046(461015) or 1.661014 peptides, indicating that a number of theoretically possible peptide sequences are either underrepresented or absent. The sequence diversity of a population of peptides should go down subsequent to a round of affinity selection, as the population is becoming enriched for a subset of sequences which exhibit high affinity for the ligand. This prediction makes POPDIV a useful tool for the monitoring of multiple rounds of biopanning or for the comparison of multiple constructed combinatorial peptide libraries. 3.1.3 POPDIV 3.1.4 AADIV The quality and utility of a display library is a function of many properties, one of the most important being sequence representation or diversity. The higher the percentage of all theoretically possible sequences which are actually physically present within the phage population, the more likely it is that affinity screening will generate results which are pertinent to the experimental goals. The only unequivocal method for measuring the completeness of a library would be to sequence each and every member subsequent to initial amplification. In place of this absolute number, however, a well-carried out population survey analogous to a political poll can generate a rough estimate of the completeness of a display library. The POPDIV program uses just such a statistical sampling method to estimate the sequence diversity of a combinatorial peptide library based upon sequences obtained from a limited number of randomly sampled members of the library. A minimum of 50 sequences is required to obtain statistically significant estimation of the true diversity value. The calculation is premised upon the assumption that if two peptide populations have the AADIV is a composite program which combines the algorithms of both AAFREQ and POPDIV into one program for convenience. The input/output formats are the same as for the two individual programs. AADIV is a valuable starting point for the analysis of a set of combinatorial peptide sequences from either a constructed parent library or in conjunction with those from an affinity screening experiment out of that parent library, as it can quickly indicate whether or not there is adequate sequence diversity in the starting set of peptides and/or whether there is any ligand-selected bias in the sequences pulled out from that set. 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de 3.1.5 INFO Although measurement of a change in amino acid frequency distribution patterns or sequence diversity is a sign of successful biopanning, these parameters are characteristics of a set of peptide sequences, not of individual peptides. INFO is a program that provides a math- 1446 S. Mandava et al. Proteomics 2004, 4, 1439–1460 copy number in the amplified naïve library (to the left in Fig. 3), by definition have low information content as their presence in the final affinity-selected pool could easily be the result of inadvertent inclusion via nonspecific or poor binding followed by a round of good growth. Conversely, phage particles which are low copy or rare within the original parent library (to the right in Fig. 3), presumably due to poor growth properties, have high information content, i.e. it is more certain that their presence in the final affinity-selected pool is due to affinity to ligand. ematical measure of the probability of observing a particular peptide sequence by random chance (i.e. nonspecific binding) as opposed to by selection for a specific property (i.e. ligand affinity). The statistical calculations carried out by INFO are based in principle upon the 1948 paper by Shannon [17] on the theory of information. Although Shannon originally developed the concept of information within the context of communication, the basic element of this theory, information as a decrease in uncertainty, can also be applied to combinatorial phage display screening. The original peptide sequence diversity and pattern distribution as it is laid out within the parent library becomes distorted during multiple rounds of affinity selection and broth amplification, much like Shannon describes signal perturbation by noise. Carrying this analogy further, “if the signal is altered in a reasonable way by the noise, the original can still be recovered” [17]. The wanted signal in which a researcher is interested within the final set of peptide sequences is that resulting from affinity to a ligand. Superimposed upon this affinity signal, however, is the unwanted or extraneous information (or noise) introduced into the peptide sequences by the filtering effects of the various life cycle stages of the phage vector [1, 4, 5]. As the noise in the sequence patterns is a regular, or nonrandom but measurable quantity, it can theoretically be subtracted to a certain extent from the signal, which is the regular but unknown sequence pattern present due to affinity selection. How does the INFO program calculate a numerical quantity for information, i.e. individual amino acid sequence bias patterns? The number is based upon the estimated occurrence or representation of that individual peptide sequence within the original unscreened or parent library. Two input files are required for INFO: a text file with an ideal minimum of 50 peptide sequences from clones randomly selected from a naïve library, and a second file which contains one or more peptide sequences affinityselected from that same library. INFO first uses AAFREQ to calculate the amino acid frequency distributions at each position of the insert in the 50 peptide sequences from the parent combinatorial library. From the observed position-specific frequencies of amino acids in the unselected library, the probability of random observation of any one peptide (PN) can be calculated by multiplying the probability of each amino acid occurring at each posi- How can this biological life cycle induced noise be measured? The steps involved in the generation of the library, from recombinant DNA construction to amplified phage particles, recreate the patterns of amino acid sequence bias within the peptides that are responsible for what we have defined as noise during multiple rounds of affinity selection (i.e. the biological step of phage amplification as opposed to the chemical step of affinity selection). In other words, the selection process for good growth characteristics which occurs during multiple rounds of amplification is similarly occurring during naïve library amplification. The selection that occurs during both procedures creates subgroups of phage particles with superior growth properties within both libraries which are present in many copies (Fig. 3). The sequence biases present in both pools of phage which are due to positive growth attributes can be estimated from the properties of the naïve library and thus theoretically subtracted from the affinity-selected library, leaving only the sequence biases introduced by ligand affinity. This mathematical subtraction process is used not only by INFO, but is also an option in the RELIC programs HETEROalign, MATCH, and FASTAskan to reduce the noise in affinity-selected peptide sequences and amplify the signal – i.e. affinity to ligand characteristics. The phage that are present in high Figure 3. A plot of the distribution of individual peptide sequences within an imaginary combinatorial peptide display library arrayed by copy number. Peptide sequences are listed along the abscissa in order of numerical representation from highest copy number out to lowest copy number, with copy number along the ordinate. Information theory dictates that those phage clones present in large numbers (i.e. the left side of the plot) inherently possess less information than the relatively rare clones at the right-hand side of the plot. 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de Proteomics 2004, 4, 1439–1460 tion within the peptide along the length of the sequence ((P1) (P2) (P3) . . . (PN) = Ptotal where N = the number of residues in the recombinant insert). Since the calculated probability PN of any one specific peptide is a very small number (due to serial multiplication of many position-specific probabilities), we define an associated information parameter for that probability PN where information = 2ln(PN). For example, a fairly common 12 mer peptide may have a probability of occurrence of 4.6610213, which would translate into an information content of 2ln (4.6610213) or 28.4. A rare 12 mer peptide sequence could be less represented in the parent library by two orders of magnitude or more, such as 4.6610215 with an associated information content of 33.0. Higher information levels correspond to less probable amino acid sequences within the insert and typically correspond to sequences less favorable to viral growth. The smaller the probability of random occurrence (i.e. the larger the associated information), the greater the chance that this peptide was observed in the affinity-selected pool due to specific binding to the target (signal) as opposed to positive growth characteristics (noise). Two different real-life applications of the INFO program to affinity selection experiments are described in Section 3.2.2. Note that since information content is a natural logarithm function, a difference of 4 in information between 2 peptides translates to a factor of 110 difference in their estimated representation within the parent library. RELIC: A server for peptide analysis 1447 uses as input a peptide sequence file in text or FASTA format and searches for user-specified contiguous amino acid sequence motifs within that population such as the group FVS, FLT, FLS, YVT and YLT. Alignment of short stretches such as these may aid in the identification of weaker consensus sequences on either side of this anchor sequence within the peptide family. The output from MOTIF1 is a list of motifs with conservative substitutions and the locations of the motifs within the peptide sequence. MOTIF2 searches for patterns of 3 amino acids and does not allow conservative amino acid substitutions, but does allow identical gap lengths, such as the pair SWQLAP and SAQTSP, with the motif being SXQXXP. A conserved motif of this type is possible for peptides which are long enough to generate partial secondary structure, thus enabling noncontinuous sequence conservation but contiguous spatial conservation. MOTIF3 searches for motifs containing 4 amino acids, with the minimum number of occurrences specified by the user, and outputs peptides that have identities in at least three of the four amino acids in the motif. Both MOTIF2 and MOTIF3 allow for penalty-free gaps providing the gap length is identical for all motif members; i.e. G Q A H Q L S and L M A H Q A S. Output for all three programs includes identification of the parent motif, the length of the motif, and an alignment of all the motif-bearing peptide sequences (with amino acids within the motif flagged in color) as shown in Fig. 4. 3.1.6 MOTIF 3.1.7 MATCH A number of algorithmic and heuristic approaches have been taken to detect weak sequence similarities within practicable computation times, including the SmithWaterman algorithm [18], FASTA [19], BLAST [20, 21] and ParAlign [22]. These bioinformatic tools, however, have been developed with, and optimized for, long protein sequences. They are ill-suited for use in the analysis of combinatorial phage display data which consist of short peptide lengths. In addition, given both the low number of peptide sequences present in most phage libraries as compared to theoretically predicted values and the functional plasticity of amino acid side chains, the search for consensus sequences needs to be mathematically flexible. The suite of MOTIF programs are a group of motif-hunting algorithms which score similarity and group inclusion by tallying different combinations of motif properties, thereby emphasizing different search goals. Weak sequence motifs within short peptide sequence populations, however, can be readily identified with the three programs in RELIC that search for motifs within the peptide population, MOTIF1, MOTIF2 and MOTIF3. MOTIF1 Previous experience with small molecule binding peptides has demonstrated that even in the absence of a clearly identifiable consensus sequence motif, weaker conserved sequence patterns may be embedded within the data. These peptide sequences may not be exact matches for regions of a protein sequence, but when simultaneously aligned up against the entire protein sequence may cluster together within one region of the protein. Bioinformatic summation of multiple binding peptide sequences to generate a type of cumulative sequence signature has been shown to yield valuable information regarding protein-small molecule interactions [6, 7]. The calculation of similarity between the collective sequences of a population of relatively short peptides and the sequence of a naturally occurring protein, however, raises certain algorithmic problems (manuscript in preparation). Although it can be calculated using a standard similarity matrix such as BLOSUM62 with a short window (i.e. 5 to 6 amino acids in length), that calculation produces three problems: (1) occurrences of rare amino 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de 1448 S. Mandava et al. Proteomics 2004, 4, 1439–1460 Figure 4. An actual output file from MOTIF2 is depicted. The input file was a text format list of peptide sequences obtained from two rounds of affinity screening with immobilized GTP. Each motif is identified separately, with the actual number of peptides possessing that sequence motif listed on the right-hand side, and the sequences aligned beneath. Individual residues which have been identified as part of the motif sequence are colored for each input peptide. The programs MATCH, HETEROalign and FASTAskan all implement this algorithm by identifying any stretches of amino acid residues within a particular protein that exhibit significant similarity to a group of affinity-selected peptides. The program works as follows: The first peptide sequence in the input file is lined up against the designated protein sequence at protein residue 1 (see Fig. 5a, lines 1 and 2). Each residue within that protein sequence which is being compared to a peptide residue is given a modified BLOSUM62 score. The peptide sequence is then realigned with the protein sequence starting at protein residue 2 (see lines 3 and 4 in Fig. 5a). A second set of scores is calculated for each of the protein residues involved in this second alignment. The peptide sequence is then realigned at protein residue 3, rescored, and then at protein residue 4, etc. until the first peptide sequence is aligned with the protein sequence at the carboxy-terminal end of the protein. This same process is subsequently carried out for each peptide in the userdesignated pool of peptide sequences. The software programs then tally the cumulative score for each amino acid residue within the protein sequence using the alignments above the cut-off value. If there is a significant fraction of peptides within the input sequences which exhibit high similarity to a portion of the protein sequence, even in the absence of exact sequence matches within a single peptide, these peptides will cluster underneath that region of the protein in the alignment output (see Fig. 5b). The cumulative score values for those protein residues within that region will be high (see output b, Fig. 5b). For certain protein/peptide population combinations it has been observed that chance similarity between target-selected consensus motifs and growthrelated motifs can cause significant noise within the cumulative similarity scores. In these situations, it is possible to use MATCH, HETEROalign, and FASTAskan to carry out the same process with a set of peptide sequences randomly selected from the parent library, and then subtract the cumulative residue scores of this analysis from the affinity-selected scores across the entire length of the protein sequence. This generates a net similarity score reflecting peptide alignment to the designated protein with positive growth motifs subtracted out. 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de acids like tryptophan are weighted so high that they tend to overwhelm even perfect matches across 4 to 5 amino acid segments; (2) the cumulative noise levels from poor matches overwhelm less common but meaningful matches; and (3) sequence biases in combinatorial libraries tend to result in differential weighting of motifs. The first problem can be minimized by renormalizing the BLOSUM62 matrix so that the diagonal (or identity) terms are all equal. The second problem can be minimized by using a cut-off below which the individual calculated similarity is discarded from the final calculation sum. Extensive analysis of well characterized systems has led to our use of a 5 amino acid window and an experimentally determined cut-off corresponding to three identities and one similarity across the window (L. Makowski; manuscript in preparation). Although numerous strategies for offsetting the biases inherent within the peptide libraries used have been tested, none have universally improved on the calculations made in the absence of correction terms. Proteomics 2004, 4, 1439–1460 RELIC: A server for peptide analysis 1449 Figure 5. a) The MATCH program algorithm carries out sequential rounds of similarity calculations between a single peptide sequence (VITGKKAILLGE in the figure) along the length of a single protein sequence (top line of each round shown). An identical amino acid match (red) is given a higher score than a conservative replacement (green). Each round generates a number score for each amino acid residue within the protein sequence. These sequential similarity calculations are then carried out for all of the individual peptide sequences within the input file. b) MATCH carries out a final summation step for each amino acid residue within the input protein sequence to generate an aggregate residue-specific similarity score as shown in output b. These similarity scores reflect the cumulative similarity between all of the input affinity-selected peptide sequences and the input protein sequence per amino acid residue. Output a depicts the cluster diagram for the peptide sequences in which each peptide is aligned to the input protein sequence at the position of maximal similarity. 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de 1450 S. Mandava et al. 3.1.8 HETEROalign The protein-length cumulative similarity score output of MATCH can be visualized in fold space if the three-dimensional coordinates of the protein are available. HETEROalign uses the algorithm from MATCH to compare an input list of ligand-selected peptide sequences to a protein sequence embedded within a PDB file. The second output from HETEROalign consists of a cluster diagram (similar to MATCH output) which depicts the alignment of peptides to the protein sequence along with the heteroatom-contacting residues demarcated with X along the top of the protein sequence (see Fig. 6c). Heteroatom contact distance is defined as the minimum interatomic distance between a ligand atom center and an amino acid atom center. An option is available in RELIC to define the distance by the user at a particular value for a particular application. Although this program was originally designed to analyze protein-small molecule ligand inter- Proteomics 2004, 4, 1439–1460 actions, in the absence of a heteroatom within the PDB file a cluster diagram will still be generated by HETEROalign, albeit with no X marks to indicate contact points above the protein sequence. Output 1 from HETEROalign is a PDB file modified from the input PDB file in which the temperature factor for each residue has been replaced by the cumulative similarity score. Visualization of this new PDB file with a standard package such as RasMol creates a three-dimensional rendering of the protein in which red regions are those with highest similarity to the input peptide sequences and blue are regions of no similarity (by choosing temperature for color; see Figs. 6a and b). This output format allows the user to quickly ascertain where the region(s) of highest similarity to the input peptide sequences are within the protein structure in terms of spatial proximity to each other and/or to heteroatoms; and their proximity to the surface. For example, in the pictures shown in Figure 6. Three depictions of the similarity between the sequence of phosphoenol pyruvate carboxykinase and a set of 400 unconstrained 12 mer peptides selected on the basis of their affinity to ATP through two rounds of biopanning (Makowski et al., manuscript in preparation). a) A rendering of the entire protein using RasMol [25] and coding the similarity by color where red is the highest similarity and blue the lowest. b) A close up view of the ATP binding site rendered with RasMol with the loop predicted to be involved in ATP binding rendered as a solid ribbon. c) An alignment of ATPselected peptides with the portion of the sequence of phosphoenol pyruvate carboxykinase exhibiting the highest similarity to the ATP-selected peptides. Residues exhibiting identity or similarity to the protein sequence are highlighted in red or orange. Most peptides align with the segment just downstream of the p-loop. 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de Proteomics 2004, 4, 1439–1460 Figs. 6a and b the region of highest similarity to the input ATP-selected peptide sequences lies within the p-loop region in contact with the ATP. A third output format from HETEROalign is a list of amino acid residues with their respective similarity scores, as shown for MATCH in Fig. 5b (b). RELIC: A server for peptide analysis 1451 crystal structure (below the horizontal line in the plot). DistSim can only be used for PDB files (generated by HETEROalign) with heteroatom coordinates in conjunction with a set of peptide sequences. 3.1.10 CLOSEcon 3.1.9 DistSim and CLOSEcon DistSim is a program which measures the correlation between the HETEROalign-determined similarity score for a particular amino acid residue within a protein and the distance of that residue from a ligand bound to the protein. DistSim uses as input the HETEROalign generated PDB file and plots the cumulative similarity score on the abscissa axis versus the distance from a ligand within that protein as determined from the cocrystal structure on the ordinate axis. It was designed as a measure of effectiveness for affinity selection experiments utilizing a heterogroup as the affinity bait. As an example, in Fig. 7 it can be seen that the seven amino acid residues within protein 1AYL which exhibit the highest cumulative similarity score to a pool of ATP-selected peptides (lower righthand corner of graph) are all within 10 Å of the ATP in the The related software CLOSEcon is similarly limited in application to PDB files with designated heteroatom coordinates. CLOSEcon provides a list of all the amino acid residues within a structure which are in contact (as defined by a user input maximum interatomic distance) with a heteroatom on the basis of crystallographic coordinates. A PDB file is used as the input, with the output being a list of contiguous amino acid residue sequences (including single residue or punctate contacts). Multiple PDB files which contain the same ligand can be analyzed with CLOSEcon to obtain a population of amino acid residues which make contact with those ligands. These contact residues can be extracted and used as input for other RELIC programs. For instance, the amino acid frequencies can be calculated with AAFREQ, or both contiguous and noncontiguous motifs within these peptide segments Figure 7. The relationship between similarity and distance to the ATP as calculated by Dist Sim. Each data point corresponds to one amino acid. The points in the lower right correspond to the loop with the highest similarity, colored red in Figs. 6a and b. 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de 1452 S. Mandava et al. can be identified using the MOTIF programs. This type of analysis allows for comparison of crystallographically determined contact sequences with phage display-derived affinity selection sequences. 3.1.11 FASTAcon FASTAcon is a short peptide motif scavenging program which can search a long input list of protein sequences in FASTA format for the occurrence of user-defined peptide sequences. The program can be directed to look for either exact matches or close matches (i.e. the user can define four out of five amino acid residues as a hit). On the data entry page the RELIC server has links to whole genome protein sequences from numerous organisms, including multiple bacterial species, mouse and human, which can be used as input for FASTAcon and/ or FASTAskan. This software is useful as a downstream tool from the MOTIF suite of programs to pinpoint proteins which possess an epitope similar or identical to an affinity search-identified consensus sequence. In the case of a drug-binding motif, FASTAcon can establish the uniqueness of that motif within the predicted human proteome. 3.1.12 FASTAskan FASTAskan is a derivative iterated version of MATCH in which the cumulative similarity scores for a list of affinity-selected peptides is calculated within multiple protein sequences rather than a single protein sequence. The input protein sequence list can be as long as for FASTAcon (i.e. all the predicted proteins from a single genome). FASTAskan will rank the proteins in descending order by peak similarity score at any one residue within the protein sequence. This output is such that the proteins most likely to bind to the ligand used for affinity screening will be clustered at the top. The peptide sequence results from a biopanning experiment can therefore be used directly in a ligand-target hunt without the intermediate step of consensus motif identification. Proteomics 2004, 4, 1439–1460 3.2.1 DNA sequence processing Application 1: A user has multiple raw DNA sequences from phage particles affinity-selected from the NEB Ph.D.-12 or -C7C libraries against a protein or small molecule ligand. These can be used as input into the two programs DNA2PRO12 and DNA2PRO7. The programs are designed to translate the sequences of DNA inserts from these libraries into peptide sequences, although the parameters in the programs could easily be modified to accommodate other display library constructs. Both programs automatically locate the position of the insert, translate the insert and indicate any possible errors in the insert sequence such as unexpected codons or errors in the surrounding vector/insert junctions. Analysis of DNA sequence data by eye is thus whittled down to those samples which exhibit sequencing problems. These programs provide the user with a FASTA format list of peptide sequences which can then be used as input for further analyses using RELIC or other webbased software. 3.2.2 Peptide population analysis and motif identification RELIC has seven programs that are designed either to analyze the statistical properties of a peptide population or identify weak consensus sequences (short amino acid sequences that are repeated either exactly or almost exactly within the population). These data are particularly valuable when calculated in conjunction with randomly chosen members of the unselected library, as libraryspecific biases can be identified and subtracted. The following sample applications are examples which utilize the software available on RELIC. Sample input and output data, as well as explanations of program operation are available on the RELIC website. To improve database performance, RELIC jobs are processed on two different servers. A link is e-mailed to each user-supplied address to facilitate data retrieval. Application 2: AAFREQ is a program that calculates the frequency of amino acid occurrence within a peptide population as a function of position within the recombinant insert, thereby pinpointing which amino acids are over or underrepresented. This program can determine positionspecific and residue-specific values such as how many threonines occur in position 1 of the inserts. Figure 8 includes a representative sample output. The input data was a text file of 12 amino acid long peptide sequences affinity selected against an immobilized form of the sugar galactose. Examination of the data in the table indicates that there are no prolines at the amino-terminal positions of the inserts (insert position 1), but a significant number at all the other positions. This frequency distribution pattern indicates significant bias against proline residues immediately adjacent to the signal peptidase cleavage site. Line 4 shows that there are no cysteines at all in any of the peptides. The value in the right hand column gives the position-independent frequency of that particular 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de 3.2 Programs and sample applications Proteomics 2004, 4, 1439–1460 RELIC: A server for peptide analysis 1453 Figure 8. Sample output from AAFREQ. In this population of peptides, seven had alanine at position 1 and 6 had alanine at position 2. An unknown amino acid is denoted by x. To assess changes in these patterns during biopanning, AAFREQ can be used with 50 to 100 phage clone sequences from the unselected original library as input and then compared with frequency values calculated from the affinity-selected pool. AAFREQ output using amino acid sequences of phage randomly selected from the two NEB display libraries (Ph.D.-12 or Ph.D.-C7C) can be viewed on the RELIC website. Data for other RELIC peptide sequences can also easily be viewed for any of the peptide analysis programs from the data entry page. AAFREQ, as well as any of the other RELIC programs (except the DNA2PRO programs) can be used with input peptide sequences from any type of display library as long as sequences from the parent library are available. Application 3: A combinatorial peptide phage display library is only as serviceable as the complexity or diversity of its sequence members. The diversity as calculated here (as defined in [2]) is a practical measure of the diversity of the population from which the sequences were selected. Given the impossibility of sequencing a large percentage of any population, the measure that POPDIV performs cannot calculate the absolute probability of a particular sequence being present in the population, but instead gives an indication of the functional diversity available in that population as a whole. By necessity POPDIV was developed as a method for estimating the sequence diversity of a peptide library from the sequences of a limited number of the members of that library. The program can calculate the diversity on the basis of the sequences of as few as 50 peptides randomly selected from any population. This measure is particularly useful for comparison of two or more populations. 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de amino acid in the input peptide population. The total column indicates that proline and threonine are the two most abundant residues in the sequences. 1454 S. Mandava et al. Proteomics 2004, 4, 1439–1460 Table 2. Descriptions of three populations with very simple distributions of peptide abundance. Population A Population B Population C 90% of members are present in equal amounts; 10% of members are not present 50% of members are present in equal amounts; the other 50% of members are present at twice that abundance 75% of members are present in equal amounts; 25% of members are not present diversity = 0.9 diversity = 0.9 diversity = 0.75 The corresponding diversities are calculated according to [2]. The concept of population diversity when peptides are present at different abundances is discussed in more detail elsewhere [1, 2]. As a simple illustration of the concept, consider the three peptide populations described in Table 2. The measure of diversity utilized by POPDIV cannot distinguish between library A and library B in Table 2 because in a sampling experiment, the two populations are statistically identical. It can, however, detect the difference between A and C or B and C, with both A and B behaving as if they have more sequence diversity than C. This program is a useful tool for rapid assessment of the relative complexity of two or more peptide populations and is equally useful to researchers either constructing or utilizing combinatorial peptide libraries. For instance, if a user has constructed a random 12 amino acid peptide library and from the sequences of 75 clones calculates the diversity at 0.01 using POPDIV, then that population functionally behaves as if it contains approximately 4.09661015 times 0.01 or 461013 peptides. One illustration of the shift in information content of a peptide display population during a biopanning experiment is shown in Fig. 9. Three rounds of biopanning with an immobilized version of the anticancer drug Taxol were monitored by isolation and sequencing of single phage clones at each step of the process. The normalized distribution of information content for each peptide population was calculated (solid lines) and compared to the distribution in the naïve, i.e. unselected, library (dotted lines) (in this case the Ph.D.-12 peptide library from NEB). Fig. 9a contains data obtained after round 1 of affinity screening; 9b after round 2 and 9c after round 3. A comparison of the three distributions indicated that the peptide population in round 2 is slightly more enriched for high information content peptides than round 1 and much more so than the unselected population [6]. Information content values however begin to decrease significantly after three rounds of biopanning (Fig. 9c), pointing towards a bias for phage with enhanced growth properties. This trend is more readily visualized in the subtraction plots of Figs. 9a through c. Calculation of the area under each set of curves shows that that the number of peptides with information content above 35.0 goes up 12% in rounds 1 and 2, but goes down 30% in round 3 (as compared to the random curve). Application 4: Affinity biopanning is an iterative process that alternatively selects on the basis of affinity and growth characteristics. Consequently, a statistical analysis that provides a basis for identifying those peptides with the highest probability of being selected on the basis of affinity rather than growth can be of use in determining the success of a series of biopanning experiments or in giving any one peptide sequence more statistical weight than another. On the basis of the observed frequencies of amino acids in an unselected population INFO calculates an information parameter associated with each peptide which is a measure of the likelihood of observing that peptide by chance. A peptide selected after multiple rounds of biopanning but having relatively low information content has a sequence representative of peptides highly iterated within the population, suggesting that it is there on the basis of growth characteristics, not ligand affinity. A peptide with relatively high information value is unlikely to be present in a selected population due to chance or to growth characteristics; rather, it is most likely present due to its affinity to the target molecule. Analysis of all three peptide populations with MOTIF1 demonstrated that of the putative Taxol-binding sequences, only the round 2 sequences contained clone pairs sharing consensus sequences as long as four and five residues. Of the two pairs of pentamers identified, one pair of phage clones shared the pentapeptide HTPHP at identical positions within the insert peptide, and a second set shared SHPST at different locations along the insert sequence. Localization of these four clones within Fig. 9b indicates that the HTPHP-containing pair possesses information contents of 35.9 and 36.3 respectively, whereas the SHPST-containing pair possesses information content values of 31.6 and 31.2 respectively. These numbers show that the HTPHP match is statistically a less likely occurrence than the SHPST match by a factor of about 110-fold, strongly suggesting that the HTPHP motif has high affinity for Taxol [6]. 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de Proteomics 2004, 4, 1439–1460 RELIC: A server for peptide analysis 1455 Figure 9. Relative abundances of peptides plotted as a function of associated information for peptides selected for affinity to Taxol after a) one round of biopanning b) two rounds, and c) three rounds. The three curves shown in d) are the result of subtraction of the normalized random curve from each of the three Taxol-selected curves. Areas of the curves in d) above the zero line indicate an upward shift in representation for peptides of that information content within that peptide pool relative to the parent library. Areas of the curves in d) below the zero line indicate a downward shift in representation for peptides of that information content within that peptide pool relative to the parent library. A shift to higher information can be seen in the first two rounds, with the trend shown more clearly in the subtraction curves depicted in d). After two rounds a significant shift to lower information suggests that selection for good growth is beginning to dominate affinity selection. In Fig. 9b the positions of peptides containing the SHPST motif are marked by open squares; the positions of peptides containing HTPHP are marked by closed squares. protein size, and the postulated central role of Bcl-2 in apoptosis, human Bcl-2 and the SHPST-containing protein kinase DNA-PKcs (Mr 4096) were tested for binding activity to Taxol. Bcl-2 was identified as an authentic Taxol-binding protein [6], whereas DNA-PKcs was ruled out (W. S. Dynan, unpublished results), corroborating the statistical prediction of the phage information content values. A sequence search with the OWL database [23] using the HTPHP consensus sequence identified two human proteins: Bcl-2, a 239 amino acid antiapoptotic protein, and ataxin-2, a large 1312 amino acid protein which has been implicated in the pathogenesis of a neurodegenerative disorder. A similar search with the consensus sequence SHPST yielded four human megaproteins, with sizes ranging from 1035 residues to 6669 residues. Use of FASTAcon yielded identical results. Given the higher statistical significance of the HTPHP-bearing phage, the fact that the probability of a random hit for a consensus without accompanying functional significance is proportional to A second application of the INFO program can be seen in Fig. 10. A published data set of peptide sequences obtained via affinity selection of the NEB Ph.D.-12 library using the apical domain of the chaperonin GroEL as bait 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de 1456 S. Mandava et al. Proteomics 2004, 4, 1439–1460 Application 5: The MOTIF1 program uses as input a peptide sequence file and searches for motifs within the population. Allowing for conservative substitutions, motifs of user-specified length are identified. An example is given here for several motifs found within a population of peptide sequences that have been affinity selected using an immobilized version of ATP as the bait. Figure 10. A graphical representation of the output from INFO with input data from [24] as described in Section 3.2.2, application 4. The red curve is the information distribution for randomly chosen peptide sequences of the parent library Ph.D.-12 (NEB). The black curve is the same INFO-generated normalized information distribution for the published GroEL apical domain-selected peptides out of that library. The green curve is the subset of that affinity-selected group of peptides which contains multiple histidine residues. A clear shift to higher information content can be seen between the parent library and the affinity-selected set of sequences. The very low information tail of the black curve, however, is taken up by the poly-his peptides assumed by the authors to be present as a result of binding to the nickel protein purification column [24]. The peptide shown by fluorescence anisotropy as having the highest affinity for GroEL lies within the highest peak of the selected peptides at the high information end of the black curve as shown by the position of the arrow (information content = 34.897). [24] included a number of histidine-containing peptides. These sequences were excluded from the subsequent binding studies by the authors, as the peptides were concluded to have been selected by interaction with the background Ni12-NTA resin. A comparison of the information content of the parent library (Fig. 10, red curve) and the GroEL-selected peptides (Fig. 10, black curve) demonstrates a significant shift to higher information peptides in the affinity-selected pool. The peptide with the highest affinity for GroEL as measured by fluorescence anisotropy, peptide SBP, has an information content of 34.9, a relatively high number for that parent library (see arrow in Fig. 10). Separation of the poly-his peptide population info content into a separate curve (green curve, Fig. 10) demonstrates that the majority of the his-containing peptides have an information content centering around 31.5, indicating that they are over 30-fold more abundant in the Ph.D.-12 library than the SBP peptide. 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim Peptide Position Oligo 1 25 7 5 VAVAL LALAL 19 59 6 2 IPSVQ MPTLN 38 40 3 2 PSLLS PSILS 38 40 4 3 SLLST SILSS 43 82 7 6 PLLLT PLLLS Position refers to the position of the motif in the peptide: VAVAL for example starts at position 7 in the peptide NFSTRTVAVALF, which is the first peptide in the input list. MOTIF2 also uses as input a peptide sequence text file and searches for motifs within the population. Conservative substitutions are not allowed, whereas nonpenalized gaps are identified in the output, along with a list of source peptides for each peptide printed in cluster format. An example is given in Fig. 4 of the output of MOTIF2-identified consensus motifs found within a population of peptide sequences that have been affinity selected using an immobilized version of GTP. 3.2.3 Protein/ligand interaction analysis using proteins of known structure Many users obtain combinatorial peptide data in the study of protein-ligand interactions. RELIC presently has three programs that use PDB files as a basis for the analysis of protein-ligand interactions in conjunction with phage display data. Application 6: The program CLOSEcon uses as input one or more PDB files (all with the same ligand) and provides a list of the amino acid residues that are in contact (defined by a maximum interatomic distance chosen by the user) with a ligand on the basis of crystallographic coordinates. Proteins may be downloaded from the PDB at http:// www.rcsb.org/PDB/, and then uploaded onto the data entry page for either single protein analysis or analysis of multiple proteins containing the same ligand. The example shown in Fig. 11 uses 1A9C (GTP Cyclohydrolase I) and extracts the residues within 10 Å of the GTP heterowww.proteomics-journal.de Proteomics 2004, 4, 1439–1460 RELIC: A server for peptide analysis 1457 Figure 11. An actual output file from the CLOSEcon program. The input PDB file 1A9C is a GTP hydrolase mutant cocrystallized with its GTP ligand. Directly below a printout of all the residues whose coordinates are contained within the input PDB file is a list of all continuous residue strings within 10 Å of the GTP ligand, in order of amino-terminus to carboxy-terminus. atom and produces two output files. The first output file lists the amino acid sequence and number obtained from the PDB file of the residues that are within 10 Å atom-toatom of the GTP molecule. The second output file lists the residues along with the chain and the name of the protein. Single amino acids indicate punctate contact while the peptide strings indicate extended contact. A major driving force in the development of the RELIC database was the need for genomic methodology which aids in the functional annotation of whole genomes. Our group has attempted to use combinatorial peptide phage display to identify small molecule binding sites within the primary amino acid sequences of proteins. Bioaffinity screening of immobilized tagged versions of numerous small molecules such as the metabolite ATP and the anticancer drug Taxol has generated populations of peptide sequences with binding affinity for numerous small molecules [6, 7]. Algorithm development related to this work has been exploited to generate software with novel capabilities. Applications 7 through 10 highlight some of the potential applications of this software. 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim Application 7: If the user has a PDB file, the program HETEROalign will predict where in the protein structure a small molecule ligand is most likely to bind using a population of peptides selected for binding to that small molecule. HETEROalign provides three visualizations of the similarity between a protein sequence and a population of peptides. The first visualization is a three-dimensional representation of the similarity by color. Any standard three-dimensional visualization package can be used to show similarity when the colors of the image are coded to temperature factor as shown in Figs. 6a and b using RasMol [25]. Figures 6a and b depict the three-dimensional structure of the ATP-binding protein 1AYL. The ATP molecule is rendered in cpk/spacefill mode. The PDB output file from HETEROalign demonstrates that maximal similarity to the input peptide sequences for this protein lies along a stretch of a-helix which wraps around the ATP molecule. This stretch contains the canonical ATP-binding motif known as a p-loop [9–11]. www.proteomics-journal.de 1458 S. Mandava et al. Proteomics 2004, 4, 1439–1460 The second output file is the sequence of the protein with those peptides exhibiting significant similarity aligned to the protein sequence in text format, with the similarity scores of each residue in the protein sequence with conserved residues highlighted in color in the region of maximum similarity. An example of this output type is shown in Fig. 6c. Bioaffinity screening of combinatorial peptide display libraries using a purified protein as a target will produce a population of peptides with affinity to that protein. In this instance HETEROalign can be similarly utilized to map segments of the binding partner protein with high similarity to the affinity-selected peptide sequences and to quickly assess by visualization if they are clustered and/or on the surface of the protein molecule. DistSim makes possible the type of plot shown in Fig. 7. Using the PDB file generated by HETEROalign as data input, this program demonstrates that the amino acid residues in this protein most similar to those of the peptide population are also physically closest to the hetero group of the protein (i.e. ATP). 3.2.4 Protein/ligand interaction analysis using any protein sequence The last three programs in the RELIC database use protein sequence FASTA text files as a basis for the analysis of protein-ligand interactions. Users can apply this software in the analysis of either single proteins or whole genomes, using either peptides collected from RELIC’s peptide database or peptides identified by the user from bioaffinity experiments. Application 8: A user wishes to assess the probability that a sequenced protein of unknown function will bind to ATP. MATCH will carry out a calculation of the similarity between that protein sequence and a population of ligand-selected peptides when no PDB file is available. The FASTA sequence of that protein can be entered and compared to phage displayed peptides that have been affinity selected for binding ATP stored in RELIC using MATCH. The degree of similarity calculated will provide a measure of the probability of binding which can then be compared by using MATCH with known ATP-binding proteins as input data. The output is a set of aligned peptides similar to that shown in Fig. 6c with the protein/ligand contact points predicted to be at the peptide cluster points. The last two programs are designed to search large sets of protein sequences up to and including whole genomes where the protein sequences are in FASTA format and stored text files. Input file sizes of up to 50 MB can be accommodated (for example, the International Protein Index FASTA file of predicted and known human proteins is at present 26.4 MB). Application 9: A user needs to know how many and which proteins in the Escherichia coli genome have the consensus ATP binding sequence known as the p-loop or Walker A box [26]. The p-loop consensus sequence A/GxxxxGKS/T can be searched for with the program FASTAcon, which will provide to the user a list of those proteins in the input protein sequence list containing the consensus sequence. This program is useful for downstream scanning of genomes using the output from any of the MOTIF programs. As a second example, using the IPI human genome obtained from European Bioinformatics Institute (http://www.ebi.ac.uk/proteome/) the consensus sequence HTPHP was identified in the proteins listed below: Application 10: Bioaffinity screening of phage display libraries, as mentioned above, can yield a group of binding peptides with no obvious sequence consensus and yet potentially rich with information. FASTAskan calcu- Consensus sequence: HTPHP # 1 2 3 4 5 6 7 8 9 position accession# databank 239.IPI:IPI00020961.1 u SWISS-PROT:P10415uREFSEQ_NP:NP_000624uTREMBL:Q96PA0 422.IPI:IPI00077228.1 u REFSEQ_XP:XP_108348 Tax_Id=9606 hypothetical protein 715.IPI:IPI00142188.2 u REFSEQ_XP:XP_170134uENSEMBL:ENSP00000300179 Tax_Id= 252.IPI:IPI00164279.1 u ENSEMBL:ENSP00000310165 Tax_Id=9606 205.IPI:IPI00031177.1 u REFSEQ_NP:NP_000648 Tax_Id=9606 B-cell lymphoma pro 1318.IPI:IPI00164711.1 u REFSEQ_NP:NP_002964uTREMBL:Q99493;Q99700uENSEMBL:EN 359.IPI:IPI00030308.1 u REFSEQ_NP:NP_115525uTREMBL:Q9H0D9uENSEMBL:ENSP00000 860.IPI:IPI00058937.2 u REFSEQ_XP:XP_067967 Tax_Id=9606 similar to Gp150-P1 348.IPI:IPI00161927.1 u REFSEQ_XP:XP_173469 Tax_Id=9606 hypothetical protein # sequences scanned = 46840 # aa in scanned proteins = 18270974 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de Proteomics 2004, 4, 1439–1460 lates the cumulative similarity between an entire peptide population and a large set of protein sequences in FASTA format stored in text files. Scores are generated by calculating the similarity (see MATCH above) between all peptides and each segment of the protein sequence. The output is a list of the proteins with the highest peak value of cumulative peptide population similarity, ordered from highest to lowest according to the value of the peak score. This allows the user to rank order a list of proteins against a set of affinity-selected peptides to predict which proteins are most likely to bind to the selection bait. Output from FASTAskan includes 5000 proteins in the listing. An example application of FASTAskan uses peptide sequences (stored in RELIC) selected for binding to ATP. The input of these peptides plus the predicted proteome sequence of E. coli gave an output list of the E. coli proteins with the highest peak value of the similarity, ordered according to the value of the peak score. This output of FASTAskan is shown in Fig. 12. The list shows the top 10 scoring E. coli K12 proteins ranked by similarity to a population of 100 unconstrained 12 mer peptides (from NEB Ph.D.-12) affinity selected for binding to immobilized biotinylated ATP. Annotation for the entries on the list in Fig. 12 demonstrates that only proteins known to bind ATP or predicted to bind to ATP on the basis of sequence similarity with ATP-binding proteins are present. RELIC: A server for peptide analysis 1459 4 Concluding remarks A web-based bioinformatics server, RELIC, has been constructed and contains a suite of bioinformatics programs capable of extracting functional information from combinatorial peptide phage display data in the presence or absence of exact sequence consensus motifs. In addition, populations of peptide sequences which bind to numerous small molecules ranging from ATP to Taxol can be downloaded from the web site. RELIC seeks to fill an unmet need by providing to the phage display community a set of appliances for the analysis of populations of peptides and for the comparison of populations of peptides both to each other and to the sequences of naturally occurring proteins in an easy to use, web accessibleformat. As additional molecular ligands are screened, these peptide sequences will be incorporated into future versions of RELIC. Bioinformatic analysis of these new data sets will produce improved, updated versions of the RELIC programs and expand our ability to identify small molecule binding proteins as well as pinpoint the ligandbinding sites therein. The authors wish to thank R. F. Fischetti and M. Scholle for data processing advice and helpful discussions respectively. This work was funded by a grant from the Office of Biological and Environmental Research, Department of Energy under Contract No. W-31-109-Eng-38 to D. J. R. Figure 12. Sample output from FASTAskan. The protein in E. coli K 12 with the highest similarity to the ATP-selected peptides had a peak similarity of 22.17 and is a member of the ATP-dependent helicase superfamily II. 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de 1460 S. Mandava et al. 5 References [1] Rodi, D. J., Soares, A. S., Makowski, L., J. Mol. Biol. 2002, 322, 1039–1052. [2] Makowski, L., Soares, A., Bioinformatics 2003, 19, 483–489. [3] Rodi, D. J., Makowski, L., Curr. Opin. Biotechnol. 1999, 10, 87–93. [4] Zucconi, A., Dente, L., Santonico, E., Castagnoli, L., Cesareni, G., J. Mol. Biol. 2001, 307, 1329–1339. [5] Iannolo, G., Minenkova, O., Gonfloni, S., Castagnoli, L., Cesareni, G., Biol. Chem. 1997, 378, 517–521. [6] Rodi, D. J., Janes, R. W., Sanganee, H. J., Holton, R. et al., J. Mol. Biol. 1999, 285, 197–204. [7] Rodi, D. J., Agoston, G. E., Manon, R., Lapcevich, R. et al., Comb. Chem. High Through. Screen. 2001, 4, 553–572. [8] Sigrist, C. J., Cerutti, L., Hulo N., Gattiker, A. et al., Bioinformatics 2002, 3, 265–274. [9] Wolf, Y. I., Brenner, S. E., Bash, P. A., Koonin, E. V., Genome Res. 1999, 9, 17–26. [10] Saraste, M., Sibbald, P. R., Wittinghofer, A., Trends Biochem. Sci. 1990, 11, 430–434. [11] Kinoshita, K., Sadanami, K., Kidera, A., Go, N., Protein Eng. 1999, 12, 11–14. [12] Johnson, J. M., Church, G. M., Proc. Natl. Acad. Sci. USA 2000, 97, 3965–3970. 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim Proteomics 2004, 4, 1439–1460 [13] Stuart, A. C., Illyin, V. A., Sali, A., Bioinformatics 2002, 18, 200–201. [14] Roche, O., Kiyama, R., Brooks, C. L., J. Med. Chem. 2001, 44, 3592–3598. [15] Kay, B. K., Adey, N. B., He, Y.-S., Manfredi, J. P. et al., Gene 1993, 128, 59–65. [16] Lowman, H. B., Wells, J. A., J. Mol. Biol. 1993, 234, 564– 578. [17] Shannon, C. E., The Bell System Technical Journal. 1948, 27, 379–423, 623–656. [18] Smith, T. F., Waterman, M. S., J. Mol. Biol. 1981, 147, 195– 197. [19] Pearson, W. R., Lipman, D. J., Proc. Natl. Acad. Sci. USA 1988, 85, 2444–2448. [20] Altschul, S. F., Gish, W., Miller, W., Myers, E. W., Lipman, D. J., J. Mol. Biol. 1990, 215, 403–410. [21] Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J. et al., Nucleic Acids Res. 1997, 25, 3389–3402. [22] Rognes, T., Nucleic Acids Res. 2001, 29, 1647–1652. [23] Bleasby, A. J., Akrigg, D., Attwood, T. K., Nucleic Acids Res. 1994, 22, 3574–3577. [24] Chen, L., Sigler, P. B., Cell 1999, 99, 757–768. [25] Sayle, R. A., Milner-White, E. J., Trends Biochem. Sci. 1995, 20, 374. [26] Walker, J. E., Saraste, M., Runswick, M. J., Gay, N. J., EMBO J. 1982, 1, 945–951. www.proteomics-journal.de