* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Isoelectric point prediction from the amino acid sequence of a protein
G protein–coupled receptor wikipedia , lookup
Gene expression wikipedia , lookup
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Expression vector wikipedia , lookup
Interactome wikipedia , lookup
Peptide synthesis wikipedia , lookup
Magnesium transporter wikipedia , lookup
Metalloprotein wikipedia , lookup
Protein purification wikipedia , lookup
Homology modeling wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Western blot wikipedia , lookup
Point mutation wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Biosynthesis wikipedia , lookup
Genetic code wikipedia , lookup
Rochester Institute of Technology RIT Scholar Works Theses Thesis/Dissertation Collections Summer 2005 Isoelectric point prediction from the amino acid sequence of a protein Matthew Conte Follow this and additional works at: http://scholarworks.rit.edu/theses Recommended Citation Conte, Matthew, "Isoelectric point prediction from the amino acid sequence of a protein" (2005). Thesis. Rochester Institute of Technology. Accessed from This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact [email protected]. THESIS ISOELECTRIC POINT PREDICTION FROM THE AMINO ACID SEQUENCE OF A PROTEIN Submitted by Matthew Conte Department In partial of Biological Sciences fulfillment For the Master of of the requirements Science degree in Bioinformatics Rochester Institute of at Technology Summer 2005 -~ nIQlnformatlcs ~luT To: Rochester Institute of Technology Department of Biological Sciences Bioinformatics Program Head, Department of Biological Sciences The undersigned state that _ _...!...M----=.!~~·:....:~.....!\--.....!h~~~v...J~ \ ~A....!........~C:!z<.loooO~Vl-"-!e..LJo...--- (Student Name) _ _--:-:::---:-----:-:---_-:--__ ' a candidate for the Master of Science degree in (Student Number) Bioinformatics, has submitted his/her thesis and has satisfactorily defended it. This completes the requirements for the Master of Science degree in Bioinformatics at Rochester Institute of Technology. Thesis committee members: Name Date Gary R. Skuse (Committee Chair) Paul A. Craig (Thesis Advisor) Name Illegible Douglas P. Merrill 475-2532 (voice) [email protected] Thesis/Dissertation Author Permission Statement Title of thesis or dissertation: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ __ A~HhLw (0/1 k Name of auth0J. Degree: ~ "'S~ Program: --~G~;~o~M~f9~C-M-~~I.-.-s---------------------College: Sc.iC .. ,e. I understand that I must submit a print copy of my thesis or di ssertation to the RIT Archi ves , per current RIT guidelines for the completion of my degree. I hereby grant to the Rochester Institute of Technology and its agents the non-exclusive license to archive and make accessible my thesis or dissertation in whole or in part in all forms of media in perpetuity. I retain all other ownership rights to the copyright of the thesis or dissertation. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. Print Reproduction Permission Granted: It. , I, ~ hereby grant permission to the Rochester Institute Technology to reproduce my print thesis or dissertation in whole or in part. Any reproduction will not be for commercial use or profit. &t+kw Signature of Author: Matthew Conte Date: Cf- OJ.. -J..065 Print Reproduction Permission Denied: 1, , hereby deny permission to the RIT Library of the Rochester Institute of Technology to reproduce my print thesis or dissertation in whole or in part. Signature of Author: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Date: - - - - - Inclusion in the RIT Digital Media Library Electronic Thesis & Dissertation (ETD) Archive I, ' additionally grant to the Rochester Institute of Technology Digital Media Library (RIT DML) the non-exclusive license to archive and provide electronic access to my thesis or dissertation in whole or in part in all forms of media in perpetuity. I understand that my work, in addition to its bibliographic record and abstract , will be available to the world-wide community of scholars and researchers through the RIT DML. I retain all other ownership rights to the copyright of the thesis or di ssertation . I also retain th .: right to use in future works (such as articles or books) all or part of thi s thesis or dissertation. I am aware that the Rochester Institute of Technology does not require registration of copyright for ETDs. I hereby certify that, if appropriate, I have obtained and attached written permission statements from the owners of each third party copyrighted matter to be included in my thesis or dissertation. I certify that the version I submitted is the same as that approved by my committee. Signature of Author: _ _ _ _ _ _ _ _ _ _ _ _ _ Date: _ _ _ __ Abstract Proteins based often do not migrate as expected their primary sequence. The predicted isoelectric on coincide with experimental pi values obtained differences led to this and pi study. formatted. This dataset discrepancy (Apl). At pipeline. The into three was split protein sequence each stage of the pipeline individual amino acid represent sequences charge, functional, involved analysis with data for the simplified demonstrated the differences between (pi) frequently does consisting the data were analyzed of different groupings. An existence of certain alphabets based on The final step in the using both the 20 their pipeline amino dipeptide sequences which correlate well predicted pi and experimental pi. in a (considering four different evaluation of the alphabet dipeptide of by comparing each of the by grouping similar amino acids of all of these sequences levels through subset was run application . not for these reasons pipeline consisted of a naive approach way the dipeptides electrophoresis coli proteome was collected Apl each chemical, and hydrophobic properties investigating acid alphabet and a simpler the E. parts each frequencies), followed by the in point in the laboratory. The Initially, 2DE data from three Apl subsets to one another. The to in two dimensional Table of Contents 1 Introduction 1 2 Methods 2.1 Forming the data set 2.2 Experimental and predicted pi values 2.3 Extracting useful information from collected 7 7 9 subset 10 sequences 2.2.1 Amino acid 2.2.2 Frequency of amino 2.2.3 Frequency of amino acids 2.2.4 Pipeline 3 frequency analysis acids workflow (naive approach) ... (alphabets approach) ... (dipeptide approach) ... 10 11 14 15 18 Results 3.1 Naive 18 approach 3.2 Alphabets approach 19 3.2.3 Functional 19 21 22 3.2.4 Hydrophobic 23 3.2.1 Charge 3.2.2 Chemical approach 24 3.4 Dipeptide threshold 26 3.5 Dipeptide using 28 3.3 Dipeptide alphabets 3.5.1 Charge 28 3.5.2 Chemical 3.5.3 Functional 29 31 3.5.4 Hydrophobic 32 4 Discussion 34 5 Conclusions 42 6 References 44 Introduction Two-dimensional technique for the field separate and conditions, of proteomics identify thousands 2DE is difficult and wait reproducibility for of proteins for results, and possibly dimension by their molecular weights. (the pH at which and molecular weight protein would from allows a cellular extract in laboratory the researcher to a single experiment. change conditions after that proteins are separated points (pi) important two decades. 2DE of gels and comparison of 2DE results isoelectric point over an time consuming as it is necessary to determine ideal initial difficult (1). In 2DE, proved (2DE) has been gel electrophoresis the between is zero) simply the has by their and accurate prediction of protein (MW) using be extremely valuable to separate groups in the first dimension net charge of the protein The (1). In addition, in the second isoelectric amino acid sequence of the researchers who use two-dimensional gel electrophoresis. Computational acid composition of a protein within the limited protein by the for calculating procedures based on and predicting the the dissociation of the values for the dissociations microenvironmental effects such as charge-charge from the amino constants of the charged groups have been developed (2-8). The accuracy certainty pi of these algorithms constants and interactions is by and post-translational modifications. To systematically protein sequence, organism. a data explore set of proteins was collected and organized The Escherichia translational the relationship between pi, molecular coli proteome was chosen since modifications such as it weight and from contains a model few methylation, acylation, gylcosylation, or 1 post- phosphorylation which can alter the pI/MW predictions much more them to migrate to based solely on pI/MW; the difficult a position on a 2-D since gel the amino acid sequence is widely At this point available the that is quite different than E. data beyond simply the proteins is what cause may predicted also one of the best protein sequence for for it. it is necessary to consider the basic the structure of the 20 amino acids shows is coli structural the role of individual amino acids in the structure and function below in the modifications of the protein. characterized prokaryotes and much more each protein presence of these modifications makes features of proteins and of proteins. Figure 1 with side chain structures shown in red (10). The as the carboxy- prediction tool and solution and amino-termini, (11) is designed to and amino-termini. The the pKa Our ionizable the assume groups on some of the amino acid side some prosthetic calculate charge It is based also affected current calculation model uses protein and does not make any chains regardless of their environment within that the charge ratios. separation is based on the total and groups, on charge on amino acid side chains of the side chains. around a side chain. the side from charge on all proteins arises the bound ions. Our side chains and depends by the the charge on on the pi carboxy- environment following pK.A values protein as well pH of the localized adjustments the chains, for to the pKA values of (Table 1). We the protein, not the also mass-to- 1 1 .0 -ce -ac Vj 1 1 1 NH H3N+-aC ce - XP 1 CH2 (CH2)3 1 P H3N+ H3N+-aC CH2 1 C=NH2 C 0 1 NH2 | NH2 = (Arg/R) (Gln/Q) /> -Mc H / XP 1 CH2 1 rcH2 ,N (Ser / S) H H H 1 1 1 P P P H3N+ -aC XP - C*e -aC P - CS 1 -aC-Ce ^P 1 1 CH2 XP ' 1 COOH P H3N+ H-C-OH 1 1 CH2 1 XP H H3N+ XP 1 CH2 CH3 1 SH CH3 COOH Aspartic Acid Glutamic Acid jl 1 -e XC H 1 1 P -*c XP - ce c Leucine Asparagine (Met / M) (Leu / L) (Asn / N) of amino acids with side chains shown in green, and amino groups is the The charge on the protein side chains. However, the charge on group of non-polar or 1 -aC-Cve in red, ^P 1 CH CH3 CfH3 Isoleucine (He / p H3N+ XP 1 HC-CH3 1 CH2 1 CH3 1 NH2 CH3 CS -"C o = H /P H3N+ XP 1 CH2 1 CH CH3 P H,N+ Methionine Figure 1. Structures (Cys / C) 1 P\ CH3 Cysteine (Thr/T) H 1 CH2 1 1 Threonine H HsN+^c-c'e S / D) (Asp (Glu/E) yP c 1 CH2 1 CH2 1 are near a P -ttC-C> "P 1 / 1 H3N+ (His / H) Proline they P H3N+-aC -Cp (Ala /A) (Pro / P) groups H (Gly/G) 0 - (Trp.W) OH Serine Ce - (Tyr/Y) Histidine 1 -aC Tryptophan HN -^C-C^e \ Tyrosine Alanine H3N+ H2N+ H Glycine (Lys/K) C w 1 p c'e 1 CH3 ce XC 1 | NH2 H2 - XP 1 CH2 H -ac H3N+ (CH2)4 ac F) H3N+ 1 xo Lysine - h y H H C^S 1 H3N+ t^ KJ 1 - -aC 1 P C^e -aC XP 1 CH2 x'o 1 (Phe / H H3N+ 1 H3N+ OH Glut amine /P P H3N+-ac-c'e Phenylalanine Arginine 1 1 P -ce 1 CH2 1 1 H H H H H Valine 1) (Val/V) carboxylate in blue (10). sum of the charges on individual the individual amino amino acid side chains can highly charged side chains. For vary example acid when the normal pKa for glutamic acid is about 4.1. In active site. One is in a polar environment and glutamate side chain energetically increases, be a two has a normal hydrophobic environment, Therefore the pKA value charged mechanism of (deprotonated) pKA the be other value. The other is where a negative charge for this lysozyme activity, and in the glutamic acid residues are glutamate side chain then decreases the extent of the deprotonation very important in the chains is in unfavorable. which lysozyme, which requires that one of the side (protonated) uncharged This is of that side chain. the same has a much at time. In different normal a second acid-base pKA example, the behavior than proteases, the interaction (the the pKA value example makes it clear found in is triad) leads that the effects on is basic two in state in is 15 to a value closer than 1 5, In serine and aspartate side of the serine individual (9). The greater most proteins. to the ionization about chain nearby histidine microenvironment of an the pKA of an will have a pKA physiological pH range. adjacent ionized from reduced amino acids are positioned next in the the serine side proteins hydroxyl to 7 or group. 8. This amino acid side chain it ionization behavior. Other which an on normally found in of the active site serine with so-called catalytic Meanwhile, can change not active sites of serine proteases other serines for the hydroxyl group value meaning that this group is chains in the serine a protein sequence positive charges. This to amino acid side chain can each other. of about 12.5 (Table 1 However, the pKA reduction For example, when two values will a below) of these be typical Arginine and carry a residue full +1 basic Arginine decrease, due to in pKA value, in turn, seen when certain charge residues are repulsion will cause one or between the both of the arginine side chains to become less ionized Table 1 below lists the typical pKA values and carry only for ionizable fractional groups a-carboxyl proteins (9). 3.1 group Aspartic acid, Glutamic acid 4.1 Histidine 6.0 Terminal in positive charge. Typical pKa Group Terminal a a-amino 8.0 group Cysteine 8.3 Tyrosine 10.9 Lysine 10.8 12.5 Arginine pKA commonly found for these side chains when they are part of a protein. The pKA values for these side chains may be quite different for the free amino acid in solution. pKA values also depend on Table 1. These are values that temperature, ionic strength, are and the ionizable microenvironment of the group (9). As we began to individual consider the impact amino acid side of amino acid sequence on chains, the need chemical and physical characteristics rather acid became chemical, acids apparent. functional, into these as opposed We alphabet We elected charge, and used that is these property much smaller Table 2 below describes how create groups of amino acids than concentrating to divide the hydrophobic groups enables us to simply using the to to 20 letter groups to on each into characteristics. normal each amino acids use smaller alphabets than the ionization behavior individual groups Dividing based on amino acid alphabet based based on on amino their sets of amino in our calculations. into an alternative normal amino acid alphabet of 20 characters alphabet that was used their these characteristics rewrite a protein sequences different of is categorized (12). based on which amino acids fall under what particular examples of protein sequences that Alphabet Type types. The Methods section contains have been translated into these different alphabets. Amino Acids Code Meaning A Negative D, E Positive H,K,R with that Code (size) (3) Charge C N No A,C,F,G,I,L,M, charge N,P,Q,S,T,V,W,Y Chemical (8) Functional (4) Hydrophobic (2) A Acidic D, E A,G,I,L,V L Aliphatic M Amide N,Q R Aromatic F,W,Y C Basic R,H,K H Hydroxyl S,T I Imino P S Sulphur C,M A Acidic D, E C Basic H,K,R H Hydrophobic A,F,I,L, M, P, V, W P Polar C,G,N,Q,S,T,Y I Hydrophobic 0 Table 2. Description of four Chemical, Functional, codes used for each properties of amino abbreviated amino and Hydrophobic (12). Shown different alphabet, acids, A, F, I, L, M, P, V, W Hydrophilic C, D, E, G, H, K, N, Q, R, S, T, Y acid sequence alphabets: Charge, and are the new alphabet what each code represents the specific amino acids that are in terms of included in each property. Proteins that have (obtained using will be studied. succession need a significant difference between their similar algorithms as mentioned As mentioned to be before, considered. above) and certain amino acids predicted pI/MW their experimental pI/MW that occur in These trends in the periodicity a particular of certain amino acids of certain proteins (those whose pi values were with large Apl values) that do accurately accurate prediction of the pi and predicted are MW other proteins They may lead to important. all of proteins in the not occur from their a more amino acid compositions. Methods Forming the data set The Server's SWISS-2DPAGE database (13) ExPASy 2-D provides extensive gel information for human, mouse, Arabidopsis thaliana, Dictyostelium discoideum, E. coli, Saccharomyces cerevisiae, referenced in Swiss-Prot. Each experimental 336 2-D was separated E. according to Vanbogelen et al. characterized the first the contains proteins used (16) denoted for isoelectric concentrated on proteome and of all group by the groups were Tonella 228 reference maps. each research et al. al. Phillips The database for this by five different for these et al. (1 7)and Yan et al. (18) because et al. set research groups (14- be (15), et al. among and proteins were also groups. by Tonella et covered more Two al. and set was also separated because it from project contains proteins should (14), Pasquali focusing (pH 4-5, 4.5-5.5, 5-6, 5.5-6.7, 6-9, the Tonella cross- collected and annotated ignored because these The first which are also since experimental conditions varied the proteins denoted by Yan et (N315) in the database is compilation of pI/MW sets contributed by the aureus coli proteome characterized decided that the them. The proteins protein from gels read proteins of the 18). It Staphylococcus and sets were 153 based and proteins of all on the pH range 6-11). We than 70% all of the experiments were carried out under created; of the E. coli the same conditions. We then This matched the pI/MW to compare experimental allows us ExPASy provides its protein IDs includes format, as own its input described earlier retrieve the 2-D gel information for gel in a input name, Accession protein to then made for predicted pi). protein each of the gels. Swiss-Prot IDs our own from pi, tools are based (and The first step each spot a contained was to way to (one delimited format of analysis on get protein far gave a the data (such as in these files included: matching, microsequencing, experimental MW, (some of proteins was then used proteins were repeated for e.g. retrieval at used in our prediction P00274) multiple spots). not The at sequences were tool. Batch retrieval ExPASy because the latter does was to retrieve a FASTA http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi?db=Protein. The batch - to the NCBI tool for retrieving sequences downloaded in FASTA format to be or peptide and references. IDs (2DPAGE Accession Number This list each gel were submitted was chosen over Swiss-Prot tool that ExPASy provides tab a The fields of values of amino acids as all of these proteins. data in list format sequences, Genbank by Bjellqvist et al. (19) and Having this experimental Swiss-Prot of the proteins sequence. description, SWISS-2DPAGE Serial Number, SWISS-2DPAGE fingerprinting), of of FASTA using pKA Number, identification method (gel A list developed of these prediction later performing any type experimental pi comparing also tab delimited format that includes multiple spots on a gel). greater ease of use when file on a calculation in the introduction the data from each 2-D mass (19). We have of proteins for both tools) pi gene pI/MW which requires a Protein Data Bank format (11). Both or have for predicting tool its FASTA each protein with pI/MW values with predicted pI/MW values. a pI/MW prediction which requires especially can data for at include, for NCBI whatever reason, the initial methionine FASTA file for the output can be conveniently occurred when based on match set of proteins residue when from each gel was recorded to a accession number and not (leaving just the respective FASTA file using Perl a simple script. ":%s/gi|\d*|sp|//" at that can ExPASy (19) was not quite as be imported into Excel ":%sAs\s*At/g" The tool" Both sets at was manipulated gave both the file which was needed which regular pI/MW predict not output was edited each few by a The to Genbank entry in facilitated excluded). file would order into using the transformed a format following it into a tab in Excel. Nevertheless the strikingly similar results derived from the Tonella data (1 7) were compared with in the Excel files Experimental pI/MW prediction to our and tool. the Yan tools and the results can be http://www.rit.edu/~mac3948/E2D/Ecoli/. and predicted pi values Looking at the were data experimental (DIGE) data (18) seen ExPASy (19) at pI/MW output problems by removing the it does use since it each protein (quotations excluded) delimited text file, allowing it to be easily "Compute This (quotations easy to readily. regular expression: was solved Swiss-Prot ID) from expressions most notably: tool our tool since by Swiss-Prot ID the tab delimited file for each gel. This accession number then fed into our tool where the Microsoft Excel file. However, using the FASTA file from NCBI in Genbank retrieving in FASTA format. The compiled far different from versus experimental pi periplasmic protein data set it was noticeable experimental pi values. by as much as (PBP), see 1.86 Some pH units Appendix A). that some predicted pi values proteins (e.g. differed in predicted pi P06128, Phosphate-binding However, for other proteins the predicted pi was exactly the same as the experimental F carbamoyltransferase chain To better (OTCase-2), characterize these (e.g. P06960, Ornithine pi Appendix A). see discrepancies across all of the proteins a simple calculation was performed: Experimental The difference in The main focus predicted - set was proteins consisted of 60 is to was put into proteins where a subset of the following analysis of these deals with 50 how focusing on Extracting Amino data we used individual next, followed amino acid referred as Apl in this Apl of varying a list section that the Apl in each 20 summarizes Apl how the subset of Another . Apl subset 0.7). The < than 0.7. subset. that were performed on to handling the amino acids. analyze paper. values. value was greater sequential steps of the < 1) The data that next section the data subsets, still approaches are whole process described flows together. collected subset sequences (the a naive approach than 0.7 (0.3 of the proteins the less than 0. 1 frequencies. The dipeptide information from to naive involves in comparing the method approach) finding a This subset and 0.3, but less frequencies raw frequency analysis Apl (Eq. pi starts with a naive approach subsets of Apl ranges. each than proteins where It subsets. acid There is be value was the four different alphabets to by a final useful the Apl sections will provide simply calculating explains (A) Delta then broken down into roughly thirds. The first Refer to the tables in Appendix A for The = identify potential causes proteins of Apl values greater last third pi experimental pi and predicted pi will of this project The data held 58 pi significant difference between determining the relative 10 each of the counts of each amino acid frequency of occurrence for each amino acid between the Apl between any subsets. If a significant Apl subsets, then this of the would that The first step in going proteins NCBI Apl for each was used subset. sequence Apl to obtain a A Perl from delimited file program can displaying one long each each sequence is of amino would amino acid exist then be frequency to start the batch from the list to count the of sequence retrieval at included in each in number of amino acids frequency of each, outputting each sequence. The a the each tab code of this aacounts.pl. allows one to look subset as a whole removed (see dipeptide individual contained each sequence the (see Appendix B when looking at at instead each protein sequence to be important shortly Charge This Apl that interest. It does experimental values. frequencies for - amino acid program was written which concatenates each separate sequence sequence. also makes sure Frequency all of the on described, written and calculate be found in Appendix B encompassing other then of great naive approach was FASTA file that FASTA file a the As previously program was Another Perl into subset. to were closer about be based possible to adjust a pi prediction algorithm values and predict pi values difference for any is kept - the amino acid frequencies of protein by protein. separate and The program that the header line makeComposite.pl), which will two amino acids that occur be of shown one right after the approach). acids (alphabets approach) alphabet A more sophisticated analysis of amino acid acids are grouped according to the frequency can be properties of their side chains. 11 done if the The amino structures of the be side chains of the amino acids can alphabets (Charge, Chemical, Functional, Table 2) is based is simply (Asp / D) are group (COO). code A. of an amino acid can Therefore, in arginine has the C. Histidine (His / 15 the Charge alphabet and they Arginine proteins; Charge In the Charge a guanidino group). they H) is also grouped protonation of the nitrogen on amino acids have the only amino acids that contain the negatively have side chains which are grouped alphabet can be together seen alphabet the not code demonstrate N. An given that the contain group together and with charged amino acid side chain occurs easily. normally do and given are grouped into the positively its charged carboxyl are amino acids they Aspartic and chain contains an e-amino alphabet (see a positive or together and are grouped (Arg / R) the positively charged amino groups (the lysine side group because abbreviated amino acid (neutral). Glutamic Acid (Glu / E) uncharged Likewise, Lysine (Lys / K) code four Hydrophobic). The Charge and the side chain on whether negative charge, or Acid used to assign them to The remaining charge behavior in example of using the below: ACDEFGH (original sequence) i NNAANNC Chemical C, Charge respectively. alphabet remaining 15 N) and alphabet These for the sequence) incorporates two groupings, groupings are analogous same reasons. amino acids Glutamine (Gin / grouped alphabet alphabet The Chemical and (Charge based Q) with that and C of a charge. contain an amide with codes groupings A in the the Asparagine (Asn / (CONH2) the code M. Phenylalanine (Phe / 12 basic alphabet characterizes than their lack are amino acids together accordingly to the A The Chemical on more acidic and and are F), Tryptophan (Trp / W), and Tyrosine (Tyr, Y) contain aromatic rings Threonine (Thr / T) Proline (Pro / P) contains an contain the (code R). Serine (Ser / hydroxyl group (OH) imino group (>C=NH) the sulfur containing amino acids are Cysteine (Cys / grouped together with code S. An example of using its C) and their side chains (code H). on on S) side chain and (code I). Finally, Methionine (Met / the Chemical alphabet can M) be are seen below: ACDEFGHNPS (original sequence) (Chemical alphabet I LSAARACMIH Functional alphabet The Functional did the Charge remaining the and alphabet again Chemical into 2 amino acids amino acid of using sequence) groups: alphabet can ACDEFGH The Functional alphabets. is hydrophobic (such the Functional incorporates the A (acidic) H (hydrophobic) as Alanine) be seen (original and C (basic) alphabet characterizes and or polar P (polar) based (such as groups as the on whether Cysteine). An example below: sequence) 1 HPAAHPC Hydrophobic groups amino acids (such as Alanine) seen alphabet sequence) alphabet The Hydrophobic It (Functional Cysteine) are given alphabet based only are given the code the is on similar hydrophobicity. Amino code O. An to the latter half of the Functional I. Amino acids example of using below: 13 that are acids alphabet. that are hydrophilic hydrophobic (such as the Hydrophobic alphabet can be ACDEFGH (original sequence) 1 OIIIOII Perl alphabets alphabet sequence) programs were written that convert normal sequences into each of the just described (see charge.pl, chemical.pl, functional.pl, and hydro.pl in Appendix B). The code (Hydrophobic four display the frequency of each alphabetic programs also calculate and that is chosen. Frequency of amino The problem affecting the (dipeptide approach) that certain abnormal pKA side chains had overall charge of a protein still All that had been acid without acids considered was taking into being next to account other amino acids to examine every "dipeptide" the any in not been dealt sum of a set of strict changes sequence. in the three Apl values of amino acids pKA that might occur The approach A subsets. with up for values due to until this point. each amino certain amino acids to solving this sequence of problem was length 7 has 6 dipeptides. For example, Frequency: Sequence: Dipeptides: Dipeptide ABCABBC AB AB = 2 0.333 BC BC = 2 0.333 CA CA = 1 0.167 AB BB = 1 0.167 counts: BB BC The interest, written frequency at which each dipeptide particularly, that counts when each they are considered dipeptide in occurs in each a sequence and 14 in a particular sequence Apl subset. displays the A Perl is of program was frequency of each dipeptide in the for dipeptides sequences of output in alphabetically from AA alphabet, the became the file a that is input (see Appendix B increasing order or dipepsA.pl . . VV). As . number of different problematic. FASTA The the case earlier was dipeptides (20 x 20 dipeptide technique same converting them into the Charge, for dipeptides = with the Chemical, Functional, and dipeps.pl output normal amino acid 400 for the was applied - to normal alphabet) sequences after Hydrophobic alphabets to alleviate this problem. Combining an entire Apl (using makeComposite.pl number of dipeptides sequence, and in see - Appendix special attention needs to of the output file from line FASTA file with a so that B) a set of sequences the first amino acid in the accession subset of FASTA sequences be that paid so became has been To count into one long combined that the last amino new line. The the dipeptide counts handles this long sequence one problematic. next sequence are not counted as a makeComposite.pl blank also into problem in acid just one sequence dipeptide. The format by replacing each other programs can now use are the this formatted as accurate as naive and alphabet counts. Pipeline Workflow So far there have been amino acids stages at which the (coded according to the four alphabets), (coded according to the four alphabets) has been the data to frequency of an reach each of these stages diagrams how to stage of analysis. go from an initial may set of The flow in taking the dipeptide, examined. amino or grouped The sequences (for each naive approach would go 15 of dipeptide process of appear somewhat confusing. FASTA acid, group transforming Figure 2 below Apl subset) to from FASTA each sequence to makeComposite.pl to aacounts.pl and then analysis. examining dipeptides with a transferring the FASTA functional program see in this is more complex. sequence to makeComposite.pl to dipepsA.pl) followed by analysis. program used alphabet Table 3 below pipeline workflow (for However, gives a a more It begins the flow for by functional.pl to dipeps.pl (or brief description detailed description of each and code of each Appendix B). \ ( Apl sunset charge.pl FASTA file v. [ J i chemical.pl 1 ' \ r ~~~~~ ^-^^^^ dipeps.pl makeComposite.pl i * \ functional.pl or dipepsA.pl ) ^ i hydro.pl " r ir ~\ analysis i' aaco ants.pi ^ Figure 2. Workflow diagram that (naive, shows how to alphabets, dipeptides). 16 get to each stage of analysis ) Program Description aacounts.pl Counts the number of each amino acid from each. frequency a a of Converts the amino acids from the sequences in a FASTA file into a 3-letter alphabet using the charge() method in charge.pl Bio::Tools::OddCodes (12). It code chemical.pl for into an code dipeps.pl from the amino acids 8-letter for using the chemical() alphabet sequence dipepsA.pl number of each in the given from highest Counts the sequence alphabetical order functional.pl Converts the into a . . . amino acids 4-letter alphabet hydro.pl for into a different alphabet makeComposite.pl for functional() (composite) be Table 3. Description counts from the using the sequence. of the programs used and the This in this in a each in FASTA file in number of each in hydrophobic() counts a FASTA file method in the number of each frequency. into composite sequence a single is then able listed here. pipeline workflow. source code 17 for each pair method the of multiple sequences used with other programs longer description in frequency. sequences each sequence as well as each Converts FASTA files each each pair amino acid pair sequences Bio: : Tools ::OddCodes (12). It then code for W). from the using the amino acids 2-letter frequency. amino acid pair each sequence as well as each Converts the in to lowest. Bio::Tools::OddCodes (12). It then code FASTA file method FASTA files. It displays (AA a then counts the number of each different frequency given in FASTA files. It displays number of each in the number of each frequency. sequences each sequence as well as each Counts the order then counts the each sequence as well as each Converts the Bio::Tools::OddCodes (12). It provides a (normal alphabet) in FASTA file and determines the Output is to FASTAfilename.aacounts sequence for Appendix B each program. to Results Naive approach The intitial naive approach counts of each amino acid < Apl acid < 0.7; Apl > 0.7) between the Apl subset and the 0.3 between the Apl < < and compare Apl < A 0.7 subset and labels is the Apl 0.1 < Frequency represent < 0. 1 Apl 0.7 of is in each done to determine the Apl subset frequency of occurrence frequencies between subset \ pi < (Apl for < 0.1; 0.3 each amino the Apl < 0. 1 similar comparison is displayed in Figure 4. 0.1 and (0.3 < Apl<0.7) Individual Amino Acids in Two Apl Subsets. The X abbreviations of the amino acids. in yellow is the 0.3 proteins which comprise subset consists of More information 0.7 Amino Acids in subset and shown 60 > set was in Figure 3. A shown the Apl the one letter subset consists of < of the relative comparison of the subset Frequencies Figure 3. analyzing the data (using the normal alphabet) subsets. 0. 1 to about each 58 < Apl 22472 total proteins which comprise individual Appendix A. 18 protein < Shown in blue 0.7 subset. in these Apl are The Apl amino acids. 17906 total axis < The 0.3 amino acids. subsets can be seen in Frequencies Figure 4. labels < 0. 1 the one letter abbreviations subset and shown subset consists of 0.7 Amino Acids in Apl < 0.1 Frequency of Individual Amino Acids represent is the Apl of 60 yellow 50 about each in Two Apl Subsets. The X axis Shown in blue are is the Apl > subset. The Apl amino acids. 15581 total in these Apl protein 0.7 22472 total proteins which comprise individual 0.7 of the amino acids. proteins which comprise subset consists of information in and Apl > be 0. 1 The Apl amino acids. subsets can < seen > More in Appendix A. Alphabets approach -Charge The next that utilizes the reduces the four alphabets. in Table 2. between the Apl using the Charge 0.7 analysis was number of variables summarized > step in subset < 0. 1 to convert each of the This decreases the being examined. Using the subset and 0.3 < Apl < alphabet a similar comparison is displayed in Figure 6. 19 subsets into a sequence size of the amino acid alphabet and The different Charge alphabet, the Apl alphabets are a comparison of the 0.7 subset is shown between the Apl < frequencies in Figure 5. Again 0.1 subset and the Apl Frequencies Amino Acids (Charge alphabet) in Apl< 0.1 and (0.3 < Apl< 0.7) of Apl< 0.1 ? 0.3< Apl< 0.7 CAN Amino Acid (charge alphabet) Figure 5. Frequency of Amino Acids Using the Charge Alphabet in Two Apl Subsets. Frequencies Amino Acids (Charge alphabet) in Apl<0.1 and Apl > 0.7 of 80 70 -. 60 s? > 50 Apl< o g 40 | 30 "" ? Apl 20 10 0 CAN Amino Acid (charge alphabet) Figure 6. Frequency of Amino Acids Using the Subsets. 20 Charge Alphabet in Two Apl > 0.1 0.7; -Chemical Using the 0. 1 subset and Chemical alphabet, the 0.3 < Apl same comparison < 0.7 between the Apl Frequencies of a comparison of subset < 0. 1 is shown the frequencies between the Apl < in Figure 7. Figure 8 displays the subset and the Apl > 0.7 Amino Acids (Chemical alphabet) in subset. Apl< 0.1 and (0.3<Apl<0.7) R M H Apl < 0.1 D0.3 < Apl < 0.7 Apl< 0.1 C Amino Acid (chemical alphabet) Figure 7. Frequency of Amino Acids Using the Chemical Alphabet in Two Apl Subsets. Frequencies of Amino Acids (Chemical alphabet) in and Apl > 0.7 Apl<0.1 ? Apl> 0.7 I R H M C Amino Acid (chemical alphabet) Figure 8. Frequency of Amino Acids Apl Subsets. 21 Using the Chemical Alphabet in Two -Functional Using the 0. 1 subset and Functional subset Functional alphabet, the 0.3 < Apl < 0.7 a comparison of the subset is alphabet a similar comparison shown frequencies between the Apl < in Figure 9. Again using the between the Apl < 0.1 subset and the Apl > 0.7 is displayed in Figure 10. Frequencies of Amino Acids (Functional alphabet) in Apl< 0.1 and (0.3 < Apl< 0.7) < 0.1 D0.3< Apl Apl P A Amino Acid (functional alphabet) Figure 9. Frequency of Amino Acids Using the Two Apl Subsets. 22 Functional Alphabet in < 0.7 Frequencies of Amino Acids (Functional alphabet) in Apl<0.1 and Apl > 0.7 Apl< D Apl A > 0.1 0.7 P Amino Acid (functional alphabet) Figure 10. Frequency of Amino Acids Using the Functional Alphabet in Two Apl Subsets. -Hydrophobic Using the Apl < 0.1 subset and Hydrophobic 0.7 subset Hydrophobic alphabet, the 0.3 < Apl < 0.7 a comparison of the subset alphabet a similar comparison is shown in Figure 11. Again using the between the Apl is displayed in Figure 12. 23 frequencies between the < 0. 1 subset and the Apl > Frequencies of Amino Acids (Hydrophobic alphabet) in Apl < 0.1 and (0.3 < Apl < 0.7) Apl<0.1 ? 0.3<Apl<0.7 I O Amino Acid (hydrophobic alphabet) Figure 11. Frequency of Amino Acids Using the Hydrophobic Alphabet in Two Apl Subsets. Frequencies of Amino Acids (Hydrophobic alphabet) in Apl <0.1 and Apl > 0.7 Apl<0.1 D Apl I > 0.7 O Amino Acid (hydrophobic alphabet) Figure 12. Frequency of Amino Acids Using the Hydrophobic Alphabet in Two Apl Subsets. Dipeptide approach Using a more entirely sophisticated method new set of results. The first way that looks of at dipeptides looking at dipeptides 24 of a sequence gave an of the three Apl subsets is similar to the naive approach in that it just examines dipeptides using the acid alphabet. This results in fewer than 400 dipeptides in dipeptides may occur). The difference in subsets was also calculated would mean that another subset. a certain The comparing the Apl similar Delta % To better This bar when 0. 1 subset and values when explain values can the 0.3 < Apl consider < < 0.7 0. 1 be A% in seen subset. value Delta % Values in Apl Using Figure 13. Density subset consists of of Delta 60 % Values 58 about each < of Dipeptides protein and one subset compared in Figure 13 the Apl shows > < to 0.7 the subset. in Figure 13. between 100% 0.3 100 when and 150% sets. Apl < 0.7 in Two Apl Subsets. The Apl 22412 total dipeptides. The 0.3 proteins which comprise individual 0.1 of Amino Acid Alphabet proteins which comprise subset consists of information a Normal Delta % a words, by the arrow comparing dipeptide frequencies in the two different Apl of not all possible Figure 14 subset and the bar indicated was a other as much %" the 1 1 times that there Densities 0.7 "%"). In 2 times comparing the Apl Figures 13-16, represents occurred "Delta that chance frequency of every dipeptide between Apl or dipeptide or owing to the frequency" ("Delta differences, < different dipeptides (there may be slightly upwards of 400 a given subset normal amino 25 < Apl 0.1 < 17848 total dipeptides. More in these Apl A. < subsets can be seen in Appendix Densities of Delta % Values in Apl Using a >25 >50 Delta % Figure 14. Density of 0.1 < and Apl > 0.7 Normal Amino Acid Alphabet Delta % Values of >100 >75 >150 >400 >300 >200 range Dipeptides in Two Apl Subsets. The Apl < 0. 1 subset consists of 60 proteins which comprise 22412 total dipeptides. The Apl subset consists of 50 proteins which comprise 15531 total dipeptides. More information about each individual protein in these Apl subsets can be seen 0.7 > in Appendix A. Dipeptide Threshold A had a similar analysis was performed on very low Discussion for value of 0.1% frequency an elaboration) had to be met infrequently (under 0.1% remaining dipeptides subset and for the Apl that were by the Figure 15 0.1 subset and letter was total the Apl > 0.7 subsets where value seen words, if a dipeptide values much less frequently the in the Apl 0.7 dataset. 26 it occurred so was eliminated. in Figure 15. Likewise, the subset can instance, see comparing the Apl be seen extreme positive or negative ranges of For dipeptides that too rapidly, dipeptides) then the Delta % be Apl frequency of occurrence threshold other number of subset can amino acid codes. found A were monitored. of the same its Delta % for dipeptides. In the 0.3 < Apl < 0.7 < change were counted and found in the one (which may the 0. 1 comparison in Figure 16. Dipeptides these figures dipeptide RR < < The are indicated (arginine-arginine) 0.1 dataset than in the 0.3 < in Apl < Densities of Delta % Values in Apl Acid Alphabet (where <-50 <-40 <-30 <-20 <-10 < 0.1 and <0 >0 Threshold Density of Delta % Values of < >10 Delta % range Figure 15. 0.3 \pl< frequency of dipeptide of >20 and particular 0.7 Using must be a Normal Amino above >40 >30 0.1) >50 >60 >75 >100 dipeptides Dipeptides in Two Apl Subsets with a 0.1%. Densities of Delta % Values in Apl Alphabet (where < 0.1 frequency and of Apl > 0.7 dipeptide Using must be a Normal Amino Acid above 0.1) 90 80 c ffi 70 w S 60 n a a ai E a 50 40 30 0) F 20 3 z 10 0 <-50 <-40 <-20 Delta % Figure 16. Threshold Density of Delta >0 <0 % Values of >80 >20 range and particular dipeptides Dipeptides in Two Apl Subsets of 0.1%. 27 with a >100 Dipeptide using Alphabets The final step in together. Using the compared to analysis was to combine the alphabet and smaller alphabets using the dipeptide approaches dramatically reduced and condensed the results as 400 normal alphabet which creates possible dipeptides. -Charge Using the Apl % < 0.1 values Charge alphabet, subset and for subset and each the Apl the 0.3 < Apl dipeptide. The > 0.7 Comparison subset of a comparison of the < 0.7 subset is dipeptide frequencies between the shown same comparison is in Figure 17 shown as well as between the Apl < the Delta 0. 1 in Figure 18. Dipeptides (based Apl < 0.1 on charge and 0.3 < characteristic) taken from Apl < 0.7 Dipeptide (charge alphabet) Figure 17. Frequencies blue are the frequencies difference in Apl < 0.7 of Charge Alphabet Dipeptides in Two Apl Subsets. Shown in of each frequency for each dipeptide in the Apl < 0.1 subset and shown didpeptide between the Apl subset. 28 < 0.1 in subset and yellow the 0.3 < is Comparison of Dipeptides (based on charge characteristic) taken from Apl < 0.1 and Apl > 0.7 60 50 40 30 20 10 0 A^ -10 NKI CN CA AfsjJ CC NN NC -20 -30 -40 Dipeptide (charge alphabet) Figure 18. Frequencies blue Charge Alphabet Dipeptides in Two Apl Subsets. Shown in the frequencies of each dipeptide in the Apl are difference in 0.7 of frequency for each < 0.1 subset and shown didpeptide between the Apl < 0. 1 in subset and yellow the Apl is > subset. -Chemical Using the the Apl < Delta % 0. 1 subset and values subset and Chemical alphabet, for the Apl 0.7 sufficiently large that it combinations the 0.3 < Apl < 0.7 dipeptide. The each > a comparison of the subset and is shown same comparison is in Figure 20. The Chemical was not possible in Figures 19 subset dipeptide frequencies between to display all 20. Instead only the display. 29 the in Figure 19 as well as between the Apl shown alphabet with possible the dipeptides dipeptide density values < were chosen to 0. 1 was Densities of Delta % Values in Apl 0.1 < and 0.3 < Apl < 0.7) Using a Chemical Alphabet 16 | 14 12 SS S 10 a) I 6 (-28%) (-25%) AS (-24%) MS (-22%) IS (-20%) Al E (43%) (48%) rt*(48%) I I a n IC IM <-20 <-10 <0 >0 Delta % Figure 19. Density of Delta % Subsets. The Apl < dipeptides. The 0.3 0.1 < Values seen >10 Apl < 0.7 >30 >20 range and particular 60 proteins which comprise subset consists of about each >50 >40 >60 dipeptides Alphabet Dipeptides in Two Apl of Chemical subset consists of total dipeptides. More information be RR(61%) J 58 22412 total proteins which comprise individual protein in these Apl 17848 subsets can in Appendix A. Densities <-40 <-30 of Delta % Values in Apl <-20 <-10 <0 Delta % < 0 1 and Apl >0 >10 > 0.7 Using a >30 >20 range and particular Chemical Alphabet >40 >50 >60 >70 >80 dipeptides J Figure 20. of Delta % Values Density Subsets. The Apl < 0.1 dipeptides. The Apl > subset consists of 0.7 Chemical Alphabet Dipeptides in Two Apl 60 subset consists of dipeptides. More information seen of about each proteins which comprise 50 15531 total in these Apl subsets can individual in Appendix A. 30 22412 total proteins which comprise protein be -Functional Using the the Apl < Delta % 0. 1 subset and the values subset and Functional alphabet, for each 0.3 Apl < a comparison of the < dipeptide. The 0.7 subset is dipeptide frequencies between in Figure 2 1 shown same comparison is shown as well as between the Apl the < 0. 1 the Apl > 0.7 subset in Figure 22. Comparison of dipeptides (based on functional characteristic) taken from Apl<0.1 and0.3<Apl<0.7 Dipeptide (functional alphabet) Figure 21. Frequencies in blue are is difference in Apl < 0.7 of Functional the frequencies frequency of each for each Alphabet Dipeptides in Two Apl Subsets. Shown dipeptide in the Apl < 0.1 subset and shown didpeptide between the Apl subset. 31 < 0. 1 subset and in yellow the 0.3 < Comparison of dipeptides (based Apl < on 0.1 functional characteristic) taken from and Apl > 0.7 30 20 10 jjLfc-fa.tfi.ll tUljlj 0 /A AH I- CA A AC HH HC PC CH CP PH HP CC PP -10 -20 -30 -40 Dipeptide (functional alphabet) Figure 22. Frequencies in blue are is difference in 0.7 of Functional the frequencies frequency Alphabet Dipeptides in Two Apl Subsets. Shown dipeptide in the Apl < 0. 1 subset and shown in yellow for each didpeptide between the Apl < 0. 1 subset and the Apl > of each subset. -Hydrophobic Using the between the Apl as the Delta % Apl < 0. 1 Hydrophobic alphabet, < 0.1 values subset and subset and for each the Apl > a comparison of the the 0.3 < Apl dipeptide. The 0.7 subset < 0.7 subset is shown same comparison in Figure 24. 32 dipeptide frequencies is in Figure 23 shown as well between the Comparison dipeptides (based of from a Apl< on 0.1 hydrophobic characteristic) taken 0.3 < Apl< 0.7 and % of D Delta Dipeptide Figure 23. Frequencies Shown in blue are is difference in the 0.3 < < of 0.7 % (pi A< 0.1 - 0.1 < Apl < 0.7) alphabet) Hydrophobic Alphabet Dipeptides in Two Apl Subsets. the frequencies yellow Apl (hydrophobicity < 0.3 Dipeptide in Apl of each dipeptide in the Apl frequency for each didpeptide < 0.1 subset and shown between the Apl < 0. 1 in subset and subset. Comparison of dipeptides (based on hydrophobic characteristic) taken from Apl<0.1 and Apl > 0.7 % of Dipeptide in Apl D Delta % A(pl<0.1 - < Apl 0.1 > 0.7) Dipeptide (hydrophobicity alphabet) Figure 24. Frequencies Shown in blue yellow the Apl are is difference in > 0.7 of Hydrophobic Alphabet Dipeptides in Two Apl Subsets. the frequencies of each dipeptide in the Apl < 0. 1 subset and shown frequency for each didpeptide between the Apl subset. 33 < 0.1 subset and in Discussion When exploring the behavior exists a values discrepancy between for performed a high using predictions to on our algorithm able The first robust enough to more accurately to give meaningful pi and the MW. the that is too their occur organisms study limited only to be set that offset have enough proteins has in E. a proteome To were both obtained. uniform and would lead to dipeptides in all robust enough. high level are still seen both of noise in in the data due to of these make sure A data that the sufficient abundance hurdles, it displays very few to the search space post-translational that has been sufficiently documented to do a case study. 34 in known post-translational modifications. overcome coli since that results that was of all in these differences. The dipeptide information to in the lowest frequencies size and post-translational modifications is certainly by the and of the protein sequences using the information set (19) (14-18). The that is too diverse frequencies have different statistical validity. modifications and data set data to handle Simply finding the small would not dipeptides that was a close a reliable question of how robustness would the fact that different maintain or similar algorithms information in the data. A data protein sequences would provide a set (11) predict pi values key element was having complications such as Unfortunately, enough pi comparison of pi values was laboratory settings differences justified lay in whether there was to be predicting based experimentally determined identify underlying patterns that could contribute to question now extracted This determined in different regular occurrence of these undergoing isoelectric focusing, there predicted pi values and percentage of those proteins. experimental pi values an effort of proteins In still keeping with the theme retaining the usage and of this Tonella data to (19) et al. from 5 different existed one or two data. Since 70% Once the of the entire E. values and proteins if significant existed. It was arbitrary Apl to separate E. of the pi and MW set was clear that had sequence selected, lines coli proteome used. greater Apl another cut-off ranges (Apl < 0.1; 0.3 the data into distinct sets < into Apl et al. being covered was (18) coli proteome. in their decided that the data using the was same conditions. study. Doing proteins made about so would make level) 0.7; Apl between Apl that > could 0.7) be how that had very small it possible to subsets a small number of Apl subsets. < of similar size the E. decision had to be between seen values. set well The primary justification for this differences (at the dipeptide necessary to break the data though yet In addition, the fact that their data promise be could possible, probably best to limit studies on values were gained held even was the same 2DE conditions it coli genome data to separate the data so that see 2DE scale would reduce as much noise as possible. covered over Apl (14-18), it noise as (17 and 18). Both the Yan be the only data would to ensure that the experimental decided that was groups large 70% over none of the groups used from the Tonella (19) group This in turn it of these groups groups performed The Tonella (19) group boasted little having a data set with as as much robustness as possible 2DE data structured of were chosen in These order compared with each other. There subsets. subsets One based an answer was difficulty in deciding how to possible approach was on a that larger to separate separate the dataset into many smaller sized number of Apl ranges. gives a scaled description of what 35 the entire dataset into these three On is one hand doing this might provide happening at each small Apl range relative to adjacent information found in each Therefore, < 0.1 total at the Apl ranges. sequence data 17906 total 60 amino acids or information description The about each and Apl < 0.7 individual subset consists of We began section. relevance of the The It of our few data also protein is best (dipeptides using more The Apl robustness. amino acids or > 0.7 22412 subset consists of can be including Apl, seen viewed as a pipeline as seen the findings. proteins which comprise in these Apl subsets, our analysis with becomes 50 in Appendix A. in Figure 2 in the (naive approach), most simple method (alphabets approach), alphabets approach). complicated, but more a and end with Along this the path, the interesting at the same exceptions). naive approach was 58 be 15531 total dipeptides. More amino acids or SWISS-2DPAGE Accession Number, analytical process a 22472 total 17848 total dipeptides. The Apl 15581 total most complicated methods quickly protein sequences did to not < 0. 1 set did not provide that individual amino acid vary among the three data using simply the different between the three Apl comparing the Apl data handling the apparent frequency characteristics with threaten the reliability would their way to more complicated methods results. way, there is a loss of smaller number of sequences that would proteins which comprise < proteins which comprise time (with by doing it this the dataset had to be separated into subsets of sufficient dipeptides. The 0.3 work hand other level due to the This, in turn, set. subset consists of Methods On the subsets. subset with the 0.3 the Apl > 0.7 subset, respectively. No can < be subsets. Apl < seen 0.7 significant 36 meaningful frequencies in In the end, naive approach were This any a given set of no amino acid found to be significantly in Figures 3 subset and and 4 the Apl when < 0. 1 subset difference between the blue and yellow frequencies identical when between Apl can be Figure 3 values and for any individual seen Figure 4 and in more To simplify the analysis, the described in Table 2 reveal any 6 (Charge Figures 9 significant alphabet and (Hydrophobic approach 10 (Functional alphabet normally requires are At this that they point show no very trend and of alphabet Figures 1 1 increase or the results did not and comparisons), and to that four Figures 5 calculated. similar results independently, 12 of the naive decrease in Apl for any be focusing denatured proteins, the only that are close to its near or not structure intact. However, for IEF, detergents are added prior aspects of protein structure. are expected in the primary 37 of proteins we observing their biological function. To interactions each other of the distant (IEF). The biological function quaternary significant by analysis the experimental conditions consider reagents such as urea and or obtained (2-8), including ours (11) regardless of their three dimensional disrupt any secondary, tertiary acids side chains is pi time. by using the Again, pipeline. previous pi prediction algorithms for isoelectric the best separation, one amino acid at a more meaningful results would interested only in separating the proteins, assure than 8 (Chemical comparisons), it is instructive to maintain and of a correlation moving between the three datasets. each amino acid employed that alphabet 4. There is dipeptide frequencies. All neighbors. the next stage in the comparisons) and was expected treat the pKa for more - nearly amino acids showed us that we number of variables was reduced comparisons), Figures 7 in Figures 3 individual trends that could affect the way that particular amino acid when It at depth The lack as well. compared, frequency of these the needed to consider the problem alphabets are amino acid; the values are also to occur sequence. to IEF to In these between Thus a fully amino consideration of the effect of neighboring amino acids on their respective side chain pKA values may prove valuable. With least respect significant alphabet results. However, first. At first Figures 13 significant. the change in The a 0. 1% amino normal time (or at acids) it and 16. There will later be The It value. would not other by be times in wise To to rely The < the hydrophobic be least alphabet. seen A dipeptide Comparisons be in Figures 23 38 and threshold Apl seen frequency in in the 100 using a subsets 24. Delta % is at least 22412 in Figures 15 dipeptide when of the is going dipeptide not occur values interesting results subset which contained results of this can extreme through all of the 400 was run with a that have Delta % Apl on such negotiate 0.1 dataset, the next ranges would seem another comparison with some of the alphabet that showed the alphabet can 400 words, if a dipeptide did analysis. still exist extreme outliers alphabet hydrophobic for being compared to frequency was vanishingly small. alphabet, the same analysis was not used and The Delta % results. that most of the dipeptides that fell into these least 22 times in the Apl reanalyzed approach was that are in the 300 be discussed alphabet will subset from the dipeptide significant alphabet Apl subset and multiple for dipeptides 0.1%. In of the one will advance very promising redesign of a pi prediction algorithm. dipeptides in the occurrence show some whose overall very high Delta % frequencies to 14 values that occurs only once in one Apl to have to the most frequency from problem was dipeptides that ranges were results and Therefore, Delta % subset. very dipeptide the analysis using the normal amino acid glance value represents Apl to each alphabet that was used the discussion range which analyses. dipeptide using the shown in the yellow bars and 1.85 was seen is very The negligible in any of Delta % dipeptide (negatively also 0.7 0. 1 to the Apl large (-18.9%) in the subset shown a subset slightly large Delta % dipeptide. This > 0.7 subset value along is alphabet Functional Delta % followed bars) is very low in was not apparent large some value of all in any Figures 2 1 for dipeptide dipeptides. The AA by negatively charged amino acid; a Delta % Apl < 0. 1 three Apl large of the subsets. of -31.3% subset and see going from the of this What the 0.3 we would for dipeptides using the Charge significantly large frequencies < like to see is a particular large Delta for dipeptides, the Functional 22 representing the alphabet show a collection of dipeptides Apl alphabet. most significant results will combine and < AA dipeptide (as frequency of occurrence frequency of occurrence value considered next. values and for frequency of occurrence value accompanied with a with a % (Figure 1 8). The Delta % for the AA dipeptide is other comparison of the Staying with the theme that the % range the alphabet codes) had (Figure 17). However, the in the blue more significant results into the 30+ charged amino acid of all than a Delta more dipeptides). values reached Table 2 for definitions < the 4 4 dipeptides (no each of the charge alphabet showed anaylsis. Apl in analysis using the that have both significantly large of occurrence: AA, AH, HA, HP, CP, PH, PP. It on was important to the complete dipeptide refer back to the amino acid alphabet. outliers: Figures 1 5 KY, YS (Figure 15) dipeptides to the Functional analysis that was and alphabet gives and 16 done using dipeptides based point out a few extreme EE, NN, YT (Figure 16). Converting these the dipeptides: 39 CP, PP and AA, PP, PP These three different dipeptides respectively. done using the Functional analysis large that it 19 and was not possible 20. Instead only the dipeptides are The labeled value denatured, depends of 1 80 calculating algorithm will > This be will data the set to their respective by the in Figures outlier Delta % values. a more accurate the protein the support is fully nearest neighbors of that dipeptides that have been identified from this possible for calculating pi an empirical process fractionally to to from adjust the algorithms for amino acid sequence the pKA whereby the pKA see which changes actual and predicted pi values for the two lead to outlier (11) values used values used a data in better sets (0.3 < Apl 0.7). of the advancements work with and rerun scope of the combinations effects of adjacent amino acids on be modified If the improvement any many future algorithm sufficiently to display. Particular even when chain, proteins, it may be to include the between Apl coli was (11, 19). These data clearly the microenvironment created Our pi values. modified and an amino acid side E. calculations. 0.7; on for data that findings is that it may lead to annotated in the < were chosen from the 22). and created of each column with extreme outlier be correlation top Using the could the density values the extreme outliers the possible dipeptide than currently existing methods idea that the pKA amino acid. dipeptides with display all significance of these calculation of pi study on to map back to (Figures 21 alphabet Using the Chemical alphabet all E. coli accuracy that the could of the pi calculation proves be made. to analysis The first compare could to the data proteome, further data that are available SWISS-2DPAGE database could be used to at to be worthy there be to build shown the larger here. Beyond ExPASy Server's perform similar analyses on 40 a many other microbial proteomes. Another step would be to port the analysis over to proteomes that contain much more post-translational modifications. be done in terms in doing so higher of predicting or it may lead to categorizing these 41 A lot eukaryotic would have to post-translational modifications an even more powerful approach organisms as well. lower to better predicting pi in but Conclusions A dataset that exists between This dataset was Several, protein. of E. coli proteins was collected and protein sequence experimental then split into three parts multi-layered, data in The the data were analyzed point and predicted depending on the magnitude to get a better understanding of these stages represented a by comparing each frequencies), followed by the application in similar amino acids chemical, way and hydrophobic investigating the alphabet and by grouping dipeptides the simplified most meaningful results different can be showing that to better prediction algorithm values. Using a short list more accurate. to of on alphabet will to only the pipeline involve pi prediction 42 involved amino acid the in greatly of these dipeptide findings modification of our is improved the how functional, subsets. next should result concentrate on post-translational modifications and another. approach yielded most extreme cases where a to the be represent sequences sequences occur that the results This each amino acid their charge, dipeptide dipeptide for part of a pipeline using both the 20 in the different Apl show different affect of adjacent amino acids one subset Once the The of Apl (Apl). in reformatting the individual The final step in the certain predict pi. to include the greatly different Apl from is groupings. studies will attempt used . point of what might alphabets based of all of these sequences frequency between proteins Future better properties four different discrepancy three Apl subsets to one of the (considering pipeline consisted of a naive approach a simpler isoelectric sequential approaches were taken an attempt causing the varying Apl. Each where isoelectric formatted to study the next in in existing side chain dipeptide pi pKA showed a pi prediction value step would pi prediction can that be to be altered by them. In addition, eventually to similar analyses will be extended to other prokaryotic organisms, and eukaryotic organisms. 43 References 1 . Fey, S.J. Larsen, P.M. "2D 2D." or not Current Opinion in Chemical 5: 26-33(2001). Biology 2. and Cargile, B.J., Talley, D.L., Stephenson, J.L. "Immobilized dimension in shotgun proteomics and analysis of the pH gradients as a accuracy of pi first predictability peptides." Electrophoresis 25: 936-945(2004). of 3. Patrickios, C.S., Yamasaki, Isoelectric 4. E.N. "Polypeptide Amino Acid Composition Analytical Ribeiro, J.M. Sillero, and Biochemistry 231: A. "An algorithm coefficients of a polynomial that allows macromolecules." proteins and other 5. Ribeiro, J.M. Sillero, and macromolecules." 6. and point." A. "A 82-91(1995). for the computer calculation of the determination of isoelectric points of Comput. Biol. Med. 20: 235-242(1990). program to calculate the isoelectric point of Comput. Biol. Med. 21: 131-141(1991). Ribeiro, J.M., Ruiz A., Sillero, M.A., Sillero, and electric charges of mutated A. "Theoretical isoelectric human hemoglobin points subunits." Clin. Chim. Acta. 190: 189-197(1990). 7. Sillero A., Ribeiro, J.M. "Isoelectric determination." 8. Analytical theoretical points of proteins: Biochemistry 179: 319-325(1989). Righetti, P.G., Caravaggio, T. "Isoelectric points and molecular weights of proteins." 9. Berg J, Journal of Chromatography 127: 1-28(1976). Tymoczko J, Stryer L. Biochemistry. New York: W. H. Freeman and Co; 2002. 10. "Image: Amino 11. 2.png" acids Wikipedia: The Free Encyclopedia. Found at http://upload.wikimedia.Org/wikipedia/en/c/c5/Amino_acids_2.png Zapoticnyj J., Conte M.C., Craig P.A. "Simulation of 2D Gel Electrophoresis." http://www.rit.edu/~pac86 1 2/2DE/2D_Sim.html 12. Bioperl. Found http://bioperl.org at 13. "SWISS-2DPAGE Two-dimensional database." Found at polyacrylamide gel electrophoresis http: //us. expasy.org/ch2oV 14. Phillips T.A., Bloch P.L., Neidhardt F.C. "Protein identifications two-dimensional gels: locations of 55 additional Escherichia coli on O'Farrell proteins." J. Bacteriol. 144:1024-1033(1980). 15. Pasquali Schaller C, Frutiger S., Wilkins M.R., Hughes G.J., Appel R.D., Bairoch A., D., Sanchez J.-C, Hochstrasser D.F. "Two-dimensional gel electrophoresis of 2DPAGE Escherichia database." coli homogenates: the Escherichia coli SWISS- Electrophoresis 17:547-555(1996). 16. Vanbogelen R.A., Abshire K.Z., Pertsemlidis A., Clark R.L., Neidhardt F.C; "Gene-protein database of Escherichia coli K-12, edition (In) Neidhardt et 6" (eds.) ed.), Escherichia coli and Salmonella: Cellular and Molecular Biology al. (2nd 17, ASM Press, Washington DC (1996). Hoogland L., C, Binz P.-A., Appel R.D., Hochstrasser D.F., Sanchez pp.2067-21 17. Tonella C. "New perspectives in the Escherichia 1:409-423(2001). 44 coli proteome investigation." J.- Proteomics 18. Yan J.X., Devenish A.T., Wait R., Stone T., Lewis S., Fowler S. "Fluorescence 2-D difference gel electrophoresis and mass spectrometry based proteomic analysis of 19. "Compute Found at E. coli." pI/Mx Proteomics 2:1682-1698(2002). for Swiss-Prot/TrEMBL entries or a user-entered http://us.expasy.org/tools/pi_tool.html 20. Bjellqvist, B., Hughes, G., Pasquali, C, Paquet, N., Ravier, F., Sanchez, J.-C, et al. (1993) "The focusing positions of polypeptides in immobilized pH gradients can be predicted from their Electrophoresis sequences." amino acid 14:1023- 1031. 21. "Get protein list for map." a reference Found bin/get-ch2d-table.pl 45 at http://www.expasy.org/cgi- Appendix A The subsets are sequences Apl than 0.7 ("0.3 below in that values < Apl pi < each of less than 0.1 ("Apl 0.7"), and Apl Included is the order. Accession Number < included in and the three Apl subsets that < 0.1"), Apl values greater gene name, Apl (experimental pi - were used. values greater than 0.7 ("Apl protein > description, than The Apl 0.3, but less 0.7") displayed SWISS-2DPAGE predicted pi). 0.1 SWISS- Gene Name ACCB 2DPAGE Access # Protein Description Biotin Apl 0.1 (BCCP) ) (Isocitrase) (Isocitratase) (ICL) Aconitate hydratase 2 (EC 4.2.1 (Citrate hydro-lyase 2) (Aconitase 2) Alkyl hydroperoxide reductase subunit C (EC 1.6.4.-) (Alkyl P02905 AHPC hydroperoxide P26427 0.01 AMPC Beta-lactamase (EC P0081 1 0.08 ARGF Ornithine ACEA ACNB carboxyl carrier protein of acetyl-CoA carboxylase Isocitrate lyase (EC 4.1 .3.1 .3) C22) 3.5.2.6) (Cephalosporinase) reductase protein P05313 -0.04 P36683 -0.07 0 P22767 -0.03 ATPD (OTCase-2) Argininosuccinate synthase (EC 6.3.4.5) (Citrulline-aspartate ligase) 3-dehydroquinate dehydratase (EC 4.2.1.10) (3-dehydroquinase) (Type I DHQase) ATP synthase alpha chain (EC 3.6.3.14) ATP synthase beta chain (EC 3.6.3.14) ATP synthase beta chain (EC 3.6.3.14) P06960 CHEY Chemotaxis P06143 0.05 CLPB CIpB P03815 -0.02 CYSM Cysteine sulfhydrylase B) 4.1.2.4) (Phosphodeoxyriboaldolase) (Deoxyriboaldolase) (DERA) Chaperone protein dnaK (Heat shock protein 70) (Heat shock 70 kDa protein) (HSP70) Chaperone protein dnaK (Heat shock protein 70) (Heat shock 70 kDa protein) (HSP70) DNA protection during starvation protein (2-phosphoEnolase (EC 4.2.1 1) (2-phosphoglycerate dehydratase) D-glycerate hydro-lyase) Malonyl CoA-acyl carrier protein transacylase (EC 2.3.1 (MCT) P16703 -0.05 FLIC Flagellin FUSA Elongation factor G FUSA Elongation factor G GALM Aldose 1-epimerase (EC 5.1 GLNK Nitrogen regulatory protein P-ll 2 2,3-bisphosphoglycerate-independent ARGG AROD ATPA ATPD carbamoyltransferase chain protein protein cheY (Heat synthase shock protein B (EC 2.5.1 Deoxyribose-phosphate DEOC DNAK DNAK DPS F (EC 2.1 .3.3) F84.1) .47) aldolase (O-acetylserine P05194 P00822 0.05 -0.01 P00824 0.02 P00824 0.03 (EC P00882 -0.1 P04475 0.08 P04475 0.1 P27430 -0.05 P08324 0.05 P25715 0.04 P04949 0.07 .1 ENO FABD GPMI .39) P02996 (EF-G) (EF-G) .3.3) 5.4.2.1) (Phosphoglyceromutase) (Mutarotase) -0.08 P40681 -0.01 P38504 phosphoglycerate mutase -0.1 P02996 -0.1 (EC P37689 -0.04 GROL GROL GROS ICD KATG LIVJ chaperonin (Protein Leu/lleA/al-binding (LIV-BP) Leu/lle/Val-binding protein (LIV-BP) Leucine-specific binding protein (LS-BP) (L-BP) S-ribosylhomocysteinase (EC 3.13.1.-) (Autoinducer-2 production protein luxS) (AI-2 synthesis protein) Methionine aminopeptidase (EC 3.4.1 1 (MAP) (Peptidase M) protein LIVJ LIVK LUXS MAP MIND NADE PGK PNP 0.1 P05380 0.03 P08200 -0.05 P13029 -0.03 P02917 0.01 P02917 0.07 P04816 -0.06 -0.08 -0.07 adenosyltransferase) Septum site-determining protein minD (Cell division inhibitor minD) NH(3)-dependent NAD(+) synthetase (EC 6.3.1.5) (Nitrogen-regulatory P04384 -0.02 protein) Phosphoglycerate kinase (EC 2.7.2.3) Polyribonucleotide nucleotidyltransferase (EC P18843 -0.06 P11665 0.03 P05055 -0.02 P17288 -0.01 P09029 -0.1 (EC 2.5.1 synthetase phosphorylase) (PNPase) Inorganic pyrophosphatase (EC .6) (Methionine 2.7.7.8) PURK Phosphoribosylaminoimidazole carboxylase ATPase 4.1.1.21) (AIR carboxylase) (AIRC) RIBH 6,7-dimethyl-8-ribityllumazine RPLL 50S ribosomal protein DNA-directed RNA L7/L12 synthase RPSA RPSA 30S SERC Phosphoserine SSB Single-strand TALB Transaldolase B (EC 2.2.1 TIG Trigger factor .9) (EC 0.01 phospho- subunit (EC (DMRL synthase) (L8) polymerase alpha chain subunit) 30S ribosomal 2.7.7.6) (RNAP P61714 -0.02 P02392 0.09 P00574 0.02 alpha protein S1 P02349 0.08 ribosomal protein S1 P02349 0.08 binding protein (EC 2.6.1 (PSAT) (SSB) (Helix-destabilizing protein) aminotransferase .52) .2) (TF) (TF) TIG Trigger factor TRPA Tryptophan TSF Elongation factor Ts TUFA Elongation factor Tu USPA Universal YCII Protein synthase alpha chain (EC 4.2.1 .20) (EF-Ts) (EF-Tu) (P-43) stress protein A P23721 0 P02339 -0.05 P30148 -0.05 P22257 0.02 P22257 0.01 P00928 -0.02 P02997 -0.05 P02990 0.01 P28242 0.04 ycil P31070 0.01 yfiD P33633 -0.04 P36656 0.02 YFID Protein YJDC Putative HTH-type transcriptional pl< (EC 2.5.1 P18197 (Polynucleotide 3.6.1.1) (Pyrophosphate hydrolase) (PPase) < P06139 P07906 PPA RPOA -0.02 P45578 .18) S-adenosylmethionine METK 0.3 P06139 Cpn60) (groEL protein) 60 kDa chaperonin (Protein Cpn60) (groEL protein) 10 kDa chaperonin (Protein Cpn10) (groES protein) Isocitrate dehydrogenase [NADP] (EC 1.1.1.42) (Oxalosuccinate decarboxylase) Peroxidase/catalase HPI (EC 1.11.1.6) (Catalase-peroxidase) (Hydroperoxidase I) 60 kDa regulator yjdC 0.7 SWISS- 2DPAGE Gene Access # Name Protein Description ACCB Biotin P02905 0.37 ACKA Acetate kinase (EC P15046 -0.46 ADK Adenylate kinase P05082 -0.67 (BCCP) 2.7.2.1) (Acetokinase) (EC 2.7.4.3) (ATP-AMP transphosphorylase) carboxyl carrier protein of acetyl-CoA carboxylase II A pi ADK P05082 -0.68 AHPF Adenylate kinase (EC 2.7.4.3) (ATP-AMP transphosphorylase) Alkyl hydroperoxide reductase subunit F (EC 1.6.4.-) (Alkyl hydroperoxide reductase F52A protein) P35340 -0.41 ALDA Aldehyde dehydrogenase A (EC P25553 Aldehyde dehydrogenase A (EC P25553 ALDA ARGT 1.2.1.22) (Lactaldehyde dehydrogenase) 1.2.1.22) (Lactaldehyde dehydrogenase) Lysine-arginine-ornithine-binding periplasmic protein (LAO-binding protein) ATP synthase epsilon chain (EC 3.6.3.14) (ATP synthase F1 sector epsilon ATPC subunit) ATPD ATP synthase beta CLPS ATP-dependent Clp protease adaptor protein dpS Dihydrodipicolinate reductase (EC 1 .26) (DHPR) Chaperone protein dnaK (Heat shock protein 70) (Heat chain (EC 3.6.3.14) 0.42 0.32 P09551 -0.48 P00832 -0.33 P00824 0.02 P75832 0.44 P04036 -0.54 P04475 -0.37 P08324 -0.67 ENO hydro-lyase) 1) (2-phosphoglycerate dehydratase) (2-phospho-Dglycerate hydro-lyase) Enoyl-[acyl-carrier-protein] reductase [NADH] (EC 1.3.1.9) (NADH-dependent P08324 -0.41 FABI enoyl-ACP P29132 -0.45 FLIC Flagellin GLNA Glutamine GPT Xanthine-guanine GST Glutathione S-transferase (EC 2.5.1.18) HISJ HISJ Histidine-binding Histidine-binding ILVH Acetolactate synthase isozyme III small subunit (EC 2.2.1.6) (Acetohydroxy-acid synthase III small subunit) (ALS-III) DAPB DNAK .3.1 70 kDa protein) (HSP70) Enolase (EC 4.2.1 ENO shock .1 1) (2-phosphoglycerate dehydratase) (2-phospho-D- glycerate Enolase (EC 4.2.1 .1 reductase) synthetase (EC 6.3.1 .2) (Glutamate-ammonia ligase) phosphoribosyltransferase (EC 2.4.2.22) (XGPRT) (HBP) (HBP) periplasmic protein periplasmic protein P04949 0.51 P0671 1 -0.38 P00501 -0.51 P39100 -0.37 P39182 -0.52 P39182 -0.37 P00894 -0.54 P17579 -0.62 (AHAS-III) 2-dehydro-3-deoxyphosphooctonate aldolase (EC 2.5.1.55) (Phospho-2dehydro-3-deoxyoctonate aldolase) (3-deoxy-D-manno-octulosonic acid 8phosphate synthetase) (KDO-8-phosphate synthetase) (KDO 8-P synthase) KDSA MALE (KDOPS) Leu/lle/Val-binding protein (LIV-BP) Leu/lle/Val-binding protein (LIV-BP) Leucine-specific binding protein (LS-BP) (L-BP) Leucine-specific binding protein (LS-BP) (L-BP) Maltose-binding periplasmic protein (Maltodextrin-binding Maltose-binding periplasmic protein (Maltodextrin-binding MDH Malate dehydrogenase (EC 1 MDOG Glucans biosynthesis MGLB D-galactose-binding periplasmic binding protein) (GGBP) LIVJ LIVJ LIVK LIVK MALE protein .1 .1 (Nucleoside-2-P -0.45 P02917 -0.57 P04816 -0.54 P04816 -0.37 P02928 -0.41 P02928 -0.62 P61889 -0.47 P33136 -0.64 P02927 -0.67 P24233 -0.56 .5.1 P38489 -0.57 .5.1 P38489 -0.65 P16921 -0.35 P08312 -0.36 P23861 -0.32 P20752 -0.59 protein) protein) (MMBP) (MMBP) .37) G protein Nucleoside diphosphate kinase (EC NDK P02917 (GBP) (D-galactose/ D-glucose 2.7.4.6) (NDK) (NDP kinase) kinase) NFNB Oxygen-insensitive NAD(P)H nitroreductase (EC 1 .-.-.-) (FMN-dependent .34) nitroreductase) (Dihydropteridine reductase) (EC 1 Oxygen-insensitive NAD(P)H nitroreductase (EC 1 .-.-.-) (FMN-dependent .34) nitroreductase) (Dihydropteridine reductase) (EC 1 NUSG Transcription NFNB antitermination protein nusG Phenylalanyl-tRNA PHES POTD synthetase alpha chain 6.1.1.20) (Phenylalanine- tRNA ligase alpha chain) (PheRS) Spermidine/putrescine-binding periplasmic Peptidyl-prolyl cis-trans isomerase A (EC PPIA (EC (Cyclophilin A) III protein (SPBP) 5.2.1.8) (PPIase A) (Rotamase A) PYRI Aspartate RTCB Protein rtcB P46850 -0.6 RTCB Protein rtcB P46850 -0.57 SBP Sulfate-binding protein P06997 -0.59 SERC Phosphoserine aminotransferase SODB Superoxide dismutase carbamoyltransferase regulatory (Sulfate starvation-induced [Fe] (EC (EC 2.6.1 P00478 chain .52) protein 2) (SSI2) (PSAT) 1 1 5. 1 1 ) . . -0.48 P23721 -0.45 P09157 -0.41 SSPB Stringent TOLB TolB TRXA Thioredoxin 1 P00274 0.3 UDP Uridine P12758 -0.47 P12758 -0.42 starvation protein B protein (TRX1) (TRX) phosphorylase (EC 2.4.2.3) (UrdPase) (UPase) phosphorylase (EC 2.4.2.3) (UrdPase) (UPase) UDP Uridine YAET Unknown YCEI Protein YCGK YGIN from 2D-page P25663 0.64 P19935 -0.57 P39170 0.39 ycel P37904 -0.56 Protein ycgK P76002 -0.57 Protein ygiN P40718 -0.38 YHGI Protein yhgl P46847 0.42 ZNUA High-affinity zinc High-affinity zinc ZNUA pi > protein spots M62/M63/03/09/T35 uptake system protein znuA P39172 -0.3 uptake system protein znuA P39172 -0.37 0.7 SWISS- Gene Name Protein Description ARGD Acetylornithine/succinyldiaminopimelate aminotransferase (EC 2.6.1.11) (EC 2.6.1 .17) (ACOAT) (Succinyldiaminopimelate transferase) (DapATase) ARTI Arginine-binding 2DPAGE Access # Apl P18335 -0.94 P30859 -0.91 P11096 -1.09 P11096 -1.07 P11096 -0.92 P16700 -1.75 P09376 -0.9 P23847 -0.85 P23847 -0.75 P27430 -0.78 P08324 1.54 P08324 1.31 FLIY during starvation protein Enolase (EC 4.2.1 1) (2-phosphoglycerate dehydratase) (2-phospho-Dglycerate hydro-lyase) Enolase (EC 4.2.1.11) (2-phosphoglycerate dehydratase) (2-phospho-Dglycerate hydro-lyase) Cystine-binding periplasmic protein (CBP) (fliY protein) (Sulfate starvationinduced protein 7) (SSI7) P39174 -1.35 GAPA Glyceraldehyde-3-phosphate dehydrogenase A (EC 1 .12) P06977 -2.03 GAPA Glyceraldehyde-3-phosphate dehydrogenase A (EC 1 .12) P06977 -1.32 GAPA Glyceraldehyde-3-phosphate dehydrogenase A (EC 1 P06977 -0.95 GLNH Glutamine-binding P 10344 -1.64 Cysteine (Thiol)-lyase A) 5) (O-acetylserine sulfhydrylase A) (O(CSase A) (Sulfate starvation-induced protein 5) A (EC 2.5.1 synthase acetylserine (Thiol)-lyase A) .47) (SSI5) Cysteine 2.5.1.47) (O-acetylserine sulfhydrylase A) (O(Thiol)-lyase A) (CSase A) (Sulfate starvation-induced protein 5) A (EC synthase acetylserine CYSK (O-acetylserine sulfhydrylase A) (O.47) (CSase A) (Sulfate starvation-induced protein (SSI5) Cysteine CYSK 1 A (EC 2.5.1 synthase acetylserine CYSK periplasmic protein CYSP (SSI5) Thiosulfate-binding DEGP Protease do (EC 3.4.21 DPPA Periplasmic dipeptide transport protein DPPA Periplasmic dipeptide transport protein DPS DNA protein .-) (Dipeptide-binding protein) (DBP) (Dipeptide-binding protein) (DBP) protection .1 ENO ENO periplasmic protein (GlnBP) IV .2.1 .2.1 .2. 1 . (GAPDH-A) (GAPDH-A) 1 2) (GAPDH-A) GLTI Glutamate/aspartate P37902 -1.01 HDEB Protein hdeB (10K-L protein) Inhibitor of vertebrate lysozyme P26605 -0.94 P45502 -1.37 P61316 -1.19 MANX Outer-membrane lipoprotein carrier protein (P20) PTS system, mannose-specific NAB component (EIIAB-Man) (Mannosepermease IIAB component) (Phosphotransferase enzyme II, AB component) (EC 2.7.1.69) (Elll-Man) P08186 -0.77 MDH Malate dehydrogenase (EC P61889 1.14 MDOG Glucans biosynthesis protein G P33136 0.89 MDOG Glucans biosynthesis protein G P33136 -0.87 P37329 -1.71 P38489 -1.03 IVY LOLA MODA Molybdate-binding periplasmic binding protein 1.1.1.37) periplasmic protein NFNB Oxygen-insensitive NAD(P)H nitroreductase nitroreductase) (Dihydropteridine reductase) Oxygen-insensitive NAD(P)H nitroreductase nitroreductase) (Dihydropteridine reductase) NLPD Lipoprotein NUSG Transcription OMPA Outer OPPA OPPA Periplasmic oligopeptide-binding Periplasmic oligopeptide-binding PANC Pantoate-beta-alanine ligase (EC (Pantoate activating enzyme) NFNB PSTS (EC 1 .-.-.-) (FMN-dependent (EC 1.5.1.34) (EC 1 .-.-.-) (FMN-dependent (EC 1.5.1.34) nlpD antitermination protein nusG membrane protein A (Outer membrane protein II*) P38489 -0.8 P33648 -0.87 P16921 -1.31 P02934 -1.09 protein P23843 -1.26 protein P23843 -0.74 P31663 -0.93 P06128 -1.86 6.3.2.1) (Pantothenate synthetase) PYRD (PBP) Phosphate-binding Dihydroorotate dehydrogenase (EC 1.3.3.1) (Dihydroorotate (DHOdehase) (DHODase) (DHOD) P05021 -0.82 RPLA 50S ribosomal protein L1 P02384 -1.74 RPLI 50S ribosomal protein L9 P02418 -1.41 RPLY 50S ribosomal protein L25 P02426 0.71 RPME2 50S ribosomal protein L31 type B-1 P71302 -1.21 SUCD Succinyl-CoA synthetase alpha chain P07459 -0.99 SUCD Succinyl-CoA synthetase alpha chain P07459 -0.86 P22783 -1.17 periplasmic protein lnositol-1-monophosphatase (EC SUHB phosphatase) (EC 6.2.1.5) (SCS-alpha) (EC 6.2.1.5) (SCS-alpha) 3.1.3.25) (IMPase) (lnositol-1- 3.1.3.25) (IMPase) (lnositol-1- (l-1-Pase) lnositol-1-monophosphatase (EC SUHB oxidase) TPIA phosphatase) (l-1-Pase) Triosephosphate isomerase (EC TRPB Tryptophan YGFZ Unknown protein YGGX UPF0269 protein yggX YLIB Putative YRBC Protein synthase binding beta chain P22783 -1.06 5.3.1.1) (TIM) P04790 -0.88 (EC 4.2.1 P00932 -0.97 from 2D-page (Spot .20) PR51) protein yliB yrbC V P39179 0.91 P52065 -0.78 P75797 -1.19 P45390 0.73 Appendix B The Materials relevant and Perl Methods code section. name of the particular program. what each program exactly for any that was used Each Perl At the top does, how to of the analysis is listed in program described in the alphabetical order of each program are comments run the program by the that explain (command line arguments), and the output of program. aacounts.pl: #!/bin/perl use strict; use Bio::Seq; Bio::SeqIO; use # Matthew Conte # # This # file script counts the determines and in number of each amino acids frequency the from a sequence a FASTA of each. # Output is to FASTAfilename.aacounts # Usage: perl aacounts.pl #initialize ... count variables 0; my $n_R 0; my $n_G my Sn_Q 0; my $n_F my $n_M Sn_Y 0; my $n_V my 0; my $n_AA_Total my $n_A file.FASTA file2.FASTA = = = = = = = = 0; my $n_N 0; my $n_H 0; my $n_P 0; = = = 0; my $n_D 0; my $n_C 0;; my Sn_E 0; i_L 0; rmy $n_K 0; 0; my $n_I 0; my Sn_L Sn_T 0; rmy Sn_W 0; 0; my Sn_S 0; my = = = = = = = = n- = = #initialize frequency variables = 0; my $f_R 0; my $f_N my $f_A 0; my $f_G 0; my $f_H my Sf_Q = = = 0; my $f_F my $f_M = $f_Y 0; my $f_V my = = = 0; my $f_P 0; foreach my $file(@ARGV) print STDERR "Reading = = = 0; my Sf_D 0; my Sf_C 0; my Sf_E 0; 0; my $f_I 0; my $ f_L 0; my Sf_K 0; 0; my Sf_S 0; my Sf_T 0; my $f_W 0; = { input file Sfile... \n"; #open the FASTA file VI = = = = = = = = my SFASTAin = Bio::SeqIO->new(-file => Sfile); #open the file's basename Sfile =~ sA.seq$//g; #open the open #for file for writing .aacounts AACOUNTS, ">$file.aacounts"; each sequence whilefmy in the FASTA file.. SFASTAseq #reset $n_A = $FASTA_in->next_seq()) { variables 0; $n_R 0; $n_N 0; $n_D 0; $n_C 0; $n_E 0; $n_Q 0; $n_G 0; $n_H 0; $n_I 0; $n_L 0; $n_K 0; $n_M 0; $n_F 0; $n_P 0; $n_S 0; $n_T 0; $n_W 0; $n_Y 0; $n_V 0; SnAATotal 0; $f_A 0; $f_R 0; $f_N 0; $f_D 0; $f_C 0; $f_E 0; $f_Q 0; $f_G 0; $f_H 0; $f_I 0; $f_L 0; $f_K 0; $f_M 0; $f_F 0; $f_P 0; $f_S 0; $f_T 0; $f_W 0; $f_Y 0; $f_V 0; = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = sequence my $desc $FASTA_seq->display_id; #trim off possible $desc =~ s/,//g; $desc =~ s/\s*//g; print = = description trailing comma and whitespace AACOUNTS $desc."> "; #get the sequence as an upper-case my Ssequence #get the = count of nucleotides = (Ssequence =~ $n_N = (Ssequence =~ $n_C = (Ssequence =~ $n_Q = (Ssequence =~ $n_H = (Ssequence =~ $n_L (Ssequence = $n_M = (Ssequence (Ssequence =~ $n_T = (Ssequence =~ + $n I + = (Ssequence tr/R//); =~ tr/D//); =~ tr/E//); tr/Q//); $n_G = (Ssequence =~ tr/G//); tr/H//); $n_I = (Ssequence =~ tr/I//); ^~ tr/K//); tr/L//); $n_K = (Ssequence = tr/M//); $n_F (Ssequence =~ tr/F//); =~ tr/S//); tr/P//); $n_S = (Ssequence = =~ (Ssequence tr/T//); $n_W tr/W//); =~ #sum up for total $n_AA_Total = $n_A $n H tr/C//); $n_E =~ =~ = (Ssequence tr/A//); $n_R = (Ssequence tr/N//); $n_D = (Ssequence =~ $n_P = string $FASTA_seq->seq; uc $n_A $n_Y = = = #output the = = = tr/Y//); $n_V + $n L $n_R + + $n K $n_N + (Ssequence = + $n M VII $n_D + + $n F =~ tr/V//); $n_C + + $n P $n_E + + $n S $n_Q + $n T + + $n_G + $n_W + $n_Y + $n_V; #calculate frequencies $f_A = $n_A / $n_AA_Total;$f_R $f_N = $f_C = $f_Q = $f_H = $f_L = $f_M = $f_T = $f_Y $n_Q / $n_AA_Total;$f_G $n_AA_Total; $n_G / SnAATotal; $n_H / $n_AA_Total;$f_I = $n_I / $n_AA_Total; $n_L / $n_AA_Total; $f_K = $n_K / $n_AA_Total; = $n_M / $f_F $n_F / $n_AA_Total; $n_AA_Total; SnAATotal; $f_S $n_S / $n_AA_Total; $n_T / SnAATotal; $f_W $n_W / SnAATotal; $n_Y / $n_AA_Total;$f_V $n_V / SnAATotal; = $n_P / = = = = #round frequencies to $ f_A $n_R / $n_N / $n_AA_Total;$f_D = $n_D / $n_AA_Total; $n_C / $n_AA_Total; $f_E = $n_E / SnAATotal; = $f_P = six decimal places ', $f_A);$f_R sprintf("%.3f ', $ f_R); $f_N sprintf("%.3f ', $f_N);$f_D sprintf("%.3f ', $ f_D); $f_C sprintf("%.3f ', $f_C); $f_E sprintf("%.3f ', $ f_E); $f_Q sprintf("%.3f ', $f_Q);$f_G sprintf("%.3f ', $ f_G); $f_H sprintf("%.3f ', $f_H);$f_I sprintf("%.3f ', $f_I); $f_L sprintf("%.3f ', $f_L); $f_K sprintf("%.3f $f_K); $f_M sprintf("%.3f ', $f_M); $f_F sprintf("%.3f ', $f_F); $f_P sprintf("%.3f ', $f_P); $f_S sprintf("%.3f ', $f_S); $f_T sprintf("%.3f ', $f_T); $f_W sprintf("%.3f ', $f_W); $f_Y sprintf("%.3f ', $f_Y);$f_V sprintf("%.3f ', $f_V); = sprintf("%.3f = = = = = = = = = = ' = , = = = = = = = #write results = to file, counts first then frequencies print AACOUNTS "\nA(neutral): $n_A \t print AACOUNTS "R(BASIC): $n_R \t $f_A\n"; $f_R\n"; print AACOUNTS "N(neutral): $n_N \t $f_N\n"; print AACOUNTS "D(ACIDIC): $n_D \t $f_D\n"; print AACOUNTS "C(neutral): $n_C \t $f_C\n"; print AACOUNTS "E(ACIDIC): $n_E \t $f_E\n"; print AACOUNTS "Q(neutral): $n_Q \t $f_Q\n"; print AACOUNTS "G(neutral): $n_G \t $f_G\n"; print AACOUNTS "H(BASIC): $n_H \t $f_H\n"; print AACOUNTS 'T(neutral): $n_I \t $f_I\n"; print AACOUNTS "L(neutral): $n_L \t $f_L\n"; print AACOUNTS "K(BASIC): $n_K \t $f_K\n"; print AACOUNTS "M(neutral): $n_M \t $f_M\n"; print AACOUNTS "F(neutral): $n_F \t $f_F\n"; print AACOUNTS "P(neutral): $n_P \t $f_P\n"; print AACOUNTS "S(neutral): $n_S \t $f_S\n"; print AACOUNTS "T(neutral): $n_T \t $f_T\n"; print AACOUNTS "W(neutral): $n_W \t $ f_W\n"; print AACOUNTS "Y(neutral): $n_Y \t $f_Y\n"; VIII print AACOUNTS "V(neutral): $n_V \t print AACOUNTS "Total: SnAATotal \n"; $f_V\n"; } close AACOUNTS; } changeCode.pl: #!/bin/perl -w use strict; use Bio::Seq; Bio::SeqIO; Bio::Tools::OddCodes; use use # Matthew Conte # # This the amino script converts # into the acids from the alphabet of the user's choice with sequences in a FASTA file the methods provided in # in Bio::Tools::OddCodes. # # # Output is to FASTAfilename.FASTA # # Usage: perl changeCode.pl file.FASTA file2.FASTA ... foreach my $file(@ARGV) { print STDERR "Reading input file Sfile... \n"; #open the FASTA file my $FASTA_in Sfile, 'FASTA'); Bio::SeqIO->new(-file = -format => => #open the file's basename Sfile =~ sA.seq$//g; #open the #NOTE - change .charge to .oddcode for whichever oddcode you decide to functional, hydrophobic. CHARGECOUNTS, ">$file.charge"; #use. Options open file for writing .charge #print a line #FASTA at are: the charge, chemical, top so that dipeps.pl understands sequence. IX that it is a long CHARGECOUNTS ">gi|our print #for each sequence while(my #in this = $FASTA_in->next_seq()){ sequence case chemical #Options are: sequenced"; in the FASTA file.. SFASTAseq #change the long in the file to the is alphabet of your choosing chosen charge, chemical, functional, hydrophobic. Soddcodeobj Bio::Tools::OddCodes->new(-seq => SFASTAseq); my Ssequence $oddcode_obj->charge(); = my = print CHARGECOUNTS SSsequence; \ close CHARGECOUNTS; } charge.pl: #!/bin/perl -w use strict; use Bio::Seq; Bio::SeqIO; Bio::Tools::OddCodes; use use # Matthew Conte # # This # a script converts 3-letter # It then alphabet counts the the amino acids using the charge() from the method number of each code for sequences in a FASTA file into in Bio::Tools::OddCodes. each sequence as well as each # frequency. # # Alphabet: A (negatively), C (positively), N (no # charge). # Output is to FASTAfilename.chargecounts # Output is TAB-DELIMITED for import in to Microsoft Excel # # Usage: perl charge.pl initialize file.FASTA file2.FASTA count variables my $n_A = my $n_C = 0; 0; ... my $n_N = 0; my SnAATotal initialize my $f_A = 0; frequency variables = 0; = 0; my $f_C = 0; my $f_N foreach my $file(@ARGV) | print STDERR "Reading input file Sfile... \n"; #open the FASTA file my SFASTAin = Bio::SeqIO->new(-file -format => Sfile, 'FASTA'); => #open the file's basename Sfile =~ sA.seq$//g; #open the open .chemcounts file for writing CHARGECOUNTS, ">$file.charge_counts"; #top line in the file to print CHARGECOUNTS see where everything goes "Sequence\tA(positive)\t%A(positive)\tC(negative)\t%C(negative)\tN(no charge)\t%N(no charge)\tTotal\n"; #for each sequence whilefmy in the FASTA file.. $FASTA_seq #reset $n_A $FASTA_in->next_seq()) { variables = 0; $n_C $n_AA_Total $ f_A = = = 0; $f_C #output the = 0; $n_N 0; 0; $f_N = sequence = = 0; 0; description $FASTA_seq->display_id; my Sdesc #trim off possible trailing comma and whitespace = Sdesc =~ s/,//g; Sdesc =~ sAs*//g; print my CHARGECOUNTS Bio::Tools::OddCodes->new(-seq $oddcode_obj->charge(); Soddcodeobj my Ssequence ##get the = Sdesc; = count of amino acids XI => SFASTAseq); $n_A - (SSsequence =~ $n_C = (SSsequence =~ $n_N = (SSsequence =~ tr/A//); tr/C//); tr/N//); #sum up for total SnAATotal = $n_A + $n_C + $n_N; #calculate frequencies $f_A = $n_A / $n_AA_Total; $f_C = $n_C / $f_N = $n_N / $n_AA_Total; $n_AA_Total; #round frequencies to 3 decimal $f_A = sprintf("%.3f ', $ f_A); $f_C = sprintf("%.3f $f_N = sprintf("%.3f #write results places ', $f_C); ', $f_N); to file print CHARGECOUNTS "\t$n_A\t$f_A\t$n C\t$f C\t$n N\t$f N\t$n AA close Total\n'! CHARGE COUNTS; chemical.pl: #!/bin/perl -w use strict; use Bio::Seq; Bio::SeqIO; Bio::Tools::OddCodes; use use # Matthew Conte # # This # a script converts 8-letter # It then alphabet counts the amino acids from the using the chemical() the number of each code sequences method for in a FASTA file into in Bio::Tools::OddCodes. each sequence as well as each # frequency. # # Alphabet: A (acidic), L (aliphatic), M (amide), R (aromatic), C H (hydroxyl), I (imino), S (sulphur). # # XII (basic), # Output is to FASTAfilename.chemcounts # Output is TAB-DELIMITED for import in to Microsoft Excel # # Usage: perl chemical_AA_counts.pl initialize file.FASTA file2.FASTA ... count variables 0; my Sn_A = Sn_L 0; my = = 0; my Sn_M = 0; my Sn_R my Sn_C my Sn_H my Sn_I my Sn_S 0; 0; 0; 0; = = = = my SnAATotal #initialize = 0; frequency variables = 0; my Sf_A = 0; my Sf_L = 0; my Sf_M = 0; my Sf_R my SfC my Sf_H 0; 0; 0; 0; = = = my Sf_I = my Sf_S foreach my $file(@ARGV) { print STDERR "Reading input file Sfile... n"; #open the FASTA file my SFASTAin = Bio::SeqIO->new(-file -format => => Sfile, 'FASTA'): #open the file's basename Sfile =~ sA.seq$//g; #open the open .chem_counts file for writing CHEMCOUNTS, ">Sfile.chem_counts"; #top line in the file to print CHEMCOUNTS see where everything goes "Sequence\tA(acidic)\t%A(acidic )^tL(aliphatic) ) tM( amide) t%M( amide ) t R(aromatic)\t%R(aromatic) tC(basic) t%C(basic) tH(hydroxyl) t%H(hydroxyl) tl(imino) ,t%L(aliphatic tTotal\n" t%I(imino)\tS(sulphur)\t%S(sulphur ) #for each sequence ; in the FASTA file.. XIII SFASTAseq while(my #reset $n_A = 0; $n_L 0; $n_AA_Total $f_A = $f_S | variables = $n_S $FASTA_in->nex t_seq()) = = = 0; $f_L 0; #output the = 0; $n_M 0; 0; $f_M = 0; $n_R 0; $f_R = = sequence description = = 0; Sn_C 0; Sf_C = = 0; Sn_H 0; $f_H = = 0; $n_I 0; $f_I = = 0; 0; $FASTA_seq->display_id; my Sdesc #trim off possible trailing comma and whitespace = Sdesc =~ s/,//g; Sdesc =~ sAs*//g; CHEMCOUNTS Sdesc; print my $oddcode_obj my Ssequence ##get the $n_A $n_L = $n_R $n_C = $n_H = = $n_S =~ =~ (SSsequence tr/A//); =~ tr/M//); (SSsequence tr/R//); (SSsequence =~ tr/C//); (SSsequence =~ tr/H//); (SSsequence =~ tr/I//); =~ #sum up for total $n_AA_Total = $n_A + tr/S//); $n_L + $n_M + #calculate frequencies $f_A = $n_A / $n_AA_Total; $n_L / SnAATotal; = $n_M / $n_AA_Total; $f_M $f_L = $f_R = $f_C = $f_H = $n_R / $n_AA_Total; $n_C / $n_AA_Total; $n_H / $n_AA_Total; $n_I / SnAATotal; = $n_S / $n_AA_Total; $f_S $f_I = #round frequencies to 3 decimal $f_A $f_L = = SFASTAseq); tr/L//); =~ (SSsequence = => count of amino acids (SSsequence = = $n_I Bio::Tools::OddCodes->new(-seq $oddcode_obj->chemical(); = (SSsequence = $n_M = places ', $f_A); ', $f_L); sprintf("%.3f sprintf("%.3f xrv $n_R + $n_C + $n_H + $n_I +Sn_S; $f_M = sprintf("%.3f, $f_M); $f_R = sprintf("%.3f ', $f_R) $f_C = sprintf("%.3f, $f_H = sprintf("%.3f $f_I = $f_S = #write $f_C) ', $f_H) sprintf("%.3f ', $f_I); sprintf("%.3f ', $f_S); results to file CHEMCOUNTS "\t$n_A\t$f_A\t$n_L\t$f_L\t$n_M\t$f_M\t$n_R\t$f_R\t$n_C\t$f_C\t$n_H\t$f_H\t$n_I\t print $f_i\t$n_S\t$f_S\t$n_AA_Total\n"; } close CHEMCOUNTS; } dipeps.pl: #!/bin/perl -w use strict; use Bio::Seq; Bio::SeqIO; Bio::Tools::OddCodes; Bio::Tools::SeqWords; use use use # Matthew Conte # # This # script counts sequence in the the given number of each different amino acid pair for FASTA files # # # Output is to FASTAfilename.dipepcounts # Output is TAB-DELIMITED for import in to Microsoft Excel # # Usage: # perl variable my Stotal to = dipep.pl file.FASTA file2.FASTA count the total print number of amino acids 0; foreach my $file(@ARGV) STDERR "Reading ... { input file Sfile... \n"; #open the FASTA file XV in each sequence. each my $FASTA_in = Bio::SeqIO->new(-file -format => Sfile, 'FASTA'); => #open the file's basename Sfile =~ sA.seq$//g; #open the open #for file for writing .funccounts DIPEPCOUNTS, ">$file.dipep_counts"; each sequence while(my in the FASTA file.. $FASTA_seq = $FASTA_in->next_seq()) f #reset total Stotal = 0; #output the sequence my Sdesc $FASTA_seq->display_id; = #trim off possible Sdesc =~ s/,//g; Sdesc =~ sAs*//g; print trailing comma and whitespace DIPEPCOUNTS "\n$desc"; my $seq_word my Ssequence # description = = Bio::Tools::SeqWords->new(-seq => SFASTAseq); $seq_word->count_overlap_words(2); display the hashtable my %hash #this = %$sequence; code will sort the dipeptides alphabetically #foreach my $key(sort keys %hash) { #$total = Stotal + $hash{$key}; #print DIPEPCOUNTS "\n$key\t$hash{$key}"; #} # sort the hash by value in descending order (highest to lowest) $hash{$aj } keys %hash){ foreach my $key (sort {$hash{$b} cmp Stotal = Stotal + $hash{$keyj; print DIPEP_COUNTS *'\n$key\t$hash{$key}"; } print DIPEPCOUNTS "\nTotal: Stotal"; } close DIPEPCOUNTS; XVI dipepsA.pl: #!/bin/perl -w use strict; use Bio::Seq; Bio::SeqIO; Bio::Tools::OddCodes; Bio::Tools::SeqWords; use use use # Matthew Conte # # This # script counts in the sequence the number given of each different amino acid pair for FASTA files # # Output is sorted alphabetically by amino acid # Output is to FASTAfilename.dipepcounts pair # Output is TAB-DELIMITED for import in to Microsoft Excel # # Usage: # perl to variable my Stotal = dipep.pl file.FASTA file2.FASTA count the total ... number of amino acids in each sequence. 0; foreach my $file(@ARGV) { print STDERR "Reading input file Sfile... \n"; #open the FASTA file my $FASTA_in = Bio::SeqIO->new(-file -format => => Sfile, 'FASTA'); #open the file's basename Sfile =~ sA.seq$//g; #open the open #for .func_counts file for writing DIPEP_COUNTS, ">$file.dipepA_counts"; each sequence while(my in the FASTA file.. $FASTA_seq = $FASTA_in->nex t_seq()) { #reset total XVII each Stotal = 0; #output the sequence my Sdesc $FASTA_seq->display_id; #trim = off possible Sdesc =~ s/,//g; Sdesc =~ sAs*//g; print trailing comma and whitespace DIPEPCOUNTS "\n$desc"; my $seq_word my Ssequence # description = = Bio::Tools::SeqWords->new(-seq => SFASTAseq); $seq_word->count_overlap_words(2); display the hashtable my %hash #this = %$sequence; code will sort the dipeptides alphabetically foreach my $key(sort keys %hash) { Stotal = Stotal + $hash{$key}; print DIPEP_COUNTS "\n$key\t$hash{$key}"; } # sort the hash by value in descending order (highest to lowest) $hash{$a} ( keys %hash){ #foreach my Skey (sort {$hash{$b} cmp #$total = Stotal + $hash{$key}; #print DIPEP_COUNTS "\n$key\t$hash{$key["; #} print DIPEPCOUNTS "\nTotal: Stotal"; } close DIPEPCOUNTS; functional.pl: #!/bin/perl -w use strict; use Bio::Seq; Bio::SeqIO; Bio::Tools::OddCodes; use use # Matthew Conte # XVIII # This # a script converts 4-letter # It then alphabet counts the amino acids using the the number from the functional() of each code sequences method for in a FASTA file into in Bio: :Tools::OddC odes. each sequence as well as each # frequency. # # Alphabet: A (acidic), C (basic), H (hydrophobic), P (polar). # # Output is to FASTAfilename.functcounts # Output is TAB-DELIMITED for import in to Microsoft Excel # # Usage: perl #initialize my $n_A functionalcounts.pl file. FASTA file2. FASTA ... count variables = 0; = 0; my $n_C = 0; my $n_H my $n_P = 0; my SnAATotal #initialize my $f_A my $f_C = my $f_P = = 0; frequency variables = my SfH = 0; 0; 0; 0; foreach my $file(@ARGV) { print STDERR "Reading input file Sfile... n"; #open the FASTA file my SFASTAin = Bio::SeqIO->new(-file -format => => Sfile, 'FASTA'): #open the file's basename Sfile =~ sA.seq$//g; #open the .funccounts file for writing open FUNCCOUNTS, ">$file.func_counts"; #top line in the file to print FUNC_COUNTS see where everything "Sequence\tA(Acidic)\t%A(Acidic)tC(Basic) goes t oC(Basic) phobic)\tP(Polar)\t%P(Polar)tTotal\n"; #for each sequence while(my in the FASTA file.. SFASTAseq = $FASTA_in->next_seq()) { XIX tH( Hydrophobic) t%H(Hydro #reset variables $n_A = 0; $n_C SnAATotal $f_A = 0; Sf_C = #output the = 0; $n_H 0; 0; $f_H = 0; $n_P 0; Sf_P = = sequence description = = 0; 0; my Sdesc $FASTA_seq->display_id; #trim off possible trailing comma and whitespace = Sdesc =~ s/,//g; Sdesc =~ sAs*//g; FUNC_COUNTS Sdesc; print Soddcodeobj Bio::Tools::OddCodes->new(-seq my Ssequence $oddcode_obj->functional(); = my => SFASTAseq); = ##get the count of amino acids $n_A = (SSsequence =~ tr/A//); $n_C = (SSsequence =~ tr/C//); $n_H = (SSsequence =~ $n_P (SSsequence = =~ #sum up for total $n_AA_Total = Sn_A + tr/H//); tr/P//); $n_C + $n_H + $n_P; #calculate frequencies $f_A = $f_C = $n_A / $n_AA_Total; $n_C / Sn_AA_Total; $f_H = Sn_H / $n_AA_Total; $f_P $n_P / SnAATotal; = #round frequencies to 3 decimal $f_A = $f_C = $f_H = $f_P = #write print places ', Sf_A); ', $f_C); sprintf("%.3f ', $f_H); sprintf("%.3f ', Sf_P); sprintf("%.3f sprintf("%.3f results to file FUNC_COUNTS "\t$n_A\t$f_A\t$n_C\t$f_C\t$n_H\t$f_H\t$n_P\t$f_P\tSn_AA_Total\n"; } close FUNC COUNTS; XX hydro.pl: #!/bin/perl -w use strict; use Bio::Seq; Bio::SeqIO; Bio::Tools::OddCodes; use use # Matthew Conte # # This # a script converts 2-letter # It then alphabet counts the the amino acids from the hydrophobic() using the number of each code for sequences method in a FASTA file into in Bio::Tools::OddCodes. each sequence as well as each # frequency. # (hydrophilic), O (hydrophobic). # Alphabet: I # # Output is to FASTAfilename.hydrocounts # Output is TAB-DELIMITED for import in to Microsoft Excel # # Usage: perl #initialize hydro.pl file.FASTA file2.FASTA ... count variables = 0; my $n_I = 0; my $n_0 my SnAATotal #initialize = 0; frequency variables = 0; my $f_I = 0; my $f_0 foreach my $file(@ARGV) { print STDERR "Reading input file Sfile... \n"; #open the FASTA file my SFASTAin = Bio::SeqIO->new(-file -format => Sfile, 'FASTA'); => #open the file's basename Sfile =~ sA.seq$//g; #open the open .chemcounts file for writing HYDRO_COUNTS, ">$file.hydro_counts"; XXI #top line in the file to see where everything goes HYDROCOUNTS print "Sequence\tI(hydrophilic)\t%I(hydrophilic)\tO(hydrophobic)\t%0(hydrophobic)\tTotal\n #for in the FASTA file.. each sequence while(my SFASTAseq #reset $n_I = $FASTA_in->next_seq()) { variables = 0; $n_0 SnAATotal $f_I = 0; $f_0 0; 0; 0; = = = #output the sequence my Sdesc $FASTA_seq->display_id; = #trim off possible Sdesc =~ s/,//g; Sdesc =~ sAs*//g; trailing HYDRO_COUNTS print my description Soddcodeobj my Ssequence ##get the $n_I = Sdesc; Bio::Tools::OddCodes->new(-seq $oddcode_obj->hydrophobic(); = => SFASTAseq); count of amino acids (SSsequence = $n_0 = comma and whitespace =~ (SSsequence tr/I//); =~ #sum up for total $n_AA_Total = $n_I + tr/O//); $n_0; #calculate frequencies $n_I / $n_AA_Total; $n_0 / SnAATotal; $f_0 $f_I = = #round frequencies to 3 decimal $f_I = $f_0 = #write print places $f_I); sprintf("%.3f ', $f_0); sprintf("%.3f", results to file HYDRO_COUNTS "\t$n_I\t$f_i\t$n_0\t$f_0\t$n_AA_Total\n"; } close HYDRO COUNTS; XXII makeComposite.pl: #!/bin/perl use strict; # Matthew Conte # # This script converts # sequence. # such as This FASTA files of multiple sequences composite sequence charge.pl, chemical.pl, is then able to be dipeps.pl, functional.pl, # # Usage: perl makeComposite.pl my Scount = file. FASTA > outfile 0; while(o){ if (Scount <1) { } $_; { else if (/A>gi*/) { #do nothing } else { print $_; } I $count++; } print "ScountAn"; XXIII into a single (composite) used with other programs and hydro.pl.