Download Isoelectric point prediction from the amino acid sequence of a protein

Document related concepts

G protein–coupled receptor wikipedia , lookup

Gene expression wikipedia , lookup

Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Expression vector wikipedia , lookup

Interactome wikipedia , lookup

Peptide synthesis wikipedia , lookup

Magnesium transporter wikipedia , lookup

Metalloprotein wikipedia , lookup

Protein purification wikipedia , lookup

Homology modeling wikipedia , lookup

QPNC-PAGE wikipedia , lookup

Metabolism wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Western blot wikipedia , lookup

Protein wikipedia , lookup

Point mutation wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Biosynthesis wikipedia , lookup

Genetic code wikipedia , lookup

Proteolysis wikipedia , lookup

Biochemistry wikipedia , lookup

Transcript
Rochester Institute of Technology
RIT Scholar Works
Theses
Thesis/Dissertation Collections
Summer 2005
Isoelectric point prediction from the amino acid
sequence of a protein
Matthew Conte
Follow this and additional works at: http://scholarworks.rit.edu/theses
Recommended Citation
Conte, Matthew, "Isoelectric point prediction from the amino acid sequence of a protein" (2005). Thesis. Rochester Institute of
Technology. Accessed from
This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion
in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact [email protected].
THESIS
ISOELECTRIC POINT PREDICTION FROM THE
AMINO ACID SEQUENCE OF A PROTEIN
Submitted
by
Matthew Conte
Department
In
partial
of
Biological Sciences
fulfillment
For the Master
of
of the requirements
Science degree in
Bioinformatics
Rochester Institute of
at
Technology
Summer 2005
-~­
nIQlnformatlcs
~luT
To:
Rochester Institute of Technology
Department of Biological Sciences
Bioinformatics Program
Head, Department of Biological Sciences
The undersigned state that _ _...!...M----=.!~~·:....:~.....!\--.....!h~~~v...J~
\ ~A....!........~C:!z<.loooO~Vl-"-!e..LJo...---­
(Student Name)
_ _--:-:::---:-----:-:---_-:--__ ' a candidate for the Master of Science degree in
(Student Number)
Bioinformatics, has submitted his/her thesis and has satisfactorily defended it.
This completes the requirements for the Master of Science degree in Bioinformatics at
Rochester Institute of Technology.
Thesis committee members:
Name
Date
Gary R. Skuse
(Committee Chair)
Paul A. Craig
(Thesis Advisor)
Name Illegible
Douglas P. Merrill
475-2532 (voice)
[email protected]
Thesis/Dissertation Author Permission Statement
Title of thesis or dissertation: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ __
A~HhLw (0/1 k
Name of auth0J.
Degree:
~ "'S~
Program: --~G~;~o~M~f9~C-M-~~I.-.-s---------------------College:
Sc.iC ..
,e.
I understand that I must submit a print copy of my thesis or di ssertation to the RIT Archi ves , per current
RIT guidelines for the completion of my degree. I hereby grant to the Rochester Institute of Technology
and its agents the non-exclusive license to archive and make accessible my thesis or dissertation in whole
or in part in all forms of media in perpetuity. I retain all other ownership rights to the copyright of the
thesis or dissertation. I also retain the right to use in future works (such as articles or books) all or part of
this thesis or dissertation.
Print Reproduction Permission Granted:
It.
,
I,
~
hereby grant permission to the Rochester Institute
Technology to reproduce my print thesis or dissertation in whole or in part. Any reproduction will not be
for commercial use or profit.
&t+kw
Signature of Author:
Matthew Conte
Date:
Cf- OJ.. -J..065
Print Reproduction Permission Denied:
1,
, hereby deny permission to the RIT Library of the
Rochester Institute of Technology to reproduce my print thesis or dissertation in whole or in part.
Signature of Author: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Date: - - - - -
Inclusion in the RIT Digital Media Library Electronic Thesis & Dissertation (ETD) Archive
I,
' additionally grant to the Rochester Institute of Technology
Digital Media Library (RIT DML) the non-exclusive license to archive and provide electronic access to
my thesis or dissertation in whole or in part in all forms of media in perpetuity.
I understand that my work, in addition to its bibliographic record and abstract , will be available to the
world-wide community of scholars and researchers through the RIT DML. I retain all other ownership
rights to the copyright of the thesis or di ssertation . I also retain th .: right to use in future works (such as
articles or books) all or part of thi s thesis or dissertation. I am aware that the Rochester Institute of
Technology does not require registration of copyright for ETDs.
I hereby certify that, if appropriate, I have obtained and attached written permission statements from the
owners of each third party copyrighted matter to be included in my thesis or dissertation. I certify that the
version I submitted is the same as that approved by my committee.
Signature of Author: _ _ _ _ _ _ _ _ _ _ _ _ _ Date: _ _ _ __
Abstract
Proteins
based
often
do
not migrate as expected
their primary sequence. The predicted isoelectric
on
coincide with experimental pi values obtained
differences led to this
and
pi
study.
formatted. This dataset
discrepancy (Apl).
At
pipeline.
The
into three
was split
protein sequence
each stage of the pipeline
individual
amino acid
represent sequences
charge,
functional,
involved
analysis
with
data for
the simplified
demonstrated the
differences between
(pi) frequently does
consisting
the data were analyzed
of different
groupings.
An
existence of certain
alphabets
based
on
The final step in the
using both the 20
their
pipeline
amino
dipeptide
sequences which correlate well
predicted pi and experimental pi.
in
a
(considering
four different
evaluation of the alphabet
dipeptide
of
by comparing each of the
by grouping similar amino acids
of all of these sequences
levels
through
subset was run
application
.
not
for these
reasons
pipeline consisted of a naive approach
way
the dipeptides
electrophoresis
coli proteome was collected
Apl
each
chemical, and hydrophobic properties
investigating
acid alphabet and
a simpler
the E.
parts each
frequencies), followed by the
in
point
in the laboratory. The
Initially, 2DE data from
three Apl subsets to one another. The
to
in two dimensional
Table
of
Contents
1
Introduction
1
2
Methods
2.1 Forming the data set
2.2 Experimental and predicted pi values
2.3 Extracting useful information from collected
7
7
9
subset
10
sequences
2.2.1 Amino
acid
2.2.2
Frequency of amino
2.2.3
Frequency of amino acids
2.2.4 Pipeline
3
frequency analysis
acids
workflow
(naive approach)
...
(alphabets approach)
...
(dipeptide approach)
...
10
11
14
15
18
Results
3.1 Naive
18
approach
3.2 Alphabets
approach
19
3.2.3 Functional
19
21
22
3.2.4 Hydrophobic
23
3.2.1 Charge
3.2.2 Chemical
approach
24
3.4 Dipeptide threshold
26
3.5 Dipeptide using
28
3.3 Dipeptide
alphabets
3.5.1 Charge
28
3.5.2 Chemical
3.5.3 Functional
29
31
3.5.4 Hydrophobic
32
4
Discussion
34
5
Conclusions
42
6
References
44
Introduction
Two-dimensional
technique for the field
separate and
conditions,
of proteomics
identify thousands
2DE is difficult
and
wait
reproducibility
for
of proteins
for results,
and
possibly
dimension
by their molecular weights.
(the
pH at which
and molecular weight
protein would
from
allows
a cellular extract
in
laboratory
the
researcher
to
a single experiment.
change conditions after that
proteins are separated
points
(pi)
important
two decades. 2DE
of gels and comparison of 2DE results
isoelectric
point
over
an
time consuming as it is necessary to determine ideal initial
difficult (1). In 2DE,
proved
(2DE) has been
gel electrophoresis
the
between
is zero)
simply the
has
by their
and
accurate prediction of protein
(MW) using
be extremely valuable to
separate groups
in the first dimension
net charge of the protein
The
(1). In addition,
in the
second
isoelectric
amino acid sequence of the
researchers who use
two-dimensional
gel
electrophoresis.
Computational
acid composition of a protein
within
the
limited
protein
by the
for calculating
procedures
based
on
and
predicting the
the dissociation
of the values
for the dissociations
microenvironmental effects such as charge-charge
from the
amino
constants of the charged groups
have been developed (2-8). The accuracy
certainty
pi
of these algorithms
constants and
interactions
is
by
and post-translational
modifications.
To systematically
protein
sequence,
organism.
a
data
explore
set of proteins was collected and organized
The Escherichia
translational
the relationship between pi, molecular
coli proteome was chosen since
modifications such as
it
weight and
from
contains
a model
few
methylation, acylation, gylcosylation, or
1
post-
phosphorylation which can alter the
pI/MW predictions much more
them to migrate to
based solely
on
pI/MW; the
difficult
a position on a
2-D
since
gel
the amino acid sequence
is widely
At this
point
available
the
that is quite different than
E.
data beyond simply the
proteins
is
what
cause
may
predicted
also one of the
best
protein sequence
for
for it.
it is necessary to
consider
the basic
the structure of the 20 amino acids
shows
is
coli
structural
the role of individual amino acids in the structure and function
below
in the
modifications
of the protein.
characterized prokaryotes and much more
each protein
presence of these modifications makes
features
of proteins and
of proteins.
Figure 1
with side chain structures shown
in
red
(10).
The
as
the
carboxy-
prediction
tool
and
solution and
amino-termini,
(11) is designed to
and amino-termini.
The
the pKa
Our
ionizable
the
assume
groups on
some of the amino acid side
some prosthetic
calculate charge
It is
based
also affected
current calculation model uses
protein and
does
not make
any
chains regardless of their environment within
that the
charge ratios.
separation
is based
on
the total
and
groups,
on
charge on amino acid side chains
of the side chains.
around a side chain.
the side
from
charge on all proteins arises
the
bound ions. Our
side chains and
depends
by the
the
charge on
on
the
pi
carboxy-
environment
following pK.A values
protein
as well
pH of the
localized
adjustments
the
chains,
for
to the pKA values of
(Table 1). We
the protein,
not
the
also
mass-to-
1
1
.0
-ce
-ac
Vj
1
1
1
NH
H3N+-aC
ce
-
XP
1
CH2
(CH2)3
1
P
H3N+
H3N+-aC
CH2
1
C=NH2
C
0
1
NH2
|
NH2
=
(Arg/R)
(Gln/Q)
/>
-Mc
H
/
XP
1
CH2
1
rcH2
,N
(Ser / S)
H
H
H
1
1
1
P
P
P
H3N+
-aC
XP
-
C*e
-aC
P
-
CS
1
-aC-Ce
^P
1
1
CH2
XP
'
1
COOH
P
H3N+
H-C-OH
1
1
CH2
1
XP
H
H3N+
XP
1
CH2
CH3
1
SH
CH3
COOH
Aspartic Acid
Glutamic Acid
jl
1
-e
XC
H
1
1
P
-*c
XP
-
ce
c
Leucine
Asparagine
(Met / M)
(Leu / L)
(Asn / N)
of amino acids with side chains shown
in green,
and amino groups
is the
The
charge on
the
protein
side chains.
However,
the
charge on
group
of non-polar or
1
-aC-Cve
in red,
^P
1
CH
CH3
CfH3
Isoleucine
(He /
p
H3N+
XP
1
HC-CH3
1
CH2
1
CH3
1
NH2
CH3
CS
-"C
o
=
H
/P
H3N+
XP
1
CH2
1
CH
CH3
P
H,N+
Methionine
Figure 1. Structures
(Cys / C)
1
P\
CH3
Cysteine
(Thr/T)
H
1
CH2
1
1
Threonine
H
HsN+^c-c'e
S
/ D)
(Asp
(Glu/E)
yP
c
1
CH2
1
CH2
1
are near a
P
-ttC-C>
"P
1
/
1
H3N+
(His / H)
Proline
they
P
H3N+-aC -Cp
(Ala /A)
(Pro / P)
groups
H
(Gly/G)
0
-
(Trp.W)
OH
Serine
Ce
-
(Tyr/Y)
Histidine
1
-aC
Tryptophan
HN
-^C-C^e
\
Tyrosine
Alanine
H3N+
H2N+
H
Glycine
(Lys/K)
C
w
1
p
c'e
1
CH3
ce
XC
1
|
NH2
H2
-
XP
1
CH2
H
-ac
H3N+
(CH2)4
ac
F)
H3N+
1
xo
Lysine
-
h
y
H
H
C^S
1
H3N+
t^
KJ
1
-
-aC
1
P
C^e
-aC
XP
1
CH2
x'o
1
(Phe /
H
H3N+
1
H3N+
OH
Glut amine
/P
P
H3N+-ac-c'e
Phenylalanine
Arginine
1
1
P
-ce
1
CH2
1
1
H
H
H
H
H
Valine
1)
(Val/V)
carboxylate
in blue (10).
sum of the charges on
individual
the individual amino
amino acid side chains can
highly charged
side chains.
For
vary
example
acid
when
the
normal
pKa for glutamic acid is about 4.1. In
active site.
One is in
a polar environment and
glutamate side chain
energetically
increases,
be
a
two
has
a normal
hydrophobic environment,
Therefore the pKA
value
charged
mechanism of
(deprotonated)
pKA
the
be
other
value.
The
other
is
where a negative charge
for this
lysozyme activity,
and
in the
glutamic acid residues are
glutamate side chain
then decreases the extent of the deprotonation
very important in the
chains
is in
unfavorable.
which
lysozyme,
which requires
that one of the side
(protonated)
uncharged
This is
of that side chain.
the
same
has
a much
at
time.
In
different
normal
a second
acid-base
pKA
example, the
behavior than
proteases, the interaction
(the
the pKA value
example makes
it
clear
found in
is
triad) leads
that the
effects on
is basic
two
in
state
in
is
15 to
a value closer
than 1 5,
In
serine
and aspartate side
of the serine
individual
(9). The
greater
most proteins.
to the ionization
about
chain
nearby histidine
microenvironment of an
the pKA of an
will
have
a
pKA
physiological pH range.
adjacent
ionized
from
reduced
amino acids are positioned next
in the
the serine side
proteins
hydroxyl
to 7
or
group.
8. This
amino acid side chain
it ionization behavior.
Other
which
an
on
normally found in
of the active site serine with
so-called catalytic
Meanwhile,
can change
not
active sites of serine proteases
other serines
for the hydroxyl group
value
meaning that this group is
chains
in the
serine
a protein sequence
positive charges.
This
to
amino acid side chain can
each other.
of about
12.5 (Table 1
However,
the pKA
reduction
For example,
when
two
values will
a
below)
of these
be
typical Arginine
and
carry
a
residue
full +1
basic Arginine
decrease, due to
in pKA value, in turn,
seen when certain
charge
residues are
repulsion
will cause one or
between the
both
of the
arginine side chains to
become less ionized
Table 1 below lists the typical pKA
values
and
carry only
for ionizable
fractional
groups
a-carboxyl
proteins
(9).
3.1
group
Aspartic acid,
Glutamic acid
4.1
Histidine
6.0
Terminal
in
positive charge.
Typical pKa
Group
Terminal
a
a-amino
8.0
group
Cysteine
8.3
Tyrosine
10.9
Lysine
10.8
12.5
Arginine
pKA
commonly found for these side chains when
they are part of a protein. The pKA values for these side chains may be quite
different for the free amino acid in solution. pKA values also depend on
Table 1. These
are
values
that
temperature, ionic strength,
are
and
the
ionizable
microenvironment of the
group (9).
As
we
began to
individual
consider
the impact
amino acid side
of amino acid sequence on
chains, the
need
chemical and physical characteristics rather
acid
became
chemical,
acids
apparent.
functional,
into these
as opposed
We
alphabet
We
elected
charge,
and
used
that
is
these property
much smaller
Table 2 below describes how
create groups of amino acids
than concentrating
to divide the
hydrophobic
groups enables us
to simply using the
to
to
20 letter
groups
to
on each
into
characteristics.
normal
each
amino acids
use smaller alphabets
than the
ionization behavior
individual
groups
Dividing
based
on
amino acid alphabet
based
based
on
on
amino
their
sets of amino
in
our calculations.
into
an alternative
normal amino acid alphabet of 20 characters
alphabet
that was used
their
these characteristics
rewrite a protein sequences
different
of
is
categorized
(12).
based
on which amino acids
fall
under what particular
examples of protein sequences that
Alphabet Type
types. The Methods
section contains
have been translated into these different
alphabets.
Amino Acids
Code
Meaning
A
Negative
D, E
Positive
H,K,R
with
that Code
(size)
(3)
Charge
C
N
No
A,C,F,G,I,L,M,
charge
N,P,Q,S,T,V,W,Y
Chemical
(8)
Functional (4)
Hydrophobic
(2)
A
Acidic
D, E
A,G,I,L,V
L
Aliphatic
M
Amide
N,Q
R
Aromatic
F,W,Y
C
Basic
R,H,K
H
Hydroxyl
S,T
I
Imino
P
S
Sulphur
C,M
A
Acidic
D, E
C
Basic
H,K,R
H
Hydrophobic
A,F,I,L,
M, P, V, W
P
Polar
C,G,N,Q,S,T,Y
I
Hydrophobic
0
Table 2. Description
of
four
Chemical, Functional,
codes used
for
each
properties of amino
abbreviated amino
and
Hydrophobic (12). Shown
different alphabet,
acids,
A, F, I, L,
M, P, V, W
Hydrophilic
C, D, E, G, H, K, N,
Q, R, S, T, Y
acid sequence alphabets: Charge,
and
are
the
new alphabet
what each code represents
the specific amino acids that are
in terms
of
included in
each property.
Proteins that have
(obtained using
will
be
studied.
succession need
a significant
difference between their
similar algorithms as mentioned
As
mentioned
to be
before,
considered.
above)
and
certain amino acids
predicted pI/MW
their experimental pI/MW
that occur in
These trends in the periodicity
a particular
of certain amino acids
of certain proteins
(those
whose pi values were
with
large Apl values) that do
accurately
accurate prediction of the pi and
predicted are
MW
other proteins
They may lead to
important.
all of proteins
in the
not occur
from their
a more
amino acid compositions.
Methods
Forming
the data set
The
Server's SWISS-2DPAGE database (13)
ExPASy
2-D
provides extensive
gel
information for human, mouse, Arabidopsis thaliana, Dictyostelium discoideum, E. coli,
Saccharomyces cerevisiae,
referenced
in Swiss-Prot. Each
experimental
336
2-D
was
separated
E.
according to
Vanbogelen
et al.
characterized
the first
the
contains
proteins
used
(16)
denoted
for isoelectric
concentrated on
proteome and
of all
group
by the
groups were
Tonella
228
reference maps.
each research
et al.
al.
Phillips
The database for this
by five
different
for these
et al.
(1 7)and Yan
et al.
(18)
because
et al. set
research groups (14-
be
(15),
et al.
among
and
proteins were also
groups.
by Tonella et
covered more
Two
al. and
set was also separated
because it
from
project contains
proteins should
(14), Pasquali
focusing (pH 4-5, 4.5-5.5, 5-6, 5.5-6.7, 6-9,
the Tonella
cross-
collected and annotated
ignored because these
The first
which are also
since experimental conditions varied
the proteins denoted
by Yan et
(N315)
in the database is
compilation of pI/MW sets
contributed
by the
aureus
coli proteome characterized
decided that the
them. The proteins
protein
from
gels read
proteins of the
18). It
Staphylococcus
and
sets were
153
based
and
proteins of all
on
the
pH range
6-11). We
than 70%
all of the experiments were carried out under
created;
of the
E.
coli
the same conditions.
We then
This
matched the pI/MW
to compare experimental
allows us
ExPASy provides its
protein
IDs
includes
format,
as
own
its input
described
earlier
retrieve the
2-D
gel
information for
gel
in
a
input
name,
Accession
protein
to
then made for
predicted pi).
protein
each of the gels.
Swiss-Prot IDs
our own
from
pi,
tools are based (and
The first step
each spot
a
contained
was
to
way to
(one
delimited format
of analysis on
get
protein
far
gave a
the data (such as
in these files included:
matching, microsequencing,
experimental
MW,
(some
of proteins was
then
used
proteins were repeated
for
e.g.
retrieval at
used
in
our prediction
P00274)
multiple spots).
not
The
at
sequences were
tool. Batch retrieval
ExPASy because the latter does
was
to retrieve a FASTA
http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi?db=Protein. The
batch
-
to the NCBI tool for retrieving sequences
downloaded in FASTA format to be
or peptide
and references.
IDs (2DPAGE Accession Number
This list
each gel
were submitted
was chosen over
Swiss-Prot
tool that
ExPASy provides
tab
a
The fields
of
values of amino acids as
all of these proteins.
data in
list
format sequences, Genbank
by Bjellqvist et al. (19)
and
Having this
experimental
Swiss-Prot
of the proteins
sequence.
description, SWISS-2DPAGE Serial Number, SWISS-2DPAGE
fingerprinting),
of
of FASTA
using pKA
Number, identification method (gel
A list
developed
of these prediction
later performing any type
experimental pi
comparing
also
tab delimited format that includes
multiple spots on a gel).
greater ease of use when
file
on a calculation
in the introduction
the data from each 2-D
mass
(19). We have
of proteins
for both tools)
pi
gene
pI/MW which requires a
Protein Data Bank format (11). Both
or
have
for predicting
tool
its FASTA
each protein with
pI/MW values with predicted pI/MW values.
a pI/MW prediction which requires
especially
can
data for
at
include, for
NCBI
whatever
reason, the initial methionine
FASTA file for the
output can
be conveniently
occurred when
based
on
match
set of proteins
residue when
from
each gel was
recorded to a
accession number and not
(leaving just the
respective
FASTA file using
Perl
a simple
script.
":%s/gi|\d*|sp|//"
at
that
can
ExPASy (19)
was not quite as
be imported into Excel
":%sAs\s*At/g"
The
tool"
Both
sets
at
was
manipulated
gave
both
the file
which was needed
which
regular
pI/MW predict
not output
was edited
each
few
by a
The
to
Genbank
entry in
facilitated
excluded).
file
would order
into
using the
transformed
a
format
following
it into
a
tab
in Excel. Nevertheless the
strikingly
similar results
derived from the Tonella data (1 7)
were compared with
in the Excel files
Experimental
pI/MW prediction
to
our
and
tool.
the Yan
tools and the
results can
be
http://www.rit.edu/~mac3948/E2D/Ecoli/.
and predicted pi values
Looking at the
were
data
experimental
(DIGE) data (18)
seen
ExPASy (19)
at
pI/MW
output
problems
by removing the
it does
use since
it
each protein
(quotations excluded)
delimited text file, allowing it to be easily
"Compute
This
(quotations
easy to
readily.
regular expression:
was solved
Swiss-Prot ID) from
expressions most notably:
tool
our tool since
by Swiss-Prot ID
the tab delimited file for each gel. This
accession number
then fed into our tool where the
Microsoft Excel file. However,
using the FASTA file from NCBI in
Genbank
retrieving in FASTA format. The
compiled
far different from
versus experimental pi
periplasmic protein
data
set
it
was noticeable
experimental pi values.
by
as much as
(PBP),
see
1.86
Some
pH units
Appendix A).
that some predicted pi values
proteins
(e.g.
differed in
predicted pi
P06128, Phosphate-binding
However, for
other proteins the predicted
pi was
exactly the
same as
the experimental
F
carbamoyltransferase chain
To better
(OTCase-2),
characterize these
(e.g. P06960, Ornithine
pi
Appendix A).
see
discrepancies
across all of the proteins a simple
calculation was performed:
Experimental
The difference in
The
main
focus
predicted
-
set was
proteins consisted of
60
is to
was put
into
proteins where
a subset of
the
following
analysis of these
deals
with
50
how
focusing
on
Extracting
Amino
data
we used
individual
next, followed
amino acid
referred as
Apl in this
Apl
of varying
a
list
section
that
the Apl
in
each
20
summarizes
Apl
how the
subset of
Another
.
Apl
subset
0.7). The
<
than 0.7.
subset.
that were performed on
to
handling the
amino acids.
analyze
paper.
values.
value was greater
sequential steps
of the
<
1)
The
data that
next section
the data subsets, still
approaches are
whole process
described
flows together.
collected subset sequences
(the
a naive approach
than 0.7 (0.3
of the proteins
the
less than 0. 1
frequencies. The dipeptide
information from
to
naive
involves
in
comparing the
method
approach)
finding a
This
subset and
0.3, but less
frequencies
raw
frequency analysis
Apl
(Eq.
pi
starts with a naive approach
subsets of Apl ranges.
each
than
proteins where
It
subsets.
acid
There is
be
value was
the four different alphabets to
by a final
useful
the Apl
sections will provide
simply calculating
explains
(A)
Delta
then broken down into roughly thirds. The first
Refer to the tables in Appendix A for
The
=
identify potential causes
proteins of Apl values greater
last third
pi
experimental pi and predicted pi will
of this project
The data
held 58
pi
significant
difference between
determining the
relative
10
each of the
counts of each amino acid
frequency of occurrence for
each amino
acid
between the Apl
between any
subsets.
If a
significant
Apl subsets, then this
of the
would
that
The first step in going
proteins
NCBI
Apl
for
each
was used
subset.
sequence
Apl
to
obtain a
A Perl
from
delimited file
program can
displaying
one
long
each
each sequence
is
of amino
would
amino acid
exist
then be
frequency
to
start
the batch
from the list
to
count
the
of
sequence retrieval at
included in
each
in
number of amino acids
frequency of each,
outputting
each sequence.
The
a
the
each
tab
code of this
aacounts.pl.
allows one
to look
subset as a whole
removed
(see dipeptide
individual
contained each sequence
the
(see Appendix B
when
looking
at
at
instead
each protein sequence
to be important shortly
Charge
This
Apl
that
interest. It
does
experimental values.
frequencies for
-
amino acid
program was written which concatenates each separate sequence
sequence.
also makes sure
Frequency
all of the
on
described,
written
and calculate
be found in Appendix B
encompassing
other
then
of great
naive approach was
FASTA file that
FASTA file
a
the
As previously
program was
Another Perl
into
subset.
to
were closer
about
be
based
possible to adjust a pi prediction algorithm
values and predict pi values
difference for any
is kept
-
the amino acid frequencies
of protein
by protein.
separate and
The
program
that the header line
makeComposite.pl),
which will
two amino acids that occur
be
of
shown
one right after
the
approach).
acids
(alphabets approach)
alphabet
A
more sophisticated analysis of amino acid
acids are grouped
according to the
frequency can be
properties of their side chains.
11
done if the
The
amino
structures of the
be
side chains of the amino acids can
alphabets
(Charge, Chemical, Functional,
Table 2) is based
is simply
(Asp / D)
are
group (COO).
code
A.
of an amino acid can
Therefore, in
arginine
has
the
C. Histidine (His /
15
the Charge alphabet
and
they
Arginine
proteins;
Charge
In the Charge
a guanidino group).
they
H) is
also grouped
protonation of the nitrogen on
amino acids
have
the only amino acids that contain the
negatively
have
side chains which
are grouped
alphabet can
be
together
seen
alphabet
the
not
code
demonstrate
N. An
given
that
the
contain
group
together
and
with
charged amino acid
side chain occurs easily.
normally do
and given
are grouped
into the positively
its
charged carboxyl
are amino acids
they
Aspartic
and
chain contains an e-amino
alphabet
(see
a positive or
together and
are grouped
(Arg / R)
the positively charged amino groups (the lysine side
group because
abbreviated amino acid
(neutral). Glutamic Acid (Glu / E)
uncharged
Likewise, Lysine (Lys / K)
code
four
Hydrophobic). The Charge
and
the side chain
on whether
negative charge, or
Acid
used to assign them to
The remaining
charge
behavior in
example of using
the
below:
ACDEFGH
(original
sequence)
i
NNAANNC
Chemical
C,
Charge
respectively.
alphabet
remaining 15
N)
and
alphabet
These
for the
sequence)
incorporates two groupings,
groupings are analogous
same reasons.
amino acids
Glutamine (Gin /
grouped
alphabet
alphabet
The Chemical
and
(Charge
based
Q)
with
that
and
C
of a charge.
contain an amide
with codes
groupings
A
in the
the
Asparagine (Asn /
(CONH2)
the code M. Phenylalanine (Phe /
12
basic
alphabet characterizes
than their lack
are amino acids
together accordingly
to the A
The Chemical
on more
acidic and
and are
F), Tryptophan (Trp
/ W),
and
Tyrosine
(Tyr, Y)
contain aromatic rings
Threonine (Thr /
T)
Proline (Pro / P)
contains an
contain the
(code R). Serine (Ser /
hydroxyl group (OH)
imino group (>C=NH)
the sulfur containing amino acids are Cysteine (Cys /
grouped
together
with code
S. An
example of using
its
C)
and
their side chains (code H).
on
on
S)
side chain
and
(code I). Finally,
Methionine (Met /
the Chemical
alphabet can
M)
be
are
seen
below:
ACDEFGHNPS
(original
sequence)
(Chemical
alphabet
I
LSAARACMIH
Functional
alphabet
The Functional
did the Charge
remaining
the
and
alphabet again
Chemical
into 2
amino acids
amino acid
of using
sequence)
groups:
alphabet can
ACDEFGH
The Functional
alphabets.
is hydrophobic (such
the Functional
incorporates the A (acidic)
H
(hydrophobic)
as
Alanine)
be
seen
(original
and
C (basic)
alphabet characterizes
and
or polar
P (polar) based
(such
as
groups as
the
on whether
Cysteine). An
example
below:
sequence)
1
HPAAHPC
Hydrophobic
groups amino acids
(such
as
Alanine)
seen
alphabet
sequence)
alphabet
The Hydrophobic
It
(Functional
Cysteine)
are given
alphabet
based only
are given
the
code
the
is
on
similar
hydrophobicity. Amino
code
O. An
to the latter half of the Functional
I. Amino
acids
example of using
below:
13
that are
acids
alphabet.
that are hydrophilic
hydrophobic (such
as
the Hydrophobic alphabet can be
ACDEFGH
(original
sequence)
1
OIIIOII
Perl
alphabets
alphabet
sequence)
programs were written that convert normal sequences
into
each of the
just described (see charge.pl, chemical.pl, functional.pl,
and
hydro.pl in
Appendix B). The
code
(Hydrophobic
four
display the frequency of each alphabetic
programs also calculate and
that is chosen.
Frequency of amino
The
problem
affecting the
(dipeptide approach)
that certain
abnormal
pKA side chains
had
overall charge of a protein still
All that had been
acid without
acids
considered was
taking into
being next to
account
other amino acids
to examine every
"dipeptide"
the
any
in
not
been dealt
sum of a set of strict
changes
sequence.
in the three Apl
values of amino acids
pKA
that
might occur
The
approach
A
subsets.
with
up
for
values
due to
until
this
point.
each amino
certain amino acids
to solving this
sequence of
problem was
length 7 has 6
dipeptides. For example,
Frequency:
Sequence:
Dipeptides:
Dipeptide
ABCABBC
AB
AB
=
2
0.333
BC
BC
=
2
0.333
CA
CA
=
1
0.167
AB
BB
=
1
0.167
counts:
BB
BC
The
interest,
written
frequency at which each dipeptide
particularly,
that counts
when
each
they
are considered
dipeptide in
occurs
in
each
a sequence and
14
in
a particular sequence
Apl
subset.
displays the
A Perl
is
of
program was
frequency of each
dipeptide in
the
for dipeptides
sequences of
output
in
alphabetically from AA
alphabet, the
became
the
file
a
that is input (see Appendix B
increasing order or dipepsA.pl
.
.
VV). As
.
number of different
problematic.
FASTA
The
the case earlier
was
dipeptides (20
x
20
dipeptide technique
same
converting them into the Charge,
for dipeptides
=
with
the
Chemical, Functional,
and
dipeps.pl
output
normal amino acid
400 for the
was applied
-
to
normal
alphabet)
sequences after
Hydrophobic
alphabets to
alleviate this problem.
Combining an entire Apl
(using makeComposite.pl
number of dipeptides
sequence,
and
in
see
-
Appendix
special attention needs to
of the output
file from
line
FASTA file
with a
so that
B)
a set of sequences
the first amino acid in the
accession
subset of FASTA sequences
be
that
paid so
became
has been
To
count
into
one
long
combined
that the last amino
new
line. The
the dipeptide counts
handles this
long sequence
one
problematic.
next sequence are not counted as a
makeComposite.pl
blank
also
into
problem
in
acid
just
one sequence
dipeptide. The format
by replacing each
other programs can now use
are
the
this formatted
as accurate as naive and alphabet counts.
Pipeline Workflow
So far there have been
amino acids
stages at which
the
(coded according to the four alphabets),
(coded according to the four alphabets) has been
the data to
frequency of an
reach each of these stages
diagrams how to
stage of analysis.
go
from
an
initial
may
set of
The flow in taking the
dipeptide,
examined.
amino
or grouped
The
sequences
(for
each
naive approach would go
15
of
dipeptide
process of
appear somewhat confusing.
FASTA
acid, group
transforming
Figure 2 below
Apl subset) to
from FASTA
each
sequence to makeComposite.pl to aacounts.pl and then analysis.
examining dipeptides
with a
transferring the FASTA
functional
program see
in this
is
more complex.
sequence to makeComposite.pl to
dipepsA.pl) followed by analysis.
program used
alphabet
Table 3 below
pipeline workflow
(for
However,
gives a
a more
It begins
the flow for
by
functional.pl to dipeps.pl (or
brief description
detailed description
of each
and code of each
Appendix B).
\
(
Apl
sunset
charge.pl
FASTA file
v.
[
J
i
chemical.pl
1 '
\
r
~~~~~
^-^^^^
dipeps.pl
makeComposite.pl
i
*
\
functional.pl
or
dipepsA.pl
)
^
i
hydro.pl
"
r
ir
~\
analysis
i'
aaco ants.pi
^
Figure 2. Workflow diagram that
(naive,
shows
how to
alphabets, dipeptides).
16
get
to
each stage of analysis
)
Program
Description
aacounts.pl
Counts the
number of each amino acid
from
each.
frequency
a
a
of
Converts the amino acids from the sequences in a FASTA file
into a 3-letter alphabet using the charge() method in
charge.pl
Bio::Tools::OddCodes (12). It
code
chemical.pl
for
into
an
code
dipeps.pl
from the
amino acids
8-letter
for
using the chemical()
alphabet
sequence
dipepsA.pl
number of each
in the
given
from highest
Counts the
sequence
alphabetical order
functional.pl
Converts the
into
a
.
.
.
amino acids
4-letter
alphabet
hydro.pl
for
into
a
different
alphabet
makeComposite.pl
for
functional()
(composite)
be
Table 3. Description
counts
from the
using the
sequence.
of the programs used
and
the
This
in this
in
a
each
in
FASTA file
in
number of each
in
hydrophobic()
counts
a
FASTA file
method
in
the number of each
frequency.
into
composite sequence
a single
is then
able
listed here.
pipeline workflow.
source code
17
for
each pair
method
the
of multiple sequences
used with other programs
longer description
in
frequency.
sequences
each sequence as well as each
Converts FASTA files
each
each pair
amino acid pair
sequences
Bio: : Tools ::OddCodes (12). It then
code
for
W).
from the
using the
amino acids
2-letter
frequency.
amino acid pair
each sequence as well as each
Converts the
in
to lowest.
Bio::Tools::OddCodes (12). It then
code
FASTA file
method
FASTA files. It displays
(AA
a
then counts the number of each
different
frequency
given
in
FASTA files. It displays
number of each
in the
number of each
frequency.
sequences
each sequence as well as each
Counts the
order
then counts the
each sequence as well as each
Converts the
Bio::Tools::OddCodes (12). It
provides a
(normal alphabet) in
FASTA file and determines the
Output is to FASTAfilename.aacounts
sequence
for
Appendix B
each program.
to
Results
Naive
approach
The intitial
naive approach
counts of each amino acid
<
Apl
acid
<
0.7; Apl
>
0.7)
between the Apl
subset and
the 0.3
between the Apl
<
<
and compare
Apl
<
A
0.7
subset and
labels
is the Apl
0.1
<
Frequency
represent
<
0. 1
Apl
0.7
of
is
in
each
done to determine the
Apl
subset
frequency of occurrence
frequencies between
subset
\ pi
<
(Apl
for
<
0.1; 0.3
each amino
the Apl <
0. 1
similar comparison
is displayed in Figure 4.
0.1 and (0.3
<
Apl<0.7)
Individual Amino Acids in Two Apl Subsets. The X
abbreviations of the amino acids.
in
yellow
is the 0.3
proteins which comprise
subset consists of
More information
0.7
Amino Acids in
subset and shown
60
>
set was
in Figure 3. A
shown
the Apl
the one letter
subset consists of
<
of
the relative
comparison of the
subset
Frequencies
Figure 3.
analyzing the data
(using the normal alphabet)
subsets.
0. 1
to
about each
58
<
Apl
22472 total
proteins which comprise
individual
Appendix A.
18
protein
<
Shown in blue
0.7
subset.
in these Apl
are
The Apl
amino acids.
17906 total
axis
<
The 0.3
amino acids.
subsets can
be
seen
in
Frequencies
Figure 4.
labels
<
0. 1
the one letter abbreviations
subset and shown
subset consists of
0.7
Amino Acids in Apl
<
0.1
Frequency of Individual Amino Acids
represent
is the Apl
of
60
yellow
50
about each
in Two Apl Subsets. The X axis
Shown in blue are
is the Apl
>
subset.
The Apl
amino acids.
15581 total
in these Apl
protein
0.7
22472 total
proteins which comprise
individual
0.7
of the amino acids.
proteins which comprise
subset consists of
information
in
and Apl >
be
0. 1
The Apl
amino acids.
subsets can
<
seen
>
More
in
Appendix A.
Alphabets
approach
-Charge
The
next
that utilizes the
reduces
the
four
alphabets.
in Table 2.
between the Apl
using the Charge
0.7
analysis was
number of variables
summarized
>
step in
subset
<
0. 1
to
convert each of the
This decreases the
being examined.
Using the
subset and
0.3
<
Apl
<
alphabet a similar comparison
is displayed in Figure 6.
19
subsets
into
a sequence
size of the amino acid alphabet and
The different
Charge alphabet,
the
Apl
alphabets are
a comparison of the
0.7
subset
is
shown
between the Apl
<
frequencies
in Figure 5. Again
0.1
subset and
the Apl
Frequencies
Amino Acids (Charge alphabet) in
Apl< 0.1 and (0.3 < Apl<
0.7)
of
Apl<
0.1
? 0.3< Apl< 0.7
CAN
Amino Acid (charge alphabet)
Figure 5.
Frequency of Amino
Acids
Using the
Charge Alphabet in Two Apl
Subsets.
Frequencies
Amino Acids (Charge alphabet) in
Apl<0.1 and Apl > 0.7
of
80
70
-.
60
s?
>
50
Apl<
o
g
40
|
30
""
? Apl
20
10
0
CAN
Amino Acid (charge alphabet)
Figure 6.
Frequency of Amino
Acids
Using the
Subsets.
20
Charge Alphabet in Two Apl
>
0.1
0.7;
-Chemical
Using the
0. 1
subset and
Chemical alphabet,
the 0.3 < Apl
same comparison
<
0.7
between the Apl
Frequencies
of
a comparison of
subset
<
0. 1
is
shown
the frequencies between the Apl <
in Figure 7. Figure 8 displays the
subset and the
Apl
>
0.7
Amino Acids (Chemical alphabet) in
subset.
Apl<
0.1
and
(0.3<Apl<0.7)
R
M
H
Apl
<
0.1
D0.3
<
Apl
<
0.7
Apl<
0.1
C
Amino Acid (chemical alphabet)
Figure 7.
Frequency of Amino Acids Using the
Chemical Alphabet in Two
Apl Subsets.
Frequencies
of
Amino Acids (Chemical alphabet) in
and Apl > 0.7
Apl<0.1
? Apl> 0.7
I
R
H
M
C
Amino Acid (chemical alphabet)
Figure 8.
Frequency of Amino
Acids
Apl Subsets.
21
Using the
Chemical Alphabet in Two
-Functional
Using the
0. 1
subset and
Functional
subset
Functional alphabet,
the 0.3
<
Apl
<
0.7
a comparison of the
subset
is
alphabet a similar comparison
shown
frequencies between the Apl
<
in Figure 9. Again using the
between the Apl
<
0.1
subset and
the Apl
>
0.7
is displayed in Figure 10.
Frequencies
of
Amino Acids (Functional alphabet) in
Apl<
0.1
and
(0.3
< Apl<
0.7)
<
0.1
D0.3<
Apl
Apl
P
A
Amino Acid (functional alphabet)
Figure 9.
Frequency of Amino Acids Using the
Two Apl Subsets.
22
Functional Alphabet in
<
0.7
Frequencies
of
Amino Acids (Functional alphabet) in
Apl<0.1
and
Apl
>
0.7
Apl<
D Apl
A
>
0.1
0.7
P
Amino Acid (functional alphabet)
Figure 10.
Frequency
of Amino
Acids
Using the
Functional Alphabet in
Two Apl Subsets.
-Hydrophobic
Using the
Apl
<
0.1
subset and
Hydrophobic
0.7
subset
Hydrophobic alphabet,
the 0.3
<
Apl
<
0.7
a comparison of the
subset
alphabet a similar comparison
is
shown
in Figure 11. Again using the
between the Apl
is displayed in Figure 12.
23
frequencies between the
<
0. 1
subset and
the Apl
>
Frequencies
of
Amino Acids (Hydrophobic alphabet) in
Apl < 0.1 and (0.3 < Apl
<
0.7)
Apl<0.1
? 0.3<Apl<0.7
I
O
Amino Acid (hydrophobic alphabet)
Figure 11.
Frequency of Amino
Acids
Using the
Hydrophobic Alphabet in Two Apl
Subsets.
Frequencies
of
Amino Acids (Hydrophobic alphabet) in
Apl <0.1 and Apl > 0.7
Apl<0.1
D Apl
I
>
0.7
O
Amino Acid (hydrophobic alphabet)
Figure 12.
Frequency of Amino
Acids
Using the
Hydrophobic Alphabet in Two Apl
Subsets.
Dipeptide
approach
Using a more
entirely
sophisticated method
new set of results.
The first way
that looks
of
at
dipeptides
looking at dipeptides
24
of a sequence gave an
of the
three Apl subsets
is
similar
to the naive approach in that it just examines dipeptides using the
acid alphabet.
This
results
in
fewer than 400 dipeptides in
dipeptides may
occur).
The difference in
subsets was also calculated
would mean
that
another subset.
a certain
The
comparing the Apl
similar
Delta %
To better
This bar
when
0. 1
subset and
values when
explain
values can
the 0.3
<
Apl
consider
<
<
0.7
0. 1
be
A%
in
seen
subset.
value
Delta % Values in Apl
Using
Figure 13.
Density
subset consists of
of Delta
60
% Values
58
about each
<
of Dipeptides
protein
and
one subset compared
in Figure 13
the Apl
shows
>
<
to
0.7
the
subset.
in Figure 13.
between 100%
0.3
100
when
and
150%
sets.
Apl < 0.7
in Two Apl Subsets. The Apl
22412 total dipeptides. The 0.3
proteins which comprise
individual
0.1
of
Amino Acid Alphabet
proteins which comprise
subset consists of
information
a Normal
Delta %
a
words,
by the arrow
comparing dipeptide frequencies in the two different Apl
of
not all possible
Figure 14
subset and
the bar indicated
was a
other
as much
%"
the 1 1 times that there
Densities
0.7
"%"). In
2 times
comparing the Apl
Figures 13-16,
represents
occurred
"Delta
that
chance
frequency of every dipeptide between Apl
or
dipeptide
or
owing to the
frequency"
("Delta
differences,
<
different dipeptides (there may be slightly
upwards of 400
a given subset
normal amino
25
<
Apl
0.1
<
17848 total dipeptides. More
in these Apl
A.
<
subsets can
be
seen
in Appendix
Densities
of
Delta % Values in Apl
Using
a
>25
>50
Delta %
Figure 14.
Density
of
0.1
<
and
Apl > 0.7
Normal Amino Acid Alphabet
Delta % Values
of
>100
>75
>150
>400
>300
>200
range
Dipeptides in Two Apl Subsets. The Apl
<
0. 1
subset consists of
60
proteins which comprise
22412 total dipeptides. The Apl
subset consists of
50
proteins which comprise
15531 total dipeptides. More information
about each
individual
protein
in these Apl
subsets can
be
seen
0.7
>
in Appendix A.
Dipeptide Threshold
A
had
a
similar analysis was performed on
very low
Discussion for
value of 0.1%
frequency
an
elaboration)
had to be
met
infrequently (under 0.1%
remaining dipeptides
subset and
for the Apl
that
were
by the
Figure 15
0.1
subset and
letter
was
total
the Apl
>
0.7
subsets where
value
seen
words, if a dipeptide
values
much
less
frequently
the
in the Apl
0.7 dataset.
26
it
occurred so
was eliminated.
in Figure 15. Likewise, the
subset can
instance,
see
comparing the Apl
be
seen
extreme positive or negative ranges of
For
dipeptides that
too rapidly,
dipeptides) then
the Delta %
be
Apl
frequency of occurrence threshold
other
number of
subset can
amino acid codes.
found
A
were monitored.
of the
same
its Delta %
for dipeptides. In
the 0.3 < Apl < 0.7
<
change
were counted and
found in the
one
(which may
the
0. 1
comparison
in Figure 16. Dipeptides
these figures
dipeptide RR
<
<
The
are
indicated
(arginine-arginine)
0.1 dataset than in the 0.3
<
in
Apl
<
Densities
of
Delta % Values in Apl
Acid Alphabet (where
<-50
<-40
<-30
<-20
<-10
<
0.1
and
<0
>0
Threshold
Density of Delta % Values
of
<
>10
Delta % range
Figure 15.
0.3
\pl<
frequency of dipeptide
of
>20
and particular
0.7
Using
must
be
a
Normal Amino
above
>40
>30
0.1)
>50
>60
>75
>100
dipeptides
Dipeptides in Two Apl Subsets
with a
0.1%.
Densities of Delta % Values in Apl
Alphabet (where
<
0.1
frequency
and
of
Apl
>
0.7
dipeptide
Using
must
be
a
Normal Amino Acid
above
0.1)
90
80
c
ffi
70
w
S
60
n
a
a
ai
E
a
50
40
30
0)
F
20
3
z
10
0
<-50
<-40
<-20
Delta %
Figure 16.
Threshold
Density
of Delta
>0
<0
% Values
of
>80
>20
range and particular
dipeptides
Dipeptides in Two Apl Subsets
of 0.1%.
27
with a
>100
Dipeptide using Alphabets
The final step in
together.
Using the
compared to
analysis was to combine the alphabet and
smaller alphabets
using the
dipeptide
approaches
dramatically reduced and condensed the results as
400
normal alphabet which creates
possible
dipeptides.
-Charge
Using the
Apl
%
<
0.1
values
Charge alphabet,
subset and
for
subset and
each
the Apl
the 0.3 < Apl
dipeptide. The
>
0.7
Comparison
subset
of
a comparison of the
<
0.7
subset
is
dipeptide frequencies between the
shown
same comparison
is
in Figure 17
shown
as well as
between the Apl
<
the Delta
0. 1
in Figure 18.
Dipeptides (based
Apl
<
0.1
on charge
and
0.3
<
characteristic) taken from
Apl < 0.7
Dipeptide (charge alphabet)
Figure 17. Frequencies
blue
are
the frequencies
difference in
Apl
<
0.7
of
Charge Alphabet Dipeptides in Two Apl Subsets. Shown in
of each
frequency for each
dipeptide in the Apl
<
0.1
subset and shown
didpeptide between the Apl
subset.
28
<
0.1
in
subset and
yellow
the 0.3
<
is
Comparison
of
Dipeptides (based
on charge
characteristic) taken from
Apl < 0.1 and Apl > 0.7
60
50
40
30
20
10
0
A^
-10
NKI
CN
CA
AfsjJ
CC
NN
NC
-20
-30
-40
Dipeptide (charge alphabet)
Figure 18. Frequencies
blue
Charge Alphabet Dipeptides in Two Apl Subsets. Shown in
the frequencies of each dipeptide in the Apl
are
difference in
0.7
of
frequency
for
each
<
0.1
subset and shown
didpeptide between the Apl
<
0. 1
in
subset and
yellow
the Apl
is
>
subset.
-Chemical
Using the
the Apl
<
Delta %
0. 1
subset and
values
subset and
Chemical alphabet,
for
the Apl
0.7
sufficiently large that it
combinations
the 0.3 < Apl < 0.7
dipeptide. The
each
>
a comparison of the
subset
and
is
shown
same comparison
is
in Figure 20. The Chemical
was not possible
in Figures 19
subset
dipeptide frequencies between
to
display all
20. Instead only the
display.
29
the
in Figure 19
as well as
between the Apl
shown
alphabet with
possible
the
dipeptides
dipeptide
density values
<
were chosen
to
0. 1
was
Densities
of
Delta % Values in Apl
0.1
<
and
0.3
<
Apl <
0.7) Using
a
Chemical
Alphabet
16
|
14
12
SS
S
10
a)
I
6
(-28%)
(-25%)
AS (-24%)
MS (-22%)
IS (-20%)
Al
E
(43%)
(48%)
rt*(48%)
I
I
a
n
IC
IM
<-20
<-10
<0
>0
Delta %
Figure 19.
Density of Delta %
Subsets. The Apl
<
dipeptides. The 0.3
0.1
<
Values
seen
>10
Apl
<
0.7
>30
>20
range and particular
60
proteins which comprise
subset consists of
about each
>50
>40
>60
dipeptides
Alphabet Dipeptides in Two Apl
of Chemical
subset consists of
total dipeptides. More information
be
RR(61%)
J
58
22412 total
proteins which comprise
individual
protein
in these Apl
17848
subsets can
in Appendix A.
Densities
<-40
<-30
of
Delta % Values in Apl
<-20
<-10
<0
Delta %
<
0 1 and Apl
>0
>10
>
0.7
Using
a
>30
>20
range and particular
Chemical Alphabet
>40
>50
>60
>70
>80
dipeptides
J
Figure 20.
of Delta % Values
Density
Subsets. The Apl
<
0.1
dipeptides. The Apl
>
subset consists of
0.7
Chemical Alphabet Dipeptides in Two Apl
60
subset consists of
dipeptides. More information
seen
of
about each
proteins which comprise
50
15531 total
in these Apl
subsets can
individual
in Appendix A.
30
22412 total
proteins which comprise
protein
be
-Functional
Using the
the Apl
<
Delta %
0. 1
subset and the
values
subset and
Functional alphabet,
for
each
0.3
Apl
<
a comparison of the
<
dipeptide. The
0.7
subset
is
dipeptide frequencies between
in Figure 2 1
shown
same comparison
is
shown
as well as
between the Apl
the
<
0. 1
the Apl > 0.7 subset in Figure 22.
Comparison
of
dipeptides (based
on
functional characteristic) taken from
Apl<0.1 and0.3<Apl<0.7
Dipeptide (functional alphabet)
Figure 21. Frequencies
in blue
are
is difference in
Apl
<
0.7
of Functional
the frequencies
frequency
of each
for
each
Alphabet Dipeptides in Two Apl Subsets. Shown
dipeptide in the Apl
<
0.1
subset and shown
didpeptide between the Apl
subset.
31
<
0. 1
subset and
in
yellow
the 0.3
<
Comparison
of
dipeptides (based
Apl
<
on
0.1
functional characteristic) taken from
and Apl >
0.7
30
20
10
jjLfc-fa.tfi.ll tUljlj
0
/A
AH
I-
CA
A
AC
HH
HC
PC
CH
CP
PH
HP
CC
PP
-10
-20
-30
-40
Dipeptide (functional alphabet)
Figure 22. Frequencies
in blue
are
is difference in
0.7
of Functional
the frequencies
frequency
Alphabet Dipeptides in Two Apl Subsets. Shown
dipeptide in the Apl < 0. 1 subset and shown in yellow
for each didpeptide between the Apl < 0. 1 subset and the Apl >
of each
subset.
-Hydrophobic
Using the
between the Apl
as
the Delta %
Apl
<
0. 1
Hydrophobic alphabet,
<
0.1
values
subset and
subset and
for
each
the Apl
>
a comparison of the
the 0.3 < Apl
dipeptide. The
0.7
subset
<
0.7
subset
is
shown
same comparison
in Figure 24.
32
dipeptide frequencies
is
in Figure 23
shown
as well
between the
Comparison
dipeptides (based
of
from
a Apl<
on
0.1
hydrophobic characteristic) taken
0.3 < Apl< 0.7
and
%
of
D Delta
Dipeptide
Figure 23. Frequencies
Shown in blue
are
is difference in
the 0.3
<
<
of
0.7
% (pi
A<
0.1
-
0.1
<
Apl
<
0.7)
alphabet)
Hydrophobic Alphabet Dipeptides in Two Apl Subsets.
the frequencies
yellow
Apl
(hydrophobicity
<
0.3
Dipeptide in Apl
of each
dipeptide in the Apl
frequency for each didpeptide
<
0.1
subset and shown
between the Apl
<
0. 1
in
subset and
subset.
Comparison
of
dipeptides (based
on
hydrophobic characteristic) taken from
Apl<0.1 and Apl
>
0.7
%
of
Dipeptide in Apl
D Delta % A(pl<0.1
-
<
Apl
0.1
>
0.7)
Dipeptide (hydrophobicity alphabet)
Figure 24. Frequencies
Shown in blue
yellow
the Apl
are
is difference in
>
0.7
of
Hydrophobic Alphabet Dipeptides in Two Apl Subsets.
the frequencies
of each
dipeptide in the Apl
<
0. 1
subset and shown
frequency for each didpeptide between the Apl
subset.
33
<
0.1
subset and
in
Discussion
When exploring the behavior
exists a
values
discrepancy between
for
performed
a
high
using
predictions
to
on our algorithm
able
The first
robust enough
to
more
accurately
to give
meaningful
pi and
the
MW.
the
that is too
their
occur
organisms
study
limited only to
be
set
that
offset
have
enough
proteins
has
in E.
a proteome
To
were
both
obtained.
uniform and
would
lead to
dipeptides in
all
robust enough.
high level
are still seen
both
of noise
in
in the data due to
of these
make sure
A data
that the
sufficient abundance
hurdles,
it displays very few
to
the search space
post-translational
that has been sufficiently documented to do a case
study.
34
in
known
post-translational modifications.
overcome
coli since
that
results
that was
of all
in
these differences. The
dipeptide information to
in the lowest frequencies
size and
post-translational modifications
is certainly
by the
and
of the protein sequences
using the information
set
(19)
(14-18). The
that is too diverse
frequencies
have different
statistical validity.
modifications and
data
set
data
to handle
Simply finding the
small would not
dipeptides that
was
a close
a reliable
question of how
robustness would
the fact that different
maintain
or similar algorithms
information in the
data. A data
protein sequences would provide a
set
(11)
predict pi values
key element was having
complications such as
Unfortunately,
enough
pi
comparison of pi values was
laboratory settings
differences justified
lay in whether there was
to be
predicting
based
experimentally determined
identify underlying patterns that could contribute to
question now
extracted
This
determined in different
regular occurrence of these
undergoing isoelectric focusing, there
predicted pi values and
percentage of those proteins.
experimental pi values
an effort
of proteins
In
still
keeping with the theme
retaining
the usage
and
of this
Tonella
data to
(19)
et al.
from 5 different
existed
one or
two
data. Since
70%
Once the
of the
entire
E.
values and proteins
if significant
existed.
It
was
arbitrary Apl
to
separate
E.
of the
pi and
MW
set was
clear
that had
sequence
selected,
lines
coli proteome
used.
greater
Apl
another
cut-off ranges
(Apl
<
0.1; 0.3
the data into distinct sets
<
into
Apl
et al.
being covered
was
(18)
coli proteome.
in their
decided that the data
using the
was
same conditions.
study.
Doing
proteins
made about
so would make
level)
0.7; Apl
between Apl
that
>
could
0.7)
be
how
that had very small
it
possible
to
subsets
a small number of Apl subsets.
<
of similar size
the E.
decision had to be
between
seen
values.
set
well
The primary justification
for this
differences (at the dipeptide
necessary to break the data
though
yet
In addition, the fact that their data
promise
be
could
possible,
probably best to limit
studies on
values were gained
held
even
was
the same 2DE conditions it
coli genome
data
to separate the data so that
see
2DE
scale
would reduce as much noise as possible.
covered over
Apl
(14-18), it
noise as
(17 and 18). Both the Yan
be the only data
would
to ensure that the experimental
decided that
was
groups
large
70%
over
none of the groups used
from the Tonella (19) group
This in turn
it
of these groups
groups performed
The Tonella (19) group boasted
little
having a data set with as
as much robustness as possible
2DE data
structured
of
were chosen
in
These
order
compared with each
other.
There
subsets.
subsets
One
based
an answer
was
difficulty in deciding how to
possible approach was
on a
that
larger
to
separate
separate
the dataset into many smaller sized
number of Apl ranges.
gives a scaled
description
of what
35
the entire dataset into these three
On
is
one
hand
doing this might provide
happening at
each small
Apl
range
relative to adjacent
information
found in
each
Therefore,
<
0.1
total
at
the
Apl
ranges.
sequence
data
17906 total
60
amino acids or
information
description
The
about each
and
Apl
<
0.7
individual
subset consists of
We began
section.
relevance of the
The
It
of our
few
data
also
protein
is best
(dipeptides using
more
The Apl
robustness.
amino acids or
>
0.7
22412
subset consists of
can
be
including Apl,
seen
viewed as a pipeline as seen
the
findings.
proteins which comprise
in these Apl subsets,
our analysis with
becomes
50
in Appendix A.
in Figure 2 in the
(naive approach),
most simple method
(alphabets approach),
alphabets approach).
complicated, but more
a
and end with
Along this
the
path, the
interesting at the
same
exceptions).
naive approach
was
58
be
15531 total dipeptides. More
amino acids or
SWISS-2DPAGE Accession Number,
analytical process
a
22472 total
17848 total dipeptides. The Apl
15581 total
most complicated methods
quickly
protein sequences
did
to
not
<
0. 1
set
did
not provide
that individual amino acid
vary among the three data
using simply the
different between the three Apl
comparing the Apl
data
handling the
apparent
frequency characteristics
with
threaten the reliability
would
their way to more complicated methods
results.
way, there is a loss of
smaller number of sequences that would
proteins which comprise
<
proteins which comprise
time (with
by doing it this
the dataset had to be separated into subsets of sufficient
dipeptides. The 0.3
work
hand
other
level due to the
This, in turn,
set.
subset consists of
Methods
On the
subsets.
subset with
the 0.3
the Apl > 0.7 subset, respectively. No
can
<
be
subsets.
Apl
<
seen
0.7
significant
36
meaningful
frequencies in
In the end,
naive approach were
This
any
a given set of
no amino acid
found to be significantly
in Figures 3
subset and
and
4
the Apl
when
<
0. 1
subset
difference between the blue
and
yellow
frequencies
identical
when
between Apl
can
be
Figure 3
values and
for any individual
seen
Figure 4
and
in
more
To simplify the analysis, the
described in Table 2
reveal
any
6 (Charge
Figures 9
significant
alphabet
and
(Hydrophobic
approach
10 (Functional
alphabet
normally
requires
are
At this
that
they
point
show
no
very
trend
and
of
alphabet
Figures 1 1
increase
or
the
results
did
not
and
comparisons),
and
to that
four
Figures 5
calculated.
similar results
independently,
12
of the naive
decrease in Apl for any
be
focusing
denatured proteins, the only
that are close to
its
near or
not
structure
intact. However, for IEF,
detergents
are added prior
aspects of protein structure.
are expected
in the primary
37
of proteins
we
observing their biological function. To
interactions
each other
of the
distant
(IEF). The biological function
quaternary
significant
by analysis
the experimental conditions
consider
reagents such as urea and
or
obtained
(2-8), including ours (11)
regardless of
their three dimensional
disrupt any secondary, tertiary
acids side chains
is
pi
time.
by using the
Again,
pipeline.
previous pi prediction algorithms
for isoelectric
the best separation,
one amino acid at a
more meaningful results would
interested only in separating the proteins,
assure
than
8 (Chemical
comparisons),
it is instructive to
maintain
and
of a correlation
moving between the three datasets.
each amino acid
employed
that
alphabet
4. There is
dipeptide frequencies. All
neighbors.
the next stage in the
comparisons)
and
was expected
treat the pKa for
more
-
nearly
amino acids showed us that we
number of variables was reduced
comparisons), Figures 7
in Figures 3
individual
trends that could affect the way that
particular amino acid when
It
at
depth
The lack
as well.
compared,
frequency of these
the
needed to consider the problem
alphabets
are
amino acid; the values are also
to
occur
sequence.
to IEF to
In these
between
Thus
a
fully
amino
consideration of the effect of neighboring amino acids on their respective side chain pKA
values
may
prove valuable.
With
least
respect
significant alphabet
results.
However,
first. At first
Figures 13
significant.
the change in
The
a
0. 1%
amino
normal
time (or at
acids) it
and
16. There
will
later be
The
It
value.
would not
other
by
be
times in
wise
To
to rely
The
<
the hydrophobic
be
least
alphabet.
seen
A dipeptide
Comparisons
be
in Figures 23
38
and
threshold
Apl
seen
frequency
in
in the 100
using
a
subsets
24. Delta % is
at
least
22412
in Figures 15
dipeptide
when
of the
is going
dipeptide
not occur
values
interesting results
subset
which contained
results of this can
extreme
through all of the 400
was run with a
that have Delta %
Apl
on such
negotiate
0.1 dataset,
the next
ranges would seem
another
comparison with some of the alphabet
that showed the
alphabet can
400
words, if a dipeptide did
analysis.
still exist extreme outliers
alphabet
hydrophobic
for
being compared to
frequency was vanishingly small.
alphabet, the same analysis
was not used
and
The Delta %
results.
that most of the dipeptides that fell into these
least 22 times in the Apl
reanalyzed
approach was
that are in the 300
be discussed
alphabet will
subset
from the
dipeptide
significant alphabet
Apl
subset and multiple
for dipeptides 0.1%. In
of the
one
will advance
very promising
redesign of a pi prediction algorithm.
dipeptides in the
occurrence
show some
whose overall
very high Delta %
frequencies to
14
values
that occurs only once in one Apl
to have
to the most
frequency from
problem was
dipeptides that
ranges were
results
and
Therefore, Delta %
subset.
very
dipeptide
the analysis using the normal amino acid
glance
value represents
Apl
to each alphabet that was used the discussion
range which
analyses.
dipeptide
using the
shown
in the
yellow
bars
and
1.85
was seen
is very
The
negligible
in any
of
Delta %
dipeptide
(negatively
also
0.7
0. 1
to the Apl
large (-18.9%) in the
subset
shown
a
subset
slightly
large Delta %
dipeptide. This
>
0.7
subset
value
along
is
alphabet
Functional
Delta %
followed
bars) is very low in
was not apparent
large
some
value of
all
in any
Figures 2 1
for dipeptide
dipeptides. The AA
by negatively charged amino acid;
a
Delta %
Apl
<
0. 1
three Apl
large
of the
subsets.
of
-31.3%
subset and
see
going from the
of this
What
the 0.3
we would
for
dipeptides using the Charge
significantly large frequencies
<
like to
see
is
a particular
large Delta
for dipeptides, the Functional
22 representing the
alphabet show a collection of dipeptides
Apl
alphabet.
most significant results will combine
and
<
AA dipeptide (as
frequency of occurrence
frequency of occurrence value
considered next.
values and
for
frequency of occurrence
value accompanied with a
with a
%
(Figure 1 8). The Delta % for the AA dipeptide is
other comparison of the
Staying with the theme that the
%
range
the alphabet codes) had
(Figure 17). However, the
in the blue
more significant results
into the 30+
charged amino acid
of all
than a Delta
more
dipeptides).
values reached
Table 2 for definitions
<
the 4
4 dipeptides (no
each of the
charge alphabet showed
anaylsis.
Apl
in
analysis
using the
that have both significantly large
of occurrence:
AA, AH, HA, HP, CP,
PH, PP.
It
on
was
important to
the complete
dipeptide
refer
back to the
amino acid alphabet.
outliers:
Figures 1 5
KY, YS (Figure 15)
dipeptides to the Functional
analysis that was
and
alphabet gives
and
16
done using dipeptides based
point out a
few
extreme
EE, NN, YT (Figure 16). Converting these
the dipeptides:
39
CP,
PP
and
AA, PP, PP
These three different dipeptides
respectively.
done using the Functional
analysis
large that it
19
and
was not possible
20. Instead only the
dipeptides
are
The
labeled
value
denatured, depends
of
1 80
calculating
algorithm will
>
This
be
will
data
the
set
to
their
respective
by the
in Figures
outlier
Delta %
values.
a more accurate
the
protein
the
support
is
fully
nearest neighbors of that
dipeptides that have been identified from this
possible
for calculating
pi
an empirical process
fractionally to
to
from
adjust
the algorithms for
amino acid sequence
the pKA
whereby the pKA
see which changes
actual and predicted pi values
for the two
lead to
outlier
(11)
values used
values used
a
data
in
better
sets
(0.3
<
Apl
0.7).
of the
advancements
work with and rerun
scope of the
combinations
effects of adjacent amino acids on
be
modified
If the improvement
any many future
algorithm
sufficiently
to display. Particular
even when
chain,
proteins, it may be
to include the
between
Apl
coli
was
(11, 19). These data clearly
the microenvironment created
Our
pi values.
modified
and
an amino acid side
E.
calculations.
0.7;
on
for
data that
findings is that it may lead to
annotated
in the
<
were chosen
from the
22).
and
created
of each column with
extreme outlier
be
correlation
top
Using the
could
the
density values
the
extreme outliers
the possible dipeptide
than currently existing methods
idea that the pKA
amino acid.
dipeptides
with
display all
significance of these
calculation of pi
study
on
to
map back to
(Figures 21
alphabet
Using the Chemical alphabet
all
E.
coli
accuracy
that
the
could
of the pi calculation proves
be
made.
to
analysis
The first
compare
could
to the data
proteome, further data that are available
SWISS-2DPAGE database
could
be
used
to
at
to be worthy there
be to build
shown
the
larger
here. Beyond
ExPASy Server's
perform similar analyses on
40
a
many
other
microbial proteomes.
Another step
would
be to
port the analysis over to
proteomes that contain much more post-translational modifications.
be done in terms
in
doing so
higher
of predicting or
it may lead to
categorizing these
41
A lot
eukaryotic
would
have to
post-translational modifications
an even more powerful approach
organisms as well.
lower
to better predicting
pi
in
but
Conclusions
A dataset
that exists
between
This dataset
was
Several,
protein.
of E. coli proteins was collected and
protein sequence
experimental
then split into three parts
multi-layered,
data in
The
the data were analyzed
point and predicted
depending on the magnitude
to get
a
better understanding
of these stages represented a
by comparing each
frequencies), followed by the
application
in
similar amino acids
chemical,
way
and
hydrophobic
investigating the
alphabet and
by grouping
dipeptides
the simplified
most meaningful results
different
can
be
showing that
to better
prediction algorithm
values.
Using a short
list
more accurate.
to
of
on
alphabet
will
to
only the
pipeline
involve
pi prediction
42
involved
amino acid
the
in greatly
of these
dipeptide findings
modification of our
is improved the
how
functional,
subsets.
next should result
concentrate on post-translational modifications and
another.
approach yielded
most extreme cases where a
to the
be
represent sequences
sequences occur
that the results
This
each
amino acid
their charge,
dipeptide
dipeptide
for
part of a pipeline
using both the 20
in the different Apl
show
different
affect of adjacent amino acids
one subset
Once the
The
of Apl
(Apl).
in reformatting the
individual
The final step in the
certain
predict pi.
to include the
greatly different Apl from
is
groupings.
studies will attempt
used
.
point
of what might
alphabets
based
of all of these sequences
frequency between proteins
Future
better
properties
four different
discrepancy
three Apl subsets to one
of the
(considering
pipeline consisted of a naive approach
a simpler
isoelectric
sequential approaches were taken
an attempt
causing the varying Apl. Each
where
isoelectric
formatted to study the
next
in
in
existing
side chain
dipeptide
pi
pKA
showed
a pi prediction value
step
would
pi prediction can
that
be to
be
altered
by
them. In addition,
eventually to
similar analyses will
be
extended to other prokaryotic organisms, and
eukaryotic organisms.
43
References
1
.
Fey, S.J.
Larsen, P.M. "2D
2D."
or not
Current Opinion in Chemical
5: 26-33(2001).
Biology
2.
and
Cargile, B.J., Talley, D.L., Stephenson, J.L. "Immobilized
dimension in
shotgun proteomics and analysis of the
pH gradients as a
accuracy
of pi
first
predictability
peptides."
Electrophoresis 25: 936-945(2004).
of
3.
Patrickios, C.S., Yamasaki,
Isoelectric
4.
E.N. "Polypeptide Amino Acid Composition
Analytical
Ribeiro, J.M.
Sillero,
and
Biochemistry 231:
A. "An
algorithm
coefficients of a polynomial that allows
macromolecules."
proteins and other
5.
Ribeiro, J.M.
Sillero,
and
macromolecules."
6.
and
point."
A. "A
82-91(1995).
for the
computer calculation of the
determination
of
isoelectric
points of
Comput. Biol. Med. 20: 235-242(1990).
program
to calculate the isoelectric
point of
Comput. Biol. Med. 21: 131-141(1991).
Ribeiro, J.M., Ruiz A., Sillero, M.A., Sillero,
and electric charges of mutated
A. "Theoretical isoelectric
human hemoglobin
points
subunits."
Clin. Chim. Acta.
190: 189-197(1990).
7.
Sillero A., Ribeiro, J.M. "Isoelectric
determination."
8.
Analytical
theoretical
points of proteins:
Biochemistry
179: 319-325(1989).
Righetti, P.G., Caravaggio, T. "Isoelectric
points and molecular weights of
proteins."
9.
Berg J,
Journal of Chromatography 127: 1-28(1976).
Tymoczko J, Stryer L. Biochemistry. New York: W. H. Freeman
and
Co;
2002.
10. "Image: Amino
11.
2.png"
acids
Wikipedia: The Free Encyclopedia. Found
at
http://upload.wikimedia.Org/wikipedia/en/c/c5/Amino_acids_2.png
Zapoticnyj J., Conte M.C., Craig P.A. "Simulation of 2D Gel
Electrophoresis."
http://www.rit.edu/~pac86 1 2/2DE/2D_Sim.html
12. Bioperl. Found
http://bioperl.org
at
13. "SWISS-2DPAGE Two-dimensional
database."
Found
at
polyacrylamide gel electrophoresis
http: //us. expasy.org/ch2oV
14. Phillips T.A., Bloch P.L., Neidhardt F.C. "Protein identifications
two-dimensional gels: locations of 55 additional Escherichia coli
on
O'Farrell
proteins."
J. Bacteriol. 144:1024-1033(1980).
15. Pasquali
Schaller
C, Frutiger S., Wilkins M.R., Hughes G.J., Appel R.D., Bairoch A.,
D., Sanchez J.-C, Hochstrasser D.F. "Two-dimensional gel
electrophoresis of
2DPAGE
Escherichia
database."
coli
homogenates: the Escherichia
coli SWISS-
Electrophoresis 17:547-555(1996).
16. Vanbogelen R.A., Abshire K.Z., Pertsemlidis A., Clark R.L., Neidhardt F.C;
"Gene-protein database of Escherichia coli K-12, edition
(In) Neidhardt et
6"
(eds.)
ed.),
Escherichia
coli
and Salmonella: Cellular
and
Molecular
Biology
al.
(2nd
17, ASM Press, Washington DC (1996).
Hoogland
L.,
C, Binz P.-A., Appel R.D., Hochstrasser D.F., Sanchez
pp.2067-21
17. Tonella
C. "New
perspectives
in the Escherichia
1:409-423(2001).
44
coli proteome
investigation."
J.-
Proteomics
18. Yan J.X., Devenish A.T., Wait R., Stone T., Lewis S., Fowler S. "Fluorescence
2-D difference gel electrophoresis and mass spectrometry based proteomic
analysis of
19. "Compute
Found
at
E.
coli."
pI/Mx
Proteomics 2:1682-1698(2002).
for Swiss-Prot/TrEMBL
entries or a user-entered
http://us.expasy.org/tools/pi_tool.html
20. Bjellqvist, B., Hughes, G., Pasquali, C, Paquet, N., Ravier, F., Sanchez, J.-C, et
al. (1993) "The focusing positions of polypeptides in immobilized pH gradients
can
be
predicted
from their
Electrophoresis
sequences."
amino acid
14:1023-
1031.
21. "Get
protein
list for
map."
a reference
Found
bin/get-ch2d-table.pl
45
at
http://www.expasy.org/cgi-
Appendix A
The
subsets are
sequences
Apl
than 0.7 ("0.3
below in that
values
<
Apl
pi <
each of
less than 0.1 ("Apl
0.7"),
and
Apl
Included is the
order.
Accession Number
<
included in
and
the three Apl subsets that
<
0.1"), Apl
values greater
gene
name,
Apl (experimental
pi
-
were used.
values greater
than 0.7 ("Apl
protein
>
description,
than
The Apl
0.3, but less
0.7") displayed
SWISS-2DPAGE
predicted pi).
0.1
SWISS-
Gene
Name
ACCB
2DPAGE
Access #
Protein Description
Biotin
Apl
0.1
(BCCP)
) (Isocitrase) (Isocitratase) (ICL)
Aconitate hydratase 2 (EC 4.2.1
(Citrate hydro-lyase 2) (Aconitase 2)
Alkyl hydroperoxide reductase subunit C (EC 1.6.4.-) (Alkyl
P02905
AHPC
hydroperoxide
P26427
0.01
AMPC
Beta-lactamase (EC
P0081 1
0.08
ARGF
Ornithine
ACEA
ACNB
carboxyl carrier protein of acetyl-CoA carboxylase
Isocitrate lyase (EC 4.1
.3.1
.3)
C22)
3.5.2.6) (Cephalosporinase)
reductase protein
P05313
-0.04
P36683
-0.07
0
P22767
-0.03
ATPD
(OTCase-2)
Argininosuccinate synthase (EC 6.3.4.5) (Citrulline-aspartate ligase)
3-dehydroquinate dehydratase (EC 4.2.1.10) (3-dehydroquinase) (Type
I DHQase)
ATP synthase alpha chain (EC 3.6.3.14)
ATP synthase beta chain (EC 3.6.3.14)
ATP synthase beta chain (EC 3.6.3.14)
P06960
CHEY
Chemotaxis
P06143
0.05
CLPB
CIpB
P03815
-0.02
CYSM
Cysteine
sulfhydrylase B)
4.1.2.4)
(Phosphodeoxyriboaldolase) (Deoxyriboaldolase) (DERA)
Chaperone protein dnaK (Heat shock protein 70) (Heat shock 70 kDa
protein) (HSP70)
Chaperone protein dnaK (Heat shock protein 70) (Heat shock 70 kDa
protein) (HSP70)
DNA protection during starvation protein
(2-phosphoEnolase (EC 4.2.1
1) (2-phosphoglycerate dehydratase)
D-glycerate hydro-lyase)
Malonyl CoA-acyl carrier protein transacylase (EC 2.3.1
(MCT)
P16703
-0.05
FLIC
Flagellin
FUSA
Elongation factor G
FUSA
Elongation factor G
GALM
Aldose 1-epimerase (EC 5.1
GLNK
Nitrogen regulatory protein P-ll 2
2,3-bisphosphoglycerate-independent
ARGG
AROD
ATPA
ATPD
carbamoyltransferase chain
protein
protein cheY
(Heat
synthase
shock protein
B (EC 2.5.1
Deoxyribose-phosphate
DEOC
DNAK
DNAK
DPS
F (EC 2.1 .3.3)
F84.1)
.47)
aldolase
(O-acetylserine
P05194
P00822
0.05
-0.01
P00824
0.02
P00824
0.03
(EC
P00882
-0.1
P04475
0.08
P04475
0.1
P27430
-0.05
P08324
0.05
P25715
0.04
P04949
0.07
.1
ENO
FABD
GPMI
.39)
P02996
(EF-G)
(EF-G)
.3.3)
5.4.2.1) (Phosphoglyceromutase)
(Mutarotase)
-0.08
P40681
-0.01
P38504
phosphoglycerate mutase
-0.1
P02996
-0.1
(EC
P37689
-0.04
GROL
GROL
GROS
ICD
KATG
LIVJ
chaperonin
(Protein
Leu/lleA/al-binding
(LIV-BP)
Leu/lle/Val-binding protein (LIV-BP)
Leucine-specific binding protein (LS-BP) (L-BP)
S-ribosylhomocysteinase (EC 3.13.1.-) (Autoinducer-2 production
protein luxS) (AI-2 synthesis protein)
Methionine aminopeptidase (EC 3.4.1 1
(MAP) (Peptidase M)
protein
LIVJ
LIVK
LUXS
MAP
MIND
NADE
PGK
PNP
0.1
P05380
0.03
P08200
-0.05
P13029
-0.03
P02917
0.01
P02917
0.07
P04816
-0.06
-0.08
-0.07
adenosyltransferase)
Septum site-determining protein minD (Cell division inhibitor minD)
NH(3)-dependent NAD(+) synthetase (EC 6.3.1.5) (Nitrogen-regulatory
P04384
-0.02
protein)
Phosphoglycerate kinase (EC 2.7.2.3)
Polyribonucleotide nucleotidyltransferase (EC
P18843
-0.06
P11665
0.03
P05055
-0.02
P17288
-0.01
P09029
-0.1
(EC 2.5.1
synthetase
phosphorylase) (PNPase)
Inorganic pyrophosphatase (EC
.6)
(Methionine
2.7.7.8)
PURK
Phosphoribosylaminoimidazole carboxylase ATPase
4.1.1.21) (AIR carboxylase) (AIRC)
RIBH
6,7-dimethyl-8-ribityllumazine
RPLL
50S
ribosomal protein
DNA-directed RNA
L7/L12
synthase
RPSA
RPSA
30S
SERC
Phosphoserine
SSB
Single-strand
TALB
Transaldolase B (EC 2.2.1
TIG
Trigger factor
.9)
(EC
0.01
phospho-
subunit
(EC
(DMRL synthase)
(L8)
polymerase alpha chain
subunit)
30S ribosomal
2.7.7.6)
(RNAP
P61714
-0.02
P02392
0.09
P00574
0.02
alpha
protein
S1
P02349
0.08
ribosomal protein
S1
P02349
0.08
binding
protein
(EC 2.6.1
(PSAT)
(SSB) (Helix-destabilizing protein)
aminotransferase
.52)
.2)
(TF)
(TF)
TIG
Trigger factor
TRPA
Tryptophan
TSF
Elongation factor Ts
TUFA
Elongation factor Tu
USPA
Universal
YCII
Protein
synthase alpha chain
(EC 4.2.1
.20)
(EF-Ts)
(EF-Tu) (P-43)
stress protein
A
P23721
0
P02339
-0.05
P30148
-0.05
P22257
0.02
P22257
0.01
P00928
-0.02
P02997
-0.05
P02990
0.01
P28242
0.04
ycil
P31070
0.01
yfiD
P33633
-0.04
P36656
0.02
YFID
Protein
YJDC
Putative HTH-type transcriptional
pl<
(EC 2.5.1
P18197
(Polynucleotide
3.6.1.1) (Pyrophosphate
hydrolase) (PPase)
<
P06139
P07906
PPA
RPOA
-0.02
P45578
.18)
S-adenosylmethionine
METK
0.3
P06139
Cpn60) (groEL protein)
60 kDa chaperonin (Protein Cpn60) (groEL protein)
10 kDa chaperonin (Protein Cpn10) (groES protein)
Isocitrate dehydrogenase [NADP] (EC 1.1.1.42) (Oxalosuccinate
decarboxylase)
Peroxidase/catalase HPI (EC 1.11.1.6) (Catalase-peroxidase)
(Hydroperoxidase I)
60 kDa
regulator yjdC
0.7
SWISS-
2DPAGE
Gene
Access #
Name
Protein Description
ACCB
Biotin
P02905
0.37
ACKA
Acetate kinase (EC
P15046
-0.46
ADK
Adenylate kinase
P05082
-0.67
(BCCP)
2.7.2.1) (Acetokinase)
(EC 2.7.4.3) (ATP-AMP transphosphorylase)
carboxyl carrier protein of acetyl-CoA carboxylase
II
A pi
ADK
P05082
-0.68
AHPF
Adenylate kinase (EC 2.7.4.3) (ATP-AMP transphosphorylase)
Alkyl hydroperoxide reductase subunit F (EC 1.6.4.-) (Alkyl hydroperoxide
reductase F52A protein)
P35340
-0.41
ALDA
Aldehyde dehydrogenase A (EC
P25553
Aldehyde dehydrogenase A (EC
P25553
ALDA
ARGT
1.2.1.22) (Lactaldehyde dehydrogenase)
1.2.1.22) (Lactaldehyde dehydrogenase)
Lysine-arginine-ornithine-binding periplasmic protein (LAO-binding protein)
ATP synthase epsilon chain (EC 3.6.3.14) (ATP synthase F1 sector epsilon
ATPC
subunit)
ATPD
ATP synthase beta
CLPS
ATP-dependent Clp protease adaptor protein dpS
Dihydrodipicolinate reductase (EC 1
.26) (DHPR)
Chaperone protein dnaK (Heat shock protein 70) (Heat
chain
(EC
3.6.3.14)
0.42
0.32
P09551
-0.48
P00832
-0.33
P00824
0.02
P75832
0.44
P04036
-0.54
P04475
-0.37
P08324
-0.67
ENO
hydro-lyase)
1) (2-phosphoglycerate dehydratase) (2-phospho-Dglycerate hydro-lyase)
Enoyl-[acyl-carrier-protein] reductase [NADH] (EC 1.3.1.9) (NADH-dependent
P08324
-0.41
FABI
enoyl-ACP
P29132
-0.45
FLIC
Flagellin
GLNA
Glutamine
GPT
Xanthine-guanine
GST
Glutathione S-transferase (EC 2.5.1.18)
HISJ
HISJ
Histidine-binding
Histidine-binding
ILVH
Acetolactate synthase isozyme III small subunit (EC 2.2.1.6)
(Acetohydroxy-acid synthase III small subunit) (ALS-III)
DAPB
DNAK
.3.1
70 kDa protein)
(HSP70)
Enolase (EC 4.2.1
ENO
shock
.1
1) (2-phosphoglycerate dehydratase)
(2-phospho-D-
glycerate
Enolase (EC 4.2.1
.1
reductase)
synthetase
(EC 6.3.1
.2)
(Glutamate-ammonia ligase)
phosphoribosyltransferase (EC
2.4.2.22) (XGPRT)
(HBP)
(HBP)
periplasmic protein
periplasmic protein
P04949
0.51
P0671 1
-0.38
P00501
-0.51
P39100
-0.37
P39182
-0.52
P39182
-0.37
P00894
-0.54
P17579
-0.62
(AHAS-III)
2-dehydro-3-deoxyphosphooctonate aldolase (EC 2.5.1.55) (Phospho-2dehydro-3-deoxyoctonate aldolase) (3-deoxy-D-manno-octulosonic acid 8phosphate synthetase) (KDO-8-phosphate synthetase) (KDO 8-P synthase)
KDSA
MALE
(KDOPS)
Leu/lle/Val-binding protein (LIV-BP)
Leu/lle/Val-binding protein (LIV-BP)
Leucine-specific binding protein (LS-BP) (L-BP)
Leucine-specific binding protein (LS-BP) (L-BP)
Maltose-binding periplasmic protein (Maltodextrin-binding
Maltose-binding periplasmic protein (Maltodextrin-binding
MDH
Malate dehydrogenase (EC 1
MDOG
Glucans biosynthesis
MGLB
D-galactose-binding periplasmic
binding protein) (GGBP)
LIVJ
LIVJ
LIVK
LIVK
MALE
protein
.1
.1
(Nucleoside-2-P
-0.45
P02917
-0.57
P04816
-0.54
P04816
-0.37
P02928
-0.41
P02928
-0.62
P61889
-0.47
P33136
-0.64
P02927
-0.67
P24233
-0.56
.5.1
P38489
-0.57
.5.1
P38489
-0.65
P16921
-0.35
P08312
-0.36
P23861
-0.32
P20752
-0.59
protein)
protein)
(MMBP)
(MMBP)
.37)
G
protein
Nucleoside diphosphate kinase (EC
NDK
P02917
(GBP) (D-galactose/ D-glucose
2.7.4.6) (NDK) (NDP kinase)
kinase)
NFNB
Oxygen-insensitive NAD(P)H nitroreductase (EC 1 .-.-.-) (FMN-dependent
.34)
nitroreductase) (Dihydropteridine reductase) (EC 1
Oxygen-insensitive NAD(P)H nitroreductase (EC 1 .-.-.-) (FMN-dependent
.34)
nitroreductase) (Dihydropteridine reductase) (EC 1
NUSG
Transcription
NFNB
antitermination protein nusG
Phenylalanyl-tRNA
PHES
POTD
synthetase alpha chain
6.1.1.20)
(Phenylalanine-
tRNA ligase alpha chain)
(PheRS)
Spermidine/putrescine-binding periplasmic
Peptidyl-prolyl cis-trans isomerase A (EC
PPIA
(EC
(Cyclophilin
A)
III
protein
(SPBP)
5.2.1.8) (PPIase A) (Rotamase A)
PYRI
Aspartate
RTCB
Protein
rtcB
P46850
-0.6
RTCB
Protein
rtcB
P46850
-0.57
SBP
Sulfate-binding
protein
P06997
-0.59
SERC
Phosphoserine
aminotransferase
SODB
Superoxide dismutase
carbamoyltransferase
regulatory
(Sulfate starvation-induced
[Fe] (EC
(EC 2.6.1
P00478
chain
.52)
protein
2) (SSI2)
(PSAT)
1 1 5. 1 1 )
.
.
-0.48
P23721
-0.45
P09157
-0.41
SSPB
Stringent
TOLB
TolB
TRXA
Thioredoxin 1
P00274
0.3
UDP
Uridine
P12758
-0.47
P12758
-0.42
starvation protein
B
protein
(TRX1) (TRX)
phosphorylase (EC 2.4.2.3) (UrdPase)
(UPase)
phosphorylase (EC 2.4.2.3) (UrdPase)
(UPase)
UDP
Uridine
YAET
Unknown
YCEI
Protein
YCGK
YGIN
from 2D-page
P25663
0.64
P19935
-0.57
P39170
0.39
ycel
P37904
-0.56
Protein
ycgK
P76002
-0.57
Protein
ygiN
P40718
-0.38
YHGI
Protein
yhgl
P46847
0.42
ZNUA
High-affinity zinc
High-affinity zinc
ZNUA
pi >
protein
spots
M62/M63/03/09/T35
uptake system protein znuA
P39172
-0.3
uptake system protein znuA
P39172
-0.37
0.7
SWISS-
Gene
Name
Protein Description
ARGD
Acetylornithine/succinyldiaminopimelate aminotransferase (EC 2.6.1.11) (EC
2.6.1 .17) (ACOAT) (Succinyldiaminopimelate transferase) (DapATase)
ARTI
Arginine-binding
2DPAGE
Access #
Apl
P18335
-0.94
P30859
-0.91
P11096
-1.09
P11096
-1.07
P11096
-0.92
P16700
-1.75
P09376
-0.9
P23847
-0.85
P23847
-0.75
P27430
-0.78
P08324
1.54
P08324
1.31
FLIY
during starvation protein
Enolase (EC 4.2.1 1) (2-phosphoglycerate dehydratase) (2-phospho-Dglycerate hydro-lyase)
Enolase (EC 4.2.1.11) (2-phosphoglycerate dehydratase) (2-phospho-Dglycerate hydro-lyase)
Cystine-binding periplasmic protein (CBP) (fliY protein) (Sulfate starvationinduced protein 7) (SSI7)
P39174
-1.35
GAPA
Glyceraldehyde-3-phosphate dehydrogenase A (EC 1
.12)
P06977
-2.03
GAPA
Glyceraldehyde-3-phosphate dehydrogenase A (EC 1
.12)
P06977
-1.32
GAPA
Glyceraldehyde-3-phosphate dehydrogenase A (EC 1
P06977
-0.95
GLNH
Glutamine-binding
P 10344
-1.64
Cysteine
(Thiol)-lyase
A)
5)
(O-acetylserine sulfhydrylase A) (O(CSase A) (Sulfate starvation-induced protein 5)
A (EC 2.5.1
synthase
acetylserine
(Thiol)-lyase
A)
.47)
(SSI5)
Cysteine
2.5.1.47) (O-acetylserine sulfhydrylase A) (O(Thiol)-lyase A) (CSase A) (Sulfate starvation-induced protein 5)
A (EC
synthase
acetylserine
CYSK
(O-acetylserine sulfhydrylase A) (O.47)
(CSase A) (Sulfate starvation-induced protein
(SSI5)
Cysteine
CYSK
1
A (EC 2.5.1
synthase
acetylserine
CYSK
periplasmic protein
CYSP
(SSI5)
Thiosulfate-binding
DEGP
Protease do (EC 3.4.21
DPPA
Periplasmic dipeptide transport
protein
DPPA
Periplasmic dipeptide transport
protein
DPS
DNA
protein
.-)
(Dipeptide-binding protein) (DBP)
(Dipeptide-binding protein) (DBP)
protection
.1
ENO
ENO
periplasmic protein
(GlnBP)
IV
.2.1
.2.1
.2.
1
.
(GAPDH-A)
(GAPDH-A)
1 2) (GAPDH-A)
GLTI
Glutamate/aspartate
P37902
-1.01
HDEB
Protein hdeB (10K-L protein)
Inhibitor of vertebrate lysozyme
P26605
-0.94
P45502
-1.37
P61316
-1.19
MANX
Outer-membrane lipoprotein carrier protein (P20)
PTS system, mannose-specific NAB component (EIIAB-Man) (Mannosepermease IIAB component) (Phosphotransferase enzyme II, AB component)
(EC 2.7.1.69) (Elll-Man)
P08186
-0.77
MDH
Malate dehydrogenase (EC
P61889
1.14
MDOG
Glucans biosynthesis
protein
G
P33136
0.89
MDOG
Glucans biosynthesis
protein
G
P33136
-0.87
P37329
-1.71
P38489
-1.03
IVY
LOLA
MODA
Molybdate-binding
periplasmic
binding
protein
1.1.1.37)
periplasmic protein
NFNB
Oxygen-insensitive NAD(P)H nitroreductase
nitroreductase) (Dihydropteridine reductase)
Oxygen-insensitive NAD(P)H nitroreductase
nitroreductase) (Dihydropteridine reductase)
NLPD
Lipoprotein
NUSG
Transcription
OMPA
Outer
OPPA
OPPA
Periplasmic oligopeptide-binding
Periplasmic oligopeptide-binding
PANC
Pantoate-beta-alanine ligase (EC
(Pantoate activating enzyme)
NFNB
PSTS
(EC 1 .-.-.-) (FMN-dependent
(EC 1.5.1.34)
(EC 1 .-.-.-) (FMN-dependent
(EC 1.5.1.34)
nlpD
antitermination protein nusG
membrane protein
A (Outer
membrane protein
II*)
P38489
-0.8
P33648
-0.87
P16921
-1.31
P02934
-1.09
protein
P23843
-1.26
protein
P23843
-0.74
P31663
-0.93
P06128
-1.86
6.3.2.1) (Pantothenate
synthetase)
PYRD
(PBP)
Phosphate-binding
Dihydroorotate dehydrogenase (EC 1.3.3.1) (Dihydroorotate
(DHOdehase) (DHODase) (DHOD)
P05021
-0.82
RPLA
50S
ribosomal protein
L1
P02384
-1.74
RPLI
50S
ribosomal protein
L9
P02418
-1.41
RPLY
50S
ribosomal protein
L25
P02426
0.71
RPME2
50S
ribosomal protein
L31 type B-1
P71302
-1.21
SUCD
Succinyl-CoA
synthetase alpha chain
P07459
-0.99
SUCD
Succinyl-CoA
synthetase alpha chain
P07459
-0.86
P22783
-1.17
periplasmic protein
lnositol-1-monophosphatase (EC
SUHB
phosphatase)
(EC 6.2.1.5) (SCS-alpha)
(EC 6.2.1.5) (SCS-alpha)
3.1.3.25) (IMPase)
(lnositol-1-
3.1.3.25) (IMPase)
(lnositol-1-
(l-1-Pase)
lnositol-1-monophosphatase (EC
SUHB
oxidase)
TPIA
phosphatase) (l-1-Pase)
Triosephosphate isomerase (EC
TRPB
Tryptophan
YGFZ
Unknown
protein
YGGX
UPF0269
protein yggX
YLIB
Putative
YRBC
Protein
synthase
binding
beta
chain
P22783
-1.06
5.3.1.1) (TIM)
P04790
-0.88
(EC 4.2.1
P00932
-0.97
from 2D-page (Spot
.20)
PR51)
protein yliB
yrbC
V
P39179
0.91
P52065
-0.78
P75797
-1.19
P45390
0.73
Appendix B
The
Materials
relevant
and
Perl
Methods
code
section.
name of the particular program.
what each program
exactly
for any
that was used
Each Perl
At the top
does, how to
of the analysis
is listed in
program
described in the
alphabetical order
of each program are comments
run
the
program
by the
that explain
(command line arguments),
and
the output of program.
aacounts.pl:
#!/bin/perl
use
strict;
use
Bio::Seq;
Bio::SeqIO;
use
# Matthew Conte
#
# This
# file
script counts
the
determines
and
in
number of each amino acids
frequency
the
from
a sequence
a
FASTA
of each.
# Output is to FASTAfilename.aacounts
# Usage:
perl aacounts.pl
#initialize
...
count variables
0; my $n_R
0; my $n_G
my Sn_Q
0; my $n_F
my $n_M
Sn_Y
0; my $n_V
my
0;
my $n_AA_Total
my $n_A
file.FASTA file2.FASTA
=
=
=
=
=
=
=
=
0; my $n_N
0; my $n_H
0; my $n_P
0;
=
=
=
0; my $n_D 0; my $n_C 0;; my Sn_E 0;
i_L
0; rmy $n_K 0;
0; my $n_I 0; my Sn_L
Sn_T
0; rmy Sn_W 0;
0; my Sn_S 0; my
=
=
=
=
=
=
=
=
n-
=
=
#initialize
frequency variables
=
0; my $f_R 0; my $f_N
my $f_A
0; my $f_G 0; my $f_H
my Sf_Q
=
=
=
0; my $f_F
my $f_M
=
$f_Y
0; my $f_V
my
=
=
=
0; my $f_P
0;
foreach my $file(@ARGV)
print
STDERR
"Reading
=
=
=
0; my Sf_D 0; my Sf_C 0; my Sf_E 0;
0; my $f_I 0; my $ f_L 0; my Sf_K 0;
0; my Sf_S 0; my Sf_T 0; my $f_W 0;
=
{
input file Sfile...
\n";
#open the FASTA file
VI
=
=
=
=
=
=
=
=
my SFASTAin
=
Bio::SeqIO->new(-file
=>
Sfile);
#open the file's basename
Sfile
=~
sA.seq$//g;
#open the
open
#for
file for writing
.aacounts
AACOUNTS, ">$file.aacounts";
each sequence
whilefmy
in the FASTA file..
SFASTAseq
#reset
$n_A
=
$FASTA_in->next_seq()) {
variables
0; $n_R 0; $n_N 0; $n_D 0; $n_C 0; $n_E 0; $n_Q 0;
$n_G 0; $n_H 0; $n_I 0; $n_L 0; $n_K 0; $n_M 0; $n_F 0;
$n_P
0; $n_S 0; $n_T 0; $n_W 0; $n_Y 0; $n_V 0;
SnAATotal 0;
$f_A 0; $f_R 0; $f_N
0; $f_D 0; $f_C 0; $f_E 0; $f_Q 0;
$f_G 0; $f_H 0; $f_I 0; $f_L 0; $f_K 0; $f_M 0; $f_F 0;
$f_P
0; $f_S 0; $f_T 0; $f_W 0; $f_Y 0; $f_V 0;
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
sequence
my $desc
$FASTA_seq->display_id;
#trim
off possible
$desc
=~
s/,//g;
$desc
=~
s/\s*//g;
print
=
=
description
trailing
comma and whitespace
AACOUNTS $desc."> ";
#get the
sequence as an upper-case
my Ssequence
#get the
=
count of nucleotides
=
(Ssequence
=~
$n_N
=
(Ssequence
=~
$n_C
=
(Ssequence
=~
$n_Q
=
(Ssequence
=~
$n_H
=
(Ssequence
=~
$n_L
(Ssequence
=
$n_M
=
(Ssequence
(Ssequence
=~
$n_T
=
(Ssequence
=~
+
$n I
+
=
(Ssequence
tr/R//);
=~
tr/D//);
=~
tr/E//);
tr/Q//); $n_G = (Ssequence =~ tr/G//);
tr/H//); $n_I = (Ssequence =~ tr/I//);
^~
tr/K//);
tr/L//); $n_K = (Ssequence
=
tr/M//); $n_F (Ssequence =~ tr/F//);
=~
tr/S//);
tr/P//); $n_S = (Ssequence
=
=~
(Ssequence
tr/T//); $n_W
tr/W//);
=~
#sum up for total
$n_AA_Total = $n_A
$n H
tr/C//); $n_E
=~
=~
=
(Ssequence
tr/A//); $n_R = (Ssequence
tr/N//); $n_D = (Ssequence
=~
$n_P
=
string
$FASTA_seq->seq;
uc
$n_A
$n_Y
=
=
=
#output the
=
=
=
tr/Y//); $n_V
+
$n L
$n_R
+
+
$n K
$n_N
+
(Ssequence
=
+
$n M
VII
$n_D
+
+
$n F
=~
tr/V//);
$n_C
+
+
$n P
$n_E
+
+
$n S
$n_Q
+
$n T
+
+
$n_G
+
$n_W
+
$n_Y
+
$n_V;
#calculate frequencies
$f_A = $n_A / $n_AA_Total;$f_R
$f_N
=
$f_C
=
$f_Q
=
$f_H
=
$f_L
=
$f_M
=
$f_T
=
$f_Y
$n_Q / $n_AA_Total;$f_G
$n_AA_Total;
$n_G / SnAATotal;
$n_H / $n_AA_Total;$f_I = $n_I / $n_AA_Total;
$n_L / $n_AA_Total; $f_K = $n_K / $n_AA_Total;
=
$n_M /
$f_F
$n_F / $n_AA_Total;
$n_AA_Total;
SnAATotal; $f_S $n_S / $n_AA_Total;
$n_T / SnAATotal; $f_W $n_W / SnAATotal;
$n_Y / $n_AA_Total;$f_V $n_V / SnAATotal;
=
$n_P /
=
=
=
=
#round frequencies to
$ f_A
$n_R /
$n_N / $n_AA_Total;$f_D = $n_D / $n_AA_Total;
$n_C / $n_AA_Total; $f_E = $n_E / SnAATotal;
=
$f_P
=
six
decimal
places
', $f_A);$f_R sprintf("%.3f ', $ f_R);
$f_N
sprintf("%.3f ', $f_N);$f_D
sprintf("%.3f ', $ f_D);
$f_C sprintf("%.3f ', $f_C); $f_E sprintf("%.3f ', $ f_E);
$f_Q sprintf("%.3f ', $f_Q);$f_G sprintf("%.3f ', $ f_G);
$f_H sprintf("%.3f ', $f_H);$f_I sprintf("%.3f ', $f_I);
$f_L sprintf("%.3f ', $f_L); $f_K sprintf("%.3f $f_K);
$f_M
sprintf("%.3f ', $f_M);
$f_F
sprintf("%.3f ', $f_F);
$f_P
sprintf("%.3f ', $f_P); $f_S
sprintf("%.3f ', $f_S);
$f_T
sprintf("%.3f ', $f_T); $f_W
sprintf("%.3f ', $f_W);
$f_Y
sprintf("%.3f ', $f_Y);$f_V
sprintf("%.3f ', $f_V);
=
sprintf("%.3f
=
=
=
=
=
=
=
=
=
=
'
=
,
=
=
=
=
=
=
=
#write
results
=
to
file,
counts
first then frequencies
print
AACOUNTS "\nA(neutral): $n_A \t
print
AACOUNTS "R(BASIC): $n_R \t
$f_A\n";
$f_R\n";
print AACOUNTS "N(neutral): $n_N \t $f_N\n";
print AACOUNTS "D(ACIDIC): $n_D \t $f_D\n";
print AACOUNTS "C(neutral): $n_C \t $f_C\n";
print AACOUNTS "E(ACIDIC): $n_E \t $f_E\n";
print AACOUNTS "Q(neutral): $n_Q \t $f_Q\n";
print AACOUNTS "G(neutral): $n_G \t $f_G\n";
print AACOUNTS "H(BASIC): $n_H \t $f_H\n";
print AACOUNTS 'T(neutral): $n_I \t $f_I\n";
print AACOUNTS "L(neutral): $n_L \t $f_L\n";
print AACOUNTS "K(BASIC): $n_K \t $f_K\n";
print AACOUNTS "M(neutral): $n_M \t $f_M\n";
print AACOUNTS "F(neutral): $n_F \t $f_F\n";
print AACOUNTS "P(neutral): $n_P \t $f_P\n";
print AACOUNTS "S(neutral): $n_S \t $f_S\n";
print AACOUNTS "T(neutral): $n_T \t $f_T\n";
print AACOUNTS "W(neutral): $n_W \t $ f_W\n";
print AACOUNTS "Y(neutral): $n_Y \t $f_Y\n";
VIII
print
AACOUNTS "V(neutral): $n_V \t
print
AACOUNTS "Total: SnAATotal \n";
$f_V\n";
}
close
AACOUNTS;
}
changeCode.pl:
#!/bin/perl
-w
use
strict;
use
Bio::Seq;
Bio::SeqIO;
Bio::Tools::OddCodes;
use
use
# Matthew Conte
#
# This
the amino
script converts
# into the
acids
from the
alphabet of the user's choice with
sequences
in
a
FASTA file
the methods provided in
# in Bio::Tools::OddCodes.
#
#
# Output is to FASTAfilename.FASTA
#
# Usage:
perl changeCode.pl
file.FASTA file2.FASTA
...
foreach my $file(@ARGV) {
print STDERR "Reading input file Sfile... \n";
#open the FASTA file
my $FASTA_in
Sfile,
'FASTA');
Bio::SeqIO->new(-file
=
-format
=>
=>
#open the file's basename
Sfile
=~
sA.seq$//g;
#open the
#NOTE
-
change
.charge
to
.oddcode
for
whichever oddcode you
decide to
functional, hydrophobic.
CHARGECOUNTS, ">$file.charge";
#use. Options
open
file for writing
.charge
#print
a
line
#FASTA
at
are:
the
charge, chemical,
top
so
that dipeps.pl
understands
sequence.
IX
that it is
a
long
CHARGECOUNTS ">gi|our
print
#for
each sequence
while(my
#in this
=
$FASTA_in->next_seq()){
sequence
case chemical
#Options
are:
sequenced";
in the FASTA file..
SFASTAseq
#change the
long
in the file to the
is
alphabet of your
choosing
chosen
charge, chemical,
functional, hydrophobic.
Soddcodeobj Bio::Tools::OddCodes->new(-seq => SFASTAseq);
my Ssequence
$oddcode_obj->charge();
=
my
=
print
CHARGECOUNTS SSsequence;
\
close
CHARGECOUNTS;
}
charge.pl:
#!/bin/perl
-w
use
strict;
use
Bio::Seq;
Bio::SeqIO;
Bio::Tools::OddCodes;
use
use
# Matthew Conte
#
# This
#
a
script converts
3-letter
# It then
alphabet
counts
the
the
amino acids
using the charge()
from the
method
number of each code
for
sequences
in
a
FASTA file into
in Bio::Tools::OddCodes.
each sequence as well as each
# frequency.
#
# Alphabet: A (negatively), C (positively), N (no
#
charge).
# Output is to FASTAfilename.chargecounts
# Output is TAB-DELIMITED for import in to Microsoft Excel
#
# Usage:
perl charge.pl
initialize
file.FASTA file2.FASTA
count variables
my $n_A
=
my $n_C
=
0;
0;
...
my $n_N
=
0;
my SnAATotal
initialize
my $f_A
=
0;
frequency variables
=
0;
=
0;
my $f_C
=
0;
my $f_N
foreach my $file(@ARGV) |
print STDERR "Reading input file Sfile... \n";
#open the FASTA file
my SFASTAin
=
Bio::SeqIO->new(-file
-format
=>
Sfile,
'FASTA');
=>
#open the file's basename
Sfile
=~
sA.seq$//g;
#open the
open
.chemcounts
file for writing
CHARGECOUNTS, ">$file.charge_counts";
#top
line in the file to
print
CHARGECOUNTS
see where
everything
goes
"Sequence\tA(positive)\t%A(positive)\tC(negative)\t%C(negative)\tN(no
charge)\t%N(no
charge)\tTotal\n";
#for
each sequence
whilefmy
in the FASTA file..
$FASTA_seq
#reset
$n_A
$FASTA_in->next_seq()) {
variables
=
0; $n_C
$n_AA_Total
$ f_A
=
=
=
0; $f_C
#output the
=
0; $n_N
0;
0; $f_N
=
sequence
=
=
0;
0;
description
$FASTA_seq->display_id;
my Sdesc
#trim off possible trailing comma and whitespace
=
Sdesc
=~
s/,//g;
Sdesc
=~
sAs*//g;
print
my
CHARGECOUNTS
Bio::Tools::OddCodes->new(-seq
$oddcode_obj->charge();
Soddcodeobj
my Ssequence
##get the
=
Sdesc;
=
count of amino acids
XI
=>
SFASTAseq);
$n_A
-
(SSsequence
=~
$n_C
=
(SSsequence
=~
$n_N
=
(SSsequence
=~
tr/A//);
tr/C//);
tr/N//);
#sum up for total
SnAATotal
=
$n_A
+
$n_C
+
$n_N;
#calculate frequencies
$f_A = $n_A / $n_AA_Total;
$f_C
=
$n_C /
$f_N
=
$n_N /
$n_AA_Total;
$n_AA_Total;
#round frequencies to 3 decimal
$f_A = sprintf("%.3f ', $ f_A);
$f_C
=
sprintf("%.3f
$f_N
=
sprintf("%.3f
#write
results
places
', $f_C);
', $f_N);
to file
print CHARGECOUNTS
"\t$n_A\t$f_A\t$n C\t$f C\t$n N\t$f N\t$n AA
close
Total\n'!
CHARGE COUNTS;
chemical.pl:
#!/bin/perl
-w
use
strict;
use
Bio::Seq;
Bio::SeqIO;
Bio::Tools::OddCodes;
use
use
# Matthew Conte
#
# This
#
a
script converts
8-letter
# It then
alphabet
counts
the amino
acids
from the
using the chemical()
the number
of each code
sequences
method
for
in
a
FASTA file into
in Bio::Tools::OddCodes.
each sequence as well as each
# frequency.
#
# Alphabet: A (acidic), L (aliphatic), M (amide), R (aromatic), C
H (hydroxyl), I (imino), S (sulphur).
#
#
XII
(basic),
# Output is to
FASTAfilename.chemcounts
# Output is TAB-DELIMITED for import in to Microsoft Excel
#
# Usage:
perl chemical_AA_counts.pl
initialize
file.FASTA file2.FASTA
...
count variables
0;
my Sn_A
=
Sn_L
0;
my
=
=
0;
my Sn_M
=
0;
my Sn_R
my Sn_C
my Sn_H
my Sn_I
my Sn_S
0;
0;
0;
0;
=
=
=
=
my SnAATotal
#initialize
=
0;
frequency variables
=
0;
my Sf_A
=
0;
my Sf_L
=
0;
my Sf_M
=
0;
my Sf_R
my SfC
my Sf_H
0;
0;
0;
0;
=
=
=
my Sf_I
=
my Sf_S
foreach my $file(@ARGV) {
print STDERR "Reading input file Sfile... n";
#open the FASTA file
my SFASTAin
=
Bio::SeqIO->new(-file
-format
=>
=>
Sfile,
'FASTA'):
#open the file's basename
Sfile
=~
sA.seq$//g;
#open the
open
.chem_counts
file for writing
CHEMCOUNTS, ">Sfile.chem_counts";
#top
line in the file to
print
CHEMCOUNTS
see where
everything
goes
"Sequence\tA(acidic)\t%A(acidic )^tL(aliphatic)
) tM( amide) t%M( amide ) t
R(aromatic)\t%R(aromatic) tC(basic) t%C(basic) tH(hydroxyl) t%H(hydroxyl) tl(imino)
,t%L(aliphatic
tTotal\n"
t%I(imino)\tS(sulphur)\t%S(sulphur )
#for
each sequence
;
in the FASTA file..
XIII
SFASTAseq
while(my
#reset
$n_A
=
0; $n_L
0;
$n_AA_Total
$f_A
=
$f_S
|
variables
=
$n_S
$FASTA_in->nex t_seq())
=
=
=
0; $f_L
0;
#output the
=
0; $n_M
0;
0; $f_M
=
0; $n_R
0; $f_R
=
=
sequence
description
=
=
0; Sn_C
0; Sf_C
=
=
0; Sn_H
0; $f_H
=
=
0; $n_I
0; $f_I
=
=
0;
0;
$FASTA_seq->display_id;
my Sdesc
#trim off possible trailing comma and whitespace
=
Sdesc
=~
s/,//g;
Sdesc
=~
sAs*//g;
CHEMCOUNTS Sdesc;
print
my
$oddcode_obj
my Ssequence
##get the
$n_A
$n_L
=
$n_R
$n_C
=
$n_H
=
=
$n_S
=~
=~
(SSsequence
tr/A//);
=~
tr/M//);
(SSsequence
tr/R//);
(SSsequence
=~
tr/C//);
(SSsequence
=~
tr/H//);
(SSsequence
=~
tr/I//);
=~
#sum up for total
$n_AA_Total = $n_A
+
tr/S//);
$n_L
+
$n_M
+
#calculate frequencies
$f_A
=
$n_A / $n_AA_Total;
$n_L / SnAATotal;
=
$n_M / $n_AA_Total;
$f_M
$f_L
=
$f_R
=
$f_C
=
$f_H
=
$n_R / $n_AA_Total;
$n_C / $n_AA_Total;
$n_H / $n_AA_Total;
$n_I / SnAATotal;
=
$n_S / $n_AA_Total;
$f_S
$f_I
=
#round frequencies to 3 decimal
$f_A
$f_L
=
=
SFASTAseq);
tr/L//);
=~
(SSsequence
=
=>
count of amino acids
(SSsequence
=
=
$n_I
Bio::Tools::OddCodes->new(-seq
$oddcode_obj->chemical();
=
(SSsequence
=
$n_M
=
places
', $f_A);
', $f_L);
sprintf("%.3f
sprintf("%.3f
xrv
$n_R
+
$n_C
+
$n_H
+
$n_I +Sn_S;
$f_M = sprintf("%.3f, $f_M);
$f_R = sprintf("%.3f ', $f_R)
$f_C
=
sprintf("%.3f,
$f_H
=
sprintf("%.3f
$f_I
=
$f_S
=
#write
$f_C)
', $f_H)
sprintf("%.3f ', $f_I);
sprintf("%.3f ', $f_S);
results to
file
CHEMCOUNTS
"\t$n_A\t$f_A\t$n_L\t$f_L\t$n_M\t$f_M\t$n_R\t$f_R\t$n_C\t$f_C\t$n_H\t$f_H\t$n_I\t
print
$f_i\t$n_S\t$f_S\t$n_AA_Total\n";
}
close
CHEMCOUNTS;
}
dipeps.pl:
#!/bin/perl
-w
use
strict;
use
Bio::Seq;
Bio::SeqIO;
Bio::Tools::OddCodes;
Bio::Tools::SeqWords;
use
use
use
# Matthew Conte
#
# This
#
script counts
sequence
in the
the
given
number of each
different
amino acid pair
for
FASTA files
#
#
# Output is to FASTAfilename.dipepcounts
# Output is TAB-DELIMITED for import in to Microsoft Excel
#
# Usage:
#
perl
variable
my Stotal
to
=
dipep.pl file.FASTA file2.FASTA
count
the total
print
number of amino acids
0;
foreach my $file(@ARGV)
STDERR
"Reading
...
{
input file Sfile... \n";
#open the FASTA file
XV
in
each sequence.
each
my $FASTA_in
=
Bio::SeqIO->new(-file
-format
=>
Sfile,
'FASTA');
=>
#open the file's basename
Sfile
=~
sA.seq$//g;
#open the
open
#for
file for writing
.funccounts
DIPEPCOUNTS, ">$file.dipep_counts";
each sequence
while(my
in the FASTA file..
$FASTA_seq
=
$FASTA_in->next_seq()) f
#reset total
Stotal
=
0;
#output the
sequence
my Sdesc
$FASTA_seq->display_id;
=
#trim
off possible
Sdesc
=~
s/,//g;
Sdesc
=~
sAs*//g;
print
trailing
comma and whitespace
DIPEPCOUNTS "\n$desc";
my $seq_word
my Ssequence
#
description
=
=
Bio::Tools::SeqWords->new(-seq => SFASTAseq);
$seq_word->count_overlap_words(2);
display the hashtable
my %hash
#this
=
%$sequence;
code will sort
the dipeptides alphabetically
#foreach my $key(sort keys %hash) {
#$total = Stotal + $hash{$key};
#print DIPEPCOUNTS
"\n$key\t$hash{$key}";
#}
#
sort
the hash
by value
in
descending order (highest to lowest)
$hash{$aj } keys %hash){
foreach my $key (sort {$hash{$b} cmp
Stotal = Stotal + $hash{$keyj;
print
DIPEP_COUNTS *'\n$key\t$hash{$key}";
}
print
DIPEPCOUNTS "\nTotal:
Stotal";
}
close
DIPEPCOUNTS;
XVI
dipepsA.pl:
#!/bin/perl
-w
use
strict;
use
Bio::Seq;
Bio::SeqIO;
Bio::Tools::OddCodes;
Bio::Tools::SeqWords;
use
use
use
# Matthew Conte
#
# This
#
script counts
in the
sequence
the number
given
of each
different
amino acid pair
for
FASTA files
#
# Output is
sorted alphabetically by amino acid
# Output is to FASTAfilename.dipepcounts
pair
# Output is TAB-DELIMITED for import in to Microsoft Excel
#
# Usage:
#
perl
to
variable
my Stotal
=
dipep.pl file.FASTA file2.FASTA
count
the total
...
number of amino acids
in
each sequence.
0;
foreach my $file(@ARGV) {
print STDERR "Reading input file Sfile... \n";
#open the FASTA file
my $FASTA_in
=
Bio::SeqIO->new(-file
-format
=>
=>
Sfile,
'FASTA');
#open the file's basename
Sfile
=~
sA.seq$//g;
#open the
open
#for
.func_counts
file for writing
DIPEP_COUNTS, ">$file.dipepA_counts";
each sequence
while(my
in the FASTA file..
$FASTA_seq
=
$FASTA_in->nex t_seq()) {
#reset total
XVII
each
Stotal
=
0;
#output the
sequence
my Sdesc
$FASTA_seq->display_id;
#trim
=
off possible
Sdesc
=~
s/,//g;
Sdesc
=~
sAs*//g;
print
trailing
comma and whitespace
DIPEPCOUNTS "\n$desc";
my $seq_word
my Ssequence
#
description
=
=
Bio::Tools::SeqWords->new(-seq => SFASTAseq);
$seq_word->count_overlap_words(2);
display the hashtable
my %hash
#this
=
%$sequence;
code will sort
the dipeptides alphabetically
foreach my $key(sort keys %hash) {
Stotal = Stotal + $hash{$key};
print
DIPEP_COUNTS "\n$key\t$hash{$key}";
}
#
sort
the hash
by value
in
descending order (highest to lowest)
$hash{$a} ( keys %hash){
#foreach my Skey (sort {$hash{$b} cmp
#$total = Stotal + $hash{$key};
#print DIPEP_COUNTS
"\n$key\t$hash{$key[";
#}
print
DIPEPCOUNTS "\nTotal:
Stotal";
}
close
DIPEPCOUNTS;
functional.pl:
#!/bin/perl
-w
use
strict;
use
Bio::Seq;
Bio::SeqIO;
Bio::Tools::OddCodes;
use
use
# Matthew Conte
#
XVIII
# This
#
a
script converts
4-letter
# It then
alphabet
counts
the
amino acids
using the
the number
from the
functional()
of each code
sequences
method
for
in
a
FASTA file into
in Bio: :Tools::OddC odes.
each sequence as well as each
# frequency.
#
# Alphabet: A
(acidic), C (basic), H (hydrophobic), P (polar).
#
# Output is to FASTAfilename.functcounts
# Output is TAB-DELIMITED for import in to Microsoft Excel
#
# Usage:
perl
#initialize
my $n_A
functionalcounts.pl file. FASTA file2. FASTA
...
count variables
=
0;
=
0;
my $n_C
=
0;
my $n_H
my $n_P
=
0;
my SnAATotal
#initialize
my $f_A
my $f_C
=
my $f_P
=
=
0;
frequency variables
=
my SfH
=
0;
0;
0;
0;
foreach my $file(@ARGV) {
print STDERR "Reading input file Sfile...
n";
#open the FASTA file
my SFASTAin
=
Bio::SeqIO->new(-file
-format
=>
=>
Sfile,
'FASTA'):
#open the file's basename
Sfile
=~
sA.seq$//g;
#open the
.funccounts
file for writing
open
FUNCCOUNTS, ">$file.func_counts";
#top
line in the file to
print
FUNC_COUNTS
see where
everything
"Sequence\tA(Acidic)\t%A(Acidic)tC(Basic)
goes
t
oC(Basic)
phobic)\tP(Polar)\t%P(Polar)tTotal\n";
#for
each sequence
while(my
in the FASTA file..
SFASTAseq
=
$FASTA_in->next_seq()) {
XIX
tH( Hydrophobic) t%H(Hydro
#reset
variables
$n_A
=
0; $n_C
SnAATotal
$f_A
=
0; Sf_C
=
#output the
=
0; $n_H
0;
0; $f_H
=
0; $n_P
0; Sf_P
=
=
sequence
description
=
=
0;
0;
my Sdesc
$FASTA_seq->display_id;
#trim off possible trailing comma and whitespace
=
Sdesc
=~
s/,//g;
Sdesc
=~
sAs*//g;
FUNC_COUNTS Sdesc;
print
Soddcodeobj Bio::Tools::OddCodes->new(-seq
my Ssequence
$oddcode_obj->functional();
=
my
=>
SFASTAseq);
=
##get the
count of amino acids
$n_A
=
(SSsequence
=~
tr/A//);
$n_C
=
(SSsequence
=~
tr/C//);
$n_H
=
(SSsequence
=~
$n_P
(SSsequence
=
=~
#sum up for total
$n_AA_Total = Sn_A
+
tr/H//);
tr/P//);
$n_C
+
$n_H
+
$n_P;
#calculate frequencies
$f_A
=
$f_C
=
$n_A /
$n_AA_Total;
$n_C / Sn_AA_Total;
$f_H = Sn_H / $n_AA_Total;
$f_P
$n_P / SnAATotal;
=
#round frequencies to 3 decimal
$f_A
=
$f_C
=
$f_H
=
$f_P
=
#write
print
places
', Sf_A);
', $f_C);
sprintf("%.3f ', $f_H);
sprintf("%.3f ', Sf_P);
sprintf("%.3f
sprintf("%.3f
results
to file
FUNC_COUNTS
"\t$n_A\t$f_A\t$n_C\t$f_C\t$n_H\t$f_H\t$n_P\t$f_P\tSn_AA_Total\n";
}
close
FUNC
COUNTS;
XX
hydro.pl:
#!/bin/perl
-w
use
strict;
use
Bio::Seq;
Bio::SeqIO;
Bio::Tools::OddCodes;
use
use
# Matthew Conte
#
# This
#
a
script converts
2-letter
# It then
alphabet
counts
the
the
amino acids
from the
hydrophobic()
using the
number of each code
for
sequences
method
in
a
FASTA file into
in Bio::Tools::OddCodes.
each sequence as well as each
# frequency.
#
(hydrophilic), O (hydrophobic).
# Alphabet: I
#
# Output is to FASTAfilename.hydrocounts
# Output is TAB-DELIMITED for import in to Microsoft Excel
#
# Usage:
perl
#initialize
hydro.pl file.FASTA file2.FASTA
...
count variables
=
0;
my $n_I
=
0;
my $n_0
my SnAATotal
#initialize
=
0;
frequency variables
=
0;
my $f_I
=
0;
my $f_0
foreach my $file(@ARGV) {
print STDERR "Reading input file Sfile... \n";
#open the FASTA file
my SFASTAin
=
Bio::SeqIO->new(-file
-format
=>
Sfile,
'FASTA');
=>
#open the file's basename
Sfile
=~
sA.seq$//g;
#open the
open
.chemcounts
file for writing
HYDRO_COUNTS,
">$file.hydro_counts";
XXI
#top line in the file to
see where
everything
goes
HYDROCOUNTS
print
"Sequence\tI(hydrophilic)\t%I(hydrophilic)\tO(hydrophobic)\t%0(hydrophobic)\tTotal\n
#for
in the FASTA file..
each sequence
while(my
SFASTAseq
#reset
$n_I
=
$FASTA_in->next_seq()) {
variables
=
0; $n_0
SnAATotal
$f_I = 0; $f_0
0;
0;
0;
=
=
=
#output the
sequence
my Sdesc
$FASTA_seq->display_id;
=
#trim
off possible
Sdesc
=~
s/,//g;
Sdesc
=~
sAs*//g;
trailing
HYDRO_COUNTS
print
my
description
Soddcodeobj
my Ssequence
##get the
$n_I
=
Sdesc;
Bio::Tools::OddCodes->new(-seq
$oddcode_obj->hydrophobic();
=
=>
SFASTAseq);
count of amino acids
(SSsequence
=
$n_0
=
comma and whitespace
=~
(SSsequence
tr/I//);
=~
#sum up for total
$n_AA_Total = $n_I
+
tr/O//);
$n_0;
#calculate frequencies
$n_I / $n_AA_Total;
$n_0 / SnAATotal;
$f_0
$f_I
=
=
#round frequencies to 3 decimal
$f_I
=
$f_0
=
#write
print
places
$f_I);
sprintf("%.3f ', $f_0);
sprintf("%.3f",
results
to file
HYDRO_COUNTS
"\t$n_I\t$f_i\t$n_0\t$f_0\t$n_AA_Total\n";
}
close
HYDRO
COUNTS;
XXII
makeComposite.pl:
#!/bin/perl
use
strict;
# Matthew Conte
#
# This
script converts
#
sequence.
#
such as
This
FASTA files
of multiple sequences
composite sequence
charge.pl, chemical.pl,
is then
able
to be
dipeps.pl, functional.pl,
#
# Usage:
perl makeComposite.pl
my Scount
=
file. FASTA
> outfile
0;
while(o){
if (Scount <1) {
}
$_;
{
else
if (/A>gi*/)
{
#do nothing
}
else
{
print
$_;
}
I
$count++;
}
print
"ScountAn";
XXIII
into
a single
(composite)
used with other programs
and
hydro.pl.