Download Prediction of Folding, Stability and Structure of Proteins from Amino

Document related concepts

Artificial gene synthesis wikipedia , lookup

Interactome wikipedia , lookup

Fatty acid metabolism wikipedia , lookup

Fatty acid synthesis wikipedia , lookup

Magnesium transporter wikipedia , lookup

Western blot wikipedia , lookup

Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Metalloprotein wikipedia , lookup

Homology modeling wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Peptide synthesis wikipedia , lookup

Protein wikipedia , lookup

Metabolism wikipedia , lookup

Point mutation wikipedia , lookup

Proteolysis wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Genetic code wikipedia , lookup

Biosynthesis wikipedia , lookup

Biochemistry wikipedia , lookup

Transcript
Coiled-Coil Stability Analysis
and
Hydrophobic Core Characterization
by
David Brinkmann
B.S., University of Colorado at Colorado Springs, 1994
A thesis submitted to the Faculty of Graduate School of the
University of Colorado at Colorado Springs
In partial fulfillment of the
Requirements for the degree of
Master of Science
Department of Computer Science
2003
ii
©Copyright By David C. Brinkmann 2003
All Rights Reserved
iii
This thesis for Master of Science degree by
David Brinkmann
has been approved for the
Department of Computer Science
by
_______________________________________________________
Jugal K. Kalita, Chair
_______________________________________________________
C. Edward Chow
_______________________________________________________
Robert Hodges
_______________________________________________________
Karen Newell
Date___________
iv
CONTENTS
CHAPTER
1. INTRODUCTION ..................................................................................
Biology .................................................................................................................. 2
DNA .................................................................................................................. 2
The Central Dogma of Molecular Biology .......................................................... 3
Protein Structure .................................................................................................... 3
Coiled-Coil .......................................................................................................... 13
2. LITERATURE REVIEW..............................................................................
Protein Structure Analysis .................................................................................... 16
Early Proteins Structure Prediction ................................................................... 16
Coiled-coil Characterizations............................................................................ 21
Stability................................................................................................................ 22
Coiled Coil Stability Using Experimental Data..................................................... 25
3. STABLE INPUT ....................................................................................
UCHSC................................................................................................................ 27
Stable Input Parameters........................................................................................ 28
Helical Propensity ............................................................................................ 33
Hydrophobicity ................................................................................................ 33
E/G Interactions ............................................................................................... 34
Intra-Chain Electrostatic Interactions................................................................ 35
v
Clusters ........................................................................................................... 36
Entropy ............................................................................................................ 36
Program Flow ...................................................................................................... 37
Output Table ........................................................................................................ 45
Output Graphs...................................................................................................... 47
4. COILED-COIL CLUSTER ANALYSIS .................................................................
Why Coiled-Coils?............................................................................................... 58
Protein Database Analysis .................................................................................... 59
SPTR dataset .................................................................................................... 60
Swiss-Prot Coiled-Coils ................................................................................... 61
Stable Coil Pre-Processing ................................................................................... 65
Coil Analysis........................................................................................................ 68
Summary of Findings ........................................................................................... 95
5. CONCLUSION .....................................................................................
GLOSSARY ........................................................................................
BIBLIOGRAPHY ....................................................................................
APPENDIX A STABLE INPUT GUI ....................................................................1
APPENDIX B TABULATED OUTPUT ................................................................10
vi
TABLES
Table 1-1 Non-polar Amino Acids (hydrophobic) ............................................................................... 6
Table 1-2 Polar Amino Acids (hydrophilic) ...................................................................................... 6
Table 1-3 Electrically Charged (negative and hydrophilic) ..................................................................... 6
Table 1-4 Electrically Charged (positive and hydrophilic) ...................................................................... 7
Table 2-1 Chou-Fasman Table
.................................................................................................. 18
Table 3-1 Windowing Algorithm for Window = 7............................................................................. 31
Table 3-2 Helical Propensity Values ............................................................................................ 41
Table 3-3 Hydrophobic Core Values ............................................................................................ 42
Table 3-4 Intra-Chain Effect ..................................................................................................... 44
Table 3-5 Inter-Chain Electrostatics
............................................................................................ 45
Table 3-6 File Extensions ........................................................................................................ 48
Table 4-1 Helical Propensity and Stability Values ............................................................................. 67
Table 4-2 Phe, Ile, Leu, Met, Val, and Tyr Frequency Swiss-Prot ........................................................... 75
Table 4-3 Phe, Ile, Leu, Met, Val, and Tyr Frequency SPTR ................................................................. 76
Table 4-4 Clusters 6 Heptad S-P ................................................................................................ 82
Table 4-5 Clusters 6 Heptad+1 S-P ............................................................................................. 82
Table 4-6 Clusters 7 Heptad S-P ................................................................................................ 82
Table 4-7 Cluster 7+1 Heptad S-P
.............................................................................................. 82
Table 4-8 Clusters 8 Heptad S-P ................................................................................................ 82
vii
Table 4-9 Clusters 8+1 Heptad S-P ............................................................................................. 82
Table 4-10 Clusters 6 Heptad SPTR ............................................................................................ 83
Table 4-11 Clusters 6 Heptad+1 SPTR ......................................................................................... 83
Table 4-12 Clusters 7 Heptad SPTR ............................................................................................ 83
Table 4-13 Clusters 7+1 Heptad SPTR ......................................................................................... 83
Table 4-14 Clusters 8 Heptad SPTR ............................................................................................ 83
Table 4-15 Clusters 8+1 Heptad SPTR ......................................................................................... 83
Table 4-16 Hydrophobic Cluster Count, Swiss-Prot ........................................................................... 87
Table 4-17 Non-Hydrophobic Cluster Count, Swiss-Prot ..................................................................... 88
Table 4-18 Hydrophobic Cluster Count, SPTR ................................................................................ 89
Table 4-19 Non-Hydrophobic Cluster Count, SPTR........................................................................... 89
Table 4-20 Stabilizing Cluster, Swiss Prot ..................................................................................... 90
Table 4-21 Stabilizing Cluster, SPTR ........................................................................................... 91
Table 4-22 Cluster Amino Acids Swiss-Prot ................................................................................... 93
Table 4-23 Cluster Amino Acids SPTR......................................................................................... 94
viii
FIGURES
Figure 1-1 Amino Acid ................................................................................................... 4
Figure 1-2 Phi and Psi Angles ......................................................................................... 5
Figure 1-3 _-Helices...................................................................................................... 10
Figure 1-4 Beta Sheets .................................................................................................. 11
Figure 1-5 Heptad Repeat.............................................................................................. 13
Figure 1-6 Heptad Positions in a Coiled Coil................................................................. 14
Figure 3-1 Coiled Coil A/D and E/G Interactions .......................................................... 34
Figure 3-2 Lateral View Coiled Coil E/G Interaction..................................................... 35
Figure 3-3 Clustered Hydrophobic Core........................................................................ 36
Figure 3-4 Stable Input Program Flow .......................................................................... 38
Figure 3-5 Clusters........................................................................................................ 40
Figure 3-6 Mapping Example........................................................................................ 40
Figure 3-7 Tropomyosin Sequence................................................................................ 49
Figure 3-8 Summary Output.......................................................................................... 51
Figure 3-9 Total Stability .............................................................................................. 52
Figure 3-10 A/D Hydrophobic Stability ........................................................................ 53
Figure 3-11 Helical Propensity...................................................................................... 54
Figure 3-12 E/G Electrostatic Interaction ...................................................................... 55
Figure 3-13 Chain Length ............................................................................................. 56
ix
Figure 3-14 Density Stability ........................................................................................ 57
Figure 4-1 SPTR Protein Entry ..................................................................................... 61
Figure 4-2 Coiled-Coil Retrieval ................................................................................... 63
Figure 4-3 Coiled Coil Entry ......................................................................................... 64
Figure 4-4 Normalized Length Frequency ..................................................................... 69
Figure 4-5 Amino Acid in A and D positions 6&7 Heptads-SPTR................................. 70
Figure 4-6 Amino Acid in A and D positions 6&7 Heptads - Swiss-Prot ....................... 70
Figure 4-7 Normalized Amino Acid Distribution Swiss-Prot......................................... 71
Figure 4-8 Normalized Amino Acid Distribution SPTR ................................................ 72
Figure 4-9 Normalized Cluster by Heptad Length ......................................................... 74
Figure 4-10 Total Clusters and Ratio by Heptad Length Swiss-Prot .............................. 78
Figure 4-11 Total Clusters and Ratio by Heptad Length SPTR ...................................... 78
Figure 4-12 Total Clusters by Cluster size Swiss-Prot ................................................... 80
Figure 4-13 Total Clusters by Cluster size SPTR........................................................... 80
Figure 4-14 Hydrophobic Amino Acids in Clusters ....................................................... 85
Figure 4-15 Non-Hydrophobic Amino Acids in Clusters ............................................... 86
Chapter 1
CHAPTER 1
INTRODUCTION
The scientific community now has access to many completely sequenced
genomes of several different species, including the genome of humans. When it comes to
the human genome, however, a complete understanding of the 500000 proteins encoded
by the 30000 genes will take many more years of further study. Not only is there a great
volume of data to be interpreted, but the complexities of the biological systems need to be
understood as well. As a complement to the physical genomic research, proteomics, a
discipline of molecular biology has been initiated for the comparative study of proteomes
under different conditions. Among the research facilities dedicated to the field of
proteomics is the Peptide Chemistry lab of Robert Hodges at the University of Colorado
Health Sciences Center (UCHSC). Dr. Hodges’ group is interested in being able to
determine the absolute stability of the coiled-coil oligomerization domain because the
ability to determine coiled-coil stability would greatly facilitate the prediction of coiledcoil protein structures and advance protein design.
This thesis explores two areas that are currently being researched at UCHSC.
First, in order to expedite analysis of experimental data a comprehensive tool is needed to
calculate the relative stability of an experimental sequences. Currently, all sequence
stability calculations are done by hand and because of the length of the sequences and the
2
calculations involved this process can take a great deal of time. In addition to performing
the calculations, there is a need for a graphical display of various aspects of the
calculations.
The second part of this thesis examines how hydrophobic amino acids, are
grouped in successive heptads of coiled-coil sequences found in the Swiss-Prot protein
database. The hydrophobic residues appear in the ‘a’ and ‘d‘ heptad position in coiledcoil conformations. It has been proposed that clusters of hydrophobic amino acids in the
‘a’ and ‘d’ positions play an important role in protein folding and other activities. For
example, when all other stability factors are constant and only the hydrophobic cluster
arrangement is altered, two proteins exhibit different levels of stability.
A comprehensive analysis of all coiled-coils regions is needed found is to be done
to determine if clusters exist in nature. In this analysis, the answers to the following
questions is sought: first, what is the frequency of the hydrophobic amino acids in the ‘a’
and ‘d’ position; second, what is the length of the clusters; third, what amino acids are
present in the different cluster lengths; fourth, how many hydrophobic amino acids and
how many other amino acids are present in clusters of various length; and fifth, Coiledcoils always start with stabilizing clusters; can these be characterized, and if so how?
Biology
DNA
Physically, DNA is described as a double helix. The double helix is a
conformation that is made up of two anti-parallel sequences connected periodically along
their lengths. The parallel sequences in the DNA molecule are made of series of repeated
sugar and phosphate molecules. This repeated pattern is found along the entire length of
3
the molecule. One of the most important roles DNA plays is that it provides a code that
ultimately leads to the synthesis of proteins in the cell.
The Central Dogma of Molecular Biology
The transfer of information in cells generally goes from DNA to RNA to the
synthesis of a protein. In brief, a single segment of one DNA strands serves as a template
for the synthesis of a RNA molecule. This process is called transcription because during
this phase of gene expression a transfer of information from one nucleic acid type to
another occurs. Next, the RNA molecule is translated into a protein sequence. The RNA
that is translated into a protein is called messenger RNA (mRNA) and the molecular
machinery which carries out his step is called the ribosome [Becker 200b]. Using
complementary base pairing (3 nucleotides or 1 codon) between a tRNA molecule (which
carries one amino acid) and the mRNA molecule, the ribosome catalyzes the chemical
reaction linking a new incoming amino acid with the previously linked amino acid in the
translated polypeptide chain. Following synthesis, the amino acid sequence can go
through further processing in the endoplasmic reticulum and golgi complex to acquire
post-translational modifications (e.g. glycosylation) [Becker 2000c] to form the final
synthesized protein.
Protein Structure
The Central Dogma of Molecular Biology describes the protein synthesis process.
Although the steps used to synthesis a protein are well known, the processes that causes a
protein to assume a particular physical structure after it is synthesized is not as well
4
understood. The specific structures and substructures a protein ultimately forms, plays an
important role in how the protein will function in the cell. Basic protein structure is
determined by the elemental components of the amino acids and can be described using a
four level hierarchy.
Proteins are generally composed of a linear main amino acid chain or back bone.
Each of the amino acids has a four part molecular substructure. The amino acid begins
with an amide group (--NH2) and end with a carboxylate group (--COOH). In between
these two groups is an _-carbon, C_. Bonded to the C_ are an R group and a hydrogen
atom. The backbone of the amino acid sequence is formed by a linear combination of
amino acids bonding together so that a repeated link of individual amino acids anime
groups, C_, and carboxylate groups form a chain. Figure 1-1, Amino Acid, shows the
details.
Figure 1-1 Amino Acid
Amino acids are connected to each other through a peptide bond that forms
between the carboxylate group of one amino acid and the amine group of its neighbor.
5
Once the bond is formed, the two joined amino acids have only one amine group or Nterminus and one carboxylate group or C-terminus. The relationship between the joined
amino acids is described using two angles psi and phi. The phi angle is the angle formed
by the amine group to the C_, and the psi angle is the angle formed by the C_, and the
former carboxylate carbon. These angles show the level of twist in the amino acid
backbone and have been used to predict overall structure stability [Gromiha 2002].
Secondary structures are found in globular proteins when the phi and psi angles of
contiguous amino acids in a sequence are repetitive. Figure 1-2, Phi and Psi Angles,
illustrates these relationships.
H
O
_
C
R
C_
H
_
N
N
C
H
O
C_
Figure 1-2 Phi and Psi Angles
The R-group attached to the C_ of each amino acid is called the amino acid sidechain. Side-chains are what give the amino acids their particular characteristic. It is the
side-chain that makes the amino acid hydrophobic, polar or non-polar. Side-chains range
in size from a simple hydrogen atom as in glycine to relatively large complex aromatic
groups. Nine of the amino acids have non-polar side-chain groups and form the
6
hydrophobic amino acids. The remaining 11 amino acids can be further categorized as
hydrophilic charged and hydrophilic uncharged. The different categories of amino acids
are listed in Tables 1-1 though 1-4.
Amino Acid
Glycine
Alanine
Valine
Leucine
Isoleucine
Methionine
Phenylalanine
Tryptophan
Proline
Three Letter Code
Gly
Ala
Val
Leu
Ile
Met
Phe
Trp
Pro
Single Letter Code
G
A
V
L
I
M
F
W
P
Table 1-1 Non-polar Amino Acids (hydrophobic)
Amino Acid
Serine
Threonine
Cysteine
Tyrosine
Asparagines
Glutamine
Three Letter Code
Ser
Thr
Cys
Tyr
Asn
Gln
Single Letter Code
S
T
C
Y
N
Q
Table 1-2 Polar Amino Acids (hydrophilic)
Amino Acid
Aspartic Acid
Glutamic Acid
Three Letter Code
Asp
Glu
Single Letter Code
D
E
Table 1-3 Electrically Charged (negative and hydrophilic)
7
Amino Acid
Lysine
Arginine
Histidine
Three Letter Code
Lys
Arg
His
Single Letter Code
K
R
H
Table 1-4 Electrically Charged (positive and hydrophilic)
Protein structure is influenced by the type and number of side-chains present in its
sequence. Hydrophobic amino acids have side chains that will not form hydrogen bonds
or ionic bonds with other groups. These hydrophobic amino acids tend to be buried in the
center of proteins away from the surrounding aqueous environment. The amino acids in
this category are listed in Table 1-1, Non-polar Amino Acids (hydrophobic). Some
references to glycine include it in the hydrophobic category and some consider its side
chain neutral. This amino acid has no strong hydrophobic or hydrophilic properties.
Amino acids with uncharged but polar side chains are uncharged at physiological pH.
These are listed in Table 1-2, Polar Amino Acids (hydrophilic). Amino acids with acidic
side chains have a carboxylic acid group in their side chain and are very hydrophilic.
These amino acids are listed in Table 1-3, Electrically Charged (negative and
hydrophilic). Amino acids with basic side chains have a positive charge on these side
chains that makes them hydrophilic and they are likely to be found at the protein surface.
These are listed in Table 1-4, Electrically Charged (positive and hydrophilic). In addition
to these amino acid characteristics, the Van der Waals forces, hydrogen bonds,
electrostatic interactions and hydrophobic effect also affect protein structure.
The Van der Waals forces are the attractions and repulsions atoms have for one
another that gives matter its general cohesion [Lesk 2002]. These come from the
positively charged nucleus of one atom and the negative charge from the electron cloud
8
of another. Hydrogen bonds are the weaker attractions between uncharged, yet polarized
atoms. Hydrogen bonds commonly form between the O and H atoms. Electrostatic
interactions form the basis for the Van der Waals interactions and the Hydrogen bond.
These interactions are common at the N and C termini of the peptide chains. Electrostatic
side chain interactions occur between Lys, Arg, His, Asp, and Glu. These are listed in
Table 1-4, Electrically Charged (negative and hydrophilic) and Table 1-5, Electrically
Charged (positive and hydrophilic).
The hydrophobic effect is the force that is imposed on the overall structure by the
non-polar side chain groups. The association of the non-polar groups reduces the
collective surface area, and therefore the amount of water that can influence the proteins’
structure. This association forces the side-chains closer together.
Protein structures are classified according to a four level hierarchy. These levels
begin with a simple linear arrangement to complex multiple substructure aggregates.
These levels are commonly referred to as the protein’s primary, secondary, tertiary, and
quaternary structures. A protein’s primary structure is the linear amino acid sequence list
of the amino acid chain or chains. These are those with which are commonly used to
describe the protein in the various databases. Secondary protein structures are local
structures of linear segments of amino acid backbone atoms that do not take into account
the effects of the side chains. The major arrangements found in the secondary structure
category are turns, sheets, and helices. These account for about 70% of the substructures
present in a protein. Tertiary structures are an organization of secondary structures linked
by weak interactions. These are best thought of as a three-dimensional arrangement of all
9
atoms in a single polypeptide chain. Quaternary structures are the aggregation of separate
polypeptide chains into the functional protein.
The primary protein structure is the linear arrangement of amino acids in the order
in which they appear in the protein. When describing a protein, the sequence begins at the
N-terminus and ends at the C-terminus. Once assembled into a primary structure, the
individual amino acid side chains are referred to as amino acid residues. Fredrick Sanger
reported the first amino acid sequence of the insulin hormone [Becker 2000].
The secondary structure of a protein is the result of the local interaction of the
amino acid residues. These interactions form three different structures or conformations.
The _-helix, also know as a repetitive secondary structure get its name because the
relationship of one amino acid to the next is the same. The parameters “n” and “r” are
two parameters that are used to characterize a general helix. The convention nr is used to
describe the helix. The “n” is the number of residues per turn and the subscript “r” is the
rise per helical residue. An _- helix is designated 3.64. It has 3.6 residues per turn and
raises 4 residues in height. In the helix, there is a possible hydrogen bond between every
fourth amino acid. This relationship allows an amino acid to form a bond with the amino
acids “above” it and “below” it. Figure 1-3, Coiled _-Helices [UWK 2003], shows this
relationship and the 3.6 residues per turn.
10
Figure 1-3 _-Helices
A beta strand is an amino acid string that does not form a coil. It zigzags in a
more extended way than a helix. One of three types of beta-sheets is formed when two or
more beta strands link side by side. The links are hydrogen bonds between the main
carboxyl ate and amide groups in the amino acid chains. The three types of beta-sheets
are anti-parallel, parallel, and mixed. In anti-parallel sheets the strands run in opposite
directions, in parallel sheets the strands run in the same directions and in the mixed
conformation there is a mix of anti-parallel and parallel strands. The beta sheet is
characterized by a maximum of hydrogen bonding. Unlike the intra-molecular hydrogen
bonds in the _- helix, the hydrogen bonds in the beta sheet are perpendicular to the plane
of the sheet that link amino acids of different amino acid chains or distant members of the
same amino acid chain.
11
The _ turn is the third type of general secondary structure and involves about onethird of residues in a globular protein. Turns are important substructures in proteins.
Antibody recognition, phosporylation, glycosylation, hydroxzylation and intron/exon
splicing are found frequently at or adjacent to turns. It has also been proposed that turns
are a mechanism used for tertiary folding of globular proteins. Turns usually occur
between two anti-parallel beta strands and are generally less than seven residues in
length. The turn enables the amino acid chain to reverse itself by 180˚. Turns come in
four types, gamma turns, Type I, Type II and Type III turns. Turns are distinguished by
the hydrogen bonds between the ith, ith+1, ith+2, and ith +3 residues [Brook 2003]. Figure
1-4, Beta Sheets [UOFG 2003], illustrates the how beta sheets are organized.
Figure 1-4 Beta Sheets
12
While the secondary protein structures form because of the repetitive nature of the
amino acid chains and the hydrogen bonds between the amino acids, the tertiary protein
structures develop mainly because of the variety in the amino acid side chains. The
tertiary structure is not a repetitive structure and is highly dependant on the interaction of
the side chains. For example, the hydrophobic residue will be drawn to the center of the
protein while the hydrophilic residues will seek other polar molecules including water.
These interactions will force the tertiary structure to fold, bend and twist in unpredictable
way.
Stabilizing the tertiary structure is achieved through both covalent and noncovalent bonds. The non-covalent stabilizers are the hydrogen bond, electrostatic and
hydrophobic interactions. The most common stabilizing covalent bond is the disulfate
bond. This type of bond is formed between two linearly distant cystines that are situated
near each other. A protein will maintain its stable shape for a given set of environmental
conditions.
The quaternary protein structure is formed by an aggregation of tertiary
component of the same or different proteins. This form of structure applies to multi-meric
proteins. Many proteins belong to this class, particularly those of molecular weight
greater than 50000 [Becker 2000a]. The same forces that stabilize the tertiary structure in
a particular environment stabilize these structures. Anfinsen proposed in his
"Thermodynamic Hypothesis", that the native conformation of a protein is adopted
spontaneously. In other words, there is sufficient information contained in the protein
sequence to guarantee correct folding from any of a large number of unfolded states
[Anfinsen 1973].
13
Coiled-Coil
The coiled-coil is a tertiary oligomerization domain that is formed when two or
more _- helices wrap around each other in a left-handed super coil. Coiled-coils are found
throughout nature and occur in a wide variety of proteins and play an important role in
basic biology. Two examples of this are the kinesin [Thormahlen 1998] and myosin
[Tripet 1997] proteins. Kinesin is a molecule that transports cellular components from
place to place in the cell. The ability to perform this is due in part to the coiled-coil.
Myosin, a fundamental protein used in muscle contractions, is another protein that
employs the coiled-coil conformation. Studies have shown that Myosin depends, in part,
on the coiled-coil to function properly [Chakrabarty 2002]. In both these proteins, the
ability of the coiled-coil to uncoil allowing the attached heads to move gives the protein
the mobility needed to perform its function.
Coiled-coils are found to have hydrophobic amino acids spaced at every third and
then every fourth residue within its sequence. A grouping of seven residues forms a
heptad repeat designated (abcdefg), where the ‘a’ and ‘d’ positions are occupied by
hydrophobic amino acids. An example of the heptad repeat pattern aligned with an amino
acid sequences is in Figure 1-5, Heptad Repeat. This figure shows the amino acid
sequence and directly below it the heptad repeat position each residue occupies.
Sequence: CGG-EVGALKA-EVGALKA-QIGALQK-QIGALQK-EVGALKKheptad
position:
gabcdef-gabcdef-gabcdef-gabcdef-gabcdef
Figure 1-5 Heptad Repeat
14
This pattern repeats and on average places a hydrophobic side-chain every 3.5
residues in the sequence. A typical _-helix has 3.6 residues per turn and takes less than
two full heptads to turn twice.
Figure 1-6 Heptad Positions in a Coiled Coil.
In the coiled-coil, the two _-helices bury their hydrophobic residues in the center
of the coil that causes the coiled-coil itself to form a super coil. These are depicted in as
positions a, a`, d, and d` in Figure 1-5. The super coil character of the coiled-coil also
gives rise to other interactions within the individual _-helices and between the _-helices
in the super coil. A portion of a coiled-coil is illustrated in Figure 1-3, _-Helices. This
figure shows the relative position of the different amino acids in their heptad positions.
The on-going research [Kwok 2003, Tripet 2000, Wagschal 1999] of coiled-coils
at the UCHSC and the University of Alberta has demonstrated that there are a number of
possible factors that determines if a stable coiled-coil exists. Using these stability factors,
15
proteins can be evaluated to find possible coiled-coil domains. Once these domains are
found, they can be further studied. Information about the domain’s composition and other
statistic can be gathered and used to predict their presence in newly sequenced proteins.
16
Chapter 2
CHAPTER 2
LITERATURE REVIEW
Protein Structure Analysis
Protein structure analysis is borne out of the desire to determine protein
characteristics without doing it experimentally or through crystallography. These two
methods can be expensive and time consuming. Processes based on protein statistics and
past experimental data have been generalized to create methods and algorithms to provide
quick answers to protein structure questions. This chapter describes some of the
approaches that have been used to characterize proteins in general and coiled-coils in
particular.
Early Proteins Structure Prediction
Early protein structure prediction algorithms [Chou 1974, Garnier 1978] were
derived by gathering statistics from a relatively small group of proteins. The statistics
related four different protein secondary structures to the amino acids that comprised
them. This information was then generalized in an attempt to predict the secondary
17
structures of other proteins. These approaches proved to be about 60%-65% accurate and
only considered the local amino acid neighborhood.
Outlined in a 1974 paper, “Conformational Parameters for Amino Acids in
Helical, _ -Sheets, and Random Coil Regions Calculated from Proteins”, the ChouFasman algorithm is one of the oldest algorithms that attempted to predict the secondary
protein structures using a larger number of proteins [Chou 1974]. Previous attempts used
far fewer than the 15 proteins and 2400 residues used by Chou-Fasman. Up to this point,
the two Zimm-Bragg parameters, _ and s, where investigated for the individual amino
acids. _ is the cooperativity factor for helix initiation and s is the equilibrium constant for
converting a coil residue to a helix. These investigations lead to some generalizations
about how some of the amino acids participate in certain conformations in some proteins.
Chou and Fasman studied all 20 amino acids in 15 proteins and compared the frequency
of the amino acids’ occurrence in various conformational states to the _ and s values. The
result of Chou and Fasman’s research was a better understanding of protein structure
prediction, which led them to develop a table of values called the “Frequency of Helical,
Inner Helical, _, and Coil Residues in the 15 Proteins with Their Conformation
Parameters P_, P_i, P_, and Pc.”
Derived from observed protein structures and their propensity to form different
structures, the Chou-Fasman parameter table consists of seven columns and has twenty
rows. The values assigned to each amino acid in the first three columns, P(_), P(_), and
P(turn), are roughly equivalent to the propensity of an amino acids to form an _-helix, _strand and hairpin turn respectively. To provide a sense of the information in the Chou-
18
Fasman parameter table, the first two rows of the table are listed below in Table 6, ChouFasman table.
Name
P(_) P(_) P(turn) f(i) f(i+1) f(i+2) f(i+3)
Alanine 1.42 .83 .66 .06 .076 .035 .058
Arginine .98 .93 .95 .07 .106 .099 .085
Table 2-1 Chou-Fasman Table
The Chou-Fasman algorithm can be explained in three parts. The first part detects
the presence of alpha helices, the second detects the presence of beta sheets and the third
part detects hairpin turns.
The helix detection algorithm starts by dividing the sequence into two regions.
The first region comprises areas where the amino acids have a P_ value greater than 1,
everything else are in the second region. Next, groups of four out of six peptides having
P_ values greater than 1 are identified. These form the base regions of the helix. From
these bases, the amino acids immediately before and after are included in the proposed
helix until the region is found to contains four peptides that have an average P_ value of
less than 1. These are the regions predicted as alpha helices. Beta sheets are predicted in
a similar fashion. This time regions of four or six amino acids with P_ values less than 1
are examined. These regions are expanded until four amino acids average a P_ of less
than 1 are found. A region is declared a beta sheet if over the entire region the P_ average
is greater than 1 and the sum of all P_ is greater than the sum of the P_’s.
19
Beta turns are calculated by calculating a turn propensity value, Pt, for all the
amino acids in the sequence based on that amino acid and the next three that follow. If
the product of all four is greater than .000075 and the Pturn average is greater than 1 and
the sum of Pturn value is greater than both P_ and P_ value, then the amino acid is
predicted to turn at that point.
To improve on the Chou-Fasman algorithm, “Analysis of the Accuracy and
Implications of Simple methods For Predicting the Secondary Structure of Globular
Proteins”, was written by Garnier, Osguthorpe and Robson (GOR) in 1978 [Garnier
1978]. This paper was an attempt to describe and test the simple statistical procedures for
determining secondary protein structures that have been developed. The GOR paper took
particular interest in the performance of the Chou-Fasman algorithm. At the time, the
Chou-Fasman approach was considered one of the best ways to determine a protein’s
structure using amino acid statistics.
Ultimately, the GOR paper sets forth the GOR algorithm. Over the past 25 years
the GOR algorithm has been improved a number of times. The latest was set forth in
1996 in the form of GOR IV. Today the GOR algorithm is an alternative to the ChouFasman algorithm in the area of statistical models.
The GOR algorithm, like the Chou-Fasman algorithm, seeks to predict four
secondary structures of a protein by evaluating the weighted position of the amino-acid
sequence. GOR divides the predicted structures into four types; helix, extended sheets,
turns and coils. The first three structures have been introduced earlier. The coil or
aperiodic state is defined as not being of the first three conformations. In developing this
20
method, the GOR algorithm used 30 proteins; the paper did not provide the number of
residues.
The GOR algorithm implementation is straightforward. The paper provides four
tables, one for each secondary structure to be predicted. Each of the tables lists all 20
amino acids and each acid has 17 spatial parameters derived from experimental
observations.
This implementation of the algorithm starts by progressively calculating an
information value, “I”, for each amino acid in the sequence. The “I” value is defined as
I(Sj;R1, R2, R3, R4,… Rlast)=∑I(Sj; Rj +m or -m)
where last = 17 and –m < j < +m m=8.
The “I” value calculated for the jth amino acid is based on the preceding and succeeding
eight residues. In each of the tables, the 17 parameters are based on the acid’s relative
distance from the jth “I” value being calculated. The “I” values for each amino acid in the
sequence are calculated. This is done four times using values from each of the four
different tables. Once the four values are determined the one with greatest value
determines in which of the four structures the amino acid is likely to participate.
Following the “I” calculation, another statistically determined value can be
applied. The decision constant, DC, can be used to further optimize the evaluation of
each of the four calculated “I” values. The DC values are determined on a protein-byprotein basis.
There is a second approach outlined in the GOR paper. It is called the “single
residue information method.” As the name suggests, the only information considered is
21
the information that a residue carries about its own conformation. This approach was not
introduced to provide a simpler approach, but to see how much influence adjacent amino
acids have on the predicted structure.
Coiled-coil Characterizations
Predictions
Proteins can be statistically analyzed for important features like charge-clusters,
repeats, hydrophobic regions, and compositional domains. As one of the many important
structural domains, much attention has been directed at developing algorithms to
determine the characteristics of coiled-coils. The basic heptad repeats is what makes the
coiled-coil particularly conducive to computer-based characterization.
PAIRCOILS [Berger 1995] classifies coiled-coils using a statistical approach. It
uses a database of all known coiled-coil sequences from myosin, tropomyosin, and
intermediate filament proteins that was created by extracting sequences from the
GENpept database [OCGC 2003]. These sequences are heptad aligned and form the basis
for PAIRCOILS predictions. From these selected proteins, the conditional probability
that two amino acids are found in any two-heptad position is determined. The frequencies
are normalized and used to determine the probability that a pair of amino acids appear in
a heptad repeat. As a result, PAIRCOIL is able to distinguish two-stranded coiled-coils
from non-coiled-coils and does not produce any false positives or false negative when
tested against a Brookhaven Protein data bank [Brook 2003].
A special coiled-coil is the ‘leucine zipper’. Bornberg-Baur, Rivals and Vingron
[Bornberg 1996] used the Swiss-Prot protein database to retrieve annotated leucine
22
zippers, leucine like zippers, and non-leucine zippers. They made the observation that
there can be two general classes of the leucine zipper. The strict zipper is characterized
by sequences that have leucine appearing regularly in the ‘d’ position in four or more
consecutive heptads. The relaxed zipper is a leucine zipper that has had one of the
leucines replaced by Met, Val, or Ile. Using the TRESPASSER program to predict the
presence of leucine zippers, they evaluated the three groups of proteins. Their results
showed that annotated leucine zippers in the Swiss-Prot database are not often predicted
to follow the strict or relaxed definition of leucine zippers. They did observe, however,
that leucine zippers frequently occur together with DNA binding basic region (bZIP) or a
helix-loop-helix (bHLH-ZIP) domain. These two are hybrid zipper domains and both the
bZIP and bHLH-ZIP regions show coiled-coil characteristics. They concluded that the
presence of a coiled-coil is a better indicator of a leucine zipper than simply the presence
leucine repeat.
Stability
Coiled-coils have been shown to play an important role in many large proteins.
The coiled-coil conformation is found in elongated or fiber-forming proteins such as
myosin, alpha keratin, tropomyosin, and kinesin. Lauzon [Lauzon 2001] analyzes the role
played by coiled-coils in myosin and Tripet [Tripet 1997] examined the coiled-coil in the
kinesin “neck” region.
Kinesin is a microtubule-dependent motor protein. This type of protein is used to
transport other proteins and vesicles from location to location within cells. Kinesin has
two heads, a linker region, and a stalk. Movement is produced when the leading head
23
detaches from the microtubule and moves forward and reattaches. The trailing head then
detaches from the microtubule and reattaches in a location closer to the leading head. The
kinesin travels from the negative to the positive end of the microtubule. The kinesin
counter part is the dynein and travels from the negative to positive end of the
microtubule.
The two heads of the kinesin are globular ATP-binding sites. These two regions
are joined through an alpha helical linker region to the stalk. The linker regions of the
two heads come together and form a coiled-coil stalk. The end of the kinesin is a light
chain region that is used to attach the kinesin to the vesicle being transported.
The “neck” of the kinesin is where the _-helix linker region joins the stalk. This
region forms a coiled-coil stabilized with the classic stabilizing factors plus additional
interactions. The “neck” can be seen as two separate segments, I and II. Segment I does
not have the classic characteristics of a coiled-coil and is considered to be less stable than
segment II [Thormahlen 1998]. Segment I has charged or hydrophilic residues in the
interface; this departs from the classical definition of a stable coiled-coil. Segment II
forms a more classical coiled-coil where the “a” and “d” positions are occupied by
hydrophobic residues.
The model advanced by Tripet [Tripet 1997] suggests that the coiled-coil region
of the “neck” coils and uncoils in response to binding site changes. Segment I is able to
uncoil more than segment II. The action in the model starts with one head bound to the
microtubule and the second detached. The coiled-coil region of the “neck” does not allow
the detached head from finding a binding site in the microtubule. In response to the
leading heads binding, a conformational change occurs that could cause a portion of the
24
“neck” coiled-coil to uncoil. This allows the trailing head to rotate and find a new binding
site in the positive charge direction on the microtubule. In this model the coiled-coil in
the neck is found to be a key element in the performance of the protein.
Myosin II is another protein where the coiled-coil conformation plays an
important role. The myosin II protein plays a fundamental role in muscle contractions and
cellular and intercellular mobility. Structurally, the myosin II protein and kinesin are
similar. Both have a separate globular binding sites connected to stalk or tail through a
polypeptide chain called a “neck”. The tail of the myosin and kinesin are formed from
two helices coiled around each forming a coiled-coil.
The myosin protein has been studied to determine how the stability of the coiledcoil neck region impacts the head to head interactions, force generations and regulation
[Chakrabarty 2002]. They found that the coiled-coil conformation remains largely intact
in the presences and absence of actin and it is estimated that it would require about 56kJ/mol per residue to uncoil. Another study tested how important neck flexibility was
on the mechanical performance of the myosin [Lauzon 2001]. They showed that the
presence of a stable coiled-coil region at the neck of the myosin significantly impairs the
mechanical performance of the myosin. They also found that a stable coiled-coil region
needed to be 15 heptads removed from the neck before normal mechanical function is
restored. Although the last two studies sites appear to contradict each other, these studies
demonstrate the important role the coiled-coil plays in different proteins.
25
Coiled Coil Stability Using Experimental Data
An approach being explored at the UCHSC, the relative stability of a coiled-coil
substructure within a protein is being determined. It has been shown that the stability of a
coiled-coil varies with the residues that occupy ‘a’ and ‘d’ positions within the
hydrophobic core of a coiled-coil [Tripet 2000, Wagschal 1999]. Core stability may be an
indicator that a coiled-coil may be able to form, but this does not necessarily indicate a
coiled-coil is present. It has also been noted that if the structure within the protein is not
stable, the protein’s structure will not fold and function properly. This would naturally
lead one to conclude that the structure is not present.
The hydrophobic core of a protein has a great influence on the overall stability
and folding rate. It has been shown [Baldi 2000] that by modifying a protein’s
hydrophobic core by a single methyl group, the folding rate can be reduced and the
overall stability can be increased from between 0.8 to 2 kcal/mol. It was suggested that
this change is caused by the overall conformational strain within the core because of the
residue changes
Studies have explored the relationship between selected hydrophobic core amino
acids and coiled-coil stability. Two studies examined the effects that replacing a single
amino acid with each of the 18 other amino acids on the stability and oligomerization
state of the protein. Both of these take a similar approach by replacing one of the
hydrophobic amino acids in the core of the coiled-coil. The first study replaced the amino
acids at the ‘a’ position. The second study replaced the amino acid in the ‘d’ position.
26
The results of these studies allowed the generation of a relative thermodynamic stability
scale for the 19 naturally occurring amino acids in the ‘a’ or ‘d’ position of a coiled-coil.
How does the constituent amino acids in the ‘a’ and ‘d’ positions in the
hydrophobic core of adjacent heptad affect overall stability and protein folding? A
hydrophobic cluster is defined as a consecutive string of three hydrophobic non-polar
amino acids in the hydrophobic core of a coiled-coil [Kwok 2003]. Kwok designed two
proteins with identical properties; the only difference between the two was they have a
different number of hydrophobic clusters. Two proteins were designed for this study.
Protein P2 had two clusters and protein P3 had three clusters. The results of this study
showed that the P3 protein folded more often than that of P2 in benign buffer. It also
showed that P3 was more stable than P2. Kwok suggests that the differences between the
two proteins are due mainly to the burial of the non-polar surface. Kwok further suggests
that clusters may stabilize the proteins in structurally significant regions, while the nonclustered areas are involved in conformational changes that allow for protein-protein
interactions.
27
Chapter 3
CHAPTER 3
STABLE INPUT
UCHSC
Protein research at the UCHSC has used a model 2 stranded, homo-stranded,
parallel coiled-coil protein to determine the effects that replacing different amino acids in
the sequence has on the relative stability of the protein. From this and other work [Kwok
2003], it is hoped that the relative and absolute stability of the protein can be determined.
There are a number of advantages of studying the coiled-coil domain. The advantages are
[TRI2 2003]:
∞ Abundant motif in proteins
∞ There is only one type of secondary structure present, i.e. the α-helix
∞ Only two interacting α -helices are required to introduce tertiary and
quaternary structure
∞ Diversity in length makes it an ideal system to test predictions
∞ All the non-covalent interactions that stabilize the three-dimensional structure
of proteins are found in coiled-coils
∞ Experimentally easy to analyze structure and stability
Being able to determine protein stability is important because a minimum
threshold of stability is required to initiate final protein folding and stability is intimately
28
involved in conformational changes and function of proteins [Kwok 2003, Lauzon 2001,
Chakrabarty 2002]. To expedite this work, an analysis tool is needed to calculate the
stability of an amino acid sequence. The “Stable Input” tool was developed in
conjunction with UCHSC to help the center determine coiled-coil stability over an entire
sequence prior to conducting a lengthy experiment.
Stable Input Parameters
An HTML graphical user interface program that is available on University of
Colorado at Colorado Springs Computer Science department Linux server provides input
to “Stable Input” [SI 2003]. This program allows the biologists the opportunity to enter a
sequence, set parameters, and perform calculations based on custom or default parameter
values. The results are provided in the form of up to eight different graphs and a tab
delimited text file of sequence values in kilo-calories per mole (kcals/mol).
The input from the HTML program is parsed and a common gateway interface
PERL program called “stable_coiled_sub.pl” calculates the results. The calculations are
based either user inputs or program defaults. The user settable inputs are summarized in
below.
Sequence Information
1. Sequence
2. Sequence Name
3. Heptad Registry offset
4. Window width
29
Tabulated Input
1. Helical Propensity
2. Hydrophobic core stability between a and d’ and d and a’ positions
3. Intra-chain (i to i+3 or i to i+4) electrostatic interactions
4. Inter-chain (g-e’ or i to i’+5) electrostatic interactions
5. Hydrophobic Clusters
6. Entropy-Chain Length
The window width allows the user to determine the number of amino acids over which to
calculate the relative stability. There are two options, 7 and 11. A window size of 7 is the
default window size in the program. The idea is that windowing the results for a
particular amino acid sequence will include the influence of at least one heptad on the
one amino acid being scored. The windowed point, when aligned with the amino acid
sequence, represents the stability trend for that position derived from the amino acids
slightly before and slightly after the current amino acid position. The beginning and end
of the sequence need special handling because there are too few amino acids to populate
a full window. The windowing algorithm is outlined below for a window width of 7, it
takes three parameters, the sequence array, Window array and the widow width and
returns an array of the same len:
Windowing
INPUT: Raw Sequence Array
Window Array
Window width
current position=0
FOR EACH $Amino Acid in the ( Raw Sequence Array )
IF ( current position > window width /2 )
and
(current position <= (Raw Sequence length) - window width /2)
THEN
7
Windowed Array[current position] =
Σ Raw Sequence [i];
i=current-3
ELSE IF ( current position == 0 ) THEN
Window width/2
Windowed Array [current position] =
Σ Raw Sequence [i];
i=0
30
ELSE IF ( current position == 1 ) THEN
Windowed Array [current position]=
Windowed Array [0]+Raw Sequence [4]
ELSE IF ( current position == 2 ) THEN
Windowed Array [current position]=
Windowed Array[1]+ Raw Sequence [5]
ELSE IF ( current position == Two From Sequence End ) THEN
6
Windowed Array [current position] =
Σ Raw Sequence [i];
i=current position -3
ELSE IF ( current position == one from sequence end ) THEN
5
Windowed array [current position] =
Σ Raw Sequence [i];
i=current position -3
ELSE IF ( current position == sequence end ) THEN
4
Windowed array [current position] =
Σ Raw Sequence [i];
i=current position -3
current position = current position +1
A similar approach is used to implement the 11 amino acid window width. The
major difference is that the beginning and ending partial windows are extended to include
5 positions before and after the current position.
The “beginning” case is handled by summing the first window/2+1 positions to
produce the 0th windowed result value, summing the first window/2+2 positions produces
the 1st windowed result value. This continues until the values for the window width/2 -1
result value is calculated. A similar calculation produces the windowed value for the
“end” corner case. When the number of positions goes below the window value, the
remaining values are used until the last four values are used for the last windowed
31
positions. An example of this calculation is illustrated for a small sequence in Table 3-1,
Windowing Algorithm for Window = 7.
Amino
Acid
Table
Value
Values
D F
Y
H
L
A
D
E
R
G
H
A
L
V
L
L
I
1 1
A B
2
C
4
D
1
E
2
F
3
G
1
H
2
I
1
J
3
K
1
L
3
M
1
N
1
O
2
P
1
Q
A
B
C
D
A
B
C
D
E
A
B
C
D
E
F
A
B
C
D
E
F
G
B
C
D
E
F
G
H
C
D
E
F
G
H
I
D
E
F
G
H
I
J
E
F
R
G
H
J
K
F
G
H
I
J
K
L
G
H
I
J
K
L
M
H
I
J
K
L
M
N
I
J
K
L
M
N
O
J
K
L
M
N
O
P
K
L L
M M M
N N N N
O O O O
P P P P
Q Q Q Q
Windowed
Value
8
9
11 14 14 15 14 13 13 14 12 12 12 12 9
Values
Used
For
Window
8
5
Table 3-1 Windowing Algorithm for Window = 7
The heptad registry position parameter sets the heptad registry offset for the input
sequence. This parameter defaults to ‘g’ if not specified by the user. The heptad registry
offset determines the heptad registry position of the first amino acid in the sequence.
Having set the first registry position, the rest of the sequence is set according to the
heptad repeat (abcdefg)n. The registry position of the sequence is stored in a parallel
array and is used in all the calculations performed by the Stable Input tool.
There are five experimentally determined parameter tables provided by UCHSC
that form the basis of all calculations. The user can override these tables by selecting the
custom radio button for any of the input parameters and providing a complete table of
32
values in the prescribed format. One or all of the five tables can be customized without
affecting the other tables.
Each of the five tables is formatted according to the information being described.
The helical propensity table contains the one helical propensity value for each of the 20
amino acids. The Intra-Chain Electrostatics Interactions table contains a value for a select
set of amino acid pairs and their values are based on the spatial separation of the pair
members. The Inter-Chain E/G Electrostatic Interaction table has two values for each
amino acid. These values are based on whether the amino acid is in the ‘e’ heptad
position or the ‘g’ heptad position. The Hydrophobic core stability table also has two
values per amino acid. These values are based on the relative heptad position, either ‘a’
or ‘d’, for each amino acid. The entropy table has a single entry per amino acid and
represents the amount of energy that should be removed from the final stability
calculation based on the amino acids in the sequence.
All table values in all five tables are listed in kcals/mol and are listed in tables and
represent the amount of relative stability each of these amino acid interactions contribute
to the over all stability of the sequence. Some of these tables represent a characteristic of
the amino acid such as helical propensity. Whereas others are based on amino acid
interactions that are derived not only on the amino acids involved but their relative
position to other amino acids within the coiled-coil.
33
Helical Propensity
The helical propensity value measures the effect a particular amino acid has on
the creation of a helix. The first propensity scale was actually a measure of the statistical
frequency that the different amino acids were found to occur in helices. Ala has the
highest helical propensity while Glu, Met, Leu, and Lys are slightly less helically prone.
Those amino acids with the least helical propensity are Gly, Ser, Thr, and Pro. Pro
actually disrupts helical formations.
Hydrophobicity
Hydrophobicity refers to the tendency of non-polar molecules to associate with
each other rather than with a polar substance such as water. The most hydrophobic
amino acids are those with aliphatic and aromatic non-polar side chains. An aliphatic
compound is one that is not aromatic; i.e., it lacks a particular arrangement of atoms in its
molecular structure. These amino acids are Ile, Met, Leu, and Val. An aromatic molecule
or compound is one that has special stability and properties due to a closed loop of its
electrons. Phe is an aromatic amino acid. The other amino acids Arg, Lys, Tyr, and Trp
have a mixture of hydrophobic, polar and charged characteristics. The experimental
tables used in Stable Input have ‘a’ and ‘d’ hydrophobic core stability values with the
helical propensity component removed. The hydrophobic core of a coiled-coil is depicted
looking down the axis of the coils in Figure 3-1, Coiled Coil A/D and E/G Interactions
[UOFG 2003]. The hydrophobic interaction between amino acids in the ‘a’ and ‘d’
34
positions is one of two interactions that occurs between the amino acids of the different
coils.
Figure 3-1 Coiled Coil A/D and E/G Interactions
E/G Interactions
Also depicted in Figure 3-1, Coiled Coil A/D and E/G Interactions, is the relative
position of the amino acids in the ‘e’ and ‘g’ positions on the different coils. Leu, Ile,
Met, and Val are the only amino acids that can occur in the ‘e’ or ‘g’ position that
impacts the stability. The other amino acids do not contribute to stability when found in
these positions. The arrows in Figure 3-2 Lateral View Coiled Coil E/G Interaction,
depicts the relative positions of the ‘e’ and ‘g’ positioned amino acids along the coiledcoil pair. The E/G interactions add to overall coiled-coil stability by creating bonds
between these amino acids and pulling the two coiled regions together.
35
Figure 3-2 Lateral View Coiled Coil E/G Interaction
Intra-Chain Electrostatic Interactions
The intra-chain Electrostatics Interaction is the interaction that occurs between
amino acids in the same coil. These interactions only apply between the charged amino
acids His, Arg, Lys, Asp, and Glu that are found at a distance of i+3, i+4, or i+5 from its
pair partner i. In the case of the pair partner being at a distance of i+5, this interaction is
applied only if the i+5 position is in the heptad position ‘e’ or ‘g’. This calculation
determines the additional stability gained by having charged amino acids above and
below the current amino acid. The spatial relationship is due to the relative positions of
the amino acid around the helix.
36
Clusters
When hydrophobic amino acids occupy the hydrophobic core of the coiled-coil in
consecutive heptads, stability increases [Kwok 2003]. The clustering of hydrophobic
amino acids is also considered in the Stable Input program. Considering only heptad
positions ‘a’ and ‘d’, Figure 3-3, Clustered Hydrophobic Core, illustrates clustering in
consecutive heptads. In the figure, the hydrophobic amino acids, Phe, Ile, Leu, Met, Val,
and Tyr are the darkened circle, while all others are open circles. A cluster is defined
starting and ending with three or more consecutive hydrophobic amino acids occupying
the hydrophobic core positions with no more than one of these positions being occupied
by anything else.
Amino Acid Sequence
Seq1
Seq2
Gabcdef
EAEALKA-EIEALKA-KAEAAEG-KAEALEG-KIEALEG-KAEAAEG-KAEALEG-EIEALKA
EAEALKA-EAEALKA-KIEAAEG-KAEALEG-KIEALEG-KAEAAEG-KAEALEG-EIEALKA
Schematic Representation of Hydrophobic residue at a and d positions
Seq 1
Seq 2
adad adadadadadad
3 Clusters
2 Clusters
Figure 3-3 Clustered Hydrophobic Core
Entropy
The entropy table has one experimentally determined entropy value per amino
acid. These values represent the change in system entropy due to the presence of each of
37
the 20 amino acids. Entropy is a measure in the energy distribution within a system. As
an example, the amino acid Pro does not appear in helical conformations. The entropy
table shows that Pro changes the entropy by 17 kcal/mol, this suggests that a large change
in entropy indicates a decrease in coiled-coil stability.
Program Flow
Program flow is illustrated in Figure 3-4, Stable Input Program Flow. The
program begins by reading the five stability parameter tables into five hash tables. The
keys for the hash tables are either the amino acids or, in the case if the intra-chain
interactions table, the amino acid pairs. Each amino acid is then assigned a heptad
position. The first amino acid position is determined by the user; rest of the positions
follow the heptad repeat pattern (abcdefg)n, the heptad positions are saved in a parallel
array. Parallel arrays are also used to save the data for derived from the five input tables
corresponding to each amino acid.
38
Input
Sequence
Heptad Offset
Tables
Req. Graphs
Input
Tables to
Hash
Tables
Create
Heptad
Registry
Array
Create
Parallel
arrays from
tables
Apply
Cluster
Algorithm
Apply
Windowing
Algorithm
Windowed Stability =
∑ Window Arrays
Non-Windowed Stability =
∑ Non-Window Arrays
CGI Output
Table/Graphs
Figure 3-4 Stable Input Program Flow
A parallel array is also used to save the final cluster map. A cluster map is used in
the program to identify the regions in the sequence that form hydrophobic core clusters
and includes the positions that separate clusters by at least one position. The pseudocode, Cluster Algorithm, below outlines the process of creating the cluster map. This
routine takes three input parameters and returns an array that includes a 1 in each amino
acid position that participates in a cluster.
39
Cluster Algorithm
INPUT : Raw Sequence Array
: Parallel Hydrophobe Map Array
: Parallel Heptad Array
: Initial Heptad Offset
LOCAL : Cluster Map
: Position=0
: Next=0
FOR EACH Amino Acid In Raw Sequence Array{
IF Amino Acid In Parallel Heptad Array [Position] = “A” OR “D” THEN
IF Amino Acid = PHE, ILE, LEU, MET, VAL, TYR THEN
Cluster Map [Next] = 1
ELSE
Cluster Map [Next] = 0
Next=Next+1
Position =Position+1
}
WHILE Sub Pattern In Cluster Map {
((1{2,}((\s{1}1{1})*(\s{1}))1{2,})|(1{2,}))/g)
Cluster Bridge = Replace Sub Pattern With All 1’s
}
Position=Next=0
FOR EACH Raw Sequence Array
IF Parallel Heptad Array Position “A” OR “D” THEN
IF Cluster Bridge [Next] = 1 THEN
Parallel Hydrophobe Map Array [Position] = 1
ELSE
Parallel Hydrophobe Map Array [Position] = 0
Next=Next+1
ELSE
Parallel Hydrophobe Map Array [Position] = 0
Position=Position+1
An examination of all the amino acids in the hydrophobic core’s ‘a’ and ‘d’
heptad positions is used to create the cluster map. All hydrophobic amino acids, Phe, Ile,
Leu, Met, Val, or Tyr that are present in the hydrophobic core are marked with a 1; this
produces the hydrophobe map. After the sequence has been processed, the hydrophobe
40
map is condensed to remove all position but those of the hydrophobic core. This is the
cluster map. At this point the cluster map has no relationship to the sequence and is better
suited for cluster pattern searches. Once a cluster pattern is found, the entire cluster
region is marked 1’s; this produces a bridge map. It is called a bridge map because the
clustered areas are bridged by non-hydophobic amino acids that will be included in the
cluster. Figure 3-5, Clusters, illustrates what cluster map patterns are bridged and which
are not.
Cluster Map
11011011
10111011
11010101
01011010
11100111
Cluster Bridge
11111111
00111111
00000000
00000000
11100111
Figure 3-5 Clusters
After all clusters have been found in the sequence, the bridge map is expanded
back using the starting heptad offset provided by the user. Figure 3-6, Mapping Example,
is an example of this process.
Heptad Position
Amino Acid
Hydrophobe Map
Cluster Map
Cluster Bridge
Final Map
GABCDEFGABCDEFGABCDEFGABCDEFG
AMHTISCWHKRLDEKLPAKKRSIKRMKAC
01001000000100010000001001000
11011011
11111111
01001001000100010010001001000
Figure 3-6 Mapping Example
The final map serves as a per amino acid multiplier when the total stability is calculated
for that particular amino acid in the sequence.
41
The five stability parameter tables are used to create sequence-aligned arrays for
the attributes being evaluated. The helical propensity attribute is done by a simple hash
look-up. This attribute is not dependant on heptad position or its relation to any other
amino acid. When completed, each amino acid in the sequences has a helical propensity
value. Table 3-2, Helical Propensity Values, show all values used in the default case and
were derived experimentally and provided by UCHSC [TRI 2003]. The helical propensity
values listed are the amount of stability these amino acids add to the relative stability to
the protein. Note, that Pro has the only negative value and is considered a helix killer
when found in a sequence.
Amino Acid
Single
Letter
Alanine
A
Cysteine
C
Aspartic acid
D
Glutamic acid
E
Phenylalanine
F
Glycine
G
Histidine
H
Isoleucine
I
Lysine
K
Leucine
L
Methionine
M
Asparagine
N
Proline
P
Glutamine
Q
Arginine
R
Serine
S
Threonine
T
Valine
V
Tryptophan
W
Tyrosine
Y
Helical Propensity
Score kcal/mol
0.53
0.24
0.12
0.18
0.26
0.00
0.18
0.33
0.39
0.45
0.37
0.18
-2.5
0.34
0.50
0.18
0.15
0.23
0.27
0.24
Table 3-2 Helical Propensity Values
42
The Hydrophobic core stability between a and d’ and d and a’ positions is
dependant on the heptad positions of the amino acids. In these case a sequence aligned
array is generated that contains a value for only amino acids in the ‘a’ and ‘d’ positions.
The hash lookup for this parameter is the amino acid and is premised on its heptad
position. For example, if an amino acid is in the ‘a’ heptad position it will receive a
different score than the same amino acid in the ‘d’ heptad position. Table 3-3,
Hydrophobic Core Values, shows the default values used in the calculations [TRI 2003].
Amino Acid
Single Position Position
Letter
A
D
Alanine
A
0.72
1.27
Cysteine
C
0.72
1.27
Aspartic acid
D
-0.63
0.78
Glutamic acid
E
0.07
0.27
Phenylalanine
F
2.49
2.14
Glycine
G
0.00
0.00
Histidine
H
0.47
1.22
Isoleucine
I
2.87
2.97
Lysine
K
0.66
0.51
Leucine
L
2.55
3.25
Methionine
M
2.58
3.03
Asparagine
N
1.52
1.32
Proline
P
-5.00
-5.00
Glutamine
Q
0.86
1.71
Arginine
R
0.35
-0.15
Serine
S
0.42
0.72
Threonine
T
1.20
1.05
Valine
V
3.07
2.12
Tryptophan
W
1.38
1.48
Tyrosine
Y
2.11
2.26
Table 3-3 Hydrophobic Core Values
43
The fourth sequence aligned array is for the Intra-chain (i to i+3, i to i+4, and i to
i+5(g)) electrostatic interactions. This calculation is not only sensitive to which heptad
position the amino acid is in, but is also dependant on the amino acids at a sequence
distances of i+3, i+4, and i+5. In this calculation, consideration is only given amino acid
pairs consisting of Asp, Glu, Lys, Arg, and His at the i and i+3 and i+4 positions. If the
amino acid at the ith position is in heptad registry position ‘c’ or ‘a’, then the amino acid
in the i+5 position is considered too with the same pairing restriction applies. Table 3-4,
Intra-Chain Effect, lists the default values used [TRI 2003].
44
Residue Pair i to i+3 i to i+4
i to i+5
Score
Score Score(e/g)
Lys- Glu
0.2
0.2
0.4
Lys-Asp
0.2
0.2
0.4
Arg-Glu
0.2
0.2
0.4
Arg-Asp
0.2
0.2
0.4
His-Glu
0.2
0.2
0.4
His-Asp
0.2
0.2
0.4
Glu-Lys
0.2
0.2
0.4
Glu-Arg
0.2
0.2
0.4
Glu-His
0.2
0.2
0.4
Asp-Lys
0.2
0.2
0.4
Asp-Arg
0.2
0.2
0.4
Asp-His
0.2
0.2
0.4
Glu-Glu
-0.2
-0.2
-0.4
Glu- Asp
-0.2
-0.2
-0.4
Asp-Asp
-0.2
-0.2
-0.4
Asp- Glu
-0.2
-0.2
-0.4
Lys- Lys
-0.2
-0.2
-0.4
Lys- Arg
-0.2
-0.2
-0.4
Lys- His
-0.2
-0.2
-0.4
Arg-Arg
-0.2
-0.2
-0.4
Arg-Lys
-0.2
-0.2
-0.4
Arg-His
-0.2
-0.2
-0.4
His-His
-0.2
-0.2
-0.4
His-Lys
-0.2
-0.2
-0.4
His-Arg
-0.2
-0.2
-0.4
Table 3-4 Intra-Chain Effect
When scoring the intra-chain interaction, 1/2 of the table score is given to each residue
position. If more than one interaction can occur in any of the pair positions then the value
assigned to the amino acids is added.
The fourth sequentially aligned array that is created is the Inter-chain (g-e’or i to
i’+5) electrostatic interactions array. This array is created by considering only those
amino acids in the heptad registry positions ‘e’ and ‘g’. Figure 3-1 and Figure 3-2
illustrate the positional interactions between the two amino acids. Hash lookups for this
45
parameter are straightforward and are only dependent on position. These interactions are
just outside the hydrophobic core and a very few amino acids participate. Ile, Leu, Met,
and Val are the amino acids that have been identified as being significant in these
positions. Table 3-5, Inter-Chain Electrostatics, lists the default values used [TRI 2003].
Amino Acid Position e Position g
Score
Score
Ile
0.7
0.8
Leu
0.7
0.8
Met
0.4
0.5
Val
0.4
0.5
Table 3-5 Inter-Chain Electrostatics
Output Table
Appendix A, Tabulated Output, is an example of the 19 column tab-delimited
table produced by Stable Input. The first column is the sequence position number, the
second column is the amino acid in that sequence position, and the third column is the
heptad registry position. Columns 4, 5, 6, and 7 are the values assigned to that amino acid
based on the four of the five input parameter tables. The tenth column is the final cluster
map. Clusters can be identified by 1’s marking consecutive ‘a’ and ‘d’ heptad positions.
Columns 11, 12, 13, and 14 are the helical, intra-chain electrostatic interactions,
hydrophobic core, and inter-chain electrostatic interaction that have had the windowing
algorithm applied. The remaining columns are derived based on the values found in the
five-parameter tables and the cluster map.
46
Column 8, Relative Stability, is the position specific relative stability value for
each amino acid. Amino acids found in clusters are given a full hydrophobicity score in
this calculation. If no clusters are present no hydrophobicity score is added to the relative
stability score.
Relative Stability[i] = Heli Propensity[i]+AD Electro[i]+EG Electro+Cluster[i]*Hydro[i]
Column 9, Windowed Relative Stability, applies the windowing function to the position
specific Relative Stability values calculated above.
Total Stability, column 16, takes into account the entropy in the coiled-coil.
Entropy was introduced late in the project to help reconcile the deviation between the
results obtained using only the four-parameter tables and the experimental data when
longer coiled-coils were used [TRI 2003]. In his research, Dr. Tripet noticed that as the
coiled-coil length was increased, the experimentally measured stability values differed
from the calculated values. The chain length effect, as it has become known as, is an
informal theory advanced to help account for these differences. To assist, the program
was modified to account for entropy in the coiled-coil. The total stability calculation is
made by removing the total accumulated entropy from the total accumulated stability.
i
Total Stability [i] = ∑ Relative Stability[j] – Entropy[j]
j=0
47
Column 17, Running Stability is the accumulated stability calculated from the
four input parameters. This represent the amount of stability the sequence gains as a
result of the chain length.
i
Running Stability [i] = ∑ Relative Stability[j]
j=0
Column 18, Density stability, is an attempt to normalize the Running Stability
value based on the number of amino acids that were used to determine it. This value is
the total accumulated stability divided by the number of residues used to calculate it.
i
Density Stability [i] =( ∑ Relative Stability[j] )/ i
j=0
Finally, the Density window column 19, is the windowed values obtained from
applying the window function to the Density Stability column. These columns are used in
the graphical output.
Output Graphs
When requested, graphs are generated based on the tabulated values. The graphs
are generated using the Linux based program GNUPLOT. This program is called directly
from the Stable Input program and stores the .PNG file in the local directory. The
tabulated and graphical output follows a naming convention that prevents the current
operating directory from getting cluttered by old data. This convention is the system time
48
stamp plus an extension indicating which file it describes. Table 3-6, File Extensions,
shows which files are associated with which data set.
File Extension
*all.png
*hel.png
*ele.png
*e_g.png
*hyd.png
*chl.png
*den.png
*sum.png
*ent.png
*.text
Data set associated
Graphic with all values graphed
Helical Propensity
Intra-Chain Electrostatic Interactions
Inter-Chain Electrostatic Interactions
Hydrophobic Core
Stability with Entropy
Stability Density
Total Stability
Windowed Entropy
Tabulated results
Table 3-6 File Extensions
After all the calculations are complete and the graphs generated, the output is
written to a HTML formatted page. The text file is written to a local file and a HTML
link is provided in the HTML output. The HTML output is formatted to include the
protein sequence with position markers, the initial heptad offset, and all the graphs
requested by the user. After each run, all the old graphs and tabulated data files are
deleted and replaced with the new files. To further assist the biologist, the individual
graphs of the last run are saved in the cgi-bin directory on the server.
The graphs are produced using the GNUPLOT program is installed on the Linux
server. GNUPLOT uses the various columns from the output text file as the data points
for the graphs. The graphs use as the X-axis the amino acid sequence position, the Y-axis
is the column information found in the text file. The summary (*all.png) plot is a
composite of four columns in the text file and requires GNUPLOT to re-plot the graph
49
for each of the columns used. It has four different line graphs, one for each of the four
input parameter, and a point plot. The point plot is the sequence positions that represent
that clustered ‘a’ and ‘d’ positions. In all the graphs, a legend in set in the upper right
hand portion of the graph that is color and symbol coded.
Figure 3-7, Tropomyosin Sequence, was used to demonstrate the graphical output
of Stable Input. Figures 3-8 through 3-18 are the produced using the tool with the heptad
registry option set to A. Appendix A has the table output generated by the Stable Input
program.
0
60
120
180
240
MDAIKKKMQM
SEALKDAQEK
DESERGMKVI
ERAELSEGKC
FAERSVTKLE
LKLDKENALD
LELAEKKATD
ESRAQKDEEK
AELEEELKTV
KSIDDLEDEL
RAEQAEADKK
AEADVASLNR
MEIQEIQLKE
TNNLKSLEAQ
YAQKLKYKAI
AAEDRSKQLE
RIQLVEEELD
AKHIAEDADR
AEKYSQKEDR
SEELDHALND
DELVSLQKKL
RAQERLATAL
KYEEVARKLV
YEEEIKVLSD
MTSI
KGTEDELDKY
QKLEEAEKAA
IIESDLERAE
KLKEAETRAE
Figure 3-7 Tropomyosin Sequence
Figure 3-8, Summary Output, shows the summary plot produced. This plot and
all other plots produced have as the X-axis the sequence position and as the Y-axis the
Relative stability in kilo-calories per mole. The Summary Output plot shows a composite
of the windowed helical propensity, E/G and A/D interactions, and the clustered
positions. The clustered regions are identified as points at their respective positions in the
sequence.
This graph shows that in nature a coiled-coil protein, such as tropomyosin, has
significant regions of high helical propensity and hydrophobic clusters. This graph also
shows a correlation between the clustered regions and A/D stability. Since this graph is
an analysis of the entire protein, there are regions in the protein that are not stable coiled-
50
coils. In the graph this is shown in the region around position 175. This region shows that
all the indicators of coiled-coil stability fall off significantly. This is a region where no
clusters form, helical propensity is low and the A/D stability is very low.
51
Figure 3-8 Summary Output
52
Figure 3-9 Total Stability
Figure 3-9, Total Stability, is the sum of all stability factors. It shows that regions
in the tropomyosin protein have a great amount of stability in many different regions.
These are regions in the protein that can be examined in greater detail to determine what
amino acids are in these regions.
53
Figure 3-10 A/D Hydrophobic Stability
Figure 3-10, A/D Hydrophobic Stability, is a graph that shows the amount of
stability gained because of the interaction between the amino acid in the ‘a’ and ‘d’
position Positions ~45 - ~75, ~80 - ~115, and ~235 - ~275 are three regions that stand
out as having in the tropomyosin protein that have high hydrophobic contributions to
stability. There are other regions but these stand out because they represent large trend
regions.
54
Figure 3-11 Helical Propensity
Figure 3-11 Helical propensity, is the helical propensity of the windowed values
for the individual amino acids. This graph shows regions where coil formation is
favored. Since a single amino acid cannot for an amino acid, the windowing of the helical
propensity shows the propensity for a region. The strongest region shown here is in the
region between ~60 and ~120, but to a lesser degree the entire protein show a propensity
to form coils.
55
Figure 3-12 E/G Electrostatic Interaction
Figure 3-12, E/G Electrostatic Interaction, windowed or not, is one of the sparest
graphs generated. The E/G interactions are based on finding two charged (Lys, Glu,
Asp,Arg, or His) amino acids in the ‘e’ or ‘g’ heptad positions. This indicates that the
tropomyosin protein does not rely heavily on the E/G interactions for stability.
56
Figure 3-13 Chain Length
Figure 3-13 Chain Length, is a graph that shows the average amount of stability
gained for each additional amino acid. The idea is that as the length of the sequence
increases there would be a corresponding increase in stability. This was a part of the
research that continues to have trouble and the theory is not set [TRI 2003]. Keeping in
mind that all output graphs are optional, this graph was included here because as the
theory becomes more refined the algorithm can be changed and this graph can become
meaningful.
57
Figure 3-14 Density Stability
Figure 3-14, Density Stability is a graph of Figure 3-13 with the windowing
algorithm. This graph shows how the stability changes over the length of the protein by
dividing the accumulated relative stability as each new amino acid is added divided by
the number of amino acids used to calculate it.
58
Chapter 4
CHAPTER 4
COILED-COIL CLUSTER ANALYSIS
Why Coiled-Coils?
This chapter describes the analysis of hydrophobic amino acids clusters in the ‘a’
and ‘d’ heptad positions in coiled-coil proteins of length 42 amino acids or greater. The
‘a’ and ‘d’ positions are the only significant positions because they form the hydrophobic
core of the coiled coil. The minimum sequences length of 42 was chosen as the minimum
length because the stabilizing effect has been observed when there are at least 3 minimum
length (3 amino acids) clusters separated by at least one minimum length destabilizing
cluster. Kwok cluster experiments used two proteins. The first protein had 3, 3 amino
acid, clusters and 2, 3 amino acid, destabilizing clusters, and the second protein had 2, 3
amino acids, cluster and 1, 3, and 1, 2, amino acid destabilizing clusters [Kwok 2003]. A
sequence length of 42 is the smallest full heptad length in which 3 minimum length
clusters this can be observed.
Hydrophobic interactions contribute significantly to protein stability because
the burial of the hydrophobic surfaces is thermodynamically favorable in aqueous
solutions. Hydrophobic core clustering may play an important role in the structure and
the function of long native coiled-coil proteins as well as be an important mechanism for
59
long coiled-coil proteins to maintain chain integrity. Hydrophobic core clusters can also
serve as “knots” to keep that chain together while allowing regions flexible regions to
function. The stabilizing regions can control protein stability in structurally important
regions and destabilizing clusters of a coiled-coil may be involved in conformational
changes that allow protein-to-protein interactions. Finally, the hydrophobic core clusters
are a natural nucleation sites for protein folding intermediates. Investigating the structural
and functional roles of hydrophobic clusters will improve the understanding of the
mechanism of coiled-coils and protein folding in general [Kwok 2003]. Because the
hydrophobic core can have non-hydrophobic amino acids and form destabilizing clusters
that separate stabilizing clusters. Hydrophobic clusters are those clusters of the amino
acids Phe, Ile, Leu, Met, Val, and Tyr while destabilizing clusters are those clusters of the
amino acids Ala, Ser, Thr, Gln, Asp, Glu, and Lys. Both cluster types are characterized in
this analysis.
Protein Database Analysis
This analysis compares annotated coiled-coil domain data to that of a complete
database of all protein sequences after each has been pre-processed through a modified
coiled-coil prediction algorithm. This analysis will attempt to find answers to five
questions concerning coiled-coil clusters. First, how often do the hydrophobic amino
acids (Phe, Ile, Leu, Met, Val, and Tyr) occur in the ‘a’ and ‘d’ position; second, what are
the lengths of these clusters; third, what amino acids are present in clusters of different
cluster lengths; fourth, how are the various amino acids distributed in various length
60
clusters; and fifth, Coiled-coils always start with stabilizing clusters, can these be
characterized, and if so how? Two sources of data were used to find the answers to these
questions. The first source of data comes from the annotated coiled-coil domain found in
the Swiss-Prot database via SWall on the European Bioinformatics Institute (EBI) servers
[EBI 2003]. The second source of data is the entire protein database found in Swiss-Prot
and TrEMBL [SIB 2003], or SPTR, database via the ExPASy server [EXP 2003]. Since
the SPTR data is a collection of all proteins, a method for determining where in the
proteins coiled-coils may appear is necessary. Both datasets are pre-processed using an
algorithm similar to that found in the Stable Coil program to identify stable coil regions.
After both sets of data have been processed, two working files of data are produced. The
dataset derived from the annotated coiled-coils is referred to as the Swiss-Prot dataset and
the dataset derived from the entire Swiss-Prot TrEMBL database is referred to as the
SPTR dataset.
SPTR dataset
The SPTR dataset was derived from data retrieved from ExPASy Molecular
Biology Server. The ExPASy (Expert Protein Analysis System) proteomics server of the
Swiss Institute of Bioinformatics is dedicated to the analysis of protein sequences.
ExPASy provides a number of different tools, databases, and, other documentation
dedicated to the study of proteins. This server provides access to Swiss-Prot and
TrEMBL Protein Knowledgebase. Swiss-Prot is a protein sequence database that strives
to provide a high level of annotations (such as the description of the function of a protein,
61
its domains structure, post-translational modifications, variants, etc.), a minimal level of
redundancy and high level of integration with other databases. TrEMBL is a computerannotated supplement of Swiss-Prot database that contains all the translations of EMBL
nucleotide sequence entries not yet integrated in the Swiss-Prot database. As of mid
August 2003 the Swiss-Prot Release 41.20 had 132675 entries.
The ExPASy server provides a file that contains a copy of the latest Swiss-Prot
database. From the Swiss-Prot TrEMBL web page, a link can be followed to the database
file download page. The complete database is available on CD or by FTP. FTP
downloads can be done from seven different mirror sites. The SPTR dataset used in this
analysis was done using a weekly updated-complete non-redundant database-from the US
mirror site. The database contained 132000 formatted protein entries packet into 55
megabytes. The raw data found in this file is in the format shown in Figure 4-1, SPTR
Protein Entry. This entry contains the Entry name, 108_LYCES, the primary accession
number, Q43495, and the protein name, Protein 108 precursor - Lycopersicon esculentum
(Tomato). The rest of the entry is the protein sequence.
sp|Q43495|108_LYCES (Q43495) Protein 108 precursor. - Lycopersicon esculentum (Tomato).
MASVKSSSSSSSSSFISLLLLILLVIVLQSQVIECQPQQSCTASLTGLNVCAPFLVPGSPTASTECCNA
VQSINHDCMCNTMRIAAQIPAQCNLPPLSCSAN
Figure 4-1 SPTR Protein Entry
Swiss-Prot Coiled-Coils
Swiss-Prot TrEMBL is a nearly non-redundant protein database consisting of the
SwissProt, SwissProtNew, TrEMBL, and TrEMBLNEW data repositories [SPTR 2003].
62
This database will provide the sub-sequences for queried coiled-coil domains. In this
case, the coiled-coil domain is sought for all annotated proteins. What follows is the
method used to retrieve the coiled-coil domain sequences. Unfortunately the coiled-coil
information is not available in a concise format.
The Swiss-Prot database [SWPR 2003] provides two classes of data: the core
protein data and the annotations. The core data has the sequence data, the citation
information (bibliographical references), and the taxonomic data (description of the
biological source of the protein). The annotation data consists of the description of the
functions of the protein, domains and sites, secondary structure, quaternary structure,
similarities to other proteins, diseases associated with any number of deficiencies in the
protein and sequence conflicts, variants.
A query of the Swiss-Prot TrEMBL database for the coiled-coil domain with the
protein sequence option set will return a list of 70 proteins from SWall (SPTR on the EBI
SRS server) with additional 2600 proteins available on additionally linked pages. Figure
4-2, Coiled-Coil Retrieval, is the method using PERL to retrieve the coiled-coil
sequences.
63
Open SWall
2643 Proteins
Parse and Save
all Protein
Accession
Numbers
Assemble “GET”
Command from
Protein Accession
Data
Parse and Save all
Coiled Coil Domain
Ref’s
Assemble “GET”
Command from
Coiled Coil Refs
Parse Coiled Coil
Domain Information
Save Coiled Coil
Sequences
Figure 4-2 Coiled-Coil Retrieval
To get the additional pages, the database is re-queried with the display set to display as
many as 3000 entries on a single page. The retrieved page contains a link to every SWall
entry and another link through the accession number to all 2600 proteins that contain a
coiled-coil reference.
Following one of the SWall entry links opens a detailed page describing the
protein. Near the bottom of the page, the feature section shows the different domain types
identified in the protein. Each of these domains is a link to another page with detailed
information about the particular domain. For the purpose of this analysis, only the Coiled
Coil (POTENTIAL) domain is of interest. Each of the protein links will have at least one
Coiled Coil domain link, but could also have multiple domain links. The coiled-coil
64
domain link provides a page that greatly simplifies coiled-coil sub-sequence retrieval.
The sub-sequence pages contain only the sequence, sequence ID, length, and start/end
position.
Retrieving the coiled-coil data from each entry is done in a three-step process.
First, a list of HTML links is extracted from the 2600 entry protein query page. Second,
the HTML links are used to form a PERL GET call to retrieve the Swiss-Prot entry page
for each protein. The contents of all the Swiss-Prot pages are parsed to find all the links
to the COILED COIL (POTENTIAL) link. The final step uses the coiled-coil linked
pages to form another PERL GET call to retrieve the page that has only the details of the
coiled region.
The information retrieved from this page is shown in Figure 4-3, Coiled Coil
Entry. This entry has the identifier ID A2S3_Human and was the parent protein in which
it was found. This is followed by the domain identification and the sequence positions in
which it was found. Finally, the domain sequence is displayed along with the length of
the region.
ID
FT
SQ
A2S3_HUMAN_1; parent: A2S3_HUMAN
DOMAIN
134
354
COILED COIL (POTENTIAL).
Sequence
221 AA;
QALLKRNHVL SEQNESLEEQ LGQAFDQVNQ LQHELCKKDE LLRIVSIASE ESETDSSCST
PLRFNESFSL SQGLLQLEML QEKLKELEEE NMALRSKACH IKTETVTYEE KEQQLVSDCV
KELRETNAQM SRMTEELSGK SDELIRYQEE LSSLLSQIVD LQHKLKEHVI EKEELKLHLQ
ASKDAQRQLT MELHELQDRN MECLGMLHES QEEIKELRSR S
//
Figure 4-3 Coiled Coil Entry
From this page, the coiled-coil sub-sequence, name and start and end position
information are save to a file.
65
The saved file is not perfect. There are over 2600 protein links that were followed.
This process took over one hour and forty minutes using a high-speed network
connection. During this process there were a number of “server time-out” errors that
were also written to output file. There were multiple attempts to get an error free run.
This was not possible. These entries had to be removed by hand. Of the 2600 links
followed, about nine “time-out” errors were found. Since this was a small proportion of
all the links removing them should not affect the overall results.
Stable Coil Pre-Processing
Before any analysis can begin, the specific coiled regions of each sequence need
to be determined. Coiled-coils are composed of multiple coiled coils that wrap around
each other. The individual coils are not necessarily aligned on the same heptad registry.
To identify the coils on different heptad alignments, the modified Stable Coil algorithm is
used. Even though the Swiss Prot dataset has already identified purported coiled-coil
regions, the individual coils have not been. Using the modified Stable Coil algorithm,
both datasets can be processed to determine specific coiled regions and the heptad
registry offset in which they exist.
Stable Coil is offered by Pence, The Canadian Protein Engineering Network [SCP
2003], and is a program designed to predict the location and stability of alpha-helical
coiled-coil conformations within protein sequences. The program uses experimentally
derived alpha-helical propensity and stability coefficients as reported by [Zhou 1994,
Wagschal 1999 and Tripet 2000]. By summing the residue scores over variable window
66
widths and comparing the total score assigned to each amino acid to a known globular
and cytoskeletal coiled-coil containing sequences, the program displays the region and
probability (in kcal/mol) that a particular sequence will adopt a coiled-coil conformation.
The modified version of the algorithm uses a 42 amino acid window with a probability
that the sequence is a coiled region set to 38kcal/mol.
The modified Stable Coil analysis algorithm uses coil stability and helical
propensity to identify coiled regions. Each sequence is processed seven times; once for
each heptad position. Each amino acid has the combined helical propensity and stability
coefficient applied to it based on its heptad registry position. The value the amino acid
position assigned is determined by which heptad position it occupies in the heptad
alignment. The amino acid position is set to one of three different values whether the
amino acid is in the ‘a’, ‘d’, or one of the other five positions. Table 4-1, Helical
Propensity and Stability Values, lists the values that are used.
67
Amino
Acid
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
Position
A
1.245
1.245
-0.75
0.255
2.75
0
0.67
3.185
1.045
2.985
2.96
1.67
-10
1.18
0.86
0.605
1.345
3.295
1.635
2.285
Position
B
1.8
1.8
0.9
0.45
2.4
0
1.4
3.3
0.9
3.7
3.4
1.5
-10
2.05
0.35
0.9
1.2
2.35
1.75
2.5
Other
0.528
0.237
0.116
0.176
0.264
0
0.182
0.325
0.385
0.446
0.369
0.182
-5
0.336
0.495
0.182
0.154
0.231
0.27
0.237
Table 4-1 Helical Propensity and Stability Values
After applying these values to the sequence, windowing is applied to locate coils.
Starting with the sequence values and a zeroed parallel array, a window of the first 42
values is summed. This sum is applied to the parallel array if the present value in a
position is less that the new sum. This process is repeated until the entire parallel array is
set. After the windowing process is complete, the regions that have at least 3 heptads with
a value of greater than 38 are deemed to be coiled regions. These regions are then
extracted from the sequences and saved along with the heptad registry positions with
which it was found. When preprocessing is complete, all coiled regions in all the protein
sequences are identified and each coiled sequence has a starting heptad offset assigned to
it. These new sequences are place in one of two new datasets that are used in this
analysis. The first dataset, containing 2817 coil sequences, is the Swiss-Prot data having
68
originally come from the Swiss-Prot coiled-coil annotated database, and the set second
dataset, containing 67358 coil sequences, is the SPTR data having been derived from the
entire SPTR database.
Coil Analysis
The Swiss-Prot and SPTR dataset have a great variety sequence lengths. A graph
depicting this variety in total sequence length is shown in Figure 4-4, Normalized Length
Frequency. This graph shows that both datasets have a similar distribution of sequence
length when normalized to the greatest sequence length in the set. Both datasets had
recorded the most frequent length at 44 amino acids. The Swiss-Prot dataset was
normalized to 216 sequences and the SPTR dataset was normalized to 8660 sequences.
69
Normalized Length Frequency
0.1400
Normalized Count
0.1200
0.1000
0.0800
0.0600
0.0400
0.0200
0.0000
42
49
56
63
70
77
84
91
98
105
Sequence Length
Swiss-Prot
SPTR
Figure 4-4 Normalized Length Frequency
Having collected about 70000 coil sequences between the two dataset the first question to
be answered is at what frequency do the hydrophobic amino acids Phe, Ile, Leu, Met,
Val, and Tyr occupy the hydrophobic core positions ‘a’ and ‘d?’
The frequency at which the different amino acids appear in the hydrophobic core
are listed in Figure 4-5, Amino Acid in A and D positions 6&7 Heptads -SPTR and
Figure 4-6, Amino Acid in A and D 6&7 Heptads -Swiss-Prot. These two tables show the
data for sequences that are 6 and 7 heptads in length in both the ‘a’ and ‘d’ heptad
positions. Going from left to right the bars in each graph represent the frequency of the
amino acid in the A then D position with the 6 heptad data set first the 7 heptad data.
70
6&7 Heptad Amino Acid SPTR Data
0.30
Normalized Count
0.25
0.20
0.15
0.10
0.05
0.00
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
Amino Acids
A 6 Heptad
D 6 Heptad
A 7 Heptad
D 7 Heptad
Figure 4-5 Amino Acid in A and D positions 6&7 Heptads-SPTR
6&7 Heptad Amino Acid Swiss Prot Data
0.45
0.40
0.35
Normalized Count
0.30
0.25
0.20
0.15
0.10
0.05
0.00
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Amino Acids
A 6 Heptad
D 7 Heptad
A 7 Heptad
D 7 Heptad
Figure 4-6 Amino Acid in A and D positions 6&7 Heptads - Swiss-Prot
Y
71
In both sets of data the in raw numbers for the first two full heptads show that in
either case Leu is the dominate amino acid in either the ‘a’ or the ‘d’ position, but Leu is
preferred in the ‘d’ position. Ile and Val are the next two dominant amino acids. These
two are preferred in the ‘a’ position in the Swiss-Prot data, but in the SPTR data the
preference is strong in Val but almost even in Ile. Phe appears to favor the ‘a’ position in
the SPTR data but is very sparse in the Swiss-Prot data. Surprisingly the nonhydrophobic amino acid Ala appears in the ‘a’ and ‘d’ position more often than Tyr in
both datasets and favors the ‘d’ position.
Aminio Acid Frequency Swiss-Prot
0.50
0.45
Normalized Count
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
A
C D
E
F
G
H
I
K
L
M
N P
Q
R
S
T
Amino Acid
Heptad Position A
Heptad Position D
Figure 4-7 Normalized Amino Acid Distribution Swiss-Prot
V W Y
72
Amino Acid Frequency SPTR
0.35
Normalized Count
0.30
0.25
0.20
0.15
0.10
0.05
0.00
A
C D
E
F
G
H
I
K
L
M
N P
Q
R
S
T
V W Y
Amino Acid
Heptad Position A
Heptad Position D
Figure 4-8 Normalized Amino Acid Distribution SPTR
Figure 4-7, Normalized Amino Acid Distribution Swiss-Prot, and Figure 4-8,
Normalized Amino Acid Distribution SPTR, shows the relative frequency the amino
acids appear in the A and D positions for both sets of data. For the Swiss-Prot dataset the
A position is dominated by Leu, Ile, Val, Lys, Asn, Arg and Ala and in the ‘d’ position
Leu, Ala, Ile, Val, Lys, Gln, and Met. The SPTR data shows that the ‘a’ position is
dominated by Leu, Ile, Val, Phe, Ala, and Tyr and in the ‘d’ position Leu, Ile, Val, Ala,
Phe, Met, and Tyr. Both of these datasets show that Ala competes with the hydrophobic
amino acids in occurrence frequency. Other studies (Tripet 2000, Wagschal 1999, Lupas
1991) have found that L is most likely to be found in the ‘a’ and ‘d’ positions followed by
the other hydrophobic amino acids with a strong showing of Ala in both the ‘a’ and ‘d’
73
positions. The strongest disagreement was in the frequency in which Met occurred. This
study showed it was consistently one of the least likely hydrophobic amino acid to occur
in the in the ‘a’ and ‘d’ position, but in the other studies, Met was the third most likely
hydrophobic amino acid to appear in the ‘a’ and ‘d’ positions.
The stabilizing effect that clusters have on longer sequence chains has been seen
experimentally. Do long protein chains have more clusters? If they, do how are they
characterized?
To answer this question the clusters found in all the sequences in both datasets are
examined. A minimum sequence length of 42 amino acids or 6 heptads is examined and
compared. The distribution of the normalized cluster length across all sequence lengths is
shown in Figure 4-8, Normalized Cluster Count by Heptad Length. This figure shows the
total number of clusters of length three or greater that appear in the various length
sequences. The Swiss-Prot dataset has 5526 clusters and the SPTR dataset had 102718.
Figure 4-9, Normalized Clusters by Heptad Length, shows that the Swiss-Prot
dataset has a slight propensity for having fewer clusters in shorter sequences than that of
the SPTR dataset. As the sequences get longer the cluster count for both sets of data falls
off, but the SPTR data diminishes more rapidly than that of the Swiss-Prot data. While
the SPTR dataset approaches no clusters counted beyond 12 heptads in length there is a
relative consistent from length 12 through 19 heptads. Since the Swiss-Prot data is comes
from the coiled-coil data set this data seems to suggest that clusters are important in
longer coiled-coils.
74
Total Clusters in Heptad Lengths
0.350
0.300
0.250
0.200
0.150
0.100
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
0.050
6
Normalized Clusters Count
0.400
Heptads
Swiss-Prot
SPTR
Figure 4-9 Normalized Cluster by Heptad Length
When considering the number of hydrophobic amino acid in any given length,
how often do the coiled sequences have them in the hydrophobic core positions ‘a’ and
‘d’? Do the coils in nature have a minimum number of hydrophobic amino acids in their
hydrophobic core?
The frequency of hydrophobic amino acids in the coiled sequences is determined
by counting the number of times a hydrophobic amino acid appear in the ‘a’ and ‘d’
positions for all sequence lengths. The results are summarized in Table 4-2, Phe, Ile, Leu,
Met, Val, and Tyr Frequency Swiss-Prot, and Table 4-3, Phe, Ile, Leu, Met, Val, and Tyr
Frequency SPTR. Both tables are based on the number of ‘a’ and ‘d’ positions found in
the coiled sequence. The averages are based on the number of ‘a’ and ‘d’ positions
available in the given heptad length, divided by the average number of hydrophobic
amino acids found in all sequences of that length.
75
Heptad
Ave
Seqs
%
Length Hydroph Found Hydroph
6
7.75
557
0.65
8.57
311
0.66
7
9.04
309
0.65
9.71
232
0.65
8
10.16
178
0.63
10.97
155
0.65
9
11.27
135
0.63
12.13
112
0.64
10
12.92
75
0.65
13.33
83
0.63
11
14.04
80
0.64
14.52
82
0.63
12
15.95
44
0.66
16.49
41
0.66
13
17.03
29
0.66
17.04
48
0.63
14
18.03
35
0.64
18.84
37
0.65
15
18.52
33
0.62
19.52
23
0.63
Heptad
Length
16
17
18
19
20
21
22
23
24
25
Ave Lrg Seqs
% Lrg
Hydroph Found Hydroph
19.65
20
0.61
21.11
18
0.64
21.06
16
0.62
21.15
20
0.6
23.06
17
0.64
23.27
15
0.63
24
17
0.63
25.17
6
0.65
25.43
7
0.64
23.56
9
0.57
27.4
15
0.65
25.44
9
0.59
28.7
10
0.65
27.83
6
0.62
29.17
12
0.63
30
1
0.64
29
1
0.6
30.4
5
0.62
32
1
0.64
34.5
2
0.68
Table 4-2 Phe, Ile, Leu, Met, Val, and Tyr Frequency Swiss-Prot
76
Heptad
Length
6
7
8
9
10
11
12
13
14
15
Ave
Seqs
%
Hydroph Found Hydroph
8.2
8.89
9.48
10.1
10.72
11.26
11.97
12.69
13.44
13.99
14.74
15.49
16.7
17.16
17.34
17.73
18.74
19.51
19.49
20.25
27039
12137
7673
5204
3966
2719
1978
1500
1279
892
736
452
240
222
160
175
123
133
136
75
0.68
0.68
0.68
0.67
0.67
0.66
0.66
0.67
0.67
0.67
0.67
0.67
0.7
0.69
0.67
0.66
0.67
0.67
0.65
0.65
Heptad Ave Lrg Seqs
% Lrg
Length Hydroph Found Hydroph
16
17
18
19
20
21
22
23
24
25
20.65
22.04
22.34
22.34
23.48
24.32
24.4
25.27
26.64
26.71
27.79
26
29.38
30.08
28.85
31
30.67
30.4
32
32
51
52
50
38
33
31
30
15
11
21
19
12
16
12
13
4
3
5
1
1
0.65
0.67
0.66
0.64
0.65
0.66
0.64
0.65
0.67
0.65
0.66
0.6
0.67
0.67
0.63
0.66
0.64
0.62
0.64
0.63
Table 4-3 Phe, Ile, Leu, Met, Val, and Tyr Frequency SPTR
An examination of all sequences of all lengths show that on average, the
hydrophobic core of these coiled regions are occupied by hydrophobic amino acids about
66% of the time. The Swiss-Prot dataset had an average of 65% for heptad lengths of 6 to
15. As the sequence length extends and few sequences are found, the average falls to
60%. The SPTR dataset show that in the heptad lengths of 6 to 15 the hydrophobic core
occupancy rate was about 67% and beyond that the average was 65%. This would seem
to suggest that when the Stable Coil algorithm is used to predict coiled-coil regions there
77
is a constant number of hydrophobic amino acid that must reside in the hydrophobic core.
This could prove to be a minimum cutoff for coiled-coil regions.
Knowing that clusters exist in both the Swiss-Prot and SPTR datasets, how many
clusters of any length are in sequences of different length? Are hydrophobic clusters
more numerous that non-hydrophobic clusters? This analysis will provide insight into
what separates hydrophobic clusters. The cluster effect can extend beyond the
hydrophobic cluster if two clusters are separated by a single hydrophobic core position
[TRI 2003].
Hydrophobic core clusters are characterized next. In this analysis a hydrophobic
cluster is consecutive ‘a’ and ‘d’ positions being occupied by the amino acids Phe, Ile,
Leu, Met, Val, and Tyr and a non-hydrophobic cluster is when two or more consecutive
‘a’ and ‘d’ positions are occupied by a non-hydrophobic amino acid. Both datasets are
analyzes and are summarized below in the graphs. The first two graphs Figure 4-10, Total
Clusters and Ratio by Sequence Length Swiss-Prot, and Figure 4-11, Total Clusters and
Ratio by Sequence Length SPTR, show the total number of hydrophobic and nonhydrophobic clusters that are present for a given sequence length.
78
700
1.4
600
1.2
500
1
400
0.8
300
0.6
200
0.4
100
0.2
0
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
0
Ratio
Cluster Count
Total Clusters and Ratio By Length Swiss-Prot
Heptads
Hydro Clusters
Non-Hydro Clusters
Ratio
Figure 4-10 Total Clusters and Ratio by Heptad Length Swiss-Prot
40000
1.2
35000
1
30000
20000
0.6
15000
0.4
10000
0.2
5000
25
24
23
22
21
20
19
17
18
16
15
14
13
12
9
10
11
8
0
7
0
Heptads
Hydro Clusters
Non-Hydro Clusters
Ratio
Figure 4-11 Total Clusters and Ratio by Heptad Length SPTR
Ratio
0.8
25000
6
Cluster Count
Total Clusters and Ratio by Heptad Length SPTR
79
Both charts show that the number of clusters for both the hydrophobic and nonhydrophobic amino acids, diminish sharply after sequences grow beyond 12 heptads and
very few are found beyond 28 heptads. Even thought the total numbers diminish, both
dataset show a similar pattern ratio of hydrophobic clusters to non-hydrophobic clusters
as the sequence length goes from 6 heptads to over 25 heptads. This indicates that when
hydrophobic clusters are present they are separated by non-hydrophobic clusters between
60 and 80% of the time.
The next set of charts, Figure 4-12, Total Clusters by Cluster Size Swiss-Prot and
Figure 4-13, Total Clusters by Cluster Size SPTR, show the size of the clusters of both
types found in both datasets. The non-hydrophobic clusters are counted starting at length
two while the hydrophobic clusters are counted starting at length 3. Non-hydrophobic
clusters never exceed 6 in length, while the hydrophobic clusters had a diminished
presence beyond length 12.
80
Total Clusters by Cluster Size Swiss-Prot
3500
3000
Count
2500
2000
1500
1000
500
0
2
3
4
5
6
7
8
9
10
11
12
13
14
Cluster Size
Hydro Clusters
Non-Hydro Clusters
Figure 4-12 Total Clusters by Cluster size Swiss-Prot
Total Clusters by Cluster Size SPTR
50000
45000
40000
35000
Count
30000
25000
20000
15000
10000
5000
0
2
3
4
5
6
7
8
9
10
11
Cluster Size
Hydro Clusters
Non-Hydro Clusters
Figure 4-13 Total Clusters by Cluster size SPTR
12
13
14
81
Tables 4-4 though 4-15, detail the distribution of the hydrophobic clusters and
non-hydrophobic clusters for 6 specific heptad lengths found in the two datasets. The
first 6 tables show the analysis for the Swiss-Prot data and the second set of six is for the
SPTR data. Each table represents a different sequence length. The hydrophobic and nonhydrophobic cluster lengths range from 3 to 10. The first column in the table is the cluster
length, the second and third column, Hydro and Non-Hydro respectively. These columns
have the number of clusters of each length and type that are found for the sequence length
the table represents.
82
Cluster
Length
3
4
5
6
7
8
9
10
Total
Hydro Non
Hydro
343
112
164
23
84
4
35
5
4
1
636
139
Table 4-4 Clusters 6 Heptad S-P
Cluster
Length
3
4
5
6
7
8
9
10
Total
Hydro Non
Hydro
211
48
82
10
75
7
39
15
8
2
432
65
Table 4-5 Clusters 6 Heptad+1 S-P
Cluster
Length
3
4
5
6
7
8
9
10
Total
Hydro Non
Hydro
214
51
109
12
76
1
46
10
7
2
464
73
Table 4-6 Clusters 7 Heptad S-P
Cluster
Length
3
4
5
6
7
8
9
10
Total
Hydro Non
Hydro
139
49
85
5
69
1
33
17
7
4
1
355
55
Table 4-7 Cluster 7+1 Heptad S-P
Cluster
Length
3
4
5
6
7
8
9
10
Total
Hydro Non
Hydro
135
51
59
4
45
1
26
7
9
1
2
284
56
Table 4-8 Clusters 8 Heptad S-P
Cluster
Length
3
4
5
6
7
8
9
10
Total
Hydro Non
Hydro
124
35
64
4
41
2
18
12
8
0
1
268
41
Table 4-9 Clusters 8+1 Heptad S-P
83
Cluster
Length
3
4
5
6
7
8
9
10
Total
Hydro Non
Hydro
15346 4781
9453 921
5156 9
2778
1282
574
213
91
34939 5711
Table 4-10 Clusters 6 Heptad SPTR
Cluster
Length
3
4
5
6
7
8
9
10
Total
Hydro Non
Hydro
7214 1748
4500 303
2981 78
1642 2
842
1
842
185
185
17959 2132
Table 4-11 Clusters 6 Heptad+1 SPTR
Cluster
Length
3
4
5
6
7
8
9
10
Total
Hydro Non
Hydro
4041 1030
2947 212
1959 45
1220 2
726
369
195
19
11571 1289
Table 4-12 Clusters 7 Heptad SPTR
Cluster
Length
3
4
5
6
7
8
9
10
Total
Hydro Non
Hydro
2756 954
1905 119
1444 12
910
545
247
147
60
8071 1085
Table 4-13 Clusters 7+1 Heptad SPTR
Cluster
Length
3
4
5
6
7
8
9
10
Total
Hydro Non
Hydro
2242 924
1237 153
1115 14
751
417
257
149
72
6318 1091
Table 4-14 Clusters 8 Heptad SPTR
Cluster
Length
3
4
5
6
7
8
9
10
Total
Hydro Non
Hydro
1695 680
958
138
729
11
481
331
197
90
52
4582 829
Table 4-15 Clusters 8+1 Heptad SPTR
84
These tables show that for the Swiss-Prot dataset, as the size of the cluster
increases by one from 3 to 4 the number found in the sequences declines by 60%,
whereas the SPTR dataset the number found declines by 38%. This would indicate that
clusters of three are favored over the clusters of four in the Swiss-Prot dataset, whereas in
the SPTR dataset the clusters of appear more often than clusters of four these are not as
strongly favored. The most dramatic declines are found in the non-hydrophobic clusters
in both datasets. In the SPTR dataset, when the cluster size is increased from 3 to 4, the
number of clusters found declines by 80%. Similarly, the drop for the Swiss-Prot dataset
is 85%. This data seems to suggest that the presence of small clusters is favored over
large clusters in both the hydrophobic and non-hydrophobic cases. This could suggest
that nature uses small hydrophobic clusters in combination with many small nonhydrophobic clusters to form longer stable regions in coiled sequences. Nature seems to
favor small stable regions to long stable regions. This may allow flexibility in protein
folding and performance.
Counting the clusters found in the different length sequences gives an
appreciation for the difference found in sequences of different lengths, but how are the
various amino acids distributed in various cluster lengths?
Having determined the frequency of the various cluster lengths, the next step is to
attempt to describe the amino acids that participate in the dominant cluster lengths for
both the hydrophobic and non-hydrophobic amino acids. From the tabulated data above
most hydrophobic clusters are 3 to 4 amino acids in length and appear in sequence length
85
of 6 and 7 heptads. The non-hydrophobic clusters are more selective. These occur in
clusters of two and diminish quickly and are rare beyond length 6.
Figure 4-14, Hydrophobic Amino Acids in Clusters, is a normalized count of the
hydrophobic amino acids that occur in clusters. Both datasets show a similar trend in that
Leu appears most often and Tyr the least. The only discrepancy is in the appearance of
Phe. Phe appear much more often in cluster in the total SPTR dataset than in that of the
Swiss-Prot dataset. The Non-hydrophobic clusters are not as easily characterized. Figure
4-15, Non-Hydrophobic Amino Acids in Clusters, shows that while the Swiss-Prot
dataset favors Ala, Asn, Thr, Ser, and Gln, the SPTR data set favors Ala, Lys, Gln, Glu,
and Arg. The only area of agreement between the two datasets is in what does not appear
in non-hydrophobic clusters. Gly, His, Cys, Asp, and Trp do not appear in either datasets.
Of these Cys, His, and, Asp are hydrophilic amino acids.
Hydrophobic Cluster
Normalized Count
0.6
0.5
0.4
0.3
0.2
0.1
0
L
I
V
M
Y
Amino Acid
Swiss-Prot
SPTR
Figure 4-14 Hydrophobic Amino Acids in Clusters
F
86
Non-Hydrophobic Cluster
Normalized Count
0.25
0.2
0.15
0.1
0.05
0
a
n
t
s
q
k
r
e
g
h
c
d
w
p
Amino Acid
Swiss-Prot
SPTR
Figure 4-15 Non-Hydrophobic Amino Acids in Clusters
The next 4 tables list the specific amino acids that appear in hydrophobic and nonhydrophobic clusters of various lengths. Table 4-16, Cluster Type Count Swiss-Prot, and
Table 4-17, Non-hydrophobic cluster, Swiss-Prot show the sequences that occur more
than 6 times in the Swiss-Prot dataset. Table 4-18, Cluster Type Count, SPTR, and Table
4-19, Non-hydrophobic cluster, SPTR show the sequences that occur more than 100
times. These cutoff values were chosen first to cut down on the infrequent data and
second, include specific cluster types and their occurrence in the different length
sequences. In this analysis non-hydrophobic cluster length of two amino acids are
considered. The tables have the sequence length the exact cluster sequence and the
number of this type of cluster found.
87
Lngth
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
13
Cluster
IIL
ILI
ILIL
ILL
ILLL
IML
IVL
LIL
LLFV
LLI
LLL
LLLL
LLM
LLV
LLY
LML
LVL
LYL
VLL
ILI
Num
7
10
7
40
9
8
8
21
6
8
38
9
6
8
10
12
26
6
13
11
Lngth
13
13
13
13
13
13
13
13
13
13
13
13
14
14
14
14
14
14
14
14
Cluster
ILL
LIL
LLF
LLI
LLL
LLLL
LLV
LML
LVL
VLL
VLV
VLVVV
ILL
LIL
LIV
LLI
LLL
LLLL
LML
LMLLL
Num
16
12
7
6
21
7
13
7
8
18
8
6
17
16
8
8
42
10
10
6
Lngth
14
14
14
15
15
15
15
15
15
15
15
15
15
15
16
16
16
16
16
16
Cluster
LVL
LVLL
VLL
ILL
ILV
LIL
LILL
LLL
LLLL
LLV
LLY
LVL
LVLL
VLL
ILI
ILL
LIL
LLI
LLL
LVL
Num
15
6
15
10
10
21
7
16
8
6
7
6
6
8
6
12
10
6
25
16
Lngth
17
17
17
17
17
18
18
18
18
18
19
19
19
19
20
21
21
21
21
21
Cluster
ILL
LIL
LLI
LLL
LVL
ILI
ILL
LLL
LVL
LYL
ILL
LILL
LLL
LVL
LLL
LILL
LLL
LML
LVL
LVLL
Num
17
6
6
16
7
6
9
18
8
7
12
6
14
12
7
7
17
7
9
8
Table 4-16 Hydrophobic Cluster Count, Swiss-Prot
Table 4-16, Hydrophobic Cluster Count, Swiss-Prot, shows that Leu, Ile and Val are the
dominant amino acids in the clusters from the Swiss-Prot dataset and the majority of the
clusters are 3 and 4 amino acids long. Table 4-17, Non-Hydrophobic Cluster Count,
Swiss-Prot, shows that the amino acid Ala is dominant in clusters of 2 and 3.
88
Lngth
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
13
13
Cluster
AA
AK
AN
AQ
AR
EK
ER
HA
KA
KQ
NA
NK
QA
QKE
QT
RQ
SA
TK
AK
QA
Num
9
8
8
13
10
11
12
7
8
7
10
6
11
7
7
8
7
9
7
7
Lngth
13
14
14
14
14
14
14
14
14
14
14
14
15
15
15
15
15
15
15
16
Cluster
RA
AA
AE
AK
AQ
AR
ER
KE
QK
TA
TK
TS
AK
AN
AS
KA
NA
QK
TE
AK
Num
6
11
6
7
7
6
10
6
6
6
8
6
11
7
12
6
7
6
7
9
Lngth
16
16
17
17
17
18
19
20
21
21
22
23
23
23
24
27
27
27
28
29
Cluster
AQ
QN
AT
HE
TN
HK
AN
AAA
AC
EK
EK
AA
SA
TK
AK
AAAN
ENS
TD
TK
AASN
Num
8
6
8
6
6
6
10
6
6
8
6
6
8
7
7
8
11
9
7
6
Lngth
42
46
52
62
62
76
77
86
Cluster
AAA
EK
EK
ER
KT
ER
EK
ER
Num
10
6
11
13
6
11
8
6
Table 4-17 Non-Hydrophobic Cluster Count, Swiss-Prot
Table 4-18, Hydrophobic Cluster Count, SPTR, shows the SPTR dataset is
dominated by 3 and 4 length clusters with Leu, Ile, and Val, but there are more Phe than
in the Swiss-Prot dataset. Table 4-19, Non-Hydrophobic Cluster Count, SPTR, shows
that most of the clusters are 2 amino acids long composed main of Ala and Asn.
89
Lngth
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
Cluster
FIL
FLI
FLL
III
IIL
IIV
ILF
ILI
ILL
ILLL
ILV
IVI
IVL
LFI
LFL
LIF
LII
LIL
LILL
LIV
Num
113
123
277
251
293
125
225
323
514
128
230
133
184
116
217
113
250
408
122
192
Lngth
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
Cluster
LLF
LLI
LLL
LLLL
LLM
LLV
LLY
LML
LVF
LVI
LVL
LVV
LYL
MLL
VII
VIL
VLI
VLL
VLM
VLV
Num
189
422
823
186
169
282
131
196
103
158
399
140
126
183
129
173
220
372
107
166
Lngth
12
12
12
13
13
13
13
13
13
13
13
13
13
13
13
13
13
13
13
14
Cluster
VVL
YLL
YYL
FLL
IIL
ILI
ILL
LII
LIL
LIV
LLF
LLI
LLL
LLLL
LLV
LML
LVL
VLI
VLL
ILL
Num
170
115
118
122
146
163
193
135
207
104
178
187
452
119
157
105
184
105
216
168
Lngth
14
14
14
14
14
15
16
17
Cluster
LIL
LLI
LLL
LLV
LVL
LLL
LLL
LLL
Num
128
127
222
102
114
189
146
116
Cluster
TN
AA
AT
TA
AA
Num
105
198
113
111
117
Table 4-18 Hydrophobic Cluster Count, SPTR
Lngth
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
Cluster
AA
AE
AG
AK
AN
AQ
AR
AS
AT
CA
EA
EK
EN
GA
GN
HA
HT
KA
KK
KN
Num
516
134
147
241
225
208
155
235
328
120
125
118
103
173
106
126
124
224
183
214
Lngth
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
Cluster
KQ
KS
KT
NA
ND
NK
NN
NQ
NR
NS
NT
QA
QH
QK
QN
QQ
QS
QT
RA
RN
Num
133
147
127
263
115
153
247
162
132
221
184
217
122
126
167
163
122
150
171
113
Lngth
12
12
12
12
12
12
12
12
12
12
12
12
12
13
13
13
13
13
13
13
Cluster
SA
SK
SN
SQ
SS
ST
TA
TE
TK
TN
TQ
TS
TT
AA
AK
AN
AT
NA
NN
SA
Num
251
141
201
171
158
222
315
111
151
200
181
215
204
251
113
104
133
119
169
118
Lngth
13
14
14
14
15
Table 4-19 Non-Hydrophobic Cluster Count, SPTR
90
Finally, do stabilizing clusters exist and if so, how can they be characterized? It is
thought that each coiled-coil begins with a stabilizing cluster. Each of the coil sequences
in the two datasets are examined, first to find if there is a cluster beginning each
sequence, and then which amino acids populate these clusters. For this part of the
analysis, a convention of 0’s and _’s are used to signify the hydrophobic amino acids and
non-hydrophobic amino acids respectively. Table 4-20, Stabilizing Cluster Swiss-Prot,
shows that only about 33% of the Swiss-Prot sequences begin with a cluster and Table 421, Stabilizing Cluster SPTR shows that 38% of the SPTR begin with clusters.
Cluster
Pattern
000
_000
__000
___000
____000
000__
000_0
0000_
00000
000___
000__0
Number Percent
Found Of total
591
358
164
65
17
95
172
153
171
37
58
20.44%
12.38%
5.67%
2.25%
0.59%
3.29%
5.95%
5.29%
5.91%
1.28%
2.01%
Table 4-20 Stabilizing Cluster, Swiss Prot
91
Cluster
Pattern
000
_000
__000
___000
____000
000__
000_0
0000_
00000
000___
000__0
Number Percent
Found Of total
18527
7537
2776
906
255
2355
4832
4479
6861
763
1592
27.51%
11.19%
4.12%
1.35%
0.38%
3.50%
7.17%
6.65%
10.19%
1.13%
2.36%
Table 4-21 Stabilizing Cluster, SPTR
In an attempt to find starting stabilizing clusters other beginning sequences
patterns were examined. Offsetting the starting sequences by _ heptad at a time, shows
that a beginning cluster does not appear even after offsetting 2 full heptads. The SwissProt analysis examined 2891 sequences and the SPTR analysis examined 67358
sequences.
The majority of the sequences used in this analysis do not begin with a stabilizing
cluster. To search the beginning of the sequence for a starting cluster, the assumed
beginning of the sequence was advanced by increments of _ a heptad. This attempt still
revealed that a majority of the sequences contain no starting stabilizing cluster. At best
between the two data sets 35% of the sequences used began with a cluster.
Of the sequences that did begin with a stabilizing cluster, a closer look at the
cluster patterns 000, 0000, and 00000 shows that the dominant amino acids found in these
clusters are Leu and Ile. Most of the clusters appear in the 12 heptads long sequences.
The cluster sequence Leu-Leu-Leu appears most often. Table 4-22, Cluster Amino Acids
92
Swiss-Prot, shows all the starting cluster combinations that occurred more than once. The
table is sorted on sequence length and has the clusters that appear and the number that are
found for each sequence length. This table contains 274 entries out of the 591 sequences
that begin with clusters of three or more. There are a total of 2891 sequences in this
dataset. When a sequence begins with a stabilizing cluster, the amino acids that appear at
the beginning of those clusters most often are Leu occurring 46%, Ile occurs 19% and
Val 13.5 %.
93
Len
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
Cluster
ILIL
ILL
ILLL
IVLVLL
LFL
LIL
LILL
LIV
LIVVL
LLFV
LLI
LLILL
LLL
LLLL
LLLM
LLLY
LLM
LLMLL
LLV
LLVL
LML
LVL
LYL
LYV
MIMM
VLFL
VLI
VLL
VLLL
VVL
YLIY
Num
5
6
4
2
2
4
2
2
2
6
3
3
7
6
2
2
3
2
3
2
3
5
2
3
2
2
3
2
2
3
3
98
Len
13
13
13
13
13
13
13
13
13
13
13
13
13
13
13
13
14
14
14
14
14
14
14
14
14
14
14
Cluster
FYF
ILILL
ILL
ILVLIM
LLFYFL
LLI
LLL
LLLL
LLV
MVL
VLI
VLILL
VLILLV
VLL
VLLL
VLV
FILILV
FLIV
IFILM
LFLL
LIL
LLL
LVLVLL
VLL
VLV
YILILL
YILL
Num
2
3
4
2
2
2
4
2
3
2
2
2
2
2
2
2
2
3
2
2
2
2
2
2
2
4
3
64
Len
15
15
16
16
16
16
17
18
19
19
19
19
20
20
21
21
21
22
22
22
23
24
24
Cluster
ILV
VLL
ILI
LLI
LVL
LVLLLL
LLI
LLL
ILL
ILLV
LLVL
VLL
FLL
MLL
LLILL
LLL
MMF
LLL
LML
VLIL
LIVL
ILLI
LVL
Num
2
3
4
2
4
2
2
7
2
2
2
2
2
4
3
2
6
4
2
2
2
4
2
67
Len
25
26
27
27
27
27
28
32
34
42
86
87
95
Cluster
YLI
LLL
ILILL
ILVLL
LLVLL
VLVLF
ILVLL
LVL
FILILL
MLL
MMFVL
MMFVL
MMF
Num
3
4
5
2
8
2
2
2
4
5
3
3
2
45
Table 4-22 Cluster Amino Acids Swiss-Prot
The SPTR database results are shown in Table 4-23, Cluster Amino Acids SPTR,
which lists the number of times the different cluster combinations occur as long as it
appears over 15 times.
94
Len
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
Seq
FII
FIL
FLF
FLI
FLL
FLV
FLY
FVL
IFL
IFLFML
IFLL
IFLLMV
III
IIL
IILL
IIV
IIY
ILF
ILI
ILIL
ILL
ILLL
ILV
ILY
IMI
IVF
IVI
IVL
IVLL
IVV
IYL
LFF
LFI
LFL
LFLL
Num
23
21
17
31
42
16
20
18
19
37
16
25
92
61
25
37
17
107
91
16
130
21
48
18
28
22
20
33
17
27
16
23
32
74
32
1272
Len
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
Seq
LFV
LFY
LIF
LII
LIL
LILL
LIM
LIV
LIY
LLF
LLFL
LLI
LLIL
LLIV
LLL
LLLI
LLLL
LLLLL
LLLV
LLM
LLMIM
LLML
LLV
LLVL
LLY
LMI
LML
LMV
LVF
LVI
LVL
LVLY
LVV
LVY
LYL
Num
24
26
43
64
114
21
21
71
17
75
16
126
26
20
192
23
47
18
25
50
23
25
72
20
33
37
51
16
27
34
122
18
37
22
25
1581
Len
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
Seq
MFL
MII
MIL
MLI
MLL
MVII
MVL
VFL
VII
VIL
VIV
VLF
VLI
VLIL
VLL
VLLL
VLM
VLV
VVF
VVL
YFM
YIL
YLI
Num
20
17
34
26
40
20
30
17
31
40
27
31
69
17
83
17
30
40
19
34
21
24
17
3557
Len
13
13
13
13
13
13
13
13
13
13
13
13
13
13
13
13
13
13
13
13
13
13
13
13
13
13
13
13
13
13
14
14
14
16
22
Seq
FLL
FLV
III
IIL
ILF
ILI
ILL
ILV
IVF
LFL
LIF
LII
LIL
LIMII
LIV
LLF
LLI
LLL
LLLL
LLV
LLY
LVI
LVL
LVV
MLF
VIL
VLI
VLL
VLLIML
YLMYLL
IIL
ILL
LILV
VLL
YFL
Num
27
16
28
19
26
41
27
17
22
19
17
32
30
16
16
80
33
48
22
42
17
16
34
21
21
18
33
38
16
21
16
16
20
21
18
904
Table 4-23 Cluster Amino Acids SPTR
When a sequence begins with a stabilizing cluster, the amino acids that appear at
the beginning of those clusters most often are Leu occurring 49.5%, Ile occurring 25%
95
and Val 13 %. This is a similar rate at which the Swiss-Prot clusters began. The SPTR
analysis is based on the 4461 clusters found out of the 18527 clusters found in all the
sequences. There are a total of 67358 sequences in the SPTR dataset.
These data show that between 30 and 40% of the sequences used in this analysis
do start with a stabilizing cluster. When a sequence does begin with a stabilizing cluster,
those clusters have a 90% chance of beginning with either Leu, Ile, of Val. This may be a
broad marker that signifies the beginning of a coiled sequence. But the definition offered
by Stable Coil may be too broad. Since the Stable Coil is using a windowing algorithm to
define the coiled region, it my not define the starting and ending point of the coil well
enough to define a starting stabilizing cluster.
Summary of Findings
In this chapter, an analysis of the hydrophobic core of the coiled regions in coiled
coil sequences was preformed. The hydrophobic core of the sequences was first
characterized to find which amino acids were present and how often they occurred. This
was followed by an analysis of clusters of hydrophobic amino acids that are found in the
hydrophobic core of adjacent heptads. The important findings are listed below.
∞ Hydrophobic amino acids occupy the hydrophobic ‘a’ and ‘d’ core positions on
average 65% for the Swiss-Prot dataset and 67% for the SPTR dataset.
∞ The number of hydrophobic clusters decreases by a factor of 2 for each
hydrophobic core position added to the sequence length.
96
∞ The number of non-hydrophobic clusters decreases by a factor of 8 as the each
hydrophobic core position added to the sequence length.
∞ Ala is the most likely non-hydrophobic amino acid to appear in the hydrophobic
core in both the Swiss-Prot and SPTR datasets.
∞ The ratio of hydrophobic clusters to non-hydrophobic clusters is .6 to .8 for
sequences from 6 heptads to 22 heptads in length.
∞ Cluster frequency decreases sharply for sequences 6 heptads to 9 heptads in
length.
∞ In both the Swiss-Prot and SPTR datasets Leu is favored in the ‘d’ hydrophobic
core position and Val and Ile is favored in the ‘a’ hydrophobic core position.
∞ Stabilizing Clusters are found in 39% of the of the SPTR sequences and 33% of
the Swiss-Prot sequences.
97
Chapter 5
CHAPTER 5
CONCLUSION
This thesis would not have been possible without the direct support and guidance
from Dr. Robert Hodges and Dr. Brian Tripet of the UCHSC and their research on coiledcoils. Their willingness to help was far beyond anything expected.
Coiled-coil protein domain research has lead to better understanding of this
structure domain in kinesin, myosin, and more recently the SARS virus. Biologists’
ability to create experiments and interpret experimental rapidly will only benefit this
research. However, if insight into the experiment can be gained prior to the lengthy
experiment, time and resources can be saved. In this thesis, the Stable Input program was
written to give biologists at the UCHSC the ability to determine the relative stability of
the coiled-coil protein domain. This program has the capacity to incorporate 6 different
stability factors which produces output that is easily interpreted and portable to other
platforms. The most important aspect of this work is the research biologists now have the
ability to create theoretical protein sequences and draw initial stability conclusions at the
click of a button.
98
In addition to the Stable Input tool, the coiled-coil hydrophobic clustering theory
was explored and quantified. Kwok showed that clustering of hydrophobic amino acids in
the hydrophobic core of consecutive heptad leads to greater stability in the overall coiledcoil. In an attempt to provide more information about clusters in nature, an exhaustive
search was initiated to quantify clusters in the Swiss-Prot database. This research has
lead to a better understanding to which amino acids frequent the hydrophobic core in
clustered regions.
The analysis between the Swiss-Prot annotated coiled-coils to the protein database
as a whole could be improved by using different coiled-coil prediction algorithms. To
perform this task a method needed to be found that would determine where in the
sequence a coiled-coil might be located and at which heptad registry offset they begin.
This was done using the Stable Coil algorithm because, first, it is used was based on the
experimentally determine stability and helical propensity values and not statistics and
second, both datasets were primarily analyzed using the criteria. The problem with this
approach is in the way the windowing function in Stable Coil may include many more
heptads at the beginning and end of each sequence. Once the windowing function
evaluated the first and last 42 positions they were not evaluated as strictly as the
intermediate positions. This may overestimate the starting and ending point of the
suspected coils. The evaluation of the stabilizing cluster shows in this analysis that the
coils selected showed it missing in over 70% of the time. The true remedy to this problem
is to define with greater precision the starting heptads of a coiled-coil. It seemed apparent
from this analysis that treating the starting and ending heptads of a coiled-coil my not be
the best approach.
99
In future analysis a more selective coiled-coil prediction method could be used to
better identify the coiled-coil regions thereby eliminating the start heptad and end heptad
problem. However, the same problem may arise when attempting to determine the heptad
registry. One suggestion would be to used that ‘a’ and ‘d’ positioned amino acids found
in this analysis to a help determine the offset value. An algorithm that would continue to
move the heptad offset until the strongest correlation between the ‘a’ and ‘d’ position and
the presence of the hydrophobic amino acids is found.
Another possible improvement to this thesis and to truly gain an appreciation of
the variety of cluster combinations, a relational database could be created. The data
shown in the tables have been distilled to the point where only the most repetitive
sequences are displayed. A database could allow for absolute queries if particular
sequences are found to be interesting.
100
GLOSSARY
Alpha helix: a repetitive secondary structure that gets its name because the relationship
of one amino acid to the next is the same. See Figure 1-3
Beta strand: an amino acid string that does not form a coil. It zigzags in a more extended
way than a helix. See Figure 1-4.
Coiled-coil: a tertiary oligomerization domain that is formed when two or more _helices wrap around each other in a left-handed super coil
Heptad: the specific repeated 7 positions a, b, c, d, e, f, g that identifies the seven
positions that characterize the coiled-coil sequences.
Hydrophobic amino acids: for the purpose of this analysis, the amino acids Phe, Ile,
Leu, Met, Val, and Tyr. Some references do not include Tyr among the hydrophobic
amino acids. A complete list of amino acids can be found in Tables1-2 through 1-5
Hydrophobic cluster: a sequence of 3 or more consecutive hydrophobic core positions
that have hydrophobic amino acids in them.
Hydrophobic core: the ‘a’ and ‘d’ heptad position in a coiled coil.
Kinesin: Similar to myosin, a family of microtubule-associated motor proteins.
Myosin: a mechanoenzyme protein that supports the movement of cellular components
with a characteristic actin binding domain head, a neck and tail.
Non-hydrophobic cluster: a sequence of 2 or more consecutive hydrophobic core
positions that have non-hydrophobic amino acids in them
Oligomer: A polymer that consists of two, three, or four monomers.
Oligomerization: The process of converting a monomer or a mixture of monomers into
an oligomer.
SPTR dataset: this is that data that was derived from the entire Swiss-Prot and TrEMBL
database.
Swiss-Prot dataset: the dataset used in the analysis that came from the Swiss-Prot
annotated
Tropomyosin: is a long, rod-like molecule, similar to myosin, that fits in the groove of
the actin helix
101
BIBLIOGRAPHY
[Anfinsen 1973] Anfinsen, C. (1973) Science 181, 223–230.
[Baldi 2000] Baldi, P., and Pollastri, G., Andersen, C., and Brunak, S. (2000). Protein
beta-Sheet Partner Prediction by Neural Networks. Department of Information and
Computer Science, University of California Irvine.
[Becker 2000] Becker W., Kleinsmith, L., Hardin, J. Chapter 3 in ”The World of the
Cell-4th ED”, The Benjamin/Cummings Publishing Company 2000[p 49-51]
[Becker 2000a] Becker W., Kleinsmith, L., Hardin, J. Chapter 3 in ”The World of the
Cell-4th ED”, The Benjamin/Cummings Publishing Company 2000[p51-55]
[Becker 2000b] Becker W., Kleinsmith, L., Hardin, J. Chapter 19 in ”The World of the
Cell-4th ED”, The Benjamin/Cummings Publishing Company 2000[p634-645]
[Becker 2000c] Becker W., Kleinsmith, L., Hardin, J. Chapter 12 in ”The World of the
Cell-4th ED”, The Benjamin/Cummings Publishing Company 2000[p341-342]
[Bornberg 1996] Bornberg, E., Rivals, E, and Vingron, M. (1996). Computational
approaches to identify leucine zippers. Nucleic Acids Research, Vol. 26, 2740-2746.
[Brook 2003] Principles of Protein Structure Using the Internet; Brookhaven PDB
Mariusz Jaskólski & Janusz Kazmarek, Center for Biocrystallographic Research, Poznan;
and Clare Sansom, Heiko Schinke, Martin-Luther-University, Dept. of
Biochemistry/Biotechnology, Halle; www.cryst.bbk.ac.uk/PPS2
[ Chakrabarty 2002] Chakrabarty, T., Xiao, M., Cooke, R., and Selvin, P. Holding Two
Heads together: Stability of the myosin II rod measured by resonance energy transfer
between the heads. PNAS April 30 2002 Vol. 99 No 9 pp6011-6016.
[Chou 1974] Chou, P., and Fasman, G. (1974). Conformational Parameters for Amino
Acids in Helical, _-Sheets, and Random Coil Regions Calculated from Proteins.
BioChemistry, Vol 13 No. 2 222-245
102
[Crick 1970] Crick, F., Central Dogma of Molecular Biology. Nature , Vol. 227, pp. 561563 (August 8, 1970)
[EBI 2003]. European Bioinformatics Institute.
http://www.ebi.ac.uk/Information/index.html
[EXP 2003] ExPASy Molecular Biology Server. http://us.expasy.org/
[Garnier 1978] Garnier, Osguthorpe and Robson (1978).Analysis of the Accuracy and
Implications of Simple Methods for Predicting the Secondary Structure of Globular
Proteins. J Mol Biol. Mar; Vol. 120, 97-120.
[ Gromiha 2002] Gromiha, M., Oobatake, M., Kono, H., Uedaira, H., Sarai, A (2002).
Importance of mutant position in Ramachandran plot for predicting protein stability of
surface mutations. Biopolymers. Aug 5; Vol. 64(4):210-20.
[Lauzon 2001] Lauzon, A., Fagnant, P., Warshaw, D., and Trybus, K. (2001). Coiled Coil
Unwinding at the Smooth Muscle Myosin Head Rod junction is required for optimal
mechanical Performance. Biophysical Journal Vol. 80 April 2001 pp1900-1904
[Lesk 2002] Arthur M. Lesk, “Introduction to Bioinformatic.” New York, NY:Oxford
Press 2002
[Kwok 2003]Kwok, S., Hodges, R.(2003). Hydrophobic Clusters Affect Protein Stability.
Dept. of Biochemistry and Molecular Genetics, Univ. of Colorado Health Sciences
Center, Denver, Co and Dept. of Biochemistry, Univ. of Alberta, Edmonton, AB.
[NCBI 2003] National Center for Biotechnology Information.
http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
[OCGC 2003] Ontario Centre for Genomic Computing
http://ocgc.ca/databases/genpept.html
[SCP 2003] Stable Coil; Pence The Canadian Protein Engineering Network.
http://biomol.uchsc.edu/researchFacilities/ComputationalCore/stablecoil/
[SI 2003] Stable Input http://dirac.uccs.edu/~dcbrinkm/thesis/stable_input.html
[SIB 2003] Swiss-Prot Protein knowledgebase TrEMBL Computer-annotated
supplement to Swiss-Prot. http://us.expasy.org/sprot/
[SPTR 2003] SPTR database is found at the web site:
http://www.hgmp.mrc.ac.uk/Bioinformatics/Databases/sptr-help.html
103
[SWPR 2003] Swiss-Prot Protein Knowledgebase User Manual, Release 41.20 of 16Aug-2003; Amos Bairoch Swiss Institute of Bioinformatics (SIB) Centre Medical
Universitaire
[Thormahlen 1998] Thormahlen, M.,Marx A., and Mandelkow, E. (1998).The coiled-coil
helix in the neck of Kinesin. Journal of structural Biology Vol. 122, 30-41
[TRI 2003] Tripet, B. University of Colorado Health Sciences Center
[TRI2 2003] Coiled-coil Presentation 2003. Tripet, B. University of Colorado Health
Sciences Center
[Tripet 1997] Tripet, B., Vale, R., Hodges, R (1997). Demostration of Coiled-coil
interaction within the Kinesin Neck Region using synthetic peptides; Journal of
Biological Chemistry. Vol 272, No.14 Issue of April 4, pp. 8946-8956.
[Tripet 1998] Tripet, B., Wagschal, K., Lavigne, P., Mant, C., Hodges, R.(1998). The role
of postion a in determining the stability and oligomerization state of _-helical coiledcoils: 20 amino acid stability coefficients in the hydrophobic core of proteins. Protein
Sciences Vol. 8, 2312-2329.
[Tripet 2000] Tripet, B., Wagschal, K., Lavigne, P., Mant, C., Hodges, R. Effects of Side
Chain Characteristics on Stability and Oligomerization State of a de Novo-designed
Model Coiled-coil: 20 Amino Acid Substitutions in Position “d”. Journal of Molecular
Biology (2000) Vol. 300, p377-402
[UWK 2003] University of western Kentucky Biotechnology Center
http://bioweb.wku.edu/courses/biol22000/3AAprotein/Fig.html
[UOFG 2003 ] University of Guelph Chemistry Chem730.
http://www.chembio.uoguelph.ca/educmat/chm730.
[Wagschal 1999] Wagschal, K.,Tripet,B.,Lavigne,P., Mant, C. & Hodges R., (1999). The
role of position a in determining the stability and oligomerization state of alpha-helical
coiled coils: 20 amino acid stability coefficients in the hydrophobic core of proteins.
Protein Sci Vol 8, 2312-2329
[Zhou 1994] Zhou, N.E., Monera, O, Kay C. & Hodges, R. (1994) alpha-helical
propensities of amino acids in the hydrophobic face of an amphipathis alpha-helix.
Protein and Pretide Letters, 1, 114-119.
104
APPENDIX A STABLE INPUT GUI
105
106
APPENDIX B TABULATED OUTPUT