Download Report - TrSys - Jacobs University

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mutation wikipedia , lookup

Non-coding DNA wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Genome evolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genetic code wikipedia , lookup

Point mutation wikipedia , lookup

Expanded genetic code wikipedia , lookup

Biosynthesis wikipedia , lookup

Molecular evolution wikipedia , lookup

Biochemistry wikipedia , lookup

Transcript
Jacobs University Bremen
Encoding of Amino Acids and Proteins
from a Communications and Information
Theoretic Perspective
Semester Project II
By: Dawit Nigatu
Supervisor: Prof. Dr. Werner Henkel
Transmission Systems Group (TrSyS)
School of Engineering and Science
October 2013
JACOBS UNIVERSITY BREMEN
Abstract
School of Engineering and Science
Encoding of Amino Acids and Proteins from a Communications and
Information Theoretic Perspective
by Dawit Nigatu
This research contains two separate parts. In the first part, we have used classical
multidimensional scaling (CMD) technique to scale down a 64-dimensional empirical
codon mutation (ECM) matrix and a 20-dimensional chemical distance matrix to two
dimensions (2-D). The 2-D plots of ECM show that most mutations occur between
codons that encode the same amino acid, i.e., the changes from one codon to another
will not change the amino acid to be produced. Furthermore, most of the highly probable
inter-amino acid mutations will not result in a dramatic change of chemical properties.
However, we have seen some inconsistencies in comparing the 2-D plots of ECM and
chemical distance matrices, in which codons near to each other in mutation distance
have a significant difference in chemical properties. This may lead to a severe effect, and
hence the results point out that some protection mechanism is needed to counteract.
In addition, the arrangement of the amino acids is very much in line with the so-called
Taylor classification. In the second part of the research, we have focused on investigating
the relationship between Shannon and Boltzmann entropies using the complete genome
sequence of the bacteria E. coli. There are positions in which parallel and anti parallel
relationships exist. We have found that around the terminus, the two entropies seem
to have an opposite trend with high Shannon and low Boltzmann entropies, meaning
that the sequence is more random and at the same time less stable. In general, the
Boltzmann entropy decreases as we move along the gene from the origin to the terminus.
Furthermore, with the cooperation with a molecular biology colleague, we have compared
the entropies with the number of different types of functional genes (anabolic, catabolic,
aerobic, and anaerobic) located at the same positions. We have seen that there is a
strong similarity between the distribution of anabolic genes and the two entropies.
Contents
Abstract
i
List of Figures
1 Introduction
1.1 Basic Theoretical Background
1.1.1 DNA . . . . . . . . . .
1.1.2 The Central Dogma .
1.2 Organization of the Report .
iii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
1
1
4
2 Dimension Reduction of Evolutionary and Chemical Distance Matrices
2.1
2.2
2.3
Evolutionary Substitution and Chemical Distance Matrices . . . . . . . .
Classical Multidimensional Scaling . . . . . . . . . . . . . . . . . . . . . .
Result and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 Relation Between Boltzmann and Shannon Entropy
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Boltzmann Entropy and Distribution . . . . . . . . . . .
3.2.1 Laws of Thermodynamics . . . . . . . . . . . . .
3.2.1.1 First Law of Thermodynamics . . . . .
3.2.1.2 Second Law of Thermodynamics . . . .
3.2.2 Ideal Gas Law . . . . . . . . . . . . . . . . . . .
3.2.3 Entropy of a Gas . . . . . . . . . . . . . . . . . .
3.2.3.1 Macroscopic View . . . . . . . . . . . .
3.2.3.2 Microscopic View: Boltzmann Entropy
3.2.4 Boltzmann Distribution . . . . . . . . . . . . . .
3.2.5 Gibbs Entropy Formula . . . . . . . . . . . . . .
3.2.6 Entropy of an Ideal Gas . . . . . . . . . . . . . .
3.3 Entropy of the E. coli Genome . . . . . . . . . . . . . .
3.4 Result and Discussion . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
6
8
11
11
12
12
12
13
13
13
13
14
16
18
19
19
20
4 Conclusions
26
A Additional Plots
27
Bibliography
29
ii
List of Figures
1.1
1.2
1.3
The structure of DNA. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Central dogma of molecular biochemistry with enzymes . . . . . . . . . .
Codon-amino acid encoding chart . . . . . . . . . . . . . . . . . . . . . . .
2.1
2.2
2.3
2.4
2-D plot of the mutation distance matrix.
2-D plot of the chemical distance matrix.
Taylor classification of amino acids. . . . .
2-D plot of the mutation distance matrix.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3.1
3.2
3.3
3.4
3.5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Adiabatic expansion of a gas at constant temperature . . . . . . . . . .
Boltzmann and Shannon entropies of E. coli genome, 2bp block. . . . .
Boltzmann and Shannon entropies of E. coli genome, 3bp block. . . . .
Number of anabolic genes with Boltzmann and Shannon entropies. . .
Number of anabolic genes with difference of Boltzmann and Shannon
entropies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6 Number of catabolic genes with Boltzmann and Shannon entropies. . .
3.7 Number of catabolic genes with difference of Boltzmann and Shannon
entropies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.8 Number of aerobic genes with Boltzmann and Shannon entropies. . . .
3.9 Number of aerobic genes with difference of Boltzmann and Shannon entropies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.10 Number of anaerobic genes with Boltzmann and Shannon entropies. . .
3.11 Number of anaerobic genes with difference of Boltzmann and Shannon
entropies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.1 Boltzmann and Shannon entropies of E. coli genome, 4bp block.
A.2 Boltzmann and Shannon entropies of E. coli genome, 5bp block.
A.3 Boltzmann and Shannon entropies of E. coli genome, 6bp block.
iii
2
3
3
. 9
. 9
. 10
. 10
.
.
.
.
14
20
21
22
. 22
. 23
. 23
. 24
. 24
. 25
. 25
. . . . . 27
. . . . . 28
. . . . . 28
Chapter 1
Introduction
1.1
1.1.1
Basic Theoretical Background
DNA
Deoxyribonucleic acid (DNA) is a double stranded structure found in all cells, containing
the genetic information of the living organism. It consists of building blocks called
nucleotides. The nucleotides are made of sugar phosphate backbone and one of the four
nitrogenous bases attached to the sugars. These bases are called Adenine, Thymine,
Cytosine, and Guanine (A, T, C, G). For the DNA to have the double helix structure,
the nucleotides are linked together into chains. A figure showing the structure of the
DNA is presented in Fig. 1.1.
The two strands are complementary to each other. According to the Watson-Crick
pairing rule A is always paired with T and G is always paired with C [2]. This means, if
we know the sequence of nucleotides on one strand, the sequence in the complementary
strand is known right away. The bases are attached by hydrogen bonds. GC pairs
have three hydrogen bonds whereas AT pairs have two hydrogen bonds. The additional
hydrogen bond makes the GC pairs more stable than AT pairs.
1.1.2
The Central Dogma
Francis Crick [3] states that flow of biologic information is from DNA towards proteins
and called the process the central dogma of molecular biology (Fig. 1.2). The sequences
of bases aligned in a segment of a DNA, called a gene, carry the directions for building proteins that have special functions in the cell. Protein synthesis consists of two
steps, transcription and translation. The RNA (ribonucleic acid) polymerase enzyme
1
Chapter 1. Introduction
2
Cell nucleus Adenine
Base pairs
[
Thymine
Guanine
Base pairs
[
Cytosine
•,
•
DNA's Double Helix. DNA molecules are found inside the cell's nucleus, tightly packed into chromosomes. Scientists use the term
"double helix" to describe DNA's winding, two-stranded chemical structure. Alternating sugar and phosphate groups form the helix's
two parallel strands, which run in opposite directions. Nitrogen bases on the two strands chemically pair together to form the
interior, or the backbone of the helix. The base adenine (A) always pairs with thymine (T), while guanine (G) always pairs with cytosine (C).
Figure 1.1: The structure of DNA [1].
unwinds the DNA molecule and the transcription process begins. In transcription the
gene sequence is copied into Messenger RNA (mRNA) using the template strand of the
DNA. Messenger RNA is a single stranded molecule similar with DNA except for the
base Thymine (T) is replaced by Uracil (U).
In the translation phase, the ribosome translates the sequence of mRNA molecule to
amino acids, reading the sequence in groups of three bases (codons). There are 20
naturally occurring amino acids. The chart in Fig. 1.3 shows the codon to amino acid
translation. The process starts when the smaller ribosomal subunit is attached to the
translation initiation site, usually AUG. Then, the transfer RNA (tRNA) binds to the
mRNA. The tRNA contains an anticodon complementary to the mRNA to which it
binds and the corresponding amino acid is attached to it. Next, the large ribosomal
subunit binds to create the P-site (peptidyl) and A-site (aminoacyl). The first tRNA
occupies the P-site and the second tRNA enters to the A-site. After that, the tRNA
at the P-site transfers the amino acid it carries to the second tRNA at the A-site and
exits. Finally, the ribosome moves along the mRNA and the next tRNA enters. This
Chapter 1. Introduction
3
Figure 1.2: Central dogma of molecular biochemistry with enzymes [4]
Figure 1.3: Codon-amino acid encoding chart [5].
process will continue until a stop codon (UAG, UAA, or UGA) signals the end of the
mRNA molecule. Lastly, the amino acids are connected by a peptide bond and folded
in a certain way to create proteins.
Chapter 1. Introduction
1.2
4
Organization of the Report
In Chapter 2, we first present the different types of evolutionary substitution matrices
and the chemical distance matrix followed by the mathematics behind classical multidimensional scaling. Then, the results of the dimension reduction are presented and
discussed. In Chapter 3, the proofs of Boltzmann entropy and Boltzmann distribution
are described. Thereafter, the Shannon and Boltzmann entropies of E. coli genome are
computed, presented, and discussed. Finally, the conclusions are presented in Chapter
4.
Chapter 2
Dimension Reduction of
Evolutionary and Chemical
Distance Matrices
2.1
Evolutionary Substitution and Chemical Distance Matrices
There are several substitution matrices providing the mutational change of one amino
acid by another inside protein sequences. The first of such matrices is the point accepted
mutations (PAM) matrix which is obtained by counting the number of replacements and
computing the mutation probabilities from a database of aligned protein sequences [6].
However, if the protein sequences are on a different part of the phylogenetic tree, the
PAM matrix is not efficient. The other type that overcomes the shortcomings of the
PAM matrix is the BLOSUM matrix (Block Substitution Matrix), which uses blocks
of aligned protein segments [7]. The third type based on amino acid substitutions is
called the WAG substitution matrix [8]. The WAG matrix utilizes a large database of
aligned proteins of different families and uses a maximum-likelihood technique to derive
the substitution scores.
The evolutionary models mentioned so far are based on amino acid substitutions. Besides, there are also models which describe codon to codon substitutions. One of them
is the 64 × 64 empirical codon mutation (ECM) matrix proposed by Schneider et al. [9].
For developing the ECM matrix, 8.3 million aligned codons from five vertebrates were
used to tally the number of substitutions and derive the mutational probabilities. Since
the transitions to stop codons are not considered, the matrix contains a block diagonal
5
Chapter 2. Multidimensional Scaling (MDS)
6
3 × 3 entries for the three stop codons separated from 61 × 61 matrix outside the stop
codons. The ECM matrix provides an extra edge by providing the transitions between
codons encoding the same amino acid, in addition to transitions leading to different
ones. Hence, we have used this matrix for the rest of our work.
Grantham’s chemical distance matrix takes into account the three chemical properties
(composition, polarity, and molecular volume) which have a strong correlation with the
substitution frequencies. The matrix presents a mechanism to identify the difference
between amino acids. The distance between amino acids is computed by making the
three chemical properties as an axis in Euclidean space.
We would like to compare how these chemical properties relate to the mutation probabilities. Since the matrices are of 64 and 20 dimensions, we have to apply the dimension
reduction technique to bring it down to 2 or 3 dimensions for easy comparisons and to
see if some kind of clustering will appear. More importantly, we would like to see the
severeness of mutational changes, which is visible in the chemical properties. For reducing the dimensions of 64 × 64 ECM and 20 × 20 chemical distance matrices, we used
a technique called classical multidimensional scaling (CMD), which will be presented in
the following section.
2.2
Classical Multidimensional Scaling
In this section, the mathematics behind CMD technique will be described. The reference
used for this section is [10].
Assume that we have observed n × n Euclidean distance matrix D = [dij ] derived from
a raw n × p data matrix X. With CMD, the aim is to recover the original data matrix
of n points in p dimensions from the distance matrix. However, since distances are
invariant to change in location, rotation, and reflections, the original data cannot be
fully retrieved.
Define an n × n matrix B such that
B = XXT .
(2.1)
The elements of B are given by
bij =
p
X
k=1
xik xjk .
(2.2)
Chapter 2. Multidimensional Scaling (MDS)
7
Similarly, since D is a distance matrix, the squared Euclidean distances can be written
as
d2ij =
p
X
(xik − xjk )2 ,
k=1
=
p
X
x2ik
+
k=1
p
X
x2jk
−2
k=1
p
X
xik xjk ,
k=1
= bii + bjj − 2bij .
(2.3)
At this point, If we can rewrite the bij s in terms of the dij ’s, X can be derived from B.
However, unless a location constraint is introduced, a unique solution cannot be found
to determine B from D. Commonly, the center of the columns of X are set to the origin,
i.e.,
n
X
xik = 0 , ∀k.
(2.4)
i=1
The added constraint will also mean that the sum of the terms in any row of B is zero.
Let T be the trace of B and observe that
n
X
d2ij = T + nbjj ,
(2.5)
d2ij = nbii + T ,
(2.6)
i=1
n
X
j=1
n X
n
X
d2ij = 2nT .
(2.7)
i=1 j=1
Solving for bij ,


n
n
n X
n
X
X
X
1
1
1
1
d2ij −
d2ij + 2
d2ij 
bij = − d2ij −
2
n
n
n
j=1
i=1
(2.8)
i=1 j=1
Applying singular value decomposition (SVD) on B,
1
1
B = VΛV0 = VΛ12 Λ12 V0 .
(2.9)
Using only the 2 (or 3) biggest eigenvalues, λ1 and λ2 and the corresponding eigenvectors
u1 and u2 we obtain
1
X = V1 Λ12 ,
"
where Λ1 =
λ1
0
0
λ2
#
and V1 = [u1 u2 ].
(2.10)
Chapter 2. Multidimensional Scaling (MDS)
2.3
8
Result and Discussion
To apply the CMD method, we need to convert the mutation probabilities in the ECM
matrix to some form of Euclidean distance measure. To do so, we have assumed a Gaussian model and computed the codon based distances from the pairwise error probability
expression given by
1
Pij = erfc
2
Dij
√
2σ
,
(2.11)
Where σ is a standard deviation. We have assumed a constant standard deviation for
the mutation distances.
The two dimensional (2-D) plots of the mutation and chemical distance matrices are
shown in figures 2.1 and 2.2, respectively. The codons encoding the same amino acid
are bundled together. Also, the clusterings of amino acids are mostly consistent with
Taylor classification shown in Fig. 2.3, which classifies amino acids based on their physiochemical properties [11]. Using these observations we can deduce that most of the
mutational changes will not lead to a significant change of chemical properties. However,
there are also some inconsistencies where lower mutation distances come together with
higher chemical distances and vice versa. The results can also be used as references
to apply some sort of protection for high mutation probabilities with higher chemical
differences. The inconsistencies are listed below.
Large chemical distance but small mutation distance
• C with ”all others”
• G with E
• S with {P,T,A}
• {D,N} with E
• {D,N} with G
• {Q,H} with {W,Y}
• K with N
Small chemical distance but large mutation distance
• {W,Y} with {F,L,M,I,V}
• {P,T,A} with {Q,H,R}
Chapter 2. Multidimensional Scaling (MDS)
9
80
W
CGC
60
CGT R
CGA CGG
40
TGG
CAC
CAT
AGG
H
CAA
AAG
Q
c
M
I
ATG
K
ATA
0
CCG
AAC
−40
CTA
CTG
CTT
TTA
TTG
TGC
TGT
CTC
TTC
TTT
AAA
−20
L
F
TAT
CAG
AGA
20
Y
TAC
P
AAT N
GAG AGC
GAA
GAC AGT
GAT
GGG
D GGA GGC
CCT
CCC CCA
E
GGT
S
T
G
A
−60
−60
−40
−20
ATC
ATT
TCCTCT
TCG
TCA
ACG
ACA
ACT ACC
GCC
GCG
GCT
GCA
0
20
V
GTA
GTG
GTC
GTT
40
60
80
Figure 2.1: 2-D plot of the mutation distance matrix.
120
C
100
80
60
40
S
20
L
G
P
T
0
N
−20
−40
V
M
A
F
I
Y
W
Q
H
D
R
E
−60
−100
−80
−60
−40
−20
0
K
20
40
60
Figure 2.2: 2-D plot of the chemical distance matrix.
80
100
Chapter 2. Multidimensional Scaling (MDS)
10
Figure 2.3: Taylor classification of amino acids [12].
The CMD method works best if the eigenvalues used for reconstruction are very large
compared to the unused eigenvalues. However, in our case the eigenvalues are not
decaying very quickly, and hence the error in 2-D representation is significant, with a
root mean squared error of around half the mean distance. For this reason, we will try to
improve the performance by applying another better dimension reduction and clustering
method in a future work.
The 3-D plot of the ECM mutation matrix is shown in Fig. 2.4.
TAC
Y
80
TAT
60
TGC
H
40
TGG
N
CGT
−20
CAG
CGC
CGA
−40
−60
80
Q
CGG
AGG
S
W
R
0
TTT
AGA
CAA
AAC
AAT
D
E
AAA
40
AGC
AGT
GGC
GAC
GAT GGT
GAG
AAG
60
GAA
K
20
GGG
CTC
TCT CTT TTA
TCG
TTG
I
TCA
CTG
CCC
CCT
CTA
V
ATC
CCG
CCA
ATT GTC
GCC
ACC
ACT
P
GCT ATA GTT
GTG
ACG ATG GCG
T ACA
GTA
GCA
M
GGA
G
−20
−40
−60
L
TCC
0
−60
F
C
TGT
CAC
CAT
20
TTC
−40
−20
A
0
20
40
Figure 2.4: 3-D plot of the mutation distance matrix.
60
80
Chapter 3
Relation Between Boltzmann and
Shannon Entropy
3.1
Introduction
DNA is a double sequence of nucleotides based on a 4-letter alphabet called Adenine,
Thymine, Cytosine, and Guanine (A, T, C, G) in which the second sequence is complementary to the first one. For a sequence of such kinds, the Shannon entropy gives an
average measure of information obtained from the distribution of the symbols (words) of
the source. In addition, the sequence in which these four letters are aligned in the DNA
is a major factor determining the stability of the DNA structure [13]. Hence, looking
into the information contained in the sequence of nucleotides along with the stability
that comes with it is important. Shannon block entropy for a block size of N symbols
is mathematically given as
HN = −
X
(N )
Pi
(N )
log Pi
,
(3.1)
i
(N )
where Pi
is the probability (relative frequency) to observe the ith word of block size
N . The entropy is maximal when all words occur at equal probabilities, and it is zero
when one of the symbols occurs with probability one.
In statistical mechanics and thermodynamics, the Boltzmann-Gibbs entropy has the
form very similar to the Shannon entropy measure given in Eq. (3.1). However, it
should be properly scaled by the Boltzmann constant k, which gives this entropy a unit
11
Chapter 3. Relation Between Boltzmann and Shannon Entropy
12
of kcal/Kelvin and natural logarithm is used.
S = −k
X
(N )
Pi
(N )
ln Pi
.
(3.2)
i
Our aim is to apply the two forms of entropy measures on the complete genome of
Escherichia coli (E.coli) and to see how the entropies develop across the genome. Furthermore, we would like to compare and figure out if there is some sort of relation that
can help us relate the two.
3.2
Boltzmann Entropy and Distribution
3.2.1
Laws of Thermodynamics
In this section, the two laws of thermodynamics will be presented. The reference used
for this section is [14].
3.2.1.1
First Law of Thermodynamics
For a system undergoing a process, the change in energy is equal to the heat added to
the system minus the work done by the system. It simply means, the energy of the
universe is conserved. The change in internal energy of the system, dE is given by the
equation
dE = dQ − dW ,
(3.3)
where dQ is the heat transferred into or out of the system and dW is the work done by
or on the system. If the work done is a mechanical work by an expanding or contracting
gas, dW can be derived to be P dV and the equation becomes
dE = dQ − P dV .
(3.4)
The negative sign is from the sign convention for work. The above equation is only valid
if the pressure is constant throughout the reaction. Under such conditions, the heat
transfer is called enthalpy(H) and the first law of thermodynamics can be written as
dE = dH − P dV .
(3.5)
Chapter 3. Relation Between Boltzmann and Shannon Entropy
3.2.1.2
13
Second Law of Thermodynamics
The second law is about entropy, a quantity which describes the microscopic state of a
system in equilibrium. If the system is thermally isolated and undergoes a change of
state, the entropy will always increase, i.e.,
∆S ≥ 0 .
(3.6)
However, if the system is not thermally isolated and the change of state is in a quasistatic
fashion in which a heat, dQ, is absorbed, then
dS =
dQ
,
T
(3.7)
where T is the absolute temperature. Entropy has units of Joule/Kelvin or Cal/Kelvin
and it is a state variable, i.e., it is independent of the path between the initial and final
states.
3.2.2
Ideal Gas Law
The state of a gas is determined by its pressure (P ), volume (V ), and temperature
(T )[14].The ideal gas law is commonly stated a,
P V = nRT ,
(3.8)
where n is the number of moles in the gas and R is the universal gas constant(8.314
J/K·mol). The ideal gas law can also be formulated as
P V = N kT ,
(3.9)
N is the number of molecules in the gas and k is the Boltzmann constant.
3.2.3
3.2.3.1
Entropy of a Gas
Macroscopic View
Consider an isothermal and adiabatic process, i.e., occurring without exchange of heat
of a system with its environment at constant temperature. Since we considered the
process to be adiabatic and isothermal, dE = 0 and dT = 0 [15]. Using the laws of
Chapter 3. Relation Between Boltzmann and Shannon Entropy
14
Figure 3.1: Adiabatic expansion of a gas at constant temperature [15]
thermodynamics (equations (3.4) and (3.7)) and the ideal gas law (Eq. (3.9)),
dQ = dE + P dV ,
(3.10)
N kT dV
,
V
(3.11)
N kdV
,
V
(3.12)
T dS =
dS =
Integrating from the initial state to the final, we obtain
S = N k ln
V2
.
V1
(3.13)
In this specific case the volume is doubled. Therefore, V2 = 2V1 ,
S = N k ln 2 .
3.2.3.2
(3.14)
Microscopic View: Boltzmann Entropy
It was Boltzmann who first gave thermodynamic entropy a meaning in relation to the
number of arrangements of the molecules Ω [15]. In the above process, if we initially
assume the number of molecules to be N and the number of arrangements of molecules
(number of possible microscopic states) to be Ω, the final system will have 2N Ω ways
of arrangements (a molecule can be either on the left or on the right). Let S1 and S2
be the entropy of the first and second states with Ω1 and Ω2 arrangements respectively.
The following proof is taken from [16]. The entropy of the final system will be
S = S1 + S2 .
(3.15)
The number of arrangements Ω of the final system will be
Ω = Ω1 Ω2 .
(3.16)
Chapter 3. Relation Between Boltzmann and Shannon Entropy
15
Boltzmann postulated the entropy to be a function of Ω,
S ≡ f (Ω) .
(3.17)
Therefore, S1 = f (Ω1 ), S2 = f (Ω2 ), and
f (Ω1 Ω2 ) = f (Ω1 ) + f (Ω2 ) .
(3.18)
Differentiating with respect to Ω1 leads to
∂f (Ω1 Ω2 )
∂f (Ω1 )
∂f (Ω1 Ω2 )
=
Ω2 =
∂Ω1
∂Ω1 Ω2
∂Ω1
∂f (Ω)
∂f (Ω1 )
.
⇒
Ω2 =
∂Ω
∂Ω1
(3.19)
(3.20)
Again differentiating with respect to Ω2 yields
∂f (Ω1 Ω2 )
∂f (Ω1 Ω2 )
∂f (Ω2 )
=
Ω1 =
∂Ω2
∂Ω1 Ω2
∂Ω2
∂f (Ω2 )
∂f (Ω)
Ω1 =
.
⇒
∂Ω
∂Ω2
(3.21)
(3.22)
Thus,
1 ∂f (Ω1 )
1 ∂f (Ω2 )
=
Ω2 ∂Ω1
Ω1 ∂Ω2
∂f (Ω1 )
∂f (Ω2 )
Ω1
= Ω2
=C,
∂Ω1
∂Ω2
(3.23)
(3.24)
where C is a constant by separation of variables.
S1 = f (Ω1 ) = C ln Ω1 + const
(3.25)
S2 = f (Ω2 ) = C ln Ω2 + const
(3.26)
S = C ln Ω1 + C ln Ω2 + const
(3.27)
S = S1 + S2
(3.28)
Hence, with const = 0, we obtain
S = C ln Ω .
(3.29)
The value of the constant C can be observed by applying the postulate to the expansion
of a gas depicted in Fig. 3.1.
∆S = S2 − S1
(3.30)
∆S = C ln 2N Ω − C ln Ω
(3.31)
Chapter 3. Relation Between Boltzmann and Shannon Entropy
∆S = CN ln 2
16
(3.32)
Comparing with Eq. (3.14) we obtain C = k. The Boltzmann entropy becomes
S = k ln Ω .
3.2.4
(3.33)
Boltzmann Distribution
Consider an isolated system with energy E, volume V , and number of molecules
N fixed. The N molecules will be arranged in such a way that n1 is in the first energy
state (1 ), n2 is in the second (2 ), n3 is in the third ..., and ni is in the i energy states.
The number of possible arrangements will be
Ω=
N
n1
N!
N − n1 − n2
N − n1
.
··· = Q
ni !
n3
n2
(3.34)
i
When the system under consideration reaches on equilibrium, the molecules will disperse
and the number of possible arrangements will be maximum [16]. To find the most
probable configuration of the molecules, we have to maximize Ω for fixed N and E. The
reference for this section is [16].
maximize
ni
subject to
Ω
X
ni = N,
X
i
ni i = E
(3.35)
i
Reformulating the constraints in terms of probabilities
Pi = nNi ,
P
P
i Pi = 1, and
i ni = N can be replaced by
P
P
i ni i = E can be replaced by
i Pi i = Ē.
Instead of Ω we can also maximize ln Ω and the problem becomes,
maximize
Pi
subject to
ln Ω,
X
X
Pi = 1,
Pi i = Ē
i
(3.36)
i
Using Stirling’s approximation for large N ,
ln N ! ≈ N ln N − N.
(3.37)
Chapter 3. Relation Between Boltzmann and Shannon Entropy
17
Applying the approximation for ln Ω
ln Ω ≈ ln N ! −
X
ln ni !
(3.38)
i
X
= N ln N − N −
ni ln ni +
X
i
= −N
X
ni
(3.39)
i
Pi ln Pi
(3.40)
i
Omitting N , because it has no effect in the maximization, and applying Lagrange multipliers method, Eq. (3.40) leads to
L=−
X
Pi ln Pi − α0
i
Setting
∂L
∂Pj
X
Pi − β
X
i
Pi i .
(3.41)
i
= 0,
− ln Pj − 1 − α0 − βj = 0 ,
Pj = e−α e−βj ,
1
Pj = e−βj , where Z = eα .
Z
(3.42)
Substituting in the constraints,
X
1 X −βj
=1,
e
Z
Pj = 1 =⇒
j
j
⇒ Z(β) =
X
(3.43)
e−βj .
j
Therefore,
e−βj
Pj = P −βj .
e
(3.44)
j
The constant β can be shown to be
1
kT .
To do so, one can compare the average energy
obtained using the Boltzmann distribution, which is
of a molecule at equilibrium
3kT
2 .
3
2β
with the average kinetic energy
Another way to derive that β =
1
kT ,
is as follows [17].
From the definition of temperature, we have
1
∂S =
.
T
∂E V,N
(3.45)
Chapter 3. Relation Between Boltzmann and Shannon Entropy
18
Using Boltzmann’s entropy definition, S = k ln Ω, and replacing Eq. (3.42) in Eq. (3.40),
X
S = k ln Ω = −kN
pi ln (e−α e−βi ) ,
(3.46)
i
X
= −kN
pi (−α − βi ) ,
(3.47)
X
(3.48)
i
= kN
X
pi α + kN β
i
pi i ,
i
E
,
N
= kN α + kβE ,
1
∂S = kβ ,
=
T
∂E (3.49)
= kN α + kN β
(3.50)
(3.51)
V,N
=⇒ β =
1
.
kT
(3.52)
Therefore, the Boltzmann distribution relating the energy and temperature to the microscopic properties is given by
j
e− kT
Pj = P − i
e kT
(3.53)
i
3.2.5
Gibbs Entropy Formula
In the Boltzmann definition of the entropy, at a fixed energy, all states resulting in an
energy E are assumed to be equally likely [15]. If the states of the thermodynamic
system are not equally probable, Gibb’s definition of entropy given by
S = −k
X
Pi ln Pi ,
(3.54)
i
where the sum is over all microstates and Pi is the probability that the molecule is in the
ith microstate [18]. This definition, like Boltzmann’s, is a fundamental postulate which
can explain the experimental facts accurately [18]. To see if this definition of entropy is
more general, consider a system having Ω microstates and if all microstates are equally
probable, i.e., the Pi =
1
Ω,
(3.54) results in
S = −k
Ω X
1
i=1
Ω
ln
1
= k ln Ω ,
Ω
(3.55)
Chapter 3. Relation Between Boltzmann and Shannon Entropy
19
which is the Boltzmann definition of entropy.
3.2.6
Entropy of an Ideal Gas
From the first law of thermodynamics (given in Eq. (3.4)) we have
dQ = dE + V dP .
(3.56)
For any gas, the change in internal energy dE depends on the change in temperature.
Thus, dE = Cv dT per mole of a gas, where Cv is the specific heat1 at constant volume.
nRT
dV
V
mCv dT
nR
dQ
=
+
dV
T
T
V
dQ = Cv dT +
(3.57)
(3.58)
Integrating both sides of the equation, leads to
(3.59)
S = Cv ln T + nR ln V + constant .
Depending on the type of experimental condition of the system, the change in entropy
will be different [14].
• If the process is done at constant temperature, ∆S = nR ln
• If the process is done at constant volume, ∆S = nCv ln
• If the process is done at constant pressure, ∆S = nCp ln
3.3
T2
T1 ,
V2
V1 ,
and
T2
T1 .
Entropy of the E. coli Genome
We have used the 4,639,221 base pairs (bp) sequence of E. coli K-12 strain. First,
the data is rearranged to start at the origin of replication. Then, entropy of chunks
of the DNA sequence is computed for different block sizes (2bp up to 6bp) in nonoverlapping windows containing 100 Kbp. For calculating the Boltzmann entropy the
stacking energies of base pairs obtained from [13] is used. All the neighboring base pairs
are considered. That is, if the nucleotide sequence is “AGCT”, the energies of AG, GC,
and CT will be taken into account.
1
The specific heat is the amount of heat per mass unit required to raise the temperature by one
degree Kelvin
Chapter 3. Relation Between Boltzmann and Shannon Entropy
20
We have assumed that all nearest neighbor pairs in the window are independent and
we postulated discrete states in which the probabilities for having the corresponding
stacking energy are drawn from the Boltzmann distribution. Although we are aware
that the Boltzmann distribution gives the most probable distribution of energy for states
having a random distribution of energies (e.g., ideal gas), which is not the case here,
we used it to have a representation of stability (energy) in an expression that follows
the structure of an entropy. The Boltzmann distribution for a state having a stacking
energy Ei at an absolute temperature of T is
e
Pi = P
−Ei
kT
ie
3.4
−Ei
kT
.
(3.60)
Result and Discussion
The result for a block size of 2bp and window size of 100 Kbp is shown in Fig. 3.2. The
result of the Boltzmann entropy is scaled down to the range of the Shannon entropy
to make it easy for visual comparisons. Although we could not yet find a single general interpretation relating the two entropies, we can see some opposite trend in some
positions (e.g., Window 16 to 25) and parallel tendencies in some other (e.g., Window
40 to 46). The plots for 3bp, 4bp, 5bp, and 6bp are also similar. This shows that the
entropies are more or less invariant under the change of block size. Hence, from now
on results with a block size of 3bp will be plotted. The plots for 4bp, 5bp, and 6bp are
presented in Appendix A.
Window:of:size::100Kb::2:base:pairs
3.986
Boltzmann:Entropy
Shannon:Entropy
3.984
Entropy
3.982
3.98
3.978
3.976
3.974
3.972
0
5
10
15
20
25
30
Window:Number
35
40
45
50
Figure 3.2: Boltzmann and Shannon entropies of E. coli genome, 2bp block.
Chapter 3. Relation Between Boltzmann and Shannon Entropy
21
Windowaofasize:a100Kb:a3abaseapairs
5.965
BoltzmannaEntropy
ShannonaEntropy
5.96
5.955
Entropy
5.95
5.945
5.94
5.935
5.93
5.925
0
5
10
15
20
25
30
WindowaNumber
35
40
45
50
Figure 3.3: Boltzmann and Shannon entropies of E. coli genome, 3bp block.
Once the results for Shannon and Boltzmann entropies were obtained, we discussed
the results with molecular biology colleagues. As a result, we decided to see how the
entropies relate to the number of the four functional classes of genes, namely anabolic,
catabolic, aerobic, and anaerobic. Additionally, they provided us with the data for the
distribution of the classes of genes in the genome. We used a 500 kb sliding window
starting with the origin as the center of the first window and slide it 4 kb at a time
across the complete genome. Then, the number of genes of the corresponding functional
gene along with the Shannon and Boltzmann or their difference is plotted. The results
are presented in figures from 3.4 to 3.11. Interestingly, from Fig. 3.4, we observe that
the shape of Boltzmann entropy and number of anabolic genes are strongly related.
This implies that, the stability is dependent on the number of anabolic genes present.
Also, the distribution of the aerobic genes has a similar pattern as the difference of the
entropies as shown in Fig. 3.9. All in all, even if there is no straightforward relationship
between some of the curves, there seems to be a hidden meaning which we should further
analyze together with our molecular genetics colleagues.
Chapter 3. Relation Between Boltzmann and Shannon Entropy
22
SlidingfWindowfoffsize:f500Kb:f3fBasepairs
oriC
100
oriC
Ter
BoltzmannfEntropy
ShannonfEntropy
AnabolicfGenes
5.95
80
60
5.94
40
5.935
20
Entropy
5.945
5.93
0
0.5
1
1.5
2
2.5
3
ChromosomalfPosition
3.5
4
NumberfoffGenes
5.955
0
5
4.5
6
xf10
Figure 3.4: Number of anabolic genes with Boltzmann and Shannon entropies.
SlidingWWindowWofWsize:W500Kb:W3WBasepairs
0.35
Ter
oriC
100
oriC
NumberWofWGenes
DifferenceWofWtheWEntropies
BoltzmannWEntropyW−WShannonWEntropy
NumberWofWAnabolicWGenes
0.3
0
0.5
1
1.5
2
2.5
3
ChromosomalWPosition
3.5
4
0
5
4.5
6
xW10
Figure 3.5: Number of anabolic genes with difference of Boltzmann and Shannon
entropies.
Chapter 3. Relation Between Boltzmann and Shannon Entropy
23
SlidingfWindowfoffsize:f500Kb:f3fBasepairs
Ter
oriC
35
oriC
30
5.945
25
5.94
20
Entropy
5.95
5.935
5.93
NumberfoffGenes
5.955
15
BoltzmannfEntropy
ShannonfEntropy
CatabolicfGenes
0
0.5
1
1.5
2
2.5
3
3.5
ChromosomalfPosition
4
10
5
4.5
6
xf10
Figure 3.6: Number of catabolic genes with Boltzmann and Shannon entropies.
SlidingWWindowWofWsize:W500Kb:W3WBasepairs
0.36
Ter
oriC
40
oriC
0.34
30
0.32
20
NumberWofWGenes
DifferenceWofWtheWEntropies
BoltzmannWEntropyW−WShannonWEntropy
NumberWofWCatabolicWGenes
0.3
0
0.5
1
1.5
2
2.5
3
ChromosomalWPosition
3.5
4
10
5
4.5
6
xW10
Figure 3.7: Number of catabolic genes with difference of Boltzmann and Shannon
entropies.
Chapter 3. Relation Between Boltzmann and Shannon Entropy
SlidingwWindowwofwsize:w500Kb:w3wBasepairs
5.952
Ter
oriC
18
oriC
BoltzmannwEntropy 16
ShannonwEntropy
14
AerobicwGenes
5.948
5.946
12
5.944
10
5.942
8
5.94
6
5.938
4
5.936
2
5.934
0
0.5
1
1.5
2
2.5
3
3.5
ChromosomalwPosition
4
NumberwofwGenes
5.95
Entropy
24
0
5
4.5
6
xw10
Figure 3.8: Number of aerobic genes with Boltzmann and Shannon entropies.
SlidingWWindowWofWsize:W500Kb:W3WBasepairs
Ter
oriC
20
oriC
BoltzmannWEntropyW−WShannonWEntropy
NumberWofWAerobicWGenes
0.34
15
0.33
10
0.32
5
0.31
0
0.5
1
1.5
2
2.5
3
3.5
ChromosomalWPosition
4
NumberWofWGenes
DifferenceWofWtheWEntropies
0.35
0
5
4.5
6
xW10
Figure 3.9: Number of aerobic genes with difference of Boltzmann and Shannon
entropies.
Chapter 3. Relation Between Boltzmann and Shannon Entropy
SlidingwWindowwofwsize:w500Kb:w3wBasepairs
Ter
oriC
5.95
5.948
Entropy
45
oriC
BoltzmannwEntropy
ShannonwEntropy
AnaerobicwGenes
40
35
5.946
30
5.944
25
5.942
20
5.94
15
5.938
10
5.936
5
5.934
0
0.5
1
1.5
2
2.5
3
ChromosomalwPosition
3.5
4
NumberwofwGenes
5.952
25
0
5
4.5
6
xw10
Figure 3.10: Number of anaerobic genes with Boltzmann and Shannon entropies.
SlidingWWindowWofWsize:W500Kb:W3WBasepairs
0.35
Ter
oriC
50
oriC
NumberWofWGenes
DifferenceWofWtheWEntropies
BoltzmannWEntropyW−WShannonWEntropy
NumberWofWAnaerobicWGenes
0.3
0
0.5
1
1.5
2
2.5
3
3.5
ChromosomalWPosition
4
0
5
4.5
6
xW10
Figure 3.11: Number of anaerobic genes with difference of Boltzmann and Shannon
entropies.
Chapter 4
Conclusions
A comparison between chemical properties of amino acids and mutation probabilities
of codons was carried out using the classical multidimensional scaling method. The
results showed that most of the highly probable mutations will not lead to a dramatic
change of chemical properties. However, some inconsistencies were also observed. Thus,
further studies of the severeness of the mutations and possible protection mechanism
to counteract the effects is required. In addition, the error introduced in representing
64-dimensional data with two dimensions is significant. This is due to the slow decay
of the eigenvalues of the data. Therefore, another dimension reduction and clustering
method with a better performance can be applied in the future.
Our second task was to look into the relationship between Shannon and Boltzmann
entropies. We have seen that, even though we did not yet find suitable interpretations,
at some positions they follow the same pattern and in other positions they tend to
move in opposite directions. We further investigated how both entropies are related to
the functional classes of genes located at the same positions in the genome. We found
interesting correlations, especially with the distribution of anabolic genes.
26
Appendix A
Additional Plots
Window:of:size::100Kb::4:base:pairs
7.92
Boltzmann:Entropy
Shannon:Entropy
7.91
Entropy
7.9
7.89
7.88
7.87
7.86
0
5
10
15
20
25
30
Window:Number
35
40
45
50
Figure A.1: Boltzmann and Shannon entropies of E. coli genome, 4bp block.
27
Appendix A. Additional Plots
28
Window:of:size::100Kb::5:Basepairs
9.87
Boltzmann:Entropy
Shannon:Entropy
9.86
9.85
Entropy
9.84
9.83
9.82
9.81
9.8
9.79
9.78
0
5
10
15
20
25
30
Window:Number
35
40
45
50
Figure A.2: Boltzmann and Shannon entropies of E. coli genome, 5bp block.
WindowKofKsize:K100Kb:K6KBasepairs
11.8
BoltzmannKEntropy
ShannonKEntropy
11.78
Entropy
11.76
11.74
11.72
11.7
11.68
0
5
10
15
20
25
30
WindowKNumber
35
40
45
50
Figure A.3: Boltzmann and Shannon entropies of E. coli genome, 6bp block.
Bibliography
[1] “Deoxyribonucleic acid (dna).” [Online]. Available:
http://www.genome.gov/
25520880
[2] J. D. Watson, F. H. Crick et al., “Molecular structure of nucleic acids,” Nature, vol.
171, no. 4356, pp. 737–738, 1953.
[3] F. H. Crick, “On protein synthesis.” in Symposia of the Society for Experimental
Biology, vol. 12, 1958, p. 138.
[4] “Central
dogma
line]. Available:
of
molecular
biochemistry
with
enzymes.”
[On-
http://en.wikipedia.org/wiki/File:Central Dogma of Molecular
Biochemistry with Enzymes.jpg
[5] “More non-random dna wonders.” [Online]. Available:
http://iaincarstairs.
wordpress.com/2011/12/26/more-non-random-dna-wonders/
[6] M. Dayhoff, R. Schwartz, and B. Orcutt, “A model for evolutionary change. mo
dayhoff, ed,” Atlas of protein sequence and structure, vol. 5, p. 345, 1978.
[7] S. Henikoff and J. G. Henikoff, “Amino acid substitution matrices from protein
blocks,” Proceedings of the National Academy of Sciences, vol. 89, no. 22, pp.
10 915–10 919, 1992.
[8] S. Whelan and N. Goldman, “A general empirical model of protein evolution derived
from multiple protein families using a maximum-likelihood approach,” Molecular
biology and evolution, vol. 18, no. 5, pp. 691–699, 2001.
[9] A. Schneider, G. M. Cannarozzi, and G. H. Gonnet, “Empirical codon substitution
matrix,” BMC bioinformatics, vol. 6, no. 1, p. 134, 2005.
[10] S. W. Cheng, “Multidimensional scaling (mds).” [Online]. Available:
http:
//www.stat.nthu.edu.tw/∼swcheng/Teaching/stat5191/lecture/06 MDS.pdf
[11] W. R. Taylor, “The classification of amino acid conservation,” Journal of theoretical
Biology, vol. 119, no. 2, pp. 205–218, 1986.
29
Bibliography
30
[12] “Amino acids venn diagram.” [Online]. Available: http://commons.wikimedia.org/
wiki/File:Amino Acids Venn Diagram.png
[13] J. SantaLucia, “A unified view of polymer, dumbbell, and oligonucleotide dna
nearest-neighbor thermodynamics,” Proceedings of the National Academy of Sciences, vol. 95, no. 4, pp. 1460–1465, 1998.
[14] F. Reif, Fundamentals of Statistical and Thermal Physics, international student
edition ed.
McGraw-Hill Book, 1985.
[15] W. Allison,
“Lecture notes on statistical physics.” [Online]. Available:
http://www-sp.phy.cam.ac.uk/∼wa14/camonly/statistical/Lecture2.pdf
[16] A. Huan,
“Course notes on statistical mechanics.” [Online]. Available:
http://www.spms.ntu.edu.sg/PAP/courseware/statmech.pdf
[17] J. Saunders, “Classical and statistical thermodynamics.” [Online]. Available:
http://personal.rhul.ac.uk/uhap/027/ph2610/PH2610 files/SECT2.pdf
[18] M. Evans, “Statistical physics section 1: Information theory approach to statistical
mechanics.” [Online]. Available: http://www2.ph.ed.ac.uk/∼mevans/sp/sp1.pdf