Download A B C - PuSH - Publikationsserver des Helmholtz Zentrums München

Document related concepts

Schiehallion experiment wikipedia , lookup

Specific impulse wikipedia , lookup

Modified Newtonian dynamics wikipedia , lookup

Anti-gravity wikipedia , lookup

Isotopic labeling wikipedia , lookup

Electromagnetic mass wikipedia , lookup

Mass wikipedia , lookup

Negative mass wikipedia , lookup

Mass versus weight wikipedia , lookup

Conservation of mass wikipedia , lookup

Center of mass wikipedia , lookup

Transcript
Ich versichere, dass ich diese Masterarbeit selbständig verfasst und nur die
angegebenen Quellen und Hilfsmittel verwendet habe.
München, den 16.03.2009
Konrad Schreiber
iii
Acknowledgements
Several people deserve my sincere thanks for their involvement in the completion of this thesis:
I want to thank Karsten Suhre and Fabian Theis for giving me the opportunity to work on this thesis in their groups. Their advice – scientific and
private – during the progress of writing this thesis are invaluable.
Agnes Fekete and Philippe Schmitt–Kopplin played an important role in
the completion of this thesis. They provided me with valuable information,
data and last but not least encouraging interest in my work. It would have
been a pleasure to meet them earlier during the course of this thesis.
The whole Computational Modelling in Biology (CMB) group deserves
my thanks as excellent colleagues, providing cheerful company and helpful
advice.
Thanks to Elisabeth Altmaier and Brigitte Wägele who were good colleagues and office–mates during my work on this thesis.
Finally I would like to thank all members of the “Institut für Bioinformatik und Systembiologie” who were not mentioned explicitly for their
professional and general support.
v
Contents
Acknowledgements
v
Contents
vii
List of Abbreviations
ix
Abstract
x
Übersicht
xi
1 Introduction
1.1 Metabolomics and Metabolic Networks . . . . . . . . . . . .
1.2 Approaches to Metabolic Network Reconstruction . . . . . .
1.3 Fourier Transform Mass Spectrometry (FTMS) . . . . . . .
1.3.1 Isotopes . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.2 Isotopic mass and mass defect . . . . . . . . . . . . .
1.3.3 Ionization mode . . . . . . . . . . . . . . . . . . . . .
1.3.4 The principle of FTMS . . . . . . . . . . . . . . . . .
1.3.5 Mass resolution and accuracy . . . . . . . . . . . . .
1.3.6 Extract preparation . . . . . . . . . . . . . . . . . . .
1.3.7 Ionization by electrospray ionization (ESI) . . . . . .
1.4 Metabolic Compound Identification from Molecular Mass . .
1.5 Introduction to Ab Initio Metabolic Network Reconstruction
1.6 Graph Theoretic Aspects of Biological Networks . . . . . . .
1.6.1 Mathematical Definition of Graphs . . . . . . . . . .
1.6.2 Properties of Graphs . . . . . . . . . . . . . . . . . .
1.6.3 Types of Graphs . . . . . . . . . . . . . . . . . . . .
1.6.4 Properties of Biological Networks . . . . . . . . . . .
1.7 Scope of this Thesis . . . . . . . . . . . . . . . . . . . . . . .
2 Materials and Methods
2.1 Description of the Dataset . . . . . . .
2.1.1 FTMS Mass Spectrometry Data
2.1.2 The KEGG Database . . . . . .
2.1.3 Exact Elemental Masses . . . .
2.1.4 Metabolic Transformations . . .
2.2 Computational Analysis . . . . . . . .
2.3 Network Reconstruction . . . . . . . .
2.3.1 Filtering of FTMS Data . . . .
2.3.2 Calculation of Mass Differences
2.3.3 Clustering . . . . . . . . . . . .
2.3.4 Network Creation . . . . . . . .
2.4 Network Analysis . . . . . . . . . . . .
2.5 Identification of Mass Differences . . .
vi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
3
3
4
5
5
7
8
8
9
10
14
16
17
18
20
21
.
.
.
.
.
.
.
.
.
.
.
.
.
22
22
22
24
25
25
25
26
26
28
29
30
31
32
3 Results
3.1 Comprehensive Analysis of Pathway Maps from the KEGG
Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Metabolic Network Reconstruction for S. cerevisiae . . . . .
3.2.1 Determining the Abundance Cutoff . . . . . . . . . .
3.2.2 Evaluation of the Proposed Null Model . . . . . . . .
3.2.3 Identification of Motifs . . . . . . . . . . . . . . . . .
3.3 Metabolic Network Reconstruction for D. melanogaster . . .
3.3.1 Determining the Abundance Cutoff . . . . . . . . . .
3.4 Analysis of the Masses . . . . . . . . . . . . . . . . . . . . .
3.5 Identification of Frequent Mass Differences . . . . . . . . . .
37
.
.
.
.
.
.
.
.
.
37
39
40
42
43
43
44
45
49
4 Discussion
52
5 Summary
57
6 Outlook
58
References
60
vii
List of Abbreviations
ATP Adenosine 5’-triphosphate
CoA Coenzyme A
e.g. “exempli gratia” for example
EcoCyc Encyclopedia of Escherichia coli K-12 Genes and Metabolism
ERG Exponential Random Graph
et al. “et alii ” and others
FTMS Fourier Transform Ion Cyclotron Resonance Mass Spectrometry
FTP File Transfer Protocol
i.e. “id est” that is
KEGG Kyoto Encyclopedia of Genes and Genomes (Database)
LIPIDMAPS LIPID Metabolites And Pathways Strategy (Database)
MassTRIX Mass TRanslator into Pathways
ppm parts per million
u unified atom mass
ix
Abstract
The topic of this work is the ab initio prediction of metabolic mass difference
networks from Fourier Transform Mass Spectrometry (FTMS) data. Mass
spectrometric measurement of an organisms metabolites yields a snapshot of
that organisms metabolism at the time of the experiment. The assumption of
the employed method is, that for a chemical reaction substrate and product
are present in a certain ratio and both are measurable. Now the mass difference of substrate and product identifies the underlying chemical reaction.
Frequent reactions will yield frequent mass differences, and the reconstructed
networks are based on these frequent mass differences.
Is is shown, that frequent mass differences have a biochemical meaning.
90 % of the observed mass differences represent a sequence of 1–6 known
metabolic transformations. The resulting networks exhibit a hierarchical
scale free topology. This observation together with the biochemical meaning
of the mass differences indicates that information is present in the reconstructed networks. A correlation of “hub”-nodes in these networks with
certain properties of the underlying metabolites can be shown. Metabolites
with many neighbors in the networks are more likely identifyable as important known metabolites. Taking into account that 80–90 % of the measured
masses in a metabolomics FTMS experiment can not be identified, the remaining unidentified “hub”-nodes are supposed to be metabolites of special
interest. They are proposed as starting points for a deeper analysis and
identification.
The thesis concludes with suggestions for future work and exploitation of
the networks’ information.
x
Übersicht
In der vorliegenden Arbeit wird die Rekonstruktion von metabolischen massendifferenz–Netzwerken aus Fourier Transform Mass Spectrometry (FTMS)
daten behandelt. Die massenspektrometrische Bestimmung der Metabolite
in einem Organismus liefert eine Momentaufnahme des Metabolismus zum
Messzeitpunkt. Wenn man von der Annahme ausgeht, dass zu einer chemischen Reaktion Produkt und Substrat in einem gewissen Verhältniss vorliegen, also messbar sind, lässt sich aus der Massendifferenz von Produkt
und Substrat auf die Reaktion schließen. Häufige reaktionen äußern sich
somit durch häufige Massendifferenzen. Die rekonstruierten massendifferenz–
Netzwerke in dieser Arbeit basieren auf diesen häufigen Massendifferenzen.
Es wird gezeigt, dass die häufigsten Massendifferenzen tatsächlich eine
biochemische Bedeutung haben. 90 % der beobachteten häufigsten Massendifferenzen können in dieser Arbeit als Sequenz von 1–6 bekannten metabolischen Transformationen identifiziert werden. Die daraus entstehenden
Netzwerke weisen eine hierarchische scale–free Topologie auf. Dies, zusammen mit der biologischen Bedeutung der Massendifferenzen, ist ein Indiez für
den Informationsgehalt der rekonstruierten Netzwerke. Die Netzwerke werden daraufhin untersucht und eine Korrelation von zentralen “hub”–Knoten
zu gewissen Eigenschaften der sie repräsentierenden Metaboliten kann gezeigt
werden. Metabolite, die in den Netzwerken viele Nachbarn besitzen, sind mit
höherer Wahrscheinlichkeit als wichtige Metabolite bekannt. Da 80–90 %
aller Massen aus einem metabolomics FTMS Experiment nicht identifiziert
werden können, bilden die verbleibenden unidentifizierten “hub”–Knoten
einen interessanten Startpunkt zur weitergehenden Analyse und Identifikation der unbekannten Massen.
Im letzten Teil der Arbeit wird auf zukünftige Möglichkteiten zur Nutzung
der Methode und zur Verwendung der Netzwerke eingegangen.
xi
1
1.1
Introduction
Metabolomics and Metabolic Networks
Metabolism is a very complex biological process. Simple organisms are able
to produce enormous amounts of different small organic compounds (metabolites) and energy from very simple molecules like for example glucose. These
small compounds are in turn further metabolized to build even the biggest
building blocks of life. The processes governing these reactions are subject of
intense study [1]. Metabolites are the substrates of metabolism. Structurally
they cover a wide range from simple molecules like Water (H2 O) to complex
structures like fatty acids and lipids. According to their variety, they are
different in size, number and nature of functional groups, volatility, charge
states or electromobility, polarity and other physicochemical parameters [2].
In parallel to the terms transcriptome and proteome, the set of metabolites
synthesized by an organism constitute its metabolome [3].
Metabolomics is the field of investigating an organism’s metabolome. The
first step in metabolomics would be the exact determination and quantification of an organisms complete metablome, but even this first step has not
been accomplished to a satisfactory degree, leaving apparently simple questions unanswered; e.g. the total number of metabolites for a model organism
like Arabidopsis thaliana is still not known [2], the same is true for Escherichia coli [4]. Despite that, metabolomics is able to produce meaningful
and important results.
Besides the investigation of the complete metabolome, there exist different strategies to study selected parts of the metabolome of a given organism.
The terms used are currently subject to change and the scientific community
may eventually agree on a coherent terminology [5]. Dunn summarizes in
a review article [5] 6 different strategies. According to this definition, (a)
Metabolomics refers to the study of the complete metabolome, (b) Metabolic
profiling is the untargeted investigation of a as large as possible set of metabolites, (c) Metabolic fingerpting investigates a snapshot of the metabolome
in an organism, (d) Targeted analysis is the quantification and identification
of a small set of metabolites, (e) Metabolic footprinting analyzes the extracellular metabolome of an organism, i.e. metabolites not consumed or excreted
by the organism, and (f ) Metabonomics is the quantitative measurement and
investigation of the dynamics of living systems under pathophysical stimuli
and under genetic modification.
Metabolic processes can be described best by graphs and as such are represented in a well defined mathematical model. The analysis of such metabolic
networks has several applications. It was for example shown, that it is possible to predict lethal gene deletions by simulating them on a metabolic
1
network [6]. Another study focused on how the production of a specific
metabolite can be maximized by investigation of the underlying metabolic
network, for improving bioethanol production [7]. It is therefore important
to have good sources for reliable metabolic networks.
1.2
Approaches to Metabolic Network Reconstruction
Metabolic networks can be reconstructed from a variety of data using a variety of approaches. Several approaches reconstruct metabolic networks specific to an organism from genome information such as the KEGG [8] and EcoCyc [9] databases. Many studies which investigate the properties of metabolic
netwoks use these databases and augment the data with information from
the literature [10, 11]. This facilitates the construction of many metabolic
networks for a comparative analysis. The problem with this approach is, that
the data are not complete. For instance [12] states, that there are reactions
catalyzed by Escherichia coli, for which the enzyme has not been identified.
Since E. coli is a well studied model organism, this problem also exists for
other organisms and implies, that there might be unknown enzymes catalyzing unknown reactions. Another more complex approach is followed by the
Palsson group [13]: Metabolic networks are first reconstructed out of genomic
information, i.e. the enzymes present in the organism of study are identified
and the reactions they catalyze are determined. Out of this information together with information from the literature the compartmentalization of the
reactions is incorporated into the model. This yields very precise metabolic
networks which are also readily used in studies about metabolic networks [14].
The above methods require precise knowledge about the organism, like
the full genomic sequence. One of the first approaches that was able to reconstruct metabolic pathways from one given component to another without
prior knowledge of the organism is described by Arita [15]. This approach
identifies structural similar metabolites and suggests a metabolic path between them as the shortest path along 16 typical metabolic reaction rules.
There is, however, no evidence, that the organism under consideration is
actually able to catalyze all postulated reactions.
A feature which all above methods are lacking, is a “snapshot view” of
metabolism, above termend metabolic fingerprinting. Such a method would
yield a metabolic network which does not represent the theoretic capabilities
of an organism, but the actually active metabolic pathways at a certain point
of time. This can be used to investigate metabolic reactions on different
conditions. Out of this reason it is desirable to obtain metabolic networks
“ab initio” for any given organism. This thesis evaluates an approach to
reconstruct these networks from mass spectrometric data without any a priori
knowledge as suggested in [16].
2
1.3
Fourier Transform Mass Spectrometry (FTMS)
The full name for what is usually abbreviated FTMS is Fourier Transform
Ion Cyclotron Resonance Mass Spectrometry. Before the technical basics are
described, an introduction to the measured entities – atoms, isotopes and
their mass – is given.
1.3.1
Isotopes
An element is specified by the number of protons in its nucleus. The number
of protons equals the atomic number of the element, which is usually noted
as a subscript before the elemental symbol. Atoms with the same atomic
number, but different numbers of neutrons in the nucleus are termed isotopes.
The sum of protons and neutrons is the so called mass number of an atom.
The mass number is usually noted as a superscript before the elemental
symbol. Summarizing the above, isotopes of the same element are equal
in their atomic number and chemical properties, but differ in their mass
number. [17, p. 67]
Most of the elements are polyisotopic, i.e. they exist as multiple stable
isotopes. This means, that there usually is one isotope which is most abundant (e.g. 12 C for Carbon), but many more stable isotopes can be encountered
(e.g. 13 C). Some elements naturally occur with only one stable isotope. These
elements are termed monoisotopic elements. The most important such elements in biology are 19 F (Fluorine), 23 Na (Sodium), 31 P (Phosphorus) and
127
I (Iodine). Also some elements occur with exactly two stable isotopes and
thus are called di–isotopic elements. The most important ones in biology are
Hydrogen (1 H, 2 H), Carbon (12 C, 13 C) and Nitrogen (14 N, 15 N), for which
the mass numbers differ by one. One can also regard Chlorine (35 Cl, 37 Cl)
as important di–isotopic element in biology; the mass numbers of the two
Chlorine isotopes differ by two. Oxygen has three stable isotopes, 16 O (most
abundant), 17 O and 18 O (second most abundant). Table 1 summarizes the
isotopic abundances for the above elements.
Isotopic abundances are reported as their sum being 100%, or the most
abundant isotope being normalized to 100%. In this thesis the values from
reference [17] are used. Because of the low abundances of N, O and H, these
elements can be treated as approximately monoisotopic [17, p. 68]. Care has
to be taken with 13 C. Carbon is a ubiquitous element in organic chemistry
and due to the relatively high 13 C abundance, effects of 13 C insertion in
biomolecules have to be considered.
3
Element
Carbon
Nitrogen
Oxygen
Hydrogen
Sulfur
Chlorine
Phosphorus
Isotopic abundance
100:1.08
100:0.369
100:0.205
100:0.0115
100:4.52
100:31.96
no stable isotopes
Table 1: Biologically relevant elements and their isotopic abundances. If more
than one isotope exists, the abundance of the most abundant relative to the second
most abundant is given. Data taken from [17, p. 69].
1.3.2
Isotopic mass and mass defect
Atomic masses are measured in unified atomic mass, abbreviated u. Protons
and neutrons have an approximate mass of 1 u and since the year 1961 one
u is defined as 1/12th the mass of a 12 C atom. [17, p. 71] The isotopic mass
is the exact mass of an isotope. It is always close to but never exactly equals
the mass number of the isotope. Because of the definition of atomic mass,
the only exception is 12 C. The difference between the isotopic mass and mass
number arises from the mass defect. Because a bound system is at a lower
energy level than its unbound constituents, according to Einsteins formula
E = mc2 , its mass is less than the total mass of its unbound constituents.
The binding energy of protons and neutrons in the nucleus is sufficiently high
to cause a measurable mass defect. Thus, the isotopic mass of an isotope is
the sum of masses of its constituents minus the binding energy in the nucleus.
E.g. the mass difference between 12 C and 13 C is only 1.0034 u, and not the
mass of the neutron, 1.0087 u. In theory, the chemical bonds in a molecule
also introduce a mass defect. But since the chemical bond energy is much
lower than nuclear binding energy, these effects can be neglected. E.g. the
average bond energy of each OH bond in water (H2 O) is 458.9 kJ/mol. The
mass defect of the two bonds calculates to 1.017 ∗ 10−15 u per molecule:
458900J mol−1 a = 1.524 · 10−25 J/molecule
1.524 · 10−25 J/molecule
= 1.693 · 10−42 Kg = 1.017 · 10−15 u
2
c
with a = 1.665402 · 10−27 the Avogadro Constant, c = 3.0 · 108 m s−1 the
speed of light
The relative atomic mass is calculated as the weighted average of the
masses from all naturally occurring isotopes of an element. So with the exception of monoisotopic elements, no atoms with a mass equaling the relative
4
atomic mass can be observed in nature. Rather a spectrum of isotopic masses
is present for each element.
The monoisotopic mass of an element is defined as the mass of the most
abundant isotope. It is important to transfer the above definitions on single
atoms to molecules. So the monoisotopic mass of a molecule is defined as the
sum of the monoisotopic masses of the elements it comprises. The monoisotopic mass does not necessarily arise from the lightest occurring isotope of
an element. However, within the elements important in biology, the lightest
occurring isotope is usually the most abundant one.
Respectively the relative molecular mass is the sum of the relative atomic
masses of the molecule’s elements. If an ion is formed by the removal of one
or more electrons from a molecule, the exact ionic mass is the monoisotopic
mass of the ion minus the mass of the removed electrons. For negative ions,
the electron mass (0.000548 u) needs to be added accordingly.
1.3.3
Ionization mode
FTMS can operate in either positive or negative ionization mode. In positive mode positively charged ions are measured, in negative mode negatively charged ions are measured. Not every metabolite is easily ionized
and some metabolites can be negatively charged but not positively, and vice
versa. Because of this, in metabolomics studies it makes sense to combine
measurements from positive and negative mode whenever possible to get a
broad spectrum of metabolite measurements [18]. The information wether a
chemical species was measured in negative or positive mode can also aid in
identifying that species.
1.3.4
The principle of FTMS
FTMS is based on ion cyclotron (IC) motion. Ions moving in a magnetic
field are forced into a cyclic motion in the plane perpendicular to the magnetic field lines. This is because an ion experiences a Lorentz force which
is perpendicular to both: the direction of the ion’s velocity and the magnetic field lines. Ion movement on the axis parallel to the magnetic field
lines is unrestricted. To trap the ions completely and measure their mass,
an electric field is applied to create a potential well [19], so the ions now
exhibit their cyclic motion in the magnetic field and are trapped in a simple harmonic oscillation along the magnetic field lines due to the trapping
potential. In the cubic analyzer cell this trapping potential is established
by two metal plates which are perpendicular to the magnetic field lines. A
small, symmetric positive voltage on both trapping plates will trap positively
charged ions, a negative voltage will trap negative ions. Schematics of the
mass spectrometers analyzer cells are given in Figure 1.
5
Magnetic field
Magnetic field
Trapping plate
Trapping plate
Excitation plate
Excitation plate
Excitation plate
Cyclotron movement
Cyclotron movement
D
E
Excitation plate
E
Trapping plate
Trapping plate
D
E
E
D
D
Figure 1: Schematic of the detection cell of an ion cyclotron mass spectrometer.
The cubic version on the left and the cylindrical version on the right side. The
smaller drawings slightly below depict the view from top to bottom (along the
magnetic field). E stands for excitation plate, D stands for detection plate. The
cubic cell is good to understand the principle, but most modern mass spectrometers
contain the cylindrical version.
The motion of the ions in the analyzer cell is governed by the magnetic
field, the electric field, the charge of the ions and the ions mass [19, 20, 21].
The ion motion can be divided into the cyclotron motion due to the magnetic
field, trapping motion due to the electric field and magnetron motion due to
the combination of the magnetic and electric fields. Since for this thesis it
is only necessary to understand the basic concept of FTMS, the detailed
equations from the references are not reproduced here. Focus is put on the
important cyclotron motion. Because the magnetic field and the electric
field are known parameters, finally the mass over charge ratio of ions can
be measured in the mass spectrometer by identifying the frequency of the
unperturbed cyclotron motion. This cyclotron motion is governed by an
equation that is derived as follows:
The force on an ion with mass m and charge q moving in a magnetic field
B, with velocity v perpendicular to the field is
Force = mass · acceleration = m
dv
= qvB.
dt
Because angular acceleration, |dv/dt| = v 2 /r, this becomes
mv 2
= qvB.
r
Angular velocity is defined as, ω = vr , so this becomes
6
mω 2 r = qBωr,
which is simply
ωc =
qB
.
m
with ωc the angular velocity of the cyclotron motion, q the charge in
coulomb, B the magnetic field in Tesla and m the mass in u [20].
So the ion’s mass over charge ratio is measured as a frequency. Compared
to time of flight instruments, where the measurement takes only as long as
the time of flight, the frequency can be measured over a longer period of
time and therefore determined more precise than any other experimental
parameter directly [20]. This accounts for the high precision and resolving
power of FTMS instruments.
Before the cyclotron frequency can be measured, the ions need to be
excited to a sufficiently big cyclotron radius. In the cubic analyzer cell the
ions are excited by a sinusoidal voltage to the two excitation plates which
- unlike the trapping plates - are orientated parallel to the magnetic field
lines. All ions of the same mass over charge ratio are excited coherently
and therefore undergo cyclotron motion as a packet [19]. About 100 ions
of the same mass over charge ratio are required to introduce a measurable
signal [20].
The cyclotron frequencies of all present ions are measured by the detection plates and form a signal which is composed of the addition of all single
frequencies. This signal is Fourier transformed to obtain the single frequency
components of every present ion. These frequencies are finally used to calculate the mass over charge value.
One has to bear in mind the fact, that not mass as such is measured,
but mass over charge. So multiply charged ions will be detected as lower
mass over charge ratios than single charged ions of the same mass. E.g. a
single charged ion with mass m will have the same mass over charge value
as a double charged ion with mass 2m or triple charged ion with mass 3m.
However, multiple ionization plays a major role only in proteomics and just
a minor role in metabolomics.
1.3.5
Mass resolution and accuracy
The major advantages of Fourier Transform Ion Cyclotron Mass Spectrometry over any other type of mass spectrometry are the unsurpassed achievements in mass resolution and precise mass measurement [5, 19]. These two
7
aspects are closely related, because a precise mass measurement requires sufficiently resolved peaks. Resolution is the capability of a mass spectrometer
to separate masses which are close to each other. The exact definition of
resolution and resolving power can be found in the literature [17, 20]. Mass
accuracy is the difference between the measured mass and the calculated
exact mass and is usually given in parts per million (ppm). E.g. a mass spectrometer with an accuracy of 2 ppm will measure the calculated exact mass
of 150 u somewhere between 149.9970 u and 150.0003 u most of the time.
The strength of the magnetic field in the mass spectrometer influences
the resolution and accuracy. The higher the magnetic field, the better the
obtained results. The typical range of magnetic fields is 1 to 9.4 Tesla [20].
In this thesis data from a 12 Tesla instrument is used.
1.3.6
Extract preparation
The preparation of cell extracts is a very important step in the preparation
of any mass spectrometry experiment and the method chosen has big impact
on the metabolites which can be measured later [3, 5]. The different methods have to compromise between different chemical species. Cell extracts
are usually mixed with methanol/0.1% formic acid solution or acetonitrile
solution [18]. Aharoni et al. found that these 2 different assays lead to the
detection of different chemical species in the final FTMS experiment. From
that follows, that no extraction protocol will yield the full scope of all metabolites present in the sample. Ideally different extraction assays are used
and the resulting measurements are combined. The same procedure has been
proposed above for the ionization mode, and in fact a combination of different
extraction protocols, ionization techniques and ionization modes will improve
the spectrum of measured metabolites. However, any analysis method has to
account for the incompleteness of the data and the interpretation of results
has to be done with respect to this as well.
1.3.7
Ionization by electrospray ionization (ESI)
Electrospray ionization is considered a soft ionization technique, because
the energies during ion formation are low enough so they don’t break the
chemical bonds in the molecules, as opposed to other ionization techniques
e.g. electron impact ionization. During electrospray ionization the charged
sample is diluted in a volatile solvent and fed into a capillary. In front of
the capillary is an electrode and a high Voltage (2-5 kV ) is applied between
capillary and electrode [5]. Due to the electric potential between the capillary
and the electrode the dilution in the capillary is charged and moves out of
the capillary, forming a Taylor cone at its tip. From the tip of the Taylor
cone a small jet is emitted. The charged molecules in this jet repel each other
8
and thus small drops are formed. These drops further dissociate into smaller
droplets due to the same repulsion force. This dissociation continues until
charged molecules are completely separated. [22].
Most small metabolites will carry one charge after this process, but it
is an important fact, that ESI results in multiply charged ions. In fact the
highest mass molecules usually carry the highest number of charges [20],
but ionization mainly depends on the amount of ionizable side chains. The
protein RNase A (molecular weight 13682 u) clearly shows 20–fold positive
ionization in a study by Henry et al. [23]. This feature is used in mass
spectrometry of large biomolecules like peptides and proteins, because by
multiple ionization these molecules can obtain mass over charge values which
are readily measurable by mass spectrometry. Within the scope of this thesis,
as mentioned, multiple ionization plays a minor role.
1.4
Metabolic Compound Identification from Molecular Mass
Small metabolites cover a mass spectrum from a few u like water (18.0106 u)
to several hundred u like for example coenzyme-A (767.1152 u). Bigger
metabolites can have masses of more than 1000 u like Vancomycin and its
derivatives (around 1447.4302 u) and others. The median 90 % of all components in the KEGG compound database [8] have a mass between 118 u
and 867 u. The median is at 306 u.
If the exact mass of a metabolite is determined with sufficient accuracy,
the chemical formula can be calculated by finding the linear combination of
elements, which precisely fits the exact mass. At infinite accuracy, it would
be possible to find one chemical formula for any given mass [17]. Fortunately
biomolecules comprise only a limited number of distinct elements. Therefore one does not need to look for linear combinations of all elements, but
only the ones present in biomolecules (see above). Obviously, for smaller
masses it is easier to find an exact matching chemical formula, than for big
masses. If further restrictions from chemistry are introduced, such as the
nitrogen rule [24, p. 238], the search space of linear combinations becomes
even smaller. If no such limitations are applied, higher mass accuracies are
required, which will be detailed in a later chapter (2.5).
With the search space limited to chemical formulas plausible for peptides,
Zubarev et al. [25] found the upper limit for unique identification of peptides
at an accuracy of 1 ppm to be at about 700 u. Aharoni et al. [18] use the
same approach to identify metabolites from metabolomics data. They as
well assume an error of 1 ppm for a 7 Tesla FTMS and are able to assign
a unique chemical formula to more than 50% of their measured metabolites
(about 5000 total). The chemical formula is in turn looked up in a chemical
compounds database.
9
Another method to identify metabolites by mass, which is used in this thesis, is a direct database lookup of the mass. This basically follows the above
procedure, but in an inverted fashion. The step of calculating a chemical formula can therefore be omitted. For all compounds in a database, the exact
mass is either obtained by a query, or calculated from the chemical formula,
and each measured mass is compared to the database masses. If the best
hit is within the desired error, an assignment can be made. The MassTRIX
framework [26] uses this method to find and display metabolites from mass
spectrometry experiments in KEGG pathway maps. Using this approach, a
precision of at least 2 ppm is required to uniquely identify metabolites in a
chemical database comprising 72,634 unique chemical formulae [16].
All small metabolites which have a database entry will be identified by the
latter approach. Problematic are polymers and fatty acids, because among
them exists a whole plethora of different masses due to their combinatoric
composition. An exact chemical formula assignment might help in this case,
to identify unknown compounds.
Special Notation In this thesis not only compounds need to be identified
by their mass, but also mass differences. Mass differences can be explained
by an arbitrary number of atom transfers, e.g. the mass difference from H2 O
to CO2 is explained by the subtraction of 2 H atoms and the addition of one
C and one O atom. This transformation will be noted, analog to well known
molecular formulas, as H−2 CO or in some cases for clearness as H−2 C+1 O+1 .
Mass differences can have a positive or negative sign, but usually their absolute value is of intrest. Hence this notation is bidirectional in the sense,
that the signs need to be swapped for the transfer in the opposite direction.
Thus the above formula is considered equivalent to H+2 C−1 O−1 .
1.5
Introduction to Ab Initio Metabolic Network Reconstruction
The approach for metabolic network reconstruction which is evaluated in
this thesis was devised previously [16]. The underlying idea is, that for
every metabolic transformation the substrate and the product should be
present in the cell at detectable quantities. The principle is, to calculate
all pairwise differences of a set of metabolite masses, and infer chemical
transformations that happened according to the mass difference of any two
compounds (Figure 2). To build a network, the compounds are connected
according to the chemical reactions in which they appear to participate. The
masses are determined using FTMS, because high mass precision will result
in high mass difference precision, which is required for the exact connection
of all compounds.
10
324.035868
Reactions from pyrimidine metabolism
(KEGG map 00240)
79.966332
404.002201
79.966332
789.993837
483.968536
306.025304
0.984015
482.984521
466.989603
15.994914
79.966332
79.966332
403.018185
15.994914
79.966332
387.023270
79.966332
323.051853
307.056938
Figure 2: Shown are 10 compounds from the pyrimidine metabolism and the
reactions between them from the KEGG reference pathway 00240. The exact
mass for each compound is printed in italic font, the mass differences resulting
from chemical reactions are printed next to the reaction arrows. Consider all
these masses are measured in a mass spectrometry experiment. Together with the
knowledge that a phosphorylation reaction (transfer of a HO3 P group) introduces
a mass difference of 79.966332 u, one can easily connect the components which
have this mass difference and conclude that they have undergone phosphorylation.
11
X
X
X
X
mass difference
0.9840155930
15.994914640
16.978930233
62.987402124
63.971417717
78.982316764
79.966332357
80.950347950
95.961246997
96.945262590
142.953734481
143.937750074
158.948649121
159.932664714
160.916680307
175.927579354
176.911594947
306.025303947
307.009319540
323.004234180
385.991636304
386.975651897
402.970566537
465.957968661
466.941984254
482.936898894
occurrence
3
3
3
2
2
2
6
2
2
2
1
1
1
3
1
1
1
1
1
1
1
1
1
1
1
1
transformation
H−1 N−1 O+1 (ammonia ligase)
O+1 (hydroxylation)
H−1 N−1 O+2
HO3 P (phosphate)
H2 O6 P2
(nucleotidohydrolase)
Table 2: The pairwise mass differences between all 10 components from Figure 2
are shown. Checkmarks (X) mark mass differences that correspond to reactions
which are known to exist between the components. Frequent mass differences
are stressed by bold text. The most common mass difference actually represents
the most common chemical reaction in this example, the phosphorylation. Also
the mass differences corresponding to hydroxylation and the rather complex ammonia ligase occur three times. But the nucleotidohydrolase, observed once, is
not detected as frequent mass difference. Instead an unknown mass difference
(159.932664714) which corresponds to the transfer of 2H, 6O and 2P occurs three
times, as well as another unknown mass difference (16.978930233) corresponding
to the transfer of H−1 N−1 O+2 .
12
6
5
4
3
1
2
Count
0
100
200
300
400
500
Mass difference
Figure 3: The data from Table 2 as histogram.
The study distinguishes two approaches to identify meaningful mass differences, i.e. mass differences corresponding to actual chemical transformations. The 1st approach is based solely on the measured masses, and can
be considered “ab initio”. Mass differences which occur at a significantly
higher rate are considered meaningful. The 2nd approach takes into account
knowledge about common metabolic transformations and as such incorporates a priori knowledge. Only mass differences which correspond to one
of the common transformations published in [16] are considered meaningful. Consider the example in Figure 2. If all ten masses are measured and
the four occurring mass differences are known, obviously all edges would be
reconstructed, plus one false positive edge between the masses 323.051853
and 307.056938 (at the bottom). The difference between these two masses is
again 15.99491464.
If the 1st approach is applied to this example, the situation is a little different. Table 2 shows all pairwise mass differences from these 10 compounds,
for clearness the same data is shown in a histogram in Figure 3. For this
example a frequent mass difference is defined as occurring 3 times or more.
As can be seen, the most frequent mass difference actually originates from
the most frequent known reaction in the example, the phosphorylation or
dephosphorylation respectively. But not all known reactions are recovered as
frequent mass differences, instead two new frequent mass differences occur.
One new mass difference comes from the difference of 2 Hydrogen, 6 Oxygen and one Phosphorus atom, the other one is explained by an even more
complex relationship, i.e. the addition of 2 Oxygen and the subtraction of
one Hydrogen and one Nitrogen atom. This simple example already shows
that not all frequently occurring mass differences are necessarily related to
actual chemical transformations. They can, among other possibilities, originate from two or more common metabolic steps in turn (in this example
13
the mass difference 159.932664714; two phosphorylation reactions), or from
compounds which share a common scaffold but contain different side groups
(e.g. mass difference 16.978930233).
If the most frequent mass differences (Table 2) are used to reconstruct a
metabolic network, this results in the graph depicted in Figure 4. The reconstructed network does not only contain 10 edges as the reference network,
but 18. Nine of these actually correspond to edges in the reference network,
the remaining 9 are additionally introduced.
In the work in [16] the method was also used on a mass spectrum from
the parasitic organism Trypanosoma brucei. From this spectrum a maximum
of 399 identified masses was obtained. Using the two mentioned approaches,
about 25,000 (1st approach) or 1438 (2nd approach) meaningful mass differences were calculated respectively. As a parasite, Trypanosoma brucei has
a very small set of metabolic enzymes. Most other organisms will therefore
yield mass spectra with significantly more masses.
The numbers for the reconstructed networks suggest very densely connected graphs already; an undirected graph with 399 vertices can contain at
most 399∗398
= 79401 unique Edges. In this case it would be completely con2
nected and called a clique. The first approach reaches 39% of this, the second
approach 2%. This is a lot, compared to actual metabolic networks. The
metabolic network of Escherichia coli derived by enzyme gene mapping [27],
contains 628 metabolites and 788 connections (reactions). This makes only
0.4% of all thinkable connections. Ouzounis et al. used a similar approach to
reconstruct the E. coli metabolic network; their reconstruction comprises 791
compounds and 744 reactions (0.24% of all thinkable connections) [12]. The
metabolic network from another species with a very streamlined metabolome,
the γ-proteobacterium Buchnera aphidicola contains 240 compounds and 263
reactions [28]. This equals 0.9% of all theoretically possible connections.
It is also well known, that metabolic networks exhibit a certain topology [29, 30]. This will be detailed in chapter 1.6.4. The authors of [16]
claim, that the generated networks are overall conform with these previous
findings and explain observed deviations with the limited number of measured metabolites.
1.6
Graph Theoretic Aspects of Biological Networks
This thesis deals mainly with the analysis of biological networks. This section
will give an introduction to biological networks represented as graphs, their
properties, ways to analyze them in a descriptive way and some examples of
studies on biological networks, especially metabolic networks.
14
Reactions from pyrimidine metabolism
(KEGG map 00240)
159.932664
79.966332
Curved lines: 16.978930
79.966332
306.025304
0.984015
0.984015
0.984015
15.994914
79.966332
159.932664
159.932664
79.966332
15.994914
79.966332
79.966332
15.994914
Figure 4: Shown are the same 10 compounds from pyrimidine metabolism. All
connections between them have been reconstructed by the most frequent pairwise
mass differences. The solid lines depict correctly reconstructed connections (true
positives), the dashed lines stand for connections which are predicted, but not
present in the reference pathway (false positives) and the gray arrow represents a
connection which was not reconstructed but present in the original pathway (false
negative). False positive in this context does not mean, that the reaction does not
exist in the kingdom of biochemical reactions, it was just not present in the used
reference network. If also less frequent mass differences are used to reconstruct
the network, more edges will be found, up to a full saturation of the network.
15
1.6.1
Mathematical Definition of Graphs
The definitions below are taken from the book “Handbook of Graph Theory” [24]. The definitions given there are very comprehensive and some are
irrelevant to this thesis. Definitions relevant to this thesis are reproduced,
some are extended to fit the need of this thesis. For any further information
please refer to [24] or any other graph theoretic publication.
In mathematical terms a graph is a pair of sets G = (V, E) where V is
a set of vertices and E is a set of edges {v1 , v2 } , vi ∈ V which connect two
elements of V . One edge can also connect a vertex to itself, in this case it is
referred to as a self loop.
In an undirected graph, the edges carry no further information, in a directed graph, each edge has a direction. A directed edge is an edge e, in which
one of the endpoints is designated as tail, the other one as head. The edge is
directed from the tail to the head.
A vertex v is incident to an edge e, if v is an endpoint of e. Two vertices
are adjacent, if they are incident to the same edge. Two adjacent vertices
are also called neighbors.
The degree k of a vertex v in an undirected graph is the number of edges
incident to v (note, that self loops are incident to the same vertex two times
and therefore add 2 to the degree). In a graph without self loops and multiple
edges between vertices, the degree equals the number of neighbors. The
average degree hki of a graph is defined as the mean degree of all its vertices.
The indegree of a vertex v in a directed graph is the number of edges
which is directed to v. The outdegree is defined analog for edges directed
from v.
The degree density of a graph G is defined as the fraction of edges present
in G, as compared to the maximal number of edges which can theoretically
exist in a graph, expressed in percent. Let |V | the number of vertices and
|E| the number of edges in G. The maximal number of edges, excluding the
possibility of multiple edges and self loops is
|Emax | =
|V |(|V | − 1)
.
2
The degree density is
100 ·
|E|
.
|Emax |
A path in a graph G is a sequence of vertices, such that two adjacent
vertices in the sequence are neighbors in G. The edges in the path are the
edges connecting each two adjacent vertices in the sequence. Furthermore no
16
vertex or edge in the path may occur twice in the path. The only exception
are the first and last vertex in the sequence, they may be the same.
The path length of a path is the number of edges in the path. If the first
and last vertex in a path are the same, the path is called cycle if it has at
least length one.
The shortest path between two vertices v1 and v2 is a path from v1 to v2
with length l such that there is no other path from v1 to v2 with length < l.
There may exist more than one shortest path with the same length but
different sequences of vertices.
The subgraph of a graph G = (V, E) is the graph G0 = (V 0 , E 0 ) with
any subset V 0 ⊂ V and subset E 0 ⊂ E with all elements in E 0 connecting
elements in V 0 . The induced subgraph of G with respect to a set of vertices
Vi ⊂ V is the graph Gi = (Vi , Ei ) with Ei ⊂ E and Ei containing all edges
from E which are incident to any two Vi in G; in words the subgraph of G
containing all Vi and all edges connecting them, or the graph which results
in removing all vertices not in Vi and keeping all edges with no “loose ends”.
A connected component is a set of vertices wherein between each pair
of vertices exists a path of finite length. A graph may consist of a single
connected component, in this case it is called connected graph.
1.6.2
Properties of Graphs
There exist several concepts and measures to describe graphs. There are
however some important and recurrently appearing concepts in the analysis
of complex networks. They can be characterized by measures like their small
world property, degree distribution, diameter and clustering coefficient. [31]
The small world property describes the fact, that even in large graphs
containing many vertices often short paths between any pair of vertices exist. [31]
The degree distribution of a graph G is a function P (k), which gives
the probability that a randomly selected vertex in G has degree k [31]. In a
graph with a finite number of vertices, as observed for metabolic networks,
it gives simply the amount of vertices in G with degree k.
The diameter of a graph G can be defined as the longest of all shortest
paths between any two vertices in G. Another concept is the average path
length, which is the average length of all shortest paths between any two
vertices in G. [31]
The clustering coefficient Cv is primarily a measure for a single vertex
v. If v has kv neighbors, there exist at most kv (k2v −1) edges between these
neighbors. The clustering coefficient of v is defined as
17
Cv =
2Ev
,
kv (kv − 1)
with Ev the number of edges that actually exist between the neighbors.
It is therefore a measure for the connectedness in the neighborhood of
v. Cv = 1 if the neighborhood of v is completely connected and Cv = 0 if
there is no edge between the neighbors of v. The clustering coefficient of
a graph G is defined as the average clustering coefficient of all vertices in
G. The distribution of clustering coefficients C(k) is defined as the average
clustering coefficient for all vertices with degree k.
1.6.3
Types of Graphs
Networks, especially biological networks, are usually modeled, using the following three graph models: random, scale free and hierarchical graphs [32].
This classification is able to explain biological data well as recent studies
have shown. Because of this, in this work these measures are applied. What
follows in the next section is a brief introduction to these three kinds of
graphs.
Saul and Filkov [33] recently presented a new and interesting method to
classify biological networks using exponential random graph (ERG) models.
These models have been used before to classify social networks, which are
usually smaller than biological models. But advances in electronics and parallel computing begin to provide the necessary calculation power to apply
the fitting methods for ESP models to biological data. The advantage of
ERG models is that parameters for a multitude of descriptive variables can
be fit to known networks or, inverting this approach, networks can be easily
simulated or constructed from a set of arbitrary descriptive parameters. In
their work they criticize the conventional classification as too restrictive with
respect to the employed descriptive variables (degree distribution and clustering coefficients). It is, however, not known, which descriptive parameters
best describe biological and especially metabolic networks. Therefore and
due to the computational issues, these models are not employed here.
Random Graphs This kind of graph was first analyzed in 1959 by Erdös
and Rényi [31]. The vertices in the graph are connected in a random fashion,
i.e. all pairs of vertices have the same probability to be connected. Because
of this, all vertices tend to have the same degree, and no local structures
like clusters (tightly connected areas in the graph) arise. This leads to a
degree distribution which is shaped like a Poisson distribution and clustering
18
C(k)
P(k)
A
0
k
0
C(k)
10
P(k)
B
0
k
10-1
10-2
10-3
10-4
1
k
100
1000
k
C(k)
10 0
P(k)
C
10
10-1
10 0
10-1
10-2
10-2
10-3
10-3
10-4
10-4
1
10
100
k
1000
1
10
100
1000
k
Figure 5: Different graph models and their properties. A: Random graph with
typical degree distribution P (k) and distribution of clustering coefficients C(k).
B: Scale free graph with a degree distribution P (k) following a power law but
distribution of clustering coefficients C(k) following a uniform distribution. The
bold circles represent the hubs in this graph. C: Hierarchical graph. Both, the
degree distribution P (k) and distribution of clustering coefficients C(k) follow a
power law. There are also hubs present in these networks, but additionally the
vertices which are not hubs and have a lower degree, tend to be more densely
connected within their neighborhood.
19
coefficients which are independent from the vertices degrees. These graphs
exhibit the small world property, i.e. despite the large number of vertices,
the average path length is relatively short.
Scale Free Graphs In these graphs there are different roles for vertices. A
few vertices are heavily connected to other vertices, but most of the vertices
show a small degree. The high degree vertices are called hubs. The hubs play
an important role in these scale free graphs, because most of the shortest
paths travel through them. A scale free graph is very vulnerable to targeted
attacks on these hubs. Among the low-degree vertices, however, there are
no different roles. This leads to a degree distribution P (k) which follows the
relationship P (k) = ak γ , commonly known as power law with constant a and
scaling exponent γ. The distribution of clustering coefficients is still uniform.
Hierarchical Graphs The last kind of graphs discussed here also contains
vertices with different roles. Again hubs are present as in scale free graphs.
The important difference to scale free graphs is, that the low-degree vertices
tend to form clusters. That means, that there are local structures of a few
vertices which are more densely connected among each other than to other
vertices. These clusters are linked to each other through the hub vertices.
Due to the hubs, the degree distribution again follows a power law; and
because of the clusters, the distribution of clustering coefficients also exhibits
power law behavior.
1.6.4
Properties of Biological Networks
It was shown in several studies, that biological networks1 frequently exhibit a
scale free and hierarchical structure and are not randomly connected [11, 29,
34, 35]. One important implication is, that the construction of null models for
the statistical analysis of biological networks must therefore preserve these
structures [14].
Arita et al. argue, that the small world property does not hold true
for metabolic networks [4]. Therefore it might be appropriate to consider
different models for this kind of networks. Despite this, in this thesis the
hierarchical model for metabolic networks is employed, because their major
argument is a structural consideration. The networks reconstructed in this
thesis do not incorporate these structural implications which will become
clear in the methods section (chapter 2.1.2).
1
comprising metabolic, protein interaction, regulatory and other networks
20
1.7
Scope of this Thesis
A previous study [16] outlined the general practicability of the approach
described in chapter 1.5 and gave an example how to use the method on
a small organism. Since metabolomics aims for the understanding of complete metabolomes, the aim of this thesis is to evaluate the said approach on
real world metabolomics data, i.e. high precision mass spectra of two larger
organisms. Data from Saccharomyces cerevisiae and Drosophila melanogaster are available at the Helmholtz–Zentrum München, so these datasets are
investigated.
Before work on the datasets is performed, the theoretic capabilities of
the method are assessed. In order to do this, known biochemical pathways
from the KEGG database are investigated on how they theoretically can be
reconstructed using the proposed method.
The identification of frequent mass differences plays an important role,
not only for reconstructing the networks, but also other fields of metabolomics; so the frequent mass differences are investigated in more detail. Currently a frequent mass difference is hypothesized to represent a chemical
reaction, but no profound knowledge about the actual meaning exists. So
this aspect will be elucidated, by explicitly looking for the chemical reaction
each mass difference represents.
Finally the reconstructed mass difference networks are analyzed to assess
wether they show characteristics of “real” metabolic networks, and what kind
of information can be extracted from them. To date no deeper analysis of
mass difference networks’ structure has been performed. The study in [16]
proposes that they are metabolic networks, which can not be validated in
this work. Instead their usability in research is showcased by examining
the vertices’s properties and by deducing information about the underlying
compounds.
21
2
2.1
Materials and Methods
Description of the Dataset
All data, programs and scripts used here are available at the cited resources
or as supplementary materials to this thesis.
2.1.1
FTMS Mass Spectrometry Data
The handling of raw data from the spectrometer is done by special software
and out of the scope of this thesis. Further information about this issue can
be found in [36]. After the mass over charge values have been calculated,
each one is present together with a raw peak height. An example of this can
be seen in Figure 6. The subsequent step is to select peaks with a desired
signal–to–noise ratio, usually about 3:1. The work in this thesis is performed
on data right after this step, i.e. a list of mass to charge (m/z ) values with a
peak height better than a certain signal to noise ratio. The charge (z ) from
now on is used as multiples of elementary charge, which is the charge carried
by a single proton.
Figure 6: Example for a FTMS mass spectrum, taken from [37]. On the x–axis
are the mass to charge (m/z ) values, the peak height is the actual signal. The m/z
for some high peaks are noted next to the peaks. The data used in this thesis are
m/z values with a sufficient signal–to–noise ratio selected by the experimenter.
The mass data for two organisms, Saccharomyces cerevisiae and Drosophila melanogaster are used to reconstruct metabolic networks. The data
were taken from the MassTRIX server [26] and created by the group of
Schmitt-Kopplin at the Helmholtz–Zentrum München using a 12 Tesla mass
spectrometer, ideally providing a relative accuracy of 0.2 ppm. The data are
available on the MassTRIX server under the job-IDs “EXAMPLE Yeast”
22
(S. cerevisiae) and “08093013010514720” (D. melanogaster ). Both are measured in negative mode. For D. melanogaster data measured in positive
ionization mode became available at a later stage of this work and is not incorporated fully into the analysis, for S. cerevisiae no data in positive mode
is available. This implies that there are definitely compounds which will not
be detected. However, the scope of detected metabolites can always be improved by combining further extraction protocols and ionization techniques,
so that the full picture of all metabolites in the sample will most likely never
be achieved. Histograms of the used data are shown in Figure 7.
0
100
count
200
The S. cerevisiae cells were grown in a medium of nutrients with optimal proportions for growing most S. cerevisiae strains. This leads to an
exponential growth of the organism. The experimental background of the D.
melanogaster dataset can not be disclosed at the time of writing.
0
500
1000
1500
2000
1500
2000
0
200
count
500
m/z
0
500
1000
m/z
Figure 7: Histograms of the masses in S. cerevisiae (top) and D. melanogaster
(bottom). Most of the measured masses are closer to the smaller side of the
detection range. For S.c. 3101 masses are available, for D.m. 1965 masses. The
mass spectrometer was optimized to detect masses smaller than 600 u.
Even though the D. melanogaster dataset became available only later
during the course of this thesis, it is of higher quality with respect to extraction protocols and mass spectrometry2 . Furthermore more metabolites could
be identified in the data using MassTRIX.
2
Personal communication: Agnes Fekete, Helmholtz–Zentrum München
23
The raw mass is corrected by adding a proton mass (1.0072764668813 u),
because the sample was measured in negative mode and the metabolites are
therefore negatively ionized.
2.1.2
The KEGG Database
KEGG is a database of biological systems that integrates genomic, chemical
and systemic functional information [8]. It is used in Release 48.0 from
October 1, 2008. Information was downloaded as flat files via FTP to speed
up the data retrieval process.
Data from KEGG compound were used to identify compounds by mass
as described in chapter 1.4. Because the exact masses of compounds are only
given in a magnitude of 10−4 u, i.e. a precision of 4 decimal places, precise
masses are calculated based on their chemical formulas using a Perl script
written by the author.
Information from KEGG pathway and KEGG reaction was used to create reference metabolic pathways and networks to finally validate the reconstructed networks. To make the validation not overly stringent, the reference
pathways were used and not the organism specific pathways. This accounts
for the fact, that many metabolites and reactions in a specific organism are
unknown [2]. KEGG reaction pairs (or reactant pairs) are used to reconstruct
connections between compounds. A reaction pair lists two compounds, one
of which is transformed into the other by a single reaction in KEGG.
Creation of Networks from the KEGG Database To create a representation of the pathway maps stored in the KEGG database, individual
maps from KEGG pathway are considered. For each map the participating
compounds and reaction pairs are taken from the database and assembled to
form a graph with compounds as vertices and reaction pairs as edges.
To build not single maps, but a complete network all reaction pairs in
the KEGG database are considered. All compounds occurring in the reaction pairs are added to the graph as vertices and connected by edges which
represent the reaction pairs. This network is called the reference network.
This reconstruction, however, is a crude one, which does not take into account structural relationships between compounds [4]. Because of this a path
in the reference network does not necessarily represent a metabolic path. E.g.
in the reference network water (H2 O) is connected to 815 other compounds
just because is plays a role in these reactions. It is, however, obvious, that
not all these 815 compounds are interconvertible by being transformed into
water and back into the other compound. The network has to be seen as a
relationship network rather than a reaction network.
24
2.1.3
Exact Elemental Masses
The exact elemental masses to calculate exact molecular masses are taken
from excel elements3 . The results of exact mass calculations for KEGG compounds are compared against the original masses from the KEGG database
to spot inconsistencies. No such inconsistencies were found.
2.1.4
Metabolic Transformations
The list of common metabolic transformations is taken from reference [16].
The most recent version they are using in their research was obtained through
personal communication4 and is available in the supplementary materials to
this thesis. It comprises 109 frequently occurring metabolic transformations
and their induced mass difference.
2.2
Computational Analysis
The computational analysis was performed using self written software in
JAVA5 and R6 [38] and aided by Pearl7 scripts and shellscripts written by the
author. Software was run on a Linux8 machine with Intel Pentium processor9
and 2 Gb memory. Special use was made of the JAVA library JUNG10 to
represent and handle graphs.
A metabolic network in this thesis is defined as an undirected graph with
a set of vertices, representing metabolites and a set of edges, representing
chemical transformations between these metabolites. Even tough the chemical reactions in metabolism can be directed, it is not within the scope of this
work to investigate the direction of reactions in the reconstructed networks.
Although software design was no major aspect in this thesis, the JAVA
software was developed to fit into a three-tier application architecture. Since
no data tier and presentation tier were required, the logic tier was developed to interface easily with these other tiers. So the developed software
can be easily reused if desired. A small graphical user interface was developed, mainly for visualizing the reconstructed networks. This graphical user
interface can be extended in the future to build a presentation tier.
3
M. Selmke and A. Selmke, http://www.chemlin.de/download/excelelements.htm
[email protected], June 9th 2008
5
Java(TM) SE Runtime Environment (build 1.6.0 02-b05)
6
R version 2.5.1 (2007-06-27)
7
Perl v5.8.8; Copyright 1987-2006, Larry Wall
8
Linux version 2.6.22.9-91.fc7 Fedora Release 7
9
Intel Pentium 4 CPU 3.00 GHz
10
Jung 1.7.6 “Java Universal Network/Graph Framework”
4
25
2.3
Network Reconstruction
The workflow of network reconstruction using the method based on frequent
mass differences and the method based on the list of common metabolic
transformations is depicted in Figure 8. Details for the individual steps are
given in the sections below.
2.3.1
Filtering of FTMS Data
To reconstruct reaction pathways, it is necessary to calculate on data which
is as transparent as possible. Ideally the desired data would only comprise
monoisotopic masses of all measured compounds. To get as close to this as
possible, data are filtered for the most important known superfluous masses.
These are firstly the masses emerging due to 13 C isotope insertion [18] and
secondly masses derived from m/z values from double ionized molecules, i.e.
molecules which carry not a single but double charge.
Identification of multiple charged ions Mass to charge peaks from
multiple charged ions can only be detected reliably with the aid of isotope
peaks. The monoisotopic peak and the peak of an isotope with one more
neutron will not differ by the neutron mass, but by the neutron mass divided
by the charge [20]. To check wether the single and double ionized form of
a compound is measured at the same time, the following procedure can be
applied: If for any two mass pairs a and b the relation a = 2 ∗ b (or vice
versa) is found, i.e. one molecule has a twice as high m/z ratio, the “lighter”
molecule is considered double ionized and the respective mass is removed
from the data. This procedure poses a problem, if two compounds have
this relationship due to their atomic composition, as for example Fructose
1,6-bisphosphate and Glyceraldehyde 3-phosphate.
Since in the observed samples double ionization plays no major role and
m/z = m/1 most of the time, the term mass is often used when actually using
the term m/z value would be more accurate. This habit is also employed in
the literature.
Identification of isotopic peaks 13 C isotope masses are identified by
mass difference. If two masses have a difference of 1.0034 u ±0.1 ppm, the
heavier mass is considered non-monoisotopic and removed from the dataset.
This is a crude method, which can be extended by incorporating the peak
ratios of isotope peaks for a more exact determination. For the observed data
and the aim of the following work it is sufficient, to employ the described
method.
26
using frequent mass differences
using known mass differences
Masses from FTMS
Masses from FTMS
Preprocessing
Preprocessing
Filtered Masses
Filtered Masses
Pairwise
comparison
Pairwise
comparison
Mass Differences
Mass Differences
Clustering
Clustered Mass Differences
Select heaviest
clusters
Frequent Mass Differences
Known Mass Differences
Connect
Connect
Network
Network
Figure 8: The workflow from FTMS data to reconstructed metabolic networks
is shown. Square boxes represent data, rounded boxes represent processes. On
the left the steps necessary for network reconstruction using internally determined
frequent mass differences are depicted, on the right the steps for reconstruction
using a set of a priori known mass differences are shown. Preprocessing summarizes
all filtering steps. Pairwise comparison calculates all pairwise mass differences.
Clustering groups mass differences together s.t. frequent mass differences can be
identified in the next step, Select heavy clusters. Finally, Connect, builds the
networks from the generated data. For the method depicted on the right, the
pairwise comparison has to be performed to be able to connect the masses to a
network, but no clustering needs to be done.
27
It is important to bear in mind, that the non-monoisotopic masses usually
are an important and regularly employed tool to identify a molecule., for
example in protein mass spectrometry [39]. But in building ab initio reaction
networks these masses would only introduce noise and have to be removed.
If the identification of compounds is of importance, the information gain
from non-monoisotopic masses has to be used. In fact in this case it would
be helpful to also use peaks with a lower signal to noise ratio, because the
isotopic peaks might fall below the chosen signal to noise cutoff.
2.3.2
Calculation of Mass Differences
The pairwise mass differences are calculated using a simple algorithm with
quadratic runtime complexity. Each mass is compared against each other
mass once and the differences and constituents of the difference are stored
using a hash map, so that for any difference the constituents can be easily
obtained.
In a correctly calibrated mass spectrometer, there is no common shift of
measurements into one direction i.e. each mass measurement can lie above
or below the exact mass independently. Therefore the precision of mass
differences becomes worse than the precision of individual measurements.
This becomes clear, if the mass measurements are seen as single independent
samples from normal distributions with mean µ = exact mass and standard
deviation σ proportional to the accuracy. If the difference of two measurements a and b is calculated, this is as if the difference is drawn from a normal
distribution with mean µdif f = µa −µb but standard deviation σdif f = σa +σb .
E.g. the exact masses 180 u and 200 u measured at a precision of 1 %11 might
yield the measurements 181.80 and 198.02 which have a difference of 16.22.
The difference of the exact masses is 20, this is a deviation of 18.9 %. To
see, how far this influences precision in the real datasets under the actual circumstances, an FTMS relative accuracy of 0.2 ppm is assumed. Furthermore
a mass at the higher detection range (2000 u) is considered. The resulting
absolute deviation calculates as:
2000 ∗ 0.2ppm = 0.0004
So the absolute deviation for a mass difference of measured masses from
the mass difference of the exact masses is in the extreme case at most 0.0008 u.
Of course this is an extreme value. But there will be some variation between
the mass differences, such that two mass differences which are the same when
considering exact masses, will be a little different on measured masses. This
discrepancy has to be considered when frequent mass differences have to be
determined from measured masses. The study in [16] does not present these
11
Percent is used instead of parts per million in this example for clarity.
28
calculations, but devises the following rule: Any 5 mass differences which are
closer to each other than 0.0001 u are considered frequent mass differences.
This value is acceptable if not the extreme, but an average absolute deviation
is calculated on the basis, that the median mass of typical metabolites is
306 u (compare chapter 1.4). Analog to the above calculation this leads
to an absolute deviation of 0.000122 u, so the previously proposed value is
adopted in this thesis.
2.3.3
Clustering
The above considerations on the calculation of mass differences reach out into
the clustering of mass differences. To identify frequent mass differences, it is
necessary to treat a couple of mass differences that are almost the same as
one cluster. The cluster arises from the same mass difference, and is spread
out around the exact mass difference due to measurement errors (precision).
Using the experience of others [16], the rule that a difference of 0.0001 u is
significant in separating these clusters is encorporated.
To achieve this computationally, hierarchical single linkage clustering
seems appropriate. The two closest elements (mass differences) will be clustered together in a recursive hierarchic fashion until all mass differences are
clustered. Within the hierarchy such subclusters have to be identified, in
which the within distance is below 0.0001 u and the between distance exceeds this value. Because the data under consideration are one-dimensional,
this is actually easier achieved by the following algorithm:
1. Sort all mass differences
2. Start at the first mass difference
3. Open a new cluster and put the current mass difference into it
4. Scan through the sorted mass differences from the next position
5. If the difference between this and the previous mass difference exceeds
0.0001
(a) return the current cluster
(b) Continue at 3
6. If not, put the current mass difference into the current cluster and
continue at 4
For a better understanding refer to Figure 9. The procedure creates a
structure similar to a histogram with dynamic bin widths; each bin contains
29
Figure 9: An example for the clustering algorithm. Each vertical bar represents
a single mass measurement. d is a chosen distance, in this work 0.0001 u. Emerging
clusters are marked by curly brackets. A break between two clusters is inserted,
as soon as the distance between two consecutive masses exceeds d. Single linkage
clustering would cluster all elements under the curly brackets first, so the results
of the employed method are equivalent to single linkage clustering.
the elements of one cluster. Of course problems known in single linkage
clustering apply to this method as well, especially “cluster elongation”, where
the cluster extends into one direction due to elements which are distant to
one of the clusters elements just below threshold [40, p. 197] . This has to
be considered in the next step.
Following the formation of clusters, the central element closest to the
exact mass difference has to be selected. The median of the mass differences
is appropriate in this case, because it firstly represents one actual member
of the cluster and secondly is less influenced by outliers, as they might occur
due to the above mentioned chaining.
The selection of “heavy” clusters is straightforward. The clusters containing the most elements are considered “heavy” and their medians are the
identified frequent mass differences. A threshold has to be defined, determining how many clusters are selected. Breitling et al. [16] select all clusters
with 5 or more elements. For bigger datasets this becomes infeasible and a
dynamic approach has to be used, as will be shown in the Results section.
2.3.4
Network Creation
To work with graphs in JAVA the JUNG package is used. A network is
represented as a SparseGraph. For each mass a vertex (SparseVertex) is
created and added to the SparseGraph. The mapping from mass to Vertex
is kept in a hash map and each vertex is assigned a UserDatum containing
the mass it is representing. So queries from mass to vertex and from vertex
to mass are always possible.
For each mass difference – wether determined intrinsically or taken from
the file of known transformations –, vertices exhibiting this mass difference
are connected. To account for the error due to measurement precision, not
only exact matches, but all pairs with a mass difference within the desired
precision are connected by an edge (UndirectedSparseEdge). The precision
used in both datasets was 0.75 ppm. The information which pairs of vertices
are to be connected by an edge can be easily obtained by querying the mass
30
difference to mass pairs map and subsequently querying the mass to vertex
map. This allows for an efficient and quick network construction. Each
inserted edge is annotated with an UserDatum, containing the mass difference
it represents.
The final graph can be stored in various file formats if desired.
2.4
Network Analysis
For all networks, i.e. the reference network and the the reconstructed networks, descriptive parameters (degree distribution, distribution of clustering
coefficients and degree density) are calculated and plotted. Compounds are
identified using the KEGG and LIPIDMAPS12 databases. KEGG in its used
version contains 11615 and LIPIDMAPS 10199 compounds, the sets show
some overlap.
The reconstruction of mass difference networks heavily depends on the
number of mass differences which are incorporated. The amount of emerging
edges can reach from several hundred, to complete saturation of the network
if all mass differences are used. The most abundant mass differences (the
ones that are observed at a high frequency) are hypothesized to represent
chemical reactions [16]. The abundance cutoff, i.e. the cutoff above which
abundance a mass difference is considered frequent, is an important parameter. The authors of [16] use an abundance cutoff of 5, without giving any
rationale for this. Their datasets are substantially smaller than the ones in
this thesis, hence choosing another (higher) abundance cutoff might be necessary. To determine an appropriate abundance cutoff, several networks are
reconstructed for S. cerevisiae and their characteristics are compared.
After this general assessment, the reconstructed D. melanogaster network
is investigated in more detail. The degree of a vertex is an important descriptor, e.g. hub vertices play a major role in scale free networks. Therefore
the hub vertices are identified. Furthermore the correlation between a vertex’s degree and its mass and tendency to represent isomers is calculated.
Pearson’s correlation coefficient is calculated for both, degree versus average
mass and degree versus average number of isomers.
To assess the significance of the obtained two correlations, a stochastic
null model is constructed. This null model is based on the assumption, that
vertices are not determined by their degree, but by chance. For each present
degree not the vertices with this degree, but a random sample of the same size
is drawn. From this sample the average mass and average number of isomers
are calculated respectively. This is repeated Ω = 100000 times. For each of
these 100000 random samples the correlation coefficient is determined. The
12
http://www.lipidmaps.org/
31
p–value for the initial correlation is estimated by the fraction of simulated
correlations which absolute values are greater or equal to the absolute initial
correlation. In formal terms this becomes:
Let
n =
d1..n =
a1..n =
b1..n =
such that bi =
the number of different degrees observed.
degrees,
average masses,
observations,
number of vertices with degree di .
The initial correlation cinit is the correlation between d and a, so
cinit = cor(d, a).
rj,1..n is sampled, such that rj,i = mean((m1 , m2 , . . . , mbi ) ∈ M ) Where
M is the set of all masses and all m are randomly chosen from M . j ∈ [1 . . . Ω]
determines the size of the null model.
For all j ∈ [1 . . . Ω] the correlation cj = cor(d, rj ) is determined. Now e
is is the number of cj such that |cj | ≥ |cinit |. The estimate of the p–value for
cinit is e/Ω.
Calculations for the correlation between degree and average number of
isotopes are analog. Sampling is not done from the masses, but from the
number of annotated isotopes. Because only about 10 % of the masses can
be annotated, this dataset is smaller, still Ω = 100000 is chosen.
2.5
Identification of Mass Differences
The frequent mass differences observed in the networks are expected to have
a biological meaning. Because of this it is of interest to identify the metabolic
transformation underlying these mass differences. To accomplish this task,
two approaches analog to the identification of compounds by their mass are
employed (compare chapter 1.4). Firstly one can check, if the observed mass
difference is caused by a known transformation. To do this, the mass differences are compared to both, the list of common metabolic transformations
and to mass differences calculated from KEGG reaction pairs. The mass differences from KEGG reaction pairs are determined by calculating the mass
difference of two compounds present in the same reaction pair. Mass differences are again compared to these values with an error tolerance of 0.75 ppm.
It is now possible, that a frequent mass difference does not arise due to
one reaction, but a whole series of reactions, or other relationships. This
32
was already discussed in the introduction, chapter 1.5 and demonstrated in
Figure 4. Additionally multiple reaction steps might not only lead to the
subtraction or addition of functional groups, but to subtracting and adding
atoms at the same time. So the second approach to identify these mass differences follows the second way to identify compounds by their mass: The
possible elemental combinations for the mass difference are calculated, this
time also allowing for negative values. E.g. the mass difference 0.9840155930
is explained by the combination H−1 N−1 O+1 . These combinations are obtained through integer linear programing. A linear programing problem is a
problem of the form
Maximize cT x
Subject to Ax ≤ b
Where x is the vector of variables to be determined (scalars for each
element), c is a vector of known coefficients (the masses of the elements) and
b is a vector of constraints forming the constraints equations Ax ≤ b. cT x
is called the objective function which is maximized (or minimized) s.t. the
constraints are not violated. Integer linear programming has one additional
constraint: all x have to be integer numbers.
To apply integer linear programming to the above problem, the objective
function has to be maximized, subject to the only constraint, that it does not
exceed the mass under investigation. So A = cT and b = mass. In further
constraints equations the maximal amount of elements used can be reduced,
which shrinks the search space. The set of elements which are considered has
to be as limited as possible, too. Here only H C N O P and S are used. The
problem now can be stated as follows.
Maximize: x1 MH + x2 MC + x3 MN + x4 MO + x5 MP + x6 MS
Subject to: x1 MH + x2 MC + x3 MN + x4 MO + x5 MP + x6 MS ≤ mass
Where the Melement are the elementary masses.
It is possible, that a linear combination within the bounds is slightly
greater than mass, but is a closer match than the solution obtained by the
linear program above. To overcome this problem, the inverted linear program
has to be formulated, i.e. minimize the objective function, with the constraint
not to fall below mass. The outcome of both linear programs has to be
compared and the better match is the final result.
Integer linear programming is in the NP-hard class of computational problems. For solving integer linear programming problems, R [38] in version 2.8.1
33
is used together with the Rglpk package13 [41], which is an interface to the
GNU Linear Programming Kit14 .
This method is a pure mathematical approach, which relies on a very
accurate mass precision. It does not incorporate any heuristic to filter improbable combinations. E.g. the mass difference of 16.978930233 should
be explained by the transfer of H−1 N−1 O+2 . Within an error tolerance
of only 0.007 ppm it is also possible to explain this mass difference by
H4 C−15 N5 O4 P5 S−3 , which looks quite implausible in metabolism. But
even H5 C1 N−8 O7 which at least looks a bit more sparse, deviates by only
0.3 ppm from the exact mass. A full list of mathematically possible formulas
within a 0.5 ppm range for this mass is shown in Table 3. To conclude these
thoughts, the results from integer linear programming for the identification
of mass differences have to be interpreted carefully, because it is very unlikely, that the measured mass difference is close enough to the exact mass
difference.
An improved method for determining the elemental composition of a mass
difference would be to enumerate all possible combinations under certain
maximal and minimal bounds for each element15 . Subsequently this combination is chosen, which firstly is within an accepted error tolerance (e.g.
1 ppm) and secondly, in accordance with the well known principle of Occam’s
razor, can explain the mass difference with the least number of subtracted
and added atoms. Since this is not the focus of this thesis, no efficient algorithm for this task is presented. However, in the results section this method
is showcased to identify frequent mass differences, which can not be identified
by other means.
A method to identify mass differences which are caused by two or more
reactions through a database lookup depends on the identification of compounds. Because less than 10 % of the measured compounds could be detected, conclusions derived out of this method can only be weak. However,
it is a nice to have feature within the developed software, which can aid the
manual analysis of metabolomics data. For each edge in the reconstructed
network the two incident vertices a and b are determined. All possible compounds Ca and Cb for the masses of a and b are retrieved. Now for all pairwise
combinations ca and cb , ca ∈ Ca and cb ∈ Cb the shortest path in the reference
network is calculated. The overall shortest path is the difference between the
two masses in terms of known reactions. Identifying the reaction pairs in the
overall shortest path, one can determine the actual reactions. An example
of this is given in Figure 10.
13
Version 0.2-8
Version 4.36; http://www.gnu.org/software/glpk/
15
The bounds used in this thesis are H ±40 , C ±30 , N ±30 , O±20 , P ±10 , S ±6 . The numbers
in superscript denote the minimal or maximal number of the element in the formula.
14
34
Mass
16.978921871
16.978922753
16.978922878
16.978923524
16.978923649
16.97892411
16.97892417
16.978924235
16.978924631
16.978924756
16.978924881
16.978925006
16.978925527
16.978925592
16.9789256520001
16.978926113
16.978926238
16.9789264519999
16.978927223
16.978927891
16.9789281049999
16.97892823
16.9789286620001
16.978928876
16.978929001
16.978929462
16.978929522
16.978929587
16.9789299829999
16.978930108
16.978930233
16.978930358
16.978930483
16.978930879
16.978930944
16.978931004
16.978931465
16.97893159
16.9789323610001
16.978933243
16.978933889
16.978934014
16.978934228
16.978934353
16.978934535
16.97893466
16.978934874
16.978934939
16.97893546
16.978935585
16.97893571
16.978935835
16.978936231
16.978936296
16.978936356
16.978936817
16.978936942
16.978937113
16.978937588
16.978937713
16.978937884
16.978938595
H
-17
29
34
12
17
10
-10
15
-17
-12
-7
-2
-29
-4
-24
-31
-26
6
-11
3
35
40
-14
18
23
16
-4
21
-11
-6
-1
4
9
-23
2
-18
-25
-20
-37
9
-13
-8
24
29
-35
-30
2
27
0
5
10
15
-17
8
-12
-19
-14
29
-36
-31
12
15
C
-23
6
-9
-14
-29
19
-19
4
29
14
-1
-16
-6
17
-21
27
12
-2
-22
21
7
-8
1
-13
-28
20
-18
5
30
15
0
-15
-30
-5
18
-20
28
13
-7
22
17
2
-12
-27
12
-3
-17
6
16
1
-14
-29
-4
19
-19
29
14
25
9
-6
5
23
N
18
-16
-10
8
14
-18
26
-12
-6
0
6
12
24
-14
30
-2
4
-13
11
-6
-23
-17
18
1
7
-25
19
-19
-13
-7
-1
5
11
17
-21
23
-9
-3
21
-13
5
11
-6
0
23
29
12
-26
-14
-8
-2
4
10
-28
16
-16
-10
-24
8
14
0
-20
O
13
-9
-7
-10
-8
-2
-13
0
-7
-5
-3
-1
-6
7
-4
2
4
19
18
-19
-4
-2
-20
-5
-3
3
-8
5
-2
0
2
4
6
-1
12
1
7
9
8
-14
-17
-15
0
2
-20
-18
-3
10
5
7
9
11
4
17
6
12
14
-18
11
13
-19
-9
P
-10
4
9
3
8
1
-3
6
-10
-5
0
5
-6
3
-1
-8
-3
-9
-10
10
4
9
9
3
8
1
-3
6
-10
-5
0
5
10
-6
3
-1
-8
-3
-4
10
4
9
3
8
-2
3
-3
6
-5
0
5
10
-6
3
-1
-8
-3
8
-9
-4
7
10
S
5
5
2
4
1
1
6
-2
6
3
0
-3
2
-6
-1
-1
-4
6
5
-5
5
2
-6
4
1
1
6
-2
6
3
0
-3
-6
2
-6
-1
-1
-4
-5
-5
-3
-6
4
1
-1
-4
6
-2
3
0
-3
-6
2
-6
-1
-1
-4
2
-2
-5
1
-5
P
86
69
71
51
77
51
77
39
75
39
17
39
73
51
81
71
53
55
77
64
78
78
68
44
70
66
58
58
72
36
4
36
72
54
62
64
78
52
82
73
59
51
49
67
93
87
43
77
43
21
43
75
43
81
55
85
59
106
75
73
44
82
ppm
0.49
0.44
0.43
0.40
0.39
0.36
0.36
0.35
0.33
0.32
0.32
0.31
0.28
0.27
0.27
0.24
0.24
0.22
0.18
0.14
0.13
0.12
0.09
0.08
0.07
0.05
0.04
0.04
0.01
0.007
0
-0.007
-0.01
-0.04
-0.04
-0.05
-0.07
-0.08
-0.13
-0.18
-0.22
-0.22
-0.24
-0.24
-0.25
-0.26
-0.27
-0.28
-0.31
-0.32
-0.32
-0.33
-0.35
-0.36
-0.36
-0.39
-0.40
-0.41
-0.43
-0.44
-0.45
-0.49
Table 3: Linear combinations of elements and their resulting mass in a 0.5 ppm
range around 16.978930233 u. The search space of the elements was limited to
H ±40 , C ±30 , N ±30 , O±20 , P ±10 , S ±6 , the numbers in superscript denote the minimal or maximal number of the element in the formula. The individual combinations are very close to each other, with respect to their mass. The wrong
combinations, however, all lead to a high sum of transferred atoms, usually much
more than 15. The right combination explains the mass difference with only 4
atoms.
35
Figure 10: Screenshot of the annotated part of the D. melanogaster network.
The bright vertices in the bottom are identified masses from the KEGG pathway 00052 (galactose metabolism). The edges are reconstructed using frequent
mass differences. The numbers next to the edges represent the minimal possible
distance measured in reaction pairs between two vertices. If it is “1”, the edge
represents an actual reaction pair and in this case is also printed slightly thicker.
Most of the reconstructed edges represent 2 reaction pairs. Next to each vertex
the mass and its corresponding KEGG compound identifiers are printed.
36
3
Results
3.1
Comprehensive Analysis of Pathway Maps from
the KEGG Database
As a first step the reference network from all KEGG masses and reaction
pairs as described in the methods (chapter 2.1.2) is constructed. This network
comprises 5765 vertices and 11112 edges. The degree density for this network
is 0.067 %. To check, wether the network structure is coherent with previous
findings, the degree distribution and distribution of clustering coefficients
are calculated and plotted in log-plots. For both distributions the described
power law can be observed (Figure 11).
log log Plot of Degree Distribution
log log Plot of C(k)
●
●
●
0.500
●
0.200
500
●
●
100
●
●
●
●●
●
●
● ●●
●● ● ●
●●
● ●
●
●
●
●
●
0.050
Average C(k)
●
●
●
10
●
0.005
●●
●
●
●
●
●
●
●
●
●
●●●●●
●●
●
10
●
●
50
●● ●● ●●●● ●
100
●●
●
●
●
500
0.002
5
●●
●
●● ●●
1
●
●
●
●
●
●
●
●
5
●
●
●
●●
● ●
●
1
●
●
●
●
●
●●
0.020
50
●
● ●●
●
● ●
●
●
●
Count
● ●
●
●
1000
●
1
Degree
5
10
50
100
●
500
●
1000
Degree
Figure 11:
Degree distribution (left) and distribution of clustering coefficients C(k) (right) of the reference network created from KEGG data. C(k) follows
a power law with scaling exponent −0.88, indicated by the regression line. The
degree distribution is heavy tailed and follows a power law in its middle section.
The regression line was fitted only to this section and has slope −2.27.
The hub vertices in this network correspond to so called current metabolites [11]. In fact the vertex with the highest degree (815) is water, followed by
Oxygen (590), S-Adenosylmethionine (270), ATP (262), CO2 (253), Ammonia (233), Phosphate (158) and Acetyl-CoA (136). Interestingly most of these
metabolites can not be measured in FTMS because of their small mass. ATP,
S-Adenosylmethionine and Acetyl-CoA are the exceptions. Hence this network can not be seen as a reaction network where two connected compounds
are interconvertible, but rather as a network representing transformation and
requirement relationships.
Analog to the construction of the reference network, smaller networks
are created from individual KEGG pathway maps as described in the meth37
α-D-Glucose-1P 260.0297
α-D-Glucose 180.0634
β-D-Glucose 180.0634
α-D-Glucose-6P 260.0297
β-D-Glucose-6P 260.0297
260.0297
α-D-Glucose-1P
α-D-Glucose-6P
β-D-Fructose-6P
β-D-Glucose-6P
180.0634
α-D-Glucose-1,6P2
β-D-Glucose-1,6P2
β-D-Fructose-6P 260.0297
339.996
β-D-Glucose-1,6P2
β-D-Glucose-1,6P2 339.996
Glycerone-P 169.998
Glyceraldehyde-3P 169.998
169.998
Glyceraldehyde-3P
Glycerone-P
265.9593
Glycerate-1,3P2
Glycerate-2,3P2
Glycerate-1,3P2 265.9593
Glycerate-2,3P2 265.9593
Glycerate-3P
185.9929
Glycerate-2P
185.9929
Phosphoenolpyruvate 167.9824
Pyruvate
185.9929
Glycerate-3P
Glycerate-2P
167.9824
Phosphoenolpyruvate
88.016
Pyruvate
88.016
Figure 12: Part of the glycolysis pathway on the left, comprising 15 compounds.
Given are compound names and masses. To create the collapsed version on the
right, all compounds with the same mass are collapsed into one vertex, preserving
the edges between compounds. A path with 8 nodes emerges. This demonstrates,
how FTMS gets a more limited view of metabolic networks. On top of the fact,
that isomers are measured as one mass, the small mass of Pyruvate can not be
detected at all.
ods section. Only such maps are considered, which actually contain reaction
pairs, so 140 maps are investigated. For these maps is assessed what effect
arises, if isomers (which have the same mass) are collapsed into one vertex,
because this effect arises when the compounds in the map are determined
using mass spectrometry. An example of this process is displayed in Figure 12. In each pathway map are on average 36.3 vertices and 38.6 edges.
Now vertices with the same mass (i.e. vertices representing isomers) are collapsed into one vertex, preserving the edges of both vertices. Double edges
are removed. These collapsed pathway maps comprise on average 28.4 vertices and 31.4 edges. The results are presented in the following table, and
indicate, that FTMS has a rather limited view of these pathway maps.
KEGG maps
Vertices 36.3
Edges
38.6
collapsed maps
28.4
31.4
difference
7.9
7.2
Table 4: Size of KEGG pathway maps before and after collapsing equal masses.
On average each pathway map loses 7.9 vertices and 7.2 edges. This amount of
entities is at least lost during any FTMS experiment, not yet accounting for not
measured compounds.
38
Figure 13: Histograms of the mass differences from the S. cerevisiae (top) and
D. melanogaster (bottom) dataset. The inset shows the 88.006 to 88.020 u interval
of the D. melanogaster data at higher resolution. The width of one box in the inset
is approximately 0.0001 u, so consecutive boxes are likely to be clustered together
by the employed clustering algorithm.
3.2
Metabolic Network Reconstruction for S. cerevisiae
To reproduce and verify the results from Breitling et al. [16], the metabolic
netwok from the Saccharomyces cerevisiae data is reconstructed using both
methods, the one based on a priori known mass differences and the one based
on frequent mass differences determined from the data. The histogram of
mass differences is shown in Figure 13 (top).
Initially the dataset contains 3101 mass values. There are 7 masses with
putative double ionization peaks, but no corresponding 13 C peaks could be
detected. After filtering for 13 C mass peaks, 3027 masses remain. A total of
164 masses can be identified as compounds using the KEGG database. Additional 62 masses can be identified using the lipidmaps database. This makes
a total of 226 identified masses at a precision of 0.75 ppm. The remaining
92 % of the masses remain unidentified.
Using a priori known mass differences from the list of 109 frequently occurring transformations is straightforward. The reconstructed network contains 1998 edges and hence has a degree density of 0.04 %. The clustering
coefficient is 0.23 and the distribution of clustering coefficients is plotted in
Figure 14 in the lower right. The degree distribution follows a power law
39
(not shown). The network’s properties are comparable to the reconstructed
networks using 5 and 10 different mass differences.
Clustering of the mass differences results in 1918640 clusters with on
average 2.4 mass differences per cluster. The frequent mass differences accumulate in a relatively small proportion of clusters. The width of the resulting
clusters, defined as difference between the biggest and smallest value in the
cluster, is on average 0.033 · 10−3 u. This is heavily biased by many one–
element clusters of width 0 u. The widest observed cluster spans a range of
0.19 · 10−3 u. After determining the abundance cutoff (see next section) the
selected clusters span ranges from 0.09 · 10−3 u to 0.11 · 10−3 u. The mean
cluster width is 0.098 · 10−3 u.
3.2.1
Determining the Abundance Cutoff
To get an overview over the possible outcomes from different abundance
cutoffs, a total of 13 networks is created, covering a wide range of abundance
cutoffs. The selected cutoffs lead to the incorporation from 5 to 3841 different
mass differences, resulting in 1133 to 114644 edges. A complete overview
together with the clustering coefficient for each network is given in Table 5.
Each of the reconstructed networks has a degree distribution, following a
power law. To assess the modularity of the network, the distributions of
clustering coefficients for each network are plotted in Figure 14.
To pick an abundance cutoff for further work, a compromise between the
network’s degree density, average clustering coefficient hC(k)i and the distribution of clustering coefficients has to be found. The rationale behind this is,
that the network might not contain any information if it is too sparsely connected, but will lose information again, if it is too densely connected and the
formation of clusters in the network is blurred by too many edges of doubtful
origin. Networks 5–10 have the highest clustering coefficients, but towards
the lower abundance cutoffs the degree density raises, and the distribution
of C(k) loses its power law characteristic, that is described for biological
networks. For these reasons network 6 was chosen for further investigation.
The resulting network’s structure should be invariant to slight variations of
the abundance cutoff, nevertheless an improved method to optimize this parameter is desirable. Due to time limitations no exhaustive method could be
tested, but an idea is to optimize the cutoff with respect to a quality measure on the networks, which should contain the degree distribution’s quality
of fit to a power law and maximize the average clustering coefficient of the
network. Thus finally the network is reconstructed with mass differences
that occur at least 96 times. This applies to 100 frequent mass differences.
Edges are inserted at a precision of 0.75 ppm. The resulting network contains
9615 edges and 3027 vertices. The degree density is 0.21 % and the average
clustering coefficient is 0.33.
40
0.9
0.5
Average C(k)
0.7
0.6
Average C(k)
3
Abundance 164
●
●
●
●
●
●
●
● ●
● ●
●
●
●
0.2
0.5
0.6
0.4
Average C(k)
2
●
Abundance 217
0.8
Abundance 304
log log Plot of C(k)
●
1.0
1.0
log log Plot of C(k)
1
0.8
1.0
log log Plot of C(k)
●
0.4
3
4
●
5
6
7
8
●
2
5
10
1
●
●●
●●
0.6
Average C(k)
●
●
●
●
●
●
●
●
●
● ● ●●
●
0.2
●
●
●
●
● ●●
● ●
●● ●
●
●●
●
●
●
●
●
●
●
●
●
●
0.2
●
5
10
1
2
5
10
Degree
●
●
2
5
10
20
●
● ●●
●
●
●
●
●
50
1.0
0.8
●
1
2
5
10
20
●
●
●
●
●
●
50
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
100
1
2
5
10
20
50
0.2
5
10
20
50
100
0.4
●
●
●
●
2
●
●
5
10
20
50
100
200
1
2
5
10
Degree
20
50
100
200
Degree
1.0
log log Plot of C(k)
●
●
T
0.8
13
0.8
●
●
●
●
●
1
Degree
●
●
●●
●●
●
●●
●●●● ●
●
● ●
●
●●●
●●●
● ● ● ●● ●
●●
●●
●
●
●
●●
●●
● ●●
● ●
●● ● ●
●●●●
● ●
● ● ●● ●
●●●● ●●
● ●
●
●●●
●
●
●● ●
●● ●● ●
● ●●
●●●
●●●● ●
●●
●
●
● ●● ●●
●●●●●
● ●● ●●
● ●●
●●
●
●●
●
●
●●
●
●● ● ●
●●●
●
●
●● ●
●●●
●●●●
●
●
●●
●●
●●● ●
●●
●●
●
● ●
●
●
●
●● ●
●
●
●●●●●
●●●
●
●
●
●●●
●
●
●●
●
● ●●
●●●
●●● ● ●
●
●●
●
●
●
● ●●
●
●
● ●
●●●●●
●
● ●
●
●●
●●
● ●● ●
●●
●
● ● ●
●
●● ●●
●
●
●
●
0.2
●
●
●
●
●
●●
●
●●
●●● ●●
●●
●
●
●
●
●● ●● ● ●●● ● ●●
●
●● ●●●
● ●● ●●●●
●●
●●●●●
●●
●
●●
●
●●●●
●
●●
● ● ● ●●
●●
● ●
● ●●●●●
● ●● ● ●● ● ●● ●
●
●● ●
●
● ●
●
● ● ●
●
●
●
● ●●● ●
●
●●●
●●
●
● ●● ●●●
● ●
●●
●●●
●● ●●●●
●●
●
●●
●●
●● ●
●
● ● ● ●
●●●
●●
●●
●
●
●
●
●● ●●
●
●
●●●●●
●●
●●
●●
●
● ●
●
● ●
●
Average C(k)
0.6
0.4
Average C(k)
0.4
●
●● ●
●
●
● ●●
● ● ●
●
●
●● ●● ●● ●●●
●●●
●●
●
● ●
●
●
●●●●●●●●●● ●●● ●
●
●
●
●
●●
●● ● ●●●
●
● ●●●
●● ●
●
●●
●
●
●● ●●●●
●
●●●●●●
●●● ● ●●
●● ●
●
●
●
●
● ●●
●● ●
● ●●●●
●
●●●● ●
● ●●
●
●
12
Abundance 30
0.6
Abundance 41
100
●
0.8
11
0.8
0.8
0.6
1.0
Degree
log log Plot of C(k)
1.0
Degree
●
●
●
●
●●●● ●● ●
●
●
●
●
● ●● ●
●
●● ●● ●
●
●
● ● ●●
●
●●
●●● ●
● ●
●●●
●
● ●● ●
●●
● ●
●
● ●
●
●
●●
●
● ●
● ●
●
●●
●
●●
●
● ●
●
●
●
log log Plot of C(k)
●
0.2
9
Degree
Abundance 52
●
50
Abundance 64
log log Plot of C(k)
Average C(k)
0.6
0.6
Abundance 23
0.4
●
●
●
●
●●● ●● ●
●●
●
●
●●
●
●
●●●●
●●
● ● ●
● ●
●
●
●●●●
●
●●
●
●
● ●
●●
●●
● ● ●●
●●●●●
●
● ●●
●●●
●
● ●
●●
●●
●
●
●
● ●●●
●●●
●●●●
● ●●●
●
●●
● ●●●
●●
●●
● ●●●●●
●●●
●
●
●●
●
●●●●
● ● ●●
●
●
●
●●
●
●●●●
● ●
●
●●
●●
● ● ●●
●●
●
●
●
●●
●●●●
●●
●
●●
●
●
● ●●
●
●●●● ●●
●
●●
●●●
●
●●
●●
●●●
●●
●
● ●●
● ●●
●
●
●
●
●●
● ● ●●
●
●
●●●
●●
●
● ●●
●●
●
●●
●
●●
●●●
●●
●●●
●
● ● ● ●
●●
●
● ●● ●
●●
●
●●
●●
● ●● ●
●●
●●
●●●
●
●
●
●
●●
●
●
●
●●
●
●●●
●
● ● ●●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●
●●
● ●
●
● ●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
0.2
0.2
●
●
0.4
1.0
20
log log Plot of C(k)
●
Average C(k)
10
●
●
●● ● ●
● ● ● ●
● ● ● ●● ●
●
●
●●●●
● ● ● ●● ●
●●
●
●
●●
● ● ●● ●
●
●
●
●● ● ●
●
● ● ●●
●●●
●
●●
●
●
●
10
2
5
0.6
Average C(k)
● ●
●
● ●●●
●
●● ●
●
●
● ● ●
● ●
● ●●
● ●●●
●●
●
●● ● ● ●
● ● ●
●
●● ● ●
●
●
●
●
●
●
●●●● ●
●●
●
1
2
0.4
1.0
8
0.8
Average C(k)
●●
●
●●●●●
● ● ●●
●
●
●● ●
● ●
●●
● ●
●
●
●
●
●
Degree
Abundance 73
0.4
0.6
Average C(k)
0.4
●
● ●
● ● ●
●
●
●● ●
log log Plot of C(k)
●
●
●
●
●
●
● ●
●
●
1
0.6
0.8
1.0
7
●
1.0
20
log log Plot of C(k)
Abundance 83
●●
●
Degree
log log Plot of C(k)
●
●
●
●
20
●
●
●
●
●●
●● ●●
● ● ●
●
●
●●
●
●
●
2
6
Abundance 96
0.2
●
●
Average C(k)
1.0
Abundance 126
0.4
Average C(k)
●
●
●
●
0.8
0.8
5
20
0.4
1.0
●
0.6
0.6
●
●
●
●
10
Degree
log log Plot of C(k)
Abundance 126
1
5
Degree
log log Plot of C(k)
4
1
2
Degree
●
●
●
log log Plot of C(k)
0.4
Average C(k)
●
1
●
●
●
0.3
0.2
2
●
●
●
●
●
0.8
1.0
1
●
●
●
●
0.1
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
1
2
●
●
●
●
●
5
10
20
50
100
200
500
1
Degree
2
5
10
●
20
Degree
Figure 14: Evolution of the network’s distribution of clustering coefficients C(k)
for an increasing number of edges. From the top left to the lower right the abundance cutoff (“Abundance”) is lowered. The plot in the lower right is C(k) for the
network reconstructed with the list of known transformations.
41
N
1
2
3
4
5
6
7
8
9
10
11
12
13
Abundance
304
217
164
126
111
96
83
73
64
52
41
30
23
Trans.
Mass Diff.s Edges
5
1133
10
1965
25
3808
49
5627
73
7674
100
9615
147
12562
199
15911
294
19977
488
28490
960
45198
1994
74600
3841
114644
109
1998
Degree–Dens.
0.03 %
0.04 %
0.08 %
0.12 %
0.17 %
0.21 %
0.27 %
0.35 %
0.44 %
0.62 %
0.99 %
1.63 %
2.50 %
0.04 %
hC(k)i
0.22
0.28
0.31
0.32
0.34
0.33
0.34
0.35
0.34
0.34
0.32
0.30
0.29
0.23
Table 5: Parameters for the reconstructed S. cerevisiae networks. The Abundance cutoff (2nd column) determines how many mass differences are considered
“frequent” (3rd column). The following columns show the number of edges, the
degreedensity and average clustering coefficient hC(k)i in the resulting network.
“N” is just a number assigned to each network. Trans. is the network reconstructed
using the list of common metabolic transformations.
3.2.2
Evaluation of the Proposed Null Model
A showcase null model as proposed in [16] is constructed by sampling 3027
random “masses” from a uniform distribution between 150 and 2000. The
previous findings can be confirmed. The mass differences do not accumulate,
so the clustering only finds clusters with a maximal size of 8. By chance 13
of these masses can be identified using KEGG, another 6 masses are present
in LIPIDMAPS.
The uniform distribution is however inappropriate, since the original data
definitely show no uniform distribution (compare Figure 7). Furthermore
most of the random masses will not represent any chemically possible combination of elements. The only conclusion which this null model permits, is
that frequent mass differences do not occur in uniformly distributed masses.
But it is very likely, that in any set of chemical compounds certain patterns
among the mass differences arise, just due to the limitations on possible
masses from chemistry. These mass differences are actually in the focus of
research16 , and it is known, that molecules consisting of C, H and O form
clearly separated “nominal mass” clusters [42]. An improved null model
would sample from these masses ideally extended by more elements, instead
16
P. Schmitt-Kopplin, Helmholtz–Zentrum München, personal communication
42
of a completely random approach. The employed null model does therefore
not allow to conclude that the investigated metabolomics data behave in any
special way as compared to a random composition of chemically plausible
masses. The total space of molecular structures within the given mass range
is so large (estimated 1060 to 10200 [36]), that the construction of a proper
null model falls out of the scope of this thesis.
Because of the above reasons this null model is not further elaborated,
especially no repeated sampling of the 3027 masses is performed. The significance of the mass differences and networks from real world data are assessed
by identifying their biological meaning.
3.2.3
Identification of Motifs
The square motif or 4-cycle is expected to occur in the reconstructed networks. A 4-cycle is a path of length 4, where the start vertex equals the end
vertex. Special about these specific 4-cycles is, that opposing edges have the
same mass difference. This motif arises, when a compound a can be chemically altered to form a0 and both compounds, a and a0, undergo the same
chemical reaction. The result are two compounds b and b0 which exhibit the
same mass difference as a and a0. Furthermore the difference between a and b
is the same as between a0 and b0. This can happen, because metabolites which
share a common scaffold but have different side groups are often transformed
in the same way, sometimes even by the same enzymes. A good example is the
metabolism of corticosteroids [43]. Also the example from the introduction
in Figure 4 shows this motif between its 6 metabolites in the bottom. These
squares extend into a “ladder” structure, because b and b0 might be further
transformed to c and c0 and so forth. This process is hypothesized to happen
in lipid metabolism. An example is the β-oxidation. The first step, oxidation, introduces a double bond and therefore removes 2 Hydrogen atoms
from the acyl–CoA molecule. But this reaction happens for all thinkable
acyl–CoA molecules, independent of their chain length. Hence the described
ladder structure should emerge. Unfortunately these structures could not be
found in significant numbers in the reconstructed networks, probably because
certain metabolites were not detected and the “ladder” structure breaks up.
This investigation was therefore not followed any further.
3.3
Metabolic Network Reconstruction for D. melanogaster
The proof of concept was repeated on the S. cerevisiae dataset in the previous
chapter. To analyze the structure and information content of the networks
yielded by the reconstruction method, metabolic networks from the fruit fly
43
log log Plot of C(k)
●
0.8
500 1000
1.0
log log Plot of Degree Distribution
●
●
Average C(k)
●
●
●
●
●
0.4
●
50
Count
100
0.6
●
●
10
●
●
●
●
●
●
●
5
●
●
●
●
●
●
●
0.2
●
●
●
1
2
5
10
●
●
●
●
●
●
●
●
●●● ●
1
●
●
●
20
1
Degree
2
5
10
●
20
Degree
Figure 15: The degree distribution (left) and distribution of clustering coefficients (right) for the reconstructed D.melanogaster network.
Drosophila melanogaster are reconstructed. The D. melanogaster dataset
became available later but is of higher quality than the S. cerevisiae dataset
(compare chapter 2.1.1 on page 23). Therefore further analysis is performed
on these data.
The raw dataset contains 1965 mass values. There are 5 masses with
putative double ionization peaks, but again no corresponding 13 C peaks can
be detected. After removing 13 C mass peaks, 1919 masses remain. 182 masses
can be identified as compounds using the KEGG database. Additional 32
masses can be identified using the lipidmaps database. In total 214 masses
are identified at a precision of 0.75 ppm. The remaining 89 % of the masses
remain unidentified.
Clustering of the mass differences results in 1227662 clusters with on
average 1.5 mass differences per cluster. The average width of the resulting
clusters is 0.016 · 10−3 u. This is again mainly due to many one–element
clusters. The widest observed cluster spans a range of 0.19·10−3 u. The finally
selected clusters (with 15 or more elements, see next section) span ranges
from 0.04 · 10−3 u to 0.15 · 10−3 u. The mean cluster width is 0.097 · 10−3 u.
3.3.1
Determining the Abundance Cutoff
To determine an abundance cutoff for further work, the method described
for the S. cerevisiae network is repeated, and a total of 13 networks with
different abundance cutoffs are reconstructed. Their parameters are shown in
Table 6. The distributions of clustering coefficients are not printed, because
they behave like the ones from S. cerevisiae. An abundance cutoff of 15 is
chosen for the D. melanogaster data.
44
N
1
2
3
4
5
6
7
8
9
10
11
12
13
Abundance
41
25
18
17
16
15
14
13
12
11
10
9
8
Tans.
Mass Diff.s Edges
9
280
23
550
57
1057
73
1273
83
1366
105
1564
148
1984
209
2450
322
3057
542
3998
988
5572
1929
8222
3915
12976
109
772
Degree–Dens.
0.02 %
0.03 %
0.06 %
0.07 %
0.07 %
0.08 %
0.11 %
0.13 %
0.17 %
0.22 %
0.30 %
0.45 %
0.71 %
0.04 %
hC(k)i
0.11
0.13
0.15
0.15
0.16
0.16
0.18
0.21
0.25
0.28
0.24
0.16
0.14
0.16
Table 6: Parameters for the reconstructed D. melanogaster networks. The Abundance cutoff (2nd column) determines how many mass differences are considered
“frequent” (3rd column). The following columns show the number of edges, the
degreedensity and average clustering coefficient hC(k)i in the resulting network.
“N” is just a number assigned to each network. Trans. is the network reconstructed
using the list of common metabolic transformations. The 6th network is chosen for
further investigation, because the distribution of C(k) loses its power law property
in the subsequent networks.
The network is reconstructed with mass differences that occur 15 or more
times. This threshold yields 105 frequent mass differences. A precision of
0.75 ppm is applied for the insertion of edges. With these parameters 1564
edges emerge, connecting the 1919 vertices. This equals a degree density of
0.09 %. The degree distribution and distribution of clustering coefficients for
this network are shown in Figure 15 on page 44.
3.4
Analysis of the Masses
The hub vertices in the D. melanogaster network are identified and all vertices
with a degree ≥ 13 are shown in Table 7. This applies to 36 vertices. A full
list can be found in the supplementary materials to this thesis. 15 of these
masses can be identified by the KEGG database, that is 42 %. This is a
substantially higher rate than for the overall compounds. Assuming, that
important metabolites are more likely to be known and annotated in public
databases like KEGG, this indicates that important compounds accumulate
in the set of hub vertices.
45
Deg.
22
21
20
19
18
18
17
17
17
17
16
16
15
15
15
15
15
15
15
14
14
14
14
14
14
13
13
13
13
13
13
13
13
13
13
13
Mass
228.208931
260.029701
226.193281
227.196631
357.251521
268.240231
255.227931
240.208921
229.212281
200.177631
284.271521
242.224571
440.093091
329.220211
280.240221
270.219491
246.050451
215.055861
180.063391
411.298461
383.267181
377.108721
363.129441
270.255881
257.243581
395.119281
352.077071
342.116201
334.066491
321.082481
285.274871
283.259221
278.040281
214.156891
198.161961
196.058301
Isom.
1
38
1
Formula
C14 H28 O2
C6 H13 O9 P
C14 H26 O2
Name
Tetradecanoic acid
D-Fructose 6-phosphate
Myristoleic acid
1
C17 H32 O2
Cyclohexaneundecanoic acid
1
1
C12 H24 O2
C18 H36 O2
Dodecanoic acid
Stearic acid
5
C18 H32 O2
Linoleic Acid
1
1
41
C6 H15 O8 P
C5 H14 N O6 P
C6 H12 O6
Glycerophosphoglycerol
Glycerophosphoethanolamine
D-Glucose
30
4
C12 H22 O11
C9 H19 O11 P
Saccharose
sn-glycero-3-Phospho-1-inositol
1
2
8
C12 H22 O3
C12 H22 O2
C6 H12 O7
3-Oxododecanoic acid
Menthyl acetate
D-Gluconic acid
Table 7: Hub vertices in the D. melanogaster network reconstructed by frequent
mass differences. Masses are sorted with respect to their degree (Deg.), Isom.
gives the number of isomers for that mass according to the KEGG database. The
molecular formula and compound name for one isomer are also given. Masses with
no further information could not be identified by means of a database lookup of
the mass.
46
800
40
●
Mass
●
●
●
400
●
500
600
●
20
Number of Isomers
30
700
●
●
300
10
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
200
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
●
●
●
●
●
0
5
10
15
20
0 1 2 3 4 5 6 7 8 9
11
13
15
17
19
21
Degree
800
Mass
0
200
5
400
600
20
15
10
Number of Isomers
25
1000
30
Degree
0
1
2
3
4
5
6
7
8
9
10
11
12
13
15
0
Degree
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15
Degree
Figure 16: Top: Correlations between the vertices degree and the average
number of isomers these vertices represent (left) and the vertices degree and the
average mass these vertices have in the D. melanogaster reconstructed network
(right). The x–axis shows the degree, the y–axis shows the respective parameter.
The mean values are connected by a line. In the left each dot represents one individual vertex, in the right plot each box represents all vertices with the same
degree. Bottom: The same values together with their respective null models.
Vertices with a degree ≥ 15 were collapsed into one group. The mean number of
isomers (left) and mean mass (right) from 100000 random samples are represented
as box plots, s.t. each box represents 100000 values, whiskers extend to the maximal observed simulation results. The actually observed values from the respective
networks are plotted as a line.
47
The hubs tend to have a lower mass than expected by chance and also represent more isomers. To see if this trend continues through the whole range
of observed degrees, the correlation between degree and these two parameters is calculated, as described in the methods section (chapter 2.4). Only
few vertices have a degree of 15 or more, therefore for correlation analysis all
these vertices are collapsed into one group. The trend and null models used
for hypothesis testing are visualized in Figure 16.
The correlation coefficient between degree and average number of isomers
is 0.84. The p-value for this correlation is 0.0001 as estimated by the randomization test. The correlation coefficient of degree and average mass is -0.62.
The p-value for this correlation is 0.0118 as determined by randomization.
Additionally the fraction of identified compounds for all vertices with the
same degree was plotted in Figure 17. Most of the hub vertices are identified,
and it can be seen, that metabolites with higher degree in the reconstructed
network are more likely to be known metabolites. Assuming that the most
important metabolites are known, from this can be concluded, that higher
degree vertices represent more important metabolites.
●
●
1000
100
10
1
0.6
●
●
●
●
0.4
●
●
●
●
●
●
●
●
●
0.2
Fraction of identified compounds
●
Color Count
0.8
1.0
The conclusion of these findings is, that the reconstructed network indeed
contains information about its constituent compounds. The exact meaning
of this information remains to be elucidated. Any deeper investigation should
start with the identification of the remaining compounds.
●
●
●
0.0
●
●
●
0
5
10
●
15
20
Degree
Figure 17: For each degree the fraction of identified compounds is shown. As
clear trend is recognizable, that compounds with a higher degree in the network
tend to be known compounds and therefore seem to be important in metabolism.
The color of each point gives information about the number of actual observations. Hollow circles represent a single compound, the darker a circle, the more
compounds it stands for.
48
1.0
0.8
●
0.8
0.6
●
●●
●●
●
●
●
●
●
0.6
●
●
●
●
●
●●
●
●
●
0.0
●
●
10
●
●
0.2
0.2
●
●
●
●
●
●
●
●
0.4
●
●
Fraction of identified masses
●
0.4
Fraction of identified masses
●
● ●
12
14
16
18
20
22
● ●●
20
Number of observed mass differences
●●●●
●
40
●
●
60
80
Number of observed mass differences
Figure 18: The right plot shows the relationship between all frequent mass
differences used for the construction of the D. melanogaster network and their
annotation. Everything right of the dotted line is collapsed into one point to form
the left plot. The x–axis shows the abundance of a mass difference, the y–axis
shows the fraction of positively identified mass differences. Darker dots represent
more observations, completely empty dots represent one observation. The star in
the left plot represents the collapsed points.
3.5
Identification of Frequent Mass Differences
To elucidate the chemical and biological meaning of the observed frequent
mass differences, each is identified by database lookup against KEGG and the
list of common metabolic transformations at an error tolerance of 0.75 ppm.
More frequently occurring mass differences are expected to be identified as
known metabolic transformation. The chosen tolerance is very strict and
yields a pessimistic estimate. Nevertheless Figure 18 shows, that the hypothesis is true and frequent mass differences are likely to be positively identified.
To get an idea of the method, the top 10 mass differences are identified.
The results are shown in Table 8. Five mass differences can be identified at
0.75 ppm error tolerance. The example shows well, that the method is quite
pessimistic as mentioned above: Only 5 masses can be identified positively.
If the error tolerance was 0.85 ppm or higher, one more compound could be
identified.
The remaining 5 unidentified mass differences are identified by the combinatoric approach. The assigned combinations of elements are combinations which explain the mass difference by as few as possible element additions / subtractions and are within a precision of 2 ppm. One additional
compound can be assigned to a chemical formula (C4 H6 ), but the formula
apparently does not represent a known metabolic transformation. For the
masses 1.00335, 2.01567 and 2.01562 only combinations with a high number of transfers can be found. One can assume, that these masses have no
49
count
90
90
71
51
50
49
46
42
41
36
mass diff.
1.00335
28.0313
2.01567
2.01562
14.01564
26.01565
18.01058
162.05278
74.03677
54.04693
common transformation
(H14 C−1 N−4 O−6 P9 S−4 ) (0.04 ppm)
ethyl addition C2 H4
H−2 C−2 N−3 O12 P−6 S2 (1.6 ppm)
cpd.
r.p.
yes
yes
X
yes
yes
yes
X
X
yes
yes
yes
yes
X
X
∗∗∗
∗∗∗
H1 C−2 N5 O−5 P−3 S4 (0.55 ppm)∗∗∗
methanol CH2
C2 H2 transfer
H2 O (0.85 ppm)
∗
monosaccharide C6 H10 O5
C3 H6 O2 (0.13 ppm)
C4 H6 (0.4 ppm)∗∗
Table 8: Top 10 mass differences from the D. melanogaster data. The first
two columns show the count or frequency followed by the mass difference itself.
If it was found in the list of common metabolic transformations, the respective
transformation is written in the third column. If it was identified to correspond
to a compound (cpd.) or reaction pair (r.p.) in the KEGG database, this is indicated in the respective column. If the last column is checked, the mass difference
was identified by either method within 0.75 ppm error. From the top 10 mass
differences, 5 can be successfully identified. The sum formulas in small font were
assigned using the combinatoric approach described in chapter 2.5. Two of them
were not identified by other means, because ∗ had only a precision of 0.85 ppm and
∗∗ appears neither as reaction pair nor compound nor common metabolic transformation. Three of them, marked with ∗∗∗ , can only be explained by transfers
of many elements. They may be noise or arise from compounds with a different
elemental composition. Actually 2.01567 is 10 ppm away from 2.01565, which is
the mass difference of a simple H2 transfer. The first value is 0.44 % smaller than
the mass difference of a simple H transfer.
underlying chemical meaning and have emerged due to noise in the mass
difference data which caused false predictions in the clustering procedure. It
is however possible, that they arise from compounds with different elemental
compositions, that exhibit exactly this mass difference.
Finally the reconstructed edges restricted to the identified compounds
were investigated as described in the methods section (chapter 2.5). Using the identified compounds, it is possible to find a shortest path between
any two compounds in the reference network, which are directly connected
in the reconstructed network. There are 239 edges between the identified
compounds, representing 77 distinct mass differences. 171 of these edges
(71.5 %) have a corresponding path in the reference network. These 171
edges still represent 57 distinct mass differences. The remaining 68 edges
connect components which are unconnected in the reference network.
50
The average length of the shortest paths is 2.31 (±1.07), so the edges
in the reconstructed network represent on average 2 reactions. The longest
shortest path comprises 6 reactions. The full histogram is given in Table 9.
This is evidence for the assumption, that the edges in the reconstructed mass
difference networks actually represent a whole series of reactions.
Length: 1 2 3 4
Count: 36 74 41 13
5 6
6 1
Table 9: Histogram of the number of reactions (i.e. the shortest path length in
the reference network) each edge represents in the reconstructed network. Only
edges between identified compounds are considered.
51
4
Discussion
Mass Difference Networks are not Metabolic Networks
FTMS gives a snapshot picture of an organisms metabolism, ideally highlighting active pathways by detecting their compounds. In reality, it is almost
impossible to detect all compounds from an active metabolic pathway. Even
if it happens, the reactions between most of them can not be resolved in the
reconstructed mass difference networks, because of collapsed isomers, as was
demonstrated in chapter 3.1. Furthermore, as an analysis of the most frequent mass differences has shown (chapter 3.5), the edges in these networks
do not necessarily represent actual chemical reactions, but arbitrary combinations of reactions. It is also possible, that a frequent mass difference arises
from chemically very different compounds, whose mass difference however appears frequent enough to become significant. If an edge represents a chemical
reaction, it is still unknown, wether the organism can catalyze this reaction
or not. The information about the direction of an edge can not be determined by mass differences alone. All edges in the reconstructed networks are
undirected. The compartmentalization of reactions can not be detected by
employing this reconstruction method, so one must assume that the edges
are not spatially separated. Finally metabolic networks are better described
as hypergraphs. The hypergraph structure can not be reconstructed by mass
differences alone. Several studies emphasize the importance of direction information, compartmentalization information and hypergraph structure in
metabolic networks [4, 11, 13, 44].
Due to the above effects many edges which are expected to be present are
actually missing from the observed data, and lots of edges, which have no
chemical meaning and probably no biological background emerge in the difference networks. The conclusion of this is, that the reconstructed networks
are no metabolic networks in the common sense. Hence common techniques
like flux balance analysis or elementary flux modes, which have been successfully applied to metabolic models [45] are not applicable to these networks.
It is well known, that network motifs occur in complex networks [46].
They usually comprise very few vertices and are determined by their connectivity structure. In this thesis the attempt was made to identify a “ladder”
motif which was hypothesized to emerge from the mass difference method.
The fact that it was not found in significant quantities can very well arise
from missing (not measured) compounds, which then break up the ladder
structure. For the mentioned β-oxidation this might be explained with the
too high mass of the acyl–CoA to reliably detect the compound. The disturbation by missing (not measured) compounds applies also to other subgraph
structures, such as individual pathways. So to validate the reconstructed
network against known pathways or find motifs, it is actually required not
52
to look for the exact pathway or motif, but for a subgraph which is homomorph to it. Determining the distance (in terms of reaction pairs) for
reconstructed edges as described in chapter 2.5 is actually doing this, but is
limited to identified compounds. Extending this to all vertices in the reconstructed networks might finally confirm the presence of the ladder motif. If
such “fuzzy” motif matching is applied, of course the null model for testing
the significance has to be modified to account for the fuzziness.
It is a fact, that not all enzymes are known, therefore more reactions exist,
than can be reconstructed by enzyme gene mapping [12]. An advantage of
the reconstructed networks is, that they may contain edges, which have a
chemical and biological meaning which is not known yet. Identification of
these edges with the current knowledge is however difficult. A good starting
point would be edges which are positively confirmed to represent known
biochemical transformations, or at least a transformation which does not
look like noise. Despite this, there is some intrinsic information in these
networks. E.g. the degree of the vertices in the networks has a meaning as
has been demonstrated. This information content can be exploited in future
work using data mining techniques.
Frequent Mass Differences are not Common Metabolic Transformations
Using the D. melanogaster and S. cerevisiae datasets, the list of known
metabolic transformations yields sparsely connected networks which are comparable to the ones created with the top 5–20 frequent mass differences (compare Tables 5 and 6), despite this list actually contains 109 entries. Additionally the common metabolic transformations are present in the determined
frequent mass differences, but at a lower rate than the KEGG reaction pairs.
Some of the common metabolic reactions only appear as less frequent mass
differences or not at all. This is another indicator for the absence of actual
metabolic transformations in the data and hence the reconstructed networks.
The hypothesis, that many of the frequent mass differences actually represent two or more metabolic transformations in turn was already presented.
Evidence for this is given in chapter 3.5, where the actual number of known
reactions are computed for reconstructed edges. Even though only a subset of
all edges is used (the ones in the induced subgraph of identified compounds),
these edges already represent 54 % of all used frequent mass differences. 93 %
of the observed mass differences could be assigned a sequence of 1–6 reactions
from the KEGG database. This equals 50 % of all mass differences in the
reconstructed network. An estimate based on the above information is, that
about 90 % of all frequent mass differences can be explained by a sequence
of 1–6 reactions from the KEGG database.
53
Using a list of common metabolic transformations however is promising if
the goal is to create metabolic reaction networks which are also suited for the
well known analysis methods mentioned above. The list has to be extended
by combinations of these reactions, to account for not measured compounds.
In doing so, it is very likely, that metabolic pathways emerge as heavily connected clusters in the resulting reconstructed networks, as Figure 19 shows.
Edges representing multiple reactions are also incorporated using the frequent mass differences, but the advantage of a well defined set of metabolic
reactions is the absence of noise due to the precision shift (discussed below)
and clustering.
1
2
2
3
1
4
Figure 19: Theoretical pathway (top) and reconstructed pathway (bottom).
The hollow circle represents a not measured compound. The numbers on the
edges represent the number of reactions which explain the mass difference. A
metabolic pathway consists of a sequence of reactions, so if edges are introduced,
which themselves represent a sequence of reactions, a single metabolic pathway
should be very densely connected by such edges.
The Clustering of Mass Differences is Complicated by a “Precision
Shift”, but Frequent Mass Differences Tend to Have a Meaning
The precision of mass differences decreases compared to the precision for
individual masses, especially if the mass difference is the difference of two
large masses. This was demonstrated in chapter 2.3.2. This “precision shift”
has of course implications for the clustering of mass differences. Especially
in the region of small mass differences around 1 u lots of noise emerges
due to this error. E.g. the three unidentified mass differences, classified as
noise (compare Table 8), are also at about 1 and 2 u. The are very close to a
simple H or H2 transfer, and it is very likely to observe this mass difference in
metabolism. So in this case the precision shift caused a wrong selection by the
clustering routine. Additionally this is an area, where many mass differences
accumulate (compare Figure 13) and clustering becomes even more difficult.
This problem could be overcome by determining the elemental composition of
54
individual masses and subsequently calculate on the exact masses, but then
other problems related to exhaustive elemental composition determination
would arise.
Albeit this, the more frequent a mass difference occurs, the more likely
an annotation for this mass difference can be found. Hence the biological
meaning of the frequent mass differences is sound. It also indicates, that the
selection method for frequent mass differences seems to work well within its
bounds, i.e. clustering and picking the median. The mentioned issue cluster
elongation might have an effect, but does not disturb the overall selection of
the correct mass difference. However, as mentioned above, edges which are
hypothesized to be “noise edges” emerge. In the D. melanogaster network
these are for example edges 1, 3 and 4 from Table 8. Each of these mass
differences could only be resolved to a very unlikely transformation. A more
likely transformation might be assigned to the edges, by incorporating more
chemical elements into the search space, even though the considered elements
already are the most likely elements in biology. Especially in the top mass
differences incorporation of more uncommon elements is not expected.
Emerging Hubs are Compounds of Interest
In the reconstructed networks hub vertices emerge. These hubs are not the
current metabolites known to be hubs in metabolic networks, but must share
some other common properties which make them important in this kind of
network. It was already mentioned, that the hub vertices are more likely to
be identified by a database lookup and therefore were targets of research in
the past. It has been shown, that there is a correlation between a vertex’s degree and the number of isomers of its elemental composition that are present
in metabolism. This makes sense in a way, that a vertex which represents
many different compounds summarizes all connections of the individual compounds. It has also been shown, that there is a negative correlation between
a vertex’s degree and its mass, i.e. vertices with a higher degree tend towards
smaller masses.
The two above correlations imply a possible relationship between a metabolites mass and the number of possible isomers of this mass which are
present in metabolism. This correlation has not been investigated explicitly
on real world data. It is however remarkable, since simple combinatoric considerations tell, that bigger masses should allow for more possible isomers.
Metabolism, in contrast, tends to use rather smaller compounds and their
isomers.
Observing just correlations, no hard conclusions can be made regarding
their origin. Especially for three correlations, it is not exactly clear which
correlation arises for what reasons. However, the possibility to identify a
55
special class of vertices, wether low degree or high degree, together with the
fact that the degree carries some information will aid research on FTMS
metabolomics data.
Half of the identified hub vertices in the D. melanogaster network represent at least one isomer that carries an acid group (compare Table 7). Acids
play an important role in metabolism and are a source of energy. The identification of all hub vertices and their roles in metabolism is an important
next step.
Determining an Appropriate Abundance Cutoff
It was mentioned in chapter 3, that slight variations in the abundance cutoff
should not heavily alter the networks structure. This can be seen in Tables 5
and 6 and in Figure 14. The changes of the parameters and form of the
distributions are changing slowly and not abruptly. Therefore it is likely,
that the overall structure of the network remains stable. However more
than slightly different cutoffs applied to the same data lead to differences
in network structure and, more important, datasets of different size require
different cutoffs as well: For the S. cerevisiae dataset an abundance cutoff of
96 was estimated to yield an optimal network, and for the D. melanogaster
dataset the value was 15. An outline how to determine a proper abundance
cutoff automatically is presented in the “Outlook” chapter.
56
5
Summary
In this thesis metabolomics FTMS data were used to reconstruct networks
based on metabolites mass differences. The basic idea behind this approach
is, that the substrate and product of any metabolic reaction should be present
in a cell. If the masses of both are measured, the resulting mass difference
represents the transformation of elements which took place in the reaction.
For a whole set of masses in a dataset the pairwise connection by meaningful
mass differences yields, in theory, a metabolic reaction network. Frequent
reactions in metabolism should lead to frequently occurring mass differences,
s.t. the reactions can be identified within the data and no external knowledge
is required. If one wants to incorporate external data, the mass differences
of common metabolic transformations can be used to reconstruct the network. In this work the focus was on networks reconstructed without external
information.
In this thesis has been shown, that the reconstructed networks are no
metabolic reaction networks, and therefore can not be analyzed with the
standard methods available for metabolic networks. However, the reconstructed networks exhibit a non–random topology, and the intermediate step
– the calculation of pairwise mass differences – also is of scientific interest.
It was shown, that frequently occurring mass differences have a biochemical
meaning; the most frequent mass differences can usually be explained by a sequence of 1–6 chemical reactions. Therefore also the reconstructed networks
should contain information. Elaborating this further led to the findings that
network hubs (i.e. vertices with many neighbors) (a) tend to be known compounds, (b) comprise rather smaller metabolites and (c) represent a higher
number of isomers than vertices with a low degree.
Many masses (about 80–90 %) in the original metabolomics datasets can
not be identified. The mass difference networks can aid the investigation
and finally identification of these mass measurements. Additionally ideas are
developed in this thesis to further investigate the reconstructed networks and
probably even extract metabolic reaction networks from the given data.
57
6
Outlook
The locigal next step after this work is the identification of as many masses
and observed frequent mass differences as possible. Some interesting targets
to start from have been highlighted in this work, and at the time of writing
some of the created data are subject to analysis. The networks are probably
invariant to minimal cut sets [44], such that this analysis can be done on
these networks. This needs to be assessed in more detail.
The work itself might be extended by some of the following proposals.
The rationale to optimize the reconstructed network for known parameters of biological networks was already presented in chapter 3.2.1. Instead of
doing this manually, an automated procedure would calculate a “goodness
value” for each abundance cutoff. This can be done efficiently, because lower
abundance cutoffs only introduce more edges, but do not alter the set of
already present edges. With some more programming effort the time consuming step of calculating the clustering coefficients for all vertices can also
be optimized, s.t. clustering coefficients are not recalculated if not required.
To assess the power law property of each network’s distributions, a power law
can be fitted to the distributions and the quality of fit can be determined with
the statistical Kolmogorov–Smirnov test. A procedure which now maximizes
the goodness of fit and the overall clustering coefficient can automatically
yield abundance cutoffs for arbitrary data.
Even though positive resluts could be obtained using the employed clustering method, the quality of frequent mass difference determination might
be improved by other clustering algorithms. As mentioned, the used clustering method (single linkage hierarchical clustering) tends to elongate clusters.
Since the data is one dimensional, this problem becomes even more gravity.
Complete linkage clustering instead of single linkage would probably result in
narrower clusters and a more precise allocation of the mass difference. There
exist other distance measures for clusters, which can be applied to the data
and a comprehensive analysis of clustering results might lead to an improved
clustering method for the present data. Additionally to improved clustering the method can be improved by filtering for “noise” edges. This would
require the identification of frequent mass differences in an exhaustive way,
s.t. unidentified edges can be removed. A pure database lookup is infeasible.
Instead the enumerative combinatoric method to identify a mass difference
described in chapter 2.5 has to be employed. An efficient algorithm for this
problem might be obtained by modifying existing software which calculates
elemental compositions for a given mass. Using the heuristic, that fewer
atom transfers are more likely to explain a reaction or sequence of reactions,
mass differences which are more likely to represent chemical and biological
causalities can be identified.
58
Another remaining but interesting analysis is to confirm that actually the
estimated 90 % frequent mass differences represent a whole sequence of reactions. This can be done by calculating linear combinations of 1–6 (± some
allowance) KEGG reaction pairs and matching these distances to the frequent mass differences. In this case reaction pairs involving transformations
through current metabolites [11] should be removed to assure only chemically
possible reaction pathways are found.
The clustering coefficients and their distribution suggest the presence
of clusters in the networks. Communities in complex networks are always
of interest and members of the communities share common properties in
biological networks [30, 47]. In the reconstructed networks these communities
might represent actual metabolic pathways. This hypothesis arises from the
estimate that most of the mass differences represent a sequence of reactions
and the considerations depicted in Figure 19. A logical follow up project
would be, by employing common graph clustering methods, to find either
scientific evidence for this hypothesis or at least another possible explanation
for such communities.
Regardless of any deeper information content in the reconstructed networks, the mere calculation of mass differences and representation as network
can aid in mass spectrometric research, and there is a great demand for software in the mass spectrometry and metabolomics community [3, 21, 48]. The
developed software has to be extended to be more user friendly and ideally
made available online. The networks need to be visualized in a dynamic and
easily modifyable way and information about vertices and edges has to be
clearly presented.
The mentioned exponential random graph (ERG) models could be used to
reassess the similarity between reconstructed networks and known metabolic
networks. If a set of feasible descriptors for metabolic networks can be identified, the reconstruction method could be re–evaluated and probably improved.
Concluding this thesis, the originally assessed method from [16] turned
out to be a naive idea, not feasible for metabolic network reconstruction.
Nevertheless, many new and interesting starting points for the analysis of
metabolomics FTMS data have been found, employing mass differences and
constructing networks out of them.
59
References
[1] V. Hatzimanikatis, C. Li, J. A. Ionita, and L. J. Broadbelt, “Metabolic
networks: enzyme function and metabolite structure.,” Curr Opin
Struct Biol, vol. 14, pp. 300–306, Jun 2004.
[2] O. Fiehn and W. Weckwerth, “Deciphering metabolic networks.,” Eur
J Biochem, vol. 270, pp. 579–588, Feb 2003.
[3] O. Fiehn, “Metabolomics–the link between genotypes and phenotypes.,”
Plant Mol Biol, vol. 48, pp. 155–171, Jan 2002.
[4] M. Arita, “The metabolic world of escherichia coli is not small.,” Proc
Natl Acad Sci U S A, vol. 101, pp. 1543–1547, Feb 2004.
[5] W. B. Dunn, “Current trends and future requirements for the
mass spectrometric investigation of microbial, mammalian and plant
metabolomes.,” Phys Biol, vol. 5, no. 1, p. 11001, 2008.
[6] J. Frster, I. Famili, B. O. Palsson, and J. Nielsen, “Large-scale evaluation of in silico gene deletions in saccharomyces cerevisiae.,” OMICS,
vol. 7, no. 2, pp. 193–202, 2003.
[7] C. Bro, B. Regenberg, J. Frster, and J. Nielsen, “In silico
aided metabolic engineering of saccharomyces cerevisiae for improved
bioethanol production.,” Metab Eng, vol. 8, pp. 102–111, Mar 2006.
[8] M. Kanehisa, M. Araki, S. Goto, M. Hattori, M. Hirakawa, M. Itoh,
T. Katayama, S. Kawashima, S. Okuda, T. Tokimatsu, and Y. Yamanishi, “Kegg for linking genomes to life and the environment.,” Nucleic
Acids Res, vol. 36, pp. D480–D484, Jan 2008.
[9] P. D. Karp, I. M. Keseler, A. Shearer, M. Latendresse, M. Krummenacker, S. M. Paley, I. Paulsen, J. Collado-Vides, S. Gama-Castro,
M. Peralta-Gil, A. Santos-Zavaleta, M. I. Pealoza-Spnola, C. BonavidesMartinez, and J. Ingraham, “Multidimensional annotation of the escherichia coli k-12 genome.,” Nucleic Acids Res, vol. 35, no. 22, pp. 7577–
7590, 2007.
[10] R. Guimer and L. A. N. Amaral, “Functional cartography of complex
metabolic networks.,” Nature, vol. 433, pp. 895–900, Feb 2005.
[11] H. Ma and A. Zeng, “Reconstruction of metabolic networks from genome
data and analysis of their global structure for various organisms.,” Bioinformatics, vol. 19, pp. 270–277, Jan 2003.
[12] C. A. Ouzounis and P. D. Karp, “Global properties of the metabolic
map of escherichia coli.,” Genome Res, vol. 10, pp. 568–576, Apr 2000.
60
[13] N. C. Duarte, M. J. Herrgrd, and B. . Palsson, “Reconstruction and
validation of saccharomyces cerevisiae ind750, a fully compartmentalized
genome-scale metabolic model.,” Genome Res, vol. 14, pp. 1298–1309,
Jul 2004.
[14] A. G. Smart, L. A. N. Amaral, and J. M. Ottino, “Cascading failure and
robustness in metabolic networks.,” Proc Natl Acad Sci U S A, vol. 105,
pp. 13223–13228, Sep 2008.
[15] M. Arita, “Metabolic reconstruction using shortest paths,” Simulation
Practice and Theory, vol. 8, pp. 109–125, 2000.
[16] R. Breitling, S. Ritchie, D. Goodenowe, M. L. Stewart, and M. P. Barrett, “Ab initio prediction of metabolic networks using fourier transform
mass spectrometry data,” Metabolomics, vol. 2, pp. 155–164, 2006.
[17] D. J. H. Gross, Mass Spectrometry - A Text Book. Springer, 2004.
[18] A. Aharoni, C. H. R. de Vos, H. A. Verhoeven, C. A. Maliepaard,
G. Kruppa, R. Bino, and D. B. Goodenowe, “Nontargeted metabolome
analysis by use of fourier transform ion cyclotron mass spectrometry.,”
OMICS, vol. 6, no. 3, pp. 217–234, 2002.
[19] J. Amster, “Fourier transform mass spectrometry,” Journal of Mass
Spectrometry, vol. 31, pp. 1325–1337, 1996.
[20] A. G. Marshall, C. L. Hendrickson, and G. S. Jackson, “Fourier transform ion cyclotron resonance mass spectrometry: a primer.,” Mass Spectrom Rev, vol. 17, no. 1, pp. 1–35, 1998.
[21] R. M. A. Heeren, A. J. Kleinnijenhuis, L. A. McDonnell, and T. H. Mize,
“A mini-review of mass spectrometry using high-performance fticr-ms
methods.,” Anal Bioanal Chem, vol. 378, pp. 1048–1058, Feb 2004.
[22] W. J. Griffiths, A. P. Jonsson, S. Liu, D. K. Rai, and Y. Wang, “Electrospray and tandem mass spectrometry in biochemistry.,” Biochem J,
vol. 355, pp. 545–561, May 2001.
[23] K. D. Henry, E. R. Williams, B. H. Wang, F. W. McLafferty, J. Shabanowitz, and D. F. Hunt, “Fourier-transform mass spectrometry of
large molecules by electrospray ionization.,” Proc Natl Acad Sci U S A,
vol. 86, pp. 9075–9078, Dec 1989.
[24] J. L. Gross and J. Yellen, Handbook of Graph Theory. CRC Press, 2003.
[25] R. A. Zubarev, P. Hkansson, and B. Sundqvist, “Accuracy requirements
for peptide characterization by monoisotopic molecular mass measurements,” Anal. Chem., vol. 68, pp. 4060–4063, 1996.
61
[26] K. Suhre and P. Schmitt-Kopplin, “Masstrix: mass translator into pathways.,” Nucleic Acids Res, vol. 36, pp. W481–W484, Jul 2008.
[27] A. S. N. Seshasayee, G. M. Fraser, M. M. Babu, and N. M. Luscombe,
“Principles of transcriptional regulation and evolution of the metabolic
system in e. coli.,” Genome Res, vol. 19, pp. 79–91, Jan 2009.
[28] G. Thomas, J. Zucker, S. Macdonald, A. Sorokin, I. Goryanin, and
A. Douglas, “A fragile metabolic network adapted for cooperation in
the symbiotic bacterium buchnera aphidicola.,” BMC Syst Biol, vol. 3,
p. 24, Feb 2009.
[29] H. Jeong, B. Tombor, R. Albert, Z. N. Oltvai, and A. L. Barabsi,
“The large-scale organization of metabolic networks.,” Nature, vol. 407,
pp. 651–654, Oct 2000.
[30] E. Ravasz, A. L. Somera, D. A. Mongru, Z. N. Oltvai, and A. L. Barabsi,
“Hierarchical organization of modularity in metabolic networks.,” Science, vol. 297, pp. 1551–1555, Aug 2002.
[31] R. Albert and A.-L. Barabasi, “Statistical mechanics of complex networks,” Reviews of Modern Physics, vol. 74, pp. 47–97, 2002.
[32] A.-L. Barabsi and Z. N. Oltvai, “Network biology: understanding the
cell’s functional organization.,” Nat Rev Genet, vol. 5, pp. 101–113, Feb
2004.
[33] Z. M. Saul and V. Filkov, “Exploring biological network structure using
exponential random graph models.,” Bioinformatics, vol. 23, pp. 2604–
2611, Oct 2007.
[34] A. Kreimer, E. Borenstein, U. Gophna, and E. Ruppin, “The evolution
of modularity in bacterial metabolic networks.,” Proc Natl Acad Sci U
S A, vol. 105, pp. 6976–6981, May 2008.
[35] M. Sales-Pardo, R. Guimer, A. A. Moreira, and L. A. N. Amaral, “Extracting the hierarchical organization of complex systems.,” Proc Natl
Acad Sci U S A, vol. 104, pp. 15224–15229, Sep 2007.
[36] S. Daunert, P. G. ang G. Gauglitz, K. G. Heumann, K. Jinno, A. SanzMedel, and S. A. Wise, “Analytical and bioanalytical chemistry,” Nov.
2007. Vol. 389 No. 5.
[37] X. Feng and M. M. Siegel, “Fticr-ms applications for the structure determination of natural products.,” Anal Bioanal Chem, vol. 389, pp. 1341–
1363, Nov 2007.
62
[38] R. D. C. Team, R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2007.
ISBN 3-900051-07-0.
[39] L. Chen, S. K. Sze, and H. Yang, “Automated intensity descent algorithm for interpretation of complex high-resolution mass spectra.,” Anal
Chem, vol. 78, pp. 5006–5018, Jul 2006.
[40] A. Nayak and I. Stojmenovic, Handbook of applied algorithms: solving
scientific, engineering, and practical problems. John Wiley, 2008.
[41] S. Theussl and K. Hornik, Rglpk: R/GNU Linear Programming Kit
Interface, 2009.
[42] N. Hertkorn, M. Frommberger, M. Witt, B. Koch, P. Schmitt-Kopplin,
and E. Perdue, “Natural organic matter and the event horizon of mass
spectrometry.,” Anal Chem, Oct 2008.
[43] K. Ishimura and H. Fujita, “Light and electron microscopic immunohistochemistry of the localization of adrenal steroidogenic enzymes.,”
Microsc Res Tech, vol. 36, pp. 445–453, Mar 1997.
[44] S. Klamt and E. D. Gilles, “Minimal cut sets in biochemical reaction
networks.,” Bioinformatics, vol. 20, pp. 226–234, Jan 2004.
[45] R. Steuer, “Computational approaches to the topology, stability and
dynamics of metabolic networks.,” Phytochemistry, vol. 68, no. 16-18,
pp. 2139–2151, 2007.
[46] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and
U. Alon, “Network motifs: simple building blocks of complex networks.,”
Science, vol. 298, pp. 824–827, Oct 2002.
[47] K. Schreiber, “Correlation of modular structures in biological networks.,” bachelorthesis, Ludwig Maximilians Universität and Technische
Universität München, 2005.
[48] P. C. Dorrestein and N. L. Kelleher, “Dissecting non-ribosomal
and polyketide biosynthetic machineries using electrospray ionization
fourier-transform mass spectrometry.,” Nat Prod Rep, vol. 23, pp. 893–
918, Dec 2006.
63