Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Schiehallion experiment wikipedia , lookup
Specific impulse wikipedia , lookup
Modified Newtonian dynamics wikipedia , lookup
Anti-gravity wikipedia , lookup
Isotopic labeling wikipedia , lookup
Electromagnetic mass wikipedia , lookup
Negative mass wikipedia , lookup
Mass versus weight wikipedia , lookup
Ich versichere, dass ich diese Masterarbeit selbständig verfasst und nur die angegebenen Quellen und Hilfsmittel verwendet habe. München, den 16.03.2009 Konrad Schreiber iii Acknowledgements Several people deserve my sincere thanks for their involvement in the completion of this thesis: I want to thank Karsten Suhre and Fabian Theis for giving me the opportunity to work on this thesis in their groups. Their advice – scientific and private – during the progress of writing this thesis are invaluable. Agnes Fekete and Philippe Schmitt–Kopplin played an important role in the completion of this thesis. They provided me with valuable information, data and last but not least encouraging interest in my work. It would have been a pleasure to meet them earlier during the course of this thesis. The whole Computational Modelling in Biology (CMB) group deserves my thanks as excellent colleagues, providing cheerful company and helpful advice. Thanks to Elisabeth Altmaier and Brigitte Wägele who were good colleagues and office–mates during my work on this thesis. Finally I would like to thank all members of the “Institut für Bioinformatik und Systembiologie” who were not mentioned explicitly for their professional and general support. v Contents Acknowledgements v Contents vii List of Abbreviations ix Abstract x Übersicht xi 1 Introduction 1.1 Metabolomics and Metabolic Networks . . . . . . . . . . . . 1.2 Approaches to Metabolic Network Reconstruction . . . . . . 1.3 Fourier Transform Mass Spectrometry (FTMS) . . . . . . . 1.3.1 Isotopes . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Isotopic mass and mass defect . . . . . . . . . . . . . 1.3.3 Ionization mode . . . . . . . . . . . . . . . . . . . . . 1.3.4 The principle of FTMS . . . . . . . . . . . . . . . . . 1.3.5 Mass resolution and accuracy . . . . . . . . . . . . . 1.3.6 Extract preparation . . . . . . . . . . . . . . . . . . . 1.3.7 Ionization by electrospray ionization (ESI) . . . . . . 1.4 Metabolic Compound Identification from Molecular Mass . . 1.5 Introduction to Ab Initio Metabolic Network Reconstruction 1.6 Graph Theoretic Aspects of Biological Networks . . . . . . . 1.6.1 Mathematical Definition of Graphs . . . . . . . . . . 1.6.2 Properties of Graphs . . . . . . . . . . . . . . . . . . 1.6.3 Types of Graphs . . . . . . . . . . . . . . . . . . . . 1.6.4 Properties of Biological Networks . . . . . . . . . . . 1.7 Scope of this Thesis . . . . . . . . . . . . . . . . . . . . . . . 2 Materials and Methods 2.1 Description of the Dataset . . . . . . . 2.1.1 FTMS Mass Spectrometry Data 2.1.2 The KEGG Database . . . . . . 2.1.3 Exact Elemental Masses . . . . 2.1.4 Metabolic Transformations . . . 2.2 Computational Analysis . . . . . . . . 2.3 Network Reconstruction . . . . . . . . 2.3.1 Filtering of FTMS Data . . . . 2.3.2 Calculation of Mass Differences 2.3.3 Clustering . . . . . . . . . . . . 2.3.4 Network Creation . . . . . . . . 2.4 Network Analysis . . . . . . . . . . . . 2.5 Identification of Mass Differences . . . vi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 3 3 4 5 5 7 8 8 9 10 14 16 17 18 20 21 . . . . . . . . . . . . . 22 22 22 24 25 25 25 26 26 28 29 30 31 32 3 Results 3.1 Comprehensive Analysis of Pathway Maps from the KEGG Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Metabolic Network Reconstruction for S. cerevisiae . . . . . 3.2.1 Determining the Abundance Cutoff . . . . . . . . . . 3.2.2 Evaluation of the Proposed Null Model . . . . . . . . 3.2.3 Identification of Motifs . . . . . . . . . . . . . . . . . 3.3 Metabolic Network Reconstruction for D. melanogaster . . . 3.3.1 Determining the Abundance Cutoff . . . . . . . . . . 3.4 Analysis of the Masses . . . . . . . . . . . . . . . . . . . . . 3.5 Identification of Frequent Mass Differences . . . . . . . . . . 37 . . . . . . . . . 37 39 40 42 43 43 44 45 49 4 Discussion 52 5 Summary 57 6 Outlook 58 References 60 vii List of Abbreviations ATP Adenosine 5’-triphosphate CoA Coenzyme A e.g. “exempli gratia” for example EcoCyc Encyclopedia of Escherichia coli K-12 Genes and Metabolism ERG Exponential Random Graph et al. “et alii ” and others FTMS Fourier Transform Ion Cyclotron Resonance Mass Spectrometry FTP File Transfer Protocol i.e. “id est” that is KEGG Kyoto Encyclopedia of Genes and Genomes (Database) LIPIDMAPS LIPID Metabolites And Pathways Strategy (Database) MassTRIX Mass TRanslator into Pathways ppm parts per million u unified atom mass ix Abstract The topic of this work is the ab initio prediction of metabolic mass difference networks from Fourier Transform Mass Spectrometry (FTMS) data. Mass spectrometric measurement of an organisms metabolites yields a snapshot of that organisms metabolism at the time of the experiment. The assumption of the employed method is, that for a chemical reaction substrate and product are present in a certain ratio and both are measurable. Now the mass difference of substrate and product identifies the underlying chemical reaction. Frequent reactions will yield frequent mass differences, and the reconstructed networks are based on these frequent mass differences. Is is shown, that frequent mass differences have a biochemical meaning. 90 % of the observed mass differences represent a sequence of 1–6 known metabolic transformations. The resulting networks exhibit a hierarchical scale free topology. This observation together with the biochemical meaning of the mass differences indicates that information is present in the reconstructed networks. A correlation of “hub”-nodes in these networks with certain properties of the underlying metabolites can be shown. Metabolites with many neighbors in the networks are more likely identifyable as important known metabolites. Taking into account that 80–90 % of the measured masses in a metabolomics FTMS experiment can not be identified, the remaining unidentified “hub”-nodes are supposed to be metabolites of special interest. They are proposed as starting points for a deeper analysis and identification. The thesis concludes with suggestions for future work and exploitation of the networks’ information. x Übersicht In der vorliegenden Arbeit wird die Rekonstruktion von metabolischen massendifferenz–Netzwerken aus Fourier Transform Mass Spectrometry (FTMS) daten behandelt. Die massenspektrometrische Bestimmung der Metabolite in einem Organismus liefert eine Momentaufnahme des Metabolismus zum Messzeitpunkt. Wenn man von der Annahme ausgeht, dass zu einer chemischen Reaktion Produkt und Substrat in einem gewissen Verhältniss vorliegen, also messbar sind, lässt sich aus der Massendifferenz von Produkt und Substrat auf die Reaktion schließen. Häufige reaktionen äußern sich somit durch häufige Massendifferenzen. Die rekonstruierten massendifferenz– Netzwerke in dieser Arbeit basieren auf diesen häufigen Massendifferenzen. Es wird gezeigt, dass die häufigsten Massendifferenzen tatsächlich eine biochemische Bedeutung haben. 90 % der beobachteten häufigsten Massendifferenzen können in dieser Arbeit als Sequenz von 1–6 bekannten metabolischen Transformationen identifiziert werden. Die daraus entstehenden Netzwerke weisen eine hierarchische scale–free Topologie auf. Dies, zusammen mit der biologischen Bedeutung der Massendifferenzen, ist ein Indiez für den Informationsgehalt der rekonstruierten Netzwerke. Die Netzwerke werden daraufhin untersucht und eine Korrelation von zentralen “hub”–Knoten zu gewissen Eigenschaften der sie repräsentierenden Metaboliten kann gezeigt werden. Metabolite, die in den Netzwerken viele Nachbarn besitzen, sind mit höherer Wahrscheinlichkeit als wichtige Metabolite bekannt. Da 80–90 % aller Massen aus einem metabolomics FTMS Experiment nicht identifiziert werden können, bilden die verbleibenden unidentifizierten “hub”–Knoten einen interessanten Startpunkt zur weitergehenden Analyse und Identifikation der unbekannten Massen. Im letzten Teil der Arbeit wird auf zukünftige Möglichkteiten zur Nutzung der Methode und zur Verwendung der Netzwerke eingegangen. xi 1 1.1 Introduction Metabolomics and Metabolic Networks Metabolism is a very complex biological process. Simple organisms are able to produce enormous amounts of different small organic compounds (metabolites) and energy from very simple molecules like for example glucose. These small compounds are in turn further metabolized to build even the biggest building blocks of life. The processes governing these reactions are subject of intense study [1]. Metabolites are the substrates of metabolism. Structurally they cover a wide range from simple molecules like Water (H2 O) to complex structures like fatty acids and lipids. According to their variety, they are different in size, number and nature of functional groups, volatility, charge states or electromobility, polarity and other physicochemical parameters [2]. In parallel to the terms transcriptome and proteome, the set of metabolites synthesized by an organism constitute its metabolome [3]. Metabolomics is the field of investigating an organism’s metabolome. The first step in metabolomics would be the exact determination and quantification of an organisms complete metablome, but even this first step has not been accomplished to a satisfactory degree, leaving apparently simple questions unanswered; e.g. the total number of metabolites for a model organism like Arabidopsis thaliana is still not known [2], the same is true for Escherichia coli [4]. Despite that, metabolomics is able to produce meaningful and important results. Besides the investigation of the complete metabolome, there exist different strategies to study selected parts of the metabolome of a given organism. The terms used are currently subject to change and the scientific community may eventually agree on a coherent terminology [5]. Dunn summarizes in a review article [5] 6 different strategies. According to this definition, (a) Metabolomics refers to the study of the complete metabolome, (b) Metabolic profiling is the untargeted investigation of a as large as possible set of metabolites, (c) Metabolic fingerpting investigates a snapshot of the metabolome in an organism, (d) Targeted analysis is the quantification and identification of a small set of metabolites, (e) Metabolic footprinting analyzes the extracellular metabolome of an organism, i.e. metabolites not consumed or excreted by the organism, and (f ) Metabonomics is the quantitative measurement and investigation of the dynamics of living systems under pathophysical stimuli and under genetic modification. Metabolic processes can be described best by graphs and as such are represented in a well defined mathematical model. The analysis of such metabolic networks has several applications. It was for example shown, that it is possible to predict lethal gene deletions by simulating them on a metabolic 1 network [6]. Another study focused on how the production of a specific metabolite can be maximized by investigation of the underlying metabolic network, for improving bioethanol production [7]. It is therefore important to have good sources for reliable metabolic networks. 1.2 Approaches to Metabolic Network Reconstruction Metabolic networks can be reconstructed from a variety of data using a variety of approaches. Several approaches reconstruct metabolic networks specific to an organism from genome information such as the KEGG [8] and EcoCyc [9] databases. Many studies which investigate the properties of metabolic netwoks use these databases and augment the data with information from the literature [10, 11]. This facilitates the construction of many metabolic networks for a comparative analysis. The problem with this approach is, that the data are not complete. For instance [12] states, that there are reactions catalyzed by Escherichia coli, for which the enzyme has not been identified. Since E. coli is a well studied model organism, this problem also exists for other organisms and implies, that there might be unknown enzymes catalyzing unknown reactions. Another more complex approach is followed by the Palsson group [13]: Metabolic networks are first reconstructed out of genomic information, i.e. the enzymes present in the organism of study are identified and the reactions they catalyze are determined. Out of this information together with information from the literature the compartmentalization of the reactions is incorporated into the model. This yields very precise metabolic networks which are also readily used in studies about metabolic networks [14]. The above methods require precise knowledge about the organism, like the full genomic sequence. One of the first approaches that was able to reconstruct metabolic pathways from one given component to another without prior knowledge of the organism is described by Arita [15]. This approach identifies structural similar metabolites and suggests a metabolic path between them as the shortest path along 16 typical metabolic reaction rules. There is, however, no evidence, that the organism under consideration is actually able to catalyze all postulated reactions. A feature which all above methods are lacking, is a “snapshot view” of metabolism, above termend metabolic fingerprinting. Such a method would yield a metabolic network which does not represent the theoretic capabilities of an organism, but the actually active metabolic pathways at a certain point of time. This can be used to investigate metabolic reactions on different conditions. Out of this reason it is desirable to obtain metabolic networks “ab initio” for any given organism. This thesis evaluates an approach to reconstruct these networks from mass spectrometric data without any a priori knowledge as suggested in [16]. 2 1.3 Fourier Transform Mass Spectrometry (FTMS) The full name for what is usually abbreviated FTMS is Fourier Transform Ion Cyclotron Resonance Mass Spectrometry. Before the technical basics are described, an introduction to the measured entities – atoms, isotopes and their mass – is given. 1.3.1 Isotopes An element is specified by the number of protons in its nucleus. The number of protons equals the atomic number of the element, which is usually noted as a subscript before the elemental symbol. Atoms with the same atomic number, but different numbers of neutrons in the nucleus are termed isotopes. The sum of protons and neutrons is the so called mass number of an atom. The mass number is usually noted as a superscript before the elemental symbol. Summarizing the above, isotopes of the same element are equal in their atomic number and chemical properties, but differ in their mass number. [17, p. 67] Most of the elements are polyisotopic, i.e. they exist as multiple stable isotopes. This means, that there usually is one isotope which is most abundant (e.g. 12 C for Carbon), but many more stable isotopes can be encountered (e.g. 13 C). Some elements naturally occur with only one stable isotope. These elements are termed monoisotopic elements. The most important such elements in biology are 19 F (Fluorine), 23 Na (Sodium), 31 P (Phosphorus) and 127 I (Iodine). Also some elements occur with exactly two stable isotopes and thus are called di–isotopic elements. The most important ones in biology are Hydrogen (1 H, 2 H), Carbon (12 C, 13 C) and Nitrogen (14 N, 15 N), for which the mass numbers differ by one. One can also regard Chlorine (35 Cl, 37 Cl) as important di–isotopic element in biology; the mass numbers of the two Chlorine isotopes differ by two. Oxygen has three stable isotopes, 16 O (most abundant), 17 O and 18 O (second most abundant). Table 1 summarizes the isotopic abundances for the above elements. Isotopic abundances are reported as their sum being 100%, or the most abundant isotope being normalized to 100%. In this thesis the values from reference [17] are used. Because of the low abundances of N, O and H, these elements can be treated as approximately monoisotopic [17, p. 68]. Care has to be taken with 13 C. Carbon is a ubiquitous element in organic chemistry and due to the relatively high 13 C abundance, effects of 13 C insertion in biomolecules have to be considered. 3 Element Carbon Nitrogen Oxygen Hydrogen Sulfur Chlorine Phosphorus Isotopic abundance 100:1.08 100:0.369 100:0.205 100:0.0115 100:4.52 100:31.96 no stable isotopes Table 1: Biologically relevant elements and their isotopic abundances. If more than one isotope exists, the abundance of the most abundant relative to the second most abundant is given. Data taken from [17, p. 69]. 1.3.2 Isotopic mass and mass defect Atomic masses are measured in unified atomic mass, abbreviated u. Protons and neutrons have an approximate mass of 1 u and since the year 1961 one u is defined as 1/12th the mass of a 12 C atom. [17, p. 71] The isotopic mass is the exact mass of an isotope. It is always close to but never exactly equals the mass number of the isotope. Because of the definition of atomic mass, the only exception is 12 C. The difference between the isotopic mass and mass number arises from the mass defect. Because a bound system is at a lower energy level than its unbound constituents, according to Einsteins formula E = mc2 , its mass is less than the total mass of its unbound constituents. The binding energy of protons and neutrons in the nucleus is sufficiently high to cause a measurable mass defect. Thus, the isotopic mass of an isotope is the sum of masses of its constituents minus the binding energy in the nucleus. E.g. the mass difference between 12 C and 13 C is only 1.0034 u, and not the mass of the neutron, 1.0087 u. In theory, the chemical bonds in a molecule also introduce a mass defect. But since the chemical bond energy is much lower than nuclear binding energy, these effects can be neglected. E.g. the average bond energy of each OH bond in water (H2 O) is 458.9 kJ/mol. The mass defect of the two bonds calculates to 1.017 ∗ 10−15 u per molecule: 458900J mol−1 a = 1.524 · 10−25 J/molecule 1.524 · 10−25 J/molecule = 1.693 · 10−42 Kg = 1.017 · 10−15 u 2 c with a = 1.665402 · 10−27 the Avogadro Constant, c = 3.0 · 108 m s−1 the speed of light The relative atomic mass is calculated as the weighted average of the masses from all naturally occurring isotopes of an element. So with the exception of monoisotopic elements, no atoms with a mass equaling the relative 4 atomic mass can be observed in nature. Rather a spectrum of isotopic masses is present for each element. The monoisotopic mass of an element is defined as the mass of the most abundant isotope. It is important to transfer the above definitions on single atoms to molecules. So the monoisotopic mass of a molecule is defined as the sum of the monoisotopic masses of the elements it comprises. The monoisotopic mass does not necessarily arise from the lightest occurring isotope of an element. However, within the elements important in biology, the lightest occurring isotope is usually the most abundant one. Respectively the relative molecular mass is the sum of the relative atomic masses of the molecule’s elements. If an ion is formed by the removal of one or more electrons from a molecule, the exact ionic mass is the monoisotopic mass of the ion minus the mass of the removed electrons. For negative ions, the electron mass (0.000548 u) needs to be added accordingly. 1.3.3 Ionization mode FTMS can operate in either positive or negative ionization mode. In positive mode positively charged ions are measured, in negative mode negatively charged ions are measured. Not every metabolite is easily ionized and some metabolites can be negatively charged but not positively, and vice versa. Because of this, in metabolomics studies it makes sense to combine measurements from positive and negative mode whenever possible to get a broad spectrum of metabolite measurements [18]. The information wether a chemical species was measured in negative or positive mode can also aid in identifying that species. 1.3.4 The principle of FTMS FTMS is based on ion cyclotron (IC) motion. Ions moving in a magnetic field are forced into a cyclic motion in the plane perpendicular to the magnetic field lines. This is because an ion experiences a Lorentz force which is perpendicular to both: the direction of the ion’s velocity and the magnetic field lines. Ion movement on the axis parallel to the magnetic field lines is unrestricted. To trap the ions completely and measure their mass, an electric field is applied to create a potential well [19], so the ions now exhibit their cyclic motion in the magnetic field and are trapped in a simple harmonic oscillation along the magnetic field lines due to the trapping potential. In the cubic analyzer cell this trapping potential is established by two metal plates which are perpendicular to the magnetic field lines. A small, symmetric positive voltage on both trapping plates will trap positively charged ions, a negative voltage will trap negative ions. Schematics of the mass spectrometers analyzer cells are given in Figure 1. 5 Magnetic field Magnetic field Trapping plate Trapping plate Excitation plate Excitation plate Excitation plate Cyclotron movement Cyclotron movement D E Excitation plate E Trapping plate Trapping plate D E E D D Figure 1: Schematic of the detection cell of an ion cyclotron mass spectrometer. The cubic version on the left and the cylindrical version on the right side. The smaller drawings slightly below depict the view from top to bottom (along the magnetic field). E stands for excitation plate, D stands for detection plate. The cubic cell is good to understand the principle, but most modern mass spectrometers contain the cylindrical version. The motion of the ions in the analyzer cell is governed by the magnetic field, the electric field, the charge of the ions and the ions mass [19, 20, 21]. The ion motion can be divided into the cyclotron motion due to the magnetic field, trapping motion due to the electric field and magnetron motion due to the combination of the magnetic and electric fields. Since for this thesis it is only necessary to understand the basic concept of FTMS, the detailed equations from the references are not reproduced here. Focus is put on the important cyclotron motion. Because the magnetic field and the electric field are known parameters, finally the mass over charge ratio of ions can be measured in the mass spectrometer by identifying the frequency of the unperturbed cyclotron motion. This cyclotron motion is governed by an equation that is derived as follows: The force on an ion with mass m and charge q moving in a magnetic field B, with velocity v perpendicular to the field is Force = mass · acceleration = m dv = qvB. dt Because angular acceleration, |dv/dt| = v 2 /r, this becomes mv 2 = qvB. r Angular velocity is defined as, ω = vr , so this becomes 6 mω 2 r = qBωr, which is simply ωc = qB . m with ωc the angular velocity of the cyclotron motion, q the charge in coulomb, B the magnetic field in Tesla and m the mass in u [20]. So the ion’s mass over charge ratio is measured as a frequency. Compared to time of flight instruments, where the measurement takes only as long as the time of flight, the frequency can be measured over a longer period of time and therefore determined more precise than any other experimental parameter directly [20]. This accounts for the high precision and resolving power of FTMS instruments. Before the cyclotron frequency can be measured, the ions need to be excited to a sufficiently big cyclotron radius. In the cubic analyzer cell the ions are excited by a sinusoidal voltage to the two excitation plates which - unlike the trapping plates - are orientated parallel to the magnetic field lines. All ions of the same mass over charge ratio are excited coherently and therefore undergo cyclotron motion as a packet [19]. About 100 ions of the same mass over charge ratio are required to introduce a measurable signal [20]. The cyclotron frequencies of all present ions are measured by the detection plates and form a signal which is composed of the addition of all single frequencies. This signal is Fourier transformed to obtain the single frequency components of every present ion. These frequencies are finally used to calculate the mass over charge value. One has to bear in mind the fact, that not mass as such is measured, but mass over charge. So multiply charged ions will be detected as lower mass over charge ratios than single charged ions of the same mass. E.g. a single charged ion with mass m will have the same mass over charge value as a double charged ion with mass 2m or triple charged ion with mass 3m. However, multiple ionization plays a major role only in proteomics and just a minor role in metabolomics. 1.3.5 Mass resolution and accuracy The major advantages of Fourier Transform Ion Cyclotron Mass Spectrometry over any other type of mass spectrometry are the unsurpassed achievements in mass resolution and precise mass measurement [5, 19]. These two 7 aspects are closely related, because a precise mass measurement requires sufficiently resolved peaks. Resolution is the capability of a mass spectrometer to separate masses which are close to each other. The exact definition of resolution and resolving power can be found in the literature [17, 20]. Mass accuracy is the difference between the measured mass and the calculated exact mass and is usually given in parts per million (ppm). E.g. a mass spectrometer with an accuracy of 2 ppm will measure the calculated exact mass of 150 u somewhere between 149.9970 u and 150.0003 u most of the time. The strength of the magnetic field in the mass spectrometer influences the resolution and accuracy. The higher the magnetic field, the better the obtained results. The typical range of magnetic fields is 1 to 9.4 Tesla [20]. In this thesis data from a 12 Tesla instrument is used. 1.3.6 Extract preparation The preparation of cell extracts is a very important step in the preparation of any mass spectrometry experiment and the method chosen has big impact on the metabolites which can be measured later [3, 5]. The different methods have to compromise between different chemical species. Cell extracts are usually mixed with methanol/0.1% formic acid solution or acetonitrile solution [18]. Aharoni et al. found that these 2 different assays lead to the detection of different chemical species in the final FTMS experiment. From that follows, that no extraction protocol will yield the full scope of all metabolites present in the sample. Ideally different extraction assays are used and the resulting measurements are combined. The same procedure has been proposed above for the ionization mode, and in fact a combination of different extraction protocols, ionization techniques and ionization modes will improve the spectrum of measured metabolites. However, any analysis method has to account for the incompleteness of the data and the interpretation of results has to be done with respect to this as well. 1.3.7 Ionization by electrospray ionization (ESI) Electrospray ionization is considered a soft ionization technique, because the energies during ion formation are low enough so they don’t break the chemical bonds in the molecules, as opposed to other ionization techniques e.g. electron impact ionization. During electrospray ionization the charged sample is diluted in a volatile solvent and fed into a capillary. In front of the capillary is an electrode and a high Voltage (2-5 kV ) is applied between capillary and electrode [5]. Due to the electric potential between the capillary and the electrode the dilution in the capillary is charged and moves out of the capillary, forming a Taylor cone at its tip. From the tip of the Taylor cone a small jet is emitted. The charged molecules in this jet repel each other 8 and thus small drops are formed. These drops further dissociate into smaller droplets due to the same repulsion force. This dissociation continues until charged molecules are completely separated. [22]. Most small metabolites will carry one charge after this process, but it is an important fact, that ESI results in multiply charged ions. In fact the highest mass molecules usually carry the highest number of charges [20], but ionization mainly depends on the amount of ionizable side chains. The protein RNase A (molecular weight 13682 u) clearly shows 20–fold positive ionization in a study by Henry et al. [23]. This feature is used in mass spectrometry of large biomolecules like peptides and proteins, because by multiple ionization these molecules can obtain mass over charge values which are readily measurable by mass spectrometry. Within the scope of this thesis, as mentioned, multiple ionization plays a minor role. 1.4 Metabolic Compound Identification from Molecular Mass Small metabolites cover a mass spectrum from a few u like water (18.0106 u) to several hundred u like for example coenzyme-A (767.1152 u). Bigger metabolites can have masses of more than 1000 u like Vancomycin and its derivatives (around 1447.4302 u) and others. The median 90 % of all components in the KEGG compound database [8] have a mass between 118 u and 867 u. The median is at 306 u. If the exact mass of a metabolite is determined with sufficient accuracy, the chemical formula can be calculated by finding the linear combination of elements, which precisely fits the exact mass. At infinite accuracy, it would be possible to find one chemical formula for any given mass [17]. Fortunately biomolecules comprise only a limited number of distinct elements. Therefore one does not need to look for linear combinations of all elements, but only the ones present in biomolecules (see above). Obviously, for smaller masses it is easier to find an exact matching chemical formula, than for big masses. If further restrictions from chemistry are introduced, such as the nitrogen rule [24, p. 238], the search space of linear combinations becomes even smaller. If no such limitations are applied, higher mass accuracies are required, which will be detailed in a later chapter (2.5). With the search space limited to chemical formulas plausible for peptides, Zubarev et al. [25] found the upper limit for unique identification of peptides at an accuracy of 1 ppm to be at about 700 u. Aharoni et al. [18] use the same approach to identify metabolites from metabolomics data. They as well assume an error of 1 ppm for a 7 Tesla FTMS and are able to assign a unique chemical formula to more than 50% of their measured metabolites (about 5000 total). The chemical formula is in turn looked up in a chemical compounds database. 9 Another method to identify metabolites by mass, which is used in this thesis, is a direct database lookup of the mass. This basically follows the above procedure, but in an inverted fashion. The step of calculating a chemical formula can therefore be omitted. For all compounds in a database, the exact mass is either obtained by a query, or calculated from the chemical formula, and each measured mass is compared to the database masses. If the best hit is within the desired error, an assignment can be made. The MassTRIX framework [26] uses this method to find and display metabolites from mass spectrometry experiments in KEGG pathway maps. Using this approach, a precision of at least 2 ppm is required to uniquely identify metabolites in a chemical database comprising 72,634 unique chemical formulae [16]. All small metabolites which have a database entry will be identified by the latter approach. Problematic are polymers and fatty acids, because among them exists a whole plethora of different masses due to their combinatoric composition. An exact chemical formula assignment might help in this case, to identify unknown compounds. Special Notation In this thesis not only compounds need to be identified by their mass, but also mass differences. Mass differences can be explained by an arbitrary number of atom transfers, e.g. the mass difference from H2 O to CO2 is explained by the subtraction of 2 H atoms and the addition of one C and one O atom. This transformation will be noted, analog to well known molecular formulas, as H−2 CO or in some cases for clearness as H−2 C+1 O+1 . Mass differences can have a positive or negative sign, but usually their absolute value is of intrest. Hence this notation is bidirectional in the sense, that the signs need to be swapped for the transfer in the opposite direction. Thus the above formula is considered equivalent to H+2 C−1 O−1 . 1.5 Introduction to Ab Initio Metabolic Network Reconstruction The approach for metabolic network reconstruction which is evaluated in this thesis was devised previously [16]. The underlying idea is, that for every metabolic transformation the substrate and the product should be present in the cell at detectable quantities. The principle is, to calculate all pairwise differences of a set of metabolite masses, and infer chemical transformations that happened according to the mass difference of any two compounds (Figure 2). To build a network, the compounds are connected according to the chemical reactions in which they appear to participate. The masses are determined using FTMS, because high mass precision will result in high mass difference precision, which is required for the exact connection of all compounds. 10 324.035868 Reactions from pyrimidine metabolism (KEGG map 00240) 79.966332 404.002201 79.966332 789.993837 483.968536 306.025304 0.984015 482.984521 466.989603 15.994914 79.966332 79.966332 403.018185 15.994914 79.966332 387.023270 79.966332 323.051853 307.056938 Figure 2: Shown are 10 compounds from the pyrimidine metabolism and the reactions between them from the KEGG reference pathway 00240. The exact mass for each compound is printed in italic font, the mass differences resulting from chemical reactions are printed next to the reaction arrows. Consider all these masses are measured in a mass spectrometry experiment. Together with the knowledge that a phosphorylation reaction (transfer of a HO3 P group) introduces a mass difference of 79.966332 u, one can easily connect the components which have this mass difference and conclude that they have undergone phosphorylation. 11 X X X X mass difference 0.9840155930 15.994914640 16.978930233 62.987402124 63.971417717 78.982316764 79.966332357 80.950347950 95.961246997 96.945262590 142.953734481 143.937750074 158.948649121 159.932664714 160.916680307 175.927579354 176.911594947 306.025303947 307.009319540 323.004234180 385.991636304 386.975651897 402.970566537 465.957968661 466.941984254 482.936898894 occurrence 3 3 3 2 2 2 6 2 2 2 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 transformation H−1 N−1 O+1 (ammonia ligase) O+1 (hydroxylation) H−1 N−1 O+2 HO3 P (phosphate) H2 O6 P2 (nucleotidohydrolase) Table 2: The pairwise mass differences between all 10 components from Figure 2 are shown. Checkmarks (X) mark mass differences that correspond to reactions which are known to exist between the components. Frequent mass differences are stressed by bold text. The most common mass difference actually represents the most common chemical reaction in this example, the phosphorylation. Also the mass differences corresponding to hydroxylation and the rather complex ammonia ligase occur three times. But the nucleotidohydrolase, observed once, is not detected as frequent mass difference. Instead an unknown mass difference (159.932664714) which corresponds to the transfer of 2H, 6O and 2P occurs three times, as well as another unknown mass difference (16.978930233) corresponding to the transfer of H−1 N−1 O+2 . 12 6 5 4 3 1 2 Count 0 100 200 300 400 500 Mass difference Figure 3: The data from Table 2 as histogram. The study distinguishes two approaches to identify meaningful mass differences, i.e. mass differences corresponding to actual chemical transformations. The 1st approach is based solely on the measured masses, and can be considered “ab initio”. Mass differences which occur at a significantly higher rate are considered meaningful. The 2nd approach takes into account knowledge about common metabolic transformations and as such incorporates a priori knowledge. Only mass differences which correspond to one of the common transformations published in [16] are considered meaningful. Consider the example in Figure 2. If all ten masses are measured and the four occurring mass differences are known, obviously all edges would be reconstructed, plus one false positive edge between the masses 323.051853 and 307.056938 (at the bottom). The difference between these two masses is again 15.99491464. If the 1st approach is applied to this example, the situation is a little different. Table 2 shows all pairwise mass differences from these 10 compounds, for clearness the same data is shown in a histogram in Figure 3. For this example a frequent mass difference is defined as occurring 3 times or more. As can be seen, the most frequent mass difference actually originates from the most frequent known reaction in the example, the phosphorylation or dephosphorylation respectively. But not all known reactions are recovered as frequent mass differences, instead two new frequent mass differences occur. One new mass difference comes from the difference of 2 Hydrogen, 6 Oxygen and one Phosphorus atom, the other one is explained by an even more complex relationship, i.e. the addition of 2 Oxygen and the subtraction of one Hydrogen and one Nitrogen atom. This simple example already shows that not all frequently occurring mass differences are necessarily related to actual chemical transformations. They can, among other possibilities, originate from two or more common metabolic steps in turn (in this example 13 the mass difference 159.932664714; two phosphorylation reactions), or from compounds which share a common scaffold but contain different side groups (e.g. mass difference 16.978930233). If the most frequent mass differences (Table 2) are used to reconstruct a metabolic network, this results in the graph depicted in Figure 4. The reconstructed network does not only contain 10 edges as the reference network, but 18. Nine of these actually correspond to edges in the reference network, the remaining 9 are additionally introduced. In the work in [16] the method was also used on a mass spectrum from the parasitic organism Trypanosoma brucei. From this spectrum a maximum of 399 identified masses was obtained. Using the two mentioned approaches, about 25,000 (1st approach) or 1438 (2nd approach) meaningful mass differences were calculated respectively. As a parasite, Trypanosoma brucei has a very small set of metabolic enzymes. Most other organisms will therefore yield mass spectra with significantly more masses. The numbers for the reconstructed networks suggest very densely connected graphs already; an undirected graph with 399 vertices can contain at most 399∗398 = 79401 unique Edges. In this case it would be completely con2 nected and called a clique. The first approach reaches 39% of this, the second approach 2%. This is a lot, compared to actual metabolic networks. The metabolic network of Escherichia coli derived by enzyme gene mapping [27], contains 628 metabolites and 788 connections (reactions). This makes only 0.4% of all thinkable connections. Ouzounis et al. used a similar approach to reconstruct the E. coli metabolic network; their reconstruction comprises 791 compounds and 744 reactions (0.24% of all thinkable connections) [12]. The metabolic network from another species with a very streamlined metabolome, the γ-proteobacterium Buchnera aphidicola contains 240 compounds and 263 reactions [28]. This equals 0.9% of all theoretically possible connections. It is also well known, that metabolic networks exhibit a certain topology [29, 30]. This will be detailed in chapter 1.6.4. The authors of [16] claim, that the generated networks are overall conform with these previous findings and explain observed deviations with the limited number of measured metabolites. 1.6 Graph Theoretic Aspects of Biological Networks This thesis deals mainly with the analysis of biological networks. This section will give an introduction to biological networks represented as graphs, their properties, ways to analyze them in a descriptive way and some examples of studies on biological networks, especially metabolic networks. 14 Reactions from pyrimidine metabolism (KEGG map 00240) 159.932664 79.966332 Curved lines: 16.978930 79.966332 306.025304 0.984015 0.984015 0.984015 15.994914 79.966332 159.932664 159.932664 79.966332 15.994914 79.966332 79.966332 15.994914 Figure 4: Shown are the same 10 compounds from pyrimidine metabolism. All connections between them have been reconstructed by the most frequent pairwise mass differences. The solid lines depict correctly reconstructed connections (true positives), the dashed lines stand for connections which are predicted, but not present in the reference pathway (false positives) and the gray arrow represents a connection which was not reconstructed but present in the original pathway (false negative). False positive in this context does not mean, that the reaction does not exist in the kingdom of biochemical reactions, it was just not present in the used reference network. If also less frequent mass differences are used to reconstruct the network, more edges will be found, up to a full saturation of the network. 15 1.6.1 Mathematical Definition of Graphs The definitions below are taken from the book “Handbook of Graph Theory” [24]. The definitions given there are very comprehensive and some are irrelevant to this thesis. Definitions relevant to this thesis are reproduced, some are extended to fit the need of this thesis. For any further information please refer to [24] or any other graph theoretic publication. In mathematical terms a graph is a pair of sets G = (V, E) where V is a set of vertices and E is a set of edges {v1 , v2 } , vi ∈ V which connect two elements of V . One edge can also connect a vertex to itself, in this case it is referred to as a self loop. In an undirected graph, the edges carry no further information, in a directed graph, each edge has a direction. A directed edge is an edge e, in which one of the endpoints is designated as tail, the other one as head. The edge is directed from the tail to the head. A vertex v is incident to an edge e, if v is an endpoint of e. Two vertices are adjacent, if they are incident to the same edge. Two adjacent vertices are also called neighbors. The degree k of a vertex v in an undirected graph is the number of edges incident to v (note, that self loops are incident to the same vertex two times and therefore add 2 to the degree). In a graph without self loops and multiple edges between vertices, the degree equals the number of neighbors. The average degree hki of a graph is defined as the mean degree of all its vertices. The indegree of a vertex v in a directed graph is the number of edges which is directed to v. The outdegree is defined analog for edges directed from v. The degree density of a graph G is defined as the fraction of edges present in G, as compared to the maximal number of edges which can theoretically exist in a graph, expressed in percent. Let |V | the number of vertices and |E| the number of edges in G. The maximal number of edges, excluding the possibility of multiple edges and self loops is |Emax | = |V |(|V | − 1) . 2 The degree density is 100 · |E| . |Emax | A path in a graph G is a sequence of vertices, such that two adjacent vertices in the sequence are neighbors in G. The edges in the path are the edges connecting each two adjacent vertices in the sequence. Furthermore no 16 vertex or edge in the path may occur twice in the path. The only exception are the first and last vertex in the sequence, they may be the same. The path length of a path is the number of edges in the path. If the first and last vertex in a path are the same, the path is called cycle if it has at least length one. The shortest path between two vertices v1 and v2 is a path from v1 to v2 with length l such that there is no other path from v1 to v2 with length < l. There may exist more than one shortest path with the same length but different sequences of vertices. The subgraph of a graph G = (V, E) is the graph G0 = (V 0 , E 0 ) with any subset V 0 ⊂ V and subset E 0 ⊂ E with all elements in E 0 connecting elements in V 0 . The induced subgraph of G with respect to a set of vertices Vi ⊂ V is the graph Gi = (Vi , Ei ) with Ei ⊂ E and Ei containing all edges from E which are incident to any two Vi in G; in words the subgraph of G containing all Vi and all edges connecting them, or the graph which results in removing all vertices not in Vi and keeping all edges with no “loose ends”. A connected component is a set of vertices wherein between each pair of vertices exists a path of finite length. A graph may consist of a single connected component, in this case it is called connected graph. 1.6.2 Properties of Graphs There exist several concepts and measures to describe graphs. There are however some important and recurrently appearing concepts in the analysis of complex networks. They can be characterized by measures like their small world property, degree distribution, diameter and clustering coefficient. [31] The small world property describes the fact, that even in large graphs containing many vertices often short paths between any pair of vertices exist. [31] The degree distribution of a graph G is a function P (k), which gives the probability that a randomly selected vertex in G has degree k [31]. In a graph with a finite number of vertices, as observed for metabolic networks, it gives simply the amount of vertices in G with degree k. The diameter of a graph G can be defined as the longest of all shortest paths between any two vertices in G. Another concept is the average path length, which is the average length of all shortest paths between any two vertices in G. [31] The clustering coefficient Cv is primarily a measure for a single vertex v. If v has kv neighbors, there exist at most kv (k2v −1) edges between these neighbors. The clustering coefficient of v is defined as 17 Cv = 2Ev , kv (kv − 1) with Ev the number of edges that actually exist between the neighbors. It is therefore a measure for the connectedness in the neighborhood of v. Cv = 1 if the neighborhood of v is completely connected and Cv = 0 if there is no edge between the neighbors of v. The clustering coefficient of a graph G is defined as the average clustering coefficient of all vertices in G. The distribution of clustering coefficients C(k) is defined as the average clustering coefficient for all vertices with degree k. 1.6.3 Types of Graphs Networks, especially biological networks, are usually modeled, using the following three graph models: random, scale free and hierarchical graphs [32]. This classification is able to explain biological data well as recent studies have shown. Because of this, in this work these measures are applied. What follows in the next section is a brief introduction to these three kinds of graphs. Saul and Filkov [33] recently presented a new and interesting method to classify biological networks using exponential random graph (ERG) models. These models have been used before to classify social networks, which are usually smaller than biological models. But advances in electronics and parallel computing begin to provide the necessary calculation power to apply the fitting methods for ESP models to biological data. The advantage of ERG models is that parameters for a multitude of descriptive variables can be fit to known networks or, inverting this approach, networks can be easily simulated or constructed from a set of arbitrary descriptive parameters. In their work they criticize the conventional classification as too restrictive with respect to the employed descriptive variables (degree distribution and clustering coefficients). It is, however, not known, which descriptive parameters best describe biological and especially metabolic networks. Therefore and due to the computational issues, these models are not employed here. Random Graphs This kind of graph was first analyzed in 1959 by Erdös and Rényi [31]. The vertices in the graph are connected in a random fashion, i.e. all pairs of vertices have the same probability to be connected. Because of this, all vertices tend to have the same degree, and no local structures like clusters (tightly connected areas in the graph) arise. This leads to a degree distribution which is shaped like a Poisson distribution and clustering 18 C(k) P(k) A 0 k 0 C(k) 10 P(k) B 0 k 10-1 10-2 10-3 10-4 1 k 100 1000 k C(k) 10 0 P(k) C 10 10-1 10 0 10-1 10-2 10-2 10-3 10-3 10-4 10-4 1 10 100 k 1000 1 10 100 1000 k Figure 5: Different graph models and their properties. A: Random graph with typical degree distribution P (k) and distribution of clustering coefficients C(k). B: Scale free graph with a degree distribution P (k) following a power law but distribution of clustering coefficients C(k) following a uniform distribution. The bold circles represent the hubs in this graph. C: Hierarchical graph. Both, the degree distribution P (k) and distribution of clustering coefficients C(k) follow a power law. There are also hubs present in these networks, but additionally the vertices which are not hubs and have a lower degree, tend to be more densely connected within their neighborhood. 19 coefficients which are independent from the vertices degrees. These graphs exhibit the small world property, i.e. despite the large number of vertices, the average path length is relatively short. Scale Free Graphs In these graphs there are different roles for vertices. A few vertices are heavily connected to other vertices, but most of the vertices show a small degree. The high degree vertices are called hubs. The hubs play an important role in these scale free graphs, because most of the shortest paths travel through them. A scale free graph is very vulnerable to targeted attacks on these hubs. Among the low-degree vertices, however, there are no different roles. This leads to a degree distribution P (k) which follows the relationship P (k) = ak γ , commonly known as power law with constant a and scaling exponent γ. The distribution of clustering coefficients is still uniform. Hierarchical Graphs The last kind of graphs discussed here also contains vertices with different roles. Again hubs are present as in scale free graphs. The important difference to scale free graphs is, that the low-degree vertices tend to form clusters. That means, that there are local structures of a few vertices which are more densely connected among each other than to other vertices. These clusters are linked to each other through the hub vertices. Due to the hubs, the degree distribution again follows a power law; and because of the clusters, the distribution of clustering coefficients also exhibits power law behavior. 1.6.4 Properties of Biological Networks It was shown in several studies, that biological networks1 frequently exhibit a scale free and hierarchical structure and are not randomly connected [11, 29, 34, 35]. One important implication is, that the construction of null models for the statistical analysis of biological networks must therefore preserve these structures [14]. Arita et al. argue, that the small world property does not hold true for metabolic networks [4]. Therefore it might be appropriate to consider different models for this kind of networks. Despite this, in this thesis the hierarchical model for metabolic networks is employed, because their major argument is a structural consideration. The networks reconstructed in this thesis do not incorporate these structural implications which will become clear in the methods section (chapter 2.1.2). 1 comprising metabolic, protein interaction, regulatory and other networks 20 1.7 Scope of this Thesis A previous study [16] outlined the general practicability of the approach described in chapter 1.5 and gave an example how to use the method on a small organism. Since metabolomics aims for the understanding of complete metabolomes, the aim of this thesis is to evaluate the said approach on real world metabolomics data, i.e. high precision mass spectra of two larger organisms. Data from Saccharomyces cerevisiae and Drosophila melanogaster are available at the Helmholtz–Zentrum München, so these datasets are investigated. Before work on the datasets is performed, the theoretic capabilities of the method are assessed. In order to do this, known biochemical pathways from the KEGG database are investigated on how they theoretically can be reconstructed using the proposed method. The identification of frequent mass differences plays an important role, not only for reconstructing the networks, but also other fields of metabolomics; so the frequent mass differences are investigated in more detail. Currently a frequent mass difference is hypothesized to represent a chemical reaction, but no profound knowledge about the actual meaning exists. So this aspect will be elucidated, by explicitly looking for the chemical reaction each mass difference represents. Finally the reconstructed mass difference networks are analyzed to assess wether they show characteristics of “real” metabolic networks, and what kind of information can be extracted from them. To date no deeper analysis of mass difference networks’ structure has been performed. The study in [16] proposes that they are metabolic networks, which can not be validated in this work. Instead their usability in research is showcased by examining the vertices’s properties and by deducing information about the underlying compounds. 21 2 2.1 Materials and Methods Description of the Dataset All data, programs and scripts used here are available at the cited resources or as supplementary materials to this thesis. 2.1.1 FTMS Mass Spectrometry Data The handling of raw data from the spectrometer is done by special software and out of the scope of this thesis. Further information about this issue can be found in [36]. After the mass over charge values have been calculated, each one is present together with a raw peak height. An example of this can be seen in Figure 6. The subsequent step is to select peaks with a desired signal–to–noise ratio, usually about 3:1. The work in this thesis is performed on data right after this step, i.e. a list of mass to charge (m/z ) values with a peak height better than a certain signal to noise ratio. The charge (z ) from now on is used as multiples of elementary charge, which is the charge carried by a single proton. Figure 6: Example for a FTMS mass spectrum, taken from [37]. On the x–axis are the mass to charge (m/z ) values, the peak height is the actual signal. The m/z for some high peaks are noted next to the peaks. The data used in this thesis are m/z values with a sufficient signal–to–noise ratio selected by the experimenter. The mass data for two organisms, Saccharomyces cerevisiae and Drosophila melanogaster are used to reconstruct metabolic networks. The data were taken from the MassTRIX server [26] and created by the group of Schmitt-Kopplin at the Helmholtz–Zentrum München using a 12 Tesla mass spectrometer, ideally providing a relative accuracy of 0.2 ppm. The data are available on the MassTRIX server under the job-IDs “EXAMPLE Yeast” 22 (S. cerevisiae) and “08093013010514720” (D. melanogaster ). Both are measured in negative mode. For D. melanogaster data measured in positive ionization mode became available at a later stage of this work and is not incorporated fully into the analysis, for S. cerevisiae no data in positive mode is available. This implies that there are definitely compounds which will not be detected. However, the scope of detected metabolites can always be improved by combining further extraction protocols and ionization techniques, so that the full picture of all metabolites in the sample will most likely never be achieved. Histograms of the used data are shown in Figure 7. 0 100 count 200 The S. cerevisiae cells were grown in a medium of nutrients with optimal proportions for growing most S. cerevisiae strains. This leads to an exponential growth of the organism. The experimental background of the D. melanogaster dataset can not be disclosed at the time of writing. 0 500 1000 1500 2000 1500 2000 0 200 count 500 m/z 0 500 1000 m/z Figure 7: Histograms of the masses in S. cerevisiae (top) and D. melanogaster (bottom). Most of the measured masses are closer to the smaller side of the detection range. For S.c. 3101 masses are available, for D.m. 1965 masses. The mass spectrometer was optimized to detect masses smaller than 600 u. Even though the D. melanogaster dataset became available only later during the course of this thesis, it is of higher quality with respect to extraction protocols and mass spectrometry2 . Furthermore more metabolites could be identified in the data using MassTRIX. 2 Personal communication: Agnes Fekete, Helmholtz–Zentrum München 23 The raw mass is corrected by adding a proton mass (1.0072764668813 u), because the sample was measured in negative mode and the metabolites are therefore negatively ionized. 2.1.2 The KEGG Database KEGG is a database of biological systems that integrates genomic, chemical and systemic functional information [8]. It is used in Release 48.0 from October 1, 2008. Information was downloaded as flat files via FTP to speed up the data retrieval process. Data from KEGG compound were used to identify compounds by mass as described in chapter 1.4. Because the exact masses of compounds are only given in a magnitude of 10−4 u, i.e. a precision of 4 decimal places, precise masses are calculated based on their chemical formulas using a Perl script written by the author. Information from KEGG pathway and KEGG reaction was used to create reference metabolic pathways and networks to finally validate the reconstructed networks. To make the validation not overly stringent, the reference pathways were used and not the organism specific pathways. This accounts for the fact, that many metabolites and reactions in a specific organism are unknown [2]. KEGG reaction pairs (or reactant pairs) are used to reconstruct connections between compounds. A reaction pair lists two compounds, one of which is transformed into the other by a single reaction in KEGG. Creation of Networks from the KEGG Database To create a representation of the pathway maps stored in the KEGG database, individual maps from KEGG pathway are considered. For each map the participating compounds and reaction pairs are taken from the database and assembled to form a graph with compounds as vertices and reaction pairs as edges. To build not single maps, but a complete network all reaction pairs in the KEGG database are considered. All compounds occurring in the reaction pairs are added to the graph as vertices and connected by edges which represent the reaction pairs. This network is called the reference network. This reconstruction, however, is a crude one, which does not take into account structural relationships between compounds [4]. Because of this a path in the reference network does not necessarily represent a metabolic path. E.g. in the reference network water (H2 O) is connected to 815 other compounds just because is plays a role in these reactions. It is, however, obvious, that not all these 815 compounds are interconvertible by being transformed into water and back into the other compound. The network has to be seen as a relationship network rather than a reaction network. 24 2.1.3 Exact Elemental Masses The exact elemental masses to calculate exact molecular masses are taken from excel elements3 . The results of exact mass calculations for KEGG compounds are compared against the original masses from the KEGG database to spot inconsistencies. No such inconsistencies were found. 2.1.4 Metabolic Transformations The list of common metabolic transformations is taken from reference [16]. The most recent version they are using in their research was obtained through personal communication4 and is available in the supplementary materials to this thesis. It comprises 109 frequently occurring metabolic transformations and their induced mass difference. 2.2 Computational Analysis The computational analysis was performed using self written software in JAVA5 and R6 [38] and aided by Pearl7 scripts and shellscripts written by the author. Software was run on a Linux8 machine with Intel Pentium processor9 and 2 Gb memory. Special use was made of the JAVA library JUNG10 to represent and handle graphs. A metabolic network in this thesis is defined as an undirected graph with a set of vertices, representing metabolites and a set of edges, representing chemical transformations between these metabolites. Even tough the chemical reactions in metabolism can be directed, it is not within the scope of this work to investigate the direction of reactions in the reconstructed networks. Although software design was no major aspect in this thesis, the JAVA software was developed to fit into a three-tier application architecture. Since no data tier and presentation tier were required, the logic tier was developed to interface easily with these other tiers. So the developed software can be easily reused if desired. A small graphical user interface was developed, mainly for visualizing the reconstructed networks. This graphical user interface can be extended in the future to build a presentation tier. 3 M. Selmke and A. Selmke, http://www.chemlin.de/download/excelelements.htm [email protected], June 9th 2008 5 Java(TM) SE Runtime Environment (build 1.6.0 02-b05) 6 R version 2.5.1 (2007-06-27) 7 Perl v5.8.8; Copyright 1987-2006, Larry Wall 8 Linux version 2.6.22.9-91.fc7 Fedora Release 7 9 Intel Pentium 4 CPU 3.00 GHz 10 Jung 1.7.6 “Java Universal Network/Graph Framework” 4 25 2.3 Network Reconstruction The workflow of network reconstruction using the method based on frequent mass differences and the method based on the list of common metabolic transformations is depicted in Figure 8. Details for the individual steps are given in the sections below. 2.3.1 Filtering of FTMS Data To reconstruct reaction pathways, it is necessary to calculate on data which is as transparent as possible. Ideally the desired data would only comprise monoisotopic masses of all measured compounds. To get as close to this as possible, data are filtered for the most important known superfluous masses. These are firstly the masses emerging due to 13 C isotope insertion [18] and secondly masses derived from m/z values from double ionized molecules, i.e. molecules which carry not a single but double charge. Identification of multiple charged ions Mass to charge peaks from multiple charged ions can only be detected reliably with the aid of isotope peaks. The monoisotopic peak and the peak of an isotope with one more neutron will not differ by the neutron mass, but by the neutron mass divided by the charge [20]. To check wether the single and double ionized form of a compound is measured at the same time, the following procedure can be applied: If for any two mass pairs a and b the relation a = 2 ∗ b (or vice versa) is found, i.e. one molecule has a twice as high m/z ratio, the “lighter” molecule is considered double ionized and the respective mass is removed from the data. This procedure poses a problem, if two compounds have this relationship due to their atomic composition, as for example Fructose 1,6-bisphosphate and Glyceraldehyde 3-phosphate. Since in the observed samples double ionization plays no major role and m/z = m/1 most of the time, the term mass is often used when actually using the term m/z value would be more accurate. This habit is also employed in the literature. Identification of isotopic peaks 13 C isotope masses are identified by mass difference. If two masses have a difference of 1.0034 u ±0.1 ppm, the heavier mass is considered non-monoisotopic and removed from the dataset. This is a crude method, which can be extended by incorporating the peak ratios of isotope peaks for a more exact determination. For the observed data and the aim of the following work it is sufficient, to employ the described method. 26 using frequent mass differences using known mass differences Masses from FTMS Masses from FTMS Preprocessing Preprocessing Filtered Masses Filtered Masses Pairwise comparison Pairwise comparison Mass Differences Mass Differences Clustering Clustered Mass Differences Select heaviest clusters Frequent Mass Differences Known Mass Differences Connect Connect Network Network Figure 8: The workflow from FTMS data to reconstructed metabolic networks is shown. Square boxes represent data, rounded boxes represent processes. On the left the steps necessary for network reconstruction using internally determined frequent mass differences are depicted, on the right the steps for reconstruction using a set of a priori known mass differences are shown. Preprocessing summarizes all filtering steps. Pairwise comparison calculates all pairwise mass differences. Clustering groups mass differences together s.t. frequent mass differences can be identified in the next step, Select heavy clusters. Finally, Connect, builds the networks from the generated data. For the method depicted on the right, the pairwise comparison has to be performed to be able to connect the masses to a network, but no clustering needs to be done. 27 It is important to bear in mind, that the non-monoisotopic masses usually are an important and regularly employed tool to identify a molecule., for example in protein mass spectrometry [39]. But in building ab initio reaction networks these masses would only introduce noise and have to be removed. If the identification of compounds is of importance, the information gain from non-monoisotopic masses has to be used. In fact in this case it would be helpful to also use peaks with a lower signal to noise ratio, because the isotopic peaks might fall below the chosen signal to noise cutoff. 2.3.2 Calculation of Mass Differences The pairwise mass differences are calculated using a simple algorithm with quadratic runtime complexity. Each mass is compared against each other mass once and the differences and constituents of the difference are stored using a hash map, so that for any difference the constituents can be easily obtained. In a correctly calibrated mass spectrometer, there is no common shift of measurements into one direction i.e. each mass measurement can lie above or below the exact mass independently. Therefore the precision of mass differences becomes worse than the precision of individual measurements. This becomes clear, if the mass measurements are seen as single independent samples from normal distributions with mean µ = exact mass and standard deviation σ proportional to the accuracy. If the difference of two measurements a and b is calculated, this is as if the difference is drawn from a normal distribution with mean µdif f = µa −µb but standard deviation σdif f = σa +σb . E.g. the exact masses 180 u and 200 u measured at a precision of 1 %11 might yield the measurements 181.80 and 198.02 which have a difference of 16.22. The difference of the exact masses is 20, this is a deviation of 18.9 %. To see, how far this influences precision in the real datasets under the actual circumstances, an FTMS relative accuracy of 0.2 ppm is assumed. Furthermore a mass at the higher detection range (2000 u) is considered. The resulting absolute deviation calculates as: 2000 ∗ 0.2ppm = 0.0004 So the absolute deviation for a mass difference of measured masses from the mass difference of the exact masses is in the extreme case at most 0.0008 u. Of course this is an extreme value. But there will be some variation between the mass differences, such that two mass differences which are the same when considering exact masses, will be a little different on measured masses. This discrepancy has to be considered when frequent mass differences have to be determined from measured masses. The study in [16] does not present these 11 Percent is used instead of parts per million in this example for clarity. 28 calculations, but devises the following rule: Any 5 mass differences which are closer to each other than 0.0001 u are considered frequent mass differences. This value is acceptable if not the extreme, but an average absolute deviation is calculated on the basis, that the median mass of typical metabolites is 306 u (compare chapter 1.4). Analog to the above calculation this leads to an absolute deviation of 0.000122 u, so the previously proposed value is adopted in this thesis. 2.3.3 Clustering The above considerations on the calculation of mass differences reach out into the clustering of mass differences. To identify frequent mass differences, it is necessary to treat a couple of mass differences that are almost the same as one cluster. The cluster arises from the same mass difference, and is spread out around the exact mass difference due to measurement errors (precision). Using the experience of others [16], the rule that a difference of 0.0001 u is significant in separating these clusters is encorporated. To achieve this computationally, hierarchical single linkage clustering seems appropriate. The two closest elements (mass differences) will be clustered together in a recursive hierarchic fashion until all mass differences are clustered. Within the hierarchy such subclusters have to be identified, in which the within distance is below 0.0001 u and the between distance exceeds this value. Because the data under consideration are one-dimensional, this is actually easier achieved by the following algorithm: 1. Sort all mass differences 2. Start at the first mass difference 3. Open a new cluster and put the current mass difference into it 4. Scan through the sorted mass differences from the next position 5. If the difference between this and the previous mass difference exceeds 0.0001 (a) return the current cluster (b) Continue at 3 6. If not, put the current mass difference into the current cluster and continue at 4 For a better understanding refer to Figure 9. The procedure creates a structure similar to a histogram with dynamic bin widths; each bin contains 29 Figure 9: An example for the clustering algorithm. Each vertical bar represents a single mass measurement. d is a chosen distance, in this work 0.0001 u. Emerging clusters are marked by curly brackets. A break between two clusters is inserted, as soon as the distance between two consecutive masses exceeds d. Single linkage clustering would cluster all elements under the curly brackets first, so the results of the employed method are equivalent to single linkage clustering. the elements of one cluster. Of course problems known in single linkage clustering apply to this method as well, especially “cluster elongation”, where the cluster extends into one direction due to elements which are distant to one of the clusters elements just below threshold [40, p. 197] . This has to be considered in the next step. Following the formation of clusters, the central element closest to the exact mass difference has to be selected. The median of the mass differences is appropriate in this case, because it firstly represents one actual member of the cluster and secondly is less influenced by outliers, as they might occur due to the above mentioned chaining. The selection of “heavy” clusters is straightforward. The clusters containing the most elements are considered “heavy” and their medians are the identified frequent mass differences. A threshold has to be defined, determining how many clusters are selected. Breitling et al. [16] select all clusters with 5 or more elements. For bigger datasets this becomes infeasible and a dynamic approach has to be used, as will be shown in the Results section. 2.3.4 Network Creation To work with graphs in JAVA the JUNG package is used. A network is represented as a SparseGraph. For each mass a vertex (SparseVertex) is created and added to the SparseGraph. The mapping from mass to Vertex is kept in a hash map and each vertex is assigned a UserDatum containing the mass it is representing. So queries from mass to vertex and from vertex to mass are always possible. For each mass difference – wether determined intrinsically or taken from the file of known transformations –, vertices exhibiting this mass difference are connected. To account for the error due to measurement precision, not only exact matches, but all pairs with a mass difference within the desired precision are connected by an edge (UndirectedSparseEdge). The precision used in both datasets was 0.75 ppm. The information which pairs of vertices are to be connected by an edge can be easily obtained by querying the mass 30 difference to mass pairs map and subsequently querying the mass to vertex map. This allows for an efficient and quick network construction. Each inserted edge is annotated with an UserDatum, containing the mass difference it represents. The final graph can be stored in various file formats if desired. 2.4 Network Analysis For all networks, i.e. the reference network and the the reconstructed networks, descriptive parameters (degree distribution, distribution of clustering coefficients and degree density) are calculated and plotted. Compounds are identified using the KEGG and LIPIDMAPS12 databases. KEGG in its used version contains 11615 and LIPIDMAPS 10199 compounds, the sets show some overlap. The reconstruction of mass difference networks heavily depends on the number of mass differences which are incorporated. The amount of emerging edges can reach from several hundred, to complete saturation of the network if all mass differences are used. The most abundant mass differences (the ones that are observed at a high frequency) are hypothesized to represent chemical reactions [16]. The abundance cutoff, i.e. the cutoff above which abundance a mass difference is considered frequent, is an important parameter. The authors of [16] use an abundance cutoff of 5, without giving any rationale for this. Their datasets are substantially smaller than the ones in this thesis, hence choosing another (higher) abundance cutoff might be necessary. To determine an appropriate abundance cutoff, several networks are reconstructed for S. cerevisiae and their characteristics are compared. After this general assessment, the reconstructed D. melanogaster network is investigated in more detail. The degree of a vertex is an important descriptor, e.g. hub vertices play a major role in scale free networks. Therefore the hub vertices are identified. Furthermore the correlation between a vertex’s degree and its mass and tendency to represent isomers is calculated. Pearson’s correlation coefficient is calculated for both, degree versus average mass and degree versus average number of isomers. To assess the significance of the obtained two correlations, a stochastic null model is constructed. This null model is based on the assumption, that vertices are not determined by their degree, but by chance. For each present degree not the vertices with this degree, but a random sample of the same size is drawn. From this sample the average mass and average number of isomers are calculated respectively. This is repeated Ω = 100000 times. For each of these 100000 random samples the correlation coefficient is determined. The 12 http://www.lipidmaps.org/ 31 p–value for the initial correlation is estimated by the fraction of simulated correlations which absolute values are greater or equal to the absolute initial correlation. In formal terms this becomes: Let n = d1..n = a1..n = b1..n = such that bi = the number of different degrees observed. degrees, average masses, observations, number of vertices with degree di . The initial correlation cinit is the correlation between d and a, so cinit = cor(d, a). rj,1..n is sampled, such that rj,i = mean((m1 , m2 , . . . , mbi ) ∈ M ) Where M is the set of all masses and all m are randomly chosen from M . j ∈ [1 . . . Ω] determines the size of the null model. For all j ∈ [1 . . . Ω] the correlation cj = cor(d, rj ) is determined. Now e is is the number of cj such that |cj | ≥ |cinit |. The estimate of the p–value for cinit is e/Ω. Calculations for the correlation between degree and average number of isotopes are analog. Sampling is not done from the masses, but from the number of annotated isotopes. Because only about 10 % of the masses can be annotated, this dataset is smaller, still Ω = 100000 is chosen. 2.5 Identification of Mass Differences The frequent mass differences observed in the networks are expected to have a biological meaning. Because of this it is of interest to identify the metabolic transformation underlying these mass differences. To accomplish this task, two approaches analog to the identification of compounds by their mass are employed (compare chapter 1.4). Firstly one can check, if the observed mass difference is caused by a known transformation. To do this, the mass differences are compared to both, the list of common metabolic transformations and to mass differences calculated from KEGG reaction pairs. The mass differences from KEGG reaction pairs are determined by calculating the mass difference of two compounds present in the same reaction pair. Mass differences are again compared to these values with an error tolerance of 0.75 ppm. It is now possible, that a frequent mass difference does not arise due to one reaction, but a whole series of reactions, or other relationships. This 32 was already discussed in the introduction, chapter 1.5 and demonstrated in Figure 4. Additionally multiple reaction steps might not only lead to the subtraction or addition of functional groups, but to subtracting and adding atoms at the same time. So the second approach to identify these mass differences follows the second way to identify compounds by their mass: The possible elemental combinations for the mass difference are calculated, this time also allowing for negative values. E.g. the mass difference 0.9840155930 is explained by the combination H−1 N−1 O+1 . These combinations are obtained through integer linear programing. A linear programing problem is a problem of the form Maximize cT x Subject to Ax ≤ b Where x is the vector of variables to be determined (scalars for each element), c is a vector of known coefficients (the masses of the elements) and b is a vector of constraints forming the constraints equations Ax ≤ b. cT x is called the objective function which is maximized (or minimized) s.t. the constraints are not violated. Integer linear programming has one additional constraint: all x have to be integer numbers. To apply integer linear programming to the above problem, the objective function has to be maximized, subject to the only constraint, that it does not exceed the mass under investigation. So A = cT and b = mass. In further constraints equations the maximal amount of elements used can be reduced, which shrinks the search space. The set of elements which are considered has to be as limited as possible, too. Here only H C N O P and S are used. The problem now can be stated as follows. Maximize: x1 MH + x2 MC + x3 MN + x4 MO + x5 MP + x6 MS Subject to: x1 MH + x2 MC + x3 MN + x4 MO + x5 MP + x6 MS ≤ mass Where the Melement are the elementary masses. It is possible, that a linear combination within the bounds is slightly greater than mass, but is a closer match than the solution obtained by the linear program above. To overcome this problem, the inverted linear program has to be formulated, i.e. minimize the objective function, with the constraint not to fall below mass. The outcome of both linear programs has to be compared and the better match is the final result. Integer linear programming is in the NP-hard class of computational problems. For solving integer linear programming problems, R [38] in version 2.8.1 33 is used together with the Rglpk package13 [41], which is an interface to the GNU Linear Programming Kit14 . This method is a pure mathematical approach, which relies on a very accurate mass precision. It does not incorporate any heuristic to filter improbable combinations. E.g. the mass difference of 16.978930233 should be explained by the transfer of H−1 N−1 O+2 . Within an error tolerance of only 0.007 ppm it is also possible to explain this mass difference by H4 C−15 N5 O4 P5 S−3 , which looks quite implausible in metabolism. But even H5 C1 N−8 O7 which at least looks a bit more sparse, deviates by only 0.3 ppm from the exact mass. A full list of mathematically possible formulas within a 0.5 ppm range for this mass is shown in Table 3. To conclude these thoughts, the results from integer linear programming for the identification of mass differences have to be interpreted carefully, because it is very unlikely, that the measured mass difference is close enough to the exact mass difference. An improved method for determining the elemental composition of a mass difference would be to enumerate all possible combinations under certain maximal and minimal bounds for each element15 . Subsequently this combination is chosen, which firstly is within an accepted error tolerance (e.g. 1 ppm) and secondly, in accordance with the well known principle of Occam’s razor, can explain the mass difference with the least number of subtracted and added atoms. Since this is not the focus of this thesis, no efficient algorithm for this task is presented. However, in the results section this method is showcased to identify frequent mass differences, which can not be identified by other means. A method to identify mass differences which are caused by two or more reactions through a database lookup depends on the identification of compounds. Because less than 10 % of the measured compounds could be detected, conclusions derived out of this method can only be weak. However, it is a nice to have feature within the developed software, which can aid the manual analysis of metabolomics data. For each edge in the reconstructed network the two incident vertices a and b are determined. All possible compounds Ca and Cb for the masses of a and b are retrieved. Now for all pairwise combinations ca and cb , ca ∈ Ca and cb ∈ Cb the shortest path in the reference network is calculated. The overall shortest path is the difference between the two masses in terms of known reactions. Identifying the reaction pairs in the overall shortest path, one can determine the actual reactions. An example of this is given in Figure 10. 13 Version 0.2-8 Version 4.36; http://www.gnu.org/software/glpk/ 15 The bounds used in this thesis are H ±40 , C ±30 , N ±30 , O±20 , P ±10 , S ±6 . The numbers in superscript denote the minimal or maximal number of the element in the formula. 14 34 Mass 16.978921871 16.978922753 16.978922878 16.978923524 16.978923649 16.97892411 16.97892417 16.978924235 16.978924631 16.978924756 16.978924881 16.978925006 16.978925527 16.978925592 16.9789256520001 16.978926113 16.978926238 16.9789264519999 16.978927223 16.978927891 16.9789281049999 16.97892823 16.9789286620001 16.978928876 16.978929001 16.978929462 16.978929522 16.978929587 16.9789299829999 16.978930108 16.978930233 16.978930358 16.978930483 16.978930879 16.978930944 16.978931004 16.978931465 16.97893159 16.9789323610001 16.978933243 16.978933889 16.978934014 16.978934228 16.978934353 16.978934535 16.97893466 16.978934874 16.978934939 16.97893546 16.978935585 16.97893571 16.978935835 16.978936231 16.978936296 16.978936356 16.978936817 16.978936942 16.978937113 16.978937588 16.978937713 16.978937884 16.978938595 H -17 29 34 12 17 10 -10 15 -17 -12 -7 -2 -29 -4 -24 -31 -26 6 -11 3 35 40 -14 18 23 16 -4 21 -11 -6 -1 4 9 -23 2 -18 -25 -20 -37 9 -13 -8 24 29 -35 -30 2 27 0 5 10 15 -17 8 -12 -19 -14 29 -36 -31 12 15 C -23 6 -9 -14 -29 19 -19 4 29 14 -1 -16 -6 17 -21 27 12 -2 -22 21 7 -8 1 -13 -28 20 -18 5 30 15 0 -15 -30 -5 18 -20 28 13 -7 22 17 2 -12 -27 12 -3 -17 6 16 1 -14 -29 -4 19 -19 29 14 25 9 -6 5 23 N 18 -16 -10 8 14 -18 26 -12 -6 0 6 12 24 -14 30 -2 4 -13 11 -6 -23 -17 18 1 7 -25 19 -19 -13 -7 -1 5 11 17 -21 23 -9 -3 21 -13 5 11 -6 0 23 29 12 -26 -14 -8 -2 4 10 -28 16 -16 -10 -24 8 14 0 -20 O 13 -9 -7 -10 -8 -2 -13 0 -7 -5 -3 -1 -6 7 -4 2 4 19 18 -19 -4 -2 -20 -5 -3 3 -8 5 -2 0 2 4 6 -1 12 1 7 9 8 -14 -17 -15 0 2 -20 -18 -3 10 5 7 9 11 4 17 6 12 14 -18 11 13 -19 -9 P -10 4 9 3 8 1 -3 6 -10 -5 0 5 -6 3 -1 -8 -3 -9 -10 10 4 9 9 3 8 1 -3 6 -10 -5 0 5 10 -6 3 -1 -8 -3 -4 10 4 9 3 8 -2 3 -3 6 -5 0 5 10 -6 3 -1 -8 -3 8 -9 -4 7 10 S 5 5 2 4 1 1 6 -2 6 3 0 -3 2 -6 -1 -1 -4 6 5 -5 5 2 -6 4 1 1 6 -2 6 3 0 -3 -6 2 -6 -1 -1 -4 -5 -5 -3 -6 4 1 -1 -4 6 -2 3 0 -3 -6 2 -6 -1 -1 -4 2 -2 -5 1 -5 P 86 69 71 51 77 51 77 39 75 39 17 39 73 51 81 71 53 55 77 64 78 78 68 44 70 66 58 58 72 36 4 36 72 54 62 64 78 52 82 73 59 51 49 67 93 87 43 77 43 21 43 75 43 81 55 85 59 106 75 73 44 82 ppm 0.49 0.44 0.43 0.40 0.39 0.36 0.36 0.35 0.33 0.32 0.32 0.31 0.28 0.27 0.27 0.24 0.24 0.22 0.18 0.14 0.13 0.12 0.09 0.08 0.07 0.05 0.04 0.04 0.01 0.007 0 -0.007 -0.01 -0.04 -0.04 -0.05 -0.07 -0.08 -0.13 -0.18 -0.22 -0.22 -0.24 -0.24 -0.25 -0.26 -0.27 -0.28 -0.31 -0.32 -0.32 -0.33 -0.35 -0.36 -0.36 -0.39 -0.40 -0.41 -0.43 -0.44 -0.45 -0.49 Table 3: Linear combinations of elements and their resulting mass in a 0.5 ppm range around 16.978930233 u. The search space of the elements was limited to H ±40 , C ±30 , N ±30 , O±20 , P ±10 , S ±6 , the numbers in superscript denote the minimal or maximal number of the element in the formula. The individual combinations are very close to each other, with respect to their mass. The wrong combinations, however, all lead to a high sum of transferred atoms, usually much more than 15. The right combination explains the mass difference with only 4 atoms. 35 Figure 10: Screenshot of the annotated part of the D. melanogaster network. The bright vertices in the bottom are identified masses from the KEGG pathway 00052 (galactose metabolism). The edges are reconstructed using frequent mass differences. The numbers next to the edges represent the minimal possible distance measured in reaction pairs between two vertices. If it is “1”, the edge represents an actual reaction pair and in this case is also printed slightly thicker. Most of the reconstructed edges represent 2 reaction pairs. Next to each vertex the mass and its corresponding KEGG compound identifiers are printed. 36 3 Results 3.1 Comprehensive Analysis of Pathway Maps from the KEGG Database As a first step the reference network from all KEGG masses and reaction pairs as described in the methods (chapter 2.1.2) is constructed. This network comprises 5765 vertices and 11112 edges. The degree density for this network is 0.067 %. To check, wether the network structure is coherent with previous findings, the degree distribution and distribution of clustering coefficients are calculated and plotted in log-plots. For both distributions the described power law can be observed (Figure 11). log log Plot of Degree Distribution log log Plot of C(k) ● ● ● 0.500 ● 0.200 500 ● ● 100 ● ● ● ●● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● 0.050 Average C(k) ● ● ● 10 ● 0.005 ●● ● ● ● ● ● ● ● ● ● ●●●●● ●● ● 10 ● ● 50 ●● ●● ●●●● ● 100 ●● ● ● ● 500 0.002 5 ●● ● ●● ●● 1 ● ● ● ● ● ● ● ● 5 ● ● ● ●● ● ● ● 1 ● ● ● ● ● ●● 0.020 50 ● ● ●● ● ● ● ● ● ● Count ● ● ● ● 1000 ● 1 Degree 5 10 50 100 ● 500 ● 1000 Degree Figure 11: Degree distribution (left) and distribution of clustering coefficients C(k) (right) of the reference network created from KEGG data. C(k) follows a power law with scaling exponent −0.88, indicated by the regression line. The degree distribution is heavy tailed and follows a power law in its middle section. The regression line was fitted only to this section and has slope −2.27. The hub vertices in this network correspond to so called current metabolites [11]. In fact the vertex with the highest degree (815) is water, followed by Oxygen (590), S-Adenosylmethionine (270), ATP (262), CO2 (253), Ammonia (233), Phosphate (158) and Acetyl-CoA (136). Interestingly most of these metabolites can not be measured in FTMS because of their small mass. ATP, S-Adenosylmethionine and Acetyl-CoA are the exceptions. Hence this network can not be seen as a reaction network where two connected compounds are interconvertible, but rather as a network representing transformation and requirement relationships. Analog to the construction of the reference network, smaller networks are created from individual KEGG pathway maps as described in the meth37 α-D-Glucose-1P 260.0297 α-D-Glucose 180.0634 β-D-Glucose 180.0634 α-D-Glucose-6P 260.0297 β-D-Glucose-6P 260.0297 260.0297 α-D-Glucose-1P α-D-Glucose-6P β-D-Fructose-6P β-D-Glucose-6P 180.0634 α-D-Glucose-1,6P2 β-D-Glucose-1,6P2 β-D-Fructose-6P 260.0297 339.996 β-D-Glucose-1,6P2 β-D-Glucose-1,6P2 339.996 Glycerone-P 169.998 Glyceraldehyde-3P 169.998 169.998 Glyceraldehyde-3P Glycerone-P 265.9593 Glycerate-1,3P2 Glycerate-2,3P2 Glycerate-1,3P2 265.9593 Glycerate-2,3P2 265.9593 Glycerate-3P 185.9929 Glycerate-2P 185.9929 Phosphoenolpyruvate 167.9824 Pyruvate 185.9929 Glycerate-3P Glycerate-2P 167.9824 Phosphoenolpyruvate 88.016 Pyruvate 88.016 Figure 12: Part of the glycolysis pathway on the left, comprising 15 compounds. Given are compound names and masses. To create the collapsed version on the right, all compounds with the same mass are collapsed into one vertex, preserving the edges between compounds. A path with 8 nodes emerges. This demonstrates, how FTMS gets a more limited view of metabolic networks. On top of the fact, that isomers are measured as one mass, the small mass of Pyruvate can not be detected at all. ods section. Only such maps are considered, which actually contain reaction pairs, so 140 maps are investigated. For these maps is assessed what effect arises, if isomers (which have the same mass) are collapsed into one vertex, because this effect arises when the compounds in the map are determined using mass spectrometry. An example of this process is displayed in Figure 12. In each pathway map are on average 36.3 vertices and 38.6 edges. Now vertices with the same mass (i.e. vertices representing isomers) are collapsed into one vertex, preserving the edges of both vertices. Double edges are removed. These collapsed pathway maps comprise on average 28.4 vertices and 31.4 edges. The results are presented in the following table, and indicate, that FTMS has a rather limited view of these pathway maps. KEGG maps Vertices 36.3 Edges 38.6 collapsed maps 28.4 31.4 difference 7.9 7.2 Table 4: Size of KEGG pathway maps before and after collapsing equal masses. On average each pathway map loses 7.9 vertices and 7.2 edges. This amount of entities is at least lost during any FTMS experiment, not yet accounting for not measured compounds. 38 Figure 13: Histograms of the mass differences from the S. cerevisiae (top) and D. melanogaster (bottom) dataset. The inset shows the 88.006 to 88.020 u interval of the D. melanogaster data at higher resolution. The width of one box in the inset is approximately 0.0001 u, so consecutive boxes are likely to be clustered together by the employed clustering algorithm. 3.2 Metabolic Network Reconstruction for S. cerevisiae To reproduce and verify the results from Breitling et al. [16], the metabolic netwok from the Saccharomyces cerevisiae data is reconstructed using both methods, the one based on a priori known mass differences and the one based on frequent mass differences determined from the data. The histogram of mass differences is shown in Figure 13 (top). Initially the dataset contains 3101 mass values. There are 7 masses with putative double ionization peaks, but no corresponding 13 C peaks could be detected. After filtering for 13 C mass peaks, 3027 masses remain. A total of 164 masses can be identified as compounds using the KEGG database. Additional 62 masses can be identified using the lipidmaps database. This makes a total of 226 identified masses at a precision of 0.75 ppm. The remaining 92 % of the masses remain unidentified. Using a priori known mass differences from the list of 109 frequently occurring transformations is straightforward. The reconstructed network contains 1998 edges and hence has a degree density of 0.04 %. The clustering coefficient is 0.23 and the distribution of clustering coefficients is plotted in Figure 14 in the lower right. The degree distribution follows a power law 39 (not shown). The network’s properties are comparable to the reconstructed networks using 5 and 10 different mass differences. Clustering of the mass differences results in 1918640 clusters with on average 2.4 mass differences per cluster. The frequent mass differences accumulate in a relatively small proportion of clusters. The width of the resulting clusters, defined as difference between the biggest and smallest value in the cluster, is on average 0.033 · 10−3 u. This is heavily biased by many one– element clusters of width 0 u. The widest observed cluster spans a range of 0.19 · 10−3 u. After determining the abundance cutoff (see next section) the selected clusters span ranges from 0.09 · 10−3 u to 0.11 · 10−3 u. The mean cluster width is 0.098 · 10−3 u. 3.2.1 Determining the Abundance Cutoff To get an overview over the possible outcomes from different abundance cutoffs, a total of 13 networks is created, covering a wide range of abundance cutoffs. The selected cutoffs lead to the incorporation from 5 to 3841 different mass differences, resulting in 1133 to 114644 edges. A complete overview together with the clustering coefficient for each network is given in Table 5. Each of the reconstructed networks has a degree distribution, following a power law. To assess the modularity of the network, the distributions of clustering coefficients for each network are plotted in Figure 14. To pick an abundance cutoff for further work, a compromise between the network’s degree density, average clustering coefficient hC(k)i and the distribution of clustering coefficients has to be found. The rationale behind this is, that the network might not contain any information if it is too sparsely connected, but will lose information again, if it is too densely connected and the formation of clusters in the network is blurred by too many edges of doubtful origin. Networks 5–10 have the highest clustering coefficients, but towards the lower abundance cutoffs the degree density raises, and the distribution of C(k) loses its power law characteristic, that is described for biological networks. For these reasons network 6 was chosen for further investigation. The resulting network’s structure should be invariant to slight variations of the abundance cutoff, nevertheless an improved method to optimize this parameter is desirable. Due to time limitations no exhaustive method could be tested, but an idea is to optimize the cutoff with respect to a quality measure on the networks, which should contain the degree distribution’s quality of fit to a power law and maximize the average clustering coefficient of the network. Thus finally the network is reconstructed with mass differences that occur at least 96 times. This applies to 100 frequent mass differences. Edges are inserted at a precision of 0.75 ppm. The resulting network contains 9615 edges and 3027 vertices. The degree density is 0.21 % and the average clustering coefficient is 0.33. 40 0.9 0.5 Average C(k) 0.7 0.6 Average C(k) 3 Abundance 164 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.2 0.5 0.6 0.4 Average C(k) 2 ● Abundance 217 0.8 Abundance 304 log log Plot of C(k) ● 1.0 1.0 log log Plot of C(k) 1 0.8 1.0 log log Plot of C(k) ● 0.4 3 4 ● 5 6 7 8 ● 2 5 10 1 ● ●● ●● 0.6 Average C(k) ● ● ● ● ● ● ● ● ● ● ● ●● ● 0.2 ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● 0.2 ● 5 10 1 2 5 10 Degree ● ● 2 5 10 20 ● ● ●● ● ● ● ● ● 50 1.0 0.8 ● 1 2 5 10 20 ● ● ● ● ● ● 50 ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● 100 1 2 5 10 20 50 0.2 5 10 20 50 100 0.4 ● ● ● ● 2 ● ● 5 10 20 50 100 200 1 2 5 10 Degree 20 50 100 200 Degree 1.0 log log Plot of C(k) ● ● T 0.8 13 0.8 ● ● ● ● ● 1 Degree ● ● ●● ●● ● ●● ●●●● ● ● ● ● ● ●●● ●●● ● ● ● ●● ● ●● ●● ● ● ● ●● ●● ● ●● ● ● ●● ● ● ●●●● ● ● ● ● ●● ● ●●●● ●● ● ● ● ●●● ● ● ●● ● ●● ●● ● ● ●● ●●● ●●●● ● ●● ● ● ● ●● ●● ●●●●● ● ●● ●● ● ●● ●● ● ●● ● ● ●● ● ●● ● ● ●●● ● ● ●● ● ●●● ●●●● ● ● ●● ●● ●●● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ●●●●● ●●● ● ● ● ●●● ● ● ●● ● ● ●● ●●● ●●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●●●●● ● ● ● ● ●● ●● ● ●● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● 0.2 ● ● ● ● ● ●● ● ●● ●●● ●● ●● ● ● ● ● ●● ●● ● ●●● ● ●● ● ●● ●●● ● ●● ●●●● ●● ●●●●● ●● ● ●● ● ●●●● ● ●● ● ● ● ●● ●● ● ● ● ●●●●● ● ●● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●●● ●● ● ● ●● ●●● ● ● ●● ●●● ●● ●●●● ●● ● ●● ●● ●● ● ● ● ● ● ● ●●● ●● ●● ● ● ● ● ●● ●● ● ● ●●●●● ●● ●● ●● ● ● ● ● ● ● ● Average C(k) 0.6 0.4 Average C(k) 0.4 ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ●● ●● ●●● ●●● ●● ● ● ● ● ● ●●●●●●●●●● ●●● ● ● ● ● ● ●● ●● ● ●●● ● ● ●●● ●● ● ● ●● ● ● ●● ●●●● ● ●●●●●● ●●● ● ●● ●● ● ● ● ● ● ● ●● ●● ● ● ●●●● ● ●●●● ● ● ●● ● ● 12 Abundance 30 0.6 Abundance 41 100 ● 0.8 11 0.8 0.8 0.6 1.0 Degree log log Plot of C(k) 1.0 Degree ● ● ● ● ●●●● ●● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ●●● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● log log Plot of C(k) ● 0.2 9 Degree Abundance 52 ● 50 Abundance 64 log log Plot of C(k) Average C(k) 0.6 0.6 Abundance 23 0.4 ● ● ● ● ●●● ●● ● ●● ● ● ●● ● ● ●●●● ●● ● ● ● ● ● ● ● ●●●● ● ●● ● ● ● ● ●● ●● ● ● ●● ●●●●● ● ● ●● ●●● ● ● ● ●● ●● ● ● ● ● ●●● ●●● ●●●● ● ●●● ● ●● ● ●●● ●● ●● ● ●●●●● ●●● ● ● ●● ● ●●●● ● ● ●● ● ● ● ●● ● ●●●● ● ● ● ●● ●● ● ● ●● ●● ● ● ● ●● ●●●● ●● ● ●● ● ● ● ●● ● ●●●● ●● ● ●● ●●● ● ●● ●● ●●● ●● ● ● ●● ● ●● ● ● ● ● ●● ● ● ●● ● ● ●●● ●● ● ● ●● ●● ● ●● ● ●● ●●● ●● ●●● ● ● ● ● ● ●● ● ● ●● ● ●● ● ●● ●● ● ●● ● ●● ●● ●●● ● ● ● ● ●● ● ● ● ●● ● ●●● ● ● ● ●● ● ●● ● ●● ● ● ●● ● ●● ● ●● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● 0.2 0.2 ● ● 0.4 1.0 20 log log Plot of C(k) ● Average C(k) 10 ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●●● ● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ●●● ● ●● ● ● ● 10 2 5 0.6 Average C(k) ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●●● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●●● ● ●● ● 1 2 0.4 1.0 8 0.8 Average C(k) ●● ● ●●●●● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● Degree Abundance 73 0.4 0.6 Average C(k) 0.4 ● ● ● ● ● ● ● ● ●● ● log log Plot of C(k) ● ● ● ● ● ● ● ● ● ● 1 0.6 0.8 1.0 7 ● 1.0 20 log log Plot of C(k) Abundance 83 ●● ● Degree log log Plot of C(k) ● ● ● ● 20 ● ● ● ● ●● ●● ●● ● ● ● ● ● ●● ● ● ● 2 6 Abundance 96 0.2 ● ● Average C(k) 1.0 Abundance 126 0.4 Average C(k) ● ● ● ● 0.8 0.8 5 20 0.4 1.0 ● 0.6 0.6 ● ● ● ● 10 Degree log log Plot of C(k) Abundance 126 1 5 Degree log log Plot of C(k) 4 1 2 Degree ● ● ● log log Plot of C(k) 0.4 Average C(k) ● 1 ● ● ● 0.3 0.2 2 ● ● ● ● ● 0.8 1.0 1 ● ● ● ● 0.1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 2 ● ● ● ● ● 5 10 20 50 100 200 500 1 Degree 2 5 10 ● 20 Degree Figure 14: Evolution of the network’s distribution of clustering coefficients C(k) for an increasing number of edges. From the top left to the lower right the abundance cutoff (“Abundance”) is lowered. The plot in the lower right is C(k) for the network reconstructed with the list of known transformations. 41 N 1 2 3 4 5 6 7 8 9 10 11 12 13 Abundance 304 217 164 126 111 96 83 73 64 52 41 30 23 Trans. Mass Diff.s Edges 5 1133 10 1965 25 3808 49 5627 73 7674 100 9615 147 12562 199 15911 294 19977 488 28490 960 45198 1994 74600 3841 114644 109 1998 Degree–Dens. 0.03 % 0.04 % 0.08 % 0.12 % 0.17 % 0.21 % 0.27 % 0.35 % 0.44 % 0.62 % 0.99 % 1.63 % 2.50 % 0.04 % hC(k)i 0.22 0.28 0.31 0.32 0.34 0.33 0.34 0.35 0.34 0.34 0.32 0.30 0.29 0.23 Table 5: Parameters for the reconstructed S. cerevisiae networks. The Abundance cutoff (2nd column) determines how many mass differences are considered “frequent” (3rd column). The following columns show the number of edges, the degreedensity and average clustering coefficient hC(k)i in the resulting network. “N” is just a number assigned to each network. Trans. is the network reconstructed using the list of common metabolic transformations. 3.2.2 Evaluation of the Proposed Null Model A showcase null model as proposed in [16] is constructed by sampling 3027 random “masses” from a uniform distribution between 150 and 2000. The previous findings can be confirmed. The mass differences do not accumulate, so the clustering only finds clusters with a maximal size of 8. By chance 13 of these masses can be identified using KEGG, another 6 masses are present in LIPIDMAPS. The uniform distribution is however inappropriate, since the original data definitely show no uniform distribution (compare Figure 7). Furthermore most of the random masses will not represent any chemically possible combination of elements. The only conclusion which this null model permits, is that frequent mass differences do not occur in uniformly distributed masses. But it is very likely, that in any set of chemical compounds certain patterns among the mass differences arise, just due to the limitations on possible masses from chemistry. These mass differences are actually in the focus of research16 , and it is known, that molecules consisting of C, H and O form clearly separated “nominal mass” clusters [42]. An improved null model would sample from these masses ideally extended by more elements, instead 16 P. Schmitt-Kopplin, Helmholtz–Zentrum München, personal communication 42 of a completely random approach. The employed null model does therefore not allow to conclude that the investigated metabolomics data behave in any special way as compared to a random composition of chemically plausible masses. The total space of molecular structures within the given mass range is so large (estimated 1060 to 10200 [36]), that the construction of a proper null model falls out of the scope of this thesis. Because of the above reasons this null model is not further elaborated, especially no repeated sampling of the 3027 masses is performed. The significance of the mass differences and networks from real world data are assessed by identifying their biological meaning. 3.2.3 Identification of Motifs The square motif or 4-cycle is expected to occur in the reconstructed networks. A 4-cycle is a path of length 4, where the start vertex equals the end vertex. Special about these specific 4-cycles is, that opposing edges have the same mass difference. This motif arises, when a compound a can be chemically altered to form a0 and both compounds, a and a0, undergo the same chemical reaction. The result are two compounds b and b0 which exhibit the same mass difference as a and a0. Furthermore the difference between a and b is the same as between a0 and b0. This can happen, because metabolites which share a common scaffold but have different side groups are often transformed in the same way, sometimes even by the same enzymes. A good example is the metabolism of corticosteroids [43]. Also the example from the introduction in Figure 4 shows this motif between its 6 metabolites in the bottom. These squares extend into a “ladder” structure, because b and b0 might be further transformed to c and c0 and so forth. This process is hypothesized to happen in lipid metabolism. An example is the β-oxidation. The first step, oxidation, introduces a double bond and therefore removes 2 Hydrogen atoms from the acyl–CoA molecule. But this reaction happens for all thinkable acyl–CoA molecules, independent of their chain length. Hence the described ladder structure should emerge. Unfortunately these structures could not be found in significant numbers in the reconstructed networks, probably because certain metabolites were not detected and the “ladder” structure breaks up. This investigation was therefore not followed any further. 3.3 Metabolic Network Reconstruction for D. melanogaster The proof of concept was repeated on the S. cerevisiae dataset in the previous chapter. To analyze the structure and information content of the networks yielded by the reconstruction method, metabolic networks from the fruit fly 43 log log Plot of C(k) ● 0.8 500 1000 1.0 log log Plot of Degree Distribution ● ● Average C(k) ● ● ● ● ● 0.4 ● 50 Count 100 0.6 ● ● 10 ● ● ● ● ● ● ● 5 ● ● ● ● ● ● ● 0.2 ● ● ● 1 2 5 10 ● ● ● ● ● ● ● ● ●●● ● 1 ● ● ● 20 1 Degree 2 5 10 ● 20 Degree Figure 15: The degree distribution (left) and distribution of clustering coefficients (right) for the reconstructed D.melanogaster network. Drosophila melanogaster are reconstructed. The D. melanogaster dataset became available later but is of higher quality than the S. cerevisiae dataset (compare chapter 2.1.1 on page 23). Therefore further analysis is performed on these data. The raw dataset contains 1965 mass values. There are 5 masses with putative double ionization peaks, but again no corresponding 13 C peaks can be detected. After removing 13 C mass peaks, 1919 masses remain. 182 masses can be identified as compounds using the KEGG database. Additional 32 masses can be identified using the lipidmaps database. In total 214 masses are identified at a precision of 0.75 ppm. The remaining 89 % of the masses remain unidentified. Clustering of the mass differences results in 1227662 clusters with on average 1.5 mass differences per cluster. The average width of the resulting clusters is 0.016 · 10−3 u. This is again mainly due to many one–element clusters. The widest observed cluster spans a range of 0.19·10−3 u. The finally selected clusters (with 15 or more elements, see next section) span ranges from 0.04 · 10−3 u to 0.15 · 10−3 u. The mean cluster width is 0.097 · 10−3 u. 3.3.1 Determining the Abundance Cutoff To determine an abundance cutoff for further work, the method described for the S. cerevisiae network is repeated, and a total of 13 networks with different abundance cutoffs are reconstructed. Their parameters are shown in Table 6. The distributions of clustering coefficients are not printed, because they behave like the ones from S. cerevisiae. An abundance cutoff of 15 is chosen for the D. melanogaster data. 44 N 1 2 3 4 5 6 7 8 9 10 11 12 13 Abundance 41 25 18 17 16 15 14 13 12 11 10 9 8 Tans. Mass Diff.s Edges 9 280 23 550 57 1057 73 1273 83 1366 105 1564 148 1984 209 2450 322 3057 542 3998 988 5572 1929 8222 3915 12976 109 772 Degree–Dens. 0.02 % 0.03 % 0.06 % 0.07 % 0.07 % 0.08 % 0.11 % 0.13 % 0.17 % 0.22 % 0.30 % 0.45 % 0.71 % 0.04 % hC(k)i 0.11 0.13 0.15 0.15 0.16 0.16 0.18 0.21 0.25 0.28 0.24 0.16 0.14 0.16 Table 6: Parameters for the reconstructed D. melanogaster networks. The Abundance cutoff (2nd column) determines how many mass differences are considered “frequent” (3rd column). The following columns show the number of edges, the degreedensity and average clustering coefficient hC(k)i in the resulting network. “N” is just a number assigned to each network. Trans. is the network reconstructed using the list of common metabolic transformations. The 6th network is chosen for further investigation, because the distribution of C(k) loses its power law property in the subsequent networks. The network is reconstructed with mass differences that occur 15 or more times. This threshold yields 105 frequent mass differences. A precision of 0.75 ppm is applied for the insertion of edges. With these parameters 1564 edges emerge, connecting the 1919 vertices. This equals a degree density of 0.09 %. The degree distribution and distribution of clustering coefficients for this network are shown in Figure 15 on page 44. 3.4 Analysis of the Masses The hub vertices in the D. melanogaster network are identified and all vertices with a degree ≥ 13 are shown in Table 7. This applies to 36 vertices. A full list can be found in the supplementary materials to this thesis. 15 of these masses can be identified by the KEGG database, that is 42 %. This is a substantially higher rate than for the overall compounds. Assuming, that important metabolites are more likely to be known and annotated in public databases like KEGG, this indicates that important compounds accumulate in the set of hub vertices. 45 Deg. 22 21 20 19 18 18 17 17 17 17 16 16 15 15 15 15 15 15 15 14 14 14 14 14 14 13 13 13 13 13 13 13 13 13 13 13 Mass 228.208931 260.029701 226.193281 227.196631 357.251521 268.240231 255.227931 240.208921 229.212281 200.177631 284.271521 242.224571 440.093091 329.220211 280.240221 270.219491 246.050451 215.055861 180.063391 411.298461 383.267181 377.108721 363.129441 270.255881 257.243581 395.119281 352.077071 342.116201 334.066491 321.082481 285.274871 283.259221 278.040281 214.156891 198.161961 196.058301 Isom. 1 38 1 Formula C14 H28 O2 C6 H13 O9 P C14 H26 O2 Name Tetradecanoic acid D-Fructose 6-phosphate Myristoleic acid 1 C17 H32 O2 Cyclohexaneundecanoic acid 1 1 C12 H24 O2 C18 H36 O2 Dodecanoic acid Stearic acid 5 C18 H32 O2 Linoleic Acid 1 1 41 C6 H15 O8 P C5 H14 N O6 P C6 H12 O6 Glycerophosphoglycerol Glycerophosphoethanolamine D-Glucose 30 4 C12 H22 O11 C9 H19 O11 P Saccharose sn-glycero-3-Phospho-1-inositol 1 2 8 C12 H22 O3 C12 H22 O2 C6 H12 O7 3-Oxododecanoic acid Menthyl acetate D-Gluconic acid Table 7: Hub vertices in the D. melanogaster network reconstructed by frequent mass differences. Masses are sorted with respect to their degree (Deg.), Isom. gives the number of isomers for that mass according to the KEGG database. The molecular formula and compound name for one isomer are also given. Masses with no further information could not be identified by means of a database lookup of the mass. 46 800 40 ● Mass ● ● ● 400 ● 500 600 ● 20 Number of Isomers 30 700 ● ● 300 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 200 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● 0 5 10 15 20 0 1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 Degree 800 Mass 0 200 5 400 600 20 15 10 Number of Isomers 25 1000 30 Degree 0 1 2 3 4 5 6 7 8 9 10 11 12 13 15 0 Degree 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Degree Figure 16: Top: Correlations between the vertices degree and the average number of isomers these vertices represent (left) and the vertices degree and the average mass these vertices have in the D. melanogaster reconstructed network (right). The x–axis shows the degree, the y–axis shows the respective parameter. The mean values are connected by a line. In the left each dot represents one individual vertex, in the right plot each box represents all vertices with the same degree. Bottom: The same values together with their respective null models. Vertices with a degree ≥ 15 were collapsed into one group. The mean number of isomers (left) and mean mass (right) from 100000 random samples are represented as box plots, s.t. each box represents 100000 values, whiskers extend to the maximal observed simulation results. The actually observed values from the respective networks are plotted as a line. 47 The hubs tend to have a lower mass than expected by chance and also represent more isomers. To see if this trend continues through the whole range of observed degrees, the correlation between degree and these two parameters is calculated, as described in the methods section (chapter 2.4). Only few vertices have a degree of 15 or more, therefore for correlation analysis all these vertices are collapsed into one group. The trend and null models used for hypothesis testing are visualized in Figure 16. The correlation coefficient between degree and average number of isomers is 0.84. The p-value for this correlation is 0.0001 as estimated by the randomization test. The correlation coefficient of degree and average mass is -0.62. The p-value for this correlation is 0.0118 as determined by randomization. Additionally the fraction of identified compounds for all vertices with the same degree was plotted in Figure 17. Most of the hub vertices are identified, and it can be seen, that metabolites with higher degree in the reconstructed network are more likely to be known metabolites. Assuming that the most important metabolites are known, from this can be concluded, that higher degree vertices represent more important metabolites. ● ● 1000 100 10 1 0.6 ● ● ● ● 0.4 ● ● ● ● ● ● ● ● ● 0.2 Fraction of identified compounds ● Color Count 0.8 1.0 The conclusion of these findings is, that the reconstructed network indeed contains information about its constituent compounds. The exact meaning of this information remains to be elucidated. Any deeper investigation should start with the identification of the remaining compounds. ● ● ● 0.0 ● ● ● 0 5 10 ● 15 20 Degree Figure 17: For each degree the fraction of identified compounds is shown. As clear trend is recognizable, that compounds with a higher degree in the network tend to be known compounds and therefore seem to be important in metabolism. The color of each point gives information about the number of actual observations. Hollow circles represent a single compound, the darker a circle, the more compounds it stands for. 48 1.0 0.8 ● 0.8 0.6 ● ●● ●● ● ● ● ● ● 0.6 ● ● ● ● ● ●● ● ● ● 0.0 ● ● 10 ● ● 0.2 0.2 ● ● ● ● ● ● ● ● 0.4 ● ● Fraction of identified masses ● 0.4 Fraction of identified masses ● ● ● 12 14 16 18 20 22 ● ●● 20 Number of observed mass differences ●●●● ● 40 ● ● 60 80 Number of observed mass differences Figure 18: The right plot shows the relationship between all frequent mass differences used for the construction of the D. melanogaster network and their annotation. Everything right of the dotted line is collapsed into one point to form the left plot. The x–axis shows the abundance of a mass difference, the y–axis shows the fraction of positively identified mass differences. Darker dots represent more observations, completely empty dots represent one observation. The star in the left plot represents the collapsed points. 3.5 Identification of Frequent Mass Differences To elucidate the chemical and biological meaning of the observed frequent mass differences, each is identified by database lookup against KEGG and the list of common metabolic transformations at an error tolerance of 0.75 ppm. More frequently occurring mass differences are expected to be identified as known metabolic transformation. The chosen tolerance is very strict and yields a pessimistic estimate. Nevertheless Figure 18 shows, that the hypothesis is true and frequent mass differences are likely to be positively identified. To get an idea of the method, the top 10 mass differences are identified. The results are shown in Table 8. Five mass differences can be identified at 0.75 ppm error tolerance. The example shows well, that the method is quite pessimistic as mentioned above: Only 5 masses can be identified positively. If the error tolerance was 0.85 ppm or higher, one more compound could be identified. The remaining 5 unidentified mass differences are identified by the combinatoric approach. The assigned combinations of elements are combinations which explain the mass difference by as few as possible element additions / subtractions and are within a precision of 2 ppm. One additional compound can be assigned to a chemical formula (C4 H6 ), but the formula apparently does not represent a known metabolic transformation. For the masses 1.00335, 2.01567 and 2.01562 only combinations with a high number of transfers can be found. One can assume, that these masses have no 49 count 90 90 71 51 50 49 46 42 41 36 mass diff. 1.00335 28.0313 2.01567 2.01562 14.01564 26.01565 18.01058 162.05278 74.03677 54.04693 common transformation (H14 C−1 N−4 O−6 P9 S−4 ) (0.04 ppm) ethyl addition C2 H4 H−2 C−2 N−3 O12 P−6 S2 (1.6 ppm) cpd. r.p. yes yes X yes yes yes X X yes yes yes yes X X ∗∗∗ ∗∗∗ H1 C−2 N5 O−5 P−3 S4 (0.55 ppm)∗∗∗ methanol CH2 C2 H2 transfer H2 O (0.85 ppm) ∗ monosaccharide C6 H10 O5 C3 H6 O2 (0.13 ppm) C4 H6 (0.4 ppm)∗∗ Table 8: Top 10 mass differences from the D. melanogaster data. The first two columns show the count or frequency followed by the mass difference itself. If it was found in the list of common metabolic transformations, the respective transformation is written in the third column. If it was identified to correspond to a compound (cpd.) or reaction pair (r.p.) in the KEGG database, this is indicated in the respective column. If the last column is checked, the mass difference was identified by either method within 0.75 ppm error. From the top 10 mass differences, 5 can be successfully identified. The sum formulas in small font were assigned using the combinatoric approach described in chapter 2.5. Two of them were not identified by other means, because ∗ had only a precision of 0.85 ppm and ∗∗ appears neither as reaction pair nor compound nor common metabolic transformation. Three of them, marked with ∗∗∗ , can only be explained by transfers of many elements. They may be noise or arise from compounds with a different elemental composition. Actually 2.01567 is 10 ppm away from 2.01565, which is the mass difference of a simple H2 transfer. The first value is 0.44 % smaller than the mass difference of a simple H transfer. underlying chemical meaning and have emerged due to noise in the mass difference data which caused false predictions in the clustering procedure. It is however possible, that they arise from compounds with different elemental compositions, that exhibit exactly this mass difference. Finally the reconstructed edges restricted to the identified compounds were investigated as described in the methods section (chapter 2.5). Using the identified compounds, it is possible to find a shortest path between any two compounds in the reference network, which are directly connected in the reconstructed network. There are 239 edges between the identified compounds, representing 77 distinct mass differences. 171 of these edges (71.5 %) have a corresponding path in the reference network. These 171 edges still represent 57 distinct mass differences. The remaining 68 edges connect components which are unconnected in the reference network. 50 The average length of the shortest paths is 2.31 (±1.07), so the edges in the reconstructed network represent on average 2 reactions. The longest shortest path comprises 6 reactions. The full histogram is given in Table 9. This is evidence for the assumption, that the edges in the reconstructed mass difference networks actually represent a whole series of reactions. Length: 1 2 3 4 Count: 36 74 41 13 5 6 6 1 Table 9: Histogram of the number of reactions (i.e. the shortest path length in the reference network) each edge represents in the reconstructed network. Only edges between identified compounds are considered. 51 4 Discussion Mass Difference Networks are not Metabolic Networks FTMS gives a snapshot picture of an organisms metabolism, ideally highlighting active pathways by detecting their compounds. In reality, it is almost impossible to detect all compounds from an active metabolic pathway. Even if it happens, the reactions between most of them can not be resolved in the reconstructed mass difference networks, because of collapsed isomers, as was demonstrated in chapter 3.1. Furthermore, as an analysis of the most frequent mass differences has shown (chapter 3.5), the edges in these networks do not necessarily represent actual chemical reactions, but arbitrary combinations of reactions. It is also possible, that a frequent mass difference arises from chemically very different compounds, whose mass difference however appears frequent enough to become significant. If an edge represents a chemical reaction, it is still unknown, wether the organism can catalyze this reaction or not. The information about the direction of an edge can not be determined by mass differences alone. All edges in the reconstructed networks are undirected. The compartmentalization of reactions can not be detected by employing this reconstruction method, so one must assume that the edges are not spatially separated. Finally metabolic networks are better described as hypergraphs. The hypergraph structure can not be reconstructed by mass differences alone. Several studies emphasize the importance of direction information, compartmentalization information and hypergraph structure in metabolic networks [4, 11, 13, 44]. Due to the above effects many edges which are expected to be present are actually missing from the observed data, and lots of edges, which have no chemical meaning and probably no biological background emerge in the difference networks. The conclusion of this is, that the reconstructed networks are no metabolic networks in the common sense. Hence common techniques like flux balance analysis or elementary flux modes, which have been successfully applied to metabolic models [45] are not applicable to these networks. It is well known, that network motifs occur in complex networks [46]. They usually comprise very few vertices and are determined by their connectivity structure. In this thesis the attempt was made to identify a “ladder” motif which was hypothesized to emerge from the mass difference method. The fact that it was not found in significant quantities can very well arise from missing (not measured) compounds, which then break up the ladder structure. For the mentioned β-oxidation this might be explained with the too high mass of the acyl–CoA to reliably detect the compound. The disturbation by missing (not measured) compounds applies also to other subgraph structures, such as individual pathways. So to validate the reconstructed network against known pathways or find motifs, it is actually required not 52 to look for the exact pathway or motif, but for a subgraph which is homomorph to it. Determining the distance (in terms of reaction pairs) for reconstructed edges as described in chapter 2.5 is actually doing this, but is limited to identified compounds. Extending this to all vertices in the reconstructed networks might finally confirm the presence of the ladder motif. If such “fuzzy” motif matching is applied, of course the null model for testing the significance has to be modified to account for the fuzziness. It is a fact, that not all enzymes are known, therefore more reactions exist, than can be reconstructed by enzyme gene mapping [12]. An advantage of the reconstructed networks is, that they may contain edges, which have a chemical and biological meaning which is not known yet. Identification of these edges with the current knowledge is however difficult. A good starting point would be edges which are positively confirmed to represent known biochemical transformations, or at least a transformation which does not look like noise. Despite this, there is some intrinsic information in these networks. E.g. the degree of the vertices in the networks has a meaning as has been demonstrated. This information content can be exploited in future work using data mining techniques. Frequent Mass Differences are not Common Metabolic Transformations Using the D. melanogaster and S. cerevisiae datasets, the list of known metabolic transformations yields sparsely connected networks which are comparable to the ones created with the top 5–20 frequent mass differences (compare Tables 5 and 6), despite this list actually contains 109 entries. Additionally the common metabolic transformations are present in the determined frequent mass differences, but at a lower rate than the KEGG reaction pairs. Some of the common metabolic reactions only appear as less frequent mass differences or not at all. This is another indicator for the absence of actual metabolic transformations in the data and hence the reconstructed networks. The hypothesis, that many of the frequent mass differences actually represent two or more metabolic transformations in turn was already presented. Evidence for this is given in chapter 3.5, where the actual number of known reactions are computed for reconstructed edges. Even though only a subset of all edges is used (the ones in the induced subgraph of identified compounds), these edges already represent 54 % of all used frequent mass differences. 93 % of the observed mass differences could be assigned a sequence of 1–6 reactions from the KEGG database. This equals 50 % of all mass differences in the reconstructed network. An estimate based on the above information is, that about 90 % of all frequent mass differences can be explained by a sequence of 1–6 reactions from the KEGG database. 53 Using a list of common metabolic transformations however is promising if the goal is to create metabolic reaction networks which are also suited for the well known analysis methods mentioned above. The list has to be extended by combinations of these reactions, to account for not measured compounds. In doing so, it is very likely, that metabolic pathways emerge as heavily connected clusters in the resulting reconstructed networks, as Figure 19 shows. Edges representing multiple reactions are also incorporated using the frequent mass differences, but the advantage of a well defined set of metabolic reactions is the absence of noise due to the precision shift (discussed below) and clustering. 1 2 2 3 1 4 Figure 19: Theoretical pathway (top) and reconstructed pathway (bottom). The hollow circle represents a not measured compound. The numbers on the edges represent the number of reactions which explain the mass difference. A metabolic pathway consists of a sequence of reactions, so if edges are introduced, which themselves represent a sequence of reactions, a single metabolic pathway should be very densely connected by such edges. The Clustering of Mass Differences is Complicated by a “Precision Shift”, but Frequent Mass Differences Tend to Have a Meaning The precision of mass differences decreases compared to the precision for individual masses, especially if the mass difference is the difference of two large masses. This was demonstrated in chapter 2.3.2. This “precision shift” has of course implications for the clustering of mass differences. Especially in the region of small mass differences around 1 u lots of noise emerges due to this error. E.g. the three unidentified mass differences, classified as noise (compare Table 8), are also at about 1 and 2 u. The are very close to a simple H or H2 transfer, and it is very likely to observe this mass difference in metabolism. So in this case the precision shift caused a wrong selection by the clustering routine. Additionally this is an area, where many mass differences accumulate (compare Figure 13) and clustering becomes even more difficult. This problem could be overcome by determining the elemental composition of 54 individual masses and subsequently calculate on the exact masses, but then other problems related to exhaustive elemental composition determination would arise. Albeit this, the more frequent a mass difference occurs, the more likely an annotation for this mass difference can be found. Hence the biological meaning of the frequent mass differences is sound. It also indicates, that the selection method for frequent mass differences seems to work well within its bounds, i.e. clustering and picking the median. The mentioned issue cluster elongation might have an effect, but does not disturb the overall selection of the correct mass difference. However, as mentioned above, edges which are hypothesized to be “noise edges” emerge. In the D. melanogaster network these are for example edges 1, 3 and 4 from Table 8. Each of these mass differences could only be resolved to a very unlikely transformation. A more likely transformation might be assigned to the edges, by incorporating more chemical elements into the search space, even though the considered elements already are the most likely elements in biology. Especially in the top mass differences incorporation of more uncommon elements is not expected. Emerging Hubs are Compounds of Interest In the reconstructed networks hub vertices emerge. These hubs are not the current metabolites known to be hubs in metabolic networks, but must share some other common properties which make them important in this kind of network. It was already mentioned, that the hub vertices are more likely to be identified by a database lookup and therefore were targets of research in the past. It has been shown, that there is a correlation between a vertex’s degree and the number of isomers of its elemental composition that are present in metabolism. This makes sense in a way, that a vertex which represents many different compounds summarizes all connections of the individual compounds. It has also been shown, that there is a negative correlation between a vertex’s degree and its mass, i.e. vertices with a higher degree tend towards smaller masses. The two above correlations imply a possible relationship between a metabolites mass and the number of possible isomers of this mass which are present in metabolism. This correlation has not been investigated explicitly on real world data. It is however remarkable, since simple combinatoric considerations tell, that bigger masses should allow for more possible isomers. Metabolism, in contrast, tends to use rather smaller compounds and their isomers. Observing just correlations, no hard conclusions can be made regarding their origin. Especially for three correlations, it is not exactly clear which correlation arises for what reasons. However, the possibility to identify a 55 special class of vertices, wether low degree or high degree, together with the fact that the degree carries some information will aid research on FTMS metabolomics data. Half of the identified hub vertices in the D. melanogaster network represent at least one isomer that carries an acid group (compare Table 7). Acids play an important role in metabolism and are a source of energy. The identification of all hub vertices and their roles in metabolism is an important next step. Determining an Appropriate Abundance Cutoff It was mentioned in chapter 3, that slight variations in the abundance cutoff should not heavily alter the networks structure. This can be seen in Tables 5 and 6 and in Figure 14. The changes of the parameters and form of the distributions are changing slowly and not abruptly. Therefore it is likely, that the overall structure of the network remains stable. However more than slightly different cutoffs applied to the same data lead to differences in network structure and, more important, datasets of different size require different cutoffs as well: For the S. cerevisiae dataset an abundance cutoff of 96 was estimated to yield an optimal network, and for the D. melanogaster dataset the value was 15. An outline how to determine a proper abundance cutoff automatically is presented in the “Outlook” chapter. 56 5 Summary In this thesis metabolomics FTMS data were used to reconstruct networks based on metabolites mass differences. The basic idea behind this approach is, that the substrate and product of any metabolic reaction should be present in a cell. If the masses of both are measured, the resulting mass difference represents the transformation of elements which took place in the reaction. For a whole set of masses in a dataset the pairwise connection by meaningful mass differences yields, in theory, a metabolic reaction network. Frequent reactions in metabolism should lead to frequently occurring mass differences, s.t. the reactions can be identified within the data and no external knowledge is required. If one wants to incorporate external data, the mass differences of common metabolic transformations can be used to reconstruct the network. In this work the focus was on networks reconstructed without external information. In this thesis has been shown, that the reconstructed networks are no metabolic reaction networks, and therefore can not be analyzed with the standard methods available for metabolic networks. However, the reconstructed networks exhibit a non–random topology, and the intermediate step – the calculation of pairwise mass differences – also is of scientific interest. It was shown, that frequently occurring mass differences have a biochemical meaning; the most frequent mass differences can usually be explained by a sequence of 1–6 chemical reactions. Therefore also the reconstructed networks should contain information. Elaborating this further led to the findings that network hubs (i.e. vertices with many neighbors) (a) tend to be known compounds, (b) comprise rather smaller metabolites and (c) represent a higher number of isomers than vertices with a low degree. Many masses (about 80–90 %) in the original metabolomics datasets can not be identified. The mass difference networks can aid the investigation and finally identification of these mass measurements. Additionally ideas are developed in this thesis to further investigate the reconstructed networks and probably even extract metabolic reaction networks from the given data. 57 6 Outlook The locigal next step after this work is the identification of as many masses and observed frequent mass differences as possible. Some interesting targets to start from have been highlighted in this work, and at the time of writing some of the created data are subject to analysis. The networks are probably invariant to minimal cut sets [44], such that this analysis can be done on these networks. This needs to be assessed in more detail. The work itself might be extended by some of the following proposals. The rationale to optimize the reconstructed network for known parameters of biological networks was already presented in chapter 3.2.1. Instead of doing this manually, an automated procedure would calculate a “goodness value” for each abundance cutoff. This can be done efficiently, because lower abundance cutoffs only introduce more edges, but do not alter the set of already present edges. With some more programming effort the time consuming step of calculating the clustering coefficients for all vertices can also be optimized, s.t. clustering coefficients are not recalculated if not required. To assess the power law property of each network’s distributions, a power law can be fitted to the distributions and the quality of fit can be determined with the statistical Kolmogorov–Smirnov test. A procedure which now maximizes the goodness of fit and the overall clustering coefficient can automatically yield abundance cutoffs for arbitrary data. Even though positive resluts could be obtained using the employed clustering method, the quality of frequent mass difference determination might be improved by other clustering algorithms. As mentioned, the used clustering method (single linkage hierarchical clustering) tends to elongate clusters. Since the data is one dimensional, this problem becomes even more gravity. Complete linkage clustering instead of single linkage would probably result in narrower clusters and a more precise allocation of the mass difference. There exist other distance measures for clusters, which can be applied to the data and a comprehensive analysis of clustering results might lead to an improved clustering method for the present data. Additionally to improved clustering the method can be improved by filtering for “noise” edges. This would require the identification of frequent mass differences in an exhaustive way, s.t. unidentified edges can be removed. A pure database lookup is infeasible. Instead the enumerative combinatoric method to identify a mass difference described in chapter 2.5 has to be employed. An efficient algorithm for this problem might be obtained by modifying existing software which calculates elemental compositions for a given mass. Using the heuristic, that fewer atom transfers are more likely to explain a reaction or sequence of reactions, mass differences which are more likely to represent chemical and biological causalities can be identified. 58 Another remaining but interesting analysis is to confirm that actually the estimated 90 % frequent mass differences represent a whole sequence of reactions. This can be done by calculating linear combinations of 1–6 (± some allowance) KEGG reaction pairs and matching these distances to the frequent mass differences. In this case reaction pairs involving transformations through current metabolites [11] should be removed to assure only chemically possible reaction pathways are found. The clustering coefficients and their distribution suggest the presence of clusters in the networks. Communities in complex networks are always of interest and members of the communities share common properties in biological networks [30, 47]. In the reconstructed networks these communities might represent actual metabolic pathways. This hypothesis arises from the estimate that most of the mass differences represent a sequence of reactions and the considerations depicted in Figure 19. A logical follow up project would be, by employing common graph clustering methods, to find either scientific evidence for this hypothesis or at least another possible explanation for such communities. Regardless of any deeper information content in the reconstructed networks, the mere calculation of mass differences and representation as network can aid in mass spectrometric research, and there is a great demand for software in the mass spectrometry and metabolomics community [3, 21, 48]. The developed software has to be extended to be more user friendly and ideally made available online. The networks need to be visualized in a dynamic and easily modifyable way and information about vertices and edges has to be clearly presented. The mentioned exponential random graph (ERG) models could be used to reassess the similarity between reconstructed networks and known metabolic networks. If a set of feasible descriptors for metabolic networks can be identified, the reconstruction method could be re–evaluated and probably improved. Concluding this thesis, the originally assessed method from [16] turned out to be a naive idea, not feasible for metabolic network reconstruction. Nevertheless, many new and interesting starting points for the analysis of metabolomics FTMS data have been found, employing mass differences and constructing networks out of them. 59 References [1] V. Hatzimanikatis, C. Li, J. A. Ionita, and L. J. Broadbelt, “Metabolic networks: enzyme function and metabolite structure.,” Curr Opin Struct Biol, vol. 14, pp. 300–306, Jun 2004. [2] O. Fiehn and W. Weckwerth, “Deciphering metabolic networks.,” Eur J Biochem, vol. 270, pp. 579–588, Feb 2003. [3] O. Fiehn, “Metabolomics–the link between genotypes and phenotypes.,” Plant Mol Biol, vol. 48, pp. 155–171, Jan 2002. [4] M. Arita, “The metabolic world of escherichia coli is not small.,” Proc Natl Acad Sci U S A, vol. 101, pp. 1543–1547, Feb 2004. [5] W. B. Dunn, “Current trends and future requirements for the mass spectrometric investigation of microbial, mammalian and plant metabolomes.,” Phys Biol, vol. 5, no. 1, p. 11001, 2008. [6] J. Frster, I. Famili, B. O. Palsson, and J. Nielsen, “Large-scale evaluation of in silico gene deletions in saccharomyces cerevisiae.,” OMICS, vol. 7, no. 2, pp. 193–202, 2003. [7] C. Bro, B. Regenberg, J. Frster, and J. Nielsen, “In silico aided metabolic engineering of saccharomyces cerevisiae for improved bioethanol production.,” Metab Eng, vol. 8, pp. 102–111, Mar 2006. [8] M. Kanehisa, M. Araki, S. Goto, M. Hattori, M. Hirakawa, M. Itoh, T. Katayama, S. Kawashima, S. Okuda, T. Tokimatsu, and Y. Yamanishi, “Kegg for linking genomes to life and the environment.,” Nucleic Acids Res, vol. 36, pp. D480–D484, Jan 2008. [9] P. D. Karp, I. M. Keseler, A. Shearer, M. Latendresse, M. Krummenacker, S. M. Paley, I. Paulsen, J. Collado-Vides, S. Gama-Castro, M. Peralta-Gil, A. Santos-Zavaleta, M. I. Pealoza-Spnola, C. BonavidesMartinez, and J. Ingraham, “Multidimensional annotation of the escherichia coli k-12 genome.,” Nucleic Acids Res, vol. 35, no. 22, pp. 7577– 7590, 2007. [10] R. Guimer and L. A. N. Amaral, “Functional cartography of complex metabolic networks.,” Nature, vol. 433, pp. 895–900, Feb 2005. [11] H. Ma and A. Zeng, “Reconstruction of metabolic networks from genome data and analysis of their global structure for various organisms.,” Bioinformatics, vol. 19, pp. 270–277, Jan 2003. [12] C. A. Ouzounis and P. D. Karp, “Global properties of the metabolic map of escherichia coli.,” Genome Res, vol. 10, pp. 568–576, Apr 2000. 60 [13] N. C. Duarte, M. J. Herrgrd, and B. . Palsson, “Reconstruction and validation of saccharomyces cerevisiae ind750, a fully compartmentalized genome-scale metabolic model.,” Genome Res, vol. 14, pp. 1298–1309, Jul 2004. [14] A. G. Smart, L. A. N. Amaral, and J. M. Ottino, “Cascading failure and robustness in metabolic networks.,” Proc Natl Acad Sci U S A, vol. 105, pp. 13223–13228, Sep 2008. [15] M. Arita, “Metabolic reconstruction using shortest paths,” Simulation Practice and Theory, vol. 8, pp. 109–125, 2000. [16] R. Breitling, S. Ritchie, D. Goodenowe, M. L. Stewart, and M. P. Barrett, “Ab initio prediction of metabolic networks using fourier transform mass spectrometry data,” Metabolomics, vol. 2, pp. 155–164, 2006. [17] D. J. H. Gross, Mass Spectrometry - A Text Book. Springer, 2004. [18] A. Aharoni, C. H. R. de Vos, H. A. Verhoeven, C. A. Maliepaard, G. Kruppa, R. Bino, and D. B. Goodenowe, “Nontargeted metabolome analysis by use of fourier transform ion cyclotron mass spectrometry.,” OMICS, vol. 6, no. 3, pp. 217–234, 2002. [19] J. Amster, “Fourier transform mass spectrometry,” Journal of Mass Spectrometry, vol. 31, pp. 1325–1337, 1996. [20] A. G. Marshall, C. L. Hendrickson, and G. S. Jackson, “Fourier transform ion cyclotron resonance mass spectrometry: a primer.,” Mass Spectrom Rev, vol. 17, no. 1, pp. 1–35, 1998. [21] R. M. A. Heeren, A. J. Kleinnijenhuis, L. A. McDonnell, and T. H. Mize, “A mini-review of mass spectrometry using high-performance fticr-ms methods.,” Anal Bioanal Chem, vol. 378, pp. 1048–1058, Feb 2004. [22] W. J. Griffiths, A. P. Jonsson, S. Liu, D. K. Rai, and Y. Wang, “Electrospray and tandem mass spectrometry in biochemistry.,” Biochem J, vol. 355, pp. 545–561, May 2001. [23] K. D. Henry, E. R. Williams, B. H. Wang, F. W. McLafferty, J. Shabanowitz, and D. F. Hunt, “Fourier-transform mass spectrometry of large molecules by electrospray ionization.,” Proc Natl Acad Sci U S A, vol. 86, pp. 9075–9078, Dec 1989. [24] J. L. Gross and J. Yellen, Handbook of Graph Theory. CRC Press, 2003. [25] R. A. Zubarev, P. Hkansson, and B. Sundqvist, “Accuracy requirements for peptide characterization by monoisotopic molecular mass measurements,” Anal. Chem., vol. 68, pp. 4060–4063, 1996. 61 [26] K. Suhre and P. Schmitt-Kopplin, “Masstrix: mass translator into pathways.,” Nucleic Acids Res, vol. 36, pp. W481–W484, Jul 2008. [27] A. S. N. Seshasayee, G. M. Fraser, M. M. Babu, and N. M. Luscombe, “Principles of transcriptional regulation and evolution of the metabolic system in e. coli.,” Genome Res, vol. 19, pp. 79–91, Jan 2009. [28] G. Thomas, J. Zucker, S. Macdonald, A. Sorokin, I. Goryanin, and A. Douglas, “A fragile metabolic network adapted for cooperation in the symbiotic bacterium buchnera aphidicola.,” BMC Syst Biol, vol. 3, p. 24, Feb 2009. [29] H. Jeong, B. Tombor, R. Albert, Z. N. Oltvai, and A. L. Barabsi, “The large-scale organization of metabolic networks.,” Nature, vol. 407, pp. 651–654, Oct 2000. [30] E. Ravasz, A. L. Somera, D. A. Mongru, Z. N. Oltvai, and A. L. Barabsi, “Hierarchical organization of modularity in metabolic networks.,” Science, vol. 297, pp. 1551–1555, Aug 2002. [31] R. Albert and A.-L. Barabasi, “Statistical mechanics of complex networks,” Reviews of Modern Physics, vol. 74, pp. 47–97, 2002. [32] A.-L. Barabsi and Z. N. Oltvai, “Network biology: understanding the cell’s functional organization.,” Nat Rev Genet, vol. 5, pp. 101–113, Feb 2004. [33] Z. M. Saul and V. Filkov, “Exploring biological network structure using exponential random graph models.,” Bioinformatics, vol. 23, pp. 2604– 2611, Oct 2007. [34] A. Kreimer, E. Borenstein, U. Gophna, and E. Ruppin, “The evolution of modularity in bacterial metabolic networks.,” Proc Natl Acad Sci U S A, vol. 105, pp. 6976–6981, May 2008. [35] M. Sales-Pardo, R. Guimer, A. A. Moreira, and L. A. N. Amaral, “Extracting the hierarchical organization of complex systems.,” Proc Natl Acad Sci U S A, vol. 104, pp. 15224–15229, Sep 2007. [36] S. Daunert, P. G. ang G. Gauglitz, K. G. Heumann, K. Jinno, A. SanzMedel, and S. A. Wise, “Analytical and bioanalytical chemistry,” Nov. 2007. Vol. 389 No. 5. [37] X. Feng and M. M. Siegel, “Fticr-ms applications for the structure determination of natural products.,” Anal Bioanal Chem, vol. 389, pp. 1341– 1363, Nov 2007. 62 [38] R. D. C. Team, R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2007. ISBN 3-900051-07-0. [39] L. Chen, S. K. Sze, and H. Yang, “Automated intensity descent algorithm for interpretation of complex high-resolution mass spectra.,” Anal Chem, vol. 78, pp. 5006–5018, Jul 2006. [40] A. Nayak and I. Stojmenovic, Handbook of applied algorithms: solving scientific, engineering, and practical problems. John Wiley, 2008. [41] S. Theussl and K. Hornik, Rglpk: R/GNU Linear Programming Kit Interface, 2009. [42] N. Hertkorn, M. Frommberger, M. Witt, B. Koch, P. Schmitt-Kopplin, and E. Perdue, “Natural organic matter and the event horizon of mass spectrometry.,” Anal Chem, Oct 2008. [43] K. Ishimura and H. Fujita, “Light and electron microscopic immunohistochemistry of the localization of adrenal steroidogenic enzymes.,” Microsc Res Tech, vol. 36, pp. 445–453, Mar 1997. [44] S. Klamt and E. D. Gilles, “Minimal cut sets in biochemical reaction networks.,” Bioinformatics, vol. 20, pp. 226–234, Jan 2004. [45] R. Steuer, “Computational approaches to the topology, stability and dynamics of metabolic networks.,” Phytochemistry, vol. 68, no. 16-18, pp. 2139–2151, 2007. [46] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon, “Network motifs: simple building blocks of complex networks.,” Science, vol. 298, pp. 824–827, Oct 2002. [47] K. Schreiber, “Correlation of modular structures in biological networks.,” bachelorthesis, Ludwig Maximilians Universität and Technische Universität München, 2005. [48] P. C. Dorrestein and N. L. Kelleher, “Dissecting non-ribosomal and polyketide biosynthetic machineries using electrospray ionization fourier-transform mass spectrometry.,” Nat Prod Rep, vol. 23, pp. 893– 918, Dec 2006. 63