Download the use of bayesian statistics in mass spectrometry data

THE USE OF BAYESIAN STATISTICS IN MASS SPECTROMETRY DATA Literature research Elli Karampini Supervisor: Gabriel Vivό-Truyols CONTENTS PART I CHAPTER 1: BAYESIAN STATISTICS 1.1. Bayes’ Theorem 1.2. Bayesian network CHAPTER 2: MASS SPECTROMETRY 2.1. Introduction to mass spectrometry 2.2. Instrumentation of mass spectrometry PART II CHAPTER 3: BAYESIAN STATISTICS IN PROTEOMIC STUDIES 3.1. Data pretreatment 3.2. Data treatment CHAPTER 4: BAYESIAN STATISTICS IN METABOLOMIC STUDIES 4.1. Applications of Bayesian approach in metabolomics CHAPTER 5: BAYESIAN STATISTICS IN FORENSIC SCIENCES 5.1. Bayes’ Theorem and forensic evidence 5.2. Mass spectrometry and Bayesian statistics in forensic sciences CHAPTER 6: BAYESIAN STATISTICS IN VARIOUS APPLICATIONS PART III CHAPTER 7: A CRITICAL REVIEW ACKNOWLEDGEMENTS REFERENCES 1 To my parents, Alexandros and Mary, and Achilles 2 PART I 3 CHAPTER 1: BAYESIAN STATISTICS Bayesian statistics is used to differentiate the sub-category of statistics in which the evidence of the true state of a hypothesis is given in degrees of belief and more specifically in Bayesian probabilities. The term Bayesian has been adopted to honor Thomas Bayes, although Pierre-Simon Laplace had independently worked on the same subject under the name the probability cause. There are two main schools in calculating probabilities. From a frequentist approach, the definition of probability value is the chance of observing this data or more extreme given the fact that the hypothesis is true. This type of probability is calculated through a frequentist test, such as a t-test of one or two means. However, it does not answer the question “is my hypothesis correct?” but rather takes it as a given. On the other hand, frequentist approach ensures a certain probability of failure of the whole procedure. In other words, although it cannot answer the question “is my hypothesis correct?”, it can assure that, following the procedures of the frequentist test, the frequency of wrong answers is under control. By contrast, the answer to “is my hypothesis correct?” can be given through a Bayesian approach. More specifically, Bayes’ theorem determines the probability that the hypothesis is correct, using the mathematical formula of Bayes’ theorem. This process combines the background information I with the outcome (data) D of an experiment and the probability of the hypothesis before considering the data in order to assess the probability value of the hypothesis itself P(𝜃|𝐷, 𝐼) [1]. The purpose of the thesis is to make a critical review concerning the recent developments (around the last 5 years) in applying Bayesian statistics with respect to analysis with mass spectrometry instruments. Firstly, a short introduction about Bayesian statistics and mass spectrometry will be given in part I. In part II different articles, relative to the topic, from a broad spectrum of analytical fields, such as proteomics, metabolomics and forensic sciences, will be discussed and finally, a critical review in part III will be presented. 4 1.1. Bayes’ theorem Bayes’ theorem is a rule which indicates how to treat conditional probabilities. Conditional probability of an event is defined as the probability obtained with the additional information that another event has already occurred [1]. It was first introduced by Thomas Bayes (1701-1761) and published under the name “An Essay towards solving a Problem in the Doctrine of Chances” in 1763, two years after his death. It was his friend Richard Price (1723-1791) who communicated the paper through John Canton (1718-1772) to the Royal Society [2]. Bayes’ theorem is based on the two fundamental rules of probability theory, the product and the addition rule. The first one defines the joint probability of two or more propositions, by means of the following equation P(𝑥, 𝑦|𝐼) = P(𝑥|𝐼) P(𝑦|𝑥, 𝐼) = P(𝑦|𝐼) P(𝑥|𝑦, 𝐼) where x and y are the propositions, which are interchangeable , I the background information and Pr(𝑥, 𝑦|𝐼) is the probability of x and y conditional on I. The later one can be expressed as 𝑀 𝑀 P({𝑦𝑗 }|𝐼) = ∑ P({𝑦𝑗 }, {𝑥𝑖 }|𝐼) = ∑ P({𝑥𝑖 }|𝐼) P({𝑦𝑖 }|{𝑥𝑖 }, 𝐼) 𝑖=1 𝑖=1 𝑁 𝑁 and, due to symmetry P({𝑥𝑖 }|𝐼) = ∑ P({𝑥𝑖 }, {𝑦𝑖 }|𝐼) = ∑ P({𝑦𝑖 }|𝐼) P({𝑥𝑖 }|{𝑦𝑖 }, 𝐼) 𝑗=1 𝑗=1 where {𝑥𝑖 : 𝑖 = 1,2,3, … , 𝑀} and {𝑦𝑖 : 𝑗 = 1,2,3, … , 𝑁} are sets of propositions with M≠N. In this case the marginalization of joint probabilities of a discrete set of variables is defined, also known as Total Probability Theorem. The same applies for continuous variables, such that 𝑥 ∈ 𝑋, 𝑦 ∈ 𝑌 and 𝑋, 𝑌 ⊆ ℝ, P(𝑥|𝐼) = ∫P(𝑥, 𝑦|𝐼) 𝑑𝑦 𝑌 and P(𝑦|𝐼) = ∫ P(𝑥, 𝑦|𝐼) 𝑑𝑥 𝑋 where, as explained in Armstrong et al., “marginalization can be considered as integrating out unnecessary variables” [1]. 5 By putting the above mentioned probability rules together, we arrive at the Bayes theorem’s mathematical formula. If the purpose is to determine the probability of a continuous parameter θ, (for instance the mean μ and/or the standard deviation σ), given the data D and the background information I, then we can express the joint probability of the θ and D, given I as so: P(𝜃, 𝐷|𝐼) = P(𝜃|𝐼) P(𝐷|𝜃, 𝐼) = P(𝐷|𝛪) P(𝜃|𝐷, 𝐼) Due to the equality of the right part of the equation, we can arrive at: P(𝜃|𝐷, 𝐼) = P(𝜃|𝛪) P(𝐷|𝜃, 𝐼) P(𝐷|𝐼) where P(𝐷|𝐼) = ∫𝜃 P(𝜃, 𝐷|𝐼)𝑑𝜃 with 𝜃 ⊆ ℝ. The above formula is the Bayes’ theorem mathematical equation and each term has its own meaning. Firstly, P(𝜃|𝐷, 𝐼) is known as the posterior probability which asserts the plausibility of θ, given D and I. Secondly, P(𝜃|𝛪) is called prior probability, which is the initial probability value before any additional information, e.g. the data D, is taken into account. In other words, it is the plausibility of θ before conducting the experiment. The numerator of the fraction P(𝐷|𝜃, 𝐼) is the likelihood probability and quantifies the plausibility of D given the θ and I, while the denominator P(𝐷|𝐼) plays the role of normalization factor [1]. In the case of hypothesis testing, where there are two hypotheses, 𝐻1 and 𝐻2 under investigation, given a set of data D, the Bayesian formula for each one is as follows: 𝑃(𝐻1 |𝐷) = 𝑃(𝐻1 ) 𝑃(𝐷|𝐻1 ) 𝑃(𝐷) 𝑃(𝐻2 |𝐷) = 𝑃(𝐻2 ) 𝑃(𝐷|𝐻2 ) 𝑃(𝐷) and By dividing these two equations, we get: 𝑃(𝐻1 |𝐷) 𝑃(𝐻1 ) 𝑃(𝐷|𝐻1 ) = ∗ 𝑃(𝐻2 |𝐷) 𝑃(𝐻2 ) 𝑃(𝐷|𝐻2 ) which can be summarized as: 𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 𝑜𝑑𝑑𝑠 = 𝑝𝑟𝑖𝑜𝑟 𝑜𝑑𝑑𝑠 ∗ 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 𝑟𝑎𝑡𝑖𝑜 [2, 3] Although Bayes’ theorem was known from the 18th century, a remarkable increase in the employment of this theory in different fields—such as Chemistry and Physics—has been recorded in the last two decades. To illustrate this, a 6 search for the terms “Bayesian” or “Bayes” was conducted in the scientific search engine Web of Science. It was found that 106,585 out of 110,663 relevant articles were dated after 1990 and from those 90,013 are after 2000. This sudden increase in interest might be due to the rise of computational power in the last decades. 1.2. Bayesian network The probabilities play a crucial role in pattern recognition and it is highly beneficial to improve the analysis using diagrammatic representations of dependences between variables, known as probabilistic graphical models. The graphic models offer useful properties such as a simple way of a probabilistic model’s visualization and insights into the model’s properties, including conditional and independent properties. The Bayesian network, also called directed graphical model, belongs to this category. As far the graphic illustration is concerned, the diagram consists of nodes, which represent random variables or sets of variables, connected by arcs, which express probabilistic relationships between these variables (4). Consider for example the joint probability distribution over three variables, a, b, c e.g. 𝑝(𝑎, 𝑏, 𝑐). By applying the product rule to the joint distribution the result is: 𝑝(𝑎, 𝑏, 𝑐) = 𝑝(𝑐|𝑎, 𝑏) ∗ 𝑝(𝑎, 𝑏) And by applying the product rule for the second time, the right-part of the above equation becomes: 𝑝(𝑎, 𝑏, 𝑐) = 𝑝(𝑐|𝑎, 𝑏) ∗ 𝑝(𝑏|𝑎) ∗ 𝑝(𝑎) The latter equation can be described as a graphical model by first introducing nodes for each variable and associating each one with the corresponding conditional distribution. For 𝑝(𝑐|𝑎, 𝑏) there will be arcs from a and b to node c, for 𝑝(𝑏|𝑎) there will be an arc from a to b, and finally for 𝑝(𝑎) Figure 1: Bayesian model for three variables a, b, c with their conditional distribution, represented as arcs [4]. there will be no incoming links. In the case that there is a link from node x to node y, 7 then the node x is called parent of node y, while the node y is called child of node x. The link indicates that the probability of y is dependent on x, in other words, that 𝑝(𝑦|𝑥) can adopt some value. The example discussed above, can be extended for K variables. The joint probability over K variables will be given by: 𝑝(𝑥1 , 𝑥2 , … , 𝑥𝑘 ) = 𝑝(𝑥𝑘 |𝑥1 , … , 𝑥𝑘−1 ) ∗ … ∗ 𝑝(𝑥2 |𝑥1 ) ∗ 𝑝(𝑥1 ) The equation can be presented as a direct graph with K nodes, one for each conditional distribution on the right-hand side. Each node will have incoming arcs from the lower numbered nodes, therefore, this graph is called fully connected [4]. There also cases where there is absence of arcs, as shown in Fig.2. 𝑥1 , 𝑥2 and 𝑥3 are parent nodes and there is no direct linkage between 𝑥6 or 𝑥7 with the parent nodes. This absence provides interesting information about the properties of the class of distributions that the graph represents. The decomposition of joint distribution of these seven variables is given by: Figure 2: Directed acyclic graph of seven variables, with three parent nodes and no direct linkage between 𝑥6 𝑜𝑟 𝑥7 to the parent nodes[4]. 𝑝(𝑥1 , 𝑥2 , … , 𝑥7 ) = 𝑝(𝑥1 ) ∗ 𝑝(𝑥2 ) ∗ 𝑝(𝑥3 ) ∗ 𝑝(𝑥4 |𝑥1 , 𝑥2 , 𝑥3 ) ∗ 𝑝(𝑥5 |𝑥1 , 𝑥3 ) ∗ 𝑝(𝑥6 |𝑥4 ) ∗ 𝑝(𝑥7 |𝑥4 , 𝑥5 ) The rule can be generalized, thus, for a graph with K nodes, the joint distribution is given by: 𝐾 𝑝(𝑥) = ∏ 𝑝(𝑥𝑘 |𝑝𝑎𝑘 ) 𝑘=1 where 𝑝𝑎𝑘 denotes the set of parents of 𝑥𝑘 and 𝑥 = {𝑥1 , … , 𝑥𝑘 }. At this point an important restriction should be mentioned. There are no closed paths within the graph such that the movement from node to node following the direction of the arcs and ending back to the starting node is not possible. These graphs are commonly referred to as Directed Acyclic Graphs (DAGs) [4]. 8 CHAPTER 2: MASS SPECTROMETRY 2.1. Introduction to mass spectrometry Mass spectrometry (MS) is an analytical technique that enables the identification and quantification of compounds of interest by measuring the massover-charge ratio (m/z) and abundance of ions in gas phase. Mass spectrometry started to be employed in experiments over a century ago. Joseph John Thomson was the first to discover in 1910 that each charged particle followed its own path to the detector, which was a photographic plate. The first experiments were done on hydrogen and later on other atoms and molecules of carbon, oxygen and nitrogen were used. He argued that none of the particles, unless they share the same velocity and charge-over-mass ratio (e/m), would strike the detector’s plate at the same time. By inspecting the plate and knowing one of the parabolic paths that a set of particles, which share the same velocity and e/m, had followed, the e/m of the other particles would be deducted. This is considered to be the birth of mass spectrometry, and important features discussed by Thomson still remain relevant [5]. Through mass spectrometry, scientists are able to measure atomic and molecular weights of species in complex samples, analyze compounds at low level samples, as well as analyze compounds without the initial steps of sample purification. These advantages are more significant than the few disadvantages, such as the loss of the sample once it is analyzed through MS [5]. Mass spectrometry (MS) can be coupled to several different techniques, such as liquid chromatography and gas chromatography, to gain more information about the sample. MS can be coupled also to itself with the result being tandem mass spectrometry (MS/MS). Tandem mass spectrometry was developed in order to acquire structural information about the analyte by coupling mass spectrometers in series Figure 3: Schematic representation of a tandem mass spectrometry experiment [5]. between the ion source and the detector [5, 6]. The principle 9 behind this type of experiment is simple. The targeted compound is ionized and its characteristic ions are separated from the mixture in the first mass spectrometer. The selected primary ions, also known as parent or precursor ions, are then decomposed in the dissociation region, resulting in fragment (historically known as daughter) ions, which are analyzed by the last mass spectrometer reaching the detector. Tandem mass spectrometry can achieve high specificity and sensitivity retaining the advantage of a fast response time [6]. 2.2. Instrumentation of mass spectrometry A mass spectrometer has three main compartments: a) the ionization source, b) the mass analyzer and c) the ion detectors [5]. One of the key performance indicators of spectrometer resolving a mass is the power, which means the minimum mass difference that can be separated from a given mass. Other performance indicators are the mass Figure 4: Characteristics of the performance of a mass spectrometer [7]. accuracy, which indicates the accuracy of the determination of the real mass of a compound, the sensitivity, which is expressed as the signal-to-noise ratio and finally the linear dynamic range (fig. 4)[7]. The ionization source serves the role of converting the analytes (M) into ions. The ionization occurs either when an electron is removed or added yielding a radical (M·+ or M·- respectively). It can also take place when charged species (e.g. H+) are added or subtracted resulting in [M+H]+ or [M-H]-. The ionization source is also responsible for transferring the ions into gas phase before they are introduced into the mass analyzer. There are different types of ionization sources: a) electron (impact) ionization (EI), b) chemical ionization (CI), c) electrospray ionization (ESI), d) atmospheric pressure chemical ionization (APCI), e) atmospheric pressure photoionization (APPI) and f) matrix-assisted laser desorption ionization (MALDI). EI is considered as “hard” ionization, resulting in 10 considerable fragmentation of the molecular ion, while ESI and MALDI are mostly employed when the analytes are biological macromolecules, which are easily degraded under harsh conditions [5, 7]. The mass analyzer is the central part of the mass spectrometer, with its role being to separate the ions with respect to their mass-over-charge ratio, providing a defined mass accuracy and resolution. There are various types of mass analyzers with differences in concept and performance, such as the quadrupole mass filter, the time-of-flight analyzer (ToF), the ion trap analyzer and the Orbitrap system. Often, they work as autonomous mass analyzers, however, the current trend points to the direction of hyphenated systems in order to increase the strength of the mass spectrometer by combining their advantages. Some of these systems are the triple-quadrupole mass filter, the quadrupole- time-of-flight (Q-ToF) system and the time-of-flight-time-of-flight (ToF-ToF) analyzer, with the last one being strongly connected to the analysis of biomolecules [7]. Finally, the ion detector is a device that is able to generate electrical current, whose intensity is proportional to the abundance of the ions. Ions exiting the mass analyzer can either be directed to a single-channel detector or dispersed to a focalplate (array) detector. Quadrupole or ion trap instruments are equipped with single-channel detectors, whereas time-of-flight systems use focal-plate detectors. The generated electrical current is subsequently digitized and the signal is transferred to a computer, where the data can be managed and stored [7]. 11 PART II 12 CHAPTER 3: BAYESIAN STATISTICS IN PROTEOMIC STUDIES Proteomics is the large-scale study of the proteome. The term proteome refers to the entire set of proteins, including their modifications, produced by an organism or a cellular system. The main goal of proteomics is the comprehensive and quantitative description of protein expression levels under the influence of environmental changes, such as drug treatment or disease [8]. The main analytical approach involves Mass Spectrometry (MS) with mild ionization. For the purification or the separation of the sample, prior to the analysis, usually Liquid Chromatography (LC) is selected. The data, derived from a MS system or LC-MS system, are very complex and their further analysis can be divided into two categories: data pretreatment and data treatment. Each one, with respect to Bayesian statistics, is discussed in the following sections. 3.1. Data pretreatment The data pretreatment is an essential step before the data can be available for further analysis, which in case of proteomic studies is strongly related to the discovery of biomarkers, drug development and disease classification. The data pretreatment usually consists of several stages, including denoising, peptide detection and spectra alignment. All mass spectra and tandem mass spectra contain, apart from the peaks of peptide fragments that are considered as useful signals, peaks of instrument noise and contaminants. This is especially the case when the data arrive from complex samples. So, the noise peaks should be removed for optimal matching to be successful [9]. Different publications can be found in which Bayesian statistics is used in the data pre-treatment step. For example, Shao et al. in 2013 proposed an approach based on Bayesian inference for denoising spectra, to build spectral libraries. They build a Bayesian classifier to make the distinction between signal (S) and noise (N) and train it so that no assumptions of peptide fragmentation behavior or instrumental settings are needed [9]. 13 The authors selected four different features that are peak’s characteristics and serve as good indicators of whether the peak 𝑖 (𝑖 = 1, 2, … , 𝑁) is signal or noise. The first feature was the rank, 𝐹𝑟 (𝑖), which is simply the intensity of the peak. For the most intense 𝐹𝑟 = 1; for the second most intense 𝐹𝑟 = 2 and so on. Since the intensities vary significantly from spectrum to spectrum, the team used the intensity rank as surrogate in order to avoid large changes in scale. The second selected feature was the m/z, i.e. 𝐹𝑚 (𝑖), which measures the relevant position of a peak in the spectrum. It is obvious that the probability of finding signal is not constant throughout the m/z range, but the exact trend is unknown prior to the experiment and it is discovered from the data. Finally, the compliment features, 𝐹𝑐,𝑍 , records whether a complimentary fragment ion can be found in the same spectrum, where the Z parameter denotes the sum of the assumed charges of the complimentary pair, and the sister features, 𝐹𝑠,𝛥 , determine the existence of a sister fragment ion in the spectrum. The sister peak is located at a distance Δ away from the peak of interest. This captures information concerning common neutral loss or isotopic ions [9]. Each peak is categorized as signal (S) or noise (N) by a consensus algorithm, according to whether or not it is consistently found across replicates. Since the peaks are labeled as S or N, the conditional probabilities 𝑃(𝐹𝑘 |𝑆) and 𝑃(𝐹𝑘 |𝑁) can be calculated for each feature 𝐹𝑘 mentioned above. These conditional probabilities constitute the Bayesian classifier and they can immediately be used to denoise singleton spectra or be written in a parameter file for future use. Given the conditional probabilities, by means of Bayes’ theorem, the computation of posterior probability of unlabeled peak is possible: 𝑃(𝑆|{𝐹𝑘 }) = 𝑃(𝑆)[∏𝑘 𝑃(𝐹𝑘 |𝑆)] 𝑃(𝑆)[∏𝑘 𝑃(𝐹𝑘 |𝑆)] + 𝑃(𝑁)[∏𝑘 𝑃(𝐹𝑘 |𝑁)] where 𝑃(𝑆) is the prior probability of peak to be a signal without any additional information and 𝑃(𝑁) = 1 − 𝑃(𝑆). The prior probability was predicted by the team, using a linear regression model with 16 features for any given spectrum. As in typical Bayesian classifier, the posterior probability can be subjected to an appropriate threshold to make a decision about the peak’s preservation in the denoised spectrum [9]. 14 The researchers claimed that the computed probabilities were reasonably accurate and their “denoiser” showed that the filtered spectra retained signal peaks and exhibit high similarity to their replicates, which indicates that their method would be a useful tool in spectral libraries. Additionally, the classifier is very flexible and can be subjected to further improvement by adding or modifying the selected features [9]. Peptide detection, whose main goal is to convert the raw spectra into a list of peptides and find the existence probability of each peptide candidate, has a direct effect on the subsequent analysis such as protein identification and quantification, biomarker discovery and classification of different samples. The difficulty is that peptides usually give several peaks in the spectra due to different charge states during ionization and isotopic peaks. To address this issue, in 2010, Sun et al. proposed a Bayesian approach for peptide detection (BPDA), which can be applied to MS data that have been generated by instruments with high enough resolution [10]. The authors used one-dimension mass spectra (1D MS) and the proposed approach can be considered as a three-step approach. The first step is to obtain a list of peptide candidates from the observed peaks and the second is to model the observed spectra, taking into account the N peptide candidates’ signal. The final step is to apply the algorithm on the fitted MS model to infer the best fitting peptide signals [10]. In a more detailed way, the spectra were first baseline corrected, the noise was filtered, peaks were detected using “mspeaks” (Matlab function) and a list of peptide candidates was generated. The mass of each peptide was produced by means of the following equation: 𝑚𝑎𝑠𝑠 = 𝑖(𝑑 − 𝑚𝑝𝑐 ) − 𝑗𝑚𝑛𝑡 , 𝑖 = 1, 2, … , 𝑐𝑠; 𝑗 = 1, 2, … , 𝑖𝑠𝑜 where mass is the mass of one peptide candidate, d denotes the value of m/z of a detected peak, 𝑚𝑝𝑐 is the mass of one positive charge, 𝑚𝑛𝑡 is the mass shift due to the addition of one neutron and the parameters 𝑖 and 𝑗 represent the charge state and isotopic positions respectively. Then, the researchers modeled the spectra, taking into account the different charge states and isotopic positions for each candidate and it also incorporates the probability of candidates’ existence and the 15 thermal noise. The signal of N peptide candidates was given by the following equation: 𝑁 𝑁 𝑐𝑠 𝑖𝑠𝑜 𝑦𝑚 = ∑ 𝜆𝑘 𝑔𝑘 (𝑥𝑚 ) + 𝜀𝑚 = ∑ 𝜆𝑘 ∑ ∑ 𝑐𝑘,𝑖𝑗 𝑓(𝑥𝑚 ; 𝜌𝑘,𝑖𝑗 ; 𝑎𝑘,𝑖𝑗 ) + 𝜀𝑚 𝑘=1 𝑘=1 𝑖=1 𝑗=0 𝑚 = 1,2, … , 𝑀 where 𝑥𝑚 is the m-th m/z in the spectrum, 𝑦𝑚 is the intensity at 𝑥𝑚 , M is the number of observations, 𝜀𝑚 is the noise (𝜀𝑚 ~𝑁(0, 𝜎 2 ), 𝑓(𝑥𝑚 ; 𝜌𝑘,𝑖𝑗 ; 𝑎𝑘,𝑖𝑗 ) is peak shape function, taken as Gaussian-shaped with 𝑎𝑘,𝑖𝑗 being the theoretical m/z value of the peak for the k-th candidate and 𝜌𝑘,𝑖𝑗 the peak’s width, 𝑐𝑘,𝑖𝑗 is the height of the peak of peptide k at i charge state and j isotopic position and, finally, 𝜆𝑘 is an indicator random variable, which is 1 if the peptide truly exists and 0 otherwise. The goal is to determine all the unknown parameters in the model (𝜃 ≜ {𝜆𝑘 , 𝑐𝑘,𝑖𝑗 ; 𝑘 = 1, … , 𝑁; 𝑖 = 1, … , 𝑐𝑠; 𝑗 = 1, … , 𝑖𝑠𝑜}), based on the observed spectrum 𝑦 = [𝑦1 , … , 𝑦𝑀 ]𝑇 and especially 𝜆𝑘 . Therefore, the Bayesian approach was employed to obtain the posterior probabilities of θ, 𝑃(𝜃|𝑦). The posterior probability of 𝜆𝑘 can be obtained by integrating the joint posterior probability over all parameter except for 𝜆𝑘 : 𝑃(𝜆𝑘 |𝑦, 𝜃−𝜆𝑘 ) ∝ 𝑝(𝜆𝑘 )𝑝(𝑦|𝜃) , where 𝜃−𝜆𝑘 ≜ 𝜃\𝜆𝑘 and for the computation the team chose Gibbs sampling method, which is a variant of Monde Carlo Markov Chain (MCMC) [10]. According to the authors, BPDA, their proposed model, which considers the charge state and isotopic positions, was positively compared to commercial and open-source software in terms of peptide detection, but it lacked in terms of computational time, since it was found time-consuming, especially when running under raw data mode [10]. Two years later, in 2012, the same team published a new paper, which can be described as the continuation of their first one. This time, their proposal concerned a peptide detection approach, but LC-MS data were used as input. They presented BPDA2d, a two-dimension (2D) Bayesian peptide detection algorithm to process the data more efficiently. BPDA2d shared the same core as BPDA, which is to evaluate all possible combinations of peptide candidates in order to minimize the mean square error (MSE) between inferred and observed spectra. The 16 difference between these algorithms is that BPDA models spectra along m/z dimension, while BPDA2d models spectra along both m/z and retention time (RT) dimensions [11]. After baseline correction, noise filtration and peak detection along the m/z axis, the authors added one more step before obtaining the list of peptide candidates. In this case, the detected 1D peaks were connected along the RT dimension. The 1D peaks were sorted according to their RT positions and if there were multiple peaks connected to the same RT only the one with the larger intensity was retained. Then they proceeded to the generation of peptide candidates followed their previously mentioned method. The model used to formulate the spectra was substantially the same apart from the addition of time as a parameter: 𝑁 𝑁 𝑐𝑠 𝑖𝑠𝑜 𝑦(𝑥𝑚 , 𝑡) = ∑ 𝜆𝑘 𝑔𝑘 (𝑥𝑚 , 𝑡) + 𝜀(𝑡) = ∑ 𝜆𝑘 ∑ ∑ 𝑐𝑘,𝑖𝑗 𝑙𝑘 (𝑡)𝐼𝑥𝑚=𝑎𝑘,𝑖𝑗 + 𝜀(𝑡) 𝑘=1 𝑘=1 𝑖=1 𝑗=0 𝑚 = 1, 2, … , 𝑀 and 𝑡 = 1, 2, … , 𝑇 where I is and indicator function (𝐼𝐴 = 1 if 𝐴 ≠ 0, 𝐼𝐴 = 0 otherwise) and 𝑙𝑘 is the normalized elution profile of the k-th peptide candidate. As in the previous article, the model takes into account charge state and isotopic position, but includes peptides’ elution peaks. It also incorporates existence’s probability of candidates and thermal noise. The authors calculated posterior probabilities of the unknown parameters (θ) of the model, using Bayes’ theorem, and focused on 𝜆𝑘 , the indicator random variable (𝜆𝑘 = 1 if the peptide truly exists, 𝜆𝑘 = 0 if not) by integrating the joint posterior probability, 𝑃(𝜃|𝑦), over the other parameters except for 𝜆𝑘 [11]. The authors claimed that their model, BPDA2d surpassed advanced software, such as msInspect and their previous proposal BPDA, in terms of sensitivity and detection accuracy. They also mentioned that their proposal is better suited for time-of-flight (TOF) data [11]. The alignment of spectra, which is necessary for the correction of experimental variations, a common problem in mass spectrometry (MS), is the basic step in the comparison of spectra. The alignment approaches can be divided into two categories: 1) the feature-based and 2) the profile-based approach. The first one is based on the distinction between signals from analytes and irrelevant 17 noise, which is the key point for a successful approach, also known as peak detection, and then the direct alignment of the detected peaks. On the other hand, the later one uses the whole spectrum to evaluate the experimental variation and adjust each one accordingly. The attempt is to find an alignment that minimizes the difference between all spectra and the reference spectrum [12, 13, 14]. The alignment of MS spectra was the purpose of a paper, written by Kong et al. in 2009, with an overall common goal to compare mean spectra across different patient populations, helpful for the biomarker discovery. The authors, advancing a previously developed method (Reilly et al. 2004), proposed a profile-based approach, using a parametric model in conjunction with Bayesian inference [12]. Firstly, the team started with normalization. The alignment model depends on the spectra’s abundances and, therefore, their variations, due to differences in sample preparation or matrix crystallization, needed to be minimized. The chosen method was one scaling factor to each MS run, following the notation of Wu (2004). The second step was the alignment model, given by the following equation: 𝑥𝑖 (𝑡) = 𝜃(𝜉𝑖 (𝑡)) + 𝜀𝑖 (𝑡) where 𝑥𝑖 (𝑡) denotes the height/intensity (on the log-scale) of each sample i at time t (which corresponds to a certain m/z in a ToF instrument), 𝜃(𝑡) is the average spectrum for this patient population at t, 𝜉𝑖 (𝑡) is the deforming function for the sample i at time t and, finally, 𝜀𝑖 (𝑡) is the random error. The restriction is that 𝜉𝑖 (𝑡) is monotone increasing, otherwise it is able to erase observed peaks in the spectra. The 𝜉𝑖 (𝑡) function is parameterized as a piecewise linear function with knots positioned at the locations (or subset of locations) where there are the observed data. The estimation of posterior modes of 𝜉1 (𝑡), 𝜉2 (𝑡), …, 𝜉𝑛 (𝑡) is done by minimizing 𝑛 𝑇−1 ∑∑ 𝑖=1 𝑗=1 1 |𝐸𝑗 | {∫ [𝑥𝑖 (𝑡) − 𝜃(𝜉𝑖 (𝑡))]2 𝑑𝑡 + 𝐸𝑗 𝜎2 ∫ [𝜉 (𝑡) − 𝑡]2 𝑑𝑡} 𝜏 2 𝛦𝑗 𝑖 where the team assumed that the least squares of 𝑥𝑖 (𝑡) − 𝜃(𝜉𝑖 (𝑡)) over 𝐸𝑗 (∫𝐸 [𝑥𝑖 (𝑡) − 𝜃(𝜉𝑖 (𝑡))]2 𝑑𝑡) are independent distributed as 𝜎 2 |𝛦𝑗 |𝜒𝑙2 and the least 𝑗 squares of 𝜉𝑖 (𝑡) − 𝑡 over 𝐸𝑗 are also independent distributed as 𝜏 2 |𝛦𝑗 |𝜒𝑙2. The 𝜉𝑖 (𝑡) function needs only to be defined for t∈T (𝑡1 , 𝑡2 , … , 𝑡𝑇 ), 𝐸𝑗 is the partition of [𝑡1 = 𝑚𝑖𝑛𝑡∈𝑇 , 𝑡𝑇 = 𝑚𝑎𝑥𝑡∈𝑇 ] defined by the location of knots of 𝜉𝑖 (𝑡). The above 18 equation is subjected to two conditions: 1) 𝜉𝑖 (𝑡𝑗 ) < 𝜉𝑖 (𝑡𝑗+1 ) in order to guarantee that 𝜉𝑖 (𝑡)is strictly monotone increasing and 2) if 𝑚𝑖𝑛𝑗 (𝑥𝑖 (𝑡𝑗+1 ), 𝑥𝑖 (𝑡𝑗 )) > 𝑟𝑖 (𝑡𝑗 ) then|𝜉𝑖 (𝑡𝑗+1 ) − 𝜉𝑖 (𝑡𝑗 )| = 𝑡𝑗+1 − 𝑡𝑗 , with 𝑖 = 1,2, … 𝑛 and , 𝑗 = 1, 2, … , 𝑇 − 1 so as to maintain the shape of the peak along the ToF axis during the alignment process. Lastly, for the computations, the researchers followed the approach of Reilly et al. (2004) and they applied the dynamic programming (DP) algorithm to relate the minimized value with the approximation to objective function [12]. The results of this method showed that the model is very efficient for low mass accuracy data. In contrast, for MS spectra with 5-50 ppm accuracy, the method shows an improvement respect other conventional methods, although is not that efficient as with low mass accuracy. An interesting idea, mentioned by the authors, is that this model can be used to align spectra from different laboratories, which are severely misaligned, since their method can handle these misalignments [12]. The alignment of LC-MS data was addressed in two different papers in 2013, published by the same team, proposing the same approach with slight differences. The approach was tested, in one case, on proteomic and metabolomics data [13] and in another case, on proteomic, metabolomic and glycomic data [14]. The team’s proposal is a Bayesian Alignment Model (BAM), which is a profile-based approach. BAM is a model that performs RT alignment based on multiple chromatograms of LC-MS runs. The model is based on alignment of the total ion chromatogram (TIC) or the base-peak chromatogram, thus reducing the order of the instrument (from 2nd to 1st). The model has two major components: the prototype and the mapping function. The prototype function, m(t), characterizes the part of the spectro-chromatograms that is shared by the different samples, and for the ith chromatogram at RT t, the intensity is referred to as prototype function indexed by the mapping function, ui(t), m(ui(t)). Each chromatogram (y for sample 𝑖) is modeled as: 𝑦𝑖 (𝑡) = 𝑐𝑖 + 𝑎𝑖 ∗ 𝑚(𝑢𝑖 (𝑡)) + 𝑒𝑖 (𝑡) for i=1, 2, 3…,N observed chromatograms [𝑦𝑖 (𝑡)], where 𝑐𝑖 ~𝑁(𝑐0 , 𝜎𝑐2 ) and 𝑎𝑖 ~𝑁(𝑎0 , 𝜎𝑎2 ) are parameters, ei(t) is the error, which is considered independent 19 and normally distributed 𝑒𝑖 (𝑡)~𝑁(0, 𝜎𝑒2 ). The prototype function is modeled with a B-spline regression: 𝒎 = 𝑩𝒎 𝝍, where ψ is a vector of elements drawn from normal distribution with a specific mean (𝜓𝑙 ~𝛮(𝜓𝑙−1 , 𝜎𝜓2 ), where 𝜓0 = 0 ) and the mapping function is a piecewise linear function, characterized by a set of knots τ,𝜏 = (𝜏0 , 𝜏1 , 𝜏2 , … , 𝜏𝑘+1 ), and their corresponding indices φi, 𝜑𝑖 = (𝜑𝑖,0 , 𝜑𝑖,1 , 𝜑𝑖,2 , … , 𝜑𝑖,𝑘+1 ). By following this process, the alignment problem is transformed into an inference task, where given the chromatograms,𝒚 = {𝑦1 , 𝑦2 , … , 𝑦𝑁 },the model parameters 𝜃 = {𝑎, 𝑐, 𝜓, 𝑎0 , 𝑐0 , 𝜎𝑎2 , 𝜎𝑐2 𝜎𝜀2 𝜎𝜓2 } need to be estimated. The authors used Markov Chain Monde Carlo (MCMC) methods to draw inference for the parameters. Once the inference is complete, the alignment is carried out by applying the inverse mapping function to each chromatogram, i.e. 𝑦̂(𝑡) = 𝑦𝑖 (𝑢̂𝑖 −1 (𝑡)) [13, 14]. 𝑖 Although both papers share the same overall approach, one can claim that the second paper is the advanced version of the first one. In the first article, the authors used a single ion chromatogram to estimate the prototype and mapping functions for RT alignment. As it is claimed, the model showed better performance than other profile-based methods, such as Dynamic Time Wrapping (DTW) model (J. Chemometrics (2004) 18: 231–241) [13]. However, as it is stated in the second article, their first model lacked in integration of prior knowledge, e.g. internal standards, and assumed the existence of a pattern only based on a single ion chromatogram. So, they introduced an advanced version, which incorporates multiple representative chromatograms and internal standards [14]. As far as the multiple representative chromatograms are concerned, the authors proposed a clustering approach for the identification of the chromatograms, where the chromatograms are simultaneously considered in the profile-based alignment to assist the progress of the estimation of the prototype and mapping functions. As for the internal standards, based on their peaks, the research team was able to evaluate the RT variations, by means of Gaussian Process (GP) regression, and this information is used as the prior of the mapping function [14]. Below, the figures illustrate graphically the models in the first second article. 20 Figure 5: Bayesian alignment model, profile-based approach [13]. Figure 6: Bayesian alignment model, profile-based approach with the incorporation of internal standards and chromatographic clustering [14]. 3.2. Data treatment Identifying proteins in a complex mixture is common task that is usually performed by tandem MS, since it is becoming a very useful tool. Typically, the protein identification process consists of two main stages. In the first stage, the observed spectra and precursors’ m/z values are matched to peptides by a database search tool and in the second stage proteins are scored and ranked using the scored peptides. The large amount of data produced by the MS, on the one hand can be considered beneficial, but, on the other leads to false matches between peptides and spectra, lowering the specificity. One of the most common problems is the degeneracy of peptides, i.e. the possibility of match a peptide to multiple proteins. These degenerate peptides are responsible for the difficulty in calculating the posterior probabilities of proteins, since the posterior probability of one protein depends on the presence of another, especially when the peptide is matched to both of them. With respect to this problem, Serang et al. proposed in 2010 a Bayesian approach for computing posterior probabilities [15]. 21 The suggested model uses three parameters and allows a protein with strong independent evidence to minimize the significance of supporting evidence that is shared with other proteins. The model is based on seven assumptions: 1) the recovery of one peptide by precursor scan does not influence the retrieval of other peptides given the set of proteins in the sample, 2) the creation and observation of one spectrum does not influence the creation and observation of other spectra given the set of peptides selected by the precursor scan, 3) the emission of a peptide is associated with the present protein with probability α, 4) the wrong detection of peptides from noisy signals (the probability that a truly absent peptide, not created by an associated protein, is falsely observed) has a constant probability, β, 5) the prior belief that a protein is present in the sample has a probability γ, 6) the prior probabilities are independent and finally, 7) each spectrum depends only on the peptide that is most excellently matched. From the probability model, the team was able to compute the likelihood of a set of proteins, which is proportional to the probability that these proteins would create the observed spectra, as follows: 𝐿(𝑅 = 𝑟|𝐷) ∝ 𝑃(𝐷|𝑅 = 𝑟) = ∑ ∏ 𝑃(𝐷𝜀 |𝛦𝜀 = 𝑒𝜀 )𝑃(𝛦𝜀 = 𝑒𝜀 |𝑅 = 𝑟) ∀𝑒 𝜀 = ∑ ∑ ∏ 𝑃(𝐷𝜀 |𝛦𝜀 = 𝑒𝜀 )𝑃(𝛦𝜀 = 𝑒𝜀 |𝑅 = 𝑟) ∀𝑒1 ∀𝑒:𝑒1 𝜀 = ∑ 𝑃(𝐷𝜀 |𝛦1 = 𝑒1)𝑃(𝛦1 = 𝑒1 |𝑅 = 𝑟) ∗ ∑ ∏ 𝑃(𝐷𝜀 |𝛦𝜀 = 𝑒𝜀 )𝑃(𝛦𝜀 = 𝑒𝜀 |𝑅 = 𝑟) ∀𝑒1 ∀𝑒:𝑒1 𝜀≠1 = ∏ ∑ 𝑃(𝐷𝜀 |𝛦𝜀 = 𝑒𝜀 )𝑃(𝛦𝜀 = 𝑒𝜀 |𝑅 = 𝑟) 𝜀 ∀𝑒𝜀 where R denotes the present proteins, E is the set of present peptides, ε is for the peptides’ index, D is the observed spectra. R and E are random variables that represent the truly presence of proteins and peptides and r and e are specific values for these variables. The 𝑃(𝐷𝜀 |𝛦𝜀 = 𝑒𝜀 ) was calculated by PeptideProphet and the 𝑃(𝛦𝜀 = 𝑒𝜀 |𝑅 = 𝑟) by the proposed model. The team also proposed a process about making the computations of posterior probabilities of large data set 22 more feasible. The process involves three steps: 1) partitioning, 2) clustering, 3) pruning. The three steps are presented in fig. 7 [15]. Figure 7: (A) partitioning: a protein is dependent on other proteins within connected sub-graphs and not dependent on proteins that share no peptides with the proteins in the connected sub-graphs. (B) clustering: proteins with identical connectivity, e.g. protein 1 and 2, can be clustered together to compute their posterior probabilities more efficiently. (C) pruning: within the connected sub-graph, e.g. protein 4 and 5, proteins that are connected only by peptides with zero probability can be divided into two sub-graphs that do not connect [15]. As mentioned in the paper, the model assumptions are not ideal. However, it is possible to evaluate their accuracy and replace them by others. For instance, assumption 5 can be improved by introducing more complex priors. The model, depending on the data and comparing the results with other methods, showed a similar and sometimes better performance. Additionally, it can be positively compared to other method in cases of high-scoring degenerate peptides [15]. The validation of database search results is an aspect of interest in protein identification. As indicated above, the protein identification has usually two stages: 1) matching peptides with database for their identification and 2) scoring and ranking of proteins based on the identified peptides [15]. The peptide identification is achieved by the combination of tandem MS and database search. The validation of these results is still developing in aspects such as specificity and sensitivity. In 2009, Zhang et al. proposed a Bayesian non-parametric model (BNP) for the validation of database results that incorporates popular methods in statistical learning, as the Bayesian method for posterior probabilities calculations [16]. The model integrates an extended set of features, which were selected from 23 literature, including peptide fragmentation knowledge and chemical or physical properties of the peptide. After the tandem MS spectra are searched against the database, the first stage is to construct two subsets, one includes the decoy matches and the other consists of the matches validated by the cutoff-based method. Based on these sets, the coefficients of LDF (linear discriminative function) score can be calculated by means of multivariate linear regression. In the second step the LDF score (x) distribution is fitted by a non-parametric PDF (probability density function) with the maximum likelihood parameter estimation. The formulation of the hypothesis mixture PDF is given by: 𝑝(𝑥) = 𝑃𝑝𝑜𝑠 𝑓(𝑥) + 𝑃𝑛𝑒𝑔 𝑔(𝑥), based on the theory that the random and correct matches can be grouped into subcategories and that the LDF score of each subcategory should have a simple distribution, e.g. normal distribution. The negative component 𝑃𝑛𝑒𝑔 𝑔(𝑥), which contributes to random matches, can be estimated by the fully non-parametric density function estimate procedure that is carried out by the maximum likelihood estimation with EM algorithm, as indicated by Duda et al. in 2001 and Archambeau et al. in 2003. The positive component 𝑃𝑝𝑜𝑠 𝑓(𝑥), which contributes to the correct matches, can be estimated by a restricted fully non-parametric density function estimate. After the estimation of the conditional PDF the correct probability of a match with a LDF score x can be calculated 𝑝𝑐𝑜𝑟 = by the following formulation: 𝑝𝑝𝑜𝑠 𝑓(𝑥) 𝑝𝑝𝑜𝑠 𝑓(𝑥) + 𝑝𝑛𝑒𝑔 𝑔(𝑥) This formulation can be explained as follows: 𝑝(𝑝𝑜𝑠|𝐷 ) 𝑝(𝑛𝑒𝑔|𝐷 ) 𝑝(𝑝𝑜𝑠) 𝑝(𝐷 |𝑝𝑜𝑠) = 𝑝(𝑛𝑒𝑔) 𝑝(𝐷|𝑛𝑒𝑔) (𝐼) , from Bayes’ theorem 𝑝(𝑝𝑜𝑠|𝐷) + 𝑝(𝑛𝑒𝑔|𝐷) = 1, ℎ𝑒𝑛𝑐𝑒 𝑝(𝑛𝑒𝑔|𝐷) = 1 − 𝑝(𝑝𝑜𝑠|𝐷) (𝐼𝐼), as normalization factor. By introducing (II) into (I) we get: 𝑝(𝑝𝑜𝑠|𝐷) 𝑝(𝑝𝑜𝑠)𝑝(𝐷|𝑝𝑜𝑠) = 1 − 𝑝(𝑝𝑜𝑠|𝐷) 𝑝(𝑛𝑒𝑔)𝑝(𝐷|𝑛𝑒𝑔) and then, after some rearrangements: 𝑝(𝑝𝑜𝑠|𝐷) = 𝑝(𝑝𝑜𝑠)𝑝(𝐷|𝑝𝑜𝑠) 𝑝(𝑛𝑒𝑔)𝑝(𝐷|𝑛𝑒𝑡) + 𝑝(𝑝𝑜𝑠)𝑝(𝐷|𝑝𝑜𝑠) 24 In this equation, 𝑝(𝑝𝑜𝑠|𝐷) can be considered as 𝑝𝑐𝑜𝑟 in the above formula, 𝑝(𝐷|𝑝𝑜𝑠) and 𝑝(𝐷|𝑛𝑒𝑔) are 𝑓(𝑥) and 𝑔(𝑥) respectively, ending up to the formulation the team proposed. Finally, the authors were able to make a decision in accordance with the cost function, presented as FDR (false discovery rate) [16]. As claimed by the research team, the model can provide a correct probability of each assignment that promotes the subsequent analysis. The model was able to identify more high confident proteins from a MS/MS data set compared to other methods, such as ProteinProphet. The stronger aspect, as it is indicated in the paper, is the confirmation of a larger number of confident peptides. Thus, it can give more information for later biological analysis [16]. The identification of proteins issue was also assessed by LeDuc and coworkers in 2014. In this case “top down” experiments were employed to identify and characterize whole proteins. The main characteristic of a “top down” experiment is that the precursor ion is an intact proteoform and not small peptides produced form enzymatic digestion prior to mass spectrometry (shotgun or bottom-up experiment). Thus, the mass of the precursor ion represents a native whole protein and its fragment ions support the characterization and verification of the primary structure. In order to capture better the information given from top down proteomics, the team suggested a scoring system for protein identification, based on Bayesian statistics, under the name C-score [17]. The authors started with the basic formulation of the Bayes’ theorem: 𝑃(𝑃𝑟𝑜𝑡𝑒𝑜𝑓𝑜𝑟𝑚𝑖 |𝑑𝑎𝑡𝑎) = 𝑃(𝑃𝑟𝑜𝑡𝑒𝑜𝑓𝑜𝑟𝑚𝑖 )𝑃(𝑑𝑎𝑡𝑎𝑀𝑆⁄𝑀𝑆 |𝑃𝑟𝑜𝑡𝑒𝑜𝑓𝑜𝑟𝑚𝑖 ) 𝑃(𝑑𝑎𝑡𝑎𝑀𝑆⁄𝑀𝑆 ) where 𝑃(𝑃𝑟𝑜𝑡𝑒𝑜𝑓𝑜𝑟𝑚𝑖 |𝑑𝑎𝑡𝑎) is the posterior probability (the probability of the 𝑖th proteoform given the MS/MS data), 𝑃(𝑃𝑟𝑜𝑡𝑒𝑜𝑓𝑜𝑟𝑚𝑖 ) is the prior probability of proteoform 𝑖, 𝑃(𝑑𝑎𝑡𝑎𝑀𝑆⁄𝑀𝑆 |𝑃𝑟𝑜𝑡𝑒𝑜𝑓𝑜𝑟𝑚𝑖 ) is the likelihood, reading the probability of the data given the proteoform 𝑖 and 𝑃(𝑑𝑎𝑡𝑎𝑀𝑆⁄𝑀𝑆 ) is known as the probability of the data. The team restated the above equation by defining variables, arriving at: 𝑃(𝜑𝑞 |𝑀𝑂 , {𝑚𝑖 }) = 𝑃(𝜑𝑞 )𝑃(𝑀𝑂 , {𝑚𝑖 }|𝜑𝑞 ) 𝑃(𝑀𝑂 , {𝑚𝑖 }) 25 where 𝑀𝑂 is the observed mass of the precursor ion, 𝑚𝑖 is the mass of the 𝑖th of the n fragment ions, so {𝑚𝑖 }𝑛𝑖=1 is the set of all the observed ions and 𝜑𝑞 is the 𝑞th candidate (from k candidate proteoforms in the database). The prior probability, 𝑃(𝜑𝑞 ), can be taken as “all the hypotheses are equal” or one can assign higher/ lower prior probabilities to a candidate. The interesting of this scoring model is that, in contrast to other Bayesian methodologies (also presented in this literature thesis), the model has no unknown parameters and instead of inferring values from the data collected for the study in question, the values are taken either from the team’s knowledge of mass spectrometry or from prior studies that focused on determining the needed values [17]. In order to calculate the likelihood, 𝑃(𝑀𝑂 , {𝑚𝑖 }|𝜑𝑞 ), LeDuc et al. assumed independence of all 𝑚𝑖 , and thus: 𝑛 𝑃(𝑀𝑂 , {𝑚𝑖 }|𝜑𝑞 ) = 𝑃(𝑀𝑂 |𝜑𝑞 ) ∏ 𝑃(𝑚𝑖 |𝜑𝑞 ) 𝑖=1 And for avoiding issues, such as the larger the list of fragment ions the lower the calculated value, the research team developed the aforementioned equation into: 1⁄ 𝑛 𝑛 𝑃(𝑀𝑂 , {𝑚𝑖 }|𝜑𝑞 ) = 𝑓 (𝑃(𝑀𝑂 |𝜑𝑞 )) 𝑔((∏ 𝑖=1 𝑃(𝑚𝑖 |𝜑𝑞 ) ) where the f function is a simple identity function for the precursor ion and g is a linear function on the logarithm base 10 of the probability of the fragment ion [17]. As stated by the authors, the proposed model showed high specificity and sensitivity, compared to the other methods. It is also mentioned that when the data are sufficient enough, the C-score demonstrated high characterization power and the method is flexible for having scoring system appropriate to the experimental procedure [17]. Spectral counting is a free-labeled quantitative approach in shotgun proteomics, which is defined as measuring the abundance of a given protein based on the number of tandem mass spectral observations of its identified peptides. Spectral counts (SPCs) have shown a good correlation with the abundance of the corresponding protein. Since SPCs can be extracted from any database search engine, it makes spectral counting a flexible and straight- forward technique. In 2011, Booth et al. proposed a Bayesian model for comparing SPCs under two 26 treatments, which allows the simultaneous classification of thousands of proteins [18]. The classification is conducted by calculating the posterior odds: 𝑂𝑖 = 𝑃(𝐼𝑖 = 1|𝑑𝑎𝑡𝑎) 𝑃(𝐼𝑖 = 0|𝑑𝑎𝑡𝑎) 𝑖 = 1, … , 𝑝 where Ii defines the indicator for non-null status of the i-th protein. A “non-null status” indicates that the protein has been affected by the treatment. If 𝑂𝑖 > 𝑐, a suitable large c, then the protein is classified as non-null. The choice of the threshold c is based on the control of false discovery rate (FDR), i.e. the protein is classified as non-null where in fact there is no treatment effect. In order to compute the posterior odds, the research team considered the following model: log𝜇𝑖𝑗 = 𝛽0 + 𝛽1 𝑇𝑗 + 𝑏0𝑖 + 𝑏1𝑖 𝐼𝑖 𝑇𝑗 + log𝐿𝑖 + log𝑁𝑗 where μij denotes the expected count for protein i in replicate j, β0 is the overall mean for the control replicates, β1 is an overall treatment effect, b0i and b1i are the corresponding protein specific effects, Li and Nj are the offset, account for the protein length and the replicate effect respectively and Tj is a binary indicator of the treatment. The model is completed by placing prior distributions for the model parameters. The authors considered three different prior distributions for the protein specific coefficients. One allows for potential correlation between the protein specific coefficients, while the other two assume they are independent and one allows the posterior mean of the protein specific treatment effects to be different in the null and non-null groups. The necessary computations were performed by means of Markov Chain Monde Carlo (MCMC) methods [18]. The results, announced by the team, showed that the proposed model is more statistically coherent and valid than other approaches they compared it to (Bayes factors), lead to a simple classification, however, it is significantly slower process than the one-at-a-time methods, such as the score test, were the results are instantaneous [18]. For the quantification of a peptide, apart from free-labeling techniques, there are also labeling techniques such as 18O-labeling approach. In the enzymatic 18Olabeling, the two oxygen atoms in the carboxyl-terminus of a peptide are replaced with oxygen isotopes form heavy-oxygen-water. The result is a m/z shift by 4Da for the labeled peptide and, thus, the labeled and unlabeled peptides are separated 27 with respect to their m/z in a spectrum, which allows the comparison of two samples. However, in practice due to water impurities (presence of 16O and 17O) and mislabeling, not all the labeled peptide receive two 18O, which results in a complex mixture of shifted and overlapping isotopic peaks of the labeled and unlabeled samples. In order to estimate the relative abundance of the peptide in two samples with respect to the aforementioned problem, Zhu et al. proposed a model with Bayesian framework for MALDI-TOF data, in 2011 [19]. The suggested model is an extension of a previously modeling approach (Valkerborg 2008, Zhu et al. 2010), where random effects of technical/biological variability were included. The model is given by: 𝑦𝑖𝑗 = 𝜇𝑖𝑗 + 𝜀𝑖𝑗 where 𝑦𝑖𝑗 is the experimental intensity obtained at the 𝑖 𝑡ℎ spectrum (or sample) 2𝜃 and 𝑗 𝑡ℎ monoisotopic peak, 𝜀𝑖𝑗 ~𝑁(0, 𝜎 2 𝜇𝑖𝑗 ) are independent and the θ parameter is the power parameter for the variance function to account for the heteroscedastic nature of the MS data. The mean intensity is the quantity 𝜇𝑖𝑗 of the j-th (j=1 denotes the monoisotopic of the unlabeled peptide) peak of the i-th spectrum. This can be expressed as: min(4𝑗−1) 𝐻𝑖 𝑅𝑗 + 𝑄𝑖 𝐻𝑖 + ∑ 𝑃𝑘 𝑅𝑗−𝑘 𝑖𝑓 1 ≤ 𝑗 ≤ 𝑙 𝑘=0 4 𝜇𝑖𝑗 ≡ 𝐸(𝑦𝑖𝑗 ) = 𝑄𝑖 𝐻𝑖 ∑ 𝑃𝑘 𝑅𝑗−𝑘 { 𝑖𝑓 𝑙 + 1 ≤ 𝑗 ≤ 𝑙 + 4 𝑘=𝑗−1 where a peptide has 𝑙 ≥ 5 isotopic variants and a 18O-labeled peptide has 𝑙 + 4. Hi(𝐻𝑖 ~𝑁(𝐻, 𝜎𝛨2 )) is the unobserved abundance of the peptide in the unlabeled sample (Sample I) in the 𝑖-th spectrum and 𝑄𝑖 (𝑄𝑖 ~𝛮(𝑄, 𝜎𝑄2 )) is the relative abundance of the peptide from the labeled sample (Sample II) with respect to the Sample I. 𝑃𝑘 is the m/z shift probability , which is calculated via a MCMC model ℎ and 𝑅𝑗 is the isotopic ratios, which is defined as 𝑅𝑗 = 𝑗⁄ℎ , 𝑗 = 1, … , 𝑙, of the 𝑗-th 1 isotopic variant, where ℎ1 , ℎ2 , etc. denotes the probabilities of occurrence of the first, second, etc. isotopic variant. The terms 𝐻𝑄𝑃𝑘 𝑅𝑗−1−𝑘 show the contribution to the mean value of the observed peaks from the isotopic variants of the peptide from Sample II [19]. 28 The team suggested that by using a Bayesian approach the incorporation of prior information could be advantageous for the analysis of the data. Although similar approaches have been published in previous years (Eckel-Passow, 2006) this method takes into account the presence of 17O atoms in heavy water and allows the isotopic distribution to be determined by the data. As it is pointed out, there are topics concerning the extension of the approach that are under further research, such as the incorporation of informative prior for the Bayesian model, which would yield a gain in precision for the estimation of the parameters [19]. Current methods for data analysis with the purpose of biomarker discovery, e.g. cancer diagnosis in an early stage, can be divide into two categories: 1) profiling, where the input is a list of peaks and 2) whole-spectrum, in which the entire spectrum is the input. It is argued that the profiling method is of greater importance, since the detected peaks are more meaningful in the way that they represent species that can be identified and further studied. The profiling method mainly consists of eight steps, which are: i) resampling, ii) noise reduction, iii) baseline correction, iv) normalization, v) peak detection, vi) peak alignment, vii) feature (peak) selection and viii) classification. As part of a profiling study, a study driven by the purpose of the discovery of biomarker that can distinguish cancer and normal samples, He et al. proposed Bayesian additive decision trees (BART) to build a classification model [20]. As shown in the schematic presentation of the proposed profiling method, after the data were baseline corrected and normalized, the next step was the peak detection. The authors used a smooth non-linear energy operator (SNEO) for the first Figure 8: The proposed method in a schematic illustration [20]. time, a method that has been successfully used in 29 electroencephalography and electrocardiogram, which was modified to be suitable for the peak detection in MS data. After the peak detection, a correlationbased peak selection was applied and the selected small peak set was used as input for BART in order to build a prediction model. BART was applied to classify samples and identify biomarkers. It is akin to the method that constructs a set of classifiers, e.g. a decision tree, to classify new data points. BART is defined by a prior and a likelihood and reports inference as the summary of all relevant uncertainties. It can be considered as a Bayesian “sum-of-tree” model, where each tree is constrained by a prior to be a weak learner. For the biomarker’s identification, the first step is to rank the selected peak according to their contribution to the classification. In this case, the contribution is determined by the calculation of the times the peak is used in the BART model. At first, the model was built on a few top ranking peaks and then progressively the number of peaks was increased [20]. As it is asserted in the paper, the method showed an excellent classification performance and the obtained results could be subjected to further research and validation. It is also mentioned that BART was accurate and the results better interpretable. Finally, by using the built-in partial dependent plot function of the BART model, it was possible to examine the effect of each biomarker to the cancer’s identification, as well as the interaction between these biomarkers [20]. As mentioned in the description of the previous paper, feature selection is a pre-step in the data analysis, with the final goal of building a classifier for biomarkers discovery. It is a common phenomenon that the initial discovery results in a relative large collection of biomarkers, but only a few remain as relevant after subsequent testing with new data. The main problem is the overfitting of classifiers that, due to small sample size and large amount of variables, result in high false positive rate of biomarker candidates. To overcome this problem, Kuschner et al. developed a method for feature selection, based on a Bayesian network (BN), in 2010 [21]. To build the BN, the authors used a model-free test for independence, which is called mutual information. Mutual information (MI) can be described as the measure of the information gained about a variable, knowing the value of another variable, and is calculated by the equation: 30 𝑀𝐼(𝑋; 𝑌) = ∑ 𝑃(𝑥, 𝑦)𝑙𝑜𝑔2 𝑥,𝑦 𝑃(𝑥, 𝑦) 𝑃(𝑥)𝑃(𝑦) where X and Y are two variables, x and y represent all the possible values that X and Y can take, respectively, and 𝑃(𝑥, 𝑦) denotes the joint probability that X takes on the value x and Y takes the value y. A value of MI equal to 0 indicates that the variables are independent. So, the first step of the method is the estimation of the variables that show dependency with class by calculating the 𝑀𝐼(𝑐𝑙𝑎𝑠𝑠; 𝑓𝑒𝑎𝑡𝑢𝑟𝑒). All features with 𝑀𝐼(𝑐𝑙𝑎𝑠𝑠; 𝑓𝑒𝑎𝑡𝑢𝑟𝑒) higher than a threshold are considered to have a connection with the class variable. For the graphic illustration of the BN that means that a directed arc is created from the class node to the node that represents the selected feature. In this way, the set of first level features is established and once it is done they are tested against all the other features individually, to determine connections between features. This is done by calculating the mutual information between first-level features and all the features. If the 𝑀𝐼(𝑓𝑖𝑟𝑠𝑡 𝑙𝑒𝑣𝑒𝑙 𝑓𝑒𝑎𝑡𝑢𝑟𝑒; 𝑓𝑒𝑎𝑡𝑢𝑟𝑒) is above the threshold (equal to the threshold used in the previous step), an directed arc is created to represent this dependency. When the connection is between two first-level features, an additional test is required, to determine the direction of this arc. Such test is based on computing the remaining mutual information between the class (C) and one of the first-level variables (F1) when the other first-level variable (F2) is known: 𝑀𝐼(𝐶; 𝐹1|𝐹2) = ∑ 𝑃(𝐶, 𝐹1, 𝐹2)𝑙𝑜𝑔2 𝑥,𝑦,𝑧 𝑃(𝐶, 𝐹1|𝐹2) 𝑃(𝐶|𝐹2)𝑃(𝐹1|𝐹2) when this mutual dependency is 0, the initial link between the class and the (firstlevel) feature 2 is removed [21]. Similar tests on mutual information are done to simplify the BN by the elimination of non-relevant connections. For instance, if the connection to the class C is of the form 𝐶 → 𝐹1 → 𝐹2 then the feature F2 will become independent of the class C when the data is partitioned on the values of F1 and the 𝑀𝐼(𝐶; 𝐹2|𝐹1) will drop to zero (0). This indicates that the F2 is independent of the class and the initial link 𝐶 → 𝐹2 will be eliminated, providing a means to organize the first-level features with their in-between dependencies. 31 Fig. 9 illustrates the first BN for leukemia data. The first level features are those showing dependency to class, when conditioned to other features, while the second level features had a low MI with the class, when they were conditioned to the parent feature. As stated in the article, the Bayesian network/mutual Figure 9: Bayesian network for leukemia data. First level features have high mutual information with the class, when conditioned to other features. Second level features showed little mutual information with class, when conditioned to the parent features [21]. information approach was able to provide a more distinct division between stable and unstable features and give the opportunity to examine relationships between features that may be useful in identification. However, testing the model with the artificial data, the method was not able to recreate the intended network completely, due to limitations of the algorithm [21]. Proteomic data treatment also refers to the post-translational modifications (PTMs) prediction, for example, the identification of peptide sequences and PTMs associated with the peptide in a biological sample [22]. PTMs are chemical modifications, which involve an enzymatic addition of a small chemical group or a larger moiety on the side chain of one or more amino acids. They regulate the activity of a protein and may occur either after or during the translation [23]. The analytical method usually chosen for the PTMs prediction is tandem MS followed by an analysis by means of a “blind” (unrestrictive) PTM search engine. This type of search engines are used because they can predict known and novel PTMs, but they are usually subjected to the noise in mass spectrometric data and result in false predictions, due to their containing inaccurate modification masses and incorrect modification positions [22]. To avoid the above mentioned issue, Chung et al. proposed a machine learning algorithm, PTMclust, in 2011 [23]. However, in 2013, the team, addressing limitations of PTMclust, they proposed a new method, iPTMclust, which introduces prior probabilities on the model variables and parameters, in order to overcome these disadvantages [22]. 32 PTMclust is applied on the results given by “blind” search engines and improves the predictions by suppressing the noise in the data and clustering the peptides with the same modification to form PTM groups. Based on the group, the algorithm finds the most probable modification mass and position [23]. Nonetheless, PTMclust showed some limitations, such as the greedy method for selecting the number of PTM clusters, the need for manual parameter tuning and the lack of confidence score per modification position. In order to overcome these limitations, the team extended PTMclust by making use of an infinite nonparametric mixture model, and so the infinite-PTMclust (iPTMclust) was developed [22]. The core of the model in iPTMclust remained the same as in PTMclust and describes how a modification mass and a modification position are generated. In the extended version (iPTMclust), priors were introduced on model variables and parameters that Figure 10: Bayesian network showing the relationship between variables of the model [22]. control the choice of a PTM group from a very large number of PTM groups. The relationship between variables can be illustrated by a Bayesian network (fig. 10). First, the shade nodes correspond to the observed variables, the unshaded nodes in the bottom plate indicate the latent variables, while the unshaded nodes in the upper plates are the model’s parameters and hyper-parameters are shown outside the plates (hyper-parameter: a parameter of a prior distribution). The model’s parameters are mixing coefficient (𝑎), modification mass means (𝜇𝑘 ), modification mass variances (𝛴𝑘 ) and probability of modification occurring on an amino acid (𝛽𝑘𝑗 ), and the observed variables are the observed modification mass (𝑚𝑛 ) and position (𝑥𝑛 ) and peptide sequence (𝑆𝑛 ). By combining the structure of the Bayesian network and the conditional distributions the joint probability can be written as: 33 𝑃(𝑐, 𝑎, 𝑧, 𝑥, 𝑚, 𝜇, 𝛴, 𝛽, 𝜆, 𝜐, 𝜑, 𝜉, 𝛾, 𝜔|𝑆, 𝛹) = 𝑃(𝛾|𝛹)𝑃(𝜆|𝛹)𝑃(𝜐|𝛹)𝑃(𝜑|𝛹)𝑃(𝜉|𝛹)𝑃(𝜔|𝛹) ∗ ∏𝑁 𝑛=1[𝑃(𝑐𝑛 |𝛾)𝑃(𝑚𝑛 |𝑐𝑛 , 𝜆, 𝜐, 𝜑, 𝜉)𝑃(𝑎𝑛 |𝑐𝑛 , 𝜔)𝑃(𝑧𝑛 |𝑎𝑛 , 𝑆𝑛 , 𝛹)𝑃(𝑥𝑛 |𝑧𝑛 , 𝛹)] where Ψ represents the model hyper-parameters for hyper-priors placed on 𝛾, 𝜆, 𝜐, 𝜑, 𝜉 and ω. The combination of latent variables and prior distributions leads to complex joint probability distribution over high-dimensional spaces. Therefore, a MCMC (Markov Chain Monte Carlo) method was employed for the necessary computations [22]. The authors claimed that iPTMclust outperformed their previous method (PTMclust) and other PTM algorithms. Since iPTMclust provides the user with modification position-level confidence scores, the result’s quality could be evaluated and further refinement of analysis could be performed [22]. 34 CHAPTER 4: BAYESIAN STATISTICS IN METABOLOMIC STUDIES Metabolomics is a term used to describe the emerging field of the study and measurement of metabolites. Metabolites are the products produced during the metabolism, the total amount of chemical reaction within a living organism. Metabolites can be considered as the “spoken language” of the genetic material (genome), and therefore, metabolomics is treated as the “read-out” of the state of the organism in study. Mass spectrometry coupled to different chromatographic methods, such as liquid or gas chromatography are the major techniques, employed for the analysis of vast array of metabolites simultaneously [24]. 4.1. Applications of Bayesian approach in metabolomics Profiling metabolites is the identification and quantification of small compounds up to 1000Da. These compounds constitute the products of the metabolic pathway. For metabolite profiling, mass spectrometry is a popular technique, which is employed to generate fingerprint spectra of the separated metabolites via chromatographic methods. These spectra are then compared with each spectrum from spectrum library, using a numerical score that characterizes the similarity between the spectra, as described for proteomics in chapter 3.2. . However, the identifications through this process are subjected to errors due to incomplete libraries, experimental noise and technical limitations. For the improvement of the accuracy of metabolite identification, Jeong et al., in 2011, proposed a Bayesian model, which analyze similarity score distribution for GCxGC/ToF-MS data [25]. The model has four layers that target on four fundamental variables, relevant to metabolite identifications. The variables are the presence/absence of a metabolite in the sample (𝑌), matching or not of a metabolite to any sample spectrum (𝑍), the correct/incorrect match (𝑊) and the similarity score (𝑆). In Layer 1, the marginal probability that every metabolite in the spectrum library is present in a sample is considered: 𝑃(𝑌𝑗 = 1) = 𝜌 𝑗 = 1, … , 𝑁, where 𝑁 is the number of the spectra in the library. In Layer 2, 𝑍 represents the observation of a match (𝑍𝑗 = 1 if there is a match and 𝑍𝑗 = 0 is there is no match for metabolite j). 35 This variable gives information about the unobserved 𝑌. Due to the nature of the metabolite and the library, each metabolite has some tendency to be matched to some sample spectrum (for a given library). So if the spectrum of a metabolite shares high level similarity Figure 11: Schematic representation of the model. Z and S observed, Wand Y unobserved [25]. with other metabolites’ spectra, there is high probability that it will be mistakenly matched to some other sample spectrum, although this metabolite might be absent. In this layer a competition score, 𝑏𝑗 , for each metabolite 𝑗 in the library is introduced and it is calculated as: 𝑏𝑗 = ∑𝑘≠𝑗,𝑘∈𝐶,𝐼(𝑟𝑘𝑗 <ℎ) 1⁄𝑎𝑘 , where 𝑎𝑘 is the similarity score between spectra of metabolite 𝑞 and 𝑘 in the library, 𝐶 is the set of spectra in the library and 𝐼(·) is the indicator function. In Layer 3, the accuracy of the match is considered for the metabolites that have been matched (𝑍𝑗 = 1) to at least one sample spectrum: 𝑃(𝑊𝑗𝑙 |𝑌𝑗 = 1, 𝑍𝑗 = 1) = 𝜏 (if 𝑌𝑗 = 0 the match is obviously incorrect). Finally, in Layer 4, a mixture model is used to characterize the distribution of similarity score (𝑆). By considering the four layers, the joint distribution of the variable can be expressed as: [𝑌, 𝑍, 𝑊, 𝑆] = [𝑌][𝑍|𝑌][𝑊|𝑍, 𝑌][𝑆|𝑊] = (∏[𝑌𝑗 ]) (∏[𝑍𝑗 |𝑌𝑗 ]) ( ∏ ∏[𝑊𝑗𝑙 |𝑍𝑗 , 𝑌𝑗 ][𝑆𝑗𝑙 |𝑊𝑗𝑙 ]) 𝑗 𝑗 𝑗:𝑍𝑗 =1 𝑙 By treating 𝑌 and 𝑊 as the unobserved variables, Expectation-Maximization (EM) algorithm was used to estimate the parameters of the model, ̂𝜃. The confidence of each metabolite 𝑗 can, then, be estimated as the posterior probability of 𝑌𝑗 : 𝑃𝑗 { 𝑃(𝑌𝑗 = 1|𝑍𝑗 = 1, 𝑆𝑗 ; 𝜃̂) [25]. 𝑃(𝑌𝑗 = 1|𝑍𝑗 = 0; 𝜃̂) The authors stated that the method is a novel model-approach to the metabolite identification problem. Thus, the comparison was performed with 36 different type of methods, where the results showed that the proposed model was more accurate [25]. Suvitaival et al., in 2014, presented a Bayesian approach for integrating data of multiple detected peaks, connected to one compound [26]. This is important in metabolomics’ studies are also concerned with the change in the levels of metabolites, providing information about the physiological state of an organism. For these changes to be discovered, comparative analysis of spectral profiles is the main approach, through the inference of covariate effects, meaning the differences between sample groups determined by the controlled covariates of an experiment. Although the data might be very complex and noisy, they can provide strong informative structure. For instance, each compound may generate multiple adduct peaks and a specific isotopic pattern, whose position and shape may be helpful during the identification of the analyte [26]. The suggested approach consists of two stages: 1) clustering the peaks that are generated by the same compound, by applying a non-parametric Bayesian Dirichlet process model on the data, and 2) the responses to the covariates of the experiment are inferred on the clusters by means of a Bayesian multi-way model. For stage 1, clustering the peaks, the team assumed that the peaks are generated through a Dirichlet process: there is an unknown number of clusters and an unknown number of peaks emerge from each cluster, where each can only be assigned to one cluster. The probability of assigning peak 𝑗 to cluster 𝑘 can be expressed as: 𝑃(𝑣𝑗𝐾 = 1|𝑄, 𝑅, 𝑉) ∝ 𝑎𝐷𝑃 𝐿(𝑄, 𝑅|𝑉−𝑗 , 𝑣𝑗𝐾 = 1) where the value 𝑣𝑗𝐾 = 1 in the clustering matrix assigns peak 𝑗 to the cluster 𝑘, 𝑄 represents the data, 𝑉 the clustering matrix, 𝑅 ∈ {0,1}𝑁𝑥𝑃𝑥𝑃 is a mask with binary values 𝑟𝑖𝑗𝑗′ indicating whether the peak pair 𝑗 𝑗′ in the sample 𝑖 appears together and whether both peaks are observed, 𝑎𝐷𝑃 is the Dirichlet process concentration parameter, which determines the prior probability mass outside the clusters and weights the likelihood term 𝐿(𝑄, 𝑅|𝑉−𝑗 , 𝑣𝑗𝐾 = 1). The inference of the posterior probability distribution was performed via Gibbs sampling. For stage 2, the research team inferred the differences in concentrations between sample groups for each cluster, which is related to one compound, given the peak height 𝑋 ∈ ℝ𝑁𝑥𝑃 and the clustering 𝑉 [26]. 37 The results, as the authors claimed, showed that by including multiple peaks could lead to an improvement of covariate effects inference and by introducing additional data describing the compound, the inadequate sample-size problem can be successfully addressed [26]. Studies concerning the identification and classification of bacteria, based on their characteristic metabolomics profiles, is a part of metabolomics, which is very important as a rapid diagnostic tool. Two different articles addressed the same issue- the identification/classification problem- for different type of bacteria, applying Bayesian statistics in the data analysis. Correa et al. in 2011 presented a research concerning Bacillus spores and species, while Oliver et al. proposed an approach for identification of various Mycobacterium species [27, 28]. In [27] the bacterium under study is Bacillus, as mentioned above. Bacillus and Clostridium are species that can adapt to changes in the environment and starvation rapidly due to their ability to develop spores. Because of their resistant spores, members of the genus Bacillus are widely distributed in the environment and their control is considerable in the food industry. Some of these bacteria are pathogenic, causing food poising, but the most notorious member of this genus is B. anthracis, which cause anthrax. Therefore, the rapid identification of spores and bacteria is of a great importance, because of their potential use as a biological warfare agent. The authors developed a genetic algorithm-Bayesian network algorithm to identify biomarkers and by means of these biomarkers they built a Bayesian network model for the identification of Bacillus spores and species [27]. The proposed analytical approach is a two-step based identification that classifies the bacilli into one of the respective species. The first step involves the reduction of the dimensions of data. A genetic algorithm (GA) is employed for feature selection with classification of either spores versus vegetative biomass or speciation to one of the seven different species. The classifier is a Bayesian network (BN) that best fits the best solution given by a GA. In the second stage, after fitting a new BN model to the best solution found by the GA in first step, the model is used for the statistical analysis in order to determine probabilistic relationships between mass-over-charge ratio intensities, selected by GA, and the classification [27]. 38 The authors reported that the classification accuracy of the suggested approach was superior to the partial least squares-discriminant analysis (PLS-DA) and that it is fast and provides easy interpretation of the relationships between the selected biomarkers. It is also mentioned that it is possible to develop predictive models that will allow inference of biological properties of the bacilli [27]. In the second article from 2012, the research team focused on the Mycobacterium species. Various species of this genus are related to tuberculosis (TB), and only in 2008, 1.8 million deaths were reported due to this disease. Although the current TB diagnostic method is considered to be very sensitive, it suffers from major limitations, such as the high rate (15-20%) of false negative in adult cases and the culturing time (2-6 weeks), which leads in unnecessary delay in the patient’s treatment. The authors investigated the potential use of GC-MS and the subsequent data analysis, in order to build a classification model for various TB causing and non-TB species [28]. The proposed approach involved different statistical analysis methods. First, principal component analysis (PCA) was employed for the determination of the existence of a natural grouping between the various sample groups. Furthermore, a PLS-DA model in order to identify the compounds that contribute most to the separation of the sample groups by ranking the compound according to the variable influence on the projection (VIP) parameter, which indicates the importance of the metabolite in the classification model. Subsequently, the authors created combining list of the biomarkers with the highest modelling powers (PCA) and VIP values, to reduce the dataset into a set of relative metabolite markers. The identities of the markers were determined through the GC retention times and the fragmentation pattern generated by MS compared to previously injected standards’ libraries. Based only the first three PCAs as input, a discriminative model based on Bayes’ theorem was built for the purpose of estimating the class membership probabilities of an unknown bacterial sample [28]. The authors claimed that their method was able to identify metabolite markers that are related to various species (TB causing and non-causing), the classification was achieved in less than 16 hours with a good detection limit. The 39 team, also, suggested that their method could possibly become a tool in TB diagnostics and disease characterization [28]. 40 CHAPTER 5: BAYESIAN STATISTICS IN FORENSIC SCIENCES 5.1. Bayes’ theorem and forensic evidence Bayes’ theorem is commonly used in the forensic sciences and especially in the Court of Law. In the forensics context, there are two hypotheses for a criminal case:  𝐻0 : the suspect is innocent  𝐻1 : the suspect is guilty In the court the question around the innocence (or guilt) should be answered. The answer can be given if the problem is treated as a probability problem. So, for each hypothesis, its probability (P), given the data (D), can be expressed as (according to Bayes’ theorem): 𝑃(𝐻0 |𝐷) = 𝑃(𝐻0 ) ∗ 𝑃(𝐷|𝐻0 ) (1) 𝑃(𝐻1 |𝐷) = 𝑃(𝐻1 ) ∗ 𝑃(𝐷|𝐻1 ) (2) By diving (1) and (2) we get: 𝑃(𝐻0 |𝐷) 𝑃(𝐻0 ) 𝑃(𝐷|𝐻0 ) = ∗ 𝑃(𝐻1 |𝐷) 𝑃(𝐻1 ) 𝑃(𝐷|𝐻1 ) where the left part of the equation is the posterior odds (the probability of the suspect to be innocent divided by the probability of the suspect to be guilty), while the right part read the prior odds and the likelihood ratio, respectively. The forensic scientist can provide the court with the likelihood ratio, i.e. the probability of the data (evidence) given the hypothesis. The job of the forensic expert is not to determine the probability of the accused to be innocent/guilty, but he can only present the evidence, including the probability of being wrong. The judge is the one that make a decision concerning the innocence of the defendant (posterior probability), based on the prior “preference” (prior probability) and the value of evidence (likelihood ratio). 5.2. Mass spectrometry and Bayesian statistics in forensic sciences An interesting Bayesian approach was published by Sottas et al. in 2007, concerning the testosterone abuse by athletes in elite sports. Testing of 41 testosterone abuse is mainly based on the testosterone over epitestosterone (T/E) ratio. Since 2004, urine samples are subjected to GC-MS and if the T/E ratio is equal or greater than 4 then the samples are submitted to IRMS (Isotope Ratio Mass Spectrometry) for determination of 13C/12C. If the IRMS analysis does not indicate an exogenous administration, then a longitudinal study is conducted. The proposal of the authors is a Bayesian screening test, which, by processing the GCMS data of a subject, answers the question whether the athlete tests positively or negatively for testosterone abuse. What is of great importance is that in this particular case the T/E ratio variance within the male population is taken into account by changing the threshold of the test from a population basis to a subject basis, when the number of individual test results has been increased [29]. The approach can be presented as a Bayesian network with four nodes (Fig. 12), where μ is the distribution of mean values of T/E for the different individuals, CV represents the distribution of coefficient of variation (CV) among different individuals, σ is obtained by multiplication of the variables μ Figure 12: Bayesian network model of T/E. It consist of 4 nodes. T/E reads the distribution of expected values of T/E returned by the model [28]. and CV, and finally T/E node is the T/E ratio values, which are assumed to be distributed following a normal distribution with μ and σ. The model works in two steps. First, from a set of prior distributions, which express the knowledge of which μ and CV is physiologically relevant, the network returns a distribution of the expected values of T/E. The next step is to infer, through Bayesian statistics, new distribution for μ and CV, since new data/test results are taken into account. In other words, the posterior probabilities of step 2 become prior probabilities for step 1 as soon as new test results are applied to the model. In this way, the distributions develop into unique and individual values for the mean and coefficient of variation [29]. 42 The results of this study were remarkable, with the authors claiming that the Bayesian interpretation of the T/E time-profiles showed significant sensitivity and returned less false positives than the other tests [29]. In another paper from 2013, written by Bettencourt da Silva et al [30]., Bayes’ theorem was used to evaluate the examination uncertainty of the presented strategy for identification of tear gas weapons’ active substances. The analysis of the samples was performed by GC-MS instrumentation. Therefore, the data consisted of retention time (τ) and ratio of abundances, noted as α, between the molecular ion A and the fragment ions B and C, which are the most abundant. The team used Bayes’ theorem, following the theory and notation presented by Ellison et al. (1998). At first the odds of an active substance A was calculated by means of the following formula: 𝑂(𝐴|𝜏, 𝛼 𝐴𝐵 , 𝛼 𝛢𝐶 ) = 𝑂(𝐴) ∗ 𝐿𝑅(𝜏) ∗ 𝐿𝑅(𝛼 𝐴𝐵 ) ∗ 𝐿𝑅(𝛼 𝛢𝐶 ) where O(A) is the known prior odds for the A to be present respect to A to be absent [𝑂(𝐴) = 𝑃(𝐴)/𝑃(−𝐴)] and LR(τ), LR(αΑΒ) LR(αΑC) are the likelihood ratio of retention time and the ratio of abundances respectively [𝐿𝑅(𝑥) = 𝑃(𝑥|𝐴)/ 𝑃(𝑥| − 𝐴), where 𝑃(𝑥|𝐴) is the chance of evidence x being observed given A and 𝑃(𝑥| − 𝐴) is the chance of evidence x when A is absent . Then the odds can be converted into the probability by using this simple equation: Pr(𝐴|𝜏, 𝛼 𝐴𝐵 , 𝛼 𝛢𝐶 ) = 𝑂(𝐴|𝜏, 𝛼 𝐴𝐵 , 𝛼 𝛢𝐶 ) 𝑂(𝐴|𝜏, 𝛼 𝐴𝐵 , 𝛼 𝛢𝐶 ) + 1 [30] The results of the evaluation of examination uncertainty indicated that strong evidence can be gathered when the signal of the peaks of the standard and the sample is consistent, taking into account the retention time and the ratio of abundances [30]. In the forensic sciences, the likelihood ratio concerning the prosecutor’s and the defendant’s hypothesis is commonly used to evaluate the evidential strength. In the article written by Farmer et al. (2009) the likelihood ratio (LR) is employed to measure the evidential strength for Isotope Ratio Mass Spectrometry (IRMS) results from white paints. Architectural white paints are a type of trace evidence, found in crime scenes, usually in cases of trespass. The paint is composed of four different parts: the pigment, the liquid, the binder and the additives. The binder is a partly-polymeric compound that when dries helps the paint to stay on the 43 surface. The isotopic analysis of 13C, 18O and 2H was primarily performed on the binder through IRMS [31]. From the IRMS data, the values for δ for13C, 18O and 2H was calculated by using the following equation: 𝛿[‰] = ⌊ 𝑅𝑠𝑎𝑚𝑝𝑙𝑒 − 𝑅𝑠𝑡𝑟𝑑 ⌋ ∗ 1000 𝑅𝑠𝑡𝑟𝑑 where Rsample is the measured isotopic ratio of the heavier isotope over the lighter and Rstrd is the measured isotopic ratio for the corresponding international reference material. The values of δ were then used to accumulate the Stable Isotope Profile (SIP) and finally the LR was calculated for the SIP, following the method proposed by Aitken et al. (2004). In this case, the LR indicates the ratio of the probabilities of observing the SIP of the controlled and recovered specimens when they come from the same source respect to the probabilities of observing the SIP from different sources. 51 paints were used as population and the comparison was done pair-wise. It was the first time that LR was applied to IRMS observations and the results, according to the authors, showed considerable forensic potential. The method gave approximately a ~2% false positive rate and a similar false negative rate, where 𝐿𝑅 > 1 indicates that the paints originate from the same source. As suggested by the team, the discriminatory power of this method would make it a promising candidate as a forensic tool [31]. Illicit drug consumption can be estimated through the analysis of communal sewage water. The analysis is based on mass spectrometric methods that allow the measurement of concentration of drug target residues (DTRs), such as metabolites, with relatively high accuracy and precision. The consumption of parent drugs can be “back-calculated” from these concentrations. However, these estimations of concentrations are subjected to many sources of uncertainty. As part of a bigger research, which also included a Monte Carlo simulation, Jones et al. proposed a Bayesian framework computed using Markov Chain Monte Carlo (MCMC), which combines the estimation of parameters and uncertainty propagation is a single step [32]. First the “back-calculations” were presented by the team in a sister paper (Baker et al, 2014), based on the previous work of Zuccato et al. (2005), in order to estimate per capita consumption from DTRs: 44 1) Load of DTR (gr/day) : 𝐿𝑜𝑎𝑑 = 𝐶𝑜𝑛𝑐𝑒𝑛𝑡𝑟𝑎𝑡𝑖𝑜𝑛∗𝐹𝑙𝑜𝑤 100 1000∗( ) 100−𝑆𝑜𝑟𝑝𝑡𝑖𝑜𝑛 100 ∗ (100+𝑆𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦), where Concentration is the DTR concentration in wastewater (𝑛𝑔/𝐿), Flow is the volume of flow to the wastewater after a 24hours period (𝑚𝑖𝑙𝑖𝑜𝑛𝑠 𝑜𝑓 𝑙𝑖𝑡𝑒𝑟𝑠/𝑑𝑎𝑦), Stability indicates the percentage of DTR that changes in the wastewater because of the conditions (pH, temperature, time) and Sorption shows the percentage of sorption to suspended particular matter (SPM) in wastewater. 2) Estimation of drug consumption per 100 people (mg/day): 𝐿𝑜𝑎𝑑 𝑀𝑊𝑝𝑎𝑟 𝐶𝑜𝑛𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛 = (𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛∗𝐸𝑥𝑐𝑟𝑒𝑡𝑖𝑜𝑛) ∗ (𝑀𝑊 𝐷𝑇𝑅 ) − 𝑂𝑆, where Excretion is the parent drug dose’s proportion that is excreted as DTR, MWpar and MWDTR are the molecular weights of the parent drug and DTR respectively, Population counts the size of the population, and OS is the amount of DTR in wastewater due to source different from consumption, such as hospitals and prescription usage. It should be mentioned that for drugs that can be administrated through different route, e.g. cocaine, the metabolic profile can vary enormously, but it is possible to be included in the Excretion term as: 𝐸𝑥𝑐𝑟𝑒𝑡𝑖𝑜𝑛 = ∑𝑅[(𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛 𝑜𝑓 𝑝𝑎𝑟𝑒𝑛𝑡 𝑑𝑟𝑢𝑔 𝑚𝑎𝑠𝑠 𝑡ℎ𝑎𝑡 𝑐𝑎𝑛 𝑏𝑒 𝑎𝑑𝑚𝑖𝑛𝑖𝑠𝑡𝑟𝑎𝑡𝑒𝑑 𝑏𝑦 𝑟𝑜𝑢𝑡𝑒 𝑅) ∗ (𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛 𝑜𝑓 𝑎 𝑑𝑜𝑠𝑒 𝑜𝑓 𝑝𝑎𝑟𝑒𝑛𝑡 𝑑𝑟𝑢𝑔 𝑒𝑥𝑐𝑟𝑒𝑡𝑒𝑑 𝑎𝑠 𝐷𝑇𝑅 𝑓𝑜𝑙𝑙𝑜𝑤𝑖𝑛𝑔 𝑡ℎ𝑒 𝑟𝑜𝑢𝑡𝑒 𝑅)] [32]. The Bayesian approach can be presented as directed acyclic graph (DAG), but the “back-calculations” are written in a way that show a more natural ‘forward’ formulation (consumption determines the load and concentration of DTR) (fig. 13). The first stage of the simulation is to characterize each parameter in terms of distributions, which in this case a prior distribution is specified for each parameter, which is updated to posterior distribution by the addition of new data to the model. Some of the parameters are known (excretion, flow, stability and population), also referred to as informative prior distributions (and they are assumed to follow a normal distribution), while others (daily consumption) are unknown (uninformative priors) and their posterior distribution is driven by the observed data. The researchers considered that the logarithmic daily consumption follows a normal distribution, with mean μ and standard deviation τ, and for these 45 parameters broad prior distributions were assumed. After fitting the model posterior estimates were available and for seven days of sampling the estimation of the mean drug consumption was 1260mg/1000 people [32]. The authors suggested that this type of analysis may allow more sophisticated statistical analyses, such as modeling the variability over time. They also mention that the model can be furthered in way that would incorporate the “weekend effect”, where the drug consumption seems to increase, however, longer periods of data should be available [32]. The role of a forensic expert is to evaluate evidence (E) in the context of two propositions: the prosecution (H1: the control and the recovered samples originate from the same source) and the defense proposition (H2: the control and recovered samples originate from different sources). Although the most common approach of the evaluation of physicochemical data is the likelihood ratio method (LR), Bayesian networks (BNs) are of growing interest in forensic sciences and have showed that their application can be a promising tool in the evaluation of the problem. Zadora presented a report on the evaluation of physicochemical data through Bayesian networks, focusing on classification of diesel fuel and kerosene samples, comparison of car paints, taking into account the polymer binder and the pigment, and comparison of fibers, which relies on the morphology of the fiber and in case of dyed fibers the comparison of the colour, in 2010 [33]. The classification problem of diesel fuel and kerosene samples, whose identification relies on automated thermal desorption-gas chromatography-mass spectrometry (ATD-GC-MS), can be solved with a BN with two nodes. One discretetype node is H, which represents the two hypotheses and the other one is Vd, which has seven ranks related to all the Figure 13: Bayesian networks for discrete variables (right) and continuous-type variables (left) [33]. considered discrete variables. If the variables are considered as continuoustype, then the node Vd changes into Vc, where the variables are expressed by 46 normal distribution. Conditional probabilities in the node Vd and parameters (means and variances) in the node Vc are calculated through the available data in databases. By entering hard evidence in the node Vd or Vc the Bayesian network model returns a result about node H, which represents the posterior probabilities ratio and in this figure is equal to LR on the basis of {1}[33]. The comparison of samples, either car paints or fibers, can also be approached by BN method. The node H has to states: H1 is the hypothesis that the compared samples (control and recovered) originate from the same object, and H2 is the hypothesis that the compared samples are from different objects. The model can have as input either continuous-type of data or discrete-type. The discrete-type Figure 14: BN for discrete-type of data (car paint problem) [33]. model (fig.14) was used for the comparison of paints, which can be described by qualitative-type data, i.e. the presence or absence of a particular compound, similar to the diesel and kerosene samples. The parent node B allows the expression of the state (rank) of a particular variable for the control sample. The prior probabilities of this node are calculated based on the background data, for example how often a certain state is observed within a particular variable in a database of analyzed car paints. It can be said that this node contains information about the rarity of a state, something that be included in the evaluation of evidential value. The node E contains the conditional probabilities that a certain combination of ranks of discrete variables in the recovered (𝐸 = 𝑉𝑟 ) and control (𝐵 = 𝑉𝑐 ) samples can occur under each of the considered propositions: a)𝑃(𝐸 = 𝑉𝑟 |𝐵 = 𝑉𝑐 , 𝐻1 ), which expresses the conditional probability that the combination of ranks 𝑉𝑟 and 𝑉𝑐 of a particular variable is observed when the samples originate from the same object, and b) 𝑃(𝐸 = 𝑉𝑟 |𝐵 = 𝑉𝑐 , 𝐻2 ) is the conditional probability which expresses that the combination of ranks 𝑉𝑟 and 𝑉𝑐 of a particular variable is observed when the samples originate from different objects [33]. 47 For the continuous-type of data (fig.15), the BN model was applied for car paints and fiber cases. The model was first proposed by Taroni et al. in 2006. The continuous nodes V_B, V_C and V_R are assumed to follow Figure 15: BN for continuous-type of data (car paint and fibers cases) [33]. normal distribution, whose parameters (mean and variance) are estimated considering rules of propagation of information in the network: a) V_B~𝑁(𝜇, 𝜏 2 ) represents the background information, where μ is the mean of all objects in a suitable database and 𝜏 2 is the between source variability, b) V_C represents the measurement done on the control sample with (𝑋|𝜃, 𝜎 2 )𝛮(𝜃, 𝜎 2 ), where 𝜎 2 is the within source variability and c) V_R is the node that presents the measurements 𝜎2 𝜏2 2 made on the recovery sample with (𝑌|𝑋 = 𝑥)~𝑁(𝑦 ̅̅̅, 1 𝜎 + 𝜎2 +𝜏2 ) when H1 is true and ̅̅̅ 𝑦1 is the mean of the analyzed parameter in the recovery sample and (𝑌|𝜃)~𝛮(𝜃, 𝜎 2 + 𝜏 2 ) when H2 is true [33]. The author claimed that the use of BNs require people with no knowledge of classical programming and people with limited understanding of evaluation procedures can benefit from the visualization of factors considered in the process of evaluation of evidence, that is provided by a BN model. However, more data should be collected and additional experiments should be carried out with the aim of assessing how these models perform, before any trial in a real casework [33]. 48 CHAPTER 6: BAYESIAN STATISTICS IN VARIOUS APPLICATIONS Bayesian statistics can be applied in many different fields that may not fall into the same category that have already been presented. A field that has a great impact on daily life is the food industry. An interesting research, which included the combination of a frequentist approach and Bayes’ theorem, was presented by Hejazi et al. in 2010. The research concerned unsaturated fatty acids (FA) that are associated with several health issues, such as cardiovascular diseases, obesity and inflammation [34]. Unsaturated FAs can exhibit a cis or trans geometry, due to the double bonds. cis-FAs are naturally present in foods, while the trans-FAs are produced by heat and hydrogenation of vegetables and marine oils or by bacteria. The latter ones are associated with aforementioned health conditions. The separation and identification of these isomers, usually carried out by gas chromatography-mass spectrometry (GC-MS), is crucial, but it is also a difficult task. The team proposed an approach, which included Bayes’ theorem, for the distinction of isomers of αlinolenic acid [34]. Linolenic acids has 3 double bonds, which gives 23 different possible geometries. The team, having previous knowledge that the central bond can be assigned as cis or trans (xcx or xtx respectively) by m/z ratio, started to distinguish the geometries for the other two bonds. A series of t-tests on the intensity ratios (peak of interest/base peak) was conducted in order to make a decision about the geometry and the m/z ratio with the smallest probability (H0: there is no difference in ratios between isomers) was selected. Given these estimations of normal probability density functions, Bayes’ theorem was then applied to obtain the probability that the molecule has one or another geometry [34]. As an example of illustration the authors presented the identification of cct isomer. The base peak in this case is m/z=79, which implies that the central bond is cis (xcx). The Student’s t-tests gave as discriminatory peaks m/z=236 for xcc and xct and m/z= 93 for tct and cct. The probability can be formulated for the intensity ratios as follows: 49 𝑃(𝑐𝑐𝑡|𝑅93 , 𝑅236 ) = 𝑃(𝑐𝑐𝑡|𝑅93 , 𝑥𝑐𝑡) ∗ 𝑃(𝑥𝑐𝑡|𝑅236 ), where 𝑅93 = 𝐼93 /𝐼79 and 𝑅236 = 𝐼236 /𝐼79 . For each position there are two possible geometries and if the central position is decided, the Bayes’ theorem formulation gives: 𝑃(𝑅93 |𝑐𝑐𝑡 )𝑃(𝑐𝑐𝑡) | )𝑃(𝑐𝑐𝑡)+𝑃(𝑅93 |𝑡𝑐𝑡 )𝑃(𝑡𝑐𝑡) 𝑐𝑐𝑡 93 𝑃(𝑐𝑐𝑡|𝑅93 , 𝑅236 ) = 𝑃(𝑅 𝑃(𝑅236 |𝑥𝑐𝑡 )𝑃(𝑥𝑐𝑡) 𝑃(𝑅236 |𝑥𝑐𝑡 )𝑃(𝑥𝑐𝑡)+𝑃(𝑅236 |𝑥𝑐𝑡 )𝑃(𝑥𝑐𝑡) ∗ , where 𝑃(𝑅93 |𝑐𝑐𝑡) and 𝑃(𝑅236 |𝑥𝑐𝑡) are also known as the likelihood ratio and they were calculated by means of one-tailed ttest and 𝑃(𝑐𝑐𝑡) is the prior probability, where a ‘flat prior’ was adopted (therefore equal to 1/8 for a sample with an unknown origin) [34]. The authors claimed that the method allowed an accurate identification of unknown isomers of α-linolenic acid, and they also suggested that this approach would be adopted in the determination of the fragmentation mechanism of general polyenes [34]. Another role of Bayesian statistics and especially Bayes’ theorem was presented by Jackson et al. in 2013. In this article, a comparison between Lawrence Livermore National Laboratory (LLNL) and the Purdue Rare Isotope Laboratory (PRIME Lab) with respect to the 41Ca analysis was demonstrated [35]. 41Ca has a long half-life (𝑇1/2 ~105 years) and its analysis with accelerator mass spectrometry (AMS) has led to a significant rise of applications in biomedical research for osteoporosis and calcium metabolism. However, an AMS measurement of 41Ca is subjected to many errors, mainly due to many interferences with other ions. The research team made a comparison between these two laboratories to report about the uncertainties due to chemical preparation procedures and bias from AMS measurements [35]. The study was divided into two components: 1) chemically prepared samples at Prime Lab were measured at both Prime Lab and LLNL to check the consistency of the AMS measurements, and 2) AMS samples from the same source were prepared at both facilities and measured at LLNL in order to compare the preparation procedures. The reported results were used to estimate if the laboratories were over- or underestimating their reported uncertainties. For the statistical calculation, Bayes’ theorem was applied to the data, following the approach presented by Dose (2003): 50 𝑃(𝜈, 𝑤|𝑑𝑎𝑡𝑎) = 𝑃(𝜈, 𝑤) ∗ 𝑃(𝑑𝑎𝑡𝑎|𝜈, 𝑤), where 𝑃(𝑑𝑎𝑡𝑎) can be omitted if the formulation is normalized. 𝜈, 𝑤 are two multipliers, such that the true uncertainties in the data should be 𝜈𝜎𝐿𝑖 and 𝑤𝜎𝑃𝑖 with respect to the corresponding results {𝑥𝑖 , 𝜎𝐿𝑖 } from LLNL and {𝑦𝑖 , 𝜎𝑃𝑖 } from Prime Lab [35]. If 𝑧𝑖 is the unknown true value of sample 𝑖 and represents the 41 Ca/40Ca ratio then 𝑃(𝑑𝑎𝑡𝑎|𝜈, 𝑤) = ∏𝑖 𝑃(𝑥𝑖 , 𝑦𝑖 |𝜈, 𝑤) and 𝑃(𝑥𝑖 , 𝑦𝑖 |𝜈, 𝑤) = ∫ 𝑑𝑧𝑖 𝑃(𝑥𝑖 , 𝑦𝑖 |𝜈, 𝑤, 𝑧𝑖 )𝑃(𝑧𝑖 |𝜈, 𝑤), resulting in the final equation 𝑃(𝑑𝑎𝑡𝑎|𝜈, 𝑤) = 𝑐(∏𝑖 𝐺𝑖 )𝑃(𝜈, 𝑤). The prior 𝑃(𝜈, 𝑤) can be used to imput additional information, such as constrains on the 𝜈 and 𝑤 values. If 𝜈 and 𝑤 are equal to 1 then the laboratories estimate their uncertainty perfectly and a value more or less 1 indicates that the facilities underestimate or overestimate their uncertainties respectively [35]. The authors concluded that the multiplier for Prime Lab, 𝑤 = 1.32 ± 0.20 , and for LLNL, 𝜈 = 1.15 ± 0.69, which indicates that both laboratories performed well in estimating their uncertainty, with a slight indication of underestimation [35]. 51 PART III 52 CHAPTER 7: A CRITICAL REVIEW This chapter is dedicated to general and critical comments on the applications of Bayesian statistics to mass spectrometric data or data produced by hyphenate systems such as liquid chromatography- mass spectrometry (LC-MS). The focus will be on the advantages and limitations, derived from the papers discussed in Part II, as well as personal impressions on the statistics applied in these articles. First of all, what should be stressed is the significant rise of the use of Bayesian statistics in different scientific fields, including analytical and bioanalytical chemistry. Fig. 16 illustrates the results of a small research conducted by counting Number of publications Increase of Bayesian statistics 120000 the 100000 publications concerning 80000 Bayesian statistics by of decade on the search 60000 engine Web of Science. It 40000 is clear that the last 20000 0 1940 number decade the number of 1960 1980 2000 2020 Year of publications Figure 16: Research conducted on the Web of Science, concerning the number of publications compared to the year of publication publications on this subject has been grown exponentially, and this can be explained by the advances in computer power, making it possible to handle large data sets nowadays. Bayesian statistics -- although seemingly based on a simple formula, 𝑃(𝜃|𝐷) ∝ 𝑃(𝜃)𝑃(𝐷|𝜃) where the answer is given by computing priors and likelihoods -- are usually applied to large data sets, e.g. proteomic data. Monte Carlo Markov Chain (MCMC) is usually employed enabling the computation of these probabilities in situations in which the integration over a large amount of parameters is required. Therefore, computational power is one of the key elements to get “answers” in a reasonable amount of time. 53 Secondly, in this thesis, out of the 26 discussed papers, 19 were related to life sciences, e.g. proteomics and metabolomics, which can be translated into a 73% with separate percentages being 58% and 15% respectively. 5 out of 26 (19%) were dedicated to forensic sciences and only 8% (2 articles) developed a Bayesian approach to treat data that did not fall into any of the aforementioned categories, such as the inter-laboratory comparison based on the 41Ca data [35]. There is a correlation between the rise of proteomics and Bayesian statistics the last 15 years, which can be explained by taking into consideration the requirements in a proteomic study. Just from a single experiment, large datasets can be produced. The usual request in this type of experiments is biomarker discovery, drug development and classification of sample between a health and a diseased state. Handling these data requires a high computational power, which is achievable nowadays, and a statistical method, which answers questions such as whether this is a peak or noise [9] or identification of proteins [17] through a database research. Bayesian statistics is the selected method, since enables the researchers to have a probability for each hypothesis being true, in comparison to a frequentist test, which allows only a control for the false positives and the false negatives. Moreover, Bayesian statistics provides a lot of flexibility in any application. Bayesian methods can be applied to many different scientific fields. As it is shown in Part II, the last five years, applications can be found in many different fields, where MS is used as an analytical tool. This can be explained by the fact that Bayesian statistics is a model-dependent method, and the right-fitted model can be selected to address the problem in question. However, this model dependency can also be considered as a limitation, because it is susceptible to bias. The right model formulation has to be selected but in a way that do not affect the results favorably. As an example of illustration, in [12], the research team chose 𝜉𝑖 (𝑡) as the deforming function for the sample i at time t with the restriction being monotone increasing. The team mentioned that this restriction was inevitable because otherwise it would erase observed peaks in the spectra. However, considering a function that is monotone increasing in a chromatogram means that it is impossible to have peak crossings across different chromatograms. This premise might be too strong, raising many questions. 54 Last but not least, as a general remark, Bayesian approaches are statistical methods that should be handled by personnel with enough background knowledge on the subject and practical experience in the field. It is a conceptually difficult subject that requires training. Experts should be employed for such type of data treatment in order to avoid inherently biased model and provide with the best non-controversial results. In conclusion, although Bayes’ Theorem was first published in 1763, it has only been employed systematically the last decade. It is a very interesting statistical method that answers critical questions, e.g. “how correct is my hypothesis?”. Nevertheless, there are limitations such as the model dependency that should be taken into account before the employment of Bayes’ theorem for data treatment. 55 AKNOWLEGDEMENTS I would like to thank my project supervisor Dr. G. Vivó Truyols from the University of Amsterdam for the support and guidance and the University of Amsterdam for excess to academic resources. 56 REFERENCES [1] N. Armstrong, D. B. Hibbert, “An introduction to Bayesian methods for analyzing chemistry data: Part 1: An introduction to Bayesian theory and methods”, Chemometrics and Intelligent Laboratory Systems 97 (2004) 194-210 [2] J. M. Bernardo, A. F. M. Smith, Bayesian Theory, John Wiley & Sons Ltd., Chichester, 1994 [3] D. S. Sivia, Data Analysis: A Bayesian Tutorial, Oxford University Press, Oxford, 1996 [4] C. M. Bishop, Pattern recognition and machine learning, Springer, New York, 2006 [5] K. Downards, Mass spectrometry: A foundation course, The Royal Society of Chemistry, Cambridge, 2004 [6] F. W. McLafferty, “Tandem mass spectrometry”, Science 214:4518 (1981) 280-287 [7] L. B. Fay, M. Kussmann, “ Chapter 1: Mass spectrometry Technologies”, RSC Food Analysis Monographs 9, The Royal Society of Chemistry, Cambridge, 2010 [8] N.L. Anderson, N.G. Anderson, “Proteome and proteomics: new technologies, new concepts, and new words”, Electrophoresis 19:11 (1998) 1853-61 [9] W. Shao, H. Lam, “Denoising Peptide Tandem Mass Spectra for Spectral Libraries: A Bayesian Approach”, Journal of Proteome Research 12 (2013) 3223−3232 [10] Y. Sun, J. Zhang, U. Braga-Neto, E. R. Dougherty, “BPDA- A Bayesian peptide detection algorithm for mass spectrometry”, BMC Bioinformatics 11:490 (2010), (http://www.biomedcentral.com/1471-2105/11/490) [11] Y. Sun, J. Zhang, U. Braga-Neto, E. R. Dougherty, “BPDA2d- a 2D global optimizationbased Bayesian peptide detection algorithm for liquid chromatograph–mass spectrometry”, Bioinformatics 28:4 (2012) 564-572 [12] X. Kong, C. Reilly, “A Bayesian approach to the alignment of mass spectra”, Bioinformatics 25:24 (2009) 3213-3220 [13] T. H. Tsai, M. G. Tadesse, Y. Wang, H. W. Ressom, “Profile-Based LC-MS Data Alignment- A Bayesian Approach”, ACM Transactions on Computational Biology and Bioinformatics 10:2 (2013) [14] T. H. Tsai, M. G. Tadesse, C. Di Poto, L. K. Pannell et al., “Multi-profile Bayesian alignment model for LC-MS data analysis with integration of internal standards”, Bioinformatics 29:21 (2013) 2774-2780 [15] O. Serang, M. J. MacCoss, W. S. Noble, “Efficient Marginalization to Compute Protein Posterior Probabilities from Shotgun Mass Spectrometry Data”, Journal of Proteome Research 9 (2010) 5346–5357 [16] J. Zhang, J. Ma, L. Dou, S. Wu et al., “Bayesian Nonparametric Model for the Validation of Peptide Identification in Shotgun Proteomics”, Molecular & Cellular Proteomics 8:3 (2009) 547-557 57 [17] R. D. LeDuc, R. T. Fellers, B. P. Early, J. B. Greer et al., “The C‑Score: A Bayesian Framework to Sharply Improve Proteoform Scoring in High-Throughput Top Down Proteomics”, Journal of Proteome Research 13 (2014) 3231−3240 [18] J. G. Booth, K. E. Eilertson, P. Dominic, B. Olinares et al., “A Bayesian Mixture Model for Comparative Spectral Count Data in Shotgun Proteomics”, Molecular & Cellular Proteomics 10:8 (2011) 1-6 [19] Q. Zhu, T. Burzykowski, “A Bayesian Markov-Chain-Based Heteroscedastic Regression Model for the Analysis of 18O-Labeled Mass Spectra”, Journal of American Society of Mass Spectrometry 22 (2011) 499-507 [20] S. He, X. Li, M. R. Viant, X. Yao, “Profiling MS proteomics data using smoothed nonlinear energy operator and Bayesian additive regression trees”, Proteomics 9 (2009) 4176–4191 [21] K. W. Kuschner, D. I. Malyarenko, W. E. Cooke, L. H. Cazares et al., “A Bayesian network approach to feature selection in mass spectrometry data”, BMC Bioinformatics 11:177 (2010), (http://www.biomedcentral.com/1471-2105/11/177) [22] C. Chung, A. Emili, B. J. Frey, “Non-parametric Bayesian approach to post-translational modification refinement of prediction from tandem mass spectrometry”, Bioinformatics 29:7 (2013) 821-829 [23] C. Chung, J. Liu, A. Emili, B.J. Frey, “Computational refinement of post-translational modifications predicted from tandem mass spectrometry”, Bioinformatics 27:6 (2011) 797-806 [24] U. Roessner, J. Bowne, “What is metabolomics all about?”, BioTechniques 46 (2009) 363-365 [25] J. Jeong, X. Shi, X. Zhang, S. Kim et al., “An empirical Bayes model using a competition score for metabolite identification in gas chromatography mass spectrometry”, BMC Bioinformatics 12:392 (2011), (http://www.biomedcentral.com/14712105/12/392) [26] T. Suvitaival, S. Rogers, S. Kaski, “Stronger findings from mass spectral data through multi-peak modeling”, BMC Bioinformatics, 15:208 (2014), (http://www.biomedcentral.com/1471-2105/15/208) [27] E. Correa, R. Goodacre, “A genetic algorithm-Bayesian network approach for the analysis of metabolomics and spectroscopic data: application to the rapid identification of Bacillus spores and classification of Bacillus species”, BMC Bioinformatics, 12:33 (2011), (http://www.biomedcentral.com/1471-2105/12/33) [28] I. Olivier, D. T. Loots, “A metabolomics approach to characterise and identify various Mycobacterium species”, Journal of Microbiological Methods 88 (2012) 419–426 [29] P. E. Sottas a, C. Saudan, C. Schweizer, N. Baume et al., “From population- to subjectbased limits of T/E ratio to detect testosterone abuse in elite sports”, Forensic Science International 174 (2008) 166–172 [30] R. J. N. Bettencourt da Silva, D. M. Silveira a , M.F. G. F. C. Camões, C. M. F. Borges et al., “Validation, Uncertainty, and Quality Control of Qualitative Analysis of Tear Gas Weapons by Gas Chromatography-Mass Spectrometry”, Analytical Letters 47 (2014) 250–267 [31] N. Farmer, W. Meier-Augenstein, D. Lucy, “Stable isotope analysis of white paints and likelihood ratios”, Science and Justice 49 (2009) 114–119 [32] H. E. Jones, M. Hickman, B. Kasprzyk-Hordern, N. J. Welton, “Illicit and pharmaceutical drug consumption estimated via wastewater analysis. Part B: Placing back- 58 calculations in a formal statistical framework”, Science of the Total Environment 487 (2014) 642–650 [33] G. Zadora, “Evaluation of the evidential value of physicochemical data by a Bayesian network approach”, Journal of Chemometrics 24 (2010) 346–366 [34] L. Hejazia, D. B. Hibbert, D. Ebrahimia, “Identification of the geometrical isomers of _linolenic acid using gas chromatography/mass spectrometry with a binary decision tree”, Talanta 83 (2011) 1233–1238 [35] G. S. Jackson, D. J. Hillegonds, P. Muzikar, B. Goehring, “Ultra-trace analysis of 41Ca in urine by accelerator mass spectrometry: An inter-laboratory comparison”, Nuclear Instruments and Methods in Physics Research B 313 (2013) 14–20 59

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download the use of bayesian statistics in mass spectrometry data