Download the use of bayesian statistics in mass spectrometry data

Document related concepts

Community fingerprinting wikipedia , lookup

Pharmacometabolomics wikipedia , lookup

Protein mass spectrometry wikipedia , lookup

Transcript
THE USE OF BAYESIAN
STATISTICS IN MASS
SPECTROMETRY DATA
Literature research
Elli Karampini
Supervisor: Gabriel Vivό-Truyols
CONTENTS
PART I
CHAPTER 1: BAYESIAN STATISTICS
1.1.
Bayes’ Theorem
1.2.
Bayesian network
CHAPTER 2: MASS SPECTROMETRY
2.1.
Introduction to mass spectrometry
2.2.
Instrumentation of mass spectrometry
PART II
CHAPTER 3: BAYESIAN STATISTICS IN PROTEOMIC STUDIES
3.1.
Data pretreatment
3.2.
Data treatment
CHAPTER 4: BAYESIAN STATISTICS IN METABOLOMIC STUDIES
4.1.
Applications of Bayesian approach in metabolomics
CHAPTER 5: BAYESIAN STATISTICS IN FORENSIC SCIENCES
5.1.
Bayes’ Theorem and forensic evidence
5.2.
Mass spectrometry and Bayesian statistics in forensic
sciences
CHAPTER 6: BAYESIAN STATISTICS IN VARIOUS APPLICATIONS
PART III
CHAPTER 7: A CRITICAL REVIEW
ACKNOWLEDGEMENTS
REFERENCES
1
To my parents, Alexandros and Mary,
and Achilles
2
PART I
3
CHAPTER 1: BAYESIAN STATISTICS
Bayesian statistics is used to differentiate the sub-category of statistics in
which the evidence of the true state of a hypothesis is given in degrees of belief
and more specifically in Bayesian probabilities. The term Bayesian has been
adopted to honor Thomas Bayes, although Pierre-Simon Laplace had
independently worked on the same subject under the name the probability
cause.
There are two main schools in calculating probabilities. From a frequentist
approach, the definition of probability value is the chance of observing this data
or more extreme given the fact that the hypothesis is true. This type of
probability is calculated through a frequentist test, such as a t-test of one or two
means. However, it does not answer the question “is my hypothesis correct?” but
rather takes it as a given. On the other hand, frequentist approach ensures a
certain probability of failure of the whole procedure. In other words, although it
cannot answer the question “is my hypothesis correct?”, it can assure that,
following the procedures of the frequentist test, the frequency of wrong answers
is under control. By contrast, the answer to “is my hypothesis correct?” can be
given through a Bayesian approach. More specifically, Bayes’ theorem
determines the probability that the hypothesis is correct, using the mathematical
formula of Bayes’ theorem. This process combines the background information
I with the outcome (data) D of an experiment and the probability of the
hypothesis before considering the data in order to assess the probability value
of the hypothesis itself P(𝜃|𝐷, 𝐼) [1].
The purpose of the thesis is to make a critical review concerning the recent
developments (around the last 5 years) in applying Bayesian statistics with
respect to analysis with mass spectrometry instruments. Firstly, a short
introduction about Bayesian statistics and mass spectrometry will be given in
part I. In part II different articles, relative to the topic, from a broad spectrum of
analytical fields, such as proteomics, metabolomics and forensic sciences, will be
discussed and finally, a critical review in part III will be presented.
4
1.1. Bayes’ theorem
Bayes’ theorem is a rule which indicates how to treat conditional
probabilities. Conditional probability of an event is defined as the probability
obtained with the additional information that another event has already
occurred [1]. It was first introduced by Thomas Bayes (1701-1761) and
published under the name “An Essay towards solving a Problem in the Doctrine of
Chances” in 1763, two years after his death. It was his friend Richard Price
(1723-1791) who communicated the paper through John Canton (1718-1772)
to the Royal Society [2].
Bayes’ theorem is based on the two fundamental rules of probability
theory, the product and the addition rule. The first one defines the joint
probability of two or more propositions, by means of the following equation
P(𝑥, 𝑦|𝐼) = P(𝑥|𝐼) P(𝑦|𝑥, 𝐼) = P(𝑦|𝐼) P(𝑥|𝑦, 𝐼)
where x and y are the propositions, which are interchangeable , I the background
information and Pr(𝑥, 𝑦|𝐼) is the probability of x and y conditional on I. The later
one can be expressed as
𝑀
𝑀
P({𝑦𝑗 }|𝐼) = ∑ P({𝑦𝑗 }, {𝑥𝑖 }|𝐼) = ∑ P({𝑥𝑖 }|𝐼) P({𝑦𝑖 }|{𝑥𝑖 }, 𝐼)
𝑖=1
𝑖=1
𝑁
𝑁
and, due to symmetry
P({𝑥𝑖 }|𝐼) = ∑ P({𝑥𝑖 }, {𝑦𝑖 }|𝐼) = ∑ P({𝑦𝑖 }|𝐼) P({𝑥𝑖 }|{𝑦𝑖 }, 𝐼)
𝑗=1
𝑗=1
where {𝑥𝑖 : 𝑖 = 1,2,3, … , 𝑀} and {𝑦𝑖 : 𝑗 = 1,2,3, … , 𝑁} are sets of propositions with
M≠N. In this case the marginalization of joint probabilities of a discrete set of
variables is defined, also known as Total Probability Theorem. The same applies
for continuous variables, such that 𝑥 ∈ 𝑋, 𝑦 ∈ 𝑌 and 𝑋, 𝑌 ⊆ ℝ,
P(𝑥|𝐼) = ∫P(𝑥, 𝑦|𝐼) 𝑑𝑦
𝑌
and
P(𝑦|𝐼) = ∫ P(𝑥, 𝑦|𝐼) 𝑑𝑥
𝑋
where, as explained in Armstrong et al., “marginalization can be considered as
integrating out unnecessary variables” [1].
5
By putting the above mentioned probability rules together, we arrive at the
Bayes theorem’s mathematical formula. If the purpose is to determine the
probability of a continuous parameter θ, (for instance the mean μ and/or the
standard deviation σ), given the data D and the background information I, then
we can express the joint probability of the θ and D, given I as so:
P(𝜃, 𝐷|𝐼) = P(𝜃|𝐼) P(𝐷|𝜃, 𝐼) = P(𝐷|𝛪) P(𝜃|𝐷, 𝐼)
Due to the equality of the right part of the equation, we can arrive at:
P(𝜃|𝐷, 𝐼) = P(𝜃|𝛪)
P(𝐷|𝜃, 𝐼)
P(𝐷|𝐼)
where P(𝐷|𝐼) = ∫𝜃 P(𝜃, 𝐷|𝐼)𝑑𝜃 with 𝜃 ⊆ ℝ. The above formula is the Bayes’
theorem mathematical equation and each term has its own meaning. Firstly,
P(𝜃|𝐷, 𝐼) is known as the posterior probability which asserts the plausibility of
θ, given D and I. Secondly, P(𝜃|𝛪) is called prior probability, which is the initial
probability value before any additional information, e.g. the data D, is taken into
account. In other words, it is the plausibility of θ before conducting the
experiment. The numerator of the fraction P(𝐷|𝜃, 𝐼) is the likelihood probability
and quantifies the plausibility of D given the θ and I, while the denominator
P(𝐷|𝐼) plays the role of normalization factor [1].
In the case of hypothesis testing, where there are two hypotheses, 𝐻1 and
𝐻2 under investigation, given a set of data D, the Bayesian formula for each one
is as follows:
𝑃(𝐻1 |𝐷) = 𝑃(𝐻1 )
𝑃(𝐷|𝐻1 )
𝑃(𝐷)
𝑃(𝐻2 |𝐷) = 𝑃(𝐻2 )
𝑃(𝐷|𝐻2 )
𝑃(𝐷)
and
By dividing these two equations, we get:
𝑃(𝐻1 |𝐷) 𝑃(𝐻1 ) 𝑃(𝐷|𝐻1 )
=
∗
𝑃(𝐻2 |𝐷) 𝑃(𝐻2 ) 𝑃(𝐷|𝐻2 )
which can be summarized as:
𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 𝑜𝑑𝑑𝑠 = 𝑝𝑟𝑖𝑜𝑟 𝑜𝑑𝑑𝑠 ∗ 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 𝑟𝑎𝑡𝑖𝑜 [2, 3]
Although Bayes’ theorem was known from the 18th century, a remarkable
increase in the employment of this theory in different fields—such as Chemistry
and Physics—has been recorded in the last two decades. To illustrate this, a
6
search for the terms “Bayesian” or “Bayes” was conducted in the scientific search
engine Web of Science. It was found that 106,585 out of 110,663 relevant articles
were dated after 1990 and from those 90,013 are after 2000. This sudden
increase in interest might be due to the rise of computational power in the last
decades.
1.2. Bayesian network
The probabilities play a crucial role in pattern recognition and it is highly
beneficial to improve the analysis using diagrammatic representations of
dependences between variables, known as probabilistic graphical models. The
graphic models offer useful properties such as a simple way of a probabilistic
model’s visualization and insights into the model’s properties, including
conditional and independent properties. The Bayesian network, also called
directed graphical model, belongs to this category. As far the graphic illustration
is concerned, the diagram consists of nodes, which represent random variables
or sets of variables, connected by arcs, which express probabilistic relationships
between these variables (4).
Consider for example the joint probability distribution over three
variables, a, b, c e.g. 𝑝(𝑎, 𝑏, 𝑐). By applying the product rule to the joint
distribution the result is:
𝑝(𝑎, 𝑏, 𝑐) = 𝑝(𝑐|𝑎, 𝑏) ∗ 𝑝(𝑎, 𝑏)
And by applying the product rule for the second time, the right-part of the above
equation becomes:
𝑝(𝑎, 𝑏, 𝑐) = 𝑝(𝑐|𝑎, 𝑏) ∗ 𝑝(𝑏|𝑎) ∗ 𝑝(𝑎)
The latter equation can be described
as a graphical model by first introducing
nodes for each variable and associating each
one with the corresponding conditional
distribution. For 𝑝(𝑐|𝑎, 𝑏) there will be arcs
from a and b to node c, for 𝑝(𝑏|𝑎) there will
be an arc from a to b, and finally for 𝑝(𝑎)
Figure 1: Bayesian model for three
variables a, b, c with their conditional
distribution, represented as arcs [4].
there will be no incoming links. In the case
that there is a link from node x to node y,
7
then the node x is called parent of node y, while the node y is called child of node
x. The link indicates that the probability of y is dependent on x, in other words,
that 𝑝(𝑦|𝑥) can adopt some value.
The example discussed above, can be extended for K variables. The joint
probability over K variables will be given by:
𝑝(𝑥1 , 𝑥2 , … , 𝑥𝑘 ) = 𝑝(𝑥𝑘 |𝑥1 , … , 𝑥𝑘−1 ) ∗ … ∗ 𝑝(𝑥2 |𝑥1 ) ∗ 𝑝(𝑥1 )
The equation can be presented as a direct graph with K nodes, one for each
conditional distribution on the right-hand side. Each node will have incoming
arcs from the lower numbered nodes, therefore, this graph is called fully
connected [4].
There also cases where there is absence
of arcs, as shown in Fig.2. 𝑥1 , 𝑥2 and 𝑥3 are
parent nodes and there is no direct linkage
between 𝑥6 or 𝑥7 with the parent nodes. This
absence provides interesting information
about the properties of the class of
distributions that the graph represents. The
decomposition of joint distribution of these
seven variables is given by:
Figure 2: Directed acyclic graph of seven
variables, with three parent nodes and
no direct linkage between 𝑥6 𝑜𝑟 𝑥7 to the
parent nodes[4].
𝑝(𝑥1 , 𝑥2 , … , 𝑥7 )
= 𝑝(𝑥1 ) ∗ 𝑝(𝑥2 ) ∗ 𝑝(𝑥3 )
∗ 𝑝(𝑥4 |𝑥1 , 𝑥2 , 𝑥3 ) ∗ 𝑝(𝑥5 |𝑥1 , 𝑥3 )
∗ 𝑝(𝑥6 |𝑥4 ) ∗ 𝑝(𝑥7 |𝑥4 , 𝑥5 )
The rule can be generalized, thus, for a graph with K nodes, the joint distribution
is given by:
𝐾
𝑝(𝑥) = ∏ 𝑝(𝑥𝑘 |𝑝𝑎𝑘 )
𝑘=1
where 𝑝𝑎𝑘 denotes the set of parents of 𝑥𝑘 and 𝑥 = {𝑥1 , … , 𝑥𝑘 }. At this point an
important restriction should be mentioned. There are no closed paths within the
graph such that the movement from node to node following the direction of the
arcs and ending back to the starting node is not possible. These graphs are
commonly referred to as Directed Acyclic Graphs (DAGs) [4].
8
CHAPTER 2: MASS SPECTROMETRY
2.1. Introduction to mass spectrometry
Mass spectrometry (MS) is an analytical technique that enables the
identification and quantification of compounds of interest by measuring the massover-charge ratio (m/z) and abundance of ions in gas phase.
Mass spectrometry started to be employed in experiments over a century
ago. Joseph John Thomson was the first to discover in 1910 that each charged
particle followed its own path to the detector, which was a photographic plate. The
first experiments were done on hydrogen and later on other atoms and molecules
of carbon, oxygen and nitrogen were used. He argued that none of the particles,
unless they share the same velocity and charge-over-mass ratio (e/m), would
strike the detector’s plate at the same time. By inspecting the plate and knowing
one of the parabolic paths that a set of particles, which share the same velocity and
e/m, had followed, the e/m of the other particles would be deducted. This is
considered to be the birth of mass spectrometry, and important features discussed
by Thomson still remain relevant [5].
Through mass spectrometry, scientists are able to measure atomic and
molecular weights of species in complex samples, analyze compounds at low level
samples, as well as analyze compounds without the initial steps of sample
purification. These advantages are more significant than the few disadvantages,
such as the loss of the sample once it is analyzed through MS [5].
Mass spectrometry (MS) can be coupled to several different techniques, such
as liquid chromatography and gas chromatography, to gain more information
about the sample. MS can be coupled also to itself with the result being tandem
mass spectrometry (MS/MS).
Tandem mass spectrometry
was developed in order to
acquire structural information
about the analyte by coupling
mass spectrometers in series
Figure 3: Schematic representation of a tandem mass
spectrometry experiment [5].
between the ion source and the
detector [5, 6]. The principle
9
behind this type of experiment is simple. The targeted compound is ionized and
its characteristic ions are separated from the mixture in the first mass
spectrometer. The selected primary ions, also known as parent or precursor ions,
are then decomposed in the dissociation region, resulting in fragment (historically
known as daughter) ions, which are analyzed by the last mass spectrometer
reaching the detector. Tandem mass spectrometry can achieve high specificity and
sensitivity retaining the advantage of a fast response time [6].
2.2. Instrumentation of mass spectrometry
A mass spectrometer has three main compartments: a) the ionization source,
b) the mass analyzer and c) the ion detectors [5]. One of the key performance
indicators
of
spectrometer
resolving
a
mass
is
the
power,
which
means the minimum mass
difference
that
can
be
separated from a given
mass. Other performance
indicators are the mass
Figure 4: Characteristics of the performance of a mass
spectrometer [7].
accuracy, which indicates
the
accuracy
of
the
determination of the real mass of a compound, the sensitivity, which is expressed
as the signal-to-noise ratio and finally the linear dynamic range (fig. 4)[7].
The ionization source serves the role of converting the analytes (M) into ions.
The ionization occurs either when an electron is removed or added yielding a
radical (M·+ or M·- respectively). It can also take place when charged species (e.g.
H+) are added or subtracted resulting in [M+H]+ or [M-H]-. The ionization source
is also responsible for transferring the ions into gas phase before they are
introduced into the mass analyzer. There are different types of ionization sources:
a) electron (impact) ionization (EI), b) chemical ionization (CI), c) electrospray
ionization (ESI), d) atmospheric pressure chemical ionization (APCI), e)
atmospheric pressure photoionization (APPI) and f) matrix-assisted laser
desorption ionization (MALDI). EI is considered as “hard” ionization, resulting in
10
considerable fragmentation of the molecular ion, while ESI and MALDI are mostly
employed when the analytes are biological macromolecules, which are easily
degraded under harsh conditions [5, 7].
The mass analyzer is the central part of the mass spectrometer, with its role
being to separate the ions with respect to their mass-over-charge ratio, providing
a defined mass accuracy and resolution. There are various types of mass analyzers
with differences in concept and performance, such as the quadrupole mass filter,
the time-of-flight analyzer (ToF), the ion trap analyzer and the Orbitrap system.
Often, they work as autonomous mass analyzers, however, the current trend
points to the direction of hyphenated systems in order to increase the strength of
the mass spectrometer by combining their advantages. Some of these systems are
the triple-quadrupole mass filter, the quadrupole- time-of-flight (Q-ToF) system
and the time-of-flight-time-of-flight (ToF-ToF) analyzer, with the last one being
strongly connected to the analysis of biomolecules [7].
Finally, the ion detector is a device that is able to generate electrical current,
whose intensity is proportional to the abundance of the ions. Ions exiting the mass
analyzer can either be directed to a single-channel detector or dispersed to a focalplate (array) detector. Quadrupole or ion trap instruments are equipped with
single-channel detectors, whereas time-of-flight systems use focal-plate detectors.
The generated electrical current is subsequently digitized and the signal is
transferred to a computer, where the data can be managed and stored [7].
11
PART II
12
CHAPTER 3: BAYESIAN STATISTICS IN PROTEOMIC
STUDIES
Proteomics is the large-scale study of the proteome. The term proteome
refers to the entire set of proteins, including their modifications, produced by an
organism or a cellular system. The main goal of proteomics is the comprehensive
and quantitative description of protein expression levels under the influence of
environmental changes, such as drug treatment or disease [8]. The main analytical
approach involves Mass Spectrometry (MS) with mild ionization. For the
purification or the separation of the sample, prior to the analysis, usually Liquid
Chromatography (LC) is selected.
The data, derived from a MS system or LC-MS system, are very complex and their
further analysis can be divided into two categories: data pretreatment and data
treatment. Each one, with respect to Bayesian statistics, is discussed in the
following sections.
3.1. Data pretreatment
The data pretreatment is an essential step before the data can be available
for further analysis, which in case of proteomic studies is strongly related to the
discovery of biomarkers, drug development and disease classification. The data
pretreatment usually consists of several stages, including denoising, peptide
detection and spectra alignment.
All mass spectra and tandem mass spectra contain, apart from the peaks of
peptide fragments that are considered as useful signals, peaks of instrument noise
and contaminants. This is especially the case when the data arrive from complex
samples. So, the noise peaks should be removed for optimal matching to be
successful [9].
Different publications can be found in which Bayesian statistics is used in the
data pre-treatment step. For example, Shao et al. in 2013 proposed an approach
based on Bayesian inference for denoising spectra, to build spectral libraries. They
build a Bayesian classifier to make the distinction between signal (S) and noise (N)
and train it so that no assumptions of peptide fragmentation behavior or
instrumental settings are needed [9].
13
The authors selected four different features that are peak’s characteristics
and serve as good indicators of whether the peak 𝑖 (𝑖 = 1, 2, … , 𝑁) is signal or
noise. The first feature was the rank, 𝐹𝑟 (𝑖), which is simply the intensity of the
peak. For the most intense 𝐹𝑟 = 1; for the second most intense 𝐹𝑟 = 2 and so on.
Since the intensities vary significantly from spectrum to spectrum, the team used
the intensity rank as surrogate in order to avoid large changes in scale. The second
selected feature was the m/z, i.e. 𝐹𝑚 (𝑖), which measures the relevant position of a
peak in the spectrum. It is obvious that the probability of finding signal is not
constant throughout the m/z range, but the exact trend is unknown prior to the
experiment and it is discovered from the data. Finally, the compliment features,
𝐹𝑐,𝑍 , records whether a complimentary fragment ion can be found in the same
spectrum, where the Z parameter denotes the sum of the assumed charges of the
complimentary pair, and the sister features, 𝐹𝑠,𝛥 , determine the existence of a
sister fragment ion in the spectrum. The sister peak is located at a distance Δ away
from the peak of interest. This captures information concerning common neutral
loss or isotopic ions [9].
Each peak is categorized as signal (S) or noise (N) by a consensus algorithm,
according to whether or not it is consistently found across replicates. Since the
peaks are labeled as S or N, the conditional probabilities 𝑃(𝐹𝑘 |𝑆) and 𝑃(𝐹𝑘 |𝑁) can
be calculated for each feature 𝐹𝑘 mentioned above. These conditional probabilities
constitute the Bayesian classifier and they can immediately be used to denoise
singleton spectra or be written in a parameter file for future use. Given the
conditional probabilities, by means of Bayes’ theorem, the computation of
posterior probability of unlabeled peak is possible:
𝑃(𝑆|{𝐹𝑘 }) =
𝑃(𝑆)[∏𝑘 𝑃(𝐹𝑘 |𝑆)]
𝑃(𝑆)[∏𝑘 𝑃(𝐹𝑘 |𝑆)] + 𝑃(𝑁)[∏𝑘 𝑃(𝐹𝑘 |𝑁)]
where 𝑃(𝑆) is the prior probability of peak to be a signal without any additional
information and 𝑃(𝑁) = 1 − 𝑃(𝑆). The prior probability was predicted by the
team, using a linear regression model with 16 features for any given spectrum. As
in typical Bayesian classifier, the posterior probability can be subjected to an
appropriate threshold to make a decision about the peak’s preservation in the
denoised spectrum [9].
14
The researchers claimed that the computed probabilities were reasonably
accurate and their “denoiser” showed that the filtered spectra retained signal
peaks and exhibit high similarity to their replicates, which indicates that their
method would be a useful tool in spectral libraries. Additionally, the classifier is
very flexible and can be subjected to further improvement by adding or modifying
the selected features [9].
Peptide detection, whose main goal is to convert the raw spectra into a list
of peptides and find the existence probability of each peptide candidate, has a
direct effect on the subsequent analysis such as protein identification and
quantification, biomarker discovery and classification of different samples. The
difficulty is that peptides usually give several peaks in the spectra due to different
charge states during ionization and isotopic peaks. To address this issue, in 2010,
Sun et al. proposed a Bayesian approach for peptide detection (BPDA), which can
be applied to MS data that have been generated by instruments with high enough
resolution [10].
The authors used one-dimension mass spectra (1D MS) and the proposed
approach can be considered as a three-step approach. The first step is to obtain a
list of peptide candidates from the observed peaks and the second is to model the
observed spectra, taking into account the N peptide candidates’ signal. The final
step is to apply the algorithm on the fitted MS model to infer the best fitting peptide
signals [10].
In a more detailed way, the spectra were first baseline corrected, the noise
was filtered, peaks were detected using “mspeaks” (Matlab function) and a list of
peptide candidates was generated. The mass of each peptide was produced by
means of the following equation:
𝑚𝑎𝑠𝑠 = 𝑖(𝑑 − 𝑚𝑝𝑐 ) − 𝑗𝑚𝑛𝑡 , 𝑖 = 1, 2, … , 𝑐𝑠; 𝑗 = 1, 2, … , 𝑖𝑠𝑜
where mass is the mass of one peptide candidate, d denotes the value of m/z of a
detected peak, 𝑚𝑝𝑐 is the mass of one positive charge, 𝑚𝑛𝑡 is the mass shift due to
the addition of one neutron and the parameters 𝑖 and 𝑗 represent the charge state
and isotopic positions respectively. Then, the researchers modeled the spectra,
taking into account the different charge states and isotopic positions for each
candidate and it also incorporates the probability of candidates’ existence and the
15
thermal noise. The signal of N peptide candidates was given by the following
equation:
𝑁
𝑁
𝑐𝑠 𝑖𝑠𝑜
𝑦𝑚 = ∑ 𝜆𝑘 𝑔𝑘 (𝑥𝑚 ) + 𝜀𝑚 = ∑ 𝜆𝑘 ∑ ∑ 𝑐𝑘,𝑖𝑗 𝑓(𝑥𝑚 ; 𝜌𝑘,𝑖𝑗 ; 𝑎𝑘,𝑖𝑗 ) + 𝜀𝑚
𝑘=1
𝑘=1
𝑖=1 𝑗=0
𝑚 = 1,2, … , 𝑀
where 𝑥𝑚 is the m-th m/z in the spectrum, 𝑦𝑚 is the intensity at 𝑥𝑚 , M is the
number of observations, 𝜀𝑚 is the noise (𝜀𝑚 ~𝑁(0, 𝜎 2 ), 𝑓(𝑥𝑚 ; 𝜌𝑘,𝑖𝑗 ; 𝑎𝑘,𝑖𝑗 ) is peak
shape function, taken as Gaussian-shaped with 𝑎𝑘,𝑖𝑗 being the theoretical m/z
value of the peak for the k-th candidate and 𝜌𝑘,𝑖𝑗 the peak’s width, 𝑐𝑘,𝑖𝑗 is the height
of the peak of peptide k at i charge state and j isotopic position and, finally, 𝜆𝑘 is an
indicator random variable, which is 1 if the peptide truly exists and 0 otherwise.
The goal is to determine all the unknown parameters in the model (𝜃 ≜
{𝜆𝑘 , 𝑐𝑘,𝑖𝑗 ; 𝑘 = 1, … , 𝑁; 𝑖 = 1, … , 𝑐𝑠; 𝑗 = 1, … , 𝑖𝑠𝑜}), based on the observed spectrum
𝑦 = [𝑦1 , … , 𝑦𝑀 ]𝑇 and especially 𝜆𝑘 . Therefore, the Bayesian approach was
employed to obtain the posterior probabilities of θ, 𝑃(𝜃|𝑦). The posterior
probability of 𝜆𝑘 can be obtained by integrating the joint posterior probability
over all parameter except for 𝜆𝑘 :
𝑃(𝜆𝑘 |𝑦, 𝜃−𝜆𝑘 ) ∝ 𝑝(𝜆𝑘 )𝑝(𝑦|𝜃) , where 𝜃−𝜆𝑘 ≜ 𝜃\𝜆𝑘
and for the computation the team chose Gibbs sampling method, which is a variant
of Monde Carlo Markov Chain (MCMC) [10].
According to the authors, BPDA, their proposed model, which considers the
charge state and isotopic positions, was positively compared to commercial and
open-source software in terms of peptide detection, but it lacked in terms of
computational time, since it was found time-consuming, especially when running
under raw data mode [10].
Two years later, in 2012, the same team published a new paper, which can
be described as the continuation of their first one. This time, their proposal
concerned a peptide detection approach, but LC-MS data were used as input. They
presented BPDA2d, a two-dimension (2D) Bayesian peptide detection algorithm
to process the data more efficiently. BPDA2d shared the same core as BPDA, which
is to evaluate all possible combinations of peptide candidates in order to minimize
the mean square error (MSE) between inferred and observed spectra. The
16
difference between these algorithms is that BPDA models spectra along m/z
dimension, while BPDA2d models spectra along both m/z and retention time (RT)
dimensions [11].
After baseline correction, noise filtration and peak detection along the m/z
axis, the authors added one more step before obtaining the list of peptide
candidates. In this case, the detected 1D peaks were connected along the RT
dimension. The 1D peaks were sorted according to their RT positions and if there
were multiple peaks connected to the same RT only the one with the larger
intensity was retained. Then they proceeded to the generation of peptide
candidates followed their previously mentioned method. The model used to
formulate the spectra was substantially the same apart from the addition of time
as a parameter:
𝑁
𝑁
𝑐𝑠 𝑖𝑠𝑜
𝑦(𝑥𝑚 , 𝑡) = ∑ 𝜆𝑘 𝑔𝑘 (𝑥𝑚 , 𝑡) + 𝜀(𝑡) = ∑ 𝜆𝑘 ∑ ∑ 𝑐𝑘,𝑖𝑗 𝑙𝑘 (𝑡)𝐼𝑥𝑚=𝑎𝑘,𝑖𝑗 + 𝜀(𝑡)
𝑘=1
𝑘=1
𝑖=1 𝑗=0
𝑚 = 1, 2, … , 𝑀 and 𝑡 = 1, 2, … , 𝑇
where I is and indicator function (𝐼𝐴 = 1 if 𝐴 ≠ 0, 𝐼𝐴 = 0 otherwise) and 𝑙𝑘 is the
normalized elution profile of the k-th peptide candidate. As in the previous article,
the model takes into account charge state and isotopic position, but includes
peptides’ elution peaks. It also incorporates existence’s probability of candidates
and thermal noise. The authors calculated posterior probabilities of the unknown
parameters (θ) of the model, using Bayes’ theorem, and focused on 𝜆𝑘 , the
indicator random variable (𝜆𝑘 = 1 if the peptide truly exists, 𝜆𝑘 = 0 if not) by
integrating the joint posterior probability, 𝑃(𝜃|𝑦), over the other parameters
except for 𝜆𝑘 [11].
The authors claimed that their model, BPDA2d surpassed advanced
software, such as msInspect and their previous proposal BPDA, in terms of
sensitivity and detection accuracy. They also mentioned that their proposal is
better suited for time-of-flight (TOF) data [11].
The alignment of spectra, which is necessary for the correction of
experimental variations, a common problem in mass spectrometry (MS), is the
basic step in the comparison of spectra. The alignment approaches can be divided
into two categories: 1) the feature-based and 2) the profile-based approach. The
first one is based on the distinction between signals from analytes and irrelevant
17
noise, which is the key point for a successful approach, also known as peak
detection, and then the direct alignment of the detected peaks. On the other hand,
the later one uses the whole spectrum to evaluate the experimental variation and
adjust each one accordingly. The attempt is to find an alignment that minimizes
the difference between all spectra and the reference spectrum [12, 13, 14].
The alignment of MS spectra was the purpose of a paper, written by Kong et
al. in 2009, with an overall common goal to compare mean spectra across different
patient populations, helpful for the biomarker discovery. The authors, advancing
a previously developed method (Reilly et al. 2004), proposed a profile-based
approach, using a parametric model in conjunction with Bayesian inference [12].
Firstly, the team started with normalization. The alignment model depends on the
spectra’s abundances and, therefore, their variations, due to differences in sample
preparation or matrix crystallization, needed to be minimized. The chosen method
was one scaling factor to each MS run, following the notation of Wu (2004). The
second step was the alignment model, given by the following equation:
𝑥𝑖 (𝑡) = 𝜃(𝜉𝑖 (𝑡)) + 𝜀𝑖 (𝑡)
where 𝑥𝑖 (𝑡) denotes the height/intensity (on the log-scale) of each sample i at time
t (which corresponds to a certain m/z in a ToF instrument), 𝜃(𝑡) is the average
spectrum for this patient population at t, 𝜉𝑖 (𝑡) is the deforming function for the
sample i at time t and, finally, 𝜀𝑖 (𝑡) is the random error. The restriction is that 𝜉𝑖 (𝑡)
is monotone increasing, otherwise it is able to erase observed peaks in the spectra.
The 𝜉𝑖 (𝑡) function is parameterized as a piecewise linear function with knots
positioned at the locations (or subset of locations) where there are the observed
data. The estimation of posterior modes of 𝜉1 (𝑡), 𝜉2 (𝑡), …, 𝜉𝑛 (𝑡) is done by
minimizing
𝑛 𝑇−1
∑∑
𝑖=1 𝑗=1
1
|𝐸𝑗 |
{∫ [𝑥𝑖 (𝑡) − 𝜃(𝜉𝑖 (𝑡))]2 𝑑𝑡 +
𝐸𝑗
𝜎2
∫ [𝜉 (𝑡) − 𝑡]2 𝑑𝑡}
𝜏 2 𝛦𝑗 𝑖
where the team assumed that the least squares of 𝑥𝑖 (𝑡) − 𝜃(𝜉𝑖 (𝑡)) over 𝐸𝑗
(∫𝐸 [𝑥𝑖 (𝑡) − 𝜃(𝜉𝑖 (𝑡))]2 𝑑𝑡) are independent distributed as 𝜎 2 |𝛦𝑗 |𝜒𝑙2 and the least
𝑗
squares of 𝜉𝑖 (𝑡) − 𝑡 over 𝐸𝑗 are also independent distributed as 𝜏 2 |𝛦𝑗 |𝜒𝑙2. The 𝜉𝑖 (𝑡)
function needs only to be defined for t∈T (𝑡1 , 𝑡2 , … , 𝑡𝑇 ), 𝐸𝑗 is the partition of
[𝑡1 = 𝑚𝑖𝑛𝑡∈𝑇 , 𝑡𝑇 = 𝑚𝑎𝑥𝑡∈𝑇 ] defined by the location of knots of 𝜉𝑖 (𝑡). The above
18
equation is subjected to two conditions: 1) 𝜉𝑖 (𝑡𝑗 ) < 𝜉𝑖 (𝑡𝑗+1 ) in order to guarantee
that 𝜉𝑖 (𝑡)is strictly monotone increasing and 2) if 𝑚𝑖𝑛𝑗 (𝑥𝑖 (𝑡𝑗+1 ), 𝑥𝑖 (𝑡𝑗 )) > 𝑟𝑖 (𝑡𝑗 )
then|𝜉𝑖 (𝑡𝑗+1 ) − 𝜉𝑖 (𝑡𝑗 )| = 𝑡𝑗+1 − 𝑡𝑗 , with 𝑖 = 1,2, … 𝑛 and , 𝑗 = 1, 2, … , 𝑇 − 1 so as to
maintain the shape of the peak along the ToF axis during the alignment process.
Lastly, for the computations, the researchers followed the approach of Reilly et al.
(2004) and they applied the dynamic programming (DP) algorithm to relate the
minimized value with the approximation to objective function [12].
The results of this method showed that the model is very efficient for low
mass accuracy data. In contrast, for MS spectra with 5-50 ppm accuracy, the
method shows an improvement respect other conventional methods, although is
not that efficient as with low mass accuracy. An interesting idea, mentioned by the
authors, is that this model can be used to align spectra from different laboratories,
which are severely misaligned, since their method can handle these misalignments
[12].
The alignment of LC-MS data was addressed in two different papers in 2013,
published by the same team, proposing the same approach with slight differences.
The approach was tested, in one case, on proteomic and metabolomics data [13]
and in another case, on proteomic, metabolomic and glycomic data [14]. The
team’s proposal is a Bayesian Alignment Model (BAM), which is a profile-based
approach.
BAM is a model that performs RT alignment based on multiple
chromatograms of LC-MS runs. The model is based on alignment of the total ion
chromatogram (TIC) or the base-peak chromatogram, thus reducing the order of
the instrument (from 2nd to 1st). The model has two major components: the
prototype and the mapping function. The prototype function, m(t), characterizes
the part of the spectro-chromatograms that is shared by the different samples, and
for the ith chromatogram at RT t, the intensity is referred to as prototype function
indexed by the mapping function, ui(t), m(ui(t)). Each chromatogram (y for
sample 𝑖) is modeled as:
𝑦𝑖 (𝑡) = 𝑐𝑖 + 𝑎𝑖 ∗ 𝑚(𝑢𝑖 (𝑡)) + 𝑒𝑖 (𝑡)
for i=1, 2, 3…,N observed chromatograms [𝑦𝑖 (𝑡)], where 𝑐𝑖 ~𝑁(𝑐0 , 𝜎𝑐2 ) and
𝑎𝑖 ~𝑁(𝑎0 , 𝜎𝑎2 ) are parameters, ei(t) is the error, which is considered independent
19
and normally distributed 𝑒𝑖 (𝑡)~𝑁(0, 𝜎𝑒2 ). The prototype function is modeled with
a B-spline regression: 𝒎 = 𝑩𝒎 𝝍, where ψ is a vector of elements drawn from
normal distribution with a specific mean (𝜓𝑙 ~𝛮(𝜓𝑙−1 , 𝜎𝜓2 ), where 𝜓0 = 0 ) and the
mapping function is a piecewise linear function, characterized by a set of knots
τ,𝜏 = (𝜏0 , 𝜏1 , 𝜏2 , … , 𝜏𝑘+1 ),
and
their
corresponding
indices
φi,
𝜑𝑖 =
(𝜑𝑖,0 , 𝜑𝑖,1 , 𝜑𝑖,2 , … , 𝜑𝑖,𝑘+1 ). By following this process, the alignment problem is
transformed into an inference task, where given the chromatograms,𝒚 =
{𝑦1 , 𝑦2 , … , 𝑦𝑁 },the model parameters 𝜃 = {𝑎, 𝑐, 𝜓, 𝑎0 , 𝑐0 , 𝜎𝑎2 , 𝜎𝑐2 𝜎𝜀2 𝜎𝜓2 } need to be
estimated. The authors used Markov Chain Monde Carlo (MCMC) methods to draw
inference for the parameters. Once the inference is complete, the alignment is
carried out by applying the inverse mapping function to each chromatogram, i.e.
𝑦̂(𝑡)
= 𝑦𝑖 (𝑢̂𝑖 −1 (𝑡)) [13, 14].
𝑖
Although both papers share the same overall approach, one can claim that
the second paper is the advanced version of the first one. In the first article, the
authors used a single ion chromatogram to estimate the prototype and mapping
functions for RT alignment. As it is claimed, the model showed better performance
than other profile-based methods, such as Dynamic Time Wrapping (DTW) model
(J. Chemometrics (2004) 18: 231–241) [13]. However, as it is stated in the second
article, their first model lacked in integration of prior knowledge, e.g. internal
standards, and assumed the existence of a pattern only based on a single ion
chromatogram. So, they introduced an advanced version, which incorporates
multiple representative chromatograms and internal standards [14].
As far as the multiple representative chromatograms are concerned, the
authors proposed a clustering approach for the identification of the
chromatograms, where the chromatograms are simultaneously considered in the
profile-based alignment to assist the progress of the estimation of the prototype
and mapping functions. As for the internal standards, based on their peaks, the
research team was able to evaluate the RT variations, by means of Gaussian
Process (GP) regression, and this information is used as the prior of the mapping
function [14]. Below, the figures illustrate graphically the models in the first
second article.
20
Figure 5: Bayesian alignment model, profile-based approach [13].
Figure 6: Bayesian alignment model, profile-based approach with the incorporation of internal
standards and chromatographic clustering [14].
3.2. Data treatment
Identifying proteins in a complex mixture is common task that is usually
performed by tandem MS, since it is becoming a very useful tool. Typically, the
protein identification process consists of two main stages. In the first stage, the
observed spectra and precursors’ m/z values are matched to peptides by a
database search tool and in the second stage proteins are scored and ranked using
the scored peptides. The large amount of data produced by the MS, on the one hand
can be considered beneficial, but, on the other leads to false matches between
peptides and spectra, lowering the specificity. One of the most common problems
is the degeneracy of peptides, i.e. the possibility of match a peptide to multiple
proteins. These degenerate peptides are responsible for the difficulty in
calculating the posterior probabilities of proteins, since the posterior probability
of one protein depends on the presence of another, especially when the peptide is
matched to both of them. With respect to this problem, Serang et al. proposed in
2010 a Bayesian approach for computing posterior probabilities [15].
21
The suggested model uses three parameters and allows a protein with strong
independent evidence to minimize the significance of supporting evidence that is
shared with other proteins. The model is based on seven assumptions: 1) the
recovery of one peptide by precursor scan does not influence the retrieval of other
peptides given the set of proteins in the sample, 2) the creation and observation of
one spectrum does not influence the creation and observation of other spectra
given the set of peptides selected by the precursor scan, 3) the emission of a
peptide is associated with the present protein with probability α, 4) the wrong
detection of peptides from noisy signals (the probability that a truly absent
peptide, not created by an associated protein, is falsely observed) has a constant
probability, β, 5) the prior belief that a protein is present in the sample has a
probability γ, 6) the prior probabilities are independent and finally, 7) each
spectrum depends only on the peptide that is most excellently matched. From the
probability model, the team was able to compute the likelihood of a set of proteins,
which is proportional to the probability that these proteins would create the
observed spectra, as follows:
𝐿(𝑅 = 𝑟|𝐷) ∝ 𝑃(𝐷|𝑅 = 𝑟)
= ∑ ∏ 𝑃(𝐷𝜀 |𝛦𝜀 = 𝑒𝜀 )𝑃(𝛦𝜀 = 𝑒𝜀 |𝑅 = 𝑟)
∀𝑒
𝜀
= ∑ ∑ ∏ 𝑃(𝐷𝜀 |𝛦𝜀 = 𝑒𝜀 )𝑃(𝛦𝜀 = 𝑒𝜀 |𝑅 = 𝑟)
∀𝑒1 ∀𝑒:𝑒1
𝜀
= ∑ 𝑃(𝐷𝜀 |𝛦1 = 𝑒1)𝑃(𝛦1 = 𝑒1 |𝑅 = 𝑟) ∗ ∑ ∏ 𝑃(𝐷𝜀 |𝛦𝜀 = 𝑒𝜀 )𝑃(𝛦𝜀 = 𝑒𝜀 |𝑅 = 𝑟)
∀𝑒1
∀𝑒:𝑒1 𝜀≠1
= ∏ ∑ 𝑃(𝐷𝜀 |𝛦𝜀 = 𝑒𝜀 )𝑃(𝛦𝜀 = 𝑒𝜀 |𝑅 = 𝑟)
𝜀
∀𝑒𝜀
where R denotes the present proteins, E is the set of present peptides, ε is for the
peptides’ index, D is the observed spectra. R and E are random variables that
represent the truly presence of proteins and peptides and r and e are specific
values for these variables. The 𝑃(𝐷𝜀 |𝛦𝜀 = 𝑒𝜀 ) was calculated by PeptideProphet
and the 𝑃(𝛦𝜀 = 𝑒𝜀 |𝑅 = 𝑟) by the proposed model. The team also proposed a
process about making the computations of posterior probabilities of large data set
22
more feasible. The process involves three steps: 1) partitioning, 2) clustering, 3)
pruning. The three steps are presented in fig. 7 [15].
Figure 7: (A) partitioning: a protein is dependent on other proteins within connected sub-graphs and
not dependent on proteins that share no peptides with the proteins in the connected sub-graphs. (B)
clustering: proteins with identical connectivity, e.g. protein 1 and 2, can be clustered together to
compute their posterior probabilities more efficiently. (C) pruning: within the connected sub-graph,
e.g. protein 4 and 5, proteins that are connected only by peptides with zero probability can be divided
into two sub-graphs that do not connect [15].
As mentioned in the paper, the model assumptions are not ideal. However, it
is possible to evaluate their accuracy and replace them by others. For instance,
assumption 5 can be improved by introducing more complex priors. The model,
depending on the data and comparing the results with other methods, showed a
similar and sometimes better performance. Additionally, it can be positively
compared to other method in cases of high-scoring degenerate peptides [15].
The validation of database search results is an aspect of interest in protein
identification. As indicated above, the protein identification has usually two
stages: 1) matching peptides with database for their identification and 2) scoring
and ranking of proteins based on the identified peptides [15]. The peptide
identification is achieved by the combination of tandem MS and database search.
The validation of these results is still developing in aspects such as specificity and
sensitivity. In 2009, Zhang et al. proposed a Bayesian non-parametric model (BNP)
for the validation of database results that incorporates popular methods in
statistical learning, as the Bayesian method for posterior probabilities calculations
[16]. The model integrates an extended set of features, which were selected from
23
literature, including peptide fragmentation knowledge and chemical or physical
properties of the peptide.
After the tandem MS spectra are searched against the database, the first
stage is to construct two subsets, one includes the decoy matches and the other
consists of the matches validated by the cutoff-based method. Based on these sets,
the coefficients of LDF (linear discriminative function) score can be calculated by
means of multivariate linear regression. In the second step the LDF score (x)
distribution is fitted by a non-parametric PDF (probability density function) with
the maximum likelihood parameter estimation. The formulation of the hypothesis
mixture PDF is given by: 𝑝(𝑥) = 𝑃𝑝𝑜𝑠 𝑓(𝑥) + 𝑃𝑛𝑒𝑔 𝑔(𝑥), based on the theory that the
random and correct matches can be grouped into subcategories and that the LDF
score of each subcategory should have a simple distribution, e.g. normal
distribution. The negative component 𝑃𝑛𝑒𝑔 𝑔(𝑥), which contributes to random
matches, can be estimated by the fully non-parametric density function estimate
procedure that is carried out by the maximum likelihood estimation with EM
algorithm, as indicated by Duda et al. in 2001 and Archambeau et al. in 2003. The
positive component 𝑃𝑝𝑜𝑠 𝑓(𝑥), which contributes to the correct matches, can be
estimated by a restricted fully non-parametric density function estimate. After the
estimation of the conditional PDF the correct probability of a match with a LDF
score
x
can
be
calculated
𝑝𝑐𝑜𝑟 =
by
the
following
formulation:
𝑝𝑝𝑜𝑠 𝑓(𝑥)
𝑝𝑝𝑜𝑠 𝑓(𝑥) + 𝑝𝑛𝑒𝑔 𝑔(𝑥)
This formulation can be explained as follows:
𝑝(𝑝𝑜𝑠|𝐷 )
𝑝(𝑛𝑒𝑔|𝐷 )
𝑝(𝑝𝑜𝑠) 𝑝(𝐷 |𝑝𝑜𝑠)
= 𝑝(𝑛𝑒𝑔)
𝑝(𝐷|𝑛𝑒𝑔)
(𝐼) , from Bayes’ theorem
𝑝(𝑝𝑜𝑠|𝐷) + 𝑝(𝑛𝑒𝑔|𝐷) = 1, ℎ𝑒𝑛𝑐𝑒 𝑝(𝑛𝑒𝑔|𝐷) = 1 − 𝑝(𝑝𝑜𝑠|𝐷)
(𝐼𝐼),
as
normalization factor. By introducing (II) into (I) we get:
𝑝(𝑝𝑜𝑠|𝐷)
𝑝(𝑝𝑜𝑠)𝑝(𝐷|𝑝𝑜𝑠)
=
1 − 𝑝(𝑝𝑜𝑠|𝐷) 𝑝(𝑛𝑒𝑔)𝑝(𝐷|𝑛𝑒𝑔)
and then, after some rearrangements:
𝑝(𝑝𝑜𝑠|𝐷) =
𝑝(𝑝𝑜𝑠)𝑝(𝐷|𝑝𝑜𝑠)
𝑝(𝑛𝑒𝑔)𝑝(𝐷|𝑛𝑒𝑡) + 𝑝(𝑝𝑜𝑠)𝑝(𝐷|𝑝𝑜𝑠)
24
In this equation, 𝑝(𝑝𝑜𝑠|𝐷) can be considered as 𝑝𝑐𝑜𝑟 in the above formula,
𝑝(𝐷|𝑝𝑜𝑠) and 𝑝(𝐷|𝑛𝑒𝑔) are 𝑓(𝑥) and 𝑔(𝑥) respectively, ending up to the
formulation the team proposed. Finally, the authors were able to make a decision
in accordance with the cost function, presented as FDR (false discovery rate) [16].
As claimed by the research team, the model can provide a correct probability
of each assignment that promotes the subsequent analysis. The model was able to
identify more high confident proteins from a MS/MS data set compared to other
methods, such as ProteinProphet. The stronger aspect, as it is indicated in the
paper, is the confirmation of a larger number of confident peptides. Thus, it can
give more information for later biological analysis [16].
The identification of proteins issue was also assessed by LeDuc and coworkers in 2014. In this case “top down” experiments were employed to identify
and characterize whole proteins. The main characteristic of a “top down”
experiment is that the precursor ion is an intact proteoform and not small peptides
produced form enzymatic digestion prior to mass spectrometry (shotgun or
bottom-up experiment). Thus, the mass of the precursor ion represents a native
whole protein and its fragment ions support the characterization and verification
of the primary structure. In order to capture better the information given from top
down proteomics, the team suggested a scoring system for protein identification,
based on Bayesian statistics, under the name C-score [17].
The authors started with the basic formulation of the Bayes’ theorem:
𝑃(𝑃𝑟𝑜𝑡𝑒𝑜𝑓𝑜𝑟𝑚𝑖 |𝑑𝑎𝑡𝑎) =
𝑃(𝑃𝑟𝑜𝑡𝑒𝑜𝑓𝑜𝑟𝑚𝑖 )𝑃(𝑑𝑎𝑡𝑎𝑀𝑆⁄𝑀𝑆 |𝑃𝑟𝑜𝑡𝑒𝑜𝑓𝑜𝑟𝑚𝑖 )
𝑃(𝑑𝑎𝑡𝑎𝑀𝑆⁄𝑀𝑆 )
where 𝑃(𝑃𝑟𝑜𝑡𝑒𝑜𝑓𝑜𝑟𝑚𝑖 |𝑑𝑎𝑡𝑎) is the posterior probability (the probability of the
𝑖th proteoform given the MS/MS data), 𝑃(𝑃𝑟𝑜𝑡𝑒𝑜𝑓𝑜𝑟𝑚𝑖 ) is the prior probability of
proteoform 𝑖, 𝑃(𝑑𝑎𝑡𝑎𝑀𝑆⁄𝑀𝑆 |𝑃𝑟𝑜𝑡𝑒𝑜𝑓𝑜𝑟𝑚𝑖 ) is the likelihood, reading the
probability of the data given the proteoform 𝑖 and 𝑃(𝑑𝑎𝑡𝑎𝑀𝑆⁄𝑀𝑆 ) is known as the
probability of the data. The team restated the above equation by defining
variables, arriving at:
𝑃(𝜑𝑞 |𝑀𝑂 , {𝑚𝑖 }) =
𝑃(𝜑𝑞 )𝑃(𝑀𝑂 , {𝑚𝑖 }|𝜑𝑞 )
𝑃(𝑀𝑂 , {𝑚𝑖 })
25
where 𝑀𝑂 is the observed mass of the precursor ion, 𝑚𝑖 is the mass of the 𝑖th of
the n fragment ions, so {𝑚𝑖 }𝑛𝑖=1 is the set of all the observed ions and 𝜑𝑞 is the 𝑞th
candidate (from k candidate proteoforms in the database). The prior
probability, 𝑃(𝜑𝑞 ), can be taken as “all the hypotheses are equal” or one can assign
higher/ lower prior probabilities to a candidate. The interesting of this scoring
model is that, in contrast to other Bayesian methodologies (also presented in this
literature thesis), the model has no unknown parameters and instead of inferring
values from the data collected for the study in question, the values are taken either
from the team’s knowledge of mass spectrometry or from prior studies that
focused on determining the needed values [17].
In order to calculate the likelihood, 𝑃(𝑀𝑂 , {𝑚𝑖 }|𝜑𝑞 ), LeDuc et al. assumed
independence of all 𝑚𝑖 , and thus:
𝑛
𝑃(𝑀𝑂 , {𝑚𝑖 }|𝜑𝑞 ) = 𝑃(𝑀𝑂 |𝜑𝑞 ) ∏ 𝑃(𝑚𝑖 |𝜑𝑞 )
𝑖=1
And for avoiding issues, such as the larger the list of fragment ions the lower the
calculated value, the research team developed the aforementioned equation into:
1⁄
𝑛
𝑛
𝑃(𝑀𝑂 , {𝑚𝑖 }|𝜑𝑞 ) = 𝑓 (𝑃(𝑀𝑂 |𝜑𝑞 )) 𝑔((∏
𝑖=1
𝑃(𝑚𝑖 |𝜑𝑞 )
)
where the f function is a simple identity function for the precursor ion and g is a
linear function on the logarithm base 10 of the probability of the fragment ion [17].
As stated by the authors, the proposed model showed high specificity and
sensitivity, compared to the other methods. It is also mentioned that when the data
are sufficient enough, the C-score demonstrated high characterization power and
the method is flexible for having scoring system appropriate to the experimental
procedure [17].
Spectral counting is a free-labeled quantitative approach in shotgun
proteomics, which is defined as measuring the abundance of a given protein based
on the number of tandem mass spectral observations of its identified peptides.
Spectral counts (SPCs) have shown a good correlation with the abundance of the
corresponding protein. Since SPCs can be extracted from any database search
engine, it makes spectral counting a flexible and straight- forward technique. In
2011, Booth et al. proposed a Bayesian model for comparing SPCs under two
26
treatments, which allows the simultaneous classification of thousands of proteins
[18].
The classification is conducted by calculating the posterior odds:
𝑂𝑖 =
𝑃(𝐼𝑖 = 1|𝑑𝑎𝑡𝑎)
𝑃(𝐼𝑖 = 0|𝑑𝑎𝑡𝑎)
𝑖 = 1, … , 𝑝
where Ii defines the indicator for non-null status of the i-th protein. A “non-null
status” indicates that the protein has been affected by the treatment. If 𝑂𝑖 > 𝑐, a
suitable large c, then the protein is classified as non-null. The choice of the
threshold c is based on the control of false discovery rate (FDR), i.e. the protein is
classified as non-null where in fact there is no treatment effect. In order to
compute the posterior odds, the research team considered the following model:
log𝜇𝑖𝑗 = 𝛽0 + 𝛽1 𝑇𝑗 + 𝑏0𝑖 + 𝑏1𝑖 𝐼𝑖 𝑇𝑗 + log𝐿𝑖 + log𝑁𝑗
where μij denotes the expected count for protein i in replicate j, β0 is the overall
mean for the control replicates, β1 is an overall treatment effect, b0i and b1i are the
corresponding protein specific effects, Li and Nj are the offset, account for the
protein length and the replicate effect respectively and Tj is a binary indicator of
the treatment. The model is completed by placing prior distributions for the model
parameters. The authors considered three different prior distributions for the
protein specific coefficients. One allows for potential correlation between the
protein specific coefficients, while the other two assume they are independent and
one allows the posterior mean of the protein specific treatment effects to be
different in the null and non-null groups. The necessary computations were
performed by means of Markov Chain Monde Carlo (MCMC) methods [18].
The results, announced by the team, showed that the proposed model is
more statistically coherent and valid than other approaches they compared it to
(Bayes factors), lead to a simple classification, however, it is significantly slower
process than the one-at-a-time methods, such as the score test, were the results
are instantaneous [18].
For the quantification of a peptide, apart from free-labeling techniques, there
are also labeling techniques such as 18O-labeling approach. In the enzymatic 18Olabeling, the two oxygen atoms in the carboxyl-terminus of a peptide are replaced
with oxygen isotopes form heavy-oxygen-water. The result is a m/z shift by 4Da
for the labeled peptide and, thus, the labeled and unlabeled peptides are separated
27
with respect to their m/z in a spectrum, which allows the comparison of two
samples. However, in practice due to water impurities (presence of 16O and 17O)
and mislabeling, not all the labeled peptide receive two
18O,
which results in a
complex mixture of shifted and overlapping isotopic peaks of the labeled and
unlabeled samples. In order to estimate the relative abundance of the peptide in
two samples with respect to the aforementioned problem, Zhu et al. proposed a
model with Bayesian framework for MALDI-TOF data, in 2011 [19].
The suggested model is an extension of a previously modeling approach
(Valkerborg 2008, Zhu et al. 2010), where random effects of technical/biological
variability were included. The model is given by:
𝑦𝑖𝑗 = 𝜇𝑖𝑗 + 𝜀𝑖𝑗
where 𝑦𝑖𝑗 is the experimental intensity obtained at the 𝑖 𝑡ℎ spectrum (or sample)
2𝜃
and 𝑗 𝑡ℎ monoisotopic peak, 𝜀𝑖𝑗 ~𝑁(0, 𝜎 2 𝜇𝑖𝑗
) are independent and the θ parameter
is the power parameter for the variance function to account for the
heteroscedastic nature of the MS data. The mean intensity is the quantity 𝜇𝑖𝑗 of the
j-th (j=1 denotes the monoisotopic of the unlabeled peptide) peak of the i-th
spectrum. This can be expressed as:
min(4𝑗−1)
𝐻𝑖 𝑅𝑗 + 𝑄𝑖 𝐻𝑖 +
∑
𝑃𝑘 𝑅𝑗−𝑘 𝑖𝑓 1 ≤ 𝑗 ≤ 𝑙
𝑘=0
4
𝜇𝑖𝑗 ≡ 𝐸(𝑦𝑖𝑗 ) =
𝑄𝑖 𝐻𝑖 ∑ 𝑃𝑘 𝑅𝑗−𝑘
{
𝑖𝑓 𝑙 + 1 ≤ 𝑗 ≤ 𝑙 + 4
𝑘=𝑗−1
where a peptide has 𝑙 ≥ 5 isotopic variants and a
18O-labeled
peptide has 𝑙 + 4.
Hi(𝐻𝑖 ~𝑁(𝐻, 𝜎𝛨2 )) is the unobserved abundance of the peptide in the unlabeled
sample (Sample I) in the 𝑖-th spectrum and 𝑄𝑖 (𝑄𝑖 ~𝛮(𝑄, 𝜎𝑄2 )) is the relative
abundance of the peptide from the labeled sample (Sample II) with respect to the
Sample I. 𝑃𝑘 is the m/z shift probability , which is calculated via a MCMC model
ℎ
and 𝑅𝑗 is the isotopic ratios, which is defined as 𝑅𝑗 = 𝑗⁄ℎ , 𝑗 = 1, … , 𝑙, of the 𝑗-th
1
isotopic variant, where ℎ1 , ℎ2 , etc. denotes the probabilities of occurrence of the
first, second, etc. isotopic variant. The terms 𝐻𝑄𝑃𝑘 𝑅𝑗−1−𝑘 show the contribution to
the mean value of the observed peaks from the isotopic variants of the peptide
from Sample II [19].
28
The team suggested that by using a Bayesian approach the incorporation of
prior information could be advantageous for the analysis of the data. Although
similar approaches have been published in previous years (Eckel-Passow, 2006)
this method takes into account the presence of
17O
atoms in heavy water and
allows the isotopic distribution to be determined by the data. As it is pointed out,
there are topics concerning the extension of the approach that are under further
research, such as the incorporation of informative prior for the Bayesian model,
which would yield a gain in precision for the estimation of the parameters [19].
Current methods for data analysis with the purpose of biomarker discovery,
e.g. cancer diagnosis in an early stage, can be divide into two categories: 1)
profiling, where the input is a list of peaks and 2) whole-spectrum, in which the
entire spectrum is the input. It is argued that the profiling method is of greater
importance, since the detected peaks are more meaningful in the way that they
represent species that can be identified and further studied. The profiling method
mainly consists of eight steps, which are: i) resampling, ii) noise reduction, iii)
baseline correction, iv) normalization, v) peak detection, vi) peak alignment, vii)
feature (peak) selection and viii) classification. As part of a profiling study, a study
driven by the purpose of the discovery of biomarker that can distinguish cancer
and normal samples, He et al.
proposed Bayesian additive
decision trees (BART) to build
a classification model [20].
As
shown
in
the
schematic presentation of the
proposed
profiling
method,
after the data were baseline
corrected and normalized, the
next
step
was
the
peak
detection. The authors used a
smooth
non-linear
energy
operator (SNEO) for the first
Figure 8: The proposed method in a schematic illustration
[20].
time, a method that has been
successfully
used
in
29
electroencephalography and electrocardiogram, which was modified to be
suitable for the peak detection in MS data. After the peak detection, a correlationbased peak selection was applied and the selected small peak set was used as input
for BART in order to build a prediction model. BART was applied to classify
samples and identify biomarkers. It is akin to the method that constructs a set of
classifiers, e.g. a decision tree, to classify new data points. BART is defined by a
prior and a likelihood and reports inference as the summary of all relevant
uncertainties. It can be considered as a Bayesian “sum-of-tree” model, where each
tree is constrained by a prior to be a weak learner. For the biomarker’s
identification, the first step is to rank the selected peak according to their
contribution to the classification. In this case, the contribution is determined by
the calculation of the times the peak is used in the BART model. At first, the model
was built on a few top ranking peaks and then progressively the number of peaks
was increased [20].
As it is asserted in the paper, the method showed an excellent classification
performance and the obtained results could be subjected to further research and
validation. It is also mentioned that BART was accurate and the results better
interpretable. Finally, by using the built-in partial dependent plot function of the
BART model, it was possible to examine the effect of each biomarker to the
cancer’s identification, as well as the interaction between these biomarkers [20].
As mentioned in the description of the previous paper, feature selection is a
pre-step in the data analysis, with the final goal of building a classifier for
biomarkers discovery. It is a common phenomenon that the initial discovery
results in a relative large collection of biomarkers, but only a few remain as
relevant after subsequent testing with new data. The main problem is the overfitting of classifiers that, due to small sample size and large amount of variables,
result in high false positive rate of biomarker candidates. To overcome this
problem, Kuschner et al. developed a method for feature selection, based on a
Bayesian network (BN), in 2010 [21].
To build the BN, the authors used a model-free test for independence, which
is called mutual information. Mutual information (MI) can be described as the
measure of the information gained about a variable, knowing the value of another
variable, and is calculated by the equation:
30
𝑀𝐼(𝑋; 𝑌) = ∑ 𝑃(𝑥, 𝑦)𝑙𝑜𝑔2
𝑥,𝑦
𝑃(𝑥, 𝑦)
𝑃(𝑥)𝑃(𝑦)
where X and Y are two variables, x and y represent all the possible values that X
and Y can take, respectively, and 𝑃(𝑥, 𝑦) denotes the joint probability that X takes
on the value x and Y takes the value y. A value of MI equal to 0 indicates that the
variables are independent. So, the first step of the method is the estimation of the
variables that show dependency with class by calculating the 𝑀𝐼(𝑐𝑙𝑎𝑠𝑠; 𝑓𝑒𝑎𝑡𝑢𝑟𝑒).
All features with 𝑀𝐼(𝑐𝑙𝑎𝑠𝑠; 𝑓𝑒𝑎𝑡𝑢𝑟𝑒) higher than a threshold are considered to
have a connection with the class variable. For the graphic illustration of the BN
that means that a directed arc is created from the class node to the node that
represents the selected feature. In this way, the set of first level features is
established and once it is done they are tested against all the other features
individually, to determine connections between features. This is done by
calculating the mutual information between first-level features and all the
features. If the 𝑀𝐼(𝑓𝑖𝑟𝑠𝑡 𝑙𝑒𝑣𝑒𝑙 𝑓𝑒𝑎𝑡𝑢𝑟𝑒; 𝑓𝑒𝑎𝑡𝑢𝑟𝑒) is above the threshold (equal to
the threshold used in the previous step), an directed arc is created to represent
this dependency. When the connection is between two first-level features, an
additional test is required, to determine the direction of this arc. Such test is based
on computing the remaining mutual information between the class (C) and one of
the first-level variables (F1) when the other first-level variable (F2) is known:
𝑀𝐼(𝐶; 𝐹1|𝐹2) = ∑ 𝑃(𝐶, 𝐹1, 𝐹2)𝑙𝑜𝑔2
𝑥,𝑦,𝑧
𝑃(𝐶, 𝐹1|𝐹2)
𝑃(𝐶|𝐹2)𝑃(𝐹1|𝐹2)
when this mutual dependency is 0, the initial link between the class and the (firstlevel) feature 2 is removed [21]. Similar tests on mutual information are done to
simplify the BN by the elimination of non-relevant connections. For instance, if the
connection to the class C is of the form 𝐶 → 𝐹1 → 𝐹2 then the feature F2 will
become independent of the class C when the data is partitioned on the values of F1
and the 𝑀𝐼(𝐶; 𝐹2|𝐹1) will drop to zero (0). This indicates that the F2 is
independent of the class and the initial link 𝐶 → 𝐹2 will be eliminated, providing a
means to organize the first-level features with their in-between dependencies.
31
Fig. 9 illustrates the first BN for leukemia data. The first level features are
those showing dependency to class, when conditioned to other features, while the
second level features had a
low MI with the class, when
they were conditioned to
the
parent
feature.
As
stated in the article, the
Bayesian network/mutual
Figure 9: Bayesian network for leukemia data. First level
features have high mutual information with the class, when
conditioned to other features. Second level features showed little
mutual information with class, when conditioned to the parent
features [21].
information approach was
able to provide a more
distinct division between
stable
and
unstable
features and give the opportunity to examine relationships between features that
may be useful in identification. However, testing the model with the artificial data,
the method was not able to recreate the intended network completely, due to
limitations of the algorithm [21].
Proteomic data treatment also refers to the post-translational modifications
(PTMs) prediction, for example, the identification of peptide sequences and PTMs
associated with the peptide in a biological sample [22]. PTMs are chemical
modifications, which involve an enzymatic addition of a small chemical group or a
larger moiety on the side chain of one or more amino acids. They regulate the
activity of a protein and may occur either after or during the translation [23]. The
analytical method usually chosen for the PTMs prediction is tandem MS followed
by an analysis by means of a “blind” (unrestrictive) PTM search engine. This type
of search engines are used because they can predict known and novel PTMs, but
they are usually subjected to the noise in mass spectrometric data and result in
false predictions, due to their containing inaccurate modification masses and
incorrect modification positions [22]. To avoid the above mentioned issue, Chung
et al. proposed a machine learning algorithm, PTMclust, in 2011 [23]. However, in
2013, the team, addressing limitations of PTMclust, they proposed a new method,
iPTMclust, which introduces prior probabilities on the model variables and
parameters, in order to overcome these disadvantages [22].
32
PTMclust is applied on the results given by “blind” search engines and
improves the predictions by suppressing the noise in the data and clustering the
peptides with the same modification to form PTM groups. Based on the group, the
algorithm finds the most probable modification mass and position [23].
Nonetheless, PTMclust showed some limitations, such as the greedy method for
selecting the number of PTM clusters, the need for manual parameter tuning and
the lack of confidence score per modification position. In order to overcome these
limitations, the team extended PTMclust by making use of an infinite nonparametric mixture model, and so the infinite-PTMclust (iPTMclust) was
developed [22].
The core of the model in
iPTMclust remained the same as
in PTMclust and describes how
a modification mass and a
modification
position
are
generated. In the extended
version
(iPTMclust),
priors
were introduced on model
variables and parameters that
Figure 10: Bayesian network showing the relationship
between variables of the model [22].
control the choice of a PTM
group from a very large number
of PTM groups. The relationship between variables can be illustrated by a
Bayesian network (fig. 10). First, the shade nodes correspond to the observed
variables, the unshaded nodes in the bottom plate indicate the latent variables,
while the unshaded nodes in the upper plates are the model’s parameters and
hyper-parameters are shown outside the plates (hyper-parameter: a parameter of
a prior distribution). The model’s parameters are mixing coefficient (𝑎),
modification mass means (𝜇𝑘 ), modification mass variances (𝛴𝑘 ) and probability
of modification occurring on an amino acid (𝛽𝑘𝑗 ), and the observed variables are
the observed modification mass (𝑚𝑛 ) and position (𝑥𝑛 ) and peptide sequence (𝑆𝑛 ).
By combining the structure of the Bayesian network and the conditional
distributions the joint probability can be written as:
33
𝑃(𝑐, 𝑎, 𝑧, 𝑥, 𝑚, 𝜇, 𝛴, 𝛽, 𝜆, 𝜐, 𝜑, 𝜉, 𝛾, 𝜔|𝑆, 𝛹) =
𝑃(𝛾|𝛹)𝑃(𝜆|𝛹)𝑃(𝜐|𝛹)𝑃(𝜑|𝛹)𝑃(𝜉|𝛹)𝑃(𝜔|𝛹) ∗
∏𝑁
𝑛=1[𝑃(𝑐𝑛 |𝛾)𝑃(𝑚𝑛 |𝑐𝑛 , 𝜆, 𝜐, 𝜑, 𝜉)𝑃(𝑎𝑛 |𝑐𝑛 , 𝜔)𝑃(𝑧𝑛 |𝑎𝑛 , 𝑆𝑛 , 𝛹)𝑃(𝑥𝑛 |𝑧𝑛 , 𝛹)]
where Ψ represents the model hyper-parameters for hyper-priors placed on
𝛾, 𝜆, 𝜐, 𝜑, 𝜉 and ω. The combination of latent variables and prior distributions leads
to complex joint probability distribution over high-dimensional spaces. Therefore,
a MCMC (Markov Chain Monte Carlo) method was employed for the necessary
computations [22].
The authors claimed that iPTMclust outperformed their previous method
(PTMclust) and other PTM algorithms. Since iPTMclust provides the user with
modification position-level confidence scores, the result’s quality could be
evaluated and further refinement of analysis could be performed [22].
34
CHAPTER 4: BAYESIAN STATISTICS IN METABOLOMIC
STUDIES
Metabolomics is a term used to describe the emerging field of the study and
measurement of metabolites. Metabolites are the products produced during the
metabolism, the total amount of chemical reaction within a living organism.
Metabolites can be considered as the “spoken language” of the genetic material
(genome), and therefore, metabolomics is treated as the “read-out” of the state of
the organism in study. Mass spectrometry coupled to different chromatographic
methods, such as liquid or gas chromatography are the major techniques,
employed for the analysis of vast array of metabolites simultaneously [24].
4.1. Applications of Bayesian approach in metabolomics
Profiling metabolites is the identification and quantification of small
compounds up to 1000Da. These compounds constitute the products of the
metabolic pathway. For metabolite profiling, mass spectrometry is a popular
technique, which is employed to generate fingerprint spectra of the separated
metabolites via chromatographic methods. These spectra are then compared with
each spectrum from spectrum library, using a numerical score that characterizes
the similarity between the spectra, as described for proteomics in chapter 3.2. .
However, the identifications through this process are subjected to errors due to
incomplete libraries, experimental noise and technical limitations. For the
improvement of the accuracy of metabolite identification, Jeong et al., in 2011,
proposed a Bayesian model, which analyze similarity score distribution for
GCxGC/ToF-MS data [25].
The model has four layers that target on four fundamental variables, relevant
to metabolite identifications. The variables are the presence/absence of a
metabolite in the sample (𝑌), matching or not of a metabolite to any sample
spectrum (𝑍), the correct/incorrect match (𝑊) and the similarity score (𝑆). In
Layer 1, the marginal probability that every metabolite in the spectrum library is
present in a sample is considered: 𝑃(𝑌𝑗 = 1) = 𝜌 𝑗 = 1, … , 𝑁, where 𝑁 is the
number of the spectra in the library. In Layer 2, 𝑍 represents the observation of a
match (𝑍𝑗 = 1 if there is a match and 𝑍𝑗 = 0 is there is no match for metabolite j).
35
This variable gives information
about the unobserved 𝑌. Due to
the nature of the metabolite
and the library, each metabolite
has some tendency to be
matched
to
some
sample
spectrum (for a given library).
So if the
spectrum of a
metabolite shares high level
similarity
Figure 11: Schematic representation of the model. Z and S
observed, Wand Y unobserved [25].
with
other
metabolites’ spectra, there is
high probability that it will be
mistakenly matched to some other sample spectrum, although this metabolite
might be absent. In this layer a competition score, 𝑏𝑗 , for each metabolite 𝑗 in the
library is introduced and it is calculated as: 𝑏𝑗 = ∑𝑘≠𝑗,𝑘∈𝐶,𝐼(𝑟𝑘𝑗 <ℎ) 1⁄𝑎𝑘 , where 𝑎𝑘 is
the similarity score between spectra of metabolite 𝑞 and 𝑘 in the library, 𝐶 is the
set of spectra in the library and 𝐼(·) is the indicator function. In Layer 3, the
accuracy of the match is considered for the metabolites that have been matched
(𝑍𝑗 = 1) to at least one sample spectrum: 𝑃(𝑊𝑗𝑙 |𝑌𝑗 = 1, 𝑍𝑗 = 1) = 𝜏 (if 𝑌𝑗 = 0 the
match is obviously incorrect). Finally, in Layer 4, a mixture model is used to
characterize the distribution of similarity score (𝑆). By considering the four layers,
the joint distribution of the variable can be expressed as:
[𝑌, 𝑍, 𝑊, 𝑆] = [𝑌][𝑍|𝑌][𝑊|𝑍, 𝑌][𝑆|𝑊]
= (∏[𝑌𝑗 ]) (∏[𝑍𝑗 |𝑌𝑗 ]) ( ∏ ∏[𝑊𝑗𝑙 |𝑍𝑗 , 𝑌𝑗 ][𝑆𝑗𝑙 |𝑊𝑗𝑙 ])
𝑗
𝑗
𝑗:𝑍𝑗 =1
𝑙
By treating 𝑌 and 𝑊 as the unobserved variables, Expectation-Maximization (EM)
algorithm was used to estimate the parameters of the model, ̂𝜃. The confidence of
each metabolite 𝑗 can, then, be estimated as the posterior probability of 𝑌𝑗 :
𝑃𝑗 {
𝑃(𝑌𝑗 = 1|𝑍𝑗 = 1, 𝑆𝑗 ; 𝜃̂)
[25].
𝑃(𝑌𝑗 = 1|𝑍𝑗 = 0; 𝜃̂)
The authors stated that the method is a novel model-approach to the
metabolite identification problem. Thus, the comparison was performed with
36
different type of methods, where the results showed that the proposed model was
more accurate [25].
Suvitaival et al., in 2014, presented a Bayesian approach for integrating data
of multiple detected peaks, connected to one compound [26]. This is important in
metabolomics’ studies are also concerned with the change in the levels of
metabolites, providing information about the physiological state of an organism.
For these changes to be discovered, comparative analysis of spectral profiles is the
main approach, through the inference of covariate effects, meaning the differences
between sample groups determined by the controlled covariates of an experiment.
Although the data might be very complex and noisy, they can provide strong
informative structure. For instance, each compound may generate multiple adduct
peaks and a specific isotopic pattern, whose position and shape may be helpful
during the identification of the analyte [26].
The suggested approach consists of two stages: 1) clustering the peaks that
are generated by the same compound, by applying a non-parametric Bayesian
Dirichlet process model on the data, and 2) the responses to the covariates of the
experiment are inferred on the clusters by means of a Bayesian multi-way model.
For stage 1, clustering the peaks, the team assumed that the peaks are generated
through a Dirichlet process: there is an unknown number of clusters and an
unknown number of peaks emerge from each cluster, where each can only be
assigned to one cluster. The probability of assigning peak 𝑗 to cluster 𝑘 can be
expressed as:
𝑃(𝑣𝑗𝐾 = 1|𝑄, 𝑅, 𝑉) ∝ 𝑎𝐷𝑃 𝐿(𝑄, 𝑅|𝑉−𝑗 , 𝑣𝑗𝐾 = 1)
where the value 𝑣𝑗𝐾 = 1 in the clustering matrix assigns peak 𝑗 to the cluster 𝑘, 𝑄
represents the data, 𝑉 the clustering matrix, 𝑅 ∈ {0,1}𝑁𝑥𝑃𝑥𝑃 is a mask with binary
values 𝑟𝑖𝑗𝑗′ indicating whether the peak pair 𝑗 𝑗′ in the sample 𝑖 appears together
and whether both peaks are observed, 𝑎𝐷𝑃 is the Dirichlet process concentration
parameter, which determines the prior probability mass outside the clusters and
weights the likelihood term 𝐿(𝑄, 𝑅|𝑉−𝑗 , 𝑣𝑗𝐾 = 1). The inference of the posterior
probability distribution was performed via Gibbs sampling. For stage 2, the
research team inferred the differences in concentrations between sample groups
for each cluster, which is related to one compound, given the peak height 𝑋 ∈ ℝ𝑁𝑥𝑃
and the clustering 𝑉 [26].
37
The results, as the authors claimed, showed that by including multiple peaks
could lead to an improvement of covariate effects inference and by introducing
additional data describing the compound, the inadequate sample-size problem can
be successfully addressed [26].
Studies concerning the identification and classification of bacteria, based on
their characteristic metabolomics profiles, is a part of metabolomics, which is very
important as a rapid diagnostic tool. Two different articles addressed the same
issue- the identification/classification problem- for different type of bacteria,
applying Bayesian statistics in the data analysis. Correa et al. in 2011 presented a
research concerning Bacillus spores and species, while Oliver et al. proposed an
approach for identification of various Mycobacterium species [27, 28].
In [27] the bacterium under study is Bacillus, as mentioned above. Bacillus
and Clostridium are species that can adapt to changes in the environment and
starvation rapidly due to their ability to develop spores. Because of their resistant
spores, members of the genus Bacillus are widely distributed in the environment
and their control is considerable in the food industry. Some of these bacteria are
pathogenic, causing food poising, but the most notorious member of this genus is
B. anthracis, which cause anthrax. Therefore, the rapid identification of spores and
bacteria is of a great importance, because of their potential use as a biological
warfare agent. The authors developed a genetic algorithm-Bayesian network
algorithm to identify biomarkers and by means of these biomarkers they built a
Bayesian network model for the identification of Bacillus spores and species [27].
The proposed analytical approach is a two-step based identification that classifies
the bacilli into one of the respective species. The first step involves the reduction
of the dimensions of data. A genetic algorithm (GA) is employed for feature
selection with classification of either spores versus vegetative biomass or
speciation to one of the seven different species. The classifier is a Bayesian
network (BN) that best fits the best solution given by a GA. In the second stage,
after fitting a new BN model to the best solution found by the GA in first step, the
model is used for the statistical analysis in order to determine probabilistic
relationships between mass-over-charge ratio intensities, selected by GA, and the
classification [27].
38
The authors reported that the classification accuracy of the suggested
approach was superior to the partial least squares-discriminant analysis (PLS-DA)
and that it is fast and provides easy interpretation of the relationships between
the selected biomarkers. It is also mentioned that it is possible to develop
predictive models that will allow inference of biological properties of the bacilli
[27].
In the second article from 2012, the research team focused on the
Mycobacterium species. Various species of this genus are related to tuberculosis
(TB), and only in 2008, 1.8 million deaths were reported due to this disease.
Although the current TB diagnostic method is considered to be very sensitive, it
suffers from major limitations, such as the high rate (15-20%) of false negative in
adult cases and the culturing time (2-6 weeks), which leads in unnecessary delay
in the patient’s treatment. The authors investigated the potential use of GC-MS and
the subsequent data analysis, in order to build a classification model for various
TB causing and non-TB species [28].
The proposed approach involved different statistical analysis methods. First,
principal component analysis (PCA) was employed for the determination of the
existence of a natural grouping between the various sample groups. Furthermore,
a PLS-DA model in order to identify the compounds that contribute most to the
separation of the sample groups by ranking the compound according to the
variable influence on the projection (VIP) parameter, which indicates the
importance of the metabolite in the classification model. Subsequently, the
authors created combining list of the biomarkers with the highest modelling
powers (PCA) and VIP values, to reduce the dataset into a set of relative metabolite
markers. The identities of the markers were determined through the GC retention
times and the fragmentation pattern generated by MS compared to previously
injected standards’ libraries. Based only the first three PCAs as input, a
discriminative model based on Bayes’ theorem was built for the purpose of
estimating the class membership probabilities of an unknown bacterial sample
[28].
The authors claimed that their method was able to identify metabolite
markers that are related to various species (TB causing and non-causing), the
classification was achieved in less than 16 hours with a good detection limit. The
39
team, also, suggested that their method could possibly become a tool in TB
diagnostics and disease characterization [28].
40
CHAPTER
5:
BAYESIAN
STATISTICS
IN
FORENSIC
SCIENCES
5.1. Bayes’ theorem and forensic evidence
Bayes’ theorem is commonly used in the forensic sciences and especially in
the Court of Law. In the forensics context, there are two hypotheses for a criminal
case:

𝐻0 : the suspect is innocent

𝐻1 : the suspect is guilty
In the court the question around the innocence (or guilt) should be answered. The
answer can be given if the problem is treated as a probability problem. So, for each
hypothesis, its probability (P), given the data (D), can be expressed as (according
to Bayes’ theorem):
𝑃(𝐻0 |𝐷) = 𝑃(𝐻0 ) ∗ 𝑃(𝐷|𝐻0 )
(1)
𝑃(𝐻1 |𝐷) = 𝑃(𝐻1 ) ∗ 𝑃(𝐷|𝐻1 )
(2)
By diving (1) and (2) we get:
𝑃(𝐻0 |𝐷) 𝑃(𝐻0 ) 𝑃(𝐷|𝐻0 )
=
∗
𝑃(𝐻1 |𝐷) 𝑃(𝐻1 ) 𝑃(𝐷|𝐻1 )
where the left part of the equation is the posterior odds (the probability of the
suspect to be innocent divided by the probability of the suspect to be guilty), while
the right part read the prior odds and the likelihood ratio, respectively. The
forensic scientist can provide the court with the likelihood ratio, i.e. the probability
of the data (evidence) given the hypothesis. The job of the forensic expert is not to
determine the probability of the accused to be innocent/guilty, but he can only
present the evidence, including the probability of being wrong. The judge is the
one that make a decision concerning the innocence of the defendant (posterior
probability), based on the prior “preference” (prior probability) and the value of
evidence (likelihood ratio).
5.2. Mass spectrometry and Bayesian statistics in forensic
sciences
An interesting Bayesian approach was published by Sottas et al. in 2007,
concerning the testosterone abuse by athletes in elite sports. Testing of
41
testosterone abuse is mainly based on the testosterone over epitestosterone (T/E)
ratio. Since 2004, urine samples are subjected to GC-MS and if the T/E ratio is
equal or greater than 4 then the samples are submitted to IRMS (Isotope Ratio
Mass Spectrometry) for determination of
13C/12C.
If the IRMS analysis does not
indicate an exogenous administration, then a longitudinal study is conducted. The
proposal of the authors is a Bayesian screening test, which, by processing the GCMS data of a subject, answers the question whether the athlete tests positively or
negatively for testosterone abuse. What is of great importance is that in this
particular case the T/E ratio variance within the male population is taken into
account by changing the threshold of the test from a population basis to a subject
basis, when the number of individual test results has been increased [29].
The
approach
can
be
presented as a Bayesian network
with four nodes (Fig. 12), where μ
is the distribution of mean values
of T/E for the different individuals,
CV represents the distribution of
coefficient of variation (CV) among
different individuals, σ is obtained
by multiplication of the variables μ
Figure 12: Bayesian network model of T/E. It consist of
4 nodes. T/E reads the distribution of expected values of
T/E returned by the model [28].
and CV, and finally T/E node is the
T/E
ratio
values,
which
are
assumed to be distributed following a normal distribution with μ and σ. The model
works in two steps. First, from a set of prior distributions, which express the
knowledge of which μ and CV is physiologically relevant, the network returns a
distribution of the expected values of T/E. The next step is to infer, through
Bayesian statistics, new distribution for μ and CV, since new data/test results are
taken into account. In other words, the posterior probabilities of step 2 become
prior probabilities for step 1 as soon as new test results are applied to the model.
In this way, the distributions develop into unique and individual values for the
mean and coefficient of variation [29].
42
The results of this study were remarkable, with the authors claiming that the
Bayesian interpretation of the T/E time-profiles showed significant sensitivity and
returned less false positives than the other tests [29].
In another paper from 2013, written by Bettencourt da Silva et al [30].,
Bayes’ theorem was used to evaluate the examination uncertainty of the presented
strategy for identification of tear gas weapons’ active substances. The analysis of
the samples was performed by GC-MS instrumentation. Therefore, the data
consisted of retention time (τ) and ratio of abundances, noted as α, between the
molecular ion A and the fragment ions B and C, which are the most abundant. The
team used Bayes’ theorem, following the theory and notation presented by Ellison
et al. (1998). At first the odds of an active substance A was calculated by means of
the following formula:
𝑂(𝐴|𝜏, 𝛼 𝐴𝐵 , 𝛼 𝛢𝐶 ) = 𝑂(𝐴) ∗ 𝐿𝑅(𝜏) ∗ 𝐿𝑅(𝛼 𝐴𝐵 ) ∗ 𝐿𝑅(𝛼 𝛢𝐶 )
where O(A) is the known prior odds for the A to be present respect to A to be
absent [𝑂(𝐴) = 𝑃(𝐴)/𝑃(−𝐴)] and LR(τ), LR(αΑΒ) LR(αΑC) are the likelihood ratio
of retention time and the ratio of abundances respectively [𝐿𝑅(𝑥) = 𝑃(𝑥|𝐴)/
𝑃(𝑥| − 𝐴), where 𝑃(𝑥|𝐴) is the chance of evidence x being observed given A and
𝑃(𝑥| − 𝐴) is the chance of evidence x when A is absent . Then the odds can be
converted into the probability by using this simple equation:
Pr(𝐴|𝜏, 𝛼 𝐴𝐵 , 𝛼 𝛢𝐶 ) =
𝑂(𝐴|𝜏, 𝛼 𝐴𝐵 , 𝛼 𝛢𝐶 )
𝑂(𝐴|𝜏, 𝛼 𝐴𝐵 , 𝛼 𝛢𝐶 ) + 1
[30]
The results of the evaluation of examination uncertainty indicated that
strong evidence can be gathered when the signal of the peaks of the standard and
the sample is consistent, taking into account the retention time and the ratio of
abundances [30].
In the forensic sciences, the likelihood ratio concerning the prosecutor’s and
the defendant’s hypothesis is commonly used to evaluate the evidential strength.
In the article written by Farmer et al. (2009) the likelihood ratio (LR) is employed
to measure the evidential strength for Isotope Ratio Mass Spectrometry (IRMS)
results from white paints. Architectural white paints are a type of trace evidence,
found in crime scenes, usually in cases of trespass. The paint is composed of four
different parts: the pigment, the liquid, the binder and the additives. The binder is
a partly-polymeric compound that when dries helps the paint to stay on the
43
surface. The isotopic analysis of 13C, 18O and 2H was primarily performed on the
binder through IRMS [31].
From the IRMS data, the values for δ for13C,
18O
and 2H was calculated by
using the following equation:
𝛿[‰] = ⌊
𝑅𝑠𝑎𝑚𝑝𝑙𝑒 − 𝑅𝑠𝑡𝑟𝑑
⌋ ∗ 1000
𝑅𝑠𝑡𝑟𝑑
where Rsample is the measured isotopic ratio of the heavier isotope over the lighter
and Rstrd is the measured isotopic ratio for the corresponding international
reference material. The values of δ were then used to accumulate the Stable
Isotope Profile (SIP) and finally the LR was calculated for the SIP, following the
method proposed by Aitken et al. (2004). In this case, the LR indicates the ratio of
the probabilities of observing the SIP of the controlled and recovered specimens
when they come from the same source respect to the probabilities of observing
the SIP from different sources. 51 paints were used as population and the
comparison was done pair-wise. It was the first time that LR was applied to IRMS
observations and the results, according to the authors, showed considerable
forensic potential. The method gave approximately a ~2% false positive rate and
a similar false negative rate, where 𝐿𝑅 > 1 indicates that the paints originate from
the same source. As suggested by the team, the discriminatory power of this
method would make it a promising candidate as a forensic tool [31].
Illicit drug consumption can be estimated through the analysis of communal
sewage water. The analysis is based on mass spectrometric methods that allow the
measurement of concentration of drug target residues (DTRs), such as
metabolites, with relatively high accuracy and precision. The consumption of
parent drugs can be “back-calculated” from these concentrations. However, these
estimations of concentrations are subjected to many sources of uncertainty. As
part of a bigger research, which also included a Monte Carlo simulation, Jones et
al. proposed a Bayesian framework computed using Markov Chain Monte Carlo
(MCMC), which combines the estimation of parameters and uncertainty
propagation is a single step [32].
First the “back-calculations” were presented by the team in a sister paper
(Baker et al, 2014), based on the previous work of Zuccato et al. (2005), in order
to estimate per capita consumption from DTRs:
44
1) Load of DTR (gr/day) : 𝐿𝑜𝑎𝑑 =
𝐶𝑜𝑛𝑐𝑒𝑛𝑡𝑟𝑎𝑡𝑖𝑜𝑛∗𝐹𝑙𝑜𝑤
100
1000∗(
)
100−𝑆𝑜𝑟𝑝𝑡𝑖𝑜𝑛
100
∗ (100+𝑆𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦), where
Concentration is the DTR concentration in wastewater (𝑛𝑔/𝐿), Flow is the
volume
of
flow
to
the
wastewater
after
a
24hours
period
(𝑚𝑖𝑙𝑖𝑜𝑛𝑠 𝑜𝑓 𝑙𝑖𝑡𝑒𝑟𝑠/𝑑𝑎𝑦), Stability indicates the percentage of DTR that
changes in the wastewater because of the conditions (pH, temperature,
time) and Sorption shows the percentage of sorption to suspended
particular matter (SPM) in wastewater.
2) Estimation of drug consumption per 100 people (mg/day):
𝐿𝑜𝑎𝑑
𝑀𝑊𝑝𝑎𝑟
𝐶𝑜𝑛𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛 = (𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛∗𝐸𝑥𝑐𝑟𝑒𝑡𝑖𝑜𝑛) ∗ (𝑀𝑊
𝐷𝑇𝑅
) − 𝑂𝑆, where Excretion is the
parent drug dose’s proportion that is excreted as DTR, MWpar and MWDTR are the
molecular weights of the parent drug and DTR respectively, Population counts the
size of the population, and OS is the amount of DTR in wastewater due to source
different from consumption, such as hospitals and prescription usage.
It should be mentioned that for drugs that can be administrated through different
route, e.g. cocaine, the metabolic profile can vary enormously, but it is possible to
be included in the Excretion term as:
𝐸𝑥𝑐𝑟𝑒𝑡𝑖𝑜𝑛 =
∑𝑅[(𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛 𝑜𝑓 𝑝𝑎𝑟𝑒𝑛𝑡 𝑑𝑟𝑢𝑔 𝑚𝑎𝑠𝑠 𝑡ℎ𝑎𝑡 𝑐𝑎𝑛 𝑏𝑒 𝑎𝑑𝑚𝑖𝑛𝑖𝑠𝑡𝑟𝑎𝑡𝑒𝑑 𝑏𝑦 𝑟𝑜𝑢𝑡𝑒 𝑅) ∗
(𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛 𝑜𝑓 𝑎 𝑑𝑜𝑠𝑒 𝑜𝑓 𝑝𝑎𝑟𝑒𝑛𝑡 𝑑𝑟𝑢𝑔 𝑒𝑥𝑐𝑟𝑒𝑡𝑒𝑑 𝑎𝑠 𝐷𝑇𝑅 𝑓𝑜𝑙𝑙𝑜𝑤𝑖𝑛𝑔 𝑡ℎ𝑒 𝑟𝑜𝑢𝑡𝑒 𝑅)]
[32].
The Bayesian approach can be presented as directed acyclic graph (DAG),
but the “back-calculations” are written in a way that show a more natural ‘forward’
formulation (consumption determines the load and concentration of DTR) (fig.
13). The first stage of the simulation is to characterize each parameter in terms of
distributions, which in this case a prior distribution is specified for each
parameter, which is updated to posterior distribution by the addition of new data
to the model. Some of the parameters are known (excretion, flow, stability and
population), also referred to as informative prior distributions (and they are
assumed to follow a normal distribution), while others (daily consumption) are
unknown (uninformative priors) and their posterior distribution is driven by the
observed data. The researchers considered that the logarithmic daily consumption
follows a normal distribution, with mean μ and standard deviation τ, and for these
45
parameters broad prior distributions were assumed. After fitting the model
posterior estimates were available and for seven days of sampling the estimation
of the mean drug consumption was 1260mg/1000 people [32].
The authors suggested that this type of analysis may allow more
sophisticated statistical analyses, such as modeling the variability over time. They
also mention that the model can be furthered in way that would incorporate the
“weekend effect”, where the drug consumption seems to increase, however, longer
periods of data should be available [32].
The role of a forensic expert is to evaluate evidence (E) in the context of two
propositions: the prosecution (H1: the control and the recovered samples originate
from the same source) and the defense proposition (H2: the control and recovered
samples originate from different sources). Although the most common approach
of the evaluation of physicochemical data is the likelihood ratio method (LR),
Bayesian networks (BNs) are of growing interest in forensic sciences and have
showed that their application can be a promising tool in the evaluation of the
problem. Zadora presented a report on the evaluation of physicochemical data
through Bayesian networks, focusing on classification of diesel fuel and kerosene
samples, comparison of car paints, taking into account the polymer binder and the
pigment, and comparison of fibers, which relies on the morphology of the fiber and
in case of dyed fibers the comparison of the colour, in 2010 [33].
The classification problem of diesel
fuel
and
kerosene
samples,
whose
identification relies on automated thermal
desorption-gas
chromatography-mass
spectrometry (ATD-GC-MS), can be solved
with a BN with two nodes. One discretetype node is H, which represents the two
hypotheses and the other one is Vd, which
has seven ranks related to all the
Figure 13: Bayesian networks for
discrete
variables
(right)
and
continuous-type variables (left) [33].
considered
discrete
variables.
If
the
variables are considered as continuoustype, then the node Vd changes into Vc,
where the variables are expressed by
46
normal distribution. Conditional probabilities in the node Vd and parameters
(means and variances) in the node Vc are calculated through the available data in
databases. By entering hard evidence in the node Vd or Vc the Bayesian network
model returns a result about node H, which represents the posterior probabilities
ratio and in this figure is equal to LR on the basis of {1}[33].
The comparison of samples, either car paints or fibers, can also be
approached by BN method. The node H has to states: H1 is the hypothesis that the
compared
samples
(control
and
recovered) originate from the same
object, and H2 is the hypothesis that
the compared samples are from
different objects. The model can have
as input either continuous-type of data
or discrete-type. The discrete-type
Figure 14: BN for discrete-type of data (car
paint problem) [33].
model (fig.14) was used for the
comparison of paints, which can be
described by qualitative-type data, i.e. the presence or absence of a particular
compound, similar to the diesel and kerosene samples. The parent node B allows
the expression of the state (rank) of a particular variable for the control sample.
The prior probabilities of this node are calculated based on the background data,
for example how often a certain state is observed within a particular variable in a
database of analyzed car paints. It can be said that this node contains information
about the rarity of a state, something that be included in the evaluation of
evidential value. The node E contains the conditional probabilities that a certain
combination of ranks of discrete variables in the recovered (𝐸 = 𝑉𝑟 ) and control
(𝐵 = 𝑉𝑐 ) samples can occur under each of the considered propositions: a)𝑃(𝐸 =
𝑉𝑟 |𝐵 = 𝑉𝑐 , 𝐻1 ), which expresses the conditional probability that the combination
of ranks 𝑉𝑟 and 𝑉𝑐 of a particular variable is observed when the samples originate
from the same object, and b) 𝑃(𝐸 = 𝑉𝑟 |𝐵 = 𝑉𝑐 , 𝐻2 ) is the conditional probability
which expresses that the combination of ranks 𝑉𝑟 and 𝑉𝑐 of a particular variable is
observed when the samples originate from different objects [33].
47
For the continuous-type
of data (fig.15), the BN model
was applied for car paints and
fiber cases. The model was first
proposed by Taroni et al. in
2006. The continuous nodes
V_B, V_C and V_R are assumed to
follow
Figure 15: BN for continuous-type of data (car
paint and fibers cases) [33].
normal
distribution,
whose parameters (mean and
variance)
are
estimated
considering rules of propagation of information in the network: a) V_B~𝑁(𝜇, 𝜏 2 )
represents the background information, where μ is the mean of all objects in a
suitable database and 𝜏 2 is the between source variability, b) V_C represents the
measurement done on the control sample with (𝑋|𝜃, 𝜎 2 )𝛮(𝜃, 𝜎 2 ), where 𝜎 2 is the
within source variability and c) V_R is the node that presents the measurements
𝜎2 𝜏2
2
made on the recovery sample with (𝑌|𝑋 = 𝑥)~𝑁(𝑦
̅̅̅,
1 𝜎 + 𝜎2 +𝜏2 ) when H1 is true
and ̅̅̅
𝑦1 is the mean of the analyzed parameter in the recovery sample and
(𝑌|𝜃)~𝛮(𝜃, 𝜎 2 + 𝜏 2 ) when H2 is true [33].
The author claimed that the use of BNs require people with no knowledge of
classical programming and people with limited understanding of evaluation
procedures can benefit from the visualization of factors considered in the process
of evaluation of evidence, that is provided by a BN model. However, more data
should be collected and additional experiments should be carried out with the aim
of assessing how these models perform, before any trial in a real casework [33].
48
CHAPTER
6:
BAYESIAN
STATISTICS
IN
VARIOUS
APPLICATIONS
Bayesian statistics can be applied in many different fields that may not fall
into the same category that have already been presented. A field that has a great
impact on daily life is the food industry. An interesting research, which included
the combination of a frequentist approach and Bayes’ theorem, was presented by
Hejazi et al. in 2010. The research concerned unsaturated fatty acids (FA) that are
associated with several health issues, such as cardiovascular diseases, obesity and
inflammation [34].
Unsaturated FAs can exhibit a cis or trans geometry, due to the double bonds.
cis-FAs are naturally present in foods, while the trans-FAs are produced by heat
and hydrogenation of vegetables and marine oils or by bacteria. The latter ones
are associated with aforementioned health conditions. The separation and
identification of these isomers, usually carried out by gas chromatography-mass
spectrometry (GC-MS), is crucial, but it is also a difficult task. The team proposed
an approach, which included Bayes’ theorem, for the distinction of isomers of αlinolenic acid [34].
Linolenic acids has 3 double bonds, which gives 23 different possible
geometries. The team, having previous knowledge that the central bond can be
assigned as cis or trans (xcx or xtx respectively) by m/z ratio, started to distinguish
the geometries for the other two bonds. A series of t-tests on the intensity ratios
(peak of interest/base peak) was conducted in order to make a decision about the
geometry and the m/z ratio with the smallest probability (H0: there is no
difference in ratios between isomers) was selected. Given these estimations of
normal probability density functions, Bayes’ theorem was then applied to obtain
the probability that the molecule has one or another geometry [34].
As an example of illustration the authors presented the identification of cct
isomer. The base peak in this case is m/z=79, which implies that the central bond
is cis (xcx). The Student’s t-tests gave as discriminatory peaks m/z=236 for xcc and
xct and m/z= 93 for tct and cct. The probability can be formulated for the intensity
ratios as follows:
49
𝑃(𝑐𝑐𝑡|𝑅93 , 𝑅236 ) = 𝑃(𝑐𝑐𝑡|𝑅93 , 𝑥𝑐𝑡) ∗ 𝑃(𝑥𝑐𝑡|𝑅236 ), where 𝑅93 = 𝐼93 /𝐼79 and 𝑅236 =
𝐼236 /𝐼79 . For each position there are two possible geometries and if the central
position is decided, the Bayes’ theorem formulation gives:
𝑃(𝑅93 |𝑐𝑐𝑡 )𝑃(𝑐𝑐𝑡)
|
)𝑃(𝑐𝑐𝑡)+𝑃(𝑅93 |𝑡𝑐𝑡 )𝑃(𝑡𝑐𝑡)
𝑐𝑐𝑡
93
𝑃(𝑐𝑐𝑡|𝑅93 , 𝑅236 ) = 𝑃(𝑅
𝑃(𝑅236 |𝑥𝑐𝑡 )𝑃(𝑥𝑐𝑡)
𝑃(𝑅236 |𝑥𝑐𝑡 )𝑃(𝑥𝑐𝑡)+𝑃(𝑅236 |𝑥𝑐𝑡 )𝑃(𝑥𝑐𝑡)
∗
, where 𝑃(𝑅93 |𝑐𝑐𝑡) and 𝑃(𝑅236 |𝑥𝑐𝑡) are also
known as the likelihood ratio and they were calculated by means of one-tailed ttest and 𝑃(𝑐𝑐𝑡) is the prior probability, where a ‘flat prior’ was adopted (therefore
equal to 1/8 for a sample with an unknown origin) [34].
The authors claimed that the method allowed an accurate identification of
unknown isomers of α-linolenic acid, and they also suggested that this approach
would be adopted in the determination of the fragmentation mechanism of general
polyenes [34].
Another role of Bayesian statistics and especially Bayes’ theorem was
presented by Jackson et al. in 2013. In this article, a comparison between Lawrence
Livermore National Laboratory (LLNL) and the Purdue Rare Isotope Laboratory
(PRIME Lab) with respect to the 41Ca analysis was demonstrated [35].
41Ca
has a long half-life (𝑇1/2 ~105 years) and its analysis with accelerator
mass spectrometry (AMS) has led to a significant rise of applications in biomedical
research for osteoporosis and calcium metabolism. However, an AMS
measurement of
41Ca
is subjected to many errors, mainly due to many
interferences with other ions. The research team made a comparison between
these two laboratories to report about the uncertainties due to chemical
preparation procedures and bias from AMS measurements [35].
The study was divided into two components: 1) chemically prepared
samples at Prime Lab were measured at both Prime Lab and LLNL to check the
consistency of the AMS measurements, and 2) AMS samples from the same source
were prepared at both facilities and measured at LLNL in order to compare the
preparation procedures. The reported results were used to estimate if the
laboratories were over- or underestimating their reported uncertainties. For the
statistical calculation, Bayes’ theorem was applied to the data, following the
approach presented by Dose (2003):
50
𝑃(𝜈, 𝑤|𝑑𝑎𝑡𝑎) = 𝑃(𝜈, 𝑤) ∗ 𝑃(𝑑𝑎𝑡𝑎|𝜈, 𝑤), where 𝑃(𝑑𝑎𝑡𝑎) can be omitted if the
formulation is normalized. 𝜈, 𝑤 are two multipliers, such that the true
uncertainties in the data should be 𝜈𝜎𝐿𝑖 and 𝑤𝜎𝑃𝑖 with respect to the
corresponding results {𝑥𝑖 , 𝜎𝐿𝑖 } from LLNL and {𝑦𝑖 , 𝜎𝑃𝑖 } from Prime Lab [35].
If 𝑧𝑖 is the unknown true value of sample 𝑖 and represents the 41 Ca/40Ca ratio
then 𝑃(𝑑𝑎𝑡𝑎|𝜈, 𝑤) = ∏𝑖 𝑃(𝑥𝑖 , 𝑦𝑖 |𝜈, 𝑤) and 𝑃(𝑥𝑖 , 𝑦𝑖 |𝜈, 𝑤) = ∫ 𝑑𝑧𝑖 𝑃(𝑥𝑖 , 𝑦𝑖 |𝜈, 𝑤, 𝑧𝑖 )𝑃(𝑧𝑖 |𝜈, 𝑤),
resulting in the final equation 𝑃(𝑑𝑎𝑡𝑎|𝜈, 𝑤) = 𝑐(∏𝑖 𝐺𝑖 )𝑃(𝜈, 𝑤). The prior 𝑃(𝜈, 𝑤)
can be used to imput additional information, such as constrains on the 𝜈 and 𝑤
values. If 𝜈 and 𝑤 are equal to 1 then the laboratories estimate their uncertainty
perfectly and a value more or less 1 indicates that the facilities underestimate or
overestimate their uncertainties respectively [35].
The authors concluded that the multiplier for Prime Lab, 𝑤 = 1.32 ± 0.20 ,
and for LLNL, 𝜈 = 1.15 ± 0.69, which indicates that both laboratories performed
well in estimating their uncertainty, with a slight indication of underestimation
[35].
51
PART III
52
CHAPTER 7: A CRITICAL REVIEW
This chapter is dedicated to general and critical comments on the
applications of Bayesian statistics to mass spectrometric data or data produced by
hyphenate systems such as liquid chromatography- mass spectrometry (LC-MS).
The focus will be on the advantages and limitations, derived from the papers
discussed in Part II, as well as personal impressions on the statistics applied in
these articles.
First of all, what should be stressed is the significant rise of the use of
Bayesian statistics in different scientific fields, including analytical and
bioanalytical chemistry. Fig. 16 illustrates the results of a small research
conducted by counting
Number of publications
Increase of Bayesian statistics
120000
the
100000
publications concerning
80000
Bayesian statistics by
of
decade on the search
60000
engine Web of Science. It
40000
is clear that the last
20000
0
1940
number
decade the number of
1960
1980
2000
2020
Year of publications
Figure 16: Research conducted on the Web of Science,
concerning the number of publications compared to the year of
publication
publications
on
this
subject has been grown
exponentially, and this
can be explained by the
advances
in
computer
power, making it possible to handle large data sets nowadays. Bayesian statistics
-- although seemingly based on a simple formula, 𝑃(𝜃|𝐷) ∝ 𝑃(𝜃)𝑃(𝐷|𝜃) where the
answer is given by computing priors and likelihoods -- are usually applied to large
data sets, e.g. proteomic data. Monte Carlo Markov Chain (MCMC) is usually
employed enabling the computation of these probabilities in situations in which
the integration over a large amount of parameters is required. Therefore,
computational power is one of the key elements to get “answers” in a reasonable
amount of time.
53
Secondly, in this thesis, out of the 26 discussed papers, 19 were related to life
sciences, e.g. proteomics and metabolomics, which can be translated into a 73%
with separate percentages being 58% and 15% respectively. 5 out of 26 (19%)
were dedicated to forensic sciences and only 8% (2 articles) developed a Bayesian
approach to treat data that did not fall into any of the aforementioned categories,
such as the inter-laboratory comparison based on the
41Ca
data [35]. There is a
correlation between the rise of proteomics and Bayesian statistics the last 15
years, which can be explained by taking into consideration the requirements in a
proteomic study. Just from a single experiment, large datasets can be produced.
The usual request in this type of experiments is biomarker discovery, drug
development and classification of sample between a health and a diseased state.
Handling these data requires a high computational power, which is achievable
nowadays, and a statistical method, which answers questions such as whether this
is a peak or noise [9] or identification of proteins [17] through a database research.
Bayesian statistics is the selected method, since enables the researchers to have a
probability for each hypothesis being true, in comparison to a frequentist test,
which allows only a control for the false positives and the false negatives.
Moreover, Bayesian statistics provides a lot of flexibility in any application.
Bayesian methods can be applied to many different scientific fields. As it is shown
in Part II, the last five years, applications can be found in many different fields,
where MS is used as an analytical tool. This can be explained by the fact that
Bayesian statistics is a model-dependent method, and the right-fitted model can
be selected to address the problem in question. However, this model dependency
can also be considered as a limitation, because it is susceptible to bias. The right
model formulation has to be selected but in a way that do not affect the results
favorably. As an example of illustration, in [12], the research team chose 𝜉𝑖 (𝑡) as
the deforming function for the sample i at time t with the restriction being
monotone increasing. The team mentioned that this restriction was inevitable
because otherwise it would erase observed peaks in the spectra. However,
considering a function that is monotone increasing in a chromatogram means that
it is impossible to have peak crossings across different chromatograms. This
premise might be too strong, raising many questions.
54
Last but not least, as a general remark, Bayesian approaches are statistical
methods that should be handled by personnel with enough background
knowledge on the subject and practical experience in the field. It is a conceptually
difficult subject that requires training. Experts should be employed for such type
of data treatment in order to avoid inherently biased model and provide with the
best non-controversial results.
In conclusion, although Bayes’ Theorem was first published in 1763, it has
only been employed systematically the last decade. It is a very interesting
statistical method that answers critical questions, e.g. “how correct is my
hypothesis?”. Nevertheless, there are limitations such as the model dependency
that should be taken into account before the employment of Bayes’ theorem for
data treatment.
55
AKNOWLEGDEMENTS
I would like to thank my project supervisor Dr. G. Vivó Truyols from the
University of Amsterdam for the support and guidance and the University of
Amsterdam for excess to academic resources.
56
REFERENCES
[1] N. Armstrong, D. B. Hibbert, “An introduction to Bayesian methods for analyzing
chemistry data: Part 1: An introduction to Bayesian theory and methods”,
Chemometrics and Intelligent Laboratory Systems 97 (2004) 194-210
[2] J. M. Bernardo, A. F. M. Smith, Bayesian Theory, John Wiley & Sons Ltd., Chichester,
1994
[3] D. S. Sivia, Data Analysis: A Bayesian Tutorial, Oxford University Press, Oxford, 1996
[4] C. M. Bishop, Pattern recognition and machine learning, Springer, New York, 2006
[5] K. Downards, Mass spectrometry: A foundation course, The Royal Society of Chemistry,
Cambridge, 2004
[6] F. W. McLafferty, “Tandem mass spectrometry”, Science 214:4518 (1981) 280-287
[7] L. B. Fay, M. Kussmann, “ Chapter 1: Mass spectrometry Technologies”, RSC Food
Analysis Monographs 9, The Royal Society of Chemistry, Cambridge, 2010
[8] N.L. Anderson, N.G. Anderson, “Proteome and proteomics: new technologies, new
concepts, and new words”, Electrophoresis 19:11 (1998) 1853-61
[9] W. Shao, H. Lam, “Denoising Peptide Tandem Mass Spectra for Spectral Libraries: A
Bayesian Approach”, Journal of Proteome Research 12 (2013) 3223−3232
[10] Y. Sun, J. Zhang, U. Braga-Neto, E. R. Dougherty, “BPDA- A Bayesian peptide detection
algorithm for mass spectrometry”, BMC Bioinformatics 11:490 (2010),
(http://www.biomedcentral.com/1471-2105/11/490)
[11] Y. Sun, J. Zhang, U. Braga-Neto, E. R. Dougherty, “BPDA2d- a 2D global optimizationbased Bayesian peptide detection algorithm for liquid chromatograph–mass
spectrometry”, Bioinformatics 28:4 (2012) 564-572
[12] X. Kong, C. Reilly, “A Bayesian approach to the alignment of mass spectra”,
Bioinformatics 25:24 (2009) 3213-3220
[13] T. H. Tsai, M. G. Tadesse, Y. Wang, H. W. Ressom, “Profile-Based LC-MS Data
Alignment- A Bayesian Approach”, ACM Transactions on Computational Biology and
Bioinformatics 10:2 (2013)
[14] T. H. Tsai, M. G. Tadesse, C. Di Poto, L. K. Pannell et al., “Multi-profile Bayesian
alignment model for LC-MS data analysis with integration of internal standards”,
Bioinformatics 29:21 (2013) 2774-2780
[15] O. Serang, M. J. MacCoss, W. S. Noble, “Efficient Marginalization to Compute Protein
Posterior Probabilities from Shotgun Mass Spectrometry Data”, Journal of Proteome
Research 9 (2010) 5346–5357
[16] J. Zhang, J. Ma, L. Dou, S. Wu et al., “Bayesian Nonparametric Model for the Validation
of Peptide Identification in Shotgun Proteomics”, Molecular & Cellular Proteomics 8:3
(2009) 547-557
57
[17] R. D. LeDuc, R. T. Fellers, B. P. Early, J. B. Greer et al., “The C‑Score: A Bayesian
Framework to Sharply Improve Proteoform Scoring in High-Throughput Top Down
Proteomics”, Journal of Proteome Research 13 (2014) 3231−3240
[18] J. G. Booth, K. E. Eilertson, P. Dominic, B. Olinares et al., “A Bayesian Mixture Model for
Comparative Spectral Count Data in Shotgun Proteomics”, Molecular & Cellular
Proteomics 10:8 (2011) 1-6
[19] Q. Zhu, T. Burzykowski, “A Bayesian Markov-Chain-Based Heteroscedastic
Regression Model for the Analysis of 18O-Labeled Mass Spectra”, Journal of American
Society of Mass Spectrometry 22 (2011) 499-507
[20] S. He, X. Li, M. R. Viant, X. Yao, “Profiling MS proteomics data using smoothed nonlinear energy operator and Bayesian additive regression trees”, Proteomics 9 (2009)
4176–4191
[21] K. W. Kuschner, D. I. Malyarenko, W. E. Cooke, L. H. Cazares et al., “A Bayesian network
approach to feature selection in mass spectrometry data”, BMC Bioinformatics 11:177
(2010), (http://www.biomedcentral.com/1471-2105/11/177)
[22] C. Chung, A. Emili, B. J. Frey, “Non-parametric Bayesian approach to post-translational
modification refinement of prediction from tandem mass spectrometry”,
Bioinformatics 29:7 (2013) 821-829
[23] C. Chung, J. Liu, A. Emili, B.J. Frey, “Computational refinement of post-translational
modifications predicted from tandem mass spectrometry”, Bioinformatics 27:6
(2011) 797-806
[24] U. Roessner, J. Bowne, “What is metabolomics all about?”, BioTechniques 46 (2009)
363-365
[25] J. Jeong, X. Shi, X. Zhang, S. Kim et al., “An empirical Bayes model using a competition
score for metabolite identification in gas chromatography mass spectrometry”, BMC
Bioinformatics
12:392
(2011),
(http://www.biomedcentral.com/14712105/12/392)
[26] T. Suvitaival, S. Rogers, S. Kaski, “Stronger findings from mass spectral data through
multi-peak
modeling”,
BMC
Bioinformatics,
15:208
(2014),
(http://www.biomedcentral.com/1471-2105/15/208)
[27] E. Correa, R. Goodacre, “A genetic algorithm-Bayesian network approach for the
analysis of metabolomics and spectroscopic data: application to the rapid
identification of Bacillus spores and classification of Bacillus species”, BMC
Bioinformatics, 12:33 (2011), (http://www.biomedcentral.com/1471-2105/12/33)
[28] I. Olivier, D. T. Loots, “A metabolomics approach to characterise and identify various
Mycobacterium species”, Journal of Microbiological Methods 88 (2012) 419–426
[29] P. E. Sottas a, C. Saudan, C. Schweizer, N. Baume et al., “From population- to subjectbased limits of T/E ratio to detect testosterone abuse in elite sports”, Forensic Science
International 174 (2008) 166–172
[30] R. J. N. Bettencourt da Silva, D. M. Silveira a , M.F. G. F. C. Camões, C. M. F. Borges et al.,
“Validation, Uncertainty, and Quality Control of Qualitative Analysis of Tear Gas
Weapons by Gas Chromatography-Mass Spectrometry”, Analytical Letters 47 (2014)
250–267
[31] N. Farmer, W. Meier-Augenstein, D. Lucy, “Stable isotope analysis of white paints and
likelihood ratios”, Science and Justice 49 (2009) 114–119
[32] H. E. Jones, M. Hickman, B. Kasprzyk-Hordern, N. J. Welton, “Illicit and pharmaceutical
drug consumption estimated via wastewater analysis. Part B: Placing back-
58
calculations in a formal statistical framework”, Science of the Total Environment 487
(2014) 642–650
[33] G. Zadora, “Evaluation of the evidential value of physicochemical data by a Bayesian
network approach”, Journal of Chemometrics 24 (2010) 346–366
[34] L. Hejazia, D. B. Hibbert, D. Ebrahimia, “Identification of the geometrical isomers of _linolenic acid using gas chromatography/mass spectrometry with a binary decision
tree”, Talanta 83 (2011) 1233–1238
[35] G. S. Jackson, D. J. Hillegonds, P. Muzikar, B. Goehring, “Ultra-trace analysis of 41Ca in
urine by accelerator mass spectrometry: An inter-laboratory comparison”, Nuclear
Instruments and Methods in Physics Research B 313 (2013) 14–20
59