Download A classification approach to solving the cosmic reionization puzzle

Document related concepts

EXPOSE wikipedia , lookup

Astronomical spectroscopy wikipedia , lookup

Transcript
IT 16 011
Examensarbete 30 hp
Februari 2016
A classification approach to
solving the cosmic reionization
puzzle
Kristiina Ausmees
Institutionen för informationsteknologi
Department of Information Technology
Abstract
A classification approach to solving the cosmic
reionization puzzle
Kristiina Ausmees
Teknisk- naturvetenskaplig fakultet
UTH-enheten
Besöksadress:
Ångströmlaboratoriet
Lägerhyddsvägen 1
Hus 4, Plan 0
Postadress:
Box 536
751 21 Uppsala
Telefon:
018 – 471 30 03
Telefax:
018 – 471 30 00
Hemsida:
http://www.teknat.uu.se/student
Cosmic reionization is a phase in the history of the universe that is still not
understood completely. Among the different theories regarding what could have been
the cause of this process, young star-forming galaxies stand out as the most likely
candidate. In order to determine if the galaxies of the reionization era could have
been the source of the required radiation, it is necessary to determine their escape
fraction of Lyman continuum photons. Since this key property cannot be measured
for the galaxies of interest, methods of probing it indirectly are required.
The foundation of such indirect estimation is the fact that the escape fraction has an
effect on the spectra of galaxies at wavelengths that are observable. This thesis
proposes a quantitative and data-driven approach in which the spectra of simulated
galaxies are used to train machine learning models to predict escape fractions. The
goal is to evaluate support vector machines as a method for predicting escape
fractions of observed galaxies.
The results indicate that the proposed method is promising. Escape fractions are
predicted with over 85 percent accuracy in most cases and the method shows a high
level of robustness to the effects of varying simulation assumptions and disturbances
in the data. Inspection of the models also gives an idea of the information content of
the spectra and the correlation to the escape fraction. A comprehensive analysis of
the classification performance is also performed which highlights some of the main
difficulties and lays a foundation for future work.
Handledare: Kristiaan Pelckmans
Ämnesgranskare: Erik Zackrisson
Examinator: Edith Ngai
IT 16 011
Tryckt av: Reprocentralen ITC
Contents
1 Introduction
1
2 Data
3
3 Methods
3.1 Principal Component Analysis . . . . . . .
3.2 Classification . . . . . . . . . . . . . . . .
3.2.1 Linear support vector machine . . .
3.2.2 Soft-margin support vector machine
3.2.3 L1-norm support vector machine .
3.3 Evaluation . . . . . . . . . . . . . . . . . .
3.3.1 Model tuning and testing . . . . . .
3.3.2 Robustness . . . . . . . . . . . . .
3.3.3 Implementation details . . . . . . .
4 Results
4.1 Understanding the data . . . . . . .
4.2 Classification of the main data set . .
4.3 Robustness of the main model . . . .
4.4 The effects of simulation assumptions
4.5 Analysis of classification performance
4.6 Comparison to other methods . . . .
5 Conclusions and discussion
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10
10
12
12
13
15
15
15
18
20
.
.
.
.
.
.
21
21
26
35
42
49
55
56
1
Introduction
One of the mysteries of the early universe that still has not been fully explained is that of cosmic
reionization. After the Big Bang, as the universe expanded and cooled off, neutral hydrogen
was able to form in the intergalactic medium (IGM). Cosmic reionization refers to the process in
which this neutral hydrogen was put in an ionized state again. It is known that this reionization
happened during the first billion years of cosmic history and that it was caused by the presence
of high-energy photons, but the source of the ionizing radiation remains unknown.
There are several candidates for what could have caused this process to occur, the most likely
one being young star-forming galaxies, (e.g. Alvarez et al.; 2012; Robertson et al.; 2015a). While
it is in theory possible that the galaxies at the time in question could have produced enough
radiation to reionize the entire universe (Alexandroff et al.; 2015), whether or not it was possible
in practice depends on the fraction of hydrogen-ionizing photons that was actually able to escape
from galaxies into the IGM. This quantity is known as the Lyman continuum (LyC) escape fraction
fesc and is a central factor in determining whether or not early galaxies could have been the cause
of cosmic reionization.
It is possible to measure the escape fraction of less distant galaxies, but as the relevant wavelengths are absorbed by the neutral gas in the IGM (Inoue et al.; 2014), this key property cannot
be directly measured for galaxies in the ionization epoch. Existing methods for estimating the
escape fraction of these galaxies therefore rely on probing it indirectly. Methods have been proposed for obtaining constraints on the typical fesc values in this epoch (e.g. Fernandez et al.; 2013;
Finkelstein et al.; 2011) but these can only be used to gain information about the galaxy population as a whole. Zackrisson et al. (2013) suggests that there is a correlation between a galaxy’s
fesc and its spectral energy distribution (SED) at wavelengths that do reach us, indicating the
possibility to infer it for individual galaxies. Their method uses two spectral features to estimate
fesc , the slope of the UV continuum and the relative width of the Hβ emission line. A qualitative
approach is also taken by Alexandroff et al. (2015) who suggest observing less distant galaxies that
are likely similar to those in the reionization era and investigating properties of the SED related
to physical qualities that they argue affect the escape fraction.
The observation that fesc affects the SED of galaxies is also the basis for the work in this
thesis. However, instead of choosing a few diagnostics, the entire spectrum is analyzed using
1
machine learning techniques to develop models that can learn to identify galaxies with different
escape fractions. By using computers and data-driven predictions there is the possibility to capture
more complicated correlations in the data and possibly to avoid some problems that the simpler
models had, such as ambiguities caused by different ages, metallicities and effects of dust on the
spectra (Zackrisson et al.; 2013).
Principal component analysis (PCA) is a method of re-expressing data to gain knowledge of
underlying dynamics of a system that may be too complex to observe directly. It allows one to
extract features of the data that are important and disregard noise or redundancy. In this thesis
it is used to gain knowledge of the information content in the SED of galaxies and how the fesc is
related to it.
Support vector machines (SVMs) are supervised learning models that can be used for pattern
recognition and classification of data. Here they are used to classify galaxies as having either
higher or lower escape fraction than a given threshold. According to Robertson et al. (2015b) an
escape fraction value fesc � 0.2 of gas-rich star forming galaxies during the period of reionization
could be enough to account for the required amount of Lyman continuum photons. The main
goal is therefore to evaluate the ability of the models to separate galaxies based on fesc thresholds
around this size.
Another component in the larger project of which this thesis is a part considers lasso regression
as a method of predicting escape fractions (Jensen et al.; 2016; Lundholm; 2016). In that approach,
the goal is to predict a continuous value for each sample, rather than placing them in classes. Both
of these methods are considered since it is possible that classification may have an advantage by
excluding some of the more difficult aspects of the modeling. A comparison of the two models is
therefore also performed to see if there are differences in performance or robustness.
The spectral features that contain information about fesc have not yet been possible to observe
for galaxies in the reionization epoch, but with the upcoming launch of the James Webb Space
Telescope (JWST) in 2018 this is about to change. The Near Infrared Spectrograph (NIRSpec) on
board will be able to provide spectra of previously unseen quality from some of the first galaxies to
form in the universe. Since the data is not available yet, simulated SED of galaxies with different
fesc are used to train and evaluate the models. We use different simulations and modify the SED
in order to imitate realistic JWST/NIRSpec observations, but it is unavoidable that the results
2
will depend on simulation assumptions to some degree. An important aspect of the evaluation
is therefore to investigate how sensitive the models are to this in order to get an idea of their
applicability to actual observations.
2
Data
The data used to train and test the models is a collection of simulated electromagnetic spectra of
galaxies with corresponding fesc values. These are obtained by adding the effects of varying escape
fractions to galaxies generated from different sets of assumptions and calculating the resulting
spectra using the LYman Continuum ANalysis (LYCAN) simulation project (Zackrisson and et
al.; 2016). The galaxy samples are obtained using different numerical simulations and modified to
imitate realistic observations from the JWST. Figure 1 shows the spectrum of a simulated galaxy
with the effects of different fesc values applied to it.
�
�������� ��������
�
�
�
����
��
���
���
���
���
���
���
���
���
���
���
���
���
�
��
����
�
��
��
������
�
���
�����������������
�
����
����
��������������
����
����
����
Figure 1: Simulated spectrum of a galaxy with the effects of different fesc values applied to it. Labels point out
the main emission lines.
The effects of modifying the escape fraction are visible in the spectra. The emission lines, or
3
peaks in flux that are labeled in the image, are higher for lower fesc values, and an effect on the
slope in the beginning of the spectra is also visible. The astronomical explanation of this is a
process in which the ionizing radiation produced within galaxies is affected by the surrounding
gas. As the radiation travels through the gas, some of it is absorbed and re-emitted as photons
with different wavelengths. In this way the ionizing radiation is transformed into effects on other
parts of the spectra, leaving traces on wavelengths that are observable.
According to the knowledge of such processes, only information in certain parts of the spectrum
should be correlated to fesc . The emission lines are the strongest indicator with a clear correlation.
These are the result of ionizing photons affecting matter in the galaxy, causing it to emit radiation.
For example, the H-lines in Figure 1 correspond to the spectral emissions of the hydrogen atom.
Another indicator is the overall slope of the spectrum, as the emission of ionizing radiation tends
to cause a reddening of the spectra that gives it a flatter slope (Raiter et al.; 2010). The bluer
end of the spectrum, or the lower wavelengths, is a part of the spectrum where this is particularly
seen. Using the slope to infer fesc is slightly more problematic than the emission lines, however,
since other properties of the galaxies such as age, metallicity and dust can also have a similar
effect, causing ambiguity in the data (Zackrisson et al.; 2013).
It should be pointed out that figure 1 is a simplification in several ways. Since all spectra are
from the same galaxy with effects of different escape fractions applied to it, they have very similar
shapes. Also, each spectrum has been normalized to have mean 1, which explains why they are all
centered around this value. This normalization also has an effect on the blue wavelengths, causing
the effect that the flux is lower for lower escape fractions. However, it still suffices to illustrate
the idea behind using this part of the spectra to infer the escape fraction.
As there is no way to measure fesc of the relevant galaxies, the proposed method will be
dependent on simulated train data even after real observations are obtained. The simulated spectra
are generated using different models and assumptions and therefore have varying properties. This
is of great significance for the task of inferring escape fraction because the different assumptions
can affect the relationship between fesc and the spectra. For example, in reality the strength of the
emission lines do not only depend on the escape fraction but also on certain physical properties of
the galaxy. There needs to be a presence of new stars that emit ionizing photons as well as gas to
be ionized. Since some simulations predict more varied star formation histories than others, they
4
result in more heterogeneous sets of galaxies where the correlation between fesc and the emission
lines is weaker.
Although the simulations used have been evaluated with respect to how realistic they are (e.g.
Shimizu et al.; 2014), at the moment it is not known which assumptions are most representative.
When observations are made there may be more indications of how well they correspond to reality,
but simulation assumptions will likely remain important. By investigating how sensitive the
method is to this, we can get an idea of its applicability in the general case, in what circumstances it
is likely to perform well and what difficulties may arise when predicting fesc for actual observations.
The remainder of this section explains the different simulation techniques and assumptions as well
as which combinations will be considered in this thesis.
Simulation and stellar tracks
Sets of galaxies are obtained from simulations made by Gnedin (2014), Shimizu et al. (2014), and
Finlator et al. (2013). The simulations model the evolution of galaxies and processes within them,
each of them making different assumptions in the calculations. The simulations provide a set of
galaxy properties but to generate spectra from this data a model of stellar evolution is required.
In this thesis the stellar tracks considered are Geneva and BPASS2 (Eldridge and Stanway; 2009).
The Shimizu set contains 406 initial galaxies which results in a total size of 4466 samples after
the effects of 11 fesc values have been added to the original spectra. For Finlator and Gnedin the
numbers are 106 and 100 initial galaxies, respectively. Figure 2 shows the spectra of 40 galaxies
per fesc value for the Finlator and Gnedin simulations with the Geneva stellar tracks.
5
� ������
�
�
�
�
�����������������
�����������������
� ��������
�
�
�
�
�
�
����
����
���
���
���
���
���
�
�
�
�
�
����
����
����
��������������
����
�
����
����
����
����
����
��������������
����
����
Figure 2: Spectra of 40 galaxies each for 5 different fesc values from the Finlator (left) and Gnedin (right)
simulations with Geneva stellar tracks.
Resolution and Noise
The simulations provide high quality spectra of galaxies, but since the goal is to apply this model
to actual observations, they have to be modified according to the NIRSpec/JWST specifications
to make them realistic. The NIRSpec website
1
lists the different possible resolutions as well
as minimum continuum flux observable at a given signal-to-noise ratio and exposure time as a
function of wavelength. This is used to re-bin the simulated spectra and develop noise models for
simulating different degrees of detector noise.
The possible resolutions of the NIRSpec spectrograph are denoted R=100, R=1000 and R=2700.
A high resolution results in a higher information content in the spectra but also more noise per
wavelength bin. Lower resolutions allow dimmer objects to be observed and is less expensive
to operate. For these reasons, the simulated spectra in this project are rebinned to correspond
to R=100 and R=1000 only. These correspond to 140 and 1478 wavelength bins per spectrum,
respectively.
The signal-to-noise ratio per flux and exposure time is calculated using the official NIRSpec
specification, making it possible to add the corresponding amount of noise to each spectral bin
of a galaxy. This makes assumptions about how the noise scales with exposure time and ignores
some noise sources, but is considered as an adequate approximation.
1
http://www.stsci.edu/jwst/instruments/nirspec
6
Two different methods of approximating detector noise are used. The first one fixes the exposure time for all samples and adds noise to each galaxy accordingly. Because of the differing
apparent magnitudes, or how bright the objects appear, this leads to some galaxies having more
noise than others. Although this model is the closest to how noise in actual observations would
behave, the second way of adding noise is useful for other aspects of this project. This method
instead defines the signal-to-noise ratio of a spectrum as the signal-to-noise of the spectral bin
containing the wavelength 1500 Å and fixes this over all the samples. The advantage of this is that
it gives less of a difference in noise level between the samples. Although the noise will still vary
over each spectra, they will not differ as much overall since they are scaled to be the same at a
given wavelength. This reduces the effect that some galaxies will be ”more important” than others
for the classification and may be more suitable for investigating which features of the spectra are
relate to fesc and evaluating how the noise level affects the model’s performance. A set of galaxies
with a noise level scaled so that
S
N
= x at 1500 Å will be denoted as having noise level snx and a
set with a fixed exposure time of x hours will be denoted as xh.
Figure 3 shows the effects of varying levels of noise on the same set of galaxies from the Shimizu
simulation with 40 spectra each for five fesc values. It is clearly visible how the noise distorts some
of the information correlated with fesc , mainly by obscuring weaker emission lines.
7
�� ���
��
��
��
�
�����������������
�����������������
�� ���
�
�
�
�
�
�
�
�
�
�
����
����
�� ����
����
����
��������������
����
����
�
����
� ��������
�����������������
�����������������
����
����
��������������
�
�
�
�
����
����
����
�
�
�
����
����
���
���
���
�
�
�
�
�
�
����
����
����
��������������
����
����
�
����
����
����
����
��������������
����
����
Figure 3: The effects of different noise levels on the spectra of galaxies. Each plot contains the same 40 galaxies
from the Shimizu simulation with three fesc values applied to them.
Dust
When the radiation from a galaxy travels through dust in the interstellar medium parts of it are
absorbed and re-emitted at higher wavelengths, causing a reddening of the spectra. Dust effects
are problematic because they can be similar to those of varying escape fractions, as discussed
above, causing uncertainty in interpreting the data. There are differing opinions regarding the
dust content of early galaxies and it is therefore not known how suitable it is to use dust effects in
training models to predict fesc . There have been results that indicate little or no dust effects in
the relevant galaxies (Bouwens et al.; 2009) whereas other studies show significantly larger dust
contents (Schaerer and de Barros; 2010).
8
The effects of dust can be added to the simulated spectra using different recipes. For the
Shimizu simulations, the recipe used is the one defined in Shimizu et al. (2014). For the Finlator
and Gnedin galaxies, the dust recipe from Finlator et al. (2006) is used. In all cases, the dust
reddening laws by Pei (1992) are used in combination with the dust recipe. Figure 4 shows the
effects of dust on galaxies from the Shimizu simulation. The reddening effect mainly results in a
� �������
� ����
�
�
�
�
�����������������
�����������������
larger spread of the fluxes, especially at the lower wavelengths.
�
�
�
�
���
���
���
���
���
�
�
�
�
�
�
�
����
����
�
����
����
����
��������������
����
�
����
����
����
����
����
��������������
����
����
Figure 4: The spectra of 40 galaxies per fesc value from the Shimizu simulation without (left) and with the effects
of dust (right).
By varying the properties described in this section it is possible to create data sets with different
simulation methods, stellar tracks, resolutions, noise levels and with or without the effects of dust.
In order to avoid displaying the results for all combinations, a subset of configurations is chosen
for illustration purposes. A main data set consisting of galaxies from the Shimizu simulation and
Geneva stellar track with no dust effects is used to evaluate the proposed classification method
thoroughly. Data sets with other configurations are used for evaluating the models’ sensitivity to
different parameters. For all experiments data without dust effects, with noise level sn5, resolution
R100 and Geneva stellar tracks is used unless otherwise stated.
9
3
3.1
Methods
Principal Component Analysis
Principal component analysis (PCA) introduced by Pearson (1901) is a method of re-expressing
data in a way that reveals which parts of it are significant and which are redundant. Each data
point can be seen as a vector in q-dimensional space and can therefore be described as a linear
combination of a set of basis vectors. Like vectors, they can be projected onto other vector spaces
spanned by a different bases. The PCA algorithm finds a basis that is a linear combination of
the original that lets the data be expressed in a way that reveals hidden dynamics. It could, for
example, be the case that several features of the data represent the same signal. In such cases it
is preferable to use a linear combination of these features instead. This is a more concise representation of the data and requires less dimensions.
A way to measure the redundancy between two features a and b is their covariance
cov(ab) = �ai bi �i
(1)
where �zi �i denotes the average value over i. The covariance measures how spread out the data
points are with respect to the dimensions a and b, and a high covariance suggests a strong correlation between features.
Let X ⊂ Rq be a set of m samples each described by q features. For features fij . . . fqj of all the
samples xj , j ∈ {1 . . . m}, let

. . . f1m
f
 11
 .. . .
..
X= .
.
.

fq1 . . . fqm





(2)
X ∈ Rq×m is thus a matrix where each row contains all measurements of one feature and each
column corresponds to a sample.
10
Then the covariance matrix SX is defined as:
SX =
1
XX T
q−1
(3)
The covariance matrix contains information about correlations between all pairs of features in the
data. The diagonal terms tii correspond to the variance of feature i and the off-diagonal terms tij
contain the covariance of features fi and fj .
The PCA algorithm finds a new basis B such that the covariance matrix of the projection BX,
or SBX is diagonalized. Thus the covariance between features is removed by the projection. It can
be shown that this is achieved by choosing B such that each row bi is an eigenvector of XX T . The
principal components of X are then the rows bi of B and the diagonal terms tii (the eigenvalues)
of SBX contain the variance of X along the principal component bi . The principal components
with the largest variance are those that represent the data the most, and if there is a subset with
much higher variances than the others, these are the dimensions which contain the essence of the
data. This is the idea behind using PCA for dimensionality reduction.
The motivation for using PCA is mainly to get an insight of the data that is being dealt
with. The knowledge of astronomical processes that cause the spectra to have certain properties
is one part of this, whereas using PCA gives a more data-oriented and exploratory analysis of the
information content. By investigating which parts of the spectra have large variance it is possible to
gain an idea of which features are important to represent the properties that set individual samples
apart. As there has not been much previous work done in applying data-driven predictions using
the entire spectra of galaxies for inferring fesc , this is an important indication of what machine
learning techniques may be suitable for the task. The second reason for using PCA is that it
allows the data to be visualized. By projecting the data onto the principal components that have
the highest variance, it is possible to reduce the number of dimensions used to represent it. This
allows a visualization of it in three dimensions, for example, which can give an indication of the
general layout of the samples with respect to fesc .
11
3.2
Classification
Classification in machine learning refers to the problem of determining to which of a given set of
categories a new observation belongs.
Let
• X ⊂ Rq be a set of samples where each sample is represented by a vector of q features
• Y be a set of class labels (for binary classification Y = {−1, 1} )
• S ⊂ X × Y be a train set consisting of samples and their corresponding labels
• S � be a set of samples with unknown labels
The goal of a classification model is to use S to train a classifier y : X → Y that can correctly
label s� ∈ S � .
3.2.1
Linear support vector machine
The linear support vector machine (SVM) introduced by Vapnik and Chervonenkis (1964) defines
a classifier by regarding each sample in S as a point in q-dimensional space and trying to find a
hyperplane that separates the points according to their labels. Such a hyperplane can be characterized by a vector w ∈ Rq and an intercept b, and therefore the classifier can be defined as:
y(x) = sign(wT x + b)
(4)
for some hyperplane (w, b),where
yi (wT xi + b) ≥ 1
∀(xi , yi ) ∈ S
(5)
The margin of a hyperplane is defined as the smallest distance from any point x0 to the hyperplane
and can be shown to be equal to
2
1
2
.
||w||2
||x||2 denotes the Euclidean, or L2 norm defined by
�� n
k=1
12
xk ∗ x k
In order to get a solution that is as safe as possible the goal is to maximize the margin. This
together with (5) gives the following objective:
min ||w||2
w,b
(6)
subject to
yi (wT xi − b) ≥ 1 ∀(xi , yi ) ∈ S
This is a convex optimization problem, and in order to put it in a form suitable for quadratic
programming, the objective function is often rewritten in the equivalent form of:
1
min ||w||22
w,b 2
(7)
subject to
yi (wT xi − b) ≥ 1 ∀(xi , yi ) ∈ S
The samples that lie exactly on the margin are called the support vectors. The support vectors
have a large effect on the SVM since they decide where the boundary goes; a change in one of
these samples would result in a change in the decision function. They are also an indication of
the complexity of the learning task. If there are few support vectors then a smaller fraction of the
samples are needed to get high accuracy.
3.2.2
Soft-margin support vector machine
The soft-margin SVM proposed by Cortes and Vapnik (1995) can find a boundary even when the
data is not completely linearly separable. This is achieved by introducing slack variables ξ that
allow data points to be mislabeled:
m
�
1
ξi
min ||w||22 + C
w,b,ξ 2
i=1
subject to
yi (wT xi − b) ≥ 1 − ξi ∀(xi , yi ) ∈ S
ξi ≥ 0 ∀ξi
C>0
13
(8)
The slack variables ξi measure how misclassified each sample xi is and C is a penalty parameter
for the misclassification error. C can be tuned to find the optimal model by controlling the tradeoff between achieving a maximal margin and avoiding misclassification. Having a high penalty
reduces the amount of data points that are on the wrong side of the hyperplane, but it can also
lead to a smaller margin and therefore a less secure solution. This could in some cases lead to
overfitting and poor performance on data that was not part of the train set. A low value for C
could on the other hand lead to an underfitted classifier which does not separate the data correctly.
There are several reasons why using support vector machines is suitable for this problem.
Firstly, they are a well-established method developed since the 1960’s with successful applications
in many fields. The optimization problem is convex and therefore allows for efficient implementation. Another desirable property of the linear SVM is that it is possible to interpret how the data
affects the decision function. The vector that defines the separating hyperplane contains information of how each feature of the samples affects the classification. This means that by inspecting
these feature weights, one can learn how much and in what way each part of the spectrum is
related to the escape fraction. Many other classification algorithms work like black boxes that
provide a prediction but no information of how it was reached. This is also the reason why only
linear support vector machines are considered. While it is possible to add kernels that project the
data onto other dimensions in which they are easier to classify, this also removes the possibility for
an astronomical interpretation of the resulting model. A model based on the soft-margin linear
support vector machine with an L2-norm will be denoted as an L2-SVM in this project.
14
3.2.3
L1-norm support vector machine
The L1-SVM proposed by Bradley and Mangasarian (1998) instead uses the L1-norm in the
optimization problem:
m
�
1
ξi
min ||w||1 + C
w,b,ξ 2
i=1
subject to
yi (wT xi − b) ≥ 1 − ξi ∀(xi , yi ) ∈ S
(9)
ξi ≥ 0 ∀ξi
C>0
The L1-norm is defined by ||x||1 =
�n
k=1
|xi | and has the property that it results in a sparse
model where some feature weights are exactly zero. In this case, C affects how many nonzero
features the model has, and therefore which subset of features are used in the classification.
The data suggests that the L1-SVM may be suitable for this application. As discussed in
section 2, both the simulated galaxy spectra as well as what is known of the astronomical processes
involved indicate that only a small part of a galaxy’s SED is correlated with its fesc . This suggests
that the L1-SVM may be more effective by ignoring the features that are not relevant. It can also
help avoid overfitting by having a less complex model and therefore less risk of fitting redundant
information. This can increase accuracy by not letting noise and other disturbances in the data
affect the classification. Finally, it is easier to interpret the astronomical significance of a model
with fewer nonzero feature weights. A model based on the soft-margin linear support vector
machine with an L1-norm will be denoted as an L1-SVM in this thesis.
3.3
3.3.1
Evaluation
Model tuning and testing
The classification models are fitted to different data sets and evaluated on their accuracy and
robustness using a test set. It is important to distinguish between the model tuning and testing
processes. Tuning refers to the parameter selection and fitting and is always performed on a subset
of the data. The entire data set is split into a train, validation and test set. The model is fitted to
15
the train set for different values of the tuning parameter C and the performance of each resulting
model on the validation set is evaluated. The model that had best validation performance is chosen
as the optimal model. This model is subsequently evaluated using the test set. It is important
that the test set is comprised of samples that are not part of the train and validation sets since
otherwise the model would be biased and the test results would not give an indication of the
model’s ability to generalize to new data.
In order to avoid possible bias introduced by the choice of validation set, cross-validation (e.g.
Mohri et al.; 2012) is used to find optimal values for the penalty parameter C. The samples that
are not in the test set are initially divided into k subsets, or folds, and the fitting and validation is
done k times, where each subset gets used as validation set once and the union of the other sets is
used as the train set. The performance of a value for C is defined as the average over the k folds.
This process is known as k-fold cross-validation.
Since both tuning and testing need to assess a model’s ability to predict the choice of performance metric is an important aspect. The difference between the true and predicted labels has to
be represented as a single value that describes how good the model is and allows for comparison.
This value is referred to as a model’s score, and the method used to obtain it the scoring metric.
For classification there are several ways to measure performance and which one is most suitable
for a given case is not always straightforward.
Let
• P (positive) denote the number of samples with true label 1
• N (negative) denote the number of samples with true label -1
• T P (true positive) denote the number of samples that were correctly given label 1
• T N (true negative) denote the number of samples that were correctly given label −1
• F P (false positive) denote the number of samples that were incorrectly given label 1
• F N (false negative) denote the number of samples that were incorrectly given label −1
The most simple scoring metric is the accuracy (acc) which is defined as the percentage of cases
that were correctly classified:
acc =
TP + TN
P +N
16
An issue with the accuracy is that it does not take false negatives or false positives into account.
This is especially an issue if either P or N is significantly smaller than the other as misclassification
of all the samples in the small set would not affect the accuracy much. To take the misclassification
per class into account a confusion matrix can be used:
Actually −1
Actually 1
Predicted −1
TN
FN
Predicted 1
FP
TP
Table 1: Layout of a confusion matrix
The confusion matrix allows one to compute the average precision per class (apc) to get a score
that reflects the accuracy of both classes. Since the class sizes can be different the weighted average
of the two is used to avoid the accuracy of one having larger impact:
apc =
P×
TP
T P +F P
+N ×
P +N
TN
T N +F N
There are, however, issues with the apc score as well. For example, if one class has a small number
of samples, the results will have large spread. This makes them less reliable and the overall average
will be over two quantities with very different degrees of variance.
Two other relevant properties of the classification are precision and recall. Intuitively, recall is
the ability to find all positive samples and precision is how good the model is at not mislabeling
negative ones:
• precision =
• recall =
TP
T P +F P
TP
P
A model ideally has high values of both and by taking their harmonic mean they are combined
into a single scoring metric, the f 1-score:
f1 = 2 ∗
precision × recall
precision + recall
17
Another way to measure mispredictions is the Receiver Operating Characteristic curve (ROC).
This curve plots the rate of true positives against the rate of false positives and thus shows how
many correct positive classifications can be gained as more false positives are allowed. To use this
as a scoring metric the area under the curve is measured, as a larger area means a higher rate of
true positives. This results in the ROC area under the curve score rauc which is the final scoring
metric that will be considered in this thesis.
As discussed previously, the choice of scoring metric is not straightforward and depends on the
nature of the data itself. Generally the best choice will become evident after some results and
knowledge of the data have already been gathered. For this reason the initial model tuning and
evaluation will use the simplest metric acc. Although this does not take class imbalance into
account it is still considered a good starting point.
Class imbalance will only be the case for the extreme thresholds, and although 0.2 is the most
relevant one for investigating cosmic reionization, the performance for all thresholds is of interest
in evaluating the model itself as well as properties of the data that are relevant for classification.
Furthermore, it is not certain that class imbalance will be an issue. The data as well as what is
known of the astronomical processes involved indicate that the galaxies have properties that are
correlated to fesc and should therefore be possible to classify accordingly. Also, it is not the case
that one class is more important than the other for our purposes. The goal is to separate the
samples according to a threshold, not to pick out ones that belong to a certain class and therefore
a measure of total performance is appropriate. Sections 4.2 through 4.4 will therefore use accuracy
when calculating a model’s score. In section 4.5 the classification performance is investigated in
detail and the effects of using different scoring metrics are evaluated.
3.3.2
Robustness
An important aspect of the evaluation of a model is its robustness. This is a measure of how
sensitive it is to varying properties of the data and is an important indication of its applicability
to the problem. Different types of robustness are considered and a variety of methods are used in
the estimation.
Robustness of accuracy is a measure of how sensitive the classification performance of the
18
models is. The sensitivity of the score of a model when applied to a test set is of interest. Does it
differ for varying assumptions of the data or for input selection? Since the goal is to use simulated
data to classify actual observations it is also important to investigate how sensitive the model is
to ”wrong” assumptions for the train set.
To see the effects of varying assumptions of the data, the models are trained and tested
on different data sets. The test scores for different noise levels, simulations, stellar tracks and
dust effects are compared to see if there are certain assumptions that cause more problems in
classification than others.
The sensitivity of a model to assumptions about the data that do not correspond to the
properties of the galaxies to be classified is investigated by using different types of data sets for
training and testing. For varying noise levels this is done by fixing the noise in the test set
and evaluating the model performance when varying the train set noise level. The sensitivity
to different simulations is evaluated by comparing the test performance of all combinations of
simulation method for the train and test sets, and likewise with stellar tracks.
It is also possible that the data may contain errors in the labeling of the training samples,
resulting perhaps from measurement errors or incorrect modeling of certain properties in a simulation. In order to get a measure of how robust a model is to this type of disturbance it is fitted
to a series of manipulated train sets in which the label of each sample has been flipped with a
certain probability, and used to classify an unmodified test set. How the performance degrades
with higher probability of label-flipping indicates the sensitivity of the model to this type of errors.
Robustness of selection refers to how the selection of features is affected by varying properties
of the data. This is related to robustness of accuracy since it is a measure of the model’s ability to
generalize, but here the focus is to see if there is large variance in which features are chosen or the
weights that they are given. This is related to the astronomical interpretation of which spectral
properties are correlated to fesc . It also gives an indication of whether the actual signal is being
captured by a model or if it is fitted to noise or other irrelevant parts of the spectra.
Bootstrapping is an algorithm used to give an indication of the stability of the model with
respect to input choice. A collection of train sets Ti� of size n� < n are chosen with replacement
from the original train set T , where |T | = n. The tuning and fitting process is then performed
for each Ti� and the variance of the resulting |T � | models is investigated. The distribution of
19
the weights of each feature over all Ti� gives an indication of the stability of the feature weights.
The robustness of feature selection is shown by the frequency that each feature is chosen to be
significant over all runs. The idea of these tests is to see how much the choice of samples affects
the trained model and it is considered more stable if changes in the train set have small effect in
the resulting classifier.
Another way to investigate the robustness of selection is to see which features are chosen to be
included in the optimal model when only using a certain number of features in the classification.
This is investigated by exhaustive search of all combinations of size 1, 2, . . . 7 of features and
plotting which ones are part of the model with best test performance for each subset size. This
gives an indication of which features are the most important for classification as well as how stable
this selection is.
The metrics above are related to two aspects of robustness, that of a model and that of the
method itself. Robustness of a model measures how stable a given model is in its performance
depending on varying input and gives an indication of how well we can expect a certain model
fitted to a data set to behave. Robustness of the method instead suggests how suitable support
vector machines are for solving this particular problem. This is more related to the nature of the
data and relationship between fesc and the spectrum of a galaxy. The goal in this thesis is to get
an idea of both these aspects of robustness.
3.3.3
Implementation details
The software package scikit-learn (Pedregosa et al.; 2011) for Python 3.4 is used to implement
the classification models and the PCA algorithm. Five-fold cross-validation is used in parameter
selection. The classification is always performed by choosing an escape fraction threshold
t ∈ {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9} and defining the negative class as the set of samples with
fesc < t and the positive class as the set of samples with fesc ≥ t. Bootstrapping is performed by
randomly selecting 80 percent of the train samples with replacement and iterating 1000 times.
When the train and test sets originate from different simulations all samples are included in
both sets. Otherwise 20% of the samples are set aside to be used for testing, with an equal
distribution of samples with different fesc values. All spectra are normalized to have mean 1 so
that the luminosity of a galaxy has no effect on the classification.
20
4
Results
In this section the results of applying the methods described above are presented. In section 4.1
the information content in general as well as with respect to fesc is investigated using PCA. In
section 4.2 support vector machines are used to classify galaxies from the main data set. The
model tuning process as well as the classification performance is evaluated thoroughly. This is
followed by an investigation of the robustness of the same models in section 4.3. Section 4.4 is also
concerned with robustness but here the effects of different data assumptions on the method are
investigated. In section 4.5 an extensive evaluation of the problems encountered with classification
is performed and the effects of different scoring metrics are investigated. Both the results of the
L2-SVM and the L1-SVM are shown throughout to provide a comparison of the two, and several
noise levels are considered to assess the sensitivity of the models to this. Finally, the classification
performance of the proposed method is compared to that of two other models: the one defined by
Zackrisson et al. (2013) that uses two properties derived from the SED and the lasso regression
approach from Lundholm (2016) which also uses the entire spectra in predicting fesc .
4.1
Understanding the data
PCA is applied to investigate the information content in the data; which parts of the spectra have
the highest variance as well as how the data points are laid out with respect to fesc . Figure 5
shows the eigenvalues of the 30 principal components with highest variance in sorted order for
four different data sets. The top panel shows results for the Shimizu set with and without dust
effects, and the bottom panel the corresponding plots for the Gnedin set. In all cases there is a
significant drop-off after roughly 5 eigenvalues, indicating that a large part of the information in
the spectra can be expressed by a few dimensions. All experiments also showed that the slope was
more or less unchanged after 20 principal components.
21
��� ���������������
�� ������������
��
���
����������
����������
���
���
���
���
�
�
�
�
�
�
��
��
��
�������������������
��
�
�
��
�
��
��
��
�������������������
��
��
��
��
��
�������������������
��
��
���
���
���
���
����������
��� �����������
����������
��� ��������������
���
���
���
���
�
���
�
��
��
��
�������������������
��
��
���
�
�
Figure 5: The resulting eigenvalues after performing PCA on the Shimizu and Gnedin simulations with and
without the effects of dust.
The results indicate that representing a spectra as a linear combination of only a few components should retain much of the information in it. By projecting a spectrum onto a subset of the
principal components and then reconstructing it, the loss of information in such a partial representation can be illustrated. Figure 6 shows how this differs when using 5, 10, 20 and 40 principal
components to represent a sample from the Shimizu simulation. The original spectra is illustrated
by a solid line and the reconstructed one by a dotted line. The red line shows the difference
between them and the legend in each plot displays the mean squared error of this difference.
22
� ����������������������
�
����������
��������
�������������
�
�����������������
�����������������
�
� �����������������������
�
�
�
�
�
�
����
����
����
����
��������������
�
����
�
�
�
�
�
����
����
����
����
����
��������������
� �����������������������
����������
�
�
�����������������
�����������������
�
�
� �����������������������
�
�
�
�
�
�
����
����������
����
����
����������
�
�
�
�
�
�
����
����
����
��������������
����
����
�
����
����
����
����
��������������
����
����
Figure 6: The effects of projecting a spectrum from the Shimizu simulation to different numbers of principal
components. Solid lines indicate the original spectra and dotted the reconstructed ones after the projection. Red
lines indicate the difference between the two, and the mean squared error is displayed in the legend.
The shape of the original and reconstructed spectra are very similar in all plots, even when only
using 5 principal components. This indicates that the information correlated to fesc , emission
lines in particular, are possible to capture using only part of the data. This can be seen as further
motivation for using a linear model to predict fesc , and a sparse one like the L1-SVM in particular.
PCA is also used to show where in the spectra the information is collected. Figure 7 shows the
four eigenvectors with highest variance for the Shimizu and Gnedin simulations with and without
dust when effects of noise are not added. The wavelength bins containing emission lines contribute
to a large part of the variance in all cases, especially the ones centered around 5000 Å. The lowest
wavelengths also show large variance but other than that most parts of the spectra do not seem to
23
contain much information. This is consistent with the effects of fesc on galaxy spectra discussed
��� ������������
���
���
���
���
���
���
�����
��� ���������������
���
���
�����������
���
�����
���
����
����
���
���
�
�
�
�
���
����
����
��������������
����
����
���
����
����
��� ��������������
��� �����������
���
���
���
���
���
���
���
���
�����
�����
in section 2.
���
���
���
���
���
���
����
����
����
��������������
����
����
����
����
����
����
��������������
����
����
���
���
���
����
����
����
��������������
���
����
����
Figure 7: The four principal components with highest variance for the Shimizu and Gnedin simulations with and
without the effects of dust and no noise effects.
So far the PCA results have only been concerned with the variance in the data in general, but
it is also possible to investigate how the structure of the data relates to fesc . By projecting the
data onto three dimensions it is possible to plot the layout of the samples. The idea is that if the
points are organized in clusters then projection onto the dimensions with the highest variance is
likely to show this structure. Here the goal is to see how they are spread out with respect to
fesc , so the samples are colored accordingly.
Figure 8 shows the positions of the samples from the main data set in three-dimensional space
after projection onto two different combinations of eigenvectors. The left plot shows this for the
24
first, second and third eigenvectors with the highest variance, and the right plot for the second,
fourth and fifth eigenvectors.
�����������������
�����������������
�
���
���
���
�
�
�
�
�
� �
� �
�
� ��
�
�� �� �
�
�
�
��
��
�
�
�
�
�
��
�� �
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
Figure 8: Samples from the Shimizu simulation after projection onto two different combinations of three principal
components. The title of each plot indicates the choice of principal components, with 1 being the one with highest
variance. Samples with 4 escape fractions are plotted and they are colored accordingly.
It is positive that there is a clear shift from red to purple, indicating that the data is separable
according to fesc to some degree, even though there is significant overlap between the colors. In
general it seems that the points are clustered together tightly with few outliers.
Figure 9 shows two plots with the three dimensional layout of galaxies from the Finlator simulation after projection onto the first, second and third as well as first, fifth and sixth principal
components with the highest variance. These plots show more spread out samples and less clustering according to fesc than the Shimizu simulation. This is because the Finlator simulations
model a larger range of start formation histories and therefore result in spectra that differ more
from each other.
25
������������������
������������������
�
���
���
���
�
�
�
�
�
�
�
�
�
��
��
��
�
�
�
�
�
�
�
�
��
�� �
�
� �
�
� �
�
� �
� � ��
�
�
�
�
�
�
�
�
�
�
�
�
Figure 9: Samples from the Finlator simulation after projection onto two different combinations of three principal
components. The title of each plot indicates the choice of principal components, with 1 being the one with highest
variance. Samples with 4 escape fractions are plotted and they are colored accordingly.
The combinations of principal components that resulted in the clearest separation were chosen
for this illustration, as most combinations showed very little clustering according to color for the
Finlator model. For the Shimizu simulation a large number of combinations showed some degree
of such separation. This indicates that the Finlator samples could be harder to classify according
to fesc .
4.2
Classification of the main data set
In this section the results of parameter selection and model evaluation for the main data set
defined in section 2 are presented. The optimal value for the penalty parameter C is determined
using 5-fold cross-validation as defined in section 3.3.1. Results are shown for both the L1-SVM
and the L2-SVM and for different noise levels.
Figures 10 and 11 show the cross-validation scores for different values of the regularization
parameter C for the L2-SVM and the L1-SVM, respectively. The results for noise levels sn3, sn5,
sn10 and no noise are shown. Each plot contains the validation scores for threshold 0.2 separately
as well as the average score over all thresholds. The maximum validation score is highlighted by
a vertical line, and in the case of the L1-SVM it is labeled with the number of nonzero features in
the resulting model.
26
����
����
�����
���� ���
�����
���� ���
����
����
����
����
��
����
��
��
��
��
��
�
���� ����
��
�
��
�
��
����
��
�
��
��
��
��
��
�
���� ��������
�
��
�
��
�
�����
����
�����
����
��
����
����
����
����
�������������
����
��
��
��
��
��
��
�
��
�
��
�
��
�
����
��
����������������������
��
��
��
��
��
�
��
�
��
�
��
�
Figure 10: Validation score as a function of the regularization parameter C for the L2-SVM for different noise
levels of the main data set. The solid line shows the score for threshold 0.2 and the dotted line the average over all
thresholds. Vertical lines indicate the maximum validation score.
27
���� ���
���� ���
��
�����
����
�����
����
����
����
����
����
��
��
����
��
��
��
���� ����
��
��
�
��
�
��
�
��
����
��
�
��
��
��
��
��
�
���� ��������
��
�
��
�
��
�
���
�����
����
�����
����
��
����
����
����
����
�������������
����
��
��
��
��
��
��
�
��
�
��
�
��
�
����
��
����������������������
��
��
��
��
��
�
��
�
��
�
��
�
Figure 11: Validation score as a function of the regularization parameter C for the L1-SVM for different noise
levels of the main data set. The solid line shows the score for threshold 0.2 and the dotted line the average over
all thresholds. Vertical lines indicate the maximum validation score and are labeled with the number of nonzero
features in the resulting model.
For both models there is a sharp increase in validation score at low values of C, indicating
that a few features are crucial for representing a large part of the information about fesc . This is
consistent with the results of the PCA in section 4.1 in which a large part of the variation in the
data was concentrated to a few principal components.
The effects of noise are similar in the two models. The value of C that results in the highest
validation score tends to be lower for the noisier data sets. This is because the model is more
prone to overfitting with higher noise as it causes some signals to be drowned out. In the case with
no noise the data is clearly easily separable and a higher misclassification penalty is beneficial.
28
Although the noise-free case is not a realistic representation of actual observations it gives an
indication of how much noise affects the classification.
As expected, lower noise results in higher scores overall. Comparison of the two models shows
that the L1-SVM has higher validation scores for the optimal value of C on the noisy data sets,
although the differences are only in the order of 0.01. In the noise-free case, however, the L2-SVM
has a validation score of 0.99 compared to 0.95 of the L1-SVM. This is probably because the
benefits of a simpler model are not as significant when no noise is present. In all experiments it
was the case that the validation score was lower for threshold 0.2 than for the average over all
thresholds, indicating that this choice of threshold is particularly difficult to separate the samples
according to.
The validation score is used to select the optimal parameters. However, the test score is a
more accurate measure of a model’s ability to classify as the test samples are independent from
the train set. The tests scores per fesc threshold for the two models are shown in figure 12.
29
����
����
����
����
����
����
����
����
�����
���� ���
�����
���� ���
����
����
����
����
����
����
����
���
���
���
���� ����
��� ��� ���
���������
���
���
����
���
���
���
���
���� ��������
����
����
����
����
����
����
���
���
���
�����
����
�����
����
��� ��� ���
���������
����
����
����
����
����
����
����
���
����
���
���
���
��� ��� ���
���������
���
���
���
��
��
���
���
��� ��� ���
���������
���
���
���
Figure 12: Test score per threshold for the main data set with different noise levels. The solid line shows the
scores for the L2-SVM and the dotted one for the L1-SVM.
The results show quite good overall performance. For the more realistic cases with noise, the
scores are between 84 and 92 percent correctly labeled samples. It is interesting that the choice of
threshold has a large effect on the test score, with the lower thresholds that are of most interest
having the worst performance. It is expected that the models perform worse for extremely high
and low thresholds because the distribution of labels in the training data is skewed, but this is not
the pattern that the results show. The higher thresholds generally have the highest test scores,
and threshold 0.1 also gives better performance than 0.2 in all cases. The L1-SVM has slightly
better scores for higher noise levels, which is probably due to the fact that it is better at selecting
relevant features and avoiding fitting to the noise. The test scores are not significantly lower than
the corresponding validation scores for either model which also indicates that there is not a great
30
deal of bias due to overfitting.
It is also interesting to investigate which features are being chosen by the models. Figures 13
and 14 show the weights attached to each feature in the decision function of the L2-SVM and the
L1-SVM, respectively. The top leftmost plot shows the features chosen when C has the lowest
value of 0.001 and then for increasing C up to the optimal value for each model. In both cases
the threshold used is 0.2.
It is clear that in all cases there is a small subset of features that have the largest weights
and therefore most impact in the classification. These correspond to those seen in the PCA plots;
the emission lines and the lower wavelengths of the spectra. The latter are likely chosen as an
indicator of the slope of the spectra, as discussed in section 2. It is interesting that for low values
of C it is mainly the emission line centered at 5000 Å that is chosen by both models, but with
increasing C the line just below 4000 Å becomes more significant. Since both are oxygen lines this
shows that the optimal models chose to give important features high weights even though they
are correlated. The fact that the third oxygen line, OIII4959 is not so significant indicates that
it may be more strongly correlated to the other two than they are among themselves. The oxygen
emissions also seem to be a stronger effect of fesc than the hydrogen lines in this case.
31
���� ����������
��� ����������
���
����
���
���
�����
����
����
����������
����
����
��������������
����
���
����
����
��� ����������
���
�����
����
���
��� ����������
����
����
��������������
����
���
���
����
����
��� ����������
���
���
���
���
���
�����
���
���
���
����
����
��������������
��
��������
���
���
���
����
����
����
����
��������������
���
����
����
����
���
���
���
����
���
����
����
��
��������
����
��
��������
���
���
��
��������
���
���
���
�����
����
���
���
���
����
����
����
��������������
���
���
����
����
��
��������
���
���
����
����
���
���
���
����
�����
���
���
��
��������
�����
����
����
����
��������������
����
����
Figure 13: The feature weights of the L2-SVM for different values of the regularization parameter C when trained
on the main data set with threshold 0.2. The bottom right plot shows the optimal value of C.
32
��� ����������
����
���
����
���
�����
����
����
��� ����������
����
����
��������������
����
���
���
����
����
����
��� ����������
��
��������
����
����
����
���
��������
����
���
���
�����
���� ����������
����
����
��������������
����
����
���
���
���
����
��� ����������
����
����
��������������
����
���
����
����
���
���
���
����
����
��������������
����
����
���
�����
���
���
���
���
���
���
���
����
����
����
��������������
���
��
��������
���
���
����
���
����
���
����
����
��
��������
�����
����
��� ����������
���
���
����
���
���
���
���
����
���
���
��
��������
���
���
��
��������
�����
���
���
�����
���
����
����
��������������
����
����
Figure 14: The feature weights of the L1-SVM for different values of the regularization parameter C when trained
on the main data set with threshold 0.2. The bottom right plot shows the optimal value of C.
The effects of different noise levels on the selection of features is also investigated. Figures 15
and 16 show the feature weights of the optimal L1-SVM and L2-SVM for different noise levels
with threshold 0.2. Both models have higher weights in the blue part of the spectra for lower
33
noise levels indicating that noise removes some information about the slope. The strength of the
emission lines seems to be less affected, although the weaker hydrogen lines are more pronounced
the less noise there is. When the noise decreases the L1-SVM increasingly only picks out one of
the oxygen lines instead of both. This is not the case for the other model, indicating that the
L1-SVM is better at selecting relevant features and avoiding using correlated ones.
��� ���
��� ���
���
���
���
���
�����
���
���
����
���
����
����
��������������
����
���
���
����
����
��� ����
����
����
��������������
����
����
�
���
�
���
���
����
��
��
��
��������
���
���
����
����
��������������
����
�
��
����
��
����
����
��
��
��
��������
�����
�����
����
�� ��������
���
���
����
���
��
��
��
��������
���
��
��
��
��������
���
���
���
����
���
���
�����
���
����
����
��������������
����
����
Figure 15: The feature weights of the optimal L2-SVM for the main data set with different noise levels.
The results confirm that the emission lines are strongly correlated with fesc and the fact that
the weights are negative further indicate that strong emission lines are signs of a low fesc value.
Since these results were for threshold 0.2, the weights indicate that the stronger emission lines a
sample has the more likely it is to have fesc < 0.2.
34
��� ���
��� ���
���
���
���
���
�����
���
����
����
����
��������������
����
���
��
��
��
��������
���
��
��
��
��������
���
���
����
���
���
���
�����
���
���
���
����
����
��� ����
����
����
����
��������������
����
����
� ��������
���
�
���
�
���
���
��
��
��
���
�
�
��
��
��
���
���
���
�����
�����
���
���
���
����
����
����
����
��������������
����
����
�
����
����
����
����
��������������
����
����
Figure 16: The feature weights of the optimal L1-SVM for the main data set with different noise levels.
4.3
Robustness of the main model
In this section the robustness of the models from the previous section is investigated. Figures 17
and 18 show how the test score is affected by mislabeled samples in the train set. The x-axis shows
the probability of a label being flipped in the train set and the y-axis the score when predicting
an unmodified test set. The scores for different thresholds are indicated by colored lines.
35
��� ���
���
���
���
���
����������
����������
��� ���
���
���
���
���
���
���
���
��� ����
���
���
�������������������
���
���
���
���
���
��� ��������
���
���
�������������������
���
���
�����
���
�����
���
�����
���
����������
����������
�����
���
���
���
���
�����
���
�����
�����
�����
���
�����
���
���
���
���
�������������������
���
���
���
���
���
���
���
�������������������
���
���
Figure 17: Test score of the L2-SVM as a function of probability of flipping each label in the train set for the
main data set with different noise levels. The tests scores for each threshold are represented by lines of different
colors.
36
��� ���
���
���
���
���
����������
����������
��� ���
���
���
���
���
���
���
���
��� ����
���
���
�������������������
���
���
���
���
���
��� ��������
���
���
�������������������
���
���
�����
���
�����
���
�����
���
����������
����������
�����
���
���
���
���
�����
���
�����
�����
�����
���
�����
���
���
���
���
�������������������
���
���
���
���
���
���
���
�������������������
���
���
Figure 18: Test score of the L1-SVM as a function of probability of flipping each label in the train set for the
main data set with different noise levels. The tests scores for each threshold are represented by lines of different
colors.
Both models seem to be quite robust to this type of errors. The test score declines very little
up to 40 percent chance of flipping each label. At 50 percent the model understandably has a one
in two chance of predicting correctly, and as the label flipping probability increases the model is
being trained on data that is increasingly labeled in the opposite way as the test set, causing a
symmetric decline in score.
The L1-SVM has a slighter decline in score between 0.0 and 0.4 probability of flipping and less
difference between the thresholds, indicating that it is slightly more robust than the L2-SVM. It
also seems to be more robust with respect to noise. For the L2-SVM the the scores decline more
steeply with higher noise and the difference between thresholds increases with less noise, but for
37
the L1-SVM the plots look largely the same for all noise levels. One possible explanation for this
is that since the L1-SVM uses less features to perform the prediction mislabeled samples have less
impact on the classification.
The next indicator of robustness is how consistently features are chosen to be used in the
classification. This test is performed on the L1-SVM as only a subset of the features have nonzero
weights and it is therefore better for illustration purposes. Figure 19 shows the results of performing 1000 bootstrap iterations on the main data set with threshold 0.2 for different noise levels.
The x-axis shows the wavelengths and the y-axis the frequency with which they are chosen to be
part of the optimal model over the bootstrap iterations. Vertical lines mark the features that were
chosen with a frequency over 90 percent.
38
���
��
��
�
����
����
����
����
����
����
����
����
��������������
����
����
��
��
��
��
���
����
��
��������
��
�
����
����
��
��
��
��
��
�������������
����
���
�������������
���
���
��
��������
���
��
��
��
��
���
�������������
���
��
��
�
����
����
Figure 19: The frequency with which each feature has a nonzero weight when performing 1000 bootstrap iterations
on the L1-SVM model with threshold 0.2 on the main data set with different noise levels. Vertical lines indicate
the features that have nonzero weight with a frequency of 90 percent or more.
The oxygen and hydrogen lines as well as several features at the start of the spectra are chosen
consistently in all noise levels. As seen with the feature weights previously, only one of the oxygen
lines is required when the noise level is low. A higher noise level seems to require more features
at the blue end of the spectra, possibly because the slope is more difficult to infer.
Noise level sn5 seems to have slightly less variance in the feature selection than the others.
This could be due to two opposing effects that noise have on the model. On the one hand, less
noise results in data that is more simple to classify and a simpler model with less features can
suffice. On the other hand, if the spectra is not affected by noise then most wavelengths have a
clear correlation to fesc and it can be beneficial to include more features. This was also indicated
39
by the feature weights in figure 16 since the noise-free case had almost no zero features, but the
model trained on sn10 data was the most sparse.
Next, the variance of the feature weights is considered. Figure 20 shows the distribution of
the feature weights over all bootstrap iterations. Apart from a few features in the middle of the
spectra that are likely noise, the emission lines show the highest variance. The fact that input
selection affects these the most could be because the strength of emission lines can differ between
individual galaxies even when they have the same escape fraction. It is also interesting that the
nonzero features at the start of the spectra have very little variance, even in the noisiest case. One
explanation could be that the noise has a similar effect on all samples in at these wavelengths,
and it could give an indication of why the models so consistently consider these wavelengths to
be relevant.
40
��� ���
���
��
��������
��
���
��
���
���
������
���
���
���
����
��� ���
����
����
����
����
����
����
����
����
����
����
����
����
����
����
����
����
����
����
����
����
����
���� ���� ����
��������������
����
����
����
����
����
����
����
����
����
����
����
����
����
���
���
���
��
���
��
��������
���
��
������
���
���
����
����
���
����
����
���
���
���
���
���
����
��
��
��
������
���
Figure 20: The mean and standard deviation of feature weights over 1000 bootstrap iterations on the L1-SVM
model with threshold 0.2 on the main data set with different noise levels.
The robustness of feature selection is also investigated by seeing which features are part of
the optimal model when restricting the number of them that may be used. Figure 21 plots which
features are chosen to be part of the optimal model with 1,2, . . . 7 features. The x-axis shows the
wavelengths, the y-axis the number of features that are allowed in the model and a bar at position
(x, y) indicates that wavelength x is chosen to be part of the optimal model with y features.
41
�
��
�
�����������
�
���
�
�
��������
�
�
����
����
����
��������������
����
����
Figure 21: The features that are chosen to have nonzero weight in the optimal L1-SVM model when limiting the
number of nonzero features and training on the main data set with threshold 0.2. A bar at position (x, y) indicates
that feature x is chosen to be part of the optimal model with y nonzero features.
The oxygen line centered near 5000 Å is chosen to be part of all models. In almost all cases
wavelengths at the start of the spectra are also chosen. The exception is the model with only one
feature, indicating that the emission line is more important than the indicator of slope. This is
consistent with the astronomical interpretation since the emission line is considered more strongly
correlated to fesc than the slope of the spectrum. Other emission lines start to be chosen when
three or more features are allowed. The two main oxygen lines are present at three features
already, and the only hydrogen line present is Hγ and it appears at six features. These results
are consistent with what the feature selection and bootstrapping showed and also indicate a high
level of robustness of selection for the L1-SVM.
4.4
The effects of simulation assumptions
In this section the robustness of the method with respect to varying data assumptions is evaluated.
Figure 22 shows how the test results of the two models vary with noise level for resolution R100
and R1000. The left figure shows the test score as a function of a fixed exposure time and the
right figure for a fixed signal-to-noise ratio.
42
����
����
�����
���� ���������
�����
���� �������������������
����
����
����
����
�
�������
�������
��������
��������
����
�
�
�
�
�
�
�����������������
�
�
��
����
�
�
�
�
�
���
�
�
�
�
��
Figure 22: The effects of resolution on the test score of the two models on the main data set. Red lines indicate
R1000 and black R100, solid lines show the scores of the L2-SVM and dotted the L1-SVM. Left: the tests score as
a function of noise calculated with fixed exposure time. Right: the test score as a function of noise calculated by
fixing the signal-to-noise ratio at 1500 Å.
With fixed signal-to-noise ratio the models trained on higher resolution data perform better.
With this noise model the samples of different resolutions have similar noise levels and it is clear
that some information relating to fesc is lost in the lower resolution data. With fixed exposure
time it is the models trained on lower resolution data that have higher test scores. This is because
a given exposure time results in a lower noise level for for the lower resolution data. The fixed
exposure time case corresponds more closely to how the noise level of actual observations will
behave and therefore the models are considered to perform better on the lower resolution. For
this reason the resolution R100 is mainly considered in this thesis.
The sensitivity of the models to different levels of noise on the train and test sets is considered
next. The noise level is fixed for the test set and the score is calculated when training on data with
different noise levels. Figure 23 shows the test score as a function of train set noise for different
test set noise levels. The main data set with threshold 0.2 is used for both models.
43
������
������
����
���
����
���
���
����������
����������
����
����
���
����
����
���
����
����
���
����
���
����
���
����
����
���
����
�
�
�
�
�
�
�
���������������
�
�
��
����
�
���
�
�
�
�
�
�
���������������
�
�
��
Figure 23: Test scores of the L1-SVM and L2-SVM as a function of noise level of the train set for fixed test set
noise level on the main data set for threshold 0.2. Lines are annotated with the test set noise level.
In general it seems that training on a lower noise level than the test set has gives poor performance, whereas training on a higher noise level is in some cases beneficial. The scores of the
L1-SVM are higher overall and seem to be more robust to training on data with the ”wrong”
noise level. For the L2-SVM most lines show a peak in test scores when train and test sets have
the same noise level, but for the L1-SVM the scores are more constant for varying train set noise
levels.
To test model sensitivity to different simulations we investigate the effects of varying simulation
for the train and test sets. Figure 24 shows the test scores per threshold for all combinations of
simulation method of train and test sets for the L1-SVM and L2-SVM. Table 2 lists the test
scores for threshold 0.2 as well as the average over all thresholds for each model for the same
combinations.
The diagonal plots show the results when the train and test sets come from the same simulation. The Gnedin simulation has the highest test scores overall, with only threshold 0.2 having
lower than 90 percent of correctness. The Shimizu simulations have similar performance, but for
Finlator the scores are significantly lower with only thresholds 0.1, 0.8 and 0.9 having over 85
percent correctness. This is because of the larger spread in shape of galaxy spectra in the Finlator
simulation that was also indicated by the results of PCA in section 4.1. These results show that
this does indeed make classification more difficult.
However, the scores seem to depend more on test set than train set simulation choice. The
44
worst results overall are when testing on the Finlator set for all choices of train simulation, but
training on Finlator does not seem to give much worse performance for testing on Shimizu or
Gnedin. Since the homogeneous train sets give significantly worse performance when testing on
the more varied one but not conversely, it seems the best choice of train set for future observations
would be a set containing a wide range of galaxy types. Keeping this in mind, the classification
model seems to be quite robust to choosing the ”wrong” simulation for training. The results also
indicate that the performance of the different classification models does not differ much for the
simulation methods for this noise level. All combinations show varying performance for different
thresholds as was the case for the main data set in section 4.2, although which ones are the most
problematic varies for test set simulation.
����
����
����
����
����
����
����������
���� ����������������������������
����������
���� ��������������������������
����������
���� ���������������������������
����
����
����
����
����
����
����
��� ��� ��� ��� ��� ��� ��� ��� ��� ����
��� ��� ��� ��� ��� ��� ��� ��� ��� ����
��� ��� ��� ��� ��� ��� ��� ��� ���
����
����
����
����
����
����
����������
���� ���������������������������
����������
���� �������������������������
����������
���� ��������������������������
����
����
����
����
����
����
����
��� ��� ��� ��� ��� ��� ��� ��� ��� ����
��� ��� ��� ��� ��� ��� ��� ��� ��� ����
��� ��� ��� ��� ��� ��� ��� ��� ���
����
����
����
����
����
����
����������
���� �����������������������������
����������
���� ���������������������������
����������
���� ����������������������������
����
����
����
����
����
��
��
����
����
��� ��� ��� ��� ��� ��� ��� ��� ��� ����
��� ��� ��� ��� ��� ��� ��� ��� ��� ����
��� ��� ��� ��� ��� ��� ��� ��� ���
���������
���������
���������
Figure 24: Test score per threshold for the L1-SVM and L2-SVM for all combinations of simulation for the train
and test set. All data sets have noise level sn5 and are without dust effects.
45
Train:Shimizu
Train:Shimizu
Train:Shimizu
Test:Shimizu
Test:Gnedin
Test:Finlator
L2
L1
t= 0.2
0.87
0.89
avg all
0.89
0.90
L2
L1
t= 0.2
0.89
0.89
avg all
0.92
0.92
L2
L1
t= 0.2
0.86
0.85,
avg all
0.85
0.85
Train:Gnedin
Train:Gnedin
Train:Gnedin
Test:Shimizu
Test:Gnedin
Test:Finlator
L2
L1
t= 0.2
0.87
0.86
avg all
0.89
0.89
L2
L1
t= 0.2
0.85
0.87
avg all
0.92
0.93
L2
L1
t= 0.2
0.85
0.85
avg all
0.84
0.83
Train:Finlator
Train:Finlator
Train:Finlator
Test:Shimizu
Test:Gnedin
Test:Finlator
L2
L1
t= 0.2
0.86
0.86
avg all
0.88
0.89
L2
L1
t= 0.2
0.85
0.87
avg all
0.90
0.90
L2
L1
t= 0.2
0.84
0.83
avg all
0.85
0.86
Table 2: Test scores for threshold 0.2 and average over all thresholds for the L1-SVM and L2-SVM for all
combinations of simulation for the train and test set.
As discussed in section 2, adding the effects of dust causes a reddening of the spectra that
results in a larger spread in the fluxes. This can cause features related to fesc to be distorted and
increase uncertainty by introducing properties that could falsely be interpreted as correlated to
fesc . Figure 25 shows a comparison of the performance of the classification models on the Shimizu
and Gnedin simulations with and without dust effects with noise level sn5. The top row contains
the test scores per threshold for the two models. The dashed lines represent data with dust effects
and solid lines without for the L1-SVM in black and the L2-SVM in red. The bottom panel shows
the feature weights of the L1-SVM with and without dust effects for the two simulations.
46
���� �������
���� ������
����
����
����
����
�����
�����
����
����
����
����
����
����
����
����
����
���
���
���
�������
��� ��� ���
���������
���
���
���
���
���
��� ��� ���
���������
���
���
���
���
���
�����
���
���
���
���
���
���
���
����
����
���
��� ������
���
�����
����������
�������
����������
�������
���
����
����
����
��������������
����
����
���
����
�������
����
����
����
����
��������������
����
����
Figure 25: Top panel: test score per threshold for the L1-SVM (black lines) and the L2-SVM (red lines) for
data with dust effects (dotted lines) and without dust effects (solid lines) for the Shimizu and Gnedin simulations.
Bottom panel: feature weights of the L1-SVM when trained on data with dust effects (dotted lines) and without
dust effects (solid lines) for the same simulations.
The effects of dust on test performance are larger for the Shimizu simulation, with lower
scores on all thresholds except one. For the Gnedin simulations the scores differ less when adding
dust. This could be because of the different dust recipes used or because of the properties of the
simulations themselves, or a combination of both. The fact that there are fewer nonzero features
in the optimal model for Gnedin data without dust than for Shimizu suggests that it is a less
complex data set that is easier to classify.
The effect that dust has on the feature weights differ between the two simulations. In the
Shimizu case the amount of nonzero features increases significantly when adding dust. There are
features chosen in throughout the spectra and at wavelengths that were never chosen in any cases
47
without dust. Since the spectra with dust have a larger spread the model needs to include more
features to produce a good fit. The feature weights for Gnedin do not increase as much with the
addition of dust. The emission lines and the beginning of the spectra, although a little shifted
to the red side, are still the main parts used. It could also be the case that dust effects make it
more difficult to infer the slope, causing more parts of the spectra to be included. The fact that
this effect is more prominent for the Shimizu simulation could be because the effects of dust are
stronger in this case.
Finally, the effects of applying different stellar tracks are investigated. Figure 26 shows the
test score per threshold for different combinations of stellar tracks for the train and test sets for
the Shimizu simulation.
���
���
���
���
�����
��� ��������������������������
�����
��� ��������������������������
���
���
���
���
���
���
���
���
��� ��� ���
���������
���
���
���
���
���
���
���
��� ��� ���
���������
���
���
���
���
���
���
���
���
���
���
���
�����
��� ��������������������������
�����
��� ��������������������������
���
���
���
���
��� ��� ���
���������
���
���
���
���
���
��
��
���
���
��� ��� ���
���������
���
���
���
Figure 26: Test score per threshold for the L2-SVM and L1-SVM when varying the stellar track used on the train
and test sets for the Shimizu simulation.
48
This assumption clearly has a large effect on classification performance. The test scores are
significantly lower for the two cases with different stellar track on the train and test sets, with
testing on BPASS2 resulting in particular difficulties around threshold 0.2 and testing on Geneva
around 0.6. This is because the different stellar evolution models predict different strengths of
emission lines for a given fesc value. BPASS2 predicts stronger lines than Geneva, meaning that a
model trained on BPASS2 will characterize low fesc values with emission lines of strengths that do
not occur in the Geneva set. This causes samples with true label -1 to be given label 1. Conversely,
a model trained on Geneva will overestimate fesc values for a BPASS2 test set.
4.5
Analysis of classification performance
So far the evaluation has only been concerned with the overall test performance of the models,
but it is also interesting to further analyze the nature of the errors. Which samples are being
incorrectly labeled? Are there any patterns or particular parts of the data that are especially
difficult? Since it is required to separate galaxies according to low escape fractions to investigate
cosmic reionization, it is particularly interesting to see what kind of errors the models make when
trained on these thresholds.
The test results from sections 4.2 and 4.4 showed poorer test scores overall for the low thresholds, with some dependence on simulation and noise. For the Shimizu simulations the lowest test
scores were for thresholds between 0.2 and 0.4. Figure 24 shows that for the Gnedin set the results
were worse for 0.2 and that the choice of train set does not affect which threshold has the lowest
score much. The Finlator set has the lowest performance around threshold 0.5, and as seen in
table 2 it is the only test set that does not have a lower score for threshold 0.2 than the average
over all thresholds.
As was discussed earlier, some decrease in performance is expected for the extreme thresholds
due to imbalance of classes in the training data. But since the pattern of tests score is not
symmetric in this way it indicates other issues. To investigate this we look at the nature of
the mispredictions. Figure 27 plots the amount of incorrectly labeled samples per threshold
in the Shimizu and Finlator sets for the two models. The dark bars shows samples that are
overpredicted and the light bars show underpredicted samples. These correspond to false positives
and false negatives from the discussion in 3 but since here the goal is to estimate fesc and classify
49
accordingly and not to pick out certain samples of interest the terms over- and underpredicted are
more suitable. The y-axis shows the absolute number of samples in each category and the number
above each bar indicates their percentage of the total. For example, 50 percent underpredicted
samples for threshold 0.2 means that half of the samples with true label 1 were given label 0 for
that threshold.
�� ��������������
��
��
���������
��
��
��
��
��
�
�
��
��
��
��
��
�
�
��
��
�
��
���������
��
�� ��������������
�
��
�
�
��
��
��
��
��
��
��
� ��
�
��
� ��
�
��
�
�
�
��� ��� ��� ��� ��� ��� ��� ��� ��� ���
���������
�
��� ��� ��� ��� ��� ��� ��� ��� ��� ���
���������
�� ���������������
�� ���������������
��
��
��
���
�� ��
��
��
�
�
�
�
��
��
��
���������
��
��
��
��
��
��
��
�
��
�
�
�
��������������
�������������
��
��
���������
��
�
��
�
��
�
��
��
��
��
��
�
�
��
��
��
��
��
�
��
�
� �
��� ��� ��� ��� ��� ��� ��� ��� ��� ���
���������
� �
��� ��� ��� ��� ��� ��� ��� ��� ��� ���
���������
Figure 27: The frequency of mispredicted samples per threshold. Underpredicted samples are indicated by light
bars and correspond to false negatives. Overpredictions, or false positives, are shown by darker bars. Numbers
above each bar indicate the percentage the mispredictions make of the total number of samples with their true
label. Results are shown for the L2-SVM and L1-SVM for both the Shimizu and Finlator simulations.
The results show that overprediction is a larger problem than underprediction. For thresholds
0.1 to 0.4 a very large part of the samples are incorrectly given label 1. The relatively high
test scores for threshold 0.1 are actually misleading as nearly all samples with lower fesc are being
50
mispredicted. The fact that there are so few such samples and that almost none are underpredicted
is what gives the appearance of a good test score.
For the Shimizu simulation the percentage and absolute number of overpredicted samples
decreases steadily with increased threshold. A corresponding but opposite effect is seen in the
underpredictions, but since the percentages are much lower the total number of mispredictions
decreases with higher thresholds, explaining the increase in test score. The two models have
similar numbers of underpredicted samples in this case, but the L1-SVM has less overpredictions
for all thresholds.
The Finlator simulations have even larger issues with overprediction than Shimizu, and also
have higher percentages of underpredicted samples for higher thresholds. The L1-SVM has fewer
overpredictions but slightly more underpredictions than the L2-SVM, with threshold 0.5 being
particularly problematic for both. Earlier results showed that the Finlator samples are more
difficult to predict than Shimizu, and these results show that this is true for high as well as low
fesc values.
The results clearly indicate that estimating the low fesc samples is particularly difficult for
both simulations. To investigate this further we take a look at the distribution of values generated
by the decision function. As described in section 3.2.1, the label of a sample is decided by the sign
of the result of applying the decision function to it. However, it is also interesting to look at this
value itself. The size of its absolute value is an indicator of how ”sure” the classification is, and
the distribution of these values can give further indication of the nature of the mispredictions.
Figures 28 and 29 show the distributions of predicted values for thresholds 0.2 and 0.8 for
the Shimizu and Finlator simulations for the L2-SVM and the L1-SVM, respectively. The y-axis
shows the number of samples per predicted value and the bars are colored according to positive
(sign of decision function value should be positive) and negative class.
51
��� ���������������������
��� ���������������������
��
���������
���������
��
��
��
��
�
�
�� �� �� � � � � �
���������������
�
����������������������
��
�
�
����������������������
��������������
��������������
���������
��
���������
��
��
��
��
�
��
��
�
�� �� �� � � � � �
��������������
��
��
��
�
�
�
� �
�
���������������
�
�
�
�
�
�
� �
�
���������������
�
�
Figure 28: The frequency of predicted values for the L2-SVM when used on the Shimizu and Finlator simulations.
The left panel shows the predictions when using threshold 0.2 and the right panel for 0.8. True positives are shown
in light gray and true negatives in dark.
Both simulations show issues with threshold 0.2. There is a peak in frequencies centered at 0
for the negative class, indicating large uncertainty in predicting these samples. For the opposite
threshold 0.8 there are not equally big problems with underprediction, although the Finlator
samples show more uncertainty in this as well. This could be because of the difficulties of Finlator
samples discussed earlier, but it could also be an effect of the uneven distribution of labels in the
training data, especially since the Finlator set has significantly fewer samples than Shimizu.
52
���
���������������������
���
��
��
���������
���
���������
���
��
��
��
�
��
��
��
��
�
�
�
� � � �
��������������
�
�
�
�
����������������������
��
��
��
�
�
�
� � � �
���������������
�
�
�
����������������������
��������������
��������������
���������
��
���������
��
��
��
��
��
��
�
���������������������
��
�
�
�
�
�
���������������
�
�
�
�
�
�
�
�
���������������
�
�
Figure 29: The frequency of predicted values for the L1-SVM when used on the Shimizu and Finlator simulations.
The left panel shows the predictions when using threshold 0.2 and the right panel for 0.8. True positives are shown
in light gray and true negatives in dark.
We may ask ourselves if this classification bias is due to bad model selection or inherent issues
with the data. Maybe using a scoring metric that valued class balance over prediction accuracy
would result in less problems with overprediction. To answer this the model tuning process is
performed again on the main data set using different scoring metrics. The average precision per
class (apc), the average of precision and recall f 1, and the ROC area under the curve rauc are used
in the cross-validation process to find the optimal value of C and the resulting model’s predicted
labels are compared to the true labels. Figure 30 shows the under- and overpredictions of the
optimal model according to all four scoring metrics for the L1-SVM and the main data set.
53
�� ���
�� ���
��
��
��
��
�
��
�
��
�
��
� ��
�
��
� ��
�
��
�
��
��
��
�
��
�
��
�
��
� ��
�
��
� ��
�
��
�
�
��
�
�
��� ��� ��� ��� ��� ��� ��� ��� ��� ���
���������
�
�
��
��
��
��
��
��
��
�
��
��
�
��� ��� ��� ��� ��� ��� ��� ��� ��� ���
���������
����
��
��������������
��
�������������
��
��
��
�
��
�
��
�
��
� ��
���������
��
���������
��
��
��
��
��
��
��
��
���������
���������
��
��
��
��
��
�
� ��
�
��
��
��
�
��
�
�
��
�
��
��
��
��
��
��
� ��
��
�
� ��
�
�
��
�
�
��� ��� ��� ��� ��� ��� ��� ��� ��� ���
���������
�
��� ��� ��� ��� ��� ��� ��� ��� ��� ���
���������
Figure 30: The frequency of mispredicted samples per threshold when using different scoring metrics to find
the optimal model. Underpredicted samples are indicated by light bars and correspond to false negatives. Overpredictions, or false positives, are shown by darker bars. Numbers above each bar indicate the percentage the
mispredictions make of the total number of samples with their true label. Results are shown for the L1-SVM used
on the main data set.
The effects of scoring metric on model misprediction are not very significant. The overpredictions for low thresholds are slightly lower for the metrics that take class imbalance into account,
but in these cases there are also more underpredictions for those thresholds. In order to be better
at estimating escape fraction for low fesc samples it seems to model has to make a tradeoff with
the ability to classify higher fesc samples. These results suggest that the misclassification issues,
particularly the overestimation, are not dependent on model selection but rather the data itself.
54
4.6
Comparison to other methods
In this section the proposed method’s performance is compared to two other methods of estimating
fesc . First we consider the model introduced by Zackrisson et al. (2013) in which only two
properties derived from the SED are used: the slope and the relative width of the Hβ emission
line. To compare the two approaches an L1-SVM is fitted using only these two features and the
classification performance on the main data set is compared to the L1-SVM fitted to the entire
spectra. The scoring metric acc is used for both models. The test scores per threshold for three
noise levels are plotted in figure 31, with the leftmost plot showing the scores of the two-feature
model and the rightmost the scores of the full-spectra approach considered in this thesis.
����
����
����
����
����
����
�����
���� ������������������
�����
���� �����������������
����
����
����
����
����
����
����
���
���
���
��� ��� ���
���������
���
���
���
����
���
���
���
����
���
���
��� ��� ���
���������
���
���
���
Figure 31: Test score per threshold for different noise levels on the main data set. Left: scores of the model that
uses only the slope and Hβ emission line as features. Right: scores of the full-spectra model introduced in this
thesis.
The test scores are higher for the full-spectra model for all noise levels and thresholds. This is to
be expected since the results showed that several features of the spectra were of great importance
for inferring fesc , and the advantages of considering all wavelengths are clear. The differences
between noise levels are also much larger for the two-feature model, indicating that using more
features gives a more robust model.
Secondly, a comparison to the lasso regression model described in Lundholm (2016) is performed. To get a comparable assessment of performance the lasso model is used to predict fesc of
the main data set, and the resulting values are placed into classes defined by the threshold 0.2 in
55
the same manner as the classification model works. The confusion matrices of the two models are
shown in tables 3 and 4.
sn3
❛❛
❛❛
True
❛
Predicted ❛❛❛
❛
❛
-1
1
-1
82
43
1
72
697
sn 5
❛❛
❛
True
❛❛
Predicted ❛❛❛
❛
❛
-1
1
-1
91
47
1
63
693
sn10
❛❛
❛❛
True
❛
Predicted ❛❛❛
❛
❛
-1
1
-1
111
47
1
43
693
Table 3: Confusion matrices when using lasso regression to classify according to threshold 0.2 on the main data
set with different noise levels.
sn3
❛❛
❛
True
❛❛
Predicted ❛❛❛
❛
❛
-1
1
-1
70
33
1
85
706
sn 5
❛❛
❛❛
True
❛
Predicted ❛❛❛
❛
❛
-1
1
-1
90
36
1
65
703
sn10
❛❛
❛
True
❛❛
Predicted ❛❛❛
❛
❛
-1
1
-1
98
39
1
57
700
Table 4: Confusion matrices when using the L1-SVM to classify according to threshold 0.2 on the main data set
with different noise levels.
The confusion matrices do not display large differences in prediction performance. The lasso
model has slightly fewer false positives, but it also has more false negatives. It seems the two
approaches struggle with the same aspects of classification, and that there are no clear advantages
of either model.
5
Conclusions and discussion
The approach of this thesis has been quite general and had the goal to provide a starting point
for using machine learning as a tool in gaining insight about cosmic reionization. The results
indicate that support vector machines are promising for this application. The models capture the
relationship between a galaxy’s spectrum and escape fraction in a an adequate way in most cases,
56
often with over 85 percent correctness when classifying galaxies according to a fesc threshold.
More importantly, the approach is quite robust. Disturbances in the data of varying types do not
affect performance to a great extent and the selection of which features are important is relatively
stable. The fact that parameter selection and cross-validation error are quite stable and that
the model’s performance is not sensitive to input selection or scoring metric indicates that the
method itself is quite robust for this application. Comparison of the two types considered did
not show significant differences in performance, but still indicated the advantages of the L1-SVM
in some aspects. The sparse model showed slightly higher accuracy but more importantly it had
more consistent behavior when used on data with more noise. A higher degree of robustness was
also indicated by less sensitivity to disturbances in the form of flipped labels in the train set, as
well as different levels of noise on the train and test sets. The analysis of mispredictions also
showed that the L1-SVM has slightly lower rates of overpredictions of low fesc samples, as well as
underpredictions of high ones. These results are likely due to the advantages of a simpler model
that relies less on unnecessary features and noise.
In section 4.4 the effects of using different sets of assumptions on the train and test sets were
investigated. Since simulated data must be used to train models that are to classify actual observations, it is important to gauge the sensitivity of the approach to using sets of assumptions with
different properties. The technique proved quite robust to differing levels of noise on the train and
test data, further indicating a good representation of the signal and a low degree of overfitting. Of
the different simulations Finlator proved to be the most problematic, which was expected since it
models a wider range of star formation histories and therefore has a more heterogeneous collection
of galaxies. The fact that training on this set did not significantly decrease test performance for
the other simulations is positive and indicates that the correlation between fesc and the SED is
still possible to capture with these models when it is weaker. As mentioned previously, there are
differing opinions regarding the dust content of the early galaxies and how appropriate it is to
train on data with dust effects is not certain. Although the effects on the spectra are significant,
particularly in the low wavelengths, the results did not show significantly worse performance for
data with dust. The assumption that had the largest negative effect on classification was using the
wrong stellar tracks. Since Geneva and BPASS2 result in emission lines of very different strengths,
training on one is misleading when testing on the other. Overall the results of model sensitivity
57
to data assumptions suggest that the approach is relatively robust. Some insights into what needs
to be considered when selecting appropriate data for training have also been gained.
The results have also provided some understanding of the data and the nature of the relationship between a galaxy’s spectrum and fesc . PCA showed that the parts of the spectra that
are known to be affected by escape fraction indeed have the highest information content and the
different simulations showed varying degrees of clustering according to fesc when projected onto
the principal components with highest variance. Inspection of feature selection showed that the
emission lines are among the main features used, with the oxygen lines OII and OIII5007 being
the most significant ones. The former is preferred for low noise levels, but higher noise seems to
require the models to rely on both. The hydrogen lines are also prominent, with Hβ generally
being stronger but Hγ being the only hydrogen line chosen in the optimal model when limiting
the number of features.
The blue wavelengths were also consistently used in the classification, indicating that the slope
of the spectra is important for inferring the escape fraction. However, as discussed previously,
this relationship could be problematic. There are concerns that other properties of the galaxies
affect the slope and that there is uncertainty in the relationship. Still, the models always relied
on some of these wavelengths, even in the case when only two features were to be used. The
feature weights of the model trained on data with dust effects also indicated this. The effects of
dust cause a higher spread in the blue wavelengths, and it seemed this caused the models to use
previously unused parts of the spectra to infer the slope. This effect was not as significant in the
Gnedin set, which also had less degradation in performance due to dust.
Another important result was the fact that the low fesc samples were especially prone to
overestimation. Since the problems with threshold 0.2 remained when using different models and
scoring metrics as well as the fact that the opposite threshold of 0.8 did not give the same degree
of uncertainty and underprediction for high fesc samples, it seems that it is not just a problem of
class imbalance but an issue with the data itself.
The performance results and insights gained about the data provide a first step towards evaluating the applicability of support vector machines to the problem of estimating fesc . The actual
classification performance has been acceptable in most tests performed and the analysis of data
assumptions has provided some indication of under what circumstances the method may be useful.
58
The size of the data sets seems to be sufficient in most cases since performance is more dependent on data assumptions than number of galaxies. This is shown by the fact that the Gnedin
simulations had higher overall performance even though it had significantly fewer samples than
Shimizu. Bootstrapping also showed a stable model tuning process that was not very dependent
on input selection. It is also useful knowledge that mismatches in noise level do not have very large
significance, but that using the wrong stellar tracks can reduce performance significantly. The fact
that training on Finlator did not reduce test performance of the other simulations indicates that
a train set with varied samples is preferable over more homogeneous sets.
Comparison to a previous model based on using only the slope and the Hβ emission line
showed that using the whole spectra gives significantly higher performance, further illustrating
the advantages of a quantitative and data-oriented approach. The support vector machine did not,
however, prove to have an advantage in accuracy or robustness over the lasso regression model.
Further analysis is required to determine the optimal machine learning technique for the task of
predicting the escape fraction of distant galaxies.
With these results as a basis, more specific measures can be suggested for future work. The
fact that indicators of the slope are consistently chosen by the classifiers would be interesting
to study in more detail. It is likely that the models considered have less problems due to the
ambiguity of the slope than some of the more simpler diagnostics previously considered since the
entire spectrum is taken into account, but more investigation is necessary to be able to say this
with certainty. However, it is also possible that the use of the slope is more misleading than
helpful and that the classification would benefit from the removal of it from the spectra. Another
possibility would be to use the knowledge of the data to introduce extra weights to the features
to make certain ones more important than others and to investigate the effects of different types
of normalization and scaling.
The fact that the low escape fraction samples are particularly problematic to predict is also
worth investigating. It could be the case that these samples are more similar to each other or
that the correlation between fesc and the slope is weaker for such galaxies. A first approach could
then be to simply use a larger train set. The models could also be tuned to compensate by sample
weighting or, if it could be known with certainty that the higher fesc values will not occur in
observations, excluding them from the train set. If the issues are due to a complicated correlation
59
a more complex model could be considered. Although the linear support vector machine has good
qualities and a high interpretability it could be worth investigating if the use of kernels would give
better classification performance for the particularly difficult samples. A nonlinear model could
maybe also give better performance on the more heterogeneous Finlator simulation. Since the low
escape fraction thresholds are of particular interest to the study of cosmic reionization the model
improvement process should include a qualitative aspect as well.
In conclusion, the work in this thesis has shown that data-driven predictions can be a suitable
tool for investigating cosmic reionization. The results have suggested what aspects of the data
need further study and paved the way for further improvements to the method. The importance of
certain simulation assumptions has also been indicated as well as the fact that further knowledge
of distant galaxies will be of importance for inferring their escape fractions.
References
Alexandroff, R., Heckman, T. M., Borthakur, S. and Overzier, R. (2015). Indirect Evidence
for Escaping Lyman Continuum Photons in Local Lyman Break Galaxy Analogs, American
Astronomical Society Meeting Abstracts, Vol. 225 of American Astronomical Society Meeting
Abstracts, p. 251.09.
Alvarez, M. A., Finlator, K. and Trenti, M. (2012). Constraints on the ionizing efficiency of the
first galaxies. cite arxiv:1209.1387Comment: 5 pages, 1 figure, submitted to ApJ Letters.
URL: http://arxiv.org/abs/1209.1387
Bouwens, R. J., Illingworth, G. D., Franx, M., Chary, R.-R., Meurer, G. R., Conselice, C. J., Ford,
H., Giavalisco, M. and van Dokkum, P. (2009). UV Continuum Slope and Dust Obscuration
from z ˜ 6 to z ˜ 2: The Star Formation Rate Density at High Redshift, Astrophysical Journal
705: 936–961.
Bradley, P. S. and Mangasarian, O. L. (1998). Feature selection via concave minimization and
support vector machines, Proceedings of the Fifteenth International Conference on Machine
Learning, ICML ’98, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp. 82–90.
Cortes, C. and Vapnik, V. (1995). Support-vector networks, Machine Learning 20(3): 273–297.
60
Eldridge, J. J. and Stanway, E. R. (2009). Spectral population synthesis including massive binaries,
Monthly Notices of the Royal Astronomical Society 400: 1019–1028.
Fernandez, E. R., Dole, H. and Iliev, I. T. (2013). A novel approach to constrain the escape fraction
and dust content at high redshift using the cosmic infrared background fractional anisotropy,
The Astrophysical Journal 764(1): 56.
Finkelstein, S. L., Papovich, C., Salmon, B., Finlator, K., Dickinson, M., Ferguson, H. C., Giavalisco, M., Koekemoer, A. M., Reddy, N. A., Bassett, R., Conselice, C. J., Dunlop, J. S.,
Faber, S. M., Grogin, N. A., Hathi, N. P., Kocevski, D. D., Lai, K., Lee, K.-S., McLure, R. J.,
Mobasher, B. and Newman, J. A. (2011). CANDELS: The Evolution of Galaxy Rest-Frame
Ultraviolet Colors from z = 8 to 4.
Finlator, K., Dav, R., Papovich, C. and Hernquist, L. (2006). The Physical and Photometric
Properties of High-Redshift Galaxies in Cosmological Hydrodynamic Simulations, Astrophysical
Journal 639: 672–694.
Finlator, K., Muñoz, J. A., Oppenheimer, B. D., Oh, S. P., Özel, F. and Davé, R. (2013). The host
haloes of O I absorbers in the reionization epoch, Monthly Notices of the Royal Astronomical
Society 436: 1818–1835.
Gnedin, N. Y. (2014). Cosmic Reionization on Computers. I. Design and Calibration of Simulations, Astrophysical Journal 793: 29.
Inoue, A. K., Shimizu, I., Iwata, I. and Tanaka, M. (2014). An updated analytic model for
attenuation by the intergalactic medium, Monthly Notices of the Royal Astronomical Society
442: 1805–1820.
Jensen, H., Zackrisson, E., Pelckmans, K., Ausmees, K., Lundholm, U. and Binggeli, C. (2016).
Measuring the escape fraction of ionizing photons in high-redshift galaxies using machine learning, Master Thesis, Department of .
Lundholm, U. (2016). Modeling the escape fraction of ionizing photons in high-redshift galaxies using statistical learning techniques, Master thesis, Department of Engineering Sciences,
Uppsala University .
61
Mohri, M., Rostamizadeh, A. and Talwalkar, A. (2012). Foundations of Machine Learning.
Pearson, K. (1901). On lines and planes of closest fit to systems of points in space, Philosophical
Magazine 2: 559–572.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,
Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher,
M., Perrot, M. and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python, Journal of
Machine Learning Research 12: 2825–2830.
Pei, Y. C. (1992). Interstellar dust from the Milky Way to the Magellanic Clouds, The Astrophysical Journal 395: 130–139.
Raiter, A., Schaerer, D. and Fosbury, R. A. E. (2010). Predicted UV properties of very metal-poor
starburst galaxies, Astronomy & Astrophysics 523: A64.
Robertson, B. E., Ellis, R. S., Furlanetto, S. R. and Dunlop, J. S. (2015a). Cosmic Reionization
and Early Star-forming Galaxies: A Joint Analysis of New Constraints from Planck and the
Hubble Space Telescope, Astrophysical Journal Letters 802: L19.
Robertson, B. E., Ellis, R. S., Furlanetto, S. R. and Dunlop, J. S. (2015b). Cosmic reionization
and early star-forming galaxies: A joint analysis of new constraints from planck and the hubble
space telescope, The Astrophysical Journal Letters 802(2): L19.
Schaerer, D. and de Barros, S. (2010). On the physical properties of z
6-8 galaxies, Astronomy
& Astrophysics 515: A73.
Shimizu, I., Inoue, A. K., Okamoto, T. and Yoshida, N. (2014).
Physical properties of
UDF12 galaxies in cosmological simulations, Monthly Notices of the Royal Astronomical Society 440: 731–745.
Vapnik, V. and Chervonenkis, A. (1964). On a class of perceptrons, Automation and Remote
Control 25: 103–109.
Zackrisson, E. and et al. (2016). , ArXiv e-prints .
62
Zackrisson, E., Inoue, A. K. and Jensen, H. (2013). The Spectral Evolution of the First Galaxies.
II. Spectral Signatures of Lyman Continuum Leakage from Galaxies in the Reionization Epoch,
Astrophysical Journal 777: 39.
63