Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
IT 16 011 Examensarbete 30 hp Februari 2016 A classification approach to solving the cosmic reionization puzzle Kristiina Ausmees Institutionen för informationsteknologi Department of Information Technology Abstract A classification approach to solving the cosmic reionization puzzle Kristiina Ausmees Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student Cosmic reionization is a phase in the history of the universe that is still not understood completely. Among the different theories regarding what could have been the cause of this process, young star-forming galaxies stand out as the most likely candidate. In order to determine if the galaxies of the reionization era could have been the source of the required radiation, it is necessary to determine their escape fraction of Lyman continuum photons. Since this key property cannot be measured for the galaxies of interest, methods of probing it indirectly are required. The foundation of such indirect estimation is the fact that the escape fraction has an effect on the spectra of galaxies at wavelengths that are observable. This thesis proposes a quantitative and data-driven approach in which the spectra of simulated galaxies are used to train machine learning models to predict escape fractions. The goal is to evaluate support vector machines as a method for predicting escape fractions of observed galaxies. The results indicate that the proposed method is promising. Escape fractions are predicted with over 85 percent accuracy in most cases and the method shows a high level of robustness to the effects of varying simulation assumptions and disturbances in the data. Inspection of the models also gives an idea of the information content of the spectra and the correlation to the escape fraction. A comprehensive analysis of the classification performance is also performed which highlights some of the main difficulties and lays a foundation for future work. Handledare: Kristiaan Pelckmans Ämnesgranskare: Erik Zackrisson Examinator: Edith Ngai IT 16 011 Tryckt av: Reprocentralen ITC Contents 1 Introduction 1 2 Data 3 3 Methods 3.1 Principal Component Analysis . . . . . . . 3.2 Classification . . . . . . . . . . . . . . . . 3.2.1 Linear support vector machine . . . 3.2.2 Soft-margin support vector machine 3.2.3 L1-norm support vector machine . 3.3 Evaluation . . . . . . . . . . . . . . . . . . 3.3.1 Model tuning and testing . . . . . . 3.3.2 Robustness . . . . . . . . . . . . . 3.3.3 Implementation details . . . . . . . 4 Results 4.1 Understanding the data . . . . . . . 4.2 Classification of the main data set . . 4.3 Robustness of the main model . . . . 4.4 The effects of simulation assumptions 4.5 Analysis of classification performance 4.6 Comparison to other methods . . . . 5 Conclusions and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 10 12 12 13 15 15 15 18 20 . . . . . . 21 21 26 35 42 49 55 56 1 Introduction One of the mysteries of the early universe that still has not been fully explained is that of cosmic reionization. After the Big Bang, as the universe expanded and cooled off, neutral hydrogen was able to form in the intergalactic medium (IGM). Cosmic reionization refers to the process in which this neutral hydrogen was put in an ionized state again. It is known that this reionization happened during the first billion years of cosmic history and that it was caused by the presence of high-energy photons, but the source of the ionizing radiation remains unknown. There are several candidates for what could have caused this process to occur, the most likely one being young star-forming galaxies, (e.g. Alvarez et al.; 2012; Robertson et al.; 2015a). While it is in theory possible that the galaxies at the time in question could have produced enough radiation to reionize the entire universe (Alexandroff et al.; 2015), whether or not it was possible in practice depends on the fraction of hydrogen-ionizing photons that was actually able to escape from galaxies into the IGM. This quantity is known as the Lyman continuum (LyC) escape fraction fesc and is a central factor in determining whether or not early galaxies could have been the cause of cosmic reionization. It is possible to measure the escape fraction of less distant galaxies, but as the relevant wavelengths are absorbed by the neutral gas in the IGM (Inoue et al.; 2014), this key property cannot be directly measured for galaxies in the ionization epoch. Existing methods for estimating the escape fraction of these galaxies therefore rely on probing it indirectly. Methods have been proposed for obtaining constraints on the typical fesc values in this epoch (e.g. Fernandez et al.; 2013; Finkelstein et al.; 2011) but these can only be used to gain information about the galaxy population as a whole. Zackrisson et al. (2013) suggests that there is a correlation between a galaxy’s fesc and its spectral energy distribution (SED) at wavelengths that do reach us, indicating the possibility to infer it for individual galaxies. Their method uses two spectral features to estimate fesc , the slope of the UV continuum and the relative width of the Hβ emission line. A qualitative approach is also taken by Alexandroff et al. (2015) who suggest observing less distant galaxies that are likely similar to those in the reionization era and investigating properties of the SED related to physical qualities that they argue affect the escape fraction. The observation that fesc affects the SED of galaxies is also the basis for the work in this thesis. However, instead of choosing a few diagnostics, the entire spectrum is analyzed using 1 machine learning techniques to develop models that can learn to identify galaxies with different escape fractions. By using computers and data-driven predictions there is the possibility to capture more complicated correlations in the data and possibly to avoid some problems that the simpler models had, such as ambiguities caused by different ages, metallicities and effects of dust on the spectra (Zackrisson et al.; 2013). Principal component analysis (PCA) is a method of re-expressing data to gain knowledge of underlying dynamics of a system that may be too complex to observe directly. It allows one to extract features of the data that are important and disregard noise or redundancy. In this thesis it is used to gain knowledge of the information content in the SED of galaxies and how the fesc is related to it. Support vector machines (SVMs) are supervised learning models that can be used for pattern recognition and classification of data. Here they are used to classify galaxies as having either higher or lower escape fraction than a given threshold. According to Robertson et al. (2015b) an escape fraction value fesc � 0.2 of gas-rich star forming galaxies during the period of reionization could be enough to account for the required amount of Lyman continuum photons. The main goal is therefore to evaluate the ability of the models to separate galaxies based on fesc thresholds around this size. Another component in the larger project of which this thesis is a part considers lasso regression as a method of predicting escape fractions (Jensen et al.; 2016; Lundholm; 2016). In that approach, the goal is to predict a continuous value for each sample, rather than placing them in classes. Both of these methods are considered since it is possible that classification may have an advantage by excluding some of the more difficult aspects of the modeling. A comparison of the two models is therefore also performed to see if there are differences in performance or robustness. The spectral features that contain information about fesc have not yet been possible to observe for galaxies in the reionization epoch, but with the upcoming launch of the James Webb Space Telescope (JWST) in 2018 this is about to change. The Near Infrared Spectrograph (NIRSpec) on board will be able to provide spectra of previously unseen quality from some of the first galaxies to form in the universe. Since the data is not available yet, simulated SED of galaxies with different fesc are used to train and evaluate the models. We use different simulations and modify the SED in order to imitate realistic JWST/NIRSpec observations, but it is unavoidable that the results 2 will depend on simulation assumptions to some degree. An important aspect of the evaluation is therefore to investigate how sensitive the models are to this in order to get an idea of their applicability to actual observations. 2 Data The data used to train and test the models is a collection of simulated electromagnetic spectra of galaxies with corresponding fesc values. These are obtained by adding the effects of varying escape fractions to galaxies generated from different sets of assumptions and calculating the resulting spectra using the LYman Continuum ANalysis (LYCAN) simulation project (Zackrisson and et al.; 2016). The galaxy samples are obtained using different numerical simulations and modified to imitate realistic observations from the JWST. Figure 1 shows the spectrum of a simulated galaxy with the effects of different fesc values applied to it. � �������� �������� � � � ���� �� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� � �� ���� � �� �� ������ � ��� ����������������� � ���� ���� �������������� ���� ���� ���� Figure 1: Simulated spectrum of a galaxy with the effects of different fesc values applied to it. Labels point out the main emission lines. The effects of modifying the escape fraction are visible in the spectra. The emission lines, or 3 peaks in flux that are labeled in the image, are higher for lower fesc values, and an effect on the slope in the beginning of the spectra is also visible. The astronomical explanation of this is a process in which the ionizing radiation produced within galaxies is affected by the surrounding gas. As the radiation travels through the gas, some of it is absorbed and re-emitted as photons with different wavelengths. In this way the ionizing radiation is transformed into effects on other parts of the spectra, leaving traces on wavelengths that are observable. According to the knowledge of such processes, only information in certain parts of the spectrum should be correlated to fesc . The emission lines are the strongest indicator with a clear correlation. These are the result of ionizing photons affecting matter in the galaxy, causing it to emit radiation. For example, the H-lines in Figure 1 correspond to the spectral emissions of the hydrogen atom. Another indicator is the overall slope of the spectrum, as the emission of ionizing radiation tends to cause a reddening of the spectra that gives it a flatter slope (Raiter et al.; 2010). The bluer end of the spectrum, or the lower wavelengths, is a part of the spectrum where this is particularly seen. Using the slope to infer fesc is slightly more problematic than the emission lines, however, since other properties of the galaxies such as age, metallicity and dust can also have a similar effect, causing ambiguity in the data (Zackrisson et al.; 2013). It should be pointed out that figure 1 is a simplification in several ways. Since all spectra are from the same galaxy with effects of different escape fractions applied to it, they have very similar shapes. Also, each spectrum has been normalized to have mean 1, which explains why they are all centered around this value. This normalization also has an effect on the blue wavelengths, causing the effect that the flux is lower for lower escape fractions. However, it still suffices to illustrate the idea behind using this part of the spectra to infer the escape fraction. As there is no way to measure fesc of the relevant galaxies, the proposed method will be dependent on simulated train data even after real observations are obtained. The simulated spectra are generated using different models and assumptions and therefore have varying properties. This is of great significance for the task of inferring escape fraction because the different assumptions can affect the relationship between fesc and the spectra. For example, in reality the strength of the emission lines do not only depend on the escape fraction but also on certain physical properties of the galaxy. There needs to be a presence of new stars that emit ionizing photons as well as gas to be ionized. Since some simulations predict more varied star formation histories than others, they 4 result in more heterogeneous sets of galaxies where the correlation between fesc and the emission lines is weaker. Although the simulations used have been evaluated with respect to how realistic they are (e.g. Shimizu et al.; 2014), at the moment it is not known which assumptions are most representative. When observations are made there may be more indications of how well they correspond to reality, but simulation assumptions will likely remain important. By investigating how sensitive the method is to this, we can get an idea of its applicability in the general case, in what circumstances it is likely to perform well and what difficulties may arise when predicting fesc for actual observations. The remainder of this section explains the different simulation techniques and assumptions as well as which combinations will be considered in this thesis. Simulation and stellar tracks Sets of galaxies are obtained from simulations made by Gnedin (2014), Shimizu et al. (2014), and Finlator et al. (2013). The simulations model the evolution of galaxies and processes within them, each of them making different assumptions in the calculations. The simulations provide a set of galaxy properties but to generate spectra from this data a model of stellar evolution is required. In this thesis the stellar tracks considered are Geneva and BPASS2 (Eldridge and Stanway; 2009). The Shimizu set contains 406 initial galaxies which results in a total size of 4466 samples after the effects of 11 fesc values have been added to the original spectra. For Finlator and Gnedin the numbers are 106 and 100 initial galaxies, respectively. Figure 2 shows the spectra of 40 galaxies per fesc value for the Finlator and Gnedin simulations with the Geneva stellar tracks. 5 � ������ � � � � ����������������� ����������������� � �������� � � � � � � ���� ���� ��� ��� ��� ��� ��� � � � � � ���� ���� ���� �������������� ���� � ���� ���� ���� ���� ���� �������������� ���� ���� Figure 2: Spectra of 40 galaxies each for 5 different fesc values from the Finlator (left) and Gnedin (right) simulations with Geneva stellar tracks. Resolution and Noise The simulations provide high quality spectra of galaxies, but since the goal is to apply this model to actual observations, they have to be modified according to the NIRSpec/JWST specifications to make them realistic. The NIRSpec website 1 lists the different possible resolutions as well as minimum continuum flux observable at a given signal-to-noise ratio and exposure time as a function of wavelength. This is used to re-bin the simulated spectra and develop noise models for simulating different degrees of detector noise. The possible resolutions of the NIRSpec spectrograph are denoted R=100, R=1000 and R=2700. A high resolution results in a higher information content in the spectra but also more noise per wavelength bin. Lower resolutions allow dimmer objects to be observed and is less expensive to operate. For these reasons, the simulated spectra in this project are rebinned to correspond to R=100 and R=1000 only. These correspond to 140 and 1478 wavelength bins per spectrum, respectively. The signal-to-noise ratio per flux and exposure time is calculated using the official NIRSpec specification, making it possible to add the corresponding amount of noise to each spectral bin of a galaxy. This makes assumptions about how the noise scales with exposure time and ignores some noise sources, but is considered as an adequate approximation. 1 http://www.stsci.edu/jwst/instruments/nirspec 6 Two different methods of approximating detector noise are used. The first one fixes the exposure time for all samples and adds noise to each galaxy accordingly. Because of the differing apparent magnitudes, or how bright the objects appear, this leads to some galaxies having more noise than others. Although this model is the closest to how noise in actual observations would behave, the second way of adding noise is useful for other aspects of this project. This method instead defines the signal-to-noise ratio of a spectrum as the signal-to-noise of the spectral bin containing the wavelength 1500 Å and fixes this over all the samples. The advantage of this is that it gives less of a difference in noise level between the samples. Although the noise will still vary over each spectra, they will not differ as much overall since they are scaled to be the same at a given wavelength. This reduces the effect that some galaxies will be ”more important” than others for the classification and may be more suitable for investigating which features of the spectra are relate to fesc and evaluating how the noise level affects the model’s performance. A set of galaxies with a noise level scaled so that S N = x at 1500 Å will be denoted as having noise level snx and a set with a fixed exposure time of x hours will be denoted as xh. Figure 3 shows the effects of varying levels of noise on the same set of galaxies from the Shimizu simulation with 40 spectra each for five fesc values. It is clearly visible how the noise distorts some of the information correlated with fesc , mainly by obscuring weaker emission lines. 7 �� ��� �� �� �� � ����������������� ����������������� �� ��� � � � � � � � � � � ���� ���� �� ���� ���� ���� �������������� ���� ���� � ���� � �������� ����������������� ����������������� ���� ���� �������������� � � � � ���� ���� ���� � � � ���� ���� ��� ��� ��� � � � � � � ���� ���� ���� �������������� ���� ���� � ���� ���� ���� ���� �������������� ���� ���� Figure 3: The effects of different noise levels on the spectra of galaxies. Each plot contains the same 40 galaxies from the Shimizu simulation with three fesc values applied to them. Dust When the radiation from a galaxy travels through dust in the interstellar medium parts of it are absorbed and re-emitted at higher wavelengths, causing a reddening of the spectra. Dust effects are problematic because they can be similar to those of varying escape fractions, as discussed above, causing uncertainty in interpreting the data. There are differing opinions regarding the dust content of early galaxies and it is therefore not known how suitable it is to use dust effects in training models to predict fesc . There have been results that indicate little or no dust effects in the relevant galaxies (Bouwens et al.; 2009) whereas other studies show significantly larger dust contents (Schaerer and de Barros; 2010). 8 The effects of dust can be added to the simulated spectra using different recipes. For the Shimizu simulations, the recipe used is the one defined in Shimizu et al. (2014). For the Finlator and Gnedin galaxies, the dust recipe from Finlator et al. (2006) is used. In all cases, the dust reddening laws by Pei (1992) are used in combination with the dust recipe. Figure 4 shows the effects of dust on galaxies from the Shimizu simulation. The reddening effect mainly results in a � ������� � ���� � � � � ����������������� ����������������� larger spread of the fluxes, especially at the lower wavelengths. � � � � ��� ��� ��� ��� ��� � � � � � � � ���� ���� � ���� ���� ���� �������������� ���� � ���� ���� ���� ���� ���� �������������� ���� ���� Figure 4: The spectra of 40 galaxies per fesc value from the Shimizu simulation without (left) and with the effects of dust (right). By varying the properties described in this section it is possible to create data sets with different simulation methods, stellar tracks, resolutions, noise levels and with or without the effects of dust. In order to avoid displaying the results for all combinations, a subset of configurations is chosen for illustration purposes. A main data set consisting of galaxies from the Shimizu simulation and Geneva stellar track with no dust effects is used to evaluate the proposed classification method thoroughly. Data sets with other configurations are used for evaluating the models’ sensitivity to different parameters. For all experiments data without dust effects, with noise level sn5, resolution R100 and Geneva stellar tracks is used unless otherwise stated. 9 3 3.1 Methods Principal Component Analysis Principal component analysis (PCA) introduced by Pearson (1901) is a method of re-expressing data in a way that reveals which parts of it are significant and which are redundant. Each data point can be seen as a vector in q-dimensional space and can therefore be described as a linear combination of a set of basis vectors. Like vectors, they can be projected onto other vector spaces spanned by a different bases. The PCA algorithm finds a basis that is a linear combination of the original that lets the data be expressed in a way that reveals hidden dynamics. It could, for example, be the case that several features of the data represent the same signal. In such cases it is preferable to use a linear combination of these features instead. This is a more concise representation of the data and requires less dimensions. A way to measure the redundancy between two features a and b is their covariance cov(ab) = �ai bi �i (1) where �zi �i denotes the average value over i. The covariance measures how spread out the data points are with respect to the dimensions a and b, and a high covariance suggests a strong correlation between features. Let X ⊂ Rq be a set of m samples each described by q features. For features fij . . . fqj of all the samples xj , j ∈ {1 . . . m}, let . . . f1m f 11 .. . . .. X= . . . fq1 . . . fqm (2) X ∈ Rq×m is thus a matrix where each row contains all measurements of one feature and each column corresponds to a sample. 10 Then the covariance matrix SX is defined as: SX = 1 XX T q−1 (3) The covariance matrix contains information about correlations between all pairs of features in the data. The diagonal terms tii correspond to the variance of feature i and the off-diagonal terms tij contain the covariance of features fi and fj . The PCA algorithm finds a new basis B such that the covariance matrix of the projection BX, or SBX is diagonalized. Thus the covariance between features is removed by the projection. It can be shown that this is achieved by choosing B such that each row bi is an eigenvector of XX T . The principal components of X are then the rows bi of B and the diagonal terms tii (the eigenvalues) of SBX contain the variance of X along the principal component bi . The principal components with the largest variance are those that represent the data the most, and if there is a subset with much higher variances than the others, these are the dimensions which contain the essence of the data. This is the idea behind using PCA for dimensionality reduction. The motivation for using PCA is mainly to get an insight of the data that is being dealt with. The knowledge of astronomical processes that cause the spectra to have certain properties is one part of this, whereas using PCA gives a more data-oriented and exploratory analysis of the information content. By investigating which parts of the spectra have large variance it is possible to gain an idea of which features are important to represent the properties that set individual samples apart. As there has not been much previous work done in applying data-driven predictions using the entire spectra of galaxies for inferring fesc , this is an important indication of what machine learning techniques may be suitable for the task. The second reason for using PCA is that it allows the data to be visualized. By projecting the data onto the principal components that have the highest variance, it is possible to reduce the number of dimensions used to represent it. This allows a visualization of it in three dimensions, for example, which can give an indication of the general layout of the samples with respect to fesc . 11 3.2 Classification Classification in machine learning refers to the problem of determining to which of a given set of categories a new observation belongs. Let • X ⊂ Rq be a set of samples where each sample is represented by a vector of q features • Y be a set of class labels (for binary classification Y = {−1, 1} ) • S ⊂ X × Y be a train set consisting of samples and their corresponding labels • S � be a set of samples with unknown labels The goal of a classification model is to use S to train a classifier y : X → Y that can correctly label s� ∈ S � . 3.2.1 Linear support vector machine The linear support vector machine (SVM) introduced by Vapnik and Chervonenkis (1964) defines a classifier by regarding each sample in S as a point in q-dimensional space and trying to find a hyperplane that separates the points according to their labels. Such a hyperplane can be characterized by a vector w ∈ Rq and an intercept b, and therefore the classifier can be defined as: y(x) = sign(wT x + b) (4) for some hyperplane (w, b),where yi (wT xi + b) ≥ 1 ∀(xi , yi ) ∈ S (5) The margin of a hyperplane is defined as the smallest distance from any point x0 to the hyperplane and can be shown to be equal to 2 1 2 . ||w||2 ||x||2 denotes the Euclidean, or L2 norm defined by �� n k=1 12 xk ∗ x k In order to get a solution that is as safe as possible the goal is to maximize the margin. This together with (5) gives the following objective: min ||w||2 w,b (6) subject to yi (wT xi − b) ≥ 1 ∀(xi , yi ) ∈ S This is a convex optimization problem, and in order to put it in a form suitable for quadratic programming, the objective function is often rewritten in the equivalent form of: 1 min ||w||22 w,b 2 (7) subject to yi (wT xi − b) ≥ 1 ∀(xi , yi ) ∈ S The samples that lie exactly on the margin are called the support vectors. The support vectors have a large effect on the SVM since they decide where the boundary goes; a change in one of these samples would result in a change in the decision function. They are also an indication of the complexity of the learning task. If there are few support vectors then a smaller fraction of the samples are needed to get high accuracy. 3.2.2 Soft-margin support vector machine The soft-margin SVM proposed by Cortes and Vapnik (1995) can find a boundary even when the data is not completely linearly separable. This is achieved by introducing slack variables ξ that allow data points to be mislabeled: m � 1 ξi min ||w||22 + C w,b,ξ 2 i=1 subject to yi (wT xi − b) ≥ 1 − ξi ∀(xi , yi ) ∈ S ξi ≥ 0 ∀ξi C>0 13 (8) The slack variables ξi measure how misclassified each sample xi is and C is a penalty parameter for the misclassification error. C can be tuned to find the optimal model by controlling the tradeoff between achieving a maximal margin and avoiding misclassification. Having a high penalty reduces the amount of data points that are on the wrong side of the hyperplane, but it can also lead to a smaller margin and therefore a less secure solution. This could in some cases lead to overfitting and poor performance on data that was not part of the train set. A low value for C could on the other hand lead to an underfitted classifier which does not separate the data correctly. There are several reasons why using support vector machines is suitable for this problem. Firstly, they are a well-established method developed since the 1960’s with successful applications in many fields. The optimization problem is convex and therefore allows for efficient implementation. Another desirable property of the linear SVM is that it is possible to interpret how the data affects the decision function. The vector that defines the separating hyperplane contains information of how each feature of the samples affects the classification. This means that by inspecting these feature weights, one can learn how much and in what way each part of the spectrum is related to the escape fraction. Many other classification algorithms work like black boxes that provide a prediction but no information of how it was reached. This is also the reason why only linear support vector machines are considered. While it is possible to add kernels that project the data onto other dimensions in which they are easier to classify, this also removes the possibility for an astronomical interpretation of the resulting model. A model based on the soft-margin linear support vector machine with an L2-norm will be denoted as an L2-SVM in this project. 14 3.2.3 L1-norm support vector machine The L1-SVM proposed by Bradley and Mangasarian (1998) instead uses the L1-norm in the optimization problem: m � 1 ξi min ||w||1 + C w,b,ξ 2 i=1 subject to yi (wT xi − b) ≥ 1 − ξi ∀(xi , yi ) ∈ S (9) ξi ≥ 0 ∀ξi C>0 The L1-norm is defined by ||x||1 = �n k=1 |xi | and has the property that it results in a sparse model where some feature weights are exactly zero. In this case, C affects how many nonzero features the model has, and therefore which subset of features are used in the classification. The data suggests that the L1-SVM may be suitable for this application. As discussed in section 2, both the simulated galaxy spectra as well as what is known of the astronomical processes involved indicate that only a small part of a galaxy’s SED is correlated with its fesc . This suggests that the L1-SVM may be more effective by ignoring the features that are not relevant. It can also help avoid overfitting by having a less complex model and therefore less risk of fitting redundant information. This can increase accuracy by not letting noise and other disturbances in the data affect the classification. Finally, it is easier to interpret the astronomical significance of a model with fewer nonzero feature weights. A model based on the soft-margin linear support vector machine with an L1-norm will be denoted as an L1-SVM in this thesis. 3.3 3.3.1 Evaluation Model tuning and testing The classification models are fitted to different data sets and evaluated on their accuracy and robustness using a test set. It is important to distinguish between the model tuning and testing processes. Tuning refers to the parameter selection and fitting and is always performed on a subset of the data. The entire data set is split into a train, validation and test set. The model is fitted to 15 the train set for different values of the tuning parameter C and the performance of each resulting model on the validation set is evaluated. The model that had best validation performance is chosen as the optimal model. This model is subsequently evaluated using the test set. It is important that the test set is comprised of samples that are not part of the train and validation sets since otherwise the model would be biased and the test results would not give an indication of the model’s ability to generalize to new data. In order to avoid possible bias introduced by the choice of validation set, cross-validation (e.g. Mohri et al.; 2012) is used to find optimal values for the penalty parameter C. The samples that are not in the test set are initially divided into k subsets, or folds, and the fitting and validation is done k times, where each subset gets used as validation set once and the union of the other sets is used as the train set. The performance of a value for C is defined as the average over the k folds. This process is known as k-fold cross-validation. Since both tuning and testing need to assess a model’s ability to predict the choice of performance metric is an important aspect. The difference between the true and predicted labels has to be represented as a single value that describes how good the model is and allows for comparison. This value is referred to as a model’s score, and the method used to obtain it the scoring metric. For classification there are several ways to measure performance and which one is most suitable for a given case is not always straightforward. Let • P (positive) denote the number of samples with true label 1 • N (negative) denote the number of samples with true label -1 • T P (true positive) denote the number of samples that were correctly given label 1 • T N (true negative) denote the number of samples that were correctly given label −1 • F P (false positive) denote the number of samples that were incorrectly given label 1 • F N (false negative) denote the number of samples that were incorrectly given label −1 The most simple scoring metric is the accuracy (acc) which is defined as the percentage of cases that were correctly classified: acc = TP + TN P +N 16 An issue with the accuracy is that it does not take false negatives or false positives into account. This is especially an issue if either P or N is significantly smaller than the other as misclassification of all the samples in the small set would not affect the accuracy much. To take the misclassification per class into account a confusion matrix can be used: Actually −1 Actually 1 Predicted −1 TN FN Predicted 1 FP TP Table 1: Layout of a confusion matrix The confusion matrix allows one to compute the average precision per class (apc) to get a score that reflects the accuracy of both classes. Since the class sizes can be different the weighted average of the two is used to avoid the accuracy of one having larger impact: apc = P× TP T P +F P +N × P +N TN T N +F N There are, however, issues with the apc score as well. For example, if one class has a small number of samples, the results will have large spread. This makes them less reliable and the overall average will be over two quantities with very different degrees of variance. Two other relevant properties of the classification are precision and recall. Intuitively, recall is the ability to find all positive samples and precision is how good the model is at not mislabeling negative ones: • precision = • recall = TP T P +F P TP P A model ideally has high values of both and by taking their harmonic mean they are combined into a single scoring metric, the f 1-score: f1 = 2 ∗ precision × recall precision + recall 17 Another way to measure mispredictions is the Receiver Operating Characteristic curve (ROC). This curve plots the rate of true positives against the rate of false positives and thus shows how many correct positive classifications can be gained as more false positives are allowed. To use this as a scoring metric the area under the curve is measured, as a larger area means a higher rate of true positives. This results in the ROC area under the curve score rauc which is the final scoring metric that will be considered in this thesis. As discussed previously, the choice of scoring metric is not straightforward and depends on the nature of the data itself. Generally the best choice will become evident after some results and knowledge of the data have already been gathered. For this reason the initial model tuning and evaluation will use the simplest metric acc. Although this does not take class imbalance into account it is still considered a good starting point. Class imbalance will only be the case for the extreme thresholds, and although 0.2 is the most relevant one for investigating cosmic reionization, the performance for all thresholds is of interest in evaluating the model itself as well as properties of the data that are relevant for classification. Furthermore, it is not certain that class imbalance will be an issue. The data as well as what is known of the astronomical processes involved indicate that the galaxies have properties that are correlated to fesc and should therefore be possible to classify accordingly. Also, it is not the case that one class is more important than the other for our purposes. The goal is to separate the samples according to a threshold, not to pick out ones that belong to a certain class and therefore a measure of total performance is appropriate. Sections 4.2 through 4.4 will therefore use accuracy when calculating a model’s score. In section 4.5 the classification performance is investigated in detail and the effects of using different scoring metrics are evaluated. 3.3.2 Robustness An important aspect of the evaluation of a model is its robustness. This is a measure of how sensitive it is to varying properties of the data and is an important indication of its applicability to the problem. Different types of robustness are considered and a variety of methods are used in the estimation. Robustness of accuracy is a measure of how sensitive the classification performance of the 18 models is. The sensitivity of the score of a model when applied to a test set is of interest. Does it differ for varying assumptions of the data or for input selection? Since the goal is to use simulated data to classify actual observations it is also important to investigate how sensitive the model is to ”wrong” assumptions for the train set. To see the effects of varying assumptions of the data, the models are trained and tested on different data sets. The test scores for different noise levels, simulations, stellar tracks and dust effects are compared to see if there are certain assumptions that cause more problems in classification than others. The sensitivity of a model to assumptions about the data that do not correspond to the properties of the galaxies to be classified is investigated by using different types of data sets for training and testing. For varying noise levels this is done by fixing the noise in the test set and evaluating the model performance when varying the train set noise level. The sensitivity to different simulations is evaluated by comparing the test performance of all combinations of simulation method for the train and test sets, and likewise with stellar tracks. It is also possible that the data may contain errors in the labeling of the training samples, resulting perhaps from measurement errors or incorrect modeling of certain properties in a simulation. In order to get a measure of how robust a model is to this type of disturbance it is fitted to a series of manipulated train sets in which the label of each sample has been flipped with a certain probability, and used to classify an unmodified test set. How the performance degrades with higher probability of label-flipping indicates the sensitivity of the model to this type of errors. Robustness of selection refers to how the selection of features is affected by varying properties of the data. This is related to robustness of accuracy since it is a measure of the model’s ability to generalize, but here the focus is to see if there is large variance in which features are chosen or the weights that they are given. This is related to the astronomical interpretation of which spectral properties are correlated to fesc . It also gives an indication of whether the actual signal is being captured by a model or if it is fitted to noise or other irrelevant parts of the spectra. Bootstrapping is an algorithm used to give an indication of the stability of the model with respect to input choice. A collection of train sets Ti� of size n� < n are chosen with replacement from the original train set T , where |T | = n. The tuning and fitting process is then performed for each Ti� and the variance of the resulting |T � | models is investigated. The distribution of 19 the weights of each feature over all Ti� gives an indication of the stability of the feature weights. The robustness of feature selection is shown by the frequency that each feature is chosen to be significant over all runs. The idea of these tests is to see how much the choice of samples affects the trained model and it is considered more stable if changes in the train set have small effect in the resulting classifier. Another way to investigate the robustness of selection is to see which features are chosen to be included in the optimal model when only using a certain number of features in the classification. This is investigated by exhaustive search of all combinations of size 1, 2, . . . 7 of features and plotting which ones are part of the model with best test performance for each subset size. This gives an indication of which features are the most important for classification as well as how stable this selection is. The metrics above are related to two aspects of robustness, that of a model and that of the method itself. Robustness of a model measures how stable a given model is in its performance depending on varying input and gives an indication of how well we can expect a certain model fitted to a data set to behave. Robustness of the method instead suggests how suitable support vector machines are for solving this particular problem. This is more related to the nature of the data and relationship between fesc and the spectrum of a galaxy. The goal in this thesis is to get an idea of both these aspects of robustness. 3.3.3 Implementation details The software package scikit-learn (Pedregosa et al.; 2011) for Python 3.4 is used to implement the classification models and the PCA algorithm. Five-fold cross-validation is used in parameter selection. The classification is always performed by choosing an escape fraction threshold t ∈ {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9} and defining the negative class as the set of samples with fesc < t and the positive class as the set of samples with fesc ≥ t. Bootstrapping is performed by randomly selecting 80 percent of the train samples with replacement and iterating 1000 times. When the train and test sets originate from different simulations all samples are included in both sets. Otherwise 20% of the samples are set aside to be used for testing, with an equal distribution of samples with different fesc values. All spectra are normalized to have mean 1 so that the luminosity of a galaxy has no effect on the classification. 20 4 Results In this section the results of applying the methods described above are presented. In section 4.1 the information content in general as well as with respect to fesc is investigated using PCA. In section 4.2 support vector machines are used to classify galaxies from the main data set. The model tuning process as well as the classification performance is evaluated thoroughly. This is followed by an investigation of the robustness of the same models in section 4.3. Section 4.4 is also concerned with robustness but here the effects of different data assumptions on the method are investigated. In section 4.5 an extensive evaluation of the problems encountered with classification is performed and the effects of different scoring metrics are investigated. Both the results of the L2-SVM and the L1-SVM are shown throughout to provide a comparison of the two, and several noise levels are considered to assess the sensitivity of the models to this. Finally, the classification performance of the proposed method is compared to that of two other models: the one defined by Zackrisson et al. (2013) that uses two properties derived from the SED and the lasso regression approach from Lundholm (2016) which also uses the entire spectra in predicting fesc . 4.1 Understanding the data PCA is applied to investigate the information content in the data; which parts of the spectra have the highest variance as well as how the data points are laid out with respect to fesc . Figure 5 shows the eigenvalues of the 30 principal components with highest variance in sorted order for four different data sets. The top panel shows results for the Shimizu set with and without dust effects, and the bottom panel the corresponding plots for the Gnedin set. In all cases there is a significant drop-off after roughly 5 eigenvalues, indicating that a large part of the information in the spectra can be expressed by a few dimensions. All experiments also showed that the slope was more or less unchanged after 20 principal components. 21 ��� ��������������� �� ������������ �� ��� ���������� ���������� ��� ��� ��� ��� � � � � � � �� �� �� ������������������� �� � � �� � �� �� �� ������������������� �� �� �� �� �� ������������������� �� �� ��� ��� ��� ��� ���������� ��� ����������� ���������� ��� �������������� ��� ��� ��� ��� � ��� � �� �� �� ������������������� �� �� ��� � � Figure 5: The resulting eigenvalues after performing PCA on the Shimizu and Gnedin simulations with and without the effects of dust. The results indicate that representing a spectra as a linear combination of only a few components should retain much of the information in it. By projecting a spectrum onto a subset of the principal components and then reconstructing it, the loss of information in such a partial representation can be illustrated. Figure 6 shows how this differs when using 5, 10, 20 and 40 principal components to represent a sample from the Shimizu simulation. The original spectra is illustrated by a solid line and the reconstructed one by a dotted line. The red line shows the difference between them and the legend in each plot displays the mean squared error of this difference. 22 � ���������������������� � ���������� �������� ������������� � ����������������� ����������������� � � ����������������������� � � � � � � ���� ���� ���� ���� �������������� � ���� � � � � � ���� ���� ���� ���� ���� �������������� � ����������������������� ���������� � � ����������������� ����������������� � � � ����������������������� � � � � � � ���� ���������� ���� ���� ���������� � � � � � � ���� ���� ���� �������������� ���� ���� � ���� ���� ���� ���� �������������� ���� ���� Figure 6: The effects of projecting a spectrum from the Shimizu simulation to different numbers of principal components. Solid lines indicate the original spectra and dotted the reconstructed ones after the projection. Red lines indicate the difference between the two, and the mean squared error is displayed in the legend. The shape of the original and reconstructed spectra are very similar in all plots, even when only using 5 principal components. This indicates that the information correlated to fesc , emission lines in particular, are possible to capture using only part of the data. This can be seen as further motivation for using a linear model to predict fesc , and a sparse one like the L1-SVM in particular. PCA is also used to show where in the spectra the information is collected. Figure 7 shows the four eigenvectors with highest variance for the Shimizu and Gnedin simulations with and without dust when effects of noise are not added. The wavelength bins containing emission lines contribute to a large part of the variance in all cases, especially the ones centered around 5000 Å. The lowest wavelengths also show large variance but other than that most parts of the spectra do not seem to 23 contain much information. This is consistent with the effects of fesc on galaxy spectra discussed ��� ������������ ��� ��� ��� ��� ��� ��� ����� ��� ��������������� ��� ��� ����������� ��� ����� ��� ���� ���� ��� ��� � � � � ��� ���� ���� �������������� ���� ���� ��� ���� ���� ��� �������������� ��� ����������� ��� ��� ��� ��� ��� ��� ��� ��� ����� ����� in section 2. ��� ��� ��� ��� ��� ��� ���� ���� ���� �������������� ���� ���� ���� ���� ���� ���� �������������� ���� ���� ��� ��� ��� ���� ���� ���� �������������� ��� ���� ���� Figure 7: The four principal components with highest variance for the Shimizu and Gnedin simulations with and without the effects of dust and no noise effects. So far the PCA results have only been concerned with the variance in the data in general, but it is also possible to investigate how the structure of the data relates to fesc . By projecting the data onto three dimensions it is possible to plot the layout of the samples. The idea is that if the points are organized in clusters then projection onto the dimensions with the highest variance is likely to show this structure. Here the goal is to see how they are spread out with respect to fesc , so the samples are colored accordingly. Figure 8 shows the positions of the samples from the main data set in three-dimensional space after projection onto two different combinations of eigenvectors. The left plot shows this for the 24 first, second and third eigenvectors with the highest variance, and the right plot for the second, fourth and fifth eigenvectors. ����������������� ����������������� � ��� ��� ��� � � � � � � � � � � � �� � �� �� � � � � �� �� � � � � � �� �� � � � � � � � � � � � � � � � � � � � � Figure 8: Samples from the Shimizu simulation after projection onto two different combinations of three principal components. The title of each plot indicates the choice of principal components, with 1 being the one with highest variance. Samples with 4 escape fractions are plotted and they are colored accordingly. It is positive that there is a clear shift from red to purple, indicating that the data is separable according to fesc to some degree, even though there is significant overlap between the colors. In general it seems that the points are clustered together tightly with few outliers. Figure 9 shows two plots with the three dimensional layout of galaxies from the Finlator simulation after projection onto the first, second and third as well as first, fifth and sixth principal components with the highest variance. These plots show more spread out samples and less clustering according to fesc than the Shimizu simulation. This is because the Finlator simulations model a larger range of start formation histories and therefore result in spectra that differ more from each other. 25 ������������������ ������������������ � ��� ��� ��� � � � � � � � � � �� �� �� � � � � � � � � �� �� � � � � � � � � � � � � �� � � � � � � � � � � � � Figure 9: Samples from the Finlator simulation after projection onto two different combinations of three principal components. The title of each plot indicates the choice of principal components, with 1 being the one with highest variance. Samples with 4 escape fractions are plotted and they are colored accordingly. The combinations of principal components that resulted in the clearest separation were chosen for this illustration, as most combinations showed very little clustering according to color for the Finlator model. For the Shimizu simulation a large number of combinations showed some degree of such separation. This indicates that the Finlator samples could be harder to classify according to fesc . 4.2 Classification of the main data set In this section the results of parameter selection and model evaluation for the main data set defined in section 2 are presented. The optimal value for the penalty parameter C is determined using 5-fold cross-validation as defined in section 3.3.1. Results are shown for both the L1-SVM and the L2-SVM and for different noise levels. Figures 10 and 11 show the cross-validation scores for different values of the regularization parameter C for the L2-SVM and the L1-SVM, respectively. The results for noise levels sn3, sn5, sn10 and no noise are shown. Each plot contains the validation scores for threshold 0.2 separately as well as the average score over all thresholds. The maximum validation score is highlighted by a vertical line, and in the case of the L1-SVM it is labeled with the number of nonzero features in the resulting model. 26 ���� ���� ����� ���� ��� ����� ���� ��� ���� ���� ���� ���� �� ���� �� �� �� �� �� � ���� ���� �� � �� � �� ���� �� � �� �� �� �� �� � ���� �������� � �� � �� � ����� ���� ����� ���� �� ���� ���� ���� ���� ������������� ���� �� �� �� �� �� �� � �� � �� � �� � ���� �� ���������������������� �� �� �� �� �� � �� � �� � �� � Figure 10: Validation score as a function of the regularization parameter C for the L2-SVM for different noise levels of the main data set. The solid line shows the score for threshold 0.2 and the dotted line the average over all thresholds. Vertical lines indicate the maximum validation score. 27 ���� ��� ���� ��� �� ����� ���� ����� ���� ���� ���� ���� ���� �� �� ���� �� �� �� ���� ���� �� �� � �� � �� � �� ���� �� � �� �� �� �� �� � ���� �������� �� � �� � �� � ��� ����� ���� ����� ���� �� ���� ���� ���� ���� ������������� ���� �� �� �� �� �� �� � �� � �� � �� � ���� �� ���������������������� �� �� �� �� �� � �� � �� � �� � Figure 11: Validation score as a function of the regularization parameter C for the L1-SVM for different noise levels of the main data set. The solid line shows the score for threshold 0.2 and the dotted line the average over all thresholds. Vertical lines indicate the maximum validation score and are labeled with the number of nonzero features in the resulting model. For both models there is a sharp increase in validation score at low values of C, indicating that a few features are crucial for representing a large part of the information about fesc . This is consistent with the results of the PCA in section 4.1 in which a large part of the variation in the data was concentrated to a few principal components. The effects of noise are similar in the two models. The value of C that results in the highest validation score tends to be lower for the noisier data sets. This is because the model is more prone to overfitting with higher noise as it causes some signals to be drowned out. In the case with no noise the data is clearly easily separable and a higher misclassification penalty is beneficial. 28 Although the noise-free case is not a realistic representation of actual observations it gives an indication of how much noise affects the classification. As expected, lower noise results in higher scores overall. Comparison of the two models shows that the L1-SVM has higher validation scores for the optimal value of C on the noisy data sets, although the differences are only in the order of 0.01. In the noise-free case, however, the L2-SVM has a validation score of 0.99 compared to 0.95 of the L1-SVM. This is probably because the benefits of a simpler model are not as significant when no noise is present. In all experiments it was the case that the validation score was lower for threshold 0.2 than for the average over all thresholds, indicating that this choice of threshold is particularly difficult to separate the samples according to. The validation score is used to select the optimal parameters. However, the test score is a more accurate measure of a model’s ability to classify as the test samples are independent from the train set. The tests scores per fesc threshold for the two models are shown in figure 12. 29 ���� ���� ���� ���� ���� ���� ���� ���� ����� ���� ��� ����� ���� ��� ���� ���� ���� ���� ���� ���� ���� ��� ��� ��� ���� ���� ��� ��� ��� ��������� ��� ��� ���� ��� ��� ��� ��� ���� �������� ���� ���� ���� ���� ���� ���� ��� ��� ��� ����� ���� ����� ���� ��� ��� ��� ��������� ���� ���� ���� ���� ���� ���� ���� ��� ���� ��� ��� ��� ��� ��� ��� ��������� ��� ��� ��� �� �� ��� ��� ��� ��� ��� ��������� ��� ��� ��� Figure 12: Test score per threshold for the main data set with different noise levels. The solid line shows the scores for the L2-SVM and the dotted one for the L1-SVM. The results show quite good overall performance. For the more realistic cases with noise, the scores are between 84 and 92 percent correctly labeled samples. It is interesting that the choice of threshold has a large effect on the test score, with the lower thresholds that are of most interest having the worst performance. It is expected that the models perform worse for extremely high and low thresholds because the distribution of labels in the training data is skewed, but this is not the pattern that the results show. The higher thresholds generally have the highest test scores, and threshold 0.1 also gives better performance than 0.2 in all cases. The L1-SVM has slightly better scores for higher noise levels, which is probably due to the fact that it is better at selecting relevant features and avoiding fitting to the noise. The test scores are not significantly lower than the corresponding validation scores for either model which also indicates that there is not a great 30 deal of bias due to overfitting. It is also interesting to investigate which features are being chosen by the models. Figures 13 and 14 show the weights attached to each feature in the decision function of the L2-SVM and the L1-SVM, respectively. The top leftmost plot shows the features chosen when C has the lowest value of 0.001 and then for increasing C up to the optimal value for each model. In both cases the threshold used is 0.2. It is clear that in all cases there is a small subset of features that have the largest weights and therefore most impact in the classification. These correspond to those seen in the PCA plots; the emission lines and the lower wavelengths of the spectra. The latter are likely chosen as an indicator of the slope of the spectra, as discussed in section 2. It is interesting that for low values of C it is mainly the emission line centered at 5000 Å that is chosen by both models, but with increasing C the line just below 4000 Å becomes more significant. Since both are oxygen lines this shows that the optimal models chose to give important features high weights even though they are correlated. The fact that the third oxygen line, OIII4959 is not so significant indicates that it may be more strongly correlated to the other two than they are among themselves. The oxygen emissions also seem to be a stronger effect of fesc than the hydrogen lines in this case. 31 ���� ���������� ��� ���������� ��� ���� ��� ��� ����� ���� ���� ���������� ���� ���� �������������� ���� ��� ���� ���� ��� ���������� ��� ����� ���� ��� ��� ���������� ���� ���� �������������� ���� ��� ��� ���� ���� ��� ���������� ��� ��� ��� ��� ��� ����� ��� ��� ��� ���� ���� �������������� �� �������� ��� ��� ��� ���� ���� ���� ���� �������������� ��� ���� ���� ���� ��� ��� ��� ���� ��� ���� ���� �� �������� ���� �� �������� ��� ��� �� �������� ��� ��� ��� ����� ���� ��� ��� ��� ���� ���� ���� �������������� ��� ��� ���� ���� �� �������� ��� ��� ���� ���� ��� ��� ��� ���� ����� ��� ��� �� �������� ����� ���� ���� ���� �������������� ���� ���� Figure 13: The feature weights of the L2-SVM for different values of the regularization parameter C when trained on the main data set with threshold 0.2. The bottom right plot shows the optimal value of C. 32 ��� ���������� ���� ��� ���� ��� ����� ���� ���� ��� ���������� ���� ���� �������������� ���� ��� ��� ���� ���� ���� ��� ���������� �� �������� ���� ���� ���� ��� �������� ���� ��� ��� ����� ���� ���������� ���� ���� �������������� ���� ���� ��� ��� ��� ���� ��� ���������� ���� ���� �������������� ���� ��� ���� ���� ��� ��� ��� ���� ���� �������������� ���� ���� ��� ����� ��� ��� ��� ��� ��� ��� ��� ���� ���� ���� �������������� ��� �� �������� ��� ��� ���� ��� ���� ��� ���� ���� �� �������� ����� ���� ��� ���������� ��� ��� ���� ��� ��� ��� ��� ���� ��� ��� �� �������� ��� ��� �� �������� ����� ��� ��� ����� ��� ���� ���� �������������� ���� ���� Figure 14: The feature weights of the L1-SVM for different values of the regularization parameter C when trained on the main data set with threshold 0.2. The bottom right plot shows the optimal value of C. The effects of different noise levels on the selection of features is also investigated. Figures 15 and 16 show the feature weights of the optimal L1-SVM and L2-SVM for different noise levels with threshold 0.2. Both models have higher weights in the blue part of the spectra for lower 33 noise levels indicating that noise removes some information about the slope. The strength of the emission lines seems to be less affected, although the weaker hydrogen lines are more pronounced the less noise there is. When the noise decreases the L1-SVM increasingly only picks out one of the oxygen lines instead of both. This is not the case for the other model, indicating that the L1-SVM is better at selecting relevant features and avoiding using correlated ones. ��� ��� ��� ��� ��� ��� ��� ��� ����� ��� ��� ���� ��� ���� ���� �������������� ���� ��� ��� ���� ���� ��� ���� ���� ���� �������������� ���� ���� � ��� � ��� ��� ���� �� �� �� �������� ��� ��� ���� ���� �������������� ���� � �� ���� �� ���� ���� �� �� �� �������� ����� ����� ���� �� �������� ��� ��� ���� ��� �� �� �� �������� ��� �� �� �� �������� ��� ��� ��� ���� ��� ��� ����� ��� ���� ���� �������������� ���� ���� Figure 15: The feature weights of the optimal L2-SVM for the main data set with different noise levels. The results confirm that the emission lines are strongly correlated with fesc and the fact that the weights are negative further indicate that strong emission lines are signs of a low fesc value. Since these results were for threshold 0.2, the weights indicate that the stronger emission lines a sample has the more likely it is to have fesc < 0.2. 34 ��� ��� ��� ��� ��� ��� ��� ��� ����� ��� ���� ���� ���� �������������� ���� ��� �� �� �� �������� ��� �� �� �� �������� ��� ��� ���� ��� ��� ��� ����� ��� ��� ��� ���� ���� ��� ���� ���� ���� ���� �������������� ���� ���� � �������� ��� � ��� � ��� ��� �� �� �� ��� � � �� �� �� ��� ��� ��� ����� ����� ��� ��� ��� ���� ���� ���� ���� �������������� ���� ���� � ���� ���� ���� ���� �������������� ���� ���� Figure 16: The feature weights of the optimal L1-SVM for the main data set with different noise levels. 4.3 Robustness of the main model In this section the robustness of the models from the previous section is investigated. Figures 17 and 18 show how the test score is affected by mislabeled samples in the train set. The x-axis shows the probability of a label being flipped in the train set and the y-axis the score when predicting an unmodified test set. The scores for different thresholds are indicated by colored lines. 35 ��� ��� ��� ��� ��� ��� ���������� ���������� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ���� ��� ��� ������������������� ��� ��� ��� ��� ��� ��� �������� ��� ��� ������������������� ��� ��� ����� ��� ����� ��� ����� ��� ���������� ���������� ����� ��� ��� ��� ��� ����� ��� ����� ����� ����� ��� ����� ��� ��� ��� ��� ������������������� ��� ��� ��� ��� ��� ��� ��� ������������������� ��� ��� Figure 17: Test score of the L2-SVM as a function of probability of flipping each label in the train set for the main data set with different noise levels. The tests scores for each threshold are represented by lines of different colors. 36 ��� ��� ��� ��� ��� ��� ���������� ���������� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ���� ��� ��� ������������������� ��� ��� ��� ��� ��� ��� �������� ��� ��� ������������������� ��� ��� ����� ��� ����� ��� ����� ��� ���������� ���������� ����� ��� ��� ��� ��� ����� ��� ����� ����� ����� ��� ����� ��� ��� ��� ��� ������������������� ��� ��� ��� ��� ��� ��� ��� ������������������� ��� ��� Figure 18: Test score of the L1-SVM as a function of probability of flipping each label in the train set for the main data set with different noise levels. The tests scores for each threshold are represented by lines of different colors. Both models seem to be quite robust to this type of errors. The test score declines very little up to 40 percent chance of flipping each label. At 50 percent the model understandably has a one in two chance of predicting correctly, and as the label flipping probability increases the model is being trained on data that is increasingly labeled in the opposite way as the test set, causing a symmetric decline in score. The L1-SVM has a slighter decline in score between 0.0 and 0.4 probability of flipping and less difference between the thresholds, indicating that it is slightly more robust than the L2-SVM. It also seems to be more robust with respect to noise. For the L2-SVM the the scores decline more steeply with higher noise and the difference between thresholds increases with less noise, but for 37 the L1-SVM the plots look largely the same for all noise levels. One possible explanation for this is that since the L1-SVM uses less features to perform the prediction mislabeled samples have less impact on the classification. The next indicator of robustness is how consistently features are chosen to be used in the classification. This test is performed on the L1-SVM as only a subset of the features have nonzero weights and it is therefore better for illustration purposes. Figure 19 shows the results of performing 1000 bootstrap iterations on the main data set with threshold 0.2 for different noise levels. The x-axis shows the wavelengths and the y-axis the frequency with which they are chosen to be part of the optimal model over the bootstrap iterations. Vertical lines mark the features that were chosen with a frequency over 90 percent. 38 ��� �� �� � ���� ���� ���� ���� ���� ���� ���� ���� �������������� ���� ���� �� �� �� �� ��� ���� �� �������� �� � ���� ���� �� �� �� �� �� ������������� ���� ��� ������������� ��� ��� �� �������� ��� �� �� �� �� ��� ������������� ��� �� �� � ���� ���� Figure 19: The frequency with which each feature has a nonzero weight when performing 1000 bootstrap iterations on the L1-SVM model with threshold 0.2 on the main data set with different noise levels. Vertical lines indicate the features that have nonzero weight with a frequency of 90 percent or more. The oxygen and hydrogen lines as well as several features at the start of the spectra are chosen consistently in all noise levels. As seen with the feature weights previously, only one of the oxygen lines is required when the noise level is low. A higher noise level seems to require more features at the blue end of the spectra, possibly because the slope is more difficult to infer. Noise level sn5 seems to have slightly less variance in the feature selection than the others. This could be due to two opposing effects that noise have on the model. On the one hand, less noise results in data that is more simple to classify and a simpler model with less features can suffice. On the other hand, if the spectra is not affected by noise then most wavelengths have a clear correlation to fesc and it can be beneficial to include more features. This was also indicated 39 by the feature weights in figure 16 since the noise-free case had almost no zero features, but the model trained on sn10 data was the most sparse. Next, the variance of the feature weights is considered. Figure 20 shows the distribution of the feature weights over all bootstrap iterations. Apart from a few features in the middle of the spectra that are likely noise, the emission lines show the highest variance. The fact that input selection affects these the most could be because the strength of emission lines can differ between individual galaxies even when they have the same escape fraction. It is also interesting that the nonzero features at the start of the spectra have very little variance, even in the noisiest case. One explanation could be that the noise has a similar effect on all samples in at these wavelengths, and it could give an indication of why the models so consistently consider these wavelengths to be relevant. 40 ��� ��� ��� �� �������� �� ��� �� ��� ��� ������ ��� ��� ��� ���� ��� ��� ���� ���� ���� ���� ���� ���� ���� ���� ���� ���� ���� ���� ���� ���� ���� ���� ���� ���� ���� ���� ���� ���� ���� ���� �������������� ���� ���� ���� ���� ���� ���� ���� ���� ���� ���� ���� ���� ���� ��� ��� ��� �� ��� �� �������� ��� �� ������ ��� ��� ���� ���� ��� ���� ���� ��� ��� ��� ��� ��� ���� �� �� �� ������ ��� Figure 20: The mean and standard deviation of feature weights over 1000 bootstrap iterations on the L1-SVM model with threshold 0.2 on the main data set with different noise levels. The robustness of feature selection is also investigated by seeing which features are part of the optimal model when restricting the number of them that may be used. Figure 21 plots which features are chosen to be part of the optimal model with 1,2, . . . 7 features. The x-axis shows the wavelengths, the y-axis the number of features that are allowed in the model and a bar at position (x, y) indicates that wavelength x is chosen to be part of the optimal model with y features. 41 � �� � ����������� � ��� � � �������� � � ���� ���� ���� �������������� ���� ���� Figure 21: The features that are chosen to have nonzero weight in the optimal L1-SVM model when limiting the number of nonzero features and training on the main data set with threshold 0.2. A bar at position (x, y) indicates that feature x is chosen to be part of the optimal model with y nonzero features. The oxygen line centered near 5000 Å is chosen to be part of all models. In almost all cases wavelengths at the start of the spectra are also chosen. The exception is the model with only one feature, indicating that the emission line is more important than the indicator of slope. This is consistent with the astronomical interpretation since the emission line is considered more strongly correlated to fesc than the slope of the spectrum. Other emission lines start to be chosen when three or more features are allowed. The two main oxygen lines are present at three features already, and the only hydrogen line present is Hγ and it appears at six features. These results are consistent with what the feature selection and bootstrapping showed and also indicate a high level of robustness of selection for the L1-SVM. 4.4 The effects of simulation assumptions In this section the robustness of the method with respect to varying data assumptions is evaluated. Figure 22 shows how the test results of the two models vary with noise level for resolution R100 and R1000. The left figure shows the test score as a function of a fixed exposure time and the right figure for a fixed signal-to-noise ratio. 42 ���� ���� ����� ���� ��������� ����� ���� ������������������� ���� ���� ���� ���� � ������� ������� �������� �������� ���� � � � � � � ����������������� � � �� ���� � � � � � ��� � � � � �� Figure 22: The effects of resolution on the test score of the two models on the main data set. Red lines indicate R1000 and black R100, solid lines show the scores of the L2-SVM and dotted the L1-SVM. Left: the tests score as a function of noise calculated with fixed exposure time. Right: the test score as a function of noise calculated by fixing the signal-to-noise ratio at 1500 Å. With fixed signal-to-noise ratio the models trained on higher resolution data perform better. With this noise model the samples of different resolutions have similar noise levels and it is clear that some information relating to fesc is lost in the lower resolution data. With fixed exposure time it is the models trained on lower resolution data that have higher test scores. This is because a given exposure time results in a lower noise level for for the lower resolution data. The fixed exposure time case corresponds more closely to how the noise level of actual observations will behave and therefore the models are considered to perform better on the lower resolution. For this reason the resolution R100 is mainly considered in this thesis. The sensitivity of the models to different levels of noise on the train and test sets is considered next. The noise level is fixed for the test set and the score is calculated when training on data with different noise levels. Figure 23 shows the test score as a function of train set noise for different test set noise levels. The main data set with threshold 0.2 is used for both models. 43 ������ ������ ���� ��� ���� ��� ��� ���������� ���������� ���� ���� ��� ���� ���� ��� ���� ���� ��� ���� ��� ���� ��� ���� ���� ��� ���� � � � � � � � ��������������� � � �� ���� � ��� � � � � � � ��������������� � � �� Figure 23: Test scores of the L1-SVM and L2-SVM as a function of noise level of the train set for fixed test set noise level on the main data set for threshold 0.2. Lines are annotated with the test set noise level. In general it seems that training on a lower noise level than the test set has gives poor performance, whereas training on a higher noise level is in some cases beneficial. The scores of the L1-SVM are higher overall and seem to be more robust to training on data with the ”wrong” noise level. For the L2-SVM most lines show a peak in test scores when train and test sets have the same noise level, but for the L1-SVM the scores are more constant for varying train set noise levels. To test model sensitivity to different simulations we investigate the effects of varying simulation for the train and test sets. Figure 24 shows the test scores per threshold for all combinations of simulation method of train and test sets for the L1-SVM and L2-SVM. Table 2 lists the test scores for threshold 0.2 as well as the average over all thresholds for each model for the same combinations. The diagonal plots show the results when the train and test sets come from the same simulation. The Gnedin simulation has the highest test scores overall, with only threshold 0.2 having lower than 90 percent of correctness. The Shimizu simulations have similar performance, but for Finlator the scores are significantly lower with only thresholds 0.1, 0.8 and 0.9 having over 85 percent correctness. This is because of the larger spread in shape of galaxy spectra in the Finlator simulation that was also indicated by the results of PCA in section 4.1. These results show that this does indeed make classification more difficult. However, the scores seem to depend more on test set than train set simulation choice. The 44 worst results overall are when testing on the Finlator set for all choices of train simulation, but training on Finlator does not seem to give much worse performance for testing on Shimizu or Gnedin. Since the homogeneous train sets give significantly worse performance when testing on the more varied one but not conversely, it seems the best choice of train set for future observations would be a set containing a wide range of galaxy types. Keeping this in mind, the classification model seems to be quite robust to choosing the ”wrong” simulation for training. The results also indicate that the performance of the different classification models does not differ much for the simulation methods for this noise level. All combinations show varying performance for different thresholds as was the case for the main data set in section 4.2, although which ones are the most problematic varies for test set simulation. ���� ���� ���� ���� ���� ���� ���������� ���� ���������������������������� ���������� ���� �������������������������� ���������� ���� ��������������������������� ���� ���� ���� ���� ���� ���� ���� ��� ��� ��� ��� ��� ��� ��� ��� ��� ���� ��� ��� ��� ��� ��� ��� ��� ��� ��� ���� ��� ��� ��� ��� ��� ��� ��� ��� ��� ���� ���� ���� ���� ���� ���� ���������� ���� ��������������������������� ���������� ���� ������������������������� ���������� ���� �������������������������� ���� ���� ���� ���� ���� ���� ���� ��� ��� ��� ��� ��� ��� ��� ��� ��� ���� ��� ��� ��� ��� ��� ��� ��� ��� ��� ���� ��� ��� ��� ��� ��� ��� ��� ��� ��� ���� ���� ���� ���� ���� ���� ���������� ���� ����������������������������� ���������� ���� ��������������������������� ���������� ���� ���������������������������� ���� ���� ���� ���� ���� �� �� ���� ���� ��� ��� ��� ��� ��� ��� ��� ��� ��� ���� ��� ��� ��� ��� ��� ��� ��� ��� ��� ���� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��������� ��������� ��������� Figure 24: Test score per threshold for the L1-SVM and L2-SVM for all combinations of simulation for the train and test set. All data sets have noise level sn5 and are without dust effects. 45 Train:Shimizu Train:Shimizu Train:Shimizu Test:Shimizu Test:Gnedin Test:Finlator L2 L1 t= 0.2 0.87 0.89 avg all 0.89 0.90 L2 L1 t= 0.2 0.89 0.89 avg all 0.92 0.92 L2 L1 t= 0.2 0.86 0.85, avg all 0.85 0.85 Train:Gnedin Train:Gnedin Train:Gnedin Test:Shimizu Test:Gnedin Test:Finlator L2 L1 t= 0.2 0.87 0.86 avg all 0.89 0.89 L2 L1 t= 0.2 0.85 0.87 avg all 0.92 0.93 L2 L1 t= 0.2 0.85 0.85 avg all 0.84 0.83 Train:Finlator Train:Finlator Train:Finlator Test:Shimizu Test:Gnedin Test:Finlator L2 L1 t= 0.2 0.86 0.86 avg all 0.88 0.89 L2 L1 t= 0.2 0.85 0.87 avg all 0.90 0.90 L2 L1 t= 0.2 0.84 0.83 avg all 0.85 0.86 Table 2: Test scores for threshold 0.2 and average over all thresholds for the L1-SVM and L2-SVM for all combinations of simulation for the train and test set. As discussed in section 2, adding the effects of dust causes a reddening of the spectra that results in a larger spread in the fluxes. This can cause features related to fesc to be distorted and increase uncertainty by introducing properties that could falsely be interpreted as correlated to fesc . Figure 25 shows a comparison of the performance of the classification models on the Shimizu and Gnedin simulations with and without dust effects with noise level sn5. The top row contains the test scores per threshold for the two models. The dashed lines represent data with dust effects and solid lines without for the L1-SVM in black and the L2-SVM in red. The bottom panel shows the feature weights of the L1-SVM with and without dust effects for the two simulations. 46 ���� ������� ���� ������ ���� ���� ���� ���� ����� ����� ���� ���� ���� ���� ���� ���� ���� ���� ���� ��� ��� ��� ������� ��� ��� ��� ��������� ��� ��� ��� ��� ��� ��� ��� ��� ��������� ��� ��� ��� ��� ��� ����� ��� ��� ��� ��� ��� ��� ��� ���� ���� ��� ��� ������ ��� ����� ���������� ������� ���������� ������� ��� ���� ���� ���� �������������� ���� ���� ��� ���� ������� ���� ���� ���� ���� �������������� ���� ���� Figure 25: Top panel: test score per threshold for the L1-SVM (black lines) and the L2-SVM (red lines) for data with dust effects (dotted lines) and without dust effects (solid lines) for the Shimizu and Gnedin simulations. Bottom panel: feature weights of the L1-SVM when trained on data with dust effects (dotted lines) and without dust effects (solid lines) for the same simulations. The effects of dust on test performance are larger for the Shimizu simulation, with lower scores on all thresholds except one. For the Gnedin simulations the scores differ less when adding dust. This could be because of the different dust recipes used or because of the properties of the simulations themselves, or a combination of both. The fact that there are fewer nonzero features in the optimal model for Gnedin data without dust than for Shimizu suggests that it is a less complex data set that is easier to classify. The effect that dust has on the feature weights differ between the two simulations. In the Shimizu case the amount of nonzero features increases significantly when adding dust. There are features chosen in throughout the spectra and at wavelengths that were never chosen in any cases 47 without dust. Since the spectra with dust have a larger spread the model needs to include more features to produce a good fit. The feature weights for Gnedin do not increase as much with the addition of dust. The emission lines and the beginning of the spectra, although a little shifted to the red side, are still the main parts used. It could also be the case that dust effects make it more difficult to infer the slope, causing more parts of the spectra to be included. The fact that this effect is more prominent for the Shimizu simulation could be because the effects of dust are stronger in this case. Finally, the effects of applying different stellar tracks are investigated. Figure 26 shows the test score per threshold for different combinations of stellar tracks for the train and test sets for the Shimizu simulation. ��� ��� ��� ��� ����� ��� �������������������������� ����� ��� �������������������������� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��������� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��������� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ����� ��� �������������������������� ����� ��� �������������������������� ��� ��� ��� ��� ��� ��� ��� ��������� ��� ��� ��� ��� ��� �� �� ��� ��� ��� ��� ��� ��������� ��� ��� ��� Figure 26: Test score per threshold for the L2-SVM and L1-SVM when varying the stellar track used on the train and test sets for the Shimizu simulation. 48 This assumption clearly has a large effect on classification performance. The test scores are significantly lower for the two cases with different stellar track on the train and test sets, with testing on BPASS2 resulting in particular difficulties around threshold 0.2 and testing on Geneva around 0.6. This is because the different stellar evolution models predict different strengths of emission lines for a given fesc value. BPASS2 predicts stronger lines than Geneva, meaning that a model trained on BPASS2 will characterize low fesc values with emission lines of strengths that do not occur in the Geneva set. This causes samples with true label -1 to be given label 1. Conversely, a model trained on Geneva will overestimate fesc values for a BPASS2 test set. 4.5 Analysis of classification performance So far the evaluation has only been concerned with the overall test performance of the models, but it is also interesting to further analyze the nature of the errors. Which samples are being incorrectly labeled? Are there any patterns or particular parts of the data that are especially difficult? Since it is required to separate galaxies according to low escape fractions to investigate cosmic reionization, it is particularly interesting to see what kind of errors the models make when trained on these thresholds. The test results from sections 4.2 and 4.4 showed poorer test scores overall for the low thresholds, with some dependence on simulation and noise. For the Shimizu simulations the lowest test scores were for thresholds between 0.2 and 0.4. Figure 24 shows that for the Gnedin set the results were worse for 0.2 and that the choice of train set does not affect which threshold has the lowest score much. The Finlator set has the lowest performance around threshold 0.5, and as seen in table 2 it is the only test set that does not have a lower score for threshold 0.2 than the average over all thresholds. As was discussed earlier, some decrease in performance is expected for the extreme thresholds due to imbalance of classes in the training data. But since the pattern of tests score is not symmetric in this way it indicates other issues. To investigate this we look at the nature of the mispredictions. Figure 27 plots the amount of incorrectly labeled samples per threshold in the Shimizu and Finlator sets for the two models. The dark bars shows samples that are overpredicted and the light bars show underpredicted samples. These correspond to false positives and false negatives from the discussion in 3 but since here the goal is to estimate fesc and classify 49 accordingly and not to pick out certain samples of interest the terms over- and underpredicted are more suitable. The y-axis shows the absolute number of samples in each category and the number above each bar indicates their percentage of the total. For example, 50 percent underpredicted samples for threshold 0.2 means that half of the samples with true label 1 were given label 0 for that threshold. �� �������������� �� �� ��������� �� �� �� �� �� � � �� �� �� �� �� � � �� �� � �� ��������� �� �� �������������� � �� � � �� �� �� �� �� �� �� � �� � �� � �� � �� � � � ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��������� � ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��������� �� ��������������� �� ��������������� �� �� �� ��� �� �� �� �� � � � � �� �� �� ��������� �� �� �� �� �� �� �� � �� � � � �������������� ������������� �� �� ��������� �� � �� � �� � �� �� �� �� �� � � �� �� �� �� �� � �� � � � ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��������� � � ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��������� Figure 27: The frequency of mispredicted samples per threshold. Underpredicted samples are indicated by light bars and correspond to false negatives. Overpredictions, or false positives, are shown by darker bars. Numbers above each bar indicate the percentage the mispredictions make of the total number of samples with their true label. Results are shown for the L2-SVM and L1-SVM for both the Shimizu and Finlator simulations. The results show that overprediction is a larger problem than underprediction. For thresholds 0.1 to 0.4 a very large part of the samples are incorrectly given label 1. The relatively high test scores for threshold 0.1 are actually misleading as nearly all samples with lower fesc are being 50 mispredicted. The fact that there are so few such samples and that almost none are underpredicted is what gives the appearance of a good test score. For the Shimizu simulation the percentage and absolute number of overpredicted samples decreases steadily with increased threshold. A corresponding but opposite effect is seen in the underpredictions, but since the percentages are much lower the total number of mispredictions decreases with higher thresholds, explaining the increase in test score. The two models have similar numbers of underpredicted samples in this case, but the L1-SVM has less overpredictions for all thresholds. The Finlator simulations have even larger issues with overprediction than Shimizu, and also have higher percentages of underpredicted samples for higher thresholds. The L1-SVM has fewer overpredictions but slightly more underpredictions than the L2-SVM, with threshold 0.5 being particularly problematic for both. Earlier results showed that the Finlator samples are more difficult to predict than Shimizu, and these results show that this is true for high as well as low fesc values. The results clearly indicate that estimating the low fesc samples is particularly difficult for both simulations. To investigate this further we take a look at the distribution of values generated by the decision function. As described in section 3.2.1, the label of a sample is decided by the sign of the result of applying the decision function to it. However, it is also interesting to look at this value itself. The size of its absolute value is an indicator of how ”sure” the classification is, and the distribution of these values can give further indication of the nature of the mispredictions. Figures 28 and 29 show the distributions of predicted values for thresholds 0.2 and 0.8 for the Shimizu and Finlator simulations for the L2-SVM and the L1-SVM, respectively. The y-axis shows the number of samples per predicted value and the bars are colored according to positive (sign of decision function value should be positive) and negative class. 51 ��� ��������������������� ��� ��������������������� �� ��������� ��������� �� �� �� �� � � �� �� �� � � � � � ��������������� � ���������������������� �� � � ���������������������� �������������� �������������� ��������� �� ��������� �� �� �� �� � �� �� � �� �� �� � � � � � �������������� �� �� �� � � � � � � ��������������� � � � � � � � � � ��������������� � � Figure 28: The frequency of predicted values for the L2-SVM when used on the Shimizu and Finlator simulations. The left panel shows the predictions when using threshold 0.2 and the right panel for 0.8. True positives are shown in light gray and true negatives in dark. Both simulations show issues with threshold 0.2. There is a peak in frequencies centered at 0 for the negative class, indicating large uncertainty in predicting these samples. For the opposite threshold 0.8 there are not equally big problems with underprediction, although the Finlator samples show more uncertainty in this as well. This could be because of the difficulties of Finlator samples discussed earlier, but it could also be an effect of the uneven distribution of labels in the training data, especially since the Finlator set has significantly fewer samples than Shimizu. 52 ��� ��������������������� ��� �� �� ��������� ��� ��������� ��� �� �� �� � �� �� �� �� � � � � � � � �������������� � � � � ���������������������� �� �� �� � � � � � � � ��������������� � � � ���������������������� �������������� �������������� ��������� �� ��������� �� �� �� �� �� �� � ��������������������� �� � � � � � ��������������� � � � � � � � � ��������������� � � Figure 29: The frequency of predicted values for the L1-SVM when used on the Shimizu and Finlator simulations. The left panel shows the predictions when using threshold 0.2 and the right panel for 0.8. True positives are shown in light gray and true negatives in dark. We may ask ourselves if this classification bias is due to bad model selection or inherent issues with the data. Maybe using a scoring metric that valued class balance over prediction accuracy would result in less problems with overprediction. To answer this the model tuning process is performed again on the main data set using different scoring metrics. The average precision per class (apc), the average of precision and recall f 1, and the ROC area under the curve rauc are used in the cross-validation process to find the optimal value of C and the resulting model’s predicted labels are compared to the true labels. Figure 30 shows the under- and overpredictions of the optimal model according to all four scoring metrics for the L1-SVM and the main data set. 53 �� ��� �� ��� �� �� �� �� � �� � �� � �� � �� � �� � �� � �� � �� �� �� � �� � �� � �� � �� � �� � �� � �� � � �� � � ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��������� � � �� �� �� �� �� �� �� � �� �� � ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��������� ���� �� �������������� �� ������������� �� �� �� � �� � �� � �� � �� ��������� �� ��������� �� �� �� �� �� �� �� �� ��������� ��������� �� �� �� �� �� � � �� � �� �� �� � �� � � �� � �� �� �� �� �� �� � �� �� � � �� � � �� � � ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��������� � ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��������� Figure 30: The frequency of mispredicted samples per threshold when using different scoring metrics to find the optimal model. Underpredicted samples are indicated by light bars and correspond to false negatives. Overpredictions, or false positives, are shown by darker bars. Numbers above each bar indicate the percentage the mispredictions make of the total number of samples with their true label. Results are shown for the L1-SVM used on the main data set. The effects of scoring metric on model misprediction are not very significant. The overpredictions for low thresholds are slightly lower for the metrics that take class imbalance into account, but in these cases there are also more underpredictions for those thresholds. In order to be better at estimating escape fraction for low fesc samples it seems to model has to make a tradeoff with the ability to classify higher fesc samples. These results suggest that the misclassification issues, particularly the overestimation, are not dependent on model selection but rather the data itself. 54 4.6 Comparison to other methods In this section the proposed method’s performance is compared to two other methods of estimating fesc . First we consider the model introduced by Zackrisson et al. (2013) in which only two properties derived from the SED are used: the slope and the relative width of the Hβ emission line. To compare the two approaches an L1-SVM is fitted using only these two features and the classification performance on the main data set is compared to the L1-SVM fitted to the entire spectra. The scoring metric acc is used for both models. The test scores per threshold for three noise levels are plotted in figure 31, with the leftmost plot showing the scores of the two-feature model and the rightmost the scores of the full-spectra approach considered in this thesis. ���� ���� ���� ���� ���� ���� ����� ���� ������������������ ����� ���� ����������������� ���� ���� ���� ���� ���� ���� ���� ��� ��� ��� ��� ��� ��� ��������� ��� ��� ��� ���� ��� ��� ��� ���� ��� ��� ��� ��� ��� ��������� ��� ��� ��� Figure 31: Test score per threshold for different noise levels on the main data set. Left: scores of the model that uses only the slope and Hβ emission line as features. Right: scores of the full-spectra model introduced in this thesis. The test scores are higher for the full-spectra model for all noise levels and thresholds. This is to be expected since the results showed that several features of the spectra were of great importance for inferring fesc , and the advantages of considering all wavelengths are clear. The differences between noise levels are also much larger for the two-feature model, indicating that using more features gives a more robust model. Secondly, a comparison to the lasso regression model described in Lundholm (2016) is performed. To get a comparable assessment of performance the lasso model is used to predict fesc of the main data set, and the resulting values are placed into classes defined by the threshold 0.2 in 55 the same manner as the classification model works. The confusion matrices of the two models are shown in tables 3 and 4. sn3 ❛❛ ❛❛ True ❛ Predicted ❛❛❛ ❛ ❛ -1 1 -1 82 43 1 72 697 sn 5 ❛❛ ❛ True ❛❛ Predicted ❛❛❛ ❛ ❛ -1 1 -1 91 47 1 63 693 sn10 ❛❛ ❛❛ True ❛ Predicted ❛❛❛ ❛ ❛ -1 1 -1 111 47 1 43 693 Table 3: Confusion matrices when using lasso regression to classify according to threshold 0.2 on the main data set with different noise levels. sn3 ❛❛ ❛ True ❛❛ Predicted ❛❛❛ ❛ ❛ -1 1 -1 70 33 1 85 706 sn 5 ❛❛ ❛❛ True ❛ Predicted ❛❛❛ ❛ ❛ -1 1 -1 90 36 1 65 703 sn10 ❛❛ ❛ True ❛❛ Predicted ❛❛❛ ❛ ❛ -1 1 -1 98 39 1 57 700 Table 4: Confusion matrices when using the L1-SVM to classify according to threshold 0.2 on the main data set with different noise levels. The confusion matrices do not display large differences in prediction performance. The lasso model has slightly fewer false positives, but it also has more false negatives. It seems the two approaches struggle with the same aspects of classification, and that there are no clear advantages of either model. 5 Conclusions and discussion The approach of this thesis has been quite general and had the goal to provide a starting point for using machine learning as a tool in gaining insight about cosmic reionization. The results indicate that support vector machines are promising for this application. The models capture the relationship between a galaxy’s spectrum and escape fraction in a an adequate way in most cases, 56 often with over 85 percent correctness when classifying galaxies according to a fesc threshold. More importantly, the approach is quite robust. Disturbances in the data of varying types do not affect performance to a great extent and the selection of which features are important is relatively stable. The fact that parameter selection and cross-validation error are quite stable and that the model’s performance is not sensitive to input selection or scoring metric indicates that the method itself is quite robust for this application. Comparison of the two types considered did not show significant differences in performance, but still indicated the advantages of the L1-SVM in some aspects. The sparse model showed slightly higher accuracy but more importantly it had more consistent behavior when used on data with more noise. A higher degree of robustness was also indicated by less sensitivity to disturbances in the form of flipped labels in the train set, as well as different levels of noise on the train and test sets. The analysis of mispredictions also showed that the L1-SVM has slightly lower rates of overpredictions of low fesc samples, as well as underpredictions of high ones. These results are likely due to the advantages of a simpler model that relies less on unnecessary features and noise. In section 4.4 the effects of using different sets of assumptions on the train and test sets were investigated. Since simulated data must be used to train models that are to classify actual observations, it is important to gauge the sensitivity of the approach to using sets of assumptions with different properties. The technique proved quite robust to differing levels of noise on the train and test data, further indicating a good representation of the signal and a low degree of overfitting. Of the different simulations Finlator proved to be the most problematic, which was expected since it models a wider range of star formation histories and therefore has a more heterogeneous collection of galaxies. The fact that training on this set did not significantly decrease test performance for the other simulations is positive and indicates that the correlation between fesc and the SED is still possible to capture with these models when it is weaker. As mentioned previously, there are differing opinions regarding the dust content of the early galaxies and how appropriate it is to train on data with dust effects is not certain. Although the effects on the spectra are significant, particularly in the low wavelengths, the results did not show significantly worse performance for data with dust. The assumption that had the largest negative effect on classification was using the wrong stellar tracks. Since Geneva and BPASS2 result in emission lines of very different strengths, training on one is misleading when testing on the other. Overall the results of model sensitivity 57 to data assumptions suggest that the approach is relatively robust. Some insights into what needs to be considered when selecting appropriate data for training have also been gained. The results have also provided some understanding of the data and the nature of the relationship between a galaxy’s spectrum and fesc . PCA showed that the parts of the spectra that are known to be affected by escape fraction indeed have the highest information content and the different simulations showed varying degrees of clustering according to fesc when projected onto the principal components with highest variance. Inspection of feature selection showed that the emission lines are among the main features used, with the oxygen lines OII and OIII5007 being the most significant ones. The former is preferred for low noise levels, but higher noise seems to require the models to rely on both. The hydrogen lines are also prominent, with Hβ generally being stronger but Hγ being the only hydrogen line chosen in the optimal model when limiting the number of features. The blue wavelengths were also consistently used in the classification, indicating that the slope of the spectra is important for inferring the escape fraction. However, as discussed previously, this relationship could be problematic. There are concerns that other properties of the galaxies affect the slope and that there is uncertainty in the relationship. Still, the models always relied on some of these wavelengths, even in the case when only two features were to be used. The feature weights of the model trained on data with dust effects also indicated this. The effects of dust cause a higher spread in the blue wavelengths, and it seemed this caused the models to use previously unused parts of the spectra to infer the slope. This effect was not as significant in the Gnedin set, which also had less degradation in performance due to dust. Another important result was the fact that the low fesc samples were especially prone to overestimation. Since the problems with threshold 0.2 remained when using different models and scoring metrics as well as the fact that the opposite threshold of 0.8 did not give the same degree of uncertainty and underprediction for high fesc samples, it seems that it is not just a problem of class imbalance but an issue with the data itself. The performance results and insights gained about the data provide a first step towards evaluating the applicability of support vector machines to the problem of estimating fesc . The actual classification performance has been acceptable in most tests performed and the analysis of data assumptions has provided some indication of under what circumstances the method may be useful. 58 The size of the data sets seems to be sufficient in most cases since performance is more dependent on data assumptions than number of galaxies. This is shown by the fact that the Gnedin simulations had higher overall performance even though it had significantly fewer samples than Shimizu. Bootstrapping also showed a stable model tuning process that was not very dependent on input selection. It is also useful knowledge that mismatches in noise level do not have very large significance, but that using the wrong stellar tracks can reduce performance significantly. The fact that training on Finlator did not reduce test performance of the other simulations indicates that a train set with varied samples is preferable over more homogeneous sets. Comparison to a previous model based on using only the slope and the Hβ emission line showed that using the whole spectra gives significantly higher performance, further illustrating the advantages of a quantitative and data-oriented approach. The support vector machine did not, however, prove to have an advantage in accuracy or robustness over the lasso regression model. Further analysis is required to determine the optimal machine learning technique for the task of predicting the escape fraction of distant galaxies. With these results as a basis, more specific measures can be suggested for future work. The fact that indicators of the slope are consistently chosen by the classifiers would be interesting to study in more detail. It is likely that the models considered have less problems due to the ambiguity of the slope than some of the more simpler diagnostics previously considered since the entire spectrum is taken into account, but more investigation is necessary to be able to say this with certainty. However, it is also possible that the use of the slope is more misleading than helpful and that the classification would benefit from the removal of it from the spectra. Another possibility would be to use the knowledge of the data to introduce extra weights to the features to make certain ones more important than others and to investigate the effects of different types of normalization and scaling. The fact that the low escape fraction samples are particularly problematic to predict is also worth investigating. It could be the case that these samples are more similar to each other or that the correlation between fesc and the slope is weaker for such galaxies. A first approach could then be to simply use a larger train set. The models could also be tuned to compensate by sample weighting or, if it could be known with certainty that the higher fesc values will not occur in observations, excluding them from the train set. If the issues are due to a complicated correlation 59 a more complex model could be considered. Although the linear support vector machine has good qualities and a high interpretability it could be worth investigating if the use of kernels would give better classification performance for the particularly difficult samples. A nonlinear model could maybe also give better performance on the more heterogeneous Finlator simulation. Since the low escape fraction thresholds are of particular interest to the study of cosmic reionization the model improvement process should include a qualitative aspect as well. In conclusion, the work in this thesis has shown that data-driven predictions can be a suitable tool for investigating cosmic reionization. The results have suggested what aspects of the data need further study and paved the way for further improvements to the method. The importance of certain simulation assumptions has also been indicated as well as the fact that further knowledge of distant galaxies will be of importance for inferring their escape fractions. References Alexandroff, R., Heckman, T. M., Borthakur, S. and Overzier, R. (2015). Indirect Evidence for Escaping Lyman Continuum Photons in Local Lyman Break Galaxy Analogs, American Astronomical Society Meeting Abstracts, Vol. 225 of American Astronomical Society Meeting Abstracts, p. 251.09. Alvarez, M. A., Finlator, K. and Trenti, M. (2012). Constraints on the ionizing efficiency of the first galaxies. cite arxiv:1209.1387Comment: 5 pages, 1 figure, submitted to ApJ Letters. URL: http://arxiv.org/abs/1209.1387 Bouwens, R. J., Illingworth, G. D., Franx, M., Chary, R.-R., Meurer, G. R., Conselice, C. J., Ford, H., Giavalisco, M. and van Dokkum, P. (2009). UV Continuum Slope and Dust Obscuration from z ˜ 6 to z ˜ 2: The Star Formation Rate Density at High Redshift, Astrophysical Journal 705: 936–961. Bradley, P. S. and Mangasarian, O. L. (1998). Feature selection via concave minimization and support vector machines, Proceedings of the Fifteenth International Conference on Machine Learning, ICML ’98, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp. 82–90. Cortes, C. and Vapnik, V. (1995). Support-vector networks, Machine Learning 20(3): 273–297. 60 Eldridge, J. J. and Stanway, E. R. (2009). Spectral population synthesis including massive binaries, Monthly Notices of the Royal Astronomical Society 400: 1019–1028. Fernandez, E. R., Dole, H. and Iliev, I. T. (2013). A novel approach to constrain the escape fraction and dust content at high redshift using the cosmic infrared background fractional anisotropy, The Astrophysical Journal 764(1): 56. Finkelstein, S. L., Papovich, C., Salmon, B., Finlator, K., Dickinson, M., Ferguson, H. C., Giavalisco, M., Koekemoer, A. M., Reddy, N. A., Bassett, R., Conselice, C. J., Dunlop, J. S., Faber, S. M., Grogin, N. A., Hathi, N. P., Kocevski, D. D., Lai, K., Lee, K.-S., McLure, R. J., Mobasher, B. and Newman, J. A. (2011). CANDELS: The Evolution of Galaxy Rest-Frame Ultraviolet Colors from z = 8 to 4. Finlator, K., Dav, R., Papovich, C. and Hernquist, L. (2006). The Physical and Photometric Properties of High-Redshift Galaxies in Cosmological Hydrodynamic Simulations, Astrophysical Journal 639: 672–694. Finlator, K., Muñoz, J. A., Oppenheimer, B. D., Oh, S. P., Özel, F. and Davé, R. (2013). The host haloes of O I absorbers in the reionization epoch, Monthly Notices of the Royal Astronomical Society 436: 1818–1835. Gnedin, N. Y. (2014). Cosmic Reionization on Computers. I. Design and Calibration of Simulations, Astrophysical Journal 793: 29. Inoue, A. K., Shimizu, I., Iwata, I. and Tanaka, M. (2014). An updated analytic model for attenuation by the intergalactic medium, Monthly Notices of the Royal Astronomical Society 442: 1805–1820. Jensen, H., Zackrisson, E., Pelckmans, K., Ausmees, K., Lundholm, U. and Binggeli, C. (2016). Measuring the escape fraction of ionizing photons in high-redshift galaxies using machine learning, Master Thesis, Department of . Lundholm, U. (2016). Modeling the escape fraction of ionizing photons in high-redshift galaxies using statistical learning techniques, Master thesis, Department of Engineering Sciences, Uppsala University . 61 Mohri, M., Rostamizadeh, A. and Talwalkar, A. (2012). Foundations of Machine Learning. Pearson, K. (1901). On lines and planes of closest fit to systems of points in space, Philosophical Magazine 2: 559–572. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M. and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12: 2825–2830. Pei, Y. C. (1992). Interstellar dust from the Milky Way to the Magellanic Clouds, The Astrophysical Journal 395: 130–139. Raiter, A., Schaerer, D. and Fosbury, R. A. E. (2010). Predicted UV properties of very metal-poor starburst galaxies, Astronomy & Astrophysics 523: A64. Robertson, B. E., Ellis, R. S., Furlanetto, S. R. and Dunlop, J. S. (2015a). Cosmic Reionization and Early Star-forming Galaxies: A Joint Analysis of New Constraints from Planck and the Hubble Space Telescope, Astrophysical Journal Letters 802: L19. Robertson, B. E., Ellis, R. S., Furlanetto, S. R. and Dunlop, J. S. (2015b). Cosmic reionization and early star-forming galaxies: A joint analysis of new constraints from planck and the hubble space telescope, The Astrophysical Journal Letters 802(2): L19. Schaerer, D. and de Barros, S. (2010). On the physical properties of z 6-8 galaxies, Astronomy & Astrophysics 515: A73. Shimizu, I., Inoue, A. K., Okamoto, T. and Yoshida, N. (2014). Physical properties of UDF12 galaxies in cosmological simulations, Monthly Notices of the Royal Astronomical Society 440: 731–745. Vapnik, V. and Chervonenkis, A. (1964). On a class of perceptrons, Automation and Remote Control 25: 103–109. Zackrisson, E. and et al. (2016). , ArXiv e-prints . 62 Zackrisson, E., Inoue, A. K. and Jensen, H. (2013). The Spectral Evolution of the First Galaxies. II. Spectral Signatures of Lyman Continuum Leakage from Galaxies in the Reionization Epoch, Astrophysical Journal 777: 39. 63