Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MORPHOLOGICAL CLASSIFICATION OF GALAXIES INTO SPIRALS AND NON-SPIRALS Devendra Singh Dhami Submitted to the faculty of the University Graduate School in partial fulfillment of the requirements for the degree Master of Sciences in the School of Informatics and Computing Indiana University May 2015 Accepted by the Graduate Faculty, Indiana University, in partial fulfillment of the requirements for the degree of Master of Sciences. Master's Thesis Committee Professor David J. Crandall Professor David B. Leake Professor Sriraam Natarajan ii Copyright © 2015 Devendra Singh Dhami iii Devendra Singh Dhami MORPHOLOGICAL CLASSIFICATION OF GALAXIES INTO SPIRALS AND NON-SPIRALS The aim of this masterβs thesis is the classification of images of galaxies according to their morphological features using computer vision and artificial intelligence techniques. We deal specifically with the shape of the galaxy in this project. The galaxies are broadly categorized into 3 categories according to their shape: circular, elliptical and spiral. Out of these 3 possible shapes, correctly classifying the spiral shape is the most challenging. This is mostly due to the noisy images of the galaxies and partly due to the shape itself, as spiral can easily be mistaken for an ellipse or even a circle. Thus we focus on classifying the images into only 2 categories: spiral and non-spiral. The first phase of the thesis addresses the process of feature extraction from images of the galaxies, and the second phase uses artificial intelligence and machine learning methods to create a system that categorizes galaxies based on the extracted features. The specific methods used for classification are boosting, logistic regression and deep neural networks. We evaluate these techniques on data from the Galaxy Zoo project [1] that is freely available to anyone. The languages used are C++ (OpenCV) and Python. iv Table of Contents Chapter 1. Introduction 1.1 Overview and Motivation 1 1.2 Related Work 1.2.1 Galaxy Classification Problem 4 1.2.2 Other Classification Problems 6 1.3 The Thesis 8 1.4 Novel Features 9 1.5 Adopted Features 14 1.6 Outline and Contribution of this thesis 16 Chapter 2. Feature Extraction 2.1 Detection of Bar 17 2.2 Dark matter density (gray scale and binary) 19 2.3 Disk / Bulge ratio 21 2.4 Circularity 22 2.5 Black/ White pixel ratio 23 2.6 Convexity 24 2.7 Form Factor 25 2.8 Bounding rectangle to fill factor 25 2.9 Count of line intersection of the shape 26 2.10 Maximum values of Red and Blue channels 26 2.11 Concentration index 26 2.12 Aspect Ratio 28 2.13 Extent 29 2.14 Red to Blue color intensity ratio 29 2.15 Fitted ellipse angle 29 v 2.16 Fitted Ellipse Height to Width ratio 30 2.17 Color coherence vector (CCV) 31 Chapter 3. Classification Algorithms 3.1 Introduction to algorithms 34 3.1.1 Logistic Regression 34 3.1.2 Boosting 37 3.2 Advantages and disadvantages of these algorithms 38 Chapter 4. Deep Learning 4.1 Introduction to Deep Learning 40 4.1.1 Deep Neural Network 41 4.1.2 Deep Belief Network 42 4.1.3 Restricted Boltzmann Machines 42 4.1.3.1 Energy based models 42 4.1.3.2 Boltzmann Machines 43 4.2 Success of Deep Learning 47 Chapter 5. Results 5.1 Available data 50 5.2 Logistic Regression and Boosting 53 5.3 Deep Learning 62 5.4 Feature Analysis 65 Chapter 6. Unsuccessful Experiments 6.1 Blob Detection 69 6.2 Laplacian of Gaussian (LoG) 70 6.3 Discrete Fourier Transform (DFT) 71 6.4 Case Based Reasoning (CBR) 72 Chapter 7. Conclusion and Future Work 7.1 Contribution 74 vi 7.2 Future Work 75 7.3 Summary 76 References 77 vii List of Figures Figure 1. Hubble Tuning Fork. 2 Figure 2. Example galaxy images from SDSS database. 3 Figure 3. Approach to the classification task. 9 Figure 4. Barred and Unbarred galaxies. 10 Figure 5. A galaxy viewed edge-on showing the bulge and where bulge is undetectable. 11 Figure 6. Bar detection process. 18 - 19 Figure 7. Thresholding of image to get the shape. 24 Figure 8. Process of cropping the image to size of bounding box. 30 Figure 9. Discretizing and finding the coherent regions in CCV. 31 Figure 10. CCV visualization for T = 500. 32 Figure 11. Best fit line for a set of data points (regression). 35 Figure 12. A sigmoid function example. 36 Figure 13. The AdaBoost algorithm. 38 Figure 14 a) An Artificial Neural Network and b) A neuron. 41 Figure 15. An EBM that measures the compatibility between observed variables X and variables to be predicted Y using the energy function E(X, Y) 43 Figure 16. A Boltzmann machine in graphical format. 44 Figure 17. A restricted Boltzmann machine in graphical format. 45 Figure 18. A deep belief network with stacked RBMs. 46 Figure 19. Training algorithm for DBNs. 47 Figure 20. SGD convergence time. 49 Figure 21. The Galaxy Zoo classification decision tree. 51 Figure 22. A snapshot of the decision tree in question form. 52 Figure 23. The human annotated data. 52 Figure 24. Splitting the dataset into two different sets. 54 viii Figure 25. N-fold cross validation. 54 - 55 Figure 26. Logistic Regression WEKA parameters. 55 Figure 27. AdaBoost WEKA parameters. 56 Figure 28. The confusion matrix for logistic regression. 56 Figure 29. The confusion matrix for AdaBoost. 57 Figure 30. Precision and Recall. 58 Figure 31. ROC curves for Logistic Regression and AdaBoost. 61 Figure 32. Comparing ROC curves for Logistic Regression and Boosting. 62 Figure 33. Dataset format for feeding into DBN. 63 Figure 34. DBN results for a small dataset of 2983 images. 63 Figure 35. DBN results for the complete dataset of 61578 images. 65 Figure 36. Performance of all bad performing features together. 66 Figure 37. Performance of all bad performing features combined with the best feature. 67 Figure 38. Performance of all bad performing features combined with the second best performing feature. 67 Figure 39. Performance of all bad performing features together using MLP. 68 Figure 40. Images having bar feature at different angles. 70 Figure 41. An edge on galaxy image. 71 Figure 42. CBR process. 72 Figure 43. CBR process relationships. 73 ix List of Tables Table 1. The accuracy of the various parameter sets adopted from [7]. 6 Table 2. Comparison of error rates of the 3 methods adopted from [11]. 8 Table 3. The resulting vector for CCV. 31 Table 4. Logistic Regression Results. 60 Table 5. Boosting Results. 60 x CHAPTER 1 INTRODUCTION 1.1. Overview and Motivation As late as the 1990s, due to the lack of appropriate technology, the task of seeing a faraway object was challenging and thus the discovery of new galaxies was a slow process. Number of galaxies we could photograph were very few in number and owing to this slow process, their classification was not a demanding task. With the advancement of technology there has been a burst in the number of galaxies being found. We are now able to photograph many more distant galaxies than in the past, but most of the classification still depends on human effort where thousands of volunteers around the world classify these galaxies into their respective classes manually. If we look into the history of classification we find several systems that were designed by astronomers for the purpose of classification of galaxies. Edwin Hubble classified galaxies into something called the Hubble sequence, also known as the Hubble tuning fork (Figure 1), in 1936 [2]. There are a few other systems, like the De Vaucoulers system and the Yerkes (or Morgan) schemes [3]. In 2007 a citizen project called the Galaxy Zoo was launched whose aim was to involve human volunteers for the purpose of galaxy classification of the images obtained from the Sloan Digital Sky Survey (SDSS) database [4]. This approach was very much successful in the beginning, as 1 within 24 hours of launch the project received almost 70,000 classifications an hour. In the end, more than 50 million classifications were received by the project during its first year, contributed by more than 150,000 people [5]. Figure 1. Hubble Tuning Fork adopted from [2]. The project is still ongoing, but recently a need has arisen to automate the classification process due to several reasons. First, images of the galaxies are noisy and thus the classification of the same galaxy varies among the human volunteers. Figure 2 shows some example of images from the SDSS database. Second, the size of the SDSS database is ever increasing. Different galaxies are being found virtually every day and thus a system that can automate the classification 2 process is very much desired. Third, detection of shapes of the galaxy is a difficult problem and thus presents an interesting research challenge for automatic image analysis. Figure 2. Example galaxy images from SDSS database. The field of computer vision plays a major role in the automating galaxy classification. We need to rely on the images obtained from the SDSS database and since galaxies are far away objects, the images obtained are low resolution and noisy. The area of computer vision that holds a special importance here is feature extraction. If we can obtain meaningful features from images, then various machine learning algorithms, both supervised and un-supervised, can be used for classification. Unfortunately, feature extraction is difficult. We not only need to find features but they should be as general as possible so that they apply to all applicable images in the dataset in question, which here is the images of spiral and non-spiral galaxies. This feature extraction process forms a major part of this thesis. 3 1.2. Related Work There has been a significant amount of research work in the area of automatic galaxy classification, although this area can still be called relatively new. The main focus has been on applying various machine learning algorithms with a special focus on neural networks for the classification task. However, feature extraction has not been studied as extensively, since it has mostly been treated as a part of image preprocessing. 1.2.1. Galaxy Classification Problem The two most famous papers on this subject are undoubtedly de la Calleja et al. [6] and Banerji et al [7]. In [6] the galaxy classification method is divided into three stages: image analysis, data compression, and machine learning. The authors applied three machine learning methods on galaxy image classification and carried out a comparison study of the three algorithms. These algorithms are Naive Bayes, C4.5 (an extension of ID3 algorithm) [35], and Random Forest, and were tested on the New General Catalog (NGC) released by the Astronomical Society of the Pacific [34]. In the image analysis step, they applied Principal Component Analysis (PCA) to make the galaxies position, scale and rotation invariant. This was done because the galaxies in the images were not centered, a criteria which the SDSS database already fulfills. PCA was also used to reduce the dimensionality of the data (data compression) and the principal components of the image projection were then used as a set of features for the classification phase. They found that Random Forest performed better than Naive Bayes or C4.5. In [7], the authors applied neural networks to classify images into three classes: early types, spirals, and point sources/artifacts. The neural network was trained on 75,000 galaxy images 4 obtained from Sloan Digital Sky Survey [4]. These training images are associated with features already annotated by humans through the Galaxy Zoo project. The test data was comprised of one million galaxy images. They trained and tested the neural network using 3 sets of input parameters: (a) Colors and profile fitting: This parameter refers to the colors of galaxies or any parameter associated with profile fitting, like the Hubble profile [2] or de Vaucouleurs profile [3] as described in section 1.1 for morphological classification. (b) Concentration and adaptive moments: This parameter refers to the concentration index [20], as will be defined in a later section and is used as a feature in this project, as well as other texture parameters. (c)The combination of both (a) and (b). Their results show that the color or the shape parameters, when taken individually, are not sufficient to capture the morphological features of the galaxy. However, combining those parameters increased the accuracy remarkably (Table 1). In Yin Cui et al [8] a system was created where a galaxy is queried by providing a galaxy image as an input, after which the system retrieves and ranks the most similar galaxies. In order to accurately detect galaxies, the input images must be invariant to rotation and scale. To find the rotation angle, the second moment of inertia was applied. A spatial-color layout descriptor was proposed to encode both local and global morphological features. The descriptor was then combined with Kernelized Locality Sensitive Hashing for retrieval and ranking. 5 Class Parameters Early Types Spirals Point Source/Artifacts (a) Colors and profile fitting 87% 86% 95% (b) Concentration and 84% 87% 28% 92% 92% 96% adaptive moments (c) Combining (a) + (b) Table 1. The accuracy of the various parameter sets adopted from [7]. Experiments were carried out by applying three kernels: Histogram Intersection, Chi-Square and Jensen-Shannon Divergence kernels. Out of the three, Histogram Intersection produced the best results with 95.8% accuracy. 1.2.2. Other Classification Problems Although not directly related to the topic at hand, Eskandari & Kouchaki [9] present an important paper in shape detection. The authors propose a novel method to distinguish regular and irregular shapes/regions in satellite and aerial images. Wu et al. [10] define a regular shape as βone that possesses the characteristic that within the shape there is a point that has an equal distance to each side or the boundary of the shape.β The authors use a more general definition in their paper. They define regularity in [9] as βthe whole shape is formed by a repetition of a particular part, in the same direction, at the sides of a regular polygon or by the repetition of this part, in the same direction, at the two sides of a line.β The authors use the Discrete Fourier Transform (DFT) and present a measure called Reference Counter-to-Noise Ratio (RCNR) to 6 define the regularity in a shape. To experiment they use three different satellite images from Google Earth and found that their approach was quite successful in detecting regular shapes in these images. The authors in [11] trained a convolutional neural network to classify the 1.2 million images in the Image Net LSVRC-2010 contest into 1000 different classes. This was a deep learning approach as the network consisted of five convolutional layers and had 60 million parameters and 650,000 neurons. Their method does not use any feature extraction at all and the only preprocessing they do is to scale down the variable sized image of the dataset into 256 x 256 fixed sized images. The neurons are trained on the raw RGB values of the pixels. The authors reported results using the two error metrics which are mandatory for the Image Net database, top-1 and top-5 error rate. The top-5 error rate is defined as βthe fraction of test images for which the correct label is not among the five labels considered most probable by the model.β They compare their results to a sparse-coding approach implemented in [12] and an approach that averages Fisher vectors over SIFT features implemented in [13]. The results (Table 2) show that the neural nets methods work the best among the compared methods. 7 Model Top-1 Top-5 Sparse Coding[12] 47.1% 28.2% SIFT + FV[13] 45.7% 25.7% CNN[11] 37.5% 17% Table 2. Comparison of error rates of the 3 methods adopted from [11]. 1.3. The Thesis Most of the work discussed above has one disadvantage in common: there is very little focus on designing the image features. The neural networks do learn the features implicitly but do not explicitly reveal anything about the features being learned. As neural networks work well in such classification problems, the need of learning features explicitly did not arise. Thus neural networks answer the question βwhat particular class is the image classified as?β but hide the answer to βwhy is the image classified as a particular class?β If we can design and extract some useful features, perhaps informed by our prior knowledge of astronomy, then we can learn more about the images and the classification will become an easier task. Of course, this approach will also have a different but complementary weakness: feature extraction has its own disadvantage of not being scalable to other problems. In other words, features that work really well for one dataset may become less important or completely irrelevant for some other dataset of a different domain. 8 My approach in this thesis is more focused on the task of feature extraction and thus can be described in a simple diagrammatic representation (Figure 3). Images Feature Extraction Classification Figure 3. Approach to the classification task. In the next two sections we present an introduction to the various features that were extracted for this project, and we discuss them in greater detail in chapter 3. Most of these features are based on the knowledge of Astronomy and are novel for being used in this particular problem. 1.4 Novel Features 1.4.1. Detection of bar Surveys show that almost two-thirds of the observed spiral galaxies are barred, i.e. have a bar at the center of the galaxy. This phenomenon is absent in other classes of galaxies and thus the presence of a bar serves as an important feature to separate spiral from other galaxies. Figure 4 shows an example of a barred and unbarred galaxy. 1.4.2. Dark matter density (gray scale and binary) Spiral galaxies have a substantial amount of dark matter. The galaxies do not generally follow the rotation laws of a solid body here on Earth. The angular velocity of the rotation typically 9 decreases with radius. Until the early 1970s, most of the rotation data for spirals came from optical observations which did not extend beyond the luminous inner regions [15]. But in later years when the observation extended to farther regions of the galaxy, the data showed the rotational velocity to be almost constant even with the increase in the radial data. This meant that the total mass of the spiral within some radius increases linearly with the radius while the total luminosity approaches a finite asymptotic limit and thus a large amount of invisible mass must be present [15]. Due to the presence of this mass the image should have higher pixel values in the halo. Thus the grayscale values of the dark matter halo of the spiral galaxies should be in a higher range as compared to their counterparts and could be a useful feature in differentiating between the two. Figure 4. Barred (left) and unbarred (right) galaxies. 1.4.3. Disk / Bulge ratio Most spiral galaxies have a bulge at the center but this can be observed only when the galaxy is viewed edge-on. As most of the galaxies we encounter are head-on, detecting the bulge 10 becomes impossible. Figure 5 demonstrates an example of this. However because the Bulge / Disk ratio can be in written in terms of the surface brightness of the galaxy, we can extract that ratio as a feature for our classification. Figure 5. A galaxy viewed edge-on showing the bulge (left) and bulge is undetectable (right). 1.4.4. Circularity The circularity parameter defines how close to a circular shape an object is. This is defined by the isometric index equation. The value of the parameter is near 1 for a circle and in the lower range for other shapes. As mentioned in [16], the value is βmuch less than one for a starfish footprint.β Since a spiral shape is very close to a starfish shape, it should also have a low value for this parameter. 1.4.5. Black to White pixel ratio This parameter measures the ratio of the number of black pixels to the number of white pixels in the binary form of the input image. In the SDSS database, all of the galaxies are centered and 11 are roughly of the same size. Thus this parameter should have a higher value for non-spiral galaxies as compared to spiral galaxies. Although this measure might not prove useful for larger non-spiral galaxies, it still forms an interesting and simple shape detection feature. 1.4.6. Count of line intersection of the shape Spiral galaxies have arms and other galaxies do not have arms. This simple property can be a very useful one in differentiating between the spiral and the non-spiral shapes. If we could draw a line from the center of the galaxy towards its end and count the number of times this line intersects the galaxy, it can give us a fair idea of the shape as the line will intersect the spiral shape more than once and circular and elliptical shape only once. 1.4.7. Maximum values of Red and Blue channels Spiral galaxies are middle-aged galaxies, i.e. they are in between the newer galaxies (lots of star formation and generally irregular shaped) and old galaxies (almost no star formation and elliptically shaped). The old galaxies are red in color due to the lack of any gas used for star formation, and the new galaxies are blue in color due to an abundance of gas and dust for star formation [18]. Spiral galaxies have all star formation in the arms and none in the disk, so the arms are bluish and the disk is reddish in appearance. Thus for the elliptical shaped galaxy images, the maximum value of the red channel pixel should be relatively high and for the irregular shaped galaxy images the maximum value of blue channel pixel should be relatively high. For the spiral galaxy images none of the red or the blue channel should have a high value. 12 1.4.8. Concentration index This parameter is related to the radial profile of the galaxy. Before defining the concentration index the following definitions [20] are necessary: a) Petrosian radius [29]: is defined as the radius where the intensity of the light from the galaxy is equal to a predefined value, usually 0.2. [21] b) Petrosian Flux: is defined as the sum of all the flux within k times the Petrosian radius. c) R90: Petrosian ninety-percent radius is the radius which contains 90% of the Petrosian flux. d) R50: Petrosian half-light radius is the radius which contains half of the Petrosian flux. The parameter concentration index is defined as the ratio between R90 and R50. 1.4.9. Aspect Ratio Aspect ratio is defined as βa function of the largest diameter and the smallest diameter orthogonal to it.β We can interpret it as the ratio between the width and height of the bounding rectangle of the galaxy. 1.4.10. Extent Extent is defined as the ratio of contour area to bounding rectangle area. 1.4.11. Red to Blue color intensity ratio As defined in Section 1.4.7 above, spiral galaxies are middle-aged galaxies and thus have reddish disk and bluish arms. The difference between this feature and the feature extracted in 13 Section 1.4.7 is that for this feature the galaxy image is more carefully cropped and we find the mean color of the RGB channels for this cropped image. For spiral galaxies the value of this ratio should be near 1 β 1.2 and for the non-spiral galaxy should be a higher value as red dominates blue in such galaxies. For the next two features, galaxy images are characterized by fitting an ellipse to them. 1.4.12. Fitted ellipse angle This feature calculates the angle of the rotated rectangle that best fits the galaxy. For non-spiral galaxies the fitted ellipse (rectangle) should have a large angle as the rectangle is nearly upright and for a spiral galaxy the angle should be relatively low. 1.4.13. Fitted Ellipse Height to Width ratio For spiral galaxies this ratio should have higher values due to the spiral shape, and non-spiral galaxies (i.e. circular and elliptical) should have lower values. 1.5 Adopted Features The next three shape features (1.5.1-1.5.3) have been adopted from [17]. 1.5.1. Convexity As the name suggests this parameter measures how convex a particular object is. According to [17], βFor jagged regions like spiral galaxies convexity is very large, whereas for elliptical galaxies it is very small.β 14 1.5.2. Form Factor Goderya & Lolling [17] define Form Factor as βa ratio of area and square of the perimeter of the galaxy.β Elliptical galaxies have a higher value for this parameter as the star formation is low and thus most of the areas are equally bright, i.e. the luminosity in case of elliptical galaxies is approximately uniformly distributed. In case of spiral galaxies, the values are low as their βperimeter per unit area is relatively largeβ [17] and the luminosity is not uniformly distributed. 1.5.3. Bounding rectangle to fill factor This parameter defines the area of the galaxy to the area of the bounding rectangle. It shows βhow many pixels in the bounding rectangle belong to the galaxy in reference to the total number of pixels in the bounding rectangleβ [17]. 1.5.4. Color coherence vector (CCV)i Color coherence vector (CCV) is a method developed for content-based image retrieval [19]. The idea of CCV is to mark each pixel as coherent or incoherent. A coherent pixel belongs to a region of pixels that share the same color value. Connected pixels are formally defined as [19]: For a region R to be considered a region of connected pixels, it should satisfy the following property: For each p1, p2 β R, there exist a path of adjacent pixels from p1, p2. (The path traversal could be horizontal, vertical or diagonal). 15 For this feature the color space is discretized into 64 colors and then each pixel is checked for its membership to a coherent region. 1.6 Outline and Contributions of this Thesis In Chapter 2, the various features extracted for this thesis are described in detail. The process followed for obtaining the features from the input images is presented. In Chapter 3, the machine learning algorithms used for classification of the images are presented. A brief introduction to the algorithms is followed by the advantages and disadvantages of the algorithms. In Chapter 4, the concept of deep learning is introduced and its success in tackling such kind of problem is discussed. In Chapter 5, the results are presented. In Chapter 6, some experiments are presented which were tried during the course of this thesis but failed or did not work as intended. Finally, Chapter 7 gives the conclusion and scope of future work for this problem. 16 Chapter 2 Feature Extraction This chapter deals with the technical details of the feature extraction process. The SSDS images [4] that we are using are color images of size 424 x 424 in JPEG format. The center of the galaxies are located at the center of the image which is helpful as we do not need to design feature detectors that are position invariant. Before the feature detection process, the images are cropped to a size of 180 x 180 to remove some of the background noise. 2.1 Detection of bar Most of the spiral galaxies have bars in their center, which emit brighter light than the rest of the galaxy. The first step then, in exploiting the brightness of a potential bar, is to increase the contrast of the galaxy. It was found that enhancing the image contrast in the HSV (HueSaturation-Value) color scale instead of the RGB color scale produced better results for the purpose of applying a threshold to the image to convert it into a binary image. Before moving ahead it is important to describe the HSV color space. The below definitions are taken from [27]. H: The hue (H) of a color refers to which pure color it resembles. S: The saturation (S) of a color refers to the amount of gray or white in a color. 17 V: The value (V) of a color, also called its lightness, describes how dark the color is. It is also used to define the brightness of a color which is the definition we use here. The threshold for the bar may be set very high as it is the brightest part of the galaxy image. A threshold value of 255 was used for this purpose and the pixels which satisfy the threshold were then extracted from the contrasted image. A contour around the mass of brightest pixels from the galaxy image was drawn. This contour was then made rotation invariant by calculating the maximum moment of the contour which gives the orientation of the shape in the image and then rotating the contour by the obtained orientation. The next step is to determine the width and height of the shape bounded by the contour. This is done by calculating the semimajor and the semi-minor axes, which gives the width and height respectively. If the width is much greater than the height, the presence of a bar structure in the galaxy is confirmed. Figure 6 shows this process. a) Original Image b) Increased Contrast Image 18 c) Extracted Contour d) Rotated Contour Figure 6. Bar detection process. 2.2 Dark matter density (gray scale and binary) Dark matter is attributed to be the source of much of the brightness observed in a galaxy. The disc of a galaxy is said to be surrounded by a dark matter halo whose density is given as the function of its radius by the Navvaro-Frenk-White profile [22]: ππ(ππ) = ππ(0) ππ 2 /π π π π οΏ½1 + οΏ½ ππ π π π π where, ππ(ππ) is the dark matter density at radius r, ππ(0) is the central dark matter density, and π π π π is the scale radius. The scale radius is defined in [23] as βthe radius at which the brightness of the galaxy has fallen off by a factor of e (~2.71828) from the center.β 19 The first step is to convert the input image into grayscale and calculate the central brightness of the galaxy. Here the following assumption is made: βThe value of a pixel of the grayscale of the image is considered to be the brightness of the image at that particular pixel.β The main question to answer here is the definition of βcenter.β We cannot consider only the central pixel of the image as the center of the galaxy and the center of every galaxy will be different. According to [24] the central brightness of the galaxy is given by: Ξ£0 = 5.3567 Ξ£ππ where, Ξ£ππ is the surface brightness at the half light radius, i.e. the radius within which half of the light is contained. Once we have the central brightness of the galaxy we can estimate the scale radius of the galaxy. This is done by first calculating the brightness of the galaxy at scale radius, which is the central brightness reduced by a factor of e [23] and then starting from the center of the image. The next step is to move through the image in a box of incremental size 1 x 1 and summing the pixel values until we get close to the brightness of the galaxy at scale radius. One half the size of the box gives us the scale radius. For the dark matter density in the binary form of the input image, the only changes from the above method is the definition of the center of the galaxy. Here we follow the method 20 described in Section 2.1 to estimate the central contour which is then used as the center of the galaxy. 2.3 Disk / Bulge ratio As mentioned before, detecting the bulge in an image that is viewed head-on (Figure 5b) is virtually impossible. In [25] the Disk / Bulge ratio is defined as: where, πππ π is the scale radius, π·π· = 0.28 β (πππ π /ππππ )2 β Ξ£(π π )/Ξ£(ππ) π΅π΅ ππππ is the half-light radius, Ξ£(π π ) is the surface brightness at scale radius, and Ξ£(ππ) is the surface brightness at half-light radius. The first step is to convert the input image into grayscale and calculate the central brightness of the galaxy as in Section 2.2 above. Once we have the central brightness of the galaxy we can calculate the scale radius of the galaxy by calculating the radius where is the central brightness is reduced by a factor of e and the half-light radius by calculating the radius where the central brightness is reduced by a factor of 2. We then start from the center of the image and move through the image in a box of incremental size 1 x 1, and sum the pixel values until we get close to the brightness of the galaxy at scale and half-light radius. One half the size of the box gives us the scale radius and the half-light radius respectively. 21 2.4 Circularity To calculate the circularity [16] of a shape in the image, we first need to extract the shape with as little information loss as possible. We focus on two types of circularity in this project: galaxy central circularity and complete galaxy circularity. Although the central question is about the shape of the complete galaxy, the central part of the galaxy is also considered because the central part of most of the spirals is barred (Feature 2.1) and for circular or elliptical galaxies the central part resembles the overall galaxy shape. The circularity of a shape can be described by the equation: πΆπΆ = 4 β ππ β π΄π΄/ππ2 where, A is the area of the shape, and P is the perimeter of the shape. To estimate the central part of the galaxy we follow the method described in Section 2.1 and then calculate the area and arc length of the contour obtained. For calculating the circularity parameter for the complete galaxy we must first convert the image into its binary form. Direct thresholding an image to convert into the binary format is not considered here for two reasons: 1. In some cases it results in the distortion of the shape, and in some cases, the loss of the entire shape. 2. For every image the threshold value might be different. OpenCV provides an implementation of the Otsu method [26] for estimating an ideal threshold, which works by calculating the optimal threshold between the foreground and the background 22 pixels. This project takes a slightly different approach for more effective thresholding. The image is first converted into its grayscale format and then the Laplacian of Gaussian (LoG) of the image is calculated. The LoG is then subtracted from the grayscale image to remove background noise. The resulting image is then converted to HSV, and the value (V) parameter between 20 and 255 is used to define the shape in the binary image, i.e. the pixels in HSV image having the value parameter between 20 and 255 are set to 255 (white) in the binary image. The binary image obtained usually gives a reasonable estimate of the shape of the galaxy, but has a few disconnected points. The approach taken to connect the image is to perform a morphological dilation: scan through all the pixels in the binary image, and if any pixel is surrounded by a white pixel in its neighborhood, it is also set to be a white pixel. Figure 7 shows this process. To calculate the area of the resulting image, the moment (M00) of the image [28] is calculated. To estimate the perimeter, the arcLength() function of OpenCV is used. 2.5 Black/ White pixel ratio To calculate the B/W pixel ratio parameter we follow the same process defined in Section 2.4 to get the binary image shown in Figure 7 c). The number of black and white pixels in the image is counted and the ratio obtained. 23 a) Original Image b) Image β LoG c) Binary Image d) Filled Binary Image Figure 7. Thresholding of image to get the shape. 2.6 Convexity The convexity of a shape is defined by [17]: πΆπΆπ₯π₯ = ππ/(2π»π» + 2ππ) where, P is the perimeter of the shape, H is the height of the bounding rectangle, and 24 W is the width of the bounding rectangle. To calculate the convexity we follow the same process defined in Section 2.4 to produce the binary image shown in Figure 7 c). We find the bounding rectangle of the contour obtained by the OpenCV function boundingRect() and calculate the height and width of the rectangle. To obtain the perimeter we use the arcLength() function as in Section 2.4. 2.7 Form Factor The Form Factor of a shape is defined in [17]: where, A is the area of the shape, and πΉπΉ = π΄π΄/ππ2 P is the perimeter of the shape. As described in Section 2.4 above the area of the shape obtained in Figure 7c) is obtained by calculating M00 and perimeter by the arcLength() function. 2.8 Bounding rectangle to fill factor After obtaining the binary image as described in Section 2.4 and shown in Figure 7c) and calculating the bounding rectangle using boundingRect() as described in Section 2.6 we obtain the bounding rectangle to fill factor parameter which is described in [17] by the equation: where, A is the area of the shape, and π΅π΅π₯π₯ = π΄π΄/(π»π» β ππ) 25 H * W is the area of the bounding rectangle. 2.9 Count of line intersection of the shape After obtaining the binary image as described in Section 2.4 and shown in Figure 7c) we obtain the contour with the largest area from the image. Then a line is drawn starting from the center of the shape at every degree angle ranging from 0° β 360°. Then we count the times when the binary intensities changes along the line. 2.10 Maximum values of Red and Blue channels For calculating this parameter the image is split into its Red, Green and Blue channels. The green channel of the image is then set to zero and the red and blue channels are added together. This is done because spirals have red discs and blue arms and non-spirals are generally red. Then the maximum value for both the channels is calculated using the minMaxLoc() function of OpenCV. 2.11 Concentration index The concentration index [20] can be expressed as: ππππ = π π 90 /π π 50 where, R90 is the radius which contains 90% of the Petrosian flux, and R50 is the radius which contains half of the Petrosian flux. 26 For this parameter we do not need any pre-processing and can directly work with the RGB image converted to grayscale. Our first aim is to find the Petrosian radius which is defined as the radius where the intensity of the light from the galaxy is equal to a predefined value, usually 0.2 [21]. This project assumes the predefined value to be between 0.17 and 0.22. The intensity of light from the galaxy at Petrosian radius is given by the equation [20]: πΌπΌοΏ½π π ππ οΏ½ = ππ( π π ππ β«0 πΌπΌ(ππ)2ππππππππ where ππ is a constant (1 for this project). πππ π ππ 2 ) We adopt the definition of Petrosian radius π π ππ as the radius where the value of the Petrosian ratio [30] is equal to 0.2 (this project assumes the predefined value to be between 0.17 and 0.22). Petrosian ratio π π ππ (ππ) at a radius r from the center of an object as defined in [30] to be βthe ratio of the local surface brightness in an annulus at r to the mean surface brightness within r.β This can be written in equation form as: π π ππ (ππ) = (οΏ½ 1.25ππ 0.8ππ β² 2 2 )ππ 2 ππππβ²2ππππβ²πΌπΌ(ππ )/[ππ(1.25 β 0.8 ππ ])/(οΏ½ ππππβ²2ππππβ²πΌπΌ(ππ β² )/ππππ 2 ) 0 To calculate π π ππ (ππ) we start from a radius of 20px in the image and go until 45px and get the intensity within the radius. We also get the intensity of the image within 2 more parameters: an upper radius of 1.25 times the radius, and a lower radius which is 0.8 times the radius. We then apply the above equation to the calculated values and compute the Petrosian ratio. The radius having values between 0.17 and 0.22 are stored and the maximum radius is taken to be the 27 Petrosian radius. If we do not find any such radius in the image we set the Petrosian radius to be 44.8998 which is the maximum value of the radius that we loop through in the image in order to calculate π π ππ (ππ). We then calculate the Petrosian flux given by: πΉπΉππ = οΏ½ 2ππππ 0 2ππππβ²ππππβ²πΌπΌ(ππ β² ) where, ππππ is the Petrosian radius. We then calculate the values of the parameters π π 90 and π π 50 by passing through the image starting from the center and calculating the flux for incrementing radii until we find the flux that is around 90% and 50% of the Petrosian flux respectively. We can then calculate the concentration index. 2.12 Aspect Ratio As described in the previous chapter aspect ratio can be interpreted as the ratio between the width and height of the bounding rectangle of the galaxy. We use the method described in 3.4 to obtain the binary image Figure 7c). Then the contour with the maximum area and maximum arc length is calculated which gives the galaxy as the standalone object. Then the bounding rectangle is calculated and aspect ratio can be defined as: π΄π΄π΄π΄ = ππ/π»π» 28 2.13 Extent The extent of a shape is given by the equation: where, CA is the contour area, and πΈπΈπΈπΈπΈπΈπΈπΈπΈπΈπΈπΈ = πΆπΆπΆπΆ/(π»π» β ππ) H*W is the bounding rectangle area. As described in Section 2.12 we obtained the contour area and bounding rectangle which can be then used to calculate the bounding rectangle area. 2.14 Red to Blue color intensity ratio We obtain the bounding rectangle for the image as described in the previous section. The original image is then cropped to the size of this bounding rectangle. Figure 8 shows the process. As it can be seen in Figure 8c) we obtain a representation of the galaxy without most of the background noise, which in these images are generally stars and dust. After we obtain the modified image the mean value of the intensities of all the channels i.e. Red, Green and Blue, is calculated and then the ratio between the values obtained for the Red and Blue channels is calculated. 2.15 Fitted ellipse angle To calculate this parameter we follow the process defined in the previous section to obtain the contour as shown in Figure 8b). Then a best fit ellipse is calculated for the obtained shape using 29 the OpenCV function fitEllipse(). The angle method of this function gives the value of the parameter. a) Original Image b) Obtained Contour c) New Cropped Image Figure 8. Process of cropping the image to size of bounding box. 2.16 Fitted Ellipse Height to Width ratio To calculate this parameter we follow the process defined in Section 2.14 to obtain the contour as shown in Figure 8b). Then a best fit ellipse is calculated for the obtained shape using the 30 OpenCV function fitEllipse(). The height and width of the fitted ellipse can be calculated using the size method of the used function and thus the parameter can be calculated. 2.17 Color coherence vector (CCV) Consider the following example to understand the CCV concept more clearly. For this example the color space is discretized to define 5 colors. In this example we set T = 3 where T is the minimum number of connected pixels that share the same color. 1 1 1 1 4 2 1 4 4 4 2 4 4 6 6 1 4 4 5 5 1 1 3 5 5 Figure 9. Discretizing and finding the coherent regions. In Figure 9, the pixels numbered 1, 4 and 5 are considered coherent because the number of pixels in the connected region equals or exceeds T=3. On the other hand, the pixels numbered 2, 3 and 6 are considered incoherent. The descriptor for this example will look as shown in Table 3, where C and I denote the number of coherent and incoherent pixels respectively. Color 1 Color 2 Color 3 Color 4 Color 5 Color 6 C I C I C I C I C I C I 8 0 0 2 0 1 8 0 4 0 0 2 Table 3. The resulting vector. 31 The first step is to increase the contrast of the input image to enhance some of the regions with sparse patterns. Then the image is converted to HSV color space. The next step is to discretize this HSV color space into 64 colors using the Hue, Saturation and Value parameters of the color space. To check whether each pixel belongs to a coherent region or not we count the number of pixels in each bin with the T parameter as required by the definition of a coherent region. The value of T was varied from 0 to the size of the image (i.e. 180 x 180 = 32400). For this parameter we modify the approach as suggested in [19] in two ways: 1. Blur the image before starting the discretization process to eliminate slight variations between the adjacent pixels. For this project we do not blur the image as this will discard the peak intensities and will impact the parameter value adversely. 2. Set the value of the parameter T around 1% of the size of the image. This project does test the approach but finds that the best value is obtained with a slightly higher value of T. For this project the optimum value was found to be 1.54% of the size of the image. Figure 10 shows the obtained resulting images with different T values. a) Spiral Image b) Circular Image Figure 10. CCV visualization for T = 500. 32 This chapter completes the first step of feature extraction as shown in figure 3. After the features are designed, we need to combine all of them together to form a dataset that can be provided to the different machine learning classifiers to complete the second system of our classifier thereby completing the classification system. The next chapter describes this second step and describes two machine learning algorithms applied to the obtained dataset and also presents the reasons of choosing the algorithms for this system. 33 Chapter 3 Classification Algorithms This chapter describes the different machine learning algorithms that were used to classify the galaxy images from the extracted features. Since we are dealing with a binary classification problem, i.e. the classification of the examples into positive (spiral) and negative (non-spiral) classes, we take a supervised learning approach and use two algorithms: 1. Logistic Regression. 2. Boosting, using decision stumps as weak classifiers. 3.1 Introduction to algorithms 3.1.1 Logistic Regression The term regression refers to finding a best fit line for provided data points, i.e. a line that gives the best approximation of the data based on some parameters or features. Figure 11 visualizes this idea. The data points used are: 0 1 2 3 4 5 6 34 0 4.1 9.7 8.8 4.2 6.1 7.8 B 12 10 8 6 4 2 0 0 1 2 3 4 5 6 7 Figure 11. Best fit line for a set of data points. The data points that lie near or at the best fit line can be predicted reliably. As we move to data points lying away from the line, their prediction becomes less reliable. The term logistic refers to the logistic function which is a sigmoid with the equation: ππ(π‘π‘) = 1/(1 + ππ βπ‘π‘ ) A logistic function is useful due to the property (mentioned in [31]) that the input to the logistic function can attain any value from -β to +β but the output will always have a value in between 0 and 1, as shown in Figure 12. The input t can be also viewed as a linear combination of different features associated with it. Thus for n features t can be written as: π‘π‘ = ππ0 + ππ1 π₯π₯1 + ππ2 π₯π₯2 + β― + ππππ π₯π₯ππ 35 Figure 12. A sigmoid function example. As is can be seen that irrespective of the input value the logistic function always outputs values between 0 and 1. Thus in logistic regression we obtain the best fit line for a set of data points by learning the sigmoid function. In practice we learn the conditional distribution ππ(π¦π¦|π₯π₯) where x is the input (example) and y is the class label, 0 or 1 in case of binary classification. If we consider v to be the feature vector we get the following two equations: ππ(π¦π¦ = 1|π₯π₯; π£π£) = 1/(1 + ππ π£π£βπ₯π₯ ) ππ(π¦π¦ = 0|π₯π₯; π£π£) = 1 β ππ(π¦π¦ = 1|π₯π₯; π£π£) 36 The first equation refers to the probability that an example belongs to class 1 and the second equation refers to the probability that an example belongs to class 0. 3.1.2 Boosting The drawback of the logistic regression technique is the absence of an implicit feature selection process and thus βbadβ features can affect the accuracy of the algorithm negatively. Thus logistic regression for a classification problem with a large number of features, as in the case of this thesis, can result in low accuracy. Boosting addresses this problem as the AdaBoost algorithm [33] contains a feature selection process called βfeature boostingβ implicitly. Michael Kearns in [32] tries to answer the question about the βHypothetical Boosting problemβ which asks if a presence of an efficient learning algorithm whose output hypothesis performs only slightly better than random guessing implies that there exists an efficient learning algorithm whose output hypothesis gives high accuracy. In simpler terms this problem asks whether βa set of weak learners can be combined into a strong learnerβ. In this project we make use of the AdaBoost algorithm as described in [33], which tries to find a weighted combination of classifiers that fits the data well. It uses a weak classifier, a decision stump (decision tree with unit height) for this project, iteratively on the dataset and maintains a distribution of weights over every example in the dataset. Initially all the examples are assigned 1 the same weight which is generally ππππππππππππ ππππ ππππππππππππππππ ππππ ππππππππππππππ . After every call to the weak classifier, the weights of the incorrectly classified examples are increased and the weights of the correctly classified examples are decreased so that the weak classifier is focused more on 37 classifying the incorrectly classified examples in every round of the call. Thus we can think of boosting as an algorithm that tries to rectify the mistakes from a previous step in the immediate next step. The algorithm is shown below in Figure 13. Figure 13. The AdaBoost algorithm adopted from [33]. 3.2 Advantages and disadvantages of these algorithms The reasons for choosing these algorithms for this project are mentioned below. 1. Logistic Regression is great for binary classification as the sigmoid function naturally creates a single decision boundary. 2. Logistic regression has low variance and so is less prone to over-fitting. 38 3. Boosting reduces both variance and bias. The bias is reduced in initial iterations and variance in later iterations. 4. Boosting has the concept of βfeature boostingβ intrinsic to it, which resembles the feature selection process and thus actually forces the classification algorithm to focus on the more important features with respect to the data. Logistic Regression and Boosting also have some drawbacks that are mentioned below. 1. Noise and outliers in the data effect boosting in a negative way as it can always try to classify the outliers thereby increasing the convergence time. 2. Boosting training time is large. 3. Logistic Regression fails for prediction of continuous outcomes. 4. Unlike boosting, Logistic Regression does not automatically perform feature selection. 5. Logistic Regression does not handle missing values. 39 Chapter 4 Deep Learning This chapter introduces the concept of deep learning and why it has been so successful in image classification tasks. 4.1 Introduction to Deep Learning The basic idea of deep learning emerged from the concept of artificial neural networks (ANN) which in turn are inspired by the biological neurons in the human brains that carry electric signals to and from the brain. An ANN has several layers of interconnected neurons (Figure 14 a) to form an artificial network (Figure 14 b) and is typically defined by three types of parameters: 1. Patterns: This refers to the pattern that connects between the different layers of neurons. 2. Learning: This refers to the process of learning used for updating the weights of the neural connections. 3. Activation Function: This refers to the function that converts a neuron's weighted input to the output from the neuron. Deep learning is to a class of machine learning techniques, where input is passed through multiple layers of processing for feature learning. Several techniques like deep neural networks 40 (DNN), deep belief networks (DBN) and Restricted Boltzmann Machines (RBM), which are described below, are specific implementations of deep learning [37]. Input Output Hidden a) b) Figure 14 a) An Artificial Neural Network and b) A neuron. 4.1.1 Deep Neural Network A deep neural network (DNN), as the name suggests, is an ANN with multiple hidden layers between the input and output layers. As every layer in the network identifies the features present in the input, the extra layers in the DNN creates features from the lower levels, thereby modeling complex data with fewer parameters than a similarly performing ANN [37]. 4.1.2 Deep Belief Network 41 A deep belief network (DBN) is a type of DNN where the connections exist only between the visible and hidden layers but not among the visible-visible units and hidden-hidden units in every layer. The main idea behind the DBN is that a preceding hidden layer serves a visible layer to the next hidden layer. As shown by Hinton et al. [38], DBNs can be trained one layer at a time, stacking every trained layer over each other, thereby giving it a deep hierarchical architecture. Every layer of the DBN is constructed of Restricted Boltzmann Machines (RBM) which are described in the next section. 4.1.3 Restricted Boltzmann Machines Before describing the Restricted Boltzmann Machines (RBM), it is necessary to understand the following terms: 4.1.3.1 Energy based models The energy based models (EBM) associate a cost function, which is termed as energy, with every variable of interest to the system or as LeCun, Chopra et al. define it in [38] as, βEnergyBased Models (EBMs) capture dependencies between variables by associating a scalar energy to each configuration of the variables.β These models learn by minimizing the energy function associated with the system. Figure 15 shows an example of an EBM, where the output shows the correspondence between X and Y. 42 Figure 15. An EBM that measures the compatibility between observed variables X and variables to be predicted Y using the energy function E(X, Y) [38]. 4.1.3.2 Boltzmann Machines Hinton & Sejnowski gave the idea of Boltzmann machines in their 1986 paper [39] as a neural network with stochastic units having an energy associated with the network. Figure 16 below shows an example of a Boltzmann machine consisting of 4 visible units v1, v2, v3 and v4 and 3 hidden units h1, h2 and h3. All the units are connected to each other and follow the following restrictions: 1. A unit cannot be connected to itself. 2. All connections are symmetrical. 43 Figure 16. A Boltzmann machine in graphical format adopted from [40]. The energy function of a Boltzmann machine is defined in [40] as πΈπΈ = β(οΏ½ π€π€ππππ π π ππ π π ππ + οΏ½ ππππ π π ππ ) ππ<ππ ππ where, π€π€ππππ is the connection strength between units i and j, π π ππ is the state of unit i and π π ππ β² {0,1}, and ππππ is the bias of the unit i. The probability of a unit i to have the value 1 is given by the equation below as defined in [40] ππππ=1 = 1/(1 + exp οΏ½β 44 βπΈπΈππ οΏ½) ππ where T is the temperature of the system. Restricted Boltzmann machines are a variant of the Boltzmann machines described above, the variance being the absence of connections between visible-visible and hidden-hidden units. Figure 17 shows an example of a restricted Boltzmann machine consisting of 3 visible units and 4 hidden units. Figure 17. A restricted Boltzmann machine in graphical format adopted from [41]. The energy function of a RBM is defined in [41] as πΈπΈ(π£π£, β) = β οΏ½ ππππ π£π£ππ β οΏ½ ππππ βππ β οΏ½ ππ ππ ππ οΏ½ π£π£ππ π€π€ππ,ππ βππ ππ where, π€π€ππ,ππ is the weight associated with the connection between hidden unit βππ and visible unit π£π£ππ , 45 ππππ , ππππ are the bias weights of the visible and hidden units respectively. Hinton & Salakhutdinov showed in [42] that RBMs can be trained individually and then stacked on top of each other to form a deep belief network, described in section 5.1.2, and thus can be represented as shown in Figure 18 below. Figure 18. A deep belief network with stacked RBMs adopted from [43]. Hinton et al. proposed a fast training algorithm for DBNs in [44] which can be summarized in Figure 19. 46 Figure 19. Training algorithm for DBNs adopted from [45]. 4.2 Success of Deep Learning The various factors that have contributed to the success of deep belief networks varies from the huge increase in the size of the dataset at one end of the spectrum to the fine-tuning of the algorithm itself at the other end. The following factors have contributed in the success of the deep learning approach, especially in object recognition problems: 1. Huge Datasets: The size of datasets has increased drastically. To understand why this has contributed to deep learning approach being successful we need to go back to the paper by Valiant [46] that shows that a machine having the following 3 properties is possible to be designed. (A) The machines can provably learn whole classes of concepts. Furthermore these classes can be characterized. (B) The classes of concepts are appropriate and nontrivial for general purpose knowledge, and (C) The computational process by which the machines deduce the desired programs requires a feasible (i.e. polynomial) number of steps. 47 Point (C) is of the utmost importance here as it can be interpreted as the following, as given in [47]: βif you have a finite number of functions, say N, then every training error will be close to every test error once you have more than log N training cases by a small constant factor and thus there will be practically no over fitting.β For example, our dataset has images of size 424 x 424 and thus a perfect machine, which takes the raw pixel as the input, has to learn around 218000 parameters to learn a perfect model, which is a huge number. We can do some preprocessing and down-sample our image to a size around 60 x 60, thereby reducing the number of parameters to 23600 which is still very large and thus getting a perfect model is still very difficult. However this example as shown in [47] can make a case for neural networks being a good fit for the large amount of data, if we consider a neural network with X parameters and consider every parameter to be of type float (32 bits). Then total number of bits in the neural network is 32X and we can have 232ππ distinct neural networks (as a bit is binary). When we have training examples greater than 32X, the chances of over fitting are drastically reduced as described above. Thus we need a deep neural network with large number of parameters. 2. Faster Computers: With the advent of Graphics Processing Units (GPUs) we can now build large neural networks and can still have relatively fast training time. 3. Fine Tuning of the training algorithm: Stochastic Gradient Descent (SGD) in the training of the deep neural networks has been very successful, since the SGD algorithm does not need to 48 remember examples visited during the previous iterations, and thus can converge faster in training large datasets as shown in Figure 20 below. Figure 20. SGD convergence time adopted form [48]. 49 Chapter 5 Results In this chapter we first analyze the available data and discuss the process of obtaining the labels for the provided examples. Then the results obtained from applying machine learning and deep learning algorithms on our data are presented. 5.1 Available data The SDSS database [4] provides us with the images of the galaxies and the human annotated data. The total number of images provided is 61578. Before proceeding further we need to understand the human annotated data. In Galaxy Zoo project [5] the volunteers to classify the galaxy images into elliptical, spiral and mergers (if the image contains merging galaxies). Figure 21 shows the decision tree that was used to guide the classification process. An example to understand the decision tree is as follows: Is the galaxy simply smooth and rounded, with no sign of a disk? (yes) How rounded is it? (yes) β¦β¦. (no) β¦β¦β¦ (no) Could this be a disk viewed edge-on? (yes) Is there a sign of a bar feature through the centre of the galaxy? (yes) / (no) Is there any sign of a spiral arm pattern? 50 Figure 21. The Galaxy Zoo classification decision tree adopted from [49]. The last question is the one we are interested in answering in this thesis. The volunteers classified the galaxy images into several classes. A snapshot of the various classes and the actual human annotated data is presented in Figure 22 and 23 respectively. In Figure 22, the Task column refers to the classes that the images were classifies into, with responses being the subclasses. As an example, for question 4, if the answer to the question βis there any sign of a spiral arm patternβ is βyesβ, then image belongs to class 4.1 and if βnoβ, it belongs to class 4.2. So, one image can belong to various classes. 51 Figure 22. A snapshot of the decision tree in question form adopted from [49]. Figure 23. The human annotated data. The values under various classes are: the percentage of volunteers who answered that the galaxy belonged to the given class. As an example, for galaxy id 100008, the value in class 4.1 is 0.418398 and in class 4.2 is 0.198455 i.e. about 41% of volunteers considered the image to have a spiral arm pattern and 20% did not. Since it is a decision tree and this question might not be 52 reached depending on the responses to the previous questions, the sum of the values may or may not be equal to 1, i.e. we cannot consider these as pure probabilities. 5.2 Logistic Regression and Boosting As we do not have any labels provided to us and only the percentage of people answering βyesβ to a question, we need to create the labels before creating a model for classification. For the application of the machine learning algorithms we apply the following condition (Condition 1) to create labels for the images: Condition 1: If the difference between the number of people answering question 4 (is there any sign of a spiral arm pattern) in positive and negative is more than 60% or 0.6, then the label is assigned 1(positive) or 0 (negative) accordingly. In other words, if the positives outnumber the negatives by 60% or above, the label assigned is positive and vice versa. Other images are not taken into account. By using the above condition we obtain 2983 examples: 1774 positive and 1209 negative. Thus to use a random baseline, if we classify all examples as positive we will still achieve 1774 1774+1209 = 59% accuracy. We use 10-fold cross validation for both the algorithms. Before going further, it is important to discuss the cross validation technique and why it is used in this thesis. Generally to test a machine learning method the dataset is divided into two parts as shown in Figure 24, namely, training set (to train the classifier) and test set (to test the performance of the trained classifier on a new example). 53 Figure 24. Splitting the dataset into two different sets. In the N-fold cross validation technique, we run the experiments on the dataset N times, and for every experiment the training and test dataset is picked randomly, according to the split that the user provides. Figure 25 shows the process in more clarity. 54 Figure 25. N-fold cross validation (here N=5). All the unfilled regions represent training set for that iteration and filled represents the test set. The advantage of using a large N-fold validation is that all the examples in the given dataset are used in the training or testing set at least once. Thus bias will be reduced drastically, but the variance will increase as will the computation time as we have to run the experiments N number of times. If a small N is used then the bias will increase but the variance will reduce and so will the computation time. Using this method, we use the logistic regression and boosting implementation of WEKA [50] with the parameters shown in Figure 26 for logistic regression and Figure 27 for boosting. Figure 26. Logistic Regression WEKA parameters. 55 Figure 26. AdaBoost WEKA parameters. We use 10 fold cross validation, with every step having a 70% β 30% training to testing split. We obtain the following confusion matrix as shown in Figure 28 for logistic regression and Figure 29 for boosting respectively, below. Predicted Class Actual Class Yes No Yes 1502 272 No 375 834 Figure 28. The confusion matrix for logistic regression. 56 Predicted Class Actual Class Yes No Yes 1520 254 No 428 781 Figure 29. The confusion matrix for AdaBoost. The confusion matrix is described as: Predicted Class Actual Class Yes No Yes True Positive False Negative No False Positive True Negative We measure the accuracy of the model by the following measures: 1. Accuracy: The ratio of number of correctly classified instances, both positive and negative, to the total number of examples. 2. F1 measure: The harmonic mean of precision and recall. Precision can be defined as the number of correct positive results divided by the number of all positive results or in other words as the fraction of relevant retrieved instances. Recall can be defined as the number of correct positive results divided by the number of positive results that should have been 57 Figure 30. Precision and Recall adopted from [51]. returned or in other words as the fraction of retrieved relevant instances. The concept of precision and recall can be better understood in Figure 30. 58 3. AUC-ROC curve: A Receiver operating characteristic (ROC) curve plots the true positive rate against the false positive rates for every possible classification threshold. Before continuing it is important to understand the meaning of βfor every possible classification thresholdβ. Fawcett [52] explains this concept in a detailed manner and the following explanation is based on his work. We deal with two kinds of classifiers: discrete and probabilistic. A discrete classifier only produces the label for a given example, e.g. 0 or 1 (negative or positive), and thus a discrete classifier will produce only one point in the ROC space. A probabilistic classifier produces a numeric score, which can be a probability or not, that represents the βdegree to which a given example is an instance of a particular class.β If the obtained score is above a threshold, which is the classification threshold introduced above, then the classifier produces 1 (positive), else 0 (negative). A simple method to obtain a ROC curve using cross validation is to collect probabilities for instances in the test fold and sort them and is implemented in WEKA. The area under the ROC curve (AUC) measures the ability of the classifier to correctly classify the examples in question. A perfect classifier will have ROC area value equal to 1 and an optimal classifier will have a value nearing 1. A value of 0.5 is comparable to random guessing. The results of logistic regression and boosting can be shown in Table 4 and 5 respectively. 59 Precision Recall F-Measure ROC Area Class = yes 0.8 0.847 0.823 0.849 Class = no 0.754 0.69 0.721 0.849 Accuracy (correctly classified instances) 78.31% Table 4. Logistic Regression Results. Precision Recall F-Measure ROC Area Class = yes 0.78 0.857 0.817 0.831 Class = no 0.755 0.646 0.696 0.831 Accuracy (correctly classified instances) 77.13% Table 5. Boosting Results. As can be seen in the table, logistic regression performs better than AdaBoost. The ROC curves, as obtained from WEKA, for both logistic regression and boosting are shown in Figure 31 and a comparison between the two curves is shown in Figure 32. 60 True Positive Rate False Positive Rate True Positive Rate False Positive Rate Figure 31. ROC curves for Logistic Regression (above) and AdaBoost (below). 61 True Positive Rate False Positive Rate Figure 32. Comparing ROC curves for Logistic Regression and Boosting. Logistic Regression is the upper curve as it has a higher AUC = 0.849 when compared to AdaBoost which has AUC = 0.831. 5.3 Deep Learning For deep belief networks we used the DBN implementation of the Python nolearn module [53]. The first step was to create a dataset that is compatible with the DBN. Initially the same dataset that was used for the machine learning algorithms, consisting of 2983 examples, is used. The first step is to convert the image to grayscale and then crop the image to size 180 x 180. Then the image is down sampled to size 69 x 69 to keep the size of the network manageable. Since the network takes raw pixels as input then the number of units in the input layer is equal to the number of pixels in the image and thus the down sampling step becomes important. The pixels are then extracted from the images and the data is shown in Figure 33 below. 62 Figure 33. Dataset format for feeding into DBN. Every row represents an image and every number, separated by a semi-colon, in a row gives the pixel value for the image. The label for the image is appended at the end. We then convert the data into 2 numpy arrays: one for the pixel values and another for the label. We then divide the data into training and testing sets with a 70% - 30% split and train a neural network with 3 hidden layers, having 1000, 500 and 500 units each, with the learning rate of 0.15 ,0.12 and 0.1 respectively. The result obtained is shown in Figure 34 below. Precision Recall F1-score support Class = 0 0.60 0.52 0.56 374 Class = 1 0.69 0.74 0.71 521 Average / Total 0.65 0.65 0.65 895 Figure 34. DBN results for a small dataset of 2983 images. 63 Since DBN require a large amount of data to train, the low precision and recall value is on expected lines. The final column named βsupportβ shows the number of images classified as positive (1) and negative (0). Since the total images were 2983 and the dataset was split in 70:30 ratio, the test set has 895 images out of which 393 were labelled 0 and 502 were labelled 1. The DBN labels 521 images 1 and 374 images are labelled 0. The DBN is then given 61578 images but the condition for obtaining the label is changed from the Condition 1 as described in section 6.2 to Condition 2 described below. Condition 2: If the difference between the number of people answering the question 4, (is there any sign of a spiral arm pattern?), in positive and negative is more than 60% or 0.6, then the label is assigned 1 (positive) or 0 (negative) accordingly. In the next step, if more than 50% of the people have answered either positive or negative , then the label is assigned 1 (positive) or 0 (negative) accordingly. For remaining images, if the difference between the number of people answering the question 4 in positive and negative is more than 25% or 0.25, then the label is assigned 1(positive) or 0 (negative) accordingly. Here the split is 67% - 33%, thereby leaving us with 20321 test images, out of which 16445 were labelled 0 and 3876 were labelled 1, the majority class baseline being 0.8. The result obtained is shown in Figure 34. 64 Precision Recall F1-score support Class = 0 0.84 0.96 0.90 16452 Class = 1 0.57 0.23 0.33 3869 Average / Total 0.79 0.82 0.79 20321 Figure 35. DBN results for the complete dataset of 61578 images. The results show a jump in the overall score from 0.7 to 0.8 but one of the concerns here is that the dataset is class imbalanced and does not show much improvement from the random baseline. This concern is not addressed in this thesis. 5.4 Feature Analysis In this section the features are analyzed separately to determine if all the features are behaving similarly in the classification algorithm or are some features dominating over the others in determining the results. The steps followed for this process is as follows: a) Find the features that do not perform well in the classification task by themselves. b) Find all the features that do perform well in the classification task by themselves. c) Combine all the features obtained in a) as a single feature. 65 d) Try combinations of the features obtained in the step b with the combined feature c) to determine how much the performance degrades or improves i.e. do the features that perform well dominate over the rest or vice-versa. Logistic Regression, using 10-cross validation and 70%-30% split is used to determine the performance of the features. The experiments yield the following results: when the worst performing features are combined together the results obtained are shown in Figure 36. When they are combined with the best performing and the second best performing feature, the accuracy jumps as shown in Figure 37 and Figure 38 respectively. Figure 36. Performance of all bad performing features together. 66 Figure 37. Performance of all bad performing features combined with the best performing. Figure 38. Performance of all bad performing features combined with the second best performing feature. 67 As it can be seen from the results, the better performing features dominate the bad performing features, at least when logistic regression is used. The so called bad performing features can also be classifier dependent. When tested on Multilayer Perceptron, as an experiment, the same features give an accuracy of 67.38% as shown in Figure 39. Figure 39. Performance of all bad performing features together using MLP. The best performing features were: CCV (Section 2.17), a combination of aspect ratio (Section 2.12), extent (Section 2.13), fitted ellipse angle (Section 2.15), fitted ellipse height to width ratio (Section 2.16), and red to blue color intensity ratio (Section 2.14) and circularity (Section 2.4). On the other end of the spectrum i.e. the features that performed poorly are detection of bar (Section 2.1) and disk/bulge ratio (Section 2.3). 68 Chapter 6 Unsuccessful Experiments This chapter summarizes some of the experiments that did not give the expected results for the problem at hand. 6.1 Blob Detection The Blob detection method was tried to detect the presence of a bar in galaxy images. When we convert the images to their respective contrasted images, the bar feature in the images actually outshines the rest of the galaxy. As blob detection detects regions having differences in properties, such as brightness, this method should be able to extract the area at the center of the image as this area seemed to be of higher intensity then the rest of the image. Then contours can be used for shape detection. This method was not adopted due to the following reasons: a) When the center of the contrasted image was tested, pixel by pixel, it was found that the intensities were not equal. b) The bar in the images are at different angles, as shown in Figure 40, and thus a rotation invariant feature detector was needed. 69 Figure 40. Images having bar feature at different angles. 6.2 Laplacian of Gaussian (LoG) The LoG method was tried to compute the number of arms of a given galaxy. Since the arms are present in the galaxy only when it is spiral this could be used as a feature for our classifier. Because LoG is a negative bell shaped function we believed that it would help maximize the values in the ends of the contour image. That way we can get the points in the far ends of the spiral image thereby giving us a good indication of the presence of arms, as for a spiral image the far ends of the image will be farther apart. The planned process was to find the distance between the farthest points in the image thereby giving us a numeric value. The computed value above a certain threshold distance would determine if the galaxy has arms and is thus spiral. 70 This method was not adopted due to the following reasons: a) Some of the spiral galaxies have tightly bounded arms. Even if we get the farthest points in the image relative to the arms it might not cross the threshold. b) The contour images were not accurate. c) Some of the circular and elliptical shaped galaxies covered a lot of area in the images. The distance between the farthest points in these might easily go over the threshold. 6.3 Discrete Fourier Transform (DFT) The Fourier transform corrects an image from the spatial domain to the frequency domain. DFT is a popular technique to obtain the geometrical characteristics of an image. This method was considered for the edge on galaxy images, an example shown in Figure 41 below, or the images with the galaxies rotated at an angle. Figure 41. An edge on galaxy image. 71 This method was not adopted due to the following reasons: a) The Fourier transformed images do not give the desired information due to the noise in the images. b) Most of the images have face-on galaxies. 6.4 Case Based Reasoning (CBR) Leake [54] states that reasoning has often been defined as a process that βdraws conclusions by chaining together generalized rules, starting from scratchβ. CBR differs in two ways: first, the rules are replaces by cases and second, chaining is replaced by retrieval and adaptation of cases. A simple CBR process is shown in Figure 42 below. Figure 42. CBR process adopted from [55]. This method was not adopted due to the following reasons: a) Difficult to represent the data and decide the structure of the cases. 72 b) As described in [54], CBR exploits the relationship between the βspace of problem descriptionβ and the βspace of problem solutionsβ, as shown in Figure 43. This is a method that is very interesting and will be tried in future work. Figure 43. CBR process relationships adopted from [54]. 73 Chapter 7 Conclusion and Future Work In this chapter the contributions of the thesis are highlighted and then then scope of future work is presented. 7.1 Contribution The main contribution of this thesis was to generate novel features for classifying galaxy images. Given the noisy nature of the images and the less-interpretable nature of neural networks, this thesis provides features that can used for classification of galaxies into spiral or non-spiral. This feature designing approach provides a better understanding of the images. We designed 22 different features and used simple machine learning techniques for classification and received good results. This thesis highlights that hand designing features can give results comparable to those obtained by deep belief networks for this problem domain. The features extracted ranges from a simple feature like the color ratio in the image at one end to more complex features like the concentration index and dark matter density at the other end of the spectrum. This thesis highlights that even the simplest of features can be critical for classification. Our approach also overcomes the lack of interpretability of deep belief networks. 74 7.2 Future Work The Future work to extend this thesis could include: a) There is scope for designing more features for the problem and also extending the designed features. Although this thesis looks extensively at the possible features that can be derived, both from the computer vision side, for example aspect ratio and red to blue color intensity ratio, as well as astronomy side, , for example detection of bar and concentration index, there are lot of other features that can be designed. b) The features designed are explicitly for this problem but there is scope for designing generalized features that can be extended to other domains as well. That approach will be more computer vision oriented. c) The human annotation data can be considered as a feature for the classification and we can consider human advice based learning [56] for classification. Seeking a domain expertβs advice can help the system perform better. d) In this thesis we tackle the problem of binary classification, i.e. we classify the images into only two classes. Multi-class classification can be also be considered, as can be seen in Figure 21, the Galaxy Zoo decision tree, where every task is interpreted as a class. Other machine learning algorithms can also be experimented with. 75 7.3 Summary Overall, we have seen that feature designing coupled with basic machine learning algorithms can produce a general system for classification. Feature designing becomes an important part of a system for interpreting the results obtained, which the deep belief network fails to provide or provides to a lesser extent. Our system gives good initial results and serves as a starting point for future work. The shape of a galaxy reveals a lot of information about the age and the type of stars present in the galaxy. There are roughly one hundred billion galaxies scattered throughout the observable universe. This is a huge amount of prospective data and thus the future of this problem holds a huge potential for the feature designing area. Designing features that differentiate not only between the different shapes of the galaxies between but also galaxies with similar shapes. For example, a single feature determining the shape, the number of arms of the galaxy and the degree of bound of the arms can be designed. Different machine learning algorithms can also be applied to determine the one that works best with this kind of data, given even the simplest of features. To actually solve the problem, very powerful features are needed to be designed. Bu powerful we mean that the features need to be as general as possible, so there is no need of designing a large of features, and yet be discriminative enough to provide a perfect classification. Also more sophisticated neural networks might solve the problem but it seems unlikely due to the nature of the images. Thus feature designing is the way to go in the future. 76 REFERENCES [1] http://data.galaxyzoo.org/ [2] http://en.wikipedia.org/wiki/Hubble_sequence [3] http://en.wikipedia.org/wiki/Galaxy_morphological_classification [4] http://www.sdss.org/ [5] http://www.galaxyzoo.org/#/story [6] de la Calleja, Jorge, and Olac Fuentes (2004). "Automated classification of galaxy images." In Knowledge-Based Intelligent Information and Engineering Systems, pp. 411-418. Springer Berlin Heidelberg. [7] Banerji, Manda, Ofer Lahav, Chris J. Lintott, Filipe B. Abdalla, Kevin Schawinski, Steven P. Bamford, Dan Andreescu et al. (2010)"Galaxy Zoo: reproducing galaxy morphologies via machine learning." Monthly Notices of the Royal Astronomical Society 406, no. 1: 342-353. [8] Yin Cui, Yongzhou Xiang, Kun Rong, Rogerio Feris, and Liangliang Cao (2014). βA Spatial-Color Layout Feature for Content-based Galaxy Image Retrievalβ. In: IEEE Winter Conference on Applications of Computer Vision (WACV). [9] Eskandari, A. R., & Kouchaki, Z. (2013). Regular Shapes Detection in Satellite Images. Malaysian Journal of Computer Science, 25(1). 77 [10] G. Wu, W. Liu, X. Xie, Q. Wei (2007). βA Shape Detection Method Based on the Radial Symmetry Nature and Direction-Discriminated Votingβ. ICIP (6), pp169-172. [11] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 10971105). [12] A. Berg, J. Deng, and L. Fei-Fei. Large scale visual recognition challenge (2010.) www.imagenet.org/challenges. [13] J. Sánchez and F. Perronnin (2011). High-dimensional signature compression for large-scale image classification. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1665β1672. IEEE. [14] Eskridge, P. B., & Frogel, J. A. (2000). What is the True Fraction of Barred Spiral Galaxies? In Toward a New Millennium in Galaxy Morphology (pp. 427-430). Springer Netherlands. [15] http://www.astro.caltech.edu/~george/ay20/eaa-darkmatter-obs.pdf [16] http://en.wikipedia.org/w/index.php?title=Shape_factor_ (image_analysis_and_microscopy)&oldid=572283723 [17] Goderya, S. N., & Lolling, S. M. (2002). Morphological classification of galaxies using computer vision and artificial neural networks: A computational scheme. Astrophysics and space science, 279(4), 377-387. [18] R. Freedman and W. J. Kaufmann (2007). Universe. 78 [19] Pass,Greg,Ramin Zabih and Justin Miller (1997). "Comparing images using color coherence vectors." In Proceedings of the fourth ACM international conference on Multimedia, pp.65-β73. [20] http://spiff.rit.edu/classes/phys443/lectures/gal_1/petro/petro.html [21] Shimasaku, K., Fukugita, M., Doi, M., Hamabe, M., Ichikawa, T., Okamura, S., & York, D. G. (2001). Statistical properties of bright galaxies in the sloan digital sky survey photometric system. The Astronomical Journal, 122(3), 1238. [22] http://en.wikipedia.org/wiki/Navarro%E2%80%93Frenk%E2%80%93White_profile [23] http://astronomy.swin.edu.au/cosmos/S/Scale+Length [24] http://www.mattiavaccari.net/research/masterthesis/masterthesis/node34.html [25] http://www.astro.caltech.edu/~aam/science/thesis/total/node20.html [26] Otsu, N. (1975). A threshold selection method from gray-level histograms. Automatica, 11(285-296), 23-27. [27] http://infohost.nmt.edu/tcc/help/pubs/colortheory/web/hsv.html [28] http://en.wikipedia.org/wiki/Image_moment [29] Petrosian, V. (1976). Surface brightness and evolution of galaxies. The Astrophysical Journal, 209, L1-L5. 79 [30] Blanton, M. R., Dalcanton, J., Eisenstein, D., Loveday, J., Strauss, M. A., SubbaRao, M. & Schlegel, D. J. (2001). The Luminosity function of galaxies in SDSS commissioning data. The Astronomical Journal, 121(5), 2358. [31] Hosmer, David W.; Lemeshow, Stanley (2000). Applied Logistic Regression (2nd ed.). Wiley. [32] Kearns, M. (1988). Thoughts on hypothesis boosting. Unpublished manuscript,45, 105. [33] Freund, Y., Schapire, R., & Abe, N. (1999). A short introduction to boosting.JournalJapanese Society For Artificial Intelligence, 14(771-780), 1612. [34] www.apsky.org/ngc/ngc.html [35] Quinlan, J. R. (2014). C4. 5: programs for machine learning. Elsevier. [36] Bengio, Y. (2009). Learning deep architectures for AI. Foundations and trends® in Machine Learning, 2(1), 1-127. [37] Hinton, G., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural computation, 18(7), 1527-1554. [38] LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., & Huang, F. (2006). A tutorial on energybased learning. Predicting structured data, 1, 0. [39] Hinton, G. E., & Sejnowski, T. J. (1986). Learning and relearning in Boltzmann machines. MIT Press, Cambridge, Mass, 1, 282-317. [40] http://en.wikipedia.org/wiki/Boltzmann_machine 80 [41] http://en.wikipedia.org/wiki/Restricted_Boltzmann_machine [42] G.E. Hinton and R.R. Salakhutdinov (2006). Reducing the Dimensionality of Data with Neural Networks, Science, Vol. 313. no. 5786, pp. 504 - 507. [43] deeplearning.net/tutorial/deeplearning.pdf [44] Hinton, G., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural computation, 18(7), 1527-1554. [45] http://en.wikipedia.org/wiki/Deep_belief_network [46] Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM,27(11), 11341142. [47] http://yyue.blogspot.com/2015/01/a-brief-overview-of-deep-learning.html [48] Bottou, L. (2012). Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade (pp. 421-436). Springer Berlin Heidelberg. [49] Willett, K. W., Lintott, C. J., Bamford, S. P., Masters, K. L., Simmons, B. D., Casteels, K. R., ... & Thomas, D. (2013). Galaxy Zoo 2: detailed morphological classifications for 304 122 galaxies from the Sloan Digital Sky Survey. Monthly Notices of the Royal Astronomical Society, stt1458. [50] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: an update. ACM SIGKDD explorations newsletter, 11(1), 10-18. [51] http://en.wikipedia.org/wiki/Precision_and_recall 81 [52] Fawcett, T. (2006). An introduction to ROC analysis. Pattern recognition letters,27(8), 861874. [53] https://pypi.python.org/pypi/nolearn [54] Leake, D. B. (1996). CBR in context: the present and future. Case-Based Reasoning, Experiences, Lessons & Future Directions, 1-30. [55] You, C. F., Jeng, C. R., & Liu, K. Y. (2013). Design for Automotive Panels Supported by an Expert System. INTECH Open Access Publisher. [56] Kunapuli, G., Odom, P., Shavlik, J. W., & Natarajan, S. (2013, December). Guiding Autonomous Agents to Better Behaviors through Human Advice. Data Mining (ICDM), 2013 IEEE 13th International Conference on (pp. 409-418). IEEE. In collaboration with Tasneem Alowaisheq. 1 82 Resume Devendra Singh Dhami 3801 S Sharon Dr., Bloomington, Indiana 47403 [email protected] | (812)369-3967 OBJECTIVE Given my great interest in the field of Artificial Intelligence and the love for Astronomy, I decided to merge the two and am pursuing my MS from Indiana University, Bloomington. I would like to extent my Masters to a PhD in the field of Artificial Intelligence with a focus on the applications of AI in Astronomy and other fields. Research Interests: (UP) 1. Artificial Intelligence 2. Computer Vision 3. Machine Learning 4. Astronomy EDUCATION Indiana University, Bloomington, USA Master of Science in Computer Science August 2013- Present Expected Graduation Date: May 2015 Current GPA = 3.80/4 Visvesvaraya Technological University, Bangalore, India Bachelor of Engineering in Information Science & Engineering August 2006-June 2010 Undergraduate Percentage = 79.82% out of 100% PROJECTS UNDERTAKEN Master Thesis (in progress): Morphological Classification of galaxies into spirals and non-spirals. Thesis Advisor: Prof. David J. Crandall Thesis Committee: Prof. David J. Crandall, Prof. David B. Leake, Prof. Sriraam Natarajan Graduate Research Projects: 1. Morphological Classification of galaxies using Computer Vision Techniques and Case Based Reasoning. (Class project for CSCI-B657 and CSCI-B552) 2. Prediction of Indian Tiger population increase/decrease in future using Machine Learning techniques. (Class project for I 526/CSCI-B 659) 3. Gamification of the Optional Practical Training Application Process for International Students. (Class project for ILS Z 516) 4. Prevention of man-in-the-middle attacks on the SSL layer in cryptographic systems (Class project for CSCI-B649) Undergraduate Research Project: 1. Designed βSentiment Analyzerβ, a tool to classify userβs sentiments on a topic (e.g. any device such as a mobile or any event such as a movie etc.) PROFESSIONAL EXPERIENCE Indiana University Bloomington (2013- present) Associate Instructor 1. Course Name: CSCI A348: Mastering the World-Wide Web Instructor: Prof. Dan-Adrian German Semester: Fall 2013 2. Course Name: CSCI B551: Elements of Artificial Intelligence Instructor: Prof. David Leake Semester: Fall 2014 3. Course Name: CSCI A201/597: Introduction to programming Instructor: Prof. Eric Wennstrom Semester: Spring 2015 Hewlett Packard Bangalore, India August (2010-July 2013) Technology Consultant 1. Project: Project and Portfolio Management (PPM) Automation Client: HP internal Platform & Skills: Eclipse, Remote desktop Server, PL/SQL Developer, Toad Duration: August 2010 β Dec 2010 Description: To automate the various functionalities of the PPM tool by writing scripts and writing Java libraries to be implemented in the front end of the tool. Responsibilities: β’ Understanding and Building Automation Framework. β’ Developing the code for implementation β’ Code Review β’ Designed automation scripts for backend verification β’ Reported the defects 2. Project: PPM Upgrade Client: AXA Technologies Platform & Skills: Eclipse, Remote desktop Server, PL/SQL Developer Duration: Jan 2011 β till date Description: To write the codes and libraries to upgrade the PPM tool from version 7.1 to version 8.0and developing reusable components (Scripts) to generate various reports out of the toolβs database. Responsibilities: β’ Analyze business requirements (e.g., business logic, process flows, and reporting requirements). β’ Preparing Development Plan. β’ Reporting Daily, weekly and monthly report as well as Progress report β’ Responsible for Functionality testing, Integration testing and Regression test β’ Preparing test cases based on Software Requirement Specifications. TECHNICAL SKILLS Languages: Java, C/C++, Python, OpenCV, SQL Platforms: UNIX, Microsoft Windows and DOS Databases: Structured Query Language, Oracle Web development: Java Script, HTML, PHP MISCELLANEOUS DETAILS I presented a paper on βSentiment Analysisβ in Tech-On, the International Technical Conference of Hewlett-Packard and the paper ranked 149 out of 2000 papers. Also I won 1st prize in IEEE paper presentation with a paper titled βImage Processingβ during my undergraduate studies. I also presented several other papers on βCloud Computingβ and βInformation Securityβ to name a few during my Undergraduate studies in various technical competitions. Have also won the βStar of the Horizon Awardβ and the βBest team in HP awardβ during my tenure in HP. I also won the a gold medal and Certificate of Academic Excellence from my undergraduate school. I was in the editorial team of my annual college magazine in my undergraduate years. It involved activities ranging from collecting the articles to be printed to oversee the actual printing of the magazine.