Download MORPHOLOGICAL CLASSIFICATION OF GALAXIES INTO

Document related concepts

Cosmic distance ladder wikipedia , lookup

Astronomical spectroscopy wikipedia , lookup

Gravitational lens wikipedia , lookup

Transcript
MORPHOLOGICAL CLASSIFICATION OF GALAXIES INTO SPIRALS AND NON-SPIRALS
Devendra Singh Dhami
Submitted to the faculty of the University Graduate School in partial fulfillment of the
requirements for the degree
Master of Sciences
in the School of Informatics and Computing
Indiana University
May 2015
Accepted by the Graduate Faculty, Indiana University, in partial fulfillment of the requirements
for the degree of Master of Sciences.
Master's Thesis Committee
Professor David J. Crandall
Professor David B. Leake
Professor Sriraam Natarajan
ii
Copyright © 2015
Devendra Singh Dhami
iii
Devendra Singh Dhami
MORPHOLOGICAL CLASSIFICATION OF GALAXIES INTO SPIRALS AND NON-SPIRALS
The aim of this master’s thesis is the classification of images of galaxies according to their
morphological features using computer vision and artificial intelligence techniques. We deal
specifically with the shape of the galaxy in this project. The galaxies are broadly categorized into
3 categories according to their shape: circular, elliptical and spiral. Out of these 3 possible
shapes, correctly classifying the spiral shape is the most challenging. This is mostly due to the
noisy images of the galaxies and partly due to the shape itself, as spiral can easily be mistaken
for an ellipse or even a circle. Thus we focus on classifying the images into only 2 categories:
spiral and non-spiral.
The first phase of the thesis addresses the process of feature extraction from images of the
galaxies, and the second phase uses artificial intelligence and machine learning methods to
create a system that categorizes galaxies based on the extracted features. The specific methods
used for classification are boosting, logistic regression and deep neural networks.
We evaluate these techniques on data from the Galaxy Zoo project [1] that is freely available to
anyone. The languages used are C++ (OpenCV) and Python.
iv
Table of Contents
Chapter 1. Introduction
1.1 Overview and Motivation
1
1.2 Related Work
1.2.1 Galaxy Classification Problem
4
1.2.2 Other Classification Problems
6
1.3 The Thesis
8
1.4 Novel Features
9
1.5 Adopted Features
14
1.6 Outline and Contribution of this thesis
16
Chapter 2. Feature Extraction
2.1 Detection of Bar
17
2.2 Dark matter density (gray scale and binary)
19
2.3 Disk / Bulge ratio
21
2.4 Circularity
22
2.5 Black/ White pixel ratio
23
2.6 Convexity
24
2.7 Form Factor
25
2.8 Bounding rectangle to fill factor
25
2.9 Count of line intersection of the shape
26
2.10 Maximum values of Red and Blue channels
26
2.11 Concentration index
26
2.12 Aspect Ratio
28
2.13 Extent
29
2.14 Red to Blue color intensity ratio
29
2.15 Fitted ellipse angle
29
v
2.16 Fitted Ellipse Height to Width ratio
30
2.17 Color coherence vector (CCV)
31
Chapter 3. Classification Algorithms
3.1 Introduction to algorithms
34
3.1.1 Logistic Regression
34
3.1.2 Boosting
37
3.2 Advantages and disadvantages of these algorithms
38
Chapter 4. Deep Learning
4.1 Introduction to Deep Learning
40
4.1.1 Deep Neural Network
41
4.1.2 Deep Belief Network
42
4.1.3 Restricted Boltzmann Machines
42
4.1.3.1 Energy based models
42
4.1.3.2 Boltzmann Machines
43
4.2 Success of Deep Learning
47
Chapter 5. Results
5.1 Available data
50
5.2 Logistic Regression and Boosting
53
5.3 Deep Learning
62
5.4 Feature Analysis
65
Chapter 6. Unsuccessful Experiments
6.1 Blob Detection
69
6.2 Laplacian of Gaussian (LoG)
70
6.3 Discrete Fourier Transform (DFT)
71
6.4 Case Based Reasoning (CBR)
72
Chapter 7. Conclusion and Future Work
7.1 Contribution
74
vi
7.2 Future Work
75
7.3 Summary
76
References
77
vii
List of Figures
Figure 1. Hubble Tuning Fork.
2
Figure 2. Example galaxy images from SDSS database.
3
Figure 3. Approach to the classification task.
9
Figure 4. Barred and Unbarred galaxies.
10
Figure 5. A galaxy viewed edge-on showing the bulge and where bulge is undetectable. 11
Figure 6. Bar detection process.
18 - 19
Figure 7. Thresholding of image to get the shape.
24
Figure 8. Process of cropping the image to size of bounding box.
30
Figure 9. Discretizing and finding the coherent regions in CCV.
31
Figure 10. CCV visualization for T = 500.
32
Figure 11. Best fit line for a set of data points (regression).
35
Figure 12. A sigmoid function example.
36
Figure 13. The AdaBoost algorithm.
38
Figure 14 a) An Artificial Neural Network and b) A neuron.
41
Figure 15. An EBM that measures the compatibility between observed variables X and variables
to be predicted Y using the energy function E(X, Y)
43
Figure 16. A Boltzmann machine in graphical format.
44
Figure 17. A restricted Boltzmann machine in graphical format.
45
Figure 18. A deep belief network with stacked RBMs.
46
Figure 19. Training algorithm for DBNs.
47
Figure 20. SGD convergence time.
49
Figure 21. The Galaxy Zoo classification decision tree.
51
Figure 22. A snapshot of the decision tree in question form.
52
Figure 23. The human annotated data.
52
Figure 24. Splitting the dataset into two different sets.
54
viii
Figure 25. N-fold cross validation.
54 - 55
Figure 26. Logistic Regression WEKA parameters.
55
Figure 27. AdaBoost WEKA parameters.
56
Figure 28. The confusion matrix for logistic regression.
56
Figure 29. The confusion matrix for AdaBoost.
57
Figure 30. Precision and Recall.
58
Figure 31. ROC curves for Logistic Regression and AdaBoost.
61
Figure 32. Comparing ROC curves for Logistic Regression and Boosting.
62
Figure 33. Dataset format for feeding into DBN.
63
Figure 34. DBN results for a small dataset of 2983 images.
63
Figure 35. DBN results for the complete dataset of 61578 images.
65
Figure 36. Performance of all bad performing features together.
66
Figure 37. Performance of all bad performing features combined with the best feature. 67
Figure 38. Performance of all bad performing features combined with the second best
performing feature.
67
Figure 39. Performance of all bad performing features together using MLP.
68
Figure 40. Images having bar feature at different angles.
70
Figure 41. An edge on galaxy image.
71
Figure 42. CBR process.
72
Figure 43. CBR process relationships.
73
ix
List of Tables
Table 1. The accuracy of the various parameter sets adopted from [7].
6
Table 2. Comparison of error rates of the 3 methods adopted from [11].
8
Table 3. The resulting vector for CCV.
31
Table 4. Logistic Regression Results.
60
Table 5. Boosting Results.
60
x
CHAPTER 1
INTRODUCTION
1.1. Overview and Motivation
As late as the 1990s, due to the lack of appropriate technology, the task of seeing a faraway
object was challenging and thus the discovery of new galaxies was a slow process. Number of
galaxies we could photograph were very few in number and owing to this slow process, their
classification was not a demanding task. With the advancement of technology there has been a
burst in the number of galaxies being found. We are now able to photograph many more
distant galaxies than in the past, but most of the classification still depends on human effort
where thousands of volunteers around the world classify these galaxies into their respective
classes manually.
If we look into the history of classification we find several systems that were designed by
astronomers for the purpose of classification of galaxies. Edwin Hubble classified galaxies into
something called the Hubble sequence, also known as the Hubble tuning fork (Figure 1), in 1936
[2]. There are a few other systems, like the De Vaucoulers system and the Yerkes (or Morgan)
schemes [3].
In 2007 a citizen project called the Galaxy Zoo was launched whose aim was to involve human
volunteers for the purpose of galaxy classification of the images obtained from the Sloan Digital
Sky Survey (SDSS) database [4]. This approach was very much successful in the beginning, as
1
within 24 hours of launch the project received almost 70,000 classifications an hour. In the end,
more than 50 million classifications were received by the project during its first year,
contributed by more than 150,000 people [5].
Figure 1. Hubble Tuning Fork adopted from [2].
The project is still ongoing, but recently a need has arisen to automate the classification process
due to several reasons. First, images of the galaxies are noisy and thus the classification of the
same galaxy varies among the human volunteers. Figure 2 shows some example of images from
the SDSS database. Second, the size of the SDSS database is ever increasing. Different galaxies
are being found virtually every day and thus a system that can automate the classification
2
process is very much desired. Third, detection of shapes of the galaxy is a difficult problem and
thus presents an interesting research challenge for automatic image analysis.
Figure 2. Example galaxy images from SDSS database.
The field of computer vision plays a major role in the automating galaxy classification. We need
to rely on the images obtained from the SDSS database and since galaxies are far away objects,
the images obtained are low resolution and noisy. The area of computer vision that holds a
special importance here is feature extraction. If we can obtain meaningful features from
images, then various machine learning algorithms, both supervised and un-supervised, can be
used for classification. Unfortunately, feature extraction is difficult. We not only need to find
features but they should be as general as possible so that they apply to all applicable images in
the dataset in question, which here is the images of spiral and non-spiral galaxies. This feature
extraction process forms a major part of this thesis.
3
1.2. Related Work
There has been a significant amount of research work in the area of automatic galaxy
classification, although this area can still be called relatively new. The main focus has been on
applying various machine learning algorithms with a special focus on neural networks for the
classification task. However, feature extraction has not been studied as extensively, since it has
mostly been treated as a part of image preprocessing.
1.2.1. Galaxy Classification Problem
The two most famous papers on this subject are undoubtedly de la Calleja et al. [6] and Banerji
et al [7]. In [6] the galaxy classification method is divided into three stages: image analysis, data
compression, and machine learning. The authors applied three machine learning methods on
galaxy image classification and carried out a comparison study of the three algorithms. These
algorithms are Naive Bayes, C4.5 (an extension of ID3 algorithm) [35], and Random Forest, and
were tested on the New General Catalog (NGC) released by the Astronomical Society of the
Pacific [34]. In the image analysis step, they applied Principal Component Analysis (PCA) to
make the galaxies position, scale and rotation invariant. This was done because the galaxies in
the images were not centered, a criteria which the SDSS database already fulfills. PCA was also
used to reduce the dimensionality of the data (data compression) and the principal components
of the image projection were then used as a set of features for the classification phase. They
found that Random Forest performed better than Naive Bayes or C4.5.
In [7], the authors applied neural networks to classify images into three classes: early types,
spirals, and point sources/artifacts. The neural network was trained on 75,000 galaxy images
4
obtained from Sloan Digital Sky Survey [4]. These training images are associated with features
already annotated by humans through the Galaxy Zoo project. The test data was comprised of
one million galaxy images. They trained and tested the neural network using 3 sets of input
parameters:
(a) Colors and profile fitting: This parameter refers to the colors of galaxies or any parameter
associated with profile fitting, like the Hubble profile [2] or de Vaucouleurs profile [3] as
described in section 1.1 for morphological classification.
(b) Concentration and adaptive moments: This parameter refers to the concentration index
[20], as will be defined in a later section and is used as a feature in this project, as well as other
texture parameters.
(c)The combination of both (a) and (b).
Their results show that the color or the shape parameters, when taken individually, are not
sufficient to capture the morphological features of the galaxy. However, combining those
parameters increased the accuracy remarkably (Table 1).
In Yin Cui et al [8] a system was created where a galaxy is queried by providing a galaxy image
as an input, after which the system retrieves and ranks the most similar galaxies. In order to
accurately detect galaxies, the input images must be invariant to rotation and scale. To find the
rotation angle, the second moment of inertia was applied. A spatial-color layout descriptor was
proposed to encode both local and global morphological features. The descriptor was then
combined with Kernelized Locality Sensitive Hashing for retrieval and ranking.
5
Class
Parameters
Early Types
Spirals
Point Source/Artifacts
(a) Colors and profile fitting
87%
86%
95%
(b) Concentration and
84%
87%
28%
92%
92%
96%
adaptive moments
(c) Combining (a) + (b)
Table 1. The accuracy of the various parameter sets adopted from [7].
Experiments were carried out by applying three kernels: Histogram Intersection, Chi-Square and
Jensen-Shannon Divergence kernels. Out of the three, Histogram Intersection produced the
best results with 95.8% accuracy.
1.2.2. Other Classification Problems
Although not directly related to the topic at hand, Eskandari & Kouchaki [9] present an
important paper in shape detection. The authors propose a novel method to distinguish regular
and irregular shapes/regions in satellite and aerial images. Wu et al. [10] define a regular shape
as “one that possesses the characteristic that within the shape there is a point that has an equal
distance to each side or the boundary of the shape.” The authors use a more general definition
in their paper. They define regularity in [9] as “the whole shape is formed by a repetition of a
particular part, in the same direction, at the sides of a regular polygon or by the repetition of
this part, in the same direction, at the two sides of a line.” The authors use the Discrete Fourier
Transform (DFT) and present a measure called Reference Counter-to-Noise Ratio (RCNR) to
6
define the regularity in a shape. To experiment they use three different satellite images from
Google Earth and found that their approach was quite successful in detecting regular shapes in
these images.
The authors in [11] trained a convolutional neural network to classify the 1.2 million images in
the Image Net LSVRC-2010 contest into 1000 different classes. This was a deep learning
approach as the network consisted of five convolutional layers and had 60 million parameters
and 650,000 neurons. Their method does not use any feature extraction at all and the only preprocessing they do is to scale down the variable sized image of the dataset into 256 x 256 fixed
sized images. The neurons are trained on the raw RGB values of the pixels. The authors
reported results using the two error metrics which are mandatory for the Image Net database,
top-1 and top-5 error rate. The top-5 error rate is defined as “the fraction of test images for
which the correct label is not among the five labels considered most probable by the model.”
They compare their results to a sparse-coding approach implemented in [12] and an approach
that averages Fisher vectors over SIFT features implemented in [13]. The results (Table 2) show
that the neural nets methods work the best among the compared methods.
7
Model
Top-1
Top-5
Sparse Coding[12]
47.1%
28.2%
SIFT + FV[13]
45.7%
25.7%
CNN[11]
37.5%
17%
Table 2. Comparison of error rates of the 3 methods adopted from [11].
1.3. The Thesis
Most of the work discussed above has one disadvantage in common: there is very little focus on
designing the image features. The neural networks do learn the features implicitly but do not
explicitly reveal anything about the features being learned. As neural networks work well in
such classification problems, the need of learning features explicitly did not arise. Thus neural
networks answer the question ‘what particular class is the image classified as?’ but hide the
answer to ‘why is the image classified as a particular class?’ If we can design and extract some
useful features, perhaps informed by our prior knowledge of astronomy, then we can learn
more about the images and the classification will become an easier task. Of course, this
approach will also have a different but complementary weakness: feature extraction has its
own disadvantage of not being scalable to other problems. In other words, features that work
really well for one dataset may become less important or completely irrelevant for some other
dataset of a different domain.
8
My approach in this thesis is more focused on the task of feature extraction and thus can be
described in a simple diagrammatic representation (Figure 3).
Images
Feature Extraction
Classification
Figure 3. Approach to the classification task.
In the next two sections we present an introduction to the various features that were extracted
for this project, and we discuss them in greater detail in chapter 3. Most of these features are
based on the knowledge of Astronomy and are novel for being used in this particular problem.
1.4 Novel Features
1.4.1. Detection of bar
Surveys show that almost two-thirds of the observed spiral galaxies are barred, i.e. have a bar
at the center of the galaxy. This phenomenon is absent in other classes of galaxies and thus the
presence of a bar serves as an important feature to separate spiral from other galaxies. Figure 4
shows an example of a barred and unbarred galaxy.
1.4.2. Dark matter density (gray scale and binary)
Spiral galaxies have a substantial amount of dark matter. The galaxies do not generally follow
the rotation laws of a solid body here on Earth. The angular velocity of the rotation typically
9
decreases with radius. Until the early 1970s, most of the rotation data for spirals came
from optical observations which did not extend beyond the luminous inner regions [15]. But
in later years when the observation extended to farther regions of the galaxy, the data showed
the rotational velocity to be almost constant even with the increase in the radial data. This
meant that the total mass of the spiral within some radius increases linearly with the radius
while the total luminosity approaches a finite asymptotic limit and thus a large amount of
invisible mass must be present [15]. Due to the presence of this mass the image should have
higher pixel values in the halo. Thus the grayscale values of the dark matter halo of the spiral
galaxies should be in a higher range as compared to their counterparts and could be a useful
feature in differentiating between the two.
Figure 4. Barred (left) and unbarred (right) galaxies.
1.4.3. Disk / Bulge ratio
Most spiral galaxies have a bulge at the center but this can be observed only when the galaxy is
viewed edge-on. As most of the galaxies we encounter are head-on, detecting the bulge
10
becomes impossible. Figure 5 demonstrates an example of this. However because the Bulge /
Disk ratio can be in written in terms of the surface brightness of the galaxy, we can extract that
ratio as a feature for our classification.
Figure 5. A galaxy viewed edge-on showing the bulge (left) and bulge is undetectable (right).
1.4.4. Circularity
The circularity parameter defines how close to a circular shape an object is. This is defined by
the isometric index equation. The value of the parameter is near 1 for a circle and in the lower
range for other shapes. As mentioned in [16], the value is “much less than one for
a starfish footprint.” Since a spiral shape is very close to a starfish shape, it should also have a
low value for this parameter.
1.4.5. Black to White pixel ratio
This parameter measures the ratio of the number of black pixels to the number of white pixels
in the binary form of the input image. In the SDSS database, all of the galaxies are centered and
11
are roughly of the same size. Thus this parameter should have a higher value for non-spiral
galaxies as compared to spiral galaxies. Although this measure might not prove useful for larger
non-spiral galaxies, it still forms an interesting and simple shape detection feature.
1.4.6. Count of line intersection of the shape
Spiral galaxies have arms and other galaxies do not have arms. This simple property can be a
very useful one in differentiating between the spiral and the non-spiral shapes. If we could
draw a line from the center of the galaxy towards its end and count the number of times this
line intersects the galaxy, it can give us a fair idea of the shape as the line will intersect the
spiral shape more than once and circular and elliptical shape only once.
1.4.7. Maximum values of Red and Blue channels
Spiral galaxies are middle-aged galaxies, i.e. they are in between the newer galaxies (lots of star
formation and generally irregular shaped) and old galaxies (almost no star formation and
elliptically shaped). The old galaxies are red in color due to the lack of any gas used for star
formation, and the new galaxies are blue in color due to an abundance of gas and dust for star
formation [18]. Spiral galaxies have all star formation in the arms and none in the disk, so the
arms are bluish and the disk is reddish in appearance. Thus for the elliptical shaped galaxy
images, the maximum value of the red channel pixel should be relatively high and for the
irregular shaped galaxy images the maximum value of blue channel pixel should be relatively
high. For the spiral galaxy images none of the red or the blue channel should have a high value.
12
1.4.8. Concentration index
This parameter is related to the radial profile of the galaxy. Before defining the concentration
index the following definitions [20] are necessary:
a) Petrosian radius [29]: is defined as the radius where the intensity of the light from the galaxy
is equal to a predefined value, usually 0.2. [21]
b) Petrosian Flux: is defined as the sum of all the flux within k times the Petrosian radius.
c) R90: Petrosian ninety-percent radius is the radius which contains 90% of the Petrosian flux.
d) R50: Petrosian half-light radius is the radius which contains half of the Petrosian flux.
The parameter concentration index is defined as the ratio between R90 and R50.
1.4.9. Aspect Ratio
Aspect ratio is defined as “a function of the largest diameter and the smallest diameter
orthogonal to it.” We can interpret it as the ratio between the width and height of the
bounding rectangle of the galaxy.
1.4.10. Extent
Extent is defined as the ratio of contour area to bounding rectangle area.
1.4.11. Red to Blue color intensity ratio
As defined in Section 1.4.7 above, spiral galaxies are middle-aged galaxies and thus have
reddish disk and bluish arms. The difference between this feature and the feature extracted in
13
Section 1.4.7 is that for this feature the galaxy image is more carefully cropped and we find the
mean color of the RGB channels for this cropped image. For spiral galaxies the value of this ratio
should be near 1 – 1.2 and for the non-spiral galaxy should be a higher value as red dominates
blue in such galaxies.
For the next two features, galaxy images are characterized by fitting an ellipse to them.
1.4.12. Fitted ellipse angle
This feature calculates the angle of the rotated rectangle that best fits the galaxy. For non-spiral
galaxies the fitted ellipse (rectangle) should have a large angle as the rectangle is nearly upright
and for a spiral galaxy the angle should be relatively low.
1.4.13. Fitted Ellipse Height to Width ratio
For spiral galaxies this ratio should have higher values due to the spiral shape, and non-spiral
galaxies (i.e. circular and elliptical) should have lower values.
1.5 Adopted Features
The next three shape features (1.5.1-1.5.3) have been adopted from [17].
1.5.1. Convexity
As the name suggests this parameter measures how convex a particular object is. According to
[17], “For jagged regions like spiral galaxies convexity is very large, whereas for elliptical
galaxies it is very small.”
14
1.5.2. Form Factor
Goderya & Lolling [17] define Form Factor as “a ratio of area and square of the perimeter of the
galaxy.” Elliptical galaxies have a higher value for this parameter as the star formation is low
and thus most of the areas are equally bright, i.e. the luminosity in case of elliptical galaxies is
approximately uniformly distributed. In case of spiral galaxies, the values are low as their
“perimeter per unit area is relatively large” [17] and the luminosity is not uniformly distributed.
1.5.3. Bounding rectangle to fill factor
This parameter defines the area of the galaxy to the area of the bounding rectangle. It shows
“how many pixels in the bounding rectangle belong to the galaxy in reference to the total
number of pixels in the bounding rectangle” [17].
1.5.4. Color coherence vector (CCV)i
Color coherence vector (CCV) is a method developed for content-based image retrieval [19].
The idea of CCV is to mark each pixel as coherent or incoherent. A coherent pixel belongs to a
region of pixels that share the same color value. Connected pixels are formally defined as [19]:
For a region R to be considered a region of connected pixels, it should satisfy the following
property: For each p1, p2 ∈ R, there exist a path of adjacent pixels from p1, p2. (The path
traversal could be horizontal, vertical or diagonal).
15
For this feature the color space is discretized into 64 colors and then each pixel is checked for
its membership to a coherent region.
1.6 Outline and Contributions of this Thesis
In Chapter 2, the various features extracted for this thesis are described in detail. The process
followed for obtaining the features from the input images is presented. In Chapter 3, the
machine learning algorithms used for classification of the images are presented. A brief
introduction to the algorithms is followed by the advantages and disadvantages of the
algorithms. In Chapter 4, the concept of deep learning is introduced and its success in tackling
such kind of problem is discussed. In Chapter 5, the results are presented. In Chapter 6, some
experiments are presented which were tried during the course of this thesis but failed or did
not work as intended. Finally, Chapter 7 gives the conclusion and scope of future work for this
problem.
16
Chapter 2
Feature Extraction
This chapter deals with the technical details of the feature extraction process. The SSDS images
[4] that we are using are color images of size 424 x 424 in JPEG format. The center of the
galaxies are located at the center of the image which is helpful as we do not need to design
feature detectors that are position invariant. Before the feature detection process, the images
are cropped to a size of 180 x 180 to remove some of the background noise.
2.1 Detection of bar
Most of the spiral galaxies have bars in their center, which emit brighter light than the rest of
the galaxy. The first step then, in exploiting the brightness of a potential bar, is to increase the
contrast of the galaxy. It was found that enhancing the image contrast in the HSV (HueSaturation-Value) color scale instead of the RGB color scale produced better results for the
purpose of applying a threshold to the image to convert it into a binary image. Before moving
ahead it is important to describe the HSV color space. The below definitions are taken from
[27].
H: The hue (H) of a color refers to which pure color it resembles.
S: The saturation (S) of a color refers to the amount of gray or white in a color.
17
V: The value (V) of a color, also called its lightness, describes how dark the color is. It is also
used to define the brightness of a color which is the definition we use here.
The threshold for the bar may be set very high as it is the brightest part of the galaxy image. A
threshold value of 255 was used for this purpose and the pixels which satisfy the threshold
were then extracted from the contrasted image. A contour around the mass of brightest pixels
from the galaxy image was drawn. This contour was then made rotation invariant by calculating
the maximum moment of the contour which gives the orientation of the shape in the image
and then rotating the contour by the obtained orientation. The next step is to determine the
width and height of the shape bounded by the contour. This is done by calculating the semimajor and the semi-minor axes, which gives the width and height respectively. If the width is
much greater than the height, the presence of a bar structure in the galaxy is confirmed.
Figure 6 shows this process.
a) Original Image
b) Increased Contrast Image
18
c) Extracted Contour
d) Rotated Contour
Figure 6. Bar detection process.
2.2 Dark matter density (gray scale and binary)
Dark matter is attributed to be the source of much of the brightness observed in a galaxy. The
disc of a galaxy is said to be surrounded by a dark matter halo whose density is given as the
function of its radius by the Navvaro-Frenk-White profile [22]:
𝜌𝜌(𝑟𝑟) =
𝜌𝜌(0)
𝑟𝑟 2
/𝑅𝑅𝑅𝑅 �1 + �
𝑟𝑟
𝑅𝑅𝑅𝑅
where, 𝜌𝜌(𝑟𝑟) is the dark matter density at radius r,
𝜌𝜌(0) is the central dark matter density, and
𝑅𝑅𝑅𝑅 is the scale radius.
The scale radius is defined in [23] as “the radius at which the brightness of the galaxy has fallen
off by a factor of e (~2.71828) from the center.”
19
The first step is to convert the input image into grayscale and calculate the central brightness of
the galaxy. Here the following assumption is made:
“The value of a pixel of the grayscale of the image is considered to be the brightness of the
image at that particular pixel.”
The main question to answer here is the definition of “center.” We cannot consider only the
central pixel of the image as the center of the galaxy and the center of every galaxy will be
different. According to [24] the central brightness of the galaxy is given by:
Σ0 = 5.3567 Σ𝑒𝑒
where, Σ𝑒𝑒 is the surface brightness at the half light radius, i.e. the radius within which half of
the light is contained.
Once we have the central brightness of the galaxy we can estimate the scale radius of the
galaxy. This is done by first calculating the brightness of the galaxy at scale radius, which is the
central brightness reduced by a factor of e [23] and then starting from the center of the image.
The next step is to move through the image in a box of incremental size 1 x 1 and summing the
pixel values until we get close to the brightness of the galaxy at scale radius. One half the size of
the box gives us the scale radius.
For the dark matter density in the binary form of the input image, the only changes from the
above method is the definition of the center of the galaxy. Here we follow the method
20
described in Section 2.1 to estimate the central contour which is then used as the center of the
galaxy.
2.3 Disk / Bulge ratio
As mentioned before, detecting the bulge in an image that is viewed head-on (Figure 5b) is
virtually impossible. In [25] the Disk / Bulge ratio is defined as:
where, 𝑟𝑟𝑠𝑠 is the scale radius,
𝐷𝐷
= 0.28 ∗ (𝑟𝑟𝑠𝑠 /𝑟𝑟𝑒𝑒 )2 ∗ Σ(𝑠𝑠)/Σ(𝑒𝑒)
𝐵𝐵
𝑟𝑟𝑒𝑒 is the half-light radius,
Σ(𝑠𝑠) is the surface brightness at scale radius, and
Σ(𝑒𝑒) is the surface brightness at half-light radius.
The first step is to convert the input image into grayscale and calculate the central brightness of
the galaxy as in Section 2.2 above. Once we have the central brightness of the galaxy we can
calculate the scale radius of the galaxy by calculating the radius where is the central brightness
is reduced by a factor of e and the half-light radius by calculating the radius where the central
brightness is reduced by a factor of 2. We then start from the center of the image and move
through the image in a box of incremental size 1 x 1, and sum the pixel values until we get close
to the brightness of the galaxy at scale and half-light radius. One half the size of the box gives us
the scale radius and the half-light radius respectively.
21
2.4 Circularity
To calculate the circularity [16] of a shape in the image, we first need to extract the shape with
as little information loss as possible. We focus on two types of circularity in this project: galaxy
central circularity and complete galaxy circularity. Although the central question is about the
shape of the complete galaxy, the central part of the galaxy is also considered because the
central part of most of the spirals is barred (Feature 2.1) and for circular or elliptical galaxies the
central part resembles the overall galaxy shape. The circularity of a shape can be described by
the equation:
𝐶𝐶 = 4 ∗ 𝜋𝜋 ∗ 𝐴𝐴/𝑃𝑃2
where, A is the area of the shape, and P is the perimeter of the shape.
To estimate the central part of the galaxy we follow the method described in Section 2.1 and
then calculate the area and arc length of the contour obtained. For calculating the circularity
parameter for the complete galaxy we must first convert the image into its binary form. Direct
thresholding an image to convert into the binary format is not considered here for two reasons:
1. In some cases it results in the distortion of the shape, and in some cases, the loss of the
entire shape.
2. For every image the threshold value might be different.
OpenCV provides an implementation of the Otsu method [26] for estimating an ideal threshold,
which works by calculating the optimal threshold between the foreground and the background
22
pixels. This project takes a slightly different approach for more effective thresholding. The
image is first converted into its grayscale format and then the Laplacian of Gaussian (LoG) of
the image is calculated. The LoG is then subtracted from the grayscale image to remove
background noise. The resulting image is then converted to HSV, and the value (V) parameter
between 20 and 255 is used to define the shape in the binary image, i.e. the pixels in HSV image
having the value parameter between 20 and 255 are set to 255 (white) in the binary image. The
binary image obtained usually gives a reasonable estimate of the shape of the galaxy, but has a
few disconnected points. The approach taken to connect the image is to perform a
morphological dilation: scan through all the pixels in the binary image, and if any pixel is
surrounded by a white pixel in its neighborhood, it is also set to be a white pixel. Figure 7 shows
this process. To calculate the area of the resulting image, the moment (M00) of the image [28]
is calculated. To estimate the perimeter, the arcLength() function of OpenCV is used.
2.5 Black/ White pixel ratio
To calculate the B/W pixel ratio parameter we follow the same process defined in Section 2.4 to
get the binary image shown in Figure 7 c). The number of black and white pixels in the image is
counted and the ratio obtained.
23
a) Original Image
b) Image – LoG
c) Binary Image
d) Filled Binary Image
Figure 7. Thresholding of image to get the shape.
2.6 Convexity
The convexity of a shape is defined by [17]:
𝐶𝐶𝑥𝑥 = 𝑃𝑃/(2𝐻𝐻 + 2𝑊𝑊)
where, P is the perimeter of the shape,
H is the height of the bounding rectangle, and
24
W is the width of the bounding rectangle.
To calculate the convexity we follow the same process defined in Section 2.4 to produce the
binary image shown in Figure 7 c). We find the bounding rectangle of the contour obtained by
the OpenCV function boundingRect() and calculate the height and width of the rectangle. To
obtain the perimeter we use the arcLength() function as in Section 2.4.
2.7 Form Factor
The Form Factor of a shape is defined in [17]:
where, A is the area of the shape, and
𝐹𝐹 = 𝐴𝐴/𝑃𝑃2
P is the perimeter of the shape.
As described in Section 2.4 above the area of the shape obtained in Figure 7c) is obtained by
calculating M00 and perimeter by the arcLength() function.
2.8 Bounding rectangle to fill factor
After obtaining the binary image as described in Section 2.4 and shown in Figure 7c) and
calculating the bounding rectangle using boundingRect() as described in Section 2.6 we obtain
the bounding rectangle to fill factor parameter which is described in [17] by the equation:
where, A is the area of the shape, and
𝐵𝐵𝑥𝑥 = 𝐴𝐴/(𝐻𝐻 ∗ 𝑊𝑊)
25
H * W is the area of the bounding rectangle.
2.9 Count of line intersection of the shape
After obtaining the binary image as described in Section 2.4 and shown in Figure 7c) we obtain
the contour with the largest area from the image. Then a line is drawn starting from the center
of the shape at every degree angle ranging from 0° − 360°. Then we count the times when the
binary intensities changes along the line.
2.10 Maximum values of Red and Blue channels
For calculating this parameter the image is split into its Red, Green and Blue channels. The
green channel of the image is then set to zero and the red and blue channels are added
together. This is done because spirals have red discs and blue arms and non-spirals are
generally red. Then the maximum value for both the channels is calculated using the
minMaxLoc() function of OpenCV.
2.11 Concentration index
The concentration index [20] can be expressed as:
𝑐𝑐𝑐𝑐 = 𝑅𝑅90 /𝑅𝑅50
where, R90 is the radius which contains 90% of the Petrosian flux, and
R50 is the radius which contains half of the Petrosian flux.
26
For this parameter we do not need any pre-processing and can directly work with the RGB
image converted to grayscale. Our first aim is to find the Petrosian radius which is defined as
the radius where the intensity of the light from the galaxy is equal to a predefined value, usually
0.2 [21]. This project assumes the predefined value to be between 0.17 and 0.22. The intensity
of light from the galaxy at Petrosian radius is given by the equation [20]:
𝐼𝐼�𝑅𝑅𝑝𝑝 � = 𝜂𝜂(
𝑅𝑅
𝑝𝑝
∫0 𝐼𝐼(𝑟𝑟)2𝜋𝜋𝜋𝜋𝜋𝜋𝜋𝜋
where 𝜂𝜂 is a constant (1 for this project).
𝜋𝜋𝑅𝑅𝑝𝑝 2
)
We adopt the definition of Petrosian radius 𝑅𝑅𝑝𝑝 as the radius where the value of the Petrosian
ratio [30] is equal to 0.2 (this project assumes the predefined value to be between 0.17 and
0.22). Petrosian ratio 𝑅𝑅𝑝𝑝 (𝑟𝑟) at a radius r from the center of an object as defined in [30] to be
“the ratio of the local surface brightness in an annulus at r to the mean surface brightness
within r.” This can be written in equation form as:
𝑅𝑅𝑝𝑝 (𝑟𝑟) = (�
1.25𝑟𝑟
0.8𝑟𝑟
′
2
2 )𝑟𝑟 2
𝑑𝑑𝑑𝑑′2𝜋𝜋𝜋𝜋′𝐼𝐼(𝑟𝑟 )/[𝜋𝜋(1.25 − 0.8
𝑟𝑟
])/(� 𝑑𝑑𝑑𝑑′2𝜋𝜋𝜋𝜋′𝐼𝐼(𝑟𝑟 ′ )/𝜋𝜋𝑟𝑟 2 )
0
To calculate 𝑅𝑅𝑝𝑝 (𝑟𝑟) we start from a radius of 20px in the image and go until 45px and get the
intensity within the radius. We also get the intensity of the image within 2 more parameters: an
upper radius of 1.25 times the radius, and a lower radius which is 0.8 times the radius. We then
apply the above equation to the calculated values and compute the Petrosian ratio. The radius
having values between 0.17 and 0.22 are stored and the maximum radius is taken to be the
27
Petrosian radius. If we do not find any such radius in the image we set the Petrosian radius to
be 44.8998 which is the maximum value of the radius that we loop through in the image in
order to calculate 𝑅𝑅𝑝𝑝 (𝑟𝑟). We then calculate the Petrosian flux given by:
𝐹𝐹𝑝𝑝 = �
2𝑟𝑟𝑝𝑝
0
2𝜋𝜋𝜋𝜋′𝑑𝑑𝑑𝑑′𝐼𝐼(𝑟𝑟 ′ )
where, 𝑟𝑟𝑝𝑝 is the Petrosian radius.
We then calculate the values of the parameters 𝑅𝑅90 and 𝑅𝑅50 by passing through the image
starting from the center and calculating the flux for incrementing radii until we find the flux that
is around 90% and 50% of the Petrosian flux respectively. We can then calculate the
concentration index.
2.12 Aspect Ratio
As described in the previous chapter aspect ratio can be interpreted as the ratio between the
width and height of the bounding rectangle of the galaxy. We use the method described in 3.4
to obtain the binary image Figure 7c). Then the contour with the maximum area and maximum
arc length is calculated which gives the galaxy as the standalone object. Then the bounding
rectangle is calculated and aspect ratio can be defined as:
𝐴𝐴𝐴𝐴 = 𝑊𝑊/𝐻𝐻
28
2.13 Extent
The extent of a shape is given by the equation:
where, CA is the contour area, and
𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 = 𝐶𝐶𝐶𝐶/(𝐻𝐻 ∗ 𝑊𝑊)
H*W is the bounding rectangle area.
As described in Section 2.12 we obtained the contour area and bounding rectangle which can
be then used to calculate the bounding rectangle area.
2.14 Red to Blue color intensity ratio
We obtain the bounding rectangle for the image as described in the previous section. The
original image is then cropped to the size of this bounding rectangle. Figure 8 shows the
process. As it can be seen in Figure 8c) we obtain a representation of the galaxy without most
of the background noise, which in these images are generally stars and dust. After we obtain
the modified image the mean value of the intensities of all the channels i.e. Red, Green and
Blue, is calculated and then the ratio between the values obtained for the Red and Blue
channels is calculated.
2.15 Fitted ellipse angle
To calculate this parameter we follow the process defined in the previous section to obtain the
contour as shown in Figure 8b). Then a best fit ellipse is calculated for the obtained shape using
29
the OpenCV function fitEllipse(). The angle method of this function gives the value of the
parameter.
a) Original Image
b) Obtained Contour
c) New Cropped Image
Figure 8. Process of cropping the image to size of bounding box.
2.16 Fitted Ellipse Height to Width ratio
To calculate this parameter we follow the process defined in Section 2.14 to obtain the contour
as shown in Figure 8b). Then a best fit ellipse is calculated for the obtained shape using the
30
OpenCV function fitEllipse(). The height and width of the fitted ellipse can be calculated using
the size method of the used function and thus the parameter can be calculated.
2.17 Color coherence vector (CCV)
Consider the following example to understand the CCV concept more clearly. For this example
the color space is discretized to define 5 colors. In this example we set T = 3 where T is the
minimum number of connected pixels that share the same color.
1
1
1
1
4
2
1
4
4
4
2
4
4
6
6
1
4
4
5
5
1
1
3
5
5
Figure 9. Discretizing and finding the coherent regions.
In Figure 9, the pixels numbered 1, 4 and 5 are considered coherent because the number of
pixels in the connected region equals or exceeds T=3. On the other hand, the pixels numbered
2, 3 and 6 are considered incoherent. The descriptor for this example will look as shown in
Table 3, where C and I denote the number of coherent and incoherent pixels respectively.
Color 1 Color 2 Color 3 Color 4 Color 5 Color 6
C I C I C I C
I C I C I
8
0
0
2
0
1
8
0
4
0 0
2
Table 3. The resulting vector.
31
The first step is to increase the contrast of the input image to enhance some of the regions with
sparse patterns. Then the image is converted to HSV color space. The next step is to discretize
this HSV color space into 64 colors using the Hue, Saturation and Value parameters of the color
space. To check whether each pixel belongs to a coherent region or not we count the number
of pixels in each bin with the T parameter as required by the definition of a coherent region.
The value of T was varied from 0 to the size of the image (i.e. 180 x 180 = 32400).
For this parameter we modify the approach as suggested in [19] in two ways:
1. Blur the image before starting the discretization process to eliminate slight variations
between the adjacent pixels. For this project we do not blur the image as this will discard the
peak intensities and will impact the parameter value adversely.
2. Set the value of the parameter T around 1% of the size of the image. This project does test
the approach but finds that the best value is obtained with a slightly higher value of T. For this
project the optimum value was found to be 1.54% of the size of the image.
Figure 10 shows the obtained resulting images with different T values.
a) Spiral Image
b) Circular Image
Figure 10. CCV visualization for T = 500.
32
This chapter completes the first step of feature extraction as shown in figure 3. After the
features are designed, we need to combine all of them together to form a dataset that can be
provided to the different machine learning classifiers to complete the second system of our
classifier thereby completing the classification system.
The next chapter describes this second step and describes two machine learning algorithms
applied to the obtained dataset and also presents the reasons of choosing the algorithms for
this system.
33
Chapter 3
Classification Algorithms
This chapter describes the different machine learning algorithms that were used to classify the
galaxy images from the extracted features.
Since we are dealing with a binary classification problem, i.e. the classification of the examples
into positive (spiral) and negative (non-spiral) classes, we take a supervised learning approach
and use two algorithms:
1. Logistic Regression.
2. Boosting, using decision stumps as weak classifiers.
3.1 Introduction to algorithms
3.1.1 Logistic Regression
The term regression refers to finding a best fit line for provided data points, i.e. a line that gives
the best approximation of the data based on some parameters or features. Figure 11 visualizes
this idea. The data points used are:
0
1
2
3
4
5
6
34
0
4.1
9.7
8.8
4.2
6.1
7.8
B
12
10
8
6
4
2
0
0
1
2
3
4
5
6
7
Figure 11. Best fit line for a set of data points.
The data points that lie near or at the best fit line can be predicted reliably. As we move to data
points lying away from the line, their prediction becomes less reliable.
The term logistic refers to the logistic function which is a sigmoid with the equation:
𝑓𝑓(𝑡𝑡) = 1/(1 + 𝑒𝑒 −𝑡𝑡 )
A logistic function is useful due to the property (mentioned in [31]) that the input to the logistic
function can attain any value from -∞ to +∞ but the output will always have a value in between
0 and 1, as shown in Figure 12. The input t can be also viewed as a linear combination of
different features associated with it. Thus for n features t can be written as:
𝑡𝑡 = 𝑎𝑎0 + 𝑎𝑎1 𝑥𝑥1 + 𝑎𝑎2 𝑥𝑥2 + ⋯ + 𝑎𝑎𝑛𝑛 𝑥𝑥𝑛𝑛
35
Figure 12. A sigmoid function example.
As is can be seen that irrespective of the input value the logistic function always outputs values
between 0 and 1.
Thus in logistic regression we obtain the best fit line for a set of data points by learning the
sigmoid function. In practice we learn the conditional distribution 𝑃𝑃(𝑦𝑦|𝑥𝑥) where x is the input
(example) and y is the class label, 0 or 1 in case of binary classification. If we consider v to be
the feature vector we get the following two equations:
𝑝𝑝(𝑦𝑦 = 1|𝑥𝑥; 𝑣𝑣) = 1/(1 + 𝑒𝑒 𝑣𝑣∙𝑥𝑥 )
𝑝𝑝(𝑦𝑦 = 0|𝑥𝑥; 𝑣𝑣) = 1 − 𝑝𝑝(𝑦𝑦 = 1|𝑥𝑥; 𝑣𝑣)
36
The first equation refers to the probability that an example belongs to class 1 and the second
equation refers to the probability that an example belongs to class 0.
3.1.2 Boosting
The drawback of the logistic regression technique is the absence of an implicit feature selection
process and thus ‘bad’ features can affect the accuracy of the algorithm negatively. Thus logistic
regression for a classification problem with a large number of features, as in the case of this
thesis, can result in low accuracy. Boosting addresses this problem as the AdaBoost algorithm
[33] contains a feature selection process called ‘feature boosting’ implicitly.
Michael Kearns in [32] tries to answer the question about the ‘Hypothetical Boosting problem’
which asks if a presence of an efficient learning algorithm whose output hypothesis performs
only slightly better than random guessing implies that there exists an efficient learning
algorithm whose output hypothesis gives high accuracy. In simpler terms this problem asks
whether ‘a set of weak learners can be combined into a strong learner’.
In this project we make use of the AdaBoost algorithm as described in [33], which tries to find a
weighted combination of classifiers that fits the data well. It uses a weak classifier, a decision
stump (decision tree with unit height) for this project, iteratively on the dataset and maintains a
distribution of weights over every example in the dataset. Initially all the examples are assigned
1
the same weight which is generally 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝑖𝑖𝑖𝑖 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 . After every call to the weak
classifier, the weights of the incorrectly classified examples are increased and the weights of
the correctly classified examples are decreased so that the weak classifier is focused more on
37
classifying the incorrectly classified examples in every round of the call. Thus we can think of
boosting as an algorithm that tries to rectify the mistakes from a previous step in the
immediate next step. The algorithm is shown below in Figure 13.
Figure 13. The AdaBoost algorithm adopted from [33].
3.2 Advantages and disadvantages of these algorithms
The reasons for choosing these algorithms for this project are mentioned below.
1. Logistic Regression is great for binary classification as the sigmoid function naturally creates a
single decision boundary.
2. Logistic regression has low variance and so is less prone to over-fitting.
38
3. Boosting reduces both variance and bias. The bias is reduced in initial iterations and variance
in later iterations.
4. Boosting has the concept of ‘feature boosting’ intrinsic to it, which resembles the feature
selection process and thus actually forces the classification algorithm to focus on the more
important features with respect to the data.
Logistic Regression and Boosting also have some drawbacks that are mentioned below.
1. Noise and outliers in the data effect boosting in a negative way as it can always try to classify
the outliers thereby increasing the convergence time.
2. Boosting training time is large.
3. Logistic Regression fails for prediction of continuous outcomes.
4. Unlike boosting, Logistic Regression does not automatically perform feature selection.
5. Logistic Regression does not handle missing values.
39
Chapter 4
Deep Learning
This chapter introduces the concept of deep learning and why it has been so successful in image
classification tasks.
4.1 Introduction to Deep Learning
The basic idea of deep learning emerged from the concept of artificial neural networks (ANN)
which in turn are inspired by the biological neurons in the human brains that carry electric
signals to and from the brain. An ANN has several layers of interconnected neurons (Figure 14
a) to form an artificial network (Figure 14 b) and is typically defined by three types of
parameters:
1. Patterns: This refers to the pattern that connects between the different layers of neurons.
2. Learning: This refers to the process of learning used for updating the weights of the neural
connections.
3. Activation Function: This refers to the function that converts a neuron's weighted input to
the output from the neuron.
Deep learning is to a class of machine learning techniques, where input is passed through
multiple layers of processing for feature learning. Several techniques like deep neural networks
40
(DNN), deep belief networks (DBN) and Restricted Boltzmann Machines (RBM), which are
described below, are specific implementations of deep learning [37].
Input
Output
Hidden
a)
b)
Figure 14 a) An Artificial Neural Network and b) A neuron.
4.1.1 Deep Neural Network
A deep neural network (DNN), as the name suggests, is an ANN with multiple hidden layers
between the input and output layers. As every layer in the network identifies the features
present in the input, the extra layers in the DNN creates features from the lower levels, thereby
modeling complex data with fewer parameters than a similarly performing ANN [37].
4.1.2 Deep Belief Network
41
A deep belief network (DBN) is a type of DNN where the connections exist only between the
visible and hidden layers but not among the visible-visible units and hidden-hidden units in
every layer. The main idea behind the DBN is that a preceding hidden layer serves a visible layer
to the next hidden layer. As shown by Hinton et al. [38], DBNs can be trained one layer at a
time, stacking every trained layer over each other, thereby giving it a deep hierarchical
architecture. Every layer of the DBN is constructed of Restricted Boltzmann Machines (RBM)
which are described in the next section.
4.1.3 Restricted Boltzmann Machines
Before describing the Restricted Boltzmann Machines (RBM), it is necessary to understand the
following terms:
4.1.3.1 Energy based models
The energy based models (EBM) associate a cost function, which is termed as energy, with
every variable of interest to the system or as LeCun, Chopra et al. define it in [38] as, “EnergyBased Models (EBMs) capture dependencies between variables by associating a scalar energy to
each configuration of the variables.“ These models learn by minimizing the energy function
associated with the system. Figure 15 shows an example of an EBM, where the output shows
the correspondence between X and Y.
42
Figure 15. An EBM that measures the compatibility between observed variables X and variables
to be predicted Y using the energy function E(X, Y) [38].
4.1.3.2 Boltzmann Machines
Hinton & Sejnowski gave the idea of Boltzmann machines in their 1986 paper [39] as a neural
network with stochastic units having an energy associated with the network. Figure 16 below
shows an example of a Boltzmann machine consisting of 4 visible units v1, v2, v3 and v4 and 3
hidden units h1, h2 and h3. All the units are connected to each other and follow the following
restrictions:
1. A unit cannot be connected to itself.
2. All connections are symmetrical.
43
Figure 16. A Boltzmann machine in graphical format adopted from [40].
The energy function of a Boltzmann machine is defined in [40] as
𝐸𝐸 = −(� 𝑤𝑤𝑖𝑖𝑖𝑖 𝑠𝑠𝑖𝑖 𝑠𝑠𝑗𝑗 + � 𝜃𝜃𝑖𝑖 𝑠𝑠𝑖𝑖 )
𝑖𝑖<𝑗𝑗
𝑖𝑖
where, 𝑤𝑤𝑖𝑖𝑖𝑖 is the connection strength between units i and j,
𝑠𝑠𝑖𝑖 is the state of unit i and 𝑠𝑠𝑖𝑖 ⋲ {0,1}, and
𝜃𝜃𝑖𝑖 is the bias of the unit i.
The probability of a unit i to have the value 1 is given by the equation below as defined in [40]
𝑝𝑝𝑖𝑖=1 = 1/(1 + exp �−
44
∆𝐸𝐸𝑖𝑖
�)
𝑇𝑇
where T is the temperature of the system.
Restricted Boltzmann machines are a variant of the Boltzmann machines described above, the
variance being the absence of connections between visible-visible and hidden-hidden units.
Figure 17 shows an example of a restricted Boltzmann machine consisting of 3 visible units and
4 hidden units.
Figure 17. A restricted Boltzmann machine in graphical format adopted from [41].
The energy function of a RBM is defined in [41] as
𝐸𝐸(𝑣𝑣, ℎ) = − � 𝑎𝑎𝑖𝑖 𝑣𝑣𝑖𝑖 − � 𝑏𝑏𝑗𝑗 ℎ𝑗𝑗 − �
𝑖𝑖
𝑗𝑗
𝑖𝑖
� 𝑣𝑣𝑖𝑖 𝑤𝑤𝑖𝑖,𝑗𝑗 ℎ𝑗𝑗
𝑗𝑗
where, 𝑤𝑤𝑖𝑖,𝑗𝑗 is the weight associated with the connection between hidden unit ℎ𝑗𝑗 and visible
unit 𝑣𝑣𝑖𝑖 ,
45
𝑎𝑎𝑖𝑖 , 𝑏𝑏𝑗𝑗 are the bias weights of the visible and hidden units respectively.
Hinton & Salakhutdinov showed in [42] that RBMs can be trained individually and then stacked
on top of each other to form a deep belief network, described in section 5.1.2, and thus can be
represented as shown in Figure 18 below.
Figure 18. A deep belief network with stacked RBMs adopted from [43].
Hinton et al. proposed a fast training algorithm for DBNs in [44] which can be summarized in
Figure 19.
46
Figure 19. Training algorithm for DBNs adopted from [45].
4.2 Success of Deep Learning
The various factors that have contributed to the success of deep belief networks varies from
the huge increase in the size of the dataset at one end of the spectrum to the fine-tuning of the
algorithm itself at the other end. The following factors have contributed in the success of the
deep learning approach, especially in object recognition problems:
1. Huge Datasets: The size of datasets has increased drastically. To understand why this has
contributed to deep learning approach being successful we need to go back to the paper by
Valiant [46] that shows that a machine having the following 3 properties is possible to be
designed.
(A) The machines can provably learn whole classes of concepts. Furthermore these classes can
be characterized.
(B) The classes of concepts are appropriate and nontrivial for general purpose knowledge, and
(C) The computational process by which the machines deduce the desired programs requires a
feasible (i.e. polynomial) number of steps.
47
Point (C) is of the utmost importance here as it can be interpreted as the following, as given in
[47]: “if you have a finite number of functions, say N, then every training error will be close to
every test error once you have more than log N training cases by a small constant factor and
thus there will be practically no over fitting.” For example, our dataset has images of size 424 x
424 and thus a perfect machine, which takes the raw pixel as the input, has to learn around
218000 parameters to learn a perfect model, which is a huge number. We can do some
preprocessing and down-sample our image to a size around 60 x 60, thereby reducing the
number of parameters to 23600 which is still very large and thus getting a perfect model is still
very difficult. However this example as shown in [47] can make a case for neural networks
being a good fit for the large amount of data, if we consider a neural network with X
parameters and consider every parameter to be of type float (32 bits). Then total number of
bits in the neural network is 32X and we can have 232𝑋𝑋 distinct neural networks (as a bit is
binary). When we have training examples greater than 32X, the chances of over fitting are
drastically reduced as described above. Thus we need a deep neural network with large number
of parameters.
2. Faster Computers: With the advent of Graphics Processing Units (GPUs) we can now build
large neural networks and can still have relatively fast training time.
3. Fine Tuning of the training algorithm: Stochastic Gradient Descent (SGD) in the training of
the deep neural networks has been very successful, since the SGD algorithm does not need to
48
remember examples visited during the previous iterations, and thus can converge faster in
training large datasets as shown in Figure 20 below.
Figure 20. SGD convergence time adopted form [48].
49
Chapter 5
Results
In this chapter we first analyze the available data and discuss the process of obtaining the labels
for the provided examples. Then the results obtained from applying machine learning and deep
learning algorithms on our data are presented.
5.1 Available data
The SDSS database [4] provides us with the images of the galaxies and the human annotated
data. The total number of images provided is 61578. Before proceeding further we need to
understand the human annotated data. In Galaxy Zoo project [5] the volunteers to classify the
galaxy images into elliptical, spiral and mergers (if the image contains merging galaxies). Figure
21 shows the decision tree that was used to guide the classification process. An example to
understand the decision tree is as follows:
Is the galaxy simply smooth and rounded,
with no sign of a disk?
(yes) How rounded is it?
(yes) …….
(no) ………
(no) Could this be a disk viewed edge-on?
(yes) Is there a sign of a bar feature through
the centre of the galaxy?
(yes) / (no) Is there any sign of a spiral arm pattern?
50
Figure 21. The Galaxy Zoo classification decision tree adopted from [49].
The last question is the one we are interested in answering in this thesis. The volunteers
classified the galaxy images into several classes. A snapshot of the various classes and the
actual human annotated data is presented in Figure 22 and 23 respectively. In Figure 22, the
Task column refers to the classes that the images were classifies into, with responses being the
subclasses. As an example, for question 4, if the answer to the question “is there any sign of a
spiral arm pattern” is “yes”, then image belongs to class 4.1 and if “no”, it belongs to class 4.2.
So, one image can belong to various classes.
51
Figure 22. A snapshot of the decision tree in question form adopted from [49].
Figure 23. The human annotated data.
The values under various classes are: the percentage of volunteers who answered that the
galaxy belonged to the given class. As an example, for galaxy id 100008, the value in class 4.1 is
0.418398 and in class 4.2 is 0.198455 i.e. about 41% of volunteers considered the image to have
a spiral arm pattern and 20% did not. Since it is a decision tree and this question might not be
52
reached depending on the responses to the previous questions, the sum of the values may or
may not be equal to 1, i.e. we cannot consider these as pure probabilities.
5.2 Logistic Regression and Boosting
As we do not have any labels provided to us and only the percentage of people answering “yes”
to a question, we need to create the labels before creating a model for classification. For the
application of the machine learning algorithms we apply the following condition (Condition 1)
to create labels for the images:
Condition 1: If the difference between the number of people answering question 4 (is there any
sign of a spiral arm pattern) in positive and negative is more than 60% or 0.6, then the label is
assigned 1(positive) or 0 (negative) accordingly. In other words, if the positives outnumber the
negatives by 60% or above, the label assigned is positive and vice versa. Other images are not
taken into account.
By using the above condition we obtain 2983 examples: 1774 positive and 1209 negative. Thus
to use a random baseline, if we classify all examples as positive we will still achieve
1774
1774+1209
= 59% accuracy. We use 10-fold cross validation for both the algorithms. Before
going further, it is important to discuss the cross validation technique and why it is used in this
thesis. Generally to test a machine learning method the dataset is divided into two parts as
shown in Figure 24, namely, training set (to train the classifier) and test set (to test the
performance of the trained classifier on a new example).
53
Figure 24. Splitting the dataset into two different sets.
In the N-fold cross validation technique, we run the experiments on the dataset N times, and
for every experiment the training and test dataset is picked randomly, according to the split
that the user provides. Figure 25 shows the process in more clarity.
54
Figure 25. N-fold cross validation (here N=5). All the unfilled regions represent training set for
that iteration and filled represents the test set.
The advantage of using a large N-fold validation is that all the examples in the given dataset are
used in the training or testing set at least once. Thus bias will be reduced drastically, but the
variance will increase as will the computation time as we have to run the experiments N
number of times. If a small N is used then the bias will increase but the variance will reduce
and so will the computation time.
Using this method, we use the logistic regression and boosting implementation of WEKA [50]
with the parameters shown in Figure 26 for logistic regression and Figure 27 for boosting.
Figure 26. Logistic Regression WEKA parameters.
55
Figure 26. AdaBoost WEKA parameters.
We use 10 fold cross validation, with every step having a 70% – 30% training to testing split. We
obtain the following confusion matrix as shown in Figure 28 for logistic regression and Figure 29
for boosting respectively, below.
Predicted Class
Actual Class
Yes
No
Yes
1502
272
No
375
834
Figure 28. The confusion matrix for logistic regression.
56
Predicted Class
Actual Class
Yes
No
Yes
1520
254
No
428
781
Figure 29. The confusion matrix for AdaBoost.
The confusion matrix is described as:
Predicted Class
Actual Class
Yes
No
Yes
True Positive
False Negative
No
False Positive
True Negative
We measure the accuracy of the model by the following measures:
1. Accuracy: The ratio of number of correctly classified instances, both positive and negative,
to the total number of examples.
2. F1 measure: The harmonic mean of precision and recall. Precision can be defined as the
number of correct positive results divided by the number of all positive results or in other
words as the fraction of relevant retrieved instances. Recall can be defined as the number of
correct positive results divided by the number of positive results that should have been
57
Figure 30. Precision and Recall adopted from [51].
returned or in other words as the fraction of retrieved relevant instances. The concept of
precision and recall can be better understood in Figure 30.
58
3. AUC-ROC curve: A Receiver operating characteristic (ROC) curve plots the true positive rate
against the false positive rates for every possible classification threshold. Before continuing it is
important to understand the meaning of ‘for every possible classification threshold’.
Fawcett [52] explains this concept in a detailed manner and the following explanation is based
on his work. We deal with two kinds of classifiers: discrete and probabilistic. A discrete classifier
only produces the label for a given example, e.g. 0 or 1 (negative or positive), and thus a
discrete classifier will produce only one point in the ROC space. A probabilistic classifier
produces a numeric score, which can be a probability or not, that represents the ‘degree to
which a given example is an instance of a particular class.’ If the obtained score is above a
threshold, which is the classification threshold introduced above, then the classifier produces 1
(positive), else 0 (negative).
A simple method to obtain a ROC curve using cross validation is to collect probabilities for
instances in the test fold and sort them and is implemented in WEKA. The area under the ROC
curve (AUC) measures the ability of the classifier to correctly classify the examples in
question. A perfect classifier will have ROC area value equal to 1 and an optimal classifier will
have a value nearing 1. A value of 0.5 is comparable to random guessing.
The results of logistic regression and boosting can be shown in Table 4 and 5 respectively.
59
Precision
Recall
F-Measure
ROC Area
Class = yes
0.8
0.847
0.823
0.849
Class = no
0.754
0.69
0.721
0.849
Accuracy (correctly classified instances)
78.31%
Table 4. Logistic Regression Results.
Precision
Recall
F-Measure
ROC Area
Class = yes
0.78
0.857
0.817
0.831
Class = no
0.755
0.646
0.696
0.831
Accuracy (correctly classified instances)
77.13%
Table 5. Boosting Results.
As can be seen in the table, logistic regression performs better than AdaBoost. The ROC curves,
as obtained from WEKA, for both logistic regression and boosting are shown in Figure 31 and a
comparison between the two curves is shown in Figure 32.
60
True Positive Rate
False Positive Rate
True Positive Rate
False Positive Rate
Figure 31. ROC curves for Logistic Regression (above) and AdaBoost (below).
61
True Positive Rate
False Positive Rate
Figure 32. Comparing ROC curves for Logistic Regression and Boosting. Logistic Regression is the
upper curve as it has a higher AUC = 0.849 when compared to AdaBoost which has AUC = 0.831.
5.3 Deep Learning
For deep belief networks we used the DBN implementation of the Python nolearn module [53].
The first step was to create a dataset that is compatible with the DBN. Initially the same dataset
that was used for the machine learning algorithms, consisting of 2983 examples, is used. The
first step is to convert the image to grayscale and then crop the image to size 180 x 180. Then
the image is down sampled to size 69 x 69 to keep the size of the network manageable. Since
the network takes raw pixels as input then the number of units in the input layer is equal to the
number of pixels in the image and thus the down sampling step becomes important. The pixels
are then extracted from the images and the data is shown in Figure 33 below.
62
Figure 33. Dataset format for feeding into DBN.
Every row represents an image and every number, separated by a semi-colon, in a row gives the
pixel value for the image. The label for the image is appended at the end. We then convert the
data into 2 numpy arrays: one for the pixel values and another for the label. We then divide the
data into training and testing sets with a 70% - 30% split and train a neural network with 3
hidden layers, having 1000, 500 and 500 units each, with the learning rate of 0.15 ,0.12 and 0.1
respectively.
The result obtained is shown in Figure 34 below.
Precision
Recall
F1-score
support
Class = 0
0.60
0.52
0.56
374
Class = 1
0.69
0.74
0.71
521
Average / Total
0.65
0.65
0.65
895
Figure 34. DBN results for a small dataset of 2983 images.
63
Since DBN require a large amount of data to train, the low precision and recall value is on
expected lines. The final column named ‘support’ shows the number of images classified as
positive (1) and negative (0). Since the total images were 2983 and the dataset was split in
70:30 ratio, the test set has 895 images out of which 393 were labelled 0 and 502 were labelled
1. The DBN labels 521 images 1 and 374 images are labelled 0.
The DBN is then given 61578 images but the condition for obtaining the label is changed from
the Condition 1 as described in section 6.2 to Condition 2 described below.
Condition 2: If the difference between the number of people answering the question 4, (is there
any sign of a spiral arm pattern?), in positive and negative is more than 60% or 0.6, then the
label is assigned 1 (positive) or 0 (negative) accordingly. In the next step, if more than 50% of
the people have answered either positive or negative , then the label is assigned 1 (positive) or
0 (negative) accordingly. For remaining images, if the difference between the number of people
answering the question 4 in positive and negative is more than 25% or 0.25, then the label is
assigned 1(positive) or 0 (negative) accordingly.
Here the split is 67% - 33%, thereby leaving us with 20321 test images, out of which 16445 were
labelled 0 and 3876 were labelled 1, the majority class baseline being 0.8. The result obtained is
shown in Figure 34.
64
Precision
Recall
F1-score
support
Class = 0
0.84
0.96
0.90
16452
Class = 1
0.57
0.23
0.33
3869
Average / Total
0.79
0.82
0.79
20321
Figure 35. DBN results for the complete dataset of 61578 images.
The results show a jump in the overall score from 0.7 to 0.8 but one of the concerns here is that
the dataset is class imbalanced and does not show much improvement from the random
baseline. This concern is not addressed in this thesis.
5.4 Feature Analysis
In this section the features are analyzed separately to determine if all the features are behaving
similarly in the classification algorithm or are some features dominating over the others in
determining the results. The steps followed for this process is as follows:
a) Find the features that do not perform well in the classification task by themselves.
b) Find all the features that do perform well in the classification task by themselves.
c) Combine all the features obtained in a) as a single feature.
65
d) Try combinations of the features obtained in the step b with the combined feature c) to
determine how much the performance degrades or improves i.e. do the features that perform
well dominate over the rest or vice-versa.
Logistic Regression, using 10-cross validation and 70%-30% split is used to determine the
performance of the features. The experiments yield the following results: when the worst
performing features are combined together the results obtained are shown in Figure 36. When
they are combined with the best performing and the second best performing feature, the
accuracy jumps as shown in Figure 37 and Figure 38 respectively.
Figure 36. Performance of all bad performing features together.
66
Figure 37. Performance of all bad performing features combined with the best performing.
Figure 38. Performance of all bad performing features combined with the second best
performing feature.
67
As it can be seen from the results, the better performing features dominate the bad performing
features, at least when logistic regression is used. The so called bad performing features can
also be classifier dependent. When tested on Multilayer Perceptron, as an experiment, the
same features give an accuracy of 67.38% as shown in Figure 39.
Figure 39. Performance of all bad performing features together using MLP.
The best performing features were: CCV (Section 2.17), a combination of aspect ratio (Section
2.12), extent (Section 2.13), fitted ellipse angle (Section 2.15), fitted ellipse height to width ratio
(Section 2.16), and red to blue color intensity ratio (Section 2.14) and circularity (Section 2.4).
On the other end of the spectrum i.e. the features that performed poorly are detection of bar
(Section 2.1) and disk/bulge ratio (Section 2.3).
68
Chapter 6
Unsuccessful Experiments
This chapter summarizes some of the experiments that did not give the expected results for the
problem at hand.
6.1 Blob Detection
The Blob detection method was tried to detect the presence of a bar in galaxy images. When
we convert the images to their respective contrasted images, the bar feature in the images
actually outshines the rest of the galaxy. As blob detection detects regions having differences in
properties, such as brightness, this method should be able to extract the area at the center of
the image as this area seemed to be of higher intensity then the rest of the image. Then
contours can be used for shape detection.
This method was not adopted due to the following reasons:
a) When the center of the contrasted image was tested, pixel by pixel, it was found that the
intensities were not equal.
b) The bar in the images are at different angles, as shown in Figure 40, and thus a rotation
invariant feature detector was needed.
69
Figure 40. Images having bar feature at different angles.
6.2 Laplacian of Gaussian (LoG)
The LoG method was tried to compute the number of arms of a given galaxy. Since the arms are
present in the galaxy only when it is spiral this could be used as a feature for our classifier.
Because LoG is a negative bell shaped function we believed that it would help maximize the
values in the ends of the contour image. That way we can get the points in the far ends of the
spiral image thereby giving us a good indication of the presence of arms, as for a spiral image
the far ends of the image will be farther apart. The planned process was to find the distance
between the farthest points in the image thereby giving us a numeric value. The computed
value above a certain threshold distance would determine if the galaxy has arms and is thus
spiral.
70
This method was not adopted due to the following reasons:
a) Some of the spiral galaxies have tightly bounded arms. Even if we get the farthest points in
the image relative to the arms it might not cross the threshold.
b) The contour images were not accurate.
c) Some of the circular and elliptical shaped galaxies covered a lot of area in the images. The
distance between the farthest points in these might easily go over the threshold.
6.3 Discrete Fourier Transform (DFT)
The Fourier transform corrects an image from the spatial domain to the frequency domain. DFT
is a popular technique to obtain the geometrical characteristics of an image. This method was
considered for the edge on galaxy images, an example shown in Figure 41 below, or the images
with the galaxies rotated at an angle.
Figure 41. An edge on galaxy image.
71
This method was not adopted due to the following reasons:
a) The Fourier transformed images do not give the desired information due to the noise in the
images.
b) Most of the images have face-on galaxies.
6.4 Case Based Reasoning (CBR)
Leake [54] states that reasoning has often been defined as a process that ‘draws conclusions by
chaining together generalized rules, starting from scratch’. CBR differs in two ways: first, the
rules are replaces by cases and second, chaining is replaced by retrieval and adaptation of
cases. A simple CBR process is shown in Figure 42 below.
Figure 42. CBR process adopted from [55].
This method was not adopted due to the following reasons:
a) Difficult to represent the data and decide the structure of the cases.
72
b) As described in [54], CBR exploits the relationship between the ‘space of problem description’
and the ‘space of problem solutions’, as shown in Figure 43.
This is a method that is very interesting and will be tried in future work.
Figure 43. CBR process relationships adopted from [54].
73
Chapter 7
Conclusion and Future Work
In this chapter the contributions of the thesis are highlighted and then then scope of future
work is presented.
7.1 Contribution
The main contribution of this thesis was to generate novel features for classifying galaxy
images. Given the noisy nature of the images and the less-interpretable nature of neural
networks, this thesis provides features that can used for classification of galaxies into spiral or
non-spiral. This feature designing approach provides a better understanding of the images. We
designed 22 different features and used simple machine learning techniques for classification
and received good results. This thesis highlights that hand designing features can give results
comparable to those obtained by deep belief networks for this problem domain. The features
extracted ranges from a simple feature like the color ratio in the image at one end to more
complex features like the concentration index and dark matter density at the other end of the
spectrum. This thesis highlights that even the simplest of features can be critical for
classification. Our approach also overcomes the lack of interpretability of deep belief networks.
74
7.2 Future Work
The Future work to extend this thesis could include:
a) There is scope for designing more features for the problem and also extending the designed
features. Although this thesis looks extensively at the possible features that can be derived,
both from the computer vision side, for example aspect ratio and red to blue color intensity
ratio, as well as astronomy side, , for example detection of bar and concentration index, there
are lot of other features that can be designed.
b) The features designed are explicitly for this problem but there is scope for designing
generalized features that can be extended to other domains as well. That approach will be
more computer vision oriented.
c) The human annotation data can be considered as a feature for the classification and we can
consider human advice based learning [56] for classification. Seeking a domain expert’s advice
can help the system perform better.
d) In this thesis we tackle the problem of binary classification, i.e. we classify the images into
only two classes. Multi-class classification can be also be considered, as can be seen in Figure
21, the Galaxy Zoo decision tree, where every task is interpreted as a class. Other machine
learning algorithms can also be experimented with.
75
7.3 Summary
Overall, we have seen that feature designing coupled with basic machine learning algorithms
can produce a general system for classification. Feature designing becomes an important part
of a system for interpreting the results obtained, which the deep belief network fails to provide
or provides to a lesser extent. Our system gives good initial results and serves as a starting point
for future work.
The shape of a galaxy reveals a lot of information about the age and the type of stars present in
the galaxy. There are roughly one hundred billion galaxies scattered throughout the observable
universe. This is a huge amount of prospective data and thus the future of this problem holds a
huge potential for the feature designing area. Designing features that differentiate not only
between the different shapes of the galaxies between but also galaxies with similar shapes. For
example, a single feature determining the shape, the number of arms of the galaxy and the
degree of bound of the arms can be designed. Different machine learning algorithms can also
be applied to determine the one that works best with this kind of data, given even the simplest
of features. To actually solve the problem, very powerful features are needed to be designed.
Bu powerful we mean that the features need to be as general as possible, so there is no need of
designing a large of features, and yet be discriminative enough to provide a perfect
classification. Also more sophisticated neural networks might solve the problem but it seems
unlikely due to the nature of the images. Thus feature designing is the way to go in the future.
76
REFERENCES
[1] http://data.galaxyzoo.org/
[2] http://en.wikipedia.org/wiki/Hubble_sequence
[3] http://en.wikipedia.org/wiki/Galaxy_morphological_classification
[4] http://www.sdss.org/
[5] http://www.galaxyzoo.org/#/story
[6] de la Calleja, Jorge, and Olac Fuentes (2004). "Automated classification of galaxy images." In
Knowledge-Based Intelligent Information and Engineering Systems, pp. 411-418. Springer Berlin
Heidelberg.
[7] Banerji, Manda, Ofer Lahav, Chris J. Lintott, Filipe B. Abdalla, Kevin Schawinski, Steven P.
Bamford, Dan Andreescu et al. (2010)"Galaxy Zoo: reproducing galaxy morphologies via
machine learning." Monthly Notices of the Royal Astronomical Society 406, no. 1: 342-353.
[8] Yin Cui, Yongzhou Xiang, Kun Rong, Rogerio Feris, and Liangliang Cao (2014). “A Spatial-Color
Layout Feature for Content-based Galaxy Image Retrieval”. In: IEEE Winter Conference on
Applications of Computer Vision (WACV).
[9] Eskandari, A. R., & Kouchaki, Z. (2013). Regular Shapes Detection in Satellite
Images. Malaysian Journal of Computer Science, 25(1).
77
[10] G. Wu, W. Liu, X. Xie, Q. Wei (2007). “A Shape Detection Method Based on the Radial
Symmetry Nature and Direction-Discriminated Voting”. ICIP (6), pp169-172.
[11] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep
convolutional neural networks. In Advances in neural information processing systems (pp. 10971105).
[12] A. Berg, J. Deng, and L. Fei-Fei. Large scale visual recognition challenge (2010.)
www.imagenet.org/challenges.
[13] J. Sánchez and F. Perronnin (2011). High-dimensional signature compression for large-scale
image classification. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference
on, pages 1665–1672. IEEE.
[14] Eskridge, P. B., & Frogel, J. A. (2000). What is the True Fraction of Barred Spiral Galaxies?
In Toward a New Millennium in Galaxy Morphology (pp. 427-430). Springer Netherlands.
[15] http://www.astro.caltech.edu/~george/ay20/eaa-darkmatter-obs.pdf
[16] http://en.wikipedia.org/w/index.php?title=Shape_factor_
(image_analysis_and_microscopy)&oldid=572283723
[17] Goderya, S. N., & Lolling, S. M. (2002). Morphological classification of galaxies using
computer vision and artificial neural networks: A computational scheme. Astrophysics and
space science, 279(4), 377-387.
[18] R. Freedman and W. J. Kaufmann (2007). Universe.
78
[19] Pass,Greg,Ramin Zabih and Justin Miller (1997). "Comparing images using color coherence
vectors." In Proceedings of the fourth ACM international conference on Multimedia, pp.65-­‐73.
[20] http://spiff.rit.edu/classes/phys443/lectures/gal_1/petro/petro.html
[21] Shimasaku, K., Fukugita, M., Doi, M., Hamabe, M., Ichikawa, T., Okamura, S., & York, D. G.
(2001). Statistical properties of bright galaxies in the sloan digital sky survey photometric
system. The Astronomical Journal, 122(3), 1238.
[22] http://en.wikipedia.org/wiki/Navarro%E2%80%93Frenk%E2%80%93White_profile
[23] http://astronomy.swin.edu.au/cosmos/S/Scale+Length
[24] http://www.mattiavaccari.net/research/masterthesis/masterthesis/node34.html
[25] http://www.astro.caltech.edu/~aam/science/thesis/total/node20.html
[26] Otsu, N. (1975). A threshold selection method from gray-level histograms.
Automatica, 11(285-296), 23-27.
[27] http://infohost.nmt.edu/tcc/help/pubs/colortheory/web/hsv.html
[28] http://en.wikipedia.org/wiki/Image_moment
[29] Petrosian, V. (1976). Surface brightness and evolution of galaxies. The Astrophysical
Journal, 209, L1-L5.
79
[30] Blanton, M. R., Dalcanton, J., Eisenstein, D., Loveday, J., Strauss, M. A., SubbaRao, M. &
Schlegel, D. J. (2001). The Luminosity function of galaxies in SDSS commissioning data. The
Astronomical Journal, 121(5), 2358.
[31] Hosmer, David W.; Lemeshow, Stanley (2000). Applied Logistic Regression (2nd ed.). Wiley.
[32] Kearns, M. (1988). Thoughts on hypothesis boosting. Unpublished manuscript,45, 105.
[33] Freund, Y., Schapire, R., & Abe, N. (1999). A short introduction to boosting.JournalJapanese Society For Artificial Intelligence, 14(771-780), 1612.
[34] www.apsky.org/ngc/ngc.html
[35] Quinlan, J. R. (2014). C4. 5: programs for machine learning. Elsevier.
[36] Bengio, Y. (2009). Learning deep architectures for AI. Foundations and trends® in Machine
Learning, 2(1), 1-127.
[37] Hinton, G., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief
nets. Neural computation, 18(7), 1527-1554.
[38] LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., & Huang, F. (2006). A tutorial on energybased learning. Predicting structured data, 1, 0.
[39] Hinton, G. E., & Sejnowski, T. J. (1986). Learning and relearning in Boltzmann
machines. MIT Press, Cambridge, Mass, 1, 282-317.
[40] http://en.wikipedia.org/wiki/Boltzmann_machine
80
[41] http://en.wikipedia.org/wiki/Restricted_Boltzmann_machine
[42] G.E. Hinton and R.R. Salakhutdinov (2006). Reducing the Dimensionality of Data with
Neural Networks, Science, Vol. 313. no. 5786, pp. 504 - 507.
[43] deeplearning.net/tutorial/deeplearning.pdf
[44] Hinton, G., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief
nets. Neural computation, 18(7), 1527-1554.
[45] http://en.wikipedia.org/wiki/Deep_belief_network
[46] Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM,27(11), 11341142.
[47] http://yyue.blogspot.com/2015/01/a-brief-overview-of-deep-learning.html
[48] Bottou, L. (2012). Stochastic gradient descent tricks. In Neural Networks: Tricks of the
Trade (pp. 421-436). Springer Berlin Heidelberg.
[49] Willett, K. W., Lintott, C. J., Bamford, S. P., Masters, K. L., Simmons, B. D., Casteels, K. R., ...
& Thomas, D. (2013). Galaxy Zoo 2: detailed morphological classifications for 304 122 galaxies
from the Sloan Digital Sky Survey. Monthly Notices of the Royal Astronomical Society, stt1458.
[50] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The
WEKA data mining software: an update. ACM SIGKDD explorations newsletter, 11(1), 10-18.
[51] http://en.wikipedia.org/wiki/Precision_and_recall
81
[52] Fawcett, T. (2006). An introduction to ROC analysis. Pattern recognition letters,27(8), 861874.
[53] https://pypi.python.org/pypi/nolearn
[54] Leake, D. B. (1996). CBR in context: the present and future. Case-Based Reasoning,
Experiences, Lessons & Future Directions, 1-30.
[55] You, C. F., Jeng, C. R., & Liu, K. Y. (2013). Design for Automotive Panels Supported by an
Expert System. INTECH Open Access Publisher.
[56] Kunapuli, G., Odom, P., Shavlik, J. W., & Natarajan, S. (2013, December). Guiding
Autonomous Agents to Better Behaviors through Human Advice. Data Mining (ICDM), 2013 IEEE
13th International Conference on (pp. 409-418). IEEE.
In collaboration with Tasneem Alowaisheq.
1
82
Resume
Devendra Singh Dhami
3801 S Sharon Dr., Bloomington, Indiana 47403
[email protected] | (812)369-3967
OBJECTIVE
Given my great interest in the field of Artificial Intelligence and the love for Astronomy, I decided
to merge the two and am pursuing my MS from Indiana University, Bloomington. I would like to
extent my Masters to a PhD in the field of Artificial Intelligence with a focus on the applications
of AI in Astronomy and other fields.
Research Interests: (UP)
1. Artificial Intelligence
2. Computer Vision
3. Machine Learning
4. Astronomy
EDUCATION
Indiana University, Bloomington, USA
Master of Science in Computer Science August 2013- Present
Expected Graduation Date: May 2015
Current GPA = 3.80/4
Visvesvaraya Technological University, Bangalore, India
Bachelor of Engineering in Information Science & Engineering August 2006-June 2010
Undergraduate Percentage = 79.82% out of 100%
PROJECTS UNDERTAKEN
Master Thesis (in progress):
Morphological Classification of galaxies into spirals and non-spirals.
Thesis Advisor: Prof. David J. Crandall
Thesis Committee: Prof. David J. Crandall, Prof. David B. Leake, Prof. Sriraam Natarajan
Graduate Research Projects:
1. Morphological Classification of galaxies using Computer Vision Techniques and Case Based
Reasoning. (Class project for CSCI-B657 and CSCI-B552)
2. Prediction of Indian Tiger population increase/decrease in future using Machine Learning
techniques. (Class project for I 526/CSCI-B 659)
3. Gamification of the Optional Practical Training Application Process for International Students.
(Class project for ILS Z 516)
4. Prevention of man-in-the-middle attacks on the SSL layer in cryptographic systems (Class
project for CSCI-B649)
Undergraduate Research Project:
1. Designed ‘Sentiment Analyzer’, a tool to classify user’s sentiments on a topic (e.g. any device
such as a mobile or any event such as a movie etc.)
PROFESSIONAL EXPERIENCE
Indiana University Bloomington (2013- present)
Associate Instructor
1. Course Name: CSCI A348: Mastering the World-Wide Web
Instructor: Prof. Dan-Adrian German
Semester: Fall 2013
2. Course Name: CSCI B551: Elements of Artificial Intelligence
Instructor: Prof. David Leake
Semester: Fall 2014
3. Course Name: CSCI A201/597: Introduction to programming
Instructor: Prof. Eric Wennstrom
Semester: Spring 2015
Hewlett Packard Bangalore, India August (2010-July 2013)
Technology Consultant
1. Project: Project and Portfolio Management (PPM) Automation
Client: HP internal
Platform & Skills: Eclipse, Remote desktop Server, PL/SQL Developer, Toad
Duration: August 2010 – Dec 2010
Description:
To automate the various functionalities of the PPM tool by writing scripts and
writing Java libraries to be implemented in the front end of the tool.
Responsibilities:
• Understanding and Building Automation Framework.
• Developing the code for implementation
• Code Review
• Designed automation scripts for backend verification
• Reported the defects
2. Project: PPM Upgrade
Client: AXA Technologies
Platform & Skills: Eclipse, Remote desktop Server, PL/SQL Developer
Duration: Jan 2011 – till date
Description:
To write the codes and libraries to upgrade the PPM tool from version 7.1 to
version 8.0and developing reusable components (Scripts) to generate various
reports out of the tool’s database.
Responsibilities:
• Analyze business requirements (e.g., business logic, process flows, and
reporting requirements).
• Preparing Development Plan.
• Reporting Daily, weekly and monthly report as well as Progress report
• Responsible for Functionality testing, Integration testing and Regression test
• Preparing test cases based on Software Requirement Specifications.
TECHNICAL SKILLS
Languages: Java, C/C++, Python, OpenCV, SQL
Platforms: UNIX, Microsoft Windows and DOS
Databases: Structured Query Language, Oracle
Web development: Java Script, HTML, PHP
MISCELLANEOUS DETAILS
I presented a paper on ‘Sentiment Analysis’ in Tech-On, the International Technical Conference
of Hewlett-Packard and the paper ranked 149 out of 2000 papers. Also I won 1st prize in IEEE
paper presentation with a paper titled ‘Image Processing’ during my undergraduate studies. I
also presented several other papers on ‘Cloud Computing’ and ‘Information Security’ to name a
few during my Undergraduate studies in various technical competitions. Have also won the ‘Star
of the Horizon Award’ and the ‘Best team in HP award’ during my tenure in HP.
I also won the a gold medal and Certificate of Academic Excellence from my undergraduate
school.
I was in the editorial team of my annual college magazine in my undergraduate years. It
involved activities ranging from collecting the articles to be printed to oversee the actual printing
of the magazine.