Download Object Recognition Using Discriminative Features and Linear Classifiers Karishma Agrawal Soumya Shyamasundar

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Multinomial logistic regression wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Principal component analysis wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Object Recognition Using Discriminative Features and
Linear Classifiers
Karishma Agrawal
kdagrawa
Soumya Shyamasundar
sshyamas
ABSTRACT
Amongst the prevailing methods used for object recognition,
the deep learning methods are most favored. Rather than
continuing what has become a trend to improve the former
methods, in this paper we try to explore the time honoured
linear classifiers such as Support Vector Machine(SVM) and
Multinomial Logistic Regression(MLR). We demonstrate how
simple transformations can make the data more responsive
to these straightforward linear separation rules and produce
results comparable to the generic feature based classifiers
with complex framework. To delve further into the relationship between the transformations and performance of the
linear classifiers, we have employed various techniques leading up to ensembling on the CIFAR-10 dataset, discussed
the results and plausible accuracies obtained.
Keywords
SVM, MLR, Principal Component Analysis, ZCA whitening
Noise, KMeans, Ensemble, Feature Engineering
1.
Pradheep Shanmugam
pshanmug
Section 3 describes the methods and the experiments we conducted to implement the learning techniques. Subsequently,
we report the results and draw inferences on several datasets
used in Section 4. We conclude and additionally discuss
scope for future work in Section 5.
2. BACKGROUND
2.1 Dataset
The CIFAR-10 dataset is a labeled dataset consisting
of 50,000 training images all of them in one of the 10
classes as shown[12]. The test set consists of 10,000 new
images from the same categories which have to be classified to their respective classes. For the class project, we
have been given a subset of CIFAR-10 comprising 4000
training images and 15000 test images. The aim of the
project was to experiment with various algorithms accurately and implement the numerous concepts learnt
in the course.[11]
INTRODUCTION
Classification of data is of paramount importance and relevance in the social media driven world. Current methods
to improve the accuracy of classification involve using large
datasets. The given CIFAR-10 dataset has 4000 images as
the training data and 15000 images in the test dataset. Due
to the immense complexity of the object recognition task,
there is a need for models that compensate for the insufficient data. Traditionally, robust algorithms like Convolutional Neural Networks(CNN) and Deep Neural Networks
give the best accuracies but at the cost of high computation
time, memory and space requirements. We have focussed
on two conceptually simple and yet powerful methods, Support Vector Machines(SVM) and Multinomial Logistic Regression(MLR) to train and test the original dataset. As
the classification of the data was not that comprehensible,
we decided to transform the original dataset by using various techniques like extracting Principal Component Analysis(PCA), whitening the images, introducing Gaussian noise
while training, including mirror images of all the datapoints
and ensembling using kmeans thus making it suitable for
SVM and MLR to perform better. We obtained results by
tuning the parameter of the classifiers which yielded accuracies comparable to elementary CNN and Deep Neural Network accuracies.
The rest of paper is organized as follows. We provide the
preliminaries in Section 2 which has a brief description of
the core Machine Learning concepts used in the experiments,
Figure 1: CIFAR-10 Dataset[2]
2.2
Support Vector Machines
Our focus was on linear classification methods, chiefly
Support Vector Machines(SVMs) and Multinomial Logistic Regression(MLR). Both of these algorithms deliver empirically good performance while relatively be-
ing less computationally intensive when compared to
deep learning networks.[5] SVMs focuses only on the
points that are nearest to the boundary of the linearly
separating hyperplane, i.e. the support vectors. The
obvious points are altogether ignored. Training an SVM
is like finding a solution to a convex quadratic problem
and hence guarantees to find a hyper plane that linearly
separates the data, if it exits. SVMs employ a technique
called kernel trick to transform the data and then find
the optimal boundary between the possible outputs. A
kernel function maps the current features into a feature
space with higher dimensions such that our data points
become linearly separable. However producing such a
higher dimension can be computationally taxing. The
kernel trick offers a neat solution to this problem, by
letting us bypass the mapping cheaply. Say S is the
similarity function in the new higher dimension feature
world, then the kernel trick is to define S in terms of the
original space itself, without defining what the transformation kernel function is. [14] We also tried classification on the off the shelf implementation of SVM which
is Library SVM[6]. It can be used to classify multiple
classes using a technique ’One Vs Rest’ also known as
’One Vs All’. In this method, an svm model is created
for each class. So we would need 10 svm models one for
each class in CIFAR 10 dataset. Each model treats one
class as positive data point and the rest of the class as
negative data points and calculates the probability of
the class. Similarly each model would classify one class.
When the training data set is run, 10 models would be
created each classifying one class. In libsvm this is done
using a method ”ovrtrain”. Following are the parameters passed to the function: (trainLabel, trainData, ’-c
1 -g 0.00154 -t 2 -b 1’) where
g = γ in kernel function
c = cost parameter C
t = radial basis function kernel
b = probabilitye stimates
Once the models are available, the test data set is run
on the models created. Each test data is run on different
models and the probabilities of each class is calculated.
The classifier which outputs the maximum probability
is chosen as the class of the test data. In libsvm, this
is done using a function ”ovrpredict”. The following
are the parameters passed to the function: (testLabel,
testData, model) where model is the struct created by
”ovrtrain”.
2.3
Multinomial Logistic Regression
MLR generalizes the logistic regression approach to classification problems where the output can take more than
two possible values. Like logistic regression it uses maximum likelihood estimation to evaluate the probability
of a data point belonging to a category. It uses an iterative algorithm to estimate the parameters of the model,
and hence it is necessary to consider the sample size and
outlying cases. MLR is often considered as an attractive
option since it does not assume normality, linearity or
homoscedasticity however it does assume the independence between categorical value choices. Furthermore,
MLR also assumes non perfect separation. If groups of
classes are perfectly separated by the classifier then unrealistic coefficients will be learnt and over fitting might
occur.
2.4
Pre-processing
The task of object recognition is quite complex to be
solved straightforwardly by a linear classifier. Each image is represented as a flattened set of pixels, corresponding to the RGB representation. Soon after initial
runs we learnt that we need to transform the dataset
to make it more suitable for our linear classifier, and
hence improve the classification accuracy. We used various methods to obtain a better representation of the
dataset, like converting the data to grayscale, principle
components analysis (PCA), zero component analysis
whitening (ZCA whitening), mirroring, (increases the
training set by flipping the training images horizontal),
adding Gaussian noise to the data, kmeans clustering
etc.
2.4.1
Principal Component Analysis
[18]Principal components tries to re-express the data as
a sum of uncorrelated components.[16] PCA rotates the
datapoints such that the maximum variability is visible,
that is to say that it identifies the most important gradients which makes it an excellent preprocessing technique.
2.4.2
Zero Component Analysis
[15]ZCA is closely related to PCA, but formally aims to
de-correlate the features with one another and makes
sure that all the features have the same variance. The
resulting covariance matrix is an identity matrix. ZCA
whitening does not reduce the dimension of the dataset.
This helps in basically eliminating all the redundant
information.
2.4.3
KMeans
K-means clustering aims to partition n observations into
k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of
the cluster. Here we use K-means to learn k centroids
and discover features from the data. Essentially we are
forming a new dataset and then training a linear classifier to predict the labels.
The above techniques took about 2-5 minutes to run
with the exception of Kmeans which took about 35-75
minutes. All the investigation was done in MATLAB
using an 8Gb RAM system.
3. EXPERIMENTAL METHODOLOGY
3.1 Original Dataset
The performance of MLR and SVM on the original
dataset did not furnish good results. We also performed
cross validation by splitting the training set and selecting 300 random datapoints to form the validation set.
We ran the SVM and MLR on the remaining 3700 datapoints. Cross-validation was done at every step while
tuning our parameters to ensure that there was no overfitting. SVM was carried out using LibSVM and grid
search. The best tuning parameters were λ = 0.01 and
γ = 0.001 which achieved an accuracy of 41.87%. MLR
was carried out using the starter code and tuned to
λ = 0.1 yielding an accuracy of 42.971%. We realized
that the current dataset is not fit for the linearly classifiers we were working with. Hence we decided on transforming the dataset with the methods described below
to make it suitable for SVM and MLR classifiers.
Figure 3: Test accuracy v/s lambda values for MLR with
PC=128
3.4
Zero Component Analysis
Grayscaling[9] the image is mapping the RGB values
to a single number giving a grayscale value. We used
the average method to obtain the grayscaled dataset.
This was our foremost attempt with feature reduction
techniques. Grayscaling gave very poor results. This
was due to the fact that most of information was lost
when converting a colored image to a grayscale image.
Hence we did not venture into tuning the parameters as
this technique was deficient of the required information.
[15]The next transformation technique that we explored
was Zero Component Analysis(ZCA)Whitening. This
is very similar to PCA except that, the features obtained become completely independent. This transformation preserves edge information but sets to zero, the
regions of relatively correlated feature points. In other
words, ZCA whitening decorrelates the feature points
completely. In the context of image data, this translates
into eliminating the redundant pixel values that contain
the same information or uniform color. [13] Whitening
did not help either for SVM or MLR because, independency of data points is not of much importance for both
the classifiers. A linearly seperable dataset is all that is
required.
3.3
3.5
3.2
Grayscaling
Principal Component Analysis
Another popular dimensionality reduction technique we
explored was the Principal Component Analysis. PCA
basically reduces the dimensionality of the data while
retaining most of the variation in the dataset. This is
done by identifying directions along which the variation of the data is maximal. These directions are the
principal components. On the basis of intensity variation in images, we can choose only those values which
give us maximum information while discarding a lot of
redundant data. PCA is purely descriptive and does
not make any prediction whatsoever. Figure 3 shows a
plot of lambda versus the test accuracies for MLR conducted on 128 principal components. We achieved our
best accuracy of 43.28% for milestone one with MLR
performed on 128 PCA components. The tuning parameters were λ = 0.1 and γ = 0.001. Either way,
reduction of the dimensions proved to be beneficial as
it gave us much better accuracy compared to the original dataset. PCA made a difference while carrying out
MLR experiments as reducing the features helped reduce the noise in the system and increase accuracy. It
did not make a lot of difference for SVM[1] as it considers only the boundary points while ignoring the more
obvious datapoints.
Mirroring
After sufficiently tuning the parameters with the reduced datasets, we inferred that there would be only
a marginal increase in the accuracy with further tuning. We needed more training datapoints to improve
efficiency. Intuitively, we decided that we can increase
the number of datapoints by implementing a technique
known as Mirroring. Mirroring helps to offset the inadequacy of datapoints to some extent as it creates
more orientations of the data. As expected the accuracy improvement for both SVM and MLR was significant. The training set was increased to 8000X3072. We
achieved an accuracy of 46.152% with MLR, λ = 0.1
and γ = 0.001 and an accuracy of 44.86% with SVM
parameters λ = 0.055 and γ = 0.008. PCA performed
on mirrored dataset also gave us prominent accuracies
as expected.
3.6
Gaussian Noise
[3]Building on similar lines of increasing the datapoints,
we looked into adding external noise to the original
dataset. Gaussian noise was added to mainly prevent
overfitting and increasing the flexibilty of the algorithms.
Usually it is added as a prior knowledge to allow for
some variations in the data which helps to increase the
(a) Original image
(b) Grayscaled image
(e) ZCA whitened image
(c) Reconstructed image
using 256 principle components
(f) Original image with
gaussian noise N (0, 0.01)
(d) Reconstructed image
using 512 principle components
(g) Mirrored image
Figure 2: Effect of various transformation on the dataset, here data point 120 from training set is shown.
generality of the model used. Adding noise also accounts for hardware errors, signal and communication
mismatches which are typically ignored. Considering
these factors when training the dataset makes the classification more accurate. Although there wasn’t an increase in the accuracy, reasonable values of 44.724%
(λ = 0.111, γ = 0.007) and 43.305% (C = 1, γ = 0.00154)
were obtained for MLR and SVM respectively.
3.7
Ensemble
Towards the end, our main approach culminated to ensemble methods. Ensemble learning [17, 10, 7] is a process by which multiple classifiers, are strategically combined to classify the given data. It is primarily used to
improve the performance of a model, or reduce the likelihood of an unfortunate selection of a poor one. We
have used KMeans as the unsupervised learning algorithm that discovers features from the unlabeled data.
Although it has been less widely used in ’deep learning’ work, KMeans has shown a lot of promise in computer vision. Hence, while using it as an unsupervised
learning module in ensemble, it has all the potential to
produce similar results when compared to deep learning and CNN. [8]Our first step was to extract random
patches from our dataset and running ZCA whitening
on them. After this the whitened data was used to
learn k centroids via KMeans clustering. Using these
centroids we learnt a better representation of the given
dataset. We carried out this process for various values of k in KMeans. We observed that results with
50 iterations were the best. This newly constructed
dataset was fed to the L2-SVM classifier. Values for C
and gamma were tuned to achieve accuracies as high
as 55.276% with a patch size of 6, 1600 centroids and
C=30. This pipeline was run many number of times by
varying parameters like patch size, centroids etc. We
carried out multiple runs with exhaustive combinations
of patch sizes(4,6,8,9,10), centroid values ranging from
1000 to 3000 and cost function values of 12,25,30,35,50
and 100. A graph has been obtained to illustrate the
trends. We improved the accuracy by using ensemble
to train classifiers on different subsets of the available
features.
3.8
Voting
Voting was used as a final step for the ensemble methods[4]. Selective voting on the results obtained by the
ensemble methods was done by taking the majority vote
of all the predicted labels for each image. This gave us
the a new set of labels which achieved an increase in
the classification accuracy. Our best accuracy, 56.876%
was obtained via majority voting.
4.
RESULTS
By performing the above techniques, we obtained a number of results which are as shown. In this section, we
present our experimental results and the impact of the
tuning parameters and performance of MLR and SVM
on different datasets. A comparative graph of MLR performed on PCA datasets is shown. There was not any
noticeable difference in the accuracies obtained with the
original, 128 components or even the 512 components.
Figure 4: MLR with PCA components
Next in figure 5 we have the comparison of various PCA
components implemented with SVM. Here too, there
isn’t any significant improvement in the accuracy values. Although it did perform better than on MLR as
SVM’s considers only the boundary datapoints or the
support vectors and discards the obvious points. In contrast, MLR takes into consideration the entire dataset.
Thus, as expected, logistic regression works better with
less feature points and SVM is more fitting with higher
dimensions.
Figure 7: SVM on various datasets
Figure 5: SVM on PCA components
In Figure 6. we see a representation of MLR on all
the various datasets obtained. As it can be seen, MLR
worked best with mirrored dataset. This is because,
logistic regression considers all the datapoints for predicting the classes. Hence its always better to have more
number of points when working with MLR.
Figure 6: MLR on various datasets
In Figure 7, we show a comparison of SVM implemented
with various datasets. As the training data is the maximum in the mirrored dataset, we can define a clear
hyperplane to separate the datapoints. The number of
support vectors will also increase and hence we have
achieved the highest accuracy with this dataset.
A graph of Kmeans performed on the dataset and subsequently SVM run on the transformed dataset is shown
in Figure 8. The accuracy increases as the cost function decreases and is almost the proportional across the
various patch sizes considered.
5.
DISCUSSIONS AND CONCLUSIONS
Our aim was to use a classifier other than the routinely
favored object recognition methods like deep belief networds, convolutional neural network etc. and obtain a
Figure 8: Comparing Kmeans with SVM
comparable accuracy. We realized that although simpler algorithms can work well and are easily scaled to
adapt to the required scenario, they do require certain
amount of preprocessing. Polishing the dataset helped
us improve our work at each step. We started with
the original dataset and tuned our classifier parameters. After this we tried grayscaling which reduced
the accuracy, which strengthen our belief that colour
is conveying important information for our data. We
observed that reducing the number of features via PCA
gives a comparable and at times even better accuracies.
Thus we decided to find a good representation of the
data as required by each of the linear classifier used.
We then explored independent component analysis but
that did not provide satisfactory results. We inferred
that inadequacy of training data was contributing to
the mediocre results. Hence we doubled the size of the
training set by taking reflection of each image about
the horizontal axis. This showed us significant improvement from 43% to 46%. We found that there were cases
where our dataset would overfit, so we decided to add
some noise to the training set, to improve flexibility.
Though the test accuracy did not improve, overfitting
reduced marginally. Finally we came across a paper
Andrew NG, Adam Coates and Honglak Lee, wherein
kmeans was used to learn the new feature representation. We used kmeans clustering on random patches of
our data, creating a reflective field, and improved our
accuracy upto 54%. Further tuning of parameter and
voting helped us reach our accurent accuracy of 56.8 %
We learnt that both data pre processing and parameter
tuning can make a huge difference in the efficiency and
accuracy of the classifier.
Development
Baseline (MS1)
Improved Baseline
Classifier
Accuracy
MLR
43.28
SVM
42.038
Ensemble
56.876
[11]
[12]
Table 1: Evaluation data
Accuracy values are certainly a measure of performance
of the classifier although obtaining very high accuracies was not our primary goal. We have shown a distinguishable improvement from our baseline values by
making the dataset more conducive and receptive to
SVM, MLR and subsequently ensemble methods as seen
in the above table.
6.
[13]
[14]
[15]
REFERENCES
[1] PCA on SVM. Retrieved Dec 08, 2014 from
http://www.quora.com/Is-it-worth-trying-PCAon-your-data-before-feeding-to-SVM.
[2] CIFAR10 - Object Recognition in Images. from
https://www.kaggle.com/c/cifar-10/data.
[3] Gaussian noise. Retrieved Dec 09,2014 from
http://stats.stackexchange.com/questions/29819/
how-adding-covariance-noise-in-gaussianprocesses-to-prevent-overfitting.
[4] E. Bauer and R. Kohavi. An empirical comparison
of voting classification algorithms: Bagging, boosting, and variants. MACHINE LEARNING, 36:105–
139, 1999.
[5] C. Burges. A tutorial on support vector machines
for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121–167, 1998.
[6] C.-C. Chang and C.-J. Lin. LIBSVM: A library
for support vector machines. ACM Transactions
on Intelligent Systems and Technology, pages 27:1–
27:27, 2011.
[7] A. Coates and A. Ng. Learning feature representations with k-means. In G. Montavon, G. Orr, and
K.-R. MÃijller, editors, Neural Networks: Tricks of
the Trade, volume 7700 of Lecture Notes in Computer Science, pages 561–580. Springer Berlin Heidelberg, 2012.
[8] A. Coates, A. Y. Ng, and H. Lee. An analysis of
single-layer networks in unsupervised feature learning. In International Conference on Artificial Intelligence and Statistics, pages 215–223, 2011.
[9] J. D. Cook. Three algorithms for converting
color to grayscale. Retrieved Dec 08, 2014 from
http://www.johndcook.com/blog/2009/08/24/
algorithms-convert-color-grayscale/.
[10] T. G. Dietterich. Ensemble methods in machine
[16]
[17]
[18]
7.
learning. In Proceedings of the First International
Workshop on Multiple Classifier Systems, MCS
’00, pages 1–15, London, UK, UK, 2000. SpringerVerlag.
A. Karpathy. Lessons learned from manually classifying CIFAR-10. Retrieved Dec 08, 2014 from
http://karpathy.ca/myblog/?p=160.
A.
Krizhevsky.
The
CIFAR-10
dataset.
Retrieved Dec 08,
2014 from
http://www.cs.toronto.edu/ kriz/cifar.html.
A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Computer
Science Department, University of Toronto, Tech.
Rep, 2009.
G. Lamp. Why use SVM? Retrieved Dec 08,
2014 from http://www.yaksis.com/posts/why-usesvm.html.
A. Ng, J. Ngiam, C. Y. Foo, Y. Mai, and C. Suen.
Ufldl tutorial. Retrieved Dec 08, 2014 from
http://ufldl.stanford.edu/wiki/index.php/Whitening.
M.
Palmer.
Principal
components
analysis.
Retrieved
Dec
08,
2014
from
http://ordination.okstate.edu/PCA.htm.
R. Polikar. Ensemble learning. 4(1):2776, 2009.
C. Shalizi. The truth about principal components
and factor analysis data mining. September 2009.
APPENDIX
This class project gave us a substantial opportunity to
put to use the knowledge we acquired during the course
of 10-601. Although we chose to work with uncomplicated algorithms, the approach applied to actualize the
results is quite involving. We learnt that the nature of
the dataset plays a predominant role while working with
linear classifiers and its always better to transform the
dataset as per requirement. Considering the fact that
there are numerous parameters that can be accorded
to get suitable results, it mainly depends on the final
target to customize the dataset. We understood that
there is no general method for obtaining optimal classification and we have to more often than not test the
performance to determine the adequacy of the classification. Further fine tuning and tweaking the parameters will enhance the predictions to a reasonable degree.
The transformation of the dataset, SVM and ensemble
method with KMeans clustering was done by Karishma
Agrawal. Pradheep Shanmugam looked after LibSVM.
MLR and getting the report ready was done by Soumya
Shyamasundar. Our Kaggle team name is Tardis. We
conducted a host of experiments to reach the conclusion
that linear classifiers can work as well as the complex
deep neural networks with the right kind of data transformation.