Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Object Recognition Using Discriminative Features and Linear Classifiers Karishma Agrawal kdagrawa Soumya Shyamasundar sshyamas ABSTRACT Amongst the prevailing methods used for object recognition, the deep learning methods are most favored. Rather than continuing what has become a trend to improve the former methods, in this paper we try to explore the time honoured linear classifiers such as Support Vector Machine(SVM) and Multinomial Logistic Regression(MLR). We demonstrate how simple transformations can make the data more responsive to these straightforward linear separation rules and produce results comparable to the generic feature based classifiers with complex framework. To delve further into the relationship between the transformations and performance of the linear classifiers, we have employed various techniques leading up to ensembling on the CIFAR-10 dataset, discussed the results and plausible accuracies obtained. Keywords SVM, MLR, Principal Component Analysis, ZCA whitening Noise, KMeans, Ensemble, Feature Engineering 1. Pradheep Shanmugam pshanmug Section 3 describes the methods and the experiments we conducted to implement the learning techniques. Subsequently, we report the results and draw inferences on several datasets used in Section 4. We conclude and additionally discuss scope for future work in Section 5. 2. BACKGROUND 2.1 Dataset The CIFAR-10 dataset is a labeled dataset consisting of 50,000 training images all of them in one of the 10 classes as shown[12]. The test set consists of 10,000 new images from the same categories which have to be classified to their respective classes. For the class project, we have been given a subset of CIFAR-10 comprising 4000 training images and 15000 test images. The aim of the project was to experiment with various algorithms accurately and implement the numerous concepts learnt in the course.[11] INTRODUCTION Classification of data is of paramount importance and relevance in the social media driven world. Current methods to improve the accuracy of classification involve using large datasets. The given CIFAR-10 dataset has 4000 images as the training data and 15000 images in the test dataset. Due to the immense complexity of the object recognition task, there is a need for models that compensate for the insufficient data. Traditionally, robust algorithms like Convolutional Neural Networks(CNN) and Deep Neural Networks give the best accuracies but at the cost of high computation time, memory and space requirements. We have focussed on two conceptually simple and yet powerful methods, Support Vector Machines(SVM) and Multinomial Logistic Regression(MLR) to train and test the original dataset. As the classification of the data was not that comprehensible, we decided to transform the original dataset by using various techniques like extracting Principal Component Analysis(PCA), whitening the images, introducing Gaussian noise while training, including mirror images of all the datapoints and ensembling using kmeans thus making it suitable for SVM and MLR to perform better. We obtained results by tuning the parameter of the classifiers which yielded accuracies comparable to elementary CNN and Deep Neural Network accuracies. The rest of paper is organized as follows. We provide the preliminaries in Section 2 which has a brief description of the core Machine Learning concepts used in the experiments, Figure 1: CIFAR-10 Dataset[2] 2.2 Support Vector Machines Our focus was on linear classification methods, chiefly Support Vector Machines(SVMs) and Multinomial Logistic Regression(MLR). Both of these algorithms deliver empirically good performance while relatively be- ing less computationally intensive when compared to deep learning networks.[5] SVMs focuses only on the points that are nearest to the boundary of the linearly separating hyperplane, i.e. the support vectors. The obvious points are altogether ignored. Training an SVM is like finding a solution to a convex quadratic problem and hence guarantees to find a hyper plane that linearly separates the data, if it exits. SVMs employ a technique called kernel trick to transform the data and then find the optimal boundary between the possible outputs. A kernel function maps the current features into a feature space with higher dimensions such that our data points become linearly separable. However producing such a higher dimension can be computationally taxing. The kernel trick offers a neat solution to this problem, by letting us bypass the mapping cheaply. Say S is the similarity function in the new higher dimension feature world, then the kernel trick is to define S in terms of the original space itself, without defining what the transformation kernel function is. [14] We also tried classification on the off the shelf implementation of SVM which is Library SVM[6]. It can be used to classify multiple classes using a technique ’One Vs Rest’ also known as ’One Vs All’. In this method, an svm model is created for each class. So we would need 10 svm models one for each class in CIFAR 10 dataset. Each model treats one class as positive data point and the rest of the class as negative data points and calculates the probability of the class. Similarly each model would classify one class. When the training data set is run, 10 models would be created each classifying one class. In libsvm this is done using a method ”ovrtrain”. Following are the parameters passed to the function: (trainLabel, trainData, ’-c 1 -g 0.00154 -t 2 -b 1’) where g = γ in kernel function c = cost parameter C t = radial basis function kernel b = probabilitye stimates Once the models are available, the test data set is run on the models created. Each test data is run on different models and the probabilities of each class is calculated. The classifier which outputs the maximum probability is chosen as the class of the test data. In libsvm, this is done using a function ”ovrpredict”. The following are the parameters passed to the function: (testLabel, testData, model) where model is the struct created by ”ovrtrain”. 2.3 Multinomial Logistic Regression MLR generalizes the logistic regression approach to classification problems where the output can take more than two possible values. Like logistic regression it uses maximum likelihood estimation to evaluate the probability of a data point belonging to a category. It uses an iterative algorithm to estimate the parameters of the model, and hence it is necessary to consider the sample size and outlying cases. MLR is often considered as an attractive option since it does not assume normality, linearity or homoscedasticity however it does assume the independence between categorical value choices. Furthermore, MLR also assumes non perfect separation. If groups of classes are perfectly separated by the classifier then unrealistic coefficients will be learnt and over fitting might occur. 2.4 Pre-processing The task of object recognition is quite complex to be solved straightforwardly by a linear classifier. Each image is represented as a flattened set of pixels, corresponding to the RGB representation. Soon after initial runs we learnt that we need to transform the dataset to make it more suitable for our linear classifier, and hence improve the classification accuracy. We used various methods to obtain a better representation of the dataset, like converting the data to grayscale, principle components analysis (PCA), zero component analysis whitening (ZCA whitening), mirroring, (increases the training set by flipping the training images horizontal), adding Gaussian noise to the data, kmeans clustering etc. 2.4.1 Principal Component Analysis [18]Principal components tries to re-express the data as a sum of uncorrelated components.[16] PCA rotates the datapoints such that the maximum variability is visible, that is to say that it identifies the most important gradients which makes it an excellent preprocessing technique. 2.4.2 Zero Component Analysis [15]ZCA is closely related to PCA, but formally aims to de-correlate the features with one another and makes sure that all the features have the same variance. The resulting covariance matrix is an identity matrix. ZCA whitening does not reduce the dimension of the dataset. This helps in basically eliminating all the redundant information. 2.4.3 KMeans K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. Here we use K-means to learn k centroids and discover features from the data. Essentially we are forming a new dataset and then training a linear classifier to predict the labels. The above techniques took about 2-5 minutes to run with the exception of Kmeans which took about 35-75 minutes. All the investigation was done in MATLAB using an 8Gb RAM system. 3. EXPERIMENTAL METHODOLOGY 3.1 Original Dataset The performance of MLR and SVM on the original dataset did not furnish good results. We also performed cross validation by splitting the training set and selecting 300 random datapoints to form the validation set. We ran the SVM and MLR on the remaining 3700 datapoints. Cross-validation was done at every step while tuning our parameters to ensure that there was no overfitting. SVM was carried out using LibSVM and grid search. The best tuning parameters were λ = 0.01 and γ = 0.001 which achieved an accuracy of 41.87%. MLR was carried out using the starter code and tuned to λ = 0.1 yielding an accuracy of 42.971%. We realized that the current dataset is not fit for the linearly classifiers we were working with. Hence we decided on transforming the dataset with the methods described below to make it suitable for SVM and MLR classifiers. Figure 3: Test accuracy v/s lambda values for MLR with PC=128 3.4 Zero Component Analysis Grayscaling[9] the image is mapping the RGB values to a single number giving a grayscale value. We used the average method to obtain the grayscaled dataset. This was our foremost attempt with feature reduction techniques. Grayscaling gave very poor results. This was due to the fact that most of information was lost when converting a colored image to a grayscale image. Hence we did not venture into tuning the parameters as this technique was deficient of the required information. [15]The next transformation technique that we explored was Zero Component Analysis(ZCA)Whitening. This is very similar to PCA except that, the features obtained become completely independent. This transformation preserves edge information but sets to zero, the regions of relatively correlated feature points. In other words, ZCA whitening decorrelates the feature points completely. In the context of image data, this translates into eliminating the redundant pixel values that contain the same information or uniform color. [13] Whitening did not help either for SVM or MLR because, independency of data points is not of much importance for both the classifiers. A linearly seperable dataset is all that is required. 3.3 3.5 3.2 Grayscaling Principal Component Analysis Another popular dimensionality reduction technique we explored was the Principal Component Analysis. PCA basically reduces the dimensionality of the data while retaining most of the variation in the dataset. This is done by identifying directions along which the variation of the data is maximal. These directions are the principal components. On the basis of intensity variation in images, we can choose only those values which give us maximum information while discarding a lot of redundant data. PCA is purely descriptive and does not make any prediction whatsoever. Figure 3 shows a plot of lambda versus the test accuracies for MLR conducted on 128 principal components. We achieved our best accuracy of 43.28% for milestone one with MLR performed on 128 PCA components. The tuning parameters were λ = 0.1 and γ = 0.001. Either way, reduction of the dimensions proved to be beneficial as it gave us much better accuracy compared to the original dataset. PCA made a difference while carrying out MLR experiments as reducing the features helped reduce the noise in the system and increase accuracy. It did not make a lot of difference for SVM[1] as it considers only the boundary points while ignoring the more obvious datapoints. Mirroring After sufficiently tuning the parameters with the reduced datasets, we inferred that there would be only a marginal increase in the accuracy with further tuning. We needed more training datapoints to improve efficiency. Intuitively, we decided that we can increase the number of datapoints by implementing a technique known as Mirroring. Mirroring helps to offset the inadequacy of datapoints to some extent as it creates more orientations of the data. As expected the accuracy improvement for both SVM and MLR was significant. The training set was increased to 8000X3072. We achieved an accuracy of 46.152% with MLR, λ = 0.1 and γ = 0.001 and an accuracy of 44.86% with SVM parameters λ = 0.055 and γ = 0.008. PCA performed on mirrored dataset also gave us prominent accuracies as expected. 3.6 Gaussian Noise [3]Building on similar lines of increasing the datapoints, we looked into adding external noise to the original dataset. Gaussian noise was added to mainly prevent overfitting and increasing the flexibilty of the algorithms. Usually it is added as a prior knowledge to allow for some variations in the data which helps to increase the (a) Original image (b) Grayscaled image (e) ZCA whitened image (c) Reconstructed image using 256 principle components (f) Original image with gaussian noise N (0, 0.01) (d) Reconstructed image using 512 principle components (g) Mirrored image Figure 2: Effect of various transformation on the dataset, here data point 120 from training set is shown. generality of the model used. Adding noise also accounts for hardware errors, signal and communication mismatches which are typically ignored. Considering these factors when training the dataset makes the classification more accurate. Although there wasn’t an increase in the accuracy, reasonable values of 44.724% (λ = 0.111, γ = 0.007) and 43.305% (C = 1, γ = 0.00154) were obtained for MLR and SVM respectively. 3.7 Ensemble Towards the end, our main approach culminated to ensemble methods. Ensemble learning [17, 10, 7] is a process by which multiple classifiers, are strategically combined to classify the given data. It is primarily used to improve the performance of a model, or reduce the likelihood of an unfortunate selection of a poor one. We have used KMeans as the unsupervised learning algorithm that discovers features from the unlabeled data. Although it has been less widely used in ’deep learning’ work, KMeans has shown a lot of promise in computer vision. Hence, while using it as an unsupervised learning module in ensemble, it has all the potential to produce similar results when compared to deep learning and CNN. [8]Our first step was to extract random patches from our dataset and running ZCA whitening on them. After this the whitened data was used to learn k centroids via KMeans clustering. Using these centroids we learnt a better representation of the given dataset. We carried out this process for various values of k in KMeans. We observed that results with 50 iterations were the best. This newly constructed dataset was fed to the L2-SVM classifier. Values for C and gamma were tuned to achieve accuracies as high as 55.276% with a patch size of 6, 1600 centroids and C=30. This pipeline was run many number of times by varying parameters like patch size, centroids etc. We carried out multiple runs with exhaustive combinations of patch sizes(4,6,8,9,10), centroid values ranging from 1000 to 3000 and cost function values of 12,25,30,35,50 and 100. A graph has been obtained to illustrate the trends. We improved the accuracy by using ensemble to train classifiers on different subsets of the available features. 3.8 Voting Voting was used as a final step for the ensemble methods[4]. Selective voting on the results obtained by the ensemble methods was done by taking the majority vote of all the predicted labels for each image. This gave us the a new set of labels which achieved an increase in the classification accuracy. Our best accuracy, 56.876% was obtained via majority voting. 4. RESULTS By performing the above techniques, we obtained a number of results which are as shown. In this section, we present our experimental results and the impact of the tuning parameters and performance of MLR and SVM on different datasets. A comparative graph of MLR performed on PCA datasets is shown. There was not any noticeable difference in the accuracies obtained with the original, 128 components or even the 512 components. Figure 4: MLR with PCA components Next in figure 5 we have the comparison of various PCA components implemented with SVM. Here too, there isn’t any significant improvement in the accuracy values. Although it did perform better than on MLR as SVM’s considers only the boundary datapoints or the support vectors and discards the obvious points. In contrast, MLR takes into consideration the entire dataset. Thus, as expected, logistic regression works better with less feature points and SVM is more fitting with higher dimensions. Figure 7: SVM on various datasets Figure 5: SVM on PCA components In Figure 6. we see a representation of MLR on all the various datasets obtained. As it can be seen, MLR worked best with mirrored dataset. This is because, logistic regression considers all the datapoints for predicting the classes. Hence its always better to have more number of points when working with MLR. Figure 6: MLR on various datasets In Figure 7, we show a comparison of SVM implemented with various datasets. As the training data is the maximum in the mirrored dataset, we can define a clear hyperplane to separate the datapoints. The number of support vectors will also increase and hence we have achieved the highest accuracy with this dataset. A graph of Kmeans performed on the dataset and subsequently SVM run on the transformed dataset is shown in Figure 8. The accuracy increases as the cost function decreases and is almost the proportional across the various patch sizes considered. 5. DISCUSSIONS AND CONCLUSIONS Our aim was to use a classifier other than the routinely favored object recognition methods like deep belief networds, convolutional neural network etc. and obtain a Figure 8: Comparing Kmeans with SVM comparable accuracy. We realized that although simpler algorithms can work well and are easily scaled to adapt to the required scenario, they do require certain amount of preprocessing. Polishing the dataset helped us improve our work at each step. We started with the original dataset and tuned our classifier parameters. After this we tried grayscaling which reduced the accuracy, which strengthen our belief that colour is conveying important information for our data. We observed that reducing the number of features via PCA gives a comparable and at times even better accuracies. Thus we decided to find a good representation of the data as required by each of the linear classifier used. We then explored independent component analysis but that did not provide satisfactory results. We inferred that inadequacy of training data was contributing to the mediocre results. Hence we doubled the size of the training set by taking reflection of each image about the horizontal axis. This showed us significant improvement from 43% to 46%. We found that there were cases where our dataset would overfit, so we decided to add some noise to the training set, to improve flexibility. Though the test accuracy did not improve, overfitting reduced marginally. Finally we came across a paper Andrew NG, Adam Coates and Honglak Lee, wherein kmeans was used to learn the new feature representation. We used kmeans clustering on random patches of our data, creating a reflective field, and improved our accuracy upto 54%. Further tuning of parameter and voting helped us reach our accurent accuracy of 56.8 % We learnt that both data pre processing and parameter tuning can make a huge difference in the efficiency and accuracy of the classifier. Development Baseline (MS1) Improved Baseline Classifier Accuracy MLR 43.28 SVM 42.038 Ensemble 56.876 [11] [12] Table 1: Evaluation data Accuracy values are certainly a measure of performance of the classifier although obtaining very high accuracies was not our primary goal. We have shown a distinguishable improvement from our baseline values by making the dataset more conducive and receptive to SVM, MLR and subsequently ensemble methods as seen in the above table. 6. [13] [14] [15] REFERENCES [1] PCA on SVM. Retrieved Dec 08, 2014 from http://www.quora.com/Is-it-worth-trying-PCAon-your-data-before-feeding-to-SVM. [2] CIFAR10 - Object Recognition in Images. from https://www.kaggle.com/c/cifar-10/data. [3] Gaussian noise. Retrieved Dec 09,2014 from http://stats.stackexchange.com/questions/29819/ how-adding-covariance-noise-in-gaussianprocesses-to-prevent-overfitting. [4] E. Bauer and R. Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. MACHINE LEARNING, 36:105– 139, 1999. [5] C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121–167, 1998. [6] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, pages 27:1– 27:27, 2011. [7] A. Coates and A. Ng. Learning feature representations with k-means. In G. Montavon, G. Orr, and K.-R. MÃijller, editors, Neural Networks: Tricks of the Trade, volume 7700 of Lecture Notes in Computer Science, pages 561–580. Springer Berlin Heidelberg, 2012. [8] A. Coates, A. Y. Ng, and H. Lee. An analysis of single-layer networks in unsupervised feature learning. In International Conference on Artificial Intelligence and Statistics, pages 215–223, 2011. [9] J. D. Cook. Three algorithms for converting color to grayscale. Retrieved Dec 08, 2014 from http://www.johndcook.com/blog/2009/08/24/ algorithms-convert-color-grayscale/. [10] T. G. Dietterich. Ensemble methods in machine [16] [17] [18] 7. learning. In Proceedings of the First International Workshop on Multiple Classifier Systems, MCS ’00, pages 1–15, London, UK, UK, 2000. SpringerVerlag. A. Karpathy. Lessons learned from manually classifying CIFAR-10. Retrieved Dec 08, 2014 from http://karpathy.ca/myblog/?p=160. A. Krizhevsky. The CIFAR-10 dataset. Retrieved Dec 08, 2014 from http://www.cs.toronto.edu/ kriz/cifar.html. A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Computer Science Department, University of Toronto, Tech. Rep, 2009. G. Lamp. Why use SVM? Retrieved Dec 08, 2014 from http://www.yaksis.com/posts/why-usesvm.html. A. Ng, J. Ngiam, C. Y. Foo, Y. Mai, and C. Suen. Ufldl tutorial. Retrieved Dec 08, 2014 from http://ufldl.stanford.edu/wiki/index.php/Whitening. M. Palmer. Principal components analysis. Retrieved Dec 08, 2014 from http://ordination.okstate.edu/PCA.htm. R. Polikar. Ensemble learning. 4(1):2776, 2009. C. Shalizi. The truth about principal components and factor analysis data mining. September 2009. APPENDIX This class project gave us a substantial opportunity to put to use the knowledge we acquired during the course of 10-601. Although we chose to work with uncomplicated algorithms, the approach applied to actualize the results is quite involving. We learnt that the nature of the dataset plays a predominant role while working with linear classifiers and its always better to transform the dataset as per requirement. Considering the fact that there are numerous parameters that can be accorded to get suitable results, it mainly depends on the final target to customize the dataset. We understood that there is no general method for obtaining optimal classification and we have to more often than not test the performance to determine the adequacy of the classification. Further fine tuning and tweaking the parameters will enhance the predictions to a reasonable degree. The transformation of the dataset, SVM and ensemble method with KMeans clustering was done by Karishma Agrawal. Pradheep Shanmugam looked after LibSVM. MLR and getting the report ready was done by Soumya Shyamasundar. Our Kaggle team name is Tardis. We conducted a host of experiments to reach the conclusion that linear classifiers can work as well as the complex deep neural networks with the right kind of data transformation.