Download Feature Extraction, Feature Selection and Machine Learning for

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Multinomial logistic regression wikipedia , lookup

Naive Bayes classifier wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
14th International Conference on Optimization of
Electrical and Electronic Equipment OPTIM 2014
May 22-24, 2014, Brasov, Romania
Feature Extraction, Feature Selection and Machine Learning for Image Classification: A
Case Study
Mădălina C. Popescu1, Lucian M. Sasu2, IEEE Member
1
Transilvania University of Braşov, 2Transilvania University of Braşov, and Siemens Corporate Technology RTC,
Braşov, Romania
Abstract
This paper presents feature extraction, feature selection and machine learning–based classification techniques for pollen
recognition from images. The number of images is small compared both to the number of derived quantitative features and to the
number of classes. The main subject is investigation of the effectiveness of 11 feature extraction/feature selection algorithms and
of 12 machine learning–based classifiers. It is found that some of the specified feature extraction/selection algorithms and some
of the classifiers exhibited consistent behavior for this dataset.
In palynology, a specific task is recognizing plants from the pollen grains. Although domain expertise as expressed by highly
qualified palynologist is for now the standard approach, we witness an increasing interest in applying automatic learning
approaches for plant recognition.
We followed the classical machine learning approach: a) use some chain of image-processing techniques to extract quantitative
features from each image; b) apply automatic feature extraction/selection techniques; c) apply some classifiers on the data subsets.
The image dataset is publicly available and contains 38 species of plants collected in a total of 181 images. Starting from those
images we applied some image enhancement techniques (stretching, histogram equalization, Gamma correction, etc.) to improve
image quality before data extraction. This step is followed by the computation of 48 attributes, which are the initial predictive
variables. The question was whether feature extraction/selection could improve the classification result; additionally, we are
interested of which type of classifier provides the highest percent of correct classification (PCC).
We considered one feature extraction (principal component analysis) and ten feature selection methods (removing highly linearly
correlated attributes, best first forward/backward search, reranking/tabu/greedy search, evolutionary/genetic algorithms, linear
forward selection, particle swarm optimization); together with the dataset having all 48 attributes, we have 12 datasets with 15-48
predictive features. Twelve classification methods were used for species discrimination: naive Bayes, multinomial logistic
regression, multilayer perceptron, radial basis function networks, k-nearest neighbors, the rule–based classifier PART, functional
trees, best-first decision tree, Hoeffding tree, logistic model trees, C4.5, and Random forest. The performance assessments were
accomplished using the Weka Experiment Environment, with 4-fold stratified cross validation, and 10 random data shuffling.
The resulted 12 datasets × 10 shuffles × 4 folds × 12 classifiers = 5760 measurements allowed us to obtain relevant statistical
conclusions using corrected resampled t-test [1]. Both feature selection/extraction algorithms and the classification methods were
provided by the Weka Experiment Environment [2].
The greedy search, best–first with backward search strategy and principal component analysis produced the lowest PCC values,
for all the classifiers. The largest three PCC values were produced by evolutionary algorithms, reranking search, best-first with
forward search, linear forward selection, and genetic algorithm.
For classifier effectiveness, Naive Bayes was taken as the baseline classifier as its PCC is close to the median of the results. The
corrected resampled t-test [1] was used to discover four groups of performance results. The best results were provided by two
local-based models: K-nearest neighbors and radial basis function network; in the same performance group we also found
multilayer perceptron and functional trees. The second performance group, with still better results than Naive Bayes, consists of
logistic model trees and multinomial logistic regression. The remaining classifiers are statistically equivalent or inferior to the
baseline model, in terms of PCC.
One of the lessons learned is that greedy search, best first search with backward search and principal component analysis should
be avoided for this problem, as their induced performance is constantly poor across all classifiers. Another remark is that it is
worth investigating the potential benefits of feature selection. Some feature selection techniques – evolutionary and genetic
algorithms, reranking search, bestfirst with forward search and linear forward selection – repeatedly allowed the classifiers to
outperform the baseline model. Finally, we remarked that the two classifiers with the largest number of good results rely on local
models, building predictions based on data that cluster together.
References
[1] C. Nadeau and Y. Bengio, "Inference for the generalization error," Mach. Learn., vol. 52, no. 3, pp. 239–281, Sep. 2003.
[2] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, "The WEKA data mining
software: an update," SIGKDD Explorations Newsletter, vol. 11, no. 1, pp. 10–18, 2009.
163