Download Random projections versus random selection of features for

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Random projections versus random selection of
features for classification of high dimensional data
Sachin Mylavarapu
Ata Kaban
School of Computer Science
The University of Birmingham
Edgbaston, B15 2TT
Birmingham, UK
School of Computer Science
The University of Birmingham
Edgbaston, B15 2TT
Birmingham, UK
Abstract—Random projections and random subspace
methods are very simple and computationally efficient techniques
to reduce dimensionality for learning from high dimensional
data. Since high dimensional data tends to be prevalent in many
domains, such techniques are the subject of much recent interest.
Random projections (RP) are motivated by their proven ability
to preserve inter-point distances. By contrary, the random
selection of features (RF) appears to be a heuristic, which
nevertheless exhibits good performance in previous studies. In
this paper we conduct a thorough empirical comparison between
these two approaches in a variety of data sets with different
characteristics. We also extend our study to multi-class
problems. We find that RP tends to perform better than RF in
terms of the classification accuracy in small sample settings,
although RF is surprisingly good as well in many cases.
Keywords—random projections, random subspace,
dimensions, dimensionality reduction, classification
I.
high
INTRODUCTION
Increasingly there are a number of application domains which
emphasize the need for analysing high dimensional data.
Bioinformatics, e-commerce etc., are a few domains in which
data is very high dimensional. Gene expression microarray
data sets could contain tens of thousands of dimensions, each
of which corresponds to a gene expression level in some
experimental condition. Hundreds of thousands of products
may need to be considered in consumer purchase behaviour
data sets. There is a growing interest in industry and academia
to analyse such datasets [2].
One of the main challenges associated with analysing
high dimensional data is the curse of dimensionality. That is,
the complexity of many existing data mining algorithms
increases exponentially with the increase in the number of
dimensions. With increasing dimensions the algorithms
become computationally out of control and thus inapplicable
in many applications.
In this work we focus on supervised learning for
classification of high dimensional data. Supervised learning
for classification is a powerful data analysis tool. The
objective of classification is to build a prediction model that
captures inherent relations between the attributes and the class
identity so that an unknown object can be accurately classified
[18]. The classifier is built using a labeled training set and its
accuracy may be validated using a test set. There are several
algorithms in use for supervised learning for classification
Naïve Bayes, Neural Networks, Decision Trees, Support
Vector Machines (SVM), Fisher Linear Discriminant Analysis
(LDA) and others [18].
When the data is high dimensional, dimensionality
reduction techniques are required. Feature selection is a way
to reduce dimensionality so that only a subset of features,
which are deemed to be important, is selected for the
classification. Another way of dimensionality reduction is
feature construction, e.g. by a projection of the feature space
into a lower dimensional space. An example of the latter is to
perform Principal Component Analysis (PCA) on the high
dimensional data and subsequently perform classification in
the space spanned by the prominent principal components
identified. These techniques are computationally expensive in
general, and in some cases they may lead to inaccurate
classification [1]. Thus there is a need to find alternative
solutions to efficiently and accurately classify high
dimensional data.
A computationally inexpensive technique of
dimensionality reduction is the so-called Random Projections
(RP) method. This is a non-adaptive feature construction
approach that received much attention in recent years from
rigorous theoretical studies to experimental investigations. An
experimental study by Fradkin and Madigan [7] compared the
performance and efficiency of PCA and RP in conjunction
with a number of classifiers, for 2-class problems: C4.5, 1Nearest Neighbor, 5-Nearest Neighbor and SVM. They found
that classification performance in a random projection space is
not far away from that obtained on the full data. The
theoretical justification of RP has initially been the JohnsonLindenstrauss Lemma, which guarantees a global preservation
of inter-point distances. Stronger guarantees have been
obtained more recently for the Fisher Linear Discriminant
classifier (FLD) [14], which more closely resembles the good
performance observed in practice. On a more heuristic basis, a
random selection of features (RF) is also in use in practice
[23][24].
However not much is known about the comparative
strengths and weaknesses of these two seemingly analogous
methods. This paper we are interested to fill this gap.
II.
BACKGROUND AND RELATED WORK
A. Dimensionality Reduction
Dimensionality reduction techniques preserve the
structure of the data as much as possible while reducing the
number of dimensions. Thus in the reduced space the
execution time of the algorithms is reduced since we have
lower dimensions. Also since the structure is preserved the
results obtained can be a reliable approximation of the original
data space.
1) Principal Component Analysis
Principal Component Analysis (PCA) is a widely
used and popular technique for dimensionality reduction. PCA
finds the directions of maximal variance in the data. PCA is
done with the hope that the components with the lowest
variation in the data are irrelevant or represent noise and hence
are better discarded [10].
One of the drawbacks of PCA is that maximal
variation does not always preserve the class structure. Even
when it does, the eigenvalue decomposition of the covariance
matrix is a time consuming task, which is an imperative step
for performing PCA. The runtime complexity of PCA is
O (np2) + O(p3), for a p-dimensional data set with n points,
which may make it infeasible for data with very high
dimensionality [7].
2) Random Projections
Another method for dimensionality reduction is
Random Projections. Random Projections is a very simple yet
powerful technique for dimensionality reduction. In this
method the data is projected on to a random subspace, which
preserves the approximate Euclidean distances between all
pairs of points after the projection [5]. The JohnsonLindenstrauss lemma (JLL) guarantees that for a set of N
points in p dimensions there is a linear transformation to a q
dimensional random subspace that preserves the Euclidean
distances between any two data points up to a factor of 1
if the number of projected dimensions
where is a
small constant such that 0
1 [10]. This result implies
that the original dimensionality is irrelevant as far as the
distance preservation is concerned. What matters is the
number of points that get projected and the accuracy with
which we want to preserve the distances. An important thing
to note is that the bounds provided by Johnson-Lindenstrauss
are rather loose, and in practice the number of dimensions to
project to in order to preserve the relevant distances may be
much lower [14,7].
The original result of Johnson & Lindenstrauss was
an existence result that did not say how to get the linear
transform. Later work by Dasgupta [6] has shown that certain
random matrices fulfil the JLL guarantee with high
probability, and there are several ways to generate a random
projection matrix [10]. The method used in this paper is to
generate a random matrix with i.i.d. Gaussian entries. There
are certain properties of this matrix, which may help us
intuitively understand this: Any two rows in the random
projection matrix are approximately orthogonal to each other,
and have approximately the same length. In essence the
random projection is an approximate isometry. Thus, there is
no need to normalize the vector to unit length or to
orthogonalise the random projection matrix in practice [10].
The following are the steps to reduce the
dimensionality of the data by random projections:
Suppose that we have a data set X
where
, ,
each data point is a p dimensional vector such that
and we need to reduce the data to a q dimensional space such
that 1
.
matrix where p
1)
Arrange the data into a
is the dimensionality of the data and n is the
number of data points.
random projection matrix
2)
Generate a
R* using the MATLAB randn (q, p) function.
3)
Multiply the random projection matrix with
the original data in order to project the data
down into a random projection space.
Thus we can see that transforming the data to a
random projection space is a simple matrix multiplication with
the guarantees of distance preservation. Hence, the random
projection technique is much more efficient than PCA with a
run time complexity of only O (pqn).
3) Random Selection of Features
Another technique explored in this work is random
feature selection (RF). This means, the features are selected
uniformly at random out of the original features, and the
classification algorithm is run in the resulting smaller feature
space. This method has also been used in the literature as a
means of building an ensemble of classifiers [23][24], each of
which receives a random subset of the original features.
However in this work we aim to assess the method of RFs in a
single classifier in order to compare it to the RP approach.
B. Supervised learning
Supervised learning is the task of learning to predict
the class labels of unclassified objects based on their attributes
(features). Usually a supervised learning algorithm is trained
with finite number of labeled examples that are available, and
it is expected to generalize on the underlying distribution of
the data such that it can perform prediction on previously
unseen data.
Formally, given a set of training examples of the
form {(x1, y1),….., (xn, yn)}, a supervised algorithm seeks a
hypothesis h: XÆY, by training the classifier where X is the
input space and Y the output space. For an unlabeled example
xk the prediction of the class label is performed by presenting
the input vector xk to the hypothesis to get h (xk)Æ yk.
a) The Fisher Linear Discriminant Analysis classifier
Fisher’s Linear discriminant analysis (FLD, LDA)
first introduced by Fisher is a machine-learning algorithm
which aims at finding the linear combination of features which
best separates two or more classes of objects [16]. LDA is
simple, easy to implement and is efficient. We have chosen to
employ it in this work, and this decision was driven by its
simplicity and the detailed theoretical understanding of its
working in the RP space [14]. FLD was found to give
comparable results to that of Support Vector Machines (SVM)
in some cases [15]. In the next section we will review the
basics of Fisher linear discriminant analysis for binary and
multi-class problems.
Classification with two-class LDA is a decision on a
threshold constant c. For any vector xk it is given by
Two - Class LDA.
Multi-Class LDA.
Suppose that we have a set of n, p-dimensional
vectors x1, x2, …..,xn where xi is given by xi = (xi1,…..,xip) and
each point belongs to one of two different classes namely c1
or c2. The scatter matrices for the two classes are given by
Multiple discriminant analysis is a natural extension
of Fisher linear discriminant analysis for cases where the
number of classes is more than two. The data is projected from
a high dimensional space to a lower dimensional space where
the transformation maximizes the ratio of inter class scatter to
the intra class scatter. In this case the maximization will lead
us to decide among several competing classes.
When there are c classes, the intra class scatter matrix is
calculated similar to the two-class case [15]:
where class ci contains ni points, and its center or mean is:
Under the assumption that both classes have the same
covariance shape, the choice of the hyper- plane would be
between the projections of two class means
and
[9]. So the parameter c is given by:
/2
1
Thus the total intra class scatter is given by
∑ ∑
(1)
The inter class scatter matrix is given by
The inter class scatter matrix is given by
(2)
The objective is to obtain a scalar y by projecting the vectors
onto the direction w [16],[4].
Fisher’s criterion suggested the linear transformation w to
maximize the ratio of the inter class scatter matrix of the
projected samples to the intra class scatter matrix of the
projected samples [15][16].
(3)
where ni is the number of training samples for each class, is
the mean for each class and is the total mean vector, i.e. the
average of the entire data set.
The total mean vector is given by
1
and inter class scatter
have been
Once the intra class
obtained the task is again to maximize the Rayleigh quotient
given below [15]
Maximising equation (3) can be solved as a generalized
eigenvalue problem and w is given by the eigenvectors of
matrix
1
[15].
We take derivative and equate to zero. This leads to [4]
The transformation w can be obtained by solving the
generalized eigenvalue problem [15]
(4)
The solution [4][16] to the generalized eigenvalue problem of
equation (4) is the following:
Classification with multi-class LDA is then
performed in the transformed space based on cosine measure
given by [15]:
,
∑
1
∑
∑
A new instance xk is classified as:
arg min
where
,
is the mean of the c-th class.
Handling the Singularity of the Inter Class Scatter Matrix.
In situations where the number of available points is much
smaller than the number of dimensions, the inter class scatter
matrix
becomes singular [15]. There are several
applications where the number of data points is much less than
the number of dimensions. Examples include e-commerce
data, gene expression data, and text categorization data etc. In
these applications the inter class scatter matrix
is usually
singular. There are various methods to deal with singularity.
Regularization was the method we used in this work to deal
[15].
with the singularity of the inter- class scatter matrix
Regularization is achieved by biasing the diagonal elements of
the inter class scatter matrix by adding a small parameter :
where is a relatively small constant which makes the inter
class scatter matrix positive definite [15]. In this work was
chosen in two ways
was a small
1)
For binary classification
constant chosen empirically -- e.g. 0.01, which
was based on observations from the
experiments, when varying the regularization
parameters.
2)
For multi-class classification was chosen as
the average of the diagonal elements of
multiplied by a small constant e.g. 0.01, as
suggested in [15].
Covariance Structures for LDA.
As already mentioned, LDA makes the assumption that the
classes have the same shape, and the shape of a class may be
characterised by a covariance matrix. There are three types of
covariance in use: full, diagonal and spherical. The full
covariance is of course the least restrictive; however in order
to estimate it accurately we would need a larger number of
points than what is available, hence a more restrictive form is
sometimes preferable.
C. Related Work
We already mentioned the experimental comparison
of classifiers working in RP spaces by Fradkin and Madigan
[7] where they compare PCA and RP using C4.5, 1NN, 5NN
and SVM classifiers. We will use the data sets from this work
in our study. The outcome of their experiments was that PCA
tends to perform better than RP when the target dimension is
chosen to be small. But it was also observed that PCA is
computationally expensive as compared to RP.
Deegalla and Bostrom [12] compared PCA and RP
with respect to the performance of nearest neighbor classifier
on five image data sets and five micro array data sets to
investigate the effectiveness of either technique with the
classifier. Their experiment results indicate that PCA
performed better than RP for all the data sets used in the study.
Their experiments also show that PCA is sensitive to the
number of reduced dimensions chosen. They notice that when
the number of dimensions is increased in PCA the accuracy
increases but after reaching a certain point the performance
begins to deteriorate. In contrast they observed that in RP
space the accuracy increased with an increase in dimensions.
Bingham and Mannila [5] compared RP with several
other dimensionality reduction techniques such as PCA,
Singular Value Decomposition (SVD), Latent Semantic
Indexing (LSI) and Discrete Cosine Transform (DCT). The
data sets used were image and text data. They also
experimented to determine the effects of noisy and noiseless
data. They found that RP is not sensitive to noisy data. The
criterion was to measure the amount of distortion caused by
each method and they found that the distortion caused by RP
and PCA were nearly the same on the data sets tested.
Fern and Brodley [3] investigated how RP can be
used in conjunction with clustering of high dimensional data.
In their approach a single run of clustering is performed after
projecting the data into a random subspace and clustering the
data in the random subspace using EM. Multiple such runs are
performed and the results are aggregated. Clustering is now
performed for the aggregated results to produce the final
clusters. This approach is a cluster ensemble framework. The
results show that the cluster ensemble approach outperformed
both individual runs of random projection/clustering and EM
clustering with PCA reduction techniques for all the three data
sets used in the study.
Several other works have been conducted using the
RP technique. Kaski [11] employed RP in the context of
WEBSOM. Dasgupta [6] has shown that data for a mixture of
Gaussians can be projected into just O(log #clusters)
dimensions while still maintaining the approximate level of
separation between the clusters. Dasgupta also shows that
when the data clusters are eccentric in nature, RP works
consistently better that PCA and PCA is less reliable when the
data clusters are eccentric in nature. Likewise, [14] have
shown that O(log #classes) is sufficient for FLD to achieve
good classification.
III.
EXPERIMENTAL SETUP
A. Data Sets
We use eight data sets. Five of these represent binary
classification problems and three others are multi-class
problems. The data sets we use for binary classification are
taken from [7]. For multi-class problems we employ the data
sets in Dettling [17] who used different classifiers on the same
data set in data space. A brief description of these data is given
below. We have chosen these data so that they cover a range
of different settings: The first two data sets have fewer points
than dimensions. The next two have more points than
dimensions, and one other has a large number of points and
large number of dimensions. The first of these categories turns
out to be the hardest, and the one where RP and RF can help
substantially. The multi-class data sets were therefore chosen
from this category.
Leukemia. The leukemia data set contains expression levels
of 7129 genes taken over 72 samples. Labels indicate which of
two variants of leukemia is present in the sample (AML, 25
samples, or ALL, 47 samples).
(BL), 23 Ewing Sarcoma (EWS), 12 neuroblastoma (NB), and
20 rhabdomyosarcoma (RMS) [21].
Colon. The Colon data set contains expression levels of
2000 genes taken in 62 different samples. For each sample it is
indicated whether it came from a tumor biopsy or not. The
data set contains 40 cancerous and 22 non-cancerous colon
tissue samples [19].
Spambase. Spam email is a well-known concept where
emails are received with advertisements for products/web
sites, make money fast schemes, chain letters, pornography
etc. The data set is a collection of such spam and non-spam
emails. It contains 4601 instances with 57 attributes and the
last attribute is the class attribute. Of the 4601 samples 1813
are spam emails and 2788 are non-spam emails [19].
Ionosphere. The Ionosphere data set is a binary
classification problem where two types of electrons are
targeted in the ionosphere by the radar signals. Those that
show some structure are classified as good and those that do
not are classified as bad. The data set contains 351 instances
with 34 attributes classified as good or bad. Of the 351
instances 126 instances are bad instances and 225 instances
are good instances [19].
Internet Ads. This dataset represents a set of possible
advertisements on Internet pages. The attributes are the
geometry of the image as well as phrases occurring in the
URL, the image's URL and alt text, the anchor text, and words
occurring near the anchor text. There are two class labels:
advertisement ("ad") and not advertisement ("nonad"). The
data set contains 3279 instances and 1558 attributes. 2821
instances of all are non-ads and 458 are ads [19].
Lymphoma. The lymphoma dataset consists of 62 samples
of 4206 gene expressions. Of the 62 samples, 42 are of diffuse
large B-cell lymphoma (DLBCL), 9 samples are of follicular
lymphoma (FL), and 11 samples are of chronic lymphocytic
leukemia (CLL) [22].
Brain Tumor. The Brain Tumor data set contains 42 tissue
samples divided in 5 classes. Primitive neuroectodermal
tumors (PNETs): 8 samples, atypical teratoid/rhabdoid tumors
(Rhad): 10 samples, malignant gliomas (Mglio): 10 samples,
medulloblastomas (MD): 10 samples and normal tissues
(Ncer): 4 samples. There are a total of 5597 genes in the data
set [20].
Srbct. The Small Round Blue Cell Tumors dataset contains
information of 63 samples and 2308 genes. The samples are
distributed in four classes as follows: 8 Burkitt Lymphoma
C. Experimental protocol
We split the data randomly into a training set and a test set, in
the same way as it was done in [7]. For each experiment, we
repeat this 100 times in order to get reliable results.
The algorithms we used are detailed below. Algorithm 1 and 2
gives the full procedure for both RP-based and RF-based
classification respectively, for the sake of completeness and
reproducibility of our results.
B. Data pre-processing
For the binary classification data sets there was no
preprocessing done on the data (as the ranges of features
appeared to be fairly similar) with an exception for the
Internet Ads data set, which had 4 attributes with missing
values, and these attributes were removed from the data set.
For the multi-class data, standardisation was done by
subtracting the mean and dividing by the standard deviation of
the features.
Algorithm 1 (Evaluating RP-based classification)
Requires: Dataset D, original dimensions d, projection
dimensions k
1. For i = 1:100(splits)
2. Split D in to training set and test set at random
3. Train and test the data in the data space and
obtain accuracy
4. Generate a (k, d) random matrix R using
MATLAB randn (k, d) function
5. Project the training and test data into the random
projection space R*D’
6. Train and test the data in the reduced space and
obtain accuracy
7. end for
Algorithm 2 (Evaluating RF-based classification)
Requires: Dataset D, original dimensions d, random
dimensions k
1. For i = 1:100(splits)
2. Split D in to training set and test set at random
3. Train and test the data in the data space and
obtain accuracy
4. Select a random set of features from the feature
space
5. Train and test the data in the reduced space and
obtain accuracy
6. end for
The next section will summarise and discuss the results
obtained, followed by the conclusions of our study.
IV.
RESULTS
We are interested to answer the following questions:
• How is the classification performance affected by the choice
of the target dimension (k)?
• To what extent does the regularization parameter affect the
performance, as the target dimension (k) varies?
• How does the form of covariance (i.e. full, diagonal,
spherical) affect the performance in the reduced
space?
• How do the two different dimensionality reduction
techniques, i.e. RP and RF, compare in terms of the
achieved classification accuracy?
To answer these questions we carried out a set of
experiments, and the results are summarised in the
Figures 1-3.
Figure 1 shows results on the high dimensional small
sample size data sets ‘leukaemia’ and ‘colon’. The leftmost
column shows the sensitivity of full-covariance-FLD to the
regularization parameter for RP-space classifiers in
various target dimensions (k). We see that working in
lower dimensional projections makes FLD less sensitive to
the choice of this parameter. This is not surprising, if we
think about the effect of RP geometrically: The smaller the
target dimension, the more spherical the covariance
becomes in the projected space. Therefore the projected
covariance has a good chance to be non-singular, and
hence insensitive to the regularisation parameter. In turn,
the full-covariance classifier in the high dimensional data
space is extremely sensitive, and a bad choice of this
parameter value can lead to poor performance close to a
random guessing.
The rightmost three columns show the classification
performance (average and one standard error over 100
independent splits of the data into training and testing sets)
comparatively for data-space-FLD, RP-FLD and RF-FLD in
the case of three covariance types (full, diagonal, spherical).
We see that RP tends to outperform RF in most cases and RPFLD can achieve comparable performance to a full data-space
FLD, despite it works in a much reduced dimensional space.
The type of covariance has relatively little effect on these
results. This is most likely because the number of data points
is so small that an accurate estimation of covariances between
the original features is hardly possible and it is either the
explicit regularisation that keeps the accuracy from falling
down, or the implicit regularisation produced by the random
projections. The random feature approach also seems to have a
regularisation effect albeit there are less guarantees on this.
We should comment about the drop and then raise of
the performance at around k=50 dimensions with the full
covariance approach (second column on Fig.1). This is not an
artefact but a systematic effect. This happens exactly when k
equals the rank of the maximum likelihood covariance
estimate. Although a full explanation of why it happens this is
beyond the scope of this paper, the theoretical studies of [8]
and [13] imply that there is a systematic underestimation of
the smallest eigenvalue of this sample-covariance, which
becomes the worst when the dimensionality approaches the
sample size. Once this point is passed and we increase k
further, the underestimation of the smallest eigenvalue
remains unchanged, but the distance between class centers
gets larger, and this explains why the performance raises back
at larger values of k.
Figure 2 presents results for data sets that contain
more points than dimensions. Since we now have a sufficient
number of points, we can use a full covariance model more
safely, and we see very little sensitivity to the choice of the
regularisation parameter. More interestingly, on these data
sets the random selection of features turns out to be no worse
than random projection.
Finally, Figure 3 gives classification results for three
multi-class data sets. Again, with sufficient dimensions that
are still significantly less than the original dimensionality, the
classification performance approaches that of the full dataspace classifier. Interestingly, and perhaps contrary to
expectation, the RF approach preforms just as well as RP. The
reasons for this remain for further investigations, although we
suspect this could be due to our use of relatively large values
for the regularisation parameter.
V.
CONCLUSIONS
We empirically investigated random projections and random
feature selection methods for their ability to serve as a means
of dimensionality reduction for classification. These
techniques are computationally efficient and hold the promise
to enable learning from high dimensional data sets. We found
that RP tends to perform better than RF in terms of the
classification accuracy in settings where the dimensionality far
exceeds the number of data points. Quite interestingly, RF has
been also quite competitive and perhaps surprisingly good in
some cases.
ACKNOWLEDGMENT
This work was performed as part of the first author’s MSc
summer project at the School of Computer Science at
Birmingham.
REFERENCES
[1]
[2]
[3]
[4]
T. Birch, Supervised classification of high-dimensional gene expression
data via random projection and boosting.MSc Project Report, University
of Birmingham, 2009, unpublished.
W. Wang, J. Yang, Mining high dimensional data,.Data Mining and
Knowledge Discovery Handbook: A Complete Guide for Practitioners
and Researchers, Kulwer Academic Publishers, 2005.
X.Z.Fern, C.Brodley, Random projection for high dimensional data
clustering: A cluster ensemble approach. In proceedings of the
Twentieth International Conference of Machine Learning, 2003.
K. Chen, Lecture notes, unpublished. Retrieved June 25, 2012, from
http://www.cs.manchester.ac.uk/pgt/COMP61021/lectures/LDA.pdf
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
E. Bingham, H. Mannila. Random projection in dimensionality
reduction: applications to image and text data. KDD'01 Proceedings of
the seventh ACM SIGKDD international conference on Knowledge
discovery and data mining, 245-250, 2001.
S. Dasgupta, Experiments with random projection. Proceedings of the
16th Conference on Uncertainty in Artifical Intelligence, 143-151, 2000.
D. Fradkin, and D. Madigan, Experiments with random projections for
machine learning, Proceedings of the ninth ACM SIGKDD international
conference on Knowledge discovery and data mining , 517-522, 2003.
D. Hoyle, Accuracy of pseudo-inverse covaraince learning: A random
matrix theory analysis. Pattern Analysis and Machine Intelligence, IEEE
Transactions, pp. 1470-1481, 2011.
T. Hofmann. Introduction to machine learning: lecture notes, 2003,
unpublished.
Retrieved
June
21,
2012,
from
http://users.cecs.anu.edu.au/~daa/courses/GSAC6017/CS195-6a.pdf
A.K. Menon, Random projections and applications to dimensionality
reduction. Honours Thesis, University of Sydney, 2007.
S. Kaski. Dimensionality reduction by random mapping: Fast similarity
computation for clustering. In Proceedings of IJCNN'98 International
Joint Conference on Neural Networks, 1, pp. 413-418, 1998.
S. Deegalla, H. Bostrom, Reducing high-dimensional data by principal
component analysis versus random projection for nearest neighbor
classification. In ICMLA' 06: Proceedings of the 5th International
Conference on Machine Learning and Applications, pp. 245-250, 2006.
S.R. Raudys, Expected classification error of the Fisher linear classifier
with psuedo-inverse covaraince matrix. Pattern Recognition Letters, pp.
385-392, 1998.
R.J Durrant, A. Kaban. Compressed fisher linear discrminant analysis:
classification of randomly projected data. Proceedings of KDD'10 , pp.
1119-1128, 2010.
T. Li, S. Zhu, M. Ogihara, Using discriminant analysis for multi-class
classification. Proceedings of the Third IEEE International Conference
on Data Mining (IEEE ICDM 2003), pp. 589-592.
[16] T. Hastie, R. Tibshirani. The elements of statistical learning: Data
Mining, Inference and Prediction. Springer, 2011.
[17] M. Dettling. BagBoosting for tumor classification with gene expression
data. BioInformatics, pp. 3583-3593, 2005.
[18] T. Mitchell. Machine Learning WCB McGraw Hill, 1997.
[19] K. Bache, and M. Lichman, (2013). UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California,
School of Information and Computer Science.
[20] J. Wang, T.H. Bo, I. Jonassen, O. Myklebost, E. Hovig.. Tumor
classification and marker gene prediction by feature selection and fuzzy
c-means clustering using microarray data. BMC Bioinformatics 4:60,
2003.
[21] Khan et al, Classification and diagnostic prediction of cancers using
gene expression profiling and artificial neural networks, Nature
Medicine 7(6).
[22] D. Chung, and S. Keles, Sparse partial least squares classification for
high dimensional data, Statistical Applications in Genetics and
Molecular
Biology,
Vol.
9,
Article
17,
2010.
http://artax.karlin.mff.cuni.cz/r-help/library/spls/html/lymphoma.html
[23] C.P. Kamath, L.O. Hall, T. Yeatman, S. Eschrich. Multivariate Feature
Selection using Random Subspace Classifiers for Gene Expression Data.
Proceedings of the 7th IEEE International Conference on Bioinformatics
and Bioengineering, BIBE 2007, October 14-17, 2007.
[24] G. Armano, C. Chira, B. Hatami. A New gene selection method based
on random subspace ensemble for microarray cancer classification.
Pattern Recognition in Bioinformatics, LNCS Vol. 7036, pp. 191-201,
2011.
Fig. 1. Results on high dimensional small sample size data. With lower target dimension (k) the classifier becomes less sensitive to the regularization
parameter. RP-FLD tends to outperform RF-FLD.
Fig. 2. Results on data with more points than dimensions. In the first
column of plots, the legend is the same as in Figure 1. Interestingly, RPFLD and RF-FLD perform similarly.
Fig. 3. Results on multi-class data sets.