Download A K-Farthest-Neighbor-based approach for support vector data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Appl Intell (2014) 41:196–211
DOI 10.1007/s10489-013-0502-0
A K-Farthest-Neighbor-based approach for support vector data
description
Yanshan Xiao · Bo Liu · Zhifeng Hao · Longbing Cao
Published online: 15 February 2014
© Springer Science+Business Media New York 2014
Abstract Support vector data description (SVDD) is a
well-known technique for one-class classification problems.
However, it incurs high time complexity in handling largescale datasets. In this paper, we propose a novel approach,
named K-Farthest-Neighbor-based Concept Boundary Detection (KFN-CBD), to improve the training efficiency of
SVDD. KFN-CBD aims at identifying the examples lying
close to the boundary of the target class, and these examples,
instead of the entire dataset, are then used to learn the classifier. Extensive experiments have shown that KFN-CBD
obtains substantial speedup compared to standard SVDD,
and meanwhile maintains comparable accuracy as the entire
dataset used.
Keywords Support vector data description · K-Farthest
Neighbors
Y. Xiao (B) · Z. Hao
School of Computers, Guangdong University of Technology,
Guangzhou, Guangdong, China
e-mail: [email protected]
Z. Hao
e-mail: [email protected]
B. Liu
School of Automation, Guangdong University of Technology,
Guangzhou, Guangdong, China
e-mail: [email protected]
L. Cao
Advanced Analytics Institute, University of Technology, Sydney,
Australia
e-mail: [email protected]
1 Introduction
Support vector machine (SVM) [1–4] is a powerful technique in machine learning and data mining. It is based on
the idea of structural risk minimization, which shows that
the generalization error is bounded by the sum of the training error and a term depending on the Vapnik-Chervonenkis
dimension. By minimizing this bound, high generalization
performance can be achieved. Support vector data description (SVDD) [5–7] is an extension of SVM for one-class
classification problems where only examples from one class
(target class) are involved in the training phase. In SVDD,
the training data is transformed from the original data space
(input space) into a higher dimensional feature space via a
mapping function. After being mapped into the feature space
via a proper mapping function, the data becomes more compact and distinguishable in the feature space. A hyper-sphere
is then determined to enclose the training data. A test example is classified to the target class if it lies on or inside
the hyper-sphere. Otherwise, it is classified to the non-target
class. To date, SVDD has been used in a wide variety of realworld applications from machine fault detection and credit
card fraud detection to network intrusion [8–10].
The success of SVDD in machine learning leads to its extension to the classification problems for mining large-scale
data. However, despite the promising properties of SVDD,
it incurs a critical computational efficiency issue when it is
applied on large-scale datasets. A typical example is outlier
detection of handwritten digit data. The handwritten digit
classification system deals with the handwritten digit images
collected from people and builds a classifier to detect the
anomalous handwritten digit images. Since different people
have different handwriting and the handwriting may even
change according to people’s emotions or situations, a large
amount of handwritten digit images need to be trained for an
A K-Farthest-Neighbor-based approach for support vector data description
appropriate learning of the handwritten digit classification
system. Another example is credit card transactional fraud
detection. It usually contains large-scale records in the system and the number of examples involved during the training
stage can be considerably large. The training time on these
data can be too expensive to make SVDD a practical solution. Hence, it is essential to explore novel approaches to
improve the training efficiency of SVDD.
This paper proposes a novel approach, called K-FarthestNeighbor-based Concept Boundary Detection (KFN-CBD).
The main motivation for us to propose KFN-CBD is the desire to further speed up SVDD on large-scale datasets by
eliminating the examples which are most likely to be nonSVs. It is known that the SVDD classifier is decided by the
support vectors (SVs) which lie on or outside the hypersphere, and removing the non-SVs does not change the classifier. Hence, KFN-CBD aims at identifying the examples
lying close to the boundary of the target class. These examples are called boundary examples in this paper. By using
only the boundary examples to train the classifier, the training set size is largely cut down and the computational cost
is substantially reduced.
Specifically, KFN-CBD consists of two steps: concept
boundary pre-processing and concept boundary learning.
In the concept boundary pre-processing step, the k-farthest
neighbors of each example are identified. To accelerate
the k-farthest neighbor query, M-tree [11], which is originally designed for k-nearest neighbor search, is extended
for k-farthest neighbor learning by performing a novel tree
query strategy. To the best of our knowledge, it is the first
learning method for efficient k-farthest neighbor query. In
the concept boundary learning step, each example is associated with a ranking score and a subset of top ranked examples is used to train the SVDD classifier. To do this, a scoring function is proposed to evaluate the examples based on
their proximation to the data boundary. According to the
scoring function, the examples with larger scores are more
likely to be SVs than those with smaller scores. Compared
to SVDD, KFN-CBD only includes a subset of informative
examples, instead of the whole dataset, to learn the classifier,
so that a large number of examples, which are likely to be
non-SVs, are eliminated from the training phase. Substantial experiments have shown that KFN-CBD obtains substantial speedup compared to SVDD, and meanwhile maintains comparable classification accuracy as the entire dataset
used.
The rest of this paper is organized as follows. Section 2
introduces previous work related to our study. Section 3
presents the details of the proposed approach. Experimental results are reported in Sect. 4. Section 5 concludes the
paper and puts forward the future work.
197
2 Related work
SVDD is an extension of SVM on one-class classification
problems. In this section, we first review the previous studies
on speeding up SVM and SVDD in Sect. 2.1. Then, the basic
idea of SVDD is briefly introduced in Sect. 2.2.
2.1 Speeding up SVM and SVDD
Previous work on speeding up SVM and SVDD can be
roughly classified into two categories: (1) algorithmic approaches [12–20], where optimization methods are used
to make the quadratic programming (QP) problems1 solve
faster; (2) data-processing approaches [21–28], which reduce the training set size using data down-sampling techniques. In practice, the data-processing approaches always
run on top of the algorithmic approaches as an ultimate way
to accelerate the training efficiency of SVM and SVDD.
Our method falls into the category of data-processing
approaches. The related work in this category will be discussed. Based on the variety of sampling techniques, the
data-processing approaches can be further divided into random sampling-based approaches and selective samplingbased approaches.
2.1.1 Random sampling-based approaches
These approaches randomly divide the training set into a
number of subsets and train the classifier on each subset. For
a test example, the results of sub-classifiers are combined for
predictions. Bagging [21] and Cascade SVM [22] are typical examples. In Bagging [21], the training data is randomly
divided into a series of subsets and each subset is trained
to obtain a “weak classifier”. The test data is then predicted
by voting on the “weak classifiers”. [22] proposes “Cascade
SVM”, which randomly decomposes the learning problem
into a number of independent smaller problems and the partial results are combined in a hierarchy way. Recently, [29]
extends Cascade SVM from SVM to SVDD. Bagging [21]
and Cascade SVM [22] have shown their capability in accelerating the training efficiency. However, there is a limitation
to these methods: random sampling may not select the most
informative examples for each sub-problem and much training time is devoured by a large number of less informative
examples.
1 Quadratic programming (QP) problem is about optimizing a quadratic
function of several variables subject to linear constraints on these variables. The optimization problems of SVM and SVDD are QP problems.
198
Y. Xiao et al.
2.1.2 Selective sampling-based approaches
In order to choose the informative examples from the training set, selective sampling-based approaches have been proposed. Instead of learning on the whole dataset, selective
sampling-based approaches aim at selecting a subset of informative examples and use the selected examples to train
the classifier. In the following, we review the selective
sampling-based approaches for SVM and SVDD, respectively.
Considerable efforts have been done to speed up SVM
by using selective sampling [23–25]. For example, CBSVM [30] clusters the training data into a number of microclusters, and selects the centroids of the clusters as the representatives along the hierarchical clustering tree. The examples near the hyper-plane are used to train the classifier.
[31] utilizes k-means clustering to cluster the training data,
and the data on the cluster boundaries is selected for SVM
learning. In [32], the potential SVs are identified by an approximate classifier trained on the centroids of the clusters,
and the clusters are then replaced by the SVs. IVM [33] uses
posterior probability estimates to choose a subset of examples for training.
In contrast with SVM, there are only a few studies
on SVDD with selective sampling. Among these studies,
KSVDD [34] adopts a divide-and-conquer strategy to speed
up SVDD. Firstly, it decomposes the training data into several subsets using k-means clustering. Secondly, each subset
is trained by SVDD. Lastly, the global classifier is trained
by using the SVs obtained in all sub-classifiers. However,
the efficiency of KSVDD may be limited since it is computationally expensive to partition the dataset using k-means
clustering. Moreover, it is time consuming to train the subsets for obtaining the SVs.
Our proposed approach is a complement of the algorithmic approaches. In the experiments, we explicitly compare our approach with the representative random samplingbased approaches and selective sampling-based approaches.
The experimental results have demonstrated that our approach obtains markedly better speedup than the methods
compared.
2.2 Support vector data description
Suppose that the training dataset is S = {x1 , x2 , . . . , xl },
where l is the number of training examples. The basic idea
of SVDD is to enclose the training data in a hyper-sphere
classifier. The objective function of SVDD is given by:
min
R2 + C
l
ξi
(1)
i=1
s.t.
xi − o2 ≤ R 2 + ξi ,
ξi ≥ 0, i = 1, 2, . . . , l
where o is the center and R is the radius of the hyper-sphere;
ξi are slack variables; C is a user pre-defined parameter
which controls the tradeoff between the hyper-sphere radius
and the training errors.
By introducing the Lagrange multipliers αi , we have the
following findings [5]:
– For xi which lies outside the hyper-sphere, i.e.,
xi − o2 > R 2 , it has αi = C.
– For xi which lies on the hyper-sphere, i.e.,
xi − o2 = R 2 , it has 0 < αi < C.
– For xi which lies inside the hyper-sphere, i.e.,
xi − o2 < R 2 , it has αi = 0.
In addition, assuming that x is an example lying on the
hyper-sphere, the center o and the radius R of the hypersphere can be calculated by [5]:
o=
l
αi xi .
i=1
R 2 = (x · x) − 2
(2)
αi (xi · x) +
i
αi αj (xi · xj ).
(3)
ij
From (2) and (3), we can see that the hyper-sphere center o and the radius R are determined by examples xi with
αi = 0. It is known that the examples lying on or outside
the hyper-sphere have αi = 0, and those located inside the
hyper-sphere have αi = 0. Therefore, the hyper-sphere classifier is decided by the examples which lie on or outside the
hyper-sphere, and eliminating the examples which are located inside the hyper-sphere does not change the classifier.
Moreover, the examples lying on or outside the hyper-sphere
are called SVs, and those located inside the hyper-sphere are
called non-SVs. The distribution of SVs for SVDD is shown
in Fig. 1. A test example x is classified as a normal example
if its distance to the hyper-sphere center o is no larger than
the radius R. Otherwise, it is considered as an outlier.
In SVDD, the classifier is constructed as a hyper-sphere.
However, the training data may not be compactly distributed
in the input space. In order to obtain better data description,
the original training data is mapped from the input space
into a higher dimensional feature space via a mapping function φ(·). By using the mapping function φ(·), the training data can be more compact and distinguishable in the
feature space. The inner product of two vectors in the feature space can be calculated by using the kernel function as
K(x, xi ) = φ(x) · φ(xi ) [35–37].
3 Proposed approach
In this section, we present the proposed approach in details.
Given a training dataset S which consists of l target examples, the objective of SVDD is to build a hyper-sphere classifier by enclosing the target examples and the classifier is
A K-Farthest-Neighbor-based approach for support vector data description
199
Algorithm 1 The proposed KFN-CBD framework
Step 1 (Boundary Concept Pre-processing): Given a training set S, the k-farthest neighbors of each example are identified. To accelerate the k-farthest neighbor search, M-tree is
used and a novel tree query strategy is proposed.
Step 2 (Boundary Concept Learning): Each example is
assigned with a ranking score based on its k-farthest neighbors. A subset of top ranked examples is selected out and
used to learn the SVDD classifier.
learning method is proposed to determine the boundary examples by considering the distances between the examples
and their k-farthest neighbors. The framework of our proposed approach is presented in Algorithm 1.
3.1 Boundary concept pre-processing
Fig. 1 The distribution of SVs in SVDD. The examples lying on the
hyper-sphere surface and outside the hyper-sphere are called “boundary SVs” and “outlying SVs”, respectively. Those located inside the
hyper-sphere are non-SVs. The SVDD classifier is decided by the
“boundary SVs” and “outlying SVs”
thereafter applied to classify the test examples. However,
SVDD incurs a critical computation issue when it is applied
on large-scale datasets.
To speed up SVDD on large-scale datasets, we propose
to select a subset of examples from the datasets and train
the SVDD classifier using the selected examples, instead
of the whole dataset. In such case, the training set size is
largely cut down and the computational cost of SVDD can
be substantially reduced. The motivation behind our proposed approach is illustrated in Fig. 1. The SVDD classifier
is decided by the SVs which lie on or outside the hypersphere, while the non-SVs, which are located inside the
hyper-sphere, have no determination on the SVDD classifier,
as discussed in Sect. 2.2. That is to say, removing the nonSVs does not change the SVDD classifier [5]. Motivated
by this observation, an attractive approach for accelerating
SVDD is to exclude the non-SVs from training the classifier so that the training set size can be reduced. However,
the exact SVs can not be known until the SVDD classifier is
trained.
To overcome this difficulty, we propose to determine
the SVs by identifying the examples lying close to the
boundary of the target class, namely boundary examples.
To determine the boundary examples, we put forward a kfarthest-neighbor-based learning method. In SVDD, after
being mapped from the input space into the feature space
via an appropriately chosen kernel function, the target data
becomes more compact and distinguishable in the feature
space. A hyper-sphere is then constructed by enclosing the
target data with good performance. Based on this geometrical characteristic of SVDD, the k-farthest-neighbor-based
The main task of this step is to determine the k-farthest
neighbors for each example. This pre-processing step is performed only once for the whole dataset. Compared to knearest neighbor query, k-farthest neighbor learning aims at
identifying k-farthest examples from the query point. At the
end of this stage, we obtain a data structure which consists
of the indexes and distances of the k-farthest neighbors for
each example in the whole dataset.
Similar to k-nearest neighbor query [38, 39], obtaining
the k-farthest neighbors of each example can be computationally expensive on large-scale datasets. Hence, M-tree
[11, 40], originally designed for efficient k-nearest neighbor learning, is extended to speed up the k-farthest neighbor
search. Specifically, M-tree for k-farthest neighbor search
consists of two steps: tree construction and tree query. In
the tree construction step, a hierarchical tree is built. The
tree construction process of k-farthest neighbor learning is
the same as k-nearest neighbor query. In the tree query step,
the tree is queried to obtain the k-farthest neighbors. Different from k-nearest neighbors, we propose a tree query
scheme for efficient k-farthest neighbor search.
The k-farthest neighbor query algorithm uses a priority
PR queue to store pending requests, and a k-element FN array to contain the k-farthest neighbor candidates, which are
the results at the end of the algorithm. To do this, a branchand-bound heuristic algorithm is adopted. In the following,
we first introduce several important notations and the others
are defined in Table 1.
PR (Priority) queue The PR queue stores a queue of pending
requests
[ptr(T (Oi )), dmax (T (Oi ))],
where
dmax (T (Oi )) = d(Oi , Q) + rOi is the upper-bound distance,
representing the maximum distance between Q and T (Oi ).
d(Oi , Q) is the distance between the query point Q and object Oi , and rOi is the radius of subtree T (Oi ). The pending
200
Y. Xiao et al.
Table 1 Symbol definition
Symbol
Definition
Q
Query point
Oi
The ith data object
T (Oi )
Covering tree of object Oi
rQ
Query radius of Q
d(Oi , Q)
Distance between the data object Oi and the query
point Q
dmin (T (Oi ))
The smallest possible distance between the query
point Q and the objects in subtree T (Oi ). The
lower-bound distance dmin (T (Oi )) is defined as:
dmin (T (Oi )) = max{d(Oi , Q) − rOi , 0}.
dmax (T (Oi ))
Upper-bound distance between the query point Q
and the objects in subtree T (Oi ). The upper-bound
distance dmax (T (Oi )) is defined as:
dmax (T (Oi )) = d(Oi , Q) + rOi .
dk
Minimum lower-bound distance dmin (T (Oi )) of k
objects in the FN array. If there is no object in the
FN array, let dk = 0.
PR queue
Queue of pending requests which have not been
filtered from the search. The entry in the PR queue is
of form [T (Oi ), dmax (T (Oi ))].
FN array
During the query process, the FN array stores the
entries of form [Oi , d(Q, Oi )] or [−, dmin (T (Oi ))].
At the end of the query, the FN array contains the
results, i.e., the k-farthest neighbors.
KFN(φ(xi ))
Top k-farthest neighbor list of example φ(xi )
f (φ(x))
Ranking score of example φ(x)
requests in the PR queue are sorted by dmax (T (Oi )) in descending order. Hence, the request with max {dmax (T (Oi ))}
is inquired first.
FN (Farthest Neighbor) array The FN array stores the
k-farthest neighbors or the lower-bound distances of subtrees. It contains k entries of form [oid(Oi ), d(Q, Oi )]
or [−, dmin (T (Oi ))], where dmin (T (Oi )) = {0, d(Oi , Q) −
rOi } is the lower-bound distance, representing the minimum
distance between Q and T (Oi ). The entries in the FN array
are sorted by d(Q, Oi ) or dmin (T (Oi )) in descending order.
The k-farthest neighbors are in the FN array at the end of the
query.
rQ (Query Range) At the beginning, the query range rQ
is set to be 0. When a new entry is inserted into the FN array, rQ is enlarged as the distance d(Q, Oi ) or dmin (T (Oi ))
stored in the last entry FN[k].
Based on the above notations, the basic idea of efficient
k-farthest neighbor query is as follows. The query starts at
the root node, and [root, 0] is the first and only request in
the PR queue. All entries in the FN array are set to [−, 0],
and the query range rQ is set to be 0. During the query, the
entry in the PR queue with max{dmax (T (Oi ))} is searched
first. This is because k-farthest neighbors are more likely
to be acquired from the farthest subtree and dmax (T (Oi ))
is the largest possible distance between the query point Q
and the objects in subtree T (Oi ). Let dk equal the distance
d(Q, Oi ) or dmin (T (Oi )) stored in the last entry FN[k]. For
the entry inquired, if dmin (T (Oi )) > dk holds, it is inserted
into the FN array, and dk is updated according to the last
entry FN[k].
After inserting the entry into the FN array and updating dk , we let rQ = dk and the query region enlarges. Based
on this region, the entries with dmax (T (Oi )) < dk are removed from the PR queue, i.e., being excluded from the
search.
This is because if the upper-bound distance dmax (T (Oi ))
between Q and T (Oi ) is smaller than the minimum distance
in the FN array, i.e., dk , T (Oi ) is unlikely to contain the kfarthest neighbors and can be safely pruned. When no object
can be filtered, the query stops and the k-farthest neighbors
are in the FN array.
Take Figs. 2(a)–(d) as an example to further illustrate the
query process of k-farthest neighbors. Here, k = 2 is set.
– Fig. 2(a): The query starts at the root level and the PR
queue is set to be [root, 0]. rQ is equal to 0, and all entries
in the FN array are [−, 0].
– Fig. 2(b): Subtrees II and I are waiting for inquiring,
and hence put into the pending PR queue. Since it has
dmax (II) > dmax (I ), subtree II is the first entry in the
PR queue, and is searched before subtree I . Moreover,
dmin (II) and dmin (I ) are put into the FN array. Considering that it has dmin (II) > dmin (I ), dmin (II) is placed in
front of dmin (I ), and dmin (I ) is the last entry in the FN
array. rQ stays 0, since rQ is equal to the minimum lowerbound distance in the FN array, i.e., rQ = dmin (I ), while
it has dmin (I ) = 0.
– Fig. 2(c): Inquire the subtree in the first entry of the PR
queue, i.e., subtree II, and then subtrees D, I and C
are inserted into the PR queue by their upper-bound distances in descending order. Since dmin (D) > dmin (C) and
dmin (C) > dmin (I ) hold, the two largest lower-bound distances, i.e., dmin (D) and dmin (C), are put into the FN array. rQ increases from 0 to dmin (C).
– Fig. 2(d): The subtree in the first entry of the PR queue,
i.e., subtree D, is filtered. O8 , O7 , I and C are sorted
by their upper-bound distances in descending order and
placed into PR. Then, O8 and O7 are inquired. Entries
[O8 , d(Q, O8 )] and [O7 , d(Q, O7 )] are put into the FN
array. This is because d(Q, O8 ) and d(Q, O7 ) are larger
than dmin (D) and dmin (C), and hence d(Q, O8 ) and
d(Q, O7 ) are in the FN array. After the entries in the
FN array are updated, the query range rQ is enlarged and
set to be the distance in the last entry of the FN array,
i.e., rQ = d(Q, O7 ). Since rQ = d(Q, O7 ) holds, it has
rQ > dmax (I ) and rQ > dmax (C). Considering that rQ is
equal to the minimum distance stored in the FN array, if
A K-Farthest-Neighbor-based approach for support vector data description
201
Fig. 2 An example of 2-farthest
neighbor query using M-tree.
Here, I , II, A, B, C and D
represent different subtrees.
Q1 to Q8 are data objects
the upper-bound distance dmax (T (Oi )) between Q and
T (Oi ) is smaller than the minimum distance stored in the
FN array, i.e., rQ , it can be safely pruned. Hence, subtrees
I and C can be pruned and removed from the PR queue.
Lastly, the query stops since the PR queue is empty and
the 2-farthest neighbors of the query point Q, namely O7
and O8 , are contained in the FN array.
3.2 Boundary concept learning
After the k-farthest neighbors are identified, the next step is
to assign each example with a ranking score based on its kfarthest neighbors, such that a subset of top ranked examples
is chosen to train the classifier. To do this, we propose a
scoring function. The objective of the scoring function is to
assign higher scores to examples which are more likely to
lie around the boundary.
Given l target examples S = {φ(x1 ), φ(x2 ), . . . , φ(xl )},
we first compute the k-farthest neighbors of each example φ(xi ) (i = 1, . . . , l), as presented in Sect. 3.1. Let
KFN(φ(xi )) be the top k-farthest neighbor list of example φ(xi ). In the following, when we express φ(xj ) ∈
KFN(φ(xi )), it means that example φ(xj ) is on the top kfarthest neighbor list of example φ(xi ). Based on the top kfarthest neighbor list of each target example, our proposed
scoring function is given by:
1
f φ(x) = 1 −
k
φ(xi )∈KFN(φ(x))
e−φ(x)−φ(xi )
2
(4)
where φ(x) − φ(xi )2 is the square of the distance from
φ(x) to φ(xi ), and it has φ(x) − φ(xi )2 = φ(x) · φ(x) +
φ(xi ) · φ(xi ) − 2φ(x) · φ(xi ). Here, φ(x) · φ(xi ) represents
the inner product of φ(x) and φ(xi ), and it can be replaced
by kernel function K(x, xi ). To illustrate (4), we assume
that k is equal to 1 and the top 1-farthest neighbor list contains only one example φ(xi ). Hence, (4) can be rewritten
2
2
as f (φ(x)) = 1 − e−φ(x)−φ(xi ) , where e−· is an exponentially decaying function. By utilizing this function, when
the distance between φ(x) and φ(xi ) is sufficiently small,
2
the value of e−φ(x)−φ(xi ) is approximately 1 and f (φ(x))
is close to 0. Conversely, when the distance between φ(x)
2
and φ(xi ) is large enough, the value of e−φ(x)−φ(xi ) approaches 0 and f (φ(x)) reaches 1. That is to say, when φ(x)
is close to its k-farthest neighbors, a small ranking score
f (φ(x)) is obtained. When φ(x) is far from its k-farthest
neighbors, a large ranking score f (φ(x)) is attained. More2
over, · 2 is always no less than 0 and 0 < e−· < 1 holds.
Hence, the value of ranking scores f (φ(x)) falls into the
range of (0, 1).
It is known that after the target data is mapped from the
input space into the feature space via a proper kernel function, the target data becomes more compact in the feature
space, so that a hyper-sphere classifier is obtained by enclosing the target data with satisfactory accuracy, as shown
in Fig. 3(a). Considering this geometrical characteristic of
SVDD, the scoring function in (4) is capable of determining
the examples which are likely to lie around the boundary of
the target class. Let us take Figs. 3(a)–(b) as an example to
202
Y. Xiao et al.
Algorithm 2 The calculation of ranking scores with M-tree
Input: Q (Query point), k (Integer), T (Tree built);
Output: f (Q) (Ranking score);
1: Let FN = ∅; /* FN is an empty array */
2: FN = KFN_Query(Q, k, T ); /* Obtaining the k-farthest
neighbors */
3: f (Q) = RankingScore(Q, FN); /* Calculating the ranking scores */
4: return f (Q);
Fig. 3 Illustration of the scoring function with 3-farthest neighbors
illustrate how the scoring function contributes to the boundary example determination.
Figure 3(a) shows that the SVDD classifier is a hypersphere and the SVs are those examples lying on the hypersphere surface or outside the hyper-sphere. To make the
illustration clearer, we only keep examples φ(x1 )–φ(x8 )
and remove the other examples, as presented in Fig. 3(b).
φ(x1 ) is an SV lying near the hyper-sphere surface, while
φ(x5 ) is a non-SV located deep inside the hyper-sphere.
φ(x2 ), φ(x3 ) and φ(x4 ) are the 3-farthest neighbors of φ(x1 ),
while the 3-farthest neighbors of φ(x5 ) are φ(x6 ), φ(x7 ) and
φ(x8 ). Then, we consider the corresponding ranking scores
of φ(x1 ) and φ(x5 ), as follows:
4
1 −φ(x1 )−φ(xi )2
e
,
f φ(x1 ) = 1 −
3
(5)
8
1 −φ(x5 )−φ(xi )2
e
.
f φ(x5 ) = 1 −
3
(6)
i=2
i=6
For f (φ(x1 )) and f (φ(x5 )), it is observed from Fig. 3(b)
that the distances between φ(x1 ) and its k-farthest neighbors
are larger than those of φ(x5 ). Hence, the ranking score of
φ(x1 ) is greater than φ(x5 ), namely f (φ(x1 )) > f (φ(x5 )).
From this example, it is seen that by tuning an appropriate k, the examples near the data boundary can have larger
distances from their k-farthest neighbors than those lying
deep inside the hyper-sphere, and hence the ranking scores
of boundary examples are roughly larger than those inside
the hyper-sphere. From this aspect, the scoring function can
be considered as a way to determine the boundary examples approximately. The pseudo codes to calculate the ranking scores with M-tree are presented in Algorithm 2. The
pseudo codes for the functions in Algorithm 2 are shown in
Algorithms 2.1, 2.1.1, 2.1.2 and 2.2, respectively.
Selection of subset sizes After the k-farthest neighbors
have been determined by the extended M-tree, each example in the dataset is assigned with a ranking score according
Algorithm 2.1 Pseudo codes for the “KFN_Qeury” function
Input: Q (Query point), k (Integer), T (Tree built);
Output: FN (k-farthest neighbor array);
1: PR = [T , _];
2: for i = 1 to k do;
3:
FN[i] = [0, _];
4:
while PR = ∅ do;
5:
Next_Node = ChooseNode(PR);
6:
FN = KFN_NodeSearch(Next_Node, Q, k);
Algorithm 2.1.1 Pseudo codes for the “ChooseNode” function
Input: PR (Priority queue);
Output: Next_Node (The next node to be searched);
1:
2:
3:
4:
5:
Let dmax (T (Or∗ )) = Max{dmax (T (Or ))};
Considering all the entries in PR;
Remove [T (Or∗ ), dmax (T (Or∗ ))] from PR;
Next_Node = T (Or∗ );
return Next_Node;
to the scoring function (4). The examples with large ranking scores tend to lie near the data boundary and those with
small ranking scores are usually located deeply inside the
hyper-sphere. Thereafter, we perform the following operations to select a subset of examples for learning the SVDD
classifier. Firstly, we sort the training examples according to
their ranking scores and the sum H of all sorted scores is ob
tained, where it has H = li=1 f (φ(xi )). Secondly, we reindex the sorted examples as {φ(x1 ), φ(x2 ), . . . , φ(xl )} and
the example with the highest score is represented as φ(x1 ).
Thirdly, let P denote a user pre-specified percentage. Starting from example φ(x1 ), we keep picking up the examples
until the sum of chosen examples reaches the pre-defined
percentage P of H . Based on this, suppose that a subset of
top ranked examples is selected out and contained in subset Ssub . For the examples in subset Ssub , the sum of their
A K-Farthest-Neighbor-based approach for support vector data description
Algorithm 2.1.2 Pseudo codes for the “KFN_NodeSearch”
function
Input: Q (Query point), k (integer), Next_Node (Next
node to be searched);
Output: FN (k-farthest neighbor array);
1: Let N = Next_Node;
2: if N is not a leaf then;
3:
{for each entry Or in N do;
4:
{if d(Or , Q) + rOr ≥ dk then;
5:
Add [T (Or ), dmax (T (Or ))] to PR;
6:
if dmin(T (Or )) > dk then;
7:
{dk = FN_Update([_, dmax (I (Or ))]);
8:
for dmax (T (Or )) < dk do
9:
Remove the entry from PR;}}}
10: else
11:
{for all Oj in N do;
12:
{if d(Q, Oj ) ≥ dk then;
13:
{dk = FN_Update([Oj , d(Oj , Q)]);
14:
for dmax (T (Or )) < dk do;
15:
Remove the entry from PR;}}}
scores satisfies:
1
H
|Ssub
|−1
f φ(xi ) < P
and
i=1
(7)
|Ssub |
1 f φ(xi ) ≥ P ,
H
i=1
where |Ssub | denotes the number of examples in Ssub .
Based on the selected examples in Ssub , the problem is
transformed into:
αi K xi , xi −
αi αj K xi , xj
max
i
s.t.
i
j
0 ≤ αi ≤ C,
αi = 1, i = 1, 2, . . . , |Ssub |.
(8)
i
From the optimization function (8), it can be seen that
the original learning problem is transformed into a smaller
problem. Compared to standard SVDD which is trained on
the entire dataset, our proposed approach learns the classifier by using only a subset of examples located near the
data boundary. As a result, the computational cost of learning the classifier is largely reduced and the training accuracy
is maintained as the entire dataset used.
In all, the kernel function is usually used to map the training data from the input space into a higher dimensional feature space. By adopting a proper kernel function, the training data in the feature space can become more compact and
203
Algorithm 2.2 Pseudo codes for the “RankingScore” function
Input: Q (Query point), FN (k-farthest neighbor array);
Output: f (Q) (Ranking score of Query point Q);
1: Sum_NormalizedDis(Q) = 0;
2: for all O in FN do
3: Dis(Q, O) = Q − O2 ;
4: NormalizedDis(Q, O) = exp(−Dis(Q, O));
5: Sum_NormalizedDis(Q) = Sum_NormalizedDis(Q) +
NormalizedDis(Q, O);
6: f (Q) = 1 − Sum_NormalizedDis(Q)/k;
7: return f (Q);
distinguishable. Based on the transformed data, the scoring
function is presented to weight the examples based on the
distances to their k-farthest neighbors. By tuning the parameter k, we can make the scoring function discriminative
enough to determine the boundary examples. At the same
time, we can adjust the value of P , making Ssub include
more informative examples which are likely to be SVs, so
that Ssub can be utilized to train a more accurate classifier.
The learning time of our method consists of two parts:
data pre-processing and subset training. On the one hand,
the data pre-processing time refers to that taken to obtain the
k-farthest neighbors and compute the ranking scores. The
time complexity of building a balanced M-tree is around
O(n log n), where n is the number of training examples
[11, 40]. For each example, retrieving the M-tree for its
k-farthest neighbors is appropriately O(log n) on average,
which is affected by the amount of overlap among the metric regions [11, 41]. Hence, for a training set with n examples, the total time of building the M-tree and retrieving the k-farthest neighbors is O(2n log n). Based on the kfarthest neighbors, the ranking scores are computed and it
takes time O(nk). Hence, the computational cost in the data
pre-processing part is O(2n log n) + O(nk). On the other
hand, the subset training time refers to that taken to train
the SVDD classifier on the selected examples. Assume that
the percentage of examples which make up the top P percentage of H is p. p is usually smaller than P . Considering that the time complexity of SVDD is O(n2 ), training
the SVDD classifier on p percentage of examples takes time
O((pn)2 ). Therefore, the time complexity of our method is
O(2n log n) + O(nk) + O((pn)2 ), where it has 0 < p < 1
and k is far less than n. When the training set size is large,
our method achieves markedly better speedup than SVDD.
3.3 Discussion of the proposed approach
3.3.1 KFN-CBD for classifying datasets with outliers
In SVDD [5], some training examples may be located outside the hyper-sphere, so that the objective function can be
204
Y. Xiao et al.
Fig. 4 KFN-CBD on the noisy
normal distribution dataset
optimized by seeking a tradeoff between the hyper-sphere
volume and the training errors. The examples lying outside the hyper-sphere are known as outliers. These outliers can be determined by our proposed KFN-CBD method.
This is because the outliers are always located outside the
hyper-sphere, and naturally have larger distances from the
k-farthest neighbors than the examples inside the hypersphere. As a result, the outliers are usually associated with
larger ranking scores. Take Fig. 3(b) as an example. φ(x1 ) is
an outlier and φ(x5 ) is an example inside the hyper-sphere.
It is seen that the distances from φ(x1 ) to its k-farthest
neighbors are generally larger than those of φ(x5 ). Hence,
the outlier φ(x1 ) has a larger ranking score than φ(x5 ).
Figures 4(a)–(c) illustrate the classification boundary obtained by KFN-CBD on a noisy normal distribution dataset
when different values of P are adopted to learn the SVDD
classifier. In Fig. 4(a), the entire dataset is used for training.
In Fig. 4(b), P = 30 % is set and the examples making up the
top 30 % of H are chosen by KFN-CBD to train the SVDD
classifier. It can be seen that although only a subset of examples is used, the classification boundary is exactly drawn
out. Figure 4(c) shows the classification boundary when P is
set to be 40 %. It is not difficult to see that the classification
boundary stays unchanged and the newly added examples
fall inside the hyper-sphere, which confirms the effectiveness of KFN-CBD in selecting the SVs.
Additionally, the existence of outliers may be controlled
by the penalty coefficient C [5]. When 0 < C < 1 holds,
there may exist some examples (namely outliers) lying outside the hyper-sphere. In our experiments, the datasets implicitly contain outliers since we choose the value of C from
2−5 to 25 . The experimental results show that KFN-CBD
can obtain comparable classification accuracy with standard
SVDD, which indicates the capability of KFN-CBD in handling the datasets with outliers.
3.3.2 KFN-CBD for classifying multi-distributed datasets
In the multi-distributed dataset, the examples come from
more than one distribution. That is to say, in the input space,
the dataset consists of several clusters and each cluster represents one of the data distributions. As discussed in [42],
the formulation of SVDD can be used to cluster the multidistributed unlabelled data (known as support vector clustering). In support vector clustering, the unlabelled data is
mapped into the feature space and enclosed by a hypersphere. Then, the unlabelled data is mapped back to the input space. The data which is originally located on the hypersphere surface in the feature space is re-distributed on the
boundaries of different clusters in the input space. By doing
this, the data with multiple distributions can be identified.
The process of SVDD to classify multi-distributed data
is similar to that of support vector clustering. In SVDD, the
multi-distributed data is mapped into the feature space and
the classifier is obtained by enclosing the data in a hypersphere. By choosing an appropriate kernel function, the examples which originally lie on the boundaries of different
clusters in the input space are relocated on the hyper-sphere
surface in the feature space. Similar to the single distributed
datasets, the SVs of multi-distributed datasets lie on or outside the hyper-sphere surface. Therefore, KFN-CBD can be
applied on multi-distributed datasets by capturing the examples near the hyper-sphere boundary.
Figures 5(a)–(c) show the classification boundary of
KFN-CBD on a multi-distributed dataset which consists of
two banana shaped clusters. In Fig. 5(a), the whole dataset is
trained. In Fig. 5(b), P = 30 % is set and the examples making up the top 30 % of H are selected to learn the SVDD
classifier. It can be seen that some of the SVs have been
detected. Figure 5(c) illustrates the classification boundary
when P is equal to 40 %. It is observed that all SVs have
been determined and the classification boundary is explicitly
drawn out. From the above illustrations, it can be seen that
for the dataset whose data structure is not a hyper-sphere,
e.g., the multi-distributed dataset in Fig. 5(a), our method
can still obtain satisfactory classification results by utilizing
the kernel function.
In the experiments, we explicitly investigate the performance of KFN-CBD on multi-distributed datasets. Among
the experimental datasets, as shown in Table 2, the target
classes of Mnist(1), Mnist(2) and Covertype(3) sub-datasets
are made up of multi-distributed data. For example, the target class of Mnist(1) sub-dataset consists of the data from
five classes, i.e., classes 1, 2, 3, 4 and 5, and each class
A K-Farthest-Neighbor-based approach for support vector data description
205
Fig. 5 KFN-CBD on the
multi-distributed dataset
comes from one data distribution. Furthermore, the experimental results demonstrate that KFN-CBD is effective in
dealing with the multi-distributed datasets. For example,
KFN-CBD obtains comparable classification accuracy to
standard SVDD on Mnist(1) sub-dataset, as shown in Table 3.
4 Experiment
We investigate the accuracy, recall, precision and efficiency
of KFN-CBD on large-scale datasets with example sizes
from 23523 to 226641. For comparison, standard SVDD
and two representative data-processing methods, i.e., random sampling [21] and selective sampling [34], are used as
baselines.
4.1 Dataset descriptions
The large-sized datasets used in the experiments include
Mnist, IJCNN and Covertype datasets. The general descriptions of these datasets are given as follows:
1. Mnist dataset2 contains 70000 examples with 10 classes
and 780 features. The original data has 780 features and
PCA is used to select 60 features for the experiments. The
example sizes from class 1 to 10 are 5923, 6742, 5958,
6131, 5842, 5421, 5918, 6265, 5851 and 5949, respectively.
2. IJCNN dataset3 has 141690 examples with 2 classes and
22 features. The example sizes of class 1 and class 2 are
128126 and 13564, respectively.
3. Covertype dataset4 is a relatively large dataset, containing 581012 examples with 7 classes and 54 features. The
example sizes from class 1 to class 7 are 211840, 283301,
35754, 2747, 9493, 17367 and 20510, respectively.
2 Available
at http://yann.lecun.com/exdb/mnist/.
3 Available
at http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
4 Available
at http://archive.ics.uci.edu.
Table 2 Descriptions of the sub-datasets used for one-class classification
Name
Target class
Training
set size
Testing
set size
# Feature
Mnist(1)
1, 2, 3, 4, 5
24477
35523
Mnist(2)
6, 7, 8, 9, 10
23523
36477
60
60
IJCNN(1)
1
102501
39189
22
Covertype(1)
1
169472
411540
54
Covertype(2)
2
226641
354371
54
Covertype(3)
3, 4, 5, 6, 7
68697
512315
54
Since these datasets are designed for binary class or
multi-class classification, we follow the operations in [5]
to obtain sub-datasets for one-class classification. That is,
choose one class as the target class and treat the other classes
as the non-target class at each round. To ensure that the obtained sub-datasets are large, before performing the operations in [5], we re-organize each dataset as follows. For the
Mnist dataset, the first five classes are grouped into one class
and the last five classes are combined into another class. For
the Covertype dataset, the last five classes are considered as
one class. For the IJCNN dataset, since the first class occupies over 90 % of the examples, we only consider the first
class as the target class.
By doing this, we acquire 6 sub-datasets, i.e., Mnist(1),
Mnist(2), IJCNN(1), Covertype(1), Covertype(2) and Covertype(3), for the task of one-class classification. Each number in parentheses represents the class number after some
classes are combined. Table 2 shows the details of these
datasets. In Table 2, “Target Class” is the classes in the original dataset, which we consider as the target class in our
experiments.
4.2 Experimental settings
All the experiments run on a Linux machine with 2.66 GHz
processor and 4 GB DRAM. The training of SVDD is performed by modifying LibSVM,5 and the RBF kernel is
5 Available
at www.csie.ntu.edu.tw/~cjlin/libsvm/.
206
Y. Xiao et al.
Table 3 Results on large-scale
sub-datasets (RS-random
sampling, SS-selective
sampling) (k = 100, P = 80 %)
Accuracy
Precision
Recall
SVDD
RS
SS
KFN-CBD
SVDD
KFN-CBD
SVDD
KFN-CBD
Mnist(1)
85.74
80.66
82.29
85.73
88.28
88.26
86.89
86.88
Mnist(2)
81.64
77.42
76.21
81.62
88.58
88.57
81.71
81.69
IJCNN(1)
89.15
76.85
81.71
89.12
85.76
85.73
90.22
90.18
Covertype(1)
64.49
55.59
58.41
64.48
68.99
68.98
73.11
73.09
Covertype(2)
62.36
52.57
56.66
62.34
73.54
73.53
70.76
70.74
Covertype(3)
60.58
55.93
57.59
60.65
71.05
71.02
65.64
65.62
adopted. As in [8, 29, 43], we select the kernel parameter σ
from 10−5 to 105 , and the penalty coefficient C from 2−5 to
25 . The optimal parameters are selected by conducting fivefold cross validation on the target class ten times. For each
of the 6 sub-datasets, the target class is partitioned into five
splits with equal sizes. In each round, four of the five splits
are used to form the training set; the remaining split and the
non-target class are used as the testing set. We repeat this
procedure ten times and compute the average results. It is
seen that the training set sizes vary from 23523 to 226641,
which guarantees that the experimental results are obtained
based on relatively large-sized datasets.
In Sect. 4.3, we fix k = 100 and P = 80 %, and compare KFN-CBD with the baselines. Here, k is the number
of farthest neighbors considered in calculating the scoring
function. The examples making up the top P percentage of
H are selected to train the SVDD classifier. Then, we analyze the performance variations of KFN-CBD when different P and k values are adopted in Sect. 4.4. It is known that
the upper-bound of outliers’ number is 1/C [5], where C is
the penalty coefficient. For the selection of k, we make it no
less than 1/C. This is because, in addition to outliers, the
examples near the hyper-sphere surface can also be utilized
to make the scoring function more distinguishable. Empirically, when k reaches 100, satisfactory classification results
can be obtained.
4.3 Result comparison
We investigate the performance of KFN-CBD with k = 100
and P = 80 %. Although larger values of k and P can improve the classification results, it is found that when k = 100
and P = 80 % are set, KFN-CBD reaches comparable classification quality as the whole dataset used. The performance variations of KFN-CBD under different values of k
and P will be discussed later.
Table 3 reports the experimental results of SVDD and
KFN-CBD in terms of accuracy, precision and recall. It is
clearly shown that KFN-CBD obtains comparable classification quality to SVDD. Taking Mnist(1) sub-dataset as an
example, the accuracy, precision and recall of SVDD are
85.74 %, 88.28 % and 86.89 % respectively, while KFNCBD obtains very close classification results with 85.73 %,
88.26 % and 86.88 %, which indicates that KFN-CBD is
effective for determining the SVs. In Table 3, the accuracy of random sampling and selective sampling on the 6
sub-datasets is also given. It can be easily seen that KFNCBD markedly outperforms random sampling and selective
sampling. For example, the accuracy of KFN-CBD on the
Mnist(1) sub-dataset is 85.73 %, which is higher than random sampling 80.66 % and selective sampling 82.29 %.
The training time of KFN-CBD consists of the data
pre-processing time Tp and the subset learning time Tl .
Tp mainly refers to the k-farthest neighbor query and score
evaluation, while Tl is the time taken by learning the classifier on the selected subset. For all results related to the
speedup, we have taken both Tp and Tl into account. Let
T (KFN) = Tp + Tl be the total training time of KFNCBD on the experimental dataset, and T (SVDD) be the
learning time of SVDD. For each experimental dataset, the
speedup of KFN-CBD is computed from dividing T (SVDD)
by T (KFN). Similarly, the speedup of random sampling and
selective sampling is calculated from dividing T (SVDD)
by their corresponding training time. In this way, we can
investigate the different improvements of KFN-CBD, random sampling and selective sample over SVDD. Table 4
presents the training time and SV numbers for KFN-CBD
and SVDD, as well as the speedup for KFN-CBD, random sampling and selective sampling. The specific training time of random sampling and selective sampling can
be obtained from dividing T (SVDD) by the corresponding
speedup. T (SVDD) can be attained from Table 4. It is seen
that KFN-CBD is around 6 times faster than SVDD, and selective sampling obtains appropriately 2 times speedup over
SVDD. Moreover, KFN-CBD is 3 times faster than selective
sampling on average.
From the above analysis, it is clear that compared to
SVDD, KFN-CBD obtains markedly speedup and meanwhile guarantees the classification quality. In addition,
KFN-CBD turns out to be explicitly faster than both of
random sampling and selective sampling, and demonstrates
even better classification accuracy.
A K-Farthest-Neighbor-based approach for support vector data description
Table 4 Results on large-scale
sub-datasets (RS-random
sampling, SS-selective
sampling, Tp -time taken by data
pre-processing, Tl -time taken by
subset learning) (k = 100,
P = 80 %)
207
Training time
SVDD
Speedup
KFN-CBD (Tp + Tl )
RS
SS
SV
KFN-CBD
SVDD
KFN-CBD
Mnist(1)
1645
85 + 218
1.42
1.94
5.43
1040
558
Mnist(2)
1583
58 + 201
1.25
2.03
6.11
1037
517
IJCNN(1)
3969
195 + 471
1.39
1.85
5.96
3452
2078
4668
Covertype(1)
10180
397 + 1242
1.33
1.93
6.21
7222
Covertype(2)
35836
1327 + 4494
1.27
2.05
6.16
9833
7388
Covertype(3)
1904
117 + 208
1.17
1.60
5.86
3048
1604
Fig. 6 Performance of
KFN-CBD on the Covertype(1),
(2) and (3) sub-datasets.
(a)–(d) show the results when k
is fixed to be 100 and P varies
from 30 % to 90 %. (e) and (f)
present the results when P is
fixed to be 80 % and k varies
from 100 to 500
4.4 Performance variations
In this section, the performance variations of KFN-CBD under different k, P and dimension numbers are investigated.
Moreover, the classification accuracy of KFN-CBD with different distance measurements is also reported. Figures 6(a)–
(d) present the results of KFN-CBD with k = 100 and P
varying from 30 % to 90 % on the Covertype(1), (2) and
(3) sub-datasets. In order to better reflect the relative performance of KFN-CBD compared to standard SVDD, the
accuracy rate, precision rate and recall rate are given. The
accuracy rate is the ratio of KFN-CBD’s accuracy to that of
SVDD. The precision rate and recall rate are obtained in a
similar way. The specific values of accuracy, precision and
208
Y. Xiao et al.
Table 5 Classification accuracy with different distance measurements
Mnist(1)
Fig. 7 Accuracy rates of KFN-CBD on the Mnist(1) dataset with different dimension numbers
recall for KFN-CBD can be obtained by multiplying the corresponding rates with those of standard SVDD, as shown in
Table 3.
In Figs. 6(a)–(c), it is seen that the accuracy rate, recall rate and precision rate improve synchronously as P increases. It is noted that when P reaches 80 %, KFN-CBD
obtains comparable accuracy as standard SVDD. Meanwhile, the speedup decreases from about 46.7 to around 4.65
on average, as shown in Fig. 6(d), since more examples are
included in the training phase.
Figures 6(e) and (f) show the accuracy rate and speedup
when P is set to be 80 % and k varies from 100 to 500.
We observe that with the increase of k, the accuracy rate remains almost unchanged, while the speedup dips since more
processing time is needed to obtain the k-farthest neighbors.
From these experimental results, it is seen that the classification quality are not very sensitive to k and even a small
number of k (e.g., k = 100) can suffice.
Moreover, we evaluate the performance of KFN-CBD
with different dimension numbers. To do this, we form five
sub-datasets by selecting the first 20 %, 40 %, 60 %, 80 %
and 100 % dimensions from the Mnist(1) dataset, respectively. The Mnist(1) dataset used in the experiments contains 60 dimensions, and hence the five sub-datasets have
12, 24, 36, 48 and 60 dimensions, respectively. Based on
the five sub-datasets, KFN-CBD is performed to learn the
classifiers, and the accuracy rates are recorded. Let Acc(1)
denote the classification accuracy of KFN-CBD on the subdataset with 12 dimensions, and Acc(SVDD) represent the
accuracy of SVDD on the Mnist(1) dataset with all 60 dimensions. To conduct a clear comparison, the accuracy rate
is computed from dividing Acc(1) by Acc(SVDD). The specific value of Acc(1) can be obtained by multiplying the accuracy rate and Acc(SVDD), whose value is 85.74 %, as
seen from Table 3. In this way, the accuracy rates for the
five sub-datasets with different dimensions can be obtained,
as shown in Fig. 7. Here, P and k are set to 80 % and 100,
respectively. It can be seen that as the number of dimensions increases, the accuracy rate goes up. This is because
the data dimensions represent different classification information which is valuable to construct the classifier. When
IJCNN(1)
Covertype(1)
Euclidean distance
85.73
89.12
64.48
Manhattan distance
85.72
89.09
64.48
Mahalanobis distance
85.79
89.08
64.46
more dimensions are contained in the datasets, more classification information can be available to learn an accuracy
classifier. When the dimension number reaches 60, the accuracy rate is about 100 %. That is to say, when all dimensions
are included in the training phase, KFN-CBD can achieve
comparable accuracy to SVDD, which can be also observed
from Table 3. More experiments on the Mnist and IJCNN
sub-datasets are presented in Figs. 8 and 9, respectively, and
the similar findings can be observed.
Above, we evaluate the performance of KFN-CBD using
the Euclidean distance. It is also interesting to investigate its
performance under different distance measurements. Table 5
presents the accuracy of KFN-CBD by employing the Euclidean, Manhattan and Mahalanobis distances, respectively.
It is seen that KFN-CBD obtains relatively stable classification accuracy with different distance measurements, which
verifies the robustness of our method. KFN-CBD is developed based on M-tree which is a metric tree and just requires the distance function to be a metric, namely satisfying the non-negative, identity of indiscernibles, symmetry
and triangle inequality properties [11]. Hence, KFN-CBD is
applicable to the distance measurement which is a metric.
5 Conclusions and future work
5.1 Contribution of this work
SVDD faces a critical computation issue in handling largescale datasets. This paper proposes a novel approach, named
K-Farthest-Neighbor-based Concept Boundary Detection
(KFN-CBD), to provide a reliable and scalable solution for
speeding up SVDD on large-scale datasets. KFN-CBD holds
the following characteristics: (1) the k-farthest-neighborbased scoring function is proposed to weight the examples
according to their proximity to the data boundary; (2) by
adopting a novel tree query strategy, M-tree is extended to
speed up the k-farthest neighbor search; to the best of our
knowledge, this is the first learning method for efficient kfarthest neighbor query; (3) it is able to acquire comparable
classification accuracy, and with significantly reduced computational costs.
A K-Farthest-Neighbor-based approach for support vector data description
209
Fig. 8 Performance of
KFN-CBD on the Mnist(1) and
(2) sub-datasets. (a) and (b)
show the results when k is fixed
to be 100 and P varies from
30 % to 90 %. (c) and (d)
present the results when P is
fixed to be 80 % and k varies
from 100 to 500
Fig. 9 Performance of
KFN-CBD on the IJCNN(1)
sub-dataset. (a) and (b) show
the results when k is set to be
100 and P varies from 30 % to
90 %. (c) and (d) present the
results when P is set to be 80 %
and k varies from 100 to 500
5.2 Limitations and future work
There are several limitations that may restrict the use of our
method in applications and we plan to extend this work in
the following directions in the future.
(1) Farthest neighbor query in high dimensions: Our
work for determining the k-farthest neighbors is based
on M-tree, which is a distance-based tree learning
method and inevitably suffers from the “curse of dimensionality” [11]. When the training data is in high dimen-
210
sions, the k-farthest neighbor query may become impractical. To deal with similarity search in high dimensions, some novel learning methods, such as localitysensitive hashing [44] and probably approximately correct NN search [41], are proposed. These methods
are put forward for k-nearest neighbor search, and we
would like to extend them to deal with k-farthest neighbor query in high dimensions.
(2) Selection of k and P : k is the number of farthest neighbors included in computing the ranking score. The experiments show that when the value of k rises, the classification accuracy increases, but the learning time goes
up synchronously. Hence, the user should trade off the
classification accuracy and computational effort according to different application problems. In our experiments, we observe that when k reaches 100, satisfactory
classification results can be obtained. In the future, we
will investigate how to select an appropriate k according
to different data characteristics and dataset sizes. Furthermore, the examples making up the top P percentage of H are selected to train the classifier. When P
is larger, more boundary examples are likely to be included to train the classifier and better classification accuracy can be attained, but it takes longer learning time.
Together with the parameter k, we will investigate the
selection of P on more real-world datasets.
In this paper, we mainly focus on one-class classification
problems without the rare class. We will extend our method
to one-class classification problems with the rare class and
evaluate its performance on real-world applications in the
future.
Acknowledgements This work is supported by Natural Science
Foundation of China (61070033, 61203280, 61202270), Guangdong
Natural Science Funds for Distinguished Young Scholar
(S2013050014133), Natural Science Foundation of Guangdong province (9251009001000005, S2011040004187, S2012040007078), Specialized Research Fund for the Doctoral Program of Higher Education
(20124420120004), Scientific Research Foundation for the Returned
Overseas Chinese Scholars, State Education Ministry, GDUT Overseas Outstanding Doctoral Fund (405120095), Science and Technology Plan Project of Guangzhou City (12C42111607, 201200000031,
2012J5100054), Science and Technology Plan Project of Panyu District Guangzhou (2012-Z-03-67), Australian Research Council Discovery Grant (DP1096218, DP130102691), and ARC Linkage Grant
(LP100200774, LP120100566).
References
1. Vapnik VN (1998) Statistical learning theory. Wiley, New York
2. Yu H, Hsieh C, Chang K, Lin C (2012) Large linear classification
when data cannot fit in memory. ACM Trans Knowl Discov from
Data 5(4):23210–23230
3. Diosan L, Rogozan A, Pecuchet JP (2012) Improving classification performance of support vector machine by genetically optimising kernel shape and hyper-parameters. Appl Intell 36(2):280–
294
Y. Xiao et al.
4. Shao YH, Wang Z, Chen WJ, Deng NY (2013) Least squares twin
parametric-margin support vector machine for classification. Appl
Intell 39(3):451–464
5. Tax DMJ, Duin RPW (2004) Support vector data description.
Mach Learn 54(1):45–66
6. Wang Z, Zhang Q, Sun X (2010) Document clustering algorithm
based on NMF and SVDD. In: Proceedings of the international
conference on communication systems, networks and applications
7. Wang D, Tan X (2013) Centering SVDD for unsupervised feature
representation in object classification. In: Proceedings of the international conference on neural information processing
8. Tax DMJ, Duin RPW (1999) Support vector data description applied to machine vibration analysis. In: Proceedings of the fifth
annual conference of the ASCI, pp 398–405
9. Lee SW, Park J (2006) Low resolution face recognition based
on support vector data description. Pattern Recognit 39(9):1809–
1812
10. Brunner C, Fischer A, Luig K, Thies T (2012) Pairwise support
vector machines and their application to large scale problems.
J Mach Learn Res 13(1):2279–2292
11. Ciaccia P, Patella M, Zezula P (1997) M-tree: an efficient access
method for similarity search in metric spaces. In: Proceedings of
the international conference on very large data bases, pp 426–435
12. Debnath R, Muramatsu M, Takahashi H (2005) An efficient support vector machine learning method with second-order cone programming for large-scale problems. Appl Intell 23(3):219–239
13. Joachims T (1998) Making large-scale support vector machine
learning practical. Advances in Kernel methods. MIT Press, Cambridge
14. Osuna E, Freund R, Girosi F (1997) An improved training algorithm for support vector machines. In: Proceedings of the IEEE
workshop on neural networks for signal processing, pp 276–285
15. Platt J (1998) Fast training of support vector machines using sequential minimal optimization. Advances in Kernel methods. MIT
Press, Cambridge
16. Keerthi S, Shevade S, Bhattacharyya C, Murthy K (2001) Improvements to Platt’s SMO algorithm for SVM classifier design.
Neural Comput 13(3):637–649
17. Chang CC, Lin CJ (2001) LibSVM: a library for support vector
machines. Software available at http://www.csie.ntu.edu.tw/cjlin/
libsvm
18. Fine S, Scheinberg K (2001) Efficient SVM training using lowrank kernel representation. J Mach Learn Res 2:243–264
19. Cossalter M, Yan R, Zheng L (2011) Adaptive kernel approximation for large-scale non-linear SVM prediction. In: Proceedings of
the international conference on machine learning
20. Chen M-S, Lin K-P (2011) Efficient kernel approximation for
large-scale support vector machine classification. In: Proceedings
of the SIAM international conference on data mining
21. Breiman L (1996) Bagging predictors. Mach Learn 24:123–140
22. Graf H, Cosatto E, Bottou L, Dourdanovic I, Vapnik V (2005) Parallel support vector machines: the cascade SVM. In: Proceedings
of the advances in neural information processing systems, pp 521–
528
23. Smola A, Schölkopf B (2000) Sparse greedy matrix approximation for machine learning. In: Proceedings of the international conference on machine learning, pp 911–918
24. Pavlov D, Chudova D, Smyth P (2000) Towards scalable support
vector machines using squashing. In: Proceedings of the ACM
SIGKDD international conference on knowledge discovery and
data mining, pp 295–299
A K-Farthest-Neighbor-based approach for support vector data description
25. Li B, Chi M, Fan J, Xue X (2007) Support cluster machine. In:
Proceedings of the international conference on machine learning,
pp 505–512
26. Zanni L, Serafini T, Zanghirati G (2006) Parallel software for
training large scale support vector machines on multiprocessor
systems. J Mach Learn Res 7:1467–1492
27. Collobert R, Bengio S, Bengio Y (2002) A parallel mixture of
SVMs for very large scale problems. Neural Comput 14(5):1105–
1114
28. Gu Q, Han J (2013) Clustered support vector machines. In: Proceedings of the international conference on artificial intelligence
and statistics
29. Yuan Z, Zheng N, Liu Y (2005) A cascaded mixture SVM classifier for object detection. In: Proceedings of the international conference on advances in neural networks, pp 906–912
30. Yu H, Yang J, Han J (2003) Classifying large data sets using svm
with hierarchical clusters. In: Proceedings of the ACM SIGKDD
international conference on knowledge discovery and data mining,
pp 306–315
31. Sun S, Tseng C, Chen Y, Chuang S, Fu H (2004) Cluster-based
support vector machines in text-independent speaker identification. In: Proceedings of the international conference on neural networks
32. Boley D, Cao D (2004) Training support vector machine using
adaptive clustering. In: Proceedings of the SIAM international
conference on data mining
33. Lawrence N, Seeger M, Herbrich R (2003) Fast sparse Gaussian
process methods: the informative vector machine. In: Proceedings
of the advances in neural information processing systems, pp 609–
616
34. Kim P, Chang H, Song D, Choi J (2007) Fast support vector data
description using k-means clustering. In: Proceedings of the international symposium on neural networks: advances in neural networks, part III, pp 506–514
35. Wang CW, You WH (2013) Boosting-SVM: effective learning
with reduced data dimension. Appl Intell 39(3):465–474
36. Li C, Liu K, Wang H (2011) The incremental learning algorithm
with support vector machine based on hyperplane-distance. Appl
Intell 34(1):19–27
37. Lee LH, Lee CH, Rajkumar R, Isa D (2012) An enhanced support vector machine classification framework by using Euclidean
distance function for text document categorization. Appl Intell
37(1):80–99
38. Garcia V, Debreuve E, Barlaud M (2008) Fast k nearest neighbor
search using GPU. In: Proceedings of the IEEE computer society
conference on computer vision and pattern recognition workshops
39. Menendez HD, Barrero DF, Camacho D (2013) A multi-objective
genetic graph-based clustering algorithm with memory optimization. In: Proceedings of the IEEE Congress on evolutionary computation
40. Ciaccia P, Patella M, Zezula P (1998) Bulk loading the M-tree. In:
Proceedings of the Australasian database conference, pp 15–26
41. Ciaccia P, Patella M (2000) PAC nearest neighbor queries: approximate and controlled search in high-dimensional and metric
spaces. In: Proceedings of the international conference on data engineering
42. Ben-Hur A, Horn D, Siegelmann H, Vapnik V (2001) Support vetor clustering. J Mach Learn Res 2:125–137
43. Davy M, Godsill S (2002) Detection of abrupt spectral changes using support vector machines—an application to audio signal segmentation. In: Proceedings of the IEEE international conference
on acoustics, speech, and signal processing
44. Mayur D, Nicole I, Piotr I, Vahab SM (2004) Locality-sensitive
hashing scheme based on P-stable distributions. In: Proceedings
of the annual ACM symposium on computational geometry
211
Yanshan Xiao received the Ph.D.
degree in computer science from the
Faculty of Engineering and Information Technology, University of
Technology, Sydney, Australia, in
2011. She is with the Faculty of
Computer, Guangdong University
of Technology. Her research interests include multiple-instance learning, support vector machine and
data mining.
Bo Liu is with the Faculty of Automation, Guangdong University
of Technology. His research interests include machine learning and
data mining. He has published papers on IEEE Transactions on Neural Networks, IEEE Transactions
on Knowledge and Data Engineering, Knowledge and Information
Systems, International Joint Conferences on Artificial Intelligence
(IJCAI), IEEE International Conference on Data Mining (ICDM),
SIAM International Conference on
Data Mining (SDM) and ACM International Conference on Information and Knowledge Management
(CIKM).
Zhifeng Hao is a professor with the
Faculty of Computer, Guangdong
University of Technology. His research interests include design and
analysis of algorithms, mathematical modeling and combinatorial optimization.
Longbing Cao is a professor with
the Faculty of Information Technology, University of Technology, Sydney. His research interests include
data mining, multi-agent technology, and agent and data mining integration. He is a senior member of
the IEEE.