Download Optimization-based Data Mining Techniques with Applications

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Principal component analysis wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Optimization-based Data Mining
Techniques with Applications
Proceedings of a Workshop held in Conjunction with
2005 IEEE International Conference on Data Mining
Houston, USA, November 27, 2005
Edited by
Yong Shi
&KLQHVH$FDGHP\RI6FLHQFHV5HVHDUFK&HQWHURQ'DWD
7HFKQRORJ\.QRZOHGJH(FRQRP\
*UDGXDWH8QLYHUVLW\RIWKH&KLQHVH$FDGHP\RI6FLHQFHV
%HLMLQJ&KLQD
ISBN 0-9738918-1-5
Optimization-based Data Mining
Techniques with Applications
Proceedings of a Workshop held in Conjunction with
2005 IEEE International Conference on Data Mining
Houston, USA, November 27, 2005
Edited by
Yong Shi
&KLQHVH$FDGHP\RI6FLHQFHV5HVHDUFK&HQWHURQ'DWD
7HFKQRORJ\.QRZOHGJH(FRQRP\
*UDGXDWH8QLYHUVLW\RIWKH&KLQHVH$FDGHP\RI6FLHQFHV
%HLMLQJ&KLQD
The papers appearing in this book reflect the authors’ opinions and are published in the
interests of timely dissemination based on review by the program committee or volume
editors. Their inclusion in this publication does not necessarily constitute endorsement by
the editors.
©2005 by the authors and editors of this book.
No part of this work can be reproduced without permission except as indicated by the
“Fair Use” clause of the copyright law. Passages, images, or ideas taken from this work
must be properly credited in any written or published materials.
ISBN 0-9738918-0-7
Printed by Saint Mary’s University, Canada.
CONTENTS
Introduction…………………………………………..………..…………..II
Novel Quadratic Programming Approaches for Feature Selection
and Clustering with Applications
W. Art Chaovalitwongse……………………………………………………………....…..1
Fuzzy Support Vector Classification Based on Possibility Theory
Zhimin Yang, Yingjie Tian, Naiyang Deng……………………………………………….8
DEA-based Classification for Finding Performance Improvement
Direction
Shingo Aoki, Yusuke Nishiuchi, Hiroshi Tsuj……………………………………..……16
Multi-Viewpoint Data Envelopment
Efficiency and Inefficiency
Analysis
for
Finding
Shingo AOKI, Kiyosei MINAMI, Hiroshi TSUJI………………………………...……..21
Mining Valuable Stocks with Genetic Optimization Algorithm
Lean Yu, Kin Keung Lai and Shouyang Wang……………………………...…………..27
A Comparison Study of Multiclass Classification between
Multiple Criteria Mathematical Programming and Hierarchical
Method for Support Vector Machines
Yi Peng, Gang Kou, Yong Shi, Zhenxing Chen and Hongjin Yang…………………….30
Pattern Recognition for Multimedia Communication Networks
Using New Connection Models between MCLP and SVM
Jing HE, Wuyi YUE, Yong SHI…………………………………………………...…….37
I
Introduction
For last ten years, the researchers have extensively applied quadratic programming
into classification, known as V. Vapnik’s Support Vector Machine, as well as various
applications. However, using optimization techniques to deal with data separation and
data analysis goes back to more than thirty years ago. According to O. L. Mangasarian,
his group has formulated linear programming as a large margin classifier in 1960’s. In
1970’s, A. Charnes and W.W. Cooper initiated Data Envelopment Analysis where a
fractional programming is used to evaluate decision making units, which is economic
representative data in a given training dataset. From 1980’s to 1990’s, F. Glover proposed
a number of linear programming models to solve discriminant problems with a small
sample size of data. Then, since 1998, the organizer and his colleagues extended such a
research idea into classification via multiple criteria linear programming (MCLP) and
multiple criteria quadratic programming (MQLP). All of these methods differ from
statistics, decision tree induction, and neural networks. So far, there are numerous
scholars around the world who have been actively working on the field of using
optimization techniques to handle data mining problems. This workshop intends to
promote the research interests in the connection of optimization and data mining as well
as real-life applications among the growing data mining communities. All of seven
papers accepted by the workshop reflect the findings of the researchers in the above
interface fields.
Yong Shi
Beijing, China
II
Novel Quadratic Programming Approaches for Feature Selection and Clustering
with Applications
W. Art Chaovalitwongse
Department of Industrial and Systems Engineering
Rutgers, The State University of New Jersey
Piscataway, New Jersey 08854
Email: [email protected]
Abstract
In this paper, we focus our main application on epilepsy
research. Epilepsy is the second most common brain
disorder after stroke. The most disabling aspect of epilepsy
is the uncertainty of recurrent seizures, which can be
characterized by a chronic medical condition produced
by temporary changes in the electrical function of the
brain. The aim of this research is to develop and apply
a new DM paradigm used to predict seizures based on the
study of neurological brain functions through quantitative
analyses of electroencephalograms (EEGs), which is a tool
for evaluating the physiological state of the brain. Although
EEGs offer excellent spatial and temporal resolution to
characterize rapidly changing electrical activity of brain
activation, it is not an easy task to excavate hidden patterns
or relationships in massive data with properties in time and
space like EEG time series. This paper involves research
activities directed toward the development of mathematical
models and optimization techniques for DM problems. The
primary goal of this paper is to incorporate novel optimization methods with DM techniques. Specifically, novel
feature selection and clustering techniques are proposed
in this paper. The proposed techniques will enhance the
ability to provide more precise data characterization, more
accurate prediction/classification, and greater understanding of EEG time series.
Uncontrolled epilepsy poses a significant burden to
society due to associated healthcare cost to treat and
control the unpredictable and spontaneous occurrence of
seizures. The main objective of this paper is to develop and
apply novel optimization-based data mining approaches
to the study of brain physiology, which might be able to
revolutionize current diagnosis and treatment of epilepsy.
Through quantitative analyses of electroencephalogram
(EEG) recordings, a new data mining paradigm for feature
selection and clustering is developed based on mathematical models and optimization techniques proposed in this
paper. The experimental results in this study demonstrate
that the proposed techniques can be used as a feature
(electrode) selection technique to capture seizure precursors. In addition, the proposed techniques will not only
excavate hidden patterns/relationships in EEGs, but also
will give a greater understanding of brain functions (as
well as other complex systems) from a system perspective.
I.
. Introduction and Background
Most data mining (DM) tasks fundamentally involve
discrete decisions based on numerical analyses of data
(e.g., the number of clusters, the number of classes, the
class assignment, the most informative features, the outlier
samples, the samples capturing the essential information).
These techniques are combinatorial in nature and can naturally be formulated as discrete optimization problems. The
goal of most DM tasks naturally lends itself to a discrete
NP-hard optimization problem. Aside from complexity
issue, the massive scale of real life DM problems is another
difficulty arising in optimization-based DM research.
A.
. Feature/Sample Selection
Although the brain is considered to be the largest
interconnected network, neurologists believe that seizures
represent the spontaneous formation of self-organizing
spatiotemporal patterns that involve only some parts (electrodes) of the brain network. The localization of epileptogenic zones is one of the proofs of this concept. Therefore,
feature selection techniques have become a very essential
tool for selecting the critical brain areas participating in
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
identify the best k clusters that minimize the distance of
the points assigned in the cluster from the center of the
cluster. A very well-known example of the distance-based
method is k-mean clustering. Another clustering method is
a model-based method, which assumes a functional model
expression that describes each of the clusters and then
searches for the best parameter to fit the cluster model by
minimizing a likelihood measure. Most clustering methods
attempt to identify the best k clusters that minimize the
distance of the points assigned in the cluster from the
center of the cluster. k-median clustering is another widely
studied clustering technique, which can be modeled as
a concave minimization problem and reformulated as a
minimization problem of a bilinear function over a polyhedral set [3]. Although these clustering techniques are well
studied and robust, they still require a priori knowledge of
the data (e.g., the number of clusters, the most informative
features).
the epileptogenesis process during seizure development.
In addition, graph theoretical approaches appear to fit
very well as a model of a brain structure [12]. Feature
selection will be very useful in selecting/identifying the
brain areas correlated to the pathway to seizure onset.
In general, feature/sample selection is considered to be a
dimensionality reduction technique within the framework
of classification and clustering. This problem can naturally
be defined as a binary optimization problems. The notion
of selection a sub-set of variables, out of superset of possible alternatives, naturally lends itself to a combinatorial
(discrete) optimization problem.
In general, depending on the model used to describe the
data the problem of feature selection will end up being a
(non)-linear mixed integer programming (MIP) problem.
The most difficult issue in DM problems arises when one
has to deal with spatial and temporal data. It is extremely
critical to be able to identify the best features in timely
fashion. To overcome this difficulty, the feature selection
problem in seizure prediction research is modeled as a
Mutli-Quadratic Integer Programming (MQIP) problem.
MQIP is very difficult to solve. Although many efficient
reformulation-linearization techniques (RTLs) have been
used to linearize QP and nonlinear integer programming
problems [1], [14], additional quadratic constraints make
MQIP problems much more difficult to solve and current
RTLs fail to solve MQIP problems effectively. A fast and
scalable RTL that can be used to solve MQIPs for feature
selection is herein proposed based on our preliminary studies in [7], [24]. In addition, a novel framework applying
graph theory to feature selection, which is based on the
preliminary study in [28], is also proposed in this paper.
II.
. Data Mining in EEGs
Recent quantitative EEG studies previously reported
in [5], [11], [10], [8], [16], [24], suggest that seizures are
deterministic rather than random and it may be possible to
predict the onset of epileptic seizures based on quantitative
analysis of the brain electrical activity through EEGs.
The seizure predictability has also been confirmed by
several other groups [13], [29], [20], [21]. This analysis
proposed in this research was motivated by mathematical
models from chaos theory used to characterize multidimensional complex systems and reduce the dimensionality of EEGs [19], [31]. These techniques demonstrate
dynamical changes of epileptic activity that involve the
gradual transition from a state of spatiotemporal chaos
to spatial order and temporal chaos [4], [27]. Such a
transition that precedes seizures for periods on the order
of minutes to hours is detectable in the EEG by the
convergence in value of chaos measures (i.e., Lyapunov
Exponent-ST Lmax) among critical electrode sites on the
neocortex and hippocampus [10]. T-statistical distance was
proposed to estimate the pair-wise difference (similarity) of
the dynamics of EEG time series between brain electrode
pairs. The T -index will measure the convergence degree
of chaos measures among critical electrode sites. The T index at time√t between electrode sites i and j is defined
as: Ti,j (t) = N × |E{ST Lmax,i − ST Lmax,j }|/σi,j (t),
where E{·} is the sample average difference for the
ST Lmax,i − ST Lmax,j estimated over a moving window
wt (λ) defined as:
B.
. Clustering
The elements and dynamical connections of the brain
dynamics can portray the characteristics of a group of
neurons and synapses or neuronal populations driven by
the epileptogenic process. Therefore, clustering the brain
areas portraying similar structural and functional relationships will give us an insight in the mechanisms of epileptogenesis and an answer to a question of how seizures
are generated, developed, and propagated, and how they
can be disrupted and treated. The goal of clustering is
to find the best segmentation of raw data into the most
common/similar groups. In clustering similarity measure
is, therefore, the most important property. The difficulty
in clustering arises from the fact that clustering is an
unsupervised learning, in which the property or the expected number of groups (clusters) are not known ahead
of time. The search for the optimal number of clusters is
parametric in nature. Distance-based method is the most
commonly studied clustering technique, which attempts to
wt (λ) =
1
0
if λ ∈ [t − N − 1, t]
if λ ∈
[t − N − 1, t],
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
where N is the length of the moving window. Then,
σi,j (t) is the sample standard deviation of the ST Lmax
differences between electrode sites i and j within the
moving window wt (λ). The thus defined T -index follows a
t-distribution with N-1 degrees of freedom. A novel feature
selection technique based on optimization techniques to
select critical electrode sites minimizing T -index similarity
measure was proposed in [4], [24]. The results of that study
demonstrated that spatiotemporal dynamical properties of
EEG’s manifest patterns corresponding to specific clinical
states [6], [4], [17], [24]. In spite of promising signs
of the seizure predictabilty, research in epilepsy is still
far from complete. The existence of seizure pre-cursors
remains to be further investigated with respect to parameter
settings, accuracy, sensitivity, specificity. Essentially, there
is a need of new feature selection and clustering used
to systematically identify the brain areas underlying the
seizure evolution as well as epileptogenic zones (the areas
initiating the habitual seizures).
powerful quadratic 0-1 programming technique proposed
in [25] is employed to solve this problem. Next we will
demonstrate how to reduce a quadratic program with a
knapsack constraint to a non-constrained quadratic 0-1
program. In order to formalize the notion of equivalence,
we propose the following definitions.
Definition 1: We say that problem P is “polynomially
reducible” to problem P0 if given an instance I(P ) of
problem P , we can in polynomial time obtain an instance
I(P0 ) of problem P0 such that solving I(P ) will solve
I(P0 ).
Definition 2: Two problems P1 and P2 are called
“equivalent” if P1 is “polynomially reducible” to P2 and
P2 is “polynomially reducible” to P1 .
Consider the following three problems:
P1 : min f (x) = xT Ax, x ∈ {0, 1}n, A ∈ Rn×n .
P̄1 : min f (x) = xT Ax + cT x, x ∈ {0, 1}n, A ∈
Rn×n , c ∈ Rn .
P̂1 : min f (x) = xT Ax, x ∈ {0, 1}n, A ∈
n
Rn×n ,
i=1 xi = k, where 0 ≤ k ≤ n is a
constant .
Define A as an n × n T-index pair-wise distance matrix,
and k is the number of selected electrode sites. Problems
P1 , P̄1 , and P̂1 can be shown to be all “equivalent” by
proving that P1 is “polynomially reducible” to P̄1 , P̄1
is “polynomially reducible” to P1 , P̂1 is “polynomially
reducible” to P1 , and P1 is “polynomially reducible” to
P̂1 . For more details, see [4], [6].
III.
. Feature Selection
The concept of optimization models for feature selection used to select/identify the brain areas correlated to
the pathway to seizure onset came from the Ising model
has been a powerful tool in studying phase transitions in
statistical physics. Such an Ising model can be described by
a graph G(V, E) having n vertices {v1 , . . . , vn } and each
edge (i, j) ∈ E having a weight (interaction energy) Jij .
Each vertex vi has a magnetic spin variable σi ∈ {−1, +1}
associated with it. An optimal spin configuration of minimum energy
is obtained by minimizing the Hamiltonian:
H(σ) = − 1≤i≤j≤n Jij σi σj over ∀σ ∈ {−1, +1}n.
This problem is equivalent to the combinatorial problem
of quadratic 0-1 programming [15]. This has motivated us
to use quadratic 0-1 (integer) programming to select the
critical cortical sites, where each electrode has only two
states, and to determine the minimal-average T-index state.
In addition, we also introduce an extension of quadratic
integer programming for electrode selection including
Feature Selection via Multi-Quadratic Programming and
Feature Selection via Graph Theory.
B.
. Feature Selection via Multi-Quadratic Integer
Programming (FSMQIP)
FSMQIP is a novel mathematical model for selecting
critical features (electrodes) of the brain network, which
can be modeled as a MQIP problem given by: min xT Ax,
n
xi = k; xT Cx ≥ Tα k(k − 1); x ∈ {0, 1}n,
s.t.,
i=1
where A is an n × n matrix of pairwise similarity of
chaos measures before a seizure, C is an n × n matrix
of pairwise similarity of chaos measures after a seizure,
and k is the pre-determined number of selected electrodes.
This problem has been proved to be NP-hard in [24].
The objective function is to minimize the average T-index
distance (similarity) of chaos measures among the critical
electrode sites. The knapsack constraint is to identify the
number of critical cortical sites. The quadratic constraint
is to ensure the divergence of chaos measures among the
critical electrode sites after a seizure. A novel RLT to
reformulate this MQIP problem as a MIP problem was
proposed in [7], which demonstrated the equivalence of
the following two problems:
P2 : min f (x) = xT Ax, s.t. Bx ≥ b, xT Cx ≥
x
α, x ∈ {0, 1}n , where α is a positive constant.
A.
. Feature Selection via Quadratic Integer Programming (FSQIP)
FSQIP is a novel mathematical model for selecting
critical features (electrodes) of the brain network, which
can be modeled as a quadratic 0-1 knapsack problem
with objective function to minimize the average T-index
(a measure of statistical distance between the mean values
of ST Lmax) among electrode sites and the knapsack
constraint to identify the number of critical cortical sites. A
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
P̄2 :
min g(s) = eT s, s.t. Ax − y − s = 0, Bx ≥
x,y,s,z
The maximum clique problem can be represented in many
equivalent formulations (e.g., an integer programming
problem, a continuous global optimization problem, and
an indefinite quadratic programming) [22]. Consider the
following indefinite quadratic programming formulation of
MCP. Let AG = (aij )n×n be the adjacency matrix of G
defined by
1 if (i, j) ∈ E
aij =
0 if (i, j) ∈
/ E.
T
b, y ≤ M (e − x), Cx − z ≥ 0, e z ≥ α, z ≤
M x, x ∈ {0, 1}n, yi , si , zi ≥ 0, where M =
C∞ and M = A∞ .
Proposition 1: P2 is equivalent to P̄2 if every entry in
matrices A and C is non-negative.
Proof: It has been shown in [9], [7] that P1 has an
optimal solution x0 iff there exist y 0 , s0 , z 0 such that
(x0 , y 0 , s0 , z 0 ) is an optimal solution to P̄1 .
The matrix AG is symmetric and all eigenvalues are real
numbers. Generally, AG has positive and negative (and
possibly zero) eigenvalues and the sum of eigenvalues is
zero as the main diagonal entries are zero [15]. Consider
the following indefinite QIP problem and MIP problem for
MCP:
1 T
n
P3 : max
2 x Ax, s.t. x ∈ {0, 1} , where A =
C.
. Feature Selection via Maximum Clique (FSMC)
FSMC is a novel mathematical model based on graph
theory for selecting critical features (electrodes) of the
brain network. [9]. The brain connectivity can be rigorously modeled as a brain graph as follows: considering a
brain network of electrodes as a weighted graph, where
each node represents an electrode and weights of edges
between nodes represent T-statistical distances of chaos
measures between electrodes. Three possible weighted
graphs are proposed: GRAPH-I is denoted as a complete
graph (the graph with all possible edges); GRAPH-II is
denoted as a graph induced from the complete one by
deleting edges whose T-index before a seizure is greater
than the T-test confident level; GRAPH-III is denoted as
a graph induced from the complete one by deleting edges
whose T-index before a seizure is than the T-test confident
level or T-index after a seizure point is less than the T-test
confidence level. Maximum cliques of these graphs will be
investigated as the hypothesis is a group of physiologically
connected electrodes is considered to be a critical largest
connected network of seizure evolution and pathway. The
Maximum Clique Problem (MCP) is NP-hard [26]; therefore, solving MCPs is not an easy task. Nevertheless,
the RLT in [7] to provide a very compact formulation
of the maximum clique problem (MCP). This compact
formulation has theoretical and computational advantages
over traditional formulations as well as provides tighter
relaxation bounds.
Consider a maximum clique problem defined as follows.
Let G = G(V, E) be an undirected graph where V =
{1, . . . , n} is the set of vertices (nodes), and E denotes
the set of edges. Assume that there is no parallel edges
(and no self-loops joining the same vertex) in G. Denote
an edge joining vertex i and j by (i, j).
Definition 3: A clique of G is a subset C of vertices
with the property that every pair of vertices in C is
connected by an edge; that is, C is a clique if the subgraph
G(C) induced by C is complete.
Definition 4: The maximum clique problem is the problem of finding a clique set C of maximal cardinality (size)
|C|.
(i,j)∈E
P̄3 :
AḠ − I and AG is an adjacency matrix of the
graph G.
n
n
min
si , s.t.
aij xj − si − yi = 0, yi −
i=1
j=1
xi ∈ {0, 1}, si , yi ≥ 0,
M (1 − xi ) ≤ 0,
where
n
and M = max j=1 |aij | = A∞ .
i
Proposition 2: P3 is equivalent to P̄3 . If x∗ solves the
problems P3 and P̄3 , then the set C defined by C = t(x∗ )
is a maximum clique of graph G with |C| = −fG (x).
Proof: It has been shown in [9], [7] that P3 has
an optimal solution x0 iff there exist y 0 , s0 , such that
(x0 , y 0 , s0 ) is an optimal solution to P̄3 .
IV.
. Clustering Techniques
The neurons in the cerebral cortex maintain thousands
of input and output connections with other group of neurons, which form a dense network of connectivity spanning
the entire thalamocortical system. Despite this massive
connectivity, cortical networks are exceedingly sparse, with
respect to the number of connections present out of all
possible connections. This indicates that brain networks are
not random, but form highly specific patterns. Networks in
the brain can be analyzed at multiple levels of scale. Novel
clustering techniques are herein proposed to construct the
temporal and spatial mechanistic basis of the epileptogenic
models based on the brain dynamics of EEGs and capture
the patterns or hierarchical structure of the brain connectivity from statistical dependence among brain areas. The
proposed hierarchical clustering techniques, which do not
require a priori knowledge of the data (number of clusters),
include Clustering via Concave Quadratic Programming
and Clustering via MIP with Quadratic Constraint.
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
CCQP
Input:
Output:
A.
. Clustering via Concave Quadratic Programming
(CCQP)
WHILE S = ∅ DO
- Construct an Euclidean matrix A from
pair-wise distance of data points in S
- Solve CCQP in problem P̄4
IF Optimal solution xi = 1 THEN
- Remove point i from set S
CCQP is a novel clustering mathematical model used
to formulate a clustering problem as a QIP problem [9].
Given n points of data to be clustered, we can formulate
a clustering problem as follows: min f (x) = xT Ax − λI,
x
s.t. x ∈ {0, 1}n, where A is an n × n Euclidean matrix of
pairwise distance, I is an identity matrix, λ is a parameter
adjusting the degree of similarity within a cluster, xi is
a 0-1 decision variable indicating whether or not point
i is selected to be in the cluster. Note that λI is an
offset parameter added to the objective function to avoid
the optimal solution of all xi are zero. This will happen
when every entry aij of Euclidean matrix A is positive
and the diagonal is zero. Although this clustering problem
is formulated as a large QIP problem, in some instances
when λ is large enough to make the quadratic function
become concave function, this problem can be converted
to a continuous problem (minimizing a concave quadratic
function over a sphere) [9]. The reduction to a continuous
problem is the main advantage of CCQP. This property
holds because of the fact that a concave function f : S →
over a compact convex set S ⊂ n attains its global
minimum at one of the extreme points of S [15]. Two
equivalent forms of CCQP problems are given by:
P4 :
P̄4 :
All n unassigned data points in set S
The number of clusters and cluster assignment
for all n data points
Fig. 1. Procedure of CCQP algorithm
B.
. Clustering via MIP with Quadratic Constraint
(CMIPQC)
CMIPQC is a novel clustering mathematical model
in which a clustering problem can be formulated as a
mixed-integer programming problem with quadratic constraint [9]. The goal of CMIPQC is to maximize number
of data points to be in a cluster such that the similarity degrees among data points in a cluster are less
than a pre-determined parameter, α. This technique can
be incorporated with hierarchical clustering methods as
follows: (a) Initialization: assign all data points into one
cluster; (b) Partition: use CMIPQC to divide the big
cluster into smaller clusters; (3) Repetition: repeat the
partition process until the stopping criterion are reached
or a cluster contains a single point. Novel mathematical
n
formulation for CMIPQC is given by: max
xi , s.t.
min f (x) = xT Ax, s.t. x ∈ {0, 1}n, where A is
x
an n × n Euclidean matrix
min f¯(x) = xT Āx, s.t. 0 ≤ x ≤ e, where Ā =
x
A + λI, λ is any real number, I is a diagonal
matrix.
x
i=1
xT Cx ≤ α, x ∈ {0, 1}, where n is the number of data
points to be clustered, C is an n × n Euclidean matrix of
pairwise distance, α is a predetermined parameter of the
similarity degree within each cluster, xi is a 0-1 decision
variable indicating whether or not point i is selected to be
in the cluster. The objective of this model is to maximize
number of data points to be in a cluster such that the
average pairwise distances among those points are less
than α. The difficulty of this problem comes from the
quadratic constraint; however, this quadratic constraint can
be efficiently linearized by the RLT described in [7]. The
CMIPQC problem is much easier to solve as it can be
reduced to an equivalent MIP problem. Similar to CCQP,
the CMIPQC algorithm has the ability to systematically
determine the optimal number of clusters and only needs
to solve m MIP problems (see Figure 2 for CMIPQC
algorithm). Two equivalent forms of CMIPQC are given
by:
n
P5 : min f (x) =
xi , s.t. xT Cx ≤ α, x ∈ {0, 1}n
Proposition 3: P4 is equivalent to P̄4 .
Proof: We will demonstrate that P2 has an optimal
solution x0 iff x0 is an optimal solution to P̄2 as follows.
If we choose λ such that Ā = A + λI becomes a negative
semidefinite matrix (e.g., λ = −μ, where μ is the largest
eigenvalue of A), then the objective function f¯(x) becomes
concave and the constraints can be replaced by 0 ≤ x ≤
e. Thus, discrete problem P2 is equivalent to continuous
problem P̄2 [9].
One of the advantages of CCQP is the ability to systematically determine the optimal number of clusters. Although
CCQP has to solve m clustering problems iteratively
(where m is the final number of clusters at the termination
of CCQP algorithm), it is efficient enough to solve largescale clustering problems because only one continuous
problem is solved in each iteration. After each iteration,
the problem size will become significantly smaller [9].
Figure 1 presents the procedure of CCQP.
x
P̄5 :
i=1
min f¯(x, y, z) =
x
n
i=1
xi , s.t. Cx − z ≥ 0, eT z ≥
α, z ≤ M x, x ∈ {0, 1}n , zi ≥ 0, where M =
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
C∞ .
Proposition 4: P3 is equivalent to P̄3 .
Proof: The proof of P5 has an optimal solution x0
iff there exist z 0 such that (x0 , z 0 ) is an optimal solution
to P̄5 as follows. P5 is a special case of P2 is very similar
to the one in [9], [7].
CMIPQC
Input:
Output:
2) FSMQIP select the critical electrodes based upon the
behavior of ST Lmax profiles before and after each
preceding seizure.
3) Such a seizure pre-cursor will be detected when the
brain dynamics from critical electrodes manifest a
pattern of transitional convergence in the similarity
degree of chaos. This pattern can be viewed as a
synchronization of the brain dynamics from critical
electrodes.
All n unassigned data points in set S
The number of clusters and cluster assignment
for all n data points
VI.
. Results
WHILE S = ∅ DO
- Construct an Euclidean matrix A from
pair-wise distance of data points in S
- Solve CMIPQC in problem P̄5
IF Optimal solution xi = 1 THEN
- Remove point i from set S
The results show that the probability of detecting
seizure pre-cursor patterns from the critical electrodes
selected by FSMQIP is approximately 83%, which is
significantly better than that from randomly selected electrodes with (p-value < 0.07). The Histogram of probability
of detecting seizure pre-cursor patterns from randomly selected electrodes and that from from the critical electrodes
is illustrated in Figure 3. The results of this study can be
used as a criterion to pre-select the critical electrode sites
that can be used to predict epileptic seizures.
Fig. 2. Procedure of CMIPQC algorithm
V.
. Materials and Methods
The data used in our studies consist of continuous
intracranial EEGs from 3 patients with temporal lobe
epilepsy. FSQIP was previously used to demonstrate the
predictability of epileptic seizures [4]. In this research, we
extend our previous findings of the seizure predictability
by using FSMQIP to select the critical cortical sites. The
FSMQIP problem is formulated as a MQIP problem with
objective function to minimize the average T-index (a
measure of statistical distance between the mean values of
ST Lmax) among electrode sites, the knapsack constraint
to identify the number of critical cortical sites [18], and an
additional quadratic constraint to ensure that the optimal
group of critical sites shows the divergence in ST Lmax
profiles after a seizure. The experiment in this study
is to test the hypothesis that FSMQIP can be used to
select critical features (electrodes) that are mostly likely to
manifest pre-cursor patterns prior to a seizure. The results
of this study will demonstrate that if one can select critical
electrodes that will manifest seizure pre-cursors, it may
be possible to predict a seizure in time to warn of an
impending seizure [6]. To test this hypothesis, we designed
an experiment used to compare the probability of detecting
seizure pre-cursor patterns from critical electrodes selected
by FSMQIP with that from randomly selected electrodes.
In this experiment, testing on 3 patients with 20 seizures,
we randomly selected 5,000 groups of electrodes, and used
FSMQIP to select the critical electrodes. The experiment
in this study is conducted as the following steps:
1) The estimation of ST Lmax profiles [2], [19], [23],
[30], [31] is used to measure the degree of order or
disorder (chaos) of the EEG signals.
Fig. 3. Histogram of Seizure Prediction Sensitivities based on Randomly Selected Electrodes versus Electrodes Selected by the
Proposed Feature Selection Technique
VII.
. Conclusions
This paper proposes a theoretical foundation of optimization techniques for feature selection and clustering
with an application in epilepsy research. Empirical investigations of the proposed feature selection techniques
demonstrate the effectiveness of the proposed techniques
with a utility of selecting the critical brain areas associated
with the epileptogenic process. Thus, advances in feature
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
selection and clustering techniques will result in the future
development of a novel DM paradigm to predict impending
seizures from multichannel EEG recordings. Prediction is
possible because, for the vast majority of seizures, the
spatio-temporal dynamical features of seizure pre-cursors
are sufficiently similar to that of the preceding seizure.
Mathematical formulations for novel clustering techniques
are also proposed in this paper. These techniques are theoretically fast and scalable. The results from this preliminary
research suggest that empirical studies of the proposed
clustering techniques should be investigated in the future
research.
[15] R. Horst, P. Pardalos, and N. Thoai, Introduction to global optimization. Kluwer Academic Publishers, 1995.
[16] L. Iasemidis, P. Pardalos, D.-S. Shiau, W. Chaovalitwongse,
K. Narayanan, A. Prasad, K. Tsakalis, P. Carney, and J. Sackellares,
“Long term prospective on-line real-time seizure prediction,” Journal of Clinical Neurophysiology, vol. 116(3), pp. 532–544, 2005.
[17] L. Iasemidis, D.-S. Shiau, W. Chaovalitwongse, J. Sackellares,
P. Pardalos, P. Carney, J. Principe, A. Prasad, B. Veeramani, and
K. Tsakalis, “Adaptive epileptic seizure prediction system,” IEEE
Transactions on Biomedical Engineering, vol. 5(5), pp. 616–627,
2003.
[18] L. Iasemidis, D.-S. Shiau, J. Sackellares, and P. Pardalos, “Transition to epileptic seizures: Optimization,” in DIMACS series in
Discrete Mathematics and Theoretical Computer Science, D. Du,
P. Pardalos, and J. Wang, Eds. American Mathematical Society,
1999, pp. 55–74.
[19] L. Iasemidis, H. Zaveri, J. Sackellares, and W. Williams, “Phase
space analysis of EEG in temporal lobe epilepsy,” in IEEE Eng.
in Medicine and Biology Society, 10th Ann. Int. Conf., 1988, pp.
1201–1203.
[20] B. Litt, R. Esteller, J. Echauz, D. Maryann, R. Shor, T. Henry,
P. Pennell, C. Epstein, R. Bakay, M. Dichter, and G. Vachtservanos,
“Epileptic seizures may begin hours in advance of clinical onset: A
report of five patients,” Neuron, vol. 30, pp. 51–64, 2001.
[21] F. Mormann, T. Kreuz, C. Rieke, R. Andrzejak, A. Kraskov,
P. David, C. Elger, and K. Lehnertz, “On the predictability of epileptic seizures,” Journal of Clinical Neurophysiology, vol. 116(3), pp.
569–587, 2005.
[22] T. Motzkin and E. Strauss, “Maxima for graphs and a new proofs
of a theorem turán,” Canadian Journal of Mathematics, vol. 17, pp.
533–540, 1965.
[23] N. Packard, J. Crutchfield, and J. Farmer, “Geometry from time
series,” Phys. Rev. Lett., vol. 45, pp. 712–716, 1980.
[24] P. Pardalos, W. Chaovalitwongse, L. Iasemidis, J. Sackellares, D.-S.
Shiau, P. Carney, O. Prokopyev, and V. Yatsenko, “Seizure warning
algorithm based on spatiotemporal dynamics of intracranial EEG,”
Mathematical Programming, vol. 101(2), pp. 365–385, 2004.
[25] P. Pardalos and G. Rodgers, “Computational aspects of a branch and
bound algorithm for quadratic zero-one programming,” Computing,
vol. 45, pp. 131–144, 1990.
[26] P. Pardalos and J. Xue, “The maximum clique problem,” Journal
of Global Optimization, vol. 4, pp. 301–328, 1992.
[27] P. Pardalos, V. Yatsenko, J. Sackellares, D.-S. Shiau, W. Chaovalitwongse, and L. Iasemidis, “Analysis of EEG data using optimization, statistics, and dynamical system techniques,” Computational
Statistics & Data Analysis, vol. 44(1–2), pp. 391–408, 2003.
[28] O. Prokopyev, V. Boginski, W. Chaovalitwongse, P. Pardalos,
J. Sackellares, and P. Carney, “Network-based techniques in EEG
data analysis and epileptic brain modeling,” in Data Mining in
Biomedicine, P. Pardalos and A. Vazacopoulos, Eds. Springer,
2005, p. To appear.
[29] M. L. V. Quyen, J. Martinerie, M. Baulac, and F. Varela, “Anticipating epileptic seizures in real time by non-linear analysis of similarity
between EEG recordings,” NeuroReport, vol. 10, pp. 2149–2155,
1999.
[30] P. Rapp, I. Zimmerman, and A. M. Albano, “Experimental studies
of chaotic neural behavior: cellular activity and electroencephalographic signals,” in Nonlinear oscillations in biology and chemistry,
H. Othmer, Ed. Springer-Verlag, 1986, pp. 175–205.
[31] F. Takens, “Detecting strange attractors in turbulence,” in Dynamical
systems and turbulence, Lecture notes in mathematics, D. Rand and
L. Young, Eds. Springer-Verlag, 1981.
References
[1] W. Adams and H. Sherali, “Linearization strategies for a class
of zero-one mixed integer programming problems,” Operations
Research, vol. 38, pp. 217–226, 1990.
[2] A. Babloyantz and A. Destexhe, “Low dimensional chaos in an
instance of epilepsy,” Proc. Natl. Acad. Sci. USA, vol. 83, pp. 3513–
3517, 1986.
[3] P. Bradley, O. Mangasarian, and W. Street, “Clustering via concave minimization,” in Advances in Neural Information Processing
Systems, M. Mozer, M. Jordan, and T. Petsche, Eds. MIT Press,
1997.
[4] W. Chaovalitwongse, “Optimization and dynamical approaches in
nonlinear time series analysis with applications in bioengineering,”
Ph.D. dissertation, University of Florida, 2003.
[5] W. Chaovalitwongse, L. Iasemidis, P. Pardalos, P. Carney, D.S. Shiau, and J. Sackellares, “Performance of a seizure warning
algorithm based on the dynamics of intracranial EEG,” Epilepsy
Research, vol. 64, pp. 93–133, 2005.
[6] W. Chaovalitwongse, P. Pardalos, L. Iasemidis, D.-S. Shiau, and
J. Sackellares, “Applications of global optimization and dynamical
systems to prediction of epileptic seizures,” in Quantitative Neuroscience, P. Pardalos, J. Sackellares, L. Iasemidis, and P. Carney,
Eds. Kluwer, 2003, pp. 1–36.
[7] W. Chaovalitwongse, P. Pardalos, and O. Prokoyev, “Reduction of
multi-quadratic 0–1 programming problems to linear mixed 0–1
programming problems,” Operations Research Letters, vol. 32(6),
pp. 517–522, 2004.
[8] W. Chaovalitwongse, O. Prokoyev, and P. Pardalos, “Electroencephalogram (EEG) time series classification: Applications in
epilepsy,” Annals of Operations Research, vol. To appear, 2005.
[9] W. A. Chaovalitwongse, “A robust clustering technique via
quadratic programming,” Department of Industrial and Systems
Engineering, Rutgers University, Tech. Rep., 2005.
[10] W. A. Chaovalitwongse, P. Pardalos, L. Iasemidis, D.-S. Shiau, and
J. Sackellares, “Dynamical approaches and multi-quadratic integer
programming for seizure prediction,” Optimization Methods and
Software, vol. 20(2–3), pp. 383–394, 2005.
[11] W. Chaovalitwongse, P. Pardalos, L. Iasemidis, J. Sackellares, and
D.-S. Shiau, “Optimization of spatio-temporal pattern processing
for seizure warning and prediction,” U.S. Patent application filed
August 2004, Attorney Docket No. 028724–150, 2004.
[12] C. Cherniak, Z. Mokhtarzada, and U. Nodelman, “Optimal-wiring
models of neuroanatomy,” in Computational Neuroanatomy, G. A.
Ascoli, Ed. Humana Press, 2002.
[13] C. Elger and K. Lehnertz, “Seizure prediction by non-linear time
series analysis of brain electrical activity,” European Journal of
Neuroscience, vol. 10, pp. 786–789, 1998.
[14] F. Glover, “Improved linear integer programming formulations of
nonlinear integer programs,” Management Science, vol. 22, pp. 455–
460, 1975.
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
Fuzzy Support Vector Classification Based on Possibility Theory*
Zhimin Yang1 Yingjie Tian2 Naiyang Deng3**
College of Economics & Management, China Agriculture University, 100083, Beijing, China
2
Chinese Academy of Sciences Research Center on Data Technology & Knowledge Economy,
100080, Beijing, China
3
College of Science, China Agriculture University, 100083, Beijing, China
1
programming. Then we transform this
programming into its equivalence quadratic
programming.
Assume that the training points contain
complete fuzzy information, i.e. the sum of
the positive membership degree and negative
membership degree of its output is 1. We
propose a fuzzy support vector classification
algorithm. Given an arbitrary test, its
corresponding output obtained by the
algorithm is a triangle fuzzy number.
Abstract
This
paper is concerned with the fuzzy
support vector classification in which the type
of both the output of the training point and the
value of the final fuzzy classification function
is triangle fuzzy number. First, the fuzzy
classification problem is formulated as a fuzzy
chance constrained programming. Then we
transform this programming into its
equivalence quadratic programming. As a
result, we propose fuzzy support vector
classification algorithm. In order to show its
rationality of the algorithm, a example is
presented.
Keywords˖machine learningˈfuzzy support
vector classification ˈ possibility measure ˈ
triangle fuzzy number
2.
FUZZY
SUPPORT
VECTOR
CLASSIFICATION MACHINE
As an extension of positive symbol 1 and
negative symbol -1, we introduce triangle
fuzzy number. Define the corresponding
output by the triangle fuzzy number. For an
input of a training point which belongs to the
positive class with the membership
degree G (0.5 d G d 1) , the triangle fuzzy
number is
y ( r1 , r2 , r3 )
1. INTRODUCTION
Support vector machines (SVMs)
proposed by Vapnik, is a powerful tool for
machine learning (Vapnik 1995 ˈ Vapnik
1998ˈCristianini 2000ˈMangasarian 1999
ˈ Deng 2004). It is also one of the most
interesting topics in this field. Lin and Wang
in (Lin, 2002) investigated a classification
problem with fuzzy information, where the
training set is S ^( x1 , ~
y1 ),", ( xl , ~
y l )` with
~
output y j ( j 1,", l ) is fuzzy number. This
(
2(G ) 2 G 2
G
, 2G 1,
2(G ) 2 3G 2
G
),
0.5 d G d 1
˄1˅
Similarly, for an input of a training point
which belongs to the negative class with the
membership degree G (0.5 d G d 1) , the
triangle fuzzy number is
paper studies this problem in a different way.
We formulate it as a fuzzy chance constrained
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
y
( r1 , r2 , r3 )
S
2(G ) 2 3G 2
2(G ) 2 G 2
(
,
2
1,
),
G
G G 0.5 d G d 1
.
˄2˅
~
Thus we use ( x, y ) to express a training
point, where ~
y is a triangle fuzzy number˄1
˅or˄2˅.We could use ( x, G ) to express a
training point too, where G is
^( x , ~y ),", ( x
1
1
p
,~
y p ), ( x p 1 , ~
y p 1 ),", ( xl , ~
y l )`
˄6˅
or
SG
^( x , G ),", ( x
1
1
p
, G p ), ( x p 1 , G p 1 ),", ( xl , G l )`
˄7˅
has the following property:
( xt , ~
y t ) and ( xt , G t ) are fuzzy positive points
y i ) and ( xi , G i ) are fuzzy
( t 1,", p ), ( xi , ~
negative points ( i p 1, ", l ).
Definition 3 Suppose a fuzzy training set (6)
or equivalently (7) and a confidence level O
˄ 0 O d 1 ˅.If there exist w  R n and
b  R so that
Pos{~
y j (( w ˜ x j ) b) t 1} t O䯸 j 1,", l ˈ
˄3˅
Given training set of classification is
S ^( x1 , ~
y1 ),", ( xl , ~
y l )` ,
˄4˅
and x j  R n is usual input, ~
y j ( j 1,", l ) is a
(8˅
then fuzzy training set (6) or (7)is fuzzy
linearly
separable,
moreover
the
corresponding fuzzy classification problem is
fuzzy linearly separable.
Note: 1 D Fuzzy linearly separable can be
considered, roughly speaking, that inputs of
fuzzy positive points and fuzzy negative
points can be separated at least with the
possibility degree O (0 O d 1) .
triangle fuzzy number (1) or (2). According to
(1) ,(2) and (3), the training set (4) can have
another form
S G ^( x1 , G 1 ),", ( xl , G l )`ˈ
˄5˅
where x j is same to those in (4), while G j is
those in (3)ˈ j 1,", l .
Definition 1 ( x j , ~
y j ) in (4) and ( x j , G j ) in (5)
are called as fuzzy training points, j 1,", l ,
and S and S G are called as fuzzy training sets.
Definition 2 Fuzzy training point ( x j , ~
y j ) or
D
2
Fuzzy linearly separability is
generalization of linearly separability of usual
training set. In fact, if G t 䯴1 t 1,", p ) and
G i 1(i p 1, ", l ) in training set (7),
fuzzy training set degenerates to usual
training set. So fuzzy linearly separability of
fuzzy training set degenerates to linearly
separability of usual training set. Supposed
Gt 1
(
t 1,", p
)
or
G i ! 1 ( i p 1, ", l ), it is possible that, on
one hand, x1 ,!, x p and x p 1 , !, xl are not
linearly separable in usual meaning; on the
other hand, they are fuzzy linearly separable.
For example, consider the case show in the
follow figure:
( x j , G j ) is called as fuzzy positive point if it
corresponds to (1); similarly, fuzzy training
y j ) or ( x j , G j ) is called as fuzzy
point ( x j , ~
negative point if it corresponds to (2).
Note: In this paper, the case either G j 0.5
G j 0.5 is omitted, because the
corresponding
triangle
fuzzy
~
number y j (2,0,2) can not provide any
information.
We rearrange the fuzzy training points in
fuzzy training set ˄4˅or˄5˅, such that the
new fuzzy training set
or
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
x3
1
0
x2
1
x1
classification
hyperplane
such
that
~
Pos{ y1 (( w ˜ x1 ) b) t 1} t 0.72 .
˄9˅
So fuzzy training set S is not fuzzy
linearly separable in the confidence level
O 0.72 .
Generally speaking, possibility measure
inequality of fuzzy event can be equivalently
transformed to real inequalities, as shown in
(10).
Theorem 1. (8) in Definition 3 is
equivalent with the real inequalities shown in
:
­ ((1 O )rt 3 Ort 2 )(( w ˜ xt ) b) t 1, t 1, " , p
®
¯((1 O )ri1 Ori 2 )(( w ˜ xi ) b) t 1, i p 1, " , l.
2
G 3
1( y 3 1) G 2 1( y 2 1) G 1
Supposed there are three fuzzy training
points ( x1 , y1 ) , ( x 2 , y 2 ) and ( x3 , y 3 ) . The fuzzy
training points ( x 2 , y 2 ) and ( x3 , y 3 ) are
certain
with
G 2 1( y 2 1)
and
G 3 1( y 3 1) . The first fuzzy training
point ( x1 , y1 ) is fuzzy with two possible
negative membership degrees G 1 0.51 and
G 1 0.6 .
˄ν˅ G 1 0.51 .According to (2), triangle
is
fuzzy
number
of
( x1 , G 1 )
~
y1 (1.94,0.02,1.9) .So the fuzzy training
set is S {( x1 , ~
y1 ), ( x 2 , y 2 ), ( x3 , y 3 )} .
Suppose O 0.72 , and classification
hyperplane x 0 , then ( w ˜ x1 ) b 2 , so
Pos{ ~
y1 (( w ˜ x1 ) b) t 1} 0.722 ! 0.7 ,
moreover
Pos{ y 2 (( w ˜ x 2 ) b) t 1} 1 ! 0.7 ˈ
Pos{~
y3 (( w ˜ x3 ) b) t 1} 1 ! 0.7 .
Therefore fuzzy training set S is fuzzy
linearly separable in the confidence level
O 0.72 .
˄ ˅ G 1 0.6 .According to (2), triangle
is
fuzzy
number
of
( x1 , G 1 )
~
y1 (1.53,0.2,1.13) .So the fuzzy training
set is S {( x1 , ~
y1 ), ( x 2 , y 2 ), ( x3 , y 3 )} .
Suppose O 0.47 , and classification
hyperplane x 0 , then ( w ˜ x1 ) b 2 , so
Pos{ ~
y1 (( w ˜ x1 ) b) t 1} 0.47 t 0.47 ,
moreover
Pos{ y 2 (( w ˜ x 2 ) b) t 1} 1 ! 0.47 ˈ
Pos{~
y 3 (( w ˜ x3 ) b) t 1} 1 ! 0.47 .
Therefore fuzzy training set S is fuzzy
linearly separable in the confidence level
O 0.47 .Supposed O 0.72 , we will find no
Proof: ~
yj
˄10˅
(r j1 , r j 2 , r j 3 ) is a triangle fuzzy
number, so 1 ~y j (( w ˜ x j ) b) is also a triangle
fuzzy number due to triangle fuzzy number
operation rule. More concretely,
If ( w ˜ xt ) b ! 0 ,
then
1 y t (( w ˜ xt ) b) (1 rt 3 (( w ˜ xt ) b),
1 rt 2 (( w ˜ xt ) b),
ˈ
1 rt1 (( w ˜ xt ) b))
t 1, ", p .
According to a triangle fuzzy number
a~ (r1 , r2 , r3 ) and arbitrary given confidence
level O ˄ 0 O d 1 ˅ , there exists
Pos{a~ d 0} t O œ (1 O )r1 Or2 .
Therefore, if ( w ˜ xt ) b ! 0 , then
Pos{1 y t (( w ˜ xt ) b) d 0} t O
œ (1 O )(1 rt 3 (( w ˜ xt ) b ))
O (1 rt 2 (( w ˜ xt ) b )) d 0
t
or
1," , p
Pos{ y t (( w ˜ xt ) b) t 1} t O
œ ((1 O ) rt 3 O rt 2 )(( w ˜ xt ) b) t 1
t 1, ", p .
Similarly, if ( w ˜ xi ) b 0 , then
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
ˈ
Pos{ y i (( w ˜ xi ) b) t 1} t O
be transformed to fuzzy chance constrained
programming with decision variable ( w, b, ) T :
1
­
min || w || 2
°
w ,b 2
®
y j (( w ˜ x j ) b) t 1} t O , j 1, ", l䯸
°̄s.t.Pos{~
ˈ
œ ((1 O ) ri1 O ri 2 )(( w ˜ xi ) b) t 1
i p 1, ", l .
Therefore (8) in Definition 3 is equivalent
with (10).
In (10), suppose
1
kt
䯸 t 1, ", p ˗
(1 O )rt 3 Ort 2
1
li
, i p 1,", l , ˄11˅
(1 O )ri1 Ori 2
then˄10˅can be rewritten:
­( w ˜ xt ) b t k t , t 1," , p
®
¯( w ˜ xi ) b d l i , i p 1, " , l.
Definition 4 Suppose fuzzy linearly
separable problem of fuzzy training set (6) or
(7),
the
two
parallel
hyperplanes
( w ˜ x) b k and ( w ˜ x) b l are support
hyperplanes about fuzzy training set (6) or
(7), so that:
­( w ˜ xt ) b t k t , t 1, " , p
°
{( w ˜ xt ) b} k °°t min
1,", p
®
°( w ˜ xi ) b d l i , i p 1, ", l
° max {( w ˜ x ) b} l .
i
°¯i p 1,",l
k t (t 1, ", p ) ˈ li (i p 1, ", l ) is the same
to those in (10 ˅ ˈ k l
˄12˅
˜ is possibility measure of fuzzy
where Pos ^`
˜ .
event ^`
Theorem 2 In the confidence level
O (0 O d 1) , the certain equivalence
programming (usual programming equivalent
with (12) )of fuzzy chance constrained
programming ˄ 13 ˅ is the quadratic
programming below:
1
­
min || w || 2
°°
w ,b 2
®s.t.((1 O )rt 3 Ort 2 )((w ˜ xt ) b) t 1, t 1, ", p
°
°¯((1 O )ri1 Ori 2 )((w ˜ xi ) b) t 1, i p 1, " , l.
˄13˅
Proof: The result can be got directly
with Theorem 1.
Theorem 3. There exists an optimal
solution of quadratic programming (13).
Proof: omitted. (see Deng 2004)
We will solve the dual programming of
quadratic programming (13).
Theorem 4. The dual programming of
quadratic programming (13) is quadratic
programming with decision variable is
( E ,D )T
:
min {k t } ˈ
t 1,", p
max {l i } .
i p 1,",l
Distance of two support hyperplanes
( w ˜ x) b k and ( w ˜ x) b l is ˖
p
l
­
1
min
(
2
)
(
A
B
C
E
¦
¦ Di )
t
° E ,D 2
t 1
i p 1
°
p
°
s
t
E t ((1 O ) rt 3 O rt 2 )
.
.
°
¦
°
t 1
®
l
°
¦ D i ((1 O ) ri1 O ri 2 ) 0
°
i p 1
°
E t t 0, t 1," , p
°
°
Di t 0, i p 1," , l䯸
¯
| k l |
ˈ
|| w ||
and we call the distance with margin ˄
k ! 0 and l 0 are constant ˅ . Due to
essence idea of Support Vector Machine, our
goal is to maximize margin. In the confidence
level O (0 O d 1) , fuzzy linearly separable
problem with fuzzy training set (6)or (7) can
˄14˅
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
So we can get certain optimal classification
hyperplane(see Deng 2004) :
( w* ˜ x) b * 0, x  R n .
(15)
Defining the function:
g ( x) ( w* ˜ x) b * ˈ
where
p
A
p
¦¦ E E ((1 O )r
t
s
t3
O rt 2 )
ˈ
t 1 s 1
*((1 O ) rs 3 O rs 2 )( xt ˜ xs )
p
B
l
¦ ¦ E D ((1 O )r
t
i
t3
O rt 2 )
ˈ
t 1 i p 1
­M (u ),0 u d M 1 (1)
°
1
°1, u ! M (1)
,˄16˅
G G (u ) ®
1
° M (u ), M (1) d u 0
° 1, u M 1 (1)
¯
1
Where M (u ) and M 1 (u ) are
respectively the inverse function of M (u )
and M (u ) .
Both M (u ) and M (u ) are regression function
(monotonously on u ) obtained by the
following way:
Computation of M (u ) :
˄ ˅ Construct training set of regression
problem
˄17˅
{( g ( x1 ), G 1 ),", ( g ( x p ), G p )}
*((1 O ) ri1 O ri 2 )( xt ˜ xi )
l
C
l
¦ ¦ D D ((1 O )r
i
q
i1
O ri 2 )
i p 1 q p 1
ˈ
*((1 O ) rq1 O ri 2 )( xi ˜ xq )
E
( E 1 ," , E p ) T  Rp
D
(D p 1 , " ,D l ) T  Rl p
ˈ
ˈ
( E ,D )T
is
decision variable.
Proof: omitted. (see Deng 2004)
Programming ˄ 14 ˅ is convex. After
getting
its
optimal
solution
*
* T
*
* T
*
*
( E ,D )
( E 1 ," , E p , D p 1 ,"D l ) , we
find a J optimal solution ( w* , b * ) T of fuzzy
coefficient programming (12) is:
p
w*
¦E
*
t
((1 O ) rt 3 O rt 2 )xt
t 1
˄ ˅ Using (17) as training set , and
selecting appropriate H ! 0 , C ! 0 ,
H support vector regression machine with
linear kernel are executed.
Computation of M (u ) :
˄ ˅ Construct training set of regression
problem
˄18˅
{( g ( x p 1 ),G p 1 )," , ( g ( xl ),G l )} .
ˈ
l
¦ D ((1 O ) ri1 O ri 2 )xi
*
i
i p 1
*
b
((1 O ) rs 3 O rs 2 )
P
¦ E t* ((1 O ) rt 3 O rt 2 )( xt ˜ xs ) ˈ
t 1
l
¦ Di* ((1 O ) ri1 O ri 2 )( xi ˜ xs )
˄ ˅ Using (13) as training set, and
selecting the same H ! 0 , C ! 0 , H support
vector regression machine with linear kernel
are executed.
i p 1
s  {s | E s* ! 0} ˈ
or
b* ((1 O ) rqi O rq 2 )
Note: The equation (11) has the following
explanation: Consider an input x c .It seems
natural that the larger g (x c) is, the larger the
corresponding membership degree to be a
fuzzy positive point is; the smaller g (x c) is,
the larger the corresponding membership
degree to be a fuzzy negative point is. The
p
¦ E t* ((1 O )rt 3 O rt 2 )( xt ˜ xq ) ˈ
t 1
l
¦ Di* ((1 O ) ri1 O ri 2 )( xi ˜ xq )
i p 1
q  {q | D q* ! 0} .
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
regression function M (˜) and M (˜) just reflect
this idea.
The above discussion leads to
Algorithm (fuzzy support vector
classification)
˄ ˅Given a fuzzy training set (6) or (7)
,and select a appropriate confidence
level O (V d O d 1) , C ! 0 and a kernel
function K ( x, x c) ,then construct quadratic
programming:
˄ ˅Select E s*  (0, C ) in
E * or D q*  (0, C ) in D * ˈthen compute
b*
p
¦ E t* ((1 O ) rt 3 O rt 2 ) K ( xt , xs )
t 1
l
¦ Di* ((1 O ) ri1 O ri 2 ) K ( xi , xs )
i p 1
Or
b* ((1 O ) rqi O rq 2 )
p
l
­
1
min
(
2
)
(
A
B
C
E
¦
¦ Di )
K
K
t
° E ,D 2 K
t 1
i p 1
°
p
°
s
t
E t ((1 O ) rt 3 O rt 2 )
.
.
°
¦
°
t 1
®
l
°
¦ D i ((1 O ) ri1 O ri 2 ) 0
°
i p 1
°
0 d E t d C , t 1," , p
°
°
0 d D i d C , i p 1," , l䯸
¯
˄18˅
where
p
AK
p
¦ E t* ((1 O ) rt 3 O rt 2 ) K ( xt , xq ) .
t 1
l
¦ D i* ((1 O ) ri1 O ri 2 ) K ( xi , xq ))
i p 1
˄ ˅Construct function
p
g ( x)
i
t3
¦ D ((1 O ) ri1 O ri 2 )K ( x, xi )
i p 1
( )Consider {( g ( x1 ), G 1 )," , ( g ( x p ), G p )}
{( g ( x p 1 ),G p 1 ),", ( g ( xl ),G l )} as
training set respectively and construct
regression
functions
M (u ) and M (u ) by H support vector
regression with linear kernel.
˄ ˅ According to (1),(2) and (3), we
transform the function G G ( g ( x)) in˄16˅
to triangle fuzzy number y y ( x) , then we
get fuzzy optimal classification function.
Note: 1 D If the outputs of all fuzzy training
points in fuzzy training set(6) or (7) are real
number 1 or -1, then fuzzy training set
degenerate to normal training set, so fuzzy
support vector classification machine
degenerate to support vector classification
machine.
2 D The selection of confidence level
O (0 O d 1) in fuzzy support vector
classification machine would be seen as
ˈ
O rt 2 )
ˈ
t 1 i p 1
*((1 O ) ri1 O ri 2 ) K ( xt , xi )
l
CK
l
¦ ¦ D D ((1 O )r
i
q
i1
O ri 2 )
i p 1 q p 1
ˈ
*((1 O ) rq1 O ri 2 ) K ( xi , xq )
E
D
.
*
i
l
t
((1 O ) rt 3 O rt 2 ) K ( x,xt )
l
*((1 O ) rs 3 O rs 2 ) K ( xt , xs )
¦ ¦ E D ((1 O )r
*
t
and
t 1 s 1
BK
¦E
t 1
p
¦¦ Et E s ((1 O )rt 3 O rt 2 )
p
((1 O ) rs 3 O rs 2 )
( E 1 ," , E p ) T  Rp ˈ
(D p 1 ," ,D l ) T  Rl p ˈ ( E , D ) T is
decision variable.
˄ ˅ Solve quadratic programming (18),
and
get
optimal
solution
( E * , D * ) T ( E 1* ," , E p* ,D *p 1 , " ,D l* ) T .
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
parameter selection problem, so we can use
methods in parameter selection such as LOO
error and LOO error bound (Deng 2004).
G ( g ( x ))
3. Numerical Experiments
In order to show the rationality of our
algorithm, we give a simple example.
Suppose fuzzy training set contains three
fuzzy positive points and three fuzzy negative
points. According to (6) and (7), this fuzzy
training set can be expressed:
S {( x1 , ~
y1 ),", ( x3 , ~
y 3 ), ( x 4 , ~
y 4 ),", ( x 6 , ~
y 6 )} ,
SG
{( x1 , G 1 ), " , ( x3 , G 3 ), ( x 4 , G 4 )," , ( x6 , G 6 )}
x1
(2, ,2) T ˈ x 2
(1.7,2) T ˈ x3
x4
(0,0) T ˈ x5
(0.8,0.5) T ˈ x6
˗
~
y3
y1 1 (1,1,1) ˈ
(0.1,0.6,1.1) ˈ y 4
y5
~
y6
1 (1,1,1)
ˈ G3
0.8 ˈ G 4
Suppose
input x7 (1,2) T , x8
g ( x7 ) 2 ! 0
(1,0.5) T
ˈ
1 ˈ G 5
G ( g ( x7 )) 0.88
˗
0.87 through
g (x) and G ( g ( x)) . According to
(1),(2)and(3),
we
can
get
~
and
y 7 (0.49,0.76,1.03)
~
(triangle
fuzzy
y 8 (0.9,0.74,0.44)
number).
In order to find relationship and difference
between fuzzy support vector classification
and support vector classification, we will have
three respective outputs of the third fuzzy
training point in fuzzy training set S G ,more
(1.5,1) T ˈ
1 ˈ G2
ˈ
test
points
T
(1,0) , and we get
g ( x8 ) 2 0 ˈ G ( g ( x8 ))
y 2 1 (1,1,1) ˈ
1 (1,1,1) ˈ
1 (1.1,0.6,0.1) ˗ G 1
­0.08 g ( x ) 0.72,0 g ( x ) d 3.50
°1,
g ( x ) ! 3.50
°
.
®
°0.07 g ( x ) 0.73, 3.86 d g ( x ) 0
°¯ 1,
g ( x ) 3.86
1
concretely, G 3 1 , G 3 0.8 ˈ G 3 1 .
While output of the sixth fuzzy training point
is G 6 1 , therefore fuzzy training set S G will
become three sets respectively:
S G1 {( x1 , G 1 )," , ( x3 , G 3 ), ( x 4 , G 4 )," , ( x6 , G 6 )}
1 ˈ
G6
0.8 .
Suppose
a
confidence
kernel
level O 0.8 , C 10 and
c
c
function K ( x, x ) x ˜ x . We use the
Algorithm
(fuzzy
support
vector
classification),
so
that
we
get
function g ( x) 2[ x]1 2[ x]2 4 .
We will establish function G G ( g ( x)) :
Look S1 {( 4,1), (3.4,1), (1,0.8)} as fuzzy
training set, and select H 0.1, C 10 and
linear kernel. Construct support vector
regression, and we get regression function
M (u ) 0.08u 0.72 ;
Look S 2 {(4,1), (1.4,1), (1,0.8)} as
fuzzy training set, and select
H 0.1, C 10 and linear kernel. Construct
support vector regression, and we get
regression function M (u ) 0.07u 0.73 ;
So we get membership function is:
(1.5,1) T ˈ G 3 1 ˈ x6 (1,0.5) T ˈ
G 6 1 .The inputs and outputs of other fuzzy
training points is the same to those in S G .
x3
S G2
{( x1 , G 1 )," , ( x3 , G 3 ), ( x 4 , G 4 )," , ( x6 , G 6 )}
(1.5,1) T ˈ G 3 0.8 ˈ x6 (1,0.5) T ˈ
G 6 1 . The inputs and outputs of other
fuzzy training points is the same to those
in S G .
x3
S G3
{( x1 , G 1 )," , ( x3 , G 3 ), ( x 4 , G 4 )," , ( x6 , G 6 )}
(1.5,1) T ˈ G 3 1 ˈ x6 (1,0.5) T ˈ
G 6 1 .
The inputs and outputs of other fuzzy
training points is the same to those in S G .
x3
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
So we observe the change of optimal
classification hyperplanes with the variety of
output of the third fuzzy training point:
G 3 1 o G 3 0.8 o G 3 0.8 o G 3 1 (19)
When all the outputs of training points in
training set are 1 or -1,fuzzy training set
degenerate to usual training set such as
S G1 , S G3 . At the same time, fuzzy support
vector classification degenerates to support
vector classification.
Suppose O 0.8 , C 10 , and kernel
function K ( x, x c) x ˜ x c .We use the
algorithm(fuzzy
support
vector
classification)and
get
certain
optimal
classification hyperplanes respectively:
L1 : [ x1 ] [ x 2 ] 2 ˈ
L2 : [ x1 ] [ x2 ] 2.4 ˈ
ˈ
L3 : 0.385[ x1 ] 1.923[ x 2 ] 1.4
L4 : 0.385[ x1 ] 1.923[ x 2 ] 1.76 .
show in the follow figure:
Other Kernelbased Learning Methods.
Cambridge University Press.
Deng, N. Y. and Zhu, M. F.(1987), Optimal
Methods, Education Press, Shenyang.
Deng, N. Y. and Tian, Y. J.(2004), The New
Method in Data Mining, Science Press,
Beijing.
Lin, C. F., Wang, S. D.(2002), Fuzzy Support
Vector Machines, IEEE Transactions on
Neural Networks,᧤2᧥.
Liu, B. D.(1998), Random Programming and
Fuzzy Programming, Tsinghua University
Press, Beijing.
Liu, B. et al.(1998) Chance Constrained
Programming with Fuzzy Parameters, Fuzzy
Sets and Systems, (2).
Mangasarian, O. L.(1999), Generalized
Support Vector Machines. Advances in Large
Margin Classifiers, MIT Press, Boston.
Vapnik, V. N.(1995), The Nature of Statistical
2.5
Learning
2
1.5
Theory,
Springer-Verlag,
New
York.
L4
Vapnik, V. N. (1998), Statistical Learning
L3
1
L2
Theory. Wiley, New York.
0.5
L1
0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Yuan, Y. X. and Sun, W. Y.(1997), Optimal
Theories and Methods, Science Press, Beijing.
Zadeh, L. A.(1965), Fuzzy Sets. Information
2
˄ 19 ˅ illuminates membership degree of
fuzzy training point x3 is changed. when its
negative membership degree gets bigger, and
its positive membership degree gets smaller.
The movement of corresponding certain
optimal
classification
hyperplane
is L1 o L2 o L3 o L4 , thus it can be seen the
result is same to intuitionistic judgment.
and Control.
Zadeh, L. A.( 1978), Fuzzy Sets as a Basis for
a Theory of Possibility, Fuzzy Sets and
Systems.
Zhang, W. X.(1995), Foundation of Fuzzy
Mathematics, Xi’an Jiaotong University
Press, Xi’an.
S
Reference
Cristianini, N. and Shawe-Taylor J.(2000), An
introduction to Support Vector Machines and
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
DEA-based Classification for Finding
Performance Improvement Direction
Shingo Aoki, Member, IEEE, Yusuke Nishiuchi, Non-Member, Hiroshi Tsuji, Member, IEEE
improvement direction for each sample.
As the other issue, DEA has been developed and applied a
variety of the managerial and economic problem situations [8].
By comparison with Pareto optimal solution so called
"efficiency frontier", performance of DMUs is measured
relatively. However, in other words, DEA has considered only
a part of DMUs which used to form the efficiency frontier.
Therefore, little attention has been given to the cluster
technique for classification all DMUs.
In order to improve these problems, this paper proposes a
new classification techniques. The proposed method consists of
two stages: (1) DEA for evaluating DMUs by their
inputs/outputs, (2) GT for finding clusters among DMUs.
The remaining structure of this paper is organized as
follows: section 2 describes about DEA as the basis of this
research. Section 3 proposes the DEA based classification
method. Section 4 illustrates a numerical simulation using the
proposal method and the traditional method, and discusses the
difference between their classification results. Section 5 obtains
universal prospects of two methods. Finally, conclusion and
future extensions are summarized in section 6.
Abstract—In order to find the performance improvement
direction for DMU (Decision Making Unit), this paper proposes a
new classification techniques. The proposed method consists of
two stages: (1) DEA (Data Envelopment Analysis) for evaluating
DMUs by their inputs/outputs, (2) GT (Group Technology) for
finding clusters among DMUs. A case study for twelve DMUs with
two inputs and two outputs shows that the proposed technique
works to obtain four clusters where each cluster has its own
performance improvement direction. This paper also discusses
the comparison on the traditional clustering and the proposed
clustering.
Index Terms—Data Envelopment Analysis, Clustering methods,
Data mining, Decision-making, Linear programming.
I. INTRODUCTION
U
nder the condition that there are a great number of
competitors in a general marketplace, a company should
find out its own advantages compared with others and extend it
[2]. For the reason mentioned above, the concern with the
mathematical approach has been growing [5] [11] [16].
Especially, this paper concentrates on the following issues: (1)
characterize each company in the marketplace by its activity
and define groups by similarity, and (2) compare a company to
others and find the performance improvement direction [3] [4].
As the former issue, in these years, a lot of cluster analyses
have been developed. Cluster analysis is the method for
classification samples which are characterized by multi
property values [5] [6]. It allows us to get common
characteristics in a group, in other words, the reason why a
sample belongs to a group. However, the traditional analysis
calculation regards all property values as appositional.
Therefore, it often gets rules which are based on absolute
property values, and makes difficult to find the performance
II. DATA ENVELOPMENT ANALYSIS (DEA)
A. An overview on DEA
Data Envelopment Analysis, initiated by Charnes et al.
(1978) [7], has been widely applied to efficiency (productivity)
analysis, and more than fifteen hundreds of researches have
been performed in the past twenty years [8].
DEA assumes DMUs activity that uses multiple inputs to
yield multiple outputs, and defines the process which changes
multiple inputs into multiple outputs as “efficiency score”. By
comparison with Pareto optimal solution so called "efficiency
frontier", efficiency score of DMU is measured relatively.
B. Efficiency frontier
This section illustrates efficiency frontier visually using
an exercise with a sample data set. In Figure 1, suppose that
there are seven DMUs which have one input and two outputs
where X-axis is an amount of sales (output 1) over a number of
shops (input) and Y-axis is a number of visitors (output 2) over
a number of shops (input). So, if a DMU is located in
upper–right region, it shows that the DMU has high
productivity.
Line B-C-F-G is efficiency frontier in Figure 1. The
DMUs on this frontier are considered that an “efficient”
Manuscript received October 1, 2005, DEA-based Classification for Finding
Performance Improvement Direction.
S. Aoki is with the Graduate School of Engineering, Osaka Prefecture
University, 1-1 Gakuencho, Sakai, Osaka 599-8531, Japan (corresponding
author to provide phone: +81-72-254-9354; fax: +81-72-254-9915; e-mail:
[email protected])
Y. Nishiuchi is with the Graduate School ofEngineering, Osaka Prefecture
University, 1-1 Gakuencho, Sakai, Osaka 599-8531, Japan (e-mail:
[email protected])
H. Tsuji is with the Graduate School of Engineering, Osaka Prefecture
University, 1-1 Gakuencho, Sakai, Osaka 599-8531, Japan (e-mail:
[email protected])
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
activity is done. Other DMUs are considered that an
“inefficient” activity is done and there are rooms to improve
their activities.
For instance, the DMUE’s efficiency score equals to
OE/OE1. Thus the range of efficiency score is [0, 1]. The
efficiency score for DMUB, DMUC, DMUF, and DMUE are
equal to 1.
1,..., n}
(2)
Using “Reference Set”, this paper re-defines a set of Rk as a
vector ak which is shown as following:
*
*
*
{O1 , O2 , ..., On } (3)
ak
In Formulation (1), for instance, when ak = { O1* ,…,
*
*
*
*
*
Ov 1* = 0, Ov = 0.7, Ov1 ,…, Ow1 = 0, Ow = 0.3, Ow 1 ,…,
A1 B
Amount of sales per number of Shops
{j | Ȝ j ! 0, j
Rk
Efficiency Frontier
On* = 0} and T k * = 0.85, a reference set of DMUk is { DMUV,
C
DMUW }. In Fig. 2, the point k’ is nearest point from DMUk on
the efficiency frontier and efficiency value of DMUk shows a
ratio of 1 to 0.85.
A
D
F
DMUv
0.3
E1
k’
G
E
Efficiency frontier
for DMUk
DMUk
0.7
1.0
0.85
Number of visitors per number of shops
DMUw
Fig. 1. Graphical description of efficiency measurement
C. DEA model
When there are n DMUs (DMU1, …, DMUk, …, DMUn),
and each DMU is characterized by its own performance with m
inputs (x1k, x2k,…, xmk) and s outputs (y1k, y2k,…, ysk), DEA
model is mathematically expressed by the following
formulation [11] [12]:
Minimize
Subject to
Tk
¦
n
t
xij O j Txik 0
j 1
¦
t
n
yrjO j
j 1
L
(i 1,2,...,m)
d¦
n
yrk (r 1,2,..., s)
Fig.2. Reference set for DMUk
What is important is that this research obtains the segment
connecting the origin with k’ not by researchers’ subjectivities
but by the intention which makes the efficiency of DMUk as
well as possible. The efficiency score of DMUk+1 is obtained
by replacing “k” with “k+1” at Formulation (1).
III. DEA-BASED CLASSIFICATION METHOD
Let us propose the method which consists of the following
steps:
A: Divine a data set into input items or output items.
B: For each DMU, solve formula (1) for getting the
efficiency score and the O j values. Then we will get a
(1)
d
Oj U
tj 1
O j 0 ( j 1,2,...,n),
similarity coefficient matrix S.
C: Apply rank order algorithm to the similarity coefficient
matrix. Then we will get clusters.
T : 䇭free
In Formulation (1), L and U are the values of lower bound
and upper bound of the
¦
n
j 1
O j . If L = 0 and U = f ,
A. Select input and output itmes
For the first step, there is a guideline to define a data set as
follows [9]:
1. Each data is numeric, and its value is more than zero,
2. In order to show the feature of DMU’s activity, analyst
should be divined a data set into input items or output items,
3. As for the input item, analyst should choose the data which is
used for the investment such as amount of capital stock,
number of employee, and amount of advertisement invest,
4. As for the output item, analysis should choose the data which
is used for the return such as amount of sales, and number of
visitors,
Formulation (1) is called “the CCR model”, and if L = U = 1,
Formulation (1) is called “the BCC model”[13] [14] [15]. This
paper used the CCR model.
T k is the efficiency score in the manner that T k = 1 (100%)
means DMU “efficient”, while T k < 1 means “inefficient”.
O j (j = 1, 2,̖, n) can be considered to form the efficiency
frontier about DMUk. Especially, if O j > 0, then DMUj is on the
efficiency frontier. A set of these DMU is so called a
“Reference set (Rk)” for the DMUk and expressed as follows:
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
B. Create similarity coefficient matrix
As the second step, the proposal method calculates an
efficiency score ( T k ) for each DMUk in Formula (1), and a
The degree of association is estimated by the distance
which is calculated by Ward’s method [21].
Ward’s method is distinct from other methods because it
uses an analysis of variance approach to evaluate the distances
between clusters. When a new cluster c is created by combining
cluster a and cluster b, for example, a distance between cluster
x and cluster c is mathematically expressed by the following
formulation:
nx na 2 nx nb 2
nx
2
d xc
d xa
d xb d ab2
(5)
nc between
nx ncclusternm
nc n.
d : andistance
x and
x
vector ak in formula (3). Then the proposal method creates
similarity coefficient matrix S as follows:
(4)
S {a 1 , a 2 , , a n } C. Classify DMUs by rank order algorithm
As the last step, DMUs are classified into some groups by
Group Technology (GT) [18], handling the similarity
coefficient matrix S. In this classification, rank order algorithm
by King, J. R [19] is employed. The rank order algorithm
consists of four steps as follows:
¦2 M
mn
n m : the number of individual s in cluster m.
In general, this method is computationally simple, while it
tends to create small size of clusters.
B2. CLASSIFICATION RESULT. The result of classification for the
data set with Ward’s clustering method obtains a dendrogram
(See Fig. 3). Dendrogram is also called tree diagram.
In Fig.3, when the two individuals have combined together
on the left, it is concerned that the two individuals belong to the
same group.
The final number of clusters depends on the position where
the dendrogram is cut off. To get four clusters, for example, (A,
J, E), (B, F, I), (K, L) and (C, G, D, H) are obtained by cutting
the dendrogram at (1) in Figure 3.
i
Step 1. Calculate total weight of each column, w j
ij
i
Step 2. Arrange columns by ascending weight,
Step 3. Calculate total weight of each row, wi
¦2
j
M ij
j
Step 4. If rows are in ascending order by weight, STOP.
else Arrange rows by ascending weight,
GOTO Step 1.
IV. A CASE STUDY
In order to verify the availability of the proposal method,
let us illustrate a numerical simulation.
Cut off (1)
A. A data set
A sample data set is shown in Table 1. The data set
concerns on regards the performance of 12 DMU (DMUA, …,
DMUL), and each DMU has four data items: number of
employee, number of shops, number of visitors and amount of
sales.
DMU Table Σ.
Entity
A
B
C
D
E
F
G
H
I
J
䌋
䌌
(distance among DMUs)
#
,
'
$
(
+
.
%
)
&
*
B. Traditional cluster analysis
B.1. METHOD OF CLUSTERING ANALYSIS. Cluster analysis is an
exploratory data analysis method which aims at sorting
different objects into groups in a way that the degree of
association between objects are maximal if they belong to the
same group and minimal otherwise[5] [20].
DMU
Cut off (2)
Fig.3. Dendrogram by Ward-method
DATA SET FOR NUMERICAL STUDIES
Inputs
Outputs
Number of Number of Number of visitors Amount of sales
employees
shops
(K person/month)
(M㪳/month)
10
8
23
21
26
10
37
32
40
15
80
68
35
28
76
60
30
21
23
20
33
10
38
41
37
12
78
65
50
22
68
77
31
15
48
33
12
10
16
36
20
12
64
23
45
26
72
35
From this classification result and Table 1, the feature of
each group is considered as follows:
(i) Group (A, J, E) is considered as that consists of “small
scale” DMUs,
(ii) Group (B, F, I) is considered as that consists of “lower
middle scale” DMUs,
(iii) Group (K, L) is considered as that consists of “larger
middle scale” DMUs and that a visitor unit price is very
low,
(iv) Group (C, G, D, H) is considered as that consists of
“large scale” DMUs.
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
If Sij > 0, then it is considered that there is relevance in
DMUI and DMUJ, the entry is 1.
If Sij = 0, then it is considered that there is no relevance in
DMUI and DMUJ, the entry is empty.
Fig.4 is illustrated the classification analysis by the traditional
method.
Number of visitors
and amount of sales
Large
Largescale
scale
Initial state
䊶Larger
䊶Largermiddle
middlescale
scale
䊶A
䊶Avisitor
visitorunit
unitprice
priceisisvery
verylow.
low.
handling
Lower
Lowermiddle
middlescale
scale
Small
Smallscale
scale
Numbers of employees
and number of shops
Fig.4. Traditional classification result
Final state
V. DEA-BASED CLASSIFICATION
This section describes the process of the proposal method.
Step1: Select inputs and outputs. According to Step 1 in
Section 3, the number of employee and the number of shops are
selected as input values, and the number of visitors and the
amount of sales are selected as output values.
Step2: Create a similarity coefficient matrix. By
Formulation (1), (3) and (4), the similarity coefficient matrix S
is obtained as shown in Table Τ.
Fig. 5. Classification demonstration by rank order algorithm
Then, four clusters: (A, D), (B, C, E, I, K, L), (F, G, H) and
(J) are obtained as shown in Fig.5. The feature of each group is
considered as follows:
(i) group (A, D) is considered as that consists of DMUs
which get many visitors and large amount of sales by a
few employee and a few shops,
(ii) group (B, C, E, I, K, L) is considered as that consists of
DMUs whose employees are clever in marquee,
(iii) group (F, G, H) is considered as that consists of DMUs
which are managed with large-sized shops,
(iv) group (J) is considered as that consists of DMU which has
many visitors who purchase a lot.
From the above analysis, Fig. 6 is illustrated as a
conceptual diagram which shows the situation of the
classification.
TABLE II.
SIMILARITY COEFFICIENT MATRIX S
DMU 㪜㪽㪽㫀㪺㫀㪼㫅㪺㫐
㫊㪺㫆㫉㪼
A 1
B 0.674
C 0.943
D 0.885
E 0.331
F 0.757
G 1
H 0.755
I 0.638
J 1
K 1
L 0.556
㵰㪜㪽㪽㫀㪺㫀㪼㫅㫋㵱 㪛㪤㪬㫊
A
1
0
0
1
0
0
0
0
0
0
0
0
B
0
0
0
0
0
0
0
0
0
0
0
0
C
0
0
0
0
0
0
0
0
0
0
0
0
D
0
0
0
0
0
0
0
0
0
0
0
0
E
0
0
0
0
0
0
0
0
0
0
0
0
ak
F
0
0
0
0
0
0
0
0
0
0
0
0
G
0
0.404
0.889
0
0.007
0.631
1
0.789
0.276
0
0
0.103
H
0
0
0
0
0
0
0
0
0
0
0
0
I
0
0
0
0
0
0
0
0
0
0
0
0
J
0
0.124
0.21
0
0.38
0
0
0.715
0.184
1
0
0.176
K
0
0.054
0.113
0.265
0.256
0
0
0
0.368
0
1
0.956
L
0
0
0
0
0
0
0
0
0
0
0
0
Profit Side
get many visitors and
large amount of sales
in a few employee and
shops.
㪪㪑 㪪㫀㫄㫀㫃㪸㫉㫃㫐㩷㪺㫆㪼㪽㪽㫀㪺㫀㪼㫅㫋㩷㫄㪸㫋㫉㫀㫏
Let us note S in Table Τ. The O j values of DMUA, DMUG,
A
DMUJ and DMUK on efficiency frontier are more than zero,
and at least one of the other DMUs is equal to zero. This means
that the each DMU is characterized by combination of
“efficient” DMUs’ features.
The proposal method is focused attention on such DEA
contribution, and finds the performance improvement direction
for each DMU.
Step3: Classify DMUs by rank order algorithm. The rank
order algorithm for the similarity coefficient matrix S generates
classification as shown in Fig. 5.
The matrix S in Fig. 5 is expressed as follows:
Brand Side
Get large sales with
a few visitor.
J
D
Marquee Side
K
Get many visitors
with a few employee.
I
L
B
E
C
Shop Scale Side
H
F
G
Get many visitors
and large sales with
many employees
Fig.6. Proposal classification result
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
[5]
VI. DISCUSSION
From the result of section 4.2, two characteristics by
clustering analysis are considered as follows:
(a) Classification result is based on the scale of management,
(b) The number of clusters can be assigned according to the
purpose.
Therefore, the traditional method does not require
preparation in advance. However, there are demerits that it is
difficult to find the performance improvement direction for a
DMU, since the classification result is only based on scale of
management.
On the other hand, DEA-based classification has three
characteristics as follows:
(a) Classification result is based on the direction of
management,
(b) The number of groups which classified is the same
number of “efficient” DMUs,
(c) Every group has at least one “efficient” DMU.
Since the O j values in the similarity coefficient matrix S
(TABLE Τ) are positive only if the DMU is “efficient”, (b) is
true. As shown in Figure 5, since there is one “efficient” DMU
in every classified group, (c) is also true.
Then, the merits and the demerits of the proposal method
are described. It is easy to find the performance improvement
direction for a DMU. For example, even if a DMU is evaluated
“inefficient”, it is possible to refer the feature of the “efficient”
DMU which belongs to the same group. However, it is
necessary to select right input and right output for preparation.
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
VII. CONCLUSIONS AND FUTURE EXTENSIONS
[20]
This paper has described issues of the traditional
classification method and proposed a new classification
method which finds performance improvement direction. Case
study has shown that the classification by cluster analysis was
based on the scale of management, and that, on the other hand,
the classification by the proposal method was based on the
direction of management.
Future extensions of this research include as follows:
(a) Application for a large scale practical problem,
(b) Meaning assigning method for the derived groups,
(c)Investigating reliability of the performance improvement
direction,
(d) Establishment of the one-step application for the proposed
method.
[21]
S. Miyamoto, Fuzzy sets in information retrieval and cluster analysis,
Kluwer Academic Publishers, Dordrecht: Boston, 1990.
M.R. Anderberg, Cluster analysis for applications, Academic Press, New
York, USA, 1973.
A. Charnes, W.W. Cooper, and E. Rhodes, “Measuring the efficiency of
decision-making units”, European journal of operational research, vol.2,
1978, pp.429-444.
T. Sueyoshi, Management Efficiency Analysis (in Japanese), Asakura
Shoten Co., Ltd, Tokyo, 2001.
K. Tone, Measurement and Improvement of Management Efficiency (in
Japanese), JUSE Press, Ltd, Tokyo, 1993.
M.J. Farrell, “The Measurement of Productive Efficiency”, Journal of the
Royal Statical Society, (Series A), vol.120, 1957, pp.253-281.
D.L, Adolphson, G.C. Cornia, and L.C. Walters, “A Unified Framework
for Classifying DEA Models”, Operational Research ’90, edited by
E.E.Bradley, Pergamon Press, 1991, pp.647-657.
A. Boussofiane, R.G. Dyson, and E. Thanassoulis, “Invited Review:
Applied Data Envelopment Analysis”, European Journal of Operational
Research, vol.52, 1991, pp-1-15.
R. D. Banker, and R.C. Morey, “The use of categorical variables in Data
Envelopment Analysis”, Management Science vol.32, 1984,
pp.1613-1627
R.D. Banker, A. Charnes, and W.W. Cooper, “Some models for
estimating technical and scale inefficiencies in data envelopment
analysis”, Management Science, Vol.30, 1984, pp.1078-1092.
R.D. Banker, “Estimating Most Productive Scale Size Using Data
Envelopment Analysis”, European Journal of Operational Research,
vol.17, 1984, pp.35-44.
W.A. Kamakura, “A note on the use of categorical variables in Data
Envelopment Analysis”, Management Science, vol.34, 1988,
pp.1273-1276.
[J.J. Rousseau, and J. Semple, “Categorical outputs in Data Envelopment
Analysis”, Management Science, vol.39, 1993, pp.384-386.
J.R. King, V. Nakornchai, “Machine-component group formation in
group technology: review and extension”, Internat. J. Prod, vol.20, 1982,
pp.117-133.
J.R. King, “Machine-Component Grouping in Production Flow Analysis:
An Approach Using a Rank Order Clustering Algorithm”, International
Journal of Production Research, vol. 18, 1980, pp.213-232.
J.G. Hirschberg, and D.J. Aigner, “A Classification for Medium and
Small Firms by Time-of-Day Electricity Usage”, Papers and Proceedings
of the Eight Annual North American Conference of the International
Association of Energy Economists, 1986, pp.253-257.
J. Ward, “Hierarchical grouping to optimize an objective function”,
Journal of the American Statistical Association, vol.58, 1963,
pp.236-244.
REFERENCES
[1]
[2]
[3]
[4]
Y. Hirose et al, Brand value evaluation paper group report, the Ministry of
Economy, Trade and Industry, 2002.
Y. Hirose et al, Brand value that on-balance-ization is hurried, weekly
economist special issue, Vol.24, 2001.
S. Aoki, Y. Naito, and H. Tsuji, “DEA-based Indicator for performance
Improvement”, Proceeding of The 2005 International Conference on
Active Media Technology, 2005.
Y. Taniguchi, H. Mizuno and H. Yajima, “Visual Decision Support
System”, Proceeding of IEEE International Conference on Systems, Man
and Cybernetics (SCM97), 1997, pp.554-558.
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
Multi-Viewpoint Data Envelopment Analysis for
Finding Efficiency and Inefficiency
Shingo AOKI, Member, IEEE, Kiyosei MINAMI, Non-Member, Hiroshi TSUJI, Member, IEEE
approaches [16]. While such multiplier restrictions usually
reduce the number of zero weight, they often produce an
infeasible solution in DEA. Therefore, new DEA model which
has robustness on the evaluation values is required.
This paper proposes a decision support technique referred
to as Multi-Viewpoint DEA model. The remaining structure of
this paper is organized as follows: the next section reviews the
traditional DEA models. Section 3 proposes a new model. The
proposed model integrates the efficiency analysis and the
inefficiency analysis into one mathematical formulation, and
allows us to analyze the performance of DMU by
multi-viewpoint between the strong points and weak points.
Section 4 verifies the proposed model through a case study. A
case study shows that the proposed model has two desirable
features: (1) robustness of the evaluation value, and (2)
unification between efficiency analysis and inefficiency
analysis. Finally, conclusion and future study are summarized
in section 5.
Abstract—This paper proposes a decision support method for
the measuring the productivity efficiency based on DEA (Data
Envelopment Analysis). The decision support method, called
Multi-Viewpoint DEA model which integrates the efficiency
analysis and the inefficiency analysis, is possible to identify the
performance of DMU (Decision Making Unit) between the strong
points and weak points by changing the view parameter. A case
study for twenty-five Japanese baseball players shows that the
proposed model is robust of the evaluation value.
Index Terms—Data Envelopment Analysis, Decision-Making,
Linear programming, Productivity.
I. INTRODUCTION
D
EA [1] is a nonparametric method for finding the relative
efficiency of DMUs, each of which is a company
responsible for converting multiple inputs into multiple outputs.
DEA has been applied to a variety of managerial and economic
problem situations in both public and private sectors [5, 9, 13,
14]. DEA defines the process which changes multiple inputs
into multiple outputs as one evaluation value.
The decision method based on DEA induces two kinds of
approaches: One is the efficiency analysis based on the Pareto
optimal solution for the aspect only of the strong points [1, 5].
The other is the inefficiency analysis based on the Pareto
optimal solution for the aspect only of the weak points [7].
Then, the evaluation values in two approaches are inconsistent
[8]. However, analysts have evaluated DMUs only by extreme
aspect: either strong points or weak points. Thus, the traditional
two analyses lack flexibility and robustness [17].
In fact, while there are many inputs and outputs in DEA
framework, these items are not fully used in the previous
approaches. This type of DEA problem has been usually
tackled by multiplier restriction approaches [15] and cone ratio
II. DEA-BASED EFFICIENCY AND INEFFICIENCY ANALYSES
A. DEA: Data Envelopment Analysis
In order to describe the mathematical structure of the
evaluation value, this paper assumes that there are n DMUs
( DMU1 , , DMU k , , DMU n ), where each DMU is
characterized by m inputs ( x 1k , , x ik , , x mk ) and s outputs
( y1k , , y rk , , y sk ). Evaluation value of is mathematically
formulated by
Evaluation Value
of DMU k
u 1 y1k u 2 y 2 k u s y sk
v1 x 1k v 2 x 2 k v m x mk
(1)
Here u r is multiplier weight given to the r th output, and
v i is multiplier weight given to the i th input. From the
analysis concept, there are two decision methods for
calculating these weights. One is the efficiency analysis
based on the Pareto optimal solution for the aspect only of
the strong points [1, 5]. The other is the inefficiency analysis
based on the Pareto optimal solution for the aspect only of
the weak points [7, 8].
Fig 1 visually represents the difference of two methods.
Suppose that there are nine DMUs which have one input and
two outputs where X-axis is output 1 over input and Y-axis is
Manuscript received September 28, 2005, Multi-Viewpoint Data
Envelopment Analysis for Finding Efficiency and Inefficiency.
S. Aoki is with the Graduate School of Engineering, Osaka Prefecture
University, 1-1 Gakuencho, Sakai, Osaka 599-8531, Japan (corresponding
author to provide phone: +81-72-254-9354; fax: +81-72-254-9915; e-mail:
[email protected])
K. Minami is with the Graduate School of Engineering, Osaka Prefecture
University, 1-1 Gakuencho, Sakai, Osaka 599-8531, Japan
H. Tsuji is with the Graduate School of Engineering, Osaka Prefecture
University, 1-1 Gakuencho, Sakai, Osaka 599-8531, Japan (e-mail:
[email protected])
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
output 2 over input. So, if DMU is located in upper-right
region, it shows that the DMU has high productivity.
Efficiency analysis finds out the efficiency frontier which
indicates the best practice line (B-C-D-E-F in Figure 1) and
evaluates the relative evaluation value by the aspect only of
the strong points. On the other hand, inefficiency analysis
finds out the inefficiency frontier which indicates the worst
practice line (B-I-H-G-F in Fig 1) and evaluates the relative
evaluation value by the aspect only of the weak points.
while T Ek 1 (100%) means the state of inefficiency.
C. Inefficiency analysis
There is another analysis which measures the inefficiency
level of a specific DMU k based on Inversed DEA model [7].
The inefficiency analysis can be mathematically formulated by
Min
s.t.
Inefficiency frontier
B
Output 2 䋯 Input
A
¦ v i x ik
G
F
Fig 1. Efficiency analysis and Inefficiency analysis
D. Requirement for Multi-Viewpoint DEA
As shown in Figure 1, DMU B and DMU F are evaluated
as both states of “efficiency ( T Ek 1 )” and “inefficiency
(T IE
1) ”. This result clearly shows mathematical difference in
k
two analyses. For the example, DMU B has the best
productivity for the Output 2 / input, while it has worst
productivity for the Output 1 / input. In efficiency analysis, the
weight of DMU B is evaluated by the aspect of the strong points.
Therefore, the weight of Output 2 / input becomes a positive
value and the weight of Output 1 / input becomes zero. On the
other hand, in inefficiency analysis, the weight of DMU B is
evaluated by the aspect of the weak points. Therefore, the
weight of Output 2 / input becomes zero and the weight of
Output 1 / input becomes a positive value. This difference of
the weight estimation causes the mathematical problems as
follow:
a) No robustness of evaluation value
Both analyses may produce zero weights for most inputs
and outputs. The zero weight indicates that the corresponding
inputs or outputs are not used for the evaluation value.
Moreover, if the specific inputs or output items are removed
from the analysis, the evaluation value may change greatly [17].
This type of DEA problem is usually tackled by multiplier
restriction approaches [15] and cone ratio approaches [16].
Such multiplier restrictions usually reduce the number of zero
(2 - 1)
r 1
s
r 1
(2 - 2)
‫ޓޓ‬
(2)
( j 1, 2, , n )
‫ޓޓޓޓޓޓޓޓޓ‬
m
¦ v i x ik
1
(3 - 3)
T IE
k 1 (100%) means the state of efficiency.
B. Efficiency Analysis
The efficiency analysis measures the efficiency level of a
specific by relativity comparing its performance to the
efficiency frontier. This paper is based on CCR model [1] while
there are other models [5, 11]. The efficiency analysis can be
mathematically formulated by
¦ v i x ij ¦ u r y rj d 0
1
Again, formula (3-2) is a restriction condition because the
productivity of all DMU (formula (1)) becomes 100% or more.
And the objective function (3-1) represents the minimization of
the virtual outputs of DMU k , setting that the virtual inputs of
DMU k is equal to 1 (formula (3-3)). Therefore, the optimal
solution of ( v i , u r ) represents the inconvenient weight for
DMU k . Especially, the inverse number of optimal objective
function value indicates the “inefficiency score” in the manner
1 (100%) means the state of inefficiency, while
that T IE
k
Output 1 䋯 Input
s
(3)
E
O
E
¦ u r y rk ( T k )
(3 - 2)
vi t 0 , u r t 0
H
i 1
s
r 1
i 1
A’’
s.t.
m
i 1
(3 - 1)
( j 1, 2, , n )
A’
m
1
)
T IE
k
m
D
Max
(
¦ v i x ij ¦ u r y rj t 0
Efficiency frontier
C
I
s
¦ u r y rk
r 1
(2 - 3)
i 1
vi t 0 , u r t 0
Here formula (2-2) is a restriction condition because the
productivity of all DMUs (formula (1)) becomes 100% or less.
And the objective function (2-1) represents the maximization of
the sum of virtual outputs of DMU k , setting that the virtual
inputs of DMU k is equal to 1 (formula (2-3)). Therefore, the
optimal solution of ( v i , u r ) represents the convenient weight
for DMU k . Especially, the optimal objective function value
indicates the evaluation value ( T Ek ) for DMU k . This evaluation
value by the convenient weight is called “efficiency score” in
the manner that T Ek 1 (100%) means the state of efficiency,
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
weight, and these analyses often produce an infeasible solution.
The development of DEA model which has robustness of the
evaluation value is required.
b) Lack of unification between efficiency analysis and
inefficiency analysis
Fundamentally, efficient DMU can not be inefficient
while inefficient DMU can not be efficient. However, the
evaluation value may be not consistent like the and in the
Figure 1 where they are in the both states of “efficiency” and
“inefficiency”. Thus, it is not easy for analysts to understand
the difference between evaluation values. The basis of the
evaluation value which has unification between efficiency
analysis and inefficiency analysis is required.
j 1, jz k
s.t.
i 1
r 1
0 ( j 1, 2, , n )
(7)
1
The efficiency score ( T Ek ) of DMU k as follows:
§
T Ek
*¨
1 d k ¨
¨
¨
©
s
·
¸
¸
m
*
¦ v i x ik ( 1) ¸¸
i 1
¹
*
¦ u r y rj
r 1
(8)
Where superscript “*” indicates the optimal solution of
formula (7).
Let us apply the formula (4) which added the variable
(d j , d j ) to formula (3-2). This paper notes that d j indicates
the artificial variables and d j indicates the slack variables in
inefficiency analysis. Using GP technique, the inefficiency
analysis (formula (3)) can be replaced by the following Linear
Programming:
(4)
Min
n
1 (M 1)d k d k M ¦ d j
j 1, jz k
s.t.
m
s
i 1
r 1
¦ v i x ij ¦ u r y rj d j d j
0
( j 1, 2, , n )
(5)
¦ u r y rk M ¦ d j
r 1
0
‫ ޓޓޓޓ‬v i t 0 , u r t 0, d j , d j t 0
artificial variables. Therefore, the objective function (2-1) can
be replaced by mathematically using several big M as follows:
n
r 1
¦ v i x ik
Here d j indicates the slack variables, and d j indicates the
s
i 1
i 1
formula (2-2):
s
s
( j 1, 2, , n )
A. Two DEA models based on GP technique
Let us propose a new decision support technique referred
to as Multi-Viewpoint DEA model. The proposed model is a
re-formulation of the efficiency analysis and inefficiency
analysis into one mathematical formulation. This paper applies
the following formula (4) which added the variable (d j , d j ) to
m
m
¦ v i x ij ¦ u r y rj d j d j
m
III. INTEGRATING EFFICIENT AND INEFFICIENT VIEW
¦ v i x ij ¦ u r y rj d j d j
n
1 d k (1 M )d k M ¦ d j
Max
m
¦ v i x ik
j 1
(9)
1
i 1
v i t 0 , u r t 0 , d j , d j t 0
From the formula (4) and formula (2-3), the objective
function (5) can be rewritten as follows:
The inefficiency score ( T IE
k ) of DMU k as follows:
s
n
¦ u r y rk M ¦ d j
r 1
j 1
m
( ¦ v i x ik d k
i 1
n
d k ) M ¦ d j
j 1
n
1 d k d k M ¦ d j
T IE
k
(6)
j 1
1
*
1 d k
(10)
n
1 d k (1 M )d k M ¦ d j
Where superscript “*” indicate the optimal solution of
formula (9).
j 1, jz k
Using GP (Goal Programming) technique, the
DEA-efficiency-model (formula (2)) can be replaced by the
following Linear Programming:
B. Mathematical integration of the efficiency and
inefficiency model
In order to integrate two DEA analyses into one formula
mathematically, this paper introduces slack variables. As seen
in formula (7) and (9), it is understood that the both analyses
have the same restriction conditions. Then, this paper applies
the following formula (11) which added any constant (D, E) to
the objective function of formula (7) and (9).
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
by the aspect of the strong points and the second term indicates
it by the aspect of the weak points. Therefore, the evaluation
, D'
value ( T MVP
) is measured on the range between -1 (-100%:
k
inefficiency) and 1 (100%: efficiency).
n
D{1 d k (1 M )d k M ¦ d j }
j 1, jz k
n
E{1 (M 1)d k d k M ¦ d j }
(11)
j 1, jz k
(D E) {D E(1 M)}d k
{D(1 - M) - E} d -k
n
n
j 1, jz k
j 1, jz k
IV. CASE STUDY
(DM ¦ d j E M ¦ d j )
A. A data set
A data set used in this paper is demonstrated illustrated in
TABLE I. (The source of this data set comes from the internet
site: YAHOO! SPORTS (in Japanese), 2005). Twenty-five
batters are selected for our performance evaluation. When
using the data set, this paper uses “bats” and “walk” as input
items as well as “singles”, “doubles”, “triples”, “homeruns”,
“runs batted in” and “steals” as output items.
When formula (11) is divided by several big M
mathematically, it can be developed as follows:
n
n
j 1, jz k
j 1, jz k
(Ed k D d -k ) (E ¦ d j D ¦ d j )
n
E ¦ d j
j 1
n
(12)
D ¦ d j
j 1
TABLE Σ.
OFFENSIVE RECORDS OF JAPANESE BASEBALL PLAYERS IN 2005
Where these constants can be D E 1 estimated, because
the constants (D, E) indicate relative ratios of efficiency
analysis and inefficiency analysis. Then the proposed model is
formulated as the following Linear Programming:
Inputs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
n
n
(1 D) ¦ d j D ¦ d j
Max
j 1
m
s
i 1
r 1
j 1
¦ v i x ij ¦ u r y rj d j d j
s.t.
0
(13)
( j 1, 2, , n )
m
¦ v i x ik
1
i 1
v i t 0 , u r t 0 , d j , d j t 0
Where x ij : i th input value of jth DMU,
y rj : input value of jth DMU,
v i , u r : input and output weight,
d i , d r : slack variables.
D' T Ek (1 D' )T IE
k
*
D' (1 d k ) (1 - D' ) (
1
*
1 d k
)
bats
walks
singles
doubles
triples
homeruns
577
452
498
574
503
473
431
552
569
529
420
530
549
633
580
544
473
526
559
559
452
580
542
503
424
96
73
71
56
38
75
46
91
57
64
33
84
41
51
66
24
53
47
50
51
40
61
82
78
36
89
91
82
110
111
74
77
77
105
92
75
67
122
140
107
95
88
86
92
110
68
89
74
79
74
37
19
25
34
29
21
27
31
42
22
27
24
25
19
27
28
20
28
22
24
19
23
18
20
18
1
2
1
2
1
1
1
1
1
1
2
0
2
8
0
3
0
0
3
1
2
1
0
0
7
44
18
36
24
6
30
15
35
11
26
14
44
2
4
20
24
6
24
27
9
26
33
37
18
6
runs
batted in
120
70
91
89
51
89
63
100
73
75
75
108
34
45
88
79
40
71
90
62
84
94
100
74
39
steals
2
3
6
18
11
6
10
8
2
7
8
0
22
42
7
1
0
4
18
4
2
5
1
1
10
B. Multi-Viewpoint DEA’s result
TABLE II shows the evaluation values of Multi-View DEA
model. This paper calculates eleven patterns between (The
View point’s parameter) D 1 and D 0 . Especially, if setting
the parameter D equals to 1, this evaluation value ( T kMVP, 1 ) is
calculated by efficiency analysis (formula (2)). And if setting
,0
) is
the parameter equals to 0, this evaluation value ( T MVP
k
calculated by inefficiency analysis (formula (3)).
1) Efficiency Analysis’s Result
This analysis finds that there are 14 batters whose
evaluation value is 1 (efficiency). In TABLE I, these batters are
included in DMU1 which captured the triple crown DMU14
The formula (13) includes the viewpoint’s parameter, and
allows us to analyze the performance of DMU by changing the
parameter between the strong points (especially, if D 1 then
the optimal solutions is the same with one of efficiency
analysis) and weak points (if D 0 then the optimal solutions
is the same with one of inefficiency analysis).
And if D D' then this paper defines the evaluation value
( T kMVP , D ' ) of DMU k as follows:
, D'
T MVP
k
Outputs
DMU
(14)
and which captured the steal crown in 2005. Then, it
understood that DEA equally evaluates a lot of evaluation axes.
However, because the evaluation value is estimated only by the
aspect of most strong point for each DMU, multiplicity of
strong points is not considered like DMU1 . Therefore,
Where superscript “*” indicate the optimal solution of
formula (13).
The first term of formula (14) indicates the evaluation value
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
it has the superiority for the ratio of doubles (output) / bats
(input). However, the other ratios are not excellent respect.
That is to say, DMU 25 has a limited strong point. Oppositely,
as seen in TABLE II, for the batters who is as almighty like
DMU1 and DMU 2 , the rank does not change easily. Because
the proposed model allows us to know whether DMU has the
multiplicity or limit of strong points, it is possible to evaluate
the DMU with robustness.
superiority can not be applied between these batters in this
analysis.
2) Inefficiency Analysis’s Result
This analysis finds that there are 10 batters whose evaluation
value is -1 (inefficiency). Because the evaluation value is
estimated only by the aspect of weak points, these batters are
included in the batters which have a little steals even if
excelling in the long hits like DMU12 and DMU 23 . As well as
efficiency analysis, superiority can not be applied between
these batters.
b)
3) Proposed Model’s Result
The proposed model allows to analyze the performance of
DMU between efficiency and inefficiency. To clarify the
change of the evaluation value when the view point’s parameter
D is shifted from 1 to 0, let us not focus on evaluation value but
on rank. Fig 2 shows the change of rank for the specific four
batters ( DMU12 , DMU13 , DMU14 , DMU 25 ) which estimated
the both states of “efficiency” and “inefficiency”.
a)
Unification between DEA-efficiency and
DEA-inefficiency model
In the case ( D 1, 0.8, 0.7, 0.4, 0.2), the rank of DMU14
are changed to 25, 12, 19, 11 and 24. Thus, the change of the
rank is large. As shown in TABLE I, because DMU14 has the
multiplicity of strong points such as singles, triples and steals, it
is understood that DMU14 has high rank roughly. However,
this result indicates that the rank does not change linear from
the aspect of strong to weak points. Although the efficiency
analysis and the inefficiency analysis are integrated into one
mathematical formulation, how to assign the view point’s
parameter D still remains.
Robustness of the evaluation value
Although DMU 25 has high rank (25) in the case ( D 1 ),
the rank of DMU 25 is rapidly lower in the other cases. Where,
thinking about strong points, in TABLE I, it is understood that
TABLE II.
)
PARAMETER AND ESTIMATION VALUE ( T MVP
k
DMU
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Į=1
1
1
1
1
1
0.980
0.989
0.981
1
0.947
1
1
1
1
0.955
1
0.851
0.926
1
0.926
1
0.934
0.916
0.849
1
Į=0.9
0.787
0.805
0.801
0.803
0.800
0.749
0.743
0.683
0.710
0.746
0.803
0.714
0.738
0.748
0.762
0.797
0.640
0.716
0.758
0.729
0.764
0.731
0.696
0.644
0.632
Į=0.8
0.592
0.606
0.605
0.610
0.609
0.557
0.553
0.499
0.507
0.559
0.612
0.515
0.557
0.554
0.563
0.570
0.445
0.520
0.573
0.527
0.570
0.539
0.499
0.456
0.451
Į=0.7
0.414
0.422
0.419
0.417
0.412
0.373
0.381
0.319
0.353
0.373
0.423
0.330
0.392
0.405
0.389
0.398
0.277
0.359
0.383
0.373
0.377
0.353
0.326
0.276
0.294
Estimation Value
Į=0.6
Į=0.5
Į=0.4
0.226
0.043
-0.138
0.231
0.052
-0.135
0.228
0.048
-0.144
0.226
0.045
-0.143
0.220
0.035
-0.158
0.196
0.021
-0.172
0.185
0.012
-0.184
0.150
-0.034
-0.209
0.159
-0.023
-0.218
0.197
0.012
-0.187
0.228
0.037
-0.163
0.177
-0.020
-0.211
0.177
-0.009
-0.198
0.209
0.006
-0.201
0.212
0.022
-0.176
0.216
0.011
-0.185
0.126
-0.054
-0.239
0.165
-0.029
-0.213
0.182
-0.006
-0.198
0.176
-0.012
-0.207
0.176
-0.009
-0.197
0.172
-0.017
-0.209
0.161
-0.033
-0.215
0.117
-0.069
-0.251
0.109
-0.071
-0.252
Į=0.3
-0.320
-0.315
-0.327
-0.328
-0.350
-0.355
-0.377
-0.405
-0.405
-0.377
-0.349
-0.386
-0.372
-0.376
-0.362
-0.377
-0.427
-0.409
-0.385
-0.406
-0.382
-0.404
-0.411
-0.427
-0.436
Į=0.2
-0.504
-0.491
-0.502
-0.509
-0.526
-0.543
-0.569
-0.604
-0.603
-0.555
-0.501
-0.578
-0.541
-0.499
-0.550
-0.545
-0.622
-0.609
-0.558
-0.587
-0.572
-0.588
-0.608
-0.619
-0.624
Į=0.1
-0.695
-0.634
-0.696
-0.661
-0.656
-0.720
-0.718
-0.793
-0.732
-0.733
-0.674
-0.799
-0.701
-0.677
-0.696
-0.722
-0.802
-0.787
-0.733
-0.716
-0.744
-0.773
-0.800
-0.806
-0.805
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
Į=0
-0.931
-0.988
-0.890
-0.846
-0.881
-0.933
-0.909
-1
-0.963
-0.930
-0.905
-1
-1
-1
-1
-0.922
-1
-1
-0.935
-0.946
-0.961
-0.971
-1
-1
-1
25
[9]
20
Rank
[10]
15
No.12
No.13
No.14
No.25
10
[11]
[12]
5
[13]
0
㱍=1
㱍=0.5
[14]
㱍=0
Parameter 㱍
Fig 2. Rank of four players
[15]
V. CONCLUSION
[16]
This paper has proposed a new decision support method,
called Multi-Viewpoint DEA model which integrated the
efficiency analysis and the inefficiency analysis by one
mathematical formulation. The proposed model allows us to
analyze the performance of DMU by changing the view point’s
parameter between the strong points (especially, if D 1 then it
becomes efficiency analysis) and weak points (if D 0 then it
becomes inefficiency analysis). Regarding twenty-five
Japanese baseball players as DMUs, a case study has shown
that the proposed model has two desirable features: (a)
robustness of the evaluation value, and (b) unification between
efficiency analysis and inefficiency analysis. For the future
study, we will also analytically compare our method to the
traditional approaches [15, 16] and explore how to set the view
point’s parameter.
[17]
Governmental Investments to the Japanese Economy”, Journal of the
Operations Research Society of Japan, Vol.38, No.4, 1995, pp.381-396.
S. Aoki, K. Mishima, H. Tsuji: Two-Staged DEA model with Malmquist
Index for Brand Value Estimation, The 8th World Multiconference on
Systemics, Cybernetics and Informatics, Vol. 10, pp.1-6, 2004.
R. D. Banker, A. Charnes, W. W. Cooper, “Some Models for Estimating
Technical and Scale Inefficiencies in Data Envelopment Analysis”,
Management Science, Vol.30, 1984, pp.1078-1092.
R. D. Banker, and R. M. Thrall, “Estimation of Returns to Scale Using
Data Envelopment Analysis”, European Journal of Operational Research,
Vol.62, 1992, pp.74-82.
H. Nakayama, M. Arakawa, Y. B. Yun, Data Envelopment Analysis in
Multicriteria Decision Making”,M. Ehrgott and X. Gandibleux (eds.)
Multiple Criteria Optimization: State of the Art Annotated Bibliographic
Surveys, Kluwer Acadmic Publishiers, 2002.
E. W. N. Bernroider, V. Stix , “The Evaluation of ERP Systems Using
Data Envelopment Analysis”, Information Technology and Organizations,
Idea Group Pub, 2003, pp.283-286.
Y. Zhou, Y. Chen, “DEA-based Performance Predictive Design of
Complex Dynamic System Business Process Improvement”, Proceeding
of Systems, Man and Cybernetics, 2003. IEEE International Conference,
2003, pp.3008-3013.
R. G. Thompson, L. N. Langemeier, C. T. Lee, and R. M. Thrall, “The
Role of Multiplier Bounds in Efficiency Analysis with Application to
Kansas Farming”, Journal of Econometrics, Vol.46, 1990, pp.93-108.
W. W. Cooper, W. Quanling and G. Yu, “Using Displaced Cone
Representation in DEA models for Nondominated Solutions in
Multiobjective Programming”, Systems Science and Mathematical
Sciences, Vol.10, 1997, pp.41-49.
S. Aoki, Y. Naito, and H. Tsuji, “DEA-based Indicator for performance
Improvement”, Proceeding of The Third International Conference on
Active Media Technology, 2005, pp.327-330.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
A.Charnes, W.W.Cooper, and E.Rhodes, “Measuring the efficiency of
decision making units”, European Journal of Operational Research, 1978,
Vol.2, pp429-444.
T. Sueyoshi and S. Aoki, “A use of a nonparametric statistic for DEA
frontier shift: the Kruskal and Wallis rank test”, OMEGA: The
International Journal of Management Science, Vol.29, No.1, 2001,
pp1-18.
T.Sueyoshi, K.Onishi, and Y.Kinase, “A Bench Mark Approach for
Baseball Evaluation”, European Journal of Operational Research,
Vol.115, 1999, pp.429-428.
T. Sueyoshi, Y. Kinase and S. Aoki, “DEA Duality on Returns to Scale in
Production and Cost Analysis”, Proceedings of the Sixth Asia Pacific
Management Conference 2000, 2000, pp1-7.
W. W. Cooper, L. M. Seiford, K. Tone, Data Envelopment Analysis: A
comprehensive text with models, applications, references and
DEA-Solver software, Kluwer Academic Publishers, 2000.
R. Coombs, P. Sabiotti and V. Walsh, Economics and Technological
Change, Macmillan, 1987.
Y. Yamada, T. Matui and M. Sugiyama, "An inefficiency measurement
method for management systems", Journal of Operations Research
Society of Japan, vol. 37, 1994, pp. 158-168 (In Japanese).
Y. Yamada, T. Sueyoshi, M. Sugiyama, T. Nukina and T. Makino “The
DEA Method for Japanese Management: The Evaluation of Local
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
Mining Valuable Stocks with Genetic
Optimization Algorithm
Lean Yu, Kin Keung Lai and Shouyang Wang
Abstract—In this study, we utilize the genetic algorithm (GA)
to mine high quality stocks for investment. Given the
fundamental financial and price information of stocks trading,
we attempt to use GA to identify stocks that are likely to
outperform the market by having excess returns. To evaluate the
efficiency of the GA for stock selection, the return of equally
weighted portfolio formed by the stocks selected by GA is used as
evaluation criterion. Experiment results reveal that the proposed
GA for stock selection provides a very flexible and useful tool to
assist the investors in selecting valuable stocks.
feedforward neural networks to perform stock selection.
However, these approaches have some drawbacks in
solving the stock selection problem. For example, fuzzy
approach [2-3] usually lack learning ability, while neural
network approach [1, 4] has overfitting problem and is often
easy to trap into local minima. In order to overcome these
shortcomings, GA is used to perform this task. Some related
typical literature can be referred to [5-7] for more details.
The main aim of this study is to mine valuable stocks using
GA and test the efficiency of the GA for stock selection. The
rest of the study is organized as follows. Section 2 describes
the mining process based on the genetic algorithm in detail.
Section 3 presents a simulation experiment. And Section 4
concludes the paper.
Index Terms—Genetic algorithms; Portfolio optimization;
Data mining; Stock selection
I.
I
INTRODUCTION
the stock market, investors are often faced with a large
number of stocks. A crucial work of their investment
decision process is the selection of stocks. From a data-mining
perspective, the problem of stock selection is to identify good
quality stocks that are potential to outperform the market by
having excess return in the future. Given the fundamental
accounting and price information of stock trading, it is a
prediction problem that involves discovering useful patterns
or relationship in the data, and applying that information to
identify whether a stock is good quality.
Obviously, it is not an easy task for many investors when
they faced with enormous amount of stocks in the market.
With focus on the business computing, applying artificial
intelligence to portfolio selection and optimization is one way
to meet the challenge. Some research has presented to solve
asset selection problem. Levin [1] applied artificial neural
network to select valuable stocks. Chu [2] used fuzzy multiple
attribute decision analysis to select stocks for portfolio.
Similarly, Zargham [3] used a fuzzy rule-based system to
evaluate the listed stocks and realize stock selection. Recently,
Fan [4] utilized support vector machine to train universal
N
II. GA-BASED STOCK SELECTION PROCESS
Generally, GA imitates the natural selection process in
biological evolution with selection, crossover and mutation,
and the sequence of the different operations of a genetic
algorithm is shown in the left part of Figure 1. That is, GA is
procedures modeled after genetics and evolution. Genetics
provide the chromosomal representation to encode the
solution space of the problem while evolutionary procedures
are designed to efficiently search for attractive solutions to
large and complex problem. Usually, GA is based on the
survival-of-the-fittest fashion by gradually manipulating the
potential problem solutions to obtain the more superior
solutions in population. Optimization is performed in the
representation rather than in the problem space directly. To
date, GA has become a popular optimization method as they
often succeed in finding the best optimum by global search in
contrast to most common optimization algorithms. Interested
readers can be referred to [8-9] for more details.
The aim of this study is to identify the quality of each stock
using GA so that investors can choose some good ones for
investment. Here we use stock ranking to determine the
quality of stock. The stocks with a high rank are regarded as
good quality stock. In this study, some financial indicators of
the listed companies are employed to determine and identify
the quality of each stock. That is, the financial indicators of
the companies are used as input variables while a score is
given to rate the stocks. The output variable is stock ranking.
Throughout the study, four important financial indicators,
return on capital employed (ROCE), price/earnings ratio (P/E
Ratio), earning per share (EPS) and liquidity ratio are utilized
Manuscript received July 30, 2005. This work was supported in part by the
SRG of City University of Hong Kong under Grant No. 7001806.
Lean Yu is with the Institute of Systems Science, Academy of Mathematics
and Systems Science, Chinese Academy of Sciences, Beijing, 100080, China
(e-mail: [email protected]).
Kin Keung Lai is with the Department of Management Science, City
University of Hong Kong and is also with the College of Business
Administration, Hunan University, 410082, China (phone: 852-2788-8563;
fax:852-2788-8560; e-mail: [email protected]).
Shouyang Wang is with the Institute of Systems Science, Academy of
Mathematics and Systems Science, Chinese Academy of Sciences, Beijing,
100080, China (e-mail: [email protected]).
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
point in using GA, which determines what a GA should
optimize. Since the output is some estimated stock ranking of
designated testing companies, some actual stock ranking
should be defined in advance for designing fitness function.
Here we use annual price return (APR) to rank the listed stock
and the APR is represented as
ASP n ASP n 1
(5)
APR
in this study. Their meaning is formulated as
ROCE = (Profit)/(Shareholder’s equity)*100%
(1)
P/E ratio = (stock price)/(earnings per share)*100% (2)
EPS=(Net income)/(The number of ordinary shares) (3)
Liquidity Ratio=(Current Assets)/(Current Liabilities) (4)
When the input variables are determined, we can use GA to
distinguish and identify the quality of each stock, as illustrated
in Fig. 1.
n
ASP
n 1
where APRn is the annual price return for year n, ASPn is the
annual stock price for year n. Usually, the stocks with a high
annual price return are regarded as good stocks. With the
value of APR evaluated for each of the N trading stocks, they
will be assigned for a ranking r ranged from 1 and N, where 1
is the highest value of the APR while N is the lowest. For
convenience of comparison, the stock’s rank r should be
mapped linearly into stock ranking ranged from 0 to 7
according to the following equation:
7u
N 1
Fig. 1 Stock selection with genetic algorithm
First of all, a population, which consists of a given number
of chromosomes, is initially created by randomly assigning
“1” and “0” to all genes. In the case of stock ranking, a gene
contains only a single bit string for the status of input variable.
The top right part of Figure 1 shows a population with four
chromosomes, each chromosome includes different genes. In
this study, the initial population of the GA is generated by
encoding four input variables. For the testing case of ROCE,
we design 8 statuses representing different qualities in terms
of different interval, varying from 0 (Extremely poor) to 7
(very good). An example of encoding ROCE is shown in
Table 1. Other input variables are encoded by the same
principle. That is, the binary string of a gene consists of three
single bits, as illustrated by Fig. 1.
TABLE I
AN EXAMPLE OF ENCODING ROCE
ROCE value
Status
Encoding
(-’, -30%]
0
000
(-30%, -20%]
1
001
(-20%,-10%]
2
010
(-10%,0%]
3
011
(0%, 10%]
4
100
(10%, 20%]
5
101
(20%, 30%]
6
110
(30%,+’)
7
111
It is worth noting that 3-digit encoding is used for
simplicity in this study. Of course, 4-digit encoding is also
adopted, but the computations will be rather complexity.
The subsequent work is to evaluate the chromosomes
generated by previous operation by a so-called fitness
function, while the design of the fitness function is a crucial
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
large number of stocks does not necessary outperform the
portfolio with the small number of stocks. Therefore it is wise
for investors to select a limit number of good quality stocks
for constructing a portfolio.
(http://www.sse.com.cn). The sample data span the period
from January 2, 2002 to December 31, 2004. Monthly and
yearly data in this study are obtained by daily data
computation. For simulation, 100 stocks are randomly
selected. In this study, we select 100 stocks from Shanghai A
share, and their stock codes vary from 600000 to 600100.
First of all, the company financial information as the input
variables is fed into the GA to obtain the derived company
ranking. This output is compared with the actual stock ranking
in terms of APR, as indicated by Equations (5) and (6). In the
process of GA optimization, the RMSE between the derived
and the actual ranking of each stock is calculated and served
as the evaluation function of the GA process. The best
chromosome obtained is used to rank the stocks and the top n
stocks are chosen for the portfolio. For experiment purpose,
the top 10 and 20 stocks are chosen for testing according to
the ranking of stock quality using GA. The top 10 and 20
stocks selected by GA can construct a portfolio. For
convenience, equally weighted portfolios are built for
comparison purpose.
In order to evaluate the usefulness of the GA optimization,
we compared the net accumulated return generated by the
selected stock from GA with a benchmark. The benchmark
return is determined by an equally weighted portfolio of all
the stocks available in the experiment. Fig. 2 reveals the
results for different portfolios.
IV. CONCLUSIONS
This study uses genetic optimization algorithm to perform
stocks selection for portfolio. Experiment results reveal that
the GA optimization approach has shown to be useful to the
problem of stock selection, which can mine the most valuable
stocks for investors.
ACKNOWLEDGMENT
The authors would like to thank the anonymous reviewers
for their valuable comments and suggestions. Their comments
have improved the quality of the paper immensely.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
A.U. Levin, “Stock selection via nonlinear multi-factor models,”
Advances in Neural Information Processing Systems, 1995, pp. 966-972.
T.C. Chu, C.T. Tsao, and Y.R. Shiue, “Application of fuzzy multiple
attribute decision making on company analysis for stock selection,”
Proceedings of Soft Computing in Intelligent Systems and Information
Processing, 1996, pp. 509-514.
M.R. Zargham and M.R. Sayeh, “A web-based information system for
stock selection and evaluation,” Proceedings of the First International
Workshop on Advance Issues of E-Commerce and Web-Based
Information Systems, 1999, pp. 81-83.
A. Fan and M. Palaniswami, “Stock selection using support vector
machines,” Proceedings of International Joint Conference on Neural
Networks, 2001, pp. 1793-1798.
L. Lin, L. Cao, J. Wang, and C. Zhang, “The applications of genetic
algorithms in stock market data mining optimization,” in Data Mining V,
A. Zanasi, N.F.F. Ebecken, and C.A. Brebbia, Eds. WIT Press, 2004.
S.H. Chen, Genetic Algorithms and Genetic Programming in
Computational Finance. Dordrecht: Kluwer Academic Publishers, 2002.
Thomas, J., Sycara, K. “The importance of simplicity and validation in
genetic programming for data mining in financial data,” Proceedings of
the Joint AAAI-1999 and GECCO-1999 Workshop on Data Mining with
Evolutionary Algorithms, 1999.
J. H. Holland, “Genetic algorithms”, Scientific American, 1992, 267, pp.
66-72.
D.E. Goldberg, Genetic Algorithm in Search, Optimization, and
Machine Learning. Addison-Wesley, Reading, MA, 1989.
Fig. 2 Accumulated return for different portfolios
From Fig. 2, we can find that the net accumulated return of
the equally weighted portfolio formed by the stocks selected
by GA is significantly outperformed the benchmark. In
addition, the performance of the portfolio of the 10 stocks is
better that of the 20 stocks. As we know, portfolio does not
only focus on the expected return but also on risk
minimization. The larger the number of stocks in the portfolio
is, the more flexible for the portfolio to make the best
composition to avoid risk. However, selecting good quality
stocks is the prerequisite of obtaining a good portfolio. That
is, although the portfolio with the large number of stocks can
lower the risk to some extent, some bad quality stocks may
include into the portfolio, which influences the portfolio
performance. Meantime, this result also demonstrates that if
the investors select good quality stocks, the portfolio with the
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
A Comparison Study of Multiclass Classification between Multiple Criteria
Mathematical Programming and Hierarchical Method for Support Vector
Machines
Yi Peng1, Gang Kou 1, Yong Shi1, 2, 3, Zhenxing Chen1 and Hongjin Yang 2
1
College of Information Science & Technology, University of Nebraska at Omaha,
Omaha, NE 68182, USA
{ ypeng, gkou, zchen}@mail.unomaha.edu
2
Chinese Academy of Sciences Research Center on Data Technology & Knowledge Economy,
Graduate University of the Chinese Academy of Sciences, Beijing 100080, China
{yshi, hjyang}@gucas.ac.cn
3
The corresponding author
Abstract
1. INTRODUCTION
Multiclass
classification refers to classify
data objects into more than two classes. The
purpose of this paper is to compare two
multiclass classification approaches: Multiple
Criteria Mathematical Programming (MCMP)
and Hierarchical Method for Support Vector
Machines (SVM). While MCMP considers all
classes at once, SVM was initially designed
for binary classification. It is still an ongoing
research issue to extend SVM from two-class
classification to multiclass classification and
many proposed approaches use hierarchical
method. In this paper, we focus on one
common hierarchical method – pairwise
classification. We compare the performance
of MCMP and SVM pairwise approach using
KDD99, a large network intrusion dataset.
Results show that MCMP achieves better
multiclass classification accuracies than SVM
pairwise.
Keywords:
classification,
multi-group
classification, multi-group Multiple criteria
mathematical
programming
(MCMP),
pairwise classification
As one of the major data mining
functionalities, classification has broad
applications such as credit card portfolio
management, medical diagnosis, and fraud
detection. Based on historical information,
classification builds classifiers to predict
categorical class labels for unknown data.
Classification methods can be classified in
various ways, and one distinction is between
binary and multiclass classification. Binary
classification, as the name indicates, classifies
data into two classes. Multiclass classification
refers to classify data objects into more than
two classes. Many real-life applications
require multiclass classification. For example,
a multiclass classification that is capable of
predicting subtypes of cancer will be more
helpful than a binary classification that can
only predict cancer or non-cancer.
Researchers have suggested various
multiclass classification methods. Multiple
Criteria Mathematical Programming (MCMP)
and Hierarchical Method for Support Vector
Machines (SVM) are two of them. MCMP
and SVM are both based on mathematical
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
of boundary scalars b1<b2<…<bk-1, can be set
to separate these k groups. The boundary bj is
used to separate Gj and Gj+1. Let X =
( x1 ,..., x r ) T  R r be a vector of real number to
be determined. Thus, we can establish the
following linear inequations (Fisher 1936, Shi
et al. 2001):
A i X < b1,
A i  G1;
(1)
bj-1”A i X< bj,
A i  Gj;
(2)
A i X t bk-1,
A i  Gk;
(3)
2” j ” k-1, 1” i ”n.
A mathematical function f can be used to
describe the summation of total overlapping
while
another
mathematical
function g represents the aggregation of all
distances. The final classification accuracies
of this multi-group classification problem
depend on simultaneously minimize f and
maximize g . Thus, a generalized bi-criteria
programming method for classification can be
formulated as:
˄ Generalized Model ˅ Minimize f and
Maximize g
Subject to: (1), (2) and (3)
To formulate the criteria and complete
constraints for data separation, some variables
need to be introduced. In the classification
problem, A i X is the score for the ith data
record. If an element Ai  G j is misclassified
programming and there is no comparison
study has been conducted to date. The
purpose of this paper is to compare these two
multiclass classification approaches. While
MCMP considers all classes at once, SVM
was initially designed for binary classification.
It is still an ongoing research issue to extend
SVM from two-class classification to
multiclass classification and many proposed
approaches use hierarchical approach. In this
paper, we focus on one common hierarchical
method – pairwise classification. We first
introduce MCMP and SVM pairwise
classification, and then implement an
experiment to compare their performance
using KDD99, a large network intrusion
dataset.
This paper is structured as follows. The
next section discusses the formulation of
multiple-group multiple criteria mathematical
programming classification model. The third
section describes pairwise SVM multiclass
classification method. The fourth section
compares the performance of MCMP and
pairwise SVM using KDD99. The last section
concludes the paper.
2. MULTI-GROUP MULTI-CRITERIA
MATHEMATICAL PROGRAMMING
MODEL
This section introduces a MCMP model
for multiclass classification. Simply speaking,
this method classifies observations into
distinct groups based on two criteria. The
following models represent this concept
mathematically:
Given
an
r-dimensional
attribute
a ( a1 ,..., a r )
,
let
vector
r
Ai ( Ai1 ,..., Air )  ƒ be one of the sample
records, where i 1,..., n ; n represents the total
number of records in the dataset. Suppose k
groups, G1, G2, …, Gk, are predefined.
and
Gi ˆ G j ), i z j,1 d i, j d k
Ai  {G1 ‰ G2 ‰ ... ‰ Gk } , i
into a group other than G j , then let D i, j
p
(p-norm of D i , j ,1 d p d f ) be the Euclidean
distance from A i to bj, and AiX = bj +
D i, j , 1 d j d k 1 and let D i , j 1 p be the
Euclidean distance from A i  G j to bj-1, and
AiX = bj-1 - D i , j 1 , 2 d j d k . Otherwise,
D i , j ,1 d j d k, 1 d i d n ,
equals
to
zero.
Therefore, the function f of total overlapping
k
of data can be represented as
1,..., n . A series
n
¦¦
j 1 i 1
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
D i, j
p
.
where Ai, i = 1, …, n are given, X and bj are
unrestricted, and D i, j , ] i , j t 0,1 d i d n. .
If an element Ai  G j is correctly
classified into G j , let ] i, j p be the
(a) and (b) are defined as such because
the distances from any correctly classified
data (A i  G j , 2 d j d k 1 ) to two adjunct
Euclidean distance from A i to bj, and AiX =
bj - ] i, j , 1 d j d k 1 and let ] i , j 1
p
be the
Euclidean distance from A i  G j to bj-1, and
boundaries bj-1 and bj must be less than bj bj-1 . A better separation of two adjunct
groups may be achieved by the following
constraints instead of (a) and (b) because (c)
and (d) set up stronger limitation on ] i j :
AiX = bj-1 + ] i , j 1 , 2 d j d k . Otherwise,
] i , j ,1 d j d k, 1 d i d n , equals to zero. Thus,
the objective is to maximize the distance
] i, j from A i to boundary if A i  G1 or G k
p
and is
b j b j 1
to
minimize
the
distance
k
Minimize
n
¦ ¦
j 1orj k i 1
] i, j
p
k 1 n
b j b j 1
j 2 i 1
2
- ¦¦
] i, j
n
¦¦ (D
i, j
i, j
] i, j ” bj+1 - bj , 1 d j d k 1
(b)
w]
-
j 1orj k i 1
k 1 n
¦¦[(]
i, j
) 2 (b j b j 1 )] i , j ] )
(6)
j 2 i 1
Subject to: (4), (5), (c) and (d)
Note that the constant (
b j b j 1
) 2 is
2
omitted from the (6) without any effect to the
solution.
A version of model 2 for three
predefined classes is given in Figure 1. The
stars represent group 1 data objects, the black
dots represent group 2 data objects, and the
white circles represent group 3 data objects.
)
AiX = bj-1 - D i , j 1 + ] i , j 1 , 2 d j d k (5)
(a)
-
)2
Subject to:
AiX = bj + D i, j - ] i, j , 1 d j d k 1 (4)
] i, j ” bj - bj-1 , 2 d j d k
)2
n
¦ ¦ (]
(
- w]
p
wD
j 1 i 1
j 1 i 1
(
(d)
Let p = 2, then objective function in
Model 1 can now be a quadratic objective and
we have:
(Model 2˅
n
p
] i, j ” (bj+1 - bj )/2+İ, 1 d j d k 1
H  ƒ is a small positive real number.
the distances of every data to its class
boundary or boundaries can be represented as
k 1 n
n
b j b j 1
]
] i, j p .
¦
¦
i, j
p ¦¦
2
j 2 i 1
j 1ork i 1
Furthermore, to transform the generalized
bi-criteria classification model into a singlecriterion problem, weights wD > 0 and w] > 0
are introduced for f (D ) and g (] ) ,
respectively. The values of wD and w] can be
pre-defined in the process of identifying the
optimal solution. As a result, the generalized
model can be converted into a single-criterion
mathematical programming model as:
k
(c)
] i , j p from A i to the middle of
2
two adjunct boundaries bj-1 and bj if
A i  G j , 2 d j d k 1 . So the function g of
˄Model 1˅Minimize wD ¦¦ D i , j
] i, j ” (bj - bj-1 )/2+İ, 2 d j d k
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
] i ,1
] i ,1
D i ,1
] i,2
D i,2
D i ,1
G1
D i,2
G2
] i,2
G3
b1
b2
İ İ
AiX = bj + D i, j - ] i, j , j 1,2 AiX = bj-1 - D i , j 1 + ] i , j 1 , j
2,3
Figure. 1 A Three-classes Model
These models can be used in
multiclass classification and the applicability
of these models depends on the nature of
given datasets. If the adjunct groups in
datasets do not have any overlapping data,
Model 4 or Model 3 is more appropriate.
Otherwise, Model 2 can generate better
results.
Model 2 can be regarded as a “weak
separation formula” since it allows
overlapping. In addition, a “medium
separation formula” can also be constructed
on the absolute class boundaries (Model 3)
without any overlapping data. Furthermore, a
“strong separation formula” that requires a
non-zero distance between the boundary of
two adjunct groups (Model 4) emphasizes
non-overlapping
characteristic
between
adjunct groups.
(Model 3˅Minimize (6)
Subject to:
(c) and (d)
AiX ” bj - ] i, j , 1 d j d k 1
3. SVM
PAIRWISE
CLASSIFICATION
MULTICLASS
Statistical Learning Theory was proposed
by Vapnik and Chervonenkis in the 1960s.
Support Vector Machine (SVM) is one of the
Kernel Machine based Statistical Learning
Methods that can be applied on various types
of data and can detect the internal relations
among the data objectives. Given a set of data,
one can define the kernel matrix to construct
SVM and compute an optimal hyperplane in
the feature space which is induced by a kernel
(Vapnik, 1995). There exist different multiclass training strategies for SVM such as Oneagainst-Rest classification, One-against-One
(pairwise) classification, and Error correcting
output codes (ECOC).
AiX • bj-1 + ] i , j 1 , 2 d j d k
where Ai, i = 1, …, n are given, X and bj are
unrestricted, and D i, j , ] i , j t 0,1 d i d n. .
(Model 4˅Minimize (6)
Subject to:
(c) and (d)
AiX ” bj - D i, j - ] i, j , 1 d j d k 1
AiX • bj-1 + D i , j 1 + ] i , j 1 , 2 d j d k
where Ai, i = 1, …, n are given, X and bj are
unrestricted, and D i, j , ] i , j t 0,1 d i d n. .
LIBSVM is a well-known free software
package for support vector classification. We
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
There are five main categories of attacks:
denial-of-service (DOS); unauthorized access
from a remote machine (R2L); unauthorized
access to local root privileges (U2R);
surveillance and other probing (Probe).
Because the number of U2R attacks is too
small (52 records), only three types of attacks,
DOS, R2L, and Probe, are used in this
experiment. The KDD99 dataset used in this
experiment has 4898430 records and contains
1071730 distinguish records.
use the latest version, LIBSVM 2.8, in our
experimental study. This software uses oneagainst-one (pairwise) method for multiclass
SVM (Chang and Lin, 2001). The oneagainst-one method was first proposed by
Knerr et al. in 1990. It constructs totally
k (k 1)
binary SVM classifiers where the
2
classifiers are trained by two distinct classes
of the total k classes (Hsu and Lin, 2002). The
k (k 1)
following quadratic program is used
2
times to generate the multi-category SVM
classifiers.
MCMP was solved by LINGO 8.0, a software
tool for solving nonlinear models (LINDO
Systems Inc.). LIBSVM version 2.8 (Chang
and Lin, 2001), an integrated software which
uses pairwise approach to support multi-class
SVM classification, was applied to KDD99
data and the classification results of LIBSVM
were compared with MCMP’s.
The four-group classification results of
MCMP and LIBSVM on KDD99 data were
summarized in Table 1 and Table 2,
respectively. The classification results were
displayed in the format of confusion matrices,
which pinpoint the kinds of errors made.
From the confusion matrices in Table 1 and
2, we observe that (1) LIBSVM achieves
perfect classification for training data: 100%
accuracy. The training results of MCMP are
almost perfect: 100% accuracy for “probe”
and “DOS” and 99% accuracy for “normal”
and “R2L”; (2) Contrasted LIBSVM’s
training classification accuracies with testing,
its performance is unstable. LIBSVM
achieves almost perfect classification for
“normal” class: 99.99% accuracy, but poor
performance for three attack types: 44.48%
for “probe”, 53.17% for “R2L”, and 74.49%
for “DOS”. (3) MCMP has a stable
performance on testing data: 97.2% accuracy
for “probe”, 99.07% for “DOS”, 88.43% for
“R2L”, and 97.05% for “normal”.
Min (W/2) ||E||2 + (1/2) ||x, b||2
Subject to:
D (AX – eb) t e - E , where e is a vector of
ones
k (k 1)
number of SVM classifiers
2
were produced, a majority vote strategy is
k (k 1)
applied to the
classifiers. Each
2
classifier has one vote and every data is
predicted to the class with the largest vote.
After
4. EXPERIMENTAL COMPARISON OF
MCMP AND PAIRWISE SVM
The KDD99 dataset was provided by
Defense Advanced Research Project Agency
(DARPA) in 1998 for the competitive
evaluation of intrusion detection approaches.
KDD 99 dataset contains 9 weeks of raw TCP
data from Simulation of a typical U.S. Air
Force LAN. A version of this dataset was
used in 1999 KDD-CUP intrusion detection
contest (Stolfo et al. 2000). After the contest,
KDD99 has become a de facto standard
dataset for intrusion detection experiments.
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
Table 1. MCMP KDD99 Classification Results
(1)
100
0
0
1
(1)
13366
1084
1
59
Evaluation on training data (400 cases):
Accuracy
(2)
(3)
(4)
<-classified as
0
0
0
(1): Probe
100
0
0
(2): DOS
0
99
1
(3): R2L
0
0
99
(4): Normal
Evaluation on test data (1071330 cases):
(2)
(3)
(4)
<-classified as
216
145
24
(1): Probe
244867
1202
14
(2): DOS
4
795
99
(3): R2L
16313
7623
788718
(4): Normal
100.00%
100.00%
99.00%
99.00%
97.20%
99.07%
88.43%
97.05%
False
Alarm
Rate
0.99%
0.00%
0.00%
1.00%
7.88%
6.32%
91.86%
0.02%
Table 2. LIBSVM KDD99 Classification Results
(1)
100
0
0
0
(1)
6117
12861
0
41
Evaluation on training data (400 cases):
Accuracy
(2)
(3)
(4)
<-classified as
0
0
0
(1): Probe
100
0
0
(2): DOS
0
100
0
(3): R2L
0
0
100
(4): Normal
Evaluation on test data (1071330 cases):
(2)
(3)
(4)
<-classified as
569
0
7065
(1): Probe
184107
0
50199
(2): DOS
0
478
421
(3): R2L
0
34
812638
(4): Normal
100.00%
100.00%
100.00%
100.00%
44.48%
74.49%
53.17%
99.99%
False
Alarm
Rate
0.00%
0.00%
0.00%
0.00%
67.84%
0.31%
6.64%
6.63%
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
Stolfo, S.J., Fan, W., Lee, W., Prodromidis, A.
and Chan, P.K. (2000) Cost-based Modeling
and Evaluation for Data Mining With
Application to Fraud and Intrusion Detection:
Results from the JAM Project, DARPA
Information Survivability Conference.
Vapnik, V. N. and Chervonenkis (1964), On
one class of perceptrons, Autom. And Remote
Contr. 25(1).
Vapnik, V. N. (1995), The Nature of
Statistical Learning Theory, Springer, New
York.
Zhu, D., Premkumar, G., Zhang, X. and Chu,
C.H. (2001) Data Mining for Network
Intrusion Detection: A comparison of
Alternativest Methods, Decision Sciences,
Volume 32 No. 4, Fall 2001.
5. CONCLUSION
This is the first time that we investigate the
differences between MCMP and pairwise
SVM for multiclass classification using a
large network intrusion dataset. The results
indicate that MCMP achieves better
classification accuracy than pairwise SVM. In
our future research, we will focus on the
theoretical differences between these two
multiclass approaches.
References
Bradley, P.S., Fayyad, U.M., Mangasarian,
O.L. (1999) Mathematical programming for
data mining: Formulations and challenges.
INFORMS Journal on Computing, 11, 217238.
Chang, C. C. and Lin, C. J. (2001) LIBSVM :
a library for support vector machines.
Software
available
at
http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Hsu, C. W. and Lin, C. J. (2002) A
comparison of methods for multi-class
support vector machines, IEEE Transactions
on Neural Networks, 13(2), 415-425.
Knerr, S., Personnaz, L., and Dreyfus, G.
(1990), “Single-layer learning revisited: A
stepwise procedure for building and training a
neural network”, in Neurocomputing:
Algorithms, Architectures and Applications, J.
Fogelman, Ed. New York: Springer-Verlag.
Kou, G., Peng, Y., Shi, Y., Chen, Z. and Chen
X. (2004b) “A Multiple-Criteria Quadratic
Programming Approach to Network Intrusion
Detection” in Y. Shi, et al (Eds.): CASDMKM
2004, LNAI 3327, Springer-Verlag Berlin
Heidelberg, 145–153.
LINDO Systems Inc., An overview of LINGO
8.0,
http://www.lindo.com/cgi/frameset.cgi?leftlin
go.html;lingof.html.
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
Pattern Recognition for Multimedia Communication
Networks Using New Connection Models
between MCLP and SVM
Jing HE
Institute of Intelligent Information
and Communication Technology
Konan University
Kobe 658-8501, Japan
Email: [email protected]
Wuyi YUE
Department of Information Science
and Systems Engineering
Konan University
Kobe 658-8501, Japan
Email: [email protected]
Yong SHI
Chinese Academy of Sciences
Research Center on Data Technology
and Knowledge Economy
Beijing 100080, China
Email: [email protected]
classification problems with two class sets, SVM generalizes
linear classifiers into high dimensional feature spaces through
non-linear mappings. The non-linear mappings are defined
implicitly by kernels in the Hilbert space. This means SVM
may produce non-linear classifiers in the original data space.
Linear classifiers then are optimized to give the maximal
margins separation between the classes [3]-[5].
Research of linear programming (LP) approach to classification problems was initiated in [6]-[8]. [9], [10] applied the
compromise solution of MCLP to deal with the same question.
In [11], an analysis for fuzzy linear programming (FLP) in
classification of credit card holder behaviors was presented.
During the process of the calculation in [11], we found that
except some approaches such as MCLP, SVM, many data
mining algorithms try to minimize the influence of outliers
or eliminate them altogether.
In other words, the unusual outliers may be of particular
interest, such as in the case of unusual pattern detection,
where unusual outliers may indicate fraudulent activities. Thus
identification of usual and unusual patterns is an interesting
data mining task, referred to as “pattern recognition”.
In this paper, by means of dividing the performance data
into usual and unusual categories, we try to find out the
category corresponding to the data mining system. The new
pattern recognition model, which connects MCLP and SVM,
is employed to identify performance data.
Some real-time and non-trivial examples for MCNs with
different pattern recognition approaches such as SVM, LP, and
MCLP are given to show how the different techniques work
and can be used in reality. The advantages that the different
algorithms offer are compared with each other. The results of
the comparisons are listed in this paper.
In Section II, we describe the basic formulas of MCLP
and SVM. Connection models between MCLP and SVM are
presented in Section III. The real-time data experiments of
pattern recognition for MCNs are given out in Section IV.
Finally, we conclude the paper with a brief summary in Section
V.
Abstract— Data mining system of performance evaluation for
multimedia communication networks (MCNs) is a challenging
research and development issue. The data mining system offers
techniques of discovering patterns in voluminous databases. By
means of dividing the performance data into usual and unusual
categories, we try to find out the category corresponding to the
data mining system. Many pattern recognition algorithms for the
data mining system have been developed and explored in recent
years such as rough sets, tough fuzzy hybridization, granular
computing, artificial neural networks, support vector machines
(SVM), and multiple criteria linear programming (MCLP). In
this paper, a new connection model between MCLP and SVM is
employed to identify performance data. In addition to theoretical
foundations, the paper also includes experiment results. Some
real-time and nontrivial examples for MCNs given in this paper
shows how MCLP and SVM work and how they can be combined
to be used at the same time in reality. The advantages that every
algorithm offers are compared with the other methods.
I. I NTRODUCTION
Data mining system of performance evaluation for multimedia communication networks (MCNs) is a challenging
research and development issue. The data mining system offers
techniques for discovering patterns in voluminous databases.
Fraudulent activity costs the telecommunication industry millions of dollars a year.
It is important to identify potentially fraudulent users and
their typical usage patterns, and detect their attempts to gain
fraudulent entry in order to perpetrate illegal activity. Several
ways of identifying unusual patterns can be used such as
multidimensional analysis, cluster analysis and outlier analysis
[1].
By means of dividing the performance data into usual and
unusual categories, we try to find out the category corresponding to the data mining system. Many pattern recognition
algorithms for data mining have been developed and explored
in recent years such as rough sets, tough fuzzy hybridization, granular computing, artificial neural Networks, support
vector machines (SVM), multiple criteria linear programming
(MCLP) and so on [2].
SVM has been gaining popularity as one of the effective
methods for machine learning in recent years. In pattern
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
II. BASIC F ORMULA
OF
SVM
AND
MCLP
,
. Using a trade-off
(exterior deviation)
parameter
between Min
and Min
, we
have the soft margin SVM method as follows:
Support Vector Machines (SVMs) were developed in [3],
[12] and their main features are as follows:
(1) SVM maps the original data set into a high dimensional
feature space by non-linear mapping implicitly defined
by kernels in the Hilbert space.
(2) SVM finds linear classifiers with the maximal margins on
the feature space.
(3) SVM provides an evaluation of the generalization ability.
(M3)
(3)
where
and
are given,
,
and
are unrestricted.
It can be seen that the idea of the soft margin SVM method
is the same as the linear programming approach to linear
classifiers. This idea was used in an extension by [14]. Not
only exterior deviations but also interior deviations can be
considered in SVM. Then we propose various algorithms of
SVM considering both of slack variables for misclassified
data points (i.e., exterior deviations) and surplus variables for
correctly classified data points (i.e., interior deviations).
In order to minimize the slackness and to maximize the
surplus, the surplus variable (interior deviation)
is used,
. The trade-off parameter
is used for the
slackness variable, and another trade-off parameter
is
used for the surplus variable. Then we have the optimization
problems as follows:
A. Hard Margin SVM
We define two classes of A and B among the training data
sets
,
. We use a variable ,
with two values of 1 and -1 to represent which class of A and
A, then
,
B a training data set belongs. Namely, if
if
B, then
.
Let
be a separating hyperplane parameter and
be
a separating parameter, where
and
is the
attribute size. Then we use a separating hyperplane
to separate samples, where =
and
.
is a boundary value. From the above
and
. Such method
definition, we know that
for separating the samples is called the classification.
The separating hyperplane with maximal margins can be
given by solving the problem with the normalization
at points with the minimum interior deviation as
follows:
(M1)
(M4)
Min
Min
(4)
where
,
unrestricted.
(1)
where
represents the function of norm.
is given,
and
are unrestricted.
Several norms are possible. When
is used, the problem is reduced to quadratic programming, while the problem
or
is reduced to linear programming [13].
with
The SVM method which can separate two classes of A and
B completely is called the hard margin SVM method. But the
hard margin SVM method tends to cause over-learning.
The hard margin SVM method with
is given as
follows:
(M2)
Min
and
are given,
,
,
and
are
C. MCLP
For the classification explained in Subsection A, the multiple criteria linear programming (MCLP) model is used. We
want to determine the best coefficients of variables
, where
are the best coefficients of variables obtained by the following Eq. (5), is
the attribute size and
. A boundary value ,
,
is used to separate two classes of A and B.
A
B
Min
(5)
(2)
where
is defined in Subsection A.
is given,
are unrestricted.
Eq. (5) is equal to the following equation:
where
is given,
and
are unrestricted.
The aim of machine learning is to predict which class new
patterns belong to on the basis of the given training data set.
and
(6)
B. Soft Margin SVM
where
is defined in Subsection A.
is given,
and
are unrestricted.
Let
,
denote the exterior deviation which
is a deviation from the hyperplane of
. Similarly, let
,
The hard margin SVM method is easily affected by noise.
In order to overcome this shortcoming, the soft margin SVM
method is introduced. The soft margin SVM method allows
some slight errors which are represented by slack variables
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
A hybrid model presented in [8] that combines Eq. (8) and
Eq. (10) is given as follows:
denote the interior deviation which is a deviation
from the hyperplane of
. Our purposes are as follows: (1)
to minimize the maximum exterior deviation (decrease errors
as much as possible). (2) to maximize the minimum interior
deviation (i.e. maximize the margins). (3) to maximize the
weighted sum of interior deviation (MSD). (4) to minimize
the weighted sum of exterior deviation (MMD).
MSD can be written as follows:
Min
(12)
where
Min
and
(M5)
Min
are unrestricted.
BETWEEN
MCLP
AND
SVM
A. Linear Separable Examples
It should be noted that the LP of Eq. (8) may yield some
as well as unbounded
unacceptable solutions such as
solutions in the goal programming approach. Therefore, some
in
appropriate normality condition must be imposed on
order to provide a bounded nontrivial optimal solution. One
.
such normality condition is
If the classification is linearly separable, then using the
normalization
, the separating hyperplane
with the maximal margins can be given by solving the
problem as follows:
(7)
is given,
and
III. C ONNECTION
A
B
where
Then,
is given,
are unrestricted.
(8)
(M8)
Max
where
is given,
and
are unrestricted.
The alternative of the above model is to find MMD as
follows:
(13)
Max
where
and
are defined in Section II.
is given,
and
are unrestricted.
However, this normality condition makes the problem to be
non-linear optimization model. Instead of maximizing the minimum interior deviation in Eq. (13), we can use the following
equivalent formulation with the normalization
at points with the minimum interior deviation [15].
Theorem. The discrimination problem of Eq. (13) is equivalent to the formula used in Eq. (1) as follows:
A
B
(9)
where
Then,
is given,
and
(M6)
Max
are unrestricted.
(M1)
Min
(10)
where
is given,
and
are unrestricted.
Proof :
The above M1 can be rewritten as follows:
where
is given,
and
are unrestricted.
[11] applied the compromise solution of multiple criteria
and maximize
linear programming to minimize the sum of
simultaneously. A two criteria linear programthe sum of
ming model is given as follows:
(M7) Min
Min
and Max
(14)
where
, is the attribute size,
and
.
is given,
and
are unrestricted.
First notice that any optimal solution to Eq. (1) must satisfy
. Otherwise we should have
and
(11)
where
is given,
and
are unrestricted.
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
delay time, transfer rate and the criteria about “unusual patterns” is designed. In these real-time experiments, the two
classes of the training data sets in MCNs are A and B, A
and B are defined in Section II. The class A represents usual
pattern, and the class B represents unusual pattern.
The purpose of pattern recognition techniques for MCNs is
to find the better classifier through a training data set and use
the classifier to predict all other performance data of MCNs.
The frequently used pattern recognition in the telecommunication industry is still two-class separation technique. The key
question of two-class separation is to separate the “unusual”
patterns called fraudulent activity from the “usual” patterns
called normal activity. The pattern recognition model is to
identify as many MCNs as possible. This is also known as the
method of “detecting fraudulent list”. In this section, a realtime performance data mart with 65 derived attributes and
1000 records of a major CHINA TELECOM MCNs database
is first used to train the different classifiers. Then, the training
solutions are employed to predict the performances of another
5000 MCNs. Finally, the classification results in different
models are compared with each other.
, i.e.
, an impossibility since
at
the optimum in the strictly convex case. Similarly,
at the optimum of Eq. (14).
Let
be an optimal vector for Eq. (1). Then
=
is well defined for Eq. (14).
Assume it is not the optimal solution for Eq. (14). And let
,
be the optimal solution instead. Then
and
=
is feasible for
=
,
=
Eq. (1). Then
=
(the constraint
is tight at the optimum), in
contradiction with the optimality of
. Hence
is
the optimal solution for Eq. (14).
be the optimal solution for Eq. (14).
Now let
=
is
Then
defined. Again, assume that
is the suboptimal
be the optimal solution with
solution, let
and define
=
. We have,
=
/
=
=
, in contradiction with the optimality
.
of
Then M1 and M8 are the same, and Theorem is proved.
B. Linear Unseparable Examples
B. Accuracy Measure
As what have been mentioned in Eq. (8), MSD is as follows:
(M5)
We would like to be able to access how well the classifier
can recognize “usual” samples (referred to as positive samples)
and how well it can recognize “unusual” samples (refereed to
as negative samples). The sensitivity and specificity measures
can be used, respectively, for this purpose. In addition, we may
use precision to access the percentage of samples labeled as
“unusual” that actually are “unusual” samples. These measures
are defined as follows:
t pos
Sensitivity
pos
t neg
Specificity
neg
t pos
Precision
t pos f pos
Min
where
and
are defined in Section II.
is given,
and
are unrestricted.
The above equation as Eq. (8) can be rewritten as Eq. (1)
according to Theorem as follows:
(M1)
Min
where
is given,
and
are unrestricted.
Then we use
as norm of Eq. (1).
is chosen
and Min
to be the trade-off parameter between Min
, we have the formulation for the soft margin SVM
methods combining Eq. (8) with Eq. (1) as follows:
where “t pos” is the number of true positives samples (“usual”
samples that were correctly classified as such), “pos” is the
number of positive samples (“usual” samples), “t neg” is the
number of true negatives samples (“unusual” samples that
were correctly classified as such), “neg” is the number of
negative samples (“unusual” samples), and “f pos” is the number of false positives samples (“unusual” samples that were
incorrectly labeled as “usual”). It can be shown that Accuracy
is a function of Sensitivity and Specificity as follwos:
Min
(15)
Accuracy
where
and
are given,
and
are unrestricted.
Eq. (15) is the same as the SVM formula in Eq. (3).
IV. PATTERN R ECOGNITION
FOR
Sensitivity
pos
pos neg
Specificity
neg
pos neg
The higher the four rates (Sensitivity rate, Specificity rate,
Precision rate, Accuracy rate) are, the better the classification
results are.
A threshold in this paper is defined to set up against specificity and precision depending on the requirement performance
evaluation of MCNs.
MCN S
A. Real-time Experiments Data
A set of attributes for MCNs, such as throughput capacity,
package forwarding rate, response time, connection attempts,
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
of the boundary .
C. Experiment Results
A previous experience on classification test showed that the
training results of a data set with balanced records (number
of usual samples is equal to number of unusual samples) may
be different from that of an unbalanced data set (number of
usual samples is not equal to number of unusual samples).
Given there are unbalanced 1000 training accounts, where
860 usual samples are usual and 140 are unusual. Models M1
to M8 can be used to test. M1 to M8 are given in Sections II
and III.
Namely, M1 is the SVM model with the objective function
. M2 is the SVM model with the objective function
. M3 is the SVM model with the objective function
+
. M4 is the SVM model to minimize
the slackness and to maximize the surplus. M5 is the linear
programming model with the objective function
. M5
is called the MSD model. M6 is the linear programming model
. M6 is called the MMD
with the objective function
model. M7 is the MCLP model. M8 is the MCLP model using
the normalization. is the boundary value for each model.
Here we use
to
calculate for models M1 to M8.
A well-known commercial soft package, Lingo [16] has
been used to perform the training and predicting processes.
The learning results of unbalanced 1000 records in Sensitivity
and Specificity are shown in Table 1, where the columns of
H are the Sensitivity rates for the usual pattern, the columns
of K are the Specificity rates for the unusual pattern.
Table 2: Predicting Results of Unbalanced 5000 Records in
Precision.
The Precision rates in models M3, M7, M4 are as high as
the learning results. M1 and M8 have the same results of H
and K with all values of b. If the threshold of the precision
of pattern recognition is predetermined as 0.9. Then the model
, ,
, ,
, , M8 with
,
, ,
M3 with
, are satisfied as better classifiers. The best model of the
threshold in the learning results is M3 with
. The order
of average predicting precision is M3, M7, M4, M2, M5, M6,
M1, M8.
In this data mart of Table 2, M1 and M8 have similar
structures and solution characterizations due to the formula
presented in Section III. When the classification is to find the
higher specificity, M1 or M8 can give the better results. When
the classification is to find the higher precision, M3, M4, M7
can give the better results.
Table 1: Learning Results of Unbalanced 1000 Records in
Sensitivity and Specificity.
V. C ONCLUSION
In this paper, we have proposed a heuristic connection
classification method to recognize unusual patterns of multimedia communication networks (MCNs). This algorithm is
based on the connection model between multiple criteria linear
programming (MCLP) with support vector machines (SVM).
Although the mathematical modeling is not new, the framework of connection configuration is innovative. In addition,
empirical training sets and the prediction results on the realtime MCNs from a major company, CHINA TELECOM, were
listed out. Comparison studies have shown that the connection
model combining MCLP and SVM has the performed better
learning results with an aspect to predicting the future performance pattern of MCNs. The connection model also has a
great deal of potential to be used in various data mining tasks.
Since the connection model is readily implemented by nonlinear programming, any available non-linear programming
packages, such as Lingo, can be used to conduct the data
analysis. In the meantime, we explored the other possible
connections between SVM and MCLP. The results of ongoing
projects to solve more complex problem will be reported in
the near future.
Table. 1 shows the learning results of models M1 to M8
for different values of the boundary . If the threshold of
the specificity rate K is predetermined as
, then the models
M1, M8 with
,
, ,
, ,
,
,
, M3 with
, M4 with
, ,
, M6 with
, ,
, M7 with
, ,
, are satisfied as better classifiers.
M1 and M8 have the same results of H and K with all values
of .
The best specificity rate model of the threshold in the
learning result of unusual patterns in K is M1, M8 with
. The order in the learning result of unusual patterns
in the specificity K is M8 = M1, M6, M3, M7, M4, M2, M5.
Table. 2 shows the predicting results of unbalanced 5000
records in Precision with models M1 to M8 for different values
ACKNOWLEDGMENT
This work was supported in part by GRANT-IN-AID FOR
SCIENTIFIC RESEARCH (No. 16560350) and MEXT.ORC
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
(2004-2008), Japan and in part by NSFC (No. 70472074),
China.
R EFERENCES
[1] J. Han and M. Kamber, Data Mining: Concepts and Techniques, An
Impernt of Academic Press, San Francisco, 2003.
[2] S. Pal and P. Mitra, Pattern Recognition Algorithms for Data Mining,
ACRC Press Company, 2004.
[3] V. Vapnik, Statistical Learning Theory, John Wiley & Sons, New York,
1998.
[4] O. Mangasarian, Linear and Nonlinear Separation of Pattern by Linear
Programming, Operations Research, 31(1): 445-453, 1965.
[5] O. Mangasarian. Multisurface Method for Pattern Separation, IEEE
Transactions on Information Theory, IT-14: 801-807, 1968.
[6] N. Freed and F. Glover, Simple but Powerful Goal Programming Models
for Discriminant Problems, European Journal of Operational Research,
7(3): 44-60, 1981.
[7] N. Freed and F. Glover, Evaluating Alternative Linear Programming
Models to Solve the Two-group Discriminant Problem, Decision Science,
17(1): 151-162, 1986.
[8] F. Glover, Improve Linear Programming Models for Discriminant Analysis, Decision Sciences, 21(3): 771-785, 1990.
[9] G. Kou, X. Liu, Y. Peng, Y. Shi, M. Wise and W. Xu, Multiple Criteria
Linear Programming Approach to Data Mining: Models, Algorithm
Designs and Software Development, Optimization Methods and Software,
18(4): 453-473, 2003.
[10] G. Kou and Y. Shi, Linux based Multiple Linear Programming Classification Program: Version 1.0, College of Information Science and
Technology, University of Nebraska-Omaha, U.S.A., 2002.
[11] J. He, X. Liu, Y. Shi, W. Xu and N. Yan, Classification of Credit
Cardholder Behavior by using Fuzzy Linear Programming, International
Journal of Information Technology
Decision Making, 3(4): 223-229,
2004.
[12] C. Cortes and V. Vapnik, Support Vector Networks, Machine Learning,
15(20): 273-297.
[13] O. Mangasarian, Arbitrary-Norm Separating Plane, Operations Research
Letters 23, 1999.
[14] K. Bennett and O. Mangasarian, Robust Linear Programming Discrimination of Two Linearly Inseparable Sets, Optimization Methods and
Software, 12(1): 23-24.
[15] P. Marcotte and G. Savard, Novel Approaches to the Discrimination
Problem, ZOR-Methods and Models of Operations Research, 12(36): 517545.
[16] http://www.lindo.com/.
[17] J. He, W. Yue and Y. Shi, Identification Mining of Unusual Patterns
for Multimedia Communication Networks, Abstract Proc. of Autumn
Conference 2005 of Operations Research Society of Japan, 262-263,
2005.
[18] Y. Shi and J. He, Computer-based Algorithms for Multiple Criteria and
Multiple Constraint Level Integer Linear Programming, Computers and
Mathematics with Applications, 49(5): 903-921, 2005.
[19] T. Asada and H. Nakayama, SVM using Multi Objective Linear Programming and Goal Programming, T. Tanino, T. Tanaka and M. Inuiguchi
(eds), Multi-objective Programming and Goal Programming, 93-98, 2003.
[20] H. Nakayama and T. Asada, Support Vector Machines Formulated as
Multi Objective Linear Programming, Proc. of ICOTA2001, 1171-1178,
2001.
[21] M. Yoon, Y. B. Yun, and H. Nakayama, A Role of Total Margin in
Support Vector Machines, Proc. of IJCNN03, 7(4): 2049-2053, 2003.
[22] W. Yue, J. Gu and X. Tang, A Performance Evaluation Index System
for Multimedia Communication Networks and Forecasting for Web-based
Network Traffic, Journal of Systems Science and Systems Engineering,
13(1): 78-97, 2002.
[23] J. He, Y. Shi and W. Xu, Classifications of Credit Cardholder Behavior
by using Multiple Criteria Non-linear Programming, Conference Proc.
of the International Conference on Data-Ming Knowledge Management,
Lecture Notes in Computer Science series, Springer-Verlag, 2004.
[24] http://www.rulequest.com/see5-info.html/.
[25] http://www.sas.com/.
[26] Y. Shi, M. Wise, M. Luo and Y. Lin, Data Mining in Credit Card
Portfolio Management: a Multiple Criteria Decision Making Approach,
Multiple Criteria Decision Making in the New Millennium, Springer,
Berlin, 2001.
,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV
Published by
Department of Mathematics and Computing Science
Technical Report Number: 2005-05 November, 2005
ISBN 0-9738918-1-5