Download Optimization-based Data Mining Techniques with Applications

Optimization-based Data Mining Techniques with Applications Proceedings of a Workshop held in Conjunction with 2005 IEEE International Conference on Data Mining Houston, USA, November 27, 2005 Edited by Yong Shi &KLQHVH$FDGHP\RI6FLHQFHV5HVHDUFK&HQWHURQ'DWD 7HFKQRORJ\.QRZOHGJH(FRQRP\ *UDGXDWH8QLYHUVLW\RIWKH&KLQHVH$FDGHP\RI6FLHQFHV %HLMLQJ&KLQD ISBN 0-9738918-1-5 Optimization-based Data Mining Techniques with Applications Proceedings of a Workshop held in Conjunction with 2005 IEEE International Conference on Data Mining Houston, USA, November 27, 2005 Edited by Yong Shi &KLQHVH$FDGHP\RI6FLHQFHV5HVHDUFK&HQWHURQ'DWD 7HFKQRORJ\.QRZOHGJH(FRQRP\ *UDGXDWH8QLYHUVLW\RIWKH&KLQHVH$FDGHP\RI6FLHQFHV %HLMLQJ&KLQD The papers appearing in this book reflect the authors’ opinions and are published in the interests of timely dissemination based on review by the program committee or volume editors. Their inclusion in this publication does not necessarily constitute endorsement by the editors. ©2005 by the authors and editors of this book. No part of this work can be reproduced without permission except as indicated by the “Fair Use” clause of the copyright law. Passages, images, or ideas taken from this work must be properly credited in any written or published materials. ISBN 0-9738918-0-7 Printed by Saint Mary’s University, Canada. CONTENTS Introduction…………………………………………..………..…………..II Novel Quadratic Programming Approaches for Feature Selection and Clustering with Applications W. Art Chaovalitwongse……………………………………………………………....…..1 Fuzzy Support Vector Classification Based on Possibility Theory Zhimin Yang, Yingjie Tian, Naiyang Deng……………………………………………….8 DEA-based Classification for Finding Performance Improvement Direction Shingo Aoki, Yusuke Nishiuchi, Hiroshi Tsuj……………………………………..……16 Multi-Viewpoint Data Envelopment Efficiency and Inefficiency Analysis for Finding Shingo AOKI, Kiyosei MINAMI, Hiroshi TSUJI………………………………...……..21 Mining Valuable Stocks with Genetic Optimization Algorithm Lean Yu, Kin Keung Lai and Shouyang Wang……………………………...…………..27 A Comparison Study of Multiclass Classification between Multiple Criteria Mathematical Programming and Hierarchical Method for Support Vector Machines Yi Peng, Gang Kou, Yong Shi, Zhenxing Chen and Hongjin Yang…………………….30 Pattern Recognition for Multimedia Communication Networks Using New Connection Models between MCLP and SVM Jing HE, Wuyi YUE, Yong SHI…………………………………………………...…….37 I Introduction For last ten years, the researchers have extensively applied quadratic programming into classification, known as V. Vapnik’s Support Vector Machine, as well as various applications. However, using optimization techniques to deal with data separation and data analysis goes back to more than thirty years ago. According to O. L. Mangasarian, his group has formulated linear programming as a large margin classifier in 1960’s. In 1970’s, A. Charnes and W.W. Cooper initiated Data Envelopment Analysis where a fractional programming is used to evaluate decision making units, which is economic representative data in a given training dataset. From 1980’s to 1990’s, F. Glover proposed a number of linear programming models to solve discriminant problems with a small sample size of data. Then, since 1998, the organizer and his colleagues extended such a research idea into classification via multiple criteria linear programming (MCLP) and multiple criteria quadratic programming (MQLP). All of these methods differ from statistics, decision tree induction, and neural networks. So far, there are numerous scholars around the world who have been actively working on the field of using optimization techniques to handle data mining problems. This workshop intends to promote the research interests in the connection of optimization and data mining as well as real-life applications among the growing data mining communities. All of seven papers accepted by the workshop reflect the findings of the researchers in the above interface fields. Yong Shi Beijing, China II Novel Quadratic Programming Approaches for Feature Selection and Clustering with Applications W. Art Chaovalitwongse Department of Industrial and Systems Engineering Rutgers, The State University of New Jersey Piscataway, New Jersey 08854 Email: [email protected] Abstract In this paper, we focus our main application on epilepsy research. Epilepsy is the second most common brain disorder after stroke. The most disabling aspect of epilepsy is the uncertainty of recurrent seizures, which can be characterized by a chronic medical condition produced by temporary changes in the electrical function of the brain. The aim of this research is to develop and apply a new DM paradigm used to predict seizures based on the study of neurological brain functions through quantitative analyses of electroencephalograms (EEGs), which is a tool for evaluating the physiological state of the brain. Although EEGs offer excellent spatial and temporal resolution to characterize rapidly changing electrical activity of brain activation, it is not an easy task to excavate hidden patterns or relationships in massive data with properties in time and space like EEG time series. This paper involves research activities directed toward the development of mathematical models and optimization techniques for DM problems. The primary goal of this paper is to incorporate novel optimization methods with DM techniques. Specifically, novel feature selection and clustering techniques are proposed in this paper. The proposed techniques will enhance the ability to provide more precise data characterization, more accurate prediction/classification, and greater understanding of EEG time series. Uncontrolled epilepsy poses a significant burden to society due to associated healthcare cost to treat and control the unpredictable and spontaneous occurrence of seizures. The main objective of this paper is to develop and apply novel optimization-based data mining approaches to the study of brain physiology, which might be able to revolutionize current diagnosis and treatment of epilepsy. Through quantitative analyses of electroencephalogram (EEG) recordings, a new data mining paradigm for feature selection and clustering is developed based on mathematical models and optimization techniques proposed in this paper. The experimental results in this study demonstrate that the proposed techniques can be used as a feature (electrode) selection technique to capture seizure precursors. In addition, the proposed techniques will not only excavate hidden patterns/relationships in EEGs, but also will give a greater understanding of brain functions (as well as other complex systems) from a system perspective. I. . Introduction and Background Most data mining (DM) tasks fundamentally involve discrete decisions based on numerical analyses of data (e.g., the number of clusters, the number of classes, the class assignment, the most informative features, the outlier samples, the samples capturing the essential information). These techniques are combinatorial in nature and can naturally be formulated as discrete optimization problems. The goal of most DM tasks naturally lends itself to a discrete NP-hard optimization problem. Aside from complexity issue, the massive scale of real life DM problems is another difficulty arising in optimization-based DM research. A. . Feature/Sample Selection Although the brain is considered to be the largest interconnected network, neurologists believe that seizures represent the spontaneous formation of self-organizing spatiotemporal patterns that involve only some parts (electrodes) of the brain network. The localization of epileptogenic zones is one of the proofs of this concept. Therefore, feature selection techniques have become a very essential tool for selecting the critical brain areas participating in ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV identify the best k clusters that minimize the distance of the points assigned in the cluster from the center of the cluster. A very well-known example of the distance-based method is k-mean clustering. Another clustering method is a model-based method, which assumes a functional model expression that describes each of the clusters and then searches for the best parameter to fit the cluster model by minimizing a likelihood measure. Most clustering methods attempt to identify the best k clusters that minimize the distance of the points assigned in the cluster from the center of the cluster. k-median clustering is another widely studied clustering technique, which can be modeled as a concave minimization problem and reformulated as a minimization problem of a bilinear function over a polyhedral set [3]. Although these clustering techniques are well studied and robust, they still require a priori knowledge of the data (e.g., the number of clusters, the most informative features). the epileptogenesis process during seizure development. In addition, graph theoretical approaches appear to fit very well as a model of a brain structure [12]. Feature selection will be very useful in selecting/identifying the brain areas correlated to the pathway to seizure onset. In general, feature/sample selection is considered to be a dimensionality reduction technique within the framework of classification and clustering. This problem can naturally be defined as a binary optimization problems. The notion of selection a sub-set of variables, out of superset of possible alternatives, naturally lends itself to a combinatorial (discrete) optimization problem. In general, depending on the model used to describe the data the problem of feature selection will end up being a (non)-linear mixed integer programming (MIP) problem. The most difficult issue in DM problems arises when one has to deal with spatial and temporal data. It is extremely critical to be able to identify the best features in timely fashion. To overcome this difficulty, the feature selection problem in seizure prediction research is modeled as a Mutli-Quadratic Integer Programming (MQIP) problem. MQIP is very difficult to solve. Although many efficient reformulation-linearization techniques (RTLs) have been used to linearize QP and nonlinear integer programming problems [1], [14], additional quadratic constraints make MQIP problems much more difficult to solve and current RTLs fail to solve MQIP problems effectively. A fast and scalable RTL that can be used to solve MQIPs for feature selection is herein proposed based on our preliminary studies in [7], [24]. In addition, a novel framework applying graph theory to feature selection, which is based on the preliminary study in [28], is also proposed in this paper. II. . Data Mining in EEGs Recent quantitative EEG studies previously reported in [5], [11], [10], [8], [16], [24], suggest that seizures are deterministic rather than random and it may be possible to predict the onset of epileptic seizures based on quantitative analysis of the brain electrical activity through EEGs. The seizure predictability has also been confirmed by several other groups [13], [29], [20], [21]. This analysis proposed in this research was motivated by mathematical models from chaos theory used to characterize multidimensional complex systems and reduce the dimensionality of EEGs [19], [31]. These techniques demonstrate dynamical changes of epileptic activity that involve the gradual transition from a state of spatiotemporal chaos to spatial order and temporal chaos [4], [27]. Such a transition that precedes seizures for periods on the order of minutes to hours is detectable in the EEG by the convergence in value of chaos measures (i.e., Lyapunov Exponent-ST Lmax) among critical electrode sites on the neocortex and hippocampus [10]. T-statistical distance was proposed to estimate the pair-wise difference (similarity) of the dynamics of EEG time series between brain electrode pairs. The T -index will measure the convergence degree of chaos measures among critical electrode sites. The T index at time√t between electrode sites i and j is defined as: Ti,j (t) = N × |E{ST Lmax,i − ST Lmax,j }|/σi,j (t), where E{·} is the sample average difference for the ST Lmax,i − ST Lmax,j estimated over a moving window wt (λ) defined as: B. . Clustering The elements and dynamical connections of the brain dynamics can portray the characteristics of a group of neurons and synapses or neuronal populations driven by the epileptogenic process. Therefore, clustering the brain areas portraying similar structural and functional relationships will give us an insight in the mechanisms of epileptogenesis and an answer to a question of how seizures are generated, developed, and propagated, and how they can be disrupted and treated. The goal of clustering is to find the best segmentation of raw data into the most common/similar groups. In clustering similarity measure is, therefore, the most important property. The difficulty in clustering arises from the fact that clustering is an unsupervised learning, in which the property or the expected number of groups (clusters) are not known ahead of time. The search for the optimal number of clusters is parametric in nature. Distance-based method is the most commonly studied clustering technique, which attempts to wt (λ) = 1 0 if λ ∈ [t − N − 1, t] if λ ∈ [t − N − 1, t], ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV where N is the length of the moving window. Then, σi,j (t) is the sample standard deviation of the ST Lmax differences between electrode sites i and j within the moving window wt (λ). The thus defined T -index follows a t-distribution with N-1 degrees of freedom. A novel feature selection technique based on optimization techniques to select critical electrode sites minimizing T -index similarity measure was proposed in [4], [24]. The results of that study demonstrated that spatiotemporal dynamical properties of EEG’s manifest patterns corresponding to specific clinical states [6], [4], [17], [24]. In spite of promising signs of the seizure predictabilty, research in epilepsy is still far from complete. The existence of seizure pre-cursors remains to be further investigated with respect to parameter settings, accuracy, sensitivity, specificity. Essentially, there is a need of new feature selection and clustering used to systematically identify the brain areas underlying the seizure evolution as well as epileptogenic zones (the areas initiating the habitual seizures). powerful quadratic 0-1 programming technique proposed in [25] is employed to solve this problem. Next we will demonstrate how to reduce a quadratic program with a knapsack constraint to a non-constrained quadratic 0-1 program. In order to formalize the notion of equivalence, we propose the following definitions. Definition 1: We say that problem P is “polynomially reducible” to problem P0 if given an instance I(P ) of problem P , we can in polynomial time obtain an instance I(P0 ) of problem P0 such that solving I(P ) will solve I(P0 ). Definition 2: Two problems P1 and P2 are called “equivalent” if P1 is “polynomially reducible” to P2 and P2 is “polynomially reducible” to P1 . Consider the following three problems: P1 : min f (x) = xT Ax, x ∈ {0, 1}n, A ∈ Rn×n . P̄1 : min f (x) = xT Ax + cT x, x ∈ {0, 1}n, A ∈ Rn×n , c ∈ Rn . P̂1 : min f (x) = xT Ax, x ∈ {0, 1}n, A ∈ n Rn×n , i=1 xi = k, where 0 ≤ k ≤ n is a constant . Define A as an n × n T-index pair-wise distance matrix, and k is the number of selected electrode sites. Problems P1 , P̄1 , and P̂1 can be shown to be all “equivalent” by proving that P1 is “polynomially reducible” to P̄1 , P̄1 is “polynomially reducible” to P1 , P̂1 is “polynomially reducible” to P1 , and P1 is “polynomially reducible” to P̂1 . For more details, see [4], [6]. III. . Feature Selection The concept of optimization models for feature selection used to select/identify the brain areas correlated to the pathway to seizure onset came from the Ising model has been a powerful tool in studying phase transitions in statistical physics. Such an Ising model can be described by a graph G(V, E) having n vertices {v1 , . . . , vn } and each edge (i, j) ∈ E having a weight (interaction energy) Jij . Each vertex vi has a magnetic spin variable σi ∈ {−1, +1} associated with it. An optimal spin configuration of minimum energy is obtained by minimizing the Hamiltonian: H(σ) = − 1≤i≤j≤n Jij σi σj over ∀σ ∈ {−1, +1}n. This problem is equivalent to the combinatorial problem of quadratic 0-1 programming [15]. This has motivated us to use quadratic 0-1 (integer) programming to select the critical cortical sites, where each electrode has only two states, and to determine the minimal-average T-index state. In addition, we also introduce an extension of quadratic integer programming for electrode selection including Feature Selection via Multi-Quadratic Programming and Feature Selection via Graph Theory. B. . Feature Selection via Multi-Quadratic Integer Programming (FSMQIP) FSMQIP is a novel mathematical model for selecting critical features (electrodes) of the brain network, which can be modeled as a MQIP problem given by: min xT Ax, n xi = k; xT Cx ≥ Tα k(k − 1); x ∈ {0, 1}n, s.t., i=1 where A is an n × n matrix of pairwise similarity of chaos measures before a seizure, C is an n × n matrix of pairwise similarity of chaos measures after a seizure, and k is the pre-determined number of selected electrodes. This problem has been proved to be NP-hard in [24]. The objective function is to minimize the average T-index distance (similarity) of chaos measures among the critical electrode sites. The knapsack constraint is to identify the number of critical cortical sites. The quadratic constraint is to ensure the divergence of chaos measures among the critical electrode sites after a seizure. A novel RLT to reformulate this MQIP problem as a MIP problem was proposed in [7], which demonstrated the equivalence of the following two problems: P2 : min f (x) = xT Ax, s.t. Bx ≥ b, xT Cx ≥ x α, x ∈ {0, 1}n , where α is a positive constant. A. . Feature Selection via Quadratic Integer Programming (FSQIP) FSQIP is a novel mathematical model for selecting critical features (electrodes) of the brain network, which can be modeled as a quadratic 0-1 knapsack problem with objective function to minimize the average T-index (a measure of statistical distance between the mean values of ST Lmax) among electrode sites and the knapsack constraint to identify the number of critical cortical sites. A ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV P̄2 : min g(s) = eT s, s.t. Ax − y − s = 0, Bx ≥ x,y,s,z The maximum clique problem can be represented in many equivalent formulations (e.g., an integer programming problem, a continuous global optimization problem, and an indefinite quadratic programming) [22]. Consider the following indefinite quadratic programming formulation of MCP. Let AG = (aij )n×n be the adjacency matrix of G defined by 1 if (i, j) ∈ E aij = 0 if (i, j) ∈ / E. T b, y ≤ M (e − x), Cx − z ≥ 0, e z ≥ α, z ≤ M x, x ∈ {0, 1}n, yi , si , zi ≥ 0, where M = C∞ and M = A∞ . Proposition 1: P2 is equivalent to P̄2 if every entry in matrices A and C is non-negative. Proof: It has been shown in [9], [7] that P1 has an optimal solution x0 iff there exist y 0 , s0 , z 0 such that (x0 , y 0 , s0 , z 0 ) is an optimal solution to P̄1 . The matrix AG is symmetric and all eigenvalues are real numbers. Generally, AG has positive and negative (and possibly zero) eigenvalues and the sum of eigenvalues is zero as the main diagonal entries are zero [15]. Consider the following indefinite QIP problem and MIP problem for MCP: 1 T n P3 : max 2 x Ax, s.t. x ∈ {0, 1} , where A = C. . Feature Selection via Maximum Clique (FSMC) FSMC is a novel mathematical model based on graph theory for selecting critical features (electrodes) of the brain network. [9]. The brain connectivity can be rigorously modeled as a brain graph as follows: considering a brain network of electrodes as a weighted graph, where each node represents an electrode and weights of edges between nodes represent T-statistical distances of chaos measures between electrodes. Three possible weighted graphs are proposed: GRAPH-I is denoted as a complete graph (the graph with all possible edges); GRAPH-II is denoted as a graph induced from the complete one by deleting edges whose T-index before a seizure is greater than the T-test confident level; GRAPH-III is denoted as a graph induced from the complete one by deleting edges whose T-index before a seizure is than the T-test confident level or T-index after a seizure point is less than the T-test confidence level. Maximum cliques of these graphs will be investigated as the hypothesis is a group of physiologically connected electrodes is considered to be a critical largest connected network of seizure evolution and pathway. The Maximum Clique Problem (MCP) is NP-hard [26]; therefore, solving MCPs is not an easy task. Nevertheless, the RLT in [7] to provide a very compact formulation of the maximum clique problem (MCP). This compact formulation has theoretical and computational advantages over traditional formulations as well as provides tighter relaxation bounds. Consider a maximum clique problem defined as follows. Let G = G(V, E) be an undirected graph where V = {1, . . . , n} is the set of vertices (nodes), and E denotes the set of edges. Assume that there is no parallel edges (and no self-loops joining the same vertex) in G. Denote an edge joining vertex i and j by (i, j). Definition 3: A clique of G is a subset C of vertices with the property that every pair of vertices in C is connected by an edge; that is, C is a clique if the subgraph G(C) induced by C is complete. Definition 4: The maximum clique problem is the problem of finding a clique set C of maximal cardinality (size) |C|. (i,j)∈E P̄3 : AḠ − I and AG is an adjacency matrix of the graph G. n n min si , s.t. aij xj − si − yi = 0, yi − i=1 j=1 xi ∈ {0, 1}, si , yi ≥ 0, M (1 − xi ) ≤ 0, where n and M = max j=1 |aij | = A∞ . i Proposition 2: P3 is equivalent to P̄3 . If x∗ solves the problems P3 and P̄3 , then the set C defined by C = t(x∗ ) is a maximum clique of graph G with |C| = −fG (x). Proof: It has been shown in [9], [7] that P3 has an optimal solution x0 iff there exist y 0 , s0 , such that (x0 , y 0 , s0 ) is an optimal solution to P̄3 . IV. . Clustering Techniques The neurons in the cerebral cortex maintain thousands of input and output connections with other group of neurons, which form a dense network of connectivity spanning the entire thalamocortical system. Despite this massive connectivity, cortical networks are exceedingly sparse, with respect to the number of connections present out of all possible connections. This indicates that brain networks are not random, but form highly specific patterns. Networks in the brain can be analyzed at multiple levels of scale. Novel clustering techniques are herein proposed to construct the temporal and spatial mechanistic basis of the epileptogenic models based on the brain dynamics of EEGs and capture the patterns or hierarchical structure of the brain connectivity from statistical dependence among brain areas. The proposed hierarchical clustering techniques, which do not require a priori knowledge of the data (number of clusters), include Clustering via Concave Quadratic Programming and Clustering via MIP with Quadratic Constraint. ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV CCQP Input: Output: A. . Clustering via Concave Quadratic Programming (CCQP) WHILE S = ∅ DO - Construct an Euclidean matrix A from pair-wise distance of data points in S - Solve CCQP in problem P̄4 IF Optimal solution xi = 1 THEN - Remove point i from set S CCQP is a novel clustering mathematical model used to formulate a clustering problem as a QIP problem [9]. Given n points of data to be clustered, we can formulate a clustering problem as follows: min f (x) = xT Ax − λI, x s.t. x ∈ {0, 1}n, where A is an n × n Euclidean matrix of pairwise distance, I is an identity matrix, λ is a parameter adjusting the degree of similarity within a cluster, xi is a 0-1 decision variable indicating whether or not point i is selected to be in the cluster. Note that λI is an offset parameter added to the objective function to avoid the optimal solution of all xi are zero. This will happen when every entry aij of Euclidean matrix A is positive and the diagonal is zero. Although this clustering problem is formulated as a large QIP problem, in some instances when λ is large enough to make the quadratic function become concave function, this problem can be converted to a continuous problem (minimizing a concave quadratic function over a sphere) [9]. The reduction to a continuous problem is the main advantage of CCQP. This property holds because of the fact that a concave function f : S → over a compact convex set S ⊂ n attains its global minimum at one of the extreme points of S [15]. Two equivalent forms of CCQP problems are given by: P4 : P̄4 : All n unassigned data points in set S The number of clusters and cluster assignment for all n data points Fig. 1. Procedure of CCQP algorithm B. . Clustering via MIP with Quadratic Constraint (CMIPQC) CMIPQC is a novel clustering mathematical model in which a clustering problem can be formulated as a mixed-integer programming problem with quadratic constraint [9]. The goal of CMIPQC is to maximize number of data points to be in a cluster such that the similarity degrees among data points in a cluster are less than a pre-determined parameter, α. This technique can be incorporated with hierarchical clustering methods as follows: (a) Initialization: assign all data points into one cluster; (b) Partition: use CMIPQC to divide the big cluster into smaller clusters; (3) Repetition: repeat the partition process until the stopping criterion are reached or a cluster contains a single point. Novel mathematical n formulation for CMIPQC is given by: max xi , s.t. min f (x) = xT Ax, s.t. x ∈ {0, 1}n, where A is x an n × n Euclidean matrix min f¯(x) = xT Āx, s.t. 0 ≤ x ≤ e, where Ā = x A + λI, λ is any real number, I is a diagonal matrix. x i=1 xT Cx ≤ α, x ∈ {0, 1}, where n is the number of data points to be clustered, C is an n × n Euclidean matrix of pairwise distance, α is a predetermined parameter of the similarity degree within each cluster, xi is a 0-1 decision variable indicating whether or not point i is selected to be in the cluster. The objective of this model is to maximize number of data points to be in a cluster such that the average pairwise distances among those points are less than α. The difficulty of this problem comes from the quadratic constraint; however, this quadratic constraint can be efficiently linearized by the RLT described in [7]. The CMIPQC problem is much easier to solve as it can be reduced to an equivalent MIP problem. Similar to CCQP, the CMIPQC algorithm has the ability to systematically determine the optimal number of clusters and only needs to solve m MIP problems (see Figure 2 for CMIPQC algorithm). Two equivalent forms of CMIPQC are given by: n P5 : min f (x) = xi , s.t. xT Cx ≤ α, x ∈ {0, 1}n Proposition 3: P4 is equivalent to P̄4 . Proof: We will demonstrate that P2 has an optimal solution x0 iff x0 is an optimal solution to P̄2 as follows. If we choose λ such that Ā = A + λI becomes a negative semidefinite matrix (e.g., λ = −μ, where μ is the largest eigenvalue of A), then the objective function f¯(x) becomes concave and the constraints can be replaced by 0 ≤ x ≤ e. Thus, discrete problem P2 is equivalent to continuous problem P̄2 [9]. One of the advantages of CCQP is the ability to systematically determine the optimal number of clusters. Although CCQP has to solve m clustering problems iteratively (where m is the final number of clusters at the termination of CCQP algorithm), it is efficient enough to solve largescale clustering problems because only one continuous problem is solved in each iteration. After each iteration, the problem size will become significantly smaller [9]. Figure 1 presents the procedure of CCQP. x P̄5 : i=1 min f¯(x, y, z) = x n i=1 xi , s.t. Cx − z ≥ 0, eT z ≥ α, z ≤ M x, x ∈ {0, 1}n , zi ≥ 0, where M = ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV C∞ . Proposition 4: P3 is equivalent to P̄3 . Proof: The proof of P5 has an optimal solution x0 iff there exist z 0 such that (x0 , z 0 ) is an optimal solution to P̄5 as follows. P5 is a special case of P2 is very similar to the one in [9], [7]. CMIPQC Input: Output: 2) FSMQIP select the critical electrodes based upon the behavior of ST Lmax profiles before and after each preceding seizure. 3) Such a seizure pre-cursor will be detected when the brain dynamics from critical electrodes manifest a pattern of transitional convergence in the similarity degree of chaos. This pattern can be viewed as a synchronization of the brain dynamics from critical electrodes. All n unassigned data points in set S The number of clusters and cluster assignment for all n data points VI. . Results WHILE S = ∅ DO - Construct an Euclidean matrix A from pair-wise distance of data points in S - Solve CMIPQC in problem P̄5 IF Optimal solution xi = 1 THEN - Remove point i from set S The results show that the probability of detecting seizure pre-cursor patterns from the critical electrodes selected by FSMQIP is approximately 83%, which is significantly better than that from randomly selected electrodes with (p-value < 0.07). The Histogram of probability of detecting seizure pre-cursor patterns from randomly selected electrodes and that from from the critical electrodes is illustrated in Figure 3. The results of this study can be used as a criterion to pre-select the critical electrode sites that can be used to predict epileptic seizures. Fig. 2. Procedure of CMIPQC algorithm V. . Materials and Methods The data used in our studies consist of continuous intracranial EEGs from 3 patients with temporal lobe epilepsy. FSQIP was previously used to demonstrate the predictability of epileptic seizures [4]. In this research, we extend our previous findings of the seizure predictability by using FSMQIP to select the critical cortical sites. The FSMQIP problem is formulated as a MQIP problem with objective function to minimize the average T-index (a measure of statistical distance between the mean values of ST Lmax) among electrode sites, the knapsack constraint to identify the number of critical cortical sites [18], and an additional quadratic constraint to ensure that the optimal group of critical sites shows the divergence in ST Lmax profiles after a seizure. The experiment in this study is to test the hypothesis that FSMQIP can be used to select critical features (electrodes) that are mostly likely to manifest pre-cursor patterns prior to a seizure. The results of this study will demonstrate that if one can select critical electrodes that will manifest seizure pre-cursors, it may be possible to predict a seizure in time to warn of an impending seizure [6]. To test this hypothesis, we designed an experiment used to compare the probability of detecting seizure pre-cursor patterns from critical electrodes selected by FSMQIP with that from randomly selected electrodes. In this experiment, testing on 3 patients with 20 seizures, we randomly selected 5,000 groups of electrodes, and used FSMQIP to select the critical electrodes. The experiment in this study is conducted as the following steps: 1) The estimation of ST Lmax profiles [2], [19], [23], [30], [31] is used to measure the degree of order or disorder (chaos) of the EEG signals. Fig. 3. Histogram of Seizure Prediction Sensitivities based on Randomly Selected Electrodes versus Electrodes Selected by the Proposed Feature Selection Technique VII. . Conclusions This paper proposes a theoretical foundation of optimization techniques for feature selection and clustering with an application in epilepsy research. Empirical investigations of the proposed feature selection techniques demonstrate the effectiveness of the proposed techniques with a utility of selecting the critical brain areas associated with the epileptogenic process. Thus, advances in feature ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV selection and clustering techniques will result in the future development of a novel DM paradigm to predict impending seizures from multichannel EEG recordings. Prediction is possible because, for the vast majority of seizures, the spatio-temporal dynamical features of seizure pre-cursors are sufficiently similar to that of the preceding seizure. Mathematical formulations for novel clustering techniques are also proposed in this paper. These techniques are theoretically fast and scalable. The results from this preliminary research suggest that empirical studies of the proposed clustering techniques should be investigated in the future research. [15] R. Horst, P. Pardalos, and N. Thoai, Introduction to global optimization. Kluwer Academic Publishers, 1995. [16] L. Iasemidis, P. Pardalos, D.-S. Shiau, W. Chaovalitwongse, K. Narayanan, A. Prasad, K. Tsakalis, P. Carney, and J. Sackellares, “Long term prospective on-line real-time seizure prediction,” Journal of Clinical Neurophysiology, vol. 116(3), pp. 532–544, 2005. [17] L. Iasemidis, D.-S. Shiau, W. Chaovalitwongse, J. Sackellares, P. Pardalos, P. Carney, J. Principe, A. Prasad, B. Veeramani, and K. Tsakalis, “Adaptive epileptic seizure prediction system,” IEEE Transactions on Biomedical Engineering, vol. 5(5), pp. 616–627, 2003. [18] L. Iasemidis, D.-S. Shiau, J. Sackellares, and P. Pardalos, “Transition to epileptic seizures: Optimization,” in DIMACS series in Discrete Mathematics and Theoretical Computer Science, D. Du, P. Pardalos, and J. Wang, Eds. American Mathematical Society, 1999, pp. 55–74. [19] L. Iasemidis, H. Zaveri, J. Sackellares, and W. Williams, “Phase space analysis of EEG in temporal lobe epilepsy,” in IEEE Eng. in Medicine and Biology Society, 10th Ann. Int. Conf., 1988, pp. 1201–1203. [20] B. Litt, R. Esteller, J. Echauz, D. Maryann, R. Shor, T. Henry, P. Pennell, C. Epstein, R. Bakay, M. Dichter, and G. Vachtservanos, “Epileptic seizures may begin hours in advance of clinical onset: A report of five patients,” Neuron, vol. 30, pp. 51–64, 2001. [21] F. Mormann, T. Kreuz, C. Rieke, R. Andrzejak, A. Kraskov, P. David, C. Elger, and K. Lehnertz, “On the predictability of epileptic seizures,” Journal of Clinical Neurophysiology, vol. 116(3), pp. 569–587, 2005. [22] T. Motzkin and E. Strauss, “Maxima for graphs and a new proofs of a theorem turán,” Canadian Journal of Mathematics, vol. 17, pp. 533–540, 1965. [23] N. Packard, J. Crutchfield, and J. Farmer, “Geometry from time series,” Phys. Rev. Lett., vol. 45, pp. 712–716, 1980. [24] P. Pardalos, W. Chaovalitwongse, L. Iasemidis, J. Sackellares, D.-S. Shiau, P. Carney, O. Prokopyev, and V. Yatsenko, “Seizure warning algorithm based on spatiotemporal dynamics of intracranial EEG,” Mathematical Programming, vol. 101(2), pp. 365–385, 2004. [25] P. Pardalos and G. Rodgers, “Computational aspects of a branch and bound algorithm for quadratic zero-one programming,” Computing, vol. 45, pp. 131–144, 1990. [26] P. Pardalos and J. Xue, “The maximum clique problem,” Journal of Global Optimization, vol. 4, pp. 301–328, 1992. [27] P. Pardalos, V. Yatsenko, J. Sackellares, D.-S. Shiau, W. Chaovalitwongse, and L. Iasemidis, “Analysis of EEG data using optimization, statistics, and dynamical system techniques,” Computational Statistics & Data Analysis, vol. 44(1–2), pp. 391–408, 2003. [28] O. Prokopyev, V. Boginski, W. Chaovalitwongse, P. Pardalos, J. Sackellares, and P. Carney, “Network-based techniques in EEG data analysis and epileptic brain modeling,” in Data Mining in Biomedicine, P. Pardalos and A. Vazacopoulos, Eds. Springer, 2005, p. To appear. [29] M. L. V. Quyen, J. Martinerie, M. Baulac, and F. Varela, “Anticipating epileptic seizures in real time by non-linear analysis of similarity between EEG recordings,” NeuroReport, vol. 10, pp. 2149–2155, 1999. [30] P. Rapp, I. Zimmerman, and A. M. Albano, “Experimental studies of chaotic neural behavior: cellular activity and electroencephalographic signals,” in Nonlinear oscillations in biology and chemistry, H. Othmer, Ed. Springer-Verlag, 1986, pp. 175–205. [31] F. Takens, “Detecting strange attractors in turbulence,” in Dynamical systems and turbulence, Lecture notes in mathematics, D. Rand and L. Young, Eds. Springer-Verlag, 1981. References [1] W. Adams and H. Sherali, “Linearization strategies for a class of zero-one mixed integer programming problems,” Operations Research, vol. 38, pp. 217–226, 1990. [2] A. Babloyantz and A. Destexhe, “Low dimensional chaos in an instance of epilepsy,” Proc. Natl. Acad. Sci. USA, vol. 83, pp. 3513– 3517, 1986. [3] P. Bradley, O. Mangasarian, and W. Street, “Clustering via concave minimization,” in Advances in Neural Information Processing Systems, M. Mozer, M. Jordan, and T. Petsche, Eds. MIT Press, 1997. [4] W. Chaovalitwongse, “Optimization and dynamical approaches in nonlinear time series analysis with applications in bioengineering,” Ph.D. dissertation, University of Florida, 2003. [5] W. Chaovalitwongse, L. Iasemidis, P. Pardalos, P. Carney, D.S. Shiau, and J. Sackellares, “Performance of a seizure warning algorithm based on the dynamics of intracranial EEG,” Epilepsy Research, vol. 64, pp. 93–133, 2005. [6] W. Chaovalitwongse, P. Pardalos, L. Iasemidis, D.-S. Shiau, and J. Sackellares, “Applications of global optimization and dynamical systems to prediction of epileptic seizures,” in Quantitative Neuroscience, P. Pardalos, J. Sackellares, L. Iasemidis, and P. Carney, Eds. Kluwer, 2003, pp. 1–36. [7] W. Chaovalitwongse, P. Pardalos, and O. Prokoyev, “Reduction of multi-quadratic 0–1 programming problems to linear mixed 0–1 programming problems,” Operations Research Letters, vol. 32(6), pp. 517–522, 2004. [8] W. Chaovalitwongse, O. Prokoyev, and P. Pardalos, “Electroencephalogram (EEG) time series classification: Applications in epilepsy,” Annals of Operations Research, vol. To appear, 2005. [9] W. A. Chaovalitwongse, “A robust clustering technique via quadratic programming,” Department of Industrial and Systems Engineering, Rutgers University, Tech. Rep., 2005. [10] W. A. Chaovalitwongse, P. Pardalos, L. Iasemidis, D.-S. Shiau, and J. Sackellares, “Dynamical approaches and multi-quadratic integer programming for seizure prediction,” Optimization Methods and Software, vol. 20(2–3), pp. 383–394, 2005. [11] W. Chaovalitwongse, P. Pardalos, L. Iasemidis, J. Sackellares, and D.-S. Shiau, “Optimization of spatio-temporal pattern processing for seizure warning and prediction,” U.S. Patent application filed August 2004, Attorney Docket No. 028724–150, 2004. [12] C. Cherniak, Z. Mokhtarzada, and U. Nodelman, “Optimal-wiring models of neuroanatomy,” in Computational Neuroanatomy, G. A. Ascoli, Ed. Humana Press, 2002. [13] C. Elger and K. Lehnertz, “Seizure prediction by non-linear time series analysis of brain electrical activity,” European Journal of Neuroscience, vol. 10, pp. 786–789, 1998. [14] F. Glover, “Improved linear integer programming formulations of nonlinear integer programs,” Management Science, vol. 22, pp. 455– 460, 1975. ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV Fuzzy Support Vector Classification Based on Possibility Theory* Zhimin Yang1 Yingjie Tian2 Naiyang Deng3** College of Economics & Management, China Agriculture University, 100083, Beijing, China 2 Chinese Academy of Sciences Research Center on Data Technology & Knowledge Economy, 100080, Beijing, China 3 College of Science, China Agriculture University, 100083, Beijing, China 1 programming. Then we transform this programming into its equivalence quadratic programming. Assume that the training points contain complete fuzzy information, i.e. the sum of the positive membership degree and negative membership degree of its output is 1. We propose a fuzzy support vector classification algorithm. Given an arbitrary test, its corresponding output obtained by the algorithm is a triangle fuzzy number. Abstract This paper is concerned with the fuzzy support vector classification in which the type of both the output of the training point and the value of the final fuzzy classification function is triangle fuzzy number. First, the fuzzy classification problem is formulated as a fuzzy chance constrained programming. Then we transform this programming into its equivalence quadratic programming. As a result, we propose fuzzy support vector classification algorithm. In order to show its rationality of the algorithm, a example is presented. Keywords˖machine learningˈfuzzy support vector classification ˈ possibility measure ˈ triangle fuzzy number 2. FUZZY SUPPORT VECTOR CLASSIFICATION MACHINE As an extension of positive symbol 1 and negative symbol -1, we introduce triangle fuzzy number. Define the corresponding output by the triangle fuzzy number. For an input of a training point which belongs to the positive class with the membership degree G (0.5 d G d 1) , the triangle fuzzy number is y ( r1 , r2 , r3 ) 1. INTRODUCTION Support vector machines (SVMs) proposed by Vapnik, is a powerful tool for machine learning (Vapnik 1995 ˈ Vapnik 1998ˈCristianini 2000ˈMangasarian 1999 ˈ Deng 2004). It is also one of the most interesting topics in this field. Lin and Wang in (Lin, 2002) investigated a classification problem with fuzzy information, where the training set is S ^( x1 , ~ y1 ),", ( xl , ~ y l )` with ~ output y j ( j 1,", l ) is fuzzy number. This ( 2(G ) 2 G 2 G , 2G 1, 2(G ) 2 3G 2 G ), 0.5 d G d 1 ˄1˅ Similarly, for an input of a training point which belongs to the negative class with the membership degree G (0.5 d G d 1) , the triangle fuzzy number is paper studies this problem in a different way. We formulate it as a fuzzy chance constrained ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV y ( r1 , r2 , r3 ) S 2(G ) 2 3G 2 2(G ) 2 G 2 ( , 2 1, ), G G G 0.5 d G d 1 . ˄2˅ ~ Thus we use ( x, y ) to express a training point, where ~ y is a triangle fuzzy number˄1 ˅or˄2˅.We could use ( x, G ) to express a training point too, where G is ^( x , ~y ),", ( x 1 1 p ,~ y p ), ( x p 1 , ~ y p 1 ),", ( xl , ~ y l )` ˄6˅ or SG ^( x , G ),", ( x 1 1 p , G p ), ( x p 1 , G p 1 ),", ( xl , G l )` ˄7˅ has the following property: ( xt , ~ y t ) and ( xt , G t ) are fuzzy positive points y i ) and ( xi , G i ) are fuzzy ( t 1,", p ), ( xi , ~ negative points ( i p 1, ", l ). Definition 3 Suppose a fuzzy training set (6) or equivalently (7) and a confidence level O ˄ 0 O d 1 ˅.If there exist w R n and b R so that Pos{~ y j (( w x j ) b) t 1} t O䯸 j 1,", l ˈ ˄3˅ Given training set of classification is S ^( x1 , ~ y1 ),", ( xl , ~ y l )` , ˄4˅ and x j R n is usual input, ~ y j ( j 1,", l ) is a (8˅ then fuzzy training set (6) or (7)is fuzzy linearly separable, moreover the corresponding fuzzy classification problem is fuzzy linearly separable. Note: 1 D Fuzzy linearly separable can be considered, roughly speaking, that inputs of fuzzy positive points and fuzzy negative points can be separated at least with the possibility degree O (0 O d 1) . triangle fuzzy number (1) or (2). According to (1) ,(2) and (3), the training set (4) can have another form S G ^( x1 , G 1 ),", ( xl , G l )`ˈ ˄5˅ where x j is same to those in (4), while G j is those in (3)ˈ j 1,", l . Definition 1 ( x j , ~ y j ) in (4) and ( x j , G j ) in (5) are called as fuzzy training points, j 1,", l , and S and S G are called as fuzzy training sets. Definition 2 Fuzzy training point ( x j , ~ y j ) or D 2 Fuzzy linearly separability is generalization of linearly separability of usual training set. In fact, if G t 䯴1 t 1,", p ) and G i 1(i p 1, ", l ) in training set (7), fuzzy training set degenerates to usual training set. So fuzzy linearly separability of fuzzy training set degenerates to linearly separability of usual training set. Supposed Gt 1 ( t 1,", p ) or G i ! 1 ( i p 1, ", l ), it is possible that, on one hand, x1 ,!, x p and x p 1 , !, xl are not linearly separable in usual meaning; on the other hand, they are fuzzy linearly separable. For example, consider the case show in the follow figure: ( x j , G j ) is called as fuzzy positive point if it corresponds to (1); similarly, fuzzy training y j ) or ( x j , G j ) is called as fuzzy point ( x j , ~ negative point if it corresponds to (2). Note: In this paper, the case either G j 0.5 G j 0.5 is omitted, because the corresponding triangle fuzzy ~ number y j (2,0,2) can not provide any information. We rearrange the fuzzy training points in fuzzy training set ˄4˅or˄5˅, such that the new fuzzy training set or ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV x3 1 0 x2 1 x1 classification hyperplane such that ~ Pos{ y1 (( w x1 ) b) t 1} t 0.72 . ˄9˅ So fuzzy training set S is not fuzzy linearly separable in the confidence level O 0.72 . Generally speaking, possibility measure inequality of fuzzy event can be equivalently transformed to real inequalities, as shown in (10). Theorem 1. (8) in Definition 3 is equivalent with the real inequalities shown in : ((1 O )rt 3 Ort 2 )(( w xt ) b) t 1, t 1, " , p ® ¯((1 O )ri1 Ori 2 )(( w xi ) b) t 1, i p 1, " , l. 2 G 3 1( y 3 1) G 2 1( y 2 1) G 1 Supposed there are three fuzzy training points ( x1 , y1 ) , ( x 2 , y 2 ) and ( x3 , y 3 ) . The fuzzy training points ( x 2 , y 2 ) and ( x3 , y 3 ) are certain with G 2 1( y 2 1) and G 3 1( y 3 1) . The first fuzzy training point ( x1 , y1 ) is fuzzy with two possible negative membership degrees G 1 0.51 and G 1 0.6 . ˄ν˅ G 1 0.51 .According to (2), triangle is fuzzy number of ( x1 , G 1 ) ~ y1 (1.94,0.02,1.9) .So the fuzzy training set is S {( x1 , ~ y1 ), ( x 2 , y 2 ), ( x3 , y 3 )} . Suppose O 0.72 , and classification hyperplane x 0 , then ( w x1 ) b 2 , so Pos{ ~ y1 (( w x1 ) b) t 1} 0.722 ! 0.7 , moreover Pos{ y 2 (( w x 2 ) b) t 1} 1 ! 0.7 ˈ Pos{~ y3 (( w x3 ) b) t 1} 1 ! 0.7 . Therefore fuzzy training set S is fuzzy linearly separable in the confidence level O 0.72 . ˄ ˅ G 1 0.6 .According to (2), triangle is fuzzy number of ( x1 , G 1 ) ~ y1 (1.53,0.2,1.13) .So the fuzzy training set is S {( x1 , ~ y1 ), ( x 2 , y 2 ), ( x3 , y 3 )} . Suppose O 0.47 , and classification hyperplane x 0 , then ( w x1 ) b 2 , so Pos{ ~ y1 (( w x1 ) b) t 1} 0.47 t 0.47 , moreover Pos{ y 2 (( w x 2 ) b) t 1} 1 ! 0.47 ˈ Pos{~ y 3 (( w x3 ) b) t 1} 1 ! 0.47 . Therefore fuzzy training set S is fuzzy linearly separable in the confidence level O 0.47 .Supposed O 0.72 , we will find no Proof: ~ yj ˄10˅ (r j1 , r j 2 , r j 3 ) is a triangle fuzzy number, so 1 ~y j (( w x j ) b) is also a triangle fuzzy number due to triangle fuzzy number operation rule. More concretely, If ( w xt ) b ! 0 , then 1 y t (( w xt ) b) (1 rt 3 (( w xt ) b), 1 rt 2 (( w xt ) b), ˈ 1 rt1 (( w xt ) b)) t 1, ", p . According to a triangle fuzzy number a~ (r1 , r2 , r3 ) and arbitrary given confidence level O ˄ 0 O d 1 ˅ , there exists Pos{a~ d 0} t O (1 O )r1 Or2 . Therefore, if ( w xt ) b ! 0 , then Pos{1 y t (( w xt ) b) d 0} t O (1 O )(1 rt 3 (( w xt ) b )) O (1 rt 2 (( w xt ) b )) d 0 t or 1," , p Pos{ y t (( w xt ) b) t 1} t O ((1 O ) rt 3 O rt 2 )(( w xt ) b) t 1 t 1, ", p . Similarly, if ( w xi ) b 0 , then ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV ˈ Pos{ y i (( w xi ) b) t 1} t O be transformed to fuzzy chance constrained programming with decision variable ( w, b, ) T : 1 min || w || 2 ° w ,b 2 ® y j (( w x j ) b) t 1} t O , j 1, ", l䯸 °̄s.t.Pos{~ ˈ ((1 O ) ri1 O ri 2 )(( w xi ) b) t 1 i p 1, ", l . Therefore (8) in Definition 3 is equivalent with (10). In (10), suppose 1 kt 䯸 t 1, ", p ˗ (1 O )rt 3 Ort 2 1 li , i p 1,", l , ˄11˅ (1 O )ri1 Ori 2 then˄10˅can be rewritten: ( w xt ) b t k t , t 1," , p ® ¯( w xi ) b d l i , i p 1, " , l. Definition 4 Suppose fuzzy linearly separable problem of fuzzy training set (6) or (7), the two parallel hyperplanes ( w x) b k and ( w x) b l are support hyperplanes about fuzzy training set (6) or (7), so that: ( w xt ) b t k t , t 1, " , p ° {( w xt ) b} k °°t min 1,", p ® °( w xi ) b d l i , i p 1, ", l ° max {( w x ) b} l . i °¯i p 1,",l k t (t 1, ", p ) ˈ li (i p 1, ", l ) is the same to those in (10 ˅ ˈ k l ˄12˅ is possibility measure of fuzzy where Pos ^` . event ^` Theorem 2 In the confidence level O (0 O d 1) , the certain equivalence programming (usual programming equivalent with (12) )of fuzzy chance constrained programming ˄ 13 ˅ is the quadratic programming below: 1 min || w || 2 °° w ,b 2 ®s.t.((1 O )rt 3 Ort 2 )((w xt ) b) t 1, t 1, ", p ° °¯((1 O )ri1 Ori 2 )((w xi ) b) t 1, i p 1, " , l. ˄13˅ Proof: The result can be got directly with Theorem 1. Theorem 3. There exists an optimal solution of quadratic programming (13). Proof: omitted. (see Deng 2004) We will solve the dual programming of quadratic programming (13). Theorem 4. The dual programming of quadratic programming (13) is quadratic programming with decision variable is ( E ,D )T : min {k t } ˈ t 1,", p max {l i } . i p 1,",l Distance of two support hyperplanes ( w x) b k and ( w x) b l is ˖ p l 1 min ( 2 ) ( A B C E ¦ ¦ Di ) t ° E ,D 2 t 1 i p 1 ° p ° s t E t ((1 O ) rt 3 O rt 2 ) . . ° ¦ ° t 1 ® l ° ¦ D i ((1 O ) ri1 O ri 2 ) 0 ° i p 1 ° E t t 0, t 1," , p ° ° Di t 0, i p 1," , l䯸 ¯ | k l | ˈ || w || and we call the distance with margin ˄ k ! 0 and l 0 are constant ˅ . Due to essence idea of Support Vector Machine, our goal is to maximize margin. In the confidence level O (0 O d 1) , fuzzy linearly separable problem with fuzzy training set (6)or (7) can ˄14˅ ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV So we can get certain optimal classification hyperplane(see Deng 2004) : ( w* x) b * 0, x R n . (15) Defining the function: g ( x) ( w* x) b * ˈ where p A p ¦¦ E E ((1 O )r t s t3 O rt 2 ) ˈ t 1 s 1 *((1 O ) rs 3 O rs 2 )( xt xs ) p B l ¦ ¦ E D ((1 O )r t i t3 O rt 2 ) ˈ t 1 i p 1 M (u ),0 u d M 1 (1) ° 1 °1, u ! M (1) ,˄16˅ G G (u ) ® 1 ° M (u ), M (1) d u 0 ° 1, u M 1 (1) ¯ 1 Where M (u ) and M 1 (u ) are respectively the inverse function of M (u ) and M (u ) . Both M (u ) and M (u ) are regression function (monotonously on u ) obtained by the following way: Computation of M (u ) : ˄ ˅ Construct training set of regression problem ˄17˅ {( g ( x1 ), G 1 ),", ( g ( x p ), G p )} *((1 O ) ri1 O ri 2 )( xt xi ) l C l ¦ ¦ D D ((1 O )r i q i1 O ri 2 ) i p 1 q p 1 ˈ *((1 O ) rq1 O ri 2 )( xi xq ) E ( E 1 ," , E p ) T Rp D (D p 1 , " ,D l ) T Rl p ˈ ˈ ( E ,D )T is decision variable. Proof: omitted. (see Deng 2004) Programming ˄ 14 ˅ is convex. After getting its optimal solution * * T * * T * * ( E ,D ) ( E 1 ," , E p , D p 1 ,"D l ) , we find a J optimal solution ( w* , b * ) T of fuzzy coefficient programming (12) is: p w* ¦E * t ((1 O ) rt 3 O rt 2 )xt t 1 ˄ ˅ Using (17) as training set , and selecting appropriate H ! 0 , C ! 0 , H support vector regression machine with linear kernel are executed. Computation of M (u ) : ˄ ˅ Construct training set of regression problem ˄18˅ {( g ( x p 1 ),G p 1 )," , ( g ( xl ),G l )} . ˈ l ¦ D ((1 O ) ri1 O ri 2 )xi * i i p 1 * b ((1 O ) rs 3 O rs 2 ) P ¦ E t* ((1 O ) rt 3 O rt 2 )( xt xs ) ˈ t 1 l ¦ Di* ((1 O ) ri1 O ri 2 )( xi xs ) ˄ ˅ Using (13) as training set, and selecting the same H ! 0 , C ! 0 , H support vector regression machine with linear kernel are executed. i p 1 s {s | E s* ! 0} ˈ or b* ((1 O ) rqi O rq 2 ) Note: The equation (11) has the following explanation: Consider an input x c .It seems natural that the larger g (x c) is, the larger the corresponding membership degree to be a fuzzy positive point is; the smaller g (x c) is, the larger the corresponding membership degree to be a fuzzy negative point is. The p ¦ E t* ((1 O )rt 3 O rt 2 )( xt xq ) ˈ t 1 l ¦ Di* ((1 O ) ri1 O ri 2 )( xi xq ) i p 1 q {q | D q* ! 0} . ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV regression function M () and M () just reflect this idea. The above discussion leads to Algorithm (fuzzy support vector classification) ˄ ˅Given a fuzzy training set (6) or (7) ,and select a appropriate confidence level O (V d O d 1) , C ! 0 and a kernel function K ( x, x c) ,then construct quadratic programming: ˄ ˅Select E s* (0, C ) in E * or D q* (0, C ) in D * ˈthen compute b* p ¦ E t* ((1 O ) rt 3 O rt 2 ) K ( xt , xs ) t 1 l ¦ Di* ((1 O ) ri1 O ri 2 ) K ( xi , xs ) i p 1 Or b* ((1 O ) rqi O rq 2 ) p l 1 min ( 2 ) ( A B C E ¦ ¦ Di ) K K t ° E ,D 2 K t 1 i p 1 ° p ° s t E t ((1 O ) rt 3 O rt 2 ) . . ° ¦ ° t 1 ® l ° ¦ D i ((1 O ) ri1 O ri 2 ) 0 ° i p 1 ° 0 d E t d C , t 1," , p ° ° 0 d D i d C , i p 1," , l䯸 ¯ ˄18˅ where p AK p ¦ E t* ((1 O ) rt 3 O rt 2 ) K ( xt , xq ) . t 1 l ¦ D i* ((1 O ) ri1 O ri 2 ) K ( xi , xq )) i p 1 ˄ ˅Construct function p g ( x) i t3 ¦ D ((1 O ) ri1 O ri 2 )K ( x, xi ) i p 1 ( )Consider {( g ( x1 ), G 1 )," , ( g ( x p ), G p )} {( g ( x p 1 ),G p 1 ),", ( g ( xl ),G l )} as training set respectively and construct regression functions M (u ) and M (u ) by H support vector regression with linear kernel. ˄ ˅ According to (1),(2) and (3), we transform the function G G ( g ( x)) in˄16˅ to triangle fuzzy number y y ( x) , then we get fuzzy optimal classification function. Note: 1 D If the outputs of all fuzzy training points in fuzzy training set(6) or (7) are real number 1 or -1, then fuzzy training set degenerate to normal training set, so fuzzy support vector classification machine degenerate to support vector classification machine. 2 D The selection of confidence level O (0 O d 1) in fuzzy support vector classification machine would be seen as ˈ O rt 2 ) ˈ t 1 i p 1 *((1 O ) ri1 O ri 2 ) K ( xt , xi ) l CK l ¦ ¦ D D ((1 O )r i q i1 O ri 2 ) i p 1 q p 1 ˈ *((1 O ) rq1 O ri 2 ) K ( xi , xq ) E D . * i l t ((1 O ) rt 3 O rt 2 ) K ( x,xt ) l *((1 O ) rs 3 O rs 2 ) K ( xt , xs ) ¦ ¦ E D ((1 O )r * t and t 1 s 1 BK ¦E t 1 p ¦¦ Et E s ((1 O )rt 3 O rt 2 ) p ((1 O ) rs 3 O rs 2 ) ( E 1 ," , E p ) T Rp ˈ (D p 1 ," ,D l ) T Rl p ˈ ( E , D ) T is decision variable. ˄ ˅ Solve quadratic programming (18), and get optimal solution ( E * , D * ) T ( E 1* ," , E p* ,D *p 1 , " ,D l* ) T . ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV parameter selection problem, so we can use methods in parameter selection such as LOO error and LOO error bound (Deng 2004). G ( g ( x )) 3. Numerical Experiments In order to show the rationality of our algorithm, we give a simple example. Suppose fuzzy training set contains three fuzzy positive points and three fuzzy negative points. According to (6) and (7), this fuzzy training set can be expressed: S {( x1 , ~ y1 ),", ( x3 , ~ y 3 ), ( x 4 , ~ y 4 ),", ( x 6 , ~ y 6 )} , SG {( x1 , G 1 ), " , ( x3 , G 3 ), ( x 4 , G 4 )," , ( x6 , G 6 )} x1 (2, ,2) T ˈ x 2 (1.7,2) T ˈ x3 x4 (0,0) T ˈ x5 (0.8,0.5) T ˈ x6 ˗ ~ y3 y1 1 (1,1,1) ˈ (0.1,0.6,1.1) ˈ y 4 y5 ~ y6 1 (1,1,1) ˈ G3 0.8 ˈ G 4 Suppose input x7 (1,2) T , x8 g ( x7 ) 2 ! 0 (1,0.5) T ˈ 1 ˈ G 5 G ( g ( x7 )) 0.88 ˗ 0.87 through g (x) and G ( g ( x)) . According to (1),(2)and(3), we can get ~ and y 7 (0.49,0.76,1.03) ~ (triangle fuzzy y 8 (0.9,0.74,0.44) number). In order to find relationship and difference between fuzzy support vector classification and support vector classification, we will have three respective outputs of the third fuzzy training point in fuzzy training set S G ,more (1.5,1) T ˈ 1 ˈ G2 ˈ test points T (1,0) , and we get g ( x8 ) 2 0 ˈ G ( g ( x8 )) y 2 1 (1,1,1) ˈ 1 (1,1,1) ˈ 1 (1.1,0.6,0.1) ˗ G 1 0.08 g ( x ) 0.72,0 g ( x ) d 3.50 °1, g ( x ) ! 3.50 ° . ® °0.07 g ( x ) 0.73, 3.86 d g ( x ) 0 °¯ 1, g ( x ) 3.86 1 concretely, G 3 1 , G 3 0.8 ˈ G 3 1 . While output of the sixth fuzzy training point is G 6 1 , therefore fuzzy training set S G will become three sets respectively: S G1 {( x1 , G 1 )," , ( x3 , G 3 ), ( x 4 , G 4 )," , ( x6 , G 6 )} 1 ˈ G6 0.8 . Suppose a confidence kernel level O 0.8 , C 10 and c c function K ( x, x ) x x . We use the Algorithm (fuzzy support vector classification), so that we get function g ( x) 2[ x]1 2[ x]2 4 . We will establish function G G ( g ( x)) : Look S1 {( 4,1), (3.4,1), (1,0.8)} as fuzzy training set, and select H 0.1, C 10 and linear kernel. Construct support vector regression, and we get regression function M (u ) 0.08u 0.72 ; Look S 2 {(4,1), (1.4,1), (1,0.8)} as fuzzy training set, and select H 0.1, C 10 and linear kernel. Construct support vector regression, and we get regression function M (u ) 0.07u 0.73 ; So we get membership function is: (1.5,1) T ˈ G 3 1 ˈ x6 (1,0.5) T ˈ G 6 1 .The inputs and outputs of other fuzzy training points is the same to those in S G . x3 S G2 {( x1 , G 1 )," , ( x3 , G 3 ), ( x 4 , G 4 )," , ( x6 , G 6 )} (1.5,1) T ˈ G 3 0.8 ˈ x6 (1,0.5) T ˈ G 6 1 . The inputs and outputs of other fuzzy training points is the same to those in S G . x3 S G3 {( x1 , G 1 )," , ( x3 , G 3 ), ( x 4 , G 4 )," , ( x6 , G 6 )} (1.5,1) T ˈ G 3 1 ˈ x6 (1,0.5) T ˈ G 6 1 . The inputs and outputs of other fuzzy training points is the same to those in S G . x3 ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV So we observe the change of optimal classification hyperplanes with the variety of output of the third fuzzy training point: G 3 1 o G 3 0.8 o G 3 0.8 o G 3 1 (19) When all the outputs of training points in training set are 1 or -1,fuzzy training set degenerate to usual training set such as S G1 , S G3 . At the same time, fuzzy support vector classification degenerates to support vector classification. Suppose O 0.8 , C 10 , and kernel function K ( x, x c) x x c .We use the algorithm(fuzzy support vector classification)and get certain optimal classification hyperplanes respectively: L1 : [ x1 ] [ x 2 ] 2 ˈ L2 : [ x1 ] [ x2 ] 2.4 ˈ ˈ L3 : 0.385[ x1 ] 1.923[ x 2 ] 1.4 L4 : 0.385[ x1 ] 1.923[ x 2 ] 1.76 . show in the follow figure: Other Kernelbased Learning Methods. Cambridge University Press. Deng, N. Y. and Zhu, M. F.(1987), Optimal Methods, Education Press, Shenyang. Deng, N. Y. and Tian, Y. J.(2004), The New Method in Data Mining, Science Press, Beijing. Lin, C. F., Wang, S. D.(2002), Fuzzy Support Vector Machines, IEEE Transactions on Neural Networks,᧤2᧥. Liu, B. D.(1998), Random Programming and Fuzzy Programming, Tsinghua University Press, Beijing. Liu, B. et al.(1998) Chance Constrained Programming with Fuzzy Parameters, Fuzzy Sets and Systems, (2). Mangasarian, O. L.(1999), Generalized Support Vector Machines. Advances in Large Margin Classifiers, MIT Press, Boston. Vapnik, V. N.(1995), The Nature of Statistical 2.5 Learning 2 1.5 Theory, Springer-Verlag, New York. L4 Vapnik, V. N. (1998), Statistical Learning L3 1 L2 Theory. Wiley, New York. 0.5 L1 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Yuan, Y. X. and Sun, W. Y.(1997), Optimal Theories and Methods, Science Press, Beijing. Zadeh, L. A.(1965), Fuzzy Sets. Information 2 ˄ 19 ˅ illuminates membership degree of fuzzy training point x3 is changed. when its negative membership degree gets bigger, and its positive membership degree gets smaller. The movement of corresponding certain optimal classification hyperplane is L1 o L2 o L3 o L4 , thus it can be seen the result is same to intuitionistic judgment. and Control. Zadeh, L. A.( 1978), Fuzzy Sets as a Basis for a Theory of Possibility, Fuzzy Sets and Systems. Zhang, W. X.(1995), Foundation of Fuzzy Mathematics, Xi’an Jiaotong University Press, Xi’an. S Reference Cristianini, N. and Shawe-Taylor J.(2000), An introduction to Support Vector Machines and ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV DEA-based Classification for Finding Performance Improvement Direction Shingo Aoki, Member, IEEE, Yusuke Nishiuchi, Non-Member, Hiroshi Tsuji, Member, IEEE improvement direction for each sample. As the other issue, DEA has been developed and applied a variety of the managerial and economic problem situations [8]. By comparison with Pareto optimal solution so called "efficiency frontier", performance of DMUs is measured relatively. However, in other words, DEA has considered only a part of DMUs which used to form the efficiency frontier. Therefore, little attention has been given to the cluster technique for classification all DMUs. In order to improve these problems, this paper proposes a new classification techniques. The proposed method consists of two stages: (1) DEA for evaluating DMUs by their inputs/outputs, (2) GT for finding clusters among DMUs. The remaining structure of this paper is organized as follows: section 2 describes about DEA as the basis of this research. Section 3 proposes the DEA based classification method. Section 4 illustrates a numerical simulation using the proposal method and the traditional method, and discusses the difference between their classification results. Section 5 obtains universal prospects of two methods. Finally, conclusion and future extensions are summarized in section 6. Abstract—In order to find the performance improvement direction for DMU (Decision Making Unit), this paper proposes a new classification techniques. The proposed method consists of two stages: (1) DEA (Data Envelopment Analysis) for evaluating DMUs by their inputs/outputs, (2) GT (Group Technology) for finding clusters among DMUs. A case study for twelve DMUs with two inputs and two outputs shows that the proposed technique works to obtain four clusters where each cluster has its own performance improvement direction. This paper also discusses the comparison on the traditional clustering and the proposed clustering. Index Terms—Data Envelopment Analysis, Clustering methods, Data mining, Decision-making, Linear programming. I. INTRODUCTION U nder the condition that there are a great number of competitors in a general marketplace, a company should find out its own advantages compared with others and extend it [2]. For the reason mentioned above, the concern with the mathematical approach has been growing [5] [11] [16]. Especially, this paper concentrates on the following issues: (1) characterize each company in the marketplace by its activity and define groups by similarity, and (2) compare a company to others and find the performance improvement direction [3] [4]. As the former issue, in these years, a lot of cluster analyses have been developed. Cluster analysis is the method for classification samples which are characterized by multi property values [5] [6]. It allows us to get common characteristics in a group, in other words, the reason why a sample belongs to a group. However, the traditional analysis calculation regards all property values as appositional. Therefore, it often gets rules which are based on absolute property values, and makes difficult to find the performance II. DATA ENVELOPMENT ANALYSIS (DEA) A. An overview on DEA Data Envelopment Analysis, initiated by Charnes et al. (1978) [7], has been widely applied to efficiency (productivity) analysis, and more than fifteen hundreds of researches have been performed in the past twenty years [8]. DEA assumes DMUs activity that uses multiple inputs to yield multiple outputs, and defines the process which changes multiple inputs into multiple outputs as “efficiency score”. By comparison with Pareto optimal solution so called "efficiency frontier", efficiency score of DMU is measured relatively. B. Efficiency frontier This section illustrates efficiency frontier visually using an exercise with a sample data set. In Figure 1, suppose that there are seven DMUs which have one input and two outputs where X-axis is an amount of sales (output 1) over a number of shops (input) and Y-axis is a number of visitors (output 2) over a number of shops (input). So, if a DMU is located in upper–right region, it shows that the DMU has high productivity. Line B-C-F-G is efficiency frontier in Figure 1. The DMUs on this frontier are considered that an “efficient” Manuscript received October 1, 2005, DEA-based Classification for Finding Performance Improvement Direction. S. Aoki is with the Graduate School of Engineering, Osaka Prefecture University, 1-1 Gakuencho, Sakai, Osaka 599-8531, Japan (corresponding author to provide phone: +81-72-254-9354; fax: +81-72-254-9915; e-mail: [email protected]) Y. Nishiuchi is with the Graduate School ofEngineering, Osaka Prefecture University, 1-1 Gakuencho, Sakai, Osaka 599-8531, Japan (e-mail: [email protected]) H. Tsuji is with the Graduate School of Engineering, Osaka Prefecture University, 1-1 Gakuencho, Sakai, Osaka 599-8531, Japan (e-mail: [email protected]) ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV activity is done. Other DMUs are considered that an “inefficient” activity is done and there are rooms to improve their activities. For instance, the DMUE’s efficiency score equals to OE/OE1. Thus the range of efficiency score is [0, 1]. The efficiency score for DMUB, DMUC, DMUF, and DMUE are equal to 1. 1,..., n} (2) Using “Reference Set”, this paper re-defines a set of Rk as a vector ak which is shown as following: * * * {O1 , O2 , ..., On } (3) ak In Formulation (1), for instance, when ak = { O1* ,…, * * * * * Ov 1* = 0, Ov = 0.7, Ov1 ,…, Ow1 = 0, Ow = 0.3, Ow 1 ,…, A1 B Amount of sales per number of Shops {j | Ȝ j ! 0, j Rk Efficiency Frontier On* = 0} and T k * = 0.85, a reference set of DMUk is { DMUV, C DMUW }. In Fig. 2, the point k’ is nearest point from DMUk on the efficiency frontier and efficiency value of DMUk shows a ratio of 1 to 0.85. A D F DMUv 0.3 E1 k’ G E Efficiency frontier for DMUk DMUk 0.7 1.0 0.85 Number of visitors per number of shops DMUw Fig. 1. Graphical description of efficiency measurement C. DEA model When there are n DMUs (DMU1, …, DMUk, …, DMUn), and each DMU is characterized by its own performance with m inputs (x1k, x2k,…, xmk) and s outputs (y1k, y2k,…, ysk), DEA model is mathematically expressed by the following formulation [11] [12]: Minimize Subject to Tk ¦ n t xij O j Txik 0 j 1 ¦ t n yrjO j j 1 L (i 1,2,...,m) d¦ n yrk (r 1,2,..., s) Fig.2. Reference set for DMUk What is important is that this research obtains the segment connecting the origin with k’ not by researchers’ subjectivities but by the intention which makes the efficiency of DMUk as well as possible. The efficiency score of DMUk+1 is obtained by replacing “k” with “k+1” at Formulation (1). III. DEA-BASED CLASSIFICATION METHOD Let us propose the method which consists of the following steps: A: Divine a data set into input items or output items. B: For each DMU, solve formula (1) for getting the efficiency score and the O j values. Then we will get a (1) d Oj U tj 1 O j 0 ( j 1,2,...,n), similarity coefficient matrix S. C: Apply rank order algorithm to the similarity coefficient matrix. Then we will get clusters. T : 䇭free In Formulation (1), L and U are the values of lower bound and upper bound of the ¦ n j 1 O j . If L = 0 and U = f , A. Select input and output itmes For the first step, there is a guideline to define a data set as follows [9]: 1. Each data is numeric, and its value is more than zero, 2. In order to show the feature of DMU’s activity, analyst should be divined a data set into input items or output items, 3. As for the input item, analyst should choose the data which is used for the investment such as amount of capital stock, number of employee, and amount of advertisement invest, 4. As for the output item, analysis should choose the data which is used for the return such as amount of sales, and number of visitors, Formulation (1) is called “the CCR model”, and if L = U = 1, Formulation (1) is called “the BCC model”[13] [14] [15]. This paper used the CCR model. T k is the efficiency score in the manner that T k = 1 (100%) means DMU “efficient”, while T k < 1 means “inefficient”. O j (j = 1, 2,̖, n) can be considered to form the efficiency frontier about DMUk. Especially, if O j > 0, then DMUj is on the efficiency frontier. A set of these DMU is so called a “Reference set (Rk)” for the DMUk and expressed as follows: ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV B. Create similarity coefficient matrix As the second step, the proposal method calculates an efficiency score ( T k ) for each DMUk in Formula (1), and a The degree of association is estimated by the distance which is calculated by Ward’s method [21]. Ward’s method is distinct from other methods because it uses an analysis of variance approach to evaluate the distances between clusters. When a new cluster c is created by combining cluster a and cluster b, for example, a distance between cluster x and cluster c is mathematically expressed by the following formulation: nx na 2 nx nb 2 nx 2 d xc d xa d xb d ab2 (5) nc between nx ncclusternm nc n. d : andistance x and x vector ak in formula (3). Then the proposal method creates similarity coefficient matrix S as follows: (4) S {a 1 , a 2 , , a n } C. Classify DMUs by rank order algorithm As the last step, DMUs are classified into some groups by Group Technology (GT) [18], handling the similarity coefficient matrix S. In this classification, rank order algorithm by King, J. R [19] is employed. The rank order algorithm consists of four steps as follows: ¦2 M mn n m : the number of individual s in cluster m. In general, this method is computationally simple, while it tends to create small size of clusters. B2. CLASSIFICATION RESULT. The result of classification for the data set with Ward’s clustering method obtains a dendrogram (See Fig. 3). Dendrogram is also called tree diagram. In Fig.3, when the two individuals have combined together on the left, it is concerned that the two individuals belong to the same group. The final number of clusters depends on the position where the dendrogram is cut off. To get four clusters, for example, (A, J, E), (B, F, I), (K, L) and (C, G, D, H) are obtained by cutting the dendrogram at (1) in Figure 3. i Step 1. Calculate total weight of each column, w j ij i Step 2. Arrange columns by ascending weight, Step 3. Calculate total weight of each row, wi ¦2 j M ij j Step 4. If rows are in ascending order by weight, STOP. else Arrange rows by ascending weight, GOTO Step 1. IV. A CASE STUDY In order to verify the availability of the proposal method, let us illustrate a numerical simulation. Cut off (1) A. A data set A sample data set is shown in Table 1. The data set concerns on regards the performance of 12 DMU (DMUA, …, DMUL), and each DMU has four data items: number of employee, number of shops, number of visitors and amount of sales. DMU Table Σ. Entity A B C D E F G H I J 䌋䌌 (distance among DMUs) # , ' $ ( + . % ) & * B. Traditional cluster analysis B.1. METHOD OF CLUSTERING ANALYSIS. Cluster analysis is an exploratory data analysis method which aims at sorting different objects into groups in a way that the degree of association between objects are maximal if they belong to the same group and minimal otherwise[5] [20]. DMU Cut off (2) Fig.3. Dendrogram by Ward-method DATA SET FOR NUMERICAL STUDIES Inputs Outputs Number of Number of Number of visitors Amount of sales employees shops (K person/month) (M㪳/month) 10 8 23 21 26 10 37 32 40 15 80 68 35 28 76 60 30 21 23 20 33 10 38 41 37 12 78 65 50 22 68 77 31 15 48 33 12 10 16 36 20 12 64 23 45 26 72 35 From this classification result and Table 1, the feature of each group is considered as follows: (i) Group (A, J, E) is considered as that consists of “small scale” DMUs, (ii) Group (B, F, I) is considered as that consists of “lower middle scale” DMUs, (iii) Group (K, L) is considered as that consists of “larger middle scale” DMUs and that a visitor unit price is very low, (iv) Group (C, G, D, H) is considered as that consists of “large scale” DMUs. ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV If Sij > 0, then it is considered that there is relevance in DMUI and DMUJ, the entry is 1. If Sij = 0, then it is considered that there is no relevance in DMUI and DMUJ, the entry is empty. Fig.4 is illustrated the classification analysis by the traditional method. Number of visitors and amount of sales Large Largescale scale Initial state 䊶Larger 䊶Largermiddle middlescale scale 䊶A 䊶Avisitor visitorunit unitprice priceisisvery verylow. low. handling Lower Lowermiddle middlescale scale Small Smallscale scale Numbers of employees and number of shops Fig.4. Traditional classification result Final state V. DEA-BASED CLASSIFICATION This section describes the process of the proposal method. Step1: Select inputs and outputs. According to Step 1 in Section 3, the number of employee and the number of shops are selected as input values, and the number of visitors and the amount of sales are selected as output values. Step2: Create a similarity coefficient matrix. By Formulation (1), (3) and (4), the similarity coefficient matrix S is obtained as shown in Table Τ. Fig. 5. Classification demonstration by rank order algorithm Then, four clusters: (A, D), (B, C, E, I, K, L), (F, G, H) and (J) are obtained as shown in Fig.5. The feature of each group is considered as follows: (i) group (A, D) is considered as that consists of DMUs which get many visitors and large amount of sales by a few employee and a few shops, (ii) group (B, C, E, I, K, L) is considered as that consists of DMUs whose employees are clever in marquee, (iii) group (F, G, H) is considered as that consists of DMUs which are managed with large-sized shops, (iv) group (J) is considered as that consists of DMU which has many visitors who purchase a lot. From the above analysis, Fig. 6 is illustrated as a conceptual diagram which shows the situation of the classification. TABLE II. SIMILARITY COEFFICIENT MATRIX S DMU 㪜㪽㪽㫀㪺㫀㪼㫅㪺㫐㫊㪺㫆㫉㪼 A 1 B 0.674 C 0.943 D 0.885 E 0.331 F 0.757 G 1 H 0.755 I 0.638 J 1 K 1 L 0.556 㵰㪜㪽㪽㫀㪺㫀㪼㫅㫋㵱㪛㪤㪬㫊 A 1 0 0 1 0 0 0 0 0 0 0 0 B 0 0 0 0 0 0 0 0 0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 D 0 0 0 0 0 0 0 0 0 0 0 0 E 0 0 0 0 0 0 0 0 0 0 0 0 ak F 0 0 0 0 0 0 0 0 0 0 0 0 G 0 0.404 0.889 0 0.007 0.631 1 0.789 0.276 0 0 0.103 H 0 0 0 0 0 0 0 0 0 0 0 0 I 0 0 0 0 0 0 0 0 0 0 0 0 J 0 0.124 0.21 0 0.38 0 0 0.715 0.184 1 0 0.176 K 0 0.054 0.113 0.265 0.256 0 0 0 0.368 0 1 0.956 L 0 0 0 0 0 0 0 0 0 0 0 0 Profit Side get many visitors and large amount of sales in a few employee and shops. 㪪㪑㪪㫀㫄㫀㫃㪸㫉㫃㫐㩷㪺㫆㪼㪽㪽㫀㪺㫀㪼㫅㫋㩷㫄㪸㫋㫉㫀㫏 Let us note S in Table Τ. The O j values of DMUA, DMUG, A DMUJ and DMUK on efficiency frontier are more than zero, and at least one of the other DMUs is equal to zero. This means that the each DMU is characterized by combination of “efficient” DMUs’ features. The proposal method is focused attention on such DEA contribution, and finds the performance improvement direction for each DMU. Step3: Classify DMUs by rank order algorithm. The rank order algorithm for the similarity coefficient matrix S generates classification as shown in Fig. 5. The matrix S in Fig. 5 is expressed as follows: Brand Side Get large sales with a few visitor. J D Marquee Side K Get many visitors with a few employee. I L B E C Shop Scale Side H F G Get many visitors and large sales with many employees Fig.6. Proposal classification result ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV [5] VI. DISCUSSION From the result of section 4.2, two characteristics by clustering analysis are considered as follows: (a) Classification result is based on the scale of management, (b) The number of clusters can be assigned according to the purpose. Therefore, the traditional method does not require preparation in advance. However, there are demerits that it is difficult to find the performance improvement direction for a DMU, since the classification result is only based on scale of management. On the other hand, DEA-based classification has three characteristics as follows: (a) Classification result is based on the direction of management, (b) The number of groups which classified is the same number of “efficient” DMUs, (c) Every group has at least one “efficient” DMU. Since the O j values in the similarity coefficient matrix S (TABLE Τ) are positive only if the DMU is “efficient”, (b) is true. As shown in Figure 5, since there is one “efficient” DMU in every classified group, (c) is also true. Then, the merits and the demerits of the proposal method are described. It is easy to find the performance improvement direction for a DMU. For example, even if a DMU is evaluated “inefficient”, it is possible to refer the feature of the “efficient” DMU which belongs to the same group. However, it is necessary to select right input and right output for preparation. [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] VII. CONCLUSIONS AND FUTURE EXTENSIONS [20] This paper has described issues of the traditional classification method and proposed a new classification method which finds performance improvement direction. Case study has shown that the classification by cluster analysis was based on the scale of management, and that, on the other hand, the classification by the proposal method was based on the direction of management. Future extensions of this research include as follows: (a) Application for a large scale practical problem, (b) Meaning assigning method for the derived groups, (c)Investigating reliability of the performance improvement direction, (d) Establishment of the one-step application for the proposed method. [21] S. Miyamoto, Fuzzy sets in information retrieval and cluster analysis, Kluwer Academic Publishers, Dordrecht: Boston, 1990. M.R. Anderberg, Cluster analysis for applications, Academic Press, New York, USA, 1973. A. Charnes, W.W. Cooper, and E. Rhodes, “Measuring the efficiency of decision-making units”, European journal of operational research, vol.2, 1978, pp.429-444. T. Sueyoshi, Management Efficiency Analysis (in Japanese), Asakura Shoten Co., Ltd, Tokyo, 2001. K. Tone, Measurement and Improvement of Management Efficiency (in Japanese), JUSE Press, Ltd, Tokyo, 1993. M.J. Farrell, “The Measurement of Productive Efficiency”, Journal of the Royal Statical Society, (Series A), vol.120, 1957, pp.253-281. D.L, Adolphson, G.C. Cornia, and L.C. Walters, “A Unified Framework for Classifying DEA Models”, Operational Research ’90, edited by E.E.Bradley, Pergamon Press, 1991, pp.647-657. A. Boussofiane, R.G. Dyson, and E. Thanassoulis, “Invited Review: Applied Data Envelopment Analysis”, European Journal of Operational Research, vol.52, 1991, pp-1-15. R. D. Banker, and R.C. Morey, “The use of categorical variables in Data Envelopment Analysis”, Management Science vol.32, 1984, pp.1613-1627 R.D. Banker, A. Charnes, and W.W. Cooper, “Some models for estimating technical and scale inefficiencies in data envelopment analysis”, Management Science, Vol.30, 1984, pp.1078-1092. R.D. Banker, “Estimating Most Productive Scale Size Using Data Envelopment Analysis”, European Journal of Operational Research, vol.17, 1984, pp.35-44. W.A. Kamakura, “A note on the use of categorical variables in Data Envelopment Analysis”, Management Science, vol.34, 1988, pp.1273-1276. [J.J. Rousseau, and J. Semple, “Categorical outputs in Data Envelopment Analysis”, Management Science, vol.39, 1993, pp.384-386. J.R. King, V. Nakornchai, “Machine-component group formation in group technology: review and extension”, Internat. J. Prod, vol.20, 1982, pp.117-133. J.R. King, “Machine-Component Grouping in Production Flow Analysis: An Approach Using a Rank Order Clustering Algorithm”, International Journal of Production Research, vol. 18, 1980, pp.213-232. J.G. Hirschberg, and D.J. Aigner, “A Classification for Medium and Small Firms by Time-of-Day Electricity Usage”, Papers and Proceedings of the Eight Annual North American Conference of the International Association of Energy Economists, 1986, pp.253-257. J. Ward, “Hierarchical grouping to optimize an objective function”, Journal of the American Statistical Association, vol.58, 1963, pp.236-244. REFERENCES [1] [2] [3] [4] Y. Hirose et al, Brand value evaluation paper group report, the Ministry of Economy, Trade and Industry, 2002. Y. Hirose et al, Brand value that on-balance-ization is hurried, weekly economist special issue, Vol.24, 2001. S. Aoki, Y. Naito, and H. Tsuji, “DEA-based Indicator for performance Improvement”, Proceeding of The 2005 International Conference on Active Media Technology, 2005. Y. Taniguchi, H. Mizuno and H. Yajima, “Visual Decision Support System”, Proceeding of IEEE International Conference on Systems, Man and Cybernetics (SCM97), 1997, pp.554-558. ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV Multi-Viewpoint Data Envelopment Analysis for Finding Efficiency and Inefficiency Shingo AOKI, Member, IEEE, Kiyosei MINAMI, Non-Member, Hiroshi TSUJI, Member, IEEE approaches [16]. While such multiplier restrictions usually reduce the number of zero weight, they often produce an infeasible solution in DEA. Therefore, new DEA model which has robustness on the evaluation values is required. This paper proposes a decision support technique referred to as Multi-Viewpoint DEA model. The remaining structure of this paper is organized as follows: the next section reviews the traditional DEA models. Section 3 proposes a new model. The proposed model integrates the efficiency analysis and the inefficiency analysis into one mathematical formulation, and allows us to analyze the performance of DMU by multi-viewpoint between the strong points and weak points. Section 4 verifies the proposed model through a case study. A case study shows that the proposed model has two desirable features: (1) robustness of the evaluation value, and (2) unification between efficiency analysis and inefficiency analysis. Finally, conclusion and future study are summarized in section 5. Abstract—This paper proposes a decision support method for the measuring the productivity efficiency based on DEA (Data Envelopment Analysis). The decision support method, called Multi-Viewpoint DEA model which integrates the efficiency analysis and the inefficiency analysis, is possible to identify the performance of DMU (Decision Making Unit) between the strong points and weak points by changing the view parameter. A case study for twenty-five Japanese baseball players shows that the proposed model is robust of the evaluation value. Index Terms—Data Envelopment Analysis, Decision-Making, Linear programming, Productivity. I. INTRODUCTION D EA [1] is a nonparametric method for finding the relative efficiency of DMUs, each of which is a company responsible for converting multiple inputs into multiple outputs. DEA has been applied to a variety of managerial and economic problem situations in both public and private sectors [5, 9, 13, 14]. DEA defines the process which changes multiple inputs into multiple outputs as one evaluation value. The decision method based on DEA induces two kinds of approaches: One is the efficiency analysis based on the Pareto optimal solution for the aspect only of the strong points [1, 5]. The other is the inefficiency analysis based on the Pareto optimal solution for the aspect only of the weak points [7]. Then, the evaluation values in two approaches are inconsistent [8]. However, analysts have evaluated DMUs only by extreme aspect: either strong points or weak points. Thus, the traditional two analyses lack flexibility and robustness [17]. In fact, while there are many inputs and outputs in DEA framework, these items are not fully used in the previous approaches. This type of DEA problem has been usually tackled by multiplier restriction approaches [15] and cone ratio II. DEA-BASED EFFICIENCY AND INEFFICIENCY ANALYSES A. DEA: Data Envelopment Analysis In order to describe the mathematical structure of the evaluation value, this paper assumes that there are n DMUs ( DMU1 , , DMU k , , DMU n ), where each DMU is characterized by m inputs ( x 1k , , x ik , , x mk ) and s outputs ( y1k , , y rk , , y sk ). Evaluation value of is mathematically formulated by Evaluation Value of DMU k u 1 y1k u 2 y 2 k u s y sk v1 x 1k v 2 x 2 k v m x mk (1) Here u r is multiplier weight given to the r th output, and v i is multiplier weight given to the i th input. From the analysis concept, there are two decision methods for calculating these weights. One is the efficiency analysis based on the Pareto optimal solution for the aspect only of the strong points [1, 5]. The other is the inefficiency analysis based on the Pareto optimal solution for the aspect only of the weak points [7, 8]. Fig 1 visually represents the difference of two methods. Suppose that there are nine DMUs which have one input and two outputs where X-axis is output 1 over input and Y-axis is Manuscript received September 28, 2005, Multi-Viewpoint Data Envelopment Analysis for Finding Efficiency and Inefficiency. S. Aoki is with the Graduate School of Engineering, Osaka Prefecture University, 1-1 Gakuencho, Sakai, Osaka 599-8531, Japan (corresponding author to provide phone: +81-72-254-9354; fax: +81-72-254-9915; e-mail: [email protected]) K. Minami is with the Graduate School of Engineering, Osaka Prefecture University, 1-1 Gakuencho, Sakai, Osaka 599-8531, Japan H. Tsuji is with the Graduate School of Engineering, Osaka Prefecture University, 1-1 Gakuencho, Sakai, Osaka 599-8531, Japan (e-mail: [email protected]) ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV output 2 over input. So, if DMU is located in upper-right region, it shows that the DMU has high productivity. Efficiency analysis finds out the efficiency frontier which indicates the best practice line (B-C-D-E-F in Figure 1) and evaluates the relative evaluation value by the aspect only of the strong points. On the other hand, inefficiency analysis finds out the inefficiency frontier which indicates the worst practice line (B-I-H-G-F in Fig 1) and evaluates the relative evaluation value by the aspect only of the weak points. while T Ek 1 (100%) means the state of inefficiency. C. Inefficiency analysis There is another analysis which measures the inefficiency level of a specific DMU k based on Inversed DEA model [7]. The inefficiency analysis can be mathematically formulated by Min s.t. Inefficiency frontier B Output 2 䋯 Input A ¦ v i x ik G F Fig 1. Efficiency analysis and Inefficiency analysis D. Requirement for Multi-Viewpoint DEA As shown in Figure 1, DMU B and DMU F are evaluated as both states of “efficiency ( T Ek 1 )” and “inefficiency (T IE 1) ”. This result clearly shows mathematical difference in k two analyses. For the example, DMU B has the best productivity for the Output 2 / input, while it has worst productivity for the Output 1 / input. In efficiency analysis, the weight of DMU B is evaluated by the aspect of the strong points. Therefore, the weight of Output 2 / input becomes a positive value and the weight of Output 1 / input becomes zero. On the other hand, in inefficiency analysis, the weight of DMU B is evaluated by the aspect of the weak points. Therefore, the weight of Output 2 / input becomes zero and the weight of Output 1 / input becomes a positive value. This difference of the weight estimation causes the mathematical problems as follow: a) No robustness of evaluation value Both analyses may produce zero weights for most inputs and outputs. The zero weight indicates that the corresponding inputs or outputs are not used for the evaluation value. Moreover, if the specific inputs or output items are removed from the analysis, the evaluation value may change greatly [17]. This type of DEA problem is usually tackled by multiplier restriction approaches [15] and cone ratio approaches [16]. Such multiplier restrictions usually reduce the number of zero (2 - 1) r 1 s r 1 (2 - 2) ‫ޓޓ‬ (2) ( j 1, 2, , n ) ‫ޓޓޓޓޓޓޓޓޓ‬ m ¦ v i x ik 1 (3 - 3) T IE k 1 (100%) means the state of efficiency. B. Efficiency Analysis The efficiency analysis measures the efficiency level of a specific by relativity comparing its performance to the efficiency frontier. This paper is based on CCR model [1] while there are other models [5, 11]. The efficiency analysis can be mathematically formulated by ¦ v i x ij ¦ u r y rj d 0 1 Again, formula (3-2) is a restriction condition because the productivity of all DMU (formula (1)) becomes 100% or more. And the objective function (3-1) represents the minimization of the virtual outputs of DMU k , setting that the virtual inputs of DMU k is equal to 1 (formula (3-3)). Therefore, the optimal solution of ( v i , u r ) represents the inconvenient weight for DMU k . Especially, the inverse number of optimal objective function value indicates the “inefficiency score” in the manner 1 (100%) means the state of inefficiency, while that T IE k Output 1 䋯 Input s (3) E O E ¦ u r y rk ( T k ) (3 - 2) vi t 0 , u r t 0 H i 1 s r 1 i 1 A’’ s.t. m i 1 (3 - 1) ( j 1, 2, , n ) A’ m 1 ) T IE k m D Max ( ¦ v i x ij ¦ u r y rj t 0 Efficiency frontier C I s ¦ u r y rk r 1 (2 - 3) i 1 vi t 0 , u r t 0 Here formula (2-2) is a restriction condition because the productivity of all DMUs (formula (1)) becomes 100% or less. And the objective function (2-1) represents the maximization of the sum of virtual outputs of DMU k , setting that the virtual inputs of DMU k is equal to 1 (formula (2-3)). Therefore, the optimal solution of ( v i , u r ) represents the convenient weight for DMU k . Especially, the optimal objective function value indicates the evaluation value ( T Ek ) for DMU k . This evaluation value by the convenient weight is called “efficiency score” in the manner that T Ek 1 (100%) means the state of efficiency, ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV weight, and these analyses often produce an infeasible solution. The development of DEA model which has robustness of the evaluation value is required. b) Lack of unification between efficiency analysis and inefficiency analysis Fundamentally, efficient DMU can not be inefficient while inefficient DMU can not be efficient. However, the evaluation value may be not consistent like the and in the Figure 1 where they are in the both states of “efficiency” and “inefficiency”. Thus, it is not easy for analysts to understand the difference between evaluation values. The basis of the evaluation value which has unification between efficiency analysis and inefficiency analysis is required. j 1, jz k s.t. i 1 r 1 0 ( j 1, 2, , n ) (7) 1 The efficiency score ( T Ek ) of DMU k as follows: § T Ek *¨ 1 d k ¨ ¨ ¨ © s · ¸ ¸ m * ¦ v i x ik ( 1) ¸¸ i 1 ¹ * ¦ u r y rj r 1 (8) Where superscript “*” indicates the optimal solution of formula (7). Let us apply the formula (4) which added the variable (d j , d j ) to formula (3-2). This paper notes that d j indicates the artificial variables and d j indicates the slack variables in inefficiency analysis. Using GP technique, the inefficiency analysis (formula (3)) can be replaced by the following Linear Programming: (4) Min n 1 (M 1)d k d k M ¦ d j j 1, jz k s.t. m s i 1 r 1 ¦ v i x ij ¦ u r y rj d j d j 0 ( j 1, 2, , n ) (5) ¦ u r y rk M ¦ d j r 1 0 ‫ ޓޓޓޓ‬v i t 0 , u r t 0, d j , d j t 0 artificial variables. Therefore, the objective function (2-1) can be replaced by mathematically using several big M as follows: n r 1 ¦ v i x ik Here d j indicates the slack variables, and d j indicates the s i 1 i 1 formula (2-2): s s ( j 1, 2, , n ) A. Two DEA models based on GP technique Let us propose a new decision support technique referred to as Multi-Viewpoint DEA model. The proposed model is a re-formulation of the efficiency analysis and inefficiency analysis into one mathematical formulation. This paper applies the following formula (4) which added the variable (d j , d j ) to m m ¦ v i x ij ¦ u r y rj d j d j m III. INTEGRATING EFFICIENT AND INEFFICIENT VIEW ¦ v i x ij ¦ u r y rj d j d j n 1 d k (1 M )d k M ¦ d j Max m ¦ v i x ik j 1 (9) 1 i 1 v i t 0 , u r t 0 , d j , d j t 0 From the formula (4) and formula (2-3), the objective function (5) can be rewritten as follows: The inefficiency score ( T IE k ) of DMU k as follows: s n ¦ u r y rk M ¦ d j r 1 j 1 m ( ¦ v i x ik d k i 1 n d k ) M ¦ d j j 1 n 1 d k d k M ¦ d j T IE k (6) j 1 1 * 1 d k (10) n 1 d k (1 M )d k M ¦ d j Where superscript “*” indicate the optimal solution of formula (9). j 1, jz k Using GP (Goal Programming) technique, the DEA-efficiency-model (formula (2)) can be replaced by the following Linear Programming: B. Mathematical integration of the efficiency and inefficiency model In order to integrate two DEA analyses into one formula mathematically, this paper introduces slack variables. As seen in formula (7) and (9), it is understood that the both analyses have the same restriction conditions. Then, this paper applies the following formula (11) which added any constant (D, E) to the objective function of formula (7) and (9). ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV by the aspect of the strong points and the second term indicates it by the aspect of the weak points. Therefore, the evaluation , D' value ( T MVP ) is measured on the range between -1 (-100%: k inefficiency) and 1 (100%: efficiency). n D{1 d k (1 M )d k M ¦ d j } j 1, jz k n E{1 (M 1)d k d k M ¦ d j } (11) j 1, jz k (D E) {D E(1 M)}d k {D(1 - M) - E} d -k n n j 1, jz k j 1, jz k IV. CASE STUDY (DM ¦ d j E M ¦ d j ) A. A data set A data set used in this paper is demonstrated illustrated in TABLE I. (The source of this data set comes from the internet site: YAHOO! SPORTS (in Japanese), 2005). Twenty-five batters are selected for our performance evaluation. When using the data set, this paper uses “bats” and “walk” as input items as well as “singles”, “doubles”, “triples”, “homeruns”, “runs batted in” and “steals” as output items. When formula (11) is divided by several big M mathematically, it can be developed as follows: n n j 1, jz k j 1, jz k (Ed k D d -k ) (E ¦ d j D ¦ d j ) n E ¦ d j j 1 n (12) D ¦ d j j 1 TABLE Σ. OFFENSIVE RECORDS OF JAPANESE BASEBALL PLAYERS IN 2005 Where these constants can be D E 1 estimated, because the constants (D, E) indicate relative ratios of efficiency analysis and inefficiency analysis. Then the proposed model is formulated as the following Linear Programming: Inputs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 n n (1 D) ¦ d j D ¦ d j Max j 1 m s i 1 r 1 j 1 ¦ v i x ij ¦ u r y rj d j d j s.t. 0 (13) ( j 1, 2, , n ) m ¦ v i x ik 1 i 1 v i t 0 , u r t 0 , d j , d j t 0 Where x ij : i th input value of jth DMU, y rj : input value of jth DMU, v i , u r : input and output weight, d i , d r : slack variables. D' T Ek (1 D' )T IE k * D' (1 d k ) (1 - D' ) ( 1 * 1 d k ) bats walks singles doubles triples homeruns 577 452 498 574 503 473 431 552 569 529 420 530 549 633 580 544 473 526 559 559 452 580 542 503 424 96 73 71 56 38 75 46 91 57 64 33 84 41 51 66 24 53 47 50 51 40 61 82 78 36 89 91 82 110 111 74 77 77 105 92 75 67 122 140 107 95 88 86 92 110 68 89 74 79 74 37 19 25 34 29 21 27 31 42 22 27 24 25 19 27 28 20 28 22 24 19 23 18 20 18 1 2 1 2 1 1 1 1 1 1 2 0 2 8 0 3 0 0 3 1 2 1 0 0 7 44 18 36 24 6 30 15 35 11 26 14 44 2 4 20 24 6 24 27 9 26 33 37 18 6 runs batted in 120 70 91 89 51 89 63 100 73 75 75 108 34 45 88 79 40 71 90 62 84 94 100 74 39 steals 2 3 6 18 11 6 10 8 2 7 8 0 22 42 7 1 0 4 18 4 2 5 1 1 10 B. Multi-Viewpoint DEA’s result TABLE II shows the evaluation values of Multi-View DEA model. This paper calculates eleven patterns between (The View point’s parameter) D 1 and D 0 . Especially, if setting the parameter D equals to 1, this evaluation value ( T kMVP, 1 ) is calculated by efficiency analysis (formula (2)). And if setting ,0 ) is the parameter equals to 0, this evaluation value ( T MVP k calculated by inefficiency analysis (formula (3)). 1) Efficiency Analysis’s Result This analysis finds that there are 14 batters whose evaluation value is 1 (efficiency). In TABLE I, these batters are included in DMU1 which captured the triple crown DMU14 The formula (13) includes the viewpoint’s parameter, and allows us to analyze the performance of DMU by changing the parameter between the strong points (especially, if D 1 then the optimal solutions is the same with one of efficiency analysis) and weak points (if D 0 then the optimal solutions is the same with one of inefficiency analysis). And if D D' then this paper defines the evaluation value ( T kMVP , D ' ) of DMU k as follows: , D' T MVP k Outputs DMU (14) and which captured the steal crown in 2005. Then, it understood that DEA equally evaluates a lot of evaluation axes. However, because the evaluation value is estimated only by the aspect of most strong point for each DMU, multiplicity of strong points is not considered like DMU1 . Therefore, Where superscript “*” indicate the optimal solution of formula (13). The first term of formula (14) indicates the evaluation value ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV it has the superiority for the ratio of doubles (output) / bats (input). However, the other ratios are not excellent respect. That is to say, DMU 25 has a limited strong point. Oppositely, as seen in TABLE II, for the batters who is as almighty like DMU1 and DMU 2 , the rank does not change easily. Because the proposed model allows us to know whether DMU has the multiplicity or limit of strong points, it is possible to evaluate the DMU with robustness. superiority can not be applied between these batters in this analysis. 2) Inefficiency Analysis’s Result This analysis finds that there are 10 batters whose evaluation value is -1 (inefficiency). Because the evaluation value is estimated only by the aspect of weak points, these batters are included in the batters which have a little steals even if excelling in the long hits like DMU12 and DMU 23 . As well as efficiency analysis, superiority can not be applied between these batters. b) 3) Proposed Model’s Result The proposed model allows to analyze the performance of DMU between efficiency and inefficiency. To clarify the change of the evaluation value when the view point’s parameter D is shifted from 1 to 0, let us not focus on evaluation value but on rank. Fig 2 shows the change of rank for the specific four batters ( DMU12 , DMU13 , DMU14 , DMU 25 ) which estimated the both states of “efficiency” and “inefficiency”. a) Unification between DEA-efficiency and DEA-inefficiency model In the case ( D 1, 0.8, 0.7, 0.4, 0.2), the rank of DMU14 are changed to 25, 12, 19, 11 and 24. Thus, the change of the rank is large. As shown in TABLE I, because DMU14 has the multiplicity of strong points such as singles, triples and steals, it is understood that DMU14 has high rank roughly. However, this result indicates that the rank does not change linear from the aspect of strong to weak points. Although the efficiency analysis and the inefficiency analysis are integrated into one mathematical formulation, how to assign the view point’s parameter D still remains. Robustness of the evaluation value Although DMU 25 has high rank (25) in the case ( D 1 ), the rank of DMU 25 is rapidly lower in the other cases. Where, thinking about strong points, in TABLE I, it is understood that TABLE II. ) PARAMETER AND ESTIMATION VALUE ( T MVP k DMU 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Į=1 1 1 1 1 1 0.980 0.989 0.981 1 0.947 1 1 1 1 0.955 1 0.851 0.926 1 0.926 1 0.934 0.916 0.849 1 Į=0.9 0.787 0.805 0.801 0.803 0.800 0.749 0.743 0.683 0.710 0.746 0.803 0.714 0.738 0.748 0.762 0.797 0.640 0.716 0.758 0.729 0.764 0.731 0.696 0.644 0.632 Į=0.8 0.592 0.606 0.605 0.610 0.609 0.557 0.553 0.499 0.507 0.559 0.612 0.515 0.557 0.554 0.563 0.570 0.445 0.520 0.573 0.527 0.570 0.539 0.499 0.456 0.451 Į=0.7 0.414 0.422 0.419 0.417 0.412 0.373 0.381 0.319 0.353 0.373 0.423 0.330 0.392 0.405 0.389 0.398 0.277 0.359 0.383 0.373 0.377 0.353 0.326 0.276 0.294 Estimation Value Į=0.6 Į=0.5 Į=0.4 0.226 0.043 -0.138 0.231 0.052 -0.135 0.228 0.048 -0.144 0.226 0.045 -0.143 0.220 0.035 -0.158 0.196 0.021 -0.172 0.185 0.012 -0.184 0.150 -0.034 -0.209 0.159 -0.023 -0.218 0.197 0.012 -0.187 0.228 0.037 -0.163 0.177 -0.020 -0.211 0.177 -0.009 -0.198 0.209 0.006 -0.201 0.212 0.022 -0.176 0.216 0.011 -0.185 0.126 -0.054 -0.239 0.165 -0.029 -0.213 0.182 -0.006 -0.198 0.176 -0.012 -0.207 0.176 -0.009 -0.197 0.172 -0.017 -0.209 0.161 -0.033 -0.215 0.117 -0.069 -0.251 0.109 -0.071 -0.252 Į=0.3 -0.320 -0.315 -0.327 -0.328 -0.350 -0.355 -0.377 -0.405 -0.405 -0.377 -0.349 -0.386 -0.372 -0.376 -0.362 -0.377 -0.427 -0.409 -0.385 -0.406 -0.382 -0.404 -0.411 -0.427 -0.436 Į=0.2 -0.504 -0.491 -0.502 -0.509 -0.526 -0.543 -0.569 -0.604 -0.603 -0.555 -0.501 -0.578 -0.541 -0.499 -0.550 -0.545 -0.622 -0.609 -0.558 -0.587 -0.572 -0.588 -0.608 -0.619 -0.624 Į=0.1 -0.695 -0.634 -0.696 -0.661 -0.656 -0.720 -0.718 -0.793 -0.732 -0.733 -0.674 -0.799 -0.701 -0.677 -0.696 -0.722 -0.802 -0.787 -0.733 -0.716 -0.744 -0.773 -0.800 -0.806 -0.805 ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV Į=0 -0.931 -0.988 -0.890 -0.846 -0.881 -0.933 -0.909 -1 -0.963 -0.930 -0.905 -1 -1 -1 -1 -0.922 -1 -1 -0.935 -0.946 -0.961 -0.971 -1 -1 -1 25 [9] 20 Rank [10] 15 No.12 No.13 No.14 No.25 10 [11] [12] 5 [13] 0 㱍=1 㱍=0.5 [14] 㱍=0 Parameter 㱍 Fig 2. Rank of four players [15] V. CONCLUSION [16] This paper has proposed a new decision support method, called Multi-Viewpoint DEA model which integrated the efficiency analysis and the inefficiency analysis by one mathematical formulation. The proposed model allows us to analyze the performance of DMU by changing the view point’s parameter between the strong points (especially, if D 1 then it becomes efficiency analysis) and weak points (if D 0 then it becomes inefficiency analysis). Regarding twenty-five Japanese baseball players as DMUs, a case study has shown that the proposed model has two desirable features: (a) robustness of the evaluation value, and (b) unification between efficiency analysis and inefficiency analysis. For the future study, we will also analytically compare our method to the traditional approaches [15, 16] and explore how to set the view point’s parameter. [17] Governmental Investments to the Japanese Economy”, Journal of the Operations Research Society of Japan, Vol.38, No.4, 1995, pp.381-396. S. Aoki, K. Mishima, H. Tsuji: Two-Staged DEA model with Malmquist Index for Brand Value Estimation, The 8th World Multiconference on Systemics, Cybernetics and Informatics, Vol. 10, pp.1-6, 2004. R. D. Banker, A. Charnes, W. W. Cooper, “Some Models for Estimating Technical and Scale Inefficiencies in Data Envelopment Analysis”, Management Science, Vol.30, 1984, pp.1078-1092. R. D. Banker, and R. M. Thrall, “Estimation of Returns to Scale Using Data Envelopment Analysis”, European Journal of Operational Research, Vol.62, 1992, pp.74-82. H. Nakayama, M. Arakawa, Y. B. Yun, Data Envelopment Analysis in Multicriteria Decision Making”,M. Ehrgott and X. Gandibleux (eds.) Multiple Criteria Optimization: State of the Art Annotated Bibliographic Surveys, Kluwer Acadmic Publishiers, 2002. E. W. N. Bernroider, V. Stix , “The Evaluation of ERP Systems Using Data Envelopment Analysis”, Information Technology and Organizations, Idea Group Pub, 2003, pp.283-286. Y. Zhou, Y. Chen, “DEA-based Performance Predictive Design of Complex Dynamic System Business Process Improvement”, Proceeding of Systems, Man and Cybernetics, 2003. IEEE International Conference, 2003, pp.3008-3013. R. G. Thompson, L. N. Langemeier, C. T. Lee, and R. M. Thrall, “The Role of Multiplier Bounds in Efficiency Analysis with Application to Kansas Farming”, Journal of Econometrics, Vol.46, 1990, pp.93-108. W. W. Cooper, W. Quanling and G. Yu, “Using Displaced Cone Representation in DEA models for Nondominated Solutions in Multiobjective Programming”, Systems Science and Mathematical Sciences, Vol.10, 1997, pp.41-49. S. Aoki, Y. Naito, and H. Tsuji, “DEA-based Indicator for performance Improvement”, Proceeding of The Third International Conference on Active Media Technology, 2005, pp.327-330. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] A.Charnes, W.W.Cooper, and E.Rhodes, “Measuring the efficiency of decision making units”, European Journal of Operational Research, 1978, Vol.2, pp429-444. T. Sueyoshi and S. Aoki, “A use of a nonparametric statistic for DEA frontier shift: the Kruskal and Wallis rank test”, OMEGA: The International Journal of Management Science, Vol.29, No.1, 2001, pp1-18. T.Sueyoshi, K.Onishi, and Y.Kinase, “A Bench Mark Approach for Baseball Evaluation”, European Journal of Operational Research, Vol.115, 1999, pp.429-428. T. Sueyoshi, Y. Kinase and S. Aoki, “DEA Duality on Returns to Scale in Production and Cost Analysis”, Proceedings of the Sixth Asia Pacific Management Conference 2000, 2000, pp1-7. W. W. Cooper, L. M. Seiford, K. Tone, Data Envelopment Analysis: A comprehensive text with models, applications, references and DEA-Solver software, Kluwer Academic Publishers, 2000. R. Coombs, P. Sabiotti and V. Walsh, Economics and Technological Change, Macmillan, 1987. Y. Yamada, T. Matui and M. Sugiyama, "An inefficiency measurement method for management systems", Journal of Operations Research Society of Japan, vol. 37, 1994, pp. 158-168 (In Japanese). Y. Yamada, T. Sueyoshi, M. Sugiyama, T. Nukina and T. Makino “The DEA Method for Japanese Management: The Evaluation of Local ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV Mining Valuable Stocks with Genetic Optimization Algorithm Lean Yu, Kin Keung Lai and Shouyang Wang Abstract—In this study, we utilize the genetic algorithm (GA) to mine high quality stocks for investment. Given the fundamental financial and price information of stocks trading, we attempt to use GA to identify stocks that are likely to outperform the market by having excess returns. To evaluate the efficiency of the GA for stock selection, the return of equally weighted portfolio formed by the stocks selected by GA is used as evaluation criterion. Experiment results reveal that the proposed GA for stock selection provides a very flexible and useful tool to assist the investors in selecting valuable stocks. feedforward neural networks to perform stock selection. However, these approaches have some drawbacks in solving the stock selection problem. For example, fuzzy approach [2-3] usually lack learning ability, while neural network approach [1, 4] has overfitting problem and is often easy to trap into local minima. In order to overcome these shortcomings, GA is used to perform this task. Some related typical literature can be referred to [5-7] for more details. The main aim of this study is to mine valuable stocks using GA and test the efficiency of the GA for stock selection. The rest of the study is organized as follows. Section 2 describes the mining process based on the genetic algorithm in detail. Section 3 presents a simulation experiment. And Section 4 concludes the paper. Index Terms—Genetic algorithms; Portfolio optimization; Data mining; Stock selection I. I INTRODUCTION the stock market, investors are often faced with a large number of stocks. A crucial work of their investment decision process is the selection of stocks. From a data-mining perspective, the problem of stock selection is to identify good quality stocks that are potential to outperform the market by having excess return in the future. Given the fundamental accounting and price information of stock trading, it is a prediction problem that involves discovering useful patterns or relationship in the data, and applying that information to identify whether a stock is good quality. Obviously, it is not an easy task for many investors when they faced with enormous amount of stocks in the market. With focus on the business computing, applying artificial intelligence to portfolio selection and optimization is one way to meet the challenge. Some research has presented to solve asset selection problem. Levin [1] applied artificial neural network to select valuable stocks. Chu [2] used fuzzy multiple attribute decision analysis to select stocks for portfolio. Similarly, Zargham [3] used a fuzzy rule-based system to evaluate the listed stocks and realize stock selection. Recently, Fan [4] utilized support vector machine to train universal N II. GA-BASED STOCK SELECTION PROCESS Generally, GA imitates the natural selection process in biological evolution with selection, crossover and mutation, and the sequence of the different operations of a genetic algorithm is shown in the left part of Figure 1. That is, GA is procedures modeled after genetics and evolution. Genetics provide the chromosomal representation to encode the solution space of the problem while evolutionary procedures are designed to efficiently search for attractive solutions to large and complex problem. Usually, GA is based on the survival-of-the-fittest fashion by gradually manipulating the potential problem solutions to obtain the more superior solutions in population. Optimization is performed in the representation rather than in the problem space directly. To date, GA has become a popular optimization method as they often succeed in finding the best optimum by global search in contrast to most common optimization algorithms. Interested readers can be referred to [8-9] for more details. The aim of this study is to identify the quality of each stock using GA so that investors can choose some good ones for investment. Here we use stock ranking to determine the quality of stock. The stocks with a high rank are regarded as good quality stock. In this study, some financial indicators of the listed companies are employed to determine and identify the quality of each stock. That is, the financial indicators of the companies are used as input variables while a score is given to rate the stocks. The output variable is stock ranking. Throughout the study, four important financial indicators, return on capital employed (ROCE), price/earnings ratio (P/E Ratio), earning per share (EPS) and liquidity ratio are utilized Manuscript received July 30, 2005. This work was supported in part by the SRG of City University of Hong Kong under Grant No. 7001806. Lean Yu is with the Institute of Systems Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100080, China (e-mail: [email protected]). Kin Keung Lai is with the Department of Management Science, City University of Hong Kong and is also with the College of Business Administration, Hunan University, 410082, China (phone: 852-2788-8563; fax:852-2788-8560; e-mail: [email protected]). Shouyang Wang is with the Institute of Systems Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100080, China (e-mail: [email protected]). ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV point in using GA, which determines what a GA should optimize. Since the output is some estimated stock ranking of designated testing companies, some actual stock ranking should be defined in advance for designing fitness function. Here we use annual price return (APR) to rank the listed stock and the APR is represented as ASP n ASP n 1 (5) APR in this study. Their meaning is formulated as ROCE = (Profit)/(Shareholder’s equity)*100% (1) P/E ratio = (stock price)/(earnings per share)*100% (2) EPS=(Net income)/(The number of ordinary shares) (3) Liquidity Ratio=(Current Assets)/(Current Liabilities) (4) When the input variables are determined, we can use GA to distinguish and identify the quality of each stock, as illustrated in Fig. 1. n ASP n 1 where APRn is the annual price return for year n, ASPn is the annual stock price for year n. Usually, the stocks with a high annual price return are regarded as good stocks. With the value of APR evaluated for each of the N trading stocks, they will be assigned for a ranking r ranged from 1 and N, where 1 is the highest value of the APR while N is the lowest. For convenience of comparison, the stock’s rank r should be mapped linearly into stock ranking ranged from 0 to 7 according to the following equation: 7u N 1 Fig. 1 Stock selection with genetic algorithm First of all, a population, which consists of a given number of chromosomes, is initially created by randomly assigning “1” and “0” to all genes. In the case of stock ranking, a gene contains only a single bit string for the status of input variable. The top right part of Figure 1 shows a population with four chromosomes, each chromosome includes different genes. In this study, the initial population of the GA is generated by encoding four input variables. For the testing case of ROCE, we design 8 statuses representing different qualities in terms of different interval, varying from 0 (Extremely poor) to 7 (very good). An example of encoding ROCE is shown in Table 1. Other input variables are encoded by the same principle. That is, the binary string of a gene consists of three single bits, as illustrated by Fig. 1. TABLE I AN EXAMPLE OF ENCODING ROCE ROCE value Status Encoding (-, -30%] 0 000 (-30%, -20%] 1 001 (-20%,-10%] 2 010 (-10%,0%] 3 011 (0%, 10%] 4 100 (10%, 20%] 5 101 (20%, 30%] 6 110 (30%,+) 7 111 It is worth noting that 3-digit encoding is used for simplicity in this study. Of course, 4-digit encoding is also adopted, but the computations will be rather complexity. The subsequent work is to evaluate the chromosomes generated by previous operation by a so-called fitness function, while the design of the fitness function is a crucial ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV large number of stocks does not necessary outperform the portfolio with the small number of stocks. Therefore it is wise for investors to select a limit number of good quality stocks for constructing a portfolio. (http://www.sse.com.cn). The sample data span the period from January 2, 2002 to December 31, 2004. Monthly and yearly data in this study are obtained by daily data computation. For simulation, 100 stocks are randomly selected. In this study, we select 100 stocks from Shanghai A share, and their stock codes vary from 600000 to 600100. First of all, the company financial information as the input variables is fed into the GA to obtain the derived company ranking. This output is compared with the actual stock ranking in terms of APR, as indicated by Equations (5) and (6). In the process of GA optimization, the RMSE between the derived and the actual ranking of each stock is calculated and served as the evaluation function of the GA process. The best chromosome obtained is used to rank the stocks and the top n stocks are chosen for the portfolio. For experiment purpose, the top 10 and 20 stocks are chosen for testing according to the ranking of stock quality using GA. The top 10 and 20 stocks selected by GA can construct a portfolio. For convenience, equally weighted portfolios are built for comparison purpose. In order to evaluate the usefulness of the GA optimization, we compared the net accumulated return generated by the selected stock from GA with a benchmark. The benchmark return is determined by an equally weighted portfolio of all the stocks available in the experiment. Fig. 2 reveals the results for different portfolios. IV. CONCLUSIONS This study uses genetic optimization algorithm to perform stocks selection for portfolio. Experiment results reveal that the GA optimization approach has shown to be useful to the problem of stock selection, which can mine the most valuable stocks for investors. ACKNOWLEDGMENT The authors would like to thank the anonymous reviewers for their valuable comments and suggestions. Their comments have improved the quality of the paper immensely. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] A.U. Levin, “Stock selection via nonlinear multi-factor models,” Advances in Neural Information Processing Systems, 1995, pp. 966-972. T.C. Chu, C.T. Tsao, and Y.R. Shiue, “Application of fuzzy multiple attribute decision making on company analysis for stock selection,” Proceedings of Soft Computing in Intelligent Systems and Information Processing, 1996, pp. 509-514. M.R. Zargham and M.R. Sayeh, “A web-based information system for stock selection and evaluation,” Proceedings of the First International Workshop on Advance Issues of E-Commerce and Web-Based Information Systems, 1999, pp. 81-83. A. Fan and M. Palaniswami, “Stock selection using support vector machines,” Proceedings of International Joint Conference on Neural Networks, 2001, pp. 1793-1798. L. Lin, L. Cao, J. Wang, and C. Zhang, “The applications of genetic algorithms in stock market data mining optimization,” in Data Mining V, A. Zanasi, N.F.F. Ebecken, and C.A. Brebbia, Eds. WIT Press, 2004. S.H. Chen, Genetic Algorithms and Genetic Programming in Computational Finance. Dordrecht: Kluwer Academic Publishers, 2002. Thomas, J., Sycara, K. “The importance of simplicity and validation in genetic programming for data mining in financial data,” Proceedings of the Joint AAAI-1999 and GECCO-1999 Workshop on Data Mining with Evolutionary Algorithms, 1999. J. H. Holland, “Genetic algorithms”, Scientific American, 1992, 267, pp. 66-72. D.E. Goldberg, Genetic Algorithm in Search, Optimization, and Machine Learning. Addison-Wesley, Reading, MA, 1989. Fig. 2 Accumulated return for different portfolios From Fig. 2, we can find that the net accumulated return of the equally weighted portfolio formed by the stocks selected by GA is significantly outperformed the benchmark. In addition, the performance of the portfolio of the 10 stocks is better that of the 20 stocks. As we know, portfolio does not only focus on the expected return but also on risk minimization. The larger the number of stocks in the portfolio is, the more flexible for the portfolio to make the best composition to avoid risk. However, selecting good quality stocks is the prerequisite of obtaining a good portfolio. That is, although the portfolio with the large number of stocks can lower the risk to some extent, some bad quality stocks may include into the portfolio, which influences the portfolio performance. Meantime, this result also demonstrates that if the investors select good quality stocks, the portfolio with the ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV A Comparison Study of Multiclass Classification between Multiple Criteria Mathematical Programming and Hierarchical Method for Support Vector Machines Yi Peng1, Gang Kou 1, Yong Shi1, 2, 3, Zhenxing Chen1 and Hongjin Yang 2 1 College of Information Science & Technology, University of Nebraska at Omaha, Omaha, NE 68182, USA { ypeng, gkou, zchen}@mail.unomaha.edu 2 Chinese Academy of Sciences Research Center on Data Technology & Knowledge Economy, Graduate University of the Chinese Academy of Sciences, Beijing 100080, China {yshi, hjyang}@gucas.ac.cn 3 The corresponding author Abstract 1. INTRODUCTION Multiclass classification refers to classify data objects into more than two classes. The purpose of this paper is to compare two multiclass classification approaches: Multiple Criteria Mathematical Programming (MCMP) and Hierarchical Method for Support Vector Machines (SVM). While MCMP considers all classes at once, SVM was initially designed for binary classification. It is still an ongoing research issue to extend SVM from two-class classification to multiclass classification and many proposed approaches use hierarchical method. In this paper, we focus on one common hierarchical method – pairwise classification. We compare the performance of MCMP and SVM pairwise approach using KDD99, a large network intrusion dataset. Results show that MCMP achieves better multiclass classification accuracies than SVM pairwise. Keywords: classification, multi-group classification, multi-group Multiple criteria mathematical programming (MCMP), pairwise classification As one of the major data mining functionalities, classification has broad applications such as credit card portfolio management, medical diagnosis, and fraud detection. Based on historical information, classification builds classifiers to predict categorical class labels for unknown data. Classification methods can be classified in various ways, and one distinction is between binary and multiclass classification. Binary classification, as the name indicates, classifies data into two classes. Multiclass classification refers to classify data objects into more than two classes. Many real-life applications require multiclass classification. For example, a multiclass classification that is capable of predicting subtypes of cancer will be more helpful than a binary classification that can only predict cancer or non-cancer. Researchers have suggested various multiclass classification methods. Multiple Criteria Mathematical Programming (MCMP) and Hierarchical Method for Support Vector Machines (SVM) are two of them. MCMP and SVM are both based on mathematical ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV of boundary scalars b1<b2<…<bk-1, can be set to separate these k groups. The boundary bj is used to separate Gj and Gj+1. Let X = ( x1 ,..., x r ) T R r be a vector of real number to be determined. Thus, we can establish the following linear inequations (Fisher 1936, Shi et al. 2001): A i X < b1, A i G1; (1) bj-1A i X< bj, A i Gj; (2) A i X t bk-1, A i Gk; (3) 2 j k-1, 1 i n. A mathematical function f can be used to describe the summation of total overlapping while another mathematical function g represents the aggregation of all distances. The final classification accuracies of this multi-group classification problem depend on simultaneously minimize f and maximize g . Thus, a generalized bi-criteria programming method for classification can be formulated as: ˄ Generalized Model ˅ Minimize f and Maximize g Subject to: (1), (2) and (3) To formulate the criteria and complete constraints for data separation, some variables need to be introduced. In the classification problem, A i X is the score for the ith data record. If an element Ai G j is misclassified programming and there is no comparison study has been conducted to date. The purpose of this paper is to compare these two multiclass classification approaches. While MCMP considers all classes at once, SVM was initially designed for binary classification. It is still an ongoing research issue to extend SVM from two-class classification to multiclass classification and many proposed approaches use hierarchical approach. In this paper, we focus on one common hierarchical method – pairwise classification. We first introduce MCMP and SVM pairwise classification, and then implement an experiment to compare their performance using KDD99, a large network intrusion dataset. This paper is structured as follows. The next section discusses the formulation of multiple-group multiple criteria mathematical programming classification model. The third section describes pairwise SVM multiclass classification method. The fourth section compares the performance of MCMP and pairwise SVM using KDD99. The last section concludes the paper. 2. MULTI-GROUP MULTI-CRITERIA MATHEMATICAL PROGRAMMING MODEL This section introduces a MCMP model for multiclass classification. Simply speaking, this method classifies observations into distinct groups based on two criteria. The following models represent this concept mathematically: Given an r-dimensional attribute a ( a1 ,..., a r ) , let vector r Ai ( Ai1 ,..., Air ) be one of the sample records, where i 1,..., n ; n represents the total number of records in the dataset. Suppose k groups, G1, G2, …, Gk, are predefined. and Gi G j ), i z j,1 d i, j d k Ai {G1 G2 ... Gk } , i into a group other than G j , then let D i, j p (p-norm of D i , j ,1 d p d f ) be the Euclidean distance from A i to bj, and AiX = bj + D i, j , 1 d j d k 1 and let D i , j 1 p be the Euclidean distance from A i G j to bj-1, and AiX = bj-1 - D i , j 1 , 2 d j d k . Otherwise, D i , j ,1 d j d k, 1 d i d n , equals to zero. Therefore, the function f of total overlapping k of data can be represented as 1,..., n . A series n ¦¦ j 1 i 1 ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV D i, j p . where Ai, i = 1, …, n are given, X and bj are unrestricted, and D i, j , ] i , j t 0,1 d i d n. . If an element Ai G j is correctly classified into G j , let ] i, j p be the (a) and (b) are defined as such because the distances from any correctly classified data (A i G j , 2 d j d k 1 ) to two adjunct Euclidean distance from A i to bj, and AiX = bj - ] i, j , 1 d j d k 1 and let ] i , j 1 p be the Euclidean distance from A i G j to bj-1, and boundaries bj-1 and bj must be less than bj bj-1 . A better separation of two adjunct groups may be achieved by the following constraints instead of (a) and (b) because (c) and (d) set up stronger limitation on ] i j : AiX = bj-1 + ] i , j 1 , 2 d j d k . Otherwise, ] i , j ,1 d j d k, 1 d i d n , equals to zero. Thus, the objective is to maximize the distance ] i, j from A i to boundary if A i G1 or G k p and is b j b j 1 to minimize the distance k Minimize n ¦ ¦ j 1orj k i 1 ] i, j p k 1 n b j b j 1 j 2 i 1 2 - ¦¦ ] i, j n ¦¦ (D i, j i, j ] i, j bj+1 - bj , 1 d j d k 1 (b) w] - j 1orj k i 1 k 1 n ¦¦[(] i, j ) 2 (b j b j 1 )] i , j ] ) (6) j 2 i 1 Subject to: (4), (5), (c) and (d) Note that the constant ( b j b j 1 ) 2 is 2 omitted from the (6) without any effect to the solution. A version of model 2 for three predefined classes is given in Figure 1. The stars represent group 1 data objects, the black dots represent group 2 data objects, and the white circles represent group 3 data objects. ) AiX = bj-1 - D i , j 1 + ] i , j 1 , 2 d j d k (5) (a) - )2 Subject to: AiX = bj + D i, j - ] i, j , 1 d j d k 1 (4) ] i, j bj - bj-1 , 2 d j d k )2 n ¦ ¦ (] ( - w] p wD j 1 i 1 j 1 i 1 ( (d) Let p = 2, then objective function in Model 1 can now be a quadratic objective and we have: (Model 2˅ n p ] i, j (bj+1 - bj )/2+İ, 1 d j d k 1 H is a small positive real number. the distances of every data to its class boundary or boundaries can be represented as k 1 n n b j b j 1 ] ] i, j p . ¦ ¦ i, j p ¦¦ 2 j 2 i 1 j 1ork i 1 Furthermore, to transform the generalized bi-criteria classification model into a singlecriterion problem, weights wD > 0 and w] > 0 are introduced for f (D ) and g (] ) , respectively. The values of wD and w] can be pre-defined in the process of identifying the optimal solution. As a result, the generalized model can be converted into a single-criterion mathematical programming model as: k (c) ] i , j p from A i to the middle of 2 two adjunct boundaries bj-1 and bj if A i G j , 2 d j d k 1 . So the function g of ˄Model 1˅Minimize wD ¦¦ D i , j ] i, j (bj - bj-1 )/2+İ, 2 d j d k ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV ] i ,1 ] i ,1 D i ,1 ] i,2 D i,2 D i ,1 G1 D i,2 G2 ] i,2 G3 b1 b2 İ İ AiX = bj + D i, j - ] i, j , j 1,2 AiX = bj-1 - D i , j 1 + ] i , j 1 , j 2,3 Figure. 1 A Three-classes Model These models can be used in multiclass classification and the applicability of these models depends on the nature of given datasets. If the adjunct groups in datasets do not have any overlapping data, Model 4 or Model 3 is more appropriate. Otherwise, Model 2 can generate better results. Model 2 can be regarded as a “weak separation formula” since it allows overlapping. In addition, a “medium separation formula” can also be constructed on the absolute class boundaries (Model 3) without any overlapping data. Furthermore, a “strong separation formula” that requires a non-zero distance between the boundary of two adjunct groups (Model 4) emphasizes non-overlapping characteristic between adjunct groups. (Model 3˅Minimize (6) Subject to: (c) and (d) AiX bj - ] i, j , 1 d j d k 1 3. SVM PAIRWISE CLASSIFICATION MULTICLASS Statistical Learning Theory was proposed by Vapnik and Chervonenkis in the 1960s. Support Vector Machine (SVM) is one of the Kernel Machine based Statistical Learning Methods that can be applied on various types of data and can detect the internal relations among the data objectives. Given a set of data, one can define the kernel matrix to construct SVM and compute an optimal hyperplane in the feature space which is induced by a kernel (Vapnik, 1995). There exist different multiclass training strategies for SVM such as Oneagainst-Rest classification, One-against-One (pairwise) classification, and Error correcting output codes (ECOC). AiX bj-1 + ] i , j 1 , 2 d j d k where Ai, i = 1, …, n are given, X and bj are unrestricted, and D i, j , ] i , j t 0,1 d i d n. . (Model 4˅Minimize (6) Subject to: (c) and (d) AiX bj - D i, j - ] i, j , 1 d j d k 1 AiX bj-1 + D i , j 1 + ] i , j 1 , 2 d j d k where Ai, i = 1, …, n are given, X and bj are unrestricted, and D i, j , ] i , j t 0,1 d i d n. . LIBSVM is a well-known free software package for support vector classification. We ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV There are five main categories of attacks: denial-of-service (DOS); unauthorized access from a remote machine (R2L); unauthorized access to local root privileges (U2R); surveillance and other probing (Probe). Because the number of U2R attacks is too small (52 records), only three types of attacks, DOS, R2L, and Probe, are used in this experiment. The KDD99 dataset used in this experiment has 4898430 records and contains 1071730 distinguish records. use the latest version, LIBSVM 2.8, in our experimental study. This software uses oneagainst-one (pairwise) method for multiclass SVM (Chang and Lin, 2001). The oneagainst-one method was first proposed by Knerr et al. in 1990. It constructs totally k (k 1) binary SVM classifiers where the 2 classifiers are trained by two distinct classes of the total k classes (Hsu and Lin, 2002). The k (k 1) following quadratic program is used 2 times to generate the multi-category SVM classifiers. MCMP was solved by LINGO 8.0, a software tool for solving nonlinear models (LINDO Systems Inc.). LIBSVM version 2.8 (Chang and Lin, 2001), an integrated software which uses pairwise approach to support multi-class SVM classification, was applied to KDD99 data and the classification results of LIBSVM were compared with MCMP’s. The four-group classification results of MCMP and LIBSVM on KDD99 data were summarized in Table 1 and Table 2, respectively. The classification results were displayed in the format of confusion matrices, which pinpoint the kinds of errors made. From the confusion matrices in Table 1 and 2, we observe that (1) LIBSVM achieves perfect classification for training data: 100% accuracy. The training results of MCMP are almost perfect: 100% accuracy for “probe” and “DOS” and 99% accuracy for “normal” and “R2L”; (2) Contrasted LIBSVM’s training classification accuracies with testing, its performance is unstable. LIBSVM achieves almost perfect classification for “normal” class: 99.99% accuracy, but poor performance for three attack types: 44.48% for “probe”, 53.17% for “R2L”, and 74.49% for “DOS”. (3) MCMP has a stable performance on testing data: 97.2% accuracy for “probe”, 99.07% for “DOS”, 88.43% for “R2L”, and 97.05% for “normal”. Min (W/2) ||E||2 + (1/2) ||x, b||2 Subject to: D (AX – eb) t e - E , where e is a vector of ones k (k 1) number of SVM classifiers 2 were produced, a majority vote strategy is k (k 1) applied to the classifiers. Each 2 classifier has one vote and every data is predicted to the class with the largest vote. After 4. EXPERIMENTAL COMPARISON OF MCMP AND PAIRWISE SVM The KDD99 dataset was provided by Defense Advanced Research Project Agency (DARPA) in 1998 for the competitive evaluation of intrusion detection approaches. KDD 99 dataset contains 9 weeks of raw TCP data from Simulation of a typical U.S. Air Force LAN. A version of this dataset was used in 1999 KDD-CUP intrusion detection contest (Stolfo et al. 2000). After the contest, KDD99 has become a de facto standard dataset for intrusion detection experiments. ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV Table 1. MCMP KDD99 Classification Results (1) 100 0 0 1 (1) 13366 1084 1 59 Evaluation on training data (400 cases): Accuracy (2) (3) (4) <-classified as 0 0 0 (1): Probe 100 0 0 (2): DOS 0 99 1 (3): R2L 0 0 99 (4): Normal Evaluation on test data (1071330 cases): (2) (3) (4) <-classified as 216 145 24 (1): Probe 244867 1202 14 (2): DOS 4 795 99 (3): R2L 16313 7623 788718 (4): Normal 100.00% 100.00% 99.00% 99.00% 97.20% 99.07% 88.43% 97.05% False Alarm Rate 0.99% 0.00% 0.00% 1.00% 7.88% 6.32% 91.86% 0.02% Table 2. LIBSVM KDD99 Classification Results (1) 100 0 0 0 (1) 6117 12861 0 41 Evaluation on training data (400 cases): Accuracy (2) (3) (4) <-classified as 0 0 0 (1): Probe 100 0 0 (2): DOS 0 100 0 (3): R2L 0 0 100 (4): Normal Evaluation on test data (1071330 cases): (2) (3) (4) <-classified as 569 0 7065 (1): Probe 184107 0 50199 (2): DOS 0 478 421 (3): R2L 0 34 812638 (4): Normal 100.00% 100.00% 100.00% 100.00% 44.48% 74.49% 53.17% 99.99% False Alarm Rate 0.00% 0.00% 0.00% 0.00% 67.84% 0.31% 6.64% 6.63% ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV Stolfo, S.J., Fan, W., Lee, W., Prodromidis, A. and Chan, P.K. (2000) Cost-based Modeling and Evaluation for Data Mining With Application to Fraud and Intrusion Detection: Results from the JAM Project, DARPA Information Survivability Conference. Vapnik, V. N. and Chervonenkis (1964), On one class of perceptrons, Autom. And Remote Contr. 25(1). Vapnik, V. N. (1995), The Nature of Statistical Learning Theory, Springer, New York. Zhu, D., Premkumar, G., Zhang, X. and Chu, C.H. (2001) Data Mining for Network Intrusion Detection: A comparison of Alternativest Methods, Decision Sciences, Volume 32 No. 4, Fall 2001. 5. CONCLUSION This is the first time that we investigate the differences between MCMP and pairwise SVM for multiclass classification using a large network intrusion dataset. The results indicate that MCMP achieves better classification accuracy than pairwise SVM. In our future research, we will focus on the theoretical differences between these two multiclass approaches. References Bradley, P.S., Fayyad, U.M., Mangasarian, O.L. (1999) Mathematical programming for data mining: Formulations and challenges. INFORMS Journal on Computing, 11, 217238. Chang, C. C. and Lin, C. J. (2001) LIBSVM : a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Hsu, C. W. and Lin, C. J. (2002) A comparison of methods for multi-class support vector machines, IEEE Transactions on Neural Networks, 13(2), 415-425. Knerr, S., Personnaz, L., and Dreyfus, G. (1990), “Single-layer learning revisited: A stepwise procedure for building and training a neural network”, in Neurocomputing: Algorithms, Architectures and Applications, J. Fogelman, Ed. New York: Springer-Verlag. Kou, G., Peng, Y., Shi, Y., Chen, Z. and Chen X. (2004b) “A Multiple-Criteria Quadratic Programming Approach to Network Intrusion Detection” in Y. Shi, et al (Eds.): CASDMKM 2004, LNAI 3327, Springer-Verlag Berlin Heidelberg, 145–153. LINDO Systems Inc., An overview of LINGO 8.0, http://www.lindo.com/cgi/frameset.cgi?leftlin go.html;lingof.html. ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV Pattern Recognition for Multimedia Communication Networks Using New Connection Models between MCLP and SVM Jing HE Institute of Intelligent Information and Communication Technology Konan University Kobe 658-8501, Japan Email: [email protected] Wuyi YUE Department of Information Science and Systems Engineering Konan University Kobe 658-8501, Japan Email: [email protected] Yong SHI Chinese Academy of Sciences Research Center on Data Technology and Knowledge Economy Beijing 100080, China Email: [email protected] classification problems with two class sets, SVM generalizes linear classifiers into high dimensional feature spaces through non-linear mappings. The non-linear mappings are defined implicitly by kernels in the Hilbert space. This means SVM may produce non-linear classifiers in the original data space. Linear classifiers then are optimized to give the maximal margins separation between the classes [3]-[5]. Research of linear programming (LP) approach to classification problems was initiated in [6]-[8]. [9], [10] applied the compromise solution of MCLP to deal with the same question. In [11], an analysis for fuzzy linear programming (FLP) in classification of credit card holder behaviors was presented. During the process of the calculation in [11], we found that except some approaches such as MCLP, SVM, many data mining algorithms try to minimize the influence of outliers or eliminate them altogether. In other words, the unusual outliers may be of particular interest, such as in the case of unusual pattern detection, where unusual outliers may indicate fraudulent activities. Thus identification of usual and unusual patterns is an interesting data mining task, referred to as “pattern recognition”. In this paper, by means of dividing the performance data into usual and unusual categories, we try to find out the category corresponding to the data mining system. The new pattern recognition model, which connects MCLP and SVM, is employed to identify performance data. Some real-time and non-trivial examples for MCNs with different pattern recognition approaches such as SVM, LP, and MCLP are given to show how the different techniques work and can be used in reality. The advantages that the different algorithms offer are compared with each other. The results of the comparisons are listed in this paper. In Section II, we describe the basic formulas of MCLP and SVM. Connection models between MCLP and SVM are presented in Section III. The real-time data experiments of pattern recognition for MCNs are given out in Section IV. Finally, we conclude the paper with a brief summary in Section V. Abstract— Data mining system of performance evaluation for multimedia communication networks (MCNs) is a challenging research and development issue. The data mining system offers techniques of discovering patterns in voluminous databases. By means of dividing the performance data into usual and unusual categories, we try to find out the category corresponding to the data mining system. Many pattern recognition algorithms for the data mining system have been developed and explored in recent years such as rough sets, tough fuzzy hybridization, granular computing, artificial neural networks, support vector machines (SVM), and multiple criteria linear programming (MCLP). In this paper, a new connection model between MCLP and SVM is employed to identify performance data. In addition to theoretical foundations, the paper also includes experiment results. Some real-time and nontrivial examples for MCNs given in this paper shows how MCLP and SVM work and how they can be combined to be used at the same time in reality. The advantages that every algorithm offers are compared with the other methods. I. I NTRODUCTION Data mining system of performance evaluation for multimedia communication networks (MCNs) is a challenging research and development issue. The data mining system offers techniques for discovering patterns in voluminous databases. Fraudulent activity costs the telecommunication industry millions of dollars a year. It is important to identify potentially fraudulent users and their typical usage patterns, and detect their attempts to gain fraudulent entry in order to perpetrate illegal activity. Several ways of identifying unusual patterns can be used such as multidimensional analysis, cluster analysis and outlier analysis [1]. By means of dividing the performance data into usual and unusual categories, we try to find out the category corresponding to the data mining system. Many pattern recognition algorithms for data mining have been developed and explored in recent years such as rough sets, tough fuzzy hybridization, granular computing, artificial neural Networks, support vector machines (SVM), multiple criteria linear programming (MCLP) and so on [2]. SVM has been gaining popularity as one of the effective methods for machine learning in recent years. In pattern ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV II. BASIC F ORMULA OF SVM AND MCLP , . Using a trade-off (exterior deviation) parameter between Min and Min , we have the soft margin SVM method as follows: Support Vector Machines (SVMs) were developed in [3], [12] and their main features are as follows: (1) SVM maps the original data set into a high dimensional feature space by non-linear mapping implicitly defined by kernels in the Hilbert space. (2) SVM finds linear classifiers with the maximal margins on the feature space. (3) SVM provides an evaluation of the generalization ability. (M3) (3) where and are given, , and are unrestricted. It can be seen that the idea of the soft margin SVM method is the same as the linear programming approach to linear classifiers. This idea was used in an extension by [14]. Not only exterior deviations but also interior deviations can be considered in SVM. Then we propose various algorithms of SVM considering both of slack variables for misclassified data points (i.e., exterior deviations) and surplus variables for correctly classified data points (i.e., interior deviations). In order to minimize the slackness and to maximize the surplus, the surplus variable (interior deviation) is used, . The trade-off parameter is used for the slackness variable, and another trade-off parameter is used for the surplus variable. Then we have the optimization problems as follows: A. Hard Margin SVM We define two classes of A and B among the training data sets , . We use a variable , with two values of 1 and -1 to represent which class of A and A, then , B a training data set belongs. Namely, if if B, then . Let be a separating hyperplane parameter and be a separating parameter, where and is the attribute size. Then we use a separating hyperplane to separate samples, where = and . is a boundary value. From the above and . Such method definition, we know that for separating the samples is called the classification. The separating hyperplane with maximal margins can be given by solving the problem with the normalization at points with the minimum interior deviation as follows: (M1) (M4) Min Min (4) where , unrestricted. (1) where represents the function of norm. is given, and are unrestricted. Several norms are possible. When is used, the problem is reduced to quadratic programming, while the problem or is reduced to linear programming [13]. with The SVM method which can separate two classes of A and B completely is called the hard margin SVM method. But the hard margin SVM method tends to cause over-learning. The hard margin SVM method with is given as follows: (M2) Min and are given, , , and are C. MCLP For the classification explained in Subsection A, the multiple criteria linear programming (MCLP) model is used. We want to determine the best coefficients of variables , where are the best coefficients of variables obtained by the following Eq. (5), is the attribute size and . A boundary value , , is used to separate two classes of A and B. A B Min (5) (2) where is defined in Subsection A. is given, are unrestricted. Eq. (5) is equal to the following equation: where is given, and are unrestricted. The aim of machine learning is to predict which class new patterns belong to on the basis of the given training data set. and (6) B. Soft Margin SVM where is defined in Subsection A. is given, and are unrestricted. Let , denote the exterior deviation which is a deviation from the hyperplane of . Similarly, let , The hard margin SVM method is easily affected by noise. In order to overcome this shortcoming, the soft margin SVM method is introduced. The soft margin SVM method allows some slight errors which are represented by slack variables ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV A hybrid model presented in [8] that combines Eq. (8) and Eq. (10) is given as follows: denote the interior deviation which is a deviation from the hyperplane of . Our purposes are as follows: (1) to minimize the maximum exterior deviation (decrease errors as much as possible). (2) to maximize the minimum interior deviation (i.e. maximize the margins). (3) to maximize the weighted sum of interior deviation (MSD). (4) to minimize the weighted sum of exterior deviation (MMD). MSD can be written as follows: Min (12) where Min and (M5) Min are unrestricted. BETWEEN MCLP AND SVM A. Linear Separable Examples It should be noted that the LP of Eq. (8) may yield some as well as unbounded unacceptable solutions such as solutions in the goal programming approach. Therefore, some in appropriate normality condition must be imposed on order to provide a bounded nontrivial optimal solution. One . such normality condition is If the classification is linearly separable, then using the normalization , the separating hyperplane with the maximal margins can be given by solving the problem as follows: (7) is given, and III. C ONNECTION A B where Then, is given, are unrestricted. (8) (M8) Max where is given, and are unrestricted. The alternative of the above model is to find MMD as follows: (13) Max where and are defined in Section II. is given, and are unrestricted. However, this normality condition makes the problem to be non-linear optimization model. Instead of maximizing the minimum interior deviation in Eq. (13), we can use the following equivalent formulation with the normalization at points with the minimum interior deviation [15]. Theorem. The discrimination problem of Eq. (13) is equivalent to the formula used in Eq. (1) as follows: A B (9) where Then, is given, and (M6) Max are unrestricted. (M1) Min (10) where is given, and are unrestricted. Proof : The above M1 can be rewritten as follows: where is given, and are unrestricted. [11] applied the compromise solution of multiple criteria and maximize linear programming to minimize the sum of simultaneously. A two criteria linear programthe sum of ming model is given as follows: (M7) Min Min and Max (14) where , is the attribute size, and . is given, and are unrestricted. First notice that any optimal solution to Eq. (1) must satisfy . Otherwise we should have and (11) where is given, and are unrestricted. ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV delay time, transfer rate and the criteria about “unusual patterns” is designed. In these real-time experiments, the two classes of the training data sets in MCNs are A and B, A and B are defined in Section II. The class A represents usual pattern, and the class B represents unusual pattern. The purpose of pattern recognition techniques for MCNs is to find the better classifier through a training data set and use the classifier to predict all other performance data of MCNs. The frequently used pattern recognition in the telecommunication industry is still two-class separation technique. The key question of two-class separation is to separate the “unusual” patterns called fraudulent activity from the “usual” patterns called normal activity. The pattern recognition model is to identify as many MCNs as possible. This is also known as the method of “detecting fraudulent list”. In this section, a realtime performance data mart with 65 derived attributes and 1000 records of a major CHINA TELECOM MCNs database is first used to train the different classifiers. Then, the training solutions are employed to predict the performances of another 5000 MCNs. Finally, the classification results in different models are compared with each other. , i.e. , an impossibility since at the optimum in the strictly convex case. Similarly, at the optimum of Eq. (14). Let be an optimal vector for Eq. (1). Then = is well defined for Eq. (14). Assume it is not the optimal solution for Eq. (14). And let , be the optimal solution instead. Then and = is feasible for = , = Eq. (1). Then = (the constraint is tight at the optimum), in contradiction with the optimality of . Hence is the optimal solution for Eq. (14). be the optimal solution for Eq. (14). Now let = is Then defined. Again, assume that is the suboptimal be the optimal solution with solution, let and define = . We have, = / = = , in contradiction with the optimality . of Then M1 and M8 are the same, and Theorem is proved. B. Linear Unseparable Examples B. Accuracy Measure As what have been mentioned in Eq. (8), MSD is as follows: (M5) We would like to be able to access how well the classifier can recognize “usual” samples (referred to as positive samples) and how well it can recognize “unusual” samples (refereed to as negative samples). The sensitivity and specificity measures can be used, respectively, for this purpose. In addition, we may use precision to access the percentage of samples labeled as “unusual” that actually are “unusual” samples. These measures are defined as follows: t pos Sensitivity pos t neg Specificity neg t pos Precision t pos f pos Min where and are defined in Section II. is given, and are unrestricted. The above equation as Eq. (8) can be rewritten as Eq. (1) according to Theorem as follows: (M1) Min where is given, and are unrestricted. Then we use as norm of Eq. (1). is chosen and Min to be the trade-off parameter between Min , we have the formulation for the soft margin SVM methods combining Eq. (8) with Eq. (1) as follows: where “t pos” is the number of true positives samples (“usual” samples that were correctly classified as such), “pos” is the number of positive samples (“usual” samples), “t neg” is the number of true negatives samples (“unusual” samples that were correctly classified as such), “neg” is the number of negative samples (“unusual” samples), and “f pos” is the number of false positives samples (“unusual” samples that were incorrectly labeled as “usual”). It can be shown that Accuracy is a function of Sensitivity and Specificity as follwos: Min (15) Accuracy where and are given, and are unrestricted. Eq. (15) is the same as the SVM formula in Eq. (3). IV. PATTERN R ECOGNITION FOR Sensitivity pos pos neg Specificity neg pos neg The higher the four rates (Sensitivity rate, Specificity rate, Precision rate, Accuracy rate) are, the better the classification results are. A threshold in this paper is defined to set up against specificity and precision depending on the requirement performance evaluation of MCNs. MCN S A. Real-time Experiments Data A set of attributes for MCNs, such as throughput capacity, package forwarding rate, response time, connection attempts, ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV of the boundary . C. Experiment Results A previous experience on classification test showed that the training results of a data set with balanced records (number of usual samples is equal to number of unusual samples) may be different from that of an unbalanced data set (number of usual samples is not equal to number of unusual samples). Given there are unbalanced 1000 training accounts, where 860 usual samples are usual and 140 are unusual. Models M1 to M8 can be used to test. M1 to M8 are given in Sections II and III. Namely, M1 is the SVM model with the objective function . M2 is the SVM model with the objective function . M3 is the SVM model with the objective function + . M4 is the SVM model to minimize the slackness and to maximize the surplus. M5 is the linear programming model with the objective function . M5 is called the MSD model. M6 is the linear programming model . M6 is called the MMD with the objective function model. M7 is the MCLP model. M8 is the MCLP model using the normalization. is the boundary value for each model. Here we use to calculate for models M1 to M8. A well-known commercial soft package, Lingo [16] has been used to perform the training and predicting processes. The learning results of unbalanced 1000 records in Sensitivity and Specificity are shown in Table 1, where the columns of H are the Sensitivity rates for the usual pattern, the columns of K are the Specificity rates for the unusual pattern. Table 2: Predicting Results of Unbalanced 5000 Records in Precision. The Precision rates in models M3, M7, M4 are as high as the learning results. M1 and M8 have the same results of H and K with all values of b. If the threshold of the precision of pattern recognition is predetermined as 0.9. Then the model , , , , , , M8 with , , , M3 with , are satisfied as better classifiers. The best model of the threshold in the learning results is M3 with . The order of average predicting precision is M3, M7, M4, M2, M5, M6, M1, M8. In this data mart of Table 2, M1 and M8 have similar structures and solution characterizations due to the formula presented in Section III. When the classification is to find the higher specificity, M1 or M8 can give the better results. When the classification is to find the higher precision, M3, M4, M7 can give the better results. Table 1: Learning Results of Unbalanced 1000 Records in Sensitivity and Specificity. V. C ONCLUSION In this paper, we have proposed a heuristic connection classification method to recognize unusual patterns of multimedia communication networks (MCNs). This algorithm is based on the connection model between multiple criteria linear programming (MCLP) with support vector machines (SVM). Although the mathematical modeling is not new, the framework of connection configuration is innovative. In addition, empirical training sets and the prediction results on the realtime MCNs from a major company, CHINA TELECOM, were listed out. Comparison studies have shown that the connection model combining MCLP and SVM has the performed better learning results with an aspect to predicting the future performance pattern of MCNs. The connection model also has a great deal of potential to be used in various data mining tasks. Since the connection model is readily implemented by nonlinear programming, any available non-linear programming packages, such as Lingo, can be used to conduct the data analysis. In the meantime, we explored the other possible connections between SVM and MCLP. The results of ongoing projects to solve more complex problem will be reported in the near future. Table. 1 shows the learning results of models M1 to M8 for different values of the boundary . If the threshold of the specificity rate K is predetermined as , then the models M1, M8 with , , , , , , , , M3 with , M4 with , , , M6 with , , , M7 with , , , are satisfied as better classifiers. M1 and M8 have the same results of H and K with all values of . The best specificity rate model of the threshold in the learning result of unusual patterns in K is M1, M8 with . The order in the learning result of unusual patterns in the specificity K is M8 = M1, M6, M3, M7, M4, M2, M5. Table. 2 shows the predicting results of unbalanced 5000 records in Precision with models M1 to M8 for different values ACKNOWLEDGMENT This work was supported in part by GRANT-IN-AID FOR SCIENTIFIC RESEARCH (No. 16560350) and MEXT.ORC ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV (2004-2008), Japan and in part by NSFC (No. 70472074), China. R EFERENCES [1] J. Han and M. Kamber, Data Mining: Concepts and Techniques, An Impernt of Academic Press, San Francisco, 2003. [2] S. Pal and P. Mitra, Pattern Recognition Algorithms for Data Mining, ACRC Press Company, 2004. [3] V. Vapnik, Statistical Learning Theory, John Wiley & Sons, New York, 1998. [4] O. Mangasarian, Linear and Nonlinear Separation of Pattern by Linear Programming, Operations Research, 31(1): 445-453, 1965. [5] O. Mangasarian. Multisurface Method for Pattern Separation, IEEE Transactions on Information Theory, IT-14: 801-807, 1968. [6] N. Freed and F. Glover, Simple but Powerful Goal Programming Models for Discriminant Problems, European Journal of Operational Research, 7(3): 44-60, 1981. [7] N. Freed and F. Glover, Evaluating Alternative Linear Programming Models to Solve the Two-group Discriminant Problem, Decision Science, 17(1): 151-162, 1986. [8] F. Glover, Improve Linear Programming Models for Discriminant Analysis, Decision Sciences, 21(3): 771-785, 1990. [9] G. Kou, X. Liu, Y. Peng, Y. Shi, M. Wise and W. Xu, Multiple Criteria Linear Programming Approach to Data Mining: Models, Algorithm Designs and Software Development, Optimization Methods and Software, 18(4): 453-473, 2003. [10] G. Kou and Y. Shi, Linux based Multiple Linear Programming Classification Program: Version 1.0, College of Information Science and Technology, University of Nebraska-Omaha, U.S.A., 2002. [11] J. He, X. Liu, Y. Shi, W. Xu and N. Yan, Classification of Credit Cardholder Behavior by using Fuzzy Linear Programming, International Journal of Information Technology Decision Making, 3(4): 223-229, 2004. [12] C. Cortes and V. Vapnik, Support Vector Networks, Machine Learning, 15(20): 273-297. [13] O. Mangasarian, Arbitrary-Norm Separating Plane, Operations Research Letters 23, 1999. [14] K. Bennett and O. Mangasarian, Robust Linear Programming Discrimination of Two Linearly Inseparable Sets, Optimization Methods and Software, 12(1): 23-24. [15] P. Marcotte and G. Savard, Novel Approaches to the Discrimination Problem, ZOR-Methods and Models of Operations Research, 12(36): 517545. [16] http://www.lindo.com/. [17] J. He, W. Yue and Y. Shi, Identification Mining of Unusual Patterns for Multimedia Communication Networks, Abstract Proc. of Autumn Conference 2005 of Operations Research Society of Japan, 262-263, 2005. [18] Y. Shi and J. He, Computer-based Algorithms for Multiple Criteria and Multiple Constraint Level Integer Linear Programming, Computers and Mathematics with Applications, 49(5): 903-921, 2005. [19] T. Asada and H. Nakayama, SVM using Multi Objective Linear Programming and Goal Programming, T. Tanino, T. Tanaka and M. Inuiguchi (eds), Multi-objective Programming and Goal Programming, 93-98, 2003. [20] H. Nakayama and T. Asada, Support Vector Machines Formulated as Multi Objective Linear Programming, Proc. of ICOTA2001, 1171-1178, 2001. [21] M. Yoon, Y. B. Yun, and H. Nakayama, A Role of Total Margin in Support Vector Machines, Proc. of IJCNN03, 7(4): 2049-2053, 2003. [22] W. Yue, J. Gu and X. Tang, A Performance Evaluation Index System for Multimedia Communication Networks and Forecasting for Web-based Network Traffic, Journal of Systems Science and Systems Engineering, 13(1): 78-97, 2002. [23] J. He, Y. Shi and W. Xu, Classifications of Credit Cardholder Behavior by using Multiple Criteria Non-linear Programming, Conference Proc. of the International Conference on Data-Ming Knowledge Management, Lecture Notes in Computer Science series, Springer-Verlag, 2004. [24] http://www.rulequest.com/see5-info.html/. [25] http://www.sas.com/. [26] Y. Shi, M. Wise, M. Luo and Y. Lin, Data Mining in Credit Card Portfolio Management: a Multiple Criteria Decision Making Approach, Multiple Criteria Decision Making in the New Millennium, Springer, Berlin, 2001. ,&'0:RUNVKRS2SWLPL]DWLRQEDVHG'DWD0LQLQJ7HFKQLTXHVZLWK$SSOLFDWLRQV Published by Department of Mathematics and Computing Science Technical Report Number: 2005-05 November, 2005 ISBN 0-9738918-1-5

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Optimization-based Data Mining Techniques with Applications