Download Nearest Correlation Louvain Method for Fast and Good Selection of

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Preprints of the
9th International Symposium on Advanced Control of Chemical Processes
The International Federation of Automatic Control
June 7-10, 2015, Whistler, British Columbia, Canada
MoM3.6
Nearest Correlation Louvain Method for
Fast and Good Selection of Input Variables
of Statistical Model ?
Taku Uchimaru ∗ Koji Hazama ∗ Koichi Fujiwara ∗
Manabu Kano ∗
∗
Department of Systems Science, Kyoto University, Kyoto, Japan
(e-mail: [email protected]).
Abstract: In the present work, a new input variable selection method for building linear
regression models is proposed. The proposed method is referred to as nearest correlation Louvain
method based variable selection (NCLM-VS). NCLM-VS is a correlation-based group-wise
method; it constructs an affinity matrix of input variables by the nearest correlation (NC)
method, partitions the affinity matrix by the Louvain method (LM), consequently clusters
input variables into multiple variable classes, and finally selects variable classes according
to their contribution to estimates. LM is very fast and optimizes the number of classes
automatically unlike spectral clustering (SC). The advantage of NCLM-VS over conventional
methods including NCSC-VS is demonstrated through their applications to soft-sensor design for
an industrial chemical process and calibration modeling based on near-infrared (NIR) spectra.
In particular, it is confirmed that NCLM-VS is significantly faster than the recently proposed
NCSC-VS while NCLM-VS can achieve as good estimation performance as NCSC-VS.
Keywords: Input variable selection, parameter estimation, prediction methods, modelling,
spectral clustering, Louvain method, soft-sensor, near-infrared (NIR) spectroscopy
1. INTRODUCTION
In the manufacturing industry, real-time control of product quality is crucial to supply high quality products.
However, it is often difficult to measure the product quality
in real time because of the high investment and maintenance cost of analyzers and also measurement delay. Thus,
the methods of estimating difficult-to-measure product
quality from easy-to-measure process variables have been
investigated. For example, models to estimate product
quality is known as soft-sensors in the chemical industry,
where partial least squares (PLS) regression is widely
used because PLS can avoid the problem of collinearity
(Kano and Ogawa (2010); Kano and Fujiwara (2013)).
In the semiconductor manufacturing industry, the models
are called virtual metrology (VM), and PLS is regularly
employed (Lynn et al. (2012)).
The estimation accuracy of soft-sensors deteriorates when
input variables which do not contribute toward estimating
output variables are used. To construct a highly accurate
soft-sensor, appropriate input variables must be selected.
So far many input variable selection methods have been
proposed. The simplest method is the R method, which is
based on the correlation coefficients between input variables and the output variable. When PLS is adopted,
one can use PLS-based variable selection methods such
as the PLS-Beta method and the variable influence on
projection (VIP) method. Lately, many other PLS-based
? This study is partially supported by Japan Society for the Promotion of Science (JSPS), Grant-in-Aid for Scientific Research (C)
24560940.
Copyright © 2015 IFAC
methods have been proposed: genetic algorithm combined
with PLS regression (GA-PLS), sub-window permutation
analysis coupled with PLS (SwPA-PLS), Sparse-PLS, and
Powered PLS (PPLS) (Mehmood et al. (2012)). Recently,
nearest correlation spectral clustering-based variable selection (NCSC-VS) was proposed (Fujiwara et al. (2012)).
NCSC-VS clusters variables into some variable classes
based on the correlation between variables by using NCSC
(Fujiwara et al. (2010, 2011)), then each variable class is
examined as to whether it should be used as input variables. It was demonstrated that NCSC-VS was superior
to other conventional methods through applications to
soft-sensor design problems. However, NCSC-VS requires
heavy computational load to optimize parameters such as
the number of variable classes.
In the present work, to solve the problem of NCSCVS, a new input variable selection method is proposed.
To reduce the computational load and to determine the
number of variable classes automatically, the Louvain
method (LM) is used instead of spectral clustering (SC).
LM can partition an affinity matrix faster than SC and
can optimize the number of clusters (Blondel et al. (2008)).
The proposed method is called nearest correlation Louvain
method-based variable selection (NCLM-VS). NCLM-VS
clusters input variables into some variable classes based on
NCLM, and selects fewer variable classes according to their
contribution to estimates. Thus, NCLM-VS is a kind of
correlation-based group-wise methods for selecting input
variables. The usefulness of NCLM-VS is demonstrated
through applications to soft-sensor design for an industrial
123
IFAC ADCHEM 2015
June 7-10, 2015, Whistler, British Columbia, Canada
chemical process and calibration modeling based on nearinfrared (NIR) spectra.
2. CONVENTIONAL METHODS
This section briefly explains conventional input variable selection methods. Two of them are based on PLS; therefore
PLS is briefly explained at the beginning of this section.
2.1 Partial Least Squares (PLS)
In PLS with one output variable, an input variable matrix
X ∈ RN ×M and an output variable vector y ∈ RN are
decomposed as follows.
X =TPT + E
(1)
y =Tb + f
(2)
where T ∈ R
is the latent variable score matrix
whose columns are latent variables (score vectors) tk ∈
RN (k = 1, 2, · · · , K), P ∈ RM ×K is the loading matrix
of X whose columns are loading vectors pk ∈ RM , and
b ∈ RK is the loading vector of y whose values are
bk (k = 1, 2, · · · , K). N and M are the numbers of samples
and input variables, respectively. K denotes the number of
adopted latent variables. E and f are errors.
N ×K
The latent variable t is a linear combination of the columns
of X.
kβselect k
>µ
(7)
kβpls k
is achieved. Here βselect denotes the vector of the regression
coefficients corresponding to the selected variables.
2.4 Variable influence on projection (VIP)
The VIP score expresses the contribution of each input
variable to the output variable (Wold et al. (2001)). The
VIP score of the jth input variable is defined as
v
u PK
2
2
2
uM
k=1 bk ktk k (wjk /kwk k)
Vj =t
(8)
PK 2
2
k=1 bk ktk k
where wjk is the jth element of the weighting vector of
the kth latent variable wk . The input variables satisfying
Vj > µ are selected for modeling.
3. NEAREST CORRELATION SPECTRAL
CLUSTERING (NCSC)
NCSC was originally proposed for sample clustering based
on the correlation among variables by not assuming any
distribution (Fujiwara et al. (2010)). Later NCSC was used
for variable clustering (Fujiwara et al. (2012)).
t = Xw
(3)
M
where the weighting vector w ∈ R is determined so that
the covariance between y and t is maximized under the
condition of kwk = 1.
NCSC integrates the nearest correlation (NC) method,
which detects samples whose correlation is similar to the
query, and the spectral clustering (SC) method, which
partitions a weighted graph into subgraphs. NC constructs
the weighted graph that expresses the correlation-based
similarities between samples and SC partitions the constructed graph.
2.2 R
3.1 Spectral clustering (SC)
The R method selects input variables based on the absolute value of the correlation coefficient rm between each
input variable and the output variable.
SC is a clustering method based on the graph theory
(Ding et al. (2001); Ng et al. (2002)). It partitions a
weighted graph, whose weights express affinities between
nodes, into subgraphs through cutting some of their arcs.
Although some spectral clustering algorithms have been
proposed, the max-min cut (Mcut) method is described in
this subsection.
s2x y
rm = q m
s2xm s2y
(4)
where s2xm y is the covariance of xm and y, and s2xm and
s2y is the variance of xm and y, respectively. The input
variables satisfying |rm | > µ are selected for modeling.
2.3 PLS-Beta
Given a wighted graph G and its affinity matrix (adjacency
matrix) W , G is partitioned into two subgraphs A and B.
The affinity between A and B is defined as
X
cut(A, B) ≡ W (A, B) =
Wa,b
(9)
a∈A,b∈B
The PLS-Beta method selects input variables based on
the magnitude of its regression coefficients. The estimate
ŷ derived by PLS is expressed by
ŷ = T (T T T )−1 T y = Xβpls
and the regression coefficients βpls is described as
βpls = W (P T W )−1 (T T T )−1 y
M ×K
where W = [w1 , · · · , wK ] ∈ R
(5)
(6)
.
Input variables can be selected individually in descending
order of the magnitude of regression coefficients until
Copyright © 2015 IFAC
W (A) ≡ W (A, A)
(10)
where a and b denote nodes of subgraphs A and B, and
Wa,b is the sum of weights of arcs between a and b. That
is, the affinity between subgraphs cut(A, B) is the sum
of weights of arcs between subgraphs. The Mcut method
searches subgraphs A and B that minimize cut(A, B) and
maximize W (A) and W (B) simultaneously.
cut(A, B) cut(A, B)
+
(11)
W (A)
W (B)
Since the Mcut method partitions a weighted graph into
two subgraphs, this procedure is repeated when three or
min J =
124
IFAC ADCHEM 2015
June 7-10, 2015, Whistler, British Columbia, Canada
x7
x3
x2
x8 l
x4
x7
x5
x2
x6
V
x1
more subgraphs are needed. In SC, the definition of an
affinity is arbitrary and affects results. SC runs in O(n3 )
time complexity, where n is the number of nodes in G.
3.2 Nearest correlation (NC) method
The NC method can detect samples whose correlation is
similar to the query without any teacher signals (Fujiwara
et al. (2010)).
The concept of NC is as follows. Suppose that the affine
subspace P in Fig. 1 (left) shows the correlation among
variables and all the samples on P have the same correlation. Although x1 , x2 , · · · , x5 have the same correlation,
x6 and x7 have a different correlation from the others.
The NC method aims to detect samples whose correlation
is similar to the query x1 . In this example, x2 , x3 , x4 , x5
on P should be detected.
First, the whole space is translated so that the query
becomes the origin as shown in Fig. 1 (right). That is, x1
is subtracted from all other samples xi (i = 2, 3, · · · , 7).
Since the translated affine subspace contains the origin, it
becomes the linear subspace V .
Next, a line connecting each sample and the origin is drawn
virtually. Then, it is checked whether another sample can
be found on this line. In this case, x2 –x5 and x3 –x4 satisfy
such a relationship. The correlation coefficients of these
pairs must be 1 or −1. On the other hand, x6 and x7 that
are not the elements of V cannot make such pairs. The
pairs whose correlation coefficients are ±1 are thought to
have the same correlation as x1 .
In practice, the threshold of correlation coefficients γ (0 <
γ ≤ 1) has to be used, since there are no pairs whose correlation coefficient is strictly ±1. Thus, pairs are selected
when the absolute values of their correlation coefficients
are larger than γ. The pairs whose correlation is similar to
the query can be detected through the above procedure.
NC method runs in O(n2 ) time complexity, where n is the
number of samples.
3.3 Nearest correlation spectral clustering (NCSC)
A correlation-based affinity matrix for spectral clustering
can be constructed by using the NC method. Suppose that
samples xn ∈ <M (n = 1, 2, · · · , N ) are stored in the
database.
The NCSC method is illustrated through a simple example
(Fujiwara et al. (2010)). An objective data set consists of
nine samples on the two dimensional space; x1 , x2 , x3 , x4
x2
x5
x3
Translated space
Fig. 1. An example of the procedure of the NC method
(Fujiwara et al. (2012))
Copyright © 2015 IFAC
k
x9
x6
x1
x6
Original space
x7
x5
x3
x1
P
x4
x4
Fig. 2. An objective data set consisting of two classes and
one outlier (Fujiwara et al. (2012))
are on the line l and x5 , x6 , x7 , x8 are on the line k while
x9 is an outlier as shown in Fig. 2. In addition, samples
x2 , x7 and x9 are on one line by chance.
First, the zero matrix S ∈ <9×9 is defined, and the
correlations of all possible pairs of samples are checked by
the NC method. For example, when x1 is the query, x2 –x3 ,
x2 –x4 and x3 –x4 are detected as pairs whose correlations
can describe x1 . The detected pairs get one point; S2,3 =
S3,2 = 1, S2,4 = S4,2 = 1 and S3,4 = S4,3 = 1.
Similarly, since x1 –x3 , x1 –x4 , x3 –x4 and x7 –x9 are detected when x2 is the query, one is added to the elements of
S corresponding to these pairs. This procedure is repeated
until all samples become the query. Finally, the affinity
matrix S becomes as follows.
0
2
2

2

S=
0
0

0
0

2
0
2
2
0
0
1
0
0 1
2
2
0
2
0
0
0
0
0
2
2
2
0
0
0
0
0
0
0
0
0
0
0
2
2
2
0
0
0
0
0
2
0
2
2
0
0
1
0
0
2
2
0
2
1
0
0
0
0
2
2
2
0
0

0
1
0

0

0
 .
0

1
0
(12)
0
By applying SC to S, the nodes (samples) are classified
into {x1 , x2 , x3 , x4 }, {x5 , x6 , x7 , x8 } and {x9 }. This example clearly shows that NCSC can classify samples on
the basis of the correlation among variables.
3.4 Nearest correlation spectral clustering-based variable
selection (NCSC-VS)
Although NCSC was originally proposed for sample clustering, it can be used for variable clustering (Fujiwara
et al. (2012)). NCSC-VS clusters input variables into some
variable classes based on their correlation by using NCSC,
and evaluates each variable class for variable selection.
First, NCSC-VS clusters input variables into J variable
classes vj = {xm | m ⊂ Vj } (j = 1, 2, · · · , J) based on their
correlation,
P where Vj is the jth subset of variable indexes,
and V = j Vj . To classify input variables by NCSC, an
affinity matrix is constructed from the transposed variable
matrix X T by using NC. Then, the jth PLS model fj is
constructed from the jth variable class matrix Xj , whose
columns consist of variables belonging to the jth variable
class, and the validity of the constructed PLS model is
examined on the basis of the contribution ratio defined as
125
IFAC ADCHEM 2015
June 7-10, 2015, Whistler, British Columbia, Canada
kŷj,k k2
(13)
kyk2
where ŷj,k denotes the estimates by fj with k latent variables. Finally, variable classes are selected in descending
order of Cj,k so that the estimation performance is maximized. A statistical model can be constructed by using the
variables belonging to the selected variable classes.
Cj,k = 1 −
4. NEAREST CORRELATION LOUVAIN METHOD
(NCLM)
In the present work, nearest correlation Louvain (NCLM)
method is proposed by integrating the NC method and
the Louvain method (LM). LM can partition the affinity
matrix much faster than SC, and also it can optimize the
number of classes in its algorithm. Thus, NCLM can divide
input variables into classes based on the correlation among
variables more efficiently than NCSC.
4.1 Louvain method (LM)
The Louvain method (LM) was originally developed in the
field of community detection (Blondel et al. (2008)). The
objective of community detection is to densely connect
nodes within the same community and to sparsely connect
nodes between different communities. Among indexes to
evaluate created clusters (communities), modularity has
received a lot of attention. The modularity is a scalar value
between −1 and 1, and it measures the density of links
inside communities as compared to links between communities. In the case of weighted networks, the modularity
score is defined as
1 X
ki kj
Q=
Aij −
δ(ci , cj )
(14)
2m i,j
2m
ki =
X
Aij , m =
j
1X
Aij
2 i,j
(15)
where Aij represents the weight of the link between nodes
i and j, ki is the sum of weights of links attached to node
i, and ci is the community to which node i is assigned.
The δ-function δ(u, v) is 1 if u = v and 0 otherwise.
Exact modularity optimization is computationally hard
to solve. Thus, several approximation algorithms to optimize modularity faster have been proposed. Among such
algorithms, LM is a well-known method that can optimize
modularity in short time.
The algorithm of LM consists of two phases, which are repeated iteratively. Assume that a weighted network has N
nodes and each node is assigned to different communities.
For each node i, the gain of modularity is evaluated by
removing node i from its community and by placing it in
the community of node j. Here node j is a neighboring
node of i. Then, the node i is placed in the community for
which the gain of modularity is maximum only if the gain
is positive. If there is no positive gain, node i stays in its
original community. This process is applied to all nodes
and repeated until no further improvement is achieved.
In this first phase, the gain of modularity ∆Q, which is
derived by moving node i into community C, is calculated
as follows.
Copyright © 2015 IFAC
"P
P
2 #
+ki,in
tot +ki
∆Q =
−
2m
2m
"P
P 2 2 #
ki
tot
in
−
−
−
2m
2m
2m
in
(16)
P
P
where in is the sum of weights of links inside C, tot
is the sum of weights of links attached to nodes in C, and
ki,in is the sum of weights of links from node i to nodes in
C.
In the second phase, a new network is generated. The
network consists of nodes that are communities generated
in the first phase. Weights between new nodes are defined
as the sum of weights of links between nodes in the
corresponding two communities.
These two phases are iterated while significant improvement of the network modularity is obtained. LM runs in
O(n log n) time complexity, where n is the number of nodes
in weighted networks.
LM is extremely fast because the gain of modularity is
easy to compute and the number of communities decrease
drastically after a few iterations. As a result, LM is faster
than SC, which needs to compute eigenvalue decomposition. By using LM instead of SC, NCLM can execute
clustering significantly faster than NCSC. In the following
example and case studies, the generalized Louvain method
is used (De Meo et al. (2011); Jutla et al. (2011)).
4.2 Numerical example
The discrimination performance of NCLM was compared
with that of the k-means method (KM) and NCSC through
a numerical example. The data set consists of three classes
of samples, which have different correlation and intersected
at the origin, as shown in Fig. 3. For each class, 100
samples were generated through
x = sai + n (i = 1, 2, 3)
(17)
T
where x = [x1 x2 ] is measurements of two variables and
T
n = [n1 n2 ] is measurement noise. s ∼ N (0, 1) and ni ∼
N (0, 0.01) are variables following the normal distribution
N (m, σ) with mean m and standard deviation σ. The
T
T
coefficient vectors are a1 = [ 1 3 ] , a2 = [ 2 2 ] and
T
a3 = [ 2 1 ] . Before clustering, x1 and x2 are normalized
(auto-scaled).
In NCSC and NCLM, the parameter γ of the NC method
was determined by trial and error and set as γ = 0.999.
The clustering results of KM, NCSC, and NCLM are
shown in Fig. 3. The conventional distance-based method,
KM, did not classify samples correctly, but both NCSC
and NCLM successfully classified them except around
the origin. Table 1 compares the computational time of
each method. It was confirmed that NCLM required less
computational time than NCSC.
4.3 NCLM-based variable selection
The present work aims to develop a new correlationbased group-wise method for selecting input variables of
126
IFAC ADCHEM 2015
June 7-10, 2015, Whistler, British Columbia, Canada
Table 2. Comparison of input variable selection
methods: estimation of the concentration of
unreacted NaOH in the alkali washing tower
True groups
ALL
R
PLS-Beta
VIP
NCSC-VS
NCLM-VS
k-means
0.4
0.2
0
x2
-0.2
-0.4
-0.6
0.4
0.2
0
x2
-0.2
-0.4
-0.6
-1
-0.5 x 0
1
0.5
-1
NCSC
-0.5
x1 0
0.5
NCLM
0.4
0.2
x2 0
-0.2
-0.4
-0.6
0.4
0.2
0
x2
-0.2
-0.4
-0.6
-1
-0.5
x1
0
0.5
-1
-0.5
x1
0
0.5
Fig. 3. Classification results of numerical example when
three classes intersect at the origin: true classes (topleft), k-means (top-right), NCSC (bottom-left), and
NCLM (bottom-right)
Table 1. Comparison of clustering methods in
computational time
Method
k-means
NCSC
NCLM
Time [s]
0.0056
0.1554
0.0959
statistical models; in particular, the proposed method is
expected to be faster than NCSC-VS. This goal would
be achieved by using NCLM for variable selection. This
method is called NCLM-VS.
NCLM-VS clusters input variables into variable classes
based on their correlation and evaluates the contribution of
each variable class to estimates. Unlike NCSC-VS, NCLMVS optimizes the number of classes automatically and it
does not require heavy computational load.
5. APPLICATIONS
In this section, the proposed NCLM-VS is validated
through two case studies. One is the soft-sensor design for
an industrial chemical process, and the other is the calibration modeling from near-infrared (NIR) spectroscopy
data.
5.1 Soft-sensor design for chemical process
NCLM-VS and conventional methods were compared
through their applications to soft-sensor design. The objective was to estimate the concentration of unreacted
NaOH in an alkali washing tower. By taking account of the
process dynamics, 50 process variables such as pressure,
temperature, and flow rate are used as input variables.
Copyright © 2015 IFAC
RMSE
0.87
0.70
0.91
0.99
0.65
0.64
r
0.71
0.78
0.72
0.63
0.79
0.79
#Var
50
43
14
36
27
34
#LV
4
4
7
4
7
4
Time [s]
—
0.74
0.83
1.14
9.51
0.21
Samples were divided into three sets; 150 samples were
used for constructing models (Model set), 19 samples were
used for tuning parameters (Parameter set), and 40 samples were used for evaluating the estimation accuracy of
PLS models (Test set). Parameters were optimized so that
root mean square error (RMSE) for the Parameter set was
minimized. RMSE is defined as
v
u
N
u1 X
RMSE = t
(yn − ŷn )2
(18)
N n=1
where N is the number of samples in the data set, yn is the
nth reference value of the output variable, and ŷn is the
estimated value of yn . The number of latent variables of
the final PLS model was determined by cross validation.
Finally, the estimation accuracy was evaluated by using
the Test set.
The threshold of NC was set at 0.8 in both NCSC-VS and
NCLM-VS, which was determined by trial and error. The
number of classes J of SC was optimized within the range
from 1 to 11; this setting contributes toward reducing the
computational time of NCSC-VS. The specifications of the
computer used in this work is as follows; OS: Windows 7
Enterprise, CPU: Intel(R) Core(TM) i7-2600K 3.40GHz,
Memory: 8GB.
In this case study, six variable selection methods were
compared: ALL (no selection), R, PLS-Beta, VIP, NCSCVS, and NCLM-VS. Table 2 shows RMSE for the Test set,
the correlation coefficient between the measured NaOH
concentration and the corresponding estimates (r), the
number of selected input variables (#Var), the number
of adopted latent variables (#LV), and the computational
time required to optimize parameters and select input
variables.
The conventional methods, R, slightly improved the estimation accuracy compared with ALL. However, the other
conventional method such as VIP and PLS-Beta did not
improved the estimation accuracy compared with ALL.
The correlation-based group-wise methods, i.e, NCSCVS and NCLM-VS, achieved the significantly better performance than the others. In particular, NCLM-VS was
the best in the estimation accuracy among all methods.
Furthermore, the computational load of NCLM-VS was
drastically lighter than that of NCSC-VS as well as the
other methods.
The optimal number of generated variable classes was
11 in NCSC-VS and 15 in NCLM-VS. On the basis
of the contribution ratio, both NCSC-VS and NCLMVS adopted only one variable class. Here it is noted
that the number of generated variable classes in NCSC-
127
IFAC ADCHEM 2015
June 7-10, 2015, Whistler, British Columbia, Canada
Table 3. Comparison of input variable selection
methods: estimation of the freezing point of
diesel fuel
ALL
R
PLS-Beta
VIP
NCSC-VS
NCLM-VS
RMSE
2.47
2.12
2.48
2.27
2.22
2.15
r
0.80
0.84
0.78
0.81
0.82
0.84
#Var
401
204
5
155
60
195
#LV
6
7
5
6
7
7
Time [s]
—
59
59
59
226
3.0
VS was optimized within the range from 1 to 11. The
computational time of NCSC-VS strongly depends on the
maximum number of generated variable classes.
The computational time of R, PLS-Beta, and VIP was very
long because they have 50 classes (input variables). These
methods add each variable increamentally for model construction. Then, each of these methods finds the optimal
number of selected input variables within the range from 1
to 50 while NCSC-VS and NCLM-VS need to find the optimal number of selected variable classes within the range
from 1 to 11 and from 1 to 15, respectively. Moreover,
NCLM-VS optimizes the number of generated variable
classes without any limitation. Hence, the computational
efficiency of NCLM-VS is remarkable.
5.2 Calibration modeling for NIR spectroscopy
In this case study, calibration models were constructed
by using NIR spectral data. The output variable is the
freezing point of diesel fuel, and the input variables are
absorbances at 401 wavelengths of NIR spectroscopy. Samples were divided into three sets; 133 samples were used
for constructing models (Model set), 56 samples were used
for tuning parameters (Parameter set), and 56 samples
were used for evaluating the estimation accuracy of PLS
models (Test set). The procedure and parameter settings
were the same as those in the previous case study except
that the number of generated variable classes in NCSC-VS
was optimized within the range from 1 to 40.
The results are summarized in Table 3. The difference of
RMSE was not significant, but it is confirmed that NCLMVS achieved better estimation accuracy than NCSCVS and reduced the computational time significantly.
The computational time of NCSC-VS is extremely long
mainly because the maximum number of generated variable classes was 40 (the number was 11 in the previous
case study).
The case studies clearly demonstrate the advantage of
the proposed NCLM-VS method over the conventional
variable selection methods in the computational load and
the estimation performance.
REFERENCES
Blondel, V., Guillaume, J.L., Lambiotte, R., and Lefebvre,
E. Fast unfolding of communities in large networks.
Journal of Statistical Mechanics: Theory and Experiment, P10008, 2008.
De Meo, P., Ferrara, E., Fiumara, G., and Provetti, A.
Generalized louvain method for community detection in
large networks. International Conference on Intelligent
Systems Design and Applications, ISDA, 88–93, 2011.
Ding, C., He, X., Zha, H., Gu, M., and Simon, H..
A min-max cut algorithm for graph partitioning and
data clustering. In Proceedings of IEEE International
Conference on Data Mining (ICDM), 107–114, 2001.
Fujiwara, K., Kano, M., and Hasebe, S. Development of
correlation-based clustering method and its application
to software sensing. Chemometrics and Intelligent Laboratory Systems, 101, 130–138, 2010.
Fujiwara, K., Kano, M., and Hasebe, S. Correlationbased spectral clustering for flexible process monitoring.
Journal of Process Control, 21, 1438–1448, 2011.
Fujiwara, K., Sawada, H., and Kano, M. Input variable
selection for PLS modeling using nearest correlation
spectral clustering. Chemometrics and Intelligent Laboratory Systems, 118, 109–119, 2012.
Jutla, I.S., Jeub, L.G., and Mucha, P.J. A generalized louvain method for community detection implemented in
MATLAB. http://netwiki.amath.unc.edu/GenLouvain,
2011.
Kano, M. and Fujiwara, K. Virtual sensing technology
in process industries: Trends and challenges revealed
by recent industrial applications. Journal of Chemical
Engineering of Japan, 46, 1–17, 2013.
Kano, M. and Ogawa, M. The state of the art in
chemical process control in japan: Good practice and
questionnaire survey. Journal of Process Control, 20(9),
969–982, 2010.
Lynn, S., Ringwood, J., and MacGearailt, N. Global and
local virtual metrology models for a plasma etch process.
IEEE Transactions on Semiconductor Manufacturing,
25, 94–103, 2012.
Mehmood, T., Liland, K. H., Snipen, L., and Sb, S. A
review of variable selection methods in partial least
squares regression. Chemometrics and Intelligent Laboratory Systems, 118, 62–69, 2012.
Ng, A., Jordan, M., and Weiss, Y. On spectral clustering:
Analysis and an algorithm. Advances in Neural Information Processing Systems, 2002.
Wold, S., Sjöström, M., and Eriksson, L. PLS-regression:
A basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems, 58, 109–130, 2001.
6. CONCLUSION
To construct highly accurate statistical models, a new
input variable selection method, NCLM-VS, was proposed.
NCLM-VS is a correlation-based group-wise method integrating the nearest correlation method and the Louvain
method. NCLM-VS is remarkably faster than the conventional methods including NCSC-VS. The advantage of
NCLM-VS over conventional methods were demonstrated
through two case studies.
Copyright © 2015 IFAC
128