Download A hybrid projection based and radial basis function architecture

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear regression wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Forecasting wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
In Proceedings of the First International Workshop on Multiple Classifier
Systems, June 2000, Sardingia
A hybrid projection based and radial basis
function architecture?
Shimon Cohen
Nathan Intrator
Computer Science Department
Tel-Aviv University
www.math.tau.ac.il/∼nin
Abstract. A hybrid architecture that includes Radial Basis Functions
(RBF) and projection based hidden units is introduced together with a
simple gradient based training algorithm. Classification and regression
results are demonstrated on various data sets and compared with several
variants of RBF networks. In particular, best classification results are
achieved on the vowel classification data [1].
1
Introduction
The duality between projection-based approximation and radial kernel
methods has been explored theoretically [2]. it was shown that a function can be decomposed into mutually exclusive parts, the radial part
and the ridge (projection based) part and that the two parts are mutually exclusive. It is difficult however, to separate the radial portion of a
function from its projection based portion before they are estimated, and
sequential methods which attempt to first find the radial part and then
proceed with a projection based approximation are likely to get stuck in
non-optimal local minima.
Earlier approaches to kernel based estimation were based on Volterra
and Wiener kernels [17, 19] but they failed to produce a practical optimization algorithm that can compete with MLPs or RBFs. The relevant
statistical framework is Generalized Additive Models (GAM) [6, 7]. In
that framework, the hidden units (the components of the additive model)
?
This work was partially supported by the Israeli Ministry of Science and by the
Israel Academy of Sciences and Humanities — Center of Excellence Program. Part
of this work was done while N. I. was affiliated with the Institute for Brain and
Neural Systems at Brown University and supported in part by ONR grants N0001498-1-0663 and N00014-99-1-0009.
have some parametric form, usually polynomial, which is estimated from
the data. While this model has nice statistical properties [18], the additional degrees of freedom, require strong regularization to avoid overfitting.
Higher order networks have at least a quadratic terms in addition to
the linear term of the projections [9] as a special case of GAM. While
they present a powerful extension of MLPs, and can form local or global
features, they do so at the cost of squaring the number of input weights
to the the hidden nodes. Flake has suggested an architecture similar to
GAM where each hidden unit has a parametric activation function which
can change from a projection based to a radial function in a continuous way [4]. This architecture uses a squared activation function, thus
called Squared MLP (SMLP) and only doubles the number of free parameters. The use of B-Splines was also suggested in this context [8] with
the argument that such network can combine properties of global and
local receptive fields. These networks are more general than the proposed
model. In high dimensional problems, a network that is too general has
more free parameters and is more likely to over-fit the data. We shall compare the proposed model to the current state of the art in performance
on the data set that we use to show that our proposed model does not
suffer from such over-fitting problem.
It achieved very good results on some data sets and was only outperformed by our proposed architecture (see below).
This paper introduces a simple extension to both MLP and RBF networks by combining RBF and Perceptron units in the same hidden layer.
Unlike the previously described methods, this does not increase the number of parameters in the model, at the cost of predetermining the number
of RBF and Perceptron units in the network. The new hybrid architecture, which we call Perceptron Radial Basis Net (PRBFN), automatically
finds the relevant functional parts from the data concurrently, thus avoiding possible local minima the result from sequential methods. It leads to
superior results on data sets on which radial basis function have so far
produced best results, in particular the vowel classification data [1] where
current best results are obtained.
2
The hybrid RBF/FF architecture and training
For simplicity, we shall consider a single hidden layer architecture, since
the extension to a multi-layer net is simple. In the hybrid architecture,
some hidden units are of radial functions and the others are of projection
2
type. All the hidden units are connected via a set of weights to the output layer which can be linear, for regression problems, or non-linear for
classification problems.
Figure 1 presents the proposed hybrid architecture. There is a set
of weights connecting the hidden units to the output units and a set of
weights for the hidden units, which is in case of the perceptron units the
projecting vectors, and the RBF centers for the RBF units.
Y1
Yc
Out layer for c output classes.
Hidden layer composed of M1
Ridge units and M2 RBF units.
Bias
Bias
Ridge1
x1
...
RidgeM1
Rbf1
...
...
RbfM2
xd
Fig. 1. PRBF hybrid neural network with M hidden units, M1 RBFs and M2 Perceptrons and M = M1 + M2.
There are several steps in estimating parameters to the hybrid architecture; First, the number of cluster centers is determined from the data
and the number of RBF hidden units is chosen accordingly. Each RBF
units is assigned to one of the cluster centers. The clustering can be done
by a k-means procedure [3]. A discussion about the benefits of more recent approaches to clustering is beyond the scope of this paper. Unlike
Orr [11], we assume that the clusters are symmetric, although each cluster
may have a different radius. This is done in order to reduce the number
of free parameters, while it is likely that in data-sets where Orr’s method
outperforms other RBF methods (see results on Friedman’s data below),
this assumption is not valid. The weights for the hidden projection based
units can be set randomly using a certain weight distribution. The second
3
layer of weights can then be found using a pseudo-inverse of the activity
matrix or via a least mean square procedure.
The last step in the parameter estimation is to refine the weights of
the hybrid network via some form of gradient descent minimization on
the full architecture.
2.1
Gradient based parameter optimization
We start by deriving the gradient of the full architecture. The output of
a radial basis unit is given by:
φ(x, wi) = exp
−(x−wi )2
2r2
i
.
The output of a projection based unit is given by:
aj = g(
X
(wji · zi )),
i
Where z is the output of the previous layer or the input to the hidden
layer and w is the weight vector associated with this unit. The transfer
function g is monotone and smooth such as sigmoidal. It is linear in the
case of regression. The total error is given by the sum of the errors for
each pattern:
E=
N
X
E n.
n=1
The error on K outputs for the n’th pattern is given by:
En =
K
1X
(y n − tnk )2 ,
2 k=1 k
where tnk and ykn are the target value and output value for the n’th pattern of the k’th output respectively. The partial derivatives of the error
function with respect to the output weights is given by:
∂E n
= g 0 (ak )(ykn − tnk )zi .
∂wkj
where zi is the output of the previous layer and g 0 (ak ) is the derivative
of the transfer function at the linear value ak .
4
The error term δ for the radial hidden units is given by:
δkn = (ykn − tnk )g 0 (ak ),
and for the projection hidden units by:
δjn = g 0 (aj )
K
X
δkn wkj .
k=1
Using this notation, the partial derivatives of the error function with
respect to first layer of weights (from the patterns to Ridge units) is
given by:
∂E n
= δjn xni .
∂wji
The partial derivatives of the error function with respect to the centers
of the RBFs is given by:
X
∂E n
(xn − mj )
=
δkn wkj
φ(xn , wj ).
2
∂mj
r
j
k=1
K
The partial derivatives of the error function with respect to the radii is
given by:
K
X
∂E n
k xn − m j k 2
=
δkn wkj
φ(xn , wj ).
3
∂rj
r
j
k=1
A momentum term can be added to the gradient, however it was not
found to be useful with the hybrid gradient. A Levenberg Marquardt
updating rule was found to be very useful for updating the weights, the
centers and the radii. It is given by
wnew = wold − (Z T Z + λI)−1 Z T wold ,
where the matrix Z is given by:
(Z)ni =
3
∂y n
.
∂wi
Eliminating un-needed RBF units
It is likely that after the clustering algorithm, some cluster centers do not
represent real clusters which could be approximated by an RBF, thus such
cluster centers should be eliminated with the hope that the projection
based units will be more useful in approximating the function in those
regions. We have used several criteria for eliminating such clusters.
5
Size of a cluster
We discard clusters whose size in terms of number of patterns is too small,
or the size in terms of scatter is too large. Given a certain threshold a,
we consider only clusters which satisfy (N/k ≥ a) where N is the total
number of patterns and k is the number of clusters.
The scatter of cluster i is given by:
Ji =
N
1 Xi
(k Xk − Mi k)2
Ni k=1
We discard clusters for which Ji ≥ a for a certain threshold a. This
criterion however is not very effective for non-radial clusters. In that case
the Mahalanobis distance should be used.
Spectrum criteria
Clusters with a large difference between highest to lowest eigen values indicate that some directions have little variation which can be attributed
to noise only. Such clusters should be projected on the more meaningful
directions. At this point, we discard such clusters, and leave the approximation in this region to the MLP.
4
Experimental results
This section describes regression and classification results of several variants of RBF and the proposed PRBFN architecture on several data sets.
The results which are only given for the test data are an average over 10
runs and include the standard deviation. We start with a comparison on
four simulated regression data sets that were used by Orr to asses the
performance of RBF. The results are summarized in Table 1.
The first data set is a 1D sine wave [11].
y = sin(12x),
with x ∈ [0, 1]. A Gaussian noise was added to the outputs with a
standard deviation of σ = 0.1. 10 sets of 50 train and 50 test patterns
randomly sampled from the data with the additive noise were used.
The second data-set is a 2D sine wave,
y = 0. sin(x1 /4) sin(x2 /2),
6
with 200 training patterns sampled at random from an input range x1 ∈
[0, 10] and x2 ∈ [−5, 5]. The clean data was corrupted by additive
Gaussian noise with σ = 0.1. The test set contains 400 noiseless samples
arranged as a 20 by 20 grid pattern, covering the same input ranges. Orr
measured the error as the total squared error over the 400 samples. We
report the error as simply the MSE on the test set.
The third data set [10, 12] is based on a one dimensional Hermite
polynomial
2
y = (1 + (x + 2x2 )e−x .
100 input values are sampled randomly between −4 < x < 4, and
Gaussian noise of standard deviation σ = .1 was added to the output.
The fourth data-set is a simulated alternating current circuit with
four input dimensions (resistance R, frequency
p ω, inductance L and capacitance C and one output impedance Z = R2 + (ωL − 1/ωC)2 . Each
training set contained 200 points sampled at random from a certain region
[13, for further details]. Again, additive noise was added to the outputs.
The experimental design is the same as the one used by Friedman in the
evaluation of MARS [5]. Friedman’s results include a division by the variance of the test set targets. We do not make this division and report the
MSE on the test set. Orr’s regression trees method [13] outperforms the
other methods on this data set.
Rbf-Orr
Rbf-Matlab
Rbf-Bishop
PRBFN
MacKay
1.7e-3±0
2.1e-3±9.5e − 5
1.8e-2±9.7e − 6
1.5e-3±9.1e − 4
1D Sine
1.2e-3±0
8.0e-4±6.0e − 3
1.7e-2±1.5e − 5
1.1e-3±2.0e − 4
2D Sine
7.4e-3±6e − 3
8.3e-3±5e − 4
4.9e-3±1.4e − 3
6.8e-3±1.8e − 4
Friedman
7.0±0.1
10.9±0.4
11.7±0.7
10.6±0.6
Table 1. Comparison of Mean squared error results on two data sets (see [13] for
details). Results on the test set are given for several variants of RBF networks which
were used also by Orr to asses RBFs. MSE Results of an average over 10 runs including
standard deviation are presented.
4.1
Classification
We have used several data sets to compare the classification performance
of the proposed methods to other RBF networks. The sonar data set attempts to distinguish between a mine and a rock. It was used by Gorman
7
and Sejnowski [14] in their study of the classification of sonar signals using neural networks. The data has 60 continuous inputs and one binary
output for the two classes. It is divided into 104 training patterns and
104 test patterns. The task is to train a network to discriminate between
sonar signals that are reflected from a metal cylinder and those that are
reflected from a similar shaped rock. There are no results for Bishop’s
algorithm as we were not able to get it to reduce the output error. Gorman and Sejnowski report on results with feed-forward architectures [16]
using 12 hidden units. They achieved 84.7% correct classification on the
test data. This result outperforms the results obtained by the different
RBF methods, and is only surpassed by the proposed hybrid RBF/FF
network.
The Deterding vowel recognition data [1, 4] is a widely studied benchmark. This problem may be more indicative of the type of problems that
a real neural network could be faced with. The data consists of auditory
features of steady state vowels spoken by British English speakers. There
are 528 training patterns and 462 test patterns. Each pattern consists
of 10 features and it belongs to one of 11 classes that correspond to the
spoken vowel. The speakers are of both genders. The best score so far
was reported by Flake using his SMLP units. His average best score was
60.6% [4] and was achieved with 44 hidden units. Our algorithm achieved
65.7% correct classification with only 19 hidden units. As far as we know,
it is the best result that was achieved on this data set.
The seismic1 and seismic2 data sets are two different representations
of seismic data. The data sets include waveforms from two types of explosions and the task is to distinguish between the two types. This data was
used a “Learning” course in the last two years for performance evaluation
of many different classifiers1. one of these two classes. The dimensionality
of seismic1 is 352 representing 32 time frames of 11 frequency bands, and
the dimensionality of seismic2 patterns is 242 representing 22 time frames
of 11 frequency bands. Principal Component Analysis (PCA) was used to
reduce the data representation into 12 dimensions. Both data-sets have
65 training patterns and 19 test patterns which were chosen to be the
most difficult for the desired discrimination.
Table 2 summarizes the percent correct classification results on the
different data sets for the different RBF classifiers and the proposed hybrid architecture. As in the regression case, the STD is also given however,
on the seismic data, due to the use of a single test set (as we wanted to
see the performance on this particular data set only) the STD is often
1
For details see http://www.math.tau.ac.il∼nin/learn98,9
8
zero as only a single classification of the data was obtained in all 10 runs.
Algorithm
RBF-Orr
RBF-Matlab
RBF-Bishop
PRBFN
Sonar
71.7±0.5
82.3±2.4
–
87.1±3.3
Vowel
–
51.6±2.9
48.4±2.4
65.7±1.9
Seismic1 Seismic2
63±0
79±0
73±4
81±3
60±4
77±5
89±0
85±3
Table 2. Percent classification results of different RBF variants on four data sets.
5
Discussion
The general ideas of a hybrid architecture is not new and is covered in
the theory of generalized additive models [7]. The novelty of the proposed
architecture and training algorithm is the simplicity of training, and the
fact that the number of model parameters is not increased. The draw
back compared with more flexible methods is that the number of RBFs
and ridge functions have to be pre-determined.
In the extensively studied vowel data set, our proposed hybrid architecture achieved average results which are superior to the best known
results [15]. Moreover, this result was achieved with a smaller number of
hidden units suggesting that the internal representation found by the hybrid architecture is richer and generalizes better. The proposed method
also outperformed a feed-forward architecture and RBF architectures on
the sonar data [14]. This architecture is thus a viable alternative to the
use of either projection based or radial basis functions. It shares the good
convergence properties of both, and with a good parameter estimation
procedure it is expected to outperform either on difficult tasks.
References
1. D. H. Deterding. Speaker Normalisation for Automatic Speech Recognition. PhD
thesis, University of Cambridge, 1989.
2. D. L. Donoho and I. M. Johnstone. Projection-based approximation and a duality
with kernel methods. Annals of Statistics, 17:58–106, 1989.
3. R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Wiley,
New York, 1973.
4. G. W. Flake. Square unit augmented, radially extended, multilayer percpetrons.
In G. B. Orr and K. Müller, editors, Neural Networks: Tricks of the Trade, pages
145–163. Springer, 1998.
9
5. J. H. Friedman. Mutltivariate adaptive regression splines. The Annals of Statistics,
19:1–141, 1991.
6. T. Hastie and R. Tibshirani. Generalized additive models. Statistical Science,
1:297–318, 1986.
7. T. Hastie and R. Tibshirani. Generalized Additive Models. Chapman and Hall,
London, 1990.
8. S. Lane, D. Handelman, J. Gelfand, and M. Flax. Function approximation using
multi-layered neural networks and b-spline receptive fields. In R. P. Lippmann,
J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems, volume 3, pages 684–693, San Mateo, CA, 1991. Morgan Kaufmann.
9. Y. C. Lee, G. Doolen, H. H. Chen, G. Z.Sun, T. Maxwell, H.Y. Lee, and C. L. Giles.
Machine learning using higher order correlation networks. Physica D,, 22:276–306,
1986.
10. D. J. C. MacKay. Bayesian interpolation. Neural Computation, 4(3):415–447, 1992.
11. M. J. Orr. Introduction to Radial Basis Function networks. Technical report, 1996.
http://www.anc.ed.ac.uk/∼mjo/rbf.html.
12. M. J. Orr. Recent advances in Radial Basis Function networks. Technical report
www.anc.ed.ac.uk/∼mjo/papers/recad.ps.gz, 1999.
13. M. J. Orr, J. Hallman, K. Takezawa, A. Murray, S. Ninomiya, M. Oide, and
T. Leonard. Combining regression trees and radial basis functions. Division of
informatics, Edinburgh University, 1999. Submitted to IJNS.
14. Gorman R. P. and Sejnowski T. J. Analysis of hidden units in a layered network
trained to classify sonar targets. Neural Network, pages 75–89, 1988. Vol. 1.
15. A. J. Robinson. Dynamic Error Propogation Networks. PhD thesis, University of
Cambridge, 1989.
16. D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In D. E. Rumelhart and J. L. McClelland, editors,
Parallel Distributed Processing, volume 1, pages 318–362. MIT Press, Cambridge,
MA, 1986.
17. M. Schetzen. The Volterra and Wiener Theories Of Nonlinear Systems. John
Wiley and Sons, New York, 1980.
18. C. J. Stone. The dimensionality reduction principle for generalized additive models.
The Annals of Statistics, 14:590–606, 1986.
19. V. Volterra. Theory of Functional and of Integro-differential Equations. Dover,
1959.
10