Download Combination of LSTM and CNN for recognizing mathematical symbols

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central pattern generator wikipedia , lookup

Okishio's theorem wikipedia , lookup

Artificial intelligence wikipedia , lookup

Backpropagation wikipedia , lookup

Online participation wikipedia , lookup

Artificial neural network wikipedia , lookup

Plateau principle wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Pattern recognition wikipedia , lookup

Transcript
THE INSTITUTE OF ELECTRONICS,
INFORMATION AND COMMUNICATION ENGINEERS
IEICE Technical Report
IBISML2014-73(2014-11)
THE INSTITUTE OF ELECTRONICS,
INFORMATION AND COMMUNICATION ENGINEERS
TECHNICAL REPORT OF IEICE
IBISML2014-73(2014-11)
Combination of LSTM and CNN for recognizing mathematical symbols
HAI Nguyen Dai † ANH Le Duc‡ and Masaki NAKAGAWA*
Department of Computer and InformationScience, Tokyo University of A&T
2-24-16 Naka-cho, Koganei-shi, Tokyo 184-8588
†
E-mail:
‡
[email protected]
[email protected]
*
[email protected]
Abstract Combining classifiers is an approach that has been shown to be useful on numerous occasions when striving for
further improvement over the performance of individual classifiers. In this paper we present a combination of CNN and LSTM,
which are both sophisticated neural networks and good classifiers for offline and online handwriting mathematical character
recognition, respectively. In addition, we employ the dropout technique and gradient based local features to improve accuracy
of CNN and LSTM, respectively. The best combination ensemble has a recognition rate which is higher than the rate achieved
by the best individual classifier on the Mathbrush database.
Keyword LSTM, directional features, CNN, Drop out, linear combination
local
1. Introduction
off-line
features
by
LSTM
significantly
Online mathematical symbol recognition is an essential
outperformed HMM. However, the hybrid features in
component of any pen-based mathematical formulation
combination with LSTM did not produce the great
recognition systems. With the increasing availability of
improvement from the same features with
touch-based or pen-based devices such as smart phones,
section 3 we describe the way of using directional (or
tablet PCs and smart boards, interest in this area has been
gradient) features as local off-line features following the
growing. However, the existing system are still far from
method of Kawamura et al. [4] and show that the
perfection because of challenges that arise from the
recognition rate by adding more gradient features is
two-dimensional nature of mathematical input and the
improved when combined with LSTM.
HMM. In
large symbol set with many similar looking symbols. In
For off-line handwriting recognition, convolutional
this work we just address the problem of mathematical
neural network (CNN) has been proven as a state of the art
symbol recognition.
method. CNNs combine three architectural ideas to ensure
Long short term memory (LSTM) and conventional
some degree of shift, scale, and distor tion invariance:
recurrent neural networks (RNN) have been successfully
local receptive fields, shared weights, and spa tial or
applied to sequence prediction and sequence labeling tasks.
temporal sub-sampling [2]. The network is usually trained
For online handwriting recognition, bidirectional LSTM
like a standard neural network by back propagation. For
networks with a connectionist temporal classification
over-fitting problem, we use two choices: 1. enlarging the
(CTC) output layer using forward and backward types of
training dataset by using linear transformations like
algorithm have been shown to outperform a state o f the art
translation, rotation or sheer; as well as 2. Using Drop out
HMM-based systems in handwriting recognition in [3].
technique
Recently, Álvaro et al. proposed a set of hybrid features
Units(RELU) have been highly successful for computer
that combine both on-line and off-line information and
tasks and proved faster to train than other standard units
using either HMM or LSTM for classification in [5]. The
such that sigmoid or tanh . RELUs are thus a good choice
symbol recognition rate achieved using raw images as
to combine with dropout for the Mathbrush database.
introduced
in
[1]..
Rectified
Linear
-287Copyright ©2014 by IEICE
The combination of multiple classifiers has been shown
The input pattern, composed of the coordinates of
to be suitable for improving the recognition performance
sampled pen-down points, is smoothed by replacing the
in
coordinates of each point with a weighted average of the
difficult
classification
problems
[6-7].
Also
in
handwritten recognition, classifier combination has been
appliedt.In
this
work,
we
deploy
simply
current point and two neighboring points.
linear
In normalization, the coordinates are transformed such
combination of CNN and LSTM. Our experiments also
that the pattern size is standardized and the shape is
show that the best combination ensemble has a recognition
regulated.
rate which is higher than relatively 1% the rate achieved
normalization method, named bi-moment normalization
by the best individual classifier on
[8].
the Mathbrush
database.
We
adopt
a
global
curve
fitting-based
For feature extraction, the local histogram of stroke
The rest of this paper is organized as follows. Section 2
direction
(direction
feature)
has
been
proven
very
gives an overview of the recognition system. Section 3
effective in character recognition. We use the directional
describes
online
decomposition technique of Kawamura et al [10] to extract
features and local gradient feature when using with LSTM
the
hybrid
feature
which
includes
directional features in normalized patterns as offline
classifier for on-line handwritten mathematical symbol
features for LSTM.
recognition. Section 4 details architecture of CNN and
We build 2 neural networks: CNN and LSTM as
the dropout technique. We detail the combination of LSTM
individual classifiers for recognizing an offline pattern
and CNN in section 5. Section 6 reports our experimental
and an online pattern, respectively. One effective approach
results and section 7 provides the conclusion.
to improving the performance of handwriting recognition
2. System overview
is to combine multiple classifiers. In our c ase,
The online mathematical symbol recognition system is
we take
advance of both methods: one is LSTM for online
depicted in Fig. 1. It consists of 2 pre -processing
recognition and
steps(trajectory smoothing and pattern normalization),
We tried linear combination of both neural networks for
feature extraction for LSTM, 2 neural networks: CNN and
simplicity instead of using any sophisticated schemes of
LSTM and a final linear combiner. Pre-processing is to
combining.
regulate the pattern shape for reducing the within -class
3. LSTM and gradient features
shape variation. The linear combiner is to combine the
classification
results
of
both
CNN
and
LSTM
for
improving accuracy rate.
the other isCNN for offline recognition.
3.1. Overview of LSTM
This section outlines the principle of LSTM RNN that is
used for online symbol classification task and combined
with CNN in section 5 and our proposed local gradient
Input pattern
features to improve the accuracy of LSTM.
An LSTM layer is composed of recurrently connected
Trajectory
smoothing
memory blocks, each of which contains one or more
memory cells, along with three multipli cative "gate" units:
input, output, and forget gates. The gates perform
Pattern
normalization
functions analogous to reading, writing, and reseting
operation. More specifically, the cell input is multiplied
Context Feature
Extraction
CNN
by the activation of the input gat e, the cell output by that
of the output gate, and the previous cell values by the
forget gate (see Fig. 2). The overall effect is to allow the
LSTM
network to store and retrieve information over long
periods of time. For example, as long as the input gate
remains closed, the activation of the cell will not be
Linear Combination
overwritten by new inputs and can therefore be made
available to the net much later in the sequence by opening
the output gate.
Output
Figure 1.Diagram of the online mathematical symbol
recognition system.
-288-
improved when used with LSTM and online features since
classifier may not process efficiently features of raw
image. In this work, we employed the gradient feature
which performed efficiently in character recognition.
Firstly, we normalize linearly each online character
pattern to standard size (64x64). Then for each point p =
(x, y), we use 6 time-based features:

This point is end point (1), otherwise (0).

normalized coordinates: (x, y)

normalized angle: (sin , cos)

distance between point(i, i+1)
Figure 2. LSTM memory block consisting of one
In order to combine these online features with
memory cell: the input, output, and forget gates collect
context information around each point, we employ the
activations from inside and outside the block which
gradient direction feature as context features.
control the cell through multiplicative units (depicted as
small circles)
Regarding gradient features: For each point p = (x, y),
we employ context window centered at p. From context
window, gradient direction features are decomposed into
Another problem with standard RNNs is that they have
components in 8 chain code directions (depicted as figure
access to pass but not to future context. This can be
4). We partitioned context window into 9 sub-windows of
overcome by using bidirectional RNNs [10], where two
equal-sizes 5x5. We calculated the value for each block
separate recurrent hidden layers scan the input sequences
by using a Gaussian blurring mask of size 10x10. We use
in
PCA to reduce dimension into 10 dimensions
opposite
directions.
The
two
hidden
layers
are
connected to the same output layer, which therefore has
access to context information in both directions. The
Direction 1
9 features
amount of context information that the network actually
uses is learned during training, and does not have to be
Direction 2
9 features
specified beforehand. Fig. 3 shows the structure of a
simple bidirectional network
72 features
Direction 7
9 features
Direction 8
9 features
Figure 4. Extracting gradient features.
3.3. Details of learning
In this work, we use the BLSTM architecture. The size
Figure 3. Structure of a bidirectional network with input
of the input layer is 6 for online features and 16 for the
i, output o, and two hidden layers for forward and
combination of online and gradient features. We use 2
backward processing
hidden BLSTM layers with 32 and 128 memory blocks per
layer. The output layer is softmax layer, the size of which
3.2. Feature extraction for LSTM
size is 93, the number of mathematical symbol classes.
We extract online features and low level context
We initialized weights of the network from a zero-mean
information around each point to train BLSTM. This
Gaussian distribution with standard deviation 0.1. We
approach was presented by Alvaro at el. [5]. They used
trained our network using online stochastic gradient
raw images centered at each point to present the context
descent with a learning rate of 0.0001 and a momentum of
information.
0.9. The training algorithm is stopped when error rate was
However the
recognition rate
was not
-289-
not improved for 20 epochs.
exponentially-many dropout networks. We use dropout in
the first two fully-connected layers as in Figure 6.
4. Architecture of CNN
The architecture of our CNN is summarized in figure 5 .
It contains 9 learned layers - 6 convolution+sub-pooling
layers, 2 full-connected layers and 1 softmax layer finally.
Figure 6. Drop out 2 first full-connected layers with
drop rate of 0.5 - dropped nodes do not contribute to
forward and backward processes.
4.3. Details of learning
We trained our models using stochastic gradient descent
with a batch size of 64 samples, momentum of 0.95.
The update rule for weight w was:
𝐦𝐢+𝟏 = 𝟎. 𝟗𝟓𝐦𝐢 − 𝟎. 𝟎𝟎𝟎𝟓. 𝛜. 𝐰𝐢 − 𝛜. 𝐆𝐫𝐚𝐝𝐢
𝐰𝐢+𝟏 = 𝐰𝐢+𝟏 + 𝐦𝐢+𝟏
Figure 5.Architecture of CNN.
Where i is the iteration index, m is the momentum
variable, ϵ is the learning rate, and Gradi is the average
4.1. ReLU Nonlinearity
Following Nair and Hinton[9], we refer to neurons
with
Rectified
Linear
Units(ReLU)
defined
as
over the ith batch of the derivative of the objective with
respect to w, evaluated at wi
follows:f x = max⁡
(x, 0). The advantages of using ReLU in
We initialized the weights in each layer fr om a
neural networks are:(1) if hard max function is used as
zero-mean Gaussian distribution with standard deviation
activation function, it induces the sparsity in the hidden
0.01 and the biases in each layer with the constant 0.
units. (2) ReLU does not face gradient vanishing problem s
We used an equal learning rate for all layers, which we
as with sigmoid and tanh function s. (3) Also, it has been
adjusted manually throughout training. The heuristic
shown that deep networks can be trained efficiently using
which we followed was to multip ly the learning rate by 0.5
ReLU even without pre-training
when the validation error rate stopped improving with the
current learning rate. The learning rate was initialized at
4.2. Dropout
0.01. We trained the network for 100 epochs and stopped
The recently introduced technique, called "dropout"[1],
when error rate did not improve for 20 epochs which took
consists of setting to zero the output of each hidden
2 to 3 hours on NVIDIA Quadro K600 4GB GPU.
neuron with probability 0.5. The neurons which are
5. Combination of CNN and LSTM
dropped in this way do not contribute to the forward pass
In mathematical symbol database, there are many
and do not participate in back-propagation. This technique
symbols having similar shape such as 'p' and ' ρ', or 'B' and
reduces complex co-adaptations of neurons since a neuron
'β' which get confused to off-line classifiers. Fortunately,
cannot rely on the presence of some other neurons. At test
these similar symbols can be more easily recognized by
time, we use all neurons but double weights, which is a
on-line classifiers because of their different stroke orders
reasonable approximation to taking the geometric mean of
and directions. On the other hand, some symbols are
the
difficult for on-line classifiers such that '0' and '6' or 'l'
predictive
distributions
produced
by
the
-290-
and 'e' due to the same stroke d irections. In this case,
off-line classification is more appropriate.
Regarding the dropout technique, it provided a great
improvement in CNN increasing the symbol recognition
error rate from 12.67% down to 11.01% (approximately
Combining multiple individual classifiers is proven to
1.5%) in average. It shows that dropout improves the
be a suitable way to improve recognition rate for difficult
performance of CNN on handwritten mathematical symbol
classification problems. There are many sophisticated
recognition tasks.
methods to combine but in this work we deploy linear
Regarding local gradient features, they provided a
combination of CNN and LSTM for simplicity as the
significant improvement in LSTM compared to using only
following formulation:
on-line features
𝐒𝐜𝐨𝐫𝐞𝐂𝐨𝐦𝐛𝐢𝐧𝐚𝐭𝐢𝐨𝐧 = 𝛂 𝐒𝐜𝐨𝐫𝐞𝐂𝐍𝐍 + 𝟏 − 𝛂 𝐒𝐜𝐨𝐫𝐞𝐋𝐒𝐓𝐌
Where α is the weighting parameter for combination.
This
parameter
is
estimated
by
an
experiment
on
validation set.
(Error rate down
from 11.12 % to
10.3%).So we found that our proposed local gradient
features present differences with respect to the results
obtained with on-line features better than contextual
features proposed in [5] .
6. Experiment and results
6.1. Dataset
We use MathBrush as a database for our experiments.
The database contains 4,654 annotated mathematical
expressions written by 20 different writers. The number of
Table 1. Experimental results of CNN and LSTM
on 5 trials.
Error
rate
CNN
(%)
CNN
+dropout
(%)
LSTM+
Online
feature
(%)
MB1
MB2
MB3
MB4
MB5
13.19
13.31
12.32
12.48
12.05
11.06
11.18
10.07
10.74
11.41
11.3
11.4
10.9
11.1
10.9
LSTM
+Online
feature
+local
gradient(%)
11.0
10.5
9.95
9.83
10.5
Avg
12.67
11.01
11.12
10.3
mathematical symbols is 26K and they are distributed in
100 different classes.
As previous approaches have also used this database,
we followed the same experimentation described in [5] in
order to obtain comparable resultsso we discarded 6
symbol classes (≤, ≠, <, λ , Ω and ′comma′). As a result, 93
different classes were considered. For each class, we used
90% samples as the training set and the remaining 10% as
6.4. Combination of CNN and LSTM
the testing set. Finally, we carried out 20 trials where the
In this section we report our experimental results on 20
training and testing sets were chosen randomly, and
trials and compare the recognition rate of our combination
symbol recognition
system with the previous work of Alvaro et el. which
results are stated as the mean
± the standard error.
obtained
To make the neural network learn better, we use
the
highest
rate
in
mathematical
symbol
classification task until now.
linear transformation including translation, rotation, shear
Table 2 show that t he best combination ensemble has a
to artificially enlarge the dataset. Also it make s us reduce
recognition rate of 90.3% which is higher thanthe rate of
overfitting.
89.43% achieved by the best individual classifier on
Besides that, MathBrush is the dataset containing online
database Mathbrush over 20 random trials (approxi mately
handwritten symbols so we had to convert patterns into
1%). In addition, our combination method also obtained
offline format from online format. We can use linear
better recognition rate than the one in [5] .
normalization method to convert pattern for simply but in
our experiment we used Bi-moment method mentioned in
[8] which yield high accuracy.
To study the performance of the combination of CNN
and LSTM, we will study the accuracy of each network
Table 2: Symbol recognition results for different
classifiers on 20 trials.
classifiers
Top-1(%)
Top-5(%)
Alvaro at el in [5]
89.4±0.3
99.3±0.2
(highest rate)
first and combination of them after that.
6.2. Experiments on CNN and LSTM
We deployed 5 trials on the Mathbrush database to test
CNN+dropOut
LSTM+Grad Hybrid
Combination
88.5±0.7
89.43±0.4
90.3±𝟎. 𝟓
99.5±0.2
99.4±0.1
99.5±𝟎. 𝟏
the effectiveness of the dropout technique mentioned in
[1] and our proposed method using local gradient features
for CNN and LSTM, respectively.
7. Conclusions
In this paper we consider the online handwritten
-291-
mathematical
symbol
classification
problem.
Hybrid
feature which is a combination of online features and local
gradient features is deployed to improve the recognition
accuracy of LSTM. We applied the dropout technique
mentioned in [1] to two first full -connected layers of CNN.
These approaches have been proved efficiently through
our experiments with 5 trials: Using LSTM, local gradient
features increased the symbol recognition rate to 0.82
point from % and for CNN, the drop out increased to 1.57
point from %.
After that we deploy simply linear combination of CNN
and LSTM. Our experiments also show that the best
combination ensemble has a recognition rate which is
significantly higher than the rate achieved by the best
individual classifier by about 1point.
Finally our future work will be focused on the
integration of this classifier in a mathematical expression
recognition system, where the recognition of the whole
expression will help to solve the problems of similar
shaped classes.
Acknowledgement
The author would like to thank Mr. Phan Van Truyen
and other members in Nakagawa laboratory for giving
helpful comments and suggestions
References
[1] Geoffrey E. Hinton, Nitish Srivastava, Alex
Krizhevsky,IlyaSutskever, and RuslanSalakhutdinov,
“Improving
neural
networks
by
preventing
co-adaptation of feature detectors, " The computing
research repository (CoRR), vol. abs/1270.0580,
2012
[2] LeCun Y., Bottou L., Bengio Y., Haffner P. (1998),
‘Gradient-based learning applied to document
recognition’, Proc. IEEE, 86(11):2278 –2324.
[3] Alex Graves, Marcus Liwicki, Santiago Fernandez,
Roman
Bertolami,
Horst
Bunke,
and
JurgenSchmidhuber, “A novelconnectionist system
for unconstrained handwriting recognition, "Pattern
Analysis and Machine Intelligence, IEEE Transaction
on, vol. 31, no. 5, pp.855-868, 2009
[4] A. Kawamura, et al., On-line recognition of
freelyhandwritten
Japanese
characters
using
directionalfeature densities, Proc. 11th ICPR, The
Hague,1992, Vol.2, pp.183-186
[5] Francisco Álvaro, Joan-Andreu Sanchez, José-Miguel
Benedi, "Classification of On-line Mathematical
Symbols with Hybrid Features and Recurrent Neural
Networks", 2013 12th International Conference on
Document Analysis and Recognition
[6] Y. Huang and C.Suen, "A method of combining
multiple experts for the recognition of unconstrained
handwritten numerals", IEEE Trans. Pattern Analysis
-292-
and Machine Intelligence, 16(1): 6 6-75, 1994
[7] J. Kittler, M. Hatef, R. Duin, and J. Matas, "On
Combining classiffiers", IEEE Transactions on
Pattern Analysis and Machine Intelligence, 20(3):
226-239, Mar. 1998
[8] C.-L. Liu, K. Nakashima, H. Sako, H. Fujisawa,
"Handwritten digit recognition: Al ternatives to
nonlinear
normalization",
Pro.
7th
ICDAR,
Edinburgh, Scotland, 2003, pp.524 -528
[9] V. Nair and G. E. Hinton. "Rectified linear units
improve restricted boltzmann machines ", In Proc.
27thInternational Conference on Machine Learning,
2010.
[10] M. Schuster and K. K. Paliwal, "Bidirectional
recurrent neural networks, " IEEE Trans. Signal
Process., vol. 45, no. 11, pp. 2673 -2681, Nov. 1997
[11] A Graves, M Liwicki, S Fernández, R Bertolami, H
Bunke, J Schmidhuber, "A novel connectionist
system for unconstrained handwriting recognition",
In Pattern Analysis and Machine Intelligence, IEEE
Transaction on 31(5), 855-868.