Download Combination of LSTM and CNN for recognizing mathematical symbols

THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS IEICE Technical Report IBISML2014-73(2014-11) THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS TECHNICAL REPORT OF IEICE IBISML2014-73(2014-11) Combination of LSTM and CNN for recognizing mathematical symbols HAI Nguyen Dai † ANH Le Duc‡ and Masaki NAKAGAWA* Department of Computer and InformationScience, Tokyo University of A&T 2-24-16 Naka-cho, Koganei-shi, Tokyo 184-8588 † E-mail: ‡ [email protected] [email protected] * [email protected] Abstract Combining classifiers is an approach that has been shown to be useful on numerous occasions when striving for further improvement over the performance of individual classifiers. In this paper we present a combination of CNN and LSTM, which are both sophisticated neural networks and good classifiers for offline and online handwriting mathematical character recognition, respectively. In addition, we employ the dropout technique and gradient based local features to improve accuracy of CNN and LSTM, respectively. The best combination ensemble has a recognition rate which is higher than the rate achieved by the best individual classifier on the Mathbrush database. Keyword LSTM, directional features, CNN, Drop out, linear combination local 1. Introduction off-line features by LSTM significantly Online mathematical symbol recognition is an essential outperformed HMM. However, the hybrid features in component of any pen-based mathematical formulation combination with LSTM did not produce the great recognition systems. With the increasing availability of improvement from the same features with touch-based or pen-based devices such as smart phones, section 3 we describe the way of using directional (or tablet PCs and smart boards, interest in this area has been gradient) features as local off-line features following the growing. However, the existing system are still far from method of Kawamura et al. [4] and show that the perfection because of challenges that arise from the recognition rate by adding more gradient features is two-dimensional nature of mathematical input and the improved when combined with LSTM. HMM. In large symbol set with many similar looking symbols. In For off-line handwriting recognition, convolutional this work we just address the problem of mathematical neural network (CNN) has been proven as a state of the art symbol recognition. method. CNNs combine three architectural ideas to ensure Long short term memory (LSTM) and conventional some degree of shift, scale, and distor tion invariance: recurrent neural networks (RNN) have been successfully local receptive fields, shared weights, and spa tial or applied to sequence prediction and sequence labeling tasks. temporal sub-sampling [2]. The network is usually trained For online handwriting recognition, bidirectional LSTM like a standard neural network by back propagation. For networks with a connectionist temporal classification over-fitting problem, we use two choices: 1. enlarging the (CTC) output layer using forward and backward types of training dataset by using linear transformations like algorithm have been shown to outperform a state o f the art translation, rotation or sheer; as well as 2. Using Drop out HMM-based systems in handwriting recognition in [3]. technique Recently, Álvaro et al. proposed a set of hybrid features Units(RELU) have been highly successful for computer that combine both on-line and off-line information and tasks and proved faster to train than other standard units using either HMM or LSTM for classification in [5]. The such that sigmoid or tanh . RELUs are thus a good choice symbol recognition rate achieved using raw images as to combine with dropout for the Mathbrush database. introduced in [1].. Rectified Linear -287Copyright ©2014 by IEICE The combination of multiple classifiers has been shown The input pattern, composed of the coordinates of to be suitable for improving the recognition performance sampled pen-down points, is smoothed by replacing the in coordinates of each point with a weighted average of the difficult classification problems [6-7]. Also in handwritten recognition, classifier combination has been appliedt.In this work, we deploy simply current point and two neighboring points. linear In normalization, the coordinates are transformed such combination of CNN and LSTM. Our experiments also that the pattern size is standardized and the shape is show that the best combination ensemble has a recognition regulated. rate which is higher than relatively 1% the rate achieved normalization method, named bi-moment normalization by the best individual classifier on [8]. the Mathbrush database. We adopt a global curve fitting-based For feature extraction, the local histogram of stroke The rest of this paper is organized as follows. Section 2 direction (direction feature) has been proven very gives an overview of the recognition system. Section 3 effective in character recognition. We use the directional describes online decomposition technique of Kawamura et al [10] to extract features and local gradient feature when using with LSTM the hybrid feature which includes directional features in normalized patterns as offline classifier for on-line handwritten mathematical symbol features for LSTM. recognition. Section 4 details architecture of CNN and We build 2 neural networks: CNN and LSTM as the dropout technique. We detail the combination of LSTM individual classifiers for recognizing an offline pattern and CNN in section 5. Section 6 reports our experimental and an online pattern, respectively. One effective approach results and section 7 provides the conclusion. to improving the performance of handwriting recognition 2. System overview is to combine multiple classifiers. In our c ase, The online mathematical symbol recognition system is we take advance of both methods: one is LSTM for online depicted in Fig. 1. It consists of 2 pre -processing recognition and steps(trajectory smoothing and pattern normalization), We tried linear combination of both neural networks for feature extraction for LSTM, 2 neural networks: CNN and simplicity instead of using any sophisticated schemes of LSTM and a final linear combiner. Pre-processing is to combining. regulate the pattern shape for reducing the within -class 3. LSTM and gradient features shape variation. The linear combiner is to combine the classification results of both CNN and LSTM for improving accuracy rate. the other isCNN for offline recognition. 3.1. Overview of LSTM This section outlines the principle of LSTM RNN that is used for online symbol classification task and combined with CNN in section 5 and our proposed local gradient Input pattern features to improve the accuracy of LSTM. An LSTM layer is composed of recurrently connected Trajectory smoothing memory blocks, each of which contains one or more memory cells, along with three multipli cative "gate" units: input, output, and forget gates. The gates perform Pattern normalization functions analogous to reading, writing, and reseting operation. More specifically, the cell input is multiplied Context Feature Extraction CNN by the activation of the input gat e, the cell output by that of the output gate, and the previous cell values by the forget gate (see Fig. 2). The overall effect is to allow the LSTM network to store and retrieve information over long periods of time. For example, as long as the input gate remains closed, the activation of the cell will not be Linear Combination overwritten by new inputs and can therefore be made available to the net much later in the sequence by opening the output gate. Output Figure 1.Diagram of the online mathematical symbol recognition system. -288- improved when used with LSTM and online features since classifier may not process efficiently features of raw image. In this work, we employed the gradient feature which performed efficiently in character recognition. Firstly, we normalize linearly each online character pattern to standard size (64x64). Then for each point p = (x, y), we use 6 time-based features:  This point is end point (1), otherwise (0).  normalized coordinates: (x, y)  normalized angle: (sin , cos)  distance between point(i, i+1) Figure 2. LSTM memory block consisting of one In order to combine these online features with memory cell: the input, output, and forget gates collect context information around each point, we employ the activations from inside and outside the block which gradient direction feature as context features. control the cell through multiplicative units (depicted as small circles) Regarding gradient features: For each point p = (x, y), we employ context window centered at p. From context window, gradient direction features are decomposed into Another problem with standard RNNs is that they have components in 8 chain code directions (depicted as figure access to pass but not to future context. This can be 4). We partitioned context window into 9 sub-windows of overcome by using bidirectional RNNs [10], where two equal-sizes 5x5. We calculated the value for each block separate recurrent hidden layers scan the input sequences by using a Gaussian blurring mask of size 10x10. We use in PCA to reduce dimension into 10 dimensions opposite directions. The two hidden layers are connected to the same output layer, which therefore has access to context information in both directions. The Direction 1 9 features amount of context information that the network actually uses is learned during training, and does not have to be Direction 2 9 features specified beforehand. Fig. 3 shows the structure of a simple bidirectional network 72 features Direction 7 9 features Direction 8 9 features Figure 4. Extracting gradient features. 3.3. Details of learning In this work, we use the BLSTM architecture. The size Figure 3. Structure of a bidirectional network with input of the input layer is 6 for online features and 16 for the i, output o, and two hidden layers for forward and combination of online and gradient features. We use 2 backward processing hidden BLSTM layers with 32 and 128 memory blocks per layer. The output layer is softmax layer, the size of which 3.2. Feature extraction for LSTM size is 93, the number of mathematical symbol classes. We extract online features and low level context We initialized weights of the network from a zero-mean information around each point to train BLSTM. This Gaussian distribution with standard deviation 0.1. We approach was presented by Alvaro at el. [5]. They used trained our network using online stochastic gradient raw images centered at each point to present the context descent with a learning rate of 0.0001 and a momentum of information. 0.9. The training algorithm is stopped when error rate was However the recognition rate was not -289- not improved for 20 epochs. exponentially-many dropout networks. We use dropout in the first two fully-connected layers as in Figure 6. 4. Architecture of CNN The architecture of our CNN is summarized in figure 5 . It contains 9 learned layers - 6 convolution+sub-pooling layers, 2 full-connected layers and 1 softmax layer finally. Figure 6. Drop out 2 first full-connected layers with drop rate of 0.5 - dropped nodes do not contribute to forward and backward processes. 4.3. Details of learning We trained our models using stochastic gradient descent with a batch size of 64 samples, momentum of 0.95. The update rule for weight w was: 𝐦𝐢+𝟏 = 𝟎. 𝟗𝟓𝐦𝐢 − 𝟎. 𝟎𝟎𝟎𝟓. 𝛜. 𝐰𝐢 − 𝛜. 𝐆𝐫𝐚𝐝𝐢 𝐰𝐢+𝟏 = 𝐰𝐢+𝟏 + 𝐦𝐢+𝟏 Figure 5.Architecture of CNN. Where i is the iteration index, m is the momentum variable, ϵ is the learning rate, and Gradi is the average 4.1. ReLU Nonlinearity Following Nair and Hinton[9], we refer to neurons with Rectified Linear Units(ReLU) defined as over the ith batch of the derivative of the objective with respect to w, evaluated at wi follows:f x = max⁡ (x, 0). The advantages of using ReLU in We initialized the weights in each layer fr om a neural networks are:(1) if hard max function is used as zero-mean Gaussian distribution with standard deviation activation function, it induces the sparsity in the hidden 0.01 and the biases in each layer with the constant 0. units. (2) ReLU does not face gradient vanishing problem s We used an equal learning rate for all layers, which we as with sigmoid and tanh function s. (3) Also, it has been adjusted manually throughout training. The heuristic shown that deep networks can be trained efficiently using which we followed was to multip ly the learning rate by 0.5 ReLU even without pre-training when the validation error rate stopped improving with the current learning rate. The learning rate was initialized at 4.2. Dropout 0.01. We trained the network for 100 epochs and stopped The recently introduced technique, called "dropout"[1], when error rate did not improve for 20 epochs which took consists of setting to zero the output of each hidden 2 to 3 hours on NVIDIA Quadro K600 4GB GPU. neuron with probability 0.5. The neurons which are 5. Combination of CNN and LSTM dropped in this way do not contribute to the forward pass In mathematical symbol database, there are many and do not participate in back-propagation. This technique symbols having similar shape such as 'p' and ' ρ', or 'B' and reduces complex co-adaptations of neurons since a neuron 'β' which get confused to off-line classifiers. Fortunately, cannot rely on the presence of some other neurons. At test these similar symbols can be more easily recognized by time, we use all neurons but double weights, which is a on-line classifiers because of their different stroke orders reasonable approximation to taking the geometric mean of and directions. On the other hand, some symbols are the difficult for on-line classifiers such that '0' and '6' or 'l' predictive distributions produced by the -290- and 'e' due to the same stroke d irections. In this case, off-line classification is more appropriate. Regarding the dropout technique, it provided a great improvement in CNN increasing the symbol recognition error rate from 12.67% down to 11.01% (approximately Combining multiple individual classifiers is proven to 1.5%) in average. It shows that dropout improves the be a suitable way to improve recognition rate for difficult performance of CNN on handwritten mathematical symbol classification problems. There are many sophisticated recognition tasks. methods to combine but in this work we deploy linear Regarding local gradient features, they provided a combination of CNN and LSTM for simplicity as the significant improvement in LSTM compared to using only following formulation: on-line features 𝐒𝐜𝐨𝐫𝐞𝐂𝐨𝐦𝐛𝐢𝐧𝐚𝐭𝐢𝐨𝐧 = 𝛂 𝐒𝐜𝐨𝐫𝐞𝐂𝐍𝐍 + 𝟏 − 𝛂 𝐒𝐜𝐨𝐫𝐞𝐋𝐒𝐓𝐌 Where α is the weighting parameter for combination. This parameter is estimated by an experiment on validation set. (Error rate down from 11.12 % to 10.3%).So we found that our proposed local gradient features present differences with respect to the results obtained with on-line features better than contextual features proposed in [5] . 6. Experiment and results 6.1. Dataset We use MathBrush as a database for our experiments. The database contains 4,654 annotated mathematical expressions written by 20 different writers. The number of Table 1. Experimental results of CNN and LSTM on 5 trials. Error rate CNN (%) CNN +dropout (%) LSTM+ Online feature (%) MB1 MB2 MB3 MB4 MB5 13.19 13.31 12.32 12.48 12.05 11.06 11.18 10.07 10.74 11.41 11.3 11.4 10.9 11.1 10.9 LSTM +Online feature +local gradient(%) 11.0 10.5 9.95 9.83 10.5 Avg 12.67 11.01 11.12 10.3 mathematical symbols is 26K and they are distributed in 100 different classes. As previous approaches have also used this database, we followed the same experimentation described in [5] in order to obtain comparable resultsso we discarded 6 symbol classes (≤, ≠, <, λ , Ω and ′comma′). As a result, 93 different classes were considered. For each class, we used 90% samples as the training set and the remaining 10% as 6.4. Combination of CNN and LSTM the testing set. Finally, we carried out 20 trials where the In this section we report our experimental results on 20 training and testing sets were chosen randomly, and trials and compare the recognition rate of our combination symbol recognition system with the previous work of Alvaro et el. which results are stated as the mean ± the standard error. obtained To make the neural network learn better, we use the highest rate in mathematical symbol classification task until now. linear transformation including translation, rotation, shear Table 2 show that t he best combination ensemble has a to artificially enlarge the dataset. Also it make s us reduce recognition rate of 90.3% which is higher thanthe rate of overfitting. 89.43% achieved by the best individual classifier on Besides that, MathBrush is the dataset containing online database Mathbrush over 20 random trials (approxi mately handwritten symbols so we had to convert patterns into 1%). In addition, our combination method also obtained offline format from online format. We can use linear better recognition rate than the one in [5] . normalization method to convert pattern for simply but in our experiment we used Bi-moment method mentioned in [8] which yield high accuracy. To study the performance of the combination of CNN and LSTM, we will study the accuracy of each network Table 2: Symbol recognition results for different classifiers on 20 trials. classifiers Top-1(%) Top-5(%) Alvaro at el in [5] 89.4±0.3 99.3±0.2 (highest rate) first and combination of them after that. 6.2. Experiments on CNN and LSTM We deployed 5 trials on the Mathbrush database to test CNN+dropOut LSTM+Grad Hybrid Combination 88.5±0.7 89.43±0.4 90.3±𝟎. 𝟓 99.5±0.2 99.4±0.1 99.5±𝟎. 𝟏 the effectiveness of the dropout technique mentioned in [1] and our proposed method using local gradient features for CNN and LSTM, respectively. 7. Conclusions In this paper we consider the online handwritten -291- mathematical symbol classification problem. Hybrid feature which is a combination of online features and local gradient features is deployed to improve the recognition accuracy of LSTM. We applied the dropout technique mentioned in [1] to two first full -connected layers of CNN. These approaches have been proved efficiently through our experiments with 5 trials: Using LSTM, local gradient features increased the symbol recognition rate to 0.82 point from % and for CNN, the drop out increased to 1.57 point from %. After that we deploy simply linear combination of CNN and LSTM. Our experiments also show that the best combination ensemble has a recognition rate which is significantly higher than the rate achieved by the best individual classifier by about 1point. Finally our future work will be focused on the integration of this classifier in a mathematical expression recognition system, where the recognition of the whole expression will help to solve the problems of similar shaped classes. Acknowledgement The author would like to thank Mr. Phan Van Truyen and other members in Nakagawa laboratory for giving helpful comments and suggestions References [1] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky,IlyaSutskever, and RuslanSalakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors, " The computing research repository (CoRR), vol. abs/1270.0580, 2012 [2] LeCun Y., Bottou L., Bengio Y., Haffner P. (1998), ‘Gradient-based learning applied to document recognition’, Proc. IEEE, 86(11):2278 –2324. [3] Alex Graves, Marcus Liwicki, Santiago Fernandez, Roman Bertolami, Horst Bunke, and JurgenSchmidhuber, “A novelconnectionist system for unconstrained handwriting recognition, "Pattern Analysis and Machine Intelligence, IEEE Transaction on, vol. 31, no. 5, pp.855-868, 2009 [4] A. Kawamura, et al., On-line recognition of freelyhandwritten Japanese characters using directionalfeature densities, Proc. 11th ICPR, The Hague,1992, Vol.2, pp.183-186 [5] Francisco Álvaro, Joan-Andreu Sanchez, José-Miguel Benedi, "Classification of On-line Mathematical Symbols with Hybrid Features and Recurrent Neural Networks", 2013 12th International Conference on Document Analysis and Recognition [6] Y. Huang and C.Suen, "A method of combining multiple experts for the recognition of unconstrained handwritten numerals", IEEE Trans. Pattern Analysis -292- and Machine Intelligence, 16(1): 6 6-75, 1994 [7] J. Kittler, M. Hatef, R. Duin, and J. Matas, "On Combining classiffiers", IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3): 226-239, Mar. 1998 [8] C.-L. Liu, K. Nakashima, H. Sako, H. Fujisawa, "Handwritten digit recognition: Al ternatives to nonlinear normalization", Pro. 7th ICDAR, Edinburgh, Scotland, 2003, pp.524 -528 [9] V. Nair and G. E. Hinton. "Rectified linear units improve restricted boltzmann machines ", In Proc. 27thInternational Conference on Machine Learning, 2010. [10] M. Schuster and K. K. Paliwal, "Bidirectional recurrent neural networks, " IEEE Trans. Signal Process., vol. 45, no. 11, pp. 2673 -2681, Nov. 1997 [11] A Graves, M Liwicki, S Fernández, R Bertolami, H Bunke, J Schmidhuber, "A novel connectionist system for unconstrained handwriting recognition", In Pattern Analysis and Machine Intelligence, IEEE Transaction on 31(5), 855-868.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Combination of LSTM and CNN for recognizing mathematical symbols