Download LARGE SCALE MACHINE LEARNING FOR PRACTICAL

LARGE SCALE MACHINE LEARNING FOR PRACTICAL NATURAL LANGUAGE PROCESSING 大規模機械学習による現実的な自然言語処理 by Daisuke Okanohara 岡野原大輔 A Doctoral Thesis 博士論文 Submitted to the Graduate School of the University of Tokyo on in Partial Fulfillment of the Requirements for the Degree of Doctor of Information Science and Technology in Computer Science Thesis Supervisor: Jun’ichi Tsujii 辻井潤一 Prof. of Computer Science ABSTRACT I present several efficient scalable frameworks for large-scale natural language processing (NLP). Corpus-oriented NLP has succeeded in a wide range of tasks like machine translation, information extraction, syntactic parsing and information retrieval. As very large corpora are becoming available, NLP systems should offer not only high performance, but also efficiency and scalability. To achieve these goals, I propose to combine several techniques and methods from different fields; online learning algorithms, string algorithms, data structures and sparse parameter learning. The difficulties in large-scale NLP can be decomposed into the following: (1) massive amount of training examples, (2) massive amount of candidate features and (3) massive amount of candidate output. Since the solutions for (1) have already been studied (e.g. online learning algorithms), I will focus on the problems (2) and (3). An example of the problem (2) is document classification with “all substring features”. Although all substring features would be effective for determining the label of a document, the number of candidate substrings is quadratic to the length of a document. Therefore a naive optimization with all substrings requires prohibitively large computational cost. I show that statistics of substring features (e.g. frequency) can be summarized into a few equivalent classes much smaller than the total length of documents. Moreover, by using auxiliary data structures enhanced suffix arrays, these effective features can be found exhaustively in linear time without enumerating all substring features. The experimental results show that the proposed algorithm achieved the higher accuracies than the state-of-the-art methods. Moreover the results also show the scalability of our algorithm; effective substrings can be enumerated from one million documents in 20 minutes. Another important example of (2) is “combination features”. In NLP a combination of original features could be most effective for the classification. Although candidate combination features are exponentially many, effective ones are very few. I present a method that can effectively find all such effective combination features. This method relies on a grafting algorithm, which incrementally adds features from the most effective one. Although this procedure looks a greedy algorithm, it can converge to the global optimum. To find such effective features, I propose a space efficient online algorithm to calculate the statistics of combination features with a simple filtering method. Experimental results show that the proposed algorithm achieved comparable or better results than those from other methods, and its result is very compact and easy to interpret. For problem (3), I consider language modeling tasks, in that we discriminate correct sentences form incorrect ones or predict a next word given previous words as context. Since the candidate words are very many, only simple generative models (e.g. N-gram models) are used in practice. I first propose a Discriminative Language Model (DLM) that directly classifies a sentence as correct or incorrect. Since DLM is a discriminative model, it can use any type of features such as the existence of a verb in the sentence. To obtain negative examples for training, I propose to use pseudo-negative examples, sampled from generative language models. Experimental results shows that DLM achieved 75% accuracy in the task of that discriminating positive and negative sentence, though N-gram models or the syntactic parsing cannot discriminate these at all. The second language model I propose is a Hierarchical Exponential Model (HEM). In HEM, we build a hierarchical tree where each candidate word corresponds to a leaf, and a binary logistic regression model is associated with each internal node. Then, the probability of a word is given by the product of the probabilities of each internal node in the path from the root to the corresponding leaf. While HEM can use any type of features, it supports efficient inference. Moreover it supports the operation that finds the most probable word efficiently, which is fundamental in efficient LMs. I conducted experiments using HEM and show that this model achieved the higher performance than other language models while it support an efficient inference. 論文要旨我々は，大規模な自然言語処理システムを実現するための，効率的かつスケーラブルな手法を提案する．コーパスに基づいた自然言語処理システムは，機械翻訳，情報抽出，構文解析，情報検索など多くの問題で成功を収めてきた．近年，非常に巨大なコーパスが手に入るようになるにつれ，システムは高精度であるだけでなく，高効率，かつスケーラブルであることが求められている．我々が提案するシステムはオンライン学習，文字列アルゴリズム，データ構造，疎パラメータ学習などの技術を統合することにより，高効率と，高精度の両面を達成する自然言語処理システムを実現する．大規模な自然言語処理の問題点は (1) サンプル数が多い，(2) 特徴種類数が多い， (3) 候補解が多い場合，の三つに大きく分けられる．(1) に対してはオンライン学習など様々な手法が提案されてきた．本論文では残る (2)，(3) の問題を中心に扱う．はじめに，(2) の例として文書における部分文字列特徴を考える．文書分類や文書クラスタリングにおいては，文書中に出現する任意の部分文字列の出現情報は，文書のラベルを決定するのに有効な特徴となりうるが，これらの種類数は文書長の二乗に比例し，そのまま扱うには非常に大きなコストが必要となる．我々は異なる統計量（文書頻度等）を持つ部分文字列の種類数が高々文書長しかないことを示し，拡張接尾辞配列を用いて有効な部分文字列を漏れ無く効率的に探索する方法を提案する．本手法を文書分類と文書クラスタリングのタスクに適用し，我々の手法が既存手法を超える精度を達成することを示し，また，本手法が百万文書の大規模文書群から有効な部分文字列を 20 分で求めることが可能できるなど，スケーラビリティが高いことを示す．次に，(2) の別の例として，組み合わせ特徴を考える．自然言語処理では複数の基本特徴の組み合わせが有効である場合が多い．しかし，有効な組み合わせ特徴は少ないにも関わらず，組み合わせ特徴の候補数は非常に大きいため，有効な組み合わせ特徴の抽出は重要な課題であった．我々は有効な特徴から順に最適化問題に Grafting アルゴリズムを採用し，最適解を保証しながら学習を効率的に行う．さらに組み合わせ特徴の統計量を効率的に計算するために，単純なフィルタリングとオンラインでの統計量の計算を組み合わせたアルゴリズムを提案する．係り受け解析に対する実験結果より，提案手法が膨大な組み合わせ特徴を効率的に処理し，既存手法と比較し同精度の結果を達成しながら非常にコンパクトなモデルが得られることを示す．次に，(3) の例として言語モデルを考える．言語モデルは，与えられた文が正しいかどうかの判定，または与えられた文脈から次の単語を予測するタスクであり，機械翻訳，音声認識，手書き文字認識など多くのアプリケーションで重要な役割を担っている．単語候補数は膨大であるため，従来の機械学習に基づく手法はそのまま適用できず，頻度情報に基づく単純な統計モデル（例，N グラムモデル）が利用されていた．我々は，まず識別言語モデルを提案する．この識別言語モデルは，与えられた文に対し直接，正しい文か非文かを分類するモデルを構築する．このモデルの学習に必要な非文は一般にコーパスから入手できないが，確率的言語モデルから生成された文を非文として利用することで学習を行うことを提案する．実験結果より，このように学習されたモデルが，既存の言語モデルや構文解析では識別不可能な文の識別問題を 75% の精度で分類できることを示す．これとは異なる言語モデルとして，階層型ロジスティック回帰モデルを提案する．このモデルは大量の候補がある問題を効率的に解くための学習モデルである．このモデルでは各候補（単語）が葉に対応し，内部節点のそれぞれにロジスティック回帰モデルが付随するような階層木を構築する．そして，ある候補に対する確率を，階層木の根から，その候補に対応する葉までの道上にある各節点での分類結果の積として定義する．このモデルでは任意の特徴が利用可能であることに加え，確率が最大となる候補を全候補を列挙しなくても効率的に求めることができる．提案手法と既存手法を比較し，本手法が高精度で単語を予測でき，かつ効率的に高確率の単語を推論できることを示す． Acknowledgements My thesis work has benefited from the support of many colleagues, friends, and family. I am deeply grateful to Professor Jun’ichi Tsujii for his valuable advice and encouragement. He invited me to the field of computational linguistics. I learned from him how to solve a problem, how to present a work, and especially how to enjoy a research. I would also express my gratitude to Dr. Yusuke Miyao and Dr. Takuya Matsuzaki. I always enjoy discussing the research topics with them, which derives the most of the work in this thesis. I would like to acknowledge Professor Yoshimasa Tsuruoka. He is the first person who taught me computational linguistics and machine learning. He also gave me much valuable advices and encouragement even after he left the laboratory. Many thanks also to Professor Kunihiko Sadakane for pushing me toward the field of string algorithms and data structures. I always enjoy thinking, and discussing with him. I am grateful to Dr. Jin-Dong Kim, Tomoko Ohta, Rune Sætre, Yoshinobu Kano, Naoyoshi Okazaki, Makoto Miwa, and Tadayoshi Hara for various combinations of help, support and inspiration. I also thank lab members, Mr. Sun Xu, Mr. Wu Xianchao, Mr. Yuichiro Matsubayashi, Mr. Yusuke Matsubara, Ms. Sumire Uematsu, Mr. Hiroki Hanaoka for valuable discussions. I am also grateful to my fellows, Mr. Daiki Kojima and Mr. Junpei Takeuchi. I enjoyed the life with them. I am much appreciated the help of secretaries, Ms. Minako Ito, Ms. Noriko Katsu, and Ms. Mika Tarukawa. I would also like to convey appreciation to all members of Tsujii laboratory. Many thanks also to the colleagues at our company, Mr. Toru Nishikawa, Mr. Jiro Nishitoba, Mr. Yuichi Yoshida, Mr. Hideyuki Tanaka, Mr. Takayuki Muranushi, Mr. Kazuki Ohta, Mr. Hiroyuki Tokunaga, Mr. Nobuyuki Kubota, Mr. Jun Watanabe and Mr. Ebihara. They gave me valuable advices from the views of different fields, and ongoing encouragement. I also thank Ms. Naoko Nishikawa for her invaluable support. Finally, I thank my parents for their support. They gave me the path to the field of computer science, and always encouraged me. 2 Contents 1 Introduction 7 1.1 Post-Corpus Oriented NLP . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2 Machine Learning and NLP . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3 Difficulties in Dealing with Very Large Data . . . . . . . . . . . . . . . . 9 1.3.1 Example Problem 1: Document Classification . . . . . . . . . . . 10 1.3.2 Example Problem 2: Language Modeling . . . . . . . . . . . . . 10 1.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5 Tools for Large-Scale Machine Learning . . . . . . . . . . . . . . . . . . 12 1.5.1 Online Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5.2 Sparse Priors: L1 regularization . . . . . . . . . . . . . . . . . . 13 1.5.3 Stringology and Succinct Data Structure . . . . . . . . . . . . . . 13 Overview of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.6 2 Background 16 2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 General Framework of Machine Learning . . . . . . . . . . . . . . . . . 16 2.3 Linear Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5 Batch Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.6 Online Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.6.1 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.7 Kernel Trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.8 Storing Sparse Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 1 I Learning with Massive Number of Features 34 3 Learning with All Substring Features 35 3.1 All Substring Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 Data Structure for Strings . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3 Grafting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4 Statistics Computation with Maximal Substring . . . . . . . . . . . . . . 43 3.4.1 Equivalent Class . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.4.2 Document Statistics with Equivalent Classes . . . . . . . . . . . 46 3.4.3 Enumeration of Equivalent Classes . . . . . . . . . . . . . . . . 46 3.4.4 External Information . . . . . . . . . . . . . . . . . . . . . . . . 48 Document Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.5.1 Document Classification Model . . . . . . . . . . . . . . . . . . 50 3.5.2 Efficient Learning Algorithm with All Substring Features . . . . . 51 3.5.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Document Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.6.1 Logistic Regression Clustering . . . . . . . . . . . . . . . . . . . 58 3.6.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.5 3.6 3.7 4 Learning with Combination Features 63 4.1 Linear Classifier and Combination Features . . . . . . . . . . . . . . . . 63 4.2 Learning Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.3 Extraction of Combination Features . . . . . . . . . . . . . . . . . . . . 66 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 71 II Learning with Massive Number of Outputs 72 5 Discriminative Language Models with Pseudo-Negative Examples 73 5.1 Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 73 5.2 Previous Language Models . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2.1 N-gram Language Model . . . . . . . . . . . . . . . . . . . . . . 76 5.2.2 Topic-based Language Models . . . . . . . . . . . . . . . . . . . 78 5.2.3 Maximum Entropy Language Models . . . . . . . . . . . . . . . 78 5.2.4 Whole Sentence Maximum Entropy Model . . . . . . . . . . . . 79 5.2.5 Discriminative Language Models . . . . . . . . . . . . . . . . . 80 5.3 Discriminative Language Model with Pseudo-Negative samples . . . . . 81 5.4 Fast Kernel Computation . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.5 Latent Features by Semi-Markov Class Model . . . . . . . . . . . . . . . 83 5.5.1 84 5.6 Class Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . Improvement of Exchange Algorithm with Filters and Bottom-up Clustering 87 5.6.1 Semi-Markov Class Model . . . . . . . . . . . . . . . . . . . . . 87 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.7.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.7.2 Experiments on Pseudo-Samples . . . . . . . . . . . . . . . . . . 89 5.7.3 Experiments on DLM-PN . . . . . . . . . . . . . . . . . . . . . 89 5.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.7 6 Hierarchical Exponential Models for Problem with Many Classes 94 6.1 Problems of Previous Language Models . . . . . . . . . . . . . . . . . . 94 6.2 Hierarchical Exponential Model . . . . . . . . . . . . . . . . . . . . . . 95 6.2.1 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.2.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.2.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.4 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 101 7 Conclusion 102 7.1 Learning with “All Substrings” . . . . . . . . . . . . . . . . . . . . . . . 103 7.2 Learning with “Combination Features” . . . . . . . . . . . . . . . . . . . 104 7.3 Discriminative Language Model with Pseudo-Negative Examples . . . . 104 3 7.4 Hierarchical Exponential Models . . . . . . . . . . . . . . . . . . . . . . 104 7.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 References 107 4 List of Figures 1.1 The comparison of the number of words in different corpora. . . . . . . . 8 2.1 A plot of several loss functions . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 A plot of L2 regularization and L1 regularization (above). A plot of the partial derivatives of L2 regularization and L1 regularization (bottom). . . 23 3.1 An example of bag-of-word representation (BOW). . . . . . . . . . . . . 36 3.2 An example of all substrings representation (ALLSTR). . . . . . . . . . . 37 3.3 An example of data structures for a text T = abracadabra$. . . . . . . . 40 3.4 The substrings and its classes for a text “T = abracadabra$”. . . . . . . 45 3.5 An example of the computation of the gradient corresponding to a substring “book”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.6 The time for finding all maximal substrings. . . . . . . . . . . . . . . . . 56 4.1 An example of filtering a candidate combination feature. . . . . . . . . . 68 5.1 Example of a sentence sampled by PLMs. . . . . . . . . . . . . . . . . . 82 5.2 Framework of our classification process. . . . . . . . . . . . . . . . . . . 83 5.3 Example of assignment in SMCM. A sentence is partitioned into variablelength chunks and each chunk is assigned a unique class. . . . . . . . . . 88 5.4 Margin distribution using SMCM bi-gram features. . . . . . . . . . . . . 92 5.5 A learning curve for SMCM (∥C∥ = 500). The accuracy is the performance on evaluation set. . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.1 An example of a hierarchical tree in a hierarchical exponential model . . . 96 6.2 A path information in a hierarchical tree for predicting the next word. . . 99 5 List of Tables 2.1 A comparison of online learning methods. . . . . . . . . . . . . . . . . . 29 2.2 Performance of online learning methods (I = 10). . . . . . . . . . . . . 29 2.3 Performance of online learning methods (I = 1). . . . . . . . . . . . . . 30 3.1 The data set in a document classification task . . . . . . . . . . . . . . . 55 3.2 Result of the document classification task . . . . . . . . . . . . . . . . . 56 3.3 The result of clustering accuracy (%) . . . . . . . . . . . . . . . . . . . . 60 3.4 Examples of substrings whose have the largest weight in each cluster . . . 61 4.1 The performance of the Japanese dependency task on the Test set. The active features column shows the number of nonzero weight features. . . 4.2 70 Document classification results for the Tech-TC-300 data set. The column F2 shows the average of F2 scores for each method of classification. . . . 71 5.1 A comparison of language models. . . . . . . . . . . . . . . . . . . . . . 76 5.2 Performance of language models on the evaluation data. . . . . . . . . . . 90 5.3 The number of features of DLM. . . . . . . . . . . . . . . . . . . . . . . 90 5.4 Comparison between classification performance with/without PKI index . 91 6.1 Corpus statistics in HEM . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.2 Results of HEM and Baseline . . . . . . . . . . . . . . . . . . . . . . . . 101 6 Chapter 1 Introduction I present several efficient, practical frameworks for natural language processing. They achieve state-of-the-art performances for many problems including document classification, document clustering, dependency parsing and language modeling. The key issue is how to process very large text efficiently with the help of sophisticated machine learning, and algorithms. 1.1 Post-Corpus Oriented NLP To date, the main goal of natural language processing (NLP) has been to build a system that predicts linguistic events/outputs accurately. Nowadays systems frequently utilize a corpus to obtain rich lexicographic or syntactic information provided by humans. This is called corpus-oriented NLP. In early times, since the amount of available text corpora was small, efficiency was not as important as accuracy. Therefore we could use any method to build highly accurate systems, even if they required large computational resources, for example support vector machines with kernel methods. It is definitely true that corpus-oriented NLP open up for many high-performance systems ranging widely from, machine translation [Brown et al., 1990, Och and Ney, 2003] and syntactic parsing [Collins, 2003, Charniak and Johnson, 2005, Miyao and Tsujii, 2008] to information extraction [McCallum et al., 2000, Lafferty et al., 2001], information retrieval [Zhai and Lafferty, 2004], summarization [Knight and Marcu, 2002] and document classification [Joachims, 1998, Pang et al., 2002] to name a few. 7 !"#$% Figure 1.1: The comparison of the number of words in different corpora. However, thanks to the rapid development of Internet and computers, we can currently obtain much larger corpora than those used previously in the NLP community. For example, Penn Treebank [Marcus et al., 1994] is one of the most used corpora in the NLP community released in early 90’s. It consists of almost one million words with syntactic information annotated by humans. On the other hand, we can currently use the following corpora: Google N-gram corpus1 is one of the biggest corpora, built by processing one trillion words and one hundred billion sentences, PubMed2 is a collection of biomedical articles, consisting of several million articles, and Wikipedia3 is the biggest online encyclopedia including ten million entries in more than one hundred languages. Figure 1.1 shows the number of words in these corpora. The size of current corpora is about 106 times larger than the previous ones. I will argue the following: more data gives better NLP systems. This can be restated as follows; a simple NLP system built using very large amounts of data often beats a complex, sophisticated NLP system built using only small amounts of data. Many studies support this rule: a machine translation system [Brants et al., 2007] with a simple language model beats the systems with a complex language model when very large amounts of data was available, and sequential labeling task [J. Suzuki, 2008] could also be much improved over 1 LDC2006T13 http://www.ncbi.nlm.nih.gov/pubmed/ 3 http://en.wikipedia.org/wiki/ 2 8 the state-of-the-art systems by using large amount of raw data. Text resources is now extraordinary abundant and the key issue of NLP becomes how to process all this data efficiently with simple methods. My proposal is beyond this: constructing an NLP system which employs sophisticated methods and can process large amounts of data. To achieve these conflicting goals at the same time, I combine several techniques and methods from different fields. 1.2 Machine Learning and NLP Roughly speaking, an NLP task is to find a mapping function from a linguistic event to another linguistic event. For example, in a machine translation task, the input and output are the source and target texts, and in a syntactic parsing task, they are the word sequence and the syntactic tree respectively. Since the goal of (supervised) Machine Learning (ML) is to create such a function using training data, it is natural to employ ML in NLP. The training of an NLP system is decomposed into the following two steps (1) represent the linguistic event/input as a vector of computed values called a feature vector, and (2) find a mapping function from the feature vector to the correct output. Since all linguistic information can be represented as feature vectors, the researcher in NLP can focus on the problem (1), while the researcher in ML can focus on solving the problem (2) separately. Although this division has promoted the use of ML in NLP, I revise this and instead treat (1) and (2) together to achieve more efficient systems. 1.3 Difficulties in Dealing with Very Large Data While constructing a NLP system based on sophisticated ML, and utilizing very large amounts of data, difficulties arising not only from massive number of examples but also massive numbers of candidate features and outputs. To process large number of examples efficiently, we can currently use online learning algorithms [Rosenblatt, 1958, Collins, 2002, Dredze et al., 2008, Crammer et al., 2008, Crammer et al., 2009, Shalev-Shwartz, 2007], and space-efficient data structures [Navarro and Mäkinen, 2007, Yan et al., 2009b]. Therefore, the remaining tasks are how to deal with massive feature and outputs. For clarity, I introduce two 9 running example problems. I will deal with these problems throughout this thesis. 1.3.1 Example Problem 1: Document Classification The first example problem is document classification. Given a document as input, the task is to predict its class or category. In previous studies, a documents was represented as a feature vector of a bag of words (BOW), in which each value corresponds to an occurrence of a word. Since the BOW representation ignores the order or the position of words, this representation loses much information from the original document. However, the word sequence or the substring are known to be effective to classify a document. For example, when the some templates are used in spam mails only, the occurrence of this template can correctly classify a document as spam although the BOW representation cannot capture this clearly. However, since the number of distinct word sequences appearing in a document is quadratic to the length of a document, the computational cost and working space gets prohibitively large if we deal with these features naively. 1.3.2 Example Problem 2: Language Modeling The second example is language modeling. Given context information as input, the task is to predict the most probable next word. In another case, given an whole sentence as input, the task is to classify it as a correct or incorrect sentence. Since the number of candidate words is much larger than those considered in previous ML systems, previous language models are defined on a simple probabilistic model called the N-gram model, and cannot utilize the features. In both problems, current systems can only use simple models (e.g. BOW model and N-gram model) when they deal with large amount of data. In the next section I propose several frameworks that can use more powerful model, and efficiently solve these problems. 1.4 Contribution The primary contribution in this thesis is the development of several frameworks for largescale NLP by deeply connecting ML and NLP. In addition to this, I show how recent 10 machine learning, and data structures, string algorithms can be used to make NLP useful given the large amount of text available. I propose four novel methods to deal with different types of problems. • Learning with “all substring features” (Chapter 3, [Okanohara and Tsujii, 2009b]) • Learning with “combination features” (Chapter 4, [Okanohara and Tsujii, 2009a]) • Discriminative Language Model (Chapter 5, [Okanohara and Tsujii, 2007]) • Hierarchical Exponential Model (Chapter 6) The first and second methods consider the problem of massive number of candidate features. They use the same framework that is able to efficiently solve a problem with many candidate features, given an algorithm for finding the most effective features. Intuitively, a feature is called effective when the current system can classify many training examples correctly with that feature. The precise definition of effective will be explained in the following chapters. The first tackles the problem with “all substring features”, in that all substrings appeared in a document are candidate features. Although the number of substrings is prohibitively large to enumerate and optimize, our algorithm can find the optimal classifier in liner time in the total length of documents by summarizing substring information in the equivalent classes. The second tackles the learning with “combination features”, in that all combination of original features are candidate features. Our algorithm computes the statistics of combination features in an online manner with filtering, and efficiently finds effective but a few combination features. I applied these methods to document classification and clustering tasks. The third and the fourth methods consider the problem of massive number of candidate outputs, namely the language modeling task. There are two type of language modeling task; the first is to discriminate a sentence as correct or not, and the second is to predict a next word given a context. The third and the fourth method solve these two language modeling tasks, respectively. In detail, I propose a discriminative language model with pseudo-negative examples (DLM-PN), which directly discriminates sentences into correct and incorrect ones. Since 11 the candidate output is infinitely many (correct/incorrect sentences can be infinitely expanded), direct discrimination seems difficult. To solve this, in DLM-PN pseudo-negative examples are sampled and the discrimination between sentences in the corpora and the pseudo-negative examples are learned. The forth offers a classification model called a hierarchical exponential model (HEM) where the search space of candidate output is represented as a hierarchical tree. With this tree, the algorithm can estimate the probability of output in O(log K) time where K is the number of possible outputs. Moreover, it also can find the output with the largest probability in O(log K) time. Since HEM uses exponential models inside, they can use any type of features. 1.5 Tools for Large-Scale Machine Learning In addition to my proposals above, my frameworks heavily utilize recent research from various fields, which are fundamental to achieve large-scale NLP. I briefly introduce that research here, and the details will be explained late in each relevant chapter. 1.5.1 Online Learning In ML, learning is equivalent to solve a convex optimization problem. To solve this, many different optimization methods have been proposed, most of which compute the gradient and (approximated) Hessian matrix of the objective function, and then update the parameters according to them. Thus, the learning step requires usually quadratic or super linear time/space to the number of examples/features, and cannot be directly applied to large-scale problems appeared in NLP. Another problem of existing convex optimization solver is requiring the large working space. In large-scale NLP problems, it is even difficult to store all example information in memory. We call this optimization batch learning because the parameters are updated concurrently after seeing all the training examples. Recently, the research field of online learning or stochastic convex optimizations [Shalev-Shwartz, 2007, Cesa-Bianchi and Logosi, 2006] have emerged. In these methods, they look at examples one by one, and update the parameters immediately. The simplest algorithm is the perceptron algorithm [Rosenblatt, 1958] which was proposed a half century ago, but its usefulness was rediscovered recently [Collins, 2002]. 12 Online learning has several advantages over batch learning. First, all examples are not required to be stored in memory, since each example can be read sequentially. Second, online learning converge to the optimum faster than batch learning. This is because many training examples in NLP are redundant, and an online learning algorithm updates the parameters more often. Therefore, the parameters for frequent features are tuned in early steps, and those for infrequent features are tuned more carefully in later steps. Recent online learning algorithm can converge to the optimum after looking training examples only once. 1.5.2 Sparse Priors: L1 regularization In many NLP tasks, although there are many candidate features, most of them are irrelevant to the task, and the effective features are very few. For example, in a document classification task, the number of words relating to the category of a document is often two or three although the number of words appeared in a document is large like an hundred. To extract these few effective features, we can use a sparse prior of parameters at learning. Then the result is sparse parameters. Here, sparse means that many parameters are exactly zero or a default value. This sparse parameter makes the inference extremely efficient and the model very compact. In particular, I use a sparse prior called a L1 regularization at training time. The optimization with L1 regularization is still convex, but non-smooth, so many specialized optimization methods have been proposed. Recent studies [Tsuruoka et al., 2009, Duchi and Singer, 2009] show how to apply L1 regularization to online learning. I will combine a sparse prior with the search algorithm to enable us to consider effective features only without enumerating all possible candidate features. 1.5.3 Stringology and Succinct Data Structure String algorithms or stringology research has been further advanced recently to deal with large-scale text. For example, the number of any substring occurrence in a document can be computed in constant time by using recent compressed full-text indices, while the working space is less than the original text [Navarro and Mäkinen, 2007]. Since NLP deals with strings as input and output, it is desirable to use these data structure especially when 13 the data is very large. For example, in this thesis, I present an algorithm for finding the most effective features from all substrings as candidate features. Although the number of possible candidate features is O(N 2 ), the proposed algorithm can find them in O(N ) time without any approximations where N is the total length of all the documents. To achieve this, I heavily use several data structures to search effective features efficiently. 1.6 Overview of This Thesis This thesis is described in two parts. The first part, Chapter 3 and Chapter 4, considers the problem with massive number of features. I especially study the problem with “all substring features” and “combination features”, which are important type of features in NLP. Although these features are too many to handle in naive way, I present learning frameworks that solve these efficient way. The second part, from Chapter 5 and Chapter 6, considers the problem with massive number of outputs. In particular, I consider language modeling. I provide two different language models, which can use more powerful features. The chapters in this thesis are organized as follows: Part I. Learning with Massive Number of Features Chapter 3 focuses on how to deal with “all substring features”. This chapter presents a learning framework based on L1 regularization, efficient calculation of gradients of features, and a grafting algorithm. The applications of this framework in NLP, including a document classification/clustering are also presented. In the learning with all substring features, all substrings appeared in a document are candidates of features. Recent studies show that the string kernel can consider all subsequence information and achieve the highest accuracy in the document classification task. However, the string kernel method requires almost quadratic time of document lengths and a large working space, not only at training, but also at inference time. I show that the statistics of substrings can be summarized into that of maximal substrings, which can be exhaustively determined in linear time by using auxiliary data structures. Then, I can search the most effective substring in linear time. Moreover, we can ob14 tain a compact model by using L1 regularization at training time, which allows extremely efficient inference in time and space during the application of model. Chapter 4 focuses on learning with combination features. In many NLP tasks, a user defines feature templates to specify a set of original features. For example, in the partof-speech tagging task, the current/previous/next word and their prefixes/suffixes would be useful to select the part-of-speech tag, and all these are defined as original features. The combinations of these features are much more important. Obviously, learning with all possible combination features requires a prohibitively large computational cost. I propose an algorithm for efficient computation of combination statistics in an online manner with filtering. Part II. Learning with Massive Number of Outputs Chapter 5 focuses on language models (LM). The most widely used LM is the Probabilistic Language Model (PLM), which assigns a probability of correctness to a sentence. In particular, N-gram Language Models (NLMs) are popular, because they are very simple, and are only able to use large amounts of training data. However, many studies show that LMs with rich information can achieve much more accurate results. To build a more accurate, and efficient LM, I present a discriminative language model with pseudo negative samples. The problem of considering language models for discriminative task is that there are no negative examples available for training. To produce negative samples, we make pseudo negative examples from PLMs, and use them as negative examples. We also employ an online margin-based learning algorithm with fast kernel computation. Finally, we capture latent information by using hidden semi-Markov models, which reduces the computational cost and improves the generalization ability. Chapter 6 presents a hierarchical exponential model (HEM). In a HEM, they build a hierarchical binary tree where leaves correspond to candidate output, and internal nodes are binary classifiers. The probability is calculated by the products of the probabilities of classification results along the path from the root to the corresponding leaf. An HEM supports an efficient arg max operation, which returns the output whose probability is the largest in O(log K) time, where K is the number of labels. In experiment on language modeling, HEM is compared to an N-gram model, and it achieved higher performance with large margin. 15 Chapter 2 Background To achieve large scale NLP, I employ diverse ideas from machine learning, data structures, string algorithms and optimization techniques. In this chapter, I explain the basics of these ideas, and postpone the details to relevant chapters. 2.1 Definition Let R be the set of real numbers, R+ be the set of positive real numbers, and Rm be a mdimensional real vector. A set C is called convex when, for any x1 , x2 ∈ C and 0 ≤ θ ≤ 1, θx1 + (1 − θ)x2 ∈ C. A function f is convex when its domain is a convex set and for all x, y in the domain of f , and 0 ≤ θ ≤ 1, f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y). 2.2 General Framework of Machine Learning Typically, the goal of machine learning (ML) is to create a function y = f (x; w) that predicts correct output y ∈ Y given input x ∈ X , where w is a parameter defining the behavior of a function f , and X and Y are candidates of input and output respectively. In this thesis, we assume that the parameter is a real vector w ∈ Rm . While ML has been applied to problems in many fields like biology, vision analysis and economics, I focus on how ML is used in NLP. Many NLP problems are represented as a classification task in that outputs are structured discrete object like a category in document classification, or a next word in language modeling. We call this discrete value label. 16 I first show a general supervised learning framework. In this framework, a parameter w is estimated by using training examples (x, y) = {(xi , yi )} so that the function f (x; w) correctly classifies training examples. However, this estimation can overfit on finite data. Therefore regularization is usually applied at the estimation. Formally, we solve the following optimization problem to estimate the parameter. w∗ = arg min L(w) + CR(w) w ∑ l(xi , yi , w), L(w) = (2.1) (2.2) i where the function L(w) is empirical risk that measures how well the function with the parameter w predicts a label in training examples, and l(x, y, w) is a loss function, which measures the suffered loss of each example. A term R(w) is regularization on the parameter, that controls the over-fitting to the training data, and C > 0 is a trade-off parameter; when C is large, regularization is strengthen, and vice versa. The parameter C is often estimated by cross-validation. We should use an adequate loss function, and regularization for different problems. 2.3 Linear Classifier I first explain a simpler case of binary class , and then extend it to multi class. In binary classification, the task is to predict a binary label y ∈ {+1, −1} for input x where a label +1 is called positive, and a label −1 is called negative. We represent information of input x by a real vector ϕ(x) ∈ Rm , called a feature vector. Each dimension of this feature vector captures the characteristic of input, defined by a feature function fi (x), ϕi (x) = fi (x). In NLP, a feature function usually corresponds to some linguistic event. For example, fi (x) = I(x is a word “University”) where I(a) is an indicator function; I(a) = 1 if a is true and I(a) = 0 otherwise. We enumerate all possible events, like an occurrence of some word x in the two preceding position. A feature template defines such all events, which generates all possible feature functions. Therefore, a feature vector ϕ(x) tends to be very sparse, in that many elements are exactly zero. This characteristic will be used for building an efficient system in this thesis. Note that a bias for each label can be considered by expanding a feature vector as ϕ′ (x) := {ϕ(x), 1}. 17 Then, a binary linear classifier predicts the output as follows; f (x; w) = s(x; w) = wT ϕ(x)   1 s(x; w) ≥ 0  −1 s(x; w) < 0 (2.3) (2.4) where w ∈ Rm is a weight vector, each of which corresponds to the weight of a feature function. Therefore, this classifier uses a weighted majority voting to decide the label. A function s(x, w) is called a score function, and its absolute value |s(x, w)| is called a margin, which measures the confidence of the classifier. For the loss function (used in Eq. (2.2) in binary classification, the straight-forward function is the number of misclassification for training examples, L0/1 (w) = ∑ l0/1 (x, y, f ) (2.5) i l0/1 (x, y, f ) = I(yf (x; w) > 0) (2.6) where l0/1 (x, y, f ) returns 1 if the current classifier f mis-classify the example (x, y), and returns 0 otherwise. However, the optimization with l0/1 functions is not convex, and indeed very hard to optimize. Another problem is that the prediction result would be always close to 0, and thus the prediction will suffer from input noise. Instead, the following three loss functions are proposed [Collins et al., 2002], llog (x, y, f ) = log(1 + exp(−ywT ϕ(x))) (log-loss) [ ] lhinge (x, y, f ) = 1 − ywT ϕ(x) + (hinge-loss) lexp (x, y, f ) = exp(−ywT ϕ(x)). (exp-loss) (2.7) (2.8) (2.9) [ where a]+ = max(a, 0) A function llog (x, y, f ) is called a log-loss, lhinge (x, y, f ) is called a hinge-loss, and lexp (x, y, f ) is called an exp-loss. These functions are convex, and upper-bounds of the l0/1 function. Therefore the learning (2.1) is also convex optimization problem. Figure 2.1 shows the plot of these function and l0/1 function. A log-loss is often used in probabilistic models, a hinge-loss is used for support vector machines, and an exp-loss is used for boosting learning methods. 18 Figure 2.1: A plot of several loss functions In particular, I give another interpretation of a log-loss; it is equivalent to the result of maximum likelihood estimation of a logistic regression model. In a logistic regression model, a probability of a label y given input x is defined as p(y|x; w) = = exp(ywT ϕ(x)) exp(ywT ϕ(x)) + exp(−ywT ϕ(x)) 1 . 1 + exp(−2ywT ϕ(x)) (2.10) (2.11) Next, a parameter w is estimated by using maximum likelihood estimation, in that the log-likelihood of examples is maximized, max ∑ log p(yi |xi ; w) = − ∑ i log(1 + exp(−2ywT ϕ(x)). (2.12) i Thus, the estimation with a log-loss corresponds to the maximum likelihood estimation of a logistic regression model. Note that, this model is also identical to the result of a maximum likelihood estimation of maximum entropy models [Jaynes, 1957]. Next, we consider the classification when the number of labels is larger than two, called multi-class classification. 19 Let us represent an information of input x and a label y by a feature vector ϕ(x, y) ∈ Rm . Each dimension of ϕ(x, y) is the result of a function for x and y, ϕ(x, y)i = fi (x, y) capturing the characteristic of x and y. An example of such a function in NLP is fi (x, y) = I(“x is a word money and y is the topic Business”). In many cases, a feature function is defined by a cross-product of input-dependent ′ features and candidate labels as follows. Let ϕ(x) ∈ Rm be an input-dependent feature vector, in that the value of the element in ϕ(x) is determined by input only. Then, we build ϕ(x, y) ∈ Rm by concatenating ϕ(x)I(y ′ = y) for each y ′ ∈ Y where m = m′ ×|Y |. For example, given ϕ(x) = (1, 0.5, 2), and y ∈ {0, 1, 2}, ϕ(x, 1) = (0, 0, 0, 1, 0.5, 2, 0, 0, 0). As in the case of binary classification, a score s(x, y) ∈ R for input x and a label y is assigned by using a linear function s(x, y; w) = wT ϕ(x, y), (2.13) where w is a weight vector, each of which corresponds to a weight of a feature function. Then, the classifier predicts a label as the one that maximizes the score, y ∗ = f (x; w) = arg max (2.14) s(x, y; w) (2.15) wT ϕ(x, y). (2.16) y = arg max y To estimate the parameter, we use similar loss functions as that of binary classification. First we define the difference of scores between a correct label, and others; ψi,y = s(xi , yi ; w) − s(xi , y, w). (2.17) We also define yi∗ = f (xi ; w). We omit the subscript of i when there is no confusion. Then, the loss functions are defined as llog (x, y, f ) = log ∑ exp(−ψi,y′ ) (LogLoss) (2.18) y′ lhinge (x, y, f ) = [ ] ψi,y∗ − m(yi , y ∗ ) + (HingeLoss) lexp (x, y, f ) = exp(−ψi,y∗ ) (ExpLoss) (2.19) (2.20) where m(y, y ∗ ) ∈ R is the misclassification penalty when the correct label is y and a prediction by the classifier is y ∗ . For m(y, y ∗ ), an indicator function m(y, y ∗ ) = I(y ̸= 20 y ∗ ) is often used. An intuition of the hinge loss is that we prefer a parameter so that a score for a correct label is larger than that for all other labels with a margin m(y, y ∗ ). The parameter is then estimated by solving the optimization (2.1) with above loss functions. Note that since these loss functions are also convex, the optimization is also a convex optimization problem. The log-loss function for a multi-class corresponds to the maximum likelihood estimation in multi-class logistic regression models. In a multi-class logistic regression model, a conditional probability for a label y given input x is defined as follows, p(y|x; w) = 1 exp(s(x, y)), Z(x) (2.21) ∑ where Z(x) = y′ exp(s(x, y ′ )) is a normalization term or a partition function, so that ∑ ′ y ′ p(y |x; w) = 1. An important class of structured classification is when a label consists of an undirected graph: a node in the graph corresponds to the variable and an edge in between the nodes represents the dependency between variables. Examples of such output are a sequence of part-of-speech tags, and a syntactic tree. A Conditional Random Field (CRF) [Lafferty et al., 2001] (log-loss, multi-class logistic regression) , and Max-Margin Markov Networks (M 3 N) [Taskar et al., 2004] (hinge-loss) are examples of such class. In this case, although the number of possible output is exponential to the input size, learning and inference can be done efficiently by using dynamic programming techniques. 2.4 Regularization This section introduces regularization and explain the characteristics of these regularization. In training (Eq. 2.1), regularization is added to prevent over-fitting. They are often the norm of parameters. The first regularization is L2 regularization, R(w) = ∥w∥22 = ∑ wi2 , (2.22) i and the second regularization is L1 regularization, or called lasso regularization, R(w) = ∥w∥1 = ∑ i 21 |wi |. (2.23) They look very similar but the results by using these regularization are totally different; the result of the optimization with L1 regularization is often a sparse vector, in which many of the parameters are exactly zero while that of L2 is not. In other words, learning with L1 regularization naturally has an effect on the feature selection, which results in an efficient and interpretable inference. For example, Gao et. al [2007a] compared L1 logistic regression models with other learning methods including L2 regularized logistic regression models. Even though the performances for these methods are almost identical, the number of non-zero weights is approximately 1/10 of that of L2 regularization. Figure 2.2 explains this effect. While the partial derivative of L2 regularization becomes 0 quickly when the parameter w goes to 0, that of L1 regularization is constant even when the parameter w goes to 0. Therefore, in L1 regularization, the parameter is pushed away to 0 if the corresponding feature is irrelevant to the objective function. In the point of view of the Maximum A Posteriori (MAP) estimation, L2 regularization is the case where the prior of the parameter is the Gaussian distribution with the mean vector is a zero vector; p(w) ∝ exp(−|w|22 /σ), and L1 regularization is the case where the prior of the parameter is the Laplace distribution p(w) ∝ exp(−|w|1 /σ). Recently a mixed norm or elastic-net is also proposed [Chen, 2009b, Chen, 2009a], R(w) = ∑ wi2 + C1 i ∑ |wi |, (2.24) i where C1 > 0 is the tradeoff-parameter between L2 regularization and L1 regularization. This norm has both the strengthand weaknees of L1 and L2 regularizations. 2.5 Batch Learning Recall that the parameter is estimated by solving the convex optimization problem (Eq. (2.1)). The objective function is convex when all the loss functions and the regularization are convex as I explained above . A convex optimization has several good characteristics [Boyd and Vandenberghe, 2004], (1) the global minimum is always unique (although the minimizer is not always unique), (2) gradient-based optimization algorithms converges to the global minimum. In gradient-based algorithms, we first compute the gradient of the objective function v := ∂(L(w) + CR(w)) . ∂w 22 (2.25) Figure 2.2: A plot of L2 regularization and L1 regularization (above). A plot of the partial derivatives of L2 regularization and L1 regularization (bottom). Then we update w := w + µv where µ < 0 is the update width, and is determined by binary search or specialized methods. In particular, since the training data in NLP is very sparse and has very large number of dimensions (e.g. 105 ), the characteristics of optimization is different form those appeared in other fields. Since there is a significant amount of work for the optimization (2.1), I can hardly enumerate all the prior work, and I provide a few references here to the research I believe is most related to NLP. 23 For the L2 regularized optimization problem, L-BFGS [Liu and Nocedal, 1989] and an exponentiated gradient method are often used. L-BFGS is a quasi-Newton method that computes the approximation of the Hessian matrix by subset of gradients. Exponentiated gradient method [Collins, 2002] uses a dual representation of (2.1), and optimize the equivalent convex dual. For the L1 regularized optimization problem, since the objective function is not differentiable when wi = 0, several specialized optimization methods have been proposed. Kazama and Tsujii [2005] proposed a method for the optimization L1 regularized logistic regression model by replacing each weight with a pair of positive weights; wi = ∑ wi+ − wi− , wi+ , wi− ≥ 0. The regularization term is represented as R(w) = i wi+ + wi− , which can be solved efficiently using the general gradient-based optimization methods at the expense of doubling the number of parameters. Andrew et al. [2007] proposed the orthant-wise L-BFGS (OWL-QN), where the orthant of parameters are fixed at the time of updating, which are recently generalized in [Yu et al., 2008]. Sha et al. [2007] proposed the application of a multiplicative update to L1 optimization with local quadratic approximation, which can be done using a very simple update formula. Koe et al. [2007] proposed the interior-point method and an approximation of the entire regularization path. We call this batch learning because the parameters are updated after looking at all the training examples. However, recently online learning algorithms have been found to be much more efficient than these batch learnings especially for very large-scale NLP problems. 2.6 Online Learning Since the number of the parameters and the terms in the optimization is extremely large for large scale NLP, a direct optimization requires prohibitively large cost. To tackle this problem online learning or stochastic convex optimization has been proposed, in that we iteratively optimize the parameters against one example or a small number of examples. In online learning, a learner takes an example one by one, and checks whether a current classifier can classify the example well or not, and the parameter is updated if it misclassified or the margin (confidence) was small. Online learning has the following characteristics against batch learning. First, online 24 learning looks at each example one by one. Therefore all examples are not required to be stored in memory. Second, online update updates the parameters more often than that of batch learning. Since a data in NLP are often redundant and the distribution of feature frequencies is much skewed. Therefore online learning can tune the parameters for frequent features in early steps, and it can focus on the parameters for rare features in later steps. I introduce several online algorithms. Table 2.1 summarizes the online learning methods in the case of binary classification. The variable s := ywT ϕ(x) indicates the loss occurred for a training example (x, y), and w+ = v means that the parameter w is updated as w := w + v. Perceptron (P) [Rosenblatt, 1958] The perceptron algorithm is the first online learning algorithm proposed a half century ago [Rosenblatt, 1958]. First, a parameter vector is initialized as a zero vector w = 0. At each step, it ensures that the current parameters correctly classify the training example. If so, it proceeds to the next example without update. If not, it update the weight vector w closer to the current example, w := w + yϕ(x). (2.26) where x and y ∈ {−1, +1} are input and a binary label respectively. The algorithm repeatedly runs over the training data. It can be shown that, if examples can be divided by a hyper-plane (or there is a binary linear classifier that can classify all examples correctly), the perceptron algorithm can find the parameter that correctly classifies the entire data set. This is not the case in most NLP data. For multi-class classification, the variant of perceptron algorithm was proposed, which is called a structured perceptron [Collins, 2002]. First, a parameter vector is initialized as a zero vector w = 0. For each training example (x, y), a systems takes input x and predicts a label using the current parameter, y ∗ = arg maxy′ s(x, y ′ ). Then, the parameter is updated so that the score for the true label y is increased, and that for the wrong label is decreased. This is achieved by a simple calculation w := w + ψy∗ (2.27) where ψy∗ = ϕ(x, y) − ϕ(x, y ∗ ) . Note that this update has no effect when a system can classify a training example correctly (y = y ∗ ). 25 Averaged Perceptron (AP) [Collins, 2002] An original perceptron algorithm often leads to poor generalization, especially when training data is noisy like NLP data. Collins et. al. [2002, 1999] show that averaging the weights of all steps improves the generation ability. In practice, we need not to keep the weight vectors at all steps, and it is enough to keep two weight vectors w and wa as follows. w := w + yϕ(x) (2.28) wa := wa + tyϕ(x) (2.29) t := t + 1. (2.30) where, at the beginning, a variable t is initialized as 1, and both w and wa are initialized as 0. Then the final weight vector is obtained by w − wa /t, which is identical to the averaged one. Passive Aggressive (PA) [Crammer et al., 2006] The problem of Perceptron is that it ignores the degree of misclassification; it always updates the parameter with the same update width, whatever it made an error. Therefore even after the update, the classifier may not classify the previous example correctly. Passive Aggressive Algorithm (PA) update the parameter so that the updated classifier can classify the example correctly, and its parameter is close to the previous parameter. Let wi be the weight vector at the i-th step. The weight vector w1 is initialized to the zero vector 0 and for each round the algorithm takes a training example xi and predicts its label yi′ . After the prediction is made, they compare the result with the true label yi and the leaner suffers from an hinge-loss, which reflects the degree to which its prediction was wrong. If the prediction is wrong or the margin is small, the parameter w is updated. There are three update strategies according to how to treat a noise of training data by using a slack variable ξ ∈ R. The first, PA, ignores the effect of noise. 1 wt+1 = arg min ∥w − wt ∥2 2 w subject to lhinge (xt , yt , w) = 0 (PA) 26 (2.31) The second, PA-I, considers the noise in L1 way. 1 wt+1 = arg min ∥w − wt ∥2 + Cξ 2 w subject to l(w; (xt , yt )) ≤ ξ and ξ ≥ 0 (2.32) (PA-I) where C ∈ R+ is a parameter which controls the tradeoff parameter between the slack term and the distance. If C is large, a more aggressive update step will happen. The third, PA-II, considers the noise in L2 way. C 1 wt+1 = arg min ∥w − wt ∥2 + ξ 2 2 2 w subject to l(w; (xt , yt )) ≤ ξ (2.33) (PA-II). The parameter C ∈ R+ is the same as the one in PA-I. These problems can be solved in closed form, wt+1 = wt + τt yt ϕ(x)t   lt    ∥ϕ(xt )∥2  τt = min{C, ∥ϕ(xltt )∥2 }     lt  . ∥ϕ(xt )∥2 +C (2.34) PA PA-I (2.35) PA-II Interestingly, the update formula of passive aggressive algorithms and that of the perceptron algorithm differ in only the update width; the update width of passive aggressive includes the error normalized by the norm of the feature vector. The multi-class PAs obtained by replacing yt ϕ(x)t with ψy∗ in (2.34), and ∥ϕ(xt )∥2 with ∥ψy∗ ∥2 , respectively. Confidence Weighted (CW) [Dredze et al., 2008, Crammer et al., 2008, Crammer et al., 2009] A confidence weighted learning algorithm (CW) captures the kind of confidence in the weights of a linear classifier. To represent a confidence, it uses a Gaussian distribution with mean µ ∈ Rm and covariance matrix Σ ∈ Rm×m . Intuitively, the value Σi,i indicates the confidence in the i-th weight; the larger Σi,i , the less confidence we have because the variance is large. Therefore, we update the i-th weight more aggressively when Σi,i is small, and vice versa. 27 The CW update rule is obtained by solving the following constrained optimization: (µt+1 , Σt+1 ) = arg minDKL (N (µ, Σ)||N (µt , Σt )) (2.36) µ,Σ subject to P (yt wT ϕ(xt ) ≥ 0) ≥ η. (2.37) where η ∈ [0.5, 1] is the hyper parameter, and N (µ, Σt ) is Gaussian distribution with mean µ and covariance matrix Σ, w ∼ N (µ, Σt ), and DKL is the KL divergence. To solve this, the first paper [Dredze et al., 2008] replace the standard deviation with the variance (2.36) because this problem is not convex in Σ. However, the second paper [Crammer et al., 2008] solve this by representing Σ by the square of positive semidefinite matrices. Then (2.36) can be solved in the closed form as follows, vt = ϕ(xt )T Σt ϕ(xt ) (2.38) mt = yt µTt ϕ(xt ) ( )2 √ 1 2 2 2 ut = −αvt ϕ + α vt ϕ + 4vt 4 ( )] [ √ 4 ϕ 1 −mt ψ + m2t + v t ϕ2 ξ αt = vt ξ 4 + αt ρ βt = √ ut + vt αt ρ µt+1 = µt + αt yt Σt ϕ(xt ) (2.41) Σt+1 = Σt − βt Σt ϕ(xt )ϕ(xt )T Σt (2.44) (2.39) (2.40) (2.42) (2.43) where ρ = Φ−1 (µ), ψ = 1 + ρ2 /2, and ξ = 1 + ρ2 , where Φ is the Gaussian cumulative distribution function. 2.6.1 Experiment To examine these online learning algorithms, I implemented these algorithms and conducted an experiment on a simple document classification task 1 . I used an experimental data from news20.binary2 in libsvm’s binary data set. The number of class is 2, and the number of example data is 19996, and the number of features is 1355191. I shuffled these 1 2 oll: http://code.google.com/p/oll/wiki/OllMainJa http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html#news20.binary 28 Table 2.1: A comparison of online learning methods. Method Update Update Prediction Condition Rule P [Rosenblatt, 1958] s<0 w+ = yϕ(x) AP [Collins, 2002] s<0 w+ = yϕ(x) PA [Crammer et al., 2006] s<1 1−s w+ = y |ϕ(x)| ϕ(x) wT ϕ(x) PA-I [Crammer et al., 2006] s<1 1−s w+ = y min(C, |ϕ(x)| ϕ(x)) wT ϕ(x) PA-II [Crammer et al., 2006] s<1 1−s w+ = y |ϕ(x)|+2C ϕ(x) wT ϕ(x) CW [Dredze et al., 2008] γ>0 w+ = yγΣϕ(x) wT ϕ(x) wT ϕ(x) wa + = y ϕ(x) t (w − wa T t ) ϕ(x) Σ−1 + = 2γCdiag(ϕ(x)) Table 2.2: Performance of online learning methods (I = 10). Method P AP PA PA1 PA2 CW SVM (linear) Training Time (sec.) 0.54 0.56 0.58 0.59 0.60 1.39 1122.60 Accuracy(%) 94.7 95.3 96.5 96.5 96.5 96.5 96.2 data, and divided them into training data of 15000 examples, and test data of 4996 examples. I compared it to the batch learning method for SVM 3 . I did not tune the hyper parameter for PA-I, PA-II, CW, and SVM (C = 1.0 for all methods). Table 2.2 shows the results when the number of iterations at training is 10. The result shows that all methods achieved the similar accuracies. The training time are also very fast compared to the batch learning method (SVM). Table 2.3 shows the results when the number of iterations at training is 1. This is the special case where the training examples need not to be stored. The result shows that all methods achieved the similar performances as the previous results. Note that in all the above online learning algorithms except CW, the final weight vector can be represented as a linear combination of training examples as in SVMs. Therefore kernel trick can be applied as, wT ϕ(x) = ∑ τt ϕ(xi )T ϕ(x) = i 3 ∑ i http://chasen.org/ taku/software/TinySVM/ 29 τt K(xi , x) (2.45) Table 2.3: Performance of online learning methods (I = 1). Method P AP PA PA1 PA2 CW Training Time (sec.) 0.05 0.09 0.07 0.08 0.08 0.21 Accuracy (%) 93.4 94.0 96.2 96.1 96.0 96.4 where K is the kernel function. Using this formulation the inner product can be replaced with a general Mercer kernel K(xi , x) such as a polynomial kernel or a Gaussian kernel. The theoretical analysis of online learning is found in [Shalev-Shwartz, 2007]. 2.7 Kernel Trick Since real-world problems generally do not have a linear structure, linear classifiers are sometimes insufficient. To overcome this problem, one can use a kernel-trick in that all feature vectors are mapped into a high-dimensional feature space by a non-linear mapping Φ so that it can be separated by a hyperplane. The problem is that computational cost of Φ and an optimization problem in high-dimensional is very expensive. However, when we solve the problem in the dual representation, we don’t need to compute Φ(x) explicitly because we need only inner products in the mapped feature space. We call the function K(x1 , x2 ) = Φ(x1 ) · Φ(x2 ) kernel functions. By selecting ϕ carefully, we can compute K(x1 , x2 ) with small computational cost. For example, consider a two-dimensional input space together with the feature map: ϕ : x = (x1 , x2 ) 7→ ϕ(x) = (x21 , x22 , √ 2x1 x2 ). (2.46) The inner product in the feature space can be evaluated as follows ⟨ϕ(x) · ϕ(z)⟩ = ⟨(x21 , x22 , √ √ 2x1 x2 ) · (z12 , z22 , 2z1 z2 )⟩ (2.47) = x21 z12 + x22 z22 + 2x1 x2 z1 z2 (2.48) = (x1 z1 + x2 z2 )2 (2.49) = ⟨x · z⟩2 . (2.50) Thus K(x, z) = Φ(x) · Φ(z) = ⟨x · z⟩2 . Many kernel function have been proposed. The 30 examples are, Kpoly (x1 , x2 ) = (x1 · x2 + 1)d , (2.51) Krbf (x1 , x2 ) = exp(−a(x1 − x2 )2 ), (2.52) Ksigmoid (x1 , x2 ) = tanh(s(x1 · x2 ) + c). (2.53) A kernel function can be defined not only on vector data but also on string or graph structure data. More detail of kernel function can be found in [Taylor and Cristianini, 2004]. 2.8 Storing Sparse Vector In NLP, most data is represented by a sparse binary vector because linguistic event has many candidates (word, string, prefix) and only some of them are observed. For example, a document is often represented by a feature vector, each value of which corresponds to the occurrence of a word in a document, and for a part-of-speech (POS) tagging task, a feature vector for predicting the POS consisting of the occurrence of the current/previous/next word, and their prefix/suffixes. We process these sparse binary vectors at learning and inference. Since these data are very large, we need to carefully store these vectors so that all processing are done in memory only. Let us see how to store a binary vector of length n with m 1’s, when m ≪ n (sparse). (n) The lower-bound of storing such a vector is lg m bits4 . This can be approximated by m(lg e + lg n/m) = 1.44m + m lg n/m bits. A straight-way method storing a binary vector explicitly requires n bits, which is much larger than its lower-bound. Let (x1 , x2 , . . . , xm ) be the position of ones. Then, to represent the vector we instead store these positions only using lg n bits of each, and requires m lg n bits in total. This is about m lg m redundant to the lower bound. Therefore, this is very close when m is close 0. For the case when m is not close to 0 but m ≪ n, a variable byte code or VarByte is effective. A VarByte is very simple and support fast encoding/decoding. In VarByte, we use the difference representations of the position indexes defined as (d1 , d2 , . . . dm ), d1 = x1 and di = xi − xi−1 − 1 for i > 1. A decoding from the difference representation to the original positions is trivial. Then, each di is stored in variable-length bytes. A first bit of each byte is used to denote whether the current code finishes at this byte or not. 4 lg x denotes ⌈log2 x⌉ 31 Algorithm 1 VarByte Encode Input: Integer d while d ≥ 128 do // output 7 lower bits put(d & 127) d := d >> 7 // 7-bit Left Shift end while put(d + 128) // output the remaining bits with 1 << 8 Algorithm 2 VarByte Decode d := 0, count := 0 loop c := get() if c ≥ 128 then d := d + ((c − 128) << count) break end if d := d + (c << count) count := count + 7 end loop Return d Algorithm 1 shows an example for encoding an integer d in VarByte, and algorithm 2 shows an example for decoding. Next, I analyze the working space of VarByte. Since for each integer di , it requires at most 8 + lg di bits5 , thus the total size of VarByte is m ∑ (8 + lg di ) ≤ 8m + m lg(n/m). (2.54) i=1 The in-equation holds because ∑ i lg di is maximized when di = n/m for all i. Therefore, VarByte requires about 6.5m redundant working space to the optimal, Since there are many other integer set representations, I present only the pointers for them; R ICE CODING , SIMPLE 9 [Anh and Moffat, 2005], SIMPLE 16 [Yan et al., 2009b], and N EW-PFOR [Yan et al., 2009b]. The comparison of these data structures can be found 5 We assume that 1 byte = 8 bit. 32 in [Yan et al., 2009b], which shows that the result of VarByte is not significantly different from these data structures in terms of working space, and computational cost. 33 Part I Learning with Massive Number of Features 34 Chapter 3 Learning with All Substring Features This chapter presents an algorithm for processing document set with all substrings as features. The applications I consider are document classification and document clustering. Although tokenized words are not enough for determining a class of a document, processing by using all substrings has a prohibitive computational cost because the number of all candidate substrings can be very large. I first present a general learning framework to deal with all substring features. In this framework all substrings are considered as distinct features, and are allowed to have different weights. Although the number of substrings and the features are prohibitively large to enumerate and optimize, we can find the optimal classifier in liner time in the total length of documents by factorizing many substring information into the equivalent class. I also use L1 regularization to obtain a compact classification model. Since the number of non-zero weights is very few, the model is easy to interpret, and the inference is extremely efficient in time and working space. Moreover the model is robust even if all substrings are considered. By combining this to Grafting algorithm [Perkins et al., 2003], a weight vector can be optimized in time proportional to the number of non-zero weights. This is actually achieved by traversing internal nodes in a suffix tree, which requires linear time and space to total length of a training document set. I will show that we can find the optimal by checking the features corresponding to maximal substring only. The number of maximal substrings is not quadratic but linear in the document length, and we can therefore efficiently train the weight vector. I apply this framework to a document classification task [Okanohara and Tsujii, 2009b]. I compared the method to the other methods; a doc- 35 IJKLM N "#$%& #'(')*)'(')*)(% ! +,- ./0-12/31 4355-2631./17 83 9:;<=>?@ABC +,- ./0-12/31 4355-2631./17 83 9DEFGH:BC Figure 3.1: An example of bag-of-word representation (BOW). ument classification with string kernel, and logistic regression model with variable-length N-grams. The result shows that the proposed method achieved the highest performance in almost tasks, and its model is very compact. Next I extend this work to a document clustering task. For this task, I propose a novel clustering algorithm called logistic regression clustering (LRC). This model solves a similar optimization problem as a document classification with logistic regression models. Therefore we can apply the same techniques as document classification. This method can assign a conditional probability of a cluster given a document. Moreover, the characteristic substrings are extracted in each cluster as an effect of L1 regularization and these substrings can be used as labels of clusters. Experimental results show that the proposed algorithm achieved comparable or better results than those from other methods, and its result is very compact and easy to interpret. 3.1 All Substring Features Generally, a document d is represented as a feature vector f (d) ∈ Rm where each dimension corresponds to the occurrence of a word in a document. Since this representation ignores the order and the position of the words, this representation is called a bag of words (BOW) (Figure 3.1). Although a BOW representation loses much of the document information, this often achieves high performance because the occurrence of a few keywords can often determine the characteristic of the document. 36 DEF GHFIAF?JK LG MNN OAPOBHQ?RO Q? C >?@AB C $%&'( %)*)+,+)*)+,+*' -./ 012/34153 6577/48530139 :5 ;:./<= -./ 012/34153 6577/48530139 :5 ;:./ 4< !" # # Figure 3.2: An example of all substrings representation (ALLSTR). However, this BOW representation still suffers from the following three problems. The first is the error in the conversion from a document to a set of words. For example, several languages, such as Japanese and Chinese, do not represent word boundary information explicitly. The word identification task itself is not easy, and the result includes many errors. What is worse is that the keywords for document classification are often unknown words such as a person’s name, (e.g. Shaquille O’Neal), and BOW representation loses this information due to errors occurring in the analysis. The second is that, in some data, it is difficult to define what the word is, such as, log data and bio-informatics data. The third is the most important problem. Words units are often inappropriate for document classification/clustering, although N-gram words are effective. For example an occurrence of a movie title is effective for determining the label of the document to be movie. However, many movie titles consist of several common words, which are lost in a BOW representation. The spam mail detection task is another example; signature and template information is important but this is not word information. To solve this, I propose to use a representation with all substrings being features. This can be considered as a bag of N-grams with N = 1 . . . ∞. Although the number of features (substrings) is the quadratic of the document length, we can find the optimal 37 solution in linear time of a length of a document by summarizing equivalent substrings. Formally, we represent a document as a bag of all substring representations where all substrings correspond to each dimension a feature vector. We call this representation ALLSTR. Figure 3.2 shows the example of ALLSTR. The cost of learning a model with ALLSTR representation is prohibitively large. However, we show that effective substrings can be found exhaustively by enumerating all maximal substring information. Note that this is not an approximation, but an exact solution. To achieve this solution, we summarize substrings into classes so that the substrings in the same class have equivalent statistical information called maximal substrings. The same idea was proposed in [Yamamoto and Church, 2001], which calculates term frequencies and document frequencies for all substrings. I extend and simplify this concept to find effective substrings efficiently. The differences will be discussed. Many previous works studied to use all substrings information for a document classification task. Among them, string kernels [Vishwanathan and Smola, 2004] is most popular, which defines a kernel for two documents d1 , and d2 as follows, ∑ K(d1 , d2 ) = rs s(d1 )s(d2 ), (3.1) s∈Σ∗ where Σ is the alphabet set, and Σ∗ is the set of all substrings on Σ, and rs is a weight parameter for s (which is not decided by learning), and s(d) is the frequency of a substring s in a document d. By incorporating this string kernel into SVM learning, we can classify a document according to all the substring information in the document. Teo [2006] shows that by using suffix arrays and auxiliary data structures, we can calculate a kernel value in O(|d1 |+|d2 |) time, and an inference for a test document d can be done in O(|d|) time where |d| is the length of a document. However, string kernels require a large amount of working space not only at training time, but also at inference time. As an example, it requires almost 20N bytes where N is the length of all documents. Therefore, such a method cannot be applied for a large document set. Moreover, kernel methods cannot control each weight independently, and they can only control a weight for each training examples. In general, very few features contribute to the label decision, and a string kernel cannot capture these features efficiently. 38 Also, string kernels tend to be affected by noise, so that we may need to cut off a long substring. Therefore, it is very difficult to consider all substrings in a string kernel. Very recently, Ifrim et. al. [2008] proposed a logistic regression model with variablelength N-gram features (structured logistic regression: SLR). In their model, different weights can be assigned to each features. They showed that N-gram information is important for document classification, and more accurate than BOW representation. However, because effective N-grams are searched greedily, important N-gram phrases can be lost. Another problem is that Ifrim’s method suffers from over-fitting due to the lack of regularization, and difficult to decide when the search process stops at training. 3.2 Data Structure for Strings To handle a large number of substrings, I heavily use data structures for strings. I introduce several data structures which will be used in our algorithm. Let Σ be a finite ordered alphabet, and σ = |Σ|. Let T [1 . . . n] be an input text of length n, drawn from Σ+ . We adopt for technical convenience the assumption that T is followed by $ (T [n] = $), which is a character from Σ that is lexicographically smaller than all other characters, and appearing nowhere else in T . Then, suffixes of T are defined as Si = T [i, n] for i = 1, . . . , n. First I introduce a suffix tree. Although a suffix tree is not used in our algorithm, I explain here to understand the idea of my algorithm clearly. A suffix tree is a compact trie that contains all suffixes of T , which can be stored in O(n log n) bits. Suffix trees support various complex string problems [Gusfield, 1997]. It has n leaves, each of which corresponds to a suffix of T . An internal node with only one child is removed, thus the length of edge would be larger than 1. Each edge is labeled by a string, called edge-label. The concatenation of labels on a path from the root to a node is called the path-label of the node. The path-label of each leaf coincides with a suffix. For each internal node, its children are sorted in the alphabetic order of the first characters of edge-labels (Fig. 3.4). An interesting property of a suffix tree is that although the number of distinct substrings appeared in T is O(n2 ), the number of internal nodes is at most n − 1. To understand this, let us consider the case where we build a suffix tree by inserting a suffix from the shortest ones. Then, at each insertion, at most one internal node is created except the first insertion. 39 Figure 3.3: An example of data structures for a text T = abracadabra$. Similarly, we can prove that the number of edges between internal nodes is also at most n − 2. Suffix trees are useful for many string problems [Gusfield, 1997]. However, suffix trees require very large working space. For example, the most effective implementation requires about 10n ∼ 20n bits of space. Therefore, I instead use a space-efficient data structure, which is the variant of enhanced suffix arrays [Abouelhoda et al., 2004]. I use the following data structures. • SA: Suffix array • H: Height array • B: Burrows-Wheeler transformed text and auxiliary data structures to support operations on these data structures efficiently. I explain this in turn. The example of these data structures for a T = abracadabra$ is shown in Figure 3.3. Suffix array A suffix array of T , SA[1, n] is defined as an integer array SA[1, n] of 40 length n such that SSA[i] < SSA[i+1] for all i = 1, . . . , n − 1 where < between strings denotes the lexicographical order of them. SA requires n lg n bit of space. Height array A height array H[0, n − 1] for T [0, n] is H[i] = lcp(SSA[i] , SSA[i+1] ), where lcp(S, U ) is the length of the longest common prefix between substrings S and U . That is, H contains the lengths of the longest common suffixes of P ’s prefixes that are consecutive in lexicographic order. Burros-Wheeler transformed text A Burrows-Wheeler Transform (BWT) of a text T , B[1, n] is defined as follows;   T [SA[i] − 1] (SA[i] > 1) B[i] =  T [n] (SA[i] = 1). (3.2) I support the following operation on B; given B[1, n] of length n drawn from an alphabet set Σ, and a query consisting of the pair of positions (l, r), we check whether B[l, r] contains more than one character or not. I present a data structure to solve the above problem in O(1) time using n + o(n) bits of working space. Here, we assume a RAMmodel in which all log n sized operations can be done in constant time. Let R[1, . . . , n − 1] be a bit vector storing the runs information of B, that is R[i] = 0 if B[i] = B[i + 1] and R[i] = 1 otherwise. Then, the B[l, r] consists of only one character type if and only if R[l, r − 1] contains only 0’s. By using rank dictionaries I will explain in the following, this check can be checked in constant time using n + o(n) bits as follows. Rank dictionaries supports rank(R, c, p) operation that returns the number of c ∈ {0, 1} in R[1, . . . , p]. It is easy to solve the problem by checking rank(R, 1, r) − rank(R, 1, l) > 0 (3.3) To support this, first, we conceptually divide an array R into large blocks of l = lg2 n bits, and again divide each large block into small blocks of s = lg n/2 bits1 . We keep the results for rank(B, 1, i×l) in an array L[n/l] and the number of 1’s from the beginning of the large block to each small block in an array S[n/s]. We also calculate all results for the √ array of lg n/2 bits in table using 2log n/2 = (n) bits of space. Then rank(R, 1, i) = L[⌊i/l⌋] + S[⌊i/s⌋] + popcount(R, ⌊i/s⌋, i) where popcount(B, i, j) returns the number of 1’s in B[i, j] in constant time by table lookup. 1 lg x = ⌈log2 x⌉ 41 3.3 Grafting In this section, we consider a maximum likelihood estimation of multi-class logistic regression model with L1 regularization. To maximize the effect of L1 regularization, rithm [Perkins and Theeiler, 2003] can be used. the grafting algo- In grafting, we begin with the empty feature set, and incrementally add effective features to the current problem. Note that although this is similar to the boosting algorithm for learning, the obtained result is always optimal. The grafting algorithm is summarized in Algorithm 3. We assume that the objective function is log-likelihood of multi-class logistic regression model. I explain the grafting algorithm formally. In this algorithm we retain two variables; w stores the current weight vector, and H stores the set of features with a non-zero weight. Initially, we set w = 0, and H = {}. At each step, the feature is selected that has the largest absolute value of the gradient of the likelihood. Let vk = ∂L(w) ∂wk be the gradient value of of a feature k against the sum of loss functions (or likelihood in the case of maximum likelihood estimation). By following the definition, the value vk can be calculated as follows, ∂L(w) ∂w ∑ k = αi,y ϕk (xi , y), vk = (3.4) (3.5) i,y where αi,y = I(yi = y) − p(y|xi ; w) and I(a) is 1 if a is true and 0 otherwise. Then, we add k ∗ = arg maxk |vk | to H and optimize the objective function with regard to H only. The solution w that is obtained is used in the next search. The iteration is continued until |vk∗ | < C. We briefly explain why we can find the optimal weight by this algorithm. Suppose that we optimize the objective funciton with all features, and initialize the weights using the results obtained from the grafting algorithm. Since all gradients of likelihoods satisfy |vk | ≤ C, and the regularization term pushes the weight toward 0 by C, any changes of the weight vector cannot decrease the objective value in (4.2). Since (4.2) is the convex optimization problem, the local optimum is always the global optimum, and therefore this is the global optimum for (4.2) 42 Algorithm 3 Grafting Input: training data (xi , yi ) (i = 1, · · · , n) and parameter C H = {}, w = 0 loop v= ∂L(w) ∂w (L(w) is the log likelihood term) k ∗ = arg max|vk | k if |vk∗ | < C then break H = H ∪ k∗ Optimize w with regards to H end loop Output w and H The point is that, given an efficient method to estimate vk∗ without the enumeration of all features, we can solve the optimization in time proportional to the active feature, regardless of the number of candidate features. 3.4 Statistics Computation with Maximal Substring This section presents an efficient algorithm for computing the statistics of all substring feature. The key idea is to the use of equivalent class of substrings. 3.4.1 Equivalent Class First, let us explain the idea of equivalent classes of substrings by using suffix trees, which is an extention of equivalent class in [Yamamoto and Church, 2001]. Let T [1, n] be a text to be processed and q be a substring. Denote by P (T, q) the list of all occurrence positions of q in T . We will omit T if there is no confusion. For example, P (T, “a”) = {1, 4, 6, 8, 11} for T = “abracadabra”. Recall that the list of all occurrence positions can be examined by traversing the suffix tree for T from the root to the edge along the edge characters. Note that suffix trees stores all suffix of T and any substring occurred in T correspond to some position in a suffix tree. Let t(q) be a position in a suffix tree corresponding to q in the suffix tree. Note that t(q) would be at the internal node or on the edge. Then, all descendant leaves from t(q) 43 denote the occurrence positions of q. For example, in figure 3.4, when q = “ab′′ , t(q) is at the edge between the internal node 1 and the internal node 0. Therefore, P (q) = {8, 1}. Similarly “abr” and “abra”, are again at the edge between the internal node 1 and the internal node 0, and they also occur at {8, 1}. From this observation, it is easy to show that when two substrings q1 and q2 are on the same edge of the suffix trees, the occurrence positions for q1 , and q2 are the same. Definition 1 Given two substrings q and r, they are called in the left-equivalent class, which is denoted as q =P r if and only if P (q) = P (r). This is equivalent to the case where t(q) and t(r) are on the same edge in the suffix tree. The number of edges between internal nodes is at most n − 2, and between an internal node and a leaf is n − 12 , where n is the length of an input text. Hence, the number of different classes is at most n − 2 + n − 1 = 2n − 3. Since all substrings appearing in T at least once are mapped into some position in the suffix tree for T , all substring can be factorized into 2n − 3 = O(n) left-equivalent classes. We can further summarize the substring information by considering a left expansion, which was not discussed in [Yamamoto and Church, 2001]. We again see in the example in Figure 3.4, that the occurrence positions of bra are {9, 2}, which seem to be different from abra whose positions are {8, 1}. However, these positions are the same with the constant move by 1 (8 = 7 + 1, 2 = 1 + 1). Figure 3.4 shows some examples for T = abracadabra$. The suffix tree for T is shown in the left of the figure, and all substring appeared in T is shown in the right. The substrings in the same color belong to the same class. We call the longest substring in each class (the most above one in the class in Figure 3.4) maximal substring. In this example, maximal substrings are “a”, “abra”, and all suffixes of T . The occurrence positions of “ab”, “abr”, “b”, “br”, “bra”, “ra”, “abra” appears in the same positions with constant move, and all these substrings are substrings of “abra”. Another class that appears more than once is only for “a′′ . All other substrings appears once, and their longest maximal substring are suffixes, such as “abracadabra$”, and “bracadabra”. We define an equivalent class formally. 2 We ignore the leaf corresponding to the special character. 44 abracadabra$ 12 abracadabra$ 11 $ $ bra 0 8 abracadabra$ $ 1 c 1 abracadabra$ c abracadabra$ 4 a d abracadabra$ 6 4 bra abracadabra$ $ 9 abracadabra$ c 2 abracadabra$ … c 2 abracadabra$ d 5 abracadabra$ 7 ra $ 10 c 3 abracadabra$ … abracadabra$ 3 Figure 3.4: The substrings and its classes for a text “T = abracadabra$”. Definition 2 Given two substring q and r, q <P r if and only if (1) q is a substring of r (2) |P (q)| = |P (r)|, (3) there exists c ∈ N such that P (q)[i] + c = P (r)[i] for all 1 ≤ i ≤ |P (q)|. Definition 3 Two substrings q and r are called in the equivalent class if and only if q <P r or r <P q. This <P relation satisfies all properties in the lemma 1, but the number of classes are much smaller than the original number of equivalent classes. Note that q <P r if q =P r, but the reverse does not always hold. For example, in T = abracadabra，bra <P abra, but bra ̸=P abra. Intuitively, q <P r means that when a substirng q appears, r also appears in the same positions. 45 3.4.2 Document Statistics with Equivalent Classes First, we extend an idea of an equivalent class to a set of documents. Given a document set (xi , yi ) (i = 1, . . . , n′ ), let T be the concatenated documents, T [1, n] = x1 $1 x2 $2 ...xn′ $n′ where $i are special characters that do not appear in the original text. Let n be the length of T . We then build a suffix trees for T (called generalized suffix tree). Since a substring overlapping a document boundary always contains a special character, we need not to care the boundaries of documents, and all the issue discussed in the previous section also hold. In this subsection, we see that many document statistics can be calculated efficiently using an idea of equivalent class. Let tf (q, d) be the term frequency, or the number of occurrence of substring q in a document d, and df (q) be the document frequency, or the number of documents that include q. Then we summarize the idea of equivalent classes of substrings as follows [Yamamoto and Church, 2001]. Lemma 1 If two substrings q and r are in the same equivalent class, then tf (q, xi ) = tf (r, xi ) for all 1 ≤ i ≤ n, and df (q) = df (r). Proof 1 Let q and r be in the same equivalent class. Then the occurrence positions of them in T are all same. Since documents are delimited by special characters, the occurrence of q and r does not overlap the document boundaries. Therefore, the occurrence positions in a document is always the same. Note that in the previous study [Yamamoto and Church, 2001], they considered suffixequivalent classes only, and we extend this into stronger classes here. In the next subsection we see that how to enumerate interesting equivalent classes only. 3.4.3 Enumeration of Equivalent Classes To enumerate these equivalent classes, the naive way is to enumerate all substrings and their classes. Then summarizing substrings into equivalent classes using the definition (2). However, since the number of substrings would be much larger than that of equivalent classes, this procedure would be very redundant. Instead, we show an algorithm to enumerate all equivalent classes directly. First we define a maximal substring. 46 Definition 4 The maximal substring p is a substring such that there is no q such that p <P q. Next, we prove the simple lemma Lemma 2 At each equivalent class, exactly one maximal substring exists. Proof 2 Let us assume that there exist two maximal substrings t and u in the same equivalent class. Since they are both maximum substrings, there exist a substring v such that v <P t and v <P u. From the definition, t and u include v as a substring inclusively, and therefore they are represented as t = t1 vt2 and u = u1 vu2 . Since their occurrence positions are overlapped, t1 is a substring of u1 or u1 is a substring of t1 . Similarly t2 and u2 are in the same condition. Let q be the longer substring between t1 and u1 , and r be the longer substring between t2 and u2 . Then a new substring s = qvr is longer than t and u, and its occurrence positions are same as those of t and u. Therefore t <P s and t <P s, and this contradict the maximal of t and u. These maximal substrings can be enumerated efficiently by using enhanced suffix arrays (ESA) (Sec. 3.5.2). All maximal substrings correspond to internal nodes or leaves in suffix trees. In particular, a maximal substring that occurs more than once corresponds to only an internal node. Internal nodes and leaves in a suffix tree can be expressed as a pair of index [l, r], thus indicating that the corresponding substring of length d appears in T [p, p + d] where p ∈ SA[l, . . . , r]. The enumeration of all leaves, and internal nodes can be done in linear time of a document length [Kasai et al., 2001]. The working space for this enumeration is 10|T | + O(n)bits. However, not all internal nodes correspond to a maximal substring. For example, in Figure 3.4, although the substring “ra” corresponds to the internal node, it is not maximal one because the substring “abra” is the maximal substring here. To filter out these redundant internal nodes, we use Burrows-Wheeler transformed text B[1, . . . , s]. The following lemma holds, 47 Lemma 3 The sufficient and necessarily condition of a substring q = [l, r] being a maximal substring is that a q corresponds to an internal node or a leaf, and B[l, r] has more than one character type. We can check whether B[l, r] has more than one character type in constant time using n + o(n) bits of working space (Section 3.2). The pseudo code of this algorithm is show in the algorithm 4. Therefore, we can obtain all maximal substrings in linear time in the total length of documents. 3.4.4 External Information Finally, let us consider the case when external information is available such as the word/phrase boundaries so that extracted substrings should have these boundaries as the beginning/ending character. This case can also be solved by same algorithm. First, we replace an input T with an input T ′ such that the special characters # is inserted at the boundaries of words, and then we apply the algorithm as above. Then, we only deal with the maximal substring with # being the first characters. This conversion does not increase the computational cost since the new input size is at most 2 times the original input size and the number of maximal strings is much smaller. 3.5 Document Classification First application of maximal strings is a document classification; given a document, we assign a label such as a category (sports, money), or polarity (positive or negative opinion) according to the content of the document. For this task, rule-based methods were first applied, but recently, machine learning methods have been applied using support vector machines (SVM) or logistic regressions (LR) because they are robust, easy to adapt to a new domain, and achieve more accurate results. In this study I employ a multi-class logistic regression model (LR) (Section 2.3). I re-phrase some of definitions here again for explanation. For an input document x, and an output label y ∈ Y , we define a feature vector ϕ(x, y) ∈ Rm capturing the characteristic of the document and the label. In LR, the 48 probability for a label y given input x is defined as follows, 1 exp(wT ϕ(x, y)) Z(x) ∑ Z(x) = exp(wT ϕ(x, y ′ )), p(y|x; w) = (3.6) y′ where w ∈ Rm is the weight vector. The most probable label is the one that maximizes the score, y ∗ = arg max p(y|x; w) = arg max wT ϕ(x, y). y (3.7) y because a function exp is a monotonic increasing funciton. The parameter w is estimated by the maximum likelihood estimation (MLE) using training examples {(xi , yi )} (i = 1, . . . , n) as follows, w∗ = arg min L(w) w ∑ L(w) = − log p(yi |xi ; w). (3.8) (3.9) i where L(w) is the negative log likelihood of training data. However, this MLE tends to over-fit the training data when the amount of training examples is insufficient for the number of parameters. To avoid this problem, a regularization term R(w) : Rm 7→ R is added to the likelihood term shown in (3.9). By applying L1 regularization (which is also called Lasso regularization), the weight vector is estimated as follows: ∗ wM AP = arg min L(w) + C|w|1 , (3.10) w where |w|1 = |w1 | + |w2 | + . . . + |wm |, and C > 0 is the trade-off parameter between the likelihood term and the regularization term; a small C emphasizes the training data, and a large C emphasizes the regularization. This L1 regularization corresponds to the maximum a posteriori (MAP) estimation with the Laplace prior on w. We call this estimation L1 -LR such as OWL-QN [Andrew and Gao, 2007]. It is known that the result of L1 -LR is a sparse parameter vector, in which many of the parameters are exactly zero. In other words, learning with L1 regularization naturally has an effect on the feature selection, which results in an efficient and interpretable inference. 49 To optimize (3.10), a gradient based optimization cannot be used directly, since the objective function is not differentiable where wi = 0. Therefore several specialized methods have been proposed for the L1 -LR optimization. For learning l1 -LR, we adapt a grafting algorithm [Perkins et al., 2003] (Section 3.3) to improve the training effeciency. In the algorithm, we keep the current weight vector w and the set of active features H (features that have non-zero weights). At the beginning, we initialize the parameters as w = 0, and H = {}. Let v ∈ Rm be the gradient of the likelihood term with regard to parameters w; ∂L(w) ∂w ∑ = (−I(y = yi ) + p(y|xi ; w)) ϕ(xi , y), v = (3.11) (3.12) i,y where I(a) is 1 if a is true and 0 otherwise. Let k ∗ be the feature such that |vk∗ | is the largest. Then we add k ∗ to H, and optimize w with H only by using a solver for L1 -LR such as OWL-QN [Andrew and Gao, 2007], in that a weight wk such that k ∈ / H are fixed to be 0. We continue this process repeatedly until |vk∗ | < C. Then the obtained weight vector is identical to the optimal weight vector [Perkins et al., 2003]. The point is that, the training time is almost proportional to the number of active features if we can efficiently compute arg maxk |vk |, even if the number of features is very large. 3.5.1 Document Classification Model In this section, we show that the optimal weights for substring features can be determined by considering the maximal substrings only in L1 -regularized learning. First, we assume that the feature types is tf (term frequency) but this is not the only case. Recall since we use tf , the feature values corresponding to substrings x and y such that x <P y are always the same for all documents. In L1 regularization, if features have equal values in all training examples, then the set of optimal weights for these features is convex. In other words, if w and w′ are the minimizer of (3.10), then w′′ := αw + (1 − α)w′ for 0 ≤ α ≤ 1 also minimizes (3.10). This can be explained in the view of equivalent classes. Lemma 4 In the L1 regularization, let E = {fi } be a set of feature indexes that belongs to the same class, and w∗ be the weight vector that minimizes (3.10). Then a weight vector 50 w′ such that ∑ i∈E wi′ = ∑ i∈E wi∗ , and wi∗ wi′ > 0 for all i ∈ E, and wi′ = wi∗ for all i∈ / E also minimizes (3.10). Proof 3 From the definition, we have, |w′ |1 = ∑ |wi′ | + i∈E / = ∑ ∑ |wi′ | i∈E |wi∗ | + i∈E / ∑ |wi∗ |. (3.13) i∈E And, since w′T ϕ(xi , y) = w∗ T ϕ(xi , y) for all 1 ≤ i ≤ n and y ∈ Y , for the likelihood term, L(w′ ) = L(w∗ ) also holds. Therefore, when there are feature sets belong to the equivalent class, it is adequate to keep the sum of the weights for these weights. In summary, for training L1 regularized problem, we deal with the features that correspond to maximal substrings only. And the obtained weight corresponds to the sum of weights in the equivalent class. 3.5.2 Efficient Learning Algorithm with All Substring Features In the previous subsection, we see that it is enough to consider a feature vector corresponding maximal substrings to find the optimal weight in ALLBOW. However, its computational cost is still large; the number of maximal substrings is linear in the total length of documents. In this section, we show that how to deal with these maximal substrings without generating a feature vector explicitly. Recall that, the grafting algorithm (Algorithm 3) only requires finding a feature such that the absolute value of the gradient of the likelihood is the maximum (k ∗ = arg maxk vk ). We show that we can estimate k ∗ efficiently by using auxiliary data structures. In particular, we consider the following feature types and combinations of these feature types. • tf (q, d) : the frequency of q in a document d. • bin(q, d) : 1 if q appears in d and 0 otherwise. 51 • idf (q) : log(n/df (q)) where n is the number of documents, and df (q) is the number of documents that include q. • len(q) : the length of q. Note that our method is not limited to these feature types only. Generally, we can efficiently compute the gradient value if feature functions depends on the position information. If the feature function depends on the different information, such as orthographic feature, then we cannot summarize substring information, and we require different techniques for efficient computation. First, let us consider the case where the feature type is tf . Let g(l, r, y) be the gradient value of the feature corresponding to the substring q that appears in q = T [p, p + d], p ∈ SA[l, r] with label y. n ∑ g(l, r, y) = (P (y|xi ; w) − I(yi = y))tf (q, xi ) (3.14) i=1 Remember that any substrings are store in the consecutive region in SA. Let D[i] be a document index that includes SA[i]. Let α be the two dimensional array of R|Y |×s defined as, α[y][i] = i−1 ∑ ( ) P (y|xD[j] ; w) − I(yD[j] = y) . (3.15) j=1 Then we can calculate g(l, r, y) by using α as, g(l, r, y) = α[y][r] − α[y][l]. (3.16) This is because, α[y][r] − α[y][l] r ∑ ( ) = P (y|xD[j] ; w) − I(yD[j] = y) = j=l n ∑ (P (y|xi ; w) − I(yi = y)) tf (q, xi ). i=1 Therefore, the gradient for any substring can be calculated in constant time by using table lookup, where table size is O(s) bits. 52 *+ 4567, - /+ 456701) 2 3- !" #$%& ' ! () *+ ) , - . /+ 01) 2 3- , Figure 3.5: An example of the computation of the gradient corresponding to a substring “book”. Figure 3.5 shows an exampe of the computation of the gradinet corresponding to a substring “book”. The columns of y = 1, y = 2, and y = 3 are the values I(yD[i] = y) − P (y|xD[i] ; w). For example the cell in the row i = 4 and the column y = 2 shows the value I(yD[4] = 2) − P (D[4] |xi ; w). Then, the gradient value corresponding to a substring “book ′′ and a label y = 2 is the summation of values in the cells of i = 3, . . . 6, and y = 2. For feature types idf (d) and bin(q, d), we can compute the gradient of any substrings in constant time using auxiliary data structures [Sadakane, 2007]. In this case, we need to remove duplicated documents at the enumerating the occurrence of q in the concecutive reagion. In practice, the auxiliary data structures requires much working space, so we adapt a simpler strategy; first enumerate all positions including q, and then remove duplicated documents in the positions. When we use len(q) features, the gradient values for substrings in the same class is different. In this case we enumerate substring from the longest ones in the class. In summary, we state the following theorem; Theorem 1 Given training documents whose total length is s, we can train a L1 regularized logistic regression model using all substring features in O(s) time. Finally, the algorithm 4 shows the overall framework to compute the arg maxk |vk |. This is same as the bottom-up traversing of all nodes in suffix tree using the height ar53 Algorithm 4 The calculation of gradients of all maximal substring Input： H[0, s], SA[0, s], B[0, s], D[0, s] S: A stack storing (pos: the beginning position in SA, len: the length of substring.) vk∗ = 0 : Store the maximal substring that have the largest gradient value for i = 0 to n + 1 do cur = (i, L[i]) cand = top(S) while cand.len > cur.len do if B[cand.pos, . . . , i] has more than one character type then vk := g(cand.pos, i, y), y ∈ Y // Estimate a gradient value of feature . See section 3.5.2. if vk > vk∗ then vk∗ = vk // Also stores k end if end if end while if cand.len < cur.len then push(S, v) // Internal node end if push(S,(i, n-SA[i] + 1)) // Leaf end for output vk . ray [Kasai et al., 2001] except that we compute the gradient value of each features by using grad(l, r, d) as discussed above. 3.5.3 Inference We explain how to classify a test document by using the result of our algorithm. After the training, we have a set of substrings H, and their weights. We first build a trie data structure for H and we assign a weight at each leaf or internal node. Then, we find all matching for H by using the Aho-Corasick method [Aho and Corasick, 1975]. This is 54 Table 3.1: The data set in a document classification task C ORPUS N UM . DOCS T OTAL L EN . N UM . TYPE N UM . (B YTE ) OF WORDS MAXIMAL STRING M OVIE - A 2000 7786004 38187 1685037 M OVIE - B 7440 213970 55764 713229 TC300- A 200 1953894 16655 378673 TC300- B 200 1424566 14430 236220 done in linear time in a length of a test document. Note that, unlike string kernels in which we have to keep all of the document set, we only keep a few substrings due to L1 regularization. Therefore the working space is very small. 3.5.4 Experiments We conducted a series of document classification experiments for two data sets MOVIE and Tech-TC300. MOVIE is a sentiment classification task, given a review information we classify it into positive or negative ones. There are two types of data set, the one provided by Bo Pang 3 (MOVIE-A), and the other provided by Ifrim4 (MOVIE-B) which was used in [Ifrim et al., 2008]. TechTC-300 consists of 300 binary classification task. An original category comes from Open Directory Project. Among 300 tasks, we used two tasks for which where SVM classification achieve only 70% accuracies. 5 ．The details of each data set are described in Table 3.1. We examined the performance using 5 cross validations. We determined the hyper parameters by using the development set. We compared our method (Proposed in Table 3.2) with L1 -LR with BOW (BOW+L1 ), LR with variable length N-gram [Ifrim et al., 2008] (SLR), and BOW with SVM (SVM). We used the third polynomial kernel because it achieved highest accuracy compared other kernels (including string kernels). 3 http://www.cs.cornell.edu/People/pabo/movie-review-data/, Polarity dataset v2.0 http://www.mpi-inf.mpg.de/ ifrim/data/kdd08-datasets.zip, KDD08-datasets 5 http://techtc.cs.technion.ac.il/techtc300/techtc300.html, A: 10341-14271, B: 10539-194915 4 55 Table 3.2: Result of the document classification task C ORPUS P ROPOSED BOW+L1 SLR SVM M OVIE - A 86.5% 83.0% 81.6% 87.2% M OVIE - B 75.1% 71.0% 74.0% 69.1% TC300- A 80.0% 66.7% 80.0% 73.1% TC300- B 86.7% 86.7% 73.3% 71.9% 1000 872 135 100 63 Time (sec.) 11 10 5 1 1 1 10 100 1000 Input Size (MB) Figure 3.6: The time for finding all maximal substrings. For feature types, we compared the result using bin, tf , idf , len and all these combinations. For word-based BOW, tf achieved the best performance, and for the proposed method, idf achieved the best performance6 . We used these feature types in the following experiments. In practice, the most time consuming part in our method was the calculation of arg maxk vk because we need to access whole data sequentially. In an original grafting algorithm, only one feature is added from the feature candidates. We instead chose predefined number of largest features and added these into H. Note that even if we include these features together, it converges to the global optimum. All experiments were conducted on a 3.0 GHz Xeon processor with 32GB main memory. The operation system was the Linux version 2.6.9. The compiler was g++ (gcc version 4.0.3) executed with the -O3 option. Table 3.2 shows the accuracy results. The proposed method achieved the highest or the second-highest accuracy in all data sets. SVM achieved the highest performance in the 6 There are no significant difference between idf, len, tf-idf, idf-len, tf-len and tf-idf-len 56 MOVIE-a corpus, but very low performance in other corpora because SVM suffered from noize words (unrelevant to the class information). The methods with L1 regularization could filter out ineffective substrings, and achieved high performance in TC300B. The results for SLR were always equal to or worse than our method, because SLR searches effective-substrings in a greedy manner, and in some cases, they cannot find the effective substring. The proposed method could successfully find the effective substrings from all substrings. Finally we examined the scalability of the proposed method. We changed the length of an input text, and examined the time to enumerate all maximal substrings. Note that this part is the most time-consuming, and the dominant part in our algorithm. Figure 3.6 shows the results. The x-axis shows the input size, and the y-axis shows the time for reporting all maximal substrings. This result indicate that our method can process in the proportional to the text size even if the text is very large such as 1 GB. 3.6 Document Clustering An next application of all substring features is document clustering. A clustering is a task to split examples into clusters so that the examples in the same clusters are similar, and those in the different clusters are dissimilar. Unlike document classification, there are no training data in document clustering, and thus it is unsupervised learning. There are many variations of clustering depending on how to define (dis-)similarity and preference of the results, such as K-means, Gaussian mixture models, and spectral clustering methods [Ding et al., 2001]. Recently, maximum margin clustering (MMC) have been proposed [Xu et al., 2004]. Since MMC produces a more accurate result, and its formulation is very similar to marginbased supervised learning, many researchers have studied MMC [Zhang et al., 2007, Zhao et al., 2008a, Zhao et al., 2008b, Li et al., 2009, Gieseke et al., 2009]. MMC approach is based on a multi-class SVM [Crammer and Singer, 2001], a supervised learning method. In multi-class SVM, the parameter is optimized so that the margin is maximized for each class. In MMC, a system also determines the partition of examples so that the margin is maximized in each cluster. A trivial solution of this problem is that all examples belong to the same class. To obtain a meaningful re57 sult, a constraint is added that specifies the minimum and the maximum size of a cluster [Xu et al., 2004]. In another way, a constraint is relaxed so that the sum of margins are restricted [Zhao et al., 2008a, Zhao et al., 2008b]. Since the optimization problem in MMC is not convex as those in other clustering methods, specialized optimizations have been proposed. Zhao [2008a, 2008b] proposed to apply a cutting-plane algorithm to the problem; constraints are added from the ones with the largest effect. Then they solve a convex-concave problem where the objective function is represented as the difference between convex functions. This solver is extremely fast even compared to K-means clustering. Instead we propose a clustering model based on a logistic regression model. Since this model is based on the probabilistic model, it can assign a conditional probability of a class given a document though MMC cannot. Moreover, we apply L1 regularization to the objective function so that the features relating to the result of clustering become sparse. Since in L1 regularization, few features will a large weight, those substrings with large weights can be used as a label of a cluster. By combinining this model to the idea of maximal substrings, a clustering can consider all substring information, and its result is compact. 3.6.1 Logistic Regression Clustering I propose a clustering algorithm, which aims at splitting input examples {xi } (i = 1, . . . , n) into k clusters. The number of cluster k is given by a user, and the determination of the optimal number of clusters is a future work. As a classification problem, let us represent information of a pair of an input x and an output y by a feature vector ϕ(x, y) ∈ Rm , each value of which is determined by a feature function {fi }, that is ϕ(x)i = fi (x). The cluster number of i-th input is denoted by yi ∈ {1, . . . , k} (i = 1, . . . , n), and the set of cluster numbers of all examples is denoted by y. Note that unlike a supervised classification problem, yi is not assigned by a training data beforehand, and the clustering task is to find y so that examples in the same clusters are similar and those in the different clusters are dissimilar. Recall that in a multi-class logistic regression model, a conditional probability of an 58 output y given an input x is defined as follows, ( ) 1 exp wT ϕ(x, y) , (3.17) Z(x) ∑ T where w is a weight vector, and Z(x) = y w ϕ(x, y) is a normalization term, or a p(y|x; w) = partition function. Then, a clustering is performed by finding the label set y and the weight vector w which maximizes the likelihood of examples, and equivalently minimizes the negative log-likelihood, (y∗ , w) = arg min − y,w ∑ log p(yi |xi ; w), (3.18) i We also add L1 regularization to the objective function to obtain a sparse weight vector. Finally, we obtain the following objective function for the clustering problem, (y∗ , w) = arg min − y,w ∑ log p(yi |xi ; w) + C∥w∥1 , (3.19) i where C ∈ R is a trade-off parameter to determine the sparseness of w; when C is large, many values in w become exactly 0. Since the direct optimization of y is hard, we solve the different optimization problem which is equivalent to (3.19) [Zhao et al., 2008b] w∗ = arg min − w ∑ M (i, y; w) log p(y|xi ; w) + C∥w∥1 (3.20) i,y where M (i, y; w) is defined as M (i, y; w) = Πy′ ̸=y I(wT ϕ(xi , y) > wT ϕ(xi , y ′ )). (3.21) Intuitively, M (i, y; w) returns 1 if y gives the largest probability for xi , and returns 0 otherwise. The optimum of (3.19) and (3.20) is equivalent, and their optimal w∗ is also equivalent. Then, the cluster assignment for each example is easily estimated by yi = arg max wT ϕ(x, y). y (3.22) Since the trivial solution of (3.20) is obtained by assigning the same cluster for all example. We also want to avoid the case where the clusters consist of outliners and others. 59 Table 3.3: The result of clustering accuracy (%) DATA LRC+ALLSTR LRC+BOW K-M EANS NC MMC 20- NEWS 71.12 69.15 35.27 41.89 70.63 WK-CL 71.05 65.38 55.71 61.43 71.95 WK-TX 64.50 60.15 45.05 35.38 69.29 WK-WT 74.05 73.12 53.52 32.85 77.96 WK-WC 71.20 71.05 49.53 33.31 73.88 Therefore we add the constraint to the problem so that each cluster should not be larger and smaller than some threshold. For example, in [Zhao et al., 2008a, Zhao et al., 2008b], for all pairs of clusters p and q, they add the constraint −l ≤ n ∑ wT ϕ(xi , p) − i=1 n ∑ wT ϕ(xi , q) ≤ l In this thesis, we use a simpler constraint, ( )2 ∑ ∑ )2 ∑ ( T T W (w) = wy ϕ(xi ) = y w ϕ(all, y) y where ϕ(all, y) = (3.23) i=1 (3.24) i ∑ i ϕ(xi , y). This function gives smaller value when all clusters have the same size. In summary, we solve the following optimization problem, w∗ = arg min − w ∑ M (i, y; w) log p(y|xi ; w) (3.25) i,y +C∥w∥1 + C1 W (w) where C1 determines the size of clusters. To optimize (3.25), we first assign labels to examples and then optimize w alternately. We continue this until it converges. This is achieved by fixing M (i, y, ; w) in (3.25). 3.6.2 Experiments We conducted a document clustering experiments for two data sets 20 newsgroup (20news-18828)7 and WebKB. For 20 newsgroup dataset, we selected the topic rec. There 7 http://people.csail.mit.edu/jrennie/20Newsgropus 60 Table 3.4: Examples of substrings whose have the largest weight in each cluster C LUSTER S UBSTRING REC . AUTOS “ THE FORD ”, “A CAR ” REC . MOTORCYCLES “A BIKE ” REC . SPORT. BASEBALL “ THE YANKEES ”, “AN ERA” “AN NHL”, “H OCKEY L EAGUE ” REC . SPORT. HOCKEY are four subtopics in the topic rec, autos, motorcycles, basebal, and hockey. WebKB is the crawling result of four university websites, and there are seven topics such as student or faculty. It is diffucult to measure the accuracy of clustering because there are no answers. We adapt the method used in [Xu et al., 2004, Zhao et al., 2008b]; (1)Remove labels from data set (2) Perform clustering algorithm for data set without labels. The cluster number is adjusted to the number of labels in original data set. (3) For each cluster, examine which labels are most assigned in original data set (4) For each cluster, we examine how many examples are assigned with the largest labels. Our ized Cut proposed (NC) method was compared [Shi and Malik, 2000], and with maximum K-means, Normal- margin clustering (MMC) [Zhao et al., 2008b]. The results of other methods are from [Zhao et al., 2008b]8 . We compared the proposed method with usual bag-of-word representation (LRC+BOW), and that with all substring representation (LRC+ALLSTR). Note that the representation of LRC+ALLSTR also includes the word features. The result is shown in Table 3.3. In most data set, the proposed methods achieved the similar results as MMC. In particular, when BOW features are used, the result of LRC+BOW is slightly worse than that of MMC. This would be because while LRC optimize the likelihood of examples, MMC optimize the classification accuracy directly. Therefore, our methods would favorable when the probabilistic information are required. LRC+ALLSTR achieved the highest accuracy in many copora partly because it can use substring features which were not used in other methods. 8 I conducted experiments on K-means, Normalized Cut, and obtained the similar but slightly worse accuracy result 61 Table 3.4 shows examples of substrings that have the largest weight in each cluster. set. The column “Cluster” is the label that is most assigned in original data Other features are often the signature or the affirmation of the write such as “[email protected]”or “University of xxx”. To examine the success of the proposed algorithm, I sort the maximal substrings in the order of tf log(df ). Then, the top 500 maximal substrings appeared only in the documents of the same label. We can find such key substrings efficiently to determine the clusters by using the maximal substring algorithm. 3.7 Discussion and Conclusion In this chapter I propose a novel algorithm to consider all substrings as features, and showed that we can train document classificatin model with all substrings witout approximation in the liner time at training. The experimental results showed that our method achieves the highest performance in several tasks compared to other document classification methods; word-based BOW, and very recent variable-length N-gram logistic regression model [Ifrim et al., 2008]. Our training results are represented as a very compact set of substrings, and the inference time is very fast in theory and practice. Next, I applied this algorithm to a document clustering task. To achieve this, I developed a novel document clustering model, called logistic regression clustering (LRC). The experimental result show that its accuracy is same as the one of state-of-the-art methods. Moreover, our algorithm gives a compact result in that only few features relate to the decision of clustering, and therefore these features can be used as lables of clusters. 62 Chapter 4 Learning with Combination Features In this chapter, I propose an algorithm for learning of an L1 regularized logistic regression model with combination features [Okanohara and Tsujii, 2009a]. In this algorithm, we can exclusively extract effective combination features without enumerating all of the candidate features. This method relies on a grafting algorithm [Perkins and Theeiler, 2003], which incrementally adds features like boosting, but it can converge to the global optimum. The heart of our algorithm is a way to find a feature that has the largest gradient value of likelihood from among the huge set of candidates. To solve this problem, we propose an example-wise algorithm with filtering. This algorithm is very simple and easy to implement, but effective in practice. I applied the proposed methods to NLP tasks, and found that our methods can achieve the same high performance as kernel methods, whereas the number of active combination features is relatively small, such as several thousands. 4.1 Linear Classifier and Combination Features A linear classifier is a fundamental tool for many NLP applications, including logistic regression models (LR), in that its score is determined by an inner product of a feature and a weight vector, and thus a linear combination of features and their weights. Although a linear classifier is very simple, it can achieve high performance on many NLP tasks, partly because many problems are described with very high-dimensional data, and high dimensional weight vectors are effective in discriminating among examples. 63 However, when an original problem cannot be handled linearly, combination features are often added to the feature set, where combination features are products of several original features. Examples of combination features are word pairs in document classification, and part-of-speech pairs of head and modifier words in a dependency analysis task. However, the task of determining effective combination features, namely feature engineering, requires domain-specific knowledge and hard work. For example, in the document classification, a document is often represented by a bag of words, where each feature ϕw corresponds to the occurrence of a word w in a document. If the co-occurrence of words w1 and w2 represents the class label of a document well, then a linear classifier with a combination feature ϕw12 := ϕw1 ϕw2 can separate the data well. In other tasks, combination features are fundamental to discriminate the labels, such as part-of-speech pairs of head and modifier words in a dependency analysis task. In this case, original features itself cannot discriminate labels. However, the task of searching effective combination features requires domain-specific knowledge and hard work. Such a non-linear phenomenon can be implicitly captured by using the kernel trick (2.7). For example, the third-order polynomial kernel can be considered as an inner product of features consisting of combination of three features. However, its computational cost is very high, not only during training but also at inference time. Moreover, the model is not interpretable, in that effective features are not represented explicitly. Many kernels methods employ L2 regularization, in that many features are equally relevant to the tasks [Ng, 2004]. Therefore, kernels cannot handle a case in which very few features are relevant to the class 1 . Although a sparse kernel logistic regression [Hérault and Grandvalet, 2007] has been proposed, in which only few training examples are selected as support vectors, the problems of kernel tricks have not yet been solved. There have been several studies to find efficient ways to obtain (combination) features. In the context of boosting, Kudo [2004] have proposed a method to extract complex features that is similar to the item set mining algorithm. In the context of L1 regularization, Dudı́k [2007], Gao [2006], and Tsuda [2007] have also proposed methods by which effective features are extracted from huge sets of feature candidates. However, their methods 1 With L1 regularization, we cannot directly apply Kernel method because the optimal parameters cannot be represented as the combination of inputs. 64 are still very computationally expensive, and we cannot directly apply this kind of method to a large-scale NLP problem. Very recently, Yoshinaga [2009] proposed a learning algorithm with combination features by using a pre-calculated weights of (partial) feature vectors stored in a feature sequence trie . 4.2 Learning Model I consider a multi-class logistic regression model (LR) (Section 2.3). For input x, and an output label y ∈ Y , we define a feature vector ϕ(x, y) ∈ Rm . In LR, the probability for a label y, given input x, is defined as follows: p(y|x; w) = ( ) 1 exp wT ϕ(x, y) , Z(x, w) (4.1) where w ∈ Rm is a weight vector corresponding to each input dimension, and Z(x, w) = ∑ T y exp(w ϕ(x, y)) is the partition function. Since the number of combination features is very large, the training procedure will over fit the training data easily. One way to control the degree to which the parameters are fitted to the data, is to introduce a penalty term that controls the complexity of the set of possible models. We use L1 regularization because we can obtain a sparse parameter vector, for which many of the parameter values are exactly zero. In other words, learning with L1 regularization naturally has an intrinsic effect of feature selection, which results in an efficient and interpretable inference with almost the same performance as L2 regularization [Gao et al., 2007b]. We estimate the parameter w by a maximum likelihood estimation (MLE) with L1 regularization using training examples {(x1 , y1 ), . . . , (xn , yn )} as follows: w∗ = arg minL(w) + C w L(w) = − n ∑ ∑ |wi | (4.2) i log p(yi |xi ; w) i=1 where C > 0 is the trade-off parameter between the likelihood term and the regularization term. This estimation is a convex optimization problem. As in the previous section, we adapt a grafting algorithm [Perkins et al., 2003] (Section 3.3) so that the training cost is proportional to the number of active features. Recall 65 that the input for the grafting algorithm is find the most effective feature; the feature that has the largest absolute value of the gradient. 4.3 Extraction of Combination Features This section presents an algorithm to compute the most effective combination features, the feature vk∗ that has the largest absolute value of the gradient. To solve this problem, we propose a perceptron-like algorithm with filtering. This algorithm is very simple and easy to implement. Let k be a new feature to be tested. Then the gradient of the likelihood corresponding to the likelihood is calculated as ∂L(w) ∂w ∑ k = αi,y ϕk (xi , y) vk = (4.3) i,y αi,y = I(yi = y) − p(y|xi ; w), (4.4) where I(a) is 1 if a is true and 0 otherwise. Then, in grafting algorithm, we add k ∗ = arg maxk |vk | to H and optimize (4.2) with regard to H only. The solution w that is obtained is used in the next search. The iteration is continued until |vk∗ | < C. We can examine whether the feature k is effective by checking the value of vk . That is, in the optimal weight, a feature will have non-zero weight only if |vk | ≥ C. Therefore, we can filter out features safely if we know their gradient value does not larger (or smaller) than C. Moreover we make use of the sparseness of the training data, and we compute the values vk in an element-wise manner. This is similar to online learning algorithm, but our method is for finding the effective features, not for optimizing the objective function. We assume that the values of the combination features are less than or equal to the original ones. This assumption is typical in NLP; for example, it is made in the case where we use binary values for original and combination features. We can relax this constraints with increasing the computational cost. First, we sort the examples in the descending order of their αi,yi = ∑ y̸=yi p(y|xi ; w) values, the sum of the probabilities to wrong labels. Then, we look at the examples one by 66 one. Let us assume that r examples have been examined so far. We keep three vectors in the computation of v, t = ∑ αi,y ϕ(xi , y) (4.5) − αi,y ϕ(xi , y) (4.6) i≤r,y t− = ∑ i>r,y t + = ∑ + αi,y ϕ(xi , y) i>r,y − + where αi,y = min(αi,y , 0) and αi,y = max(αi,y , 0). Then, simple calculus shows that the gradient value for a combination feature k, vk , for which the original features are k1 and k2 , is bounded below and above thus; + tk + t− k < vk < tk + tk (4.7) − + + tk + max(t− k1 , tk2 ) < vk < tk + min(tk1 , tk2 ). Intuitively, the upper bound of (4.7) is the case where the combination feature fires only for the examples with αi,y ≥ 0, and the lower bound of (4.7) is the case where the combination feature fires only for the examples with αi,y ≤ 0. The second inequality arises from the fact that the value of a combination feature is equal to or less than the values of its original features. Therefore, we examine (4.7) and check whether or not |vk | will be larger than C. If not, we can remove the feature safely. Since the examples are sorted in the order of their ∑ y |αi,y |, the bound will become tighter quickly. Therefore, many combination features are filtered out in the early steps. In experiments, the weights for the original features are optimized first, and then the weights for combination features are optimized. This significantly reduces the number of candidates for combination features. Figure 4.1 shows an example of filtering. The red line shows the case of tk + t+ k and the green line shows the case of tk + t− k. Algorithm 5 presents the details of the overall algorithm for the extraction of effective combination features. Note that many candidate features will be removed just before adding. We maximize the effect of filtering as follows. First, we solve the problem with only original features. Then we check combination features. Since after the training with only 67 Figure 4.1: An example of filtering a candidate combination feature. original features, many αi,yi are small, and many combinations can be filtered out without inserting H. Note that even in this case, we can obtain the optimal weights. In the approximated version, we keep only top K features whose |vk | are largest. In this case, the most effective features would be filter out 2 4.4 Experiments To measure the effectiveness of the proposed method (called L1 -Comb), we conducted experiments on the dependency analysis task, and the document classification task. In all experiments, the parameter C was tuned using the development data set. In the first experiment, we performed Japanese dependency analysis. We used the Kyoto Text Corpus (Version 3.0), Jan. 1, 3-8 as the training data, Jan. 10 as the development data, and Jan. 9 as the test data so that the result could be compared to those from previous studies [Sassano, 2004]3 . We used the shift-reduce dependency algorithm [Sassano, 2004]. The number of training events was 113, 332, each of which consisted of two word positions as inputs, and y = {0, 1} as an output indicating the dependency relation. We used the same feature set as in [Sassano, 2004], but we only used an atomic feature and did not use any combination features explicitly, because we tested 2 3 In experiment, I exmamined the exact version only. The data set is different from that in the CoNLL shared task. This data set is more difficult. 68 Algorithm 5 Algorithm to return the combination feature that has the largest gradient value. Input: training data (xi , yi ) and its αi,y value (i = 1, . . . , n, y = 1, . . . , |Y |), and the ∑ parameter C. Examples are sorted with respect to their y |αi,y | values. ∑ ∑ t+ = ni=1 y max(αi,y , 0)ϕ(x, y) ∑ ∑ t− = ni=1 y min(αi,y , 0)ϕ(x, y) t = 0, H = {} // Active combination feature for i = 1 to n and y ∈ Y do for all combination features k in xi do if |vk | > C (Check by using Eq.(4.7) ) then vk := vk + αi,y ϕk (xi , y) H =H ∪k end if end for t+ := t+ − max(αi,y , 0)ϕ(xi , y) t− := t− − min(αi,y , 0)ϕ(xi , y) end for Output: arg maxk∈H vk the effectiveness of the combination features. For example, we used the POS of headwords as features, but did not use the pair of the POS of head and modifier words as features. We expect that our algorithm can automatically extract such features from the training examples. For the training data, the number of original features was 78, 570, and the number of combination features of degrees 2 and 3 was 5, 787, 361, and 169m430, 335, respectively. Note that we need not see all of them using our algorithm. In all experiments, combination features of degrees 2 and 3 (the products of two or three original features) were used. We compared our methods using LR with L1 regularization using original features (L1 -Original), SVM with a 3rd order polynomial Kernel, LR with L2 regularization using combination features with up to 3 combinations (L2 -Comb3), and an averaged perceptron with original features (Ave. Perceptron). Table 4.1 shows the result of the Japanese dependency task. The accuracy result indi69 Table 4.1: The performance of the Japanese dependency task on the Test set. The active features column shows the number of nonzero weight features. D EP. T RAIN ACTIVE ACC . (%) T IME ( S ) F EAT. L1 -C OMB 89.03 605 78002 L1 -O RIG 88.50 35 29166 S VM 3- POLY 88.72 35720 (K ERNEL ) L2 -C OMB 3 89.52 22197 91477782 AVE . P ERCE . 87.23 5 45089 cates that the accuracy was improved with automatically extracted combination features. In the column of active features, the number of active features is listed. This indicates that L1 regularization automatically selects very few effective features. Note that, in training, L1 -Comb used around 100 MB, while L2 -Comb3 used more than 30 GB. The most time consuming part for L1 -Comb was the optimization of the L1 -LR problem. Examples of extracted combination features include POS pairs of head and modifiers, such as Head/Noun-Modifier/Noun, and combinations of distance features with the POS of head. Note that the accuracy given by our method is still lower than the current best result of Japanese dependency analysis task, which uses manually selected combination features, or polynomial kernels. The previous work [Uchimoto et al., 1999] suggested that combination features with the degree 4 or 5 will improve the accuracy: and this is the future work. For the second experiment, we performed the document classification task using the Tech-TC-300 data set [Davidov et al., 2004]4 . We used the tf-idf scores as feature values. We did not filter out any words beforehand. The Tech-TC-300 data set consists of 295 binary classification tasks. We divided each document set into a training and a test set. The ratio of the test set to the training set was 1 : 4. The average number of features for tasks was 25, 389. Table 4.2 shows the results for L1 -LR with combination features and SVM with linear 4 http://techtc.cs.technion.ac.il/techtc300/techtc300.html 70 Table 4.2: Document classification results for the Tech-TC-300 data set. The column F2 shows the average of F2 scores for each method of classification. F2 L1 -C OMB 0.949 L1 -O RIG 0.917 SVM (L INEAR K ERNEL ) 0.896 kernel5 . The column ACTIVE FEATURES shows the average number of active features for each task. The results indicate that the combination features are effective. 4.5 Discussion and Conclusion I have presented a method to extract effective combination features for the L1 regularized logistic regression model. I have shown that a simple filtering technique is effective for enumerating effective combination features with the grafting algorithm, even for largescale problems. Experimental results show that a L1 regularized logistic regression model with combination features can achieve comparable or better results than those from other methods, and its result is very compact and easy to interpret. 5 SVM with polynomial kernel did not achieve significant improvement 71 Part II Learning with Massive Number of Outputs 72 Chapter 5 Discriminative Language Models with Pseudo-Negative Examples Language models (LM) are fundamental tools for many applications, such as speech recognition, machine translation, spelling correction, etc. This chapter and the following chapter present LMs for different tasks. This chapter presents a discriminative language model with pseudo-negative samples (DLM-PN), which directly classifies a given sentence as correct or incorrect, while the next chapter presents a hierarchical exponential model (HEM) that predicts a next word given a context with the probability information. These LMs are used in different tasks; while DLM-PN can use a feature corresponding to a whole sentence (e.g. the length of sentence), HEM cannot. However, HEM can give a probabilistic information, and supports an efficient inference of the most probable word. 5.1 Language Modeling Language modeling (LM) aims at determining whether a given sentence is correct or not. For example in machine translation, an input sentence is translated into several output sentences in the target languages, and LMs choose the best sentence among them by comparing their fluency. In particular, in a statistical machine translation model [Brown et al., 1990], given a sentence E in the original language, a machine translation system finds the best sentence F̂ in the target language as F̂ = arg maxP (F |E) = arg maxP (E|F )P (F ). F F 73 (5.1) This problem is therefore decomposed into the translation problem P (E|F ) and the targetlanguage dependent problem P (F ). The language model solves the latter problem. Among several language models, probabilistic language models (PLM) are used, which estimate the probability of word sequences or sentences. An example of such PLM is N-gram language model (NLM). However, PLMs cannot determine whether a sentence is correct or not independently because the probability depends on the length of a sentence and the global frequency of each word. For example, p(S1 ) < p(S2 ), where p(S) is the probability of a sentence S given by PLMs, does not always mean S2 is more correct or plausible. Because this could occur when the length of S2 is shorter than S1 , and S2 has more common words than S1 . Another problem is that PLMs cannot include overlapped information or non-local information easily, which is important to classify sentences more finely. These problems have not been discussed before because LMs are used in the applications where LMs select the best sentence among sentences in that their lengths and word tendencies are similar. However, this problem becomes apparent when sentences are not similar ones. For example, these language models cannot determine incorrect sentences among sentences written by non-native writers. These problems will get more strained with the increasing LMs usage in the language generation applications. It is therefore reasonable to consider that most sentences can be classified into correct or incorrect in terms of grammar, pragmatics and plausibility, and say a 70% correct sentence is very rare or not exist. It would be better to treat language modeling as a binary classification in some applications. Discriminative language models (DLMs) have been proposed to classify sentences into correct or incorrect ones. DLM can [Gao et al., 2005, Roark et al., 2007] include both non-local and overlapped information directly. However DLMs in previous studies assume specific applications. For example, training examples are candidate sentences generated by other applications with one correct sentence. They estimate parameters by minimizing sample-risk error, which is the accuracy of choosing best sentence among other candidate which may include plausible and correct sentences. Therefore the model cannot be used for other applications. More importantly, the training data size in previous DLMs is limited unlike the unlimited data size in PLMs. If we have unlimited negative examples, the models could be trained directly by discriminating correct sentences and incorrect 74 sentences. In this chapter, I propose a DLM for generic purpose. This DLM can be used not only for specific applications, but also for other applications as PLMs. To achieve this goal, I need to solve the following two problems. The first is that we cannot obtain negative examples (incorrect sentences). The second is its prohibitive computational cost because the number of features and examples is very large. In previous studies this problem is not obvious because the number of training data is limited and moreover they did not use a combination of features, and thus the computational cost was tractable. For the first problem, I propose to sample incorrect sentences by using a probabilistic language model and then train a model to discriminate between correct and incorrect sentences. I call these examples Pseudo-Negative because they are not real negative sentences. I call this method DLM-PN (DLM with Pseudo-Negative samples). For the second problem, I apply on-line max-margin learning with fast kernel computation. I will show that the non-linear model is essential to discriminate between correct and incorrect sentences. I also estimate latent information of sentences using a semi-Markov class model to extract features from them. Although the number of latent features is substantially smaller than that of explicit features such as words or phrases, latent features include essential information for sentence classification. Experimental results show that these pseudo negative samples can be seen as incorrect examples, and the proposed learning method is enough to discriminate between correct and incorrect sentences. We also show that DLM-PN can classify sentences correctly which cannot be classified by the N-gram models, the syntactic parsers, and non-native speakers. 5.2 Previous Language Models I explain several previous language models, and their characteristics. Table 5.1 shows the comparison of the previous language models, and our language models. ME denotes a maximum entropy language model, and WSME denotes an whole sentence maximum language model. A function P (w) is the operation to return the probability of a word w, and arg maxw P (w) is the operation to return the most probable word. A variable |W | denotes the number of distinct candidate words. 75 Table 5.1: A comparison of language models. Language Training Complex Time of Time of Models Efficiency Features P (w) arg maxw P (w) N-gram XX O(1) O(1) ∼ O(|W |) X O(|W |) O(|W |) XX - - X ME [Rosenfeld, 1994] WSME [Rosenfeld et al., 2001] DLM-PN (Section 5) X XX - - HEM (Section 6) X X O(log |W |) O(log |W |) In this table, LMs are characterized by the following points; (1) Training efficiency: whether the training step is efficient and scalable. (2) Complex features; whether the model can use complex features such as suffix/prefix of previous words, and long-distance information, (3) Time of P (w): the computational cost to support P (w), and (4) Time of arg maxw P (w): the computational cost to support arg maxw P (w). 5.2.1 N-gram Language Model N-gram Language Models (NLMs) estimate a probability of a sentence, which are most widely used in many applications. Given a sentence S of t words; S := w1t 1 , its probability P (S) is decomposed to probabilities of each word depending on the preceding words by the chain rule as follows, P (S) = P (w1t ) ∏ = P (wi |w1i−1 ). i=1...t These parameters P (wi |w1i−1 ) can be estimated by using the maximum likelihood estimation as P ′ (wi |w1i−1 ) := C(w1i )/C(w1i−1 ) (5.2) where C(w1j ) is the frequency of w1j in training corpus. However, since the size of training corpus is finite, we cannot estimate these parameters accurately. 1 wij denotes the subsequence of words from the i-th word the j-th word inclusively, wi , wi+1 , . . . , wj . 76 NLMs approximate each probability by conditioning only on the preceding N − 1 words as ∏ PN LM (wi |w1i−1 ) := ∏ i−1 P (wi |wi−N +1 ). (5.3) i=1...t i=1...t Since the number of parameters in NLM is still large, several smoothing methods are applied to NLM to produce more accurate probabilities, and to assign nonzero probabilities to any word string. I will explain additive smoothing and interpolated smoothing. Other sophisticated smoothing methods can be found in [Rice, 1998]. Additive smoothing is the simplest smoothing [Lidstone, 1920]; i−1 P (wi |wi−N +1 ) = i C(wi−N +1 ) + σ (5.4) i−1 C(wi−N +1 ) + σ|W | where σ is a allocation parameter that is usually σ = 1 or σ = 1/2 and |W | is the number of distinct number of words. The performance of this smoothing is very poor when N is large. Interpolated smoothing mixes the lower-order distribution with fixed parameters as i−1 P (wi |wi−N +1 ) = where ∑ i λj ∑ j=0...N −1 λj i ) C(wi−j i−1 C(wi−j ) . (5.5) = 1. The parameters λj (j = 0 . . . N − 1) is estimated by using held-out training data. NLMs are widely used in many applications because its simpleness and efficiency. However, several drawbacks of NLM are reported and I will show some of them which are related to my work. The first is that the probabilities in NLMs strongly depends on the length of the sentence; P (w1t ) ∝ C t for some constant C. Two sentences of different length, therefore, cannot be compared directly. For example, the probability of an incorrect sentence “I are fine” is much higher than the probability of a correct sentence “I am very fine Today”. The second is that NLMs cannot include overlapped nor non-local information. NLMs therefore would give a higher probability to a sentence without verb. To overcome this, factored language models [Bilmes and Kirchhoff, 2003] decompose the condition into finer elements to treat detailed features such as “suffix” of words. Although such improvements over NLMs have been proposed their modification is naive and more detailed features cannot be included easily. 77 5.2.2 Topic-based Language Models In previous language models, correlated information is considered in naive way, such as a trigger model [Tillmann and Ney, 1996]. Recently, novel language models have presented, where we can consider the correlated word information. For example, this model can include the information “nurse” likely to be appeared in a sentence including “doctor” or “hospital”. Many topic-based language models capture topic information by changing or choosing the probability distribution. Probabilistic latent semantic indexing (PLSI) [Gildea and Hofmann, 1999] assigns a probability to a sentence or document as follows P (w1t ) = t ∑ C ∏ λc φc,w (5.6) w=1 c=1 where λc is the topic probability and φc,w is the word probability depending on the topic. In this model, for each word, a topic is chosen according to λc . Latent Dirichlet allocation (LDA) [Blei et al., 2003] is an extension of PLSI, assigning a probability to a sentence or document as follows ∫ PDir (λ; α) t ∑ C ∏ λc φc,w (5.7) w=1 c=1 where PDir (λ; α) = Z(α) ∏C αc −1 c=1 θc is an Dirichlet distribution and used for defining a distribution of a distribution. LDA determines the topic distribution λ stochastically. However, topic-based language models cannot deal with syntax information, or sentence information easily because of the difficulty in combining the syntax and topic information. 5.2.3 Maximum Entropy Language Models A maximum entropy language model or an exponential language model [Rosenfeld, 1994] overcomes these problems so that they can use any type of features. This model is equivalent to a multi-class logistic model. Given a context information h and a next word c, we define a feature vector ϕ(c, h) ∈ Rm , each dimension of which corresponds to some feature function f (c, h) such as f (c, h) = I(c = ”T okyo” and h = ”Ilivein”). Then, 78 we define the probability of a word c given a context h as P (c|h) = where Z(h) = ∑ c exp(w T ϕ(c, h)) 1 exp(wT ϕ(c, h)). Z(h) (5.8) is the normalization parameter and w ∈ Rm is the weight vector. NLM is a special case of ME such that feature functions correspond to the occurrence of N-gram. We estimate the parameters of ME by using maximum likelihood estimation. Let (ci , hi ) (i = 1, · · · , N ) be the examples appeared in the corpus. Then we solve convex optimization problem w∗ = arg min − w ∑ log P (ci |hi ) + C · R(w) (5.9) i where R(w) is the regularization term and C > 0 is the parameter to control the tradeoff between the likelihood term and the regularization term. We estimate C by cross vali∑ dation. For the regularization term L22 regularization R(w) := i wi2 , L1 regularization ∑ ∑ ∑ R(w) := i |wi | and their combination i wi2 + C ′ i |wi | (C ′ is the tradeoff parameter between L1 and L22 regularization) are often used. The problem of ME is its large cost at training. Even at an inference time, it requires a computational cost proportional to the number of candidate words. 5.2.4 Whole Sentence Maximum Entropy Model An whole sentence maximum entropy model [Rosenfeld et al., 2001] (WSME) assigns a probability to a sentence with a single exponential model. Therefore, it is not required to calculate the normalization term Z(h) for each word. Given a feature vector ϕ(S) of a sentence S, WSME assigns a probability to S using a logistic regression model as follows, p(S) = 1 · p0 (S) · exp(wT ϕ(S)), Z (5.10) where Z is a normalization constant, p0 (S) is a initial distribution of S obtained by other PLM such as N-gram model and w are the parameters of the model and ϕ(S) are arbitrary computable properties, or features of a sentence S. For example, f1234 (S) := “if S has you in the front of a sentence then 1 otherwise 0”. 79 To estimate w, we again employ maximum likelihood estimation. In this estimation we need to compute expected values of the number of each ϕ(S)i being fired in the current models. Since it is impossible to enumerate all possible sentences, sentences are sampled according to the probability distribution of current parameters and then update the parameters using them. The problem of WSME is its large computational cost at training. Especially, it requires samplings of sentences. Another problem is that it cannot assign a probability to each word, and therefore WSME can be only used for the re-ranking of sentences. Moreover [Rosenfeld et al., 2001] reported that the improvement over the previous language models is modest if only simple features are used. 5.2.5 Discriminative Language Models In another direction, discriminative language models directly determine a given sentence to be correct or incorrect. For example, in the task of speech recognition we are given a set of candidate sentences, and correct (or better) sentences are obvious. In this case, we can directly learn the model to distinguish correct sentences from incorrect sentences [Roark et al., 2007, Gao et al., 2005]. Formally, a discriminative language model (DLM) assigns a score f (S) ∈ R to a sentence S, measuring the correctness of a sentence in terms of grammar and pragmatics, so that f (S) > 0 implies S is correct and f (S) < 0 implies S is incorrect. PLMs can be considered as a kind of DLM if f (S) := f ′ (P (S)) where f ′ is a monotonic increasing function. For example, we can use f (S) = P (S)/|S| − c, where c is some threshold, and |S| is the length of S. In this thesis, I will consider a case where a function is represented by a linear function as many previous studies [Roark et al., 2007, Gao et al., 2005]. Given a sentence S, we compute a feature vector from it (ϕ(S)) using a predefined set of feature functions {ϕj }m j=1 . The form of the function f we use is f (S) = wT ϕ(S), (5.11) where w is a feature weighting vector. Since there is no restriction in designing ϕ(S), DLMs can employ any overlapped or 80 non-local information in S. I estimate w using training samples {(Si , yi )} for i = 1...t, where yi = 1 if Si is correct and yi = −1 if Si is incorrect. However, it is hard to obtain incorrect sentences because only correct sentences are available from the corpus. This does not occur in previous studies, because they assume specific applications and therefore can obtain real negative examples easily. For example, Roark [2007] proposed a discriminative language model, in which a model is trained using a set of output produced by a speech recognition system. The difference between their approach and ours is that we do not assume just one application. Moreover, they always have a group consisting of one correct sentence and many incorrect sentences, which are very similar to each other because they are generated by same the input. On the other hand, our framework does not assume any such groups of training data sentences, and treat correct or incorrect examples independently in training. 5.3 Discriminative Language Model with Pseudo-Negative samples I propose a novel discriminative language model, Discriminative Language Model with Pseudo-Negative samples (DLM-PN). In this model, pseudo-negative examples are sampled from PLMs, which are assumed all incorrect. First a probabilistic model is built using training data and then almost only negative examples are sampled from PLMs independently. DLMs are trained using correct sentences from a corpus and negative examples from a pseudo-negative generator. An advantage of sampling is that many negative samples can be collected as the number of correct ones and the difference between truly correct sentences and incorrect sentences which are correct in local sense can be clarified For sampling, any PLMs can be used as long as a model supports sentence sampling procedure. In particular, I used NLMs with an interpolated smoothing because it supports efficient sentence sampling. Algorithm 5.1 describes the sampling procedure. Since the focus is on discriminating between correct sentences from corpus and incorrect sentences which are sampled by N-gram model, DLM-PN may not be able to discriminate incorrect sentences that are not generated by N-gram model. However, this does not become a serious problem, because these sentences can be filtered out by NLMs even if they exist. 81 Algorithm 6 Sample procedure for pseudo-negative examples taken from NLMs. for i = 1, 2, . . . do Sample wi accoding to p(wi |wi−N +1 , . . . , wi−1 ) if wi is the end of a sentence (EOS) then break end if end for Output w1 , w2 , . . ., EOS We know of no program , and animated discussions about prospects for trade barriers or regulations on the rules of the game as a whole , and elements of decoration of this peanut-shaped to priorities tasks across both target countries Figure 5.1: Example of a sentence sampled by PLMs. Although the DLM-PN can be trained using any binary classification methods, the number of training examples is very large, and batch training has suffered from prohibitively large computational cost in terms of time and memory. I therefore used the passive aggressive learning algorithm, which is an on-line learning algorithm and requires much smaller computational cost (Section 2.6). 5.4 Fast Kernel Computation In DLMs, the correlation information or the combination of features is important to capture the non-local information. If kernel trick is applied to on-line max-margin learning, a subset of the observed examples needs to be stored, called active set. However in contrast to the support set in SVMs, an example is added to the active set every time the on-line algorithm makes a prediction mistake or when its confidence in a prediction is inadequately low. Therefore the number of active set is significant large and thus the total computational cost becomes the square of the number of training examples. Since the number of training examples is very large, the computational cost is prohibitive even if we apply kernel trick. 82 Build a probabilistic language model Probabilistic LM (e.g. N-gram LM) Sample sentences Corpus Positive (Pseudo-) Negative Input training examples Binary Classifier test sentences Return positive/negative label or score (margin) Figure 5.2: Framework of our classification process. The calculation of inner product between two examples can be done by intersecting of fired features in each example. This is similar to the merge sort and can be done in O(m) time where m is the average number of fired features in an example. When the number of examples in active set it a, the total computational cost is O(m·a) time. For kernel computation, PKI (Polynomial Kernel Inverted) is proposed [Kudo and Matsumoto, 2003]. PKI is an extension of Inverted Index in Information Retrieval. In on-line setting, we maintain the active set for each feature item. Algorithm 7 is the sample code of PKI. 5.5 Latent Features by Semi-Markov Class Model Another problem of DLMs is that the number of features becomes very large, since all possible N-gram are used as features. In particular, memory requirement is a serious problem because quite a few active sets with many features have to be stored, not only at training, but also at inference time. To deal with this, filtering of low-confidence feature 83 Algorithm 7 Sample code for PKI Input: x C: An array to store the innner products for i ∈ {i|xi ̸= 0} do for j ∈ {h(i)j ̸= 0} do C[j] := C[j] + xi h(i)j end for end for r := 0 for i ∈ {i|C[i] ̸= 0} do r := r + αj (C[j] + c)d end for Output: r. would be effective, but it is difficult to decide which features are important in on-line learning. Therefore, instead similar N-grams are clustered to reduce the number of features by using a semi-Markov class model. 5.5.1 Class Model The class model was originally proposed by [Martin et al., 1998]. In the class model, deterministic word-to-class mappings are estimated, keeping the number of classes much smaller than the number of distinct words. The class model was extended into a semi-Markov class model (SMCM), a part of which was proposed by [Deligne and Bimbot, 1995]. I generalize it as a class model, that is a word sequence is partitioned into a variable-length chunk sequence and then each chunks are clustered into classes depends on the adjacent chunks. As an example, the use of a bi-gram class model will be explained. The probabilities of a sentence w1 , . . . , wt in a bi-gram class model is calculated by P (w1 , . . . , wt ) = ∏ P (wi+1 |ci+1 )P (ci+1 |ci ). i 84 (5.12) On the other hand, the probabilities in a bi-gram semi-Markov class model is calculated by P (w1 , . . . , wt ) = ∑∏ s P (ci |ci−1 ) · P (ws(i),s(i)+1,...,e(i) , ci ), (5.13) i where s is the all possible partition of a sentence and s(i) denotes the beginning position of i-th chunk and e(i) denotes the end position of i-th chunk in partition s and s(i + 1) = e(i) + 1 for all i. Note that each word and variable-length chunk belongs to only one class, unlike in a hidden Markov model, where each word can belong to several classes. Using a training corpus, the mapping from a chunk to a class was obtained by maximum likelihood estimation. The log-likelihood of the training corpus (w1 , . . . , wn ) in a bi-gram class model can be calculated as ∏ log = ∑ P (wi+1 |wi ) i log P (wi+1 |ci+1 )P (ci+1 |ci ) i = ∑ log i = ∑ C(wi+1 , ci+1 ) C(ci+1 , ci ) C(ci+1 ) C(ci ) C(c1 , c2 ) log c1 ,c2 + ∑ C(c1 , c2 ) C(c1 )C(c2 ) C(w) log C(w). (5.14) w In (5.14), only the first term is used, since the second term does not depend on the class allocation. The allocation problem is solved by an exchange algorithm as follows; for each word, we move it to the class in which the log-likelihood is maximized. We continue this until no movement occurs. A naive implementation of this exchange algorithm scales quadratically to the number of classes, since each time a word is moved to one class, all class bi-gram counts are potentially affected. However, by only considering those counts that actually change, the algorithm can be made to scale somewhere between linearly and quadratically to the number of classes [Martin et al., 1998]. I will explain the detail of implementation, relating to my improvement. Other details can be found in [Martin et al., 1998]. In this algorithm we keep the following data structures. The frequency indicates the frequency in the training corpus. 85 • wuni : wuni [i] keeps a frequency of wi . • wbi : wbi [i][j] keeps a frequency of wi wj . • cmap : cmap[i] keeps an assigned class of a word wi . • cw : cw[i][j] keeps a frequency of ci and wj . • wc : wc[i][j] keeps a frequency of wi and cj . • cbi : cbi [i][j] keeps a frequency of ci cj . • cuni : cuni [i] keeps a frequency of ci . Recall that the first term of (5.14), and thus the current performance of class allocation is calculated by using cbi and cuni . We then update class bi-grams after the movement of a word w from a current class cold to a class cnew as follows2 . For c ∈ / cold , cnew , cbi [c][cold ] − = cw[c][w] cbi [cold ][c] − = wc[w][c] cbi [c][cnew ] + = cw[c][w] cbi [cnew ][c] + = wc[w][c], and for c ∈ cold , cnew cbi [cold ][cold ] + = −cw[cold ][w] − wc[w][cold ] + wbi [w][w] cbi [cold ][cnew ] + = +cw[cold ][w] − wc[w][cnew ] − wbi [w][w] cbi [cnew ][cold ] + = −cw[cnew ][w] + wc[w][cold ] − wbi [w][w] cbi [cnew ][cnew ] + = +cw[cnew ][w] + wc[w][cnew ] + wbi [w][w]. Using this update, we check the difference of the log-likelihood for one trial move in O(|C|) time, and thus the one iteration can be performed in O(|C|2 |W |) time. 2 a+ = b denotes a := a + b, and a− = b denotes a := a − b. 86 5.6 Improvement of Exchange Algorithm with Filters and Bottom-up Clustering Although an exchange algorithm is efficient enough for a class model, this is not suitable for a semi-Markov class model because the number of chunks is much bigger than the number of words. I therefore propose two improvements for an exchange algorithm. The first is approximating the computation in exchange algorithm and the second is bottom-up clustering which strengthens the convergence. The first technique is an approximation of the log-likelihood difference between before and after the movement of a word to another class. The log-likelihood change beccomse small for almost trial in the later iteration. By using this approximated value we can filter out many useless trials. For each word w, t classes are sampled from cw and wc according to the frequency of cw and wc, and built two arrays cw′ , wc′ of length t. Using these arrays, approximated values of the difference of the log-likelihood are computed, and an original exchange algorithm is applied only if the approximated value is larger than the pre-defined threshold. Since new two arrays are built only before the beginning of each iteration of a words, the cost of building these arrays is negligible. The second techniques concerns memory issue rather than time complexity. Since the matrices could become very large (For example, the number of chunks in experiments is about 3 million and the number of classes is 500), instead of clustering word into predefined number of clusters, we cluster chunks into 2 classes and then again clusters these two clusters independently into 2 each, where the total number of classes is 4. This procedure will be applied recursively until the number of classes reaches the pre-defined number. 5.6.1 Semi-Markov Class Model In SMCM, we need to decide not only the word-to-class mapping function but also wordto-chunk partitions in each sentences. I employ a Viterbi decoding of S for estimating the partition of a sentence. I then consider these chunks as a word in class model and apply an exchange algorithm. I next iterate this procedures until the change of the objective value becomes lower than the pre-defined threshold. 87 w1 w2 w3 w4 w5 w6 w7 w8 c1 c2 c3 c4 Figure 5.3: Example of assignment in SMCM. A sentence is partitioned into variablelength chunks and each chunk is assigned a unique class. Figure 5.3 shows the example of semi-markov class model. In this example, the first class (c1 corresponds the chunks of length 2 (w1 w2 ). 5.7 Experiments 5.7.1 Experimental Setup I partitioned a BNC-corpus into model-train, DLM-train-positive, and DLM-test sets. The number of sentences in model-train is 4500×103 , in DLM-train is 500×103 , and in DLMtest is 10 × 103 . An NLM using model-train is built and Pseudo-Negative examples are sampled from it. The number of positive and pseudo-negative examples are equal3 I mixed sentences from DLM-train-positive and Pseudo-Negative examples, and then shuffle the order of these sentences to build DLM-train. I call the sentences from DLM-train-positive “positive” example and the the sentences from Pseudo-example “negative” examples. I eliminated the sentences less than 5 words in these corpora because it is difficult to decide whether these sentences are correct or not (e.g. compound word). Next semi-Markov classes is extracted using model-train. The number of extracted chunks is 2.76 × 106 . 3 The expected length of pseudo-negative examples is same as that of positive examples. 88 5.7.2 Experiments on Pseudo-Samples I examined the property of Pseudo-Samples to justify our framework. I sampled 100 sentences from DLM-train. A native English speaker is asked to assign correct and incorrect labels to these sentences4 . The result is that all positive were labeled with correct and all negative except one sentence were labeled with negative. From this result, I can say that the sampling method is able to generate incorrect sentences and if a classifier can discriminate them, a classifier can discriminate sentences between correct sentences and incorrect sentences. Note that it takes an average of 25 seconds for the annotator to assign the label, which suggest that it is difficult even for human to identify incorrect sentences. I examined whether it is possible to discriminate between correct and incorrect sentences by using syntactic parsing methods. If so, we can use parsing as a classification tool. I used a phrase structure parser [Charniak and Johnson, 2005] and a HPSG parser [Yusuke and Tsujii, 2005], and applied them to the 100 sentences. The result is that all sentences are parsed except one sentence in positive examples. This result indicates the difference between correct sentences and pseudo negative examples cannot be found in syntactic faults. 5.7.3 Experiments on DLM-PN I then examined the effect of each features in DLM. For N-gram and Part of Speech (POS), I used trigram features. From SMCM mapping function, I use bi-gram class as a feature function. I used DLM-train as a training set. In all experiments, the hyper parameter is adjusted as C = 50.0 where C is a parameter in a classification (Section 5.3). In all result with kernel, PKI is used to compute the kernel value. Table 5.2 shows the accuracy results with different features. The result of SMCM |C| = 100 shows the accuracy where the number of classes in SMCM is 100. This result shows that the kernel method is indeed important to achieve a high performance. Note that the performance of SMCM is the same as that of word. Table 5.3 shows the number of features in each methods. Note that a new feature is added only if the classifier needs to update their parameters. Therefore these numbers are smaller than the number of all candidate feature. For example, the number of possible 4 Since the PLMs is also based on BNC-corpus, we cannot classify them with the tendency of word contents 89 Table 5.2: Performance of language models on the evaluation data. Accuracy (%) Training time (s) Linear classifier word 51.28 137.1 POS 52.64 85.0 SMCM (∥C∥ = 100) 51.79 304.9 SMCM (∥C∥ = 500) 54.45 422.1 3-order Polynomial Kernel word 73.65 20143.7 POS 66.58 29622.9 SMCM (∥C∥ = 100) 67.11 37181.6 SMCM (∥C∥ = 500) 74.11 34474.7 Table 5.3: The number of features of DLM. # of distinct features word tri-gram 15773230 POS tri-gram 35376 SMCM (∥C∥ = 100) 9335 SMCM (∥C∥ = 500) 199745 distinct features in SMCM (∥C∥ = 500) is 10000 = 100 · 100. From these results, we found that SMCM achieved high performance with very few features. I then examined the effect of PKI (inverted index method) where SMCM bi-gram and third order polynomial kernel is used. Figure 5.4 shows the result of each method. In this experiment, 200 × 103 sentences in DLM-train are used for both experiments because training on all the training data required much longer time than was possible for our experimental setup. The result indicates that an index reduces the computational cost much, and its inference can be performed in less than 0.1 seconds, which is reasonable for the 90 Table 5.4: Comparison between classification performance with/without PKI index Training time (s) inference time (ms) Baseline 37665.5 370.6 with Index 4664.9 47.8 real world applications. Finally I exampled te learning curves to examine the effect of the size of training data on the performance. Figure 5.5 shows the result of classification task using SMCM bi-gram features. The result suggests that the performance can still be improved by increasing the training data. Figure 5.4 shows the margin distribution for correct sentences and pseudo-negative examples by using SMCM-bi-gram features. Although many examples are close to the border line (Margin = 0), positive and negative examples are distributed in > 0 and < 0 sides. Therefore higher re-call or precision can be achieved by setting pre-defined margin threshold other than 0. 5.8 Discussion More sophisticated sampling technique is promising. For example, I can sample sentences from whole sentence maximum entropy model [Rosenfeld et al., 2001] and this is a feature work. The result without kernels indicates that an non-linear model is important to discriminate correct and incorrect sentences. Therefore it would be helpful to use such combination in PLM. Recent successes in topic-based Language model [Blei et al., 2003, Wang et al., 2005] also indicate the importance of this phenomena. Contrastive Estimation [Smith and Eisner, 2005] is similar to us with regard to making pseudo-negative examples. They build a neighborhood of input example to help unsupervised estimation such as one word is changed or deletion and build a lattice. It then estimates parameters efficiently. On the other hand, I make independent pseudo-negative examples to make training possible. 91 200 negative positive se cn et ne s fo re b m u N 100 0 -3 -2 -1 0 Margin 1 2 3 Figure 5.4: Margin distribution using SMCM bi-gram features. 5.9 Conclusion In this chapter I have presented a novel discriminative language model using pseudonegative examples. I also show that an on-line max-margin learning method enable us to handle one million sentences and achieved 75% accuracy in the task of discriminative positive and pseudo negative examples. Experiments indicate that Pseudo Negative example is incorrect but close to the correct sentences at the same time because parsing cannot discriminate it. Experimental results also showed that the combination of features is effective to discriminate correct and incorrect sentences, which has not been discussed in probabilistic language models. To scale up the problem size in terms of the number examples and features, we would ask more refined kernel-based learning methods as [Cheng et al., 2006] and [Dekel et al., 2005]. Another interesting area is to handle probabilistic language model directly without sampling and learn the discriminative model more efficiently and accurately. 92 80 75 )% 70 ( yc ar 65 uc cA 60 Accuray of SMCM 55 50 00 05 00 05 3 00 05 6 00 05 9 50 + E 1 50 + E 2 50 + E 2 50 + E 2 50 + E 2 50 + E 3 50 + E 3 50 + E 3 50 + E 4 50 + E 4 50 + E 4 50 + E 5 50 + E 5 Number of examples Figure 5.5: A learning curve for SMCM (∥C∥ = 500). The accuracy is the performance on evaluation set. I am also interested in the applications of this model, not only machine translation and speech recognition but also identifying incorrect sentences written by non-native spakers as an extended version of a spelling correction tool. 93 Chapter 6 Hierarchical Exponential Models for Problem with Many Classes This chapter presents another novel language model, called a Hierarchical Exponential Model (HEM). In the previous language models, existing multi-class classifiers cannot be used directly because the number of candidate outputs is very large, and therefore they require impractical computational cost at training and inference time. A HEM reduces the cost by structuring the search space in a hierarchical way. 6.1 Problems of Previous Language Models In this chapter, we consider the task of predicting of a next word c given a context information h. An example of context information are previous words, and long-range information (some word x appeared in the previous context). Moreover, we also estimate the conditional probability p(c|h), which is useful for other applications. We often estimate p(c|h), by a maximum likelihood estimation using training corpus, pM LE (c|h) = C(c, h)/C(h), (6.1) where C(c, h) is the frequency of events where c appeared in the context of h in the training corpus, and C(h) is the frequency of events where h appeared in the training corpus. However, this estimation is very unstable even if we use very large corpus. For example, we cannot estimate this probability information if some context event h is unseen in training data. Therefore, some smoothing methods should be applied to estimate p(c|h). 94 An N -gram Language model (NLM) approximates the probability by corresponding h to the preceding N − 1 words only; pN LM (c|h) = C(w, c)/C(w) (6.2) where C(w, c) is the frequency of word sequences w, c in the training corpus, and C(w) is the frequency of word sequence w in the training corpus. This simple approximation performs very well in many applications. However, there are many other useful features to predict the next word. For example, a suffix (prefix) of the previous word is effective to determine the next word, and cache or trigger (the occurrence of some word in the context) is also effective to determine the next word. To exploit these features freely, we can use a maximum entropy language model (ME, Section 2.3) or called an exponential language model in [Rosenfeld, 1994]. Recall that a ME is defined as, 1 exp(wT ϕ(c, h)), (6.3) Z(h) where ϕ(c, h) is a feature vector for the pair of input h and output c, w is the weight vector, ∑ and Z(h) = c′ exp(s(h, c′ )) is a normalization term or a partition function. We can use p(c|h; w) = any information of context h and a next word c in defining ϕ(c, h). In ME, the most probable word can be found by c∗ = arg max p(c|h; w) (6.4) wT ϕ(c, h) (6.5) c = arg max c since exp is a monotonic increasing function. The problem of ME is that its computational cost; it proportionally increases as the number of labels increases because the computation of Z(h) requires the summation over all candidate words. Since the number of candidate words is very large, ME is impractical for language modeling. Moreover, since a context h is different at each word position, we need to recalculate Z(h) at each word position and pre-computation is difficult. 6.2 Hierarchical Exponential Model I present a novel language model called an Hierarchical Exponential Model (HEM). While this model can use any type of features as ME, it can efficiently inference the probability 95 Figure 6.1: An example of a hierarchical tree in a hierarchical exponential model of a word. Moreover, it supports an efficient arg max operations that returns the most probable word among the candidate words. First, I explain an overview of HEM. We represent a set of candidate words as a binary tree. This tree can be estimated by any method such as hierarchical clustering. At each internal node in the tree, a different binary logistic regression model is associated. The output −1 corresponds to the left child, and the output +1 corresponds to the right child. Note that we can use any type of features in these binary logistic regression models. Then, the probability for a word w is defined as the product of the probabilities obtained by the results of binary logistic regression models from the root to the leaf corresponding to w. Obviously, the sum of probabilities of candidate words equals to 1, and therefore this binary tree assign the probabilistic distribution over the candidate words. By restricting the height of the tree being O(|W |) where |W | is the number of candidate word, it can efficiently estimate the probability without enumerating all candidate words. Figure 6.1 shows the example of HEM when candidate words are {a, b, c, d, e}. A HEM is similar to a decision tree in that a word is predicted by a set of binary decisions. However, in HEM, each decision is associated with a probability assigned by a binary logistic model using any type of features, and therefore it defines the probabilistic distribution over a candidate words. 96 Let us explain a HEM formally. Let C be the set of candidate words, TC be a binary tree with |C| leaves and |C| − 1 internal nodes, each node of which corresponds to the word in C. We call this tree a label tree. Then each word c is associated with a binary code Bc which is made as follows. From the root to the leaf corresponding to word c, we append 0 if we go to the left child, and 1 if we go to the right child. An example of such a tree is Huffman tree used in data compression. Let Bc [j] ∈ {0, 1} be the j-th bit value of Bc and N (Bc [j]) be the j-th internal node from the root to the leaf corresponding to a word c. For example, in the figure 6.1, a binary code for a word a is 011, and Ba [1] = 0, Ba [2] = 1. A feature vector for a context information x is denoted as ϕ(x) ∈ Rm . Then we associate a binary logistic regression model at each internal node v, Pv (1|h) = 1 1 + exp(−wvT ϕ(h)) Pv (0|h) = 1 − Pv (1|h) = 1 1 + exp(wvT ϕ(h)) (6.6) (6.7) where wv ∈ Rm is the weight vector corresponding to the internal node v. Then, the probability for a word c is given by the product of the classification results from the root to the leaf node corresponding to c, |Bc | P (c|h) = ∏ PN (Bc [j]) (Bc [j]|h). (6.8) j=1 This can be viewed as that at each internal node, a region [0, 1] is recursively divided into two disjoint regions. 6.2.1 Learning In a HEM, we need to decide both the structure of the label tree and the set of parameters of classifiers. Although we may obtain more accurate model by estimating the tree structure and the parameters together, I take a simpler approach that the tree structure is first estimated and then the parameters of each internal node are estimated. First, to estimate the structure of the label tree, we begin with a set of all candidate words, and then recursively split the set of words into two disjoint sets so that those in 97 the same set have similar context information. To achieve this, I adapted an one-side class model [Whittaker and Woodland, 2001] because it is simple and efficient. Let ri be the class of a word ci . Then, in an one-side class model, the probability of a word ci given context information hi is defined as Poscm (ci |hi ) = P (ci |ri−1 ). (6.9) That is, the probability depends on only the class of the most previous word. Then we split a set of words so that the likelihood of the training corpus is maximized as follows. First, all words are assigned to a class from the two candidates randomly. Next, for each word, we check the difference of the likelihood after the movement of a word to another class. Finally a word is actually moved if the difference of likelihood is positive. We continue this procedure until no movements occured. We recursively apply this process to both classes until all classes have only one word. Given a label tree, we can estimate the weight vectors for each internal nodes independently. Therefore we can parallelize the estimation of these weight vectors, which is impossible in previous MEs. 6.2.2 Features In a HEM, any information in the context can be used as a feature. For example, we can use the previous N − 1 words as an N-gram model. In another case, the occurrence of some word in the long context (like 100 words) can be used as a trigger language model [Tillmann and Ney, 1996]. Other examples are: a suffix and a prefix of a word, an orthographic feature and the position in the document. Note that all weights for these features are estimated automatically. More importantly, we can use the path information of the previous words as a feature. An example of such feature is shown in figure 6.2. For example, when the previous word is “a”, since the binary code of “a” is 011, prefixes of this binary code 0, 01, and 011 can be used as features. This can be regarded as a hierarchical class model [Martin et al., 1998], and we can expect the smoothing effect. A virtue of this feature function compared to the original class model [Martin et al., 1998] is that the optimal size of classes need not to be determined beforehand because their weights are automatically determined. 98 Figure 6.2: A path information in a hierarchical tree for predicting the next word. Another virtue of this feature function is that the prediction becomes more robust. When the previous prediction is failed, or unknown word appeared, previous LMs cannot use context information, and its accuracy dropped. However, by using these prefixes of path information, some of prefixes of path information are correct even if the final prediction is incorrect, and these prefixes information are useful to estimate next word. 6.2.3 Inference In a HEM, given context information h and a candidate word c, the conditional probability P (c|h) is estimated in O(log |W |) time where |W | is the number of distinct word. This is because we just examine the results of the classification along the path from the root to the leaf corresponding c, and the height of the tree is O(log |W |). This is much improvement over the MEs, which requires O(|W |) time in the worst case. Moreover, a HEM supports an efficient argmax operation c∗ = arg maxc P (c|h) for a context h. Previous language models cannot support this operation efficiently and at the worst case they require O(|W |) time. This problem can be considered as finding the path with the maximum weight from the root to the path where the weight of an edge is the logarithm of the corresponding probability. By applying the branch/bound method at search step, while the worst time is O(|W |), in many case argmax is supported in O(log |W |). 99 Table 6.1: Corpus statistics in HEM s CSJ BNC Number of Words 8.85 × 10 6 5.38 × 107 Number of Word Types 2.81 × 104 1.16 × 106 In particular, in many cases, the probability distribution over candidate words is very skew, and a beam search is enough. In a beam search, we keep top K candidates with the highest probability, and we pop the most probable candidate and add these candidates into the stack. 6.3 Experiments We conducted experiments on a Japanese Spoken Language Corpus (CSJ) and an English BNC corpus (BNC). We divided each of these corpora into training and test corpus as 5 : 1. A statistics of these corpora is shown in Table 6.1. We used a trigram model with modified Kneser-Ney smoothing as a baseline. This model is the state-of-the-art language model. We train the proposed model using training corpus, and estimate the tree structure and parameters. For a feature function, we use previous and the next to the previous word as features. This information is same as that of trigram models. Moreover, we use a feature for the prefix of the path of the previous words with the length of 1, 2, and 4. At the training of the binary logistic regression model, we applied the L1 regularization. The hyper-parameter is manually adjusted using samples from training corpus. To measure the performance of LMs, we used the perplexity defined as 2H where H is the average of conditional probabilities of next words. Table 6.2 shows the result of the perplexity. We can see that the proposed method achieved lower perplexities in both corpora. 100 Table 6.2: Results of HEM and Baseline Method CSJ (Perplexity) BNC (Perplexity) Trigram with KN 133.5 230.3 Hierarchical Exponential Model 120.3 216.9 6.4 Discussion and Conclusion Although the application of a maximum entropy model to language models is not new [Rosenfeld, 1994], they have not been used in NLP community because their computational cost is prohibitively large, and the improvement over simple language models (e.g. N-gram) is modest. In this chapter, we have presented a language model that can use any type of features, and its training/inference is efficient. We show that, the large computational cost can be reduced by structuring the search space in an hierarchical way as supper-tagging of lexical entries [Matsuzaki et al., 2007]. The environments of language models have changed. Since much faster computers and learning algorithms (e.g. online learning) are available now, language models with logistic regression models become practical. Moreover by applying sparse priors (e.g. L1 regularization), the model can be compact even smaller than N-gram models. By combining our algorithm with a fast learning and parallelized processing (GPU, Crowd computing), language model can be much improved. 101 Chapter 7 Conclusion In this thesis, I have proposed methods to build a large-scale natural language processing (NLP) system. The methods are efficient, practical and scalable. To build this system, three problems must be solved, as described in the following three paragraphs. The first problem was related to the massive number of training examples. This problems was already studied by others, so many solutions exist. For example, online learning algorithms can train the model in linear time, relative to the number of training examples. In addition, by using L1 regularization, the model can be made compact and its inference becomes extremely fast. I introduced several online algorithms and data structures in chapter 2, and used in most my implementations. The second problem was about the massive number of features. The examples I considered in this thesis was “all substring features” and “combination features”. I presented an efficient algorithm for finding the effective features among the massive number of candidate features. I also showed that we can learn an exact optimal classification model without enumerating all candidate features by using a grafting algorithm, a sparse prior, and the searching algorithm for finding effective features. The third problem was about the massive number of output candidates. The example considered in this thesis was language modeling. I tackled this problem by two different approaches. The first was to discriminate correct/incorrect sentences directly. The problem is how to learn the discriminative model when only positive examples are available. I proposed to employ pseudo negative examples sampled from a probabilistic language model. The second language model involved building a hierarchical tree for candidate 102 outputs, which enabled us to search for the most probable output in an efficient manner. These proposals were implemented with the help of recent on-line learning algorithms and data structures, and we achieved state-of-the-art performance over a wide range of NLP tasks, including document classification, document clustering, and language modeling. I summarize the achievements of this thesis for each sub-topic below. 7.1 Learning with “All Substrings” I presented a learning algorithm with “all substring features”, where all substrings appearing in a document are used as candidate features. I showed that, although the number of candidate features is O(n2 ) where n is the length of a document, a classifier can be trained in O(n) time, and the required working space is also O(n). This is because many substrings carry the same information (e.g. they have same the same frequency in a document), so these substrings can then be summarized into much fewer equivalent classes. Since many types of features functions depend only on these statistics, the computed statistics corresponding to these features (such as the gradient value of an objective function with regard to that feature) are also the same. By using enhanced suffix arrays and auxiliary data structures, the statistics of these classes can be enumerated efficiently. I combine this efficient enumeration algorithm with L1 regularization and a grafting algorithm; at each step, the most effective feature is searched from O(n) equivalent classes, and this feature is added to the candidates for optimization. The properties of L1 regularization and the grafting algorithm ensures that this greedy algorithm converges to the global optimum. I showed two applications of the algorithm described above: one for document classification and one for document clustering. I conducted experiments and compared the results to other state-of-the-art methods. The results showed that the accuracy of my algorithm is highest in many tasks, especially when substring information is important in deciding the category of a document. The proposed algorithm will be useful for other applications dealing with string information, such as genome sequence and web-log mining. This algorithm is also very scalable. The time/space consumption is linearly proportional to the text size, even when the number of documents is over one million. 103 7.2 Learning with “Combination Features” I showed that how searching for effective combination features can be done. Since the number of candidate features is exponentially growing compared to the number of original features, direct optimization using the combination features is not feasible because of the prohibitively large cost. I showed that a simple filtering technique can be used to enumerate (only) the effective combination features, even for large-scale problems. Experimental results indicate that an L1 regularized logistic regression model with combination features can achieve comparable or better results than those from other methods. Furthermore, the resulting model is very compact and easy to interpret. 7.3 Discriminative Language Model with Pseudo-Negative Examples I proposed a discriminative learning algorithm for problems where only a generative model is given. An example is language modeling with only positive examples available. I proposed to sample (pseudo-) negative examples from the generative model. I applied this method to language modeling to enable the classifier to directly discriminate correct sentences from incorrect sentences. I also showed that an on-line max-margin learning method enables us to handle one million sentences and achieving 75% accuracy in the task of discriminating the positive and pseudo-negative examples. Other experimental results revealed that although a pseudo-negative example is incorrect, a syntactic parser cannot discriminate it from correct examples. However, native speakers can easily discriminate it from correct examples. Experimental results also showed that kernel methods can effectively discriminate correct and incorrect sentences. This has not been discussed previously in probabilistic language model research. 7.4 Hierarchical Exponential Models Finally, I proposed a hierarchical exponential model (HEM) for the problem where the number of candidate outputs is very large. In an HEM, the output candidates are hierarchically clustered and the probability of an output is given by the product of the probabilities of the classification result associated with each node in a path from the root to the leaf. HEM supports the efficient arg max operation, which returns the most probable output in 104 O(log K) time, where K is the number of labels. I applied HEM to the language modeling problem and the experimental results show that this model achieved higher performance in terms of the perplexity measure, compared to previous state-of-the-art language models, such as N-gram with Kneser-Ney smoothing. 7.5 Future Work There are several topics I did not consider in this thesis, even though they are fundamental for practical NLP in the future. Three such topics are parallelization, randomization, and non-linear representation. Parallelization has become important because current processors are highly parallelized and there are many clusters available for heavy computation now. In existing NLP systems, including ours, only serialized processing is only considered, and it is not trivial to do the processing in parallel. For example, since an online learning algorithm updates a parameter every time a mistake occurs, the learning operation depends on most previous operations, so parallelization is difficult. However, for simpler tasks, parallelization is very effective and can easily to be applied [Asuncion et al., 2007, Asuncion et al., 2008], Examples are N-gram language model [Brants et al., 2007], and simple machine learning [Chu et al., 2006]. Another possibility is the use of general-purpose graphic processor units (GPGPU) for NLP [Zein et al., 2008, Yan et al., 2009a], since they are much optimized for parallel computation. Randomization (or randamized algorithms) have also become important for large-scale NLP. The benefit of a randomization algorithm is that we can expect the good performance with simpler algorithms. Recent studies have shown that randomization is especially effective in language modeling [Talbot and Brants, 2008, A. Levenberg, 2009], document clustering [Ravichandran et al., 2005], the calculation of expectations of features [Bouchard-Cote et al., 2009], and the computation of singular value decomposition of a matrix [Halko et al., 2009]. NLP renewed attention on non-linear representation using deep neural networks. Although in the history of machine learning, neural networks was replaced with linear classifiers or kernel machines, non-linear representation using deep neural networks have achieved many success in NLP [Collobert and Weston, 2009, Hinton et al., 2006] recently. 105 This is because the studies of online algorithms (or greedy update) and regularization have matured and parameter estimation can be accurately performed. Since there exist many layers in NLP such as part-of-speech tags, a syntactic tree, and a semantic information, the joint inference will become more important, and the idea of neural networks would become more imporant. To conclude, this thesis has presented several large scale NLP systems, which are based on recent research on data structures, machine learning, string algorithms, and optimization techniques. While the previous NLP systems only consider how to obtain accurate results, the proposed systems also considers other important factors like efficiency in terms of speed and resources, and scalability. The most significant contribution of this thesis is to make practical NLP systems available by showing the possibility of integrating ideas from different fields. I hope that the proposed algorithms will be applied to the problems in other fields like biology, vision recognition and data mining as well. 106 References [A. Levenberg, 2009] A. Levenberg, M. Osborne. 2009. Stream-based randomized language models for smt. In Proc. of EMNLP, pages 756–764. [Abouelhoda et al., 2004] Abouelhoda, M. I., S. Kurtz, and E. Ohlebusch. 2004. Replacing suffix trees with enhanced suffix arrays. J. Discrete Algs, 2:53–86. [Aho and Corasick, 1975] Aho, A. V. and M. J. Corasick. 1975. Efficient string matching: An aid to bibliographic search. Communications of the ACM, 18(6):333–340. [Andrew and Gao, 2007] Andrew, G. and J. Gao. 2007. Scalable training of l1- regularized log-linear models. In Proc. of ICML. [Anh and Moffat, 2005] Anh, V. N. and A. Moffat. 2005. Inverted index compression using word-aligned binary codes. Information Retrieval, 8(1):151–166. [Asuncion et al., 2007] Asuncion, A., P. Smyth, and M. Welling. 2007. Distributed inference for latent dirichlet allocation. In Proc. of NIPS. [Asuncion et al., 2008] Asuncion, A., P. Smyth, and M. Welling. 2008. Asynchronous distributed learning of topic models. In Proc. of NIPS. [Bilmes and Kirchhoff, 2003] Bilmes, J. A. and K. Kirchhoff. 2003. Factored language models and generalized parallel backoff. In Proc. of HLT/NACCL, pages 4–6. [Blei et al., 2003] Blei, D. M., A. Y. Ng, and M. I. Jordan. 2003. Latent dirichlet allocation. Journal of Machine Learning Research., 3:993–1022. [Bouchard-Cote et al., 2009] Bouchard-Cote, A., S. Petrov, and D. Klein. 2009. Randomized pruning: Efficiently calculating expectations in large dynamic programs. In Proc. of NIPS. 107 [Boyd and Vandenberghe, 2004] Boyd, S. and L. Vandenberghe. 2004. Convex Optimization. Cambridge University Press. [Brants et al., 2007] Brants, T., A. C. Popat, P. Xu, F. J. Och, and J. Dean. 2007. Large language models in machine translation. In Proc. of EMNLP, pages 858–867. [Brown et al., 1990] Brown, P. F., J. Cocke, S. Pietra, V. Pietra, F. Jelinek, J. Lafferty, R. L. Mercer, and P. S. Roossin. 1990. A statistical approach to machine translation. Comput. Linguist., 16(2):79–85. [Cesa-Bianchi and Logosi, 2006] Cesa-Bianchi, N. and G. Logosi. 2006. Prediction, learning, and games. Cambridge University Press. [Charniak and Johnson, 2005] Charniak, E. and M. Johnson. 2005. Coarse-to-fine N-best parsing and maxent discriminative reranking. In Proc. of ACL. [Chen, 2009a] Chen, S. F. 2009a. Performance prediction for exponential language models. In Proc. of NAACL, pages 450–458. [Chen, 2009b] Chen, S. F. 2009b. Shrinking exponential language models. In Proc. of NAACL, pages 468–476, Morristown, NJ, USA. Association for Computational Linguistics. [Cheng et al., 2006] Cheng, L., S. V. N. Vishwanathan, D. Schuurmans, S. W., and T. Caelli. 2006. Implicit online learning with kernels. In Proc. of NIPS. [Chu et al., 2006] Chu, C., S. K. Kim, Yi. Lin, G. Bradski Y. Yu, A. Y. Ng, and K. Olukotun. 2006. Mapreduce for machine learning on multicore. In Proc. of NIPS. [Collins et al., 2002] Collins, M., R. E. Schapire, and Y. Singer. 2002. Logistic regression, adaboost and bregman distances. Machine Learning, 48(1-3):253–285. [Collins, 2002] Collins, Michael. 2002. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proc. of EMNLP. [Collins, 2003] Collins, M. 2003. Head-driven statistical models for natural language parsing. Computational Linguistics, 29(4), December. 108 [Collobert and Weston, 2009] Collobert, R. and Jason Weston. 2009. Deep learning in natural language processing. NIPS Tutorial. [Crammer and Singer, 2001] Crammer, K. and Y. Singer. 2001. On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, 2:265–292. [Crammer et al., 2006] Crammer, K., O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. 2006. Online passive-aggressive algorithms. Journal of Machine Learning Research. [Crammer et al., 2008] Crammer, K., M. Dredze, and F. Pereira. 2008. Exact convex confidence-weighted learning. In Proc. of NIPS, pages 345–352. MIT Press. [Crammer et al., 2009] Crammer, K., M. Dredze, and A. Kulesza. 2009. Multi-class confidence weighted algorithms. In Proc. of EMNLP, pages 496–504. [Davidov et al., 2004] Davidov, D., E. Gabrilovich, and S. Markovitch. 2004. Parameterized generation of labeled datasets for text categorization based on a hierarchical directory. In Proc. of SIGIR. [Dekel et al., 2005] Dekel, O., S. Shalev-Shwartz, and Y. Singer. 2005. The forgetron: A kernel-based perceptron on a fixed budget. In Proc. of NIPS. [Deligne and Bimbot, 1995] Deligne, S. and F. Bimbot. 1995. Language modeling by variable length sequences: Theoretical formulation and evaluation of multigrams. In Proc. ICASSP ’95, pages 169–172, Detroit, MI. [Ding et al., 2001] Ding, C. H. Q., X. He, H. Zha, M. Gu, and H. D. Simon. 2001. A min-max cut algorithm for graph partitioning and data clustering. In ICDM, pages 107–114. [Dredze et al., 2008] Dredze, M., K. Crammer, and F. Pereira. 2008. Confidence- weighted linear classification. In Proc. of ICML, pages 264–271. [Duchi and Singer, 2009] Duchi, J. and Y. Singer. 2009. Online and batch learning using forward backward splitting. In Proc. of NIPS. 109 [Dudı́k et al., 2007] Dudı́k, M., S. J. Phillips, and R. E. Schapire. 2007. Maximum entropy density estimation with generalized regularization and an application to species distribution modeling. JMLR, 8:1217–1260. [Freund and Schapire, 1999] Freund, Y. and R. E. Schapire. 1999. Large margin classification using the perceptron algorithm. Machine Learning, 37(3):277–296. [Gao et al., 2005] Gao, J., H. Yu, W. Yuan, and P. Xu. 2005. Minimum sample risk methods for language modeling. In Proc. of HLT/EMNLP. [Gao et al., 2006] Gao, J., H. Suzuki, and B. Yu. 2006. Approximation lasso methods for language modeling. In Proc. of ACL/COLING. [Gao et al., 2007a] Gao, J., G. Andrew, M. Johnson, and K. Toutanova. 2007a. A comparative study of parameter estimation methods for statistical natural language processing. In Proc. of ACL, pages 824–831. [Gao et al., 2007b] Gao, J., G. Andrew, M. Johnson, and K. Toutanova. 2007b. A comparative study of parameter estimation methods for statistical natural language processing. In Proc. of ACL, pages 824–831. [Gieseke et al., 2009] Gieseke, F., T. Pahikkala, and O. Kramer. 2009. Fast evolutionary maximum margin clustering. In ICML, pages 361–368. [Gildea and Hofmann, 1999] Gildea, D. and T. Hofmann. 1999. Topic-based language models using em. In Proceedings of the 6th European Conference on Speech Communication and Technology(EUROSPEECH). [Gusfield, 1997] Gusfield, D. 1997. Algorithms on Strings, Trees and Sequences. Cambridge University Press. [Halko et al., 2009] Halko, N., P. G. Martinsson, and J. Tropp. 2009. Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions. arXiv 0909.4061. [Hérault and Grandvalet, 2007] Hérault, R. and Y. Grandvalet. 2007. Sparse probabilistic classifiers. In Proc. of ICML, pages 337–344. 110 [Hinton et al., 2006] Hinton, G.E., S. Osindero, and Y.W. Teh. 2006. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554. [Ifrim et al., 2008] Ifrim, G., G. Bakir, and G. Weikum. 2008. Fast logistic regression for text categorization with variable-length n-grams. In Proc. of SIGKDD. [J. Suzuki, 2008] J. Suzuki, H. Isozaki. 2008. Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In Proc. of ACL, pages 665–673. [Jaynes, 1957] Jaynes, E. T. 1957. Information theory and statistical mechanics. Physical Review Series II, 106(4):620–630. [Joachims, 1998] Joachims, T. 1998. Text categorization with Support Vector Machines learning with many relevant features. In Proceedings of 10th European Conference on Machine Learning (ECML), pages 137–142. [Kasai et al., 2001] Kasai, T., G. Lee, H. Arimura, S. Arikawa, and K. Park. 2001. Lineartime longest-common-prefix computation in suffix arrays and its applications. In Proc. of CPM, pages 181–192. [Kazama and Tsujii, 2005] Kazama, J. and J. Tsujii. 2005. Maximum entropy models with inequality constraints: A case study on text categorization. Machine Learning Journal special issue on Learning in Speech and Language Technologies, 60((13)):169–194. [Knight and Marcu, 2002] Knight, K. and D. Marcu. 2002. Summarization beyond sentence extraction: a probabilistic approach to sentence compression. Artif. Intell., 139(1):91–107. [Koh et al., 2007] Koh, K., S. J. Kim, and S. Boyd. 2007. An interior-point method for large-scale l1 -regularized logistic regression. JMLR, 8. [Kudo and Matsumoto, 2003] Kudo, Taku and Yuji Matsumoto. 2003. Fast methods for kernel-based text analysis. In Proc. of ACL. [Kudo and Matsumoto, 2004] Kudo, T. and Y. Matsumoto. 2004. A boosting algorithm for classification of semi-structured text. In Proc. of EMNLP. 111 [Lafferty et al., 2001] Lafferty, J., A. McCallum, and F. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of ICML, pages 282–289. [Li et al., 2009] Li, Yu-F., I. W. Tsang, J. T. Kwok, and Z-H Zhou. 2009. Tigher and convex maximum margin clustering. In In Proc. of AISTATS, pages 344–351. [Lidstone, 1920] Lidstone, G.J. 1920. Note on the general case of the bayes-laplace formula for inductive or a posteriori probabilities. Transactions of the Faculty of Actuaries, 8:182–192. [Liu and Nocedal, 1989] Liu, D. C. and J. Nocedal. 1989. On the limited memory method for large scale optimization. Mathematical Programming B, 45(3):503–528. [Marcus et al., 1994] Marcus, M., G. Kim, M. A. Marcinkiewicz, R. MacIntyre, A. Bies, M. Ferguson, K. Katz, and B. Schasberger. 1994. The Penn treebank: Annotating predicate argument structure. In ARPA Human Language Technology Workshop. [Martin et al., 1998] Martin, S., J. Liermann, and H. Ney. 1998. Algorithms for bigram and trigram word clustering. Speech Communicatoin, 24(1):19–37. [Matsuzaki et al., 2007] Matsuzaki, T., Y. Miyao, and J. Tsujii. 2007. Efficient hpsg parsing with supertagging and cfg-filtering. In Proc. of IJCAI, pages 1671–1676, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. [McCallum et al., 2000] McCallum, A., D. Freitag, and F. C. N. Pereira. 2000. Maximum entropy markov models for information extraction and segmentation. In Proc. of ICML, pages 591–598, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. [Miyao and Tsujii, 2008] Miyao, Y. and J. Tsujii. 2008. Feature forest models for probabilistic HPSG parsing. Computational Linguistics., 34(1):35–80, March. [Navarro and Mäkinen, 2007] Navarro, G. and V. Mäkinen. 2007. Compressed full-text indexes. ACM Comput. Surv., 39(1):2. [Ng, 2004] Ng, A. Y. 2004. Feature selection, L1 vs. L2 regularization, and rotational invariance. In Proc. of ICML, page 78, New York, NY, USA. ACM. 112 [Och and Ney, 2003] Och, F. J. and H. Ney. 2003. A systematic comparison of various statistical alignment models. Comput. Linguist., 29(1):19–51. [Okanohara and Tsujii, 2007] Okanohara, D. and J. Tsujii. 2007. A discriminative language model with pseudo-negative samples. In ACL, pages 73–80. [Okanohara and Tsujii, 2009a] Okanohara, D. and J. Tsujii. 2009a. Learning combination features with L1-regularization. In Proc. of NAACL, pages 97–100. [Okanohara and Tsujii, 2009b] Okanohara, D. and J. Tsujii. 2009b. Text categorization with all substring features. In Proc. of SDM, pages 838–846. [Pang et al., 2002] Pang, B., L. Lee, and S. Vaithyanathan. 2002. Thumbs up? sentiment classification using machine learning techniques. In Proc. of EMNLP, pages 79–86. [Perkins and Theeiler, 2003] Perkins, S. and J. Theeiler. 2003. Online feature selection using grafting. ICML. [Perkins et al., 2003] Perkins, S., K. Lacker, and J. Theiler. 2003. Grafting: Fast, incremental feature selection by gradient descent in function space. JMLR, 3:1333–1356. [Ravichandran et al., 2005] Ravichandran, D., P. Pantel, and E. Hovy. 2005. Randomized algorithms and nlp: Using locality sensitive hash function for high speed noun clustering. In Proc. of EMNLP, pages 622–629. [Rice, 1998] Rice, R. F. 1998. An empirical study of smoothing techniques for language modeling. Technical report, Harvard Computer Science Technical report TR-10-98. [Roark et al., 2007] Roark, B., M. Saraclar, and M. Collins. 2007. Discriminative n-gram language modeling. computer speech and language. Computer Speech and Language, 21(2):373–392. [Rosenblatt, 1958] Rosenblatt, F. 1958. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6):386–408. [Rosenfeld et al., 2001] Rosenfeld, R., S. F. Chen, and X. Zhu. 2001. Whole-sentence exponential language models: a vehicle for linguistic-statistical integration. Computers Speech and Language, 15(1). 113 [Rosenfeld, 1994] Rosenfeld, R. 1994. Adaptive Statistical Language Modeling: A Maximum Entropy Approach. Ph.D. thesis, Carnegie Mellon University. [Sadakane, 2007] Sadakane, K. 2007. Succinct data structures for flexible text retrieval systems. Journal of Discrete Algorithms, 5(1):12–22. [Saigo et al., 2007] Saigo, H., T. Uno, and K. Tsuda. 2007. Mining complex genotypic features for predicting HIV-1 drug resistance. Bioinformatics, 23:2455–2462. [Sassano, 2004] Sassano, Manabu. 2004. Linear-time dependency analysis for japanese. In Proc. of COLING. [Sha et al., 2007] Sha, F., Y. A. Park, and L. K. Saul. 2007. Multiplicative updates for l1 regularized linear and logistic regression. In Proc. of IDA. [Shalev-Shwartz, 2007] Shalev-Shwartz, S. 2007. Online Learning: Theory, Algorithms, and Applications. Ph.D. thesis, The Hebrew University of Jerusalem, July. [Shi and Malik, 2000] Shi, J. and J. Malik. 2000. Normalized cuts and image segmentation. PAMI. [Smith and Eisner, 2005] Smith, N. A. and J. Eisner. 2005. Contrastive estimation: Training log-linear models on unlabeled data. In Proc. of ACL. [Talbot and Brants, 2008] Talbot, D. and T. Brants. 2008. Randomized language models via perfect hash functions. In Proc. of ACL, pages 505–513,. [Taskar et al., 2004] Taskar, B., C. Guestrin, and D. Koller. 2004. Max margin markov networks. In Proc. of NIPS. [Taylor and Cristianini, 2004] Taylor, J. S. and N. Cristianini. 2004. Kernel Methods for Pattern Analysis. Cambiridge Univsity Press. [Teo and Vishwanathan, 2006] Teo, C. H. and S. V. N. Vishwanathan. 2006. Fast and space efficient string kernels using suffix arrays. In Proc. of ICML, pages 929–936. [Tillmann and Ney, 1996] Tillmann, C. and H. Ney. 1996. Selection criteria for word trigger pairs in language modeling. In In ICGI ’96, pages 95–106. Springer. 114 [Tsuruoka et al., 2009] Tsuruoka, Y., J. Tsujii, and S. Ananiadou. 2009. Stochastic gradient descent training for l1-regularized log-linear models with cumulative penalty. In Proc. of ACL-IJCNLP, pages 477–485. [Uchimoto et al., 1999] Uchimoto, K., S. Sekine, and H. Isahara. 1999. Japanese dependency structure analysis based on maximum entropy models. In Proc. of EACL, pages 196–203. [Vishwanathan and Smola, 2004] Vishwanathan, S. V. N and A. J. Smola. 2004. Fast kernels for string and tree matching. Kernels and Bioinformatics. [Wang et al., 2005] Wang, S., S. Wang, R. Greiner, D. Schuurmans, and L. Cheng. 2005. Exploiting syntactic, semantic and lexical regularities in language modeling via directed markov random fields. In Proc. of ICML. [Whittaker and Woodland, 2001] Whittaker, E. W. D. and R.C . Woodland. 2001. Efficient class-based language modelling for very large vocabularies. Acoustics, Speech, and Signal Processing, IEEE International Conference on, 1:545–548. [Xu et al., 2004] Xu, L., J. Neufeld, B. Larson, and D. Shuuramans. 2004. Maximum margin clustering. In NIPS 17. [Yamamoto and Church, 2001] Yamamoto, M. and K. W. Church. 2001. Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Comput. Linguist., 27(1):1–30. [Yan et al., 2009a] Yan, F., N. XU, and Y. Qi. 2009a. Parallel inference for latent dirichlet allocation on graphics processing units. In Proc. of NIPS. [Yan et al., 2009b] Yan, H., S. Ding, and T. Suel. 2009b. Inverted index compression and query processing with optimized document ordering. In Proc. of WWW, pages 401–410, New York, NY, USA. ACM. [Yoshinaga and Kitsuregawa, 2009] Yoshinaga, N. and M. Kitsuregawa. 2009. Polynomial to linear: Efficient classification with conjunctive features. In Proc. of EMNLP, pages 1542–1551. 115 [Yu et al., 2008] Yu, J., S. V. N. Vishwanathan, S. Guenter, and N. Schraudolph. 2008. A quasi-Newton approach to nonsmooth convex optimization. In Proc. of ICML. [Yusuke and Tsujii, 2005] Yusuke, M. and J. Tsujii. 2005. Probabilistic disambiguation models for wide-coverage HPSG parsing. In Proc. of ACL 2005., pages 83–90, Ann Arbor, Michigan, June. [Zein et al., 2008] Zein, A., M. El, , E. McCreath, A. P. Rendell, and A. J. Smola. 2008. Performance evaluation of the NVIDIA GeForce 8800 GTX GPU for machine learning. In International Conference Computational Science. [Zhai and Lafferty, 2004] Zhai, C. and J. Lafferty. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst., 22(2):179– 214. [Zhang et al., 2007] Zhang, K., I. W. Tsang, and J. T. Kowk. 2007. Maximum margin clustering made practical. In ICML 24. [Zhao et al., 2008a] Zhao, B., F. Wang, and C. Zhang. 2008a. Efficient maximum margin clustering via cutting plane algorithm. In SDM, pages 751–762. [Zhao et al., 2008b] Zhao, B., F. Wang, and C. Zhang. 2008b. Efficient multiclass maximum margin clustering. In ICML. 116

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download LARGE SCALE MACHINE LEARNING FOR PRACTICAL