Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
LARGE SCALE MACHINE LEARNING FOR PRACTICAL
NATURAL LANGUAGE PROCESSING
大規模機械学習による現実的な自然言語処理
by
Daisuke Okanohara
岡野原 大輔
A Doctoral Thesis
博士論文
Submitted to
the Graduate School of the University of Tokyo
on
in Partial Fulfillment of the Requirements
for the Degree of Doctor of Information Science and Technology
in Computer Science
Thesis Supervisor: Jun’ichi Tsujii
辻井 潤一
Prof. of Computer Science
ABSTRACT
I present several efficient scalable frameworks for large-scale natural language processing
(NLP). Corpus-oriented NLP has succeeded in a wide range of tasks like machine translation,
information extraction, syntactic parsing and information retrieval. As very large corpora are becoming available, NLP systems should offer not only high performance, but also efficiency and
scalability. To achieve these goals, I propose to combine several techniques and methods from
different fields; online learning algorithms, string algorithms, data structures and sparse parameter
learning.
The difficulties in large-scale NLP can be decomposed into the following: (1) massive amount
of training examples, (2) massive amount of candidate features and (3) massive amount of candidate output. Since the solutions for (1) have already been studied (e.g. online learning algorithms),
I will focus on the problems (2) and (3).
An example of the problem (2) is document classification with “all substring features”. Although all substring features would be effective for determining the label of a document, the number of candidate substrings is quadratic to the length of a document. Therefore a naive optimization
with all substrings requires prohibitively large computational cost. I show that statistics of substring
features (e.g. frequency) can be summarized into a few equivalent classes much smaller than the
total length of documents. Moreover, by using auxiliary data structures enhanced suffix arrays,
these effective features can be found exhaustively in linear time without enumerating all substring
features. The experimental results show that the proposed algorithm achieved the higher accuracies
than the state-of-the-art methods. Moreover the results also show the scalability of our algorithm;
effective substrings can be enumerated from one million documents in 20 minutes.
Another important example of (2) is “combination features”. In NLP a combination of original
features could be most effective for the classification. Although candidate combination features are
exponentially many, effective ones are very few. I present a method that can effectively find all such
effective combination features. This method relies on a grafting algorithm, which incrementally
adds features from the most effective one. Although this procedure looks a greedy algorithm, it
can converge to the global optimum. To find such effective features, I propose a space efficient
online algorithm to calculate the statistics of combination features with a simple filtering method.
Experimental results show that the proposed algorithm achieved comparable or better results than
those from other methods, and its result is very compact and easy to interpret.
For problem (3), I consider language modeling tasks, in that we discriminate correct sentences
form incorrect ones or predict a next word given previous words as context. Since the candidate
words are very many, only simple generative models (e.g. N-gram models) are used in practice. I
first propose a Discriminative Language Model (DLM) that directly classifies a sentence as correct
or incorrect. Since DLM is a discriminative model, it can use any type of features such as the
existence of a verb in the sentence. To obtain negative examples for training, I propose to use
pseudo-negative examples, sampled from generative language models. Experimental results shows
that DLM achieved 75% accuracy in the task of that discriminating positive and negative sentence,
though N-gram models or the syntactic parsing cannot discriminate these at all.
The second language model I propose is a Hierarchical Exponential Model (HEM). In HEM,
we build a hierarchical tree where each candidate word corresponds to a leaf, and a binary logistic
regression model is associated with each internal node. Then, the probability of a word is given by
the product of the probabilities of each internal node in the path from the root to the corresponding
leaf. While HEM can use any type of features, it supports efficient inference. Moreover it supports
the operation that finds the most probable word efficiently, which is fundamental in efficient LMs.
I conducted experiments using HEM and show that this model achieved the higher performance
than other language models while it support an efficient inference.
論文要旨
我々は,大規模な自然言語処理システムを実現するための,効率的かつスケーラブルな手
法を提案する.コーパスに基づいた自然言語処理システムは,機械翻訳,情報抽出,構文解
析,情報検索など多くの問題で成功を収めてきた.近年,非常に巨大なコーパスが手に入る
ようになるにつれ,システムは高精度であるだけでなく,高効率,かつスケーラブルである
ことが求められている.我々が提案するシステムはオンライン学習,文字列アルゴリズム,
データ構造,疎パラメータ学習などの技術を統合することにより,高効率と,高精度の両面
を達成する自然言語処理システムを実現する.
大規模な自然言語処理の問題点は (1) サンプル数が多い,(2) 特徴種類数が多い, (3) 候補
解が多い場合,の三つに大きく分けられる.(1) に対してはオンライン学習など様々な手法
が提案されてきた.本論文では残る (2),(3) の問題を中心に扱う.
はじめに,(2) の例として文書における部分文字列特徴を考える.文書分類や文書クラス
タリングにおいては,文書中に出現する任意の部分文字列の出現情報は,文書のラベルを決
定するのに有効な特徴となりうるが,これらの種類数は文書長の二乗に比例し,そのまま扱
うには非常に大きなコストが必要となる.我々は異なる統計量(文書頻度等)を持つ部分文
字列の種類数が高々文書長しかないことを示し,拡張接尾辞配列を用いて有効な部分文字列
を漏れ無く効率的に探索する方法を提案する.本手法を文書分類と文書クラスタリングのタ
スクに適用し,我々の手法が既存手法を超える精度を達成することを示し,また,本手法が
百万文書の大規模文書群から有効な部分文字列を 20 分で求めることが可能できるなど,ス
ケーラビリティが高いことを示す.
次に,(2) の別の例として,組み合わせ特徴を考える.自然言語処理では複数の基本特徴
の組み合わせが有効である場合が多い.しかし,有効な組み合わせ特徴は少ないにも関わら
ず,組み合わせ特徴の候補数は非常に大きいため,有効な組み合わせ特徴の抽出は重要な課
題であった.我々は有効な特徴から順に最適化問題に Grafting アルゴリズムを採用し,最適
解を保証しながら学習を効率的に行う.さらに組み合わせ特徴の統計量を効率的に計算する
ために,単純なフィルタリングとオンラインでの統計量の計算を組み合わせたアルゴリズム
を提案する.係り受け解析に対する実験結果より,提案手法が膨大な組み合わせ特徴を効率
的に処理し,既存手法と比較し同精度の結果を達成しながら非常にコンパクトなモデルが得
られることを示す.
次に,(3) の例として言語モデルを考える.言語モデルは,与えられた文が正しいかどうか
の判定,または与えられた文脈から次の単語を予測するタスクであり,機械翻訳,音声認識,
手書き文字認識など多くのアプリケーションで重要な役割を担っている.単語候補数は膨大
であるため,従来の機械学習に基づく手法はそのまま適用できず,頻度情報に基づく単純な
統計モデル(例,N グラムモデル)が利用されていた.我々は,まず識別言語モデルを提案
する.この識別言語モデルは,与えられた文に対し直接,正しい文か非文かを分類するモデ
ルを構築する.このモデルの学習に必要な非文は一般にコーパスから入手できないが,確率
的言語モデルから生成された文を非文として利用することで学習を行うことを提案する.実
験結果より,このように学習されたモデルが,既存の言語モデルや構文解析では識別不可能
な文の識別問題を 75% の精度で分類できることを示す.
これとは異なる言語モデルとして,階層型ロジスティック回帰モデルを提案する.このモ
デルは大量の候補がある問題を効率的に解くための学習モデルである.このモデルでは各候
補(単語)が葉に対応し,内部節点のそれぞれにロジスティック回帰モデルが付随するよう
な階層木を構築する.そして,ある候補に対する確率を,階層木の根から,その候補に対応
する葉までの道上にある各節点での分類結果の積として定義する.このモデルでは任意の特
徴が利用可能であることに加え,確率が最大となる候補を全候補を列挙しなくても効率的に
求めることができる.提案手法と既存手法を比較し,本手法が高精度で単語を予測でき,か
つ効率的に高確率の単語を推論できることを示す.
Acknowledgements
My thesis work has benefited from the support of many colleagues, friends, and family.
I am deeply grateful to Professor Jun’ichi Tsujii for his valuable advice and encouragement. He invited me to the field of computational linguistics. I learned from him how
to solve a problem, how to present a work, and especially how to enjoy a research.
I would also express my gratitude to Dr. Yusuke Miyao and Dr. Takuya Matsuzaki. I
always enjoy discussing the research topics with them, which derives the most of the work
in this thesis.
I would like to acknowledge Professor Yoshimasa Tsuruoka. He is the first person
who taught me computational linguistics and machine learning. He also gave me much
valuable advices and encouragement even after he left the laboratory.
Many thanks also to Professor Kunihiko Sadakane for pushing me toward the field of
string algorithms and data structures. I always enjoy thinking, and discussing with him.
I am grateful to Dr. Jin-Dong Kim, Tomoko Ohta, Rune Sætre, Yoshinobu Kano,
Naoyoshi Okazaki, Makoto Miwa, and Tadayoshi Hara for various combinations of help,
support and inspiration.
I also thank lab members, Mr. Sun Xu, Mr. Wu Xianchao, Mr. Yuichiro Matsubayashi, Mr. Yusuke Matsubara, Ms. Sumire Uematsu, Mr. Hiroki Hanaoka for valuable
discussions. I am also grateful to my fellows, Mr. Daiki Kojima and Mr. Junpei Takeuchi.
I enjoyed the life with them. I am much appreciated the help of secretaries, Ms. Minako
Ito, Ms. Noriko Katsu, and Ms. Mika Tarukawa. I would also like to convey appreciation
to all members of Tsujii laboratory.
Many thanks also to the colleagues at our company, Mr. Toru Nishikawa, Mr. Jiro
Nishitoba, Mr. Yuichi Yoshida, Mr. Hideyuki Tanaka, Mr. Takayuki Muranushi, Mr.
Kazuki Ohta, Mr. Hiroyuki Tokunaga, Mr. Nobuyuki Kubota, Mr. Jun Watanabe and Mr.
Ebihara. They gave me valuable advices from the views of different fields, and ongoing
encouragement.
I also thank Ms. Naoko Nishikawa for her invaluable support. Finally, I thank my
parents for their support. They gave me the path to the field of computer science, and
always encouraged me.
2
Contents
1 Introduction
7
1.1
Post-Corpus Oriented NLP . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.2
Machine Learning and NLP . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.3
Difficulties in Dealing with Very Large Data . . . . . . . . . . . . . . . .
9
1.3.1
Example Problem 1: Document Classification . . . . . . . . . . .
10
1.3.2
Example Problem 2: Language Modeling . . . . . . . . . . . . .
10
1.4
Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.5
Tools for Large-Scale Machine Learning . . . . . . . . . . . . . . . . . .
12
1.5.1
Online Learning . . . . . . . . . . . . . . . . . . . . . . . . . .
12
1.5.2
Sparse Priors: L1 regularization . . . . . . . . . . . . . . . . . .
13
1.5.3
Stringology and Succinct Data Structure . . . . . . . . . . . . . .
13
Overview of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . .
14
1.6
2 Background
16
2.1
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.2
General Framework of Machine Learning . . . . . . . . . . . . . . . . .
16
2.3
Linear Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.4
Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.5
Batch Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
2.6
Online Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.6.1
Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
2.7
Kernel Trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
2.8
Storing Sparse Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
1
I Learning with Massive Number of Features
34
3 Learning with All Substring Features
35
3.1
All Substring Features . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.2
Data Structure for Strings . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.3
Grafting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
3.4
Statistics Computation with Maximal Substring . . . . . . . . . . . . . .
43
3.4.1
Equivalent Class . . . . . . . . . . . . . . . . . . . . . . . . . .
43
3.4.2
Document Statistics with Equivalent Classes . . . . . . . . . . .
46
3.4.3
Enumeration of Equivalent Classes . . . . . . . . . . . . . . . .
46
3.4.4
External Information . . . . . . . . . . . . . . . . . . . . . . . .
48
Document Classification . . . . . . . . . . . . . . . . . . . . . . . . . .
48
3.5.1
Document Classification Model . . . . . . . . . . . . . . . . . .
50
3.5.2
Efficient Learning Algorithm with All Substring Features . . . . .
51
3.5.3
Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
3.5.4
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
Document Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
3.6.1
Logistic Regression Clustering . . . . . . . . . . . . . . . . . . .
58
3.6.2
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . .
62
3.5
3.6
3.7
4 Learning with Combination Features
63
4.1
Linear Classifier and Combination Features . . . . . . . . . . . . . . . .
63
4.2
Learning Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
4.3
Extraction of Combination Features . . . . . . . . . . . . . . . . . . . .
66
4.4
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
4.5
Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . .
71
II Learning with Massive Number of Outputs
72
5 Discriminative Language Models with Pseudo-Negative Examples
73
5.1
Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
73
5.2
Previous Language Models . . . . . . . . . . . . . . . . . . . . . . . . .
75
5.2.1
N-gram Language Model . . . . . . . . . . . . . . . . . . . . . .
76
5.2.2
Topic-based Language Models . . . . . . . . . . . . . . . . . . .
78
5.2.3
Maximum Entropy Language Models . . . . . . . . . . . . . . .
78
5.2.4
Whole Sentence Maximum Entropy Model . . . . . . . . . . . .
79
5.2.5
Discriminative Language Models . . . . . . . . . . . . . . . . .
80
5.3
Discriminative Language Model with Pseudo-Negative samples . . . . .
81
5.4
Fast Kernel Computation . . . . . . . . . . . . . . . . . . . . . . . . . .
82
5.5
Latent Features by Semi-Markov Class Model . . . . . . . . . . . . . . .
83
5.5.1
84
5.6
Class Model . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Improvement of Exchange Algorithm with Filters and Bottom-up Clustering 87
5.6.1
Semi-Markov Class Model . . . . . . . . . . . . . . . . . . . . .
87
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
88
5.7.1
Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . .
88
5.7.2
Experiments on Pseudo-Samples . . . . . . . . . . . . . . . . . .
89
5.7.3
Experiments on DLM-PN . . . . . . . . . . . . . . . . . . . . .
89
5.8
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
5.9
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
5.7
6 Hierarchical Exponential Models for Problem with Many Classes
94
6.1
Problems of Previous Language Models . . . . . . . . . . . . . . . . . .
94
6.2
Hierarchical Exponential Model . . . . . . . . . . . . . . . . . . . . . .
95
6.2.1
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
6.2.2
Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
6.2.3
Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
6.3
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.4
Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 101
7 Conclusion
102
7.1
Learning with “All Substrings” . . . . . . . . . . . . . . . . . . . . . . . 103
7.2
Learning with “Combination Features” . . . . . . . . . . . . . . . . . . . 104
7.3
Discriminative Language Model with Pseudo-Negative Examples . . . . 104
3
7.4
Hierarchical Exponential Models . . . . . . . . . . . . . . . . . . . . . . 104
7.5
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
References
107
4
List of Figures
1.1
The comparison of the number of words in different corpora. . . . . . . .
8
2.1
A plot of several loss functions . . . . . . . . . . . . . . . . . . . . . . .
19
2.2
A plot of L2 regularization and L1 regularization (above). A plot of the
partial derivatives of L2 regularization and L1 regularization (bottom). . .
23
3.1
An example of bag-of-word representation (BOW). . . . . . . . . . . . .
36
3.2
An example of all substrings representation (ALLSTR). . . . . . . . . . .
37
3.3
An example of data structures for a text T = abracadabra$. . . . . . . .
40
3.4
The substrings and its classes for a text “T = abracadabra$”. . . . . . .
45
3.5
An example of the computation of the gradient corresponding to a substring “book”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
3.6
The time for finding all maximal substrings. . . . . . . . . . . . . . . . .
56
4.1
An example of filtering a candidate combination feature. . . . . . . . . .
68
5.1
Example of a sentence sampled by PLMs. . . . . . . . . . . . . . . . . .
82
5.2
Framework of our classification process. . . . . . . . . . . . . . . . . . .
83
5.3
Example of assignment in SMCM. A sentence is partitioned into variablelength chunks and each chunk is assigned a unique class. . . . . . . . . .
88
5.4
Margin distribution using SMCM bi-gram features. . . . . . . . . . . . .
92
5.5
A learning curve for SMCM (∥C∥ = 500). The accuracy is the performance on evaluation set. . . . . . . . . . . . . . . . . . . . . . . . . . .
93
6.1
An example of a hierarchical tree in a hierarchical exponential model . . .
96
6.2
A path information in a hierarchical tree for predicting the next word. . .
99
5
List of Tables
2.1
A comparison of online learning methods. . . . . . . . . . . . . . . . . .
29
2.2
Performance of online learning methods (I = 10).
. . . . . . . . . . . .
29
2.3
Performance of online learning methods (I = 1). . . . . . . . . . . . . .
30
3.1
The data set in a document classification task . . . . . . . . . . . . . . .
55
3.2
Result of the document classification task . . . . . . . . . . . . . . . . .
56
3.3
The result of clustering accuracy (%) . . . . . . . . . . . . . . . . . . . .
60
3.4
Examples of substrings whose have the largest weight in each cluster . . .
61
4.1
The performance of the Japanese dependency task on the Test set. The
active features column shows the number of nonzero weight features. . .
4.2
70
Document classification results for the Tech-TC-300 data set. The column
F2 shows the average of F2 scores for each method of classification. . . .
71
5.1
A comparison of language models. . . . . . . . . . . . . . . . . . . . . .
76
5.2
Performance of language models on the evaluation data. . . . . . . . . . .
90
5.3
The number of features of DLM. . . . . . . . . . . . . . . . . . . . . . .
90
5.4
Comparison between classification performance with/without PKI index .
91
6.1
Corpus statistics in HEM . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2
Results of HEM and Baseline . . . . . . . . . . . . . . . . . . . . . . . . 101
6
Chapter 1
Introduction
I present several efficient, practical frameworks for natural language processing. They
achieve state-of-the-art performances for many problems including document classification, document clustering, dependency parsing and language modeling. The key issue is
how to process very large text efficiently with the help of sophisticated machine learning,
and algorithms.
1.1 Post-Corpus Oriented NLP
To date, the main goal of natural language processing (NLP) has been to build a system that predicts linguistic events/outputs accurately. Nowadays systems frequently
utilize a corpus to obtain rich lexicographic or syntactic information provided by humans. This is called corpus-oriented NLP. In early times, since the amount of available text corpora was small, efficiency was not as important as accuracy. Therefore we
could use any method to build highly accurate systems, even if they required large computational resources, for example support vector machines with kernel methods. It is
definitely true that corpus-oriented NLP open up for many high-performance systems
ranging widely from, machine translation [Brown et al., 1990, Och and Ney, 2003] and
syntactic parsing [Collins, 2003, Charniak and Johnson, 2005, Miyao and Tsujii, 2008]
to information extraction [McCallum et al., 2000, Lafferty et al., 2001], information retrieval [Zhai and Lafferty, 2004], summarization [Knight and Marcu, 2002] and document
classification [Joachims, 1998, Pang et al., 2002] to name a few.
7
!"#$%
Figure 1.1: The comparison of the number of words in different corpora.
However, thanks to the rapid development of Internet and computers, we can currently
obtain much larger corpora than those used previously in the NLP community. For example, Penn Treebank [Marcus et al., 1994] is one of the most used corpora in the NLP
community released in early 90’s. It consists of almost one million words with syntactic
information annotated by humans. On the other hand, we can currently use the following
corpora: Google N-gram corpus1 is one of the biggest corpora, built by processing one
trillion words and one hundred billion sentences, PubMed2 is a collection of biomedical
articles, consisting of several million articles, and Wikipedia3 is the biggest online encyclopedia including ten million entries in more than one hundred languages. Figure 1.1
shows the number of words in these corpora. The size of current corpora is about 106
times larger than the previous ones.
I will argue the following: more data gives better NLP systems. This can be restated as
follows; a simple NLP system built using very large amounts of data often beats a complex,
sophisticated NLP system built using only small amounts of data. Many studies support
this rule: a machine translation system [Brants et al., 2007] with a simple language model
beats the systems with a complex language model when very large amounts of data was
available, and sequential labeling task [J. Suzuki, 2008] could also be much improved over
1
LDC2006T13
http://www.ncbi.nlm.nih.gov/pubmed/
3
http://en.wikipedia.org/wiki/
2
8
the state-of-the-art systems by using large amount of raw data.
Text resources is now extraordinary abundant and the key issue of NLP becomes how
to process all this data efficiently with simple methods. My proposal is beyond this:
constructing an NLP system which employs sophisticated methods and can process large
amounts of data. To achieve these conflicting goals at the same time, I combine several
techniques and methods from different fields.
1.2 Machine Learning and NLP
Roughly speaking, an NLP task is to find a mapping function from a linguistic event to
another linguistic event. For example, in a machine translation task, the input and output
are the source and target texts, and in a syntactic parsing task, they are the word sequence
and the syntactic tree respectively. Since the goal of (supervised) Machine Learning (ML)
is to create such a function using training data, it is natural to employ ML in NLP. The
training of an NLP system is decomposed into the following two steps (1) represent the
linguistic event/input as a vector of computed values called a feature vector, and (2) find
a mapping function from the feature vector to the correct output. Since all linguistic information can be represented as feature vectors, the researcher in NLP can focus on the
problem (1), while the researcher in ML can focus on solving the problem (2) separately.
Although this division has promoted the use of ML in NLP, I revise this and instead treat
(1) and (2) together to achieve more efficient systems.
1.3 Difficulties in Dealing with Very Large Data
While constructing a NLP system based on sophisticated ML, and utilizing
very large amounts of data, difficulties arising not only from massive number of examples but also massive numbers of candidate features and outputs.
To process large number of examples efficiently, we can currently
use online learning algorithms [Rosenblatt, 1958, Collins, 2002, Dredze et al., 2008,
Crammer et al., 2008, Crammer et al., 2009, Shalev-Shwartz, 2007], and space-efficient
data structures [Navarro and Mäkinen, 2007, Yan et al., 2009b]. Therefore, the remaining tasks are how to deal with massive feature and outputs. For clarity, I introduce two
9
running example problems. I will deal with these problems throughout this thesis.
1.3.1 Example Problem 1: Document Classification
The first example problem is document classification. Given a document as input, the task
is to predict its class or category. In previous studies, a documents was represented as a
feature vector of a bag of words (BOW), in which each value corresponds to an occurrence
of a word. Since the BOW representation ignores the order or the position of words, this
representation loses much information from the original document. However, the word
sequence or the substring are known to be effective to classify a document. For example,
when the some templates are used in spam mails only, the occurrence of this template can
correctly classify a document as spam although the BOW representation cannot capture
this clearly. However, since the number of distinct word sequences appearing in a document is quadratic to the length of a document, the computational cost and working space
gets prohibitively large if we deal with these features naively.
1.3.2 Example Problem 2: Language Modeling
The second example is language modeling. Given context information as input, the task is
to predict the most probable next word. In another case, given an whole sentence as input,
the task is to classify it as a correct or incorrect sentence. Since the number of candidate
words is much larger than those considered in previous ML systems, previous language
models are defined on a simple probabilistic model called the N-gram model, and cannot
utilize the features.
In both problems, current systems can only use simple models (e.g. BOW model and
N-gram model) when they deal with large amount of data. In the next section I propose
several frameworks that can use more powerful model, and efficiently solve these problems.
1.4 Contribution
The primary contribution in this thesis is the development of several frameworks for largescale NLP by deeply connecting ML and NLP. In addition to this, I show how recent
10
machine learning, and data structures, string algorithms can be used to make NLP useful
given the large amount of text available.
I propose four novel methods to deal with different types of problems.
• Learning with “all substring features” (Chapter 3, [Okanohara and Tsujii, 2009b])
• Learning with “combination features” (Chapter 4, [Okanohara and Tsujii, 2009a])
• Discriminative Language Model (Chapter 5, [Okanohara and Tsujii, 2007])
• Hierarchical Exponential Model (Chapter 6)
The first and second methods consider the problem of massive number of candidate
features. They use the same framework that is able to efficiently solve a problem with
many candidate features, given an algorithm for finding the most effective features. Intuitively, a feature is called effective when the current system can classify many training
examples correctly with that feature. The precise definition of effective will be explained
in the following chapters.
The first tackles the problem with “all substring features”, in that all substrings appeared in a document are candidate features. Although the number of substrings is prohibitively large to enumerate and optimize, our algorithm can find the optimal classifier in
liner time in the total length of documents by summarizing substring information in the
equivalent classes. The second tackles the learning with “combination features”, in that
all combination of original features are candidate features. Our algorithm computes the
statistics of combination features in an online manner with filtering, and efficiently finds
effective but a few combination features. I applied these methods to document classification and clustering tasks.
The third and the fourth methods consider the problem of massive number of candidate
outputs, namely the language modeling task. There are two type of language modeling
task; the first is to discriminate a sentence as correct or not, and the second is to predict
a next word given a context. The third and the fourth method solve these two language
modeling tasks, respectively.
In detail, I propose a discriminative language model with pseudo-negative examples
(DLM-PN), which directly discriminates sentences into correct and incorrect ones. Since
11
the candidate output is infinitely many (correct/incorrect sentences can be infinitely expanded), direct discrimination seems difficult. To solve this, in DLM-PN pseudo-negative
examples are sampled and the discrimination between sentences in the corpora and the
pseudo-negative examples are learned. The forth offers a classification model called a
hierarchical exponential model (HEM) where the search space of candidate output is represented as a hierarchical tree. With this tree, the algorithm can estimate the probability of
output in O(log K) time where K is the number of possible outputs. Moreover, it also can
find the output with the largest probability in O(log K) time. Since HEM uses exponential
models inside, they can use any type of features.
1.5 Tools for Large-Scale Machine Learning
In addition to my proposals above, my frameworks heavily utilize recent research from
various fields, which are fundamental to achieve large-scale NLP. I briefly introduce that
research here, and the details will be explained late in each relevant chapter.
1.5.1 Online Learning
In ML, learning is equivalent to solve a convex optimization problem. To solve this,
many different optimization methods have been proposed, most of which compute the
gradient and (approximated) Hessian matrix of the objective function, and then update the
parameters according to them. Thus, the learning step requires usually quadratic or super
linear time/space to the number of examples/features, and cannot be directly applied to
large-scale problems appeared in NLP. Another problem of existing convex optimization
solver is requiring the large working space. In large-scale NLP problems, it is even difficult
to store all example information in memory. We call this optimization batch learning
because the parameters are updated concurrently after seeing all the training examples.
Recently, the research field of online learning or stochastic convex optimizations [Shalev-Shwartz, 2007, Cesa-Bianchi and Logosi, 2006] have emerged. In these
methods, they look at examples one by one, and update the parameters immediately. The
simplest algorithm is the perceptron algorithm [Rosenblatt, 1958] which was proposed a
half century ago, but its usefulness was rediscovered recently [Collins, 2002].
12
Online learning has several advantages over batch learning. First, all examples are not
required to be stored in memory, since each example can be read sequentially. Second,
online learning converge to the optimum faster than batch learning. This is because many
training examples in NLP are redundant, and an online learning algorithm updates the
parameters more often. Therefore, the parameters for frequent features are tuned in early
steps, and those for infrequent features are tuned more carefully in later steps. Recent
online learning algorithm can converge to the optimum after looking training examples
only once.
1.5.2 Sparse Priors: L1 regularization
In many NLP tasks, although there are many candidate features, most of them are irrelevant to the task, and the effective features are very few. For example, in a document
classification task, the number of words relating to the category of a document is often
two or three although the number of words appeared in a document is large like an hundred. To extract these few effective features, we can use a sparse prior of parameters at
learning. Then the result is sparse parameters. Here, sparse means that many parameters
are exactly zero or a default value. This sparse parameter makes the inference extremely
efficient and the model very compact.
In particular, I use a sparse prior called a L1 regularization at training time. The
optimization with L1 regularization is still convex, but non-smooth, so many specialized optimization methods have been proposed. Recent studies [Tsuruoka et al., 2009,
Duchi and Singer, 2009] show how to apply L1 regularization to online learning. I will
combine a sparse prior with the search algorithm to enable us to consider effective features only without enumerating all possible candidate features.
1.5.3 Stringology and Succinct Data Structure
String algorithms or stringology research has been further advanced recently to deal with
large-scale text. For example, the number of any substring occurrence in a document
can be computed in constant time by using recent compressed full-text indices, while the
working space is less than the original text [Navarro and Mäkinen, 2007]. Since NLP deals
with strings as input and output, it is desirable to use these data structure especially when
13
the data is very large.
For example, in this thesis, I present an algorithm for finding the most effective features from all substrings as candidate features. Although the number of possible candidate
features is O(N 2 ), the proposed algorithm can find them in O(N ) time without any approximations where N is the total length of all the documents. To achieve this, I heavily
use several data structures to search effective features efficiently.
1.6 Overview of This Thesis
This thesis is described in two parts.
The first part, Chapter 3 and Chapter 4, considers the problem with massive number
of features. I especially study the problem with “all substring features” and “combination features”, which are important type of features in NLP. Although these features are
too many to handle in naive way, I present learning frameworks that solve these efficient
way. The second part, from Chapter 5 and Chapter 6, considers the problem with massive
number of outputs. In particular, I consider language modeling. I provide two different
language models, which can use more powerful features.
The chapters in this thesis are organized as follows:
Part I. Learning with Massive Number of Features
Chapter 3 focuses on how to deal with “all substring features”. This chapter presents
a learning framework based on L1 regularization, efficient calculation of gradients of features, and a grafting algorithm. The applications of this framework in NLP, including a
document classification/clustering are also presented.
In the learning with all substring features, all substrings appeared in a document are
candidates of features. Recent studies show that the string kernel can consider all subsequence information and achieve the highest accuracy in the document classification task.
However, the string kernel method requires almost quadratic time of document lengths and
a large working space, not only at training, but also at inference time.
I show that the statistics of substrings can be summarized into that of maximal substrings, which can be exhaustively determined in linear time by using auxiliary data structures. Then, I can search the most effective substring in linear time. Moreover, we can ob14
tain a compact model by using L1 regularization at training time, which allows extremely
efficient inference in time and space during the application of model.
Chapter 4 focuses on learning with combination features. In many NLP tasks, a user
defines feature templates to specify a set of original features. For example, in the partof-speech tagging task, the current/previous/next word and their prefixes/suffixes would
be useful to select the part-of-speech tag, and all these are defined as original features.
The combinations of these features are much more important. Obviously, learning with all
possible combination features requires a prohibitively large computational cost. I propose
an algorithm for efficient computation of combination statistics in an online manner with
filtering.
Part II. Learning with Massive Number of Outputs
Chapter 5 focuses on language models (LM). The most widely used LM is the Probabilistic Language Model (PLM), which assigns a probability of correctness to a sentence.
In particular, N-gram Language Models (NLMs) are popular, because they are very simple, and are only able to use large amounts of training data. However, many studies show
that LMs with rich information can achieve much more accurate results.
To build a more accurate, and efficient LM, I present a discriminative language model
with pseudo negative samples. The problem of considering language models for discriminative task is that there are no negative examples available for training. To produce negative samples, we make pseudo negative examples from PLMs, and use them as negative
examples. We also employ an online margin-based learning algorithm with fast kernel
computation. Finally, we capture latent information by using hidden semi-Markov models, which reduces the computational cost and improves the generalization ability.
Chapter 6 presents a hierarchical exponential model (HEM). In a HEM, they build a
hierarchical binary tree where leaves correspond to candidate output, and internal nodes
are binary classifiers. The probability is calculated by the products of the probabilities
of classification results along the path from the root to the corresponding leaf. An HEM
supports an efficient arg max operation, which returns the output whose probability is the
largest in O(log K) time, where K is the number of labels. In experiment on language
modeling, HEM is compared to an N-gram model, and it achieved higher performance
with large margin.
15
Chapter 2
Background
To achieve large scale NLP, I employ diverse ideas from machine learning, data structures,
string algorithms and optimization techniques. In this chapter, I explain the basics of these
ideas, and postpone the details to relevant chapters.
2.1 Definition
Let R be the set of real numbers, R+ be the set of positive real numbers, and Rm be a mdimensional real vector. A set C is called convex when, for any x1 , x2 ∈ C and 0 ≤ θ ≤ 1,
θx1 + (1 − θ)x2 ∈ C. A function f is convex when its domain is a convex set and for all
x, y in the domain of f , and 0 ≤ θ ≤ 1, f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y).
2.2 General Framework of Machine Learning
Typically, the goal of machine learning (ML) is to create a function y = f (x; w) that
predicts correct output y ∈ Y given input x ∈ X , where w is a parameter defining the
behavior of a function f , and X and Y are candidates of input and output respectively.
In this thesis, we assume that the parameter is a real vector w ∈ Rm .
While ML has been applied to problems in many fields like biology, vision analysis and
economics, I focus on how ML is used in NLP. Many NLP problems are represented as a
classification task in that outputs are structured discrete object like a category in document
classification, or a next word in language modeling. We call this discrete value label.
16
I first show a general supervised learning framework. In this framework, a parameter w is estimated by using training examples (x, y) = {(xi , yi )} so that the function
f (x; w) correctly classifies training examples. However, this estimation can overfit on finite data. Therefore regularization is usually applied at the estimation. Formally, we solve
the following optimization problem to estimate the parameter.
w∗ = arg min L(w) + CR(w)
w
∑
l(xi , yi , w),
L(w) =
(2.1)
(2.2)
i
where the function L(w) is empirical risk that measures how well the function with the
parameter w predicts a label in training examples, and l(x, y, w) is a loss function, which
measures the suffered loss of each example. A term R(w) is regularization on the parameter, that controls the over-fitting to the training data, and C > 0 is a trade-off parameter;
when C is large, regularization is strengthen, and vice versa. The parameter C is often estimated by cross-validation. We should use an adequate loss function, and regularization
for different problems.
2.3 Linear Classifier
I first explain a simpler case of binary class , and then extend it to multi class.
In binary classification, the task is to predict a binary label y ∈ {+1, −1} for input x
where a label +1 is called positive, and a label −1 is called negative.
We represent information of input x by a real vector ϕ(x) ∈ Rm , called a feature
vector. Each dimension of this feature vector captures the characteristic of input, defined
by a feature function fi (x), ϕi (x) = fi (x). In NLP, a feature function usually corresponds
to some linguistic event. For example, fi (x) = I(x is a word “University”) where I(a)
is an indicator function; I(a) = 1 if a is true and I(a) = 0 otherwise. We enumerate all
possible events, like an occurrence of some word x in the two preceding position. A feature
template defines such all events, which generates all possible feature functions. Therefore,
a feature vector ϕ(x) tends to be very sparse, in that many elements are exactly zero. This
characteristic will be used for building an efficient system in this thesis. Note that a bias
for each label can be considered by expanding a feature vector as ϕ′ (x) := {ϕ(x), 1}.
17
Then, a binary linear classifier predicts the output as follows;
f (x; w) =
s(x; w) = wT ϕ(x)
1
s(x; w) ≥ 0
−1 s(x; w) < 0
(2.3)
(2.4)
where w ∈ Rm is a weight vector, each of which corresponds to the weight of a feature
function. Therefore, this classifier uses a weighted majority voting to decide the label.
A function s(x, w) is called a score function, and its absolute value |s(x, w)| is called a
margin, which measures the confidence of the classifier.
For the loss function (used in Eq. (2.2) in binary classification, the straight-forward
function is the number of misclassification for training examples,
L0/1 (w) =
∑
l0/1 (x, y, f )
(2.5)
i
l0/1 (x, y, f ) = I(yf (x; w) > 0)
(2.6)
where l0/1 (x, y, f ) returns 1 if the current classifier f mis-classify the example (x, y), and
returns 0 otherwise.
However, the optimization with l0/1 functions is not convex, and indeed very hard to
optimize. Another problem is that the prediction result would be always close to 0, and
thus the prediction will suffer from input noise.
Instead, the following three loss functions are proposed [Collins et al., 2002],
llog (x, y, f ) = log(1 + exp(−ywT ϕ(x))) (log-loss)
[
]
lhinge (x, y, f ) = 1 − ywT ϕ(x) + (hinge-loss)
lexp (x, y, f ) = exp(−ywT ϕ(x)).
(exp-loss)
(2.7)
(2.8)
(2.9)
[
where a]+ = max(a, 0) A function llog (x, y, f ) is called a log-loss, lhinge (x, y, f ) is
called a hinge-loss, and lexp (x, y, f ) is called an exp-loss. These functions are convex, and
upper-bounds of the l0/1 function. Therefore the learning (2.1) is also convex optimization
problem. Figure 2.1 shows the plot of these function and l0/1 function. A log-loss is often
used in probabilistic models, a hinge-loss is used for support vector machines, and an
exp-loss is used for boosting learning methods.
18
Figure 2.1: A plot of several loss functions
In particular, I give another interpretation of a log-loss; it is equivalent to the result
of maximum likelihood estimation of a logistic regression model. In a logistic regression
model, a probability of a label y given input x is defined as
p(y|x; w) =
=
exp(ywT ϕ(x))
exp(ywT ϕ(x)) + exp(−ywT ϕ(x))
1
.
1 + exp(−2ywT ϕ(x))
(2.10)
(2.11)
Next, a parameter w is estimated by using maximum likelihood estimation, in that the
log-likelihood of examples is maximized,
max
∑
log p(yi |xi ; w) = −
∑
i
log(1 + exp(−2ywT ϕ(x)).
(2.12)
i
Thus, the estimation with a log-loss corresponds to the maximum likelihood estimation
of a logistic regression model. Note that, this model is also identical to the result of a
maximum likelihood estimation of maximum entropy models [Jaynes, 1957].
Next, we consider the classification when the number of labels is larger than two,
called multi-class classification.
19
Let us represent an information of input x and a label y by a feature vector ϕ(x, y) ∈
Rm . Each dimension of ϕ(x, y) is the result of a function for x and y, ϕ(x, y)i = fi (x, y)
capturing the characteristic of x and y. An example of such a function in NLP is fi (x, y)
= I(“x is a word money and y is the topic Business”).
In many cases, a feature function is defined by a cross-product of input-dependent
′
features and candidate labels as follows. Let ϕ(x) ∈ Rm be an input-dependent feature
vector, in that the value of the element in ϕ(x) is determined by input only. Then, we build
ϕ(x, y) ∈ Rm by concatenating ϕ(x)I(y ′ = y) for each y ′ ∈ Y where m = m′ ×|Y |. For
example, given ϕ(x) = (1, 0.5, 2), and y ∈ {0, 1, 2}, ϕ(x, 1) = (0, 0, 0, 1, 0.5, 2, 0, 0, 0).
As in the case of binary classification, a score s(x, y) ∈ R for input x and a label y is
assigned by using a linear function
s(x, y; w) = wT ϕ(x, y),
(2.13)
where w is a weight vector, each of which corresponds to a weight of a feature function.
Then, the classifier predicts a label as the one that maximizes the score,
y ∗ = f (x; w)
= arg max
(2.14)
s(x, y; w)
(2.15)
wT ϕ(x, y).
(2.16)
y
= arg max
y
To estimate the parameter, we use similar loss functions as that of binary classification.
First we define the difference of scores between a correct label, and others;
ψi,y = s(xi , yi ; w) − s(xi , y, w).
(2.17)
We also define yi∗ = f (xi ; w). We omit the subscript of i when there is no confusion.
Then, the loss functions are defined as
llog (x, y, f ) = log
∑
exp(−ψi,y′ )
(LogLoss)
(2.18)
y′
lhinge (x, y, f ) =
[
]
ψi,y∗ − m(yi , y ∗ ) +
(HingeLoss)
lexp (x, y, f ) = exp(−ψi,y∗ ) (ExpLoss)
(2.19)
(2.20)
where m(y, y ∗ ) ∈ R is the misclassification penalty when the correct label is y and a
prediction by the classifier is y ∗ . For m(y, y ∗ ), an indicator function m(y, y ∗ ) = I(y ̸=
20
y ∗ ) is often used. An intuition of the hinge loss is that we prefer a parameter so that a
score for a correct label is larger than that for all other labels with a margin m(y, y ∗ ). The
parameter is then estimated by solving the optimization (2.1) with above loss functions.
Note that since these loss functions are also convex, the optimization is also a convex
optimization problem.
The log-loss function for a multi-class corresponds to the maximum likelihood estimation in multi-class logistic regression models. In a multi-class logistic regression model, a
conditional probability for a label y given input x is defined as follows,
p(y|x; w) =
1
exp(s(x, y)),
Z(x)
(2.21)
∑
where Z(x) = y′ exp(s(x, y ′ )) is a normalization term or a partition function, so that
∑
′
y ′ p(y |x; w) = 1.
An important class of structured classification is when a label consists of an undirected graph: a node in the graph corresponds to the variable and an edge in between
the nodes represents the dependency between variables. Examples of such output are
a sequence of part-of-speech tags, and a syntactic tree. A Conditional Random Field
(CRF) [Lafferty et al., 2001] (log-loss, multi-class logistic regression) , and Max-Margin
Markov Networks (M 3 N) [Taskar et al., 2004] (hinge-loss) are examples of such class. In
this case, although the number of possible output is exponential to the input size, learning
and inference can be done efficiently by using dynamic programming techniques.
2.4 Regularization
This section introduces regularization and explain the characteristics of these regularization.
In training (Eq. 2.1), regularization is added to prevent over-fitting. They are often the
norm of parameters. The first regularization is L2 regularization,
R(w) = ∥w∥22 =
∑
wi2 ,
(2.22)
i
and the second regularization is L1 regularization, or called lasso regularization,
R(w) = ∥w∥1 =
∑
i
21
|wi |.
(2.23)
They look very similar but the results by using these regularization are totally different;
the result of the optimization with L1 regularization is often a sparse vector, in which
many of the parameters are exactly zero while that of L2 is not. In other words, learning
with L1 regularization naturally has an effect on the feature selection, which results in
an efficient and interpretable inference. For example, Gao et. al [2007a] compared L1 logistic regression models with other learning methods including L2 regularized logistic
regression models. Even though the performances for these methods are almost identical,
the number of non-zero weights is approximately 1/10 of that of L2 regularization.
Figure 2.2 explains this effect. While the partial derivative of L2 regularization becomes 0 quickly when the parameter w goes to 0, that of L1 regularization is constant
even when the parameter w goes to 0. Therefore, in L1 regularization, the parameter is
pushed away to 0 if the corresponding feature is irrelevant to the objective function.
In the point of view of the Maximum A Posteriori (MAP) estimation, L2 regularization
is the case where the prior of the parameter is the Gaussian distribution with the mean
vector is a zero vector; p(w) ∝ exp(−|w|22 /σ), and L1 regularization is the case where
the prior of the parameter is the Laplace distribution p(w) ∝ exp(−|w|1 /σ).
Recently a mixed norm or elastic-net is also proposed [Chen, 2009b, Chen, 2009a],
R(w) =
∑
wi2 + C1
i
∑
|wi |,
(2.24)
i
where C1 > 0 is the tradeoff-parameter between L2 regularization and L1 regularization.
This norm has both the strengthand weaknees of L1 and L2 regularizations.
2.5 Batch Learning
Recall that the parameter is estimated by solving the convex optimization problem (Eq.
(2.1)). The objective function is convex when all the loss functions and the regularization
are convex as I explained above . A convex optimization has several good characteristics [Boyd and Vandenberghe, 2004], (1) the global minimum is always unique (although
the minimizer is not always unique), (2) gradient-based optimization algorithms converges
to the global minimum. In gradient-based algorithms, we first compute the gradient of the
objective function
v :=
∂(L(w) + CR(w))
.
∂w
22
(2.25)
Figure 2.2: A plot of L2 regularization and L1 regularization (above). A plot of the partial
derivatives of L2 regularization and L1 regularization (bottom).
Then we update w := w + µv where µ < 0 is the update width, and is determined by
binary search or specialized methods.
In particular, since the training data in NLP is very sparse and has very large number of
dimensions (e.g. 105 ), the characteristics of optimization is different form those appeared
in other fields. Since there is a significant amount of work for the optimization (2.1), I can
hardly enumerate all the prior work, and I provide a few references here to the research I
believe is most related to NLP.
23
For the L2 regularized optimization problem, L-BFGS [Liu and Nocedal, 1989] and
an exponentiated gradient method are often used. L-BFGS is a quasi-Newton method that
computes the approximation of the Hessian matrix by subset of gradients. Exponentiated
gradient method [Collins, 2002] uses a dual representation of (2.1), and optimize the
equivalent convex dual.
For the L1 regularized optimization problem, since the objective function is not differentiable when wi = 0, several specialized optimization methods have been proposed.
Kazama and Tsujii [2005] proposed a method for the optimization L1 regularized logistic regression model by replacing each weight with a pair of positive weights; wi =
∑
wi+ − wi− , wi+ , wi− ≥ 0. The regularization term is represented as R(w) = i wi+ + wi− ,
which can be solved efficiently using the general gradient-based optimization methods at
the expense of doubling the number of parameters. Andrew et al. [2007] proposed the
orthant-wise L-BFGS (OWL-QN), where the orthant of parameters are fixed at the time of
updating, which are recently generalized in [Yu et al., 2008]. Sha et al. [2007] proposed
the application of a multiplicative update to L1 optimization with local quadratic approximation, which can be done using a very simple update formula. Koe et al. [2007] proposed
the interior-point method and an approximation of the entire regularization path.
We call this batch learning because the parameters are updated after looking at all the
training examples.
However, recently online learning algorithms have been found to be much more efficient than these batch learnings especially for very large-scale NLP problems.
2.6 Online Learning
Since the number of the parameters and the terms in the optimization is extremely large
for large scale NLP, a direct optimization requires prohibitively large cost. To tackle this
problem online learning or stochastic convex optimization has been proposed, in that we
iteratively optimize the parameters against one example or a small number of examples.
In online learning, a learner takes an example one by one, and checks whether a current
classifier can classify the example well or not, and the parameter is updated if it misclassified or the margin (confidence) was small.
Online learning has the following characteristics against batch learning. First, online
24
learning looks at each example one by one. Therefore all examples are not required to be
stored in memory. Second, online update updates the parameters more often than that of
batch learning. Since a data in NLP are often redundant and the distribution of feature frequencies is much skewed. Therefore online learning can tune the parameters for frequent
features in early steps, and it can focus on the parameters for rare features in later steps.
I introduce several online algorithms. Table 2.1 summarizes the online learning methods in the case of binary classification. The variable s := ywT ϕ(x) indicates the loss
occurred for a training example (x, y), and w+ = v means that the parameter w is updated as w := w + v.
Perceptron (P) [Rosenblatt, 1958] The perceptron algorithm is the first online learning
algorithm proposed a half century ago [Rosenblatt, 1958]. First, a parameter vector is
initialized as a zero vector w = 0. At each step, it ensures that the current parameters
correctly classify the training example. If so, it proceeds to the next example without
update. If not, it update the weight vector w closer to the current example,
w := w + yϕ(x).
(2.26)
where x and y ∈ {−1, +1} are input and a binary label respectively.
The algorithm repeatedly runs over the training data. It can be shown that, if examples
can be divided by a hyper-plane (or there is a binary linear classifier that can classify
all examples correctly), the perceptron algorithm can find the parameter that correctly
classifies the entire data set. This is not the case in most NLP data.
For multi-class classification, the variant of perceptron algorithm was proposed, which
is called a structured perceptron [Collins, 2002]. First, a parameter vector is initialized as
a zero vector w = 0. For each training example (x, y), a systems takes input x and
predicts a label using the current parameter, y ∗ = arg maxy′ s(x, y ′ ). Then, the parameter
is updated so that the score for the true label y is increased, and that for the wrong label is
decreased. This is achieved by a simple calculation
w := w + ψy∗
(2.27)
where ψy∗ = ϕ(x, y) − ϕ(x, y ∗ ) . Note that this update has no effect when a system can
classify a training example correctly (y = y ∗ ).
25
Averaged Perceptron (AP) [Collins, 2002] An original perceptron algorithm often
leads to poor generalization, especially when training data is noisy like NLP data. Collins
et. al. [2002, 1999] show that averaging the weights of all steps improves the generation
ability. In practice, we need not to keep the weight vectors at all steps, and it is enough to
keep two weight vectors w and wa as follows.
w := w + yϕ(x)
(2.28)
wa := wa + tyϕ(x)
(2.29)
t := t + 1.
(2.30)
where, at the beginning, a variable t is initialized as 1, and both w and wa are initialized
as 0. Then the final weight vector is obtained by w − wa /t, which is identical to the
averaged one.
Passive Aggressive (PA) [Crammer et al., 2006] The problem of Perceptron is that
it ignores the degree of misclassification; it always updates the parameter with the same
update width, whatever it made an error. Therefore even after the update, the classifier may
not classify the previous example correctly. Passive Aggressive Algorithm (PA) update
the parameter so that the updated classifier can classify the example correctly, and its
parameter is close to the previous parameter.
Let wi be the weight vector at the i-th step. The weight vector w1 is initialized to the
zero vector 0 and for each round the algorithm takes a training example xi and predicts
its label yi′ . After the prediction is made, they compare the result with the true label yi
and the leaner suffers from an hinge-loss, which reflects the degree to which its prediction
was wrong. If the prediction is wrong or the margin is small, the parameter w is updated.
There are three update strategies according to how to treat a noise of training data by using
a slack variable ξ ∈ R. The first, PA, ignores the effect of noise.
1
wt+1 = arg min ∥w − wt ∥2
2
w
subject to lhinge (xt , yt , w) = 0
(PA)
26
(2.31)
The second, PA-I, considers the noise in L1 way.
1
wt+1 = arg min ∥w − wt ∥2 + Cξ
2
w
subject to l(w; (xt , yt )) ≤ ξ and ξ ≥ 0
(2.32)
(PA-I)
where C ∈ R+ is a parameter which controls the tradeoff parameter between the slack
term and the distance. If C is large, a more aggressive update step will happen. The third,
PA-II, considers the noise in L2 way.
C
1
wt+1 = arg min ∥w − wt ∥2 + ξ 2
2
2
w
subject to l(w; (xt , yt )) ≤ ξ
(2.33)
(PA-II).
The parameter C ∈ R+ is the same as the one in PA-I.
These problems can be solved in closed form,
wt+1 = wt + τt yt ϕ(x)t
lt
 ∥ϕ(xt )∥2
τt =
min{C, ∥ϕ(xltt )∥2 }
lt
.
∥ϕ(xt )∥2 +C
(2.34)
PA
PA-I
(2.35)
PA-II
Interestingly, the update formula of passive aggressive algorithms and that of the perceptron algorithm differ in only the update width; the update width of passive aggressive
includes the error normalized by the norm of the feature vector.
The multi-class PAs obtained by replacing yt ϕ(x)t with ψy∗ in (2.34), and ∥ϕ(xt )∥2
with ∥ψy∗ ∥2 , respectively.
Confidence
Weighted
(CW)
[Dredze et al., 2008,
Crammer et al., 2008,
Crammer et al., 2009] A confidence weighted learning algorithm (CW) captures the
kind of confidence in the weights of a linear classifier. To represent a confidence, it
uses a Gaussian distribution with mean µ ∈ Rm and covariance matrix Σ ∈ Rm×m .
Intuitively, the value Σi,i indicates the confidence in the i-th weight; the larger Σi,i , the
less confidence we have because the variance is large. Therefore, we update the i-th
weight more aggressively when Σi,i is small, and vice versa.
27
The CW update rule is obtained by solving the following constrained optimization:
(µt+1 , Σt+1 ) = arg minDKL (N (µ, Σ)||N (µt , Σt ))
(2.36)
µ,Σ
subject to
P (yt wT ϕ(xt ) ≥ 0) ≥ η.
(2.37)
where η ∈ [0.5, 1] is the hyper parameter, and N (µ, Σt ) is Gaussian distribution with
mean µ and covariance matrix Σ, w ∼ N (µ, Σt ), and DKL is the KL divergence.
To solve this, the first paper [Dredze et al., 2008] replace the standard deviation with
the variance (2.36) because this problem is not convex in Σ. However, the second paper [Crammer et al., 2008] solve this by representing Σ by the square of positive semidefinite matrices. Then (2.36) can be solved in the closed form as follows,
vt = ϕ(xt )T Σt ϕ(xt )
(2.38)
mt = yt µTt ϕ(xt )
(
)2
√
1
2
2
2
ut =
−αvt ϕ + α vt ϕ + 4vt
4
(
)]
[
√
4
ϕ
1
−mt ψ + m2t
+ v t ϕ2 ξ
αt =
vt ξ
4
+
αt ρ
βt = √
ut + vt αt ρ
µt+1 = µt + αt yt Σt ϕ(xt )
(2.41)
Σt+1 = Σt − βt Σt ϕ(xt )ϕ(xt )T Σt
(2.44)
(2.39)
(2.40)
(2.42)
(2.43)
where ρ = Φ−1 (µ), ψ = 1 + ρ2 /2, and ξ = 1 + ρ2 , where Φ is the Gaussian cumulative
distribution function.
2.6.1 Experiment
To examine these online learning algorithms, I implemented these algorithms and conducted an experiment on a simple document classification task 1 . I used an experimental
data from news20.binary2 in libsvm’s binary data set. The number of class is 2, and the
number of example data is 19996, and the number of features is 1355191. I shuffled these
1
2
oll: http://code.google.com/p/oll/wiki/OllMainJa
http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html#news20.binary
28
Table 2.1: A comparison of online learning methods.
Method
Update
Update
Prediction
Condition
Rule
P [Rosenblatt, 1958]
s<0
w+ = yϕ(x)
AP [Collins, 2002]
s<0
w+ = yϕ(x)
PA [Crammer et al., 2006]
s<1
1−s
w+ = y |ϕ(x)|
ϕ(x)
wT ϕ(x)
PA-I [Crammer et al., 2006]
s<1
1−s
w+ = y min(C, |ϕ(x)|
ϕ(x))
wT ϕ(x)
PA-II [Crammer et al., 2006]
s<1
1−s
w+ = y |ϕ(x)|+2C
ϕ(x)
wT ϕ(x)
CW [Dredze et al., 2008]
γ>0
w+ = yγΣϕ(x)
wT ϕ(x)
wT ϕ(x)
wa + = y ϕ(x)
t
(w −
wa T
t ) ϕ(x)
Σ−1 + = 2γCdiag(ϕ(x))
Table 2.2: Performance of online learning methods (I = 10).
Method
P
AP
PA
PA1
PA2
CW
SVM (linear)
Training Time (sec.)
0.54
0.56
0.58
0.59
0.60
1.39
1122.60
Accuracy(%)
94.7
95.3
96.5
96.5
96.5
96.5
96.2
data, and divided them into training data of 15000 examples, and test data of 4996 examples. I compared it to the batch learning method for SVM 3 . I did not tune the hyper
parameter for PA-I, PA-II, CW, and SVM (C = 1.0 for all methods).
Table 2.2 shows the results when the number of iterations at training is 10. The result
shows that all methods achieved the similar accuracies. The training time are also very
fast compared to the batch learning method (SVM).
Table 2.3 shows the results when the number of iterations at training is 1. This is the
special case where the training examples need not to be stored. The result shows that all
methods achieved the similar performances as the previous results.
Note that in all the above online learning algorithms except CW, the final weight vector
can be represented as a linear combination of training examples as in SVMs. Therefore
kernel trick can be applied as,
wT ϕ(x) =
∑
τt ϕ(xi )T ϕ(x) =
i
3
∑
i
http://chasen.org/ taku/software/TinySVM/
29
τt K(xi , x)
(2.45)
Table 2.3: Performance of online learning methods (I = 1).
Method
P
AP
PA
PA1
PA2
CW
Training Time (sec.)
0.05
0.09
0.07
0.08
0.08
0.21
Accuracy (%)
93.4
94.0
96.2
96.1
96.0
96.4
where K is the kernel function. Using this formulation the inner product can be replaced
with a general Mercer kernel K(xi , x) such as a polynomial kernel or a Gaussian kernel.
The theoretical analysis of online learning is found in [Shalev-Shwartz, 2007].
2.7 Kernel Trick
Since real-world problems generally do not have a linear structure, linear classifiers are
sometimes insufficient. To overcome this problem, one can use a kernel-trick in that all
feature vectors are mapped into a high-dimensional feature space by a non-linear mapping
Φ so that it can be separated by a hyperplane. The problem is that computational cost of Φ
and an optimization problem in high-dimensional is very expensive. However, when we
solve the problem in the dual representation, we don’t need to compute Φ(x) explicitly
because we need only inner products in the mapped feature space. We call the function
K(x1 , x2 ) = Φ(x1 ) · Φ(x2 ) kernel functions. By selecting ϕ carefully, we can compute
K(x1 , x2 ) with small computational cost.
For example, consider a two-dimensional input space together with the feature map:
ϕ : x = (x1 , x2 ) 7→ ϕ(x) = (x21 , x22 ,
√
2x1 x2 ).
(2.46)
The inner product in the feature space can be evaluated as follows
⟨ϕ(x) · ϕ(z)⟩ = ⟨(x21 , x22 ,
√
√
2x1 x2 ) · (z12 , z22 , 2z1 z2 )⟩
(2.47)
= x21 z12 + x22 z22 + 2x1 x2 z1 z2
(2.48)
= (x1 z1 + x2 z2 )2
(2.49)
= ⟨x · z⟩2 .
(2.50)
Thus K(x, z) = Φ(x) · Φ(z) = ⟨x · z⟩2 . Many kernel function have been proposed. The
30
examples are,
Kpoly (x1 , x2 ) = (x1 · x2 + 1)d ,
(2.51)
Krbf (x1 , x2 ) = exp(−a(x1 − x2 )2 ),
(2.52)
Ksigmoid (x1 , x2 ) = tanh(s(x1 · x2 ) + c).
(2.53)
A kernel function can be defined not only on vector data but also on string or graph structure data. More detail of kernel function can be found in [Taylor and Cristianini, 2004].
2.8 Storing Sparse Vector
In NLP, most data is represented by a sparse binary vector because linguistic event has
many candidates (word, string, prefix) and only some of them are observed. For example,
a document is often represented by a feature vector, each value of which corresponds to the
occurrence of a word in a document, and for a part-of-speech (POS) tagging task, a feature
vector for predicting the POS consisting of the occurrence of the current/previous/next
word, and their prefix/suffixes. We process these sparse binary vectors at learning and
inference. Since these data are very large, we need to carefully store these vectors so that
all processing are done in memory only.
Let us see how to store a binary vector of length n with m 1’s, when m ≪ n (sparse).
(n)
The lower-bound of storing such a vector is lg m
bits4 . This can be approximated by
m(lg e + lg n/m) = 1.44m + m lg n/m bits. A straight-way method storing a binary vector explicitly requires n bits, which is much larger than its lower-bound. Let
(x1 , x2 , . . . , xm ) be the position of ones. Then, to represent the vector we instead store
these positions only using lg n bits of each, and requires m lg n bits in total. This is about
m lg m redundant to the lower bound. Therefore, this is very close when m is close 0.
For the case when m is not close to 0 but m ≪ n, a variable byte code or VarByte
is effective. A VarByte is very simple and support fast encoding/decoding. In VarByte,
we use the difference representations of the position indexes defined as (d1 , d2 , . . . dm ),
d1 = x1 and di = xi − xi−1 − 1 for i > 1. A decoding from the difference representation
to the original positions is trivial. Then, each di is stored in variable-length bytes. A first
bit of each byte is used to denote whether the current code finishes at this byte or not.
4
lg x denotes ⌈log2 x⌉
31
Algorithm 1 VarByte Encode
Input: Integer d
while d ≥ 128 do
// output 7 lower bits
put(d & 127)
d := d >> 7
// 7-bit Left Shift
end while
put(d + 128)
// output the remaining bits with 1 << 8
Algorithm 2 VarByte Decode
d := 0, count := 0
loop
c := get()
if c ≥ 128 then
d := d + ((c − 128) << count)
break
end if
d := d + (c << count)
count := count + 7
end loop
Return d
Algorithm 1 shows an example for encoding an integer d in VarByte, and algorithm 2
shows an example for decoding. Next, I analyze the working space of VarByte. Since for
each integer di , it requires at most 8 + lg di bits5 , thus the total size of VarByte is
m
∑
(8 + lg di ) ≤ 8m + m lg(n/m).
(2.54)
i=1
The in-equation holds because
∑
i lg di
is maximized when di = n/m for all i. Therefore,
VarByte requires about 6.5m redundant working space to the optimal,
Since there are many other integer set representations, I present only the pointers for
them; R ICE
CODING , SIMPLE 9
[Anh and Moffat, 2005],
SIMPLE 16
[Yan et al., 2009b],
and N EW-PFOR [Yan et al., 2009b]. The comparison of these data structures can be found
5
We assume that 1 byte = 8 bit.
32
in [Yan et al., 2009b], which shows that the result of VarByte is not significantly different
from these data structures in terms of working space, and computational cost.
33
Part I
Learning with Massive Number of
Features
34
Chapter 3
Learning with All Substring Features
This chapter presents an algorithm for processing document set with all substrings as features. The applications I consider are document classification and document clustering.
Although tokenized words are not enough for determining a class of a document, processing by using all substrings has a prohibitive computational cost because the number of all
candidate substrings can be very large.
I first present a general learning framework to deal with all substring features. In this
framework all substrings are considered as distinct features, and are allowed to have different weights. Although the number of substrings and the features are prohibitively large
to enumerate and optimize, we can find the optimal classifier in liner time in the total
length of documents by factorizing many substring information into the equivalent class.
I also use L1 regularization to obtain a compact classification model. Since the number of
non-zero weights is very few, the model is easy to interpret, and the inference is extremely
efficient in time and working space. Moreover the model is robust even if all substrings
are considered. By combining this to Grafting algorithm [Perkins et al., 2003], a weight
vector can be optimized in time proportional to the number of non-zero weights. This is
actually achieved by traversing internal nodes in a suffix tree, which requires linear time
and space to total length of a training document set. I will show that we can find the optimal by checking the features corresponding to maximal substring only. The number of
maximal substrings is not quadratic but linear in the document length, and we can therefore efficiently train the weight vector. I apply this framework to a document classification
task [Okanohara and Tsujii, 2009b]. I compared the method to the other methods; a doc-
35
IJKLM N
"#$%& #'(')*)'(')*)(%
!
+,- ./0-12/31
4355-2631./17
83 9:;<=>?@ABC
+,- ./0-12/31
4355-2631./17 83 9DEFGH:BC
Figure 3.1: An example of bag-of-word representation (BOW).
ument classification with string kernel, and logistic regression model with variable-length
N-grams. The result shows that the proposed method achieved the highest performance in
almost tasks, and its model is very compact.
Next I extend this work to a document clustering task. For this task, I propose a novel
clustering algorithm called logistic regression clustering (LRC). This model solves a similar optimization problem as a document classification with logistic regression models.
Therefore we can apply the same techniques as document classification. This method can
assign a conditional probability of a cluster given a document. Moreover, the characteristic substrings are extracted in each cluster as an effect of L1 regularization and these
substrings can be used as labels of clusters. Experimental results show that the proposed
algorithm achieved comparable or better results than those from other methods, and its
result is very compact and easy to interpret.
3.1 All Substring Features
Generally, a document d is represented as a feature vector f (d) ∈ Rm where each dimension corresponds to the occurrence of a word in a document. Since this representation
ignores the order and the position of the words, this representation is called a bag of words
(BOW) (Figure 3.1).
Although a BOW representation loses much of the document information, this often
achieves high performance because the occurrence of a few keywords can often determine
the characteristic of the document.
36
DEF GHFIAF?JK LG MNN OAPOBHQ?RO Q? C
>?@AB C
$%&'( %)*)+,+)*)+,+*'
-./ 012/34153
6577/48530139
:5 ;:./<=
-./ 012/34153
6577/48530139 :5 ;:./ 4<
!"
#
#
Figure 3.2: An example of all substrings representation (ALLSTR).
However, this BOW representation still suffers from the following three problems.
The first is the error in the conversion from a document to a set of words. For example,
several languages, such as Japanese and Chinese, do not represent word boundary information explicitly. The word identification task itself is not easy, and the result includes
many errors. What is worse is that the keywords for document classification are often unknown words such as a person’s name, (e.g. Shaquille O’Neal), and BOW representation
loses this information due to errors occurring in the analysis.
The second is that, in some data, it is difficult to define what the word is, such as, log
data and bio-informatics data.
The third is the most important problem. Words units are often inappropriate for document classification/clustering, although N-gram words are effective. For example an
occurrence of a movie title is effective for determining the label of the document to be
movie. However, many movie titles consist of several common words, which are lost in
a BOW representation. The spam mail detection task is another example; signature and
template information is important but this is not word information.
To solve this, I propose to use a representation with all substrings being features. This
can be considered as a bag of N-grams with N = 1 . . . ∞. Although the number of
features (substrings) is the quadratic of the document length, we can find the optimal
37
solution in linear time of a length of a document by summarizing equivalent substrings.
Formally, we represent a document as a bag of all substring representations where
all substrings correspond to each dimension a feature vector. We call this representation
ALLSTR. Figure 3.2 shows the example of ALLSTR.
The cost of learning a model with ALLSTR representation is prohibitively large. However, we show that effective substrings can be found exhaustively by enumerating all maximal substring information. Note that this is not an approximation, but an exact solution.
To achieve this solution, we summarize substrings into classes so that the substrings in
the same class have equivalent statistical information called maximal substrings. The same
idea was proposed in [Yamamoto and Church, 2001], which calculates term frequencies
and document frequencies for all substrings. I extend and simplify this concept to find
effective substrings efficiently. The differences will be discussed.
Many previous works studied to use all substrings information for a document classification task. Among them, string kernels [Vishwanathan and Smola, 2004] is most popular,
which defines a kernel for two documents d1 , and d2 as follows,
∑
K(d1 , d2 ) =
rs s(d1 )s(d2 ),
(3.1)
s∈Σ∗
where Σ is the alphabet set, and Σ∗ is the set of all substrings on Σ, and rs is a weight
parameter for s (which is not decided by learning), and s(d) is the frequency of a substring
s in a document d.
By incorporating this string kernel into SVM learning, we can classify a document
according to all the substring information in the document. Teo [2006] shows that by using
suffix arrays and auxiliary data structures, we can calculate a kernel value in O(|d1 |+|d2 |)
time, and an inference for a test document d can be done in O(|d|) time where |d| is the
length of a document.
However, string kernels require a large amount of working space not only at training
time, but also at inference time. As an example, it requires almost 20N bytes where N
is the length of all documents. Therefore, such a method cannot be applied for a large
document set.
Moreover, kernel methods cannot control each weight independently, and they can
only control a weight for each training examples. In general, very few features contribute
to the label decision, and a string kernel cannot capture these features efficiently.
38
Also, string kernels tend to be affected by noise, so that we may need to cut off a long
substring. Therefore, it is very difficult to consider all substrings in a string kernel.
Very recently, Ifrim et. al. [2008] proposed a logistic regression model with variablelength N-gram features (structured logistic regression: SLR). In their model, different
weights can be assigned to each features. They showed that N-gram information is important for document classification, and more accurate than BOW representation. However,
because effective N-grams are searched greedily, important N-gram phrases can be lost.
Another problem is that Ifrim’s method suffers from over-fitting due to the lack of regularization, and difficult to decide when the search process stops at training.
3.2 Data Structure for Strings
To handle a large number of substrings, I heavily use data structures for strings. I introduce
several data structures which will be used in our algorithm.
Let Σ be a finite ordered alphabet, and σ = |Σ|. Let T [1 . . . n] be an input text of
length n, drawn from Σ+ . We adopt for technical convenience the assumption that T is
followed by $ (T [n] = $), which is a character from Σ that is lexicographically smaller
than all other characters, and appearing nowhere else in T . Then, suffixes of T are defined
as Si = T [i, n] for i = 1, . . . , n.
First I introduce a suffix tree. Although a suffix tree is not used in our algorithm, I
explain here to understand the idea of my algorithm clearly. A suffix tree is a compact
trie that contains all suffixes of T , which can be stored in O(n log n) bits. Suffix trees
support various complex string problems [Gusfield, 1997]. It has n leaves, each of which
corresponds to a suffix of T . An internal node with only one child is removed, thus the
length of edge would be larger than 1. Each edge is labeled by a string, called edge-label.
The concatenation of labels on a path from the root to a node is called the path-label of
the node. The path-label of each leaf coincides with a suffix. For each internal node, its
children are sorted in the alphabetic order of the first characters of edge-labels (Fig. 3.4).
An interesting property of a suffix tree is that although the number of distinct substrings
appeared in T is O(n2 ), the number of internal nodes is at most n − 1. To understand this,
let us consider the case where we build a suffix tree by inserting a suffix from the shortest
ones. Then, at each insertion, at most one internal node is created except the first insertion.
39
Figure 3.3: An example of data structures for a text T = abracadabra$.
Similarly, we can prove that the number of edges between internal nodes is also at most
n − 2.
Suffix trees are useful for many string problems [Gusfield, 1997]. However, suffix
trees require very large working space. For example, the most effective implementation
requires about 10n ∼ 20n bits of space.
Therefore, I instead use a space-efficient data structure, which is the variant of enhanced suffix arrays [Abouelhoda et al., 2004].
I use the following data structures.
• SA: Suffix array
• H: Height array
• B: Burrows-Wheeler transformed text
and auxiliary data structures to support operations on these data structures efficiently. I
explain this in turn. The example of these data structures for a T = abracadabra$ is
shown in Figure 3.3.
Suffix array A suffix array of T , SA[1, n] is defined as an integer array SA[1, n] of
40
length n such that SSA[i] < SSA[i+1] for all i = 1, . . . , n − 1 where < between strings
denotes the lexicographical order of them. SA requires n lg n bit of space.
Height array A height array H[0, n − 1] for T [0, n] is H[i] = lcp(SSA[i] , SSA[i+1] ),
where lcp(S, U ) is the length of the longest common prefix between substrings S and U .
That is, H contains the lengths of the longest common suffixes of P ’s prefixes that are
consecutive in lexicographic order.
Burros-Wheeler transformed text A Burrows-Wheeler Transform (BWT) of a text
T , B[1, n] is defined as follows;
 T [SA[i] − 1] (SA[i] > 1)
B[i] =
 T [n]
(SA[i] = 1).
(3.2)
I support the following operation on B; given B[1, n] of length n drawn from an alphabet
set Σ, and a query consisting of the pair of positions (l, r), we check whether B[l, r]
contains more than one character or not. I present a data structure to solve the above
problem in O(1) time using n + o(n) bits of working space. Here, we assume a RAMmodel in which all log n sized operations can be done in constant time.
Let R[1, . . . , n − 1] be a bit vector storing the runs information of B, that is R[i] = 0
if B[i] = B[i + 1] and R[i] = 1 otherwise. Then, the B[l, r] consists of only one character
type if and only if R[l, r − 1] contains only 0’s. By using rank dictionaries I will explain in
the following, this check can be checked in constant time using n + o(n) bits as follows.
Rank dictionaries supports rank(R, c, p) operation that returns the number of c ∈
{0, 1} in R[1, . . . , p]. It is easy to solve the problem by checking
rank(R, 1, r) − rank(R, 1, l) > 0
(3.3)
To support this, first, we conceptually divide an array R into large blocks of l = lg2 n
bits, and again divide each large block into small blocks of s = lg n/2 bits1 . We keep the
results for rank(B, 1, i×l) in an array L[n/l] and the number of 1’s from the beginning of
the large block to each small block in an array S[n/s]. We also calculate all results for the
√
array of lg n/2 bits in table using 2log n/2 = (n) bits of space. Then rank(R, 1, i) =
L[⌊i/l⌋] + S[⌊i/s⌋] + popcount(R, ⌊i/s⌋, i) where popcount(B, i, j) returns the number
of 1’s in B[i, j] in constant time by table lookup.
1
lg x = ⌈log2 x⌉
41
3.3 Grafting
In this section, we consider a maximum likelihood estimation of multi-class logistic regression model with L1 regularization.
To
maximize
the
effect
of
L1
regularization,
rithm [Perkins and Theeiler, 2003] can be used.
the
grafting
algo-
In grafting, we begin with the
empty feature set, and incrementally add effective features to the current problem. Note
that although this is similar to the boosting algorithm for learning, the obtained result is
always optimal. The grafting algorithm is summarized in Algorithm 3. We assume that
the objective function is log-likelihood of multi-class logistic regression model.
I explain the grafting algorithm formally. In this algorithm we retain two variables; w
stores the current weight vector, and H stores the set of features with a non-zero weight.
Initially, we set w = 0, and H = {}. At each step, the feature is selected that has the
largest absolute value of the gradient of the likelihood. Let vk =
∂L(w)
∂wk
be the gradient
value of of a feature k against the sum of loss functions (or likelihood in the case of maximum likelihood estimation). By following the definition, the value vk can be calculated
as follows,
∂L(w)
∂w
∑ k
=
αi,y ϕk (xi , y),
vk =
(3.4)
(3.5)
i,y
where αi,y = I(yi = y) − p(y|xi ; w) and I(a) is 1 if a is true and 0 otherwise. Then, we
add k ∗ = arg maxk |vk | to H and optimize the objective function with regard to H only.
The solution w that is obtained is used in the next search. The iteration is continued until
|vk∗ | < C.
We briefly explain why we can find the optimal weight by this algorithm. Suppose
that we optimize the objective funciton with all features, and initialize the weights using
the results obtained from the grafting algorithm. Since all gradients of likelihoods satisfy
|vk | ≤ C, and the regularization term pushes the weight toward 0 by C, any changes of
the weight vector cannot decrease the objective value in (4.2). Since (4.2) is the convex
optimization problem, the local optimum is always the global optimum, and therefore this
is the global optimum for (4.2)
42
Algorithm 3 Grafting
Input: training data (xi , yi ) (i = 1, · · · , n) and parameter C
H = {}, w = 0
loop
v=
∂L(w)
∂w
(L(w) is the log likelihood term)
k ∗ = arg max|vk |
k
if |vk∗ | < C then break
H = H ∪ k∗
Optimize w with regards to H
end loop
Output w and H
The point is that, given an efficient method to estimate vk∗ without the enumeration
of all features, we can solve the optimization in time proportional to the active feature,
regardless of the number of candidate features.
3.4 Statistics Computation with Maximal Substring
This section presents an efficient algorithm for computing the statistics of all substring
feature. The key idea is to the use of equivalent class of substrings.
3.4.1 Equivalent Class
First, let us explain the idea of equivalent classes of substrings by using suffix trees, which
is an extention of equivalent class in [Yamamoto and Church, 2001].
Let T [1, n] be a text to be processed and q be a substring. Denote by P (T, q) the list of
all occurrence positions of q in T . We will omit T if there is no confusion. For example,
P (T, “a”) = {1, 4, 6, 8, 11} for T = “abracadabra”.
Recall that the list of all occurrence positions can be examined by traversing the suffix
tree for T from the root to the edge along the edge characters. Note that suffix trees stores
all suffix of T and any substring occurred in T correspond to some position in a suffix tree.
Let t(q) be a position in a suffix tree corresponding to q in the suffix tree. Note that
t(q) would be at the internal node or on the edge. Then, all descendant leaves from t(q)
43
denote the occurrence positions of q. For example, in figure 3.4, when q = “ab′′ , t(q) is at
the edge between the internal node 1 and the internal node 0. Therefore, P (q) = {8, 1}.
Similarly “abr” and “abra”, are again at the edge between the internal node 1 and the
internal node 0, and they also occur at {8, 1}. From this observation, it is easy to show
that when two substrings q1 and q2 are on the same edge of the suffix trees, the occurrence
positions for q1 , and q2 are the same.
Definition 1 Given two substrings q and r, they are called in the left-equivalent class,
which is denoted as q =P r if and only if P (q) = P (r). This is equivalent to the case
where t(q) and t(r) are on the same edge in the suffix tree.
The number of edges between internal nodes is at most n − 2, and between an internal
node and a leaf is n − 12 , where n is the length of an input text. Hence, the number of
different classes is at most n − 2 + n − 1 = 2n − 3. Since all substrings appearing in T
at least once are mapped into some position in the suffix tree for T , all substring can be
factorized into 2n − 3 = O(n) left-equivalent classes.
We can further summarize the substring information by considering a left expansion,
which was not discussed in [Yamamoto and Church, 2001]. We again see in the example
in Figure 3.4, that the occurrence positions of bra are {9, 2}, which seem to be different
from abra whose positions are {8, 1}. However, these positions are the same with the
constant move by 1 (8 = 7 + 1, 2 = 1 + 1).
Figure 3.4 shows some examples for T = abracadabra$. The suffix tree for T is
shown in the left of the figure, and all substring appeared in T is shown in the right. The
substrings in the same color belong to the same class. We call the longest substring in each
class (the most above one in the class in Figure 3.4) maximal substring. In this example,
maximal substrings are “a”, “abra”, and all suffixes of T . The occurrence positions of
“ab”, “abr”, “b”, “br”, “bra”, “ra”, “abra” appears in the same positions with constant
move, and all these substrings are substrings of “abra”. Another class that appears more
than once is only for “a′′ . All other substrings appears once, and their longest maximal
substring are suffixes, such as “abracadabra$”, and “bracadabra”.
We define an equivalent class formally.
2
We ignore the leaf corresponding to the special character.
44
abracadabra$
12
abracadabra$
11
$
$ bra 0
8 abracadabra$
$ 1
c 1 abracadabra$
c
abracadabra$
4
a
d
abracadabra$
6
4 bra
abracadabra$
$ 9
abracadabra$
c 2
abracadabra$
…
c 2
abracadabra$
d 5
abracadabra$
7
ra
$
10
c
3
abracadabra$
…
abracadabra$
3
Figure 3.4: The substrings and its classes for a text “T = abracadabra$”.
Definition 2 Given two substring q and r, q <P r if and only if (1) q is a substring of
r (2) |P (q)| = |P (r)|, (3) there exists c ∈ N such that P (q)[i] + c = P (r)[i] for all
1 ≤ i ≤ |P (q)|.
Definition 3 Two substrings q and r are called in the equivalent class if and only if q <P r
or r <P q.
This <P relation satisfies all properties in the lemma 1, but the number of classes are much
smaller than the original number of equivalent classes. Note that q <P r if q =P r, but
the reverse does not always hold. For example, in T = abracadabra,bra <P abra, but
bra ̸=P abra.
Intuitively, q <P r means that when a substirng q appears, r also appears in the same
positions.
45
3.4.2 Document Statistics with Equivalent Classes
First, we extend an idea of an equivalent class to a set of documents. Given a document set
(xi , yi ) (i = 1, . . . , n′ ), let T be the concatenated documents, T [1, n] = x1 $1 x2 $2 ...xn′ $n′
where $i are special characters that do not appear in the original text. Let n be the length
of T . We then build a suffix trees for T (called generalized suffix tree). Since a substring
overlapping a document boundary always contains a special character, we need not to care
the boundaries of documents, and all the issue discussed in the previous section also hold.
In this subsection, we see that many document statistics can be calculated efficiently
using an idea of equivalent class.
Let tf (q, d) be the term frequency, or the number of occurrence of substring q in
a document d, and df (q) be the document frequency, or the number of documents
that include q. Then we summarize the idea of equivalent classes of substrings as follows [Yamamoto and Church, 2001].
Lemma 1 If two substrings q and r are in the same equivalent class, then tf (q, xi ) =
tf (r, xi ) for all 1 ≤ i ≤ n, and df (q) = df (r).
Proof 1 Let q and r be in the same equivalent class. Then the occurrence positions of them
in T are all same. Since documents are delimited by special characters, the occurrence of
q and r does not overlap the document boundaries. Therefore, the occurrence positions in
a document is always the same.
Note that in the previous study [Yamamoto and Church, 2001], they considered suffixequivalent classes only, and we extend this into stronger classes here.
In the next subsection we see that how to enumerate interesting equivalent classes only.
3.4.3 Enumeration of Equivalent Classes
To enumerate these equivalent classes, the naive way is to enumerate all substrings and
their classes. Then summarizing substrings into equivalent classes using the definition (2).
However, since the number of substrings would be much larger than that of equivalent
classes, this procedure would be very redundant.
Instead, we show an algorithm to enumerate all equivalent classes directly.
First we define a maximal substring.
46
Definition 4 The maximal substring p is a substring such that there is no q such that
p <P q.
Next, we prove the simple lemma
Lemma 2 At each equivalent class, exactly one maximal substring exists.
Proof 2 Let us assume that there exist two maximal substrings t and u in the same equivalent class. Since they are both maximum substrings, there exist a substring v such that
v <P t and v <P u. From the definition, t and u include v as a substring inclusively,
and therefore they are represented as t = t1 vt2 and u = u1 vu2 . Since their occurrence
positions are overlapped, t1 is a substring of u1 or u1 is a substring of t1 . Similarly t2
and u2 are in the same condition. Let q be the longer substring between t1 and u1 , and r
be the longer substring between t2 and u2 . Then a new substring s = qvr is longer than
t and u, and its occurrence positions are same as those of t and u. Therefore t <P s and
t <P s, and this contradict the maximal of t and u.
These maximal substrings can be enumerated efficiently by using enhanced suffix arrays (ESA) (Sec. 3.5.2).
All maximal substrings correspond to internal nodes or leaves in suffix trees. In particular, a maximal substring that occurs more than once corresponds to only an internal
node.
Internal nodes and leaves in a suffix tree can be expressed as a pair of index [l, r],
thus indicating that the corresponding substring of length d appears in T [p, p + d] where
p ∈ SA[l, . . . , r].
The enumeration of all leaves, and internal nodes can be done in linear time of a
document length [Kasai et al., 2001]. The working space for this enumeration is 10|T | +
O(n)bits.
However, not all internal nodes correspond to a maximal substring. For example, in
Figure 3.4, although the substring “ra” corresponds to the internal node, it is not maximal
one because the substring “abra” is the maximal substring here.
To filter out these redundant internal nodes, we use Burrows-Wheeler transformed text
B[1, . . . , s]. The following lemma holds,
47
Lemma 3 The sufficient and necessarily condition of a substring q = [l, r] being a maximal substring is that a q corresponds to an internal node or a leaf, and B[l, r] has more
than one character type.
We can check whether B[l, r] has more than one character type in constant time using
n + o(n) bits of working space (Section 3.2).
The pseudo code of this algorithm is show in the algorithm 4. Therefore, we can obtain
all maximal substrings in linear time in the total length of documents.
3.4.4 External Information
Finally, let us consider the case when external information is available such as the
word/phrase boundaries so that extracted substrings should have these boundaries as the
beginning/ending character. This case can also be solved by same algorithm.
First, we replace an input T with an input T ′ such that the special characters # is
inserted at the boundaries of words, and then we apply the algorithm as above. Then, we
only deal with the maximal substring with # being the first characters.
This conversion does not increase the computational cost since the new input size is at
most 2 times the original input size and the number of maximal strings is much smaller.
3.5 Document Classification
First application of maximal strings is a document classification; given a document, we
assign a label such as a category (sports, money), or polarity (positive or negative opinion)
according to the content of the document.
For this task, rule-based methods were first applied, but recently, machine learning
methods have been applied using support vector machines (SVM) or logistic regressions
(LR) because they are robust, easy to adapt to a new domain, and achieve more accurate
results.
In this study I employ a multi-class logistic regression model (LR) (Section 2.3). I
re-phrase some of definitions here again for explanation.
For an input document x, and an output label y ∈ Y , we define a feature vector
ϕ(x, y) ∈ Rm capturing the characteristic of the document and the label. In LR, the
48
probability for a label y given input x is defined as follows,
1
exp(wT ϕ(x, y))
Z(x)
∑
Z(x) =
exp(wT ϕ(x, y ′ )),
p(y|x; w) =
(3.6)
y′
where w ∈ Rm is the weight vector. The most probable label is the one that maximizes
the score,
y ∗ = arg max p(y|x; w) = arg max wT ϕ(x, y).
y
(3.7)
y
because a function exp is a monotonic increasing funciton.
The parameter w is estimated by the maximum likelihood estimation (MLE) using
training examples {(xi , yi )} (i = 1, . . . , n) as follows,
w∗ = arg min L(w)
w
∑
L(w) = −
log p(yi |xi ; w).
(3.8)
(3.9)
i
where L(w) is the negative log likelihood of training data.
However, this MLE tends to over-fit the training data when the amount of training
examples is insufficient for the number of parameters. To avoid this problem, a regularization term R(w) : Rm 7→ R is added to the likelihood term shown in (3.9). By applying L1
regularization (which is also called Lasso regularization), the weight vector is estimated
as follows:
∗
wM
AP = arg min L(w) + C|w|1 ,
(3.10)
w
where |w|1 = |w1 | + |w2 | + . . . + |wm |, and C > 0 is the trade-off parameter between the
likelihood term and the regularization term; a small C emphasizes the training data, and
a large C emphasizes the regularization. This L1 regularization corresponds to the maximum a posteriori (MAP) estimation with the Laplace prior on w. We call this estimation
L1 -LR such as OWL-QN [Andrew and Gao, 2007].
It is known that the result of L1 -LR is a sparse parameter vector, in which many of the
parameters are exactly zero. In other words, learning with L1 regularization naturally has
an effect on the feature selection, which results in an efficient and interpretable inference.
49
To optimize (3.10), a gradient based optimization cannot be used directly, since the objective function is not differentiable where wi = 0. Therefore several specialized methods
have been proposed for the L1 -LR optimization.
For learning l1 -LR, we adapt a grafting algorithm [Perkins et al., 2003] (Section 3.3)
to improve the training effeciency. In the algorithm, we keep the current weight vector w
and the set of active features H (features that have non-zero weights). At the beginning,
we initialize the parameters as w = 0, and H = {}. Let v ∈ Rm be the gradient of the
likelihood term with regard to parameters w;
∂L(w)
∂w
∑
=
(−I(y = yi ) + p(y|xi ; w)) ϕ(xi , y),
v =
(3.11)
(3.12)
i,y
where I(a) is 1 if a is true and 0 otherwise. Let k ∗ be the feature such that |vk∗ | is the
largest. Then we add k ∗ to H, and optimize w with H only by using a solver for L1 -LR
such as OWL-QN [Andrew and Gao, 2007], in that a weight wk such that k ∈
/ H are fixed
to be 0. We continue this process repeatedly until |vk∗ | < C. Then the obtained weight
vector is identical to the optimal weight vector [Perkins et al., 2003]. The point is that, the
training time is almost proportional to the number of active features if we can efficiently
compute arg maxk |vk |, even if the number of features is very large.
3.5.1 Document Classification Model
In this section, we show that the optimal weights for substring features can be determined
by considering the maximal substrings only in L1 -regularized learning.
First, we assume that the feature types is tf (term frequency) but this is not the only
case. Recall since we use tf , the feature values corresponding to substrings x and y such
that x <P y are always the same for all documents.
In L1 regularization, if features have equal values in all training examples, then the
set of optimal weights for these features is convex. In other words, if w and w′ are the
minimizer of (3.10), then w′′ := αw + (1 − α)w′ for 0 ≤ α ≤ 1 also minimizes (3.10).
This can be explained in the view of equivalent classes.
Lemma 4 In the L1 regularization, let E = {fi } be a set of feature indexes that belongs
to the same class, and w∗ be the weight vector that minimizes (3.10). Then a weight vector
50
w′ such that
∑
i∈E
wi′ =
∑
i∈E
wi∗ , and wi∗ wi′ > 0 for all i ∈ E, and wi′ = wi∗ for all
i∈
/ E also minimizes (3.10).
Proof 3 From the definition, we have,
|w′ |1 =
∑
|wi′ | +
i∈E
/
=
∑
∑
|wi′ |
i∈E
|wi∗ |
+
i∈E
/
∑
|wi∗ |.
(3.13)
i∈E
And, since w′T ϕ(xi , y) = w∗ T ϕ(xi , y) for all 1 ≤ i ≤ n and y ∈ Y , for the likelihood
term, L(w′ ) = L(w∗ ) also holds.
Therefore, when there are feature sets belong to the equivalent class, it is adequate to
keep the sum of the weights for these weights. In summary, for training L1 regularized
problem, we deal with the features that correspond to maximal substrings only. And the
obtained weight corresponds to the sum of weights in the equivalent class.
3.5.2 Efficient Learning Algorithm with All Substring Features
In the previous subsection, we see that it is enough to consider a feature vector corresponding maximal substrings to find the optimal weight in ALLBOW. However, its computational cost is still large; the number of maximal substrings is linear in the total length
of documents. In this section, we show that how to deal with these maximal substrings
without generating a feature vector explicitly.
Recall that, the grafting algorithm (Algorithm 3) only requires finding a feature
such that the absolute value of the gradient of the likelihood is the maximum (k ∗ =
arg maxk vk ). We show that we can estimate k ∗ efficiently by using auxiliary data structures.
In particular, we consider the following feature types and combinations of these feature
types.
• tf (q, d) : the frequency of q in a document d.
• bin(q, d) : 1 if q appears in d and 0 otherwise.
51
• idf (q) : log(n/df (q)) where n is the number of documents, and df (q) is the number of documents that include q.
• len(q) : the length of q.
Note that our method is not limited to these feature types only. Generally, we can efficiently compute the gradient value if feature functions depends on the position information. If the feature function depends on the different information, such as orthographic
feature, then we cannot summarize substring information, and we require different techniques for efficient computation.
First, let us consider the case where the feature type is tf . Let g(l, r, y) be the gradient
value of the feature corresponding to the substring q that appears in q = T [p, p + d],
p ∈ SA[l, r] with label y.
n
∑
g(l, r, y) =
(P (y|xi ; w) − I(yi = y))tf (q, xi )
(3.14)
i=1
Remember that any substrings are store in the consecutive region in SA.
Let D[i] be a document index that includes SA[i]. Let α be the two dimensional array
of R|Y |×s defined as,
α[y][i] =
i−1
∑
(
)
P (y|xD[j] ; w) − I(yD[j] = y) .
(3.15)
j=1
Then we can calculate g(l, r, y) by using α as,
g(l, r, y) = α[y][r] − α[y][l].
(3.16)
This is because,
α[y][r] − α[y][l]
r
∑
(
)
=
P (y|xD[j] ; w) − I(yD[j] = y)
=
j=l
n
∑
(P (y|xi ; w) − I(yi = y)) tf (q, xi ).
i=1
Therefore, the gradient for any substring can be calculated in constant time by using
table lookup, where table size is O(s) bits.
52
*+ 4567, - /+ 456701) 2 3-
!" #$%& '
! () *+ ) , - . /+ 01) 2 3- , Figure 3.5: An example of the computation of the gradient corresponding to a substring
“book”.
Figure 3.5 shows an exampe of the computation of the gradinet corresponding to a
substring “book”. The columns of y = 1, y = 2, and y = 3 are the values I(yD[i] =
y) − P (y|xD[i] ; w). For example the cell in the row i = 4 and the column y = 2 shows the
value I(yD[4] = 2) − P (D[4] |xi ; w). Then, the gradient value corresponding to a substring
“book ′′ and a label y = 2 is the summation of values in the cells of i = 3, . . . 6, and y = 2.
For feature types idf (d) and bin(q, d), we can compute the gradient of any substrings
in constant time using auxiliary data structures [Sadakane, 2007]. In this case, we need
to remove duplicated documents at the enumerating the occurrence of q in the concecutive reagion. In practice, the auxiliary data structures requires much working space, so
we adapt a simpler strategy; first enumerate all positions including q, and then remove
duplicated documents in the positions.
When we use len(q) features, the gradient values for substrings in the same class is
different. In this case we enumerate substring from the longest ones in the class.
In summary, we state the following theorem;
Theorem 1 Given training documents whose total length is s, we can train a L1 regularized logistic regression model using all substring features in O(s) time.
Finally, the algorithm 4 shows the overall framework to compute the arg maxk |vk |.
This is same as the bottom-up traversing of all nodes in suffix tree using the height ar53
Algorithm 4 The calculation of gradients of all maximal substring
Input: H[0, s], SA[0, s], B[0, s], D[0, s]
S: A stack storing (pos: the beginning position in SA, len: the length of substring.)
vk∗ = 0 : Store the maximal substring that have the largest gradient value
for i = 0 to n + 1 do
cur = (i, L[i])
cand = top(S)
while cand.len > cur.len do
if B[cand.pos, . . . , i] has more than one character type then
vk := g(cand.pos, i, y), y ∈ Y
// Estimate a gradient value of feature . See section 3.5.2.
if vk > vk∗ then
vk∗ = vk // Also stores k
end if
end if
end while
if cand.len < cur.len then
push(S, v) // Internal node
end if
push(S,(i, n-SA[i] + 1)) // Leaf
end for
output vk .
ray [Kasai et al., 2001] except that we compute the gradient value of each features by
using grad(l, r, d) as discussed above.
3.5.3 Inference
We explain how to classify a test document by using the result of our algorithm.
After the training, we have a set of substrings H, and their weights. We first build a trie
data structure for H and we assign a weight at each leaf or internal node. Then, we find
all matching for H by using the Aho-Corasick method [Aho and Corasick, 1975]. This is
54
Table 3.1: The data set in a document classification task
C ORPUS
N UM .
DOCS
T OTAL L EN .
N UM .
TYPE
N UM .
(B YTE )
OF WORDS
MAXIMAL STRING
M OVIE - A
2000
7786004
38187
1685037
M OVIE - B
7440
213970
55764
713229
TC300- A
200
1953894
16655
378673
TC300- B
200
1424566
14430
236220
done in linear time in a length of a test document.
Note that, unlike string kernels in which we have to keep all of the document set, we
only keep a few substrings due to L1 regularization. Therefore the working space is very
small.
3.5.4 Experiments
We conducted a series of document classification experiments for two data sets MOVIE
and Tech-TC300.
MOVIE is a sentiment classification task, given a review information we classify it
into positive or negative ones. There are two types of data set, the one provided by
Bo Pang
3
(MOVIE-A), and the other provided by Ifrim4 (MOVIE-B) which was used
in [Ifrim et al., 2008]. TechTC-300 consists of 300 binary classification task. An original
category comes from Open Directory Project. Among 300 tasks, we used two tasks for
which where SVM classification achieve only 70% accuracies. 5 .The details of each data
set are described in Table 3.1.
We examined the performance using 5 cross validations. We determined the hyper
parameters by using the development set.
We compared our method (Proposed in Table 3.2) with L1 -LR with BOW (BOW+L1 ),
LR with variable length N-gram [Ifrim et al., 2008] (SLR), and BOW with SVM (SVM).
We used the third polynomial kernel because it achieved highest accuracy compared other
kernels (including string kernels).
3
http://www.cs.cornell.edu/People/pabo/movie-review-data/, Polarity dataset v2.0
http://www.mpi-inf.mpg.de/ ifrim/data/kdd08-datasets.zip, KDD08-datasets
5
http://techtc.cs.technion.ac.il/techtc300/techtc300.html, A: 10341-14271, B: 10539-194915
4
55
Table 3.2: Result of the document classification task
C ORPUS
P ROPOSED
BOW+L1
SLR
SVM
M OVIE - A
86.5%
83.0%
81.6%
87.2%
M OVIE - B
75.1%
71.0%
74.0%
69.1%
TC300- A
80.0%
66.7%
80.0%
73.1%
TC300- B
86.7%
86.7%
73.3%
71.9%
1000
872
135
100
63
Time
(sec.)
11
10
5
1
1
1
10
100
1000
Input Size (MB)
Figure 3.6: The time for finding all maximal substrings.
For feature types, we compared the result using bin, tf , idf , len and all these combinations. For word-based BOW, tf achieved the best performance, and for the proposed
method, idf achieved the best performance6 . We used these feature types in the following
experiments.
In practice, the most time consuming part in our method was the calculation of
arg maxk vk because we need to access whole data sequentially. In an original grafting
algorithm, only one feature is added from the feature candidates. We instead chose predefined number of largest features and added these into H. Note that even if we include
these features together, it converges to the global optimum.
All experiments were conducted on a 3.0 GHz Xeon processor with 32GB main memory. The operation system was the Linux version 2.6.9. The compiler was g++ (gcc version
4.0.3) executed with the -O3 option.
Table 3.2 shows the accuracy results. The proposed method achieved the highest or
the second-highest accuracy in all data sets. SVM achieved the highest performance in the
6
There are no significant difference between idf, len, tf-idf, idf-len, tf-len and tf-idf-len
56
MOVIE-a corpus, but very low performance in other corpora because SVM suffered from
noize words (unrelevant to the class information). The methods with L1 regularization
could filter out ineffective substrings, and achieved high performance in TC300B. The
results for SLR were always equal to or worse than our method, because SLR searches
effective-substrings in a greedy manner, and in some cases, they cannot find the effective
substring. The proposed method could successfully find the effective substrings from all
substrings.
Finally we examined the scalability of the proposed method. We changed the length of
an input text, and examined the time to enumerate all maximal substrings. Note that this
part is the most time-consuming, and the dominant part in our algorithm. Figure 3.6 shows
the results. The x-axis shows the input size, and the y-axis shows the time for reporting all
maximal substrings. This result indicate that our method can process in the proportional
to the text size even if the text is very large such as 1 GB.
3.6 Document Clustering
An next application of all substring features is document clustering. A clustering is a
task to split examples into clusters so that the examples in the same clusters are similar,
and those in the different clusters are dissimilar. Unlike document classification, there
are no training data in document clustering, and thus it is unsupervised learning. There
are many variations of clustering depending on how to define (dis-)similarity and preference of the results, such as K-means, Gaussian mixture models, and spectral clustering
methods [Ding et al., 2001].
Recently, maximum margin clustering (MMC) have been proposed [Xu et al., 2004].
Since MMC produces a more accurate result, and its formulation is very similar to marginbased supervised learning, many researchers have studied MMC [Zhang et al., 2007,
Zhao et al., 2008a, Zhao et al., 2008b, Li et al., 2009, Gieseke et al., 2009].
MMC approach is based on a multi-class SVM [Crammer and Singer, 2001], a supervised learning method. In multi-class SVM, the parameter is optimized so that the
margin is maximized for each class. In MMC, a system also determines the partition
of examples so that the margin is maximized in each cluster. A trivial solution of this
problem is that all examples belong to the same class. To obtain a meaningful re57
sult, a constraint is added that specifies the minimum and the maximum size of a cluster [Xu et al., 2004]. In another way, a constraint is relaxed so that the sum of margins are
restricted [Zhao et al., 2008a, Zhao et al., 2008b].
Since the optimization problem in MMC is not convex as those in other clustering
methods, specialized optimizations have been proposed. Zhao [2008a, 2008b] proposed
to apply a cutting-plane algorithm to the problem; constraints are added from the ones with
the largest effect. Then they solve a convex-concave problem where the objective function
is represented as the difference between convex functions. This solver is extremely fast
even compared to K-means clustering.
Instead we propose a clustering model based on a logistic regression model. Since
this model is based on the probabilistic model, it can assign a conditional probability of
a class given a document though MMC cannot. Moreover, we apply L1 regularization
to the objective function so that the features relating to the result of clustering become
sparse. Since in L1 regularization, few features will a large weight, those substrings with
large weights can be used as a label of a cluster. By combinining this model to the idea
of maximal substrings, a clustering can consider all substring information, and its result is
compact.
3.6.1 Logistic Regression Clustering
I propose a clustering algorithm, which aims at splitting input examples {xi } (i =
1, . . . , n) into k clusters. The number of cluster k is given by a user, and the determination of the optimal number of clusters is a future work.
As a classification problem, let us represent information of a pair of an input x and an
output y by a feature vector ϕ(x, y) ∈ Rm , each value of which is determined by a feature
function {fi }, that is ϕ(x)i = fi (x). The cluster number of i-th input is denoted by
yi ∈ {1, . . . , k} (i = 1, . . . , n), and the set of cluster numbers of all examples is denoted
by y. Note that unlike a supervised classification problem, yi is not assigned by a training
data beforehand, and the clustering task is to find y so that examples in the same clusters
are similar and those in the different clusters are dissimilar.
Recall that in a multi-class logistic regression model, a conditional probability of an
58
output y given an input x is defined as follows,
(
)
1
exp wT ϕ(x, y) ,
(3.17)
Z(x)
∑ T
where w is a weight vector, and Z(x) =
y w ϕ(x, y) is a normalization term, or a
p(y|x; w) =
partition function.
Then, a clustering is performed by finding the label set y and the weight vector w
which maximizes the likelihood of examples, and equivalently minimizes the negative
log-likelihood,
(y∗ , w) = arg min −
y,w
∑
log p(yi |xi ; w),
(3.18)
i
We also add L1 regularization to the objective function to obtain a sparse weight vector.
Finally, we obtain the following objective function for the clustering problem,
(y∗ , w) = arg min −
y,w
∑
log p(yi |xi ; w) + C∥w∥1 ,
(3.19)
i
where C ∈ R is a trade-off parameter to determine the sparseness of w; when C is large,
many values in w become exactly 0.
Since the direct optimization of y is hard, we solve the different optimization problem
which is equivalent to (3.19) [Zhao et al., 2008b]
w∗ = arg min −
w
∑
M (i, y; w) log p(y|xi ; w) + C∥w∥1
(3.20)
i,y
where M (i, y; w) is defined as
M (i, y; w) = Πy′ ̸=y I(wT ϕ(xi , y) > wT ϕ(xi , y ′ )).
(3.21)
Intuitively, M (i, y; w) returns 1 if y gives the largest probability for xi , and returns 0
otherwise. The optimum of (3.19) and (3.20) is equivalent, and their optimal w∗ is also
equivalent.
Then, the cluster assignment for each example is easily estimated by
yi = arg max wT ϕ(x, y).
y
(3.22)
Since the trivial solution of (3.20) is obtained by assigning the same cluster for all
example. We also want to avoid the case where the clusters consist of outliners and others.
59
Table 3.3: The result of clustering accuracy (%)
DATA
LRC+ALLSTR
LRC+BOW
K-M EANS
NC
MMC
20- NEWS
71.12
69.15
35.27
41.89
70.63
WK-CL
71.05
65.38
55.71
61.43
71.95
WK-TX
64.50
60.15
45.05
35.38
69.29
WK-WT
74.05
73.12
53.52
32.85
77.96
WK-WC
71.20
71.05
49.53
33.31
73.88
Therefore we add the constraint to the problem so that each cluster should not be larger
and smaller than some threshold. For example, in [Zhao et al., 2008a, Zhao et al., 2008b],
for all pairs of clusters p and q, they add the constraint
−l ≤
n
∑
wT ϕ(xi , p) −
i=1
n
∑
wT ϕ(xi , q) ≤ l
In this thesis, we use a simpler constraint,
(
)2
∑ ∑
)2
∑ ( T
T
W (w) =
wy ϕ(xi )
=
y w ϕ(all, y)
y
where ϕ(all, y) =
(3.23)
i=1
(3.24)
i
∑
i ϕ(xi , y).
This function gives smaller value when all clusters have
the same size.
In summary, we solve the following optimization problem,
w∗ = arg min −
w
∑
M (i, y; w) log p(y|xi ; w)
(3.25)
i,y
+C∥w∥1 + C1 W (w)
where C1 determines the size of clusters.
To optimize (3.25), we first assign labels to examples and then optimize w alternately.
We continue this until it converges. This is achieved by fixing M (i, y, ; w) in (3.25).
3.6.2 Experiments
We conducted a document clustering experiments for two data sets 20 newsgroup (20news-18828)7 and WebKB. For 20 newsgroup dataset, we selected the topic rec. There
7
http://people.csail.mit.edu/jrennie/20Newsgropus
60
Table 3.4: Examples of substrings whose have the largest weight in each cluster
C LUSTER
S UBSTRING
REC . AUTOS
“ THE FORD ”, “A CAR ”
REC . MOTORCYCLES
“A BIKE ”
REC . SPORT. BASEBALL
“ THE YANKEES ”, “AN ERA”
“AN NHL”, “H OCKEY L EAGUE ”
REC . SPORT. HOCKEY
are four subtopics in the topic rec, autos, motorcycles, basebal, and hockey. WebKB is the
crawling result of four university websites, and there are seven topics such as student or
faculty.
It is diffucult to measure the accuracy of clustering because there are no answers.
We adapt the method used in [Xu et al., 2004, Zhao et al., 2008b]; (1)Remove labels from
data set (2) Perform clustering algorithm for data set without labels. The cluster number is
adjusted to the number of labels in original data set. (3) For each cluster, examine which
labels are most assigned in original data set (4) For each cluster, we examine how many
examples are assigned with the largest labels.
Our
ized
Cut
proposed
(NC)
method
was
compared
[Shi and Malik, 2000],
and
with
maximum
K-means,
Normal-
margin
clustering
(MMC) [Zhao et al., 2008b]. The results of other methods are from [Zhao et al., 2008b]8 .
We compared the proposed method with usual bag-of-word representation
(LRC+BOW), and that with all substring representation (LRC+ALLSTR). Note that the
representation of LRC+ALLSTR also includes the word features.
The result is shown in Table 3.3. In most data set, the proposed methods achieved
the similar results as MMC. In particular, when BOW features are used, the result of
LRC+BOW is slightly worse than that of MMC. This would be because while LRC optimize the likelihood of examples, MMC optimize the classification accuracy directly.
Therefore, our methods would favorable when the probabilistic information are required.
LRC+ALLSTR achieved the highest accuracy in many copora partly because it can use
substring features which were not used in other methods.
8
I conducted experiments on K-means, Normalized Cut, and obtained the similar but slightly worse accu-
racy result
61
Table 3.4 shows examples of substrings that have the largest weight in each
cluster.
set.
The column “Cluster” is the label that is most assigned in original data
Other features are often the signature or the affirmation of the write such as
“[email protected]”or “University of xxx”.
To examine the success of the proposed algorithm, I sort the maximal substrings in the
order of tf log(df ). Then, the top 500 maximal substrings appeared only in the documents
of the same label. We can find such key substrings efficiently to determine the clusters by
using the maximal substring algorithm.
3.7 Discussion and Conclusion
In this chapter I propose a novel algorithm to consider all substrings as features, and
showed that we can train document classificatin model with all substrings witout approximation in the liner time at training. The experimental results showed that our method
achieves the highest performance in several tasks compared to other document classification methods; word-based BOW, and very recent variable-length N-gram logistic regression model [Ifrim et al., 2008]. Our training results are represented as a very compact set
of substrings, and the inference time is very fast in theory and practice.
Next, I applied this algorithm to a document clustering task. To achieve this, I developed a novel document clustering model, called logistic regression clustering (LRC).
The experimental result show that its accuracy is same as the one of state-of-the-art methods. Moreover, our algorithm gives a compact result in that only few features relate to the
decision of clustering, and therefore these features can be used as lables of clusters.
62
Chapter 4
Learning with Combination Features
In this chapter, I propose an algorithm for learning of an L1 regularized logistic regression
model with combination features [Okanohara and Tsujii, 2009a]. In this algorithm, we can
exclusively extract effective combination features without enumerating all of the candidate
features. This method relies on a grafting algorithm [Perkins and Theeiler, 2003], which
incrementally adds features like boosting, but it can converge to the global optimum.
The heart of our algorithm is a way to find a feature that has the largest gradient
value of likelihood from among the huge set of candidates. To solve this problem, we
propose an example-wise algorithm with filtering. This algorithm is very simple and easy
to implement, but effective in practice.
I applied the proposed methods to NLP tasks, and found that our methods can achieve
the same high performance as kernel methods, whereas the number of active combination
features is relatively small, such as several thousands.
4.1 Linear Classifier and Combination Features
A linear classifier is a fundamental tool for many NLP applications, including logistic
regression models (LR), in that its score is determined by an inner product of a feature
and a weight vector, and thus a linear combination of features and their weights. Although
a linear classifier is very simple, it can achieve high performance on many NLP tasks,
partly because many problems are described with very high-dimensional data, and high
dimensional weight vectors are effective in discriminating among examples.
63
However, when an original problem cannot be handled linearly, combination features
are often added to the feature set, where combination features are products of several original features. Examples of combination features are word pairs in document classification,
and part-of-speech pairs of head and modifier words in a dependency analysis task. However, the task of determining effective combination features, namely feature engineering,
requires domain-specific knowledge and hard work.
For example, in the document classification, a document is often represented by a bag
of words, where each feature ϕw corresponds to the occurrence of a word w in a document.
If the co-occurrence of words w1 and w2 represents the class label of a document well, then
a linear classifier with a combination feature ϕw12 := ϕw1 ϕw2 can separate the data well.
In other tasks, combination features are fundamental to discriminate the labels, such as
part-of-speech pairs of head and modifier words in a dependency analysis task. In this
case, original features itself cannot discriminate labels. However, the task of searching
effective combination features requires domain-specific knowledge and hard work.
Such a non-linear phenomenon can be implicitly captured by using the kernel trick
(2.7). For example, the third-order polynomial kernel can be considered as an inner
product of features consisting of combination of three features. However, its computational cost is very high, not only during training but also at inference time. Moreover, the model is not interpretable, in that effective features are not represented explicitly. Many kernels methods employ L2 regularization, in that many features are
equally relevant to the tasks [Ng, 2004]. Therefore, kernels cannot handle a case in which
very few features are relevant to the class 1 . Although a sparse kernel logistic regression [Hérault and Grandvalet, 2007] has been proposed, in which only few training examples are selected as support vectors, the problems of kernel tricks have not yet been
solved.
There have been several studies to find efficient ways to obtain (combination) features.
In the context of boosting, Kudo [2004] have proposed a method to extract complex features that is similar to the item set mining algorithm. In the context of L1 regularization,
Dudı́k [2007], Gao [2006], and Tsuda [2007] have also proposed methods by which effective features are extracted from huge sets of feature candidates. However, their methods
1
With L1 regularization, we cannot directly apply Kernel method because the optimal parameters cannot
be represented as the combination of inputs.
64
are still very computationally expensive, and we cannot directly apply this kind of method
to a large-scale NLP problem. Very recently, Yoshinaga [2009] proposed a learning algorithm with combination features by using a pre-calculated weights of (partial) feature
vectors stored in a feature sequence trie .
4.2 Learning Model
I consider a multi-class logistic regression model (LR) (Section 2.3). For input x, and an
output label y ∈ Y , we define a feature vector ϕ(x, y) ∈ Rm .
In LR, the probability for a label y, given input x, is defined as follows:
p(y|x; w) =
(
)
1
exp wT ϕ(x, y) ,
Z(x, w)
(4.1)
where w ∈ Rm is a weight vector corresponding to each input dimension, and Z(x, w) =
∑
T
y exp(w ϕ(x, y)) is the partition function.
Since the number of combination features is very large, the training procedure will
over fit the training data easily. One way to control the degree to which the parameters
are fitted to the data, is to introduce a penalty term that controls the complexity of the set
of possible models. We use L1 regularization because we can obtain a sparse parameter
vector, for which many of the parameter values are exactly zero. In other words, learning
with L1 regularization naturally has an intrinsic effect of feature selection, which results
in an efficient and interpretable inference with almost the same performance as L2 regularization [Gao et al., 2007b].
We estimate the parameter w by a maximum likelihood estimation (MLE) with L1
regularization using training examples {(x1 , y1 ), . . . , (xn , yn )} as follows:
w∗ = arg minL(w) + C
w
L(w) = −
n
∑
∑
|wi |
(4.2)
i
log p(yi |xi ; w)
i=1
where C > 0 is the trade-off parameter between the likelihood term and the regularization
term. This estimation is a convex optimization problem.
As in the previous section, we adapt a grafting algorithm [Perkins et al., 2003] (Section 3.3) so that the training cost is proportional to the number of active features. Recall
65
that the input for the grafting algorithm is find the most effective feature; the feature that
has the largest absolute value of the gradient.
4.3 Extraction of Combination Features
This section presents an algorithm to compute the most effective combination features, the
feature vk∗ that has the largest absolute value of the gradient. To solve this problem, we
propose a perceptron-like algorithm with filtering. This algorithm is very simple and easy
to implement.
Let k be a new feature to be tested. Then the gradient of the likelihood corresponding
to the likelihood is calculated as
∂L(w)
∂w
∑ k
=
αi,y ϕk (xi , y)
vk =
(4.3)
i,y
αi,y = I(yi = y) − p(y|xi ; w),
(4.4)
where I(a) is 1 if a is true and 0 otherwise. Then, in grafting algorithm, we add k ∗ =
arg maxk |vk | to H and optimize (4.2) with regard to H only. The solution w that is
obtained is used in the next search. The iteration is continued until |vk∗ | < C.
We can examine whether the feature k is effective by checking the value of vk . That is,
in the optimal weight, a feature will have non-zero weight only if |vk | ≥ C. Therefore, we
can filter out features safely if we know their gradient value does not larger (or smaller)
than C.
Moreover we make use of the sparseness of the training data, and we compute the
values vk in an element-wise manner. This is similar to online learning algorithm, but our
method is for finding the effective features, not for optimizing the objective function.
We assume that the values of the combination features are less than or equal to the
original ones. This assumption is typical in NLP; for example, it is made in the case where
we use binary values for original and combination features. We can relax this constraints
with increasing the computational cost.
First, we sort the examples in the descending order of their αi,yi =
∑
y̸=yi
p(y|xi ; w)
values, the sum of the probabilities to wrong labels. Then, we look at the examples one by
66
one. Let us assume that r examples have been examined so far.
We keep three vectors in the computation of v,
t =
∑
αi,y ϕ(xi , y)
(4.5)
−
αi,y
ϕ(xi , y)
(4.6)
i≤r,y
t− =
∑
i>r,y
t
+
=
∑
+
αi,y
ϕ(xi , y)
i>r,y
−
+
where αi,y
= min(αi,y , 0) and αi,y
= max(αi,y , 0).
Then, simple calculus shows that the gradient value for a combination feature k, vk ,
for which the original features are k1 and k2 , is bounded below and above thus;
+
tk + t−
k < vk < tk + tk
(4.7)
−
+ +
tk + max(t−
k1 , tk2 ) < vk < tk + min(tk1 , tk2 ).
Intuitively, the upper bound of (4.7) is the case where the combination feature fires only for
the examples with αi,y ≥ 0, and the lower bound of (4.7) is the case where the combination
feature fires only for the examples with αi,y ≤ 0. The second inequality arises from the
fact that the value of a combination feature is equal to or less than the values of its original
features. Therefore, we examine (4.7) and check whether or not |vk | will be larger than C.
If not, we can remove the feature safely.
Since the examples are sorted in the order of their
∑
y
|αi,y |, the bound will become
tighter quickly. Therefore, many combination features are filtered out in the early steps. In
experiments, the weights for the original features are optimized first, and then the weights
for combination features are optimized. This significantly reduces the number of candidates for combination features.
Figure 4.1 shows an example of filtering. The red line shows the case of tk + t+
k and
the green line shows the case of tk + t−
k.
Algorithm 5 presents the details of the overall algorithm for the extraction of effective combination features. Note that many candidate features will be removed just before
adding.
We maximize the effect of filtering as follows. First, we solve the problem with only
original features. Then we check combination features. Since after the training with only
67
Figure 4.1: An example of filtering a candidate combination feature.
original features, many αi,yi are small, and many combinations can be filtered out without
inserting H. Note that even in this case, we can obtain the optimal weights.
In the approximated version, we keep only top K features whose |vk | are largest. In
this case, the most effective features would be filter out 2
4.4 Experiments
To measure the effectiveness of the proposed method (called L1 -Comb), we conducted
experiments on the dependency analysis task, and the document classification task. In all
experiments, the parameter C was tuned using the development data set.
In the first experiment, we performed Japanese dependency analysis. We used the
Kyoto Text Corpus (Version 3.0), Jan. 1, 3-8 as the training data, Jan. 10 as the development data, and Jan. 9 as the test data so that the result could be compared to
those from previous studies [Sassano, 2004]3 . We used the shift-reduce dependency algorithm [Sassano, 2004]. The number of training events was 113, 332, each of which
consisted of two word positions as inputs, and y = {0, 1} as an output indicating the dependency relation. We used the same feature set as in [Sassano, 2004], but we only used
an atomic feature and did not use any combination features explicitly, because we tested
2
3
In experiment, I exmamined the exact version only.
The data set is different from that in the CoNLL shared task. This data set is more difficult.
68
Algorithm 5 Algorithm to return the combination feature that has the largest gradient
value.
Input: training data (xi , yi ) and its αi,y value (i = 1, . . . , n, y = 1, . . . , |Y |), and the
∑
parameter C. Examples are sorted with respect to their y |αi,y | values.
∑ ∑
t+ = ni=1 y max(αi,y , 0)ϕ(x, y)
∑ ∑
t− = ni=1 y min(αi,y , 0)ϕ(x, y)
t = 0, H = {} // Active combination feature
for i = 1 to n and y ∈ Y do
for all combination features k in xi do
if |vk | > C (Check by using Eq.(4.7) ) then
vk := vk + αi,y ϕk (xi , y)
H =H ∪k
end if
end for
t+ := t+ − max(αi,y , 0)ϕ(xi , y)
t− := t− − min(αi,y , 0)ϕ(xi , y)
end for
Output: arg maxk∈H vk
the effectiveness of the combination features. For example, we used the POS of headwords
as features, but did not use the pair of the POS of head and modifier words as features. We
expect that our algorithm can automatically extract such features from the training examples. For the training data, the number of original features was 78, 570, and the number of
combination features of degrees 2 and 3 was 5, 787, 361, and 169m430, 335, respectively.
Note that we need not see all of them using our algorithm.
In all experiments, combination features of degrees 2 and 3 (the products of two or
three original features) were used.
We compared our methods using LR with L1 regularization using original features
(L1 -Original), SVM with a 3rd order polynomial Kernel, LR with L2 regularization using
combination features with up to 3 combinations (L2 -Comb3), and an averaged perceptron
with original features (Ave. Perceptron).
Table 4.1 shows the result of the Japanese dependency task. The accuracy result indi69
Table 4.1: The performance of the Japanese dependency task on the Test set. The active
features column shows the number of nonzero weight features.
D EP.
T RAIN
ACTIVE
ACC . (%)
T IME ( S )
F EAT.
L1 -C OMB
89.03
605
78002
L1 -O RIG
88.50
35
29166
S VM 3- POLY
88.72
35720
(K ERNEL )
L2 -C OMB 3
89.52
22197
91477782
AVE . P ERCE .
87.23
5
45089
cates that the accuracy was improved with automatically extracted combination features.
In the column of active features, the number of active features is listed. This indicates that
L1 regularization automatically selects very few effective features. Note that, in training,
L1 -Comb used around 100 MB, while L2 -Comb3 used more than 30 GB. The most time
consuming part for L1 -Comb was the optimization of the L1 -LR problem.
Examples of extracted combination features include POS pairs of head and modifiers,
such as Head/Noun-Modifier/Noun, and combinations of distance features with the POS
of head.
Note that the accuracy given by our method is still lower than the current best result of
Japanese dependency analysis task, which uses manually selected combination features,
or polynomial kernels. The previous work [Uchimoto et al., 1999] suggested that combination features with the degree 4 or 5 will improve the accuracy: and this is the future
work.
For the second experiment, we performed the document classification task using the
Tech-TC-300 data set [Davidov et al., 2004]4 . We used the tf-idf scores as feature values.
We did not filter out any words beforehand. The Tech-TC-300 data set consists of 295
binary classification tasks. We divided each document set into a training and a test set.
The ratio of the test set to the training set was 1 : 4. The average number of features for
tasks was 25, 389.
Table 4.2 shows the results for L1 -LR with combination features and SVM with linear
4
http://techtc.cs.technion.ac.il/techtc300/techtc300.html
70
Table 4.2: Document classification results for the Tech-TC-300 data set. The column F2
shows the average of F2 scores for each method of classification.
F2
L1 -C OMB
0.949
L1 -O RIG
0.917
SVM (L INEAR K ERNEL )
0.896
kernel5 . The column ACTIVE FEATURES shows the average number of active features
for each task. The results indicate that the combination features are effective.
4.5 Discussion and Conclusion
I have presented a method to extract effective combination features for the L1 regularized
logistic regression model. I have shown that a simple filtering technique is effective for
enumerating effective combination features with the grafting algorithm, even for largescale problems. Experimental results show that a L1 regularized logistic regression model
with combination features can achieve comparable or better results than those from other
methods, and its result is very compact and easy to interpret.
5
SVM with polynomial kernel did not achieve significant improvement
71
Part II
Learning with Massive Number of
Outputs
72
Chapter 5
Discriminative Language Models with
Pseudo-Negative Examples
Language models (LM) are fundamental tools for many applications, such as speech
recognition, machine translation, spelling correction, etc. This chapter and the following
chapter present LMs for different tasks. This chapter presents a discriminative language
model with pseudo-negative samples (DLM-PN), which directly classifies a given sentence as correct or incorrect, while the next chapter presents a hierarchical exponential
model (HEM) that predicts a next word given a context with the probability information.
These LMs are used in different tasks; while DLM-PN can use a feature corresponding to
a whole sentence (e.g. the length of sentence), HEM cannot. However, HEM can give a
probabilistic information, and supports an efficient inference of the most probable word.
5.1 Language Modeling
Language modeling (LM) aims at determining whether a given sentence is correct or
not. For example in machine translation, an input sentence is translated into several output sentences in the target languages, and LMs choose the best sentence among them
by comparing their fluency. In particular, in a statistical machine translation model
[Brown et al., 1990], given a sentence E in the original language, a machine translation
system finds the best sentence F̂ in the target language as
F̂ = arg maxP (F |E) = arg maxP (E|F )P (F ).
F
F
73
(5.1)
This problem is therefore decomposed into the translation problem P (E|F ) and the targetlanguage dependent problem P (F ). The language model solves the latter problem.
Among several language models, probabilistic language models (PLM) are used,
which estimate the probability of word sequences or sentences. An example of such PLM
is N-gram language model (NLM).
However, PLMs cannot determine whether a sentence is correct or not independently
because the probability depends on the length of a sentence and the global frequency of
each word. For example, p(S1 ) < p(S2 ), where p(S) is the probability of a sentence S
given by PLMs, does not always mean S2 is more correct or plausible. Because this could
occur when the length of S2 is shorter than S1 , and S2 has more common words than
S1 . Another problem is that PLMs cannot include overlapped information or non-local
information easily, which is important to classify sentences more finely.
These problems have not been discussed before because LMs are used in the applications where LMs select the best sentence among sentences in that their lengths and word
tendencies are similar. However, this problem becomes apparent when sentences are not
similar ones. For example, these language models cannot determine incorrect sentences
among sentences written by non-native writers. These problems will get more strained
with the increasing LMs usage in the language generation applications.
It is therefore reasonable to consider that most sentences can be classified into correct
or incorrect in terms of grammar, pragmatics and plausibility, and say a 70% correct sentence is very rare or not exist. It would be better to treat language modeling as a binary
classification in some applications.
Discriminative language models (DLMs) have been proposed to classify sentences
into correct or incorrect ones. DLM can [Gao et al., 2005, Roark et al., 2007] include both
non-local and overlapped information directly. However DLMs in previous studies assume
specific applications. For example, training examples are candidate sentences generated
by other applications with one correct sentence. They estimate parameters by minimizing
sample-risk error, which is the accuracy of choosing best sentence among other candidate which may include plausible and correct sentences. Therefore the model cannot be
used for other applications. More importantly, the training data size in previous DLMs
is limited unlike the unlimited data size in PLMs. If we have unlimited negative examples, the models could be trained directly by discriminating correct sentences and incorrect
74
sentences.
In this chapter, I propose a DLM for generic purpose. This DLM can be used not only
for specific applications, but also for other applications as PLMs. To achieve this goal,
I need to solve the following two problems. The first is that we cannot obtain negative
examples (incorrect sentences). The second is its prohibitive computational cost because
the number of features and examples is very large. In previous studies this problem is not
obvious because the number of training data is limited and moreover they did not use a
combination of features, and thus the computational cost was tractable.
For the first problem, I propose to sample incorrect sentences by using a probabilistic language model and then train a model to discriminate between correct and incorrect
sentences. I call these examples Pseudo-Negative because they are not real negative sentences. I call this method DLM-PN (DLM with Pseudo-Negative samples).
For the second problem, I apply on-line max-margin learning with fast kernel computation. I will show that the non-linear model is essential to discriminate between correct and
incorrect sentences. I also estimate latent information of sentences using a semi-Markov
class model to extract features from them. Although the number of latent features is substantially smaller than that of explicit features such as words or phrases, latent features
include essential information for sentence classification.
Experimental results show that these pseudo negative samples can be seen as incorrect
examples, and the proposed learning method is enough to discriminate between correct
and incorrect sentences. We also show that DLM-PN can classify sentences correctly
which cannot be classified by the N-gram models, the syntactic parsers, and non-native
speakers.
5.2 Previous Language Models
I explain several previous language models, and their characteristics.
Table 5.1 shows the comparison of the previous language models, and our language
models. ME denotes a maximum entropy language model, and WSME denotes an whole
sentence maximum language model. A function P (w) is the operation to return the probability of a word w, and arg maxw P (w) is the operation to return the most probable word.
A variable |W | denotes the number of distinct candidate words.
75
Table 5.1: A comparison of language models.
Language
Training
Complex
Time of
Time of
Models
Efficiency
Features
P (w)
arg maxw P (w)
N-gram
XX
O(1)
O(1) ∼ O(|W |)
X
O(|W |)
O(|W |)
XX
-
-
X
ME [Rosenfeld, 1994]
WSME [Rosenfeld et al., 2001]
DLM-PN (Section 5)
X
XX
-
-
HEM (Section 6)
X
X
O(log |W |)
O(log |W |)
In this table, LMs are characterized by the following points; (1) Training efficiency:
whether the training step is efficient and scalable. (2) Complex features; whether the
model can use complex features such as suffix/prefix of previous words, and long-distance
information, (3) Time of P (w): the computational cost to support P (w), and (4) Time of
arg maxw P (w): the computational cost to support arg maxw P (w).
5.2.1 N-gram Language Model
N-gram Language Models (NLMs) estimate a probability of a sentence, which are most
widely used in many applications.
Given a sentence S of t words; S := w1t 1 , its probability P (S) is decomposed to
probabilities of each word depending on the preceding words by the chain rule as follows,
P (S) = P (w1t )
∏
=
P (wi |w1i−1 ).
i=1...t
These parameters P (wi |w1i−1 ) can be estimated by using the maximum likelihood estimation as
P ′ (wi |w1i−1 ) := C(w1i )/C(w1i−1 )
(5.2)
where C(w1j ) is the frequency of w1j in training corpus. However, since the size of training
corpus is finite, we cannot estimate these parameters accurately.
1
wij denotes the subsequence of words from the i-th word the j-th word inclusively, wi , wi+1 , . . . , wj .
76
NLMs approximate each probability by conditioning only on the preceding N − 1
words as
∏
PN LM (wi |w1i−1 ) :=
∏
i−1
P (wi |wi−N
+1 ).
(5.3)
i=1...t
i=1...t
Since the number of parameters in NLM is still large, several smoothing methods are
applied to NLM to produce more accurate probabilities, and to assign nonzero probabilities to any word string. I will explain additive smoothing and interpolated smoothing.
Other sophisticated smoothing methods can be found in [Rice, 1998]. Additive smoothing
is the simplest smoothing [Lidstone, 1920];
i−1
P (wi |wi−N
+1 ) =
i
C(wi−N
+1 ) + σ
(5.4)
i−1
C(wi−N
+1 ) + σ|W |
where σ is a allocation parameter that is usually σ = 1 or σ = 1/2 and |W | is the number
of distinct number of words. The performance of this smoothing is very poor when N is
large.
Interpolated smoothing mixes the lower-order distribution with fixed parameters as
i−1
P (wi |wi−N
+1 ) =
where
∑
i λj
∑
j=0...N −1
λj
i )
C(wi−j
i−1
C(wi−j
)
.
(5.5)
= 1. The parameters λj (j = 0 . . . N − 1) is estimated by using held-out
training data.
NLMs are widely used in many applications because its simpleness and efficiency.
However, several drawbacks of NLM are reported and I will show some of them which
are related to my work.
The first is that the probabilities in NLMs strongly depends on the length of the sentence; P (w1t ) ∝ C t for some constant C. Two sentences of different length, therefore,
cannot be compared directly. For example, the probability of an incorrect sentence “I are
fine” is much higher than the probability of a correct sentence “I am very fine Today”.
The second is that NLMs cannot include overlapped nor non-local information. NLMs
therefore would give a higher probability to a sentence without verb. To overcome this,
factored language models [Bilmes and Kirchhoff, 2003] decompose the condition into
finer elements to treat detailed features such as “suffix” of words. Although such improvements over NLMs have been proposed their modification is naive and more detailed
features cannot be included easily.
77
5.2.2 Topic-based Language Models
In previous language models, correlated information is considered in naive way, such as a
trigger model [Tillmann and Ney, 1996]. Recently, novel language models have presented,
where we can consider the correlated word information. For example, this model can
include the information “nurse” likely to be appeared in a sentence including “doctor” or
“hospital”. Many topic-based language models capture topic information by changing or
choosing the probability distribution.
Probabilistic latent semantic indexing (PLSI) [Gildea and Hofmann, 1999] assigns a
probability to a sentence or document as follows
P (w1t ) =
t ∑
C
∏
λc φc,w
(5.6)
w=1 c=1
where λc is the topic probability and φc,w is the word probability depending on the topic.
In this model, for each word, a topic is chosen according to λc .
Latent Dirichlet allocation (LDA) [Blei et al., 2003] is an extension of PLSI, assigning
a probability to a sentence or document as follows
∫
PDir (λ; α)
t ∑
C
∏
λc φc,w
(5.7)
w=1 c=1
where PDir (λ; α) = Z(α)
∏C
αc −1
c=1 θc
is an Dirichlet distribution and used for defining a
distribution of a distribution. LDA determines the topic distribution λ stochastically.
However, topic-based language models cannot deal with syntax information, or sentence information easily because of the difficulty in combining the syntax and topic information.
5.2.3 Maximum Entropy Language Models
A maximum entropy language model or an exponential language model [Rosenfeld, 1994]
overcomes these problems so that they can use any type of features. This model is equivalent to a multi-class logistic model. Given a context information h and a next word c,
we define a feature vector ϕ(c, h) ∈ Rm , each dimension of which corresponds to some
feature function f (c, h) such as f (c, h) = I(c = ”T okyo” and h = ”Ilivein”). Then,
78
we define the probability of a word c given a context h as
P (c|h) =
where Z(h) =
∑
c exp(w
T ϕ(c, h))
1
exp(wT ϕ(c, h)).
Z(h)
(5.8)
is the normalization parameter and w ∈ Rm is the
weight vector. NLM is a special case of ME such that feature functions correspond to the
occurrence of N-gram.
We estimate the parameters of ME by using maximum likelihood estimation. Let
(ci , hi ) (i = 1, · · · , N ) be the examples appeared in the corpus. Then we solve convex
optimization problem
w∗ = arg min −
w
∑
log P (ci |hi ) + C · R(w)
(5.9)
i
where R(w) is the regularization term and C > 0 is the parameter to control the tradeoff
between the likelihood term and the regularization term. We estimate C by cross vali∑
dation. For the regularization term L22 regularization R(w) := i wi2 , L1 regularization
∑
∑
∑
R(w) := i |wi | and their combination i wi2 + C ′ i |wi | (C ′ is the tradeoff parameter
between L1 and L22 regularization) are often used.
The problem of ME is its large cost at training. Even at an inference time, it requires a
computational cost proportional to the number of candidate words.
5.2.4 Whole Sentence Maximum Entropy Model
An whole sentence maximum entropy model [Rosenfeld et al., 2001] (WSME) assigns a
probability to a sentence with a single exponential model. Therefore, it is not required
to calculate the normalization term Z(h) for each word. Given a feature vector ϕ(S) of a
sentence S, WSME assigns a probability to S using a logistic regression model as follows,
p(S) =
1
· p0 (S) · exp(wT ϕ(S)),
Z
(5.10)
where Z is a normalization constant, p0 (S) is a initial distribution of S obtained by other
PLM such as N-gram model and w are the parameters of the model and ϕ(S) are arbitrary
computable properties, or features of a sentence S. For example, f1234 (S) := “if S has
you in the front of a sentence then 1 otherwise 0”.
79
To estimate w, we again employ maximum likelihood estimation. In this estimation
we need to compute expected values of the number of each ϕ(S)i being fired in the current
models. Since it is impossible to enumerate all possible sentences, sentences are sampled
according to the probability distribution of current parameters and then update the parameters using them.
The problem of WSME is its large computational cost at training. Especially, it requires samplings of sentences. Another problem is that it cannot assign a probability to
each word, and therefore WSME can be only used for the re-ranking of sentences. Moreover [Rosenfeld et al., 2001] reported that the improvement over the previous language
models is modest if only simple features are used.
5.2.5 Discriminative Language Models
In another direction, discriminative language models directly determine a given sentence to be correct or incorrect. For example, in the task of speech recognition we are
given a set of candidate sentences, and correct (or better) sentences are obvious. In this
case, we can directly learn the model to distinguish correct sentences from incorrect sentences [Roark et al., 2007, Gao et al., 2005].
Formally, a discriminative language model (DLM) assigns a score f (S) ∈ R to a
sentence S, measuring the correctness of a sentence in terms of grammar and pragmatics,
so that f (S) > 0 implies S is correct and f (S) < 0 implies S is incorrect. PLMs can be
considered as a kind of DLM if f (S) := f ′ (P (S)) where f ′ is a monotonic increasing
function. For example, we can use f (S) = P (S)/|S| − c, where c is some threshold, and
|S| is the length of S.
In this thesis, I will consider a case where a function is represented by a linear function as many previous studies [Roark et al., 2007, Gao et al., 2005]. Given a sentence S,
we compute a feature vector from it (ϕ(S)) using a predefined set of feature functions
{ϕj }m
j=1 . The form of the function f we use is
f (S) = wT ϕ(S),
(5.11)
where w is a feature weighting vector.
Since there is no restriction in designing ϕ(S), DLMs can employ any overlapped or
80
non-local information in S. I estimate w using training samples {(Si , yi )} for i = 1...t,
where yi = 1 if Si is correct and yi = −1 if Si is incorrect.
However, it is hard to obtain incorrect sentences because only correct sentences are
available from the corpus. This does not occur in previous studies, because they assume
specific applications and therefore can obtain real negative examples easily. For example,
Roark [2007] proposed a discriminative language model, in which a model is trained using
a set of output produced by a speech recognition system. The difference between their
approach and ours is that we do not assume just one application. Moreover, they always
have a group consisting of one correct sentence and many incorrect sentences, which are
very similar to each other because they are generated by same the input. On the other
hand, our framework does not assume any such groups of training data sentences, and
treat correct or incorrect examples independently in training.
5.3 Discriminative Language Model with Pseudo-Negative samples
I propose a novel discriminative language model, Discriminative Language Model with
Pseudo-Negative samples (DLM-PN). In this model, pseudo-negative examples are sampled from PLMs, which are assumed all incorrect.
First a probabilistic model is built using training data and then almost only negative examples are sampled from PLMs independently. DLMs are trained using correct sentences
from a corpus and negative examples from a pseudo-negative generator.
An advantage of sampling is that many negative samples can be collected as the number of correct ones and the difference between truly correct sentences and incorrect sentences which are correct in local sense can be clarified
For sampling, any PLMs can be used as long as a model supports sentence sampling
procedure. In particular, I used NLMs with an interpolated smoothing because it supports
efficient sentence sampling. Algorithm 5.1 describes the sampling procedure.
Since the focus is on discriminating between correct sentences from corpus and incorrect sentences which are sampled by N-gram model, DLM-PN may not be able to
discriminate incorrect sentences that are not generated by N-gram model. However, this
does not become a serious problem, because these sentences can be filtered out by NLMs
even if they exist.
81
Algorithm 6 Sample procedure for pseudo-negative examples taken from NLMs.
for i = 1, 2, . . . do
Sample wi accoding to p(wi |wi−N +1 , . . . , wi−1 )
if wi is the end of a sentence (EOS) then
break
end if
end for
Output w1 , w2 , . . ., EOS
We know of no program , and animated discussions about
prospects for trade barriers or regulations on the rules
of the game as a whole , and elements of decoration of
this peanut-shaped to priorities tasks across both
target countries
Figure 5.1: Example of a sentence sampled by PLMs.
Although the DLM-PN can be trained using any binary classification methods, the
number of training examples is very large, and batch training has suffered from prohibitively large computational cost in terms of time and memory. I therefore used the
passive aggressive learning algorithm, which is an on-line learning algorithm and requires
much smaller computational cost (Section 2.6).
5.4 Fast Kernel Computation
In DLMs, the correlation information or the combination of features is important to capture the non-local information. If kernel trick is applied to on-line max-margin learning, a
subset of the observed examples needs to be stored, called active set. However in contrast
to the support set in SVMs, an example is added to the active set every time the on-line algorithm makes a prediction mistake or when its confidence in a prediction is inadequately
low. Therefore the number of active set is significant large and thus the total computational
cost becomes the square of the number of training examples. Since the number of training
examples is very large, the computational cost is prohibitive even if we apply kernel trick.
82
Build a probabilistic language model
Probabilistic LM
(e.g. N-gram LM)
Sample sentences
Corpus
Positive
(Pseudo-) Negative
Input training examples
Binary Classifier
test sentences
Return positive/negative label or score (margin)
Figure 5.2: Framework of our classification process.
The calculation of inner product between two examples can be done by intersecting of
fired features in each example. This is similar to the merge sort and can be done in O(m)
time where m is the average number of fired features in an example. When the number of
examples in active set it a, the total computational cost is O(m·a) time. For kernel computation, PKI (Polynomial Kernel Inverted) is proposed [Kudo and Matsumoto, 2003]. PKI
is an extension of Inverted Index in Information Retrieval. In on-line setting, we maintain
the active set for each feature item. Algorithm 7 is the sample code of PKI.
5.5 Latent Features by Semi-Markov Class Model
Another problem of DLMs is that the number of features becomes very large, since all
possible N-gram are used as features. In particular, memory requirement is a serious
problem because quite a few active sets with many features have to be stored, not only at
training, but also at inference time. To deal with this, filtering of low-confidence feature
83
Algorithm 7 Sample code for PKI
Input: x
C: An array to store the innner products
for i ∈ {i|xi ̸= 0} do
for j ∈ {h(i)j ̸= 0} do
C[j] := C[j] + xi h(i)j
end for
end for
r := 0
for i ∈ {i|C[i] ̸= 0} do
r := r + αj (C[j] + c)d
end for
Output: r.
would be effective, but it is difficult to decide which features are important in on-line
learning. Therefore, instead similar N-grams are clustered to reduce the number of features
by using a semi-Markov class model.
5.5.1 Class Model
The class model was originally proposed by [Martin et al., 1998].
In the class
model, deterministic word-to-class mappings are estimated, keeping the number of
classes much smaller than the number of distinct words. The class model was extended into a semi-Markov class model (SMCM), a part of which was proposed by
[Deligne and Bimbot, 1995]. I generalize it as a class model, that is a word sequence
is partitioned into a variable-length chunk sequence and then each chunks are clustered
into classes depends on the adjacent chunks. As an example, the use of a bi-gram class
model will be explained. The probabilities of a sentence w1 , . . . , wt in a bi-gram class
model is calculated by
P (w1 , . . . , wt ) =
∏
P (wi+1 |ci+1 )P (ci+1 |ci ).
i
84
(5.12)
On the other hand, the probabilities in a bi-gram semi-Markov class model is calculated
by
P (w1 , . . . , wt ) =
∑∏
s
P (ci |ci−1 ) · P (ws(i),s(i)+1,...,e(i) , ci ),
(5.13)
i
where s is the all possible partition of a sentence and s(i) denotes the beginning position
of i-th chunk and e(i) denotes the end position of i-th chunk in partition s and s(i + 1) =
e(i) + 1 for all i. Note that each word and variable-length chunk belongs to only one
class, unlike in a hidden Markov model, where each word can belong to several classes.
Using a training corpus, the mapping from a chunk to a class was obtained by maximum
likelihood estimation. The log-likelihood of the training corpus (w1 , . . . , wn ) in a bi-gram
class model can be calculated as
∏
log
=
∑
P (wi+1 |wi )
i
log P (wi+1 |ci+1 )P (ci+1 |ci )
i
=
∑
log
i
=
∑
C(wi+1 , ci+1 ) C(ci+1 , ci )
C(ci+1 )
C(ci )
C(c1 , c2 ) log
c1 ,c2
+
∑
C(c1 , c2 )
C(c1 )C(c2 )
C(w) log C(w).
(5.14)
w
In (5.14), only the first term is used, since the second term does not depend on the class
allocation. The allocation problem is solved by an exchange algorithm as follows; for
each word, we move it to the class in which the log-likelihood is maximized. We continue this until no movement occurs. A naive implementation of this exchange algorithm
scales quadratically to the number of classes, since each time a word is moved to one
class, all class bi-gram counts are potentially affected. However, by only considering
those counts that actually change, the algorithm can be made to scale somewhere between
linearly and quadratically to the number of classes [Martin et al., 1998]. I will explain
the detail of implementation, relating to my improvement. Other details can be found in
[Martin et al., 1998].
In this algorithm we keep the following data structures. The frequency indicates the
frequency in the training corpus.
85
• wuni : wuni [i] keeps a frequency of wi .
• wbi : wbi [i][j] keeps a frequency of wi wj .
• cmap : cmap[i] keeps an assigned class of a word wi .
• cw : cw[i][j] keeps a frequency of ci and wj .
• wc : wc[i][j] keeps a frequency of wi and cj .
• cbi : cbi [i][j] keeps a frequency of ci cj .
• cuni : cuni [i] keeps a frequency of ci .
Recall that the first term of (5.14), and thus the current performance of class allocation
is calculated by using cbi and cuni .
We then update class bi-grams after the movement of a word w from a current class
cold to a class cnew as follows2 . For c ∈
/ cold , cnew ,
cbi [c][cold ] − = cw[c][w]
cbi [cold ][c] − = wc[w][c]
cbi [c][cnew ] + = cw[c][w]
cbi [cnew ][c] + = wc[w][c],
and for c ∈ cold , cnew
cbi [cold ][cold ] + = −cw[cold ][w] − wc[w][cold ] + wbi [w][w]
cbi [cold ][cnew ] + = +cw[cold ][w] − wc[w][cnew ] − wbi [w][w]
cbi [cnew ][cold ] + = −cw[cnew ][w] + wc[w][cold ] − wbi [w][w]
cbi [cnew ][cnew ] + = +cw[cnew ][w] + wc[w][cnew ] + wbi [w][w].
Using this update, we check the difference of the log-likelihood for one trial move in
O(|C|) time, and thus the one iteration can be performed in O(|C|2 |W |) time.
2
a+ = b denotes a := a + b, and a− = b denotes a := a − b.
86
5.6 Improvement of Exchange Algorithm with Filters and Bottom-up Clustering
Although an exchange algorithm is efficient enough for a class model, this is not suitable
for a semi-Markov class model because the number of chunks is much bigger than the
number of words. I therefore propose two improvements for an exchange algorithm. The
first is approximating the computation in exchange algorithm and the second is bottom-up
clustering which strengthens the convergence.
The first technique is an approximation of the log-likelihood difference between before
and after the movement of a word to another class. The log-likelihood change beccomse
small for almost trial in the later iteration. By using this approximated value we can filter
out many useless trials.
For each word w, t classes are sampled from cw and wc according to the frequency of
cw and wc, and built two arrays cw′ , wc′ of length t. Using these arrays, approximated
values of the difference of the log-likelihood are computed, and an original exchange
algorithm is applied only if the approximated value is larger than the pre-defined threshold.
Since new two arrays are built only before the beginning of each iteration of a words, the
cost of building these arrays is negligible.
The second techniques concerns memory issue rather than time complexity. Since the
matrices could become very large (For example, the number of chunks in experiments is
about 3 million and the number of classes is 500), instead of clustering word into predefined number of clusters, we cluster chunks into 2 classes and then again clusters these
two clusters independently into 2 each, where the total number of classes is 4. This procedure will be applied recursively until the number of classes reaches the pre-defined
number.
5.6.1 Semi-Markov Class Model
In SMCM, we need to decide not only the word-to-class mapping function but also wordto-chunk partitions in each sentences. I employ a Viterbi decoding of S for estimating the
partition of a sentence. I then consider these chunks as a word in class model and apply an
exchange algorithm. I next iterate this procedures until the change of the objective value
becomes lower than the pre-defined threshold.
87
w1 w2 w3 w4 w5 w6 w7 w8
c1
c2
c3
c4
Figure 5.3: Example of assignment in SMCM. A sentence is partitioned into variablelength chunks and each chunk is assigned a unique class.
Figure 5.3 shows the example of semi-markov class model. In this example, the first
class (c1 corresponds the chunks of length 2 (w1 w2 ).
5.7 Experiments
5.7.1 Experimental Setup
I partitioned a BNC-corpus into model-train, DLM-train-positive, and DLM-test sets. The
number of sentences in model-train is 4500×103 , in DLM-train is 500×103 , and in DLMtest is 10 × 103 . An NLM using model-train is built and Pseudo-Negative examples are
sampled from it. The number of positive and pseudo-negative examples are equal3 I mixed
sentences from DLM-train-positive and Pseudo-Negative examples, and then shuffle the
order of these sentences to build DLM-train. I call the sentences from DLM-train-positive
“positive” example and the the sentences from Pseudo-example “negative” examples. I
eliminated the sentences less than 5 words in these corpora because it is difficult to decide
whether these sentences are correct or not (e.g. compound word). Next semi-Markov
classes is extracted using model-train. The number of extracted chunks is 2.76 × 106 .
3
The expected length of pseudo-negative examples is same as that of positive examples.
88
5.7.2 Experiments on Pseudo-Samples
I examined the property of Pseudo-Samples to justify our framework. I sampled 100 sentences from DLM-train. A native English speaker is asked to assign correct and incorrect
labels to these sentences4 . The result is that all positive were labeled with correct and all
negative except one sentence were labeled with negative. From this result, I can say that
the sampling method is able to generate incorrect sentences and if a classifier can discriminate them, a classifier can discriminate sentences between correct sentences and incorrect
sentences. Note that it takes an average of 25 seconds for the annotator to assign the label,
which suggest that it is difficult even for human to identify incorrect sentences.
I examined whether it is possible to discriminate between correct and incorrect sentences by using syntactic parsing methods. If so, we can use parsing as a classification
tool. I used a phrase structure parser [Charniak and Johnson, 2005] and a HPSG parser
[Yusuke and Tsujii, 2005], and applied them to the 100 sentences. The result is that all
sentences are parsed except one sentence in positive examples. This result indicates the
difference between correct sentences and pseudo negative examples cannot be found in
syntactic faults.
5.7.3 Experiments on DLM-PN
I then examined the effect of each features in DLM. For N-gram and Part of Speech (POS),
I used trigram features. From SMCM mapping function, I use bi-gram class as a feature
function. I used DLM-train as a training set. In all experiments, the hyper parameter is
adjusted as C = 50.0 where C is a parameter in a classification (Section 5.3). In all result
with kernel, PKI is used to compute the kernel value. Table 5.2 shows the accuracy results
with different features. The result of SMCM |C| = 100 shows the accuracy where the
number of classes in SMCM is 100. This result shows that the kernel method is indeed
important to achieve a high performance. Note that the performance of SMCM is the same
as that of word.
Table 5.3 shows the number of features in each methods. Note that a new feature is
added only if the classifier needs to update their parameters. Therefore these numbers are
smaller than the number of all candidate feature. For example, the number of possible
4
Since the PLMs is also based on BNC-corpus, we cannot classify them with the tendency of word contents
89
Table 5.2: Performance of language models on the evaluation data.
Accuracy (%)
Training time (s)
Linear classifier
word
51.28
137.1
POS
52.64
85.0
SMCM (∥C∥ = 100)
51.79
304.9
SMCM (∥C∥ = 500)
54.45
422.1
3-order Polynomial Kernel
word
73.65
20143.7
POS
66.58
29622.9
SMCM (∥C∥ = 100)
67.11
37181.6
SMCM (∥C∥ = 500)
74.11
34474.7
Table 5.3: The number of features of DLM.
# of distinct features
word tri-gram
15773230
POS tri-gram
35376
SMCM (∥C∥ = 100)
9335
SMCM (∥C∥ = 500)
199745
distinct features in SMCM (∥C∥ = 500) is 10000 = 100 · 100.
From these results, we found that SMCM achieved high performance with very few
features.
I then examined the effect of PKI (inverted index method) where SMCM bi-gram and
third order polynomial kernel is used. Figure 5.4 shows the result of each method. In
this experiment, 200 × 103 sentences in DLM-train are used for both experiments because
training on all the training data required much longer time than was possible for our experimental setup. The result indicates that an index reduces the computational cost much,
and its inference can be performed in less than 0.1 seconds, which is reasonable for the
90
Table 5.4: Comparison between classification performance with/without PKI index
Training time (s)
inference time (ms)
Baseline
37665.5
370.6
with Index
4664.9
47.8
real world applications.
Finally I exampled te learning curves to examine the effect of the size of training data
on the performance. Figure 5.5 shows the result of classification task using SMCM bi-gram
features. The result suggests that the performance can still be improved by increasing the
training data.
Figure 5.4 shows the margin distribution for correct sentences and pseudo-negative
examples by using SMCM-bi-gram features. Although many examples are close to the
border line (Margin = 0), positive and negative examples are distributed in > 0 and < 0
sides. Therefore higher re-call or precision can be achieved by setting pre-defined margin
threshold other than 0.
5.8 Discussion
More sophisticated sampling technique is promising. For example, I can sample sentences
from whole sentence maximum entropy model [Rosenfeld et al., 2001] and this is a feature
work.
The result without kernels indicates that an non-linear model is important to discriminate correct and incorrect sentences. Therefore it would be helpful to use such
combination in PLM. Recent successes in topic-based Language model [Blei et al., 2003,
Wang et al., 2005] also indicate the importance of this phenomena.
Contrastive Estimation [Smith and Eisner, 2005] is similar to us with regard to making pseudo-negative examples. They build a neighborhood of input example to help unsupervised estimation such as one word is changed or deletion and build a lattice. It then
estimates parameters efficiently. On the other hand, I make independent pseudo-negative
examples to make training possible.
91
200
negative
positive
se
cn
et
ne
s
fo
re
b
m
u
N
100
0
-3
-2
-1
0
Margin
1
2
3
Figure 5.4: Margin distribution using SMCM bi-gram features.
5.9 Conclusion
In this chapter I have presented a novel discriminative language model using pseudonegative examples. I also show that an on-line max-margin learning method enable us
to handle one million sentences and achieved 75% accuracy in the task of discriminative
positive and pseudo negative examples. Experiments indicate that Pseudo Negative example is incorrect but close to the correct sentences at the same time because parsing cannot
discriminate it. Experimental results also showed that the combination of features is effective to discriminate correct and incorrect sentences, which has not been discussed in
probabilistic language models.
To scale up the problem size in terms of the number examples and features,
we would ask more refined kernel-based learning methods as [Cheng et al., 2006] and
[Dekel et al., 2005]. Another interesting area is to handle probabilistic language model
directly without sampling and learn the discriminative model more efficiently and accurately.
92
80
75
)% 70
(
yc
ar 65
uc
cA
60
Accuray of SMCM
55
50
00
05
00
05
3
00
05
6
00
05
9
50
+
E
1
50
+
E
2
50
+
E
2
50
+
E
2
50
+
E
2
50
+
E
3
50
+
E
3
50
+
E
3
50
+
E
4
50
+
E
4
50
+
E
4
50
+
E
5
50
+
E
5
Number of examples
Figure 5.5: A learning curve for SMCM (∥C∥ = 500). The accuracy is the performance
on evaluation set.
I am also interested in the applications of this model, not only machine translation and
speech recognition but also identifying incorrect sentences written by non-native spakers
as an extended version of a spelling correction tool.
93
Chapter 6
Hierarchical Exponential Models for Problem with
Many Classes
This chapter presents another novel language model, called a Hierarchical Exponential
Model (HEM). In the previous language models, existing multi-class classifiers cannot be
used directly because the number of candidate outputs is very large, and therefore they
require impractical computational cost at training and inference time. A HEM reduces the
cost by structuring the search space in a hierarchical way.
6.1 Problems of Previous Language Models
In this chapter, we consider the task of predicting of a next word c given a context information h. An example of context information are previous words, and long-range information (some word x appeared in the previous context). Moreover, we also estimate the
conditional probability p(c|h), which is useful for other applications.
We often estimate p(c|h), by a maximum likelihood estimation using training corpus,
pM LE (c|h) = C(c, h)/C(h),
(6.1)
where C(c, h) is the frequency of events where c appeared in the context of h in the
training corpus, and C(h) is the frequency of events where h appeared in the training
corpus. However, this estimation is very unstable even if we use very large corpus. For
example, we cannot estimate this probability information if some context event h is unseen
in training data. Therefore, some smoothing methods should be applied to estimate p(c|h).
94
An N -gram Language model (NLM) approximates the probability by corresponding
h to the preceding N − 1 words only;
pN LM (c|h) = C(w, c)/C(w)
(6.2)
where C(w, c) is the frequency of word sequences w, c in the training corpus, and C(w)
is the frequency of word sequence w in the training corpus.
This simple approximation performs very well in many applications. However, there
are many other useful features to predict the next word. For example, a suffix (prefix)
of the previous word is effective to determine the next word, and cache or trigger (the
occurrence of some word in the context) is also effective to determine the next word.
To exploit these features freely, we can use a maximum entropy language model (ME,
Section 2.3) or called an exponential language model in [Rosenfeld, 1994]. Recall that a
ME is defined as,
1
exp(wT ϕ(c, h)),
(6.3)
Z(h)
where ϕ(c, h) is a feature vector for the pair of input h and output c, w is the weight vector,
∑
and Z(h) = c′ exp(s(h, c′ )) is a normalization term or a partition function. We can use
p(c|h; w) =
any information of context h and a next word c in defining ϕ(c, h).
In ME, the most probable word can be found by
c∗ = arg max
p(c|h; w)
(6.4)
wT ϕ(c, h)
(6.5)
c
= arg max
c
since exp is a monotonic increasing function.
The problem of ME is that its computational cost; it proportionally increases as the
number of labels increases because the computation of Z(h) requires the summation over
all candidate words. Since the number of candidate words is very large, ME is impractical
for language modeling. Moreover, since a context h is different at each word position, we
need to recalculate Z(h) at each word position and pre-computation is difficult.
6.2 Hierarchical Exponential Model
I present a novel language model called an Hierarchical Exponential Model (HEM). While
this model can use any type of features as ME, it can efficiently inference the probability
95
Figure 6.1: An example of a hierarchical tree in a hierarchical exponential model
of a word. Moreover, it supports an efficient arg max operations that returns the most
probable word among the candidate words.
First, I explain an overview of HEM. We represent a set of candidate words as a binary
tree. This tree can be estimated by any method such as hierarchical clustering. At each
internal node in the tree, a different binary logistic regression model is associated. The
output −1 corresponds to the left child, and the output +1 corresponds to the right child.
Note that we can use any type of features in these binary logistic regression models. Then,
the probability for a word w is defined as the product of the probabilities obtained by
the results of binary logistic regression models from the root to the leaf corresponding to
w. Obviously, the sum of probabilities of candidate words equals to 1, and therefore this
binary tree assign the probabilistic distribution over the candidate words. By restricting
the height of the tree being O(|W |) where |W | is the number of candidate word, it can
efficiently estimate the probability without enumerating all candidate words.
Figure 6.1 shows the example of HEM when candidate words are {a, b, c, d, e}.
A HEM is similar to a decision tree in that a word is predicted by a set of binary
decisions. However, in HEM, each decision is associated with a probability assigned by a
binary logistic model using any type of features, and therefore it defines the probabilistic
distribution over a candidate words.
96
Let us explain a HEM formally. Let C be the set of candidate words, TC be a binary
tree with |C| leaves and |C| − 1 internal nodes, each node of which corresponds to the
word in C. We call this tree a label tree. Then each word c is associated with a binary
code Bc which is made as follows. From the root to the leaf corresponding to word c, we
append 0 if we go to the left child, and 1 if we go to the right child. An example of such
a tree is Huffman tree used in data compression. Let Bc [j] ∈ {0, 1} be the j-th bit value
of Bc and N (Bc [j]) be the j-th internal node from the root to the leaf corresponding to a
word c.
For example, in the figure 6.1, a binary code for a word a is 011, and Ba [1] = 0,
Ba [2] = 1.
A feature vector for a context information x is denoted as ϕ(x) ∈ Rm . Then we
associate a binary logistic regression model at each internal node v,
Pv (1|h) =
1
1 + exp(−wvT ϕ(h))
Pv (0|h) = 1 − Pv (1|h) =
1
1 + exp(wvT ϕ(h))
(6.6)
(6.7)
where wv ∈ Rm is the weight vector corresponding to the internal node v.
Then, the probability for a word c is given by the product of the classification results
from the root to the leaf node corresponding to c,
|Bc |
P (c|h) =
∏
PN (Bc [j]) (Bc [j]|h).
(6.8)
j=1
This can be viewed as that at each internal node, a region [0, 1] is recursively divided into
two disjoint regions.
6.2.1 Learning
In a HEM, we need to decide both the structure of the label tree and the set of parameters
of classifiers. Although we may obtain more accurate model by estimating the tree structure and the parameters together, I take a simpler approach that the tree structure is first
estimated and then the parameters of each internal node are estimated.
First, to estimate the structure of the label tree, we begin with a set of all candidate
words, and then recursively split the set of words into two disjoint sets so that those in
97
the same set have similar context information. To achieve this, I adapted an one-side class
model [Whittaker and Woodland, 2001] because it is simple and efficient.
Let ri be the class of a word ci . Then, in an one-side class model, the probability of a
word ci given context information hi is defined as
Poscm (ci |hi ) = P (ci |ri−1 ).
(6.9)
That is, the probability depends on only the class of the most previous word. Then we split
a set of words so that the likelihood of the training corpus is maximized as follows. First,
all words are assigned to a class from the two candidates randomly. Next, for each word,
we check the difference of the likelihood after the movement of a word to another class.
Finally a word is actually moved if the difference of likelihood is positive. We continue
this procedure until no movements occured. We recursively apply this process to both
classes until all classes have only one word.
Given a label tree, we can estimate the weight vectors for each internal nodes independently. Therefore we can parallelize the estimation of these weight vectors, which is
impossible in previous MEs.
6.2.2 Features
In a HEM, any information in the context can be used as a feature. For example, we
can use the previous N − 1 words as an N-gram model. In another case, the occurrence
of some word in the long context (like 100 words) can be used as a trigger language
model [Tillmann and Ney, 1996]. Other examples are: a suffix and a prefix of a word,
an orthographic feature and the position in the document. Note that all weights for these
features are estimated automatically.
More importantly, we can use the path information of the previous words as a feature.
An example of such feature is shown in figure 6.2. For example, when the previous word is
“a”, since the binary code of “a” is 011, prefixes of this binary code 0, 01, and 011 can be
used as features. This can be regarded as a hierarchical class model [Martin et al., 1998],
and we can expect the smoothing effect. A virtue of this feature function compared to the
original class model [Martin et al., 1998] is that the optimal size of classes need not to be
determined beforehand because their weights are automatically determined.
98
Figure 6.2: A path information in a hierarchical tree for predicting the next word.
Another virtue of this feature function is that the prediction becomes more robust.
When the previous prediction is failed, or unknown word appeared, previous LMs cannot
use context information, and its accuracy dropped. However, by using these prefixes of
path information, some of prefixes of path information are correct even if the final prediction is incorrect, and these prefixes information are useful to estimate next word.
6.2.3 Inference
In a HEM, given context information h and a candidate word c, the conditional probability
P (c|h) is estimated in O(log |W |) time where |W | is the number of distinct word. This is
because we just examine the results of the classification along the path from the root to the
leaf corresponding c, and the height of the tree is O(log |W |). This is much improvement
over the MEs, which requires O(|W |) time in the worst case.
Moreover, a HEM supports an efficient argmax operation c∗ = arg maxc P (c|h) for
a context h. Previous language models cannot support this operation efficiently and at
the worst case they require O(|W |) time. This problem can be considered as finding the
path with the maximum weight from the root to the path where the weight of an edge
is the logarithm of the corresponding probability. By applying the branch/bound method
at search step, while the worst time is O(|W |), in many case argmax is supported in
O(log |W |).
99
Table 6.1: Corpus statistics in HEM
s
CSJ
BNC
Number of Words
8.85 × 10
6
5.38 × 107
Number of Word Types
2.81 × 104
1.16 × 106
In particular, in many cases, the probability distribution over candidate words is very
skew, and a beam search is enough. In a beam search, we keep top K candidates with the
highest probability, and we pop the most probable candidate and add these candidates into
the stack.
6.3 Experiments
We conducted experiments on a Japanese Spoken Language Corpus (CSJ) and an English
BNC corpus (BNC). We divided each of these corpora into training and test corpus as
5 : 1. A statistics of these corpora is shown in Table 6.1.
We used a trigram model with modified Kneser-Ney smoothing as a baseline. This
model is the state-of-the-art language model. We train the proposed model using training
corpus, and estimate the tree structure and parameters. For a feature function, we use
previous and the next to the previous word as features. This information is same as that of
trigram models. Moreover, we use a feature for the prefix of the path of the previous words
with the length of 1, 2, and 4. At the training of the binary logistic regression model, we
applied the L1 regularization. The hyper-parameter is manually adjusted using samples
from training corpus.
To measure the performance of LMs, we used the perplexity defined as 2H where H
is the average of conditional probabilities of next words.
Table 6.2 shows the result of the perplexity. We can see that the proposed method
achieved lower perplexities in both corpora.
100
Table 6.2: Results of HEM and Baseline
Method
CSJ (Perplexity)
BNC (Perplexity)
Trigram with KN
133.5
230.3
Hierarchical Exponential Model
120.3
216.9
6.4 Discussion and Conclusion
Although the application of a maximum entropy model to language models is not
new [Rosenfeld, 1994], they have not been used in NLP community because their computational cost is prohibitively large, and the improvement over simple language models
(e.g. N-gram) is modest.
In this chapter, we have presented a language model that can use any type of features,
and its training/inference is efficient. We show that, the large computational cost can be
reduced by structuring the search space in an hierarchical way as supper-tagging of lexical
entries [Matsuzaki et al., 2007].
The environments of language models have changed. Since much faster computers
and learning algorithms (e.g. online learning) are available now, language models with
logistic regression models become practical. Moreover by applying sparse priors (e.g.
L1 regularization), the model can be compact even smaller than N-gram models. By
combining our algorithm with a fast learning and parallelized processing (GPU, Crowd
computing), language model can be much improved.
101
Chapter 7
Conclusion
In this thesis, I have proposed methods to build a large-scale natural language processing
(NLP) system. The methods are efficient, practical and scalable. To build this system,
three problems must be solved, as described in the following three paragraphs.
The first problem was related to the massive number of training examples. This problems was already studied by others, so many solutions exist. For example, online learning
algorithms can train the model in linear time, relative to the number of training examples.
In addition, by using L1 regularization, the model can be made compact and its inference becomes extremely fast. I introduced several online algorithms and data structures in
chapter 2, and used in most my implementations.
The second problem was about the massive number of features. The examples I considered in this thesis was “all substring features” and “combination features”. I presented
an efficient algorithm for finding the effective features among the massive number of candidate features. I also showed that we can learn an exact optimal classification model
without enumerating all candidate features by using a grafting algorithm, a sparse prior,
and the searching algorithm for finding effective features.
The third problem was about the massive number of output candidates. The example
considered in this thesis was language modeling. I tackled this problem by two different
approaches. The first was to discriminate correct/incorrect sentences directly. The problem is how to learn the discriminative model when only positive examples are available.
I proposed to employ pseudo negative examples sampled from a probabilistic language
model. The second language model involved building a hierarchical tree for candidate
102
outputs, which enabled us to search for the most probable output in an efficient manner.
These proposals were implemented with the help of recent on-line learning algorithms
and data structures, and we achieved state-of-the-art performance over a wide range of
NLP tasks, including document classification, document clustering, and language modeling.
I summarize the achievements of this thesis for each sub-topic below.
7.1 Learning with “All Substrings”
I presented a learning algorithm with “all substring features”, where all substrings appearing in a document are used as candidate features. I showed that, although the number of candidate features is O(n2 ) where n is the length of a document, a classifier can
be trained in O(n) time, and the required working space is also O(n). This is because
many substrings carry the same information (e.g. they have same the same frequency
in a document), so these substrings can then be summarized into much fewer equivalent
classes. Since many types of features functions depend only on these statistics, the computed statistics corresponding to these features (such as the gradient value of an objective
function with regard to that feature) are also the same. By using enhanced suffix arrays
and auxiliary data structures, the statistics of these classes can be enumerated efficiently.
I combine this efficient enumeration algorithm with L1 regularization and a grafting algorithm; at each step, the most effective feature is searched from O(n) equivalent classes,
and this feature is added to the candidates for optimization. The properties of L1 regularization and the grafting algorithm ensures that this greedy algorithm converges to the
global optimum.
I showed two applications of the algorithm described above: one for document classification and one for document clustering. I conducted experiments and compared the
results to other state-of-the-art methods. The results showed that the accuracy of my algorithm is highest in many tasks, especially when substring information is important in
deciding the category of a document. The proposed algorithm will be useful for other applications dealing with string information, such as genome sequence and web-log mining.
This algorithm is also very scalable. The time/space consumption is linearly proportional
to the text size, even when the number of documents is over one million.
103
7.2 Learning with “Combination Features”
I showed that how searching for effective combination features can be done. Since the
number of candidate features is exponentially growing compared to the number of original
features, direct optimization using the combination features is not feasible because of the
prohibitively large cost. I showed that a simple filtering technique can be used to enumerate (only) the effective combination features, even for large-scale problems. Experimental
results indicate that an L1 regularized logistic regression model with combination features
can achieve comparable or better results than those from other methods. Furthermore, the
resulting model is very compact and easy to interpret.
7.3 Discriminative Language Model with Pseudo-Negative Examples
I proposed a discriminative learning algorithm for problems where only a generative model
is given. An example is language modeling with only positive examples available. I
proposed to sample (pseudo-) negative examples from the generative model. I applied
this method to language modeling to enable the classifier to directly discriminate correct
sentences from incorrect sentences. I also showed that an on-line max-margin learning
method enables us to handle one million sentences and achieving 75% accuracy in the
task of discriminating the positive and pseudo-negative examples. Other experimental
results revealed that although a pseudo-negative example is incorrect, a syntactic parser
cannot discriminate it from correct examples. However, native speakers can easily discriminate it from correct examples. Experimental results also showed that kernel methods
can effectively discriminate correct and incorrect sentences. This has not been discussed
previously in probabilistic language model research.
7.4 Hierarchical Exponential Models
Finally, I proposed a hierarchical exponential model (HEM) for the problem where the
number of candidate outputs is very large. In an HEM, the output candidates are hierarchically clustered and the probability of an output is given by the product of the probabilities
of the classification result associated with each node in a path from the root to the leaf.
HEM supports the efficient arg max operation, which returns the most probable output in
104
O(log K) time, where K is the number of labels. I applied HEM to the language modeling
problem and the experimental results show that this model achieved higher performance in
terms of the perplexity measure, compared to previous state-of-the-art language models,
such as N-gram with Kneser-Ney smoothing.
7.5 Future Work
There are several topics I did not consider in this thesis, even though they are fundamental
for practical NLP in the future. Three such topics are parallelization, randomization, and
non-linear representation.
Parallelization has become important because current processors are highly parallelized and there are many clusters available for heavy computation now. In existing NLP
systems, including ours, only serialized processing is only considered, and it is not trivial
to do the processing in parallel. For example, since an online learning algorithm updates
a parameter every time a mistake occurs, the learning operation depends on most previous operations, so parallelization is difficult. However, for simpler tasks, parallelization is
very effective and can easily to be applied [Asuncion et al., 2007, Asuncion et al., 2008],
Examples are N-gram language model [Brants et al., 2007], and simple machine learning [Chu et al., 2006]. Another possibility is the use of general-purpose graphic processor
units (GPGPU) for NLP [Zein et al., 2008, Yan et al., 2009a], since they are much optimized for parallel computation.
Randomization (or randamized algorithms) have also become important for large-scale
NLP. The benefit of a randomization algorithm is that we can expect the good performance with simpler algorithms. Recent studies have shown that randomization is especially effective in language modeling [Talbot and Brants, 2008, A. Levenberg, 2009],
document clustering [Ravichandran et al., 2005], the calculation of expectations of features [Bouchard-Cote et al., 2009], and the computation of singular value decomposition
of a matrix [Halko et al., 2009].
NLP renewed attention on non-linear representation using deep neural networks. Although in the history of machine learning, neural networks was replaced with linear classifiers or kernel machines, non-linear representation using deep neural networks have
achieved many success in NLP [Collobert and Weston, 2009, Hinton et al., 2006] recently.
105
This is because the studies of online algorithms (or greedy update) and regularization have
matured and parameter estimation can be accurately performed. Since there exist many
layers in NLP such as part-of-speech tags, a syntactic tree, and a semantic information,
the joint inference will become more important, and the idea of neural networks would
become more imporant.
To conclude, this thesis has presented several large scale NLP systems, which are based
on recent research on data structures, machine learning, string algorithms, and optimization techniques. While the previous NLP systems only consider how to obtain accurate
results, the proposed systems also considers other important factors like efficiency in terms
of speed and resources, and scalability. The most significant contribution of this thesis is
to make practical NLP systems available by showing the possibility of integrating ideas
from different fields. I hope that the proposed algorithms will be applied to the problems
in other fields like biology, vision recognition and data mining as well.
106
References
[A. Levenberg, 2009] A. Levenberg, M. Osborne. 2009. Stream-based randomized language models for smt. In Proc. of EMNLP, pages 756–764.
[Abouelhoda et al., 2004] Abouelhoda, M. I., S. Kurtz, and E. Ohlebusch. 2004. Replacing suffix trees with enhanced suffix arrays. J. Discrete Algs, 2:53–86.
[Aho and Corasick, 1975] Aho, A. V. and M. J. Corasick. 1975. Efficient string matching:
An aid to bibliographic search. Communications of the ACM, 18(6):333–340.
[Andrew and Gao, 2007] Andrew, G. and J. Gao.
2007.
Scalable training of l1-
regularized log-linear models. In Proc. of ICML.
[Anh and Moffat, 2005] Anh, V. N. and A. Moffat. 2005. Inverted index compression
using word-aligned binary codes. Information Retrieval, 8(1):151–166.
[Asuncion et al., 2007] Asuncion, A., P. Smyth, and M. Welling. 2007. Distributed inference for latent dirichlet allocation. In Proc. of NIPS.
[Asuncion et al., 2008] Asuncion, A., P. Smyth, and M. Welling. 2008. Asynchronous
distributed learning of topic models. In Proc. of NIPS.
[Bilmes and Kirchhoff, 2003] Bilmes, J. A. and K. Kirchhoff. 2003. Factored language
models and generalized parallel backoff. In Proc. of HLT/NACCL, pages 4–6.
[Blei et al., 2003] Blei, D. M., A. Y. Ng, and M. I. Jordan. 2003. Latent dirichlet allocation. Journal of Machine Learning Research., 3:993–1022.
[Bouchard-Cote et al., 2009] Bouchard-Cote, A., S. Petrov, and D. Klein. 2009. Randomized pruning: Efficiently calculating expectations in large dynamic programs. In
Proc. of NIPS.
107
[Boyd and Vandenberghe, 2004] Boyd, S. and L. Vandenberghe. 2004. Convex Optimization. Cambridge University Press.
[Brants et al., 2007] Brants, T., A. C. Popat, P. Xu, F. J. Och, and J. Dean. 2007. Large
language models in machine translation. In Proc. of EMNLP, pages 858–867.
[Brown et al., 1990] Brown, P. F., J. Cocke, S. Pietra, V. Pietra, F. Jelinek, J. Lafferty,
R. L. Mercer, and P. S. Roossin. 1990. A statistical approach to machine translation.
Comput. Linguist., 16(2):79–85.
[Cesa-Bianchi and Logosi, 2006] Cesa-Bianchi, N. and G. Logosi. 2006. Prediction,
learning, and games. Cambridge University Press.
[Charniak and Johnson, 2005] Charniak, E. and M. Johnson. 2005. Coarse-to-fine N-best
parsing and maxent discriminative reranking. In Proc. of ACL.
[Chen, 2009a] Chen, S. F. 2009a. Performance prediction for exponential language models. In Proc. of NAACL, pages 450–458.
[Chen, 2009b] Chen, S. F. 2009b. Shrinking exponential language models. In Proc. of
NAACL, pages 468–476, Morristown, NJ, USA. Association for Computational Linguistics.
[Cheng et al., 2006] Cheng, L., S. V. N. Vishwanathan, D. Schuurmans, S. W., and
T. Caelli. 2006. Implicit online learning with kernels. In Proc. of NIPS.
[Chu et al., 2006] Chu, C., S. K. Kim, Yi. Lin, G. Bradski Y. Yu, A. Y. Ng, and K. Olukotun. 2006. Mapreduce for machine learning on multicore. In Proc. of NIPS.
[Collins et al., 2002] Collins, M., R. E. Schapire, and Y. Singer. 2002. Logistic regression,
adaboost and bregman distances. Machine Learning, 48(1-3):253–285.
[Collins, 2002] Collins, Michael. 2002. Discriminative training methods for hidden
markov models: Theory and experiments with perceptron algorithms. In Proc. of
EMNLP.
[Collins, 2003] Collins, M. 2003. Head-driven statistical models for natural language
parsing. Computational Linguistics, 29(4), December.
108
[Collobert and Weston, 2009] Collobert, R. and Jason Weston. 2009. Deep learning in
natural language processing. NIPS Tutorial.
[Crammer and Singer, 2001] Crammer, K. and Y. Singer. 2001. On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, 2:265–292.
[Crammer et al., 2006] Crammer, K., O. Dekel, J. Keshet, S. Shalev-Shwartz, and
Y. Singer. 2006. Online passive-aggressive algorithms. Journal of Machine Learning Research.
[Crammer et al., 2008] Crammer, K., M. Dredze, and F. Pereira. 2008. Exact convex
confidence-weighted learning. In Proc. of NIPS, pages 345–352. MIT Press.
[Crammer et al., 2009] Crammer, K., M. Dredze, and A. Kulesza. 2009. Multi-class
confidence weighted algorithms. In Proc. of EMNLP, pages 496–504.
[Davidov et al., 2004] Davidov, D., E. Gabrilovich, and S. Markovitch. 2004. Parameterized generation of labeled datasets for text categorization based on a hierarchical
directory. In Proc. of SIGIR.
[Dekel et al., 2005] Dekel, O., S. Shalev-Shwartz, and Y. Singer. 2005. The forgetron: A
kernel-based perceptron on a fixed budget. In Proc. of NIPS.
[Deligne and Bimbot, 1995] Deligne, S. and F. Bimbot. 1995. Language modeling by
variable length sequences: Theoretical formulation and evaluation of multigrams. In
Proc. ICASSP ’95, pages 169–172, Detroit, MI.
[Ding et al., 2001] Ding, C. H. Q., X. He, H. Zha, M. Gu, and H. D. Simon. 2001. A
min-max cut algorithm for graph partitioning and data clustering. In ICDM, pages
107–114.
[Dredze et al., 2008] Dredze, M., K. Crammer, and F. Pereira.
2008.
Confidence-
weighted linear classification. In Proc. of ICML, pages 264–271.
[Duchi and Singer, 2009] Duchi, J. and Y. Singer. 2009. Online and batch learning using
forward backward splitting. In Proc. of NIPS.
109
[Dudı́k et al., 2007] Dudı́k, M., S. J. Phillips, and R. E. Schapire. 2007. Maximum entropy density estimation with generalized regularization and an application to species
distribution modeling. JMLR, 8:1217–1260.
[Freund and Schapire, 1999] Freund, Y. and R. E. Schapire. 1999. Large margin classification using the perceptron algorithm. Machine Learning, 37(3):277–296.
[Gao et al., 2005] Gao, J., H. Yu, W. Yuan, and P. Xu. 2005. Minimum sample risk
methods for language modeling. In Proc. of HLT/EMNLP.
[Gao et al., 2006] Gao, J., H. Suzuki, and B. Yu. 2006. Approximation lasso methods for
language modeling. In Proc. of ACL/COLING.
[Gao et al., 2007a] Gao, J., G. Andrew, M. Johnson, and K. Toutanova. 2007a. A comparative study of parameter estimation methods for statistical natural language processing.
In Proc. of ACL, pages 824–831.
[Gao et al., 2007b] Gao, J., G. Andrew, M. Johnson, and K. Toutanova. 2007b. A comparative study of parameter estimation methods for statistical natural language processing.
In Proc. of ACL, pages 824–831.
[Gieseke et al., 2009] Gieseke, F., T. Pahikkala, and O. Kramer. 2009. Fast evolutionary
maximum margin clustering. In ICML, pages 361–368.
[Gildea and Hofmann, 1999] Gildea, D. and T. Hofmann. 1999. Topic-based language
models using em. In Proceedings of the 6th European Conference on Speech Communication and Technology(EUROSPEECH).
[Gusfield, 1997] Gusfield, D. 1997. Algorithms on Strings, Trees and Sequences. Cambridge University Press.
[Halko et al., 2009] Halko, N., P. G. Martinsson, and J. Tropp. 2009. Finding structure
with randomness: Stochastic algorithms for constructing approximate matrix decompositions. arXiv 0909.4061.
[Hérault and Grandvalet, 2007] Hérault, R. and Y. Grandvalet. 2007. Sparse probabilistic
classifiers. In Proc. of ICML, pages 337–344.
110
[Hinton et al., 2006] Hinton, G.E., S. Osindero, and Y.W. Teh. 2006. A fast learning
algorithm for deep belief nets. Neural Computation, 18(7):1527–1554.
[Ifrim et al., 2008] Ifrim, G., G. Bakir, and G. Weikum. 2008. Fast logistic regression for
text categorization with variable-length n-grams. In Proc. of SIGKDD.
[J. Suzuki, 2008] J. Suzuki, H. Isozaki. 2008. Semi-supervised sequential labeling and
segmentation using giga-word scale unlabeled data. In Proc. of ACL, pages 665–673.
[Jaynes, 1957] Jaynes, E. T. 1957. Information theory and statistical mechanics. Physical
Review Series II, 106(4):620–630.
[Joachims, 1998] Joachims, T. 1998. Text categorization with Support Vector Machines
learning with many relevant features. In Proceedings of 10th European Conference on
Machine Learning (ECML), pages 137–142.
[Kasai et al., 2001] Kasai, T., G. Lee, H. Arimura, S. Arikawa, and K. Park. 2001. Lineartime longest-common-prefix computation in suffix arrays and its applications. In Proc.
of CPM, pages 181–192.
[Kazama and Tsujii, 2005] Kazama, J. and J. Tsujii. 2005. Maximum entropy models
with inequality constraints: A case study on text categorization. Machine Learning Journal special issue on Learning in Speech and Language Technologies, 60((13)):169–194.
[Knight and Marcu, 2002] Knight, K. and D. Marcu. 2002. Summarization beyond
sentence extraction: a probabilistic approach to sentence compression. Artif. Intell.,
139(1):91–107.
[Koh et al., 2007] Koh, K., S. J. Kim, and S. Boyd. 2007. An interior-point method for
large-scale l1 -regularized logistic regression. JMLR, 8.
[Kudo and Matsumoto, 2003] Kudo, Taku and Yuji Matsumoto. 2003. Fast methods for
kernel-based text analysis. In Proc. of ACL.
[Kudo and Matsumoto, 2004] Kudo, T. and Y. Matsumoto. 2004. A boosting algorithm
for classification of semi-structured text. In Proc. of EMNLP.
111
[Lafferty et al., 2001] Lafferty, J., A. McCallum, and F. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc.
of ICML, pages 282–289.
[Li et al., 2009] Li, Yu-F., I. W. Tsang, J. T. Kwok, and Z-H Zhou. 2009. Tigher and
convex maximum margin clustering. In In Proc. of AISTATS, pages 344–351.
[Lidstone, 1920] Lidstone, G.J. 1920. Note on the general case of the bayes-laplace formula for inductive or a posteriori probabilities. Transactions of the Faculty of Actuaries,
8:182–192.
[Liu and Nocedal, 1989] Liu, D. C. and J. Nocedal. 1989. On the limited memory method
for large scale optimization. Mathematical Programming B, 45(3):503–528.
[Marcus et al., 1994] Marcus, M., G. Kim, M. A. Marcinkiewicz, R. MacIntyre, A. Bies,
M. Ferguson, K. Katz, and B. Schasberger. 1994. The Penn treebank: Annotating
predicate argument structure. In ARPA Human Language Technology Workshop.
[Martin et al., 1998] Martin, S., J. Liermann, and H. Ney. 1998. Algorithms for bigram
and trigram word clustering. Speech Communicatoin, 24(1):19–37.
[Matsuzaki et al., 2007] Matsuzaki, T., Y. Miyao, and J. Tsujii. 2007. Efficient hpsg
parsing with supertagging and cfg-filtering. In Proc. of IJCAI, pages 1671–1676, San
Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
[McCallum et al., 2000] McCallum, A., D. Freitag, and F. C. N. Pereira. 2000. Maximum
entropy markov models for information extraction and segmentation. In Proc. of ICML,
pages 591–598, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
[Miyao and Tsujii, 2008] Miyao, Y. and J. Tsujii. 2008. Feature forest models for probabilistic HPSG parsing. Computational Linguistics., 34(1):35–80, March.
[Navarro and Mäkinen, 2007] Navarro, G. and V. Mäkinen. 2007. Compressed full-text
indexes. ACM Comput. Surv., 39(1):2.
[Ng, 2004] Ng, A. Y. 2004. Feature selection, L1 vs. L2 regularization, and rotational
invariance. In Proc. of ICML, page 78, New York, NY, USA. ACM.
112
[Och and Ney, 2003] Och, F. J. and H. Ney. 2003. A systematic comparison of various
statistical alignment models. Comput. Linguist., 29(1):19–51.
[Okanohara and Tsujii, 2007] Okanohara, D. and J. Tsujii. 2007. A discriminative language model with pseudo-negative samples. In ACL, pages 73–80.
[Okanohara and Tsujii, 2009a] Okanohara, D. and J. Tsujii. 2009a. Learning combination
features with L1-regularization. In Proc. of NAACL, pages 97–100.
[Okanohara and Tsujii, 2009b] Okanohara, D. and J. Tsujii. 2009b. Text categorization
with all substring features. In Proc. of SDM, pages 838–846.
[Pang et al., 2002] Pang, B., L. Lee, and S. Vaithyanathan. 2002. Thumbs up? sentiment
classification using machine learning techniques. In Proc. of EMNLP, pages 79–86.
[Perkins and Theeiler, 2003] Perkins, S. and J. Theeiler. 2003. Online feature selection
using grafting. ICML.
[Perkins et al., 2003] Perkins, S., K. Lacker, and J. Theiler. 2003. Grafting: Fast, incremental feature selection by gradient descent in function space. JMLR, 3:1333–1356.
[Ravichandran et al., 2005] Ravichandran, D., P. Pantel, and E. Hovy. 2005. Randomized algorithms and nlp: Using locality sensitive hash function for high speed noun
clustering. In Proc. of EMNLP, pages 622–629.
[Rice, 1998] Rice, R. F. 1998. An empirical study of smoothing techniques for language
modeling. Technical report, Harvard Computer Science Technical report TR-10-98.
[Roark et al., 2007] Roark, B., M. Saraclar, and M. Collins. 2007. Discriminative n-gram
language modeling. computer speech and language. Computer Speech and Language,
21(2):373–392.
[Rosenblatt, 1958] Rosenblatt, F. 1958. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6):386–408.
[Rosenfeld et al., 2001] Rosenfeld, R., S. F. Chen, and X. Zhu. 2001. Whole-sentence
exponential language models: a vehicle for linguistic-statistical integration. Computers
Speech and Language, 15(1).
113
[Rosenfeld, 1994] Rosenfeld, R. 1994. Adaptive Statistical Language Modeling: A Maximum Entropy Approach. Ph.D. thesis, Carnegie Mellon University.
[Sadakane, 2007] Sadakane, K. 2007. Succinct data structures for flexible text retrieval
systems. Journal of Discrete Algorithms, 5(1):12–22.
[Saigo et al., 2007] Saigo, H., T. Uno, and K. Tsuda. 2007. Mining complex genotypic
features for predicting HIV-1 drug resistance. Bioinformatics, 23:2455–2462.
[Sassano, 2004] Sassano, Manabu. 2004. Linear-time dependency analysis for japanese.
In Proc. of COLING.
[Sha et al., 2007] Sha, F., Y. A. Park, and L. K. Saul. 2007. Multiplicative updates for l1
regularized linear and logistic regression. In Proc. of IDA.
[Shalev-Shwartz, 2007] Shalev-Shwartz, S. 2007. Online Learning: Theory, Algorithms,
and Applications. Ph.D. thesis, The Hebrew University of Jerusalem, July.
[Shi and Malik, 2000] Shi, J. and J. Malik. 2000. Normalized cuts and image segmentation. PAMI.
[Smith and Eisner, 2005] Smith, N. A. and J. Eisner. 2005. Contrastive estimation: Training log-linear models on unlabeled data. In Proc. of ACL.
[Talbot and Brants, 2008] Talbot, D. and T. Brants. 2008. Randomized language models
via perfect hash functions. In Proc. of ACL, pages 505–513,.
[Taskar et al., 2004] Taskar, B., C. Guestrin, and D. Koller. 2004. Max margin markov
networks. In Proc. of NIPS.
[Taylor and Cristianini, 2004] Taylor, J. S. and N. Cristianini. 2004. Kernel Methods for
Pattern Analysis. Cambiridge Univsity Press.
[Teo and Vishwanathan, 2006] Teo, C. H. and S. V. N. Vishwanathan. 2006. Fast and
space efficient string kernels using suffix arrays. In Proc. of ICML, pages 929–936.
[Tillmann and Ney, 1996] Tillmann, C. and H. Ney. 1996. Selection criteria for word
trigger pairs in language modeling. In In ICGI ’96, pages 95–106. Springer.
114
[Tsuruoka et al., 2009] Tsuruoka, Y., J. Tsujii, and S. Ananiadou. 2009. Stochastic gradient descent training for l1-regularized log-linear models with cumulative penalty. In
Proc. of ACL-IJCNLP, pages 477–485.
[Uchimoto et al., 1999] Uchimoto, K., S. Sekine, and H. Isahara. 1999. Japanese dependency structure analysis based on maximum entropy models. In Proc. of EACL, pages
196–203.
[Vishwanathan and Smola, 2004] Vishwanathan, S. V. N and A. J. Smola. 2004. Fast
kernels for string and tree matching. Kernels and Bioinformatics.
[Wang et al., 2005] Wang, S., S. Wang, R. Greiner, D. Schuurmans, and L. Cheng. 2005.
Exploiting syntactic, semantic and lexical regularities in language modeling via directed markov random fields. In Proc. of ICML.
[Whittaker and Woodland, 2001] Whittaker, E. W. D. and R.C . Woodland. 2001. Efficient class-based language modelling for very large vocabularies. Acoustics, Speech,
and Signal Processing, IEEE International Conference on, 1:545–548.
[Xu et al., 2004] Xu, L., J. Neufeld, B. Larson, and D. Shuuramans. 2004. Maximum
margin clustering. In NIPS 17.
[Yamamoto and Church, 2001] Yamamoto, M. and K. W. Church. 2001. Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus.
Comput. Linguist., 27(1):1–30.
[Yan et al., 2009a] Yan, F., N. XU, and Y. Qi. 2009a. Parallel inference for latent dirichlet
allocation on graphics processing units. In Proc. of NIPS.
[Yan et al., 2009b] Yan, H., S. Ding, and T. Suel. 2009b. Inverted index compression
and query processing with optimized document ordering. In Proc. of WWW, pages
401–410, New York, NY, USA. ACM.
[Yoshinaga and Kitsuregawa, 2009] Yoshinaga, N. and M. Kitsuregawa. 2009. Polynomial to linear: Efficient classification with conjunctive features. In Proc. of EMNLP,
pages 1542–1551.
115
[Yu et al., 2008] Yu, J., S. V. N. Vishwanathan, S. Guenter, and N. Schraudolph. 2008. A
quasi-Newton approach to nonsmooth convex optimization. In Proc. of ICML.
[Yusuke and Tsujii, 2005] Yusuke, M. and J. Tsujii. 2005. Probabilistic disambiguation
models for wide-coverage HPSG parsing. In Proc. of ACL 2005., pages 83–90, Ann
Arbor, Michigan, June.
[Zein et al., 2008] Zein, A., M. El, , E. McCreath, A. P. Rendell, and A. J. Smola. 2008.
Performance evaluation of the NVIDIA GeForce 8800 GTX GPU for machine learning.
In International Conference Computational Science.
[Zhai and Lafferty, 2004] Zhai, C. and J. Lafferty. 2004. A study of smoothing methods
for language models applied to information retrieval. ACM Trans. Inf. Syst., 22(2):179–
214.
[Zhang et al., 2007] Zhang, K., I. W. Tsang, and J. T. Kowk. 2007. Maximum margin
clustering made practical. In ICML 24.
[Zhao et al., 2008a] Zhao, B., F. Wang, and C. Zhang. 2008a. Efficient maximum margin
clustering via cutting plane algorithm. In SDM, pages 751–762.
[Zhao et al., 2008b] Zhao, B., F. Wang, and C. Zhang. 2008b. Efficient multiclass maximum margin clustering. In ICML.
116