Download Prediction with Local Patterns using Cross

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Mixture model wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Transcript
Prediction with Local Patterns using Cross-Entropy
Paper ID: Number 430
Heikki Mannila*, Dmitry Pavlov+, and Padhraic Smyth+
*Microsoft Research
Redmond, WA
+
Information and Computer Science
University of California at Irvine
Abstract
Sets of local patterns in the forms of rules and co-occurrence counts are a common type of output from many
data mining algorithms such as association rule algorithms. A known problem with such sets of patterns is how
to use them for prediction, i.e., how to synthesize local sparse information into a coherent global predictive
model. In this paper we investigate the use of a cross-entropy approach to combining local patterns. Each local
pattern is viewed as a constraint on an appropriate high-order joint distribution of interest. Typically, a set of
patterns returned by a data mining algorithm under-constrains the high-order model. The cross-entropy
criterion is then used to select a specific distribution in this constrained family relative to a prior. We review the
iterative-scaling algorithm which is an iterative technique for finding a joint distribution given constraints. We
then illustrate the application of this method to two specific problems. The first problem is combining
information about frequent itemsets. We show that the cross-entropy approach can be used for query
selectivity estimation for 0/1 data sets. The results show that we can accurately answer a large class of queries
using just a small set of aggregate information. The second problem involves sequence prediction given
historical rules, with an application to protein modeling. We conclude that viewing local patterns as constraints
on a high-order probability model is a useful and principled framework for prediction based on large sets of
mined patterns.
1. Introduction
In this paper we investigate the problem of how to combine local partial information
specified as conditional probabilities (e.g., probabilistic rules) or joint probabilities (e..g.,
frequent itemsets) into a single global model which can then be used for prediction. This is
an important problem in data mining. There are many algorithms in existence for finding
local patterns. A common complaint is that these algorithms return large sets of “rules” but
it is difficult to evaluate the properties of these rules as a set, e.g., it is difficult to perform
prediction or to compare the quality of one set against another in an objective manner.
While there are already some approaches to combining local patterns for specific problems
like classification (e.g., Liu, Hsu, and Ma (1998)), here we focus explicitly on a probabilistic
framework. Note that we will not focus at any length in this paper on how the patterns or
rules can be found from the data (there is a large body of techniques available to do this
already). Instead we focus on the problem of once given the patterns, how can one combine
them for prediction?
1
In this paper we will consider the discrete-valued problem, thus, we’ll be assuming that a
vector-valued random variable x=(x1, x2,…,xn) takes values from some finite alphabet. We
are interested in estimating the joint distribution P(x1, x2,…,xn) = P(x), or some function of
P(x) such as a conditional distribution, given sets of local rules containing information of the
form p(xi= a|xj= b, xk= c) and/or p(xi= a, xj= b, xk= c). Let us first consider some general
issues in learning P(x) from data. In general, if for example each xi can take m distinct values,
specification of P(x) will require O(mn) different entries in the joint probability table. It is
clear that estimating the full joint distribution from data is often impractical given even
moderate values of m and n, because of the exponential growth of the number of different
probabilities which must be specified.
One general approach is to reduce the complexity of the joint distribution model by making
structural assumptions about the independence relations between the x variables. For
example, if we are willing to assume that all variables are independent of each other (typically
not a very useful model) then we only need O(mn) parameters to specify the joint
distribution. For a first-order Markov assumption (given some ordering defined on the
variables) we would need O(m2n), and so forth. Graphical models or belief networks provide
a general framework for specifying and manipulating of a very general class of such
independence models.
These Markov assumptions typically model dependence only at the variable level. An example
of a variable level dependency would be to say that each letter in a sequence of text depends
only on the immediately preceding letter. Variable-level models do not easily allow selective
encoding of dependencies at the value level. A value level dependency only specifies a
dependence between particular values of two variables, rather than assuming dependence for
all pairs of values. For example, a value-level dependence model for text could specify that
certain letters depend on certain preceding letters, e.g., “h” follows “t” with some
probability, but there is no explicit statement made in the model about the dependence of
“h” on other letters which precede it. For every pairwise direct variable-level dependency in
the model, we need O(m2) parameters to specify this dependency (i.e., all the possible m x m
conditional probabilities between the pairs of values). For large alphabets (i.e., large m) this
parameterization may be expensive in terms of the data required to justify it. For example, in
language modeling at the symbol level, even specifying bigram models (i.e., first-order
Markov dependence), with a typical vocabulary on the order of m=10,000 symbols, can be a
tremendously difficult task requiring vast training data sets.
If we go beyond the first-order dependence at the variable level, the parameterization
problem becomes worse, i.e., we need O(mk) parameters for a (k-1)th order model. Again,
using text sequences as an example, for higher level dependencies we may be more interested
in value-specific dependencies than variable-level dependencies, e.g., the “ion” trigram of 3
letters is quite specific to these letters, while combinations such as “tyw” can be modeled as
largely independent. We can generalize the idea of value-specific dependencies to allow more
general events, such as if the word “data” or “statistics” have occurred in the past 20 words,
the probability of getting the word “hypothesis” as the next word has increased (e.g., this is
captured in the notion of “triggers,” Rosenfeld, 1996).
A related idea is to approximate high-order dependencies by combinations of low-order
models, e.g., p(x4|x3,x2,x1) as a linear combination of p(x4|x3) , p(x4|x2) and p(x4|x1) (e.g., see
2
Raftery and Tavere (1995) and Saul and Jordan (in press) in a Markov sequence modeling
context). Note that such a model reduces the number of parameters from O(mk) to O(km2),
but of course true high-order interactions may not be accurately captured. More generally,
we want to combine various “low-order” pieces of information in the form of rules into a
coherent high-order model.
The primary focus of many current data mining algorithms is to extract such low-order
information in an efficient manner. For example, association rule algorithms essentially pull
out all conditional probabilities of the form p(xi= a|xj= b, xk= c) which are above some
threshold and for which p(xj= b, xk= c), the support, is also above a prespecified threshold.
These types of patterns are “local” to small sets of specific variables and specific values of
these variables. There are numerous similar “rule-finding” algorithms that can efficiently
produce large lists of such local patterns. Typically the patterns are chosen based on the fact
that they differ from the expected norm, i.e., they are informative in some sense relative to a
uniform prior distribution on P(x1, x2,…,xn). Potentially these algorithms can pull out valuelevel dependencies in the data which might be “missed” by a variable-level model.
Combining local patterns to form a global model for P(x1, x2,…,xn) is worthwhile from two
different perspectives. Firstly, from a data mining perspective, it would offer a framework
for linking sets of local patterns with the more well-established world of global models,
where evaluation, prediction, estimation, model selection, and so forth, can be handled in a
coherent and consistent manner. Secondly, from an applications perspective, the ability to
quickly find local patterns and then use these to construct a model for P(x1, x2,…,xn) could
offer distinct advantages in accuracy, efficiency, and interpretability compared to more
conventional approaches.
In this paper we focus on the first aspect of the problem, namely illustrating how local sets
of patterns typically uncovered by data mining algorithms can be combined to form global
models. In Section 2 we formally define the problem and then show that it is equivalent to a
well-known problem in probabilistic modeling, namely, finding the closest distribution to a
prior distribution in a constrained family. In Section 3 we review the well-known iterative
scaling algorithm which is an iterative technique for finding the distribution of interest.
Section 4 contains an application of this approach to the problem of query selectivity
estimation, where one wants to estimate how many records in a database satisfy a certain
“high-order” query and one already has low-order counts. In Section 5 we outline the use of
this technique on sequential data (which is where it has most often been used in the past,
particularly for text sequences), and apply the method to characterizing dependence structure
in protein sequences. Section 6 contains conclusions.
2. Background and General Statement of the Problem
We begin with a fairly abstract statement of the problem and then show how this statement
relates to discussion in the last section. Consider that the following requirements are
imposed on P(x) (keep in mind that P(x) is unknown, we wish to estimate it given the data):
1. P(x) should satisfy certain linear constraints, where the constraints come from the
training data,
3
2. P(x) should be chosen in all other respects in accordance with our ignorance about
everything these constraints do not specify.
Requirement (1) can be rewritten mathematically as
 P( x )* k ( x|i )  d ( i ) ,
(1)
x
where k(x|i) denotes the i-th constraint function (which for our purposes will be an
indicator function, taking the value 1 iff x satisfies i-th constraint), d(i) is a given constraint
target and i varies over all given constraints, from 1 to C, where C is the total number of
constraints. In a data mining framework, we will assume that our patterns and rules are
statements about frequency counts in the data that can be represented as constraints. For
example, if we know that the number of co-occurrences of (xj= 1, xk= 1) in the training data
is njk, then the indicator function would be 1 when (xj= 1, xk= 1) and 0 otherwise, so that the
left-hand side of Equation (1) would sum to P(xj= 1, xk= 1). The right-hand side constraint
d(i) could be set to something like njk/S (where S is the total number of data points), thus,
constraining the probability on the left to take this maximum-likelihood value.
The cross-entropy is defined as
P( x )
D( P|| Q )   P( x )log
(2)
Q( x )
x
and provides a strictly non-negative distance measure between two densities P(x) and Q(x).
The cross-entropy is zero if and only if P(x) = Q(x) for all x . For our purposes Q(x) is a
prior distribution for x which we assume is known, and we seek that P(x) which obeys the
constraints in (1) and which is closest to Q(x) in terms of minimizing the cross-entropy
distance. In this paper, and in many data mining applications, there is no well-defined prior
Q(x) which is known, and thus, we choose a completely uniform prior as Q(x). Since Q(x) is
then a constant as a function of x, minimizing the cross-entropy is in this case equivalent to
maximizing a function of the form –P log P, i.e., maximizing the entropy.
Thus, we can state our problem as follows: from the family of constrained distributions P
that satisfy (1), choose that particular distribution which has the maximum entropy.
Maximum entropy has a long history as a criterion for model selection that we will not dwell
on here: essentially it picks the distribution which makes the fewest independence
assumptions.
One of the constraints in (1) (let it be constraint 0) should be k(x|0)=1 for all x and d(0)=1
to ensure that the distribution sums up to 1. The method of undetermined Lagrangian
multipliers can be used to determine the general functional form of the solution:
P(x)  Q(x) *  0 *   ik ( x|i )
(3)
i
with constraints
 0 *  Q( x )*   ik ( x|i ) * k ( x| j )  d ( j )
x
(4)
i
4
with j varying over all possible constraints and {  i }iC1 being a set of positive numbers which
must be found. Since the indicator function k(x|j) participates as a factor in (4) we will only
sum over such x that satisfy the j-th constraint.
The general problem in (1) and (2) is now reduced to the problem of finding a set of positive
numbers {  i }iC1 for the problem (3) and (4). It can be shown that problem (3),(4) (and,
hence, problem (1),(2)) will have a solution and it will be unique, iff the constraints are
consistent (Darroch and Ratcliff (1972)). Consistency means that the constraints in (1) do
not contradict each other. For example, if all the constraints only consist of relative
frequency counts in the data, then they will be consistent automatically. However, one might
want to use better estimators for marginal probabilities than simply raw counts, especially for
small numbers of counts. In this case it may be difficult to prove consistency. Nonetheless it
has been found that even when consistency is not present, a reasonable solution can still be
found (e.g., Rosenfeld, 1996). The requirement of consistency and limited amounts of
training data complicates the choice of constraints as discussed in Jelinek (1998). The space
complexity of the problem will depend on the number of imposed constraints. Expression
(3) contains (C+1) unknown parameters and thus, the space complexity is O(C). For each
application we discuss below we will give bounds on C. The time complexity is a function of
the algorithm used to arrive at a specific solution to (3) and (4), which we discuss in the next
section. Note in general that while the original data set may not reside in main memory (i.e.,
the pattern finding algorithm may involve access of disk-resident data for example), it is
quite reasonable to assume that the set of patterns returned by the algorithm will themselves
reside in main memory as will any model we construct based on those patterns. Thus, once
the patterns (constraints) are determined, no further disk access is necessary.
3. The Generalized Iterative Scaling Algorithm
Generalized iterative scaling algorithm is well known in the statistical literature as a technique
for solving problems of the general nature presented in the last section (see, for instance, the
original paper by Darroch and Ratcliff (1972), Csiszar(1984 and 1989)). The algorithm can
be used to obtain a numerical solution for the parameters of the function (3), satisfying
constraints (4) plus a normalization constraint. The iterative scaling algorithm will result in a
sequence of properly normalized probability distributions PN(x) such that if P(x) is a desired
distribution then
lim max P( x )  PN ( x )  0 .
N 
x
Thus, the algorithm will converge to a unique solution of the problem (3),(4) provided that
constraints (4) are consistent. A high-level outline of the algorithm is:
5
1. Choose an initial approximation to P(x) coinciding with Q(x)
(typically uniform).
2. Iteration = 0;
3. While (Not all Constraints are Sufficiently Satisfied) do
Begin
Iteration++;
For (i varying over all constraints in (4)) do
Begin
Update 0
Update i
End
EndWhile
3. Output Probability Distribution in the form (3)
The update rules for the normalizing constant 0 and i corresponding to the constraint i for
iteration N , i varying over the constraints in (4), will be (from Jelinek ,1998):
Si   P ( N 1 ) ( x )* k ( x|i )
(5)
x
 0 ( N )   0 ( N 1 )
 i ( N )   i ( N 1 )
1  di
1  Si
d i * ( 1  Si )
( 1  d i )* Si
(6)
(7)
In order to compute the right-hand side of rules (6) and (7) one needs to sum the
distribution obtained on the previous step P ( N 1 ) ( x ) over all possible values of x satisfying
constraint i.
The time complexity of the algorithm depends on the number of multiplications that are
performed in one iteration of the algorithm, i.e., used to update all C constraints. The
number of summands in (5) equals the number Ni of occurrences of x in the training data
that satisfy the i-th constraint. Ni is bounded by the total number of data points in the
training set, S. Each summand is a function P ( N 1 ) ( x ) evaluated for a specific x. This
evaluation takes O(C) multiplications where C is the total number of constraints. Hence,
complexity of carrying out step (5) is bounded by O(SC). Thus, the overall complexity of
carrying out one iteration of the algorithm is bounded by O(SC2) since updates are
performed for every constraint.
4. Heikki's Section Here
5. Sequence Models combining Local Patterns and Markov Models
5.1 Local Trigger-based Patterns for Sequence Prediction
6
Let us consider a specific application domain, in which the training data is a sequence
consisting of symbols from a fixed alphabet A. For any variable w we define the history H as
the HL symbols preceding w in the text where HL is the length of history. One can allow a
gap of GL symbols separating w and its history. Thus, effectively x=(w, H), and we are
interested in estimating P(x)=P(w, H). Estimating the probability distribution P(x) is a
difficult task as discussed earlier since in general one may want to consider long term
dependencies, which increases the history length HL and hence the number of variables in
the distribution. In practice, it is often only feasible to accurately estimate bigram or trigram
models, i.e. models in which HL is equal to 1 or 2.
We can introduce a notion of long term dependency by incorporating probabilistic rules or
triggers (Rosenfeld, 1996). A trigger is simply the occurrence of a particular symbol in the
history H of a particular variable. A useful trigger is one whose presence in the history
significantly modifies the distribution of the variable w. In general, we will call the triple
(  ,, PTrigger  P  ) a trigger where P(w=, H)=P. Here  is the trigger symbol
(i.e., the symbol in the history), and  is a value of the variable whose probability we are
currently interested in. For this paper it is not so important how we find these triggers from
the data. One relatively straightforward technique is to select triggers based on the average
mutual information between events {w=} and {H}. Thus, all triggers can be ranked
according to their information content and only top triggers retained for incorporation into a
global model.
Let us now examine what a set of constraints (3) will look like when we combine bigrams
and triggers. Here the history H (of some specific length) is defined to start 2 positions back
in the sequence and the bigram depends on the symbol 1 position back. We define the
concept of history equivalence classes. Suppose that we have pulled NT triggers from the training
text: (  k ,k , PkTrigger  P k  k ) , k varying from 1 to NT , and the probabilities Pk are
estimated from the training text using frequency counts. Let us now list in the set {Unique}
all the different trigger symbols from the list 1 , 2 ,..., N T . Note that the same k can
trigger a different k so that card{}NTW is in general less than NT. We can now define the
equivalent history for the bigram-trigger model as
H*  w1 , I ( Unique
 H ),..., I ( Unique
1
N TW  H )
(5)
where w-1 is the symbol immediately preceding w; and I ( Unique
H ) is an indicator function
k
that returns 1 iff condition inside the braces is satisfied.
The equivalent history H* is distinct from the previously defined history H in the following
ways:
1. We include the symbol w-1 from the gap to the equivalent history
2. In the remaining part of the equivalent history we no longer consider symbols from the
alphabet A but indicators of whether certain trigger symbols occurred in the history.
Thus, under the equivalent history framework, we don't distinguish between the pasts of the
two symbols in the training text unless they are both preceded by the same symbol and the
same trigger symbols occur in their histories.
7
There is a nice decision tree representation of an equivalent history. For a training text
containing S symbols from the alphabet A there are S histories which actually occur. For
each symbol let us identify its preceding symbol and history, then put it down the decision
tree. The tree first tests what the symbol w-1 is, which is O(|A|) comparisons and branches,
and for each k on the level (k+1) tests if I ( Unique
H ) is 1, thus making one comparison.
k
All levels of this tree but the first one have a branching factor 2, and the tree has a depth of
1+ NTW. On the very bottom level we can accumulate the number of "real" training text
histories falling into each leaf - most leaves will have no histories, some might have more
than one. All histories falling into one leaf of the tree will be indistinguishable under the
history equivalence classes paradigm.
Now, we are ready to define the set of all constraints that we will impose on the probability
distribution for combining bigrams and triggers:
P( H* )  f ( H* )
(P6)
P( w, w 1 )  f ( w, w 1 )
if C(w,w-1)K
(P7)
P( w )  f ( w )
if C(w)L
(P8)
P( w  k , k  H )  PkTriggers
if C( w   k , k  H )  M
(P9)
P( w  k , k  H )  f ( k )  PkTriggers
if C( w   k , k  H )  M
(P10)
where f and C are relative frequencies and counts obtained from the training data. We will
refer to condition (P8) as simply the “priors” (i.e., the unigram or univariate marginal
constraints). Since we want to take into account the effect of both presence and absence of
the trigger symbols in the history of w, we include both constraints (P9) and (P10).
It can be easily shown (Jelinek, 1998) that given the constraints (P6)-(P10), and the goal of
minimizing cross-entropy, the general solution can be rewritten as
P( w , H*)  Q( w , H*)*  0 *  1I ( C ( w ) L ) ( w )*  2 ( H* )*  3I ( C ( w ,w1 ) K ) ( w , w 1 )*
NT
*   tIk( C (  k  w , k H ) M ) * tIk( C (  k  w , k H ) M )
(P11)
k 1
where we note that now the second parameter is the equivalent history. The constants K, L,
and M are thresholds that can be set to determine the minimum frequency count in the data
necessary for a constraint to be included. Note that if the count for any of the constraints is
less than the constant on the right then there will be no factor corresponding to this symbol
in the product. This means that we can reduce the complexity of both the model and the
iterative scaling algorithm by increasing the value of these constants (Jelinek, 1998, discusses
what problems this increase may cause for the test text). Here we simply set the constants K,
L, and M all to be 1.
8
Now we bound the number of parameters C in function (P11). There are at most |A|
parameters that correspond to the constraints introduced by the priors (P8). There are at
most |A|2 parameters that correspond to constraints introduced by bigrams (P7). The
number of bigrams satisfying condition C(w,w-1)K=1 is also bounded by S (the total number
of symbols in the training text). The upper bound is, thus, min(S, |A|2). There are at most S
parameters that correspond to constraints introduced by relative frequencies of equivalent
histories (P6). Finally, there is a total of 2*NT terms corresponding to trigger constraints (P9)
and (P10). The number of triggers NT in turn is bounded by |A|2 since the latter is the total
possible number of pairs of symbols in the alphabet A. Thus, overall memory complexity is
bounded by 1+|A|+S+2*|A|2+min(S, |A|2) which is O(S+|A|2) parameters.
In the appendix, we provide details on how to perform step (5) of the generalized iterative
scaling algorithm in order to estimate parameters of the distribution. Here, we note that the
overall time complexity is O(S*NT*|A|2) per iteration. In all of the experiments below the
algorithm converged to within 1% of the specified constraints in about 20 iterations.
5.2 Experimental Results on Protein Sequence Prediction with Triggers
We have applied the sequential triggers model and iterative scaling algorithm to modeling of
sequential protein data. Proteins are large macromolecules responsible for many fundamental
physiological processes in living organisms. Their primary structure is defined by a sequence
of amino acids. Understanding the sequential dependence of amino acids in protein
sequences is an important first step in developing an improved understanding of the role and
function of different proteins (Setubal and Meidanis, 1997). The purpose of this experiment
was to explore sequence structure beyond simple Markov modeling, in particular the use of
trigger models.
For this set of experiments we used a well known set of 585 hemoglobin protein sequences,
available on-line at http://www.isb-sib.ch/. We used various combinations of prior, bigram, and
trigger models for modeling the joint distribution of the current symbol and its equivalent
history. The size of the alphabet |A|=20 since there are 20 amino acids. For all of our
experiments we did the 10 fold cross-validation. We estimated mean and standard deviation
of the decrease in empirical cross-entropy on the training and test sets for a variety of different
models which were fit to the training data. The following modeling strategies were used,
where in all cases priors, bigrams, and triggers were estimated based on simple frequency
counts from the training data:
1. A model based only on priors (this is a rather simple baseline model).
2. A model based on all priors and bigrams: this model is fully specified and we do not
need to use iterative scaling to find the joint distribution (i.e., it can be directly specified).
3. A model based on all priors and selected bigrams. This requires the use of iterative
scaling to estimate P(w|H*)=P(w|w-1).
4. A model based on combinations of priors, bigrams and triggers for different choices of
lengths of history (from 1 to 20). The constants K=L=M had a fixed value of 1.
9
For the last experiment we also varied the number of T top triggers and B top bigrams
chosen per symbol of the alphabet using the mutual information criterion. There is a
tradeoff here. If we have too few bigrams or triggers then we will not get a good probability
model, but if we have too many the estimates will be noisy and the model will not be as
accurate on test data.
Tables 1 and 2 show the top 10 bigrams and triggers found in the training data with a history
length of 1 (i.e., the triggers are based on the occurrence of a symbol 2 positions back, and
the bigrams are based on symbols 1 position back). The tables illustrate how the mutual
information method works for bigrams and triggers. The average mutual information selects
pairs for which the conditional probability P(B|A) is substantially different from the prior
P(B). Note also that apart from the top bigram, many of the triggers have a higher mutual
information than the bigrams, indicating that there may be more information in the sequence
further “back” than in the first-order Markov bigram window.
Table 1. Top 10 bigrams for the training data, history length is 1.
Symbol A, w-1=A
L
R
H
Y
R
N
D
F
H
Y
Prior(A)
0.117
0.027
0.060
0.024
0.027
0.033
0.058
0.052
0.060
0.024
Symbol B, w=B
S
M
C
R
V
F
K
P
G
F
Prior(B)
0.066
0.012
0.011
0.027
0.087
0.052
0.077
0.038
0.062
0.052
P(B|A)
0.221
0.142
0.077
0.188
0.305
0.208
0.200
0.134
0.166
0.216
Aver. MI(w-1=A, w=B)
0.02625
0.00956
0.00841
0.00810
0.00787
0.00742
0.00688
0.00653
0.00619
0.00574
P(B|A)
0.327
0.582
0.018
0.207
0.260
0.169
0.096
0.108
0.129
0.184
Aver. MI(AH, w=B)
0.01321
0.01303
0.01291
0.01219
0.00959
0.00880
0.00716
0.00661
0.00646
0.00621
Table 2. Top 10 triggers for the training data, history length is 1.
Symbol A, AH
N
W
L
H
R
P
K
V
G
P
Prior(A)
0.033
0.011
0.117
0.060
0.027
0.038
0.078
0.088
0.062
0.038
Symbol B, w=B
K
K
L
D
F
N
R
P
E
T
Prior(B)
0.078
0.078
0.117
0.058
0.053
0.034
0.027
0.038
0.041
0.054
For the training and test data sets, the empirical cross-entropy can be defined as
1 S
CE ( Model )    log( P( w  wk | H* ,Model ))
(P13)
S k 1
where S as before is the total number of symbols in the text and wk is the observed symbol
that appears in the k-th position in the text. Cross-entropy is non-negative and measures the
empirical conditional uncertainty (entropy, in bits) in the model’s predictions: the lower the
better. There is also an intuitive interpretation of this logP score. If we consistently assign
observed symbols high probabilities we will do well, whereas if we consistently assign
10
observed symbols low probabilities, we will do poorly (as reflected in the cross entropy
score).
Our training and test data sets are characterized as follows:
 Size of the alphabet |A|=20,
 Total number of symbols in the training data set =57103,
 Total number of symbols in the test data set = 28590,
 A very simple baseline score can be defined as the cross-entropy of a model with a
uniform distribution for P, namely log|A| = log(20) = 4.3219 bits. Thus, all decreases in
reported cross-entropy are relative to this baseline uncertainty.
The termination accuracy of approximation in all experiments was 1%, i.e., the iterative
scaling was halted when the distribution came with 1% of the desired constraints. One of the
important modeling questions was what the number of triggers and bigrams should be. In
the experiments we looked at both the top 3 triggers bigrams per (Table 3) and the top 6
(Table 4). We also added an additional requirement that the average mutual information of a
trigger or of a bigram should be more than 0.001 bits for it to be incorporated into the
model.
We first ran simple tests in which we estimated priors and bigrams from the training data
and evaluated decrease in cross entropy due to them (columns 2 and 3 of Tables 3 and 4).
As expected, bigrams do better than priors alone on both the training and test data: and
there is a smaller improvement (decrease in uncertainty) on the test data than on the training
data. This indicates that there is at least some memory in the protein sequences. The next
column in each Table contains the results for the models incorporating top 3 or top 6
bigrams (per symbol) with priors. These models have fewer parameters than the full bigram
model, are better than the priors alone, but not as accurate as the full bigram model.
The last two columns in each Table show the results for triggers and priors, and triggers,
priors and bigrams, respectively. There are a few different things worth noting here:

The best model overall (on either test or training data, for both 3 or 6 bigrams/triggers)
is the priors+some bigrams+some triggers model with a history length of one (rightmost
column, top row). The fact that this holds up out-of-sample indicates that the trigger
“rules” are in fact capturing some true regularities in the sequences.

A history of length 1 appears to perform best. As the width of the history window
increases we are allowing more long-range memory into the model, but the precision (in
terms of where the symbol occurs in the window) is reduced. It appears that precision is
more important than length (refer to tables 6 and 7 below for experiments with location
specific generalized triggers).

Priors+bigrams outperform priors+bigrams for all history lengths. Models based on
priors+triggers for short histories outperform models based on top 3 or 6 bigrams.
11

The best model overall is for 6 bigrams and triggers, combining triggers, bigrams, and
priors. In principle, one could keep adding more triggers, in this case more information
will be incorporated in the model. Table 5 shows that using all triggers and bigrams with
mutual information greater than 0.001 bit gives results comparable to the that of a model
based on 6 top triggers and bigrams (especially for the test set) which essentially means
that the latter model already absorbs all major information.

Standard deviation of scores on the test set is substantially higher than on the training
set. Analysis of performance on different folds reveals that there were two folds with
sequences of symbols strikingly different from the rest of the family so that neither
triggers nor bigrams could improve scoring of the model based on priors.
Table 3. Decrease in CE for different models. |A|=20, K=L=M=1, at most 3 top Bigrams and at most
3 top Triggers with Mutual Information greater 0.001 bit are retained (unless othewise specified).
History
Length
1
2
5
10
15
20
Priors
Train
Test
.257
.005
.256
.047
All Bigrams w/o
estimation
Train
Test
.564
.022
.511
.220
Bigrams w.
estimation
Train
Test
.437
.014
.399
.130
Priors and
Triggers
Train,
Test,
.034 .146
.464
.424
.447
.424
.428
.381
.417
.390
.394
.363
.371
.351
Priors, Bigrams
and Triggers
Train,
Test,
.038 .230
.636
.567
.614
.561
.601
521
.594
.532
.572
.507
.546
.495
Table 4. Decrease in CE for different models. |A|=20, K=L=M=1, at most 6 top Bigrams and 6 top
Triggers are retained with Mutual Information greater 0.001 bit (unless othewise specified).
History
Length
1
2
5
10
15
20
Priors
Train
Test
.257
.005
.256
.047
All Bigrams w/o
estimation
Train
Test
.564
.022
.511
.220
Bigrams w.
estimation
Train
Test
.478
.021
.425
.159
Priors and
Triggers
Train,
Test,
.045 .177
.512
.461
.499
.463
.482
.421
.460
.416
.433
.392
.398
.369
Priors, Bigrams
and Triggers
Train,
Test,
.050 .280
.722
.629
.701
.622
.694
.590
.678
.588
.651
.560
.630
.539
Table 5. Decrease in CE for different models. |A|=20, K=L=M=1, all top Bigrams and all top
Triggers with Mutual Information greater 0.001 bit are retained (unless othewise specified).
History
Length
1
Priors
Train
257
.005
Test
.256
.047
All Bigrams w/o
estimation
Train
Test
.564
.511
.022
.220
Bigrams w.
estimation
Train
Test
.497
.447
.021
.170
Priors and
Triggers
Train
Test
.528
.478
.027
.210
Priors, Bigrams
and Triggers
Train
Test
.757
.665
.0.068
.320
The results with the history length suggest one specific refinement to the method, namely to
allow multiple histories of different lengths and different locations. In this paper we discuss
how results above would change if we make triggers location specific, i.e. if we no longer
consider triggers as just belonging to the history but being in a specific location in the
12
history, say K symbols back from the symbol being predicted. We will call these triggers
generalized. To illustrate the idea of generalized triggers, let us as before treat history of the
symbol w as HL symbols preceding w and separated from w by the gap as shown on the
figure below. To define the concept of equivalent history for standard triggers we only had
to know which of the trigger symbols occurred in the document history of a symbol. For
example, if symbol "A" was a trigger symbol then all document histories were only
disciminated based on whether they contained "A" or not and regardless of position of "A"
in them.
Under the generalized triggers framework, we redefine a notion of a trigger in the following
manner: we will call a four-tuple (  ,, k , P Gen .Trigger  P ) a generalized trigger if
P(w=,w-k=)=P. An equivalent history of any symbol can be now defined similarly to
what we did for conventional triggers.
Note that for history of length 1 generalized and regular models will coincide, for longer
histories our expectation would be that generalized model would provide more valuable
information about the symbol being predicted since location specific triggers on the distance
of more than 1 in many cases do have higher mutual information with the predicted symbol
than coventional triggers. To illustrate this point let us refer to the tables 6 and 7.
Table 6. Top 5 triggers for the symbol "T" in the training data, history length is 5.
Symbol A, AH
W
H
D
A
F
Prior(A)
0.001
0.060
0.058
0.116
0.053
Symbol B, w=B
T
T
T
T
T
Prior(B)
0.054
0.054
0.054
0.054
0.054
P(B|A)
0.274
0.007
0.006
0.022
0.108
Aver. MI(AH, w = B)
0.00392
0.00309
0.00306
0.00232
0.00186
Table 7. Top 5 Generalized triggers for the symbol "T" in the training data, history length is 5.
Symbol A, w(-k)=A
P(k=-4)
P(k=-2)
F(k=-5)
F(k=-6)
C(k=-4)
Prior(A)
0.038
0.038
0.053
0.053
0.012
Symbol B, w=B
T
T
T
T
T
Prior(B)
0.054
0.054
0.054
0.054
0.054
P(B|A)
0.222
0.184
0.151
0.141
0.252
Aver. MI(w(-k)=A, w=B)
0.00958
0.00621
0.00526
0.00432
0.00370
It is remarkable that the top 5 generalized trigger symbols are all different from the top 5
trigger symbols (columns 1 in tables 6, 7). Moreover, for all but the last entree in table 7 and
the first entree in table 6 average mutual information scores for generalized triggers are
higher than that for conventional triggers, and the top score for generalized triggers is almost
13
3 times greater than the top score in table 6. As tables 8 and 9 show we will also gain better
decreases in cross entropy due to the generalization of a notion of a trigger.
Table 8. Generalized Triggers, Decrease in CE for different models. |A|=20, K=L=M=1, at most 3
top Bigrams and at most 3 top Triggers with Mutual Information greater 0.001 bit are retained (unless
othewise specified).
History
Length
1
2
5
10
15
20
Priors
Train
Test
.257
.005
.256
.047
All Bigrams w/o
estimation
Train
Test
.564
.022
.511
.220
Bigrams w.
estimation
Train
Test
.437
.014
.399
.130
Priors and
Triggers
Train,
Test,
.091 .235
.464
.424
.548
.510
.625
.535
.672
.575
.698
.601
.718
.626
Priors, Bigrams
and Triggers
Train,
Test,
.080 .305
.636
.567
.706
.642
.769
.666
.812
.703
.838
.725
.854
.746
Table 9. Generalized Triggers, Decrease in CE for different models. |A|=20, K=L=M=1, at most 6 top
Bigrams and at most 6 top Triggers with Mutual Information greater 0.001 bit are retained (unless othewise
specified).
History
Length
1
2
5
10
15
20
Priors
Train
Test
.257
.005
.256
.047
All Bigrams w/o
estimation
Train
Test
.564
.022
.511
.220
Bigrams w.
estimation
Train
Test
.478
.021
Priors and
Triggers
Train,
Test,
.045 .177
Priors, Bigrams
and Triggers
Train,
Test,
.050 .280
.425
.159
One may notice that results in tables 8 and 9 are somewhat different from what was reported
in tables 3 and 4. For models based on generalized triggers performance of the model
improves as history length increases. The best model corresponds to history length 20, and
this is not surprising since longer histories allow for a better chance to find higher mutual
information trigger for a symbol (tables 6 and 7 particularly prove that). Also the best model
for 3 top triggers is gives at least 12% improvement in score over the model based on all
conventional triggers and all bigrans (table 5 versus last row in table 8). We also notice that
long history generalized trigger based models (without bigrams) outperform models based
on all bigrams which was never the case with conventional triggers. All this prompts that the
concept of generalized triggers and its other possible extensions might help further improve
performance.
An important feature of the overall approach is that including such “patterns” into the
model can be carried out in an entirely automatic manner using the iterative scaling
procedure. In fact there are numerous generalizations of the basic idea presented here such
as allowing a more general language for trigger rules with multiple symbols, disjunctions, and
so forth. Parameters such as history lengths, numbers of triggers, bigrams, etc., can all be
chosen automatically in principle using cross-validated estimates of cross-entropy.
14
6. Conclusion
Data mining algorithms are very good at efficiently finding patterns in the form of
conjunctive expressions (“rules”) with attached frequencies of occurrence in large data sets.
However, the problem of combining these patterns into a coherent global model has been
relatively unexplored. In this paper we proposed a straightforward approach to this problem
by viewing the local patterns as constraints on an unknown high-order distribution. The
iterative scaling procedure can then be used to find the specific high-order distribution that
is closest in a cross-entropy sense to a specified prior. We illustrated the application of this
idea on two different problems, query selectivity estimation and protein sequence prediction
and in both cases the method using local patterns provided significant improvements over
conventional alternatives.
Acknowledgments
The work of DP and PS described in this paper was supported by the National Science
Foundation under Grant IRI-9703120.
References
A.L. Berger, S.A. Della Pietra and V.J. Della Pietra. A Maximum Entropy Approach to
Natural Language Processing. Computational Linguistics, March 1996, Vol. 22, No. 1, 39-72.
I. Csiszar, G. Tusnady. Information Geometry and Alternating Minimization Procedures.
Statistics and Decisions, R. Oldenburg Verlag, Munich, 1984, Supp. Issue No. 1, 205-237.
I.Csiszar. A Geometric Interpretation of Darroch and Ratcliff's Generalized Iterative
Scaling. The Annals of Mathematical Statistics, 1989, Vol. 17, No. 3, 1409-1413.
J.N. Darroch, D. Ratcliff. Generalized Iterative Scaling for Log-Linear Models. The Annals of
Mathematical Statistics, 1972, Vol. 43, No. 5, 1470-1480.
F. Jelinek. Statistical Methods for Speech Recognition. MIT Press, 1998.
S. Kullback. Information Theory and Statistics. Wiley, New York.
B. Liu, W. Hsu, Y. Ma, Integrating Classification and Association Rule Mining, Proceedings
of the Fourth International Conference on Knowledge Discovery and Data Mining, R.
Agrawal and P. Stolorz (eds.), Menlo Park, CA: AAAI Press, 80-96, 1998.
E. Raftery and S. Tavere, ????
R. Rosenfeld. A Maximum Entropy Approach to Adaptive Statistical Language Modelling.
Computer Speech and Language, July 1996, Vol. 10, No. 3, 187-228.
L. Saul and M. I. Jordan, Mixed memory Markov models, Machine Learning, to appear.
15
J.C. Setubal, J.M. Meidanis. Introduction to Computational Molecular Biology. PWS Publishing
Company, 1997.
Appendix: The Iterative Scaling Algorithm for the Bigram Model with Triggers
Let us first rewrite the update rules (P5)-(P7) for the normalizing constant and ij
corresponding to the constraint j in the i-th group (i varies over constraints(P6)-(P10)) for
the iteration N will be
(P5)
Sij   P ( N 1 ) ( w , H* )* k ( w , H*|( i , j ))
( w ,H* )
 0 ( N )   0 ( N 1 )
 ij ( N )   ij ( N 1 )
1  d ij
(P6)
1  Sij
d ij * ( 1  Sij )
(P7)
( 1  d ij )* Sij
In order to compute the right-hand side of rules (P6) and (P7) one needs to sum the
distribution obtained on the previous iteration over all possible values of (w,H*) satisfying
constraint j in the group number i. Let us denote this sum Sij as we did above and show how
to conduct this computation for some constraint in a specific group. To keep notation
simpler we omit the superscript N standing for the the current iteration number.
1. Group (P7), bigrams, let w=A, w-1=B and C(w=A,w-1=B)K, we have
S( 7 ),w  A ,w1  B  0 * 1I ( C( w  A ) L ) ( w  A )*  3 ( w  A, w1  B )*
NT
*   2 ( H , w1  B )*   tIk( C(  k  A , k H ) M ) * tIk( C(  k  A , k H ) M )
k 1
H
2. Group (P8), symbols w, let w=A and C(w=A)L, we have
S( 8 ),w  A  0 * 1 ( w  A )*   3I ( C ( w  A ,w1 ) K ) ( w  A, w 1 )*
w 1
NT
*   2 ( H , w 1 )*   tIk( C (  k  A , k H ) M ) * tIk( C(  k  A , k H ) M )
k 1
H
3. Group (6), history equivalence class H* will identify w-1=B and for any specific symbol w
which triggers fired and which not.
S( 6 ),w1  B ,know what triggers fired  0  2 ( H*) 1I ( C( w ) L ) ( w )*  3I ( C ( w ,w1  B ) K ) ( w, w 1  B )*
*

fired triggers
w

I ( C (  k  w , k H ) M )
tk
*

I ( C (  k  A , k H ) M )
tk
not  fired triggers
4. Groups (P9)(P10) for triggers and "anti"-triggers are similar, following is the summation
for group (P9), suppose that we update p-th trigger, C( w   p , p  H )  M , and we know
that it has fired, so that
16
I ( C ( w   p ) L )
S( 9 ),w   p , p H  0 * 1
( w   p )*   3
I ( C ( w   p , w 1 )  K )
( w   p , w 1 )*
w 1
*

H :  p H
NT
 2 ( H , w 1 )*   tIk( C (  k  A , k H ) M ) * tIk( C (  k  A , k H ) M )
k 1
The time complexity of performing steps 1, 2 and 3 is O(S*|A|*NT) and the last step is
O(S*NT*NT)=O(S*NT*|A|2), so that overall time compleixty is O(S*NT*|A|2).
17