Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Prediction with Local Patterns using Cross-Entropy Paper ID: Number 430 Heikki Mannila*, Dmitry Pavlov+, and Padhraic Smyth+ *Microsoft Research Redmond, WA + Information and Computer Science University of California at Irvine Abstract Sets of local patterns in the forms of rules and co-occurrence counts are a common type of output from many data mining algorithms such as association rule algorithms. A known problem with such sets of patterns is how to use them for prediction, i.e., how to synthesize local sparse information into a coherent global predictive model. In this paper we investigate the use of a cross-entropy approach to combining local patterns. Each local pattern is viewed as a constraint on an appropriate high-order joint distribution of interest. Typically, a set of patterns returned by a data mining algorithm under-constrains the high-order model. The cross-entropy criterion is then used to select a specific distribution in this constrained family relative to a prior. We review the iterative-scaling algorithm which is an iterative technique for finding a joint distribution given constraints. We then illustrate the application of this method to two specific problems. The first problem is combining information about frequent itemsets. We show that the cross-entropy approach can be used for query selectivity estimation for 0/1 data sets. The results show that we can accurately answer a large class of queries using just a small set of aggregate information. The second problem involves sequence prediction given historical rules, with an application to protein modeling. We conclude that viewing local patterns as constraints on a high-order probability model is a useful and principled framework for prediction based on large sets of mined patterns. 1. Introduction In this paper we investigate the problem of how to combine local partial information specified as conditional probabilities (e.g., probabilistic rules) or joint probabilities (e..g., frequent itemsets) into a single global model which can then be used for prediction. This is an important problem in data mining. There are many algorithms in existence for finding local patterns. A common complaint is that these algorithms return large sets of “rules” but it is difficult to evaluate the properties of these rules as a set, e.g., it is difficult to perform prediction or to compare the quality of one set against another in an objective manner. While there are already some approaches to combining local patterns for specific problems like classification (e.g., Liu, Hsu, and Ma (1998)), here we focus explicitly on a probabilistic framework. Note that we will not focus at any length in this paper on how the patterns or rules can be found from the data (there is a large body of techniques available to do this already). Instead we focus on the problem of once given the patterns, how can one combine them for prediction? 1 In this paper we will consider the discrete-valued problem, thus, we’ll be assuming that a vector-valued random variable x=(x1, x2,…,xn) takes values from some finite alphabet. We are interested in estimating the joint distribution P(x1, x2,…,xn) = P(x), or some function of P(x) such as a conditional distribution, given sets of local rules containing information of the form p(xi= a|xj= b, xk= c) and/or p(xi= a, xj= b, xk= c). Let us first consider some general issues in learning P(x) from data. In general, if for example each xi can take m distinct values, specification of P(x) will require O(mn) different entries in the joint probability table. It is clear that estimating the full joint distribution from data is often impractical given even moderate values of m and n, because of the exponential growth of the number of different probabilities which must be specified. One general approach is to reduce the complexity of the joint distribution model by making structural assumptions about the independence relations between the x variables. For example, if we are willing to assume that all variables are independent of each other (typically not a very useful model) then we only need O(mn) parameters to specify the joint distribution. For a first-order Markov assumption (given some ordering defined on the variables) we would need O(m2n), and so forth. Graphical models or belief networks provide a general framework for specifying and manipulating of a very general class of such independence models. These Markov assumptions typically model dependence only at the variable level. An example of a variable level dependency would be to say that each letter in a sequence of text depends only on the immediately preceding letter. Variable-level models do not easily allow selective encoding of dependencies at the value level. A value level dependency only specifies a dependence between particular values of two variables, rather than assuming dependence for all pairs of values. For example, a value-level dependence model for text could specify that certain letters depend on certain preceding letters, e.g., “h” follows “t” with some probability, but there is no explicit statement made in the model about the dependence of “h” on other letters which precede it. For every pairwise direct variable-level dependency in the model, we need O(m2) parameters to specify this dependency (i.e., all the possible m x m conditional probabilities between the pairs of values). For large alphabets (i.e., large m) this parameterization may be expensive in terms of the data required to justify it. For example, in language modeling at the symbol level, even specifying bigram models (i.e., first-order Markov dependence), with a typical vocabulary on the order of m=10,000 symbols, can be a tremendously difficult task requiring vast training data sets. If we go beyond the first-order dependence at the variable level, the parameterization problem becomes worse, i.e., we need O(mk) parameters for a (k-1)th order model. Again, using text sequences as an example, for higher level dependencies we may be more interested in value-specific dependencies than variable-level dependencies, e.g., the “ion” trigram of 3 letters is quite specific to these letters, while combinations such as “tyw” can be modeled as largely independent. We can generalize the idea of value-specific dependencies to allow more general events, such as if the word “data” or “statistics” have occurred in the past 20 words, the probability of getting the word “hypothesis” as the next word has increased (e.g., this is captured in the notion of “triggers,” Rosenfeld, 1996). A related idea is to approximate high-order dependencies by combinations of low-order models, e.g., p(x4|x3,x2,x1) as a linear combination of p(x4|x3) , p(x4|x2) and p(x4|x1) (e.g., see 2 Raftery and Tavere (1995) and Saul and Jordan (in press) in a Markov sequence modeling context). Note that such a model reduces the number of parameters from O(mk) to O(km2), but of course true high-order interactions may not be accurately captured. More generally, we want to combine various “low-order” pieces of information in the form of rules into a coherent high-order model. The primary focus of many current data mining algorithms is to extract such low-order information in an efficient manner. For example, association rule algorithms essentially pull out all conditional probabilities of the form p(xi= a|xj= b, xk= c) which are above some threshold and for which p(xj= b, xk= c), the support, is also above a prespecified threshold. These types of patterns are “local” to small sets of specific variables and specific values of these variables. There are numerous similar “rule-finding” algorithms that can efficiently produce large lists of such local patterns. Typically the patterns are chosen based on the fact that they differ from the expected norm, i.e., they are informative in some sense relative to a uniform prior distribution on P(x1, x2,…,xn). Potentially these algorithms can pull out valuelevel dependencies in the data which might be “missed” by a variable-level model. Combining local patterns to form a global model for P(x1, x2,…,xn) is worthwhile from two different perspectives. Firstly, from a data mining perspective, it would offer a framework for linking sets of local patterns with the more well-established world of global models, where evaluation, prediction, estimation, model selection, and so forth, can be handled in a coherent and consistent manner. Secondly, from an applications perspective, the ability to quickly find local patterns and then use these to construct a model for P(x1, x2,…,xn) could offer distinct advantages in accuracy, efficiency, and interpretability compared to more conventional approaches. In this paper we focus on the first aspect of the problem, namely illustrating how local sets of patterns typically uncovered by data mining algorithms can be combined to form global models. In Section 2 we formally define the problem and then show that it is equivalent to a well-known problem in probabilistic modeling, namely, finding the closest distribution to a prior distribution in a constrained family. In Section 3 we review the well-known iterative scaling algorithm which is an iterative technique for finding the distribution of interest. Section 4 contains an application of this approach to the problem of query selectivity estimation, where one wants to estimate how many records in a database satisfy a certain “high-order” query and one already has low-order counts. In Section 5 we outline the use of this technique on sequential data (which is where it has most often been used in the past, particularly for text sequences), and apply the method to characterizing dependence structure in protein sequences. Section 6 contains conclusions. 2. Background and General Statement of the Problem We begin with a fairly abstract statement of the problem and then show how this statement relates to discussion in the last section. Consider that the following requirements are imposed on P(x) (keep in mind that P(x) is unknown, we wish to estimate it given the data): 1. P(x) should satisfy certain linear constraints, where the constraints come from the training data, 3 2. P(x) should be chosen in all other respects in accordance with our ignorance about everything these constraints do not specify. Requirement (1) can be rewritten mathematically as P( x )* k ( x|i ) d ( i ) , (1) x where k(x|i) denotes the i-th constraint function (which for our purposes will be an indicator function, taking the value 1 iff x satisfies i-th constraint), d(i) is a given constraint target and i varies over all given constraints, from 1 to C, where C is the total number of constraints. In a data mining framework, we will assume that our patterns and rules are statements about frequency counts in the data that can be represented as constraints. For example, if we know that the number of co-occurrences of (xj= 1, xk= 1) in the training data is njk, then the indicator function would be 1 when (xj= 1, xk= 1) and 0 otherwise, so that the left-hand side of Equation (1) would sum to P(xj= 1, xk= 1). The right-hand side constraint d(i) could be set to something like njk/S (where S is the total number of data points), thus, constraining the probability on the left to take this maximum-likelihood value. The cross-entropy is defined as P( x ) D( P|| Q ) P( x )log (2) Q( x ) x and provides a strictly non-negative distance measure between two densities P(x) and Q(x). The cross-entropy is zero if and only if P(x) = Q(x) for all x . For our purposes Q(x) is a prior distribution for x which we assume is known, and we seek that P(x) which obeys the constraints in (1) and which is closest to Q(x) in terms of minimizing the cross-entropy distance. In this paper, and in many data mining applications, there is no well-defined prior Q(x) which is known, and thus, we choose a completely uniform prior as Q(x). Since Q(x) is then a constant as a function of x, minimizing the cross-entropy is in this case equivalent to maximizing a function of the form –P log P, i.e., maximizing the entropy. Thus, we can state our problem as follows: from the family of constrained distributions P that satisfy (1), choose that particular distribution which has the maximum entropy. Maximum entropy has a long history as a criterion for model selection that we will not dwell on here: essentially it picks the distribution which makes the fewest independence assumptions. One of the constraints in (1) (let it be constraint 0) should be k(x|0)=1 for all x and d(0)=1 to ensure that the distribution sums up to 1. The method of undetermined Lagrangian multipliers can be used to determine the general functional form of the solution: P(x) Q(x) * 0 * ik ( x|i ) (3) i with constraints 0 * Q( x )* ik ( x|i ) * k ( x| j ) d ( j ) x (4) i 4 with j varying over all possible constraints and { i }iC1 being a set of positive numbers which must be found. Since the indicator function k(x|j) participates as a factor in (4) we will only sum over such x that satisfy the j-th constraint. The general problem in (1) and (2) is now reduced to the problem of finding a set of positive numbers { i }iC1 for the problem (3) and (4). It can be shown that problem (3),(4) (and, hence, problem (1),(2)) will have a solution and it will be unique, iff the constraints are consistent (Darroch and Ratcliff (1972)). Consistency means that the constraints in (1) do not contradict each other. For example, if all the constraints only consist of relative frequency counts in the data, then they will be consistent automatically. However, one might want to use better estimators for marginal probabilities than simply raw counts, especially for small numbers of counts. In this case it may be difficult to prove consistency. Nonetheless it has been found that even when consistency is not present, a reasonable solution can still be found (e.g., Rosenfeld, 1996). The requirement of consistency and limited amounts of training data complicates the choice of constraints as discussed in Jelinek (1998). The space complexity of the problem will depend on the number of imposed constraints. Expression (3) contains (C+1) unknown parameters and thus, the space complexity is O(C). For each application we discuss below we will give bounds on C. The time complexity is a function of the algorithm used to arrive at a specific solution to (3) and (4), which we discuss in the next section. Note in general that while the original data set may not reside in main memory (i.e., the pattern finding algorithm may involve access of disk-resident data for example), it is quite reasonable to assume that the set of patterns returned by the algorithm will themselves reside in main memory as will any model we construct based on those patterns. Thus, once the patterns (constraints) are determined, no further disk access is necessary. 3. The Generalized Iterative Scaling Algorithm Generalized iterative scaling algorithm is well known in the statistical literature as a technique for solving problems of the general nature presented in the last section (see, for instance, the original paper by Darroch and Ratcliff (1972), Csiszar(1984 and 1989)). The algorithm can be used to obtain a numerical solution for the parameters of the function (3), satisfying constraints (4) plus a normalization constraint. The iterative scaling algorithm will result in a sequence of properly normalized probability distributions PN(x) such that if P(x) is a desired distribution then lim max P( x ) PN ( x ) 0 . N x Thus, the algorithm will converge to a unique solution of the problem (3),(4) provided that constraints (4) are consistent. A high-level outline of the algorithm is: 5 1. Choose an initial approximation to P(x) coinciding with Q(x) (typically uniform). 2. Iteration = 0; 3. While (Not all Constraints are Sufficiently Satisfied) do Begin Iteration++; For (i varying over all constraints in (4)) do Begin Update 0 Update i End EndWhile 3. Output Probability Distribution in the form (3) The update rules for the normalizing constant 0 and i corresponding to the constraint i for iteration N , i varying over the constraints in (4), will be (from Jelinek ,1998): Si P ( N 1 ) ( x )* k ( x|i ) (5) x 0 ( N ) 0 ( N 1 ) i ( N ) i ( N 1 ) 1 di 1 Si d i * ( 1 Si ) ( 1 d i )* Si (6) (7) In order to compute the right-hand side of rules (6) and (7) one needs to sum the distribution obtained on the previous step P ( N 1 ) ( x ) over all possible values of x satisfying constraint i. The time complexity of the algorithm depends on the number of multiplications that are performed in one iteration of the algorithm, i.e., used to update all C constraints. The number of summands in (5) equals the number Ni of occurrences of x in the training data that satisfy the i-th constraint. Ni is bounded by the total number of data points in the training set, S. Each summand is a function P ( N 1 ) ( x ) evaluated for a specific x. This evaluation takes O(C) multiplications where C is the total number of constraints. Hence, complexity of carrying out step (5) is bounded by O(SC). Thus, the overall complexity of carrying out one iteration of the algorithm is bounded by O(SC2) since updates are performed for every constraint. 4. Heikki's Section Here 5. Sequence Models combining Local Patterns and Markov Models 5.1 Local Trigger-based Patterns for Sequence Prediction 6 Let us consider a specific application domain, in which the training data is a sequence consisting of symbols from a fixed alphabet A. For any variable w we define the history H as the HL symbols preceding w in the text where HL is the length of history. One can allow a gap of GL symbols separating w and its history. Thus, effectively x=(w, H), and we are interested in estimating P(x)=P(w, H). Estimating the probability distribution P(x) is a difficult task as discussed earlier since in general one may want to consider long term dependencies, which increases the history length HL and hence the number of variables in the distribution. In practice, it is often only feasible to accurately estimate bigram or trigram models, i.e. models in which HL is equal to 1 or 2. We can introduce a notion of long term dependency by incorporating probabilistic rules or triggers (Rosenfeld, 1996). A trigger is simply the occurrence of a particular symbol in the history H of a particular variable. A useful trigger is one whose presence in the history significantly modifies the distribution of the variable w. In general, we will call the triple ( ,, PTrigger P ) a trigger where P(w=, H)=P. Here is the trigger symbol (i.e., the symbol in the history), and is a value of the variable whose probability we are currently interested in. For this paper it is not so important how we find these triggers from the data. One relatively straightforward technique is to select triggers based on the average mutual information between events {w=} and {H}. Thus, all triggers can be ranked according to their information content and only top triggers retained for incorporation into a global model. Let us now examine what a set of constraints (3) will look like when we combine bigrams and triggers. Here the history H (of some specific length) is defined to start 2 positions back in the sequence and the bigram depends on the symbol 1 position back. We define the concept of history equivalence classes. Suppose that we have pulled NT triggers from the training text: ( k ,k , PkTrigger P k k ) , k varying from 1 to NT , and the probabilities Pk are estimated from the training text using frequency counts. Let us now list in the set {Unique} all the different trigger symbols from the list 1 , 2 ,..., N T . Note that the same k can trigger a different k so that card{}NTW is in general less than NT. We can now define the equivalent history for the bigram-trigger model as H* w1 , I ( Unique H ),..., I ( Unique 1 N TW H ) (5) where w-1 is the symbol immediately preceding w; and I ( Unique H ) is an indicator function k that returns 1 iff condition inside the braces is satisfied. The equivalent history H* is distinct from the previously defined history H in the following ways: 1. We include the symbol w-1 from the gap to the equivalent history 2. In the remaining part of the equivalent history we no longer consider symbols from the alphabet A but indicators of whether certain trigger symbols occurred in the history. Thus, under the equivalent history framework, we don't distinguish between the pasts of the two symbols in the training text unless they are both preceded by the same symbol and the same trigger symbols occur in their histories. 7 There is a nice decision tree representation of an equivalent history. For a training text containing S symbols from the alphabet A there are S histories which actually occur. For each symbol let us identify its preceding symbol and history, then put it down the decision tree. The tree first tests what the symbol w-1 is, which is O(|A|) comparisons and branches, and for each k on the level (k+1) tests if I ( Unique H ) is 1, thus making one comparison. k All levels of this tree but the first one have a branching factor 2, and the tree has a depth of 1+ NTW. On the very bottom level we can accumulate the number of "real" training text histories falling into each leaf - most leaves will have no histories, some might have more than one. All histories falling into one leaf of the tree will be indistinguishable under the history equivalence classes paradigm. Now, we are ready to define the set of all constraints that we will impose on the probability distribution for combining bigrams and triggers: P( H* ) f ( H* ) (P6) P( w, w 1 ) f ( w, w 1 ) if C(w,w-1)K (P7) P( w ) f ( w ) if C(w)L (P8) P( w k , k H ) PkTriggers if C( w k , k H ) M (P9) P( w k , k H ) f ( k ) PkTriggers if C( w k , k H ) M (P10) where f and C are relative frequencies and counts obtained from the training data. We will refer to condition (P8) as simply the “priors” (i.e., the unigram or univariate marginal constraints). Since we want to take into account the effect of both presence and absence of the trigger symbols in the history of w, we include both constraints (P9) and (P10). It can be easily shown (Jelinek, 1998) that given the constraints (P6)-(P10), and the goal of minimizing cross-entropy, the general solution can be rewritten as P( w , H*) Q( w , H*)* 0 * 1I ( C ( w ) L ) ( w )* 2 ( H* )* 3I ( C ( w ,w1 ) K ) ( w , w 1 )* NT * tIk( C ( k w , k H ) M ) * tIk( C ( k w , k H ) M ) (P11) k 1 where we note that now the second parameter is the equivalent history. The constants K, L, and M are thresholds that can be set to determine the minimum frequency count in the data necessary for a constraint to be included. Note that if the count for any of the constraints is less than the constant on the right then there will be no factor corresponding to this symbol in the product. This means that we can reduce the complexity of both the model and the iterative scaling algorithm by increasing the value of these constants (Jelinek, 1998, discusses what problems this increase may cause for the test text). Here we simply set the constants K, L, and M all to be 1. 8 Now we bound the number of parameters C in function (P11). There are at most |A| parameters that correspond to the constraints introduced by the priors (P8). There are at most |A|2 parameters that correspond to constraints introduced by bigrams (P7). The number of bigrams satisfying condition C(w,w-1)K=1 is also bounded by S (the total number of symbols in the training text). The upper bound is, thus, min(S, |A|2). There are at most S parameters that correspond to constraints introduced by relative frequencies of equivalent histories (P6). Finally, there is a total of 2*NT terms corresponding to trigger constraints (P9) and (P10). The number of triggers NT in turn is bounded by |A|2 since the latter is the total possible number of pairs of symbols in the alphabet A. Thus, overall memory complexity is bounded by 1+|A|+S+2*|A|2+min(S, |A|2) which is O(S+|A|2) parameters. In the appendix, we provide details on how to perform step (5) of the generalized iterative scaling algorithm in order to estimate parameters of the distribution. Here, we note that the overall time complexity is O(S*NT*|A|2) per iteration. In all of the experiments below the algorithm converged to within 1% of the specified constraints in about 20 iterations. 5.2 Experimental Results on Protein Sequence Prediction with Triggers We have applied the sequential triggers model and iterative scaling algorithm to modeling of sequential protein data. Proteins are large macromolecules responsible for many fundamental physiological processes in living organisms. Their primary structure is defined by a sequence of amino acids. Understanding the sequential dependence of amino acids in protein sequences is an important first step in developing an improved understanding of the role and function of different proteins (Setubal and Meidanis, 1997). The purpose of this experiment was to explore sequence structure beyond simple Markov modeling, in particular the use of trigger models. For this set of experiments we used a well known set of 585 hemoglobin protein sequences, available on-line at http://www.isb-sib.ch/. We used various combinations of prior, bigram, and trigger models for modeling the joint distribution of the current symbol and its equivalent history. The size of the alphabet |A|=20 since there are 20 amino acids. For all of our experiments we did the 10 fold cross-validation. We estimated mean and standard deviation of the decrease in empirical cross-entropy on the training and test sets for a variety of different models which were fit to the training data. The following modeling strategies were used, where in all cases priors, bigrams, and triggers were estimated based on simple frequency counts from the training data: 1. A model based only on priors (this is a rather simple baseline model). 2. A model based on all priors and bigrams: this model is fully specified and we do not need to use iterative scaling to find the joint distribution (i.e., it can be directly specified). 3. A model based on all priors and selected bigrams. This requires the use of iterative scaling to estimate P(w|H*)=P(w|w-1). 4. A model based on combinations of priors, bigrams and triggers for different choices of lengths of history (from 1 to 20). The constants K=L=M had a fixed value of 1. 9 For the last experiment we also varied the number of T top triggers and B top bigrams chosen per symbol of the alphabet using the mutual information criterion. There is a tradeoff here. If we have too few bigrams or triggers then we will not get a good probability model, but if we have too many the estimates will be noisy and the model will not be as accurate on test data. Tables 1 and 2 show the top 10 bigrams and triggers found in the training data with a history length of 1 (i.e., the triggers are based on the occurrence of a symbol 2 positions back, and the bigrams are based on symbols 1 position back). The tables illustrate how the mutual information method works for bigrams and triggers. The average mutual information selects pairs for which the conditional probability P(B|A) is substantially different from the prior P(B). Note also that apart from the top bigram, many of the triggers have a higher mutual information than the bigrams, indicating that there may be more information in the sequence further “back” than in the first-order Markov bigram window. Table 1. Top 10 bigrams for the training data, history length is 1. Symbol A, w-1=A L R H Y R N D F H Y Prior(A) 0.117 0.027 0.060 0.024 0.027 0.033 0.058 0.052 0.060 0.024 Symbol B, w=B S M C R V F K P G F Prior(B) 0.066 0.012 0.011 0.027 0.087 0.052 0.077 0.038 0.062 0.052 P(B|A) 0.221 0.142 0.077 0.188 0.305 0.208 0.200 0.134 0.166 0.216 Aver. MI(w-1=A, w=B) 0.02625 0.00956 0.00841 0.00810 0.00787 0.00742 0.00688 0.00653 0.00619 0.00574 P(B|A) 0.327 0.582 0.018 0.207 0.260 0.169 0.096 0.108 0.129 0.184 Aver. MI(AH, w=B) 0.01321 0.01303 0.01291 0.01219 0.00959 0.00880 0.00716 0.00661 0.00646 0.00621 Table 2. Top 10 triggers for the training data, history length is 1. Symbol A, AH N W L H R P K V G P Prior(A) 0.033 0.011 0.117 0.060 0.027 0.038 0.078 0.088 0.062 0.038 Symbol B, w=B K K L D F N R P E T Prior(B) 0.078 0.078 0.117 0.058 0.053 0.034 0.027 0.038 0.041 0.054 For the training and test data sets, the empirical cross-entropy can be defined as 1 S CE ( Model ) log( P( w wk | H* ,Model )) (P13) S k 1 where S as before is the total number of symbols in the text and wk is the observed symbol that appears in the k-th position in the text. Cross-entropy is non-negative and measures the empirical conditional uncertainty (entropy, in bits) in the model’s predictions: the lower the better. There is also an intuitive interpretation of this logP score. If we consistently assign observed symbols high probabilities we will do well, whereas if we consistently assign 10 observed symbols low probabilities, we will do poorly (as reflected in the cross entropy score). Our training and test data sets are characterized as follows: Size of the alphabet |A|=20, Total number of symbols in the training data set =57103, Total number of symbols in the test data set = 28590, A very simple baseline score can be defined as the cross-entropy of a model with a uniform distribution for P, namely log|A| = log(20) = 4.3219 bits. Thus, all decreases in reported cross-entropy are relative to this baseline uncertainty. The termination accuracy of approximation in all experiments was 1%, i.e., the iterative scaling was halted when the distribution came with 1% of the desired constraints. One of the important modeling questions was what the number of triggers and bigrams should be. In the experiments we looked at both the top 3 triggers bigrams per (Table 3) and the top 6 (Table 4). We also added an additional requirement that the average mutual information of a trigger or of a bigram should be more than 0.001 bits for it to be incorporated into the model. We first ran simple tests in which we estimated priors and bigrams from the training data and evaluated decrease in cross entropy due to them (columns 2 and 3 of Tables 3 and 4). As expected, bigrams do better than priors alone on both the training and test data: and there is a smaller improvement (decrease in uncertainty) on the test data than on the training data. This indicates that there is at least some memory in the protein sequences. The next column in each Table contains the results for the models incorporating top 3 or top 6 bigrams (per symbol) with priors. These models have fewer parameters than the full bigram model, are better than the priors alone, but not as accurate as the full bigram model. The last two columns in each Table show the results for triggers and priors, and triggers, priors and bigrams, respectively. There are a few different things worth noting here: The best model overall (on either test or training data, for both 3 or 6 bigrams/triggers) is the priors+some bigrams+some triggers model with a history length of one (rightmost column, top row). The fact that this holds up out-of-sample indicates that the trigger “rules” are in fact capturing some true regularities in the sequences. A history of length 1 appears to perform best. As the width of the history window increases we are allowing more long-range memory into the model, but the precision (in terms of where the symbol occurs in the window) is reduced. It appears that precision is more important than length (refer to tables 6 and 7 below for experiments with location specific generalized triggers). Priors+bigrams outperform priors+bigrams for all history lengths. Models based on priors+triggers for short histories outperform models based on top 3 or 6 bigrams. 11 The best model overall is for 6 bigrams and triggers, combining triggers, bigrams, and priors. In principle, one could keep adding more triggers, in this case more information will be incorporated in the model. Table 5 shows that using all triggers and bigrams with mutual information greater than 0.001 bit gives results comparable to the that of a model based on 6 top triggers and bigrams (especially for the test set) which essentially means that the latter model already absorbs all major information. Standard deviation of scores on the test set is substantially higher than on the training set. Analysis of performance on different folds reveals that there were two folds with sequences of symbols strikingly different from the rest of the family so that neither triggers nor bigrams could improve scoring of the model based on priors. Table 3. Decrease in CE for different models. |A|=20, K=L=M=1, at most 3 top Bigrams and at most 3 top Triggers with Mutual Information greater 0.001 bit are retained (unless othewise specified). History Length 1 2 5 10 15 20 Priors Train Test .257 .005 .256 .047 All Bigrams w/o estimation Train Test .564 .022 .511 .220 Bigrams w. estimation Train Test .437 .014 .399 .130 Priors and Triggers Train, Test, .034 .146 .464 .424 .447 .424 .428 .381 .417 .390 .394 .363 .371 .351 Priors, Bigrams and Triggers Train, Test, .038 .230 .636 .567 .614 .561 .601 521 .594 .532 .572 .507 .546 .495 Table 4. Decrease in CE for different models. |A|=20, K=L=M=1, at most 6 top Bigrams and 6 top Triggers are retained with Mutual Information greater 0.001 bit (unless othewise specified). History Length 1 2 5 10 15 20 Priors Train Test .257 .005 .256 .047 All Bigrams w/o estimation Train Test .564 .022 .511 .220 Bigrams w. estimation Train Test .478 .021 .425 .159 Priors and Triggers Train, Test, .045 .177 .512 .461 .499 .463 .482 .421 .460 .416 .433 .392 .398 .369 Priors, Bigrams and Triggers Train, Test, .050 .280 .722 .629 .701 .622 .694 .590 .678 .588 .651 .560 .630 .539 Table 5. Decrease in CE for different models. |A|=20, K=L=M=1, all top Bigrams and all top Triggers with Mutual Information greater 0.001 bit are retained (unless othewise specified). History Length 1 Priors Train 257 .005 Test .256 .047 All Bigrams w/o estimation Train Test .564 .511 .022 .220 Bigrams w. estimation Train Test .497 .447 .021 .170 Priors and Triggers Train Test .528 .478 .027 .210 Priors, Bigrams and Triggers Train Test .757 .665 .0.068 .320 The results with the history length suggest one specific refinement to the method, namely to allow multiple histories of different lengths and different locations. In this paper we discuss how results above would change if we make triggers location specific, i.e. if we no longer consider triggers as just belonging to the history but being in a specific location in the 12 history, say K symbols back from the symbol being predicted. We will call these triggers generalized. To illustrate the idea of generalized triggers, let us as before treat history of the symbol w as HL symbols preceding w and separated from w by the gap as shown on the figure below. To define the concept of equivalent history for standard triggers we only had to know which of the trigger symbols occurred in the document history of a symbol. For example, if symbol "A" was a trigger symbol then all document histories were only disciminated based on whether they contained "A" or not and regardless of position of "A" in them. Under the generalized triggers framework, we redefine a notion of a trigger in the following manner: we will call a four-tuple ( ,, k , P Gen .Trigger P ) a generalized trigger if P(w=,w-k=)=P. An equivalent history of any symbol can be now defined similarly to what we did for conventional triggers. Note that for history of length 1 generalized and regular models will coincide, for longer histories our expectation would be that generalized model would provide more valuable information about the symbol being predicted since location specific triggers on the distance of more than 1 in many cases do have higher mutual information with the predicted symbol than coventional triggers. To illustrate this point let us refer to the tables 6 and 7. Table 6. Top 5 triggers for the symbol "T" in the training data, history length is 5. Symbol A, AH W H D A F Prior(A) 0.001 0.060 0.058 0.116 0.053 Symbol B, w=B T T T T T Prior(B) 0.054 0.054 0.054 0.054 0.054 P(B|A) 0.274 0.007 0.006 0.022 0.108 Aver. MI(AH, w = B) 0.00392 0.00309 0.00306 0.00232 0.00186 Table 7. Top 5 Generalized triggers for the symbol "T" in the training data, history length is 5. Symbol A, w(-k)=A P(k=-4) P(k=-2) F(k=-5) F(k=-6) C(k=-4) Prior(A) 0.038 0.038 0.053 0.053 0.012 Symbol B, w=B T T T T T Prior(B) 0.054 0.054 0.054 0.054 0.054 P(B|A) 0.222 0.184 0.151 0.141 0.252 Aver. MI(w(-k)=A, w=B) 0.00958 0.00621 0.00526 0.00432 0.00370 It is remarkable that the top 5 generalized trigger symbols are all different from the top 5 trigger symbols (columns 1 in tables 6, 7). Moreover, for all but the last entree in table 7 and the first entree in table 6 average mutual information scores for generalized triggers are higher than that for conventional triggers, and the top score for generalized triggers is almost 13 3 times greater than the top score in table 6. As tables 8 and 9 show we will also gain better decreases in cross entropy due to the generalization of a notion of a trigger. Table 8. Generalized Triggers, Decrease in CE for different models. |A|=20, K=L=M=1, at most 3 top Bigrams and at most 3 top Triggers with Mutual Information greater 0.001 bit are retained (unless othewise specified). History Length 1 2 5 10 15 20 Priors Train Test .257 .005 .256 .047 All Bigrams w/o estimation Train Test .564 .022 .511 .220 Bigrams w. estimation Train Test .437 .014 .399 .130 Priors and Triggers Train, Test, .091 .235 .464 .424 .548 .510 .625 .535 .672 .575 .698 .601 .718 .626 Priors, Bigrams and Triggers Train, Test, .080 .305 .636 .567 .706 .642 .769 .666 .812 .703 .838 .725 .854 .746 Table 9. Generalized Triggers, Decrease in CE for different models. |A|=20, K=L=M=1, at most 6 top Bigrams and at most 6 top Triggers with Mutual Information greater 0.001 bit are retained (unless othewise specified). History Length 1 2 5 10 15 20 Priors Train Test .257 .005 .256 .047 All Bigrams w/o estimation Train Test .564 .022 .511 .220 Bigrams w. estimation Train Test .478 .021 Priors and Triggers Train, Test, .045 .177 Priors, Bigrams and Triggers Train, Test, .050 .280 .425 .159 One may notice that results in tables 8 and 9 are somewhat different from what was reported in tables 3 and 4. For models based on generalized triggers performance of the model improves as history length increases. The best model corresponds to history length 20, and this is not surprising since longer histories allow for a better chance to find higher mutual information trigger for a symbol (tables 6 and 7 particularly prove that). Also the best model for 3 top triggers is gives at least 12% improvement in score over the model based on all conventional triggers and all bigrans (table 5 versus last row in table 8). We also notice that long history generalized trigger based models (without bigrams) outperform models based on all bigrams which was never the case with conventional triggers. All this prompts that the concept of generalized triggers and its other possible extensions might help further improve performance. An important feature of the overall approach is that including such “patterns” into the model can be carried out in an entirely automatic manner using the iterative scaling procedure. In fact there are numerous generalizations of the basic idea presented here such as allowing a more general language for trigger rules with multiple symbols, disjunctions, and so forth. Parameters such as history lengths, numbers of triggers, bigrams, etc., can all be chosen automatically in principle using cross-validated estimates of cross-entropy. 14 6. Conclusion Data mining algorithms are very good at efficiently finding patterns in the form of conjunctive expressions (“rules”) with attached frequencies of occurrence in large data sets. However, the problem of combining these patterns into a coherent global model has been relatively unexplored. In this paper we proposed a straightforward approach to this problem by viewing the local patterns as constraints on an unknown high-order distribution. The iterative scaling procedure can then be used to find the specific high-order distribution that is closest in a cross-entropy sense to a specified prior. We illustrated the application of this idea on two different problems, query selectivity estimation and protein sequence prediction and in both cases the method using local patterns provided significant improvements over conventional alternatives. Acknowledgments The work of DP and PS described in this paper was supported by the National Science Foundation under Grant IRI-9703120. References A.L. Berger, S.A. Della Pietra and V.J. Della Pietra. A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics, March 1996, Vol. 22, No. 1, 39-72. I. Csiszar, G. Tusnady. Information Geometry and Alternating Minimization Procedures. Statistics and Decisions, R. Oldenburg Verlag, Munich, 1984, Supp. Issue No. 1, 205-237. I.Csiszar. A Geometric Interpretation of Darroch and Ratcliff's Generalized Iterative Scaling. The Annals of Mathematical Statistics, 1989, Vol. 17, No. 3, 1409-1413. J.N. Darroch, D. Ratcliff. Generalized Iterative Scaling for Log-Linear Models. The Annals of Mathematical Statistics, 1972, Vol. 43, No. 5, 1470-1480. F. Jelinek. Statistical Methods for Speech Recognition. MIT Press, 1998. S. Kullback. Information Theory and Statistics. Wiley, New York. B. Liu, W. Hsu, Y. Ma, Integrating Classification and Association Rule Mining, Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, R. Agrawal and P. Stolorz (eds.), Menlo Park, CA: AAAI Press, 80-96, 1998. E. Raftery and S. Tavere, ???? R. Rosenfeld. A Maximum Entropy Approach to Adaptive Statistical Language Modelling. Computer Speech and Language, July 1996, Vol. 10, No. 3, 187-228. L. Saul and M. I. Jordan, Mixed memory Markov models, Machine Learning, to appear. 15 J.C. Setubal, J.M. Meidanis. Introduction to Computational Molecular Biology. PWS Publishing Company, 1997. Appendix: The Iterative Scaling Algorithm for the Bigram Model with Triggers Let us first rewrite the update rules (P5)-(P7) for the normalizing constant and ij corresponding to the constraint j in the i-th group (i varies over constraints(P6)-(P10)) for the iteration N will be (P5) Sij P ( N 1 ) ( w , H* )* k ( w , H*|( i , j )) ( w ,H* ) 0 ( N ) 0 ( N 1 ) ij ( N ) ij ( N 1 ) 1 d ij (P6) 1 Sij d ij * ( 1 Sij ) (P7) ( 1 d ij )* Sij In order to compute the right-hand side of rules (P6) and (P7) one needs to sum the distribution obtained on the previous iteration over all possible values of (w,H*) satisfying constraint j in the group number i. Let us denote this sum Sij as we did above and show how to conduct this computation for some constraint in a specific group. To keep notation simpler we omit the superscript N standing for the the current iteration number. 1. Group (P7), bigrams, let w=A, w-1=B and C(w=A,w-1=B)K, we have S( 7 ),w A ,w1 B 0 * 1I ( C( w A ) L ) ( w A )* 3 ( w A, w1 B )* NT * 2 ( H , w1 B )* tIk( C( k A , k H ) M ) * tIk( C( k A , k H ) M ) k 1 H 2. Group (P8), symbols w, let w=A and C(w=A)L, we have S( 8 ),w A 0 * 1 ( w A )* 3I ( C ( w A ,w1 ) K ) ( w A, w 1 )* w 1 NT * 2 ( H , w 1 )* tIk( C ( k A , k H ) M ) * tIk( C( k A , k H ) M ) k 1 H 3. Group (6), history equivalence class H* will identify w-1=B and for any specific symbol w which triggers fired and which not. S( 6 ),w1 B ,know what triggers fired 0 2 ( H*) 1I ( C( w ) L ) ( w )* 3I ( C ( w ,w1 B ) K ) ( w, w 1 B )* * fired triggers w I ( C ( k w , k H ) M ) tk * I ( C ( k A , k H ) M ) tk not fired triggers 4. Groups (P9)(P10) for triggers and "anti"-triggers are similar, following is the summation for group (P9), suppose that we update p-th trigger, C( w p , p H ) M , and we know that it has fired, so that 16 I ( C ( w p ) L ) S( 9 ),w p , p H 0 * 1 ( w p )* 3 I ( C ( w p , w 1 ) K ) ( w p , w 1 )* w 1 * H : p H NT 2 ( H , w 1 )* tIk( C ( k A , k H ) M ) * tIk( C ( k A , k H ) M ) k 1 The time complexity of performing steps 1, 2 and 3 is O(S*|A|*NT) and the last step is O(S*NT*NT)=O(S*NT*|A|2), so that overall time compleixty is O(S*NT*|A|2). 17