Download User Intention Modeling in Web Applications Using Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
World Wide Web: Internet and Web Information Systems, 5, 181–191, 2002
 2002 Kluwer Academic Publishers. Manufactured in The Netherlands.
User Intention Modeling in Web Applications Using
Data Mining
ZHENG CHEN
Microsoft Research Asia, 49 Zhichun Road, Beijing 100080, PR China
[email protected]
FAN LIN
[email protected]
Department of Computer of Science and Technology, Tsinghua University, Beijing 100084, PR China
HUAN LIU
Arizona State University, PO Box 875406 Tempe, AZ 85287-5406, USA
[email protected]
YIN LIU
[email protected]
Department of Computer Science and Engineering, Tongji University, Shanghai, PR China
WEI-YING MA
Microsoft Research Asia, 49 Zhichun Road, Beijing 100080, PR China
[email protected]
LIU WENYIN
[email protected]
Department of Computer Science, City University of Hong Kong, Hong Kong SAR, PR China
Abstract
The problem of inferring a user’s intentions in Machine–Human Interaction has been the key research issue for
providing personalized experiences and services. In this paper, we propose novel approaches on modeling and
inferring user’s actions in a computer. Two linguistic features – keyword and concept features – are extracted from
the semantic context for intention modeling. Concept features are the conceptual generalization of keywords.
Association rule mining is used to find the proper concept of corresponding keyword. A modified Naïve Bayes
classifier is used in our intention modeling. Experimental results have shown that our proposed approach achieved
84% average accuracy in predicting user’s intention, which is close to the precision (92%) of human prediction.
Keywords:
intention modeling, user modeling, machine learning, data mining, Web navigation
1. Introduction
The rapid growth of the Internet has resulted in an exponential growth of information. People often lost in his way to find the information even with the help of search engines [4].
Meanwhile, software applications are aiming to provide more powerful functionalities to
satisfy the needs of different users. For example, in order to compose a multimedia document, the user must learn how to insert different media objects and format them appropriately. To better assist the user to search what they want more efficiently and learn
new software tools more effectively, the computer needs to understand the user’s intention.
182
CHEN ET AL.
In our opinion, the user’s intention can be classified into two levels: action intention
and semantic intention. Action intentions are lower level, such as mouse click, keyboard
typing and other basic actions performed on a computer. Semantic intentions correspond
to what the user wants to achieve at high level, which may involve several basic actions on
a computer to accomplish it. For example, “I want to buy a book from Amazon,” “I want
to find some papers on data mining,” or “I want to attach an image file to the email I am
composing” [12] is a semantic intention.
In this paper, we mainly focus on predicting action intention based on the features we
extracted from the user interaction such as user’s typed sentences and viewed content.
Although not explicitly designed to predict the user’s semantic intention, the predicted
actions together will constitute of a high level goal that the user intends to achieve. It has
been shown that the assistance is helpful for the user when the user’s intention is predicted
by observing the user’s behaviors [10]. For example, in Web surfing, a user may conduct
a series of actions including clicking (the hyperlinks), saving (the pages), and closing (the
browser). Suppose a user wants to buy a digital camera that is his semantic intention, he
may do the following: First, open a Web browser. Second, type in www.amazon.com in
the address bar. Third, after the page is returned, type in digital camera in the search box.
Fourth, click on one of the objects contained in the page. Fifth, click on the buy button to
confirm. Last, after the transaction is finished, close the browser. Our goal is to predict
this series of basic actions that the user will be conducting in a system to accomplish his
intention to buy a digital camera on the Web. A software agent may automatically highlight
the hyperlinks that the user may click on based on the prediction.
In this paper, we use a modified Naïve Bayes classifier to model the user’s action intention on a computer. The model can be trained incrementally and used to predict the user’s
next action. Besides keyword features, our algorithm also utilizes WordNet to generalize
the learned rule for similar words. Our experiments show that our prediction algorithm can
reach 85% of accuracy.
The rest of the paper is organized as follows. Section 2 is a brief overview of related
work in this field. Section 3 describes our algorithm in detail. Section 4 shows the experiments and evaluations. We conclude our work in Section 5.
2. Related work
Predicting user’s intention based on a sequence of user’s actions or other related semantic
information is an interesting and challenging task. In fact, predicting the user’s intention
is an important functionality of agents, which are known as intelligent, frequently, autonomous and mobile software systems. Hence, various agent systems have been designed
with their own methods to achieve this goal. Bauer et al. [3] have already introduced some
typical notions and procedures on how to construct and train such an agent. However,
most of the works mainly focused on user’s preference and did not address the difference
between user’s intention and preference.
Office Assistant [10] is perhaps the most widely used agent. It uses the approach of
Bayesian Networks for analyzing and predicting the user goals. A temporal reasoning
USER INTENTION MODELING IN WEB APPLICATIONS USING DATA MINING
183
method is used for the changing of the user goals. This work, to the best of our knowledge,
is probably the only one which has studied user’s intention in a deep mode. However, we
think the Office Assistant’s predictions can be further improved if semantic contexts are
also used in addition to action sequences for mining user intention.
For Syskill & Webert, the authors of [15] have compared several learning algorithms in
user’s preference modeling and prediction, including Bayesian classifier, Nearest Neighbor, PEBLS, Decision Trees, TF*IDF, and Neural Nets. However, their methods require
manual tagging and no incremental algorithms are considered.
For Retriever, the authors of [8] have used the TF*IDF weighting scheme, which is a
traditional Information Retrieval approach, to automatically mine user’s preference from
semantic information the user involved in. Furthermore, they try to analyze the queries
that the user typed for precision improvement using a so-called query domain analysis
approach.
Other information agents in this kind, such as WebWatcher [1], WebMate [5], WAIR [17]
are based on a similar approach. For instance, in WebWatcher, hyperlink information is
considered, and in WAIR, user’s feedback is also added as an important source for analysis.
In contrast with most of existing works [6,7], our approach does not only consider keyword features extracted from the text, but also tries to form a concept hierarchy of keywords
to improve the prediction performance. On the other hand, research efforts in incremental
algorithms have also been made in this field. Gabriel et al. [9] gave a detailed discussion
on incremental clustering. Different algorithms were presented and compared, including
explicit cluster assignment, greedy clustering, and doubling algorithm. These algorithms
decrease the maintenance space of the source documents and speed up the process of model
rebuilding. Inspired by their success, in this paper, we also extend the Naïve Bayes Algorithm to support incremental learning in the intention modeling process.
3. Mining user’s intention
3.1. Linguistic features
Linguistic features are features in the text that may indicate a user’s intention. Two types of
linguistic features are considered in this paper: keyword and concept features. A keyword
feature is a single word extracted from the text, which may be stemmed [16] and stop-word
excluded. For example, a sentence “Attached word file is a map of our office” is parsed to
keyword features “attach word file map office.”
Although the keywords contain important meaning of the sentences, they are too specific to represent user’s intention. For example, “road” and “way” are of the same
meaning sometime; however, they are different keyword features. In order to solve the
problem, we introduce the concept hierarchy of keywords, which are more general than
keywords. WordNet [13] is a tool representing the underlying lexical concept in various forms: synonyms (similar terms), hypernyms (broader terms), and hyponyms (narrower terms), etc. Among them, the hypernyms relation is a good form of concept hierarchy of keywords. For instance, the hypernyms of the word “dog” in the hierarchy
184
CHEN ET AL.
FEATURE_EXTRACTION(Record, α, β)
For each Ri in the Record set (R1 , . . . , Rn )
Extract (Ki1 , . . . , Kim ) from Ri and add to F
For each Kij
Generate C of Kij using WordNet
Add C and action tag of Ri to T
Given α, β, generate the Association Rules (Ci1 , . . . , Cim ) from T using Apriori algorithm
and add F
Return F
Figure 1. Algorithm of extracting linguistic features.
are “dog ⇒ canine ⇒ · · · ⇒ mammal ⇒ · · · ⇒ animal ⇒ life form ⇒ entity,” which
means “the dog is kind of . . .” and the hypernym ladder of word “cat” is “cat ⇒ feline
⇒ · · · ⇒ mammal ⇒ · · · ⇒ animal ⇒ life form ⇒ entity.” If we roll up the concept hierarchy, the word “dog” and “cat” can be merged into “mammal,” “animal,” or even “entity.”
This operation results in the generalization of keywords.
Not all concept generalizations are suitable for the linguistic feature. “Entity” may
be a too abstract concept for keyword “dog” and is not representative. “Animal” is a
good concept in the case mentioned above. We use WordNet to extract the concept hierarchy of each keyword and select the most representative one (i.e. hypernyms) as the
concept feature by means of Association Rules. As can be seen from the experiments,
user intention prediction based on the concept features outperforms the pure keyword features.
3.2. Feature extraction algorithms
At the keyword level of feature extraction, the text part is parsed such that all words are
extracted from the sentences and are stemmed with stop-word excluded. Each keyword is
a feature and will be added to the keyword feature set.
In section 3.1, we mentioned that the concept hierarchy of the keywords may be selected for better representation. Given the α and β thresholds, which will be explained in
detail in the next paragraph, the association rules are employed to mine the most popular
concept of the keywords among the training data, that is, to generalize the various keywords to the same concept level. An algorithm to extract linguistic features is presented in
Figure 1.
In Figure 1, α, β are thresholds of the rule generation; F is the feature set; C is the
concept hierarchy; T means the transaction set of Association Rules. In Section 3.1 we
mentioned that we need to select an appropriate concept of keywords for the concept features. In order to make the selection process automatically, we chose a rule generation
method to mine the proper concept. The Apriori algorithm proposed by Agrawal et al.
[2] was adopted to generate the association rules. The rules represent the association relationship between features and intentions, e.g., given a feature, the association indicates
whether a click action is intended. The association rules were further constrained by two
USER INTENTION MODELING IN WEB APPLICATIONS USING DATA MINING
185
<?xml version="1.0" encoding="UTF-8">
<Result>
<IE>
<Action Type="...">
<Title>...</Title>
<Body>...</Body>
<Links>
<Link>
<URL>...</URL>
<Text>...</Text>
</Link>
...
</Links>
</Action>
</IE>
</Result>
Figure 2. A skeleton of XML user log data.
parameters: α (support of item sets) and β (confidence of association rule) [2]. The first
parameter α, which depicts the scope of the rules, can be expressed by the percentage of
those records that contain both the same feature and a corresponding intention. The second
parameter β depicts the probability that the rule stands, i.e. the probability of the intention given the appearance of the feature. We evaluate the generated rules based on these
two parameters: those rules with parameters higher than certain thresholds are selected as
concept features.
3.3. Intention modeling
Without the loss of generality, we focused on modeling and inferring the user’s intention in
the Web browser environment because the training and testing data are easier to collect in
such an environment. Initially, a user intention model is empty but can be learned from the
user logs. Each user action record contains a text part and a tag of action as well as other
important information that may reflect the user’s intention. The XML format is adopted
when user’s log data is recorded, as shown in Figure 2. “Action Type” is one of the five
intentions mentioned in Section 4. “Title” and “Body” contain corresponding content in
the html file. “Links” record every url links appears in the html file.
Naïve Bayes classifier and Bayesian Belief Network are two machine learning algorithms widely used in recent years because they provide a probabilistic approach to inference [14]. The Naïve Bayes classifier is based on the simplified assumption that the
features are conditionally independent and this assumption dramatically reduces the complexity of learning the target function. In contrast to the Naïve Bayes classifier, Bayesian
Belief Network describes the joint probability distribution for a set of features. In general,
186
CHEN ET AL.
INTENTION_MODELING(Record, M)
This algorithm incrementally builds the intention model from training data.
call FEATURE_EXTRACTION(Record, α, β)
For each Ri in the Record set (R1 , . . . , Rn )
ij , C
i1 , . . . , C
ij )
i1 , . . . , K
Using IG to select (K
ij , C
i1 , . . . , C
ij ) to counting set
i1 , . . . , K
Adding (K
If U 0.01 (threshold for incremental learning)
Recalculate the probability distribution of the intention model, empty the counting set
P∗ =
+
+
+ Nnew
P · Nold
Ntotal + 1
Else keep the probability distribution unchanged
Return M
Figure 3. Algorithm on intention modeling.
Bayesian Belief Network provides a better performance in classification than Naïve Bayes
classifier. However, the computational complexity for building Bayesian Belief Network
becomes impractical when the training data is large. Therefore, we chose Naïve Bayes
classifier to build intention models. The algorithm was revised to support incremental
learning.
The training algorithm for intention modeling is depicted in Figure 3. Record
(R1 , . . . , Rn ) is a set of log data tagged with user’s intention; n is the number of records;
Model (M) is the intention model trained before (which is initially empty); Kij is a keyword feature; Cij is concept feature; m is the dimensions of the features; IG is the Information Gain Algorithm [18]; U is the percentage of untrained records. To support incremental
learning, we use the trick that store the extra information of the count of the features in the
+
+ is the count of
is the count of previous positive training examples, Nnew
counting set: Nold
positive examples added currently and Ntotal is the total count of the training data. With
the counting set above, there is no need to keep the training data and the probability table
can be updated incrementally.
Furthermore, we use Information Gain [18] to select most discriminative features (both
keyword and concept) to reduce the size of dictionary and improve the performance when
the training set is small. The Information Gain measures the significance of information obtained for intention prediction by knowing the presence or absence of a feature in a record.
Let {Vi }m
i=1 denote the set of intentions predefined. The information gain of feature f is
defined as follows:
IG(f ) = −
P (Vi ) log(Vi )
+ P (f )
P (Vi |f ) log(Vi |f )
P (Vi |f ) log(Vi |f ).
(1)
+ P (f )
Given a training set, we compute the information gain for each unique feature and remove from the feature set those features whose information gain was less than a certain
predetermined threshold.
USER INTENTION MODELING IN WEB APPLICATIONS USING DATA MINING
187
Table 1. Probability distribution of the trained model.
P (click) = 0.5, P (save) = 0.2, P (close) = 0.3
Learn
Research
Cognition
Click
Save
Close
0.5
0.6
0.7
0.2
0.2
0.1
0.3
0.2
0.2
3.4. Predicting user intention
Once training is completed, the obtained intention model for the user is used to predict the
user’s intention in the future. The prediction process is as follows. A set of linguistic features represented by (
f1 , f2 , . . . , fn ) is extracted from the text typed or used by the user.
Assuming conditional independence among features, the prediction module calculates the
probabilities of all predefined user intentions (V ) and chooses the one with the maximum
probability (vNB ) based on the following equation:
vNB = arg max P (vj |f1 , f2 , . . . , fn )
vj ∈V
= arg max P (vj )P (f1 , f2 , . . . , fn |vj )
vj ∈V
= arg max P (vj )P (f1 |vj )P (f2 |vj ) · · · P (fn |vj ).
vj ∈V
(2)
For example, given the title of a hyperlink “Learning & Research,” we can apply the
model to infer the user’s intention as follows:
vNB =
P (vj )P (f1 = “learn”|vj )
arg max
vj ∈{click,save,close}
× P (f2 = “research”|vj )P (f3 = “cognition”|vj ) .
(3)
Among the features, “learn” and “research” are keyword features, and “cognition” is the
concept feature of these two keywords. To calculate vNB using the above expression, we
require estimates for probability terms P (vj ) and P (fi = wk |vj ) (here we introduce wk
to indicate the kth feature in the feature dictionary). Suppose we have a trained model and
Table 1 provides the partial probability distribution.
Using probabilities in Table 1, we calculate vNB as follows:
P (click)P (f1 = “learn”|click)
×P (f2 = “research”|click)P (f3 = “cognition”|click) = 0.105,
P (save)P (f1 = “learn”|save)
×P (f2 = “research”|save)P (f3 = “cognition”|save) = 0.0008,
P (close)P (f1 = “learn”|close)
×P (f2 = “research”|close)P (f3 = “cognition”|close) = 0.0036.
(4)
188
CHEN ET AL.
Based on the above data, we can predict that the user may want to follow the hyperlink,
that is, click the hyperlink.
4. Experiments and discussion
We have developed a tool to automatically collect the user’s log data in the IE environment.
The following five user’s actions are recorded:
(1)
(2)
(3)
(4)
(5)
browse (view a Web page),
click (follow a hyperlink in a Web page),
query (type query words in a search box),
save (save pages or the objects in a page),
close (close the IE browsing window).
Each action was stored in the XML format (cf. Figure 2). Five users’ data in a period of
one month were logged and approximately 15,000 pages and their corresponding actions
were recorded. We randomly select some of the pages as the training data and the rest as the
testing data according to a training ratio and repeat the split 10 times in one test to calculate
t-test value [11]. From the training data, keyword and concept features were extracted and
stored in their corresponding feature set. The Information Gain is used to select the most
useful features and, hence, can dramatically reduce the feature set. In the training stage,
the model is built incrementally according to the feature set. From the testing data, each
of the action record is predicted using the trained model, and the predicted intention is
compared with the action performed by the user. In our experiment, α and β mentioned in
Section 3.2 are 0.005 and 0.6, respectively.
In the experiments, we used keyword features, concept features, and their combination
to compare their performance for intention modeling. The threshold of Information Gain
is 0.02. Because most of the related work did not give a prediction on user action intention,
and the Office Assistant [10] only archived a precision of 40% (while predicting a lot of
intentions), we had to compare our method with human prediction. We randomly chose
150 records from the log data and predict user’s intention manually. We read the training
data to be familiar with the user’s behavior, and then use the testing data for prediction.
The comparison of these methods is shown in Figure 4.
The keyword features are considered the baseline of our approach’s configuration and
the human prediction precision is the upper bound of benchmarking intention prediction
approaches because our goal is to help the user as a human assistant. As we can see, the precision using concept features is better than keyword features. The data can be analyzed as
paired-samples t-test with one tailed distribution. The P value of t-test is 0.0079 (< 0.05),
which shows a significant improvement in the concept features. The association rules help
improve the performance, because concept features provide more generalized descriptions
of a Web page, as discussed in Sections 3.1 and 3.3. Furthermore, when the two features
are combined, the performance is improved slightly compared with concept feature only.
The P value of t-test is 0.0097, which supports the conclusion. The performance of our
algorithm is close to the human prediction when the training ratio is 50%.
USER INTENTION MODELING IN WEB APPLICATIONS USING DATA MINING
189
Figure 4. Comparison of different features for intention modeling.
Figure 5. Precision comparison among different intentions.
In the experiments, we found that the precision of different intentions varies to some
extent, as shown in Figure 5.
We used combined keywords and concepts features. “Browse,” “Click,” and so on are
the intentions we predicted in this experiment. “Browse” action is easy to predict because
of its large proportion (almost 60%) in the training data. In the Naïve Model, this kind
of intention is easy to predict for the fully trained model. “Query” action is predicted
with a higher precision because the concept feature gives a well generalized description on
the pages containing search results. For instance, when the user types a query in Google,
the concept generalization of keywords in returned pages are denser than the keywords
themselves. However, the “close” action has a lower precision in prediction, indicating
that it is probably unpredictable based on only the text information. For example, the user
may close the window that he is interested because something urgent appears or simply
close an uninteresting window. The average precision is the weighted sum of these five
intentions, which can be calculated as
Praverage =
P (Acti ) · PrActi .
(5)
An experiment on the impact of Information Gain (IG) was also conducted. Figure 6
shows the performance of using different threshold of Information Gain. The IG algorithm
gives a better performance when training data is small. In this situation, a lot of keywords
190
CHEN ET AL.
Figure 6. Impact of IG to prediction precision.
appear only once or twice and contribute little to the intention model. The IG algorithm
is a good way to remove these data. However, if the threshold is too small, i.e., too few
features are selected, the model will give a poor performance. When the training data is
large enough, whether to use the IG makes little difference. Because in either condition,
the model is fully trained, noise data will not influence the whole model. However, IG
generates a much smaller dictionary and Naïve Model, for a more compact representation.
We have also implemented and tested our algorithm on the prediction of Email attachment insertion, i.e., whether to insert an attachment when the user is typing the Email.
After intention inferring is done, we further predict which file would be inserted according
to the user’s preference model. In a multi-lingual country, it is tedious to switch the input
methods when writing multi-language documents. With the help of our algorithms, we can
monitor the user’s intention and make an auto-switch for him. What the user has typed can
be considered as the semantic context and the intention is of two states: switch or not.
5. Conclusion
In this paper, we presented our methods on modeling and inferring user’s intention via data
mining. We defined two levels of intention (action intention and semantic intention) and
differentiated user’s intentions from user’s preferences. Two linguistic features (keyword
and concept features) are extracted for intention modeling. We used association rule to
mine proper concepts of corresponding keywords. In our experiment, concept features are
more effective than keyword features because they are generalizations of the keywords.
Naïve Bayes classifier is a learned intention model. It was chosen here due to its simplicity
and fast speed. We also modified the algorithm to support incremental learning. Experiments have shown the usefulness and effectiveness of the developed algorithms. In order
to take advantage of a vast body of work on user’s preference, our work in the near future
will concentrate on using both user’s intention and user’s preference in Web applications.
USER INTENTION MODELING IN WEB APPLICATIONS USING DATA MINING
191
References
[1] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo, “Fast discovery of association rules,”
in Advances in Knowledge Discovery and Data Mining, AAAI Press, California, 1994, pp. 307–328.
[2] R. Armstrong, D. Freitag, T. Joachims, and T. Mitchell, “WebWatcher: A learning apprentice for the World
Wide Web,” in Proceedings of AAAI Spring Symposium on Information Gathering from Heterogeneous, Distributed Resources, 1995.
[3] M. Bauer, D. Dengler, and G. Paul, “Instructible information agents for Web mining,” in Proceedings of the
2000 International Conference on Intelligent User Interfaces, 2000, pp. 21–28.
[4] H. Chen, Y. Chung, and M. Ramsey, “A smart itsy bitsy spider for the Web,” Journal of the American Society
for Information Science 49(7), 1998, 604–618.
[5] L. Chen and K. Sycara, “WebMate: A personal agent for browsing and searching,” in Proceedings of the
Second International Conference on Autonomous Agents, 1998, pp. 132–139.
[6] Z. Chen, W. Liu, F. Zhang, M. Li, and H. J. Zhang, “ Web mining for Web image retrieval,” Journal of the
American Society for Information Science and Technology 52(10), 2001, 831–839.
[7] F. Crestani, M. Lalmas, C. J. Rijsbergen, and I. Campbell, ““Is this document Relevant? . . . Probably”:
A survey of probabilistic models in information retrieval,” ACM Computing Surveys 30(4), 1998, 528–552.
[8] D. Fragoudis and S. D. Likothanassis, “Retriever: An agent for intelligent information recovery,” in Proceedings of the 20th International Conference on Information Systems, 1999, pp. 422–427.
[9] L. Gabriel, Somlo and E. H. Adele, “Incremental clustering for profile maintenance in information gathering
Web agents,” in Proceedings of the Fifth International Conference on Autonomous Agents, 2001, pp. 262–269.
[10] E. Horvitz, J. Breese, D. Heckerman, D. Hovel, and K. Rommelse, “The Lumiere project: Bayesian user
modeling for inferring the goals and needs of software users,” in Proceedings of the 14th Conference on
Uncertainty in Artificial Intelligence, 1998, pp. 256–265.
[11] R. E. Kirk, Statistics: An Introduction, Baylor University, 1999.
[12] F. Lin, W. Liu, Z. Chen, H. J. Zhang, and L. Tang, “User modeling for efficient use of multimedia files,” in
Proceedings of Second IEEE Pacific-Rim Conference on Multimedia. Beijing, October 2001. Lecture Notes in
Computer Science, Vol. 2175, Springer, 2001, pp. 182–189.
[13] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. Miller, “Introduction to WordNet: An on-line
lexical database,” International Journal of Lexicography 3(4), 1990, 235–244.
[14] T. Mitchell, Machine Learning, McGraw-Hill, New York, 1997, pp. 154–200.
[15] M. Pazzani, J. Muramatsu, and D. Billsus, “Syskill & Webert: Identifying interesting Web sites,” in Proceedings of the 13th National Conference on Artificial Intelligence (AAA196), 1996, pp. 54–61.
[16] M. F. Porter, “An algorithm for suffix stripping,” Program 14(3), 1980, 130–137.
[17] Y. W. Seo and B. T. Zhang, “A reinforcement learning agent for personalized information filtering,” in
Proceedings of the 2000 International Conference on Intelligent User Interfaces, 2000, pp. 248–251.
[18] Y. Yang and J. Pedersen, “A comparative study on feature selection in text categorization,” in Proceedings
of the 14th International Conference on Machine Learning, 1997.