Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Dissertation Proposal WEB MINING FOR KNOWLEDGE DISCOVERY Zhongming Ma Ph.D. Candidate in Information Systems School of Accounting and Information Systems David Eccles School of Business The University of Utah Co-chairs Dr. Gautam Pant and Dr. Olivia Sheng Committee members Dr. Paul Hu Dr. Ellen Riloff Dr. Wei Gao 1 1 DISSERTATION PROPOSAL 1.1 Knowledge Discovery on the Web Knowledge discovery from databases (KDD) refers to “the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patters in data” [Fayyad et al. 1996]. KDD has achieved a broad range of applications including pattern recognition and predictive analytics in many different areas, such as engineering, business, and science. Knowledge discovery has two types of goals, verification and discovery. In general the former goal refers to verifying a user’s hypothesis and the latter can be further divided into prediction (i.e., predicting unknown or future values) and description (i.e., presenting identified results such as patterns in a human-understandable form) [Fayyad et al. 1996]. The Web has become a universal repository with tremendous amount of data that can be accessed from any where in the world and has experienced continuous growth both in content and its users. Therefore, the Web presents immense opportunities for discovering knowledge. However, unlike conventional databases, the data on Web is mostly unstructured. This situation makes knowledge discovery on Web challenging as compared to KDD on traditional databases. On the Web, the knowledge discovery process requires considerable effort on identifying, selecting, and processing data possibly from multiple sources and in different (often free-form text) formats. Manual analysis that turns such large volumes of Web data into knowledge is impractical and thus knowledge discovery on the Web becomes an attempt to address the accentuated problem of data overload. We adapt the KDD process presented in [Fayyad et al. 2 1996] for Web mining and present the process of Wed mining for knowledge discovery as follows. g cessin Prepro Web data ng Gatheri Target data Figure 1. rmation Transfo Processed data tation Interpre tion ic d re /P ining Web m Knowledge Patterns Transformed data Process of Web mining for knowledge discovery Web mining is a step in the KDD process and it aims to analyze data and discover knowledge from the Web. The Web data includes all kinds of Web documents, hyperlinks among Web pages, and Web usage logs. Depending on the type of Web data being mined, Web mining can be broadly divided into three categories: Web content mining, Web structure mining, and Web usage mining [Srivastava et al. 2000]. Web content mining is the process of discovering knowledge from Web page content (i.e., often text), and it often uses techniques based on data mining and text mining [Liu 2006]. Important Web content mining problems include data/information extraction [e.g. Hammer et al. 1997], Web information integration [e.g. Knoblock et al. 1998], online opinion extraction, Web search [e.g. Brin and Page 1998], processing (e.g., clustering or categorizing) search results according to page content [e.g. Zamir and Etzioni 1999; Dumais and Chen 2001], etc [Liu 2006]. 3 Web structure mining tries to discover useful information such as importance of pages from the structure of hyperlinks on the basis of social network analysis (SNA) techniques and graph theory. Its research topics cover ranking pages [e.g. Brin and Page 1998; Chakrabarti el al. 1999], finding Web community [e.g. Gibson et al. 1998], etc. Web usage mining is the automatic discovery of user access patterns from Web logs [Cooley et al. 1997]. The identified visit patterns can help in understanding the overall access patterns and trends for all users [e.g. Zaïane et al. 1998] and allow for Web site design that is responsive to business goals and customer needs, such as user-level customization [e.g. Eirinaki and Vazirgiannis 2003]. My dissertation consists of two related topics/parts: personalized search and business relationship discovery, both of which are in the area of Web mining for knowledge discovery. The first topic presents and evaluates an automatic personalized search framework that categorizes search results under user’s interests in order to examine how the proposed personalized search approach outperforms non-categorized and non-personalized baseline systems. This research is of Web content mining. The second topic proposes an approach to identifying an intercompany network using company citations from Web content (more specifically, online news stories) and discovers business relationships between companies from the network on the basis of SNA and machine learning techniques. Therefore the second topic covers both Web content mining and Web structure mining. The main research question we explore is whether structural attributes derived from the intercompany network, which in turn is derived from company citations in online news, can identify business relationships. As shown in Figure 2, at a high level, the first topic connects Web content to people, and the second uses Web content to discover connections between companies. Thus the two topics are connected through 4 mining of Web content. However, the two topics generate different types of knowledge – interest-based personalized search results versus news-driven inter-company relationships – and hence entail diverse adoptions of Web data, processing, and Web mining. In the next two sections we briefly introduce the two topics. taxonomy user interests search query search engine online news company citations construction of text classifiers for user interests search results construction of weighted, directed intercompany network network analysis search results personalized to users identification of network structural attributes classification methods discovery of business relationships between companies Figure 2. business relationship discovery personalized search Web content Process View of the Two Topics of the Dissertation 1.2 Personalized Search Most search engines, including the most popular ones such as Google and Yahoo!, ignore users’ search context, such as users’ interests. As a result, the same query from different users with different information needs retrieves the same search results displayed in the same way. Hence, they use a “one size fits all” [Lawrence 2000] approach. We note that currently Google is attempting to address this problem with some level of voluntary personalization. Personalization techniques that consider users’ context during search can improve search efficiency [Pitkow et 5 al. 2002]. We propose and implement an automatic approach to categorizing search results according to a user’s interests to help users find relevant information and find it quicker. Our approach is particularly well suited for a workplace scenario where much of the information, needed by the proposed system, about professional interests and skills of knowledge workers is available to the employer. Personalizing based on such information within an organization can be expected to have less privacy concerns as compared to a general purpose search engine gathering data on user interests. Moreover, unlike other approaches, our approach does not impose any burden of implicit or explicit feedback from the user. User's interests ODP Category profiles Search query Web pages Mapping Categories framework mapped to interests h Searc in g en e Figure 3. tion Evalua LIST Search results categorized under interests Search results Gathering ication Classif Preprocessing Web mining CAT PCAT Improved search efficiency Interpretation Knowledge Discovery Process for Interest-Based Personalized Search We customize the general process of Web mining for KDD in Figure 1 and present the process of interest-based personalized search for knowledge discovery in Figure 3 where processes covered by the horizontal double-arrow-lines correspond to equivalent ones in Figure 1. The proposed approach includes a mapping framework that automatically maps user interests 6 into a group of categories from Open Directory Project (ODP) taxonomy. A text classifier is built from the content of the mapped ODP categories and later is used at query-time to categorize search results under user interests. For a workplace scenario where the employees’ professional interests and skills can be automatically extracted from their resume or company’s database, this approach is fully automatic in that users do not need to provide implicit or explicit feedbacks during the search. Also the use of ODP is transparent to the users. The lack of explicit or implicit feedback and the use of ODP taxonomy without a user’s awareness of it differentiates this work from many others, such as [Gauch et al. 2003, Liu et al. 2004; Chirita et al. 2005]. In addition, we study three search systems with different interfaces for displaying search results. The first system (LIST) shows search results in a page-by-page list. The second (CAT) categorizes and displays results under certain ODP categories. The third (PCAT) is what we propose, and PCAT categorizes and displays results under user interests. We compare the PCAT with LIST and PCAT with CAT on the basis of different query lengths and different types of search tasks. Contributions of this research are that we present an automatic approach to personalizing Web searches given a set of user interests. The main findings include (1) PCAT is better than LIST for one word query and Information Gathering type of task, and PCAT outperforms CAT for free-form queries and for both Information Gathering and Finding types of tasks in terms of the time spent on finding relevant results. We conclude that there is not any system universally better than others – the performance of a system depends on some parameters such as query length and type of task. 7 1.3 Business Relationship Discovery Business news contains rich and current information about companies and the relationships among them. Reading news is very time consuming and requires a reader to possess certain skills, the most basic of which is a good understanding of the language in which the news is written. The huge volume of news stories makes the manual identification of relationships among a large number of companies nontrivial and unscalable. The previous literature using news to automatically discover business relationships among companies is sparse. Many researchers in areas such as organization behavior and sociology employ SNA techniques to investigate the nature and implications of business relationships on the basis of explicitly specified company relationships provided by reliable data sources [e.g. Levine 1972; Walker et al. 1997; Uzzi 1999; Gulati and Gargiulo 1999]. In contrast, researchers in bibliometrics and computer science tend to identify links between nodes using implicit signals, such as article citations, URL links, and email communications, derived from large and noisy data sources. They study problems such as identifying importance of individual nodes (e.g., Web pages, journal articles) in a network [e.g. Garfield 1979; Brin and Page 1998; Kleinberg 1999] and finding communities on the Web [e.g. Kautz et al. 1997; Gibson et al. 1998], instead of discovering business relationships between companies. We present an approach of automatic discovery of company relationships from online business news using machine learning and SNA techniques. Figure 4 illustrates the knowledge discovery process for business relationship discovery from Web data (i.e., online news). 8 gold standards Comany citations Online News network analysis Directed, weighted intercomapny network Gathering Preprocessing Figure 4. n tio Evalua ication Classif Company pairs and network attributes Discovered business relationships Classified business relationships Web mining Prediction Knowledge Discovery Process for Business Relationship Discovery Given that a news story pertaining to a company often cites one or more other companies, we construct a directed and weighted intercompany network on the basis of the citations from a large amount of online news by considering company citations as directed links from the focal companies to the cited companies. Further we identify four types of attributes from the network structure using SNA techniques. More specifically they are dyadic degree based-, node degree based-, node centrality based-, and structural equivalence based-attributes. Those attributes differ in their coverage of the network. With those network attributes, we study two types of company relationships using machine learning methods. This news-driven, SNA-based business relationship discovery approach is scalable and language-neutral. Research along this line consists of two studies that differ in their target business relationships and we describe them as follows. The first one concentrates on predicting a company revenue relation (CRR). Given a pair of companies, CRR refers to the relative size of two companies’ annual revenues. We find that degree-based and centrality-based attributes derived from network structure can predict CRR 9 with reasonable precision, recall, and accuracy (all above 70%) for all directly linked company pairs in the network. Contributions of this study are that (1) our approach can serve as a data filtering step for studying the revenue relations among very large number of companies. (2) Since the revenue information for public companies is available quarterly, our approach can be used as a prediction tool for revenues. (3) Our approach can be applied to discover the revenue relations for private or foreign companies as well. In the second work we study competitor relationship between companies. We discover the competitor relationship between a pair of connected companies in the intercompany network on the basis of the four types of attributes. And in particular, we study the classification of company pairs for imbalanced data set where the number of competitor pairs is much smaller than that of non-competitor pairs. We use two gold standards: Hoovers.com and Mergentonline.com that are professional company profile websites and contain manually identified competitors for each company to evaluate the classification performance of our approach. Given that neither of the gold standards is complete in the coverage of competitors, we estimate the coverage of each gold standard. Finally we present metrics to estimate how much our approach can extend each of the gold standards. Contributions of this work include that we present an automatically approach to discovering competitor relationships between companies. Our approach is particularly useful to serve as an initial data filtering step to identify a group of potential competitors for each of many companies. We study an imbalanced dataset problem and report the classification performance for competitor pairs in both the imbalanced dataset and the whole dataset. Most important, we report the estimated extension of our approach to each of two gold standards. 10 1.4 Overview of Dissertation At high level the dissertation is organized as follows. Part I, which consists of chapters 2 to 5, is for the first topic of the dissertation: Interested-based Personalized Search. Part II, which includes chapters 6 to 9, covers the two related studies in business relationship discovery. More specifically we highlight each chapter as follows. Chapter 2 introduces the research on personalized search and reviews related prior work. We detail our approach of personalized search in Chapter 3. Experiments are covered in Chapter 4 and result analyses and conclusions are discussed in Chapter 5. For the topic of business relationship discovery, we introduce it and review prior literature in Chapter 6. Chapter 7 describes how to identify attributes from the network structure and explain the data and data processing procedures. We concentrate predicting CRR in Chapter 8 and discovering competitor relationships in Chapter 9. Finally we conclude the dissertation in Chapter 10. 1.5 Proposed Plan The time line of my dissertation is as follows. Feb. 13, 2007 Proposal defense Mar. 16, 2007 Sending dissertation draft to committee members and to Thesis Office for format approval Mar. 30, 2007 Update on the dissertation draft Apr. 3 or 10, 2007 Dissertation defense 11