Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
SYNOPSIS TEXT MINING FOR INFORMATION RETRIEVAL Introduction Nowadays, large quantity of data is being accumulated in the data repository. Usually there is a huge gap from the stored data to the knowledge that could be constructed from the data. This transition won't occur automatically, that's where Data Mining comes into picture. In Exploratory Data Analysis, some initial knowledge is known about the data, but Data Mining could help in a more in-depth knowledge about the data. Seeking knowledge from massive data is one of the most desired attributes of Data Mining. Manual data analysis has been around for some time now, but it creates a bottleneck for large data analysis. Fast developing computer science and engineering techniques and methodology generates new demands to mine complex data types. A number of Data Mining techniques (such as association, clustering, classification) are developed to mine this vast amount of data. Previous studies [18] on Data Mining focus on structured data, such as relational and transactional data. However, in reality, a substantial portion of the available information is stored in text databases (or document databases), which consists of large collections of documents from various sources, such as news articles, books, digital libraries and Web pages. Text databases are rapidly growing due to the increasing amount of information available in electronic forms, such as electronic publications, e-mail, CD-ROMs, and the World Wide Web (which can also be viewed as a huge, interconnected, dynamic text database). Data stored in text databases is mostly semi-structured, i.e., it is neither completely unstructured nor completely structured. For example, a document may contain a few structured fields, such as title, authors, publication date, length, category, and so on, but also contain some largely unstructured text components, such as abstract and contents. In recent database research, studies have been done to model and implement semistructured data. Information Retrieval techniques, such as text indexing, have been developed to handle the unstructured documents. But, traditional Information Retrieval Synopsis - 1 techniques become inadequate for the increasingly vast amount of text data. Typically, only a small fraction of the many available documents will be relevant to a given individual or user. Without knowing what could be in the documents, it is difficult to formulate effective queries for analyzing and extracting useful information from the data. Users need tools to compare different documents, rank the importance and relevance of the documents, or find patterns and trends across multiple documents. Thus, Text Mining has become an increasingly popular and essential theme in Data Mining. Text Mining, also known as knowledge discovery from text, and document information mining, refers to the process of extracting interesting patterns from very large text corpus for the purposes of discovering knowledge. It is an interdisciplinary field involving Information Retrieval, Text Understanding, Information Extraction, Clustering, Categorization, Topic Tracking, Concept Linkage, Computational Linguistics, Visualization, Database Technology, Machine Learning, and Data Mining [25]. Text Mining tools/applications intend to capture relationship between the data. They can be roughly organized into two groups. One group focuses on document exploration functions to organize documents based on their content and provide an environment for a user to navigate and browse in a document or concept space. It includes Clustering, Visualization, and Navigation. The other group focuses on text analysis functions to analyze the content of the documents and discover relationships between concepts or entities described in the documents. They are mainly based on natural language processing techniques, including Information Retrieval, Information Extraction, Text Categorization, and Summarization [27], [28]. Content-based text selection techniques have been extensively evaluated in the context of Information Retrieval. Every approach to text selection has four basic components: Some technique for representing the documents Some technique for representing the information needed (i.e., profile construction) Some way of comparing the profiles with the document representation Some way of using the results of the comparison Synopsis - 2 Issues in Information Retrieval The main motivation of our research is to study different existing tools and techniques of Text Mining for Information Retrieval (IR). Search engine is the most well known Information Retrieval tool. Application of Text Mining techniques to Information Retrieval can improve the precision of retrieval systems by filtering relevant documents for the given search query. Electronic information on Web is a useful resource for users to obtain a variety of information. The process of manually compiling text pages according to a user's needs and preferences and into actionable reports is very labor intensive, and is greatly amplified when it needs to be updated frequently. Updates to what has been collected often require a repeated searching, filtering previously retrieved text web pages and reorganizing them. To harness this information, various search engines and Text Mining techniques have been developed to gather and organize the web pages. Retrieving relevant text pages on a topic from a large page collection is a challenging task. Given below are some issues identified in Information Retrieval process: Issue (1): Traditional Information Retrieval techniques become inadequate to handle large text databases containing high volume of text documents. To search relevant documents from the large document collection, a vocabulary is used which map each term given in the search query to the address of the corresponding inverted file; the inverted files are then read from the disk; and are merged, taking the intersection of the sets of documents for AND, OR, NOT operations [8],[24],[30],[31]. To support retrieval process, inverted file require several additional structures such as document frequency of each lexicon in the vocabulary, term frequency of each term in the document. The principal cost of searching process are the space requirement in memory to hold inverted file entries, and the time spend to process large size inverted files maintaining record of each document of the corpus as they are potential answers. Many terms in the query means more disk accesses into the inverted file, and more time spent to merge the obtained lists. Synopsis - 3 Issue (2): Presently, while doing query based searching, search engines return a set of web pages containing both relevant and non relevant pages, sometimes showing non relevant pages assigned higher rank score. These search engines use one of the following approaches to organize, search and analyze information on the web. In the first approach [30], ranking algorithm uses term frequency to select the terms of the page, for indexing a web page (after filtering out common or meaningless words). In the second approach [5],[9],[11],[19],[20] structure of links appearing between pages is considered to identify pages that are often referenced by other pages. Analyzing the density, direction and clustering of links, such method is capable of identifying the pages that are likely to contain valuable information. Another approach [9],[26],[29] analyzes the content of the pages linked to or from the page of interest. They analyze the similarity of the word usage at different link distance from the page of interest and demonstrate that structure of words used by the linked pages enables more efficient indexing and searching. Anchor text [15] of a hyperlink is considered to describe its target page and so target pages can be replaced by their corresponding anchor text. But the nature of the Web search environment is such that the retrieval approaches based on single sources of evidence, suffer from weaknesses that can hurt the retrieval performance. For example, content-based Information Retrieval approach does not consider link information of the page while ranking the target page and hence affect the quality of web documents, while link-based approaches [6],[19],[20] can suffer from incomplete or noisy link topology. The inadequacy of singular Web Information Retrieval approaches make a strong argument for combining multiple sources of evidence as a potentially advantageous retrieval strategy for Web Information Retrieval. Issue (3): A common problem of Information Retrieval is that users have to browse large number of documents containing both relevant and non relevant documents before finding relevant documents. Clustering keeps similar documents together in a single group and hence fastens the process of Information Retrieval by retrieving the documents from the same cluster based on query vector matching to the cluster centroid. There are many clustering algorithms available like K-means, Bisecting Kmeans, HFTC (Hierarchical Document clustering using Frequent Itemsets), Hybrid PSO+K-means method and Global K-means [3],[7],[16],[17],[23]. But, there are many challenges in using these existing clustering techniques in the domain of text documents. Bisecting K-means produce deep hierarchy resulting in difficulty to browse Synopsis - 4 if one makes an incorrect selection while navigating a hierarchy. Although HFTC [4] produces a relatively flat hierarchy as compared to Bisecting K-means, but it is expensive in terms of calculating the global frequent itemsets to create the clusters. Global K-means [22], unlike K-means, is insensitive to the choice of initial k cluster centers, thus giving global optimum solution. But it requires execution of K-means method (nk) times for document set of size n to generate k clusters showing time complexity of O(nk). In Hybrid PSO+K-means method [12], the PSO (Particle Swarm Optimization) module is executed for a short period to search for the optimum cluster centroid locations and then the K-means module is used for refining and generating the final optimal clustering solution. This method produces the global optimum solution like Global k-means, and also requires assuming the initial value of k. Issue (4): To fasten the process of document retrieval, text summarization technique is used [1],[2],[10],[13]. Ranking of documents is made based on the summary or the abstract provided by the authors of the document. But it is not always possible as not all documents come with an abstract or summary. Also when different summarization tools like Copernic, SweSum, Extractor, MSWord, Intelligent, Brevity, Pertinence text summarizer are used to summarize the document, not all the topics covered within the document are reflected in its summary. Proposed Solutions The aim of this thesis is to address the above discussed issues in order to improve the accuracy of the Information Retrieval process. Our proposed approaches in this thesis contribute in the field of text Information Retrieval and provide more relevant documents (web pages) to the given searched query accurately and efficiently in less time. 1) To speed up the process of text document retrieval and effective utilization of the memory space as discussed in issue (1), we propose an algorithm based on inverted index file. By using the range partition feature of oracle, the space requirement of memory is reduced considerably as the inverted index file is stored on secondary storage and only the required portion of the inverted index file is maintained in the main memory. Fuzzy logic is applied to retrieve the selected documents and then suffix tree clustering is used to group the similar documents. Synopsis - 5 2) To handle the problem discussed in issue (2), a method is proposed for learning web structure to classify web documents and demonstrates the usefulness of considering the text content information of backward links and forward links for computing the page rank score. The similarity of the word usage at single level link distance from the target page is analyzed, which shows that content of words in the linked pages enables more efficient indexing and searching. The novel method efficiently reduces the limitations of the already existing Link Analysis algorithms like Kleinberg‘s HITS algorithm, SALSA [14], [21] while computing the rank score of the retrieved web pages and the results obtained by the proposed method are not biased towards in-degree or out-degree of the target page. Also the rank scores obtained showing non-zero values help to rank the web pages more accurately. 3) An approach to text document clustering that overcomes the drawback of K-means and Global k-means as discussed in issue (3) is proposed which gives global optimal solution with time complexity of O(lk) to obtain k clusters from an initial set of l starting clusters. 4) A new method of building the generic, extract based single text document summary of fixed length is proposed to handle the limitations of existing summarization algorithms as discussed in issue (4). Index term(s) of the document is/are identified based on keyterms of each sentence and paragraph within the document. Rank of the sentence is computed based on the number of matching terms between the document and sentence index terms. Sentences having high rank score are extracted to be included in the final summary. Results and Findings of the Work done Various conclusions and findings derived on applying the proposed methods to handle the above discussed issues in Information Retrieval are discussed below: 1) Compression and partitioning of inverted file reduces the memory space requirement and store the entire compressed inverted file in secondary storage. It is the use of compression and partitioning that results in the superior performance of retrieval process as shown in fig. 1. Through the range partitioning feature of Oracle, a smaller, faster representation for sorted lexicon of interest can be achieved. To process a query, only a small portion of the compressed inverted index file is cached in memory. Input/output (I/O) time required for loading a much smaller compressed postings list is Synopsis - 6 small, although it adds some cost of decompression. So, the retrieval system runs faster on compressed posting lists than on uncompressed posting lists. Also, sorted lexicon permits rapid binary search for matching strings. Conjunctive queries are easily handled through the concept of fuzzy logic in retrieving the documents having high value of -cut (threshold value) in document set for AND operation and non-zero value of -cut for OR operation. The proposed method also retrieves the documents having the synonyms of the searched query terms instead of query term itself, in the document. d dat d a d data d data data g base a bank base g gra t g m get a r rea r ch r relat r iona r lrep d d d d d 00010 00 000 001 0 0 101 00 0 0 1000010 010 01 10 01 10 10 11 0 1 0 1 001 r r 1 Text 000100 document collection stored 00110 0011 on disk 1 r 000101 eat Compressed, rangepartitioned dictionary of words 0 1 posting list containing Synonyms Compressed, rangepartitioned Vocabulary of Words posting list containing document_ids Document Collection Set Fig.1 Compressed inverted index file structure showing range partition 2) On the basis of the results obtained, it is found that both the text content information of backward and forward links is useful for ranking the target page. It is also observed that utilizing only extended anchor text from documents that link to the target document or while just considering the words and phrases on the target pages (full-text) does not yield very accurate results. We analyze the similarity of the word usage at single level link distance from the page of interest and demonstrate that content of words in the linked pages enables more efficient indexing and searching. The proposed method efficiently reduces the limitations of Synopsis - 7 the already existing Link Analysis algorithms (HITS, pSALSA, SALSA, HubAvg, AThresh, HThresh , FThresh, BFS) while computing the rank of the web page and the results obtained by the proposed method are not biased towards in-degree or out-degree of the target page as the links to the target page can be easily differentiated as navigational, functional and noisy links. While computing the rank score of the target page, only functional links are considered. Also the non-zero rank scores obtained by the proposed method help to accurately rank the web pages while other Link Analysis algorithms sometimes compute zero rank score of the pages. 3) The following K-means and Global K-means clustering weaknesses are removed on applying the proposed clustering method: (i) The number of clusters k, need not be assumed initially as required in K-means. This number k is determined by the proposed clustering method itself. The required number of clusters are then obtained iteratively by combining the similar clusters based on their inter cluster distance (between the two clusters centroids) and minimum intra cluster distance (between the cluster centroid and its corresponding member documents). (ii) Proposed method always obtain the same clusters, using the same data, even if the documents are considered in different sequence, which is not possible in K-means. (iii) Different initial condition (k cluster centers) produces same cluster results. Hence the algorithm is not trapped in the local optimum as in K-means. (iv) In the proposed method, it is not required to know which term in the document contributes more to the grouping process since we assume the TF-IDF weight of each term to determine their importance in the clustering process. Hence the final clusters produced are independent of any initial assumptions needed, at the start of the clustering process. (v) The time complexity of the proposed clustering method is O(lk) starting with l initial clusters, which is less than the time complexity of Global K-means which is O(nk ) for a document set of size n, at the cost of cluster quality. But the gain from the reduce time complexity overrules this slight degradation in quality of clusters. Synopsis - 8 Experimental evaluation on Reuters newsfeeds (Reuters-21578) shows clustering results (entropy, purity, F-measure) obtained by proposed method comparable with K-means and Global k-means. 4) It has been observed that the proposed method to generate generic, extract single text document summary, clearly depicts the topic(s) discussed in the text document and shows linking between the sentences of the summary. Unlike other summarization methods, the method is independent of the structure of text document and the position of sentence within the document. A sentence appearing later in the document can be included in the summary according to its importance within the paragraph of the document. Our proposed text summarizer avoids redundant information in the summary by excluding the sentences conveying same information and hence improves the quality by including more information to the fixed length generated summary. We evaluated our approach on DUC-2002 corpus (dataset contain two baselines 100 words extract summary) and it shows satisfactory results when compared to all the reported summarization systems in terms of ROUGE-N (N=1 to 8). Chapter details of the Thesis The thesis is organized chapter wise as follows: Chapter 1: This chapter is devoted to introduction about Data Mining, Text Mining and Information Retrieval. Different techniques, applications areas and architecture of Data Mining and Text Mining are discussed in the chapter. Basic concepts, models and techniques of Information Retrieval such as extraction of index terms, retrieval models are also discussed. At the end of the chapter, the different Information Retrieval evaluation techniques and the framework of Information Retrieval are explained. Chapter 2: In this chapter, a discussion on related work on document indexing, hyperlink structure of web pages, clustering and text document summarization is discussed. Based on the literature survey on each topic, the problems and challenges identified from existing tools and techniques for each are discussed in brief, providing the basis for the work to be carried out. Synopsis - 9 Chapter 3: This chapter discusses the method for Quick Text Retrieval Algorithm Supporting Synonyms Based on Fuzzy Logic. In this chapter, different compression algorithms (like Elias Gamma code, Elias Delta code, Fibonacci Code) are studied to store inverted index file and the concept of Fuzzy Information Retrieval is discussed along with Suffix tree clustering. Chapter 4: This chapter is about Web Page Ranking Based on Text Content of Linked Pages. In this chapter, different link analysis ranking algorithms (HITS, pSALSA, SALSA, HubAvg, AThresh, HThresh , FThresh, BFS) are discussed and the ranking scores of pages computed through these link analysis ranking algorithms are compared with the proposed ranking approach which compute the rank score of the target web page based on the content analysis of the link pages of the target page. Chapter 5: It discusses the problem of Automatic generation of initial value K to apply K-means method for Text Documents Clustering. Different clustering techniques and their limitations are discussed like k-means clustering include Bisecting K-means, HFTC clustering (Hierarchical Document clustering using Frequent Itemsets), Hybrid PSO+K-means method, Global K-means method. A method of clustering is then proposed to overcome the limitations of the already existing clustering methods. Chapter 6: This chapter contains two sections. In first section, Document Summarization based on Sentence Ranking Using Vector Space Model is discussed. In this section, different summarization tools are analyzed like Copernic, SweSum, Extractor, MSWord, Intelligent, Brevity, Pertinence. The summary obtained from these tools are obtained and compared with the proposed summarizer on DUC-2002 dataset using ROUGE package. In second section, a method is suggested to obtain query-based text summarization using clustering (both for single and multi-document). Chapter 7: It is the last chapter of the thesis in which conclusion and future scope have been discussed. Synopsis - 10 Keywords: Information Retrieval, Suffix-tree clustering, Fuzzy logic, Query processing, Vector Space Model, Index compression, Inverted Index file, Backward links, Forward links, Link Structure Analysis, Web page ranking, Clustering, K-means clustering, Global K-means clustering, Extract Summary, ROUGE tool, Text Summarization List of Author’s Publications 1. Saxena P.C., and Gupta N., ―Quick Text Retrieval Algorithm Supporting Synonyms Based on Fuzzy Logic‖, Computing Multimedia and Intelligent Techniques (CMIT), vol. 2, no. 1, pp. 7-24, 2006. 2. Saxena P.C., Gupta J.P., Gupta N., ―Web Page Ranking Based on Text Content of Linked Pages‖, International Journal of Computer Theory and Engineering (IJCTE), vol. 2, no. 1, pp. 42-51, Feb. 2010. 3. Gupta N., Saxena P.C. , Gupta J.P., “Automatic generation of initial value k to apply k-means method for text documents clustering‖, International Journal of Data Mining, Modelling and Management (IJDMMM), vol. 3, no. 1, pp. 18–41, 2011. 4. Gupta N., Saxena P.C., Gupta J.P., “Document Summarization based on Sentence Ranking Using Vector Space Model‖, accepted in International Journal of Data Mining, Modelling and Management (IJDMMM), 2011. Major References [1] Arora R., and Ravindran B., ―Latent Dirichlet Allocation and Singular Value Decomposition based Multi-Document Summarization‖, Proc. Eighth IEEE International Conference on Data Mining (ICDM 2008), IEEE Press, pp. 713-718, 2008. [2] Basagic R., Krupic D., Suzic B., ―Automatic Text Summarization, Information Search and Retrieval‖, WS 2009, Institute for Information Systems and Computer Media, Graz University of Technology, Graz, 2009. [3] Bellot P., and El-Beze M., ―A Clustering Method for Information Retrieval‖, Laboratoire d'Informatique d'Avignon, France, Tech. Rep. IR- 0199,1999. Synopsis - 11 [4] Benjamin C. M., Fung Ke, Wang M.E., ―Large hierarchical document clustering using frequent itemsets‖, Proc. Third SIAM International Conference on Data Mining 2003, pp. 59-70, 2003. [5] Borodin A., Roberts Gareth O., Rosenthal J.S., Tsaparas P., ― Finding Authorities and Hubs from link structures on the World Wide Web‖, Proc. 10th WWW Conference, Hong Kong, pp. 415-429, 2001. [6] Borodin A., Roberts Gareth O., Rosenthal J.S., Tsaparas P., ―Link analysis ranking: algorithms, theory, and experiments, ACM Transactions on Internet Technology, vol. 5, no. 1, pp. 231-297, 2005. [7] Bouguettaya A., ―On-Line Clustering‖, IEEE Transactions on Knowledge and Data Engineering, vol. 8, no. 2, pp. 333-339, 1996. [8] Bratley P., and Choueka Y., ―Processing truncated terms in document retrieval systems‖, Information Processing and Management, vol. 18, no. 5, pp. 257-266, 1982. [9] Chakrabarati S., Dom B., Gibson D., Kleinberg Jon M., Raghavan P., Rajagopalan S., (1998), ―Automatic resource list compilation by analyzing hyperlink structure and associated text‖, Proc. 7th International WWW conference, pp. 65-74, 1998. [10] Chatterjee N., and Mohan S., ―Extraction-Based Single-Document Summarization Using Random Indexing‖, Proc. 19th IEEE International Conference on Tools with Artificial Intelligence, vol. 2 (ICTAI 2007), pp. 448-455, 2007. [11] Chen Z., Liu S., Wenyin L., Pu G., Ma W., ―Building a web Thesaurus from web Link Structure‖, Proc. 26th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 48-55, 2003. [12] Cui X., and Potok Thomas E., ―Document clustering analysis based on hybrid PSO + K-means algorithm‖, Journal of Computer Sciences (Special Issue), pp. 2733, 2005. [13] Dehkordi P., Khosravi H., Kumarci, F., ―Text Summarization Based on Genetic Programming‖, International Journal of Computing and ICT Research, vol. 3, no. 1, pp. 57-64, 2009. Synopsis - 12 [14] Farahat A., Lofaro T., Miller Joel C., Rae G., Ward L.A., ―Authority Rankings from HITS, PageRank, and SALSA: Existence, Uniqueness, and Effect of Initialization‖, SIAM J. Science Computing, vol. 27, no. 4, pp. 1181-1201, 2006. [15] Eiron N., and McCurley K.S., ―Analysis of Anchor Text for Web Search‖, Proc. 26th annual international ACM SIGIR conference on Research and development in IR, pp. 459 – 460, 2003. [16] Fisher D.H., ‖Knowledge Acquisition via Incremental Conceptual Clustering‖, Machine Learning, vol. 2, pp. 139-172, 1987. [17] Florian B., Martin E., Xiaowei X., ―Frequent Term-Based Text Clustering‖, Proc. 8th International Conference on Knowledge Discovery and Data Mining (KDD 2002), Canada, pp. 436 – 442, 2002. [18] Han J., and Kamber M., ―Data Mining Concepts and Techniques”, Morgan Kaufmann Publishers, 2001. [19] Henzinger Monika R., and Bharat K., ―Improved algorithms for topic distillation in a hyperlinked environment‖, Proc. 21st International ACM SIGIR conference on Research and Development in IR, pp. 104-111, 1998. [20] Kleinberg Jon M., ―Authoritative sources in a hyperlinked environment‖, Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, vol. 46, no. 5, pp. 604-632, 1992. [21] Lempel R., and Moran S., ―The stochastic Approach for Link-Structure Analysis (SALSA) and the TKC Effect‖, Proc. 9th International World Wide Web Conference, Amsterdam, Netherlands, pp. 387-401, 2000. [22] Likas A., Vlassis N., Verbeek J.J., ―The Global K-means Clustering Algorithm‖, Pattern Recognition, vol. 36, no. 2, pp. 451-461, 2003. [23] Melnik S., Raghavan S., Yang B., Garcia-Molina H., ―Building A Distributed FullText Index For The Web‖, ACM Transactions On Information Systems, vol. 19, no. 3, pp. 217-241, 2001. Synopsis - 13 [24] Moffat A., and Zobel J., ―Self-Indexing Inverted Files for fast Text Retrieval‖, presented as Preliminary form at the 1994 Australian database Conference and at the 1994 IEEE Conference on Data Engineering, Feb. 1994. [25] Stavrianou A., Andritsos P., Nicoloyannis N., ―Overview and semantic issues of text mining‖, ACM Sigmod Record , vol. 36, no. 3, pp. 23-34, 2007. [26] Szymanski Boleslaw K., and Chung M., ―A method for Indexing Web Pages Using Web Bots”, Proc. International Conference on Info-Tech Info-Net ICII'2001, Beijing, China, IEEE CS Press, pp. 1-6, 2001. [27] Tan A., ―Text Mining: Promises And Challenges‖, Proc. South East Asia Regional Computer Confederation (SEARCC'99), 1999. [28] Tan A., ―Text Mining: The state of the art and the challenges‖, Proc. PAKDD 1999 Workshop on Knowledge Disocovery from Advanced Databases, pp. 6570, 1999. [29] Yang K., ―Combining text-and link-based retrieval methods for Web IR‖, Proc. 10th Text REtrieval Conference, pp. 609—618, 2001. [30] Zobel J., and Moffat A., ―Inverted Files for Text Search Engines‖, ACM Computing Surveys, vol. 38, no. 2, pp. 1-56, 2006. [31] Zobel J., Moffat A., Sacks-Davis R., ―Searching large Lexicons for Partially Specified Terms using Compressed Inverted Files‖, Proc. 19th VLDB Conference Dublin, Ireland, pp. 290-391, 1993. Synopsis - 14