Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Auto-Survey: An Automatic Survey Generation Tool Arash Termehchy University of Illinois at Urbana-Champaign [email protected] Abstract: Providing professionals and researchers with a good survey on the documents and papers about particular subject has been a problem in those communities. In this paper, we formalize the problem and introduce a method to generate survey papers automatically using the referential text or context from other papers referring to them. A tool also has been developed based on the method. 1. Introduction Having a survey paper is a very basic need for any branch of every field of study or professional activities like engineering, medical, or legal. Such a paper should provide user with an overview of the important issues in the field, approaches taken to those problem, methods of implementing each approach, and finally and more importantly giving clear information on their results in terms of performance, applicability, or popularities. So far, it has been a task upon knowledgeable individuals to seek the appropriate data sources like papers and documents, select the appropriate and most representative documents and organize the above structure. Needlessly to mention, the quality of survey is always considered being depend on the knowledge and neutrality of the survey provider person. To the best of our knowledge, there is not any automated tool to provide tutorial or survey papers. The closest automated tool to address this requirement is multiple document summarizers [RD]. Multiple document summarizers extract the most important and representative text fragments, mostly sentences, from input documents and compose them in a new short document. The important issues are to define the appropriate importance criteria and not to allow similar sentences, even of higher important, written in the final document. Based on these some summarizers have been developed like Multi-Gen [MG] and News-InEssence [NIS]. The main differences between multiple document summarization and survey generation which do not allow us to use those tools in this domain are: 1. The multiple document summarizers give very good result in summarizing small documents like web pages or news documents, but so far they did not achieve considerable superiority over base line naïve method like choosing first sentences of Abstract/Introduction parts of the papers [RD]. 2. There are less common entities among technical, scientific, or more complicated professional documents compared to web pages or news documents [NSH]. This makes using current multiple document summarization methods that use IR or NLP techniques to extract by far less effective. 3. One of the most important roles of survey papers are to judge about different aspects of the methods or compares them. This information either is not available in the document content, is not thorough enough, or simply cannot be trusted scientifically. The quality of survey papers are very dependent on how authoritative the writer(s) is. There are some approaches in the community toward abstract summarization of which makes new sentences using AI and NLP techniques. These methods if applied to multiple-document could capture first two issues, however; there has not been such an approach in multiple-document summarization, and even in single document summarization its result still far from being applicable [DBM]. In this paper we use the referential text or context available in other documents to the target document to provide a survey about them instead of their own contents. In each scientific, technical, or professional document there are parts where author(s) for different reasons refer(s) to another paper or document like when in a scientific paper author(s) compare(s) their work with similar works done before. These parts, called context [BS], is a very precious data source to build survey papers. There have been some efforts to use this information to mine technical or scientific literatures. In [BS], authors index and query scientific papers using the terms occurring in their context instead of their contents and show more precision for informational and better ranking for navigational queries over classical indexing methods. Also there are some paper classification methods based on the papers’ contexts in [NKO]. [NSH] proposes using context for multi-paper summarization in scientific literature but does not propose any method to do that. Perhaps another area which is very close to context analysis is anchor text analysis in World Wide Web [CHR] [DBM] [EM]. For instance in [DBM] authors suggest collecting and using the text around links to a page from other pages to summarize the web page. Although our problem shares some issues with this problem but it is different in terms of the techniques to extract the context part, preprocess them, select the appropriate sentences, and organize them. Those differences are because the structure of a web page is different from a paper and also we are working on multiple-documents. The organization of the paper is as follows: in section 2 we provide a general view on the proposed method and the tool architecture. Section 3 addresses the data collection and preprocessing issues. The next section provides the selection appropriate text fragments and data organization methods. The summarization problem will be discussed in the section 5. Evaluation and discussion part talks about primary issues in our results propose the ways to further improvement. 2. General Method To generate the survey paper for a particular topic, we should first find all the related documents and papers from corpus. After that for each paper all its paragraph contexts will be extract no matter the containing document is in the retrieved document set. After preprocessing these contexts will be organized as different document. By extracting proper amount of text from each context paragraph, we would be able to use hierarchical text clustering method to distinguish different sub-topics within the found documents. Then we apply the multi-document summarization techniques to summarize each cluster. At the last step, system gathers different clusters as well as references and illustrates the paper. Therefore, the developed tool has following architecture: WWW Data Collector Data Preprocessing System Configuration Document Clustering Query Summarization We have developed the system configuration and data collector, and data preprocessing modules in Java. Document clustering modules is based on the Judge library [JL] which text mining package based on the Weka 3 data mining library [WL]. The summarization module is developed using MEAD [ME] which is used for multi-document summarization and is written in Perl. The integration of different modules is performed using files. For instance the document clustering modules write the data in a file with appropriate formats for Summarization modules (MEAD), and so on and so forth. A sample screen shot of the user interface of the system is shown in Fig 1 at appendix. 3. Data Collection and Preprocessing Since the digital libraries are accessible through the internet, we decided to reuse existing digital libraries instead of creating one and feeding articles in it ourselves. For this purpose, Citeseer [CL] digital library was selected which has a large corpus of scientific papers in computer science domain as well as some other capabilities like search engine. Also, Citeseer gives the context of paper but that context has two limitations. First at most 2 contexts per referring paper are given. Second, not all the paragraph including the context is shown. To remove these constraints, we had to implement some extra text extraction routines. The data collection module includes a multi-threaded crawler and a HTML parser/wrapper for each returned page type. Crawler sends the query to the server using the usual HTTP request, imitating a normal user input. Then it creates some threads to interpret and extract the data from the returning result pages. Currently we are using 10 extractor thread and one main thread to schedule the page queue and assign them to the appropriate extractor threads in a First Come First Serve manner. Each page could be a result page, a document detail page, overview page, or context page. For each type of page, assigned thread to the page, calls the right wrapper/extractor classes to extract the data and store it in the appropriate files. All wrappers using the same inheritance hierarchy, most of the wrapping logic is reused. The full-text of the paper is also extracted and using the primary context hint in the Citeseer context page, system extracts the whole context paragraph. We use the titles and new lines paragraph markers to extract the paragraphs. Because sometime there is not any context available for the document, either because of being new or not having any citations, system also extracts the text summary for each document provided in Citeseer site. The bibliography entry information of the cited paper is stored in a reference file as well. To cluster and summarize the data we need pure sentences. First stage in data preprocessing is to remove the dot from dot included words like abbreviations and numbers and then using the dot separator partition the paragraph to pseudo sentences like [DBM]. Of course system remembers the sentence including the reference because as we will see these sentences determine the text span. Then apply a minimum required length (currently 30) to the pseudo sentences of each paragraph. A paragraph with its main sentences removed will be removed, too. Data preprocessing module is also replace the citation flag like [1] or [YUH],.. to more meaningful words like in this paper or in those papers. The other filtering is repeated sentences/paragraph which is due to the multiple versions of the same paper (conference version/Journal version) in the digital library. We do not include summaries when there is at least one context available. Although just by changing a parameter in the system user could set the threshold of the maximum number of available contexts to include summaries. These paragraphs are stored in a file per each document the paragraphs referring called context document. Each of these documents gets an id number. Since the position of each paragraph in the generated documents are very important for the next steps, system determines the position of each paragraph based on the number of citation the referring paper containing the paragraph has and also the publication date of it. System asks user to enter two factors to show how important the recency of the context and the rank (number of citation) of the context is. Then system will calculate the position of each paragraph in the context document. Based on the following formula: Position_rank = recency_factor * (Normalized_Time(cited_paper)) + rank_facto * (Citations(cited_paper)) We normalize the time by decreasing 1960 (for computer science paper). System also use default values (1980 for time and 0 for citations) when there is not such information. The same formula with the same is calculated for each context document to be used in the summarization step when the global ordering of a paragraph is determined. All the position information and text are stored in the .docsent file format to be used by MEAD summarizer. 4. Text Span and Clustering The important problem when dealing with context is to choose the appropriate number of sentences before and after the main sentence of the paragraph. Selecting the whole paragraph will give us high recall but less precision. On the other hand using just the main sentence provides high precision in price of low recall. Different approaches on choosing the appropriate text span from context or anchor text have been proposed in the previous works. [DBM] uses a window size based approach to extract text around anchor in web pages. [NKO] uses some clue words like "therefore, however,.." to choose a sentence before or after main sentence and words like "In this paper, we ,…" not to choose those sentences. This method is more like a trial and error and restricted method. Also they can just extract one sentence before or after the main sentence. In [NSH] an NLP based approach has been introduced to extract the main entities from the paragraph and then using NLP techniques, it discovers the relation among those entities and makes a new sentence. The flaw here is those entities must be given by the user and also the algorithm has not been completed. We propose another technique to extract the appropriate number of sentences from the paragraph. As [WN] mentions there are 15 reasons to cite a paper. Only in some of them referring paper is describing the content of referred (positively or negatively). Sometime the context could be too general or having very less semantic connectivity with the cited paper. Obviously more descriptive the context is, more similar that would be to the content of the cited paper. Therefore, we devise an algorithm to check each combination of consecutive surrounding sentences of main sentence to maximize the cosine TF/IDF based similarity between the content of the cited paper and the context. For papers whose content is not available, we have used their summaries. Because the names of authors are very likely to happen in the descriptive context sentences, we put the weight same as highest weight of the paper content to them. We have not faced any situation in which neither content nor summary of the original paper available. However, in this case we could use a the whole contexts from different referring document instead of the content assuming that common words among the contexts represents the actual content of the documents. Then using the co-web [HW] clustering algorithm available in Judge Library [JL], system clusters the context documents. Co-web is a conceptual hierarchical clustering algorithm that clusters the documents in a decision tree like structure. This method is appropriate for a survey paper since the survey paper has the same structure. Using the clustering the survey paper will have the desired organization from whole to part. The number of levels in the tree is limited to two (which is good estimation for most handwritten survey papers) and number of clusters in each level would be determined using the number of context documents in each cluster (right now we are using the threshold of at least 20% of context documents in each cluster). Each cluster would be a separate section in the final paper after summarization. To pass the clustering information, the document ids of each cluster in stored in appropriate cluster files (.cluster) in a format that could be used by the MEAD summarizer. 5. Summarization At the final step systems summarizes each cluster separately. We are using a multidocument sentence based summarization method called centroid based method implemented in the MEAD tool. The summarization process involves three steps: Feature Selection First we must select the features based on the sentences are classified and ranked. Our system uses three available feature of the MEAD summarizer tool: Centroid which is a vector containing the most common terms of the cluster. We also use this vector to put a title on each section, Position which is the position of the sentence in the context document that has been determined based on the time/rank formula mentioned above, and Query which is a vector containing the original query terms entered by user. In addition to the above features, we added two more features for our purpose. The context document weight that was calculated in the data processing step and a vector of words containing the synonyms of the "survey" like "overview, review…" to give more weight to the sentences describing a hand written survey paper in the domain. Classification The MEAD classifier classifies and ranks the sentences based on the above features. Re-ranking Since there might be too similar sentences added to the summarized text, reranker detects the sentences that are too similar to already added sentences and will not allow them to get in to the summarized text. We added a hook-up routine to add reference information of the removed similar text to the added sentences. To save the cohesion of the paragraph, we change the input paragraph to the sentence format, so the MEAD system is considered the whole paragraph as a sentence. The compression rate of the final summarization is determined by user. This rate will determine the number of sentences in the output based on the input size. 6. Discussion and Future Work A sample output of the system is shown in the Fig2 in appendix for the "parallel database" query. There are very good detection of the important papers and important survey papers. Also the sentences are informative enough and some of the general issues of the research community have been addresses very well like architecture and metrics. The cohesion within each paragraph is good enough, but the most important problem in terms of cohesion is the relation between different paragraphs. Using the clustering method gathers all related paragraph at the same section, the difference is obvious among Fig2 and Fig3 which is without clustering. However, still the cohesion among paragraphs from different documents is poor. The title of each cluster is provided using the most common words in their centroid (like the example), which is not so much meaningful for the user. That would be interesting try to find a sentence including most of these words to be used as a title in a future work. Our method of text span selection gives considerably more meaningful paragraph compared to using base method (Fig 3). It also provides better cohesion and removes more unrelated sentences. Generally, the text span was 2-3 sentences. The weakness we found for our method was that it eliminates the context paragraphs talking about the application of a method. These paragraphs are removed because they are not similar enough to the content or summary of original document. An interesting problem is to categorize and summarize these sentences in a different section. References [BS] Bradshaw S., “Reference Directed Indexing: Redeeming Relevance for Subject Search in Citation Indexes”, Proceedings of the 7th European Conference on Research and Advanced Technology for Digital Libraries, 2003. [CHR] Craswell N., Hawking D., and Robertson S. “Effective site finding using link anchor information”, In Proceedings of ACM SIGIR 2001 [CS] "Citeseer digital library" at http://citeseer.ist.psu.edu [DBM] Delort J.Y., Bouchon-Meunier B., and Rifqi M. “Enhanced web document summarization using hyperlinks”. In Proceedings of the 14th ACM conference on Hypertext and hypermedia, pages 208--215, New York, NY, USA, 2003. [EM] Eiron N. and McCurley K. S. “Analysis of anchor text for web search”. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 459–460. ACM Press, 2003. [HK] Han J, Kamber M.."Data mining, concepts and techniques" Morgan Kaufman Publishing, Second edition 2006. [JL] “Judge Library” at http://www3.dfki.uni-kl.de/judge [MG] “Multi-Gen: a general multi-document summarizer” http://www.cs.columbia.edu/~regina/demo4/ [ME] “MEAD: a public domain portable multi-document summarization system” http://www.summarization.com/mead/ [NSH] Nakov P., A. Schwartz, Hearst M. “Citances: Citation Sentences for Semantic Analysis of Bioscience Text”, Workshop on Search and Discovery in Bioinformatics at SIGIR'04, Sheffield, UK, July 2004. [NKO] Nanba H., Kando N., M. Okumura “Towards Multi-Paper Summarization Using Reference Information” Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence Pages: 926 – 931, 1999 [NIS] “News-In-Essence” http://www.newsinessence.com/ [RD] Radev D.R., “Text summarization, Tutorial”. ACM-SIGIR 2004, Sheffield,UK [WL] “Weka 3: Data Mining Software in Java” http://www.cs.waikato.ac.nz/ml/weka/ [WN] Weinstock, N. "Citation indexes", in Kent A. (Ed.). Encyclopedia of Library and Information Science, New York: Marcel Dekker, Vol.5, pages 16-41, 1971. Fig 1. Input Screen Shot Fig2. The sample output with clustering and text span Fig 3. Smaple output without clustering and text span