Download 1 - University of Illinois Urbana

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Auto-Survey: An Automatic Survey
Generation Tool
Arash Termehchy
University of Illinois at Urbana-Champaign
[email protected]
Abstract: Providing professionals and researchers with a good survey on the documents
and papers about particular subject has been a problem in those communities. In this
paper, we formalize the problem and introduce a method to generate survey papers
automatically using the referential text or context from other papers referring to them. A
tool also has been developed based on the method.
1. Introduction
Having a survey paper is a very basic need for any branch of every field of study or
professional activities like engineering, medical, or legal. Such a paper should provide
user with an overview of the important issues in the field, approaches taken to those
problem, methods of implementing each approach, and finally and more importantly
giving clear information on their results in terms of performance, applicability, or
popularities. So far, it has been a task upon knowledgeable individuals to seek the
appropriate data sources like papers and documents, select the appropriate and most
representative documents and organize the above structure. Needlessly to mention, the
quality of survey is always considered being depend on the knowledge and neutrality of
the survey provider person. To the best of our knowledge, there is not any automated tool
to provide tutorial or survey papers.
The closest automated tool to address this requirement is multiple document summarizers
[RD]. Multiple document summarizers extract the most important and representative text
fragments, mostly sentences, from input documents and compose them in a new short
document. The important issues are to define the appropriate importance criteria and not
to allow similar sentences, even of higher important, written in the final document. Based
on these some summarizers have been developed like Multi-Gen [MG] and News-InEssence [NIS]. The main differences between multiple document summarization and
survey generation which do not allow us to use those tools in this domain are:
1. The multiple document summarizers give very good result in summarizing small
documents like web pages or news documents, but so far they did not achieve
considerable superiority over base line naïve method like choosing first sentences of
Abstract/Introduction parts of the papers [RD].
2. There are less common entities among technical, scientific, or more complicated
professional documents compared to web pages or news documents [NSH]. This
makes using current multiple document summarization methods that use IR or NLP
techniques to extract by far less effective.
3. One of the most important roles of survey papers are to judge about different aspects
of the methods or compares them. This information either is not available in the
document content, is not thorough enough, or simply cannot be trusted scientifically.
The quality of survey papers are very dependent on how authoritative the writer(s) is.
There are some approaches in the community toward abstract summarization of which
makes new sentences using AI and NLP techniques. These methods if applied to
multiple-document could capture first two issues, however; there has not been such an
approach in multiple-document summarization, and even in single document
summarization its result still far from being applicable [DBM].
In this paper we use the referential text or context available in other documents to the
target document to provide a survey about them instead of their own contents. In each
scientific, technical, or professional document there are parts where author(s) for
different reasons refer(s) to another paper or document like when in a scientific paper
author(s) compare(s) their work with similar works done before. These parts, called
context [BS], is a very precious data source to build survey papers. There have been some
efforts to use this information to mine technical or scientific literatures. In [BS], authors
index and query scientific papers using the terms occurring in their context instead of
their contents and show more precision for informational and better ranking for
navigational queries over classical indexing methods. Also there are some paper
classification methods based on the papers’ contexts in [NKO]. [NSH] proposes using
context for multi-paper summarization in scientific literature but does not propose any
method to do that.
Perhaps another area which is very close to context analysis is anchor text analysis in
World Wide Web [CHR] [DBM] [EM]. For instance in [DBM] authors suggest collecting
and using the text around links to a page from other pages to summarize the web page.
Although our problem shares some issues with this problem but it is different in terms of
the techniques to extract the context part, preprocess them, select the appropriate
sentences, and organize them. Those differences are because the structure of a web page
is different from a paper and also we are working on multiple-documents.
The organization of the paper is as follows: in section 2 we provide a general view on the
proposed method and the tool architecture. Section 3 addresses the data collection and
preprocessing issues. The next section provides the selection appropriate text fragments
and data organization methods. The summarization problem will be discussed in the
section 5. Evaluation and discussion part talks about primary issues in our results propose
the ways to further improvement.
2. General Method
To generate the survey paper for a particular topic, we should first find all the related
documents and papers from corpus. After that for each paper all its paragraph contexts
will be extract no matter the containing document is in the retrieved document set. After
preprocessing these contexts will be organized as different document. By extracting
proper amount of text from each context paragraph, we would be able to use hierarchical
text clustering method to distinguish different sub-topics within the found documents.
Then we apply the multi-document summarization techniques to summarize each cluster.
At the last step, system gathers different clusters as well as references and illustrates the
paper.
Therefore, the developed tool has following architecture:
WWW
Data
Collector
Data
Preprocessing
System
Configuration
Document
Clustering
Query
Summarization
We have developed the system configuration and data collector, and data preprocessing
modules in Java. Document clustering modules is based on the Judge library [JL] which
text mining package based on the Weka 3 data mining library [WL]. The summarization
module is developed using MEAD [ME] which is used for multi-document
summarization and is written in Perl. The integration of different modules is performed
using files. For instance the document clustering modules write the data in a file with
appropriate formats for Summarization modules (MEAD), and so on and so forth. A
sample screen shot of the user interface of the system is shown in Fig 1 at appendix.
3. Data Collection and Preprocessing
Since the digital libraries are accessible through the internet, we decided to reuse existing
digital libraries instead of creating one and feeding articles in it ourselves. For this
purpose, Citeseer [CL] digital library was selected which has a large corpus of scientific
papers in computer science domain as well as some other capabilities like search engine.
Also, Citeseer gives the context of paper but that context has two limitations. First at
most 2 contexts per referring paper are given. Second, not all the paragraph including the
context is shown. To remove these constraints, we had to implement some extra text
extraction routines.
The data collection module includes a multi-threaded crawler and a HTML
parser/wrapper for each returned page type. Crawler sends the query to the server using
the usual HTTP request, imitating a normal user input. Then it creates some threads to
interpret and extract the data from the returning result pages. Currently we are using 10
extractor thread and one main thread to schedule the page queue and assign them to the
appropriate extractor threads in a First Come First Serve manner. Each page could be a
result page, a document detail page, overview page, or context page. For each type of
page, assigned thread to the page, calls the right wrapper/extractor classes to extract the
data and store it in the appropriate files. All wrappers using the same inheritance
hierarchy, most of the wrapping logic is reused. The full-text of the paper is also
extracted and using the primary context hint in the Citeseer context page, system extracts
the whole context paragraph. We use the titles and new lines paragraph markers to extract
the paragraphs.
Because sometime there is not any context available for the document, either because of
being new or not having any citations, system also extracts the text summary for each
document provided in Citeseer site. The bibliography entry information of the cited paper
is stored in a reference file as well.
To cluster and summarize the data we need pure sentences. First stage in data
preprocessing is to remove the dot from dot included words like abbreviations and
numbers and then using the dot separator partition the paragraph to pseudo sentences like
[DBM]. Of course system remembers the sentence including the reference because as we
will see these sentences determine the text span. Then apply a minimum required length
(currently 30) to the pseudo sentences of each paragraph. A paragraph with its main
sentences removed will be removed, too. Data preprocessing module is also replace the
citation flag like [1] or [YUH],.. to more meaningful words like in this paper or in those
papers. The other filtering is repeated sentences/paragraph which is due to the multiple
versions of the same paper (conference version/Journal version) in the digital library. We
do not include summaries when there is at least one context available. Although just by
changing a parameter in the system user could set the threshold of the maximum number
of available contexts to include summaries.
These paragraphs are stored in a file per each document the paragraphs referring called
context document. Each of these documents gets an id number. Since the position of each
paragraph in the generated documents are very important for the next steps, system
determines the position of each paragraph based on the number of citation the referring
paper containing the paragraph has and also the publication date of it. System asks user to
enter two factors to show how important the recency of the context and the rank (number
of citation) of the context is. Then system will calculate the position of each paragraph in
the context document. Based on the following formula:
Position_rank = recency_factor * (Normalized_Time(cited_paper)) + rank_facto *
(Citations(cited_paper))
We normalize the time by decreasing 1960 (for computer science paper). System also use
default values (1980 for time and 0 for citations) when there is not such information. The
same formula with the same is calculated for each context document to be used in the
summarization step when the global ordering of a paragraph is determined. All the
position information and text are stored in the .docsent file format to be used by MEAD
summarizer.
4. Text Span and Clustering
The important problem when dealing with context is to choose the appropriate number of
sentences before and after the main sentence of the paragraph. Selecting the whole
paragraph will give us high recall but less precision. On the other hand using just the
main sentence provides high precision in price of low recall. Different approaches on
choosing the appropriate text span from context or anchor text have been proposed in the
previous works. [DBM] uses a window size based approach to extract text around anchor
in web pages. [NKO] uses some clue words like "therefore, however,.." to choose a
sentence before or after main sentence and words like "In this paper, we ,…" not to
choose those sentences. This method is more like a trial and error and restricted method.
Also they can just extract one sentence before or after the main sentence. In [NSH] an
NLP based approach has been introduced to extract the main entities from the paragraph
and then using NLP techniques, it discovers the relation among those entities and makes
a new sentence. The flaw here is those entities must be given by the user and also the
algorithm has not been completed.
We propose another technique to extract the appropriate number of sentences from the
paragraph. As [WN] mentions there are 15 reasons to cite a paper. Only in some of them
referring paper is describing the content of referred (positively or negatively). Sometime
the context could be too general or having very less semantic connectivity with the cited
paper. Obviously more descriptive the context is, more similar that would be to the
content of the cited paper. Therefore, we devise an algorithm to check each combination
of consecutive surrounding sentences of main sentence to maximize the cosine TF/IDF
based similarity between the content of the cited paper and the context. For papers whose
content is not available, we have used their summaries. Because the names of authors are
very likely to happen in the descriptive context sentences, we put the weight same as
highest weight of the paper content to them.
We have not faced any situation in which neither content nor summary of the original
paper available. However, in this case we could use a the whole contexts from different
referring document instead of the content assuming that common words among the
contexts represents the actual content of the documents.
Then using the co-web [HW] clustering algorithm available in Judge Library [JL], system
clusters the context documents. Co-web is a conceptual hierarchical clustering algorithm
that clusters the documents in a decision tree like structure. This method is appropriate
for a survey paper since the survey paper has the same structure. Using the clustering the
survey paper will have the desired organization from whole to part. The number of levels
in the tree is limited to two (which is good estimation for most handwritten survey
papers) and number of clusters in each level would be determined using the number of
context documents in each cluster (right now we are using the threshold of at least 20%
of context documents in each cluster). Each cluster would be a separate section in the
final paper after summarization.
To pass the clustering information, the document ids of each cluster in stored in
appropriate cluster files (.cluster) in a format that could be used by the MEAD
summarizer.
5. Summarization
At the final step systems summarizes each cluster separately. We are using a multidocument sentence based summarization method called centroid based method
implemented in the MEAD tool. The summarization process involves three steps:
 Feature Selection
First we must select the features based on the sentences are classified and ranked.
Our system uses three available feature of the MEAD summarizer tool: Centroid
which is a vector containing the most common terms of the cluster. We also use
this vector to put a title on each section, Position which is the position of the
sentence in the context document that has been determined based on the time/rank
formula mentioned above, and Query which is a vector containing the original
query terms entered by user.
In addition to the above features, we added two more features for our purpose.
The context document weight that was calculated in the data processing step and a
vector of words containing the synonyms of the "survey" like "overview,
review…" to give more weight to the sentences describing a hand written survey
paper in the domain.
 Classification
The MEAD classifier classifies and ranks the sentences based on the above
features.
 Re-ranking
Since there might be too similar sentences added to the summarized text, reranker detects the sentences that are too similar to already added sentences and
will not allow them to get in to the summarized text. We added a hook-up routine
to add reference information of the removed similar text to the added sentences.
To save the cohesion of the paragraph, we change the input paragraph to the sentence
format, so the MEAD system is considered the whole paragraph as a sentence.
The compression rate of the final summarization is determined by user. This rate
will determine the number of sentences in the output based on the input size.
6. Discussion and Future Work
A sample output of the system is shown in the Fig2 in appendix for the "parallel
database" query. There are very good detection of the important papers and important
survey papers. Also the sentences are informative enough and some of the general issues
of the research community have been addresses very well like architecture and metrics.
The cohesion within each paragraph is good enough, but the most important problem in
terms of cohesion is the relation between different paragraphs. Using the clustering
method gathers all related paragraph at the same section, the difference is obvious among
Fig2 and Fig3 which is without clustering. However, still the cohesion among paragraphs
from different documents is poor. The title of each cluster is provided using the most
common words in their centroid (like the example), which is not so much meaningful for
the user. That would be interesting try to find a sentence including most of these words to
be used as a title in a future work.
Our method of text span selection gives considerably more meaningful paragraph
compared to using base method (Fig 3). It also provides better cohesion and removes
more unrelated sentences. Generally, the text span was 2-3 sentences. The weakness we
found for our method was that it eliminates the context paragraphs talking about the
application of a method. These paragraphs are removed because they are not similar
enough to the content or summary of original document. An interesting problem is to
categorize and summarize these sentences in a different section.
References
[BS] Bradshaw S., “Reference Directed Indexing: Redeeming Relevance for Subject
Search in Citation Indexes”, Proceedings of the 7th European Conference on Research
and Advanced Technology for Digital Libraries, 2003.
[CHR] Craswell N., Hawking D., and Robertson S. “Effective site finding using link
anchor information”, In Proceedings of ACM SIGIR 2001
[CS] "Citeseer digital library" at http://citeseer.ist.psu.edu
[DBM] Delort J.Y., Bouchon-Meunier B., and Rifqi M. “Enhanced web document
summarization using hyperlinks”. In Proceedings of the 14th ACM conference on
Hypertext and hypermedia, pages 208--215, New York, NY, USA, 2003.
[EM] Eiron N. and McCurley K. S. “Analysis of anchor text for web search”. In
Proceedings of the 26th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, pages 459–460. ACM Press, 2003.
[HK] Han J, Kamber M.."Data mining, concepts and techniques" Morgan Kaufman
Publishing, Second edition 2006.
[JL] “Judge Library” at http://www3.dfki.uni-kl.de/judge
[MG] “Multi-Gen: a general multi-document summarizer”
http://www.cs.columbia.edu/~regina/demo4/
[ME] “MEAD: a public domain portable multi-document summarization system”
http://www.summarization.com/mead/
[NSH] Nakov P., A. Schwartz, Hearst M. “Citances: Citation Sentences for Semantic
Analysis of Bioscience Text”, Workshop on Search and Discovery in Bioinformatics at
SIGIR'04, Sheffield, UK, July 2004.
[NKO] Nanba H., Kando N., M. Okumura “Towards Multi-Paper Summarization Using
Reference Information” Proceedings of the Sixteenth International Joint Conference on
Artificial Intelligence Pages: 926 – 931, 1999
[NIS] “News-In-Essence” http://www.newsinessence.com/
[RD] Radev D.R., “Text summarization, Tutorial”. ACM-SIGIR 2004, Sheffield,UK
[WL] “Weka 3: Data Mining Software in Java” http://www.cs.waikato.ac.nz/ml/weka/
[WN] Weinstock, N. "Citation indexes", in Kent A. (Ed.). Encyclopedia of Library and
Information Science, New York: Marcel Dekker, Vol.5, pages 16-41, 1971.
Fig 1. Input Screen Shot
Fig2. The sample output with clustering and text span
Fig 3. Smaple output without clustering and text span