Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University of Singapore, Singapore Using Evolutionary Models of Text for Multi-document summarization Summarization • Traditionally, heuristics for extractive summarization – Cue/stigma phrases – Sentence position (relative to document, section, paragraph) – Sentence length – TF×IDF, TF scores – Similarity (with title, context, query) • With the advent of machine learning, heuristic weights for different features are tuned by supervised learning • In last few years, graphical representations of text have shed new light on the summarization problem TextGraphs 2 at HLT/NAACL 2007 2 Using Evolutionary Models of Text for Multi-document summarization Prestige as sentence selection • One motivation of using graphical methods was to model the problem as finding prestige of nodes in a social network • PageRank used random walk to smooth the effect of non-local context • HITS and SALSA to model hubs and authorities • In summarization, lead to TextRank and LexRank • Contrast with previous graphical approaches (Salton et al. 1994) • Did we leave anything out of our representation for summarization? Yes, the notion of an evolving network TextGraphs 2 at HLT/NAACL 2007 3 Using Evolutionary Models of Text for Multi-document summarization Social networks change! Natural evolving networks (Dorogovtsev and Mendes, 2001) – Citation networks: New papers can cite old ones, but the old network is static – The Web: new pages are added with an old page connecting it to the web graph, old pages may update links TextGraphs 2 at HLT/NAACL 2007 4 Using Evolutionary Models of Text for Multi-document summarization Evolutionary models for summarization Writers and readers often follow conventional rhetorical styles - articles are not written or read in an arbitrary way Consider the evolution of texts using a very simplistic model – Writers write from the first sentence onwards in a text – Readers read from the first sentence onwards of a text • A simple model: sentences get added incrementally to the graph TextGraphs 2 at HLT/NAACL 2007 5 Using Evolutionary Models of Text for Multi-document summarization Timestamped Graph Construction Approach – These assumptions suggest us to iteratively add sentences into the graph in chronological order. – At each iteration, consider which edges to add to the graph. – For single document: simple and straightforward: add 1st sentence, followed by the 2nd, and so forth, until the last sentence is added – For multi-document: treat it as multiple instances of single documents, which evolve in parallel; i.e., add 1st sentences of all documents, followed by all 2nd sentences, and so forth • Doesn’t really model chronological ordering between articles, fix later TextGraphs 2 at HLT/NAACL 2007 6 Using Evolutionary Models of Text for Multi-document summarization Timestamped Graph Construction Model: • Documents as columns – di = document i • Sentences as rows –sj = jth sentence of document TextGraphs 2 at HLT/NAACL 2007 7 Using Evolutionary Models of Text for Multi-document summarization Timestamped Graph Construction • A multi document example doc1 doc2 doc3 sent1 sent2 sent3 TextGraphs 2 at HLT/NAACL 2007 8 Using Evolutionary Models of Text for Multi-document summarization An example TSG: DUC 2007 D0703A-A TextGraphs 2 at HLT/NAACL 2007 9 Using Evolutionary Models of Text for Multi-document summarization Timestamped Graph Construction These are just one instance of TSGs Let’s generalize and formalize them Def: A timestamped graph algorithm tsg(M) is a 9-tuple (d, e, u, f,σ, t, i, s, τ) that specifies a resulting algorithm that takes as input the set of texts M and outputs a graph G Properties of edges TextGraphs 2 at HLT/NAACL 2007 Properties of nodes Input text transformation function 10 Using Evolutionary Models of Text for Multi-document summarization Edge properties (d, e, u, f) • Edge Direction (d) – Forward, backward, or undirected • Edge Number (e) – number of edges to instantiate per timestep • Edge Weight (u) – weighted or unweighted edges • Inter-document factor (f) – penalty factor for links between documents in multi-document sets. TextGraphs 2 at HLT/NAACL 2007 11 Using Evolutionary Models of Text for Multi-document summarization Node properties (σ, t, i, s) • Vertex selection function σ(u, G) – One strategy: among those nodes not yet connected to u in G, choose the one with highest similarity according to u – Similarity functions: Jaccard, cosine, concept links (Ye et al.. 2005) • Text unit type (t) – Most extractive algorithms use sentences as elementary units • Node increment factor (i) – How many nodes get added at each timestep • Skew degree (s) – Models how nodes in multi-document graphs are added – Skew degree = how many iterations to wait before adding the 1st sentence of the next document – Let’s illustrate this … TextGraphs 2 at HLT/NAACL 2007 12 Using Evolutionary Models of Text for Multi-document summarization Skew Degree Examples time(d1) < time(d2) < time(d3) < time(d4) d1 d2 d3 d4 d1 d2 d3 d4 d1 d2 d3 d4 Freely skewed = Only add a new document when it would be linked by some node using vertex function σ Skewed by 1 TextGraphs 2 at HLT/NAACL 2007 Skewed by 2 Freely skewed 13 Using Evolutionary Models of Text for Multi-document summarization Input text transformation function (τ) • Document Segmentation Function (τ) – Problem observed in some clusters where some documents in a multi-document cluster are very long – Takes many timestamps to introduce all of the sentences, causing too many edges to be drawn d5 d5a d5b –Τ(G) segments long documents into several sub docs • Solution is too hacked – hope to investigate more in current and future work TextGraphs 2 at HLT/NAACL 2007 14 Using Evolutionary Models of Text for Multi-document summarization Timestamped Graph Construction • Representations – We can model a number of different algorithms using this 9-tuple formalism: (d, e, u, f, σ, t, i, s, τ) – The given toy example: (f, 1, 0, 1, max-cosine-based, sentence, 1, 0, null) – LexRank graphs: (u, N, 1, 1, cosine-based, sentence, Lmax, 0, null) N = total number of sentences in the cluster; Lmax = the max document length i.e., all sentences are added into the graph in one timestep, each connected to all others, and cosine scores are given to edge weights TextGraphs 2 at HLT/NAACL 2007 15 TSG-based summarization Methodology Evaluation Analysis Using Evolutionary Models of Text for Multi-document summarization System Overview • Sentence splitting –Detect and mark sentence boundaries –Annotate each sentence with the doc ID and the sentence number –E.g., XIE19980304.0061: 4 March 1998 from Xinhua News; XIE19980304.0061-14: the 14th sentence of this document • Graph construction –Construct TSG in this phase TextGraphs 2 at HLT/NAACL 2007 17 Using Evolutionary Models of Text for Multi-document summarization System Overview • Sentence Ranking – Apply topic-sensitive random walk on the graph to redistribute the weights of the nodes • Sentence extraction – Extract the top-ranked sentences – Two different modified MMR rerankers are used, depending on whether it is main or update task TextGraphs 2 at HLT/NAACL 2007 18 Using Evolutionary Models of Text for Multi-document summarization Evaluation • Dataset: DUC 2005, 2006 and 2007. • Evaluation tool: ROUGE: n-gram based automatic evaluation • Each dataset contains 50 or 45 clusters, each cluster contains a query and 25 documents • Evaluate on some parameters –Do different e values affect the summarization process? –How do topic-sensitivity and edge weighting perform in running PageRank? –How does skewing the graph affect the information flow in the graph? TextGraphs 2 at HLT/NAACL 2007 19 Using Evolutionary Models of Text for Multi-document summarization Evaluation on number of edges (e) Tried different e values • Optimal performance: e = 2 • At e = 1, graph is too loosely connected, not suitable for PageRank → very low performance • At e = N, a LexRank system e=2 e=2 N TextGraphs 2 at HLT/NAACL 2007 N N 20 Using Evolutionary Models of Text for Multi-document summarization Evaluation (other edge parameters) • PageRank: generic vs topic-sensitive • Edge weight (u): unweighted vs weighted • Optimal performance: topic-sensitive PageRank and weighted edges TextGraphs 2 at HLT/NAACL 2007 Topicsensitive Weighted edges ROUGE-1 ROUGE-2 No No 0.39358 0.07690 Yes No 0.39443 0.07838 No Yes 0.39823 0.08072 Yes Yes 0.39845 0.08282 21 Using Evolutionary Models of Text for Multi-document summarization Evaluation on skew degree (s) • Different skew degrees: s = 0, 1 and 2 • Optimal performance: s = 1 • s = 2 introduces a delay interval that is too large Skew degree ROUGE-1 ROUGE-2 0 0.36982 0.07580 1 0.37268 0.07682 2 0.36998 0.07489 • Need to try freely skewed graphs TextGraphs 2 at HLT/NAACL 2007 22 Using Evolutionary Models of Text for Multi-document summarization Holistic Evaluation in DUC We participated in DUC 2007 with an extractive-based TSG system • Main task: 12th for ROUGE-2, 10th for ROUGE-SU4 among 32 systems • Update task: 3rd for ROUGE-2, 4th for ROUGE-SU4 among 24 systems • Used a modified version of maximal marginal relevance to penalize links in previously read articles – Extension of inter-document factor (f) • TSG formalism better tailored to deal with update / incremental text tasks • New method that may be competitive with current approaches – Other top scoring systems may do sentence compression, not just extraction TextGraphs 2 at HLT/NAACL 2007 23 Using Evolutionary Models of Text for Multi-document summarization Conclusion • Proposed a timestamped graph model for text understanding and summarization – Adds sentences one at a time • Parameterized model with nine variables – Canonicalizes representation for several graph based summarization algorithms Future Work • Freely skewed model • Empirical and theoretical properties of TSGs (e.g., in-degree distribution) TextGraphs 2 at HLT/NAACL 2007 24 Backup Slides 25 Minute talk total 26 Apr 2007, 11:50-12:15 Using Evolutionary Models of Text for Multi-document summarization Differences for main and update task processing Main task: 1. 2. 3. Construct a TSG for input cluster Run topic-sensitive PageRank on the TSG Apply first modified version of MMR to extract sentences Update task: • Cluster A: – Construct a TSG for cluster A – Run topic-sensitive PageRank on the TSG – Apply the second modified version of MMR to extract sentences • Cluster B: – Construct a TSG for clusters A and B – Run topic-sensitive PageRank on the TSG; only retain sentences from B – Apply the second modified version of MMR to extract sentences • Cluster C: – Construct a TSG for clusters A, B and C – Run topic-sensitive PageRank on the TSG; only retain sentences from C – Apply the second modified version of MMR to extract sentences TextGraphs 2 at HLT/NAACL 2007 26 Using Evolutionary Models of Text for Multi-document summarization Sentence Ranking • Once a timestamped graph is built, we want to compute an prestige score for each node • PageRank: use an iterative method that allows the weights of the nodes to redistribute until stability is reached • Similarities as edges → weighted edges; query → topic-sensitive Topic sensitive (Q) portion TextGraphs 2 at HLT/NAACL 2007 Standard random walk term 27 Using Evolutionary Models of Text for Multi-document summarization Sentence Extraction – Main task • Original MMR: integrates a penalty of the maximal similarity of the candidate document and one selected document • Ye et al. (2005) introduced a modified MMR: integrates a penalty of the total similarity of the candidate sentence and all selected sentences Penalty: All previous sentence similarity • Score(s) = PageRank score of s; S = selected sentences • This is used in the main task TextGraphs 2 at HLT/NAACL 2007 28 Using Evolutionary Models of Text for Multi-document summarization Sentence Extraction – Update task •Update task assumes readers already read previous cluster(s) – implies we should not select sentences that have redundant information with previous cluster(s) • Propose a modified MMR for the update task: – consider the total similarity of the candidate sentence with all selected sentences and sentences in previously-read cluster(s) Previous cluster overlap • P contains some top-ranked sentences in previous cluster(s) TextGraphs 2 at HLT/NAACL 2007 29 Using Evolutionary Models of Text for Multi-document summarization References • Günes Erkan and Dragomir R. Radev. 2004. LexRank: Graph-based centrality as salience in text summari-zation. Journal of Artificial Intelligence Research, (22). • Rada Mihalcea and Paul Tarau. 2004. TextRank: Bring-ing order into texts. In Proceedings of EMNLP 2004. • S.N. Dorogovtsev and J.F.F. Mendes. 2001. Evolution of networks. Submitted to Advances in Physics on 6th March 2001. • Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual Web search engine. Com-puter Networks and ISDN Systems, 30(1-7). • Jon M. Kleinberg. 1999. Authoritative sources in a hy-perlinked environment. In Proceedings of ACM-SIAM Symposium on Discrete Algorithms, 1999. • Shiren Ye, Long Qiu, Tat-Seng Chua, and Min-Yen Kan. 2005. NUS at DUC 2005: Understanding docu-ments via concepts links. In Proceedings of DUC 2005. TextGraphs 2 at HLT/NAACL 2007 30