Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Where does this new information belong? From developing mining algorithms to supporting knowledge discovery Bettina Berendt – thanks for joint work with and support from Ilija Subasić Mathias Verbeke Siegfried Nijssen Luc De Raedt K.U. Leuven Yes we can! The problem The solution? Automatic topic dectection Period 1 Healthcare agenda Climate agenda Period 2 Health Care Insurance Green American Uninsured energy Families plan Working A healthcare vote Period 3 0.017 0.015 0.013 Opposition to 0.013 0.009 healthcare 0.008 reform 0.005 Peace Nobel Prize Cophenhagen climate summit Period 4 Another healthcare vote Same event/document; different interpretations & categorisations Similar problems in science and learning Topic detection in time-indexed corpora of news texts ! Conference programme Similar problems in other areas Music collections, multimedia collections: see Andreas Nürnberger‘s talk at SML 2010 The solution? Context-aware systems / personalisation Political activist Female Has problems with anger management You probably do / should think about it this way: ... What users want ... to structure the world how they see it left right interactivity ... to re-use their categories (that they worked so hard to find) semantics ... to acknowledge squares / circles that others see green / the world differently not green Social similarity / diversity ... to be able to see through their eyes is (nearly) green perspectivetaking ... to provide data mining methods to do all that! Research agenda The problem interactivity automatic topic dectection support sense-making = provide methods / tools for Knowledge Disovery (in the full sense) semantics Social similarity / diversity perspectivetaking ... to provide data mining methods to do all that! Research agenda The problem Our solution approach interactivity automatic topic dectection support sense-making = provide methods / tools for Knowledge Disovery (in the full sense) semantics Social similarity / diversity perspectivetaking ... to provide data mining methods to do all that! STORIES: functionality basics STORIES: functionality basics STORIES: mining basics (1) Graphical summarisation of multiple text documents Document / text pre-processing • Template recognition • Multi-document named entities • Stopword removal, lemmatization •“fact (assertion) recognition” Similarity measure to determine salient relations Document summarization strategy • time relevance, a “temporal co-occurrence lift” • no topics, but salient concepts & relations • time window; word-span window Selection approach for concepts • concepts = words or named entities • salient concept = high TF & involved in a salient relation, time-indexed • bursty co-occurrence Burstiness measure STORIES: mining basics (2) Graph analysis for query recommendation Aim: highlight subgraphs that represent an event Topological properties Change: Subgraph new in this period STORIES: evaluation 1. Information retrieval quality • Edges – events: up to 80% recall, ca. 30% precision 2. Search quality • Subgraphs index coherent document clusters 3. Learning effectiveness Document search with story graphs leads to averages of 67-75% accuracy on judgments of story fact truth on average, 1.3-4.7 queries with 3.4-5.2 nodes/words per query 4. Comparison with other temporal text mining methods New (and only) framework for cross-method comparison Recall-&precision-style metrics different method rankings Damilicious: functionality basics Apply my grouping rfid (Security/privacy, Group 2, ...) to the following new search result: * Show users and how similarly they group * Apply U4‘s grouping to my new search result: Damilicious: mining basics (1) Methods and process 1. 2. 3. 4. Query Automatic clustering Manual regrouping Re-use 1. 2. Learn classifier & present way(s) of grouping Transfer the constructed concepts Features/methods for the conceptual/predictive clustering: Lingo phrases, Lingo clustering, Ripper co-citation, bibliometric coupling, word or LSA similarity, combinations; k-means, hierarchical Damilicious: mining basics (2) Measures of grouping and user diversity Diversity = 1 – similarity = 1 - Normalized mutual information (entropy-based measure) • “How similarly do two users group documents?“ • For each query q, consider their groupings gr: NMI = 0 • For several queries: aggregate Damilicious: evaluation • Clustering: Does it generate meaningful document groups? – yes (tradition in bibliometrics) – but: data? – Small expert evaluation of CiteseerCluster • Choosing the clustering and classification methods for conceptual clustering – Experiments: different features, clustering methods, classification methods quality of reconstruction and extension-over-time (NMI) • Technology acceptance – End-user experiment (clustering & regrouping) – 5-person formative user study (transfer of own results) Conclusions and (some) questions • Sense-making involves – – – – – – KD approach Extracting information from texts Text mining Extracting structural information between entities Graph mining Creating, using and modifying categories Semantics Interacting with external representations Interactivity Acknowledging diversity and perspective-taking ... Usage mining and “model-processing“ (conceptual / predictive clustering) • Appropriate mining methods, measures, ...? • More/better evaluation methods and frameworks? • Use cases? • • • • • Subašić, I. & Berendt, B. (2009). Discovery of interactive graphs for understanding and searching time-indexed corpora. Knowledge and Information Systems. DOI 10.1007/s10115-009-0227-x (PDF) Berendt, B. & Subašić, I. (2009). STORIES in time: a graph-based interface for news tracking and discovery. n N. Cristianini & M. Turchi (Eds.), Proceedings of Intelligent Analysis and Processing of Web News Content (IAPWNC) at The 2009 IEEE /WIC / ACM International Conferences Web Intelligence (WI'09) / Intelligent Agent Technology (IAT'09). 15 September 2009, Milan, Italy. (Proceedings of WI-IAT.2009, DOI 10.1109/WI-IAT.2009.342, pp. 531-534) (PDF) Verbeke, M., Berendt, B., & Nijssen, S. (2009). Data mining, interactive semantic structuring, and collaboration: A diversity-aware method for sense-making in search. In G. Boato & C. Niederee (Eds.), Proceedings of First International Workshop on Living Web, collocated with the 8th International Semantic Web Conference (ISWC2009), Washington D.C., USA, October 26, 2009. CEUR Workshop Proceedings Vol515. (PDF) Berendt, B. (2010). Diversity in search: what, how, and what for? Talk at Barcelona Media / Yahoo! Research and UPF, 4 March 2010. (PPT) Berendt, B., Krause, B., & Kolbe-Nusser, S. (2010). Intelligent scientific authoring tools: Interactive data mining for constructive uses of citation networks. networks. Information Processing & Management, 46(1), 1-10. (PDF)