Download Novel User Interfaces: The Digital Desk as the Interface of the Future

Untangling Text Data Mining Stanford Digital Libraries Seminar May 11, 1998 Marti Hearst UC Berkeley SIMS www.sims.berkeley.edu/~hearst 5/11/98 1 Caveat Emptor: I do information access. I do not do text data mining (yet). This talk is an attempt to explore the relationship between the two. 5/11/98 2 Talk Outline  Definitions – What is Data Mining? – What is Information Access? – What is Text Data Mining?    Empirical Computational Linguistics Real text data mining tasks Conclusions and Future Directions 5/11/98 3 The Knowledge Discovery from Data Process (KDD) KDD: The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. (Fayyad, Shapiro, & Smyth, CACM 96) Note: data mining is just one step in the process 5/11/98 4 What is Data Mining? (Fayyad & Uthurusamy 96, Fayyad 97)    Fitting models to or determining patterns from very large datasets. A “regime” which enables people to interact effectively with massive data stores. Deriving new information from data. – finding patterns across large datasets – discovering heretofore unknown information 5/11/98 5 What is Data Mining?  Potential point of confusion: – The extracting ore from rock metaphor does not really apply to the practice of data mining – If it did, then standard database queries would fit under the rubric of data mining • Find all employee records in which employee earns $300/month less than their managers – In practice, DM refers to: • finding patterns across large datasets • discovering heretofore unknown information 5/11/98 6 Why Data Mining?    Because the data is there. Because current DBMS technology does not support data analysis. Because – – – – larger disks faster cpus high-powered visualization networked information are becoming widely available. 5/11/98 8 DM Touchstone Applications (CACM 39 (11) Special Issue)  Finding patterns across data sets: – Reports on changes in retail sales • to improve sales – Patterns of sizes of TV audiences • for marketing – Patterns in NBA play • to alter, and so improve, performance – Deviations in standard phone calling behavior • to detect fraud • for marketing 5/11/98 9 DM Touchstone Applications (CACM 39 (11) Special Issue)  Separating signal from noise: – Classifying faint astronomical objects – Finding genes within DNA sequences – Discovering novel tectonic activity 5/11/98 10 What’s new here?   Sounds like statistical modeling or machine learning. Main Difference: scale and availability (Fayyad 97) – Datasets too large for classical analysis – Increased opportunity for access • end user is often not a statistician – New issues in sampling 5/11/98 12 Statistician’s Viewpoint (David Hand 97)  What’s new about DM? – Returns statisticians to their empirical roots • exploration rather than modeling – Hypothesis testing may be irrelevant • given the large data sizes everything is significant – Data was collected for some other purpose than what it is being analyzed for now 5/11/98 13 Talk Outline  Definitions – What is Data Mining? – What is Information Access? – What is Text Data Mining?    Empirical Computational Linguistics Real text data mining tasks Conclusions and Future Directions 5/11/98 15 Information Access (Information Retrieval more broadly construed)  Problem: – Huge amounts of online textual information  Goal: – Build systems to help people discover, create use, reuse, and understand information  Approach: – Leverage off of users’ smarts – Combine stats, text analysis, user interfaces 5/11/98 17 Information Retrieval A restricted form of Information Access    The system has available only pre-existing, “canned” text passages. Its response is limited to selecting from these passages and presenting them to the user. It must select, say, 10 or 20 passages out of millions! 5/11/98 18 Needles in Haystacks  The emphasis in IR (and standard DB) is in answering ad hoc queries. 5/11/98 19 IA vs. KDD Process 5/11/98 20 IA vs. KDD Process Query/Information Need 5/11/98 21 IA vs. KDD Process Query/Information Need Match query against transformed data Show results ranked in relevance order 5/11/98 22 Talk Outline  Definitions – What is Data Mining? – What is Information Access? – What is Text Data Mining?    Empirical Computational Linguistics Real text data mining tasks Conclusions and Future Directions 5/11/98 24 What is Text Data Mining?  Peoples’ first thought: – Make it easier to find things on the Web. – But this is information retrieval!  The metaphor of extracting ore from rock: – Does make sense for extracting documents of interest from a huge pile. – But does not reflect notions of DM in practice: • finding patterns across large collections • discovering heretofore unknown information 5/11/98 25 Real Text DM  What would finding a pattern across a large text collection really look like? 5/11/98 26 Bill Gates + MS-DOS in the Bible! From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil (William Gates, agitator, leader) 5/11/98 27 From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil 5/11/98 28 Real Text DM  The point: – Discovering heretofore unknown information is not what we usually do with text. – (If it weren’t known, it could not have been written by someone!)  However: – There is a field whose goal is to learn about patterns in text for its own sake ... 5/11/98 29 Observation Research that exploits patterns in text does so mainly in the service of computational linguistics, rather than for learning about and exploring text collections. 5/11/98 30 Talk Outline   Definitions Empirical Computational Linguistics – Special and important properties of text – Relationship to TDM – Examples of TDM as CL  Real text data mining tasks  Conclusions and Future Directions 5/11/98 31 Recent Trends in NLP (CL)   Previously: AI, full understanding Current: Corpus-based, Statistical • ACL proceedings: from 3 corpus-based papers in 1991 to at least half in 1996 • Stat NLP was tried long ago (Z. Harris)  Simple Often Wins • Echoes results in IR  Interesting direction: • Statistics + Linguistics (Klavans & Resnik 96) 5/11/98 32 Text Analysis (CL) Tasks     Word Sense Disambiguation Automatic Lexicon Augmentation Discourse Analysis Parsing • Phrase Identification • Phrase Attachments • Predicate/Argument Structure • Scope of Conjunctions • ... 5/11/98 33 Why Text is Tough – Abstract concepts difficult to represent (AI-Complete) – “Countless” combinations of subtle, abstract relationships among concepts – Many ways to represent similar concepts space ship, flying saucer, UFO, figment of imagination – Concepts are difficult to visualize – High dimensionality Tens or hundreds of thousands of features 5/11/98 34 Why Text is Tough  Language is: – ambiguous (many different meanings for the same words and phrases) – different combinations imply different meanings 5/11/98 35 Why Text is Tough  I saw Pathfinder on Mars with a telescope.  Pathfinder photographed Mars.    The Pathfinder photograph mars our perception of a lifeless planet. The Pathfinder photograph from Ford has arrived. The Pathfinder forded the river without marring its paint job. 5/11/98 36 Why Text is Easy   Highly redundant in bulk Just about any simple algorithm can get “good” results for coarse tasks – – – – 5/11/98 Pull out “important” phrases Find “meaningfully” related words Create summary from document Major problem: Evaluation 37 Stupid Text Tricks – Coarse IR, Clustering • Don’t need dimension reduction (except stopwords) • Don’t need morphological analysis • Don’t need word sense disambiguation – Partial parsing: • Simple, greedy transformation rules • Cascading finite state machines – Categorization • Assume independence 5/11/98 38 Text “Data Cleaning” Pre-process text as follows:  Tokenization  Morphological Analysis (Stemming) inflectional, derivational, or crude IR methods  Part-of-Speech Tagging I/Pro see/VP Pathfinder/PN on/P Mars/PN ...  Phrase Boundary Identification [Subj I] [VP saw] [DO Pathfinder] [PP on Mars] [PP with a telescope]. 5/11/98 39 CCL Methodology  Describe here the standard methodology for corpus-based computational linguistics algorithms 5/11/98 40 CCL Examples  Place here examples of the kinds of output generated for computational linguistics applications 5/11/98 41 Inducing MetaData for Documents  Assigning bibliographic metadata – author, genre, time, region  Subject/Topic assignments – category labels: MeSH, LoC, ACM keywords  Information Extraction (MUC) – MUC: terrorist incidents • • • • 5/11/98 who did the bombing where did the bombing take place what weapon(s) were used when did it happen 42 Inducing MetaData for Collections    Indexes Hierarchical Categorization Overviews of Connectivity • hyperlinks • co-citation links  Overviews of Subject Matter • 2D • 3D • dynamic 5/11/98 43 A Main Point:   Empirical CL is usually not helpful for improving Information Access. However, it can produce – metadata – overviews – associations that are indirectly useful for IA. 5/11/98 44 Talk Outline    Definitions Empirical Computational Linguistics Real text data mining tasks – TDM not using text – TDM using text  Conclusions and Future Directions 5/11/98 45 TDM using Metadata (instead of Text) (Dagan, Feldman, and Hirsh, SDAIR ‘96) – Data: • Reuter’s newswire (22,000 articles, late 1980s) • Categories: commodities, time, countries, people, and topic – Goals: • distributions of categories across time (trends) • distributions of categories between collections • category co-occurrence (e.g., topic|country) – Interactive Interface: • lists, pie charts, 2D line plots 5/11/98 46 Combining Text with Metadata (images, hyperlinks)  Examples – Text + Links to find “authority pages” (Kleinberg at Cornell, Page at Stanford) – Usage + Time + Links to study evolution of web and information use (Pitkow et al. at PARC) – Images + Text to improve image search 5/11/98 47 Talk Outline    Definitions The New Empirical Computational Linguistics Real text data mining tasks – TDM not using text – TDM using text  Conclusions and Future Directions 5/11/98 48 Ore-Filled Text Collections   Newspaper/Newswire Medical Articles – Patterns associated with symptoms, drugs  Patent Law – Recent Study Justifying Scientific Funding – Hypotheses for New Inventions  “Corporate Memory” 5/11/98 49 True Text Data Mining: Don Swanson’s Medical Work  Given – medical titles and abstracts – a problem (incurable rare disease) – some medical expertise  find causal links among titles – symptoms – drugs – results 5/11/98 50 Swanson Example (1991)  Problem: Migraine headaches (M) – – – – – – – –  stress associated with M stress leads to loss of magnesium calcium channel blockers prevent some M magnesium is a natural calcium channel blocker spreading cortical depression (SCD)implicated in M high levels of magnesium inhibit SCD M patients have high platelet aggregability magnesium can suppress platelet aggregability All extracted from medical journal titles 5/11/98 51 Swanson’s TDM   Two of his hypotheses have received some experimental verification. His technique – Only partially automated – Required medical expertise  Few people are working on this. 5/11/98 52 Text Collection Overviews  Clusters/Unsupervised Overviews – – – – – 5/11/98 Chalmers: BEAD, Networks of Words Lin,Chen: Kohonen Feature Maps Xerox PARC: Local Clusters Pacific Northwest: ThemeScapes Rennison: Galaxy of News 53 Text Overviews – Huge 2D maps may be inappropriate focus for information retrieval • can’t see what documents are about • documents forced into one position in semantic space • space difficult to browse for IR purposes – Perhaps more suited for pattern discovery • problem: often only one view on the space 5/11/98 54 Talk Outline    Definitions The New Empirical Computational Linguistics Real text data mining tasks – TDM not using text – TDM using text  Conclusions and Future Directions 5/11/98 55 Conclusions  Currently, what might be construed as Text Data Mining is really Computational Linguistics – Text is tricky to process, but rich and abundant (now) – There are many CL tools available  Data Mining directly from text – tells us about language – produces meta-information that may be useful for information access 5/11/98 56 Conclusions, continued  Information Access != Text Data Mining – IA = finding needle in haystack – TDM = finding patterns or discovering new information  However, Information Access may potentially be served by Text Data Mining techniques: – automated metadata assignment – collection overviews  The synthesis of ideas from TDM and IA: – Perhaps a new field of exploratory data analysis over text! 5/11/98 57 Promsing Research Directions  Text Data Mining Problems: – Patterns within sets of documents: • What is the latest in this field? • How is this field related to that field? – Chains of evidence embedded in text: • What drugs have been tested for this symptom? • What effects did this funding have on that field? – Human use of information over time, • How does information diffuse across the web? 5/11/98 58

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Novel User Interfaces: The Digital Desk as the Interface of the Future