Download agogino

Text Data Mining Prof. Marti Hearst UC Berkeley SIMS Guest Lecture, ME 290M Prof. Agogino May 4, 1999 There’s Lots of Text Out There  Is it Information Overload? Why not TURBO-Text? How can we SYNTHESIZE what’s there to make new discoveries? Talk Outline  Definitions – What is Data Mining? – What is Text Data Mining?  Text data mining examples – Lexical knowledge acquisition – Merging textual records – Finding cures for diseases (from medical literature)  Future Directions What is Data Mining? (Fayyad & Uthurusamy 96, Fayyad 97)    Fitting models to or determining patterns from very large datasets. A “regime” which enables people to interact effectively with massive data stores. Deriving new information from data. – finding patterns across large datasets – discovering heretofore unknown information What is Data Mining?  Potential point of confusion: – The extracting ore from rock metaphor does not really apply to the practice of data mining – If it did, then standard database queries would fit under the rubric of data mining » Find all employee records in which employee earns $300/month less than their managers – In practice, DM refers to: » finding patterns across large datasets » discovering heretofore unknown information Why Data Mining? Because the data is there.  Because current DBMS technology does not support data analysis.  Because  – – – – larger disks faster cpus high-powered visualization networked information are becoming widely available. DM Touchstone Applications (CACM 39 (11) Special Issue)  Finding patterns across data sets: – Reports on changes in retail sales » to improve sales – Patterns of sizes of TV audiences » for marketing – Patterns in NBA play » to alter, and so improve, performance – Deviations in standard phone calling behavior » to detect fraud » for marketing DM Touchstone Applications (CACM 39 (11) Special Issue)  Separating signal from noise: – Classifying faint astronomical objects – Finding genes within DNA sequences – Discovering novel tectonic activity What is Text Data Mining?  Peoples’ first thought: – Make it easier to find things on the Web. – This is information retrieval!   The metaphor of extracting ore from rock does make sense for extracting documents of interest from a huge pile. But does not reflect notions of DM in practice: – finding patterns across large collections – discovering heretofore unknown information Text DM  IR  Data Mining: » Patterns, Nuggets, Exploratory Analysis  Information Retrieval: – Finding and ranking documents that match users’ information need » ad hoc query » filtering/standing query – Rarely Patterns, Exploratory Analysis Real Text DM  The point: – Discovering heretofore unknown information is not what we usually do with text. – (If it weren’t known, it could not have been written by someone.)  However: – There is a field whose goal is to learn about patterns in text for its own sake ... Computational Lingustics  Goal: automated language understanding – this isn’t possible – instead, go for subgoals, e.g., » word sense disambiguation » phrase recognition » semantic associations  Current approach: – statistical analyses of very large text collections WordNet: A Lexical Database A list of hypernyms for each sense of “crow” Lexicographic Knowledge Acquisition  Given a large lexical database ... – Wordnet: Miller, Fellbaum et al. at Princeton – http://www.cogsci.princeton.edu/~wn  … and a huge text collection – How to automatically add new relations? Idea: Use Simple LexicoSyntactic Analysis  Patterns of the following type work: NP0 such as NP1, {NP2 …, (and | or) NPi i >= 1, implies forall NPi, i>=1, hyponym(NPi, NP0)  Example: – “Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use.” – implies hyponym(“Gelidium”, “red algae”) More Examples  “Felonies, such as shootings and stabbings …” implies – hyponym(shootings, felonies) – hyponym(stabbings, felonies)  Is this in the WordNet hierarchy? Linking Killing to Felonies Another Example Einstein is (was) a physicist.  Is/was he a genius?  Making Einstein a Genius Results from “such as” lexicosyntactic relation Results with the “or other” lexicosyntactic relation Procedure Discover a pattern that indicates a lexical relationship  Scan through a large collection; extract sentences that match the pattern  Extract the NPs from the sentence  – requires some phrase parsing  Check if suggested relation is in WordNet or not – this part not automated, but could be Discovering New Patterns  Suggested algorithm: – Decide on a lexical relation of interest, e.g., hyponymy – Derive a list of word pairs from WordNet that are known to hold that relation » e.g., (crow, bird) – Extract sentence from text collection in which both terms occur – Find commonalities among lexico-syntactic context – Test these out against other word pairs known to hold the relationship in WordNet Text Merging Example: Discovering Hypocritical Congresspersons Discovering Hypocritical Congresspersons  Feb 1, 1996 – US House of Reps votes to pass Telecommunications Reform Act – this contains the CDA (Communications Decency Act) – violaters subject to fines of $250,000 and 5 years in prison – eventually struck down by court Discovering Hypocritical Congresspersons  Sept 11, 1998 – US House of Reps votes to place the Starr report online – the content would (most likely) have violated the CDA  365 people were members for both votes – 284 members voted aye both times » 185 (94%) Republicants voted aye both times » 96 (57%) Democrats voted aye both times How to find Hypocritical Congresspersons?  This must have taken a lot of work – Hand cutting and pasting – Lots of picky details » Some people voted on one but not the other bill » Some people share the same name   Check for different county/state Still messed up on “Bono” – Taking stats at the end on various attributes » Which state » Which party  Tools should help streamline, reuse results How to find Hypocritical Congresspersons?  The hard part? – Knowing two compare these two sets of voting records. How to find causes of disease? Don Swanson’s Medical Work  Given – medical titles and abstracts – a problem (incurable rare disease) – some medical expertise  find causal links among titles – symptoms – drugs – results Swanson Example (1991)  Problem: Migraine headaches (M) – – – – – stress associated with M stress leads to loss of magnesium calcium channel blockers prevent some M magnesium is a natural calcium channel blocker spreading cortical depression (SCD)implicated in M – high levels of magnesium inhibit SCD – M patients have high platelet aggregability – magnesium can suppress platelet aggregability  All extracted from medical journal titles Swanson’s TDM Two of his hypotheses have received some experimental verification.  His technique  – Only partially automated – Required medical expertise  Few people are working on this. How to Automate This?  Idea: mixed-initiative interaction – User applies tools to help explore the hypothesis space – System runs suites of algorithms to help explore the space, suggest directions Our Proposed Approach  Three main parts – UI for building/using strategies – Backend for interfacing with various databases and translating different formats – Content analysis/machine learning for figuring out good hypotheses/throwing out bad ones The UI part   Need support for building strategies Mixed-initiative system – Trade off between user-initiated hypotheses exploration and system-initiated suggestions  Information visualization – Another way to show lots of choices Candidate Associations Suggested Strategies Current Retrieval Results Lindi: Linking Information for Novel Discovery and Insight Just starting up now (fall 98)  Initial work: Hao Chen, Ketan MayerPatel, Shankar Raman  Summary  Text Data Mining: – Extracting heretofore undiscovered information from large text collections – Not the same as information retrieval  Examples – Lexicographic knowledge acquisition – Merging of text representations – Linking related information  The truth is out there!

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download agogino