Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Text Data Mining Prof. Marti Hearst UC Berkeley SIMS Guest Lecture, ME 290M Prof. Agogino May 4, 1999 There’s Lots of Text Out There Is it Information Overload? Why not TURBO-Text? How can we SYNTHESIZE what’s there to make new discoveries? Talk Outline Definitions – What is Data Mining? – What is Text Data Mining? Text data mining examples – Lexical knowledge acquisition – Merging textual records – Finding cures for diseases (from medical literature) Future Directions What is Data Mining? (Fayyad & Uthurusamy 96, Fayyad 97) Fitting models to or determining patterns from very large datasets. A “regime” which enables people to interact effectively with massive data stores. Deriving new information from data. – finding patterns across large datasets – discovering heretofore unknown information What is Data Mining? Potential point of confusion: – The extracting ore from rock metaphor does not really apply to the practice of data mining – If it did, then standard database queries would fit under the rubric of data mining » Find all employee records in which employee earns $300/month less than their managers – In practice, DM refers to: » finding patterns across large datasets » discovering heretofore unknown information Why Data Mining? Because the data is there. Because current DBMS technology does not support data analysis. Because – – – – larger disks faster cpus high-powered visualization networked information are becoming widely available. DM Touchstone Applications (CACM 39 (11) Special Issue) Finding patterns across data sets: – Reports on changes in retail sales » to improve sales – Patterns of sizes of TV audiences » for marketing – Patterns in NBA play » to alter, and so improve, performance – Deviations in standard phone calling behavior » to detect fraud » for marketing DM Touchstone Applications (CACM 39 (11) Special Issue) Separating signal from noise: – Classifying faint astronomical objects – Finding genes within DNA sequences – Discovering novel tectonic activity What is Text Data Mining? Peoples’ first thought: – Make it easier to find things on the Web. – This is information retrieval! The metaphor of extracting ore from rock does make sense for extracting documents of interest from a huge pile. But does not reflect notions of DM in practice: – finding patterns across large collections – discovering heretofore unknown information Text DM IR Data Mining: » Patterns, Nuggets, Exploratory Analysis Information Retrieval: – Finding and ranking documents that match users’ information need » ad hoc query » filtering/standing query – Rarely Patterns, Exploratory Analysis Real Text DM The point: – Discovering heretofore unknown information is not what we usually do with text. – (If it weren’t known, it could not have been written by someone.) However: – There is a field whose goal is to learn about patterns in text for its own sake ... Computational Lingustics Goal: automated language understanding – this isn’t possible – instead, go for subgoals, e.g., » word sense disambiguation » phrase recognition » semantic associations Current approach: – statistical analyses of very large text collections WordNet: A Lexical Database A list of hypernyms for each sense of “crow” Lexicographic Knowledge Acquisition Given a large lexical database ... – Wordnet: Miller, Fellbaum et al. at Princeton – http://www.cogsci.princeton.edu/~wn … and a huge text collection – How to automatically add new relations? Idea: Use Simple LexicoSyntactic Analysis Patterns of the following type work: NP0 such as NP1, {NP2 …, (and | or) NPi i >= 1, implies forall NPi, i>=1, hyponym(NPi, NP0) Example: – “Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use.” – implies hyponym(“Gelidium”, “red algae”) More Examples “Felonies, such as shootings and stabbings …” implies – hyponym(shootings, felonies) – hyponym(stabbings, felonies) Is this in the WordNet hierarchy? Linking Killing to Felonies Another Example Einstein is (was) a physicist. Is/was he a genius? Making Einstein a Genius Results from “such as” lexicosyntactic relation Results with the “or other” lexicosyntactic relation Procedure Discover a pattern that indicates a lexical relationship Scan through a large collection; extract sentences that match the pattern Extract the NPs from the sentence – requires some phrase parsing Check if suggested relation is in WordNet or not – this part not automated, but could be Discovering New Patterns Suggested algorithm: – Decide on a lexical relation of interest, e.g., hyponymy – Derive a list of word pairs from WordNet that are known to hold that relation » e.g., (crow, bird) – Extract sentence from text collection in which both terms occur – Find commonalities among lexico-syntactic context – Test these out against other word pairs known to hold the relationship in WordNet Text Merging Example: Discovering Hypocritical Congresspersons Discovering Hypocritical Congresspersons Feb 1, 1996 – US House of Reps votes to pass Telecommunications Reform Act – this contains the CDA (Communications Decency Act) – violaters subject to fines of $250,000 and 5 years in prison – eventually struck down by court Discovering Hypocritical Congresspersons Sept 11, 1998 – US House of Reps votes to place the Starr report online – the content would (most likely) have violated the CDA 365 people were members for both votes – 284 members voted aye both times » 185 (94%) Republicants voted aye both times » 96 (57%) Democrats voted aye both times How to find Hypocritical Congresspersons? This must have taken a lot of work – Hand cutting and pasting – Lots of picky details » Some people voted on one but not the other bill » Some people share the same name Check for different county/state Still messed up on “Bono” – Taking stats at the end on various attributes » Which state » Which party Tools should help streamline, reuse results How to find Hypocritical Congresspersons? The hard part? – Knowing two compare these two sets of voting records. How to find causes of disease? Don Swanson’s Medical Work Given – medical titles and abstracts – a problem (incurable rare disease) – some medical expertise find causal links among titles – symptoms – drugs – results Swanson Example (1991) Problem: Migraine headaches (M) – – – – – stress associated with M stress leads to loss of magnesium calcium channel blockers prevent some M magnesium is a natural calcium channel blocker spreading cortical depression (SCD)implicated in M – high levels of magnesium inhibit SCD – M patients have high platelet aggregability – magnesium can suppress platelet aggregability All extracted from medical journal titles Swanson’s TDM Two of his hypotheses have received some experimental verification. His technique – Only partially automated – Required medical expertise Few people are working on this. How to Automate This? Idea: mixed-initiative interaction – User applies tools to help explore the hypothesis space – System runs suites of algorithms to help explore the space, suggest directions Our Proposed Approach Three main parts – UI for building/using strategies – Backend for interfacing with various databases and translating different formats – Content analysis/machine learning for figuring out good hypotheses/throwing out bad ones The UI part Need support for building strategies Mixed-initiative system – Trade off between user-initiated hypotheses exploration and system-initiated suggestions Information visualization – Another way to show lots of choices Candidate Associations Suggested Strategies Current Retrieval Results Lindi: Linking Information for Novel Discovery and Insight Just starting up now (fall 98) Initial work: Hao Chen, Ketan MayerPatel, Shankar Raman Summary Text Data Mining: – Extracting heretofore undiscovered information from large text collections – Not the same as information retrieval Examples – Lexicographic knowledge acquisition – Merging of text representations – Linking related information The truth is out there!