Download SIMS 290-2: Applied Natural Language Processing: Marti

SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004 1 Untangling Text Data Mining (updated from lecture from 1999) 2 Outline Untangling several different fields DM, CL, IA, TDM TDM examples TDM as Exploratory Data Analysis New Problems for Computational Linguistics Our current efforts 3 Classifying Application Types Patterns Non-textual data Textual data Non-Novel Nuggets Novel Nuggets Standard data mining Database queries Automated Reasoning (AI) Computational linguistics Information retrieval Real text data mining 4 What is Data Mining? (Fayyad & Uthurusamy 96, Fayyad 97) Fitting models to or determining patterns from very large datasets. A “regime” which enables people to interact effectively with massive data stores. Deriving new information from data. 5 Why Data Mining? Because the data is there. Because larger disks faster cpus high-powered visualization networked information are now widely available. 6 Knowledge Discovery from Data (KDD) KDD: The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. (Fayyad, Shapiro, & Smyth, CACM 96) Note: data mining is just one step in the process 7 Data Mining Applications (CACM 39 (11) Special Issue) Finding patterns across data sets: Reports on changes in retail sales – to improve sales Patterns of sizes of TV audiences – for marketing Patterns in NBA play – to alter, and so improve, performance Deviations in standard phone calling behavior – to detect fraud – for marketing 8 What is Data Mining? Potential point of confusion: The extracting ore from rock metaphor does not really apply to the practice of data mining If it did, then standard database queries would fit under the rubric of data mining In practice, DM refers to: – finding patterns across large datasets – discovering heretofore unknown information 9 What is Text Data Mining? Many people’s first thought: Make it easier to find things on the Web. But this is information retrieval! 10 Needles in Haystacks The emphasis in IR is in finding documents that already contain answers to questions. 11 Information Retrieval A restricted form of Information Access The system has only pre-existing, “canned” text passages. Its response is limited to selecting from these passages and presenting them to the user. It must select, say, 10 or 20 passages out of millions. 12 What is Text Data Mining? The metaphor of extracting ore from rock: Does make sense for extracting documents of interest from a huge pile. But does not reflect notions of DM in practice: – finding patterns across large collections – discovering heretofore unknown information What would finding a pattern across a large text collection really look like …? 13 Bill Gates + MS-DOS in the Bible! From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil (William Gates, agitator, leader) 14 From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil More info: http://cs.anu.edu.au/~bdm/dilugim/gatesdet.txt http://cs.anu.edu.au/~bdm/dilugim/torah.html 15 Real Text DM The point: Discovering heretofore unknown information is not what we usually do with text. (If it weren’t known, it could not have been written by someone!) However: There is a field whose goal is to learn about patterns in text for their own sake ... 16 Computational Linguistics! Goal: automated language understanding this isn’t possible (yet) instead, go for subgoals, e.g., – word sense disambiguation – phrase recognition – semantic associations Common current approach: statistical analyses over very large text collections 17 Why CL Isn’t TDM A linguist finds it interesting that “cloying” co-occurs significantly with “Jar Jar Binks” ... … But this doesn’t really answer a question relevant to the world outside the text itself. 18 Why CL Isn’t TDM We need to use the text indirectly to answer questions about the world Direct: Analyze patent text; determine which word patterns indicate various subject categories. Indirect: Analyze patent text; find out whether private or public funding leads to more inventions. 19 Why CL Isn’t TDM Direct: Cluster newswire text; determine which terms are predominant Indirect: Analyze newswire text; gather evidence about which countries/alliances are dominating which financial sectors 20 Nuggets vs. Patterns TDM: we want to discover new information … … As opposed to discovering which statistical patterns characterize occurrence of known information. Example: WSD not TDM: computing statistics over a corpus to determine what patterns characterize Sense S. TDM: discovering the meaning of a new sense of a word. 21 Nuggets vs. Patterns Nugget: a new, heretofore unknown item of information. Pattern: distributions or rules that characterize the occurrence (or non-occurrence) of a known item of information. Application of rules can create nuggets in some circumstances. 22 Example: Lexicon Augmentation Application of a lexico-syntactic pattern: NP0 such as NP1, {NP2 …, (and | or) NPi } i >= 1, implies that forall NPi, i>=1, hyponym(NPi, NP0) Extracts out a new hypernym: “Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use.” implies hyponym(“Gelidium”, “red algae”) However, this fact was already known to the author of the text. 23 The Quandry How do we use text to both Find new information not known to the author of the text Find information that is not about the text itself 24 Idea: Exploratory Data Analysis Use large text collections to gather evidence to support (or refute) hypotheses Not known to author: links across many texts Not self-referential: work within the domain of discourse 25 Example: Etiology Given medical titles and abstracts a problem (incurable rare disease) some medical expertise find causal links among titles symptoms drugs results 26 Swanson Example (1991) Problem: Migraine headaches (M) Facts extracted from medical journal titles: stress associated with M stress leads to loss of magnesium calcium channel blockers prevent some M magnesium is a natural calcium channel blocker spreading cortical depression (SCD) implicated in M high levels of magnesium inhibit SCD M patients have high platelet aggregability magnesium can suppress platelet aggregability 27 Gathering Evidence stress magnesium CCB migraine magnesium SCD magnesium PA magnesium 28 Gathering Evidence CCB migraine PA magnesium SCD stress 29 Swanson’s TDM Two of his hypotheses have received some experimental verification. His technique Only partially automated Required medical expertise Some researchers are pursuing this further. 30 How to find functions of genes? Important problem in molecular biology Have the genetic sequence Don’t know what it does But … – Know which genes it coexpresses with – Some of these have known function So … Infer function based on function of coexpressed genes – This idea suggested to me by Michael Walker and others at Incyte Pharmaceuticals 31 Gene Co-expression: Role in the genetic pathway Kall. g? Kall. h? PSA PSA PAP PAP g? Other possibilities as well 32 Make use of the literature Look up what is known about the other genes. Different articles in different collections Look for commonalities Similar topics indicated by Subject Descriptors Similar words in titles and abstracts adenocarcinoma, neoplasm, prostate, prostatic neoplasms, tumor markers, antibodies ... 33 Developing Strategies Different strategies seem needed for different situations First: see what is known about Kallikrein. 7341 documents. Too many AND the result with “disease” category – If result is non-empty, this might be an interesting gene Now get 803 documents AND the result with PSA – Get 11 documents. Better! 34 Developing Strategies Look for commalities among these documents Manual scan through ~100 category labels Would have been better if – Automatically organized – Intersections of “important” categories scanned for first 35 Try a new tack Researcher uses knowledge of field to realize these are related to prostate cancer and diagnostic tests New tack: intersect search on all three known genes Hope they all talk about diagnostics and prostate cancer Fortunately, 7 documents returned Bingo! A relation to regulation of this cancer 36 Formulate a Hypothesis Hypothesis: mystery gene has to do with regulation of expression of genes leading to prostate cancer New tack: do some lab tests See if mystery gene is similar in molecular structure to the others If so, it might do some of the same things they do 37 38 Strategies again In hindsight, combining all three genes was a good strategy. Store this for later Might not have worked Need a suite of strategies Build them up via experience and a good UI 39 Text Merging Example Discovering Hypocritical Congresspersons 40 Discovering Hypocritical Congresspersons Feb 1, 1996 US House of Reps votes to pass Telecommunications Reform Act This contains the CDA (Communications Decency Act) – Sought to criminalize posting to the Internet any material deemed indecent and patently offensive, with no exception for socially redeeming material. Violaters subject to fines of $250,000 and 5 years in prison Eventually struck down by courts http://www.tbtf.com/resource/hypocrites.html 41 Discovering Hypocritical Congresspersons Sept 11, 1998 US House of Reps votes to place the Starr report online The content would (most likely) have violated the CDA 365 people were members for both votes 284 members voted aye both times – 185 (94%) Republicants voted aye both times – 96 (57%) Democrats voted aye both times http://www.tbtf.com/resource/hypocrites.html 42 http://www.tbtf.com/resource/hypocrites.html 43 http://www.tbtf.com/resource/hypocrites.html 44 How to find Hypocritical Congresspersons? This must have taken a lot of work Hand cutting and pasting Lots of picky details – Some people voted on one but not the other bill – Some people share the same name  Check for different county/state  Still messed up on “Bono” Taking stats at the end on various attributes – Which state – Which party Tools should help streamline, reuse results The hardest part? Knowing to compare these two sets of voting records in the first place. 45 Summary Text Data Mining: Extracting heretofore undiscovered information from large text collections Information Access  TDM IA: locating already known information that is currently of interest Finding patterns across text is already done in CL Tells us about the behavior of language Helps build very useful tools! 46 Summary on Text Data Mining The future: analyzing what the text is about We don’t know how; text is tough! Idea: bring the user into the loop. Build up piecewise evidence to support hypotheses Make use of partial domain models. The Truth is Out There! 47

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download SIMS 290-2: Applied Natural Language Processing: Marti