Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998 Talk Outline What is Data Mining? What isn’t Text Data Mining? What is Text Data Mining Examples A proposal for a system for Text Data Mining Marti A. Hearst UC Berkeley SIMS 1998 What is Data Mining? (Fayyad & Uthurusamy 96, Fayyad 97) Fitting models to or determining patterns from very large datasets. A “regime” which enables people to interact effectively with massive data stores. Deriving new information from data. finding patterns across large datasets discovering heretofore unknown information Marti A. Hearst UC Berkeley SIMS 1998 What is Data Mining? Potential point of confusion: The extracting ore from rock metaphor does not really apply to the practice of data mining If it did, then standard database queries would fit under the rubric of data mining Find all employee records in which employee earns $300/month less than their managers In practice, DM refers to: finding patterns across large datasets discovering heretofore unknown information Marti A. Hearst UC Berkeley SIMS 1998 DM Touchstone Applications (CACM 39 (11) Special Issue) Finding patterns across data sets: Reports on changes in retail sales Patterns of sizes of TV audiences for marketing Patterns in NBA play to improve sales to alter, and so improve, performance Deviations in standard phone calling behavior to detect fraud for marketing Marti A. Hearst UC Berkeley SIMS 1998 What is Text Data Mining? Peoples’ first thought: Make it easier to find things on the Web. This is information retrieval! The metaphor of extracting ore from rock does make sense for extracting documents of interest from a huge pile. But does not reflect notions of DM in practice: finding patterns across large collections discovering heretofore unknown information Marti A. Hearst UC Berkeley SIMS 1998 Text DM != IR Data Mining: Patterns, Nuggets, Exploratory Analysis Information Retrieval: Finding and ranking documents that match users’ information need ad hoc query filtering/standing query Marti A. Hearst UC Berkeley SIMS 1998 Real Text DM What would finding a pattern across a large text collection really look like? Marti A. Hearst UC Berkeley SIMS 1998 Bill Gates + MS-DOS in the Bible! From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil (William Gates, agitator, leader) Marti A. Hearst UC Berkeley SIMS 1998 From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil Marti A. Hearst UC Berkeley SIMS 1998 Real Text DM The point: Discovering heretofore unknown information is not what we usually do with text. (If it weren’t known, it could not have been written by someone.) However: There are some interesting problems of this type! Marti A. Hearst UC Berkeley SIMS 1998 Combining Data Types for Novel Tasks Text + Links to find “authority pages” (Kleinberg at Cornell, Page at Stanford) Usage + Time + Links to study evolution of web and information use (Pitkow et al. at PARC) Marti A. Hearst UC Berkeley SIMS 1998 Ore-Filled Text Collections Congressional Voting Records Answer questions like: Who are the most hypocritical congresspeople? Medical Articles Create hypotheses about causes of rare diseases Create hypotheses about gene function Patent Law Answer questions like: Marti A. Hearst UC Berkeley SIMS 1998 Is government funding of research worthwhile? Marti A. Hearst UC Berkeley SIMS 1998 Marti A. Hearst UC Berkeley SIMS 1998 How to find Hypocritical Congresspersons? This must have taken a lot of work Hand cutting and pasting Lots of picky details Some people voted on one but not the other bill Some people share the same name Check for different county/state Still messed up on “Bono” Taking stats at the end on various attributes Which state Which party Marti A. Hearst UC Berkeley SIMS 1998 How to find functions of genes? Important problem in molecular biology Have the genetic sequence Don’t know what it does But … Know which genes it coexpresses with Some of these have known function So … Infer function based on function of co-expressed genes This is new work by Michael Walker and others at Incyte Pharmaceuticals Marti A. Hearst UC Berkeley SIMS 1998 Gene Co-expression: Role in the genetic pathway Kall. g? Kall. h? PSA PSA PAP PAP g? Other possibilities as well Marti A. Hearst UC Berkeley SIMS 1998 Make use of the literature Look up what is known about the other genes. Different articles in different collections Look for commonalities Similar topics indicated by Subject Descriptors Similar words in titles and abstracts adenocarcinoma, neoplasm, prostate, prostatic neoplasms, tumor markers, antibodies ... Marti A. Hearst UC Berkeley SIMS 1998 Developing Strategies Different strategies seem needed for different situations First: see what is known about Kallikrein. 7341 documents. Too many AND the result with “disease” category If result is non-empty, this might be an interesting gene Now get 803 documents AND the result with PSA Get 11 documents. Better! Marti A. Hearst UC Berkeley SIMS 1998 Developing Strategies Look for commalities among these documents Manual scan through ~100 category labels Would have been better if Automatically organized Intersections of “important” categories scanned for first Marti A. Hearst UC Berkeley SIMS 1998 Try a new tack Researcher uses knowledge of field to realize these are related to prostate cancer and diagnostic tests New tack: intersect search on all three known genes Hope they all talk about diagnostics and prostate cancer Fortunately, 7 documents returned Bingo! A relation to regulation of this cancer Marti A. Hearst UC Berkeley SIMS 1998 Formulate a Hypothesis Hypothesis: mystery gene has to do with regulation of expression of genes leading to prostate cancer New tack: do some lab tests See if mystery gene is similar in molecular structure to the others If so, it might do some of the same things they do Marti A. Hearst UC Berkeley SIMS 1998 Strategies again In hindsight, combining all three genes was a good strategy. Store this for later Might not have worked Need a suite of strategies Build them up via experience and a good UI Marti A. Hearst UC Berkeley SIMS 1998 The System Doing the same query with slightly different values each time is time-consuming and tedious Same goes for cutting and pasting results IR systems don’t support varying queries like this very well. Each situation is a bit different Some automatic processing is needed in the background to eliminate/suggest hypotheses Marti A. Hearst UC Berkeley SIMS 1998 The System Three main parts UI for building/using strategies Backend for interfacing with various databases and translating different formats Content analysis/machine learning for figuring out good hypotheses/throwing out bad ones Marti A. Hearst UC Berkeley SIMS 1998 The UI part Need support for building strategies Lots of info lying around, so a nice option is ... Two-handed interface Big table display Mixed-initiative system Trade off between user-initiated hypotheses exploration and system-initiated suggestions Information visualization Another way to show lots of choices Marti A. Hearst UC Berkeley SIMS 1998 Candidate Associations Suggested Strategies Current Retrieval Results Marti A. Hearst UC Berkeley SIMS 1998 Other applications Patent example Political example The truth’s out there! Marti A. Hearst UC Berkeley SIMS 1998 Text Tango Just starting up now. Let me know if you’d like to work on it! Marti A. Hearst UC Berkeley SIMS 1998