Download Database Application Design - CLAIR

November 30, 2000 Language and Information Handout #5 (C) 2000, The University of Michigan 1 Course Information • • • • • • Instructor: Dragomir R. Radev ([email protected]) Office: 305A, West Hall Phone: (734) 615-5225 Office hours: TTh 3-4 Course page: http://www.si.umich.edu/~radev/760 Class meets on Thursdays, 5-8 PM in 311 West Hall (C) 2000, The University of Michigan 2 Clustering (Cont’d) (C) 2000, The University of Michigan 3 Using similarity in visualization • Dendrograms (see Figure 14.1 of M&S, page 496) (C) 2000, The University of Michigan 4 Types of clustering • Hierarchical: agglomerative, divisive • Soft & hard • Similarity functions: – Single link: most similar members – Complete link: least similar members – Group-average: average similarity • Applications: – improving language models – etc. (C) 2000, The University of Michigan 5 HITS-type algorithms (C) 2000, The University of Michigan 6 Hyperlinks and resource communities • • • • • Jon Kleinberg (Almaden, Cornell) authoritative sources www.harvard.edu --> Harvard conferring authority via links global properties of authority pages (C) 2000, The University of Michigan 7 Hubs and authorities hubs authorities (C) 2000, The University of Michigan unrelated pages 8 Authorities • Java: www.gamelan.com, java.sun.com, sunsite.unc.edu/javafaq/javafaq.html • Censorship: www.eff.org, www.eff.org/blueribbon.html, www.aclu.org • Search engines: www.yahoo.com, www.excite.com, www.mckinley.com, www.lycos.com (C) 2000, The University of Michigan 9 Related pages • www.honda.com: www.ford.com, www.toyota.com, www.yahoo.com • www.nyse.com: www.amex.com, www.liffe.com, update.wsj.com (C) 2000, The University of Michigan 10 Collocations (C) 2000, The University of Michigan 11 Collocations • Idioms • Free word combinations • Know a word by the company that it keeps (Firth) • Common use • No general syntactic or semantic rules • Important for non-native speakers (C) 2000, The University of Michigan 12 Examples Idioms Collocations Free-word combinations To kick the bucket Dead end To catch up To trade actively Table of contents Orthogonal projection To take the bus The end of the road To buy a house (C) 2000, The University of Michigan 13 Uses • Disambiguation (e.g, “bank”/”loan”,”river”) • Translation • Generation (C) 2000, The University of Michigan 14 Properties • • • • • Arbitrariness Language- and dialect-specific Common in technical language Recurrent in context (see Smadja 83) (C) 2000, The University of Michigan 15 Arbitrariness • Make an effort vs. *make an exertion • Running commentary vs. *running discussion • Commit treason vs. *commit treachery (C) 2000, The University of Michigan 16 Cross-lingual properties • Régler la circulation = direct traffic • Russian, German, Serbo-Croatian: direct translation is used • AE: set the table, make a decision • BE: lay the table, take a decision • “semer le désarroi” - “to sow disarray” - “to wreak havoc” (C) 2000, The University of Michigan 17 Types of collocations • Grammatical: come to, put on; afraid that, fond of, by accident, witness to • Semantic (only certain synonyms) • Flexible: find/discover/notice by chance (C) 2000, The University of Michigan 18 Base/collocator pairs Base Collocator Example Noun Noun Verb Adjective Verb verb adjective adverb adverb preposition Set the table Warm greetings Struggle desperately Sound asleep Put on (C) 2000, The University of Michigan 19 Extracting collocations • Mutual information I (x;y) = log2 P(x,y) P(x)P(y) • What if I(x;y) = 0? • What if I(x;y) < 0? (C) 2000, The University of Michigan 20 Yule’s coefficient A - frequency of lemma pairs involving both Li and Lj B - frequency of pairs involving Li only C - frequency of pairs involving Lk only D - frequency of pairs involving neither YUL = (C) 2000, The University of Michigan AD - BC AD + BC -1  YUL  1 21 Specific mutual information • Used in extracting bilingual collocations p (e,f) I (e,f) = p(e) p(f) • p(e,f) - probability of finding both e and f in aligned sentences • p(e), p(f) - probabilities of finding the word in one of the languages (C) 2000, The University of Michigan 22 Example from the Hansard corpus (Brown, Lai, and Mercer) French word Mutual information sein 5.63 bureau 5.63 trudeau 5.34 premier 5.25 (C) 2000, The University of Michigan résidence intention 5.12 4.57 no session 4.53 4.34 23 Flexible and rigid collocations • Example (from Smadja): “free” and “trade” Total p-5 p-4 p-3 p-2 8031 7 (C) 2000, The University of Michigan 6 13 5 p-1 7918 p+1 p+2 p+3 p+4 p+5 0 12 20 26 24 24 Xtract (Smadja) • The Dow Jones Industrial Average • The NYSE’s composite index of all its listed common stocks fell *NUMBER* to *NUMBER* (C) 2000, The University of Michigan 25 Translating Collocations • Brush up a lesson, repasser une leçon • Bring about/осуществлять • Hansards: late spring: fin du printemps, Atlantic Canada Opportunities Agency, Agence de promotion économique du Canada atlantique (C) 2000, The University of Michigan 26 The eSseNSe system (C) 2000, The University of Michigan 27 Offline processing Online processing user query cached documents cluster 1 cluster 2 cluster 3 … … cluster n hitlist … summary 1 summary 2 (C) 2000, The University of Michigan summary 3 summary n summary 28 (C) 2000, The University of Michigan 29 (C) 2000, The University of Michigan 30 (C) 2000, The University of Michigan 31 (C) 2000, The University of Michigan 32 (C) 2000, The University of Michigan 33 Sample summary Sentence Score The idea behind data mining then is the non-trivial process of identifying valid novel potentially useful and ultimately understandable patterns in data 18 2 The term knowledge discovery in databases KDD was formalized in 1989 in reference to the general concept of being broad and 'high level' in the pursuit of seeking knowledge from data 494.92 The term data mining is then this high-level application techniques / tools used to present and analyze data for decision makers 509.11 This term data mining has been used by statisticians data analyst and the MIS management information systems community whereas KDD has been mostly used by artificial intelligence and machine learning researchers 487.92 These are : -the untapped value in large databases consolidation of database records tending towards a single customer view concept of an information or data warehouse from the consolidation of databases dramatic drop in the cost/performance ratio of hardware systems - for data storage and processing 576.60 Intense competition in an increasing saturated marketplace the ability to custom manufacture market and advertise to small market segments and individuals 4 and the market for data mining products is estimated at about 500 million in early 1994 12 Data mining technologies are characterized by intensive computations on large volumes of data 486.92 Data mining versus traditional database queries Traditional database queries contrasts with data mining since these are typified by the simple question such as what were the sales of orange juice in January 1995 for the Boston area 520.53 Data mining on the other hand through the use of specific algorithms or search engines attempts to source out discernable patterns and trends in the data and infers rules from these patterns 500.80 (C) 2000, The University of Michigan 34 Cluster ID Number of URLs Centroi d words 00085 63 00044 167 00086 115 retrieval in formation university ir text research systems science semantic evaluation document pp library 00657 135 vol pp no retrieval proceedings conference user ir informat ion query 00766 6 00127 4 (C) 2000, The University of Michigan informat ion retrieval systems university workshop papers research library science submission chair text edu conference computer applications language processing data libraries web internet site design hosting search online sites commerce page business meta your information content server marketing you electronic pages user interface users ariadne hypertext search computer interfaces we designer system laurel informat ion nelson knowledge collaboration representation interaction process systems neural systems computational data evolutionary intelligent networks learning artificial knowledge intelligence 35 (C) 2000, The University of Michigan 36 (C) 2000, The University of Michigan 37 (C) 2000, The University of Michigan 38 (C) 2000, The University of Michigan 39 (C) 2000, The University of Michigan 40 mining (84.54), data (64.13), knowledge (14.25), discovery (11.98), advertised (11.20), databases (9.69), information (6.98), research (6.96), analysis (6.95), text (6.05), patterns (5.30), algorithms (4.39). Text Min ing is a new and excit ing research area that tries to solve the informat ion overload problem by using techniques from data mining mach ine learning info rmation retrieval naturallanguage understanding case-based reasoning statistics and knowledge management to help people gain insight into large quantities of semi-structured or unstructured text <BR> Current Projects - A ims The Data M ining Program has two projects: DM ITL - Data Mining in the Large Conducting practical case studies for clients involving the analysis of large and co mplex data sets ParAlg – Parallel Algorith ms The main co mputing facility for these projects consists of a secure mu ltiprocessor Sun E4000 The DMITL project aims to develop knowledge and techniques relevant to the data mining of large and co mplex datasets using high performance co mputers Megaputer develops software and solutions for data mining text analysis and knowledge discovery in databases (C) 2000, The University of Michigan 183.73 156.56 163.78 188.96 41 Cross-language information access (C) 2000, The University of Michigan 42 CE SE QE DE SE English (C) 2000, The University of Michigan 43 CE SE QC QE DE SE English (C) 2000, The University of Michigan Chinese 44 CE SE SC CC QC QE DE SE English (C) 2000, The University of Michigan SC DC Chinese 45 CE SE SC CC QC QE DE SE English (C) 2000, The University of Michigan SC DC Chinese 46 CE SE SC CC QC QE DE SE English (C) 2000, The University of Michigan SC DC Chinese 47 CE SE SC CC QC QE DE SE English (C) 2000, The University of Michigan SC SC->E DC Chinese 48 Objectives • Produce summaries using multiple algorithms • evaluate summarization and translation separately • intrinsic and extrinsic language-independent evaluation metrics • establish correlation between evaluation metrics • build parallel C-E doc+summary 9K docs (Hong Kong news) (C) 2000, The University of Michigan 49 Participants • Full time – – – – K.-L. Kwok, Queens College Dragomir Radev, U. Michigan Wai Lam, Chinese University of HK Simone Teufel, Columbia • “Consultants” – – – – Chin-Yew Lin, ISI Tomek Strzalkowski, Albany Jade Goldstein, CMU Jian-Yun Nie, U. Montréal • Supporters – TIDES roadmap group: Ed Hovy, Daniel Marcu, Kathy McKeown (C) 2000, The University of Michigan 50 Techniques and parameters • Summarization: – position, TF*IDF, centroids, largest common subsequence, keywords • Evaluation: – intrinsic: percent agreement, relative utility, precision/recall – extrinsic: document rank, question answering • Length of documents/summaries (C) 2000, The University of Michigan 51 The parallel corpus • English and Chinese (Hong Kong News) • Already there: – 9000 documents and their translations – list of 300 queries in English and their translations • We will create before the workshop: – document relevance judgements • 50 queries, 5 hrs/query, $10/hr -> $2,500 – sentence relevance judgements • 4 doc/hr, need 4000 rel. judgements -> $10,000 – optional: manual abstracts (C) 2000, The University of Michigan 52 Creating the judgements • For each query – – – – submit to IR engine discard unless it has 5-20 hits get exhaustive document relevance judgements consider top 100 documents • get sentence relevance judgements for – all relevant judgements – top 50 documents (including irrelevant ones!) (C) 2000, The University of Michigan 53 Experiments • Experiment 1: (Validation) Compare preservation of ranking with other measures: judgement overlap, relative utility • Experiment 2 & Experiment 3: – use with preservation of ranking – only possible due to new, parallel experimental design – factor out effects of • query translation • summarization • monolingual IR • Baseline: – leading sentence summaries vs. documents – other summarization methods vs. documents – (ideal: manual summaries vs. documents) (C) 2000, The University of Michigan 54 Experiments • Monolingual experiments – Effect of summarization • English Query -> English Doc (ranks) • English Query-> English Summary (ranks) • Chinese Query -> Chinese Doc (ranks) • Chinese Query -> Chinese Summary (ranks) – Baseline: • leading sentence summary vs. document • ideal: manual summary vs. document – Effect of language on IR • English Query -> English Doc • Chinese Query-> Chinese Doc • Experiment 2: crosslingual – Effect of query translation • English Query -> English Doc • English Query -> Chinese Query -> Chinese Doc (C) 2000, The University of Michigan 55 Timeline • Pre-workshop: build corpus • Sentence segmenter, Chinese tokenizer, machine translation, IR system, eSseNSe summarizer • Workshop: system integration, build toolkit, summarization, evaluation, correlation, system refinement, final evaluation (C) 2000, The University of Michigan 56 Workshop W1 W2 W3 W4 W5 W6 Set up experimental testbed Evaluation plan laid out Selection of training/test sub-corpora Alpha version of CLIA system tested on a small number of queries Baseline experiment Run experiment one Run experiment two Compute query translation quality Run experiment three Feedback from first three experiments System improvements Improved CLIA system ready Evaluation using unseen test data Draft of final report Additional experiments Wrap-up Final version of CLIA system released (C) 2000, The University of Michigan 57 Merit criteria • Novelty: never done before, integration of CLIR and summarization (C) 2000, The University of Michigan 58 Merit criteria • Novelty: never done before, integration of CLIR and summarization • Collaboration: participants wouldn’t work together otherwise (C) 2000, The University of Michigan 59 Merit criteria • Novelty: never done before, integration of CLIR and summarization • Collaboration: participants wouldn’t work together otherwise • Scientific merit: much-needed evaluation metrics, techniques for multi-document, multilingual summarization, incorporate utility, redundancy, subsumption (C) 2000, The University of Michigan 60 Merit criteria • Novelty: never done before, integration of CLIR and summarization • Collaboration: participants wouldn’t work together otherwise • Scientific merit: much-needed evaluation metrics, techniques for multi-document, multilingual summarization, incorporate utility, redundancy, subsumption • Feasibility: uses existing work, specific plan for new work (C) 2000, The University of Michigan 61 Merit criteria • Novelty: never done before, integration of CLIR and summarization • Collaboration: participants wouldn’t work together otherwise • Scientific merit: much-needed evaluation metrics, techniques for multi-document, multilingual summarization, incorporate utility, redundancy, subsumption • Feasibility: uses existing work, specific plan for new work • Community building: corpora, evaluation techniques, and software (CLIR, evaluation, and summarization), builds on prior evaluations (TDT, TREC, SUMMAC, DUC) (C) 2000, The University of Michigan 62 Merit criteria • Novelty: never done before, integration of CLIR and summarization • Collaboration: participants wouldn’t work together otherwise • Scientific merit: much-needed evaluation metrics, techniques for multi-document, multilingual summarization, incorporate utility, redundancy, subsumption • Feasibility: uses existing work, specific plan for new work • Community building: corpora, evaluation techniques, and software (CLIR, evaluation, and summarization), builds on prior evaluations (TDT, TREC, SUMMAC, DUC) • Funder interest: multilingual systems, large amounts of data (C) 2000, The University of Michigan 63 Merit criteria • Novelty: never done before, integration of CLIR and summarization • Collaboration: participants wouldn’t work together otherwise • Scientific merit: much-needed evaluation metrics, techniques for multi-document, multilingual summarization, incorporate utility, redundancy, subsumption • Feasibility: uses existing work, specific plan for new work • Community building: corpora, evaluation techniques, and software (CLIR, evaluation, and summarization), builds on prior evaluations (TDT, TREC, SUMMAC, DUC) • Funder interest: multilingual systems, large amounts of data (C) 2000, The University of Michigan 64 What was dropped • Interactive clustering of documents • Evaluation of the quality of translated summaries • Document translation • Effects of document genre, length • Evolving summaries (C) 2000, The University of Michigan 65 More… (C) 2000, The University of Michigan 66 What we didn’t talk about • • • • • • Hidden Markov models Part of speech tagging Probabilistic parsing Information retrieval Text classification etc. (C) 2000, The University of Michigan 67 THE END ? http://perun.si.umich.edu/~radev/760/job/ (C) 2000, The University of Michigan 68

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Database Application Design - CLAIR