Download Database Application Design - CLAIR

Document related concepts
no text concepts found
Transcript
November 30, 2000
Language and Information
Handout #5
(C) 2000, The University
of Michigan
1
Course Information
•
•
•
•
•
•
Instructor: Dragomir R. Radev ([email protected])
Office: 305A, West Hall
Phone: (734) 615-5225
Office hours: TTh 3-4
Course page: http://www.si.umich.edu/~radev/760
Class meets on Thursdays, 5-8 PM in 311 West Hall
(C) 2000, The University
of Michigan
2
Clustering (Cont’d)
(C) 2000, The University
of Michigan
3
Using similarity in visualization
• Dendrograms (see Figure 14.1 of M&S,
page 496)
(C) 2000, The University
of Michigan
4
Types of clustering
• Hierarchical: agglomerative, divisive
• Soft & hard
• Similarity functions:
– Single link: most similar members
– Complete link: least similar members
– Group-average: average similarity
• Applications:
– improving language models
– etc.
(C) 2000, The University
of Michigan
5
HITS-type algorithms
(C) 2000, The University
of Michigan
6
Hyperlinks and resource
communities
•
•
•
•
•
Jon Kleinberg (Almaden, Cornell)
authoritative sources
www.harvard.edu --> Harvard
conferring authority via links
global properties of authority pages
(C) 2000, The University
of Michigan
7
Hubs and authorities
hubs
authorities
(C) 2000, The University
of Michigan
unrelated pages
8
Authorities
• Java: www.gamelan.com, java.sun.com,
sunsite.unc.edu/javafaq/javafaq.html
• Censorship: www.eff.org,
www.eff.org/blueribbon.html, www.aclu.org
• Search engines: www.yahoo.com,
www.excite.com, www.mckinley.com,
www.lycos.com
(C) 2000, The University
of Michigan
9
Related pages
• www.honda.com: www.ford.com,
www.toyota.com, www.yahoo.com
• www.nyse.com: www.amex.com,
www.liffe.com, update.wsj.com
(C) 2000, The University
of Michigan
10
Collocations
(C) 2000, The University
of Michigan
11
Collocations
• Idioms
• Free word combinations
• Know a word by the company that it keeps
(Firth)
• Common use
• No general syntactic or semantic rules
• Important for non-native speakers
(C) 2000, The University
of Michigan
12
Examples
Idioms
Collocations
Free-word combinations
To kick the bucket
Dead end
To catch up
To trade actively
Table of contents
Orthogonal projection
To take the bus
The end of the road
To buy a house
(C) 2000, The University
of Michigan
13
Uses
• Disambiguation (e.g, “bank”/”loan”,”river”)
• Translation
• Generation
(C) 2000, The University
of Michigan
14
Properties
•
•
•
•
•
Arbitrariness
Language- and dialect-specific
Common in technical language
Recurrent in context
(see Smadja 83)
(C) 2000, The University
of Michigan
15
Arbitrariness
• Make an effort vs. *make an exertion
• Running commentary vs. *running
discussion
• Commit treason vs. *commit treachery
(C) 2000, The University
of Michigan
16
Cross-lingual properties
• Régler la circulation = direct traffic
• Russian, German, Serbo-Croatian: direct
translation is used
• AE: set the table, make a decision
• BE: lay the table, take a decision
• “semer le désarroi” - “to sow disarray” - “to
wreak havoc”
(C) 2000, The University
of Michigan
17
Types of collocations
• Grammatical: come to, put on; afraid that,
fond of, by accident, witness to
• Semantic (only certain synonyms)
• Flexible: find/discover/notice by chance
(C) 2000, The University
of Michigan
18
Base/collocator pairs
Base
Collocator
Example
Noun
Noun
Verb
Adjective
Verb
verb
adjective
adverb
adverb
preposition
Set the table
Warm greetings
Struggle desperately
Sound asleep
Put on
(C) 2000, The University
of Michigan
19
Extracting collocations
• Mutual information
I (x;y) = log2
P(x,y)
P(x)P(y)
• What if I(x;y) = 0?
• What if I(x;y) < 0?
(C) 2000, The University
of Michigan
20
Yule’s coefficient
A - frequency of lemma pairs involving both
Li and Lj
B - frequency of pairs involving Li only
C - frequency of pairs involving Lk only
D - frequency of pairs involving neither
YUL =
(C) 2000, The University
of Michigan
AD - BC
AD + BC
-1  YUL  1
21
Specific mutual information
• Used in extracting bilingual collocations
p (e,f)
I (e,f) =
p(e) p(f)
• p(e,f) - probability of finding both e and f in
aligned sentences
• p(e), p(f) - probabilities of finding the word in one
of the languages
(C) 2000, The University
of Michigan
22
Example from the Hansard
corpus (Brown, Lai, and Mercer)
French word Mutual information
sein
5.63
bureau
5.63
trudeau
5.34
premier
5.25
(C) 2000, The University
of Michigan
résidence
intention
5.12
4.57
no
session
4.53
4.34
23
Flexible and rigid collocations
• Example (from Smadja): “free” and “trade”
Total p-5 p-4 p-3 p-2
8031
7
(C) 2000, The University
of Michigan
6
13
5
p-1
7918
p+1 p+2 p+3 p+4 p+5
0
12
20
26
24
24
Xtract (Smadja)
• The Dow Jones Industrial Average
• The NYSE’s composite index of all its
listed common stocks fell *NUMBER* to
*NUMBER*
(C) 2000, The University
of Michigan
25
Translating Collocations
• Brush up a lesson, repasser une leçon
• Bring about/осуществлять
• Hansards: late spring: fin du printemps,
Atlantic Canada Opportunities Agency,
Agence de promotion économique du
Canada atlantique
(C) 2000, The University
of Michigan
26
The eSseNSe system
(C) 2000, The University
of Michigan
27
Offline processing
Online processing
user query
cached documents
cluster 1
cluster 2
cluster 3
…
…
cluster n
hitlist
…
summary 1
summary 2
(C) 2000, The University
of Michigan
summary 3
summary n
summary
28
(C) 2000, The University
of Michigan
29
(C) 2000, The University
of Michigan
30
(C) 2000, The University
of Michigan
31
(C) 2000, The University
of Michigan
32
(C) 2000, The University
of Michigan
33
Sample summary
Sentence
Score
The idea behind data mining then is the non-trivial process of identifying valid novel potentially useful and ultimately
understandable patterns in data 18 2 The term knowledge discovery in databases KDD was formalized in 1989 in reference to the
general concept of being broad and 'high level' in the pursuit of seeking knowledge from data
494.92
The term data mining is then this high-level application techniques / tools used to present and analyze data for decision makers
509.11
This term data mining has been used by statisticians data analyst and the MIS management information systems community whereas
KDD has been mostly used by artificial intelligence and machine learning researchers
487.92
These are : -the untapped value in large databases consolidation of database records tending towards a single customer view concept
of an information or data warehouse from the consolidation of databases dramatic drop in the cost/performance ratio of hardware
systems - for data storage and processing
576.60
Intense competition in an increasing saturated marketplace the ability to custom manufacture market and advertise to small market
segments and individuals 4 and the market for data mining products is estimated at about 500 million in early 1994 12 Data mining
technologies are characterized by intensive computations on large volumes of data
486.92
Data mining versus traditional database queries Traditional database queries contrasts with data mining since these are typified by
the simple question such as what were the sales of orange juice in January 1995 for the Boston area
520.53
Data mining on the other hand through the use of specific algorithms or search engines attempts to source out discernable patterns
and trends in the data and infers rules from these patterns
500.80
(C) 2000, The University
of Michigan
34
Cluster
ID
Number
of URLs
Centroi d words
00085
63
00044
167
00086
115
retrieval in formation university ir text research systems science semantic
evaluation document pp library
00657
135
vol pp no retrieval proceedings conference user ir informat ion query
00766
6
00127
4
(C) 2000, The University
of Michigan
informat ion retrieval systems university workshop papers research library science
submission chair text edu conference computer applications language processing
data libraries
web internet site design hosting search online sites commerce page business meta
your information content server marketing you electronic pages
user interface users ariadne hypertext search computer interfaces we designer
system laurel informat ion nelson knowledge collaboration representation
interaction process systems
neural systems computational data evolutionary intelligent networks learning
artificial knowledge intelligence
35
(C) 2000, The University
of Michigan
36
(C) 2000, The University
of Michigan
37
(C) 2000, The University
of Michigan
38
(C) 2000, The University
of Michigan
39
(C) 2000, The University
of Michigan
40
mining (84.54), data (64.13), knowledge (14.25), discovery (11.98), advertised (11.20), databases
(9.69), information (6.98), research (6.96), analysis (6.95), text (6.05), patterns (5.30), algorithms
(4.39).
Text Min ing is a new and excit ing research area that tries to solve the informat ion overload
problem by using techniques from data mining mach ine learning info rmation retrieval naturallanguage understanding case-based reasoning statistics and knowledge management to help
people gain insight into large quantities of semi-structured or unstructured text
<BR> Current Projects - A ims The Data M ining Program has two projects: DM ITL - Data
Mining in the Large Conducting practical case studies for clients involving the analysis of
large and co mplex data sets ParAlg – Parallel Algorith ms The main co mputing facility for
these projects consists of a secure mu ltiprocessor Sun E4000
The DMITL project aims to develop knowledge and techniques relevant to the data mining of
large and co mplex datasets using high performance co mputers
Megaputer develops software and solutions for data mining text analysis and knowledge
discovery in databases
(C) 2000, The University
of Michigan
183.73
156.56
163.78
188.96
41
Cross-language information
access
(C) 2000, The University
of Michigan
42
CE
SE
QE
DE
SE
English
(C) 2000, The University
of Michigan
43
CE
SE
QC
QE
DE
SE
English
(C) 2000, The University
of Michigan
Chinese
44
CE
SE
SC
CC
QC
QE
DE
SE
English
(C) 2000, The University
of Michigan
SC
DC
Chinese
45
CE
SE
SC
CC
QC
QE
DE
SE
English
(C) 2000, The University
of Michigan
SC
DC
Chinese
46
CE
SE
SC
CC
QC
QE
DE
SE
English
(C) 2000, The University
of Michigan
SC
DC
Chinese
47
CE
SE
SC
CC
QC
QE
DE
SE
English
(C) 2000, The University
of Michigan
SC
SC->E
DC
Chinese
48
Objectives
• Produce summaries using multiple algorithms
• evaluate summarization and translation separately
• intrinsic and extrinsic language-independent
evaluation metrics
• establish correlation between evaluation metrics
• build parallel C-E doc+summary 9K docs (Hong
Kong news)
(C) 2000, The University
of Michigan
49
Participants
• Full time
–
–
–
–
K.-L. Kwok, Queens College
Dragomir Radev, U. Michigan
Wai Lam, Chinese University of HK
Simone Teufel, Columbia
• “Consultants”
–
–
–
–
Chin-Yew Lin, ISI
Tomek Strzalkowski, Albany
Jade Goldstein, CMU
Jian-Yun Nie, U. Montréal
• Supporters
– TIDES roadmap group: Ed Hovy, Daniel Marcu, Kathy McKeown
(C) 2000, The University
of Michigan
50
Techniques and parameters
• Summarization:
– position, TF*IDF, centroids, largest common
subsequence, keywords
• Evaluation:
– intrinsic: percent agreement, relative utility,
precision/recall
– extrinsic: document rank, question answering
• Length of documents/summaries
(C) 2000, The University
of Michigan
51
The parallel corpus
• English and Chinese (Hong Kong News)
• Already there:
– 9000 documents and their translations
– list of 300 queries in English and their translations
• We will create before the workshop:
– document relevance judgements
• 50 queries, 5 hrs/query, $10/hr -> $2,500
– sentence relevance judgements
• 4 doc/hr, need 4000 rel. judgements -> $10,000
– optional: manual abstracts
(C) 2000, The University
of Michigan
52
Creating the judgements
• For each query
–
–
–
–
submit to IR engine
discard unless it has 5-20 hits
get exhaustive document relevance judgements
consider top 100 documents
• get sentence relevance judgements for
– all relevant judgements
– top 50 documents (including irrelevant ones!)
(C) 2000, The University
of Michigan
53
Experiments
• Experiment 1: (Validation)
Compare preservation of ranking with other measures: judgement overlap,
relative utility
• Experiment 2 & Experiment 3:
– use with preservation of ranking
– only possible due to new, parallel experimental design
– factor out effects of
• query translation
• summarization
• monolingual IR
• Baseline:
– leading sentence summaries vs. documents
– other summarization methods vs. documents
– (ideal: manual summaries vs. documents)
(C) 2000, The University
of Michigan
54
Experiments
• Monolingual experiments
– Effect of summarization
• English Query -> English Doc (ranks)
• English Query-> English Summary (ranks)
• Chinese Query -> Chinese Doc (ranks)
• Chinese Query -> Chinese Summary (ranks)
– Baseline:
• leading sentence summary vs. document
• ideal: manual summary vs. document
– Effect of language on IR
• English Query -> English Doc
• Chinese Query-> Chinese Doc
• Experiment 2: crosslingual
– Effect of query translation
• English Query -> English Doc
• English Query -> Chinese Query -> Chinese Doc
(C) 2000, The University
of Michigan
55
Timeline
• Pre-workshop: build corpus
• Sentence segmenter, Chinese tokenizer, machine
translation, IR system, eSseNSe summarizer
• Workshop: system integration, build toolkit,
summarization, evaluation, correlation, system
refinement, final evaluation
(C) 2000, The University
of Michigan
56
Workshop
W1
W2
W3
W4
W5
W6
Set up experimental testbed
Evaluation plan laid out
Selection of training/test sub-corpora
Alpha version of CLIA system tested on a small number of queries
Baseline experiment
Run experiment one
Run experiment two
Compute query translation quality
Run experiment three
Feedback from first three experiments
System improvements
Improved CLIA system ready
Evaluation using unseen test data
Draft of final report
Additional experiments
Wrap-up
Final version of CLIA system released
(C) 2000, The University
of Michigan
57
Merit criteria
• Novelty: never done before, integration of CLIR and
summarization
(C) 2000, The University
of Michigan
58
Merit criteria
• Novelty: never done before, integration of CLIR and
summarization
• Collaboration: participants wouldn’t work together
otherwise
(C) 2000, The University
of Michigan
59
Merit criteria
• Novelty: never done before, integration of CLIR and
summarization
• Collaboration: participants wouldn’t work together otherwise
• Scientific merit: much-needed evaluation metrics,
techniques for multi-document, multilingual
summarization, incorporate utility, redundancy,
subsumption
(C) 2000, The University
of Michigan
60
Merit criteria
• Novelty: never done before, integration of CLIR and
summarization
• Collaboration: participants wouldn’t work together otherwise
• Scientific merit: much-needed evaluation metrics, techniques
for multi-document, multilingual summarization, incorporate
utility, redundancy, subsumption
• Feasibility: uses existing work, specific plan for new work
(C) 2000, The University
of Michigan
61
Merit criteria
• Novelty: never done before, integration of CLIR and
summarization
• Collaboration: participants wouldn’t work together otherwise
• Scientific merit: much-needed evaluation metrics, techniques
for multi-document, multilingual summarization, incorporate
utility, redundancy, subsumption
• Feasibility: uses existing work, specific plan for new work
• Community building: corpora, evaluation techniques, and
software (CLIR, evaluation, and summarization), builds
on prior evaluations (TDT, TREC, SUMMAC, DUC)
(C) 2000, The University
of Michigan
62
Merit criteria
• Novelty: never done before, integration of CLIR and
summarization
• Collaboration: participants wouldn’t work together otherwise
• Scientific merit: much-needed evaluation metrics, techniques
for multi-document, multilingual summarization, incorporate
utility, redundancy, subsumption
• Feasibility: uses existing work, specific plan for new work
• Community building: corpora, evaluation techniques, and
software (CLIR, evaluation, and summarization), builds on
prior evaluations (TDT, TREC, SUMMAC, DUC)
• Funder interest: multilingual systems, large amounts of
data
(C) 2000, The University
of Michigan
63
Merit criteria
• Novelty: never done before, integration of CLIR and
summarization
• Collaboration: participants wouldn’t work together otherwise
• Scientific merit: much-needed evaluation metrics, techniques
for multi-document, multilingual summarization, incorporate
utility, redundancy, subsumption
• Feasibility: uses existing work, specific plan for new work
• Community building: corpora, evaluation techniques, and
software (CLIR, evaluation, and summarization), builds on
prior evaluations (TDT, TREC, SUMMAC, DUC)
• Funder interest: multilingual systems, large amounts of data
(C) 2000, The University
of Michigan
64
What was dropped
• Interactive clustering of documents
• Evaluation of the quality of translated
summaries
• Document translation
• Effects of document genre, length
• Evolving summaries
(C) 2000, The University
of Michigan
65
More…
(C) 2000, The University
of Michigan
66
What we didn’t talk about
•
•
•
•
•
•
Hidden Markov models
Part of speech tagging
Probabilistic parsing
Information retrieval
Text classification
etc.
(C) 2000, The University
of Michigan
67
THE END ?
http://perun.si.umich.edu/~radev/760/job/
(C) 2000, The University
of Michigan
68