Download Lecture aid

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Representing and Accessing [Textual] Digital Information (COMS/INFO 630), Spring 2006
4/6/06: Lecture 18 handout
I.
From Table 1 of Hofmann (1999): Four learned topic models (k = 128) drawn from the TDT-1 corpus
(15,862 documents of newswire text and transcripts), each represented by the words with highest probability
with respect to that model.
1
2
3
4
plane, airport, crash, flight, safety, aircraft, air, passenger, ...
space, shuttle, mission, astronauts, launch, station, crew, ...
home, family, like, love, kids, mother, life, happy, friends, ...
film, movie, music, new, best, hollywood, love, actor, entertainment, ...
II.
From Figure 2 of Pereira, Tishby, and Lee (1993): A portion of a cluster hierarchy learned from
Grolier’s Encyclopedia (ten million words). Each node in the “tree” below represents a cluster by listing
the five words most closely associated with that cluster. Cluster splits are represented by parent-child
relationships in the tree.
number
diversity
structure
concentration
level
speed
level
velocity
size
extent
spread
zenith
depth
velocity
height
number
concentration
strength
ratio
rate
change
failure
variation
structure
shift
pollution
failure
increase
infection
loss
structure
relationship
aspect
system
adaption
Note: In these experiments, the 1000 most common nouns played the role of documents, and verbs that
take those nouns as direct objects played the role of terms. Thus, the noun “gun” could be characterized
by its appearing in the verb/direct-object relationship with the verbs “fire”, “polish”, “buy”, and “ban”;
whereas the noun “banana” could be characterized by its appearing in the verb/direct-object relationship
with the verbs “peel”, “eat”, “buy”, and, conceivably, “ban”.
III. Evaluation note Inferred structure(s) should be shown to provide improvement in (more or less)
practical applications, rather than only “eyeballed” for correspondence to intuitions regarding language
semantics. Recall, after all, that the singular vectors may not be “interpretable”, but SVD-based subspace
projections often are very effective.
References
Hofmann, Thomas. 1999. Probabilistic latent semantic indexing. In Proceedings of SIGIR, pages 50–57.
Pereira, Fernando, Naftali Tishby, and Lillian Lee. 1993. Distributional clustering of English words. In 31st
Annual Meeting of the ACL, pages 183–190. ACL, Association for Computational Linguistics, Somerset,
NJ. Distributed by Morgan Kaufmann, San Francisco, CA.