Download PowerPoint - Cornell Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Body snatching wikipedia , lookup

Anatomical terminology wikipedia , lookup

Anatomical terms of location wikipedia , lookup

Transcript
CS 430: Information Discovery
Lecture 21
Thesauruses and Gazetteers
1
Course Administration
•
2
Lexicon and Thesaurus
Lexicon contains information about words, their
morphological variants, and their grammatical usage.
Thesaurus relates words by meaning:
ship, vessel, sail; craft, navy, marine, fleet, flotilla
book, writing, work, volume, tome, tract, codex
search, discovery, detection, find, revelation
(From Roget's Thesaurus, 1911)
3
Thesaurus in Information Retrieval
Use of a thesaurus in indexing (precoordination)
A. Manual
Used to guide human indexer to assign standard terms and
associations.
computer-aided instruction
see also education
UF teaching machines
BT educational computing
TT computer applications
RT education
RT teaching
4
From: INSPEC
Thesaurus
Thesaurus in Information Retrieval
Use of a thesaurus in indexing (precoordination)
B. Automatic
Divide terms into thesaurus classes. Replace similar terms by a
thesaurus class.
5
408 dislocation
409 blast-cooled
junction
heat-flow
minority-carrier
heat-transfer
n-p-n
p-n-p
410 anneal
point-contact
strain
recombine
transition
unijunction
From: Salton
and McGill
Desirable Properties for Information
Retrieval
6
•
Thesaurus is specific to a subject area. Contains only
terms of interest for identification within that subject
area.
•
Ambiguous terms are coded only for the senses
important for that field.
•
Target is that each thesaurus class should include
terms of moderate frequency. Ideally the classes
should have similar frequency.
Art and Architecture Thesaurus
•
Controlled vocabulary for describing and retrieving information:
fine art, architecture, decorative art, and material culture.
•
Almost 120,000 terms for objects, textual materials, images,
architecture and culture from all periods and all cultures.
•
Used by archives, museums, and libraries to describe items in their
collections.
•
Used to search for materials.
•
Used by computer programs, for information retrieval, and natural
language processing.
A project of the J. Paul Getty Trust
7
Art and Architecture Thesaurus
Provides the terminology for objects, and the vocabulary necessary to
describe them, such as style, period, shape, color, construction, or
use, and scholarly concepts, such as theories, or criticism.
Concept:
a cluster of terms, one of which is established as the preferred term,
or descriptor.
Categories:
associated concepts, physical attributes, styles and periods, agents,
activities, materials, and objects.
8
Art and Architecture Thesaurus:
Sample Record
Record ID: 198841
Descriptor: rhyta
Note: Refers to vessels from Ancient Greece, eastern Europe, or
the Middle East that typically have a closed form with two
openings, one at the top for filling and one at the base so that
liquid could stream out. They are often in the shape of a horn or
an animal's head, and were typically used as a drinking cup or for
pouring wine into another vessel.
9
Hierarchy:
Containers [TQ]
...<containers by function or context>
...........<culinary containers>
...................<containers for serving and consuming food>
Art and Architecture Thesaurus:
Sample Record (continued)
Terms:
rhyta
rhyton (alternate, singular)
protomai
protome
rhea
rheon
rheons
Related concepts:
stirrup cups
sturzbechers
drinking vessels
ceremonial vessels
10
MeSH -- Medical Subject Headings
Controlled vocabulary for indexing articles, for cataloging
books and other holdings, and for searching MeSH-indexed
databases, including MEDLINE.
•
•
•
About 19,000 primary subject headings
Thesaurus of 110,000 chemical terms.
Total vocabulary over 300,000 terms.
National Library of Medicine provides MeSH subject headings
for each of the 400,000 articles that it indexes every year.
"MeSH terminology provides a consistent way to retrieve
information that may use different terminology for the same
concepts."
11
MeSH -- Medical Subject Headings
MeSH hierarchy:
general terms, e.g., anatomy, organisms, diseases, biological
sciences;
anatomy is divided into sixteen topics, e.g., body regions and
musculoskeletal system;
body regions is divided into sections, e.g., abdomen, axilla, back
etc.
12
Example of MeSH hierarchy
Biological Sciences [G]
Biological Sciences [G01] +
Health Occupations [G02] +
Environment and Public Health [G03] +
Biological Phenomena, Cell Phenomena, and Immunity [G04] +
Genetics [G05] +
Biochemical Phenomena, Metabolism, and Nutrition [G06] +
Physiological Processes [G07] +
Reproductive and Urinary Physiology [G08] +
Circulatory and Respiratory Physiology [G09] +
Digestive, Oral, and Skin Physiology [G10] +
Musculoskeletal, Neural, and Ocular Physiology [G11] +
Chemical and Pharmacologic Phenomena [G12] +
13
Example of MeSH hierarchy
(continued)
14
Physiological Processes [G07]
Adaptation, Physiological [G07.062] +
Aging [G07.168] +
Body Constitution [G07.265] +
Body Temperature [G07.315]
Body Temperature Regulation [G07.315.232] +
Skin Temperature [G07.315.753]
Chronobiology [G07.450] +
Electrophysiology [G07.453] +
Fluid Shifts [G07.503]
Growth and Embryonic Development [G07.553] +
Homeostasis [G07.621] +
Tensile Strength [G07.900]
Tropism [G07.950] +
Example of MeSH hierarchy
(continued)
15
MeSH Heading
Body Temperature
Tree Number
E01.370.600.120
Tree Number
G07.315
Entry Term
Organ Temperature
See Also
Fever
See Also
Thermography
See Also
Thermometers
Allowable Qualifiers
DE GE IM PH RE
Unique ID
D001831
Observations about Manually
Maintained Thesaurus
• Permit very rich structure of relationships
• Most effective when user of search system is skilled
in the discipline and trained in the use of the thesaurus
(e.g., medical librarian)
• Needs continually updating as a field develops new
terminology
• Expensive to create and maintain
16
Gazetteers
The Alexandria Digital Library (ADL): geolibrary at University
of California at Santa Barbara where a primary attribute of objects
is location on Earth (e.g., map, satellite photograph).
Geographic footprint: latitude and longitude values that
represent a point, a bounding box, a linear feature, or a complete
polygonal boundary.
Gazetteer: list of geographic names, with geographic locations
and other descriptive information.
Geographic name: proper name for a geographic place or feature
(e.g., Santa Barbara County, Mount Washington, St. Francis
Hospital, and Southern California)
17
Alexandria Thesaurus: Example
canals
A feature type category for places such as the Erie Canal.
Used for:
The category canals is used instead of any of the following.
canal bends
ditches
canalized streams
drainage canals
ditch mouths
drainage ditches
Broader Terms:
Canals is a sub-type of hydrographic structures.
18
... more ...
Alexandria Thesaurus: Example
(continued)
canals (continued)
Related Terms:
The following is a list of other categories related to canals (nonhierarchial relationships).
channels
locks
transportation features
tunnels
Scope Note:
Manmade waterway used by watercraft or for drainage, irrigation,
mining, or water power. » Definition of canals.
19
Use of a Gazetteer
• Answers the "Where is" question; for example, "Where is
Santa Barbara?"
• Translates between geographic names and locations. A
user can find objects by matching the footprint of a
geographic name to the footprints of the collection
objects.
•
20
Locates particular types of geographic features in a
designated area. For example, a user can draw a box
around an area on a map and find the schools, hospitals,
lakes, or volcanoes in the area.
Alexandria Gazetteer: Example from
a search on "Tulsa"
Feature name
State County
Type
Latitude
Tulsa
OK Tulsa
pop pl
360914N 0955933W
Tulsa Country
Club
OK Osage
locale
360958N 0960012W
Tulsa County
OK Tulsa
civil
360600N 0955400W
airport
360500N 0955205W
Tulsa Helicopters OK Tulsa
Incorporated
Heliport
21
Longitude
Challenges for the Alexandria Gazetteer
Content standard: A standard conceptual schema for
gazetteer information.
Feature types: A type scheme to categorize individual
features, is rich in term variants and extensible.
Temporal aspects: Geographic names and attributes change
through time.
"Fuzzy" footprints: Extent of a geographic feature is often
approximate or ill-defined (e.g., Southern California).
22
Challenges for the Alexandria
Gazetteer (continued)
Quality aspects:
(a) Indicate the accuracy of latitude and longitude data.
(b) Ensure that the reported coordinates agree with the other
elements of the description.
Spatial extents:
(a) Points do not represent the extent of the geographic
locations and are therefore only minimally useful.
(b) Bounding boxes, often include too much territory (e.g., the
bounding box for California also includes Nevada).
23
Examples of Gazetteers
Alexandria Digital Library
Linda L. Hill, James Frew, and Qi Zheng, Geographic Names:
The Implementation of a Gazetteer in a Georeferenced Digital
Library. D-Lib Magazine, 5: 1, January 1999.
http://www.dlib.org/dlib/january99/hill/01hill.html
Getty Thesaurus of Geographic Names
http://www.getty.edu/research/tools/vocabulary/tgn/
24