Download INSYS 300 -

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Anatomical terminology wikipedia , lookup

Anatomical terms of location wikipedia , lookup

Transcript
INFO624 -- Week 8
Subject Indexing & Knowledge
Representation
Dr. Xia Lin
Assistant Professor
College of Information Science and Technology
Drexel University
Effective Information Retrieval
Data Structures
 Knowledge Representation
 From Document representation to
Knowledge representation
 User Interface and User Interaction

Document Representation
Vocabulary
 Semantics
 Implementation

Vocabulary


Controlled Vocabulary
 A list of terms selected for index purpose.
 The terms are processed to reduce
inconsistence and ambiguity.
 Established selection rules and indexing
rules
Uncontrolled vocabulary
 Subject keywords
 Metadata
Example: ACM record
Meta Data




Data about data
Descriptive Data
 External to the meaning of the document
 Dublin Core Metadata Element Set
 Author, title, publisher, etc.
Semantic Metadata
 Subject keywords
Challenge: automatic generation of
metadata for documents
Semantics

Semantics is the study of meaning
 Relational semantics
 Synonymy, hierarchical, etc.
 Referential semantics
 Homonyms, techniques used to limited the
meanings or referents of terms
 Category semantics
 Facets or other participations
Example:

Mercury?
 Mercury (car)
 Mercury (planet)
 Mercury (metal)
 Mercury (Greek god)
Implementation
Standards
 AACR2
 ISO Standard for Indexing (ISO 5963)
 ISO Standard for Thesaurus Construction
(ISO 2788)
 Rules
 Classification rules
 Evaluation rules

Subject Indexing

A human analytic process for identifying,
selecting, and representing document concepts
 Create indexing languages
 Using standardized, limited vocabularies for
index purposes.
 Assign indexing terms to documents
 Using only the terms in the index language
selected.
Basic Processes of Subject
Indexing




Identifying concepts which represent the subject
and purpose of a document.
Deciding which of these concepts are important
for retrieval of this document
Expressing concepts needed for retrieval in the
indexing languages used
Using uncontrolled vocabulary for concepts not
represented or represented insufficiently
specifically in the indexing languages.
Controlled Vocabulary

Goals:
 To permit easy locations of documents by
topic.
 To define topic areas, and hence relate one
document to another.
 to provide multiple access pointers to
documents
 to enforce a uniformity throughout an
information retrieval system
Controlled Vocabulary

Formats:
 Hierarchical Classified list
 hierarchical subject descriptors
 associative cross references
 classification notation (codes)
 Alphabetical list
 include both descriptors and other
lead-in terms
Main Components
in a Controlled Vocabulary
Broader Term
Synonymous
Term
Keyword/
Descriptor
Narrower
Term
Related Term
Example
Broader Terms
Diseases
Neoplasms
Synonyms
Malignancy
Malignant tumor
Cancer morphology
Cancer
Related Terms
Abdominal Neoplasms
Hyperplasia
Seminoma
Malignant neoplasm of skins Breast
Cancer
Primary
malignant neoplasm of liver
Narrower Terms
Example:


MeSH – Medical Subject Headings
 22,568 descriptors
 139,000 headings (Supplementary Concept
Records)
 thousands of cross-references
 i.e., Vitamin C see Ascorbic Acid.
 Used t indexing MEDLINE
MeSH Browser
MeSH Tree Structures - 2004
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
Anatomy [A]
Organisms [B]
Diseases [C]
Chemicals and Drugs [D]
Analytical, Diagnostic and Therapeutic Techniques and
Equipment [E]
Psychiatry and Psychology [F]
Biological Sciences [G]
Physical Sciences [H]
Anthropology, Education, Sociology and Social Phenomena [I]
Technology and Food and Beverages [J]
Humanities [K]
Information Science [L]
Persons [M]
Health Care [N]
Geographic Locations [Z]
ERIC Thesaurus


more than 10,000 terms or subject headings
used in indexing and searching ERIC records.
A supplemental list of over 55,000 terms or
subject headings including
 proper names (e.g., geographic, personal,
institutional, project, equipment, test, etc.,
names) or
 concepts not yet represented by the
controlled vocabulary of the ERIC
Thesaurus.
Controlled Vocabulary

Examples:
 Case studies: Descriptor
 SN: Details analyses, usually focusing on a
particular problem of an individual, group,
or organization (note: do not confuse with
“medical case histories”
 NT:
Cross sectional studies
Longitudinal studies
Examples (Case Studies)


BT
Evaluation methods
Research
RT
Case records
Counseling
Qualitative research
Advantages of Subject Indexing


Facilitates concept search
 search by topics/subjects, not just by words
 link related documents by subject terms
 Make implicit information explicit
Provides a standard terminology to index and
search documents.
 Use small indexing vocabulary
 Help the searcher find related terms
Disadvantages of Subject Indexing



Expensive manual operations
 To construct the controlled vocabulary
 To assign terms to documents
Difficult to keep up to date
 Terminology changes very fast
 New terms are added daily.
Inconsistent process of human indexing
 Same documents are assigned different indexing
terms by different indexers
 The user may not use the same terms to find
documents as the indexer would use to index the
documents.
Document Representation


Inverted Indexing
 Represent a document as a list of terms
occurred in the document
 computer-based indexing
 statistical-based indexing
Subject Indexing
 Represent a document as a list of subject
terms occurred in a controlled vocabulary.
Considerations of
Document Representation

Any format of document representation
needs to maintain a balance of its
 Discriminating power
 Descriptiveness
 Similarity identification
 Conciseness
Considerations of DR

Discriminating power
 to identify a document uniquely
 to reduce ambiguity
 Examples:
• ISBN number for book
• bar codes for products
Considerations of DR

Descriptiveness
 describe all the information as complete as
possible
 fulltext
 abstracts
 extracts
 reviews
 Completeness and correctness
Considerations of DR

Similarity Identification
 to group similar documents
 keywords or subject indexing
 book classification numbers

Difficulty for the computer to assign
keywords, subject descriptors, or
classification numbers to documents
Considerations of DR

Conciseness
 simple and clear
 reduce process time and storage space
 Examples:
 authors and titles




Relationships of four
considerations
Higher discrimination power may lower the
capability of identifying similarities among
documents.
Good descriptiveness may defeat the conciseness
What’s good for the computer may not always be
good for the user.
A good representation should seek a balance of the
four, and take consideration of both the computer
and the user.
What’s missing in DR?
Intelligent Reasoning!
 Knowledge-base
 Ontology
 Semantic Networks
 Uncertainty(impreciseness)-handling

Knowledge Representation

encoding human knowledge - in all its
various forms - in such a way that the
knowledge can be used.
 A successful representation of some
knowledge must be in a form that is
understandable by humans, and must
cause the system using the knowledge
to behave as if it knows it.
Knowledge Representation



A knowledge representation (KR) is most
fundamentally a surrogate, a substitute for the
thing itself.
It is a set of ontological commitments, i.e., an
answer to the question: In what terms should I
think about the world?
It is a fragmentary theory of intelligent reasoning,
expressed in terms of three components: (i) the
representation's fundamental conception of
intelligent reasoning; (ii) the set of inferences the
representation sanctions; and (iii) the set of
inferences it recommends.
Knowledge Representation


It is a medium for pragmatically efficient
computation, i.e., the computational environment
in which thinking is accomplished. One
contribution to this pragmatic efficiency is
supplied by the guidance a representation
provides for organizing information so as to
facilitate making the recommended inferences.
It is a medium of human expression, i.e., a
language in which we say things about the world.
 From http://medg.lcs.mit.edu/ftp/psz/krep.html
Intelligent Information Retrieval
Information retrieval supported by
knowledge representation, rather than
document representation.
 Useful links
 Stanford
 Agent-based IR
