Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University Effective Information Retrieval Data Structures Knowledge Representation From Document representation to Knowledge representation User Interface and User Interaction Document Representation Vocabulary Semantics Implementation Vocabulary Controlled Vocabulary A list of terms selected for index purpose. The terms are processed to reduce inconsistence and ambiguity. Established selection rules and indexing rules Uncontrolled vocabulary Subject keywords Metadata Example: ACM record Meta Data Data about data Descriptive Data External to the meaning of the document Dublin Core Metadata Element Set Author, title, publisher, etc. Semantic Metadata Subject keywords Challenge: automatic generation of metadata for documents Semantics Semantics is the study of meaning Relational semantics Synonymy, hierarchical, etc. Referential semantics Homonyms, techniques used to limited the meanings or referents of terms Category semantics Facets or other participations Example: Mercury? Mercury (car) Mercury (planet) Mercury (metal) Mercury (Greek god) Implementation Standards AACR2 ISO Standard for Indexing (ISO 5963) ISO Standard for Thesaurus Construction (ISO 2788) Rules Classification rules Evaluation rules Subject Indexing A human analytic process for identifying, selecting, and representing document concepts Create indexing languages Using standardized, limited vocabularies for index purposes. Assign indexing terms to documents Using only the terms in the index language selected. Basic Processes of Subject Indexing Identifying concepts which represent the subject and purpose of a document. Deciding which of these concepts are important for retrieval of this document Expressing concepts needed for retrieval in the indexing languages used Using uncontrolled vocabulary for concepts not represented or represented insufficiently specifically in the indexing languages. Controlled Vocabulary Goals: To permit easy locations of documents by topic. To define topic areas, and hence relate one document to another. to provide multiple access pointers to documents to enforce a uniformity throughout an information retrieval system Controlled Vocabulary Formats: Hierarchical Classified list hierarchical subject descriptors associative cross references classification notation (codes) Alphabetical list include both descriptors and other lead-in terms Main Components in a Controlled Vocabulary Broader Term Synonymous Term Keyword/ Descriptor Narrower Term Related Term Example Broader Terms Diseases Neoplasms Synonyms Malignancy Malignant tumor Cancer morphology Cancer Related Terms Abdominal Neoplasms Hyperplasia Seminoma Malignant neoplasm of skins Breast Cancer Primary malignant neoplasm of liver Narrower Terms Example: MeSH – Medical Subject Headings 22,568 descriptors 139,000 headings (Supplementary Concept Records) thousands of cross-references i.e., Vitamin C see Ascorbic Acid. Used t indexing MEDLINE MeSH Browser MeSH Tree Structures - 2004 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. Anatomy [A] Organisms [B] Diseases [C] Chemicals and Drugs [D] Analytical, Diagnostic and Therapeutic Techniques and Equipment [E] Psychiatry and Psychology [F] Biological Sciences [G] Physical Sciences [H] Anthropology, Education, Sociology and Social Phenomena [I] Technology and Food and Beverages [J] Humanities [K] Information Science [L] Persons [M] Health Care [N] Geographic Locations [Z] ERIC Thesaurus more than 10,000 terms or subject headings used in indexing and searching ERIC records. A supplemental list of over 55,000 terms or subject headings including proper names (e.g., geographic, personal, institutional, project, equipment, test, etc., names) or concepts not yet represented by the controlled vocabulary of the ERIC Thesaurus. Controlled Vocabulary Examples: Case studies: Descriptor SN: Details analyses, usually focusing on a particular problem of an individual, group, or organization (note: do not confuse with “medical case histories” NT: Cross sectional studies Longitudinal studies Examples (Case Studies) BT Evaluation methods Research RT Case records Counseling Qualitative research Advantages of Subject Indexing Facilitates concept search search by topics/subjects, not just by words link related documents by subject terms Make implicit information explicit Provides a standard terminology to index and search documents. Use small indexing vocabulary Help the searcher find related terms Disadvantages of Subject Indexing Expensive manual operations To construct the controlled vocabulary To assign terms to documents Difficult to keep up to date Terminology changes very fast New terms are added daily. Inconsistent process of human indexing Same documents are assigned different indexing terms by different indexers The user may not use the same terms to find documents as the indexer would use to index the documents. Document Representation Inverted Indexing Represent a document as a list of terms occurred in the document computer-based indexing statistical-based indexing Subject Indexing Represent a document as a list of subject terms occurred in a controlled vocabulary. Considerations of Document Representation Any format of document representation needs to maintain a balance of its Discriminating power Descriptiveness Similarity identification Conciseness Considerations of DR Discriminating power to identify a document uniquely to reduce ambiguity Examples: • ISBN number for book • bar codes for products Considerations of DR Descriptiveness describe all the information as complete as possible fulltext abstracts extracts reviews Completeness and correctness Considerations of DR Similarity Identification to group similar documents keywords or subject indexing book classification numbers Difficulty for the computer to assign keywords, subject descriptors, or classification numbers to documents Considerations of DR Conciseness simple and clear reduce process time and storage space Examples: authors and titles Relationships of four considerations Higher discrimination power may lower the capability of identifying similarities among documents. Good descriptiveness may defeat the conciseness What’s good for the computer may not always be good for the user. A good representation should seek a balance of the four, and take consideration of both the computer and the user. What’s missing in DR? Intelligent Reasoning! Knowledge-base Ontology Semantic Networks Uncertainty(impreciseness)-handling Knowledge Representation encoding human knowledge - in all its various forms - in such a way that the knowledge can be used. A successful representation of some knowledge must be in a form that is understandable by humans, and must cause the system using the knowledge to behave as if it knows it. Knowledge Representation A knowledge representation (KR) is most fundamentally a surrogate, a substitute for the thing itself. It is a set of ontological commitments, i.e., an answer to the question: In what terms should I think about the world? It is a fragmentary theory of intelligent reasoning, expressed in terms of three components: (i) the representation's fundamental conception of intelligent reasoning; (ii) the set of inferences the representation sanctions; and (iii) the set of inferences it recommends. Knowledge Representation It is a medium for pragmatically efficient computation, i.e., the computational environment in which thinking is accomplished. One contribution to this pragmatic efficiency is supplied by the guidance a representation provides for organizing information so as to facilitate making the recommended inferences. It is a medium of human expression, i.e., a language in which we say things about the world. From http://medg.lcs.mit.edu/ftp/psz/krep.html Intelligent Information Retrieval Information retrieval supported by knowledge representation, rather than document representation. Useful links Stanford Agent-based IR