Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida [email protected] Abstract OAI extensions to federated search and other services for MathML-based metadata indexing and subject classification of mathematical abstracts. Construction of ontology or conceptual maps of mathematics. Mathematical formulas are considered as elements of the ontology. Ontology indexing by clustering mathematical abstracts or full papers into an information visualization interface so that users may select using ontology as well as metadata. A DL Server with OAI Extensions: Managing the Metadata Complexity Harvest API Harvester OAI_DC Data Mining Data Provider Service Provider DL Server Service Provider Federated Search Data Provider OAI_XXX Data Data Provider Provider User User Service Service Provider Provider Internet Server Harvester Harvester Data Data Provider Provider Service … Service Provider 1 Provider 1 Java DataBase Connectivity (JDBC) Java DataBase Connectivity (JDBC) Digested Digested Metadata Metadata Service Providers’ Service Providers’ Data Data Harvested Harvested Metadata Metadata Service Service Provider ProviderNN A DL Server with OAI Extensions: Managing the Metadata Complexity Built in capabilities: Harvester – harvest various OAI compliant data providers Data provider – expose harvested and existing metadata sets Service provider – federated search and data mining capabilities on metadata sets Harvester Harvest API Data Providers Harvester Interface: • URL to harvest • Selective harvesting parameters parameters Harvester Harvested metadata … DL Server Harvester Interface Harvester Interface Data Provider Expose single or combined metadata sets harvested to other harvesters Reformat metadata from different data providers to be harvested by other service providers (e.g., originally Dublin Core, reformat to MARC before exposing) Service Provider: Federated Search Emulating a federated search service on existing and combined harvested metadata sets Federated search across potentially other search protocols Federated Search Federated Search Federated Search Service Provider: Data Mining Knowledge discovery on harvested metadata sets Metadata classification using the SelfOrganizing Map (SOM) algorithm Improving retrieval effectiveness by providing concept browsing and search services Self-Organizing Map Algorithm Competitive and unsupervised learning algorithm Artificial neural network algorithm for visualizing and interpreting complex data sets Providing a mapping from a highdimensional input space to a twodimensional output space Data Mining Service Provider System Architecture Browser Concept browsing request Browser Concept search request Response Request Response Concept Harvester SOM Categorizer Input Vector Generator Noun Phraser Fetch metadata Save SOM Metadata Database Response Concept Harvester Screenshot of the SOM Categorizer Construction of Two-level Concept Hierarchy Constructing the SOM for each harvested metadata set SOMs of the lower layer are added to the upper-layer SOM. VTETD Top-level Concept Browsing Bottom-level Concept Browsing MEDLINE Database Developed by the National Library of Medicine (NLM) Bibliographic citations and abstracts from more than 4,600 biomedical journals published in the United States and 70 other countries. Covering the fields of medicine, nursing, dentistry, veterinary medicine, the health care system, and the preclinical sciences. Over 12 million citations Searchable via PubMed or the NLM Gateway MeSH (Medical Subject Headings) MEDLINE uses MeSH as its controlled vocabulary for indexing database articles Indexers scan an entire article and assign MeSH headings (or MeSH descriptors) to each article MeSH descriptors are arranged in both an alphabetic list and a hierarchical structure. Updated annually to reflect the changes in medicine and medical terminology Our Experimentation Problems It is well known that searching by descriptors will greatly improve the search precision. However, it is very difficult for naïve users to know and use exact MeSH descriptors to search. In addition, as the database of MEDLINE grows, information overload would prevent users from finding relevant information of their interest. Proposed Approach Categorizations according to MeSH terms, MeSH major topics, and the co-occurrence of MeSH descriptors Clustering using the results of MeSH term categorization through the Knowledge Grid Visualization of categories and hierarchical clusters Data Access Services MeSH Major Topic Tree View SOM Tree View Knowledge Grid Knowledge Grid Architecture High level K-Grid layer DA TAAS Data Access Service Tools and Algorithms Access Service EPM Execution Plan Management RPS Result Presentation Serv. Core K-Grid layer KBR KDS Knowledge Directory Service RAEM KEPR Resource Alloc. Execution Mng. KMR Generic and Data Grid Services Courtesy of Cannataro and Talia (Knowledge Grid: An Architecture for Distributed Knowledge Discovery) Future Directions Develop a federated search service for OAIcompliant mathematical abstracts. Develop an ontology or conceptual maps for mathematics. Develop an ontology search service for mathematical abstracts and full papers. Develop an interoperable architecture with other services, such as OCR of mathematical formulas. Acknowledgement Many thanks to the NSF NSDL Program. Collaborators – Joe Futrelle (NCSA), Ed Fox (Virginia Tech) Student Team – Hyunki Kim, Chee Yoong Choo, Xiaoou Fu, Yu Chen