Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer Science Old Dominion University, Norfolk, VA 23529 K. Maly, M. Zubair, M. Nelson In Collaboration With Los Alamos National Laboratory (R. Luce) & American Physical Society (M. Doyle) JISC/NSF PI Meeting, June 24-25 Motivation Lack of a federation service that provides an unified interface to diverse collections in the physics domain having metadata that differ in richness, syntax, and semantics Motivation • Dissemination and discovery of Physics resources • Contributors LANL, APS, AIP, CERN researchers, teachers • Users Students, teachers, researchers Arc: The Basic Federation Engine Harvester User Interface Data Normalization Search Engine (Servlet) Cache History Harvest JDBC Oracle MySQL Data Provider Daily Harvest Data Provider Arc: The Basic Federation Engine Grouper Local Query Cache and Session Related Date Session Manager Database (Metadata & Index) Searcher Displayer Challenges • Resource Discovery – – – – – Diversity in metadata richness Lack of controlled vocabulary Ease of discovering (formula based discovery) Cross linking support Classification • Creation and Maintenance – Freshness of metadata – Dynamic nature of collections – Filtering • Economic Sustainability – Rights management – Who pays? For what? Issues – No controlled vocabulary • Different subject classifications • Same authors but different rendering • Same affiliation but different form Interactive resource discovery approach components Harvester Harvested Metadata 1 User interact to identify all the collections to be searched and with what all options. 2 User execute search based on the selected options Index Generator for Union of Key Metadata Fields 2 Indexed field contents 1 Search Engine User Interface Issues - Equation based search • Representing search query • Rendering of equations and embedding them into the HTML display • Integrating into search interface • Identifying equations inside the metadata • Filtering equations • Equation storage Servlet oai.search.Search EqnSearch Image Converter DisplayEqn Eqn2Gif MathEqn Formula Extractor DC Metadata cHotEqn Eqn Data Img2Gif EqnExtractor EqnRecorder EqnCleaner EqnFilter Acme.JPM.Encoders.GifEncoder Formula Filter Filtering Equations • Errors in equation encoding, some examples: – – missing "$" in LaTeX representation illegal LaTeX symbols • Simple equations like "n=3" Filtering/categorizing Equations Approach: Use of "Stop Equation File" similar to "Stop Word File" used for indexing. In equation filtering context, the stop equation file consists of rules in form of regular expressions, which describe the LaTeX string to be dropped. The regular expression approach gives us the flexibility to describe easily variety of strings to be filtered. How to search for records using equations? Three search alternatives (or any combination of these) for the user: •Search for docs containing all formulae found in a) abstracts b) subject fields of documents containing user input ‘keywords’ •Search for docs containing formulae defined by category (e.g. integrals, moments, limits) • Browse formulae by various categorizations and search for docs containing selected formulae Issues - Cross Linking References • Obtaining references from full-text documents or parallel metadata sets • Bad format of such references when obtained from full text • Needed standard way to represent across collections Issues – Name similarity • Authors use different names for themselves and their affiliation • Could use authority files, difficult to create and maintain across different collections Similarity approach Clustering Iterative refinement approach: •Coarse level clusters based on approximate string matching (edit-distance, soundex, n-gram) •Refining clusters based on affiliation where available Presentation Allow user to follow search by clicking authors and then selecting appropriate, i.e., no authority files Homogenizing User Space • Enabling Web users to discover information in OAI collections (DP-9 Service) – http://arc.cs.odu.edu:8080/dp9/ • Enabling OAI users to discover information in Web enabled non-OAI compliant collections/databases/web sites DP-9 Service for Exposing OAI Collections to Web Vac: Gateway Service for Harvesting Non-OAI Collections Web Enabled Non-OAI Compliant Collections/Databases/ Web Sites Web Enabled Non-OAI Compliant Collections/Databases/ Web Sites Web Enabled Non-OAI Compliant Collections/Databases/ Web Sites WIDL Description (XML based language) WIDL Description (XML based language) WIDL Description (XML based language) Gateway to Non-OAI Collections OAI Service Provider Sample Description in WIDL of a Web enabled NonOAI Collection <WIDL NAME=‘’NonOAIGateway" Template=‘’TRcollector" BASEURL="http://www.princeton.edu" VERSION="2.0"> <SERVICE NAME=‘’getURL" METHOD="GET" URL="" INPUT=‘’" OUTPUT=‘’urlOutput" /> </BINDING> <BINDING NAME="urlOutput" TYPE="OUTPUT"> <VARIABLE NAME=‘’link" TYPE="String" REFERENCE="doc.p[1].text" /> <VARIABLE NAME=‘’title" TYPE="String" REFERENCE=‘’title" /> <VARIABLE NAME=‘’author" TYPE="String" REFERENCE=‘’author" /> <VARIABLE NAME=‘’descriptionr" TYPE="String" REFERENCE=‘’abstract" /> </BINDING> </WIDL> Federation/archives Consistency Harvester User Interface Data Normalization Search Engine (Servlet) Cache History Harvest JDBC Oracle MySQL Data Provider Daily Harvest Data Provider Future Tasks • Post processing of search results for easier navigation • Exploiting richer metadata and handling diversity in metadata across all participating collections • Concentrate on interactive search interface for resource discovery • Data normalization, authority files, filtering • Investigating different schemes for maintaining federation/archives consistency • More high level services beyond formula based search and cross-linking • User testing!!!! Links • ODU DL research group: – http://dlib.cs.odu.edu/ • Main federation engine: – http://arc.cs.odu.edu/ • NSDL research: – http://archon.cs.odu.edu/ • ITR/IM research – http://kepler.cs.odu.edu/ Not used Los Alamos Collection American Physical Society Collection Arc Service Provider TRI Service Provider OAI Layer OAI Layer OAI Layer OAI Layer Registration Server (XML mapping for each DP) Harvester Harvested Metadata Metadata Processor Search Engine User Interface Unified and Normalized Metadata Name authority file Automated metadata mapping approach