Download American Physical Society (M. Doyle)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

URL redirection wikipedia , lookup

Transcript
Archon - A Digital Library that Federates Physics Collections
with Varying Degrees of Metadata Richness
Department of Computer Science
Old Dominion University, Norfolk, VA 23529
K. Maly, M. Zubair, M. Nelson
In Collaboration With
Los Alamos National Laboratory (R. Luce)
&
American Physical Society (M. Doyle)
JISC/NSF PI Meeting, June 24-25
Motivation
Lack of a federation service that
provides an unified interface to
diverse collections in the physics
domain having metadata that differ
in richness, syntax, and semantics
Motivation
Dissemination and discovery of Physics
resources
•Contributors
LANL, APS, AIP, CERN
researchers, teachers
•Users
Students, teachers, researchers
Basic Federation Engine
Arc
Harvester
User Interface
Data Normalization
Search Engine
(Servlet)
Cache
History
Harvest
JDBC
Oracle
MySQL
Data
Provider
Daily
Harvest
Data
Provider
Grouper
Local Query Cache and
Session Related Date
Session Manager
Database
(Metadata &
Index)
Searcher
Displayer
Challenges
• Resource Discovery
– Diversity in metadata richness
– Lack of controlled vocabulary
– Ease of discovering (formula based
discovery)
– Cross linking support
– Classification
• Creation and Maintenance
– Freshness of metadata
– Dynamic nature of collections
– Filtering
• Economic Sustainability
– Rights management
– Who pays? For what?
Issues – No controlled vocabulary
• Different subject classification
• Same authors but different rendering
• Same affiliation but different form
Harvester
Harvested
Metadata
1
User interact to identify all the collections to be
searched and with what all options.
2
User execute search based on the selected
options
Index
Generator for
Union of Key
Metadata Fields
2
Indexed
field
contents
1
Search Engine
User Interface
Interactive resource discovery approach components
Need interactive demo here
Issues - Equation based search
• Representing search query
• Rendering of equations and embedding them
into the HTML display.
• Integrating into search interface
• Identifying equations inside the metadata.
• Filtering equations.
• Equation storage
Servlet
oai.search.Search
EqnSearch
Image Converter
DisplayEqn
Eqn2Gif
MathEqn
Formula
Extractor
DC Metadata
cHotEqn
Eqn Data
Img2Gif
EqnExtractor
EqnRecorder
EqnCleaner
EqnFilter
Acme.JPM.Encoders.GifEncoder
Formula Filter
Filtering Equations
• Errors in equation encoding, some
examples:
–
missing "$" in latex representation, for
example
– illegal latex symbols
• Simple equations like "n=3"
Filtering Equations
Approach:
Use of "Stop Equation File" similar to "Stop Word File" used for
indexing.
In equation filtering context, the stop equation file consists of
rules in form regular expressions, which describe the latex
string to be dropped. The regular expression approach gives us
the flexibility to describe easily variety of strings to be filtered.
How to search for records using
equations?
Three search alternatives (or any combination of these) for the
user:
•Search for docs containing all formulae found in a)
abstracts b) subject fields of documents containing user
input ‘keywords’
•Search for docs containing formulae defined by category
(e.g. integrals, moments, limits)
• Browse formulae by various categorizations and search
for docs containing selected formulae
Issues - Cross Linking References
• Obtaining references from full-text
documents or parallel metadata sets
• Bad format of such references when
obtained from full text
• Needed standard way to represent across
collections
• Need image of similar links
Issues – Name similarity
• Authors use different names for themselves
and their affiliation
• Could use authority files, difficult to create
and maintain across different collections
Similarity approach
Clustering
Iterative refinement approach:
•Coarse level clusters based on approximate string matching (edit-distance, soundex, n-gram)
•Refining clusters based on affiliation where available
Presentation
Allow user to follow search by clicking authors and then selecting appropriate, i.e., no authority files
Homogenizing User Space
• Enabling Web users to discover
information in OAI collections (DP-9
Service)
• Enabling OAI users to discover
information in Web enabled non-OAI
compliant collections/databases/web sites
DP-9 Service for Exposing OAI
Collections to Web
Gateway Service for Harvesting Non-OAI
Collections
Web Enabled
Non-OAI Compliant
Collections/Databases/
Web Sites
Web Enabled
Non-OAI Compliant
Collections/Databases/
Web Sites
Web Enabled
Non-OAI Compliant
Collections/Databases/
Web Sites
WIDL Description
(XML based language)
WIDL Description
(XML based language)
WIDL Description
(XML based language)
Gateway to Non-OAI
Collections
OAI Service Provider
Sample Description in WIDL of a Web enabled
Non-OAI Collection
<WIDL NAME=‘’NonOAIGateway" Template=‘’TRcollector"
BASEURL="http://www.princeton.edu" VERSION="2.0">
<SERVICE NAME=‘’getURL" METHOD="GET" URL="" INPUT=‘’"
OUTPUT=‘’urlOutput" />
</BINDING> <BINDING NAME="urlOutput" TYPE="OUTPUT">
<VARIABLE NAME=‘’link" TYPE="String" REFERENCE="doc.p[1].text" />
<VARIABLE NAME=‘’title" TYPE="String" REFERENCE=‘’title" />
<VARIABLE NAME=‘’author" TYPE="String" REFERENCE=‘’author" />
<VARIABLE NAME=‘’descriptionr" TYPE="String"
REFERENCE=‘’abstract" />
</BINDING>
</WIDL>
Federation/archives Consistency
Synchronous
Model
List
Asynchronous
Order
Pull
Action
Retrieval
Model
(n/a)
Push
Action
(n/a)
Register -> Subscribe
Notify
->
Retrieval
Table 2. Content Delivery Model
Register ->
Notify
->
Delivery
Future Tasks
• Post processing of search results for easier navigation
• Exploiting richer metadata and handling diversity in
metadata across all participating collections
• Concentrate on interactive search interface for resource
discovery
• Data normalization, authority files, filtering
• Investigating different schemes for maintaining
federation/archives consistency
• More high level services beyond formula based search
and cross-linking
• User testing!!!!
Links
•
•
•
•
•
•
•
•
Main ODU DL research:
http://dlib.cs.odu.edu
Main federation engine:
http://arc.cs.odu.edu
Main NSDL research:
http://archon.cs.odu.edu
Main ITR/IM research
http://kepler.cs.odu.edu
Not used
Web enabled nonOAI Collection
OAI Collection
NSDL Physics Collection – Prototype in Development
COMMENTS: Update the snapshot, cant use
ads
Los Alamos
Collection
American
Physical
Society
Collection
Arc
Service Provider
TRI
Service Provider
OAI Layer
OAI Layer
OAI Layer
OAI Layer
Registration Server
(XML mapping for
each DP)
Harvester
Harvested
Metadata
Metadata
Processor
Search Engine
User Interface
Unified and
Normalized
Metadata
Name
authority
file
Automated metadata mapping approach