Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference Glasgow Caledonian University September 7th & 8th, 2006 Benoit PAUWELS Université Libre de Bruxelles (ULB) Brussels Agenda • Introduction – Unicorn interfaces • Part 1: An OAI frontend for Unicorn • Part 2: An SRU frontend for Unicorn – Short description of OAI and SRU protocols – Overview of technical implementation – Use cases and demos Introduction • OAI and SRU are ‘open’ protocols that permit exchange of metadata between information systems • Well-known Unicorn interfaces: – Unicorn API server – Unicorn Webcat/iBistro/iLink server – Unicorn Z39.50 server • All comply to the philosophy of request/response sequences Unicorn interfaces: API server API server SirsiDynix Catalogue database [ Records and indexes ] TCPIP/Socket API request • Character client • C Workflows client • Java Themes client TCPIP/Socket API response API datacodes/values Client system Communication protocol Information exchange protocol Returned record structure Unicorn server TCPIP/Socket proprietary SirsiDynix API requests/responses proprietary SirsiDynix format (data-codes and -values) Unicorn interfaces: iLink Web Server iLink [ Records and indexes ] HTTP iLink request (URL) • Any Web browser HTTP HTML page HTML Client system Communication protocol Information exchange protocol Returned record structure Catalogue database Unicorn server HTTP URL requests / HTML responses HTML Unicorn interfaces: Z39.50 Z39.50 [ Records and indexes ] Z39.50 Z39.50 request • Any Z3950 client Z3950 Z3950 response MARC21 Client system Communication protocol Information exchange protocol Returned record structure Catalogue database Unicorn server Z39.50 specific Z39.50 specific typically MARC21 Unicorn interfaces • API: Proprietary – low interoperability level • HTML: Record data not well structured – low reusability level • Z39.50: Protocol specific – more difficult to implement (high learning curve) – Z39.50 is statefull Difficult to integrate into today’s web services environments communication: use HTTP information exchange: use open protocols (like OAI and SRU) record data structure: use XML (according to well-defined XML Schema) 2 new Unicorn interfaces • HTTP / Open / XML • OAI-PMH: Open Archives Initiative – Protocol for Metadata Harvesting • SRU: Search and Retrieve via URL OAI-PMH : the protocol Web Server OAI Frontend HTTP embedded OAI requests HTTP embedded OAI responses Service Provider Data Provider Document Archive OAI-PMH: the protocol • ‘Harvester collects metadata from archives’ • Stateless protocol: sequence of OAI requests/responses over HTTP • Just harvesting -- NOT searching OAI-PMH: the protocol OAI requests • HTTP GET|POST requests • Syntax – BASE URL • host + port + path of OAI request handler – key=value pairs • Examples: – http://www.cible.ulb.ac.be:80/ cgi-bin/OAI20/catalog? verb=Identify _ – http://www.biomedcentral.com/ oai/1.1/bmcoai.asp? verb=GetRecord&identifier=oai:bmc:1471-2105-11&metadataPrefix=oai_dc OAI-PMH: the protocol OAI responses • XML encoded bytestreams, containing the records • Record = triplet – header (unique OAI identifier) – metadata – about • Metadata schemes – XML Schema – Minimum: unqualified Dublin Core – Community specific • Example of a record (catkey 450000 from ULB catalogue): – oai_dc marc21 umods OAI-PMH: the protocol Simple : 6 OAI requests/responses • Identify – http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?verb=Identify _ • ListMetadataFormats – [identifier] http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?verb=ListMetadataFormats _ • ListSets – http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?verb=ListSets _ • GetRecord – identifier, metadataPrefix http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=GetRecord&identifier=oai:ulbcat:245000&metadataPrefix=marc21 _ OAI-PMH: the protocol Simple : 6 OAI requests/responses • ListRecords metadataPrefix, [from,until,set] – http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=ListRecords&metadataPrefix=oai_dc _ – http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=ListRecords&metadataPrefix=mhld21&set=elper _ – http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=ListRecords&metadataPrefix=marc21&from=2006-08-01 _ • ListIdentifiers metadataPrefix, [from,until,set] – http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=ListIdentifiers&metadataPrefix=oai_dc _ OAI frontend for Unicorn • Implementation of the data provider functionality (2001) • http://www.openarchives.org/tools/tools.html pick a template and interface with Unicorn through Unicorn database tools • Our choice: Object Oriented Perl frontend (H. Suleman – Virginia Tech) _ OAI frontend for Unicorn HTTP server CGI HTTP embedded OAI request Unicorn database OAI C wrapper • call the appropriate OAI request handler fork in ‘sirsi’ environment • retrieve metadata from Unicorn database • format in XML HTTP embedded OAI response OAI.pl Unicorn Server OAI frontend for Unicorn Example: implementation of the GetRecord request http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=GetRecord&identifier=oai:ulbcat:245000&metadataPrefix=oai_dc 1. Get metadata from Unicorn for catkey 245000 $record = `echo $catkey | catalogdump -of | filtermarc -iALL -od -Ds`; _ @dates = split(‘\|’,`echo $catkey | selcatalog -iK opr`); 2. Convert ANSEL character set into ISO-LATIN-1 3. Map from MARC to oai_dc _ 4. Format into XML OAI frontend for Unicorn Example: implementation of the ‘set’ parameter of the ListRecords request http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=ListRecords&metadataPrefix=oai_dc&set=elper • Precompile set as a file of catkeys – name of file: « name of set_catkeys » • • • • – einstein_albert_catkeys elper_catkeys sd_catkeys all_catkeys through periodic execution of « mkoaisets » custom report OAI frontend for Unicorn Example: implementation of the ‘from/until’ parameters of the ListRecords request http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=ListRecords&metadataPrefix=oai_dc&from=2006-08-01&until=2006-08-31 • • • BRS index on creation/modification date? Every Unicorn record that gets created or modified is ‘touched’ in the ‘textedit’ and ‘browsedit’ directories Custom report ‘cadutext’ – – • saves catkeys to <ud>/Savedkeys/adutext/rptid adds line ‘rptid|date|status’ to <ud>/Lastruns/cadutext Example: « from=2006-08-01&until=2006-08-31 » – – – obtain report ids for all runs of cadutext after 2006-08-01 and before 2006-08-31 from the file <ud>/Lastruns/cadutext for each of these report ids: obtain catkeys from <ud>/Savedkeys/adutext/rptid and save them to randomnumber_catkeys file sort and uniq the randomnumber_catkeys file OAI frontend for Unicorn • Limitations of implementation: – ListRecords/ListIdentifiers: • The from and until parameters are not permitted if the set parameter is given on the request • The from and until parameters are permitted if the set parameter is not given on the request, but their values should fall within a certain date range (at this moment arbitrarily set to ‘today - 2 months’ and ‘today’) – Deleted records • Complete source code and documentation available on the API Repository (http://sirsiapi.org) OAI frontend - use cases @ ULB Use case 1: Vlink - OpenURL resolver system joint project with Vrije Universiteit Brussel (VUB) OpenURL ISI Web of Science OVID WebSpirs Vlink knowledge base ULB iLink Elsevier ScienceDirect JSTOR Vlink HTML extended services OAI frontend - use cases @ ULB Use case 1: Vlink - OpenURL resolver system • OpenURL sent from iLink http://bibdev.vub.ac.be/cgi-bin/openurlulb? sid=ULB:Webcat&id=oai:ulbcat:617924 • This OpenURL does not contain enough metadata for the specific item ==> Vlink does a fetch back to Unicorn through an OAI GetRecord request to obtain a full MARC21 bibliographic description http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=GetRecord&identifier=oai:ulbcat:617924&metadataPrefi x=marc21 OAI frontend - use cases @ ULB Use case 1: Vlink - OpenURL resolver system • Feed Vlink Knowledge Base through OAI harvesting Vlink Knowledge Base Unicorn OAI-PMH VLink http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=ListRecords&metadataPrefix=mhld21&set=elper OAI frontend - use cases @ ULB Use case 2: Unicat - Virtual Union Catalog of Belgium HTML Unicat WWW Gateway Search/ Browse indexes Union OAI Archive OAI SRU OAI Unicat Indexer End User University library Catalog Central Repository Unicat Harvester Unicorn Aleph VIRTUA VUBIS Public Museum Other Data providers SRU : the protocol Catalogue database Web Server SRU Frontend [ Records and indexes ] HTTP SRU request HTTP SRU response XML Client System Communication protocol Information exchange protocol Returned record structure Unicorn Server HTTP SRU XML SRU: the protocol • ‘Client searches and retrieves metadata records from an archive’ • Stateless protocol: sequence of SRU requests/responses over HTTP • Search and Retrieve (<-> OAI: harvesting) SRU: the protocol SRU requests • HTTP GET requests • Syntax – BASE URL • host + port + path of SRU request handler – key=value pairs • 3 possible requests (operations) – explain • • • • – serves to record facilities available at an SRU server used by clients to self-configure returned explain record is in XML and follows the ZeeRex Schema Example: http://z3950.loc.gov:7090/voyager?version=1.1&operation=explain _ scan • allows the client to request a range of the available terms at a given point within a list of indexed terms • enables clients to present an ordered list of values and, if supported, how many hits there would be for a search on that term – searchRetrieve SRU: the protocol searchRetrieve operation • searchRetrieve (principal) parameters – – – – – – Version: (of the request); current protocol version: 1.1 query: query expressed in CQL startRecord: position within the sequence of matched records of the first record to be returned maximumRecords: number of records requested to be returned recordSchema: schema requested for the records to be returned stylesheet: URL for an xml stylesheet. The client requests that the server simply return this URL in the response. • CQL « Traditionally, query languages have fallen into two camps: Powerful, expressive languages, not easily readable nor writable by non-experts (e.g. SQL, PQF, and XQuery);or simple and intuitive languages not powerful enough to express complex concepts (e.g. CCL and google). CQL tries to combine simplicity and intuitiveness of expression for simple, every day queries, with the richness of more expressive languages to accomodate complex concepts when necessary. » (http://www.loc.gov/standards/sru/cql) SRU: the protocol searchRetrieve operation Examples of CQL queries: • dinosaur title = "complete dinosaur" title exact "the complete dinosaur" dinosaur not reptile dinosaur and bird or dinobird publicationYear < 1980 • title all "complete dinosaur" title contains all of the words: ‘complete’, and ‘dinosaur’ • title any "dinosaur bird reptile" title contains any of the words: ‘dinosaur’, ‘bird’, or ‘reptile’ • ribs prox/distance<=5 chevrons a more specific proximity query: ‘ribs’ within 5 words of ‘chevrons’ SRU: the protocol searchRetrieve operation -- examples • http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve &query=author=einstein _ • http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve &maximumRecords=10&startRecord=1&query=author=einstein _ • http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve &maximumRecords=10&startRecord=1&query=author=einstein&recordSche ma=dc _ • http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve &maximumRecords=10&startRecord=1&query=author all "einstein albert“ _ • http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve &maximumRecords=10&startRecord=1&query=title all "einstein albert“ _ • http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve &maximumRecords=10&startRecord=1&query=title all "einstein albert“&stylesheet=http://bib49.ulb.ac.be/cibleCanevas.xsl _ • http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve &maximumRecords=10&startRecord=1&query=title all "einstein albert“&stylesheet=http://bib49.ulb.ac.be/cibleTypo3.xsl _ SRU frontend for Unicorn Catalogue database Web Server SRU Frontend HTTP SRU request [ Records and indexes ] HTTP SRU response XML Client system Unicorn Server SRU frontend for Unicorn Web Server Z39.50 Frontend SRU/Z39.50 Gateway Catalogue database [ Records and indexes ] HTTP SRU request Z3950 Z3950 request Z3950 Z3950 response HTTP SRU response XML Client system SRU/Z39.50 Unicorn Server SRU frontend for Unicorn • SRU/Z39.50 Gateway: YAZ Proxy (Index Data) – Implemented at ULB: 7/2006 (2 days) – config.xml <target name="cible" default="1"> <url>bib7.ulb.ac.be:2200</url> <xi:include href="explain.xml"/> <cql2rpn>pqf.properties</cql2rpn> </target> <target name=“slavko" default="1"> <url>velma.library.mun.ca:2200</url> <xi:include href="explain.slavko.xml"/> <cql2rpn>pqf.slavko.properties</cql2rpn> </target> – explain.xml • ZeeRex XML record as response to ‘explain’ operation – pqf.properties • specifies the mapping of various CQL indexes, relations, etc. into Type-1 query attributes SRU frontend for Unicorn • YAZ Proxy – http://bib49.ulb.ac.be:9000/Cible? version=1.1&operation=searchRetrieve&maximumRecords=10&s tartRecord=1& query=title all "einstein albert“& stylesheet=http://bib49.ulb.ac.be/cibleTypo3.xsl _ – http://bib49.ulb.ac.be:9000/Slavko? version=1.1&operation=searchRetrieve&maximumRecords=10&s tartRecord=1& query=title all "einstein albert“& stylesheet=http://bib49.ulb.ac.be/cibleTypo3.xsl _ SRU frontend : use case @ ULB • Seamless integration of catalog searches in CMS • Typo3 • Example – HTML page containing biography of famous belgian historian Henri Pirenne – frame pointing to the following URL: http://bib49.ulb.ac.be:9000/Cible? version=1.1&operation=searchRetrieve&maximumRecords=10&startRe cord=1& query=pirenne%20and%20epub-dnu-* &stylesheet=http://bib49.ulb.ac.be/cibleTypo3.xsl • Project – Unicorn contains descriptions of databases, websites, etc with local thematic classification codes in 653 – create thematic websites within our CMS, containing frames that list available databases per theme