Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
XML Structured Document Retrieval and Distributed Resource Discovery Ray R. Larson School of Information Management & Systems University of California, Berkeley [email protected] August 22, 2001 NASA Ames Lecture -- Ray R. Larson Context • NSF/JISC International Digital Library Grant – Cross-Domain Resource Discovery: Integrated Discovery and Use of Textual, Numeric and Spatial Data • UC Berkeley DLI2 Grant: – ReInventing Scholarly Information Access • UC Berkeley working with the University of Liverpool/Manchester Computing with participation from – DeMontfort University (MASTER) – Art and Humanities Data Service (http://ahds.ac.uk/) • OTA (Oxford), HDS (Essex), PADS (Glasgow), ADS (York), VADS (Surrey & Northumbria) – Consortium of University Research Libraries (CURL) – UC Berkeley Library (and California Digital Library) • Making of America II • Online Archive of California – British Natural History Museum, London – NESSTAR (NEtworked Social Science Tools and Resources) August 22, 2001 NASA Ames Lecture -- Ray R. Larson Research Areas • Goals are – Practical application of existing DL technologies to some large-scale cross-domain collections – Theoretical examination and evaluation of nextgeneration designs for systems architecture and and distributed cross-domain searching for DLs August 22, 2001 NASA Ames Lecture -- Ray R. Larson Approach • For the first goal, we are implementing a distributed search system based on international standards (Z39.50 and SGML/XML) using the Cheshire II information retrieval system • Databases include: – – – – HE Archives hub Arts and Humanities Data Service (AHDS) MASTER CURL (Consortium of University Research Libraries) – Online Archive of California (OAC) NASA Ames Lecture -- Ray R. Larson – Making of America II (MOA2) August 22, 2001 Current Usage of Cheshire II • Web clients for: – – – – – Berkeley NSF/NASA/ARPA Digital Library World Conservation Digital Library SunSite (UC Berkeley Science Libraries) University of Liverpool Higher Education Archives Hub – – – – – – – University of Essex, HDS (part of AHDS) Oxford Text Archive (test only) California Sheet Music Project Cha-Cha (Berkeley Intranet Search Engine) Berkeley Metadata project cross-language demo Univ. of Virginia (test implementations) Cheshire ranking algorithm is basis for original Inktomi August 22, 2001 • Glasgow, Edinburgh, Bath, Liverpool, Kings College London, University College London, Nottingham, Durham, School of Oriental and African Studies, Manchester, Southhampton, Warwick and others (to be expanded) NASA Ames Lecture -- Ray R. Larson Current and Upcoming Usage of Cheshire II • DIEPER Digitized European Periodicals project. – http://gdz.sub.uni-goettingen.de/dieper/ • NESSTAR (Networked Social Science Tools and Resources. – http://www.nesstar.org/ • FASTER – Flexible Access to Statistics Tables and Electronic Resources. (Continuation of NESSTAR) – http://www.faster-data.org/ • MASTER (Manuscript Access through Standards for Electronic Records. – http://www.cta.dmu.ac.uk/projects/master/ August 22, 2001 NASA Ames Lecture -- Ray R. Larson Upcoming Usage of Cheshire II • ZETOC (Prototype of the Electronic Table of Contents from the British Library) – http://zetoc.mimas.ac.uk/ • Archives Hub – http://www.archiveshub.ac.uk/ • RSLP Palaeography project – http://www.palaeography.ac.uk/ • British Natural History Museum, London • JISC data services directory hosted by MIMAS • Resource Discovery Network (RDN), where it will be used to harvest RDN records from the various hubs using OAI and provide search August 22, 2001 NASA Ames Lecture -- Ray R. Larson Client/Server Architecture • Server Supports: – – – – – – Database storage Indexing Z39.50 access to local data Boolean and Probabilistic Searching Relevance Feedback External SQL database support • Client Supports: – Programmable (Tcl/Tk) Graphical User Interface – Z39.50 access to remote servers – SGML/XML & MARC formatting • Combined Client/Server CGI scripting via WebCheshire used for web applications August 22, 2001 NASA Ames Lecture -- Ray R. Larson SGML/XML Support • Underlying native format for all data is SGML/XML • The DTD defines the file format for each file • Full SGML/XML parsing • XML Configuration Files define the database • USMARC DTD and MARC to SGML conversion (and back again) • Access to full-text via special SGML tags • Support for SGML/XML component definition and indexing August 22, 2001 NASA Ames Lecture -- Ray R. Larson SGML/XML Support • Example XML record for a DL document <ELIB-BIB> <BIB-VERSION>ELIB-v1.0</BIB-VERSION> <ID>756</ID> <ENTRY>June 12, 1996</ENTRY> <DATE>June 1996</DATE> <TITLE>Cumulative Watershed Effects: Applicability of Available Methodologies to the Sierra Nevada</TITLE> <ORGANIZATION>University of California</ORGANIZATION> <TYPE>report</TYPE> <AUTHOR-INSTITUTIONAL>USDA Forest Service</AUTHOR-INSTITUTIONAL> <AUTHOR-PERSONAL>Neil H. Berg</AUTHOR-PERSONAL> <AUTHOR-PERSONAL>Ken B. Roby</AUTHOR-PERSONAL> <AUTHOR-PERSONAL>Bruce J. McGurk</AUTHOR-PERSONAL> <PROJECT>SNEP</PROJECT> <SERIES>Vol 3</SERIES> <PAGES>40</PAGES> <TEXT-REF>/elib/data/docs/0700/756/HYPEROCR/hyperocr.html</TEXT-REF> <PAGED-REF>/elib/data/docs/0700/756/OCR-ASCII-NOZONE</PAGED-REF> </ELIB-BIB> August 22, 2001 NASA Ames Lecture -- Ray R. Larson SGML/XML Support • Example SGML/MARC Record <USMARC Material="BK" ID="00000003"><leader><LRL>00722</LRL><RecStat>n</RecStat> <RecType>a</RecType><BibLevel>m</BibLevel><UCP></UCP><IndCount>2</IndCount> <SFCount>2</SFCount><BaseAddr>00229</BaseAddr><EncLevel> </EncLevel> <DscCatFm></DscCatFm><LinkRec></LinkRec><EntryMap><FLength>4</Flength><SCharPos> 5</SCharPos><IDLength>0</IDLength><EMUCP></EMUCP></EntryMap></Leader> <Directry>00100140000000500170001400800410003101000140007203500200008603500170010 610000190012324501050014225000110024726000320025830000330029050400500032365000360 0373700002200409700002200431950003200453998000700485</Directry><VarFlds> <VarCFlds><Fld001>CUBGGLAD1282B</Fld001><Fld005>19940414143202.0</Fld005> <Fld008>830810 1983 nyu eng u</Fld008></VarCFlds> <VarDFlds><NumbCode><Fld010 I1="Blank" I2="Blnk"><a>82019962 </a></Fld010> <Fld035 I1="Blank" I2="Blnk"><a>(CU)ocm08866667</a></Fld035><Fld035 I1="Blank" I2="Blnk"><a>(CU)GLAD1282</a></Fld035></NumbCode><MainEnty><Fld100 NameType="Single" I2=""><a>Burch, John G.</a></Fld100></MainEnty><Titles><Fld245 AddEnty="Yes" NFChars="0"><a>Information systems :</a><b>theory and practice /</b><c>John G. Burch, Jr., Felix R. Strater, Gary Grudnitski</c></Fld245></Titles><EdImprnt><Fld250 I1="Blank" I2="Blnk"><a>3rd ed</a></Fld250><Fld260 I1="" I2="Blnk"><a>New York :</a><b>J. Wiley,</b><c>1983</c></Fld260></EdImprnt><PhysDesc><Fld300 I1="Blank" I2="Blnk"><a>xvi, 632 p. :</a><b>ill. ;</b><c>24 cm</c></Fld300></PhysDesc><Series></Series><Notes><Fld504 I1="Blank" I2="Blnk"><a>Includes bibliographical references and index</a></Fld504></Notes><SubjAccs><Fld650 SubjLvl="NoInfo" August 22, 2001 NASA Ames Lecture -- Ray R. Larson SubjSys="LCSH"><a>Management information systems.</a></Fld650> ... SGML/XML Support • Configuration files for the Server are also SGML/XML: – They include tags describing all of the data files and indexes for the database. – They also include instructions on how data is to be extracted for indexing and how Z39.50 attributes map to the indexes for a given database. – They include definition of components and component indexes August 22, 2001 NASA Ames Lecture -- Ray R. Larson Component Extraction and Retrieval • Any sub-elements of an SGML/XML document can be defined as a separately indexed “component”. • Components can be ranked and retrieved independently of the source document (but linked back to their original source) • For example paragraphs and abstracts in the full text of documents could be defined as components to provide paragraph-level search • Example: Glassier archives… August 22, 2001 NASA Ames Lecture -- Ray R. Larson Component Extraction and Retrieval • The Glassier archive is an EAD document (1.9 Mb in size) • Contains “Series, Subseries, and Item level” descriptions of things in the archive August 22, 2001 NASA Ames Lecture -- Ray R. Larson Excerpt from Glasier Archive <c level="subseries"> <did> <head>GP-1-1: General correspondence. Public letters.</head> <unitid id="gp-1-1">GP-1-1</unitid> <unittitle>Glasier Papers. General correspondence. Public letters.</unittitle> </did> <arrangement> <head>Arrangement </head> <p>Public letters arranged alphabetically within each year </p> </arrangement> <c level="item" langmaterial="eng"> <did> <unitid id="gp-1-1-0001">GP-1-1-0001</unitid> <unittitle>Letter from Richard Murray. <geogname>Glasgow</geogname>; <unitdate > 7 Apr 1879</unitdate>.</unittitle> <origination><persname>Murray, Richard</persname></origination> <physdesc><extent>1 letter</extent></physdesc> </did> <note><p>Employment reference for J.B.G. as draughtsman<subject>Glasier, John Bruce</subject></p> </note> </c> ETC…. August 22, 2001 NASA Ames Lecture -- Ray R. Larson Example Component Def … <COMPONENTS> <COMPONENTDEF> <COMPONENTNAME> /home/ray/Work/Glasier_test/indexes/COMPONENT_DB1 </COMPONENTNAME> <COMPONENTNORM>NONE</COMPONENTNORM> <COMPSTARTTAG> <tagspec> <FTAG> c </FTAG><ATTR> level <VALUE>item</VALUE></ATTR> </tagspec> </COMPSTARTTAG> <COMPONENTINDEXES> <!-- First index def --> … August 22, 2001 NASA Ames Lecture -- Ray R. Larson Components • Both individual tags and “ranges” with a starting tag and (different) ending tag can be used as components • Components permit parts of complex SGML/XML documents to be treated as separate documents August 22, 2001 NASA Ames Lecture -- Ray R. Larson Cheshire II Searching Local Remote Z39.50 Z39.50 Internet Z39.50 Z39.50 Scanned Images Text August 22, 2001 NASA Ames Lecture -- Ray R. Larson Boolean Search Capability • All Boolean operations are supported – “zfind author x and (title y or subject z) not subject A” • Named sets are supported and stored on the server • Boolean operations between stored sets are supported – “zfind SET1 and subject widgets or SET2” • Nested parentheses and truncation are supported – “zfind xtitle Alice#” August 22, 2001 NASA Ames Lecture -- Ray R. Larson Probabilistic Models • Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query • Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle) • Rely on accurate estimates of probabilities August 22, 2001 NASA Ames Lecture -- Ray R. Larson Probability Ranking Principle • If a reference retrieval system’s response to each request is a ranking of the documents in the collections in the order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data has been made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data. Stephen E. Robertson, J. Documentation 1977 August 22, 2001 NASA Ames Lecture -- Ray R. Larson Probabilistic Models: Logistic Regression • Estimates for relevance based on log-linear model with various statistical measures of document content as independent variables. Log odds of relevance is a linear function of attributes: log O( R|q i , d j , t k ) c0 c1v1 c2 v2 ... cn vn Term contributions summed: m log O( R | qi , d j ) [log O( R | qi , d j ,t k ) log O( R)] k 1 Probability of Relevance is inverse of log odds: P( R | qi , d j ) August 22, 2001 1 1 e log(O ( R|qi , d j )) NASA Ames Lecture -- Ray R. Larson Relevance Logistic Regression August 22, 2001 100 90 80 70 60 50 40 30 20 10 0 - 0 10 20 30 40 50 60 Term Frequency in Document NASA Ames Lecture -- Ray R. Larson Probabilistic Retrieval: Logistic Regression Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients (TREC). At retrieval the probability estimate is obtained by: 6 P( R | Q, D) c0 ci X i i 1 For the 6 X attribute measures shown on the next slide August 22, 2001 NASA Ames Lecture -- Ray R. Larson Probabilistic Retrieval: Logistic Regression attributes 1 X1 M M log QAF tj 1 Query Length X 2 QL 1 X3 M Average Absolute Query Frequency M log DAF tj Average Absolute Document Frequency 1 X 4 DL Document Length 1 M Average Inverse Document Frequency X5 log IDFt j M 1 N nt j Inverse Document Frequency IDF nt j X 6 log M August 22, 2001 Number of Terms in common between query and document -- logged NASA Ames Lecture -- Ray R. Larson Cheshire Probabilistic Retrieval • Uses Logistic Regression ranking method developed at Berkeley with new algorithm for weigh calculation at retrieval time. • Z39.50 “relevance” operator used to indicate probabilistic search • Any index can have Probabilistic searching performed: – zfind topic @ “cheshire cats, looking glasses, march hares and other such things” – zfind title @ caucus races • Boolean and Probabilistic elements can be combined: – zfind topic @ government documents and title guidebooks August 22, 2001 NASA Ames Lecture -- Ray R. Larson Combining Search Types • It is also possible to combine the results of multiple independent searches into a single result set. (using the Z39.50 SORT service of the Cheshire system) – – – – – – E.g.: Search of Full Text (Probabilistic) Search of Full Text (Boolean) Search of Components (Probabilistic) Search of Titles (Probabilistic) Search of Subject Headings (Probabilistic) • All result sets are merged and re-ranked to produce the final list. August 22, 2001 NASA Ames Lecture -- Ray R. Larson Relevance Feedback. • Any records in a result set can be used for Relevance Feedback • Uses the “set name” to receive feedback instructions. – zfind SET1:2,5-9,30,45 – zfind SET2:6 • Chosen records are used to build a new probabilistic query • Ranked results are returned • Planned support for (modified) Rocchio RF August 22, 2001 NASA Ames Lecture -- Ray R. Larson Cheshire II - Two-Stage Retrieval (EVM generation) • Example: Using the LC Classification System – Pseudo-Document created for each LC class containing terms derived from “content-rich” portions of documents in that class (subject headings, titles, etc.) – Permits searching by any term in the class – Ranked Probabilistic retrieval techniques attempt to present the “Best Matches” to a query first. – User selects classes to feed back for the “second stage” search of documents (which includes info from first stage selections) • Can be used with any classified/Indexed collection and controlled vocabulary August 22, 2001 NASA Ames Lecture -- Ray R. Larson Automatic Class Assignment Automatic Class Assignment: Polythetic, Exclusive or Overlapping, usually ordered clusters are order-independent, usually based on an intellectually derived scheme Search Engine Doc Doc Doc Doc Doc Doc Doc 1. Create pseudo-documents representing intellectually derived classes. 2. Search using document contents 3. Obtain ranked list 4. Assign document to N categories ranked over threshold. OR assign to top-ranked category August 22, 2001 NASA Ames Lecture -- Ray R. Larson Cheshire II - Cluster Generation • Define basis for clustering records. – Select field to form the basis of the cluster. – Evidence Fields to use as contents of the pseudodocuments. • During indexing cluster keys are generated with basis and evidence from each record. • Cluster keys are sorted and merged on basis and pseudo-documents created for each unique basis element containing all evidence fields. • Pseudo-Documents (Class clusters) are indexed on combined evidence fields. August 22, 2001 NASA Ames Lecture -- Ray R. Larson Cheshire II - Two-Stage Retrieval • Using the Mesh Subject Heading System – Pseudo-Document created for each MESH heading containing terms derived from “content-rich” portions of documents in that class (other subject headings, titles, abstract, etc.) – Permits searching by any term in the class – Ranked Probabilistic retrieval techniques attempt to present the “Best Matches” to a query first. – User selects classes to feed back for the “second stage” search of documents. • Can be used with any classified/Indexed collection. August 22, 2001 NASA Ames Lecture -- Ray R. Larson Distributed Search: The Problem • Hundreds or Thousands of servers with databases ranging widely in content, topic, format – Broadcast search is expensive in terms of bandwidth and in processing too many irrelevant results – How to select the “best” ones to search? • What to search first • Which to search next – Topical /domain constraints on the search selections – Variable contents of database (metadata only, full text…) August 22, 2001 NASA Ames Lecture -- Ray R. Larson An Approach for Cross-Domain Resource Discovery • MetaSearch – New approach to building metasearch based on Z39.50 – Instead of using broadcast search we are using two Z39.50 Services • Identification of database metadata using Z39.50 Explain • Extraction of distributed indexes using Z39.50 SCAN • Evaluation – How efficiently can we build distributed indexes? – How effectively can we choose databases using the index? – How effective is merging search results from multiple sources? – Hierarchies of servers (general/meta-topical/individual)? August 22, 2001 NASA Ames Lecture -- Ray R. Larson Z39.50 Overview UI Map Query Map Results Map Query Search Engine Internet Map Results Map Query UI Map Results August 22, 2001 NASA Ames Lecture -- Ray R. Larson Z39.50 Explain • Explain supports searches for – Server-Level metadata • Server Name • IP Addresses • Ports – Database-Level metadata • Database name • Search attributes (indexes and combinations) – Support metadata (record syntaxes, etc) August 22, 2001 NASA Ames Lecture -- Ray R. Larson Z39.50 SCAN • Originally intended to support Browsing • Query for – – – – – Database Attributes plus Term (i.e., index and start point) Step Size Number of terms to retrieve Position in Response set • Results – Number of terms returned – List of Terms and their frequency in the database (for the given attribute combination) August 22, 2001 NASA Ames Lecture -- Ray R. Larson Z39.50 SCAN Results Syntax: zscan indexname1 term stepsize number_of_terms pref_pos zscan topic cat 1 20 1 % zscan title cat 1 20 1 {SCAN {Status 0} {SCAN {Status 0} {Terms 20} {Terms 20} {StepSize 1} {StepSize 1} {Position 1}} {Position 1}} {cat 706} {cat 27} {cat-and-mouse 19} {cat-fight 1} {cat-burglar 1} {catalan 19} {cat-carrying 1} {cat-egory 1} {catalogu 37} {cat-fight 1} {catalonia 8} {cat-gut 1} {catalyt 2} {cat-litter 1} {catania 1} {cat-lovers 2} {cataract 1} {cat-pee 1} {catch 173} {cat-run 1} {catch-all 3} {cat-scanners 1} … {catch-up 2} … NASA Ames Lecture -- Ray R. Larson August 22, 2001 MetaSearch Server Index Creation • For all servers, or a topical subset… – Get Explain information (especially DC mappings) – For each index (or each DC index) • Use SCAN to extract terms and frequency • Add term + freq + source index + database metadata to the metasearch “Collection Document” (XML) – Planned extensions: • Post-Process indexes (especially Geo Names, etc) for special types of data August 22, 2001 Ames Lecture -- Ray R. Larson – e.g. createNASA “geographical coverage” indexes MetaSearch Approach Map Query Map Explain And Scan Queries Search Engine Map Results MetaSearch Server Internet DB 1 Map Results Distributed Index August 22, 2001 DB2 Map Query Search Engine Db 5 Db 6 NASA Ames Lecture -- Ray R. Larson Search Engine Map Results DB 3 DB 4 Known Problems • Not all Z39.50 Servers support SCAN or Explain • Solutions: – Probing for attributes instead of explain (e.g. DC attributes or analogs) – We also support OAI and can extract OAI metadata for servers that support OAI • Collection Documents are static and need to be replaced when the associated collection changes August 22, 2001 NASA Ames Lecture -- Ray R. Larson Evaluation • Test Environment – TREC Tipster and FT data (approx. 3.5 GB) – Partitioned into 236 smaller collections based on source and (for TIPSTER) date by month (Distributed Search Testbed built by French, et al.) • High size variability (Range from 1 to thousands of docs) • 21,225,299 Words, 142,345,670 chars total for harvested records • Efficiency (old data) – Average of 23.07 seconds per database to SCAN each database (3.4 indexes on average) – Average of 14.07 seconds excluding FT (131 seconds for FT database with 7 indexes) – Now collecting more information – so longer harvest times longer, but still under one minute on average August 22, 2001 NASA Ames Lecture -- Ray R. Larson Evaluation • Effectiveness – Still working on evaluation comparing our DB ranking with the TIPSTER relevance judgements – Can be compared with published selection methods (CORI, GlOSS, etc.) using the same testbed August 22, 2001 NASA Ames Lecture -- Ray R. Larson Future • Testing of variant algorithms for ranking collections • Application to real systems and testing in a production environment (Archives Hub) • Logically Clustering servers by topic • Meta-Meta Servers (treating the MetaSearch database as just another database) August 22, 2001 NASA Ames Lecture -- Ray R. Larson Distributed Metadata Servers General Servers August 22, 2001 Meta-Topical Servers Replicated servers NASA Ames Lecture -- Ray R. Larson Database Servers Conclusion • A lot of interesting work to be done – Redesign and development of the Cheshire II system – Evaluating new meta-indexing methods – Developing and Evaluating methods for merging cross-domain results (or, perhaps, when to keep them separate) August 22, 2001 NASA Ames Lecture -- Ray R. Larson Further Information • Full Cheshire II client and server source is available ftp://cheshire.berkeley.edu/pub/cheshire/ – Includes HTML documentation – Also on Berkeley Digital Library Software Distribution CD • Project Web Site http://cheshire.berkeley.edu/ August 22, 2001 NASA Ames Lecture -- Ray R. Larson