Download CHESHIRE II - University of California, Berkeley

Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School of Information Management and Systems University of California, Berkeley October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Overview • Cheshire II – Feature overview – Current usage – Recent Additions • • • • • • Distributed Search and Indexing Geographic Operators and Search Ranking XML Schemas and Element Retrieval MySQL and PostgreSQL interfaces CORI, Okapi BM-25 ranking algorithms Result Set sorting, merging and ranking operators, bitmapped indexes • Cheshire III Design and Development October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Overview of Cheshire II • It supports SGML and XML • It is a client/server application • Uses the Z39.50 Information Retrieval Protocol, support for SRW, OAI, SOAP, SDLIP also implemented • Server supports a Relational Database Gateway • Supports Boolean searching of all servers • Supports probabilistic ranked retrieval in the Cheshire search engine as well as Boolean and proximity search • Search engine supports ``nearest neighbor'' searches and relevance feedback • GUI interface on X window displays and Windows NT • WWW/CGI forms interface for DL, using combined client/server CGI scripting via WebCheshire • Scriptable clients using Tcl and (new) Python • Store SGML/XML as files or “Datastore” database October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Current Usage • Over 100 Databases in the UK, including – – – – AHDS/History Data Service Mersey Libraries ZETOC Archives Hub • Distributed Archives Hub – JISC Resource Discovery Network (RDN) • (OAI-MHP Harvesting with Cheshire Search) – Planned use with TEL being developed by the BL • Also being used at Harvard and Berkeley • California Sheet Music Project • Los Alamos National Lab (genomics metadata) October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Distributed Search October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson The Problem • The Digital Library vision -- Access to everyone for “all human knowledge” • Lyman and Varian’s estimates of the “Dark Web” • Hundreds or Thousands of servers with databases ranging widely in content, topic, format – Broadcast search is expensive in terms of bandwidth and in processing too many irrelevant results – How to select the “best” ones to search? • Which resource to search first? • Which to search next if more is wanted? – Topical /domain constraints on the search selections – Variable contents of database (metadata only, full text, multimedia…) October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Distributed Search Tasks • Resource Description – How to collect metadata about digital libraries and their collections or databases • Resource Selection – How to select relevant digital library collections or databases from a large number of databases • Distributed Search – How to perform parallel or sequential searching over the selected digital library databases • Data Fusion – How to merge query results from different digital libraries with their different search engines, differing record structures,CDLetc. -- Cheshire II & III -- Ray R. Larson October 3, 2003 An Approach for Distributed Resource Discovery • Distributed resource representation and discovery – New approach to building resource descriptions based on Z39.50 – Instead of using broadcast search across resources we are using two Z39.50 Services • Identification of database metadata using Z39.50 Explain • Extraction of distributed indexes using Z39.50 SCAN • Evaluation – How efficiently can we build distributed indexes? – How effectively can we choose databases using the index? – How effective is merging search results from multiple sources? – Can we build hierarchies of servers (general/metatopical/individual)? October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Z39.50 Explain • Explain supports searches for – Server-Level metadata • Server Name • IP Addresses • Ports – Database-Level metadata • Database name • Search attributes (indexes and combinations) – Support metadata (record syntaxes, etc) October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Z39.50 SCAN • Originally intended to support Browsing • Query for – – – – – Database Attributes plus Term (i.e., index and start point) Step Size Number of terms to retrieve Position in Response set • Results – Number of terms returned – List of Terms and their frequency in the database (for the given attribute combination) October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Z39.50 SCAN Results Syntax: zscan indexname1 term stepsize number_of_terms pref_pos zscan topic cat 1 20 1 % zscan title cat 1 20 1 {SCAN {Status 0} {SCAN {Status 0} {Terms 20} {Terms 20} {StepSize 1} {StepSize 1} {Position 1}} {Position 1}} {cat 706} {cat 27} {cat-and-mouse 19} {cat-fight 1} {cat-burglar 1} {catalan 19} {cat-carrying 1} {cat-egory 1} {catalogu 37} {cat-fight 1} {catalonia 8} {cat-gut 1} {catalyt 2} {cat-litter 1} {catania 1} {cat-lovers 2} {cataract 1} {cat-pee 1} {catch 173} {cat-run 1} {catch-all 3} {cat-scanners 1} … {catch-up 2} … CDL -- Cheshire II & III -- Ray R. Larson October 3, 2003 Resource Index Creation • For all servers, or a topical subset… – Get Explain information – For each index • Use SCAN to extract terms and frequency • Add term + freq + source index + database metadata to the XML “Collection Document” for the resource – Planned extensions: • Post-Process indexes (especially Geo Names, etc) for special types of data – e.g. create “geographical coverage” indexes October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson MetaSearch Approach Map Query Map Explain And Scan Queries Search Engine Map Results MetaSearch Server Internet DB 1 Map Results Distributed Index October 3, 2003 DB2 Map Query Search Engine Db 5 Db 6 CDL -- Cheshire II & III -- Ray R. Larson Search Engine Map Results DB 3 DB 4 Known Issues and Problems • Not all Z39.50 Servers support SCAN or Explain • Solutions that appear to work well: – Probing for attributes instead of explain (e.g. DC attributes or analogs) – We also support OAI and can extract OAI metadata for servers that support OAI – Query-based sampling (Callan) • Collection Documents are static and need to be replaced when the associated collection changes October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Evaluation • Test Environment – TREC Tipster data (approx. 3 GB) – Partitioned into 236 smaller collections based on source and date by month (no DOE) • High size variability (from 1 to thousands of records) • Same database as used in other distributed search studies by J. French and J. Callan among others – Used TREC topics 51-150 for evaluation (these are the only topics with relevance judgements for all 3 TIPSTER disks October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Harvesting Efficiency • Tested using the databases on the previous slide + the full FT database (210,158 records ~ 600 Mb) • Average of 23.07 seconds per database to SCAN each database (3.4 indexes on average) and create a collection representative, over the network • Average of 14.07 seconds • Also tested larger databases (E.g. TREC FT database ~600 Mb with 7 indexes was harvested in 131 seconds. October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Our Collection Ranking Approach • We attempt to estimate the probability of relevance for a given collection with respect to a query using the Logistic Regression method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weight calculation at retrieval time • Estimates from multiple extracted indexes are combined to provide an overall ranking score for a given resource (I.e., fusion of multiple query results) October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Probabilistic Retrieval: Logistic Regression Probability of relevance for a given index is based on logistic regression from a sample set documents to determine values of the coefficients (TREC). At retrieval the probability estimate is obtained by: 6 P( R | Q, C )  c0   ci X i i 1 October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Probabilistic Retrieval: Logistic Regression attributes 1 X1  M M  log QAF tj 1 Query Length X 2  QL 1 X3  M Average Absolute Query Frequency M  log CAF Average Absolute Collection Frequency tj 1 CL 10 1 M X5  log ICFt j  M 1 N  nt j ICF  nt j Collection size estimate X4  X 6October  log M 3, 2003 Average Inverse Collection Frequency Inverse Document Frequency (N = Number of collections M= Number of Terms in common CDL -- Cheshire II & III -- Ray R. Larson between query and document Evaluation • Effectiveness – Tested using the collection representatives described above (as harvested from over the network) and the TIPSTER relevance judgements – Testing by comparing our approach to known algorithms for ranking collections – Results were measured against reported results for the Ideal and CORI algorithms and against the optimal “Relevance Based Ranking” (MAX) – Recall analog (How many of the Rel docs occurred in the top n databases – averaged) October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Titles only (short query) R̂ October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Future • Logically Clustering servers by topic • Meta-Meta Servers (treating the MetaSearch database as just another database) October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Distributed Metadata Servers General Servers October 3, 2003 Meta-Topical Servers Replicated servers CDL -- Cheshire II & III -- Ray R. Larson Database Servers Geographic Operators and Search Ranking October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson The GEO Operations • Operators established for the GEO Z39.50 profile • Implemented using special operations on indexes • Indexing allows extraction of geographic coordinates and dates from SGML/XML data in a variety of formats • Normalized internal representation in indexes • Search using geographic and time elements as primary or limiting search elements October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson The GEO Operations • X-based interfaces permit (simple) map drawing and search • Interface to MapServer for web-based map searching October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson GEO Geographic operators >=< >#< <#> <># ++ :<: :<=: Overlap Fully Enclosed Encloses Fully Outside Near Before Before or During :>=: During or After :>: After October 3, 2003 Search region and data Overlap Data fully enclosed in search reg. Data fully encloses search region Data outside of search region Data is near search region Data date is before search date Data date is before or during search date Data date is during or after search date Data date is after search date CDL -- Cheshire II & III -- Ray R. Larson Overlaps search October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Fully Enclosed Search October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Map-Based Search October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson GeoSearch Web Interface October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson XML Schemas and Element Retrieval October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson XML Schema Support • XML Schemas can now be used to define the data contents • Tested with a wide variety of schemas including METS (with various supporting schemas) October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson XML Element Extraction • A new search “ElementSetName” is XML_ELEMENT_ • Any Xpath, element name, or regular expression can be included following the final underscore when submitting a present request • The matching elements are extracted from the records matching the search and delivered in a simple format.. October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson XML Extraction % zselect sherlock 372 {Connection with SHERLOCK (sherlock.berkeley.edu) database 'bibfile' at port 2100 is open as connection #372} % zfind topic mathematics {OK {Status 1} {Hits 26} {Received 0} {Set Default} {RecordSyntax UNKNOWN}} % zset recsyntax XML % zset elementset XML_ELEMENT_Fld245 % zdisplay {OK {Status 0} {Received 10} {Position 1} {Set Default} {NextPosition 11} {RecordSyntax XML 1.2.840.10003.5.109.10}} { <RESULT_DATA DOCID="1"> <ITEM XPATH="/USMARC[1]/VarFlds[1]/VarDFlds[1]/Titles[1]/Fld245[1]"> <Fld245 AddEnty="No" NFChars="0"><a>Singularitâes áa Cargáese</a></Fld245> </ITEM> <RESULT_DATA> … etc… October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson MySQL and PostgreSQL October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson RDBMS Support • There are two reasons for RDBMS support – IR systems are not meant for LOTS of update transactions – Some application need to have access to both relational data and text data via Z39.50 • Both MySQL and PostgreSQL are popular open source RDBMS and now either can now be used via Cheshire – Z39.50 mappings to RDBMS columns – “ZQL” submission of SQL as Z39.50 Type 0 query October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Protocol Support October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Protocols • In Cheshire II most protocols (except Z39.50) are implemented using scripting • Example scripts to support the following are included in the distribution – – – – October 3, 2003 OAI SRW (Python version) SOAP SDLIP CDL -- Cheshire II & III -- Ray R. Larson CORI, Okapi BM-25 ranking algorithms October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Why additional ranking methods • CORI is extremely hard to beat as a distributed search method • OKAPI BM-25 is now the “default” retrieval algorithm in experimental IR • New operators (later) let us mix and match ranking methods and Boolean operations October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson CORI ranking df T df  50  150  cw / cw  | DB | 0.5   log  cf   I log | DB | 1.0 p(rk | dbi )  0.4  0.6  T  I where : df is number of documents containing rk cf is number of databases containing rk | DB | is the number of databases being ranked cw is the number of words in dbi CDL --of Cheshire II & databases III -- Ray R. Larson cw is the average cw the being ranked October 3, 2003 Okapi BM25 (k1  1)tf (k3  1)qtf w  K  tf k3  qtf T Q (1) • • • • • • Where: Q is a query containing terms T K is k1((1-b) + b.dl/avdl) k1, b and k3 are parameters , usually 1.2, 0.75 and 7-1000 tf is the frequency of the term in a specific document qtf is the frequency of the term in a topic from which Q was derived • dl and avdl are the document length and the average document  r  0.5  length measured in some convenient unit   R  r  0 . 5 (1)   (1) • w is the Robertson-Sparck Jones weight. w  log October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson n  r  0.5      N  n  R  r  0.5  Result Set sorting, merging and ranking operators, bitmapped indexes October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Sorting • Support for Z39.50 Sort functions • Merge multiple resultsets and sort new set – Sort by index name/key (ATTRIBUTE) – Sort by rank (ELEMENTS) • Merges ranked results and Boolean results – Sort by XML/SGML Tag contents (TAG) October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Merging and Ranking Operators • Extends the capabilities of merging to include merger operations in queries like Boolean operators • Fuzzy Logic Operators – !FUZZY_AND – !FUZZY_OR – !FUZZY_NOT • Restrict components to particular parents • Merge Operators – !MERGE_SUM – !MERGE_MEAN October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Bitmapped Indexes • Bitmap indexes can be used for Boolean operations where the data has only a few values and very large numbers of items with each value • Only one bit per record stored in the index • Processed on a demand basis so only blocks with the bits needed to resolve a query are fetched October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire III Design and Development October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire III Goals • Retain or reproduce (and refine) all Cheshire II features – “Spring cleaning” of code base – Add Full Unicode Support – Store most system and content data in the database • Permit easy and efficient integration in Web Services • Use threaded server for economy of resource usage • Enhanced Multiprotocol support • Support for distributed processing (I.e. GRID clusters) • Enhance expandability and “drop in’ functionality • Interfaces and/or APIs for Java, Python, C/C++ October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II Design Overview Z SERVER CONT CONFIG INDEX CLUSTER BUILD ASSOC XML DOCS INDEX CHESHIRE XML DIRECTORY October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire III Server Overview Cheshire III SERVER C O N F Normalization I C G I L S & N U S C E O D S A C N E T R A N T X E C R I R O N I H L G N G SERVER CONTROL API A U T H E N T I C A T I O N X S L T R E C O R D T R A N S F O R M S DB API USER INFO A P Native calls A C H P H Z39.50 R A SOAP E O N OAI I SRW T D Fetch ID N O L Put ID T E C EOpenURLR O R UDDI F WSRP A L OGIS C E JDBC N E T W O R K STAFF UI CONFIG User/ Client Clients REMOTE SYSTEMS (any protocol) LOCAL DB XML October 3, 2003 CONFIG RESULT INDEXES & Metadata SETS CDL -- Cheshire II & III -- Ray R. Larson INFO ACCESS INFO Cheshire III SERVER C O N F I G & C O N T R O L SERVER CONTROL API Normalization I N D E X I N G C L U S T E R I N G S E A R C H A U T S H E C N A T N I C A T I O N X S L T T R R E A C N O S R F D O R M S USER INFO Native calls P R O T O C O L H A N D L E R Z39.50 SOAP OAI SRW Fetch ID Put ID OpenURL UDDI WSRP OGIS A P A C H E I N T E R F A C E JDBC DB API N E T W O R K STAFF UI CONFIG User/ Client Clients REMOTE SYSTEMS (any protocol) LOCAL DB XML October 3, 2003 INDEXES CONFIG RESULT & Metadata SETS CDL -- Cheshire II & III -- Ray R. Larson INFO ACCESS INFO Retain Features • The intent is to permit all of the types of in indexing, searching and record formatting available now, while making it easier to add new capabilities • The new system will also support full UNICODE for content and for metadata • Store metadata and content in the database (including config information, etc.) October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Permit easy integration of Web Services • The assumption is that the web server will be the central server mechanism in the future. • The new design relies on the session handling, threading and load management tools available in Apache (2.0.40+) • The Cheshire server is dynamically loaded as part of the Web Server October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Multiprotocol Support • The Web server handles the network issues and passes requests in various protocols along to the Cheshire Server. • Individual Protocol “plugins” and the Protocol Handler convert search, display, and metadata requests in a particular protocol to the internal Cheshire III control language, and convert outgoing message and data to the appropriate protocol form October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson • Distributed Processing (RESEARCH) The server will support protocols for interchange of partial results and collection statistics with a single “Master” controlling the actions of a large number of “Slave” servers • These will run in parallel in a GRID environment • This is still “research” but will probably be using “Storage Grid” technology from SDSC with our own applications • Non-Grid use of the same protocols, etc will be possible (but definitely slower) October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Enhanced Expanability • Clearly defined APIs for interacting with the server will permit easy addition of new functionality, or to replace or upgrade existing functionality • Interactive user interface for database configuration and setup – We want to make it easier for a user/administrator to create and manage the database October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Multilingual APIs • The system is being developed in a multilingual environment. • We will include the ability to interface with (at a minimum) Java, Python and C/C++ applications. • APIs for developing new functions will be available in these languages as well October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Development • Currently work is going on here (RRL) and (primarily) in the UK • We have incomplete (Alpha) versions of the system, but haven’t been distributing it in the current form (changing constantly) • First release version is expected in mid-’04 October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson For More Information • http://Cheshire.berkeley.edu • ftp://Cheshire.berkeley.edu for source • Contact [email protected] October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download CHESHIRE II - University of California, Berkeley