Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Grid-based Digital Libraries and Cheshire3 Ray R. Larson University of California, Berkeley School of Information ISGC - Taiwan 2006.05.04 SLIDE 1 Overview • The Grid, Text Mining and Digital Libraries – Grid Architecture – Grid IR Issues • Cheshire3: Building an Architecture and Bringing Search to Grid-Based Digital Libraries – Overview – Cheshire3 Architecture – Distributed Workflows – Grid Experiments • Many of the slides in this presentation were created by others, including Ian Foster, Eric Yen, Michael Buckland and Reagan Moore, Paul Watry and Clare Llewellyn ISGC - Taiwan 2006.05.04 SLIDE 2 Quality, economies of scale The Grid: On-Demand Access to Electricity Source: Ian Foster ISGC - Taiwan Time 2006.05.04 SLIDE 3 By Analogy, A Computing Grid • Decouples production and consumption – Enable on-demand access – Achieve economies of scale – Enhance consumer flexibility – Enable new devices • On a variety of scales – Department – Campus – Enterprise – Internet ISGC - Taiwan Source: Ian Foster 2006.05.04 SLIDE 4 What is the Grid? “The short answer is that, whereas the Web is a service for sharing information over the Internet, the Grid is a service for sharing computer power and data storage capacity over the Internet. The Grid goes well beyond simple communication between computers, and aims ultimately to turn the global network of computers into one vast computational resource.” Source: The Global Grid Forum ISGC - Taiwan 2006.05.04 SLIDE 5 The Foundations are Being Laid Edinburgh Glasgow DL Belfast Newcastle Manchester Cambridge Oxford Cardiff RAL Hinxton London Soton Tier0/1 facility Tier2 facility Tier3 facility 10 Gbps link 2.5 Gbps link 622 Mbps link Other link ISGC - Taiwan 2006.05.04 SLIDE 6 ISGC - Taiwan Astrophysics .…. ..… Remote sensors Portals Combustion (Dr. Eric Yen, Academia Sinica, Taiwan.) Collaboratories Cosmology Remote Visualization Remote Computing Application Toolkits Data Grid Grid middleware Applications High energy physics Grid Architecture -- Grid Services Protocols, authentication, policy, instrumentation, Resource management, discovery, events, etc. Grid Fabric Storage, networks, computers, display devices, etc. and their associated local services 2006.05.04 SLIDE 7 But… what about… • Applications and data that are NOT for scientific research? • Things like: ISGC - Taiwan 2006.05.04 SLIDE 8 Humanities computing … Astrophysics Text Mining … Remote sensors Digital Libraries Metadata management Bio-Medical Search & Retrieval Combustion (ECAI/AS Grid Digital Library Workshop) Portals Cosmology Collaboratories Remote Visualization Remote Computing Application Toolkits Data Grid Grid middleware Applications High energy physics Grid Architecture Grid Services Protocols, authentication, policy, instrumentation, Resource management, discovery, events, etc. Grid Fabric Storage, networks, computers, display devices, etc. and their associated local services ISGC - Taiwan 2006.05.04 SLIDE 9 Grid-Based Digital Libraries: Needs • Large-scale distributed storage requirements and technologies • Organizing distributed digital collections • Shared Metadata – standards and requirements • Managing distributed digital collections • Security and access control • Collection Replication and backup • Distributed Information Retrieval support and algorithms ISGC - Taiwan 2006.05.04 SLIDE 10 The Cheshire Information Lifecycle Management Achitecture • The Cheshire 3 Information Life Cycle Management Architecture seeks to integrate recent advances from the data grid, digital library, and persistent archive communities in order to support large scale digital repositories across domains and formats. • The primary aim is to integrate data grid, digital library and persistant archive technologies to support all processes of information life-cycle management. This will include datagrid technologies (SRB), digital library technologies (Cheshire), and persistant archive technologies(Multivalent). The flow of information between these technologies is controlled using easily configurable workflow systems, such as Kepler, the workflows support the integration of text mining technologies from NaCTeM for knowledge generation. ISGC - Taiwan 2006.05.04 SLIDE 11 Cheshire3 Environment ISGC - Taiwan 2006.05.04 SLIDE 12 Text and Data Mining Systems • The Cheshire system is being used in the UK National Text Mining Centre (NaCTeM) as a primary means of integrating information retrieval systems with text mining and data analysis systems. • The framework will seek to integrate text and data mining tools, and implement these on highly parallel grid infrastructures. We intend to further supplement these capabilities by incorporating support for highly dimensional data sets. • Such support will extend the capabilities of the Cheshire system to extract and relate semantic information, efficiently and effectively. ISGC - Taiwan 2006.05.04 SLIDE 13 Digital Preservation Technologies • The architecture is now being used for multiple persistent archives projects, including the NARA Persistant Archives and NPACI collaboration project; the NSDL Digital Preservation Life-cycle Management project; the NSDL Persistent Archive Testbed project; and AHDS prototype • We are working on a knowledge generation infrastructure, which may be used to characterise knowledge relationships. The outcome of this development may make it possible to extract and process knowledge from massive data collections which will increase the rate of discovery • The architecture incorporates the Multivalent digital object management system as a novel solution to long term preservation requirements. This component will validate, parse, and render objects from the original bit stream ISGC - Taiwan 2006.05.04 SLIDE 14 Data Grid Technologies • Storage capabilities of the framework will be primarily based on the data grid technologies provided by the SDSC Storage Resource Broker (SRB). • The SRB offers storage, archiving and version control of heterogeneous data in containers to allow aggregated data movement. It is able to create multiple geographically distributed backups of data via a transparent interface. • The SRB is emerging as the de-facto standard for datagrid applications, and is already in use by: – – – – The World University Network The Biomedical Informations Research Network (BIRN) The UK eScience Centre (CCLRC) The National Partnership for Advanced Computational Infrastructure (NPACI) – NASA information power grid. ISGC - Taiwan 2006.05.04 SLIDE 15 Data Grid Problem • “Enable a geographically distributed community [of thousands] to pool their resources in order to perform sophisticated, computationally intensive analyses on Petabytes of data” • Note that this problem: – Is common to many areas of science – Overlaps strongly with other Grid problems ISGC - Taiwan 2006.05.04 SLIDE 16 Examples of Desired Data Grid Functionality • • • • • High-speed, reliable access to remote data Automated discovery of “best” copy of data Manage replication to improve performance Co-schedule compute, storage, network “Transparency” wrt delivered performance – Particularly important for IR applications where an end user may be waiting for results • Enforce access control on data • Allow representation of “global” resource allocation policies ISGC - Taiwan 2006.05.04 SLIDE 17 SRB as a Solution • The Storage Resource Broker is a middleware • It virtualizes resource access • It mediates access to distributed heterogeneous resources • It uses a MetaCATalog to facilitate the brokering • It integrates data and metadata Application MCAT SRB Server HRM DB2, Oracle, Illustra, ObjectStore HPSS, ADSM, UniTree UNIX, NTFS, HTTP, FTP Distributed Storage Resources (database systems, archival storage systems, file systems, ftp, http, …) Source: Arcot Rajasekar (SDSC) ISGC - Taiwan 2006.05.04 SLIDE 18 SDSC Storage Resource Broker & Meta-data Catalog Application Resource, User User Defined C, C++, Linux I/O Unix Shell Java, NT Browsers Prolog Python Web SRB MCAT Dublin Core Archives HPSS, ADSM, HRM UniTree, DMF File Systems Databases Unix, NT, Mac OSX Third-party copy Remote Proxies DB2, Oracle, Sybase DataCutter Application Meta-data Source: Arcot Rajasekar (SDSC) ISGC - Taiwan 2006.05.04 SLIDE 19 SRB Concepts • Abstraction of User Space – Single sign-on – Multiple authentication schemes • certificates, (secure) passwords, tickets, group permissions, roles • Virtualization of Resources – Resource Location, Type & Access transparency – Logical Resource Definitions - bundling • Abstraction of Data and Collections • – Virtual Collections: Persistent Identifier and Global Name Space – Replication & Segmentation Data Discovery – system & application metadata – User-defined Metadata – Structural & Descriptive – Attribute-based Access (path names become irrelevant) • Uniform Access Methods – APIs, Command Line, GUI Browsers, Web-Access (Portal,WSDL, CGI) – Parallel Access with both Client and Server-driven strategies Source: Arcot Rajasekar (SDSC) ISGC - Taiwan 2006.05.04 SLIDE 20 DL Collection Management • Content management systems such as Dspace and Fedora, are currently being extended to make use of the SRB for data grid storage. This will ensure their collections can in future be of virtually unlimited size, and be stored, replicated, and accessible via federated grid technologies. • By supporting the SRB, we have ensured that the Cheshire framework will be able to integrate with these systems, thereby facilitating digital content ingestion, resource discovery, content management, dissemination, and preservation, within a data-grid environment. ISGC - Taiwan 2006.05.04 SLIDE 21 Workflow Environments • Workflow environments, such as Kepler, are designed to allow researchers to design and execute flexible processing sequences for complex data analysis. They provide a Graphical User Interface to allow any level of user from a variety of disciplines to design these workflows in a drag-and-drop manner. • In particular we intend to provide a platform which may integrate text mining techniques and methodologies, either as part of an internal Cheshire workflow, or as external workflow configured using a system such as Kepler. ISGC - Taiwan 2006.05.04 SLIDE 22 Grid IR Issues • Want to preserve the same retrieval performance (precision/recall) while hopefully increasing efficiency (I.e. speed) • Very large-scale distribution of resources is (still) a challenge for sub-second retrieval • Different from most other typical Grid processes, IR is potentially less computing intensive and more data intensive • In many ways Grid IR replicates the process (and problems) of metasearch or distributed search ISGC - Taiwan 2006.05.04 SLIDE 23 Cheshire Digital Library System • Cheshire was originally created at UC Berkeley and more recently co-developed at the University of Liverpool. The system itself is widely used in the United Kingdom for production digital library services including: – – – – Archives Hub JISC Information Environment Service Registry Resource Discovery Network British Library ISTC service • The Cheshire system has recently gone through a complete redesign into its current incarnation, Cheshire3 enabling Grid-based IR over the Data Grid ISGC - Taiwan 2006.05.04 SLIDE 24 Cheshire3 IR Overview • XML Information Retrieval Engine – 3rd Generation of the UC Berkeley Cheshire system, as co-developed at the University of Liverpool. – Uses Python for flexibility and extensibility, but uses C/C++ based libraries for processing speed – Standards based: XML, XSLT, CQL, SRW/U, Z39.50, OAI to name a few. – Grid capable. Uses distributed configuration files, workflow definitions and PVM or MPI to scale from one machine to thousands of parallel nodes. – Free and Open Source Software. – http://www.cheshire3.org/ (under development!) ISGC - Taiwan 2006.05.04 SLIDE 25 Cheshire3 Server Overview Cheshire3 SERVER C O N F Normalization A I C U G I L T S &N U S H E E C D S C N O A E T A T N R I T X E C N C R I R A H ON I T L G N I O G N SERVER CONTROL API X S L T R E C O R D T R A N S F O R M S DB API USER INFO A P Native calls A C P H Z39.50 H R A SOAP E O N OAI I SRW T DFetch ID N O L Put ID T E C EOpenURLR UDDI O R WSRP F A L OGIS C E JDBC N STAFF UI E T CONFIG W User/ O Client Clients R K REMOTE SYSTEMS (any protocol) LOCAL DB XML ISGC - Taiwan INDEXES CONFIG & Metadata INFO RESULT SETS ACCESS INFO 2006.05.04 SLIDE 26 Cheshire3 Object Model Protocol Handler ConfigStore Ingest Process Documents Object Transformer Server Records User Document Query UserStore Document Group ResultSet Database PreParser PreParser PreParser Query Document Index Extracter RecordStore Parser Normaliser Terms IndexStore ISGC - Taiwan Record DocumentStore 2006.05.04 SLIDE 27 Workflow Objects • Workflows are first class objects in Cheshire3 (though not represented in the model diagram) and now support integration with Kepler workflows • All Process and Abstract objects have individual XML configurations with a common base schema with extensions • We can treat configurations as Records and store in regular RecordStores, allowing access via regular IR protocols. ISGC - Taiwan 2006.05.04 SLIDE 28 Workflow References • Workflows contain a series of instructions to perform, with reference to other Cheshire3 objects • Reference is via pseudo-unique identifiers … Pseudo because they are unique within the current context (Server vs Database) • Workflows are objects, so this enables server level workflows to call database specific workflows with the same identifier ISGC - Taiwan 2006.05.04 SLIDE 29 Distributed Processing • Each node in the cluster instantiates the configured architecture, potentially through a single ConfigStore • Master nodes then run a high level workflow to distribute the processing amongst Slave nodes by reference to a subsidiary workflow • As object interaction is well defined in the model, the result of a workflow is equally well defined. This allows for the easy chaining of workflows, either locally or spread throughout the cluster ISGC - Taiwan 2006.05.04 SLIDE 30 External Integration • Cheshire workflows now allow integration with existing cross-service workflow systems, in particular Kepler/Ptolemy • Integration at two levels: – Cheshire3 as a Kepler service (black box) ... Identify a workflow to call – Cheshire3 objects as a services (duplicate existing C3 workflow functions using Kepler) ISGC - Taiwan 2006.05.04 SLIDE 31 Cheshire3 Grid Tests • Initial experiments: 20 processors in Liverpool using PVM (Parallel Virtual Machine) • Using 16 processors with one “master” and 22 “slave” processes we were able to parse and index MARC data at about 13000 records per second • 610 Mb of TEI data can be parsed and indexed in seconds • More recently, 30 processors using MPI has shown even better results, unsurprisingly. ISGC - Taiwan 2006.05.04 SLIDE 32 SRB and SDSC Experiments • In conjunction with SDSC, we have implemented SRB support for processed data storage and data acquisition • We are continuing work with SDSC to run further evaluations using the TeraGrid through a “small” grant for 30000 CPU hours – SDSC's TeraGrid cluster currently consists of 256 IBM cluster nodes, each with dual 1.5 GHz Intel® Itanium® 2 processors, for a peak performance of 3.1 teraflops. The nodes are equipped with four gigabytes (GBs) of physical memory per node. The cluster is running SuSE Linux and is using Myricom's Myrinet cluster interconnect network. • Large-scale test collections now include MEDLINE, NSDL, the NARA preservation prototype, and we hope to use CiteSeer and the “million books” collections of the Internet Archive • Tests with Medline showed that we were able to index 750,000 Medline records, including part of speech tagging in 2 minutes, 45 seconds (using 1 master,59 slaves) ISGC - Taiwan 2006.05.04 SLIDE 33 NARA Prototype • The NARA (National Archives and Records Administration) preservation prototype technology demonstration… • The current demo version is not designed to show the fastest possible search times of the collection, but instead to show how the system can scale while retaining usable search times and interacting directly with the preserved, original objects rather than a migrated version. • The collection was generated by a web crawl for documents relating to the Space Shuttle Columbia tragedy in 2003, and includes both federal documents (such as the CAIB site and other information from NASA) as well as third party discussion such as news sites (BBC, CNN etc) and reputable information sources (Wikipedia, Space.com). As such, it is a preservation of the popular memory of the events as well as the official perspective. ISGC - Taiwan 2006.05.04 SLIDE 34 NARA Prototype: Introduction • To ensure scalability and longevity, the design criteria for the prototype were: – Use the TeraGrid for processing – Use the Storage Resource Broker for all data storage – Use Cheshire3 for a standards based, flexible information architecture – Use Multivalent for document processing and display – Use standards where ever possible to promote architecture independence. • The use of the prototype system has two distinct phases. First, the Teragrid is used to process the documents stored in the SRB and store the processed information back in to SRB collections. In the second phase, the user performs a search on the processed data stored in the SRB, and then interacts with the original documents directly. ISGC - Taiwan 2006.05.04 SLIDE 35 Teragrid Indexing • The following figure shows the initial data processing steps performed on the Teragrid. – Once machines in the grid have been allocated for processing, each machine builds an instance of the Cheshire3 architecture. The architecture is as we described previously – One machine is designated as the 'Master' and controls the behaviour of all other machines, called 'Slaves'. For much larger scale datasets, multiple master nodes could be assigned – The Master acquires the list of documents to process from the SRB and distributes these amongst the slaves according to the SRB URL scheme. ISGC - Taiwan 2006.05.04 SLIDE 36 Teragrid Indexing Master1 File Paths SRB (Web Crawl) File Path1 Object1 ObjectN Slave1 Extracted Data1 ISGC - Taiwan File PathN SlaveN GPFS Temp Storage Extracted DataN 2006.05.04 SLIDE 37 Teragrid indexing • Upon receiving a URL to process, A slave node retrieves the document from the SRB directly and steps through a configurable workflow of information processing steps. This includes analysis of every sentence in the text down to the part of speech level, as well as typical keyword and phrase indexing • As part of the workflow, the slave node will store temporary indexing data into the TeraGrid wide GPFS storage system. This data will be then merged in the following phase • Within a single slave the processing includes: ISGC - Taiwan 2006.05.04 SLIDE 38 Teragrid Indexing: Slave MVD Document Parser Phrase Detection Maste r1 Data Cleaning SRB (Web Crawl) Slave 1 GPFS Temp Storage Noun/Verb Filter Slave N NLP Tagger Proximity XML Parser etc. XPath Extraction ISGC - Taiwan 2006.05.04 SLIDE 39 Teragrid Indexing 2 • Once all of the documents have been processed, the information that was extracted must be merged into inverted index files to be stored in the SRB. The following graphic shows this process • In the second phase of the indexing process, the master first directs each slave to sort all of the temporary data files that it has created • The master then directs one node per index to merge those data files into one combined file • Then those nodes are directed to create inverted indexes from the data. These indexes are created in small chunks, one chunk per initial character in the term extracted. After each chunk is built, the slave uploads it into the SRB, and the temporary data files are deleted • After all indexes have been loaded, the processing stage is complete ISGC - Taiwan 2006.05.04 SLIDE 40 Teragrid Indexing 2 Master1 SRB (Indexes) Sort/Load Request Merged Data Merged Data Slave1 Extracted Data ISGC - Taiwan Sort/Load Request SlaveN GPFS Temp Storage Extracted Data 2006.05.04 SLIDE 41 NARA Search Process • In the Discovery phase, a user interacts with a front end to the system. This phase does not use the Teragrid for processing as it cannot be predicted when a user would want to do their search, and hence when the use of the grid should be requested is unknown • In the first step, the user performs a search via the interface. The front end system retrieves the chunks of indexes necessary to fulfil the users request from the SRB and returns a set of SRB URLs. • The interaction between the Fab4 Browser and the front end is performed via the SRW/U information retrieval standard. The response is XML which is rendered as HTML via XSLT stylesheets. By modifying the stylesheets, different user interfaces can quickly be generated without modifying the server. By using a standard protocol, we can ensure that the response is able to be interpreted by different clients or generated by different front end architectures. To verify this, you may load the pages with a regular web browser and select 'View Source' to see the underlying XML. • The search may sometimes be a little slower than expected if the index chunks needed have not been retrieved from the data grid. Please note that the data is not replicated, so the chunks are being transfered from San Diego to Liverpool in this particular setup. ISGC - Taiwan 2006.05.04 SLIDE 42 Search Phase Search Request Index Sections Multivalent Browser Web Interface SRB (Indexes) Liverpool San Diego SRB URIs Berkeley ISGC - Taiwan 2006.05.04 SLIDE 43 Search Process 2 • Once the SRB URLs have been returned to the users, they may interact directly with the original documents stored in the SRB via Fab4. Fab4 connects to the SRB and retrieves the document for display, rather than through a web proxy interface • (Note that this display is not available in conventional browsers) ISGC - Taiwan 2006.05.04 SLIDE 44 Search Phase2 SRB URI of Object Multivalent Browser SRB (Web Crawl) Original Object ISGC - Taiwan 2006.05.04 SLIDE 45 Multivalent Document Model • MVD Model – “media adapters”, “behaviors” and “layers” • Media adapters generate a “document tree” for internal representation of a particular document type • behaviors implement the active elements of a specific document or document type, as well as shared behaviors (like annotation) that can be applied to all document types • The document tree is interpreted as layers for presentation (next page) • MVD has media adapters giving direct support for interpretation and presentation of many different document formats – HTML, PDF, scanned paper images and OCR, TeX DVI, Appleworks, XML, Open Office, etc) • Supports active, distributed, composable transformations of multimedia documents • Enables sophisticated annotations, intelligent result handling, user-modifiable interface, composite documents ISGC - Taiwan 2006.05.04 SLIDE 46 Multivalent Documents Cheshire Layer GIS Layer Valence: 2: The relative capacity to unite, react, or interact (as with antigens or a biological substrate). Webster’s 7th Collegiate Dictionary Table Layer History of The Classical World kdk dkd kdk The jsfj sjjhfjs jsjj jsjhfsjf sjhfjksh sshf jsfksfjk sjs jsjfs kj sjfkjsfhskjf sjfhjksh skjfhkjshfjksh jsfhkjshfjkskjfhsfh skjfksjflksjflksjflksf sjfksjfkjskfjskfjklsslk slfjlskfjklsfklkkkdsj ksfksjfkskflk sjfjksf kjsfkjsfkjshf sjfsjfjks ksfjksfjksjfkthsjir\\ ks ksfjksjfkksjkls’ks klsjfkskfksjjjhsjhuu sfsjfkjs taksksh sksksk skksksk Document content extraction OCR Layer OCR Mapping Layer Modernjsfj sjjhfjs jsjj jsjhfsjf sslfjksh sshf jsfksfjk sjs jsjfs kj sjfkjsfhskjf sjfhjksh skjfhkjshfjksh jsfhkjshfjkskjfhsfh skjfksjflksjflksjflksf sjfksjfkjskfjskfjklsslk slfjlskfjklsfklkkkdsj Scanned Page Image kdjjdkd kdjkdjkd kj kdkdk kdkd dkk jdjjdj clclc ldldl Table 1. ISGC - Taiwan 2006.05.04 SLIDE 47 NARA Prototype • Fab4 Multivalent browser available at http://bodoni.lib.liv.ac.uk:8080/fab4/ • Nara prototype at http://srw.cheshire3.org/services/nara ISGC - Taiwan 2006.05.04 SLIDE 48 Summary • Indexing and IR work very well in the Grid environment, with the expected scaling behavior for multiple processes • We are working on other large-scale indexing projects (such as the NSDL) and will also be running evaluations of retrieval performance using IR test collections such as the TREC “Terabyte track” collection ISGC - Taiwan 2006.05.04 SLIDE 49 Thank you! Available via http://www.cheshire3.org ISGC - Taiwan 2006.05.04 SLIDE 50