Download Storage Resource Broker

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Grid-based Digital Libraries
and Cheshire3
Ray R. Larson
University of California, Berkeley
School of Information
ISGC - Taiwan
2006.05.04 SLIDE 1
Overview
• The Grid, Text Mining and Digital Libraries
– Grid Architecture
– Grid IR Issues
• Cheshire3: Building an Architecture and
Bringing Search to Grid-Based Digital
Libraries
– Overview
– Cheshire3 Architecture
– Distributed Workflows
– Grid Experiments
• Many of the slides in this presentation were created by others,
including Ian Foster, Eric Yen, Michael Buckland and Reagan
Moore, Paul Watry and Clare Llewellyn
ISGC - Taiwan
2006.05.04 SLIDE 2
Quality, economies of scale
The Grid: On-Demand Access to Electricity
Source: Ian Foster
ISGC - Taiwan
Time
2006.05.04 SLIDE 3
By Analogy, A Computing Grid
• Decouples production and consumption
– Enable on-demand access
– Achieve economies of scale
– Enhance consumer flexibility
– Enable new devices
• On a variety of scales
– Department
– Campus
– Enterprise
– Internet
ISGC - Taiwan
Source: Ian Foster
2006.05.04 SLIDE 4
What is the Grid?
“The short answer is that, whereas the Web
is a service for sharing information over
the Internet, the Grid is a service for
sharing computer power and data storage
capacity over the Internet. The Grid goes
well beyond simple communication
between computers, and aims ultimately to
turn the global network of computers into
one vast computational resource.”
Source: The Global Grid Forum
ISGC - Taiwan
2006.05.04 SLIDE 5
The Foundations are
Being Laid
Edinburgh
Glasgow
DL
Belfast
Newcastle
Manchester
Cambridge
Oxford
Cardiff
RAL
Hinxton
London
Soton
Tier0/1 facility
Tier2 facility
Tier3 facility
10 Gbps link
2.5 Gbps link
622 Mbps link
Other link
ISGC - Taiwan
2006.05.04 SLIDE 6
ISGC - Taiwan
Astrophysics
.….
..…
Remote
sensors
Portals
Combustion
(Dr. Eric Yen, Academia Sinica, Taiwan.)
Collaboratories Cosmology
Remote
Visualization
Remote
Computing
Application
Toolkits
Data Grid
Grid middleware
Applications
High energy
physics
Grid Architecture --
Grid
Services
Protocols, authentication, policy, instrumentation,
Resource management, discovery, events, etc.
Grid
Fabric
Storage, networks, computers, display devices, etc.
and their associated local services
2006.05.04 SLIDE 7
But… what about…
• Applications and data that are NOT for
scientific research?
• Things like:
ISGC - Taiwan
2006.05.04 SLIDE 8
Humanities
computing
…
Astrophysics
Text Mining
…
Remote
sensors
Digital
Libraries
Metadata
management
Bio-Medical
Search &
Retrieval
Combustion
(ECAI/AS Grid Digital Library Workshop)
Portals
Cosmology
Collaboratories
Remote
Visualization
Remote
Computing
Application
Toolkits
Data Grid
Grid middleware
Applications
High energy
physics
Grid Architecture
Grid
Services
Protocols, authentication, policy, instrumentation,
Resource management, discovery, events, etc.
Grid
Fabric
Storage, networks, computers, display devices, etc.
and their associated local services
ISGC - Taiwan
2006.05.04 SLIDE 9
Grid-Based Digital Libraries: Needs
• Large-scale distributed storage
requirements and technologies
• Organizing distributed digital collections
• Shared Metadata – standards and
requirements
• Managing distributed digital collections
• Security and access control
• Collection Replication and backup
• Distributed Information Retrieval
support and algorithms
ISGC - Taiwan
2006.05.04 SLIDE 10
The Cheshire Information Lifecycle
Management Achitecture
• The Cheshire 3 Information Life Cycle Management
Architecture seeks to integrate recent advances from the
data grid, digital library, and persistent archive
communities in order to support large scale digital
repositories across domains and formats.
• The primary aim is to integrate data grid, digital library
and persistant archive technologies to support all
processes of information life-cycle management. This
will include datagrid technologies (SRB), digital library
technologies (Cheshire), and persistant archive
technologies(Multivalent). The flow of information
between these technologies is controlled using easily
configurable workflow systems, such as Kepler, the
workflows support the integration of text mining
technologies from NaCTeM for knowledge generation.
ISGC - Taiwan
2006.05.04 SLIDE 11
Cheshire3 Environment
ISGC - Taiwan
2006.05.04 SLIDE 12
Text and Data Mining Systems
• The Cheshire system is being used in the UK National
Text Mining Centre (NaCTeM) as a primary means of
integrating information retrieval systems with text mining
and data analysis systems.
• The framework will seek to integrate text and data
mining tools, and implement these on highly parallel grid
infrastructures. We intend to further supplement these
capabilities by incorporating support for highly
dimensional data sets.
• Such support will extend the capabilities of the Cheshire
system to extract and relate semantic information,
efficiently and effectively.
ISGC - Taiwan
2006.05.04 SLIDE 13
Digital Preservation Technologies
• The architecture is now being used for multiple
persistent archives projects, including the NARA
Persistant Archives and NPACI collaboration project; the
NSDL Digital Preservation Life-cycle Management
project; the NSDL Persistent Archive Testbed project;
and AHDS prototype
• We are working on a knowledge generation
infrastructure, which may be used to characterise
knowledge relationships. The outcome of this
development may make it possible to extract and
process knowledge from massive data collections which
will increase the rate of discovery
• The architecture incorporates the Multivalent digital
object management system as a novel solution to long
term preservation requirements. This component will
validate, parse, and render objects from the original bit
stream
ISGC - Taiwan
2006.05.04 SLIDE 14
Data Grid Technologies
• Storage capabilities of the framework will be primarily
based on the data grid technologies provided by the
SDSC Storage Resource Broker (SRB).
• The SRB offers storage, archiving and version control of
heterogeneous data in containers to allow aggregated
data movement. It is able to create multiple
geographically distributed backups of data via a
transparent interface.
• The SRB is emerging as the de-facto standard for datagrid applications, and is already in use by:
–
–
–
–
The World University Network
The Biomedical Informations Research Network (BIRN)
The UK eScience Centre (CCLRC)
The National Partnership for Advanced Computational
Infrastructure (NPACI)
– NASA information power grid.
ISGC - Taiwan
2006.05.04 SLIDE 15
Data Grid Problem
• “Enable a geographically distributed
community [of thousands] to pool their
resources in order to perform
sophisticated, computationally intensive
analyses on Petabytes of data”
• Note that this problem:
– Is common to many areas of science
– Overlaps strongly with other Grid problems
ISGC - Taiwan
2006.05.04 SLIDE 16
Examples of
Desired Data Grid Functionality
•
•
•
•
•
High-speed, reliable access to remote data
Automated discovery of “best” copy of data
Manage replication to improve performance
Co-schedule compute, storage, network
“Transparency” wrt delivered performance
– Particularly important for IR applications where an
end user may be waiting for results
• Enforce access control on data
• Allow representation of “global” resource
allocation policies
ISGC - Taiwan
2006.05.04 SLIDE 17
SRB as a Solution
• The Storage Resource Broker is a middleware
• It virtualizes resource access
• It mediates access to distributed heterogeneous resources
• It uses a MetaCATalog to facilitate the brokering
• It integrates data and metadata
Application
MCAT
SRB Server
HRM DB2, Oracle, Illustra, ObjectStore HPSS, ADSM, UniTree UNIX, NTFS, HTTP, FTP
Distributed Storage Resources
(database systems, archival storage systems, file systems, ftp, http, …)
Source: Arcot Rajasekar (SDSC)
ISGC - Taiwan
2006.05.04 SLIDE 18
SDSC Storage Resource Broker
& Meta-data Catalog
Application
Resource,
User
User
Defined
C, C++,
Linux I/O
Unix
Shell
Java, NT
Browsers
Prolog
Python
Web
SRB
MCAT
Dublin
Core
Archives
HPSS, ADSM, HRM
UniTree, DMF
File Systems Databases
Unix, NT,
Mac OSX
Third-party
copy
Remote
Proxies
DB2, Oracle,
Sybase
DataCutter
Application
Meta-data
Source: Arcot Rajasekar (SDSC)
ISGC - Taiwan
2006.05.04 SLIDE 19
SRB Concepts
• Abstraction of User Space
– Single sign-on
– Multiple authentication schemes
• certificates, (secure) passwords, tickets, group permissions, roles
• Virtualization of Resources
– Resource Location, Type & Access transparency
– Logical Resource Definitions - bundling
• Abstraction of Data and Collections
•
– Virtual Collections: Persistent Identifier and Global Name Space
– Replication & Segmentation
Data Discovery – system & application metadata
– User-defined Metadata – Structural & Descriptive
– Attribute-based Access (path names become irrelevant)
• Uniform Access Methods
– APIs, Command Line, GUI Browsers, Web-Access (Portal,WSDL, CGI)
– Parallel Access with both Client and Server-driven strategies
Source: Arcot Rajasekar (SDSC)
ISGC - Taiwan
2006.05.04 SLIDE 20
DL Collection Management
• Content management systems such as Dspace
and Fedora, are currently being extended to
make use of the SRB for data grid storage. This
will ensure their collections can in future be of
virtually unlimited size, and be stored, replicated,
and accessible via federated grid technologies.
• By supporting the SRB, we have ensured that
the Cheshire framework will be able to integrate
with these systems, thereby facilitating digital
content ingestion, resource discovery, content
management, dissemination, and preservation,
within a data-grid environment.
ISGC - Taiwan
2006.05.04 SLIDE 21
Workflow Environments
• Workflow environments, such as Kepler, are
designed to allow researchers to design and
execute flexible processing sequences for
complex data analysis. They provide a Graphical
User Interface to allow any level of user from a
variety of disciplines to design these workflows
in a drag-and-drop manner.
• In particular we intend to provide a platform
which may integrate text mining techniques and
methodologies, either as part of an internal
Cheshire workflow, or as external workflow
configured using a system such as Kepler.
ISGC - Taiwan
2006.05.04 SLIDE 22
Grid IR Issues
• Want to preserve the same retrieval
performance (precision/recall) while hopefully
increasing efficiency (I.e. speed)
• Very large-scale distribution of resources is (still)
a challenge for sub-second retrieval
• Different from most other typical Grid processes,
IR is potentially less computing intensive and
more data intensive
• In many ways Grid IR replicates the process
(and problems) of metasearch or distributed
search
ISGC - Taiwan
2006.05.04 SLIDE 23
Cheshire Digital Library System
• Cheshire was originally created at UC Berkeley
and more recently co-developed at the
University of Liverpool. The system itself is
widely used in the United Kingdom for
production digital library services including:
–
–
–
–
Archives Hub
JISC Information Environment Service Registry
Resource Discovery Network
British Library ISTC service
• The Cheshire system has recently gone through
a complete redesign into its current incarnation,
Cheshire3 enabling Grid-based IR over the Data
Grid
ISGC - Taiwan
2006.05.04 SLIDE 24
Cheshire3 IR Overview
• XML Information Retrieval Engine
– 3rd Generation of the UC Berkeley Cheshire system,
as co-developed at the University of Liverpool.
– Uses Python for flexibility and extensibility, but uses
C/C++ based libraries for processing speed
– Standards based: XML, XSLT, CQL, SRW/U, Z39.50,
OAI to name a few.
– Grid capable. Uses distributed configuration files,
workflow definitions and PVM or MPI to scale from
one machine to thousands of parallel nodes.
– Free and Open Source Software.
– http://www.cheshire3.org/ (under development!)
ISGC - Taiwan
2006.05.04 SLIDE 25
Cheshire3 Server Overview
Cheshire3 SERVER
C
O
N
F Normalization A
I
C
U
G I L
T
S
&N U
S H
E
E
C
D S
C N
O
A
E
T
A T
N
R
I
T X E C N
C
R I R
A
H
ON I
T
L G N
I
O
G
N
SERVER
CONTROL
API
X
S
L
T
R
E
C
O
R
D
T
R
A
N
S
F
O
R
M
S
DB API
USER
INFO
A
P
Native calls
A
C
P H Z39.50 H
R A SOAP E
O N OAI I
SRW
T DFetch ID N
O L Put ID T
E
C EOpenURLR
UDDI
O R WSRP
F
A
L
OGIS
C
E
JDBC
N
STAFF UI
E
T CONFIG
W
User/
O Client
Clients
R
K REMOTE
SYSTEMS
(any protocol)
LOCAL DB
XML
ISGC - Taiwan
INDEXES
CONFIG
& Metadata
INFO
RESULT
SETS
ACCESS
INFO
2006.05.04 SLIDE 26
Cheshire3 Object Model
Protocol
Handler
ConfigStore
Ingest Process
Documents
Object
Transformer
Server
Records
User
Document
Query
UserStore
Document
Group
ResultSet
Database
PreParser
PreParser
PreParser
Query
Document
Index
Extracter
RecordStore
Parser
Normaliser
Terms
IndexStore
ISGC - Taiwan
Record
DocumentStore
2006.05.04 SLIDE 27
Workflow Objects
• Workflows are first class objects in
Cheshire3 (though not represented in the
model diagram) and now support
integration with Kepler workflows
• All Process and Abstract objects have
individual XML configurations with a
common base schema with extensions
• We can treat configurations as Records
and store in regular RecordStores,
allowing access via regular IR protocols.
ISGC - Taiwan
2006.05.04 SLIDE 28
Workflow References
• Workflows contain a series of instructions
to perform, with reference to other
Cheshire3 objects
• Reference is via pseudo-unique identifiers
… Pseudo because they are unique within
the current context (Server vs Database)
• Workflows are objects, so this enables
server level workflows to call database
specific workflows with the same identifier
ISGC - Taiwan
2006.05.04 SLIDE 29
Distributed Processing
• Each node in the cluster instantiates the
configured architecture, potentially through a
single ConfigStore
• Master nodes then run a high level workflow to
distribute the processing amongst Slave nodes
by reference to a subsidiary workflow
• As object interaction is well defined in the model,
the result of a workflow is equally well defined.
This allows for the easy chaining of workflows,
either locally or spread throughout the cluster
ISGC - Taiwan
2006.05.04 SLIDE 30
External Integration
• Cheshire workflows now allow integration
with existing cross-service workflow
systems, in particular Kepler/Ptolemy
• Integration at two levels:
– Cheshire3 as a Kepler service (black box) ...
Identify a workflow to call
– Cheshire3 objects as a services (duplicate
existing C3 workflow functions using Kepler)
ISGC - Taiwan
2006.05.04 SLIDE 31
Cheshire3 Grid Tests
• Initial experiments: 20 processors in Liverpool
using PVM (Parallel Virtual Machine)
• Using 16 processors with one “master” and 22
“slave” processes we were able to parse and
index MARC data at about 13000 records per
second
• 610 Mb of TEI data can be parsed and indexed
in seconds
• More recently, 30 processors using MPI has
shown even better results, unsurprisingly.
ISGC - Taiwan
2006.05.04 SLIDE 32
SRB and SDSC Experiments
• In conjunction with SDSC, we have implemented SRB support for
processed data storage and data acquisition
• We are continuing work with SDSC to run further evaluations using
the TeraGrid through a “small” grant for 30000 CPU hours
– SDSC's TeraGrid cluster currently consists of 256 IBM cluster nodes,
each with dual 1.5 GHz Intel® Itanium® 2 processors, for a peak
performance of 3.1 teraflops. The nodes are equipped with four
gigabytes (GBs) of physical memory per node. The cluster is running
SuSE Linux and is using Myricom's Myrinet cluster interconnect
network.
• Large-scale test collections now include MEDLINE, NSDL, the
NARA preservation prototype, and we hope to use CiteSeer and the
“million books” collections of the Internet Archive
• Tests with Medline showed that we were able to index 750,000
Medline records, including part of speech tagging in 2 minutes, 45
seconds (using 1 master,59 slaves)
ISGC - Taiwan
2006.05.04 SLIDE 33
NARA Prototype
• The NARA (National Archives and Records
Administration) preservation prototype technology
demonstration…
• The current demo version is not designed to show the
fastest possible search times of the collection, but
instead to show how the system can scale while
retaining usable search times and interacting directly
with the preserved, original objects rather than a
migrated version.
• The collection was generated by a web crawl for documents relating
to the Space Shuttle Columbia tragedy in 2003, and includes both
federal documents (such as the CAIB site and other information
from NASA) as well as third party discussion such as news sites
(BBC, CNN etc) and reputable information sources (Wikipedia,
Space.com). As such, it is a preservation of the popular memory of
the events as well as the official perspective.
ISGC - Taiwan
2006.05.04 SLIDE 34
NARA Prototype: Introduction
• To ensure scalability and longevity, the design criteria for
the prototype were:
– Use the TeraGrid for processing
– Use the Storage Resource Broker for all data storage
– Use Cheshire3 for a standards based, flexible information
architecture
– Use Multivalent for document processing and display
– Use standards where ever possible to promote architecture
independence.
• The use of the prototype system has two distinct phases.
First, the Teragrid is used to process the documents
stored in the SRB and store the processed information
back in to SRB collections. In the second phase, the
user performs a search on the processed data stored in
the SRB, and then interacts with the original documents
directly.
ISGC - Taiwan
2006.05.04 SLIDE 35
Teragrid Indexing
• The following figure shows the initial data
processing steps performed on the Teragrid.
– Once machines in the grid have been allocated for
processing, each machine builds an instance of the
Cheshire3 architecture. The architecture is as we
described previously
– One machine is designated as the 'Master' and
controls the behaviour of all other machines, called
'Slaves'. For much larger scale datasets, multiple
master nodes could be assigned
– The Master acquires the list of documents to process
from the SRB and distributes these amongst the
slaves according to the SRB URL scheme.
ISGC - Taiwan
2006.05.04 SLIDE 36
Teragrid Indexing
Master1
File Paths
SRB
(Web Crawl)
File Path1
Object1
ObjectN
Slave1
Extracted Data1
ISGC - Taiwan
File PathN
SlaveN
GPFS Temp Storage
Extracted DataN
2006.05.04 SLIDE 37
Teragrid indexing
• Upon receiving a URL to process, A slave node
retrieves the document from the SRB directly
and steps through a configurable workflow of
information processing steps. This includes
analysis of every sentence in the text down to
the part of speech level, as well as typical
keyword and phrase indexing
• As part of the workflow, the slave node will store
temporary indexing data into the TeraGrid wide
GPFS storage system. This data will be then
merged in the following phase
• Within a single slave the processing includes:
ISGC - Taiwan
2006.05.04 SLIDE 38
Teragrid Indexing: Slave
MVD Document
Parser
Phrase Detection
Maste
r1
Data Cleaning
SRB
(Web
Crawl)
Slave
1
GPFS Temp
Storage
Noun/Verb Filter
Slave
N
NLP Tagger
Proximity
XML Parser
etc.
XPath Extraction
ISGC - Taiwan
2006.05.04 SLIDE 39
Teragrid Indexing 2
• Once all of the documents have been processed, the
information that was extracted must be merged into
inverted index files to be stored in the SRB. The
following graphic shows this process
• In the second phase of the indexing process, the master
first directs each slave to sort all of the temporary data
files that it has created
• The master then directs one node per index to merge
those data files into one combined file
• Then those nodes are directed to create inverted
indexes from the data. These indexes are created in
small chunks, one chunk per initial character in the term
extracted. After each chunk is built, the slave uploads it
into the SRB, and the temporary data files are deleted
• After all indexes have been loaded, the processing stage
is complete
ISGC - Taiwan
2006.05.04 SLIDE 40
Teragrid Indexing 2
Master1
SRB
(Indexes)
Sort/Load
Request
Merged Data
Merged Data
Slave1
Extracted Data
ISGC - Taiwan
Sort/Load
Request
SlaveN
GPFS Temp Storage
Extracted Data
2006.05.04 SLIDE 41
NARA Search Process
• In the Discovery phase, a user interacts with a front end to the
system. This phase does not use the Teragrid for processing as it
cannot be predicted when a user would want to do their search, and
hence when the use of the grid should be requested is unknown
• In the first step, the user performs a search via the interface. The
front end system retrieves the chunks of indexes necessary to fulfil
the users request from the SRB and returns a set of SRB URLs.
• The interaction between the Fab4 Browser and the front end is
performed via the SRW/U information retrieval standard. The
response is XML which is rendered as HTML via XSLT stylesheets.
By modifying the stylesheets, different user interfaces can quickly be
generated without modifying the server. By using a standard
protocol, we can ensure that the response is able to be interpreted
by different clients or generated by different front end architectures.
To verify this, you may load the pages with a regular web browser
and select 'View Source' to see the underlying XML.
• The search may sometimes be a little slower than expected if the
index chunks needed have not been retrieved from the data grid.
Please note that the data is not replicated, so the chunks are being
transfered from San Diego to Liverpool in this particular setup.
ISGC - Taiwan
2006.05.04 SLIDE 42
Search Phase
Search Request
Index Sections
Multivalent
Browser
Web Interface
SRB
(Indexes)
Liverpool
San Diego
SRB URIs
Berkeley
ISGC - Taiwan
2006.05.04 SLIDE 43
Search Process 2
• Once the SRB URLs have been returned
to the users, they may interact directly with
the original documents stored in the SRB
via Fab4. Fab4 connects to the SRB and
retrieves the document for display, rather
than through a web proxy interface
• (Note that this display is not available in
conventional browsers)
ISGC - Taiwan
2006.05.04 SLIDE 44
Search Phase2
SRB URI of Object
Multivalent
Browser
SRB
(Web Crawl)
Original Object
ISGC - Taiwan
2006.05.04 SLIDE 45
Multivalent Document Model
• MVD Model
– “media adapters”, “behaviors” and “layers”
• Media adapters generate a “document tree” for internal
representation of a particular document type
• behaviors implement the active elements of a specific document or
document type, as well as shared behaviors (like annotation) that
can be applied to all document types
• The document tree is interpreted as layers for presentation (next
page)
• MVD has media adapters giving direct support for
interpretation and presentation of many different
document formats
– HTML, PDF, scanned paper images and OCR, TeX DVI, Appleworks,
XML, Open Office, etc)
• Supports active, distributed, composable transformations
of multimedia documents
• Enables sophisticated annotations, intelligent result
handling, user-modifiable interface, composite
documents
ISGC - Taiwan
2006.05.04 SLIDE 46
Multivalent Documents
Cheshire Layer
GIS Layer
Valence:
2: The relative
capacity to unite,
react, or interact
(as with antigens
or a biological
substrate).
Webster’s 7th Collegiate
Dictionary
Table Layer
History of The Classical World
kdk
dkd
kdk
The jsfj sjjhfjs jsjj
jsjhfsjf sjhfjksh sshf
jsfksfjk sjs jsjfs kj
sjfkjsfhskjf sjfhjksh
skjfhkjshfjksh
jsfhkjshfjkskjfhsfh
skjfksjflksjflksjflksf
sjfksjfkjskfjskfjklsslk
slfjlskfjklsfklkkkdsj
ksfksjfkskflk sjfjksf
kjsfkjsfkjshf sjfsjfjks
ksfjksfjksjfkthsjir\\
ks
ksfjksjfkksjkls’ks
klsjfkskfksjjjhsjhuu
sfsjfkjs
taksksh
sksksk
skksksk
Document
content
extraction
OCR Layer
OCR Mapping
Layer
Modernjsfj sjjhfjs jsjj
jsjhfsjf sslfjksh sshf
jsfksfjk sjs jsjfs kj
sjfkjsfhskjf sjfhjksh
skjfhkjshfjksh
jsfhkjshfjkskjfhsfh
skjfksjflksjflksjflksf
sjfksjfkjskfjskfjklsslk
slfjlskfjklsfklkkkdsj
Scanned
Page
Image
kdjjdkd kdjkdjkd kj
kdkdk kdkd dkk
jdjjdj
clclc ldldl
Table 1.
ISGC - Taiwan
2006.05.04 SLIDE 47
NARA Prototype
• Fab4 Multivalent browser available at
http://bodoni.lib.liv.ac.uk:8080/fab4/
• Nara prototype at
http://srw.cheshire3.org/services/nara
ISGC - Taiwan
2006.05.04 SLIDE 48
Summary
• Indexing and IR work very well in the Grid
environment, with the expected scaling
behavior for multiple processes
• We are working on other large-scale
indexing projects (such as the NSDL) and
will also be running evaluations of retrieval
performance using IR test collections such
as the TREC “Terabyte track” collection
ISGC - Taiwan
2006.05.04 SLIDE 49
Thank you!
Available via http://www.cheshire3.org
ISGC - Taiwan
2006.05.04 SLIDE 50