Download 支持数据驱动型应用的跨域共享与服务支撑平台研究

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
A Data-Driven Service Platform for Building Digital
Libraries
1
Chunxiao Xing, 2Ming Zhang, 1Yong Zhang, 1Lemen Chao, 1Lizhu Zhou
1TResearch Institute of Information Technology
Tsinghua National Laboratory for Information Science and Technology
Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
{xingcx,zhangyong05}@tsinghua.edu.cn, [email protected], [email protected]
2Department of Computer Science and Technology, Peking University, Beijing 100871, China
[email protected]
Abstract:
Digital Libraries are faced with new opportunity and challenge brought by the latest technological
advances – the cross-domain distributed network and heterogeneous data-driven demands. To meet
with this challenge, we are perusing a Cross-domain Sharing and Service Support Platform project.
The major goals of the project are: (1) to conduct a concrete requirement analysis for the data-driven
applications and cross-domain sharing; (2) to extend the Digital Object Model by object decomposing
and to form the Digital Services Object Model to support digital library development; and (3) to design
an architecture for the cross-domain sharing and service support platform as the data access
middleware for building digital libraries conveniently. This paper presents the major contributions of
our work in digital resource service modeling and platform architecture. As a show case for the
platform application, the paper briefs iDLib – a digital library system that offers cross-domain data
access and personalized services.
Keywords: digital library, data-driven, cross-domain sharing, cross-domain service, iDLib
1. Introduction
With the development of information technology, data are becoming an important factor in
driving organizational behaviors. In the networked world, when the amount of data are growing
exponentially, it is increasingly important to solve the crucial problems such as how to cross-domain
share distributed, heterogeneous, dynamic, and vast amounts of data, and how to provide efficient
data services to support organizational activities. Digital Libraries are important data resources in
modern society, and their development raise following three new requirements.
(1) Data-Driven
Currently, applications in large-scale information systems are gradually moved from
computation-driven to data-driven. Data-driven applications have dominated the construction of
Digital Libraries. The characteristics of data-driven applications are summarized as follows:


data-centric design of application system;
highly integrated with the comprehensive data management (database/data
warehousing, XML data, digital storage) and technical service (ingestion, analysis,
1

mining, decision and presentation);
requiring comprehensive plans in combination with the architecture, reference model
and infrastructure, which serves as a data management and service support platform to
enable end users to make decisions or develop applications. As a matter of fact,
nowadays, how to build such platforms has attracted many researchers who are
interested in construction of future digital libraries.
(2) Cross-domain sharing and services
In distributed digital libraries, organization and management are both based on "domain",
which provide services as independent systems. A domain can be a general industry network, or
a topic-sensitive network. Nowadays, both intra-domain and cross-domain management are
urgently necessary for large-scale digital libraries. It is a challenging task to manage metadata and
object data which are cross-regional at national level or global level.
(3) Dynamic management of massive data
The web technologies for digital libraries have progressed from simple Web l.0 model to
complex Web2.0 model. Web 2.0 models provide more dynamic interactivity and more applications
such as Blog, TAG, SNS, RSS and Wiki. Digital libraries have accumulated widespread, various, massive
data which are still inefficient in usage[1].
According to the features of cross-domain distribution, heterogeneity and data-driven in digital
libraries, this paper investigates the key techniques to support the construction of a cross-domain
sharing and service platform for building digital libraries.
The organization of the paper is as follows: Section 1 introduces the motivation of our research by
summarizing the requirements for cross-domain sharing and data-driven services. Section 2 discusses
the digital resource service model featuring five service components. Section 3 describes the
architecture of the cross-domain sharing and service support platform for building digital libraries.
Section 4briefs iDLib – a digital service system as a show case for the platform implementation. The
last section is the conclusion.
2. Modeling Digital Resource Service Objects
2.1 The definition of digital resources
In order to effectively manage digital resources, we need a model to represent relevant
information and classify related functions. From the view point of Institutional repository, there
are two typical models to describe digital resources from Fedora[2] and DSpace[3] separately.
Fedora defines a generic digital object model that can be used to express many kinds of objects.
The basic components of a Fedora digital object are: PID, Object Properties, DataStream(s) and
Disseminator(s). Although every Fedora digital object conforms to the Fedora object model, there
are three distinct types of Fedora digital objects that can be stored in a Fedora repository: Data
2
Objects, Behavior Definition Objects, and Behavior Mechanism Objects. While in DSpace, the
data is arranged as community collections of items, which bundle bit streams together. A
community is the highest level of the DSpace content hierarchy. They correspond to parts of the
organization such as departments, labs, research centers or schools. An item is an "archival atom"
consisting of grouped, related content and associated descriptions (metadata). An item's exposed
metadata is indexed for browsing and searching. Items are organized into collections of
logically-related material.
However, both of the two aforementioned models cannot be directly applied in the
cross-domain sharing and services. Therefore we design a model to support cross-domain
functions by combining Service Component Architecture (SCA) to maximize the model's flexibility.
SCA specifications [4] and white paper [5] define that SCA is a set of specifications for building SOA
applications and for defining how to create components and how to combine those components
into complete applications. If services are built upon SCA, applications will not be affected even
changes occur. Furthermore, we propose the Digital Resource Service Component Model (DRSC)
to describe the properties, services and references of digital resources.
DRSC is an expansion of the traditional digital object models. For general Institutional
Repository, object ID can be a list of incremental numbers or other pre-defined numbers in a
single domain. But for object ID used in cross-domain applications, it has to be strictly unique no
matter when and where they are produced. That is why we use UUID (Universally Unique
Identifier) for representing identifiers. A UUID is 128 bits long, and can be guaranteed unique
across space and time. The UUIDs were first used in the Apollo Network Computing System and
then in the Open Software Foundation's (OSF) Distributed Computing Environment (DCE), and
later in Microsoft Windows platforms as GUIDs (Globally Unique Identifier). We follow the same
pattern to use it in DRSC.
An SCA component consists of Services, References and Properties as shown in Figure 1. A
component provides service to other components by the Services interface, which imports the
services of other components. The attribute defines the functions of components.
DRSC
Metadata
Content
Properties
ID
Log
Services
References
Component
Type
Figure 1.
Annotation
SCA component
Figure 2. DRSC Object
2.2 The atom components
We define a DRSC object as an integration of five atomic service components: identifier,
metadata, content, log and annotation. This is shown in Figure 2.
3
(1) Identifier (ID) Component. The properties of the component include a unique identifier,
registration information and a list of pointers to the services provided by the same object. The
identifier is generated by UUID. Registration information includes agent, registration timestamp,
approval and so on. Basic search is supported by Dublin core metadata. Pointers position to the
other four components. Thus, the proportion of Identifier component to other components is 1:
N. Services provided include registration, search and location.
(2) Metadata Component. The component contains the information about the DRSC object
itself, and the relationship between this object and other objects. Properties include Dublin Core
metadata and other metadata. The links between objects is expressed in binary groups, e.g., (ID2,
parent) indicates that the current DRSC object is a father of the ID2 DRSC. Metadata object may
contain some redundant information, such as the location of the DRSC. There are two groups of
service: data manipulation and data access.
(3) Content Component. The properties are the multiple versions of the resource object. In
each individual version, there are information on issued date, format, and creator/modifier. It
provides uploading and downloading services. For the upload process, it creates a component for
a new resource, tracks the entire history of digital resources for an existing resource, and saves
the version updating information. Hence, the component provides support for the content of a
DRSC object throughout the life cycle.
(4) Log Component. Properties include operator, operation type, operation text and
operation results. There are two types of logs: access log, which does not change the DRSC
object’s metadata or content; and operation log that revises the DRSC object’s metadata or
content. Access log records the visiting history, and hence can be used to analyze users’ behavior
patterns and preferences. While operation log aims at tracking the updates of a DRSC object, for
the purpose of audition and recovery. Thus, this component provides two services: log recording
and log analysis.
(5) Annotation Component. Properties of this component include ratings, tags, comments,
and usage status. From our point of view, the differences between tag and metadata are: a)
meta-data is given by experts, while the tags are annotated by unprofessional users; b) metadata
is usually selected from a formal vocabulary, while tags are informally selected from casual terms
by users; c) metadata quality is guaranteed, while the quality of tags is not. However, the
advantage of tags is that they can be modified over times to reflect the present understanding of
DRSCs, while metadata remains unchanged.
Scores are used to measure the quality of a DRSC object and to rank search results as well.
Comments are the reviews written by real users. They can be used as resources for data mining in
conjunction with the tag text. For a DRSC object, the user status is optional to help users manage
the learning process. In addition, it helps calculate the relationships between users and resources.
Tags can also be used in recommendation methods such as Collaborative Filtering. We use a
six-element tuple to identify a user given the corresponding DRSC object: <user, resource, rating,
tags, comments, usage status>. From the tag, user similarity and tag similarity can be calculated.
4
2.3 Composition of DRSC Components
The above five components compose of a DRSC object (Figure 2). These components may not
reside in one single computer, but in a set of distributed computers on the network. The property
configuration of a DRSC object, such as linking to the corresponding component, indicates this
property. SCA supports an implementation of multiple interfaces to ensure the flexibility of a
DRSC object. Application components can be implemented in different languages, such as Java, C
+ + and COBOL, etc. The platform is compliant with OSOA standard [6].
Metadata, content, tagging and log components are all based on the ID component. Pointers
in ID component position to the other four components. When status in the four components
changes, they will notify the ID component to update the registration information. The high
flexibility of a DRSC object allows automatic integration with other transfer protocol, e.g., Web
services, MQ, HTML, REST and so on.
Service component model of digital resources has the following advantages: (1) simplify the
process of development, composition and deployment; (2) improve the portability, reusability
and flexibility; and (3) reduce the burden of organization by hiding the supporting technology.
For digital resource, there are two steps to create a DRSC object: divide the properties into
five components and then build a management system. In the first step, metadata includes
Dublin Core metadata and extracted metadata from digital resources. In the content component,
we need to add additional information, such as format type, file size, creation date, version
number and so on. In addition to the original format, we also have to take the possible format
into consideration. If the module of automatic conversion does not exist, the function of format
conversion should be manually added. Annotation of digital resources may not correspond to
one record in metadata component. The relationship is shown in Figure 3. The proportion of ID
component and the other four components is 1: N. The proportion of metadata element and
content component is also 1: N. However, in some cases, objects can be directly retrieved by
visiting ID component for efficiency.
ID
1
1
N
1
1
N
Metadata
Annotation
1
N
N
N
Content
Log
Figure 3. The relationships between components
In the second step, there are several ways to build the system. The first method is based on
analysis, and uses the assistant tool to create the source code and distribution packages. The
second method is to reuse the open source atomic components downloaded from Internet.
Assistant tools can help users develop applications. Compared with the first method, the second
method can greatly reduce the overall cost. Users only need to install a lightweight Graphical
5
User Interface (GUI) without storing local data. The third way is to directly create objects for
digital resources in the Web. Users only need to provide specific information of digital resources,
and provide a SaaS GUI to manage digital resources. In general, three methods differ from
remote installation to local services, from complex to simple (Figure 4).
ID
ID
Metadata
Annotation
Content
Log
DRSC
Metadata
ID
Annotation
Metadata
Annotation
Log
Content
Log
Content
DRSC
DRSC
Reduced Cost of Construction and Operation
Figure 4. Construction of DRSC object
For a created DRSC object, there are two types of users: end users and developers. If the
end-users only need basic functions, such as add, delete, modify and search, we can provide a
Web Interface. But usually, the end users need more functions or various GUI, so developers can
write different interface and new features through the interface provided by DRSCs.
Furthermore, besides a new digital resource management system, DRSCs can also be invoked by
pre-existing software systems.
3. Cross-domain network service platform supporting data-driven
applications
3.1 Architecture
The architecture of platform consists of the following parts: common data access, catalogue
and exchange system, integrated data management, data-driven engine, data service engine, and
enterprise service bus. The following subsection will elaborate the three essential components of
the architecture more specifically, that is the Integrated Data Management layer, the Data-Driven
Engine, and the Data Services Engine, as shown in Figure 5.
(1)Integrated Data Management layer. This layer is responsible for integrated management of
generalized data --- data, information and knowledge. Typical data-driven applications for sharing and
services include integrated management of three types of data: structured (Relational Database),
semi-structured (XML data), and unstructured (Digital Object Storage) data. The most important key
part is the high-performance engine to provide data integration service. The basic functions of the
layer are as follows: heterogeneous data access, data cleaning, data modeling, data transformation,
data loading, mass migration, updated data alert (moving the updated data only for performance),
6
built-in data partition, data distribution, multi-points data source systems, high availability, fault
tolerance, and quick failover.
(2) Data-Driven Engine. This engine contains a rule library and model library. By dynamically
monitoring rules in the library, the dynamic integrated service and the cross-domain, dynamic
business transactions can be performed.
(3) Data Service Engine. The engine provides functions for applications, such as data monitoring,
message mechanism, object components, object containers, data analysis, data mining, data
archiving, and event response. The major issue for building this engine is how to collaborate different
types of transactions, and the relevant principles for real-time data update monitoring, messaging,
data-driven applications, so the engine can offer necessary services under the ESB layer.
Academic Resource
Application
Gov Resource
Application
Enterprise Resource
Application
Enterprise Service Bus (ESB)
Data Service Engine
Data-driven Engine
Integrated Data Management Layer
Catalogue and Exchange System
Common Data Access(Cross-Domain Management and Security Authentication)
Data
Interface
Data
Interface
Domain n
Domain n
Figure 5. The architecture of the cross-domain network service platform supporting data-driven
applications
As illustrated in Figure 5, the data service engine can be built not only on the integrated data
management layer but also on the data driven engine. In the second case, data driven engine, which
takes rule engine as core technology, can detect the change of data management layer dynamically.
Data changes on the Integrated data management layer can trigger the data driven engine , and then
the data-driven engine can provide data service engines with some specific data services .
3.2 Key technologies
7
The key technologies can be divided into three categories: aggregating cross-domain data
resources, integrating and sharing demand-based data, and supporting large scale concurrent users.
3.2.1 Aggregating cross-domain data resources
(1) Unified Data Model
As mentioned above, data can be divided into three types --- structured, semi-structured and
unstructured. The management mechanisms are rational database, XML and digital object,
respectively. We present a method to provide unified management, more specifically, create links
among different types of data in the catalog system, and process with corresponding methods by
pre-testing data types.
(2) Metadata-based catalogue system and exchange system
In order to support the unified data model, we studied multiple digital object identifier
management and service system, based on the national catalogue system and exchange system.
Further, we proposed CDOI system based on DOI system..
(3) Multi-document summarization
It extracts the key content from multiple documents and generates a summary report
automatically. Take the novelty search as example, the process of the summary algorithm includes :
i)
Calculate the importance of every sentence by MMR;
ii)
Pick up the sentences with highest score to ensure high relevance and low redundancy[7][8].
(4) Semantic based cross-domain large scale information retrieval
The major issues in this area include:
i)
How to help users to accurately express their search intent
ii)
How to solve the ambiguity
iii)
How to adjust the retrieving range
iv)
How to add user-specific elements, and
v)
How to search meta-data
We adopt multiple domain ontology and web 2.0 technologies to solve the problems.
(5) Ontology Storage and Query
By taking advantage of Jena, SDB, Joseki, SPARQL and Tomcat, we stored OWL files in relational
database. Further, users can search and reason by web-app on heterogeneous, dynamic and massive
data.
3.2.2 Demand-based data integration and sharing
8
(1) Service composition and verification
We use Chu space formally model and verify WS-BPEL. The process is as follows:
i)
Convert WS-BPEL to its control flow framework for BPEL-CF procedure.
ii)
Based on process algebra of Chu space, map the procedure to a Chu space.
iii)
Define specification language to describe properties of the procedure, and provide
algorithms for verification.
For Web Service composition modeling and verification, the process is as follows:
i)
Convert user-submitted WS-BPEL to its control flow framework for BPEL-CF procedure.
ii)
Based on process algebra of Chu space, map the procedure to a Chu space and semantic
data, as the first input.
iv)
According to user’s requirements, define specification language to describe properties of
the procedure, and provide algorithms for verification, as the second input.
iii)
Scan the inputs and verify by Chu space verification tools. If the properties are satisfied,
return true; otherwise, it indicates error occurs, return the detail exception message.
[9][10][11]
(2) Collaboration filtering based on social tags [12]
We present IBeST (Item-Based with Social Tags) based on Item-based CF. IBeST is an algorithm
framework that extends item-based CF to the of social tag level. Different from the traditional CF that
evaluate item-item relevance only, IBeST evaluates another relevance score by social tags and ratings,
and then incorporates the score into the original relevance. The steps are as follows:
i)
Pre-process social tags to format data for subsequent step.
ii)
Take meta-data as weighted tags to add semantic information.
iii)
Optimize weights and calculate the weighted average.
iv)
Calculate the prediction score by traditional CF.
(3) Graph-based personalized recommendation [13][14]
We make use of the properties of social tags to calculate similarity among users, and add the
similarity score into the graph to overcome the sparse problem. Through extensive experiments on
real data, we have proved that our algorithm can effectively improve the precision of personalized
recommendation.
(4) Long-Term preservation of digital resources based on risk management
We take the following risks into consideration: format and version, software, operating system,
hardware, etc. The proposed method is combined with the existing open source projects, to help
monitor and solve multiple risks in digital resource management and storage, moreover,
provide the
integrated risk management model for risk planning, risk identification, risk analysis, risk handling, and
risk monitoring and controlling.
3.2.3 Support of Large-Scale Concurrent Users
9
(1) Massive data mining based on Map-Reduce
We took Map-Reduce as the unified mechanism to handle data mining on massive unstructured
resources. More specifically, we implemente methods for classification, clustering, association mining,
etc. in the Map-Reduce framework. For implementation, we took Hadoop as the basic system[15].
(2) Architecture of large-scale concurrent transactions
We surveyed a lot of architecture of large-scale concurrent transactions and presented a model
to support our web-based applications. The model deals with large-scale access by balancing load in
tiers of Web server and application server. For data services, in order to improve efficiency, it adopts
IBM Data Facility method based on key value for core transactions; meanwhile, it adopts the memory
database, cache and customized memory architecture for other transactions.
3.3 Prototype platform
Based on the system architecture and related key technologies, we implemented a platform for
cross-domain sharing and service support, as shown in Figure 6.
Application Generation Tool
Service reliability/usability/standards
ingest/proces
s/integrate
organization
analysis
service
Data clean
Categorized
index
Behavior
mining
Data query
Long-term
preservation
Data transform
Full text
indexing
Classification/cl
ustering
Personalizatio
n
Component
management
Data modeling
Assistant
indexing
Text
summarization
Service
composition
Privilege
management
manage
Service
Component
Layer
······
Data Service Access Interface
Identifier
management
RDB
Metadata
management
File
management
XML DB
Annotation
management
LDAP
Log
management
File
System
Data
Service
Cloud Layer
Data Layer
Figure 6. The architecture of prototype platform for cross-domain sharing and service support
There are three layers in Figure 6: data layer, data service cloud layer and service component
layer. Data layer provides distributed storage; data service cloud layer provides DRSC (Digital
Resource Service Component) for object services; and the service component layer encapsulates
multiple software modules, which can be integrated into different digital library applications.
10
4. Cross-Domain Digital Resource Personalized Services --- iDLib
To provide a test for putting the architecture of the platform into practice, we developed a
cross-domain digital resource service application-iDLib. Digital resources of the data layer are
obtained from various domains, such as National Library of China, China National Knowledge
Infrastructure (CNKI), China Academic Library & Information System (CALIS), Tsinghua university
Library, and Peking University Library on the Internet.
4.1 Functions
Based on resources from various digital resource domains, IDlib provides a cross-domain digital
resource personalized services for digital resources. These services support the academic activities,
respond to the requirement automatically, and discover the latest hot resources and disciplines.
There are four main functions:
(1) Retrieve: search relevant resources by keywords.
(2) Analysis: novelty consultant to generate a report; recommendation and topic tracking, etc.
(3) Feedback: receive users’ feedback and analyze users’ preference. Provide personalized
service based on users’ interests
(4) Management: classification of resources, tagging service to organize resource, etc.
4.2 Core components
iDLib aims at building a system for cross-domain sharing and service support. There are five core
components in iDLib: Identifier component, metadata component, content component, log
component and annotation component. The system is developed in Java language, SSH framework
and MySQL. Figure 7 shows the user interface of IDlib.
Figure 7. Web UI of iDLib
5. Conclusion
In the paper, we propose a platform for cross-domain sharing and service support for building
11
digital libraries. The platform addresses the following specific needs: data-driven, cross-domain
large-scale data sharing, and dynamic management. We formally model digital resource objects and
design a generalized architecture. The architecture consists of universal data access, catalogue and
exchange systems, integrated data management, data service engine, data-driven engine and
enterprise service bus.
The development and public use of iDLib provide a test bed for the feasibility and efficiency of
the proposed platform. In the future, we will continue to improve this platform by developing other
novel applications for digital libraries such as Digital Library for Chinese Science History etc.
References
[1] Xing Chun-Xiao, Zeng Chun,Li Chao,Zhou Li-Zhu. Study on architecture of massive information
management for digital library. Journal of Software,2004,15(1):76-85.
[2] Carl Lagoze, Sandy Payette, Edwin Shin, Chris Wilper. Fedora: an architecture for complex objects
and their relationships. International Journal on Digital Libraries, Volume 6, Number 2 / April,
2006, pp.124-138.
[3] MacKenzie Smith, Mary R. Barton, Margret Branschofsky, Greg McClellan,
Michael J. Bass,
David Stuve,
Julie Harford Walker,
Robert Tansley. DSpace: An Open Source Dynamic Digital
Repository. D-Lib Magazine,2003(9).
[4] Service component architecture specifications (2009)
http://www.osoa.org/display/Main/Service+Component+Architecture+Specifications.
[5] SCA White Paper (technical) http://www.ibm.com/developerworks/library/specification/ws-sca/
[6] OSOA,
the
Open
Service
Oriented
Architecture
collaboration
[2010-5-18].
http://www.osoa.org/display/Main/Home.
[7] Qinglin Guo, Ming Zhang. Multi-documents Automatic Abstracting based on text clustering and
semantic analysis. Knowledge-Based Systems, Volume 22, Issue 6, August 2009, Pages 482-485.
[8] Yanxing Zhang, Ming Zhang, Zhihong Deng. Users’ comments automatic summarization based on
user features. Computer research and development, 46(Suppl.):520-525, 2009
[9] Xutao Du, Chunxiao Xing, Lizhu Zhou. A Chu Spaces Semantics of BPEL-Like Fault Handling. FCST
2009, Shanghai. pages 317-323.
[10] Xutao Du, Chunxiao Xing, Lizhu Zhou. A Chu Spaces Semantics of Control Flow in BPEL. APSCC
2009, Singapore
[11] Xutao Du, Chunxiao Xing, Lizhu Zhou. A Chu Spaces Process Algebra. Computer Science, 2009.
[12] Li Zhou,Yong Zhang,Chunxiao Xing. A Collaborative Filtering Algorithm Based on Global and
Domain Authorities. ICADL2008. Springer. Bali, Indonesia. pp.164-173.
[13] Ziqi Wang, Yuwei Tan, Ming Zhang: Graph-Based Recommendation on Social Networks. In
Wook-Shin Han, et. al. (Eds.): Advances in Web Technologies and Applications, Proceedings of the
12th Asia-Pacific Web Conference, APWeb 2010, Buscan, Korea, 6-8 April 2010. IEEE Computer
Society, ISBN 978-0-7695-4012-2. pp. 116-122.
[14] Ming Zhang, Ziqi Wang. Graph-based personalized recommendation algorithm and system for
social network. Invention patent, register NO: 201010102050.2.
[15] Guojun Liu, Ming Zhang, Fei Yan. Large-scale Social Network Analysis based on MapReduce.
International Conference on Computational Aspects of Social Networks – 2010, September 26-28,
2010, Taiyuan, China.
12