Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Data-Driven Service Platform for Building Digital Libraries 1 Chunxiao Xing, 2Ming Zhang, 1Yong Zhang, 1Lemen Chao, 1Lizhu Zhou 1TResearch Institute of Information Technology Tsinghua National Laboratory for Information Science and Technology Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China {xingcx,zhangyong05}@tsinghua.edu.cn, [email protected], [email protected] 2Department of Computer Science and Technology, Peking University, Beijing 100871, China [email protected] Abstract: Digital Libraries are faced with new opportunity and challenge brought by the latest technological advances – the cross-domain distributed network and heterogeneous data-driven demands. To meet with this challenge, we are perusing a Cross-domain Sharing and Service Support Platform project. The major goals of the project are: (1) to conduct a concrete requirement analysis for the data-driven applications and cross-domain sharing; (2) to extend the Digital Object Model by object decomposing and to form the Digital Services Object Model to support digital library development; and (3) to design an architecture for the cross-domain sharing and service support platform as the data access middleware for building digital libraries conveniently. This paper presents the major contributions of our work in digital resource service modeling and platform architecture. As a show case for the platform application, the paper briefs iDLib – a digital library system that offers cross-domain data access and personalized services. Keywords: digital library, data-driven, cross-domain sharing, cross-domain service, iDLib 1. Introduction With the development of information technology, data are becoming an important factor in driving organizational behaviors. In the networked world, when the amount of data are growing exponentially, it is increasingly important to solve the crucial problems such as how to cross-domain share distributed, heterogeneous, dynamic, and vast amounts of data, and how to provide efficient data services to support organizational activities. Digital Libraries are important data resources in modern society, and their development raise following three new requirements. (1) Data-Driven Currently, applications in large-scale information systems are gradually moved from computation-driven to data-driven. Data-driven applications have dominated the construction of Digital Libraries. The characteristics of data-driven applications are summarized as follows: data-centric design of application system; highly integrated with the comprehensive data management (database/data warehousing, XML data, digital storage) and technical service (ingestion, analysis, 1 mining, decision and presentation); requiring comprehensive plans in combination with the architecture, reference model and infrastructure, which serves as a data management and service support platform to enable end users to make decisions or develop applications. As a matter of fact, nowadays, how to build such platforms has attracted many researchers who are interested in construction of future digital libraries. (2) Cross-domain sharing and services In distributed digital libraries, organization and management are both based on "domain", which provide services as independent systems. A domain can be a general industry network, or a topic-sensitive network. Nowadays, both intra-domain and cross-domain management are urgently necessary for large-scale digital libraries. It is a challenging task to manage metadata and object data which are cross-regional at national level or global level. (3) Dynamic management of massive data The web technologies for digital libraries have progressed from simple Web l.0 model to complex Web2.0 model. Web 2.0 models provide more dynamic interactivity and more applications such as Blog, TAG, SNS, RSS and Wiki. Digital libraries have accumulated widespread, various, massive data which are still inefficient in usage[1]. According to the features of cross-domain distribution, heterogeneity and data-driven in digital libraries, this paper investigates the key techniques to support the construction of a cross-domain sharing and service platform for building digital libraries. The organization of the paper is as follows: Section 1 introduces the motivation of our research by summarizing the requirements for cross-domain sharing and data-driven services. Section 2 discusses the digital resource service model featuring five service components. Section 3 describes the architecture of the cross-domain sharing and service support platform for building digital libraries. Section 4briefs iDLib – a digital service system as a show case for the platform implementation. The last section is the conclusion. 2. Modeling Digital Resource Service Objects 2.1 The definition of digital resources In order to effectively manage digital resources, we need a model to represent relevant information and classify related functions. From the view point of Institutional repository, there are two typical models to describe digital resources from Fedora[2] and DSpace[3] separately. Fedora defines a generic digital object model that can be used to express many kinds of objects. The basic components of a Fedora digital object are: PID, Object Properties, DataStream(s) and Disseminator(s). Although every Fedora digital object conforms to the Fedora object model, there are three distinct types of Fedora digital objects that can be stored in a Fedora repository: Data 2 Objects, Behavior Definition Objects, and Behavior Mechanism Objects. While in DSpace, the data is arranged as community collections of items, which bundle bit streams together. A community is the highest level of the DSpace content hierarchy. They correspond to parts of the organization such as departments, labs, research centers or schools. An item is an "archival atom" consisting of grouped, related content and associated descriptions (metadata). An item's exposed metadata is indexed for browsing and searching. Items are organized into collections of logically-related material. However, both of the two aforementioned models cannot be directly applied in the cross-domain sharing and services. Therefore we design a model to support cross-domain functions by combining Service Component Architecture (SCA) to maximize the model's flexibility. SCA specifications [4] and white paper [5] define that SCA is a set of specifications for building SOA applications and for defining how to create components and how to combine those components into complete applications. If services are built upon SCA, applications will not be affected even changes occur. Furthermore, we propose the Digital Resource Service Component Model (DRSC) to describe the properties, services and references of digital resources. DRSC is an expansion of the traditional digital object models. For general Institutional Repository, object ID can be a list of incremental numbers or other pre-defined numbers in a single domain. But for object ID used in cross-domain applications, it has to be strictly unique no matter when and where they are produced. That is why we use UUID (Universally Unique Identifier) for representing identifiers. A UUID is 128 bits long, and can be guaranteed unique across space and time. The UUIDs were first used in the Apollo Network Computing System and then in the Open Software Foundation's (OSF) Distributed Computing Environment (DCE), and later in Microsoft Windows platforms as GUIDs (Globally Unique Identifier). We follow the same pattern to use it in DRSC. An SCA component consists of Services, References and Properties as shown in Figure 1. A component provides service to other components by the Services interface, which imports the services of other components. The attribute defines the functions of components. DRSC Metadata Content Properties ID Log Services References Component Type Figure 1. Annotation SCA component Figure 2. DRSC Object 2.2 The atom components We define a DRSC object as an integration of five atomic service components: identifier, metadata, content, log and annotation. This is shown in Figure 2. 3 (1) Identifier (ID) Component. The properties of the component include a unique identifier, registration information and a list of pointers to the services provided by the same object. The identifier is generated by UUID. Registration information includes agent, registration timestamp, approval and so on. Basic search is supported by Dublin core metadata. Pointers position to the other four components. Thus, the proportion of Identifier component to other components is 1: N. Services provided include registration, search and location. (2) Metadata Component. The component contains the information about the DRSC object itself, and the relationship between this object and other objects. Properties include Dublin Core metadata and other metadata. The links between objects is expressed in binary groups, e.g., (ID2, parent) indicates that the current DRSC object is a father of the ID2 DRSC. Metadata object may contain some redundant information, such as the location of the DRSC. There are two groups of service: data manipulation and data access. (3) Content Component. The properties are the multiple versions of the resource object. In each individual version, there are information on issued date, format, and creator/modifier. It provides uploading and downloading services. For the upload process, it creates a component for a new resource, tracks the entire history of digital resources for an existing resource, and saves the version updating information. Hence, the component provides support for the content of a DRSC object throughout the life cycle. (4) Log Component. Properties include operator, operation type, operation text and operation results. There are two types of logs: access log, which does not change the DRSC object’s metadata or content; and operation log that revises the DRSC object’s metadata or content. Access log records the visiting history, and hence can be used to analyze users’ behavior patterns and preferences. While operation log aims at tracking the updates of a DRSC object, for the purpose of audition and recovery. Thus, this component provides two services: log recording and log analysis. (5) Annotation Component. Properties of this component include ratings, tags, comments, and usage status. From our point of view, the differences between tag and metadata are: a) meta-data is given by experts, while the tags are annotated by unprofessional users; b) metadata is usually selected from a formal vocabulary, while tags are informally selected from casual terms by users; c) metadata quality is guaranteed, while the quality of tags is not. However, the advantage of tags is that they can be modified over times to reflect the present understanding of DRSCs, while metadata remains unchanged. Scores are used to measure the quality of a DRSC object and to rank search results as well. Comments are the reviews written by real users. They can be used as resources for data mining in conjunction with the tag text. For a DRSC object, the user status is optional to help users manage the learning process. In addition, it helps calculate the relationships between users and resources. Tags can also be used in recommendation methods such as Collaborative Filtering. We use a six-element tuple to identify a user given the corresponding DRSC object: <user, resource, rating, tags, comments, usage status>. From the tag, user similarity and tag similarity can be calculated. 4 2.3 Composition of DRSC Components The above five components compose of a DRSC object (Figure 2). These components may not reside in one single computer, but in a set of distributed computers on the network. The property configuration of a DRSC object, such as linking to the corresponding component, indicates this property. SCA supports an implementation of multiple interfaces to ensure the flexibility of a DRSC object. Application components can be implemented in different languages, such as Java, C + + and COBOL, etc. The platform is compliant with OSOA standard [6]. Metadata, content, tagging and log components are all based on the ID component. Pointers in ID component position to the other four components. When status in the four components changes, they will notify the ID component to update the registration information. The high flexibility of a DRSC object allows automatic integration with other transfer protocol, e.g., Web services, MQ, HTML, REST and so on. Service component model of digital resources has the following advantages: (1) simplify the process of development, composition and deployment; (2) improve the portability, reusability and flexibility; and (3) reduce the burden of organization by hiding the supporting technology. For digital resource, there are two steps to create a DRSC object: divide the properties into five components and then build a management system. In the first step, metadata includes Dublin Core metadata and extracted metadata from digital resources. In the content component, we need to add additional information, such as format type, file size, creation date, version number and so on. In addition to the original format, we also have to take the possible format into consideration. If the module of automatic conversion does not exist, the function of format conversion should be manually added. Annotation of digital resources may not correspond to one record in metadata component. The relationship is shown in Figure 3. The proportion of ID component and the other four components is 1: N. The proportion of metadata element and content component is also 1: N. However, in some cases, objects can be directly retrieved by visiting ID component for efficiency. ID 1 1 N 1 1 N Metadata Annotation 1 N N N Content Log Figure 3. The relationships between components In the second step, there are several ways to build the system. The first method is based on analysis, and uses the assistant tool to create the source code and distribution packages. The second method is to reuse the open source atomic components downloaded from Internet. Assistant tools can help users develop applications. Compared with the first method, the second method can greatly reduce the overall cost. Users only need to install a lightweight Graphical 5 User Interface (GUI) without storing local data. The third way is to directly create objects for digital resources in the Web. Users only need to provide specific information of digital resources, and provide a SaaS GUI to manage digital resources. In general, three methods differ from remote installation to local services, from complex to simple (Figure 4). ID ID Metadata Annotation Content Log DRSC Metadata ID Annotation Metadata Annotation Log Content Log Content DRSC DRSC Reduced Cost of Construction and Operation Figure 4. Construction of DRSC object For a created DRSC object, there are two types of users: end users and developers. If the end-users only need basic functions, such as add, delete, modify and search, we can provide a Web Interface. But usually, the end users need more functions or various GUI, so developers can write different interface and new features through the interface provided by DRSCs. Furthermore, besides a new digital resource management system, DRSCs can also be invoked by pre-existing software systems. 3. Cross-domain network service platform supporting data-driven applications 3.1 Architecture The architecture of platform consists of the following parts: common data access, catalogue and exchange system, integrated data management, data-driven engine, data service engine, and enterprise service bus. The following subsection will elaborate the three essential components of the architecture more specifically, that is the Integrated Data Management layer, the Data-Driven Engine, and the Data Services Engine, as shown in Figure 5. (1)Integrated Data Management layer. This layer is responsible for integrated management of generalized data --- data, information and knowledge. Typical data-driven applications for sharing and services include integrated management of three types of data: structured (Relational Database), semi-structured (XML data), and unstructured (Digital Object Storage) data. The most important key part is the high-performance engine to provide data integration service. The basic functions of the layer are as follows: heterogeneous data access, data cleaning, data modeling, data transformation, data loading, mass migration, updated data alert (moving the updated data only for performance), 6 built-in data partition, data distribution, multi-points data source systems, high availability, fault tolerance, and quick failover. (2) Data-Driven Engine. This engine contains a rule library and model library. By dynamically monitoring rules in the library, the dynamic integrated service and the cross-domain, dynamic business transactions can be performed. (3) Data Service Engine. The engine provides functions for applications, such as data monitoring, message mechanism, object components, object containers, data analysis, data mining, data archiving, and event response. The major issue for building this engine is how to collaborate different types of transactions, and the relevant principles for real-time data update monitoring, messaging, data-driven applications, so the engine can offer necessary services under the ESB layer. Academic Resource Application Gov Resource Application Enterprise Resource Application Enterprise Service Bus (ESB) Data Service Engine Data-driven Engine Integrated Data Management Layer Catalogue and Exchange System Common Data Access(Cross-Domain Management and Security Authentication) Data Interface Data Interface Domain n Domain n Figure 5. The architecture of the cross-domain network service platform supporting data-driven applications As illustrated in Figure 5, the data service engine can be built not only on the integrated data management layer but also on the data driven engine. In the second case, data driven engine, which takes rule engine as core technology, can detect the change of data management layer dynamically. Data changes on the Integrated data management layer can trigger the data driven engine , and then the data-driven engine can provide data service engines with some specific data services . 3.2 Key technologies 7 The key technologies can be divided into three categories: aggregating cross-domain data resources, integrating and sharing demand-based data, and supporting large scale concurrent users. 3.2.1 Aggregating cross-domain data resources (1) Unified Data Model As mentioned above, data can be divided into three types --- structured, semi-structured and unstructured. The management mechanisms are rational database, XML and digital object, respectively. We present a method to provide unified management, more specifically, create links among different types of data in the catalog system, and process with corresponding methods by pre-testing data types. (2) Metadata-based catalogue system and exchange system In order to support the unified data model, we studied multiple digital object identifier management and service system, based on the national catalogue system and exchange system. Further, we proposed CDOI system based on DOI system.. (3) Multi-document summarization It extracts the key content from multiple documents and generates a summary report automatically. Take the novelty search as example, the process of the summary algorithm includes : i) Calculate the importance of every sentence by MMR; ii) Pick up the sentences with highest score to ensure high relevance and low redundancy[7][8]. (4) Semantic based cross-domain large scale information retrieval The major issues in this area include: i) How to help users to accurately express their search intent ii) How to solve the ambiguity iii) How to adjust the retrieving range iv) How to add user-specific elements, and v) How to search meta-data We adopt multiple domain ontology and web 2.0 technologies to solve the problems. (5) Ontology Storage and Query By taking advantage of Jena, SDB, Joseki, SPARQL and Tomcat, we stored OWL files in relational database. Further, users can search and reason by web-app on heterogeneous, dynamic and massive data. 3.2.2 Demand-based data integration and sharing 8 (1) Service composition and verification We use Chu space formally model and verify WS-BPEL. The process is as follows: i) Convert WS-BPEL to its control flow framework for BPEL-CF procedure. ii) Based on process algebra of Chu space, map the procedure to a Chu space. iii) Define specification language to describe properties of the procedure, and provide algorithms for verification. For Web Service composition modeling and verification, the process is as follows: i) Convert user-submitted WS-BPEL to its control flow framework for BPEL-CF procedure. ii) Based on process algebra of Chu space, map the procedure to a Chu space and semantic data, as the first input. iv) According to user’s requirements, define specification language to describe properties of the procedure, and provide algorithms for verification, as the second input. iii) Scan the inputs and verify by Chu space verification tools. If the properties are satisfied, return true; otherwise, it indicates error occurs, return the detail exception message. [9][10][11] (2) Collaboration filtering based on social tags [12] We present IBeST (Item-Based with Social Tags) based on Item-based CF. IBeST is an algorithm framework that extends item-based CF to the of social tag level. Different from the traditional CF that evaluate item-item relevance only, IBeST evaluates another relevance score by social tags and ratings, and then incorporates the score into the original relevance. The steps are as follows: i) Pre-process social tags to format data for subsequent step. ii) Take meta-data as weighted tags to add semantic information. iii) Optimize weights and calculate the weighted average. iv) Calculate the prediction score by traditional CF. (3) Graph-based personalized recommendation [13][14] We make use of the properties of social tags to calculate similarity among users, and add the similarity score into the graph to overcome the sparse problem. Through extensive experiments on real data, we have proved that our algorithm can effectively improve the precision of personalized recommendation. (4) Long-Term preservation of digital resources based on risk management We take the following risks into consideration: format and version, software, operating system, hardware, etc. The proposed method is combined with the existing open source projects, to help monitor and solve multiple risks in digital resource management and storage, moreover, provide the integrated risk management model for risk planning, risk identification, risk analysis, risk handling, and risk monitoring and controlling. 3.2.3 Support of Large-Scale Concurrent Users 9 (1) Massive data mining based on Map-Reduce We took Map-Reduce as the unified mechanism to handle data mining on massive unstructured resources. More specifically, we implemente methods for classification, clustering, association mining, etc. in the Map-Reduce framework. For implementation, we took Hadoop as the basic system[15]. (2) Architecture of large-scale concurrent transactions We surveyed a lot of architecture of large-scale concurrent transactions and presented a model to support our web-based applications. The model deals with large-scale access by balancing load in tiers of Web server and application server. For data services, in order to improve efficiency, it adopts IBM Data Facility method based on key value for core transactions; meanwhile, it adopts the memory database, cache and customized memory architecture for other transactions. 3.3 Prototype platform Based on the system architecture and related key technologies, we implemented a platform for cross-domain sharing and service support, as shown in Figure 6. Application Generation Tool Service reliability/usability/standards ingest/proces s/integrate organization analysis service Data clean Categorized index Behavior mining Data query Long-term preservation Data transform Full text indexing Classification/cl ustering Personalizatio n Component management Data modeling Assistant indexing Text summarization Service composition Privilege management manage Service Component Layer ······ Data Service Access Interface Identifier management RDB Metadata management File management XML DB Annotation management LDAP Log management File System Data Service Cloud Layer Data Layer Figure 6. The architecture of prototype platform for cross-domain sharing and service support There are three layers in Figure 6: data layer, data service cloud layer and service component layer. Data layer provides distributed storage; data service cloud layer provides DRSC (Digital Resource Service Component) for object services; and the service component layer encapsulates multiple software modules, which can be integrated into different digital library applications. 10 4. Cross-Domain Digital Resource Personalized Services --- iDLib To provide a test for putting the architecture of the platform into practice, we developed a cross-domain digital resource service application-iDLib. Digital resources of the data layer are obtained from various domains, such as National Library of China, China National Knowledge Infrastructure (CNKI), China Academic Library & Information System (CALIS), Tsinghua university Library, and Peking University Library on the Internet. 4.1 Functions Based on resources from various digital resource domains, IDlib provides a cross-domain digital resource personalized services for digital resources. These services support the academic activities, respond to the requirement automatically, and discover the latest hot resources and disciplines. There are four main functions: (1) Retrieve: search relevant resources by keywords. (2) Analysis: novelty consultant to generate a report; recommendation and topic tracking, etc. (3) Feedback: receive users’ feedback and analyze users’ preference. Provide personalized service based on users’ interests (4) Management: classification of resources, tagging service to organize resource, etc. 4.2 Core components iDLib aims at building a system for cross-domain sharing and service support. There are five core components in iDLib: Identifier component, metadata component, content component, log component and annotation component. The system is developed in Java language, SSH framework and MySQL. Figure 7 shows the user interface of IDlib. Figure 7. Web UI of iDLib 5. Conclusion In the paper, we propose a platform for cross-domain sharing and service support for building 11 digital libraries. The platform addresses the following specific needs: data-driven, cross-domain large-scale data sharing, and dynamic management. We formally model digital resource objects and design a generalized architecture. The architecture consists of universal data access, catalogue and exchange systems, integrated data management, data service engine, data-driven engine and enterprise service bus. The development and public use of iDLib provide a test bed for the feasibility and efficiency of the proposed platform. In the future, we will continue to improve this platform by developing other novel applications for digital libraries such as Digital Library for Chinese Science History etc. References [1] Xing Chun-Xiao, Zeng Chun,Li Chao,Zhou Li-Zhu. Study on architecture of massive information management for digital library. Journal of Software,2004,15(1):76-85. [2] Carl Lagoze, Sandy Payette, Edwin Shin, Chris Wilper. Fedora: an architecture for complex objects and their relationships. International Journal on Digital Libraries, Volume 6, Number 2 / April, 2006, pp.124-138. [3] MacKenzie Smith, Mary R. Barton, Margret Branschofsky, Greg McClellan, Michael J. Bass, David Stuve, Julie Harford Walker, Robert Tansley. DSpace: An Open Source Dynamic Digital Repository. D-Lib Magazine,2003(9). [4] Service component architecture specifications (2009) http://www.osoa.org/display/Main/Service+Component+Architecture+Specifications. [5] SCA White Paper (technical) http://www.ibm.com/developerworks/library/specification/ws-sca/ [6] OSOA, the Open Service Oriented Architecture collaboration [2010-5-18]. http://www.osoa.org/display/Main/Home. [7] Qinglin Guo, Ming Zhang. Multi-documents Automatic Abstracting based on text clustering and semantic analysis. Knowledge-Based Systems, Volume 22, Issue 6, August 2009, Pages 482-485. [8] Yanxing Zhang, Ming Zhang, Zhihong Deng. Users’ comments automatic summarization based on user features. Computer research and development, 46(Suppl.):520-525, 2009 [9] Xutao Du, Chunxiao Xing, Lizhu Zhou. A Chu Spaces Semantics of BPEL-Like Fault Handling. FCST 2009, Shanghai. pages 317-323. [10] Xutao Du, Chunxiao Xing, Lizhu Zhou. A Chu Spaces Semantics of Control Flow in BPEL. APSCC 2009, Singapore [11] Xutao Du, Chunxiao Xing, Lizhu Zhou. A Chu Spaces Process Algebra. Computer Science, 2009. [12] Li Zhou,Yong Zhang,Chunxiao Xing. A Collaborative Filtering Algorithm Based on Global and Domain Authorities. ICADL2008. Springer. Bali, Indonesia. pp.164-173. [13] Ziqi Wang, Yuwei Tan, Ming Zhang: Graph-Based Recommendation on Social Networks. In Wook-Shin Han, et. al. (Eds.): Advances in Web Technologies and Applications, Proceedings of the 12th Asia-Pacific Web Conference, APWeb 2010, Buscan, Korea, 6-8 April 2010. IEEE Computer Society, ISBN 978-0-7695-4012-2. pp. 116-122. [14] Ming Zhang, Ziqi Wang. Graph-based personalized recommendation algorithm and system for social network. Invention patent, register NO: 201010102050.2. [15] Guojun Liu, Ming Zhang, Fei Yan. Large-scale Social Network Analysis based on MapReduce. International Conference on Computational Aspects of Social Networks – 2010, September 26-28, 2010, Taiyuan, China. 12