Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bamboo CI-HUB Bill Parod, NUIT Architect for Software Development Group, Northwestern University May, 12, 2013 This document is an architecture summary for the “Collection Interoperability Hub” (CI-Hub) developed for the Bamboo Technology Project. Apache Chemistry OpenCMIS ............................................................................................... 1 Bamboo Collection Interoperability Hub (CI-Hub)....................................................... 4 Bamboo Services Platform (BSP) .................................................................................................. 4 Resource Oriented Architecture (ROA) ...................................................................................... 4 Service Oriented Architecture (SOA)........................................................................................... 5 CI-Hub FileShare Repository .......................................................................................................... 7 CI-Hub Locator Extensions .............................................................................................................. 8 Perseus Connector .........................................................................................................................................9 HathiTrust Connector...................................................................................................................................9 TCP/Fedora Connector ............................................................................................................................. 10 Configuration .......................................................................................................................... 11 Service factory .............................................................................................................................................. 11 Accounts.......................................................................................................................................................... 11 Locator classes ............................................................................................................................................. 11 Locator configurations .............................................................................................................................. 12 Bamboo CMIS types.................................................................................................................................... 12 Repository root file path .......................................................................................................................... 12 Request and Response folders ............................................................................................................... 12 Source Code ............................................................................................................................. 13 Resource Oriented Architecture (ROA) Source .................................................................... 13 ROA BSP Classes .......................................................................................................................................... 13 Service Oriented Architecture (SOA) Source ......................................................................... 13 SOA BSP Classes ........................................................................................................................................... 14 Apache Chemistry FileShare Repository Override........................................................................ 14 Locator Extensions for External Repositories ................................................................................ 14 Custom CMIS Types and Configuration ............................................................................................. 15 Apache Chemistry OpenCMIS Apache Chemistry OpenCMIS is an open source implementation of the Content Management Interoperability Standard (CMIS). OpenCMIS includes a server framework for layering CMIS over other content repositories and a client framework for integrating consuming applications with CMIS compliant repositories. The Apache Chemistry distribution comes with two example repository implementations: "InMemory" and "FileShare". 1 The Bamboo CI-Hub is an extended version of the FileShare implementation. The FileShare implementation, as the name implies, persist CMIS folders, documents, and properties as filesystem folders and files. It declares its own package but also leverages other Apache Chemistry server modules. Apache Chemistry exposes the core CMIS domain services through three binding options: Atom Publishing Protocol (APP), Web Services, and local Java class bindings. Bamboo uses the AtomPub binding to provide HTTP access to CI-Hub. A simplified diagram of the AtomPub over FileShareRepository processing is shown below. 2 Figure 1. Figure 1 above shows the main classes involved in Atompub request processing. The org.apache.chemistry.opencmis.server.impl.atompub package includes the CMISAtomPubServlet servlet class for handling HTTP requests. Its initialization establishes a dispatch table, mapping CMIS request to their associated CMIS service classes (RepositoryService, NavigationService, ObjectService, VersioningService, 3 and DiscoveryService) within the package. In handling HTTP requests, the servlet forwards incoming HTTPServletRequest and HTTPServletResponse objects as well as a CMIS CallContext object through this Dispatcher to these classes serving the specific CMIS request. These Atompub classes in turn invoke associated methods on the configured repository CMIS service classes and write return values as AtomPub XML to the forwarded HTTPServletResponse object. This is how the AtomPub request parsing and reply formatting is accomplished. The underlying CMIS implementation is responsible for managing content within the repository. It invokes implementations of core CMIS services classes like those described in the AtomPub binding (RepositoryService, NavigationService, ObjectService, VersioningService, and DiscoveryService). Bamboo Collection Interoperability Hub (CI-Hub) As mentioned above, the Bamboo CI-Hub is based on the OpenCMIS FileShare Repository. Adapting OpenCMIS for Bamboo involved two major efforts: 1) extending OpenCMIS to normalize content from external repositories and 2) refactoring OpenCMIS for OSGi deployments on the Bamboo Services Platform. Bamboo Services Platform (BSP) The Apache Chemistry OpenCMIS is typically deployed as a web application. Deploying the CI-Hub on the Bamboo Services Platform (BSP) required separating HTTP request processing from core service functionality into separate OSGi bundles functioning as BSP Resources and a BSP Service respectively. Resource Oriented Architecture (ROA) The CI-Hub ROA layer defines a service for the BSP’s CXF servlet, exposing the AtomPub binding to CMIS services described above. ROA uses a Spring beans.xml file to declare and configure its service, and Java annotations in the implementing class to expose Java methods implementing HTTP methods. The ROA layer beans.xml file (fragments shown below) defines a single bean implemented by the CIHubResource class bound to the root (“/cihub”) CXF path. This class provides a functional replacement, in the BSP, for the CMISAtomPubServlet class used in the Apache Chemistry webapp implementation. CIHubResource is modeled on the CMISAtomPubServlet class, but is invoked by BSP CXF, rather than directly by a servlet container. Instead of a web.xml file to configure servlet properties for initialization, our ROA bean obtains properties from its beans.xml file. ICMISRepositoryServiceFactory is the CI-Hub SOA layer interface that our ROA layer uses to obtain an SOA CmisServiceFactory. The CmisServiceFactory creates a CmisService based on the specific factory class configured in the Apache Chemistry FileShare repository configuration file, cihub.properties. CI-Hub configures that property to use its own org.projectbamboo.cihub.northwestern.domain.FileShareServiceFactory class, 4 replacing the default Apache Chemistry class. FileShareServiceFactory can then substitute the custom org.projectbamboo.cihub.northwestern.domain.FileShareService class to achieve custom CI-Hub behavior. Another important configuration property in the CI-Hub is atomPubAddedPath. The OpenCMIS FileShare Repository is implemented to run as a servlet in a servlet container. It therefor forms URLs in atompub replies based on servlet context. However, in our BSP deployment, we are executing as a jaxrs service on our own path (/cihub) under a CXF servlet in an OSGi container. Consequently, in the BSP environment we need to provide a more extensive URL path in atompub replies. The extra path information is configured in the ROA bean definition’s atomPubAddedPath property, reflecting the BSP ROA deployment and the specific path of our ROA service jaxrs addres. These are shown below in the CI-Hub ROA bean file. <jaxrs:server id="cihub" address="/cihub"> <jaxrs:serviceBeans><ref bean="ciHubResource"/></jaxrs:serviceBeans> </jaxrs:server> <bean id="ciHubResource" init-method="create" class="org.projectbamboo.cihub.northwestern.resources.CIHubResource"> <property name="callContextHandlerClass" value="org.apache.chemistry.opencmis.server.shared.BasicAuthCallContextHandler"/> <property name="cmisRepositoryServiceFactory" ref="CMISRepositoryServiceFactory"/> <property name="serviceCatalog" ref ="serviceCatalog"/> <!-- this reflects the jaxrs address above --> <property name="atomPubAddedPath" value="/services/bsp/cihub/"/> <!-- unsername for fileShare login --> <property name="fileShareUsername" value="test"/> <!-- fileShare password --> <property name="fileSharePassword" value="test"/> </bean> Major elements of the ROA bean.xml file. The SOA service instance obtained by the ROA layer is passed through ROA level AtomPub package classes. Deep in the AtomPub package classes, binding-neutral CMIS methods are invoked on the passed-in CmisService class, which in our case is an SOA layer service. Service Oriented Architecture (SOA) The CI-Hub SOA layer defines a service used by other BSP services and the CI-Hub ROA layer resource. The beans.xml file (fragments shown below) for the SOA layer defines a single bean for the CMISRepositoryServiceFactory class and exposes that as an OSGi service supporting the ICMISRepositoryServiceFactory interface. This class implements a single method: getCMISRepositoryService(CallContext context). This method simply returns an Apache Chemistry CmisService class. Consumers of this service then use the CmisService API defined by that class. ROA and SOA processing is shown below in Figure 2. <osgi:service ref="cmisRepositoryServiceFactory" 5 interface="org.projectbamboo.cihub.northwestern.service.ICMISRepositoryServiceFact ory" ranking="1"> <osgi:service-properties> <entry key="service.pid" value="urn:uuid:418E1B99-5ABE-4693-8AAD-FC9DA164A581"/> <entry key="serviceDescriptionLocation" value=" https://wikihub.berkeley.edu/display/pbamboo/CI+Hub+Service+Contract+Description++v0.9-alpha"/> <entry key="service.description" value="CMIS Service Factory"/> <entry key="service.vendor" value="Northwestern University"/> <entry key="serviceProviderName" value="Bamboo CI Hub OSGi Service Implementation"/> <entry key="serviceVersion" value="1.0"/> <entry key="serviceProviderType" value="functional"/> <entry key="defaultServiceProvider" value="true"/> <entry key="serviceProviderSupportedVersionsRange " value="[1.0.0,2.0.0)"/> <entry key="serviceProviderContact" value="[email protected]"/> </osgi:service-properties> </osgi:service> <bean id ="cmisRepositoryServiceFactory" class="org.projectbamboo.cihub.northwestern.service.CMISRepo…"> <property name="repositoryConfigFile" value="/cihub.properties"/> <property name="repositoryId" value="content"/> </bean> Major elements of the SOA bean.xml file. 6 org.projectbamboo.cihub.northwestern.resources.* org.projectbamboo.cihub.northwestern.service.* org.projectbamboo.cihub.northwestern.domain.* Figure 2. CI-Hub FileShare Repository The Apache Chemistry FileShare Repository supports configuration of the factory class to use for the CMIS FileShare service implementation. This setting is used by the CI-Hub to substitute its own custom service factory class (FileShareServiceFactory). This custom class in turn instantiates custom CI-Hub 7 versions of the FileShareService and FileShareRepository classes in order to achieve custom CI-Hub behavior, as shown above. The CI-Hub FileShareRepository is provided to customize behavior when new files are submitted to the CMIS repository. FileShareRepository examines submitted CMIS files for indication that they are Zotero format bibliography files. When a Zotero file is detected, FileShareRepository presents the Zotero references to each configured “Locator” class in the CI-Hub. Locator or “connector” classes are used to provide the specific processing needed to import content from external repositories. These class relationships are shown below in Figure 3. CI-Hub Locator Extensions In addition to repository-specific reference detection, CI-Hub “Locator” processing involves retrieval of content from the referenced repository and conversion and placement of that content into a Bamboo Book Model structure in the CI-Hub. An item’s citation reference alone is usually not sufficient for retrieval of an item’s complete content. Each specific “Locater” must understand the specific repository’s application programing interface (API) and form additional URL references based on the cited URL’s identifier to retrieve additional description and content for the item. The Bamboo Book Model defines the folder structure, file naming conventions, and set of CMIS properties for all content files constituting a Bamboo “Book”. Each locator class understands its respective repository’s content model and its mapping to the Bamboo Book Model. The CI-Hub provides such locator services for Perseus, HathiTrust, and a Fedora implementation of selected Text Creation Partnership (TCP) texts running at the University of Illinois. org.projectbamboo.cihub.northwestern.service.* org.projectbamboo.cihub.northwestern.domain.* Figure 3. 8 In addition to the Bamboo Book Model conversion, the connectors save source files from their respective repositories in the CMIS repository. The specific behavior of each connector is described in more detail in the sections below. The URL pattern for each repository recognized by its respective locator is listed. Perseus Connector http://www.perseus.tufts.edu/hopper/text?doc=.* The Perseus connector is based on a generic Fedora connector. As such it obtains all the referenced Fedora object’s datastreams and stores them in the Bamboo Book Model’s book/source directory. It then forms a URL to the object’s TEI datastream using URL pattern heuristics coded in the connector to obtain the object’s TEI transcript. The connector also uses the Fedora object’s “MODS” datastream to obtain basic bibliographic metadata for the object. This basic descriptive metadata is used in CMIS property files within the Book Model. The Bamboo Book Model is a page-based model, that is, it organizes book content into separate constituent pages. Perseus transcripts however, as is characteristic of classics texts, are not naturally paginated and so their TEI transcripts do not contain any page break markup. In order to supply “pages” for the Bamboo Book Model, the CI-Hub Perseus connector performs its own pagination by breaking the TEI datastream by its <div/> elements, creating a Bamboo page for each <div/>. For each of these “pages”, the connector creates files for its TEI XML representation, an xhtml representation, and a plain text representation. The connector also creates a .cmis file for each of these page representation files containing relevant properties. If the specific Perseus TEI transcript does not contain <div/> elements, the connector provides the same representation types above, using <l/> elements for pagination. The connector also creates a plain text “volume level” version of the entire transcript by concatenating plain text pages into a single file. HathiTrust Connector http://hdl.handle.net/2027/* The HathiTrust connector uses the HathiTrust API to obtain the referenced item’s bibliographic description and a ZIP file containing its scanned pages. It uses two HathiTrust API calls to do this, one to obtain the item’s bibliographic description as JSON formatted MARC, and the other to obtain a ZIP file containing the full set of scanned pages. CMIS bibliographic descriptive properties are obtained by parsing the JSON formatted string for MARC fields, using the following MARC codes and subcodes for basic description: 9 title = 245$a creator = 100$a publisher = 260 issued = 260$c The Bamboo Book Model also accommodates JPEG page images at various widths. However many HathiTrust volumes provide their page images in the JPEG2000 format. Consequently the HathiTrust connector must reformat JPEG2000 images as JPEG for Bamboo Book Model conformance. The HathiTrust connector uses a Djatoka JPEG2000 server to decode HathiTrust JPEG2000 files as JPEG images at various resolutions, defined by the Bamboo Book Model. TCP/Fedora Connector http://ramman.grainger.uiuc.edu The Text Creation Partnership (TCP) connector, like the Perseus connector, leverages Fedora Repository access classes included in the CI-Hub. Its Fedora object model, however is different from the Perseus. It holds separate Fedora objects for the main bibliographic entity, the TEI transcript, and the MorphAdorned TEI transcript. Page-level representations, whether TEI, MorphAdorned TEI, plain text, or JPEG at various resolutions – are all obtained using parameterized methods in Fedora disseminations. The TCP Connector forms all relevant URLs internally to access desired Bamboo Book page representations, and like HathiTrust and Perseus content, forms all desired representations that are possible for the source materials. The Bamboo Book Model representations provided by each connector are summarized in the table below. For each repository column, "Repository" means that particular CMIS Type is obtained directly from the repository. "Connector" means that the CMIS Type is manufactured by the connector from other content obtained from the repository. For example, the Perseus connector creates plaintext, xhtml, and TEI xml pages from the Perseus source TEI transcript. CMIS Type page-plaintext page-xhtml page-tei pagemorphadorned page-image page-thumb150 book-tei book-plaintext source-mets source-aggregate source-bib-marc source-pageimage-jp2 source-page-xml source-page-ocr mime-type text/plain text/html text/xml text/xml TCP (UIUC) repository repository repository repository Hathi connector image/jpeg image/jpeg text/xml text/plain text/xml application/zip application/json image/jp2 repository repository repository connector connector connector connector connector Perseus connector connector connector repository connector repository repository repository text/xml text/plain 10 Configuration The Apache Chemistry OpenCMIS distribution requires minor configuration but offers considerable flexibility and customization. Its main configuration file (cihub.properties) requires local deployment settings for file system paths to the repository content area and initial account credentials. Beyond that it allows local substitution of the major factory class for the repository, facilitating custom implementations and local extensibility of CMIS content types. These properties and their Bamboo settings are given below. The cihub.properties file is maintained outside of the OSGi container and discovered at runtime by the CI-Hub by forming a path combining the $BSPLOCALSTORE_HOME Environment variable and the repositoryConfigFile property defined in the SOA bean.xml file. Care should be taken to define the BSPLOCALSTORE_HOME shell variable in the BSP execution environment, defining the repositoryConfigFile path relative to $BSPLOCALSTORE_HOME, and placing the cihub.properties in that location. Service factory This property is used to declare the class responsible for creating the principal class used for CMIS processing, as described above. class=org.projectbamboo.cihub.northwestern.domain.FileShareServiceFactory Accounts Apache Chemistry performs security checks on each CMIS request. The method it uses to manage accounts, passwords, and readwrite or readonly permissions are these next properties below. login.1 = test:PASSWORD login.2 = cmisuser:PASSWORD login.3 = reader:PASSWORD repository.cihub.readwrite = test, cmisuser repository.cihub.readonly = reader Locator classes This is where “Locator” classes or repository connectors are declared for the CIHub. By associating a class name here with a property name that ends with “.locator”, that class will be included in the list of potential repository locator services associated with the local CMIS repository having repository identifier = REPOSITORY_ID, where REPOSITORY_ID is obtained from the property name with this regular expression: ‘/repository\.REPOSITORY_ID\..*\.locator’. # repository.cihub.hathi.locator = # org.projectbamboo.cihub.northwestern.domain.HathiLocatorService repository.cihub.perseus.locator = org.projectbamboo.cihub.northwestern.domain.PerseusLocatorService repository.cihub.tcp.locator = org.projectbamboo.cihub.northwestern.domain.TCPLocatorService 11 Locator configurations This properties file provides connect information for the various locator classes listed above. repository.cihub.connectorConfig = /config/connector.properties Bamboo CMIS types In CMIS, folders and documents can be assigned properties. Properties are contained in a separate file with s special naming convention. On of the properties associated with all Bamboo content is its type. Each type is defined by a repositorywide configuration file which describes properties for objects of its type. Those type definitions are enumerated and the paths to their definitions listed below. type.01 type.02 type.03 type.04 type.05 type.06 type.07 type.08 type.09 type.10 type.11 type.12 type.13 type.14 type.15 type.16 type.17 type.18 type.19 type.20 type.21 = = = = = = = = = = = = = = = = = = = = = /CMISTypes{file.separator}bamboo-page-document.xml /CMISTypes{file.separator}book.xml /CMISTypes{file.separator}contents.xml /CMISTypes{file.separator}example-type.xml /CMISTypes{file.separator}metadata.xml /CMISTypes{file.separator}page.xml /CMISTypes{file.separator}page-image.xml /CMISTypes{file.separator}page-tei.xml /CMISTypes{file.separator}page-thumb150.xml /CMISTypes{file.separator}page-xhtml.xml /CMISTypes{file.separator}source-mets.xml /CMISTypes{file.separator}source-page-image.xml /CMISTypes{file.separator}source-page-ocr.xml /CMISTypes{file.separator}source-page-xml.xml /CMISTypes{file.separator}bamboo-folder.xml /CMISTypes{file.separator}page-morphadorned.xml /CMISTypes{file.separator}page-plaintext.xml /CMISTypes{file.separator}volume-plaintext.xml /CMISTypes{file.separator}source-aggregate.xml /CMISTypes{file.separator}userfolder.xml /CMISTypes{file.separator}page-image-jp2.xml Repository root file path This is the path on the file system where CI-Hub will store and retrieve the folders and files it manages as a CMIS repository. Care should be taken to coordinate file ownership and permissions on this directory with the process owner of CI-Hub’s execution, e.g. the BSP process owner. repository.cihub = /var/bamboo/cmis/content Request and Response folders Initial design of the CI-Hub intended submitted files, that is, Zotero files containing references representing Bamboo Book creation/resolution requests, to arrive in the req folder and processing status responses to be written by CI-Hub into an associated file in the req folder. Zotero file submission can occur anywhere in the CMIS Folder structure. Book processing status messages are written to a similarly named file in the res directory. repository.cihub.request = req repository.cihub.response = res 12 Source Code The CI-Hub source code is organized into ROA and SOA source trees and includes Java, Scala, as well as Spring XML files. This section describes that source code organization. All source code, with the exception of the cihub.properties configuration file, is relative to ci-hub-service/. The cihub.properties file, which is not compiled and deployed in the OSGi bundles but placed in the BSP file system relative to the path indicated in the $BSPLOCALSTORE_HOME environment variable, is found in the ci-hub-config directory, at the same level as ci-hub-service/ in the source tree. Resource Oriented Architecture (ROA) Source The ROA source is found under ci-hub-service/resource. It contains java and Spring beans.xml source files. /ci-hub-service/resource/src/main/ ROA BSP Classes The CI-Hub ROA layer contains only two Java class files needs for the BSP, one for the ROA Interface class and one for the ROA Implementation class: ./java/org/projectbamboo/bsp/services/cihub/resources/ ICIHubResource.java CIHubResource.java The ROA resource is declared and configured with its associated Spring beans file: ./resources/META-INF/spring/ beans.xml Service Oriented Architecture (SOA) Source Source code for the CI-Hub SOA layer is found in /ci-hub-service/service/src/main/. It provides four major aspects of the CI-Hub customization of the Apache Chemistry FileShare implementation: 1. SOA interface and implementation classes for BSP architecture 2. Apache Chemistry class overrides and custom configuration 3. Locator extensions for external repositories 4. Custom CMIS type definitions and bindings There are two Java packages, a Scala source package, a folder structure of resource files for configuration and CMIS types, and a webapp folder for deployment as a web application deployed in a servlet container. 13 SOA BSP Classes Like the CI-Hub ROA layer, the SOA layer contains two Java class files referenced in the layer’s beans.xml, one for the SOA Interface class and one for the SOA Implementation class. These are each implemented with Java: ./java/org/projectbamboo/cihub/northwestern/service/ CMISRepositoryServiceFactory.java ICMISRepositoryServiceFactory.java Apache Chemistry FileShare Repository Override As described in the first section above, the CI-Hub extends the Apache Chemistry FileShare Repository with custom processing of Zotero files. The Java source files that override similar Chemistry classes are found in: ./java/org/projectbamboo/cihub/northwestern/domain/ FileShareRepository.java FileShareService.java FileShareServiceFactory.java MIMETypes.java RepositoryMap.java RepositoryService.java TypeManager.java CI-Hub also overrides org.apache.chemistry.opencmis.server.impl.atompub classes. The modification here is slight but impacts several classes which in turn must be overridden to reference the new class. This Apache Chemistry package contains a utility class, AtomPubUtils which is used by several other classes in the package to form URLs in atompub replies. AtomPubUtils by default assumes that the execution environment is a servlet container, running CMISAtomPubServlet. It therefor forms URLs based on the current domain and servlet context. However, in our BSP deployment, we are executing as a jaxrs service on our own path (/cihub) under a CXF servlet in an OSGi container. Consequently, in the BSP environment we need to provide a different URL path in atompub replies as the default URLs formed in AtomPubUtils would be incorrect. We override AtomPubUtils in CI-Hub, as well as other classes in its package that use AtomPubUtils to provide appropriate paths in atompub replies. The extra path is passed in the HttpServletRequest object and configured at the ROA in its bean definition’s atomPubAddedPath property. These custom classes are found in: ./java/org/projectbamboo/cihub/northwestern/domain/atompub AtomPubUtils.java NavigationService.java ObjectService.java RepositoryService.java Locator Extensions for External Repositories The CI-Hub source code is mixed language though both (Java and Scala) are JVMbased. All are part of the same org.projectbamboo.cihub.northwestern.domain package. The classes that implement interactions with external repositories and some utility classes are in Java. The external repository “locator” and their support 14 classes are in Scala. Language choice here is likely more historical and reflecting different developers’ preferences over the life of the project rather than specific advantage to task. Java utility classes for external repository API encapsulation: ./java/org/projectbamboo/cihub/northwestern/domain/ fedora DataStream.java FedoraConnector.java FedoraConnectorREST.java HttpInputStream.java hathi HathiConnector.java Scala “locator” and support classes: ./scala/org/projectbamboo/cihub/northwestern/domain/ BambooRequest.scala BambooRequestImpl.scala BambooType.scala ConnectionException.scala HathiLocatorService.scala InvalidIDException.scala LocatorServiceAPI.scala PerseusLocatorService.scala RenderingType.scala TCPLocatorService.scala TextImageConverter.scala TimeoutException.scala ZipProcessor.scala ZoteroFileParser.scala Custom CMIS Types and Configuration ./resources These files provide definitions of custom CMIS objects. They serve as xml templates for “locator” classes, in order to create CMIS metadata files for Bamboo Book Model folders and files for external repository content. ./CMISTemplates cmis.xml cmis.xml.folder cmis.xml.item cmis.xml.locator cmis.xml.page cmis.xml.tcp These files provide CMIS definitions of custom Bamboo types. ./CMISTypes bamboo-folder.xml bamboo-page-document.xml book.xml contents.xml 15 example-type.xml metadata.xml page-image-jp2.xml page-image.xml page-morphadorned.xml page-plaintext.xml page-tei.xml page-thumb150.xml page-xhtml.xml page.xml source-aggregate.xml source-mets.xml source-page-image.xml source-page-ocr.xml source-page-xml.xml userfolder.xml volume-plaintext.xml The configuration file for external repositories referenced by CI-Hub connectors is found in the config directory: ./config connector.properties 16