Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
EGI User Forum 13 April 2011 Vilnius (Lithuania) e-Infrastructure Integration with gCube Andrea Manzi ( CERN ) Pasquale Pagano ( ISTI-CNR ) www.d4science.eu Outline • D4Science II Ecosystem • gCube architecture • Interoperability approaches • • • • • Resource Discovery Data Storage & Access Data Discovery Data Process Security • Applications • • AquaMaps Time Series 2 e-Infrastructure Integration with gCube Vilnius, 13 April 2011 www.d4science.eu D4Science II Ecosystem • Heterogeneous resources • Heterogeneous computational platforms FAO Geonetwork • Rich set of legacy applications FAO FIGIS • Multiple administrative domains INSPIRE AquaMaps • Evolving communities Hadoop EGEE/EGI D4SCIENCE INFRASTRUCTURE DRIVER Community B Community C Portal GENESI-DR Community C Community A 3 e-Infrastructure Integration with gCube Vilnius, 13 April 2011 www.d4science.eu gCube architecture gCube run-time environment gCube Definition and Management Services gCube Application Services Presentation Services Portlets Information Organization Services Collection Content Metadata Annotation -… Management Ontology Management Process Execution Management VRE Management Storage Management gCube Container Information Access Services Search Framework Personalization Service Index Management Framework User Services Application Support Layer DIR Support Framework Information System Security gCore Framework 4 e-Infrastructure Integration with gCube Vilnius, 13 April 2011 www.d4science.eu Virtual Organization A Virtual Organization (VO) specifies how a set of users can access a set of resources what is shared who is allowed to share the conditions under which sharing can occur The concept of VO Is not adequate to cover some common scenarios • Data needs to be assessed before to make it publically exploitable by the VO members. • Restricted set of users have to collaborate to refine processes and implement show cases. • Products generated through elaboration of data or simulation have to be validated by expert users. 5 e-Infrastructure Integration with gCube Vilnius, 13 April 2011 www.d4science.eu Virtual Research Environment VRE 1 VRE 2 VO Virtual Research Environment (VRE) is a distributed and dynamically created environment where subset of resources can be assigned to a subset of users via interfaces for a limited timeframe at little or no cost for the providers of the infrastructure Integrated with cloud systems ( OpenNebula ) gCube is a first example of a VRE management system e-Infrastructure Integration with gCube Vilnius, 13 April 2011 6 www.d4science.eu Interoperability: Assumptions Very rich applications and data collections are currently maintained by a multitude of authoritative providers Different problems require different execution paradigms: batch, map-reduce, synchronous call, message-queue, … Key distributed computation technologies exist: grid (gLite and Globus), distributed resource management (Condor), clusters (Hadoop), … Several standards are adopted in the same domain 7 e-Infrastructure Integration with gCube Vilnius, 13 April 2011 www.d4science.eu Interoperability: Landscape security Data process Data Access Data Storage Resource Discovery Data Discovery Unstructured Data: blob (binary), and textual files Structured Data: tabular, statistical, geospatial, temporal, and textual data Compound Data: data composed by unstructured and structured data entities e-Infrastructure Integration with gCube Vilnius, 13 April 2011 8 www.d4science.eu Interoperability: gCube Vision gCube objectives: hide heterogeneity, i.e. abstract over differences in location, protocol, and model; embrace heterogeneity, i.e. allow for multiple locations, protocols, and models; Technical goals: no bottlenecks: scale no less than the interfaced resources no outages: keep failures partial and temporary autonomicity: system reacts and recovers 9 e-Infrastructure Integration with gCube Vilnius, 13 April 2011 www.d4science.eu Hiding Heterogeneity • Heterogeneous resources are virtually accessible in a common ecosystem of resources • despite their locations, technologies, and protocol • Different communities have access to different views • according to the conditions under which the sharing can occur • Each community can define its own VRE • for a limited timeframe and at no cost for the providers of the resource • Several VRE can coexist • without interfering each other even by competing for the same resources e-Infrastructure Integration with gCube Vilnius, 13 April 2011 10 www.d4science.eu Embracing Heterogeneity Approaches and solutions to achieve interoperability : Blackboard-based asynchronous communication between components in a system one protocol to R/W and one language to specify messages Wrapper/ Mediator-based translates one interface for a component into a compatible interface Adaptor-based provides a unified interface to a set of other components interfaces and encapsulates how this set of objects interact 11 e-Infrastructure Integration with gCube Vilnius, 13 April 2011 www.d4science.eu gCube interoperability framework: the solution Interoperability Approaches: Resource Discovery Each resource is represented by a profile (metadata) characterising: the interface the state the list of dependencies the run-time status the policies the configuration the pending tasks to execute A Resource profile is published by the resource owner is discovered by the resource consumers asynchronously through a common resource-independent protocol gCube offers a distributed and scalable Information System (blackboard) to store, discover, and access resource profiles e-Infrastructure Integration with gCube Vilnius, 13 April 2011 12 www.d4science.eu Interoperability Approaches: Content Interoperability[1/2] gCube Open Content Management Architecture (OCMA) Assumption data stored in different storage back-ends diverse locations, models, access types few common primitives: documents, collections, repositories gCube allows to reach content that lies outside system expose content (reachable from) inside system perform coarse-grained as well as fine-grained retrieval, update, and addressing Runtime scalability autonomic read-only state replication, maximize throughput, minimize response time: discovery-time load balancing (through IS) reduce latencies Software plugin-based architecture to reduce development costs (plugins over Storage systems) e-Infrastructure Integration with gCube Vilnius, 13 April 2011 13 www.d4science.eu Interoperability Approaches: Content Interoperability[2/2] Content Manager Service ( OCMA Service) • Adapts gCube doc model ( gDoc ) to an unbounded number of back-end types gDoc factory adapts T1 gDoc Write gDoc Read … adapts T2 14 e-Infrastructure Integration with gCube Vilnius, 13 April 2011 www.d4science.eu Interoperability Approaches : Data Discovery gCube offers Several index types Forward indexing, which supports ultra fast lookups on tabular typed metadata; XML indexing, that supports semistructured lookups on content metadata; Textual field indexing, that supports full text and qualified lookups on textual (mainly) metadata; Metadata full text indexing, that enables full text lookups on metadata; Content full text indexing, that enables full text lookups on text extracted by content; Geospatial/temporal indexing, that enables geospatial proximity and coverage queries to be executed over geospatial/temporal metadata; Feature indexing, that enables high-dimension vector indexing, for feature lookup (currently the feature is inactive); 15 e-Infrastructure Integration with gCube Vilnius, 13 April 2011 www.d4science.eu Interoperability Approaches : Process Execution [1/2] gCube offers solutions to: Decouple the business domain and infrastructure specific logic from the core “execution” functionality Invocate a wide range of logic components: SOAP and REST WebServices, Shell Scripts, Executable Binaries, POJOs, … Support most of the execution paradigms: batch, map-reduce, synchronous call Bridges key distributed computation technologies: grid (gLite and Globus), Condor, Hadoop Control and monitor the execution of a processing flow Staging of data among different storage providers Streaming data among computation elements 16 e-Infrastructure Integration with gCube Vilnius, 13 April 2011 www.d4science.eu Interoperability Approaches : Process Execution [2/2] By using adaptors that operate on a specific third party language and translate them into native constructs, allow for the creation of complex workflows that exploit several diverse technologies deployed on different infrastructures 17 e-Infrastructure Integration with gCube Vilnius, 13 April 2011 www.d4science.eu Interoperability Approaches : Security [1/2] gCube offers solutions : To secure access to gCube resources for interoperable external systems (incoming security) To ensure Interoperability of gCube security mechanisms with standards compliant security systems (reuse) To facilitate secure access to external resources for gCube services (outgoing) 18 e-Infrastructure Integration with gCube Vilnius, 13 April 2011 www.d4science.eu Interoperability Approaches : Security [2/2] Authz: • XACML for authz request/response protocol and policy definition • SAML assertions to transport user/service authN information • Argus-based approach (EMI Authz framework) having pluggable design to integrate additional PIPs • SAML Profile for XACML 2.0 following the OASIS Authorization Interoperability Profile Specification AuthN: • Production level SSL/HTTPS support • Key- and Trust-Manager 19 e-Infrastructure Integration with gCube Vilnius, 13 April 2011 www.d4science.eu Species Distribution Maps Generation AquaMaps is an application* tailored to predict global distributions of marine species initially designed for marine mammals and subsequently generalised to marine species, that generates color-coded species range maps using a half-degree latitude and longitude blocks by interfacing several databases and repository providers * Algorithm by Kashner et al. 2006 e-Infrastructure Integration with gCube Vilnius, 13 April 2011 20 www.d4science.eu Species Distribution Maps Generation AquaMaps execution is based on the gCube Ecological Niche Modelling Suite which allows the extrapolation of known species occurrences ◦ to determine environmental envelopes (species tolerances) ◦ to predict future distributions by matching species tolerances against local environmental conditions (e.g. climate change and sea pollution) Very large volume of input and output data: HSPEC native range 56,468,301 - HSPEC suitable range 114,989,360 Very large number of computation: One multispecies map computed on 6,188 half degree cells (over 170k) and 2,540 species requires 125 millions computations (Eli E. Agbayani, FishBase Project/INCOFISH WP1, WorlFish Center) 21 e-Infrastructure Integration with gCube Vilnius, 13 April 2011 www.d4science.eu Time Series Management Offers a set of tools to manage capture statistics Supports the complete TS lifecycle Supports validation, curation, and analysis Provides support for data reallocation Produces uniform data-set 22 e-Infrastructure Integration with gCube Vilnius, 13 April 2011 www.d4science.eu Time Series and R statistical software integration The main aims are to: • provide a complete, fully working, environment for R language • give user methods to automatically extract data from the time series he was working on • give user the possibility to perform queries on the time series database • provide a service distributed on the infrastructure. Multiple instances can be managed on the infrastructure VREs, the distribution being transparent to the users (SaaS model) 23 e-Infrastructure Integration with gCube Vilnius, 13 April 2011 www.d4science.eu Conclusions gCube System: Stable software being improved over the last 5 years ( end of DILIGENT -> D4Science -> D4ScienceII) gCube offers a variety of patterns, tools, and solutions to delivery interoperability solutions and interconnect Heterogeneous digital content Heterogeneous repository systems Heterogeneous computation platforms to decrease the cost of adoption to deal with several standards 24 e-Infrastructure Integration with gCube Vilnius, 13 April 2011 www.d4science.eu Questions Time 25 e-Infrastructure Integration with gCube Vilnius, 13 April 2011 www.d4science.eu