Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Microsoft SQL Server wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Relational model wikipedia , lookup
Clusterpoint wikipedia , lookup
Functional Database Model wikipedia , lookup
1 Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru San Diego Supercomputer Center 2 Integrated Cyberinfrastructure System Education and Training Discovery & Innovation Source: Dr. Deborah Crawford, Chair, NSF CI Working Committee Applications • Geosciences • Environmental Sciences • Neurosciences • High Energy Physics … • Development Tools & Libraries Domain-specific Cybertools (software) Shared Cybertools (software) Middleware Services Hardware Distributed Resources (computation, storage, communication, etc.) 3 Community Cyberinfrastructure Friendly Projects Work-Facilitating Portals Ecological Observatories (NEON) Earthquake Engineering (NEES) Authentication - Authorization - Auditing - Workflows - Visualization - Analysis Hardware Ocean Observing (ORION) Geosciences (GEON) Middleware Services Biomedical Informatics (BIRN) Development Tools & Libraries High Enegy Physics (GriPhyN) Adapted from: Prof. Mark Ellisman, UC San Diego Your Specific Tools & User Apps. Shared Tools Science Domains Distributed Computing, Instruments and Data Resources 4 Data, Tools, & Computation • Data – Field observations – Laboratory analyses – Sensor-based data (land, airborne, satellite) • Tools – QA/QC, simple transformations and analyses – Complex models • Computation – Community codes – Access to high-performance computing – Data Intensive Computing 5 Variety of Geoinformatics Efforts • Data collection – Digital data collection in the field – “When does it become cyberinfrastructure”? • Database curation – E.g. EarthChem, Paleobiology, MorphoBank, Paleo Pollen, etc…. – When does it become “tools” and “community codes” • Software Development – Tools: gravity and magnetics, paleogeography, geochemistry, seismic data products, … – Community codes: SCEC-CME, CIG, … 6 Variety of Geoinformatics Efforts • High Performance Computing – LiDAR data management – Seismic analyses – Petascale initiative • Data Integration – E.g. CUAHSI HIS – Also, a pressing need in projects like EarthScope 7 Cyberinfrastructure: The Common Platform Across Distributed Projects Cyberinfrastructure Data Management And Curation Data Collection Modeling and Integration Tool Development To provide access to all of these “resources” and support “interoperability” among them 8 Example: USArray Data Flow • Deploy field sensor arrays – Across US • Collect data from sensor arrays and perform QA/QC – One of the sites is SIO, San Diego • Archive data for community access – IRIS, Seattle EarthScope/USArray: Single project, multiple participants. 9 Survey Example: LiDAR Workflow Courtesy: Chris Crosby, ASU D. Harding, NASA Point Cloud x, y, z, … Interpolate / Grid Single goal: Multiple projects, multiple participants, e.g. NCALM, GEON, ASU, NASA, USGS, … Analyze / “Do Science” 10 GEON Cyberinfrastructure • • • Funded by NSF IT Research program Multi-institution collaboration between IT and Earth Science researchers GEON Cyberinfrastructure provides: – – – – – – Authenticated access to data and Web services Registration of data sets, tools, and services with metadata Search for data, tools, and services, using ontologies Scientific workflow environment and access to HPC Data and map integration capability Scientific data visualization and GIS mapping 11 Key Informatics Areas • Portals – Authenticated, role-based access to cyber resources: data, tools, models, model outputs, collaboration spaces, … • Data Integration – Search, discovery and integration of data from heterogeneous information sources (“mediation” and “semantic integration”) • Use of workflow systems, and access to HPC – Ability to “program” at a higher level of abstraction – Sharing of models, along with “provenance” information – Gateways to HPC environments • Management of Geospatial Information – Using GIS capabilities, map services, geospatial data integration • Visualization of 3D, 4D geospatial data and information 12 Distributed System Definition • A Distributed System is – one in which the hardware and software components in networked computers communicate and coordinate their activities only by passing messages, e.g. the Internet • A Distributed Database System is – one in which data is stored at several sites, each managed by a database system (DBMS) that can run independently 13 Distributed System Models • Client – Server invocation Client A Network Server 1 Network Client B response • Peer to Peer Process 2 Network Process 1 Networ k Process 3 Client C 14 Remote Service Invocation • TCP/IP – Basic Internet protocol for computer communications – Platform for building a number of other open or proprietary, “higher-level” communications protocols • Communication at a higher-level of abstraction • http – Open protocol based on TCP/IP for the Web – Fixed set of “verbs” (actions) used to transfer HTML documents • CORBA, Java RMI – Protocols based on an object model 15 SDSC Storage Resource Broker “Virtualizing” storage User Resource, Mthd, User User Defined C, C++, Linux I/O Unix Shell Java, NT Browsers Prolog Web Predicate SRB MCAT Dublin Core Archives File Systems Databases HPSS, ADSM, UniTree, DMF Unix, NT, Mac OSX DB2, Oracle, Sybase Metadata Extraction Remote Proxies DataCutter Application Meta-data http://www.sdsc.edu/srb 16 SRB Client/Server Model Data are requested using an SRB ID and a “file abstraction” (open, close, read, write) SRB Client Network SRB Server HPSS Client Networ k HPSS server Oracle Client Networ k Oracle Server Networ k SRB peer-topeer protocol SRB Server B 17 OpenDAP • Client/Server model OpenDAP Servers Network OpenDAP Clients 18 OpenDAP Servers CODAR netCDF HDF4 Data Data CODAR netCDF Data Matlab DSP Tables SQL FITS CDF Flat Binary Data Data Data Data Data Data Data Matlab HDF4 JGOFS DSP FITS JDBC CEDAR General Data Data ESML FreeFrom CDF CEDAR From: Peter Cornillon & Jim Gallagher http://www.opendap.org/support/stennis_tutorial.html Clients netCDF C Ferret GrADS netCDF Java IDV VisAD ncBrowse Matlab Client IDL Client Matlab IDL Access Excel 19 OpenDAP Data Request • Data are requested with a URL. • http://www.cdc.noaa.gov/cgi-bin/nph-nc/datasets/Reynolds_sst?sst[10:10][0:90][0:180] • Protocol Machine name • OPeNDAP server Directory File name Constraint User can impose a constraint on the data to be acquired from a data set by appending a constraint expression to the end of the URL 20 Remote Service Invocation with Web Services • A Web Service is a simple protocol for invoking remote services on the Web. It is: – A network “endpoint”, i.e. server, that implements one or more “ports”. • `Each port is defined by the message types that accepts and the messages it returns. – Specified by a “Web Service Definition Language” xml document. • Given the WSDL for a web service you know all you need to interact with it. • Web Service Standards also exist for security, policy, reliability, addressing, notification, choreography and workflow. – It is the basis for MS .NET, IBM Websphere, SUN, Oracle, BEA, HP, … – It is the basis for the new Grid standards like WSRF and OGSA. 21 Web Site vs Web Service From: “Building Grid Applications and Portals, An Approach Based on Components, Web Services and Workflow Tools,” Gannon et al, Euro-Par 2004 • Web Site • Web Service – Designed to pass http get/post/put request to between a browser and a web server. – Google has a web site. Web Server – Designed for services to talk to other services by exchanging xml messages – Google also provides a web service so Google may be used in distributed apps Web Service Client’s Browser Web Service Web Service 22 Grid Services From: “Building Grid Applications and Portals, An Approach Based on Components, Web Services and Workflow Tools,” Gannon et al, Euro-Par 2004 • Grid: A distributed, heterogeneous set of resources – Integrated by a pervasive layer of services – Goal: allow users to view it as a single system • More than the Internet (which forms part of the resource layer) • Builds on the Web by building on web services Open Grid Service Architecture Layer Registries and Name binding Reservations And Scheduling Data Management Service Security Policy Administration & Monitoring Event Service Logging Accounting Service Grid Orchestration Web Services Resource Framework – Web Services Notification Physical Resource Layer 23 Access Interfaces and Levels of Access • Web service, native application program interface, ODBC/JDBC, filesystem SOAP server stack WSDL and SOAP Web Server “stack” URLs and http Application can also be “wrapped” as a Web Service SRB, Application OpenDAP, Program etc… DBMS filesystem Expose ODBC/JDBC interface (and full SQL) Mount remote filesystems 24 Authentication • Client – Server models User Client A Client-side authentication Server 1 Network ? Server 3 ? Server-side authentication Server 2 25 Common Authentication Obtain Credentials Certificate Authority Verify Credentials Client Invoke with Credentials Server 1 Server 2 Server 3 26 Grid Account Management Architecture (GAMA): Single sign-on in GEON (also used in a number of other projects) Karan Bhatia, Kurt Mueller, Choonhan Youn, Sandeep Chandra gama gridportlets GridSphere DB Servlet container import user retrieve credential Java keystore Portal server 1 Portal server 2 retrieve credential OGSA Grid services wrapper create user CACL Myproxy CAS … Servlet container Java keystore GAMA server Stand-alone applications 27 Systems Issues • Load Balancing, Failover, Replication Server 1 Client Server 2 Server 3 Multiple servers for load balancing, failover Data replication 28 Distributed Data Access • What is the issue? • Ability to access data stored in multiple, different databases using a single request, e.g. – Get geologic information from multiple geologic databases – Get employee information from all branches • Ability to update data stored in multiple databases, e.g. – Transfer salary amount from University to my bank account – Transfer funds from Visa account to vendor’s account 29 Distributed data access Client Homogeneous: mySQL Heterogeneous: mySQL Database mySQL 1 How about creating a “cached” local copy? mySQL Sources may be data repositories or metadata catalogs mySQL Oracle DB2 Database Excel 2 ASCII Database flat file 3 30 Data Warehousing Client 2. Query processing interaction only between client and warehouse Data Warehouse (common schema) – Extract 1. Load data – Transform from sources – Load to warehouse ETL ETL Data Source 1 Data Source 2 But, warehouse data could be “stale”, i.e. out of synch with source data… ETL Data Source 3 31 Data integration via middleware 1. Each client request goes to sources, via middleware Database 1 Client Data integration Middleware (aka Mediator) 2. Result collected by middleware and returned to client Database 2 Database 3 32 Warehousing vs Mediation • Warehousing: User ETL to “massage” local data to fit into a common global, warehouse schema • Mediation: Modify user query to match schemas exported by each source – But, which schema does the user query? – The Integrated View Schema – Sources “export” a view (the export schema) • Federated databases – Local sources belong to different “administrative domains”, i.e. different owners. – Local autonomy 33 The Canonical Mediator / Wrapper Architecture Client Application Wrapper processes could execute at sources, at mediator, or elsewhere Q1 Cached data Export view in mediator data model Local view in local data model Mediator (Integrated view in mediator data model, e.g. relational, XML) Q11 Q12 Q13 Q14 Wrapper Wrapper Wrapper Wrapper Local schema Local schema Local schema Local schema Data source 1 Data source 2 Data source 3 q14 Data source 4 34 Example: A Relational Mediator Client Application Mediator (Relational data model) Wrapper Wrapper Relational DBMS e.g. PostGIS Shape file 35 Example: A Shape-file Based Mediator Client Application Mediator (Shape file-based data model) Wrapper Wrapper Relational DBMS e.g. PostGIS Shape file 36 Example: An XML Mediator User / Applications Mediator (XML-based data model, e.g. GML) Wrapper Wrapper Wrapper Relational DBMS e.g. PostGIS Shape file XML file e.g. ArcXML 37 User Authentication and Access Control How about using GAMA for authentication? Client Application 1. User authenticates to system 2. User connects to mediator (passes credentials to mediator) 3. Mediator Mediator connects to sources a) Using original user credentials b) Or, mapped credentials (role-based access) 4. Need to define users or roles in sources Wrapper Wrapper Data source 1 Data source 2 38 Different types of heterogeneity in data integration • Platform heterogeneity: different OS platforms • DBMS heterogeneity: different database systems, e.g. SQLServer, mySQL, DB2 • Data type heterogeneity • Schema heterogeneity • Heterogeneity in units, accuracy, resolution • Semantic heterogeneity 39 Schema Integration • A long standing Computer Science problem • Simple case Source 1 Wrapper Sample ID: Table varchar Rock type: Age: varchar int … Source 2 Wrapper: convert between int and varchar for Age Table Sample ID: Rock type: Age: … varchar varchar varchar – Mediator View: (SampleID varchar, Rock_Type varchar, Age int) – In Source2 Table, map Age to int 40 Another integration scenario Source 1 Table Sample ID: Rock type: Eon: Era: Period: varchar varchar varchar varchar varchar Phanerozoic Mesozoic Jurassic Source 2 Table Sample ID: Rock type: Age: varchar varchar varchar “Phanerozoic/mesozoic;jur” – Mediator View: (SampleID varchar, Rock_Type varchar, Age varchar, Era varchar, Period varchar) – In Source 2 Table, parse Age to obtain sub-components of the field 41 A more advanced integration scenario Source 1 Table Sample ID: Rock type: Eon: Era: Period: varchar varchar varchar varchar varchar Phanerozoic Mesozoic Jurassic Source 2 Table Sample ID: Rock type: Age: varchar varchar int 150 • Mediator View: (SampleID varchar, Rock_Type varchar, Eon varchar, Era varchar, Period varchar) – Same as Source1 table schema • Query: Get rock types for all rocks from the Jurassic period 42 Doing the integration • • • Query sent to mediator: SELECT DISTINCT(Rock_Type) FROM Mediator_View WHERE Period=‘Jurrasic’ Query to Source 1: SELECT DISTINCT(Rock_Type) FROM Source1_Table WHERE Period=‘Jurrasic’ For Source2, need to map Period=“Jurassic” to Age values Source 2 Table Sample ID: Rock type: Age: varchar varchar int Geologic_Time Table Eon: Era: Period: Min varchar varchar varchar int Max int 43 Query “fragment” sent to Source 2 • SELECT DISTINCT (S2.Rock_Type) FROM Source2_Table S2, Where is the Geologic_Time_Table GT Geologic_Time table stored ? WHERE GT.Period = ‘Jurrasic’ AND (S2.Age >= GT.Min) AND (S2.Age <= GT.Max) 44 Data Integration Carts™ • Integrating data sets without explicitly creating views • An example request: Plot all gravity data points that fall within the spatial extent of rocks of a given type, in the Rocky Mountain testbed region – Use GEONsearch to find all gravity and geologic data using bounding box for “Rocky Mountain testbed region” • Need gazeteer / spatial ontology to determine Rocky Mountain region • Need to know classification of datasets (as gravity and geology) • Intersect extent of gravity and geologic datasets (from metadata) with extent of Rocky Mountain region – Plot gravity point data that fall within polygons of rocks of given type 45 Ad hoc integration Search Metadata Catalog GEONsearch “Geologic and gravity data in Rocky Mountains” Plot map Data Integration Cart™ Query Map 46 Data Registration Spatial Ontology Location Rock Classification Ontology Igneous Point Polygon Granite Quartzmonzonite Latitude Longitude Item Registration (Schema registration) Metadata (X, Y) Gravity dataset Item Detail Registration Lat, Long, RockType Geologic dataset Metadata 47 48 Another complex query • Query: Get rock types for all rocks from the mesozoic era – Easy to do for Source 1: Era = “Mesozoic” – For Source 2: • Need to find numeric age range for Mesozoic – Find age range across all subclasses of Mesozoic (Cretaceous, Jurassic, Triassic) • Select all Source 2 Table records whose age range falls within the Mesozoic age range