* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Welcome to the OMII
Survey
Document related concepts
Transcript
Distributed Data Linking using OGSA-DAI and OGSA-DQP Ally Hume 4th International Conference on e-Social Science 18 June 2008 Web: www.omii.ac.uk Email: [email protected] Motivation • Grid is about sharing resources. • OGSA-DAI is concerned with sharing structured data. Relational Database XML Database Indexed File Web: www.omii.ac.uk Email: [email protected] Sharing data via website download • Zip up the dataset and put it on a website. • Pros: o Easy for data provider. • Cons: o o o Possible very large download of only small portion required. User has to install data into a local database to use it. Static snapshot. Web: www.omii.ac.uk Email: [email protected] Sharing data via direct access • Tell client the URL of the database and a username and password, e.g: o mycomputer.epcc.ed.ac.uk:3306, userID, password • Pros: o Fast. • Cons: o o o o o Firewall issues. User and password management is hard. No consistent security model. Hard to use in Grid/Web service workflows. Continued on next slide… Web: www.omii.ac.uk Email: [email protected] Sharing data via direct access • Cons (continued) o o o o No server side layer in which to standardize database heterogeneities. Client needs to know, and have installed, correct driver for the database. Different drivers for Java, C#, C++, Fortran etc. Totally different API for different database types, e.g. JDBC for Relational, XMLDB for XML, Lucene for indexed files. Web: www.omii.ac.uk Email: [email protected] Domain-specific web service • Web service that gives access to the data using domain-specific operations, e.g. o o o Book findByISBN(ISBN) List<Book> findByAuthor(Author) List<Book> findByKeyword(Word) • Pros: o o o o Fits with grid/web service approach. Abstraction hides back-end database details. Web services are programming language neutral. Operations likely to map well to authorization policies. Web: www.omii.ac.uk Email: [email protected] Domain-specific web service cont. • Cons: o Slower than direct access • SOAP overhead. • SOAP very poor encoding for large result sets. o Domain-specific API prevents use of generic data exploration, mining and manipulation tools. Books Cancer Generic Data Linking Application Books written by University employees Web: www.omii.ac.uk University Employees University employees in 1932 who have since died of cancer Email: [email protected] OGSA-DAI generic web service • OGSA-DAI gives access to data via a generic web service. • Clients sees the data in its ‘raw’ format, e.g. o o Tables, columns, rows for relational data. Collections, elements etc. for XML data • Clients can obtain the schema of the data. • Clients send queries in appropriate query language, e.g. SQL, XPath. Relational Database request OGSA-DAI XML Database data Indexed File Web: www.omii.ac.uk Email: [email protected] OGSA-DAI • Pros: o o o Fits with grid/web service approach. Web services are programming language neutral. Access to schema and raw data supports generic tools. Web: www.omii.ac.uk Email: [email protected] Reducing use of SOAP • But SOAP is still slow for data delivery. • OGSA-DAI supports workflows to reduce data movement and the use of SOAP. OGSA-DAI Transform Web Service FTP Server Web: www.omii.ac.uk OGSA-DAI Query -> Transform -> DeliverToFTP FTP Server Email: [email protected] OGSA-DAI workflows • • • • Workflows composed of pipelined activities. Activities are installed at the server. Data streams between activities. Activities for data querying, data transforms, data integration and data delivery. • Allows some computation to be moved closer to the data. Web: www.omii.ac.uk Email: [email protected] OGSA-DAI Workflow Example Access SQLQuery ObtainFromHTTP SELECT * FROM Bands WHERE name = Bangles; http://www.someplace.org/styl esheets/webRowSetToHTML.xsl tuples Transform TupleToWebRowSetCharArrays XSL WebRowSet XML XSLTransform HTML Deliver DeliverToURL ftp://www.musicplace.org/bands/Bangles.html Web: www.omii.ac.uk Email: [email protected] Data linking scenario 1: VOTES • Virtual Organisations for Trials and Epidemiological Studies (VOTES). o • • • • http://labserv.nesc.gla.ac.uk/projects/votes/ Linking records related to medial trials. Linking key is Scotland’s CHI number. Data stored in relational databases. Perform joins over distributed data. Web: www.omii.ac.uk Email: [email protected] Data linking scenario 1 : VOTES CHI Name Sex DOB 13101955448 John Smith M 14/8/1965 23045635248 Mary Jones F 24/11/2000 CHI Diagnosis 13101955448 Lung cancer 23045635248 Leukaemia CHI Sex DOB Diagnosis 13101955448 M 14/8/1965 Lung cancer 23045635248 F 24/11/2000 Leukaemia Web: www.omii.ac.uk Email: [email protected] Data linking scenario 1 : VOTES workflow DB1 OGSA DAI DB2 SELECT CHI, Sex, DOB FROM Patients ORDER BY CHI SQLQuery (DB1) (CHI, Sex, DOB) (CHI, Sex, DOB, Diagnosis) Ordered data streams SQLQuery (DB2) Merge Join (CHI, Diagnosis) SELECT CHI, Diagnosis FROM TrialX ORDER BY CHI Web: www.omii.ac.uk Email: [email protected] Deliver Data linking scenario 1 : VOTES workflow DB1 OGSA DAI DB2 SELECT CHI, Sex, DOB FROM Patients WHERE CHI in (?) ORDER BY CHI Ordered data stream SQLQuery (DB2) SQL Nested In Clause Join (DB1) (CHI, Sex, DOB, Diagnosis) (CHI, Diagnosis) SELECT CHI, Diagnosis FROM TrialX ORDER BY CHI Web: www.omii.ac.uk Email: [email protected] Deliver Data linking across multiple servers • Data linking across OGSA-DAI services Workflow 1 OGSA DAI DB1 Data Workflow 2 SQLQuery (DB1) Deliver to OGSA-DAI OGSA DAI DB2 Receive from OGSA-DAI JOIN SQLQuery (DB2) Web: www.omii.ac.uk Email: [email protected] Deliver Data linking scenario 2 - CDC • • • • Centers for Disease Control and Prevention Real-time Outbreak and Disease Surveillance Unions over distributed data. Data in Health Level 7 format: Date 17/5/2008 17/5/2008 Reason Chest pain Flu Age 43 64 Sex M F Zip 15234 15112 • Task: For each postcode count the number of occurrences of reason ‘X’. Web: www.omii.ac.uk Email: [email protected] CDC RODS scenario Flu occurrence 17/5/2007 Zip Count 15112 2 15144 1 DB2 DB1 Date 17/5/2008 17/5/2008 Reason Chest pain Flu Age 43 64 Sex M F Web: www.omii.ac.uk Zip 15234 15112 Date 17/5/2008 17/5/2008 Reason Flu Flu Age 51 23 Email: [email protected] Sex M F Zip 15144 15112 CDC RODS Scenario: Merging using project workflow SELECT zip, count(*) as total FROM Cases WHERE Reason = “Flu” GROUP BY zip ORDER BY zip SQLQuery (DB1) SQLQuery (DB2) SELECT zip, count(*) as total FROM Cases WHERE Reason = “Flu” GROUP BY zip ORDER BY zip Web: www.omii.ac.uk (15112, 1) Ordered data streams (15112, 1) (15144, 1) DB1 OGSA DAI DB2 Project (15112, 2) (15144, 1) (group by zip and sum(total)) Email: [email protected] New database in Open MRS format DB3 encounter id date 500 17/6/2008 patient_id 10000 gender birth M 24/11/2005 person_address Person_id zip 2000 15112 concept id name Flu 23 FLU observation encounter_id concept_id 500 23 person id 2000 patient id person_id 10000 2000 Define view events DB4 date reason 17/6/2008 Flu Web: www.omii.ac.uk age sex zip 2 M 15112 Email: [email protected] CDC scenario: using a view DB1 workflow SELECT zip, count(*) as total FROM Cases WHERE Reason = “Flu” GROUP BY zip ORDER BY zip SQLQuery (DB1) SQLQuery (DB2) SELECT zip, count(*) as total FROM Cases WHERE Reason = “Flu” GROUP BY zip ORDER BY zip OGSA DAI DB2 DB4 (15112, 1) Ordered data streams (15112, 1) (15144, 1) SQLQuery (DB4) Web: www.omii.ac.uk Project (15112, 2) (15144, 1) (group by zip and sum(total)) View Project (group by zip and sum(total)) (15112, 1) Email: [email protected] DB3 (15112, 3) (15144, 1) OGSA-DQP • Distributed Query Processing • Allows tables in multiple databases to appear as tables in one database. Can do joins and unions over the tables. OGSA-DAI Virtual Green DB OGSA-DQP OGSA-DAI Blue DB OGSA-DAI Yellow DB Web: www.omii.ac.uk Email: [email protected] CDC Scenario: using OGSA-DQP DB1 workflow OGSA DAI DB5 DB2 OGSADQP DB4 SELECT zip, count(*) as total FROM DB1.Cases UNION DB2.Cases UNION DB4.Cases WHERE Reason = “Flu” GROUP BY zip ORDER BY zip (15112, 3) (15144, 1) SQLQuery (DB5) Web: www.omii.ac.uk Email: [email protected] View DB3 CDC Scenario: View over OGSA-DQP workflow DB1 OGSA DAI DB6 View DB5 DB2 OGSADQP SELECT zip, count(*) as total FROM Cases WHERE Reason = “Flu” GROUP BY zip ORDER BY zip SQLQuery (DB6) Web: www.omii.ac.uk (15112, 3) (15144, 1) DB4 View DB3 Cases: SELECT * FROM DB1.Cases UNION DB2.Cases UNION DB4.Cases Email: [email protected] OGSA-DAI views for schema transforms Date Site 11/10/1993 11/10/1993 31/10/1993 N654 PIRU N654 BEPA N654 PIRU Date Site 11/10/1993 N654 13/10/1993 1 13/10/1993 1 Web: www.omii.ac.uk Species Area Count 2 1 2 26 3 29 Date Site Picrub Betpap 13/10/1993 14/11/1993 1 1 13.5 8.4 1.6 1.8 Species Density Picea rubens Picea rubens Betula papyifera 13 13.5 1.6 Email: [email protected] OGSA-DAI views for schema transforms • View to map from TS2 to TS1: o SELECT Date, Site, "Picea rubens", Picrub FROM TS2 UNION SELECT Date, Site, "Betula papyifera", Betpap FROM TS2 • View to map from TS3 to TS2: o SELECT s1.Date, s1.Site, s1.Density, s2.Density FROM (SELECT * FROM TS3 WHERE Species="Picea rubens") AS s1 JOIN (SELECT * FROM TS3 WHERE Species="Betula papyifera") AS s2 ON s1.Date = s2.Date Web: www.omii.ac.uk Email: [email protected] OGSA-DAI views for security • Views can be used for security • A resource can use different views for different types of user, e.g. nurses and doctors could get different views on the same data. • The rows a user can see can be restricted, e.g. o SELECT patients.* FROM patients, doctor WHERE patients.doc_id = doctor.id AND doctor.dn = $DN$ • The views extension to OGSA-DAI will be released soon. Web: www.omii.ac.uk Email: [email protected] Data linking example 3: SEE-GEO • SEcurE access to GEOspatial services o o o http://edina.ac.uk/projects/seesaw/seegeo/index.html EDINA, MIMAS, NeSC, NCeSS UK JISC project • Geographical information systems • Virtual integration of and access control to o o o Census data Borders data Data hosted by other organisations and exposed as services • OGSA-DAI for federation of heterogeneous data resources Web: www.omii.ac.uk Email: [email protected] SEE-GEO – geo-linking service portal 1: GLSQuery submited via portal e.g. “Leeds population distribution by census output area” GLS Portal Maps 5: Portal gets image using URL 4: URL of image is returned to portal – avoids costly SOAP/HTTP transfer of image MIMAS Census OGSA-DAI Get Join Transform Deliver Get UK BORDERS 2: Workflow is populated with query parameters and run Web: www.omii.ac.uk Image Creation Service Email: [email protected] 3: Image is placed on a map server SEE-GEO – client’s view GeoLinking Service Portal Leeds population distribution by census output area? Client Web: www.omii.ac.uk Email: [email protected] Security • OGSA-DAI on Globus Toolkit: o o o o Message and transport level authentication, integrity and encryption. Message level proxy delegation. Transport level delegation on TODO list Authorization based to GT authorization framework • Plug-in PIPs and PDPs o Work done to integrate VOMS with OGSA-DAI as part of OMIIEurope project. • OGSA-DAI on OMII-UK: o o Transport level security Currently no support for resource-based authorization. Web: www.omii.ac.uk Email: [email protected] Current status of OGSA-DAI/DQP • Views functionality released soon • Continued work on DQP o o Group By not yet supported Derived tables not yet supported, so views on top of DQP does not yet work • Limited support XML-based data linking o o Work done in Japan Plan to look into this with Arts and Humanities project • We always endeavour to work closely with users to fill in any gaps. Web: www.omii.ac.uk Email: [email protected] That’s all folks! • • • • OMII-UK booth ENGAGE OGSA-DAI website: www.ogsadai.org.uk Any questions? Web: www.omii.ac.uk Email: [email protected]