Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Oracle Database wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Functional Database Model wikipedia , lookup
Concurrency control wikipedia , lookup
Relational model wikipedia , lookup
Scalable Semantic Web Data Management Using Vertical Partitioning Daniel Abadi21, Adam Marcus2, Samuel Madden2, and Kate Hollenbach2 1Yale University 2MIT May 22, 2017 RDF Data Is Proliferating Semantic Web vision: make Web machine-readable RDF is the data model behind Semantic Web Increasing amount of data published using RDF Swoogle indexes 2,271,350 Semantic Web documents Biologists seem sold on Semantic Web 5/22/2017 Integrated data from Swiss-Prot, TrEMBL, and PIR protein databases available in RDF (500 million statements) Daniel Abadi - Yale 2 DBFacebook: A New Social Networking Application Mike Stonebraker Things found in nature (streams, sequoias, auroras) David DeWitt Person name name type likes type knows RDF Data Model PersonID1 PersonID2 knows dislikes dislikes authorOf authorOf Double blind reviewing Elastic/Velcro/Anything “One-size-fits-all” authorOf authorOf Pub101 title 5/22/2017 The Design of Postgres Pub102 venue venue SIGMOD title Implementation Techniques Daniel Abadi - Yale for Main Memory Database Systems Pub103 venue VLDB title GAMMA – A High 3 Performance Dataflow Database Machine DBFacebook: A New Social Networking Application Mike Stonebraker Things found in nature (streams, sequoias, auroras) David DeWitt foaf:Person foaf:name foaf:name rdf:type dbfb: likes rdf:type foaf:knows http://DBFaceBook.com/PersonID1 http://DBFaceBook.com/PersonID2 foaf:knows dbfb: dislikes dbfb: dislikes dbfb: authorOf dbfb: authorOf Double blind reviewing Elastic/Velcro/Anything “One-size-fits-all” dbfb: authorOf http://DBFaceBook.com/Pub101 dbfb: title 5/22/2017 The Design of Postgres dbfb: venue dbfb:SIGMOD dbfb: authorOf http://DBFaceBook.com/Pub102 dbfb: venue dbfb: title Implementation Techniques Daniel Abadi - Yale for Main Memory Database Systems http://DBFaceBook.com/Pub103 dbfb: venue dbfb:VLDB dbfb: title GAMMA – A High 4 Performance Dataflow Database Machine RDF Data Management Early projects built their own RDF stores Trend now towards storing in RDBMSs Paper examines 3 approaches for storing RDF data in a RDBMS … 5/22/2017 Daniel Abadi - Yale 5 DBFacebook RDF Graph Mike Stonebraker Things found in nature (streams, sequoias, auroras) David DeWitt Person name name type likes type knows PersonID1 PersonID2 knows dislikes dislikes authorOf authorOf Double blind reviewing Elastic/Velcro/Anything “One-size-fits-all” authorOf authorOf Pub101 title 5/22/2017 The Design of Postgres Pub102 venue venue SIGMOD title Implementation Techniques Daniel Abadi - Yale for Main Memory Database Systems Pub103 venue VLDB title GAMMA – A High 6 Performance Dataflow Database Machine Approach 1: Triple Stores Subject Property PersonID1 PersonID1 PersonID1 PersonID1 PersonID1 PersonID1 PersonID2 PersonID2 PersonID2 PersonID2 PersonID2 Pub101 Pub101 Pub102 Pub102 Pub103 Pub103 5/22/2017 type name likes dislikes authorOf authorOf type name dislikes authorOf authorOf title venue title venue title venue Object Person “Mike Stonebraker” “Things found in nature (streams, sequoias, auroras)” “Elastic/Velcro/Anything ‘One-size-fits-all’” Pub101 Pub102 Person “David DeWitt” “Double blind reviewing” Pub102 Pub103 “The Design of Postgres” SIGMOD “Implementation Techniques for Main Memory Databases” SIGMOD “GAMMA – A High Performance Dataflow Database” VLDB Daniel Abadi - Yale 7 DBFacebook RDF Graph Mike Stonebraker Things found in nature (streams, sequoias, auroras) David DeWitt Person name name type likes type knows PersonID1 PersonID2 knows dislikes dislikes authorOf authorOf Double blind reviewing Elastic/Velcro/Anything “One-size-fits-all” authorOf authorOf Pub101 title 5/22/2017 The Design of Postgres Pub102 venue venue SIGMOD title Implementation Techniques Daniel Abadi - Yale for Main Memory Database Systems Pub103 venue VLDB title GAMMA – A High 8 Performance Dataflow Database Machine Approach 2: Property Tables Subject name PersonID1 Mike Stonebraker PersonID2 David DeWitt Subject Pub101 Pub102 Pub103 5/22/2017 likes dislikes Things found in Elastic/Velcro/ nature (streams, Anything sequoias, auroras) ‘One-size-fits-all’ Double Blind NULL Reviewing title “The Design of Postgres” “Implementation Techniques for Main Memory Databases” “GAMMA – A High Performance Dataflow Database” Daniel Abadi - Yale venue SIGMOD SIGMOD SIGMOD 9 DBFacebook RDF Graph Mike Stonebraker Things found in nature (streams, sequoias, auroras) David DeWitt Person name name type likes type knows PersonID1 PersonID2 knows dislikes dislikes authorOf authorOf Double blind reviewing Elastic/Velcro/Anything “One-size-fits-all” authorOf authorOf Pub101 title 5/22/2017 The Design of Postgres Pub102 venue venue SIGMOD title Implementation Techniques Daniel Abadi - Yale for Main Memory Database Systems Pub103 venue VLDB title GAMMA – A High 10 Performance Dataflow Database Machine Approach 3: One-table-per-property name Subject Object dislikes Subject Object likes Subject Object authorOf Subject Object Mike Elastic/Velcro/ Things found in PersonID1 Pub101 Stonebraker PersonID1 Anything PersonID1 nature (streams, ‘One-size-fits-all’ sequoias, auroras) David PersonID1 Pub102 PersonID2 DeWitt Double Blind PersonID2 Reviewing PersonID2 Pub102 PersonID1 PersonID2 Pub103 5/22/2017 Daniel Abadi - Yale 11 Paper Contributions Explores advantages/disadvantages of these approaches Triples stores are the dominant choice Property Tables implemented by Jena and Oracle We propose the one-table-per-property approach Shows how a column-store can be extended to implement the one-table-per-property approach Introduces benchmark for evaluating RDF stores 5/22/2017 Daniel Abadi - Yale 12 Results Synopsis Triple-store really slow on benchmark with 50M triples Property-tables and one-table-per-property approaches are factor of 3 faster One-table-per-property with column-store yields another factor of 10 5/22/2017 Daniel Abadi - Yale 13 Querying RDF Data SPARQL is the dominant language Examples: SELECT ?name WHERE { ?x type Person . ?x name ?name } SELECT ?likes ?dislikes WHERE { ?x title “Implementation Techniques for Main Memory Databases” . ?y authorOf ?x . ?y likes ?likes . ?y dislikes ?dislikes } 5/22/2017 Daniel Abadi - Yale 14 Translation to SQL over triples is easy Subject Property PersonID1 PersonID1 PersonID1 PersonID1 PersonID1 PersonID1 PersonID2 PersonID2 PersonID2 PersonID2 PersonID2 Pub101 Pub101 Pub102 Pub102 Pub103 Pub103 5/22/2017 type name likes dislikes authorOf authorOf type name dislikes authorOf authorOf title venue title venue title venue Object Person “Mike Stonebraker” “Things found in nature (streams, sequoias, auroras)” “Elastic/Velcro/Anything ‘One-size-fits-all’” Pub101 Pub102 Person “David DeWitt” “Double blind reviewing” Pub102 Pub103 “The Design of Postgres” SIGMOD “Implementation Techniques for Main Memory Databases” SIGMOD “GAMMA – A High Performance Dataflow Database” VLDB Daniel Abadi - Yale 15 SPARQL SQL (over triple store) Query 1 SPARQL: SELECT ?name WHERE { ?x type Person . ?x name ?name } Query 1 SQL: SELECT B.object FROM triples AS A, triples as B WHERE A.subject = B.subject AND A.property = “type” AND A.object = “Person” AND B.predicate = “name” 5/22/2017 Daniel Abadi - Yale 16 SPARQL SQL (over triple store) Query 2 SPARQL: SELECT ?likes ?dislikes WHERE { ?x title “Implementation Techniques for Main Memory Databases” . ?y authorOf ?x . ?y likes ?likes . ?y dislikes ?dislikes } Query 2 SQL: SELECT C.object, D.object FROM triples AS A, triples AS B, triples AS C, triples AS D WHERE A.subject = B.object AND A.property = “title” AND A.object = “Implementation Techniques for Main Memory Databases” AND B.property = “authorOf” AND B.subject = C.subject AND C.property = “likes” AND C.subject = D.subject AND D.property = “dislikes” 5/22/2017 Daniel Abadi - Yale 17 Triple Stores Accessing multiple properties for a resource require subject-subject joins Path expressions require subject-object joins Can improve performance by: Indexing each column Dictionary encoding string data Ultimately: Do not scale 5/22/2017 Daniel Abadi - Yale 18 Property Tables Can Reduce Joins Subject name PersonID1 Mike Stonebraker PersonID2 David DeWitt likes dislikes Things found in Elastic/Velcro/ nature (streams, Anything sequoias, auroras) ‘One-size-fits-all’ Double Blind Reviewing NULL Left-over triples Subject PersonID1 PersonID1 PersonID2 PersonID2 … 5/22/2017 Property authorOf authorOf authorOf authorOf … Object Pub101 Pub102 Pub102 Pub103 Daniel Abadi - Yale … 19 Property Tables Complex to design If narrow: reduces nulls, increases unions/joins If wide: reduces unions/joins, increases nulls Implemented in Jena and Oracle 5/22/2017 But main representation of data is still triples Daniel Abadi - Yale 20 Table-Per-Property Approach name Subject Object dislikes Subject Object likes Subject Object authorOf Subject Object Mike Elastic/Velcro/ Things found in PersonID1 Pub101 Stonebraker PersonID1 Anything PersonID1 nature (streams, ‘One-size-fits-all’ sequoias, auroras) David PersonID1 Pub102 PersonID2 DeWitt Double Blind PersonID2 Reviewing PersonID2 Pub102 PersonID1 PersonID2 Pub103 + Nulls not stored + Easy to handle multi-valued attributes + Only need to read relevant properties Still need joins (but they are linear merge joins) 5/22/2017 Daniel Abadi - Yale 21 Materialized Paths Mike Stonebraker Things found in nature (streams, sequoias, auroras) David DeWitt Person name name type likes type knows PersonID1 authorOf:title PersonID2 knows dislikes dislikes authorOf authorOf:title authorOf Double blind reviewing Elastic/Velcro/Anything “One-size-fits-all” authorOf:title authorOf authorOf authorOf:title Pub101 title 5/22/2017 The Design of Postgres Pub102 venue venue SIGMOD title Implementation Techniques Daniel Abadi - Yale for Main Memory Database Systems Pub103 venue VLDB title GAMMA – A High 22 Performance Dataflow Database Machine Accelerating Path Expressions Materialize Common Paths authorOf:title Improved property table PersonID1 The Design of the Postgres performance by 18-38% Implementation Techniques Improved one-table-perproperty performance by 75- PersonID1 for Main Memory Database Systems 84% Use automatic database designer (e.g., C-Store /Vertica) to decide what to materialize 5/22/2017 Subject Implementation Techniques PersonID2 for Main Memory Database Systems GAMMA – A High PersonID2 Performance Dataflow Database Machine Daniel Abadi - Yale 23 One-table-per-property Column-Store Can think of one-table-per-property as vertical partitioning super-wide property table Column-store is a natural storage layer to use for vertical partitioning Advantages: 5/22/2017 Tuple Headers Stored Separately Column-oriented data compression Do not necessarily have to store the subject column Carefully optimized merge-join code Daniel Abadi - Yale 24 Library Benchmark Data Queries Real Library Data (50 million RDF triples) Data acquired from a variety of diverse sources (some quite unstructured) Automatically generated from the Longwell RDF browser Details in paper … 5/22/2017 Daniel Abadi - Yale 25 Results 5/22/2017 Daniel Abadi - Yale 26 Conclusions and Future Work Experimented with storing RDF data using different schemas in RDMS (both row and column-oriented) Future work: build a fully-functional RDF database Extracts and loads RDF data from structured, semistructured, and unstructured data sources Translates SPARQL to queries over vertical schema Performs reasoning inside the DB Use with biology research Excited about this work? Then … 5/22/2017 Daniel Abadi - Yale 27 Come To Yale! 5/22/2017 Daniel Abadi - Yale 28