Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The JTC1 SC32N1634 Graph Query Language: Towards a Unification of Graph Query Approaches David Silberberg [email protected] 443-778-6231 Outline Goals & Example Scenario Key Features of GQL Computational Complexity of Query Execution Future Directions 2 Heritage Style Viewgraphs Goals of the Graph Query Language (GQL) Project To unify disparate graph query approaches into a single, seamless, and declarative language Supports semantic search over graph data structures represented by schemas Supports traditional graph algorithms that systematically follow edges to discover interesting subgraphs (e.g., shortest path, minimal spanning tree, etc.) Supports metrics-oriented graphs algorithms (e.g., social network analysis, etc.) Supports special commands tailored to analysis of graphs Supports ontology-assisted query To quantify the scalability of this type of language 3 Heritage Style Viewgraphs Assumptions Data model is a typed graph that adheres to a schema Not XML – graphs tend to be more highly connected Not a semantic model – inference cannot, in general, be performed on the schema Data graphs can be large Query languages are only an abstract representation of questions The object is finding the right abstraction for the way people think about interacting with graphs Other query languages onto other data models will work – but do those languages help facilitate or hinder the formulation of those requests or the interpretation of the results? Algorithms are external to the graph management system There are too many algorithms New algorithms may be implemented or modified regularly We are not the experts in writing efficient algorithms 4 Heritage Style Viewgraphs Graph Interaction Methods Graph interactions take many forms Browse One-step-at-a-time exploration of a graph Semantic Schema-Based Search Several-steps-at-a-time graph query Algorithms Find subgraphs Calculate graph metrics Analysis Hypothesis expressions, etc. GQL is a declarative graph query language for integrating all these approaches! 6 Heritage Style Viewgraphs Example Scenario Farmer Jones' lettuce crop did well this year, but few other farmers did well. Why? First, find Farmer Jones. (Browsing) Jones 7 Heritage Style Viewgraphs Example Scenario Rabbits usually eat lettuce. Let's find the rabbits that ate Farmer Jones' lettuce. (Semantic Schema-Based Search) Prize Roman Jones Bugs Icy Harvey 8 Heritage Style Viewgraphs Example Scenario Let's look at all the farmers, and their locations, whose lettuce was eaten by fewer than 5 rabbits. (Semantic Schema-Based Search) Prize Roman Bugs Jones Leafy Harvey Icy Soft Smith Peter Crispy Smalltown, USA Green Harris Tasty 9 Heritage Style Viewgraphs Example Scenario What commonalities do the farmers have with each other and with the rabbits? (Semantic and Algorithmic Search) Acme Rent-aFox Prize Roman Red Bugs Leafy Jones Icy Harvey Soft Smith Sly Crispy Peter Smalltown, USA Green Harris Tasty 10 Heritage Style Viewgraphs Example Scenario If Fred fox ate Prize lettuce, what else would we learn? (Analysis-specific Methods, Semantic Search, and Algorithmic Search) Fox Enterprises Acme Rent-aFox Prize Fred Roman Brer Bugs Leafy Jones Icy Harvey Soft Smith Red Crispy Peter Smalltown, USA Green Harris Sly Tasty 11 Heritage Style Viewgraphs Outline Goals & Example Scenario Key Features of GQL Computational Complexity of Query Execution Future Directions 12 Heritage Style Viewgraphs Related Work Four categories of graph query languages and examples 1. Knowledge base (subject-predicate-object) query languages SPARQL, RQL, RAL, RDF Query Language 2. Graph reasoning query languages OWL-QL, GraphLog, Query and Inference Service for RDF 3. Query languages with graph operators GOQL GRAM 4. Graphical user interface query language QGRAPH 13 Heritage Style Viewgraphs Features of GQL that Support Analysis Schema-based graph query Returns a single graph or a set of graphs (not tables or XML files) Aliasing Graph exploration through wildcard search Embedded queries (helps achieve first order logic expressiveness) Creates new graph structures in query results Query over defined patterns (of activity or behavior, for example) Special commands tailored to analysis Hypothesis expressions Composite vertices (of vertices and edges) External algorithms that return graphs (e.g., shortest path) External algorithms that return metrics (e.g., social network analysis) Ontology-assisted graph query NEXT 14 Heritage Style Viewgraphs Example Graph Model time Chases Fox name Fox: fox1 name: George age: 3 Rabbit Eats age time name age time Eats Lettuce name Eats Carrot name time Chases: chases1 Rabbit: rabbit1 time: 2pm name: Peter age: 2 Eats: eats1 Eats: eats3 Lettuce: lettuce1 time: 8am name: PrizeLettuce Eats: eats4 Carrot: carrot1 time: 3pm Chases: chases2 time: 5pm Fox: fox2 name: Fred 15 Eats: eats2 age: 2 time: 9am Rabbit: rabbit2 name: Bugs age: 4 Rabbit: rabbit3 name: Jack age: 1 time: 7pm name: CarrotTop Eats: eats5 Carrot: carrot2 time: 7am name: BigCarrot Eats: eats6 Lettuce: lettuce2 time: 8am name: Icy Heritage Style Viewgraphs GQL Operators - Overview Basic Syntax SUBGRAPH clause Finds a subgraph in the source graph CONSTRAINT clause Filters the subgraph based on property constraints RETURN clause Describes the resulting graph or sets of graphs to return Syntax for analysis ASSUME clause Supports hypothesis statements PATTERN clause Defines search patterns BACK 16 Heritage Style Viewgraphs Simple Query that Returns a Single Graph SUBGRAPH Fox Chases Rabbit AND Fox Eats Rabbit CONSTRAINT Chases.Time < Eats.Time RETURN Fox Chases Rabbit AND Fox Eats Rabbit Fox: fox1 Chases: chases1 Rabbit: rabbit1 time: 2pm name: George age: 3 name: Peter age: 2 Eats: eats1 time: 3pm Type represents variable Motivated by languages like SQL In constrast to (Fox ?f1) 18 Heritage Style Viewgraphs Returning a Set of Graphs Can be done with edge expansion or joins in the RETURN clause Can be seamlessly integrated with non-graph expansion expressions Any query can be returned as a set of graphs if desired SUBGRAPH Fox Chases Rabbit RETURN Fox Chases# Rabbit Fox: fox1 Chases: chases1 Rabbit: rabbit1 result graph 1 time: 2pm name: George age: 3 Fox: fox1 name: Peter Chases: chases2 age: 2 Rabbit: rabbit2 result graph 2 time: 5pm name: George 19 age: 3 name: Bugs age: 4 BACK Heritage Style Viewgraphs Aliasing SUBGRAPH Fox ALIAS ChasingFox Chases Rabbit AND Fox ALIAS EatingFox Eats Rabbit CONSTRAINT ChasingFox.name <> EatingFox.name RETURN ChasingFox Chases Rabbit AND EatingFox Eats Rabbit If our graph had an additional edge in which George Fox chased Jack Rabbit at 8 a.m., the result would look like: Fox: fox1 name: George age: 3 Chases: chases3 time: 8am Eats: eats2 Fox: fox2 name: Fred 20 age: 2 time: 9am Rabbit: rabbit3 name: Jack age: 1 BACK Heritage Style Viewgraphs Embedded Queries Significant component of first order logic expressiveness To request the first fox that ate a rabbit, the following existential query is formulate: SUBGRAPH CONSTRAINT RETURN Fox Eats ALIAS E1 Rabbit NOT EXISTS (SUBGRAPH Fox Eats ALIAS E2 Rabbit CONSTRAINT E1.time > E2.time) Fox Eats Rabbit Eats: eats2 Fox: fox2 name: Fred age: 2 time: 9am Rabbit: rabbit3 name: Jack age: 1 BACK 21 Heritage Style Viewgraphs New Result Graph Structure Query SUBGRAPH RETURN Fox Eats Rabbit AND Rabbit Eats Lettuce Fox new(Ingests) Lettuce Fox: fox1 name: George Ingests: ingests1 age: 3 Lettuce: lettuce1 name: PrizeLettuce Fox: fox2 name: Fred age: 2 Ingests: ingests3 Lettuce: lettuce2 name: Icy BACK 22 Heritage Style Viewgraphs Hypothesis Expressions Enables queries on hypothetical data SUBGRAPH Fox Chases Rabbit AND Fox Eats Rabbit AND Rabbit Eats Lettuce CONSTRAINT Chases.time < ‘8am’ RETURN Fox new(Ingests) Lettuce ASSUME EDGE Chases [NEW time = ‘7am’] FROM Fox[CONSTRAINT name= ‘Fred’] TO Rabbit[CONSTRAINT name= ‘Jack’] Motivated by OWL-QL 23 BACK Heritage Style Viewgraphs Composite Vertices Composite vertices Composed of vertices and edges Contained vertices can be composite as well name Place location OccuredAt HuntingEvent time time Chases Fox Eats name Eats Lettuce name Eats Carrot name Rabbit age time name age time time 24 Heritage Style Viewgraphs Composite Vertex Queries - continued SUBGRAPH CONSTRAINT RETURN HuntingEvent OccuredAt Place AND HuntingEvent DIRECTLY CONTAINS Rabbit AND Rabbit Eats Lettuce Place.name = ‘Smith Game Park’ Rabbit Eats Lettuce time Eats Lettuce name Rabbit name age Addresses a subset of Harel's Higraphs Multiple hops CONTAINS or IS-CONTAINED-BY Feasible because of the hierarchy 25 BACK Heritage Style Viewgraphs Wildcard Queries SUBGRAPH Fox * ALIAS InterestingEdge Rabbit RETURN Fox InterestingEdge Rabbit Fox: fox1 name: George age: 3 Chases: chases1 Rabbit: rabbit1 time: 2pm name: Peter age: 2 Eats: eats1 time: 3pm Chases: chases2 time: 5pm Eats: eats2 Fox: fox2 name: Fred age: 2 time: 9am Rabbit: rabbit2 name: Bugs age: 4 Rabbit: rabbit3 name: Jack age: 1 One edge wildcard queries Multiple hops May be computationally expensive in a graph Can be handled by an external AllPath() algorithm 26 BACK Heritage Style Viewgraphs Pattern Definition Assigns names to interesting graph patterns Can be reused in multiple queries PATTERN Predator (Fox new(PreysUpon) Rabbit) = SUBGRAPH Fox Chases Rabbit AND Fox Eats Rabbit CONSTRAINT Chases.time < Eats.time RETURN Fox new(PreysUpon) Rabbit 27 Heritage Style Viewgraphs Pattern Use Query: SUBGRAPH RETURN Predator(Fox PreysUpon Rabbit) AND Rabbit Eats Lettuce Fox new(Ingests) Lettuce Is evaluated as if it were: SUBGRAPH CONSTRAINT RETURN Fox Chases Rabbit AND Fox Eats Rabbit AND Rabbit Eats Lettuce Chases.time < Eats.time Fox new(Ingests) Lettuce BACK 28 Heritage Style Viewgraphs External Graph Algorithms that Return Subgraphs Shortest Path SUBGRAPH RETURN GameWarden Chases Fox AND ShortestPath(Fox, Rabbit) ALIAS SP_alias AND Rabbit Eats Lettuce GameWarden Chases Fox AND SP_alias AND Rabbit Eats Lettuce Adjacent Vertices SUBGRAPH CONSTRAINT RETURN 29 AdjacentVertices(Rabbit) ALIAS AV_alias count_edges(Rabbit) > 10 AV_alias BACK Heritage Style Viewgraphs External Graph Algorithms that Return Metrics Centrality: Find the Foxes that eventually Eat the Rabbits, who play a central role in the garden activities SUBGRAPH CONSTRAINT RETURN Fox Eats Rabbit Centrality (Fox, Rabbit, Lettuce) > .8 Fox Eats Rabbit Clustering Coefficient: Find the Foxes that are likely to work together when Chasing Rabbits SUBGRAPH CONSTRAINT RETURN 30 Fox ALIAS Fox1 Chases Rabbit AND Fox ALIAS Fox2 Chases Rabbit ClusteringCoefficient (Fox1, Fox2) > .6 AND Fox1 <> Fox2 Fox Eats Rabbit Heritage Style Viewgraphs Some Issues with External Algorithms Algorithms do not filter results, they operate direction on the graph and tie into the rest of the results Algorithms need to return a set of graphs (or a graph under some circumstances) in a standard format Order of query execution No current way to refer to the result vertices and edges of algorithms that are not specifically identified in the query SUBGRAPH AdjacentVertices(Rabbit) ALIAS AV_alias CONSTRAINT ClusteringCoefficient (<Vertex1 ?>, <Vertex2 ?>) > .6 RETURN <Vertex1 ?> <Edge1 ?> Rabbit AND <Vertex2 ?> <Edge2 ?> Rabbit BACK 31 Heritage Style Viewgraphs Ontology Assisted Query Organism isA isA Animal isA Carnivore isA Wolf Ontology isA Chases Eats Herbivore Eats isA isA isA Fox Hare Sheep Vegetable isA Lettuce isA Carrot Mappings time Chases Fox name 32 Eats age time Rabbit name age time Eats Lettuce name Eats Carrot name Graph Schema time Heritage Style Viewgraphs Ontology-Assisted Query Result SUBGRAPH RETURN Carnivore Eats Herbivore AND Herbivore Eats Vegetable Carnivore new(Ingests) Vegetable Fox: fox1 name: George Ingests: ingests1 age: 3 Lettuce: lettuce1 name: PrizeLettuce Fox: fox2 name: Fred age: 2 Ingests: ingests3 Lettuce: lettuce2 name: Icy 33 Heritage Style Viewgraphs Some Issues of Ontology-Assisted Query Why not just have an ontology query language? Performance issues? Scaling issues? Capitalize on features that semantics bring to bear on a graph query language Semantic abstraction (e.g., subsumption, hierarchy) Use inference to create semantically consistent models Impose semantic on the graph model BACK 34 Heritage Style Viewgraphs Outline Goals & Example Scenario Key Features of GQL Computational Complexity of Query Execution Future Directions 35 Heritage Style Viewgraphs Query Optimization Query execution time is the key to success for any query language – GQL is no exception We apply relational database optimization techniques to graph queries Optimization issues Addressed query optimization on a per path-segment basis – yes Address path-segment ordering – initial thoughts Address the management of large amounts of intermediate results of a query – not yet Address incorporating external algorithms – not yet Address ontology elaboration performance – not yet 36 Heritage Style Viewgraphs Query Optimization Query plan representations are used to define query execution plans Query plan representations are constructed to optimize the query execution time Via graph algebra Via graph statistics to estimate query costs for each operation Query optimizer determines The best algorithm to execute each operation The best operation ordering to optimize overall query execution time 37 Heritage Style Viewgraphs Graph Statistics Estimating costs requires statistical knowledge of the graph We estimate the cost of the path segment operator One of the most common and costly operations Statistics that we initially considered useful: Vertex Cardinality: The number of vertices of type v is count(v) or just V. Vertex Edge Set Cardinality: The total number of edges e that emanate from all vertices of type v is count(ev) or just EV. Edge Cardinality: The number of edges of type e is count(e) or just E. Edge Distribution: The number of different vertex type pairs that edges of type e connect of just ED. Selectivity Factor: The percentage of vertices or edges that match a property constraint is sel(), where is the property constraint. Uniformity assumption Independence assumption 39 Heritage Style Viewgraphs Path Segment – Vertex Search, No Indices Algorithm Iterate through a set of vertices of type v in O(V) time For each vertex, iterate through its edge list to find edges of type e in O(EV/V) time Follow the edge to vertex w in constant time Execution time is O(V*(EV/V)) = O(EV) 40 Heritage Style Viewgraphs Path Segment – Indices on Vertex Edge Set Requires each edge set to be indexed through a logarithmic-time search tree (e.g., B+ tree) Next values are (virtually) collocated with the matching value Enables a constant time search for the next value(s) Algorithm Iterate through vertices of type v in time O(V) Find matching edge(s) in logarithmic time O(log(EV/V) Iterate through the matching edges in time O(E/EDV) Execution time is O(V * (log(EV/V) + E/EDV) ) = O(V*log(EV/V) + E/ED) If ED E (i.e., one edge of type e emanates from each v), then the algorithm tends to operate in time O(V*log(EV/V)) If ED E and EV V, the algorithm tends operate in time O(V) If ED E and EV >> V, the algorithm tends to operate in time O(V*log(EV)) If ED >> E, then the algorithm tends to operate in time O(E/ED) 41 Heritage Style Viewgraphs Path Segment – Edge Indices, Constraint Beneficial when the query includes a constraint v on an indexed property of vertices of type v Vertex edge sets are indexed as well Algorithm Logarithmic-time search through the indexed properties v in time O(log(V)) Iterate through vertices (collocated in the index) that satisfy the constraint in time O(sel(v)*V) Performs a logarithmic-time search on the edges of each matching vertex in time O(log(EV/V)) Iterate through the matching edges in time O(E/EDV) Execution time is O(log(V) + (sel(v)*V*(log(EV/V) + E/EDV)) ) = O(log(V) + sel(v)*V*log(EV/V) + sel(v)*E/ED) If sel(v) 0, the dominant factor is the search for vertices or O(log(V)) If the selectivity factor is higher, the execution time approaches the times of the previous slide 42 Heritage Style Viewgraphs Path Segment – Edge Search, No Indices Algorithm Iterate over edge types e and select those that connect v to w in time O(E) Find the corresponding vertices in constant time Execution time is O(E) 43 Heritage Style Viewgraphs Path Segment – Edge Search, Constraint Beneficial when the query statement includes a constraint e on an indexed property of edges of type e Algorithm Performs a logarithmic-time search through properties to find the first matching edge in time O(log(E)) Performs a linear search through all subsequent matching edges in time O(sel(e)*E) Find both vertices attached to each edge in constant time Execution time is O(log(E) + sel(e)*E) If sel(e) 0, the algorithm tends to an execution time of O(log(E)) Otherwise, the algorithm tends to an execution time of O(E) 44 Heritage Style Viewgraphs Varying Number of Vertices per Vertex Type Varying Number of Vertices per Vertex Type with Constraints 10000 Execution Time (ms) per 1000 Iterations high property constraint selectivity low property constraint selectivity 1000 100 10 Number of Vertices per Vertex Type 1 45 100 1000 10000 100 1000 10000 Algorithm 2 192 315 2713 184 342 2671 Algorithm 3 188 334 560 47 43 45 Algorithm 4 152 144 146 163 146 144 Algorithm 5 147 160 145 34 33 35 Heritage Style Viewgraphs Varying Number of Edges per Vertex Varying Number of Edges per Vertex with Constraints 1000 Execution Time (ms) per 1000 Iterations high property constraint selectivity low property constraint selectivity 100 10 Number of Edges per Vertex 1 46 5 10 50 100 500 1000 5 10 50 100 500 1000 Algorithm 2 356 320 289 298 275 284 348 329 291 292 289 277 Algorithm 3 27 25 78 77 82 81 9 10 15 22 75 80 Algorithm 4 71 67 66 68 65 66 70 68 65 75 63 65 Algorithm 5 13 20 70 72 73 71 8 8 9 10 21 21 Heritage Style Viewgraphs Varying Edge Types with Constraints Varying Number of Edge Types with Constraints 100000 high property constraint selectivity low property constraint selectivity Execution Time (ms) per 1000 Iterations 10000 1000 100 10 Number of Edge Types 1 47 1 10 100 1000 1 10 100 1000 Algorithm 2 349 187 192 32 323 181 183 32 Algorithm 3 352 204 183 39 15 11 13 12 Algorithm 4 29342 1709 137 17 30231 1779 157 19 Algorithm 5 146 148 146 22 10 10 10 10 Heritage Style Viewgraphs Path Segment Ordering Assume the following query SUBGRAPH Fox Chases Rabbit AND Rabbit Eats Lettuce CONSTRAINT Rabbit.age < 3 RETURN Fox new(Ingests) Lettuce Query processing produces the following query execution plan p Fox new (Ingests) Lettuce s Rabbit.age < 3 Eats Lettuce Fox Chases Rabbit 48 Heritage Style Viewgraphs Path Segment Execution Order Choice Which is more efficient? p Fox p Fox new Ingests Lettuce s Rabbit.age < 3 s Rabbit.age < 3 Eats Lettuce Fox Chases Rabbit 49 new Ingests Lettuce or Fox Chases Rabbit Eats Lettuce Heritage Style Viewgraphs Execution Order Heuristics In simple terms Identify the path segment operation that promises to return the least number of results Then identify the next operation that promises to return the next least number of results It is actually more complicated than this Need to search an exponential number of orderings to find the most efficient ordering Heuristics can make this search tractable 50 Heritage Style Viewgraphs Path-Segment Ordering Metric Order the path segment operators to return the fewest results Rough heuristic: If predicates v, e, and w are applied to V, E and W respectively Start with V and use selectivity factors to estimate execution time Execution time is: V * sel(v) * (E/EDV) * sel(e) * (WED/E) * sel(w) Or, sel(v) * sel(e) * sel(w) * W Use this formula to determine whether Fox Chases Rabbit should precede or follow Rabbit Eats Lettuce 51 Heritage Style Viewgraphs Outline Goals & Example Scenario Key Features of GQL Computational Complexity of Query Execution Future Directions 52 Heritage Style Viewgraphs Prototype Implementation Schedule Currently Implemented Schema search returning a single graph Pattern matching Aliasing Ontology assisted graph query Next to be implemented – within approximately 6 months Externally defined functions Wildcard search Hypothesis expressions Future Return a set of graphs (instead of a single graph) Embedded queries Return new graph structures in query results Composite vertices (of vertices and edges) Predefined patterns Query Optimization 53 Heritage Style Viewgraphs Future Work Relate GQL to a graphical interface Enables analysts to express queries through graphical means Can leverage several technologies (QGraph, Conceptual Graphs, etc.) Augment GQL to include Uncertainty, Geospatial and Temporal operators and data structures Address query optimization techniques Create a generic (as much as possible) back-end API to integrate with data sources Relational Different graph approaches 54 Heritage Style Viewgraphs