Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Graph Data Partition Models for Online Social Networks Prima Chairunnanda, Simon Forsyth, Khuzaima Daudjee David R. Cheriton School of Computer Science University of Waterloo Waterloo, Ontario, Canada {pchairun, swforsyt, kdaudjee}@uwaterloo.ca ABSTRACT as is evident from Facebook using Cassandra [6] and MySQL, and MySpace using Microsoft SqlServer. However, a graph can naturally represent many OSN constructs, with users and objects as vertices connected via edges. Furthermore, many services offered by an OSN are equivalent to traversing this graph. Listing a tweet’s followers is traversing all “Follows” edges in the reverse direction. Viewing a friend’s photo album can be seen as traversing all “UploadPhoto” and “TaggedIn” edges. Models that fit the data they represent are often easier to understand and are potentially more efficient. For example, a graph database can ensure that well-connected sub-graphs remain in the same partition on the assumption that they will be frequently accessed together. A graph database is therefore able to naturally provide bounds on the number of servers needed to provide a complete answer to many queries. There have been a lot of studies revolving around distributed RDBMS, key-value stores, and column store databases, but graph DBMSs have received little attention. Among the current graph DBMSs, there are variations in how a graph is viewed. At the simplest level, a graph consists of vertices connected via edges. Clearly this does not suffice for real-life applications, as there needs to be labels on the nodes and edges themselves. Neo4j adds the notion of edge type and properties to further describe nodes and edges. Relaxing the restriction on edges, HypergraphDB [5] allows a single edge to connect more than two vertices to express more complex semantics. Resource Description Framework (RDF) is yet another alternative, where information is encoded in triplets (subject, predicate, object). In essence, each triplet represents a directed edge between the subject and object. How to store the triplets again varies among RDF databases, where some storing as graphs (e.g. AllegroGraph), some as tuples in an underlying relational databases (e.g. Virtuoso [4]), and a number of others using proprietary formats. Regardless of the underlying physical representation of the data, employing a single centralized graph DBMS will quickly become a bottleneck. OSNs deal with huge amount of data, potentially consisting of trillions of vertices and edges. Thus, methods to effectively scale graph DBMSs are needed to improve their utility to an OSN. One popular approach to overcome the bottleneck is to use multiple instances of a DBMS, each holding a shard of the database. Averbuch and Neumann [2] explored the problem of partitioning in Neo4j graph database, but their experiments used an emulator relying on graph colouring. We demonstrate that there exist partition models that can be implemented with minimal computation overhead and remove physical limits from the size of the graph. Moreover, such models provide potential for load-balancing and increased parallelism for queries that do not require access to the entire graph. Our implementation, called PNeo4j, extends the Online social networks have become important vehicles for connecting people for work and leisure. As these networks grow, data that are stored over these networks also grow, and management of these data becomes a challenge. Graph data models are a natural fit for representing online social networks but need to support distribution to allow the associated graph databases to scale while offering acceptable performance. We provide scalability by considering methods for partitioning graph databases and implement one within the Neo4j architecture based on distributing the vertices of the graph. We evaluate its performance in several simple scenarios and demonstrate that it is possible to partition a graph database without incurring significant overhead other than that required by network delays. We identify and discuss several methods to reduce the observed network delays in our prototype. Categories and Subject Descriptors H.2.4 [Systems]: Distributed Databases; E.2 [Data Structures]: Graphs and networks General Terms Design, Performance, Experimentation Keywords Distributed graph database, Graph representation 1. INTRODUCTION During the last decade, online social networks (OSNs) have emerged to the forefront of the Internet. Based on a recent Internet traffic analysis [1], three of the ten most frequently visited websites are OSNs. In an OSN, users are connected to each other via edges. Edges can be undirected (e.g. “Friends” in Facebook) or directed (e.g. “Follows” in Twitter). How to store this information is central to any OSN, and is an active research area. Key-value storage systems and Relational Database Management Systems (RDBMS) appear favoured choices, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HT’12, June 25–28, 2012, Milwaukee, Wisconsin, USA. Copyright 2012 ACM 978-1-4503-1335-3/12/06 ...$10.00. 175 2.2.4 Discussion Neo4j graph database to support partitioning. Pregel [7] introduces a computational model for distributed graph traversal, but does not specifically address the challenges of partitioning the graph in the first place. This paper is organized as follows: we first visit important design decisions we made for PNeo4j in Section 2, followed by our specific implementation details in Section 3. We then present our experimental results in Section 4, and finally we conclude in Section 5. 2. As the goal is to create a scalable shared-nothing database, we examined each method in terms of perceived scalability and amount of data that must be shared. For the case of vertex partitioning, there is no physical limit on graph size. However, since an edge that crosses a shard boundary must be accessible from both sides, at the minimum it must be duplicated on the two partitions hosting the endpoint vertices. For edge partitioning, the maximum size is limited by the largest overlay graph consisting of one edge type. Edge partitioning also duplicates vertices that have edges not located on the same server. Finally, property partitioning requires the entire graph to be contained within a single machine, limiting graph size. Partitioning the properties does allow for the least amount of required duplication. These options are not mutually exclusive and all of them may be used within the context of a single graph, though with an increase in complexity for database design and development. We chose to implement partitioning across vertices so as to remove restrictions on the physical limit on graph size for all graphs. However, we note that edge partitioning has a potentially useful property. The partitions generated from edge partitioning may naturally be load-balanced because vertices will likely be present in multiple partitions. As a consequence, queries involving that vertex can be answered by any of those partitions, potentially speeding up traversals since the initial vertex is likely to be on the originating server. When queries mainly involve only one edge type, partitioning by edge-type might look attractive as cross-partition traversals can be avoided. However, the server hosting that edge type could become hotspot. When seen from an OSN perspective, vertex partitioning also has an additional advantage. The types of queries operating on an OSN are usually a form of traversal from a particular starting vertex, i.e. they exhibit spatial locality. After a particular vertex is visited, each of its neighbouring vertices will have an increased chance to be accessed next. It is, therefore, beneficial to have neighbours hosted in the same machine. Several systems, such as SPAR[11], exploit this behaviour in their replication design for OSNs. The requirement of spatial locality becomes much more important when more OSN entities are represented as nodes in the graph. For instance, the school a user goes to can either be stored as a property of the user, or as a node with inward edges of type “AttendSchool” coming from the students of the school. The latter representation has the advantage that the graph is more complete and can be interpreted on its own, as edges explicitly indicate a connection, while the interpretation of property values are application dependent. This has direct consequence on the length of traversal path. Let us consider the query to find the school attended by the most number of a user’s friends. When school is represented by property and friendship by an edge, the query involves traversing an edge, then querying a node’s property. The query will instead traverse two edges if both are represented as edges. If there is an edge connecting two vertices and the vertices are located in different subgraphs, the edge is a crossing-edge. We assume assignment of vertices to partitions should be chosen to minimize network traffic and it is therefore beneficial to minimize the number of crossing-edges. As the number of crossing-edges increases, the probability that a traversal must cross a partition boundary increases, incurring additional network costs. There have been studies on using graph min-cut algorithms to minimize the number of crossing-edges, e.g. [2][3]. However, we are specifically interested in the behaviour at partition boundaries, and so manually cut the graphs to ensure our traversals cross par- DESIGN 2.1 System Architecture We use the following system model assumptions as a basis for evaluating partitioning techniques. We are not interested in faulttolerance for the initial implementation but are interested in achieving efficiency. Therefore, we attempt to minimize the amount of information sharing in the design. Clients and their queries are assumed to run in the context of a single server. The client will not start a transaction on one partition and complete it on another, instead the first server contacted as part of a transaction will be responsible for all queries that make up the transaction. We assume that a method exists for the client to find the initial server. The servers hosting the partitions can have knowledge of all other servers. We do not require all servers to be available to function, but do require the server hosting an object to be available when that object is accessed. 2.2 Partition Model We consider three methods for partitioning a graph database: across vertices, across edges, and among the properties associated with the edges and/or vertices in the graph. 2.2.1 Vertex Partitioning The most studied method [2][7][8] for splitting a graph is cutting the graph. That is, the graph is partitioned into subgraphs, where each vertex belongs to exactly one subgraph. The subgraphs then become shards of the original graph. 2.2.2 Edge Partitioning Another possibility is to split the graph into subgraphs by their edges. Each partition contains a subset of the edges from the original graph, with the total graph being reconstructed from all the partitions. In some applications, edges may have types to give more semantics to the relationship. To give a concrete example, in Facebook, two friends are connected by a “Friend” edge, while a user and an event are connected by an “Attending” edge. For these graphs, it is also reasonable to partition the graph into subgraphs, each containing the edges and vertices for a unique set of edge types. 2.2.3 Property Partitioning The last possibility has been considered by Neo Technologies for version 2 of Neo4j [10]. Vertices and edges require relatively little storage space, but properties have arbitrary length and so may require significantly more space, limiting the maximum graph size. By storing the properties in a separate key-value store and entering the much shorter keys into the graph database, a single server may store a bigger graph. However, as the entire graph and the property keys must still fit onto a single server, the number of vertices and edges is still limited by the storage capacity of a single server. 176 V1 Identifying a ghost vertex is also a challenge. For example, consider the situation of V1 connected to V2 as above. Later, say, a new edge is to be created between another vertex V3 owned by P2 to vertex V1 . In this case, P2 can either create a new ghost vertex or reuse its existing ghost for V1 . The first option increases the required storage and makes avoiding duplicate vertices during traversal more difficult, while reuse complicates edge creation since a check must be made to see if the vertex already exists, costing additional time or space. V2 (a) V1 V2' V1' V2 (b) P1 2.3.2 Dangling Edge Model P2 V1 Instead of materializing ghost vertices, we can leave one end of the crossing-edge unconnected (Figure 1(c)). To achieve this, we require a method to identify vertices that are not part of the current partition. A simple method is to incorporate an identifier for the partition each vertex is assigned to as part of the vertex ID and then use that ID to identify if the vertex is local. A separate lookup table could be used, in which case, since most vertices are expected to be internal to the partition, a flag should be set within the edge to avoid making many failed lookups. To avoid duplication of data, one of the two dangling edges is demoted to a ghost edge. Like a ghost vertex, a ghost edge stores only the location and ID of the actual edge, and all requests are forwarded to the remote partition. In PNeo4j, we always choose the incoming dangling edge to be the ghost edge. Due to this characteristic, it might be more beneficial to choose the semantic of the edge so that it mimics the natural order of traversal between the vertices, avoiding traversal of the ghost edges. For instance, supposing the query is to find all events created by a particular user, having the edges point outward from the user to the events (denoting a “Creates” semantic) might yield some performance benefit compared to the edges pointing inward (denoting a “Created by” semantic). By removing the ghost vertices, we eliminate the storage cost of a crossing-edge to one extra edge. More importantly, we never have to synchronize vertices between partitions since they are not shared. By not storing properties at the ghost edge, we restrict synchronization to edge deletion. V2 (c) P1 P2 V1 V2 SS (d) SS P1 P2 Figure 1: Visualization of crossing-edge representations. (a) shows the original graph, while (b), (c), and (d) show the graph partitioned over P1 and P2, for the Ghost Vertex, Dangling Edge, and Super Source/Sink models respectively. Dotted lines indicate ghost objects. In (d), the vertex marked “SS” is the Super Source/Sink. titions. We defer the study of different min-cut algorithms for our system to other work. 2.3 Crossing-Edge Representation We consider three different kinds of representations to model a crossing-edge in our system. We discuss the advantages and disadvantages of each model as well as the additional cost imposed by the model. All examples in this subsection assume a crossing-edge connecting two vertices V1 and V2 located in partition P1 and P2 , respectively. The original graph is shown in Figure 1(a). Without loss of generality, we assume the edge is directed from V1 to V2 . 2.3.3 Super Source/Sink Model The Dangling Edge model has the problem that a partitioned graph is not actually a graph on any of the servers, as some endpoint vertices are missing. One possible solution to this is to materialize a single vertex that is used as a marker for all crossing-edges, as shown in Figure 1(d). Borrowing from flow network terminology, we call this a Super Source/Sink (SuperSS) vertex, because it consumes all incoming and outgoing flows to/from crossing-edges. The additional cost to represent a crossing-edge is thus only one extra edge since only one SuperSS is created regardless of the number of crossing-edges in the partition. This solution has a serious pitfall. To modify an edge, both vertices must be locked to ensure that a vertex is not deleted and to prevent the addition of two edges causing one to not be associated with the vertex. The SuperSS vertex are connected with all edges that have one end in a remote partition. It follows that any changes to any crossing-edge in the partition would require locking the SuperSS vertex, greatly reducing the concurrency of the system and increasing the potential for deadlock for transactions that cross partitions. Therefore, such an option is not advisable for a graph database. 2.3.1 Ghost Vertex Model In the Ghost Vertex model, we create a ghost vertex in P1 to represent V2 , as depicted in Figure 1(b). A ghost vertex is a vertex with a single property containing the location of the real vertex. The edge connecting the two vertices is duplicated across both partitions. In essence, V1 is now connected to the ghost of V2 (denoted as V2 ). On the P2 side, another ghost vertex to represent V1 and an edge from V1 to V2 is created. A major advantage of this model is that all partitions contain valid graphs. This model can thus be implemented as middleware on top of any single machine graph database without modifying the underlying system. When a client arrives at a ghost vertex, the middleware will return a proxy which will contact the remote partition for any requests related to that vertex. However, a middleware approach is typically slower than a more intrusive one because the middleware needs to do additional processing on top of the regular database access. 177 3. IMPLEMENTATION Table 1: Time to traverse one edge 1,000,000 times (ms) Interpartition Database Intrapartition Local Local Fully source dest remote Neo4j 1130 PNeo4j 1157 31068 77646 166189 3.1 Neo4j Neo4j1 is an open source graph DBMS specialized for highspeed graph traversal. It supports ACID transactions at a readcommitted isolation level and provides for manual locking to allow the user to achieve higher levels of isolation. Transactions are logged to ensure durability and the system provides some consistency guarantees, such as preventing connections to non-existing vertices. In addition, it supports a High-Availability mode [9], in which the database is fully replicated across several systems. One system is designated as the master. The other systems are slaves and may have stale copies of the data. Writes are supported on slaves, but they synchronize with the master using the two-phase commit protocol on every write. Still, such replication does not completely address the scaling out of the database, because the total amount of data is limited by the storage capacity of the smallest server. partitions involved in a transaction after the first will not contact any other partition with respect to that transaction. Because the global transaction IDs are hidden from the client and because none of the functions in our current design create recursive calls between partitions, this assumption is valid for our implementation. As a concrete example, when we are traversing a crossing edge from a remote partition P2 to another remote partition P3 , the originating partition P1 will first get information about the edge from P2 , determine it is crossing on to another partition P3 , then contact P3 to continue the traversal. Consequently, the originating server will act as the coordinator for the two-phase commit. Note that it means PNeo4j does not impose any additional connectivity requirement on top of that required for two-phase commit, which is that participants are not necessarily contactable from other participants other than the coordinator. While our implementation does not handle failure of the originating server, additional logging on the remote partitions would address this issue. 3.2 Vertex/Edge Identifier In Neo4j, each object is uniquely identified by an identifier (ID) generated by the system. Once assigned, an object’s ID will not change, but if the object is deleted, its ID may be reassigned to a new object. As of version 1.3, Neo4j supports 234 vertices and 235 edges. Neo4j uses the long Java data type for an ID, theoretically allowing 264 unique objects. The upper bits of the vertex ID space are unused and always 0. Our implementation uses the most significant 16 bits of an ID to record the partition that owns a vertex. Each partition in PNeo4j is assigned a unique 16-bit partition identifier (PID). When the PID is present in the object’s ID, the ID is a global identifier (GID), and it ensures the object is uniquely identifiable across all partitions. The PID value of 0 is reserved to preserve compatibility with Neo4j. The PID is not persisted to disk. Instead, when an object is loaded, PNeo4j adds the PID to create the GID of the object. Internally, Neo4j assigns an ID to properties attached to objects. As this ID is not exposed to clients, we do not need to modify Neo4j’s handling of property IDs. 4. PERFORMANCE EVALUATION 4.1 Test Environment All experiments were performed on a single computer running Linux hosting each partition on a different network port. All tests were run on the same machine. Latency averaged 0.01ms between partitions. The baseline Neo4j system used for comparison is version 1.3M03, and our modifications were made to the slightly newer 1.3M04 milestone, released two weeks later. 4.2 Cross-Partition Traversal Averbuch and Neumann [2] performed one test on their implementation of a partitioned graph database: traverse one edge in a two vertex database 1,000,000 times. They used an emulator that used a single partition with a colouring property to indicate virtual partitions for their remaining experiments. As part of our motivation was to see if the performance problem they observed was surmountable, we repeated their experiment with our implementation. We test all possible arrangements of the source and destination vertices. Both the source vertex and the destination vertex can be either local to the partition that initiates the transaction, or remote from it. Since a remote source in our design implies that the edge is also remote, they are ordered in the table by increasing remoteness. Each trial was run with a five second warm-up period and each case presents the mean of five runs. We also show the results of the same test with an unmodified copy of Neo4j (with both vertices local by definition). The results in Table 1 show that increasing the quantity of remote information increases the time required for the traversal. Indeed, as desired, the costs reflect the number of network messages that need to be sent for each case: zero for fully local, one for a remote destination (to access the remote vertex), and two for a remote source (one to access the remote vertex and one to access the remote edge). Importantly, the overhead in the purely local case is minimal. This is in contrast to results in [2], which showed a significant perfor- 3.3 Partition Policy In the interest of avoiding extra complexity, all vertices are created in the partition that receives the request. If a user wants to create a vertex in a specific partition, that partition must be contacted. This is sufficient to let us create partitioned graphs to test partition traversals. Because we include the partition identifier as part of the ID, an automatic repartitioning scheme could generate errors for missing vertices, as its ID could change after a read since the only isolation level we support is read committed. This problem can be avoided by implementing additional isolation levels (a read lock is required to ensure that a vertex does not move during a transaction). 3.4 Transactions A single transaction may require operations across multiple partitions. Our implementation generates a global transaction ID for a transaction the first time a partition contacts another partition. This ID is then used in all further communication related to that transaction. Two-phase commit protocol is employed to provide consistency across partitions. We assume that all operations related to a transaction will originate from the partition that created the transaction. That is, all 1 http://neo4j.org 178 [4] O. Erling and I. Mikhailov. Rdf support in the virtuoso dbms. In T. Pellegrini, S. Auer, K. Tochtermann, and S. Schaffert, editors, Networked Knowledge - Networked Media, volume 221 of Studies in Computational Intelligence, pages 7–24. Springer Berlin / Heidelberg, 2009. [5] B. Iordanov. Hypergraphdb: a generalized graph database. In Proceedings of the 2010 international conference on Web-age information management, WAIM’10, pages 25–36, Berlin, Heidelberg, 2010. Springer-Verlag. [6] A. Lakshman and P. Malik. Cassandra: a decentralized structured storage system. SIGOPS Oper. Syst. Rev., April 2010. [7] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. SIGMOD ’10, 2010. [8] V. Muntés-Mulero, N. Martínez-Bazán, J.-L. Larriba-Pey, E. Pacitti, and P. Valduriez. Graph partitioning strategies for efficient bfs in shared-nothing parallel systems. WAIM’10, 2010. [9] Neo Technology. 7.1 architecture. http://docs.neo4j.org/chunked/stable/ ha-architecture.html. [10] Neo Technology. Roadmap. http://wiki.neo4j.org/content/Roadmap. [11] J. M. Pujol, V. Erramilli, G. Siganos, X. Yang, N. Laoutaris, P. Chhabra, and P. Rodriguez. The little engine(s) that could: scaling online social networks. In Proceedings of the ACM SIGCOMM 2010 conference, SIGCOMM ’10, pages 375–386, New York, NY, USA, 2010. ACM. mance penalty even for fully local traversals, attributed partly to the increased software stack in their implementation. 5. CONCLUSIONS We examined three methods for partitioning a database and identified vertex-based partitioning as the only one that does not impose a storage limit on scalability. Of the three methods presented for implementing vertex-based partitioning, the dangling edge model would have the least overhead associated with edge modification. We implemented the dangling-edge scheme in a graph database. Our tests show that performance within a single partition is maintained, and performance is affected by the network overhead associated with communication between partitions. For scenarios where spatial locality is observed, such as traversals within a close group of friends in OSNs or route-finding in a road network on the assumption that most desired routes are local, traversals are unlikely to cross multiple partitions, and thus PNeo4j incurs only minimal performance penalty. Optimizations such as in [8] may be possible to reduce cross-partition performance hit due to network communication. Our current implementation does not ship traversal processing to servers other than the one contacted by the client, generating network traffic proportional to the number of vertices not present in that server. Methods to ship processing of the traversal to each remote server are expected to improve performance. 6. REFERENCES [1] Alexa. Alexa top 500 global sites. http://www.alexa.com/topsites. [2] A. Averbuch and M. Neumann. Partitioning graph databases. Technical report, 2010. [3] C. Curino, E. Jones, Y. Zhang, and S. Madden. Schism: a workload-driven approach to database replication and partitioning. Proc. VLDB Endow., September 2010. 179