Download Graph Data Partition Models for Online Social Networks

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microsoft Jet Database Engine wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

Commitment ordering wikipedia , lookup

Clusterpoint wikipedia , lookup

Concurrency control wikipedia , lookup

Serializability wikipedia , lookup

Database model wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Transcript
Graph Data Partition Models for Online Social Networks
Prima Chairunnanda, Simon Forsyth, Khuzaima Daudjee
David R. Cheriton School of Computer Science
University of Waterloo
Waterloo, Ontario, Canada
{pchairun, swforsyt, kdaudjee}@uwaterloo.ca
ABSTRACT
as is evident from Facebook using Cassandra [6] and MySQL, and
MySpace using Microsoft SqlServer. However, a graph can naturally represent many OSN constructs, with users and objects as
vertices connected via edges. Furthermore, many services offered
by an OSN are equivalent to traversing this graph. Listing a tweet’s
followers is traversing all “Follows” edges in the reverse direction.
Viewing a friend’s photo album can be seen as traversing all “UploadPhoto” and “TaggedIn” edges.
Models that fit the data they represent are often easier to understand and are potentially more efficient. For example, a graph
database can ensure that well-connected sub-graphs remain in the
same partition on the assumption that they will be frequently accessed together. A graph database is therefore able to naturally
provide bounds on the number of servers needed to provide a complete answer to many queries. There have been a lot of studies revolving around distributed RDBMS, key-value stores, and column
store databases, but graph DBMSs have received little attention.
Among the current graph DBMSs, there are variations in how
a graph is viewed. At the simplest level, a graph consists of vertices connected via edges. Clearly this does not suffice for real-life
applications, as there needs to be labels on the nodes and edges
themselves. Neo4j adds the notion of edge type and properties to
further describe nodes and edges. Relaxing the restriction on edges,
HypergraphDB [5] allows a single edge to connect more than two
vertices to express more complex semantics. Resource Description
Framework (RDF) is yet another alternative, where information is
encoded in triplets (subject, predicate, object). In essence, each
triplet represents a directed edge between the subject and object.
How to store the triplets again varies among RDF databases, where
some storing as graphs (e.g. AllegroGraph), some as tuples in an
underlying relational databases (e.g. Virtuoso [4]), and a number
of others using proprietary formats.
Regardless of the underlying physical representation of the data,
employing a single centralized graph DBMS will quickly become a
bottleneck. OSNs deal with huge amount of data, potentially consisting of trillions of vertices and edges. Thus, methods to effectively scale graph DBMSs are needed to improve their utility to an
OSN.
One popular approach to overcome the bottleneck is to use multiple instances of a DBMS, each holding a shard of the database.
Averbuch and Neumann [2] explored the problem of partitioning
in Neo4j graph database, but their experiments used an emulator relying on graph colouring. We demonstrate that there exist
partition models that can be implemented with minimal computation overhead and remove physical limits from the size of the
graph. Moreover, such models provide potential for load-balancing
and increased parallelism for queries that do not require access to
the entire graph. Our implementation, called PNeo4j, extends the
Online social networks have become important vehicles for connecting people for work and leisure. As these networks grow, data
that are stored over these networks also grow, and management of
these data becomes a challenge. Graph data models are a natural fit for representing online social networks but need to support
distribution to allow the associated graph databases to scale while
offering acceptable performance. We provide scalability by considering methods for partitioning graph databases and implement one
within the Neo4j architecture based on distributing the vertices of
the graph. We evaluate its performance in several simple scenarios and demonstrate that it is possible to partition a graph database
without incurring significant overhead other than that required by
network delays. We identify and discuss several methods to reduce
the observed network delays in our prototype.
Categories and Subject Descriptors
H.2.4 [Systems]: Distributed Databases; E.2 [Data Structures]:
Graphs and networks
General Terms
Design, Performance, Experimentation
Keywords
Distributed graph database, Graph representation
1.
INTRODUCTION
During the last decade, online social networks (OSNs) have emerged to the forefront of the Internet. Based on a recent Internet traffic analysis [1], three of the ten most frequently visited websites
are OSNs. In an OSN, users are connected to each other via edges.
Edges can be undirected (e.g. “Friends” in Facebook) or directed
(e.g. “Follows” in Twitter).
How to store this information is central to any OSN, and is an
active research area. Key-value storage systems and Relational
Database Management Systems (RDBMS) appear favoured choices,
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
HT’12, June 25–28, 2012, Milwaukee, Wisconsin, USA.
Copyright 2012 ACM 978-1-4503-1335-3/12/06 ...$10.00.
175
2.2.4 Discussion
Neo4j graph database to support partitioning. Pregel [7] introduces
a computational model for distributed graph traversal, but does not
specifically address the challenges of partitioning the graph in the
first place.
This paper is organized as follows: we first visit important design decisions we made for PNeo4j in Section 2, followed by our
specific implementation details in Section 3. We then present our
experimental results in Section 4, and finally we conclude in Section 5.
2.
As the goal is to create a scalable shared-nothing database, we
examined each method in terms of perceived scalability and amount
of data that must be shared. For the case of vertex partitioning,
there is no physical limit on graph size. However, since an edge
that crosses a shard boundary must be accessible from both sides,
at the minimum it must be duplicated on the two partitions hosting
the endpoint vertices. For edge partitioning, the maximum size is
limited by the largest overlay graph consisting of one edge type.
Edge partitioning also duplicates vertices that have edges not located on the same server. Finally, property partitioning requires the
entire graph to be contained within a single machine, limiting graph
size. Partitioning the properties does allow for the least amount of
required duplication.
These options are not mutually exclusive and all of them may be
used within the context of a single graph, though with an increase
in complexity for database design and development.
We chose to implement partitioning across vertices so as to remove restrictions on the physical limit on graph size for all graphs.
However, we note that edge partitioning has a potentially useful
property. The partitions generated from edge partitioning may naturally be load-balanced because vertices will likely be present in
multiple partitions. As a consequence, queries involving that vertex can be answered by any of those partitions, potentially speeding
up traversals since the initial vertex is likely to be on the originating
server. When queries mainly involve only one edge type, partitioning by edge-type might look attractive as cross-partition traversals
can be avoided. However, the server hosting that edge type could
become hotspot.
When seen from an OSN perspective, vertex partitioning also has
an additional advantage. The types of queries operating on an OSN
are usually a form of traversal from a particular starting vertex,
i.e. they exhibit spatial locality. After a particular vertex is visited,
each of its neighbouring vertices will have an increased chance to
be accessed next. It is, therefore, beneficial to have neighbours
hosted in the same machine. Several systems, such as SPAR[11],
exploit this behaviour in their replication design for OSNs.
The requirement of spatial locality becomes much more important when more OSN entities are represented as nodes in the graph.
For instance, the school a user goes to can either be stored as a
property of the user, or as a node with inward edges of type “AttendSchool” coming from the students of the school. The latter representation has the advantage that the graph is more complete and
can be interpreted on its own, as edges explicitly indicate a connection, while the interpretation of property values are application
dependent. This has direct consequence on the length of traversal
path. Let us consider the query to find the school attended by the
most number of a user’s friends. When school is represented by
property and friendship by an edge, the query involves traversing
an edge, then querying a node’s property. The query will instead
traverse two edges if both are represented as edges.
If there is an edge connecting two vertices and the vertices are
located in different subgraphs, the edge is a crossing-edge. We
assume assignment of vertices to partitions should be chosen to
minimize network traffic and it is therefore beneficial to minimize
the number of crossing-edges. As the number of crossing-edges
increases, the probability that a traversal must cross a partition
boundary increases, incurring additional network costs.
There have been studies on using graph min-cut algorithms to
minimize the number of crossing-edges, e.g. [2][3]. However, we
are specifically interested in the behaviour at partition boundaries,
and so manually cut the graphs to ensure our traversals cross par-
DESIGN
2.1 System Architecture
We use the following system model assumptions as a basis for
evaluating partitioning techniques. We are not interested in faulttolerance for the initial implementation but are interested in achieving efficiency. Therefore, we attempt to minimize the amount of
information sharing in the design. Clients and their queries are
assumed to run in the context of a single server. The client will
not start a transaction on one partition and complete it on another,
instead the first server contacted as part of a transaction will be responsible for all queries that make up the transaction. We assume
that a method exists for the client to find the initial server.
The servers hosting the partitions can have knowledge of all
other servers. We do not require all servers to be available to function, but do require the server hosting an object to be available when
that object is accessed.
2.2 Partition Model
We consider three methods for partitioning a graph database:
across vertices, across edges, and among the properties associated
with the edges and/or vertices in the graph.
2.2.1 Vertex Partitioning
The most studied method [2][7][8] for splitting a graph is cutting
the graph. That is, the graph is partitioned into subgraphs, where
each vertex belongs to exactly one subgraph. The subgraphs then
become shards of the original graph.
2.2.2 Edge Partitioning
Another possibility is to split the graph into subgraphs by their
edges. Each partition contains a subset of the edges from the original graph, with the total graph being reconstructed from all the
partitions. In some applications, edges may have types to give
more semantics to the relationship. To give a concrete example,
in Facebook, two friends are connected by a “Friend” edge, while
a user and an event are connected by an “Attending” edge. For
these graphs, it is also reasonable to partition the graph into subgraphs, each containing the edges and vertices for a unique set of
edge types.
2.2.3 Property Partitioning
The last possibility has been considered by Neo Technologies
for version 2 of Neo4j [10]. Vertices and edges require relatively
little storage space, but properties have arbitrary length and so may
require significantly more space, limiting the maximum graph size.
By storing the properties in a separate key-value store and entering
the much shorter keys into the graph database, a single server may
store a bigger graph. However, as the entire graph and the property
keys must still fit onto a single server, the number of vertices and
edges is still limited by the storage capacity of a single server.
176
V1
Identifying a ghost vertex is also a challenge. For example, consider the situation of V1 connected to V2 as above. Later, say, a
new edge is to be created between another vertex V3 owned by P2
to vertex V1 . In this case, P2 can either create a new ghost vertex
or reuse its existing ghost for V1 . The first option increases the required storage and makes avoiding duplicate vertices during traversal more difficult, while reuse complicates edge creation since a
check must be made to see if the vertex already exists, costing additional time or space.
V2
(a)
V1
V2'
V1'
V2
(b)
P1
2.3.2 Dangling Edge Model
P2
V1
Instead of materializing ghost vertices, we can leave one end of
the crossing-edge unconnected (Figure 1(c)). To achieve this, we
require a method to identify vertices that are not part of the current
partition. A simple method is to incorporate an identifier for the
partition each vertex is assigned to as part of the vertex ID and then
use that ID to identify if the vertex is local. A separate lookup table
could be used, in which case, since most vertices are expected to
be internal to the partition, a flag should be set within the edge to
avoid making many failed lookups.
To avoid duplication of data, one of the two dangling edges is
demoted to a ghost edge. Like a ghost vertex, a ghost edge stores
only the location and ID of the actual edge, and all requests are forwarded to the remote partition. In PNeo4j, we always choose the
incoming dangling edge to be the ghost edge. Due to this characteristic, it might be more beneficial to choose the semantic of the edge
so that it mimics the natural order of traversal between the vertices,
avoiding traversal of the ghost edges. For instance, supposing the
query is to find all events created by a particular user, having the
edges point outward from the user to the events (denoting a “Creates” semantic) might yield some performance benefit compared to
the edges pointing inward (denoting a “Created by” semantic).
By removing the ghost vertices, we eliminate the storage cost
of a crossing-edge to one extra edge. More importantly, we never
have to synchronize vertices between partitions since they are not
shared. By not storing properties at the ghost edge, we restrict
synchronization to edge deletion.
V2
(c)
P1
P2
V1
V2
SS
(d)
SS
P1
P2
Figure 1: Visualization of crossing-edge representations. (a)
shows the original graph, while (b), (c), and (d) show the graph
partitioned over P1 and P2, for the Ghost Vertex, Dangling
Edge, and Super Source/Sink models respectively. Dotted lines
indicate ghost objects. In (d), the vertex marked “SS” is the
Super Source/Sink.
titions. We defer the study of different min-cut algorithms for our
system to other work.
2.3 Crossing-Edge Representation
We consider three different kinds of representations to model a
crossing-edge in our system. We discuss the advantages and disadvantages of each model as well as the additional cost imposed by
the model.
All examples in this subsection assume a crossing-edge connecting two vertices V1 and V2 located in partition P1 and P2 , respectively. The original graph is shown in Figure 1(a). Without loss of
generality, we assume the edge is directed from V1 to V2 .
2.3.3 Super Source/Sink Model
The Dangling Edge model has the problem that a partitioned
graph is not actually a graph on any of the servers, as some endpoint vertices are missing. One possible solution to this is to materialize a single vertex that is used as a marker for all crossing-edges,
as shown in Figure 1(d). Borrowing from flow network terminology, we call this a Super Source/Sink (SuperSS) vertex, because it
consumes all incoming and outgoing flows to/from crossing-edges.
The additional cost to represent a crossing-edge is thus only one extra edge since only one SuperSS is created regardless of the number
of crossing-edges in the partition.
This solution has a serious pitfall. To modify an edge, both vertices must be locked to ensure that a vertex is not deleted and to
prevent the addition of two edges causing one to not be associated with the vertex. The SuperSS vertex are connected with all
edges that have one end in a remote partition. It follows that any
changes to any crossing-edge in the partition would require locking
the SuperSS vertex, greatly reducing the concurrency of the system
and increasing the potential for deadlock for transactions that cross
partitions. Therefore, such an option is not advisable for a graph
database.
2.3.1 Ghost Vertex Model
In the Ghost Vertex model, we create a ghost vertex in P1 to
represent V2 , as depicted in Figure 1(b). A ghost vertex is a vertex with a single property containing the location of the real vertex.
The edge connecting the two vertices is duplicated across both partitions. In essence, V1 is now connected to the ghost of V2 (denoted
as V2 ). On the P2 side, another ghost vertex to represent V1 and an
edge from V1 to V2 is created.
A major advantage of this model is that all partitions contain
valid graphs. This model can thus be implemented as middleware
on top of any single machine graph database without modifying
the underlying system. When a client arrives at a ghost vertex, the
middleware will return a proxy which will contact the remote partition for any requests related to that vertex. However, a middleware
approach is typically slower than a more intrusive one because the
middleware needs to do additional processing on top of the regular
database access.
177
3.
IMPLEMENTATION
Table 1: Time to traverse one edge 1,000,000 times (ms)
Interpartition
Database Intrapartition Local Local
Fully
source
dest
remote
Neo4j
1130
PNeo4j
1157
31068 77646 166189
3.1 Neo4j
Neo4j1 is an open source graph DBMS specialized for highspeed graph traversal. It supports ACID transactions at a readcommitted isolation level and provides for manual locking to allow the user to achieve higher levels of isolation. Transactions are
logged to ensure durability and the system provides some consistency guarantees, such as preventing connections to non-existing
vertices. In addition, it supports a High-Availability mode [9], in
which the database is fully replicated across several systems. One
system is designated as the master. The other systems are slaves
and may have stale copies of the data. Writes are supported on
slaves, but they synchronize with the master using the two-phase
commit protocol on every write. Still, such replication does not
completely address the scaling out of the database, because the total amount of data is limited by the storage capacity of the smallest
server.
partitions involved in a transaction after the first will not contact
any other partition with respect to that transaction. Because the
global transaction IDs are hidden from the client and because none
of the functions in our current design create recursive calls between
partitions, this assumption is valid for our implementation. As a
concrete example, when we are traversing a crossing edge from a
remote partition P2 to another remote partition P3 , the originating
partition P1 will first get information about the edge from P2 , determine it is crossing on to another partition P3 , then contact P3 to
continue the traversal. Consequently, the originating server will act
as the coordinator for the two-phase commit. Note that it means
PNeo4j does not impose any additional connectivity requirement
on top of that required for two-phase commit, which is that participants are not necessarily contactable from other participants other
than the coordinator. While our implementation does not handle
failure of the originating server, additional logging on the remote
partitions would address this issue.
3.2 Vertex/Edge Identifier
In Neo4j, each object is uniquely identified by an identifier (ID)
generated by the system. Once assigned, an object’s ID will not
change, but if the object is deleted, its ID may be reassigned to a
new object.
As of version 1.3, Neo4j supports 234 vertices and 235 edges.
Neo4j uses the long Java data type for an ID, theoretically allowing
264 unique objects. The upper bits of the vertex ID space are unused and always 0. Our implementation uses the most significant
16 bits of an ID to record the partition that owns a vertex.
Each partition in PNeo4j is assigned a unique 16-bit partition
identifier (PID). When the PID is present in the object’s ID, the ID
is a global identifier (GID), and it ensures the object is uniquely
identifiable across all partitions. The PID value of 0 is reserved to
preserve compatibility with Neo4j. The PID is not persisted to disk.
Instead, when an object is loaded, PNeo4j adds the PID to create
the GID of the object.
Internally, Neo4j assigns an ID to properties attached to objects.
As this ID is not exposed to clients, we do not need to modify
Neo4j’s handling of property IDs.
4. PERFORMANCE EVALUATION
4.1 Test Environment
All experiments were performed on a single computer running
Linux hosting each partition on a different network port. All tests
were run on the same machine. Latency averaged 0.01ms between
partitions. The baseline Neo4j system used for comparison is version 1.3M03, and our modifications were made to the slightly newer
1.3M04 milestone, released two weeks later.
4.2 Cross-Partition Traversal
Averbuch and Neumann [2] performed one test on their implementation of a partitioned graph database: traverse one edge in a
two vertex database 1,000,000 times. They used an emulator that
used a single partition with a colouring property to indicate virtual
partitions for their remaining experiments. As part of our motivation was to see if the performance problem they observed was
surmountable, we repeated their experiment with our implementation.
We test all possible arrangements of the source and destination
vertices. Both the source vertex and the destination vertex can be
either local to the partition that initiates the transaction, or remote
from it. Since a remote source in our design implies that the edge is
also remote, they are ordered in the table by increasing remoteness.
Each trial was run with a five second warm-up period and each
case presents the mean of five runs. We also show the results of
the same test with an unmodified copy of Neo4j (with both vertices
local by definition).
The results in Table 1 show that increasing the quantity of remote
information increases the time required for the traversal. Indeed, as
desired, the costs reflect the number of network messages that need
to be sent for each case: zero for fully local, one for a remote destination (to access the remote vertex), and two for a remote source
(one to access the remote vertex and one to access the remote edge).
Importantly, the overhead in the purely local case is minimal. This
is in contrast to results in [2], which showed a significant perfor-
3.3 Partition Policy
In the interest of avoiding extra complexity, all vertices are created in the partition that receives the request. If a user wants to
create a vertex in a specific partition, that partition must be contacted. This is sufficient to let us create partitioned graphs to test
partition traversals.
Because we include the partition identifier as part of the ID, an
automatic repartitioning scheme could generate errors for missing
vertices, as its ID could change after a read since the only isolation
level we support is read committed. This problem can be avoided
by implementing additional isolation levels (a read lock is required
to ensure that a vertex does not move during a transaction).
3.4 Transactions
A single transaction may require operations across multiple partitions. Our implementation generates a global transaction ID for
a transaction the first time a partition contacts another partition.
This ID is then used in all further communication related to that
transaction. Two-phase commit protocol is employed to provide
consistency across partitions.
We assume that all operations related to a transaction will originate from the partition that created the transaction. That is, all
1
http://neo4j.org
178
[4] O. Erling and I. Mikhailov. Rdf support in the virtuoso dbms.
In T. Pellegrini, S. Auer, K. Tochtermann, and S. Schaffert,
editors, Networked Knowledge - Networked Media, volume
221 of Studies in Computational Intelligence, pages 7–24.
Springer Berlin / Heidelberg, 2009.
[5] B. Iordanov. Hypergraphdb: a generalized graph database. In
Proceedings of the 2010 international conference on
Web-age information management, WAIM’10, pages 25–36,
Berlin, Heidelberg, 2010. Springer-Verlag.
[6] A. Lakshman and P. Malik. Cassandra: a decentralized
structured storage system. SIGOPS Oper. Syst. Rev., April
2010.
[7] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert,
I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for
large-scale graph processing. SIGMOD ’10, 2010.
[8] V. Muntés-Mulero, N. Martínez-Bazán, J.-L. Larriba-Pey,
E. Pacitti, and P. Valduriez. Graph partitioning strategies for
efficient bfs in shared-nothing parallel systems. WAIM’10,
2010.
[9] Neo Technology. 7.1 architecture.
http://docs.neo4j.org/chunked/stable/
ha-architecture.html.
[10] Neo Technology. Roadmap.
http://wiki.neo4j.org/content/Roadmap.
[11] J. M. Pujol, V. Erramilli, G. Siganos, X. Yang, N. Laoutaris,
P. Chhabra, and P. Rodriguez. The little engine(s) that could:
scaling online social networks. In Proceedings of the ACM
SIGCOMM 2010 conference, SIGCOMM ’10, pages
375–386, New York, NY, USA, 2010. ACM.
mance penalty even for fully local traversals, attributed partly to
the increased software stack in their implementation.
5.
CONCLUSIONS
We examined three methods for partitioning a database and identified vertex-based partitioning as the only one that does not impose
a storage limit on scalability. Of the three methods presented for
implementing vertex-based partitioning, the dangling edge model
would have the least overhead associated with edge modification.
We implemented the dangling-edge scheme in a graph database.
Our tests show that performance within a single partition is maintained, and performance is affected by the network overhead associated with communication between partitions. For scenarios
where spatial locality is observed, such as traversals within a close
group of friends in OSNs or route-finding in a road network on
the assumption that most desired routes are local, traversals are
unlikely to cross multiple partitions, and thus PNeo4j incurs only
minimal performance penalty. Optimizations such as in [8] may be
possible to reduce cross-partition performance hit due to network
communication.
Our current implementation does not ship traversal processing
to servers other than the one contacted by the client, generating
network traffic proportional to the number of vertices not present
in that server. Methods to ship processing of the traversal to each
remote server are expected to improve performance.
6.
REFERENCES
[1] Alexa. Alexa top 500 global sites.
http://www.alexa.com/topsites.
[2] A. Averbuch and M. Neumann. Partitioning graph databases.
Technical report, 2010.
[3] C. Curino, E. Jones, Y. Zhang, and S. Madden. Schism: a
workload-driven approach to database replication and
partitioning. Proc. VLDB Endow., September 2010.
179