Download DiploCloud: Efficient and Scalable Management of RDF Data in the

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
DiploCloud: Efficient and Scalable Management of RDF Data in the Cloud
ABSTRACT
Despite recent advances in distributed RDF data management, processing largeamounts of RDF data in the cloud is still very challenging. In spite of its seemingly
simple data model, RDF actually encodes rich and complex graphs mixing both
instance and schema-level data. Sharding such data using classical techniques or
partitioning the graph using traditional min-cut algorithms leads to very inefficient
distributed operations and to a high number of joins. In this paper, we describe
DiploCloud, an efficient and scalable distributed RDF data management system for
the cloud. Contrary to previous approaches, DiploCloud runs a physiological
analysis of both instance and schema information prior to partitioning the data. In
this paper, we describe the architecture of DiploCloud, its main data structures, as
well as the new algorithms we use to partition and distribute data. We also present
an extensive evaluation of DiploCloud showing that our system is often two orders
of magnitude faster than state-of-the-art systems on standard workloads.
EXISTING SYSTEM
The complexity of scaling out an application in the cloud (i.e., adding new
computing nodes to accommodate the growth of some process) very much depends
on the process to be scaled. Often, the task at hand can be easily split into a large
series of subtasks to be run independently and concurrently. Such operations are
commonly called embarrassingly parallel. Embarrassingly parallel problems can be
relatively easily scaled out in the cloud by launching new processes on new
commodity machines. There are however many processes that are much more
difficult to parallelize, typically because they consist of sequential processes.
Disadvantages of Existing System:
1. Difficult to parallelize in practice.
2. Heavy Computation Cost.
PROPOSED SYSTEM
We propose DiploCloud, an efficient, distributed and scalable RDF data processing
system for distributed and cloud environments. Our storage system in DiploCloud
can be seen as a hybrid structure extending several of the ideas from above. Our
system is built on three main structures: RDF molecule clusters (which can be seen
as hybrid structures borrowing both from property tables and RDF subgraphs),
template lists (storing literals in compact lists as in a column-oriented database
system) and an efficient key index indexing URIs and literals based on the clusters
they belong to.
Contrary to many distributed systems, DiploCloud uses a resolutely non-relational
storage format.
Advantages of Proposed System:
1. A new hybrid storage model that efficiently and effectively partitions an RDF
graph
2. Low computation require for hashing, template list, key index.
SYSTEM ARCHITECTURE
SYSTEM REQUIREMENTS
HARDWARE REQUIREMENTS:
Hardware
:
Pentium
Speed
:
1.1 GHz
RAM
:
1GB
Hard Disk
:
20 GB
SOFTWARE REQUIREMENTS:
Operating System
: Windows Family
Technology
: Java and J2EE
Web Technologies
: Html, JavaScript, CSS
Web Server
: Tomcat
Database
: My SQL
Java Version
: JDK 1.7 or 1.8
REFERENCES:
 K. Aberer, P. Cudre-Mauroux, M. Hauswirth, and T. van Pelt,“GridVine:
Building Internet-scale semantic overlay networks,” in Proc. Int. Semantic
Web Conf., 2004, pp. 107–121.

P. Cudre-Mauroux, S. Agarwal, and K. Aberer, “GridVine: Aninfrastructure
for peer information management,” IEEE Internet Comput., vol. 11, no. 5,
pp. 36–44, Sep./Oct. 2007.

M. Wylot, J. Pont, M. Wisniewski, and P. Cudre-Mauroux.
(2011).dipLODocus[RDF]: Short and long-tail RDF analytics for massive
webs of data. Proc. 10th Int. Conf. Semantic Web - Vol. Part I, pp. 778–793
[Online]. Available: http://dl.acm.org/citation.cfm? id=2063016.2063066
 M. Wylot, P. Cudre-Mauroux, and P. Groth, “TripleProv: Efficient
processing of lineage queries in a native RDF store,” in Proc. 23rd Int. Conf.
World Wide Web, 2014, pp. 455–466.

M. Wylot, P. Cudre-Mauroux, and P. Groth, “Executing provenanceenabled queries over web data,” in Proc. 24th Int. Conf. World Wide Web,
2015, pp. 1275–1285.