Download epiC: an extensible and scalable system for processing big data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Operational transformation wikipedia , lookup

Clusterpoint wikipedia , lookup

Data Protection Act, 2012 wikipedia , lookup

Big data wikipedia , lookup

Data center wikipedia , lookup

Data model wikipedia , lookup

Data analysis wikipedia , lookup

Forecasting wikipedia , lookup

3D optical data storage wikipedia , lookup

Information privacy law wikipedia , lookup

Data vault modeling wikipedia , lookup

Database model wikipedia , lookup

Business intelligence wikipedia , lookup

Transcript
epiC: an extensible and scalable system for
processing big data
Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu
School of Computing, National University of Singapore
College of Computer Science and Technology, Zhejiang University
The Big Data problem is characterized by the so called 3V features: Volume - a huge amount
of data, Velocity - a high data ingestion rate, and Variety - a mix of structured data, semistructured data, and unstructured data. The state-of-the-art solutions to the Big Data problem
are largely based on the MapReduce framework (aka its open source implementation
Hadoop). Although Hadoop handles the data volume challenge successfully, it does not deal
with the data variety well since the programming interfaces and its associated data
processing model is inconvenient and inefficient for handling structured data and graph data.
Prof Ooi and his team have designed and implemented epiC, an extensible system to tackle
the Big Data’s data variety challenge. epiC introduces a general Actor-like concurrent
programming model, independent of the data processing models, for specifying parallel
computations. It provides two forms of abstractions: a concurrent programming model and a
processing model. A concurrent programming model defines a set of abstractions (i.e.,
interfaces) for users to specify parallel computations consisting of independent computations
and dependencies between those computations. A data processing model defines a set of
abstractions for users to specify data manipulation operations. Users process multi-structured
datasets with appropriate epiC extensions, the implementation of a data processing model
best suited for the data type and auxiliary code for mapping that data processing model into
epiC’s concurrent programming model. Like Hadoop, programs written in this way can be
automatically parallelized and the runtime system takes care of fault tolerance and intermachine communications.
Many existing database applications can be migrated to epiC by extending the SQL interface.
epiC can be deployed on top of a pool of virtual machines to provide 24x7 service. All epiC
data are maintained by its storage engine (epiCS), an enhanced key‐value store on top of
the DFS. It creates secondary indices to facilitate the non‐key access. The secondary
indices are distributed in the cluster via the Cayley graph protocol. In epiCS, all schema
information is stored in the meta‐store and asynchronously replicated in all processing
nodes. epiCX is the execution engine of epiC which organizes the operators into an acyclic
directed graph. Each operator is processed by multiple compute nodes in parallel. Users are
allowed to create their customized operators by extending the epiC’s java APIs. Compared
to existing engines, epiCX is a more flexible parallel processing model and can support
graph‐based processing efficiently, such as some data mining algorithms for data analytics.
epiC has to been benchmarked against various existing systems, and has been shown to be
efficient and scalable. The software stack of epiC is shown in Figure 1. Apart from epiCX
and epiCS, epiC software stack includes three other components: epiCP, a performance
monitoring subsystem for ensuring the efficiency of the system; epiCT, a trusted data service
component for security and data privacy, and epiCG, a graph engine for supporting iterative
computations required for data analytics.
Figure 1. The epiC Software Stack.
Various applications, such as consumer analytics for pushing advertisements based on
location and interests to users on their mobile devices, and healthcare analytics for
providing better and more effective healthcare, can developed on top of epiC. epiC is being
extended to support interface for R programming, and to provide a dashboard configuration
tool for users to dynamically configure its multi-window visualization by dragging and
dropping.
The early development of epic was supported by a Singapore Ministry of Education grant,
and follow on work supported by a National Research Foundation Competitive Research
Programme (NRF CRP) grant. Collaborators include National University Health System and
telecommunication companies. Most works have been published in top database
conferences such as ACM SIGMOD, VLDB and IEEE ICDE, and it has won a VLDB 2014
best paper award. epiC is being commercialized.
------------------------------------------------------------------------------------------------------------------------The full manuscript can be accessed via:
http://www.google.com.sg/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0CEMQFj
AB&url=http%3A%2F%2Fwww.comp.nus.edu.sg%2F~ooibc%2Fepicvldb14.pdf&ei=fTzVMiRM9GJuAS574DYCQ&usg=AFQjCNGwvKDVznFBbFyzl4n1ZbfsE5PMlA&bvm=bv.8
3339334,d.c2E&cad=rja