* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download epiC: an extensible and scalable system for processing big data
Operational transformation wikipedia , lookup
Clusterpoint wikipedia , lookup
Data Protection Act, 2012 wikipedia , lookup
Data center wikipedia , lookup
Data analysis wikipedia , lookup
Forecasting wikipedia , lookup
3D optical data storage wikipedia , lookup
Information privacy law wikipedia , lookup
Data vault modeling wikipedia , lookup
epiC: an extensible and scalable system for processing big data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National University of Singapore College of Computer Science and Technology, Zhejiang University The Big Data problem is characterized by the so called 3V features: Volume - a huge amount of data, Velocity - a high data ingestion rate, and Variety - a mix of structured data, semistructured data, and unstructured data. The state-of-the-art solutions to the Big Data problem are largely based on the MapReduce framework (aka its open source implementation Hadoop). Although Hadoop handles the data volume challenge successfully, it does not deal with the data variety well since the programming interfaces and its associated data processing model is inconvenient and inefficient for handling structured data and graph data. Prof Ooi and his team have designed and implemented epiC, an extensible system to tackle the Big Data’s data variety challenge. epiC introduces a general Actor-like concurrent programming model, independent of the data processing models, for specifying parallel computations. It provides two forms of abstractions: a concurrent programming model and a processing model. A concurrent programming model defines a set of abstractions (i.e., interfaces) for users to specify parallel computations consisting of independent computations and dependencies between those computations. A data processing model defines a set of abstractions for users to specify data manipulation operations. Users process multi-structured datasets with appropriate epiC extensions, the implementation of a data processing model best suited for the data type and auxiliary code for mapping that data processing model into epiC’s concurrent programming model. Like Hadoop, programs written in this way can be automatically parallelized and the runtime system takes care of fault tolerance and intermachine communications. Many existing database applications can be migrated to epiC by extending the SQL interface. epiC can be deployed on top of a pool of virtual machines to provide 24x7 service. All epiC data are maintained by its storage engine (epiCS), an enhanced key‐value store on top of the DFS. It creates secondary indices to facilitate the non‐key access. The secondary indices are distributed in the cluster via the Cayley graph protocol. In epiCS, all schema information is stored in the meta‐store and asynchronously replicated in all processing nodes. epiCX is the execution engine of epiC which organizes the operators into an acyclic directed graph. Each operator is processed by multiple compute nodes in parallel. Users are allowed to create their customized operators by extending the epiC’s java APIs. Compared to existing engines, epiCX is a more flexible parallel processing model and can support graph‐based processing efficiently, such as some data mining algorithms for data analytics. epiC has to been benchmarked against various existing systems, and has been shown to be efficient and scalable. The software stack of epiC is shown in Figure 1. Apart from epiCX and epiCS, epiC software stack includes three other components: epiCP, a performance monitoring subsystem for ensuring the efficiency of the system; epiCT, a trusted data service component for security and data privacy, and epiCG, a graph engine for supporting iterative computations required for data analytics. Figure 1. The epiC Software Stack. Various applications, such as consumer analytics for pushing advertisements based on location and interests to users on their mobile devices, and healthcare analytics for providing better and more effective healthcare, can developed on top of epiC. epiC is being extended to support interface for R programming, and to provide a dashboard configuration tool for users to dynamically configure its multi-window visualization by dragging and dropping. The early development of epic was supported by a Singapore Ministry of Education grant, and follow on work supported by a National Research Foundation Competitive Research Programme (NRF CRP) grant. Collaborators include National University Health System and telecommunication companies. Most works have been published in top database conferences such as ACM SIGMOD, VLDB and IEEE ICDE, and it has won a VLDB 2014 best paper award. epiC is being commercialized. ------------------------------------------------------------------------------------------------------------------------The full manuscript can be accessed via: http://www.google.com.sg/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0CEMQFj AB&url=http%3A%2F%2Fwww.comp.nus.edu.sg%2F~ooibc%2Fepicvldb14.pdf&ei=fTzVMiRM9GJuAS574DYCQ&usg=AFQjCNGwvKDVznFBbFyzl4n1ZbfsE5PMlA&bvm=bv.8 3339334,d.c2E&cad=rja