Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Research Proposal Parallel and Distributed Analytics Approaches on Big Data Clouds JAMIA MILIA ISLAMIA UNIVERSITY, NEW DELHI-110025 Submitted By: Mahboob Alam Submitted To: Department of Computer Engineering Jamia Milia Islamia University JAMIA MILIA ISLAMIA UNIVERSITY, NEW DELHI Research Synopsis Parallel and Distributed Analytics Approaches on Big Data Clouds Submitted By: Mahboob Alam Overview: Big Data indicates very large and complex data sets that are difficult to process using traditional and sequential data processing applications. Data-intensive, parallel and distributed approaches are typically employed, such as the MapReduce programming paradigm (e.g., Apache Hadoop). However, one of the most interesting challenges is not about the storage and the management of the data, rather it is about the insights and the impact the analysis of the data can generate. From this perspective, providing effective and efficient algorithms and tools for Big Data Analytics and Mining is fundamental. The potential of Big Data is in our ability to provide solutions to business and to the scientific community which are based on the approach known as ‘data-driven discovery’. The project will investigate, develop and test distributed formulations of data mining algorithms that are suitable for parallel and distributed computing paradigms. Depending on ongoing collaborations, the project may contribute to multi-disciplinary applications for the analysis of very large data in one of the following domains: Climate Science, Neuroscience, or Finance. Computer simulation is widely utilized and important in various scientific areas, as diverse as earth sciences, drug design, healthcare, or manufacturing. The execution of simulation applications can be computationally demanding requiring special computing resources, such as clusters, grids or clouds. However, it is not only the execution of the simulation that raises technical challenges. Simulation programs typically generate large amount of data that needs to be processed and analyzed. Objective: The objective of the proposed PhD research is to design novel architectures and algorithms to store, process and analyze large volumes of data, structured and unstructured. As a recent approach, cloud computing technologies have been proposed to execute large scale distributed simulation (e.g. the CloudSMEs European Project. Proving the interdisciplinary nature of the CloudSME simulation platform (originally designed for manufacturing and engineering), the proposed research will investigate how to apply these results for big datasets, and how to extend them with cloud-based big data storage and analytical tools. Simulation analytics can be performed in area of biomolecular simulations, financial data generation, health care data etc. The research would focus on analyzing the problem of storing, processing and retrieving meaningful insight from petabytes of data. A multilayer architecture can be designed between data generating sources to end users and ensuring each layer uses the best of bread for its specific task. Recent publications relevant to the project [1] T Kiss, P Borsody, G Terstyanszky, S Winter, P Greenwell, S McEldowney, H Heindl: Large-scale virtual screening experiments on Windows Azure-based cloud resources, Concurrency and Computation, Practice and experience, DOI: 10.1002/cpe.3113, 2013. [2] T Kiss, P Greenwell, H Heindl, G Terstyanszky and N Weingarten, Parameter Sweep Workflows for Modelling Carbohydrate Recognition, Journal of Grid Computing, Vol 8, No 4, pp 587-601, DOI: 10.1007/s10723-010-9166-8, 2010. [3] S J E Taylor, T Kiss, G Terstyanszky, P Kacsuk and N Fantini: Cloud Computing for Simulation in Manufacturing and Engineering: Introducing the CloudSME Simulation Platform, to be published in proceedings of ANSS 14.