Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Synopsis on Deep Computation Model for Unsupervised Feature Learning on Big Data Submitted By Vikram Yadav Under the Supervision of Prof. Prasanta K. Jana DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING INDIAN SCHOOL OF MINES DHANBAD – 826004 1 Optimization of Big data cluster using machine learning Technique 1. Background of Machine Learning Machine learning (ML) is a branch of artificial intelligence that systematically applies algorithms to synthesize the underlying relationships among data and information. For example, ML systems can be trained on automatic speech recognition systems (such as iPhone's Siri) to convert acoustic information in a sequence of speech data into semantic structure expressed in the form of a string of words.ML is already finding widespread uses in web search, ad placement, credit scoring, stock market prediction, gene sequence analysis, behavior analysis, smart coupons, drug development, weather forecasting, big data analytics, and many more applications. ML will play a decisive role in the development of a host of user-centric innovations.ML owes its burgeoning adoption to its ability to characterize underlying relationships within large arrays of data in ways that solve problems in big data analytics, behavioral pattern recognition, and information evolution. ML systems can moreover be trained to categorize the changing conditions of a process so as to model variations in operating behavior. As bodies of knowledge evolve under the influence of new ideas and technologies, ML systems can identify disruptions to the existing models and redesign and retrain themselves to adapt to and co evolve with the new knowledge. The computational characteristic of ML is to generalize the training experience (or examples) and output a hypothesis that estimates the target function. The generalization attribute of ML allows the system to perform well on unseen data instances by accurately predicting the future data. Unlike other optimization problems, ML does not have a well-defined function that can be optimized. Instead, training errors serve as a catalyst to test learning errors. The process of generalization requires classifiers that input discrete or continuous feature vectors and output a class. The goal of ML is to predict future events or scenarios that are unknown to the computer. The learning process plays a crucial role in generalizing the problem by acting on its historical experience. Experience exists in the form of training datasets, which aid in achieving accurate results on new and unseen tasks. The training datasets encompass an existing problem domain that the learner uses to build a general model about that domain. This enables the model to generate largely accurate predictions in new cases. Machine learning aids in the development of programs that improve their performance for a given task through experience and training. Many big data applications leverage ML to operate at highest efficiency. The sheer volume, diversity, and speed of data flow 2 have made it impracticable to exploit the natural capability of human beings to analyze data in real time. The surge in social networking and the wide use of Internet-based applications have resulted not only in greater volume of data, but also increased complexity of data. To preserve data resolution and avoid data loss, these streams of data need to be analyzed in real time. The heterogeneity of the big data stream and the massive computing power we possess today present us with abundant opportunities to foster learning methodologies that can identify best practices for a given business problem. The sophistication of modern computing machines can handle large data volumes, greater complexity, and terabytes of storage. Additionally, intelligent programflows that run on these machines can process and combine many such complex data streams to develop predictive models and extract intrinsic patterns in otherwise noisy data. When you need to predict or forecast a target value, supervised learning is the appropriate choice. The next step is to decide, depending on the target value, between clustering (in the case of discrete target value) and regression (in the case of numerical target value). It is also important to judge whether ML is the suitable approach for solving a given problem. By its nature, ML cannot deliver perfect accuracy. For solutions requiring highly accurate results in a bounded time period, ML may not be the preferred approach. In general, the following conditions are favorable to the application of ML: (a) very high accuracy is not desired; (b) large volumes of data contain undiscovered patterns or information to be synthesized; (c) the problem itself is not very well understood owing to lack of knowledge or historical information as a basis for developing suitable algorithms; and (d) the problem needs to adapt to changing environmental conditions. 2. Popularity of Machine Learning Machine learning has got popularity due to its end to end algorithms. The process of developing ML algorithms consists of all the steps: Collection, preprocessing and, transformation of data, training and testing of algorithm, applying reinforcement learning and execute. Some of the popular machine learning algorithms are : Supervised learning, unsupervised learning algorithms, semi-supervised learning, reinforcement learning, transductive learning and inductive inference Popular Machine Learning Algorithms: his section describes in turn the top 10 most influential data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) in December 2006: C4.5, k-means, SVMs, Apriori, estimation 3 maximization (EM), PageRank, AdaBoost, k–nearest neighbors (k-NN), naive Bayes, and classification and regression trees (CARTs) 3. Background of Big-data: IDC is a pioneer in studying big data and its impact. It defines big data in a 2011 report that was sponsored by EMC (the cloud computing leader) [1]: ``Big data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling highvelocity capture, discovery, and/or analysis.'' This definition delineates the four salient features of big data, i.e., volume, variety velocity and value. As a result, the ``4Vs'' definition has been used widely to characterize big data A similar description appeared in a 2001 research report [2] in which META group (now Gartner) analyst Doug Laney noted that data growth challenges and opportunities are three-dimensional, i.e., increasing volume, velocity, and variety. Although this description was not meant originally to define big data, Gartner and much of the industry, including IBM [3] and certain Microsoft researchers [4], continue to use this ``3Vs'' model to describe big data 10 years later [5]. The National Institute of Standards and Technology (NIST) [6] suggests that, ``Big data is where the data volume, acquisition velocity, or data representation limits the ability to perform effective analysis using traditional relational approaches or requires the use of significant horizontal scaling for efficient processing.'' To address the challenge of web-scale data management and analysis, Google created Google File System (GFS) [7] and MapReduce [8] programming model. GFS and Map Reduce enable automatic data parallelization and the distribution of large-scale computation applications to large clusters of commodity servers. In Jun. 2011, EMC published a report entitled ``Extracting Value from Chaos'' [8]. The concept of big data and its potential were discussed throughout. In March 2012, the Obama administration announced that the US would invest 200 million dollars to launch a big data research plan. The effort will involve a number of federal agencies, including DARPA , the National Institutes of Health, and the National Science Foundation [9].ut the report. Data management refers to mechanisms and tools that provide persistent data storage and highly efficient management, such as distributed file systems and SQL or NoSQL data stores. The programming model implements abstraction application logic and facilitates the data analysis applications. MapReduce [8], Dryad [10], Pregel [11], and Dremel [12] exemplify programming models. 4 4. Big data System challenges Designing and deploying a big data analytics system is not a trivial or straightforward task. As one of its definitions suggests, big data is beyond the capability of current hardware and software platforms. The new hardware and software platforms in turn demand new infrastructure and models to address the wide range of challenges of big data. Big Data presents signicant challenges to deep learning, including large scale, heterogeneity, noisy labels, and non-stationary distribution, among many others. Processing of data can include various operations depending on usage like culling, tagging, highlighting, indexing, searching, faceting, etc operations. It is not possible for single or few machines to store or process huge amount of data in a finite time period. Recent works [13], [14], [15] have discussed potential obstacles to the growth of big data applications 5. Objective Keeping the above view, we propose here to carry out the Ph. D. work with the following objectives: Developing a tensor deep learning model for performing feature learning and forming multiple levels of representations of big data based on tensor Simulation of high-order back-propagation algorithm as an extension of the conventional back-propagation algorithm in the high-order tensor space for training the parameters in tensor auto encoders Comparison of the simulation results with the existing ones Our attempt will be made to design the high-order back-propagation algorithm for feature learning on noisy and highly non-linear distribution of big data i.e. heterogeneous data 6. Work Plan At the beginning, we aim at surveying the back-propagation algorithm to gather their pros and cons. Our attempt will be then to propose a high-order back-propagation algorithm by extending the conventional back-propagation algorithm from the vector space to the high order tensor space for feature learning. We shall use typical deep learning model (Stacked auto encoders) that is established by stacking Restrictive boltzmann machine, sparse auto encodes , denoising auto encoders and predictive sparse parsing. We shall perform the experiments through simulation run using Matlab 5 on the server and compare them with the existing related algorithms using various performance metrics. 7. Reference 1 J. Gantz and D. Reinsel, ``Extracting value from chaos,'' in Proc. IDC iView, 2011, pp. 1_12. 2. J. Manyika et al., Big data: The Next Frontier for Innovation, Competition, and Productivity. San Francisco, CA, USA: McKinsey Global Institute, 2011, pp. 1_137. 3. P. Zikopoulos and C. Eaton, Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. New York, NY, USA: McGraw-Hill, 2011. 4. E. Meijer, ``Theworld according to LINQ,'' Commun. ACM, vol. 54, no. 10, pp. 45_51, Aug. 2011. 5. D. Laney, ``3d data management: Controlling data volume, velocity and variety,'' Gartner, Stamford, CT, USA, White Paper, 2001. 6. M. Cooper and P. Mell. (2012). Tackling Big Data [Online]. Available: http://csrc.nist.gov/groups/SMA/forum/documents/ june2012presentations/f%csm_june2012_cooper_mell.pdf 7. S. Ghemawat, H. Gobioff, and S.-T. Leung, ``The Google _le system,'' in Proc. 19th ACM Symp. Operating Syst. Principles, 2003, pp. 29_43. 8. J. Dean and S. Ghemawat, ``Mapreduce: Simpli_ed data processing on large clusters,'' Commun. ACM, vol. 51, no. 1, pp. 107_113, 2008. 9. W. House. (2012, Mar.). Fact Sheet: Big Data Across the Federal Government [Online]. Available:http://www.whitehouse.gov/sites/default/_les/microsites/ostp/big_data%_f act_sheet_3_29_2012.pdf 10. M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, ``Dryad: Distributed dataparallel programs from sequential building blocks,'' in Proc. 2nd ACM SIGOPS/EuroSys Eur. Conf. Comput. Syst., Jun. 2007, pp. 59_72. 6 11. T. White, Hadoop: The De_nitive Guide. Sebastopol, CA, USA: O'Reilly Media, 2012. 12. G. Malewicz et al., ``Pregel: A system for large-scale graph processing,'' in Proc. ACM SIGMOD Int. Conf. Manag. Data, Jun. 2010, pp. 135_146. 13. A. Labrinidis and H. V. Jagadish, ``Challenges and opportunities with big data,'' Proc. VLDB Endowment, vol. 5, no. 12, pp. 2032_2033, Aug. 2012. 14. S. Chaudhuri, U. Dayal, and V. Narasayya, ``An overview of business intelligence technology,'' Commun. ACM, vol. 54, no. 8, pp. 88_98, 2011. 7