Download Synopsis on Deep Computation Model for Unsupervised Feature

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Forecasting wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
Synopsis
on
Deep Computation Model for Unsupervised
Feature Learning on Big Data
Submitted By
Vikram Yadav
Under the Supervision of
Prof. Prasanta K. Jana
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
INDIAN SCHOOL OF MINES
DHANBAD – 826004
1
Optimization of Big data cluster using machine
learning Technique
1. Background of Machine Learning
Machine learning (ML) is a branch of artificial intelligence that systematically applies
algorithms to synthesize the underlying relationships among data and information. For
example, ML systems can be trained on automatic speech recognition systems (such as
iPhone's Siri) to convert acoustic information in a sequence of speech data into semantic
structure expressed in the form of a string of words.ML is already finding widespread
uses in web search, ad placement, credit scoring, stock market prediction, gene sequence
analysis, behavior analysis, smart coupons, drug development, weather forecasting, big
data analytics, and many more applications. ML will play a decisive role in the
development of a host of user-centric innovations.ML owes its burgeoning adoption to
its ability to characterize underlying relationships within large arrays of data in ways
that solve problems in big data analytics, behavioral pattern recognition, and information
evolution. ML systems can moreover be trained to categorize the changing conditions of
a process so as to model variations in operating behavior. As bodies of knowledge
evolve under the influence of new ideas and technologies, ML systems can identify
disruptions to the existing models and redesign and retrain themselves to adapt to and co
evolve with the new knowledge. The computational characteristic of ML is to generalize
the training experience (or examples) and output a hypothesis that estimates the target
function. The generalization attribute of ML allows the system to perform well on
unseen data instances by accurately predicting the future data. Unlike other optimization
problems, ML does not have a well-defined function that can be optimized. Instead,
training errors serve as a catalyst to test learning errors. The process of generalization
requires classifiers that input discrete or continuous feature vectors and output a class.
The goal of ML is to predict future events or scenarios that are unknown to the
computer. The learning process plays a crucial role in generalizing the problem by
acting on its historical experience. Experience exists in the form of training datasets,
which aid in achieving accurate results on new and unseen tasks. The training datasets
encompass an existing problem domain that the learner uses to build a general model
about that domain. This enables the model to generate largely accurate predictions in
new cases.
Machine learning aids in the development of programs that improve their performance
for a given task through experience and training. Many big data applications leverage
ML to operate at highest efficiency. The sheer volume, diversity, and speed of data flow
2
have made it impracticable to exploit the natural capability of human beings to analyze
data in real time. The surge in social networking and the wide use of Internet-based
applications have resulted not only in greater volume of data, but also increased
complexity of data. To preserve data resolution and avoid data loss, these streams of
data need to be analyzed in real time. The heterogeneity of the big data stream and the
massive computing power we possess today present us with abundant opportunities to
foster learning methodologies that can identify best practices for a given business
problem. The sophistication of modern computing machines can handle large data
volumes, greater complexity, and terabytes of storage. Additionally, intelligent programflows that run on these machines can process and combine many such complex data
streams to develop predictive models and extract intrinsic patterns in otherwise noisy
data. When you need to predict or forecast a target value, supervised learning is the
appropriate choice. The next step is to decide, depending on the target value, between
clustering (in the case of discrete target value) and regression (in the case of numerical
target value).
It is also important to judge whether ML is the suitable approach for solving a given
problem. By its nature, ML cannot deliver perfect accuracy. For solutions requiring
highly accurate results in a bounded time period, ML may not be the preferred
approach. In general, the following conditions are favorable to the application of ML:
(a) very high accuracy is not desired; (b) large volumes of data contain undiscovered
patterns or information to be synthesized; (c) the problem itself is not very well
understood owing to lack of knowledge or historical information as a basis for
developing suitable algorithms; and (d) the problem needs to adapt to changing
environmental conditions.
2. Popularity of Machine Learning
Machine learning has got popularity due to its end to end algorithms. The process of
developing ML algorithms consists of all the steps: Collection, preprocessing and,
transformation of data, training and testing of algorithm, applying reinforcement
learning and execute. Some of the popular machine learning algorithms are : Supervised
learning, unsupervised learning algorithms, semi-supervised learning, reinforcement
learning, transductive learning and inductive inference
Popular Machine Learning Algorithms: his section describes in turn the top 10 most
influential data mining algorithms identified by the IEEE International Conference on
Data Mining (ICDM) in December 2006: C4.5, k-means, SVMs, Apriori, estimation
3
maximization (EM), PageRank, AdaBoost, k–nearest neighbors (k-NN), naive Bayes,
and classification and regression trees (CARTs)
3. Background of Big-data:
IDC is a pioneer in studying big data and its impact. It defines big data in a 2011 report
that was sponsored by EMC (the cloud computing leader) [1]: ``Big data technologies
describe a new generation of technologies and architectures, designed to economically
extract value from very large volumes of a wide variety of data, by enabling highvelocity capture, discovery, and/or analysis.'' This definition delineates the four salient
features of big data, i.e., volume, variety velocity and value. As a result, the ``4Vs''
definition has been used widely to characterize big data
A similar description appeared in a 2001 research report [2] in which META group (now
Gartner) analyst Doug Laney noted that data growth challenges and opportunities are
three-dimensional, i.e., increasing volume, velocity, and variety. Although this
description was not meant originally to define big data, Gartner and much of the
industry, including IBM [3] and certain Microsoft researchers [4], continue to use this
``3Vs'' model to describe big data 10 years later [5]. The National Institute of Standards
and Technology (NIST) [6] suggests that, ``Big data is where the data volume,
acquisition velocity, or data representation limits the ability to perform effective
analysis using traditional relational approaches or requires the use of significant
horizontal scaling for efficient processing.'' To address the challenge of web-scale data
management and analysis, Google created Google File System (GFS) [7] and
MapReduce [8] programming model. GFS and Map Reduce enable automatic data
parallelization and the distribution of large-scale computation applications to large
clusters of commodity servers. In Jun. 2011, EMC published a report entitled
``Extracting Value from Chaos'' [8]. The concept of big data and its potential were
discussed throughout. In March 2012, the Obama administration announced that the US
would invest 200 million dollars to launch a big data research plan. The effort will
involve a number of federal agencies, including DARPA , the National Institutes of
Health, and the National Science Foundation [9].ut the report. Data management refers
to mechanisms and tools that provide persistent data storage and highly efficient
management, such as distributed file systems and SQL or NoSQL data stores. The
programming model implements abstraction application logic and facilitates the data
analysis applications. MapReduce [8], Dryad [10], Pregel [11], and Dremel [12]
exemplify programming models.
4
4. Big data System challenges
Designing and deploying a big data analytics system is not a trivial or straightforward
task. As one of its definitions suggests, big data is beyond the capability of current
hardware and software platforms. The new hardware and software platforms in turn
demand new infrastructure and models to address the wide range of challenges of big
data. Big Data presents signicant challenges to deep learning, including large scale,
heterogeneity, noisy labels, and non-stationary distribution, among many others.
Processing of data can include various operations depending on usage like culling,
tagging, highlighting, indexing, searching, faceting, etc operations. It is not possible for
single or few machines to store or process huge amount of data in a finite time period.
Recent works [13], [14], [15] have discussed potential obstacles to the growth of big
data applications
5. Objective
Keeping the above view, we propose here to carry out the Ph. D. work with the
following objectives:

Developing a tensor deep learning model for performing feature learning and
forming multiple levels of representations of big data based on tensor

Simulation of high-order back-propagation algorithm as an extension of the
conventional back-propagation algorithm in the high-order tensor space for
training the parameters in tensor auto encoders

Comparison of the simulation results with the existing ones
Our attempt will be made to design the high-order back-propagation algorithm for
feature learning on noisy and highly non-linear distribution of big data i.e.
heterogeneous data
6. Work Plan
At the beginning, we aim at surveying the back-propagation algorithm to gather their
pros and cons. Our attempt will be then to propose a high-order back-propagation
algorithm by extending the conventional back-propagation algorithm from the vector
space to the high order tensor space for feature learning. We shall use typical deep
learning model (Stacked auto encoders) that is established by stacking Restrictive
boltzmann machine, sparse auto encodes , denoising auto encoders and predictive
sparse parsing. We shall perform the experiments through simulation run using Matlab
5
on the server and compare them with the existing related algorithms using various
performance metrics.
7. Reference
1 J. Gantz and D. Reinsel, ``Extracting value from chaos,'' in Proc. IDC iView, 2011,
pp. 1_12.
2. J. Manyika et al., Big data: The Next Frontier for Innovation, Competition, and
Productivity. San Francisco, CA, USA: McKinsey Global Institute, 2011, pp. 1_137.
3. P. Zikopoulos and C. Eaton, Understanding Big Data: Analytics for Enterprise Class
Hadoop and Streaming Data. New York, NY, USA: McGraw-Hill, 2011.
4. E. Meijer, ``Theworld according to LINQ,'' Commun. ACM, vol. 54, no. 10, pp.
45_51, Aug. 2011.
5. D. Laney, ``3d data management: Controlling data volume, velocity and variety,''
Gartner, Stamford, CT, USA, White Paper, 2001.
6. M. Cooper and P. Mell. (2012). Tackling Big Data [Online]. Available:
http://csrc.nist.gov/groups/SMA/forum/documents/
june2012presentations/f%csm_june2012_cooper_mell.pdf
7. S. Ghemawat, H. Gobioff, and S.-T. Leung, ``The Google _le system,'' in Proc. 19th
ACM Symp. Operating Syst. Principles, 2003, pp. 29_43.
8. J. Dean and S. Ghemawat, ``Mapreduce: Simpli_ed data processing on large
clusters,'' Commun. ACM, vol. 51, no. 1, pp. 107_113, 2008.
9. W. House. (2012, Mar.). Fact Sheet: Big Data Across the Federal Government
[Online].
Available:http://www.whitehouse.gov/sites/default/_les/microsites/ostp/big_data%_f
act_sheet_3_29_2012.pdf
10. M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, ``Dryad: Distributed dataparallel programs from sequential building blocks,'' in Proc. 2nd ACM
SIGOPS/EuroSys Eur. Conf. Comput. Syst., Jun. 2007, pp. 59_72.
6
11. T. White, Hadoop: The De_nitive Guide. Sebastopol, CA, USA: O'Reilly Media,
2012.
12. G. Malewicz et al., ``Pregel: A system for large-scale graph processing,'' in Proc.
ACM SIGMOD Int. Conf. Manag. Data, Jun. 2010, pp. 135_146.
13. A. Labrinidis and H. V. Jagadish, ``Challenges and opportunities with big data,''
Proc. VLDB Endowment, vol. 5, no. 12, pp. 2032_2033, Aug. 2012.
14. S. Chaudhuri, U. Dayal, and V. Narasayya, ``An overview of business intelligence
technology,'' Commun. ACM, vol. 54, no. 8, pp. 88_98, 2011.
7