Download Data Mining: Exploring Big Data Using Hadoop and MapReduce

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
1321
Vol 04, Special Issue 01, 2013
http://ijesr.in/
International Journal of Engineering Sciences Research-IJESR
ACICE-2013
ISSN: 2230-8504; e-ISSN-2230-8512
Data Mining: Exploring Big Data Using Hadoop and MapReduce
M.JAYASREE
M.Tech Scholar ,Dept. of CSE, Sri Venkateswara College of Engineering (SVCE)
Karakambadi Road, Tirupati 517507, Andhra Pradesh –INDIA
[email protected]
ABSTRACT
With the fast growth of networks now-a-days organizations has filled with the collection of millions of data with
large number of combinations. This big data challenges over business problems. It requires more analysis for
the high-performance process. The new methods of hadoop and MapReduce methods are discussed from the
data mining perspective.
Keywords: Data Mining, Big data, BI, Big Data analytics, OLAP, EDA, Neural Networks, Hadoop and
MapReduce technique, Advantages, Disadvantages.
1. INTRODUCTION
Data mining [4 , 5] is designed to inquire the data by concerning methodical relationships among the variables
and then by applying the detected patterns to the new subset of data, then findings can be validated. The
mechanism of data mining is to apply regular algorithms to the large sets of data to extract the information from
it. This large set of data that is related to business or market is known as “Big Data [1]”. The theme of business
intelligence (BI) is to exchange the data that tends increased value to the enterprise. Rather than collecting the
information on what organizations are really doing, it is better to understand how organizations view big data
and to what extent they are currently using it to benefit their business. Now organizations are begun to
understand and explore how to process and analyze this big data. Big data and data mining are spreading not
only in BI but also in science fields such as meteorology, petroleum exploration and bio-informatics. This
sequence of data needs support of software, hardware and sophisticated algorithms for the new level of analysis.
2. TRADITIONAL DATA MINING TECHNIQUES
In general data mining is categorized into three phase process (1) the elementary exploration, (2) Identifying the
pattern with validation/verification and (3) deployment. (In this process the pattern is applied to the new model
to generate predictions).
Fig 1: Different phases of Data Mining
The first phase involves data cleaning, transferring the data and performing certain selection operations to data
fields. It brings the variables to be manageable, thus based on the nature of the analytic problem, the exploratory
analyses is elaborated by using the collection of graphical and statistical methods [7, 8] like “Exploratory Data
Analysis (EDA)” .The EDA will helps in making the decision about the nature and complexity of the variables,
and which variables are taken in to count for the next stage. The second phase contains model building and
validation. This involves a lengthy process. It mainly involves on predicting the performance i.e., by defining
the details of the question and resulting in stable over the samples. These topics are comes under the predictive
data mining [9] which includes the techniques [6] like Bagging, Boosting, Stacking and Meta-Learning. The
final phase is the deployment model in which the best selected in previous models are applied to the new data in
order to generate expected outcome.
2.1On-Line Analytic Processing (OLAP)
2010-2013 - IJESR
Indexing in Process - EMBASE, EmCARE, Electronics & Communication Abstracts, SCIRUS, SPARC, GOOGLE Database, EBSCO, NewJour, Worldcat,
DOAJ, and other major databases etc.,
1322
Vol 04, Special Issue 01, 2013
http://ijesr.in/
International Journal of Engineering Sciences Research-IJESR
ACICE-2013
ISSN: 2230-8504; e-ISSN-2230-8512
This is also known as fast analysis of shared multidimensional information – FASM .It allows the end users of
multidimensional databases to generate the summaries accessible by the computer which contains the
information of data and other systematic queries. Even though the name existed as on-line there is no
compulsory of analyses of OLAP in real-time only. It just refers to analyze the multidimensional databases.
OLAP [10] facilities can be unified into enterprise-wide database systems that allow analysts and managers to
guide the performance of the business aspects like manufacturing process and various kinds of completed
transactions at different locations. In this sense Data Mining can be a kind of analytic development of OLAP.
2.2 Exploratory Data Analysis (EDA)
The Exploratory Data Analysis involves the both general approaches ad specific techniques. There is a need to
understand the purpose of Data Mining and traditional Exploratory Data Analysis. Data Mining is more focused
towards the general nature of the elemental phenomena and it concerns less about identifying the specific
relations among the variables. Instead of it is focus to produce a solution that can generate useful predictions.
Thus Data Mining [11] is a approach to explore the data which uses not only the traditional Exploratory Data
analysis (EDA) techniques but also the Neural Networks that can generate valid predictions but are unable to
identify the relations between the variables on which the predictions are based.
2.3 Neural Networks
Neural Networks [12] is one of the Data Mining techniques. Neural Networks are analytical techniques which
are inspired by the biological nervous system such as brain to process the information. In this system based on
the functions of brain it can predict new observations from other observations (i.e., from one variable to another
variable) by using a process called learning from existed data.
Fig 2: Neural Networks model
The network architecture is designed such as it includes a definite number of layers each consisting of certain
number of neurons. Each stage involves “trail and errors”. After each training, the resulting “network”
developed in this learning process describes a pattern detected in the data. Thus this networking functioning
similar to the traditional model approach. By using Neural Networks techniques, the analysis of explanatory
models can be build. The major advantage of this model is they can approximate any continuous function and
thus no need to have compulsory of any hypotheses about the underlying variable. A drawback of this model is
that the final solution depends on the initial conditions of the network which is impossible.
3. HADOOP AND MAPREDUCE TECHNIQUE
Big data [2] plays an important role in Business applications. The coverage of big data includes Data
acquisition, cleaning, distribution, and best practices. Big Data contains the risk of threat analysis, predicting
failures of the network data and trade control. Big data analytics found that Apache Hadoop is preferred as
solution to the problems in the traditional Data Mining. It acts as extensible for recovering the failures of the
data storage and processing in the distributed system.Apache Hadoop is an Open-source software framework
[13] for storing and processing of large data-sets on clusters of hardware. Here the hadoop is designed with the
assumptions of hardware failures that are automatically handled by the software framework. The main
components of Hadoop are Hadoop distributed file system (HDFS) which is useful for large files and
MapReduce which acts as heart of Hadoop. HDFS is high bandwidth clustered storage. MapReduce performs
two different tasks in Hadoop programs. the First job is to map, in which it takes a collection of data where it is
transformed into another set of data .After transformation the data is broken in to tuples (with key/value
pairs).The job of reduce is to take the output from the map job which acts as its input and these data tuples are
combined into smaller sets of tuples. In this manner the reduce job is always achieved after the map job.
4. HADOOP WITH OTHER TECHNOLOGIES
4.1 Using Hadoop techniques with parallel Databases [14]
In the earlier days of Hadoop and MapReduce there are several problems. But now the current versions have
been used with various data management techniques to reduce the performance gap. In this context the Hadoop
2010-2013 - IJESR
Indexing in Process - EMBASE, EmCARE, Electronics & Communication Abstracts, SCIRUS, SPARC, GOOGLE Database, EBSCO, NewJour, Worldcat,
DOAJ, and other major databases etc.,
1323
Vol 04, Special Issue 01, 2013
http://ijesr.in/
International Journal of Engineering Sciences Research-IJESR
ACICE-2013
ISSN: 2230-8504; e-ISSN-2230-8512
is studied in the sense of similarities and differences with parallel databases. The parallel databases techniques
like job optimization, data layouts and indexes are focused in this discussion. Even though the usage of Hadoop
requires a little knowledge of databases background, it can’t compete with the parallel databases with their
efficiency of query processing. Many researchers found that it is efficient to use parallel databases with the
combination of Hadoop MapReduce. In the previous versions of Hadoop, most of the problems are found in the
physical organization of data including data layouts and indexes. In general Hadoop and MapReduce affected
from row-oriented layout.Hence other data layout techniques are proposed for Hadoop MapReduce. Similarly
Hadoop has deficit of appropriate indexes.A good source of indexing techniques has been proposed for Hadoop
MapReduce [15, 16, 17, 18].
4.2 Hadoop and Data Warehouse [19]
As the advertising of Hadoop is unrestrained, the practitioners are easily affected by diversity of opinions like
Hadoop is becoming the new data warehouse. But it is not really as it seems. There are a lot of differences
between Hadoop and data warehouse. This context explores when to use hadoop and when to switch data
warehouse. Let us consider an example of firm uses Hadoop to preprocess raw click streams generated by
customers using their website. These click streams are passed to the data warehouse as its processing provides
the vision of customers preferences.The data warehouse sets these customer preferences with marketing
campaigns and recommendation engines to offer investment suggestions and analysis to customers. So the Data
Warehouse is used as a source in complex Hadoop jobs. This will brings the advantages of these two systems in
parallel. Finally choosing hadoop and data warehouse depends on the requirements of the organization.In most
of the cases Hadoop and data warehouse work together as a group in the information supply.
5. HADOOP IN OTHER SCIENCES
5.1 Biological data analysis
As the biological databases grow larger, dealing with increasing information of genomic data is spreading
tremendously. In order to handle this big data, hadoop is used mostly in next generation sequencing. Highthroughput expression data for genes, proteins, and metabolites can be benefit from Big Data Analysis of
Hadoop. Advances in Bioinformatics [21] bring challenges in processing, storing and analyzing the data. This
can be achieved by using Hadoop with the combination of Cloudburst algorithm. It is also examining in protein
sequence clustering [22] which finds MapReduce algorithm could be useful for handling more complex clusters
with a high degree of parallelism.
5.2 Geographical analysis
This category of data includes location and is enhanced with geographic information in a structured form which
is nothing but spatial data. This data requires an understanding of geometry and operations that can be
performed on it.The GIS tools are used for spatial Framework of Hadoop [23]. It allows users for the analysis of
spatial data. Some geographical networks [24] problems like Distributed Denial of Service (DDoS) which are
common attempts in security hacking can be analyzed by using Hadoop and MapReduce.
6. HADOOP MAPREDUCE ADVANTAGES
The main advantage of Hadoop MapReduce it allows the users (even though if they are not experts) to easily
handle analytical risk over Big data. It gives complete control on processing the input datasets. MapReduce can
be easily used by the developers without having much knowledge of databases but with a little knowledge of
java is needed. It gives satisfied performance in scaling large clusters.
 It supports distributed data and computation
 The computation is performed local to data and thus it prevents the network overload.
 The tasks are independent hence, it can easily handle partial failures such as when the nodes fail, and it
can automatically restart.
 It is a Simple programming model. The end-user programmer only writes MapReduce tasks.
 HDFS stores vast amount of information.
 HDFS is simple and robust coherence model thus it stores data reliably.
 HDFS provide streaming read performance.
 Flat scalability [20]
 It has the ability to process the large amount of data in parallel.
 HDFS has capability for replicating the files which can easily handle situations like software and
hardware failure.
 In HDFS the data can be written only once and it can be read for many times.
 It is more economic way as the data and processing are distributed across the clusters of personal
computers.
2010-2013 - IJESR
Indexing in Process - EMBASE, EmCARE, Electronics & Communication Abstracts, SCIRUS, SPARC, GOOGLE Database, EBSCO, NewJour, Worldcat,
DOAJ, and other major databases etc.,
1324
Vol 04, Special Issue 01, 2013
http://ijesr.in/


International Journal of Engineering Sciences Research-IJESR
ACICE-2013
ISSN: 2230-8504; e-ISSN-2230-8512
It can be offered as on-demand service, for example as part of Amazon’s EC2 cluster computing
service.
Ability to write MapReduce programs in Java, a language which even many noncomputer scientists can
learn with sufficient capability to meet powerful data-processing needs.
7. DISADVANTAGES OR LIMITATIONS OF HADOOP
These are major common areas where the Hadoop framework is found uncertain.
 As the both the Hadoop HDFS and MapReduce software are under active development, they are found
to be uneven.
 Possibility of preventing central data leads to restrictive programming model.
 HDFS is weak in handling small files, and inadequacy of transparent compression. The design of
HDFS is such that it doesn’t work with random reads on small files because of its optimization for
sustained throughput.
 There is a necessary of managing job flow is when there is intermediate data.
 Managing the cluster is hard in operations like debugging, distributing software, collection logs etc.
 Because of single-master model, it requires more care and may limit scaling.
 Hadoop offers high security model, but because of its complexity it is hard to implement it.
 MapReduce is a batch-based architecture which means it doesn’t allow itself to use cases that needs
real-time data access.
CONCLUSION
To keep track of current state of business, advanced analytical technique of big data such as predictive analysis,
data mining, statistics and natural language processing are to be examined. New techniques of big data such as
Hadoop and MapReduce create alternatives to traditional data warehousing. Traditional Hadoop with
combination of new technologies explores a new scope of study in various fields of science and technologies.
REFERENCES
[1] O'Reilly Media ( 2013).” Disruptive Possibilities: How Big Data Changes Everything” .
[2] McKinsey Global Institute : Big Data:The next frontier for innovation, competition , and productivity.
[3] Berry, M., J., A., & Linoff, G., S., (2000). Mastering data mining. New York: Wiley.
[4] Edelstein, H., A. (1999). Introduction to data mining and knowledge discovery (3rd ed). Potomac, MD: Two
Crows Corp.
[5] Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (1996). Advances in knowledge
discovery & data mining. Cambridge, MA: MIT Press.
[6] Han, J., Kamber, M. (2000). Data mining: Concepts and Techniques. New York: Morgan-Kaufman.
[7] Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). The elements of statistical learning : Data mining,
inference, and prediction. New York: Springer.
[8] Pregibon, D. (1997). Data Mining. Statistical Computing and Graphics, 7, 8.
[9] Weiss, S. M., & Indurkhya, N. (1997). Predictive data mining: A practical guide. New York: MorganKaufman..
[10] Westphal, C., Blaxton, T. (1998). Data mining solutions. New York: Wiley.
[11] Witten, I. H., & Frank, E. Data mining. New York: Morgan-Kaufmann.
[12] Haykin (1994), Masters (1995), Ripley (1996) , and Welstead (1994) : Neural Networks.
[13] http://en.wikipedia.org/wiki/Apache_Hadoop.Apache .
[14] https://infosys.uni-saarland.de/publications/BigDataTutorial.pdf.
[15] J. Dittrich, J.-A. Quian´e-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a Yellow
Elephant Run Like a Cheetah (Without It Even Noticing). PVLDB, 3(1):519–529, 2010.
[16] J. Dittrich, J.-A. Quian´e-Ruiz, S. Richter, S. Schuh, A. Jindal, and J. Schad. Only Aggressive Elephants
are Fast Elephants. PVLDB, 5,2012.
[17] D. Jiang et al. The Performance of MapReduce: An In-depth Study.PVLDB, 3(1-2):472–483, 2010.
[18] J. Lin et al. Full-Text Indexing for Optimizing Selection Operations in Large-Scale Data Analytics
MapReduce Workshop, 2011.
[19] http://www.teradata.com.
[20] http://www.j2eebrain.com/java-J2ee-hadoop-advantages-and-disadvantages.html.
[21] http://ctovision.com/2011/05/hadoop-for-bioinformatics/
[22] http://www.nersc.gov/users/analytics-and-visualization/hadoop/
[23] http://esri.github.io/gis-tools-for-hadoop/
[24] http://www.hadoopsphere.com/2012/07/detecting-ddos-hacking-attempt-with.html.
2010-2013 - IJESR
Indexing in Process - EMBASE, EmCARE, Electronics & Communication Abstracts, SCIRUS, SPARC, GOOGLE Database, EBSCO, NewJour, Worldcat,
DOAJ, and other major databases etc.,