Download data mining over large datasets using hadoop in cloud

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Data Protection Act, 2012 wikipedia , lookup

Expense and cost recovery system (ECRS) wikipedia , lookup

Clusterpoint wikipedia , lookup

Data model wikipedia , lookup

Data center wikipedia , lookup

Forecasting wikipedia , lookup

SAP IQ wikipedia , lookup

Database model wikipedia , lookup

Data analysis wikipedia , lookup

Data vault modeling wikipedia , lookup

Information privacy law wikipedia , lookup

3D optical data storage wikipedia , lookup

Business intelligence wikipedia , lookup

Transcript
ISSN:2249-5789
V Nappinna Lakshmi et al, International Journal of Computer Science & Communication Networks,Vol 3(2), 73-78
DATA MINING OVER LARGE DATASETS USING
HADOOP IN CLOUD ENVIRONMENT
V.Nappinna lakshmi 1, N. Revathi2*
1
PG Scholar, 2Assistant Professor
Department of Information Technology,
Sri Venkateswara College of Engineering,
Sriperumbudur – 602105, Chennai, INDIA.
1
[email protected]
2
[email protected]*
*
Corresponding author
Abstract- There is a drastic growth of data’s in the
web applications and social networking and such
data’s are said be as Big Data. The Hive queries with
the integration of Hadoop are used to generate the
report analysis for thousands of datasets. It requires
huge amount of time consumption to retrieve those
datasets. It lacks in performance analysis. To
overcome this problem the Market Basket Analysis a
very popular Data Mining Algorithm is used in
Amazon cloud environment by integrating it with
Hadoop Ecosystem and Hbase. The objective is to
store the data persistently along with the past history
of the data set and performing the report analysis of
those data set. The main aim of this system is to
improve performance through parallelization of
various operations such as loading the data, index
building and evaluating the queries. Thus the
performance analysis is done with the minimum of
three nodes with in the Amazon cloud environment.
Hbase is a open source, non-relational and
distributed database model. It runs on the top of the
Hadoop. It consists of a single key with multiple
values. Looping is avoided in retrieving a particular
data from huge datasets and it consumes less amount
of time for executing the data. HDFS file system is
used to store the data after performing the map
reduce operations and the execution time is
decreased when the number of nodes gets increased.
The performance analysis is tuned with the
parameters such as the HBase Heap Memory and
Caching Parameter.
Keywords- HBase, Cloud computing, Hadoop
ecosystem, mining algorithm
I INTRODUCTION
Currently the data set sizes for applications
are growing in a incredible manner. Thus the data
sets growing beyond the few hundreds of
terabytes, have no solutions to manage and analyse
these data. Services like social networking
approaches to achieve the goals like minimum
amount of effort in terms of software, CPU and
network. Cloud computing is associated with the
new paradigm for provisioning the computing
infrastructure. Thus the paradigm shifts the
location of infrastructure to the network to reduce
the cost associated with the management of
hardware and software resources. The cloud
computing is said to be as the model for enabling
convenient, on- demand network access, to a
shared pool of configurable computing resources
that can be rapidly provisioned and released with
minimal management effort or service provider
interaction.
Hadoop is a framework for running large
number of applications which consists HDFS for
storing large number of dataset. Hadoop DB tries
to achieve fault tolerance and the ability to operate
in heterogeneous environments by inheriting the
scheduling and job tracking implementation from
Hadoop. The main aim of these systems is to
improve the performance through parallelization
of various operations such as loading the datasets,
index building and evaluating the queries. These
systems usually designed to run on top of a shared
nothing architecture where data may be stored in a
distributed fashion and input/output speeds are
improved by using multiple CPU’s disk in parallel
and network links with high available bandwidth.
Hadoop database tries to achieve the
performance of parallel databases by doing most
of query processing inside the database engine.
Hadoop is an open source and framework that is
used in cloud environment for efficient data
analysis and storage of data. It supports dataintensive
applications
by
realizing
the
implementation of the Map Reduce framework.
Inspired by the Google’s architecture. Integration
of Hadoop and Hive is used to store and retrieve
the dataset in a efficient manner. For more
efficiency of data storage and transactions of retail
business the integration of Hadoop ecosystem with
HBase along with the cloud environment is used to
store and retrieve the data sets persistently. The
performance analysis is done with the Map
Reduce parameters like HBase heap memory and
Caching parameter.
73
ISSN:2249-5789
V Nappinna Lakshmi et al, International Journal of Computer Science & Communication Networks,Vol 3(2), 73-78
A. Cloud computing
similar hardware).Computational processing can
occur on data stored either in a file system
(unstructured) or in a database (structured). Map
Reduce can take advantage on data locality,
processing the datasets or near the storage assets to
decrease transmission of data.
It is a model for enabling convenient and ondemand network access to a shared pool of
configurable computing resources (e.g. networks,
servers, application storages and services) that can
be rapidly provisioned and released with minimal
management effort or service provider interaction.
PIG
A
M
B
A
R
I
O
O
Z
I
E
S
Q
O
O
P
HCATALOG
Fig 1.1 Hadoop Ecosystem
B. Hadoop
HADOOP is designed to run on cheap
commodity hardware. It automatically handles the
data replication and node failure, focuses on
processing the data. It is used to store the copies of
internal log and dimension data sources. Used for
reporting/analytics and machine learning.
C. Hive
Hive is system for managing and querying
structured data built on top of the Hadoop for data
warehousing and analytics of data.
D. Hive QL
Hive QL is the tuple subset of the hive
dataware housing. It makes use of the parser for
execution of the map reduce program and to store
the large data set in HDFS of Hadoop.
E. Map Reduce
Map Reduce is a programming model for
processing large data sets, and the implementation
is done with the Google. Map Reduce is typically
used to do distribute computing on clusters of
computers. Map Reduce is a framework for
processing parallelizable problems across huge
datasets using a large number of computers
(nodes), collectively referred to as a cluster (if all
nodes are on the same local network and use
F
L
U
M
E
"Map" step- The master node takes the input
file and divides those datasets into smaller subproblems, and distributes them to each of the
worker nodes. A worker node may do this again in
turn to a multi-level tree structure. The worker
node processes the smaller problem, and passes
the answer back to its master node.
"Reduce" step- The master node collects the
answers of all the sub-problems and combines
them to form the output – the answer to the
problem is that, it was originally trying to solve
the job process in easier manner.
F. HBase
HBase is a distributed, versioned, columnoriented database built on top of Hadoop and
Zookeeper. HBase master is used to figure out the
placement of regions of the nodes, so that it can
configure and splits the data correctly. This is to
run the mappers job on the region servers
containing the regions to ensure data locality.
HBase Server reads incoming request using one
Listener thread, and pool of Deseriaisation
Threads.
Input/output
process
is
done
asynchronously. The listener sleeps on a bunch of
socket descriptors.
G. Zookeeper
Zookeeper allows distributed processes to
coordinate with each other through a shared
hierarchical name space of data registers (they are
said to be as the registers znodes), used to like the
file system. Unlike normal file systems Zookeeper
provides its clients with high throughput, low
latency, availability and the strictly ordered access
to the znodes. The performance analysis of
Zookeeper is being tested by allowing number of
distributed systems.
The main differences between Zookeeper
and standard file systems are that every znode can
have data associated with it and those znodes are
limited to the amount of datasets that they can
have. Zookeeper was designed to store the
coordination of datasets such as: the status
information, configuration, location of the
information, etc. This kind of Meta datainformation is usually being measured in
kilobytes, if not bytes. Zookeeper has in built
sanity check of 1M and it is used to prevent it
from being used as a huge data store, but in
74
ISSN:2249-5789
V Nappinna Lakshmi et al, International Journal of Computer Science & Communication Networks,Vol 3(2), 73-78
general it is used to store much smaller pieces of
data. It is used to achieve the data sets easily by
splitting the large number of datasets.
blocks on a Data node. The placement of the
replicas is critical to HDFS performance.
Optimizing replica placement distinguishes HDFS
from other distributed file systems.
H. HDFS
I.
Hadoop Distributed File System cluster
consists of a single Name node, a master server
that manages the file system namespace and
regulates access to files by clients. There are a
number of Data Nodes usually one per node in a
cluster. The Data Nodes manage storage attached
to the nodes that they run on. HDFS contains a file
system namespace and allows user data to be
stored in files. A single file is being split into one
or more blocks and set of blocks are stored in Data
Nodes.
Goal- improves reliability, availability and
network bandwidth utilization.
Searching of data topic-Many racks,
communication between racks are through
switches. Network bandwidth between machines
on the same rack is greater than those in different
racks. Name node determines the rack id for each
Data Node.
Data Nodes-serves read, write requests,
performs block creation, deletion, and replication
upon instruction from Name node.
Metadata
(Name,
replica...)
(Home,
foo,
data6)
Name node
Client
Read
Block ops
Datanodes
Replicas are placed-Nodes are being placed
on various local racks. Replica Selection- Replica
selection for READ operation: HDFS tries to
minimize the bandwidth consumption and latency.
If there is a replica on the Reader node then that is
preferred. HDFS cluster may span multiple data
centers: replica in the local data center is preferred
over the remote one. File system Metadata- the
HDFS namespace is stored by Name node.
Name node uses a transaction log called the
Edit Log to record every change that occurs to the
file system Meta data. Entire file system
namespace including mapping of blocks to files
and file system properties is stored in a file
FsImage. Stored in Name node’s local file system.
Datanodes
Replica-
II RELATED WORKS
Blocks
tions
Client
Rack 1
Rack-aware replica placement
Rack2
Fig.1.2 HDFS architecture
Name node maintains the file system. Any
Meta information changes to the file system are
recorded by the Name node.
An application can specify the number of
replicas of the file needed: replication factor of the
file. This information is stored in the Name node HDFS is designed to store very large files across
machines in a large cluster. Each file is a sequence
of blocks. All blocks in the file system except the
last are of the same size. Blocks are replicated for
fault tolerance. Block size and replicas are
configurable per file. The Name node receives a
Heartbeat and a Block Report from each Data
Node in the cluster. Block Report contains all the
D. Abadi [1]In this paper the large scale data
analysis is done with the traditional DBMS. The
data management is scalable but there is
replication of data. Replication of data leads to the
fault tolerance.
Y. Xu, P. Kostamaa, and L. Gao [3] This
paper deploys the Teradata parallel DBMS for
large data warehouses. In recent years there is a
increase in the data volumes and some data like
web logs and sensor data are not managed by
Teradata EDW (Enterprise Data ware house).
Researchers agree that both the parallel DBMS
and Map reduce of Hadoop paradigms have
advantages and dis-advantages for the various
business applications and the also exists for the
long time.
The integration of optimizing
opportunities is not possible for DBMS running on
the single node.
Farah Habib Chan chary [6] large datasets
among the clusters of machines are efficiently
stored in the cloud storage systems. So that the
75
ISSN:2249-5789
V Nappinna Lakshmi et al, International Journal of Computer Science & Communication Networks,Vol 3(2), 73-78
same information on more than one system could
operate the datasets even if any one of the
system’s power fails.
J.ABABI, AVI SILBERCHATZ [7]
Analysing massive datasets on very large clusters
is done within the HadoopDB architecture for the
real world application like business data
warehousing. It approaches for parallel databases
in performance. Still there is no scalability. It
consumes huge amount of time for execution.
III HBASE-MINING ARCHITECTURE
DATA REPORT
ANALYSIS
MARKET
BASKET
ALGORITHM
A. Advantage of having the HADOOP database
In RDBMS it can store the limited amount of
data, SQL operations are used and it cannot
execute the data concurrently which means there is
no parallel processing and it is a single threaded
process, and it can handle only one dataset.
Whereas in HADOOP along with the HBASE data
storage it can store huge amount of data up to
petabytes of data, it makes use of NOSQL
operations and it can execute the data concurrently
which means the parallelization is achieved and it
splits the work into many based of the number of
processors. So that it maps the function with <key,
value> pairs and the data is being processed in the
reduce function to execute the dataset.
Table 1 Difference between the NOSQL and SQ
HADOOP
HIVE
NN
NN
DN
DN
CLUSTER
JTJTTTTT
DATASET
QUEIRIES
AMAZON HM
HM
HRS
HRS
CLOUD
SN DN
HDFS-HBASE
HADOOP
SNTTDN
HQP
DN
HQP
HRSTT
TT
HRS
HRS
RDBMS are used to store only the structured
data, but the HADOOP data storage systems are
used to store both the structured and unstructured
data. E.g. for unstructured data are (mail, audio,
video, medical transactions etc...)
IV MARKET BASKET ALGORITHMS
Fig: 3.1 HBase-Mining Architecture
Hive queries are used to store and retrieve the
movie lens datasets. Efficient database storage
with the market basket algorithm is used for the
maintenance of heavy transactions in the retail
business of super market products and the aim of
this systems is to improve the performance
through parallelization of various operations such
as loading the datasets, index building and queries
evaluation in the Hadoop database is integrated
along with the HBase is used to store and retrieve
the huge datasets without any loss age of the
transactions. The Amazon cloud is used to hold
the HDFS file system to store the name node, data
node along with the region servers where the data
sets are stored when it is being splitted from the
Hbase table. Hive Query Processing (HQP) is also
considered as one of the data nodes.
Market basket is one of the most popular
data mining algorithms. It is a cross-selling
promotional program to generate the combination
of datasets. Association rules are also used to
identify the pairs of similar datasets. Advantage of
using this algorithm is that it is simple in
computations and different forms of data can be
analysed. Selection of promotions in purchasing
and joint promotional opportunities is more.
Identifying the actionable information of
market basket analysis are profitability for each
purchase profiles and the use for marketing
purpose are layouts or catalogs, select product for
promotions, space allocation and product
placement. The purchases patterns are identified
by the items tend to be purchased together and the
items purchased sequentially.
76
ISSN:2249-5789
V Nappinna Lakshmi et al, International Journal of Computer Science & Communication Networks,Vol 3(2), 73-78
A. Market Basket Analysis for Mapper
The input file is being loaded into the
Hadoop distributed file system and the datasets are
stored with the block sizes of 64MB along with
the <key, value> pairs, then the mapping function
is performed then each mapped datasets are being
stored in their respective Hbase tables.
B. Algorithm
1. Input file is loaded in to the HDFS.
2. File is being splitted in to the block size of
64MB.
HBase has around 1 million records for product
table.
And it has 5.1 million records doing algorithm
analysis.
For Performance Results
1 Node - 1 Million records - 4 min 37 sec
2 Nodes - 1 Million records - 3 min 31 sec
3 Nodes - 1 Million records - 2 min 56 sec
5 Nodes - 1 Million records - 2min
Performance Parameters tuned
Hbase Heap Memory and Caching Parameter
3. The mapping function is performed on the basis
of <k1, v1> pairs.
4. Then it is stored on the HBase tables.
C. Market Basket Analysis for Reducer
The HBase market basket table is being
splitted as region server to store the datasets with
the help of zookeeper and the <k1, v1> pairs are
allotted for each datasets that are involved in the
transactions. The Reduce function is performed to
get the output file in the form of <K2, v2> pairs.
The grouping and sorting functions are performed
to obtain the final result <k3, v3> pair in the
Hbase table.
D.
Algorithm
1. HBase market basket table is splitted into region
servers to store the datasets with <k1, v1> pairs.
VI CONCLUSION
The integration of Hadoop Ecosystem
along with the HBase is used to store and retrieve
the huge datasets without any loss and the
parallelization is achieved in loading the datasets,
building the indexes and evaluation of the queries .
The Hadoop ecosystem can store both the
structured and unstructured datasets. It can include
and delete the datasets parallel. The Map Reduce
function is used to split the datasets and to get
stored on the basis of number of processors and
the market basket analysis for mapper and reducer
function is performed to store and retrieve the
millions of datasets along with the <key, value>
pairs that are allotted for each and every datasets.
Thus the obtained datasets are being stored in
Hbase table and the performance analysis is
processed with five nodes.
VII FUTURE WORK
2. <k2, v2> output file pair is obtained by the
reduce function on the basis of alphabetical
analysis manner.
3. Sorting and grouping functions are performed
by counting the number of value counts to obtain
final result <k3, v3> pair in the HBase table.
Thus the Hadoop Ecosystem with the
integration of Hbase database should be used in
various fields like telecommunications, banks,
insurance, medical fields etc. To maintain the
public details in an efficient manner and to avoid
the fraudulent.
V RESULTS
VIII REFERENCES
[1]
[2]
[3]
[4]
Fig:5.1 Performance analysis of data sets
D. Abadi.” Data management in the cloud: Limitations
and
opportunities”,IEEE Transactions on Data
Engineering , Vol.32,No.1,March 2009.
Daniel Warneke and Odej Kao,”Exploiting Dynamic
Resource Allocation for Efficient Parallel Data
Processing in the Cloud,” IEEE Transactions on
Distributed and Parallel systems , Vol.22,No.6,June
2011.
Y. Xu, P. Kostamaa, and L. Gao. “Integrating Hadoop
and Parallel DBMS”, Proceedings of ACM SIGMOD,
International conference on Data Management, New
york,NY,USA 2010.
Huiqi Xu, Zhen Li et.al, “ CloudVista: Interactive and
Economical Visual Cluster Analysis for Big Data in the
Cloud,” IEEE Conference on Cloud computing, Vol.5,
No.12,August 2012.
77
ISSN:2249-5789
V Nappinna Lakshmi et al, International Journal of Computer Science & Communication Networks,Vol 3(2), 73-78
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
M. Losee and Lewis Church Jr.”Information Retrieval
with Distributed Databases: Analytic Models of
Performance Robert,” IEEE Transactions on parallel and
distributed systems, Vol.15, No.1, January 2004.
Farah Habib Chan chary,”Data Migration: Connecting
databases in the cloud” IEEE JOURNAL ON
COMPUTER SCIENCE, vol no-40 , page no-450-455,
MARCH 2012.
Kamil Bajda at el.”Efficient processing of data
warehousing queries in a split execution environment”,
JOURNAL ON DATA WAREHOUSING, vol no-35, ACM,
JUNE 2011.
J.ABABI, AVI SILBERCHATZ,”HadoopDB in action:
Building
real
world
applications”,
SIGMOD
CONFERENCE ON DATABASE, vol no-44 ,USA,
SEPTEMBER 2011.
S. Chen. “Cheetah: A High Performance, Custom Data
Warehouse on Top of Map Reduce”, In Proceedings of
VLDB, vol no-23, pg no-922-933, SEPTEMBER 2010.
R. Vernica, M. Carey, and C. Li.” Efficient ParallelSetSimilarity Joins Using Map Reduce”, In Proceedings of
SIGMOD, vol no-56, pg no-165-178, MARCH 2010
Aster Data, “SQL Map Reduce framework”,
http://www.asterdata.com/product/advancedanalytics.php.
Apache HBase, http://hbase.apache.org/.
J. Lin and C. Dyer, “Data-Intensive Text Processing with
Map Reduce”, Morgan & Claypool Publishers, (2010).
[14] GNU Cord, http://www.coordguru.com/.
78