Download Big Data Mining: A Study

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
ISSN No: 2309-4893
International Journal of Advanced Engineering and Global Technology
I
Vol-04, Issue-06, November 2016
Big Data Mining: A Study
Hitesh Kataria1, Shubham Grover1
1
Student, Computer Science Engineering, Maharaja Agrasen Institute of Technology, India
Abstract: The last decade has seen an explosive growth of data.
Our pace to analyse data is a lot slower that its rate of production.
Data mining is the technique used to discover and predict useful
insights from the data, making it valuable and powerful. This
paper present an overall study of this process, the methods used
and its major applications in today’s world.
I. INTRODUCTION
Data mining can be viewed as a result of the natural
evolution of information technology [1].
Data mining is a knowledge discovery process involving
extraction of interesting (non‐ trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data. Also referred to as knowledge discovery
(mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archaeology, data dredging,
information harvesting, business intelligence etc, it is basically a
two step process: rules are developed by taking the behaviour of
given system (data sets) which are then used to evaluate the
behaviour/ outcome for the given circumstances.
Decision
Support
(1990s)
Data Mining
(Emerging
Today)
New
England
last
March?
Drill
down to
Boston."
(OLAP), multi Cognos,
dimensional
Micro
databases,
strategy
data
warehouses
multiple levels
"What’s
likely to
happen
to
Boston
unit sales
next
month?
Why?"
Advanced
algorithms,
multiprocessor
computers,
massive
databases
Prospective,
proactive
information
delivery
Pilot,
Lockheed,
IBM, SGI,
numerous
startups
(nascent
industry)
It is an iterative process, consisting of the following major
steps:
TABLE I
STEPS IN THE EVOLUTION OF DATA MINING.[2]
Evolutionary Business Enabling
Step
Question Technologies
Product
Characteristics
Providers
Data
Collection
(1960s)
Data Access
(1980s)
Data
Warehousing
&
"What
was my
total
revenue
in the
last five
years?"
Computers,
tapes, disks
IBM,
CDC
Retrospective,
static data
delivery
"What
were unit
sales in
New
England
last
March?"
Relational
databases
(RDBMS),
Structured
Query
Language
(SQL), ODBC
Oracle,
Sybase,
Informix,
IBM,
Microsoft
Retrospective,
dynamic data
delivery at
record level
"What
On-line
were unit analytic
sales in
processing
Pilot,
Retrospective,
Comshare, dynamic data
Arbor,
delivery at
Figure 1 Data Mining steps
1397
www.ijaegt.com
ISSN No: 2309-4893
International Journal of Advanced Engineering and Global Technology
I
Vol-04, Issue-06, November 2016
A. Selection:
Data relevant to the task are retrieved from appropriate
sources
B. Preprocessing:
 Data cleaning: Fill in missing values, smooth
noisy data, identify or remove outliers, and resolve
inconsistencies
 Data integration: Integration of multiple databases,
data cubes, or files
 Data
transformation:
Normalization
and
aggregation
 Data reduction: Obtains reduced representation in
volume but produces the same or similar analytical
results
 Data Discretization: Part of data reduction but
with particular importance, especially for
numerical data
C. Transformation:
Transform /consolidate into a new format for
processing. [3]
D. Data mining:
Essential process in which intelligent methods are
applied in order to extract useful results.
E. Interpretation / evaluation:
Interpret the result/query to give meaningful
report/information
Several major data mining techniques have been developed.
We will briefly examine them to have a good overview of
them.[4]
A. Clustering
Clustering is process of grouping related records
together. Related records are grouped together on the
basis of having similar values for attributes.This
process can be can be very effective if data is clustered
but not if data is “smeared”. This technique is based on
the unsupervised learning (i.e. desired output for a
given input is not known). Most commonly used
algorithms are:
 Enhanced K-Means
 Orthogonal Partitioning
 Expectation Maximization

B. Classification
Classification is the process of assigning an object to a
certain class based on its similarity to previous
examples of other objects. It can be done with
reference to original data or based on a model of that
data. Classification is similar to clustering in that it
also segments customer records into distinct segments
called classes. But unlike clustering, a classification
analysis requires that the end-user/analyst know ahead
of time how classes are defined. For example, classes
can be defined to represent the likelihood that a
customer defaults on a loan (Yes/No). Classification is
a supervised learning process. Most commonly used
algorithms are:
II TECHNIQUES




Logistic Regression
Naive Bayes
Support Vector Machine
Decision Tree
C. Summarization
Summarization is the generalization or abstraction of
data. It is the process of reducing a large volume of
information to a summary or abstract preserving only
the most essential items. For example, the long
distance calls of customer can be summarized in to
total minutes, total calls, total spending etc instead of
detailed calls.
D. Association Rules
Association is the most popular data mining techniques
and fined most frequent item set. Association strives to
discover patterns in data which are based upon
Fig 2
1398
www.ijaegt.com
ISSN No: 2309-4893
International Journal of Advanced Engineering and Global Technology
I
Vol-04, Issue-06, November 2016
relationships between items in the same transaction.
Because of its nature, association is sometimes referred
to as “relation technique”. These types of findings are
often used for targeting coupons/deals or advertising.
Apriori is the commonly used algorithm.
useful/relevant for building models to solve a particular
problem[6]. By extracting as much information as
possible from a given data table using the smallest
number of attributes, a user can save significant
computing time and often build better models.
E. Anomaly detection
In a large data set it is possible to get a picture of what
the data tends to look like in a typical case. Statistics
can be used to determine if something is notably
different from this pattern. For instance, the IRS could
model typical tax returns and use anomaly detection to
identify specific returns that differ from this for review
and audit.
J. Sequence Discovery
Sequential patterns analysis is one of data mining
technique that seeks to discover or identify similar
patterns, regular events or trends in transaction data
over a business period. For eg. in sales, with historical
transaction data, businesses can identify a set of items
that customers buy together a different times in a year.
Then businesses can use this information to
recommend customers buy it with better deals based on
their purchasing frequency in the past.
F. Regression
Regression is finding function with minimal error to
model data. It is statistical methodology that is most
often used for numeric prediction. Regression analysis
is widely used for prediction and forecasting, where its
use has substantial overlap with the field of machine
learning. Regression analysis is also used to understand
which among the independent variables are related to
the dependent variable, and to explore the forms of
these relationships. In restricted circumstances,
regression analysis can be used to infer causal
relationships between the independent and dependent
variables. However this can lead to illusions or false
relationships, so cautions advisable [5]
III APPLICATIONS
Various industries have been adopting data mining to
their mission-critical business processes to gain
competitive advantages [4]:
A. Intrusion detection
Data mining can help improve intrusion detection by
adding a level of focus to anomaly detection. By
identifying bounds for valid network activity, data
mining will aid an analyst in his/her ability to
distinguish attack activity from common everyday
traffic on the network.
B. Retail industry
Retail industry collects large amount of data on sales
and customer shopping history. The quantity of data
collected continues to expand rapidly, especially due to
the increasing ease, availability and popularity of the
business conducted on web, or e-commerce. Retail
industry provides a rich source for data mining. Retail
data mining can help identify customer behavior,
discover customer shopping patterns and trends,
improve the quality of customer service, achieve better
customer retention and satisfaction, enhance goods
consumption ratios design more effective goods
transportation and distribution policies and reduce the
G. Prediction
The prediction, as it name implies, is one of data
mining techniques that discovers relationship between
independent variables and relationship between
dependent and independent variables. For instance, the
prediction analysis technique can be used in sale to
predict profit for the future if we consider sale is an
independent variable, profit could be a dependent
variable.
H. Time series Analysis
A Time Series is an ordered sequence of data points It
consists of sequences of values or events changing
with time Time series analysis is the process of using
statistical techniques to generate predictions (forecasts)
for future events based on known past events .
I.
cost of business.
Attribute Importance
Attribute Importance provides an automated solution
for improving the speed and possibly the accuracy of
classification models built on data tables with a large
number of attributes. Using this technique, the analyst
can determine which of the attributes are most
C. Telecommunications
Due to the development of new computer and
communication technologies, the telecommunication
industry is rapidly expanding. This is the reason why
data mining has become very important to help and
1399
www.ijaegt.com
ISSN No: 2309-4893
International Journal of Advanced Engineering and Global Technology
I
Vol-04, Issue-06, November 2016
understand
the
business.
Data
mining in
telecommunication industry helps in identifying the
telecommunication patterns, catch fraudulent activities,
make better use of resource, and improve quality of
service.
G. Biological Data Analysis
Now a days we see that there is vast growth in field of
biology such as genomics, proteomics, functional
Genomics and biomedical research. Biological data
mining is very important part of Bioinformatics.
Following are the aspects in which Data mining
contribute for biological data analysis:
 Semantic integration of heterogeneous, distributed
genomic and proteomic databases.
 Alignment, indexing, similarity search and
comparative analysis of multiple nucleotide
sequences.
 Discovery of structural patterns and analysis of
genetic networks and protein pathways.[10]
D. Finance
With the increasing economic globalization and
improvements in IT, large amounts of financial data
are being generated and stored. These can be subjected
to data mining techniques to discover hidden patterns
and obtain predictions for trends in the future and the
behaviour of the financial markets.This is turn would
result in an improved market place responsiveness and
awareness leading to reduced costs and increased
revenue.
Analytics can contribute to solving business problems
in banking and finance by finding patterns, causalities,
and correlations in business information and market
prices that are not immediately apparent to managers
because the volume data is too large or is generated too
quickly to screen by experts. The managers of the
banks may go a step further to find the sequences,
episodes and periodicity of the transaction behaviour of
their customers which may help them in actually better
segmenting, targeting, acquiring, retaining and
maintaining a profitable customer base.
E.
Cloud computing
Data Mining techniques are used in cloud computing.
The implementation of data mining techniques through
Cloud computing will allow the users to retrieve
meaningful information from virtually integrated data
warehouse that reduces the costs of infrastructure and
storage [7].Cloud computing uses the Internet services
that rely on clouds of servers to handle tasks. The data
mining technique helps Cloud Computing to perform
efficient, reliable and secure services for their users.
[8]
H. Agriculture
Data mining is emerging in agriculture field for crop
yield analysis a with respect to four parameters namely
year, rainfall, production and area of sowing. Yield
prediction is a very important agricultural problem that
remains to be solved based on the available data. The
yield prediction problem can be solved by employing
Data Mining techniques such as K Means, K nearest
neighbor (KNN), Artificial Neural Network and
support vector machine (SVM) [11].
IV CONCLUSIONS
In this paper, we discussed about the data mining process and
also briefly presented the major steps involved in the same. The
most frequently used techniques for knowledge discovery are
also mentioned. Lastly, we have given a short description of the
major applications of this field.
REFERENCES
[1] By Jiawei Han, Micheline, Kamber, Jian Pei, Data Mining:
Concepts and Techniques, 2nd edition
[2] Dr. Borne 2005UMUC Data Mining Lecture 21 Data Mining UMUC
F. Biomedical analysis
In recent years, Data Mining has been widely used in
area of Medical science such as Biomedical, DNA,
Genetics and Medicine etc. In the area of Genetics, the
important goal is to understand the mapping
relationship between the variation in human DNA
sequences and the disease susceptibility. Data Mining
is very important tool to help improve the diagnosis,
prevention and treatment of the diseases.[9]
CSMN 667 Lecture #2.
[3] NONG YE, Data Mining: Theories Algorithms and Examples
[4] Aakanksha Bhatnagar, Shweta P. Jadye, Madan Mohan Nagar” Data
Mining Techniques & Distinct Applications: A Literature Review”
International Journal of Engineering Research & Technology
(IJERT) Vol. 1 Issue 9, November- 2012
1400
www.ijaegt.com
ISSN No: 2309-4893
International Journal of Advanced Engineering and Global Technology
I
Vol-04, Issue-06, November 2016
[5] R.Kaur, S.Kaur, A.Kaur, R.Kaur, A.Kaur, “An Overview of
Database management System, Data warehousing and Data Mining”.
IJARCCE, Vol.2, issue.7, July 2013.
[6] Java Data Mining: Strategy, Standard, and Practice: A Practical
Guide for By Mark F. Hornick, Erik Marcadé, Sunil Venkayala
[7] Ruxandra-Ştefania PETRE, “Data mining in Cloud Computing”
Database Systems Journal vol. III, no. 3/2012
[8] Smita, Priti Sharma,Use of Data Mining in Various Field: A Survey
Paper, IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN:
2278-0661, p- ISSN: 2278-8727Volume 16, Issue 3, Ver. V (MayJun. 2014), PP 18-21
[9] Simmi Bagga, Dr. G.N. Singh , Applications of Data Mining,
International Journal for Science and Emerging Technologies with
Latest Trends, 2012
[10] Data Mining: Task, Tools, Techniques and Applications
S.D.Gheware1 , A.S.Kejkar2 , S.M.Tondare3, International Journal
of Advanced Research in Computer and Communication
Engineering Vol. 3, Issue 10, October 2014
[11] D Ramesh , B Vishnu Vardhan, “Data Mining Techniques and
Applications to Agricultural Yield Data” International Journal of
Advanced Research in Computer and Communication Engineering
Vol. 2, Issue 9, September 2013
1401
www.ijaegt.com