Download REVIEW ON DATA MINING

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Data Protection Act, 2012 wikipedia , lookup

Data model wikipedia , lookup

Data center wikipedia , lookup

Forecasting wikipedia , lookup

Data analysis wikipedia , lookup

Data vault modeling wikipedia , lookup

3D optical data storage wikipedia , lookup

Information privacy law wikipedia , lookup

Business intelligence wikipedia , lookup

Data mining wikipedia , lookup

Transcript
International Journal of Research In Science & Engineering
Special Issue: Techno-Xtreme 16
e-ISSN: 2394-8299
p-ISSN: 2394-8280
REVIEW ON DATA MINING
Madhuri Chawale1, Mayuri Lahe2,Prof. S.A.Zunke3
1
Student, I.T. Department, J.D.I.E.T. Yavatmal, [email protected]
2
Student, I.T. Department, J.D.I.E.T. Yavatmal, [email protected]
Assistant Professor,I.T.Department, J.D.I.E.T. Yavatmal,[email protected]
ABSTRACT
This paper focuses on the data mining and the current trends associated with it. It presents an overview of
data mining system and clarifies how data mining and knowledge discovery in databases are related both
to each other and to related fields. Data Mining is a technology used to describe knowledge discovery and
to search for significant relationships such as patterns, association and changes among variables in
databases. These techniques facilitate useful data interpretations for the banking sector to avoid customer
attrition. Customer retention is the most important factor to be analysed in today’s competitive business
environment. And also fraud is a significant problem in banking sector. Detecting and preventing fraud is
difficult, because fraudsters develop new schemes all the time, and the schemes grow more and more
sophisticated to elude easy detection. We have also tried to identify the research area in data mining
where further work can be continued.
Keywords: Data Mining, KDD
1. INTRODUCTION
In today’s computer age data storage has been growing in size to unthinkable ranges that only
computerized methods applied to find information among these large repositories of data available to
organizations whether it was online or offline. Data mining was conceptualized in the 1990s as a means of
addressing the problem of analyzing the vast repositories of data that are available to mankind, and being
added to continuously.
Data mining is necessary to extract hidden useful information from the large datasets in a given
application. This usefulness relates to the user goal, in other words only the user can determine whether the
resulting knowledge answers his goal.DM also refers as analytical intelligence and business intelligence.
Because data mining is a relatively new concept, it has been defined in various ways by various authors in the
recent past. Some widely used techniques in data mining include artificial neural networks, genetic
algorithms, K-nearest neighbor method, decision trees, and data reduction.
Data mining is a process of the knowledge discovery in databases and the goal is to find out the
hidden and interesting information. Various important steps are involved in knowledge discovery in
databases (KDD) which helps to convert raw data into knowledge. Data mining is just a step in KDD which
is used to extract interesting patterns from data that are easy to perceive, interpret, and manipulate. Several
major kinds of data mining methods, including generalization, characterization, classification, clustering,
association, evolution, pattern matching, data visualization, and meta-rule guided mining will be reviewed.
The explosive growth of databases makes the scalability of data mining techniques increasingly important.
Data mining algorithms have the ability to rapidly mine vast amount of data.Essentially, the two
types of data mining approaches differ in whether they seek to build models or to find patterns. The first
approach, concerned with building models is, apart from the problems inherent from the large sizes of the
data sets, similar to conventional exploratory statistical methods. The objective is to produce an overall
summary of a set of data to identify and describe the main features of the shape of the distribution.The
IJRISE| www.ijrise.org|[email protected] [537-543]
International Journal of Research In Science & Engineering
Special Issue: Techno-Xtreme 16
e-ISSN: 2394-8299
p-ISSN: 2394-8280
second type of data mining approach, pattern detection, seeks to identify small departures from the norm, to
detect unusual patterns of behavior.
Scientific data mining distinguishes itself in the sense that the nature of the datasets is often very
different from traditional market driven data mining applications. In this work, a detailed survey is carried
out on data mining applications in the healthcare sector, types of data used and details of the information
extracted. Data mining algorithms applied in healthcare industry play a significant role in prediction and
diagnosis of the diseases. There are a large number of data mining applications are found in the medical
related areas such as Medical device industry, Pharmaceutical Industry and Hospital Management. To find
the useful and hidden knowledge from the database is the purpose behind the application of data mining.
2.LITERATURE REVIEW
A literature review is a text written by critical points of current knowledge including substantive
find theoretical and methodological contributions to a particular topic. Literature reviews are secondary
sources and do not report any new or original experimental work. HianChyeKoh and Gerald Tan mainly
discusses data mining and its applications with major areas like Treatment effectiveness, Management of
healthcare, Detection of fraud and abuse, Customer relationship management. JayanthiRanjan presents how
data mining discovers and extracts useful patterns of this large data to find observable patterns. This paper
demonstrates the ability of Data mining in improving the quality of the decision making process in pharma
industry. Issues in the pharma industry are adverse reactions to the drugs.
M. Durairaj, K. Meena illustrates a hybrid prediction system consists of Rough Set Theory (RST) and
Artificial Neural Network (ANN) for dispensation medical data. The process of developing a new data
mining technique and software to assist competent solutions for medical data analysis has been explained.
The experiments onspermatological data set for predicting excellence of animal semen is carried out. The
projected hybrid prediction system is applied for pre-processing of medical database and to train the ANN for
production prediction. The prediction accuracy is observed by comparing observed and predicted cleavage
rate.
K. Srinivas, B. Kavitha Rani and Dr. A. Goverdhan discusses mainly examine the potential use of
classification based data mining techniques such as Rule Based, Decision tree, Naïve Bayes and Artificial
Neural Network to the massive volume of healthcare data. Using an age, sex, blood pressure and blood sugar
medical profiles it can predict the likelihood of patients getting a heart disease.
3. DATA MINING AND KDD PROCESS:
Data mining is a detailed process of analysing large amounts of data and picking out the relevant
information. It refers to extracting or mining knowledge from large amounts of data. Data Mining is the
fundamental stage inside the process of extraction of useful and comprehensible knowledge, previously
unknown, from large quantities of data stored in different formats, with the objective of improving the
decision of companies, organizations where the data can be collected. However data mining and overall
process known as Knowledge Discovery from Databases (KDD) is usually an expensive process, especially
in the stages of business objectives elicitation, data mining objectives elicitation, and data preparation. This is
especially the case each time data mining is applied to a blood bank. Data Mining can be defined as the
extraction or fetching of the relevant information i.e. Knowledge from the large repositories of data. That’s
the reason it is also called as Knowledge Mining. However many synonyms are linked with Data Mining viz.
Knowledge Mining from Data, Knowledge extraction, Data/ pattern analysis, Data archaeology and Data
dredging. Data Mining is also popularly known as Knowledge Discovery in data bases (KDD), refers to the
nontrivial extraction of implicit, previously unknown and potentially useful information from data in
Databases.
The KDD process includes selecting the data needed for data mining process & may be obtained
from many different & heterogeneous data sources. Preprocessing includes finding incorrect or missing data.
There may be many different activities performed at this time. Erroneous data may be corrected or removed,
whereas missing data must be supplied. Preprocessing also include: removal of noise or outliers, collecting
IJRISE| www.ijrise.org|[email protected] [537-543]
International Journal of Research In Science & Engineering
Special Issue: Techno-Xtreme 16
e-ISSN: 2394-8299
p-ISSN: 2394-8280
necessary information to model or account for noise, accounting for time sequence information and known
changes. Transformation is converting the data into a common format for processing. Some data may be
encoded or transformed into more usable format. Data reduction, dimensionality reduction & data
transformation method may be used to reduce the number of possible data values being considered. Data
Mining is the task being performed, to generate the desired result. Interpretation/Evaluation is how the data
mining results are presented to the users which are extremely important because the usefulness of the result is
dependent on it. Different kinds of knowledge requires different kinds of representation e.g. classification,
clustering, association rule etc. Many people treat the data mining as a synonym for another popularly used
term, Knowledge Discovery from Data. The following figure (fig.1), shows the data mining as simply an
essential step in the process of KDD i.e. Knowledge Discovery from Data.
4. STEPS OF KDD PROCESS:
The knowledge discovery in databases (KDD) process comprises of a few steps leading from raw
data collections to some form of new knowledge. The iterative process contains of the following steps:
4.1 Data cleaning:
It also known as data cleansing, it is a fundamental step in which noise data and irrelevant data are removed
from
the
collection.
4.2 Data integration:
In this step, multiple data sources, often heterogeneous, may be combined in a common source. Data
selection at this step, the data relevant to the analysis is decided on and retrieved from the data collection .
4.3 Data transformation :
It is also known as data consolidation, it is a stage in which the selected data is transformed into forms
appropriate for the mining procedure.
4.4 Data mining:
It is the core step of KDD process in which clever techniques are applied to extract patterns potentially
useful.
4.5 Pattern evaluation:
IJRISE| www.ijrise.org|[email protected] [537-543]
International Journal of Research In Science & Engineering
Special Issue: Techno-Xtreme 16
e-ISSN: 2394-8299
p-ISSN: 2394-8280
In this step, strictly interesting patterns representing knowledge are identified based on given measures.
4.6 Knowledge representation:
This is the final step in which the discovered knowledge is visually represented to the user. This is a very
essential step that uses visualization techniques to help users understand and interpret the data mining results.
5. DATA ANALYSIS TASKS AND TECHNIQUES:
Several data mining problem types or analysis tasks are typically encountered during a data mining
project. Depending on the desired outcome, several data analysis techniques with different goals may be
applied successively to achieve a desired result. For example, to determine which customers are likely to buy
a new product, a business analyst may need first to use cluster analysis to segment the customer database, and
then apply regression analysis to predict buying behavior for each cluster .
5.1Data Summarization:
It gives the user an overview of the structure of the data and is generally carried out in the early
stages of a project. This type of initial exploratory data analysis can help to understand the nature of the data
and to find potential hypotheses for hidden information. Simple descriptive statistical and visualization
techniques generally apply.
5.2 Segmentation:
Segmentation separates the data into interesting and meaningful sub-groups or classes. In this case,
the analyst can hypothesize certain subgroups as relevant for the business question based on prior knowledge
or based on the outcome of data description and summarization. Automatic clustering techniques can detect
previously unsuspected and hidden structures in data that allow segmentation. Clustering techniques,
visualization and neural nets generally apply.
5.3 Classification:
Assumes that a set of objects characterized by some attributes or features belong to different classes.
The class label is a discrete qualitative identifier; for example, large, medium, or small. The objective is to
build classification models that assign the correct class to previously unseen and unlabeled objects.
Classification models are mostly used for predictive modeling. Discriminant analysis, decision tree, rule
induction methods, and genetic algorithms generally apply.
5.4 Prediction:
Prediction is very similar to classification. The difference is that in prediction, the class is not a
qualitative discrete attribute but a continuous one. The goal of prediction is to find the numerical value of
the target attribute for unseen objects; this problem type is also known as regression, and if the prediction
deals with time series data, then it is often called forecasting. Regression analysis, decision trees, and neural
nets generally apply. The least-squares criterion is a common method used in regression analysis, which finds
the regression coefficients that minimize the sum of the squared deviation of the predicted values of the
model from the observed values of the data.
5.5 Dependency analysis:
Dependency analysis deals with finding a model that describes significant dependencies (or
associations) between data items or events. Dependencies can be used to predict the value of an item given
information on other data items. Dependency analysis has close connections with classification and
prediction because the dependencies are implicitly used for the formulation of predictive models.
IJRISE| www.ijrise.org|[email protected] [537-543]
International Journal of Research In Science & Engineering
Special Issue: Techno-Xtreme 16
e-ISSN: 2394-8299
p-ISSN: 2394-8280
6. CLUSTERING:
A cluster is a subset of objects which are “similar”. subset of objects such that the distance
between any two objects in the cluster is less than the distance between any object in the cluster
and any object not located inside it.Clustering is a process of partitioning a set of data (or objects)
into a set of meaningful sub-classes, called cluster.The quality of a clustering result also depends
on both the similarity measure used by the method and its implementation. The quality of a
clustering method is also measured by its ability to discover some or all of the hidden patterns.
Fig. process of clustering
7. DATA MINING APPLICATIONS IN HEALTHCARE SECTOR :
Healthcare industry today generates large amounts of complex data about patients,
hospital resources, disease diagnosis, electronic patient records, medical devices etc.
Larger amounts of data are a key resource to be processed and analysed for knowledge
extraction that enables support for cost-savings and decision making. Data mining
applications in healthcare can be grouped as the evaluation into broad categories,
7.1 Treatment effectiveness:
Data mining applications can develop to evaluate the effectiveness of medical
treatments. Data mining can deliver an analysis of which course of action proves effective
by comparing and contrasting causes, symptoms, and courses of treatments.
7.2 Healthcare management:
IJRISE| www.ijrise.org|[email protected] [537-543]
International Journal of Research In Science & Engineering
Special Issue: Techno-Xtreme 16
e-ISSN: 2394-8299
p-ISSN: 2394-8280
Data mining applications can be developed to better identify and track chronic disease states and
high-risk patients, design appropriate interventions, and reduce the number of hospital admissions and claims
to aid healthcare management. Data mining used to analyze massive volumes of data and statistics to search
for patterns that might indicate an attack by bio-terrorists.
7.3 Customer relationship management
Customer relationship management is a core approach to managing interactions between commercial
organizations- typically banks and retailers-and their customers, it is no less important in a healthcare
context. Customer interactions may occur through call centers, physicians’ offices, billing departments,
inpatient settings, and ambulatory care settings.
7.4 Fraud and abuse:
Detect fraud and abuses establish norms and then identify unusual or abnormal patterns of claims by
physicians, clinics, or others attempt in data mining applications. Data mining applications fraud and abuse
applications can highlight inappropriate prescriptions or referrals and fraudulent insurance and medical
claims
7.5 Medical Device Industry:
Healthcare system’s one important point is medical device. For best communication work this one is
mostly used. Mobile communications and low-cost of wireless bio- sensors have paved the way for
development of mobile healthcare applications that supply a convenient, safe and constant way of monitoring
of vital signs of patients. Ubiquitous Data Stream Mining (UDM) techniques such as light weight, one-pass
data stream mining algorithms can perform real-time analysis on-board small/mobile devices while
considering available resources such as battery charge and available memory.
7.6 Pharmaceutical Industry:
The technology is being used to help the pharmaceutical firms manage their inventories and to
develop new product and services. A deep understanding of the knowledge hidden in the Pharma data is vital
to a firm’s competitive position and organizational decision-making.
8. LIMITATIONS:
8.1 Security issues:
Although companies have a lot of personal information about us available online, they do not have sufficient
security systems in place to protect that information.
8.2 Misuse of information:
Trends obtain through data mining intended to be used for marketing purpose or for some other ethical
purpose, may be misused. Unethical businesses or people may used the information obtained through data
mining to take advantage of vulnerable people.
9. CONCLUSION
This report presents an overview of data mining and its techniques which have been used to extract
interesting patterns and to develop significant relationships among variables stored in a huge dataset. Data
mining is needed in many fields to extract the useful information from the large amount of data. Large
amount of data is maintained in every field to keep different records such as medical data, scientific data,
educational data, demographic data, financial data, marketing data etc. Therefore, different ways have been
found to automatically analyze the data, to summarize it, to discover and characterize trends in it and to
automatically flag anomalies. The several data mining techniques are introduced by the different researchers.
These techniques are used to do classification, to do clustering, to find interesting patterns. In our future
work, the data mining techniques will be implemented on blood donor’s data set for predicting the blood
donor’s behavior and attitude, which have been collected from the blood bank center.
REFERENCE
[1] AnkitBhardwaj, Arvind Sharma, V.K. ShrivastavaData Mining Techniques and Their Implementation in
Blood Bank Sector, “International Journal of Engineering Research and Applications (IJERA)” ISSN:
2248-9622 www.ijera.com Vol. 2, Issue4, July-August 2012, pp.1303-1309
[2] S.D.Gheware1, A.S.Kejkar2, S.M.Tondare3, Data Mining: Task, Tools, Techniques and Applications,
“International Journal of Advanced Research in Computer and Communication Engineering” Vol. 3,
Issue 10, October 2014
IJRISE| www.ijrise.org|[email protected] [537-543]
International Journal of Research In Science & Engineering
Special Issue: Techno-Xtreme 16
e-ISSN: 2394-8299
p-ISSN: 2394-8280
[3] Shyam Sundaram1 and Santhanam T2, Data Mining Techniques, “IJCSI International Journal of
Computer Science Issues”, Vol. 8, Issue 5, No 2, September 2011 ISSN (Online): 1694-0814
[4] M. Durairaj, V. Ranjani, Data Mining Applications in Healthcare Sector, “International Journal of
scientific & Technology research” volume 2, issue 10 October 2013
IJRISE| www.ijrise.org|[email protected] [537-543]