Download Survey on Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 2, February 2012)
Survey on Data Mining
Vibha Maduskar1, Prof. yashovardhan kelkar2
1
A-4/1 Mahananda Nagar Ujjian(M.P)
2
M.I.T.S Ujjain
1
[email protected]
[email protected]
2
Abstract - In the Information Technology era
information plays vital role in every sphere of the human
life. It is very important to gather data from different data
sources, store and maintain the data, generate information
,generate knowledge .To analyze this vast amount of data
and drawing fruitful conclusions and inferences it needs the
special tools called data mining tools. This paper gives
overview of the data mining in which what research could
be done in different areas and some of its applications.
I.
The steps involved in Knowledge discovery are –
1. Data Selection: The data relevant to the analysis is
decided and retrieved from the various data locations
2. Data Preprocessing: In this stage the process of data
cleaning and data integration is done.
3.Data Cleaning: It is also known as data cleansing; in
this phase noise data and irrelevant data are removed
from the collected data.
4.Data Integration: In this stage, multiple data sources,
often heterogeneous, are combined in a common source.
5.Data Transformation: In this phase the selected data
is transformed into forms appropriate for the mining
procedure.
6. Data Mining: It is the crucial step in which clever
techniques are applied to extract potentially useful
patterns. The decision is made about the data mining
technique to be used.
7. Interpretation and Evaluation: In this step,
interesting patterns representing knowledge are identified
based on given measures. The discovered knowledge is
visually presented to the user.
This essential step uses visualization techniques to help
users understand.
INTRODUCTION
Data Mining is the process of extracting knowledge
hidden from large volumes of raw data. The knowledge
must be new, not obvious, and one must be able to use it.
Data mining has been defined as “the nontrivial
extraction of implicit, previously unknown, and
potentially useful information from data [1]. It is “the
science of extracting useful information from large
databases”. Data mining is one of the tasks in the process
of knowledge discovery from the database .
II. APPLICATIONS O F D ATA M INING
As data mining matures, new and increasingly
innovative applications for it emerge. Although a wide
variety of data mining scenarios can be described. For the
purpose of this paper the applications of data mining are
divided in the following categories: • Healthcare •
Finance • Retail industry • Telecommunication • Text
Mining & Web Mining • Higher Education.
2.1 Medical: The past decade has seen an explosive
growth in biomedical research, ranging from the
development of new pharmaceuticals and in cancer
therapies to the identification and study of human
genome by discovering large scale sequencing patterns
and gene functions. Recent research in approaches for
disease diagnosis, prevention and treatment.
Fig.1.1 KDD process
275
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 2, February 2012)
The objective of this work This work presented three
search constraints that had the following objectives:
producing only medically useful rules, reducing the
number of discovered rules and improving running time.
First, data set attributes are constrained to belong to userspecified groups to eliminate uninteresting value
combinations and to reduce the combinatorial explosion
of rules. Second, attributes are constrained to appear
either in the antecedent or in the consequent to discover
only predictive rules. Third, rules are constrained to have
a threshold on the number of attributes to produce fewer
and simpler rules This work presented three search
constraints that had the following objectives: producing
only medically useful rules, reducing the number of
discovered rules and improving running time. First, data
set attributes are constrained to belong to user-specified
groups to eliminate uninteresting value combinations and
to reduce the combinatorial explosion of rules. Second,
attributes are constrained to appear either in the
antecedent or in the consequent to discover only
predictive rules. Third, rules are constrained to have a
threshold on the number of attributes to produce fewer
and simpler rules.
Experimental results provide evidence that decision
trees are less effective than constrained association rules
to predict disease with several related target attributes,
due to low confidence factors (i.e. low reliability), slight
overfitting, rule complexity for unrestricted trees (i.e.
long rules) and data set fragmentation (i.e. small data
subsets). Therefore ,constrained association rules can be
an alternative to other statistical and machine learning
[2].techniques applied in medical problems where there
is a requirement to predict several target attributes based
on subsets of independent numeric and categorical
attributes.
[2.1.1]Heart Disease Prediction Some hospitals use decision support systems, but are
largely limited. They can answer simple queries like
“What is the average age of patients who have heart
disease?” , “How many surgeries had resulted in hospital
stays longer than 10 days?”, “Identify the female patients
who are single ,above 30 years old, and who have been
treated for cancer.” However they cannot answer
complex queries like.“Given patient records, predict the
probability of patients getting a heart disease.”[10].
[2.1.1.1] By naïve Bays –
The Naïve Bayes Classifier technique is particularly
suited when the dimensionality of the inputs is high.
Despite its simplicity, Naive Bayes can often outperform
more sophisticated classification methods. Naïve Bayes
model identifies the characteristics of patients with heart
disease. It shows the probability of each input attribute
for the predictable state.[10] Why preferred naive bayes
implementation:
1) When the data is high.
2) When the attributes are independent of each other.
3) When we want more efficient output, as compared to
other methods output.
Decision Support in Heart Disease Prediction System
is developed using Naive Bayesian Classification
technique. The system extracts hidden knowledge from a
historical heart disease database. This is the most
effective model to predict patients with heart disease.
This model could answer complex queries, each with its
own strength with respect to ease of model interpretation,
access to detailed information and accuracy. DSHDPS
can be further enhanced and expanded.
[2.1.1.2]Heart Disease Diagnosis By Association rule
and Decision tree –
The goal is to link perfusion measurements and risk
factors to artery disease. Some rules were expected,
confirming valid medical knowledge, and some rules
were surprising, having the potential to enrich medical
knowledge. We show some of the most important
discovered rules. Predictive rules were grouped in two
sets: (1) if there is a low perfusion measurement or no
risk factor then the arteries are healthy; (2) if there exists
a risk factor or a high perfusion measurement then the
arteries are diseased. The maximum association size κ
was 4.[2] In this paper decision trees are not as powerful
as association rules to exploit a set of numeric attributes
manually binned and categorical attributes and several
related target attributes. Decision trees do not work well
with combinations of several target variables (arteries),
which requires defining one class attribute for each
values combination. Decision trees fail to identify many
medically relevant combinations of independent numeric
variable ranges and categorical values .
[2.2]Higher Education
[2.2.1]By Web Based mining
The search to determine significant relationships
among variables in the data has become a slow and
subjective process. As a possible solution to this
problem, the concept of Knowledge Discovery in
Databases – KDD has emerged .The process of the
formation of significant models and assessment within
KDD is referred to as data mining . Data mining is used
to uncover hidden or unknown study, university students
were grouped according to their characteristics, forming
clusters. The clustering process was carried out using a
Kmeans algorithm[1] (ii)Web based educational data
mining- Web-based educational systems collect large
amounts of student data, from web logs to much more
semantically rich data contained in student models we
have shown how the discovery of different patterns
through different data mining algorithms & visualization
techniques suggests to us simple pedagogical policy.
276
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 2, February 2012)
Data exploration focused on the number of attempted
exercises combined with classification led us to identify
students at risk, those who have not trained enough.
Clustering and cluster visualization led us to identify a
particular behavior
among failing students, when
students try out the logic rules of the pop-up menu of the
tool. , a timely and appropriate warning to students at risk
could help preventing failing in the final exam[3].
Therefore it seems to us that data mining has a lot of
potential for education, and can The way we have
performed clustering may seem rough, as only few
variables, namely the number and type of mistakes, the
number of exercises have been used to cluster students in
homogeneous groups. This is due to our particular data.
All exercises are about formal proofs. Even if they differ
in their difficulty, they do not fundamentally differ in the
concepts students have to grasp. We have discovered a
behavior rather than particular abilities. In a different
context, clustering students to find homogeneous groups
regarding skills should take into account answers to a
particular set of exercises.
Other possible uses of information technology in the
field of pharmaceuticals include pricing (two-tier pricing
strategy) and exchange of information between vertically
integrated drug companies for mutual benefit.
Nevertheless, the challenge remains though data collect
Data mining fondly called patterns analysis on large sets
of data uses tools like association,clustering,
segmentation and classification for helping better
manipulation of the data help the pharma firms compete
on lower costs while improving the quality of drug
discovery and delivery methods. The paper presents how
Data Mining discovers and extracts useful patterns from
this large data to find observable patterns. The paper
demonstrates the ability of Data Mining in improving the
quality of decision making process in pharma industry.
There are in general three stages of drug development
namely finding of new drugs, development tests and
predicts drug behavior, clinical trials test the drug in
humans and commercialization takes drug and sells it to
likely consumers (doctors and patients).A simple
association technique could help us measure the
outcomes that would greatly enhance the patient’s quality
of life say for e.g. faster restoration of the body’s normal
functioning. This could be a benefit much sought after by
the patient and could help the firm better position the
drug vis-à-vis the competition. This paper described that
these techniques can be easily and successfully used. The
paper presented on how Data mining discovers and
extracts useful patterns from this large data to find
observable patterns. The paper demonstrates the ability
of Data Mining in improving the quality of decision
making process in pharma industry.
[2.2.2]Predicting Students PerformanceThe main objective of higher education institutions is
to provide quality education to its students. One way to
achieve highest level of quality in higher education
system is by discovering knowledge for prediction
regarding enrolment of students in a particular course,
alienation of traditional classroom teaching model,
detection of unfair means used in online examination,
detection of abnormal values in the result sheets of the
students, prediction about students’ performance and so
on. The knowledge is hidden among the educational data
set and it is extractable through data mining techniques.
the classification task is used to evaluate student’s
performance and as there are many approaches that are
used for data classification, the decision tree method is
used here.
[2.3.2]Data Mining of Market Knowledge in The
Pharmaceutical IndustryThis paper will provide an overview and examples of
some of these Data Mining SAS applications in the
pharmaceutical industry In the paper, we will describe a
series of problems typical to the industry and describe the
analytical issues. In the presentation, as time permits, we
will suggest approaches to solving them and present
results to illustrate these techniques and the power that
these tools bring. In particular, the areas that we will
explore are customer segmentation at product launch,
ROI and the optimal timing of complementary
promotional efforts, analyses of managed care impact in
sales and customer targeting, and studies of resource
allocation in response to a competitor's product launch .In
the era of post-genomic drug development, extracting
and applying knowledge from chemical, biological, and
clinical data is one of the greatest challenges facing the
pharmaceutical industry. Pharmaceutical Data Mining
brings together contributions from leading academic and
industrial
scientists,
who
address
both
the
implementation of new data mining technologies and
application issues in the industry.
[2.2.3]Prediction of Student Performance
The proposed model is used to extract all information
of student behavior in writing the code of assignments
and to find some statistical patterns or predicators that
can be used to enhance students’ performance in writing
the code. The results obtained have suggested that
aspects such as student work habits, and even code
quality, have little bearing on the student’s
Performance[9].
[2.3] PHARMACEUTICAL INDUSTRY
[2.3.1]The implications are such that by a simple process
of merging the drug usage and cost of medicines (after
completing the legal requirements) with the patient care
records of doctors and hospitals helping firms to conduct
nation wide trials for its new drugs.
277
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 2, February 2012)
[2.4] SecurityOne of the applications suggested by IDOT personnel
is collusion detection among contractors, similar to the
fraud detection application of data mining currently used
by many insurance companies .
[2.4.1]Intrusion DetectionAs network attacks have increased in number and
severity over the past few years, intrusion detection
system (IDS) is increasingly becoming a critical
component to secure the network. Due to large volumes
of security audit data as well as complex and dynamic
properties
of
intrusion
behaviors,
optimizing
performance of IDS becomes an important open problem
that is receiving more and more attention from the
research community. We make a quick and up-to date
literature survey on attempts for designing intrusion
detection systems using the KDD dataset [4](classifiers,
evaluation setup and performance comparison).
[2.6]AgriculturalThe data mining new technology applying to the
modern agriculture logistics decision is to raise
agricultural product circulate speed, lower the logistics
cost, promote the profession logistics value added
service, raise our international competition ability,
promote the agricultural product industry management
level, exalt farmers income, it's notable advantage lies in
raising the modern agriculture logistics decision level,
the accuracy and efficiency of decision ,modern logistics
system, reducing subjective and blindness of decision.
This paper applies the decision support system in the
information system basing on the foundation of
developing agriculture logistics data mining system,
which provide strong decision support to the main
agriculture logistics business, the governor and decision
maker of the agriculture logistics to adapt the
development demand of the modern agriculture
logistics[6].
[2.4.2]Cyber securityMINDS is a suite of data mining algorithms which
can be used as a tool by network analysts to defend the
network against attacks and emerging cyber threats. The
various components of MINDS such as the scan detector,
anomaly detector and the profiling module detect
different types of attacks and intrusions on a computer
network. The scan detector aims at detecting scans which
are the precursors both[7] any network attack. The
anomaly detection algorithm is very effective in detecting
behavioral anomalies in the network traffic which
typically translate to malicious activities such as dos
traffic, worms, policy violations and inside abuse.
(ii) Spatial Data MiningAn overview of data clustering method using cluster
analysis and there by generates patterns/rules.
“Association Rule Mining Analysis” usually sounds
like something very smart, difficult to understand,
something that is useful only to those researchers and
professors wearing thick glasses. But the reality is just
opposite. Although we might not be aware of it, pattern
analysis using association rule mining is present in many
aspect of our everyday life. The paper considers only two
dimensions and on the basis of these two dimensions it
clusterizes [11] the data objects. the paper can be
implemented for more than two dimensions. The current
method to find the distance between the data point and
the cluster is Euclidean distance. These methods give
circular clusters.
[2.5]TransportationAn application of data-mining analysis to a typical
construction database containing information about
asphalt projects in Illinois. A case study was presented to
test the applicability of data mining as an analysis
method. A database was constructed with collected data
from IDOT sources. Data mining technique was utilized
to analyze the created dataset and rules generated. Based
on the generated results and interpretation, certain
previously unknown patterns were discovered[5]. The
study shows that data mining can provide information on
a dataset/database beyond statistical methods only and
provide a source of valuable information (that could not
have been detected otherwise) to support decisionmaking. If the time-consuming data collection process
can be reduced, the method can extract information faster
than other analysis methods. Another important extension
to this research is exploring the validity of the previously
unknown patterns that were discovered. This could entail
mining the data using other software as well as
conducting a long-term study to check to verify those
rules. Furthermore, other applications of data mining to
the construction industry could be developed.
III. CONCLUSION
In this paper we represents applications of data mining
in main research areas .The goal of this paper is to find
the applications of data mining in different areas of
research. Most of the previous studies on data mining
applications in various fields use the variety of data types
range from text to images and stores in variety of
databases and data structures. The different methods of
data mining are used to extract the patterns and thus the
knowledge from this variety databases. Selection of data
and methods for data mining is an important task in this
process and needs the knowledge of the domain.
278
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 2, February 2012)
Several attempts have been made to design and
develop the generic data mining system but no system
found completely generic. Thus, for every domain the
domain expert’s assistant is mandatory. The domain
experts shall be guided by the system to effectively apply
their knowledge for the use of data mining systems to
generate required knowledge. The domain experts are
required to determine the variety of data that should be
collected in the specific problem domain, selection of
specific data for data mining, cleaning and
transformation of data, extracting patterns for knowledge
generation and finally interpretation of the patterns and
knowledge generation.
References
[1] K.Srinivas B.Kavihta Rani Dr. A.Govrdhan “Applications of Data
Mining Techniques in Healthcare and Prediction of Heart
Attacks” Srinivas et al. / (IJCSE) International Journal on
Computer Science and Engineering Vol. 02, No. 02, 2010, 250255.
[2]
Carlos Ordonez University of Houston Houston, TX,
USA”Comparing Association Rules and Decision Trees for
Disease Prediction”.
[3] Şenol Zafer ERDOĞAN Mehpare TİMOR “A DATA MINING
APPLICATION IN A STUDENT DATABASE” JOURNAL OF
AERONAUTICS AND SPACE TECHNOLOGIESJULY 2005
VOLUME 2 NUMBER 2 (53-57)
[4]Agathe MERCERON and Kalina YACEF “Educational Data
Mining: a Case Study”
[5] Huy Anh Nguyen and Deokjai Choi Chonnam National University,
Computer Science Department “Application of Data Mining to
Network Intrusion Detection: Classifier Selection Model”
[6] Khaled Nassar, Associate ProfessorDepartment of Architectural
Engineering, University of SharjahAPPLICATION OF DATAMINING TO STATE TRANSPORTATION AGENCIES’
PROJECTS DATABASES.
[7] LIU Dejun , ZHANG Guangsheng Shenyang Agricultural
University, Shenyang “Application of Data Mining Technology
in Modern Agricultural Logistics Management Decision.
[8] Varun Chandola, Eric Eilertson, Levent ErtÄoz, GyÄorgy Simon
and Vipin Kumar Department of Computer Science, University of
Minnesota Data Mining for Cyber Security
[9] Qasem A. Al-Radaideh, Emad M. Al-Shawakfa, and Mustafa I.
Al-Najjar “Mining Student Data Using Decision Trees”.
[10],Mrs.G.Subbalakshmi (M.Tech), Mr. K. Ramesh M.Tech, Asst.
Professor, Mr. M. Chinna Rao M.Tech,(Ph.D.) Asst. Professor
Decision Support in Heart Disease Prediction System using
Naive Bayes.
[11] D.Rajesh AP-SITE, VIT University, Vellore-14 “Application of
Spatial Data Mining for Agriculture”.
279