Download APRIORI algorithm based medical data mining forfrequent

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
IPASJ International Journal of Information Technology (IIJIT)
Web Site: http://www.ipasj.org/IIJIT/IIJIT.htm
Email: [email protected]
ISSN 2321-5976
A Publisher for Research Motivation ........
Volume 2, Issue 4, April 2014
APRIORI algorithm based medical data mining
for frequent disease identification
Gitanjali J1 , C.Ranichandra2 ,M.Pounambal3
School of Information Technology and Engineering,
VIT UNIVERSITY, Vellore-632014, Tamil Nadu, India
Abstract
The data mining comprises of analysis of large data from various perspectives and obtaining summary of useful information.
The information can be transferred into knowledge regarding future trends and history. Data mining has a very important role
in the information technology domain. Huge amounts of complex data is generated by health care sector today. These data
includes details about diseases, patients, diagnosis methods, electronic patients details hospitals resources etc,. The data mining
methods are very helpful in making medicinal decisions in disease curing. The vast data collected by healthcare industry are not
mined and hence information is hidden. And as a result the decision making is not effective. The knowledge discovered can be
used by the healthcare administrators for enhancing the service quality. In this paper, a method for identifying frequency of
diseases in particular geographical location for a given period of time using Apriori data mining technique based on association
rules is proposed.
Keywords: KDD, Bayesian classification, Genetic Algorithm.
1. INTRODUCTION
The meaning of Data Mining is the extraction of knowledge from large data. It is also named as knowledge mining
from large amount of data. There are so many other terms which give similar meaning of data mining, they are
knowledge extraction, data archaeology, data /pattern analysis etc. The other famously used terms are knowledge
discovery from data or KDD. Decision making can be achieved by converting data mining in to knowledge and this
process is called knowledge discovery. The iterative sequence present in knowledge discovery are 1.,Data cleaning
[inconsistent data and noise are removed],2., Data integration [the combination of multiple data sources are done],3.,
Data selection [the relevant data is extracted to the analysis task from data base],4., Data transformation [the data is
transformed in to relevant other forms ],5,. Data mining [the data patterns are extracted by applying intelligent
methods], 6, Pattern evaluation [depending upon some measures, we will identify the interesting patterns in
knowledge],7.,
Knowledge presentation [some techniques are used to represent knowledge and visualizations].Data mining includes
many other functions like classification, association, clustering, and predictions. Relationships and hidden patterns are
discovered by using advanced data mining techniques. Mining association rules is the one of the important data mining
applications. In 1993 association rules are used to identify relationships among item sets in data bases, these are not
inherited properties. In medical field it is used to find the most frequently occurred diseases in different geographical
locations at given time period. Hence the medical data is analyzed in this research work.
2. LITRATURE REVIEW
Jyothi soni. [1] Provided a survey of latest techniques in predicting heart diseases using data mining techniques of
knowledge discovery. So many experiments are conducted to compare the performances and to determine the outcomes.
The survey reveals that in accuracy wise Bayesian classification is having similar results as of decision tree. When
these are compared to other methods, like Neural Networks, Classification based on clustering they are performing
well. Decision tree algorithm and Bayesian classification are improved by applying Genetic algorithm optimal data
sets are obtained by reducing the actual data size which is useful in predicting Heart diseases. Carlos Ordonez [2]
studied how to limit the association rules in order to predict the heart diseases. He proposed three things to decrease the
number of patterns. Firstly, the required things needed such that attributes should present on only one side of the rule.
Secondly, divide the attributes into uninteresting groups. Thirdly to reduce the number of rules applied. Maria-Luiza
Antonie [3] investigated different data mining techniques like association rule mining and neural networks in detection
of tumor in digital mammography. The two results performed well in the accuracy wise which gave 70%
Volume 2, Issue 4, April 2014
Page 1
IPASJ International Journal of Information Technology (IIJIT)
Web Site: http://www.ipasj.org/IIJIT/IIJIT.htm
Email: [email protected]
ISSN 2321-5976
A Publisher for Research Motivation ........
Volume 2, Issue 4, April 2014
3. APRIORI ALGORITHM
It is used for finding frequent item sets. This algorithm is proposed by R. Agrawal and R. Srikant in 1994. The name
of this algorithm is Apriori because it uses the prior knowledge of frequent item sets. Firstly, the input is D the dataset
is given and also we should know the min_sup, which is minimum support count threshold.
And we get the output as L the frequent item sets in D. The procedure for the algorithm is as follows:
step1: scan D for count of each candidate and generate
.List all the candidate item sets and place the corresponding
candidate support count of
.
Step2: compare the candidate support count with the minimum support count. Generate
candidate from .
Step3: scan D for count of each candidate. Compare the candidate support count with the minimum support count list
remaining in
.This process is continued until the most frequent item set is produced.
4. PROPOSED WORK
Apriori algorithm can be used for mining the disease occurrence details for a specific time range.
Algorithm
Mn : Medical data item set of size n
Fn : frequent item set of size n
F1 = {frequent items};
for (n = 1; Fn !=; n++) do begin
Mn+1 = Medical data derived from Fn
each t transaction in the database do
Increment count of all the medical data in Mn+1 that are
in t
Fn+1 = min_support medical data in Mn+1
end
return nFn;
Result of the Research
The proposed approach is very helpful in identifying the frequently occurring diseases in a huge medical data. As a
result, medical conclusions and decisions regarding frequent diseases can be made by practitioners accurately. Data for
analysis is obtained from different geographical areas during various time ranges.
5. EXPERIMENTAL RESULT
This research utilizes the data set containing the electronic medical details of different patients. This include patient’s
name, disease name, age, sex, date, address, , etc, in particular year. Fig.1. shows the bar graph of the number of
diseases affecting the patients monthly. Fig. 2. Depicts number of patients affected by various diseases monthly.. It
unfolds the fact that in a particular month some patients are affected by the same disease.
no of diseases
14
12
10
8
6
4
2
0
Nov
Jul
Sep
May
Mar
Jan
no…
Fig.1. shows the bar graph of the number of diseases affecting the patients monthly
Volume 2, Issue 4, April 2014
Page 2
IPASJ International Journal of Information Technology (IIJIT)
A Publisher for Research Motivation ........
Volume 2, Issue 4, April 2014
Web Site: http://www.ipasj.org/IIJIT/IIJIT.htm
Email: [email protected]
ISSN 2321-5976
no of patients
160
140
120
100
80
no of
patie
nts
60
40
20
0
Jan Mar May Jul Sep Nov
Fig. 2. Depicts number of patients affected by various diseases monthly
Relation: Disease
Instances: 12
January
February
March
April
May
June
July
August
September
October
November
December
Apriori
Minimum support:=0.35 (4 instances)
Minimum metric =0.9
Number of cycles performed: 13
Attributes: 29
AIDS
Allergies
Heart disease
Asthma
HIV
human papilloma virus
hypertension
Impotence
Insomnia
Jaundice
Kidney Disease
Leukemia
Liver cancer
Liver Disease
Lung Cancer
Lupus
Overweight
Eye Disease
Pain
Pertussis
Pregnancy
Raynauds Phenomenon
sexually transmitted diseases
sleep disorders
smoking
stroke
Thrush
Thyroid disorders
Whooping Cough
Large item sets L(1): 20
Large Item sets L(1):
Volume 2, Issue 4, April 2014
Page 3
IPASJ International Journal of Information Technology (IIJIT)
Web Site: http://www.ipasj.org/IIJIT/IIJIT.htm
Email: [email protected]
ISSN 2321-5976
A Publisher for Research Motivation ........
Volume 2, Issue 4, April 2014
Allergies
4
Heart disease
Jaundice
HIV
Hypertension
Impotence
Thrush
Whooping Cough
Insomnia
sexually transmitted diseases
sleep disorders
Smoking
Raynauds Phenomenon
Pregnancy
Pain
Overweight
Lung Cancer
Liver Disease
7
5
4
5
6
4
7
6
5
5
9
4
4
7
5
4
4
large item sets L(2): 21
Large Item sets L(2):
Heart disease, hypertension
Heart disease, Insomnia
Heart disease, Kidney Disease
Heart disease, Liver Disease
Heart disease, Overweight
Heart disease, Pain
Heart disease, smoking
Asthma, Thrush
HIV,smoking
Hypertension, smoking
Impotence, Raynauds Phenomenon
Impotence, smoking
Impotence, Whooping Cough
Insomnia, smoking
Liver Disease, Overweight
Liver Disease, smoking
Pain, sexually transmitted diseases
Overweight, smoking
Pain, smoking
sexually transmitted diseases, smoking
Smoking, Whooping Cough
4
5
4
4
5
5
6
4
4
4
4
4
5
5
4
4
5
4
5
4
5
Large item sets L(3): 7
Large Item sets L(3):
Heart disease,Insomnia,smoking
Heart disease, Liver Disease, Overweight
Heart disease, Liver Disease=t smoking
Heart disease,Overweight,smoking
Impotence,smoking,Whooping Cough
Volume 2, Issue 4, April 2014
4
4
4
4
4
Page 4
IPASJ International Journal of Information Technology (IIJIT)
Web Site: http://www.ipasj.org/IIJIT/IIJIT.htm
Email: [email protected]
ISSN 2321-5976
A Publisher for Research Motivation ........
Volume 2, Issue 4, April 2014
Liver Disease, Overweight, smoking
Pain, sexually transmitted diseases, smoking
4
4
Large item sets L(4): 1
Large Item sets L(4):
Heart disease, Liver Disease,Overweight,smoking
4
6. CONCLUSION
This research work proposes Apriori data mining based on association rule and generates the frequency of diseases
affected by patients and also the number of patients affected by these diseases .Based on various geographical areas and
at various time periods the study is made.. Existing electronic medical details obtained from hospitals are utilized as
training data set for analysis. The analysis and study concluded that the patients are affected frequently by 4 different
diseases at different geographical areas during a particular year.
References
[1] Jyothi Soni, et al., “Predictive Data Mining for Medical Diagnosis: An Overview of Heart Disease Prediction”
[2] Carlos Ordonez, “Improving Heart Disease Prediction Using Constrained Association Rules”
[3] Maria-Luiza Antonie et al., “Application of Data Mining Techniques for Medical Image Classification”
[4] M. Ilayaraja, T. Meyyappan,”Mining Medical Data to Identify Frequent Diseases using Apriori
[5] Murugesan K., Md.Rukunuddin Ghalib., Gitanjali J., Indumathi J., Manjula D.(2009), “A pioneering Cryptic
Random Projection based approach for Privacy Preserving Data Mining”, In proceedings of The IEEE
International Conference on Information Reuse and Integration (IEEE IRI-09) July 10-12, Las Vegas, USA. pp.
437-439.
[6] “Sprouting Modus Operandi for Selection of the Best PPDM Technique for Health Care Domain”, International
Journal Conference in Recent Trends in Computer Science. Vol. 1, No. 1, pp. 627-629.
AUTHORS
GITANJALI J received her M.Tech IT Networking from Vellore Institute of Technology, India, in year
2008.She is working for Vellore Institute of technology as an Assistant Professor Senior. She is currently
doing her PhD from VIT University, Vellore. Her research interest includes Security for Data Mining,
Networks, Software Engineering and Ontology.
C.RANICHANDRA is working as Assistant Professor Selection Grade in VIT University, Vellore, Tamil
Nadu, India. She has fourteen years of teaching experience in VIT. Ranichandra was born in 1975 in
Madurai District. She graduated in B.Tech(CSE) from Vellore Engineering College in 19197 and received
her M.Tech (CSE) from VIT University in 2008. The author started the research work from 2009 in Grid
Databases and is currently working on Database issues in Cloud.
M.Pounambal is Assistant Professor Selection Grade in School of Information Technology and Engineering
at VIT University, Vellore, India. She received B.E and M.Tech Degree in Computer Science field. Her area
of interest includes Wireless Networks.
Volume 2, Issue 4, April 2014
Page 5