Download A Hybrid C- Tree Algorithm for Privacy Preserving Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
International Journal of Soft Computing and Engineering (IJSCE)
ISSN: 2231-2307, Volume-4, Issue-ICCIN-2K14, March 2014
ICCIN-2K14 | January 03-04, 2014
Bhagwan Parshuram Institute of Technology, New Delhi, India
A Hybrid C- Tree Algorithm for Privacy
Preserving Data Mining
Shweta Taneja, Shashank Khanna, Sugandha Tilwalia, Ankita
The privacy and security of individuals should be taken
care to solve ethical, legal, and social issues. Large volumes
of data are regularly collected through data mining and
being analyzed. These data include medical history,
important policies of government and organizations which
help them in decision making process and also provide
social benefit such as crime reduction, national security,
and medical research etc. these data is very important for
data analyst and researchers who analyzes hospital reports,
government agencies and insurance companies [12]. With
the huge development of internet technology, privacy
preserving data mining has become one of the most
important topics and become a serious concern for people
those want their data to be secured. Since hackers can easily
get the techniques on the internet through which they can
steal someone’s personal sensitive data. Thus to secure the
data a new field has been developed by researchers known
as privacy preserving data mining (PPDM). Data mining
techniques have been developed successfully to extracts
knowledge from the huge database. As data mining become
more pervading, privacy concerns are increasing.
In order to secure data many techniques and methods has
been introduced such as k-anonymity, data perturbation,
association rule mining, adding unknown values,
encryption, condensation approach, etc. All of these
techniques having their advantages as well as disadvantages
such as some techniques are very costly, some techniques
cannot handle huge amount of data and some techniques
may loss information. In order to overcome these problems
we have introduced a new technique called as Hybrid Ctree technique.
In this paper, we discuss our novel algorithm and some
techniques. The paper is organized as follows. In Section 1,
we introduce privacy and need of privacy preserving data
mining. Section 2 provides the related work done in the
field of privacy preserving data mining. In Section 3, we
propose our algorithm. Section 4 shows the experiments
conducted on an original data set and the results obtained.
Finally we conclude in Section 5 and present the future
prospects of our algorithm.
Abstract: Data Mining is a process of discovering useful
information or knowledge from the data warehouse. In today’s
world, people are more concerned about their privacy and
secrecy. Various Privacy Preserving Data Mining algorithms are
developed to preserve privacy and hide sensitive data. For
example, information available in government records or in
patients records
like name, age, sex, disease, unique
identification number etc. should be preserved. In this paper, we
propose a new algorithm for privacy preserving data mining.
First, we rearrange the records of the data set using a novel ‘CTree’ procedure and perturb the primary attribute. Then, we
encrypt sensitive attributes using ASCII Code and special
characters. . The algorithm is implemented and tested on a micro
data of patient record. Sensitive data is perturbed in an efficient
way which will never reveal anyone’s identity. Also, original
data can be reconstructed from perturbed data, making usability
of data.
Keywords: Privacy Preserving; Data Mining; Sensitive
Attribute; C-Tree; Encryption.
I. INTRODUCTION
Data Mining [1] refers to extracting or “mining” knowledge
from large amounts of data. It is the process of discovering
interesting knowledge from large amounts of data stored
either in databases, data warehouses, or other information
repositories. By performing data mining, interesting
knowledge, regularities, or high-level information can be
extracted from database and viewed or browsed from
different angles. The discovered knowledge can be applied
to decision making, process control, information
management, and query processing. Data mining is
considered one of the most important frontiers in database
systems and one of the promising interdisciplinary
developments in the information industry. Data mining,
with its promise to efficiently discover valuable, nonobvious information from large databases [2], is particularly
vulnerable to misuse. So, there might be a conflict between
data mining and privacy.
In today’s world privacy is the major concern to protect the
sensitive data. People are very much concerned about their
sensitive information which they do not want to share with
anyone. Privacy plays an important role by securing and
protecting the sensitive data values from being used by
unauthorized access [2] and thus it is different from any
other field of data security such as data security and access
control which prevents information disclosure against
illegitimate means. The main aim of privacy preservation is
to prevent data or information from unauthorized access
[7][11] to the data.
II. RELATED WORK
Our main focus is hiding of sensitive or crucial data. There
is a lot of work which is done in the field of preserving
privacy of data mining. In literature, different authors have
proposed different techniques of privacy preserving data
mining. Cryptography based technique [7][9] encrypts
information and provides security to sensitive information
but has many limitations to it like it cannot successfully
protect outputs of computation or it does not produce good
results for more parties.
Blocking based technique which replaces known values
with unknown by randomization and the problem is to
guess the unknown values[‘?’] and easy to crack original
value behind the unknown values [8]. The replacement of
Manuscript received March 2014
Shweta Taneja, Dept. of Computer Science Engineering Bhagwan
Parshuram Institute of Technology, Indraprastha University, India.
Shashank Khanna, Dept. of Computer Science Engineering Bhagwan
Parshuram Institute of Technology, Indraprastha University, India.
Sugandha Tilwalia, Dept. of Computer Science Engineering Bhagwan
Parshuram Institute of Technology, Indraprastha University, India.
Ankita, Dept. of Computer Science Engineering Bhagwan Parshuram
Institute of Technology, Indraprastha University, India.
21
Published By:
Blue Eyes Intelligence Engineering
& Sciences Publication Pvt. Ltd.
A Hybrid C- Tree Algorithm for Privacy Preserving Data Mining
sensitive values does not depend on any specific rule. There
may be different sensitive rules according to the
requirements. The main aim of this technique is to preserve
the sensitive data from unauthorized access [7].
A technique which adds some noise to the original data
which perturbs sensitive information and hence preserves
the privacy is referred to as perturbation technique [5][7].
This technique apparently distorts sensitive data values by
changing them by adding, subtracting or any other
mathematical formula [6]. This technique does not
reconstruct the original data but it only reconstructs the
distribution. Table 1 given below shows some privacy
preserving techniques with their merits and demerits.
Our paper considers all these techniques and makes a
hybrid approach [7] [3] for preserving privacy in data
mining. We use a method called ‘Information Gain’ [1] [14]
which is used to find the most sensitive attribute in our data
set which will be used for classification.
Table I Merits and Demerits of Privacy Preserving Data
Mining Techniques
S.No
Techniques of
PPDM
Merits
Demerits
1.
PERTURBAT
ION
This approach is
especially difficult
to scale when more
than a few parties
are
involved.
Different attributes
are
treated
independently by
the
perturbation
approach.[2][3][5]
[7]
The method does
not reconstruct
the original data
values. Also it
does not hold
good for large
databases.[7]
2.
CONDENSAT
ION
Pseudo-data
have no longer
effect on data
mining
algorithm,
because
they
have the same
format as the
original
data.
[7][13]
3.
CRYPTOGRA
PHIC
4.
BLOCKING
BASED
TECHNIQUE
This
approach
works on pseudodata rather than
perturbed
data,
which helps in
better preservation
of privacy than
techniques which
simply
use
modifications
of
the original data.
[10]
Cryptography
a
technique through
which
sensitive
data
can
be
encrypted. There is
also
a
proper
toolset
for
algorithms
of
cryptography in the
field of privacy
preserving
data
mining. [9]
To
preserve
privacy
of
individual,
sensitive
transactions
are
replaced
by
unknown
values.[7][8]
III. PROPOSED ALGORITHM
We have proposed a Hybrid Algorithm which combines
merits of different techniques and help in preserving the
privacy of sensitive data. Techniques involved are
encryption, perturbation and unknown values to preserve
privacy.
The Hybrid C-Tree Algorithm is as follows:1) Input original Training Dataset and calculate Gain
using Information Gain[1].
Info(D):= -∑Pi log2 ( Pi );
InfoA ( D ):= ∑ ( | Dj | / | D | ) x Info (Dj )
Gain:= Info ( D ) - InfoA ( D ).
Where D is an attribute.
The attribute for which Gain is greater is chosen as
classifying attribute for tree.
2) Output - Perturbed Data in sequence from Left Cluster
to Right Cluster from the last level of C – Tree.
3) Method:a) Calculate Information Gain Value for attributes
having Random Values.
b) Select a Unique Classifying Attribute.
c) Classify this level of tree checking whether
Random Value Attribute eg. Age if
Age>45.
d) On classified data, divide if Age is even or odd.
e) Now in each cluster after classification if Age
attribute is even
Age:= Age + 15
else
Age:= Age – 15;
f) Traverse the tree from Left Cluster to Right
Cluster from the last level of C – Tree.
g) Repeat steps a to f and obtain the perturbed data.
4) Encrypt Character Function
a) Input Character attribute from training dataset.
b) Starting from the first alphabet of attribute, convert
each character into its ASCII Code (k).
c) Integer M:= 155-k;
d) Convert M into Character.
e) Replace current character with the converted
character.
f) Return string.
5) Encrypt ID Function
a) Check if integer.
b) Replace numeric values with the order of
precedence of special character as given in ASCII
Chart given below.
0 !
1 #
2 $
3 %
4 &
5 *
6 +
7 8 /
9 =
c) Return ID.
6) Stop.
This approach is
especially
difficult to scale
when
more
parties
are
involved. Also it
does not hold
good for large
databases. [7]
Unknown Values
help
in
preserving
privacy
but
reconstruction of
original data set
is quite difficult.
[7]
IV. EXPERIMENTS CONDUCTED
We have implemented our proposed algorithm on an
Original Patient Data Set [4] in Table 2. In this dataset, we
have an alpha-numeric attribute (UID), character attribute
22
Published By:
Blue Eyes Intelligence Engineering
& Sciences Publication Pvt. Ltd.
International Journal of Soft Computing and Engineering (IJSCE)
ISSN: 2231-2307, Volume-4, Issue-ICCIN-2K14, March 2014
ICCIN-2K14 | January 03-04, 2014
Bhagwan Parshuram Institute of Technology, New Delhi, India
(Name and Disease), numeric attribute (Age) and an
attribute which determines the gender of the patient.
UID
Table II
Patient Medical Dataset- Original
NAME
AGE SEX DISEASE
IN5612
ALBERT
29
M
CANCER
IN5622
DERRICK
30
M
FLU
IN5627
HANNAH
25
F
IN5634
SUMMER
36
F
HEART
DISEASE
CANCER
IN5646
ROBERT
43
M
IN5647
PEARL
34
F
IN5655
KRISTINE
47
F
HEART
DISEASE
HEART
DISEASE
CANCER
IN5672
SMITH
48
M
FLU
IN5675
SAMANTHA
21
F
FLU
From the above medical data, a medical researcher may be
keen to find some pattern and while mining they might
reveals someone’s personnel and sensitive life publically.
Hence to preserve sensitivity we implement this algorithm
which will hide the sensitive information and will keep a
perturbed dataset in warehouse, so that one cannot easily
reveal an individual’s identity while mining. The technique
will preserve all sensitive information and for specific
miner it will provide only that attribute which may help him
in his research and also maintaining privacy. We have
implemented the algorithm using Java and backend is taken
care with the help of MySQL.
Below are the steps that are followed.
Step 1. We first find an attribute which is best suitable for
decision making and classification. This is
dependent on the attribute having the largest Gain
using the concept of Information Gain. The Gain
value for ‘Age’ attribute is the highest and
therefore we classify our tree using ‘Age’.
Step 2. Now, dividing the primary attribute i.e. Age in two
categories
Age > 45 or Age < = 45.
and storing the ID’s in two different Lists.
Step 3. Now, further classifying the list’s into two i.e.
Even and Odd. The result of this classification
produces Cluster in the leaves as shown in Fig. 1.
Step 4. Traverse the leaves from the Leftmost Cluster to
Rightmost Cluster and store the records in the
database.
Step 5. Also, we add 15 to all even records and subtract 15
from all odd records and update the ages in the
database.
Step 6. Now we encrypt the Names and Diseases with the
conversion step as written in the algorithm. The
encrypted Names and Diseases are updated in the
database.
Step 7. Encryption of ID attribute is done by the same
concept as explained in the algorithm.
The results of the above experiment are as shown in Tab. 3.
NAME
WVIIRXP
HFNNVI
KVZIO
AGE
45
51
49
SEX
M
F
F
ZOYVIG
SZMMZS
14
10
M
F
IN*+&+
ILYVIG
28
M
IN*+-*
IN*+-$
IN*+**
HZNZMGSZ
HNRGS
PIRHGRMV
6
63
32
F
M
F
XZMZVI
SVZIG
WRHVZHV
SVZIG
WRHVZHV
UOF
UOF
XZMZVI
Figure 1: C-Tree depicting clusters in the leaf nodes
V. CONCLUSION AND FUTURE PROSPECTS
Privacy has become an important concern now. In today’s
scenario no individual would like to share his privacy.
While mining data from the warehouse, the analyst might
find some knowledge that consists of someone’s sensitive
information like his health status or his bank account details
etc. It may be published but an individual might not like
that his information should be revealed. So in the field of
privacy preserving data mining we have contributed a novel
hybrid algorithm which perturbs the sensitive data. Benefits
of our proposed algorithm are that while mining, analysts
will not be able to reveal information and hence preserving
privacy and secrecy of an individual. In future we would
like to present a hybrid algorithm which will help in
reconstructing the original data.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
Table III
Encrypted Patient Medical Dataset- using C-Tree Hybrid
Algorithm
UID
IN*+$$
IN*+%&
IN*+&-
IN*+#$
IN*+$-
[7]
DISEASE
UOF
XZMZVI
SVZIG
WRHVZHV
[8]
23
J. Han and M. Kamber , “Data Mining: Concepts and Techniques”,
2nd ed.,The Morgan Kaufmann Series in Data Management
Systems, Jim Gray, Series Editor 2006.
M. B. Malik, M. A. Ghazi and R. Ali, “Privacy Preserving Data
Mining Techniques: Current Scenario and Future Prospects”, in
proceedings of Third International Conference on Computer and
Communication Technology, IEEE 2012.
P.Deivanai, J. Jesu Vedha Nayahi and V.Kavitha,” A Hybrid Data
Anonymization integrated with Suppression for Preserving Privacy
in mining multi party data” in proceedings of International
Conference on Recent Trends in Information Technology, IEEE
2011.
M. Prakash, G. Singaravel, “A New Model for Privacy Preserving
Sensitive Data Mining”, in proceedings of ICCCNT Coimbatore,
India, IEEE 2012.
J. Liu, J. Luo and J. Z. Huang, “Rating: Privacy Preservation for
Multiple Attributes with Different Sensitivity requirements”, in
proceedings of 11th IEEE International Conference on Data Mining
Workshops, IEEE 2011.
H. Kargupta and S. Datta, Q. Wang and K. Sivakumar, “On the
Privacy Preserving Properties of Random Data Perturbation
Techniques”, in proceedings of the Third IEEE International
Conference on Data Mining, IEEE 2003.
S. Lohiya and L. Ragha, “Privacy Preserving in Data Mining Using
Hybrid Approach”, in proceedings of 2012 Fourth International
Conference on Computational Intelligence and Communication
Networks, IEEE 2012.
A. Parmar, U. P. Rao, D. R. Patel, “Blocking based approach for
classification Rule hiding to Preserve the Privacy in Database” , in
proceedings of International Symposium on Computer Science and
Society, IEEE 2011.
Published By:
Blue Eyes Intelligence Engineering
& Sciences Publication Pvt. Ltd.
A Hybrid C- Tree Algorithm for Privacy Preserving Data Mining
[9]
[10]
[11]
[12]
[13]
[14]
Y. Lindell, B.Pinkas, “Privacy preserving data mining”, in
proceedings of Journal of Cryptology, 5(3), 2000.
C. Aggarwal , P.S. Yu, “A condensation approach to privacy
preserving data mining”, in proceedings of International Conference
on Extending Database Technology (EDBT), pp. 183–199, 2004.
746
R. Agrawal and A. Srikant, " Privacy-preserving data mining”, in
proceedings of SIGMOD00, pp. 439-450.
T. Jahan, G.Narsimha and C.V Guru Rao, “Data Perturbation and
Features Selection in Preserving Privacy” in proceedings of 978-14673-1989-8/12, IEEE 2012.
G. Nayak and S. Devi, “A Survey on Privacy Preserving Data
Mining: Approaches and Techniques” in proceedings of
International Journal of Engineering Science and Technology
(IJEST), 2011
X. Xiao and H. Ding, "Enhancement of K-nearest Neighbour
Algorithm Based on Weighted Entropy of Attribute Value" in
proceeding of 5th International Conference on BioMedical
Engineering and Informatics, IEEE 2012.
24
Published By:
Blue Eyes Intelligence Engineering
& Sciences Publication Pvt. Ltd.