Download Privacy Preserving Distributed Classification Using C4.5

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
ISSN(Online): 2320-9801
ISSN (Print): 2320-9798
International Journal of Innovative Research in Computer and Communication Engineering
An ISO 3297: 2007 Certified Organization
Vol.3, Special Issue 3, April 2015
2nd National Conference On Emerging Trends In Electronics And Communication Engineering (NCETECE’15)
Organized by
Dept. of ECE, New Prince Shri Bhavani College Of Engineering & Technology, Chennai-600073, India during 6th & 7th April 2015
Privacy Preserving Distributed Classification
Using C4.5 Decision Tree
S.MerlinJ.Jesu, Vedha Nayhi
2nd Year ME (CSE), Regional Centre of Anna University, Tirunelveli , Tamil Nadu, India
Asst.Professor, Regional Centre of Anna University, Tirunelveli , Tamil Nadu, India
ABSTRACT: In distributed classification maintaining individual data privacy is a major issue. The Random Decision
Tree (RDT) concept used to reduce the information leakage, but the RDT also have some privacy violation. Because
here more assumption based structure sharing to be done. But every assumptions have the drawbacks over privacy like
if the structure is unknown means from the schema other parties may have a chance to guess the information and if a
number of new instances are classified to the same leaf node, it is easy to figure out what is the structure of the branch
leading to that leaf. The number of tree creation in RDT increase the time complexity. The proposed system use C4.5
decision tree algorithm. Every party creates one single optimal tree and the class labels along with new instance
exchanged between the number of parties while the need for an distributed classification using secure sum or threshold
homomorphic encryption.
KEYWORDS: C4.5 decision tree,secure sum,Homomorphic encryption
I.
INTRODUCTION
The basic idea of privacy preserving data mining was extend traditional data mining techniques to work with the data
modified to mask sensitive information. This has lead to concerns that the personal data may be misused for a variety of
purposes.Inorder to alleviate these concerns a number of techniques have recently been proposed to perform the data
mining tasks in a privacy preserving way.In present, great advances in networking and databases technologies make it
easy to distribute data across multi parties and collect data on a large scale for sharing information. Distributed data
mining such as association rule mining and decision tree learning are widely used by global enterprises to obtain
accurate market underlyinginformation for their business decision. Although different enterprises are willing to
collaborate with each other to data mine on the union of their data, due to legal constraints or competition among
enterprises, they don’t want to reveal their sensitive and private information to others during the data mining process.
There has been growing concern that use the technology of gaining knowledge from vast quantities of data is violating
individual privacy. Privacy preserving data mining (PPDM) has emerged to address this problem, and become a
challenging research area in the field of data mining (DM). The method of preserving privacy data mining depend on
the data mining task (i.e., association rule, classification, clustering, etc.) and the data sources distribution manner (i.e.,
centralize where all transactions are stored in only one party; horizontallywhere every involving party has only a subset
of transaction records, but every record contains all attributes; verticallywhere every involving party has the same
numbers of transaction records, but every record contains partial attributes). Next the problem of constructing a
decisiontree classifier in a distributed environment. We first present experimental evidence that creating a random
decision trees for providing good privacy and good accuracy, particularly for small datasets. Using random decision
trees, our algorithm produces classifiers that havegood prediction accuracy without compromising privacy, even for
small datasets. In contrast to ID3 trees, random decision tree classifiers are not suitable for applicationsin which it is
necessary to learn which combinations of attribute-values are mostpredictive of the class label, because random
decision trees do not provide this information.
II. RELATED WORK
In distributed privacy preserving classification there is a huge possibility for an privacy violation when an new
instance need to be classified. Privacy preserving via dataset complementation fails if all training datasets are leaked
Copyright @ IJIRCCE
www.ijircce.com
1
ISSN(Online): 2320-9801
ISSN (Print): 2320-9798
International Journal of Innovative Research in Computer and Communication Engineering
An ISO 3297: 2007 Certified Organization
Vol.3, Special Issue 3, April 2015
2nd National Conference On Emerging Trends In Electronics And Communication Engineering (NCETECE’15)
Organized by
Dept. of ECE, New Prince Shri Bhavani College Of Engineering & Technology, Chennai-600073, India during 6th & 7th April 2015
because the dataset reconstruction algorithm is generic .The difficulty of individual privacy is compounded by the
availability of auxiliary information which renders straight forward approaches based on anonymization. The ID3 tree
not suitable for applications such as attributes from it identify the combination of an more sensitive class label, because
RDT not provide that information .Encryption speed is the most important factor when looking at the speed of the
algorithm.Inorder to make significant improvements to this algorithm either the encryption speed must be increased or
the number encryption used must be decreased. Solve by some possible routes to use specialized hardware to speed up
the encryption or to use elliptic curve encryption techniques to decrease the key space necessary. Reconstructing of a
perturbed data also be complex.The adversary should not be able to prevent the correct result from being computed and
should learn nothing more than the result and the inputs of corrupted parties. Because the set of the corrupted parties is
fixed from the start such an adversary is called static or non-adaptive .Reconstruction of a given dataset not be exact. A
given reconstruction algorithm may not always converge; and even when it converges, there is no guarantee that it
provides a reasonable estimate of the original distribution. The algorithm provides a robust estimate of the original
distribution .The randomness structure may be used to compromise privacy issues unless pay careful attention. Only
reconstruct the distribution of the original data from the data perturbed by random value distortion; but it does not
consider estimate of the individual values of the data points. Cryptography-based work for privacy-preserving data
mining is still too slow to be effective for large scale datasets to face today’s big data challenge. Random Decision
Trees (RDT) shows that it is possible to generate equivalent and accurate models with much smaller cost.But RDT
have some drawback like if the accuracy may increase by the number of tree construction also be increased but the tree
increase the time complexity. The C 4.5 decision tree algorithm improve the time complexity and accuracy .Because
here single tree only constructed for every party so it reduces the time and before the completion of tree the error
pruning techniques used to improve the accuracy.
III. PRIVACY PRESERVING METHODS
The field of privacy has seen rapid advances in recent years because of the increases in the ability to store data. In
particular, recent advances in the data mining field have lead to increased concerns about privacy. While the topicof
privacy has been traditionally studied in the context of cryptography and information-hiding, recent emphasis on data
mining has lead to renewed interest in the field. The problem of privacy-preserving data mining has become more
important in recent years because of the increasing ability to store personal dataabout users, and the increasing
sophistication of data mining algorithms to leverage this information. A number of techniques such as randomization
and k-anonymity to performprivacy-preserving data mining. Furthermore, the problem has been discussed in multiple
communities such as the database community, the statistical disclosure control community and the cryptography
community. In some cases, the different communities have explored parallel lines of work which are quite similar.
A) K-Anonymity
The method for directly building a k-anonymous decision tree from a given dataset. The proposed algorithm is
basically an improvement of the classical decision tree building algorithm, combining mining and anonymization in a
single process. At initialization time, the decision tree is composed of a unique root node, representing all the attributes
in a given dataset. At each step, the algorithm inserts a new splitting node in the tree, by choosing the attribute in the
quasi-identifier that is more useful forclassification purposes, and updates the tree accordingly. If the tree obtained is
non-k-anonymous, then the node insertion is rolled back. The algorithm stops when no node can be inserted without
violating k-anonymity, or when the classification obtained is considered satisfactory.
B) Randomization
The method has been discussed for decision tree classificationwith the use of the aggregate distributions reconstructed
from the randomized distribution. The key idea is to construct the distributions separately for the different classes.
Then, the splitting condition for the decision tree uses the relative presence of the different classes which is derived
from the aggregate distributions. Since the probabilistic behavior is encoded in aggregate data distributions, it can be
used to construct a naive Bayes classifier. In such a classifier, the approach of randomized response with partial hiding
is used in order to perform the classification. This approach is effective both empirically and analytically.
Copyright @ IJIRCCE
www.ijircce.com
2
ISSN(Online): 2320-9801
ISSN (Print): 2320-9798
International Journal of Innovative Research in Computer and Communication Engineering
An ISO 3297: 2007 Certified Organization
Vol.3, Special Issue 3, April 2015
2nd National Conference On Emerging Trends In Electronics And Communication Engineering (NCETECE’15)
Organized by
Dept. of ECE, New Prince Shri Bhavani College Of Engineering & Technology, Chennai-600073, India during 6th & 7th April 2015
The basic problem in distributed classification is to train a classifier from the distributed data and then classify each
new instance. For distributed decision tree classification, the objective is to create a decision tree classifier from the
distributed data. In the privacy-preserving case, the additional constraint is that the process of building the classifier, or
of classifying an instance should not leak any additional information beyond what is learned from the result. So the
RDT classifier concepts to be used for an privacy.
IV. PRIVACY PRESERVING IN DISTRIBUTED ENVIRONMENT
I) Partitioning Data
Complete dataset to be partitioned as horizontally and vertically
A) Horizontal Partitioning
In horizontally partitioned data sets, different sites contain different setsof records with the same (or highly
overlapping) set of attributes which are used for mining purposes. The construction of a popular decision tree induction
methodwith the use of approximations of the best splitting attributes. Subsequently, a variety of classifiers have been
generalized to the problem of horizontally-partitioned privacy preserving mining. An extreme solution for the
horizontally partitioned case in which privacy-preserving classification is performed in a fully distributed setting, where
each customer has private access to only their own record. A host of other data mining applications have been
generalized to the problem of horizontally partitioned data sets.
B) Vertically Partitioning
A portion of each instance is present at each site, but no site contains complete information for any instance. The
method presented here works for any number of parties and the class attribute (or other attributes) need be known only
to one party. Our method is trivially extendible to the simplified case where all parties know the class attributes.
V. C4.5 DECISION TREE
Decision tree classification is to predict the class attribute value of a new transaction based on a given training set of
transactions, in which each transaction consists of several general attributes and a class attribute. The crucial problem
of classification is how to build a decision tree, in which each leaf node represents a classification result, that is, a class
attribute value, and each non-leaf node represents a testing attribute, that is, a general attribute. To build a decision
treeusing the algorithms: ID3 and C4.5. In fact, C4.5 can be regard as an extension of ID3, and the main difference
between these two algorithms includes: 1) C4.5 can deal with the case of the attributes with discrete and continuous
values, but ID3 only can cope with the case of the attributes with discrete values; 2) C4.5 uses information gain ratio to
determine the nodes of decision tree, instead of information gain used in ID3; 3) compared with ID3, C4.5 adds the
processes of pruning a decision tree and deriving rule sets from the decision tree.
The pseudo code of the original C4.5 tree construction algorithm based on the description of data set is distributed
among n-parties ties, in which the n parties share the set of general attributes A and the class attribute C={ ,…, },
and the party Pi (i∈[1, n]) has the training set of transactions Ti, a part of the whole training set T
Alogrithm 1
(Pseudocode of the distributed C4.5 tree-construction algorithm)
BuildTree( : , C, A ;…; : , C, A)
1) Compute the frequency freq( ,T) foreach j∈[1, m]:
Foreach i∈[1, n], counts ( )={t| t.C= , t∈ };
2) Find freq(
,T)=Max{ freq( ,T)| j∈[1, m]};
Copyright @ IJIRCCE
…, jointly computefreq(
www.ijircce.com
)
3
ISSN(Online): 2320-9801
ISSN (Print): 2320-9798
International Journal of Innovative Research in Computer and Communication Engineering
An ISO 3297: 2007 Certified Organization
Vol.3, Special Issue 3, April 2015
2nd National Conference On Emerging Trends In Electronics And Communication Engineering (NCETECE’15)
Organized by
Dept. of ECE, New Prince Shri Bhavani College Of Engineering & Technology, Chennai-600073, India during 6th & 7th April 2015
3) If freq(
,T)=1, or |T| is less then a certain value,
Create a leaf node L with the class
Compute the classification error of the node L;
Return L;
4) Create a decision tree node N;
5) Foreach attribute a∈A, …, jointly compute information gain ratio GainRatio(a):
…, jointly determine a to be a discrete or continuous attribute;
If a is discrete and has l possible attribute values
…,
P1,…,Pnjointly compute GainRatio(a) for the lsplittingn of T;
If a is continuous and has l attribute values v1,…,vl,
…, jointly compute GainRatio(a) for all the l-1 possible 2-splittings of T;
…, jointly find the 2-splitting with the best GainRatio;
6) N.test=AttributeWithBestGainRatio;
7) ForeachT’=T1’∪…∪ ’ in the splitting of T
If T’ is Empty
Child of N is a leaf;
Else
Child of N = BuildTree(P1: T1’, C, A-{a};…;Pn:
Tn’, C, A-{a});
8) Compute the classification error of the node of N;
9) Return N;
In step 5 of the algorithm, GainRatio(a) is caculated
as follows:
Entropy(T)=
freq( ,T)
Entropy(T|a)=
Entropy(|T(a=
)|)
Gain(a)=Entropy(T)-Entropy(T |a);
GainRatio(a)=Gain(a)/Split(a);
VI. PRIVACY
The Class labels of the tree handle the information count for the overall database. So each party focus to provide
privacy for the class labels. This paper handles the privacy techniques as secure sum and Threshold Additive
Homomorphic Encryption.
A) Secure Sum
Secure sum is often given as a simple example of secure multiparty Computation. Distributed data mining algorithms
frequently calculate thesum of values from individual sites. Assuming three or more parties and no collusion, the
following method securely computes such a sum. Assume that the value v =
to be computed is known to lie in
the range [0…..n]. One site is designated the master site, numbered 1. The remaining sites are numbered 2….s. Site 1
generates a random number R, uniformly chosen from [0…..n]. Site 1 adds this to its local value v1, and sends the sum
R + v1 to site 2. so site 2 learns nothing about the actual value of v1.Likewise finally site s sends the result to site 1.
Site 1, knowing R, can subtract R to get the actual result.
B)Threshold Homomorphic Encryption
The homomorphic encryptiondescribed as follows: Let
(.) denote the encryption function with public key and
(.) denote the decryption function with private key . A secure public key cryptosystem is called homomorphic if
it satisfies the following requirements:
Copyright @ IJIRCCE
www.ijircce.com
4
ISSN(Online): 2320-9801
ISSN (Print): 2320-9798
International Journal of Innovative Research in Computer and Communication Engineering
An ISO 3297: 2007 Certified Organization
Vol.3, Special Issue 3, April 2015
2nd National Conference On Emerging Trends In Electronics And Communication Engineering (NCETECE’15)
Organized by
Dept. of ECE, New Prince Shri Bhavani College Of Engineering & Technology, Chennai-600073, India during 6th & 7th April 2015
(1) Given theencryption of
and
,
( ) and
( ), there exists an efficient algorithm to compute the public
key encryption of + , denoted
(m1+m2) :=
( ) +h
( ).
(2) Given a constant k and the encryption of
,
), there exists an efficient algorithm to compute the public key
encryption of k · , denoted
(k.· ) := k ×h
( ).
Secure Sum securely calculates the sum of values from individual sites. Assume that each site ihas some value vi and
all sites want to securely compute v =
where v is known to be in the range [0..n]. Homomorphic encryption could
be used to calculate secure sum as follows:
Algorithm 2
1: Site 1 creates a homomorphic encryption public and private key pair, and sends the public key to all sites
2: Site 1 sets = E ( )
3: Each site iwhere m ≥ i>1, gets
from site i−1 and computes =
E ( ) using additive property of the
homomorphic encryption
4: Site m sends to site 1
5: Site 1 sends D ( ) to all parties
The protocol is secure because any party other than site 1 cannot decrypt the values. It also correctly calculates the
summation because =
E ( ) and D ( ) = v.
C) Distributed Privacy preserving classification in C4.5 Decision Tree
The partitioned data to be distributed between n-parties. Each party individually construct one single optimal C 4.5
decision tree. While the new instance(ie test data) to be sendbetween the parties in the distributed environment as
encrypted form due to privacy concerns.Based on the new instance each party doing classification individually from the
result of the each classification ie. Class labels of the each tree to be distributed between the parties as an encrypted
form using secure sum and homomorphic encryption.From that every party get the information about all parties
database without violating individual privacy.
VII. CONCLUSION
C4.5 Decision tree classification is more suitable for distributed privacy preserving classification. It improves the
accuracy for an decision making. And it reduce the time consuming because here every party construct the single tree.
In the scenarios where many parties are participating to perform global data mining without compromising their
privacy, Thealgorithm decreases the costs of communication and computation compared with the cryptography based
approaches. We have considered the security and privacyimplications when dealing with distributed data thatis
partitioned either horizontally or vertically acrossmultiple sites, and the challenges of performing data mining tasks on
such data. In distributed environment the C4.5 provide privacy, the additional constraint is that the process of building
the classifier, or of classifying an instance should not leak any additional information beyond what is learned from the
result.
For future research, we will investigate the possibility of developing more effective and efficient algorithms. We also
plan to extend our research to other tasks of data mining, like clustering and association rule, etc.
REFERNCES
[1] JaideepVaidya,BasitShafiq,Wei Fan, Danish Mehmood, David Lorenzi(2014) “A Random Decision Tree Framework for Privacy-preserving Data
Mining” MSIS Department, Rutgers University, 1 Washington Park, Newark, NJ 07102, USA
[2] J. Vaidya, C. Clifton, M. Kantarcioglu, and A. S. Patterson,“Privacy-preserving decision trees over vertically partitioned data,” ACM Trans.
Knowl. Discov. Data, vol. 2, 2008.
[3]R. Cramer, I. Damgard, and J. B. Nielsen, “Multiparty computationfrom threshold homomorphic encryption,” Springer-Verlag, May 2001,.
Copyright @ IJIRCCE
www.ijircce.com
5
ISSN(Online): 2320-9801
ISSN (Print): 2320-9798
International Journal of Innovative Research in Computer and Communication Engineering
An ISO 3297: 2007 Certified Organization
Vol.3, Special Issue 3, April 2015
2nd National Conference On Emerging Trends In Electronics And Communication Engineering (NCETECE’15)
Organized by
Dept. of ECE, New Prince Shri Bhavani College Of Engineering & Technology, Chennai-600073, India during 6th & 7th April 2015
[4]AlkaGangrade “Building privacy-preserving c4.5 decision tree classifier on multiparties,”International Journal on Computer Science and
Engineering Vol.1(3), 2009
[5]Charu.C.Agarwal and Philip S.Yu “Privacy Preserving Data Mining models and Algorithms”
[6] Weiwei Fang, Bingru Yang and Dingli Song (2010)“Preserving Private Knowledge In Decision Tree Learning” Information Engineering School,
University of Science and Technology Beijing, Computer Center, Beijing Information Science and Technology University, China, JOURNAL OF
COMPUTERS, VOL. 5, NO. 5
[7] Pascal Paillier (1999) “Public-Key Cryptosystems Based on Composite Degree Residuosity Classes” Gemplus Card International, Cryptography
Department 34 rue Guynemer, 92447 IssyMoulineaux, France ENST, Computer Science Department [Published in J. Stern, Ed., Advances in
Cryptology vol. 1592 of Lecture Notes in Computer Science, Springer-Verlag,
[8] Ian H. Witten and Eibe Frank (2000)“Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations” Morgan
Kaufmann Publishers
[9] Dakshi Agrawal, Charu C. Aggarwal” On the Design and Quantification of Privacy PreservingData Mining Algorithms” IBM T. J. Watson
Research Center Yorktown Heights, NY 10598
[10] HillolKargupta and SouptikDatta, Qi Wang and KrishnamoorthySivakumar“On the Privacy Preserving Properties of Random Data Perturbation
Techniques” Computer Science and Electrical Engineering Department, University of Maryland Baltimore County Baltimore, Maryland 21250, USA
[11] Zhengli Huang, Wenliang Du and Biao Chen “Deriving Private Information from Randomized Data” Department of Electrical Engineering and
Computer Science Syracuse University, Syracuse, NY 13244
[12]Wenliang Du, Zhijun Zhan (2002)“Building Decision Tree Classifier on Private Data “Syracuse University, Department of Electrical
Engineering and Computer Science
[13] Murat Kantarcıoglu and Chris Clifton, “Privacy-preserving Distributed Mining of Association Rules on Horizontally Partitioned Data”, IEEE
TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
[14] Amit Dhurandhar, Alin Dobra (2008)“Probabilistic Characterization of Random Decision Trees” Computer and Information Science and
Engineering, University of Florida,Gainesville, FL 32611, USA, Journal of Machine Learning Research
[15] Luong The Dung, Ho To Bao “Privacy Preserving EM-based Clustering “Information Technology Center, Vietnam Government Information
Security Commission 105 Nguyen Chi Thanh, HaNoi, VietNam, Japan Advanced Institute of Science and Technology Nomishi, Ishikawa, Japan,
Institute of Information Technology, Hanoi, Vietnam
[16] Jaideep Vaidya, Chris Clifton “Privacy Preserving K Means Clustering over Vertically Partitioned Data” Department of Computer Sciences
,Purdue University,250 N University St,West Lafayette, IN 479072066
[17] GeethaJagannathan, Rebecca N. Wright “Privacy-Preserving Distributed K-Means Clustering over Arbitrarily Partitioned Data”, Department of
Computer Science, Stevens Institute of Technology, Hoboken, NJ, 07030, USA
[18] Jaideep Vaidya, Chris Clifton “Privacy Preserving Association Rule Mining in Vertically Partitioned Data”, Department of Computer Sciences,
Purdue University, West Lafayette, Indiana
[19] Murat Kantarcioglu, Jaideep Vaidya “An Architecture for Privacy-preserving Mining of Client Information” Department of Computer Sciences,
Purdue University,1398 Computer Sciences Building
Copyright @ IJIRCCE
www.ijircce.com
6