Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ISSN(Online): 2320-9801 ISSN (Print): 2320-9798 International Journal of Innovative Research in Computer and Communication Engineering An ISO 3297: 2007 Certified Organization Vol.3, Special Issue 3, April 2015 2nd National Conference On Emerging Trends In Electronics And Communication Engineering (NCETECE’15) Organized by Dept. of ECE, New Prince Shri Bhavani College Of Engineering & Technology, Chennai-600073, India during 6th & 7th April 2015 Privacy Preserving Distributed Classification Using C4.5 Decision Tree S.MerlinJ.Jesu, Vedha Nayhi 2nd Year ME (CSE), Regional Centre of Anna University, Tirunelveli , Tamil Nadu, India Asst.Professor, Regional Centre of Anna University, Tirunelveli , Tamil Nadu, India ABSTRACT: In distributed classification maintaining individual data privacy is a major issue. The Random Decision Tree (RDT) concept used to reduce the information leakage, but the RDT also have some privacy violation. Because here more assumption based structure sharing to be done. But every assumptions have the drawbacks over privacy like if the structure is unknown means from the schema other parties may have a chance to guess the information and if a number of new instances are classified to the same leaf node, it is easy to figure out what is the structure of the branch leading to that leaf. The number of tree creation in RDT increase the time complexity. The proposed system use C4.5 decision tree algorithm. Every party creates one single optimal tree and the class labels along with new instance exchanged between the number of parties while the need for an distributed classification using secure sum or threshold homomorphic encryption. KEYWORDS: C4.5 decision tree,secure sum,Homomorphic encryption I. INTRODUCTION The basic idea of privacy preserving data mining was extend traditional data mining techniques to work with the data modified to mask sensitive information. This has lead to concerns that the personal data may be misused for a variety of purposes.Inorder to alleviate these concerns a number of techniques have recently been proposed to perform the data mining tasks in a privacy preserving way.In present, great advances in networking and databases technologies make it easy to distribute data across multi parties and collect data on a large scale for sharing information. Distributed data mining such as association rule mining and decision tree learning are widely used by global enterprises to obtain accurate market underlyinginformation for their business decision. Although different enterprises are willing to collaborate with each other to data mine on the union of their data, due to legal constraints or competition among enterprises, they don’t want to reveal their sensitive and private information to others during the data mining process. There has been growing concern that use the technology of gaining knowledge from vast quantities of data is violating individual privacy. Privacy preserving data mining (PPDM) has emerged to address this problem, and become a challenging research area in the field of data mining (DM). The method of preserving privacy data mining depend on the data mining task (i.e., association rule, classification, clustering, etc.) and the data sources distribution manner (i.e., centralize where all transactions are stored in only one party; horizontallywhere every involving party has only a subset of transaction records, but every record contains all attributes; verticallywhere every involving party has the same numbers of transaction records, but every record contains partial attributes). Next the problem of constructing a decisiontree classifier in a distributed environment. We first present experimental evidence that creating a random decision trees for providing good privacy and good accuracy, particularly for small datasets. Using random decision trees, our algorithm produces classifiers that havegood prediction accuracy without compromising privacy, even for small datasets. In contrast to ID3 trees, random decision tree classifiers are not suitable for applicationsin which it is necessary to learn which combinations of attribute-values are mostpredictive of the class label, because random decision trees do not provide this information. II. RELATED WORK In distributed privacy preserving classification there is a huge possibility for an privacy violation when an new instance need to be classified. Privacy preserving via dataset complementation fails if all training datasets are leaked Copyright @ IJIRCCE www.ijircce.com 1 ISSN(Online): 2320-9801 ISSN (Print): 2320-9798 International Journal of Innovative Research in Computer and Communication Engineering An ISO 3297: 2007 Certified Organization Vol.3, Special Issue 3, April 2015 2nd National Conference On Emerging Trends In Electronics And Communication Engineering (NCETECE’15) Organized by Dept. of ECE, New Prince Shri Bhavani College Of Engineering & Technology, Chennai-600073, India during 6th & 7th April 2015 because the dataset reconstruction algorithm is generic .The difficulty of individual privacy is compounded by the availability of auxiliary information which renders straight forward approaches based on anonymization. The ID3 tree not suitable for applications such as attributes from it identify the combination of an more sensitive class label, because RDT not provide that information .Encryption speed is the most important factor when looking at the speed of the algorithm.Inorder to make significant improvements to this algorithm either the encryption speed must be increased or the number encryption used must be decreased. Solve by some possible routes to use specialized hardware to speed up the encryption or to use elliptic curve encryption techniques to decrease the key space necessary. Reconstructing of a perturbed data also be complex.The adversary should not be able to prevent the correct result from being computed and should learn nothing more than the result and the inputs of corrupted parties. Because the set of the corrupted parties is fixed from the start such an adversary is called static or non-adaptive .Reconstruction of a given dataset not be exact. A given reconstruction algorithm may not always converge; and even when it converges, there is no guarantee that it provides a reasonable estimate of the original distribution. The algorithm provides a robust estimate of the original distribution .The randomness structure may be used to compromise privacy issues unless pay careful attention. Only reconstruct the distribution of the original data from the data perturbed by random value distortion; but it does not consider estimate of the individual values of the data points. Cryptography-based work for privacy-preserving data mining is still too slow to be effective for large scale datasets to face today’s big data challenge. Random Decision Trees (RDT) shows that it is possible to generate equivalent and accurate models with much smaller cost.But RDT have some drawback like if the accuracy may increase by the number of tree construction also be increased but the tree increase the time complexity. The C 4.5 decision tree algorithm improve the time complexity and accuracy .Because here single tree only constructed for every party so it reduces the time and before the completion of tree the error pruning techniques used to improve the accuracy. III. PRIVACY PRESERVING METHODS The field of privacy has seen rapid advances in recent years because of the increases in the ability to store data. In particular, recent advances in the data mining field have lead to increased concerns about privacy. While the topicof privacy has been traditionally studied in the context of cryptography and information-hiding, recent emphasis on data mining has lead to renewed interest in the field. The problem of privacy-preserving data mining has become more important in recent years because of the increasing ability to store personal dataabout users, and the increasing sophistication of data mining algorithms to leverage this information. A number of techniques such as randomization and k-anonymity to performprivacy-preserving data mining. Furthermore, the problem has been discussed in multiple communities such as the database community, the statistical disclosure control community and the cryptography community. In some cases, the different communities have explored parallel lines of work which are quite similar. A) K-Anonymity The method for directly building a k-anonymous decision tree from a given dataset. The proposed algorithm is basically an improvement of the classical decision tree building algorithm, combining mining and anonymization in a single process. At initialization time, the decision tree is composed of a unique root node, representing all the attributes in a given dataset. At each step, the algorithm inserts a new splitting node in the tree, by choosing the attribute in the quasi-identifier that is more useful forclassification purposes, and updates the tree accordingly. If the tree obtained is non-k-anonymous, then the node insertion is rolled back. The algorithm stops when no node can be inserted without violating k-anonymity, or when the classification obtained is considered satisfactory. B) Randomization The method has been discussed for decision tree classificationwith the use of the aggregate distributions reconstructed from the randomized distribution. The key idea is to construct the distributions separately for the different classes. Then, the splitting condition for the decision tree uses the relative presence of the different classes which is derived from the aggregate distributions. Since the probabilistic behavior is encoded in aggregate data distributions, it can be used to construct a naive Bayes classifier. In such a classifier, the approach of randomized response with partial hiding is used in order to perform the classification. This approach is effective both empirically and analytically. Copyright @ IJIRCCE www.ijircce.com 2 ISSN(Online): 2320-9801 ISSN (Print): 2320-9798 International Journal of Innovative Research in Computer and Communication Engineering An ISO 3297: 2007 Certified Organization Vol.3, Special Issue 3, April 2015 2nd National Conference On Emerging Trends In Electronics And Communication Engineering (NCETECE’15) Organized by Dept. of ECE, New Prince Shri Bhavani College Of Engineering & Technology, Chennai-600073, India during 6th & 7th April 2015 The basic problem in distributed classification is to train a classifier from the distributed data and then classify each new instance. For distributed decision tree classification, the objective is to create a decision tree classifier from the distributed data. In the privacy-preserving case, the additional constraint is that the process of building the classifier, or of classifying an instance should not leak any additional information beyond what is learned from the result. So the RDT classifier concepts to be used for an privacy. IV. PRIVACY PRESERVING IN DISTRIBUTED ENVIRONMENT I) Partitioning Data Complete dataset to be partitioned as horizontally and vertically A) Horizontal Partitioning In horizontally partitioned data sets, different sites contain different setsof records with the same (or highly overlapping) set of attributes which are used for mining purposes. The construction of a popular decision tree induction methodwith the use of approximations of the best splitting attributes. Subsequently, a variety of classifiers have been generalized to the problem of horizontally-partitioned privacy preserving mining. An extreme solution for the horizontally partitioned case in which privacy-preserving classification is performed in a fully distributed setting, where each customer has private access to only their own record. A host of other data mining applications have been generalized to the problem of horizontally partitioned data sets. B) Vertically Partitioning A portion of each instance is present at each site, but no site contains complete information for any instance. The method presented here works for any number of parties and the class attribute (or other attributes) need be known only to one party. Our method is trivially extendible to the simplified case where all parties know the class attributes. V. C4.5 DECISION TREE Decision tree classification is to predict the class attribute value of a new transaction based on a given training set of transactions, in which each transaction consists of several general attributes and a class attribute. The crucial problem of classification is how to build a decision tree, in which each leaf node represents a classification result, that is, a class attribute value, and each non-leaf node represents a testing attribute, that is, a general attribute. To build a decision treeusing the algorithms: ID3 and C4.5. In fact, C4.5 can be regard as an extension of ID3, and the main difference between these two algorithms includes: 1) C4.5 can deal with the case of the attributes with discrete and continuous values, but ID3 only can cope with the case of the attributes with discrete values; 2) C4.5 uses information gain ratio to determine the nodes of decision tree, instead of information gain used in ID3; 3) compared with ID3, C4.5 adds the processes of pruning a decision tree and deriving rule sets from the decision tree. The pseudo code of the original C4.5 tree construction algorithm based on the description of data set is distributed among n-parties ties, in which the n parties share the set of general attributes A and the class attribute C={ ,…, }, and the party Pi (i∈[1, n]) has the training set of transactions Ti, a part of the whole training set T Alogrithm 1 (Pseudocode of the distributed C4.5 tree-construction algorithm) BuildTree( : , C, A ;…; : , C, A) 1) Compute the frequency freq( ,T) foreach j∈[1, m]: Foreach i∈[1, n], counts ( )={t| t.C= , t∈ }; 2) Find freq( ,T)=Max{ freq( ,T)| j∈[1, m]}; Copyright @ IJIRCCE …, jointly computefreq( www.ijircce.com ) 3 ISSN(Online): 2320-9801 ISSN (Print): 2320-9798 International Journal of Innovative Research in Computer and Communication Engineering An ISO 3297: 2007 Certified Organization Vol.3, Special Issue 3, April 2015 2nd National Conference On Emerging Trends In Electronics And Communication Engineering (NCETECE’15) Organized by Dept. of ECE, New Prince Shri Bhavani College Of Engineering & Technology, Chennai-600073, India during 6th & 7th April 2015 3) If freq( ,T)=1, or |T| is less then a certain value, Create a leaf node L with the class Compute the classification error of the node L; Return L; 4) Create a decision tree node N; 5) Foreach attribute a∈A, …, jointly compute information gain ratio GainRatio(a): …, jointly determine a to be a discrete or continuous attribute; If a is discrete and has l possible attribute values …, P1,…,Pnjointly compute GainRatio(a) for the lsplittingn of T; If a is continuous and has l attribute values v1,…,vl, …, jointly compute GainRatio(a) for all the l-1 possible 2-splittings of T; …, jointly find the 2-splitting with the best GainRatio; 6) N.test=AttributeWithBestGainRatio; 7) ForeachT’=T1’∪…∪ ’ in the splitting of T If T’ is Empty Child of N is a leaf; Else Child of N = BuildTree(P1: T1’, C, A-{a};…;Pn: Tn’, C, A-{a}); 8) Compute the classification error of the node of N; 9) Return N; In step 5 of the algorithm, GainRatio(a) is caculated as follows: Entropy(T)= freq( ,T) Entropy(T|a)= Entropy(|T(a= )|) Gain(a)=Entropy(T)-Entropy(T |a); GainRatio(a)=Gain(a)/Split(a); VI. PRIVACY The Class labels of the tree handle the information count for the overall database. So each party focus to provide privacy for the class labels. This paper handles the privacy techniques as secure sum and Threshold Additive Homomorphic Encryption. A) Secure Sum Secure sum is often given as a simple example of secure multiparty Computation. Distributed data mining algorithms frequently calculate thesum of values from individual sites. Assuming three or more parties and no collusion, the following method securely computes such a sum. Assume that the value v = to be computed is known to lie in the range [0…..n]. One site is designated the master site, numbered 1. The remaining sites are numbered 2….s. Site 1 generates a random number R, uniformly chosen from [0…..n]. Site 1 adds this to its local value v1, and sends the sum R + v1 to site 2. so site 2 learns nothing about the actual value of v1.Likewise finally site s sends the result to site 1. Site 1, knowing R, can subtract R to get the actual result. B)Threshold Homomorphic Encryption The homomorphic encryptiondescribed as follows: Let (.) denote the encryption function with public key and (.) denote the decryption function with private key . A secure public key cryptosystem is called homomorphic if it satisfies the following requirements: Copyright @ IJIRCCE www.ijircce.com 4 ISSN(Online): 2320-9801 ISSN (Print): 2320-9798 International Journal of Innovative Research in Computer and Communication Engineering An ISO 3297: 2007 Certified Organization Vol.3, Special Issue 3, April 2015 2nd National Conference On Emerging Trends In Electronics And Communication Engineering (NCETECE’15) Organized by Dept. of ECE, New Prince Shri Bhavani College Of Engineering & Technology, Chennai-600073, India during 6th & 7th April 2015 (1) Given theencryption of and , ( ) and ( ), there exists an efficient algorithm to compute the public key encryption of + , denoted (m1+m2) := ( ) +h ( ). (2) Given a constant k and the encryption of , ), there exists an efficient algorithm to compute the public key encryption of k · , denoted (k.· ) := k ×h ( ). Secure Sum securely calculates the sum of values from individual sites. Assume that each site ihas some value vi and all sites want to securely compute v = where v is known to be in the range [0..n]. Homomorphic encryption could be used to calculate secure sum as follows: Algorithm 2 1: Site 1 creates a homomorphic encryption public and private key pair, and sends the public key to all sites 2: Site 1 sets = E ( ) 3: Each site iwhere m ≥ i>1, gets from site i−1 and computes = E ( ) using additive property of the homomorphic encryption 4: Site m sends to site 1 5: Site 1 sends D ( ) to all parties The protocol is secure because any party other than site 1 cannot decrypt the values. It also correctly calculates the summation because = E ( ) and D ( ) = v. C) Distributed Privacy preserving classification in C4.5 Decision Tree The partitioned data to be distributed between n-parties. Each party individually construct one single optimal C 4.5 decision tree. While the new instance(ie test data) to be sendbetween the parties in the distributed environment as encrypted form due to privacy concerns.Based on the new instance each party doing classification individually from the result of the each classification ie. Class labels of the each tree to be distributed between the parties as an encrypted form using secure sum and homomorphic encryption.From that every party get the information about all parties database without violating individual privacy. VII. CONCLUSION C4.5 Decision tree classification is more suitable for distributed privacy preserving classification. It improves the accuracy for an decision making. And it reduce the time consuming because here every party construct the single tree. In the scenarios where many parties are participating to perform global data mining without compromising their privacy, Thealgorithm decreases the costs of communication and computation compared with the cryptography based approaches. We have considered the security and privacyimplications when dealing with distributed data thatis partitioned either horizontally or vertically acrossmultiple sites, and the challenges of performing data mining tasks on such data. In distributed environment the C4.5 provide privacy, the additional constraint is that the process of building the classifier, or of classifying an instance should not leak any additional information beyond what is learned from the result. For future research, we will investigate the possibility of developing more effective and efficient algorithms. We also plan to extend our research to other tasks of data mining, like clustering and association rule, etc. REFERNCES [1] JaideepVaidya,BasitShafiq,Wei Fan, Danish Mehmood, David Lorenzi(2014) “A Random Decision Tree Framework for Privacy-preserving Data Mining” MSIS Department, Rutgers University, 1 Washington Park, Newark, NJ 07102, USA [2] J. Vaidya, C. Clifton, M. Kantarcioglu, and A. S. Patterson,“Privacy-preserving decision trees over vertically partitioned data,” ACM Trans. Knowl. Discov. Data, vol. 2, 2008. [3]R. Cramer, I. Damgard, and J. B. Nielsen, “Multiparty computationfrom threshold homomorphic encryption,” Springer-Verlag, May 2001,. Copyright @ IJIRCCE www.ijircce.com 5 ISSN(Online): 2320-9801 ISSN (Print): 2320-9798 International Journal of Innovative Research in Computer and Communication Engineering An ISO 3297: 2007 Certified Organization Vol.3, Special Issue 3, April 2015 2nd National Conference On Emerging Trends In Electronics And Communication Engineering (NCETECE’15) Organized by Dept. of ECE, New Prince Shri Bhavani College Of Engineering & Technology, Chennai-600073, India during 6th & 7th April 2015 [4]AlkaGangrade “Building privacy-preserving c4.5 decision tree classifier on multiparties,”International Journal on Computer Science and Engineering Vol.1(3), 2009 [5]Charu.C.Agarwal and Philip S.Yu “Privacy Preserving Data Mining models and Algorithms” [6] Weiwei Fang, Bingru Yang and Dingli Song (2010)“Preserving Private Knowledge In Decision Tree Learning” Information Engineering School, University of Science and Technology Beijing, Computer Center, Beijing Information Science and Technology University, China, JOURNAL OF COMPUTERS, VOL. 5, NO. 5 [7] Pascal Paillier (1999) “Public-Key Cryptosystems Based on Composite Degree Residuosity Classes” Gemplus Card International, Cryptography Department 34 rue Guynemer, 92447 IssyMoulineaux, France ENST, Computer Science Department [Published in J. Stern, Ed., Advances in Cryptology vol. 1592 of Lecture Notes in Computer Science, Springer-Verlag, [8] Ian H. Witten and Eibe Frank (2000)“Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations” Morgan Kaufmann Publishers [9] Dakshi Agrawal, Charu C. Aggarwal” On the Design and Quantification of Privacy PreservingData Mining Algorithms” IBM T. J. Watson Research Center Yorktown Heights, NY 10598 [10] HillolKargupta and SouptikDatta, Qi Wang and KrishnamoorthySivakumar“On the Privacy Preserving Properties of Random Data Perturbation Techniques” Computer Science and Electrical Engineering Department, University of Maryland Baltimore County Baltimore, Maryland 21250, USA [11] Zhengli Huang, Wenliang Du and Biao Chen “Deriving Private Information from Randomized Data” Department of Electrical Engineering and Computer Science Syracuse University, Syracuse, NY 13244 [12]Wenliang Du, Zhijun Zhan (2002)“Building Decision Tree Classifier on Private Data “Syracuse University, Department of Electrical Engineering and Computer Science [13] Murat Kantarcıoglu and Chris Clifton, “Privacy-preserving Distributed Mining of Association Rules on Horizontally Partitioned Data”, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING [14] Amit Dhurandhar, Alin Dobra (2008)“Probabilistic Characterization of Random Decision Trees” Computer and Information Science and Engineering, University of Florida,Gainesville, FL 32611, USA, Journal of Machine Learning Research [15] Luong The Dung, Ho To Bao “Privacy Preserving EM-based Clustering “Information Technology Center, Vietnam Government Information Security Commission 105 Nguyen Chi Thanh, HaNoi, VietNam, Japan Advanced Institute of Science and Technology Nomishi, Ishikawa, Japan, Institute of Information Technology, Hanoi, Vietnam [16] Jaideep Vaidya, Chris Clifton “Privacy Preserving K Means Clustering over Vertically Partitioned Data” Department of Computer Sciences ,Purdue University,250 N University St,West Lafayette, IN 479072066 [17] GeethaJagannathan, Rebecca N. Wright “Privacy-Preserving Distributed K-Means Clustering over Arbitrarily Partitioned Data”, Department of Computer Science, Stevens Institute of Technology, Hoboken, NJ, 07030, USA [18] Jaideep Vaidya, Chris Clifton “Privacy Preserving Association Rule Mining in Vertically Partitioned Data”, Department of Computer Sciences, Purdue University, West Lafayette, Indiana [19] Murat Kantarcioglu, Jaideep Vaidya “An Architecture for Privacy-preserving Mining of Client Information” Department of Computer Sciences, Purdue University,1398 Computer Sciences Building Copyright @ IJIRCCE www.ijircce.com 6