Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Computer Application Available online on http://www.rspublication.com/ijca/ijca_index.htm Issue 4, Volume 1 (February 2014) ISSN: 2250-1797 A Survey of Safeguarding the Privacy in Data Mining Using Decision Tree Learning Algorithms Mrs.G.Shoba, Rajeswari.M, Kalaitchelvi.S #1 Asst.Prof (CSE Dept), Christ College of Engineering and Technology, Affiliated to Pondicherry University, Puducherry-605010. #2 Final Year M.Tech Scholar (CSE Dept), Christ College of Engineering and Technology, Affiliated to Pondicherry University, Puducherry-605010. #3 Final Year M.Tech Scholar (CSE Dept), Christ College of Engineering and Technology, Affiliated to Pondicherry University, Puducherry-605010. ABSTRACT Privacy is an important affair in many data mining applications that deal with health concern, security, financial and other types of sensitive data. On one hand the data mining process gives the knowledge which can be used to support a variety of domains like marketing, weather conditions forecasting, and therapeutic diagnosis. But, on the other hand, easy access to personal data poses a danger to individual solitude. The tangible anxiety of people is that their private information should not be misused behind the scenes without their knowledge. Due to privacy intrusion while performing the data mining operations this is often not possible to exploit huge databases for scientific or financial research. To address this problem, several privacy-preserving data mining techniques are used. The aim of privacy preserving data mining (PPDM) is to extract relevant knowledge from large amounts of data while protecting at the same time sensitive information. This paper mainly focuses on general classification technique decision tree classifier for preserving privacy. It presents a survey on decision tree learning on various privacy techniques. Keywords-Privacy, Data Mining, Decision Tree Classifier, PPDM. 1. INTRODUCTION Data mining is the progression of extracting knowledge or pattern from large amount of data. It is widely used by researchers for science and business process. Data collected from information providers are significant for pattern recognition and decision making. The data collection process takes time and efforts hence sample datasets are sometime stored for reuse. However attacks are attempted to steal these sample datasets and private information may be leaked from these stolen datasets. Therefore privacy preserving data mining are developed to convert sensitive datasets into sanitized version in which private or sensitive information is hidden from unauthorized retrievers [1]. Privacy preserving data mining [15] is a special data mining technique which has emerged to deal with the privacy issue in data mining. PPDM uses special techniques to protect the privacy of sensitive data and also give valid data mining results. In this paper we propose a novel method to preserve the privacy by perturbing the original data using randomized data R S. Publication (rspublication.com), [email protected] Page 21 International Journal of Computer Application Available online on http://www.rspublication.com/ijca/ijca_index.htm Issue 4, Volume 1 (February 2014) ISSN: 2250-1797 perturbation privacy preserving data mining technique and then constructing a decision tree classifier on the perturbed data. The rest of this paper is structured as follows: the next section describes Decision tree classifier and gives brief introduction about decision tree algorithm. Section 3 introduces different privacy preservation technique through decision tree approach. Section 4 provides an overall summary of this paper, and suggests directions for further research on this topic. 2. DECISION TREE CLASSIFIER A decision tree [15] is defined as “a predictive modeling technique from the field of machine learning and statistics that builds a simple tree-like structure to model the underlying pattern of data”. Decision tree is one of the popular methods is able to handle both categorical and numerical data and perform classification with less computation. Decision trees are often easier to interpret. Decision tree is a classifier which is a directed tree with a node having no incoming edges called root. All the nodes except root have exactly one incoming edge. Each non-leaf node called internal node or splitting node contains a decision and most appropriate target value assigned to one class is represented by leaf node [15]. Decision tree classifier is able to break down a complex decision making process into collection of simpler decision. The complex decision is subdivided into simpler decision on the basis of splitting criteria. It divides whole training set into smaller subsets. Information gain, gain ratio, gain index are three basic splitting criteria to select attribute as a splitting point. Decision trees can be built from historical data they are often used for explanatory analysis as well as a form of supervision learning. The algorithm is design in such a way that it works on all the data that is available and as perfect as possible. According to Breiman et al. the tree complexity has a crucial effect on its accuracy performance. The tree complexity is explicitly controlled by the pruning method employed and the stopping criteria used. Usually, the tree complexity is measured by one of the following metrics: • The total number of nodes; • Total number of leaves; • Tree depth; 3. PRIVACY PRESERVING DECISION TREE LEARNING This section explores the privacy preservation decision tree learning techniques in which firstly data is modified by using different data modification and perturbation-based approaches and then decision tree mining is applied to modified or sanitized dataset. (i) (ii) (iii) (iv) (v) (vi) (vii) (viii) (ix) (x) Decision tree classifier for privacy preserving Efficient decision tree construction in unrealized data set using C4.5 algorithm Efficient decision tree based privacy preserving approach for unrealized data sets. Privacy preserving decision tree from perturbed data. Privacy preserving decision tree over vertically partitioned data. Privacy preserving decision tree learning using unrealized data sets. Noise Addition scheme in decision tree for privacy preserving data mining Privacy Preserving Two-Layer Decision Tree Classifier for Multiparty Databases. Privacy Preservation Decision Tree Based On Data Set Complementation. Building Privacy-Preserving C4.5 Decision Tree Classifier On Multiparty. R S. Publication (rspublication.com), [email protected] Page 22 International Journal of Computer Application Available online on http://www.rspublication.com/ijca/ijca_index.htm Issue 4, Volume 1 (February 2014) ISSN: 2250-1797 3.1 Decision Tree Classifier for Privacy Preserving To protect the sensitive information in data while extracting knowledge from large amount of data. The algorithm C4.5 has been used for the general classification in a secured manner and introduces a privacy-preserving decision tree classifier. To preserve the privacy via dataset complementation, the original dataset is being replaced by unreal dataset from which original samples cannot be reconstructed without the entire group of entire group of unreal data sets Unrealized training set algorithm is used. As the first sample is collected, this novel approach is applied directly to the data storage during the data collection process. An accurate decision tree can be built directly from those unreal datasets. During the privacy preserving process, this set of perturbed data sets is dynamically modified. As the sanitized version of the original samples, these perturbed data sets are stored to enable a modified decision tree data mining methods [6]. 3.2 Efficient Decision Tree Construction in Unrealized Data Set using C4.5 algorithm Privacy preservation in data mining activities is of significant importance for many applications. However, the privacy preserving process sometimes reduces the utility of training datasets, which causes inaccurate data mining results. The problem of privacypreserving data mining has become more important in recent years because of the increasing ability to store personal data about users, and the increasing superiority of data mining algorithms to leverage this information. C4.5 decision tree can be used to generate a reasonably small and very accurate decision function. The mean performance penalty on existing performance data was within the measurement error for all trees are considered. These trees were also able to produce decision functions with less than 2.5% relative performance penalty for both collectives. It indicates that it is possible to use information about one MPI collective operation to generate a reasonable well decision function for another collective.ID3 and C4.5 branch on every value and use an entropy minimization heuristic to select best attribute [7]. 3.3 Decision tree based privacy preserving approach for unrealized data sets Privacy preserving approach (PPDM) through data set complementation; it ensures the effectiveness of training data sets for decision tree learning using ID3, Improved ID3.During the privacy preserving method, set of perturbed datasets is enthusiastically adapted. From the original data samples, these perturbed datasets are stored to allow a modified decision tree based data mining method. This method guarantees to provide the same data mining outcomes as the originals, which is proved exactly and by a test using one set of sample datasets. From the vision of privacy preservation, the original datasets can only be reconstructed in their entirety if someone has all perturbed datasets, which is not supposed to be the case for an unauthorized party. RIPPER algorithms which are very appropriate for decision tree learning after completion of the unrealized dataset. RIPPER algorithm improvements have been created a rule learner and finally the results become unrealized dataset [8]. 3.4 Privacy Preserving Decision Tree Mining from Perturbed Data In this paper, Li Liu, Murat Kantarcioglu and Bhavani Thuraisingham [13] propose a new perturbation based technique to modify the data mining algorithms. They build a classifier for the original data set from the perturbed training data set. This paper proposed a modified C4.5 [9] decision tree classifier which is suitable for privacy preserving data mining can be used to classify both the original and the perturbed data. This technique considers the splitting point of the attribute as well as the bias of the noise data set as well. It calculates the R S. Publication (rspublication.com), [email protected] Page 23 International Journal of Computer Application Available online on http://www.rspublication.com/ijca/ijca_index.htm Issue 4, Volume 1 (February 2014) ISSN: 2250-1797 bias whenever try to find the best attribute, the best split point and partition the training data. This algorithm is based on the perturbation scheme, but skips the steps of reconstructing the original data distribution. The proposed technique has increased the privacy protection with less computation time. Privacy as a security issue in data mining area is still a challenge. 3.5 Privacy-Preserving Decision Trees over Vertically Partitioned Data Jaideep Vaidya, Chris Clifton, Murat Kantarcioglu and A. Scott Patterson [14] introduce a Privacy-Preserving Decision Trees over Vertically Partitioned Data, generalized privacypreserving variant of the ID3 algorithm for vertically partitioned data distributed over two or more parties. The algorithm as presented in this paper is working program is a significant step forward in creating usable, distributed, privacy preserving data-mining algorithms. This paper presents a new protocol to construct a decision tree on vertically partitioned data with an arbitrary number of parties where only one party has the class attribute. It presents a general framework for constructing a system in which distributed classification would work. It serves to show that the methods can actually be built and are feasible. This work provides an upper bound on the complexity of building privacy preserving decision trees. Significant work is required to find a tight upper bound on the complexity. 3.6 Privacy Preserving Decision Tree Learning Using Unrealized Data Sets Pui K. Fong and Jens H. Weber-Jahnke [8] introduce a new perturbation and randomization based approach that protects centralized sample data sets utilized for decision tree data mining. They introduced a new privacy preserving approach via data set complementation. Dataset complementation confirms the utility of training data sets for decision tree learning. This approach converts the original sample data sets into a group of unreal data sets. The original samples cannot be reconstructed without the entire group of unreal data sets. An accurate decision tree can be built directly from those unreal data sets. This novel approach can be applied directly to the data storage as soon as the first sample is collected and applied at any time during the data collection process. In order to mitigate the threat of their inadvertent disclosure or theft, privacy preservation is applied to sanitize the samples prior to their release to third parties. In contrast to other sanitization methods, this technique does not affect the accuracy of data mining results. The decision tree can be built directly from the sanitized data sets, no need to be reconstructing the original dataset. 3.7 A Noise Addition Scheme in Decision Tree for Privacy Preserving Data Mining Mohammad Ali Kadampur, Somayajulu D.V.L.N [10] proposes a strategy that protects the data privacy during decision tree analysis of data mining process. It is basically a noise addition framework specifically tailored toward classification task in data mining. They propose to add specific noise to the numeric attributes after exploring the decision tree of the original data. The modified data then is presented to the second party for decision tree analysis. The decision tree obtained on the original data and the obfuscated data are similar but by using this method the proper data is not revealed to the second party during the mining process and hence the privacy will be preserved. The method also preserves averages and few other statistical parameters thus making the modified data set useful for both data mining and statistical purposes. This paper uses Quinlan's [15] C5.0 decision tree builder on the selected data set [16] and obtain the decision tree of the original data set and a unique method of listing the nodes (attributes) that we touch in the path from the root of the tree to the leaf, then use a noise addition strategy for each of the attributes. The approach taken in this paper integrates both categorical and numeric data types and focuses on privacy preserving during decision tree analysis. R S. Publication (rspublication.com), [email protected] Page 24 International Journal of Computer Application Available online on http://www.rspublication.com/ijca/ijca_index.htm Issue 4, Volume 1 (February 2014) ISSN: 2250-1797 3.8 Privacy Preserving Two-Layer Decision Tree Classifier for Multiparty Databases In this paper, they address the problem related to privacy preserving data mining in a distributed manner. In particular, they focus on privacy preserving two-layer decision tree classifier on horizontally partitioned data. The objective of privacy preserving data classification is to build accurate classifiers without disclosing private information in the data being mined. The performance of privacy preserving techniques should be analyzed and compared in terms of both the privacy protection of individual data and the predictive accuracy of the constructed classifiers [11]. 3.9 Privacy Preservation Decision Tree Based On Data Set Complementation This paper focuses on privacy protection of the training samples applied for decision tree data mining using data set complementation algorithm. The aim of this algorithm is to protect the sensitive information in data from the large amount of data set. The privacy preservation of data set can be expressed in the form of decision tree, cluster or association rule. This paper proposes a privacy preservation based on data set complement algorithms which store the information of the real dataset. So that the private data can be safe from the unauthorized party, if some portion of the data can be lost, then we can reconstructed the original data set from the unrealized dataset and the perturbing data set[9]. 3.10 Building Privacy-Preserving C4.5 Decision Tree Classifier on Multiparties In this paper, they address Privacy-preserving classification problem in a multi-party sense. We focus the general classification in a secured manner and introduce a Privacy-preserving decision tree classifier using C4.5 algorithm without involving third party.C4.5 algorithm is a software expansion of the basic ID3 algorithm designed by Quinlan. The protocol used in this paper is considerably more efficient than any existing solutions [12]. 4. CONCLUSION In this paper, we present a decision tree classification technique for privacy preserving in data mining. We have surveyed different approaches used in estimating the effectiveness of privacy preserving data mining algorithms using decision tree classifier. The work presents in this paper indicates the ever increasing interest of researchers in the area of securing sensitive data and knowledge from malicious users. Many privacy preserving algorithms of decision tree mining are proposed by researchers however, privacy preserving technology needs to be further researched because of the complexity of the privacy problem. We conclude privacy preserving decision tree mining algorithms by analyzing the existing work in the table.1and make some remarks and pitfalls. Future research need to be developed to work on these remarks. Table1: Analysis of decision tree algorithms for privacy preserving in data mining S.no. Title Purpose Technique and Remarks algorithm used Pitfalls 1. Its purpose Convert the original sample data into group of unreal data sets. Data set complementation approach for converting sample data set into sanitized or altered data sets and decision tree Requires extra storage for storing perturbed and complement of sample data set. Decision tree classifier for privacy preserving Considered about both continuous valued attributes and discrete value sets. It uses both R S. Publication (rspublication.com), [email protected] Page 25 International Journal of Computer Application Available online on http://www.rspublication.com/ijca/ijca_index.htm classifier for c4.5 algorithm 2. Efficient decision tree construction in unrealized data set using C4.5 algorithm. Purpose to privacy preservation of collected data samples in case where information from the sample database. Original data set where the original data samples cannot be reconstructed if an unauthorized party were to steal some portion of data. Discover to reduce the storage requirement associated with the derived data set complementation approach using C4.5 algorithm. 3. Efficient decision tree based privacy preserving approach for unrealized data sets 4. Privacy preserving decision tree from perturbed data To handle numeric continuous attributes. 5. Privacy preserving decision tree over vertically partitioned data. Purpose is to tackle the problem of classification A modified C4.5 decision tree classifier and Perturbation technique to build the decision tree are used. A generalized privacy preserving variant of the ID3 algorithm for vertically partitioned data distributed on two or more parties is used. 6. Privacy preserving decision tree learning using unrealized data sets. An accurate decision tree can be built directly from those unreal data sets. New perturbation and randomization based approach via data set complementation is used. 7 Noise Addition scheme in decision tree for privacy preserving data To add the specific noise to numeric attribute after Categorical Attribute Perturbation and Perturbation Noise RIPPER algorithm and modified ID3 algorithms are used for unrealized data sets. Issue 4, Volume 1 (February 2014) ISSN: 2250-1797 ID3 and C4.5 algorithm to analyze the perturbed data set. Execution time and accuracy of C4.5 algorithm is high when compared to ID3 algorithm even if the size is increased. Unrealized data set does not perform well with sample data sets with low frequency. RIPPER algorithm increases the number of rules to cover the non negative rates as well as positive rates with information gain. Accuracy is increased to 83.4%, even if the number of instances increased. Privacy preserving of data set complementation fails if all the training data samples are leaked. ID3 algorithm will handle the training data with missing attributes. Trivially extendible to case where all the parties have the class attributes and in fact causes a significant increase in efficiency of protocol. ID3 algorithm and discrete valued attributes accurate decision tree can be built from these unreal data sets. The data quality of the perturbed data set is Required to find tight upper bound on the complexity. R S. Publication (rspublication.com), [email protected] Need various ways to build classifier which can be used to classify the perturbed data set Privacy preserving via data set compliment fails if all the training data sets are leaked because the data set reconstructs algorithm in generic. It works on numeric data only, need to work on data quality and security level Page 26 International Journal of Computer Application Available online on http://www.rspublication.com/ijca/ijca_index.htm mining exploring the decision tree of the original data. addition strategy has used with c5.0 decision tree builder. . 8. Privacy Preserving TwoLayer Decision Tree Classifier for Multiparty Databases To address the issue related to privacy preserving data mining in a distributed manner. TLPPHPID3 () – Two Layer Privacy Preserving Horizontally Partitioned ID3 algorithm is used. An UTP allows well-designed solutions that meet privacy constraints and achieve acceptable performance. 9. Privacy Preservation Decision Tree Based On Data Set Complementation To protect private data from the unauthorized party, if some portion of the data can be lost, 10. Building privacy preserving C4.5 decision tree classifier on multi parties. Purpose is to build a decision tree classifier for more than one party. Privacy preservation based on data set complement algorithms which store the information of the real dataset. Privacy-preserving C4.5 decision tree classification on vertically partitioned data without using third party Issue 4, Volume 1 (February 2014) ISSN: 2250-1797 considered to be high when the perturbed data set is similar to original data set. Performance of two-layer horizontally partitioned ID3 decision tree classifier is better than the basic ID3 decision tree classifiers. It requires less memory space. Also provides fast and easy calculations when the samples are evenly distributed, as the storage requirement is the increased. measurement. Feasible to construct a Privacypreserving decision tree classifier that can be used SMC techniques Joining multi-party attributes using a trusted third party and an untrusted third party. They actually were implementing the entire protocol in JAVA on huge databases, which should be the first working code in the area of privacy preserving decision tree classifier on horizontally partitioned data using un-trusted third party. The data set complementation fails, if all training datasets were leaked, because the dataset reconstruction algorithm is generic. 5. REFERENCES [1] Smita D Patel, Sanjay Tiwari,”Privacy Preserving Data Mining”, (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 4 (1) , 2013, 139 - 141. [2] R. Agrawal And R. Srikant. Privacy-Preserving Data Mining. In ACM SIGMOD International Conference On Management Of Data, Pages 439–450. ACM, 2000. [3] Lior Rokach And Oded Maimon “Top-Down Induction Of Decision Trees Classifiers A Survey”, IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS:PART C, VOL. 1, NO. 11, NOVEMBER 2002 . [4] Jiawei Han, Micheline Kamber, “Data Mining: Concepts And Techniques”, Morgan Kaufmann, 2001. R S. Publication (rspublication.com), [email protected] Page 27 International Journal of Computer Application Available online on http://www.rspublication.com/ijca/ijca_index.htm Issue 4, Volume 1 (February 2014) ISSN: 2250-1797 [5] Lior Rokach, Oded Maimon Data Mining And Knowledge Discovery Handbook Second Edition Pages 167-192, Springer Science + Business Media,2010 . [6] Tejaswini Pawar , Prof. Snehal Kamlapur “ Decision Tree Classifier for Privacy Preservation”, International Journal of Emerging Technologies in Computational and Applied Sciences,3(3), Dec.12-Feb.13, pp. 309-314. [7] A.P.Subapriya, M.Kalimuthu.” Efficient Decision Tree Construction In Unrealized Dataset Using C4.5 Algorithm”, Research Journal of Computer Systems Engineering – RJCSE, Vol 04; Special Issue; June 2013. [8] Ms.S.Nithya1, Mrs. P.Senthil Vadivu2,” Efficient Decision Tree Based Privacy Preserving Approach For Unrealized Data Sets “International Journal of Advances in Computer Science and Technology,Volume 2, No.6, June 2013. [9] Madhusmita Sahu, Debasis Gountia, Neelamani Samal,” Privacy Preservation Decision Tree Based On Data Set Complementation”, International Journal of Innovative Research in Computer and Communication Engineering, Vol. 1, Issue 2, April 2013. [10] Mohammad Ali Kadampur, Somayajulu D.V.L.N.,” A Noise Addition Scheme In Decision Tree For Privacy Preserving Data Mining”, Journal Of Computing, Volume 2, Issue 1, January 2010. [11] Alka Gangrade, Ravindra Patel,” Privacy Preserving Two-Layer Decision Tree Classifier for Multiparty Databases”, International Journal of Computer and Information Technology (2277 – 0764) Volume 01– Issue 01, September 2012. [12] Alka Gangrade1, Ravindra Patel,” Building Privacy-Preserving C4.5 Decision Tree Classifier On Multiparties”, International Journal On Computer Science And Engineering Vol.1(3), 2009, 199-205. [13] Li Liu, Murat Kantarcioglu and Bhavani Thuraisingham,” Privacy Preserving Decision Tree Mining From Perturbed Data”, Proceedings Of The 42nd Hawaii International Conference On System Sciences – 2009. [14] J. Vaidya and C. Clifton. Privacy-preserving decision trees over vertically partitioned data. In Proceedings of the 19th Annual IFIP WG 11.3 Working Conference on Data and Applications Security, Storrs,Connecticut, 2005. Springer. L. Liu, M. Kantarcioglu, and B. Thuraisingham, “Privacy Preserving Decision Tree Mining from Perturbed Data,” Proc. 42nd Hawaii Int’l Conf. System Sciences (HICSS ’09), 2009. [15] Tejaswini Pawar*, Prof. Snehal Kamalapur “A Survey on Privacy Preserving Decision Tree Classifier”, International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.843-847 R S. Publication (rspublication.com), [email protected] Page 28