Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Soft Computing and Engineering (IJSCE) ISSN: 2231-2307, Volume-4, Issue-ICCIN-2K14, March 2014 ICCIN-2K14 | January 03-04, 2014 Bhagwan Parshuram Institute of Technology, New Delhi, India A Hybrid C- Tree Algorithm for Privacy Preserving Data Mining Shweta Taneja, Shashank Khanna, Sugandha Tilwalia, Ankita The privacy and security of individuals should be taken care to solve ethical, legal, and social issues. Large volumes of data are regularly collected through data mining and being analyzed. These data include medical history, important policies of government and organizations which help them in decision making process and also provide social benefit such as crime reduction, national security, and medical research etc. these data is very important for data analyst and researchers who analyzes hospital reports, government agencies and insurance companies [12]. With the huge development of internet technology, privacy preserving data mining has become one of the most important topics and become a serious concern for people those want their data to be secured. Since hackers can easily get the techniques on the internet through which they can steal someone’s personal sensitive data. Thus to secure the data a new field has been developed by researchers known as privacy preserving data mining (PPDM). Data mining techniques have been developed successfully to extracts knowledge from the huge database. As data mining become more pervading, privacy concerns are increasing. In order to secure data many techniques and methods has been introduced such as k-anonymity, data perturbation, association rule mining, adding unknown values, encryption, condensation approach, etc. All of these techniques having their advantages as well as disadvantages such as some techniques are very costly, some techniques cannot handle huge amount of data and some techniques may loss information. In order to overcome these problems we have introduced a new technique called as Hybrid Ctree technique. In this paper, we discuss our novel algorithm and some techniques. The paper is organized as follows. In Section 1, we introduce privacy and need of privacy preserving data mining. Section 2 provides the related work done in the field of privacy preserving data mining. In Section 3, we propose our algorithm. Section 4 shows the experiments conducted on an original data set and the results obtained. Finally we conclude in Section 5 and present the future prospects of our algorithm. Abstract: Data Mining is a process of discovering useful information or knowledge from the data warehouse. In today’s world, people are more concerned about their privacy and secrecy. Various Privacy Preserving Data Mining algorithms are developed to preserve privacy and hide sensitive data. For example, information available in government records or in patients records like name, age, sex, disease, unique identification number etc. should be preserved. In this paper, we propose a new algorithm for privacy preserving data mining. First, we rearrange the records of the data set using a novel ‘CTree’ procedure and perturb the primary attribute. Then, we encrypt sensitive attributes using ASCII Code and special characters. . The algorithm is implemented and tested on a micro data of patient record. Sensitive data is perturbed in an efficient way which will never reveal anyone’s identity. Also, original data can be reconstructed from perturbed data, making usability of data. Keywords: Privacy Preserving; Data Mining; Sensitive Attribute; C-Tree; Encryption. I. INTRODUCTION Data Mining [1] refers to extracting or “mining” knowledge from large amounts of data. It is the process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses, or other information repositories. By performing data mining, interesting knowledge, regularities, or high-level information can be extracted from database and viewed or browsed from different angles. The discovered knowledge can be applied to decision making, process control, information management, and query processing. Data mining is considered one of the most important frontiers in database systems and one of the promising interdisciplinary developments in the information industry. Data mining, with its promise to efficiently discover valuable, nonobvious information from large databases [2], is particularly vulnerable to misuse. So, there might be a conflict between data mining and privacy. In today’s world privacy is the major concern to protect the sensitive data. People are very much concerned about their sensitive information which they do not want to share with anyone. Privacy plays an important role by securing and protecting the sensitive data values from being used by unauthorized access [2] and thus it is different from any other field of data security such as data security and access control which prevents information disclosure against illegitimate means. The main aim of privacy preservation is to prevent data or information from unauthorized access [7][11] to the data. II. RELATED WORK Our main focus is hiding of sensitive or crucial data. There is a lot of work which is done in the field of preserving privacy of data mining. In literature, different authors have proposed different techniques of privacy preserving data mining. Cryptography based technique [7][9] encrypts information and provides security to sensitive information but has many limitations to it like it cannot successfully protect outputs of computation or it does not produce good results for more parties. Blocking based technique which replaces known values with unknown by randomization and the problem is to guess the unknown values[‘?’] and easy to crack original value behind the unknown values [8]. The replacement of Manuscript received March 2014 Shweta Taneja, Dept. of Computer Science Engineering Bhagwan Parshuram Institute of Technology, Indraprastha University, India. Shashank Khanna, Dept. of Computer Science Engineering Bhagwan Parshuram Institute of Technology, Indraprastha University, India. Sugandha Tilwalia, Dept. of Computer Science Engineering Bhagwan Parshuram Institute of Technology, Indraprastha University, India. Ankita, Dept. of Computer Science Engineering Bhagwan Parshuram Institute of Technology, Indraprastha University, India. 21 Published By: Blue Eyes Intelligence Engineering & Sciences Publication Pvt. Ltd. A Hybrid C- Tree Algorithm for Privacy Preserving Data Mining sensitive values does not depend on any specific rule. There may be different sensitive rules according to the requirements. The main aim of this technique is to preserve the sensitive data from unauthorized access [7]. A technique which adds some noise to the original data which perturbs sensitive information and hence preserves the privacy is referred to as perturbation technique [5][7]. This technique apparently distorts sensitive data values by changing them by adding, subtracting or any other mathematical formula [6]. This technique does not reconstruct the original data but it only reconstructs the distribution. Table 1 given below shows some privacy preserving techniques with their merits and demerits. Our paper considers all these techniques and makes a hybrid approach [7] [3] for preserving privacy in data mining. We use a method called ‘Information Gain’ [1] [14] which is used to find the most sensitive attribute in our data set which will be used for classification. Table I Merits and Demerits of Privacy Preserving Data Mining Techniques S.No Techniques of PPDM Merits Demerits 1. PERTURBAT ION This approach is especially difficult to scale when more than a few parties are involved. Different attributes are treated independently by the perturbation approach.[2][3][5] [7] The method does not reconstruct the original data values. Also it does not hold good for large databases.[7] 2. CONDENSAT ION Pseudo-data have no longer effect on data mining algorithm, because they have the same format as the original data. [7][13] 3. CRYPTOGRA PHIC 4. BLOCKING BASED TECHNIQUE This approach works on pseudodata rather than perturbed data, which helps in better preservation of privacy than techniques which simply use modifications of the original data. [10] Cryptography a technique through which sensitive data can be encrypted. There is also a proper toolset for algorithms of cryptography in the field of privacy preserving data mining. [9] To preserve privacy of individual, sensitive transactions are replaced by unknown values.[7][8] III. PROPOSED ALGORITHM We have proposed a Hybrid Algorithm which combines merits of different techniques and help in preserving the privacy of sensitive data. Techniques involved are encryption, perturbation and unknown values to preserve privacy. The Hybrid C-Tree Algorithm is as follows:1) Input original Training Dataset and calculate Gain using Information Gain[1]. Info(D):= -∑Pi log2 ( Pi ); InfoA ( D ):= ∑ ( | Dj | / | D | ) x Info (Dj ) Gain:= Info ( D ) - InfoA ( D ). Where D is an attribute. The attribute for which Gain is greater is chosen as classifying attribute for tree. 2) Output - Perturbed Data in sequence from Left Cluster to Right Cluster from the last level of C – Tree. 3) Method:a) Calculate Information Gain Value for attributes having Random Values. b) Select a Unique Classifying Attribute. c) Classify this level of tree checking whether Random Value Attribute eg. Age if Age>45. d) On classified data, divide if Age is even or odd. e) Now in each cluster after classification if Age attribute is even Age:= Age + 15 else Age:= Age – 15; f) Traverse the tree from Left Cluster to Right Cluster from the last level of C – Tree. g) Repeat steps a to f and obtain the perturbed data. 4) Encrypt Character Function a) Input Character attribute from training dataset. b) Starting from the first alphabet of attribute, convert each character into its ASCII Code (k). c) Integer M:= 155-k; d) Convert M into Character. e) Replace current character with the converted character. f) Return string. 5) Encrypt ID Function a) Check if integer. b) Replace numeric values with the order of precedence of special character as given in ASCII Chart given below. 0 ! 1 # 2 $ 3 % 4 & 5 * 6 + 7 8 / 9 = c) Return ID. 6) Stop. This approach is especially difficult to scale when more parties are involved. Also it does not hold good for large databases. [7] Unknown Values help in preserving privacy but reconstruction of original data set is quite difficult. [7] IV. EXPERIMENTS CONDUCTED We have implemented our proposed algorithm on an Original Patient Data Set [4] in Table 2. In this dataset, we have an alpha-numeric attribute (UID), character attribute 22 Published By: Blue Eyes Intelligence Engineering & Sciences Publication Pvt. Ltd. International Journal of Soft Computing and Engineering (IJSCE) ISSN: 2231-2307, Volume-4, Issue-ICCIN-2K14, March 2014 ICCIN-2K14 | January 03-04, 2014 Bhagwan Parshuram Institute of Technology, New Delhi, India (Name and Disease), numeric attribute (Age) and an attribute which determines the gender of the patient. UID Table II Patient Medical Dataset- Original NAME AGE SEX DISEASE IN5612 ALBERT 29 M CANCER IN5622 DERRICK 30 M FLU IN5627 HANNAH 25 F IN5634 SUMMER 36 F HEART DISEASE CANCER IN5646 ROBERT 43 M IN5647 PEARL 34 F IN5655 KRISTINE 47 F HEART DISEASE HEART DISEASE CANCER IN5672 SMITH 48 M FLU IN5675 SAMANTHA 21 F FLU From the above medical data, a medical researcher may be keen to find some pattern and while mining they might reveals someone’s personnel and sensitive life publically. Hence to preserve sensitivity we implement this algorithm which will hide the sensitive information and will keep a perturbed dataset in warehouse, so that one cannot easily reveal an individual’s identity while mining. The technique will preserve all sensitive information and for specific miner it will provide only that attribute which may help him in his research and also maintaining privacy. We have implemented the algorithm using Java and backend is taken care with the help of MySQL. Below are the steps that are followed. Step 1. We first find an attribute which is best suitable for decision making and classification. This is dependent on the attribute having the largest Gain using the concept of Information Gain. The Gain value for ‘Age’ attribute is the highest and therefore we classify our tree using ‘Age’. Step 2. Now, dividing the primary attribute i.e. Age in two categories Age > 45 or Age < = 45. and storing the ID’s in two different Lists. Step 3. Now, further classifying the list’s into two i.e. Even and Odd. The result of this classification produces Cluster in the leaves as shown in Fig. 1. Step 4. Traverse the leaves from the Leftmost Cluster to Rightmost Cluster and store the records in the database. Step 5. Also, we add 15 to all even records and subtract 15 from all odd records and update the ages in the database. Step 6. Now we encrypt the Names and Diseases with the conversion step as written in the algorithm. The encrypted Names and Diseases are updated in the database. Step 7. Encryption of ID attribute is done by the same concept as explained in the algorithm. The results of the above experiment are as shown in Tab. 3. NAME WVIIRXP HFNNVI KVZIO AGE 45 51 49 SEX M F F ZOYVIG SZMMZS 14 10 M F IN*+&+ ILYVIG 28 M IN*+-* IN*+-$ IN*+** HZNZMGSZ HNRGS PIRHGRMV 6 63 32 F M F XZMZVI SVZIG WRHVZHV SVZIG WRHVZHV UOF UOF XZMZVI Figure 1: C-Tree depicting clusters in the leaf nodes V. CONCLUSION AND FUTURE PROSPECTS Privacy has become an important concern now. In today’s scenario no individual would like to share his privacy. While mining data from the warehouse, the analyst might find some knowledge that consists of someone’s sensitive information like his health status or his bank account details etc. It may be published but an individual might not like that his information should be revealed. So in the field of privacy preserving data mining we have contributed a novel hybrid algorithm which perturbs the sensitive data. Benefits of our proposed algorithm are that while mining, analysts will not be able to reveal information and hence preserving privacy and secrecy of an individual. In future we would like to present a hybrid algorithm which will help in reconstructing the original data. REFERENCES [1] [2] [3] [4] [5] [6] Table III Encrypted Patient Medical Dataset- using C-Tree Hybrid Algorithm UID IN*+$$ IN*+%& IN*+&- IN*+#$ IN*+$- [7] DISEASE UOF XZMZVI SVZIG WRHVZHV [8] 23 J. Han and M. Kamber , “Data Mining: Concepts and Techniques”, 2nd ed.,The Morgan Kaufmann Series in Data Management Systems, Jim Gray, Series Editor 2006. M. B. Malik, M. A. Ghazi and R. Ali, “Privacy Preserving Data Mining Techniques: Current Scenario and Future Prospects”, in proceedings of Third International Conference on Computer and Communication Technology, IEEE 2012. P.Deivanai, J. Jesu Vedha Nayahi and V.Kavitha,” A Hybrid Data Anonymization integrated with Suppression for Preserving Privacy in mining multi party data” in proceedings of International Conference on Recent Trends in Information Technology, IEEE 2011. M. Prakash, G. Singaravel, “A New Model for Privacy Preserving Sensitive Data Mining”, in proceedings of ICCCNT Coimbatore, India, IEEE 2012. J. Liu, J. Luo and J. Z. Huang, “Rating: Privacy Preservation for Multiple Attributes with Different Sensitivity requirements”, in proceedings of 11th IEEE International Conference on Data Mining Workshops, IEEE 2011. H. Kargupta and S. Datta, Q. Wang and K. Sivakumar, “On the Privacy Preserving Properties of Random Data Perturbation Techniques”, in proceedings of the Third IEEE International Conference on Data Mining, IEEE 2003. S. Lohiya and L. Ragha, “Privacy Preserving in Data Mining Using Hybrid Approach”, in proceedings of 2012 Fourth International Conference on Computational Intelligence and Communication Networks, IEEE 2012. A. Parmar, U. P. Rao, D. R. Patel, “Blocking based approach for classification Rule hiding to Preserve the Privacy in Database” , in proceedings of International Symposium on Computer Science and Society, IEEE 2011. Published By: Blue Eyes Intelligence Engineering & Sciences Publication Pvt. Ltd. A Hybrid C- Tree Algorithm for Privacy Preserving Data Mining [9] [10] [11] [12] [13] [14] Y. Lindell, B.Pinkas, “Privacy preserving data mining”, in proceedings of Journal of Cryptology, 5(3), 2000. C. Aggarwal , P.S. Yu, “A condensation approach to privacy preserving data mining”, in proceedings of International Conference on Extending Database Technology (EDBT), pp. 183–199, 2004. 746 R. Agrawal and A. Srikant, " Privacy-preserving data mining”, in proceedings of SIGMOD00, pp. 439-450. T. Jahan, G.Narsimha and C.V Guru Rao, “Data Perturbation and Features Selection in Preserving Privacy” in proceedings of 978-14673-1989-8/12, IEEE 2012. G. Nayak and S. Devi, “A Survey on Privacy Preserving Data Mining: Approaches and Techniques” in proceedings of International Journal of Engineering Science and Technology (IJEST), 2011 X. Xiao and H. Ding, "Enhancement of K-nearest Neighbour Algorithm Based on Weighted Entropy of Attribute Value" in proceeding of 5th International Conference on BioMedical Engineering and Informatics, IEEE 2012. 24 Published By: Blue Eyes Intelligence Engineering & Sciences Publication Pvt. Ltd.