Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Survey

Transcript

ISSN 2319-8885 Vol.03,Issue.36 November-2014, Pages:7219-7223 www.ijsetr.com Fortify Fragile Labels of Data in Interglot SANGADI TEJASWI1, LEELAVATHI AREPALLI2 1 PG Scholar, Dept of CSE, Sri Vasavi Engineering College, Tadepalligudem, JNTUK, AP, India, E-mail: [email protected] 2 Asst Prof, Dept of CSE, Sri Vasavi Engineering College, Tadepalligudem, JNTUK, AP, India, E-mail: [email protected] Abstract: Privacy is one of the major concerns when publishing or sharing social network data for social science research and business analysis. Recently, researchers have developed privacy models similar to k-anonymity to prevent node reidentification through structure information. However, even when these privacy models are enforced, an attacker may still be able to infer one’s private information if a group of nodes largely share the same sensitive labels (i.e., attributes). In other words, the labelnode relationship is not well protected by pure structure anonymization methods. Furthermore, existing approaches, which rely on edge editing or node clustering, may significantly alter key graph properties. In this paper, we define a k-degree-l-diversity anonymity model that considers the protection of structural information as well as sensitive labels of individuals. We further propose a novel anonymization methodology based on adding noise nodes. We develop a new algorithm by adding noise nodes into the original graph with the consideration of introducing the least distortion to graph properties. Most importantly, we provide a rigorous analysis of the theoretical bounds on the number of noise nodes added and their impacts on an important graph property. We conduct extensive experiments to evaluate the effectiveness of the proposed technique. Keywords: Privacy, Secure Multiparty Computation, Non-Cooperative Computation. I. INTRODUCTION We propose a novel idea to preserve important graph properties, such as distances between nodes by adding certain “noise” nodes into a graph. This idea is based on the following key observation. In Our proposed system, privacy preserving goal is to prevent an attacker from reidentifying a user and finding the fact that a certain user has a specific sensitive value. To achieve this goal, we define a k-degree-ldiversity (KDLD) model for safely publishing a labeled graph, and then develop corresponding graph anonymization algorithms with the least distortion to the properties of the original graph, such as degrees and distances between nodes. Privacy and security, particularly maintaining confidentiality of data, have become a challenging issue with advances in information and communication technology. The ability to communicate and share data has many benefits, and the idea of an omniscient data source carries great value to research and building accurate data analysis models. For example, for credit card companies to build more comprehensive and accurate fraud detection system, credit card transaction data from various companies may be needed to generate better data analysis models. Department of Energy supports research on building much more efficient diesel engines [7]. Such an ambitious task requires the collaboration of geographically distributed industries, national laboratories, and universities. Those institutions (including potentially competing industry partners) need to share their private data for building data analysis models to understand the underlying physical phenomena. An omniscient data source eases misuse, such as the growing problem of identity theft. To prevent misuse of data, there is a recent surge in laws mandating protection of confidential data, such as the European Community privacy standards [9], U.S. health-care laws [17], and California SB1386. However, this protection comes with a real cost through both added security expenditure and penalties and costs associated with disclosure. What we need is the ability to compute the desired “beneficial outcome” of data sharing for analyzing without having to actually share or disclose data. This would maintain the security provided by separation of control while still obtaining the benefits of a global data source. Secure multiparty computation (SMC) [11], [44], [45] has recently emerged as an answer to this problem. Informally, if a protocol meets the SMC definitions, the participating parties learn only the final result and whatever can be inferred from the final result and their own inputs. A simple example is Yao’s millionaire problem [44]: two millionaires, Alice and Bob, want to learn who is richer without disclosing their actual wealth to each other. Recognizing this, the research community has developed many SMC protocols, for applications as diverse as forecasting [5], decision tree analysis [33] and auctions [37] among others. Nevertheless, the SMC model does not guarantee that data provided by participating parties are truthful. In many reallife situations, data needed for building data analysis models are distributed among multiple parties with potentially conflicting interests. For instance, a credit card company that Copyright @ 2014 IJSETR. All rights reserved. SANGADI TEJASWI, LEELAVATHI AREPALLI has a superior data analysis model for fighting credit card functionality. If the answer is positive, then there is no need fraud may increase its profits as compared to its peers. An to design complicated and generally inefficient SMC-based engine design company may want to exclusively learn the protocols. In this paper, we assume that the number of data analysis models that may enable it to build much more malicious or dishonest participating parties can be at most n _ efficient diesel engines. Clearly, as described above, building 1, where n is the number of parties. This assumption is very data analysis models is generally performed among parties general since most existing works in the area of privacythat have conflicting interests. In SMC, we generally assume preserving data analysis assume either all participating that participating parties provide truthful inputs. This parties are honest (or semi-honest) or the majority of assumption is usually justified by the fact that learning the participating parties are honest. Thus, we extend the non correct data analysis models or results is in the best interest cooperative computation definitions to incorporate cases of all participating parties. Since SMC-based protocols where there are multiple dishonest parties. require participating parties to perform expensive In addition, we show that from incentive compatibility computations, if any party does not want to learn data models point of view, most data analysis tasks need to be analyzed and analysis results, the party should not participate in the only for two party cases. Furthermore, to show the protocol. applicability of our developed theorems, we use these Still, this assumption does not guarantee the truthfulness of theorems to analyze under what conditions, common data the private input data when participating parties want to learn analysis tasks, such as mean and covariance matrix the final result exclusively. For example, a drug company estimation, can be executed in an incentive compatible may lie about its private data so that it can exclusively learn manner. The paper is organized as follows: Section 2 the data analysis. model. Although SMC protocols guarantee provides an overview of the works closely related to this that nothing other than the final data analysis result is paper and background regarding the concept of non revealed, it is impossible to verify whether or not cooperative computation. In Section 3, we propose several participating parties are truthful about their private input data. important theorems along with formal proofs. Based on these In other words, unless proper incentives are set, current SMC theorems, Section 4 analyzes some common distributed data techniques cannot prevent input modification by participating analysis tasks that are either incentive compatible or not parties. To better illustrate this problem, we consider a case incentive compatible in the context of this paper. Section 6 from management where competing companies (e.g., Texas concludes the paper with a discussion of possible future Instruments, IBM and Intel) establish a consortium (e.g., research directions. Semiconductor Manufacturing Technology1). The companies B. KDLD Algorithm send the consortium their sales data, and key manufacturing k-degree anonymity with l-diversity to prevent not only costs and times. Then, the consortium analyzes the data and the reidentification of individual nodes but also the statistically summarizes them in a report of industry trends, revelation of a sensitive attribute associated with each node. which is made available back to consortium members. In this If the k-degree-l-diversity constraint satisfies create KDLD case, it is in the interest of companies to learn true industry graph. A KDLD graph protects two aspects of each user trends while revealing their private data as little as possible. when an attacker uses degree information to attack A novel Even though SMC protocols can prevent the revelation of the graph construction technique which makes use of noise nodes private data, they do not guarantee that companies send their to preserve utilities of the original graph. Two key properties true sales data and other required information. Assume that n are considered: Add as few noise edges as possible. Change companies would like to learn the sample mean and variance the distance between nodes as less as possible. The noise of the sales data for a particular type of product. edges/nodes added should connect nodes that are close with respect to the social distance. There exist a large number of A. Our Contributions In this paper, we analyze what types of distributed low degree vertices in the graph which could be used to hide functionalities could be implemented in an incentive added noise nodes from being re-identified. By carefully compatible fashion. In other words, we explore which inserting noise nodes, some graph properties could be better functionalities can be implemented in a way that participating preserved than a pure edge-editing method. A graph is kparties have the incentive to provide their true private inputs degree anonymous if and only if for any node in this graph, upon engaging in the corresponding SMC protocols. We there exist at least k - 1 other node with the same degree. If show how tools from theoretical computer science in general an adversary knows that one person has three friends in the and non cooperative computation (NCC) [40] in particular graph, he can immediately know that node 2 is that person could be used to analyze incentive issues in distributed data and the related attributes of node 2 are revealed. k-degree analysis framework. This is significant because input anonymity can be used to prevent such structure attacks. modification cannot be prevented before the execution of any However, in many applications, a social network where SMC-based protocol. (Input modification could be prevented each node has sensitive attributes should be published. For during the execution of some SMC based protocols, but these example, a graph may contain the user salaries which are protocols are generally expensive.) The theorems developed sensitive. In this case, k-degree alone is not sufficient to in the paper can be adopted to analyze whether or not input prevent the inference of sensitive attributes of individuals. A modification could occur for computing a distributed International Journal of Scientific Engineering and Technology Research Volume.03, IssueNo.36, November-2014, Pages: 7219-7223 Fortify Fragile Labels of Data in Interglot graph that satisfies 2- degree anonymity but node labels are Among those research issues, algorithmic mechanism design not considered. In it, nodes 2 and 3 have the same degree 3, and non cooperative computation are closely related to our but they both have the label “80K.” If an attacker knows work. How private preferences of many parties could be someone has three friends in the social network, he can combined to find a global and socially optimal solution [38]. conclude that this person’s salary is 80K without exactly Usually, in algorithmic mechanism design, there exists a function that needs to be maximized based on the private reidentifying the node. inputs of the parties, and the goal is to devise mechanisms II. RELATED WORK AND BACKGROUND and payment schemes that force individuals to tell their true We begin with an overview of privacy-preserving private values. In our case, since it is hard to measure the distributed data analysis. Then, we briefly discuss the monetary value of the data analysis results, devising a concept of non cooperative computation. Table 1 provides payment scheme that is required by many mechanism design common notations and terminologies used extensively for the models is not viable (e.g., Vickrey-Groves-Clarke rest of this paper. In addition, the terms secure and privacy mechanisms [38]). Instead, we adopt the non cooperative preserving are interchangeable thereafter. computation model [40] that is designed for parties who want to jointly compute the correct function results on their private A. Privacy-Preserving Data Analysis inputs. Since data analysis algorithms can be seen as a special Many privacy-preserving data analysis protocols have case, modifying non cooperative computation model for our been designed using cryptographic techniques. Data are purposes is a natural choice as shown in Figs.1 and 2. generally assumed to be either vertically or horizontally partitioned. In the case of horizontally partitioned data, The non cooperative computation model can be seen as an different sites collect the same set of information about example of applying game theoretical ideas in a distributed different entities. For example, different credit card computation setting [40]. In NCC, each party participates in a companies may collect credit card transactions of different protocol to learn the output of some function f over the joint individuals. Privacy-preserving distributed protocols have inputs of the parties. First, all participating parties send their been developed for horizontally partitioned data for many private inputs securely to a trusted third party (TTP), then different data mining tasks such as building decision trees, TTP computes f and sends back the result to every [32], mining association rules, [26], and generating k-means participating party. The NCC model makes the following clusters [31] and k-nn classifiers. In the case of vertically assumptions: partitioned data, we assume that different sites collect information about the same set of entities, but they collect Correctness: The first priority for every participating party different feature sets. For example, both a university pay roll is to learn the correct result. and the university’s student health center may collect Exclusiveness: If possible, every participating party prefers information about a student. Again, privacy-preserving to learn the correct result exclusively. protocols for the vertically partitioned case have been developed for many different data mining tasks such as association rules, [41], building decision trees [8] and kmeans clusters [19]. (See [43] for a survey of the results.) To our knowledge, all the existing techniques assume that each participating party use its true data during the distributed data mining protocol execution. In addition to existing techniques that consider honest butcurious model, there are techniques developed against malicious adversaries, such as [16], [24]. Especially, in [16], authors discuss how to prevent lying about inputs using “input-consistency checks.” Basically, authors suggest checking whether the inputs satisfy some conditions that are known to be true about the inputs (e.g., a binary input vector cannot consist of all zeros). Although such approach could be useful in practice, it cannot prevent lying about inputs that satisfy the domain constraints (e.g., an adversary can lie about its binary vector by making sure that he does not use binary vectors that consists of all zeros as input). In our case, we suggest a different solution. Basically, we try to “incentivize” truth telling instead of preventing lying. Fig.1. Publish a graph with degree and label anonymity. B. Non cooperative Computation Recently, research issues at the intersection of computer Fig.2. Example for adding noise node. science and game theory have been studied extensively. International Journal of Scientific Engineering and Technology Research Volume.03, IssueNo.36, November-2014, Pages: 7219-7223 SANGADI TEJASWI, LEELAVATHI AREPALLI III. CHARACTERISTICS OF DNCC FUNCTIONS IV. ANALYZING DATA ANALYSIS TASKS UNDER As discussed previously, the term incentive compatible THE DNCC MODEL means that participating parties have the incentive or So far, we have developed techniques to prove whether or motivation to provide their actual inputs when they compute not a function is in DNCC. Combining the two concepts functionality. Although SMC-based privacy-preserving data DNCC and SMC, we can analyze privacy-preserving data analysis protocols (under the malicious adversary model) can analysis tasks (without utilizing a TTP) that are incentive prevent participating parties from modifying their inputs once compatible. We next prove several such important tasks that the protocols are initiated, they cannot prevent the parties either satisfy or do not satisfy the DNCC model. Also, note from modifying their inputs before the execution. On the that the data analysis tasks analyzed next have practical other hand, parties are expected to provide their true inputs to SMC-implementations. correctly evaluate a function that satisfies the NCC model. A. Privacy-Preserving Decision Tree Classification Therefore, any functionality that satisfies the NCC model is The aim of a decision tree is to provide classification inherently incentive compatible under the assumption that criteria based on the attributes of a data set. At each node of a participating parties prefers to learn the function result decision tree, the data are “split” into several subsets of data correctly and if possible exclusively. Now, the question is based on the criterion of information gain. The information which functionalities or data analysis tasks satisfy the NCC gain of a split is defined as the average difference in entropy model. For the rest of the paper, we first develop certain key between the original data set, and each data set formed by the theorems regarding NCC functions. Based on these theorems, split. In the Id3 and C4.5 decision trees [15], the sets for the we subsequently analyze functionalities that can be split are chosen based on a single attribute. The construction implemented under (or satisfy) the NCC model. of the tree proceeds as follows: 1) Determine the attribute that has the greatest information gain (or equivalently A. Collusion Regarding the NCC Model When evaluating certain privacy-preserving protocols in minimum conditional entropy). 2) Use that attribute to split practice, we need to consider the case where an adversary the data set. 3) Repeat this process within each subset until may control a subset of the parties involved in the protocol. no splits give significant information gain. At this point the Such an adversary may force the parties it controls to submit tree is complete. wrong inputs as shown in Figs.3 and 4. In order to analyze B. Using Non-DNCC Functions in an Incentivefunctionalities that are incentive compatible when collusion Compatible Way is possible, the current DNCC model needs to be extended to According to our previous analyses, some functions like include the possibility of collusion. In other words, we need sum, set union and set intersection are not in DNCC, but they to understand that given f is in DNCC, whether or not f is can still be used in an incentive-compatible way in PPDA still in DNCC when collusion occurs. To analyze this, we applications. The reason is that in many applications, continue to follow the correctness and exclusiveness primitives like sum, set union and intersection are not used assumptions under the NCC model. Based on the alone, and they often act as subroutines. To have a assumptions, we next define the DNCC functions for the case completely secure protocol, the subroutines can only return where an adversary controls at most t fixed parties. random shares of the expected results. For example, suppose a PPDA application uses the set intersection as a subroutine to compute the intersection between two sets D1 and D2. To achieve the best security, the subroutine produces two random numbers _1 and _2 (from a certain field), Subsequently, _1 and _2 will be used as input values to the next subroutine in the PPDA application. Because of this observation, we have the following claim. Fig.3. Edge editing with neighbourhood rule. V. CONCLUSION AND FUTURE WORK Even though privacy-preserving data analysis techniques guarantee that nothing other than the final result is disclosed, whether or not participating parties provide truthful input data cannot be verified. In this paper, we have investigated what kinds of PPDA tasks are incentive compatible under the NCC model. Based on our findings, there are several important PPDA tasks that are incentive driven. Table 2 classifies the common data analysis tasks studied in this paper into DNCC or Non-DNCC categories. Most often, data partition schemes can make a difference in determining DNCC or Non-DNCC classifications. For any functions, Theorems 3.1 and 3.2 provide a general way to determine if a function is in DNCC. In addition, Claim 5.1 can be used to Fig.4. Strategies to adding noise nodes. International Journal of Scientific Engineering and Technology Research Volume.03, IssueNo.36, November-2014, Pages: 7219-7223 Fortify Fragile Labels of Data in Interglot analyze if a composite function is in DNCC, and it also [9] “Directive 95/46/EC of the European Parliament and of provides a method to design a PPDA protocol that guarantees the Council of 24 Oct. 1995 on the Protection of Individuals to be incentive compatible under the DNCC definition. For with Regard to the Processing of Personal Data and on the instance, PPDA task can have many variations, and one Free Movement of Such Data,” Official J. European common variation is to place a filter at the last step of the Communities, vol. 281, pp. 31-50, Oct. 1995. task to make the PPDA protocols more secure (e.g., secure [10] K. Fukunaga, Introduction to Statistical Pattern similar document detection versus its threshold-based Recognition. Academic Press, 1990. variation). According to Claim 5.1, as long as the last step in [11] O. Goldreich, S. Micali, and A. Wigderson, “How to a PPDA task is in DNCC, it is always possible to make the Play Any Mental Game - A Completeness Theorem for entire PPDA task satisfying the DNCC model. Therefore, Protocols with Honest Majority,” Proc. 19th ACM Symp. the when designing a PPDA protocol, it is in our best interests to Theory of Computing, pp. 218-229, 1987. make the last step of the PPDA task incentive-compatible [12] O. Goldreich, “General Cryptographic Protocols,” The whenever possible. As a part of future research direction, we Foundations of Cryptography, vol. 2, Cambridge Univ. Press, will investigate incentive issues in other data analysis tasks, 2004. and extend the proposed theorems under the probabilistic [13] S.D. Gordon and J. Katz, “Rational Secret Sharing, NCC model. Another important direction that we would like Revisited,” Proc. Int’l Conf. Security and Cryptography for to pursue is to create more efficient Secure Multiparty Networks, p. 229, 2006. Computation techniques tailored toward implementing the [14] J. Halpern and V. Teague, “Rational Secret Sharing and Multiparty Computation: Extended Abstract,” Proc. Ann. data analysis tasks that are in DNCC. ACM Symp. Theory of Computing (STOC ’04), pp. 623-632, VI. ACKNOWLEDGMENTS 2004. The first author’s contribution to this work was partially [15] J. Han and M. Kamber, Data Mining: Concepts and supported by Air Force Office of Scientific Research MURI Techniques. Morgan Kaufmann, 2006. Grant FA9550-08-1-0265, National Institutes of Health Grant [16] S. Han and W.K. Ng, “Preemptive Measures against 1R01LM009989, US National Science Foundation (NSF) Malicious Party in Privacy-Preserving Data Mining,” Proc. Grant Career-CNS-0845803, and NSF Grants CNSSIAM Int’l Conf. 0964350, CNS-1016343. The second author’s contribution to Data Mining (SDM), pp. 375-386, 2008. this work was supported by the Office of Naval Research [17] “Standard for Privacy of Individually Identifiable Health under Award No. N000141110256 and NSF under Award Information,” Fed. Register, vol. 67, no. 157, pp. 53181No. CNS-1011984. 53273, Aug. 2002. VII. REFERENCES [18] S. Izmalkov, S. Micali, and M. Lepinski, “Rational [1] I. Abraham, D. Dolev, R. Gonen, and J. Halpern, Secure Computation and Ideal Mechanism Design,” Proc. “Distributed Computing Meets Game Theory: Robust 46th Ann. IEEE Symp. Foundations of Computer Science Mechanisms for Rational Secret Sharing and Multiparty (FOCS ’05), pp. 585-594, 2005. Computation,” Proc. 25th Ann. ACM Symp. Principles of Author’s Profile: Distributed Computing, pp. 53-62, 2006. S.Tejaswi received B.tech degree in CSE [2] R. Agrawal and E. Terzi, “On Honesty in Sovereign from AU, AP-India in 2012. At present Information Sharing,” Proc. Int’l Conf. Advances in M.tech(CSE) from B.tech(2012-2014) in Database Technology, pp. 240- 256, 2006. Sri Vasavi Engineering college from [3] R. Agrawal and R. Srikant, “Fast Algorithms for Mining JNTUK. Association Rules,” Proc. Int’l Conf. Very Large Data Bases (VLDB ’94) pp. 487-499, Sept. 1994. [4] I. Ashlagi, A. Klinger, and M. Tenneholtz, “K-NCC: Stability Against Group Deviations in Non-Cooperative A.Leelavathi received BSc degree in Computation,” Proc. Third Int’l Conf. Internet and Network Computers from AU, AP_India in 2001, Economics, pp. 564-569, 2007. Msc degree in computers from AU, in [5] M.J. Atallah, M. Bykova, J. Li, and M. Karahan, “Private 2003, and M.Tech degree in CSE from Collaborative Forecasting and Benchmarking,” Proc. Second JNTUK in 2010. I was worked as a ACM Workshop Privacy in the Electronic Soc. (WPES), Oct. teaching assistant lecturer’s assistant 2004. professor with Dept of CSE, AU and [6] B. Chor and E. Kushilevitz, “A Zero-One Law for Jawaharlal technological university, in Boolean Privacy,” Proc. 21st Ann. ACM Symp. Theory of 2003, 2005 and 2009 respectively. And present I am working Computing (STOC ’89), pp. 62-72, 1989. as an assistant Professor in Sri Vasavi Engineering college [7] Doe News, www.doe.gov, Feb. 2005. from JNTUK. I was participated in workshops and [8] W. Du and Z. Zhan, “Building Decision Tree Classifier coordinated many campus selection training programs for on Private Data,” Proc. IEEE Int’l Conf. Data Mining students in Sri Vasavi Engineering college. I am interested in Workshop Privacy, Security, and Data Mining, C. Clifton oop(C++, Java) and designing. and V. Estivill-Castro, eds., vol. 14, pp. 1-8, Dec. 2002. International Journal of Scientific Engineering and Technology Research Volume.03, IssueNo.36, November-2014, Pages: 7219-7223