Download Fortify Fragile Labels of Data in Interglot

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia, lookup

Transcript
ISSN 2319-8885
Vol.03,Issue.36
November-2014,
Pages:7219-7223
www.ijsetr.com
Fortify Fragile Labels of Data in Interglot
SANGADI TEJASWI1, LEELAVATHI AREPALLI2
1
PG Scholar, Dept of CSE, Sri Vasavi Engineering College, Tadepalligudem, JNTUK, AP, India, E-mail: [email protected]
2
Asst Prof, Dept of CSE, Sri Vasavi Engineering College, Tadepalligudem, JNTUK, AP, India, E-mail: [email protected]
Abstract: Privacy is one of the major concerns when publishing or sharing social network data for social science research and
business analysis. Recently, researchers have developed privacy models similar to k-anonymity to prevent node reidentification
through structure information. However, even when these privacy models are enforced, an attacker may still be able to infer
one’s private information if a group of nodes largely share the same sensitive labels (i.e., attributes). In other words, the labelnode relationship is not well protected by pure structure anonymization methods. Furthermore, existing approaches, which rely
on edge editing or node clustering, may significantly alter key graph properties. In this paper, we define a k-degree-l-diversity
anonymity model that considers the protection of structural information as well as sensitive labels of individuals. We further
propose a novel anonymization methodology based on adding noise nodes. We develop a new algorithm by adding noise nodes
into the original graph with the consideration of introducing the least distortion to graph properties. Most importantly, we
provide a rigorous analysis of the theoretical bounds on the number of noise nodes added and their impacts on an important
graph property. We conduct extensive experiments to evaluate the effectiveness of the proposed technique.
Keywords: Privacy, Secure Multiparty Computation, Non-Cooperative Computation.
I. INTRODUCTION
We propose a novel idea to preserve important graph
properties, such as distances between nodes by adding certain
“noise” nodes into a graph. This idea is based on the
following key observation. In Our proposed system, privacy
preserving goal is to prevent an attacker from reidentifying a
user and finding the fact that a certain user has a specific
sensitive value. To achieve this goal, we define a k-degree-ldiversity (KDLD) model for safely publishing a labeled
graph, and then develop corresponding graph anonymization
algorithms with the least distortion to the properties of the
original graph, such as degrees and distances between nodes.
Privacy and security, particularly maintaining confidentiality
of data, have become a challenging issue with advances in
information and communication technology. The ability to
communicate and share data has many benefits, and the idea
of an omniscient data source carries great value to research
and building accurate data analysis models. For example, for
credit card companies to build more comprehensive and
accurate fraud detection system, credit card transaction data
from various companies may be needed to generate better
data analysis models. Department of Energy supports
research on building much more efficient diesel engines [7].
Such an ambitious task requires the collaboration of
geographically distributed industries, national laboratories,
and universities. Those institutions (including potentially
competing industry partners) need to share their private data
for building data analysis models to understand the
underlying physical phenomena.
An omniscient data source eases misuse, such as the
growing problem of identity theft. To prevent misuse of data,
there is a recent surge in laws mandating protection of
confidential data, such as the European Community privacy
standards [9], U.S. health-care laws [17], and California
SB1386. However, this protection comes with a real cost
through both added security expenditure and penalties and
costs associated with disclosure. What we need is the ability
to compute the desired “beneficial outcome” of data sharing
for analyzing without having to actually share or disclose
data. This would maintain the security provided by separation
of control while still obtaining the benefits of a global data
source. Secure multiparty computation (SMC) [11], [44],
[45] has recently emerged as an answer to this problem.
Informally, if a protocol meets the SMC definitions, the
participating parties learn only the final result and whatever
can be inferred from the final result and their own inputs. A
simple example is Yao’s millionaire problem [44]: two
millionaires, Alice and Bob, want to learn who is richer
without disclosing their actual wealth to each other.
Recognizing this, the research community has developed
many SMC protocols, for applications as diverse as
forecasting [5], decision tree analysis [33] and auctions [37]
among others.
Nevertheless, the SMC model does not guarantee that data
provided by participating parties are truthful. In many reallife situations, data needed for building data analysis models
are distributed among multiple parties with potentially
conflicting interests. For instance, a credit card company that
Copyright @ 2014 IJSETR. All rights reserved.
SANGADI TEJASWI, LEELAVATHI AREPALLI
has a superior data analysis model for fighting credit card
functionality. If the answer is positive, then there is no need
fraud may increase its profits as compared to its peers. An
to design complicated and generally inefficient SMC-based
engine design company may want to exclusively learn the
protocols. In this paper, we assume that the number of
data analysis models that may enable it to build much more
malicious or dishonest participating parties can be at most n _
efficient diesel engines. Clearly, as described above, building
1, where n is the number of parties. This assumption is very
data analysis models is generally performed among parties
general since most existing works in the area of privacythat have conflicting interests. In SMC, we generally assume
preserving data analysis assume either all participating
that participating parties provide truthful inputs. This
parties are honest (or semi-honest) or the majority of
assumption is usually justified by the fact that learning the
participating parties are honest. Thus, we extend the non
correct data analysis models or results is in the best interest
cooperative computation definitions to incorporate cases
of all participating parties. Since SMC-based protocols
where there are multiple dishonest parties.
require participating parties to perform expensive
In addition, we show that from incentive compatibility
computations, if any party does not want to learn data models
point of view, most data analysis tasks need to be analyzed
and analysis results, the party should not participate in the
only for two party cases. Furthermore, to show the
protocol.
applicability of our developed theorems, we use these
Still, this assumption does not guarantee the truthfulness of
theorems to analyze under what conditions, common data
the private input data when participating parties want to learn
analysis tasks, such as mean and covariance matrix
the final result exclusively. For example, a drug company
estimation, can be executed in an incentive compatible
may lie about its private data so that it can exclusively learn
manner. The paper is organized as follows: Section 2
the data analysis. model. Although SMC protocols guarantee
provides an overview of the works closely related to this
that nothing other than the final data analysis result is
paper and background regarding the concept of non
revealed, it is impossible to verify whether or not
cooperative computation. In Section 3, we propose several
participating parties are truthful about their private input data.
important theorems along with formal proofs. Based on these
In other words, unless proper incentives are set, current SMC
theorems, Section 4 analyzes some common distributed data
techniques cannot prevent input modification by participating
analysis tasks that are either incentive compatible or not
parties. To better illustrate this problem, we consider a case
incentive compatible in the context of this paper. Section 6
from management where competing companies (e.g., Texas
concludes the paper with a discussion of possible future
Instruments, IBM and Intel) establish a consortium (e.g.,
research directions.
Semiconductor Manufacturing Technology1). The companies
B. KDLD Algorithm
send the consortium their sales data, and key manufacturing
k-degree anonymity with l-diversity to prevent not only
costs and times. Then, the consortium analyzes the data and
the reidentification of individual nodes but also the
statistically summarizes them in a report of industry trends,
revelation of a sensitive attribute associated with each node.
which is made available back to consortium members. In this
If the k-degree-l-diversity constraint satisfies create KDLD
case, it is in the interest of companies to learn true industry
graph. A KDLD graph protects two aspects of each user
trends while revealing their private data as little as possible.
when an attacker uses degree information to attack A novel
Even though SMC protocols can prevent the revelation of the
graph construction technique which makes use of noise nodes
private data, they do not guarantee that companies send their
to preserve utilities of the original graph. Two key properties
true sales data and other required information. Assume that n
are considered: Add as few noise edges as possible. Change
companies would like to learn the sample mean and variance
the distance between nodes as less as possible. The noise
of the sales data for a particular type of product.
edges/nodes added should connect nodes that are close with
respect to the social distance. There exist a large number of
A. Our Contributions
In this paper, we analyze what types of distributed
low degree vertices in the graph which could be used to hide
functionalities could be implemented in an incentive
added noise nodes from being re-identified. By carefully
compatible fashion. In other words, we explore which
inserting noise nodes, some graph properties could be better
functionalities can be implemented in a way that participating
preserved than a pure edge-editing method. A graph is kparties have the incentive to provide their true private inputs
degree anonymous if and only if for any node in this graph,
upon engaging in the corresponding SMC protocols. We
there exist at least k - 1 other node with the same degree. If
show how tools from theoretical computer science in general
an adversary knows that one person has three friends in the
and non cooperative computation (NCC) [40] in particular
graph, he can immediately know that node 2 is that person
could be used to analyze incentive issues in distributed data
and the related attributes of node 2 are revealed. k-degree
analysis framework. This is significant because input
anonymity can be used to prevent such structure attacks.
modification cannot be prevented before the execution of any
However, in many applications, a social network where
SMC-based protocol. (Input modification could be prevented
each node has sensitive attributes should be published. For
during the execution of some SMC based protocols, but these
example, a graph may contain the user salaries which are
protocols are generally expensive.) The theorems developed
sensitive. In this case, k-degree alone is not sufficient to
in the paper can be adopted to analyze whether or not input
prevent the inference of sensitive attributes of individuals. A
modification could occur for computing a distributed
International Journal of Scientific Engineering and Technology Research
Volume.03, IssueNo.36, November-2014, Pages: 7219-7223
Fortify Fragile Labels of Data in Interglot
graph that satisfies 2- degree anonymity but node labels are
Among those research issues, algorithmic mechanism design
not considered. In it, nodes 2 and 3 have the same degree 3,
and non cooperative computation are closely related to our
but they both have the label “80K.” If an attacker knows
work. How private preferences of many parties could be
someone has three friends in the social network, he can
combined to find a global and socially optimal solution [38].
conclude that this person’s salary is 80K without exactly
Usually, in algorithmic mechanism design, there exists a
function that needs to be maximized based on the private
reidentifying the node.
inputs of the parties, and the goal is to devise mechanisms
II. RELATED WORK AND BACKGROUND
and payment schemes that force individuals to tell their true
We begin with an overview of privacy-preserving
private values. In our case, since it is hard to measure the
distributed data analysis. Then, we briefly discuss the
monetary value of the data analysis results, devising a
concept of non cooperative computation. Table 1 provides
payment scheme that is required by many mechanism design
common notations and terminologies used extensively for the
models is not viable (e.g., Vickrey-Groves-Clarke
rest of this paper. In addition, the terms secure and privacy
mechanisms [38]). Instead, we adopt the non cooperative
preserving are interchangeable thereafter.
computation model [40] that is designed for parties who want
to jointly compute the correct function results on their private
A. Privacy-Preserving Data Analysis
inputs. Since data analysis algorithms can be seen as a special
Many privacy-preserving data analysis protocols have
case, modifying non cooperative computation model for our
been designed using cryptographic techniques. Data are
purposes is a natural choice as shown in Figs.1 and 2.
generally assumed to be either vertically or horizontally
partitioned. In the case of horizontally partitioned data,
The non cooperative computation model can be seen as an
different sites collect the same set of information about
example of applying game theoretical ideas in a distributed
different entities. For example, different credit card
computation setting [40]. In NCC, each party participates in a
companies may collect credit card transactions of different
protocol to learn the output of some function f over the joint
individuals. Privacy-preserving distributed protocols have
inputs of the parties. First, all participating parties send their
been developed for horizontally partitioned data for many
private inputs securely to a trusted third party (TTP), then
different data mining tasks such as building decision trees,
TTP computes f and sends back the result to every
[32], mining association rules, [26], and generating k-means
participating party. The NCC model makes the following
clusters [31] and k-nn classifiers. In the case of vertically
assumptions:
partitioned data, we assume that different sites collect
information about the same set of entities, but they collect
Correctness: The first priority for every participating party
different feature sets. For example, both a university pay roll
is to learn the correct result.
and the university’s student health center may collect
Exclusiveness: If possible, every participating party prefers
information about a student. Again, privacy-preserving
to learn the correct result exclusively.
protocols for the vertically partitioned case have been
developed for many different data mining tasks such as
association rules, [41], building decision trees [8] and kmeans clusters [19]. (See [43] for a survey of the results.) To
our knowledge, all the existing techniques assume that each
participating party use its true data during the distributed data
mining protocol execution.
In addition to existing techniques that consider honest butcurious model, there are techniques developed against
malicious adversaries, such as [16], [24]. Especially, in [16],
authors discuss how to prevent lying about inputs using
“input-consistency checks.” Basically, authors suggest
checking whether the inputs satisfy some conditions that are
known to be true about the inputs (e.g., a binary input vector
cannot consist of all zeros). Although such approach could be
useful in practice, it cannot prevent lying about inputs that
satisfy the domain constraints (e.g., an adversary can lie
about its binary vector by making sure that he does not use
binary vectors that consists of all zeros as input). In our case,
we suggest a different solution. Basically, we try to
“incentivize” truth telling instead of preventing lying.
Fig.1. Publish a graph with degree and label anonymity.
B. Non cooperative Computation
Recently, research issues at the intersection of computer
Fig.2. Example for adding noise node.
science and game theory have been studied extensively.
International Journal of Scientific Engineering and Technology Research
Volume.03, IssueNo.36, November-2014, Pages: 7219-7223
SANGADI TEJASWI, LEELAVATHI AREPALLI
III. CHARACTERISTICS OF DNCC FUNCTIONS
IV. ANALYZING DATA ANALYSIS TASKS UNDER
As discussed previously, the term incentive compatible
THE DNCC MODEL
means that participating parties have the incentive or
So far, we have developed techniques to prove whether or
motivation to provide their actual inputs when they compute
not a function is in DNCC. Combining the two concepts
functionality. Although SMC-based privacy-preserving data
DNCC and SMC, we can analyze privacy-preserving data
analysis protocols (under the malicious adversary model) can
analysis tasks (without utilizing a TTP) that are incentive
prevent participating parties from modifying their inputs once
compatible. We next prove several such important tasks that
the protocols are initiated, they cannot prevent the parties
either satisfy or do not satisfy the DNCC model. Also, note
from modifying their inputs before the execution. On the
that the data analysis tasks analyzed next have practical
other hand, parties are expected to provide their true inputs to
SMC-implementations.
correctly evaluate a function that satisfies the NCC model.
A. Privacy-Preserving Decision Tree Classification
Therefore, any functionality that satisfies the NCC model is
The aim of a decision tree is to provide classification
inherently incentive compatible under the assumption that
criteria based on the attributes of a data set. At each node of a
participating parties prefers to learn the function result
decision tree, the data are “split” into several subsets of data
correctly and if possible exclusively. Now, the question is
based on the criterion of information gain. The information
which functionalities or data analysis tasks satisfy the NCC
gain of a split is defined as the average difference in entropy
model. For the rest of the paper, we first develop certain key
between the original data set, and each data set formed by the
theorems regarding NCC functions. Based on these theorems,
split. In the Id3 and C4.5 decision trees [15], the sets for the
we subsequently analyze functionalities that can be
split are chosen based on a single attribute. The construction
implemented under (or satisfy) the NCC model.
of the tree proceeds as follows: 1) Determine the attribute
that has the greatest information gain (or equivalently
A. Collusion Regarding the NCC Model
When evaluating certain privacy-preserving protocols in
minimum conditional entropy). 2) Use that attribute to split
practice, we need to consider the case where an adversary
the data set. 3) Repeat this process within each subset until
may control a subset of the parties involved in the protocol.
no splits give significant information gain. At this point the
Such an adversary may force the parties it controls to submit
tree is complete.
wrong inputs as shown in Figs.3 and 4. In order to analyze
B. Using Non-DNCC Functions in an Incentivefunctionalities that are incentive compatible when collusion
Compatible Way
is possible, the current DNCC model needs to be extended to
According to our previous analyses, some functions like
include the possibility of collusion. In other words, we need
sum,
set union and set intersection are not in DNCC, but they
to understand that given f is in DNCC, whether or not f is
can still be used in an incentive-compatible way in PPDA
still in DNCC when collusion occurs. To analyze this, we
applications. The reason is that in many applications,
continue to follow the correctness and exclusiveness
primitives like sum, set union and intersection are not used
assumptions under the NCC model. Based on the
alone, and they often act as subroutines. To have a
assumptions, we next define the DNCC functions for the case
completely secure protocol, the subroutines can only return
where an adversary controls at most t fixed parties.
random shares of the expected results. For example, suppose
a PPDA application uses the set intersection as a subroutine
to compute the intersection between two sets D1 and D2. To
achieve the best security, the subroutine produces two
random numbers _1 and _2 (from a certain field),
Subsequently, _1 and _2 will be used as input values to the
next subroutine in the PPDA application. Because of this
observation, we have the following claim.
Fig.3. Edge editing with neighbourhood rule.
V. CONCLUSION AND FUTURE WORK
Even though privacy-preserving data analysis techniques
guarantee that nothing other than the final result is disclosed,
whether or not participating parties provide truthful input
data cannot be verified. In this paper, we have investigated
what kinds of PPDA tasks are incentive compatible under the
NCC model. Based on our findings, there are several
important PPDA tasks that are incentive driven. Table 2
classifies the common data analysis tasks studied in this
paper into DNCC or Non-DNCC categories. Most often, data
partition schemes can make a difference in determining
DNCC or Non-DNCC classifications. For any functions,
Theorems 3.1 and 3.2 provide a general way to determine if a
function is in DNCC. In addition, Claim 5.1 can be used to
Fig.4. Strategies to adding noise nodes.
International Journal of Scientific Engineering and Technology Research
Volume.03, IssueNo.36, November-2014, Pages: 7219-7223
Fortify Fragile Labels of Data in Interglot
analyze if a composite function is in DNCC, and it also
[9] “Directive 95/46/EC of the European Parliament and of
provides a method to design a PPDA protocol that guarantees
the Council of 24 Oct. 1995 on the Protection of Individuals
to be incentive compatible under the DNCC definition. For
with Regard to the Processing of Personal Data and on the
instance, PPDA task can have many variations, and one
Free Movement of Such Data,” Official J. European
common variation is to place a filter at the last step of the
Communities, vol. 281, pp. 31-50, Oct. 1995.
task to make the PPDA protocols more secure (e.g., secure
[10] K. Fukunaga, Introduction to Statistical Pattern
similar document detection versus its threshold-based
Recognition. Academic Press, 1990.
variation). According to Claim 5.1, as long as the last step in
[11] O. Goldreich, S. Micali, and A. Wigderson, “How to
a PPDA task is in DNCC, it is always possible to make the
Play Any Mental Game - A Completeness Theorem for
entire PPDA task satisfying the DNCC model. Therefore,
Protocols with Honest Majority,” Proc. 19th ACM Symp. the
when designing a PPDA protocol, it is in our best interests to
Theory of Computing, pp. 218-229, 1987.
make the last step of the PPDA task incentive-compatible
[12] O. Goldreich, “General Cryptographic Protocols,” The
whenever possible. As a part of future research direction, we
Foundations of Cryptography, vol. 2, Cambridge Univ. Press,
will investigate incentive issues in other data analysis tasks,
2004.
and extend the proposed theorems under the probabilistic
[13] S.D. Gordon and J. Katz, “Rational Secret Sharing,
NCC model. Another important direction that we would like
Revisited,” Proc. Int’l Conf. Security and Cryptography for
to pursue is to create more efficient Secure Multiparty
Networks, p. 229, 2006.
Computation techniques tailored toward implementing the
[14] J. Halpern and V. Teague, “Rational Secret Sharing and
Multiparty Computation: Extended Abstract,” Proc. Ann.
data analysis tasks that are in DNCC.
ACM Symp. Theory of Computing (STOC ’04), pp. 623-632,
VI. ACKNOWLEDGMENTS
2004.
The first author’s contribution to this work was partially
[15] J. Han and M. Kamber, Data Mining: Concepts and
supported by Air Force Office of Scientific Research MURI
Techniques. Morgan Kaufmann, 2006.
Grant FA9550-08-1-0265, National Institutes of Health Grant
[16] S. Han and W.K. Ng, “Preemptive Measures against
1R01LM009989, US National Science Foundation (NSF)
Malicious Party in Privacy-Preserving Data Mining,” Proc.
Grant Career-CNS-0845803, and NSF Grants CNSSIAM Int’l Conf.
0964350, CNS-1016343. The second author’s contribution to
Data Mining (SDM), pp. 375-386, 2008.
this work was supported by the Office of Naval Research
[17] “Standard for Privacy of Individually Identifiable Health
under Award No. N000141110256 and NSF under Award
Information,” Fed. Register, vol. 67, no. 157, pp. 53181No. CNS-1011984.
53273, Aug. 2002.
VII. REFERENCES
[18] S. Izmalkov, S. Micali, and M. Lepinski, “Rational
[1] I. Abraham, D. Dolev, R. Gonen, and J. Halpern,
Secure Computation and Ideal Mechanism Design,” Proc.
“Distributed Computing Meets Game Theory: Robust
46th Ann. IEEE Symp. Foundations of Computer Science
Mechanisms for Rational Secret Sharing and Multiparty
(FOCS ’05), pp. 585-594, 2005.
Computation,” Proc. 25th Ann. ACM Symp. Principles of
Author’s Profile:
Distributed Computing, pp. 53-62, 2006.
S.Tejaswi received B.tech degree in CSE
[2] R. Agrawal and E. Terzi, “On Honesty in Sovereign
from AU, AP-India in 2012. At present
Information Sharing,” Proc. Int’l Conf. Advances in
M.tech(CSE) from B.tech(2012-2014) in
Database Technology, pp. 240- 256, 2006.
Sri Vasavi Engineering college from
[3] R. Agrawal and R. Srikant, “Fast Algorithms for Mining
JNTUK.
Association Rules,” Proc. Int’l Conf. Very Large Data Bases
(VLDB ’94) pp. 487-499, Sept. 1994.
[4] I. Ashlagi, A. Klinger, and M. Tenneholtz, “K-NCC:
Stability Against Group Deviations in Non-Cooperative
A.Leelavathi received BSc degree in
Computation,” Proc. Third Int’l Conf. Internet and Network
Computers from AU, AP_India in 2001,
Economics, pp. 564-569, 2007.
Msc degree in computers from AU, in
[5] M.J. Atallah, M. Bykova, J. Li, and M. Karahan, “Private
2003, and M.Tech degree in CSE from
Collaborative Forecasting and Benchmarking,” Proc. Second
JNTUK in 2010. I was worked as a
ACM Workshop Privacy in the Electronic Soc. (WPES), Oct.
teaching assistant lecturer’s assistant
2004.
professor with Dept of CSE, AU and
[6] B. Chor and E. Kushilevitz, “A Zero-One Law for
Jawaharlal technological university, in
Boolean Privacy,” Proc. 21st Ann. ACM Symp. Theory of
2003, 2005 and 2009 respectively. And present I am working
Computing (STOC ’89), pp. 62-72, 1989.
as an assistant Professor in Sri Vasavi Engineering college
[7] Doe News, www.doe.gov, Feb. 2005.
from JNTUK. I was participated in workshops and
[8] W. Du and Z. Zhan, “Building Decision Tree Classifier
coordinated many campus selection training programs for
on Private Data,” Proc. IEEE Int’l Conf. Data Mining
students in Sri Vasavi Engineering college. I am interested in
Workshop Privacy, Security, and Data Mining, C. Clifton
oop(C++, Java) and designing.
and V. Estivill-Castro, eds., vol. 14, pp. 1-8, Dec. 2002.
International Journal of Scientific Engineering and Technology Research
Volume.03, IssueNo.36, November-2014, Pages: 7219-7223