Download Privacy-Preserving Data Mining Systems

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
C O V E R
F E A T U R E
Privacy-Preserving
Data Mining Systems
Nan Zhang
University of Texas at Arlington
Wei Zhao
Rensselaer Polytechnic Institute
Although successful in many applications, data mining poses special concerns for private
data. An integrated architecture takes a systemic view of the problem, implementing
established protocols for data collection, inference control, and information sharing.
D
ata mining successfully extracts knowledge to
support a variety of domains—marketing,
weather forecasting, medical diagnosis, and
national security—but it is still a challenge to
mine certain kinds of data without violating
the data owners’ privacy.1 How to mine patients’ private data, for example, is an ongoing problem in healthcare applications. In recognition of the growing privacy
concern, directives such as the US Health Insurance
Portability and Accountability Act (HIPAA) and the
European Union Privacy Directive mandate privacy protection for data management and analysis systems.
As data mining becomes more pervasive, such concerns are increasing. Online data collection systems are
an example of new applications that threaten individual privacy. Already companies are sharing data mining
models to obtain a richer set of data about mutual customers and their buying habits.
The computing community must address data mining
privacy before data mining techniques become widespread and the threat to private information spirals out
of control. The sticking point is how to protect privacy
while preserving the usefulness of data mining results.
Much research is under way to address obstacles, but
practical privacy-preserving data mining systems are
largely in the research and prototyping stages. Many
techniques for privacy-preserving data mining concentrate on algorithmic solutions and underlying mathematical tools,2,3 rather than focusing on system issues.
52
Computer
Our goal in investigating privacy preservation issues
was to take a systemic view of architectural requirements
and design principles and explore possible solutions that
would lead to guidelines for building practical privacypreserving data mining systems.
FOUNDATIONAL DESIGN
As Figure 1 shows, privacy-preserving data mining
usually has multiple steps that translate to a three-tiered
architecture: At the bottom tier are the data providers,
the data owners, which are often physically distributed.
The data providers submit their private data to the data
warehouse server. This server, which constitutes the middle tier, supports online analytical data processing to
facilitate data mining by translating raw data from the
data providers into aggregate data that the data mining
servers can more quickly process.
The data warehouse server stores the data collected
in disciplined physical structures, such as a multidimensional data cube, and aggregates and precomputes
the data in various forms, such as sum, average, max,
and min. In an online survey system, for example, the
survey respondents would be data providers who submit
their data to the survey analyzer’s data warehouse server;
an aggregated data point might be the average age of all
survey respondents. The aggregated data is more efficient to process than raw data from the providers.
At the top tier are the data mining servers, which perform the actual data mining. In a privacy-preserving data
Published by the IEEE Computer Society
0018-9162/07/$25.00 © 2007 IEEE
mining system, these servers
do not have free access to all
Information sharing
data in the data warehouse.
Data Mining System 1
Data Mining System 2
In a hospital system, the
accounting department can
mine patients’ financial data,
for example, but cannot
Data mining servers
Data mining servers
access patients’ medical
records. Developing and validating effective rules for the
data mining servers’ access
to the data warehouse is an
Data warehouse server
Data warehouse server
open research problem.4
Besides constructing data
mining models on its local
data warehouse server, a
data mining server might
Data providers
Data providers
share information with data
mining servers from other
systems. The motivation for
this sharing is to build data Figure 1. Basic architecture for privacy-preserving data mining.The architecture typically
mining models that span has three tiers: data providers, which are the data owners; the data warehouse server, which
systems. For example, sev- supports online analytical processing; and the data mining servers that perform data mining
eral retail companies might tasks and share information. The challenge is to control private information transmitted
opt to share their local data among entities without impeding data mining.
mining models on customer
records to build a global data mining model about con- Minimum thus means that privacy disclosure is on a
sumer behavior that would benefit all the companies. need-to-know basis. Many privacy regulations, includAs Figure 1 shows, sharing occurs in the top tier, where ing HIPAA, mandate this minimum necessary rule.
each data mining server holds the data mining model of
its own system. Thus, “sharing” means sharing local Privacy protocols
data mining models rather than raw data.
On the basis of the architecture in Figure 1 and the
minimum necessary design principle, we have evolved
“Minimum necessary” design principle
a basic strategy for building a privacy-preserving data
Any design of a privacy-preserving data mining system mining system. Central to the strategy are three protorequires a clear definition of privacy. The common inter- cols that govern privacy disclosure among entities:
pretation is that a data point is private if its owner has
the right to choose whether or not, to what extent, and
• Data collection protects privacy during data transfor what purpose to disclose the data point to others. In
mission from the data providers to the data wareprivacy-preserving data mining literature, most authors
house server.
assume (either implicitly or explicitly) that a data owner
• Inference control manages privacy protection between
generally chooses not to disclose its private data unless
the data warehouse server and data mining servers.
data mining requires it. This assumption and the
• Information sharing controls information shared
accepted information-privacy definition form the basis
among the data mining servers in different systems.
of the “minimum necessary” design principle:
Given the minimum necessary rule, a common goal
In a data mining system, disclosed private information
of these protocols is to transmit the minimum private
(from one entity to another) should be the minimum
information necessary for data mining from one entity
necessary for data mining.
to another to build accurate data mining models. In reality, it is often difficult to build an efficient system that
Minimum in this context is a qualitative, not a quan- protects private information perfectly. Consequently,
titative, measure. Since the quantitative measure of pri- there are always tradeoffs between data privacy and data
vacy disclosure varies among systems, minimum mining model accuracy. These protocols are based on
captures the idea that all unnecessary private informa- established methods that the system designer can tailor
tion (unnecessary in the context of how accurate the to particular requirements, choosing the most beneficial
data mining results must be) should not be disclosed. tradeoffs. The data collection protocol, for example, can
April 2007
53
can be effective in guaranteeing the data’s anonymity6—
k-anonymity, for example,
means that each perturbed
data record is indistinguishValue-based method
Dimension-based method
able from the perturbed values of at least k–1 other data
records.
Perturbation-based
Aggregation-based
Blocking-based
Projection-based
approach
approach
approach
approach
The value-based method
assumes that it would be difficult, if not impossible, for
Figure 2. Data collection protocol taxonomy. A designer can choose which of two methods—
the data warehouse server to
value- or dimension-based—and its attendant approaches best serve the design.
rediscover the original private data from the manipulated values but that the server would still be able to
draw from one of two established collection methods, recover the original data distribution from the perturbed
each with its advantages and drawbacks.
data, thereby supporting the construction of accurate
data mining models.5
Data collection protocol
DATA COLLECTION PROTOCOL
The data collection protocol lets data providers identify the minimum necessary part of private information—
what must be disclosed to build accurate data mining
models—and ensures that they transmit only that part
of the information to the data warehouse server.
Several requirements shape the data collection protocol. First, it must be scalable, since a data warehouse
server can deal with as many as hundreds of thousands
of data providers, as in an online survey system. Second,
the computational cost to data providers must be small
because they have considerably lower computational
power than the data warehouse server, and a higher cost
could discourage them from participating in data mining.
Finally, the protocol must be robust; it must deliver relatively accurate data mining results while protecting data
providers’ privacy, even if data providers behave erratically. For example, if some data providers in an online
survey system deviate from the protocol or submit meaningless data, the data collection protocol must control
the influence of such erroneous behavior and ensure that
global data mining results remain sufficiently accurate.
Figure 2 shows a data collection protocol taxonomy
based on two data collection methods.
Value-based method
With the value-based method,5 a data provider
manipulates the value of each data attribute or item
independently using one of two approaches. The perturbation-based approach3 adds noise directly to the
original data values, such as changing age 23 to 30 or
Texas to California. The aggregation-based approach
generalizes data according to the relevant domain hierarchy, such as changing age 23 to age range 21-25 or
Texas to the US.
The perturbation-based approach is highly suitable
for arbitrary data, while the aggregation-based approach
relies on knowledge of the domain hierarchy, but
54
Computer
Dimension-based method
The dimension-based method is so called because the
data to be mined usually has many attributes, or dimensions. The basic idea is to remove part of the private
information from the original data by reducing the number of dimensions. The blocking-based approach3
accomplishes this by truncating some private attributes
without releasing them to the data warehouse server.
However, this approach could result in information loss,
preventing data mining servers from constructing accurate data mining models. The more complicated projection-based approach7 overcomes this problem by
projecting the original data into a carefully designed,
low-dimensional subspace in a way that retains only the
minimum information necessary to construct accurate
data mining models.
Advantages and drawbacks
Each method and attendant approach has pluses and
minuses. The value-based method is independent of the
data mining task, which makes it suitable for applications involving multiple data mining tasks or tasks
unknown at data collection. In contrast, the dimensionbased method fits better with individual data mining
tasks because the information to be retained after
dimension reduction usually depends on the particular
task.
So far, research has not defined an effective and universally applicable projection-based approach. Even so,
the projection-based approach promises strong advantages over value-based methods in terms of the tradeoff
between accuracy and privacy protection.
Most value-based approaches treat different attributes independently and separately, so at least some
attributes that are less necessary for data mining are
always disclosed to the data warehouse server to the
same extent as other attributes. Indeed a recent study
revealed that, with the perturbation-based randomization approach, the data warehouse
Item
April
May
June
July
Sum
server could use privacy intrusion techniques
to filter noise from the perturbed data, thereby
Book
10
Known
15
Known
Q5 = 25
rediscovering part of the original private data.8
CD
20
Known
27
Known
Q6 = 47
The projection-based approach avoids this
DVD
Known
35
16
36
Q7 = 87
problem by exploiting the relationship among
Game
Known
25
Known
14
Q8 = 39
attributes and disclosing only those necessary
Sum
Q1 = 30
Q2 = 60
Q3 = 58
Q4 = 50
for data mining.
Guiding data submission can also reduce
unnecessary privacy disclosure, enhancing Figure 3. Inference that discloses private information. If the data mining
the performance of data perturbation. In ear- server becomes an adversary, it might be able to infer from the query
lier work,7 we and colleague Shengquan answers and certain cells (Known) the number of DVDs a data provider
Wang proposed a guidance-based dimension sold in June (which is private and should not be disclosed) by computing
reduction scheme for dynamic systems, such Q1 + Q3 – (Q5 + Q6 ) = 88 – 72 = 16, where Q1 to Q8 are query answers.
as online survey systems, in which data
providers (survey respondents and so on) join the sysFigure 4 shows an inference control protocol taxontem and submit their data asynchronously. To guide omy based on two inference control methods.
data providers that have not yet submitted data, the
scheme analyzes the data already collected and esti- Query-oriented method
mates the attributes necessary for data mining. The
The query-oriented method4 is centered on the consystem then sends the estimated useful attributes to cept of a safe query set, which says that query set <Q1,
data providers as guidance. Our work shows that this Q2, …, Qn> is safe if a data mining server cannot infer
guidance-based scheme is more effective than private data from the answers to Q1, Q2, …, Qn. Thus,
approaches without such guidance.
query-oriented inference control means that when the
data warehouse server receives a query, it will answer
INFERENCE CONTROL PROTOCOL
the query only if the union set of query history—the
Protecting private data in the data warehouse server set of all queries already answered—and the recently
requires controlling the information disclosed to the received query are safe. Otherwise, it will reject the
data mining servers—which is the aim of the inference query. Relative to query-oriented inference control in
control protocol. Following the minimum necessary statistical databases, inference control in data warerule, the inference control protocol ensures that the data houses involves significantly more data. Consequently,
warehouse server answers the queries necessary for data the burden is on inference control protocols to process
mining yet minimizes privacy disclosure.
queries more efficiently.
Several requirements drive the inference control proBecause dynamically determining a query set’s safety
tocol’s design and implementation. One is the need to (online query history check) can be time-consuming, a
block inferences. If a data mining server becomes an static version of the query-oriented method might be
adversary, it will try to infer private information from more suitable. The static version determines a safe set
the query answers it has already received. Figure 3 gives of queries offline (before any query is actually received).
an example.
If a query set is safe, then any one of its subsets is also
Further, the inference control protocol must be effi- safe. At runtime, when the data warehouse server
cient enough to satisfy the data
warehouse server’s required online
response time—the time between
Inference control protocol
issuing a query and answering it.
The time that an inference control
protocol uses is part of that response
Query-oriented method
Data-oriented method
time. It must be controlled so that
the data warehouse server can maintain its reduced response time.
Classify safe
Do perturbation
Check query
Do perturbation
To meet these requirements, inferand unsafe sets
online when query
history online
by data collection
offline
received
ence control protocols must restrict
the information included in the
query answers so that the data mining server cannot infer private data Figure 4. An inference control protocol taxonomy. A designer can choose which of two
methods—query- or data-oriented—best serves the design.
from received query answers.
April 2007
55
receives the query, it answers only if the query is in the
predetermined safe set. Otherwise, it will reject the
query. On the downside, the static method is conservative in selecting a safe set, which might cause it to reject
some queries unnecessarily.
Data-oriented method
server reject some privacy-divulging queries (such as
Q3 in Figure 3). This, in turn, would effectively downgrade the data perturbation level yet retain the same
degree of privacy protection. Because the data is perturbed, the server would have to reject far fewer queries
and could thus answer most queries fairly accurately
while continuing to protect private information.
With the data-oriented method of inference control,9
the data warehouse server perturbs the stored raw data INFORMATION SHARING PROTOCOL
and estimates the query answers as accurately as possiBecause each data mining server constructs local data
ble on the basis of the perturbed data. As Figure 4 shows, mining models in its own system, these servers are likely
the data collection protocol can handle perturbation to share their local data mining models rather than the
unless the application requires storing original data in the raw data in the data warehouses. Local data mining moddata warehouse server. In that case,
els can be sensitive, especially when
the data warehouse server might have
the local models are not globally valid.
The query-oriented method
to perturb the data when processing
To protect the privacy of individthe query.
ual data mining systems, some mechcan provide more accurate
The data-oriented method assumes
anism must control the disclosure of
answers than the
that perturbation can protect private
private information in local data
data-oriented method.
information from being disclosed,
mining models. This mechanism is
enabling the data warehouse server to
the information sharing protocol,
answer all queries freely on the basis
which again follows the minimum
of the perturbed data. Research has shown that the query necessary rule. The protocol’s objective is to enable data
answers estimated from the perturbed data can still sup- mining servers across multiple systems to construct
port the construction of accurate data mining models.5
global data mining models while disclosing only the minimum private information about local data mining modAdvantages and disadvantages
els necessary for information sharing.
The two methods have unique performance considMany information sharing protocols exist for applierations. The data-oriented method offers query respon- cations other than data mining, such as database intersiveness, since the data warehouse server will answer all operation or data integration.10 Information sharing is
queries. The query-oriented method, in contrast, nor- necessary for most distributed data mining systems, and
mally rejects a substantial number of queries,9 which much work has focused on designing specific informameans that some data mining servers might be unable tion sharing protocols for data mining tasks.
to complete their data mining tasks.
A major design concern of the information sharing
On the plus side, the query-oriented method can pro- protocol is defending against adversaries that behave
vide more accurate answers than the data-oriented arbitrarily within the capability allocated to them. The
method. When the data warehouse server answers a defense strategy depends on the adversary model—the
query, its answer will always be precise. The data-ori- set of assumptions about an adversary’s intent and
ented method, in contrast, answers queries with esti- behavior. Two of the more popular adversary models
mation, so it might not be accurate enough to support are semihonest10 and beyond semihonest.
data mining, particularly when the construction of data
mining models requires highly accurate query answers. Semihonest adversaries
Efficiency is an important advantage for the static verAn adversary is semihonest if it properly follows the
sion of the query-oriented method, which has the short- designated protocol but records all intermediate comest response time because most of its computational cost putation and communication, thereby providing a way
is offline. The dynamic version must trade off efficiency to derive private information.
and query responsiveness: To answer more queries, the
Cryptographic encryption has proved effective in
data warehouse server must spend more time analyzing defending against semihonest adversaries.2,10,11 In this
the query history. The data-oriented method also suf- method, each data mining server encrypts its local data
fers from low efficiency, since the computational over- mining model and exchanges the encrypted model with
head for query estimation can be several orders of other data mining servers.
magnitude higher than for query answering.
Some encryption scheme properties, such as the RivestOne way to enhance inference control protocol per- Shamir-Adleman (RSA) cryptosystem’s commutative
formance is to integrate query- and data-oriented meth- encryption property, make it possible to design algoods. Introducing the query answer-or-reject scheme to rithms for data mining servers to perform certain data
the data-oriented method would let the data warehouse mining tasks and set operations without knowing the
56
Computer
private keys of other entities.2,10,11 Tasks include classification, association rule mining, clustering, and collaborative filtering; set operations include set intersection, set
union, and element reduction.
Because it is not possible to recover the original (local)
data mining models from their encrypted values without
knowing the private keys, this method is a secure defense
against semihonest adversaries. Researchers have already
evolved a detailed taxonomy and cryptographic encryption methods for various system settings.2,3
Beyond semihonest adversaries
erogeneous privacy requirements is a challenge with
much potential return.
Privacy measurements
The accuracy versus protection tradeoff inherent in
privacy-preserving data mining means that some mechanism must accurately measure the degree of privacy
protection. Although extensive work has focused on privacy measurement, as yet no one has proposed a commonly accepted measurement technique for generic
privacy-preserving data mining systems. Proper privacy
protection measurement has three criteria: It must
An adversary is considered beyond
semihonest if it deviates from the
• reflect system settings (adversaries
Research on anomaly
designated protocol, changes its
might have different levels of interinput data, or both.
est in different data values, such as
detection can contribute to
Because it is difficult if not imposbeing more concerned with
multiple disciplines, such as
sible to defend against an adversary
patients that have contagious dissecurity, biology, and finance.
that is behaving arbitrarily, dealing
eases than other diseases),
with beyond semihonest adversaries
• account for data providers’ diverse
requires more refined models. One
privacy concerns (some might consuch model is the intent-based adversary model,12 which
sider age as private information, while others are willformulates an adversary’s intent as combining the intent
ing to disclose it publicly), and
to obtain accurate data mining results with compromis• satisfy the minimum necessary rule.
ing other entities’ private information. A game-theoretic
method is then developed to defend against adversaries
A comprehensive study of privacy measurement for
that weigh the accuracy of data mining results over com- all three protocols would be a huge step toward improvpromising other parties’ privacy.12
ing the performance of privacy-preserving data mining
The basic idea is to design the information sharing pro- techniques.
tocol in a way that no adversary can both obtain accurate data mining results and intrude on other servers’ Anomaly detection
privacy. Adversaries that are more concerned with the
A common application of data mining is to detect
accuracy of data mining results will be forced not to data-set anomalies, as in mining log file data to detect
intrude on the privacy of others to get that accuracy.
intrusions. However, few researchers have considered
privacy protection in detecting anomalies.
OPEN RESEARCH ISSUES
Research on anomaly detection is an important part
Several issues require additional research to ensure the of data mining and can contribute to multiple discioptimum performance of the techniques described.
plines, such as security, biology, and finance. Thoroughly
investigating issues related to the design of privacy-preProtocol integration
serving data mining techniques for anomaly detection
Many systems need a seamless integration of the three would be extremely beneficial.
protocols, yet little research has addressed this need.
Our proposed integrated architecture could serve as Multiple protection levels
a platform for studying protocol interaction. Such
In some cases, multiple levels of private information
insights can pave the way for effective and efficient must be protected. The first level might be a data point
integration.
value, and the second level, the data point sensitivity
(knowledge of whether or not a data point is private).
Heterogeneous privacy requirements
Most existing studies focus on protecting the first level
Privacy-preserving data mining techniques depend and assume that all entities already know the second
on respecting the privacy protection levels that data level. Research has yet to answer how to protect the
providers require. Most existing studies assume homoge- second level (and higher levels) of private information.
nous privacy requirements—that all data owners need
the same privacy level for all their data and its attributes. This assumption is unrealistic in practice and could
ur work is an important first step in addressing the
even degrade system performance unnecessarily.
critical systemic issues of privacy preservation in
Designing and implementing techniques that exploit hetdata mining. Much research remains to realize the
O
April 2007
57
potential of the architecture and design principles we
have described. Much literature already addresses privacy-preserving data mining, but clearly the ideas must
cross considerable ground to become practical systems.
Studies are needed for the design of privacy-preserving
data mining techniques in real-world scenarios, in which
data owners can freely address their individual privacy
concerns without the data miner’s consent. Also critical
is work that more closely incorporates designs with specialized applications such as healthcare, market analysis, and finance. Our hope is that others will continue
efforts in this important area. ■
References
1. J. Han and M. Kamber, Data Mining Concepts and Techniques, Morgan Kaufmann, 2001.
2. C. Clifton et al., “Tools for Privacy Preserving Distributed
Data Mining,” SIGKDD Explorations, vol. 4, no. 2, 2003,
pp. 28-34.
3. V.S. Verykios et al., “State-of-the-Art in Privacy Preserving
Data Mining,” SIGMOD Record, vol. 33, no. 1, 2004, pp.
50-57.
4. L. Wang, S. Jajodia, and D. Wijesekera, “Securing OLAP Data
Cubes against Privacy Breaches,” Proc. 25th IEEE Symp.
Security and Privacy, IEEE Press, 2004, pp. 161-175.
5. R. Agrawal and R. Srikant, “Privacy-Preserving Data Mining,” Proc. 19th ACM SIGMOD Int’l Conf. Management of
Data, ACM Press, 2000, pp. 439-450.
6. R.J. Bayardo and R. Agrawal, “Data Privacy through Optimal
k-Anonymization,” Proc. 21st Int’l Conf. Data Eng., IEEE
Press, 2005, pp. 217-228.
7. N. Zhang, S. Wang, and W. Zhao, “A New Scheme on Privacy-Preserving Data Classification,” Proc. 11th ACM
SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, ACM Press, 2005, pp. 374-383.
8. Z. Huang, W. Du, and B. Chen, “Deriving Private Information
from Randomized Data,” Proc. 24th ACM SIGMOD Int’l
Conf. Management of Data, ACM Press, 2005, pp. 37-48.
9. R. Agrawal, R. Srikant, and D. Thomas, “Privacy-Preserving
OLAP,” Proc. 25th ACM SIGMOD Int’l Conf. Management
of Data, ACM Press, 2005, pp. 251-262.
10. R. Agrawal, A. Evfimievski, and R. Srikant, “Information
Sharing across Private Databases,” Proc. 22nd ACM SIGMOD Int’l Conf. Management of Data, ACM Press, 2003,
pp. 86-97.
11. Y. Lindell and B. Pinkas, “Privacy Preserving Data Mining,”
Proc. 12th Ann. Int’l Conf. Advances in Cryptology, SpringerVerlag, 2000, pp. 36-54.
12. N. Zhang and W. Zhao, “Distributed Privacy Preserving Information Sharing,” Proc. 31st Int’l Conf. Very Large Data
Bases, ACM Press, 2005, pp. 889-900.
Nan Zhang is an assistant professor of computer science
and engineering at the University of Texas at Arlington. His
research interests include databases and data mining, information security and privacy, and distributed systems. Zhang
received a PhD in computer science from Texas A&M University. He is a member of the IEEE. Contact him at
[email protected].
Wei Zhao is a professor of computer science and the dean
for the School of Science at Rensselaer Polytechnic Institute. His research interests include distributed computing,
real-time systems, computer networks, and cyberspace security. Zhao received a PhD in computer and information sciences from the University of Massachusetts, Amherst. He is
a Fellow of the IEEE and a member of the IEEE Computer
Society and the ACM. Contact him at [email protected].
Engineering and Applying the Internet
IEEE Internet Computing reports emerging tools,
technologies, and applications implemented through the Internet
to support a worldwide computing environment.
In 2007, we’ll look at:
• Autonomic Computing
• Roaming
• Distance Learning
• Dynamic Information Dissemination
• Knowledge Management
• Media Search
www.computer.org/internet/
58
Computer