Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
C O V E R F E A T U R E Privacy-Preserving Data Mining Systems Nan Zhang University of Texas at Arlington Wei Zhao Rensselaer Polytechnic Institute Although successful in many applications, data mining poses special concerns for private data. An integrated architecture takes a systemic view of the problem, implementing established protocols for data collection, inference control, and information sharing. D ata mining successfully extracts knowledge to support a variety of domains—marketing, weather forecasting, medical diagnosis, and national security—but it is still a challenge to mine certain kinds of data without violating the data owners’ privacy.1 How to mine patients’ private data, for example, is an ongoing problem in healthcare applications. In recognition of the growing privacy concern, directives such as the US Health Insurance Portability and Accountability Act (HIPAA) and the European Union Privacy Directive mandate privacy protection for data management and analysis systems. As data mining becomes more pervasive, such concerns are increasing. Online data collection systems are an example of new applications that threaten individual privacy. Already companies are sharing data mining models to obtain a richer set of data about mutual customers and their buying habits. The computing community must address data mining privacy before data mining techniques become widespread and the threat to private information spirals out of control. The sticking point is how to protect privacy while preserving the usefulness of data mining results. Much research is under way to address obstacles, but practical privacy-preserving data mining systems are largely in the research and prototyping stages. Many techniques for privacy-preserving data mining concentrate on algorithmic solutions and underlying mathematical tools,2,3 rather than focusing on system issues. 52 Computer Our goal in investigating privacy preservation issues was to take a systemic view of architectural requirements and design principles and explore possible solutions that would lead to guidelines for building practical privacypreserving data mining systems. FOUNDATIONAL DESIGN As Figure 1 shows, privacy-preserving data mining usually has multiple steps that translate to a three-tiered architecture: At the bottom tier are the data providers, the data owners, which are often physically distributed. The data providers submit their private data to the data warehouse server. This server, which constitutes the middle tier, supports online analytical data processing to facilitate data mining by translating raw data from the data providers into aggregate data that the data mining servers can more quickly process. The data warehouse server stores the data collected in disciplined physical structures, such as a multidimensional data cube, and aggregates and precomputes the data in various forms, such as sum, average, max, and min. In an online survey system, for example, the survey respondents would be data providers who submit their data to the survey analyzer’s data warehouse server; an aggregated data point might be the average age of all survey respondents. The aggregated data is more efficient to process than raw data from the providers. At the top tier are the data mining servers, which perform the actual data mining. In a privacy-preserving data Published by the IEEE Computer Society 0018-9162/07/$25.00 © 2007 IEEE mining system, these servers do not have free access to all Information sharing data in the data warehouse. Data Mining System 1 Data Mining System 2 In a hospital system, the accounting department can mine patients’ financial data, for example, but cannot Data mining servers Data mining servers access patients’ medical records. Developing and validating effective rules for the data mining servers’ access to the data warehouse is an Data warehouse server Data warehouse server open research problem.4 Besides constructing data mining models on its local data warehouse server, a data mining server might Data providers Data providers share information with data mining servers from other systems. The motivation for this sharing is to build data Figure 1. Basic architecture for privacy-preserving data mining.The architecture typically mining models that span has three tiers: data providers, which are the data owners; the data warehouse server, which systems. For example, sev- supports online analytical processing; and the data mining servers that perform data mining eral retail companies might tasks and share information. The challenge is to control private information transmitted opt to share their local data among entities without impeding data mining. mining models on customer records to build a global data mining model about con- Minimum thus means that privacy disclosure is on a sumer behavior that would benefit all the companies. need-to-know basis. Many privacy regulations, includAs Figure 1 shows, sharing occurs in the top tier, where ing HIPAA, mandate this minimum necessary rule. each data mining server holds the data mining model of its own system. Thus, “sharing” means sharing local Privacy protocols data mining models rather than raw data. On the basis of the architecture in Figure 1 and the minimum necessary design principle, we have evolved “Minimum necessary” design principle a basic strategy for building a privacy-preserving data Any design of a privacy-preserving data mining system mining system. Central to the strategy are three protorequires a clear definition of privacy. The common inter- cols that govern privacy disclosure among entities: pretation is that a data point is private if its owner has the right to choose whether or not, to what extent, and • Data collection protects privacy during data transfor what purpose to disclose the data point to others. In mission from the data providers to the data wareprivacy-preserving data mining literature, most authors house server. assume (either implicitly or explicitly) that a data owner • Inference control manages privacy protection between generally chooses not to disclose its private data unless the data warehouse server and data mining servers. data mining requires it. This assumption and the • Information sharing controls information shared accepted information-privacy definition form the basis among the data mining servers in different systems. of the “minimum necessary” design principle: Given the minimum necessary rule, a common goal In a data mining system, disclosed private information of these protocols is to transmit the minimum private (from one entity to another) should be the minimum information necessary for data mining from one entity necessary for data mining. to another to build accurate data mining models. In reality, it is often difficult to build an efficient system that Minimum in this context is a qualitative, not a quan- protects private information perfectly. Consequently, titative, measure. Since the quantitative measure of pri- there are always tradeoffs between data privacy and data vacy disclosure varies among systems, minimum mining model accuracy. These protocols are based on captures the idea that all unnecessary private informa- established methods that the system designer can tailor tion (unnecessary in the context of how accurate the to particular requirements, choosing the most beneficial data mining results must be) should not be disclosed. tradeoffs. The data collection protocol, for example, can April 2007 53 can be effective in guaranteeing the data’s anonymity6— k-anonymity, for example, means that each perturbed data record is indistinguishValue-based method Dimension-based method able from the perturbed values of at least k–1 other data records. Perturbation-based Aggregation-based Blocking-based Projection-based approach approach approach approach The value-based method assumes that it would be difficult, if not impossible, for Figure 2. Data collection protocol taxonomy. A designer can choose which of two methods— the data warehouse server to value- or dimension-based—and its attendant approaches best serve the design. rediscover the original private data from the manipulated values but that the server would still be able to draw from one of two established collection methods, recover the original data distribution from the perturbed each with its advantages and drawbacks. data, thereby supporting the construction of accurate data mining models.5 Data collection protocol DATA COLLECTION PROTOCOL The data collection protocol lets data providers identify the minimum necessary part of private information— what must be disclosed to build accurate data mining models—and ensures that they transmit only that part of the information to the data warehouse server. Several requirements shape the data collection protocol. First, it must be scalable, since a data warehouse server can deal with as many as hundreds of thousands of data providers, as in an online survey system. Second, the computational cost to data providers must be small because they have considerably lower computational power than the data warehouse server, and a higher cost could discourage them from participating in data mining. Finally, the protocol must be robust; it must deliver relatively accurate data mining results while protecting data providers’ privacy, even if data providers behave erratically. For example, if some data providers in an online survey system deviate from the protocol or submit meaningless data, the data collection protocol must control the influence of such erroneous behavior and ensure that global data mining results remain sufficiently accurate. Figure 2 shows a data collection protocol taxonomy based on two data collection methods. Value-based method With the value-based method,5 a data provider manipulates the value of each data attribute or item independently using one of two approaches. The perturbation-based approach3 adds noise directly to the original data values, such as changing age 23 to 30 or Texas to California. The aggregation-based approach generalizes data according to the relevant domain hierarchy, such as changing age 23 to age range 21-25 or Texas to the US. The perturbation-based approach is highly suitable for arbitrary data, while the aggregation-based approach relies on knowledge of the domain hierarchy, but 54 Computer Dimension-based method The dimension-based method is so called because the data to be mined usually has many attributes, or dimensions. The basic idea is to remove part of the private information from the original data by reducing the number of dimensions. The blocking-based approach3 accomplishes this by truncating some private attributes without releasing them to the data warehouse server. However, this approach could result in information loss, preventing data mining servers from constructing accurate data mining models. The more complicated projection-based approach7 overcomes this problem by projecting the original data into a carefully designed, low-dimensional subspace in a way that retains only the minimum information necessary to construct accurate data mining models. Advantages and drawbacks Each method and attendant approach has pluses and minuses. The value-based method is independent of the data mining task, which makes it suitable for applications involving multiple data mining tasks or tasks unknown at data collection. In contrast, the dimensionbased method fits better with individual data mining tasks because the information to be retained after dimension reduction usually depends on the particular task. So far, research has not defined an effective and universally applicable projection-based approach. Even so, the projection-based approach promises strong advantages over value-based methods in terms of the tradeoff between accuracy and privacy protection. Most value-based approaches treat different attributes independently and separately, so at least some attributes that are less necessary for data mining are always disclosed to the data warehouse server to the same extent as other attributes. Indeed a recent study revealed that, with the perturbation-based randomization approach, the data warehouse Item April May June July Sum server could use privacy intrusion techniques to filter noise from the perturbed data, thereby Book 10 Known 15 Known Q5 = 25 rediscovering part of the original private data.8 CD 20 Known 27 Known Q6 = 47 The projection-based approach avoids this DVD Known 35 16 36 Q7 = 87 problem by exploiting the relationship among Game Known 25 Known 14 Q8 = 39 attributes and disclosing only those necessary Sum Q1 = 30 Q2 = 60 Q3 = 58 Q4 = 50 for data mining. Guiding data submission can also reduce unnecessary privacy disclosure, enhancing Figure 3. Inference that discloses private information. If the data mining the performance of data perturbation. In ear- server becomes an adversary, it might be able to infer from the query lier work,7 we and colleague Shengquan answers and certain cells (Known) the number of DVDs a data provider Wang proposed a guidance-based dimension sold in June (which is private and should not be disclosed) by computing reduction scheme for dynamic systems, such Q1 + Q3 – (Q5 + Q6 ) = 88 – 72 = 16, where Q1 to Q8 are query answers. as online survey systems, in which data providers (survey respondents and so on) join the sysFigure 4 shows an inference control protocol taxontem and submit their data asynchronously. To guide omy based on two inference control methods. data providers that have not yet submitted data, the scheme analyzes the data already collected and esti- Query-oriented method mates the attributes necessary for data mining. The The query-oriented method4 is centered on the consystem then sends the estimated useful attributes to cept of a safe query set, which says that query set <Q1, data providers as guidance. Our work shows that this Q2, …, Qn> is safe if a data mining server cannot infer guidance-based scheme is more effective than private data from the answers to Q1, Q2, …, Qn. Thus, approaches without such guidance. query-oriented inference control means that when the data warehouse server receives a query, it will answer INFERENCE CONTROL PROTOCOL the query only if the union set of query history—the Protecting private data in the data warehouse server set of all queries already answered—and the recently requires controlling the information disclosed to the received query are safe. Otherwise, it will reject the data mining servers—which is the aim of the inference query. Relative to query-oriented inference control in control protocol. Following the minimum necessary statistical databases, inference control in data warerule, the inference control protocol ensures that the data houses involves significantly more data. Consequently, warehouse server answers the queries necessary for data the burden is on inference control protocols to process mining yet minimizes privacy disclosure. queries more efficiently. Several requirements drive the inference control proBecause dynamically determining a query set’s safety tocol’s design and implementation. One is the need to (online query history check) can be time-consuming, a block inferences. If a data mining server becomes an static version of the query-oriented method might be adversary, it will try to infer private information from more suitable. The static version determines a safe set the query answers it has already received. Figure 3 gives of queries offline (before any query is actually received). an example. If a query set is safe, then any one of its subsets is also Further, the inference control protocol must be effi- safe. At runtime, when the data warehouse server cient enough to satisfy the data warehouse server’s required online response time—the time between Inference control protocol issuing a query and answering it. The time that an inference control protocol uses is part of that response Query-oriented method Data-oriented method time. It must be controlled so that the data warehouse server can maintain its reduced response time. Classify safe Do perturbation Check query Do perturbation To meet these requirements, inferand unsafe sets online when query history online by data collection offline received ence control protocols must restrict the information included in the query answers so that the data mining server cannot infer private data Figure 4. An inference control protocol taxonomy. A designer can choose which of two methods—query- or data-oriented—best serves the design. from received query answers. April 2007 55 receives the query, it answers only if the query is in the predetermined safe set. Otherwise, it will reject the query. On the downside, the static method is conservative in selecting a safe set, which might cause it to reject some queries unnecessarily. Data-oriented method server reject some privacy-divulging queries (such as Q3 in Figure 3). This, in turn, would effectively downgrade the data perturbation level yet retain the same degree of privacy protection. Because the data is perturbed, the server would have to reject far fewer queries and could thus answer most queries fairly accurately while continuing to protect private information. With the data-oriented method of inference control,9 the data warehouse server perturbs the stored raw data INFORMATION SHARING PROTOCOL and estimates the query answers as accurately as possiBecause each data mining server constructs local data ble on the basis of the perturbed data. As Figure 4 shows, mining models in its own system, these servers are likely the data collection protocol can handle perturbation to share their local data mining models rather than the unless the application requires storing original data in the raw data in the data warehouses. Local data mining moddata warehouse server. In that case, els can be sensitive, especially when the data warehouse server might have the local models are not globally valid. The query-oriented method to perturb the data when processing To protect the privacy of individthe query. ual data mining systems, some mechcan provide more accurate The data-oriented method assumes anism must control the disclosure of answers than the that perturbation can protect private private information in local data data-oriented method. information from being disclosed, mining models. This mechanism is enabling the data warehouse server to the information sharing protocol, answer all queries freely on the basis which again follows the minimum of the perturbed data. Research has shown that the query necessary rule. The protocol’s objective is to enable data answers estimated from the perturbed data can still sup- mining servers across multiple systems to construct port the construction of accurate data mining models.5 global data mining models while disclosing only the minimum private information about local data mining modAdvantages and disadvantages els necessary for information sharing. The two methods have unique performance considMany information sharing protocols exist for applierations. The data-oriented method offers query respon- cations other than data mining, such as database intersiveness, since the data warehouse server will answer all operation or data integration.10 Information sharing is queries. The query-oriented method, in contrast, nor- necessary for most distributed data mining systems, and mally rejects a substantial number of queries,9 which much work has focused on designing specific informameans that some data mining servers might be unable tion sharing protocols for data mining tasks. to complete their data mining tasks. A major design concern of the information sharing On the plus side, the query-oriented method can pro- protocol is defending against adversaries that behave vide more accurate answers than the data-oriented arbitrarily within the capability allocated to them. The method. When the data warehouse server answers a defense strategy depends on the adversary model—the query, its answer will always be precise. The data-ori- set of assumptions about an adversary’s intent and ented method, in contrast, answers queries with esti- behavior. Two of the more popular adversary models mation, so it might not be accurate enough to support are semihonest10 and beyond semihonest. data mining, particularly when the construction of data mining models requires highly accurate query answers. Semihonest adversaries Efficiency is an important advantage for the static verAn adversary is semihonest if it properly follows the sion of the query-oriented method, which has the short- designated protocol but records all intermediate comest response time because most of its computational cost putation and communication, thereby providing a way is offline. The dynamic version must trade off efficiency to derive private information. and query responsiveness: To answer more queries, the Cryptographic encryption has proved effective in data warehouse server must spend more time analyzing defending against semihonest adversaries.2,10,11 In this the query history. The data-oriented method also suf- method, each data mining server encrypts its local data fers from low efficiency, since the computational over- mining model and exchanges the encrypted model with head for query estimation can be several orders of other data mining servers. magnitude higher than for query answering. Some encryption scheme properties, such as the RivestOne way to enhance inference control protocol per- Shamir-Adleman (RSA) cryptosystem’s commutative formance is to integrate query- and data-oriented meth- encryption property, make it possible to design algoods. Introducing the query answer-or-reject scheme to rithms for data mining servers to perform certain data the data-oriented method would let the data warehouse mining tasks and set operations without knowing the 56 Computer private keys of other entities.2,10,11 Tasks include classification, association rule mining, clustering, and collaborative filtering; set operations include set intersection, set union, and element reduction. Because it is not possible to recover the original (local) data mining models from their encrypted values without knowing the private keys, this method is a secure defense against semihonest adversaries. Researchers have already evolved a detailed taxonomy and cryptographic encryption methods for various system settings.2,3 Beyond semihonest adversaries erogeneous privacy requirements is a challenge with much potential return. Privacy measurements The accuracy versus protection tradeoff inherent in privacy-preserving data mining means that some mechanism must accurately measure the degree of privacy protection. Although extensive work has focused on privacy measurement, as yet no one has proposed a commonly accepted measurement technique for generic privacy-preserving data mining systems. Proper privacy protection measurement has three criteria: It must An adversary is considered beyond semihonest if it deviates from the • reflect system settings (adversaries Research on anomaly designated protocol, changes its might have different levels of interinput data, or both. est in different data values, such as detection can contribute to Because it is difficult if not imposbeing more concerned with multiple disciplines, such as sible to defend against an adversary patients that have contagious dissecurity, biology, and finance. that is behaving arbitrarily, dealing eases than other diseases), with beyond semihonest adversaries • account for data providers’ diverse requires more refined models. One privacy concerns (some might consuch model is the intent-based adversary model,12 which sider age as private information, while others are willformulates an adversary’s intent as combining the intent ing to disclose it publicly), and to obtain accurate data mining results with compromis• satisfy the minimum necessary rule. ing other entities’ private information. A game-theoretic method is then developed to defend against adversaries A comprehensive study of privacy measurement for that weigh the accuracy of data mining results over com- all three protocols would be a huge step toward improvpromising other parties’ privacy.12 ing the performance of privacy-preserving data mining The basic idea is to design the information sharing pro- techniques. tocol in a way that no adversary can both obtain accurate data mining results and intrude on other servers’ Anomaly detection privacy. Adversaries that are more concerned with the A common application of data mining is to detect accuracy of data mining results will be forced not to data-set anomalies, as in mining log file data to detect intrude on the privacy of others to get that accuracy. intrusions. However, few researchers have considered privacy protection in detecting anomalies. OPEN RESEARCH ISSUES Research on anomaly detection is an important part Several issues require additional research to ensure the of data mining and can contribute to multiple discioptimum performance of the techniques described. plines, such as security, biology, and finance. Thoroughly investigating issues related to the design of privacy-preProtocol integration serving data mining techniques for anomaly detection Many systems need a seamless integration of the three would be extremely beneficial. protocols, yet little research has addressed this need. Our proposed integrated architecture could serve as Multiple protection levels a platform for studying protocol interaction. Such In some cases, multiple levels of private information insights can pave the way for effective and efficient must be protected. The first level might be a data point integration. value, and the second level, the data point sensitivity (knowledge of whether or not a data point is private). Heterogeneous privacy requirements Most existing studies focus on protecting the first level Privacy-preserving data mining techniques depend and assume that all entities already know the second on respecting the privacy protection levels that data level. Research has yet to answer how to protect the providers require. Most existing studies assume homoge- second level (and higher levels) of private information. nous privacy requirements—that all data owners need the same privacy level for all their data and its attributes. This assumption is unrealistic in practice and could ur work is an important first step in addressing the even degrade system performance unnecessarily. critical systemic issues of privacy preservation in Designing and implementing techniques that exploit hetdata mining. Much research remains to realize the O April 2007 57 potential of the architecture and design principles we have described. Much literature already addresses privacy-preserving data mining, but clearly the ideas must cross considerable ground to become practical systems. Studies are needed for the design of privacy-preserving data mining techniques in real-world scenarios, in which data owners can freely address their individual privacy concerns without the data miner’s consent. Also critical is work that more closely incorporates designs with specialized applications such as healthcare, market analysis, and finance. Our hope is that others will continue efforts in this important area. ■ References 1. J. Han and M. Kamber, Data Mining Concepts and Techniques, Morgan Kaufmann, 2001. 2. C. Clifton et al., “Tools for Privacy Preserving Distributed Data Mining,” SIGKDD Explorations, vol. 4, no. 2, 2003, pp. 28-34. 3. V.S. Verykios et al., “State-of-the-Art in Privacy Preserving Data Mining,” SIGMOD Record, vol. 33, no. 1, 2004, pp. 50-57. 4. L. Wang, S. Jajodia, and D. Wijesekera, “Securing OLAP Data Cubes against Privacy Breaches,” Proc. 25th IEEE Symp. Security and Privacy, IEEE Press, 2004, pp. 161-175. 5. R. Agrawal and R. Srikant, “Privacy-Preserving Data Mining,” Proc. 19th ACM SIGMOD Int’l Conf. Management of Data, ACM Press, 2000, pp. 439-450. 6. R.J. Bayardo and R. Agrawal, “Data Privacy through Optimal k-Anonymization,” Proc. 21st Int’l Conf. Data Eng., IEEE Press, 2005, pp. 217-228. 7. N. Zhang, S. Wang, and W. Zhao, “A New Scheme on Privacy-Preserving Data Classification,” Proc. 11th ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, ACM Press, 2005, pp. 374-383. 8. Z. Huang, W. Du, and B. Chen, “Deriving Private Information from Randomized Data,” Proc. 24th ACM SIGMOD Int’l Conf. Management of Data, ACM Press, 2005, pp. 37-48. 9. R. Agrawal, R. Srikant, and D. Thomas, “Privacy-Preserving OLAP,” Proc. 25th ACM SIGMOD Int’l Conf. Management of Data, ACM Press, 2005, pp. 251-262. 10. R. Agrawal, A. Evfimievski, and R. Srikant, “Information Sharing across Private Databases,” Proc. 22nd ACM SIGMOD Int’l Conf. Management of Data, ACM Press, 2003, pp. 86-97. 11. Y. Lindell and B. Pinkas, “Privacy Preserving Data Mining,” Proc. 12th Ann. Int’l Conf. Advances in Cryptology, SpringerVerlag, 2000, pp. 36-54. 12. N. Zhang and W. Zhao, “Distributed Privacy Preserving Information Sharing,” Proc. 31st Int’l Conf. Very Large Data Bases, ACM Press, 2005, pp. 889-900. Nan Zhang is an assistant professor of computer science and engineering at the University of Texas at Arlington. His research interests include databases and data mining, information security and privacy, and distributed systems. Zhang received a PhD in computer science from Texas A&M University. He is a member of the IEEE. Contact him at [email protected]. Wei Zhao is a professor of computer science and the dean for the School of Science at Rensselaer Polytechnic Institute. His research interests include distributed computing, real-time systems, computer networks, and cyberspace security. Zhao received a PhD in computer and information sciences from the University of Massachusetts, Amherst. He is a Fellow of the IEEE and a member of the IEEE Computer Society and the ACM. Contact him at [email protected]. Engineering and Applying the Internet IEEE Internet Computing reports emerging tools, technologies, and applications implemented through the Internet to support a worldwide computing environment. In 2007, we’ll look at: • Autonomic Computing • Roaming • Distance Learning • Dynamic Information Dissemination • Knowledge Management • Media Search www.computer.org/internet/ 58 Computer