Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
64 International Journal of Data Warehousing and Mining, 7(4), 64-85, October-December 2011 Preserving Privacy in Time Series Data Mining Ye Zhu, Cleveland State University, USA Yongjian Fu, Cleveland State University, USA Huirong Fu, Oakland University, USA ABSTRACT Time series data mining poses new challenges to privacy. Through extensive experiments, the authors find that existing privacy-preserving techniques such as aggregation and adding random noise are insufficient due to privacy attacks such as data flow separation attack. This paper also presents a general model for publishing and mining time series data and its privacy issues. Based on the model, a spectrum of privacy preserving methods is proposed. For each method, effects on classification accuracy, aggregation error, and privacy leak are studied. Experiments are conducted to evaluate the performance of the methods. The results show that the methods can effectively preserve privacy without losing much classification accuracy and within a specified limit of aggregation error. Keywords: Aggregation Error, Classification Accuracy, Privacy, Privacy Attacks, Time Series Data Mining 1. INTRODUCTION Privacy has been identified as an important issue in data mining. The challenge is to enable data miners to discover knowledge from data, while protecting data privacy. On one hand, data miners want to find interesting global patterns. On the other hand, data providers do not want to reveal the identity of individual data. This leads to the study of privacy-preserving data mining (Agrawal & Srikant, 2000). Two common approaches in privacypreserving data mining are data perturbation and data partitioning. In data perturbation, the DOI: 10.4018/jdwm.2011100104 original data is modified by adding noise, aggregating, transforming, obscuring, and so on. Privacy is preserved by mining the modified data instead of the original data. In data partitioning, data is split among multiple parties, who securely compute interesting patterns without sharing data. However, privacy issues in time series data mining go beyond data identity. In time series data mining, characteristics in time series can be regarded as private information. The characteristics can be trend, peak and trough in time domain or periodicity in frequency domain. For example, a company’s sales data may show periodicity which can be used by competitors to infer promotion periods. Certainly, the company does not want to share such data. Moreover, Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. International Journal of Data Warehousing and Mining, 7(4), 64-85, October-December 2011 65 existing approaches to preserve privacy in data mining may not protect privacy in time series data mining. In particular, aggregation and naively adding noise to time series data are prone to privacy attacks. In this paper, we study privacy issues in time series data mining. The objective of this research is to identify effective privacy-preserving methods for time series data mining. We first present a model for publishing and mining time series data and then discuss potential attacks on privacy. As a counter measure to privacy threat, we propose to add noise into original data to preserve privacy. The effects of noise on preserving privacy and on data mining performance are studied. The data mining task in our study is classification and its performance is measured by classification accuracy. We propose a spectrum of methods for adding noise. For each method, we first explain the intuition behind the idea and then present its algorithm. The methods are implemented and evaluated in terms of their impacts on privacy preservation, classification accuracy, and aggregation error in experiments. Our experiments show that these methods can preserve privacy without seriously sacrificing classification accuracy or increasing aggregation error. The contributions of our paper are: (a) We identify privacy issues in time series data mining and propose a general model for protecting privacy in time series data mining. (b) We propose a set of methods for preserving privacy by adding noise. Their performance is evaluated against real data sets. (c) We analyze the effect of noise on preserving privacy and the impact on data mining performance for striking a balance between the two. The rest of the paper is organized as follows. In Section 2, we discuss related work in privacy preserving and time series data mining. A general model for publishing and mining time series data is proposed in Section 3, along with discussion on its privacy concerns. Methods for preserving privacy by adding noise are proposed in Section 4. The effects of noise on privacy preserving, classification accuracy, and aggregation error are studied in Section 5. Related issues are discussed in Section 6. Section 7 concludes the study and gives a few future research directions. 2. RELATED WORK Privacy Preserving Data Mining To preserve privacy in data mining, researchers have proposed many approaches which can be categorized into two main groups: data perturbation and data partitioning. In data perturbation approaches, data is modified by adding noise, aggregation, suppression, transformation, and so on. Data mining is performed on modified data instead of original data to preserve privacy. Random noise is added to preserve privacy in decision tree construction (Agrawal & Srikant, 2000) and association rules (Evfimievski, Srikant, Agrawal, & Gehrke, 2002). The effects of random noise on privacy preserving and data mining performance are studied in (Du & Zhan, 2003), as well effective approaches to randomization (Huang, Du, & Chen, 2005; Zhu & Liu, 2004). Generalization is proposed to achieve k-anonymity where each record is identical to at least k-1 other records in the data set (Bayardo & Agrawal, 2005; LeFevre, DeWitt, & Ramakrishnan, 2006; Iyengar, 2002). Anonymization for classification is studied in Fung and Wang (2007). A privacy-preserving protocol for computing aggregation queries using randomized algorithms is proposed in She, Want, Fu, and Yabo (2008). In data partitioning approaches, data is distributed among multiple parties. To preserve privacy, the parties do not share their data, but cooperate to find global patterns. In most cases, secure multi-party computation (Du & Atallah, 2001; Yildizli, Pedersen, Saygin, Savas, & Levi, 2011) is employed and many use encryptions too. Secure multi-party computation has been introduced for building decision tree (Lindell & Pinkas, 2000), mining association rules (Vaidya & Clifton, 2002; Kantarcioglu & Clifton, 2004), clustering with k-means (Vaidya & Clifton, 2003; Jagannathan & Wright, 2005), and Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 20 more pages are available in the full version of this document, which may be purchased using the "Add to Cart" button on the publisher's webpage: www.igi-global.com/article/preserving-privacy-time-seriesdata/58638 Related Content Elasticity in Cloud Databases and Their Query Processing Goetz Graefe, Anisoara Nica, Knut Stolze, Thomas Neumann, Todd Eavis, Ilia Petrov, Elaheh Pourabbas and David Fekete (2013). International Journal of Data Warehousing and Mining (pp. 1-20). www.irma-international.org/article/elasticity-cloud-databases-theirquery/78284/ Finding Associations in Composite Data Sets: The CFARM Algorithm M. Sulaiman Khan, Maybin Muyeba, Frans Coenen, David Reid and Hissam Tawfik (2011). International Journal of Data Warehousing and Mining (pp. 1-29). www.irma-international.org/article/finding-associations-composite-datasets/55077/ An OLAM Operator for Multi-Dimensional Shrink Stefano Rizzi, Matteo Golfarelli and Simone Graziani (2015). International Journal of Data Warehousing and Mining (pp. 68-97). www.irma-international.org/article/an-olam-operator-for-multi-dimensionalshrink/129525/ RCUBE: Parallel Multi-Dimensional ROLAP Indexing1 Frank Dehne, Todd Eavis and Andrew Rau-Chaplin (2010). Strategic Advancements in Utilizing Data Mining and Warehousing Technologies: New Concepts and Developments (pp. 107-120). www.irma-international.org/chapter/rcube-parallel-multi-dimensionalrolap/40400/ A Graph-Based Biomedical Literature Clustering Approach Utilizing Term's Global and Local Importance Information Xiaodan Zhang, Xiaohua Hu, Jiali Xia, Xiaohua Zhou and Palakorn Achananuparp (2008). International Journal of Data Warehousing and Mining (pp. 84-101). www.irma-international.org/article/graph-based-biomedical-literatureclustering/1819/