Download Preserving Privacy in Time Series Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
64 International Journal of Data Warehousing and Mining, 7(4), 64-85, October-December 2011
Preserving Privacy in Time
Series Data Mining
Ye Zhu, Cleveland State University, USA
Yongjian Fu, Cleveland State University, USA
Huirong Fu, Oakland University, USA
ABSTRACT
Time series data mining poses new challenges to privacy. Through extensive experiments, the authors find that
existing privacy-preserving techniques such as aggregation and adding random noise are insufficient due to
privacy attacks such as data flow separation attack. This paper also presents a general model for publishing
and mining time series data and its privacy issues. Based on the model, a spectrum of privacy preserving
methods is proposed. For each method, effects on classification accuracy, aggregation error, and privacy
leak are studied. Experiments are conducted to evaluate the performance of the methods. The results show
that the methods can effectively preserve privacy without losing much classification accuracy and within a
specified limit of aggregation error.
Keywords:
Aggregation Error, Classification Accuracy, Privacy, Privacy Attacks, Time Series Data Mining
1. INTRODUCTION
Privacy has been identified as an important issue in data mining. The challenge is to enable
data miners to discover knowledge from data,
while protecting data privacy. On one hand, data
miners want to find interesting global patterns.
On the other hand, data providers do not want
to reveal the identity of individual data. This
leads to the study of privacy-preserving data
mining (Agrawal & Srikant, 2000).
Two common approaches in privacypreserving data mining are data perturbation
and data partitioning. In data perturbation, the
DOI: 10.4018/jdwm.2011100104
original data is modified by adding noise, aggregating, transforming, obscuring, and so on.
Privacy is preserved by mining the modified data
instead of the original data. In data partitioning, data is split among multiple parties, who
securely compute interesting patterns without
sharing data.
However, privacy issues in time series data
mining go beyond data identity. In time series
data mining, characteristics in time series can
be regarded as private information. The characteristics can be trend, peak and trough in time
domain or periodicity in frequency domain.
For example, a company’s sales data may show
periodicity which can be used by competitors to
infer promotion periods. Certainly, the company
does not want to share such data. Moreover,
Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
International Journal of Data Warehousing and Mining, 7(4), 64-85, October-December 2011 65
existing approaches to preserve privacy in
data mining may not protect privacy in time
series data mining. In particular, aggregation
and naively adding noise to time series data
are prone to privacy attacks.
In this paper, we study privacy issues in
time series data mining. The objective of this research is to identify effective privacy-preserving
methods for time series data mining. We first
present a model for publishing and mining time
series data and then discuss potential attacks
on privacy. As a counter measure to privacy
threat, we propose to add noise into original
data to preserve privacy. The effects of noise
on preserving privacy and on data mining performance are studied. The data mining task in
our study is classification and its performance
is measured by classification accuracy.
We propose a spectrum of methods for
adding noise. For each method, we first explain
the intuition behind the idea and then present its
algorithm. The methods are implemented and
evaluated in terms of their impacts on privacy
preservation, classification accuracy, and aggregation error in experiments. Our experiments
show that these methods can preserve privacy
without seriously sacrificing classification accuracy or increasing aggregation error.
The contributions of our paper are: (a)
We identify privacy issues in time series data
mining and propose a general model for protecting privacy in time series data mining. (b)
We propose a set of methods for preserving
privacy by adding noise. Their performance is
evaluated against real data sets. (c) We analyze
the effect of noise on preserving privacy and the
impact on data mining performance for striking
a balance between the two.
The rest of the paper is organized as follows. In Section 2, we discuss related work in
privacy preserving and time series data mining.
A general model for publishing and mining time
series data is proposed in Section 3, along with
discussion on its privacy concerns. Methods
for preserving privacy by adding noise are
proposed in Section 4. The effects of noise on
privacy preserving, classification accuracy,
and aggregation error are studied in Section
5. Related issues are discussed in Section 6.
Section 7 concludes the study and gives a few
future research directions.
2. RELATED WORK
Privacy Preserving Data Mining
To preserve privacy in data mining, researchers have proposed many approaches which
can be categorized into two main groups: data
perturbation and data partitioning.
In data perturbation approaches, data is
modified by adding noise, aggregation, suppression, transformation, and so on. Data
mining is performed on modified data instead
of original data to preserve privacy. Random
noise is added to preserve privacy in decision
tree construction (Agrawal & Srikant, 2000)
and association rules (Evfimievski, Srikant,
Agrawal, & Gehrke, 2002). The effects of
random noise on privacy preserving and data
mining performance are studied in (Du &
Zhan, 2003), as well effective approaches to
randomization (Huang, Du, & Chen, 2005;
Zhu & Liu, 2004). Generalization is proposed
to achieve k-anonymity where each record is
identical to at least k-1 other records in the
data set (Bayardo & Agrawal, 2005; LeFevre,
DeWitt, & Ramakrishnan, 2006; Iyengar, 2002).
Anonymization for classification is studied in
Fung and Wang (2007). A privacy-preserving
protocol for computing aggregation queries
using randomized algorithms is proposed in
She, Want, Fu, and Yabo (2008).
In data partitioning approaches, data is
distributed among multiple parties. To preserve
privacy, the parties do not share their data, but
cooperate to find global patterns. In most cases,
secure multi-party computation (Du & Atallah,
2001; Yildizli, Pedersen, Saygin, Savas, & Levi,
2011) is employed and many use encryptions
too. Secure multi-party computation has been
introduced for building decision tree (Lindell &
Pinkas, 2000), mining association rules (Vaidya
& Clifton, 2002; Kantarcioglu & Clifton, 2004),
clustering with k-means (Vaidya & Clifton,
2003; Jagannathan & Wright, 2005), and
Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
20 more pages are available in the full version of this
document, which may be purchased using the "Add to Cart"
button on the publisher's webpage:
www.igi-global.com/article/preserving-privacy-time-seriesdata/58638
Related Content
Elasticity in Cloud Databases and Their Query Processing
Goetz Graefe, Anisoara Nica, Knut Stolze, Thomas Neumann, Todd Eavis, Ilia
Petrov, Elaheh Pourabbas and David Fekete (2013). International Journal of Data
Warehousing and Mining (pp. 1-20).
www.irma-international.org/article/elasticity-cloud-databases-theirquery/78284/
Finding Associations in Composite Data Sets: The CFARM Algorithm
M. Sulaiman Khan, Maybin Muyeba, Frans Coenen, David Reid and Hissam Tawfik
(2011). International Journal of Data Warehousing and Mining (pp. 1-29).
www.irma-international.org/article/finding-associations-composite-datasets/55077/
An OLAM Operator for Multi-Dimensional Shrink
Stefano Rizzi, Matteo Golfarelli and Simone Graziani (2015). International Journal of
Data Warehousing and Mining (pp. 68-97).
www.irma-international.org/article/an-olam-operator-for-multi-dimensionalshrink/129525/
RCUBE: Parallel Multi-Dimensional ROLAP Indexing1
Frank Dehne, Todd Eavis and Andrew Rau-Chaplin (2010). Strategic Advancements
in Utilizing Data Mining and Warehousing Technologies: New Concepts and
Developments (pp. 107-120).
www.irma-international.org/chapter/rcube-parallel-multi-dimensionalrolap/40400/
A Graph-Based Biomedical Literature Clustering Approach Utilizing Term's
Global and Local Importance Information
Xiaodan Zhang, Xiaohua Hu, Jiali Xia, Xiaohua Zhou and Palakorn Achananuparp
(2008). International Journal of Data Warehousing and Mining (pp. 84-101).
www.irma-international.org/article/graph-based-biomedical-literatureclustering/1819/