Download Privacy-Preserving Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Clusterpoint wikipedia , lookup

Big data wikipedia , lookup

Database model wikipedia , lookup

Transcript
Privacy-Preserving Data Mining
Rebecca Wright
Computer Science Department
Stevens Institute of Technology
www.cs.stevens.edu/~rwright
PORTIA Site Visit
12 May, 2005
The Data Revolution
• The current data revolution is fueled by the perceived,
actual, and potential usefulness of the data.
• Most electronic and physical activities leave some kind
of data trail. These trails can provide useful information
to various parties.
• However, there are also concerns about appropriate
handling and use of sensitive information.
• Privacy-preserving methods of data handling seek to
provide sufficient privacy as well as sufficient utility.
Advantages of Privacy Protection
• protection of personal information
• protection of proprietary or sensitive information
• enables collaboration between different data
owners (since they may be more willing or able to
collaborate if they need not reveal their
information)
• compliance with legislative policies
Overview
• Introduction
• Primitives
• Higher-level protocols
– Distributed data mining
– Publishable data
– Coping with massiveness
– Beyond privacy-preserving data mining
• Implementation and experimentation
• Lessons learned, conclusions
Models for Distributed Data Mining, I
• Horizontally Partitioned
P1
…
…
…
P2
…
…
P3
…
…
• Vertically Partitioned
P1
…
…
…
…
…
…
…
…
P2
…
…
…
…
…
…
…
…
Models for Distributed Data Mining, II
• Fully Distributed
• Client/Server(s)
…
SERVER(S)
P2
…
Each holds
database
P3
…
Pn-1
…
Pn
…
…
P1
CLIENT
Wishes to
compute
on servers’
data
Cryptography vs. Randomization
cryptographic approach
inefficiency
privacy loss
randomization approach
inaccuracy
Cryptography vs. Randomization
inefficiency
cryptographic approach
privacy loss
randomization approach
inaccuracy
Secure Multiparty Computation
• Allows n players to privately compute a function f of
their inputs.
P
1
Pn
P2
• Overhead is polynomial in size of inputs and complexity
of f [Yao86, GMW87, BGW88, CCD88, ...]
• In theory, can solve any private distributed data mining
problem. In practice, not efficient for large data.
Primitives for PPDM
• Common tools include secret sharing,
homomorphic encryption, secure scalar product,
secure set intersection, secure sums, and other
statistics.
• PORTIA work:
– [BGN05]: homomorphic encryption of 2-DNF
formulas (arbitrary additions, one multiplication),
based on bilinear maps. (P)
– [AMP04]: Medians, kth ranked element. (P)
– [FNP04]: set intersection and cardinality.
Higher-Level Protocols
• [LP00]: private protocol for lnx and xlnx
• Various protocols to search remote, encrypted, or
access controlled data (e.g. for keywords, items in
common): [BBA03(P), Goh03, FNP04, BCOP04,
ABG+05(P), EFH#]
• [YZW05]: frequency mining protocol. (P)
Data Mining Models
• [WY04,YW05]: privacy-preserving construction of
Bayesian networks from vertically partitioned data.
• [YZW05]: classification from frequency mining in fully
distributed model (naïve Bayes classification, decision
trees, and association rule mining). (P)
• [JW#]: privacy-preserving k-means clustering for
arbitrarily partitioned data. (In vertically partitioned
case, similar to two-party [VC03].)
• [AST05]: privacy-preserving computation of
multidimensional aggregates on vertically or horizontally
partitioned data using randomization.
Privacy-Preserving Bayes Networks [WY04,YW05]
Goal: Cooperatively learn Bayesian network
structure on the combination of DBA and DBB ,
ideally without either party learning anything except
the Bayesian network itself.
Alice
Bob
DBA
DBB
K2 Algorithm for BN Learning
• Determining the best BN structure for a given data set is NP-hard,
so heuristics are used in practice.
• The K2 algorithm [CH92] is a widely used BN structure-learning
algorithm, which we use as the starting point for our solution.
• Considers nodes in sequence. Adds new parent that most increases
a score function f, up to a maximum number of parents per node.
α 0!α 1!
f (i, π (i)) = ∏
(α 0 + α 1 + 1)!
Our Solution: Approximate Score
Modified score function: approximates the same relative
ordering, and lends itself well to private computation.
• Apply natural log to f and use Stirling’s approximation
• Drop constant factor and bounded term. Result is:
g(i, π (i)) = ∑ ( 12 (ln α 0 + ln α1 − ln t) +
(α 0 ln α 0 + α1 ln α1 − t ln t ))
where t = α0 + α1 + 1
Our Solution: Components
Sub-protocols used:
• Privacy-preserving scalar product protocol: based on
homomorphic encryption
• Privacy-preserving computation of α-parameters: uses
scalar product
• Privacy-preserving score computation: uses α-parameters,
[LP00] protocols for lnx and xlnx
• Privacy-preserving score comparison: uses [Yao86]
All intermediate values (scores and parameters) are
protected using secret sharing. [YW05] improves on
[MSK04] for parameter computation.
Overview
• Introduction
• Primitives
• Higher-level protocols
– Distributed data mining
– Publishable data
– Coping with massiveness
– Beyond privacy-preserving data mining
• Implementation and experimentation
• Lessons learned, conclusions
Publishable Data
• Goal: Modify data before publishing so that results have
good privacy and good utility.
– Some situations favor one more than the other.
– May prevent some things from being learned at all.
• [DN04]: Extends privacy definitions of [EGS03,DN03]
relating a priori and a posteriori knowledge, and provides
solutions in a moderated publishing model.
• [CDMSW04]: provide quantifiable definitions of privacy
and utility. One’s privacy is guaranteed to the extent
that one does not stand out from others.
Publishable Data: k-Anonymity
• Modify database before publishing so (quasi-identifier of)
every record in the database is identical to at least k – 1 other
records [Swe02, MW04].
• [AFK+05]: optimal k-anonymization is NP-hard even if
the data values are ternary. Presents efficient
approximation algorithms for k-anonymization. (P)
• [ZYW05]: in two formulations, present solutions for a
data publisher to learn a k-anonymized version of a fully
distributed database without learning anything else. (P)
Coping with Massiveness
• Data mining on massive data sets in an important
field in its own right.
• It is also privacy-relevant, because:
– Massive data sets are likely to be distributed and multiply
owned.
– Efficiency improvements are needed in order to have any
hope of adding overhead for privacy.
• [FKMSZ05]: Stream algorithms for massive graphs (P)
• [DKM04]: Approximate massive-matrix computations
(P)
Beyond Privacy-Preserving Data Mining
Enforce policies about what kind of queries or
computations on data are allowed.
• [JW#]: Extends private inference control of [WS04] to
work with more complex query functions. Client learns
query result if and only if inference rule is met (and
learns nothing else).
• [KMN05]: Simulatable auditing to ensure that query
denials do not leak information. (P)
• [ABG+04]: P4P: Paranoid Platform for Privacy
Preferences. Mechanism for ensuring released data is
usable only for allowed tasks. (P)
Overview
• Introduction
• Primitives
• Higher-level protocols
– Distributed data mining
– Publishable data
– Coping with massiveness
– Beyond privacy-preserving data mining
• Implementation and experimentation
• Lessons learned, conclusions
Implementation and Experimentation
• secure scalar product protocol [SWY04]
• MySQL private information retrieval (PIR) [BBFS#]
• Fairplay: a system implementing Yao’s two party secure
function evaluation [MNPS04]
• Bayesian network implementation [KRFW#] (D)
• secure computation of surveys using Fairplay and use for
Taulbee survey [FPRS04] (P,D)
Survey Software [FPRS04] (P,D)
• User-friendly, open-source, free implementation using Fairplay
[MNPS04], suitable for use with CRA’s Taulbee salary survey.
Not adopted.
• CRA’s reasons:
– Need for data cleaning, multiyear comparisons, unanticipated use
– “Perhaps most member departments will trust us.”
• Provost Offices’ reasons:
– No legal basis for using this privacy-preserving protocol on data that
we otherwise don’t disclose
– Correctness and security claims are hard and expensive to assess,
despite open-source implementation.
– All-or-none adoption by Ivy+ peer group. Can’t make decision
unilaterally.
Future Directions in Experimentation
• Combine these and others into a general-purpose privacypreserving data mining experimental platform. Useful
for:
– fast prototyping of new protocols
– efficiency, accuracy comparisons of different approaches
• Experiment with real data and real uses.
– need to find a user community that has explicitly expressed
interest, and that could potentially accomplish something via
PPDM that it currently cannot accomplish.
– [Scha04]: genetics researchers may form such a community
Other Future Directions
• Preprocessing of data for PPDM.
• Privacy-preserving data solutions that use both
randomization and cryptography in order to gain
some of the advantages of both.
• Policies for privacy-preserving data mining:
languages, reconciliation, and enforcement.
• Incentive-compatible privacy-preserving data
mining.
Conclusions
• Increasing use of computers and networks has led to a
proliferation of sensitive data.
• Without proper precautions, this data could be misused.
• Many technologies exist for supporting proper data handling,
but much work remains, and some barriers must be overcome
in order for them to be deployed.
• Cryptography is a useful component, but not the whole
solution.
• Technology, policy, and education must work together.