Download Privacy, Data Mining and Human Rights

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia, lookup

Transcript
On Privacy, Data Mining
Technology and Human
Rights
©Dr. Ramon C. Barquin
Conference on Data Mining and Human Rights in the Fight Against Terrorism
Universitat Zurich
Zurich, Switzerland
10-11 June 2010
AGENDA










Introduction: Framing the Issue
Data Mining and the U.S. Government
What is Data Mining?
Informational Privacy: Basic Issues
What is Privacy Preserving Data Mining?
The Role of Trust in Data Collection
Concerns as we enter the future
Terrorism and Human Rights
The Role of Ethics
Conclusion
Data, data everywhere…



Number of documents on the web: Over 1 trillion?
Total new information in 2003: ≈ 5 exabytes
 Equivalent to half a million new Libraries of
Congress
 Enough to capture every word ever spoken by all
humans
161 Exabytes produced in 2006
 3 million times storage of all books written
 12 stacks of books from Earth to the Sun
Source: UC Berkeley study, 2003 and IDC Study 2007
And many good reasons to extract as much
knowledge as possible from that data
 Data
Mining
 Health
care
 Law enforcement
 Education
 Logistics
 Customer service
Investigation at Stillwater State
Correctional Facility, Minnesota




Data mining software was applied to
phone records from the prison
A pattern linking calls between prisoners
and a recent parolee was discovered
The calling data was then mined again
together with records of prisoners’
financial accounts
The result: a large drug smuggling ring
was discovered
Source:Yehuda Lindell, Bar-Ilan University
New York Times
“Reaping Results: Data-Mining Goes
Mainstream”
By STEVE LOHR, May 20, 2007
“The technology, for example, pointed to a
high rate of robberies on paydays in Hispanic
neighborhoods [in Richmond], where fewer
people use banks and where customers leaving
check-cashing stores were easy targets for
robbers. Elsewhere, there were clusters of
random-gunfire incidents at certain times of
night. So extra police were deployed in those
areas when crimes were predicted.”
But nothing is
perfect…
No. Now all our pillaging is done
electronically from a centralized office.”
The Headlines






“TSA viewed as 'profiling' in new screening program ,” National
Journal, 5/21/10
“Web Start-Ups Offer Bargains for Users’ Data,” by S. Clifford, NY
Times, 5/30/10
“Shoppers Who Can’t Have Secrets,” by N. Singer, NY Times ,
4/30/10
“Review of Terrorism Database Finds Flaws,” by M. Sherman,
Washington Post, 6/14/05
“How Privacy Vanishes Online,” by S. Lohr, NY Times, 3/16/10
“Internet censorship proves counterproductive in curtailing
terrorist recruitment,” Jill R. Aitoro, National Journal, 05/26/10
The Data Revolution




The current data revolution is fueled by the perceived,
actual, and potential usefulness of the data.
Most electronic and physical activities leave some kind
of data trail. These trails can provide useful information
to various parties.
However, there are also concerns about appropriate
handling and use of sensitive information.
Privacy-preserving methods of data handling seek to
provide sufficient privacy as well as sufficient utility.
Source: Rebecca Wright, Stevens Institute of Technology
Framing the Issue




Where we would we be without data?
Data as a double edged sword
The Post-9/11 environment
Focus on government data collections
“No, I’m not backing up my files…I’m just
assuming the FBI Is making copies.”
Data Mining and the
U.S. Government
Federal Agency Data
Mining Reporting
Act of 2007
U.S. Federal Law
Privacy Compliance Documents



Privacy Threshold Analysis (PTA) Identifies whether system,
program, or project is a Privacy Sensitive System (i.e., a system that
collects or maintains PII or otherwise impacts privacy) and
determines whether Privacy Impact Assessment (PIA) or System of
Record Notice (SORN) is required.
Privacy Impact Assessment (PIA): Method by which the federal
agencies reviews system management activities in key areas such as
security and how/when information is collected, used, and shared.
PIA determines whether an existing SORN appropriately covers the
activity or a new SORN is required.
System of Record Notice (SORN): Provide notice to the public
regarding Privacy Act information collected by a system of records,
as well as insight into how information is used, retained, and may be
corrected.
Source: Federal Agency Data Mining Reporting Act, 42 U.S.C. § 2000ee-3(b)(1)
Department of Homeland Security
 Automated
Targeting System (ATS)
 Data Analysis and Research for
Trade Transparency System
(DARTTS)
 Freight Assessment System (FAS)
Source: 2009 Data Mining Report to Congress, Department of Homeland Security, December 2009
But what exactly is
data mining?
The morass of definitions…
Data Mining
A program involving pattern-based queries, searches, or other analyses of
one or more electronic databases, where—



(A) a department or agency of the Federal Government, or a non-Federal entity
acting on behalf of the Federal Government, is conducting the queries,
searches, or other analyses to discover or locate a predictive pattern or anomaly
indicative of terrorist or criminal activity on the part of any individual or
individuals;
(B) the queries, searches, or other analyses are not subject-based and do not
use personal identifiers of a specific individual, or inputs associated with a
specific individual or group of individuals, to retrieve information from the
database or databases; and
(C) the purpose of the queries, searches, or other analyses is not solely—


(i) the detection of fraud, waste, or abuse in a Government agency or program; or
(ii) the security of a Government computer system.
Source: Federal Agency Data Mining Reporting Act, 42 U.S.C. § 2000ee-3(b)(1)
Data Mining/Knowledge Discovery




Process that uses a variety of data analysis tools to discover patterns
and relationships in data that may be used to make valid predictions.
(Two Crows Corp.)
Analysis of (often large) observational data sets to find unsuspected
relationships and to summarize the data in novel ways that are both
understandable and useful to data owner. (Hand, Mannila, Smyth)
Non-trivial extraction of implicit, previously unknown, and potentially
useful information from large data sets or databases [W. Frawley and
G. Piatetsky-Shapiro and C. Matheus, 1992]
Use of information technology to attempt to derive useful knowledge
from (usually) very large data sets. (DETECTER, Work Package # 6)
List of Data Mining Systems (2)















BioSense –DHHS/CDC
Foreign Terrorist Tracking Task Force Activity-FBI
NETLEADS – DHS/ICE&CBP
ICE Pattern Analysis and Information Collection System (ICEPIC) – DHS/ICE
Intelligence and Information Fusion (I2F) – DHS/OIA
ProActive Intelligence (PAINT) – DHS/OIA
Knowledge Discovery and Dissemination – IARPA
Video Analysis and Content Extraction (VACE) - IARPA
Rapid Knowledge Formulation-DARPA
Analysis, Dissemination, Visualization, Insight and Semantic Enhancement (ADVISE)DHS
Able Danger-Army
Threat and Local Observation Notice (TALON) – DOD
TIDE (Datamart) – DHS
FBI Intelligence Community Data Marts- FBI
Investigative Data Warehouse (IDW) - FBI
Source: DETECTER, Work Package No. 6
List of Data Mining Systems (1)















Computer Assisted Passenger Pre-Screening System II (CAPPS II) – DHS/TSA
Secure Flight -- DHS/TSA
Automated Targeting System (ATS) - DHS
Total Information Awareness/ Terrorist Information Awareness (TIA) – DARPA
Multi-State Anti Terrorism Information Exchange (MATRIX) - Multi-State Consortium
Novel Intelligence From Massive Data (NIMD)-NSA
Analyst Notebook I2 - DHS
Secure Collaborative Operational Prototype Environment (SCOPE) – FBI
Insight Smart Discovery – DIA
Verity K2 Enterprise - DIA
PATHFINDER - DIA
Autonomy – DIA
Counterintelligence Automated Investigative Management System (CI-AIMS) - DOE
Autonomy - DOE
Counterintelligence Analytical Research Data System (CARDS)-DOE
Source: DETECTER, Work Package No. 6
European/International Data Mining
Efforts
CAHORS – NATO
 Creation of European Terrorist Profiles
 European Passenger Name Records
System
 European Security Research
 Terrorist RasterfahndungBundeskriminalamt

Source: DETECTER, Work Package No. 6
For observation

REVEAL (US)
SCION (US)
National Security Branch Analysis Center (US)
Guardian (US)
Eurodac (EU)
Schengen Information System II (EU)
Europol Information System (EU)
Visa Information System (EU)
EDVIGE/EDVIPR (FR)
CHRISTINA (FR)
Project Rich Picture (UK)

National Public Order Intelligence Unit Database (UK)










Source: DETECTER, Work Package No. 6
So how do we have our cake
and eat it too?


It starts with the need to protect our data
 Security
 Data integrity
 Privacy protection
And progresses to bigger and better things
 Privacy-Preserving Data Mining (PPDM)
Information
Privacy
Degrees of Touchiness
As the type of personal information grows more intimate, the
percentage of people who want to keep it at home rises
Basic personal information (name, address, phone number)
42%
Social Security number or driver’s license number
51%
Major Purchases
56%
Internet behavior
62%
Employee records
64%
Credit or debit card number
69%
Banking or home mortgage records
74%
Patient health records
83%
Source: CIO Magazine. July 15, 2006. Ponemon Institute
How should this information
be protected?
 Medical
information
 Financial information
 Credit scores
 Criminal record
 School grades
 Job performance ratings
Information Privacy



Concept applied to collection, use and
maintenance of personal information with advent
of database technology
Central component is power of individual to
control the use of sensitive information
The issue of identity


Sensitivity
Complexity
Benefits of Data Access




Reinforcement of open scientific inquiry
Verification, refutation, or refinement of original
results
Promotion of new research through existing data;
improvements of measurements and data
collection methods and analytic techniques
Climate in which scientific research confronts
decision making
Source: Committee on National Statistics, National Research Council
Issues

The data






Mandatory vs. voluntary
Sensitive vs. Non-sensitive data
Anonymous vs. Named
Individual identifiable vs. non-identifiable
The question of identity
The usage


Administrative
Statistical
Fair Information Practice Principles
(FIPPs)








Transparency
Individual Participation
Purpose Specification
•Data Minimization
Use Limitation
Data Quality and Integrity
Security
Accountability and Auditing
Start with the basics…
1.
2.
3.
4.
Generalization
De-identification and Re-identification
“Anonymization”
Cryptography
But problem of privacy
breaches is big…
“In fact, 87% of the population of the United
States is uniquely identified by date of
birth (e.g., month, day and year), gender,
and their 5-digit ZIP codes. The point is
that data that may look anonymous is not
necessarily anonymous.”
Source: Latanya Sweeney at meeting of the Department of Homeland Security DPIAC.
Hence, Privacy-Preserving
Data Mining (PPDM)
Privacy Preserving Data Mining or PPDM, is a
research area concerned with the privacy driven
from “personally identifiable information” when
considered for data mining.
Wikipedia
Two major approaches to PPDM
 Randomization

Application: Web Demographics
 Cryptographic

Approach
Approach
Application: Inter-Enterprise Data
Mining
Source: Ramakrishnan Srikant, IBM
Source: Rebecca Wright, Stevens Institute of Technology
Major Approaches to Randomization
Privacy-Preserving Data Mining
The Randomization Method
 Group Based Anonymization
 Distributed Privacy-Preserving Data
Mining
 Privacy-Preservation of Application
Results

Source: Charu C. Aggarwal (IBM), Philip S. Yu (university of Illinois)
Privacy-Preserving Data Mining Models and Algorithms
The Randomization Method





Privacy Quantification
Adversarial Attacks on Randomization
Randomization Methods for Data Streams
Multiplicative Perturbations
Data Swapping
Source: Charu C. Aggarwal (IBM), Philip S. Yu (university of Illinois)
Privacy-Preserving Data Mining Models and Algorithms
Group Based Anonymization







The k-Anonymity Framework
Personalized Privacy-Preservation
Utility Based Privacy Preservation
Sequential Releases
The l-diversity Method
The t-closeness Model
Models for Text, Binary and String Data
Source: Charu C. Aggarwal (IBM), Philip S. Yu (university of Illinois)
Privacy-Preserving Data Mining Models and Algorithms
Distributed Privacy-Preserving Data Mining



Distributed Algorithms over Horizontally
Partitioned Data Sets
Distributed Algorithms over Vertically Partitioned
Data
Distributed Algorithms for k-Anonymity
Source: Charu C. Aggarwal (IBM), Philip S. Yu (university of Illinois)
Privacy-Preserving Data Mining Models and Algorithms
Privacy-Preservation of Application Results



Association Rule Hiding
Downgrading Classifier Effectiveness
Query Auditing and Inference Control
Source: Charu C. Aggarwal (IBM), Philip S. Yu (university of Illinois)
Example: Web Demographics

Volvo S40 website targets people in 20s


Are visitors in their 20s or 40s?
Which demographic groups like/dislike the
website?
Source: Ramakrishnan Srikant, IBM
Randomization Approach:
Overview
30 | 70K | ...
50 | 40K |
...
Randomizer
Randomizer
65 | 20K | ...
25 | 60K | ...
Reconstruct
distribution
of Age
Reconstruct
distribution
of Salary
Data Mining
Algorithms
Source: Ramakrishnan Srikant, IBM
...
...
...
Model
Reconstruction Problem
+





MANY RELEVANT ALGORITHMS:
[WY04,YW05]: privacy-preserving construction of Bayesian networks
from vertically partitioned data.
[YZW05]: classification from frequency mining in fully distributed model
(naïve Bayes classification, decision trees, and association rule mining).
(P)
[JW#]: privacy-preserving k-means clustering for arbitrarily partitioned
data.
[AST05]: privacy-preserving computation of multidimensional aggregates
on vertically or horizontally partitioned data using randomization.
Source: Ramakrishnan Srikant, IBM; and Rebecca Wright, Stevens Institute of Technology
Then attempt to reconstruct as
accurately as possible…
Number of People
1200
1000
800
Original
Randomized
Reconstructed
600
400
200
0
20
60
Age
Source: Ramakrishnan Srikant, IBM
Suggested Architecture for
(Cryptographic) PPDM
Source: Privacy-Preserving Data Mining Systems, Nan Zhang and Wei Zhao, COMPUTER, April 2007.
Examples of Secure Computation
Tasks





Authentication protocols
Online payments
Auctions
Elections
Privacy preserving data mining
Source:Yehuda Lindell, Bar-Ilan University
Secure Multiparty Computation


A set of parties with private inputs
Parties wish to jointly compute a function of
their inputs so that certain security
properties (like privacy and correctness)
are preserved


E.g., secure elections, auctions, online payments
Properties must be ensured even if some of
the parties maliciously attack the protocol
Source:Yehuda Lindell, Bar-Ilan University
Secure Multiparty Communication
Ideal Model
x
y
x
y
f1(x,y)
f2(x,y)
Trusted party
f1(x,y)
Source:Yehuda Lindell, Bar-Ilan University
f2(x,y)
So what is the role
of trust?
Role of Trust in Data Collection

Trust is a 3-part relation




Truster
Entrusted good
Trusted
Three key principles

Truster must see the trusted as



Having a goodwill
Encapsulating the interests of others
Competent to handle the entrusted good
Source: A. Baier, Trust and Antitrust.
Role of Trust in Data Collection
Truster
+
Entrusted
Good
Trust
Trusted
Role of Trust in Data Collection
T(A:B) = f(G, I, C)
Where,
T(A:B) – Trustworthiness of person A (trusted), as
perceived by person B (truster), in relation to good .
G – Goodwill of person A
I – Degree to which A is able to represent the interests of B
C – Competence of A in handling entrusted good 
Attributes of Goodwill


Persons or Institutions can both be trustworthy
Persons




Part of the trustworthy person’s motive in handling the entrusted
good is to fulfill the truster’s interests
In fulfilling the truster’s interests, the trustworthy person looks
beyond the interests that would only benefit him or herself
The trustworthy person has demonstrated that he or she cares
about the management of other’s entrusted goods in the past
The actions, motives, and interests and the principles which
guide the trustworthy person’s actions are clear and easily
understood
Attributes of Goodwill


Persons or Institutions can both be trustworthy
Institutions



The trustworthy institution’s goal is to uphold the interests
of trusters even though they may not have an interest in
doing so and even if doing so conflicts with certain
interests of the institution
The institution has clear policies which guide the behavior
of its members and demonstrate that the goal of the
institution is to preserve the public’s interests
The institution’s implicit or explicit code of conduct
demonstrates that the institution upholds the interests of
the public
Encapsulating the interests of others

Trustworthy Persons




Can be relied upon with other people’s entrusted goods
Their interests are clearly observable and understood
How their interests lead to the fulfillment of the truster’s interests
are clearly observable and understood
Trustworthy Institutions



Have a history of reliability in the management of public’s
entrusted goods
Their interests are clear, open and understood
How their interests lead to the fulfillment of the truster’s interests
are also clear, open and understood
Competence

Webster’s Dictionary:





Having requisite or adequate ability or qualities
Legally qualified
What are skills and tools of competent data
collector?
When dealing with government-mandated
collection of sensitive data where individuals are
identifiable, do we need “legal qualification”?
Competence of institutions: technology and policy
Concerns as we
enter the future
If men were angels…
“If men were angels, no government would be
necessary. If angels were to govern men,
neither external nor internal controls on
government would be necessary. In framing a
government which is to be administered by men
over men, the great difficulty lies in this: you
must first enable the government to control the
governed; and in the next place oblige it to
control itself.”
James Madison, Federalist #51
Categorization of Surveillance
 Surveillance
 Dataveillance
 Überveillance
Source: Roger Clarke, University of New South Wales/Australian National University
Surveillance and
Dataveillance
…and then there is
Überveillance
Categorization of Surveillance







Of what?
For whom?
By whom?
Why?
How?
Where?
When?
Source: Roger Clarke, University of New South Wales/Australian National University
Überveillance
“an above and beyond omnipresent 24/7
surveillance where the explicit concerns for
misinformation, misinterpretation and
information manipulation, are ever more
multiplied and where potentially the
technology is embedded in our bodies.”
Michael and Katina Michael,
University of Wollongong
Source: Roger Clarke, University of New South Wales/Australian National University
Principles necessary to consider when dealing with issues
having to do with human dignity, the right to the integrity
of the person and the protection of personal data






Precautionary principle
Purpose specification principle
Data minimization principle
Proportionality principle
Integrity and inviolability of the body
principle
Dignity principle
Source: European Group on Ethics in Science and New Technologies, Opinion
on ICT Implants in the Human Body, 2007.
Terrorism and
Human Rights
How far should the pendulum
swing?
Counterveillance Principles?








Independent evaluation of Technology
A Moratorium on Technology Deployments
Open Information Flows
Justification for Proposed Measures
Consultation and Participation
Evaluation
Design Principles: Balance, Independent Controls,
Nymity and Multiple Identities
Rollback
Source: Roger Clarke, University of New South Wales/Australian National University
Is it a basic human right to be
safe from terrorism?
What should governments do?
The Role of Ethics


Strongly linked to trust
Institutional programs




Leadership commitment
Codes of conduct
Policies and practices
The bioethics model
Finding a Balance
 Technology
 Policy
 Ethics
Ten Commandments of Computer Ethics
1. Thou shalt not use a computer to harm other people
2. Thou shalt not interfere with other people’s computer work
3. Thou shalt not snoop around in other people’s computer files
4. Thou shalt not use a computer to steal
5. Thou shalt not use a computer to bear false witness
6. Thou shalt not copy or use proprietary software for which you haven’t
paid
7. Thou shalt not use other people’s computer resources without
authorization or proper compensation
8. Thou shalt not appropriate other people’s intellectual output
9. Thou shalt think about the social consequences of the program you
are writing or the system you are designing
10. Thou shalt always use a computer in ways that insure consideration
and respect for your fellow humans
Source: Computer Ethics Institute, www.computerethicsinstitute.org
Conclusion
The Move to Action
DATA
INFORMATION
Decisions?
INTELLIGENCE
Actions?
KNOWLEDGE
WISDOM
Copyright Dr. Ramon C. Barquin
73
Today’s
Analyst
Copyright Dr. Ramon C. Barquin
74
Conclusion
“Briefly stated, the major task in control over our
destiny is to make as many second-order
consequences as possible intended, anticipated
and desirable; and reduce to a practical minimum
those that are unintended, unanticipated and
undesirable.”
Raymond Bauer, Second-Order Consequences
Questions
?
Barquin International
1707 L Street NW, Suite 1030
Washington, DC 20036
Phone: (202) 296-7147
Fax: (202)296-8903
[email protected]
www.barquin.com
Additional
Slides
Goals a PPDM should enforce
1. It should have to prevent the discovery of
sensible information.
2. It should be resistant to the various data mining
techniques.
3. It should not compromise the access and the
use of non sensitive data.
4. It should not have an exponential computational
complexity.
Source: Elisa Bertino, Dan Lin, and Wei Jiang, Purdue University
Criteria on which to evaluate a PPDM
algorithm




- Privacy level
 How closely the sensitive hidden information can be estimated
- Hiding failure
 The portion of sensitive information not hidden by the
application of a privacy preservation technique
- Data quality after the application of a privacy preserving technique
 Quality of data and the data mining results after the hiding
strategy is applied
- Complexity
 The ability of a privacy preserving algorithm to execute with
good performance in terms of all the resources implied by the
algorithm
Source: Elisa Bertino, Dan Lin, and Wei Jiang, Purdue University
Protocols governing privacy disclosure
among entities

Data collection


Inference control


Protects privacy during data transmission from the data providers
to the data warehouse server
Manages privacy protection between the data warehouse server
and data mining servers
Information sharing

Controls information shared among the data mining servers in
different systems
Source: Privacy-Preserving Data Mining Systems, Nan Zhang and Wei Zhao, COMPUTER, April 2007.
Topics in PPDM Bibliography












1 Privacy Preserving Data Mining
Philosophical Issues
2 Privacy Preserving Data Mining
Survey/General Issues
3 Additive Data Perturbation
4 Multiplicative Data Perturbation
5 Categorical or General Data Perturbation
6 Data Anonymization
7 Data Swapping
8 Randomized Response
9 Cryptographic/Secure Multi-Party
Computation (SMC)
10 Privacy Preserving Classification
11 Privacy Preserving Association
Rule/Frequent Itemsets Mining
12 Privacy Preserving Clustering











13 Privacy Preserving Bayes
Classifier/Bayesian Network
14 Privacy Preserving Multivariate
Statistical Analysis
15 Privacy Information Retrieval and
Database Application
16 Privacy Preserving Collaborative
Filtering
17 Privacy Preserving Data Stream Mining
18 Privacy in P2P or Large-scale
Distributed Environments
19 Hiding Sensitive Rules
20 Information Theory in Privacy Preserving
Data Mining
21 Security and Privacy in RFID Systems
22 Privacy Preserving Case Study
23 Link Farm
Source: Charu C. Aggarwal (IBM), Philip S. Yu (university of Illinois)