Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
On Privacy, Data Mining Technology and Human Rights ©Dr. Ramon C. Barquin Conference on Data Mining and Human Rights in the Fight Against Terrorism Universitat Zurich Zurich, Switzerland 10-11 June 2010 AGENDA Introduction: Framing the Issue Data Mining and the U.S. Government What is Data Mining? Informational Privacy: Basic Issues What is Privacy Preserving Data Mining? The Role of Trust in Data Collection Concerns as we enter the future Terrorism and Human Rights The Role of Ethics Conclusion Data, data everywhere… Number of documents on the web: Over 1 trillion? Total new information in 2003: ≈ 5 exabytes Equivalent to half a million new Libraries of Congress Enough to capture every word ever spoken by all humans 161 Exabytes produced in 2006 3 million times storage of all books written 12 stacks of books from Earth to the Sun Source: UC Berkeley study, 2003 and IDC Study 2007 And many good reasons to extract as much knowledge as possible from that data Data Mining Health care Law enforcement Education Logistics Customer service Investigation at Stillwater State Correctional Facility, Minnesota Data mining software was applied to phone records from the prison A pattern linking calls between prisoners and a recent parolee was discovered The calling data was then mined again together with records of prisoners’ financial accounts The result: a large drug smuggling ring was discovered Source:Yehuda Lindell, Bar-Ilan University New York Times “Reaping Results: Data-Mining Goes Mainstream” By STEVE LOHR, May 20, 2007 “The technology, for example, pointed to a high rate of robberies on paydays in Hispanic neighborhoods [in Richmond], where fewer people use banks and where customers leaving check-cashing stores were easy targets for robbers. Elsewhere, there were clusters of random-gunfire incidents at certain times of night. So extra police were deployed in those areas when crimes were predicted.” But nothing is perfect… No. Now all our pillaging is done electronically from a centralized office.” The Headlines “TSA viewed as 'profiling' in new screening program ,” National Journal, 5/21/10 “Web Start-Ups Offer Bargains for Users’ Data,” by S. Clifford, NY Times, 5/30/10 “Shoppers Who Can’t Have Secrets,” by N. Singer, NY Times , 4/30/10 “Review of Terrorism Database Finds Flaws,” by M. Sherman, Washington Post, 6/14/05 “How Privacy Vanishes Online,” by S. Lohr, NY Times, 3/16/10 “Internet censorship proves counterproductive in curtailing terrorist recruitment,” Jill R. Aitoro, National Journal, 05/26/10 The Data Revolution The current data revolution is fueled by the perceived, actual, and potential usefulness of the data. Most electronic and physical activities leave some kind of data trail. These trails can provide useful information to various parties. However, there are also concerns about appropriate handling and use of sensitive information. Privacy-preserving methods of data handling seek to provide sufficient privacy as well as sufficient utility. Source: Rebecca Wright, Stevens Institute of Technology Framing the Issue Where we would we be without data? Data as a double edged sword The Post-9/11 environment Focus on government data collections “No, I’m not backing up my files…I’m just assuming the FBI Is making copies.” Data Mining and the U.S. Government Federal Agency Data Mining Reporting Act of 2007 U.S. Federal Law Privacy Compliance Documents Privacy Threshold Analysis (PTA) Identifies whether system, program, or project is a Privacy Sensitive System (i.e., a system that collects or maintains PII or otherwise impacts privacy) and determines whether Privacy Impact Assessment (PIA) or System of Record Notice (SORN) is required. Privacy Impact Assessment (PIA): Method by which the federal agencies reviews system management activities in key areas such as security and how/when information is collected, used, and shared. PIA determines whether an existing SORN appropriately covers the activity or a new SORN is required. System of Record Notice (SORN): Provide notice to the public regarding Privacy Act information collected by a system of records, as well as insight into how information is used, retained, and may be corrected. Source: Federal Agency Data Mining Reporting Act, 42 U.S.C. § 2000ee-3(b)(1) Department of Homeland Security Automated Targeting System (ATS) Data Analysis and Research for Trade Transparency System (DARTTS) Freight Assessment System (FAS) Source: 2009 Data Mining Report to Congress, Department of Homeland Security, December 2009 But what exactly is data mining? The morass of definitions… Data Mining A program involving pattern-based queries, searches, or other analyses of one or more electronic databases, where— (A) a department or agency of the Federal Government, or a non-Federal entity acting on behalf of the Federal Government, is conducting the queries, searches, or other analyses to discover or locate a predictive pattern or anomaly indicative of terrorist or criminal activity on the part of any individual or individuals; (B) the queries, searches, or other analyses are not subject-based and do not use personal identifiers of a specific individual, or inputs associated with a specific individual or group of individuals, to retrieve information from the database or databases; and (C) the purpose of the queries, searches, or other analyses is not solely— (i) the detection of fraud, waste, or abuse in a Government agency or program; or (ii) the security of a Government computer system. Source: Federal Agency Data Mining Reporting Act, 42 U.S.C. § 2000ee-3(b)(1) Data Mining/Knowledge Discovery Process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions. (Two Crows Corp.) Analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to data owner. (Hand, Mannila, Smyth) Non-trivial extraction of implicit, previously unknown, and potentially useful information from large data sets or databases [W. Frawley and G. Piatetsky-Shapiro and C. Matheus, 1992] Use of information technology to attempt to derive useful knowledge from (usually) very large data sets. (DETECTER, Work Package # 6) List of Data Mining Systems (2) BioSense –DHHS/CDC Foreign Terrorist Tracking Task Force Activity-FBI NETLEADS – DHS/ICE&CBP ICE Pattern Analysis and Information Collection System (ICEPIC) – DHS/ICE Intelligence and Information Fusion (I2F) – DHS/OIA ProActive Intelligence (PAINT) – DHS/OIA Knowledge Discovery and Dissemination – IARPA Video Analysis and Content Extraction (VACE) - IARPA Rapid Knowledge Formulation-DARPA Analysis, Dissemination, Visualization, Insight and Semantic Enhancement (ADVISE)DHS Able Danger-Army Threat and Local Observation Notice (TALON) – DOD TIDE (Datamart) – DHS FBI Intelligence Community Data Marts- FBI Investigative Data Warehouse (IDW) - FBI Source: DETECTER, Work Package No. 6 List of Data Mining Systems (1) Computer Assisted Passenger Pre-Screening System II (CAPPS II) – DHS/TSA Secure Flight -- DHS/TSA Automated Targeting System (ATS) - DHS Total Information Awareness/ Terrorist Information Awareness (TIA) – DARPA Multi-State Anti Terrorism Information Exchange (MATRIX) - Multi-State Consortium Novel Intelligence From Massive Data (NIMD)-NSA Analyst Notebook I2 - DHS Secure Collaborative Operational Prototype Environment (SCOPE) – FBI Insight Smart Discovery – DIA Verity K2 Enterprise - DIA PATHFINDER - DIA Autonomy – DIA Counterintelligence Automated Investigative Management System (CI-AIMS) - DOE Autonomy - DOE Counterintelligence Analytical Research Data System (CARDS)-DOE Source: DETECTER, Work Package No. 6 European/International Data Mining Efforts CAHORS – NATO Creation of European Terrorist Profiles European Passenger Name Records System European Security Research Terrorist RasterfahndungBundeskriminalamt Source: DETECTER, Work Package No. 6 For observation REVEAL (US) SCION (US) National Security Branch Analysis Center (US) Guardian (US) Eurodac (EU) Schengen Information System II (EU) Europol Information System (EU) Visa Information System (EU) EDVIGE/EDVIPR (FR) CHRISTINA (FR) Project Rich Picture (UK) National Public Order Intelligence Unit Database (UK) Source: DETECTER, Work Package No. 6 So how do we have our cake and eat it too? It starts with the need to protect our data Security Data integrity Privacy protection And progresses to bigger and better things Privacy-Preserving Data Mining (PPDM) Information Privacy Degrees of Touchiness As the type of personal information grows more intimate, the percentage of people who want to keep it at home rises Basic personal information (name, address, phone number) 42% Social Security number or driver’s license number 51% Major Purchases 56% Internet behavior 62% Employee records 64% Credit or debit card number 69% Banking or home mortgage records 74% Patient health records 83% Source: CIO Magazine. July 15, 2006. Ponemon Institute How should this information be protected? Medical information Financial information Credit scores Criminal record School grades Job performance ratings Information Privacy Concept applied to collection, use and maintenance of personal information with advent of database technology Central component is power of individual to control the use of sensitive information The issue of identity Sensitivity Complexity Benefits of Data Access Reinforcement of open scientific inquiry Verification, refutation, or refinement of original results Promotion of new research through existing data; improvements of measurements and data collection methods and analytic techniques Climate in which scientific research confronts decision making Source: Committee on National Statistics, National Research Council Issues The data Mandatory vs. voluntary Sensitive vs. Non-sensitive data Anonymous vs. Named Individual identifiable vs. non-identifiable The question of identity The usage Administrative Statistical Fair Information Practice Principles (FIPPs) Transparency Individual Participation Purpose Specification •Data Minimization Use Limitation Data Quality and Integrity Security Accountability and Auditing Start with the basics… 1. 2. 3. 4. Generalization De-identification and Re-identification “Anonymization” Cryptography But problem of privacy breaches is big… “In fact, 87% of the population of the United States is uniquely identified by date of birth (e.g., month, day and year), gender, and their 5-digit ZIP codes. The point is that data that may look anonymous is not necessarily anonymous.” Source: Latanya Sweeney at meeting of the Department of Homeland Security DPIAC. Hence, Privacy-Preserving Data Mining (PPDM) Privacy Preserving Data Mining or PPDM, is a research area concerned with the privacy driven from “personally identifiable information” when considered for data mining. Wikipedia Two major approaches to PPDM Randomization Application: Web Demographics Cryptographic Approach Approach Application: Inter-Enterprise Data Mining Source: Ramakrishnan Srikant, IBM Source: Rebecca Wright, Stevens Institute of Technology Major Approaches to Randomization Privacy-Preserving Data Mining The Randomization Method Group Based Anonymization Distributed Privacy-Preserving Data Mining Privacy-Preservation of Application Results Source: Charu C. Aggarwal (IBM), Philip S. Yu (university of Illinois) Privacy-Preserving Data Mining Models and Algorithms The Randomization Method Privacy Quantification Adversarial Attacks on Randomization Randomization Methods for Data Streams Multiplicative Perturbations Data Swapping Source: Charu C. Aggarwal (IBM), Philip S. Yu (university of Illinois) Privacy-Preserving Data Mining Models and Algorithms Group Based Anonymization The k-Anonymity Framework Personalized Privacy-Preservation Utility Based Privacy Preservation Sequential Releases The l-diversity Method The t-closeness Model Models for Text, Binary and String Data Source: Charu C. Aggarwal (IBM), Philip S. Yu (university of Illinois) Privacy-Preserving Data Mining Models and Algorithms Distributed Privacy-Preserving Data Mining Distributed Algorithms over Horizontally Partitioned Data Sets Distributed Algorithms over Vertically Partitioned Data Distributed Algorithms for k-Anonymity Source: Charu C. Aggarwal (IBM), Philip S. Yu (university of Illinois) Privacy-Preserving Data Mining Models and Algorithms Privacy-Preservation of Application Results Association Rule Hiding Downgrading Classifier Effectiveness Query Auditing and Inference Control Source: Charu C. Aggarwal (IBM), Philip S. Yu (university of Illinois) Example: Web Demographics Volvo S40 website targets people in 20s Are visitors in their 20s or 40s? Which demographic groups like/dislike the website? Source: Ramakrishnan Srikant, IBM Randomization Approach: Overview 30 | 70K | ... 50 | 40K | ... Randomizer Randomizer 65 | 20K | ... 25 | 60K | ... Reconstruct distribution of Age Reconstruct distribution of Salary Data Mining Algorithms Source: Ramakrishnan Srikant, IBM ... ... ... Model Reconstruction Problem + MANY RELEVANT ALGORITHMS: [WY04,YW05]: privacy-preserving construction of Bayesian networks from vertically partitioned data. [YZW05]: classification from frequency mining in fully distributed model (naïve Bayes classification, decision trees, and association rule mining). (P) [JW#]: privacy-preserving k-means clustering for arbitrarily partitioned data. [AST05]: privacy-preserving computation of multidimensional aggregates on vertically or horizontally partitioned data using randomization. Source: Ramakrishnan Srikant, IBM; and Rebecca Wright, Stevens Institute of Technology Then attempt to reconstruct as accurately as possible… Number of People 1200 1000 800 Original Randomized Reconstructed 600 400 200 0 20 60 Age Source: Ramakrishnan Srikant, IBM Suggested Architecture for (Cryptographic) PPDM Source: Privacy-Preserving Data Mining Systems, Nan Zhang and Wei Zhao, COMPUTER, April 2007. Examples of Secure Computation Tasks Authentication protocols Online payments Auctions Elections Privacy preserving data mining Source:Yehuda Lindell, Bar-Ilan University Secure Multiparty Computation A set of parties with private inputs Parties wish to jointly compute a function of their inputs so that certain security properties (like privacy and correctness) are preserved E.g., secure elections, auctions, online payments Properties must be ensured even if some of the parties maliciously attack the protocol Source:Yehuda Lindell, Bar-Ilan University Secure Multiparty Communication Ideal Model x y x y f1(x,y) f2(x,y) Trusted party f1(x,y) Source:Yehuda Lindell, Bar-Ilan University f2(x,y) So what is the role of trust? Role of Trust in Data Collection Trust is a 3-part relation Truster Entrusted good Trusted Three key principles Truster must see the trusted as Having a goodwill Encapsulating the interests of others Competent to handle the entrusted good Source: A. Baier, Trust and Antitrust. Role of Trust in Data Collection Truster + Entrusted Good Trust Trusted Role of Trust in Data Collection T(A:B) = f(G, I, C) Where, T(A:B) – Trustworthiness of person A (trusted), as perceived by person B (truster), in relation to good . G – Goodwill of person A I – Degree to which A is able to represent the interests of B C – Competence of A in handling entrusted good Attributes of Goodwill Persons or Institutions can both be trustworthy Persons Part of the trustworthy person’s motive in handling the entrusted good is to fulfill the truster’s interests In fulfilling the truster’s interests, the trustworthy person looks beyond the interests that would only benefit him or herself The trustworthy person has demonstrated that he or she cares about the management of other’s entrusted goods in the past The actions, motives, and interests and the principles which guide the trustworthy person’s actions are clear and easily understood Attributes of Goodwill Persons or Institutions can both be trustworthy Institutions The trustworthy institution’s goal is to uphold the interests of trusters even though they may not have an interest in doing so and even if doing so conflicts with certain interests of the institution The institution has clear policies which guide the behavior of its members and demonstrate that the goal of the institution is to preserve the public’s interests The institution’s implicit or explicit code of conduct demonstrates that the institution upholds the interests of the public Encapsulating the interests of others Trustworthy Persons Can be relied upon with other people’s entrusted goods Their interests are clearly observable and understood How their interests lead to the fulfillment of the truster’s interests are clearly observable and understood Trustworthy Institutions Have a history of reliability in the management of public’s entrusted goods Their interests are clear, open and understood How their interests lead to the fulfillment of the truster’s interests are also clear, open and understood Competence Webster’s Dictionary: Having requisite or adequate ability or qualities Legally qualified What are skills and tools of competent data collector? When dealing with government-mandated collection of sensitive data where individuals are identifiable, do we need “legal qualification”? Competence of institutions: technology and policy Concerns as we enter the future If men were angels… “If men were angels, no government would be necessary. If angels were to govern men, neither external nor internal controls on government would be necessary. In framing a government which is to be administered by men over men, the great difficulty lies in this: you must first enable the government to control the governed; and in the next place oblige it to control itself.” James Madison, Federalist #51 Categorization of Surveillance Surveillance Dataveillance Überveillance Source: Roger Clarke, University of New South Wales/Australian National University Surveillance and Dataveillance …and then there is Überveillance Categorization of Surveillance Of what? For whom? By whom? Why? How? Where? When? Source: Roger Clarke, University of New South Wales/Australian National University Überveillance “an above and beyond omnipresent 24/7 surveillance where the explicit concerns for misinformation, misinterpretation and information manipulation, are ever more multiplied and where potentially the technology is embedded in our bodies.” Michael and Katina Michael, University of Wollongong Source: Roger Clarke, University of New South Wales/Australian National University Principles necessary to consider when dealing with issues having to do with human dignity, the right to the integrity of the person and the protection of personal data Precautionary principle Purpose specification principle Data minimization principle Proportionality principle Integrity and inviolability of the body principle Dignity principle Source: European Group on Ethics in Science and New Technologies, Opinion on ICT Implants in the Human Body, 2007. Terrorism and Human Rights How far should the pendulum swing? Counterveillance Principles? Independent evaluation of Technology A Moratorium on Technology Deployments Open Information Flows Justification for Proposed Measures Consultation and Participation Evaluation Design Principles: Balance, Independent Controls, Nymity and Multiple Identities Rollback Source: Roger Clarke, University of New South Wales/Australian National University Is it a basic human right to be safe from terrorism? What should governments do? The Role of Ethics Strongly linked to trust Institutional programs Leadership commitment Codes of conduct Policies and practices The bioethics model Finding a Balance Technology Policy Ethics Ten Commandments of Computer Ethics 1. Thou shalt not use a computer to harm other people 2. Thou shalt not interfere with other people’s computer work 3. Thou shalt not snoop around in other people’s computer files 4. Thou shalt not use a computer to steal 5. Thou shalt not use a computer to bear false witness 6. Thou shalt not copy or use proprietary software for which you haven’t paid 7. Thou shalt not use other people’s computer resources without authorization or proper compensation 8. Thou shalt not appropriate other people’s intellectual output 9. Thou shalt think about the social consequences of the program you are writing or the system you are designing 10. Thou shalt always use a computer in ways that insure consideration and respect for your fellow humans Source: Computer Ethics Institute, www.computerethicsinstitute.org Conclusion The Move to Action DATA INFORMATION Decisions? INTELLIGENCE Actions? KNOWLEDGE WISDOM Copyright Dr. Ramon C. Barquin 73 Today’s Analyst Copyright Dr. Ramon C. Barquin 74 Conclusion “Briefly stated, the major task in control over our destiny is to make as many second-order consequences as possible intended, anticipated and desirable; and reduce to a practical minimum those that are unintended, unanticipated and undesirable.” Raymond Bauer, Second-Order Consequences Questions ? Barquin International 1707 L Street NW, Suite 1030 Washington, DC 20036 Phone: (202) 296-7147 Fax: (202)296-8903 [email protected] www.barquin.com Additional Slides Goals a PPDM should enforce 1. It should have to prevent the discovery of sensible information. 2. It should be resistant to the various data mining techniques. 3. It should not compromise the access and the use of non sensitive data. 4. It should not have an exponential computational complexity. Source: Elisa Bertino, Dan Lin, and Wei Jiang, Purdue University Criteria on which to evaluate a PPDM algorithm - Privacy level How closely the sensitive hidden information can be estimated - Hiding failure The portion of sensitive information not hidden by the application of a privacy preservation technique - Data quality after the application of a privacy preserving technique Quality of data and the data mining results after the hiding strategy is applied - Complexity The ability of a privacy preserving algorithm to execute with good performance in terms of all the resources implied by the algorithm Source: Elisa Bertino, Dan Lin, and Wei Jiang, Purdue University Protocols governing privacy disclosure among entities Data collection Inference control Protects privacy during data transmission from the data providers to the data warehouse server Manages privacy protection between the data warehouse server and data mining servers Information sharing Controls information shared among the data mining servers in different systems Source: Privacy-Preserving Data Mining Systems, Nan Zhang and Wei Zhao, COMPUTER, April 2007. Topics in PPDM Bibliography 1 Privacy Preserving Data Mining Philosophical Issues 2 Privacy Preserving Data Mining Survey/General Issues 3 Additive Data Perturbation 4 Multiplicative Data Perturbation 5 Categorical or General Data Perturbation 6 Data Anonymization 7 Data Swapping 8 Randomized Response 9 Cryptographic/Secure Multi-Party Computation (SMC) 10 Privacy Preserving Classification 11 Privacy Preserving Association Rule/Frequent Itemsets Mining 12 Privacy Preserving Clustering 13 Privacy Preserving Bayes Classifier/Bayesian Network 14 Privacy Preserving Multivariate Statistical Analysis 15 Privacy Information Retrieval and Database Application 16 Privacy Preserving Collaborative Filtering 17 Privacy Preserving Data Stream Mining 18 Privacy in P2P or Large-scale Distributed Environments 19 Hiding Sensitive Rules 20 Information Theory in Privacy Preserving Data Mining 21 Security and Privacy in RFID Systems 22 Privacy Preserving Case Study 23 Link Farm Source: Charu C. Aggarwal (IBM), Philip S. Yu (university of Illinois)