Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Macro Trends in Counter-Terrorism Technologies And Thoughts on Responsible Innovation DETECTER Project, Brussels September 7th, 2011 Jeff Jonas, IBM Distinguished Engineer Chief Scientist, IBM Entity Analytics [email protected] 1 Today’s Material  Background  Macro Trends  Detecting Bad Guys in Big Data  Challenging Privacy and Civil Liberties Issues  Privacy by Design (PbD) Considerations  Questions and Answers 2 Background  Early 80’s: Founded Systems Research & Development (SRD), a custom software consultancy  1989 – 2003: Built numerous systems for Las Vegas casinos including a technology known as Non-Obvious Relationship Awareness (NORA)  2001/2003: Funded by In-Q-Tel  2005: IBM acquires SRD  2005: Acquired by IBM, now Chief Scientist, IBM Entity Analytics  Cumulatively: I have had a hand in a number of systems with multi-billions of rows describing 100’s of millions of entities 3 Roles  Member, Markle Foundation Task Force on National Security in the Information Age  Board Member, US Geospatial Intelligence Foundation (USGIF), the GEOINT organizing body  Senior Associate, Center for Strategic and International Studies (CSIS)  Member, EPIC advisory board  Advisor, Privacy International 4 Current Primary Area of Interest  Making sense of information in large data sets, across complex ecosystems with emphasis on privacy and civil liberties protections – 1996: Created an identity-centric customer repository based on 4,200 disparate systems … >100 million resolved identities – 2001: Assistance in various post-9/11 data analysis programs for public and private sector – 2005: Missing persons project following Hurricane Katrina resulting in re-unification of >100 loved ones 5 A Late Bloomer to Privacy 6 1980 – 2001 No clue whatsoever 2001 – 2006 Slowly waking up 2007 – 2011 Today, at best, a student of privacy A Journey Fraught with Reflection and Rethinking The greater my privacy and civil liberties awareness 7 The greater the number of imperfections appear in my rearview mirror Katrina – Missing Persons Reunification Project  Information about status of persons quickly end up scattered across countless databases – Over 50 such web sites/organizations were identified as having victim related data – Many people were registered duplicate times in the same database – Many people were registered duplicate times across databases – Many people were registered as missing in one database and found in another database  Connecting found persons previously reported as missing becomes nearly impossible – Too many databases – Constantly changing data 8 Katrina Reunification Project Statistics  Total data sources  Usable records 1,570,000  Unique persons 36,815  Total loved ones reunited 9 15 >100 Katrina – Missing Persons Reunification Project  Privacy by Design (PbD) – Contractually authorized to delete all the data after the reunification office completed its work – Hence, a few months later, all collected data and reporting products were deleted DESTRUCTION OF EVIDENCE! Data Decommissioning – Destruction of Accountability 10 Macro Trends 11 Avg Age Good News: The World is Not More Dangerous 37 1900: Western Europe Today: Global Average Number Dead 67 75M ~17+% 300M ~4.5% 1300’s: “Black Death” 12 Today: If America sunk into ocean and everyone dies Prediction Your doctor is 102 and this is not weird. 13 Complexity of Execution Bad News: “More Death Cheaper in Future” Graph 10 Kiloton Nuke 1918 Spanish Influenza Death 14 1918 Spanish Influenza Genome 15 Complexity of Execution “More Death Cheaper in Future” Graph = Bad 10 Kiloton Nuke Easier 1918 Spanish Influenza More Death Death 16 Jerome Kerviel – US$7B www.chinapost.com.tw/news_images/20080127/p1d.jpg 17 Jerome Kerviel – US$7B Back it out Back it in Back it out Analytic Checkpoint Analytic Checkpoint 18 Back it in 1 Day 2050 Predictions A single person can kill 100M people for <$1,000. 19 State of the Union: Enterprise Amnesia 20 Amnesia, definition A defect in memory, especially resulting from brain damage. 21 US National Security Amnesia Events 9/11 Two known terrorists were admitted into the US (only discovered after the fact). Christmas Day Bomber Abdulmutallab possessed a multi-entry VISA while at the same time was on the terrorist watch list (only discovered after the fact). 22 Computing Power Growth Trend: Organizations Are Getting Dumber Every two days now we create as much information as we did from the dawn of civilization up until 2003.” Available Observation Space ~ EricContext Schmidt, CEO Google Enterprise Amnesia Sensemaking Algorithms Time 23 Computing Power Growth Trend: Organizations Are Getting Dumber Available Observation Space WHY? Context Sensemaking Algorithms Time 24 Algorithms at Dead End. You Can’t Squeeze Knowledge Out of a Pixel. 25 No Context [email protected] 26 Context, definition Better understanding something by taking into account the things around it. 27 Information without context is hardly actionable. 28 Lack of Context – Consequences  Alert queues growing faster than the humans address – filled mostly with false positives  The top item in the queue is not the most relevant item  Items require so much investigative effort – they are often abandoned prematurely  Risk assessment becomes the risk 29 29 Information in Context … and Accumulating [email protected] Job Applicant No Fly List 30 Most Trusted Source Known Terrorist The Puzzle Metaphor  Imagine an ever-growing pile of puzzle pieces of varying sizes, shapes and colors  What it represents is unknown – there is no picture  Is it one puzzle, 15 puzzles, or 1,500 different puzzles?  Some pieces are duplicates, missing, incomplete, low quality, or have been misinterpreted  Some pieces may even be professionally fabricated lies  Until you take the pieces to the table and attempt assembly, you don’t know what you are dealing with 31 32 Puzzling: 4 Puzzles, 620 Useful Pieces 270 pieces 90% 30 pieces 10% 200 pieces 66% 6 pieces 2% 150 pieces 50% 33 (duplicates) (pure noise) +36 Useless Pieces! 34 First Discovery 35 More Data Finds Data 36 Duplicates in Front Of Your Eyes 37 First Duplicate Found Here 38 39 40 Incremental Context – Incremental Discovery 41 6:40pm START 22min “Hey, this one is a duplicate!” 35min “I think some pieces are missing.” 37min “Looks like a bunch of hillbillies on a porch.” 44min “Hillbillies, playing guitars, sitting on a porch, near a barber sign … and a banjo!” 150 pieces 50% 42 Incremental Context – Incremental Discovery 43 47min “We should take the sky and grass off the table.” 2hr “Let’s switch sides, and see if we can make sense of this from different perspectives.” 2hr10m “Wait, there are three … no, four puzzles.” 2hr17m “We need a bigger table.” 2hr18m “I think you threw in a few random pieces.” 44 45 46 Trend: Big Data [in context] = New Physics More data: better the predictions – Lower false positives – Lower false negatives More data: bad data … good – Suddenly glad your data was not perfect More data: less compute 47 From Pixels to Pictures to Insight Contextualization Observations 48 Relevance Detection Persistent Context Consumer (An analyst, a system, the sensor itself, etc.) One Form of Context is “Expert Counting”  Is it 5 people each with 1 account … or is it 1 person with 5 accounts?  Is it 20 cases of H1N1 in 20 cities … or one case reported 20 times?  If one cannot count … one cannot estimate vector or velocity (direction and speed).  Without vector and velocity … prediction is nearly impossible. 49 Skilled adversaries engage in “channel separation.” 50 Cell Phone #1 Cell Phone #2 Bank Acct #1 Passport #1 Unknown Unknown Billy K. William A. Hence, detection requires “channel consolidation.” William A aka Billy K. • Cell Phone #1 • Cell Phone #2 • Bank Acct #1 • Passport #1 51 Expert Counting: Degrees of Difficulty Deceit Bob Jones Ken Wells 123455 550119 Incompatible Features Fuzzy Exactly Same Bob Jones 123455 52 Bob Jones 123455 Bob Jones 123455 Bob Jones 123455 Robert T Jonnes 000123455 bjones@hotmail Deceit Detection Using Context Accumulation Feature Accumulation Deceit Robert Jones 123455 POB 13452 DOB 03/12/73 Bob Jones POB 13452 [email protected] 53 Bob Jones Ken Wells 123455 550119 Ken Wells 550119 POB 999911 DOB 03/12/73 [email protected] [email protected] DOB 03/12/73 Robert Jones 123455 Resolved! Ken Wells 550119 3 Models for Information Sharing 54 1. Bulk Transfer  Large collections are passed along to appropriate third parties  May be required if the recipient must commingle the data in secret  The recipients must have a capacity much larger than their own native requirements  The more copies the more difficult it is to maintain the information currency across the ecosystem  The more copies the more difficult to prevent of unintended disclosure  Useful when the number of recipients and transactional volumes are very small 55 2. Services for Inquiry  Owners enable third party inquiry (human or machine lookups)  When lots of systems are integrated, federated search can be automated to search all third party data sources based on a single user/machine search  Each system in the federation must be sized for all volume  Third party systems often lack the necessary indexes  Nearly impossible to ensure each federated systems is on-line  Useful for periodic, on-demand, inquiry using each third party data source like a reference system – particularly appropriate for narrow investigative work and/or forensic analysis  Not that useful for detect/preempt missions 56 3. Central Catalog/Index  Parties interested in information sharing supply metadata to a central catalog (index)  Inquiries can discover the location of all available documents using a single lookup  Card catalogs provide pointers to source systems and documents enabling efficient/scalable lookup (aka federated fetch)  Easier to keep the data current … than bulk transfer  Scales massively  Easier to secure 57 Discovery at the Library ? Subject 58 Title Author Enterprise Discovery Who 59 What Where When How The Policy Focus Becomes … “Discoverability” If you don’t publish your meta-data (who, what, where, when) to the enterprise catalog … Information is not discoverable … Therefore, the value of your operational system to the broad strategic interests of the enterprise is effectively ZERO! 60 Are You Playing Well With Others? SHARING SCORECARD(*) DISCOVERABILITY Organization Records Discoverable % This org 5B 2.5B 50% That org 120B 6B 5% The other org 3B 1B 33% Their org 1B 750K 75% Their other org 1B 500K 50% (*) Any resemblance to real organizations and real number would be coincidental 61 Challenging Privacy and Civil Liberties Issues 62 Issue #1: Essential Secrets vs. Transparency  To detect professionally fabricated lies, using only data, one must either: 1. Collect observations the adversary doesn’t know you have 2. Or, be able to perform compute over your observations in a manner the adversary cannot fathom  The Challenge: How can organizations catch bad guys if there is transparency over their observational space and what is computable? 63 Issue #2: More Data Good  The good news: Both those in the counterterrorism business and privacy community equally detest false positives – The government recognizes that false positives waste government resources – The privacy community recognizes that false positives place the innocent under undeserved government scrutiny  The challenge: Two remedies for false positives 1. Change the rules to reduce the number of alerts (which increases the false negatives) 2. Add more information such that the additional context permits greater discrimination  The more data, the lower the false positives and the lower the false negatives 64 Issue #3: Necessity of Central Indexes  Federated search is extremely limited – Does not scale when the mission is to get “left of boom” (detection)  Central card catalogs (indexes) are the only viable way forward – Only the metadata centralized with pointers, not all the data  The Challenge: General reaction to central databases, even if just an index 65 Issue #4: Lone Gunmen Surveillance  Rare events planned by one or a small group are more difficult to detect  The size of the observation space needed to detect lone gunmen planning acts of terrorism … approaches ubiquitous surveillance  Risk-based surveillance – A car bomb in a public place – A sector of national infrastructure at risk – WMD over a major city  The Challenge: At some point when one person can create extraordinary damage, cheaply, without a trace … then what? 66 Issue #5: Less Secrets Lead to Chilling Effects?  It is becoming harder and harder to have secrets  Will this chill behavior? – Will population behavior gravitate towards the center of the bell curve? – Or, will mankind become more tolerant of diversity? 67 Privacy by Design (PbD) Considerations 68 Universal Declaration of Human Rights  Article 9 No one shall be subjected to arbitrary arrest, detention or exile.  Article 12 No one shall be subjected to arbitrary interference with his privacy, family, home or correspondence, nor to attacks upon his honor and reputation. Everyone has the right to the protection of the law against such interference or attacks.  Article 15 (1) Everyone has the right to a nationality. (2) No one shall be arbitrarily deprived of his nationality nor denied the right to change his nationality.  Article 17 (1) Everyone has the right to own property alone as well as in association with others. (2) No one shall be arbitrarily deprived of his property. 69 PbD: Information Attribution  Avoid the receipt of any data that does not come with an ability to track its pedigree/attribution.  When passing your data into secondary systems, pass the data pedigree/attribution along to the recipient (even if that means only a pointer to your copy).  If the ‘chain of where data came from’ is not maintained in the information sharing ecosystem – there is no hope of keeping it current and very difficult to reconcile cross-system consistency. More here: Full Attribution, Don’t Leave Home Without It Out-bound Record-level Accountability in Information Sharing Systems 70 PbD: Data Destruction  When the data is no longer needed or there is a mandate … purge it.  For example, at the close of a special information analysis project; consider decommissioning the data sets in proportion to the consequences of unintended disclosure or misuse.  If there is a legal requirement to retain data, or long term accountability is necessary, consider pushing the data to forms of retrieval useful only in the context of forensic/investigatory purposes. More here: Decommissioning Data: Destruction of Accountability 71 PbD: Limit Data Transfers  If you don’t have to move the entire record: don’t.  Using information sharing systems as an example, it is best not to send all the data to each (and every) information sharing partner. Better to create a central index with prescribed fields. The index then points to the original data holder – and getting access to the original record requires permission at that time, from the original data holder. This ensures a degree of transparency. More here: Discoverability: The First Information Sharing Principle 72 PbD: Data Tethering  When data is moved from systems of record out into secondary systems, as the source data changes (adds, changes and deletes) these secondary systems should be notified.  If the secondary systems have themselves forwarded the data to tertiary systems, these same changes should be passed through the entire food chain. More here: Data Tethering: Managing the Echo 73 PbD: Obfuscate Data  For every copy there is a increasing risk of unintended disclosure.  When there is an opportunity to perform data masking, anonymization, encryption … do it.  Techniques now exist whereby data can be first obfuscated (e.g., encrypted, anonymized, masked, etc.) before information transfer ... while still maintaining a capability of performing deep analytics (e.g., data matching) post obfuscation. More here: To Anonymize or Not Anonymize, That is the Question 74 Maximizing Discovery - Minimizing Disclosure Persistent Context ! FEATURES: Cd5dced41028cb7ea51 00c9782a552a2d09b1b 7f2b6e48ea7d042bbe8 75 Observations Cd5dced41028cb … 00c9782a552a2 … 7f2b6e48ea7d0 … … Record #A-701 Sensors Employee Database 0d06b31faa7c… B5e341a4b0c… 00c9782a552… … Record #B-9103 Fraud Database Maximizing Discovery - Minimizing Disclosure Observations Mark Randy Smith DOB: 06/07/74 123 Main Street 713 731 5577 Record #A-701 Sensors Policy Controls Discovery Employee Database M. Randal Smith DOB: 06/07/74 713 731 5577 Record #B-9103 76 Record #A-701 Matches Record #B-9103 Policy Controls Fraud Database PbD: Build Accountability into Systems  Opt for the use of tamper-resistant audit logs. The greater the lack of transparency, the greater the need for immutable logs: mandated or not. More here: Immutable Audit Logs (IAL’s) Found: An Immutable Audit Log 77 Comments on: Data Mining  Data mining is not bad. There are setting where data mining is very valuable and saves lives  Predictive Data Mining – Limited efficacy without volumes of training data  Predicate Triage Data – Used to organize data sets containing only “subjects of interest” More here: Effective Counter-Terrorism and the Limited Role of Predictive Data Mining Data Mining, Predicate Triage and NSA Domestic Surveillance 78 Data Mining Defined (humorous) “Torturing the data until it confesses … and if you torture it enough, you can get it to confess to anything.” ACM SIGKDD Conference, Philadelphia 2006 79 Comments on: Link Analysis  Link analysis is very powerful, when used in a narrow fashion. Inspection of “subjects of interest” outward.  Predicate-based link analysis: Big social maps are not useful unless one has an entrance point.  Link analysis: prune early More here: Hunting Bad Guys, Phone Records and a Few Good Dead Men Predicate-based Link Analysis: A Post 9/11 Analysis (1+1= 13) Sometimes a Big Picture is Worth a 1,000 False Positives 80 Comments on: Watch Listing and False Positives  Difference between wrongly named and wrongly matched  Low fidelity watch lists are the single biggest cause of false positives - solving this ambiguity involves additional data  Minimize collection, maximize consumer participation and election  Provide a redress process More here: Precision in TSA’s Terrorist Watch List Comments on the TSA No-Fly and Selectee Watch List Process 81 Closing Thoughts 82 ”The data must find the data … and the relevance must find the user.” 83 In Closing  There is going to be more sensors, more data  This data will be commingled for greater accuracy to serve consumers and protecting countries  What data is collected/observed and when … will be the debate  Chief privacy principle: Avoid consumer surprise  If it has been collected, the holder has the obligation to make sense of it  Organizations must harness data to be smart, efficient, and survive … but how smart do they need to be and do we trust them?  Hence the tension 84 Related Papers Heritage Foundation: Paul Rosenzweig/Jeff Jonas Correcting False Positives: Redress and the Watch List Conundrum Cato Foundation: Jeff Jonas/Jim Harper Effective Counterterrorism and the Limited Role of Predictive Data Mining Steptoe & Johnson: Stewart Baker Anonymization, Data-Matching and Privacy: A Case Study IEEE Security and Privacy: Jeff Jonas Threat and Fraud Intelligence: Las Vegas Style Giannino Bassetti Foundation: Jeff Ubios Transparency, Privacy and Responsibility: An Interview with Jeff Jonas Markle Foundation Nation At Risk: Policy Makers Need Better Information to Protect the Country 85 Related Blog Posts Algorithms At Dead-End: Cannot Squeeze Knowledge Out Of A Pixel Puzzling: How Observations Are Accumulated Into Context When Risk Assessment is the Risk Big Data. New Physics. The Christmas Day Intelligence Failure – Part II: Jeff Jonas’ Christmas Wish List Decommissioning Data: Destruction of Accountability Source Attribution, Don’t Leave Home Without It Data Tethering: Managing the Echo Out-bound Record-level Accountability in Information Sharing Systems To Anonymize or Not Anonymize, That is the Question Immutable Audit Logs (IAL’s) The Information Sharing Paradox Discoverability: The First Information Sharing Principle When Federated Search Bites Using Transparency As A Mask 86 Macro Trends in Counter-Terrorism Technologies And Thoughts on Responsible Innovation DETECTER Project, Brussels September 7th, 2011 Jeff Jonas, IBM Distinguished Engineer Chief Scientist, IBM Entity Analytics [email protected] 87