Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
eCommerce Technology 20-751 Data Mining and Privacy Technology 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Coping with Information • Computerization of daily life produces data – Point-of-sale, Internet shopping (& browsing), credit cards, banks . . . – Info on credit cards, purchase patterns, product preferences, payment history, sites visited . . . • Travel. One trip by one person generates info on destination, airline preferences, seat selection, hotel, rental car, name, address, restaurant choices . . . • Data cannot be processed or even inspected manually 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Data Overload • Only a small portion of data collected is analyzed (estimate: 5%) • Vast quantities of data are collected and stored out of fear that important info will be missed • Data volume grows so fast that old data is never analyzed • Database systems do not support queries like – “Who is likely to buy product X” – “List all reports of problems similar to this one” – “Flag all fraudulent transactions” • But these may be the most important questions! 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Data Mining “The key in business is to know something that nobody else knows.” — Aristotle Onassis “To understand is to perceive patterns.” PHOTO: LUCINDA DOUGLAS-MENZIES PHOTO: HULTON-DEUTSCH COLL — Sir Isaiah Berlin 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Data Mining • Extracting previously unknown relationships from large datasets – summarize large data sets – discover trends, relationships, dependencies – make predictions • Differs from traditional statistics – Huge, multidimensional datasets – High proportion of missing/erroneous data – Sampling unimportant; work with whole population • Sometimes called – KDD (Knowledge Discovery in Databases) – OLAP (Online Analytical Processing) 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Taxonomy of Data Mining Methods Data Mining Methods Predictive Modeling • Decision Trees • Neural Networks • Naive Bayesian • Branching criteria Database Segmentation Link Analysis Text Mining Deviation Detection Semantic Maps • Clustering • K-Means Rule Associa tion Visualization SOURCE: WELGE & REINCKE, NCSA 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Predictive Modeling • Objective: use data about the past to predict future behavior • Sample problems: – Will this (new) customer pay his bill on time? (classification) – What will the Dow-Jones Industrial Average be on October 15? (prediction) • Technique: supervised learning – decision trees – neural networks – naive Bayesian 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Predictive Modeling Honest Tridas Vickie Mike Wally Waldo Barney Crooked Which characteristics distinguish the two groups? SOURCE: WELGE & REINCKE, NCSA 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Learned Rules in Predictive Modeling Tridas Vickie Mike Honest = has round eyes and a smile SOURCE: WELGE & REINCKE, NCSA 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Rule Induction Example Data: height short tall tall short tall tall tall short hair blond blond red dark dark blond dark blond eyes blue brown blue blue blue blue brown brown class A B A B B A B B Devise a predictive rule to classify a new person as A or B SOURCE: WELGE & REINCKE, NCSA 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Build a Decision Tree hair dark blond red short, blue = B tall, blue = B tall, brown= B {tall, blue = A} short, blue = A tall, brown = B tall, blue = A short, brown = B Does not completely classify blonde-haired people. More work is required Completely classifies dark-haired and red-haired people SOURCE: WELGE & REINCKE, NCSA 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Build a Decision Tree hair dark blond red short, blue = B tall, blue = B tall, brown= B {tall, blue = A} Decision tree is complete because 1. All 8 cases appear at nodes 2. At each node, all cases are in the same class (A or B) short, blue = A tall, brown = B tall, blue = A short, brown = B eye blue short = A tall = A brown tall = B short = B SOURCE: WELGE & REINCKE, NCSA 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Learned Predictive Rules hair dark blond red B A eyes blue A brown B SOURCE: WELGE & REINCKE, NCSA 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Decision Trees • Good news: a decision tree can always be built from training data • Any variable can be used at any level of the tree • Bad news: every data point may wind up at a leaf (tree has not compressed the data) height tall short eyes blue brown B hair blonde A brown B dark eyes hair blonde B B blue dark red A B 8 cases, 7 nodes. This tree has not summarized the data effectively 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Database Segmentation (Clustering) • “The art of finding groups in data” Kaufman & Rousseeuw • Objective: gather items from a database into sets according to (unknown) common characteristics • Much more difficult than classification since the classes are not known in advance (no training) • Examples: – Demographic patterns – Topic detection (words about the topic often occur together) • Technique: unsupervised learning 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Clustering Example • Are there natural clusters in the data (36,10), (12,8), (38,42), (13,6), (36,38), (16,9), (40,36), (35,19), (37,7), (39,8)? 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Clustering • K-means algorithm • To divide a set into K clusters • Pick K points at random. Use them to divide the set into K clusters based on nearest distance • Loop: – Find the mean of each cluster. Move the point there. – Redefine the clusters. – If no point changes cluster, done • Clustering demo • Agglomerative clustering: start with N clusters & merge • Agglomerative clustering demo 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Neural Networks Networks of processing units called neurons. This is the j th neuron: Neuron computes a linear function of the inputs n INPUTS x1, …, xn 1 OUTPUT yj depends only on the linear function Neurons are easy to simulate n WEIGHTS w1j , …, wnj SOURCE: CONSTRUCTING INTELLIGENT AGENTS WITH JAVA 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Neural Networks INPUT LAYER HIDDEN LAYER INPUTS: 1 PER INPUT LAYER NEURON OUTPUT LAYER OUTPUTS: 1 PER OUTPUT LAYER NEURON DISTINGUISHED OUTPUT (THE “ANSWER”) 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Neural Networks Learning through back-propagation 1. Network is trained by giving it many inputs whose output is known 2. Deviation is “fed back” to the neurons to adjust their weights 3. Network is then ready for live data DEVIATION SOURCE: CONSTRUCTING INTELLIGENT AGENTS WITH JAVA 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Neural Network Classification “Which factors determine a pet’s favorite food?” Species = Dog food: Chum Breed = Mixed food: Mr. Dog Owner’s age > 45 Owner’s sex = F 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Neural Network Demos • Demo: Notre Dame football, Handwriting analyzer • Financial applications: – Churning: are trades being instituted just to generate commissions? – Fraud detection in credit card transactions – Kiting: isolate float on uncollected funds – Money Laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) • Insurance applications: – Auto Insurance: detect a group of people who stage accidents to collect on insurance – Medical Insurance: detect professional patients and ring of doctors and ring of references 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Rule Association • Try to find rules of the form IF <left-hand-side> THEN <right-hand-side> • (This is the reverse of a rule-based agent, where the rules are given and the agent must act. Here the actions are given and we have to discover the rules!) • Prevalence = probability that LHS and RHS occur together (sometimes called “support factor,” “leverage” or “lift”) • Predictability = probability of RHS given LHS (sometimes called “confidence” or “strength”) 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Association Rules from Market Basket Analysis • <Dairy-Milk-Refrigerated> <Soft Drinks Carbonated> – prevalence = 4.99%, predictability = 22.89% • <Dry Dinners - Pasta> <Soup-Canned> – prevalence = 0.94%, predictability = 28.14% • <Paper Towels - Jumbo> <Toilet Tissue> – prevalence = 2.11%, predictability = 38.22% • <Dry Dinners - Pasta> <Cereal - Ready to Eat> – prevalence = 1.36%, predictability = 41.02% • <American Cheese Slices > <Cereal - Ready to Eat> – prevalence = 1.16%, predictability = 38.01% 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Use of Rule Associations • Coupons, discounts – Don’t give discounts on 2 items that are frequently bought together. Use the discount on 1 to “pull” the other • Product placement – Offer correlated products to the customer at the same time. Increases sales • Timing of cross-marketing – Send camcorder offer to VCR purchasers 2-3 months after VCR purchase • Discovery of patterns – People who bought X, Y and Z (but not any pair) bought W over half the time 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Finding Rule Associations • Example: grocery shopping • For each item, count # of occurrences (say out of 100,000) apples 1891, caviar 3, ice cream 1088, pet food 2451, … • Drop the ones that are below a minimum support level apples 1891, ice cream 1088, pet food 2451, … • Make a table of each item against each other item: apples ice cream pet food apples 1891 685 24 ice cream ----- 1088 322 pet food ----- ----- 2451 • Discard cells below support threshold. Now make a cube for triples, etc. Add 1 dimension for each product on LHS. 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Rule Association Demos • Magnum Opus (RuleQuest, free download) • See5/C5.0 (RuleQuest, free download) • Cubist numerical rule finder (RuleQuest, free download) • IBM Interactive Miner 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Text Mining • Objective: discover relationships among people & things from their appearance in text • Topic detection, term detection – When has a new term been seen that is worth recording? • Generation of “knowledge map”, a graph representing terms/topics and their relationships • SemioMap demo (Semio Corp.) – – – – Phrase extraction Concept clustering (through co-occurrence) not by document Graphic navigation (link means concepts co-occur) Processing time: 90 minutes per gigabyte 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Major Data Mining Ideas • • • • There’s too much data We don’t understand what it means It can be handled without human intervention Relationships can be discovered automatically 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Privacy Policies • Privacy policy: – what data we collect – what we do with the data – how we control its release • A privacy policy is written in natural language, e.g. English • How can a machine understand it? • Answer: express the policy in a formal, machinereadable way • P3P for website policies • EPAL for enterprise policies 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS P3P • Platform for Privacy Preferences • Idea: allow the browser to control the release of data to a website • Website states its policy in machine-readable form (P3P) • User expresses his privacy preferences in machinereadable form • Browser compares Website policy with user preferences and controls release of data 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Simple HTTP Transaction Web Server GET /index.html HTTP/1.1 Host: www.att.com . . . Request web page HTTP/1.1 200 OK Content-Type: text/html . . . Send web page 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS SOURCE: LORRIE CRANOR HTTP With P3P 1.0 GET /w3c/p3p.xml HTTP/1.1 Host: www.att.com Request Policy Reference File Web Server Send Policy Reference File Request P3P Policy Send P3P Policy GET /index.html HTTP/1.1 Host: www.att.com . . . Request web page HTTP/1.1 200 OK Content-Type: text/html . . . Send web page 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS SOURCE: LORRIE CRANOR Assertions in a P3P policy • General assertions – Location of human-readable policies and opt-out mechanisms – discuri, opturi attributes of <POLICY> – Indication that policy is for testing only – <TEST> (optional) – Web site contact information – <ENTITY> – Access information – <ACCESS> – Information about dispute resolution – <DISPUTES> (optional) • Data-Specific Assertions – Consequence of providing data – <CONSEQUENCE> (optional) – Indication that no identifiable data is collected – <NON-IDENTIFIABLE> (optional) – How data will be used – <PURPOSE> – With whom data may be shared – <RECIPIENT> – Whether opt-in and/or opt-out is available – required attribute of <PURPOSE> and <RECIPIENT> – Data retention policy – <RETENTION> – What kind of data is collected – <DATA> 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS SOURCE: LORRIE CRANOR P3P/XML encoding Statement P3P version <POLICIES xmlns="http://www.w3.org/2002/01/P3Pv1"> <POLICY discuri="http://p3pbook.com/privacy.html" Location of name="policy"> human-readable P3P policy name <ENTITY> <DATA-GROUP> privacy policy <DATA Site’s ref="#business.contact-info.online.email">[email protected] name </DATA> and <DATA ref="#business.contact-info.online.uri">http://p3pbook.com/ contact </DATA> info <DATA ref="#business.name">Web Privacy With P3P</DATA> </DATA-GROUP> Access disclosure </ENTITY> Human-readable <ACCESS><nonident/></ACCESS> explanation <STATEMENT> <CONSEQUENCE>We keep standard web server logs.</CONSEQUENCE> <PURPOSE><admin/><current/><develop/></PURPOSE> How data may <RECIPIENT><ours/></RECIPIENT> be used <RETENTION><indefinitely/></RETENTION> <DATA-GROUP> Data recipients <DATA ref="#dynamic.clickstream"/> <DATA ref="#dynamic.http"/> Data retention policy </DATA-GROUP> </STATEMENT> Types of data collected </POLICY> 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS </POLICIES> SOURCE: LORRIE CRANOR P3P Increases Privacy Awareness http://www.att.com/accessatt/ • P3P clients can check a privacy policy each time it changes • P3P clients can check privacy policies on all objects in a web page, including ads and invisible images http://adforce.imgis.com/?adlink|2|68523|1|146|ADFORCE 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS SOURCE: LORRIE CRANOR P3P in IE6 Automatic processing of compact policies only; third-party cookies without compact policies blocked by default Privacy icon on status bar indicates that a cookie has been blocked – pop-up appears the first time the privacy icon appears 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS SOURCE: LORRIE CRANOR Users can click on privacy icon for list of cookies; privacy summaries are available at sites that are P3P-enabled 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS SOURCE: LORRIE CRANOR Typical Anonymizer Request Request Anonymizer Reply Reply Client Server • • • • Acts as a proxy for users Hides information from end servers Sees all web traffic Adds ads to pages (free service; subscription service also available) http://www.anonymizer.com SOURCE: LORRIE CRANOR 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Mixes Sender Destination B, C dest,msg kC kB kA Mix C msg dest,msg kC Mix A C dest,msg kC kB Mix B kX = encrypted with public key of Mix X Sender routes message randomly through network of “Mixes”, using layered public-key encryption. SOURCE: LORRIE CRANOR 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Crowds • Users join a Crowd of other users • Web requests from the crowd cannot be linked to any individual • Protection from – – – – end servers other crowd members system administrators eavesdroppers • First system to hide data shadow on the web without trusting a central authority http://www.research.att.com/projects/crowds/ SOURCE: LORRIE CRANOR 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS EPAL • P3P is designed for web privacy • EPAL describes internal privacy policies • Data Users: Individuals who are accessing or receiving data. Examples: nurse, doctor • Actions: Examples: create a customer record, read a record. • Data Categories: High-level descriptions of types of data. Examples: customer contact information, medical information • Purposes: How the data will be used by the recipient. Examples: marketing, fulfillment. • Conditions: Examples: age < 13, consent obtained. • Obligations: Additional steps to be taken when access occurs. Examples: log the access, delete data after 1 year if no additional business 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS EPAL • ALLOW [Data User] TO PERFORM [Action] ON [Data Type] FOR [Purpose] IF [Condition] AND CARRY OUT [Obligation] • DENY [Data User] TO PERFORM [Action] ON [Data Type] FOR [Purpose] IF [Condition] AND CARRY OUT [Obligation] 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS An EPAL Rule • English: “Email can be used for the book-of-month club only if consent has been given and age is more than 13.” <rule id="rule1" ruling="allow"> <data-user id="borderless-books"/> <data-category id="email"/> <purpose id="book-of-the-month-club"/> <action id="read"/> <condition id="consentToBookClub"/> <condition id="olderThan13"/> <obligation id="retention"> <parameter id="days">5</parameter> </obligation> </rule> 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS EPAL SOURCE: IBM, W3C 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Q&A 20-751 ECOMMERCE TECHNOLOGY SUMMER 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS