Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Overview Professor P. Batchelor Furman University Overview Introduction Explanation of Data Mining Techniques Advantages Applications Privacy Data Mining What is Data Mining? “The process of semi automatically analyzing large databases to find useful patterns” (Silberschatz) KDD – “Knowledge Discovery in Databases” “Attempts to discover rules and patterns from data” Discover Rules Make Predictions Areas of Use Internet – Discover needs of customers Economics – Predict stock prices Science – Predict environmental change Medicine – Match patients with similar problems cure Example of Data Mining Credit Card Company wants to discover information about clients from databases. Want to find: Clients who respond to promotions in “Junk Mail” Clients that are likely to change to another competitor Clients that are likely to not pay Services that clients use to try to promote services affiliated with the Credit Card Company Anything else that may help the Company provide/ promote services to help their clients and ultimately make more money. Data Mining & Data Warehousing Data Warehouse: “is a repository (or archive) of information gathered from multiple sources, stored under a unified schema, at a single site.” (Silberschatz) Collect data Store in single repository Allows for easier query development as a single repository can be queried. Data Mining: Analyzing databases or Data Warehouses to discover patterns about the data to gain knowledge. Knowledge is power. Discovery of Knowledge Data Mining Techniques Classification Clustering Regression (we have already looked at this) Association Rules Classification Classification: Given a set of items that have several classes, and given the past instances (training instances) with their associated class, Classification is the process of predicting the class of a new item. Classify the new item and identify to which class it belongs Example: A bank wants to classify its Home Loan Customers into groups according to their response to bank advertisements. The bank might use the classifications “Responds Rarely, Responds Sometimes, Responds Frequently”. The bank will then attempt to find rules about the customers that respond Frequently and Sometimes. The rules could be used to predict needs of potential customers. Technique for Classification Decision-Tree Classifiers Job Engineer Carpenter Income <30K Bad >50K Good Income <40K Bad >90K Good Doctor Income >100K <50K Bad Predicting credit risk of a person with the jobs specified. Good Clustering “Clustering algorithms find groups of items that are similar. … It divides a data set so that records with similar content are in the same group, and groups are as different as possible from each other. ” Example: Insurance company could use clustering to group clients by their age, location and types of insurance purchased. The categories are unspecified and this is referred to as ‘unsupervised learning’ Clustering Group Data into Clusters Similar data is grouped in the same cluster Dissimilar data is grouped in the a differnt cluster How is this achieved ? Hierarchical Group data into t-trees K-Nearest Neighbor A classification method that classifies a point by calculating the distances between the point and points in the training data set. Then it assigns the point to the class that is most common among its k-nearest neighbors (where k is an integer) Association Rules “An association algorithm creates rules that describe how often events have occurred together.” Example: When a customer buys a hammer, then 90% of the time they will buy nails. Association Rules Support: “is a measure of what fraction of the population satisfies both the antecedent and the consequent of the rule” Example: People who buy hotdog buns also buy hotdog sausages in 99% of cases. = High Support People who buy hotdog buns buy hangers in 0.005% of cases. = Low support Situations where there is high support for the antecedent are worth careful attention E.g. Hotdog sausages should be placed near hotdog buns in supermarkets if there is also high confidence. Association Rules Confidence: “is a measure of how often the consequent is true when the antecedent is true.” Example: 90% of Hotdog bun purchases are accompanied by hotdog sausages. High confidence is meaningful as we can derive rules. Hotdog bun Hotdog sausage 2 rules may have different confidence levels and have the same support. E.g. Hotdog sausage Hotdog bun may have a much lower confidence than Hotdog bun Hotdog sausage yet they both can have the same support. Advantages of Data Mining Provides new knowledge from existing data Public databases Government sources Company Databases Old data can be used to develop new knowledge New knowledge can be used to improve services or products Improvements lead to: Bigger profits More efficient service Uses of Data Mining Sales/ Marketing Risk Assessment Identify Customers that pose high credit risk Fraud Detection Diversify target market Identify clients needs to increase response rates Identify people misusing the system. E.g. People who have two Social Security Numbers Customer Care Identify customers likely to change providers Identify customer needs Applications of Data Mining Source IDC 1998 Privacy Concerns Effective Data Mining requires large sources of data To achieve a wide spectrum of data, link multiple data sources Linking sources leads can be problematic for privacy as follows: If the following histories of a customer were linked: Shopping History Credit History Bank History Employment History The users life story can be painted from the collected data Linking to Re-identify Data Ethnicity Name Visit date Address Diagnosis ZIP Procedure Birth date Medication Sex Total charge Medical Data Date registered Party affiliation Date last voted Voter List L. Sweeney. Weaving technology and policy together to maintain confidentiality. Journal of Law, Medicine and Ethics. 1997, 25:98-110. {date of birth, gender, 5-digit ZIP} uniquely identifies 87.1% of USA pop. Perceived Concerns Data mining lets you find out about my private life I don’t want you, my insurance company, the government knowing everything Data mining doesn’t always get it right I don’t want to be put in jail because data mining said so I don’t want to be denied credit, a job, insurance because data mining said so. Real Concerns Data mining lets you find out about my private life Data mining doesn’t always get it right Learned models allow conjectures Learning the model requires collecting data Our legal system is supposed to ensure due process Data mining typically allows businesses to take risks they otherwise wouldn’t Identify people we can give instant credit But without data mining, decisions would be slower and probably more restrictive. Why is credit so easy to get, even though bankruptcies up? Data Mining and Terrorism Total Information Awareness (TIA). The Information Awareness Office (IAO) was established by the Defense Advanced Research Projects Agency in January 2002 to bring together several DARPA projects focused on applying surveillance and information technology to track and monitor terrorists and other threats to U.S. national security, by achieving Total Information Awareness (TIA). Following public criticism that the development and deployment of this technology could potentially lead to a mass surveillance system, the IAO was defunded by Congress in 2003. However, several IAO projects continued to be funded, and merely run under different names Evidence Extraction and Link Discovery Development of technologies and tools for automated discovery, extraction and linking of sparse evidence contained in large amounts of classified and unclassified data sources (such as phone call records from the NSA call database, internet histories, or bank records) Design systems with the ability to extract data from multiple sources (e.g., text messages, social networking sites, financial records, and web pages). Detect patterns comprising multiple types of links between data items or people communicating (e.g., financial transactions, communications, travel, etc.). Designed to link items relating potential "terrorist" groups and scenarios, and to learn patterns of different groups or scenarios to identify new organizations and emerging threats. Scalable Social Network Analysis Aimed at developing techniques based on social network analysis for modeling the key characteristics of terrorist groups and discriminating these groups from other types of societal groups. Sean McGahan, of Northeastern University said the following in his study of SSNA: The purpose of the SSNA algorithms program is to extend techniques of social network analysis to assist with distinguishing potential terrorist cells from legitimate groups of people ... In order to be successful SSNA will require information on the social interactions of the majority of people around the globe. Since the Defense Department cannot easily distinguish between peaceful citizens and terrorists, it will be necessary for them to gather data on innocent civilians as well as on potential terrorists. Does this worry you or make you feel more secure? Human ID project The Human Identification at a Distance (HumanID) project developed automated biometric identification technologies to detect, recognize and identify humans at great distances for "force protection", crime prevention, and "homeland security/defense" purposes. Its goals included programs to: Develop algorithms for locating and acquiring subjects out to 150 meters (500 ft) in range. Fuse face and gait recognition into a 24/7 human identification system. Develop and demonstrate a human identification system that operates out to 150 meters (500 ft) using visible imagery. Develop a low power millimeter wave radar system for wide field of view detection and narrow field of view gait classification. Characterize gait performance from video for human identification at a distance. Develop a multi-spectral infrared and visible face recognition system. Solutions Data mining lets you find out about my private life Privacy-preserving data mining Data mining doesn’t always get it right Data scientists know it and are working on it Educate the user Privacy-Preserving Data Mining Data Perturbation Construct a data set with noise added Miners given the perturbed data set Reconstruct distribution to improve results Solutions out there Can be released without revealing private data Decision trees, association rules Debate: Does it really preserve privacy? Can we prove impossibility of noise removal? Privacy-Preserving Data Mining Distributed Data Mining Data owners keep their data Encryption techniques to preserve privacy Collaborate to get data mining results Proofs that private data is not disclosed Solutions for Decision Trees, Association Rules, Clustering Different solutions needed depending on how data is distributed, privacy constraints What Next? Data mining lets you find out about my private life Constraints that allow us to restrict what models can be learned Can we ensure that data mining won’t produce results that are amenable to misuse? (e.g., 100% confidence models) Redlining example Data mining doesn’t always get it right Educate the public What data mining does (and doesn’t do) Do You Agree? There is a great difference between an inanimate machine knowing your secrets and a person knowing the same. Political solutions can control how and why information goes from the machine to trusted analysts who can act on the knowledge.