Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Introduction to the Data Mining course What is data mining all about Inst%&ctor: Nick Cercone -‐ 3050 LAS-‐ [email protected] 1 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 CSE4412, CSE6412 – Data Mining Instructor: Nick Cercone TA: Nastaran Babanejad WIKI: https://wiki.eecs.yorku.ca/course_archive/2014-15/F/4412/ Read Chapter 1 of Han et al. (textbook) Introduction • • • • • • • • • 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 What Motivated Data Mining? Why Is It Important? So, What Is Data Mining? Data Mining—On What Kind of Data? Data Mining Functionalities—What Kinds of Patterns Can Be Mined? Are All of the Patterns Interesting? Classification of Data Mining Systems Data Mining Task Primitives Integration of a Data Mining System with a Database or DataWarehouse System Major Issues in Data Mining CSE 4412 & CSE64112 Data Mining, Fall, 2014 2 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 CSE4412, CSE6412 – Requirements Homework: small assignment (10%) big assignment (15%) not accepted late In-class: presentation (10%) Final Exam: undergraduates (25%), graduate students (10%) Project: start early, takes lots of time (40%) Paper: graduate students (15%)\ Basic Knowledge: databases, statistics, algorithms, programming CSE 4412 & CSE64112 Data Mining, Fall, 2014 3 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 CSE4412, CSE6412 –Data Mining Project Project: Software implementation related to course subject matter Should involve an original component or experiment More later about available data and computing resources It’s going to be fun and hard work CSE 4412 & CSE64112 Data Mining, Fall, 2014 4 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 CSE4412, CSE6412 – Projects • Many projects deal with collaborative filtering (advice based on what is done by similar people) • Others deal with engineering solutions to machine-learning problems • Project suggestions on wiki • Data and infrastructure available on wiki, elsewhere • Projects require thought (1) Tell important pages from unimportant (PageRank); (2) Tell real news from publicity (how?); (3) Distinguish positive from negative product reviews (how?); (4) Feature generation in ML; etc. CSE 4412 & CSE64112 Data Mining, Fall, 2014 5 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 CSE4412, CSE6412 –Data Mining Projects Working in pairs is okay, but No more than two per project. Expect more from a pair than from an individual and documentation should make clear who did what. The effort should be roughly evenly distributed CSE 4412 & CSE64112 Data Mining, Fall, 2014 6 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 CSE4412, CSE6412 –Data Mining Course Outline: see calendar on wiki CSE 4412 & CSE64112 Data Mining, Fall, 2014 7 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 What is data mining all about Data, Information, Knowledge, Understanding & Wisdom The DIKW Pyramid, also known variously as the "DIKW Hierarchy", "Wisdom Hierarchy", the "Knowledge Hierarchy", the "Information Hierarchy", and the "Knowledge Pyramid”, refers loosely to a class of models for representing structural and/ or functional relationships between data, information, knowledge, and wisdom. CSE 4412 & CSE64112 Data Mining, Fall, 2014 8 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 What is data mining all about Data are symbols or signs, representing stimuli or signals Data pervades our daily lives. Such data, properly interpreted and visualized, becomes information which, contextualized turns into knowledge to be understood and applied as wisdom and (business and government) intelligence. CSE 4412 & CSE64112 Data Mining, Fall, 2014 9 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 The DIKW Pyramid CSE 4412 & CSE64112 Data Mining, Fall, 2014 10 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 or CSE 4412 & CSE64112 Data Mining, Fall, 2014 11 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 or CSE 4412 & CSE64112 Data Mining, Fall, 2014 12 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 or CSE 4412 & CSE64112 Data Mining, Fall, 2014 13 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 or CSE 4412 & CSE64112 Data Mining, Fall, 2014 14 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 or CSE 4412 & CSE64112 Data Mining, Fall, 2014 15 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 What is data mining all about • Data: symbols • Information: data that are processed to be useful; provides answers to "who", "what", "where", and "when" questions • Knowledge: application of data and information; answers "how" questions • Understanding: appreciation of "why" • Wisdom: evaluated understanding. CSE 4412 & CSE64112 Data Mining, Fall, 2014 16 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 What is data mining (traditional) “Data mining is the extraction of implicit, previously unknown, and potentially useful information from data.” “The application of specific algorithms for extracting patterns from data, it is a part of knowledge discovery from databases” “Data mining is a process, not just a series of statistical analyses.” CSE 4412 & CSE64112 Data Mining, Fall, 2014 17 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Statistics Computer Science • (Semi-)automated application of algorithms for pattern discovery • Algorithms developed in the field of Artificial Intelligence (machine learning) • Part of the process of knowledge discovery • Process of discovering patterns in data • (Manual) application of a series of statistical techniques (among which machine learning) • Incorporates • Exploration • Sampling • Modeling • Validation Traditionally CSE 4412 & CSE64112 Data Mining, Fall, 2014 18 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 What can data mining do? 1. Classification discrete outcomes (bird, cat, or fish) 2. Estimation continuously valued outcomes (height, income, or weight) 3. Prediction estimated future value (predicting the number of children in the next year, or which telephone subscribers will order a three-way calling) 4. Affinity Grouping (market basket analysis) grouping which things go together (pizza and soft drink, coffee and coffee maid, or cat food and kitty litter) 5. Clustering segmenting a heterogeneous population into a number of more homogeneous subgroups (a cluster of symptoms might indicate different diseases) 6. Description describing what is going on (women support Democrats in greater numbers than do men) CSE 4412 & CSE64112 Data Mining, Fall, 2014 19 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Differences between data mining and machine learning 1. 2. 3. 4. ML is a broader field (not only learning from examples, but also reinforcement learning, evolutionary computing …) Some overlapping in the algorithms used and the problems addressed Data mining is concerned with finding understandable knowledge, while ML is concerned with improving performance of an agent (e.g. training a neural network to balance a pole is part of ML, but not data mining) Data mining deals with large sets of real-world examples (realistic examples are normally very large and noisy) CSE 4412 & CSE64112 Data Mining, Fall, 2014 20 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Data Mining Application Examples Areas where data mining has been applied recently include: • Science, Astronomy, Bioinformatics, Drug Discovery, ... • Business, Advertising, Customer modeling and Customer Relationship management (CRM), e-Commerce, Fraud Detection, Targeted Marketing Health care, Investments, • Manufacturing, • Sports/entertainment, • Telecom (telephone and communications), • targeted marketing, CSE 4412 & CSE64112 Data Mining, Fall, 2014 21 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Data Mining Application Examples (cont) • Web: Search engines, bots, … • Government, Anti-terrorism efforts (we will discuss controversy over privacy later), Law enforcement, Profiling tax cheaters One of the most important and widespread business applications of data mining is Customer Modeling, also called Predictive Analytics. This includes tasks such as CSE 4412 & CSE64112 Data Mining, Fall, 2014 22 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Predictive Analytics • Predicting attrition or churn, i.e., find which customers are likely to terminate service; Targeted marketing: Customer acquisition - find which prospects are likely to become customers; Cross-sell - for given customer and product, find which other product(s) they are likely to buy • Credit-risk - identify the risk that this customer will not pay back the loan or credit card • fraud detection - is this transaction fraudulent? The largest users of Customer Analytics are industries such as banking, telecom, retailers, where businesses with large numbers of customers are making extensive use of these technologies. CSE 4412 & CSE64112 Data Mining, Fall, 2014 23 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Customer Attrition: Case Study Let's consider a case study of mobile phone company. Typical attrition (also called churn) rate at for mobile phone customers is around 25-30% a year! The task is • Given customer information for the past N (N can range from 2 to 18 months), predict who is likely to attrite in next month or two. • Also, estimate customer value and what is the cost-effective offer to be made to this customer. CSE CSE 4412 & CSE64112 Data Mining, Fall, 2014 24 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Customer Attrition: Case Study (cont) • Verizon Wireless is the largest wireless service provider in the United States with a customer base of 34.6 million subscribers, see http://www.kdnuggets.com/news/2003/n19/22i.html). • Verizon built a customer data warehouse that – – – – Identified potential attriters Developed multiple, regional models Targeted customers with high propensity to accept the offer Reduced attrition rate from over 2%/month to under 1.5%/month (huge impact over 34 million subscribers) CSE CSE 4412 & CSE64112 Data Mining, Fall, 2014 25 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Assessing Credit Risk : Case Study Consider a situation where a person applies for a loan. Should a bank approve the loan? Note: People who have the best credit don't need the loans, and people with worst credit are not likely to repay. Bank's best customers are in the middle. Banks develop credit models using variety of machine learning methods. Mortgage and credit card proliferation are the results of being able to successfully predict if a person is likely to default on a loan. Credit risk assessment is universally used in the US and widely deployed in most developed countries. CSE 4412 & CSE64112 Data Mining, Fall, 2014 26 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Successful e-commerce - Case Study • Amazon.com is the largest on-line retailer, which started with books and expanded into music, electronics, and other products. Amazon.com has an active data mining group, which focuses on personalization. Why personalization? Consider a person that buys a book (product) at Amazon.com. • Task: Recommend other books (and perhaps products) this person is likely to buy • Amazon initial and quite successful effort was using clustering based on books bought. CSE 4412 & CSE64112 Data Mining, Fall, 2014 27 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Successful e-commerce - Case Study (cont) • For example, customers who bought "Advances in Knowledge Discovery and Data Mining", by Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, also bought "Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations" , by Witten and Eibe. • Recommendation program is quite successful and more advanced programs are being developed. CSE 4412 & CSE64112 Data Mining, Fall, 2014 28 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Unsuccessful e-commerce - Case Study • Of course application of data mining is no guarantee of success and during the Internet bubble of 1999-2000, we have seen plenty of examples. • Consider the legwear and legcare e-tailer Gazelle.com, whose clickstream and purchase data from was the subject of KDD Cup 2000 competition (http://www.ecn.purdue.edu/ KDDCUP/) • One of the questions was: Characterize visitors who spend more than $12 on an average order at the site CSE 4412 & CSE64112 Data Mining, Fall, 2014 29 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Unsuccessful e-commerce - Case Study (cont) • The data included a dataset of 3,465 purchases, 1,831 customers • Very interesting and illuminating analysis was done by dozens Cup participants. The total time spend was thousands of hours, which would have been equivalent to millions of dollars in consulting fees. • However, the total sales of Gazelle.com were only a few thousands of dollars and no amount of data mining could help them. Not surprisingly, Gazelle.com went out of business in Aug 2000. CSE 4412 & CSE64112 Data Mining, Fall, 2014 30 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Data Mining, Security & Fraud Detection • There are currently numerous applications of data mining for security and fraud detection. One of the most common is Credit Card Fraud Detection. Almost all credit card purchases are scanned by special algorithms that identify suspicious transactions for further action. I have recently received such a call from my bank, when I used a credit card to pay for a journal published in England. This was an unusual transaction for me (first purchase in the UK on this card) and the software flagged it. CSE 4412 & CSE64112 Data Mining, Fall, 2014 31 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Data Mining, Security & Fraud Detection (cont) • Other applications include detection of money laundering - a notable system, called FAIS, was developed by Ted Senator for the US Treasury [Se96]. • National Association of Securities Dealers (NASD) which runs NASDAQ, has developed a system called Sonar that uses data mining for monitoring insider trading and fraud through misrepresentation http://www.kdnuggets.com/news/2003/n18/13i.html • Many telecom companies, including AT&T, Bell Atlantic, British Telecom/MCI have developed systems for catching phone fraud. CSE 4412 & CSE64112 Data Mining, Fall, 2014 32 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Data Mining, Security & Fraud Detection (cont) • Data mining and security was in the headlines with US Government efforts on using data mining for terrorism detection, as part of now closed Total Information Awareness Program. However, the problem of terrorism is unlikely to go away soon, and US government efforts are continuing. • Less controversial is use of data mining for bio-terrorism detection, as was done at Salt Lake Olympics 2002 (the only thing that was found was a small outbreak of tropical diseases). CSE 4412 & CSE64112 Data Mining, Fall, 2014 33 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Problems Suitable for Data Mining The areas where data mining applications are likely to be successful have these characteristics: • require knowledge-based decisions • have a changing environment • have sub-optimal current methods • have accessible, sufficient, and relevant data • provides high payoff for the right decisions • Also, if the problem involves people, then proper consideration should be given to privacy CSE 4412 & CSE64112 Data Mining, Fall, 2014 34 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Knowledge Discovery We define Knowledge Discovery in Data (KDD) as the nontrivial process of identifying • valid • novel • potentially useful • and ultimately understandable patterns in data from Advances in Knowledge Discovery and Data Mining, Fayyad, PiatetskyShapiro, Smyth, and Uthurusamy, (Chapter 1), AAAI/MIT Press 1996 CSE 4412 & CSE64112 Data Mining, Fall, 2014 35 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Knowledge Discovery (cont) • Knowledge Discovery is an interdisciplinary field, which builds upon a foundation provided by databases and statistics and applies methods from machine learning and visualization in order to find useful patterns. • Data mining has much in common with Statistics and with Machine Learning. However, there are differences. • Statistics provides a theory for dealing with randomness and tools for testing hypotheses. Statistics does not study topics such as data preprocessing or results visualization, which are part of data mining. CSE 4412 & CSE64112 Data Mining, Fall, 2014 36 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Knowledge Discovery (cont) • Machine learning has a more heuristic approach and focuses on improving performance of a learning agent. ML encompasses other areas such as real-time learning and robotics - which are not part of data mining. Data Mining and Knowledge Discovery integrates theory and heuristics focusing on the entire process of knowledge discovery, including data cleaning, learning, and integration and visualization of results. CSE 4412 & CSE64112 Data Mining, Fall, 2014 37 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Knowledge Discovery Process • Knowledge Discovery emphasizes process. KDD is not a single step solution of applying a machine learning method to a dataset, but a continuous process with loops and feedbacks. This process has been formalized by an industry group called CRISP-DM, which stands for CRoss Industry Standard Process for Data Mining CSE 4412 & CSE64112 Data Mining, Fall, 2014 38 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Knowledge Discovery Process (cont) The main steps in the KD process include: • 1. Business (or Problem) Understanding • 2. Data Understanding • 3. Data Preparation (including data cleaning & preprocessing) • 4. Modeling (applying machine learning and data mining algorithms) • 5. Evaluation (checking the performance of these algorithms) • 6. Deployment • 7. Monitoring. See www.crisp-dm.org for more information on CRISP-DM. CSE 4412 & CSE64112 Data Mining, Fall, 2014 39 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Historical Note: Many names of Data Mining • Data Mining and Knowledge Discovery has many names. • In 1960’s, statisticians used terms like "Data Fishing" or "Data Dredging" to refer to what they considered a bad practice of analyzing data without an apriori hypothesis. • The term "Data Mining" appeared around 1990 in the database community. A phrase "database mining"™, was trademarked by HNC, and researchers turned to "data mining". Other terms used include Data Archaeology, Information Harvesting, Information Discovery, Knowledge Extraction, etc. CSE 4412 & CSE64112 Data Mining, Fall, 2014 40 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Historical Note: Many names of Data Mining • Gregory Piatetsky-Shapiro coined the term "Knowledge Discovery in Databases" for the first workshop on the same topic (1989) and this term became popular. • The term data mining became more popular in business community and in the press. • Data Mining and Knowledge Discovery are used interchangeably, and we use these terms as synonyms. CSE 4412 & CSE64112 Data Mining, Fall, 2014 41 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Data Mining Tasks • Data mining is about many different types of patterns, and there are correspondingly many types of data mining tasks. Some of the most popular are • • • • • • • • Classification: predicting an item class Clustering: finding clusters in data Associations: e.g. A & B & C occur frequently Visualization: to facilitate human discovery Summarization: describing a group Deviation Detection: finding changes Estimation: predicting a continuous value Link Analysis: finding relationships CSE 4412 & CSE64112 Data Mining, Fall, 2014 42 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Brief History of Data Mining • The term "Data mining" was introduced in the 1990s, but data mining is the evolution of a field with a long history. • Data mining roots are traced back along three family lines: classical statistics, artificial intelligence, and machine learning. • Satistics are the foundation of most technologies on which data mining is built, e.g. regression analysis, standard distribution, standard deviation, standard variance, discriminant analysis, cluster analysis, and confidence intervals. All of these are used to study data and data relationships. CSE 4412 & CSE64112 Data Mining, Fall, 2014 43 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Brief History of Data Mining (cont) • Artificial intelligence(AI) attempts to apply human-thoughtlike processing to statistical problems. Certain AI concepts were adopted for commercial products, e.g., query optimization modules for Relational DB Management Systems • Machine learning combines statistics and AI, because it blends AI heuristics with advanced statistical analysis. Machine learning techniques learn about the data they study, such that programs make different decisions based on the qualities of the studied data, using statistics for fundamental concepts, and adding more advanced AI heuristics and algorithms to achieve goals. CSE 4412 & CSE64112 Data Mining, Fall, 2014 44 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Brief History of Data Mining (cont) • Data mining, in many ways, is fundamentally the adaptation of machine learning techniques to business applications. Data mining is best described as the union of historical and recent developments in statistics, AI, and machine learning. These techniques are then used together to study data and find previously-hidden trends or patterns within. CSE 4412 & CSE64112 Data Mining, Fall, 2014 45 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Computer science conferences on data mining • CIKM Conference - ACM Conference on Information and Knowledge Management • DMIN Conference - International Conference on Data Mining • DMKD Conference – Research Issues on Data Mining and Knowledge Discovery • ECDM Conference – European Conference on Data Mining • ECML-PKDD Conference – European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases • EDM Conference – International Conference on Educational Data Mining • ICDM Conference – IEEE International Conference on Data Mining • KDD Conference – ACM SIGKDD Conference on Knowledge Discovery and Data Mining CSE 4412 & CSE64112 Data Mining, Fall, 2014 46 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Computer science conferences on data mining • MLDM Conference – Machine Learning and Data Mining in Pattern Recognition • PAKDD Conference - The annual Pacific-Asia Conference on Knowledge Discovery and Data Mining • PAW Conference - Predictive Analytics World • SDM Conference – SIAM International Conference on Data Mining • SSTD Symposium – Symposium on Spatial and Temporal Databases • WSDM Conference – ACM Conference on Web Search and Data Mining Data mining topics are also present on many data management/database conferences such as the ICDE Conference, SIGMOD Conference and International Conference on Very Large Data Bases CSE 4412 & CSE64112 Data Mining, Fall, 2014 47 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Data mining involves 6 common classes of tasks: • Anomaly detection (Outlier/change/deviation detection) – The identification of unusual data records that might be interesting of data errors tat require further investigation. • Association rule learning (Dependency modeling) – Searches for relationships between variables. For example, a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes, sometimes referred to as “market basket analysis”. • Clustering – the task of discovering groups and structures in the data that are in some ay or another similar, without usingknown structures in the data CSE 4412 & CSE64112 Data Mining, Fall, 2014 48 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Data mining involves 6 common classes of tasks: (cont) • Classification – the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as “legitimate” or “spam”. • Regression – attempts to find a function that models the data with the least error. • Summarization – providing a more compact representation of the data set, including visualization and report generation. CSE 4412 & CSE64112 Data Mining, Fall, 2014 49 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Data mining consists of five major elements: • Extract, transform, and load transaction data onto the data warehouse system. • Store and manage the data in a multidimensional database system. • Provide data access to business analysts and information technology professionals. • Analyze the data by application software. • Present the data in a useful format, such as a graph or table. CSE 4412 & CSE64112 Data Mining, Fall, 2014 50 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Different levels of analysis are available: • Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure. • Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution. • Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) . CART and CHAID are decision tree techniques used for classification of a dataset. They provide a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome. CART segments a dataset by creating 2-way splits while CHAID segments using chi square tests to create multi-way splits. CART typically requires less data preparation than CHAID. CSE 4412 & CSE64112 Data Mining, Fall, 2014 51 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Different levels of analysis are available: (cont) • Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called the knearest neighbor technique. • Rule induction: The extraction of useful if-then rules from data based on statistical significance. • Data visualization: The visual interpretation of complex relationships in multidimensional data. Graphics tools are used to illustrate data relationships. CSE 4412 & CSE64112 Data Mining, Fall, 2014 52 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Data Mining: Issues • One of the key issues raised by data mining technology is not a business or technological one, but a social one. It is the issue of individual privacy. Data mining makes it possible to analyze routine business transactions and glean a significant amount of information about individuals buying preferences. • Another issue is that of data integrity. Clearly, data analysis can only be as good as the data that is being analyzed. A key implementation challenge is integrating conflicting or redundant data from different sources. For example, a bank may maintain credit cards accounts on several different databases. The addresses (or even the names) of a single cardholder may be different in each. Software must translate data from one system to another and select the address most recently entered. CSE 4412 & CSE64112 Data Mining, Fall, 2014 53 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Data Mining: Issues (cont) • A technical issue is whether it is better to set up a relational database or a multidimensional structure. In a relational db, data is stored in tables, permitting ad hoc queries. In a multidimensional structure, sets of cubes are arranged in arrays, with subsets created according to category. While multidimensional structures facilitate multidimensional data mining, relational structures have performed better in client/server environments. • Finally, there is cost. Although system hardware costs have dropped dramatically, data mining and data warehousing tend to be self-reinforcing. The more powerful the data mining queries, the greater the utility of the information being gleaned from the data, and the greater the pressure to increase the amount of data being collected and maintained, which increases the pressure for faster, more powerful data mining queries. This increases pressure for larger, faster systems, which are more expensive. CSE 4412 & CSE64112 Data Mining, Fall, 2014 54 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Next Class • Course Introduction (continued) Data mining terminology and introduction to data mining concepts. Data Preprocessing: Why Preprocess the Data?; Descriptive Data Summarization; Data Cleaning; Data Integration and Transformation; Data Reduction; Data Discretization and Concept Hierarchy Generation CSE 4412 & CSE64112 Data Mining, Fall, 2014 55 CSE4412 & CSE6412 3.0 Data Mining Instructor: Nick Cercone – 3050 LAS – [email protected], Tuesdays, Thursdays 1:00-2:30 – LAS 3033 Fall Semester, 2014 Concluding Remarks The Road to Wisdom The road to wisdom? – Well, it's plain and simple to exXress: Er% and er% and er% again but less and less and less. CSE 4412 & CSE64112 Data Mining, Fall, 2014 56