Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Course Information Information Data Mining n CS 341, Spring 2007 n Instructor: Xiaoyan Li Lecture: Mon&W on&W ed 2:40pm – 3:55pm – Room: Kendade hall 107 n Prof. Xiaoyan Li Visiting Assistant Professor of Computer Science Mount Holyoke College Office hour: Tu/Th 10:00am – 11:00am (or by appointment) – Office: Clapp 227 – Email: [email protected] © Prentice Hall Course Information n Course Structure Textbook n – Data Mining: Introductory and Advanced Topics The course is divided into 3 parts – Related concepts and basic techniques – Core Topics » by Margaret H. Dunham , ISBN 00-1313-088892088892-3 n 2 Topics » Classification, clustering, association rules – Related Concepts & Basic Techniques – Core Topics – Perl programming language, final projects » Classification, Clustering and Association Rules n – Advanced Topics The first 2/3 are lectures, the rest 1/3 are seminars. » Web Mining, Spatial Mining & Temporal Mining © Prentice Hall 3 © Prentice Hall Tentative schedule: n Grading CSCS-341 Data Ming n n n n © Prentice Hall 4 5 Class participation: 20% Four homework assignments: 20% One midterm: 20% One final project: 40% © Prentice Hall 6 1 Some slides are adopted from: Introduction Outline DATA MINING Introductory and Advanced Topics Goal: Provide an overview of data mining. Part I n n Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University n n n Define data mining Basic data mining tasks Data mining vs. database & KDD Data mining development Data mining issues Companion slides for the text by Dr. M.H.Dunham, Data Mining, Introductory and Advanced Topics, Topics, Prentice Hall, 2002. © Prentice Hall Introduction n n n Data Mining Definition Data is growing at a phenomenal rate Users expect more sophisticated information How? n Finding hidden information in a database n Similar terms – Exploratory data analysis – Data driven discovery – Deductive learning UNCOVER HIDDEN INFORMATION DATA MINING © Prentice Hall 9 © Prentice Hall Example 1.1 n n n 10 Data Mining Algorithm Credit card company must determine whether to authorize credit card purchases. Four classes: – – – – 8 1) Authorize, 2) Ask for further identification before authorization 3) do not authorize, 4) do not authorize but contact police n Purpose: Fit Data to a Model n Preference – Criteria to choose the best model Search – Technique to search the data n How to classify a purchase? – Examine historical data and determine how data fit into the four classes. – Apply the model to new purchase © Prentice Hall 11 © Prentice Hall 12 2 Data Mining Models n Data Mining Models and Tasks Predictive: – A predictive model makes a prediction about values of data using known results found from different data. n Descriptive: – A descriptive model identifies patterns or relationships in data. © Prentice Hall 13 © Prentice Hall Basic Data Mining Tasks n Classification maps data into predefined groups or classes n Example 1.1 is a general classification problem Example 1.2 is an example of pattern recognition Basic Data Mining Tasks n – Pattern recognition n n n Example 1.3 – A college professor wishes to reach a certain level of savings before her retirement. – She predicts what her retirement savings will be based on its current values and several past values. – She uses a linear regression formula to predict her retirement savings. 15 © Prentice Hall 16 Basic Data Mining Tasks (cont’ (cont’d) Basic Data Mining Tasks n Regression is used to map a data item to a real valued prediction variable. – Assume some known type of function (e.g. linear) and select the best one. – Airport screening is used to determine whether passengers are potential terrorists or criminals – Basic patterns: distance between eyes, size and shape of mouth, etc. © Prentice Hall 14 Clustering groups similar data together into clusters. (The clusters are not predefined) n Example 1.6 Summarization maps data into subsets with associated simple descriptions. – Characterization – Generalization – A department store chain creates special catalogs targeted to various demographic groups based on attributes such as income, location, etc. n Example 1.7 – The average SAT score is one of the criteria used to compare universities by the U.S. News & World Report. © Prentice Hall 17 © Prentice Hall 18 3 Basic Data Mining Tasks (cont’ (cont’d) n Ex: Time Series Analysis Link Analysis uncovers relationships among data. – Affinity analysis – Association rules – identify items are frequently purchased together. n n n n n Example: Stock Market Predict future values Determine similar patterns over time Classify behavior Example 1.8 – A grocery store retailer is trying to decide whether to put bread on sale. – He finds that 60% of the time that bread is sold so are pretzels and 70% of the time jelly is also sold by using association rules. – Decisions? © Prentice Hall 19 © Prentice Hall Data Mining vs. Database Processing --Query --Query Examples n Database Processing vs. Data Mining Processing Database – Find all credit applicants with last name of Smith. – Identify customers who have purchased more than $10,000 in the last month. – Find all customers who have purchased milk n n Query n risks. (classification) – Identify customers with similar buying habits. (Clustering) – Find all items which are frequently purchased with milk. (association rules) n Data n n Output – Fuzzy – Not a subset of database © Prentice Hall 22 KDD Process Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process. Another opinion: opinion: Modified from [FPSS96C] n n n n – They are no difference. difference. © Prentice Hall Data – Not operational data – Precise – Subset of database Data Mining vs. KDD n n Output 21 Query – Poorly defined – No precise query language – Operational data – Find all credit applicants who are poor credit n n – Well defined – SQL Data Mining © Prentice Hall 20 n 23 Selection: Obtain data from various sources. Preprocessing: Cleanse data. Transformation: Convert to common format. Transform to new format. Data Mining: Obtain desired results. Interpretation/Evaluation: Present results to user in meaningful manner. © Prentice Hall 24 4 Data Mining Development Data Mining Metrics •Similarity Measures •Hierarchical Clustering •IR Systems •Imprecise Queries •Textual Data •Web Search Engines •Relational Data Model •SQL •Association Rule Algorithms •Data Warehousing •Scalability Techniques n n •Bayes Theorem •Regression Analysis •EM Algorithm •K-Means Clustering •Time Series Analysis •Algorithm Design Techniques •Algorithm Analysis •Data Structures n n Usefulness Return on Investment (ROI) Accuracy Space/Time •Neural Networks •Decision Tree Algorithms © Prentice Hall 25 © Prentice Hall Database Perspective on Data Mining (what is a good data mining tool?) n n n n 26 Social Issues n Privacy ? Scalability Real World Data Updates Ease of Use © Prentice Hall 27 © Prentice Hall 28 Announcements: n Next Lecture: – Database, Decision Support System & Warehousing n Reading assignments: – Chapter 2 © Prentice Hall 29 5