Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
COMP1942 Exploring and Visualizing Data Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse COMP1942 1 Course Details Instructor Dr. Raymond Wong TA Kai Ho CHAN Dandan LIN Junqiu WEI COMP1942 2 Course Details Webpage http://course.cse.ust.hk/comp1942/ COMP1942 3 Course Details Lecture Time: Monday (1:30pm - 2:50pm) and Friday (9:00am - 10:20am) Venue: G010 (CYT Building) Tutorial will be announced via email. Tutorial Time: Monday (12:30pm-1:20pm) Venue: Room 5583 (LT 29-30) (Academic Building) or CSE Lab 3 (Rm 4213 (Academic Building)) Time: Tuesday (12:30pm-1:20pm) Venue: Rm 2302 (LT 17-18) (Academic Building) or CSE Lab 3 (Rm 4213 (Academic Building)) COMP1942 4 Course Details Textbook Data Mining for Business Intelligence: Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. Galit Shmueli, Nitin R. Patel and Peter C. Bruce, Wiley 2010 (2nd edition) COMP1942 5 Course Details Reference books/materials: Data Mining: Concepts and Techniques. Jiawei Han, Micheline Kamber and Jian PEI. Morgan Kaufmann Publishers (3rd edition) Introduction to Data Mining. Pang-Ning Tan, Michael Steinbach, Vipin Kumar Boston : Pearson Addison Wesley (2006) COMP1942 6 Common Core Requirement My ability to use quantitative methods to define, analyze and solve problems in daily life has been enhanced. I am more able to process quantitative data and to use the data to reach a conclusion in a logical way. The course has aroused my interest in learning more about mathematical models or quantitative methods. COMP1942 7 Course Details Grading Scheme: Assignment 10% Project 20% In-class Participation 10% Mid-Term Exam 20% Final Exam 40% COMP1942 8 Assignment 2 assignments Assignment 1 Assignment 2 Content before the mid-term exam Content after the mid-term exam NOTE: No late submissions are allowed. Assignment 10% Project 20% In-class Participation 10% Mid-Term Exam 20% Final Exam 40% COMP1942 9 Assignment If the students can answer the selected questions in class correctly, for each correct answer, I will give him/her a coupon This coupon can be used to waive one question in an assignment which means that s/he can get full marks for this question without answering this question COMP1942 10 Assignment Guideline For each assignment, each student can waive at most one question only. s/he can waive any question s/he wants and obtain full marks for this question (no matter whether s/he answer this question or not) s/he may also answer this question. But, we will also mark it but will give full marks to this question. When the student submits the assignment, please staple the coupon to the submitted assignment please write down the question no. s/he wants to waive on the coupon COMP1942 11 Project Phase 1 (Excel file) Phase 2 (Design Report) Phase 3 (Final Report and Output files) Assignment 10% Project 20% In-class Participation 10% Mid-Term Exam 20% Final Exam 40% COMP1942 12 Project You are required to form a group. Each group contains 1 or 2 members. 3-member group is NOT allowed. Please fill in the following information of each member in the link https://goo.gl/forms/WguJdkHO8TkpOFYn1 student ID student name Email One group needs to submit the grouping information ONCE. The group forming deadline is 15 Feb (Wed) 1pm. COMP1942 13 Project Data Mining Tool: XLMiner (in MS Excel) Installed in CSE Lab 3 (Rm 4213) All non-CSE students and all non-CPEG students need to apply for the CSE account. You can see the details in our course webpage. COMP1942 14 Project In Phase 3 (the last phase), you are required to hand in some output files We will check the output files You can use at most one coupon to obtain full marks for all output files Each group can use at most one coupon Please staple your coupon with your final report. COMP1942 15 In-class Participation In each lesson, you are required to bring one of the following with you. your smart phone installed with iPRS (Internet enabled Personal Response System) or your PRS device COMP1942 Assignment 10% Project 20% In-class Participation 10% Mid-Term Exam 20% Final Exam 40% 16 In-class Participation If you have a smart phone (Android/iOS), please install an app called “HKUST iLearn” in your smart phone (Android/iOS). If you do not have a smart phone, you have to borrow your PRS device. please visit ITSC Service Desk at Rm 2021 (Lift 2) to borrow your PRS device. COMP1942 17 In-class Participation In each lesson, you may be asked about some multiple-choice questions (e.g., 1-3 questions) You have to use your iPRS to answer the questions COMP1942 18 In-class Participation You can obtain 1 unit for in-class participation when you answer a question in class with your iPRS (no matter whether you answer it correctly or not) Those questions may be in the midterm exam and the final exam. COMP1942 19 In-class Participation In some cases, Some students may be absent for some reasons in class The iPRS system could not record your answer E.g., your smart device and the iPRS system crash You are required to obtain 20 units in order to obtain the full score (10%) for the in-class participation We will give at least 25 questions in the course. COMP1942 20 Midterm and Final Exam You are allowed to bring a calculator with you. Please remember to prepare a calculator for the exam Assignment 10% Project 20% In-class Participation 10% Mid-Term Exam 20% Final Exam 40% COMP1942 21 Midterm Exam In-class Midterm Date: 17 March, 2017 (Fri) Time: 9:00-10:20 Venue: G010 (CYT Building) Rm 5619 (LT 31/32) (Academic Building) COMP1942 22 Major Topics In this course, you are expected to learn something related to “Exploring and Visualizing Data”. Not only this! In this course, you are expected to learn how to solve problems and how to analyze problems. This is very important to your future. COMP1942 23 Major Topics 1. 2. 3. 4. 5. 6. Association Clustering Classification Data Warehouse Dimension Reduction Web Databases COMP1942 24 1. Association Customer Apple Orange Raymond Apple Orange Ada Grace Orange Apple Orange … … … Items/Itemsets Frequency Apple 2 Orange 3 Milk 1 {Apple, Orange} 2 {Orange, Milk} COMP1942 1 Milk Milk We are interested in the items/itemsets with frequency >= 2 … Frequent Pattern (or Frequent Item) Frequent Pattern (or Frequent Item) Frequent Pattern (or Frequent Itemset) 25 1. Association Customer Apple Orange Raymond Apple Orange Ada Grace Orange Apple Orange … … … Items/Itemsets Frequency Apple 2 Orange 33 Milk 1 {Apple, Orange} 22 Milk Milk We are interested in the items/itemsets with frequency >= 2 Association Rule: …Apple Orange 1. ( 100% customers who buy apple will probably buy orange.) 2. Orange Apple ( 67% customer who buy orange will probably buy apple.) Problem: toMilk} find all frequent {Orange, 1 patterns and association rules COMP1942 26 Major Topics 1. 2. 3. 4. 5. 6. Association Clustering Classification Data Warehouse Dimension Reduction Web Databases COMP1942 27 2. Clustering Raymond Louis Wyman … Computer 100 History 40 90 20 45 95 … Cluster 2 (e.g. High Score in History and Low Score in Computer) History … Cluster 1 (e.g. High Score in Computer and Low Score in History) Computer Problem: to find all clusters COMP1942 28 Major Topics 1. 2. 3. 4. 5. 6. Association Clustering Classification Data Warehouse Dimension Reduction Web Databases COMP1942 29 3. Classification Suppose there is a person. Race Income Child Insurance white high no ? child=yes root child=no 100% Yes 0% No Income=high 100% Yes 0% No Income=low 0% Yes 100% No Decision tree COMP1942 30 Major Topics 1. 2. 3. 4. 5. 6. Association Clustering Classification Data Warehouse Dimension Reduction Web Databases COMP1942 31 4. Warehouse Query Users Databases Need to wait for a long time (e.g., 1 day to 1 week) Databases Data Warehouse Users Pre-computed results COMP1942 32 Major Topics 1. 2. 3. 4. 5. 6. Association Clustering Classification Data Warehouse Dimension Reduction Web Databases COMP1942 33 Suppose we have the following data set COMP1942 34 According to the data, we find the following vectors (marked in red) e1 COMP1942 e2 35 Consider that the data points are projected on e1 COMP1942 36 Suppose all data points are projected on vector e1 e1 This corresponds to the information loss This corresponds to another information loss COMP1942 e2 37 After all data points are projected on vector e1 e1 Thus, the total information loss is small. COMP1942 e2 38 We can use only 1 dimension to represent all data points (i.e., vector e1) COMP1942 39 Major Topics 1. 2. 3. 4. 5. 6. Association Clustering Classification Data Warehouse Dimension Reduction Web Databases COMP1942 40 6. Web Databases Raymond Wong COMP1942 41 How to rank the webpages? COMP1942 42