Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
COMP5331 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse COMP5331 1 Course Details Reference books/materials: Papers Data Mining: Concepts and Techniques. Jiawei Han and Micheline Kamber. Morgan Kaufmann Publishers (3rd edition) Introduction to Data Mining. Pang-Ning Tan, Michael Steinbach, Vipin Kumar Boston : Pearson Addison Wesley (2006) COMP5331 2 Area DB or AI This course can count towards one of the areas ONLY and cannot be double counted towards the required credits COMP5331 3 Course Details Grading Scheme: Assignment 30% Project 30% Final Exam 40% COMP5331 4 Assignment If the students can answer the selected questions in class correctly, for each corrected answer, I will give him/her a coupon This coupon can be used to waive one question in an assignment which means that s/he can get full marks for this question without answering this question COMP5331 5 Assignment Guideline For each assignment, each student can waive at most one question only. s/he can waive any question he wants and obtain full marks for this question (no matter whether s/he answer this question or not) s/he may also answer this question. But, we will also mark it but will give full marks to this question. When the student submits the assignment, please staple the coupon to the submitted assignment please write down the question no. s/he wants to waive on the coupon COMP5331 6 Project Each project is completed by a group. The number of students in a group depends on the class size. The duration of each presentation depends on the class size. It will be announced soon. COMP5331 7 Project Project Type (One of the following) Survey Implementation-oriented Project Your group only needs to read about 2~5 papers Your group only needs to read about 1~2 papers Research-oriented Project You can read some papers and conduct research COMP5331 8 Project Project Type (One of the following) 1. 2. 3. Proposal Presentation Final report Survey Implementation-oriented Project Research-oriented 1. 2. 3. COMP5331 4. 1. 2. Project 3. 4. Full Score = 80% Full Score = 90% Proposal Presentation Final report Coding Proposal Full Score = 100% Presentation Final report (containing your proposed methodology) 9 Coding (if any) Project Project Topic Some pre-selected topics/papers Your own choice For fairness, please do not choose the topic which is closely related to your own research COMP5331 10 Exam You are allowed to bring a calculator with you. Please remember to prepare a calculator for the exam COMP5331 11 Major Topics 1. 2. 3. 4. 5. 6. 7. Association Clustering Classification Data Warehouse Data Mining over Data Streams Web Databases Multi-criteria Decision Making COMP5331 12 1. Association Customer Apple Orange Raymond Apple Orange Ada Grace Orange Apple Orange … … … Items/Itemsets Frequency Apple 2 Orange 3 Milk 1 {Apple, Orange} 2 {Orange, Milk} COMP5331 1 Milk Milk We are interested in the items/itemsets with frequency >= 2 … Frequent Pattern (or Frequent Item) Frequent Pattern (or Frequent Item) Frequent Pattern (or Frequent Itemset) 13 1. Association Customer Apple Orange Raymond Apple Orange Ada Grace Orange Apple Orange … … … Items/Itemsets Frequency Apple 2 Orange 33 Milk 1 {Apple, Orange} 22 Milk Milk We are interested in the items/itemsets with frequency >= 2 Association Rule: …Apple Orange 1. ( 100% customers who buy apple will probably buy orange.) 2. Orange Apple ( 67% customer who buy orange will probably buy apple.) Problem: toMilk} find all frequent {Orange, 1 patterns and association rules COMP5331 14 Major Topics 1. 2. 3. 4. 5. 6. 7. Association Clustering Classification Data Warehouse Data Mining over Data Streams Web Databases Multi-criteria Decision Making COMP5331 15 2. Clustering Raymond Louis Wyman … Computer 100 History 40 90 20 45 95 … Cluster 2 (e.g. High Score in History and Low Score in Computer) History … Cluster 1 (e.g. High Score in Computer and Low Score in History) Computer Problem: to find all clusters COMP5331 16 Major Topics 1. 2. 3. 4. 5. 6. 7. Association Clustering Classification Data Warehouse Data Mining over Data Streams Web Databases Multi-criteria Decision Making COMP5331 17 3. Classification Suppose there is a person. Race Income Child Insurance white high no ? child=yes root child=no 100% Yes 0% No Income=high 100% Yes 0% No Income=low 0% Yes 100% No Decision tree COMP5331 18 Major Topics 1. 2. 3. 4. 5. 6. 7. Association Clustering Classification Data Warehouse Data Mining over Data Streams Web Databases Multi-criteria Decision Making COMP5331 19 4. Warehouse Query Users Databases Need to wait for a long time (e.g., 1 day to 1 week) Databases Data Warehouse Users Pre-computed results COMP5331 20 Major Topics 1. 2. 3. 4. 5. 6. 7. Association Clustering Classification Data Warehouse Data Mining over Data Streams Web Databases Multi-criteria Decision Making COMP5331 21 5. Data Mining over Static Data 1. Association 2. Clustering 3. Classification Static Data COMP5331 Output (Data Mining Results) 22 5. Data Mining over Data Streams 1. Association 2. Clustering 3. Classification … Unbounded Data COMP5331 Output (Data Mining Results) Real-time Processing 23 Major Topics 1. 2. 3. 4. 5. 6. 7. Association Clustering Classification Data Warehouse Data Mining over Data Streams Web Databases Multi-criteria Decision Making COMP5331 24 6. Web Databases Raymond Wong COMP5331 25 How to rank the webpages? COMP5331 26 Major Topics 1. 2. 3. 4. 5. 6. 7. Association Clustering Classification Data Warehouse Data Mining over Data Streams Web Databases Multi-criteria Decision Making COMP5331 27 7. Multi-criteria Decision Making Suppose we want to look for a hotel which is close to a beach. 3 hotels Hotel Price Distance to beach (km) a 1000 4 b 2400 5 c 3000 1 COMP5331 Suppose we compare hotel a and hotel b We know that hotel a is “better” than hotel b because 1. Price of hotel a is smaller 2. Distance of hotel a is smaller We have two attributes. Which hotel should we select? 28 7. Multi-criteria Decision Making Suppose we want to look for a hotel which is close to a beach. 3 hotels Hotel Price Distance to beach (km) a 1000 4 b 2400 5 c 3000 COMP5331 Suppose we compare hotel a and hotel c We cannot determine hotel a is “better” 1 than hotel c (wrt two attributes). We cannot determine hotel c is “better” than hotel a (wrt two attributes).. This is because 1. Price of hotel a is smaller 2. Distance of hotel c is smaller We have two attributes. Which hotel should we select? 29