Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
IONA School of Arts & Science Computer Science Department CS 761 Data Mining Fall 2012 Monday 6:30PM - 8:30PM Murphy Center 209 Office hours: Monday: 5:00-6:30pm Wednesday: 5:00-6:30pm By appointment Professor: Dr. Smiljana Petrovic Office: Murphy Center, 1st Floor, 113-I Phone: 914-633-2561 Email: [email protected] Course Description: This course will introduce popular data mining methods for extracting knowledge from data. It will cover the principles of data mining methods, but also provide to students hands-on experience in developing data mining solutions to scientific and business problems. Topics include: knowledge representation, data processing, machine learning and statistical methods (association mining, classification and prediction using Bayesian learning, decision trees, instance-based learning, support vector machines, neural networks, genetic algorithms, cluster analysis), evaluation of the performance and meta-learning algorithms. Ethical implications of data mining applications are considered. Applications are drawn from a variety of real life examples from different areas. Texts Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach, Vipin Kumar; Addison Wesley, 2006; ISBN13: 978-0321321367 Data Mining: Concepts and Techniques, Third Edition (Series in Data Management Systems); Jiawei Han, Micheline Kamber, Jian Pei; The Morgan Kaufmann Publishers, 2010; ISBN-13: 978-0123814791 References Data Mining: Practical Machine Learning Tools and Techniques, Third Edition, Ian H. Witten, Eibe Frank, Mark A. Hall; Morgan Kaufmann Publishers, 2005; ISBN-13: 978-0123748560 Software: Weka 3: Data Mining Software in Java, open source: http://www.cs.waikato.ac.nz/ml/weka/ Course Objectives: Course Objectives 1. Understand the various data mining techniques CS Program Goals and Outcomes SLO1: work independently to analyze the requirements of problems of appropriate complexity and then design and implement solutions. 2. Implement data mining algorithms in a high level programming language SLO1: work independently to analyze the requirements of problems of appropriate complexity and then design and implement solutions. 3. Use data mining techniques and tools to solve a real world problem SLO1: work independently to analyze the requirements of problems of appropriate complexity and then design and implement solutions. 4. Evaluate emerging data mining technologies Assessment Criteria/Tools: SLO2: communicate clearly and effectively on technical issues in both oral and written form. SLO2: communicate clearly and effectively on technical issues in both oral and written form. SLO4: adapt readily to new technologies and/or disciplines Assessment for this course is accomplished through direct and indirect measures. The direct measures are as follows. Assessment Tool Percentage of Final Grade Course Objectives Measured Assignments 20% 1, 2, 4 Project 30% 1, 3 Two Class Exam 20% (8% + 12%) 1 Final Exam 30% 1 Assignments: Weekly written assignments assess students’ understanding of data mining concepts covered in class. In the programming assignment, students will implement a data mining method of their choice. Student will also present new data mining trend or research area of their choice. Project: The project involves all steps of the data mining process: collecting and preparing data, preprocessing, mining, reporting and discussing results. Midterm Exam: Test is designed to access the student’s comprehension of lecture material, data mining concepts and applications. The purpose of the tests is, in addition to assessment, to provide feedback to both the instructor and the student. Final exam: The final exam is a cumulative assessment of the material for the entire course. Students are responsible for the entire material covered during the lectures and assigned as homework. The indirect measures consist of student and faculty evaluations done at the end of each semester. Students complete an online evaluation form developed by Prof. D’Alessio that asks them to evaluate the extent to which the course has met each of its objectives. Each faculty member submits a spreadsheet indicating the extent to which each of the assessment tools met the desired course objectives. The results of these assessments are analyzed to identify problems. The department discusses the problems and suggests solutions to address the problems. Course Outline: Week Lecture Topic 1 Introduction 9/10 What Is Data Mining? Motivating Challenges The Origins of Data Mining Data Mining Tasks 2 Probability and Statistics 9/17 Data Types of Data Sets and Attribute Values Basic Statistical Descriptions of Data Data Visualization Measuring Data Similarity Data Preprocessing Aggregation Sampling Dimensionality Reduction Feature Subset Selection Feature Creation 3 Exploring Data 9/24 Summary Statistics OLAP Multidimensional Data Analysis Visualization Motivations for Visualization References: Ch 1 Appendix C Ch 2 Han Ch 2 Ch 3 Assignments/Project General Concepts Techniques Visualizing Higher-Dimensional Data 4 10/1 Classification Decision Tree Induction How a Decision Tree Works How to Build a Decision Tree Methods for Expressing Attribute Test Conditions Measures for Selecting the Best Split Algorithm for Decision Tree Induction Model Overfitting Overfitting issue Estimation of Generalization Errors Handling Overfitting in Decision Tree Induction Ch 4 5 10/8 Class Exam 1 Ch 4 6 10/15 7 10/22 8 10/29 9 11/5 Evaluating the Performance of a Classifier Holdout Method Random Subsampling Cross-Validation Methods for Comparing Classifiers Estimating a Confidence Interval for Accuracy Comparing the Performance of Two Models Comparing the Performance of Two Classifiers Rule-Based Classifiers Nearest-Neighbor Classifiers Bayesian Classifiers Bayes Theorem Naıve Bayes Classifier Bayes Error Rate Bayesian Belief Networks Linear Algebra Artificial Neural Network (ANN) Perceptron Multilayer Artificial Neural Network Support Vector Machine (SVM) Maximum Margin Hyperplanes Linear SVM Nonlinear SVM Ensemble Methods Rationale for Ensemble Method Methods for Constructing an Ensemble Classifier Bagging and Boosting Empirical Comparison among Ensemble Methods Class Exam 2 Due Project Proposal Class Exam 1 Appendix C Ch 5 Appendix A Ch 5 Ch 5 Han Ch 11 Class Exam 2 10 11/12 11 11/19 12 11/26 13 12/3 14 12/10 15 12/17 Social Impacts of Data Mining Ch 6, 7 Association Analysis Frequent Itemset Generation The Apriori Principle Frequent Itemset Generation in the Apriori Algorithm Candidate Generation and Pruning Support Counting Computational Complexity Rule Generation Ch 8, 9 Due Programming Cluster Analysis Assignment Different Types of Clustering and Clusters K-means Algorithm K-means: Additional Issues K-means as an Optimization Problem Extended Cluster Analysis Han Ch 8, 9, Advanced topics in data mining 10, 11 Mining of stream data, time-series data, and sequence data Social network analysis and multirelational data mining Mining object, spatial, multimedia, text, and Web data Data mining for biological and biomedical data analysis and other scientific applications Due Final Project Report Project presentations Trends and Research Frontiers in Data Mining Mining Complex Types of Data Advanced Data Mining Applications Data Mining System Products and Research Prototypes Trends in Data Mining Han Ch 11 Due Essay and Presentation on a Topic About Trends or Research in Data Mining FINAL EXAM College Policy for all courses and students: (full explanations of policy may be found in the College Catalog) Plagiarism: Is the unauthorized use or close imitation of the language and thoughts of another author/person and the representation of them as one's own original work. Iona College policy stipulates that students may be failed for the assignment or course, with no option for resubmission or re-grading of said assignment. A second instance of plagiarism may result in dismissal from the College. Attendance: All students are required to attend all classes. Iona has an attendance policy for which all students are accountable. While class absence may be explained it is never excused. Failure to attend 20% or more of the total class meetings will result in a failure of the class for attendance (FA). The FA grade weighs as an F would in the final official transcript. Course and Teacher Evaluation(CTE): Iona College now uses an on-line CTE system. This system is administered by an outside company and all of the data is collected confidentially. No student name or information will be linked to any feedback received by the instructor. The information collected will be compiled in aggregate form by the agency and distributed back to the Iona administration and faculty, with select information made available to students who complete the CTE. Your feedback in this process is an essential part of improving our course offerings and instructional effectiveness. We want and value your point of view. (You will receive several emails at your Iona email account about how and when the CTE will be administered with instructions how to proceed.)