Download CS 761 Data Mining Fall 2012

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
IONA
School of Arts & Science
Computer Science Department
CS 761 Data Mining
Fall 2012
Monday 6:30PM - 8:30PM
Murphy Center 209
Office hours: Monday: 5:00-6:30pm
Wednesday: 5:00-6:30pm
By appointment
Professor: Dr. Smiljana Petrovic
Office: Murphy Center, 1st Floor, 113-I
Phone: 914-633-2561
Email: [email protected]
Course Description:
This course will introduce popular data mining methods for extracting knowledge from data. It will cover the
principles of data mining methods, but also provide to students hands-on experience in developing data mining
solutions to scientific and business problems. Topics include: knowledge representation, data processing,
machine learning and statistical methods (association mining, classification and prediction using Bayesian
learning, decision trees, instance-based learning, support vector machines, neural networks, genetic algorithms,
cluster analysis), evaluation of the performance and meta-learning algorithms. Ethical implications of data mining
applications are considered. Applications are drawn from a variety of real life examples from different areas.
Texts
 Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach, Vipin Kumar; Addison Wesley, 2006; ISBN13: 978-0321321367
 Data Mining: Concepts and Techniques, Third Edition (Series in Data Management Systems); Jiawei Han,
Micheline Kamber, Jian Pei; The Morgan Kaufmann Publishers, 2010; ISBN-13: 978-0123814791
References
 Data Mining: Practical Machine Learning Tools and Techniques, Third Edition, Ian H. Witten, Eibe Frank,
Mark A. Hall; Morgan Kaufmann Publishers, 2005; ISBN-13: 978-0123748560
Software:
 Weka 3: Data Mining Software in Java, open source: http://www.cs.waikato.ac.nz/ml/weka/
Course Objectives:
Course Objectives
1. Understand the various data mining
techniques
CS Program Goals and Outcomes
SLO1: work independently to analyze the requirements of problems
of appropriate complexity and then design and implement solutions.
2. Implement data mining algorithms in a
high level programming language
SLO1: work independently to analyze the requirements of problems
of appropriate complexity and then design and implement solutions.
3. Use data mining techniques and tools to
solve a real world problem
SLO1: work independently to analyze the requirements of problems
of appropriate complexity and then design and implement solutions.
4. Evaluate emerging data mining
technologies
Assessment Criteria/Tools:
SLO2: communicate clearly and effectively on technical issues
in both oral and written form.
SLO2: communicate clearly and effectively on technical issues
in both oral and written form.
SLO4: adapt readily to new technologies and/or disciplines
Assessment for this course is accomplished through direct and indirect measures.
The direct measures are as follows.
Assessment Tool
Percentage of Final Grade
Course Objectives Measured
Assignments
20%
1, 2, 4
Project
30%
1, 3
Two Class Exam
20% (8% + 12%)
1
Final Exam
30%
1
Assignments: Weekly written assignments assess students’ understanding of data mining concepts covered in
class. In the programming assignment, students will implement a data mining method of their choice. Student will
also present new data mining trend or research area of their choice.
Project: The project involves all steps of the data mining process: collecting and preparing data, preprocessing,
mining, reporting and discussing results.
Midterm Exam: Test is designed to access the student’s comprehension of lecture material, data mining concepts
and applications. The purpose of the tests is, in addition to assessment, to provide feedback to both the instructor
and the student.
Final exam: The final exam is a cumulative assessment of the material for the entire course. Students are
responsible for the entire material covered during the lectures and assigned as homework.
The indirect measures consist of student and faculty evaluations done at the end of each semester. Students
complete an online evaluation form developed by Prof. D’Alessio that asks them to evaluate the extent to which
the course has met each of its objectives. Each faculty member submits a spreadsheet indicating the extent to
which each of the assessment tools met the desired course objectives. The results of these assessments are
analyzed to identify problems. The department discusses the problems and suggests solutions to address the
problems.
Course Outline:
Week
Lecture Topic
1
Introduction
9/10
 What Is Data Mining?
 Motivating Challenges
 The Origins of Data Mining
 Data Mining Tasks
2
Probability and Statistics
9/17 Data
 Types of Data Sets and Attribute Values
 Basic Statistical Descriptions of Data
 Data Visualization
 Measuring Data Similarity
Data Preprocessing
 Aggregation
 Sampling
 Dimensionality Reduction
 Feature Subset Selection
 Feature Creation
3
Exploring Data
9/24
 Summary Statistics
 OLAP
 Multidimensional Data Analysis
Visualization
 Motivations for Visualization
References:
Ch 1
Appendix C
Ch 2
Han Ch 2
Ch 3
Assignments/Project



General Concepts
Techniques
Visualizing Higher-Dimensional Data
4
10/1
Classification
Decision Tree Induction
 How a Decision Tree Works
 How to Build a Decision Tree
 Methods for Expressing Attribute Test
Conditions
 Measures for Selecting the Best Split
 Algorithm for Decision Tree Induction
Model Overfitting
 Overfitting issue
 Estimation of Generalization Errors
 Handling Overfitting in Decision Tree
Induction
Ch 4
5
10/8
Class Exam 1
Ch 4
6
10/15
7
10/22
8
10/29
9
11/5
Evaluating the Performance of a Classifier
 Holdout Method
 Random Subsampling
 Cross-Validation
Methods for Comparing Classifiers
 Estimating a Confidence Interval for Accuracy
 Comparing the Performance of Two Models
 Comparing the Performance of Two Classifiers
Rule-Based Classifiers
Nearest-Neighbor Classifiers
Bayesian Classifiers
 Bayes Theorem
 Naıve Bayes Classifier
 Bayes Error Rate
 Bayesian Belief Networks
Linear Algebra
Artificial Neural Network (ANN)
 Perceptron
 Multilayer Artificial Neural Network
Support Vector Machine (SVM)
 Maximum Margin Hyperplanes
 Linear SVM
 Nonlinear SVM
Ensemble Methods
 Rationale for Ensemble Method
 Methods for Constructing an Ensemble
Classifier
 Bagging and Boosting
 Empirical Comparison among Ensemble
Methods
Class Exam 2
Due Project Proposal
Class Exam 1
Appendix C
Ch 5
Appendix A
Ch 5
Ch 5
Han Ch 11
Class Exam 2
10
11/12
11
11/19
12
11/26
13
12/3
14
12/10
15
12/17
Social Impacts of Data Mining
Ch 6, 7
Association Analysis
 Frequent Itemset Generation
 The Apriori Principle
 Frequent Itemset Generation in the Apriori
Algorithm
 Candidate Generation and Pruning
 Support Counting
 Computational Complexity
 Rule Generation
Ch 8, 9
Due Programming
Cluster Analysis
Assignment
 Different Types of Clustering and Clusters
 K-means Algorithm
 K-means: Additional Issues
 K-means as an Optimization Problem
 Extended Cluster Analysis
Han Ch 8, 9,
Advanced topics in data mining
10, 11
 Mining of stream data, time-series data, and
sequence data
 Social network analysis and multirelational data
mining
 Mining object, spatial, multimedia, text, and
Web data
 Data mining for biological and biomedical data
analysis and other scientific applications
Due Final Project Report
Project presentations
Trends and Research Frontiers in Data Mining
 Mining Complex Types of Data
 Advanced Data Mining Applications
 Data Mining System Products and Research
Prototypes
 Trends in Data Mining
Han Ch 11
Due Essay and
Presentation on a Topic
About Trends or Research
in Data Mining
FINAL EXAM
College Policy for all courses and students: (full explanations of policy may be found in the College Catalog)
Plagiarism: Is the unauthorized use or close imitation of the language and thoughts of another author/person and the
representation of them as one's own original work. Iona College policy stipulates that students may be failed for the
assignment or course, with no option for resubmission or re-grading of said assignment. A second instance of plagiarism
may result in dismissal from the College.
Attendance: All students are required to attend all classes. Iona has an attendance policy for which all students are
accountable. While class absence may be explained it is never excused. Failure to attend 20% or more of the total class
meetings will result in a failure of the class for attendance (FA). The FA grade weighs as an F would in the final official
transcript.
Course and Teacher Evaluation(CTE): Iona College now uses an on-line CTE system. This system is administered by an
outside company and all of the data is collected confidentially. No student name or information will be linked to any
feedback received by the instructor. The information collected will be compiled in aggregate form by the agency and
distributed back to the Iona administration and faculty, with select information made available to students who complete the
CTE. Your feedback in this process is an essential part of improving our course offerings and instructional effectiveness.
We want and value your point of view. (You will receive several emails at your Iona email account about how and when the
CTE will be administered with instructions how to proceed.)