Download Decision Sciences Department COURSE NUMBER: DNSC 6279

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Transcript
Decision Sciences Department
COURSE NUMBER: DNSC 6279
COURSE TITLE: Data Mining
COURSE DESCRIPTION: This course provides an in-depth exposure to various supervised and
unsupervised data mining techniques that can be used both to discover relationships in large data
sets and to build predictive models. Techniques covered include regression models, decision trees,
neural networks, clustering, and association analysis.
COURSE PRE-REQS: Stochastics for Analytics I, Statistics for Analytics, or equivalent (JUD/DAD),
MSBA Program Candidacy or instructor approval.
PROFESSORS:
Mr. Patrick Hall
Phone: 336-693-4481
E-mail: [email protected]
Office Hours: Th: 5:30-7:00pm and by appointment
RECOMMENDED TEXTBOOKS:
Introduction to Data Mining, by Pang-Ning Tan, Michael Steinbach, and Vipin Kumar
http://www-users.cs.umn.edu/~kumar/dmbook/index.php
An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor
Hastie and Robert Tibshirani
http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Fourth%20Printing.pdf
Text Analytics Using SAS® Text Miner Course Notes
SAS Institute
(Available to class participants from instructor)
COURSE OBJECTIVES: How can organizations make better use of the increasing amounts of data
they are collecting? How can they convert data into information and processes that are useful for
managerial decision making? We will attempt to answer these questions by examining several data
mining and data analysis methods and tools for exploring and analyzing data sets.
LEARNING OBJECTIVES:
 Develop a solid foundation in supervised and unsupervised learning techniques
 Learn how to build and assess different types of predictive models
 Learn how to manipulate data and build models using real data and standard data mining
software tools
READING ASSIGNMENTS: The student is responsible for studying and understanding all assigned
materials. If reading generates questions that are not discussed in class, the student has the
responsibility of addressing the instructor privately or raising the issue in a discussion section on
Blackboard. Additional reading, including technical papers and on-line material, may be assigned
during the course.
SOFTWARE: The course will primarily involve using SAS Enterprise Miner and the SAS, Python, and R
languages. Occasionally, we may use some other packages too (details will be made available on
Blackboard.)
TENTATIVE SCHEDULE:
Weeks
Date
1-3
Jan 14/21/28
3-4
Jan 28/Feb 4
5
Feb 11
6
Feb 18
7
Feb 25
Topics
(Readings will be posted on
Blackboard)
- Intro to data mining,
supervised and unsupervised learning
- Intro to software and tools
- Project team formation
- Initial project proposal
- Data pre-processing
Software references:
SAS EM Sample and Modify nodes
R dplyr package
Python Pandas library
- Dimension Reduction
- PCA
- Feature Extraction
Software References:
SAS EM PCA and Variable Selection
nodes
Python scikit-learn library
- Linear regression
- Logistic regression
Software references:
SAS EM Regression and LARS nodes
R core and stats, glmnet packages
- Model selection
- Penalized regression
- Model diagnostics
- Model assessment
Software reference:
SAS EM Regression and LARS nodes
PROCs REG, LOGISTIC, GLMSLECT
R core and stats, glmnet packages
- Decision Trees
- Decision Tree Ensembles:
8
Mar 3
9
10
Mar 10
Mar 17
Mar 24
11
Mar 31
12
Apr 7
13
Apr 14
14
15
Apr 21
Finals week
Random Forest
Gradient Boosting
Software references:
SAS EM Decision Tree and Ensemble
Nodes
R rpart, party, randomForest, gbm
packages
Python scikit-learn library
H2O.ai library
- Neural networks and deep learning
Software References:
SAS EM Neural node
Python scikit-learn library
H2O.ai library
Project proposal presentations
Spring break
- Clustering
Software references:
SAS EM Clustering node
SAS PROCs CLUSTER, FASTCLUS
R cluster package
Python scikit-learn library
- Frequent Pattern Mining
Software References:
SAS EM Association and Link Analysis
nodes
R arules package
- Text Mining
Software References:
SAS EM Text Miner nodes
Python scikit-learn and nltk libraries
Miscellaneous topics as needed, i.e.
- Memory Based Reasoning
(Neighbors)
- Support vector machines
- Naïve Bayes
- Distributed computing
- Data visualization
Project Presentations
Final Exam
Students are expected to come to class: 1) having read and prepared to discuss the material for the
current lecture; 2) having reviewed the material of the previous lectures. Participation in class
discussions is expected from all students.
GRADING: The course grade will be based on homework assignments, quizzes, a final exam, and a
team project. Each grading component is described in detail below.
Quizzes
There will be several in-class quizzes, typically every week. They will be based on current and prior
assigned readings and material covered in the class sessions. The lowest quiz will be dropped.
Homework Assignments
Homework assignments will typically require the use of software.
A typical homework assignment will consist of a few problems with several parts and will be given
one week before it is due. Solutions will be posted on the course web site. No late homework
assignments will be accepted.
In preparing the submissions, please follow these guidelines:
 Ensure any computer program solutions are commented and runnable in a standard Python,
R, or SAS environment.
 Ensure any written solutions are typed or easily readable by anyone;
 Ensure a clear logical flow and mark your answers;
 Print/type your name(s) on the top right hand corner of every page or in a header of any
computer program submitted.
Final Exam
The final exam is individual and will be scheduled during finals' week. No make-up final exam will
be given.
Project
The project is designed to serve as an exercise in applying one or more of the data mining
techniques covered in the course to analyze real life data sets. A primary objective is to understand
the complexities that arise in mining massive, real life datasets that are often inconsistent,
incomplete, and unclean. Students can use a variety of software tools to perform the analysis,
including SAS Enterprise Miner and/or various Python and R packages.
This is a semester long project, and students have the option to work in 2 or 3 person teams. The
deliverables include a formal project proposal (due mid-semester), and a final report (due at the
end of the semester at the time of your final project presentation - Session 14). Students may select
a current Kaggle contest (https://www.kaggle.com/) or their MSBA practicum project as the project
for this class.
Grading Weights
 Quizzes: 25%.
 Homework assignments: 25%.
 Final exam: 25%.
 Project: 25%
The final exam and all quizzes are individual. No make-up exam / quiz / homework assignment will
be given.
ACADEMIC INTEGRITY: Cheating and plagiarism will not be tolerated. Any case will automatically
result in loss of all the points for the assignment, and may be a reason for a failing grade and/or
grounds for dismissal. In case of a group assignment, all group members will receive a zero grade.
Any suspected case of cheating or plagiarism or behavior in violation of the rules of this course will
be reported to the Office of Academic Integrity. Students are expected to know and understand all
college policies, especially the code of academic integrity available at:
http://www.gwu.edu/~ntegrity/code.html
DISABILITY SERVICES: Please contact the Disability Support Services office to establish eligibility and
to coordinate reasonable accommodation. For additional information, refer to
http://disabilitysupport.gwu.edu/
ATTENDANCE: The George Washington University Bulletin, Graduate Programs, 2009–2010:
"Regular attendance is expected. Students may be dropped from any class for undue absence …
Students are held responsible for all of the work of the courses in which they are registered, and all
absences must be excused by the instructor before provision is made to make up the work missed."
CHANGES: The instructors reserves the right to make revisions to any item on this syllabus,
including, but not limited to any class policy, course outline and schedule, grading policy, tests, etc.
Note that the requirements for deliverables may be clarified and expanded in class, via email, or
on Blackboard. Students are expected to complete the deliverables incorporating such additions
and to check email and Blackboard announcements frequently.