Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Decision Sciences Department COURSE NUMBER: DNSC 6279 COURSE TITLE: Data Mining COURSE DESCRIPTION: This course provides an in-depth exposure to various supervised and unsupervised data mining techniques that can be used both to discover relationships in large data sets and to build predictive models. Techniques covered include regression models, decision trees, neural networks, clustering, and association analysis. COURSE PRE-REQS: Stochastics for Analytics I, Statistics for Analytics, or equivalent (JUD/DAD), MSBA Program Candidacy or instructor approval. PROFESSORS: Mr. Patrick Hall Phone: 336-693-4481 E-mail: [email protected] Office Hours: Th: 5:30-7:00pm and by appointment RECOMMENDED TEXTBOOKS: Introduction to Data Mining, by Pang-Ning Tan, Michael Steinbach, and Vipin Kumar http://www-users.cs.umn.edu/~kumar/dmbook/index.php An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Fourth%20Printing.pdf Text Analytics Using SAS® Text Miner Course Notes SAS Institute (Available to class participants from instructor) COURSE OBJECTIVES: How can organizations make better use of the increasing amounts of data they are collecting? How can they convert data into information and processes that are useful for managerial decision making? We will attempt to answer these questions by examining several data mining and data analysis methods and tools for exploring and analyzing data sets. LEARNING OBJECTIVES: Develop a solid foundation in supervised and unsupervised learning techniques Learn how to build and assess different types of predictive models Learn how to manipulate data and build models using real data and standard data mining software tools READING ASSIGNMENTS: The student is responsible for studying and understanding all assigned materials. If reading generates questions that are not discussed in class, the student has the responsibility of addressing the instructor privately or raising the issue in a discussion section on Blackboard. Additional reading, including technical papers and on-line material, may be assigned during the course. SOFTWARE: The course will primarily involve using SAS Enterprise Miner and the SAS, Python, and R languages. Occasionally, we may use some other packages too (details will be made available on Blackboard.) TENTATIVE SCHEDULE: Weeks Date 1-3 Jan 14/21/28 3-4 Jan 28/Feb 4 5 Feb 11 6 Feb 18 7 Feb 25 Topics (Readings will be posted on Blackboard) - Intro to data mining, supervised and unsupervised learning - Intro to software and tools - Project team formation - Initial project proposal - Data pre-processing Software references: SAS EM Sample and Modify nodes R dplyr package Python Pandas library - Dimension Reduction - PCA - Feature Extraction Software References: SAS EM PCA and Variable Selection nodes Python scikit-learn library - Linear regression - Logistic regression Software references: SAS EM Regression and LARS nodes R core and stats, glmnet packages - Model selection - Penalized regression - Model diagnostics - Model assessment Software reference: SAS EM Regression and LARS nodes PROCs REG, LOGISTIC, GLMSLECT R core and stats, glmnet packages - Decision Trees - Decision Tree Ensembles: 8 Mar 3 9 10 Mar 10 Mar 17 Mar 24 11 Mar 31 12 Apr 7 13 Apr 14 14 15 Apr 21 Finals week Random Forest Gradient Boosting Software references: SAS EM Decision Tree and Ensemble Nodes R rpart, party, randomForest, gbm packages Python scikit-learn library H2O.ai library - Neural networks and deep learning Software References: SAS EM Neural node Python scikit-learn library H2O.ai library Project proposal presentations Spring break - Clustering Software references: SAS EM Clustering node SAS PROCs CLUSTER, FASTCLUS R cluster package Python scikit-learn library - Frequent Pattern Mining Software References: SAS EM Association and Link Analysis nodes R arules package - Text Mining Software References: SAS EM Text Miner nodes Python scikit-learn and nltk libraries Miscellaneous topics as needed, i.e. - Memory Based Reasoning (Neighbors) - Support vector machines - Naïve Bayes - Distributed computing - Data visualization Project Presentations Final Exam Students are expected to come to class: 1) having read and prepared to discuss the material for the current lecture; 2) having reviewed the material of the previous lectures. Participation in class discussions is expected from all students. GRADING: The course grade will be based on homework assignments, quizzes, a final exam, and a team project. Each grading component is described in detail below. Quizzes There will be several in-class quizzes, typically every week. They will be based on current and prior assigned readings and material covered in the class sessions. The lowest quiz will be dropped. Homework Assignments Homework assignments will typically require the use of software. A typical homework assignment will consist of a few problems with several parts and will be given one week before it is due. Solutions will be posted on the course web site. No late homework assignments will be accepted. In preparing the submissions, please follow these guidelines: Ensure any computer program solutions are commented and runnable in a standard Python, R, or SAS environment. Ensure any written solutions are typed or easily readable by anyone; Ensure a clear logical flow and mark your answers; Print/type your name(s) on the top right hand corner of every page or in a header of any computer program submitted. Final Exam The final exam is individual and will be scheduled during finals' week. No make-up final exam will be given. Project The project is designed to serve as an exercise in applying one or more of the data mining techniques covered in the course to analyze real life data sets. A primary objective is to understand the complexities that arise in mining massive, real life datasets that are often inconsistent, incomplete, and unclean. Students can use a variety of software tools to perform the analysis, including SAS Enterprise Miner and/or various Python and R packages. This is a semester long project, and students have the option to work in 2 or 3 person teams. The deliverables include a formal project proposal (due mid-semester), and a final report (due at the end of the semester at the time of your final project presentation - Session 14). Students may select a current Kaggle contest (https://www.kaggle.com/) or their MSBA practicum project as the project for this class. Grading Weights Quizzes: 25%. Homework assignments: 25%. Final exam: 25%. Project: 25% The final exam and all quizzes are individual. No make-up exam / quiz / homework assignment will be given. ACADEMIC INTEGRITY: Cheating and plagiarism will not be tolerated. Any case will automatically result in loss of all the points for the assignment, and may be a reason for a failing grade and/or grounds for dismissal. In case of a group assignment, all group members will receive a zero grade. Any suspected case of cheating or plagiarism or behavior in violation of the rules of this course will be reported to the Office of Academic Integrity. Students are expected to know and understand all college policies, especially the code of academic integrity available at: http://www.gwu.edu/~ntegrity/code.html DISABILITY SERVICES: Please contact the Disability Support Services office to establish eligibility and to coordinate reasonable accommodation. For additional information, refer to http://disabilitysupport.gwu.edu/ ATTENDANCE: The George Washington University Bulletin, Graduate Programs, 2009–2010: "Regular attendance is expected. Students may be dropped from any class for undue absence … Students are held responsible for all of the work of the courses in which they are registered, and all absences must be excused by the instructor before provision is made to make up the work missed." CHANGES: The instructors reserves the right to make revisions to any item on this syllabus, including, but not limited to any class policy, course outline and schedule, grading policy, tests, etc. Note that the requirements for deliverables may be clarified and expanded in class, via email, or on Blackboard. Students are expected to complete the deliverables incorporating such additions and to check email and Blackboard announcements frequently.