Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
University of Sydney Discipline of Business Analytics Fall 2013 QBUS 6810: Statistical Learning and Data Mining We 6-9 pm Room H04-159 (Seminar Room 6, Ground floor, Merewether Building) Description It is common for businesses to have access to very rich information data sets, often generated automatically as a by-product of the main institutional activity of a firm or business unit. Data mining, or statistical learning, deals with inferring and validating patterns, structures and relationships in data, as a tool to support decisions in the business environment. This post graduate course in statistical learning offers a survey of main statistical methodologies for visualization and analysis of business and market data. It provides the tools necessary to extract information required for specific tasks such as credit scoring, prediction and classification, market segmentation and product positioning. Emphasis will be given to business applications of data mining using modern software tools. The goals are that students (1) know which business analytic tool is most relevant for what type of business problem, (2) know advantages and limitations of each method, (3) can extract information from large volumes of data readily available from the business environment, (4) can obtain and interpret a meaningful analytical result using a software package such as STATA®, Gauss, Matlab or SAS, (5) can present and write about finding effectively Lecturer Office Hours Artem Prokhorov, PhD Tel.: 02 9351 6584 Office: Merewether-499 Email: [email protected] Web: http://alcor.concordia.ca/~aprokhor We 10-12 or by appointment. Emailing me your questions is often the fastest way to get an answer. Also, I am in the office most of the time and can usually talk to students without an appointment. Book The Elements of Statistical Learning: Data Mining, Inference and Prediction, by T.Hastie, R.Tibshirani and J.Friedman (2002), Springer Series in Statistics, 2009. Freely available at: http://www-stat.stanford.edu/~tibs/ElemStatLearn/. Grading Exam Presentations Group Project = = = 50% 25% 25% I reserve the right to adjust the final grade distribution as I see appropriate. Exam I will announce the time and place and provide a practice exam at the end of the semester Presentations Presentations are in-class lectures given by a student to other students. They will cover chapters of the Book that I will assign and can include any other material you find or I provide you with. They will be followed by unmarked clicker-based multiple choice quizzes composed by the presenter. Group Project Your own statistical analysis of your own problem using your own business or economic or financial data, carried out in groups of 2-3 people. Involves: 1. finding a topic of interest to you (it can come from the applied portion of your previous course paper, from a recent paper you saw, from the examples we cover in class, or from me). 1 2. finding data for it (e.g., section Databases at alcor.concordia.ca/~aprokhor/links.html); those who will not have a topic (and data) by the time of Easter break will be assigned one. 3. choosing an appropriate method (choose from those we study and talk to me). 4. estimating the model in the software of your choice. Do this by May 1 if you want my feedback. 5. presenting the results in class in the last couple of weeks of the course. 6. incorporating any feedback you receive and writing up results in a 10-15 page summary (background, method, findings, interpretation and limitations). Deadline for email submission of the summary is June 1. General Info There are no make-ups for presentations, group projects or the exam. Not showing up for a presentation or the exam, not turning in the project automatically gets you a zero for the relevant part of the grade unless there is a well documented medical excuse, in which case the weight of the missing part is spread over the remaining parts. Students must notify instructor about religious observances at the beginning of the semester so that they can be accommodated. The following outline is tentative. We may add topics from the Book. Tentative Outline I. Introduction and linear algebra for data analysis: Introduction to statistical learning. Vector spaces, inner product, matrices, matrix inverse. Covariance matrix. II. Data visualization and introduction to supervised learning: The spectral and the singular value decomposition of a matrix: the biplot. Optimal linear prediction. Loss functions. III. Linear regression model: Representation, inference (estimation and testing) IV. Variable selection and shrinkage methods: Stepwise selection, rigde regression, lasso, lars, principal components regression. V. Linear methods for classification: Linear probability model and logistic regression VI. Linear methods for classification: Canonical variates and discriminant analysis VII. Semiparametric regression: Regression splines and smoothing splines VIII. Kernel smoothing methods Kernel smoothing. Local polynomial regression. IX. Model assessment and selection; model inference and averaging. X. Classification trees: Regression and classification trees, boosting. XI. Neural networks: Neural networks, training. XII. Unsupervised learning: Association analysis. Market basket analysis, distance and similarity, multidimensional scaling XIII. Cluster analysis: K-means, Gaussian mixtures, hierarchical methods 2