Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Department of Statistics STATS 784SC Statistical Data Mining Study Guide 2015 Lecturer Thomas Yee, Department of Statistics. Rm 221, Bldg 303 Ph. 923-8811, [email protected] Aim of the Course This course will look at some statistical theory and practical aspects of data mining. It will have a significant coursework component, most of it being computer work. You will be exposed to at least one ‘large’ data set, and there may be a little bit of programming. Students are assumed to have a good working knowledge of R and a good grade in STATS 210—STATS 310 is preferable. The aim of the course is to give students an understanding of a few common statistical techniques used for data mining. By the end of the course students should be able to apply these methods confidently on a large data set. Topics The chapters will probably be something like 1. What is data mining? 2. Handling large data sets in Unix and R† 3. Data visualization† 4. Decision trees 5. The classification problem 6. Neural networks (possibly) 7. Cluster analysis (possibly) 8. GLMs and GAMs (possibly) There is a coursebook of notes given out in class in Week 1. Reading lists in each chapter are provided. Students need a reasonable familiarity with R and computing in general. Class Web Site The class’s web site is http://www.stat.auckland.ac.nz/∼yee/784. This will be updated all the time, and students are expected to look there every time they log on to the computer. The site contains data sets, announcements, hints, office hours, etc. Computer Work We will use R, not because it’s the best for data mining, but because it’s elegant, free and allows the statistical aspects of the various methods that will be taught to be easily seen. For students who know SAS already, we would like to give you some practical work involving SAS Enterprise Miner if the UoA licence makes this feasible. Assignments There will be fortnightly assignments, plus a possible project (worth about 1 or 2 assignments). All students must do their work independently. Any copying or cheating will result in negative marks for the entire assignment, e.g., 80% will be −80%. You should ask us for help during office hours. We will not respond to e-mail questions unless there is a problem/error on our part. Terms Test This will be a closed book test and about one hour’s duration, held during the middle of the course. Examination It will be a closed book exam, and 2 hours long. Texts There is no single suitable text, but some background reading include the following. Bishop, C. M. (2006) Pattern Recognition and Machine Learning, New York, USA, Springer. Hand, D., Mannila, H., Smyth, P. (2001) Principles of Data Mining. Cambridge, MA, USA, MIT Press. Hastie, T. J., Tibshirani, R. J. and Friedman, J. H. (2009) Elements of Statistical Learning: Data Mining, Inference and Prediction, 2nd Ed. New York, USA, Springer-Verlag. Huber, P. J. (2011) Data analysis: what can be learned from the past 50 years, Hoboken, NJ, USA, Wiley. Hurwitz, J. and Nugent, A. and Halper, F. and Kaufman, M. (2013) Big Data for Dummies, Hoboken, NJ, USA, Wiley. Izenman, A.. J. (2008) Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning, New York, USA, Springer-Verlag. James, G., Witten, D., Hastie, T. J., Tibshirani, R. J. (2013) An Introduction to Statistical Learning with Applications in R, New York, USA, Springer. Kuhn, M. and Johnson, K. (2013) Applied Predictive Modeling, New York, USA, Springer. Larose, D. T. (2006) Data Mining Methods and Models, Hoboken, NJ, USA, Wiley-Interscience. Tan, P.-N., Steinbach, M. and Kumar, V. (2006) Introduction to Data Mining, Boston, USA, Pearson Addison Wesley. Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S, 4th Ed. New York, USA, Springer-Verlag. You’ll need to make use of the R online help. Special Learning Requirements Students are asked to discuss privately any impairment related requirements face-to-face and/or in written form with course convenor, lecturer or tutor. Assessment Coursework = 5 or 6 Assignments (20%) + Terms Test (20%) Final mark = 60% final exam + 40% coursework. It is very important that you attempt ALL assignments and sit the test. You must obtain 50% or more in the final exam in order to pass the course. 2