Download Department of Statistics STATS 784SC Statistical Data Mining Study

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Department of Statistics
STATS 784SC Statistical Data Mining
Study Guide 2015
Lecturer
Thomas Yee, Department of Statistics. Rm 221, Bldg 303 Ph. 923-8811, [email protected]
Aim of the Course
This course will look at some statistical theory and practical aspects of data mining. It will have a significant
coursework component, most of it being computer work. You will be exposed to at least one ‘large’ data
set, and there may be a little bit of programming. Students are assumed to have a good working knowledge
of R and a good grade in STATS 210—STATS 310 is preferable.
The aim of the course is to give students an understanding of a few common statistical techniques used for
data mining. By the end of the course students should be able to apply these methods confidently on a large
data set.
Topics
The chapters will probably be something like
1. What is data mining?
2. Handling large data sets in Unix and R†
3. Data visualization†
4. Decision trees
5. The classification problem
6. Neural networks (possibly)
7. Cluster analysis (possibly)
8. GLMs and GAMs (possibly)
There is a coursebook of notes given out in class in Week 1. Reading lists in each chapter are provided.
Students need a reasonable familiarity with R and computing in general.
Class Web Site
The class’s web site is http://www.stat.auckland.ac.nz/∼yee/784. This will be updated all
the time, and students are expected to look there every time they log on to the computer. The site contains
data sets, announcements, hints, office hours, etc.
Computer Work
We will use R, not because it’s the best for data mining, but because it’s elegant, free and allows the
statistical aspects of the various methods that will be taught to be easily seen.
For students who know SAS already, we would like to give you some practical work involving SAS Enterprise Miner if the UoA licence makes this feasible.
Assignments
There will be fortnightly assignments, plus a possible project (worth about 1 or 2 assignments). All students
must do their work independently. Any copying or cheating will result in negative marks for the entire
assignment, e.g., 80% will be −80%. You should ask us for help during office hours. We will not respond
to e-mail questions unless there is a problem/error on our part.
Terms Test
This will be a closed book test and about one hour’s duration, held during the middle of the course.
Examination
It will be a closed book exam, and 2 hours long.
Texts
There is no single suitable text, but some background reading include the following.
Bishop, C. M. (2006) Pattern Recognition and Machine Learning, New York, USA, Springer.
Hand, D., Mannila, H., Smyth, P. (2001) Principles of Data Mining. Cambridge, MA, USA,
MIT Press.
Hastie, T. J., Tibshirani, R. J. and Friedman, J. H. (2009) Elements of Statistical Learning:
Data Mining, Inference and Prediction, 2nd Ed. New York, USA, Springer-Verlag.
Huber, P. J. (2011) Data analysis: what can be learned from the past 50 years, Hoboken, NJ,
USA, Wiley.
Hurwitz, J. and Nugent, A. and Halper, F. and Kaufman, M. (2013) Big Data for Dummies,
Hoboken, NJ, USA, Wiley.
Izenman, A.. J. (2008) Modern Multivariate Statistical Techniques: Regression, Classification,
and Manifold Learning, New York, USA, Springer-Verlag.
James, G., Witten, D., Hastie, T. J., Tibshirani, R. J. (2013) An Introduction to Statistical
Learning with Applications in R, New York, USA, Springer.
Kuhn, M. and Johnson, K. (2013) Applied Predictive Modeling, New York, USA, Springer.
Larose, D. T. (2006) Data Mining Methods and Models, Hoboken, NJ, USA, Wiley-Interscience.
Tan, P.-N., Steinbach, M. and Kumar, V. (2006) Introduction to Data Mining, Boston, USA,
Pearson Addison Wesley.
Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S, 4th Ed. New York,
USA, Springer-Verlag.
You’ll need to make use of the R online help.
Special Learning Requirements
Students are asked to discuss privately any impairment related requirements face-to-face and/or in written
form with course convenor, lecturer or tutor.
Assessment
Coursework = 5 or 6 Assignments (20%) + Terms Test (20%)
Final mark = 60% final exam + 40% coursework.
It is very important that you attempt ALL assignments and sit the test. You must obtain 50% or more in the
final exam in order to pass the course.
2