Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Analytics CMIS Short Course part II Day 1 Part 1: Introduction Sam Buttrey December 2015 Who Am I? • A.B., Princeton, Statistics; M.A., Ph.D., U. California-Berkeley, Statistics • Naval Postgraduate School, Department of Operations Research, 1996-Present • Data Analysis, Data Mining, Big Data Analytics, Classification, Modeling and Applications… • Married, one child… Tentative Schedule • Today: Trees and Ensembles – – – – 9:00 – 10:00: Recap part I (Whitaker); Overview 10:00 – 2:00: Regression & Classification trees 2:00 – 3:00: Ensemble models 3:00 – 4:00: *Evaluating classifiers: ROC and F1 • Tomorrow: Unsupervised Models – 9:00 – 11:00: Principal components – 11:00 – 2:00: Clustering – 2:00 – 4:00? Association Rules • If Time Remains – Simple forecasting models 3 The Big Picture • These courses are intended to help you visualize, predict, classify and find patterns in large data sets • Often these are constructed by combining different data sets from different sources – Inconsistencies, redundancies, noise • Data sets in the course are small enough to be used quickly, but we want automatic techniques that scale up to huge-ish data Data • Will normally appear as a rectangular array: rows are observations, columns are measurements (variables): n p – Data that is not already rectangular will be wrestled into this form! – columns of pixels, counts of terms in documents, etc. • Each column has the same sort of measurement: numeric (incl. date/time), categorical, logical (True/False), text • Data might be missing Types of Models (i) • Often one response (target) variable is the primary measure of interest (“Y ”) • We want to predict the value of Y in new data where predictors (X ’s) are known • When Y is numeric, this is regression – E.g. size of error in TACNAV data • When Y is categorical, this is classification – E.g. digit recognition (0, 1, …, 9) • These models are called “supervised” because the true Y ’s are known Recap of Part I • Whitaker, September 2015 1. R and RStudio 2. Linear regression – Comparison to Nearest-Neighbor Methods 3. Logistic Regression 4. Controlling Complexity – Training set/test set; cross-validation – Lasso and other regularization These all involve 5. Intro. to Neural Networks linear combinations of predictor variables! Constant Concerns • Modern models like trees are very flexible and therefore prone to over-fitting • Control complexity by: 1. Using separate training and validation sets, the latter to compare models • Then evaluate prediction error with test set 2. Cross-validation across, say, 10 folds 3. Regularization (shrinkage of coefficients) via ridge or lasso • A constant theme in big data 8 Use the model built with the training data to predict a new set of data Over-Fitting Low Bias – High Variance -2 -2 0 0 2 2 y y 4 4 6 6 Under-Fitting High Bias – Low Variance 2 4 6 8 10 2 4 x 8 10 An Independent Test Set PredictionError e.g. y 6 x yˆ i / n 2 i i Training Set Less Complex More Complex Measure of Complexity Types of Models (ii) • Unsupervised models have no particular response variable • Goals are to find groups (clustering, source separation), or anomalies, or relationships (association rules), or reduce dimensionality for visualization • Generally more difficult and less satisfying than supervised models – Hard to evaluate or compare quality 10 R and RStudio Recap • R is a very popular free open-source statistical environment • RStudio is a free front-end that makes managing scripts and graphics easier • Our variables come in vectors; a rectangular set of vectors makes up a data.frame • Example – beer (35 x 11) R Basics Restated • R is case-sensitive (but Windows isn’t) • help (thing) or ?thing for help • a <- b assigns value of b to a – Subsequent changes to b don’t affect a • Recall earlier commands with up-arrow – history(100) shows last 100 • Use forward slash for file names • Special characters: \n, \t, \\, \" – Single or double-quotes okay; # for comment • == for “is equal”, != for “not equal” 12 Materials • The disc has Slides, R Scripts, Data sets and Libraries (plus a few random things) – library (<name>) looks in default places; library (<name>, lib.loc=<place>) • Cntl-Enter executes a line from a script, but… • Lots of commands are already given to you – for best results make sure you understand them 13 R Refresher • Get beer data into R from Excel • Data frames and variable types • Simple exploration – Plot of Calories vs. Alcohol • Simple linear regression model of Calories vs. Alcohol • Drawing the response “surface” – To be used to compare the linear model with the tree model • Let’s do this! 14