Download Educating the Total Force - Naval Postgraduate School

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Multinomial logistic regression wikipedia , lookup

Transcript
Data Analytics
CMIS Short Course part II
Day 1 Part 1: Introduction
Sam Buttrey
December 2015
Who Am I?
• A.B., Princeton, Statistics; M.A., Ph.D.,
U. California-Berkeley, Statistics
• Naval Postgraduate School, Department of
Operations Research, 1996-Present
• Data Analysis, Data Mining, Big Data
Analytics, Classification, Modeling and
Applications…
• Married, one child…
Tentative Schedule
• Today: Trees and Ensembles
–
–
–
–
9:00 – 10:00: Recap part I (Whitaker); Overview
10:00 – 2:00: Regression & Classification trees
2:00 – 3:00: Ensemble models
3:00 – 4:00: *Evaluating classifiers: ROC and F1
• Tomorrow: Unsupervised Models
– 9:00 – 11:00: Principal components
– 11:00 – 2:00: Clustering
– 2:00 – 4:00? Association Rules
• If Time Remains
– Simple forecasting models
3
The Big Picture
• These courses are intended to help you
visualize, predict, classify and find patterns
in large data sets
• Often these are constructed by combining
different data sets from different sources
– Inconsistencies, redundancies, noise
• Data sets in the course are small enough
to be used quickly, but we want automatic
techniques that scale up to huge-ish data
Data
• Will normally appear as a rectangular
array: rows are observations, columns
are measurements (variables): n  p
– Data that is not already rectangular will be
wrestled into this form! – columns of pixels,
counts of terms in documents, etc.
• Each column has the same sort of
measurement: numeric (incl. date/time),
categorical, logical (True/False), text
• Data might be missing
Types of Models (i)
• Often one response (target) variable is the
primary measure of interest (“Y ”)
• We want to predict the value of Y in new
data where predictors (X ’s) are known
• When Y is numeric, this is regression
– E.g. size of error in TACNAV data
• When Y is categorical, this is classification
– E.g. digit recognition (0, 1, …, 9)
• These models are called “supervised”
because the true Y ’s are known
Recap of Part I
• Whitaker, September 2015
1. R and RStudio
2. Linear regression
– Comparison to Nearest-Neighbor Methods
3. Logistic Regression
4. Controlling Complexity
– Training set/test set; cross-validation
– Lasso and other regularization These all involve
5. Intro. to Neural Networks
linear combinations
of predictor
variables!
Constant Concerns
• Modern models like trees are very flexible
and therefore prone to over-fitting
• Control complexity by:
1. Using separate training and validation sets,
the latter to compare models
• Then evaluate prediction error with test set
2. Cross-validation across, say, 10 folds
3. Regularization (shrinkage of coefficients) via
ridge or lasso
• A constant theme in big data
8
Use the model built with the training data to
predict a new set of data
Over-Fitting
Low Bias – High Variance
-2
-2
0
0
2
2
y
y
4
4
6
6
Under-Fitting
High Bias – Low Variance
2
4
6
8
10
2
4
x
8
10
An Independent Test Set
PredictionError
e.g.
 y
6
x
 yˆ i  / n
2
i
i
Training Set
Less Complex
More Complex
Measure of Complexity
Types of Models (ii)
• Unsupervised models have no particular
response variable
• Goals are to find groups (clustering,
source separation), or anomalies, or
relationships (association rules), or reduce
dimensionality for visualization
• Generally more difficult and less satisfying
than supervised models
– Hard to evaluate or compare quality
10
R and RStudio Recap
• R is a very popular free open-source
statistical environment
• RStudio is a free front-end that makes
managing scripts and graphics easier
• Our variables come in vectors; a
rectangular set of vectors makes up a
data.frame
• Example – beer (35 x 11)
R Basics Restated
• R is case-sensitive (but Windows isn’t)
• help (thing) or ?thing for help
• a <- b assigns value of b to a
– Subsequent changes to b don’t affect a
• Recall earlier commands with up-arrow
– history(100) shows last 100
• Use forward slash for file names
• Special characters: \n, \t, \\, \"
– Single or double-quotes okay; # for comment
• == for “is equal”, != for “not equal”
12
Materials
• The disc has Slides, R Scripts, Data sets
and Libraries (plus a few random things)
– library (<name>) looks in default places;
library (<name>, lib.loc=<place>)
• Cntl-Enter executes a line from a script,
but…
• Lots of commands are already given to
you – for best results make sure you
understand them
13
R Refresher
• Get beer data into R from Excel
• Data frames and variable types
• Simple exploration
– Plot of Calories vs. Alcohol
• Simple linear regression model of Calories
vs. Alcohol
• Drawing the response “surface”
– To be used to compare the linear model with
the tree model
• Let’s do this!
14