Download unit2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Forecasting wikipedia , lookup

Coefficient of determination wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
STAT 462/862 Unit 2:
Overview
Devon Lin
Department of Mathematics and Statistics
Queen’s University, Sept 23
1
Topics
• Types of data, terminology
• Statistical data analysis
• Various problems
Reference:
Sec 1, 2.1, 2.2 of Hastie, Tibshirani and Friedman
Auto Dataset
• http://www-bcf.usc.edu/~gareth/ISL/Auto.data
• data(Auto) in the R package ISLR
A data frame with 397 observations on the following 9 variables.
mpg: miles per gallon
cylinders: Number of cylinders between 4 and 8
displacement: Engine displacement (cu. inches)
horsepower: Engine horsepower
weight: Vehicle weight (lbs.)
acceleration: Time to accelerate from 0 to 60 mph (sec.)
year: Model year (modulo 100)
origin: Origin of car (1. American, 2. European, 3. Japanese)
name: Vehicle name
3
• What variables impact mpg and in what way? Treat mpg as the response
and all other variables except name as the predictors.
• predict whether a given car gets high or low gas mileage; Create a binary
variable, mpg01, that contains a 1 if mpg contains a value above its
median, and a 0 if mpg contains a value below its median.
4
Datasets
• http://www.statsci.org/datasets.html
• https://stat.ethz.ch/R-manual/Rdevel/library/datasets/html/00Index.html
• http://mldata.org/
5
What data looks like?
X1
X2
X3
Y
x1
x2
x3
x4
x5
• Data/points/instances/examples/samples/records: rows
• Input Variables/independent
variables/features/attributes/covariates/predictors/regressors/factors:
columns
• Output Variable/outcome/response/label/dependent variable/: special
column that is observed or to be predicted
Data types
• Continuous: quantitative, a number like weight or
length
• Discrete: qualitative, a symbol, like ‘cat’ or ‘dot’, ‘0’
or ‘1’, {0,1,2,3}, {small, median, large}
Output types
•
•
•
•
•
•
•
Independent identically distributed (iid)
Spatial
Times series
Spatial-temporal
Functional
Image (matrices)
Variable-size non-vector data (eg. graphs, texts)
Statistical data analysis
• Step 1: Define the problem and state the objective.
• Step 2: Collect data.
• Step 3: Analyze data: exploratory methods and
confirmatory methods (includes model fitting and
assessment of the model assumptions.)
• Step 4: Draw conclusions and make
recommendations.
Main goals of data analysis
•
Estimation: based on observations, or training
data (xi, yi), i=1,2….,n, through a process called
learning (or estimation).
•
Prediction: Use that model to predict something
about data you haven’t seen before, that comes
from the same distribution as the training data,
called test data.
Main learning tasks
1.
2.
3.
4.
5.
Regression: predict a continuous output variable
Classification: predict a discrete output variable
Density estimation: predict the distribution
Clustering: predict clusters
Dimensionality reduction: find a smaller set of new
variables each being a combination of the input variables
Supervised learning: predicting an output variable for which we
get to see examples. (regression, classification)
Unsupervised learning: predicting a target variable for which we
never get to see examples. (density estimation, clustering,
dimensionality reduction)
Regression
Output/Response
X1
X2
X3
Y
3
Training
Dataset
4.5
5.6
2.9
model
learn
7
X1
Test
Dataset
X2
X3
Y
X1
X2
X3
Y
?
5.1
?
3.9
?
?
?
apply
model
7.9
2.9
4
Classical example of a regression model: least square regression
An example of regression
• Data: measurements of the girth, height and volume
of timber in 31 felled black cherry trees. Note that
girth is the diameter of the tree (in inches) measured
at 4 ft 6 in above the ground.
> data(trees)
> attach(trees)
> plot(Volume ˜ Girth, data = trees, log = "xy")
Classification
Output/Target/Class
X1
X2
X3
Y
A
Training
Dataset
B
B
A
model
learn
B
X1
Test
Dataset
X2
X3
Y
X1
X2
X3
Y
?
B
?
A
?
?
?
apply
model
A
B
B
An example of classification
• Data: gives the measurements in centimeters of the
variables sepal length and width and petal length and
width, respectively, for 50 flowers from each of 3
species of iris. The species are Iris setosa, versicolor,
and virginica.
>install.packages(‘e1071’)
>library(‘e1071’)
>data(iris)
>attach(iris)
>## classification mode
># default with factor response:
>model <- svm(Species ˜ ., data = iris)
>print(model)
>summary(model)
An example of density estimation
• Data: Eruption time in minutes for the Old Faithful
geyser in Yellowstone, National Park, Wyoming, USA.
> data(faithful)
> attach(faithful)
> eruptions
> d = density(eruptions, bw = "sj")
> plot(d)
Clustering
• Given a set of data points, each having a set of
attributes, and a similarity measure among them,
find clusters such that
– Data points in one cluster are more similar to one another.
– Data points in separate clusters are less similar to one another.
• Similarity measures:
– Euclidean distance if variables are continuous.
– Other problem-specific measures: city-block distance, Mahalanobis
distance, etc.
20
Illustrating clustering
Euclidean Distance Based Clustering in 3-D space.
Within-cluster distances
are minimized
Between-cluster distances
are maximized
21
Illustrating document clustering
• Clustering points: 3204 Articles of Los Angeles Times.
• Similarity measure: How many words are common in these
documents (after some word filtering).
C a te g o ry
T o ta l
C o r r e c t ly
A r t ic le s
P la c e d
F in a n c ia l
555
364
F o r e ig n
341
260
N a t io n a l
273
36
M e tro
943
746
S p o rts
738
573
E n t e r t a in m e n t
354
278
22
Anomaly or Outlier Detection
• Detect significant deviations from normal behavior
• Applications:
– Credit Card Fraud Detection
– Online Fraud Detection
– Network Intrusion
Detection
Typical network traffic at University level may reach over 100 million connections per day
23
Data collection
• Observational study: observe and measure the variables
without changing its conditions
• Experimental design: manipulate the conditions of the
experiments and control the factors
• Survey sampling: the process of choosing samples from a
population
• Simulation: use a mathematical, computer model to
represent the physical process.
24
Data cleaning
• Raw data need to be accurately entered for successful
evaluation of information
• Check character variables have valid values
• Check numeric variables are within range
• Check for missing values
• Check for and eliminate duplicates
• Check for unique values (ID variables)
• Check for invalid dates or observations
• Combining multiple files
25
Some questions for data analysis
• What model class to fit the data?
• Which notion of error to use? (loss functions)
• How to make sure the error on future data is
minimized?(generalization)
• Which model to use? (model selection)
• The performance of the model when its assumptions
do not hold?(robustness)
Challenges of data analysis
• Scalability
– a method that works on a small data set may not work on a larger
one (is the algorithm efficient and practical for larger data sets)
•
•
•
•
•
Dimensionality
Complex and massive data
Data quality
Data ownership and distribution
Privacy preservation
27
Topics
• Regression and regularized method
• Classification
– Linear, quadratic, linear discriminant analysis
– Logistic regression
– Support vector machines
• Clustering
– K-means, hierarchical, self-organizing maps
• Singular Value Decomposition, Principal Component Analysis.
28