Download Exploratory Data Analysis using R for a Vehicle Silhouette Dataset

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Forecasting wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
Dr. Eick
COSC 4335 “Data Mining” Spring 2017
Assignment1: (Exploratory) Data Analysis
for a Vehicle Silhouette Dataset
Group Project (typically 2-3 students per group)
Due: Saturday, February 18, 11p (electronic Submission)
Last Updated: January 24, 2016, 11:11a
Learning Objectives:
1. Learn how to manage and preprocess datasets and how to compute basic statistics
and to create basic data visualizations (using R)
2. Learn how to interpret popular displays, such as histograms, scatter plots, box plots,
density plots,…
3. Get some practical experience in exploratory data analysis
4. Learn how to create background knowledge for a dataset
5. Learn to distinguish expected from unexpected results in data analysis and data
mining—in general, this task is quite challenging, as it requires background
knowledge with respect to the employed data mining technique, and also practical
experience.
Download Statlog (Vehicle Silhouettes) Data Set dataset from
http://archive.ics.uci.edu/ml/datasets/Statlog+(Vehicle+Silhouettes) limiting yourself to
analyzing to the following subset of the dataset; using all examples to create the subset
and not changing the order in the dataset:
i. Groups 1-5, analyze the COMPACTNESS
(average perim)**2/area),
ELONGATEDNESS
(area/(shrink width)**2), RADIUS RATIO (max.radmin.rad)/av.radius, SCALED VARIANCE
(2nd order moment about minor
axis)/area ALONG MAJOR AXIS attributes (1st , 4th, 8th , and 11th attribute) and the
class attribute.
ii. Groups 6 and higher analyze the COMPACTNESS (average perim)**2/area),
CIRCULARITY
(average radius)**2/area, SCALED VARIANCE
(2nd
order moment about minor axis)/area ALONG MAJOR AXIS attributes,
HOLLOWS RATIO (area of hollows)/(area of bounding polygon) (1st , 2nd, 11th , and
18th attribute), and the class attribute.
5 Examples in the raw Vehicle Silhouette Dataset:
96 55 103 201 65 9 204 32 23 166 227 624 246 74 6 2
89 36 51 109 52 6 118 57 17 129 137 206 125 80 2 14
99 41 77 197 69 6 177 36 21 139 202 485 151 72 4 10
104 54 100 186 61 10 216 31 24 173 225 686 220 74 5
101 56 100 215 69 10 208 32 24 169 227 651 223 74 6
186 194 opel
181 185 van
198 199 bus
11 185 195 saab
5 186 193 opel
1
Assignment1 Tasks:
Apply the following exploratory data analysis techniques using R to your dataset:
0. Compute the mean value and standard deviation of the 4 numerical attributes1. 1
point
1. Compute the covariance matrix for each pair of your 4 numerical attributes; next,
compute the correlations for each of the 6 pairs of the 4 attributes. Interpret the
statistical findings! 6 points
2. Create a scatter plot for COMPACTNESS and SCALED VARIANCE of your
dataset. Interpret the scatter plot! 3 points
3. Create histograms for each of the 4 attributes. Then create the same histograms for
the 4 attributes for the instances of each of the 4 classes; interpret the obtained 20
histograms. 10 points
4. Create box plots for the first and last numerical attribute of your dataset for the
instances of the 4 classes and the whole dataset. Interpret and compare the obtained 5
boxplots for each of the two attributes! 8 points
5. Create supervised scatter plots/supervised density plots for all pairs of your numerical
attributes. Next create a 3D-scatterplot using the first 3 numerical attributes and the
last 3 numerical attributes of your dataset—that is two 3D-scatterplots have to be
created. Interpret the obtained plots; in particular address what can be said about the
difficulty in predicting the correct class of the vehicle silhouette. Assess the
usefulness of the 3D scatterplot compared to the 2D plots! 10 points
6. Create a Star plot for the first 10 instances of class OPEL and the first 10 instances of
VAN (based on the order in the file); interpret the 20 stat plots—star plots should be
constructed for the 4 continuous attributes! 3 points
7. Create a new dataset ZVS from your original dataset by transforming the 4
continuous attributes into z-scores; next convert the class attribute as follows:
OPEL1, SAAB2, VAN3, BUS4. Next, fit a linear model that predicts the
modified class attribute using the four z-scored, continuous attributes as independent
variables. Report the R2 of the linear model and the coefficients of each attribute in
the obtained regression function. Do the coefficients tell you anything about the
importance of the attribute in predicting four classes of vehicle silhouettes? 8 points
8. Create 3 decision tree models with 20 or less nodes for your dataset (total number of
nodes should be less than 21 do not submit models with more than 20 nodes!);
Explain how the 3 decision tree models were obtained. Report the training accuracy
and the testing accuracy of this decision tree; interpret the learnt decision tree.) What
does it tell you about the importance of the 4 continuous attributes for the
classification problem? 6 points
9. Write a conclusion (at most 13 sentences!) summarizing the most important findings
of the assignment; in particular, address the findings obtained related to predicting the
class attribute. 4 points (and up to 4 extra points)
Remark: About 25-33% of the Assignment1 points will be allocated to interpreting
statistical findings and visualizations
1
This is more a verification of that you have the correct dataset!
2