Download assignment #3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Exploratory factor analysis wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Multinomial logistic regression wikipedia , lookup

Transcript
ISC471/HCI571
Fall 2012
Assignment3
Classification
Due date: Sunday, November 11, 2012, midnight
The goal of this assignment is to practice classification methods and to apply them to
a dataset using SPSS data analysis tools.
Assignment preparation
This assignment will be using SPSS data analysis tool.
Heart disease datasets
The dataset studied is the heart disease dataset, proposed in several format: the heartc.arff (Cleveland data) and heart-h.arff (Hungarian data) from UCI repository (datasetsUCI.jar) for the nominal version, and the cleveland.arff dataset and hungarian.arff from
UCI numeric repository (datasets-numeric.jar). These datasets describe factors of heart
disease. They can be downloaded from:
http://www.cs.waikato.ac.nz/~ml/weka/index_datasets.html.
The data mining project goal is to better understand the risk factors for heart
disease, as represented in the 14th attribute: num (<50 means no disease, and values <501 to <50-4 represent increasing levels of heart disease).
The question on which this data analsyis study concentrates is whether it is
possible to predict heart disease from the other known data about a patient. The data
mining task of choice to answer this question will be classification/prediction, and
several different algorithms will be used to find which one provides the best predictive
power.
1. Data preparation- integration
We want to merge the two datasets (heart-c.arff and heart-h.arff) into one, in a step
called data integration. We also want to transform these files to text format.
a. Is there an entity identification or schema integration (difference in database
schema) problem in this dataset ? If yes, how to fix it?
b. Integrate the two datasets into one single dataset, called heart-ch.txt, which will
be used as a starting point for the next questions. Keep in this dataset only the
lines after “@data”. Append at the top of the file a line with the list of attribute
names separated by commas (each attribute is listed in a separate line preceded by
the word “@attribute”). The remainder of this file explain what each attribute
means, and you will need to get familiar with the meaning of each attribute in the
remainder of this study. How many instances do you have? How many attributes?
c. Import the file heart-ch.txt into SPSS. A wizard will guide you through reading
the file as a comma-delimited file.
One option to check is that variable names are included at the top of the file.
ISC471/HCI571
Fall 2012
Also, there is only one delimiter between the variables, which is the comma.
Save in SPSS format for future used: heart-ch.sav.
Paste below a screenshot of the SPSS screen showing the bottom of the file
imported.
2. Descriptive data summarization
Before preprocessing the data, an important step is to get acquainted with the data – also
called data understanding.
a. Select the Analyze tab and choose Descriptive statistics then Descriptives feature.
Study for example the age attribute. What is its mean? Its standard deviation? Its
min and max?
b. Provide the five-number summary of this attribute (use the Frequencies feature).
c. Specify which attributes are numeric, which are ordinal, and which are
categorical/nominal.
d. Create a histogram showing the number of patients in age categories with stacks
corresponding to heart diseses. For this, select Graphs menu, then Legacy
dialogs, Bar, Stacked. Accept options by default, then age as Category Axis, and
num as Define Stacks by. Explain what this graphic represents, and what each
color represents. What do you learn from this graphic ? Paste it here.
e. In the same manner, visualize the other attributes one by one always keeping num
as the stacking variable. Stop when you find one variable which shows a
correlation with heart disease. State which this variable is, and paste the histogram
here (note: there is not only one response, and you may try to be strategic about
the order in which you explore the vaiables).
f. Does any pair of different attributes seem correlated?
3. Classification in SPSS
The dataset heart-ch.txt is ready to be processed for classification. The target attribute is
the last one by default: num. Evaluation of a classifier generally uses a training set first
to train the model, then a test set on which predictions by the model are compared with
known classes.
a. Switch to the Analyze tab. This dataset is going to be analyzed with Binary
Logistic Regression. The options by default can be used, however it is
recommended to use 95 for CI for exp(B). By selecting num as the dependent
variable, and all the other variables as covariates, which result do you obtain
(Block 1 Classification Table)?
ISC471/HCI571
Fall 2012
b. By choosing a different set of variables, provide the results you get in the table
below .
Method
Logistic regression
Significance
R Square
Accuracy
c. Split the data into 75% for the training set and 25% for the test set, and compare
results between Logistic Regression and Nearest-neighbor (both from the Analyze
tab).
Method
Logistic regression
Nearest-neighbor
Accuracy
65.8%
100%
d. What do you conclude about the predictability of heart disease? Explain.
e. Which factors are most predictive of heart disease ?