Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ISC471/HCI571 Fall 2012 Assignment3 Classification Due date: Sunday, November 11, 2012, midnight The goal of this assignment is to practice classification methods and to apply them to a dataset using SPSS data analysis tools. Assignment preparation This assignment will be using SPSS data analysis tool. Heart disease datasets The dataset studied is the heart disease dataset, proposed in several format: the heartc.arff (Cleveland data) and heart-h.arff (Hungarian data) from UCI repository (datasetsUCI.jar) for the nominal version, and the cleveland.arff dataset and hungarian.arff from UCI numeric repository (datasets-numeric.jar). These datasets describe factors of heart disease. They can be downloaded from: http://www.cs.waikato.ac.nz/~ml/weka/index_datasets.html. The data mining project goal is to better understand the risk factors for heart disease, as represented in the 14th attribute: num (<50 means no disease, and values <501 to <50-4 represent increasing levels of heart disease). The question on which this data analsyis study concentrates is whether it is possible to predict heart disease from the other known data about a patient. The data mining task of choice to answer this question will be classification/prediction, and several different algorithms will be used to find which one provides the best predictive power. 1. Data preparation- integration We want to merge the two datasets (heart-c.arff and heart-h.arff) into one, in a step called data integration. We also want to transform these files to text format. a. Is there an entity identification or schema integration (difference in database schema) problem in this dataset ? If yes, how to fix it? b. Integrate the two datasets into one single dataset, called heart-ch.txt, which will be used as a starting point for the next questions. Keep in this dataset only the lines after “@data”. Append at the top of the file a line with the list of attribute names separated by commas (each attribute is listed in a separate line preceded by the word “@attribute”). The remainder of this file explain what each attribute means, and you will need to get familiar with the meaning of each attribute in the remainder of this study. How many instances do you have? How many attributes? c. Import the file heart-ch.txt into SPSS. A wizard will guide you through reading the file as a comma-delimited file. One option to check is that variable names are included at the top of the file. ISC471/HCI571 Fall 2012 Also, there is only one delimiter between the variables, which is the comma. Save in SPSS format for future used: heart-ch.sav. Paste below a screenshot of the SPSS screen showing the bottom of the file imported. 2. Descriptive data summarization Before preprocessing the data, an important step is to get acquainted with the data – also called data understanding. a. Select the Analyze tab and choose Descriptive statistics then Descriptives feature. Study for example the age attribute. What is its mean? Its standard deviation? Its min and max? b. Provide the five-number summary of this attribute (use the Frequencies feature). c. Specify which attributes are numeric, which are ordinal, and which are categorical/nominal. d. Create a histogram showing the number of patients in age categories with stacks corresponding to heart diseses. For this, select Graphs menu, then Legacy dialogs, Bar, Stacked. Accept options by default, then age as Category Axis, and num as Define Stacks by. Explain what this graphic represents, and what each color represents. What do you learn from this graphic ? Paste it here. e. In the same manner, visualize the other attributes one by one always keeping num as the stacking variable. Stop when you find one variable which shows a correlation with heart disease. State which this variable is, and paste the histogram here (note: there is not only one response, and you may try to be strategic about the order in which you explore the vaiables). f. Does any pair of different attributes seem correlated? 3. Classification in SPSS The dataset heart-ch.txt is ready to be processed for classification. The target attribute is the last one by default: num. Evaluation of a classifier generally uses a training set first to train the model, then a test set on which predictions by the model are compared with known classes. a. Switch to the Analyze tab. This dataset is going to be analyzed with Binary Logistic Regression. The options by default can be used, however it is recommended to use 95 for CI for exp(B). By selecting num as the dependent variable, and all the other variables as covariates, which result do you obtain (Block 1 Classification Table)? ISC471/HCI571 Fall 2012 b. By choosing a different set of variables, provide the results you get in the table below . Method Logistic regression Significance R Square Accuracy c. Split the data into 75% for the training set and 25% for the test set, and compare results between Logistic Regression and Nearest-neighbor (both from the Analyze tab). Method Logistic regression Nearest-neighbor Accuracy 65.8% 100% d. What do you conclude about the predictability of heart disease? Explain. e. Which factors are most predictive of heart disease ?