Download Kolker-Week2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Matthew Kolker
CSIS 5420
Week 2 Assignment
1. (Computational Question #2, page 102) Answer the following:
a. Write the production rules for the decision tree shown in Figure 3.4 (page 74).
If Age> 43, Then Life Insurance Promotion = No
If Age<=43 & Sex=Female, Then Life Insurance Premium = Yes
If Age<=43 & Sex=Male & Credit Card Insurance=No, then Life Insurance Premium =
No
If Age<=43 & Sex=Male & Credit Card Insurance=Yes, then Life Insurance Premium =
Yes
b. Can you simplify any of your rules without compromising rule accuracy?
Yes. Can drop Age<=43 for the next to last rule: If Sex=Male & CCI=No, then LIP =
No
2. (Computational Question #4, page 102) Using the attribute age, draw an expanded
version of the decision tree in Figure 3.4 (page 74) that correctly classifies all training
data instances.
Age
=29
Yes
>29, <=43
>43, <29
No
Sex
Male
CCI
Female
Yes
Yes
No
3. (Computational Question #8, page 103) Use the information in Table 3.5 (page 82) to
list three two-item set rules. Use the data in Table 3.3 (page 80) to compute confidence
and support values for each of your rules.
Set1:
If Magazine Promotion = Yes
Then Watch Promotion = No (4/7)
If Watch Promotion = No
Then Magazine Promotion = Yes (4/6)
Set2:
If Magazine Promotion = Yes
Then Credit Card Insurance = No (5/7)
If Credit Card Insurance = No
Then Magazine Promotion = Yes (5/8)
Set3:
If Credit Card Insurance = No
Then Sex = Female (4/8)
If Sex = Female
Then Credit Card Insurance = No (4/4)
4. (Computational Question #10, page 103) Perform the third iteration of the K-Means
algorithm for the example given in the section titled An Example Using K-Means (page
84). What are the new cluster centers?
Computations for the third iteration are:
Distance (C1 – 1) = 1.05
Distance (C2 – 1) = 3.42
Distance (C1 – 2) = 2.03
Distance (C2 – 2) = 2.38
Distance (C1 – 3) = 1.20
Distance (C2 – 3) = 2.83
Distance (C1 – 4) = 1.20
Distance (C2 – 4) = 1.42
Distance (C1 – 5) = 1.67
Distance (C2 – 5) = 1.54
Distance (C1 – 6) = 5.07
Distance (C2 – 6) = 2.61
The third iteration results in a modified clustering:
C1 contains instances 1, 2, 3, and 4
C2 contains instances 5 and 6
Next, we compute the new centers for each cluster.
For cluster C1:
X=(1+1+2+2)/4 = 1.5
Y=(1.5+4.5+1.5+3.5)/4 = 2.75
For cluster C2:
X=(3+5)/2 = 4
Y=(2.5+6)/2 = 4.25
Center of C1 is (1.5, 2.75)
Center of C2 is (4.0, 4.25)
5. (Data Mining Question #1, page 141) This exercise demonstrates how erroneous data
is displayed in a spreadsheet file.
a. Copy the CreditCardPromotion.xls dataset into a new spreadsheet.
Done
b. Modify the instance on line 17 to contain one or more illegal characters for age.
Done
c. Add one or more blank lines to the spreadsheet.
Done
d. Remove values from one or more spreadsheet cells.
Done
e. Using life insurance promotion as the output attribute, initiate a data mining session
with ESX. When asked if you wish to continue mining the good instances, answer No.
Open the Word document located in the upper-left corner of your spreadsheet to examine
the error messages.
8
~Line is blank
18
~Bad numerical data for attribute Age
6. (Data Mining Question #2, page 141) Suppose you suspect marked differences in
promotional purchasing trends between female and male Acme credit card customers.
You wish to confirm or refute your suspicion. Perform a supervised data mining session
using the CreditCardPromotion.xls dataset with sex as the output attribute. Designate all
other attributes as input attributes, and use all 15 instances for training. Because there is
no test data, the RES TST and RES MTX sheets will not be created. Write a summary
confirming or refuting your hypothesis. Base your analysis on:
a. Class resemblance scores.
b. Class predictability and predictiveness scores.
c. Rules created for each class. You may wish to use the rerule feature.
The hypothesis is that there are different trends amongst men and women. The class
resemblance scores indicate that this may be true. It is interesting to note that women
have a greater resemblance then the population as a whole. The class predictability and
predictiveness scores show that men and women are fairly equal except when it comes to
the life insurance promotion in which they are flipped. This indicates that the preferences
are fairly common except on this promotion. It should be researched to see if there is
some other factor for this difference. It is interesting to note that the rule for males
essentially states that not taking the life insurance promotion is a key requisite whereas
with females, it is entirely age.
7. (Data Mining Question #4, page 141) For this exercise you will use ESX to perform a
data mining session with the cardiology patient data described in Chapter 2, page 37,
files: CardiologyNumericals.xls and CardiologyCategorical.xls). Load the
CardiologyCategorical.xls file into a new MS Excel spreadsheet. This is the mixed form
of the dataset containing both categorical and numeric data. Recall that the data contains
303 instances representing patients who have a heart condition (sick) as well as those
who do not.
Save the spreadsheet to a new file and perform a supervised learning mining session
using class as the output attribute. Use 1.0 for the real-tolerance setting and select 203
instances as training data. The final 100 instances represent the test dataset. Generate
rules using the default settings for the rule generator. Answer the following based on your
results:
a. Provide the domain resemblance score as well as the resemblance score for each class
(sick and healthy).
Class Sick: .553
Class Healthy: .581
Domain: .52
b. What percent of the training data is female?
31%
c. What is the most commonly occurring domain value for the attribute slope?
Flat
d. What is the average age of those individuals in the healthy class?
51.95
e. What is the most common healthy class value for the attribute thal?
Normal
f. Specify blood pressure values for the two most typical sick class instances.
125 and 130
g. What percent of the sick class is female?
17%
h. What is the predictiveness score for the sick class attribute value angina = true? In your
own words, explain what this value means.
.75 or 75%
This means that if the value of angina is true, there is a 75% chance that the person is in
the sick class.
i. List one sick class attribute value that is high sufficient for class membership.
Thal = rev
j. What percent of the test set instances were correctly classified?
82%
k. Give the 95% confidence limits for the test set. State what the confidence limit values
mean.
25.7%
l. How many test set instances were classified as being sick when in reality they were
from the healthy class?
5
m. List a rule with multiple conditions for the sick class that covers at least 50% on the
instances and is accurate in at least 85 % of all cases.
If thal = rev and chest pain type = asymptomatic, then sick = true
8. (Data Mining Question #6, page 142) In this exercise you will use instance typicality
to determine class prototypes. You will then employ the prototype instances to build and
test a supervised learner model.
a. Open the CardiologyCategorical.xls spreadsheet file.
b. Save the file to a new spreadsheet.
c. Perform a supervised mining session on the class attribute. Use all instances for
training.
d. When learning is complete, open the RES TYP sheet and copy the most typical sick
class instance to a new spreadsheet. Save the spreadsheet and return to the RES TYP
sheet.
e. Now, copy the most typical healthy class instance the the spreadsheet created in the
previous step and currently holding the single most typical sick class instance. Save the
updated spreadsheet file.
f. Delete columns A, B, and Q of the new spreadsheet file that now contains the most
typical healthy class instance and the most typical sick class instance. Copy the two
instances contained in the spreadsheet.
g. Return to the original sheet 1 data sheet and insert two blank rows after row three.
paste the two instances copied in step f into sheet 1.
h. Your sheet 1 spreadsheet now contains 305 instances. The first instance in sheet 1 (row
4) is the most typical sick class instance. The second instance is the most typical healthy
class instance.
i. Perform a data mining session using the first two instances as training data. The
remaining 303 instances will make up the test set.
j. Analyze the test set results by examining the confusion matrix. What can you
conclude?
Using the most typical cases to generate the rules provides an 81% reliability in the
generated rules. This is only 1% less then using a larger data set. This means that the
most typical cases are very representative of the data.
k. Repeat the above steps but this time extract the least typical instance from each class.
How do your results compare with those of the first experiment?
This results in a 34% reliability. This would be bad because it would have a lot of misdiagnosis. For the most part, everything is almost flipped from using the most typical
cases.
9. (Data Mining Question #8, page 143) Perform a supervised data mining session with
the Titanic.xls dataset. Use 1500 instances for training and the remaining instances as test
data.
a. Are any of the input attributes highly predictive of class membership for either class?
Females had a score of .81 for surviving. This seems plausible due to the “women and
children” first motto. However, children only had a score of .63. Third class, crew, and
males all had a predicative score of > .75 for not surviving.
b. What is the test set accuracy of the model?
77%
c. Use the confidence limits to state the 95% test set accuracy range.
The error rate is between 19.8% and 26.2%.
d. Why is the classification accuracy so much lower for this experiment than for the quick
mine experiment given in Section 4.8 (pages 135 - 139)?
The test set was randomly generated.