Download Monica Nusskern Week 2 Assignment

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Monica Nusskern
5420 – Data Mining
Week 2 Assignment
June 5, 2005
1
Total Score 85 out of 90
Score 10 out of 10
1. (Computational Question #2, page 102) Answer the following:
a. Write the production rules for the decision tree shown in Figure 3.4 (page
74).
If Age > 43
Then Life Insurance Promotion = No Good
If Age <= 43 and Sex = Male and Credit Card Insurance = No
Then Life Insurance Promotion = No Good
If Age <= 43 and Sex = Male and Credit Card Insurance = Yes
Then Life Insurance Promotion = Yes Good
If Age <= 43 and Sex = Female and Credit Card Insurance = Yes
Then Life Insurance Promotion = Yes Good
b. Can you simplify any of your rules without compromising rule accuracy?
If Age > 43
Then Life Insurance Promotion = No
If Age <= 43 and Sex = Male and Credit Card Insurance = No
Then Life Insurance Promotion = No
If Age <= 43 and Sex = Female or Male and Credit Card Insurance = Yes
Then Life Insurance Promotion = Yes OK
If the individual is MALE then we can ignore the attribute AGE.
Score 9 out of 10
2. (Computational Question #4, page 102) Using the attribute age, draw ana
expanded version of the decision tree in Figure 3.4 (page 74) the correctly
classifies all training data instances.
Age
<= 43
> 43
Sex
Monica Nusskern
5420 – Data Mining
Week 2 Assignment
June 5, 2005
2
No (2/0)
Male
Female
Yes (6/0)
CC
Insurance
No (3/1)
Yes (2/0)
Yes
No
One possibility is to split the Credit Card Insurance No branch on age >29 and age<=29.
The two instances following age >29 will have life insurance promotion = no. The two
instances following age <=29 once again split on attribute age. This time, the split is age
<=27 and age > 27.
Score 10 out of 10
3. (Computational Question #8, page 103) Use the information in Table 3.5
(page 82) to list three two-item set rules. Use the data in Table 3.3 (page 80)
to compute confidence and support values for each of your rules.
If Magazine Promotion = Yes
Then Life Insurance Promotion = Yes (5/7)
If Watch Promotion = Yes
Then Magazine Promotion = Yes (4/7)
If Life Insurance Promotion = Yes
Then Magazine Promotion = Yes (5/5)
Very good
Score 9 out of 10
4. (Computational Question #10, page 103) Perform the third iteration of the
K-Means algorithm for the example given in the section titled An Example
Using K-Means (page 84). What are the new cluster centers?
Cluster Centers:
Monica Nusskern
5420 – Data Mining
Week 2 Assignment
June 5, 2005
3
C1 = x = 1.8
y = 2.7
C2 = x = 5
y=6
The new cluster center for cluster 1 is (1.5, 5). The new cluster center for cluster 2 is (4.0,
4.25).
Score 10 out of 10
Good
5. (Data Mining Question #1, page 141) This exercise demonstrates how
erroneous data is displayed in a spreadsheet file.
a. Copy the CreditCardPromotion.xls dataset into a new spreadsheet.
b. Modify the instance on line 17 to contain one or more illegal characters
for age.
c. Add one or more blank lines to the spreadsheet.
d. Remove values from one or more spreadsheet cells.
e. Using life insurance promotion as the output attribute, initiate a data
mining session with ESX. When asked if you wish to continue mining the
good instances, answer No. Open the Word document located in the upperleft corner of your spreadsheet to examine the error messages.
9
~ Line is blank
14
~ Line is blank
19
~ Bad numerical data for attribute Age
Monica Nusskern
5420 – Data Mining
Week 2 Assignment
June 5, 2005
4
Score 10 out of 10
Very good
6. (Data Mining Question #2, page 141) Suppose you suspect marked
differences in promotional purchasing trends between female and male
Acme credit card customers. You wish to confirm or refute your suspicion.
Perform a supervised data mining session using the
CreditCardPromotion.xls dataset with sex as the output attribute.
Designate all other attributes as input attributes, and use all 15 instances
for training. Because there is no test data, the RES TST and RES MTX
sheets will not be created. Write a summary confirming or refuting your
hypothesis. Base your analysis on:
a. Class resemblance scores.
CLASS RESEMBLANCE STATISTICS
Class Male
0.429
8
(0.07)
Res. Score:
No. of Inst.
Class Significance:
Class Female
0.484
7
0.05
Domain
0.46
15
b. Class predictability and predictiveness scores.
Female:
Categorical Attribute
Summary:
Name
Value
Frequency Predictability Predictiveness
Income Range: "30-40,000"
2
0.29
0.40
"50-60,000"
2
0.29
1.00
"20-30,000"
2
0.29
0.50
"40-50,000"
1
0.14
0.25
Male:
Categorical Attribute
Summary:
Name
Value
Frequency Predictability Predictiveness
Income Range: "40-50,000"
3
0.38
0.75
"30-40,000"
3
0.38
0.60
"20-30,000"
2
0.25
0.50
c. Rules created for each class. You may wish to use the rerule feature.
*******************************
Rules for Class Male
8 instances
*******************************
Monica Nusskern
5420 – Data Mining
Week 2 Assignment
June 5, 2005
Score 9 out of 10
5
Good
7. (Data Mining Question #4, page 141) For this exercise you will use ESX to
perform a data mining session with the cardiology patient data described in
Chapter 2, page 37, files: CardiologyNumericals.xls and
CardiologyCategorical.xls). Load the CardiologyCategorical.xls file into a
new MS Excel spreadsheet. This is the mixed form of the dataset
containing both categorical and numeric data. Recall that the data contains
303 instances representing patients who have a heart condition (sick) as
well as those who do not.
Save the spreadsheet to a new file and perform a supervised learning
mining session using class as the output attribute. Use 1.0 for the realtolerance setting and select 203 instances as training data. the final 100
instances represent the test dataset. Generate rules using the default
settings for the rule generator. Answer the following based on your results:
a. Provide the domain resemblance score as well as the resemblance score
for each class (sick and healthy).
CLASS RESEMBLANCE STATISTICS
Class Sick
Class Healthy
Domain
Monica Nusskern
5420 – Data Mining
Week 2 Assignment
June 5, 2005
Res. Score:
No. of Inst.
Class Significance:
6
0.553
93
0.07
0.581
110
0.13
0.52
203
b. What percent of the training data is female?
33% 31% of the training data is female.
c. What is the most commonly occurring domain value for the attribute
slope?
The most commonly occurring domain value for the attribute slope is flat.
d. What is the average age of those individuals in the healthy class?
The average age of those individuals in the healthy class is
51.95.
e. What is the most common healthy class value for the attribute thal?
The most common healthy class value for the attribute thal is normal.
f. Specify blood pressure values for the two most typical sick class
instances.
The blood pressure values for the most typical sick class instances are
125 and 130.
g. What percent of the sick class is female?
31% of the sick class is female. 17%
h. What is the predictiveness score for the sick class attribute value angina
= true? In your own words, explain what this value means.
About 67% of patients will be sick when angina is equal to true. 75%
i. List one sick class attribute value that is high sufficient for class
membership.
Cholesterol as a sick class attribute value that is high sufficient for class
membership.
j. What percent of the test set instances were correctly classified?
Monica Nusskern
5420 – Data Mining
Week 2 Assignment
June 5, 2005
7
82% of the test set instances was classified correctly.
k. Give the 95% confidence limits for the test set. State what the confidence
limit values mean.
Confidence limit values are described as an interval estimate for a mean
of a data set. Instead of using a single mean estimate, a confidence interval
creates upper and lower limits of the mean.
l. How many test set instances were classified as being sick when in reality
they were from the healthy class?
There are 13 5 test set instances classified as being sick when in reality
they were actually from a healthy class.
m. List a rule with multiple conditions for the sick class that covers at least
50% on the instances and is accurate in at least 85 % of all cases.
angina = TRUE
and chest pain type = Asymptomatic
:rule accuracy 87.50%
:rule coverage 52.69%
Score 10 out of 10
Very good
8. (Data Mining Question #6, page 142) In this exercise you will use
instance typicality to determine class prototypes. You will then employ the
prototype instances to build and test a supervised learner model.
a. Open the CardiologyCategorical.xls spreadsheet file.
b. Save the file to a new spreadsheet.
c. Perform a supervised mining session on the class attribute. Use all
instances for training.
Monica Nusskern
5420 – Data Mining
Week 2 Assignment
June 5, 2005
8
d. When learning is complete, open the RES TYP sheet and copy the most
typical sick class instance to a new spreadsheet. Save the spreadsheet and
return to the RES TYP sheet.
e. Now, copy the most typical healthy class instance the the spreadsheet
created in the previous step and currently holding the single most typical
sick class instance. Save the updated spreadsheet file.
f. Delete columns A, B, and Q of the new spreadsheet file that now contains
the most typical healthy class instance and the most typical sick class
instance. Copy the two instances contained in the spreadsheet.
g. Return to the original sheet 1 data sheet and insert two blank rows after
row three. paste the two instances copied in step f into sheet 1.
h. Your sheet 1 spreadsheet now contains 305 instances. The first instance
in sheet 1 (row 4) is the most typical sick class instance. The second
instance is the most typical healthy class instance.
i. Perform a data mining session using the first two instances as training
data. The remaining 303 instances will make up the test set.
j. Analyze the test set results by examining the confusion matrix. What can
you conclude?
The accuracy rate is 81%. 31 sick cases were classified as healthy and
26 healthy cases were actually sick. This is a good representation of the overall
class.
k. Repeat the above steps but this time extract the least typical instance
from each class. How do your results compare with those of the first
experiment?
Monica Nusskern
5420 – Data Mining
Week 2 Assignment
June 5, 2005
9
The accuracy rate decreased to only 34%. 82 sick cases were classified
as healthy and 117 healthy cases were actually sick.
Score 8 out of 10
9. (Data Mining Question #8, page 143) Perform a supervised data mining
session with the Titanic.xls dataset. Use 1500 instances for training and the
remaining instances as test data.
a. Are any of the input attributes highly predictive of class membership for
either class?
Yes, the attribute value Female is highly sufficient for membership in the
Sex class.
The predictiveness score for sex = female is 0.81 for the survivors. The
predictiveness score for sex = male is 0.77 for the non-survivors. For the nonsurvivors, class = third has a predictiveness score of 0.72 and class = crew has a
predictiveness score of 0.76.
b. What is the test set accuracy of the model?
The test set accuracy of the model was 77%.
Good
c. Use the confidence limits to state the 95% test set accuracy range.
The results have an upper-bound error of 1.5% and a lower-bound error of
0.5%.
The lower-bound accuracy is 73.8%. The upper-bound accuracy is 80.2%.
d. Why is the classification accuracy so much lower for this experiment
than for the quick mine experiment given in Section 4.8 (pages 135 - 139)?
These results have a lower-bound error of 19.8% and an
upper-bound error of 26.2%. These rates are higher than the error
rates from the quick mining experiment given in Section 4.8 and
have a much lower confidence limit around the mean.
The test set for the example in section 4.8 contains 190 nonsurvivors and 10 survivors. That is, 95% of the test data holds nonsurvivor instances. The test data does not reflect the ratio of
survivors to non-survivors seen in the entire dataset. The test set
Monica Nusskern
5420 – Data Mining
Week 2 Assignment
June 5, 2005
for this problem contains 77% non-survivors and 23% survivors.
This non-survivor to survivor ratio more closely matches the ratio
seen in the entire domain of data instances.
10
Related documents