Download Monica Nusskern Week 1 Assignment

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Mixture model wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Monica Nusskern
CSIS 5420 – Data Mining
Week 1 Assignment
June 3, 2005
1
Total Score 103 out of 110
Score 10 out of 10
1. (Question #2, page 30) For each of the following problem scenarios,
decide if a solution would best be addressed with supervised learning,
unsupervised clustering, or database query. As appropriate, state any
initial hypothesis you would like to test. If you decide that supervised
learning or unsupervised clustering is the best answer, list several input
attributes you believe to be relevant for solving the problem.
a. What characteristics differentiate people who have had
back surgery and have returned to work from those who
have had back surgery and have not returned to their jobs?
The best solution for this would be unsupervised clustering
because this scenario clearly offers an attribute whose value
represents a set of predefined output classes. The initial
hypothesis could be, if a person is over 40 years of age and
performs a job with a high percentage of physical labor, then they
are less likely to return to work then someone who is under 40 and
performs a low amount of physical labor with their job. Good
Input Attributes: Worker ID, Job Type, Age, Sex
b. A major automotive manufacturer recently initiated a tire
recall for one of their top-selling vehicles. The automotive
company blames the tires for the unusually high accident
rate seen with their top-seller. The company producing the
tires claims the high accident rate only occurs when their
tires are on the vehicle in question. Who is to blame?
The best solution would be supervised learning and a decision
tree could be utilized to reach a conclusion. The initial hypothesis
could be, if the vehicle has an accident without these tires, then the
automotive manufacturer is to blame. The opposite could also be
asked, if different types of vehicles use the tired in question and
have a high accident rate, then the tire maker is to blame. Good
Input Attributes: Car ID, Questionable Tires on Vehicle, Type of
Vehicle
c. When customers visit my web site, what products are they
most likely to buy together?
Monica Nusskern
CSIS 5420 – Data Mining
Week 1 Assignment
June 3, 2005
2
The best solution would be unsupervised clustering because it
is possible to use data mining to with a company’s data to gain
insight into possible patterns in the database. An initial hypothesis
could be, if a customer purchases Product A, then Product B will
likely be purchased as well. Good
Input Attributes: Customer ID, Transaction Method, Item
Purchased
d. What percent of my employees miss one or more days of
work per month?
A database query could be performed on a database with
human resource data stored there. Once the data for each
employee is extracted, simple calculations could be performed to
determine what percentage of employees miss one or more days of
work per month. Good
e. What relationships can I find between an individual's
height, weight, age, and favorite spectator sport?
Unsupervised clustering would be the best method for this
scenario because it is necessary to build models without predefined
classes. The initial hypothesis would need to be if a relationship
could be developed between individual demographics and favorite
spectator sport. It is quite possible that no valid relationship may
exist. Good
Input Attributes: Person ID, Height, Weight, Age, Favorite Sport
Score 10 out of 10
2. (Question #3, page 30) Medical doctors are experts at disease diagnosis
and surgery. Explain how medical doctors use induction to help develop
their skills.
Induction-based learning is the process of forming a general concept
definition by observing specific examples of the concept to be learned. Induction
allows doctors to determine diagnosis based on observations of particular
patterns and to establish necessary treatment based on these observations of
recurring patterns. Once these patterns are learned doctors can use this
experience to treat similar patterns in their future patients. Very good
Score 10 out of 10
Monica Nusskern
CSIS 5420 – Data Mining
Week 1 Assignment
June 3, 2005
3
3. (Question #6, page 31) What happens when you try to build a decision
tree for the data in Table 1.1 without employing the attributes Swollen
Glands and Fever?
Table 1.1 Hypothetical Training Data for Disease Diagnosis
Patient Sore
Swollen
Fever
Congestion Headache Diagnosis
ID
Throat
Glands
1
Yes
Yes
Yes
Yes
Yes
Strep
throat
2
No
No
No
Yes
Yes
Allergy
3
Yes
Yes
No
Yes
No
Cold
4
Yes
No
Yes
No
No
Strep
throat
5
No
Yes
No
Yes
No
Cold
6
No
No
No
Yes
No
Allergy
7
No
No
Yes
No
No
Strep
throat
8
Yes
No
No
Yes
Yes
Allergy
9
No
Yes
No
Yes
Yes
Cold
10
Yes
Yes
No
Yes
Yes
Cold
Patients could be misdiagnosed without using the symptoms of swollen
glands and fever. The patient could be diagnosed with strep throat and actually
have a cold. This could lead to taking unnecessary medication. Good
Let's pick sore throat as the top-level node. The only possibilities are yes and no.
Instances one, three four, eight, and ten follow the yes path. The no path shows instances
2,5,6,7 & 9. The path for sore throat = yes has representatives from all three classes as
does sore throat = no.
Next we follow the sore throat = yes path and choose headache. We need only concern
ourselves with instances 1,3,4, 8 & 10. For headache = yes we have instances 1 (strep
throat) ,8 (allergy ), & 10 (cold). For headache = no we have instances 3 (cold) and 4 (strep
throat).
Next follow headache = yes and choose congestionthe only remaining attribute. All
three instances show congestion = yes, therefore the tree is unable to further differentiate
the three instances. A similar problem is seen by following headache = no. Therefore, the
path following sore throat = yes is unable to differentiate any of the five instances. The
Monica Nusskern
CSIS 5420 – Data Mining
Week 1 Assignment
June 3, 2005
4
problem repeats itself for the path sore throat = no. In general, any top-level node choice
of sore throat, congestion, or headache gives a similar result.
Score 10 out of 10
4. (Question #6, page 63) Supposed you have used data mining to develop
two alternative models designed to accept or reject home mortgage
applications. Both models show an 85% test set classification correctness.
The majority of errors made by model A are false accepts whereas the
majority of errors made by model B are false rejects. Which model should
you choose? Justify your answer.
The model that I would choose would be model B because false rejects
would cost the firm much less money than false accepts. If the majority of the
errors for model A were false accepts that means that people who where not
qualified candidates for home loans would be accepted regardless. This could
be detrimental to the company, as these applicants would not pay their mortgage
bills resulting in less income for the company while large expenses would be
incurred. OK, but consider this perspective, since a mortgage is secured
credit, is there much risk in false accepts?
Score 10 out of 10
5. (Question #7, page 63) Supposed you have used data mining to develop
two alternative models designed to decide whether or not to drill for oil.
Both models show an 85% test set classification correctness. The majority
of errors made by model A are false accepts whereas the majority of errors
made by model B are false rejects. Which model should you choose?
Justify your answer.
The model that I would choose would be model A because the company
could be missing out on large income by not drilling in a certain area when in
actuality, they should. If a company drilled for oil where none existed, this could
be used for future knowledge to apply to the models that were developed.
However, while I chose model A for this question, I do see the benefits of utilizing
model B, such as the environmental impacts of drilling where no oil exists. OK,
but consider if the cost of drilling for oil is very high, Model B is the best
choice.
Score 10 out of 10
6. (Question #8, page 63) Explain how unsupervised clustering can be used
to evaluate the likely success of a supervised learner model.
Monica Nusskern
CSIS 5420 – Data Mining
Week 1 Assignment
June 3, 2005
5
Unsupervised clustering can be used to evaluate the likely success of a
supervised learner model by:
 Utilizing a confusion matrix to compute model accuracy by summing the
values found on the main diagonal and divide this sum by the total number of
test set instances.
 Using two-class error analysis to denote false accepts and false rejects.
 To evaluate supervised models having numeric output mean absolute error
and mean square error can be utilized.
OK, but let me suggest a simpler answer.
In a supervised learner model, we pre-determine which attributes will be
used to classify our data and what specific clusters we will accept. In other
words, we assume that a chosen set of attributes will classify our data
under a chosen output attribute.
If our unsupervised learner determines that the same input attributes will
form clusters that differentiate the values of the output attribute, then the
complementary results verify the supervised learner assumptions.
Score 10 out of 10
7. (Question #9, page 63) Explain how supervised learning can be used to
help evaluate the results of an unsupervised clustering model.
Supervised learning can be used to help evaluate the results of an
unsupervised clustering model by the following technique:
 Perform an unsupervised clustering. Designate each cluster as a class and
assign each an arbitrary name such as C1, C2, and C3.
 Choose a random sample of instances from each of the classes as a result of
the instance clustering. Each class should be represented in the random
sample in the same ratio as it is represented in the entire dataset. A good
sample is two-thirds of all instances.
 Build a supervised learner model with the class name as the output attribute
using the randomly sampled instances as training data. Employ the
remaining instances to test the supervised model for classification
correctness. Very good
Score 7 out of 10
Monica Nusskern
CSIS 5420 – Data Mining
Week 1 Assignment
June 3, 2005
6
8. (Computational Question #1, page 63) Consider the following three-class
confusion matrix. The matrix shows the classification results of a
supervised model that uses previous voting records to determine the
political party affiliation (Republican, Democrat, or Independent) of
members of the United States Senate.
Computed Decision
Rep
Dem
Ind
Rep
42
2
1
Dem
5
40
3
Ind
0
3
4
a. What percent of the instances were correctly
classified?
86% Good
b. According to the confusion matrix, how many
Democrats are in the Senate? How many
Republicans? How many Independents?
40 Democrats, 42 Republicans, 4 Independents
48 Democrats, 45 Republicans, 7 Independents.
Add across the rows. There are 100 total senators.
c. How many Republicans were classified as
belonging to the Democratic Party?
2 Republicans Good
d. How many Independents were classified as
Republicans?
0 Independents Good
Score 7 out of 10
9. (Computational Question #2, page 64) Suppose we have two
classes each with 100 instances. The instances in one class
contain information about individuals who currently have
Monica Nusskern
CSIS 5420 – Data Mining
Week 1 Assignment
June 3, 2005
credit card insurance. The instances in the second class
include information about individuals who have at least one
credit card but are without credit card insurance. Use the
following to answer the questions below:
IF Life Insurance = Yes & Income > $50K
THEN Credit Card Insurance = Yes
Rule Accuracy = 80%
Rule Coverage = 40%
a. How many individuals represented by the
instances in the class of credit card insurance
holders have life insurance and make more
than $50,000 per year?
80 individuals 40 instances
b. How many instances representing individuals
who do not have credit card insurance have
life insurance and make more than $50,000 per
year?
80 individuals 10 instances
Score 10 out of 10
10. (Computational Question #3, page 64) Consider the
confusion matrices shown below.
a. Compute the lift for Model X.
Lift = 2.00785  2.008
b. Compute the lift for Model Y.
Lift = 2.25 Good
7
Monica Nusskern
CSIS 5420 – Data Mining
Week 1 Assignment
June 3, 2005
8
Model
X
Computed
Accept
Computed
Reject
Accept
46
54
Reject
2,245
7,655
Model
Y
Computed
Accept
Computed
Reject
Accept
45
55
Reject
1,955
7,945
Score 9 out of 10
11. (Computational Question #4, page 65) A certain mailing list
consists of P names. Suppose a model has been built to
determine a select group of individuals from the list who will
receive a special flyer. As a second option, the flyer can be
sent to all individuals on the list. Use the notation given in the
confusion matrix below to show that the lift for choosing the
model over sending out the flyer to the entire population can
be computed with the equation:
Send
Flyer?
Computed
Send
Computed
Don't Send
Send
C11
C12
Don't
Send
C21
C22
Lift = P(C11 | Sample)
P(C11 | Population)
Send Flyer? Computed Send
Send
c11
Don't Send c21
Sum(Computed Send)
Lift =
c11/Sum(ComputedSend)
Sum(Send)/Sum(Total)
Computed Don't Send
c12
Sum(Send)
c21
Sum(Don't Send)
Sum (Computed Don't Send) Sum(Total)
Monica Nusskern
CSIS 5420 – Data Mining
Week 1 Assignment
June 3, 2005
So Lift = ( C11 / (C11 + C21) ) / ( (C11+C12) / (C11+C12+C21 +C22) )
and we know that (C11+C12+C21 +C22) = the total number of names P.
Therefore, using substitution …
Lift = ( C11 / (C11 + C21) ) / ( (C11+C12) / P )
Lift = ( C11 / (C11 + C21) ) * (P / ( (C11+C12) )
Lift = ( C11 * P ) / ((C11 + C12) * (C11+ C21) )
9