Download Kolker-Week1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Matthew Kolker
CSIS 5420
Assignment 1
1. (Question #2, page 30) For each of the following problem scenarios, decide if a solution would
best be addressed with supervised learning, unsupervised clustering, or database query. As
appropriate, state any initial hypothesis you would like to test. If you decide that supervised
learning or unsupervised clustering is the best answer, list several input attributes you believe to
be relevant for solving the problem.
a. What characteristics differentiate people who have had back surgery and have returned to
work from those who have had back surgery and have not returned to their jobs?
This is best solved with supervised learning because it is looking for attributes. My initial
hypothesis to test would be blue collar versus white collar differences. The attributes of concern
are: age, gender, height, weight, type of job (degree of physical and mental activity), education
level, number of years at the company, marital status, number of children, and income level.
b. A major automotive manufacturer recently initiated a tire recall for one of their top-selling
vehicles. The automotive company blames the tires for the unusually high accident rate seen with
their top-seller. The company producing the tires claims the high accident rate only occurs when
their tires are on the vehicle in question. Who is to blame?
A database query can be used to provide statistics on how many vehicles crashed with or without
the particular tires as well as how many other vehicles with these tires crashed or did not crash.
However, to get a more detailed analysis of the problem, supervised learning is more appropriate.
The initial hypothesis would be to test if the tires caused these vehicles to crash. Relevant
attributes will include: vehicle type, year, miles driven, driver gender, driver age, last known tire
pressure, wear and tear on tires, number of passengers when crashed, and weather conditions at
time of crash. This data should be collected for all vehicles by the manufacturer as well as all
vehicles that had the tires regardless of manufacturer.
c. When customers visit my web site, what products are they most likely to buy together?
Since this is looking to find a commonality, it is best addressed by unsupervised clustering. An
initial hypothesis I would check would be the differences between men and women. The
attributes of concern are: age, gender, income level, state of residence, marital status, number of
children, types of items bought, number of items bought per month (or year), and hobbies.
d. What percent of my employees miss one or more days of work per month?
This is a database query because it is looking to answer a direct question that can be found
directly from data.
e. What relationships can I find between an individual's height, weight, age, and favorite spectator
sport?
This is best ascertained by supervised learning. The basic initial hypothesis will be to test if an
individual’s height, weight, age, and favorite spectator sport are related. The attributes are given
as height, weight, age, and favorite spectator sport. I would venture to suspect that location
(country, state, and/or city) and gender are relevant as well.
2. (Question #3, page 30) Medical doctors are experts at disease diagnosis and surgery. Explain
how medical doctors use induction to help develop their skills.
Induction is the process of using specific examples to make general observations. Medical doctors use
induction to identify commonalities amongst conditions and causes (i.e. disease and other problems). They
learn what the various conditions are for diseases so that when a patient has certain conditions, they can use
an induction tree to work their way to find the likely disease. Based on this, they can issues the right
procedure or prescription.
3. (Question #6, page 31) What happens when you try to build a decision tree for the data in
Table 1.1 without employing the attributes Swollen Glands and Fever?
Table 1.1 Hypothetical Training Data for Disease Diagnosis
Patient Sore
Swollen
Fever
Congestion Headache Diagnosis
ID
Throat
Glands
1
Yes
Yes
Yes
Yes
Yes
Strep throat
2
No
No
No
Yes
Yes
Allergy
3
Yes
Yes
No
Yes
No
Cold
4
Yes
No
Yes
No
No
Strep throat
5
No
Yes
No
Yes
No
Cold
6
No
No
No
Yes
No
Allergy
7
No
No
Yes
No
No
Strep throat
8
Yes
No
No
Yes
Yes
Allergy
9
No
Yes
No
Yes
Yes
Cold
10
Yes
Yes
No
Yes
Yes
Cold
Removing swollen glands and fever will make it impossible to build a proper decision tree
because it removes terminal nodes. For example, both cold and strep throat have all 3 remaining
symptoms. Even worse, now strep throat can be diagnosed by having all 3 symptoms or by
having none at all. Without the ability to identify terminal nodes, a decision tree can not be made.
4. (Question #6, page 63) Supposed you have used data mining to develop two alternative
models designed to accept or reject home mortgage applications. Both models show an 85% test
set classification correctness. The majority of errors made by model A are false accepts whereas
the majority of errors made by model B are false rejects. Which model should you choose? Justify
your answer.
This is somewhat up to the degree of risk that the mortgager wants. However, since mortgages
are secured loans, I would suspect that model A is more desirable. The average cost of
foreclosing on a mortgage will likely be far less then the average cost of rejecting good
applicants.
5. (Question #7, page 63) Supposed you have used data mining to develop two alternative
models designed to decide whether or not to drill for oil. Both models show an 85% test set
classification correctness. The majority of errors made by model A are false accepts whereas the
majority of errors made by model B are false rejects. Which model should you choose? Justify
your answer.
Model A should be chosen because the average cost for drilling at a bad location will be much
less then the average cost of lost revenue from missing a good location. It is most likely that the
bad locations will still produce oil but maybe not enough to turn a profit for that location. However
the chance of missing a very profitable location is undesirable.
6. (Question #8, page 63) Explain how unsupervised clustering can be used to evaluate the likely
success of a supervised learner model.
The basic idea here is to add an attribute that is intended to group the conditions in the
supervised learner model. This is demonstrated by the heart patient data example by adding a
“is sick or is healthy” attribute. If the model is good, then putting the data through an
unsupervised clustering test should cause the like groups to cluster together. The unsupervised
clustering can also be used to identify outliers.
7. (Question #97, page 63) Explain how supervised learning can be used to help evaluate the
results of an unsupervised clustering model.
There are three basic steps:
1. Perform unsupervised clustering; assign arbitrary names to the clusters.
2. Chose a random sample from each cluster. Each cluster should be represented by the
same ratio as created. The book suggest about two-thirds of all instances should be
chosen for a sample.
3. Use the sample to create a supervised learning model then run the remaining instances
through the model.
Doing this will help to show the data in a structured system as well as provide some idea as to the
reliability of how well the unsupervised clustering model correctly classifies the instances.
8. (Computational Question #1, page 63) Consider the following three-class confusion matrix. The
matrix shows the classification results of a supervised model that uses previous voting records to
determine the political party affiliation (Republican, Democrat, or Independent) of members of the
United States Senate.
Computed Decision
Rep
Dem
Ind
Rep
42
2
1
Dem
5
40
3
Ind
0
3
4
a. What percent of the instances were correctly classified?
(42+40+4)/100 * 100% = 86%
b. According to the confusion matrix, how many Democrats are in the Senate? How many
Republicans? How many Independents?
Democrats – 45
Republicans – 47
Independents – 8
c. How many Republicans were classified as belonging to the Democratic Party?
2
d. How many Independents were classified as Republicans?
0
9. (Computational Question #2, page 64) Suppose we have two classes each with 100 instances.
The instances in one class contain information about individuals who currently have credit card
insurance. The instances in the second class include information about individuals who have at
least one credit card but are without credit card insurance. Use the following to answer the
questions below:
IF Life Insurance = Yes & Income > $50K
THEN Credit Card Insurance = Yes
Rule Accuracy = 80%
Rule Coverage = 40%
a. How many individuals represented by the instances in the class of credit card insurance
holders have life insurance and make more than $50,000 per year?
64
b. How many instances representing individuals who do not have credit card insurance have life
insurance and make more than $50,000 per year?
16
10. (Computational Question #3, page 64) Consider the confusion matrices shown below.
a. Compute the lift for Model X.
(46/2291)/(100/10000) = 2.008
b. Compute the lift for Model Y.
2.25
Model X Computed Accept Computed Reject
Accept
46
54
Reject
2,245
7,655
Model Y Computed Accept Computed Reject
Accept
45
55
Reject
1,955
7,945
11. (Computational Question #4, page 65) A certain mailing list consists of P names. Suppose a
model has been built to determine a select group of individuals from the list who will receive a
special flyer. As a second option, the flyer can be sent to all individuals on the list. Use the
notation given in the confusion matrix below to show that the lift for choosing the model over
sending out the flyer to the entire population can be computed with the equation:
Lift = (C11 * P) / ( (C11 + C12) * (C11 + C21) )
Send
Flyer?
Computed
Send
Computed Don't
Send
Send
C11
C12
Don't
Send
C21
C22
Lift = (C11 / (C11 + C21) ) / ( (C11 + C12) / P )
Lift = (C11 / (C11 + C21) ) * ( P / (C11 + C12) )
Lift = (C11 * P) / ( (C11 + C21) * (C11 + C12) )