Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Matthew Kolker CSIS 5420 Assignment 1 1. (Question #2, page 30) For each of the following problem scenarios, decide if a solution would best be addressed with supervised learning, unsupervised clustering, or database query. As appropriate, state any initial hypothesis you would like to test. If you decide that supervised learning or unsupervised clustering is the best answer, list several input attributes you believe to be relevant for solving the problem. a. What characteristics differentiate people who have had back surgery and have returned to work from those who have had back surgery and have not returned to their jobs? This is best solved with supervised learning because it is looking for attributes. My initial hypothesis to test would be blue collar versus white collar differences. The attributes of concern are: age, gender, height, weight, type of job (degree of physical and mental activity), education level, number of years at the company, marital status, number of children, and income level. b. A major automotive manufacturer recently initiated a tire recall for one of their top-selling vehicles. The automotive company blames the tires for the unusually high accident rate seen with their top-seller. The company producing the tires claims the high accident rate only occurs when their tires are on the vehicle in question. Who is to blame? A database query can be used to provide statistics on how many vehicles crashed with or without the particular tires as well as how many other vehicles with these tires crashed or did not crash. However, to get a more detailed analysis of the problem, supervised learning is more appropriate. The initial hypothesis would be to test if the tires caused these vehicles to crash. Relevant attributes will include: vehicle type, year, miles driven, driver gender, driver age, last known tire pressure, wear and tear on tires, number of passengers when crashed, and weather conditions at time of crash. This data should be collected for all vehicles by the manufacturer as well as all vehicles that had the tires regardless of manufacturer. c. When customers visit my web site, what products are they most likely to buy together? Since this is looking to find a commonality, it is best addressed by unsupervised clustering. An initial hypothesis I would check would be the differences between men and women. The attributes of concern are: age, gender, income level, state of residence, marital status, number of children, types of items bought, number of items bought per month (or year), and hobbies. d. What percent of my employees miss one or more days of work per month? This is a database query because it is looking to answer a direct question that can be found directly from data. e. What relationships can I find between an individual's height, weight, age, and favorite spectator sport? This is best ascertained by supervised learning. The basic initial hypothesis will be to test if an individual’s height, weight, age, and favorite spectator sport are related. The attributes are given as height, weight, age, and favorite spectator sport. I would venture to suspect that location (country, state, and/or city) and gender are relevant as well. 2. (Question #3, page 30) Medical doctors are experts at disease diagnosis and surgery. Explain how medical doctors use induction to help develop their skills. Induction is the process of using specific examples to make general observations. Medical doctors use induction to identify commonalities amongst conditions and causes (i.e. disease and other problems). They learn what the various conditions are for diseases so that when a patient has certain conditions, they can use an induction tree to work their way to find the likely disease. Based on this, they can issues the right procedure or prescription. 3. (Question #6, page 31) What happens when you try to build a decision tree for the data in Table 1.1 without employing the attributes Swollen Glands and Fever? Table 1.1 Hypothetical Training Data for Disease Diagnosis Patient Sore Swollen Fever Congestion Headache Diagnosis ID Throat Glands 1 Yes Yes Yes Yes Yes Strep throat 2 No No No Yes Yes Allergy 3 Yes Yes No Yes No Cold 4 Yes No Yes No No Strep throat 5 No Yes No Yes No Cold 6 No No No Yes No Allergy 7 No No Yes No No Strep throat 8 Yes No No Yes Yes Allergy 9 No Yes No Yes Yes Cold 10 Yes Yes No Yes Yes Cold Removing swollen glands and fever will make it impossible to build a proper decision tree because it removes terminal nodes. For example, both cold and strep throat have all 3 remaining symptoms. Even worse, now strep throat can be diagnosed by having all 3 symptoms or by having none at all. Without the ability to identify terminal nodes, a decision tree can not be made. 4. (Question #6, page 63) Supposed you have used data mining to develop two alternative models designed to accept or reject home mortgage applications. Both models show an 85% test set classification correctness. The majority of errors made by model A are false accepts whereas the majority of errors made by model B are false rejects. Which model should you choose? Justify your answer. This is somewhat up to the degree of risk that the mortgager wants. However, since mortgages are secured loans, I would suspect that model A is more desirable. The average cost of foreclosing on a mortgage will likely be far less then the average cost of rejecting good applicants. 5. (Question #7, page 63) Supposed you have used data mining to develop two alternative models designed to decide whether or not to drill for oil. Both models show an 85% test set classification correctness. The majority of errors made by model A are false accepts whereas the majority of errors made by model B are false rejects. Which model should you choose? Justify your answer. Model A should be chosen because the average cost for drilling at a bad location will be much less then the average cost of lost revenue from missing a good location. It is most likely that the bad locations will still produce oil but maybe not enough to turn a profit for that location. However the chance of missing a very profitable location is undesirable. 6. (Question #8, page 63) Explain how unsupervised clustering can be used to evaluate the likely success of a supervised learner model. The basic idea here is to add an attribute that is intended to group the conditions in the supervised learner model. This is demonstrated by the heart patient data example by adding a “is sick or is healthy” attribute. If the model is good, then putting the data through an unsupervised clustering test should cause the like groups to cluster together. The unsupervised clustering can also be used to identify outliers. 7. (Question #97, page 63) Explain how supervised learning can be used to help evaluate the results of an unsupervised clustering model. There are three basic steps: 1. Perform unsupervised clustering; assign arbitrary names to the clusters. 2. Chose a random sample from each cluster. Each cluster should be represented by the same ratio as created. The book suggest about two-thirds of all instances should be chosen for a sample. 3. Use the sample to create a supervised learning model then run the remaining instances through the model. Doing this will help to show the data in a structured system as well as provide some idea as to the reliability of how well the unsupervised clustering model correctly classifies the instances. 8. (Computational Question #1, page 63) Consider the following three-class confusion matrix. The matrix shows the classification results of a supervised model that uses previous voting records to determine the political party affiliation (Republican, Democrat, or Independent) of members of the United States Senate. Computed Decision Rep Dem Ind Rep 42 2 1 Dem 5 40 3 Ind 0 3 4 a. What percent of the instances were correctly classified? (42+40+4)/100 * 100% = 86% b. According to the confusion matrix, how many Democrats are in the Senate? How many Republicans? How many Independents? Democrats – 45 Republicans – 47 Independents – 8 c. How many Republicans were classified as belonging to the Democratic Party? 2 d. How many Independents were classified as Republicans? 0 9. (Computational Question #2, page 64) Suppose we have two classes each with 100 instances. The instances in one class contain information about individuals who currently have credit card insurance. The instances in the second class include information about individuals who have at least one credit card but are without credit card insurance. Use the following to answer the questions below: IF Life Insurance = Yes & Income > $50K THEN Credit Card Insurance = Yes Rule Accuracy = 80% Rule Coverage = 40% a. How many individuals represented by the instances in the class of credit card insurance holders have life insurance and make more than $50,000 per year? 64 b. How many instances representing individuals who do not have credit card insurance have life insurance and make more than $50,000 per year? 16 10. (Computational Question #3, page 64) Consider the confusion matrices shown below. a. Compute the lift for Model X. (46/2291)/(100/10000) = 2.008 b. Compute the lift for Model Y. 2.25 Model X Computed Accept Computed Reject Accept 46 54 Reject 2,245 7,655 Model Y Computed Accept Computed Reject Accept 45 55 Reject 1,955 7,945 11. (Computational Question #4, page 65) A certain mailing list consists of P names. Suppose a model has been built to determine a select group of individuals from the list who will receive a special flyer. As a second option, the flyer can be sent to all individuals on the list. Use the notation given in the confusion matrix below to show that the lift for choosing the model over sending out the flyer to the entire population can be computed with the equation: Lift = (C11 * P) / ( (C11 + C12) * (C11 + C21) ) Send Flyer? Computed Send Computed Don't Send Send C11 C12 Don't Send C21 C22 Lift = (C11 / (C11 + C21) ) / ( (C11 + C12) / P ) Lift = (C11 / (C11 + C21) ) * ( P / (C11 + C12) ) Lift = (C11 * P) / ( (C11 + C21) * (C11 + C12) )