Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Monica Nusskern CSIS 5420 – Data Mining Week 1 Assignment June 3, 2005 1 Total Score 103 out of 110 Score 10 out of 10 1. (Question #2, page 30) For each of the following problem scenarios, decide if a solution would best be addressed with supervised learning, unsupervised clustering, or database query. As appropriate, state any initial hypothesis you would like to test. If you decide that supervised learning or unsupervised clustering is the best answer, list several input attributes you believe to be relevant for solving the problem. a. What characteristics differentiate people who have had back surgery and have returned to work from those who have had back surgery and have not returned to their jobs? The best solution for this would be unsupervised clustering because this scenario clearly offers an attribute whose value represents a set of predefined output classes. The initial hypothesis could be, if a person is over 40 years of age and performs a job with a high percentage of physical labor, then they are less likely to return to work then someone who is under 40 and performs a low amount of physical labor with their job. Good Input Attributes: Worker ID, Job Type, Age, Sex b. A major automotive manufacturer recently initiated a tire recall for one of their top-selling vehicles. The automotive company blames the tires for the unusually high accident rate seen with their top-seller. The company producing the tires claims the high accident rate only occurs when their tires are on the vehicle in question. Who is to blame? The best solution would be supervised learning and a decision tree could be utilized to reach a conclusion. The initial hypothesis could be, if the vehicle has an accident without these tires, then the automotive manufacturer is to blame. The opposite could also be asked, if different types of vehicles use the tired in question and have a high accident rate, then the tire maker is to blame. Good Input Attributes: Car ID, Questionable Tires on Vehicle, Type of Vehicle c. When customers visit my web site, what products are they most likely to buy together? Monica Nusskern CSIS 5420 – Data Mining Week 1 Assignment June 3, 2005 2 The best solution would be unsupervised clustering because it is possible to use data mining to with a company’s data to gain insight into possible patterns in the database. An initial hypothesis could be, if a customer purchases Product A, then Product B will likely be purchased as well. Good Input Attributes: Customer ID, Transaction Method, Item Purchased d. What percent of my employees miss one or more days of work per month? A database query could be performed on a database with human resource data stored there. Once the data for each employee is extracted, simple calculations could be performed to determine what percentage of employees miss one or more days of work per month. Good e. What relationships can I find between an individual's height, weight, age, and favorite spectator sport? Unsupervised clustering would be the best method for this scenario because it is necessary to build models without predefined classes. The initial hypothesis would need to be if a relationship could be developed between individual demographics and favorite spectator sport. It is quite possible that no valid relationship may exist. Good Input Attributes: Person ID, Height, Weight, Age, Favorite Sport Score 10 out of 10 2. (Question #3, page 30) Medical doctors are experts at disease diagnosis and surgery. Explain how medical doctors use induction to help develop their skills. Induction-based learning is the process of forming a general concept definition by observing specific examples of the concept to be learned. Induction allows doctors to determine diagnosis based on observations of particular patterns and to establish necessary treatment based on these observations of recurring patterns. Once these patterns are learned doctors can use this experience to treat similar patterns in their future patients. Very good Score 10 out of 10 Monica Nusskern CSIS 5420 – Data Mining Week 1 Assignment June 3, 2005 3 3. (Question #6, page 31) What happens when you try to build a decision tree for the data in Table 1.1 without employing the attributes Swollen Glands and Fever? Table 1.1 Hypothetical Training Data for Disease Diagnosis Patient Sore Swollen Fever Congestion Headache Diagnosis ID Throat Glands 1 Yes Yes Yes Yes Yes Strep throat 2 No No No Yes Yes Allergy 3 Yes Yes No Yes No Cold 4 Yes No Yes No No Strep throat 5 No Yes No Yes No Cold 6 No No No Yes No Allergy 7 No No Yes No No Strep throat 8 Yes No No Yes Yes Allergy 9 No Yes No Yes Yes Cold 10 Yes Yes No Yes Yes Cold Patients could be misdiagnosed without using the symptoms of swollen glands and fever. The patient could be diagnosed with strep throat and actually have a cold. This could lead to taking unnecessary medication. Good Let's pick sore throat as the top-level node. The only possibilities are yes and no. Instances one, three four, eight, and ten follow the yes path. The no path shows instances 2,5,6,7 & 9. The path for sore throat = yes has representatives from all three classes as does sore throat = no. Next we follow the sore throat = yes path and choose headache. We need only concern ourselves with instances 1,3,4, 8 & 10. For headache = yes we have instances 1 (strep throat) ,8 (allergy ), & 10 (cold). For headache = no we have instances 3 (cold) and 4 (strep throat). Next follow headache = yes and choose congestionthe only remaining attribute. All three instances show congestion = yes, therefore the tree is unable to further differentiate the three instances. A similar problem is seen by following headache = no. Therefore, the path following sore throat = yes is unable to differentiate any of the five instances. The Monica Nusskern CSIS 5420 – Data Mining Week 1 Assignment June 3, 2005 4 problem repeats itself for the path sore throat = no. In general, any top-level node choice of sore throat, congestion, or headache gives a similar result. Score 10 out of 10 4. (Question #6, page 63) Supposed you have used data mining to develop two alternative models designed to accept or reject home mortgage applications. Both models show an 85% test set classification correctness. The majority of errors made by model A are false accepts whereas the majority of errors made by model B are false rejects. Which model should you choose? Justify your answer. The model that I would choose would be model B because false rejects would cost the firm much less money than false accepts. If the majority of the errors for model A were false accepts that means that people who where not qualified candidates for home loans would be accepted regardless. This could be detrimental to the company, as these applicants would not pay their mortgage bills resulting in less income for the company while large expenses would be incurred. OK, but consider this perspective, since a mortgage is secured credit, is there much risk in false accepts? Score 10 out of 10 5. (Question #7, page 63) Supposed you have used data mining to develop two alternative models designed to decide whether or not to drill for oil. Both models show an 85% test set classification correctness. The majority of errors made by model A are false accepts whereas the majority of errors made by model B are false rejects. Which model should you choose? Justify your answer. The model that I would choose would be model A because the company could be missing out on large income by not drilling in a certain area when in actuality, they should. If a company drilled for oil where none existed, this could be used for future knowledge to apply to the models that were developed. However, while I chose model A for this question, I do see the benefits of utilizing model B, such as the environmental impacts of drilling where no oil exists. OK, but consider if the cost of drilling for oil is very high, Model B is the best choice. Score 10 out of 10 6. (Question #8, page 63) Explain how unsupervised clustering can be used to evaluate the likely success of a supervised learner model. Monica Nusskern CSIS 5420 – Data Mining Week 1 Assignment June 3, 2005 5 Unsupervised clustering can be used to evaluate the likely success of a supervised learner model by: Utilizing a confusion matrix to compute model accuracy by summing the values found on the main diagonal and divide this sum by the total number of test set instances. Using two-class error analysis to denote false accepts and false rejects. To evaluate supervised models having numeric output mean absolute error and mean square error can be utilized. OK, but let me suggest a simpler answer. In a supervised learner model, we pre-determine which attributes will be used to classify our data and what specific clusters we will accept. In other words, we assume that a chosen set of attributes will classify our data under a chosen output attribute. If our unsupervised learner determines that the same input attributes will form clusters that differentiate the values of the output attribute, then the complementary results verify the supervised learner assumptions. Score 10 out of 10 7. (Question #9, page 63) Explain how supervised learning can be used to help evaluate the results of an unsupervised clustering model. Supervised learning can be used to help evaluate the results of an unsupervised clustering model by the following technique: Perform an unsupervised clustering. Designate each cluster as a class and assign each an arbitrary name such as C1, C2, and C3. Choose a random sample of instances from each of the classes as a result of the instance clustering. Each class should be represented in the random sample in the same ratio as it is represented in the entire dataset. A good sample is two-thirds of all instances. Build a supervised learner model with the class name as the output attribute using the randomly sampled instances as training data. Employ the remaining instances to test the supervised model for classification correctness. Very good Score 7 out of 10 Monica Nusskern CSIS 5420 – Data Mining Week 1 Assignment June 3, 2005 6 8. (Computational Question #1, page 63) Consider the following three-class confusion matrix. The matrix shows the classification results of a supervised model that uses previous voting records to determine the political party affiliation (Republican, Democrat, or Independent) of members of the United States Senate. Computed Decision Rep Dem Ind Rep 42 2 1 Dem 5 40 3 Ind 0 3 4 a. What percent of the instances were correctly classified? 86% Good b. According to the confusion matrix, how many Democrats are in the Senate? How many Republicans? How many Independents? 40 Democrats, 42 Republicans, 4 Independents 48 Democrats, 45 Republicans, 7 Independents. Add across the rows. There are 100 total senators. c. How many Republicans were classified as belonging to the Democratic Party? 2 Republicans Good d. How many Independents were classified as Republicans? 0 Independents Good Score 7 out of 10 9. (Computational Question #2, page 64) Suppose we have two classes each with 100 instances. The instances in one class contain information about individuals who currently have Monica Nusskern CSIS 5420 – Data Mining Week 1 Assignment June 3, 2005 credit card insurance. The instances in the second class include information about individuals who have at least one credit card but are without credit card insurance. Use the following to answer the questions below: IF Life Insurance = Yes & Income > $50K THEN Credit Card Insurance = Yes Rule Accuracy = 80% Rule Coverage = 40% a. How many individuals represented by the instances in the class of credit card insurance holders have life insurance and make more than $50,000 per year? 80 individuals 40 instances b. How many instances representing individuals who do not have credit card insurance have life insurance and make more than $50,000 per year? 80 individuals 10 instances Score 10 out of 10 10. (Computational Question #3, page 64) Consider the confusion matrices shown below. a. Compute the lift for Model X. Lift = 2.00785 2.008 b. Compute the lift for Model Y. Lift = 2.25 Good 7 Monica Nusskern CSIS 5420 – Data Mining Week 1 Assignment June 3, 2005 8 Model X Computed Accept Computed Reject Accept 46 54 Reject 2,245 7,655 Model Y Computed Accept Computed Reject Accept 45 55 Reject 1,955 7,945 Score 9 out of 10 11. (Computational Question #4, page 65) A certain mailing list consists of P names. Suppose a model has been built to determine a select group of individuals from the list who will receive a special flyer. As a second option, the flyer can be sent to all individuals on the list. Use the notation given in the confusion matrix below to show that the lift for choosing the model over sending out the flyer to the entire population can be computed with the equation: Send Flyer? Computed Send Computed Don't Send Send C11 C12 Don't Send C21 C22 Lift = P(C11 | Sample) P(C11 | Population) Send Flyer? Computed Send Send c11 Don't Send c21 Sum(Computed Send) Lift = c11/Sum(ComputedSend) Sum(Send)/Sum(Total) Computed Don't Send c12 Sum(Send) c21 Sum(Don't Send) Sum (Computed Don't Send) Sum(Total) Monica Nusskern CSIS 5420 – Data Mining Week 1 Assignment June 3, 2005 So Lift = ( C11 / (C11 + C21) ) / ( (C11+C12) / (C11+C12+C21 +C22) ) and we know that (C11+C12+C21 +C22) = the total number of names P. Therefore, using substitution … Lift = ( C11 / (C11 + C21) ) / ( (C11+C12) / P ) Lift = ( C11 / (C11 + C21) ) * (P / ( (C11+C12) ) Lift = ( C11 * P ) / ((C11 + C12) * (C11+ C21) ) 9