Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
EECS 647: Introduction to Database Systems Instructor: Luke Huan Spring 2009 Administrative z Homework 6 will be posted at the class website. z z There is no due day Final project demonstrations are scheduled on May 6th. z z Strongly encourage everyone to do a demonstration If you want, please send me the following information by May 5th: z z Your team name, your and your partner name Final project report is due on May 12th at 1:30 5/3/2009 Luke Huan Univ. of Kansas 2 Summer Research Assistant Position z z I have several summer research assistant positions Focusing on developing and applying data mining techniques to biological data z z Hands-on experience of interdisciplinary research Interactions with biologists and chemists for drug development! z z z KU has a $20M NIH center for exploring the interface of chemistry with biology A good start-point for graduate study at KU If interested, send me an email to schedule an appointment 5/3/2009 Luke Huan Univ. of Kansas 3 Illustrating Classification Task Classification Algorithms Training Data NAM E M ike M ary Bill Jim Dave Anne 5/3/2009 RANK YEARS TENURED Assistant Prof 3 no Assistant Prof 7 yes Professor 2 yes Associate Prof 7 yes Assistant Prof 6 no Associate Prof 3 no Luke Huan Univ. of Kansas Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ 4 Apply Model to Data Classifier Testing Data Unseen Data (Jeff, Professor, 4) NAM E Tom M erlisa George Joseph 5/3/2009 RANK YEARS TENURED Assistant Prof 2 no Associate Prof 7 no Professor 5 yes Assistant Prof 7 yes Luke Huan Univ. of Kansas Tenured? 5 Classification: Application 1 z Direct Marketing z z Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product. Approach: z z z Use the data for a similar product introduced before. We know which customers decided to buy and which decided otherwise. This {buy, don’t buy} decision forms the class attribute. Collect various demographic, lifestyle, and company-interaction related information about all such customers. z Type of business, where they stay, how much they earn, etc. Use this information as input attributes to learn a classifier model. From [Berry & Linoff] Data Mining Techniques, 1997 5/3/2009 Luke Huan Univ. of Kansas 6 Classification: Application 2 z Fraud Detection z z Goal: Predict fraudulent cases in credit card transactions. Approach: Use credit card transactions and the information on its account-holder as attributes. When does a customer buy, what does he buy, how often he pays on time, etc z Label past transactions as fraud or fair transactions. This forms the class attribute. z Learn a model for the class of the transactions. z Use this model to detect fraud by observing credit card transactions on an account. z 5/3/2009 Luke Huan Univ. of Kansas 7 More Examples of Classification z Predicting tumor cells as benign or malignant z Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil z Categorizing news stories as finance, weather, entertainment, sports, etc 5/3/2009 Luke Huan Univ. of Kansas 8 Decision Tree Training Dataset 5/3/2009 age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent Luke Huan Univ. of Kansas buys_computer no no yes yes yes no yes no yes yes yes yes yes no 9 Output: A Decision Tree for “buys_computer” age? <=30 overcast 30..40 yes student? >40 age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent credit rating? no yes excellent fair no yes no yes 5/3/2009 buys_computer no no yes yes yes no yes no yes yes yes yes yes no Luke Huan Univ. of Kansas 10 Tree Induction z Greedy strategy. z z Split the records based on an attribute test that optimizes certain criterion. Issues z Determine how to split the records z z z How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting 5/3/2009 Luke Huan Univ. of Kansas 11 Splitting Based on Continuous Attributes 5/3/2009 Luke Huan Univ. of Kansas 12 How to determine the Best Split Before Splitting: 10 records of class 0, 10 records of class 1 Which test condition is the best? 5/3/2009 Luke Huan Univ. of Kansas 13 Decision Tree Based Classification z Advantages: z z z z Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification techniques for many simple data sets 5/3/2009 Luke Huan Univ. of Kansas 14 Overfitting due to Noise Decision boundary is distorted by noise point 5/3/2009 Luke Huan Univ. of Kansas 15 Decision Boundary • Border line between two neighboring regions of different classes is known as decision boundary • Decision boundary is parallel to axes because test condition involves a single attribute at-a-time 5/3/2009 Luke Huan Univ. of Kansas 16 Oblique Decision Trees x+y<1 Class = + Class = • Test condition may involve multiple attributes • More expressive representation • Finding optimal test condition is computationally expensive 5/3/2009 Luke Huan Univ. of Kansas 17