Download PPT

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
EECS 647: Introduction to
Database Systems
Instructor: Luke Huan
Spring 2009
Administrative
z
Homework 6 will be posted at the class website.
z
z
There is no due day
Final project demonstrations are scheduled on May 6th.
z
z
Strongly encourage everyone to do a demonstration
If you want, please send me the following information by
May 5th:
z
z
Your team name, your and your partner name
Final project report is due on May 12th at 1:30
5/3/2009
Luke Huan Univ. of Kansas
2
Summer Research Assistant Position
z
z
I have several summer research assistant positions
Focusing on developing and applying data mining
techniques to biological data
z
z
Hands-on experience of interdisciplinary research
Interactions with biologists and chemists for drug
development!
z
z
z
KU has a $20M NIH center for exploring the interface of
chemistry with biology
A good start-point for graduate study at KU
If interested, send me an email to schedule an
appointment
5/3/2009
Luke Huan Univ. of Kansas
3
Illustrating Classification Task
Classification
Algorithms
Training
Data
NAM E
M ike
M ary
Bill
Jim
Dave
Anne
5/3/2009
RANK
YEARS TENURED
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof
7
yes
Assistant Prof
6
no
Associate Prof
3
no
Luke Huan Univ. of Kansas
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
4
Apply Model to Data
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAM E
Tom
M erlisa
George
Joseph
5/3/2009
RANK
YEARS TENURED
Assistant Prof
2
no
Associate Prof
7
no
Professor
5
yes
Assistant Prof
7
yes
Luke Huan Univ. of Kansas
Tenured?
5
Classification: Application 1
z
Direct Marketing
z
z
Goal: Reduce cost of mailing by targeting a set of consumers
likely to buy a new cell-phone product.
Approach:
z
z
z
Use the data for a similar product introduced before.
We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class attribute.
Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
ƒ
z
Type of business, where they stay, how much they earn, etc.
Use this information as input attributes to learn a classifier model.
From [Berry & Linoff] Data Mining Techniques, 1997
5/3/2009
Luke Huan Univ. of Kansas
6
Classification: Application 2
z
Fraud Detection
z
z
Goal: Predict fraudulent cases in credit card transactions.
Approach:
Use credit card transactions and the information on its
account-holder as attributes.
ƒ When does a customer buy, what does he buy, how often
he pays on time, etc
z Label past transactions as fraud or fair transactions. This
forms the class attribute.
z Learn a model for the class of the transactions.
z Use this model to detect fraud by observing credit card
transactions on an account.
z
5/3/2009
Luke Huan Univ. of Kansas
7
More Examples of Classification
z
Predicting tumor cells as benign or malignant
z
Classifying secondary structures of protein
as alpha-helix, beta-sheet, or random
coil
z
Categorizing news stories as finance,
weather, entertainment, sports, etc
5/3/2009
Luke Huan Univ. of Kansas
8
Decision Tree
Training
Dataset
5/3/2009
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no
fair
high
no
excellent
high
no
fair
medium
no
fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no
fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no
excellent
high
yes fair
medium
no
excellent
Luke Huan Univ. of Kansas
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
9
Output: A Decision Tree for
“buys_computer”
age?
<=30
overcast
30..40
yes
student?
>40
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no
fair
high
no
excellent
high
no
fair
medium
no
fair
low
yes
fair
low
yes
excellent
low
yes
excellent
medium
no
fair
low
yes
fair
medium
yes
fair
medium
yes
excellent
medium
no
excellent
high
yes
fair
medium
no
excellent
credit rating?
no
yes
excellent
fair
no
yes
no
yes
5/3/2009
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Luke Huan Univ. of Kansas
10
Tree Induction
z
Greedy strategy.
z
z
Split the records based on an attribute test that optimizes
certain criterion.
Issues
z
Determine how to split the records
z
z
z
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
5/3/2009
Luke Huan Univ. of Kansas
11
Splitting Based on Continuous Attributes
5/3/2009
Luke Huan Univ. of Kansas
12
How to determine the Best Split
Before Splitting: 10 records of class 0,
10 records of class 1
Which test condition is the best?
5/3/2009
Luke Huan Univ. of Kansas
13
Decision Tree Based Classification
z
Advantages:
z
z
z
z
Inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Accuracy is comparable to other classification techniques
for many simple data sets
5/3/2009
Luke Huan Univ. of Kansas
14
Overfitting due to Noise
Decision boundary is distorted by noise point
5/3/2009
Luke Huan Univ. of Kansas
15
Decision Boundary
• Border line between two neighboring regions of different classes is
known as decision boundary
• Decision boundary is parallel to axes because test condition involves
a single attribute at-a-time
5/3/2009
Luke Huan Univ. of Kansas
16
Oblique Decision Trees
x+y<1
Class = +
Class =
• Test condition may involve multiple attributes
• More expressive representation
• Finding optimal test condition is computationally expensive
5/3/2009
Luke Huan Univ. of Kansas
17