Download class

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Classification
Part I
Applications of Classification




Predicting tumor cells as benign or malignant
Classifying credit card transactions
as legitimate or fraudulent
Classifying secondary structures of protein
as alpha-helix, beta-sheet, or random
coil
Categorizing news stories as finance,
weather, entertainment, sports, etc
2012/3/23
Data Mining Techniques and Applications
2
Classification: Definition

Given a collection of records (training set )
– Each record contains a set of attributes, one of the
attributes is the class.
Find a model for class attribute as a function
of the values of other attributes.
 Goal: previously unseen records should be
assigned a class as accurately as possible.

– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.
2012/3/23
Data Mining Techniques and Applications
3
Illustrating Classification Task
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Learning
algorithm
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Deduction
10
Test Set
2012/3/23
Data Mining Techniques and Applications
4
Classification Techniques
Decision Tree based Methods
 Naïve Bayes and Bayesian Belief Networks
 Rule-based Methods
 Memory based reasoning
 Neural Networks
 Support Vector Machines

2012/3/23
Data Mining Techniques and Applications
5
Example of a Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
10
Training Data
2012/3/23
Model: Decision Tree
Data Mining Techniques and Applications
6
Another Example of Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Married
MarSt
NO
Single,
Divorced
Refund
No
Yes
NO
TaxInc
< 80K
NO
> 80K
YES
There could be more than one tree that
fits the same data!
10
2012/3/23
Data Mining Techniques and Applications
7
Apply Model to Test Data
Test Data
Start from the root of tree.
Refund
Yes
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
2012/3/23
Refund Marital
Status
NO
> 80K
YES
Data Mining Techniques and Applications
8
Apply Model to Test Data
Test Data
Refund
Yes
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
2012/3/23
Refund Marital
Status
NO
> 80K
YES
Data Mining Techniques and Applications
9
Apply Model to Test Data
Test Data
Refund
Yes
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
2012/3/23
Refund Marital
Status
NO
> 80K
YES
Data Mining Techniques and Applications
10
Apply Model to Test Data
Test Data
Refund
Yes
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
2012/3/23
Refund Marital
Status
NO
> 80K
YES
Data Mining Techniques and Applications
11
Apply Model to Test Data
Test Data
Refund
Yes
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
2012/3/23
Refund Marital
Status
NO
> 80K
YES
Data Mining Techniques and Applications
12
Apply Model to Test Data
Test Data
Refund
Yes
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
2012/3/23
Refund Marital
Status
Assign Cheat to “No”
NO
> 80K
YES
Data Mining Techniques and Applications
13
Pigeon Problem 1
Examples of
class A
3
4
1.5
5
Examples of
class B
5
2.5
5
2
6
8
8
3
2.5
5
4.5
3
2012/3/23
Data Mining Techniques and Applications
14
Pigeon Problem 1
Examples of
class A
3
4
1.5
6
5
8
What class is
this object?
Examples of
class B
5
2.5
5
2
8
3
8
What about this
one, A or B?
4.5
2.5
2012/3/23
5
4.5
1.5
7
3
Data Mining Techniques and Applications
15
Pigeon Problem 1
Examples of
class A
3
4
1.5
5
Examples of
class B
5
2.5
5
2
6
8
8
3
2.5
5
4.5
3
2012/3/23
This is a B!
8
1.5
Here is the rule.
If the left bar is
smaller than the
right bar, it is an A,
otherwise it is a B.
Data Mining Techniques and Applications
16
Pigeon Problem 2
Examples of
class A
Examples of
class B
4
4
5
2.5
5
5
2
5
6
6
5
3
Oh! This ones
hard!
8
Even I know this
one
7
3
2012/3/23
3
2.5
1.5
7
3
Data Mining Techniques and Applications
17
Pigeon Problem 2
Examples of
class A
Examples of
class B
4
4
5
2.5
5
5
2
5
The rule is as follows,
if the two bars are
equal sizes, it is an A.
Otherwise it is a B.
So this one is an A.
6
6
5
3
7
3
2012/3/23
3
2.5
7
3
Data Mining Techniques and Applications
18
Pigeon Problem 3
Examples of
class A
Examples of
class B
4
4
5
6
1
5
7
5
6
3
4
8
3
7
7
7
2012/3/23
6
6
This one is really hard!
What is this, A or B?
Data Mining Techniques and Applications
19
Pigeon Problem 3
Examples of
class A
4
4
Examples of
class B
5
6
6
6
1
5
7
5
6
3
4
8
3
7
7
7
2012/3/23
It is a B!
The rule is as follows,
if the square of the
sum of the two bars is
less than or equal to
100, it is an A.
Otherwise it is a B.
Data Mining Techniques and Applications
20
Examples of
class A
3
Examples of
class B
5
4
2.5
Left Bar
Pigeon Problem 4
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Right Bar
1.5
5
5
2
6
8
8
3
2.5
5
4.5
3
2012/3/23
Here is the rule again.
If the left bar is smaller
than the right bar, it is
an A, otherwise it is a B.
Data Mining Techniques and Applications
21
Examples of
class A
4
4
Examples of
class B
5
2.5
Left Bar
Pigeon Problem 5
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Right Bar
5
5
2
5
6
6
5
3
3
3
2.5
3
2012/3/23
Let me look it up… here it is..
the rule is, if the two bars
are equal sizes, it is an A.
Otherwise it is a B.
Data Mining Techniques and Applications
22
Examples of
class A
4
4
Examples of
class B
5
6
Left Bar
Pigeon Problem 6
100
90
80
70
60
50
40
30
20
10
10 20 30 40 50 60 70 80 90 100
Right Bar
1
5
7
5
6
3
4
8
3
2012/3/23
7
7
7
The rule again:
if the square of the sum of the
two bars is less than or equal
to 10, it is an A. Otherwise it is
a B.
Data Mining Techniques and Applications
23
Supervised vs. Unsupervised Learning

Supervised learning (classification)
– Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
– New data is classified based on the training set

Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc.
with the aim of establishing the existence of classes
or clusters in the data
2012/3/23
Data Mining Techniques and Applications
24
Issues Regarding Classification and Prediction (1):
Data Preparation

Data cleaning
– Preprocess data in order to reduce noise
and handle missing values

Relevance analysis (feature selection)
– Remove the irrelevant or redundant
attributes

Data transformation
– Generalize and/or normalize data
2012/3/23
Data Mining Techniques and Applications
25
Issues regarding classification and prediction (2):
Evaluating Classification Methods






2012/3/23
Predictive accuracy
Speed and scalability
– time to construct the model
– time to use the model
Robustness
– handling noise and missing values
Scalability
– efficiency in disk-resident databases
Interpretability:
– understanding and insight provided by the model
Goodness of rules
– decision tree size
– compactness of classification rules
Data Mining Techniques and Applications
26
Training Dataset
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
2012/3/23
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Data Mining Techniques and Applications
27
Output: A Decision Tree for “buys_computer”
age?
<=30
student?
2012/3/23
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
Data Mining Techniques and Applications
28
Algorithm for Decision Tree Induction

Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-andconquer manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they
are discretized in advance)
 Examples are partitioned recursively based on selected
attributes
 Test attributes are selected on the basis of a heuristic
or statistical measure (e.g., information gain)
2012/3/23
Data Mining Techniques and Applications
29
age?
<=30
31…40
>40
income
student
credit_rating
class
income
student
credit_rating
class
high
no
fair
no
medium
no
fair
yes
high
no
excellent
no
low
yes
fair
yes
medium
no
fair
no
low
yes
excellent
no
low
yes
fair
yes
medium
yes
fair
yes
medium
yes
excellent
yes
medium
no
excellent
no
2012/3/23
income
student
credit_rating
class
high
no
fair
yes
low
yes
excellent
yes
medium
no
excellent
yes
high
fair and Applications
Datayes
Mining Techniques
yes
30
Algorithm for Decision Tree Induction (cont’d)

Conditions for stopping partitioning
 All samples for a given node belong to the
same class
 There are no remaining attributes for further
partitioning – majority voting is employed for
classifying the leaf
 There are no samples left
2012/3/23
Data Mining Techniques and Applications
31
Notes on Decision Trees







Training Data
Choosing Splitting Attributes
Ordering of Splitting Attributes
Splits
Tree Structure
Stopping Criteria
Pruning
2012/3/23
Data Mining Techniques and Applications
32
Grasshoppers
Katydids
Antenna Length
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Abdomen Length
2012/3/23
Data Mining Techniques and Applications
33
previously unseen instance =
11
5.1
7.0
???????
• We can “project” the
previously unseen instance
into the same space as the
database.
Antenna Length
10
9
8
7
6
5
4
3
2
1
• We have now abstracted
away the details of our
particular problem. It will
be much easier to talk about
points in space.
Katydids
Abdomen Length
Grasshoppers
Data Mining Techniques and Applications
1 2 3 4 5 6 7 8 9 10
2012/3/23
34
imple Linear Classifier
10
9
8
7
6
5
4
3
2
1
If previously unseen instance above the line
then
class is Katydid
else
class is Grasshopper
Katydids
Grasshoppers
1 2 3 4 5 6 7 8 9 10
2012/3/23
Data Mining Techniques and Applications
35
Information/Entropy




2012/3/23
Given probabilitites p1, p2, .., ps whose sum is 1
Entropy is defined as:
Entropy measures the amount of randomness or surprise
or uncertainty.
Goal in classification
 no surprise
 entropy = 0
Data Mining Techniques and Applications
36
Entropy
log (1/p)
2012/3/23
H(p,1-p)
Data Mining Techniques and Applications
37
Example of Entropy
Entropy is a measure of „uncertainty‟ in a probability distribution.
1.00
1.00
0.90
0.90
0.80
0.80
0.70
0.70
0.60
0.60
0.50
0.50
0.40
0.40
0.30
0.30
0.20
0.20
0.10
0.10
0.00
0.00
1
Probability(event 1) = 0.5
Probability(event 2) = 0.5
Entropy = 1.0
2012/3/23
1
2
2
Probability(event 1) = 0.1
Probability(event 2) = 0.9
Entropy = 0.469
Data Mining Techniques and Applications
38
Example of Entropy (cont’d)
1.00
This is zero entropy, i.e.
zero uncertainty about
which event is the „true‟
one.
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
1
2
Probability(event 1) = 0
Probability(event 2) = 1
Entropy = 0.0
2012/3/23
Data Mining Techniques and Applications
39
Example of Entropy (cont’d)
Entropy can be measured for a set, e.g.:
S = {a, a, a, a, a, a, a, a, b, b, b, b, b}
(8 a’s and 5 b’s, 13 total)
8   5
5 
 8
entropy( S )    (log 2 )    (log 2 )   0.96124
13   13
13 
 13
Remember negative!
2012/3/23
for the a’s
for the b’s
Data Mining Techniques and Applications
40
Attribute Selection Measure:
Information Gain (ID3/C4.5)



Select the attribute with the highest information gain
S contains si tuples of class Ci for i = {1, …, m}
Information measures info. required to classify any
arbitrary tuple I( s1,s2,...,s m)   m si log 2 si
s
i 1


s
Entropy of attribute A with values {a1,a2,…,av}
v
s1 j  ...  smj
E(A)  
I ( s1 j ,..., smj )
s
j 1
Information gained by branching on attribute A
Gain(A)  I(s 1, s 2 ,...,sm)  E(A)
2012/3/23
Data Mining Techniques and Applications
41
Example (Step by Step)
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
2012/3/23
income student credit_rating buys_computer
high
no fair
no
high
no excellent
no
high
no fair
yes
medium no fair
yes
low
yes fair
yes
Entropy = I(9, 5)
low
yes excellent
no
=0.940
low
yes excellent
yes
(from 14 examples)
medium no fair
no
low
yes fair
yes
medium yes fair
yes
medium yes excellent
yes
medium no excellent
yes
high
yes fair
yes
medium no excellent
no
Data Mining Techniques and Applications
42
age?
<=30
0.971
(5 examples)
31…40
>40
0
(4 examples)
0.971
(5 examples)
5
4
5
E (age) 
I (2,3) 
I (4,0) 
I (3,2)  0.694
14
14
14
5 tuples with
‘<=30’
4 tuples
with ’31..40’
5 tuples
with ’>40’
Gain(age)  I (9,5)  E (age)  0.246
2012/3/23
Data Mining Techniques and Applications
43
Attribute Selection by Information
Gain Computation




Class P: buys_computer = “yes”
Class N: buys_computer = “no”
I(p, n) = I(9, 5) =0.940
Compute the entropy for age:
age
<=30
30…40
>40
pi
2
4
3
ni I(pi, ni)
3 0.971
0 0
2 0.971
age
income student credit_rating
<=30
high
no
fair
<=30
high
no
excellent
31…40 high
no
fair
>40
medium
no
fair
>40
low
yes fair
>40
low
yes excellent
31…40 low
yes excellent
<=30
medium
no
fair
<=30
low
yes fair
>40
medium
yes fair
<=30
medium
yes excellent
31…40 medium
no
excellent
31…40 high
yes fair
>402012/3/23
medium
no
excellent
5
4
I (2,3) 
I ( 4,0)
14
14
5

I (3,2)  0.694
14
E (age) 
5
I (2,3) means “age <=30” has 5
14 out of 14 samples, with 2 yes’es
and 3 no’s. Hence
Gain(age)  I ( p, n)  E (age)  0.246
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
Data Mining
no Techniques and Applications
Similarly,
Gain(income)  0.029
Gain( student)  0.151
Gain(credit _ rating )  0.048
44
Advantages/Disadvantages of Decision Trees
• Advantages:
– Easy to understand
– Easy to generate rules
• Disadvantages:
– May suffer from overfitting.
– Classifies by rectangular partitioning (so does
not handle correlated features very well).
– Can be quite large – pruning is necessary.
– Does not handle streaming data easily
2012/3/23
Data Mining Techniques and Applications
45
Overfitting
Problem
• With few data points,
decision tree may
perfectly classify the
training data
• Model built is not
generalized to future
datasets
2012/3/23
Yes
Female
Data Mining Techniques and
Applications
No
Male
Wears green -> female ??
46