Download ppt

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Chapter 6. Classification and Prediction

What is classification? What is prediction?
Issues regarding classification and prediction

Classification by decision tree induction








Bayesian Classification
Classification by backpropagation
Classification based on concepts from association rule
mining
Other Classification Methods
Prediction
Classification accuracy
Summary
May 2, 2017
Data Mining: Concepts and Techniques
1
Classification by Decision Tree Induction



Decision tree
 A flow-chart-like tree structure
 Internal node denotes a test on an attribute
 Branch represents an outcome of the test
 Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases
 Tree construction
 At start, all the training examples are at the root
 Partition examples recursively based on selected attributes
 Tree pruning
 Identify and remove branches that reflect noise or outliers
Use of decision tree: Classifying an unknown sample
 Test the attribute values of the sample against the decision tree
May 2, 2017
Data Mining: Concepts and Techniques
2
Terminology
A Simple Decision Tree Example: Predicting Responses to a Credit Card Marketing Campaign
Nodes
Branch / Arc
Low
Root node
Non-Responder
Leaf nodes
Debts
High
Low
Responder
Income
Many
High
This decision tree says that
people with low income and
high debts, and high income
males with many children are
likely responders.
May 2, 2017
Responder
Children
Male
Gender
Few
Female
Non-Responder
Non-Responder
Data Mining: Concepts and Techniques
33
Decision Tree (Formally)
Given:



D = {t1, …, tn} where ti=<ti1, …, tih>
Database schema contains {A1, A2, …, Ah}
Classes C={C1, …., Cm}
Decision or Classification Tree is a tree associated with
D such that



May 2, 2017
Each internal node is labeled with attribute, Ai
Each arc is labeled with predicate which can
be applied to attribute at parent
Each leaf node is labeled with a class, Cj
Data Mining: Concepts and Techniques
4
Using a Decision Tree
Classifying a New Example for our Credit Card Marketing Campaign
Feed the new example into the root of the tree and follow the relevant
path towards a leaf, based on the attributes of the example.
Low
Non-Responder
High
Responder
Debts
Assume Sarah
has high income.
Low
Income
Many
High
Children
Male
Gender
Few
Female
Non-Responder
May 2, 2017
Responder
Data Mining: Concepts and Techniques
Non-Responder
The tree predicts
that Sarah will not
respond to our
campaign. 55
Algorithms for Decision Trees
• ID3, ID4, ID5, C4.0, C4.5, C5.0, ACLS, and ASSISTANT:
Use information gain or gain ratio as splitting criterion
• CART (Classification And Regression Trees):
Uses Gini diversity index or Twoing criterion as measure
(See Sestito &
of impurity when deciding splitting.
Dillon, Chapter
• CHAID:
3, Witten &
Frank Section
A statistical approach that uses the Chi-squared test (of
4.3, or Dunham
correlation/association, dealt with in an earlier lecture) when Section 4.4, if
you’re
deciding on the best split. Other statistical approaches, such
interested in
as those by Goodman and Kruskal, and Zhou and Dillon,
learning more)
use the Assymetrical Tau or Symmetrical Tau to choose the
best discriminator.
• Hunt’s Concept Learning System (CLS), and MINIMAX:
Minimizes the cost of classifying examples correctly or
incorrectly.
(If interested see also: http://www.kdnuggets.com/software/classification-tree-rules.html)
May 2, 2017
Data Mining: Concepts and Techniques
66
Algorithm for Decision Tree Induction


Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-conquer
manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they are
discretized in advance)
 Examples are partitioned recursively based on selected attributes
 Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
 There are no samples left
May 2, 2017
Data Mining: Concepts and Techniques
7
DT Induction
May 2, 2017
Data Mining: Concepts and Techniques
8
DT Issues
How to choose splitting attributes: The performance of the
decision tree depends heavily on the choice of the splitting
attributes. This includes
– Choosing attributes that will be used as splitting attributes.
– Ordering of splitting attributes.

Number of splits: Branches at each internal node.

Tree structure: In many cases, balanced tree is desirable.

Stopping criteria: When to stop splitting? Accuracy vs.
performance tradeoff. Also, stop earlier to prevent
overfitting.

Training data: Too small inaccurate, Too large overfit.

Pruning: To improve the performance during classification.

May 2, 2017
Data Mining: Concepts and Techniques
9
Training Set
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
Tempreature Humidity Windy Class
hot
high
false
N
hot
high
true
N
hot
high
false
P
mild
high
false
P
cool
normal false
P
cool
normal true
N
cool
normal true
P
mild
high
false
N
cool
normal false
P
mild
normal false
P
mild
normal true
P
mild
high
true
P
hot
normal false
P
mild
high
true
N
Which attribute to select?
May 2, 2017
Data Mining: Concepts and Techniques
11
Height Example Data
Name
Kristina
Jim
Maggie
Martha
Stephanie
Bob
Kathy
Dave
Worth
Steven
Debbie
Todd
Kim
Amy
Wynette
May 2, 2017
Gender
F
M
F
F
F
M
F
M
M
M
F
M
F
F
F
Height
1.6m
2m
1.9m
1.88m
1.7m
1.85m
1.6m
1.7m
2.2m
2.1m
1.8m
1.95m
1.9m
1.8m
1.75m
Output1
Short
Tall
Medium
Medium
Short
Medium
Short
Short
Tall
Tall
Medium
Medium
Medium
Medium
Medium
Data Mining: Concepts and Techniques
Output2
Medium
Medium
Tall
Tall
Medium
Medium
Medium
Medium
Tall
Tall
Medium
Medium
Tall
Medium
Medium
12
Comparing DTs

Tree structure: In many cases, balanced tree is desirable.
Balanced
Deep
May 2, 2017
Data Mining: Concepts and Techniques
13
Building a compact tree



The key to building a decision tree - which
attribute to choose in order to branch (Choosing
Splitting Attributes)
The heuristic is to choose the attribute with the
maximum Information Gain based on information
theory.
Decision Tree Induction is often based on
Information Theory
May 2, 2017
Data Mining: Concepts and Techniques
14
Information theory


Information theory provides a mathematical basis for
measuring the information content.
To understand the notion of information, think about it as
providing the answer to a question, for example, whether
a coin will come up heads.
 If one already has a good guess about the answer,
then the actual answer is less informative.
 If one already knows that the coin is unbalanced so
that it will come with heads with probability 0.99, then
a message (advanced information) about the actual
outcome of a flip is worth less than it would be for a
honest coin.
May 2, 2017
Data Mining: Concepts and Techniques
15
Information theory (cont …)


For a fair (honest) coin, you have no information, and
you are willing to pay more (say in terms of $) for
advanced information - less you know, the more valuable
the information.
Information theory uses this same intuition, but instead
of measuring the value for information in dollars, it
measures information contents in bits. One bit of
information is enough to answer a yes/no question about
which one has no idea, such as the flip of a fair coin
May 2, 2017
Data Mining: Concepts and Techniques
16
Information theory

In general, if the possible answers vi have
probabilities P (vi), then the information content I
(entropy) of the actual answer is given by
n
I    p(vi ) log 2 p(vi )
i 1

For example, for the tossing of a fair coin we get
1
1 1
1
I (Coin toss)   log  log  1bit
2
2 2
2
2

2
If the coin is loaded to give 99% head, we get I =
0.08, and as the probability of heads goes to 1,
the information of the actual answer goes to 0
May 2, 2017
Data Mining: Concepts and Techniques
17
Binary entropy function
If X can assume values 0 and 1, entropy of X is defined as H(X) = -Pr(X=0) log2 Pr(X=0) Pr(X=1) log2 Pr(X=1). It has value if Pr(X=0)=1 or Pr(X=1)=1. The entropy reaches
maximum when Pr(X=0)=Pr(X=1)=1/2 (the value of entropy is then 1).
When p=1 , the uncertainty is at a maximum; if one were to place a fair bet on the outcome
in this case, there is no advantage to be gained with prior knowledge of the probabilities. In
this case, the entropy is maximum at a value of 1 bit.
May 2, 2017
Data Mining: Concepts and Techniques
18
May 2, 2017
Data Mining: Concepts and Techniques
19
Attribute Selection Measure: Information Gain
(ID3/C4.5)



Select the attribute with the highest information gain
S contains si tuples of class Ci for i = {1, …, m}
information measures info required to classify any
arbitrary tuple
m
I( s1,s2,...,sm )  
i 1

si
si
log 2
s
s
entropy of attribute A with values {a1,a2,…,av}
s1 j  ...  smj
I ( s1 j ,..., smj )
s
j 1
v
E(A)  

information gained by branching on attribute A
Gain(A)  I(s 1, s 2 ,..., sm)  E(A)
May 2, 2017
Data Mining: Concepts and Techniques
20
Building a Tree : Choosing a Split Example
Example: applicants who defaulted on loans (has paying diffıculties) :
ApplicantID
1
2
3
4
City
Philly
Philly
Philly
Philly
Children
Many
Many
Few
Few
Try split on Children attribute:
Many
Low
Income
Few

Status
DEFAULTS
DEFAULTS
PAYS 
PAYS 
Try split on Income attribute:

Children


Income
Medium
Low
Medium
High



Medium
High


Notice how the split on the Children attribute gives purer partitions.
It is therefore chosen as the first (and in this case only) split.
May 2, 2017
Data Mining: Concepts and Techniques
21
21
Building a Tree
Choosing a Split: Worked Example
Entropy
Split on Children attribute:
Probability of being in  class at this node
Probability of being in class at this node
(1) Entropy of each subnode






Entropy _ of _ subnode _ 1   0 log 0  2 log 2  0
2
2
2
2
Few

Entropy _ of _ subnode _ 2   2 log 2  0 log 0  0
2
2
2
2
Children



Many
(2) Weighted entropy of this Split Option:



Weighted _ entropy _ of _ split _ on _ Children  2  0  2  0  0
4
4
May 2, 2017
Entropy of Sub-node 1
Proportion
of examples
in Sub-node 1
Data Mining: Concepts
and Techniques
22
22
Building a Tree
Choosing a Split: Worked Example
Entropy
Split on Income attribute:
Probability of being in  class at this node
Probability of being in class at this node
(1) Entropy of each subnode
Low
Income



Medium
High



Entropy _ of _ subnode _ 1   0 log 0  1 log 1  0
1
1
1
1




Entropy _ of _ subnode _ 2   1 log 1  1 log 1  1
2
2
2
2

Entropy _ of _ subnode _ 3   1 log 1  0 log 0  0
1
1
1
1

(2) Weighted entropy of this Split Option:






Weighted _ entropy _ of _ split _ on _ Income  1  0  2 1  1  0  0.5
4
4
4
May 2, 2017
Entropy of Sub-node 2
DataProportion
Mining: Concepts
and Techniques
23
23
of examples
in Sub-node 2
Building a Tree
Choosing a Split: Worked Example
(3) Information Gain of Split Options



Entropy _ of _ parent _ node   2 log 2  2 log 2  1
4
4
4
4
Many

Low
Children


Income
Few



Weighted entropy of this
Split Option (Split on Children)
= 0 (calculated earlier)
Information Gain = 1 - 0 = 1

Medium
High


Weighted entropy of this
Split Option (Split on Income) =
0.5 (calculated earlier)
Information Gain = 1 - 0.5 = 0.5
Notice how the split on Children attribute gives higher information
gain (purer partitions), and is therefore the preferred split.
May 2, 2017
Data Mining: Concepts and Techniques
24
Building a decision tree: an example training
dataset (ID3)
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
May 2, 2017
income student credit_rating
high
no
fair
high
no
excellent
high
no
fair
medium
no
fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no
fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no
excellent
high
yes fair
medium
no
excellent
Data Mining: Concepts and Techniques
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
25
Output: A Decision Tree for “buys_computer”
age?
<=30
student?
May 2, 2017
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
Data Mining: Concepts and Techniques
26
Attribute Selection by Information Gain
Computation (ID3)

Class P: buys_computer = “yes”

Class N: buys_computer = “no”

I(p, n) = I(9, 5) =0.940
May 2, 2017
Data Mining: Concepts and Techniques
27
Attribute Selection by Information Gain Computation
Class P: buys_computer = “yes”
 Class N: buys_computer = “no”
 I(p, n) = I(9, 5) =0.940
 Compute the entropy for age:

age
<=30
30…40
>40
pi
2
4
3
ni I(pi, ni)
3 0.971
0 0
2 0.971
age
income student credit_rating
<=30
high
no
fair
<=30
high
no
excellent
31…40 high
no
fair
>40
medium
no
fair
>40
low
yes fair
>40
low
yes excellent
31…40 low
yes excellent
<=30
medium
no
fair
<=30
low
yes fair
>40
medium
yes fair
<=30
medium
yes excellent
31…40 medium
no
excellent
31…40 high
yes fair
2017
>40May 2,medium
no
excellent
5
4
I (2,3) 
I ( 4,0)
14
14
5

I (3,2)  0.694
14
E ( age) 
5
I (2,3) means “age <=30” has 5 out
14
of 14 samples, with 2 yes’es and 3
no’s. Hence
Gain(age)  I ( p, n)  E (age)  0.246
buys_computer
no
no
Similarly,
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
Data no
Mining: Concepts and Techniques
Gain(income)  0.029
Gain( student )  0.151
Gain(credit _ rating )  0.048
28
Attribute Selection : Age Selected
May 2, 2017
Data Mining: Concepts and Techniques
29
Example 2 : ID3 (Output1)
Name
Kristina
Jim
Maggie
Martha
Stephanie
Bob
Kathy
Dave
Worth
Steven
Debbie
Todd
Kim
Amy
Wynette
May 2, 2017
Gender
F
M
F
F
F
M
F
M
M
M
F
M
F
F
F
Height
1.6m
2m
1.9m
1.88m
1.7m
1.85m
1.6m
1.7m
2.2m
2.1m
1.8m
1.95m
1.9m
1.8m
1.75m
Output1
Short
Tall
Medium
Medium
Short
Medium
Short
Short
Tall
Tall
Medium
Medium
Medium
Medium
Medium
Data Mining: Concepts and Techniques
Output2
Medium
Medium
Tall
Tall
Medium
Medium
Medium
Medium
Tall
Tall
Medium
Medium
Tall
Medium
Medium
30
Example 2: ID3 Example (Output1)

Starting state entropy:
4/15 log(15/4) + 8/15 log(15/8) + 3/15 log(15/3) = 0.4384

Gain using gender:
Female: 3/9 log(9/3)+6/9 log(9/6)=0.2764
Male: 1/6 (log 6/1) + 2/6 log(6/2) + 3/6 log(6/3) = 0.4392
Weighted sum: (9/15)(0.2764) + (6/15)(0.4392) = 0.34152
Gain: 0.4384 – 0.34152 = 0.09688

Gain using height:




Divide continous data into ranges
(0,1.6], (1.6,1.7], (1.7,1.8], (1.8,1.9], (1.9,2.0], (2.0, infinite]
2 values in (1.9,2.0] with entropy (0+1/2 (0.301) + ½ (0.301) = 0.301
Other entropies in ranges are zero
0.4384 – (2/15)(0.301) = 0.3983

Choose height as first splitting attribute
May 2, 2017
Data Mining: Concepts and Techniques
31
Example 3 : ID3 Weather data set
May 2, 2017
Data Mining: Concepts and Techniques
32
Example 3 : ID3
May 2, 2017
Data Mining: Concepts and Techniques
33
Example 3 : ID3
May 2, 2017
Data Mining: Concepts and Techniques
34
Example 3 : ID3
May 2, 2017
Data Mining: Concepts and Techniques
35
Attribute Selection Measures



Information gain (ID3)
 All attributes are assumed to be categorical
 Can be modified for continuous-valued attributes
Gini index (IBM IntelligentMiner)
 All attributes are assumed continuous-valued
 Assume there exist several possible split values
for each attribute
 May need other tools, such as clustering, to get
the possible split values
 Can be modified for categorical attributes
Information Gain Ratio (C4.5)
May 2, 2017
Data Mining: Concepts and Techniques
36
Gini Index (IBM IntelligentMiner)

If a data set T contains examples from n classes, gini index,
n
gini(T) is defined as
2
gini(T ) 1  p j
j 1

where pj is the relative frequency of class j in T.
If a data set T is split into two subsets T1 and T2 with sizes
N1 and N2 respectively, the gini index of the split data
contains examples from n classes, the gini index gini(T) is
defined as
N 1 gini( )  N 2 gini( )
(
T
)

gini split
T1
T2
N
N

The attribute provides the smallest ginisplit(T) is chosen to
split the node (need to enumerate all possible splitting
points for each attribute).
May 2, 2017
Data Mining: Concepts and Techniques
37
C4.5 Algoritm (Gain Ratio), -Dunham, page 102

An improvement over ID3 decision tree algorithm.
– Splitting: Gain value used in ID3 favors the attributes with
high cardinality. C4.5 takes the cardinality of the attribute into
the consideration and uses gain ratio as opposed to gain.
– Missing data: During the tree construction, the training data
that have missing values are simply ignored. Only records that
have attribute values will be used to calculate the gain ratio.
(Note: the formula for computing the gain ratio changes a bit
to accommodate those missing data). During the classification,
if the splitting attribute of the query is missing, we need to
descent to every child node. The final class assigned to the
query depends on the probabilities associated with classes at
leaf nodes.
May 2, 2017
Data Mining: Concepts and Techniques
38
C4.5 Algoritm (Gain Ratio)
Continuous data: Break the continuous attribute values
into ranges.
– Pruning: 2 strategies: Prune the tree if the increasing
classification error is within the limit.
• Subtree replacement: Subtree is replaced by a
leaf node. Bottom up.
• Subtree raising: Subtree is replaced by its most
used subtree.
– Rules: C4.5 allows classification directly via the decision
trees or rules generated from them. In addition, there
are some techniques to simplify complex or redundant
rules.
May 2, 2017
Data Mining: Concepts and Techniques
39
CART Algorithm (Dunham, page 102)





Acronym for Classification And Regression Trees.
Binary splits only (Create Binary Tree)
Uses entropy
Formula to choose split point, s, for node t:
PL,PR probability that a tuple in the training set will be
on the left or right side of the tree.
May 2, 2017
Data Mining: Concepts and Techniques
40
Extracting Classification Rules from Trees






Represent the knowledge in the form of IF-THEN rules
One rule is created for each path from the root to a leaf
Each attribute-value pair along a path forms a conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand
Example
age = “<=30” AND student = “no” THEN buys_computer = “no”
age = “<=30” AND student = “yes” THEN buys_computer = “yes”
age = “31…40”
THEN buys_computer = “yes”
age = “>40” AND credit_rating = “excellent” THEN
buys_computer = “yes”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer =
“no”
IF
IF
IF
IF
May 2, 2017
Data Mining: Concepts and Techniques
41
Extracting Classification Rules from Trees
for Credit Card Marketing Campaign
For each leaf in the tree, read the rule from the root to that leaf.
You will arrive at a set of rules.
Low
IF Income=Low AND Debts=Low
Non-Responder THEN Non-Responder
Debts
Low
High
Income
Responder
Many
Responder
Male Children
High
Gender
Few
Female
Non-Responder
May 2, 2017
IF Income=Low AND Debts=High
THEN Responder
IF Income=High AND
Gender=Male AND
Children=Many
THEN Responder
IF Income=High AND
Non-Responder Gender=Male AND
Children=Few
THEN Non-Responder
IF Income=High AND Gender=Female
THEN Non-Responder
Data Mining: Concepts and Techniques
42
Avoid Overfitting in Classification


The generated tree may overfit the training data
 Too many branches, some may reflect anomalies
due to noise or outliers
 Result is in poor accuracy for unseen samples
Two approaches to avoid overfitting
 Prepruning: Halt tree construction early—do not split
a node if this would result in the goodness measure
falling below a threshold
 Difficult to choose an appropriate threshold
 Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
 Use a set of data different from the training data
to decide which is the “best pruned tree”
May 2, 2017
Data Mining: Concepts and Techniques
43
Approaches to Determine the Final Tree Size

Separate training (2/3) and testing (1/3) sets

Use cross validation, e.g., 10-fold cross validation

Use all the data for training


but apply a statistical test (e.g., chi-square) to
estimate whether expanding or pruning a
node may improve the entire distribution
Use minimum description length (MDL) principle:

May 2, 2017
halting growth of the tree when the encoding
is minimized
Data Mining: Concepts and Techniques
44
Enhancements to basic decision tree induction



Allow for continuous-valued attributes
 Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete
set of intervals
Handle missing attribute values
 Assign the most common value of the attribute
 Assign probability to each of the possible values
Attribute construction
 Create new attributes based on existing ones that are
sparsely represented
 This reduces fragmentation, repetition, and replication
May 2, 2017
Data Mining: Concepts and Techniques
45
Classification in Large Databases



Classification—a classical problem extensively studied by
statisticians and machine learning researchers
Scalability: Classifying data sets with millions of examples
and hundreds of attributes with reasonable speed
Why decision tree induction in data mining?




relatively faster learning speed (than other classification
methods)
convertible to simple and easy to understand
classification rules
can use SQL queries for accessing databases
comparable classification accuracy with other methods
May 2, 2017
Data Mining: Concepts and Techniques
46
Scalable Decision Tree Induction Methods in Data Mining
Studies




SLIQ (EDBT’96 — Mehta et al.)
 builds an index for each attribute and only class list and
the current attribute list reside in memory
SPRINT (VLDB’96 — J. Shafer et al.)
 constructs an attribute list data structure
PUBLIC (VLDB’98 — Rastogi & Shim)
 integrates tree splitting and tree pruning: stop growing
the tree earlier
RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
 separates the scalability aspects from the criteria that
determine the quality of the tree
 builds an AVC-list (attribute, value, class label)
May 2, 2017
Data Mining: Concepts and Techniques
47
Data Cube-Based Decision-Tree Induction



Integration of generalization with decision-tree induction
(Kamber et al’97).
Classification at primitive concept levels
 E.g., precise temperature, humidity, outlook, etc.
 Low-level concepts, scattered classes, bushy
classification-trees
 Semantic interpretation problems.
Cube-based multi-level classification
 Relevance analysis at multi-levels.
 Information-gain analysis with dimension + level.
May 2, 2017
Data Mining: Concepts and Techniques
48
Presentation of Classification Results
May 2, 2017
Data Mining: Concepts and Techniques
49