Download 1 - News

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Information Retrieval and Data Mining
Summer Semester 2015
TU Kaiserslautern
Prof. Dr.-Ing. Sebastian Michel
Databases and Information Systems
Group (AG DBIS)
http://dbis.informatik.uni-kl.de/
Information Retrieval and Data Mining, SoSe 2015, S. Michel
1
Chapter VI: Classification
1. Motivation and Definitions
2. Decision Trees
3. Bayes Classifier
4. Support Vector Machines (only as teaser)
Tan, Steinbach & Kumar, Chapter 8
Information Retrieval and Data Mining, SoSe 2015, S. Michel
2
1. Classification: Example Classifier
age?
youth
student?
no
no
middle_age?
senior
yes
yes
yes
credit_rating?
fair
no
excellent
yes
A decision tree for the concept buys_computer, indicating whether a customer at
an electronic shop is likely to purchase a computer.
source: Han&Kamber
Information Retrieval and Data Mining, SoSe 2015, S. Michel
3
Classification: Definition
• Given a collection of records (training set)
– Each record contains a set of attributes, one of the attributes
is the class.
© Tan,Steinbach, Kumar
• Find a model for class attribute as a function of
the values of other attributes.
• Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test
sets, with training set used to build the model and test set
used to validate it.
Information Retrieval and Data Mining, SoSe 2015, S. Michel
4
Examples of Classification Task
• Predicting tumor cells as benign or malignant
• Classifying credit card transactions
as legitimate or fraudulent
© Tan,Steinbach, Kumar
• Categorizing news stories as finance,
weather, entertainment, sports, etc
• Classifying persons into tax evaders and tax
payers.
Information Retrieval and Data Mining, SoSe 2015, S. Michel
5
Illustrating Classification Task
Tid
Attrib1
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Attrib2
Attrib3
Class
Learning
algorithm
Induction
Learn
Model
Model
10
© Tan,Steinbach, Kumar
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Deduction
10
Test Set
Information Retrieval and Data Mining, SoSe 2015, S. Michel
6
Classification model evaluation
• Much the same measures as
with IR methods
Predicted class
– Focus on accuracy and
error rate
Class = 1
Class = 0
Class = 1
f11
f10
Class = 0
f01
f00
– But also precision, recall, F-scores, …
Information Retrieval and Data Mining, SoSe 2015, S. Michel
7
Overview Classification Techniques
•
•
•
•
•
Decision-Tree-based Methods
Rule-based Methods
Naïve Bayes
Support Vector Machines
……
Information Retrieval and Data Mining, SoSe 2015, S. Michel
8
© Tan,Steinbach, Kumar
Example of a Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
10
Training Data
Model: Decision Tree
Information Retrieval and Data Mining, SoSe 2015, S. Michel
9
2. Decision Trees
© Tan,Steinbach, Kumar
MarSt
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Married
NO
Single,
Divorced
Refund
No
Yes
NO
TaxInc
< 80K
NO
> 80K
YES
There could be more than one tree that fits
the same data!
10
Information Retrieval and Data Mining, SoSe 2015, S. Michel
10
Decision Tree Classification Task
Tid
Attrib1
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Attrib2
Attrib3
Class
Tree
Induction
algorithm
Induction
Learn
Model
Model
10
© Tan,Steinbach, Kumar
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Decision
Tree
Deduction
10
Test Set
Information Retrieval and Data Mining, SoSe 2015, S. Michel
11
Apply Model to Test Data
Test Data
Start from the root of tree.
Refund
Taxable
Income Cheat
No
80K
Married
?
10
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
© Tan,Steinbach, Kumar
Refund Marital
Status
NO
Married
NO
> 80K
YES
Information Retrieval and Data Mining, SoSe 2015, S. Michel
12
Apply Model to Test Data
Test Data
Refund
Taxable
Income Cheat
No
80K
Married
?
10
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
© Tan,Steinbach, Kumar
Refund Marital
Status
NO
Married
NO
> 80K
YES
Information Retrieval and Data Mining, SoSe 2015, S. Michel
13
Apply Model to Test Data
Test Data
Refund
Taxable
Income Cheat
No
80K
Married
?
10
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
© Tan,Steinbach, Kumar
Refund Marital
Status
NO
Married
NO
> 80K
YES
Information Retrieval and Data Mining, SoSe 2015, S. Michel
14
Apply Model to Test Data
Test Data
Refund
Taxable
Income Cheat
No
80K
Married
?
10
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
© Tan,Steinbach, Kumar
Refund Marital
Status
NO
Married
NO
> 80K
YES
Information Retrieval and Data Mining, SoSe 2015, S. Michel
15
Apply Model to Test Data
Test Data
Refund
Taxable
Income Cheat
No
80K
Married
?
10
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
© Tan,Steinbach, Kumar
Refund Marital
Status
NO
Married
NO
> 80K
YES
Information Retrieval and Data Mining, SoSe 2015, S. Michel
16
Apply Model to Test Data
Test Data
Refund
Taxable
Income Cheat
No
80K
Married
?
10
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
© Tan,Steinbach, Kumar
Refund Marital
Status
NO
Married
Assign Cheat to “No”
NO
> 80K
YES
Information Retrieval and Data Mining, SoSe 2015, S. Michel
17
Classifying a Record with a
Decision Tree
• Given a decision tree.
• How to classify a test record?
• Start at root note and apply the test condition
to the record and follow the appropriate
branch.
• If this leads to internal node, again apply test
condition and follow branch.
• Otherwise, if at leave node, assign class of
leave node to record.
• Repeat until at leave node.
Information Retrieval and Data Mining, SoSe 2015, S. Michel
18
Decision Tree Classification Task
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Tree
Induction
algorithm
Induction
Learn
Model
Model
10
© Tan,Steinbach, Kumar
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Decision
Tree
Deduction
10
Test Set
Information Retrieval and Data Mining, SoSe 2015, S. Michel
19
Constructing Decision Tree
• There are exponentially many decision trees
for the training data.
• Finding optimal tree is computationally
infeasible.
• Instead, use greedy algorithms: Series of local
split operations to grow the tree. Not optimal,
but there are efficient algorithms that create
sufficiently accurate trees.
Information Retrieval and Data Mining, SoSe 2015, S. Michel
20
© Tan,Steinbach, Kumar
General Structure of Hunt’s
Algorithm
• Let Dt be the set of training records
that reach a node t
• General Procedure:
– If Dt contains records that belong
the same class yt, then t is a leaf
node labeled as yt
– If Dt is an empty set, then t is a
leaf node labeled by the default
class, yd
– If Dt contains records that belong
to more than one class, use an
attribute test to split the data
into smaller subsets. Recursively
apply the procedure to each
subset.
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
Information Retrieval and Data Mining, SoSe 2015, S. Michel
Dt
?
21
Hunt’s Algorithm
Don’t
Cheat
Start with most frequent
class as default class.
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
© Tan,Steinbach, Kumar
10
Information Retrieval and Data Mining, SoSe 2015, S. Michel
22
Hunt’s Algorithm (2)
Refund
Don’t
Cheat
Yes
Don’t
Cheat
No
Don’t
Cheat
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
© Tan,Steinbach, Kumar
10
Information Retrieval and Data Mining, SoSe 2015, S. Michel
23
Hunt’s Algorithm (3)
Refund
Don’t
Cheat
Yes
No
Don’t
Cheat
Don’t
Cheat
Refund
Yes
No
© Tan,Steinbach, Kumar
Don’t
Cheat
Single,
Divorced
Cheat
Marital
Status
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
Married
Don’t
Cheat
Information Retrieval and Data Mining, SoSe 2015, S. Michel
24
Hunt’s Algorithm (4)
Refund
Don’t
Cheat
Yes
No
Don’t
Cheat
Don’t
Cheat
Refund
Refund
Yes
Don’t
Cheat
Single,
Divorced
© Tan,Steinbach, Kumar
Yes
No
Cheat
Marital
Status
Married
Don’t
Cheat
Single,
Divorced
Don’t
Cheat
No
Marital
Status
Don’t
Cheat
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
Married
Don’t
Cheat
Taxable
Income
< 80K
Tid Refund Marital
Status
>= 80K
Cheat
Information Retrieval and Data Mining, SoSe 2015, S. Michel
25
Tree Induction
• Greedy strategy.
– Split the records based on an attribute test that
optimizes certain criterion.
• Issues
© Tan,Steinbach, Kumar
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
Information Retrieval and Data Mining, SoSe 2015, S. Michel
26
How to Specify Test Condition?
• Depends on attribute types
– Nominal
– Ordinal
– Continuous
© Tan,Steinbach, Kumar
• Depends on number of ways to split
– 2-way split
– Multi-way split
Information Retrieval and Data Mining, SoSe 2015, S. Michel
27
Splitting Based on Nominal
Attributes
• Multi-way split: Use as many partitions as
distinct values.
CarType
Family
Luxury
© Tan,Steinbach, Kumar
Sports
• Binary split: Divides values into two subsets.
Need to find optimal partitioning.
{Sports,
Luxury}
CarType
{Family}
OR
{Family,
Luxury}
Information Retrieval and Data Mining, SoSe 2015, S. Michel
CarType
{Sports}
28
Splitting Based on Continuous
Attributes
• Different ways of handling continuous attributes
– Discretization to form an ordinal categorical attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.
© Tan,Steinbach, Kumar
– Binary Decision: (A < v) or (A  v)
• consider all possible splits and finds the best cut
• can be more compute intensive
Information Retrieval and Data Mining, SoSe 2015, S. Michel
30
Splitting Based on Continuous
Attributes
Taxable
Income
> 80K?
Taxable
Income?
< 10K
Yes
> 80K
No
© Tan,Steinbach, Kumar
[10K,25K)
(i) Binary split
[25K,50K)
[50K,80K)
(ii) Multi-way split
Information Retrieval and Data Mining, SoSe 2015, S. Michel
31
Tree Induction
• Greedy strategy.
– Split the records based on an attribute test that
optimizes certain criterion.
• Issues
© Tan,Steinbach, Kumar
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
Information Retrieval and Data Mining, SoSe 2015, S. Michel
32
How to determine the Best Split
Before Splitting:
10 records of class 0
10 records of class 1
Own
Car?
Yes
Car
Type?
No
Family
Student
ID?
Luxury
c1
Sports
© Tan,Steinbach, Kumar
C0: 6
C1: 4
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
C0: 1
C1: 0
...
c10
C0: 1
C1: 0
c11
C0: 0
C1: 1
c20
...
C0: 0
C1: 1
Which test condition is the best?
Information Retrieval and Data Mining, SoSe 2015, S. Michel
33
How to determine the Best Split
• Greedy approach:
– Nodes with homogeneous class distribution are
preferred
• Need a measure of node impurity:
© Tan,Steinbach, Kumar
C0: 5
C1: 5
C0: 9
C1: 1
Non-homogeneous,
Homogeneous,
High degree of impurity
Low degree of impurity
Information Retrieval and Data Mining, SoSe 2015, S. Michel
34
Selecting the Best Split
• Let p(i | t) be the fraction of records
belonging to class i at node t
• Best split is selected based on the degree
of impurity of the child nodes
– p(0 | t) = 0 and p(1 | t) = 1 has high purity
– p(0 | t) = 1/2 and p(1 | t) = 1/2 has the smallest purity (highest
impurity)
• Intuition: high purity ⇒ small value of
impurity measures ⇒ better split
35
© Tan,Steinbach, Kumar
Example of Purity
high impurity
high purity
Information Retrieval and Data Mining, SoSe 2015, S. Michel
36
Impurity Measures
Information Retrieval and Data Mining, SoSe 2015, S. Michel
37
© Tan,Steinbach, Kumar
Examples for Computing Entropy
C1
C2
0
6
P(C1) = 0/6 = 0
C1
C2
1
5
P(C1) = 1/6
C1
C2
2
4
P(C1) = 2/6
P(C2) = 6/6 = 1
Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
P(C2) = 5/6
Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65
P(C2) = 4/6
Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
Information Retrieval and Data Mining, SoSe 2015, S. Michel
38
© Tan,Steinbach, Kumar
Examples for computing GINI
C1
C2
0
6
P(C1) = 0/6 = 0
C1
C2
1
5
P(C1) = 1/6
C1
C2
2
4
P(C1) = 2/6
P(C2) = 6/6 = 1
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
P(C2) = 5/6
Gini = 1 – (1/6)2 – (5/6)2 = 0.278
P(C2) = 4/6
Gini = 1 – (2/6)2 – (4/6)2 = 0.444
Information Retrieval and Data Mining, SoSe 2015, S.
Michel
39
Comparing Conditions
• The quality of the split: the change in the impurity
– Called the gain of the test condition
•
•
•
•
•
I( ) is the impurity measure
k is the number of attribute values
p is the parent node, vj is the child node
N is the total number of records at the parent node
N(vj) is the number of records associated with the child node
• Maximizing the gain ⇔ minimizing the weighted
average impurity measure of child nodes
• If I() = Entropy(), then Δ = Δinfo is called information gain
Information Retrieval and Data Mining, SoSe 2015, S. Michel
40
How to Find the Best Split
Before Splitting:
C0
C1
N00
N01
M0
A?
B?
Yes
No
Node N1
© Tan,Steinbach, Kumar
C0
C1
Node N2
N10
N11
C0
C1
N20
N21
M2
M1
Yes
No
Node N3
C0
C1
Node N4
N30
N31
C0
C1
M3
M12
N40
N41
M4
M34
Gain = M0 – M12 vs M0 – M34
Information Retrieval and Data Mining, SoSe 2015, S. Michel
41
Problems of maximizing Δ
Higher purity
Information Retrieval and Data Mining, SoSe 2015, S. Michel
42
Problems of Maximizing Δ
• Impurity measures favor attributes with large
number of values
• A test condition with large number of
outcomes might not be desirable
– Number of records in each partition is too small to make predictions
• Solution 1: gain ratio = Δinfo / SplitInfo
P(vi) = the fraction of records at child; k = total number of splits
• Solution 2: restrict the splits to binary
43
Tree Induction
• Greedy strategy.
– Split the records based on an attribute test that
optimizes certain criterion.
• Issues
© Tan,Steinbach, Kumar
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
Information Retrieval and Data Mining, SoSe 2015, S. Michel
44
Stopping Criteria for Tree Induction
• Stop expanding a node when all the
records belong to the same class
• Stop expanding a node when all the
records have the same or similar attribute
values. In this case the class with
“majority” wins.
Information Retrieval and Data Mining, SoSe 2015, S. Michel
45
Overfitting and Tree Pruning
• Common problem with decision trees is that
tree might be too tightly tailored to training
data (and thus possibly to noise in data).
– Good: error on training data might be very low
– But what about test data previously unseen?
• Idea: Avoid tree becoming too fine-grained.
• Solution 1: Stop splitting nodes early (i.e.,
preprocessing)
• Solution 2: Build tree regularly and then prune
parts of it (i.e., postprocessing)
Information Retrieval and Data Mining, SoSe 2015, S. Michel
46
Example: Training Data
© Tan,Steinbach, Kumar
Example of overfitting due to noisy training data …..
*) wrong class
Information Retrieval and Data Mining, SoSe 2015, S. Michel
47
© Tan,Steinbach, Kumar
Example: Two Different Decision
Trees
Information Retrieval and Data Mining, SoSe 2015, S. Michel
48
Example: Test Data
1
1
2 1
Let’s see how the trees M1 and M2 perform on test and training data.
M1: 0% error on training data, but 30% error on test data! Errors marked with 1
M2: 20% error on training data, but 10% error on test data! 2
table source: Tan,Steinbach, Kumar
Information Retrieval and Data Mining, SoSe 2015, S. Michel
49
2. (Naive) Bayes Classifier
• A probabilistic framework for solving
classification problems
• Conditional Probability: P(C | A)  P( A, C )
© Tan,Steinbach, Kumar
P ( A)
P ( A, C )
P( A | C ) 
P (C )
• Bayes theorem:
P( A | C ) P(C )
P(C | A) 
P( A)
Information Retrieval and Data Mining, SoSe 2015, S. Michel
50
Example of Bayes Theorem
• Given:
– A doctor knows that meningitis causes stiff neck 50% of the time
– Prior probability of any patient having meningitis is 1/50,000
– Prior probability of any patient having stiff neck is 1/20
© Tan,Steinbach, Kumar
• If a patient has stiff neck, what’s the
probability he/she has meningitis?
P( S | M ) P( M ) 0.5 1 / 50000
P( M | S ) 

 0.0002
P( S )
1 / 20
Information Retrieval and Data Mining, SoSe 2015, S. Michel
51
Bayesian Classifiers
• Consider each attribute and class label as
random variables
• Given a record with attributes (A1, A2,…,An)
© Tan,Steinbach, Kumar
– Goal is to predict class C
– Specifically, we want to find the value of C that
maximizes P(C| A1, A2,…,An )
• Can we estimate P(C| A1, A2,…,An ) directly from
data?
Information Retrieval and Data Mining, SoSe 2015, S. Michel
52
Bayesian Classifiers
• Approach:
– compute the posterior probability P(C | A1, A2, …, An) for all
values of C using the Bayes theorem
P( A A  A | C ) P(C )
P(C | A A  A ) 
P( A A  A )
1
1
2
2
n
n
1
2
n
© Tan,Steinbach, Kumar
– Choose value of C that maximizes
P(C | A1, A2, …, An)
– Equivalent to choosing value of C that maximizes
P(A1, A2, …, An|C) P(C)
• How to estimate P(A1, A2, …, An | C )?
Information Retrieval and Data Mining, SoSe 2015, S. Michel
53
Naïve Bayes Classifier
• Assume independence among attributes Ai when
class is given:
P(A1, A2, …, An |C) =
P(A1| Cj) P(A2| Cj)… P(An| Cj)
© Tan,Steinbach, Kumar
– Can estimate P(Ai| Cj) for all Ai and Cj.
– New point is classified to Cj if P(Cj)  P(Ai| Cj)
is maximal.
Information Retrieval and Data Mining, SoSe 2015, S. Michel
54
How to Estimate Probabilities
from
s
al
al
u
c
c
i
i
Data? ategor ategor ontinuo lass
• Class: P(C) = Nc/N
c
Tid
© Tan,Steinbach, Kumar
– e.g., P(No) = 7/10,
P(Yes) = 3/10
Refund
c
c
c
Marital
Status
Taxable
Income
Evade
1
Yes
Single
125K
No
2
No
Married
100K
No
• For discrete attributes:
P(Ai | Ck) = |Aik|/ Nc
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
– where |Aik| is number of
instances having attribute Ai
and belongs to class Ck
– Examples:
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
k
10
P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
Information Retrieval and Data Mining, SoSe 2015, S. Michel
55
How to Estimate Probabilities from
Data?
• For continuous attributes:
– Discretize the range into bins
• one ordinal attribute per bin
• violates independence assumption
k
– Two-way split: (A < v) or (A > v)
• choose only one of the two splits as new attribute
© Tan,Steinbach, Kumar
– Probability density estimation:
• Assume attribute follows a normal distribution
• Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
• Once probability distribution is known, can use it to
estimate the conditional probability P(Ai|c)
Information Retrieval and Data Mining, SoSe 2015, S. Michel
56
How to Estimate Probabilities from
Data?
t
ca
• Normal distribution:
Tid
– One for each (Ai,ci) pair
© Tan,Steinbach, Kumar
• For (Income, Class=No):
– If Class=No
• sample mean = 110
• sample variance = 2975
Refund
o
eg
a
c
i
r
l
t
ca
o
eg
a
c
i
r
l
co
in
nt
u
s
u
o
Marital
Status
Taxable
Income
Evade
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Information Retrieval and Data Mining, SoSe 2015, S. Michel
as
l
c
57
Example of Naïve Bayes Classifier
Given a Test Record:
X  (Refund  No, Married, Income  120K)
naive Bayes Classifier:
© Tan,Steinbach, Kumar
P(Refund=Yes|No) = 3/7
P(Refund=No|No) = 4/7
P(Refund=Yes|Yes) = 0
P(Refund=No|Yes) = 1
P(Marital Status=Single|No) = 2/7
P(Marital Status=Divorced|No)=1/7
P(Marital Status=Married|No) = 4/7
P(Marital Status=Single|Yes) = 2/7
P(Marital Status=Divorced|Yes)=1/7
P(Marital Status=Married|Yes) = 0
For taxable income:
If class=No:
sample mean=110
sample variance=2975
If class=Yes: sample mean=90
sample variance=25
P(X|Class=No) = P(Refund=No|Class=No)
 P(Married| Class=No)
 P(Income=120K| Class=No)
= 4/7  4/7  0.0072 = 0.0024
P(X|Class=Yes) = P(Refund=No| Class=Yes)
 P(Married| Class=Yes)
 P(Income=120K| Class=Yes)
= 1  0  1.2  10-9 = 0
Since P(X|No)P(No) > P(X|Yes)P(Yes)
Therefore P(No|X) > P(Yes|X)
=> Class = No
Information Retrieval and Data Mining, SoSe 2015, S. Michel
58
3. Support Vector Machines
© Tan,Steinbach, Kumar
Idea: Find a linear hyperplane (decision boundary) that
will separate the data
Information Retrieval and Data Mining, SoSe 2015, S. Michel
59
Support Vector Machines
One Possible Solution
© Tan,Steinbach, Kumar
B1
Information Retrieval and Data Mining, SoSe 2015, S. Michel
60
Support Vector Machines
Another possible solution
© Tan,Steinbach, Kumar
B2
Information Retrieval and Data Mining, SoSe 2015, S. Michel
61
Support Vector Machines
Other possible solutions
© Tan,Steinbach, Kumar
B2
Information Retrieval and Data Mining, SoSe 2015, S. Michel
62
Support Vector Machines
B1
© Tan,Steinbach, Kumar
B2
• Which one is better? B1 or B2?
• How do you define better?
Information Retrieval and Data Mining, SoSe 2015, S. Michel
63
Support Vector Machines
B1
B2
© Tan,Steinbach, Kumar
b21
b22
margin
b11
b12
• Find hyperplane maximizes the margin => B1 is better than B2
Information Retrieval and Data Mining, SoSe 2015, S. Michel
64
Support Vector Machines
B1
 
w x  b  0
 
w  x  b  1
 
w  x  b  1
© Tan,Steinbach, Kumar
b11
 
if w  x  b  1
1

f ( x)  
 

1
if
w
 x  b  1

Information Retrieval and Data Mining, SoSe 2015, S. Michel
b12
Margin 
2

|| w ||2
65
Summary Data Mining
• Frequent Itemset and Association Rule Mining:
– Apriori Principle and Algorithm
• Clustering:
– K-means
– Hierarchical clustering
– DBSCAN (density based clustering)
• Classification:
– Decision trees
– Naïve Bayes Classifier
– Support Vector Machines (SVMs)
Information Retrieval and Data Mining, SoSe 2015, S. Michel
66