Download Document

Document related concepts

Choice modelling wikipedia , lookup

Transcript
Algorithms for Big Data
Classification
Dr. Jianye Hao
Associate Professor
School of Software
Review
• Decision Tree Classifier
– decision tree induction algorithm
• a greedy approach
• how to specify the attribute splitting condition?
– Binary vs multi-way splitting
• how to select the optimal splitting?
– Impurity measures: Gini index, entropy, classification error
• When to stop splitting?
– Overfitting vs underfitting
Review
• How to evaluate and compare the performance
of a classifier?
– Accuracy, precision, recall…
– Confidence interval
• the accuracy rate of a classifier
• the accuracy difference between two models (statistically
significant?)
• the accuracy difference between two algorithms (statistically
significant?)
Classical classification methods
• Bayes Classifier max(P(wi|x))
• Linear discriminant classifier
• Quadratic discriminant classifier
• Minimum Distance Classifier min(Dj(x))
• Minimum Error Rate Classifier min(ri(x))
• K-Nearest Neighbor Classifier
Joint probability &
conditional probability
• If we have two variables, X and Y, some
probabilities involve both X and Y
• Joint Probability:
p(x,y) is the probability that two things
happen together
• Conditional Probability:
p(x|y) the conditional
probability that X takes the value x given that Y takes the value
y.
• The relationship:
p(x,y) = p(x|y)p(y)
Bayes rule
• From the formula relating joint probability with conditional
probability, we know
p(x|y)p(y) = p(x,y) = p(y|x)p(x)
So,
p( x | y ) p( y )
p( y | x ) 
p( x )
This is the Bayes Rule. Using that we can express p(y|x) in terms
of p(x|y).
An example of Bayes reasoning
We know rain  wet ground; and wet ground  rain
(ground can be wet by pouring water)
But usually we say it is raining when we see the
ground is wet.
Why do we have this prediction?
An example of Bayes reasoning
• Bayes rule:
=1 (always true)
P( rain )
P( rain | wet ground )  P( wet ground | rain )
P( wet ground )
• In fact
– If P(wet ground)>>P(rain), for instance,
somebody frequently pours water on the ground,
we will not say it is raining when seeing the wet
ground.
– If P(wet ground) P(rain), we may conclude that
it is raining when seeing the wet ground
– This is exactly what Bayes rule tells us!
Another example of Bayes
reasoning
• Given:
– A doctor knows that meningitis causes stiff neck 50% of the
time
– Prior probability of any patient having meningitis is 1/50,000
– Prior probability of any patient having stiff neck is 1/20
• If a patient has stiff neck, what’s the probability
he/she has meningitis?
P( S | M ) P( M ) 0.5 1 / 50000
P( M | S ) 

 0.0002
P( S )
1 / 20
Bayes Classifiers
• Consider the attribute and the class label as random variables x
and w
• Given the attribute x
– Goal is to predict class w
– Specifically, we want to find value of i such that
maximizes P(wi|x), say k, because given x, wk is most
probably true
• Can we estimate P(wi|x) directly from data?
Bayes Classifiers
• Approach:
– Compute the posterior probability P(wi|x) for all values of wi
using the Bayes rule
prior
posterior
P( wi | x ) 
P( x | wi ) P( wi )
P( x )
– Choose value of i that maximized P(wi|x)
– Equivalent to choosing value of i that maximizes P(x|wi)P(wi)
because P(x) is a constant
• P(x|wi) and P(wi) are given or can be easily estimate from
data
An Illustrating Example
•
•
•
•
•
In US, 0.8% of the people has this form of cancer
When this disease is present, the test returns a correct POS
result 98% of the times
It returns a correct NE result 97% of the times when the disease
is not present
Suppose Anne comes to the doctor’s office and run a blood test
and the test result is POS
What is the likelihood that Anne has this form of cancer or not?
–
This is not good for Anne. After all, the test is 98% accurate
An Illustrating Example
•
•
•
•
•
In US, 0.8% of the people has this form of cancer
When this disease is present, the test returns a correct POS
result 98% of the times
It returns a correct NE result 97% of the times when the disease
is not present
Suppose Anne comes to the doctor’s office and run a blood test
and the test result is POS
What is the likelihood that Anne has this form of cancer or not?
P(C | POS ) 
P( POS | C ) P(C ) 0.98  0.008 0.0078


P( POS )
P( POS )
P( POS )
P( NC | POS ) 
•
P( POS | NC ) P( NC ) 0.03  0.92 0.0298


P( POS )
P( POS )
P( POS )
We can classify Anne is not having cancer.
How tol Estimate
Probabilities
s
al
a
u
c
c
i
i
r
r
uo
o
o
from
Data?
n
s
i
g
g
t
s
e
e
t
t
n
a
l
Tid
10
ca
ca
co
Refund
Marital
Status
Taxable
Income
Evade
c
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
•
Class: P(C) = Nc/N
– e.g., P(No) = 7/10,
P(Yes) = 3/10
•
For discrete attributes:
P(Ai | Ck) = |Aik|/ Nc
– where |Aik| is number of instances
having attribute Ai and belongs to class
Ck
– Examples:
P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
How to handle continuous
attributes?
• For continuous attributes:
– Discretize the range into bins
• one ordinal attribute per bin
– Two-way split: (A < v) or (A > v)
• choose only one of the two splits as new attribute
– Example:
• Age (<18, 18-22, 23-30,30-40,>40)
• Annual Salary (>200,000, 120,000-200,000, 60,000-120,000,
<60,000)
How to handle continuous
attributes?
• For continuous attributes:
– Probability density estimation:
• Assume attribute follows a normal distribution
• Use data to estimate parameters of distribution
– Mean : sample mean

x
1
 xi
d i
– standard deviation: sample standard deviation
sd 

2
 ( x  x)
i
i
d 1
• Once probability distribution is known, can use it
to estimate the conditional probability P(Ai|c)
How to handle continuous
attributes?
c
Tid
at
Refund
o
g
e
l
a
ric
c
at
o
g
e
Marital
Status
l
a
ric
co
in
t
n
Taxable
Income
u
s
u
o
as
l
c
Evade
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
s
• Normal distribution:
1
P( A | c ) 
e
2
i
j

( Ai   ij ) 2
2  ij2
2
ij
– One for each (Ai,ci) pair
• For (Income, Class=No):
– If Class=No
• sample mean = 110
• sample variance = 2975
10
1
P( Income  120 | No) 
e
2 (54.54)

( 120110 ) 2
2 ( 2975 )
 0.0072
Example of Naive Bayes
Classifier
Given a Test Record:
X  (Refund  No, Married, Income  120K)
naive Bayes Classifier:
P(Refund=Yes|No) = 3/7
P(Refund=No|No) = 4/7
P(Refund=Yes|Yes) = 0
P(Refund=No|Yes) = 1
P(Marital Status=Single|No) = 2/7
P(Marital Status=Divorced|No)=1/7
P(Marital Status=Married|No) = 4/7
P(Marital Status=Single|Yes) = 2/7
P(Marital Status=Divorced|Yes)=1/7
P(Marital Status=Married|Yes) = 0
For taxable income:
If class=No:
sample mean=110
sample variance=2975
If class=Yes: sample mean=90
sample variance=25
P(X|Class=No) = P(Refund=No|Class=No)
 P(Married| Class=No)
 P(Income=120K| Class=No)
= 4/7  4/7  0.0072 = 0.0024
P(X|Class=Yes) = P(Refund=No| Class=Yes)
 P(Married| Class=Yes)
 P(Income=120K| Class=Yes)
= 1  0  1.2  10-9 = 0
Since P(X|No)P(No) > P(X|Yes)P(Yes)
Therefore P(No|X) > P(Yes|X)
=> Class = No
Bayes Classifiers
• Robust to isolated noise points
• Handle missing values by ignoring the instance during probability
estimate calculations
• Robust to irrelevant attributes
• Bayes Classification  Decision boundary
Quadratic Discriminant Classifier
• Using the Bayes classifier, assign x to class j:
j  arg max P(wi | x)  arg max P( x | wi ) P(wi )
i
i
• Assume data points in each class follow the multivariate
Gaussian distribution,
• Then the Bayes classifier is reduced to assign x to class j:
j  arg max
d i ( x ),
i
n
1
1
1
d i ( x )  ln P( wi )  ln 2  ln | Ci |  ( x  mi )T Ci ( x  mi )
2
2
2
• di(x) is a quadratic function of x, which is also called quadratic
discriminant functions (QDA)
Demo of decision boundaries
• Consider a simple case
– x is two-dimensional
– Ci is diagonal
– with two classes
• Then the decision boundary is a quadratic equation of x1 and x2,
and the coefficient of x1x2 is zero
Demo of decision boundaries
Linear Discriminant Classifier
• If we assume that the covariance matrix is the same for all classes,
then
1
2
1
2
 k ( x)   log  k  ( x  k )T  k 1 ( x  k )  log  k
1 T 1
 k ( x)  x  k  k  k  log  k
2
T
1
• The decision boundary function can be defined as follows.
Linear Discriminant Classifier
• In practice the parameters of the Gaussian distribution
can be estimated using the training data,
ˆ  N k / N where N k is the number of instance in class k
uˆk 
x / N
gi k
i
K
k
ˆ    ( xi  ˆ k )( xi  ˆ k )T /( N  K )
k 1 g i  k
Minimum Distance Classifier
• Compute the mean vector i of each class i
• Given a new instance x, calculate the Euclidean distance
between x and each mean vector
d  x  i  ( x  i )T ( x  i )
Minimum Distance Classifier
• Can be considered as a special case of linear
discriminant classifier when the covariance matrix  1   2
, and the prior probability of each class is the same.
1 T 1
 k ( x)  x  k  k  k  log  k
2
T
1
1 T
 k ( x)  x  k   k  k
2
linear discriminant function
T
MDC Decision function
Minimum error rate classifiers
• Given item x, the probability that item x belongs to class
wi is P(wi|x).
• If item x is wrongly classified as class wj (it actually
belongs to class wk), then the loss incurs is Lkj
Minimum error rate classifiers
• Assign x to class j which minimizes the conditional average
risk (loss):
M
j  arg min
ri ( x )  arg min
 Lki P( wk | x )
i
i
k 1
• We can see the expected risk is minimized
• If
, we have r i (x) = 1-P(wi|x)
• Then minimizing ri (x) is equivalent to Bayes classifier
Overview of the classification methods
Minimum Error Rate Classifiers
Bayes Classifiers
Gaussian
distribution
Hyperquadrics as the
Decision boundaries
The same Ci for all i
Linear discriminant classifiers
Minimum distance classifiers
Ci = σ2I, and P(wi)=P(wj)
Instance-Based Classifiers
Set of Stored Cases
Atr1
……...
AtrN
Class
A
• Store the training records
• Use training records to
predict the class label of
unseen cases
B
B
C
A
C
B
Unseen Case
Atr1
……...
AtrN
Instance Based Classifiers
• Examples:
– Rote-learner
• Memorizes entire training data and performs
classification only if attributes of record match one
of the training examples exactly
– Nearest neighbor
• Uses k “closest” points (nearest neighbors) for
performing classification
Nearest-Neighbor Classifiers
• Basic idea:
– If it walks like a duck, quacks like a duck, then
it’s probably a duck
Compute
Distance
Training
Records
Choose k of the
“nearest” records
Test
Record
Nearest-Neighbor Classifiers
Unknown record
•
Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve
•
To classify an unknown record:
– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
Definition of Nearest Neighbor
X
(a) 1-nearest neighbor
X
X
(b) 2-nearest neighbor
(c) 3-nearest neighbor
K-nearest neighbors of a record x are data points
that have the k smallest distance to x
1 nearest-neighbor
Voronoi Diagram
Nearest Neighbor Classification
• Compute distance between two points:
– Euclidean distance
d ( p, q ) 
n
2
(
p

q
)
 i i
i 1
• Determine the class from nearest neighbor list
– take the majority vote of class labels among the knearest neighbors
– Weigh the vote according to distance
• weight factor, w = 1/d2
Nearest Neighbor Classification
• Choosing the value of k:
– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points from
other classes
X
Nearest Neighbor Classification
• Scaling issues
– Attributes may have to be scaled to prevent
distance measures from being dominated by one
of the attributes
– Example:
• height of a person may vary from 1.5m to 1.8m
• weight of a person may vary from 90lb to 300lb
• income of a person may vary from $10K to $1M
Nearest Neighbor Classification
• Problem with Euclidean measure:
– High dimensional data
• curse of dimensionality
– Can produce counter-intuitive results
111111111110
100000000000
vs
011111111111
000000000001
d = 1.4142
• Solutions:
•
•
Normalize the vectors to unit length
Cosine similarity
d = 1.4142
Example: PEBLS
• PEBLS: Parallel Examplar-Based Learning
System (Cost & Salzberg)
– Works with both continuous and nominal
features
• For nominal features, distance between two
nominal values is computed using modified value
difference metric (MVDM)
– Each record is assigned a weight factor
– Number of nearest neighbor, k = 1
Example: PEBLS
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
d(Single,Married)
2
No
Married
100K
No
= | 2/4 – 0/4 | + | 2/4 – 4/4 | = 1
3
No
Single
70K
No
d(Single,Divorced)
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
d(Married,Divorced)
7
Yes
Divorced 220K
No
= | 0/4 – 1/2 | + | 4/4 – 1/2 | = 1
8
No
Single
85K
Yes
d(Refund=Yes,Refund=No)
9
No
Married
75K
No
= | 0/3 – 3/7 | + | 3/3 – 4/7 | = 6/7
10
No
Single
90K
Yes
60K
Distance between nominal attribute values:
= | 2/4 – 1/2 | + | 2/4 – 1/2 | = 0
10
Marital Status
Class
Refund
Single
Married
Divorced
Yes
2
0
1
No
2
4
1
Class
Yes
No
Yes
0
3
No
3
4
d (V1 ,V2 )  
i
n1i n2i

n1 n2
Example: PEBLS
Tid Refund Marital
Status
Taxable
Income Cheat
X
Yes
Single
125K
No
Y
No
Married
100K
No
10
Distance between record X and record Y:
d
( X , Y )  wX wY  d ( X i , Yi )2
i 1
where:
Number of times X is used for prediction
wX 
Number of times X predicts correctly
wX  1 if X makes accurate prediction most of the time
wX > 1 if X is not reliable for making predictions
Collaborative Filtering
• Making recommendation based on nearest neighbor
approach.
• Basic Idea: to recommend a music to Anne, find the
person A that are the most similar to Anne and
recommend to Anne the music that person A is listening.
• We use Pearson Correlation Coefficient to compute the
similarity (the opposite of distance) between them.
Collaborative Filtering
• Suppose Amanda, Eric and Sally rated the band, the
Grey Warden, and their similarity with Anne is as follows,
Person
Grey Wardens Rating
Amanda
4.5
Eric
5
Sally
3.5
Person
Pearson
Sally
0.8
Eric
0.7
Amanda
0.5
Collaborative Filtering
Person
Grey Wardens Rating
Influence
Amanda
4.5
25%
Eric
5
35%
Sally
3.5
40%
• K = 3: the projected rating for Anne=
(4.5*0.25)+(5*0.35)+(3.5*0.4)=4.275
• K=2: the projected rating for Anne =
(3.5*0.8/1.5)+(5*0.7/1.5)=4.2
Nearest neighbor Classification
• k-NN classifiers are lazy learners
– It does not build models explicitly
– Unlike eager learners such as decision tree induction
and rule-based systems
– Classifying unknown records are relatively expensive
Classification methods
• The methods are related
–
–
–
–
–
Minimum distance classifier
Linear discriminant classifier
Quadratic discriminant classifier
Bayes classifier
Minimum error rate classifier
• The methods based on the statistical property of the data set
Automated Negotiation
• A negotiation consists of
– A number of agents (agent space)
– A negotiation domain D (outcome space)
D = {I1, I2, … In} and each issue consists of k values Ii =
{v1,v2,…vk}
– A number of utility space (preference profiles)
• A Laptop negotiation domain
–
–
–
–
Two negotiation agents: agent A and agent B
Three issues: brand, hard disk, monitor
Each issue contains a number of discrete values
One bidding instance is a Dell laptop with 80 Gb and 17’inch
monitor.
Automated Negotiation
• Utility Space
– Specify a preference of each agent for each outcome
Pareto optimal
bids
• The optimal bid of an agent is the bid that
gives the maximum utility to that agent
• The utilities of negotiation agents are often
contradictory, i.e., one agent’s gain is another
agent’s pain
Automated Negotiation
• Optimal goal of a negotiation
– Maximizing individual payoff
– Maximizing social welfare (the sum of the payoffs of all
partied involved in the negotiation)
Automated Negotiation
• Negotiation protocol
– Defines the rules to regulate how the negotiation proceeds
between negotiation agents.
– Agents are obliged to follow the protocol, and any deviation from
the protocol will be penalized.
• Negotiation Strategy
– Specify how an agent should behave during a negotiation under
the regulation of a negotiation protocol.
Negotiation Protocol
• Bilateral Negotiation Protocol (Alternating Offer Protocol)
–
–
–
–
Involves two parties- agent A and B
One agent (e.g., agent A) starts the negotiation first
Each agent takes turn to negotiate
At its turn, each agent is allowed to present one of the following
three options
• Accept – accept the current proposal from the negotiation partner
• Offer – propose a new offer to the partner
• EndNegotiation – choose to terminate the negotiation without reaching
an agreement
• Reservation Value
– The value that an agent obtains if no agreement is reached by
the end of negotiation or the negotiation terminates earlier
• Time pressure
– The utility decreases with the passing of negotiation time
Overall Structure of a Negotiation
Strategy
“Decoupling Negotiating Agents to Explore the Space of
Negotiation Strategies”, Novel Insights in Agent-based Complex Automated Negotiation,2014
Design Effective Negotiation Strategies
using Mining Techniques
•
Benefits of predicting the opponent’s behavior
– Cooperative environments ->better coordination with others
– Competitive environments -> opportunity of taking exploitative actions to
maximize its own payoff.
•
How to predict the opponent’s behavior?
– Regression techniques (non-linear regression, Guassian process regression) predicting the opponent’s concession degree [Using gaussian processes to
optimise concession in complex negotiations against unknown opponents,
IJCAI, 2011]
– Chebychev Polynomials – predict the opponent’s decision function [Modeling
opponent decision in repeated one-shot negotiations. Proceedings of the fourth
international joint conference on Autonomous agents and multiagent systems.
ACM, 2005: 397-403.]
– Neural Network [Predicting Opponent’s Moves in Electronic
Negotiations Using Neural Networks, Group Decision and Negotiation
Conference,2006]
Design Effective Negotiation Strategies
using Mining Techniques
• Benefits of predicting the opponent’s preference
– Increase the chance of reaching win-win negotiation outcomes
– Better understanding of the opponent’s behaviors
• How to predict the opponent’s preference?
– Bayesian Learning - Predict the opponent’s issue weight and the
evaluation function [Opponent modelling in automated multiissue negotiation using bayesian learning, AAMAS, 2008]
– Bayesian Learning – predict the opponent’s reservation value
[Sycara K, Zeng D. Benefits of learning in negotiation, AAAI.
1997.]
Bayesian Learning in
Negotiation
• A Buyer and a Seller
• Task: learn the reservation value of the seller using Bayesian
learning
• Two hypotheses on the seller’s reservation value: H1 = 100, H2
=130
• Prior knowledge of the buyer: P(H1) = P(H2) = 0.5
• Domain Knowledge of the buyer: a seller usually offers a price which
is above their reservation price by 17%
– P(152|H2) = 0.95, P(152|H1) = 0.75
• Given a new price offered by the seller, apply Bayesian learning
P ( H 1 | e) 
P(e | H 1) P( H 1)
P(e | H 1) P( H 1)

P (e)
P(e | H 1) P( H 1)  P(e | H 2) P( H 2)
P ( H 2 | e) 
P(e | H 2) P( H 2)
P(e | H 2) P( H 2)

P (e)
P(e | H 1) P( H 1)  P(e | H 2) P( H 2)
Select Effective Negotiation
Strategies using Mining Techniques
• Given a new domain, predicting which existing
negotiation strategy performs best?
– Artificial Neural Network
– Decision Tree
– Linear/Logistic Regression
Basic Procedures
•
•
•
•
Extract the features of a domain
Build the training set
Choose a data mining algorithm and train the model
Apply the model to a testing set and evaluate the performance of the
prediction results.
Discounting
factor
Avg No. of
values in
each issue
Average
utility overall
all proposals
No. of Issue
Domain size
Class
0.3
0.3
0.1
3
1000
Strategy 1
0.8
0.45
0.35
10
2888
Strategy 2
0.1
0.32
0.56
21
33
Strategy 1
0.2
0.76
0.8
4
200
Strategy 3
0.3
0.678
0.95
8
34
Strategy 5
Project Introduction
Genius Negotiation Platform
A negotiation environment for heterogeneous
negotiation agents
A set of negotiation problems (domains)
A set of negotiation agents (strategies)
A set of analytical tools to evaluate an agent’s
performance
Webpage: http://ii.tudelft.nl/genius/
User manual:
https://ii.tudelft.nl/genius/sites/default/files/userguide.pdf
Resources:
http://mmi.tudelft.nl/negotiation/index.php/Automated_Ne
gotiating_Agents_Competition_(ANAC)
A Remainder
• Start earlier !
• Send the group information to
[email protected] ASAP
Exercise II
• Using Bayes classier and k-NN to classify the following
task and compare their performance.
• Pima Indians Diabetes
–
–
–
–
–
–
–
–
–
the number of times pregnant,
plasma glucose concentration,
blood pressure,
triceps skin fold thickness,
serum insulin level,
body mass index,
diabetes pedigree function,
age,
Class label: ‘1’-diabetes; ‘0’ - not.