Download CS690L Data Mining: Classification(2) Bayesian Classification

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Naive Bayes classifier wikipedia , lookup

Multinomial logistic regression wikipedia , lookup

Transcript
Bayesian Classification
CS690L
Data Mining: Classification(2)
Reference:
J. Han and M. Kamber, Data Mining: Concepts and Techniques
• Probabilistic learning: Calculate explicit probabilities for
hypothesis, among the most practical approaches to
certain types of learning problems
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct. Prior knowledge can be combined with observed
data.
• Probabilistic prediction: Predict multiple hypotheses,
weighted by their probabilities
• Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard of
optimal decision making against which other methods can
be measured
Bayesian Theorem: Basics
Bayesian Theorem
• Let X be a data sample whose class label is unknown
• Let H be a hypothesis that X belongs to class C
• For classification problems, determine P(H/X): the
probability that the hypothesis holds given the observed
data sample X
• P(H): prior probability of hypothesis H (i.e. the initial
probability before we observe any data, reflects the
background knowledge)
• P(X): probability that sample data is observed
• P(X|H) : probability of observing the sample X, given that
the hypothesis holds
• Given training data X, posteriori probability of a
hypothesis H, P(H|X) follows the Bayes theorem
• Informally, this can be written as
posterior =likelihood x prior / evidence
• MAP (maximum posteriori) hypothesis
• Practical difficulty: require initial knowledge of
many probabilities, significant computational
cost
Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally
independent:
n
P ( X | C i ) = ∏ P( x k | C i )
k =1
• The product of occurrence of say 2 elements x1 and x2,
given the current class is C, is the product of the
probabilities of each element taken separately, given
the same class P([y1,y2],C) = P(y1,C) * P(y2,C)
• No dependence relation between attributes
• Greatly reduces the computation cost, only count the
class distribution.
• Once the probability P(X|Ci) is known, assign X to the
class with maximum P(X|Ci)*P(Ci)
Training dataset
Class:
C1:buys_computer=
‘yes’
C2:buys_computer=
‘no’
Data sample
X =(age<=30,
Income=medium,
Student=yes
Credit_rating=
Fair)
age
<=30
<=30
30…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
1
Naïve Bayesian Classifier: Example
Naïve Bayes: Continuous Value (1)
• Compute P(X/Ci) for each class
P(age=“<30” | buys_computer=“yes”) = 2/9=0.222
P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“yes)= 6/9 =0.667
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4
X=(age<=30 ,income =medium, student=yes,credit_rating=fair)
P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044
P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019
P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028
P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.007
Outlook
Temp(ºF)
Humidity
Windy
Class
sunny
85
85
false
Don't Play
sunny
80
90
true
Don't Play
overcast
83
86
false
Play
rainy
70
96
false
Play
rainy
68
80
false
Play
rainy
65
70
true
Don’t Play
overcast
64
65
true
Play
sunny
72
95
false
Don’t Play
sunny
69
70
false
Play
rainy
75
80
false
Play
sunny
75
70
true
Play
overcast
72
90
true
Play
overcast
81
75
false
Play
rainy
71
91
true
Don’t Play
X belongs to class “buys_computer=yes”
Naïve Bayes: Continuous Value (2)
Outlook
yes
temperature
humidity
yes
yes
no
no
windy
no
yes
play
yes
no
9
5
sunny
2
3
83
85
86
85
false
6
2
4
0
70
80
96
90
true
3
3
rainy
3
2
68
65
80
70
64
72
65
95
69
71
70
91
75
80
75
70
72
90
81
75
Outlook
no
overcast
Naïve Bayes: Continuous Value (3)
temperature
humidity
windy
yes
no
yes
no
ye
s
no
sunny
2/9
3/5
µ
73
74.6
µ
79.1
86.2
overcast
4/9
0/5
σ
6.2
7.9
σ
10.2
9.7
rainy
3/9
2/5
play
yes
no
yes
no
false
6/9
2/5
9/14
5/14
true
3/9
3/5
µ: mean and σ standard deviation
Naïve Bayes: Continuous Value (4)
Outlook
Temp
Humidity
Windy
Class
sunny
66
90
true
?
Naïve Bayes: Continuous Value (5)
f (Temp = 66 | Class = yes) =
• P(Class = yes | Outlook = sunny ∧ Temp = 66 ∧ Humidity = 90 ∧ Windy = true)
−
1
⋅e
2π (6.2)
( x − ( 73)) 2
2 ( 6.2 ) 2
= 0.0340
f ( Humidity = 90 | Class = yes) = 0.0221
• P(Class = no | Outlook = sunny ∧ Temp = 66 ∧ Humidity = 90 ∧ Windy = true)
Pyes =
Gaussian (Normal) Density Function
f ( x) =
1
2πσ
⋅e
−
( x − µ )2
2
3 9
9 ⋅ 0.0340 ⋅ 0.0221 ⋅ 9 ⋅ 14
=
P(Outlook = sunny ∧ Temp = 66 ∧ Humidity = 90 ∧ Windy = true)
2σ 2
=
0.000036
P(Outlook = sunny ∧ Temp = cool ∧ Humidity = high ∧ Windy = true)
2
Naïve Bayes: Continuous Value (6)
Pno =
=
3
3 5
5 ⋅ 0.0291 ⋅ 0.0380 ⋅ 5 ⋅ 14
=
P(Outlook = sunny ∧ Temp = 66 ∧ Humidity = 90 ∧ Windy = true)
0.000136
P(Outlook = sunny ∧ Temp = cool ∧ Humidity = high ∧ Windy = true)
Pyes = 20.9%
Pno = 79.1%
Naïve Bayesian Classifier: Evaluation
• Advantages :
– Easy to implement
– Good results obtained in most of the cases
• Disadvantages
– Assumption: class conditional independence , therefore loss of
accuracy
– Practically, dependencies exist among variables.
– E.g., hospitals: patients: Profile: age, family history etc
Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc
– Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
• How to deal with these dependencies?
– Bayesian Belief Networks
Bayesian Networks
• Bayesian belief network allows a subset of the variables
Bayesian Belief Network: An Example
Family
History
Smoker
conditionally independent
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
• A graphical model of causal relationships
– Represents dependency among the variables
LC
LungCancer
Emphysema
0.8
~LC
0.2
0.5
0.7
0.1
0.5
0.3
0.9
– Gives a specification of joint probability distribution
Y
X
Z
P
‰Nodes: random variables
‰Links: dependency
‰X,Y are the parents of Z, and Y is the
parent of P
‰No dependency between Z and P
‰Has no loops or cycles
Learning Bayesian Networks
• Several cases
– Given both the network structure and all variables
observable: learn only the CPTs
– Network structure known, some hidden variables:
method of gradient descent, analogous to neural
network learning
– Network structure unknown, all variables observable:
search through the model space to reconstruct graph
topology
– Unknown structure, all hidden variables: no good
algorithms known for this purpose
PositiveXRay
Dyspnea
Bayesian Belief Networks
The conditional probability table
for the variable LungCancer:
Shows the conditional probability
for each possible combination of its
parents
n
P ( z1,..., zn ) =
∏ P ( z i | Parents ( Z i ))
i =1
Prediction
• Prediction is similar to classification
– First, construct a model
– Second, use model to predict unknown value
• Major method for prediction is regression
– Linear and multiple regression
– Non-linear regression
• Prediction is different from classification
– Classification refers to predict categorical class label
– Prediction models continuous-valued functions
3
Regress Analysis and Log-Linear
Models in Prediction
Predictive Modeling in Databases
• Linear regression: Y = α + β X
• Predictive modeling: Predict data values or construct
generalized linear models based on the database data.
• One can only predict value ranges or category
distributions
• Method outline:
–
–
–
–
– Two parameters , α and β specify the line and are to
be estimated by using the data at hand.
– using the least squares criterion to the known values
of Y1, Y2, …, X1, X2, ….
• Multiple regression: Y = b0 + b1 X1 + b2 X2.
– Many nonlinear functions can be transformed into the
above.
• Log-linear models:
– The multi-way table of joint probabilities is
approximated by a product of lower-order tables.
– Probability: p(a, b, c, d) = αab βacχad δbcd
Minimal generalization
Attribute relevance analysis
Generalized linear model construction
Prediction
• Determine the major factors which influence the
prediction
– Data relevance analysis: uncertainty measurement, entropy
analysis, expert judgment, etc.
• Multi-level prediction: drill-down and roll-up analysis
Linear regression
Regression
• Minimizing w.r.t. “a”
• Method of Least Squares
• Principle: Minimize ∑ [Yˆ (Xi ) - Yi]2
n
n
∑Y
i=1
i =1
• Assume Yˆ (Xi ) = a + bXi
(linear relationship)
i
n
= na + b∑ X i
• Minimizing w.r.t. “b”
n
• Minimize
∑X Y
n
∑ [Yi-(a+bXi)]2
i=1
i =1
b=
a=
n ∑ XY − ∑ X ∑ Y
n ∑ X − (∑ X )
2
2
1
(∑ Y − b∑ X )
n
i
i
n
n
i =1
i =1
= a∑ Xi + b∑ X i2
(2)
Linear Regression: Example
Linear regression
• Solving (1) and (2) we get:
(1)
i =1
X (years)
Y (Salary)
3
30
8
57
9
64
13
72
3
36
6
43
11
59
21
90
1
20
16
83
X =
b=
1
∑ X = 9 .1
n
n ∑ XY − ∑ X ∑ Y
n ∑ X 2 − (∑ X ) 2
Y =
=
1
∑ y = 55 .4
n
∑ ( X − X )(Y − Y )
∑ (X − X )
2
a = Y − bX
(3 – 9.1)(30 – 55.4)+(8 – 9.1)(57 – 55.4)…+(16 – 9.1)(83 – 55.4)
b = ------------------------------------------------------------------------(3 – 9.1)2 + (8 – 9.1) 2 + … + (16 – 9.1) 2
a = 55.4 – (3.5)(9.1) = 23.6
4
Linear regression
Determine linear regression line, with Y linearly dependent on X
5