Download Classification

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Classification
©Jiawei Han and Micheline Kamber
http://www-sal.cs.uiuc.edu/~hanj/bk2/
Chp 6
Integrated with slides from Prof. Andrew W. Moore
http://www.cs.cmu.edu/~awm/tutorials
modified by Donghui Zhang
2017年5月22日星期一
Data Mining: Concepts and Techniques
1
Content






What is classification?
Decision tree
Naïve Bayesian Classifier
Baysian Networks
Neural Networks
Support Vector Machines (SVM)
2017年5月22日星期一
Data Mining: Concepts and Techniques
2
Classification vs. Prediction



Classification:
 models categorical class labels (discrete or nominal)
 e.g. given a new customer, does she belong to the
“likely to buy a computer” class?
Prediction:
 models continuous-valued functions
 e.g. how many computers will a customer buy?
Typical Applications
 credit approval
 target marketing
 medical diagnosis
 treatment effectiveness analysis
2017年5月22日星期一
Data Mining: Concepts and Techniques
3
Classification—A Two-Step Process


Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
 The set of tuples used for model construction is training set
 The model is represented as classification rules, decision trees,
or mathematical formulae
Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
 The known label of test sample is compared with the
classified result from the model
 Accuracy rate is the percentage of test set samples that are
correctly classified by the model
 Test set is independent of training set, otherwise over-fitting
will occur
 If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
2017年5月22日星期一
Data Mining: Concepts and Techniques
4
Classification Process (1): Model
Construction
Classification
Algorithms
Training
Data
NAME
M ike
M ary
B ill
Jim
D ave
A nne
RANK
YEARS TENURED
A ssistant P rof
3
no
A ssistant P rof
7
yes
P rofessor
2
yes
A ssociate P rof
7
yes
A ssistant P rof
6
no
A ssociate P rof
3
no
2017年5月22日星期一
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Data Mining: Concepts and Techniques
5
Classification Process (2): Use the
Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
T om
M erlisa
G eorge
Joseph
RANK
YEARS TENURED
A ssistant P rof
2
no
A ssociate P rof
7
no
P rofessor
5
yes
A ssistant P rof
7
yes
2017年5月22日星期一
Data Mining: Concepts and Techniques
Tenured?
6
Supervised vs. Unsupervised
Learning

Supervised learning (classification)



Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)


The class labels of training data is unknown
Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
2017年5月22日星期一
Data Mining: Concepts and Techniques
7
Evaluating Classification Methods






Predictive accuracy
Speed and scalability
 time to construct the model
 time to use the model
Robustness
 handling noise and missing values
Scalability
 efficiency in disk-resident databases
Interpretability:
 understanding and insight provided by the model
Goodness of rules
 decision tree size
 compactness of classification rules
2017年5月22日星期一
Data Mining: Concepts and Techniques
8
Content






What is classification?
Decision tree
Naïve Bayesian Classifier
Baysian Networks
Neural Networks
Support Vector Machines (SVM)
2017年5月22日星期一
Data Mining: Concepts and Techniques
9
Training Dataset
This
follows an
example
from
Quinlan’s
ID3
2017年5月22日星期一
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
Data Mining: Concepts and Techniques
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
10
Output: A Decision Tree for “buys_computer”
age?
<=30
student?
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
2017年5月22日星期一
Data Mining: Concepts and Techniques
11
Extracting Classification Rules from Trees

Represent the knowledge in the form of IF-THEN rules

One rule is created for each path from the root to a leaf

Each attribute-value pair along a path forms a conjunction

The leaf node holds the class prediction

Rules are easier for humans to understand

Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40”
THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer =
“yes”
IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”
2017年5月22日星期一
Data Mining: Concepts and Techniques
12
Algorithm for Decision Tree Induction


Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-conquer
manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they are
discretized in advance)
 Examples are partitioned recursively based on selected attributes
 Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
 There are no samples left
2017年5月22日星期一
Data Mining: Concepts and Techniques
13
Note to other teachers and users of these slides.
Andrew would be delighted if you found this source
material useful in giving your own lectures. Feel free
to use these slides verbatim, or to modify them to fit
your own needs. PowerPoint originals are available. If
you make use of a significant portion of these slides in
your own lecture, please include this message, or the
following link to the source repository of Andrew’s
tutorials: http://www.cs.cmu.edu/~awm/tutorials .
Comments and corrections gratefully received.
Information gain slides adapted from
Andrew W. Moore
Associate Professor
School of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/~awm
[email protected]
412-268-7599
2017年5月22日星期
一
Data Mining: Concepts and
Techniques
14
Bits
You are watching a set of independent random samples of X
You see that X has four possible values
P(X=A) = 1/4
P(X=B) = 1/4
P(X=C) = 1/4
P(X=D) = 1/4
So you might see: BAACBADCDADDDA…
You transmit data over a binary serial link. You can encode each reading
with two bits (e.g. A = 00, B = 01, C = 10, D = 11)
0100001001001110110011111100…
2017年5月22日星期一
Data Mining: Concepts and Techniques
15
Fewer Bits
Someone tells you that the probabilities are not equal
P(X=A) = 1/2
P(X=B) = 1/4
P(X=C) = 1/8
P(X=D) = 1/8
It’s possible…
…to invent a coding for your transmission that only uses
1.75 bits on average per symbol. How?
2017年5月22日星期一
Data Mining: Concepts and Techniques
16
Fewer Bits
Someone tells you that the probabilities are not equal
P(X=A) = 1/2
P(X=B) = 1/4
P(X=C) = 1/8
P(X=D) = 1/8
It’s possible…
…to invent a coding for your transmission that only uses
1.75 bits on average per symbol. How?
A
0
B
10
C
110
D
111
(This is just one of several ways)
2017年5月22日星期一
Data Mining: Concepts and Techniques
17
Fewer Bits
Suppose there are three equally likely values…
P(X=B) = 1/3
P(X=C) = 1/3
P(X=D) = 1/3
Here’s a naïve coding, costing 2 bits per symbol
A
00
B
01
C
10
Can you think of a coding that would need only 1.6 bits
per symbol on average?
In theory, it can in fact be done with 1.58496 bits per
symbol.
2017年5月22日星期一
Data Mining: Concepts and Techniques
18
General Case
Suppose X can have one of m values… V1, V2,
P(X=V1) = p1
P(X=V2) = p2
….
…
Vm
P(X=Vm) = pm
What’s the smallest possible number of bits, on average, per
symbol, needed to transmit a stream of symbols drawn from
X’s distribution? It’s
H ( X )   p1 log 2 p1  p2 log 2 p2    pm log 2 pm
m
  p j log 2 p j
j 1
H(X) = The entropy of X


“High Entropy” means X is from a uniform (boring) distribution
“Low Entropy” means X is from varied (peaks and valleys) distribution
2017年5月22日星期一
Data Mining: Concepts and Techniques
19
General Case
Suppose X can have one of m values… V1, V2,
P(X=V1) = p1
…
….
P(X=V2) = p2
Vm
P(X=Vm) = pm
A histogram of the
What’s the smallest possible number of frequency
bits, on average,
per
distribution of
symbol, needed to transmit a stream values
of symbols
drawn
from
of
X
would
have
A histogram
X’s distribution?
It’s of the
many lows and one or
frequency distribution of
two highs
values
of
X
would
be
flat
H ( X )   p log p  p log p    p log p
1
2
1
2
2
2
m
2
m
m
  p j log 2 p j
j 1
H(X) = The entropy of X


“High Entropy” means X is from a uniform (boring) distribution
“Low Entropy” means X is from varied (peaks and valleys) distribution
2017年5月22日星期一
Data Mining: Concepts and Techniques
20
General Case
Suppose X can have one of m values… V1, V2,
P(X=V1) = p1
…
….
P(X=V2) = p2
Vm
P(X=Vm) = pm
A histogram of the
What’s the smallest possible number of frequency
bits, on average,
per
distribution of
symbol, needed to transmit a stream values
of symbols
drawn
from
of
X
would
have
A histogram
X’s distribution?
It’s of the
many lows and one or
frequency distribution of
two highs
values
of
X
would
be
flat
H ( X )   p log p  p log p    p log p
1
2
1
2
2
2
m
2
m
m
  p..and
the
j logso
2 p
j values
j 1
sampled from it would
be all over the place
H(X) = The entropy of X


..and so the values
sampled from it would be
more predictable
“High Entropy” means X is from a uniform (boring) distribution
“Low Entropy” means X is from varied (peaks and valleys) distribution
2017年5月22日星期一
Data Mining: Concepts and Techniques
21
Entropy in a nut-shell
Low Entropy
2017年5月22日星期一
High Entropy
Data Mining: Concepts and Techniques
22
Entropy in a nut-shell
Low Entropy
High Entropy
..the values (locations
of soup) sampled
entirely from within the
soup bowl
2017年5月22日星期一
..the values (locations of
soup) unpredictable...
almost uniformly sampled
throughout our dining room
Data Mining: Concepts and Techniques
23
Exercise:



Suppose 100 customers have two classes:
“Buy Computer” and “Not Buy Computer”.
Uniform distribution: 50 buy. Entropy?
Skewed distribution: 100 buy. Entropy?
2017年5月22日星期一
Data Mining: Concepts and Techniques
24
Specific Conditional Entropy
Suppose I’m trying to predict output Y and I have input X
X = College Major
Y = Likes “Gladiator”
X
Y
Let’s assume this reflects the true
probabilities
E.G. From this data we estimate
Math
Yes
• P(LikeG = Yes) = 0.5
History
No
CS
Yes
• P(Major = Math & LikeG = No) = 0.25
Math
No
Math
No
CS
Yes
History
No
Math
Yes
2017年5月22日星期一
• P(Major = Math) = 0.5
• P(LikeG = Yes | Major = History) = 0
Note:
• H(X) = 1.5
•H(Y) = 1
Data Mining: Concepts and Techniques
25
Specific Conditional Entropy
X = College Major
Y = Likes “Gladiator”
X
Y
Math
Yes
History
No
CS
Yes
Math
No
Math
No
CS
Yes
History
No
Math
Yes
2017年5月22日星期一
Definition of Specific Conditional
Entropy:
H(Y|X=v) = The entropy of Y among
only those records in which X has
value v
Data Mining: Concepts and Techniques
26
Specific Conditional Entropy
X = College Major
Definition of Conditional Entropy:
Y = Likes “Gladiator”
H(Y|X=v) = The entropy of Y among
only those records in which X has
value v
X
Y
Example:
Math
Yes
History
No
CS
Yes
• H(Y|X=Math) = 1
Math
No
• H(Y|X=History) = 0
Math
No
• H(Y|X=CS) = 0
CS
Yes
History
No
Math
Yes
2017年5月22日星期一
Data Mining: Concepts and Techniques
27
Conditional Entropy
X = College Major
Y = Likes “Gladiator”
X
Y
Math
Yes
History
No
CS
Yes
Math
No
Math
No
CS
Yes
History
No
Math
Yes
Definition of Specific Conditional
Entropy:
H(Y|X) = The average conditional
entropy of Y
= if you choose a record at random
what will be the conditional
entropy of Y, conditioned on that
row’s value of X
= Expected number of bits to
transmit Y if both sides will know
the value of X
= ΣjProb(X=vj) H(Y | X = vj)
2017年5月22日星期一
Data Mining: Concepts and Techniques
28
Conditional Entropy
Definition of general Conditional
Y = Likes “Gladiator” Entropy:
X = College Major
H(Y|X) = The average conditional
entropy of Y
X
Y
Math
Yes
History
No
CS
Yes
Math
No
Math
No
CS
Yes
History
No
Math
Yes
2017年5月22日星期一
= ΣjProb(X=vj) H(Y | X = vj)
Example:
vj
Math
History
CS
Prob(X=vj)
0.5
0.25
0.25
H(Y | X = vj)
1
0
0
H(Y|X) = 0.5 * 1 + 0.25 * 0 + 0.25 * 0 = 0.5
Data Mining: Concepts and Techniques
29
Information Gain
X = College Major
Y = Likes “Gladiator”
X
Y
Definition of Information Gain:
IG(Y|X) = I must transmit Y.
How many bits on average
would it save me if both ends of
the line knew X?
IG(Y|X) = H(Y) - H(Y | X)
Math
Yes
History
No
CS
Yes
Math
No
Math
No
CS
Yes
• H(Y|X) = 0.5
History
No
Math
Yes
• Thus IG(Y|X) = 1 – 0.5 = 0.5
2017年5月22日星期一
Example:
• H(Y) = 1
Data Mining: Concepts and Techniques
30
What is Information Gain used for?
Suppose you are trying to predict whether someone
is going live past 80 years. From historical data you
might find…
•IG(LongLife | HairColor) = 0.01
•IG(LongLife | Smoker) = 0.2
•IG(LongLife | Gender) = 0.25
•IG(LongLife | LastDigitOfSSN) = 0.00001
IG tells you how interesting a 2-d contingency table is
going to be.
2017年5月22日星期一
Data Mining: Concepts and Techniques
31
Conditional entropy H(C|age)
age buy no
<=30
2
3
30…40 4
0
>40
3
2



H(C|age<=30) = 2/5 * lg(5/2) + 3/5 * lg(5/3) = 0.971
H(C|age in 30..40) = 1 * lg 1 + 0 * lg 1/0 = 0
H(C|age>40) = 3/5 * lg(5/3) + 2/5 * lg(5/2) = 0.971
H (C | age) 
5
4
H (C | age  30) 
H (C | age  (30,40])
14
14
5

H (C | age  40)  0.694
14
2017年5月22日星期一
Data Mining: Concepts and Techniques
32
Select the attribute with lowest
conditional entropy





H(C|age) = 0.694
H(C|income) = 0.911
H(C|student) = 0.789
H(C|credit_rating) = 0.892
age?
<=30 30..40
Select “age” to be the
tree root!
student? yes
2017年5月22日星期一
no
yes
no
yes
Data Mining: Concepts and Techniques
>40
credit rating?
excellent fair
no
yes
33
Goodness in Decision Tree Induction




relatively faster learning speed (than other classification
methods)
convertible to simple and easy to understand classification
rules
can use SQL queries for accessing databases
comparable classification accuracy with other methods
2017年5月22日星期一
Data Mining: Concepts and Techniques
34
Scalable Decision Tree Induction
Methods in Data Mining Studies




SLIQ (EDBT’96 — Mehta et al.)
 builds an index for each attribute and only class list and
the current attribute list reside in memory
SPRINT (VLDB’96 — J. Shafer et al.)
 constructs an attribute list data structure
PUBLIC (VLDB’98 — Rastogi & Shim)
 integrates tree splitting and tree pruning: stop growing
the tree earlier
RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
 separates the scalability aspects from the criteria that
determine the quality of the tree
 builds an AVC-list (attribute, value, class label)
2017年5月22日星期一
Data Mining: Concepts and Techniques
35
Visualization of a Decision Tree in
SGI/MineSet 3.0
2017年5月22日星期一
Data Mining: Concepts and Techniques
36
Content






What is classification?
Decision tree
Naïve Bayesian Classifier
Baysian Networks
Neural Networks
Support Vector Machines (SVM)
2017年5月22日星期一
Data Mining: Concepts and Techniques
37
Bayesian Classification: Why?




Probabilistic learning: Calculate explicit probabilities for
hypothesis, among the most practical approaches to
certain types of learning problems
Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct. Prior knowledge can be combined with observed
data.
Probabilistic prediction: Predict multiple hypotheses,
weighted by their probabilities
Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard
of optimal decision making against which other methods
can be measured
2017年5月22日星期一
Data Mining: Concepts and Techniques
38
Bayesian Classification






X: a data sample whose class label is unknown, e.g.
X =(Income=medium, Credit_rating=Fair, Age=40).
Hi: a hypothesis that a record belongs to class Ci, e.g.
Hi = a record belongs to the “buy computer” class.
P(Hi), P(X): probabilities.
P(Hi/X): a conditional probability: among all records with
medium income and fair credit rating, what’s the
probability to buy a computer?
This is what we need for classification! Given X, P(Hi/X)
tells us the possibility that it belongs to some class.
What if we need to determine a single class for X?
2017年5月22日星期一
Data Mining: Concepts and Techniques
39
Bayesian Theorem




Another concept, P(X|Hi) : probability of observing the
sample X, given that the hypothesis holds. E.g. among all
people who buy computer, what percentage has the same
value as X.
We know P(X  Hi) = P(Hi|X) P(X) = P(X|Hi) P(Hi),
So
P( X | H )P(H )
P(H | X ) 
i
i
P( X )
i
We should assign X to the class Ci where P(Hi|X) is
maximized,
 equivalent to maximize P(X|Hi) P(Hi).
2017年5月22日星期一
Data Mining: Concepts and Techniques
40
Basic Idea





Read the training data,
 Compute P(Hi) for each class.
 Compute P(Xk|Hi) for each distinct instance of X among
records in class Ci.
To predict the class for a new data X,
 for each class, compute P(X|Hi) P(Hi).
 return the class which has the largest value.
Any Problem?
Too many combinations of Xk.
e.g. 50 ages, 50 credit ratings, 50 income levels 
125,000 combinations of Xk!
2017年5月22日星期一
Data Mining: Concepts and Techniques
41
Naïve Bayes Classifier

A simplified assumption: attributes are conditionally
independent:
n
P( X | C i)   P( x | C i)
k
k 1



The product of occurrence of say 2 elements x1 and x2,
given the current class is C, is the product of the
probabilities of each element taken separately, given the
same class P([y1,y2],C) = P(y1,C) * P(y2,C)
No dependence relation between attributes
Greatly reduces the number of probabilities to maintain.
2017年5月22日星期一
Data Mining: Concepts and Techniques
42
Sample quiz questions
1. What data does
naïve Baysian net
maintain?
2. Given
X =(age<=30,
Income=medium,
Student=yes
Credit_rating=Fair)
buy or not buy?
2017年5月22日星期一
age
<=30
<=30
30…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
Data Mining: Concepts and Techniques
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
43
Naïve Bayesian Classifier: Example

Compute P(X/Ci) for each class
P(age=“<30” | buys_computer=“yes”) = 2/9=0.222
P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“yes)= 6/9 =0.667
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4
X=(age<=30 ,income =medium, student=yes,credit_rating=fair)
P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044
P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019
P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028
P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007
X belongs to class “buys_computer=yes”
2017年5月22日星期一
Pitfall: forget P(Ci)
Data Mining: Concepts and Techniques
44
Naïve Bayesian Classifier: Comments



Advantages :
 Easy to implement
 Good results obtained in most of the cases
Disadvantages
 Assumption: class conditional independence , therefore loss of
accuracy
 Practically, dependencies exist among variables
 E.g., hospitals: patients: Profile: age, family history etc
Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc
 Dependencies among these cannot be modeled by Naïve Bayesian
Classifier
How to deal with these dependencies?
 Bayesian Belief Networks
2017年5月22日星期一
Data Mining: Concepts and Techniques
45
Content






What is classification?
Decision tree
Naïve Bayesian Classifier
Baysian Networks
Neural Networks
Support Vector Machines (SVM)
2017年5月22日星期一
Data Mining: Concepts and Techniques
46
Note to other teachers and users of these slides.
Andrew would be delighted if you found this source
material useful in giving your own lectures. Feel free
to use these slides verbatim, or to modify them to fit
your own needs. PowerPoint originals are available. If
you make use of a significant portion of these slides in
your own lecture, please include this message, or the
following link to the source repository of Andrew’s
tutorials: http://www.cs.cmu.edu/~awm/tutorials .
Comments and corrections gratefully received.
Baysian Networks slides adapted from
Andrew W. Moore
Associate Professor
School of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/~awm
[email protected]
412-268-7599
2017年5月22日星期
一
Data Mining: Concepts and
Techniques
47
What we’ll discuss





Recall the numerous and dramatic benefits of
Joint Distributions for describing uncertain worlds
Reel with terror at the problem with using Joint
Distributions
Discover how Bayes Net methodology allows us to
build Joint Distributions in manageable chunks
Discover there’s still a lurking problem…
…Start to solve that problem
2017年5月22日星期一
Data Mining: Concepts and Techniques
48
Why this matters
In Andrew’s opinion, the most important
technology in the Machine Learning / AI field to
Active
haveData
emerged in the last 10 years.
Collection
 A clean, clear, manageable language and
methodology for expressing
what you’re certain
Inference
and uncertain about
 Already, many practical applications in medicine,
Anomaly
factories, helpdesks:
Detection

P(this problem | these symptoms)
anomalousness of this observation
choosing next diagnostic test | these observations
2017年5月22日星期一
Data Mining: Concepts and Techniques
49
Ways to deal with Uncertainty






Three-valued logic: True / False / Maybe
Fuzzy logic (truth values between 0 and 1)
Non-monotonic reasoning (especially focused on
Penguin informatics)
Dempster-Shafer theory (and an extension known
as quasi-Bayesian theory)
Possibabilistic Logic
Probability
2017年5月22日星期一
Data Mining: Concepts and Techniques
50
Discrete Random Variables





A is a Boolean-valued random variable if A
denotes an event, and there is some degree of
uncertainty as to whether A occurs.
Examples
A = The US president in 2023 will be male
A = You wake up tomorrow with a headache
A = You have Ebola
2017年5月22日星期一
Data Mining: Concepts and Techniques
51
Probabilities



We write P(A) as “the fraction of possible worlds
in which A is true”
We could at this point spend 2 hours on the
philosophy of this.
But we won’t.
2017年5月22日星期一
Data Mining: Concepts and Techniques
52
Visualizing A
Event space of
all possible
worlds
Worlds in which
A is true
P(A) = Area of
reddish oval
Its area is 1
Worlds in which A is False
2017年5月22日星期一
Data Mining: Concepts and Techniques
53
Interpreting the axioms




0 <= P(A) <= 1
P(True) = 1
P(False) = 0
P(A or B) = P(A) + P(B) - P(A and B)
The area of A can’t get
any smaller than 0
And a zero area would
mean no world could
ever have A true
2017年5月22日星期一
Data Mining: Concepts and Techniques
54
Interpreting the axioms




0 <= P(A) <= 1
P(True) = 1
P(False) = 0
P(A or B) = P(A) + P(B) - P(A and B)
The area of A can’t get
any bigger than 1
And an area of 1 would
mean all worlds will have
A true
2017年5月22日星期一
Data Mining: Concepts and Techniques
55
Interpreting the axioms




0 <= P(A) <= 1
P(True) = 1
P(False) = 0
P(A or B) = P(A) + P(B) - P(A and B)
A
B
2017年5月22日星期一
Data Mining: Concepts and Techniques
56
Interpreting the axioms




0 <= P(A) <= 1
P(True) = 1
P(False) = 0
P(A or B) = P(A) + P(B) - P(A and B)
A
P(A or B)
B
P(A and B)
B
Simple addition and subtraction
2017年5月22日星期一
Data Mining: Concepts and Techniques
57
These Axioms are Not to be Trifled With

There have been attempts to do different
methodologies for uncertainty





Fuzzy Logic
Three-valued logic
Dempster-Shafer
Non-monotonic reasoning
But the axioms of probability are the only system
with this property:
If you gamble using them you can’t be unfairly exploited by an
opponent using some other system [di Finetti 1931]
2017年5月22日星期一
Data Mining: Concepts and Techniques
58
Theorems from the Axioms


0 <= P(A) <= 1, P(True) = 1, P(False) = 0
P(A or B) = P(A) + P(B) - P(A and B)
From these we can prove:
P(not A) = P(~A) = 1-P(A)

How?
2017年5月22日星期一
Data Mining: Concepts and Techniques
59
Another important theorem


0 <= P(A) <= 1, P(True) = 1, P(False) = 0
P(A or B) = P(A) + P(B) - P(A and B)
From these we can prove:
P(A) = P(A ^ B) + P(A ^ ~B)

How?
2017年5月22日星期一
Data Mining: Concepts and Techniques
60
Conditional Probability

P(A|B) = Fraction of worlds in which B is true that
also have A true
H = “Have a headache”
F = “Coming down with Flu”
P(H) = 1/10
P(F) = 1/40
P(H|F) = 1/2
F
H
2017年5月22日星期一
“Headaches are rare and flu
is rarer, but if you’re coming
down with ‘flu there’s a 5050 chance you’ll have a
headache.”
Data Mining: Concepts and Techniques
61
Conditional Probability
P(H|F) = Fraction of flu-inflicted
worlds in which you have a
headache
F
H
= #worlds with flu and headache
-----------------------------------#worlds with flu
H = “Have a headache”
F = “Coming down with Flu”
P(H) = 1/10
P(F) = 1/40
P(H|F) = 1/2
2017年5月22日星期一
= Area of “H and F” region
-----------------------------Area of “F” region
= P(H ^ F)
----------P(F)
Data Mining: Concepts and Techniques
62
Definition of Conditional Probability
P(A ^ B)
P(A|B) = ----------P(B)
Corollary: The Chain Rule
P(A ^ B) = P(A|B) P(B)
2017年5月22日星期一
Data Mining: Concepts and Techniques
63
Bayes Rule
P(A ^ B)
P(A|B) P(B)
P(B|A) = ----------- = --------------P(A)
P(A)
This is Bayes Rule
Bayes, Thomas (1763) An essay
towards solving a problem in the doctrine
of chances. Philosophical Transactions
of the Royal Society of London, 53:370418
2017年5月22日星期一
Data Mining: Concepts and Techniques
64
Using Bayes Rule to Gamble
R R B B
$1.00
The “Win” envelope has a
dollar and four beads in
it
R B B
The “Lose” envelope
has three beads and
no money
Trivial question: someone draws an envelope at random and offers to
sell it to you. How much should you pay?
2017年5月22日星期一
Data Mining: Concepts and Techniques
65
Using Bayes Rule to Gamble
$1.00
The “Win” envelope has a
dollar and four beads in
it
The “Lose” envelope
has three beads and
no money
Interesting question: before deciding, you are allowed to see one bead
drawn from the envelope.
Suppose it’s black: How much should you pay?
Suppose it’s red: How much should you pay?
2017年5月22日星期一
Data Mining: Concepts and Techniques
66
Another Example




You friend told you that she has two children (not
twin).
 Probability that both children are male?
You asked her whether she has at least one son,
and she said yes.
 Probability that both children are male?
You visited her house and saw one son of hers.
 Probability that both children are male?
Note: P(SeeASon|OneSon) and P(TellASon|OneSon)
are different!
2017年5月22日星期一
Data Mining: Concepts and Techniques
67
Multivalued Random Variables



Suppose A can take on more than 2 values
A is a random variable with arity k if it can take
on exactly one value out of {v1,v2, .. vk}
Thus…
P( A  vi  A  v j )  0 if i  j
P( A  v1  A  v2  A  vk )  1
2017年5月22日星期一
Data Mining: Concepts and Techniques
68
An easy fact about Multivalued Random Variables:
 Using the axioms of probability…
0 <= P(A) <= 1, P(True) = 1, P(False) = 0
P(A or B) = P(A) + P(B) - P(A and B)

And assuming that A obeys…
P( A  vi  A  v j )  0 if i  j
P( A  v1  A  v2  A  vk )  1
• It’s easy to prove that
i
P( A  v1  A  v2  A  vi )   P( A  v j )
j 1
2017年5月22日星期一
Data Mining: Concepts and Techniques
69
An easy fact about Multivalued Random Variables:
 Using the axioms of probability…
0 <= P(A) <= 1, P(True) = 1, P(False) = 0
P(A or B) = P(A) + P(B) - P(A and B)

And assuming that A obeys…
P( A  vi  A  v j )  0 if i  j
P( A  v1  A  v2  A  vk )  1
• It’s easy to prove that
i
P( A  v1  A  v2  A  vi )   P( A  v j )
j 1
• And thus we can prove k
 P( A  v )  1
j 1
2017年5月22日星期一
j
Data Mining: Concepts and Techniques
70
Another fact about Multivalued Random Variables:
 Using the axioms of probability…
0 <= P(A) <= 1, P(True) = 1, P(False) = 0
P(A or B) = P(A) + P(B) - P(A and B)

And assuming that A obeys…
P( A  vi  A  v j )  0 if i  j
P( A  v1  A  v2  A  vk )  1
• It’s easy to prove that
i
P( B  [ A  v1  A  v2  A  vi ])   P( B  A  v j )
j 1
2017年5月22日星期一
Data Mining: Concepts and Techniques
71
Another fact about Multivalued Random Variables:
 Using the axioms of probability…
0 <= P(A) <= 1, P(True) = 1, P(False) = 0
P(A or B) = P(A) + P(B) - P(A and B)

And assuming that A obeys…
P( A  vi  A  v j )  0 if i  j
P( A  v1  A  v2  A  vk )  1
• It’s easy to prove that
i
P( B  [ A  v1  A  v2  A  vi ])   P( B  A  v j )
• And thus we can prove
j 1
k
P( B)   P( B  A  v j )
j 1
2017年5月22日星期一
Data Mining: Concepts and Techniques
72
More General Forms of Bayes Rule
P( B | A) P( A)
P( A |B) 
P( B | A) P( A)  P( B |~ A) P(~ A)
P( B | A  X ) P( A  X )
P( A |B  X ) 
P( B  X )
2017年5月22日星期一
Data Mining: Concepts and Techniques
73
More General Forms of Bayes Rule
P( A  vi |B) 
P( B | A  vi ) P( A  vi )
nA
 P( B | A  v ) P( A  v )
k 1
2017年5月22日星期一
k
Data Mining: Concepts and Techniques
k
74
Useful Easy-to-prove facts
P( A | B)P(A | B)  1
nA
 P( A  v
k 1
2017年5月22日星期一
k
| B)  1
Data Mining: Concepts and Techniques
75
From Probability to Bayesian Net
• Suppose there are some diseases, symptoms
and related facts.
• E.g. flu, headache, fever, lung cancer, smoker.
• We are interested to know some (conditional or
unconditional) probabilities such as
• P( flu )
• P( lung cancer | smoker )
• P( flu | headache ^ ~fever )
• What shall we do?
2017年5月22日星期一
Data Mining: Concepts and Techniques
76
The Joint Distribution
Example: Boolean
variables A, B, C
Recipe for making a joint distribution
of M variables:
2017年5月22日星期一
Data Mining: Concepts and Techniques
77
The Joint Distribution
Recipe for making a joint distribution
of M variables:
1. Make a truth table listing all
combinations of values of your
variables (if there are M Boolean
variables then the table will have
2M rows).
2017年5月22日星期一
Example: Boolean
variables A, B, C
A
B
C
0
0
0
0
0
1
0
1
0
0
1
1
1
0
0
1
0
1
1
1
0
1
1
1
Data Mining: Concepts and Techniques
78
The Joint Distribution
Recipe for making a joint distribution
of M variables:
1. Make a truth table listing all
combinations of values of your
variables (if there are M Boolean
variables then the table will have
2M rows).
2. For each combination of values,
say how probable it is.
2017年5月22日星期一
Example: Boolean
variables A, B, C
A
B
C
Prob
0
0
0
0.30
0
0
1
0.05
0
1
0
0.10
0
1
1
0.05
1
0
0
0.05
1
0
1
0.10
1
1
0
0.25
1
1
1
0.10
Data Mining: Concepts and Techniques
79
The Joint Distribution
Recipe for making a joint distribution
of M variables:
1. Make a truth table listing all
combinations of values of your
variables (if there are M Boolean
variables then the table will have
2M rows).
2. For each combination of values,
say how probable it is.
3. If you subscribe to the axioms of
probability, those numbers must
sum to 1.
Example: Boolean
variables A, B, C
A
B
C
Prob
0
0
0
0.30
0
0
1
0.05
0
1
0
0.10
0
1
1
0.05
1
0
0
0.05
1
0
1
0.10
1
1
0
0.25
1
1
1
0.10
A
0.25
0.30
2017年5月22日星期一
0.05
B
Data Mining: Concepts and Techniques
0.10
0.05
0.10
0.05
C
0.10
80
Using the
Joint
Once you have the JD you
can ask for the probability of
any logical expression
involving your attribute
2017年5月22日星期一
 P(row )
P( E ) 
rows matching E
Data Mining: Concepts and Techniques
81
Using the
Joint
P(Poor Male) = 0.4654
 P(row )
P( E ) 
rows matching E
2017年5月22日星期一
Data Mining: Concepts and Techniques
82
Using the
Joint
P(Poor) = 0.7604
 P(row )
P( E ) 
rows matching E
2017年5月22日星期一
Data Mining: Concepts and Techniques
83
Inference
with the Joint
P( E1  E2 )
P( E1 | E2 ) 

P ( E2 )
 P(row )
rows matching E1 and E2
 P(row )
rows matching E2
2017年5月22日星期一
Data Mining: Concepts and Techniques
84
Inference
with the Joint
P( E1  E2 )
P( E1 | E2 ) 

P ( E2 )
 P(row )
rows matching E1 and E2
 P(row )
rows matching E2
P(Male | Poor) = 0.4654 / 0.7604 = 0.612
2017年5月22日星期一
Data Mining: Concepts and Techniques
85
Joint distributions

Good news

Once you have a joint
distribution, you can
ask important
questions about
stuff that involves a
lot of uncertainty
2017年5月22日星期一
Bad news
Impossible to create
for more than about
ten attributes
because there are
so many numbers
needed when you
build the damn
thing.
Data Mining: Concepts and Techniques
86
Using fewer numbers
Suppose there are two events:


M: Manuela teaches the class (otherwise it’s Andrew)
S: It is sunny
The joint p.d.f. for these events contain four entries.
If we want to build the joint p.d.f. we’ll have to invent those four numbers.
OR WILL WE??


We don’t have to specify with bottom level conjunctive
events such as P(~M^S) IF…
…instead it may sometimes be more convenient for us
to specify things like: P(M), P(S).
But just P(M) and P(S) don’t derive the joint distribution. So you can’t
answer all questions.
2017年5月22日星期一
Data Mining: Concepts and Techniques
87
Using fewer numbers
Suppose there are two events:


M: Manuela teaches the class (otherwise it’s Andrew)
S: It is sunny
The joint p.d.f. for these events contain four entries.
If we want to build the joint p.d.f. we’ll have to invent those four numbers.
OR WILL WE??


We don’t have to specify with bottom level conjunctive
events such as P(~M^S) IF…
…instead it may sometimes be more convenient for us
to specify things like: P(M), P(S).
But just P(M) and P(S) don’t derive the joint distribution. So you can’t
answer all questions.
2017年5月22日星期一
Data Mining: Concepts and Techniques
88
Independence
“The sunshine levels do not depend on and do not influence
who is teaching.”
This can be specified very simply:
P(S  M) = P(S)
This is a powerful statement!
It required extra domain knowledge. A different kind of
knowledge than numerical probabilities. It needed an
understanding of causation.
2017年5月22日星期一
Data Mining: Concepts and Techniques
89
Independence
From P(S  M) = P(S), the rules of probability imply: (can you prove
these?)

P(~S  M) = P(~S)

P(M  S) = P(M)

P(M ^ S) = P(M) P(S)

P(~M ^ S) = P(~M) P(S), P(M^~S) = P(M)P(~S),
P(~M^~S) = P(~M)P(~S)
2017年5月22日星期一
Data Mining: Concepts and Techniques
90
Independence
From P(S  M) = P(S), the rules of probability imply: (can you prove
these?)
And in general:

P(~S  M) = P(~S)
P(M=u ^ S=v) = P(M=u) P(S=v)

P(M  S) = P(M)

P(M ^ S) = P(M) P(S)

for each of the four combinations of
u=True/False
P(~M ^ S) = P(~M) P(S),v=True/False
(PM^~S) = P(M)P(~S),
P(~M^~S) = P(~M)P(~S)
2017年5月22日星期一
Data Mining: Concepts and Techniques
91
Independence
We’ve stated:
P(M) = 0.6
From these statements, we can
P(S) = 0.3
P(S  M) = P(S) derive the full joint pdf.
S
M
T
T
T
F
F
T
F
F
Prob
And since we now have the joint pdf, we can make
any queries we like.
2017年5月22日星期一
Data Mining: Concepts and Techniques
92
A more interesting case



M : Manuela teaches the class
S : It is sunny
L : The lecturer arrives slightly late.
Assume both lecturers are sometimes delayed by bad weather. Andrew is
more likely to arrive later than Manuela.
2017年5月22日星期一
Data Mining: Concepts and Techniques
93
A more interesting case



M : Manuela teaches the class
S : It is sunny
L : The lecturer arrives slightly late.
Assume both lecturers are sometimes delayed by bad weather. Andrew is
more likely to arrive late than Manuela.
Let’s begin with writing down knowledge we’re happy about:
P(S  M) = P(S), P(S) = 0.3, P(M) = 0.6
Lateness is not independent of the weather and is not independent of the
lecturer.
2017年5月22日星期一
Data Mining: Concepts and Techniques
94
A more interesting case



M : Manuela teaches the class
S : It is sunny
L : The lecturer arrives slightly late.
Assume both lecturers are sometimes delayed by bad weather. Andrew is
more likely to arrive late than Manuela.
Let’s begin with writing down knowledge we’re happy about:
P(S  M) = P(S), P(S) = 0.3, P(M) = 0.6
Lateness is not independent of the weather and is not independent of the
lecturer.
We already know the Joint of S and M, so all we need now is
P(L  S=u, M=v)
in the 4 cases of u/v = True/False.
2017年5月22日星期一
Data Mining: Concepts and Techniques
95
A more interesting case



M : Manuela teaches the class
S : It is sunny
L : The lecturer arrives slightly late.
Assume both lecturers are sometimes delayed by bad weather. Andrew is
more likely to arrive late than Manuela.
P(S  M) = P(S) P(L  M ^ S) = 0.05
P(L  M ^ ~S) = 0.1
P(S) = 0.3
P(L  ~M ^ S) = 0.1
P(M) = 0.6
P(L  ~M ^ ~S) = 0.2
Now we can derive a full joint
p.d.f. with a “mere” six numbers
instead of seven*
*Savings are larger for larger numbers of variables.
2017年5月22日星期一
Data Mining: Concepts and Techniques
96
A more interesting case



M : Manuela teaches the class
S : It is sunny
L : The lecturer arrives slightly late.
Assume both lecturers are sometimes delayed by bad weather. Andrew is
more likely to arrive late than Manuela.
P(S  M) = P(S) P(L  M ^ S) = 0.05
P(L  M ^ ~S) = 0.1
P(S) = 0.3
P(L  ~M ^ S) = 0.1
P(M) = 0.6
P(L  ~M ^ ~S) = 0.2
Question: Express
P(L=x ^ M=y ^ S=z)
in terms that only need the above
expressions, where x,y and z may
each be True or False.
2017年5月22日星期一
Data Mining: Concepts and Techniques
97
A bit of notation
P(S  M) = P(S)
P(S) = 0.3
P(M) = 0.6
P(L  M ^ S) = 0.05
P(L  M ^ ~S) = 0.1
P(L  ~M ^ S) = 0.1
P(L  ~M ^ ~S) = 0.2
P(M)=0.6
P(s)=0.3
P(LM^S)=0.05
P(LM^~S)=0.1
P(L~M^S)=0.1
P(L~M^~S)=0.2
2017年5月22日星期一
S
M
L
Data Mining: Concepts and Techniques
98
P(S  M) = P(S)
P(S) = 0.3
P(M) = 0.6
P(L  M ^ S) = 0.05
P(L  M ^ ~S) Read
= 0.1 the absence of an arrow
S and M to mean “it
P(L  ~M ^ S) =between
0.1
would
P(L  ~M ^ ~S)
= 0.2not help me predict M if I
knew the value of S”
P(M)=0.6
P(s)=0.3
P(LM^S)=0.05
P(LM^~S)=0.1
P(L~M^S)=0.1
P(L~M^~S)=0.2
2017年5月22日星期一
S
This kind of stuff will be
thoroughly formalized later
A bit of notation
M
L
Read the two arrows into L to
mean that if I want to know the
value of L it may help me to
know M and to know S.
Data Mining: Concepts and Techniques
99
An even cuter trick
Suppose we have these three events:
 M : Lecture taught by Manuela
 L : Lecturer arrives late
 R : Lecture concerns robots
Suppose:
 Andrew has a higher chance of being late than Manuela.
 Andrew has a higher chance of giving robotics lectures.
What kind of independence can we find?
How about:
P(L  M) = P(L) ?
 P(R  M) = P(R) ?
 P(L  R) = P(L) ?

2017年5月22日星期一
Data Mining: Concepts and Techniques
100
Conditional independence
Once you know who the lecturer is, then whether they arrive
late doesn’t affect whether the lecture concerns robots.
P(R  M,L) = P(R  M) and
P(R  ~M,L) = P(R  ~M)
We express this in the following way:
“R and L are conditionally independent given M”
..which is also notated
by the following
diagram.
2017年5月22日星期一
M
L
R
Data Mining: Concepts and Techniques
Given knowledge of M,
knowing anything else in
the diagram won’t help
us with L, etc.
101
Conditional Independence formalized
R and L are conditionally independent given M if
for all x,y,z in {T,F}:
P(R=x  M=y ^ L=z) = P(R=x  M=y)
More generally:
Let S1 and S2 and S3 be sets of variables.
Set-of-variables S1 and set-of-variables S2 are conditionally
independent given S3 if for all assignments of values to the
variables in the sets,
P(S1’s assignments  S2’s assignments & S3’s assignments)=
P(S1’s assignments  S3’s assignments)
2017年5月22日星期一
Data Mining: Concepts and Techniques
102
Example:
“Shoe-size is conditionally independent of Glove-size given
height weight and age”
means
R and L are conditionally independent given M if
forall s,g,h,w,a
for all x,y,z in {T,F}:
P(ShoeSize=s|Height=h,Weight=w,Age=a)
P(R=x  M=y ^ L=z) = P(R=x  M=y)
=
P(ShoeSize=s|Height=h,Weight=w,Age=a,GloveSize=g)
More generally:
Let S1 and S2 and S3 be sets of variables.
Set-of-variables S1 and set-of-variables S2 are conditionally
independent given S3 if for all assignments of values to the
variables in the sets,
P(S1’s assignments  S2’s assignments & S3’s assignments)=
P(S1’s assignments  S3’s assignments)
2017年5月22日星期一
Data Mining: Concepts and Techniques
103
Example:
“Shoe-size is conditionally independent of Glove-size given
height weight and age”
does not mean
R and L are conditionally independent given M if
forall s,g,h
for all x,y,z in {T,F}:
P(ShoeSize=s|Height=h)
P(R=x  M=y ^ L=z) = P(R=x  M=y)
=
P(ShoeSize=s|Height=h, GloveSize=g)
More generally:
Let S1 and S2 and S3 be sets of variables.
Set-of-variables S1 and set-of-variables S2 are conditionally
independent given S3 if for all assignments of values to the
variables in the sets,
P(S1’s assignments  S2’s assignments & S3’s assignments)=
P(S1’s assignments  S3’s assignments)
2017年5月22日星期一
Data Mining: Concepts and Techniques
104
Conditional
independence
M
L
R
We can write down P(M). And then, since we know
L is only directly influenced by M, we can write
down the values of P(LM) and P(L~M) and know
we’ve fully specified L’s behavior. Ditto for R.
P(M) = 0.6
‘R and L conditionally
P(L  M) = 0.085
independent given M’
P(L  ~M) = 0.17
P(R  M) = 0.3
P(R  ~M) = 0.6
2017年5月22日星期一
Data Mining: Concepts and Techniques
105
Conditional independence
M
L
R
P(M) = 0.6
P(L  M) = 0.085
P(L  ~M) = 0.17
P(R  M) = 0.3
P(R  ~M) = 0.6
Conditional Independence:
P(RM,L) = P(RM),
P(R~M,L) = P(R~M)
Again, we can obtain any member of the Joint
prob dist that we desire:
P(L=x ^ R=y ^ M=z) =
2017年5月22日星期一
Data Mining: Concepts and Techniques
106
Assume five variables
T: The lecture started by 10:35
L: The lecturer arrives late
R: The lecture concerns robots
M: The lecturer is Manuela
S: It is sunny




T only directly influenced by L (i.e. T is conditionally
independent of R,M,S given L)
L only directly influenced by M and S (i.e. L is conditionally
independent of R given M & S)
R only directly influenced by M (i.e. R is conditionally
independent of L,S, given M)
M and S are independent
2017年5月22日星期一
Data Mining: Concepts and Techniques
107
T: The lecture started by 10:35
L: The lecturer arrives late
R: The lecture concerns robots
M: The lecturer is Manuela
S: It is sunny
Making a Bayes net
M
S
L
R
T
Step One: add variables.
• Just choose the variables you’d like to be included in the
net.
2017年5月22日星期一
Data Mining: Concepts and Techniques
108
T: The lecture started by 10:35
L: The lecturer arrives late
R: The lecture concerns robots
M: The lecturer is Manuela
S: It is sunny
Making a Bayes net
M
S
L
R
T
Step Two: add links.
• The link structure must be acyclic.
• If node X is given parents Q1,Q2,..Qn you are promising
that any variable that’s a non-descendent of X is
conditionally independent of X given {Q1,Q2,..Qn}
2017年5月22日星期一
Data Mining: Concepts and Techniques
109
T: The lecture started by 10:35
L: The lecturer arrives late
R: The lecture concerns robots
M: The lecturer is Manuela
S: It is sunny
Making a Bayes net
P(s)=0.3
P(LM^S)=0.05
P(LM^~S)=0.1
P(L~M^S)=0.1
P(L~M^~S)=0.2
M
S
P(M)=0.6
P(RM)=0.3
P(R~M)=0.6
L
P(TL)=0.3
P(T~L)=0.8
R
T
Step Three: add a probability table for each node.
• The table for node X must list P(X|Parent Values) for each
possible combination of parent values
2017年5月22日星期一
Data Mining: Concepts and Techniques
110
T: The lecture started by 10:35
L: The lecturer arrives late
R: The lecture concerns robots
M: The lecturer is Manuela
S: It is sunny
Making a Bayes net
P(s)=0.3
P(LM^S)=0.05
P(LM^~S)=0.1
P(L~M^S)=0.1
P(L~M^~S)=0.2
M
S
P(M)=0.6
P(RM)=0.3
P(R~M)=0.6
L
P(TL)=0.3
P(T~L)=0.8
R
T
• Two unconnected variables may still be correlated
• Each node is conditionally independent of all nondescendants in the tree, given its parents.
• You can deduce many other conditional independence
relations from a Bayes net. See the next lecture.
2017年5月22日星期一
Data Mining: Concepts and Techniques
111
Bayes Nets Formalized
A Bayes net (also called a belief network) is an augmented
directed acyclic graph, represented by the pair V , E where:


V is a set of vertices.
E is a set of directed edges joining vertices. No
loops of any length are allowed.
Each vertex in V contains the following information:


The name of a random variable
A probability distribution table indicating how
the probability of this variable’s values depends
on all possible combinations of parental values.
2017年5月22日星期一
Data Mining: Concepts and Techniques
112
Building a Bayes Net
1.
2.
3.
4.
Choose a set of relevant variables.
Choose an ordering for them
Assume they’re called X1 .. Xm (where X1 is the first in
the ordering, X1 is the second, etc)
For i = 1 to m:
1.
2.
3.
Add the Xi node to the network
Set Parents(Xi ) to be a minimal subset of
{X1…Xi-1} such that we have conditional
independence of Xi and all other members of
{X1…Xi-1} given Parents(Xi )
Define the probability table of
P(Xi =k  Assignments of Parents(Xi ) ).
2017年5月22日星期一
Data Mining: Concepts and Techniques
113
Computing a Joint Entry
How to compute an entry in a joint distribution?
E.G: What is P(S ^ ~M ^ L ~R ^ T)?
P(s)=0.3
M
S
P(LM^S)=0.05
P(LM^~S)=0.1
P(L~M^S)=0.1
P(L~M^~S)=0.2
P(M)=0.6
P(RM)=0.3
P(R~M)=0.6
L
P(TL)=0.3
P(T~L)=0.8
R
T
2017年5月22日星期一
Data Mining: Concepts and Techniques
114
Computing with Bayes Net
P(s)=0.3
M
S
P(LM^S)=0.05
P(LM^~S)=0.1
P(L~M^S)=0.1
P(L~M^~S)=0.2
P(M)=0.6
P(RM)=0.3
P(R~M)=0.6
L
P(TL)=0.3
P(T~L)=0.8
R
T
P(T
P(T
P(T
P(T
P(T
P(T
P(T
P(T
^ ~R ^ L ^ ~M ^ S) =
 ~R ^ L ^ ~M ^ S) * P(~R ^ L ^ ~M ^ S) =
 L) * P(~R ^ L ^ ~M ^ S) =
 L) * P(~R  L ^ ~M ^ S) * P(L^~M^S) =
 L) * P(~R  ~M) * P(L^~M^S) =
 L) * P(~R  ~M) * P(L~M^S)*P(~M^S) =
 L) * P(~R  ~M) * P(L~M^S)*P(~M | S)*P(S) =
 L) * P(~R  ~M) * P(L~M^S)*P(~M)*P(S).
2017年5月22日星期一
Data Mining: Concepts and Techniques
115
The general case
P(X1=x1 ^ X2=x2 ^ ….Xn-1=xn-1 ^ Xn=xn) =
P(Xn=xn ^ Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) =
P(Xn=xn  Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) * P(Xn-1=xn-1 ^…. X2=x2 ^ X1=x1) =
P(Xn=xn  Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) * P(Xn-1=xn-1 …. X2=x2 ^ X1=x1) *
P(Xn-2=xn-2 ^…. X2=x2 ^ X1=x1) =
:
:
=
 P X
i
 xi   X i 1  xi 1     X 1  x1 
i
 xi  Assignment s of Parents  X i 
n
i 1

 P X
n
i 1
So any entry in joint pdf table can be computed. And so any
conditional probability can be computed.
2017年5月22日星期一
Data Mining: Concepts and Techniques
116
Where are we now?




We have a methodology for building Bayes nets.
We don’t require exponential storage to hold our probability table.
Only exponential in the maximum number of parents of any node.
We can compute probabilities of any given assignment of truth values
to the variables. And we can do it in time linear with the number of
nodes.
So we can also compute answers to any questions.
P(s)=0.3
P(LM^S)=0.05
P(LM^~S)=0.1
P(L~M^S)=0.1
P(L~M^~S)=0.2
M
S
P(M)=0.6
P(RM)=0.3
P(R~M)=0.6
L
P(TL)=0.3
P(T~L)=0.8
R
T
E.G. What could we do to compute P(R  T,~S)?
2017年5月22日星期一
Data Mining: Concepts and Techniques
117
Where are we now?
Step 1: Compute P(R ^ T ^ ~S)
We have a methodology for building Bayes nets.
Step
2: don’t
Compute
P(Texponential
^ ~S)

We
require
storage to hold our probability table.
Only exponential in the maximum number of parents of any node.
Step
3: can
Return

We
compute probabilities of any given assignment of truth values
to the variables. And we can do it in time linear with the number of
nodes.
P(R ^ T ^ ~S)

So
we can also compute answers to any questions.
------------------------------------P(T ^ ~S)

P(s)=0.3
P(LM^S)=0.05
P(LM^~S)=0.1
P(L~M^S)=0.1
P(L~M^~S)=0.2
M
S
P(M)=0.6
P(RM)=0.3
P(R~M)=0.6
L
P(TL)=0.3
P(T~L)=0.8
R
T
E.G. What could we do to compute P(R  T,~S)?
2017年5月22日星期一
Data Mining: Concepts and Techniques
118
Where are we now?
Step 1: Compute P(R ^ T ^ ~S)
Sum of all the rows in the Joint
that match R ^ T ^ ~S

We have a methodology for building Bayes nets.
Step
2: don’t
Compute
P(Texponential
^ ~S)

We
require
storage to hold our probability table.
Only exponential in the maximum number of parents of any node.
Sum of all the rows in the Joint
Step
3:
Return

We can compute probabilities of any given assignment
ofTtruth
that match
^ ~Svalues
to the variables. And we can do it in time linear with the number of
nodes.
P(R ^ T ^ ~S)

So
we can also compute answers to any questions.
------------------------------------P(T ^ ~S)
P(s)=0.3
P(LM^S)=0.05
P(LM^~S)=0.1
P(L~M^S)=0.1
P(L~M^~S)=0.2
M
S
P(M)=0.6
P(RM)=0.3
P(R~M)=0.6
L
P(TL)=0.3
P(T~L)=0.8
R
T
E.G. What could we do to compute P(R  T,~S)?
2017年5月22日星期一
Data Mining: Concepts and Techniques
119
Where are we now?
4 joint computes
Step 1: Compute P(R ^ T ^ ~S)
Sum of all the rows in the Joint
that match R ^ T ^ ~S

We have a methodology for building Bayes nets.
Step
2: don’t
Compute
P(Texponential
^ ~S)

We
require
storage to hold our probability table.
Only exponential in the maximum number of parents of any node.
Sum of all the rows in the Joint
Step
3:
Return

We can compute probabilities of any given assignment
ofTtruth
that match
^ ~Svalues
to the variables. And we can do it in time linear with the number of
nodes.
P(R ^ T ^ ~S)
8 joint computes

So
we
can
also
compute
answers
to
any
questions.
------------------------------------Each of these obtained by
P(T ^ ~S)
the “computing a joint
P(s)=0.3
P(LM^S)=0.05
P(LM^~S)=0.1
P(L~M^S)=0.1
P(L~M^~S)=0.2
M
S
probability entry” method of
the earlier
slides
P(RM)=0.3
P(M)=0.6
P(R~M)=0.6
L
P(TL)=0.3
P(T~L)=0.8
R
T
E.G. What could we do to compute P(R  T,~S)?
2017年5月22日星期一
Data Mining: Concepts and Techniques
120
The good news
We can do inference. We can compute any conditional
probability:
P( Some variable  Some other variable values )
P( E1  E2 )
P( E1 | E2 ) 

P ( E2 )
 P( joint entry )
joint entries matching E1 and E2
 P( joint entry )
joint entries matching E2
2017年5月22日星期一
Data Mining: Concepts and Techniques
121
The good news
We can do inference. We can compute any conditional
probability:
P( Some variable  Some other variable values )
P( E1  E2 )
P( E1 | E2 ) 

P ( E2 )
 P( joint entry )
joint entries matching E1 and E2
 P( joint entry )
joint entries matching E2
Suppose you have m binary-valued variables in your Bayes
Net and expression E2 mentions k variables.
How much work is the above computation?
2017年5月22日星期一
Data Mining: Concepts and Techniques
122
The sad, bad news
Conditional probabilities by enumerating all matching entries in the joint
are expensive:
Exponential in the number of variables.
2017年5月22日星期一
Data Mining: Concepts and Techniques
123
The sad, bad news
Conditional probabilities by enumerating all matching entries in the joint
are expensive:
Exponential in the number of variables.
But perhaps there are faster ways of querying Bayes nets?
 In fact, if I ever ask you to manually do a Bayes Net inference, you’ll
find there are often many tricks to save you time.
 So we’ve just got to program our computer to do those tricks too, right?
2017年5月22日星期一
Data Mining: Concepts and Techniques
124
The sad, bad news
Conditional probabilities by enumerating all matching entries in the joint
are expensive:
Exponential in the number of variables.
But perhaps there are faster ways of querying Bayes nets?
 In fact, if I ever ask you to manually do a Bayes Net inference, you’ll
find there are often many tricks to save you time.
 So we’ve just got to program our computer to do those tricks too, right?
Sadder and worse news:
General querying of Bayes nets is NP-complete.
2017年5月22日星期一
Data Mining: Concepts and Techniques
125
Bayes nets inference algorithms
A poly-tree is a directed acyclic graph in which no two nodes have more than one path
between them.
S X1
M
L X3
T X5
A poly tree
X2
S
R X4
X1
M
X2
L X3
T
R X4
X5
Not a poly tree
(but still a legal Bayes net)
• If net is a poly-tree, there is a linear-time algorithm (see a later
Andrew lecture).
• The best general-case algorithms convert a general net to a polytree (often at huge expense) and calls the poly-tree algorithm.
• Another popular, practical approach (doesn’t assume poly-tree):
Stochastic Simulation.
2017年5月22日星期一
Data Mining: Concepts and Techniques
126
Sampling from the Joint Distribution
P(s)=0.3
M
S
P(LM^S)=0.05
P(LM^~S)=0.1
P(L~M^S)=0.1
P(L~M^~S)=0.2
P(M)=0.6
P(RM)=0.3
P(R~M)=0.6
L
P(TL)=0.3
P(T~L)=0.8
R
T
It’s pretty easy to generate a set of variable-assignments at random with the same
probability as the underlying joint distribution.
How?
2017年5月22日星期一
Data Mining: Concepts and Techniques
127
Sampling from the Joint Distribution
P(s)=0.3
M
S
P(LM^S)=0.05
P(LM^~S)=0.1
P(L~M^S)=0.1
P(L~M^~S)=0.2
P(M)=0.6
P(RM)=0.3
P(R~M)=0.6
L
P(TL)=0.3
P(T~L)=0.8
R
T
1. Randomly choose S. S = True with prob 0.3
2. Randomly choose M. M = True with prob 0.6
3. Randomly choose L. The probability that L is true depends on
the assignments of S and M. E.G. if steps 1 and 2 had produced
S=True, M=False, then probability that L is true is 0.1
4. Randomly choose R. Probability depends on M.
5. Randomly choose T. Probability depends on L
2017年5月22日星期一
Data Mining: Concepts and Techniques
128
A general sampling algorithm
Let’s generalize the example on the previous slide to a general Bayes Net.
As in Slides 16-17 , call the variables X1 .. Xn, where Parents(Xi) must be a subset
of {X1 .. Xi-1}.
For i=1 to n:
1.
Find parents, if any, of Xi. Assume n(i) parents. Call them Xp(i,1),
Xp(i,2), …Xp(i,n(i)).
2.
Recall the values that those parents were randomly given: xp(i,1),
xp(i,2), …xp(i,n(i)).
3.
Look up in the lookup-table for:
P(Xi=True  Xp(i,1)=xp(i,1),Xp(i,2)=xp(i,2)…Xp(i,n(i))=xp(i,n(i)))
4.
Randomly set xi=True according to this probability
x1, x2,…xn are now a sample from the joint distribution of X1, X2,…Xn.
2017年5月22日星期一
Data Mining: Concepts and Techniques
129
Stochastic Simulation Example
Someone wants to know P(R = True  T = True ^ S = False )
We’ll do lots of random samplings and count the number of occurrences of
the following:



Nc : Num. samples in which T=True and S=False.
Ns : Num. samples in which R=True, T=True and S=False.
N : Number of random samplings
Now if N is big enough:
Nc /N is a good estimate of P(T=True and S=False).
Ns /N is a good estimate of P(R=True ,T=True , S=False).
P(RT^~S) = P(R^T^~S)/P(T^~S), so Ns / Nc can be a good estimate of
P(RT^~S).
2017年5月22日星期一
Data Mining: Concepts and Techniques
130
General Stochastic Simulation
Someone wants to know P(E1  E2 )
We’ll do lots of random samplings and count the number of occurrences of
the following:



Nc : Num. samples in which E2
Ns : Num. samples in which E1 and E2
N : Number of random samplings
Now if N is big enough:
Nc /N is a good estimate of P(E2).
Ns /N is a good estimate of P(E1 , E2).
P(E1  E2) = P(E1^ E2)/P(E2), so Ns / Nc can be a good estimate of P(E1 E2).
2017年5月22日星期一
Data Mining: Concepts and Techniques
131
Likelihood weighting
Problem with Stochastic Sampling:
With lots of constraints in E, or unlikely events in E, then most of the
simulations will be thrown away, (they’ll have no effect on Nc, or Ns).
Imagine we’re part way through our simulation.
In E2 we have the constraint Xi = v
We’re just about to generate a value for Xi at random. Given the values
assigned to the parents, we see that P(Xi = v  parents) = p .
Now we know that with stochastic sampling:

we’ll generate “Xi = v” proportion p of the time, and proceed.
And we’ll generate a different value proportion 1-p of the time, and the
simulation will be wasted.

Instead, always generate Xi = v, but weight the answer by weight “p” to
compensate.
2017年5月22日星期一
Data Mining: Concepts and Techniques
132
Likelihood weighting
Set Nc :=0, Ns :=0
1.
Generate a random assignment of all variables that matches E2. This
process returns a weight w.
2.
Define w to be the probability that this assignment would have been
generated instead of an unmatching assignment during its
generation in the original algorithm.Fact: w is a product of all
likelihood factors involved in the generation.
3.
Nc := Nc + w
If our sample matches E1 then Ns := Ns + w
5.
Go to 1
Again, Ns / Nc estimates P(E1  E2 )
4.
2017年5月22日星期一
Data Mining: Concepts and Techniques
133
Case Study I
Pathfinder system. (Heckerman 1991, Probabilistic Similarity Networks, MIT
Press, Cambridge MA).

Diagnostic system for lymph-node diseases.

60 diseases and 100 symptoms and test-results.

14,000 probabilities

Expert consulted to make net.



8 hours to determine variables.

35 hours for net topology.

40 hours for probability table values.
Apparently, the experts found it quite easy to invent the causal links and
probabilities.
Pathfinder is now outperforming the world experts in diagnosis. Being
extended to several dozen other medical domains.
2017年5月22日星期一
Data Mining: Concepts and Techniques
134
Questions





What are the strengths of probabilistic networks compared
with propositional logic?
What are the weaknesses of probabilistic networks
compared with propositional logic?
What are the strengths of probabilistic networks compared
with predicate logic?
What are the weaknesses of probabilistic networks
compared with predicate logic?
(How) could predicate logic and probabilistic networks be
combined?
2017年5月22日星期一
Data Mining: Concepts and Techniques
135
What you should know





The meanings and importance of independence and
conditional independence.
The definition of a Bayes net.
Computing probabilities of assignments of variables (i.e.
members of the joint p.d.f.) with a Bayes net.
The slow (exponential) method for computing arbitrary,
conditional probabilities.
The stochastic simulation method and likelihood weighting.
2017年5月22日星期一
Data Mining: Concepts and Techniques
136
Content






What is classification?
Decision tree
Naïve Bayesian Classifier
Baysian Networks
Neural Networks
Support Vector Machines (SVM)
2017年5月22日星期一
Data Mining: Concepts and Techniques
137
Neural Networks

Assume each record X is a vector of length n.

There are two classes which are valued 1 and -1. E.g.
class 1 means “buy computer”, and class -1 means “no
buy”.

We can incrementally change weights to learn to
produce these outputs using the perceptron learning
rule.
2017年5月22日星期一
Data Mining: Concepts and Techniques
138
A Neuron
- mk
x0
w0
x1
w1
xn

f
output y
wn
Input
weight
weighted
Activation
vector x vector w
sum
function
 The n-dimensional input vector x is mapped into
variable y by means of the scalar product and a
nonlinear function mapping
2017年5月22日星期一
Data Mining: Concepts and Techniques
139
A Neuron
- mk
x0
w0
x1
w1
xn

f
output y
wn
Input
weight
weighted
vector x vector w
sum
For Example
y  sign(
n
w x
i 0
2017年5月22日星期一
i
i
Activation
function
 mk )
Data Mining: Concepts and Techniques
140
Multi-Layer Perceptron
Output vector
Output nodes
Hidden nodes
I j   wij xi   j
i
wij
Input nodes
Input vector: xi
Network Training


The ultimate objective of training
 obtain a set of weights that makes almost all the
tuples in the training data classified correctly
Steps
 Initialize weights with random values
 Feed the input tuples into the network one by one
 For each unit




Compute the net input to the unit as a linear combination
of all the inputs to the unit
Compute the output value using the activation function
Compute the error
Update the weights and the bias
Content






What is classification?
Decision tree
Naïve Bayesian Classifier
Baysian Networks
Neural Networks
Support Vector Machines (SVM)
2017年5月22日星期一
Data Mining: Concepts and Techniques
144
Linear Support Vector Machines
Margin


Goal: find a plane
that divides the two
sets of points.
Also, need to
maximize margin.
value( )= -1, e.g. does not buy computer
value( )= 1, e.g. buy computer
Linear Support Vector Machines
Small Margin
Large Margin
Support Vectors
Linear Support Vector Machines




The hyperplane is defined
by vector w and value b.
For any point x where
value(x)=-1, w·x +b ≤ -1.
For any point x where
value(x)=1, w·x +b ≥ 1.
Here let
x = (xh , xv)
w=(wh , wv), we have:
w·x = wh xh + wv xv
Linear Support Vector Machines



For instance, the
M
hyperplane is defined by
12 3
vector w = (1, 1/ 3 ) and
11 3
value b = -11.
10 3
Some examples:
x = (10, 0)  w·x +b = -1
x = (0, 12 3)  w·x +b = 1
x = (0, 0)  w·x +b = -11
60°
10 11
A little more detail:
 w is perpendicular to minus plane,
2
2
 3
 margin M =
, e.g.
| w|
1  (1/ 3)
2
2
12
Linear Support Vector Machines




Each record xi in training data has a vector and a class,
 There are two classes: 1 and -1,
 E.g. class 1 = “buy computer”, class -1 = “not buy”.
 E.g. a record (age, salary) = (45, 50k), class = 1.
Read the training data, find the hyperplane defined by w
and b.
Should maximize margin, minimize error. Can be done by
Quadratic Programming Technique.
Predict the class of a new data x by looking at the sign of
w·x +b. Positive: class 1. Negative: class -1.
SVM – Cont.


What if the data is not linearly separable?
Project the data to high dimensional space where it is
linearly separable and then we can use linear SVM –
(Using Kernels)
(0,1) +
+
-
+
-1
0
+1
2017年5月22日星期一
(0,0)
Data Mining: Concepts and Techniques
+
(1,0)
150
Non-Linear SVM
Classification using SVM (w,b)
?
xi  w  b  0
In non linear case we can see this as
?
K ( xi , w)  b  0
Kernel – Can be thought of as doing dot product
in some high dimensional space
2017年5月22日星期一
Data Mining: Concepts and Techniques
151
2017年5月22日星期一
Data Mining: Concepts and Techniques
152
Results
2017年5月22日星期一
Data Mining: Concepts and Techniques
153
SVM vs. Neural Network

SVM
 Relatively new concept
 Nice Generalization
properties
 Hard to learn – learned
in batch mode using
quadratic programming
techniques
 Using kernels can learn
very complex functions
2017年5月22日星期一

Neural Network
 Quite Old
 Generalizes well but
doesn’t have strong
mathematical foundation
 Can easily be learned in
incremental fashion
 To learn complex
functions – use
multilayer perceptron
(not that trivial)
Data Mining: Concepts and Techniques
154
SVM Related Links

http://svm.dcs.rhbnc.ac.uk/

http://www.kernel-machines.org/



C. J. C. Burges. A Tutorial on Support Vector Machines for
Pattern Recognition. Knowledge Discovery and Data
Mining, 2(2), 1998.
SVMlight – Software (in C)
http://ais.gmd.de/~thorsten/svm_light
BOOK: An Introduction to Support Vector Machines
N. Cristianini and J. Shawe-Taylor
Cambridge University Press
2017年5月22日星期一
Data Mining: Concepts and Techniques
155
Content







What is classification?
Decision tree
Naïve Bayesian Classifier
Baysian Networks
Neural Networks
Support Vector Machines (SVM)
Bagging and Boosting
2017年5月22日星期一
Data Mining: Concepts and Techniques
156
Bagging and Boosting

General idea
Training data
Classification method (CM)
Altered Training data
Classifier C
CM
Classifier C1
Altered Training data
……..
Aggregation ….
CM
Classifier C2
Classifier C*
2017年5月22日星期一
Data Mining: Concepts and Techniques
157
Bagging






Given a set S of s samples
Generate a bootstrap sample T from S. Cases in S may not
appear in T or may appear more than once.
Repeat this sampling procedure, getting a sequence of k
independent training sets
A corresponding sequence of classifiers C1,C2,…,Ck is
constructed for each of these training sets, by using the
same classification algorithm
To classify an unknown sample X,let each classifier predict
or vote
The Bagged Classifier C* counts the votes and assigns X
to the class with the “most” votes
2017年5月22日星期一
Data Mining: Concepts and Techniques
158
Boosting Technique — Algorithm

Assign every example an equal weight 1/N

For t = 1, 2, …, T Do
Obtain a hypothesis (classifier) h(t) under w(t)
 Calculate the error of h(t) and re-weight the examples
based on the error . Each classifier is dependent on the
previous ones. Samples that are incorrectly predicted
are weighted more heavily
(t+1) to sum to 1 (weights assigned to
 Normalize w
different classifiers sum to 1)
Output a weighted sum of all the hypothesis, with each
hypothesis weighted according to its accuracy on the
training set


2017年5月22日星期一
Data Mining: Concepts and Techniques
159
Summary

Classification is an extensively studied problem (mainly in
statistics, machine learning & neural networks)

Classification is probably one of the most widely used
data mining techniques with a lot of extensions

Scalability is still an important issue for database
applications: thus combining classification with database
techniques should be a promising topic

Research directions: classification of non-relational data,
e.g., text, spatial, multimedia, etc..
2017年5月22日星期一
Data Mining: Concepts and Techniques
160