Download + P(A ^ ~B)

Document related concepts
no text concepts found
Transcript
Probabilistic and
Bayesian Analytics
Andrew W. Moore
Professor
School of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/~awm
[email protected]
Copyright © Andrew W. Moore
Slide 1
Probability
• The world is a very uncertain place
• 30 years of Artificial Intelligence and
Database research danced around this fact
• And then a few AI researchers decided to
use some ideas from the eighteenth century.
• We will review the fundamentals of
probability.
Copyright © Andrew W. Moore
Slide 2
Discrete Random Variables. Caso Binario
• A is a Boolean-valued random variable if A
denotes an event, and there is some degree
of uncertainty as to whether A occurs.
• Examples
• A = The US president in 2023 will be male
• A = You wake up tomorrow with a headache
• A = You have Ebola
Copyright © Andrew W. Moore
Slide 3
Probabilidades
• Denotamos P(A) , la probabilidad de A,
como “la fraccion de todas las ocurrencias
en las cuales A es cierta”.
• Existen varias maneras de asignar
probabilidades pero no hablaremos de eso
ahora.
Copyright © Andrew W. Moore
Slide 4
Diagrama de Venn para visualizar A
Espacio muestral
de todas las
ocurrencias
posibles.
El area es 1
Copyright © Andrew W. Moore
Ocurrencias en
las cuales A es
cierta
P(A) = Area de
la parte ovalada
Ocurrencias en las cuales A es Falsa
Slide 5
The Axioms of Probability
•
•
•
•
0 <= P(A) <= 1
P(A es cierta en todas las ocurrencias) = 1
P(en ninguna ocurrencia A es cierta) = 0
P(A or B) = P(A) + P(B) si A y B son
disjuntos.
Estos axiomas fueron introducidos por la escuela rusa al
principio de los 1900: Kolmogorov, Liapunov, Kinthchine,
Chebychev
Copyright © Andrew W. Moore
Slide 6
Interpreting the axioms
•
•
•
•
0 <= P(A) <= 1
P(True) = 1
P(False) = 0
P(A or B) = P(A) + P(B)
The area of A can’t get
any smaller than 0
And a zero area would
mean no world could
ever have A true
Copyright © Andrew W. Moore
Slide 7
Interpreting the axioms
•
•
•
•
0 <= P(A) <= 1
P(True) = 1
P(False) = 0
P(A or B) = P(A) + P(B)
The area of A can’t get
any bigger than 1
And an area of 1 would
mean all worlds will have
A true
Copyright © Andrew W. Moore
Slide 8
Interpreting the axioms
•
•
•
•
0 <= P(A) <= 1
P(True) = 1
P(False) = 0
P(A or B) = P(A) + P(B) - P(A and B)
A
B
Copyright © Andrew W. Moore
Slide 9
Interpreting the axioms
•
•
•
•
0 <= P(A) <= 1
P(True) = 1
P(False) = 0
P(A or B) = P(A) + P(B) - P(A and B)
A
P(A or B)
B
P(A and B)
B
Simple addition and subtraction
Copyright © Andrew W. Moore
Slide 10
Theorems from the Axioms
• 0 <= P(A) <= 1, P(True) = 1, P(False) = 0
• P(A or B) = P(A) + P(B) - P(A and B)
From these we can prove:
P(not A) = P(~A) = 1-P(A)
• How?
Copyright © Andrew W. Moore
Slide 11
Another important theorem
• 0 <= P(A) <= 1, P(True) = 1, P(False) = 0
• P(A or B) = P(A) + P(B) - P(A and B)
From these we can prove:
P(A) = P(A ^ B) + P(A ^ ~B)
• How?
Copyright © Andrew W. Moore
Slide 12
Multivalued Random Variables
• Suppose A can take on more than 2 values
• A is a random variable with arity k if it can
take on exactly one value out of {v1,v2, ..
vk }
• Thus…
P( A  vi  A  v j )  0 if i  j
P( A  v1  A  v2  ....  A  vk )  1
Copyright © Andrew W. Moore
Slide 13
An easy fact about Multivalued
Random Variables:
• Using the axioms of probability…
0 <= P(A) <= 1, P(True) = 1, P(False) = 0
P(A or B) = P(A) + P(B) - P(A and B)
• And assuming that A obeys…
P( A  vi  A  v j )  0 if i  j
P( A  v1  A  v2  ....  A  vk )  1
• It’s easy to prove that
i
P( A  v1  A  v2  ...  A  vi )   P( A  v j )
j 1
Copyright © Andrew W. Moore
Slide 14
An easy fact about Multivalued
Random Variables:
• Using the axioms of probability…
0 <= P(A) <= 1, P(True) = 1, P(False) = 0
P(A or B) = P(A) + P(B) - P(A and B)
• And assuming that A obeys…
P( A  vi  A  v j )  0 if i  j
P( A  v1  A  v2  ...  A  vk )  1
• It’s easy to prove that
i
P( A  v1  A  v2  ...  A  vi )   P( A  v j )
j 1
• And thus we can prove k
 P( A  v )  1
j 1
Copyright © Andrew W. Moore
j
Slide 15
Another fact about Multivalued
Random Variables:
• Using the axioms of probability…
0 <= P(A) <= 1, P(True) = 1, P(False) = 0
P(A or B) = P(A) + P(B) - P(A and B)
• And assuming that A obeys…
P( A  vi  A  v j )  0 if i  j
P( A  v1  A  v2  ...  A  vk )  1
• It’s easy to prove that
i
P( B  [ A  v1  A  v2  ...  A  vi ])   P( B  A  v j )
j 1
Copyright © Andrew W. Moore
Slide 16
Another fact about Multivalued
Random Variables:
• Using the axioms of probability…
0 <= P(A) <= 1, P(True) = 1, P(False) = 0
P(A or B) = P(A) + P(B) - P(A and B)
• And assuming that A obeys…
P( A  vi  A  v j )  0 if i  j
P( A  v1  A  v2  ...  A  vk )  1
• It’s easy to prove that
i
P( B  [ A  v1  A  v2  ...  A  vi ])   P( B  A  v j )
j 1
• And thus we can prove
k
P( B)   P( B  A  v j )
j 1
Copyright © Andrew W. Moore
Slide 17
Elementary Probability in Pictures
• P(~A) + P(A) = 1
Copyright © Andrew W. Moore
Slide 18
Elementary Probability in Pictures
• P(B) = P(B ^ A) + P(B ^ ~A)
Copyright © Andrew W. Moore
Slide 19
Elementary Probability in Pictures
k
 P( A  v )  1
j 1
Copyright © Andrew W. Moore
j
Slide 20
Elementary Probability in Pictures
k
P( B)   P( B  A  v j )
j 1
Copyright © Andrew W. Moore
Slide 21
Conditional Probability
• P(A|B) = fraccion de ocurrencias en las cuales B
es cierta y en las cuales A es tambien cierta.
H = “Have a headache”
F = “Coming down with
Flu”
P(H) = 1/10
P(F) = 1/40
P(H|F) = 1/2
F
H
Copyright © Andrew W. Moore
“Headaches are rare and flu
is rarer, but if you’re coming
down with ‘flu there’s a 5050 chance you’ll have a
headache.”
Slide 22
Conditional Probability
P(H|F) = fraccion de los que tiene
flu que tienen headache
F
H
= #ocurrencias con flu and
headache
-----------------------------------#ocurrencias with flu
H = “Have a headache”
F = “Coming down with
Flu”
= Area of “H and F” region
-----------------------------Area of “F” region
P(H) = 1/10
P(F) = 1/40
P(H|F) = 1/2
= P(H ^ F)
----------P(F)
Copyright © Andrew W. Moore
Slide 23
Definition of Conditional Probability
P(A ^ B)
P(A|B) = ----------P(B)
Corollary: The Chain Rule
P(A ^ B) = P(A|B) P(B)
Copyright © Andrew W. Moore
Slide 24
Probabilistic Inference
H = “Have a headache”
F = “Coming down with
Flu”
F
H
P(H) = 1/10
P(F) = 1/40
P(H|F) = 1/2
One day you wake up with a headache. You think: “Drat!
50% of flus are associated with headaches so I must have a
50-50 chance of coming down with flu”
Is this reasoning good?
Copyright © Andrew W. Moore
Slide 25
Probabilistic Inference
H = “Have a headache”
F = “Coming down with
Flu”
F
H
P(H) = 1/10
P(F) = 1/40
P(H|F) = 1/2
P(F ^ H) =P(F)P(H/F)=(1/40)(1/2)=1/80
P(F|H) =P(F^H)/P(H)=(1/80)/(1/10)=1/8=.125
Copyright © Andrew W. Moore
Slide 26
Another way to understand the
intuition
Thanks to Jahanzeb Sherwani for contributing this explanation:
Copyright © Andrew W. Moore
Slide 27
What we just did…
P(A ^ B)
P(A|B) P(B)
P(B|A) = ----------- = --------------P(A)
P(A)
This is Bayes Rule. P(B) es llamada la
probabilidad apriori, P(A/B) es llamada la
veosimiltud, y P(B/A) es la probabilidad
posterior
Bayes, Thomas (1763) An essay
towards solving a problem in the
doctrine of chances. Philosophical
Transactions of the Royal Society of
London, 53:370-418
Copyright © Andrew W. Moore
Slide 28
More General Forms of Bayes Rule
P( B | A) P( A)
P( A |B) 
P( B | A) P( A)  P( B |~ A) P(~ A)
P( B | A  X ) P( A  X )
P( A |B  X ) 
P( B  X )
Copyright © Andrew W. Moore
Slide 29
More General Forms of Bayes Rule
P( A  vi |B) 
P( B | A  vi ) P( A  vi )
nA
 P( B | A  v ) P( A  v )
k 1
Copyright © Andrew W. Moore
k
k
Slide 30
Useful Easy-to-prove facts
P( A | B)P(~ A | B)  1
nA
 P( A  v
k 1
Copyright © Andrew W. Moore
k
| B)  1
Slide 31
The Joint Distribution
Example: Boolean
variables A, B, C
Recipe for making a joint distribution
of M variables:
Copyright © Andrew W. Moore
Slide 32
Usando la Regla de Bayes en Juegos
$1.00
El sobre “Ganador”
tiene un dolar y 4
bolitas (2 rojas y 2
negras).
El sobre “Perdedor”
tiene 3 bolitas ( una
roja y dos negras) y
no tiene dinero.
Pregunta: Alguien extrae un sobre al azar y te lo ofrece en venta?
Cuanto deberias pagar por el sobre?
Copyright © Andrew W. Moore
Slide 33
Using Bayes Rule to Gamble
$1.00
El sobre “Ganador”
tiene un dolar y 4
bolitas (2 rojas y 2
negras).
El sobre “Perdedor” tiene
3 bolitas ( una roja y
dos negras) y no tiene
dinero.
Pregunta interesante: Antes de decidir, se le permite a Ud. Que vea
una bolita extraida del sobre.
Suponga que es negra cuanto deberia Ud pagar por el sobre?
Suponga que es roja, cuanto deberia Ud. Pagar por el sobre?
Copyright © Andrew W. Moore
Slide 34
Calculo…
$1.00
P(Ganador/Bola negra)=P(Ganador^BN)/P(BN)
=P(Ganador)P(BN/Ganador)/P(BN)
P(BN)=P(BN^Ganador)+P(BN^Perdedor)
=P(Ganador)P(BN/Ganador)/P(BN)+P(Perdedor)P(BN/Perdedor)
=(1/2)(2/4)+(1/2)(2/3)=7/12
P(Ganador/Bola negra)=(1/4)/(7/12)=3/7=.43
No pagaria nada por el sobre.
Ahora P(Ganador/Bola Roja)=P(Ganador ^ BR)/P(BR)
=.40
Copyright © Andrew W. Moore
Slide 35
Distribuciones conjuntas
Recipe for making a joint distribution
of M variables:
1. Make a truth table listing all
combinations of values of your
variables (if there are M Boolean
variables then the table will have
2M rows).
Example: Boolean
variables A, B, C
A
B
C
0
0
0
0
0
1
0
1
0
0
1
1
1
0
0
1
0
1
1
1
0
1
1
1
A=1 si la persona es varon y 0 si es mujer, B=1 si la persona
esta casada y 0 si no esta y C =1 si la persona esta enferma y 0
si no lo esta
Copyright © Andrew W. Moore
Slide 36
The Joint Distribution
Recipe for making a joint distribution
of M variables:
1. Make a truth table listing all
combinations of values of your
variables (if there are M Boolean
variables then the table will have
2M rows).
2. For each combination of values,
say how probable it is.
Copyright © Andrew W. Moore
Example: Boolean
variables A, B, C
A
B
C
Prob
0
0
0
0.30
0
0
1
0.05
0
1
0
0.10
0
1
1
0.05
1
0
0
0.05
1
0
1
0.10
1
1
0
0.25
1
1
1
0.10
Slide 37
The Joint Distribution
Recipe for making a joint distribution
of M variables:
1. Make a truth table listing all
combinations of values of your
variables (if there are M Boolean
variables then the table will have
2M rows).
2. For each combination of values,
say how probable it is.
3. If you subscribe to the axioms of
probability, those numbers must
sum to 1.
A
B
C
Prob
0
0
0
0.30
0
0
1
0.05
0
1
0
0.10
0
1
1
0.05
1
0
0
0.05
1
0
1
0.10
1
1
0
0.25
1
1
1
0.10
A
0.05
0.25
0.30
Copyright © Andrew W. Moore
Example: Boolean
variables A, B, C
B
0.10
0.05
0.10
0.05
C
0.10
Slide 38
Datos para el ejemplo de la JD
• The Census dataset:
48842 instances, mix of continuous and discrete
features (train=32561, test=16281)45222.
if instances with unknown values are removed
(train=30162,test=15060)
Disponible en: http://ftp.ics.uci.edu/pub/machinelearning-databases
Donantes: Ronny Kohavi y Barry Becker (1996).
Copyright © Andrew W. Moore
Slide 39
Variables en Census
• 1-age: continuous.
• 2-workclass: Private, Self-emp-not-inc, Self-emp-inc,
Federal-gov, Local-gov, State-gov, Without-pay, Neverworked.
• 3-fnlwgt (final weight) : continuous.
• 4-education: Bachelors, Some-college, 11th, HS-grad, Profschool, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th,
Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
• 5-education-num: continuous.
• 6-marital-status: Married-civ-spouse, Divorced, Nevermarried, Separated, Widowed, Married-spouse-absent,
Married-AF-spouse.
• 7-occupation:
• 8-relationship: Wife, Own-child, Husband, Not-in-family,
Other-relative, Unmarried.
Copyright © Andrew W. Moore
Slide 40
Variables en Census
• 9-race: White, Asian-Pac-Islander, AmerIndian-Eskimo, Other, Black.
• 10-sex: Female[0], Male[1].
• 11-capital-gain: continuous.
• 12-capital-loss: continuous.
• 13-hours-per-week: continuous.
• 14-native-country: nominal
• 15 Salary: >50K (rich)[2], <=50K (poor)[1].
Copyright © Andrew W. Moore
Slide 41
Primeros datos de census(train)
> traincensus[1:5,]
age workcl fwgt educ
1 39
6 77516 13
2 50
2 83311 13
3 38
1 215646 9
4 53
1 234721 7
5 28
1 338409 13
country salary
1
1
1
2
1
1
3
1
1
4
1
1
5
13
1
>
Copyright © Andrew W. Moore
educn
13
13
9
7
13
marital occup
3
9
4
1
5
3
2
7
4
1
7
3
1
6
1
relat race sex gain loss hpwk
1 1 2174 0 40
1 1 0 0 13
1 1 0 0 40
5 1 0 0 40
5 0 0 0 40
Slide 42
Calculando frecuencias para sexo
> countsex=table(traincensus$sex)
> probsex=countsex/32561
> countsex
0
1
10771 21790
> probsex
0
1
0.3307945 0.6692055
>
Copyright © Andrew W. Moore
Slide 43
Calculando las probabilidades conjuntas
> jdcensus=traincensus[,c(10,13,15)]
> dim(jdcensus)
[1] 32561
3
> jdcensus[1:3,]
sex hpwk salary
1 1 40
1
2 1 13
1
3 1 40
1
> dim(jdcensus[jdcensus$sex==0 &jdcensus$hpwk<=40.5
&jdcensus$salary==1,])[1]
[1] 8220
> dim(jdcensus[jdcensus$sex==0 &jdcensus$hpwk<=40.5
&jdcensus$salary==1,])[1]/dim(jdcensus)[1]
[1] 0.2524492
> dim(jdcensus[jdcensus$sex==0 &jdcensus$hpwk<=40.5
&jdcensus$salary==2,])[1]/dim(jdcensus)[1]
[1] 0.02484567
Copyright © Andrew W. Moore
Slide 44
> dim(jdcensus[jdcensus$sex==0 &jdcensus$hpwk>40.5
&jdcensus$salary==1,])[1]/dim(jdcensus)[1]
[1] 0.0421363
> dim(jdcensus[jdcensus$sex==0 &jdcensus$hpwk>40.5
&jdcensus$salary==2,])[1]/dim(jdcensus)[1]
[1] 0.01136329
> dim(jdcensus[jdcensus$sex==1 &jdcensus$hpwk<40.5
&jdcensus$salary==1,])[1]/dim(jdcensus)[1]
[1] 0.3309174
> dim(jdcensus[jdcensus$sex==1 &jdcensus$hpwk<=40.5
&jdcensus$salary==2,])[1]/dim(jdcensus)[1]
[1] 0.09754
> dim(jdcensus[jdcensus$sex==1 &jdcensus$hpwk>40.5
&jdcensus$salary==1,])[1]/dim(jdcensus)[1]
[1] 0.1336875
> dim(jdcensus[jdcensus$sex==1 &jdcensus$hpwk>40.5
&jdcensus$salary==2,])[1]/dim(jdcensus)[1]
[1] 0.1070606
Copyright © Andrew W. Moore
Slide 45
Using the
Joint
One you have the JD you can
ask for the probability of any
logical expression involving
your attribute
Copyright © Andrew W. Moore
P( E ) 
 P(row )
rows matching E
Slide 46
Using the
Joint
P(Poor Male) = 0.4654
P( E ) 
 P(row )
rows matching E
Copyright © Andrew W. Moore
Slide 47
Using the
Joint
P(Poor) = 0.7604
P( E ) 
 P(row )
rows matching E
Copyright © Andrew W. Moore
Slide 48
Inference
with the
Joint
P( E1  E2 )
P( E1 | E2 ) 

P ( E2 )
 P(row )
rows matching E1 and E2
 P(row )
rows matching E2
Copyright © Andrew W. Moore
Slide 49
Inference
with the
Joint
P( E1  E2 )
P( E1 | E2 ) 

P ( E2 )
 P(row )
rows matching E1 and E2
 P(row )
rows matching E2
P(Male | Poor) = 0.4654 / 0.7604 = 0.612
Copyright © Andrew W. Moore
Slide 50
Inferencia es un asunto importante
• Tengo esta evidencia, cual es la probabilidad
de que esta conclusion sea cierta?
• Tengo la nuca hinchada: Cual es la probabilidad de que
tenga meningitis?
• Veo luces en mi casa y son las 9pm. Cual es la
probabilidad de que mi esposa ya se haya quedado
dormida?
• Hay bastante aplicaciones de Inferencia Bayesiana
en Medicina, Farmacia, Diagnostico de Falla de
Maquinas, etc.
Copyright © Andrew W. Moore
Slide 51
Where do Joint Distributions
come from?
• Idea One: Expert Humans
• Idea Two: Simpler probabilistic facts and
some algebra
Example: Suppose you knew
P(A) = 0.7
P(C|A^B) = 0.1
P(C|A^~B) = 0.8
P(B|A) = 0.2
P(C|~A^B) = 0.3
P(B|~A) = 0.1 P(C|~A^~B) = 0.1
Then you can automatically
compute the JD using the
chain rule
P(A=x ^ B=y ^ C=z) =
P(C=z|A=x^ B=y) P(B=y|A=x) P(A=x)
Copyright © Andrew W. Moore
In another lecture:
Bayes Nets, a
systematic way to
do this.
Slide 52
Where do Joint Distributions
come from?
• Idea Three: Learn them from data!
Prepare to see one of the most impressive learning
algorithms you’ll come across in the entire course….
Copyright © Andrew W. Moore
Slide 53
Learning a joint distribution
Build a JD table for your
attributes in which the
probabilities are unspecified
The fill in each row with
records matching row
ˆ
P(row ) 
total number of records
A
B
C
Prob
0
0
0
?
0
0
1
?
A
B
C
Prob
0
1
0
?
0
0
0
0.30
0
1
1
?
0
0
1
0.05
1
0
0
?
0
1
0
0.10
1
0
1
?
0
1
1
0.05
1
1
0
?
1
0
0
0.05
1
1
1
?
1
0
1
0.10
1
1
0
0.25
1
1
1
0.10
Fraction of all records in which
A and B are True but C is False
Copyright © Andrew W. Moore
Slide 54
Example of Learning a Joint
• This Joint was
obtained by
learning from
three
attributes in
the UCI
“Adult”
Census
Database
[Kohavi 1995]
Copyright © Andrew W. Moore
Slide 55
Where are we?
• We have recalled the fundamentals of
probability
• We have become content with what JDs are
and how to use them
• And we even know how to learn JDs from
data.
Copyright © Andrew W. Moore
Slide 56
Density Estimation
• Our Joint Distribution learner is our first
example of something called Density
Estimation
• A Density Estimator learns a mapping from a
set of attributes to a Probability
Input
Attributes
Copyright © Andrew W. Moore
Density
Estimator
Probability
Slide 57
Density Estimation
• Comparado con los otros dos modelos
principales:
Input
Attributes
Classifier
Prediction of
categorical output
Input
Attributes
Density
Estimator
Probability
Input
Attributes
Regressor
Prediction of
real-valued output
Copyright © Andrew W. Moore
Slide 58
Evaluacion de la estimacion de densidad
Test-set criterion for estimating performance
on future data*
* See the Decision Tree or Cross Validation lecture for more detail
Input
Attributes
Classifier
Prediction of
categorical output
Test set
Accuracy
Input
Attributes
Density
Estimator
Probability
?
Input
Attributes
Regressor
Prediction of
real-valued output
Test set
Accuracy
Copyright © Andrew W. Moore
Slide 59
Evaluating a density estimator
• Given a record x, a density estimator M can
tell you how likely the record is:
Pˆ ( x|M )
• Given a dataset with R records, a density
estimator can tell you how likely the dataset
is:
(Under the assumption that all records were independently
generated from the Density Estimator’s JD)
R
Pˆ (dataset |M )  Pˆ (x1  x 2   x R|M )   Pˆ (x k|M )
k 1
Copyright © Andrew W. Moore
Slide 60
The Auto-mpg dataset
Donor: Quinlan,R. (1993)
Number of Instances: 398 minus 6 missing=392 (training: 196 test:
196):
Number of Attributes: 9 including the class attribute7. Attribute
Information:
1. mpg:
continuous (discretizado bad<=25,good>25)
2. cylinders:
multi-valued discrete
3. displacement: continuous (discretizado low<=200, high>200)
4. horsepower: continuous ((discretizado low<=90, high>90)
5. weight:
continuous (discretizado low<=3000, high>3000)
6. acceleration: continuous (discretizado low<=15, high>15)
7. model year: multi-valued discrete (discretizado 70-74,75-77,78-82
8. origin:
multi-valued discrete
9. car name:
string (unique for each instance)
Note: horsepower has 6 missing values
Copyright © Andrew W. Moore
Slide 61
The auto-mpg dataset
18.0 8 307.0 130.0 3504. 12.0 70 1 "chevrolet chevelle
malibu"
15.0 8 350.0 165.0 3693. 11.5 70 1 "buick skylark 320"
18.0 8 318.0 150.0 3436. 11.0 70 1 "plymouth satellite"
16.0 8 304.0 150.0 3433. 12.0 70 1 "amc rebel sst"
17.0 8 302.0 140.0 3449. 10.5 70 1 "ford torino"
…………………………………………………………………..
27.0 4 140.0 86.00 2790. 15.6 82 1 "ford mustang gl"
44.0 4 97.00 52.00 2130. 24.6 82 2 "vw pickup"
32.0 4 135.0 84.00 2295. 11.6 82 1 "dodge rampage"
28.0 4 120.0 79.00 2625. 18.6 82 1 "ford ranger"
31.0 4 119.0 82.00 2720. 19.4 82 1 "chevy s-10"
Copyright © Andrew W. Moore
Slide 62
A subset of auto-mpg
levelmpg modelyear maker
[1,] "bad" "70-74" "america"
[2,] "bad" "70-74" "america"
[3,] "bad" "70-74" "america"
[4,] "bad" "70-74" "america"
[5,] "bad" "70-74" "america"
[6,] "bad" "70-74" "america“
………………………………………………
……………………………………………..
[7,] "good" "78-82" "america"
[8,] "good" "78-82" "europe"
[9,] "good" "78-82" "america"
[10,] "good" "78-82" "america"
[11,] "good" "78-82" "america"
Copyright © Andrew W. Moore
Slide 63
Mpg: Train test and test set
indices_samples=sample(1:392,196)
mpgtrain=smallmpg[indices_samples,]
dim(mpgtrain)
[1] 196 3
mpgtest=smallmpg[-indices_samples,]
#computing the first joint probability
a1=dim(mpgtrain[mpgtrain[,1]=="bad" &
mpgtrain[,2]=="70-74" &
mpgtrain[,3]=="america",])[1]/192
Copyright © Andrew W. Moore
Slide 64
The joint density estimate
levelmpg
modelyear
maker
probability
bad
70-74
America
0.2397959
bad
70-74
Asia
0.01530612
Bad
70-74
Europe
0.03061224
Bad
75-77
America
0.2091837
Bad
75-77
Asia
0.01020408
Bad
75-77
Europe
0.04591837
Bad
78-82
America
0.04591837
Bad
78-82
Asia
Bad
78-82
Europe
0
Good
70-74
America
0.02040816
Good
70-74
Asia
0.01530612
Good
70-74
Europe
0.03061224
Good
75-77
America
0.04081633
Good
75-77
Asia
0.04081633
Good
75-77
Europe
0.02551020
Good
78-82
America
0.1071429
Good
78-82
Asia
0.07142857
Good
78-82
Europe
0.04591837
Copyright © Andrew W. Moore
0.005102041
Slide 65
196
Pˆ (mpgtest |M )  Pˆ (x1  x 2   x196|M )   Pˆ (x k|M )
 (in this case)  1.45  10-222
k 1
This probability has been computed considering that any joint
probability is at least greater than 1/1020=10-20
Copyright © Andrew W. Moore
Slide 66
A small dataset: Miles Per Gallon
192
Training
Set
Records
mpg
modelyear maker
good
bad
bad
bad
bad
bad
bad
bad
:
:
:
bad
good
bad
good
bad
good
good
bad
good
bad
75to78
70to74
75to78
70to74
70to74
70to74
70to74
75to78
:
:
:
70to74
79to83
75to78
79to83
75to78
79to83
79to83
70to74
75to78
75to78
asia
america
europe
america
america
asia
asia
america
:
:
:
america
america
america
america
america
america
america
america
europe
europe
From the UCI repository (thanks to Ross Quinlan)
Copyright © Andrew W. Moore
Slide 67
A small dataset: Miles Per Gallon
192
Training
Set
Records
Copyright © Andrew W. Moore
mpg
modelyear maker
good
bad
bad
bad
bad
bad
bad
bad
:
:
:
bad
good
bad
good
bad
good
good
bad
good
bad
75to78
70to74
75to78
70to74
70to74
70to74
70to74
75to78
:
:
:
70to74
79to83
75to78
79to83
75to78
79to83
79to83
70to74
75to78
75to78
asia
america
europe
america
america
asia
asia
america
:
:
:
america
america
america
america
america
america
america
america
europe
europe
Slide 68
A small dataset: Miles Per Gallon
192
Training
Set
Records
mpg
modelyear maker
good
bad
bad
bad
bad
bad
bad
bad
:
:
:
bad
good
bad
good
bad
good
good
bad
good
bad
75to78
70to74
75to78
70to74
70to74
70to74
70to74
75to78
:
:
:
70to74
79to83
75to78
79to83
75to78
79to83
79to83
70to74
75to78
75to78
asia
america
europe
america
america
asia
asia
america
:
:
:
america
1
america
america
america
america
america
america
america
europe
europe
196
Pˆ (dataset /|M )  Pˆ (x  x 2   x196|M )   Pˆ (x k|M )
 (in this case)  3.4  10
k 1
-203
Dataset puede ser mpgtrain o
mpgtest
Copyright © Andrew W. Moore
Slide 69
Log Probabilities
Since probabilities of datasets get so
small we usually use log probabilities
R
R
k 1
k 1
log Pˆ (dataset |M )  log  Pˆ (x k|M )   log Pˆ (x k|M )
Copyright © Andrew W. Moore
Slide 70
A small dataset: Miles Per Gallon
192
Training
Set
Records
mpg
modelyear maker
good
bad
bad
bad
bad
bad
bad
bad
:
:
:
bad
good
bad
good
bad
good
good
bad
good
bad
75to78
70to74
75to78
70to74
70to74
70to74
70to74
75to78
:
:
:
70to74
79to83
75to78
79to83
75to78
79to83
79to83
70to74
75to78
75to78
asia
america
europe
america
america
asia
asia
america
:
:
:
america
america
america
america
america
america
america
america
europe
europe
R
R
k 1
k 1
log Pˆ (dataset |M )  log  Pˆ (x k|M )   log Pˆ (x k|M )
 (in this case)   466.19
In our case= -510.79
Copyright © Andrew W. Moore
Slide 71
Resumen: The Good News
• Hemos visto una manera de aprender un
estimador de densidad a partir de datos.
• Estimadores de densidad pueden ser usados
para
• Ordenar los records segun la probabilidad, y
detectar asi casos anomalos (“outliers”).
• Para hacer inferencia: calcular P(E1|E2)
• Para calcular clasificadores Bayesianos (se vera
mas tarde).
Copyright © Andrew W. Moore
Slide 72
Summary: The Bad News
• Density estimation by directly learning the
joint is trivial, mindless and dangerous
Copyright © Andrew W. Moore
Slide 73
Using a test set
An independent test set with 196 cars has a worse log likelihood
(actually it’s a billion quintillion quintillion quintillion quintillion
times less likely)
….Density estimators can overfit. And the full joint density
estimator is the overfittiest of them all!
Copyright © Andrew W. Moore
Slide 74
Overfitting Density Estimators
If this ever happens, it means
there are certain combinations
that we learn are impossible
R
R
k 1
k 1
log Pˆ ( testset |M )  log  Pˆ (x k|M )   log Pˆ (x k|M )
   if for any k Pˆ (x k|M )  0
Copyright © Andrew W. Moore
Slide 75
Using a test set
The only reason that our test set didn’t score -infinity is that my
code is hard-wired to always predict a probability of at least one
in 1020
We need Density Estimators that are less prone
to overfitting
Copyright © Andrew W. Moore
Slide 76
Naïve Density Estimation
El problema con el estimador de densidad conjunto
es que simplemente refleja la muestra de
entrenamiento.
Necesitamos algo que generalize en forma mas util.
El modelo naïve model generaliza mas
eficientemente:
Asume que cada atributo esta distribuido
independientemente uno del otro.
Copyright © Andrew W. Moore
Slide 77
A note about independence
• Assume A and B are Boolean Random
Variables. Then
“A and B are independent”
if and only if
P(A|B) = P(A)
• “A and B are independent” is often notated
as
AB
Copyright © Andrew W. Moore
Slide 78
Independence Theorems
• Assume P(A|B) = P(A)
• Then P(A^B) =
P(A/B)=P(A^B)/P(B)
Por Independencia,
P(A)=P(A^B)/P(B)
= P(A) P(B)
Copyright © Andrew W. Moore
• Assume P(A|B) = P(A)
• Then P(B|A) =
P(A^B)/P(A)
• Por el resultado
anterior
• P(B/A)=P(A)P(B)/P(A)
= P(B)
Slide 79
Independence Theorems
•
•
•
•
•
•
•
Assume P(A|B) = P(A)
Then P(~A|B) =
P(~A^B)/P(B)=
[P(B)-P(A^B)]/P(B)
1-P(A^B)/P(B)=
1-P(A), por Indep.
P(~A)
= P(~A)
Copyright © Andrew W. Moore
•
•
•
•
•
•
Assume P(A|B) = P(A)
Then P(A|~B) =
P(A^~B)/P(~B)
[P(A)-P(A^B)]/1-P(B)
P(A)-P(A)P(B)/1-P(B)
P(A)[1-P(B)]/1-P(B)
= P(A)
Slide 80
Multivalued Independence
For multivalued Random Variables A and B,
AB
if and only if
u, v : P( A  u | B  v)  P( A  u )
from which you can then prove things like…
u, v : P( A  u  B  v)  P( A  u ) P( B  v)
u, v : P( B  v | A  u )  P( B  v)
Copyright © Andrew W. Moore
Slide 81
Independently Distributed Data
• Let x[i] denote the i’th field of record x.
• The independently distributed assumption
says that for any i,v, u1 u2… ui-1 ui+1… uM
P( x[i ]  v | x[1]  u1 , x[2]  u2 , x[i  1]  ui 1 , x[i  1]  ui 1 ,  x[ M ]  u M )
 P( x[i ]  v)
• Or in other words, x[i] is independent of
{x[1],x[2],..x[i-1], x[i+1],…x[M]}
• This is often written as
x[i ]  {x[1], x[2],  x[i  1], x[i  1],  x[ M ]}
Copyright © Andrew W. Moore
Slide 82
Back to Naïve Density Estimation
•
•
•
Let x[i] denote the i’th field of record x:
Naïve DE assumes x[i] is independent of {x[1],x[2],..x[i-1], x[i+1],…x[M]}
Example:
• Suppose that each record is generated by randomly shaking a green dice
and a red dice
• Dataset 1: A = red value, B = green value
• Dataset 2: A = red value, B = sum of values
• Dataset 3: A = sum of values, B = difference of values
• Which of these datasets violates the naïve assumption?
Copyright © Andrew W. Moore
Slide 83
Using the Naïve Distribution
• Once you have a Naïve Distribution you can easily
compute any row of the joint distribution.
• Suppose A, B, C and D are independently
distributed. What is P(A^~B^C^~D)?
Copyright © Andrew W. Moore
Slide 84
Using the Naïve Distribution
• Once you have a Naïve Distribution you can easily
compute any row of the joint distribution.
• Suppose A, B, C and D are independently
distributed. What is P(A^~B^C^~D)?
= P(A|~B^C^~D) P(~B^C^~D)
= P(A) P(~B^C^~D)
= P(A) P(~B|C^~D) P(C^~D)
= P(A) P(~B) P(C^~D)
= P(A) P(~B) P(C|~D) P(~D)
= P(A) P(~B) P(C) P(~D)
Copyright © Andrew W. Moore
Slide 85
Naïve Distribution General Case
• Suppose x[1], x[2], … x[M] are independently
distributed.
M
P( x[1]  u1 , x[2]  u2 , x[ M ]  u M )   P( x[k ]  uk )
k 1
• So if we have a Naïve Distribution we can
construct any row of the implied Joint Distribution
on demand.
• So we can do any inference
• But how do we learn a Naïve Density Estimator?
Copyright © Andrew W. Moore
Slide 86
Learning a Naïve Density
Estimator
# records in which x[i ]  u
ˆ
P( x[i ]  u ) 
total number of records
Another trivial learning algorithm!
Copyright © Andrew W. Moore
Slide 87
Conparacion
Joint DE
Naïve DE
Can model anything
Can model only very
boring distributions
Puede modelar una copia No lo puede hacer
ruidosa de la muestra
original
Given 100 records and more than 6 Given 100 records and 10,000
Boolean attributes will screw up
multivalued attributes will be fine
badly
Copyright © Andrew W. Moore
Slide 88
Empirical Results: “MPG”
The “MPG” dataset consists of 392 records and 8 attributes
A tiny part of
the DE
learned by
“Joint”
The DE
learned by
“Naive”
Copyright © Andrew W. Moore
Slide 89
Empirical Results: “MPG”
The “MPG” dataset consists of 392 records and 8 attributes
A tiny part of
the DE
learned by
“Joint”
The DE
learned by
“Naive”
Copyright © Andrew W. Moore
Slide 90
Empirical Results: “Weight vs. MPG”
Suppose we train only from the “Weight” and “MPG” attributes
The DE
learned by
“Joint”
Copyright © Andrew W. Moore
The DE
learned by
“Naive”
Slide 91
Empirical Results: “Weight vs. MPG”
Suppose we train only from the “Weight” and “MPG” attributes
The DE
learned by
“Joint”
Copyright © Andrew W. Moore
The DE
learned by
“Naive”
Slide 92
“Weight vs. MPG”: The best that Naïve can do
The DE
learned by
“Joint”
Copyright © Andrew W. Moore
The DE
learned by
“Naive”
Slide 93
Reminder: The Good News
• Tenemos dos maneras de aprender un
estimador de densidad a partir de datos.
• *In otras clases veremos estimadores de
densidad mas complicados. (Mixture Models, Bayesian
Networks, Density Trees, Kernel Densities and many more)
• Los estimadores de densidad pueden ser
usados para:
• Deteccion de anomalias (“outliers”).
• Para ser inferencias: Calcular P(E1|E2)
• Para calcular clasificadores bayesianos.
Copyright © Andrew W. Moore
Slide 94