* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Machine Learning: Probability Theory
Theoretical computer science wikipedia , lookup
Information theory wikipedia , lookup
Machine learning wikipedia , lookup
Generalized linear model wikipedia , lookup
FisherβYates shuffle wikipedia , lookup
Simulated annealing wikipedia , lookup
Pattern recognition wikipedia , lookup
Machine Learning
Probability Theory
WS2013/2014 β Dmitrij Schlesinger, Carsten Rother
Probability space
is a three-tuple (Ξ©, π, π) with:
β’ Ξ© β the set of elementary events
β’ π β algebra
β’ π β probability measure
π-algebra over Ξ© is a system of subsets,
i.e. π β π«(Ξ©) (π« is the power set) with:
β’ Ξ©βπ
β’ π΄ βπ β Ξ©βπ΄ βπ
β’ π΄π β π π = 1 β¦ π β
π
π=1 π΄π
βπ
π is closed with respect to the complement and countable
conjunction. It follows: β
β π, π is closed also with respect to the
countable disjunction (due to the De Morgan's laws)
Machine Learning : Probability Theory
2
Probability space
Examples:
β’ π = β
, Ξ© (smallest) and π = π« Ξ© (largest) π-algebras over Ξ©
β’ the minimal π-algebra over Ξ© containing a particular subset π΄ β Ξ©
is π = β
, π΄, Ξ© β π΄, Ξ©
β’ Ξ© is discrete and finite, π = 2Ξ©
β’ Ξ© = β , the Borel-algebra (contains all intervals among others)
β’ etc.
Machine Learning : Probability Theory
3
Probability measure
π: π β 0,1 is a βmeasureβ (Ξ ) with the normalizing π Ξ© = 1
π-additivity: let π΄π β π be pairwise disjoint subsets, i.e. π΄π β© π΄π β² = β
,
then
π
π΄π =
π
π(π΄π )
π
Note: there are sets for which there is no measure.
Examples: the set of irrational numbers, function spaces ββ etc.
Banach-Tarski paradoxon (see Wikipedia ο):
Machine Learning : Probability Theory
4
(For us) practically relevant cases
β’ The set Ξ© is βgood-naturedβ, e.g. βπ , discrete finite sets etc.
β’ π = π« Ξ© , i.e. the algebra is the power set
β’ We often consider a (composite) βeventβ π΄ β Ξ© as the union of
elemantary ones
β’ Probability of an event is
π π΄ =
π(π)
πβπ΄
Machine Learning : Probability Theory
5
Random variables
Here a special case β real-valued random variables.
A random variable π for a probability space (Ξ©, π, π) is a mapping
π: Ξ© β β, satisfying
π: π π β€ π β π
βπ ββ
(always holds for power sets)
Note: elementary events are not numbers β they are elements of a
general set Ξ©
Random variables are in contrast numbers, i.e. they can be summed
up, subtracted, squared etc.
Machine Learning : Probability Theory
6
Distributions
Cummulative distribution function of a random variable π :
πΉπ π = π π: π π β€ π
Probability distribution of a discrete random variable π: Ξ© β β€ :
ππ π = π({π: π π = π})
Probability density of a continuous random variable π: Ξ© β β :
ππΉπ (π)
ππ π =
ππ
Machine Learning : Probability Theory
7
Distributions
Why is it necessary to do it so complex (through the cummulative
distribution function)?
Example β a Gaussian
Probability of any particular real value is zero β a βdirectβ definition
of a βprobability distributionβ is senseless ο
It is indeed possible through the cummulative distribution function.
Machine Learning : Probability Theory
8
Mean
A mean (expectation, average ... ) of a random variable π is
πΌπ π =
π π β
π(π) =
πβΞ©
π π β
π =
π π:π π =π
ππ π β
π
π
Arithmetic mean is a special case:
1
π₯=
π
π
π₯π =
π=1
ππ (π) β
π
π
with
π₯ β‘ π and ππ π =
1
π
(uniform probability distribution).
Machine Learning : Probability Theory
9
Mean
The probability of an event π΄ β Ξ© can be expressed as the mean
value of a corresponding βindicatorβ-variable
π π΄ =
π(π) =
πβπ΄
π π β
π(π)
πβΞ©
with
π π =
1 if π β π΄
0 otherwise
Often, the set of elementary events can be associated with a
random variable (just enumerate all π β Ξ© ).
Then one can speak about a βprobability distribution over Ξ©β
(instead of the probability measure).
Machine Learning : Probability Theory
10
Example 1 β numbers of a die
The set of elementary events:
Ξ© = π, π, π, π, π, π
Probability measure:
π π
Random variable (number of a die):
π π = 1, π π = 2 β¦ π π = 6
Cummulative distribution:
πΉπ 3 = , πΉπ 4.5 =
Probability distribution:
ππ 1 = ππ 2 β¦ ππ 6 =
Mean value:
πΌπ π = 3.5
1
6
= , π π, π
1
2
=
2
3
1
3
β¦
β¦
1
6
Another random variable (squared number of a die)
π β² π = 1, π β² π = 4 β¦ π β² π = 36
Mean value:
πΌπ π = 15
1
6
Note: πΌπ π β² β πΌ2π (π)
Machine Learning : Probability Theory
11
Example 2 β two independent dice numbers
The set of elementary events (6x6 faces):
Ξ© = π, π, π, π, π, π × π, π, π, π, π, π
Probability measure: π ππ
=
1
,π
36
ππ, ππ
=
1
18
β¦
Two random variables:
1) The number of the first die: π1 ππ = 1, π1 ππ = 1, π1 ππ = 5 β¦
2) The number of the second die: π2 ππ = 2, π2 ππ = 3, π2 ππ = 6 β¦
Probability distributions:
ππ1 1 = ππ1 2 = β― = ππ1
1
6 =
6
ππ2 1 = ππ2 2 = β― = ππ2
1
6 =
6
Machine Learning : Probability Theory
12
Example 2 β two independent dice numbers
Consider the new random variable: π = π1 + π2
The probability distribution ππ is not uniform anymore ο
ππ β (1,2,3,4,5,6,5,4,3,2,1)
Mean value is πΌπ π = 7
In general for mean values:
πΌπ π1 + π2 =
π π β
(π1 π + π2 π ) = πΌπ π1 + πΌπ (π2 )
πβΞ©
Machine Learning : Probability Theory
13
Random variables of higher dimension
Analogously: Let π: Ξ© β βπ be a mapping (π = 2 for simplicity),
with π = π1 , π2 , π1 : Ξ© β β and π2 : Ξ© β β
Cummulative distribution function:
πΉπ π, π = π π: π1 π β€ π β© π: π2 π β€ π
Joint probability distribution (discrete):
ππ=
π1 ,π2
π, π = π π: π1 π = π β© π: π2 π = π
Joint probability density (continuous):
ππ=
π1 ,π2
π 2 πΉπ (π, π )
π, π =
ππ ππ
Machine Learning : Probability Theory
14
Independence
Two events π΄ β π and π΅ β π are independent, if
π π΄ β© π΅ = π π΄ β
π(π΅)
Interesting: Events π΄ and π΅ = Ξ© β π΅ are independent, if π΄ and π΅ are
independent ο
Two random variables are independent, if
πΉπ=
π1 ,π2
π, π = πΉπ1 π β
πΉπ2 π
β π, π
It follows (example for continuous π):
π 2 πΉπ (π, π ) ππΉπ1 (π) ππΉπ2 π
ππ π, π =
=
β
= ππ1 π β
ππ2 π
ππππ
ππ
ππ
Machine Learning : Probability Theory
15
Conditional probabilities
Conditional probability:
π π΄ π΅) =
π(π΄β©π΅)
π(π΅)
Independence (almost equivalent): π΄ and π΅ are independent, if
π π΄ π΅) = π(π΄)
and/or
π π΅ π΄) = π(π΅)
Bayesβ Theorem (formula, rule)
π π΅ π΄) β
π(π΄)
π π΄ π΅) =
π(π΅)
Machine Learning : Probability Theory
16
Further definitions (for random variables)
Shorthand: π π₯, π¦ β‘ ππ (π₯, π¦)
Marginal probability distribution:
π π₯ =
π(π₯, π¦)
π¦
Conditional probability distribution:
π(π₯, π¦)
π π₯π¦ =
π(π¦)
Note:
π₯ π(π₯|π¦)
=1
Independent probability distribution:
π π₯, π¦ = π π₯ β
π(π¦)
Machine Learning : Probability Theory
17
Example
Let the probability to be taken ill be
π πππ = 0.02
Let the conditional probability to have a temperature in that case is
π π‘πππ πππ = 0.9
However, one may have a temperature without any illness, i.e.
π π‘πππ πππ = 0.05
What is the probability to be taken ill provided that one has a
temperature?
Machine Learning : Probability Theory
18
Example
Bayesβ rule:
π π‘πππ πππ β
π(πππ)
π πππ π‘πππ =
=
π(π‘πππ)
(marginal probability in the denominator)
=
π π‘πππ πππ β
π(πππ)
π π‘πππ πππ β
π πππ + π π‘πππ πππ β
π(πππ)
=
0.9 β
0.02
=
β 0.27
0.9 β
0.02 + 0.05 β
0.98
β not so high as expected ο, the reason β very low prior probability
to be taken ill
Machine Learning : Probability Theory
19
Further topics
The model
Let two random variables be given:
β’ The first one is typically discrete (i.e. π β πΎ) and is called βclassβ
β’ The second one is often continuous (π₯ β π) and is called
βobservationβ
Let the joint probability distribution π(π₯, π) be βgivenβ.
As π is discrete it is often specified by π π₯, π = π π β
π(π₯|π)
The recognition task: given π₯, estimate π.
Usual problems (questions):
β’ How to estimate π from π₯ ?
β’ The joint probability is not always explicitly specified.
β’ The set πΎ is sometimes huge.
Machine Learning : Probability Theory
20
Further topics
The learning task:
Often (almost always) the probability distribution is known up to
free parameters. How to choose them (learn from examples)?
Next classes:
1. Recognition, Bayessian Decision Theory
2. Probabilistic learning, Maximum-Likelihood principle
3. Discriminative models, recognition and learning
β¦
Machine Learning : Probability Theory
21