Download Machine Learning: Probability Theory

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Theoretical computer science wikipedia , lookup

Information theory wikipedia , lookup

Machine learning wikipedia , lookup

Generalized linear model wikipedia , lookup

Fisher–Yates shuffle wikipedia , lookup

Simulated annealing wikipedia , lookup

Pattern recognition wikipedia , lookup

Randomness wikipedia , lookup

Birthday problem wikipedia , lookup

Probability amplitude wikipedia , lookup

Transcript
Machine Learning
Probability Theory
WS2013/2014 – Dmitrij Schlesinger, Carsten Rother
Probability space
is a three-tuple (Ξ©, 𝜎, 𝑃) with:
β€’ Ξ© βˆ’ the set of elementary events
β€’ 𝜎 βˆ’ algebra
β€’ 𝑃 βˆ’ probability measure
𝜎-algebra over Ω is a system of subsets,
i.e. 𝜎 βŠ† 𝒫(Ξ©) (𝒫 is the power set) with:
β€’ Ω∈𝜎
β€’ 𝐴 ∈𝜎 β‡’ Ξ©βˆ–π΄ ∈𝜎
β€’ 𝐴𝑖 ∈ 𝜎 𝑖 = 1 … 𝑛 β‡’
𝑛
𝑖=1 𝐴𝑖
∈𝜎
𝜎 is closed with respect to the complement and countable
conjunction. It follows: βˆ… ∈ 𝜎, 𝜎 is closed also with respect to the
countable disjunction (due to the De Morgan's laws)
Machine Learning : Probability Theory
2
Probability space
Examples:
β€’ 𝜎 = βˆ…, Ξ© (smallest) and 𝜎 = 𝒫 Ξ© (largest) 𝜎-algebras over Ξ©
β€’ the minimal 𝜎-algebra over Ξ© containing a particular subset 𝐴 ∈ Ξ©
is 𝜎 = βˆ…, 𝐴, Ξ© βˆ– 𝐴, Ξ©
β€’ Ξ© is discrete and finite, 𝜎 = 2Ξ©
β€’ Ξ© = ℝ , the Borel-algebra (contains all intervals among others)
β€’ etc.
Machine Learning : Probability Theory
3
Probability measure
𝑃: 𝜎 β†’ 0,1 is a β€žmeasureβ€œ (Ξ ) with the normalizing 𝑃 Ξ© = 1
𝜎-additivity: let 𝐴𝑖 ∈ 𝜎 be pairwise disjoint subsets, i.e. 𝐴𝑖 ∩ 𝐴𝑖 β€² = βˆ…,
then
𝑃
𝐴𝑖 =
𝑖
𝑃(𝐴𝑖 )
𝑖
Note: there are sets for which there is no measure.
Examples: the set of irrational numbers, function spaces β„βˆž etc.
Banach-Tarski paradoxon (see Wikipedia ):
Machine Learning : Probability Theory
4
(For us) practically relevant cases
β€’ The set Ξ© is β€žgood-naturedβ€œ, e.g. ℝ𝑛 , discrete finite sets etc.
β€’ 𝜎 = 𝒫 Ξ© , i.e. the algebra is the power set
β€’ We often consider a (composite) β€ževentβ€œ 𝐴 βŠ† Ξ© as the union of
elemantary ones
β€’ Probability of an event is
𝑃 𝐴 =
𝑃(πœ”)
πœ”βˆˆπ΄
Machine Learning : Probability Theory
5
Random variables
Here a special case – real-valued random variables.
A random variable πœ‰ for a probability space (Ξ©, 𝜎, 𝑃) is a mapping
πœ‰: Ξ© β†’ ℝ, satisfying
πœ”: πœ‰ πœ” ≀ π‘Ÿ ∈ 𝜎
βˆ€π‘Ÿ βˆˆβ„
(always holds for power sets)
Note: elementary events are not numbers – they are elements of a
general set Ξ©
Random variables are in contrast numbers, i.e. they can be summed
up, subtracted, squared etc.
Machine Learning : Probability Theory
6
Distributions
Cummulative distribution function of a random variable πœ‰ :
πΉπœ‰ π‘Ÿ = 𝑃 πœ”: πœ‰ πœ” ≀ π‘Ÿ
Probability distribution of a discrete random variable πœ‰: Ξ© β†’ β„€ :
π‘πœ‰ π‘Ÿ = 𝑃({πœ”: πœ‰ πœ” = π‘Ÿ})
Probability density of a continuous random variable πœ‰: Ξ© β†’ ℝ :
πœ•πΉπœ‰ (π‘Ÿ)
π‘πœ‰ π‘Ÿ =
πœ•π‘Ÿ
Machine Learning : Probability Theory
7
Distributions
Why is it necessary to do it so complex (through the cummulative
distribution function)?
Example – a Gaussian
Probability of any particular real value is zero β†’ a β€ždirectβ€œ definition
of a β€žprobability distributionβ€œ is senseless 
It is indeed possible through the cummulative distribution function.
Machine Learning : Probability Theory
8
Mean
A mean (expectation, average ... ) of a random variable πœ‰ is
𝔼𝑃 πœ‰ =
𝑃 πœ” β‹… πœ‰(πœ”) =
πœ”βˆˆΞ©
𝑃 πœ” β‹…π‘Ÿ =
π‘Ÿ πœ”:πœ‰ πœ” =π‘Ÿ
π‘πœ‰ π‘Ÿ β‹… π‘Ÿ
π‘Ÿ
Arithmetic mean is a special case:
1
π‘₯=
𝑁
𝑛
π‘₯𝑖 =
𝑖=1
π‘πœ‰ (π‘Ÿ) β‹… π‘Ÿ
π‘Ÿ
with
π‘₯ ≑ π‘Ÿ and π‘πœ‰ π‘Ÿ =
1
𝑁
(uniform probability distribution).
Machine Learning : Probability Theory
9
Mean
The probability of an event 𝐴 ∈ Ω can be expressed as the mean
value of a corresponding β€žindicatorβ€œ-variable
𝑃 𝐴 =
𝑃(πœ”) =
πœ”βˆˆπ΄
𝑃 πœ” β‹… πœ‰(πœ”)
πœ”βˆˆΞ©
with
πœ‰ πœ” =
1 if πœ” ∈ 𝐴
0 otherwise
Often, the set of elementary events can be associated with a
random variable (just enumerate all πœ” ∈ Ξ© ).
Then one can speak about a β€œprobability distribution over Ξ©β€œ
(instead of the probability measure).
Machine Learning : Probability Theory
10
Example 1 – numbers of a die
The set of elementary events:
Ξ© = π‘Ž, 𝑏, 𝑐, 𝑑, 𝑒, 𝑓
Probability measure:
𝑃 π‘Ž
Random variable (number of a die):
πœ‰ π‘Ž = 1, πœ‰ 𝑏 = 2 … πœ‰ 𝑓 = 6
Cummulative distribution:
πΉπœ‰ 3 = , πΉπœ‰ 4.5 =
Probability distribution:
π‘πœ‰ 1 = π‘πœ‰ 2 … π‘πœ‰ 6 =
Mean value:
𝔼𝑃 πœ‰ = 3.5
1
6
= , 𝑃 𝑐, 𝑓
1
2
=
2
3
1
3
…
…
1
6
Another random variable (squared number of a die)
πœ‰ β€² π‘Ž = 1, πœ‰ β€² 𝑏 = 4 … πœ‰ β€² 𝑓 = 36
Mean value:
𝔼𝑃 πœ‰ = 15
1
6
Note: 𝔼𝑃 πœ‰ β€² β‰  𝔼2𝑃 (πœ‰)
Machine Learning : Probability Theory
11
Example 2 – two independent dice numbers
The set of elementary events (6x6 faces):
Ξ© = π‘Ž, 𝑏, 𝑐, 𝑑, 𝑒, 𝑓 × π‘Ž, 𝑏, 𝑐, 𝑑, 𝑒, 𝑓
Probability measure: 𝑃 π‘Žπ‘
=
1
,𝑃
36
𝑐𝑑, π‘“π‘Ž
=
1
18
…
Two random variables:
1) The number of the first die: πœ‰1 π‘Žπ‘ = 1, πœ‰1 π‘Žπ‘ = 1, πœ‰1 𝑒𝑓 = 5 …
2) The number of the second die: πœ‰2 π‘Žπ‘ = 2, πœ‰2 π‘Žπ‘ = 3, πœ‰2 𝑒𝑓 = 6 …
Probability distributions:
π‘πœ‰1 1 = π‘πœ‰1 2 = β‹― = π‘πœ‰1
1
6 =
6
π‘πœ‰2 1 = π‘πœ‰2 2 = β‹― = π‘πœ‰2
1
6 =
6
Machine Learning : Probability Theory
12
Example 2 – two independent dice numbers
Consider the new random variable: πœ‰ = πœ‰1 + πœ‰2
The probability distribution π‘πœ‰ is not uniform anymore 
π‘πœ‰ ∝ (1,2,3,4,5,6,5,4,3,2,1)
Mean value is 𝔼𝑃 πœ‰ = 7
In general for mean values:
𝔼𝑃 πœ‰1 + πœ‰2 =
𝑃 πœ” β‹… (πœ‰1 πœ” + πœ‰2 πœ” ) = 𝔼𝑃 πœ‰1 + 𝔼𝑃 (πœ‰2 )
πœ”βˆˆΞ©
Machine Learning : Probability Theory
13
Random variables of higher dimension
Analogously: Let πœ‰: Ξ© β†’ ℝ𝑛 be a mapping (𝑛 = 2 for simplicity),
with πœ‰ = πœ‰1 , πœ‰2 , πœ‰1 : Ξ© β†’ ℝ and πœ‰2 : Ξ© β†’ ℝ
Cummulative distribution function:
πΉπœ‰ π‘Ÿ, 𝑠 = 𝑃 πœ”: πœ‰1 πœ” ≀ π‘Ÿ ∩ πœ”: πœ‰2 πœ” ≀ 𝑠
Joint probability distribution (discrete):
π‘πœ‰=
πœ‰1 ,πœ‰2
π‘Ÿ, 𝑠 = 𝑃 πœ”: πœ‰1 πœ” = π‘Ÿ ∩ πœ”: πœ‰2 πœ” = 𝑠
Joint probability density (continuous):
π‘πœ‰=
πœ‰1 ,πœ‰2
πœ• 2 πΉπœ‰ (π‘Ÿ, 𝑠)
π‘Ÿ, 𝑠 =
πœ•π‘Ÿ πœ•π‘ 
Machine Learning : Probability Theory
14
Independence
Two events 𝐴 ∈ 𝜎 and 𝐡 ∈ 𝜎 are independent, if
𝑃 𝐴 ∩ 𝐡 = 𝑃 𝐴 β‹… 𝑃(𝐡)
Interesting: Events 𝐴 and 𝐡 = Ξ© βˆ– 𝐡 are independent, if 𝐴 and 𝐡 are
independent 
Two random variables are independent, if
πΉπœ‰=
πœ‰1 ,πœ‰2
π‘Ÿ, 𝑠 = πΉπœ‰1 π‘Ÿ β‹… πΉπœ‰2 𝑠
βˆ€ π‘Ÿ, 𝑠
It follows (example for continuous πœ‰):
πœ• 2 πΉπœ‰ (π‘Ÿ, 𝑠) πœ•πΉπœ‰1 (π‘Ÿ) πœ•πΉπœ‰2 𝑠
π‘πœ‰ π‘Ÿ, 𝑠 =
=
β‹…
= π‘πœ‰1 π‘Ÿ β‹… π‘πœ‰2 𝑠
πœ•π‘Ÿπœ•π‘ 
πœ•π‘Ÿ
πœ•π‘ 
Machine Learning : Probability Theory
15
Conditional probabilities
Conditional probability:
𝑃 𝐴 𝐡) =
𝑃(𝐴∩𝐡)
𝑃(𝐡)
Independence (almost equivalent): 𝐴 and 𝐡 are independent, if
𝑃 𝐴 𝐡) = 𝑃(𝐴)
and/or
𝑃 𝐡 𝐴) = 𝑃(𝐡)
Bayesβ€˜ Theorem (formula, rule)
𝑃 𝐡 𝐴) β‹… 𝑃(𝐴)
𝑃 𝐴 𝐡) =
𝑃(𝐡)
Machine Learning : Probability Theory
16
Further definitions (for random variables)
Shorthand: 𝑝 π‘₯, 𝑦 ≑ π‘πœ‰ (π‘₯, 𝑦)
Marginal probability distribution:
𝑝 π‘₯ =
𝑝(π‘₯, 𝑦)
𝑦
Conditional probability distribution:
𝑝(π‘₯, 𝑦)
𝑝 π‘₯𝑦 =
𝑝(𝑦)
Note:
π‘₯ 𝑝(π‘₯|𝑦)
=1
Independent probability distribution:
𝑝 π‘₯, 𝑦 = 𝑝 π‘₯ β‹… 𝑝(𝑦)
Machine Learning : Probability Theory
17
Example
Let the probability to be taken ill be
𝑝 𝑖𝑙𝑙 = 0.02
Let the conditional probability to have a temperature in that case is
𝑝 π‘‘π‘’π‘šπ‘ 𝑖𝑙𝑙 = 0.9
However, one may have a temperature without any illness, i.e.
𝑝 π‘‘π‘’π‘šπ‘ 𝑖𝑙𝑙 = 0.05
What is the probability to be taken ill provided that one has a
temperature?
Machine Learning : Probability Theory
18
Example
Bayes’ rule:
𝑝 π‘‘π‘’π‘šπ‘ 𝑖𝑙𝑙 β‹… 𝑝(𝑖𝑙𝑙)
𝑝 𝑖𝑙𝑙 π‘‘π‘’π‘šπ‘ =
=
𝑝(π‘‘π‘’π‘šπ‘)
(marginal probability in the denominator)
=
𝑝 π‘‘π‘’π‘šπ‘ 𝑖𝑙𝑙 β‹… 𝑝(𝑖𝑙𝑙)
𝑝 π‘‘π‘’π‘šπ‘ 𝑖𝑙𝑙 β‹… 𝑝 𝑖𝑙𝑙 + 𝑝 π‘‘π‘’π‘šπ‘ 𝑖𝑙𝑙 β‹… 𝑝(𝑖𝑙𝑙)
=
0.9 β‹… 0.02
=
β‰ˆ 0.27
0.9 β‹… 0.02 + 0.05 β‹… 0.98
βˆ’ not so high as expected , the reason – very low prior probability
to be taken ill
Machine Learning : Probability Theory
19
Further topics
The model
Let two random variables be given:
β€’ The first one is typically discrete (i.e. π‘˜ ∈ 𝐾) and is called β€œclass”
β€’ The second one is often continuous (π‘₯ ∈ 𝑋) and is called
β€œobservation”
Let the joint probability distribution 𝑝(π‘₯, π‘˜) be β€œgiven”.
As π‘˜ is discrete it is often specified by 𝑝 π‘₯, π‘˜ = 𝑝 π‘˜ β‹… 𝑝(π‘₯|π‘˜)
The recognition task: given π‘₯, estimate π‘˜.
Usual problems (questions):
β€’ How to estimate π‘˜ from π‘₯ ?
β€’ The joint probability is not always explicitly specified.
β€’ The set 𝐾 is sometimes huge.
Machine Learning : Probability Theory
20
Further topics
The learning task:
Often (almost always) the probability distribution is known up to
free parameters. How to choose them (learn from examples)?
Next classes:
1. Recognition, Bayessian Decision Theory
2. Probabilistic learning, Maximum-Likelihood principle
3. Discriminative models, recognition and learning
…
Machine Learning : Probability Theory
21