* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Machine Learning: Probability Theory
Theoretical computer science wikipedia , lookup
Information theory wikipedia , lookup
Machine learning wikipedia , lookup
Generalized linear model wikipedia , lookup
FisherβYates shuffle wikipedia , lookup
Simulated annealing wikipedia , lookup
Pattern recognition wikipedia , lookup
Machine Learning Probability Theory WS2013/2014 β Dmitrij Schlesinger, Carsten Rother Probability space is a three-tuple (Ξ©, π, π) with: β’ Ξ© β the set of elementary events β’ π β algebra β’ π β probability measure π-algebra over Ξ© is a system of subsets, i.e. π β π«(Ξ©) (π« is the power set) with: β’ Ξ©βπ β’ π΄ βπ β Ξ©βπ΄ βπ β’ π΄π β π π = 1 β¦ π β π π=1 π΄π βπ π is closed with respect to the complement and countable conjunction. It follows: β β π, π is closed also with respect to the countable disjunction (due to the De Morgan's laws) Machine Learning : Probability Theory 2 Probability space Examples: β’ π = β , Ξ© (smallest) and π = π« Ξ© (largest) π-algebras over Ξ© β’ the minimal π-algebra over Ξ© containing a particular subset π΄ β Ξ© is π = β , π΄, Ξ© β π΄, Ξ© β’ Ξ© is discrete and finite, π = 2Ξ© β’ Ξ© = β , the Borel-algebra (contains all intervals among others) β’ etc. Machine Learning : Probability Theory 3 Probability measure π: π β 0,1 is a βmeasureβ (Ξ ) with the normalizing π Ξ© = 1 π-additivity: let π΄π β π be pairwise disjoint subsets, i.e. π΄π β© π΄π β² = β , then π π΄π = π π(π΄π ) π Note: there are sets for which there is no measure. Examples: the set of irrational numbers, function spaces ββ etc. Banach-Tarski paradoxon (see Wikipedia ο): Machine Learning : Probability Theory 4 (For us) practically relevant cases β’ The set Ξ© is βgood-naturedβ, e.g. βπ , discrete finite sets etc. β’ π = π« Ξ© , i.e. the algebra is the power set β’ We often consider a (composite) βeventβ π΄ β Ξ© as the union of elemantary ones β’ Probability of an event is π π΄ = π(π) πβπ΄ Machine Learning : Probability Theory 5 Random variables Here a special case β real-valued random variables. A random variable π for a probability space (Ξ©, π, π) is a mapping π: Ξ© β β, satisfying π: π π β€ π β π βπ ββ (always holds for power sets) Note: elementary events are not numbers β they are elements of a general set Ξ© Random variables are in contrast numbers, i.e. they can be summed up, subtracted, squared etc. Machine Learning : Probability Theory 6 Distributions Cummulative distribution function of a random variable π : πΉπ π = π π: π π β€ π Probability distribution of a discrete random variable π: Ξ© β β€ : ππ π = π({π: π π = π}) Probability density of a continuous random variable π: Ξ© β β : ππΉπ (π) ππ π = ππ Machine Learning : Probability Theory 7 Distributions Why is it necessary to do it so complex (through the cummulative distribution function)? Example β a Gaussian Probability of any particular real value is zero β a βdirectβ definition of a βprobability distributionβ is senseless ο It is indeed possible through the cummulative distribution function. Machine Learning : Probability Theory 8 Mean A mean (expectation, average ... ) of a random variable π is πΌπ π = π π β π(π) = πβΞ© π π β π = π π:π π =π ππ π β π π Arithmetic mean is a special case: 1 π₯= π π π₯π = π=1 ππ (π) β π π with π₯ β‘ π and ππ π = 1 π (uniform probability distribution). Machine Learning : Probability Theory 9 Mean The probability of an event π΄ β Ξ© can be expressed as the mean value of a corresponding βindicatorβ-variable π π΄ = π(π) = πβπ΄ π π β π(π) πβΞ© with π π = 1 if π β π΄ 0 otherwise Often, the set of elementary events can be associated with a random variable (just enumerate all π β Ξ© ). Then one can speak about a βprobability distribution over Ξ©β (instead of the probability measure). Machine Learning : Probability Theory 10 Example 1 β numbers of a die The set of elementary events: Ξ© = π, π, π, π, π, π Probability measure: π π Random variable (number of a die): π π = 1, π π = 2 β¦ π π = 6 Cummulative distribution: πΉπ 3 = , πΉπ 4.5 = Probability distribution: ππ 1 = ππ 2 β¦ ππ 6 = Mean value: πΌπ π = 3.5 1 6 = , π π, π 1 2 = 2 3 1 3 β¦ β¦ 1 6 Another random variable (squared number of a die) π β² π = 1, π β² π = 4 β¦ π β² π = 36 Mean value: πΌπ π = 15 1 6 Note: πΌπ π β² β πΌ2π (π) Machine Learning : Probability Theory 11 Example 2 β two independent dice numbers The set of elementary events (6x6 faces): Ξ© = π, π, π, π, π, π × π, π, π, π, π, π Probability measure: π ππ = 1 ,π 36 ππ, ππ = 1 18 β¦ Two random variables: 1) The number of the first die: π1 ππ = 1, π1 ππ = 1, π1 ππ = 5 β¦ 2) The number of the second die: π2 ππ = 2, π2 ππ = 3, π2 ππ = 6 β¦ Probability distributions: ππ1 1 = ππ1 2 = β― = ππ1 1 6 = 6 ππ2 1 = ππ2 2 = β― = ππ2 1 6 = 6 Machine Learning : Probability Theory 12 Example 2 β two independent dice numbers Consider the new random variable: π = π1 + π2 The probability distribution ππ is not uniform anymore ο ππ β (1,2,3,4,5,6,5,4,3,2,1) Mean value is πΌπ π = 7 In general for mean values: πΌπ π1 + π2 = π π β (π1 π + π2 π ) = πΌπ π1 + πΌπ (π2 ) πβΞ© Machine Learning : Probability Theory 13 Random variables of higher dimension Analogously: Let π: Ξ© β βπ be a mapping (π = 2 for simplicity), with π = π1 , π2 , π1 : Ξ© β β and π2 : Ξ© β β Cummulative distribution function: πΉπ π, π = π π: π1 π β€ π β© π: π2 π β€ π Joint probability distribution (discrete): ππ= π1 ,π2 π, π = π π: π1 π = π β© π: π2 π = π Joint probability density (continuous): ππ= π1 ,π2 π 2 πΉπ (π, π ) π, π = ππ ππ Machine Learning : Probability Theory 14 Independence Two events π΄ β π and π΅ β π are independent, if π π΄ β© π΅ = π π΄ β π(π΅) Interesting: Events π΄ and π΅ = Ξ© β π΅ are independent, if π΄ and π΅ are independent ο Two random variables are independent, if πΉπ= π1 ,π2 π, π = πΉπ1 π β πΉπ2 π β π, π It follows (example for continuous π): π 2 πΉπ (π, π ) ππΉπ1 (π) ππΉπ2 π ππ π, π = = β = ππ1 π β ππ2 π ππππ ππ ππ Machine Learning : Probability Theory 15 Conditional probabilities Conditional probability: π π΄ π΅) = π(π΄β©π΅) π(π΅) Independence (almost equivalent): π΄ and π΅ are independent, if π π΄ π΅) = π(π΄) and/or π π΅ π΄) = π(π΅) Bayesβ Theorem (formula, rule) π π΅ π΄) β π(π΄) π π΄ π΅) = π(π΅) Machine Learning : Probability Theory 16 Further definitions (for random variables) Shorthand: π π₯, π¦ β‘ ππ (π₯, π¦) Marginal probability distribution: π π₯ = π(π₯, π¦) π¦ Conditional probability distribution: π(π₯, π¦) π π₯π¦ = π(π¦) Note: π₯ π(π₯|π¦) =1 Independent probability distribution: π π₯, π¦ = π π₯ β π(π¦) Machine Learning : Probability Theory 17 Example Let the probability to be taken ill be π πππ = 0.02 Let the conditional probability to have a temperature in that case is π π‘πππ πππ = 0.9 However, one may have a temperature without any illness, i.e. π π‘πππ πππ = 0.05 What is the probability to be taken ill provided that one has a temperature? Machine Learning : Probability Theory 18 Example Bayesβ rule: π π‘πππ πππ β π(πππ) π πππ π‘πππ = = π(π‘πππ) (marginal probability in the denominator) = π π‘πππ πππ β π(πππ) π π‘πππ πππ β π πππ + π π‘πππ πππ β π(πππ) = 0.9 β 0.02 = β 0.27 0.9 β 0.02 + 0.05 β 0.98 β not so high as expected ο, the reason β very low prior probability to be taken ill Machine Learning : Probability Theory 19 Further topics The model Let two random variables be given: β’ The first one is typically discrete (i.e. π β πΎ) and is called βclassβ β’ The second one is often continuous (π₯ β π) and is called βobservationβ Let the joint probability distribution π(π₯, π) be βgivenβ. As π is discrete it is often specified by π π₯, π = π π β π(π₯|π) The recognition task: given π₯, estimate π. Usual problems (questions): β’ How to estimate π from π₯ ? β’ The joint probability is not always explicitly specified. β’ The set πΎ is sometimes huge. Machine Learning : Probability Theory 20 Further topics The learning task: Often (almost always) the probability distribution is known up to free parameters. How to choose them (learn from examples)? Next classes: 1. Recognition, Bayessian Decision Theory 2. Probabilistic learning, Maximum-Likelihood principle 3. Discriminative models, recognition and learning β¦ Machine Learning : Probability Theory 21