Download Slides

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Choice modelling wikipedia , lookup

Regression toward the mean wikipedia , lookup

German tank problem wikipedia , lookup

Least squares wikipedia , lookup

Transcript
Review
• Counting & permutations
• Random events
• Axioms, conditional probability, marginalisation, Bayes, independence
• Random variables
• Definition, Bernoulli & Binomial RVs, indicator RVs
• Mean & variance, correlation & conditional expectation
• Inequalities
• Markov, Chebyshev, Chernoff
• Sample mean, weak law of large numbers
• Continuous random variables, Normal distribution, CLT
• Statistical modelling: logistic regression & linear regression
Inequalities
• Markov’s Inequality. For X a non-negative random variable:
P(X ≥ a) ≤
E (X )
for all a > 0
a
• Chebyshev’s Inequality. For X a random variable with mean
E (X ) = µ and variance var (X ) = σ 2 :
P(|X − µ| ≥ k) ≤
σ2
for all k > 0
k2
• Chernoff’s Inequality. For X a random variable:
P(X ≥ a) ≤ min e −ta+log E (e
t>0
(this is the basis for large deviations theory1 )
1 https://en.wikipedia.org/wiki/Large
deviations theory
tX
)
Example: Markov
An elevator can carry a load of at most 1000Kg. The average weight of a
person is 80Kg. Suppose 10 people are in the elevator, use Markov’s
inequality to upper bound the probability that it is overloads.
P10
• Load S = i=1 Xi where Xi is the weight of the i’th person.
P10
P10
• E [S] = E [ i=1 Xi ] = i=1 E [Xi ] = 10 × 80 = 800
• By Markov inequality, P(S ≥ 1000) ≤ E [S]/1000 = 800/1000 = 0.8
• Note that we need only information about the mean – no need for
any knowledge of the distribution of people’s weights
Confidence Intervals
• Recall that when a random variable lies in an interval a ≤ X ≤ b
with a specified probability we call this a confidence interval e.g.
p − 0.05 ≤ Y ≤ p + 0.05 with probability at least 0.95.
0.03
0.025
P(IQ)
0.02
0.015
0.01
0.005
0
40
60
80
100
120
140
IQ Score
• Chebyshev inequality allows us to calulate confidence intervals given
the mean and variance of a random variable.
PN
• For sample mean X̄ = N1
k=1 Xk , Chebyshev inequality tells us
σ2
P(|X̄ − µ| ≥ ) ≤ N2 where µ is mean of Xk and σ 2 is its variance.
• E.g. When =
µ−
√ σ
0.05N
√ σ
0.05N
≤ X̄ ≤ µ +
then
σ2
N2
√ σ
0.05N
= 0.05 and Chebyshev tells us that
with probability at least 0.95.
Laws of Large Numbers
Consider N independent and identically distributed (i.i.d) random
variablesPX1 , · · · XN each with mean µ and variance σ 2 . Let
N
X̄ = N1 k=1 Xk .
• Weak Law of Large Numbers. For any > 0:
P(|X̄ − µ| ≥ ) → 0 as N → ∞
That is, X̄ concentrates around the mean µ as N increases.
Follows from Chebyshev inequality.
• Central Limit Theorem.
X̄ ∼ N(µ,
σ2
) as N → ∞
N
That is, as N increases the distribution of X̄ converges to a Normal
(or Gaussian) distribution. Variance σ 2 /N → 0 as N → ∞. So
distribution concentrates around the mean µ as N
• CLT gives us another way to estimate a confidence interval i.e. using
the properties of the Normal distribution
Continuous Random Variables
Continuous random variables:
• Take on real-values
• e.g. travel time to work, temperature of this room, fraction of Irish
population supporting Scotland in the rugby
• Cumulative distribution function (CDF) FY (y ) = P(Y ≤ y ) can be
used with both discrete and continuous RVs
For continuous random variable Y we have probability density function
(PDF) fY (y ):
• fY (y ) ≥ 0
•
R∞
f (y )dy
−∞ Y
= 1 (total area under PDF is 1)
Rb
• P(a ≤ Y ≤ b) = a fY (y )dy (area under PDF between a and b is
the probability that a ≤ Y ≤ b)
Rb
• FY (b) = −∞ fY (y )dy
Example: Uniform Random Variables
Y is a uniform random variable when it has PDF:
(
1
when α ≤ y ≤ β
fY (y ) = β−α
0
otherwise
PDF
CDF
fY(y)
fY(y)
1/(β−α)
1
α
β
α
y
• For α ≤ a ≤ b ≤ β: P(a ≤ Y ≤ b) =
β
y
b−a
β−α
• rand() function in Matlab.
• A bus arrives at a stop every 10 minutes. You turn up at the stop at
a time selected uniformly at random during the day and wait for 5
minutes. What is the probability that the bus turns up ?
Expectation and Variance
Just replace sums with integrals when using continuous RVs
For discrete RV X
For continuous RV X
P
E [X ] = x xP(X = x)
P
E [X n ] = x x n P(X = x)
R∞
E [X ] = R −∞ xf (x)dx
∞
E [X n ] = −∞ x n f (x)dx
For both discrete and continuous random variables:
E [aX + b] = aE [X ] + b
Var (X ) = E [(X − µ)2 ] = E [X 2 ] − (E [X ])2
Var (aX + b) = a2 Var (X )
Example
A detector looks for edges in an image. Conditioned on an edge being
present, the detector response X is Gaussian with mean 0 and variance
σ 2 . When no edge is present, the detector response is Gaussian with
mean 0 and variance 1. An image has an edge with probability p. What
is the mean and variance of the detector response.
• Let F be the event that an edge is present.
• E [X ] = E [X |F ]P(edge) + E [X |F c ]P(F c ) = 0 × p + 0 × (1 − p) = 0
• Var (X ) = E [X 2 |F ]P(edge) + E [X 2 |F c ]P(F c ) = σ 2 × p + 1 × (1 − p)
Joint and Conditional Probability Density Functions
Joint and conditional PDFs behave much the same as probabilities:
• Joint PDF of X and Y is: fXY (x, y )
• Conditional PDF is defined as: fX |Y (x|y ) =
fXY (x,y )
fY (y )
Chain rule holds for PDFs:
fXY (x, y ) = fX |Y (x|y )fY (y ) = fY |X (y |x)fX (x)
We can marginalise PDFs:
Z ∞
fXY (x, y )dy = fX (x)
−∞
Bayes Rule holds:
fY |X (y |x) =
fX |Y (x|y )fY (y )
fX (x)
X are Y are independent when: fXY (x, y ) = fX (x)fY (y )
Example
Suppose random variable Y = X + M, where M ∼ N(0, 1). Conditioned
on X = x, what is the PDF of Y ?
• fY |X (y |x) =
√1
2π
2
)
exp(− (y −x)
2
Suppose that X ∼ N(0, σ). What is fX |Y (x|Y ) ?
• Use Bayes Rule:
fX |Y (x|y ) =
=
fY |X (y |x)fX (x)
fY (y )
√1
2π
2
exp(− (y −x)
)×
2
√1
σ 2π
2
x
exp(− 2σ
2)
fY (y )
• fY (y ) is just a normalising constant (so that the area under
fX |Y (x|y ) is 1).
Classification: Logistic Regression
~ of m
• Label Y only takes values 0 or 1. Real-valued vector X
observed features X (1) , X (2) , · · · , X (m)
• In Logistic regression our statistical model is that:
m
~ X
~ = ~x ) =
P(Y = 1|Θ = θ,
X
1
with z =
θ(i) x (i)
1 + exp(−z)
i=1
~ X
~ = ~x ) = 1 − P(Y = 1|Θ = θ,
~ X
~ = ~x ) =
P(Y = 0|Θ = θ,
exp(−z)
1 + exp(−z)
• Model has m parameters θ (1) , θ (2) · · · , θ (m) . We gather these
together into a vector θ~
• Training data is RV D. Consists of n observations
d = {(~x1 , y1 ), · · · , (~xn , yn )}
~ which
• Maximum Likelihood estimate: select the value of θ
~
maximises P(D|θ).
Linear Regression
• Assume a linear relationship between x and Y
Y =
m
X
Θ(i) x (i) + M
i=1
~ is a vector of unknown (perhaps random) parameters and M is
• Θ
random “noise” e.g. M ∼ N(0, 1), Θ(i) ∼ N(0, λ) with the value of
λ known.
• Training data D is a set of n observed pairs
d = {(x1 , y1 ), · · · , (xn , yn )}
~ which
• Maximum Likelihood estimate: select the value of θ
~
maximises P(D|θ).
~
• Maximum a posteriori (MAP) estimate: select the value of θ
~
which maximises P(θ|D).