Download slides

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Statistics
Sampled from
Morris H. DeGroot & Mark J.
Schervish, “Probability and Statistics”,
3rd Edition, Addison Wesley
1
Probability: Continuous distribution
and variables
• Continuous distributions
–
–
–
–
–
–
–
Random variables
Probability density function
Uniform, normal and exponential distributions
Expectations and variance
Law of large numbers
Central limit theorem
Probability density functions of more than one
variable
– Rejection and transformation methods for sampling
distributions (section)
2
Statistics
• Estimators: mean,
standard deviation
• Maximum likelihood
• Confidence intervals
2
• χ statistics
• Regression
• Goodness of fit
3
Continuous random variables
• A random variable X is a real value
function defined on a sample space S.
• X is a continuous random variable if a nonnegative function f, defined on the real
line, exists such that an integral over the
domain A is the probability that X takes a
value in domain A. (A is, for example, the
interval [a,b])
b
Pr ( a < X < b ) = ∫ f ( x ) dx
a
4
Probability density function
• f is called probability density function
(p.d.f.). Note that the unit of the pdf below
are of 1/length, only after the multiplication
with a length element we get probability
• For every p.d.f. we have
f ( x) ≥ 0
∞
∫
−∞
f ( x ) dx = 1.
5
Examples of p.d.fs
• A car is driving in a circle at a constant speed.
What is the probability that it will be found in the
interval between 1 and 2 radians?
• A computer is generating with equal probability
density, random numbers between 0 and 1.
What is the probability of obtaining 0.75?
• Protein folds at a constant rate (the probability
that a protein will fold at the time interval [t,t+dt]
is a constant αdt). If we have at time zero N0
protein molecules, what is the probability that all
protein molecules will fold after time t’ ?
6
Uniform distribution on an interval
• Consider an experiment in which a point X
is selected from an interval S = { x : a ≤ x ≤ b}
in such a way that the probability of finding
X at a given interval is proportional to the
interval length (hence the p.d.f. is a
constant). This distribution is called the
uniform distribution We must have for this
distribution∞
b
∫ f ( x ) dx =∫ f ( x ) dx = 1
−∞
a
7
Uniform distribution (continue)
f(x)
⎧ 1
⎪
f ( x) = ⎨b − a
⎪⎩ 0
⎫
for a ≤ x ≤ b ⎪
⎬
otherwise ⎭⎪
1/(b-a)
a
b
x
8
Examples of simple distributions
•
•
X is a random variable distributed uniformly
on a circle of radius a. Find f(x)
Check that the following function satisfies the
conditions to be a p.d.f.
⎧ 2 −1 3
⎪ x
f ( x) = ⎨ 3
⎪⎩ 0
⎫
for 0 < x < 1⎪
⎬
otherwise ⎭⎪
9
Exponential distribution
f ( x ) = ⎡⎣ a exp ( − ax ) 0 < x < ∞ ⎤⎦
f(x)
x
10
Normal distribution
⎡⎛ a ⎞1/ 2
2
f ( x ) = ⎢⎜ ⎟ exp − a ( x − x0 )
⎢⎣⎝ π ⎠
(
1
)
⎤
−∞ < x < ∞ ⎥
⎥⎦
a
xo
11
Continuous distribution functions
defined as F ( x ) = Pr( X ≤ x)
for - ∞ < x < ∞
F(x) is a monotonic non decreasing function of x (can you show it?),
that can be written in terms of its corresponding p.d.f.
F ( x ) = Pr ( X ≤ x ) =
x
∫ f ( x ) dx
−∞
or
dF
= f ( x)
dx
12
Distribution function: Example
⎡ a exp ( − ax ) for 0 < x < ∞⎤
f ( x) = ⎢
⎥
0
otherwise ⎦
⎣
F ( x) =
x
x
∫ f ( x ) dx = ∫ a exp ( −ax ) dx = ⎡⎣− exp ( −ax )⎤⎦
−∞
x
0
= 1 − exp ( − ax )
0
13
Expectation
• For a random variable X with a p.d.f. f(x)
the expectation E(X) is defined
E(X ) =
∞
∫ x ⋅ f ( x ) dx
−∞
The expectation exists if and only if the integral is absolutely converged, i.e.
∞
∫
x f ( x )dx < ∞
−∞
14
Expectation (example)
⎧2 x 0 < x < 1 ⎫
f ( x) = ⎨
⎬
⎩ 0 otherwise ⎭
∞
1
⎡x ⎤
2
E ( X ) = ∫ x ⋅ 2 x ⋅ dx = 2 ⎢ ⎥ =
⎣ 3 ⎦0 3
−∞
3
Even if the p.d.f, satisfies the requirements, it is not
obvious that the expectation exists (next slide)
15
The Cauchy p.d.f.
⎡
1
f ( x) = ⎢
2
⎢⎣ π (1 + x )
⎤
−∞ < x < ∞ ⎥
⎥⎦
( f ( x ) ≥ 0)
x
1
1
1⎛
⎛ π ⎞⎞
arctan
arctan
F ( x) = ∫
dx
x
x
=
=
−
( ) −∞
( ) ⎜− ⎟⎟
⎜
2
π
π⎝
⎝ 2 ⎠⎠
−∞ π (1 + x )
x
1 ⎛π π ⎞
F (∞) = ⎜ + ⎟ = 1
π ⎝2 2⎠
⎛∞
⎞
1
⎜∫
dx = 1⎟
2
⎜ −∞ π (1 + x )
⎟
⎝
⎠
16
Cauchy distribution: Expectation
Test for existence of expectation
E(X ) =
∞
∫
−∞
x ⋅ f ( x ) ⋅ dx =
∞
∫
−∞
1
x
dx → ∞
2
π (1 + x )
Expectation does not exist for the Cauchy distribution.
17
Some properties of expectations
• Expectation is linear
E ( aX + bY ) = aE ( X ) + bE (Y )
• If the random variables X and Y are
independent ( f ( x, y ) = f ( x ) f ( y ) ) then
E ( X ⋅ Y ) = E ( X ) ⋅ E (Y )
18
Expectation of a function
• Is essentially the same as the expectation
of a variable
E ( r ( x )) =
∞
∞
−∞
−∞
∫ r ⋅ g ( r ) dr = ∫ r ( x ) ⋅ f ( x ) ⋅ dx
of special interest is the expectation value of moments
⎡
⎤
variance ≡ E ( X ) − ⎣⎡ E ( X ) ⎦⎤ = ∫ x ⋅ f ( x ) ⋅ dx − ⎢ ∫ x ⋅ f ( x ) ⋅ dx ⎥
−∞
⎣ −∞
⎦
2
2
∞
∞
2
Can you show that the variance is always non-negative?
19
2
Functions of several random
variables
• We consider a p.d.f.
f ( x1 ,..., xn )
of several random variables
X 1 ,..., X n
The p.d.f. satisfies (of course)
f ( x1 ,..., xn ) ≥ 0
b1 b2
bn
∫ ∫ ... ∫ f ( x ,..., x ) dx ...dx
1
a1 a2
an
n
1
n
=1
20
Expectation of function of several
variables
• Similarly to one variable case,
expectations of functions with several
variables are computes
E (Y = r ( x1 ,..., xn ) ) =
∞
∞
∫ ... ∫ r ( x ,..., x ) ⋅ f ( x ,..., x ) dx ...dx
1
−∞
n
1
n
1
n
−∞
21
Example: expectation of more than
one variable
⎧1 for ( x, y ) ∈ S ⎫
f ( x, y ) = ⎨
⎬
otherwise ⎭
⎩0
S is a square: 0 < x < 1 0 < y < 1
E( X +Y
2
1 1
2
) = ∫ ∫(x
2
+y
2
) f ( x, y ) ⋅dx ⋅ dy
0 0
1 1
2
= ∫ ∫ ( x + y )dx ⋅ dy =
3
0 0
2
2
22
Markov Inequality
• X is a random variable such that
Pr ( X ≥ 0 ) = 1
• For every t>0
E(X )
Pr ( X ≥ t ) ≤
t
• Prove it
• Why E(X)>t is not interesting?
23
Chebyshev Inequality
is a special case of the Markov inequality
• X is a random variable for which the variance
exists. For t>0
(
Pr ⎡⎣ X − E ( X ) ⎤⎦ ≥ t
2
2
)
var ( X )
≤
2
t
• Substitute
Y = ⎡⎣ X − E ( X ) ⎤⎦
2
→ E (Y ) =var ( X ) and t 2 by t
to obtain the Markov inequality
24
The law of large numbers I
• Consider a set of N random variables X 1 ,... X n
i.i.d. Each of the random variables has
mean (expectation value) µ and variance
σ2
• The arithmetic
average
of
n
samples
is
1
defined X n = ( X 1 + ... + X n ) . It defines a
new randomnvariable that we call the
sample mean
• The expectation value of the sample mean
1
1
E ( X n ) = ∑ E ( X i ) = ⋅ nµ = µ
25
n
i
n
The Law of Large Numbers II
• The variance of X n
(
var ( X n ) = E X − E
2
n
2
( X ))
n
1
= 2
n
1 ⎡
⎤
E ( X i X j ) − 2 ⎢∑ E ( X i )⎥
∑
n ⎣ i
i, j
⎦
2
Since X i and X j are independent for i ≠ j E ( X i X j ) = E ( X i ) E ( X j )
1
var ( X n ) = 2 ⎡⎣ n ⋅ E ( X 2 ) + ( n 2 − n ) ⋅ E 2 ( X ) − n 2 E 2 ( X ) ⎤⎦
n
(
)
var ( X n ) = E ( X 2 ) − E 2 ( X ) n = var ( X ) n
Which means that the variance is decreasing linearly with the
number of sampled points
26
Law of Large numbers III
Chebyshev Inequality:
1 − Pr
(( X
− µ) ≥ ε
2
n
2
) = Pr (( X
− µ) < ε
2
n
2
)
var ( X )
≥ 1−
nε 2
for ε ≥ 0
⇒ Xn → µ
27
Central Limit Theorem
• Statement without proof:
• Given a set of random variables X 1 ,..., X n
with mean µi and variance σ2i we define a
new random variable Y = ∑ X
i
i = 1 ,..., n
n
⎛
⎜ ∑ σ
⎝ i = 1 ,..., n
2
i
⎞
⎟
⎠
1
2
Xi
• For very large n, the distribution of i =∑
1,..., n
is normal with mean ∑ µ and variance
∑σ
i =1,..., n
2
i
i =1,.., n
i
28
Statistical Inference
• Data generated from unknown probability
distribution and statement on the unknown
distribution are warranted. Determine
parameters (e.g. β for exponential
distribution, µ and σ for normal
distribution)
• Prediction of new experiments
29
Estimation of parameters
• Notation: f(x|θ) is the probability density of sampling
x given (conditioned on) parameters θ.
• For a set of n independent and identically distributed
samples the probability density is:
f ( x1 ,..., xn | θ ) =
∏ f ( x |θ ) ≡ f (x |θ )
i =1,..., n
i
• However, what we want to determine now are the
parameters… For example assuming the distribution
is normal, we seek the mean µ and the variance σ2
2
⎛
⎞
µ
−
x
(
)
1
2
f ( x | µ ,σ ) =
exp ⎜ −
⎟
2
⎜
⎟
σ
2πσ
2
⎝
⎠
30
Bayesian arguments
• What we want is the function f (θ | x )
given a set of observations x, what is the
probability that the set of parameters is θ ?
• Bayesian statistics: Think of the
parameters like other random variables
with probability ξ(θ).
The joint probability f ( x,θ ) ≡ f ( x | θ ) ξ (θ ) is also
f ( x, θ ) ≡ f (θ | x ) g ( x )
31
The likelihood function
f ( x | θ ) ξ (θ )
• We can formally write ξ (θ | x ) =
g ( x)
which is the probability of having a particular set
of parameter for the p.d.f provided a set of
observation (what we wanted). Note that our
prime interest here is in the parameter set θ
and the samples of x is given. Since g(x) is
independent of θ we can write the likelihood
function
ξ (θ | x ) ∝ f ( x | θ ) ξ (θ )
32
Example: Likelihood function I
• Consider the exponential distribution
⎧ β exp [ − β x ] for x > 0 ⎫
f (x | β ) = ⎨
⎬
0
otherwise
⎩
⎭
⎧ n
⎫
⎡
⎤
⎪ β exp ⎢ − β ∑ xi ⎥ for x > 0 ⎪
f (x | β ) = ⎨
⎬
⎣ i =1,...,n ⎦
⎪
⎪
0
otherwise
⎩
⎭
• And assume the p.d.f. of the parameter β is a
Gaussian with a mean and variance of 1.
⎛ β2 ⎞
1
exp ⎜ −
ξ (β ) =
⎟
2
2π
⎝
⎠
33
Example: Likelihood function II
2
⎡
⎤
⎛
1
β ⎞ n
ξ ( β | x) =
exp ⎜ −
⎟ β exp ⎢ − β ∑ xi ⎥
1/ 2
( 2π )
⎝ 2 ⎠
⎣ i =1,...,n ⎦
34
Maximum Likelihood
We look for a maximum of the function L (θ ) = log ( f n ( x | θ ) )
as a function of the parameters θ
As a concrete example we consider the normal distribution
L (θ ) = log ⎡⎣ f n ( x | µ ,θ ) ⎤⎦
n
n
1
= − log ( 2π ) − log (σ 2 ) − 2
2
2
2σ
∑ ( xi − µ )
2
i =1,..., n
To find the most likely set of parameters we determine
the maximum of L(θ)
35
Maximum of L(θ) for normal
distribution
dL
1
=0=− 2
dµ
2σ
⎞
1 ⎛
2 ( xi − µ ) = − 2 ⎜ ∑ xi − nµ ⎟
∑
σ ⎝ i =1,...,n
i =1,..., n
⎠
1
⇒ µ = ∑ xi
n i =1,...,n
1
dL
n
=− 2 +
2
2 2
2σ
dσ
2 (σ )
∑ (x − µ)
i =1,..., n
i
2
=0
1
2
σ = ∑ ( xi − µ )
n i =1,...,n
2
36
Determine a most likely parameter for
the uniform distribution
⎧1
⎫
for 0 ≤ x ≤ θ ⎪
⎪
f ( x | θ ) = ⎨θ
⎬
⎪⎩ 0
otherwise ⎪⎭
⎧1
⎫
⎪ n for 0 ≤ xi ≤ θ ( i = 1,..., n ) ⎪
f ( x | θ ) = ⎨θ
⎬
⎪⎩ 0
⎪⎭
otherwise
It is clear that θ must be larger than all the xi and at the same time maximizes
the monotonically decreasing function 1 θ n, hence
θ = max [ x1 ,..., xn ]
37
Potential problems in maximum likelihood
procedure
• Value of θ is underestimated (note that θ should be larger than all x, not
only the ones we sample so far)
•No guarantee that a solution exists for the distribution below θ must be
large than any x but at the same time equal to the maximal x. This is not
possible and hence, no solution
⎧1
⎫
⎪
f ( x | θ ) = ⎨θ
⎪⎩ 0
•The solution is not necessarily unique
for 0 < x < θ ⎪
⎬
otherwise ⎪⎭
⎧1 for θ ≤ xi ≤ θ + 1 ( i=1,...,n ) ⎫
fn ( x | θ ) = ⎨
⎬
0
otherwise
⎩
⎭
⎧1 for max ( x1 ,..., xn ) − 1 ≤ θ ≤ min ( x1 ,..., xn ) ⎫
⇒ fn ( x | θ ) = ⎨
⎬
0
otherwise
⎩
⎭
⇒ max ( x1 ,..., xn ) − 1 ≤ θ ≤ min ( x1 ,..., xn )
38
The χ2 distribution with n
degrees of freedom
∞
Γ ( n ) = ∫ t n −1e − t dt
n>0
0
1
( n / 2) −1
exp ( −x 2)
f ( x) = n / 2
x
2 Γ ( n 2)
E ( x) = n
x>0
var ( x ) = 2n
There is a useful relation between the χ2 and the normal
distributions
39
Theorem connecting χ2 and normal distributions
If the random variables X1,…,Xn are i.i.d. and if each of
these variables has standard normal distribution, then
the sum of the squares
Y = X + ... + X
2
2
1
k
n
Has a χ2 distribution with n degrees of freedom
The distribution functions
F ( y ) = Pr (Y ≤ y ) = Pr ( X 2 ≤ y ) = Pr ( − y1/ 2 ≤ X ≤ y1/ 2 )
= Φ ( y1/ 2 ) − Φ ( − y1/ 2 )
The p.d.f is obtained by differentiating both side f ( y ) = F ' ( y )
φ ( y ) = Φ ' ( y ) . Note φ ( y1/ 2 ) = ( 2π )
−1/ 2
exp ( − y / 2 ) . We have
f ( y ) = φ ( y1/ 2 )(1 2 y −1/ 2 ) + φ ( − y1/ 2 )(1 2 y −1/ 2 )
f ( y ) = ( 2π )
−1/ 2
y −1/ 2 exp ( − y / 2 )
which is the χ 2 distribution with one degree of freedom
40
Normal distribution: Parameters
Let X1,…,Xn be a random sample from normal distribution having mean
µ and variance σ2 . Then the sample mean (hat denotes M.L.E)
µˆ = X n =
and the sample variance
1
Xi
∑
n i =1,...,n
2
1
σˆ = ∑ ( X i − X n )
n i =1,...,n
2
are independent random variables.
µ̂
has a normal distribution with a mean µ and variance σ2/n.
nσˆ / σ
2
2
has a chi-square distribution of n-1 degrees of freedom
Why n-1 ? (next slide)
41
Parameters of the normal
distribution: Note 1
• Let x1 ,..., xn be a vector of random number of
length n sampled from the normal distribtuion
• Let y1 ,..., yn be another vector of n random
numbers, related to the previous vector by linear
transformation A (AAt=I)
y = Ax
• Consider now the calculation of the variance
(next slide)
42
Variance
• The formula we should use for the variance
1
2
var ( X ) = ∑ ( X i − µ )
n i =1,...,n
• However, we do not know the exact mean, and
therefore we use
1 n
2
var ( X ) = ∑ ( X i − X n )
n i =1
• What are the consequences of using this
approximation?
43
Variance is not changing upon
linear transformations
• Consider the expression
∑ (Y − Y )
i =1,..., n
i
∑ (X
i =1,..., n
n
t
i
2
=
∑ ( AX
i =1,..., n
− AX n ) ( AX i − AX n )
t
i
− Xn ) A A( X − Xn ) =
t
t
t
i
∑ (X
i =1,..., n
t
i
− Xn )
2
• The analysis is based on the unitarity of A.
Hence, linear transformation dos change the
variance of he distribution. This makes it
possible to exploit the difference between
X n and µ
44
The n-1 (versus n) factor
• Since A is arbitrary ( as long as it is unitary). We can
choose one of the transformation vectors a to be
(1,…,1)/n1/2
• The scalar product
X a − X na = 0
t
• Is identically zero (remember how we compute the
mean?)
• Hence since we computed the average from the same
sample we computed the variance, the variance lost one
degree of freedom.
45
The n-1 factor II
• Note that the n-1 makes sense. Consider
only a single sample point, which is of
course very poor and leaves a high degree
of uncertainty regarding the value sof the
parameters. If we use n then the estimated
variance becomes zero, while if we use n1 we obtain infinite, which is more
appropriate to the problem at hand, for
which we have no information to
determine the variance
46
The t distribution
(in preparation for confidence
intervals)
• Consider two random variables Y and Z, such
that Y has chi-2 distribution with n degrees of
freedom and Z has a standard normal
distribution the variable X is defined by
⎛ Y ⎞
X = Z ⎜ 1/ 2 ⎟
⎝n ⎠
Then the distribution of X is the t distribution with n
degrees of freedom.
47
The t distribution
• The function is tabulated and can be written in terms
of Γ function
∞
Γ (α ) = ∫ xα −1 exp ( − x ) dx
⎛ n +1⎞
Γ⎜
⎟ ⎛ x 2 ⎞ −( n +1) / 2
⎝ 2 ⎠ 1+
tn ( x ) =
⎜
⎟
1/ 2
n ⎠
⎛n⎞⎝
( nπ ) Γ ⎜ ⎟
⎝2⎠
0
for − ∞ < x < ∞
• The t distribution is approaching the normal
distribution as n → ∞
. It has the same mean
but longer tails.
48
Confidence Interval
• Confidence interval provide an alternative
to the use of estimator instead of the
actual value of an unknown parameter.
We can find an interval (A,B) that we think
has high probability of containing the
desired parameter. The length of the
interval gives us an idea how well we can
estimate the parameter value.
49
Confidence interval: Example
Sample distribution is normal with mean µ
and standard deviation σ. We expect to
find a sample S in the intervals
µ S ± σ ; µ S ± 2σ ; µ S ± 3σ
About 68.27% 95.45% and 99.73 of the time
respectively
50
Confidence interval for means
If the statistics S has the sample mean X
then 95% and 99% confidence limits for
estimation of the population mean are
given by X ± 1.96σ X and X ± 2.58σ respectively.
For large samples ( n ≥ 30 ) we can write
(depending on the level of confidence we
are interested in) X ± z σ
X
c
n
For small sample we need to t distribution
51
Confidence interval for the mean of
the normal distribution
• Let
X 1 ,..., X nfor a random sample from a normal
distribution with unknown mean and unknown variance.
Let tn-1(x) denote the p.d.f of the t distribution with n-1
degrees of freedom, and let c be a constant such that
c
t
x
dx
=
γ
(
)
n
−
1
∫
−c
• For every value of n, the value of c can be found from
the table of the t distribution to fit the confidence
(probability) γ
52
Confidence interval for means
small sample (n<30)
• We use the special distribution t (instead
the direct normal distribution expression)
since the variance is estimated from the
sample too. For 95% confidence in the
mean
−t0.975
X − µ)
(
<
σˆ
n
< t0.975
and the mean can be estimated as (with confidence 95%)
σ
σˆ
< µ < X + t.975
X − t0.975
n
n
53
Related documents