Download Linear dependence (correlation)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Linear dependence (correlation)
Suppose we observe n pairs of values
{(x1 , y1 ), (x2 , y2 ),K (xn , yn )}
The covariance of x and y is the value
σ xy
1 n
= ∑ ( xi − x )( yi − y )
n i =1
The correlation of x e y is the value
σ xy
ρ xy =
σ xσ y
where σx and σy are the standard deviations of x and y.
Correlation
Takes values in [-1, 1]
Values close to 1 (-1) indicate an almost perfect linear
relation between x and y with positive (negative) slope.
Values close to 0 indicate absence of a linear relation
between the two variables
Warning
Zero correlation does not exclude any form of relation
between x e y, but only absence of linear relation.
Corr(SP100, NASDAQ) = 0.87
Percentage daily returns from 2.1.1998 to 3.6.2011
Quantiles / percentiles
The k-th percentile is the modality that separates the
smallest k% from the greatest (100 - k)%.
Formally, q is the k-th percentile if
# {xi < q}
≤k
n
# {xi > q}
and
≤ (1 − k )
n
The median is the quantile 0.5, or the 50-th percentile.
The quartiles are the quantiles that divide the ordered
data in 4 equal parts: percentiles 25, 50, 75
The deciles are the quantiles that divide the ordered data
in10 equal parts: percentiles 10, 20, …,90
Example with numeric data
Find the quartiles in
{800, 223, 417, 114, 413, 299, 415, 830, 796}
Osservazioni
Perc < xi
Perc > xi
114
0
.889
223
.111
.778
299
.222
.667
413
.333
.556
415
.444
.444
417
.556
.333
796
.667
.222
800
.778
.111
830
.889
0
Elements of probability
Probability space
Ω sample space, set of elementary events. For ex. In a
dice there are 6 elementary elements.
F sigma-filed, subsets of Ω (events) that we are interested
to associate with a probability.
P probability measure, value in [0,1] to be associated to
each event in F.
Sample space
Any non empty set
Examples
Coin toss: H, C
Dice toss: the 6 facets
Throwing a chock against a blackboard: any two dimensional
point on the blackboard
Sigma-field (collection of events)
Set F of subsets of the sample space Ω such that:
The set Ω belongs to F.
If A belongs to F then also his complement with respect
to Ω belongs to F.
If A1, A2, … belong to F then also their union {A1∪A2∪…}
belongs to F.
Consequences: empty set ∅ and intersections of events are
in F.
Try to build example using the sample spaces of the
previous slide.
Probability measure
Set function that maps any element of F to a number in
[0,1] such that:
P(Ω) = 1
If A1, A2,… are disjoint sets (empty intersection), then
P(A1∪A2∪…) = P(A1) + P(A2) + …
There are different approaches to the construction of
probabilities:
Classic
Frequentist
Bayesian
Approaches to probabilities
Classic.
Tied to gambling. The probability of each elementary
event is equal 1/(number of elementary elements)
For ex. each facet of the dice has probability 1/6
Frequentist.
The probability of an event is the relative frequency of
that event as the number of tries goes to infinity.
For ex. I throw a dice n times and check the relative
frequency of each facet as n diverges.
Bayesian.
Subjectivist: it is a probability based on my own opinion
grounded on past experience.
Probability of «A or B»
We’ve seen that if A and B are disjoint the probability
P(A or B) = P(A∪B) = P(A) + P(B)
If A and B intersect the probability of the intersection is
to be subtracted from the total (otherwise counted
twice)
P(A or B) = P(A∪B) = P(A) + P(B) – P(A e B)
Probability of «A or B» and independence
Two events are dependent if there is some connection
btween the two. In this case knowing that A happened
tells us something about the happening of B and viceversa.
For ex. if I know that throwing a dice an even number
was extracted (A), this tell me something about the
extraction of number 3 (B).
The probability that B occurs knowing that A happened, is
termed conditional probability and is indicated with
P(B | A)
P(A e B) = P(A) P(B|A) = P(B) P(A|B)
Independence
Two events A and B are said independent when knowing
that A happened does not give any information on the
occurrence of B, and vice-versa.
For ex. knowing that in throwing a dice “an even number
was extracted” (A) does not provide any information of
the event “the extracted number is in the interval 1-4”
(B).
In this case P(B | A) = P(B) and
P(A and B) = P(A) P(B)
Exercises
Compute the probability that in one dice toss one gets
{1 or 2 or 3}.
A: 1/6 + 1/6 + 1/6 = 1/2
Compute the probability that in one dice toss one gets {1
or 2 or 3} knowing that an even number is extracted.
A: P(1 or 2 or 3 | 2 or 4 or 6) = P(2)/P(2 or 4 or 6) =
(1/6) / (1/2) = 1/3
Compute the probability that tossing two dices the sum
of the results is 3.
A: P({D1=1} and {D2=2}) + P({D1=2} and {D2=1}) =
P{D1=1} P{D2=2} + P{D1=2} P{D2=1} = 2/36
Random variables (r.v.)
Given a probability space (Ω, F, P), a (misurable) function that
associates numbers to events in Ω is called random
variable. If we call ω the generic element of Ω, we can
indicate a r.v. with X(ω), but for brevity usually only X is
used. For ex. for the dice toss:
ω
⚀
⚁
⚂
⚃
⚄
⚅
X(ω
ω)
1
2
3
4
5
6
Cumulative probability function
It is a non-decreasing function that returns the probability
of a r.v. X being smaller or equal to a given value x
0.8
1.0
Funzione di ripartizione per il lancio del dado
0.6
FX (x ) = Pr( X ≤ x)
0.0
0.2
0.4
F(x)
0
1
2
3
4
x
5
6
7
Discrete vs. continuous r.v.
When X ranges only in a finite or numerable set of
values, X is discrete.
When X ranges in a set with the power of continuous
(real numbers or intervals), then X is continuous.
There are also «intermediate» situations in which X can
be seen as the sum of a continuous and a discrete r.v. We
will not see any example of this kind in this course.
Probability function
For a discrete r.v. it is possible to define the function p(x)
that map each outcome to its probability.
E.g. for the dice toss we have:
1 6 per x = 1,2,3,4,5,6
p(x ) = 
altrementi
0
Off course
F ( x) = ∑ p (i )
i≤ x
Bernoulli r.v.
for x = 0,1
otherwise
1.0
Funzione di probabilità di Bernoulli (p=0.7)
0.8
 p x (1 − p )1− x
p(x ) = 
0

0.6
with p in [0,1]
0.4
0.2
0.0
Takes only the two values {0,1}
The probability function is
p(x)
0.0
0.2
0.4
0.6
x
0.8
1.0
Binomial r.v.
 n!
n− x
p x (1 − p )

p ( x ) =  x!(n − x )!
0
for x = 0,1,..., n
otherwise
0.20
Funzione di probabilità binomiale (p=0.7, n=10)
Binomial
probability function with p = .7, n=10
0.10
0.00
Takes the values {1,2,…n}
It is the sum of n independent Bernoulli r.v.
It has two parameters p e n
p(x)
0
2
4
6
x
8
10
Poisson r.v.
Takes the values {1,2,…}
Used for modeling the probability of occurrence of a
number of events in a given time span (e.g. number of
calls in a call-center)
 λx exp(− λ )

per x = 0,1,...
(
)
p
x
=

Only one parameter λ > 0
x!
altrimenti
0.10
0.05
0.00
p(x)
0.15
0.20
di probabilità di
Poisson (lambda=3.5)
PoissonFunzione
probability
function
with λ = 3.6
0
0
5
10
x
15
Expectaion
The mean or expectation of a discrete r.v. is
E( X ) = ∑ x ⋅ p(x )
x∈Χ
where Χ represents the range of X, that is the support of
X.
As a consequence the variance is given by
Var ( X ) = ∑ [x − E( X )] ⋅ p( x )
2
x∈Χ
Exercises
Compute mean and variance of Bernoulli, Binomial and
Poisson r.v.
Solutions:
V.C.
E(X)
Var(X)
Bernoulli
p
p(1 – p)
Binomiale
np
np(1 – p)
Poisson
λ
λ
Probability density function
The probability that a continuous random variable hits a
given number is zero and so no probability function can
be defined.
For continuous r.v. one uses the probability density
function f (x): the area underneath the curve over an
interval gives the probability that the r.v. assumes a value
in that interval.
The relation with the cumulative probability function is
F (x ) =
x
∫
−∞
f (z ) d z
d F (x )
f (x ) =
dx
Example
The probability of the interval [5, 10] is 0.25
Mean and variance
Mean
E( X ) =
∞
∫ x ⋅ f (x ) d x
−∞
Variance
var( X ) =
∞
2
(
)
[
x
−
E
X
]
⋅ f (x ) d x
∫
−∞
One notation for the expectation
Note that one can compute the expectation of any
function of the g(X) of a r.v. (if it exists) and the notation
can be made uniform as
E[g ( X )] =
∞
∫ g (x ) d F (x )
−∞
This is a Riemann-Stieltjes integral, but for those who do
not know this definition, they can can just read it as:
∞
continuous r.v. g (x ) ⋅ f (x ) d x
∫
−∞
Discrete r.v.
∑ g (x ) ⋅ p(x )
x∈Χ
Normal or Gaussian r.v.
Density function: it has two paramaters µ e σ.
2

1
1 x−µ  
f (x ) =
exp  
 
2π σ
 2  σ  
Mean = µ
Variance = σ 2
In the standard normal
µ = 0, σ = 1
Interval (-1.64, 1.64)
contains 90% of prob.
Interval (-1.96, 1.96)
contains 95% of prob.
Interval (-2.58, 2.58)
contains 99% of prob.
Note: standardization
Let X be a r.v. with mean µ and standard deviation σ.
Then
X −µ
Y=
σ
is a r.v. with mean 0 and standard deviation 1.
If X is normal, then Y is standard normal.
Chi Square r.v.
It has one param. κ > 0 named degrees of freedom (dof).
Takes only positive values.
Summing the square of k std. normal independent r.v.
gives a Chi Square r.v. with κ = k dof
Mean = κ
Variance = 2κ
Student’s t r.v.
It has one param. κ > 0 named degrees of freedom (dof).
Takes values in ℝ (real numbers).
If Z is standard normal and S2 is Chi Square with κ dof
independent of Z, then
Z
S2 κ
is Student’s t with κ dof.
For κ →∞ Student’s t converges to a std. normal
Mean = 0
Variance = κ / (κ - 2) if κ > 2, infinite otherwise.
Student’s t density
Moments
The p-th moment of a r.v. is defined as
∞
µ p = E (X p ) = ∫ x p d F ( x )
−∞
The p-th central moment if a r.v. is defined as
[
] ∫ (x − µ )
m p = E ( X − µ1 ) =
p
∞
1
p
d F (x )
−∞
Obviously: µ1 is the mean and e m2 the variance.
Symmetry
A r.v. X is symmetric when for all x
Pr{X − med < x} = 1 − Pr{X − med > x}
where med stands for the median.
In other words we say that a random variable is
symmetric when its density reflects with respect to the
vertical axis centered on the median.
Odd central moments are zero in symmetric
distributions.
Skewness
It is the standardized third moment
 X − µ  3  m3
γ 1 = E 
 = 3
 σ   σ
Kurtosis or «tail thickness»
The fourth central std. moment is a measure of «tail
thickness»:
 X − µ  4  m4
kurt = E 
 = 4
 σ   σ
In normal distr. the kurtosis is 3, and the kurtosis index
(or excess kurtosis):
γ2 =
m4
σ
4
−3
Same mean, same variance
0.7
0.6
0.5
Leptokurtic (kurt > 3)
0.4
Mesokurtic (kurt = 3)
0.3
Platykurtic (kurt < 3)
0.2
0.1
-4
-2
2
4
Same picture in log scale
10-5
10-13
10-21
10-29
10-37
-5
0
5
10
SP100 returns vs. normal density
Densità rendimenti SP100 vs. normale (media = 0.0005, st.dev. = 1.34)
0.2
0.1
0.0
Density
0.3
0.4
Kurt = 10.28
-10
-5
0
5
10