Download Text(Lec10_txt file

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

German tank problem wikipedia , lookup

Regression toward the mean wikipedia , lookup

Choice modelling wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
2001/02: Lecture 10. Statistical Methods for Data Analysis




Simple statistics (mean value, variance, histogram, covariance, correlation)
Regression analysis
PCA, ICA
Fourier series, wavelets
Simple statistics
Let x1 , x2 ,..., xn is a sample. From a natural science point of view, the sample is a
result of repetitive and independent measures of some subject.
From a mathematical point of view, the sample is a result of n independent repetitions
of a random experiment with a random variable  , which has the distribution
dF ( x )
function F ( x ) ( F ( x )  probabilit y{  x}) or the density function f ( x ) 
.
dx

The mean of random variable: M 
 xf ( x)dx


The variance of random variable: D   ( x  M ) f ( x )dx 
2


x
2
f ( x )dx  M 2

Examples of random variables:
Binomial random variable (discrete). Let us toss a coin (m times). Let suppose that
the probability to get heads (1) is p and the probability to get the tails (0) is q = 1-p.
The binomial random variable is the number of heads (one’s) among m results of
tossing coin.
Pr{1}  p; Pr{0}  q  1  p;
Pr{  k}  Cmk p k q mk
M  mp, D  mpq .
Uniform random variable (rectangular distribution) in the interval [0,1]:
1 if 0  x  1
f ( x)  
0 otherwise
M  1 / 2, D  1 / 12
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1
Gaussian (normal) random variable:
( x  M )2

1
2
f ( x) 
e 2
 2
Mean is M, variance is  .
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-3
-2
-1
0
1
2
3
Let x1 , x2 ,..., xn is a sample.
( x max  x min )
2
n
1
Another sample mean: x   xi
n i 1
1 n
( xi  x ) 2
Sample variance: s 2 

n  1 i 1
Estimator of the mean: q 
Sample standard deviation: s 
1 n
 ( xi  x )2
n  1 i 1
Histogram:
( x max  x min )
,
k
counteri  counteri  1 if x min  (i  1) * bin  x j  x min  i * bin , j  1,2, ..., n,
x min , x max , bin 
i  1,2,..., k , k  n / 20
This counteri shows the number of sample elements inside of the i-th bin.
The histogram is used to estimate the density function of the random variable  .
Let suppose that the sample consists of pairs ( xi , yi ), i  1,2,..., n .
Bivariate Normal Distribution
f ( x, y ) 
1
2 x y

 ( x  M x )2
( x  M x )( y  M y ) ( y  M y ) 2  
1



exp 
 2

2 
2
2

2

2
(
1


)




1 
x
x y
y



2
n
Sample covariance is cov 
(x
i 1
i
 x )( yi  y )
n 1
Note that in case of independence between X and Y, the covariance is zero.
If covariance between X and Y is zero then, generally speaking, we do not know if X
and Y are independent.
If X and Y are normal random variables and covariance is zero then X and Y are
independent.
n
Sample correlation:  
cov
s x2 s 2y

(x
i 1
i
 x )( yi  y )
(n  1) s x2 s 2y
, 1    1
Regression analysis (multiple regression)
Independent variables (predictors) X 1 , X 2 ,..., X p
Dependent variable (response) Y
Regression model:
Yi  1   2 X 2i   3 X 3i  ...   p X pi   i , i  1,2,...n, n  p
1 is the intercept ,  2 ,... p are the regression slope coefficien ts
 i is the residual term, normal random variable. M ( i )  0, cov( i j )  0
Y  Xβ  ε, X 1  1
Least square method:
ε' ε  ( Y  Xβ)' ( Y  Xβ)
( XX' )β  X' Y
β  (X' X) 1 X' Y
Multiple coefficient of determination:
 (Yi  Yi m )2 , 0  R 2  1
R2  1 
 (Yi  Y )2
The numerator gives the error sums-of squares and the denominator gives the total
variation. If R 2 is close to 0 then it means that the regression model and a simple
mean value model are very similar. If R 2 is close to 1 then it means that the fitting by
the regression model is good and the error is small.
Principle Component Analysis (PCA)
http://www.cis.hut.fi/projects/ica/fastica/
Independent component analysis
http://www.cis.hut.fi/projects/ica/
3
Fourier Series and Wavelets
http://www.amara.com/current/wavelet.html
(Four iterations of a Daubechies wavelet)
4