Download Sampling - math.fme.vutbr.cz

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Randomness wikipedia , lookup

Probability box wikipedia , lookup

Law of large numbers wikipedia , lookup

Bias of an estimator wikipedia , lookup

German tank problem wikipedia , lookup

Transcript
STATISTICS
People sometimes use statistics to describe the results of an
experiment or an investigation. This process is referred to as
data analysis or descriptive statistics.
Statistics is also used in another way: if the entire population
of interest is not accessible for some reason, often only a
portion of the population (a sample) is observed and statistics
is then used to answer questions about the whole population.
This process is called inferential statistics.
Statistical inference will be the main focus of this lesson.
Example 1
We choose ten people from a population and for each individual
measure his or her height. We obtain the following results in cm:
178, 180, 158, 166, 190, 180, 177, 178, 182, 160
It is known that the the random variable describing the height of
an individual from that particular population has a normal
distribution N  ,100
What is an unbiased estimate of the average height  of the
population?
Considering that the arithmetic mean of the the sample arithmetic
mean is 174.9, can we maintain that, with a probability of 1   , the
hypothesis that the population average mean equals 176 cm is true?
The answer to thefirst question is related to point estimation First
we will introduce the notion of a random sample.
We view the raw data x1 , x2 ,, xn as an implementation of a
random vector X 1 , X 2 ,, X n with the random variables being
independent and each random variable having the same probability
distribution F(x). This random vector is called a random sample or
a sample.
In a similar way, we also define a multidimensional random
sample, for example  X 1 , Y1 ,  X 2 , Y2 ,,  X n , Yn .
A function Y  S  X 1 , X 2 ,, X n  that does not explicitly depend on
any parameter of the distribution F(x) is called a statistic of the
sample X 1 , X 2 ,, X n
Various statistics are then used as point estimators of parameters
of the distribution F(x). The point estimation is obtained by
implementing the point estimator statistic, that is, by substituting
the raw data for the random variables in the formula.
The following are examples of the most frequently used statistics:
Sample arithmetic mean
n
X
X
i 1
n
i
Sample variance
n
S '2 
2


X

X
 i
i
n 1
Sample standard deviation
n
S'
 X
X
2
i
i
n 1
Sample correlation coefficient
1 n
1 n
 1 n 
X iYi    X i   Yi 

n i 1
n i 1  n i 1 

R
2
2
n
n
1 n 2 1 n



1
1



2
  X i    X i    Yi    Yi  
 n i 1
 n i 1

n
n




i

1
i

1



It is desirable for a point estimate to be:
(1) Consistent. The larger the sample size, the more accurate the
estimate.
(2) Unbiased. For each value  of the parameter of the
population distribution F(x) the sample expectancy E(Y) of the
point estimator Y equals  .
(3) Most efficient or best unbiased: of all consistent, unbiased
estimates, the one possessing the smallest variance.
It can be shown that the following statistics are consistent,
unbiased, and best unbiased estimators.
sample arithmetic mean for E(X)
sample variance for D(X)
sample standard deviation for D X 
sample correlation coefficient for   X , Y 
Returning to Example 1, we can now say that, based on the
data given, 174.9 is a consistent, best unbiased estimation of
the average height of the entire population.
Example 2
We choose ten people from a population and for each individual
measure his or her height. We obtain the following results in cm:
178, 180, 158, 166, 190, 180, 177, 178, 182, 160
It is known that the the random variable describing the height of
an individual from that particular population has a normal
distribution N  , 100
For a given probability 1   , is there an interval I such that the
average height  of the population lies in I with a probality 1   ?
Let us take the sample arithmetic mean
n
X
X
i 1
i
n
If each X i has a normal distribution N  ,  2  , it can be proved
that X has a distribution
 2
N  , 
 n 
We will show this for n = 2:
2
Let X0 has a distribution N 0,  x  and Y0 N 0,  y2 
X 0  Y0
Put Z 0 
denote by F(z) the distribution of Z0 and
2
calculate : F  z   PZ 0  z   P X 0  Y0  z 
Since X0 and Y0 are independent, we can write
 X 0  Y0

P
 z 
 2


x y
x2
z
2
 2
1
e 2
2 
y2
 2
1
e 2 dxdy
2 

x y
x2
z
2
 2
1
e 2
2 
y2
x2
 2
 2
1
1
2
e
dxdy  
e 2
2 
2 
A
z
2
A
z
2
y2
 2
1
e 2 dxdy
2 
To calculate
x2
y2
 2
1
e 2
2 
I  
A
 2
1
e 2 dxdy
2 
we will use the transformation x  t , y  s  t where J = 1.
I
z
1
2

 ds  e
2

z
1
2

2
2
2

dt 
2 2


 ds  e

t 2   s t 2


t 2 ts  s
2
2
4
e

z
1
2
s
2
2
4 2
dt 

2
 ds  e


2 t 2  2 ts  s 2
2 2
dt 

z
1
2
2
2
2  s
4 2
e


ds  e

t  s 2 
2

2
dt 

u
1

2
e


u2
2

ds  e

t2
2
dt 

1

u
e


u2
2
ds 
1
2
u

2
e

u2
  
2

 2

The second equation from the left is due to the fact that

e


t2
2
dt   
 2
Thus we can conclude that Z0 has a distribution N  0, 
 2 
2
ds
To conclude the proof, we will show that if X  N  , 2 , then
Y  X  c  N   c, 2 
F ( y)  PY  y   P X  c  y   P X  y  c  

1
2 
y c


e

 x   2
2 2
dx  x  t  c 
1
2 
y


e

 t     c  2
2 2
dt
Let us calculate what expectancy 1 the population must have for
the probability that the value of the sample arithmetic mean is less
than x is less than 1   2 and, what expectancy  2 the population
must have for the probability that the value of the sample arithmetic mean is greater than x is less than  2 .
X  N 1 ,  2 n
P X  x   1   2
 2
1
X  N 2 ,  2 n
x
P X  x   1   2
 2
x
2
 X  1 x  1 
 n x  1  
P  X  x   P

  




  n  n
 X  2 x  2 
 n x   2  
P  X  x   1  P

  1  




  n  n
 n x  1  

  1 2



 n x   2  
1  
  1 2



 n x  1  

  1 2



n x   2  

 
  1 2



n x  1 
 u1 2

1  x 

n


n

2  x 
u1 2
x
n x   2 
u1 2    x 

n
u1 2
 u1 2

n
u1 2
Now we can solve Example 2:
We have n  10,   10, x  174.9
If we choose 1    0.95 we get u1 2  u0.975  1.96
and so we can write
10
10
174.9 
1.96    174.9 
1.96
10
10
168.7    181.1
Thus, we can conclude that, with probability 0.95, the average
height ofte population in question lies between 168.7 and 181.1
We can think of the formulas
x

n
u1 2
and
  x

n
u1 2
as of implementations of the statistics
X

n
u1 2
and
X

n
u1 2

X   u

,
X

u
1 2
1 2  is called a confidence interval

n
n


x   u

,
s

u
and its value 
1 2
1 2  for a particular
n
n


sample is called an intercal estimate at confidence level 1  
The conficence interval has been derived on the assumption that
the population distribution is normal. With such an assumption,
other confidence intervals can also be derived:
S
X  S t

,
X

t
1 2
1 2 

n
n

is a confidence interval for the expectancy if the population
variance and expectancy are not known.
Here t1 2 denotes the quantile of the Student's t-distribution
with k  n  1 degrees of freedom.
 n  1S 2 n  1S 2 
,
 2

2


 2
 1 2

is a confidence interval for the variance if the population
variance and expectancy are not known.
2
2


 2
Here 1 2 and
are quantiles of Pearson’s chi-
squared distribution with k = n-1 degrees of freedom.