Download Sampling - math.fme.vutbr.cz

STATISTICS People sometimes use statistics to describe the results of an experiment or an investigation. This process is referred to as data analysis or descriptive statistics. Statistics is also used in another way: if the entire population of interest is not accessible for some reason, often only a portion of the population (a sample) is observed and statistics is then used to answer questions about the whole population. This process is called inferential statistics. Statistical inference will be the main focus of this lesson. Example 1 We choose ten people from a population and for each individual measure his or her height. We obtain the following results in cm: 178, 180, 158, 166, 190, 180, 177, 178, 182, 160 It is known that the the random variable describing the height of an individual from that particular population has a normal distribution N  ,100 What is an unbiased estimate of the average height  of the population? Considering that the arithmetic mean of the the sample arithmetic mean is 174.9, can we maintain that, with a probability of 1   , the hypothesis that the population average mean equals 176 cm is true? The answer to thefirst question is related to point estimation First we will introduce the notion of a random sample. We view the raw data x1 , x2 ,, xn as an implementation of a random vector X 1 , X 2 ,, X n with the random variables being independent and each random variable having the same probability distribution F(x). This random vector is called a random sample or a sample. In a similar way, we also define a multidimensional random sample, for example  X 1 , Y1 ,  X 2 , Y2 ,,  X n , Yn . A function Y  S  X 1 , X 2 ,, X n  that does not explicitly depend on any parameter of the distribution F(x) is called a statistic of the sample X 1 , X 2 ,, X n Various statistics are then used as point estimators of parameters of the distribution F(x). The point estimation is obtained by implementing the point estimator statistic, that is, by substituting the raw data for the random variables in the formula. The following are examples of the most frequently used statistics: Sample arithmetic mean n X X i 1 n i Sample variance n S '2  2   X  X  i i n 1 Sample standard deviation n S'  X X 2 i i n 1 Sample correlation coefficient 1 n 1 n  1 n  X iYi    X i   Yi   n i 1 n i 1  n i 1   R 2 2 n n 1 n 2 1 n    1 1    2   X i    X i    Yi    Yi    n i 1  n i 1  n n     i  1 i  1    It is desirable for a point estimate to be: (1) Consistent. The larger the sample size, the more accurate the estimate. (2) Unbiased. For each value  of the parameter of the population distribution F(x) the sample expectancy E(Y) of the point estimator Y equals  . (3) Most efficient or best unbiased: of all consistent, unbiased estimates, the one possessing the smallest variance. It can be shown that the following statistics are consistent, unbiased, and best unbiased estimators. sample arithmetic mean for E(X) sample variance for D(X) sample standard deviation for D X  sample correlation coefficient for   X , Y  Returning to Example 1, we can now say that, based on the data given, 174.9 is a consistent, best unbiased estimation of the average height of the entire population. Example 2 We choose ten people from a population and for each individual measure his or her height. We obtain the following results in cm: 178, 180, 158, 166, 190, 180, 177, 178, 182, 160 It is known that the the random variable describing the height of an individual from that particular population has a normal distribution N  , 100 For a given probability 1   , is there an interval I such that the average height  of the population lies in I with a probality 1   ? Let us take the sample arithmetic mean n X X i 1 i n If each X i has a normal distribution N  ,  2  , it can be proved that X has a distribution  2 N  ,   n  We will show this for n = 2: 2 Let X0 has a distribution N 0,  x  and Y0 N 0,  y2  X 0  Y0 Put Z 0  denote by F(z) the distribution of Z0 and 2 calculate : F  z   PZ 0  z   P X 0  Y0  z  Since X0 and Y0 are independent, we can write  X 0  Y0  P  z   2   x y x2 z 2  2 1 e 2 2  y2  2 1 e 2 dxdy 2   x y x2 z 2  2 1 e 2 2  y2 x2  2  2 1 1 2 e dxdy   e 2 2  2  A z 2 A z 2 y2  2 1 e 2 dxdy 2  To calculate x2 y2  2 1 e 2 2  I   A  2 1 e 2 dxdy 2  we will use the transformation x  t , y  s  t where J = 1. I z 1 2   ds  e 2  z 1 2  2 2 2  dt  2 2    ds  e  t 2   s t 2   t 2 ts  s 2 2 4 e  z 1 2 s 2 2 4 2 dt   2  ds  e   2 t 2  2 ts  s 2 2 2 dt   z 1 2 2 2 2  s 4 2 e   ds  e  t  s 2  2  2 dt   u 1  2 e   u2 2  ds  e  t2 2 dt   1  u e   u2 2 ds  1 2 u  2 e  u2    2   2  The second equation from the left is due to the fact that  e   t2 2 dt     2 Thus we can conclude that Z0 has a distribution N  0,   2  2 ds To conclude the proof, we will show that if X  N  , 2 , then Y  X  c  N   c, 2  F ( y)  PY  y   P X  c  y   P X  y  c    1 2  y c   e   x   2 2 2 dx  x  t  c  1 2  y   e   t     c  2 2 2 dt Let us calculate what expectancy 1 the population must have for the probability that the value of the sample arithmetic mean is less than x is less than 1   2 and, what expectancy  2 the population must have for the probability that the value of the sample arithmetic mean is greater than x is less than  2 . X  N 1 ,  2 n P X  x   1   2  2 1 X  N 2 ,  2 n x P X  x   1   2  2 x 2  X  1 x  1   n x  1   P  X  x   P           n  n  X  2 x  2   n x   2   P  X  x   1  P    1         n  n  n x  1      1 2     n x   2   1     1 2     n x  1      1 2    n x   2        1 2    n x  1   u1 2  1  x   n   n  2  x  u1 2 x n x   2  u1 2    x   n u1 2  u1 2  n u1 2 Now we can solve Example 2: We have n  10,   10, x  174.9 If we choose 1    0.95 we get u1 2  u0.975  1.96 and so we can write 10 10 174.9  1.96    174.9  1.96 10 10 168.7    181.1 Thus, we can conclude that, with probability 0.95, the average height ofte population in question lies between 168.7 and 181.1 We can think of the formulas x  n u1 2 and   x  n u1 2 as of implementations of the statistics X  n u1 2 and X  n u1 2  X   u  , X  u 1 2 1 2  is called a confidence interval  n n   x   u  , s  u and its value  1 2 1 2  for a particular n n   sample is called an intercal estimate at confidence level 1   The conficence interval has been derived on the assumption that the population distribution is normal. With such an assumption, other confidence intervals can also be derived: S X  S t  , X  t 1 2 1 2   n n  is a confidence interval for the expectancy if the population variance and expectancy are not known. Here t1 2 denotes the quantile of the Student's t-distribution with k  n  1 degrees of freedom.  n  1S 2 n  1S 2  ,  2  2    2  1 2  is a confidence interval for the variance if the population variance and expectancy are not known. 2 2    2 Here 1 2 and are quantiles of Pearson’s chi- squared distribution with k = n-1 degrees of freedom.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Sampling - math.fme.vutbr.cz