Download No Slide Title

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Confidence interval wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Statistical inference wikipedia , lookup

Resampling (statistics) wikipedia , lookup

German tank problem wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Inferential
statistics
Suppose, we have a bag of nuts.
I will choose one of nuts, I will
crack it and it will be empty.
What then I can conclude?
The optimist says: „But this!
Only one nut is bad and I have
to pull it. At least we got rid of it.
"Pessimist says:" This is what I
was afraid of, the bag is full of
bad nuts ". What will say
Statistician? I declare that both
pessimist and optimist may be
right.
To determine whether the nuts in the bag are bad,
it is enough to select few nuts from different places of bag
and crack them …
doc.Ing. Zlata Sojková, CSc.
1
Statistical inference is based on the sample
investigation
Statistical inference is the
process of using sample
results to draw conclusions
about the parameters of a
population.
The sample should be a
representative
sample of the population.
On the picture it’s not so ...
doc.Ing. Zlata Sojková, CSc.
2
Examples of inferential statistics





Household accounts
Marketing research of consumer behavior (patterns?)
Sample investigation of agricultural enterprises
Survey of public opinions
Quality control
doc.Ing. Zlata Sojková, CSc.
3
Inferential statistics (or Statistical inference)
 Assume that we are working with the sample and we
calculate a sample statistics such: sample average,
sample variance , sample standard deviation.
 Based on the sample we assume the properties of a
population.
 This means , the values of a sample statistics are used
to estimate the unknown values of population
parameters
 Usually we estimate parameters of population such :
population mean, population variance, standard
deviation of population.
doc.Ing. Zlata Sojková, CSc.
4
Graphicaly
Symbols:
parameters of population: , 2,
, generally Q
Sample
with size n
sample characteristics :
Population – size N,
resp.  (infinity)
2
x , s ,s
Generally:un
doc.Ing. Zlata Sojková, CSc.
5
Statistical inference (SI)
has two basic tasks
statistical estimation - unknown population parameters are
estimated by sample characteristics
Statistical hypothesis testing - we express assumptions about
the unknown parameters of the population. If we can
formulate these assumptions to statistical hypotheses and we
can verify their validity by statistical procedures, then these
statistical process is statistical hypothesis testing.
doc.Ing. Zlata Sojková, CSc.
6
Some another tasks of SI
 To determinate size of sample (n), which will be enough for reliable
(spoľahlivý) estimation of parameters
 To determinate some methods of statistical units sampling from population
Explanation: the sample characteristics are deterministic in relationship to the sample,
but they are random variables in relationship to the population , so they have some
probability distribution.
That means, important is choosing of the right model of sample characteristic
distribution, which we have to use in statistical inference (this made for us
statisticians). Arithmetic average has usually Student distribution, but in large
sample (n>30) we can approximate Student distribution by Normal distribution
doc.Ing. Zlata Sojková, CSc.
7
Random sampling
There are a lot of methods that can be used to select
a sample from a population
 from the repetition point of view
selection with replacement
•selection without replacement

Classification based on the subdivision file
 simple random sample (finite or infinite
population)
 or composite, which can be:.
•
•
Based on choosing of groups
Quota sampling …..e.t.c.
doc.Ing. Zlata Sojková, CSc.
8
Theory of Estimation (TO)
Repetition:
the main goal of theory of estimation is to
estimate population parameters such: mju, sigma
by using sample characteristics
There are two types of estimators


Point estimate – bodový odhad
Interval estimate – intervalový
odhad
doc.Ing. Zlata Sojková, CSc.
9
Point estimation of population parameter
Q (generally)
 Point estimator – is a single numerical value used as an
estimate of population parameter Q - geometrically that
means one point
 Estimate- estimator – abbrev.est.
sign:
est Q = un
Q  un
Mostly we estimate :
 population mean 
 variance of population 2 and standard deviation of
population 
doc.Ing. Zlata Sojková, CSc.
10
Attributes of point estimates
The best estimator satisfies (meets) following conditions:
 Unbiasedness - neskreslenosť (nevychýlenosť)
 Consistency - konzistencia
We eplain two
 Efficiency - výdatnosť
first condition
 Suficiency (postačujúci odhad)
doc.Ing. Zlata Sojková, CSc.
11
 Unbiasedness
E(un - Q) = 0 E( un )= Q
we will repeat sampling more
times, always we will get some
another error – so we will get
x
another average .
x
According to the unbiasedness
we require that expected value of
all errors should be equal to zero. We
require that all errors are only random, so
we don’t underestimate or overestimate
the mean of population.
x
x
x x
x xx
xx
x
Asymptotically unbiased estimator of Q is sample
characteristic , which satisfy condition :
lim E(un )  Q
n 
doc.Ing. Zlata Sojková, CSc.
12
Consistency
lim P(| un - Q |   )  1
n 
Principle of consistency lies in the law of large numbers. The
consistency provides
in statistical practice, that with
increasing sample size the error of estimation decreases.
For large samples the error of estimation is very small
Sufficient condition of consistency is asymptotically unbiased
estimation of un and meeting of the condition:
lim D(u n )  0
n 
doc.Ing. Zlata Sojková, CSc.
13
Efficiency PE
 Any sample characteristic is a random variable, with some
variance
 If we have two unbiased point estimators of the same
population parameter, the point with the smaller variance
is said to have greater efficiency.
D( un )  min
doc.Ing. Zlata Sojková, CSc.
14
Point estimator of population mean 
E ( x )   , D( x )  ... 
x 

2
n
! Standard deviation of average ,
mean standard error of
n
estimation
While x offers unbiased estimator of  and :
lim
2
D( x )  lim
0
n


n
n
The sufficiency condition of consistence is satisfied and x
is unbiased and consistent estimator of population mean
est   x
doc.Ing. Zlata Sojková, CSc.

15
Point estimator of variance 2 resp. 
(n - 1) 2
E ( s )  ... 
.
n
2
Sample variance s 2 isn’t unbiased estimator of population
variance 2 -it offers negatively biased estimation.
Unbiasedness is equal to
1
 .
n
2
The sample variance is asymptotically unbiased of  2,
while
n 1 2
lim E(s )  lim
  2
n 
n 
n
2
doc.Ing. Zlata Sojková, CSc.
16
So, unbiased point estimator of population variance
2 is sample variance s12, which is computed:
n
n
1
2
s12 
s2 
(x

x
)

j
n -1
n -1 j1
Bessel’s correction
Conclusion
Difference between s12
and s2 is decreasing
with increasing sample
size n. At the sample
size greater than 50,
( n > 50 ) difference is
negligible
est   x
est   s
2
doc.Ing. Zlata Sojková, CSc.
2
1
17
Example:At 400 random households in one of the regions
SR were investigated expenditures on alcoholic drinks and
cigarettes. We will make point estimate of mean and
standard error.
est   x  973Sk
est   s1  286Sk
s1 286
x 

 14.3
n 20
Estimated average error of mean is relatively small. It is only
1.5% of mean. We can expect that error in estimation of average
expenditures on alcoholic drinks and cigarettes is not too large.
doc.Ing. Zlata Sojková, CSc.
18
Comparison of the statistical distribution of
attributes X in the population to the distribution of
x
sample average
:
f(x)
)x(f
σ
n

doc.Ing. Zlata Sojková, CSc.
19
Interval estimate of parameter Q
P(q1  Q  q2) = 1-
q1,q2 – lower and
upper limit of
interval - random
f(g)
 -risk of
estimation
(1 - )
confidency level
/2
q1
/2
q2
doc.Ing. Zlata Sojková, CSc.
20
Interval estimation of population mean 
Suppose, that the statistical attribute has a Normal distribution
X.....N(,2) ,
If we will choose a sample with the size of n, then aritmethic
average has Normal distribution too .......N(, 2/n)
Confidence interval for  depends on disponibility of
information and sample size:
a) If the variance of population is known (theoretical
assumption) we can create standardized normal variables :
u
x-

u has N(0,1) independent on
estiamed value 
n doc.Ing. Zlata Sojková, CSc.
21


x
-μ
P  u  
 u 
1
1
σ

2
2

n



  1-



f(u)
1-
 u1
doc.Ing. Zlata Sojková, CSc.
2
u1
22
2
After transformation we get


 
  1-
P x  u 
   x u 

1
1

n
n
2
2

 - sampling error
 - half of the interval,
determinates accurancy of the
estimation,
Interval estimate is actually point
estimate  , t.j.
x Δ
x Δ
x Δ
doc.Ing. Zlata Sojková, CSc.
23
b) The population variance is unknown
est 2 = s12 , and the sample size is large,
n > 30
x u
1

s1
n
We can use N(0,1)
2
c) If the population variance is unknown
est 2 = s12 , and the sample size is small (less
than 30),
n  30
x  t ( n-1)
s1
n
t(n-1) –critical value of
Student’s distribution at
alfa level and at degrees
of freedom
doc.Ing. Zlata Sojková, CSc.
24
Example: Based on the point estimator of household expenditure
on cigarette and alcohol we will do interval estimation with 95%
of probability
n=400
379  x
s
x u
1

2
1
s1
286
x 

 14.3
n
400
n

x
2
1
 = 1.96 * 14.3 = 28.03
 u1- 0.025  u0.975  1.96
u
Excel... NORMSINV(0.975)
973 - 28.03 <  < 973 + 28.03,
t.j 944.97 <  < 1 001.03
With 95% probability we estimate average expenditure
from 945 Sk to 1001 Sk.
doc.Ing. Zlata Sojková, CSc.
25
Example: It was taken research to investigate the weight loss of
carrot, after one week storage. 20 samples of 1 kg weight at the
begining of the storage was analyzed and the loss of weight was
identified. Average weight loss was 49g with sample standard
deviation 4g.We assume, that weight loss have normal distribution.
We will estimate average loss of weight with 95% confidence.
Because n<30 we will use...
s1
t(n-1) -kvantil Studentovho
x  t (n -1)
rozdelenia, t0.05(19)=2.09
n
TINV(0.05;19) - Excel
4
49  2.09 .
With 95 % confidence, average weight
20
loss of 1kg carrot sample is in interval
  1.9
47.1g to 50.9g
47.1    50.9doc.Ing. Zlata Sojková, CSc.
26
The large of confidence error  depends on the??
 confidence probability (1- )
 mean error of average which depends on:
Variability of attributes - we can’t change it ,
Sample size . That we can change !!!
The sample size which we need for achievement of
reliability an accuracy
we can determinate using next formula:
n u
2
1-/2
doc.Ing. Zlata Sojková, CSc.
2
1
2
s

27
Confidence Interval for variance  2 a 
2
1 / 2
 χ  χ
2
f(2)
/2
2 1-/2
1-
2
/2
 1 
Critical values
of CHÍ-square
distribution
2

doc.Ing. Zlata Sojková, CSc./2
/2
χ2

Pχ
2
(n

1
)s
2
1
χ 
σ2
28
After transformation we receive:
 (n - 1)s
P
2
χ
/2

2
1

2
(n - 1)s

χ12 / 2
2
1

  1  

Respectively confidence interval for standard
deviation:
 (n - 1)s
P
2

χ / 2

2
1
(n - 1)s
 
2
χ1 / 2
doc.Ing. Zlata Sojková, CSc.
2
1

  1 


29
Questions
 What is relevant difference between
point and interval estimation? How
boundary interval depends on the
confidence level?? How confidence level
influences the accuracy of the
confidence interval
 How can we assure interval estimate of
mean with chosen confidence and
accurancy?
doc.Ing. Zlata Sojková, CSc.
30