Download Basic Statistical Concepts Statistical Population

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Confidence interval wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Gibbs sampling wikipedia , lookup

Taylor's law wikipedia , lookup

German tank problem wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Basic Statistical Concepts
Statistical Population
• The entire underlying set of observations
from which samples are drawn.
– Philosophical meaning: all observations that could
ever be taken for range of inference
• e.g. all barnacle populations that have ever existed, that
exist or that will exist
– Practical meaning: all observations within a
reasonable range of inference
• e.g. barnacle populations on that stretch of coast
1
Statistical Sample
• A representative subset of a population.
– What counts as being representative
• Unbiased and hopefully precise
Strategies
• Define survey objectives: what is the goal of survey
or experiment? What are you hypotheses?
• Define population parameters to estimate (e.g.
number of individuals, growth, color etc).
• Implement sampling strategy
– measure every individual (think of implications in terms of
cost, time, practicality especially if destructive)
– measure a representative portion of the population (a
sample)
2
Sampling
• Goal:
– Every unit and combination of units in the population (of
interest) has an equal chance of selection.
• This is a fundamental assumption in all estimation procedures
• How:
– Many ways if underlying distribution is not uniform
» In the absence of information about underlying
distribution the only safe strategy is random sampling
» Costs: sometimes difficult, and may lead to its own
source of bias (if sample size is low). Much more
about this later
Sampling Objectives
• To obtain an unbiased estimate of a
population mean
• To assess the precision of the estimate (i.e.
calculate the standard error of the mean)
• To obtain as precise an estimate of the
parameters as possible for time, effort and
money spent
3
Measures of location
• Population mean () - the average value
• Sample mean = y estimates 
• Population median - the middle value
• Sample median estimates population median
• In a normal distribution the mean=median (also the
mode), this is not ensured in other distributions
Y
Y
Median
Mean & median
Mean
Measures of dispersion
• Population variance (2) - average sum of squared
deviations from mean
• Measured sample variance (s2) estimates population
variance
2
(xi - x)
n-1
• Standard deviation (s)
– square root of variance
– same units as original variable
4
Measures (statistics) of Dispersion
(xi - )2
Population Sum of Squares
Sample Sum of Squares
Population variance
•
•
s2
Note, units are squared
Denominator is (n-1)
Sample
standard deviation
•
2 =
Note, units are squared
Denominator is (n)
Sample variance
•
•
SS = (xi - x)2
(xi - )2
n
(xi - x)2
=
n-1
s=
Note, units are not squared

(xi - x)2
n-1
More Statistics of Dispersion
Standard error of the mean
•
This is also the Standard Deviation
of the sample means
Coefficient of variation
•
•
Measurement of variation
independent of units
Expressed as a percentage of mean
Covariance

CV =
sxy =
s2
n
=
s
n
s
x
(xi - x ) (yi - y )
n-1
8
Measure of how two variables covary
Range is between and +
Value depends in part on range in data
– bigger numbers yield bigger values of
covariance
8
•
•
•
sx =
5
Types of estimates
• Point estimate
– Single value estimate of the parameter, e.g. y is
a point estimate of , s is a point estimate of 
• Interval estimate
– Range within which the parameter lies known
with some degree of confidence, e.g. 95%
confidence interval is an interval estimate of 
Sampling distribution
The frequency (or probability) distribution of a
statistic (e.g. sample mean):
• Many samples (size n) from population
• Calculate all the sample means
• Plot frequency distribution of sample means
(sampling distribution)
6
P(y)
y
y
Multiple samples
- multiple sample means
P(y)
y-
Sampling distribution of sample means
Means
21.5
22.3
23.0
23.9
24.9
25.1
25.8
26.5
27.8
29.9
True Mean = 25
36
22
19
27
41
12
25
33
23
31
Mean = 21.5
10
20
30
40
36
23
24
25
21
17
16
40
Mean = 25.8
Number of cases
28
28
Estimate of Mean
7
Sampling distribution of mean
• The sampling distribution of the sample mean
approaches a normal distribution as n gets
larger - Central Limit Theorem.
• The mean of this sampling distribution is ,
the mean of original population.
Large number of Samples
12
0.2
8
0.1
4
0
15
20
25
30
Probability
0.3
Proportion per Bar
# of cases
16
0.0
35
Estimate of Mean

Estimate of Mean (x)
8
Sampling distribution of mean
• The sampling distribution of the sample means
approaches a normal distribution as n gets larger Central Limit Theorem.
• The mean of this sampling distribution is , the mean
of original population.
• The standard deviation of this sampling distribution is
approximated by s/n, the standard deviation of any
given sample divided by square root of sample size the standard error of the mean.
Standard deviation can be calculated for any distribution
The standard deviation of the distribution of sample means can be
calculated the same as for a given sample

x
(xi - x)2
N-1
Where:
1. x = mean of the
means and ~ 
   number of
means used in
distribution
Probability
sx =
2.5%
2.5%
~2
s
sx
~2 x

Estimate of Mean (x)
9
Standard deviation can be calculated for any distribution
The standard deviation of the distribution of sample means can be
calculated the same as for a given sample
sx =
sx ~ SEM =
s
n
Where:
s = sample standard deviation
and
n = number of replicates in the
sample
Probability
However:
To do so would require an
immense sampling effort, hence
an approximation is used:

(xi - x)2
N-1
x
2.5%
2.5%
~2 SEM
~2 SEM

Estimate of Mean (x)
Standard error of mean
• population SD estimated by sample SE:
s/n
• measures precision of sample mean
• how close sample mean is likely to be to
true population mean
10
Standard error of mean
• If SE is low:
– repeated samples would produce similar sample
means
– therefore, any single sample mean likely to be
close to population mean
• If SE is high:
– repeated samples would produce very different
sample means
– therefore, any single sample mean may not be
close to population mean
Effect of Standard error on estimate of
(assume df= large)
1 SEM=2
1 SEM=5
0.30
0.30

0.24
0.24
0.18
Probability
Probability

0.12
~2 SEM

0.18
0.12
~2 SEM
0.06
0.06
2.5%
2.5%
~2 SEM ~2 SEM
0.00
0
10
20
30
Estimate of Mean
40
0.00
0
10
20
30
Estimate of Mean
40
11
Worked example
Lovett et al. (2000) measured the
concentration of SO42- in 39 North American
forested streams (qk2002, Box 2.2)
Stream
Santa Cruz
Colgate
Halsey
Batavia Hill
SO42(mmol.L-1)
50.6
55.4
56.5
57.5
Statistic
Value
Sample mean
61.92
Sample median
62.10
Sample variance 27.46
Sample SD
5.24
SE of mean
0.84
Interval estimate
• How confident are we in a single sample estimate of
, i.e. how close do we think our sample mean is to
the unknown population mean.
• Remember  is a fixed, but unknown, value.
• Interval (range of values) within which we are 95%
(for example) sure  occurs - a confidence interval
12
Distribution of sample means
99%
95%
P( y )
y
Calculate the proportion of sample means within a
range of values.
Transform distribution of means to a distribution
with mean = 0 and standard deviation = 1
t statistic
y
s/ n
13
0.4
Probability
0.3
Null distribution
0.2
0.1
0.0
-5 -4 -3 -2 -1 0
t=
1
2
3
4
5
y
s/ n
t statistic – interpretation and
units
• The deviation between the
sample and population
mean is expressed in terms
of Standard error (i.e.
Standard deviations of the
sampling distribution)
• Hence the value of t’s are in
standard errors
• For example t=2 indicates
that the deviation (y-  ) is
equal to 2 x the standard error
y
s/ n
14
The t statistic
• This t statistic follows a t-distribution, which
has a mathematical formula.
• Same as normal distribution for n>30
otherwise flatter, more spread than normal
distribution.
• Different t distributions for different sample
sizes < 30 (actually df which is n-1).
0.4
N=30
Null distributions
Probability
0.3
N=3
0.2
0.1
0.0
-5 -4 -3 -2 -1 0
t=
1
2
3
4
5
y
s/ n
15
Two tailed t-values

Probabilities of t = y
occurring outside the range
s/ n
– tdf to + tdf
Probability
Degrees of Freedom
.01
.02
.05
.10
.20
1
63.66
31.82
12.71
6.314
3.078
2
9.925
6.965
4.303
2.920
1.886
3
5.841
4.541
3.182
2.353
1.638
4
4.604
3.747
2.776
2.132
1.533
5
4.032
3.365
2.571
2.015
1.476
10
3.169
2.764
2.228
1.812
1.372
15
2.947
2.602
2.132
1.753
1.341
20
2.845
2.528
2.086
1.725
1.325
25
2.787
2.485
2.060
1.708
1.316
z
2.575
2.326
1.960
1.645
1.282
4 df
95%
-2.78
+2.78
-5 -4 -3 -2 -1 0 1 2 3 4 5
t=
y
s/ n
One and two tailed t-values (df 4)
Degrees of Freedom
.005/.01
.01/.02
.025/.05
.05/.10
.10/.20
1
63.66
31.82
12.71
6.314
3.078
2
9.925
6.965
4.303
2.920
1.886
3
5.841
4.541
3.182
2.353
1.638
4
4.604
3.747
2.776
2.132
1.533
5
4.032
3.365
2.571
2.015
1.476
10
3.169
2.764
2.228
1.812
1.372
15
2.947
2.602
2.132
1.753
1.341
20
2.845
2.528
2.086
1.725
1.325
25
2.787
2.485
2.060
1.708
1.316
z
2.575
2.326
1.960
1.645
1.282
2 tailed
1 tailed
95%
-2.78
95%
+2.78
-5 -4 -3 -2 -1 0 1 2 3 4 5
1 tailed
+2.132
-5 -4 -3 -2 -1 0 1 2 3 4 5
t
y
=
s/ n
-2.132
95%
-5 -4 -3 -2 -1 0 1 2 3 4 5
16
The t statistic
• This t statistic follows a t-distribution, which has a
mathematical formula.
• Same as normal distribution for n>30 otherwise
flatter, more spread than normal distribution.
• Different t distributions for different sample sizes < 30
(actually df which is n-1).
• The proportions of t values between particular t
values, yield a confidence estimate (the likelihood
that the true mean is in the range)
For n = 5 (df = 4), 95% of all t values occur
between t = -2.78 and t = +2.78
95%
-2.78
+2.78
Pr(t)
-5 -4 -3 -2 -1 0 1 2 3 4 5
95%
-2.78
0
t
+2.78
• Probability is 95% that t is between -2.78 and
+2.78
y
• Probability is 95% that
is between -2.78 and
s n
+2.78
• Rearrange equation to solve for 
17
Rearrange to solve for 
y
s/ n
1.
t=
2.
t( s / n) = ( y   )
Solve for (using df):
1. Calculated t values
2. Desired confidence level
(to determine range in
values that are likely to
contain  )
For two tailed test
 y  t( s / n )
3.
and
 y  t( s / n )
Pr[y  t( s / n )  y  t( s / n )]
For 95% CI, use the t value between which 95% of
all t values occur, for specific df (n-1):
P[ y  t ( s
n )    y  t(s
n ) ]  0 .95
This is a confidence interval.
• CI’s from repeated samples of size n , 95% of
the CI's would contain  and 5% wouldn’t.
• 95% probability that this interval includes the
true population mean.
18
Worked example (Lovett et al. 2000)
Sample mean
Sample SD
SE
•
•
•
•
61.92
5.24
0.84
The t value (95%, 38df) = 2.02 (from a t-table)
2.5% of t values are greater than 2.02
2.5% of t values are less than -2.02
95% of t values are between -2.02 and +2.02
P {61.92 - 2.02 (5.24 / 39) <  < 61.92 + 2.02 (5.24 / 39)}
= 0.95
P {60.22 <  < 63.62} = 0.95
Confidence Interval (2 tailed) assume
95% CI is desired
Pr[y  t( s / n )  y  t( s / n )]
Lovett et al. (2000)
38 df
Probability
Degrees of Freedom
.01
.02
.05
.10
.20
1
63.66
31.82
12.71
6.314
3.078
2
9.925
6.965
4.303
2.920
1.886
3
5.841
4.541
3.182
2.353
1.638
4
4.604
3.747
2.776
2.132
1.533
5
4.032
3.365
2.571
2.015
1.476
10
3.169
2.764
2.228
1.812
1.372
15
2.947
2.602
2.132
1.753
1.341
20
2.845
2.528
2.086
1.725
1.325
25
2.787
2.485
2.060
1.708
1.316
38
2.705
2.426
2.020
1.685
1.302
95%
61.92
Sample mean
SEM
DF
61.92
0.84
38
y t( s / n )
y  t( s / n )
61.92 – 2.02(0.84)
61.92 + 2.02(0.84)
60.22
<<
63.62
19
• The interval 60.22 – 63.62 will contain  95%
of the time.
• We are 95% confident that the interval 60.22
– 63.62 contains .
Effect on Confidence Interval
Case
Mean
Sample
size (SS)
Standard
deviation
(SD)
Standard
Error
Probability
(%)
Lower
Confidence
Limit
Upper
Confidence
Limit
Reference
61.92
39
5.24
0.834
95%
60.22
63.62
Double
SD
61.92
39
10.48
1.68
95%
58.53
65.31
Reduce
SS
61.92
20
5.24
1.17
95%
59.47
64.37
Increase
%
61.92
39
5.24
0.834
99%
59.65
64.20
20
Estimating other parameters
• Logic of interval estimation of population
mean using t-distribution can be extended to
resampling
– For example: confidence interval of the mean
Confidence Interval – using
resampling vs t-test
• CI from t distribution is based on creation of a
distribution from mean and standard deviation
calculated from sample data
• CI from resampling is based on sample data
• For example, assume we have the following
observations and want to determine if the
mean is different from 10
– 9, 8,9,10, 9, 8,9,7,11,11
21
Confidence Interval – using t distribution
0.4
0.4
0.2
Use t distribution
0.3
0.1
0.1
4
5
2
t
Sample mean
Sample SD
SE
•
•
•
•
3
0.0
15
0
13
y
1
11
-1
9
-4
7
-3
5
-5
0.0
0.2
-2
Prob(y)
Ho: µ=10
s  1.30
Prob(t)
y  9.14
0.3
9.139
1.30
0.412
The t value (95%, 9 df) = 2.26 (from a t-table)
2.5% of t values are greater than 2.26
2.5% of t values are less than -2.26
95% of t values are between -2.26 and +2.26
P {9.14 – 2.26 (1.30 / 10) <  < 9.14 + 2.26 (1.30 / 10)} = 0.95
P {8.22<  <10.07} = 0.95
Resampling
Confidence Interval – using resampling
•Resample many times, with replacement, each
with 10 observations
•Calculate means of all samples
•Generate distribution of means and determine
empirical confidence interval
Histogram of the Estimates of Mean
150
0.14
0.12
Count
0.08
0.06
50
0.04
Proportion per Bar
0.10
100
95.0% Confidence Interval for Mean
Variable ¦
¦
Mean
Lower
Upper
---------+----------------------------y
¦
9.142
8.453
9.912
0.02
0
8
9
10
0.00
11
Mean value
22
Compare approaches
Statistic
Using t- distribution
Using resampling
Mean
9.139
9.142
Upper Confidence
limit
10.07
9.91
Lower Confidence
Limit
8.22
8.45
Accept
Ho: µ =10
(is 10 within 95% CI)
YES
NO
Confidence Intervals using
resampling
• The same technique may be used to set
confidence limits to any statistic
e.g. the median,
the average (absolute) deviation,
standard deviation (s),
coefficient of variation, or
skewness.
23