Download Research Methodology and Statistics Module

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

German tank problem wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Statistical Sampling
∆ιατµηµατικό πρόγραµµα
µεταπτυχιακών σπουδών
Τεχνο-οικονοµικά συστήµατα
∆ηµήτρης Φουσκάκης
What do you think about Statistics?
2
Introduction to Statistics
„
Why do we need statistics?
„
Descriptive statistics.
„
Inferential statistics.
3
Why do we need statistics?
„
Why indeed?
“A distinctive function of statistics
is this: it enables the scientist to
make a numerical evaluation of the
uncertainty of his conclusion.”
(Snedecor, 1950)
4
The fundamental problem: sampling
„
How representative is my sample?
5
The fundamental problem: sampling
„
Statistics can tell us how good the
chances are that the
characteristics of a given sample
represent characteristics of the
target population,
if
„
each individual of the target
population had the same chance to
be sampled!
6
(Assumption of randomness)
The fundamental problem: sampling
„
„
„
„
Population: Set of all units of interest – X
random variable.
X follows a distribution f with unknown
mean µ and standard deviation σ and
more general with an unknown parameter
θ.
Random Sample: X1, …, Xn independent
and identically distributed random
variables (follow the same distribution as
X).
Observed values of the random sample
(sample values – sample data) x1,…, xn.
help us make inference
7
Descriptive and inferential statistics
Descriptive statistics:
helps to describe the characteristics of a sample.
Inferential statistics:
a collection of methods, which help to quantify how
certain we can be when we make inferences from a given
sample.
8
Types of data
„
Categorical
•
nominal (married, single, divorced . . .)
ordinal (minimal, moderate, severe . . .)
•
binary (success, failure)
•
„
Quantitative
•
discrete (0,1,2,3,4,5 . . .)
-
•
e.g. Number of road accidents
continuous
-
e.g. Height
9
Descriptive statistics
„
n
x=
∑x
i =1
n
i
Measures of location:
- Sample Mean x (the sum of all
the scores divided by the number
of observations).
- Median (the score that lies
midpoint when the data are
ranked in order).
- Mode (the most frequently
occurring score).
- Trimmed Mean (some of the
largest and smallest observations
are removed before calculating
the mean).
10
Descriptive statistics (continued)
„
SD = Variance
n
s2 =
∑ ( xi − x )
i =1
n −1
2
Measures of spread:
- Range (the lowest and highest
values).
- Centiles (two values that
encompass most rather than all of
the data values, e.g. quartiles).
- Standard Deviation (SD) s (the
idea is based on averaging the
distance each value is from the
mean).
- Variance s2 (the square of SD).
11
Graphical representations of variability
„
Histogram
Boxplot
„
Frequency polygon
„
Steam-and-leaf diagram
„
0
0
0
0
0
1
1
111
222333
445
666666677
89
000000011111111
22222222222233333333
12
Estimate the shape of the p.d.f. of X
„
In order to estimate the shape of
the p.d.f. f(x) of X, one can create
a frequency table of the sample
values x1,…, xn. This is done by
dividing up the range of the values
of x1,…, xn into a set of intervals.
Then create a histogram and use it
as an estimate of the shape of the
pdf f(x) of X.
13
Estimating probabilities
„
Suppose that we want to calculate the
probability p=P(a§ X§ b). Let denote p̂
the fraction of the sample data x1,…, xn
that are between the values a,b. Then p̂
is an estimate of the required
probability.
14
Estimate the mean and the variance
„
The observed sample mean
n
x=
∑x
i
i =1
n
can be used to estimate the true mean µ of X.
„
The observed variance
n
s2 =
∑ ( xi − x )
2
i =1
n −1
can be used to estimate the true variance σ2 of X.
15
Sample Mean
„
The definitions of the observed
sample mean and variance pertain
to the observed values x1,…, xn.
Let us instead look at the problem
before the random sample is
collected. Recall that before the
sample is collected the random
variables X1, …, Xn denote the
uncertain values that will be
obtained from the random sample.
16
Sample Mean
n
X=
n
S =
2
∑( X −X)
i=1
i
n−1
∑X
i =1
i
n
2
RANDOM
VARIABLES
17
Sample Mean
E( X ) = µ
Var ( X ) =
σ
2
n
E (S ) = σ
2
2
How good an estimate of the mean µ is the observed
sample mean x is? How reliable is this estimate?
X ∼ N (µ ,σ /
n)
from the Central limit theorem (n ≥ 30)
18
Example
Berkshire Power Company (BPC) is an electric
utility company that provides electric power.
Has recently implemented a variety of incentive
programs to encourage households to conserve
energy in winter months. They would like to
estimate the mean µ and standard deviation σ of
the distribution of household electricity consumption
for January.
Sample of n=100 households.
19
Example
20
Example
n
x=
∑x
i
i =1
n
n
s2 =
= 3011KWH
∑(x − x )
i =1
i
n −1
2
= 540483.7 ⇒ s = 735.18 KWH
Suppose we choose now a different sample of 100 households. How different the
answers would be? If instead would choose n=10?
Remember that from the Central Limit Theorem
X ∼ N (µ ,σ /
n)
The standard deviation of the distribution of the sample mean is lower when n
21
is larger.
Example
22
Confidence Intervals for the Mean for
Large Sample Size
Observed sample mean x will be more reliable estimate for µ when the
sample size n is larger. We can quantify the intuitive notion of reliability of an
estimate by developing the concept of a confidence interval (C.I.).
Consider the following problem: Compute the quantity b:
p = P(µ−b≤ X ≤ µ+b) =0.95
p = P (−
b
X −µ
b
) = 0.95
≤
≤
σ / n σ / n σ / n
Z~N(0,1) for n>29
P (1.96 ≤ Z ≤ 1.96) = 0.95 ⇒ P ( X −
1.96σ
1.96σ
≤µ≤X +
) 23
n
n
Confidence Intervals for the Mean for
Large Sample Size
If n ≥ 30 then a 95% confidence interval for the mean µ is the interval
1.96 s
1.96 s ⎤
⎡
,X +
⎢X −
⎥
n
n
⎣
⎦
Interpretation of a confidence Interval: Since both the sample mean X
and the sample variance S 2 are random variables, each time we take a
random sample, we find different values for the observed sample mean
x and the observed variance s 2 . This results to a different confidence
interval each time we sample. A 95% confidence interval means that 95% of
the resulting intervals will contain the actual mean µ.
24
Confidence Intervals for the Mean for
Large Sample Size
In our previous example with the Berkshire Power Company with a sample
size of n=100 we get the following 95% confidence interval for the true mean:
1.96 s
1.96 s ⎤
⎡
,X +
⎢X −
⎥ = [2866.9, 3155.1]
n
n ⎦
⎣
If instead our sample size was smaller, our uncertainty about the true value of
µ becomes larger, and thus we should expect a wider confidence interval.
25
Confidence Intervals for the Mean for
Large Sample Size
Suppose that x is the observed sample mean and s 2 is the observed
variance. If n ≥ 30 then a β% confidence interval for the mean µ is the
interval:
where zα/2 is such that:
za / 2 × s
za / 2 × s ⎤
⎡
,X +
⎢X −
⎥
n
n
⎣
⎦
P ( − z a / 2 ≤ Z ≤ z a / 2 ) = β / 100 and α = 1- (β/100)
For β=90%, α=0.10, zα/2 =1.645
For β=95%, α=0.05, zα/2 =1.960
For β=98%, α=0.02, zα/2 =2.326
For β=99%, α=0.01, zα/2 =2.576
Thus in our previous example with our sample
of 100 households a 99% confidence interval
for the true mean is:
2 .5 7 6 s
2 .5 7 6 s ⎤
⎡
X
X
−
,
+
⎢
⎥ = [ 2 8 2 1 .6 , 3 2 0 0 .3 ]
n
n ⎦
⎣
wider than the 95% one
26
Normal Table
27
Confidence Intervals for the Mean for
Small Sample Size
What if our sample size is less than 30. The procedure for constructing a
confidence interval for the true mean is the same as before, but this time
T=
X −µ
σ/ n
follows approximately a t-distribution with k=(n-1) degrees of
freedom (this approximation works well only if the Xi are almost Normally
distributed).
Thus the β% confidence interval for the true mean is:
c×s
c×s⎤
⎡
,X +
⎢X −
⎥
n
n
⎣
⎦
P ( − c ≤ T ≤ c ) = β / 100
where c is such that:
and T follows the t-distribution with (n-1)
degrees of freedom
28
Example
In the Berkshire Power Company example lets suppose that our sample was
from only n=10 households, and gave us an observed sample mean of
3056 KWH and an observed sample standard deviation of 800 KWH. Then a
99% C.I. For the true mean is:
c× s
c× s⎤ ⎡
3.250 × 800
3.250 × 800 ⎤
⎡
,X +
, 3056 −
⎢X −
⎥ = ⎢3056 −
⎥
n
n
10
10
⎣
⎦ ⎣
⎦
where the value 3.250 can be easily obtained from the tables of the t
distribution with k=10-1=9 degrees of freedom and β=99%.
29
Student Table
30
Confidence Interval for the population
proportion
Suppose that the national Institute of Health (NIH) would like to estimate the
proportion of teenagers that smokes. They randomly sampled 1000
teenagers and found that 253 of them are smokers. Thus the observed
sample proportion p = 253 /1000 = 0.253
We would like to construct a C.I. for the estimate of the true proportion of
teenagers that smoke.
Let X be the number of teenagers in the sample of size n that smokes.
Then X~B(n,p) and therefore E(X)=np and Var(X)=np(1-p).
If
P =
X
n
is the sample proportion (random variable) then E( P )=p and
Var( P )=[p(1-p)] /n.
31
Confidence Interval for the population
proportion
If
np ≥ 5 and n(1 − p ) ≥ 5
Z=
P− p
P(1− P) / n
then from the Central Limit Theorem we have that
obeys approximately the standard Normal distribution.
Using the above fact we can derive the following result:
If p is the observed sample proportion in a sample of size n and
np ≥ 5 and n(1 − p ) ≥ 5
then a β% C.I. For the population proportion p is:
⎡
⎢ p − za / 2
⎣
p (1 − p )
, p + za / 2
n
p (1 − p ) ⎤
⎥
n
⎦ where zα/2 is such that:
P ( − z a / 2 ≤ Z ≤ z a / 2 ) = β / 100 and α = 1- (β/100) 32
Confidence Interval for the population
proportion
So in our example lets compute a 99% C.I. For the proportion of
teenagers that smoke. Note that
np ≥ 5 and n(1 − p ) ≥ 5 and so
we can use the preceding method. From the tables of the standard
normal distribution we find that c=2.576 and thus the required C.I. is:
⎡
0.253(1 − 0.253)
0.253(1 − 0.253) ⎤
p (1 − p )
p (1 − p ) ⎤ ⎡
−
+
=
−
+
,
0.253
2.576
,
0.253
2.576
p
c
p
c
⎢
⎥ ⎢
⎥
1000
1000
n
n
⎣
⎦ ⎣
⎦
= [0.218, 0.288].
33
Experimental Design for
Estimating the Mean µ
Sample size n
Affects the width of the C.I.
How large should n be in order to to satisfy a pre-specific tolerance in the
width of the β% C.I. ?
Experimental Design
2 2
za / 2 s
n=
2
L
L=tolerance level, i.e. Our estimate x is
within plus or minus L of the true value µ with
probability β/100.
where zα/2 is such that:
P ( − z a / 2 ≤ Z ≤ z a / 2 ) = β / 100 and α = 1- (β/100) 34
Experimental Design for
Estimating the Mean µ
„
„
If the value of n computed in the previous
expression is less than 30, then we set n=30.
One difficulty in using the previous expression is that
we have to know the value of the sample standard
deviation in advance. However, one can typically
obtain a rough estimate of the sample standard
deviation s by conducting a small pilot sample first.
35
Example
Suppose that a marketing research firm wants to conduct a survey to
estimate the mean µ of the distribution of the amount spent on
entertainment by each adult who visits a certain popular resort. The firm
would like to estimate the mean of this distribution to within $120.00 with
95% confidence. From data regarding past operations at the resort, it has
been estimated that the standard deviation of entertainment expenditures is
no more than $400.00. How large the sample size should be?
za / 2 s
1.96 × 400
n=
=
= 42.68 ≈ 43
2
2
L
120
2 2
2
2
36
Experimental Design for
Estimating the Proportion p
Suppose we want a β% C.I. for the proportion with a tolerance level of L.
Then we obtain that:
c 2 p (1 − p )
n=
L2
where c is such that:
P ( − c ≤ Z ≤ c ) = β / 100
The problem with using the above formula directly is that we don’t know the
value of the observed sample proportion
in advance. However it can be
easily proved that :
p
p (1 − p ) ≤
1
4
Thus if we use the value of ¼ instead of
we obtain the “conservative” estimate:
za / 22
n=
4 L2
p (1 − p )
37
Example
Suppose that a major American television network is interested in
estimating the proportion p of American adults who are in favor of a
particular national issue such as a handgun control. They would like to
compute a 95% C.I. whose tolerance level is plus or minus 3%. How many
adults would the television network need to poll?
za / 22
1.96 2
n=
=
= 1, 067.11 ≈ 1, 068
2
2
4L
4 × 0.03
This is a rather remarkable fact. No matter how small or large is the
proportion we want to estimate, if we randomly sample 1,068 adults, then
in 19 cases out of 20 (95%), the results based on such a sample will
differ by no more than 3% in either direction from what would have been
obtained by polling all American adults.
38
Comparing Estimates of the Mean of Two
Distributions
Suppose that a national department store chain is considering whether or
not to promote its products via direct mail promotion campaign. They have
chosen two randomly selected groups of consumers with n1 and n2
consumers in each group. They plan to mail the promotional material to all
the consumers in the first group but not to any of the second group. Then
they plan to monitor the spending of each consumer in each group in their
stores in the coming month in order to estimate the effectiveness of the
promotional campaign.
Suppose that the true mean of the first group is µ1 with a standard deviation
of σ1 and for the second group µ2 with a standard deviation of σ2. Our
objective is to estimate the difference µ1-µ2.
Suppose that we plan to randomly sample n1 observations X1,…,Xn1 from
the first population and n2 observations Y1,…,Yn2 from the second
population.
39
Comparing Estimates of the Mean of Two
Distributions
1 n1
X = ∑ Xi ,
n1 i =1
The two sample means then are:
and:
E ( X − Y ) = µ1 − µ 2 , Var ( X − Y ) =
σ 12
n
1
Y =
n2
+
σ 22
n2
∑Y
i =1
i
n
From the Central Limit Theorem then we have that:
Z =
X − Y − ( µ1 − µ 2 )
σ1
2
n1
+
σ2
2
∼ N (0,1) when n1, n2 ≥ 30
n2
40
Comparing Estimates of the Mean of Two
Distributions
If x , y are the two observed sample means and s1 , s2
the two
observed standard deviations then the estimate for the µ1-µ2 is the difference
between the observed sample means x − y
.
A β% C.I. for the true difference µ1-µ2 of the two population means is:
⎡
s12 s2 2
s12 s2 2 ⎤
+
+
, x − y + za / 2
⎢ x − y − za / 2
⎥
n1 n2
n1 n2 ⎥⎦
⎢⎣
where zα/2 is such that:
P ( − z a / 2 ≤ Z ≤ z a / 2 ) = β / 100 and α = 1- (β/100) 41
Comparing Estimates of the Mean of Two
Distributions
Back in our example suppose that n1= 500 and n2 = 400 consumers.
Suppose that the observed sample mean of consumer sales in the first group
is $387 and in the second group $365 with an observed standard deviation in
the first group of $233 and in the second of $274. Let us compute a 98% C.I.
for the difference between the means µ1-µ2 of the distribution of sales
between the first group and the second group.
⎡
s12 s22
s12 s22 ⎤
+ , x − y + za / 2
+ ⎥=
⎢ x − y − za / 2
n1 n2
n1 n2 ⎥⎦
⎢⎣
⎡
2232 2742
2232 2742 ⎤
+
+
, 387 − 365 + 2.326
⎢387 − 365 − 2.326
⎥ = [−$17.43, $61.43]
500 400
500 400 ⎥⎦
⎢⎣
Because this C.I. contains zero, we are not 98% confident that the promotional
42
campaign will result in any increase in consumer spending.
Comparing Estimates of the Population
Proportion of Two Populations
We need to estimate the difference p1-p2 between the proportions of two
independent populations. Suppose we sample from both populations obtaining
n1 and n2 observations respectively. Let X denote the number of observations in
the first population with the characteristic of interest and Y denote the number
of observations in the second population with the characteristic of interest. The
sample proportions of the two populations then are:
X
Y
P1 = , P2 =
n2
n1
and:
E ( P1 − P2 ) = p1 − p2
p1 (1 − p1 ) p2 (1 − p2 )
Var ( P1 − P2 ) =
+
n1
n2
From the Central Limit Theorem then we have that:
Z =
P1 − P2 − ( p1 − p 2 )
P1 (1 − P1 ) P2 (1 − P2 )
+
n1
n2
~ N (0,1)
43
Comparing Estimates of the Population
Proportion of Two Populations
If the observed sample proportions are p1 , p2 then the estimate for the
difference between the proportions p1-p2 is the difference between the
observed sample proportions p1 − p2 .
n1 p1 , n2 p2 , n1 (1 − p1 ), n2 (1 − p2 ) ≥ 5
If also
difference between the proportions p1-p2 is:
⎡
⎢ p1 − p2 − za / 2
⎣
then a β% C.I. for the
p1 (1 − p1 ) p2 (1 − p2 )
+
, p1 − p2 + za / 2
n1
n2
p1 (1 − p1 ) p2 (1 − p2 ) ⎤
+
⎥
n1
n2
⎦
where zα/2 is such that:
P ( − z a / 2 ≤ Z ≤ z a / 2 ) = β / 100 and α = 1- (β/100)
44
Example
In a ten year study 3,806 middle-age men with high cholesterol levels but no
known heart problems were randomly divided into two equal groups.
Members of the first group received a new drug designed to lower cholesterol
levels, while the second group received a daily dosages of a placebo.
Besides lowering cholesterol levels, the drug appeared to be effective in
reducing the incidence of heart attacks. During the 10 years, 155 of those in
the first group had a heart attack, compared to 187 in the second group. Let
p1 denote the proportion of middle-aged men with high cholesterol who will
suffer a heart attack within ten years if they receive the new drug, and let p2
denote the proportion of middle-aged men with high cholesterol who will
suffer a heart attack within ten years if they do not receive the new drug.
Let us compute the 90% C.I. Of the difference between the proportions p1-p2.
Here we have: n1=1,903, n2=1,903 and
p1 = 155 /1903 = 0.08145, p2 = 187 /1903 = 0.09827
For β=90% we find that c=1.645. Therefore a 90% C.I. is:
45
Example
⎡
p1 (1 − p1 ) p2 (1 − p2 )
p1 (1 − p1 ) p2 (1 − p2 ) ⎤
+
, p1 − p2 + c
+
⎢ p1 − p2 − c
⎥
n
n
n
n
1
2
1
2
⎣
⎦
= [ -0.032, -0.0016].
Note that this entire range is less than zero, therefore we are 90%
confident that the new drug is effective in reducing the incidence of heart
attacks in middle-age men with high cholesterol.
46