Download The paired sample experiment - Department of Mathematics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Psychometrics wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Statistical inference wikipedia , lookup

Taylor's law wikipedia , lookup

Gibbs sampling wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
The paired sample experiment
The paired t test
Frequently one is interested in comparing the
effects of two treatments (drugs, etc…) on a
response variable.
The two treatments determine two different
populations
– Popn 1 cases treated with treatment 1.
– Popn 2 cases treated with treatment 2
The response variable is assumed to have a
normal distribution within each population
differing possibly in the mean (and also possibly
in the variance)
Two independent sample design
A sample of size n cases are selected from
population 1 (cases receiving treatment 1) and a
second sample of size m cases are selected from
population 2 (cases receiving treatment 2).
The data
– x1, x2, x3, …, xn from population 1.
– y1, y2, y3, …, ym from population 2.
The test that is used is the t-test for two
independent samples
The test statistic (if equal variances are assumed):
yx
t
1 1
sPooled

n m
where
sPooled 
2
2
n

1
s

m

1
s
  x 
 y
nm2
The matched pair experimental design (The paired
sample experiment)
Prior to assigning the treatments the subjects are grouped
into pairs of similar subjects.
Suppose that there are n such pairs (Total of 2n = n + n
subjects or cases), The two treatments are then randomly
assigned to each pair. One member of a pair will receive
treatment 1, while the other receives treatment 2. The data
collected is as follows:
– (x1, y1), (x2 ,y2), (x3 ,y3),, …, (xn, yn) .
xi = the response for the case in pair i that receives
treatment 1.
yi = the response for the case in pair i that receives
treatment 2.
Let di = yi - xi. Then
d1, d2, d3 , … , dn
Is a sample from a normal distribution with mean,
md = m2 – m1 , and
variance
      2cov  x, y 
     2 xy x y
2
d
2
x
2
x
2
y
2
y
standard deviation
 d   x2   y2  2 xy x y
Note if the x and y measurements are positively
correlated (this will be true if the cases in the pair are
matched effectively) than d will be small.
To test H0: m1 = m2 is equivalent to testing H0: md = 0.
(we have converted the two sample problem into a single
sample problem).
The test statistic is the single sample t-test on the
differences
d1, d2, d3 , … , dn
namely
d 0
td 
sd n
df = n - 1
Example
We are interested in comparing the effectiveness of two
method for reducing high cholesterol
The methods
1. Use of a drug.
2. Control of diet.
The 2n = 8 subjects were paired into 4 match pairs.
In each matched pair one subject was given the
drug treatment, the other subject was given the diet
control treatment. Assignment of treatments was
random.
The data
reduction in cholesterol after 6 month period
Pair
Treatment
Drug treatment
Diet control Treatment
1
30.3
25.7
2
10.2
9.4
3
22.3
24.6
4
15.0
8.9
Differences
Pair
Treatment
Drug treatment
Diet control Treatment
di
d  2.3
1
30.3
25.7
4.6
2
10.2
9.4
0.8
3
22.3
24.6
-2.3
4
15.0
8.9
6.1
sd  3.792
d 0
2.3
td 

 1.213
sd n 3.792 4
t0.025  3.182 for df = n – 1 = 3, Hence we accept H0.
Nonparametric
Statistical Methods
Many statistical procedures make assumptions
The t test, z test make the assumption that the
populations being sampled are normally
distributed. (True for both the one sample and
the two sample test).
This assumption for large sample sizes is not
critical.
(Reason: The Central Limit Theorem)
The sample mean, the statistic z will have
approximately a normal distribution for large
sample sizes even if the population is not
normal.
For small sample sizes the departure from the
assumption of normality could affect the
performance of a statistical procedure that
assumes normality.
For testing, the probability of a type I error may
not be the desired value of a = 0.05 or 0.01
For confidence intervals the probability of
capturing the parameter may be the desired value
(95% or 99%) but a value considerably smaller
Example: Consider the z-test
For a = 0.05 we reject the hypothesized value of
the mean if z < -1.96 or z > 1.96
sample mean  m
z
 n
Suppose the population is an exponential population
with parameter l. (m = 1/l and  = 1/l)
0.06
0.05
0.04
Actual population
0.03
Assumed population
0.02
0.01
0
-40
-20
0
20
40
60
80
100
Suppose the population is an exponential population
with parameter l. (m = 1/l and  = 1/l)
It can be shown that the sampling distribution of x
is the Gamma distribution with
l
n
m
and a  n
Use mgf’s
The distribution of x is not the normal distribution
with
m x  m and  x 

n

m
n
Sampling distribution of x
0.04
n=2
Actual distribution
0.035
0.03
0.025
Distribution
assuming normality
0.02
0.015
0.01
0.005
0
-40
-20
0
20
40
60
80
100
Sampling distribution of x
Actual distribution
0.06
n=5
0.05
0.04
Distribution
assuming normality
0.03
0.02
0.01
0
-40
-20
0
-0.01
20
40
60
80
100
Sampling distribution of x
Actual distribution
0.1
n = 20
0.08
Distribution
assuming normality
0.06
0.04
0.02
0
-40
-20
0
-0.02
20
40
60
80
100
Definition
When the data is generated from process
(model) that is known except for finite
number of unknown parameters the model is
called a parametric model.
Otherwise, the model is called a nonparametric model
Statistical techniques that assume a nonparametric model are called non-parametric.
The sign test
A nonparametric test for the central
location of a distribution
We want to test:
H0: median = m0
against
HA: median  m0
(or against a one-sided alternative)
• The assumption will be only that the
distribution of the observations is
continuous.
• Note for symmetric distributions the
mean and median are equal if the mean
exists.
• For non-symmetric distribution, the
median is probably a more appropriate
measure of central location.
The Sign test:
1. The test statistic:
S = the number of observations
that exceed m0
Comment: If H0: median = m0 is true
we would expect 50% of the
observations to be above m0, and 50%
of the observations to be below m0,
50%
0
0
50%
median = m0
If H 0 is true then S will have a binomial distribution
with p = 0.50, n = sample size.
m0 > median
p < 0.50
p
median
m0
If H 0 is not true then S will still have a binomial
distribution. However p will not be equal to 0.50.
m0 < median
p > 0.50
p
m0
median
p = the probability that an observation is greater
than m0.
Summarizing: If H0 is true then S will have a
binomial distribution with p = 0.50, n = sample size.
n = 10
x
0
1
2
3
4
5
6
7
8
9
10
0.3
p(x)
0.0010
0.0098
0.0439
0.1172
0.2051
0.2461
0.2051
0.1172
0.0439
0.0098
0.0010
0.25
0.2
0.15
0.1
0.05
0
0
1
2
3
4
5
6
7
8
9
10
The critical and acceptance region:
n = 10
x
0
1
2
3
4
5
6
7
8
9
10
p(x)
0.0010
0.0098
0.0439
0.1172
0.2051
0.2461
0.2051
0.1172
0.0439
0.0098
0.0010
0.3000
0.2500
0.2000
0.1500
0.1000
0.0500
0.0000
0
1
2
3
4
5
6
7
8
9
Choose the critical region so that a is close to 0.05 or 0.01.
e. g. If critical region is {0,1,9,10} then a = .0010 + .0098 +
.0098 +.0010 = .0216
10
e. g. If critical region is {0,1,2,8,9,10} then a = .0010 + .0098
+.0439+.0439+ .0098 +.0010 = .1094
n = 10
x
0
1
2
3
4
5
6
7
8
9
10
p(x)
0.0010
0.0098
0.0439
0.1172
0.2051
0.2461
0.2051
0.1172
0.0439
0.0098
0.0010
0.3000
0.2500
0.2000
0.1500
0.1000
0.0500
0.0000
0
1
2
3
4
5
6
7
8
9
10
•
If one can’t determine a fixed confidence region to
achieve a fixed significance level a , one then randomizes
the choice of the critical region
•
In the example with n = 10, if the critical region is
{0,1,9,10} then a = .0010 + .0098 + .0098 +.0010 = .0216
•
If the values 2 and 8 are added to the critical region the
value of increases to 0.216 + 2(.0439) = 0.0216 + 0.0878
= 0.1094
•
Note 0.05 =0.0216 + 0.3235(.0878)
Consider the following critical region
1. Reject H0 if the test statistic is {0,1,9,10}
2. If the test statistic is {2,8} perform a success-failure
experiment with p = P[success] = 0.3235, If the
experiment is a success Reject Ho.
3. Otherwise we accept H0.
Example
Suppose that we are interested in determining
if a new drug is effective in reducing
cholesterol.
Hence we administer the drug to n = 10
patients with high cholesterol and measure the
reduction.
The data
Case
1
2
3
4
5
6
7
8
9
10
Initial
240
237
264
233
236
234
264
241
261
256
Cholesterol
Final
Reduction
228
12
222
15
262
2
224
9
240
-4
237
-3
264
0
219
22
252
9
254
2
Let S = the number of negative reductions = 2
If H0 is true then S will have a binomial distribution
with p = 0.50, n = 10.
We would expect
S to be small if
H0 is false.
n = 10
0.3
x
0
1
2
3
4
5
6
7
8
9
10
p(x)
0.0010
0.0098
0.0439
0.1172
0.2051
0.2461
0.2051
0.1172
0.0439
0.0098
0.0010
0.25
0.2
0.15
0.1
0.05
0
0
1
2
3
4
5
6
7
8
9
10
Choosing the critical region to be {0, 1, 2} the
probability of a type I error would be
a = 0.0010 + 0.0098 + 0.0439 = 0.0547
Since S = 2 lies in this region, the Null hypothesis
should be rejected.
Conclusion: There is a significant positive reduction
(a = 0.0547) in cholesterol.
If n is large we can use the Normal approximation
to the Binomial.
Namely S has a Binomial distribution with p = ½
and n = sample size.
Hence for large n, S has approximately a Normal
distribution with
mean
and
n
m S  np 
2
standard deviation
n
 1  1 
 S  npq  n   
2
 2  2 
Hence for large n,use as the test statistic (in
place of S)
n
z
S  mS
S

S
n
2
2
Choose the critical region for z from the
Standard Normal distribution.
i.e. Reject H0 if z < -za/2 or z > za/2
two tailed ( a one tailed test can also be set up.