Download Samples

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

German tank problem wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
CTSI BERD Research Methods Seminar Series
Statistical Analysis II
Mosuk Chow, Ph.D.
Senior Scientist and Professor
Statistics Department
University Park
November 8, 2016
Basic statistical concepts
(from Stat I)







Descriptive statistics (numeric/graphical)
Population distribution vs. Sampling
distribution
Standard Deviation vs. Standard Error
Estimation of population mean
Confidence interval
Hypothesis testing
P-value
Outline for Stat II






Estimate population proportion
Paired design
1-sample t
Non-paired design
2-sample t
Pooled variance versus non-pooled
variance
Estimation of population proportion (p)
Examples:
 Proportion of patients who became infected
 Proportion of patients who are cured
 Proportion of individuals positive on a blood test
 Proportion of adverse drug reactions
 Proportion of premature infants who survive
Sampling Distribution of Sample Proportion


Sampling distribution of sample proportion can be
approximated by normal distribution when sample
size is sufficiently large (central limit theorem)
The standard error of a sample proportion p is
estimated by:
p̂  (1  p̂)
SE(p̂) 

n
95% Confidence Interval for a Proportion
pˆ  2  SE (pˆ )
The rule of thumb for good normal approximation is
n  pˆ  5 and n  (1  pˆ )  5
Example

In a study of 200 patients, 90 patients experienced
adverse drug reactions

The estimated proportion who experience an
adverse drug reaction is
90
pˆ 
 0.45
200
95% confidence interval for the population
proportion is

0.45  0.55
0.45  2
200
= (0.38, 0.52)
Paired design

Paired design



Self-pairing:
Measurements are taken at two distinct points in
time from a single subject (e.g. Before vs. After)
Matched pairs (e.g., twins, eyes, subjects matched
on important characteristics such as age and
gender)
Why pairing?



Control extraneous noise
Control confounding factors that affect the
comparison
Make comparison more precise
Example: Blood Pressure and Oral
Contraceptive Use (n=10 women)
Participant
BP Before OC
After-Before
1
126
2
105
3
104
4
115
…
BP After OC
132
109
102
117
Paired samples
sample
115.6
2nd sample
Sb=11.3
Sa=13.1
1st
Sample Mean:
Sample Standard Deviation:
120.4
6
4
-2
2
Example (cont.)
Scientific questions:
 What is the mean change in blood pressure after
oral contraceptives (OC) use in a population of
women who use OC?
 Estimate the mean change by a confidence
interval approach
 Is there any change in mean blood pressure after
oral contraceptives use in a population of women
who use OC?
 Hypothesis testing
Inference on mean change


Due to the design of the study, we can
reduce the BP information on two samples
(women’s BP prior to OC use and the same
subject’s BP after OC use) into one piece of
information: information on the differences in
BP between the times points for the same
subject.
Perform the one sample inference on the
difference for the relevant research question.
Inference on mean change

Reduce the BP information on two samples
(women prior to OC use, women after OC use)
into one piece of information: information on the
differences in BP between the times points.

The sample average of the differences: xdiff

Sample standard deviation of the differences: Sd
95% confidence interval for mean change in BP:

xdiff ± tn-1,0.975 
Sd
n
where n is the number of pairs, tn-1,0.975 is the critical
value from t distribution with df=n-1.

The sample average of the differences is 4.8, which can
also be obtained by
xdiff  xafter  xbefore
(4.8 = 120.4 – 115.6)

The sample standard deviation of the differences is
n
sd 
2
(
x

x
)
 diff i diff
i 1
n 1
 4.6
Example: Blood Pressure and Oral
Contraceptive Use (n=10 women)
Participant
Before
1
2
3
4
…
BP Before OC
126
105
104
115
115.6
Sample
Mean:
SD:
BP After OC
132
109
102
117
120.4
After6
4
-2
2
4.8
xbefore
xafter
xdiff
Sb=11.3
Sa=13.1
Sd=4.6
95% CI of mean change in BP

4.8 ± t9, 0.975  S d
n
 4.6 
4.8  2.26  

 10 
4.8 ± 2.26  1.45
1.52 to 8.08
Notes


The number 0 is NOT in the confidence interval
(1.52, 8.08)
Because 0 is not in the interval, this suggests there
is a non-zero change in BP over time.

The BP change could be due to factors other than
oral contraceptives.

A control group of comparable women who were
not taking oral contraceptives but taking the
placebo would strengthen this study.
Comparison of Two Independent Samples


A Low Carbohydrate as Compared with a
Low Fat Diet in Severe Obesity1
 132 severely obese participants
randomized to one of two diet groups
 Participants followed for a six-month
period
At the end of the study period
 Participants on the low carbohydrate diet
lost more weight than those on a low fat
diet.
1Samaha,
F., et. al. A Low-Carbohydrate as Compared with a Low-Fat
Diet in Severe Obesity, New England Journal of Medicine 348;21
Comparison of Two Independent Samples
Number of Subjects
Mean Weight Change (kg)
Post-diet less pre-diet
Standard Deviation of Weight
Change (kg)
Diet Group
Low Fat
Low Carb
68
64
-1.8
-5.7
3.9
8.6
Is weight loss associated with diet
type?
Comparison of Two Independent Samples




In statistical terms, is there a difference in the
average weight loss for the participants on the low
fat diet as compared to participants on the low
carbohydrate diet?
Although there are paired pre/post measurements
on each participant, the comparison of interest is
not paired.
For each participant we compute a change in
weight (after diet weight minus before diet weight)
However, we are comparing the changes in weight
between two independent diet groups.
Comparison of Two Independent Samples




We have two samples: {x11, x12, x13,…, x1n1}
and {x21, x22, x23,…, x2n2} drawn from
populations with means 1 and 2 and
variances 12 and 22 , respectively.
The two samples are independent; there is no
pairing of observations.
We would like to estimate the difference of the
population means, 2 - 1.
Using the confidence interval, we can decide
whether the two means are different.
Comparison of Two Independent Samples
We know our best estimate for the
mean (of a single population) is the
sample mean, x .
It would seem sensible to estimate 1
with x1 , and 2 with x2
and 2 – 1 with x2  x1 .
Sampling Distribution of the Difference
in Sample Means

Since we have largish samples (both
greater than 30) we know the sampling
distributions of the sample means in
both groups are approximately normal

It turns out the difference of any
quantities, which are (approximately)
normally distributed, is also normally
distributed.
Sampling Distribution of the Difference
in Sample Means

So, the good news is . . .

The sampling distribution of the difference of
two sample means, each based on large
samples, approximates a normal distribution.

This sampling distribution is centered at the
true mean difference, µ2 - µ1.
Confidence Interval for (2 - 1)



We can construct a confidence interval for
2 - 1 using the (pivotal) quantity
( X 2  X 1 )  (  2   1)
T
Standard Error( X 2  X 1 )
Two Independent (Unpaired) Samples

The standard error of the difference
for two independent samples is
calculated differently than we did for
paired designs.

The formula for the standard error of
the difference depends on the
sample sizes in both groups and
standard deviations in both groups.
Comparison of Two Independent
Samples

The formula is
 x  x   / n1   / n2
2

1
2
1
2
2
If we follow the same reasoning we did for the
one sample case, we could substitute s1 and s2
for 1 and 2, respectively, to give an estimate
of
sx2  x1  s12 / n1  s22 / n2
Comparison of Two Independent
Samples

The distribution of
Ts 
(X 2  X 1 )  (  2  1 )
S12 / n1  S22 / n2
can be approximated by the t distribution where
the degrees of freedom are calculated as
( s12 / n1  s22 / n2 ) 2
d  2
( s1 / n1 ) 2 /(n1  1)  ( s22 / n2 ) 2 /( n2  1)

You may see this referred to as Welch’s or
Satterthwaite’s approximation.
Confidence Interval for (2 - 1)

We can construct a confidence interval for 2
- 1 using the (pivotal) quantity
(X 2  X 1 )  (  2  1 )
S / n1  S / n2
2
1

2
2
An approximate (1- )  100% confidence interval
is given by
X 2  X 1  t d ,1 / 2 S / n1  S / n2
2
1
2
2
Comparison of Two Independent Samples with
equal variance (21=22 =2)


If 12 and 22 are unknown, but equal to a
common value 2, we could “pool” our samples
to obtain an estimate of 2 to estimate the
standard error of the difference in sample
means:
The previous estimate we were working with
 x2  x1   12 / n1   22 / n2
is an unpooled estimate because we obtained
estimates of 12 and 22 separately.
sx2  x1  s / n1  s / n2
2
1
2
2
Comparison of Two Independent Samples
(cont.)
A pooled estimate of 2 is

n1
s 
2
p

(x
 x1 )   ( x2 j  x2 )
2
1i
i 1
n2
j 1
n1  1  n2  1
2
(n1  1) s12  (n2  1) s22

.
n1  n2  2
When 12=22=2, we have
 x  x   / n1   / n2   / n1   / n2
2
1
2
1
2
2
2
2
Comparison of Two Independent
Samples (cont.)

If we substitute the pooled estimator of 2 into
(X 2  X 1 )  (  2  1 )
 / n1   / n2
2
1
2
2

(X 2  X 1 )  (  2  1 )
 / n1   / n2
2
2
,
we have
TP 
( X 2  X 1 )  ( 2  1 )
S P2 / n1  S P2 / n2

( X 2  X 1 )  ( 2  1 )
S P2 (1 / n1  1 / n2 )
Comparison of Two Independent
Samples (cont.)

TP follows a t distribution with n1+n2-2 degrees
of freedom.
A (1- )  100% confidence interval is given by
( X 2  X 1 )  t n1  n2  2, 1 / 2 S (1 / n1  1 / n2 )
2
P
Choosing when to Pool



One rule of thumb is to use the pooled
variances as long as the ratio of the sample
standard deviations (larger s/smaller s) is 2,
but this cutoff is somewhat arbitrary.
Usually the results are not that different.
If you are unsure of which one to use, go with
the separate variance as that is more
conservative.
Diet and Weight Loss Example

A 95% confidence interval is
( X 2  X 1 )  t n1  n2  2,1 / 2 S (1 / n1  1 / n2 )
2
P
 (5.7  (1.8))  T68 64 2, 0.975 1.15
 3.9  1.98 1.15
 3.9  2.277
 (6.2,1.6) kg
Back to Blood Pressure and Oral
Contraceptive Use (n=10 women)
Participant
Before
1
2
3
4
…
BP Before OC
126
105
104
115
115.6
BP After OC
132
109
102
117
120.4
After6
4
-2
2
4.8
Sample
Mean:
xbefore
xafter
xdiff
SD:
Sb=11.3
Sa=13.1
Sd=4.6

If we do not realize that we should
use the paired t but use the two
sample t procedure to obtain the CI,
will the interval be wider or
narrower?

Anwser:
Paired t Confidence Interval:
( X 2  X 1 )  t n , 1 / 2 S d2 / n
2-sample t Confidence Interval
( X 2  X 1 )  tn1  n2 2, 1 / 2 S (1 / n1  1 / n2 )
2
P


It is very important to know the
design and use the appropriate
statistical technique to analyze the
data.
If we have a control group for the OC
example, then we will use two
sample t to compare the mean
change in blood pressure in the two
groups.
THE END
Want to learn more statistics
or have consultations, contact:
http://ctsi.psu.edu/ctsiprograms/biostatisticsepidemiologyresearch-design/