Download day3-E2005

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Ph.D. COURSE IN BIOSTATISTICS
DAY 3
Example: Serum triglyceride measurements in cord blood from
282 babies (Data file: triglyceride.dta)
80
Frequency
60
40
20
0
0
.5
1
1.5
2
triglyceride
m-2s m-s m m+s m+2s
m = sample mean
s = sample standard deviation
1
The distribution of serum triglyceride does not look like a normal
distribution. The distribution is skew to the right or positively skew.
A Q-Q plot clearly shows the lack of fit.
2
triglyceride
1.5
1
.5
0
0
.5
1
Inverse Normal
How should we summarize these data?
How should we compare two groups of measurements
for which the variation is positively skew?
2
Solution 1: A “non-parametric” approach
Mean and standard deviations are mainly useful for symmetric
distributions. For a skew distribution the “upward spread” differs
from the “downward” spread, and a single number summary of
the spread is therefore not sufficient.
Summary statistics useful for skew distributions:
median, 25, 75 percentiles or e.g. median, 5, 95 percentiles
The skewness can be identified from these statistics.
tabstat trigly , stat(p5 med p95)
variable |
p5
p50
p95
-------------+-----------------------------trigly |
.27
.46
.88
--------------------------------------------
Note:
If a distribution is symmetric then
If a distribution is skew to the right then
If a distribution is skew to the left then
mean = median
mean > median.
mean < median
3
Solution 2: Transformation of data
A transformation of the data may lead to data that conform with a
normal distribution. Statistical methods based for data from a normal
distribution may then be applied to the transformed data.
In principle many different transformations could be considered,
but interpretation of results based on transformed data, may be
complicated, so in practice only a few simple transformations e.g. the
power transformations and in particular logarithm transformations,
are used. Here we look at logarithm transformations:
Data, original scale
x1 , x2 ,
Data log-transformed
z1 , z2 , , zn
Logarithms satisfy
log(a  b)  log(a)  log(b)
log(a / b)  log(a)  log(b)
, xn
with
zi  log( xi )
Natural logarithms are usually preferred for calculations.
Figures with a logarithmic scale usually apply a logarithm to base 10.
Basic facts about logarithms: see
www.biostat.au.dk/teaching/praegrad/opgaver/biostat/logaritmefunktioner.pdf
4
On a log-scale pairs of points with the same ratio have the same
distance, e.g. log(10) - log(5) = log(100) – log(50). Log-transformations
therefore remove (or reduce) positive skewness.
histogram trigly ,
///
xscale(log) start(0.1) width(0.1) freq
80
Frequency
60
40
log-scale
with original
units
20
0
.5
1
1.5
2
triglyceride
The figure suggests that on a log-scale the variation in the triglyceride
data may be described by a normal distribution.
5
To obtain a more satisfactory histogram the log-transformed data
natural logarithm
must be generated explicitly
can also be written as ln
gene logtri=log(trigly)
histogram logtri , freq start(-1.86) normal
qnorm logtri , mco(blue)
.5
40
0
log(triglyceride)
50
30
20
-.5
-1
10
-1.5
0
-2
-2
-1.5
-1
-.5
0
log(triglyceride)
0.14
0.38
.5
-2
-1.5
-1
-.5
0
.5
Inverse Normal
1.00
log-scale with natural log-units
original units
On a log-scale these data look like a sample from a normal distribution,
but can we interpret results derived from log-transformed data? 6
ANALYSIS OF LOG-TRANSFORMED DATA.
INTERPRETATION OF RESULTS
On a log-scale the triglyceride data can be described by a normal
distribution, so on a log-scale the data are usefully summarized by
a mean and a standard deviation,
but what does these numbers mean?
Consider a random variable X and introduce Z = log(X).
The logarithm is here the natural logarithm, sometimes also called ln.
Percentiles of the distribution of X are transformed to the analogous
percentiles of the distribution of Z. In particular
log(median( X ))  median(log( X ))  median( Z )
The exponential function exp is the inverse of the natural logarithm,
so percentiles are easily “back-transformed, in particular
exp median( Z )  median( X )
7
Similar results are not true for mean and standard deviation
exp mean( Z )  exp  Z   mean( X )   X
exp sd ( Z )  exp  Z   sd ( X )   X
If the transformed variable Z follows a symmetric distribution we
have
mean( Z )  median( Z )
so, when “back transformed” the sample mean of the log-transformed
observations becomes an estimate of the median on the original scale,
i.e.
exp( z ) is an estimate of median(X).
If the transformed variable follows a normal distribution further results
are avialable:
 X  exp  Z  exp  Z2 
 X   X exp  Z2   1
8
If the transformed variable Z follows a normal distribution, the
standard deviation of Z can be used to estimate c.v.(X), the
coefficient of variation of X. The relations above show that
X
c.v.( X ) 
 exp  Z2   1   Z
X
The approximation is accurate if the variation is small, e.g.  Z  0.3
The coefficient of variation is often used to describe the variation of
postively skew distributions.
The result above shows that - unless the variation is large - the standard
deviation of the log-transformed data, sZ, can be used as an estimate
of the coefficient of variation of the observations on the original scale.
If the variation is large the coefficient of variation of X is estimated by
exp sZ2   1
9
Confidence intervals
Assume that the log-transformed data follow a normal distribution
Confidence intervals for mean(Z) are ”back-transformed” to confidence
intervals for median(X).
Let ˆ L and ˆU denote the lower and upper 95% confidence limits for
Z  mean( Z ) then
ˆ L  z  t0.975 sz
n
ˆU  z  t0.975 sz
n
exp ˆ L  and exp ˆU  are 95% confidence limits of the median(X).
Similarly, by inserting the confidence limits for  Z in the relation
c.v.( X )  exp  Z2   1
confidence limits for the coefficient of variation of X are obtained.
10
Two independent samples
Consider two independent samples x1 ,
, xn and y1 ,
, ym.
Assume that the sample distributions are positively skew, but that the
log-transformed samples v1 , , vn and w1 , , wm can be described by
normal distributions.
The hypothesis V  W of identical means on a log-scale corresponds
to the hypothesis median(X) = median(Y).
The ”back-transformed” difference exp v  w is and estimate of the
ratio
median( X )
median(Y )
Similarly, the confidence interval for V  W transform back to a
confidence interval for the ratio of the medians.
The F-test of the hypothesis  V   W is equivalent to testing the
hypothesis c.v.(X) = c.v.(Y).
11
Example – triglyceride data
Summary statistics for skew data
Solution 1 ( a non-parametric approach)
tabstat trigly , stat(p5 p25 p50 p75 p95)
variable |
p5
p25
p50
p75
p95
-------------+-------------------------------------------------trigly |
.27
.36
.46
.6
.88
----------------------------------------------------------------
The population median is estimated by the sample median of 0.46.
For later comparison we also compute the sample c.v. of X:
tabstat trigly , stat(mean sd cv)
variable |
mean
sd
cv
-------------+-----------------------------trigly |
.507153 .2184926 .4308218
--------------------------------------------
The population c.v. is estimated by the sample c.v. of 43%.
12
Example – triglyceride data
Summary statistics for skew data
Solution 2 ( log-transformation of data)
tabstat logtri , stat(p5 p25 p50 p75 p95)
variable |
p5
p25
p50
p75
p95
-------------+-------------------------------------------------logtri | -1.309333 -1.021651 -.7765288 -.5108256 -.1278334
----------------------------------------------------------------
Percentiles transform back to percentiles, e.g. exp(-1.309333) = 0.27
and exp(-0.7765288) = 0.46
Confidence interval for the median of X
tabstat logtri , stat(mean semean sd var)
variable |
mean se(mean)
sd variance
-------------+---------------------------------------logtri | -.7570465 .0231264 .3876687 .1502871
------------------------------------------------------
We compute a confidence interval for mean(Z) and transform the limits
13
back to the original scale.
lower limit:
upper limit:
ˆ L  0.757  1.968  0.0231  0.8026
ˆU  0.757  1.968  0.0231  0.7115
95% confidence limits for median(X) therefore become exp(-0.8026) =
0.45 and exp(-0.7115) = 0.49.
We can also compute a 90% prediction interval based a normal
distribution for Z and transform back to the original scale.
exp(0.757  1.650  0.388)  exp(1.397)  0.25
exp(0.757  1.650  0.388)  exp(0.117)  0.89
as expected these values are very similar to the 5 and 95 percentiles.
Estimation of c.v.(X):
The standard deviation is here a rather crude estimate of c.v.(X),
instead we use
exp(.1502871)  1  0.403
The estimate is very similar to the estimate computed from the sample
mean and the sample standard deviation of the original data.
14
Why use transformations of data?
Why ”change the data to fit the method”? – instead of ”change the
method to fit the data”?
For simple problems a simple solution is OK:
Use sample percentiles as descriptive statistics and non-parametric
tests for statistical inference.
Transformation of data
For more complicated problems the methods available assume that
data are from a normal population. The choice is then between
• A simple, but unsatisfactory analysis
• An satisfactory analysis based on transformed data.
The methods based on a normal distribution are appealing because
they are more powerful and cover a wider range of problems.
Disadvantages: Interpretation and presentation of results from analyses
of transformed data is not always straightforward.
15
THE TWO-SAMPLE PROBLEM WITH PAIRED DATA
Example: Measurement of body temperature
The Stata file temp.dta contains measurements of body temperature
in 96 patients using Rectal Hg thermometer, Oral Hg thermometer,
and an electronic device (Craft-thermometer).
list id hgrectal hgoral craft in 1/5
1.
2.
3.
4.
5.
+--------------------------------+
| id
hgrectal
hgoral
craft |
|--------------------------------|
| 1
38.1
37.1
37.4 |
| 2
38.1
37.8
37.8 |
| 3
38.9
38.2
38.2 |
| 4
38.4
38.2
38.1 |
| 5
40.3
40
40.1 |
+--------------------------------+
The data were collected as part of a study to assess the validity of the
electronic thermometer.
Here we consider the two series of HG thermometer readings.
16
hgoral
A plot of 94 pairs of readings of oral and rectal Hg temperatur (two
patients had missing rectal reading)
Oral Hg versus Rectal Hg
41
40
39
38
37
36
35
35
36
37
38
39
40
41
hgrectal
identity line
17
Questions:
• What is the difference between oral and rectal body temperature?
• How much does the differences vary between patient?
The situation resembles the two-sample problem discussed in lecture 2,
but here the same patients are measured with both methods. We no
longer have two independent samples, but two paired samples.
The “obvious” analysis
1. For each patient compute the difference between the two readings
2. Provided the assumptions are satisfied the differences are modeled as
a sample for from a normal distribution: compute a confidence intervals
for the expected difference and the standard deviation and perhaps a
one-sample t-test of the hypothesis that the expected difference is 0.
If the normal distribution does not apply, use a non-parametric approach
instead (Lecture 5).
18
Note:
• The obvious approach was used for the analysis of the data on
change in diastolic blood pressure (Lecture 2).
• The data may be used to identify systematic differences between
the methods (accuracy), but replicate measurements are needed
to describe the random measurement error (precision).
What kind of model considerations leads to this analysis?
Consider a given patient and introduce the following notation
X = A reading of the rectal Hg temperature
Y = A reading of the oral Hg temperature
and let
D = X – Y = the difference between the two readings.
The temperature readings X and Y can be considered as a sum of
two components:
X=A+E
Y=B+F
19
Here
A = the “true” rectal temperature, i.e. the hypothetical value we
would obtain as the average reading if the “experiment” was
repeated a large number of times
B = the “true” oral temperature
E = the random measurement error associated with a
rectal temperature reading using a Hg thermometer
F = the random measurement error associated with an
oral temperature reading using a Hg thermometer
The difference D is therefore
D = (A – B) + (E – F)
The first term, A – B, represents the difference between true rectal and
oral temperature for the given patient on a given day.
The second term, E – F, is the combined effect of measurement errors.
20
In a population of patients the difference of true temperatures, A – B,
may vary between individuals and also between days for a given
individual. Let  denote the population mean, then
A – B =  + C,
where C represents intra-individual variation in difference
in true temperature.
Returning to the difference between the readings we therefore have the
following decomposition
D = X – Y =  + (C + E – F) =  + U,
where  is the population mean and U decribes random variation,
which has three components, inter-individual variation in true differences,
measurement error of rectal reading, measurement error of oral reading.
The statistical model underlying the ”obvious” approach describes
the data as a random sample from a normal distribution with
2
mean  and variance  .
21
Note:
The model describes the differences and does not specify anything
about the variation of temperature (rectal or oral) in the population of
patients (A or B is not assumed to vary according to a normal distribution).
Checking the validity of the assumptions:
• Independence: Reasonable here, since each patient contributes with
one difference only.
• Same distribution:
• The expected difference between the two readings does not
depend on the overall level of the temperature. Is this a reasonable
assumption?
• The random variation in the difference has the same size for all
temperatures. Is this a reasonable assumption?
• Normal distribution: Is this a reasonable assumption?
22
Important plots that should always be made:
a. A plot of Y against X
b. Difference against average (or against sum)
c. Histogram of differences
d. Q-Q plot of differences
Oral Hg versus Rectal Hg
41
1.2
39
38
.4
0
36
-.4
35
-.8
2
36
37
38
39
40
b
.8
37
35
41
37
38
39
40
hgrectal
hgmean
Rectal Hg - Oral Hg
Q-Q plot of difference
1.5
c
1
1.5
difro
Density
1.6
a
difro
hgoral
40
difference versus average
1
41
d
.5
0
.5
-.5
0
-1
-.5
0
.5
difro
1
1.5
-.5
0
.5
Inverse Normal
1
1.5
23
The plots were made using a Stata do-file with the following commands
scatter hgoral hgrectal ,
///
title("Oral Hg versus Rectal Hg")
///
ylabel(35(1)41) xlabel(35(1)41)
///
saving(g1, replace) mco(gs3)
gene difro = hgrectal-hgoral
gene hgmean=(hgrectal+hgoral)/2
scatter difro hgmean ,
///
title("difference versus average")
///
ylabel(-0.8(0.4)1.6) xlabel(37(1)41) ///
saving(g2,replace) mco(gs3)
histogram difro ,
///
title("Rectal Hg - Oral Hg") ///
start(-0.8) width(0.2)
///
saving(g3,replace)
qnorm difro , title("Q-Q plot of difference") ///
saving(g4,replace) mco(gs3)
graph combine g1.gph g2.gph g3.gph g4.gph , ///
saving(g5,replace)
24
Oral Hg versus Rectal Hg
41
Plot a: Oral reading
versus rectal reading
40
hgoral
39
38
37
36
35
35
36
37
38
39
40
41
Comments
No systematic differences are present if the points would scatter
around the identity line (red line).
hgrectal
The plot suggests that rectal temperature is systematically
higher, but the size of the difference does not seem to vary
systematically over the range of temperatures.
The points scatter around a line with slope 1 (black line) is fairly
constant over the range of temperatures.
25
difference versus average
1.6
1.2
.8
difro
Plot b: Difference
versus average.
Often called a
Bland- Altman plot
.4
0
-.4
-.8
37
38
39
40
Comments
hgmean
If no systematic differences are present the points scatter around
the horizontal line at 0 (red line).
41
The plot suggests that rectal temperature is systematically higher,
but the size of the difference does not seem to vary systematically
over the range of temperatures.
The points scatter around a horizontal line at approx. 0.5 (black line)
and the variation is fairly constant over the range of temperatures.
26
Rectal Hg - Oral Hg
Plot c and d: The
histogram and the Q-Q
plot of the difference.
2
Density
1.5
1
-1
-.5
0
.5
1
1.5
Q-Q plot difro
of difference
difro
Comments:
.5
These plots are mainly used
for assessing the validity of a 0
normal distribution.
The distribution looks fairly 1.5
symmetric with a modest
departure from the normal
1
curve.
The slightly S-shaped
.5
pattern in the Q-Q plot
indicates that both tails are 0
too long. This could be the
consequence of a few gross -.5
errors.
-.5
0
.5
Inverse Normal
1
27 1.5
Analysis using Stata:
A paired two-sample problem can be analyzed with the command
ttest directly, i.e. without first generating the difference
ttest hgrectal=hgoral
== is also allowed
Paired t test
--------------------------------------------------------------------Variable | Obs
Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+----------------------------------------------------------hgrectal |
94 38.06808 .0852441
.8264723
37.89881
38.23736
hgoral |
94 37.57447 .0855485
.8294235
37.40459
37.74435
---------+----------------------------------------------------------diff |
94 .4936168 .0330279
.3202179
.4280298
.5592038
---------------------------------------------------------------------
average difference
95% confidence limits for the expected difference
mean(diff) = mean(hgrectal - hgoral)
t =
Ho: mean(diff) = 0
degrees of freedom =
Ha: mean(diff) < 0
Pr(T < t) = 1.0000
Ha: mean(diff) != 0
Pr(|T| > |t|) = 0.0000
14.9454
93
Ha: mean(diff) > 0
Pr(T > t) = 0.0000
28
p-value of the hypothesis that the expected difference is 0
Analysis, continued
Estimates obtained form the output
ˆ  0.49 (95% confidence limits : 0.43 0.56)
ˆ  0.32 (i.e. ˆ 2  0.322  0.102)
95% confidence limits for  becomes (see Lecture 2 page 31):
 L  0.28  U  0.37
Conclusion:
Rectal Hg temperature is on the average 0.5 degree higher than oral
Hg temperature. The random variation (measurement errors and interand intra-individual variation) has a standard deviation of 0.32 degree.
The confidence interval of the expected difference  does not describe
the agreement between the two methods.
Limits of agreement are defined as ˆ  2ˆ (i.e. mean diff. ± 2·s.d.).
This is an approx. 95% prediction interval. The limits are rather wide:
-0.15 and 1.13 reflecting the relatively large variation in the differences.
29
Example Urinary albumin excretion rate
The file albu.dta contains data on urinary albumin excretion rate
(µg/min) measured by two different methods (“one-hour” and “night”)
in 15 patients.
a.
27
24
21
18
15
12
9
6
3
0
b.
15
difference
onehour
Diagnostic plots
10
5
0
0
3
6
9
12
15
18
21
24
27
0
5
night
.2
15
20
average
15
c.
difference
.15
Density
10
.1
d.
10
5
.05
0
0
0
5
difference
10
15
0
2
4
6
Inverse Normal
8
10
30
Example T4 and T8 cell concentration
The file tcounts.dta contains data on the number of T4 and T8 cell/mm3
in blood samples from 20 patients in remission from Hodgkin’s disease.
Diagnostic plots
a.
2500
500
2000
0
Difference
t8
1500
1000
b.
-500
-1000
500
0
-1500
0
500
1000
1500
2000
2500
0
300
600
t4
8.0e-04
500
4.0e-04
-500
2.0e-04
-1000
0
-1500
-500
Difference
0
d.
0
Difference
Density
6.0e-04
-1000
1200 1500 1800
Average
c.
-1500
900
500
-1000
-500
0
Inverse Normal
500
1000
31
In both of these examples the difference between the two values
increases with the size.
15
500
0
Difference
10
-500
5
-1000
0
-1500
0
5
10
average
15
20
0
300
600
900
1200
1500
1800
Average
Also the variation in the difference seems to increase with the size
of the measurements.
The ”obvious” analysis may not be apppropriate, but do we have
any alternatives?
32
PAIRED DATA:
ABSOLUTE OR RELATIVE CHANGE/DIFFERENCE?
Data: a sample of size n of pairs of observations (x,y).
Such data arise in many situations, e.g.
• Comparison of measurement methods
• Comparison of pre-treatment and post-treatment measurements
• Studies of inter-observer or intra-observer variation
Problem: Do the x’s differ from the y’s? If yes, how?
First reaction: Look at differences (the obvious analysis above)
BUT should we consider absolute or relative difference/change?
Absolute change  y  x
Relative change 
yx y
 1
x
x
note: the relative change
is just the ratio ”adjusted”
such that 0 becomes the
”no change” value.
33
Absolute or relative change?
The choice should reflect the structure of the systematic variation in
the data.
The relationship between the two series of measurements may identify
the most appropriate choice.
Two simple relations between y and x
1.
y xa
2.
y  x b
In situation 1 differences will be relatively stable.
If the variation in the data conforms to this relation absolute change
is the appropriate choice (from a statistical perspective).
In situation 2 ratios will be relatively stable.
If the variation in the data conforms to this relation relative change
is the appropriate choice (from a statistical perspective).
34
The paired t-test and the non-parametric analog are primarily intended
for situation 1, since these tests are based on the differences.
Situation 2 is often best handled by taking logarithms:
y  x b
implies
log( y)  log( x)  log(b)
i.e. the relationship between log(x) and log(y) is additive (situation 1)
with a = log(b)
In short: If the variation in a series of pairs (x,y) is multiplicative
(situation 2) the best approach is usually to compute the differences
 y
log( y )  log( x)  log  
x
and base the statistical analyses on these differences.
This corresponds to working with ratios on a log-scale.
35
Paired data: Difference of logs versus relative change
Why not use relative change in situation 2?
Example
Patient
x
y
(y-x)/x
y/x
log(y)-log(x)
1
100
200
100%
2.00
0.693
2
200
100
-50%
0.50
-0.693
Average
150
150
25%
1.25
0
The second person is just a ”reversed version” of the first person, so
”nothing happens”.
•
•
•
•
The first average (x) is identical to the second average (y).
The average relative change is 25% indicating a slight increase.
The average ratio is larger than 1 indicating a slight increase.
The average of difference of logs – the ratio measured on a log-scale –
is 0, indicating the sensible result: no change.
36
Paired data: Difference of logs versus relative change
Working with ratios on a log-scale is convenient, since e.g.
log( y / x)   log( x / y )
The relative change does not have this property
yx
x y

x
y
When the ratio y/x is close to 1 then
yx
y y
log( y )  log( x)  log     1 
x
x x
When the ratio is close to 1 (e.g. within ±20%) then the difference
between the natural logarithms is approximately equal to the relative
change.
Paired data with a multiplicative structure (situation 2) usually also have
an increased variability for large observations. Working with data on
log-scale stabilize variation and may therefore lead to a simpler
description of the random variation (a coefficient of variation).
37
Example T4 and T8 cell concentration
The diagnostic plots (page 31) indicated that an additive relationship
was inappropriate. Data are now log-transformed to evaluate the fit of
a multiplicative structure.
Diagnostic plots based on log-transformed data
a.
7.5
log(t8)
1
.75
.5
.25
0
-.25
-.5
-.75
-1
-1.25
-1.5
Difference of logs
8
7
6.5
6
5.5
5
5
5.5
6
6.5
7
7.5
8
b.
5
5.5
log(t4)
1
c.
Density
.6
.4
.2
6.5
7
7.5
8
Average of logs
Difference of logs
.8
6
d.
.5
0
-.5
-1
0
-1.5
-1.5
-1
-.5
0
Difference of logs
.5
1
-1
-.5
0
Inverse Normal
.5
1
38
Example T4 and T8 cell concentration – continued
log-transformed data: Both systematic and random variation agree
better with assumptions .
Stata analysis of log-transformed data
gene logt4=log(t4)
gene logt8=log(t8)
ttest logt4=logt8
Paired t test
--------------------------------------------------------------------Variable| Obs
Mean
Std. Err. Std. Dev. [95% Conf. Interval]
--------+-----------------------------------------------------------logt4 |
20 6.486932
.1583811
.7083018
6.155437
6.818427
logt8 |
20 6.237297
.1371353
.6132875
5.95027
6.524325
--------+-----------------------------------------------------------diff |
20 .2496345
.1271238
.5685149 -.0164387
.5157076
--------------------------------------------------------------------mean(diff) = mean(logt4 - logt8)
t =
1.9637
Ho: mean(diff) = 0
degrees of freedom =
19
Ha: mean(diff) < 0
Pr(T < t) = 0.9678
Ha: mean(diff) != 0
Pr(|T| > |t|) = 0.0644
Ha: mean(diff) > 0
Pr(T > t) = 0.0322
39
ANALYSIS OF LOG-TRANSFORMED, PAIRED SAMPLES
INTERPRETATION OF RESULTS
 T4 
Summary statistics in output describe log(T4 )  log(T8 )  log  ,
i.e. the ratio of the counts on a log-scale.
 T8 
The log-transformed ratios are a sample from a normal distribution. The
relationships on page 7-10 show how to translate the results to results
about the ratios. We have
Ratio: Log scale
Ratio: Original scale
mean = 0.250
median = exp(0.250) = 1.28
conf.limits of mean
-0.016
0.516
conf. limits of median
0.98
1.67
standard deviation
0.569
coefficient of variation
62%
approx. 95% prediction interval
-0.89
1.39
approx. 95% prediction interval
0.41
4.00
40
The estimates above may be compared with estimates based on the
sample of ratios.
gene ratio=t4/t8
tabstat ratio , stat(med cv min max)
Output
Note: the sample
size is 20, so the
”non-parametric”
95% prediction
limits are min and
max
variable |
p50
cv
min
max
-------------+---------------------------------------ratio | 1.193993 .6085281 .4736842 3.818452
------------------------------------------------------
The two sets of estimates agree reasonably well.
Note:
Using log-transformation in the analysis of two paired samples leads
to an estimate of the median ratio.
Using log-transformation in the analysis of two independent samples
leads to an estimate of the ratio of the medians.
41
WHY USE TRANSFORMATION ÒF DATA?
Overall purpose
log transformation – and other transformations - of the data are often
used because a simpler description of the transformed data is available.
Two aspects
Simpler description of the random variation:
• A positively skew distribution was replaced by a symmetric distribution
• Bland-Altman plot: on a log scale the size of variation was independent
• of the level of the outcome
A simpler, and more valid, description of the systematic variation
• For paired data the multiplicative structure was replaced by a
additive structure
More advanced statistical methodology for continuous variables ,
e.g. regression models and analysis of variance models, assumes
constant standard deviation and additivity of effects. Transformation
of data may therefore be necessary to apply these methods.
42
Note:
Whether or not to use t-test is often believed to be the most important
question in analysis of two-sample problems.
BUT the choice of appropriate measure of change is usually much
more important than the choice of test statistic.
Always look at diagnostic plots!
However, sometimes there is no clear answer, because
1. Neither the original scale or a log-scale is appropriate
2. If the variation between pairs is modest it can be very difficult to
distinguish between an additive structure and a multiplicative
structure in the data. In such situations the results are usually very
similar for transformed and untransformed data.
3. The transformation that achieves additivity is not the transformation
that achieves constant variation.
4. It is just a mess
43
STATA – some interactive commands
Both ttest and sdtest have interactive versions, called ttesti
and sdtesti, respectively. These commands do not require the
full data set, but the relevant summary statistics are instead entered
directly after the command.
Some examples
one-sample problem: n = 213, mean = 1.901, sd = 7.529, H: mean = 0
ttesti 213 1.901 7.529 0
two-sample problem:
n1 = 213, mean1 = 1.901, sd1 = 7.529,
n2 = 217, mean2 = 2.194, sd2 = 8.365
H: mean1 = mean2
means are
ttesti 213 1.901 7.529 217 2.194 8.365
not necessary,
write a period.
one-sample problem: n = 213, sd = 7.529, H: sd = 7
sdtesti 213 . 7.529 7
two-sample problem:
n1 = 213, sd1 = 7.529, n2 = 217, sd2 = 8.365
H: sd1 = sd2
sdtesti 213 . 7.529 217 . 8.365
44