Download Consequences of the Log Transformation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Transcript
STAT 602: 5 - Consequences of the Log Transformation
Spring 2017
Estimating the typical value of a single population
Example 5.1: Mercury Concentrations of Minnesota Walleyes
Data File: Walleyes (1990-1998) Major Waterways
These data come from walleyes sampled from major waterways in Minnesota during the
years 1990 – 1998. One the major characteristics of interest to fishery biologists is the
mercury contamination (in parts per million or PPM) found in the tissues of walleyes.
We begin by examining a histogram and summary statistics for the mercury
contaminations found in the sampled walleyes.
Clearly the distribution of mercury concentrations is extremely skewed to the right. For
variables with markedly skewed distributions the median is generally a better measure
of typical value than the mean because the mean is inflated by the extreme cases in the
tail of the distribution. The median mercury contamination found in the sampled
walleyes is .24 ppm, while the mean is .354 ppm. When working with an extremely
skewed right distribution it is common practice to work with the characteristic of
interest in the logarithmic scale, the base of which is unimportant.
To transform a variable in JMP you can use the JMP Calculator which allows you to
perform a variety of data transformations and manipulations. To create a new column
containing a function of another column double-click to the right of the last column to
add a new column to the spreadsheet. Next double-click at the top of the column to
obtain the Column Info window. In the window change the name of the new column to
log10(Hg) and select Formula from the New Property pull-down menu and click Edit
Formula.
85
STAT 602: 5 - Consequences of the Log Transformation
Spring 2017
The Column Info box in JMP
The JMP Calculator should then appear on the screen. To take the base 10 logarithm of
the HGPPM variable, first select Transcendental from the menu to the right of the
calculator keypad because the logarithm is a transcendental (non-algebraic) function. In
the list that appears in the rightmost menu select base 10 logarithm (i.e. log10). In
formula window you should see log10. Now you need supply the name of the variable
you wish to take the logarithm of, which is HGPPM in this case by selecting it from the
variable list on the left of the calculator window.
The JMP Calculator
When finished the formula window will then look like:
Log10(HGPPM)
Finally click Apply or Ok and close the calculator window. The new column you
created should now contain the base 10 logarithm of the mercury concentrations. The
86
STAT 602: 5 - Consequences of the Log Transformation
Spring 2017
histogram and summary statistics for the log 10 Hg readings are shown below. We can
clearly see approximate normality has been achieved through the log transformation.
Histogram, Boxplot, and Normal Quantile Plot for log10(Hg)
Summary Statistics for log10(Hg)
Here we see that both the median and mean are approximately -.620 ppm in the
log base 10 scale.
87
STAT 602: 5 - Consequences of the Log Transformation
Spring 2017
Back-Transforming the Mean and Median to the Original Scale
We can back-transform the mean and median values for the log base 10 mercury level as
follows:
Median back-transformed to the original scale 10−.620 = .240 ppm which is the sample
median we found when looking at the data in the original scale above! This is an
extremely important fact.
Mean back-transformed to the original scale 10−.621 = .239 ppm which is well below the
sample mean in the original scale above (𝑦̅ = .354 ppm)! This is an extremely important
observation also. What we have seen is that the median of the data in the original scale
is the same as the back-transformed median of the data in the log scale. Put another
way, we see that the log base 10 of the sample median in the original scale is the same as
the sample median of the data in the log base 10 scale.
However, the mean in the original scale is NOT the same as the back-transformed mean
of the data in the log scale. In other words, we see that the log base 10 of the sample
mean in the original scale is NOT the same as the sample mean of the data in the log
base 10 scale.
If we define the following:
Y  sample mean (original scale)
m  sample median (original scale)
log 10 (Y )  log of the sample mean
log (m)  log of the sample median
Ylog  sample mean in the log scale
  pop. mean (original scale)
 log = pop. mean (log scale)
mlog  sample median in the log scale
10
M  pop. median (original scale)
M log  pop. median (log scale)
For the sample median we have:
log (m)  m
10
log
or equivalently,
m  10
mlog
This is also holds true for the population medians as well, log 10 ( M )  M log and
M  10
M log
.
In contrast for the mean we have for the sample mean that log 10 (Y )  Ylog and for the
population mean, log 10 (  )   log .
88
STAT 602: 5 - Consequences of the Log Transformation
Spring 2017
If we have a symmetric distribution after log transformation the median in the log scale
is the same as the mean in the log scale. Thus any inferences (e.g. CI’s & hypothesis
tests) made for the mean in the log scale can thought of as inference for the median in
the log scale as well.
Using the above notation above we have:
 log  M log
if the distribution in the log scale is symmetric. For example, if the log transformed
values are approximately normal then we have symmetry, because the normal
distribution is symmetric.
Confidence Interval for the Typical Mercury Level
For our example we have a 95% CI for  log , the population mean Hg concentration in the
log scale (and the population median ( M log )) is given by (-.666, -.575).
From JMP
Back-transforming the endpoints of this interval to the original scale gives the following
interval (.216 ppm, .266 ppm). THIS IS A CONFIDENCE INTERVAL FOR THE
POPULATION MEDIAN IN THE ORIGINAL SCALE! (Again this is because of the
fact that 10
M log
 M the median in the original scale).
So we estimate that the median mercury level found in walleyes in major fisheries in
Minnesota is between .216 ppm and .266 ppm with 95% confidence.
Hypothesis Testing
Suppose we wish to test:
H o : The typical mercury level of walleyes in MN < .20 ppm
H a : The typical mercury level of walleyes in MN > .20 ppm
Because our data is so right skewed the typical mercury level is best measured by the
population median. To make an inference for the median for right-skewed data we can
use the log transformation again. Restating our hypotheses in the log scale we have:
(Note: log 10 .20  .699 )
H o : The typical log mercury level of walleyes in MN < -.699 log base 10 ppm
H a : The typical log mercury level of walleyes in MN > -.699 log base 10 ppm
89
STAT 602: 5 - Consequences of the Log Transformation
Spring 2017
Using the Test Mean... option from the log10(Hg) pull-down menu we obtain the
following results.
We have extremely strong evidence against the null hypothesis in favor of the
alternative hypothesis. Hence we would conclude that the median Hg concentration
(original scale) found in Minnesota walleyes exceeds .20 ppm.
90
STAT 602: 5 - Consequences of the Log Transformation
Spring 2017
Comparative Analyses in the Log Scale
We have seen that the consequence of the log transformation for single
population inference is that our inferences are being made about the median in
the original scales vs. the mean. When comparing two (or more) populations
where the variable of interest has a right-skewed distribution the log
transformation again is frequently used. The consequences of the log
transformation on comparative analysis are similar in nature to the single
population case discussed above. Our inferences will be about how the
population medians compare in the original scale.
Example: Mercury Levels in Walleyes from Fish Lake vs. Island Lake
Data File: Walleyes Fish vs. Island
The key property of logarithms we will be using in our discussion is as follows:
 x
log( x)  log( y )  log 
 y
i.e. the differences of two variables, x and y, in the log scale is equivalent to the
log of their ratio.
Comparative Analyses in JMP
Comparative Display and Summary Statistics in the Original Scale
Both distributions appear to be
right-skewed. For both lakes the
sample mean exceeds the sample
median. It also appears that the
mercury levels in Island Lake are
more spread out, i.e. the population
variance/standard deviation appears
to be larger.
91
STAT 602: 5 - Consequences of the Log Transformation
Spring 2017
Comparative Display and Summary Statistics in Log 10 Scale
Both distributions in the log
scale appear to be
approximately normal,
however the Fish Lake
Flowage distribution shows
evidence of kurtosis. The
means and medians are closer
in value in the log scale.
Comparing the Population Variances/Standard Deviations (log 10 scale)
We have strong evidence that the
population variances/standard
deviations are not equal.
Independent Samples Test for Comparing Means/Medians (log 10 scale)
We have strong evidence that the
population means/medians in the log 10
scale significantly differ (p < .0001).
Island
Fish
  log
A 95% CI for (  log
) or equivalently ( M Island
log
(.484, .728). Using the fact that
log ( M )  M
10
log
 M ) is given by
Fish
log
and the difference of
logarithms property, we find that this is also a confidence interval for the
following:
(M
Island
log
M
Fish
log
)  (log ( M
10
Island
)  log ( M
10
Fish
M
))  log 
M
10
Island
Fish



92
STAT 602: 5 - Consequences of the Log Transformation
Spring 2017
So (.484, .728) is a confidence interval for the log base 10 of the ratio of the
population median Hg level for Island Lake to the population median Hg level
for Fish Lake Flowage. If we back-transform the endpoints of this interval we
will obtain a confidence interval for the ratio of medians in the original scale, i.e.
M
Island
M
Fish
.
Doing this we obtain:
10
.484

, 10.728  3.05 , 5.35 .
Therefore we estimate with 95% confidence, that the median Hg level found in
walleyes from Island Lake is between 3.05 and 5.35 times larger than the median
Hg level found in walleyes from Fish Lake Flowage. Walleyes in Island Lake
Reservoir have between 3.05 and 5.35 times as much mercury found in their
tissues on average as those found in Fish Lake Flowage.
We will see this type of comparative analysis of data in the logarithmic scale
when we examine pairwise comparisons in ANOVA in the next section of notes.
93