Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STAT 602: 5 - Consequences of the Log Transformation Spring 2017 Estimating the typical value of a single population Example 5.1: Mercury Concentrations of Minnesota Walleyes Data File: Walleyes (1990-1998) Major Waterways These data come from walleyes sampled from major waterways in Minnesota during the years 1990 – 1998. One the major characteristics of interest to fishery biologists is the mercury contamination (in parts per million or PPM) found in the tissues of walleyes. We begin by examining a histogram and summary statistics for the mercury contaminations found in the sampled walleyes. Clearly the distribution of mercury concentrations is extremely skewed to the right. For variables with markedly skewed distributions the median is generally a better measure of typical value than the mean because the mean is inflated by the extreme cases in the tail of the distribution. The median mercury contamination found in the sampled walleyes is .24 ppm, while the mean is .354 ppm. When working with an extremely skewed right distribution it is common practice to work with the characteristic of interest in the logarithmic scale, the base of which is unimportant. To transform a variable in JMP you can use the JMP Calculator which allows you to perform a variety of data transformations and manipulations. To create a new column containing a function of another column double-click to the right of the last column to add a new column to the spreadsheet. Next double-click at the top of the column to obtain the Column Info window. In the window change the name of the new column to log10(Hg) and select Formula from the New Property pull-down menu and click Edit Formula. 85 STAT 602: 5 - Consequences of the Log Transformation Spring 2017 The Column Info box in JMP The JMP Calculator should then appear on the screen. To take the base 10 logarithm of the HGPPM variable, first select Transcendental from the menu to the right of the calculator keypad because the logarithm is a transcendental (non-algebraic) function. In the list that appears in the rightmost menu select base 10 logarithm (i.e. log10). In formula window you should see log10. Now you need supply the name of the variable you wish to take the logarithm of, which is HGPPM in this case by selecting it from the variable list on the left of the calculator window. The JMP Calculator When finished the formula window will then look like: Log10(HGPPM) Finally click Apply or Ok and close the calculator window. The new column you created should now contain the base 10 logarithm of the mercury concentrations. The 86 STAT 602: 5 - Consequences of the Log Transformation Spring 2017 histogram and summary statistics for the log 10 Hg readings are shown below. We can clearly see approximate normality has been achieved through the log transformation. Histogram, Boxplot, and Normal Quantile Plot for log10(Hg) Summary Statistics for log10(Hg) Here we see that both the median and mean are approximately -.620 ppm in the log base 10 scale. 87 STAT 602: 5 - Consequences of the Log Transformation Spring 2017 Back-Transforming the Mean and Median to the Original Scale We can back-transform the mean and median values for the log base 10 mercury level as follows: Median back-transformed to the original scale 10−.620 = .240 ppm which is the sample median we found when looking at the data in the original scale above! This is an extremely important fact. Mean back-transformed to the original scale 10−.621 = .239 ppm which is well below the sample mean in the original scale above (𝑦̅ = .354 ppm)! This is an extremely important observation also. What we have seen is that the median of the data in the original scale is the same as the back-transformed median of the data in the log scale. Put another way, we see that the log base 10 of the sample median in the original scale is the same as the sample median of the data in the log base 10 scale. However, the mean in the original scale is NOT the same as the back-transformed mean of the data in the log scale. In other words, we see that the log base 10 of the sample mean in the original scale is NOT the same as the sample mean of the data in the log base 10 scale. If we define the following: Y sample mean (original scale) m sample median (original scale) log 10 (Y ) log of the sample mean log (m) log of the sample median Ylog sample mean in the log scale pop. mean (original scale) log = pop. mean (log scale) mlog sample median in the log scale 10 M pop. median (original scale) M log pop. median (log scale) For the sample median we have: log (m) m 10 log or equivalently, m 10 mlog This is also holds true for the population medians as well, log 10 ( M ) M log and M 10 M log . In contrast for the mean we have for the sample mean that log 10 (Y ) Ylog and for the population mean, log 10 ( ) log . 88 STAT 602: 5 - Consequences of the Log Transformation Spring 2017 If we have a symmetric distribution after log transformation the median in the log scale is the same as the mean in the log scale. Thus any inferences (e.g. CI’s & hypothesis tests) made for the mean in the log scale can thought of as inference for the median in the log scale as well. Using the above notation above we have: log M log if the distribution in the log scale is symmetric. For example, if the log transformed values are approximately normal then we have symmetry, because the normal distribution is symmetric. Confidence Interval for the Typical Mercury Level For our example we have a 95% CI for log , the population mean Hg concentration in the log scale (and the population median ( M log )) is given by (-.666, -.575). From JMP Back-transforming the endpoints of this interval to the original scale gives the following interval (.216 ppm, .266 ppm). THIS IS A CONFIDENCE INTERVAL FOR THE POPULATION MEDIAN IN THE ORIGINAL SCALE! (Again this is because of the fact that 10 M log M the median in the original scale). So we estimate that the median mercury level found in walleyes in major fisheries in Minnesota is between .216 ppm and .266 ppm with 95% confidence. Hypothesis Testing Suppose we wish to test: H o : The typical mercury level of walleyes in MN < .20 ppm H a : The typical mercury level of walleyes in MN > .20 ppm Because our data is so right skewed the typical mercury level is best measured by the population median. To make an inference for the median for right-skewed data we can use the log transformation again. Restating our hypotheses in the log scale we have: (Note: log 10 .20 .699 ) H o : The typical log mercury level of walleyes in MN < -.699 log base 10 ppm H a : The typical log mercury level of walleyes in MN > -.699 log base 10 ppm 89 STAT 602: 5 - Consequences of the Log Transformation Spring 2017 Using the Test Mean... option from the log10(Hg) pull-down menu we obtain the following results. We have extremely strong evidence against the null hypothesis in favor of the alternative hypothesis. Hence we would conclude that the median Hg concentration (original scale) found in Minnesota walleyes exceeds .20 ppm. 90 STAT 602: 5 - Consequences of the Log Transformation Spring 2017 Comparative Analyses in the Log Scale We have seen that the consequence of the log transformation for single population inference is that our inferences are being made about the median in the original scales vs. the mean. When comparing two (or more) populations where the variable of interest has a right-skewed distribution the log transformation again is frequently used. The consequences of the log transformation on comparative analysis are similar in nature to the single population case discussed above. Our inferences will be about how the population medians compare in the original scale. Example: Mercury Levels in Walleyes from Fish Lake vs. Island Lake Data File: Walleyes Fish vs. Island The key property of logarithms we will be using in our discussion is as follows: x log( x) log( y ) log y i.e. the differences of two variables, x and y, in the log scale is equivalent to the log of their ratio. Comparative Analyses in JMP Comparative Display and Summary Statistics in the Original Scale Both distributions appear to be right-skewed. For both lakes the sample mean exceeds the sample median. It also appears that the mercury levels in Island Lake are more spread out, i.e. the population variance/standard deviation appears to be larger. 91 STAT 602: 5 - Consequences of the Log Transformation Spring 2017 Comparative Display and Summary Statistics in Log 10 Scale Both distributions in the log scale appear to be approximately normal, however the Fish Lake Flowage distribution shows evidence of kurtosis. The means and medians are closer in value in the log scale. Comparing the Population Variances/Standard Deviations (log 10 scale) We have strong evidence that the population variances/standard deviations are not equal. Independent Samples Test for Comparing Means/Medians (log 10 scale) We have strong evidence that the population means/medians in the log 10 scale significantly differ (p < .0001). Island Fish log A 95% CI for ( log ) or equivalently ( M Island log (.484, .728). Using the fact that log ( M ) M 10 log M ) is given by Fish log and the difference of logarithms property, we find that this is also a confidence interval for the following: (M Island log M Fish log ) (log ( M 10 Island ) log ( M 10 Fish M )) log M 10 Island Fish 92 STAT 602: 5 - Consequences of the Log Transformation Spring 2017 So (.484, .728) is a confidence interval for the log base 10 of the ratio of the population median Hg level for Island Lake to the population median Hg level for Fish Lake Flowage. If we back-transform the endpoints of this interval we will obtain a confidence interval for the ratio of medians in the original scale, i.e. M Island M Fish . Doing this we obtain: 10 .484 , 10.728 3.05 , 5.35 . Therefore we estimate with 95% confidence, that the median Hg level found in walleyes from Island Lake is between 3.05 and 5.35 times larger than the median Hg level found in walleyes from Fish Lake Flowage. Walleyes in Island Lake Reservoir have between 3.05 and 5.35 times as much mercury found in their tissues on average as those found in Fish Lake Flowage. We will see this type of comparative analysis of data in the logarithmic scale when we examine pairwise comparisons in ANOVA in the next section of notes. 93