Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Consequences of the Log Transformation Estimating the typical value of a single population Example: Mercury Concentrations of Minnesota Walleyes Data File: Walleyes (1990-1998) Major Waterways These data come from walleyes sampled from major waterways in Minnesota during the years 1990 – 1998. One the major characteristics of interest to fishery biologists is the mercury contamination (in parts per million or PPM) found in the tissues of walleyes. We begin by examining a histogram and summary statistics for the mercury contaminations found in the sampled walleyes. Clearly the distribution of mercury concentrations is extremely skewed to the right. For variables with very skewed distributions the median is generally a better measure of typical value than the mean because the mean is inflated by the extreme cases in the tail of the distribution. The median mercury contamination found in the sampled walleyes is .25 ppm while the mean is .365 ppm. When working with an extremely skewed right distribution it is common practice to work with the characteristic of interest in the logarithmic scale, the base of which is unimportant. To transform a variable in JMP you must use the JMP Calculator which allows you to perform a variety of data transformations and manipulations. To create a new column containing a function of another column double-click to the right of the last column to add a new column to the spreadsheet. Next double-click at the top of the column to obtain the Column Info window. In the window change the name of the new column to log10(Hg) and select Formula from the New Property pull-down menu and click Edit Formula. 1 The Column Info box in JMP The JMP Calculator should then appear on the screen. To take the base 10 logarithm of the HGPPM variable, first select Transcendental from the menu to the right of the calculator keypad because the logarithm is a transcendental (non-algebraic) function. In the list that appears in the rightmost menu select base 10 logarithm (i.e. log10). In formula window you should see log10. Now you need supply the name of the variable you wish to take the logarithm of, which is HGPPM in this case by selecting it from the variable list on the left of the calculator window. The JMP Calculator When finished the formula window will then look like: Log10(HGPPM) Finally click Apply and close the calculator window. The new column you created should now contain the base 10 logarithm of the mercury concentrations. The histogram and summary statistics for the log 10 Hg readings are shown below. We can clearly see approximate normality has been achieved through the log transformation. 2 Histogram, Boxplot, and Normal Quantile Plot for log10(Hg) Summary Statistics for log10(Hg) Here we see that both the median and mean are approximately -.600 ppm in the log base 10 scale. Back-Transforming the Mean and Median to the Original Scale We can back-transform the mean and median values for the log base 10 mercury level as follows: Median back-transformed to the original scale = 10 .602 = .250 which is the median we found when looking at the data in the original scale above! This is an extremely important observation. Mean back-transformed to the original scale = 10 .599831 = .2513 which is well below the sample mean in the original scale above! This is an extremely important observation also. 3 What we have seen is that the median of the data in the original scale is the same as the back-transformed median of the data in the log scale. Put another way, we see that the log base 10 of the sample median in the original scale is the same as the sample median of the data in the log base 10 scale. However, the mean in the original scale is NOT the same as the back-transformed mean of the data in the log scale. In other words, we see that the log base 10 of the sample mean in the original scale is NOT the same as the sample mean of the data in the log base 10 scale. If we define the following: X sample mean (original scale) log 10 ( X ) log of the sample mean Med sample median (original scale) log 10 ( Med ) log of the sample median X log sample mean (log scale) Med log sample median (log scale) For the median we have: log 10 ( Med ) Med log or equivalently, Med Med 10 log In contrast for the mean we have: log 10 ( X ) X log Furthermore, the median in the log scale is the same as the mean in the log scale. Thus the any inferences (e.g. CI’s & hypothesis tests) made for the mean in the log scale can thought of as inference for the median in the log scale as well. Using the notation above we have: X log Med log A 95% CI for the population mean Hg concentration in the log scale, and hence the population median, is given by (-.633, -.567). Back-transforming the endpoints of this interval to the original scale gives the following interval (.233 ppm, .271 ppm). THIS IS A CONFIDENCE INTERVAL FOR THE MEDIAN IN THE ORIGINAL Med SCALE! (Again this is because of the fact that 10 log Med ) 4 Hypothesis Testing Example Suppose we wish to test: H o : The typical mercury level of walleyes in MN < .20 ppm H a : The typical mercury level of walleyes in MN > .20 ppm Because our data is so right skewed the typical mercury level is best measured by the population median. To make an inference for the median for right-skewed data we can use the log transformation again. Restating our hypotheses in the log scale we have: (Note: log 10 .20 .699 ) H o : The typical log mercury level of walleyes in MN < -.699 log base 10 ppm H a : The typical log mercury level of walleyes in MN > -.699 log base 10 ppm Using the Test Mean... option from the log10(Hg) pull-down menu we obtain the following results. We have extremely strong evidence against the null hypothesis in favor of the alternative hypothesis. Hence we would conclude that the median Hg concentration (original scale) found in Minnesota walleyes exceeds .20 ppm. 5 Comparative Analyses in the Log Scale We have seen that the consequence of the log transformation for single population inference is that our inferences are being made about the median in the original scales vs. the mean. When comparing two (or more) populations where the variable of interest has a right-skewed distribution the log transformation again is frequently used. The consequences of the log transformation on comparative analysis are similar in nature to the single population case discussed above. Our inferences will be about how the population medians compare in the original scale. Example: Mercury Levels in Walleyes from Fish Lake vs. Island Lake Data File: Walleyes Fish vs. Island The key property of logarithms we will be using in our discussion is as follows: x log( x) log( y ) log y i.e. the differences of two variables, x and y, in the log scale is equivalent to the log of their ratio. Comparative Analyses in JMP Comparative Display and Summary Statistics in the Original Scale Both distributions appear to be right-skewed. For both lakes the sample mean exceeds the sample median. It also appears that the mercury levels in Island Lake are more spread out, i.e. the population variance/standard deviation appears to be larger. 6 Comparative Display and Summary Statistics in Log 10 Scale Both distributions in the log scale appear to be approximately normal, however the Fish Lake Flowage distribution shows evidence of kurtosis. The means and medians are closer in value in the log scale. Comparing the Population Variances/Standard Deviations (log 10 scale) We have strong evidence that the population variances/standard deviations are not equal. Independent Samples Test for Comparing Means/Medians (log 10 scale) We have strong evidence that the population means/medians in the log 10 scale significantly differ (p < .0001) 7 Island Fish Island Fish A 95% CI for ( X log ) or equivalently ( Med log ) is given by X log Med log (.490 , .722). Using the fact that log 10 ( Med ) Med log and difference of logarithms property above we can say this is also a confidence interval for the following: log 10 ( Med Island ) log 10 ( Med Fish Med Island ) log 10 Fish Med So (.490, .722) is a confidence interval for the log base 10 of the ratio of the population median Hg level for Island Lake to the population median Hg level for Fish Lake Flowage. If we back-transform the endpoints of this interval we will obtain a confidence Island interval for the ratio of medians in the original scale, i.e. Med . Med Fish Doing this we obtain: 10.490 , 10.722 3.09 , 5.27 . Therefore we estimate with 95% confidence that the median Hg level found in walleyes from Island Lake is between 3.09 and 5.27 times larger than the median Hg level found in walleyes from Fish Lake Flowage. We will see this type of comparative analysis of data in the logarithmic scale when we examine pair-wise comparisons in ANOVA later in the course. 8