* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Data transformation
Survey
Document related concepts
Transcript
Data Transformations Data transformations Data transformation can “correct” deviation from normality and uneven variance (heteroscedasticity) Right transformation can allow you to use parametric statistics The need for transformations We should always check the assumptions that data follow a normal distribution with uniform variance *If the data meet the assumptions we can analyze the raw data as described. *If they are not met, we have two possible strategies: 1-We can use a method which does not require these assumptions, such as a rankbased method. 2-We can transform the data mathematically to make them fit the assumptions more closely before analysis. There are three commonly used transformations for quantitative data: The logarithm, the square root, and the reciprocal. We call ,these transformations variance-stabilizing because their purpose is to make variances the same. For most data encountered in healthcare research, the first or third situation applies If we have several groups of subjects and calculate the mean and variance for each group, we can plot variability against mean. We might have one of these situations: -Variability and mean are unrelated. We do not usually have a problem and can treat the variances as uniform. We do not need a transformation. -Variance is proportional to mean. A square root transformation should remove the relationship between variability and mean. -Standard deviation is proportional to mean. A logarithmic transformation should remove the relationship between variability and mean. -Standard deviation is proportional to the square of the mean. A reciprocal transformation should remove the relationship between variability and mean. . Variance-stabilizing transformations also tend to make distributions Normal. There is a mathematical reason for this, as for so much in statistics. It can be shown that if we take several samples from the same population, the means and variances of these samples will be independent if and only if the distribution is Normal. This means that uniform variance tends to go with a Normal Distribution. A transformation which makes variance uniform will often also make data follow a Normal distribution and vice versa logarithmic transformation *The most frequently used is the logarithm. *This is particularly useful for concentrations of substances in blood. *The reason for this is that blood is very dynamic, with reactions happening continuously. Many of the substances we measure are part of a metabolic chain, both being synthesized and metabolized to something else. *The rates at which these reactions happen depends on the amounts of other substances in the blood and the consequence is that the various factors which determine the concentration of the substance are multiplied together. *Multiplying and dividing tends to produce skew distributions. *If we take the logarithm of several numbers multiplied together we get the sum of their logarithms. **So log transformation produces something where the various influences are added together and addition tends to produce a Normal distribution. For example, the following figure shows serum cholesterol in stroke patients As we have seen, for the serum cholesterol in stroke patients data, the log transformation gives a good fit to the Normal. What happens if we analyze the logarithm of serum cholesterol then try to transform back to the natural scale? For the raw data, serum cholesterol: mean = 6.34, SD = 1.40. For log (base e) serum cholesterol: mean = 1.82, SD = 0.22. If we take the mean on the transformed scale and backtransform by taking the antilog, we get exp(1.82) = 6.17. This is less than the mean for the raw data. The antilog of the mean log is not the same as the untransformed arithmetic mean. geometric mean is calculated which is found by multiplying all the observations and taking the n’th root The geometric mean is found by multiplying all the n observations together and then taking the nth root. For example, the geometric mean of 4 and 9 is 6, found by multiplying 4 by 9 to give 36 and taking the square (or second) root. The geometric mean is usually smaller than the arithmetic mean. For 4 and 9 this is (4 + 9)/2 = 6.5. Thus the mean of the logs is the log of the geometric mean. What about the units for the geometric mean? If cholesterol is measured in mmol/L, the log of a single observation is the log of a measurement in mmol/L. and the antilog is back in the original units, mmol/L Even if a transformation does not produce a really good fit to the Normal distribution, it may still make the data much more amenable to analysis. The following figure shows a histogram and Normal plot for the area of venous ulcer at recruitment The raw data have a very skew distribution and the small number of very large ulcers might lead to problems in analysis. Although the log transformed data are still skew, the skewness is much less and the data much easier to analyze Making a distribution more like the Normal is not the only reason for using a transformation The following figure shows prostate specific antigen (PSA) for three groups of prostate patients: with benign conditions, with prostatitis, and with prostate cancer A log transformation of the PSA gives a much clearer picture . The variability is now much more similar in the three groups The square root The square root is best for fairly weak relationships between variability and magnitude, i.e. variance proportional to mean or standard deviation proportional to the square root of the mean. The square root can be used for variables which are greater than or equal to zero, the log and the reciprocal can only be used for variables which are strictly greater than zero, because neither the logarithm nor the reciprocal of zero are defined. Arm lymphatic flow in rheumatoid arthritis with oedema The distribution is positively skew and the variability is clearly greater in the groups with greater lymphatic activity. A square root transformation has the effect of making the data less skew and making the variation more uniform. In these data, a log transformation proved to have too great an effect, making the distribution negatively skew, and so the square root of the data was used in the analysis. Reciprocal transformation Removes the relationship between variability and mean. The reciprocal is best for very strong relationships, where the standard deviation is proportional to the square of the mean. The reciprocal can only be used for variables which are strictly greater than zero. If the square root removes the least amount of skewness , the reciprocal removes the .most. Can all data be transformed? Not all data can be transformed successfully. 1-Sometimes we have very long tails at both ends of the distribution, which makes transformation by log, square root or reciprocal ineffective For example the distribution of blood sodium in ITU patients This is fairly symmetrical, but has longer tails than a Normal distribution. The shape of the Normal plot is first convex then concave 2-Sometimes we have a bimodal distribution, which makes transformation by log, square root or reciprocal ineffective For example systolic blood pressure in a sample of ITU patients 3-Sometimes we have a large number of identical observations, which will all transform to the same value whatever transformation we use. These are often at one extreme of the distribution, usually at zero For example the distribution of coronary artery calcium in a large group of patients More than half of these observations were equal at zero. Any transformation would leave half the observations with the same value, at the extreme of the distribution. It is impossible to transform these data to a Normal distribution . 4-Sometimes transformation lead to variation in p-value. So, What can we do if we cannot transform data? It is usually safer to use methods that do not require such assumptions These include the non-parametric methods. Hypothesis Testing Procedures Hypothesis Testing Procedures Parametric Nonparametric Wilcoxon Rank Sum Test Z Test t Test One-Way ANOVA Kruskal-Wallis H-Test Many More Tests Exist! Types of data and analysis Nominal Ordinal Discrete Continuous Parametric Non-parametric Types of Data Nominal - no numerical value Ordinal - order or rank Discrete - counts Continuous - interval, ratio Parametric Test Procedures 1-Involve Population Parameters Example: Population Mean 2-Require Interval Scale or Ratio Scale . Whole Numbers or Fractions Example: Height in Inches (72, 60.5, 54.7) 3-Have Stringent Assumptions . Example: Normal Distribution Nonparametric Test Procedures A nonparametric test is a hypothesis test that does not require any specific conditions about the shape of the populations or the value of any population parameters. Tests are often called “distribution free” tests. Why non-parametric statistics? -Need to analyse ‘Crude’ data (nominal, ordinal) -Data derived from small samples -Data that do not follow a normal distribution -Data of unknown distribution Advantages of Nonparametric Tests 1-Used With All Scales . 2-Easier to Compute. 3- Make Fewer Assumptions. 4- Suitable for small sample size. 5-Analysis involves outlier values. 6- No need for population Parameters. as 7-Results May Be as Exact Parametric Procedures © 1984-1994 T/Maker Co. Disadvantages of Nonparametric Tests 1-May Waste Information If Data Permit Using Parametric Procedures Example: Converting Data From Ratio to Ordinal Scale 2-Difficult to Compute by Hand for Large Samples 3-Tables Not Widely Available © 1984-1994 T/Maker Co. What is a parameter and why should I care? Most statistical tests, like the t test, assume some kind of underlying distribution, like the normal distribution If you know the mean and the standard deviation of a normal distribution then you know how to calculate probabilities Means and standard deviations are called Parameters; all theoretical distributions have parameters. Statistical tests that assume a distribution and use parameters are called parametric tests Statistical tests that don't assume a distribution or use parameters are called nonparametric tests Ranks Many nonparametric procedures are based on ranks. Data are ranked by ordering them from lowest to highest and assigning them, in order, the integer values from 1 to the sample size. Ties are resolved by assigning tied values the mean of the ranks they would have received if there were no ties. Example: 117, 119, 119, 125, 128 becomes 1, 2.5, 2.5, 4, 5 – If the two 119s were not tied, they would have been assigned the ranks 2 and 3. The mean of 2 and 3 is 2.5. Procedure: replace the original data with the ranks across subjects and then perform the parametric test. For large samples, many nonparametric techniques can be viewed as the usual normal-theory-based procedures applied to ranks Normal theory based test Corresponding nonparametric test t test for independent samples Mann-Whitney U test; Wilcoxon rank-sum test Paired t test Wilcoxon matched pairs signed-rank test Pearson correlation coefficient One way analysis of variance (F test) Two way analysis of variance Spearman rank correlation coefficient Kruskal-Wallis analysis of variance by ranks Friedman Two way analysis of variance Purpose of test Compares two independent samples Examines a set of differences Assesses the linear association between two variables. Compares three or more groups Compares groups classified by two different factors Wilcoxon signed rank test To test difference between paired data STEP 1 -Exclude any differences which are zero -Put the rest of differences in ascending order -Ignore their signs -Assign them ranks -If any differences are equal, average their ranks STEP 2 -Count up the ranks of +ives as T+ -Count up the ranks of –ives as T- STEP 3 If there is no difference between drug (T+) and placebo (T-), then T+ & T- would be similar If there were a difference one sum would be much smaller and the other much larger than expected The smaller sum is denoted as T T = smaller of T+ and T- STEP 4 Compare the value obtained with the critical values (5%, 2% and 1% ) in table N is the number of differences that were ranked (not the total number of differences) So the zero differences are excluded Wilcoxon Signed Rank Test - assume distribution is continuous and symmetric Discard any observation(s) that equal M0, adjust n Again look at the differences between the observations and the null value, M0 (Paired data, look at differences within pairs) Rank the absolute values of the differences, from low to high Ties receive the average rank T+ = sum of the ranks of the positive differences T = sum of the ranks of the negative differences p-values for one-sided tests are in Table - only if results are in correct “direction” Double the table value to get the p-value for a two-sided test For one simple sample. Test, at a = .05, if median age of students finishing a Masters degree in biostatistics is greater than 25. H0: M = 25 H1: M > 25 Age Age-25 Rank 26 30 37 23 42 25 28 33 28 1 5 12 -2 17 0 3 8 3 1 5 7 2 8 3.5 6 3.5 T+ = 34 T =2 T=2 p-value = .0118 *Because calculated T is at a p value less than 0.05 , from the tables ,the difference is significant . *we can reject H0 Signed Rank Test Computation Table for paired data. X1i X2i Di = X1i - X2i |Di| Ri Sign Sign Ri X11 X21 D1 = X11 - X21 |D1| R1 ± ± R1 X12 X22 D2 = X12 - X22 |D2| R2 ± ± R2 X13 X23 D3 = X13 - X23 |D3| R3 ± ± R3 : : ± ± Rn : : : : : X1n X2n Dn = X1n - X2n |Dn| Rn Total T+ & T- Hours of sleep Patient Drug Placebo Difference Rank Ignoring sign 1 6.1 5.2 0.9 3.5* 2 7.0 7.9 -0.9 3.5* 3 8.2 3.9 4.3 10 4 7.6 4.7 2.9 7 5 6.5 5.3 1.2 5 6 8.4 5.4 3.0 8 7 6.9 4.2 2.7 6 8 6.7 6.1 0.6 2 9 7.4 3.8 3.6 9 10 5.8 6.3 -0.5 1 3rd & 4th ranks are tied hence averaged T= smaller of T+ (50.5) and T- (4.5) Here T=4.5 significant at 2% level indicating the drug (hypnotic) is more effective than placebo Signed Rank Test Computation Table X1i X2i 9.98 9.88 9.88 9.86 9.90 9.83 9.99 9.80 9.94 9.87 9.84 9.84 Total Di +0.10 +0.02 +0.07 +0.19 +0.07 0.00 |Di| Ri 0.10 4 0.02 1 0.07 2 2.5 0.19 5 0.07 3 2.5 0.00 ... Sign Sign Ri + + + + + ... +4 +1 +2.5 +5 +2.5 Discard T+ = 15, T- = 0 Wilcoxon Signed Rank Table (Portion) One-Tailed Two-Tailed a = .05 a = .025 a = .01 a = .005 a = .10 a = .05 a = .02 a = .01 n=5 n=6 1 2 1 n = 7 .. 4 2 0 .. .. .. .. n = 11 n = 12 n = 13 : : : : There are two types of comparison using tables for wilcoxon signed rank test 1- Looking at critical values (Z): In which the calculated T value ( smaller one ) is compared with the tabulated value at specific N and p The difference is significant (Null HYPOTHESIS IS REJECTED ) If calculated T < OR = tabulated T 2-By comparing the P values By finding P value at certain N that match the calculated T If this P value > the specified one ( 0.05 for example ) the H0 can not be rejected.i,e not significant. It is only significant if that p = or < the assumed p . ** for 2 tailed test p=2 x p for one tailed