Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Topic 10 Non-parametric Statistics Also known as Robust Statistics and Sturdy Statistics Descriptive Statistics We are often interested in statistics that summarise data. For example, we want to know where the “centre” or “middle” of the distribution is. We often compute the average of data values to inform us about the middle of a data set distribution. If we want to know about the spread of a dataset, we compute the variance or standard deviation. In many cases, these statistics are informative. But in other cases, we need to be aware of the properties of these statistics, such as their sensitivity to outliers. Let us consider a small data set, consisting of the prices of houses sold in a suburb on a particular day. Table 1 Housing price in a suburb To the nearest $1000 365 376 421 470 560 589 650 680 695 715 724 735 752 950 1505 The average (mean) of the data set is $702 (x 1000). There are 15 sample values in Table 1. Nine of the 15 are below the average, and six are above the average. One can argue that the average (702) is not quite in the middle. The reason for the large average value is because there are two data points in Table 1 that are somewhat further away from the other 13 data points. In general, whenever there are “outliers” (data values far away from the majority of the observations), the average could be influenced by these outliers if the sample is small. To prevent outliers from influencing the statistics, one way is to find the middle of the data distribution by ranking all observations, and find the value that is in the middle of the observations. In our example, the middle observation is the 8th observation in Table 1, $680 (x1000). This statistic is called the median. The median (or 50th percentile) is not as sensitive to outliers as the average. Suppose our highest observation is $1800 (x1,000) instead of $1505 (x1,000), the median will not change, as it is still the 8th observation in Table 1, but the average will increase. The median is sometimes called a “robust” statistic, or a “sturdy” statistic (robust or sturdy against outliers). In many non-parametric statistical procedures, robust statistics often involve ranks of the observation values. For the spread of a distribution, variance and standard deviation are also sensitive to outliers. The inter-quartile range (25th percentile and 75th percentile) is a more robust statistic for measuring the spread of the distribution. These “robust” statistics are often known as non-parametric statistics. An Example of Non-parametric Statistical Test – to compare the “middle” of the distributions for two matched samples Consider a data set with students’ pre-test and post-test scores. Table 2 student 1 student 2 student 3 student 4 student 5 student 6 student 7 student 8 student 9 student 10 student 11 student 12 student 13 student 14 student 15 Pre-test score Post-test score 60 65 57 56 89 87 67 78 45 60 60 68 76 74 53 60 90 88 70 78 49 56 92 93 61 60 60 65 76 74 “>” or “<” + + + + + + + + + - In Topic 7, we discussed about using paired-sample t-test to compare means of two samples where the observations are matched. The paired sample t-test essentially computes the difference for each pair of observations, and then test whether the mean of these paired differences is zero. In this way, the magnitude of the differences will be taken into account in computing the test statistic. If one pair of observations has a large difference, it will influence the t-test statistic. To consider the construction of a more “robust” statistic that is not sensitive to outliers, we can ignore the actual magnitude of the difference, but focus on the “sign” of the difference only (i.e., negative or positive), as shown in Table 2. Of the 15 differences, there are 9 positive differences and 6 negative differences. Is there evidence that the two distributions have different “middle” values? To establish the level of statistical significance, we can first form a null hypothesis that the two distributions have the same middle values. Under this null hypothesis, any difference between the matched observations is due to random chance, so that there is 50% chance we may observe a positive difference and 50% chance we may observe a negative difference. For the moment, let’s ignore ties. Under the null hypothesis, what is the probability of observing 9 or more positives out of 15 differences? To compute this probability, we use a binomial distribution where the number of trials is 15, the number of successes (a positive difference is regarded as a success) is 9, and the probability of a success for each trial is 0.5 (under the null hypothesis that there is no difference between the two samples). In EXCEL compute the probability of observing a 9 or more out of 15 trials: =1 - binomdist(8,15,0.5,1) = 0.3036. This is the probability for a one-tailed test (that is, we are only interested to see if there are more positive differences than negative differences). If our alternative hypothesis is that there is a difference between the two samples, irrespective of whether sample A is higher or lower than sample B, then we need to use a two-tailed test. The probability for a two-tailed test is simply twice the probability of the onetailed test. In this case, it is 2 × 0.3036 = 0.607. The statistical procedure we have described is called a Sign Test. Carry out a Sign Test in SPSS In SPSS, enter the data so that the matched values are in two columns: Select Analyze -> Nonparametric Tests -> Related Samples, as shown below. Click on the Fields tab, and select the two variables, t1, and t2 into the Test Fields, as shown below. Click on the Settings tab. Check the box of the Sign Test, as shown below. Then click on Run. SPSS shows the following output. Difference between paired sample t-test and the Sign Test Carry out a paired sample t-test, and compare the statistical significance with that from the Sign Test. The paired sample t-test produces a significant result, while the Sign Test produced a non-significant result. So which test should we be using? For the Sign Test, because we ignore the magnitude of differences and only focus on the signs of the differences, the test is less powerful in detecting differences. However, the Sign Test is not sensitive to outlier observations. On the other hand, if there are aberrant observations such as one or two very large differences, the t-test can be influenced by these outliers, and produce a significant result when there isn’t any difference for the majority of the observations. So the decision on which test to use is dependent on whether we want to guard against outliers and “play safe”. A non-parametric test that is more powerful than the Sign Test is the Wilcoxon’s Signed Rank Test. This test not only takes into account of the Signs of the paired differences, it also takes into account the rank of the difference. In this way, order of magnitude of the paired differences is accounted for to some degree. Non-parametric Statistics for Correlations In Topic 9, for the data set of PISA and TIMSS country mean scores, it was shown that the two outliers (Indonesia and Tunisia) (with significantly lower scores than the other countries in the data set) can influence the (Pearson) correlation coefficient. Table 3 PISA and TIMSS country mean scores Country Code Country Name PISA country mean score TIMSS country mean score AUS Australia 524 505 BFL Belgium(Flemish) 553 537 BSQ Basque Country, Spain 502 487 ENG England 507 498 HKG Hong Kong 550 586 HUN Hungary 490 529 IDN Indonesia 360 411 ITA Italy 466 484 JPN Japan 534 570 KOR Korea 542 589 LVA Latvia 483 508 NLD The Netherlands 538 536 NOR Norway 495 461 NZL New Zealand 523 494 ONT Ontario, Canada 531 521 QUE Quebec, Canada 541 543 RUS Russia 468 508 SCO Scotland 524 498 SVK Slovak Republic 498 508 SWE Sweden 509 499 TUN Tunisia 359 410 USA United States 483 504 A scatter plot of the PISA versus TIMSS country mean scores is shown below: Compute the correlation between PISA and TIMSS country mean scores, with all countries included. Then compute the correlation again with Tunisia and Indonesia removed from the data set. What is the reduction in the correlation coefficient when Tunisia and Indonesia are removed? How can we construct a statistic for correlation that is not so sensitive to outliers? Following the same line of thought as for the median and inter-quartile range, one can use ranks instead of the actual observations to form correlations. That is, we assign a rank to each observation within each sample, and then compute the correlation using the ranks. An observation with a large magnitude will not influence the correlation that much since we only use the ranks which goes from 1 to N. Spearman’s rank correlation is exactly this. In SPSS, the Correlate module has an option for the computation of Spearman’s rank correlation. See the dialog box below. For the data set containing PISA and TIMSS country mean scores, compute Spearman’s rank correlation for all countries, and then with Indonesia and Tunisia removed. Did the correlation change very much? How does this change compare with the change when Pearson’s correlation was used? Non-parametric Statistics for Other Tests of Significance There are many other non-parametric statistics. For example, for testing the equality of the medians of independent samples, the Mann-Whitney test uses ranks instead of actual observations. In summary, you should consider using non-parametric statistics when there are outliers, or when the sample size is small.