Download NonParametric

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Topic 10 Non-parametric Statistics
Also known as Robust Statistics and Sturdy Statistics
Descriptive Statistics
We are often interested in statistics that summarise data. For example, we want to
know where the “centre” or “middle” of the distribution is. We often compute the
average of data values to inform us about the middle of a data set distribution. If we
want to know about the spread of a dataset, we compute the variance or standard
deviation. In many cases, these statistics are informative. But in other cases, we need
to be aware of the properties of these statistics, such as their sensitivity to outliers.
Let us consider a small data set, consisting of the prices of houses sold in a suburb on
a particular day.
Table 1 Housing price in a suburb
To the nearest
$1000
365
376
421
470
560
589
650
680
695
715
724
735
752
950
1505
The average (mean) of the data set is $702 (x 1000). There are 15 sample values in
Table 1. Nine of the 15 are below the average, and six are above the average. One can
argue that the average (702) is not quite in the middle. The reason for the large
average value is because there are two data points in Table 1 that are somewhat
further away from the other 13 data points. In general, whenever there are “outliers”
(data values far away from the majority of the observations), the average could be
influenced by these outliers if the sample is small.
To prevent outliers from influencing the statistics, one way is to find the middle of the
data distribution by ranking all observations, and find the value that is in the middle of
the observations. In our example, the middle observation is the 8th observation in
Table 1, $680 (x1000). This statistic is called the median.
The median (or 50th percentile) is not as sensitive to outliers as the average. Suppose
our highest observation is $1800 (x1,000) instead of $1505 (x1,000), the median will
not change, as it is still the 8th observation in Table 1, but the average will increase.
The median is sometimes called a “robust” statistic, or a “sturdy” statistic (robust or
sturdy against outliers). In many non-parametric statistical procedures, robust
statistics often involve ranks of the observation values.
For the spread of a distribution, variance and standard deviation are also sensitive to
outliers. The inter-quartile range (25th percentile and 75th percentile) is a more robust
statistic for measuring the spread of the distribution. These “robust” statistics are
often known as non-parametric statistics.
An Example of Non-parametric Statistical Test – to compare the
“middle” of the distributions for two matched samples
Consider a data set with students’ pre-test and post-test scores.
Table 2
student 1
student 2
student 3
student 4
student 5
student 6
student 7
student 8
student 9
student 10
student 11
student 12
student 13
student 14
student 15
Pre-test score Post-test score
60
65
57
56
89
87
67
78
45
60
60
68
76
74
53
60
90
88
70
78
49
56
92
93
61
60
60
65
76
74
“>” or “<”
+
+
+
+
+
+
+
+
+
-
In Topic 7, we discussed about using paired-sample t-test to compare means of two
samples where the observations are matched. The paired sample t-test essentially
computes the difference for each pair of observations, and then test whether the mean
of these paired differences is zero. In this way, the magnitude of the differences will
be taken into account in computing the test statistic. If one pair of observations has a
large difference, it will influence the t-test statistic.
To consider the construction of a more “robust” statistic that is not sensitive to
outliers, we can ignore the actual magnitude of the difference, but focus on the “sign”
of the difference only (i.e., negative or positive), as shown in Table 2.
Of the 15 differences, there are 9 positive differences and 6 negative differences. Is
there evidence that the two distributions have different “middle” values? To establish
the level of statistical significance, we can first form a null hypothesis that the two
distributions have the same middle values. Under this null hypothesis, any difference
between the matched observations is due to random chance, so that there is 50%
chance we may observe a positive difference and 50% chance we may observe a
negative difference. For the moment, let’s ignore ties. Under the null hypothesis, what
is the probability of observing 9 or more positives out of 15 differences?
To compute this probability, we use a binomial distribution where the number of trials
is 15, the number of successes (a positive difference is regarded as a success) is 9, and
the probability of a success for each trial is 0.5 (under the null hypothesis that there is
no difference between the two samples).
In EXCEL compute the probability of observing a 9 or more out of 15 trials:
=1 - binomdist(8,15,0.5,1) = 0.3036.
This is the probability for a one-tailed test (that is, we are only interested to see if
there are more positive differences than negative differences). If our alternative
hypothesis is that there is a difference between the two samples, irrespective of
whether sample A is higher or lower than sample B, then we need to use a two-tailed
test. The probability for a two-tailed test is simply twice the probability of the onetailed test. In this case, it is 2 × 0.3036 = 0.607.
The statistical procedure we have described is called a Sign Test.
Carry out a Sign Test in SPSS
In SPSS, enter the data so that the matched values are in two columns:
Select Analyze -> Nonparametric Tests -> Related Samples, as shown below.
Click on the Fields tab, and select the two variables, t1, and t2 into the Test Fields, as
shown below.
Click on the Settings tab. Check the box of the Sign Test, as shown below.
Then click on Run.
SPSS shows the following output.
Difference between paired sample t-test and the Sign Test
Carry out a paired sample t-test, and compare the statistical significance with that
from the Sign Test. The paired sample t-test produces a significant result, while the
Sign Test produced a non-significant result.
So which test should we be using?
For the Sign Test, because we ignore the magnitude of differences and only focus on
the signs of the differences, the test is less powerful in detecting differences.
However, the Sign Test is not sensitive to outlier observations. On the other hand, if
there are aberrant observations such as one or two very large differences, the t-test can
be influenced by these outliers, and produce a significant result when there isn’t any
difference for the majority of the observations.
So the decision on which test to use is dependent on whether we want to guard against
outliers and “play safe”.
A non-parametric test that is more powerful than the Sign Test is the Wilcoxon’s
Signed Rank Test. This test not only takes into account of the Signs of the paired
differences, it also takes into account the rank of the difference. In this way, order of
magnitude of the paired differences is accounted for to some degree.
Non-parametric Statistics for Correlations
In Topic 9, for the data set of PISA and TIMSS country mean scores, it was shown
that the two outliers (Indonesia and Tunisia) (with significantly lower scores than the
other countries in the data set) can influence the (Pearson) correlation coefficient.
Table 3 PISA and TIMSS country mean scores
Country Code
Country Name
PISA country
mean score
TIMSS country
mean score
AUS
Australia
524
505
BFL
Belgium(Flemish)
553
537
BSQ
Basque Country, Spain
502
487
ENG
England
507
498
HKG
Hong Kong
550
586
HUN
Hungary
490
529
IDN
Indonesia
360
411
ITA
Italy
466
484
JPN
Japan
534
570
KOR
Korea
542
589
LVA
Latvia
483
508
NLD
The Netherlands
538
536
NOR
Norway
495
461
NZL
New Zealand
523
494
ONT
Ontario, Canada
531
521
QUE
Quebec, Canada
541
543
RUS
Russia
468
508
SCO
Scotland
524
498
SVK
Slovak Republic
498
508
SWE
Sweden
509
499
TUN
Tunisia
359
410
USA
United States
483
504
A scatter plot of the PISA versus TIMSS country mean scores is shown below:
Compute the correlation between PISA and TIMSS country mean scores, with all
countries included.
Then compute the correlation again with Tunisia and Indonesia removed from the
data set.
What is the reduction in the correlation coefficient when Tunisia and Indonesia are
removed?
How can we construct a statistic for correlation that is not so sensitive to outliers?
Following the same line of thought as for the median and inter-quartile range, one can
use ranks instead of the actual observations to form correlations. That is, we assign a
rank to each observation within each sample, and then compute the correlation using
the ranks. An observation with a large magnitude will not influence the correlation
that much since we only use the ranks which goes from 1 to N. Spearman’s rank
correlation is exactly this.
In SPSS, the Correlate module has an option for the computation of Spearman’s rank
correlation. See the dialog box below.
For the data set containing PISA and TIMSS country mean scores, compute
Spearman’s rank correlation for all countries, and then with Indonesia and Tunisia
removed. Did the correlation change very much? How does this change compare with
the change when Pearson’s correlation was used?
Non-parametric Statistics for Other Tests of Significance
There are many other non-parametric statistics. For example, for testing the equality
of the medians of independent samples, the Mann-Whitney test uses ranks instead of
actual observations.
In summary, you should consider using non-parametric statistics when there are
outliers, or when the sample size is small.