Download 1 ζ-scores 2 Percentiles, Deciles, and Quartiles

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
1
z-scores
A data set is normalised by subtracting the mean from each observation and then dividing by the standard
deviation. The resulting observations are called normalised scores or z-scores. The new data set will have
mean 0 and standard deviation 1.
If x is a value and z is the corresponding z-score,
z=
x−µ
σ
and
x = σz + µ.
In Lecture 3, we considered the data set 2, 3, 6, 9, 10. We found that µ = 6 and σ ≈ 3.16. the observation 3
has a z-score of
3−6
≈ −0.949,
z≈
3.16
so we say that this observation is 0.949 standard deviations below the mean.
The mean corresponds to a z-score of 0.
The value at two standard deviations above the mean is
x ≈ 3.16 · 2 + 6 ≈ 12.3.
No observations were this high.
z-scores give a standardised measure of value.
2
Percentiles, Deciles, and Quartiles
Consider the data set 4, 24, 25, 27, 27, 27, 29, 30, 32, 32, 33, 33, 33, 35, 35, 37, 37, 39, 55, 100 consisting of
20 observations. Notice that this data is presented in increasing order.
Suppose we want to find a value such that 37% of the observations are less than this value. This value is
called the 37th percentile and written as P37 . To do this, we multiply the number of observations by 37%
(0.37) and obtain 7.4. Since 7.4 is not a whole number, we look at the 8th observation. The 8th observation
is 30. So we say that the 37th percentile is P37 = 30. Notice that 7 observations (which is actually 35% of
20) are less than 30. This is as good as we can do.
Now say we want to find a value such that 85% of the observations are less than this value. Again, multiply
the number of observations by 85% (0.85) and obtain 17. Since 17 is a whole number, take the average of
the 17th and 18th observations on the list. That is, we say that the 85th percentile is P85 = 37+39
= 38.
2
Notice that 17 observations (which is exactly 85% of 20) are less than 38.
In general, to find the pth percentile, we compute
i=
Np
,
100
where N is the size of the data set. If i is a whole number, report the average of the observations in the ith
and (i + 1) st position. If i is not a whole number, report the observation whose position is the next integer
larger than i.
A similar calculation works to obtain deciles (where the data is broken up into 10 equal pieces) and quartiles
(where the data is broken up into 4 equal pieces).
We can think of computing percentiles as putting the data into 100 bins (0th percentile to 99th percentile)
deciles as putting the data into 10 bins (0th decile to 9th decile) and quartiles as putting the data into 4 bins
1
(0th quartile to 3rd quartile) These bins are reported by their minimum value. That is, Q3 is the minimum
value of the 3rd quartile bin.
To find the dth decile, compute
i=
Nd
10
and to find the qth quartile, compute
Nd
.
4
Or, use the facts that, for example, the third decile is the same as the 30th percentile and the third quartile
is the same as the 75th percentile.
i=
As an example, to compute the 5th decile, we find i = 20·5
10 = 10. Thus the 5th decile is the average of the
10th and 11th observations, which is D5 = 32+33
=
32.5.
2
To compute the 1st quartile, we find i =
observations, which is Q1 = 27+27
= 27.
2
20·1
4
= 5. Thus the 1st quartile is the average of the 5th and 6th
We could think of the 3rd quartile as the 75th percentile and find i = 20·75
100 = 15. Thus the 3rd quartile is
the average of the 15th and 16th observations, which is Q3 = 35+37
=
36.
2
We can also give a percentile, decile, or quartile for each observation. To do this, find the fraction observations
which are smaller than this observation and multiply this by 100 for percentile, 10 for decile, and 4 for quartile.
Truncate, do not round, the result.
19
Consider the observation 100. There are 19 out of 20 observations which are smaller than this. Since 20
×100 =
95, we say that this is in the 95th percentile. The maximum observation is not at the 100th percentile because
it is not larger than itself.
Consider the two observations of 32. There are 8 out of 20 observations which are smaller than these. Since
8
20 × 4 = 1.6, we say that this is in the 1st quartile. We do not round up. The second quartile starts at 33.
8
Since 20
× 10 = 4, we say that this is in the 4th decile.
Percentiles, deciles, and quartiles give standardised measures of rank.
3
Inter-Quartile Range
The inter-quartile range is
IQR = Q3 − Q1 .
This is a measure of spread of the data and indicates the range of the middle half of the observations. It is
more robust than the ordinary range, which takes into account all the observations, and can be significantly
affected by outliers.
In the above example, the IQR is 36 − 27 = 9. The range is 100 − 4 = 96, and is very directly dependent
upon the most extreme data.
4
Outliers
Any observation which is at least 1.5 IQRs above Q3 but no more than 3 IQRs above Q3 is considered a
possible outlier. The same goes for any observation which is at least 1.5 IQRs below Q1 but no more than
3 IQRs below Q1 . Since IQR is 9 here, 1.5 IQRs is 13.5 and 3 IQRs is 27. So possible outliers are between
36 + 13.5 = 49.5 and 36 + 27 = 63 on the high end, and between 27 − 27 = 0 and 27 − 13.5 = 13.5 on the
low end. That is, 4 and 55 are possible outliers.
Any observation which more than 3 IQRs above Q3 is considered a probable outlier. The same goes for any
observation which is more than 3 IQRs below Q1 . So possible outliers are more than 63 on the high end,
and less than 0 on the low end. That is, 100 is a probable outlier.
2
5
Five Number Summary
The five number summary consists of the minimum, Q1 , the median (also known as Q2 or D5 or P50 ), Q3 ,
and the maximum. In the example considered here, it is 4, 27, 32.5, 36, 100. One quarter of the data is
between the minimum and Q1 , one quarter of the data is between the Q1 and the median, one quarter of the
data is between the median and Q3 , and the remaining quarter of the data is between Q3 and the maximum.
6
Box and Whisker Plot
A box is drawn between Q1 and Q3 . The median is indicated with a line through this box. Whiskers extend
from the boxes out to the most extreme non-outliers. Possible outliers are indicated with × and probable
outliers are indicated with +.
Here, the largest non-outlier is 39 and the smallest non-outlier is 24. Therefore, the whisker coming off the
top of the box extends up to 39 and the whisker coming off the bottom of the box extends down to 24.
110
100
90
80
70
60
50
40
30
20
10
0
One quarter of the data is in the lower part of the box. One quarter of the data is in the upper part of the
box. Half the data is within the box. Although the two parts of the box may not have the same size, they
each contain the same number of observations.
3