Download Consider the following data set:

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia, lookup

Data mining wikipedia, lookup

Time series wikipedia, lookup

Transcript
S. Soleymani
Math 52
Percentiles and Outliers
Consider the following data set:
38
41
43
44
44
45
47
48
51
52
57
57
58
62
68
79
85
98
100
The histogram below suggests that the distribution of the data is right or positively-skewed.
Histogram
6
4
Frequency
2
10
9
M
or
e
99
89
79
69
39.5 49.5 59.5 69.5 79.5 89.5 99.5 100
59
29
39
0
49
Frequency
8
Some descriptive statistics are found using Excel:
Median = 52
Mean = 58.8
Standard Deviation = 18.9.
Since the data is right-skewed then the median is less than the mean. This is because a few data
points on the right of the distribution pull up the mean. We have already discussed that the
median of 52 is a better measure of the center of this data set.
Definition: the kth-percentile ( Pk ) of a data set corresponds to the data value that separates the
lower kth% from the upper (100-kth)% of the data.
To find Pk of a data set, you first need sort the data and then evaluate the locator L which is the
location of Pk . The locator L can be found by:
k
L
n
100
where, k is the percent being used, and n is the number of data points.
CASE 1: If L isn’t a whole number, then round it up. The value of Pk is the Lth value from the
lowest data value.
EXAMPLE: Find the 50th percentile.
50
19  9.5 . Round L up to L=10. Now count 10 values from the
Here k=50, n=19, then L 
100
minimum value. The 10th value is 52, therefore P50  52. Note that the 50th percentile is the
same as the median.
CASE 2: If L is a whole number, then Pk is the average of the Lth and the next value in the data
set.
Sometimes we are interested in finding the quartiles of a data set. These quartiles are:
Q1  P25  25th percentile
Q2  P50  Median
Q3  P75  75th percentile
Definition: Interquartile Range (IQR) is a measure of variation and is computed by taking the
difference of Q3 and Q1 , that is
IQR  Q3  Q1.
The IQR for our data set is IQR  68  44  24 .
In most cases we are interested to detect and eliminate outliers in a data set. Outliers are extreme
unusual values that influence the mean and standard deviation of a data set. The borders that
separate the outliers from the rest of the data are called fences and they are computed as
followings:
upper fence  Q3  1.5  IQR
lower fence  Q1  1.5  IQR
Find the outliers for the data set given in this handout.
SOME PROPERTIES OF MEASURE OF CENTER AND VARIATION
If the data points are increased or decreased by a nonzero constant then the measure of the center
increases or decreases by the same constant. The measure of variation remains the same.
If the data is rescaled by multiplying each data point by a constant, then all measures rescale by
the same constant.