Download Numerical Summary Measures of Variability for Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Ch2.2, Ch2.3 Numerical Summary Measures of Variability for Data
Topics:
 Measures of variability (spread):
o Deviation
o Variance/standard deviation
o Interquartile range
 Statistical definition of Outliers: the 1.5 x IQR criterion for outliers
-----------------------------------------------------------------------------------------------------------
II: Measures of Variability (Spread)
(1) Deviation: the difference between the observation and the mean
Deviation of xi is defined as di  xi  x .
What is the mean deviation ( d ) ?
1
1
 d1  d 2  ...  d n    x1  x  x2  x  ...  xn  x 
n
n
1
1
  x1  x2  ...  xn  nx    nx  nx   0
n
n
d
So d does not measure the spread in the data. We might use the average of
absolute deviation:
1
| d1 |  | d 2 | ... | d n | .
n
But this measure is mathematically inconvenient. Then we consider the following
definition.
(2) Variance / Standard Deviation
Sample Variance: the sample variance s 2 of n observations is average
(using n-1) of squared deviations
n
s2 
 x
i 1
 x
2
i
n 1
1
► An alternative form convenient for calculation
s2 

1
  x i n( x ) 2 
n 1 
Sample Standard Deviation (SD): the sample standard deviation s
is the square root of s 2 :
n
s
 x
i 1
i
 x
2
n 1
Ex. Grades of 9 students on a HW assignment: 86,85,81,82,84,84,83,84,87.
x  84 . What is the SD? (Also know that
9
x
i 1
2
i
 63532
 9

s 2 =   xi2  n  ( x ) 2  / 8  63532 – 9 * 842 = 28/8=3.5
 i 1

s  s 2  3.5  1.87.
 Interpretation: A random student’s score is about 1.87 points from the
mean score (84).
Comments:
1. SD should be used as a measure of spread only when mean is used as the
measure of center
2. SD=0 implies all data points are the same (no variability)
3. Like mean, SD is strongly influenced by outliers
Ex. If a coding error makes 87 to be 870, then x =182 and SD becomes 278…
(3) Interquartile Range (IQR)
(a) Quartiles: Q1 = first quartile = median of the lower half of the
data. Q3 = third quartile = median of the upper half of the data
(Q2 = median).
IQR = Q3 – Q1 (good spread measure in the presence of outliers)
2
To find quartiles:
1. Sort the data and divide data points into 2 halves (If there are odd
number of observations, include the median in each half.)
2. Lower quartile  Q1= median of the lower half of the data.
3. Upper quartile  Q3 = median of the upper half of the data.
(b) Inter-Quartile Range (IQR)

IQR = Q3 – Q1

Interpretation of IQR: Measure of variability (spread) of data,
similar to s but usually larger than s.
Ex. Grades of 9 students on a HW assignment: 86,85,81,82,84,84,83,84,87.
The sorted hw scores are:
81 82 83 84 84 85 86 87. So median = 84, Q1 = 83, Q3 = 85, IQR = 85
– 83 = 2.
3
Ex. Rainfall in NC in the some 15 months
1|0
2 | 25
3 | 45
Stem: one digit
4 | 11667
Leaf: tenths digit
5 | 449
6|0
7|
8|2
Find quartiles and the IQR
Remark 1: The 1.5 x IQR Criterion for Outliers
An observation is called an outlier if it is 1.5*IQR larger than Q3 or 1.5*IQR
smaller than Q1.
Extreme outlier may indicate data entry error or unusual characteristics in
the data that need careful investigation (if it is 3*IQR larger than Q3 or
3*IQR smaller than Q1.
Ex. Use the 1.5xIQR rule to check if there is any outlier in the Rainfall dataset.
Q1 = 3.45, Q3 = 5.4, IQR = 5.4 – 3.45 = 2.925.
Q1 – 2.925 = 3.45 – 2.925 = 0.525 (no data point smaller than 0.525)
Q3 + 2.925 = 5.4 + 2.925 = 8.325 (no data point larger than 8.325)
Remark 2: Standard numerical summaries of a data set includes sample size,
center, and spread.
For reasonably symmetric distribution with no outliers, use x , s
For the rest situation, use x, IQR
4
Remark 3: The 5-number summary
min, Q1, Median, Q3, max
Remark 4: Change of Unit
1. Adding (or subtracting) a constant to each observation will NOT change
the measures of spread, such as SD and IQR, (but does change the measures
of center and quartiles)
If the new unit = 60 + the old unit; do the spread of the data change?
2. Multiplying each observation by a constant a does multiply measures of
spread (SD and IQR) by |a|.
5
Conclusion: If new unit is aX + b, then the new spread in terms of IQR is |a|
times the original IQR
( Recall that the new center is ax  b )
Ex. Temperatures read in Fahrenheit and the SD temperature is s, and IQR is r.
What are the new SD and IQR if we switch to Centigrade? Note that
5
C  F  32  .
9
The new SD is 5s/9.
The new IQR = 5r/9.
6