Download 2.5 – Using the Mean and Standard Deviation to Describe Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
2.5 – Using the Mean and Standard Deviation to Describe Data
The standard deviation is a measure of variability of sample data. The smaller standard
deviation, the bigger percentage of measurements are close to the mean.
Chebyshev’s Rule:
For any number k >1, at least (1-1/k2) measurements are within k standard
deviations of the mean, that is at least (1-1/k2) measurements are in the interval
Consequently:
a) If k=2 then at least 3/4 = 75% measurements are within 2s from the mean
b) If k=3 then at least 8/9 = 89% measurements are within 3s from the mean
≥ 8/9 = 89%
≥ 3/4 = 75%
Chebyshev’s rule applies to a sample from any distribution.
Empirical Rule:
If a sample is taken from mound-shaped distribution (symmetric and shaped as
below) then
a) Approximately 68% of the measurements will fall within 1 standard deviation of
the mean, that is within the interval
b) Approximately 95% of the measurements will fall within 2 standard deviations
of the mean, that is within the interval
c) Approximately 99.7% (essentially all) of the measurements will fall within 3
standard deviations, of the mean, that is within the interval
min
4𝑠 ≤ 𝑟𝑎𝑛𝑔𝑒 ≤ 6𝑠
range
⇒
max
𝑟𝑎𝑛𝑔𝑒
𝑟𝑎𝑛𝑔𝑒
≤𝑠 ≤
6
4
Exercise: (Ex. 2.11, p. 80) The 50 companies' percentages of revenues spent on R&D are below
(sorted already)
5.2 5.6 5.9 6.0 6.5 6.5 6.5 6.6 6.8 6.9 6.9 6.9 7.1 7.1 7.2 7.2 7.4 7.5
7.5 7.7 7.7 7.8 7.9 8.0 8.0 8.1 8.2 8.2 8.2 8.4 8.5 8.8 9.0 9.2 9.4
9.5 9.5 9.6 9.7 9.9 10.1 10.5 10.5 10.6 11.1 11.3 11.7 13.2 13.5 13.5
1. Calculate the range and use it to obtain a rough approximation of s.
Ans: …… ≤ 𝑠 ≤ ………
2. Compute
= …………… and s = ……………….
3. Calculate the intervals
for k=1, 2, 3, and for each interval give
a. Percentage estimated by the Chebyshev’s Rule
b. Percentage estimated by the Empirical Rule
c. The actual percentage of observations in the interval. Compare them with a. and b.
k
1
2
3
Chebyshev’ s Rule
Empirical Rule
Actual
2.6 – Measures of Relative Standing
Key concepts: percentile, z-score
z-score = number of standard deviation that x is above (if positive) or below
(if negative) of the mean
s
•
x
mean
The (sample) z-score for a measurement x is the number
 The mean of z-scores is always 0, ,
(see Exercise in Lecture 2.5)
 The standard deviation of z-scores is always 1,
 For mound-shaped distribution
1. Approximately 68% of the measurements will have a z-score between -1 and 1.
2. Approximately 95% of the measurements will have a z-score between -2 and 2.
3. Approximately 99.7% (almost all) of the measurements will have a z-score
between -3 and 3.
Tom
Exercise. Tom’s SAT z-score is 2.0. Assuming mound-shaped distribution this means
that
a) approximately 2.5% of students who took SAT scored better than Joe
b) approximately 97.5% of students who took SAT scored worse than Joe
The pth percentile (of a data set). Intuitively, the pth percentile is a number such that
approximately p% of the measurements (arranged in ascending order) fall below the pth
percentile and (100 - p)% fall above it.
p%
(1-p)%
Methods of computing vary; different software or calculators may give different answers.
Important percentiles:
 The median = 50th percentile
 The lower quartile QL = 25th percentile
 The upper quartile QU = 75th percentile
Min
QL
median
QU
Max
2.7 – Methods for Detecting Outliers: Box Plot and z-Scores
Key concepts: interquartile range, inner and outer fences, outlier, box plot,
The interquartile range (IQR) is the distance between the lower and upper quartiles:
IQR = QU- QL
Inner fences:
Outer fences:
Lower inner fence (LIF) = QL – 1.5 IQR
Upper inner fence (UIF) = QU + 1.5 IQR
Lower outer fence (LOF) = QL – 3 IQR
Upper outer fence (LUF) = QU + 3 IQR
An outlier is an observation (or measurement) that is unusually large or small relative to
the other values in a data set
Rules of Thumb for Detecting Outliers
 Box Plot Method:
o Observations falling beyond the inner fences are called outliers.
o Observations falling between the inner fences and the outer fences are called
suspect outliers
o Observations falling beyond the outer fences are deemed highly suspect
outliers.
 z-scores: Observations with z-scores greater than 3 in absolute value are considered
outliers.
The numbers: minimum, QL, median, QU, maximum, are called five number summary.
Box Plot: Graphical visualization of five number summary (minimum, QL, median, QU,
maximum) and outliers. Box plot can be drawn horizontally or vertically
outliers
min
QL med
QU
max
Example. Data: 45 46 49 35 76 80 89 109 37 61 62 64 68 56 57 57 59 71 72
(n=19)
Sorted Data: 35 37 45 46 49 56 57 57 59 61 62 64 68 71 72 76 80 89 109
Mean = 62.79, s = 18.00, Min = 35, Max = 109, Median = 61,
TI-83:
QL = 49,
QU = 72,
LIF = 49 – 1.5×23 = 14.5,
LOF = 49 – 3×23 = -20,
IQR = 72 - 49 = 23
UIF = 72 +1.5×23 = 106.5
UOF = 72 +3×23 = 141
Excel:
QL = 52.5,
QU = 71.5,
LIF = 52.5 – 1.5×19 = 24,
LOF = 52.5 – 3×19 = -4.5,
IQR = 71.5- 52.5 = 19
UIF = 71.5 +1.5×19 = 100
UOF = 71.5+3×19 = 128.5
(Note that Excel gives different quartiles and fences, and hence possibly outliers)
1. Determine if there are any outliers (using Box Plot method)
Answer: 109 is a suspect outlier, no highly suspect outliers
2. Determine if there are any outliers (using z-scores method)
Answer:
= 8.79,
=116.79, no outliers in z-score sense
3. Make a box plot
Related documents