Download Chapter 2 Part 2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
MEASURES OF SPREAD
1. Range: Maximum - Minimum
2. IQR (Inter Quartile Range)= Q3 − Q1 (Minimum,Q1 ,
Median, Q3 , Maximum: Five number Summary)
3. Standard Deviation
PERCENTILES
I
p-th percentile: The value below which (roughly) p% of the
data points lie. When p=50 we get the median.
I
The 25-th percentile is called FIRST QUARTILE and
denoted by Q1
I
The 75-th percentile is called THIRD QUARTILE and
denoted by Q3
These are useful in determining “Outliers“ and can be
represented in a ‘Box plot‘.
These were discussed in your recitation.
STANDARD DEVIATION
Standard deviation is a measure of the amount of deviation
from the mean.
STANDARD DEVIATION
Computation
Let x1 , x2 , . . . , xn be a data set with mean x̄,
x-values
x1
x2
x3
.
.
.
xn
(x − x̄)
(x1 − x̄)
(x2 − x̄)
(x3 − x̄)
.
.
.
(xn − x̄)
0
(x − x̄)2
(x1 − x̄)2
(x2 − x̄)2
(x3 − x̄)2
.
.
.
(xn − x̄)2
Pn
2
1 (xi − x̄)
STANDARD DEVIATION
Computation
Pn
2
i=1 (xi −x̄)
I
Population variance : σ 2 =
I
Population standard deviation: σ =
I
Sample variance: s2 =
I
Sample standard deviation: s =
n
Pn
i=1 (xi −x̄)
p
Population Variance
2
n−1
p
Sample Variance
Why the square in (x − x̄)2 ?
X
(xi − x̄) =
X
xi − nx̄ = 0
.
Some of the deviations from the mean are positive and some
negative. When you sum they cancel out each other. So take
square to make everything positive
Why take the square root of the variance?
This ensures that the measure of variability – standard
deviation – is in the same unit as the data.
Problem 2.57
Find the range, variance and standard deviation of the data
4,2,1,0,1,
I
Maximum is 4, Minimum is 0
So the range is 4-0 = 4
I
X̄ = 1.6
1.
x-values
4
2
1
0
1
(x − x̄)
(4-1.6)
(2-1.6)
(1-1.6)
(0-1.6)
(1-1.6)
(x − x̄)2
(4 − 1.6)2 = 5.76
(2 − 1.6)2 = .16
(1 − 1.6.)2 = .36
(0 − 1.6)2 = 2.56
(1 − 1.6.)2 = .36
0
9.20
2. Sample variance = 9.20 /4 = 2.3
√
3. Sample s.d = 2.3 = 1.52
Properties of Variance
I
Var (x1 , x2 , . . . , xn ) ≥ 0 and is = 0 only when
x1 = x2 = . . . = xn
I
Var (x1 + b, x2 + b, . . . , xn + b) = Var(x1 , x2 , . . . , xn ) Adding
a constant does not change the variance
I
Var (ax1 , ax2 , . . . , axn ) = a2 Var(x1 , x2 , . . . , xn )
I
Var(ax1 + b, ax2 + b, . . . , axn + b) = a2 Var(x1 , x2 , . . . , xn )
Properties of Standard Deviation
I
S.D (x1 , x2 , . . . , xn ) ≥ 0 and is = 0 only when
x1 = x2 = . . . = xn
I
S.D (x1 + b, x2 + b, . . . , xn + b) = S.D(x1 , x2 , . . . , xn )
Adding a constant does not change the standard deviation
I
S.D (ax1 , ax2 , . . . , axn ) = |a| S.D(x1 , x2 , . . . , xn )
I
S.D (ax1 + b, ax2 + b, . . . , axn + b) = |a| S.D(x1 , x2 , . . . , xn )
What can we say about the Histogram if we know the mean and
s.d
CHEBYSHEV’s RULE
1
)
k2
I
Atleast (1 −
mean
I
k=2: At least 75 % of the observations lie within 2 standard
deviations of the mean
I
k=3: at least 8/9 , approx 90% of the observations lie within
3 standard deviations of the mean
part of the histogram lies within ks of the
If the histogram is bell shaped then
I
Approximately 68 % of the observations lie within
x̄ − s, x̄ + s
I
Approximately 95 % of the observations lie within
x̄ − 2s, x̄ + 2s
I
Approximately 99.7 % of the observations lie within
x̄ − 3s, x̄ + 3s
prob 153, 144
Problem 2.144
If the range of a set of data is 20, find a rough approximation to
the s.d of the data set
I
75% of the data falls within x̄ − 2s, x̄ + 2s
I
i.e within a range of x̄ + 2s − x̄ − 2s = 4s
I
so 4s ≈ range = 20
I
so s ≈
20
4
=5
tophat
Numerical measures of relative standing
Let
µ be the mean of a data set when the data set is the population
σ be the s.d of a data set when the data set is the population
x̄ be the mean of a data set when the data set is a sample
σ be the sd of a data set when the data set is a sample
For any value x,
The Population z-score of x is
z=
x −µ
σ
The Sample z-score of x is
z=
x − X̄
s
The z- score is a measure of “ how many s.d’s is x away from
x is k standard deviations away from the mean is same as
|x − µ| > k σ.
i.e. If x − µ > k σ then
x−µ
σ
>k
i.e z > k .
So x is larger than µ + k σ is equivalent to z-value of x is larger
than k .
Similarly,
x is smaller than µ − k σ is equivalent to z-value of x is smaller
than −k .
So in terms of z - values
I
At least 75 % of the observations have z-values less than 2
I
at least 8/9 , approx 90% of the observations have z-values
less than 3
I
Put differently,
At most 10% of the observations have z-values larger than
3
If the histogram is bell shaped then
I
Approximately 68 % of the observations have z-values less
than 1
I
Approximately 95 % of the observations have z-values less
than 2
I
Approximately 99.7 % of the observations have z-values
less than 3
Since values that are far away from the mean have very large
or very small (negative) z -scores, we can use z-scores to
define “outliers“.
Observations with z-scores greater than 3 in absolute value are
considered outliers.
problems 139,140,161
tophat
Related documents