Download Numerical Summary Measures

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Numerical Summary Measures
Lecture 03: Measures of Variation
and Interpretation, and Measures of
Relative Position
1
Measures of Variation
• Consider the following three data sets:
– Data 1: 1, 2, 3, 4, 5
– Data 2: 1, 1, 3, 5, 5
– Data 3: 3, 3, 3, 3, 3
• For these data sets, the mean and the median are
clearly identical.
• But, they are different data sets!
• The need to measure the variation in the data.
2
On the Perils of an “Average Value”
• Situation: Man has his
head in a very hot
compartment, and his
feet feeling very cold.
• Question: Mr., how
are you feeling?
• Reply: Oh, on the
average, I am just
fine! …
• Crash! Dead!
3
Sample Variance
• To measure degree of variation, one could look at
the values of the deviations of the observations
from its sample mean.
• The sample variance, denoted by S2, is defined to
be the ‘average’ of the squared deviations of the
observations from its sample mean.
n

1
S 
Xi  X

n  1 i 1
2

2
4
Computational Formula
• Definitional formula not very efficient for
purposes of computation of the sample variance.
• The computational formula is oftentimes used.
2

 n
 
 Xi  
 n
n
2
1 
1 
 i 1  
2
2
S2 
X

n
X

X

 i
 n  1  i

n  1  i 1
n
i 1





5
Properties
• It has squared units … which leads to defining the
standard deviation.
• It is always nonnegative, and equals zero if and
only if all the observations are identical.
• The larger the value, the more variation in the
data.
• The divisor of (n-1) instead of n makes the sample
variance “unbiased” for the population variance
(s2) … will be explained when we get into
inference.
6
Standard Deviation
• The sample standard deviation, denoted by S, is
the positive square root of the sample variance.
• Purpose: to have a measure with the same units of
measurements as the original observations.
S  S
2
7
Illustration of Computation
• Data set in the example for the mean and median.
• Data: 122, 135, 110, 126, 100, 110, 110, 126, 94,
124, 108, 110, 92, 98, 118, 110, 102, 108, 126,
104, 110, 120, 110, 118, 100, 110, 120, 100, 120,
92
• We illustrate computations using the definitional
and computational formulas in a spreadsheet-type
format.
8
Example … continued
• The spreadsheet-type table on the next slide is
obtained from an Excel worksheet.
• The first three columns illustrates the computation
using the definitional formula.
• The last column is used to illustrate the
computation using the computational formula.
• Details will be provided in class!
9
Stat 700
Computation of the Variance and Standard Deviation
X
122
135
110
126
100
110
110
126
94
124
108
110
92
98
118
110
102
108
126
104
110
120
110
118
100
110
120
100
120
92
Sum_Of_X
Dev=X-Mean
10.90
23.90
-1.10
14.90
-11.10
-1.10
-1.10
14.90
-17.10
12.90
-3.10
-1.10
-19.10
-13.10
6.90
-1.10
-9.10
-3.10
14.90
-7.10
-1.10
8.90
-1.10
6.90
-11.10
-1.10
8.90
-11.10
8.90
-19.10
Sum_Of_Dev
3333
Mean_Of_X
111.1
0.00
Dev^2
X^2
118.81
571.21
1.21
222.01
123.21
1.21
1.21
222.01
292.41
166.41
9.61
1.21
364.81
171.61
47.61
1.21
82.81
9.61
222.01
50.41
1.21
79.21
1.21
47.61
123.21
1.21
79.21
123.21
79.21
364.81
14884
18225
12100
15876
10000
12100
12100
15876
8836
15376
11664
12100
8464
9604
13924
12100
10404
11664
15876
10816
12100
14400
12100
13924
10000
12100
14400
10000
14400
8464
Sum_Of_Dev^2
Sum_Of_X^2
3580.70
373877.00
Variance_Of_X
Variance_Of_X
123.47
123.47
Standard Dev of X Standard_Dev_Of_X
11.11
11.11
10
Explanations of Columns in the Sheet
• Column 1: contains the values of X, Sum of X,
and Sample Mean.
• Column 2: contains the deviations, Dev = XSampleMean, and the Sum of Deviations.
• Column 3: contains the squared deviations, Sum
of squared deviations, variance, and the standard
deviation (via definitional formula).
• Column 4: contains the squared X; sum of squared
X, and the variance (via the computational
formula).
11
Population Parameters (Analogs)
• If the quantities are computed from the population
values, then we obtain population parameters such
as the mean, variance and standard deviations.
• The notation are as follows:
Symbols used for the
Mean
Variance
Standard Deviation
Sample (based on sample
values)
X
S2
S
Population (based on
population values)

s2
s
12
Information from Mean and Standard
Deviation
• Empirical Rule: For symmetric mound-shaped
distributions:
– Percentage of all observations within 1 standard
deviation of the mean is approximately 68%.
– Percentage of all observations within 2 standard
deviations of the mean is approximately 95%.
– Percentage of all observations within 3 standard
deviations of the mean is approximately 100%.
– Thus, usually no observations will be more than 3
standard deviations of the mean!
13
Information … continued
• Chebyshev’s Rule: For any distribution (be it symmetric,
skewed, bi-modal, etc.), we always have that:
– Percentage of all observations within 1 standard
deviation of the mean is at least 0%.
– Percentage of all observations within 2 standard
deviations of the mean is at least 75%.
– Percentage of all observations within 3 standard
deviations of the mean is at least 88.89%.
– More generally, the percentage of observations within k
standard deviations of the mean is at least (1 - 1/k2).
14
Illustration of these Rules
• Consider the sample data with 30 observations considered
earlier.
• Data: 122, 135, 110, 126, 100, 110, 110, 126, 94, 124, 108,
110, 92, 98, 118, 110, 102, 108, 126, 104, 110, 120, 110,
118, 100, 110, 120, 100, 120, 92
• Recall that:
– Sample mean = 111.1
– Sample standard deviation = 11.11
• Percentages in the intervals of form:
• [Mean - kS, Mean + kS]
15
Percentages in Certain Intervals
Interval
Limits of Interval
Within 1S of
Mean
Within 2S of
Mean
Within 3S of
Mean
Percentage of Observations
[100, 122.2]
Number of
Observations
21
[88.9, 133.3]
29
96.67
[77.8, 144.4]
30
100.00
70.00
Lower Limit = (Sample mean) - 2(Std Dev) = 111.1 - 2(11.1) = 88.9
Upper Limit = (Sample mean) + 2(Std Dev) = 1400.9 + 2(391.3) = 133.3
By going through the 30 observations, 29 of the observations are between
88.9 and 133.3, which is (29/30)(100) = 96.67% of all the observations.
Note that the observed percentages certainly satisfy the lower bounds
provided by Chebyshev's Inequality.
Also, note that the observed percentages are very close to the percentages
specified by the Empirical Rule. This is because the histogram is somewhat
symmetric.
16
Measure of Relative Standing:
Z-Score
Given a data set, the z-score, called the standardized score,
associated with an observation whose value is x is given by
xX
Z
.
S
It measures the distance of x from the sample mean in terms
of the number of standard deviations. A negative (positive)
value indicates the value x is smaller (larger) than the sample
mean.
17
Percentiles
• Given a set of n observations, the 100pth
percentile, where 0 < p < 1, is that value which is
larger than 100p% of all the observation, and less
than 100(1-p)% of the observations.
• For example, the 95th percentile is the value larger
than 95% of all the observations and it is smaller
than 5% of all the observations.
18
Measures of Relative Standing: Quartiles
• The first quartile, denoted by Q1, is the 25th
percentile of the data set.
• The third quartile, denoted by Q3, is the
75th percentile of the data set.
• The second quartile, which is the 50th
percentile, is simply the median of the data
set, M.
19
Computing the Quartiles
• Divide the arranged data set into two parts using
the median as cut-off.
• If the sample size n is odd, then the median should
be included in each group; while if n is even then
the median is not included in either group.
• First quartile (Q1) is the median of the lower
group.
• Third quartile (Q3) is the median of the upper
group.
20
Example: Quartile Computation
• Arranged Data:
• 92, 92, 94, 98, 100, 100, 100, 102, 104, 108, 108, 110, 110,
110, 110, 110, 110, 110, 110, 118, 118, 120, 120, 120, 122,
124, 126, 126, 126, 135
• M = 110 = average of 15th and 16th values.
• Q1 = in 8th position = 102
• Q3 = in 23rd position = 120.
21
Box Plots
• Another graphical summary of the data is provided by the
boxplot. This provides information about the presence of
outliers.
• Steps in constructing a boxplot are as follows:
– Calculate M, Q1, Q3, and the minimum and maximum
values.
– Form a box with left and right ends being at Q1 and Q3,
respectively.
– Draw a vertical line in the box at the location of the
median.
– Connect the min and max values to the box by lines.
22
The BoxPlot
• For the systolic blood pressure data set, the resulting
boxplot, obtained using Minitab, is shown below.
HV
Q3
M
Q1
LV
23
Comparative BoxPlots
The boxplot could also be used to make a comparison of
the distributions of different groups. This could be
achieved by presenting the boxplots of the different
groups in a side-by-side manner.
We demonstrate this idea using the Beanie Babies Data
on page 91. This data set contains the following variable:
Name: name of beanie baby
Age: in months, since 9/98
Status: R=retired, C=current
Value: Value of baby
24
Comparative BoxPlots of Value by
Status
Value
2000
1000
0
C
R
Status
Distributions for both groups very right-skewed!
25
Comparative BoxPlots of Log(Value) by
Status
8
LogValue
7
6
5
4
3
2
C
R
Status
26
Relationship Between Age and Value
Value
2000
1000
0
0
10
20
30
40
50
60
70
Age0998
27
Relationship Between Log(Age) and
Value
8
LogValue
7
6
5
4
3
2
0
10
20
30
40
50
60
70
Age0998
28
Related documents