Download Chapter_07_sp13

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Regression toward the mean wikipedia , lookup

Time series wikipedia , lookup

Transcript
Chapter 7: Summarizing and Displaying Measurement (quantitative) Data
Salaries (in 100K) for Big Ten Head Football Coaches
School
Minn
Indiana
NW
Illinois
Wisc
MSU
Purdue
PSU
Nebraska
Mich
Iowa
OSU
Position
Coach
Coach
Coach
Coach
Coach
Coach
Coach
Coach
Coach
Coach
Coach
Coach
Name
Kill
Wilson
Fitzgerald
Beckman
Andersen
Dantonio
Hazell
O'Brien
Pelini
Hoke
Ferentz
Meyer
Salary(in 100K)
12
13
13
16
18
19
20
23
29
31
38
43
Numbers to describe data:
- mode which is the most frequently observed data value (13)
- median and quartiles
- mean
The position of the median is found by (n+1)/2 which for our data set is (12+1)/2 = 6.5 which
represents that the median is located halfway between the 6th and 7th observation within the
ordered data. To calculate the median for our data we would take the midpoint of the 6th and 7th
observation or for this example the midpoint between 19 and 20 which is again 19.5. The
interpretation is that about 50% of the salaries for Big 10 head football coaches are at or below
19.5 or $1,950,000.
Similarly:
The position of the quartiles can be found by the calculating the median of the median! That is,
the position of the first quartile called Q1 is found by (n+1)/4 = (12+1)/4 = 3.25. For us this
makes the first quartile located halfway between the third and fourth ordered observation which is
between 13 and 16 making quartile 1, or Q1, equal to 14.5 or $1,450,000. The interpretation is
that 25% of the salaries for Big 10 head football coaches are at or below 14.5 or $1,450,000
The position of the third quartile called Q3 is found by 3*(n+1)/4 = 3*(12+1)/4 = 9.75. For us
this makes the third quartile, Q3, located halfway between the 9th and 10th ordered observation
which comes to halfway between 29 and 31. This makes the Q3 equal to 30.0 or $3,000,000.
The interpretation is that 75% of the salaries for Big 10 head football coaches are at or below
30.0 or $3,000,000
1
-
mean (sum of all values divided by the total number summed): sum is 275 divided by 12 we
get mean of 22.92 or $2,292.000
-
variability measures:
i. range (max – min = 43 – 12 = 31),
ii. Interquartile Range or IQR = Q3 – Q1 = 30.0 – 14.5 = 15.5 The IQR represents how
spread out the middle 50% of the observations are since this measure is the difference between
the third and first quartile (i.e. 75% - 25% = 50%)
iii.variance and standard deviation (or SD or Std. Dev.). These two are measures of “how
spread out the values are from the mean” and are related: standard deviation is the square root of
the variance. Not important to know how to calculate these but to know their meaning. A basic
interpretation of the standard deviation is that it is roughly the average distance the observed
values are from the mean. In our example the SD is 10.2 meaning that roughly the average
difference between each salary and the mean salary is about $1,020,000. Also, the variance
would be10.22 or 104.04.
-
Five number summary: min (12), Q1 (14.5), Median (19.5), Q3 (30) and max (43)
NOTE: Sometimes these values are not possible outcomes (e.g. the mean number of children in
a US household is 2.2) we do NOT round the number to a whole number (e.g. we would not
round this to 2). The value is important as it tells us that on average his mean number of children
is less than 3 but more than 2.
Graphing measurement data and shape:
- stem and leaf
- histogram
- boxplot
- symmetric (or bell shaped), skewed, and outliers
Stem-and-Leaf Display: Salary(in 100K)
Stem-and-leaf of Salary
Leaf Unit = 1.0
1
1
2
2
3
3
4
N
= 12
233
689
03
9
1
8
3
2
Histogram of Salary
3.0
Frequency
2.5
2.0
1.5
1.0
0.5
0.0
10
15
20
25
30
Salary
35
40
45
This histogram would be interpreted as right-skewed or positively skewed since the extreme
observations are “pulling” or “stretching” the data to the right or in a positive direction. For such
distributions we would expect the mean to be more than, or to the right, of the median. This is
the case for this example as the mean of 22.92 is greater than the median of 19.5.
Building a “fence” around the data to determine extreme observations or outliers.
We can use the quartiles and IQR to build a fence around our data in order to determine if any
observations in our data set can be considered extreme or outliers. This fence is built by:
Lower Fence: Q1 - 1.5*IQR
Upper Fence: Q3 + 1.5*IQR
For this data set with Q1 of 70, Q3 of 83.5, and IQR of 14.5 the fence is:
Lower: 14.5 - 1.5*15.5 = 14.5 – 23.25 = -8.75 or in this case 0
Upper: 30 + 1.5*15.5 = 30 + 23.25 = 53.25
Looking at the data we see none of the values lie outside this “fence” meaning that there are no
outliers or extreme observations in our data.
Creating a Box Plot
A box plot is simply a representation of the median and the quartiles representing a “box” and
“whiskers” representing the fence. This plot is sometimes referred to as a box-and-whisker plot.
Below is a box plot our data. The box itself consists of Q1 and Q3 with the line within the box
being the median (note how the big box goes from 14.5 to 30, the Q1 and Q3, with the line within
this box being at 19.5, the median). The whiskers extend out both sides of the box but only
extend to the observed value closest to the lower and upper fence without exceeding these values.
For example, the lower fence is at 0 but the closest observation to this value without going below
it is 12; thus the whisker stops at 12. The other whisker goes to 43 since this is the closest
observed value that does not exceed the upper fence of 53.25 Any outliers will be marked by an
asterisk *
3
Boxplot of Salary
45
40
Salary
35
30
25
20
15
10
Outliers have the greatest effect on the mean and standard deviation/variance since the
latter uses the mean in its calculation. This effect is due to the mean taking into account
the values of all observations whereas the median and quartiles are first determined by
the number of observations. Thus outliers have less of an impact on the median, quartiles
and IQR). The range, too, is greatly affected by outliers
Empirical Rule
When the data is symmetric or bell-shaped the use of SD is quite helpful. You would find in such
instances that for data shaped this way that roughly 68% of the observations fall with +/- one SD
from the mean; 95% of the observations fall within +/- two standard deviations from the mean;
and almost all – 99.7% of all observations fall within +/- three standard deviations from the mean.
For example the math portion of the SAT test is typically bell-shaped with a mean of 500 and
standard deviation of 100. Thus we would expect that 68% of all scores would be between 400
and 600; 95% would fall between 300 and 700; and almost all (99.7%) would fall between 200
and 800.
4