Download M1 - Representation..

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Psychometrics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Math 3307
Module 1:
Representations of Data
Descriptive Statistics
Central Tendency
Spread
Fractiles
Rates of change
Z score
Representations
Dot diagrams
Charts
Stem and Leaf Plots
Box and Whisker Plots
Scatter Plots
1
Descriptive Statistics
Descriptive statistics take raw data and present it in a way that highlights the important
material WITHOUT drawing inferences or generalizations for the viewer. Predictive
statistics provide a means to make a judgment, a prediction, or an inference about a
situation.
For example, taking the 2010 census data and noting that the population of the United
States is now 307, 006, 550 people and reporting this is descriptive. Even noting that in
2000 the population was 281, 421, 906 and the new number is 1.091 times larger than the
populations in 2009 is descriptive. Taking these numbers, plotting a linear regression
line and predicting the population of the USA in 2015 is NOT descriptive, it is making an
estimate, a prediction. Making generalizations is NOT descriptive either. If you come to
a generalization about a situation, you are out of the area of descriptive statistics.
Problem DS1
Which of the following conclusions may be obtained from the following data by purely
descriptive methods and which require generalizations?
A student in my Spring Pre-calculus class took 4 consecutive daily quizzes and got the
following scores: 3, 8, 10, and 12.
a.)
b.)
c.)
d.)
On only 1 day did he get less than 5 right.
The student’s number correct increased on each successive quiz.
The student got better at guessing what I was going to ask each day.
On the last day the student copied his answers from his neighbor.
Problem DS2
Smith and Jones are hairdressers. On a recent day, Smith cut the hair of 4 male clients
and 2 female clients. While Jones cut hair on 3 males and 3 females.
a.)
b.)
c.)
d.)
The amount of time it takes Smith and Jones to do a haircut is
approximately the same.
Smith always cuts hair on more males than females.
The two always have the same number of clients per day.
Over a week, Smith averages 6 clients a day.
2
Problem DS3
Which of the following conclusions can be obtained by descriptive methods and which
require generalizations?
Driving the same model of car, 5 different drivers averaged 15.5, 14.7, 16.0, 15.5 and
14.8 mpg.
a.)
b.)
c.)
d.)
None of the drivers averaged more than 16 mpg.
The second driver must have driven on rural roads.
15.5 is the average mpg most often achieved.
The third driver drove faster than the other 3.
Typical types of summaries of data
Measures of Central Tendency – these are the numbers that describe what is normal,
usual, and in the middle or the center. These terms are very loose and need firming up
mathematically, of course.
The most popular measure of “centeredness” is the Mean (sometimes called the average).
The mean of n numbers is the sum of the numbers divided by n. If you are working with
a data set of measurements, the mean is denoted: x .
There are some very cogent reasons for it’s popularity:
It can always be calculated and it’s easy to calculate.
It is unique: there is only ONE mean for a data set.
It uses EVERY data point; nothing is eliminated.
It doesn’t depend on chance or luck.
There are some equally important reasons to take the mean with a grain of salt:
It is heavily affected by outliers!
Recall the data on the number of pets owned by the 3307 population!
3
Problem CT1
An elevator in PGH is designed to carry a maximum load of 3,200 pounds. If it is loaded
with 18 people with a mean weight of 166 pounds, is it in any danger of being
overloaded?
Weighted Mean
Sometimes each data point is not “equal” in weight, meaning some have more importance
than others. For example, in my Math 3379 class there are 4 papers; the first 3 are 10%
of the grade and the fourth is a term paper worth 20% of the grade. In order to calculate a
student’s average on this 50% of the course grade, I would take the 3 grades and TWICE
the term paper grade and divide by 5. Note that you use “proportional” multiplication to
even things up!
Problem CT2
Having received a bonus of $20,000 for accepting early retirement, a company’s sales
representative invested $6,000 in a bond paying 3.75%, $10,000 in a mutual fund paying
3.96%, and $4,000 in a CD paying 3.25%. Find the weighted mean of these percentages.
Problem CT3
A lecturer counts the final exam in a course 4 times as much as each of the 3 small exams
during the semester. Which of the following students has the higher average?
Mikey
Lizbeth
Test 1
72
81
Test 2
80
87
Test 3
65
75
Final
82
78
4
Problem CT4
A home appliance store has the following inventory:
Refrigerator
A
B
C
D
E
A.
B.
C.
# in stock
18
12
9
14
25
Size in CuFt
15
21
19
21
24
Price - $
416
549
649
716
799
What is the average size of these refrigerators?
What will the average income per unit if they sell them all?
What is the average price for a refrigerator?
Another measure of central tendency is the Median:
The median is the value that is at the numerical middle of the data if there are an odd
number of data points and they are arranged in order by size. It is the mean of the 2
middle data points if the number of data points is even and arranged in order by size.
The formula for finding the location of the median for n data points is
0.5(n + 1). The process is to order the data and then find the measurement at that
location.
Problem CT5
In golf the holes are rated for a recommended number of strokes needed to sink the golf
ball into the hole. A score of par means the golfer used the recommended number, a
birdie is one fewer than recommended, a bogey is one more than the recommended
number, an eagle is 2 fewer strokes.
At a recent televised tournament, 7 golfers had the following scores, ranked
alphabetically by last name: par, birdie, par, par, birdie, bogey, and eagle.
What was the median score?
5
Problem CT6
Find the median location for
A.
B.
n = 19
n = 52
The final measure of central tendency is the Mode. This is the number that occurs most
frequently in a data set.
Problem CT7
What is the mode for the data in Problem CT5?
The Mode is the measurement in the data set that occurs most often.
Problem CT8
Which of the following bars shows the mode in this histogram?
Age and saying No
Number of No's per hour
6
5
4
3
Series1
2
1
0
1
2
3
4
5
6
Age
6
Relationships among Mean, Median, and Mode:
Problem CT 9
x axis
STTR
STTL
1
2
3
4
5
6
7
8
9
10
1
2
4
5
4
3
2
2
1
1
Symm
1
2
3
4
5
6
8
5
4
3
1
2
3
4
5
5
4
3
2
1
Calculate mean, median, and mode for these 3 charts. Mark on the x-axis where each
goes.
Skewed to the right
6
5
4
3
Series1
2
1
0
1
2
3
4
5
6
7
8
9
10
7
Skewed to the left
9
8
7
6
5
Series1
4
3
2
1
0
1
2
3
4
5
6
7
8
9
10
Symmetric
6
5
4
3
Series1
2
1
0
1
2
3
4
5
6
7
8
9
10
Summarize your results with a mnemonic device.
8
Which measurement is most sensitive to outliers? Mean or Median?
What does it mean to say “most sensitive”
Discuss this idea using the salaries of baseball players.
Problem CT 10
The data shown in the table are the median prices of existing homes in the USA from
1981 through 1986. If the average prices of existing homes were calculated for each of
these years, how do you think these values would compare to the median prices shown?
Would the average price be higher, lower, or the same?
Year
1981
1982
1983
1984
1985
1986
Median
66,460
67,800
70,300
72,400
75,500
80,300
9
Problem CT 11
Car A
27.9
30.4
30.6
31.4
31.7
Car B
31.2
28.7
31.3
28.7
31.3
Car C
28.6
29.1
28.5
32.1
29.7
Above is mileage data from 3 compact cars on 5 trials each. Each car was manufactured
by a different car company.
If the manufacturers of Car A want to advertise how fuel efficient their car is, what
statistics might they use to substantiate their claim?
If the manufacturers of Car B want to advertise how fule efficient their car is, what
statistics might they use to substantiate their claim?
What about the maker of Car C?
10
Measures of Variability
A measure of variability is a number that describes the spread or the variety of
measurements in a data set.
The range of a data set is equal to the largest measurement minus the smallest
measurement.
The sample variance is calculated with the following formula for n data points:
s
2
 ( x  x)

2
n 1
First calculate the sample mean,
then subtract the mean from each measurement individually and
square the answer.
Add up all the squares and divide by n  1.
The standard deviation for a set of data is the square root of the variance: s.
Problem MV 1
Calculate the mean for each sample below. Calculate the variance for each sample.
Discuss the information available in the variance.
N=5
1.2
1
0.8
0.6
Series1
0.4
0.2
0
1
2
3
4
5
11
N=5
3.5
3
2.5
2
Series1
1.5
1
0.5
0
1
2
3
4
5
Problem MV 2
Here is a data set: (8, 2, 2, 7, 4, 6, 5, 3, 4)
Describe this data set using mean, median, mode, range, and standard deviation.
Problem MV 3
Three sets of data are shown below. What are the number of data points in each set?
What is the mean for each set (do this WITHOUT a calculator!). Rank the sets from the
most variable to the least variable and tell why you made those choices. (again:
calculator free).
Hint: use the formula for variance to help you reason it out!
s
2
 ( x  x)

2
n 1
12
Data set 1
7
6
Frequency
5
4
Series1
3
2
1
0
1
2
3
4
5
6
7
8
9
10
11
Measurement
Data Set 2
6
Frequency
5
4
3
Series1
2
1
0
1
2
3
4
5
6
7
8
9
10
11
Measurement
13
Frequency
Data Set 3
10
9
8
7
6
5
4
3
2
1
0
Series1
1
2
3
4
5
6
7
8
9
10
11
Measurement
Problem MV 4
Consider the following 2 samples:
Sample A:
10, 0, 1, 9, 10, 0
Sample B:
0, 5, 10, 5, 5, 5
Describe these data sets using mean, median, mode, range, and variance. What statistics
are the same and what statistics are different. Which data set is the more variable and
why? Which is the better predictor of variability: range or variance?
Grouped Data for Variance calculations
If f is the frequency of a data measurement, then the following formula calculates the
variance for the data:
n
s2 
 f ( x  x)
i 1
i
2
i
n 1
14
Problem MV 5
The data in the following table are for the inner diameters of some tubes manufactured by
a machine. This table is called a “distribution” because it gives the values and their
frequency. Find the mean diameter and the variance for the tubes.
D, inches
frequency
2.0
2
2.2
4
2.3
6
2.8
3
3.0
5
Problem MV 7
The following table is a distribution of the top speeds in mph at which 30 racers were
clocked in an auto race. Find the mean and variance for the race.
Top Speed
Number of racers
145
9
150
8
160
11
170
2
15
Fractiles and Percentages
A fractile ranking means that a given number of measurements lie below the given
measurement and a given number above.
Suppose your child comes home to tell you that she’s in the 90th percentile of her class on
a particular test. This means that 90% of the children have lower scores or the same
score as she does and 10% have higher scores. You do need to be a little careful with
these measurements of relative ranking, though. It could be that 91% of the children
failed the test and 9% passed. In this scenario, of course, being in the 90% percentile
isn’t much to brag about. You need absolute measures AND relative measures to
evaluate a situation about fractiles.
Deciles divide the measurements into 10ths and quartiles divide the measurements into
quarters. The median is both a decile and a quartile ranking.
Let’s look at quartiles:
Q1 is the median of all measurements less than the median of the data set.
Q3 is the median of all measurements greater than the median of the data set.
Problem FP 1
The 21 meetings of the West U Orchid Breeders club had the following attendances:
22, 24, 23, 24, 27, 25, 24, 19, 24, 26, 28, 32, 21, 24, 25, 23, 26, 25, 18, 24
Find all 3 measure of central tendency, Q1, Q3, and the standard deviation for the data
set.
Problem FP 2
Find the positions of the median, Q1, and Q3 for
A.
B.
n = 32
n = 35
16
Problem FP 3
The following numbers are weekly lumber production (in million board feet) for a
company in Oregon. Find the first quartile and the 90th percentile for the data.
390
406
447
410
370
338
410
320
359
392
315
480
17
Percentage change in a measurement:
The percent change in a measurement is often of interest to managers, doctors, and
teachers. It is used as a measure of efficacy.
The calculation is
final - initial
initial
Suppose you have a student who was reading poorly – 15 words a minute. You train the
student using your favorite method and test him again to find him reading 27 words a
minute. The percent change is
27  15
15
which is 80%.
You would then report an 80% improvement in speed.
Problem PC 1
You’ve been looking at a sweater in the store but it costs $135 and that’s too much. BUT
one day you go and check and it’s been marked down to $65…what is the percent
change?
Problem PC2
A student has been working with a tutor on his math skills. His weekly quiz average was
a 65% when he started with the help program.
His quizzes are 30 points each. During the program his weekly grades are
20, 23, 21, 28, 27, 29
What is the percent change in his average? Would you say that the tutoring helped?
18
Z score
Z scores are used on data that are collected from populations that have a normal
distribution for the property under scrutiny. A z-score tells you how far from the mean a
particular measurement is. A z-score is calculated with the following formula:
z
x

where x is a particular data measurement and the other 2 symbols stand for the mean and
standard deviation of a particular population. Note that standard deviation is the square
root of variance.
Sketch a normal distribution here:
The Empirical Rule for normally distributed data:
Approximately 68.3 percent of the observations will fall within one standard deviation of
the mean ( x  s ). Approximately 95.4 percent of the observations will fall within 2
standard deviations of the mean. Approximately 99.7 percent of the observations will fall
within 3 standard deviations of the mean.
A rough estimate of the range is the mean +/ 3 standard deviations. Why is this true?
ZS Problem 1
If you have 2 students applying for entrance to a G&T program and you have room for
only one, which one will you pick based on the following test information?
Gina got a 78 on a test with an average of 72 and a standard deviation of 5.
Mike got an 87 on a test with an average of 85 and standard deviation 1.5.
Who is the stronger student and how do you know?
19
ZS Problem 2
Given the following distribution
Measurement
1
2
3
4
5
6
7
8
9
10
11
12
13
number
0
3
1
5
2
7
5
6
3
0
1
0
2
Discuss the measures of central tendency
 mean
 median
 mode
the measures of variability
 range
 variance
 standard deviation
and give the z score for the measurement 7.
Verify the Empirical Rule by making a dot or bar chart of the data and marking off where
each of the standard deviations from the mean are. (s, 2s, 3s)
20
ZS Problem 3
The mean salary of the employees at a high school in Missouri is $28, 500 with a
standard deviation of $2,100.
Discuss the Empirical Rule and who might fit where on a bar chart of employee salaries.
The state announces a flat raise of $500 per employee for the next year. Find the mean
and standard deviation of the new salaries.
Who will benefit the most in a percentage change analysis?
ZS Problem 4
Given that the mean is 90 and the standard deviation is 1.4 give the numbers of the 2,000
data points that should be within 1, 2, and 3 standard deviations of the mean. Then count
the numbers that actually ARE within these bounds.
21
Value
Frequency
0
1
1
2
2
4
3
8
4
20
5
35
6
60
7
120
8
25
9
500
10
1000
ZS Problem 5
For 50 days, the number of vehicles using a particular road was tracked by a city
engineer. She found that the mean was 385 and the standard deviation was 15 vehicles.
Suppose you are interested in opening a franchise shop along the road and you know you
need traffic between 340 and 430 cars per day to be successful. How many days have
this much traffic? Is this a good location or a marginal location?
22
ZS Problem 6:
Analyze the following nuclear reactor data (@2010)
In operation
Country
Electr. net
output
MW
2
1
7
2
2
18
935
375
5,926
1,884
1,906
12,569
1
1
2
-
Electr. net
output
MW
692
1,245
1,906
-
13
10,048
27
27,230
6
6
4
58
17
4
20
54
21
2
1
2
2
32
4
1
2
8
10
5
6
15
19
104
442
4,980
3,722
2,716
63,130
20,490
1,889
4,391
46,823
18,665
1,300
487
425
1,300
22,693
1,792
666
1,800
7,514
9,303
3,238
4,980
13,107
10,137
100,747
374,958
2
1
1
5
1
2
5
1
11
2
2
2
1
65
2,600
Number
Argentina
Armenia
Belgium
Brazil
Bulgaria
Canada
China


Mainland
Taiwan
Czech Republic
Finland
France
Germany
Hungary
India
Iran
Japan
Korea, Republic
Mexico
Netherlands
Pakistan
Romania
Russian Federation
Slovakian Republic
Slovenia
South Africa
Spain
Sweden
Switzerland
Taiwan
Ukraine
United Kingdom
USA
Total
Under construction
Number
1,600
1,600
3,564
915
2,650
5,560
300
9,153
782
2,600
1,900
1,165
62,862
23
Work:
Some thoughts:
A histogram for the number per country?
Calculate the measures of center, the variability
Check the Empirical Rule?
An average output for each reactor?
A z-score for the USA?
24
Representations
Dot diagrams
Charts
Stem and Leaf Plots
Box and Whisker Plots
Scatter Plots
A.
Dot diagrams:
These summarize data visually and quickly. Put one dot for each observation.
Note that you don’t need to sort the data to make a dot diagram.
For example:
If I toss a die 6 times and get: 1 4 5 6 1 2
I’d put a horizontal line down and mark off the 6 possible numbers and then put a
dot above each recorded value:
25
DD Problem 1
2150132071342412251343110241132352244
This data summarizes the number of times per week that a small regional airport
with 48 flights per day that there are delayed takeoffs.
Make a dot diagram and analyze the data completely.
Dot diagrams are also useful with qualitative or categorical data.
DD Problem 2:
At a recent televised tournament, 7 golfers had the following scores, ranked
alphabetically by last name: par, birdie, par, par, birdie, bogey, and eagle.
Analyze this with a dot diagram.
26
B.
Charts
Example:
Here is a distribution of information about Americans aged 18 or older:
Marital status
Percent
Single
Count
In Millions
41.8
Married
113.3
61.1
Widowed
13.9
7.5
Divorced
16.3
8.8
22.6
There are a couple of ways to display this information graphically. One is a histogram or
bar chart and another is a pie chart.
Pie chart
27
Histogram
Why was it important to use the percentages and not the raw counts?
28
Charts Problem 1
Here’s some 2000 Census Data – percent of the population by state. Note that it is not
quite strictly descending order – the data was in descending order during the 1990 census
and when I cut out the intervening years – since some states have lower percentages than
in 1990, they got “out of order”
How would you display this data in a small box in the middle of a report? You don’t
want visual distortion; you do want to avoid a histogram with 51 bars, though!
April 1,
2000
United States
281,421,906 %
1 California
33,871,648
11.04
2 Texas
20,851,820
7.41
3 New York
18,976,457
6.74
4 Florida
15,982,378
5.68
5 Illinois
12,419,293
4.41
6 Pennsylvania
12,281,054
4.36
7 Ohio
11,353,140
4.03
8 Michigan
9,938,444
3.53
9 Georgia
8,186,453
2.91
10 North Carolina
8,049,313
2.86
11 New Jersey
8,414,350
2.99
12 Virginia
7,078,515
2.52
13 Washington
5,894,121
2.09
14 Arizona
5,130,632
1.82
15 Massachusetts
6,349,097
2.26
16 Indiana
6,080,485
2.16
17 Tennessee
5,689,283
2.02
18 Missouri
5,595,211
1.99
19 Maryland
5,296,486
1.88
20 Wisconsin
5,363,675
1.91
21 Minnesota
4,919,479
1.75
22 Colorado
4,301,261
1.53
23 Alabama
4,447,100
1.58
24 South Carolina
4,012,012
1.43
29
25 Louisiana
4,468,976
1.59
26 Kentucky
4,041,769
1.44
27 Oregon
3,421,399
1.22
28 Oklahoma
3,450,654
1.23
29 Connecticut
3,405,565
1.21
30 Iowa
2,926,324
1.04
31 Mississippi
2,844,658
1.01
32 Arkansas
2,673,400
0.95
33 Kansas
2,688,418
0.96
34 Utah
2,233,169
0.79
35 Nevada
1,998,257
0.71
36 New Mexico
1,819,046
0.65
37 West Virginia
1,808,344
0.64
38 Nebraska
1,711,263
0.61
39 Idaho
1,293,953
0.46
40 New Hampshire
1,235,786
0.44
41 Maine
1,274,923
0.45
42 Hawaii
1,211,537
0.43
43 Rhode Island
1,048,319
0.37
44 Montana
902,195
0.32
45 Delaware
783,600
0.28
46 South Dakota
754,844
0.27
47 Alaska
626,932
0.22
48 North Dakota
642,200
0.23
49 Vermont
608,827
0.22
50 District of Columbia
572,059
0.20
51 Wyoming
493,782
0.18
0.00
0.00
Puerto Rico
3,808,610
1.35
30
Charts Problem 2
United States
AGE DISTRIBUTION
When drawn as a "population pyramid," age distribution can hint at patterns of growth.
A top heavy pyramid, like the one for Grant County, North Dakota, suggests negative population
growth that might be due to any number of factors, including high death rates, low birth rates,
and increased emigration from the area.
A bottom heavy pyramid, like the one drawn for Orange County, Florida, suggests high birthrates,
falling or stable death rates, and the potential for rapid population growth.
But most areas fall somewhere between these two extremes and have a population pyramid
that resembles a square, indicating slow and sustained growth with the birth rate exceeding
the death rate, though not by a great margin.
Discuss this representation of ages from the census 10 years ago. What kind of
difficulties did the authors overcome with this particular version of a histogram?
What kinds of ancillary information can be drawn from this data?
31
Charts Problem 3
Although there have been advances in medical technology and donation, the demand for
organ, eye and tissue donation still vastly exceeds the number of donors. More than
100,000 men, women and children currently need life-saving organ transplants.






Every 10 minutes another name is added to the national organ transplant waiting
list.
An average of 18 people die each day from the lack of available organs for
transplant.
In 2009, there were 8,021 deceased organ donors and 6,610 living organ donors
resulting in 28,465 organ transplants.
Last year, more than 42,000 grafts were made available for transplant by eye
banks within the United States.
According to research, 98% of all adults have heard about organ donation and
86% have heard of tissue donation.
90% of Americans say they support donation, but only 30% know the essential
steps to take to be a donor.
Statistics
110,541 Patients Waiting*
60,758 Multicultural Patients*
1,785 Pediatric Patients*
28,663 Organ Transplants Performed in 2010
14,502 Organ Donors in 2010
32
Waiting list candidates as of 5pm 6/13/11
All
Kidney
111,671
89,060
Pancreas
1,369
Kidney/Pancreas
2,191
Liver
16,291
Intestine
266
Heart
3,178
Lung
1,770
Heart/Lung
66
All candidates will be less than the sum due to candidates waiting for multiple organs
Transplants performed January - March 2011
Total
6,709
Deceased Donor
5,276
Living Donor
1,433
Based on OPTN data as of 06/03/2011
Donors recovered January - March 2011
Total
3,346
Deceased Donor
1,921
Living Donor
1,425
Based on OPTN data as of 06/03/2011
Let’s try to think of a more compelling way to present this data. How would you arrange
this information in a more visual style?
33
Presentation:
34
Charts Problem 4
Fifty-four candidates entering an astronaut training program were given a psychological
profile test measuring bravery. NASA grouped the data to make it more compact.
Note that the scores are grouped into units of the SAME length. Why is this important?
Would you present this as a pie chart?
A dot diagram?
A bar chart or histogram?
Score in points
# of candidates
60 - 79
8
80 - 99
16
100 - 119
18
120 - 139
8
140 - 159
6
What do you think about the extreme values on the results?
35
C.
Stem and Leaf Plots
An improvement on dot diagrams, stem and leaf plots work on data with many
various measurements. It is fairly low tech and can be quickly done in a meeting or on
the fly. I find them exceptionally useful in small classes (n < 50) for a quick grade
analysis.
The stems are the 10’s and the leaves are the single digits in each day’s total. It can be
useful to organize the leaves in order, too.
Here is one of my classes, a final:
10 123
09 45779
08 327758
07 459
06 78
BELOW 1111
Turn the page sideways (clockwise)…note the resemblance to a dot diagram! What does
this tell you about my class?
Note that in each case, there was somebody pretty close to the next level.
What grade is “BELOW”?
Sometimes if the data is unusually condensed, you might split the stems making more
rows rather than fewer rows.
Here are some quiz grades out of 130 points:
112 114 114 116 118 119 120 121 122 123 124 125 125 126 127 127 129
The best data presentation is to show 110 – 114, 115 – 119, 120 – 124, 125 – 129 rather
than just 2 stems with LOOOOONG leaf lines:
11 244
11 689
12 01234
12 556779
Note that the stems are now both a hundreds and a tens digit!
36
SL Problem 1 -A hotel has 85 rooms. In February of last year they had the following rental statistics:
75 79 37 57 60 64 35 73 62 81 43 72 78 54 69 75 78 49 59 80 58 76 52 49 42 62 81 77
Produce a stem and leaf plot of this data.
SL Problem 2
The following weights are ounces packed in 30 one pound bags. Display the data and
analyze the data.
15.6 15.9 16.2 16.0 15.6 15.9 16.0 15.6 15.6 16.0 1506 15.9 16.2 15.6 16.2
16.0 15.8 15.9 16.2 15.8 15.8 16.2 16.2 16.0 16.2 15.9 16.2 15.8 16.2 16.0
37
SL Problem 3
Decide which representation you’d like to use with this data to show the age of the
presidents at inauguration.
Consider doing a time plot*, too. Are we electing younger people than earlier in our
history?
How could you present the categorical data? Party affliation, home state, religion…
*a chronological presentation with time on the x axis.
Presidents
Find information about U.S. presidents, including party affiliation, term in office, age at
inauguration, age at death, and more.
State
of
birth
Born
Died
1789–
1797
Va.
2/22/1732
12/14/1799 Episcopalian
J. Adams
(F)
1797–
1801
Mass.
10/30/1735 7/4/1826
3.
Jefferson
(DR)
1801–
1809
Va.
4/13/1743
4.
Madison
(DR)
1809–
1817
Va.
5.
Monroe
(DR)
1817–
1825
6.
7.
Name and
(party)1
Term
1.
Washington
(F)3
2.
Religion2
Age
Age
at
at
inaug. death
57
67
Unitarian
61
90
7/4/1826
Deist
57
83
3/16/1751
6/28/1836
Episcopalian
57
85
Va.
4/28/1758
7/4/1831
Episcopalian
58
73
J. Q. Adams 1825–
(DR)
1829
Mass.
7/11/1767
2/23/1848
Unitarian
57
80
Jackson (D)
S.C.
3/15/1767
6/8/1845
Presbyterian
61
78
1829–
38
1837
8.
Van Buren
(D)
1837–
1841
N.Y.
12/5/1782
7/24/1862
Reformed Dutch
54
79
9.
W. H.
Harrison
(W)4
1841
Va.
2/9/1773
4/4/1841
Episcopalian
68
68
10. Tyler (W)
1841–
1845
Va.
3/29/1790
1/18/1862
Episcopalian
51
71
11. Polk (D)
1845–
1849
N.C.
11/2/1795
6/15/1849
Methodist
49
53
12. Taylor (W)4
1849–
1850
Va.
11/24/1784 7/9/1850
Episcopalian
64
65
13. Fillmore (W)
1850–
1853
N.Y.
1/7/1800
Unitarian
50
74
14. Pierce (D)
1853–
1857
N.H.
11/23/1804 10/8/1869
Episcopalian
48
64
Buchanan
(D)
1857–
1861
Pa.
4/23/1791
6/1/1868
Presbyterian
65
77
16. Lincoln (R)5
1861–
1865
Ky.
2/12/1809
4/15/1865
Liberal
52
56
A. Johnson
(U)6
1865–
1869
N.C.
12/29/1808 7/31/1875
(7)
56
66
18. Grant (R)
1869–
1877
Ohio
4/27/1822
7/23/1885
Methodist
46
63
19. Hayes (R)
1877–
1881
Ohio
10/4/1822
1/17/1893
Methodist
54
70
20. Garfield (R)5
1881
Ohio
11/19/1831 9/19/1881
Disciples of Christ
49
49
21. Arthur (R)
1881–
1885
Vt.
10/5/1829
11/18/1886 Episcopalian
50
56
22.
Cleveland
(D)
1885–
1889
N.J.
3/18/1837
6/24/1908
Presbyterian
47
71
23.
B. Harrison
(R)
1889–
1893
Ohio
8/20/1833
3/13/1901
Presbyterian
55
67
1893–
N.J.
3/18/1837
6/24/1908
Presbyterian
55
71
15.
17.
24. Cleveland
3/8/1874
39
(D)8
1897
25.
McKinley
(R)5
1897–
1901
Ohio
1/29/1843
26.
T. Roosevelt 1901–
(R)
1909
N.Y.
27. Taft (R)
1909–
1913
28. Wilson (D)
9/14/1901
Methodist
54
58
10/27/1858 1/6/1919
Reformed Dutch
42
60
Ohio
9/15/1857
Unitarian
51
72
1913–
1921
Va.
12/28/1856 2/3/1924
Presbyterian
56
67
29. Harding (R)4
1921–
1923
Ohio
11/2/1865
8/2/1923
Baptist
55
57
30. Coolidge (R)
1923–
1929
Vt.
7/4/1872
1/5/1933
Congregationalist
51
60
31. Hoover (R)
1929–
1933
Iowa
8/10/1874
10/20/1964 Quaker
54
90
F. D.
32. Roosevelt
(D)4
1933–
1945
N.Y.
1/30/1882
4/12/1945
51
63
33. Truman (D)
1945–
1953
Mo.
5/8/1884
12/26/1972 Baptist
60
88
34.
Eisenhower
(R)
1953–
1961
Tex.
10/14/1890 3/28/1969
62
78
35.
Kennedy
(D)5
1961–
1963
Mass.
5/29/1917
11/22/1963 Roman Catholic
43
46
36.
L. B.
Johnson (D)
1963–
1969
Tex.
8/27/1908
1/22/1973
Disciples of Christ
55
64
37. Nixon (R)9
1969–
1974
Calif.
1/9/1913
4/22/1994
Quaker
56
81
38. Ford (R)
1974–
1977
Neb.
7/14/1913
12/26/2006 Episcopalian
61
—
39. Carter (D)
1977–
1981
Ga.
10/1/1924
—
Southern Baptist
52
—
40. Reagan (R)
1981–
1989
Ill.
2/6/1911
6/5/2004
Disciples of Christ
69
93
3/8/1930
Episcopalian
Presbyterian
40
1989–
1993
Mass.
6/12/1924
—
Episcopalian
64
—
1993–
2001
Ark.
8/19/1946
—
Baptist
46
—
G. W. Bush
(R)
2001–
2009
Conn.
July 6,
1946
—
Methodist
54
—
44. Obama (D)
2009–
Hawaii
Aug. 4,
1961
—
United Church of
Christ
47
41.
G.H.W.
Bush (R)
42. Clinton (D)
43.
NOTE:
1. F—Federalist; DR—Democratic-Republican; D—Democratic; W—Whig; R—Republican; U—Union.
2. Religious affiliation at election. Several presidents changed religions during their lifetimes.
3. No party for first election. The party system in the U.S. made its appearance during Washington's first term.
4. Died in office.
5. Assassinated in office.
6. The Republican National Convention of 1864 adopted the name Union Party. It renominated Lincoln for president; for vice
president it nominated Johnson, a War Democrat. Although frequently listed as a Republican vice president and president,
Johnson undoubtedly considered himself strictly a member of the Union Party. When that party broke apart after 1868, he
returned to the Democratic Party.
7. Johnson was not a professed church member; however, he admired the Baptist principles of church government.
8. Second nonconsecutive term.
9. Resigned Aug. 9, 1974.
41
42
D. Box and Whisker plots:
Disability Adjusted
Life Expectancy
1999
age
freq
25
1
30
9
35
11
40
11
45
10
50
40
55
37
60
33
65
20
70
15
75
4
191
Sierra Leone: 29.5
Japan: 73.8
The UN calculated the Disability Adjusted Life Expectancy for citizens of the 191
member countries in 1999. The table above is their findings.
Five number summary:
Max
Q3
Median
Q1
Min
IQR: interquartile range (upper quartile – lower quartile)
Upper fence: 1.5 IQR above Q3;
Lower fence: 1.5 IQR below Q2.
[never really show these in your box plot]
43
Box: Q3, Median, Q2. Make a rectangle; any width works fine
Whiskers: lines to the most extreme value inside the fences, top with horizontal “stop”
Show asterisks as outlier values.
Vertical display:
Horizontal display under histogram:
44
BW01
Here is some pre-lesson grades (10 points) plotted with a “double stem”…each 10’s
category is broken into 2 parts: 0 – 4 and 5 - 9
3
4
4
5
5
6
6
7
7
8
57
0023
5666899
234
56789
1224
9
23
8
1
Recreate the first 4 measurements from either end:
Find Q1 and Q3 and the Median
Find the “fences”
Do a horizontal box and whisker plot.
Are there any outliers in this data?
45
BW02
Comparing groups with box and whisker plots.
A student designed an experiment to test the efficiency of 4 coffee containers from
different manufacturers by pouring coffee at 180 into each container and then measuring
the temperature difference after 30 minutes. She did the experiment 5 times – using
different cups of the same type each time (she didn’t reuse any of the cups). So she used
20 cups total, 5 from each manufacturer.
The 5 number summary average temperature differences are in the table below
Min
Q1
Median Q3
Max
IQR
Cup 1
6F
6
8.33
14.25
18.5
8.25
Cup 2
0F
1
2
4.5
7
3.5
Cup 3
9F
11.5
14.25
21.75
24.5
10.25
Cup 4
6F
6.50
8.50
14.25
17.5
7.75
Using VERTICAL box and whisker diagrams and a vertical axis of Temperature Change,
Compare the data. Which cup has the best heat retention property?
46
Scatter Plots, Time Plots, and Line Plots
Plotting: Problem 1
The Bureau of Labor Statistics tracks the buying power of our currency by using a fixed
basket of goods and services. It prices the items and records how much the same items
cost over time. The base period is the average cost of the basket for some given period of
time of the time series. The base period for the following data is 1982 – 1984 – the
basket costs about $100 for the time period.
1970
1973
1976
1979
1982
1985
1988
$39
$44
$57
$73
$97
$108
$118
Plot the data and see if you can discuss both trend and rate of change of the purchasing
power of $100.
47
Plotting: Problem 2
Deaths from cancer:
1940
1945
1950
1955
1960
1965
1970
1975
1980
1985
1990
120 per 100,000 people
137
140
147
148
153
160
170
182
190
201
Plot the data and think critically about whether cancer is getting much, much more
prevalent or if there’s something else going on socially, too!
48
Plotting: Problem 3
Sometimes 2 different views can each provide information for decisions.
Here are 20 measurements, taken over 20 hours IN ORDER. They measure the tension
on a wire grid behind an electronic display. If the tension is too high or too low, the
display quits working for safety reasons.
265.5 297.0 269.6 283.3 304.8 280.4 283.5 257.4 317.5 327.4 264.7 307.7
310.0 343.3 328.1 342.6 338.8 340.1 374.6 336.1
Make a stem plot using 2 digits for the stem PLUS make a time plot. What items of
interest to the managers do you see in EACH display. Describe the distributions and
what the management might need to do.
49
Scatter plots
Here is some data taken after an airport opened near a neighborhood. The first column is
the number of weeks since the airport opened and the second column is the sound
frequency range to which the person’s hearing will respond.
Weeks
Range
47
56
116
178
19
75
160
31
12
164
43
74
15.1
14.1
13.2
12.7
14.6
13.8
11.9
14.8
15.3
12.6
14.7
14.0
x
47
56
116
178
19
75
160
31
12
164
43
74
Formula
14.4775
14.32
13.27
12.185
14.9675
13.9875
12.5
14.7575
15.09
12.43
14.5475
14.005
We’ll graph these with the vertical axis going from 12 to 15 and the horizontal axis going
from 0 to 200.
Linear regression line:
y = 0.0175x +15.3
If you plot this line THROUGH your scatter plot it will be APPROXIMATELY the line
the data points are going in. (these points are on the right in the table).
Naturally the data points will be off the plotted line. How much off is recorded in a
statistic called the “r”, regression coefficient. An r of 0 is very bad – your data is
basically a cloud. An 4 of 1 is PERFECT, every point is on the line.
The r for this data is .88. A negative slope to the line and the points are quite close to
the line.
50
18
16
14
Hearing
12
10
Series1
8
6
4
2
0
0
50
100
150
200
Weeks
51
16
14
12
Hearing
10
Series1
Linear (Series1)
8
6
4
2
0
0
50
100
150
200
Weeks
52