Download Histograms:

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Transcript
ES 25 Quantitative Thinking
Lab 4: Data Description: Summary Statistics and Histograms
Due: Tuesday, May 1st, before 12 noon (E-mail to your facilitator)
Investigation #1: ES25 Student’s Water Consumption
1. Summary Statistics
 Generate “summary statistics” for the list of student’s water consumption (measured
in gallons/day)
 Tools/Data Analysis/Descriptive Statistics/Input range= (select data)/Output
range (select destination cell on same worksheet)/Check “summary statistics”
 You should get the following chart (if you do NOT have the Data Analysis Toolpak
on your computer, you can get the values in the chart using the formulas given):
Summary Statistics
Interpretation
Mean
76.93 The "average" of students' water consumption is 77 gal/day
One half of the students reported water consumption values
65.04 that were less than 65 gal/day, and the other half of the
students reported consuming more than 65 gal/day.
On average, the daily consumption values were 47.7 gal/day
more or less than the mean value of 77 gal/day. Note: this is
Standard Deviation 47.68
not a very good measure of spread, since the distribution is
not symmetric (Normal).
Minimum
24.71 The smallest reported water use was 24.7 gal/day
Maximum
217.00 The largest reported water use was 217 gal/day
Sum
3539.00 The total daily water consumption for ES 25 was 3539 gal/day
Count
46 Forty-six students participated in this exercise.
Median
2. Frequency table and Histogram
 Calculate the “frequencies” for the water consumption data, using the following
bins: (0-40], (40-80], (80-120], (120, 160], (160, 200], (200-240]
 NOTE: for Excel to recognize the bins above, you must type them is as: 40,
80, 120, 160, 200, 240… it does not recognize parentheses as number values.
 Frequencies can be calculated using the Tools/Data Analysis/Histogram:
o input= CO2 data, bins= bins on left hand column below, output =
click on one cell above where you want the values to start)
 Calculate the “relative frequency” and the “density,” to generate the following table
(replace the one below with your completed table):
Bin Width
Bin
Relative
= 40
Range Frequency Frequency Density
40
(0-40]
14
0.304348 0.007609
80
(40-80]
16
0.347826 0.008696
120
(80-120]
8
0.173913 0.004348
160 (120-160]
3
0.065217 0.001630
200 (160-200]
4
0.086957 0.002174
240 (200-240]
1
0.021739 0.000543
Sum
46
1
 Use the “Chart” to make a “Frequency histogram.” Try to make it look like this:
Histogram: ES 25 Student Water Consumption
18
16
14
Frequency
12
10
8
6
4
2
0
(0-40]
(40-80]
(80-120]
(120-160]
(160-200]
(200-240]
Total Daily Water Consumption (gal/day )
 Use the “Chart” to make a Relative Frequency histogram.
Paste below.
Relative Frequency Histrogram: ES 25 Student Water
Consumption
0.4
0.35
Relative Frequency
0.3
0.25
0.2
0.15
0.1
0.05
0
(0-40]
(40-80]
(80-120]
(120-160]
(160-200]
(200-240]
Total Daily Water Consumption (gal/day )
 Use the “Chart” to make a Density histogram.
Paste below.
Density Function: ES 25 Student Water Consumption
Relative Frequency/Bin Width
0.010000
0.009000
0.008000
0.007000
0.006000
0.005000
0.004000
0.003000
0.002000
0.001000
0.000000
(0-40]
(40-80]
(80-120]
(120-160]
(160-200]
(200-240]
Total Daily Water Consumption (gal/day )
3. Investigate effects of bin width:
 Make Relative Frequency histograms for the bin widths below. Format the
histograms like the previous group, for easier comparison. Paste below (resize them
so that you can easily (and accurately) compare the graphs.
o Bin width = 25 and Bin width = 10
Relative Frequency Histogram of ES 25 Daily Water Consumption
0.4
Relative Frequency
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
(0-25]
(25-50]
(50-75]
(75-100]
(100-125]
(125-150]
(150-175]
(175-200]
(200-225]
Water Consumption (gal/day )
Relative Frequency Histogram of ES 25 Daily Water Consumption
Relative Frequency
0.25
0.2
0.15
0.1
0.05
(0
-1
0]
(1
020
]
(2
030
]
(3
040
]
(4
050
]
(5
060
]
(6
070
]
(7
080
]
(8
090
(9
]
010
(1
0]
00
-1
10
(1
]
10
-1
2
(1
0]
20
-1
30
(1
]
30
-1
4
(1
0]
40
-1
50
(1
]
50
-1
6
(1
0]
60
-1
70
(1
]
70
-1
80
(1
]
80
-1
90
(1
]
90
-2
00
(2
]
00
-2
1
(2
0]
10
-2
20
]
0
Water Consumption (gal/day )
 Which bin width do you think is most appropriate for displaying the water
consumption data? Write a paragraph justifying your choice (think about what
patterns you can see that may be “masked” by larger or smaller bin choices)
Answers will vary: (looks more bimodal, get extra peak at (70-80], but lose big trend of
lots of observations between (25-50]. Interesting outlier (highest observation with
smaller bins)
4. Draw conclusions (answer is complete sentences):
 Write a paragraph description of the distribution (think about type, outliers,
clusters, maximums, minimums, see pages 26-27 workbook).
Right-skewed, which makes sense because you can’t consume negative amounts of
water (therefore, a “normal” distribution would be difficult). From the wide binned
histogram, it looks like a vast majority of students consume less than 100 gallons per
day, but there are a few outliers that seem to consume way more than everyone else. I
wonder if there was measurement error, or if student water consumption really varies
as much as we see in this histogram.
 What proportion of students used greater than or equal to120 gallons of water
per day? Explain how you got your answer (which histogram did you use,
did you have to add numbers up, did you compare areas, etc.)
Frequency Histogram of ES 25 Daily Water Consumption
10
9
8
Frequency
7
6
5
4
3
2
1
00
]
11
(1
0]
10
-1
20
(1
]
20
-1
3
(1
0]
30
-1
40
(1
]
40
-1
5
(1
0]
50
-1
60
(1
]
60
-1
7
(1
0]
70
-1
80
(1
]
80
-1
9
(1
0]
90
-2
00
(2
]
00
-2
1
(2
0]
10
-2
20
]
(1
00
-
0]
(9
01
0]
(8
09
0]
(7
08
0]
(6
07
0]
(5
06
0]
(4
05
0]
(3
04
0]
03
02
(2
(1
(0
-1
0]
0
Water Consumption (gal/day )
I used the frequency histogram above. I added the number of observations >120
gal/day, which was: 1+2+3+1+1 =7. So, there were 7 students out of 46 who reported
consuming more than 120 gal/day, which is 15.2%.
 What percentage of students consume less water than the mean (you may
have to calculate this by hand, or assume that values are uniformly
distributed within bins)? 58.7% (I counted 27 out of 46 less than the
mean)
What percentage of students consume less water than the median? By
definition, 50% of the observations are less than, and greater than, the
median.
Would you say “most students consume less than average?”
YES! Since the distribution is skewed right, the mean is influenced by the
heavy water users, and is not a good measure of center. Also, since the
standard deviation is measured using the mean, it is also not a good measure
of spread (there is a measure called the Interquartile range which is better).
 What is the probability that a randomly selected student in ES 25 would use
less than 50 gallons of water per day? Explain how you got your answer
(which histogram did you use, did you have to add numbers up, did you
compare areas, etc.)
The proportion of students who consume less than 50 gallons per day is 17/46 (37%), o
the probability that a randomly select student would consume less than 50 gallons per
day is 37%. You can get this value from the relative frequency histogram just by
looking at the heights of the bars below 50% and adding them up (true, you have to
make sure 50.00 is not in the dataset, since the question asks for less than 50, and your
bins ends with 50]…)

Pretend that we actually did this experiment, for five days, and every
single day a different, randomly selected student reported using less than
50 gal/day. How would you explain the discrepancy between the
probability that you calculated above, and the observed phenomenon
(clearly, there is no “right answer” we are looking for logical reasoning)
Perhaps students started practicing water conservation! (or they lied)
 What is the z-score associated with your personal water consumption (report
your water consumption, in gal/day)? If you did not do HW1, choose a value,
report the value (in gal/day), and calculate the associated z-score. Interpret
your answer, for this problem.
Investigation #2: Distribution of Carbon Dioxide for a One-year Period at Mauna Loa
1. Copy the 2004 monthly CO2 data from the “MaunaLoaCO2” spreadsheet into a new
spreadsheet (name it 2004CO2). Paste special/transpose to get the data in a column
rather than a row.
2. Summary Statistics
 Generate “summary statistics” for the distribution of CO2 during the year 2004.
Summary Statistics
Mean
377.64
Median
377.43
Standard Deviation
1.95
Minimum
Maximum
374.06
380.63
Range
6.57
Count
12
Interpretation
The "average" carbon dioxide concentration over the 12 months of 2004
One half of the CO2 concentrations were less than 377.43 ppm, and the other
half of the CO2 concentrations were more than 377.43 ppm.
On average, the CO2 concentrations varied by about 1.95 ppm (more or less
than the mean value of 377.64 ppm ). Note: this seems like a pretty good
measure of spread, since the median is close to the mean, and the distribution
looks symmetric (Normal).
The smallest observed concentration of CO2 in 2004 was 374.06 ppm
The largest observed concentration of CO2 in 2004 was 380.63 ppm
The observations span 6.57 ppm (see min and max above). This, like standard
deviation, is also a measure of spread (notice that it is much greater than the
standard deviation, you should understand why)
There are 12 months in a year. If you got 14, how do you explain it????
Paste your findings (in chart form) below.
 Interpret the values: mean, median, standard deviation, and range, in terms of this
problem. You may report your answer in the chart above, if you like.
3. Frequency table and Histogram
 Using a bin width of 2, calculate the “frequencies,” “relative frequency” and the
“density,” to generate the following table (paste your completed version, below):
Bin
Bin Label
Frequency
373
375
377
379
381
(371-373]
(373-375]
(375-377]
(377-379]
(379-381]
Sum
0
1
3
5
3
12
Relative
Frequency
0
0.08333333
0.25
0.41666667
0.25
1
Density
0
0.04166667
0.125
0.20833333
0.125
 Using the chart tool, make a “relative frequency” histogram (format them like you
did in the water problem, with no space between the bins). Paste below.
(below, next to 1 ppm bin width histogram)
4. Investigate effects of bin width:
 Make a new frequency table (like the one above) and relative frequency histogram
for a bin width= 1 ppm. Paste below.
 Which bin width (1 ppm or 2 ppm) is best at revealing patterns of variation in the
data? Justify your choice.
ANSWERS WILL VARY, LOOK FOR CHANGES IN pattern of histogram.
Relative Frequency Histogram, CO2, Mauna Loa, 2004
6
Relative Frequency
5
4
3
2
1
0
(371-373]
(373-375]
(375-377]
(377-379]
(379-381]
CO2 (ppm)
Relative Frequency Histogram, CO2, Mauna Loa, 2004
0.3500
Relative Frequency
0.3000
0.2500
0.2000
0.1500
0.1000
0.0500
0.0000
(373-374]
(374-375]
(375-376]
(376-377]
(377-378]
CO2 (ppm)
(378-379]
(379-380]
(380-381]
5. Is the distribution changing over time?
 Generate a new frequency table and histogram, for the CO2 distribution in the year
that you were born (your “birth year”). If you were born before 1958, use the year
1960 data.
 Your histogram should be of “relative frequency,” and should use a bin width of 1
ppm. Paste the frequency table, and histogram, below.
Distribution of CO2 at Mauna Loa, 1975
0.35
Relative Frequency
0.3
0.25
0.2
0.15
0.1
0.05
0
(328-329]
(329-330]
(330-331]
CO2 (ppm)
(331-332]
(332-333]
Bin
Bin Label
Frequency
328
329
330
331
332
333
(327-328]
(328-329]
(329-330]
(330-331]
(331-332]
(332-333]
0
2
1
3
3
1
Relative
Frequency
0
0.2
0.1
0.3
0.3
0.1
Draw conclusions (answer is complete sentences):
 In the year 2004, what proportion of months had carbon dioxide
concentrations greater than 379 ppm? 0.25 (25%)
 In your birth year, what proportion of months had carbon dioxide

concentrations greater than 379 ppm? 0
Compare the ‘typical’ CO2 observation in 2004 to the ‘typical’ CO2
observation in your birth year (think: should you be using the mean, or the
median?) Explain why the values (from your birth year to 2004) are so
different (hint: see Keeling Curve, page 4 of your workbook).
The concentration of CO2 is increasing in the atmosphere. Therefore, the
average amount in 2004 will be much higher than that in 1975. I expected
the distribution to be skewed left, since many observed high values of CO2
are “piling up” on the right hand side, as CO2 increases over time. If it was
skewed, it is better to use the median than the mean (however, my birth
year is far enough back that it looks pretty symmetrical—and 12 data points
is really not enough to get a smooth distribution).
 Compare the range and standard deviation for the 2004 data and your birth
year data. Offer logical explanations (they don’t need to be “correct”) for the
differences that you observe.
1975
mean
331.15
median
331.15
range
5.62
st deviation 1.81
All values are in ppm.
2004
377.64
377.43
6.57
1.95
It appears as if the Carbon dioxide concentrations are becoming more spread out
within one year than they used to be. In other words, we see a broader range of
measurements (which are not as concentrated around the “average” concentration). I
think that this could be because of CO2 fertilization (plants may be growing more in
the summer, taking up more CO2 than in the past, and thus respiring more CO2 in the
winter).
 Compare the shape of the distribution to what you predicted it would look
like (in HW3). Offer a reasonable hypothesis for the observed shape.
I thought the distribution would have been bimodal, but in fact it seems to be Normal
or skewed left. I thought bimodal because of the summer and winter extreme CO2
concentrations, but now I realize that most of the time, CO2 concentration is between
these two values (so that the bulk of the observations are in the middle). As previously
mentioned, the distribution is becoming more left skewed as the concentration of CO2
increses (forcing more observations to the right side of the distribution (with a few
trailing observations on the left from the first few months of the year).