Download Old Faithful Packet

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Old Faithful
Name: _________________
Old Faithful is neither the biggest nor the most predictable geyser at Yellowstone National Park in Wyoming,
but over 85% of visitors to the park stop by to see it erupt, spraying anywhere from 3,700 to 8,400 gallons of
boiling water up to 184 feet in the air. In order to increase tourism, park officials try to predict when Old
Faithful will erupt. According to the National Park Service, the eruptions can be predicted within 10 minutes
90% of the time.
We’re going to begin investigating Old Faithful eruptions using data collected from 272 eruptions in August,
1985. The following table displays the time between subsequent eruptions measured in seconds:
1) Take a look at the data and write out any conclusions you can make. What number best summarizes this
data?
2) For this data the mean interval between eruptions in 209.26. Is this a useful summary of the data? Does it
describe the typical amount of time we must wait for the next eruption? Defend you answer to both
questions.
3) In general how is the mean affected by the distribution of data?
Below is the same data organized from minimum interval to maximum interval:
4) Along with the mean, the median is often used to represent a dataset. What does the median for our data set
(240) represent? Is the mean or median a more appropriate indicator of the center of our data? Support
your claim.
5) State the 5 Number Summary for the Old Faithful data
6) Are there any outliers in the data? Check mathematically and show ALL work you do.
7) Convert your 5 Number Summary and the Mean to minutes. How does this affect our summary statistics?
8) Suppose we realize there was a measurement error that resulted in the intervals between eruptions being 3
seconds off. To fix the error we add 3 seconds to each observation. How will this impact the 5 Number
Summary and the Mean?
9) Suppose we now notice that our observation of 300 seconds (5 minutes) was actually a typo. That
observation was supposed to be 3000 (50 minutes). What impact, if any, would this have on the 5 Number
Summary and the mean? Which summary statistics are more resistant to the presence of outliers?
10) Create a boxplot and explain what it tells us about Old Faithful.
11) Boxplots are more useful when you want to compare two distributions (or a variable across groups). Below
is a series of boxplots displaying the monthly unemployment rates for each sate from 1976-2010. What can
you interpret from this visualization?
12) Looking at the following histograms that display the frequency of the geyser eruption choose the one you
fell best describes the data. Explain your choice and what the graph tells you about Old Faithful.
13) While there is no β€œbest” number of bins to use, as if varies with the data set and what the author is
attempting to demonstrate, one common way of selecting the number of bins is to use the square root of the
quantity of data. Create a histogram with 10 bins. How wide should each bin be?
14) Some of the ways we can measure variability are variance and standard deviation, formulas for both are
listed below (we will only use population in accordance to IB). Explain in writing what the formulas
represent.
Variance
Standard Deviation
2
𝜎 =
Μ… 2
βˆ‘π‘›
𝑖=1 𝑓𝑖 (π‘₯𝑖 βˆ’π‘‹)
𝑛
𝜎=√
Μ… 2
βˆ‘π‘›
𝑖=1 𝑓𝑖 (π‘₯𝑖 βˆ’π‘‹)
𝑛
15) For old Faithful the variance is 4690 seconds and the standard deviation is 68.5 seconds. Explain what these
values represent.
16) Suppose we, once again, decided to convert our data into minutes by dividing each value by 60. What
would happen to the variance and standard deviation? Explain.
17) Suppose we, once again, added 3 second to each value due to our measurement error. What would happen to
the variance and standard deviation?
18) Suppose, once again, we needed to change the value of 300 to 3000. Which of the following statistics
would be most impacted by this outlier: range, IQR, Variance, or standard deviation?
19) Create a frequency table with 10 classes. What is the interval width of each class?
Class Interval
Frequency
20)
Based on the frequency table you created calculate mean for the
data 3 times using the lower class bounds, upper class bounds, class
midpoint.
Lower Bound Mean: _________________
Upper Bound Mean: _________________
Class Midpoint Mean: _________________
21) Explain why using the midpoint will generally give us the best estimate for the mean
22) How could you estimate the median for data compiled in a frequency table? Use your process to find the
median for the frequency table you created
23) Calculate by hand the standard deviation for you frequency table then compare it with the original standard
deviation of 68.5 seconds.
24) Based on the previous problems create a list of pros and cons for using frequency tables when working with
large amounts of data.
25) Researchers believed that a relationship exists between the interval between eruptions and the duration of
the previous eruption. The following scatter plot displays the duration of geyser eruptions and the interval
before the next eruption. Describe the relationship between these two variables.
26) Below is the same graph but with a regression line added to the graph. Interpret the slope and intercept of
our regression line.
Search the internet for a small dataset with between 25 and 100 individual observations and do the following
1. Calculate the 5 Number Summary by hand then check your work by using a TI-Nspire.
Website of data:
Min: _________
Q1: _________
Med: _________
Q3: _________
Max: _________
2. Using your calculator find the mean , standard deviation, and variance.
µ: _________
𝜎: _________
𝜎 2 : _________
3. Create a frequency table for the data (with appropriately sized classes) and calculate an estimated mean
and standard deviation
µ: _________
𝜎: _________
4. Create at least 2 visualizations of the data (boxplot, histogram, stem plot etc.) and attach them.
5. Under each of the graphs you created write a paragraph summarizing what you learned about the data
along with the advantages/disadvantages of the graph you created.