Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Old Faithful Name: _________________ Old Faithful is neither the biggest nor the most predictable geyser at Yellowstone National Park in Wyoming, but over 85% of visitors to the park stop by to see it erupt, spraying anywhere from 3,700 to 8,400 gallons of boiling water up to 184 feet in the air. In order to increase tourism, park officials try to predict when Old Faithful will erupt. According to the National Park Service, the eruptions can be predicted within 10 minutes 90% of the time. Weβre going to begin investigating Old Faithful eruptions using data collected from 272 eruptions in August, 1985. The following table displays the time between subsequent eruptions measured in seconds: 1) Take a look at the data and write out any conclusions you can make. What number best summarizes this data? 2) For this data the mean interval between eruptions in 209.26. Is this a useful summary of the data? Does it describe the typical amount of time we must wait for the next eruption? Defend you answer to both questions. 3) In general how is the mean affected by the distribution of data? Below is the same data organized from minimum interval to maximum interval: 4) Along with the mean, the median is often used to represent a dataset. What does the median for our data set (240) represent? Is the mean or median a more appropriate indicator of the center of our data? Support your claim. 5) State the 5 Number Summary for the Old Faithful data 6) Are there any outliers in the data? Check mathematically and show ALL work you do. 7) Convert your 5 Number Summary and the Mean to minutes. How does this affect our summary statistics? 8) Suppose we realize there was a measurement error that resulted in the intervals between eruptions being 3 seconds off. To fix the error we add 3 seconds to each observation. How will this impact the 5 Number Summary and the Mean? 9) Suppose we now notice that our observation of 300 seconds (5 minutes) was actually a typo. That observation was supposed to be 3000 (50 minutes). What impact, if any, would this have on the 5 Number Summary and the mean? Which summary statistics are more resistant to the presence of outliers? 10) Create a boxplot and explain what it tells us about Old Faithful. 11) Boxplots are more useful when you want to compare two distributions (or a variable across groups). Below is a series of boxplots displaying the monthly unemployment rates for each sate from 1976-2010. What can you interpret from this visualization? 12) Looking at the following histograms that display the frequency of the geyser eruption choose the one you fell best describes the data. Explain your choice and what the graph tells you about Old Faithful. 13) While there is no βbestβ number of bins to use, as if varies with the data set and what the author is attempting to demonstrate, one common way of selecting the number of bins is to use the square root of the quantity of data. Create a histogram with 10 bins. How wide should each bin be? 14) Some of the ways we can measure variability are variance and standard deviation, formulas for both are listed below (we will only use population in accordance to IB). Explain in writing what the formulas represent. Variance Standard Deviation 2 π = Μ 2 βπ π=1 ππ (π₯π βπ) π π=β Μ 2 βπ π=1 ππ (π₯π βπ) π 15) For old Faithful the variance is 4690 seconds and the standard deviation is 68.5 seconds. Explain what these values represent. 16) Suppose we, once again, decided to convert our data into minutes by dividing each value by 60. What would happen to the variance and standard deviation? Explain. 17) Suppose we, once again, added 3 second to each value due to our measurement error. What would happen to the variance and standard deviation? 18) Suppose, once again, we needed to change the value of 300 to 3000. Which of the following statistics would be most impacted by this outlier: range, IQR, Variance, or standard deviation? 19) Create a frequency table with 10 classes. What is the interval width of each class? Class Interval Frequency 20) Based on the frequency table you created calculate mean for the data 3 times using the lower class bounds, upper class bounds, class midpoint. Lower Bound Mean: _________________ Upper Bound Mean: _________________ Class Midpoint Mean: _________________ 21) Explain why using the midpoint will generally give us the best estimate for the mean 22) How could you estimate the median for data compiled in a frequency table? Use your process to find the median for the frequency table you created 23) Calculate by hand the standard deviation for you frequency table then compare it with the original standard deviation of 68.5 seconds. 24) Based on the previous problems create a list of pros and cons for using frequency tables when working with large amounts of data. 25) Researchers believed that a relationship exists between the interval between eruptions and the duration of the previous eruption. The following scatter plot displays the duration of geyser eruptions and the interval before the next eruption. Describe the relationship between these two variables. 26) Below is the same graph but with a regression line added to the graph. Interpret the slope and intercept of our regression line. Search the internet for a small dataset with between 25 and 100 individual observations and do the following 1. Calculate the 5 Number Summary by hand then check your work by using a TI-Nspire. Website of data: Min: _________ Q1: _________ Med: _________ Q3: _________ Max: _________ 2. Using your calculator find the mean , standard deviation, and variance. µ: _________ π: _________ π 2 : _________ 3. Create a frequency table for the data (with appropriately sized classes) and calculate an estimated mean and standard deviation µ: _________ π: _________ 4. Create at least 2 visualizations of the data (boxplot, histogram, stem plot etc.) and attach them. 5. Under each of the graphs you created write a paragraph summarizing what you learned about the data along with the advantages/disadvantages of the graph you created.