Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
History of statistics wikipedia , lookup
Taylor's law wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Foundations of statistics wikipedia , lookup
Confidence interval wikipedia , lookup
Statistical hypothesis testing wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Amy Mault Math 1040 Term Project The goal of this project is to determine the statistical significance derived from a sample of skittles bags purchased by our class. First we will organize the categorical data involving the various colors found in the sample using pie charts, as well as pareto charts. Then we will organize the quantitative data involving the number of candies found in each bag, using the information to find the mean, the standard deviation and the 5 number summary, and then create a frequency histogram and a boxplot with that information. Using these tools we hope to gain an idea of the real world application of statistical data, both categorical and quantitative. Skittles Pie Chart by Candy Color Skittles Pareto Chart The charts above represent the overall amount of the colors found in our sample of 2435 candies in 38 bags. Comparing the overall data to my own sample of candies, I am surprised there were almost an equal number of colors. My sample had only 4 purple candies and somehow I thought that the color purple was the 'most special' of the colors, with very few per bag. Class Data My Bag Red: 500 Red: 15 Yellow: 474 Yellow: 7 Orange: 446 Orange: 17 Green: 503 Green: 18 Purple: 512 Purple: 4 Column Total Candies in Bag Mean Std. dev. Min Max Q3 Q1 Median 64.078947 13.19988 Quantitative data: candies p/bag 45 114 62 59 61 5# summary Mean: 64.1 Min. 45 Std. Dev.: 13.20 Q1-59 Q3-62 Median-61 Total # of bags: 38 Total candies: 2435 Boxplot: candies per bag Max 114 5 Number Summary: Min Q1 Q2 Q3 Max 45 59 59 61 62 The frequency histogram is right skewed, the mean number of candies per bag was 64.1, and there were a couple of bags with an outrageous number of candies at the 110-120 class. The data represents my own bag, which fell just under the mean at 61 candies. It seems most of the bags in the sample were close to the mean. I am a little surprised to see there were no bags at all in the 80-90 candies per bag class. All in all this data is well represented by the graphs used here. Categorical data, used to show the colors of candies in each bag is best represented with Pie Charts and Pareto Charts, because you can see the individual colors and the general distribution of them among the sample. In order to represent the Quantitative data, the use of a Frequency Histogram makes better sense. The numbers of candies per bag are demonstrated best with this type of chart, and you can clearly see that the mean number of candies is in the 60-70 class, with only a couple bags in the extremely low (40-50) and extremely high (110-120) classes. Calculating the mean number of colors per bag would not make sense due to the nature of the data, you can count and order, but not measure Ordinal data. Quantitative data, on the other hand can be measured and the differences are meaningful as it is Nominal data that is being looked at. Confidence Intervals Confidence intervals measure the probability that a population parameter will fall between two sets of numbers, usually a lower and an upper value. The general purpose of using confidence intervals is to test data with a certain ‘confidence’ that is calculated based on the standard error of measurement. The end result is a calculation that shows how well the sample point estimate can be used to approximate the true population parameter. The following are problems using confidence intervals to determine the true proportion of yellow candies per bag (99% confidence interval), the true mean number of candies per bag (95% confidence interval), and the standard deviation of the number of candies per bag (98% confidence interval). The results of the first problem, true proportion of yellow candies per bag, tell us that we can be 99% confident that each bag of skittles contain 17-21% yellow candies. This appears to be only somewhat accurate because in looking at the data, most bags of skittles contained approximately 15-20 yellow skittles with 60 candies per bag. I believe there were outliers (bags with over 100 candies) that push the true proportion beyond where it should be. For the second problem determining the true mean number of candies per bag, we can determine with 95% confidence that each bag of skittles contained a mean number between 60 and 68 candies. This information appears to be accurate with the exception (once again) of the outliers possibly pushing up the higher number of candies beyond where it should be. For the third confidence interval problem, we can say with 98% confidence that the true standard deviation for the number of candies per bag is between 11 and 20. This seems to be accurate considering the standard deviation for the population is just over 13, with the outliers (bags of candy with over 100 skittles per bag) increasing the higher end of the interval beyond what may be expected. Hypothesis Testing Hypothesis testing is a way of using statistics to see whether a given hypothesis is true. The following examples are ways of determining if a Claim (also the null or Ho) is valid or not valid by following these 4 steps. 1. Formulate the null hypothesis Ho (commonly, that the observations are the result of pure chance) and the alternative hypothesis H1, determine which hypothesis is the Claim and formulate an alternative hypothesis 2. Identify a test statistic that can be used to assess the truth of the null hypothesis. 3. Compute the P-value, which is the probability that a test statistic at least as significant as the one observed would be obtained assuming that the null hypothesis were true. The smaller the P-value, the stronger the evidence against the null hypothesis. 4. Compare the p-value to an acceptable significance value alpha (sometimes called an alpha value). If p<=alpha, that the observed effect is statistically significant, the null hypothesis is ruled out, and the alternative hypothesis is valid. For the first example, we test the claim that of 20% of all candies are yellow. The claim ends up beingHo: p = .20, and the alternative hypothesis is- H1: p not = .20. After finding a p value of .218697 with a 0.05 significance level, we have found that the p value is larger than the significance level. This means we should not reject the claim that 20% of all skittles are yellow. We also have insufficient evidence to support the alternative hypothesis. In summary we could say that it is likely that 20% of the skittles candies are yellow. For the second example, we test the claim that the mean number of candies per bag is 55. First we identify the claim as Ho: m = 55, and the alternative hypothesis, H1: m not = 55. After finding a p value of 0.000139 with a 0.01significance level, we find that the p value is smaller than the significance level. This means we should reject the claim that the mean number of candies per bag is 55. We also have sufficient evidence to support the alternative hypothesis. Reflection In order to conduct a relatively good interval estimate and hypothesis test, we need to have certain requirements met. First, the sampling method must be a simple random sample. Second, we need a sample population large enough. In order to get a reasonable estimate, our sample size must be greater than 30. If we have a sample size that is less than the required number of 30, then the data must be normally distributed in order to obtain a reasonable estimate. For hypothesis testing, we also have requirements that must be met. First, like with interval estimates, we need to have a simple random sample. Also the sample size multiplied with the proportion (np>=5)of successes must be greater than or equal to 5, and the sample size and the proportion of failures (nq>=5) should also be greater than 5. In our class data, we had a simple random sample size of 2435, and our successes and failures were (depending on the data used) met all the conditions to conduct a proper interval estimate and hypothesis test for the skittles project. Possible errors could easily be made (I know, because I made a few) using these methods. In addition to rounding errors, and incorrect z score calculations, a person could easily use the wrong critical value in determining a confidence interval, or a hypothesis test. Errors made when conducting a hypothesis test would include the errors mentioned above as well as the typical Type I errors, which result from rejecting a true null hypothesis. This could have easily been done in our first test regarding the claim that 20% of all candies are yellow. The data pointed to this claim being true, but we could have easily gotten confused and rejected it. A Type II error, which is result from failing to reject a false null hypothesis. This error could have easily been done in the second hypothesis test regarding the mean number of candies per bag. The claim was that the mean was equal to 55. Although it would be easy to see that the mean number of candies per bag is closer to 64, an inaccurate number in the calculations might cause someone failing to reject this hypothesis. Improving on the sampling method could be done by increasing the sample population to twice the number that we had in the class. By doubling the number of the sample, we could see the confidence intervals and hypothesis testing conclusions become more accurate. Also, by somehow eliminating the outliers, we could again, make our conclusions more accurate. Although in looking at the data sheet, there were only a few bags with a larger number of candies, so it did not affect the numbers as drastically as 10 (or more) larger bags would have. From the research I have done with the class skittles data, I have discovered that purple candies are not the most precious color in the bags, they are actually fairly equally distributed throughout the population of candies. I have also determined that there are approximately 17-22 yellow candies in each bag of skittles, there are approximately 60-68 candies per bag, and the approximate standard deviation of the number of candies per bag is between 11-21. I have also learned that it is likely 20% of all skittles are colored red, and that the mean number of candies per bag is not 55. Reflection on the Term Project During the process of doing this project I have changed the way I look at math in real world applications. Never before did I know you could take a bunch of students, make them buy bags of skittles, make them count the candy (and also count the colors), and then give them all kinds of statistical projects to do that might make them crazy! No really, now that it is all done I am kind of perplexed by the data that came out of the class sample. Before completing this project, I was like most people when listening to the news. I hear numbers of all kinds and I just assume they must be true and reflective of the data given to whoever did the math. I hear statistical references now and I am more suspicious. I think to myself “Sure, and just who did you ask these questions buddy? I bet you picked a bunch of devout atheists to poll about what type of people use the ‘F’ word the most…” In other words, I am now more likely to question the statistically related absolutes that I see not only on the news, but also in commercials. As if I needed a reason to become more cynical. Ultimately, I believe what I have learned about the wide world of numbers over my last couple years of schooling has made me a better citizen. I am now okay with figuring out appropriate percentages (in my head even!) to tip my server while going out to eat, as well as listening with discernment to the numbers that are thrown around while watching the news reports on the efficacy of the flu vaccine, amongst other things that may be important to me and my direction in life. I can’t believe I am about to say this…knowing the difficulty this semester of statistics has brought to me, but I believe that as a society we should all learn how to look at information we are given with a certain level of statistical ‘discernment’. It might actually change some things in this country for the better.