Download Math1040TermProject

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

Confidence interval wikipedia , lookup

Statistical hypothesis testing wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Amy Mault
Math 1040
Term Project
The goal of this project is to determine the statistical significance derived from a sample of
skittles bags purchased by our class. First we will organize the categorical data involving the various
colors found in the sample using pie charts, as well as pareto charts. Then we will organize the
quantitative data involving the number of candies found in each bag, using the information to find the
mean, the standard deviation and the 5 number summary, and then create a frequency histogram and a
boxplot with that information. Using these tools we hope to gain an idea of the real world application
of statistical data, both categorical and quantitative.
Skittles Pie Chart by Candy Color
Skittles Pareto Chart
The charts above represent the overall amount of the colors found in our sample of 2435 candies in 38
bags. Comparing the overall data to my own sample of candies, I am surprised there were almost an
equal number of colors. My sample had only 4 purple candies and somehow I thought that the color
purple was the 'most special' of the colors, with very few per bag.
Class Data
My Bag
Red: 500
Red: 15
Yellow: 474
Yellow: 7
Orange: 446
Orange: 17
Green: 503
Green: 18
Purple: 512
Purple: 4
Column
Total Candies in
Bag
Mean
Std. dev. Min Max Q3 Q1 Median
64.078947 13.19988
Quantitative data: candies p/bag
45
114 62 59
61
5# summary
Mean: 64.1
Min. 45
Std. Dev.: 13.20
Q1-59 Q3-62 Median-61
Total # of bags: 38
Total candies: 2435
Boxplot: candies per bag
Max 114
5 Number Summary:
Min Q1 Q2 Q3 Max
45 59 59 61 62
The frequency histogram is right skewed, the mean number of candies per bag was 64.1, and
there were a couple of bags with an outrageous number of candies at the 110-120 class. The data
represents my own bag, which fell just under the mean at 61 candies. It seems most of the bags in the
sample were close to the mean. I am a little surprised to see there were no bags at all in the 80-90
candies per bag class.
All in all this data is well represented by the graphs used here. Categorical data, used to show
the colors of candies in each bag is best represented with Pie Charts and Pareto Charts, because you can
see the individual colors and the general distribution of them among the sample. In order to represent
the Quantitative data, the use of a Frequency Histogram makes better sense. The numbers of candies
per bag are demonstrated best with this type of chart, and you can clearly see that the mean number of
candies is in the 60-70 class, with only a couple bags in the extremely low (40-50) and extremely high
(110-120) classes. Calculating the mean number of colors per bag would not make sense due to the
nature of the data, you can count and order, but not measure Ordinal data. Quantitative data, on the
other hand can be measured and the differences are meaningful as it is Nominal data that is being
looked at.
Confidence Intervals
Confidence intervals measure the probability that a population parameter will fall between two sets of
numbers, usually a lower and an upper value. The general purpose of using confidence intervals is to
test data with a certain ‘confidence’ that is calculated based on the standard error of measurement. The
end result is a calculation that shows how well the sample point estimate can be used to approximate
the true population parameter.
The following are problems using confidence intervals to determine the true proportion of yellow candies per bag (99% confidence interval), the true mean number of candies per bag (95% confidence interval), and the standard deviation of the number of candies per bag (98% confidence interval).
The results of the first problem, true proportion of yellow candies per bag, tell us that we can be 99%
confident that each bag of skittles contain 17-21% yellow candies. This appears to be only somewhat
accurate because in looking at the data, most bags of skittles contained approximately 15-20 yellow
skittles with 60 candies per bag. I believe there were outliers (bags with over 100 candies) that push the
true proportion beyond where it should be.
For the second problem determining the true mean number of candies per bag, we can determine with
95% confidence that each bag of skittles contained a mean number between 60 and 68 candies. This
information appears to be accurate with the exception (once again) of the outliers possibly pushing up
the higher number of candies beyond where it should be.
For the third confidence interval problem, we can say with 98% confidence that the true standard deviation for the number of candies per bag is between 11 and 20. This seems to be accurate considering
the standard deviation for the population is just over 13, with the outliers (bags of candy with over 100
skittles per bag) increasing the higher end of the interval beyond what may be expected.
Hypothesis Testing
Hypothesis testing is a way of using statistics to see whether a given hypothesis is true. The following
examples are ways of determining if a Claim (also the null or Ho) is valid or not valid by following
these 4 steps.
1. Formulate the null hypothesis Ho (commonly, that the observations are the result of pure chance)
and the alternative hypothesis H1, determine which hypothesis is the Claim and formulate an alternative hypothesis
2. Identify a test statistic that can be used to assess the truth of the null hypothesis.
3. Compute the P-value, which is the probability that a test statistic at least as significant as the one
observed would be obtained assuming that the null hypothesis were true. The smaller the P-value, the
stronger the evidence against the null hypothesis.
4. Compare the p-value to an acceptable significance value alpha (sometimes called an alpha value). If
p<=alpha, that the observed effect is statistically significant, the null hypothesis is ruled out, and the
alternative hypothesis is valid.
For the first example, we test the claim that of 20% of all candies are yellow. The claim ends up beingHo: p = .20, and the alternative hypothesis is- H1: p not = .20. After finding a p value of .218697 with a
0.05 significance level, we have found that the p value is larger than the significance level. This means
we should not reject the claim that 20% of all skittles are yellow. We also have insufficient evidence to
support the alternative hypothesis. In summary we could say that it is likely that 20% of the skittles
candies are yellow.
For the second example, we test the claim that the mean number of candies per bag is 55. First we
identify the claim as Ho: m = 55, and the alternative hypothesis, H1: m not = 55. After finding a p value
of 0.000139 with a 0.01significance level, we find that the p value is smaller than the significance
level. This means we should reject the claim that the mean number of candies per bag is 55. We also
have sufficient evidence to support the alternative hypothesis.
Reflection
In order to conduct a relatively good interval estimate and hypothesis test, we need to have certain
requirements met. First, the sampling method must be a simple random sample. Second, we need a
sample population large enough. In order to get a reasonable estimate, our sample size must be greater
than 30. If we have a sample size that is less than the required number of 30, then the data must be
normally distributed in order to obtain a reasonable estimate.
For hypothesis testing, we also have requirements that must be met. First, like with interval estimates,
we need to have a simple random sample. Also the sample size multiplied with the proportion (np>=5)of
successes must be greater than or equal to 5, and the sample size and the proportion of failures (nq>=5)
should also be greater than 5.
In our class data, we had a simple random sample size of 2435, and our successes and failures were
(depending on the data used) met all the conditions to conduct a proper interval estimate and hypothesis
test for the skittles project.
Possible errors could easily be made (I know, because I made a few) using these methods. In addition to
rounding errors, and incorrect z score calculations, a person could easily use the wrong critical value in
determining a confidence interval, or a hypothesis test.
Errors made when conducting a hypothesis test would include the errors mentioned above as well as the
typical Type I errors, which result from rejecting a true null hypothesis. This could have easily been done
in our first test regarding the claim that 20% of all candies are yellow. The data pointed to this claim
being true, but we could have easily gotten confused and rejected it. A Type II error, which is result from
failing to reject a false null hypothesis. This error could have easily been done in the second hypothesis
test regarding the mean number of candies per bag. The claim was that the mean was equal to 55.
Although it would be easy to see that the mean number of candies per bag is closer to 64, an inaccurate
number in the calculations might cause someone failing to reject this hypothesis.
Improving on the sampling method could be done by increasing the sample population to twice the
number that we had in the class. By doubling the number of the sample, we could see the confidence
intervals and hypothesis testing conclusions become more accurate. Also, by somehow eliminating the
outliers, we could again, make our conclusions more accurate. Although in looking at the data sheet,
there were only a few bags with a larger number of candies, so it did not affect the numbers as drastically
as 10 (or more) larger bags would have.
From the research I have done with the class skittles data, I have discovered that purple candies are not
the most precious color in the bags, they are actually fairly equally distributed throughout the population
of candies. I have also determined that there are approximately 17-22 yellow candies in each bag of
skittles, there are approximately 60-68 candies per bag, and the approximate standard deviation of the
number of candies per bag is between 11-21. I have also learned that it is likely 20% of all skittles are
colored red, and that the mean number of candies per bag is not 55.
Reflection on the Term Project
During the process of doing this project I have changed the way I look at math in real world
applications. Never before did I know you could take a bunch of students, make them buy bags of
skittles, make them count the candy (and also count the colors), and then give them all kinds of
statistical projects to do that might make them crazy! No really, now that it is all done I am kind of
perplexed by the data that came out of the class sample.
Before completing this project, I was like most people when listening to the news. I hear numbers
of all kinds and I just assume they must be true and reflective of the data given to whoever did the
math. I hear statistical references now and I am more suspicious. I think to myself “Sure, and just who
did you ask these questions buddy? I bet you picked a bunch of devout atheists to poll about what type
of people use the ‘F’ word the most…” In other words, I am now more likely to question the
statistically related absolutes that I see not only on the news, but also in commercials. As if I needed a
reason to become more cynical.
Ultimately, I believe what I have learned about the wide world of numbers over my last couple
years of schooling has made me a better citizen. I am now okay with figuring out appropriate
percentages (in my head even!) to tip my server while going out to eat, as well as listening with
discernment to the numbers that are thrown around while watching the news reports on the efficacy of
the flu vaccine, amongst other things that may be important to me and my direction in life. I can’t
believe I am about to say this…knowing the difficulty this semester of statistics has brought to me, but
I believe that as a society we should all learn how to look at information we are given with a certain
level of statistical ‘discernment’. It might actually change some things in this country for the better.