Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Jennifer Harris Statistics 1040 Group Project Introduction This project is to show the many concepts used in my 1040 statistics class, including collecting samples, organizing and analyzing the data for baseball Hall of Famers, drawing conclusions, and presenting my work. In this part of my project I have created summary statistic of those baseball players that made it into the Hall of Fame Membership Status, and how they were selected. There are three different categories, the first is not a member, second is elected by the Baseball Writers Association of America (BBWAA which we will refer to as the “media” for the remainder of this project), and third is chosen by the Old Timers of Veterans Committee. The project shows baseball data in a way that can be interpreted by all through visuals such as Pareto charts, pie graphs, histograms and boxplots. I will be taking samples of this data and calculating sampling errors, to estimate the population data. Pareto Charts Categorical Data for Hall of Fame Membership Population Hall of Fame Frequency 1400 1216 1200 1000 800 600 400 200 67 57 Elected by Players Elected by Media 0 Not a member Random Sample Hall of Fame Systematic Sample Hall of Fame 40 35 30 20 3 10 2 0 Not a member Elected by Media Elected by Players 40 30 20 10 0 34 Not a member 3 3 Elected by Media Elected by Players As we can see most of the baseball players in the population are not members of the Hall of Fame. Players elected in the Hall of Fame are approximately equally likely to be elected by the players or the media. My samples are obtained through simple random sampling in which I used Excel to generate random numbers to reorder my data, and systematic sampling in which I selected a random sample starting at twenty one and then every thirty fourth number until I had 40 numbers, which is the sample size. I used the formula N/40=1340/40=33.5. I rounded up to to pick every 34 number starting with a random number of 21. The Pareto charts (above) and the pie charts (below) will visually display my results. The samples appear to be good estimates of the population data. This generally happens when we have a sample size greater than 30. Quantitative Data for Home Runs Population Hall of Fame 4.25% 5% Not a member Elected by Media Elected by Players 90.75% Random Sample Hall of Fame 7.5% 7.5% Not a member Elected by Media Elected by Players 85% Systematic Sample Hall of Fame 7.50% 5.00% Not a member Elected by Media 87.50% Elected by Players Quantitative Variable Analysis: Home Run Data The quantative variable I obtained from the population was of home run data from professional baseball players. I used statcrunch to find statistics, Including the population mean, population standard deviation,and the five number summary. I decided to take a random sample and a systematic sample using the same sampleing techniques mentioned above with a sample size of forty. I also computed the following summary statistics for each sample. Summary Sample statistics: For Quantitative Data Column n Mean Variance Systematic Sampling 40 100.475 22327.64 Random Sample 40 82.3 Std. Dev. Range Min Max Q1 149.42436 754 7074.2153 84.10835 337 Q3 1 755 15.5 140.5 40.5 2 339 23 121 Summary Population parameter: For Quantitative Data Column n Mean Variance Std. Dev. Median Range Min Max Q1 Q3 Home Runs 1340 85.1097 9590.293 97.930046 51 755 Median 0 755 22 108 50.5 The systematic sample has a higher mean and standard deviation than both the random sample and the population. This could be because our data is skewed right and we could have had an unusual amount of great baseball players in the systematic sample that hit a lot of homeruns. Population Home Run Boxplot Random Sample Home Runs Boxplot Systematic Sample Home Runs Boxplot I constructed the following boxplots using the five number summaries above. The data appears to be skewed right. This makes sense because there are very few hall of fame caliber players that hit more than 500 homeruns. Most Major League Baseball players hit far less homeruns than the few top players. We can see that the major difference between the random sample and the systematic sample is that the max number in the random sample is 339 and the max number of homeruns in the systematic sample is 755 which is also the maximum number of homeruns in the entire population. This is an anomaly that we would pick the best homerun hitter in the history of baseball in our sample of only 40 players. Systematic Sample Histogram Home Runs Population Histogram of Home Runs Random Sample Histogram of Home Runs The following histograms also show that our data is skewed right and confirm all of our explanations above. We can clearly see the outlier in our systematic sample in the histograms. Confidence Intervals 95% confidence intervals of the population proportion The above are two confidence intervals contructed from the random and systematic samples. They are contructed at the 95% level for the population proportion. The margin of error for the simple random sample is approximately 8.16%, and the margin of error for the systematic sample is approximately 6.75%. For the random sample we are 95% confident that the true population proportion for players elected by the old timers comittee is between 0% and 15.66%. For our systematic sample we are 95% confident that the true population proportion of players elected by the old timers committee is between 0% aand 11.75%. The true population proportion is 5% which happens to be within both of these intervals. 95% confidence intervals of the population mean The above are two confidence intervals contructed from the random and systematic samples. They are contructed at the 95% level for the population mean. The margin of error for both of the samples is approximately 31.23. For the random sample we are 95% confident that the true population mean for homeruns hit by major league baseball players is between 55.4 and 109.2. For our systematic sample we are 95% confident that the true population mean is between 52.69 and 148.27. The true population mean is 85.11 which happens to be within both of these intervals. 95% confidence intervals of the population standard deviation The above are two confidence intervals contructed from the random and systematic samples. They are contructed at the 95% level for the population standard deviation. For the random sample we are 95% confident that the true population standard deviation for homeruns hit by major league baseball players is between 68.19 and 106.27. For our systematic sample we are 95% confident that the true population standard deviation is between 121.13 and 106.27. The true population standard deviation is 97.93 which happens to be within the simple random sample but not within the systematic sample. This could be be because of the outlier that is contained in the systematic sample of 755 homeruns. Proportion Hypothesis Test (level of significance .05) The above are hypothesis tests for both the systematic and random samples for the population proportion equal to .05 of players elected by the old timer’s committee. The test for the random sample we do not reject the null hypothesis because the P-value of .468 > .05. We conclude that we do not have evidence to suggest the true population proportion is different than .05. The true population proportion is .05 so we have come to the correct conclusion. The hypothesis test for systematic sample suggests that we do not reject the null hypothesis because the P-value of .99 > .05. We conclude that we do not have evidence to suggest the true population proportion is different than .05. In both cases we have not made any errors assuming that our true population data is correct. Hypothesis Test for the Population Mean (level of significance .05) The above are hypothesis tests for both the systematic and random samples for the population mean equal to 85.11 homeruns hit on average by each player. The test for the random sample we do not reject the null hypothesis because the P-value of .833 > .05. We conclude that we do not have evidence to suggest the true population mean is different than .85.11. The true population mean is 85.11 so we have come to the correct conclusion. The hypothesis test for systematic sample suggests that we do not reject the null hypothesis because the P-value of .52 > .05. We conclude that we do not have evidence to suggest the true population mean is different than 85.11. In both cases we have not made any errors (Type I or Type II) assuming that our true population data is correct. Reflection This project was very interesting because I was able to apply the concepts that I learned in class to a real world applicable situations. This could possibly be useful in my future career. In social work it is very important to find summary statistics, hypothesis tests, and confidence intervals relating to substance abuse and what public policies are effective in helping people. Our sample for the proportion meets the sampling conditions because of the way we conducted the samples in unbiased. The samples are also less than 5% of the total population. The samples that we conducted for the mean homeruns (quantitative data) is also unbiased through the simple random and systematic selection processes we used. In this case the sample sizes are greater than 30 which allows us to construct confidence intervals and hypothesis tests even if the population is not normally distributed. Our conclusions about our hypothesis tests make sense because in this case we do know the true population data and we tested based on the true values. If we were to reject any of our hypothesis based on the tests that we conducted we would be encountering a type I error. In all of our samples we were not able to reject the null hypothesis. We did not make any type I errors. Extra Credit Regression p This regression shows the positive correlation between hits and homeruns. As a player gets more hits our regression shows that the player will also hit more homeruns. A player who hits 1500 hits during his career will on average hit 106 home runs. A player who is a member of the 3000 hit club will on average have hit 231 home runs. We can see that most of our data points are close to the origin. This is because there are only a select few great players who get many hits and have many homeruns. We would expect as we go up and to the right on our graph we would find more hall of famers.