Download Statistics Project 2514

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Sampling (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Jennifer Harris
Statistics 1040
Group Project
Introduction
This project is to show the many concepts used in my 1040 statistics class, including
collecting samples, organizing and analyzing the data for baseball Hall of Famers, drawing
conclusions, and presenting my work. In this part of my project I have created summary statistic
of those baseball players that made it into the Hall of Fame Membership Status, and how they
were selected. There are three different categories, the first is not a member, second is elected
by the Baseball Writers Association of America (BBWAA which we will refer to as the “media”
for the remainder of this project), and third is chosen by the Old Timers of Veterans Committee.
The project shows baseball data in a way that can be interpreted by all through visuals such as
Pareto charts, pie graphs, histograms and boxplots. I will be taking samples of this data and
calculating sampling errors, to estimate the population data.
Pareto Charts
Categorical Data for Hall of Fame Membership
Population Hall of Fame Frequency
1400
1216
1200
1000
800
600
400
200
67
57
Elected by Players
Elected by Media
0
Not a member
Random Sample Hall
of Fame
Systematic Sample Hall
of Fame
40
35
30
20
3
10
2
0
Not a member
Elected by
Media
Elected by
Players
40
30
20
10
0
34
Not a member
3
3
Elected by
Media
Elected by
Players
As we can see most of the baseball players in the population are not members of the Hall of Fame.
Players elected in the Hall of Fame are approximately equally likely to be elected by the players or the
media. My samples are obtained through simple random sampling in which I used Excel to generate
random numbers to reorder my data, and systematic sampling in which I selected a random sample
starting at twenty one and then every thirty fourth number until I had 40 numbers, which is the sample
size. I used the formula N/40=1340/40=33.5. I rounded up to to pick every 34 number starting
with a random number of 21.
The Pareto charts (above) and the pie charts (below) will visually display my results. The samples appear
to be good estimates of the population data. This generally happens when we have a sample size
greater than 30.
Quantitative Data for Home Runs
Population Hall of Fame
4.25%
5%
Not a member
Elected by Media
Elected by Players
90.75%
Random Sample Hall of Fame
7.5%
7.5%
Not a member
Elected by Media
Elected by Players
85%
Systematic Sample Hall of Fame
7.50%
5.00%
Not a member
Elected by Media
87.50%
Elected by Players
Quantitative Variable Analysis: Home Run Data
The quantative variable I obtained from the population was of home run data from
professional baseball players. I used statcrunch to find statistics, Including the population
mean, population standard deviation,and the five number summary. I decided to take a
random sample and a systematic sample using the same sampleing techniques mentioned
above with a sample size of forty. I also computed the following summary statistics for each
sample.
Summary Sample statistics: For Quantitative Data
Column
n
Mean
Variance
Systematic Sampling 40 100.475 22327.64
Random Sample
40 82.3
Std. Dev.
Range Min Max Q1
149.42436 754
7074.2153 84.10835
337
Q3
1
755 15.5 140.5 40.5
2
339 23
121
Summary Population parameter: For Quantitative Data
Column
n
Mean
Variance Std. Dev.
Median Range Min Max Q1 Q3
Home Runs 1340 85.1097 9590.293 97.930046 51
755
Median
0
755 22 108
50.5
The systematic sample has a higher mean and standard deviation than both the random
sample and the population. This could be because our data is skewed right and we could have
had an unusual amount of great baseball players in the systematic sample that hit a lot of
homeruns.
Population Home Run Boxplot
Random Sample Home Runs Boxplot
Systematic Sample Home Runs Boxplot
I constructed the following boxplots using the five number summaries above. The data appears
to be skewed right. This makes sense because there are very few hall of fame caliber players that hit
more than 500 homeruns. Most Major League Baseball players hit far less homeruns than the few top
players. We can see that the major difference between the random sample and the systematic sample is
that the max number in the random sample is 339 and the max number of homeruns in the systematic
sample is 755 which is also the maximum number of homeruns in the entire population. This is an
anomaly that we would pick the best homerun hitter in the history of baseball in our sample of only 40
players.
Systematic Sample Histogram
Home Runs
Population Histogram of
Home Runs
Random Sample Histogram of
Home Runs
The following histograms also show that our data is skewed right and confirm all of our
explanations above. We can clearly see the outlier in our systematic sample in the histograms.
Confidence Intervals
95% confidence intervals of the population proportion
The above are two confidence intervals contructed from the random and systematic
samples. They are contructed at the 95% level for the population proportion. The margin of
error for the simple random sample is approximately 8.16%, and the margin of error for the
systematic sample is approximately 6.75%. For the random sample we are 95% confident that
the true population proportion for players elected by the old timers comittee is between 0%
and 15.66%. For our systematic sample we are 95% confident that the true population
proportion of players elected by the old timers committee is between 0% aand 11.75%. The
true population proportion is 5% which happens to be within both of these intervals.
95% confidence intervals of the population mean
The above are two confidence intervals contructed from the random and systematic
samples. They are contructed at the 95% level for the population mean. The margin of error for
both of the samples is approximately 31.23. For the random sample we are 95% confident that
the true population mean for homeruns hit by major league baseball players is between 55.4
and 109.2. For our systematic sample we are 95% confident that the true population mean is
between 52.69 and 148.27. The true population mean is 85.11 which happens to be within both
of these intervals.
95% confidence intervals of the population standard deviation
The above are two confidence intervals contructed from the random and systematic
samples. They are contructed at the 95% level for the population standard deviation. For the
random sample we are 95% confident that the true population standard deviation for
homeruns hit by major league baseball players is between 68.19 and 106.27. For our systematic
sample we are 95% confident that the true population standard deviation is between 121.13
and 106.27. The true population standard deviation is 97.93 which happens to be within the
simple random sample but not within the systematic sample. This could be be because of the
outlier that is contained in the systematic sample of 755 homeruns.
Proportion Hypothesis Test (level of significance .05)
The above are hypothesis tests for both the systematic and random samples for the
population proportion equal to .05 of players elected by the old timer’s committee. The test for
the random sample we do not reject the null hypothesis because the P-value of .468 > .05. We
conclude that we do not have evidence to suggest the true population proportion is different
than .05. The true population proportion is .05 so we have come to the correct conclusion. The
hypothesis test for systematic sample suggests that we do not reject the null hypothesis
because the P-value of .99 > .05. We conclude that we do not have evidence to suggest the true
population proportion is different than .05. In both cases we have not made any errors
assuming that our true population data is correct.
Hypothesis Test for the Population Mean (level of significance .05)
The above are hypothesis tests for both the systematic and random samples for the
population mean equal to 85.11 homeruns hit on average by each player. The test for the
random sample we do not reject the null hypothesis because the P-value of .833 > .05. We
conclude that we do not have evidence to suggest the true population mean is different than
.85.11. The true population mean is 85.11 so we have come to the correct conclusion. The
hypothesis test for systematic sample suggests that we do not reject the null hypothesis
because the P-value of .52 > .05. We conclude that we do not have evidence to suggest the true
population mean is different than 85.11. In both cases we have not made any errors (Type I or
Type II) assuming that our true population data is correct.
Reflection
This project was very interesting because I was able to apply the concepts that I learned
in class to a real world applicable situations. This could possibly be useful in my future career. In
social work it is very important to find summary statistics, hypothesis tests, and confidence
intervals relating to substance abuse and what public policies are effective in helping people.
Our sample for the proportion meets the sampling conditions because of the way we
conducted the samples in unbiased. The samples are also less than 5% of the total population.
The samples that we conducted for the mean homeruns (quantitative data) is also unbiased
through the simple random and systematic selection processes we used. In this case the sample
sizes are greater than 30 which allows us to construct confidence intervals and hypothesis tests
even if the population is not normally distributed.
Our conclusions about our hypothesis tests make sense because in this case we do know
the true population data and we tested based on the true values. If we were to reject any of
our hypothesis based on the tests that we conducted we would be encountering a type I error.
In all of our samples we were not able to reject the null hypothesis. We did not make any type I
errors.
Extra Credit Regression
p
This regression shows the positive correlation between hits and homeruns. As a player gets
more hits our regression shows that the player will also hit more homeruns. A player who hits 1500 hits
during his career will on average hit 106 home runs. A player who is a member of the 3000 hit club will
on average have hit 231 home runs. We can see that most of our data points are close to the origin. This
is because there are only a select few great players who get many hits and have many homeruns. We
would expect as we go up and to the right on our graph we would find more hall of famers.