Download Statistics 103 Probability and Statistical Inference Instructions for lab

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Statistics 103 Probability and Statistical Inference
Instructions for lab 3
Lab Objective
To explore data with histograms, boxplots, and summary statistics.
Lab Procedures
Open the file agpop from the course directory. This file is taken from the 1992 U.S. Census of
Agriculture. It contains data on agricultural characteristics of all 3,078 counties in the United
States. Variables include:




acres92 (number of acres devoted to farming in 1992)
farms92 (number of farms in 1992)
largef92(the number of farms with more than 1,000 acres)
smallf92 (the number of farms with fewer than 9 acres),
and similar variables for the 1987 and 1982 censuses. Also included are county and state names,
and a variable indicating the county's region of the country (West, Northeast, North Central,
South). For more information on the Census of Agriculture, including data from the 1997 census,
you can visit the web site of the National Agricultural Statistics Service.
Normal Approximation
1. Construct a histogram for the number of farms in 1992 across all counties. Describe the
distribution by answering the following questions.
a. What is the mean and median?
b. Is the distribution right-skewed or left skewed?
c. What is the inter-quartile range?
2. Often we wish to have data that follow a Normal distribution. One way to handle skewed data
is to transform the data.
a. Create a new variable that transforms farms92 by taking the square-root. Call it
SqFarms92.
b. Construct a histogram for the new variable and answer the following questions.
i. What is the mean, median, standard deviation for the transformed variable?
ii. Is the transformed variable more symmetric? What summary statistics
support this claim?
c. Transform the mean and median of SqFarms92 back to the original scale by squaring
the statistics. Are they similar to the mean and median calculated using the raw data
farms92? Why?
3. Assume that SqFarms92 follows approximately a Normal distribution with mean and
standard deviation estimated in question 2(ii). Using the Answer the following questions.
You will need to use a Z-Table either from the textbook or online.
a. Approximately, what proportion of counties had more than 500 farms? (Note this is
on the original scale).
b. Approximately, what proportion of counties had between 500 to 800 farms?
c. Find an interval of the number of farms that include that middle 50% of the
distribution.
4. In question 3, we used the Normal distribution to approximate the distribution of farms92
after a square-root transformation. However, in this analysis, we actually have all the data!
a. Repeat question 3(a) by looking at the percentiles of farms92. For example, one way
to do this is to create a new variable, farms92_greater500, that takes the value 1 if
farms92 is greater than 500. Then simply calculate the proportion of
farms92_greater500 equal to 1.
b. How well does the Normal approximation perform?
Binomial Probability
5. The counties are divided into four regions.
a. How many counties are in the South? What is the proportion?
6. There are 3078 counties in total and we are interested in taking a sample of 5 counties
without replacement.
a. What is the probability of having at least 1 in the South?
7. Because our original sample is so large, let’s assume the probability of a county being in the
South stays constant as estimated in 5(a), even though we are sampling without replacement.
Using the Binomial distribution formula to answer the following questions.
a. In a sample of 5 counties what is the probability of having at least 1 county in the
South?
b. In a sample of 10 counties, what is the probability of having 4 counties in the South?
8. Let’s draw some samples ourselves. Click on Table  Subset and choose Random Sample
Size and type in 10. Make sure you select All Columns and click okay. A new data table will
appear. Repeat the above 10 times and record the number of counties in the South you got.
a. Is it close to the estimate in 7(b)?