Download docx (Word)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Inductive probability wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Gibbs sampling wikipedia , lookup

Transcript
Exam
It is highly recommended that you answer the exam using Rmarkdown (you can simply use
the exam Rmarkdown file as a starting point).
Part I: Estimating probabilities
Remember to load the mosaic package first:
library(mosaic)
Chile referendum data
In this part we will use the dataset Chile. Remember to read the description of the dataset
as well as the Wikipedia entry about the background.
Chile <read.table("http://asta.math.aau.dk/dan/static/datasets?file=Chile.dat",
header=TRUE, quote="\"")
NB: This dataset has several missing values (NA). To remove these when you use tally you
can add the argument useNA = "no".
•
Do a cross tabulation of the variables vote and sex.
•
Estimate the probability of vote=N.
•
Make a 95% confidence interval for the probability of vote=N.
•
Estimate the probability of vote=N given that sex=F.
•
What would these probabilities satisfy if vote and sex were statistically independent?
Part II: Sampling distributions and the central limit theorem
This is a purely theoretical exercise where we investigate the random distribution of
samples from a known population.
Waiting times in a queue
We start by sampling data from the so-called exponential distribution - also called the
negative exponential distribution. The exponential distribution is the most common
distribution used to describe the waiting time between arrivals in a queue. It has one
parameter, which is the number of arrivals per time unit, also called the arrival rate. In our
case we set it to 1 arrival per time unit. Since the arrival rate in our theoretical population
is 1, the mean waiting time for the population will be 𝜇 = 1. Furthermore, it can be shown
that the standard deviation is 𝜎 = 1.
The following commands randomly samples 25 waiting times y and calculates the mean of
these y_bar.
y <- rexp(25, rate = 1)
y
## [1] 0.43411264
## [7] 0.20025879
## [13] 1.18766120
## [19] 0.45262537
## [25] 1.06047869
2.96845468
0.68169148
1.54653673
0.29007238
2.00201953
0.78435122
2.31660386
1.02436731
2.40607838
0.71508338
2.93045280
2.79151352
0.01393748
0.88599114
0.67177111
1.54055438
0.32013741
0.18889305
2.64166444
0.01879488
y_bar <- mean(y)
y_bar
## [1] 1.202964
Note: Since it is a random sample from the population your numbers will be different. Try
to rerun the commands a few times.
The following command replicates the sampling experiment 1000 times and saves the
result as a matrix(y) with 25 rows and 1000 columns:
y <- replicate(1000, rexp(25, rate = 1))
The mean(y_bar) is calculated for each of the 1000 replications (i.e. each entry in y_bar is
the average of the 25 values in the corresponding column):
y_bar <- colMeans(y)
Make a histogram of all the sampled waiting times using a command like
histogram(as.numeric(y), breaks = 40) inserted in a new code chunk (try to do
experiments with the number of breaks):
•
•
Explain how a histogram is constructed.
Does this histogram look like a normal distribution?
Now we focus on the mean waiting times y_bar.
•
Based on the known population parameters 𝜇 = 1 and 𝜎 = 1 what is the the mean,
standard deviation and approximate distribution of y_bar according to the CLT?
•
What are the theoretical quartiles based on this approximate distribution of y_bar?
•
Compare the predicted values of mean, standard deviation and quartiles with the
observed values (you can use favstats to calculate these from y_bar).
•
Make a histogram of the sample means (y_bar). Does it look like a normal
distribution?
•
Make a boxplot of the sample means and explain how a boxplot is constructed.
Part III: Theoretical boxplot for a normal distribution
Finally, consider the theoretical boxplot of a general normal distribution with mean 𝜇 and
standard deviation 𝜎, and find the probability of being an outlier according to the 1.5⋅IQR
criterion:
•
First find the 𝑧-score of the lower/upper quartile. I.e. the value of 𝑧 such that 𝜇 ± 𝑧𝜎 is
the lower/upper quartile.
•
Use this to find the IQR (expressed in terms of 𝜎).
•
Now find the 𝑧-score of the maximal extent of the whisker. I.e. the value of 𝑧 such that
𝜇 ± 𝑧𝜎 is the endpoint of lower/upper whisker.
•
Find the probability of being an outlier.