Download docx (Word)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Inductive probability wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

History of statistics wikipedia , lookup

Gibbs sampling wikipedia , lookup

Transcript
Exam
It is highly recommended that you answer the exam using Rmarkdown (you can simply use
the exam Rmarkdown file as a starting point).
Part I: Estimating probabilities
Favrskov school transport data
In this part we will use the Favrskov dataset (remember to read the description of the
dataset when you download it):
Favrskov <read.csv("http://asta.math.aau.dk/dan/static/datasets?file=Favrskov.dat", sep
= "")
•
Do a cross tabulation of the variables klassetrin and transport.
•
Estimate the probability of transport="cyklet".
•
Make a 95% confidence interval for the probability of transport="cyklet".
•
Estimate the probability of transport="cyklet" given that
klassetrin="indskolingen".
•
What do these probabilities satisfy if transport and klassetrin are statistically
independent?
Part II: Sampling distributions and the central limit theorem
This is a purely theoretical exercise where we investigate the random distribution of
samples from a known population.
Waiting times in a queue
We start by sampling data from the so-called exponential distribution - also called the
negative exponential distribution. The exponential distribution is the most common
distribution used to describe the waiting time between arrivals in a queue. It has one
parameter, which is the number of arrivals per time unit, also called the arrival rate. In our
case we set it to 1 arrival per time unit. Since the arrival rate in our theoretical population
is 1, the mean waiting time for the population will be 𝜇 = 1. Furthermore, it can be shown
that the standard deviation is 𝜎 = 1.
The following commands randomly samples 25 waiting times y and calculates the mean of
these y_bar.
y <- rexp(25, rate = 1)
y
## [1] 0.11440572
## [7] 2.36595784
## [13] 0.36983323
## [19] 0.34308119
## [25] 0.42302329
0.82283062
0.59501424
1.24641370
0.10522128
0.08347796
0.81090968
1.02974101
0.02916098
0.21030060
1.89092012
1.93451403
3.61961480
0.85696115
0.84391031
0.98985420
1.36999000
2.09756011
0.49074162
0.07010304
2.54954789
y_bar <- mean(y)
y_bar
## [1] 1.010524
Note: Since it is a random sample from the population your numbers will be different. Try
to rerun the commands a few times.
The following command replicates the sampling experiment 1000 times and saves the
result as a matrix(y) with 25 rows and 1000 columns:
y <- replicate(1000, rexp(25, rate = 1))
The mean(y_bar) is calculated for each of the 1000 replications (i.e. each entry in y_bar is
the average of the 25 values in the corresponding column):
y_bar <- colMeans(y)
Make a histogram of all the sampled waiting times using a command like
histogram(as.numeric(y), breaks = 40) inserted in a new code chunk (try to do
experiments with the number of breaks):
•
•
Explain how a histogram is constructed.
Does this histogram look like a normal distribution?
Now we focus on the mean waiting times y_bar.
•
Based on the known population parameters 𝜇 = 1 and 𝜎 = 1 what is the the mean,
standard deviation and approximate distribution of y_bar according to the CLT?
•
What are the theoretical quartiles based on this approximate distribution of y_bar?
•
Compare the predicted values of mean, standard deviation and quartiles with the
observed values (you can use favstats to calculate these from y_bar).
•
Make a histogram of the sample means (y_bar). Does it look like a normal
distribution?
•
Make a boxplot of the sample means and explain how a boxplot is constructed.
Part III: Theoretical boxplot for a normal distribution
Finally, consider the theoretical boxplot of a general normal distribution with mean 𝜇 and
standard deviation 𝜎, and find the probability of being an outlier according to the 1.5⋅IQR
criterion:
•
First find the 𝑧-score of the lower/upper quartile. I.e. the value of 𝑧 such that 𝜇 ± 𝑧𝜎 is
the lower/upper quartile.
•
Use this to find the IQR (expressed in terms of 𝜎).
•
Now find the 𝑧-score of the maximal extent of the whisker. I.e. the value of 𝑧 such that
𝜇 ± 𝑧𝜎 is the endpoint of lower/upper whisker.
•
Find the probability of being an outlier.