Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Exam It is highly recommended that you answer the exam using Rmarkdown (you can simply use the exam Rmarkdown file as a starting point). Part I: Estimating probabilities Remember to load the mosaic package first: library(mosaic) Chile referendum data In this part we will use the dataset Chile. Remember to read the description of the dataset as well as the Wikipedia entry about the background. Chile <read.table("http://asta.math.aau.dk/dan/static/datasets?file=Chile.dat", header=TRUE, quote="\"") NB: This dataset has several missing values (NA). To remove these when you use tally you can add the argument useNA = "no". • Do a cross tabulation of the variables vote and sex. • Estimate the probability of vote=N. • Make a 95% confidence interval for the probability of vote=N. • Estimate the probability of vote=N given that sex=F. • What would these probabilities satisfy if vote and sex were statistically independent? Part II: Sampling distributions and the central limit theorem This is a purely theoretical exercise where we investigate the random distribution of samples from a known population. Waiting times in a queue We start by sampling data from the so-called exponential distribution - also called the negative exponential distribution. The exponential distribution is the most common distribution used to describe the waiting time between arrivals in a queue. It has one parameter, which is the number of arrivals per time unit, also called the arrival rate. In our case we set it to 1 arrival per time unit. Since the arrival rate in our theoretical population is 1, the mean waiting time for the population will be 𝜇 = 1. Furthermore, it can be shown that the standard deviation is 𝜎 = 1. The following commands randomly samples 25 waiting times y and calculates the mean of these y_bar. y <- rexp(25, rate = 1) y ## [1] 0.43411264 ## [7] 0.20025879 ## [13] 1.18766120 ## [19] 0.45262537 ## [25] 1.06047869 2.96845468 0.68169148 1.54653673 0.29007238 2.00201953 0.78435122 2.31660386 1.02436731 2.40607838 0.71508338 2.93045280 2.79151352 0.01393748 0.88599114 0.67177111 1.54055438 0.32013741 0.18889305 2.64166444 0.01879488 y_bar <- mean(y) y_bar ## [1] 1.202964 Note: Since it is a random sample from the population your numbers will be different. Try to rerun the commands a few times. The following command replicates the sampling experiment 1000 times and saves the result as a matrix(y) with 25 rows and 1000 columns: y <- replicate(1000, rexp(25, rate = 1)) The mean(y_bar) is calculated for each of the 1000 replications (i.e. each entry in y_bar is the average of the 25 values in the corresponding column): y_bar <- colMeans(y) Make a histogram of all the sampled waiting times using a command like histogram(as.numeric(y), breaks = 40) inserted in a new code chunk (try to do experiments with the number of breaks): • • Explain how a histogram is constructed. Does this histogram look like a normal distribution? Now we focus on the mean waiting times y_bar. • Based on the known population parameters 𝜇 = 1 and 𝜎 = 1 what is the the mean, standard deviation and approximate distribution of y_bar according to the CLT? • What are the theoretical quartiles based on this approximate distribution of y_bar? • Compare the predicted values of mean, standard deviation and quartiles with the observed values (you can use favstats to calculate these from y_bar). • Make a histogram of the sample means (y_bar). Does it look like a normal distribution? • Make a boxplot of the sample means and explain how a boxplot is constructed. Part III: Theoretical boxplot for a normal distribution Finally, consider the theoretical boxplot of a general normal distribution with mean 𝜇 and standard deviation 𝜎, and find the probability of being an outlier according to the 1.5⋅IQR criterion: • First find the 𝑧-score of the lower/upper quartile. I.e. the value of 𝑧 such that 𝜇 ± 𝑧𝜎 is the lower/upper quartile. • Use this to find the IQR (expressed in terms of 𝜎). • Now find the 𝑧-score of the maximal extent of the whisker. I.e. the value of 𝑧 such that 𝜇 ± 𝑧𝜎 is the endpoint of lower/upper whisker. • Find the probability of being an outlier.