Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ISQS 5347 Homework #6 1.A) In the customer satisfaction distribution, the numbers 0.05, 0.15, 0.15, 0.15, and 0.50 mean that about 5 out of 100 customers will rate their satisfaction as 1, about 15 out of 100 will rate their satisfaction as 2, about 15 out of the 100 will give a rating of 3, about 15 out of 100 will give a rating of 4, and about 50 out of 100 will give a rating of 5. The numbers are able to mathematically model the process that produced the customer satisfaction data by specifying the frequencies with which each y value is generated. B) The expected value of the distribution, E(Y), can be found by taking an average weighted by the y values. For instance, the value y holds 50 percent of the weight of the expected value. Since the distribution is discrete, we should use the discrete expected value formula: ∑ ∗ () = 1 ∗ 0.05 + 2 ∗ 0.15 + 3 ∗ 0.15 + 4 ∗ 0.15 + 5 ∗ 0.50 = 3.9. The variance of the distribution, Var(Y), is a probability weighted measure of the average difference of the y values from the expected value of the distribution. Since the distribution is discrete, the variance should be computed using the discrete variance formula: ∑ − () ∗ (). This can be computed manually or by using the following R code. y = c(1,2,3,4,5) p = c(.05, .15, .15, .15, .5) sigma2.y = sum((y-mu.y)^2 *p) This code gives a variance of 1.69 for the distribution. C) The Law of Large Numbers applied in this situation tells us that, when we generate n hypothetical customer data from this distribution, the average of those n data values will get closer to the expected value of the distribution, 3.9, as n approaches infinity. D) We can find the mean of the distribution of by expressing () in terms of (), which we know to be 3.9. First, by substitution, ( ) = ( ( + + ⋯ + )). Then, by the linearity property of expected value, this quantity equals ( + + ⋯ + ). Next, by the additivity property of expected value, the quantity becomes [ ( ) + ( ) + ⋯ + ( )]. Now, since each is identically distributed, each of the 1000 ( ) terms are identical. The expression can now be reduced to [1000 ∗ ()] = (). Hence we have shown that () = (), so we can conclude that () = 3.9. Note that independence was not needed in these calculations, only identical distributions of the random variables. E) The variance of the distribution of can be found similarly by expressing !() as a function of !(), which I have already calculated to be 1.69. First, by substitution, !() = !( ( + + ⋯ + )). Then, by the linearity property of variance, this expression equals " !( + + ⋯ + ). By the additive property of variance, and using the fact that all covariance terms will equal zero because the 1000 random variables are independent, the quantity becomes " [ !( ) + !( ) + ⋯ + !( )]. Now, since each , term is identically distributed, each of the 1000 !( ) terms are identical. The expression can now be reduced to " [1000 ∗ !()] = #$%(&) . Plugging in !() = 1.69, the !() = 0.00169. Both independence and identical distributions of the random variables were needed to calculate the variance of the sample average. F) The standard deviation of the sample average can be found by taking the square root of !(): √0.00169 = 0.04111. Chebyschev’s inequality, applied in this case, states that Pr () − 3 ∗ +,() < < () + 3 ∗ +,() > 1 − 0" . Plugging in the values calculated earlier, the inequality becomes Pr(3.9 − 3 ∗ 0.04111 < < 3.9 + 3 ∗ 0.04111) > 1 − 1 9 = Pr(3.77667 < < 4.02333) > 0.8889. This tells you that at least 0.8889% of values that are produced by this distribution will be between 3.77667 and 4.02333. That is, at least 0.8889% of the produced will be within ±3standard deviations of the mean. G) A total of1000 ∗ 10000data points will be generated, so I first created this sampleandassigneditthevariableall.yusingthefollowingcode. n = 1000 Nrep = 10000 all.y = sample(y, n*Nrep, p, replace = T) Itwasthennecessarytogroupthedataintoamatrixwith1000columns,which allowedthecalculationof10000randomsamplesofasshowninthecodebelow. matrix.y = matrix(all.y, ncol = n) ybar = rowMeans(matrix.y) We can check how well Chebychev’s inequality works with the data that was produced by checking how many values out of the 10000 sampled are between 3.77667 and 4.02333. This was done using the code below. check = (3.77667<ybar & 4.02333>ybar) mean(check) This gave a result of 0.9969. Thus 99.69%, or 9969 of the 10,000 values produced by the distribution were between 3.77667 and 4.02333. This is exactly the result predicted by Chebyshev’s Inequality in part F, since 99.86% > 88.89%, so Chebyshev’s Inequality works appropriately in describing the data produced. H) The histogram of the average values were graphed in a histogram using the code below to produce Figure 1. hist(ybar, breaks=50, freq=F) Figure 1 The q-q plot was created using the code below and is shown in Figure 2. qqnorm(ybar) qqline(ybar) Figure 2 Figure 1 shows a fairly symmetrical bell shape, indicative of a normal distribution; Figure 2 shows that the sample quantiles of the generated values fit very closely with the quantiles of a normal distribution, also indicating that the sample quantiles were produced by a normal distribution. From Figures 1 and 2, it appears that the Central Limit Theorem did work in this case because the distribution of the sample averages is approximately normal. This is made possible by the high n value of 1000. I) The 68-95-99.7 rule was checked for the values using code similar to the code that checked Chebychev’s inequality. Roughly 68% of the data produced by a normal distribution will be within 1 standard deviation of the mean; it is expected that this will hold for the values that were generated due to the analysis of the data’s histogram and q-q plot. The code below checked for the percentage of values that were in this range. within1sd = abs(ybar-3.9) < 1*0.04111 prop.1sd = mean(within1sd) The percentage of values within 2 and 3 standard deviations of the mean were similarly calculated using the code below. within2sd = abs(ybar-3.9) < 2*0.04111 prop.2sd = mean(within2sd) within3sd = abs(ybar-3.9) < 3*0.04111 prop.3sd = mean(within3sd) This code tells us that 69.23% of the data that was generated was within 1 standard deviation of the mean, 95.3% of the data was within 2 standard deviations of the mean, and 95.3% of the data was within 99.69% of the mean. Although the 68-9599.7 rule did not predict with complete accuracy the spread of the data produced due to chance, the percentages are close enough that the rule worked reasonably well. J) The 68-95-99.7 rule is more useful in this case for most purposes because it gives exact predictions for the probabilities of data being within 1, 2, and 3 standard deviations of the mean, and these predictions were correct within 2%. The distribution of is approximately normal, with enabled the correct use of the 68-9599.7 rule. Chebychev’s inequality, on the other hand, gives a range of values that the true probabilities could be in, so it is usually less precise. However, if complete accuracy is needed, Chebychev’s inequality is the better interpretation. Predictions from the 68-95-99.7 rule are only approximate because the distribution of is only approximately normal; but Chebychev’s inequality is true for all distributions, so we can be certain that it makes accurate predictions.