Download Example 2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Law of large numbers wikipedia , lookup

Gibbs sampling wikipedia , lookup

Transcript
ISQS 5347
Homework #6
1.A) In the customer satisfaction distribution, the numbers 0.05, 0.15, 0.15, 0.15, and
0.50 mean that about 5 out of 100 customers will rate their satisfaction as 1, about
15 out of 100 will rate their satisfaction as 2, about 15 out of the 100 will give a
rating of 3, about 15 out of 100 will give a rating of 4, and about 50 out of 100 will
give a rating of 5. The numbers are able to mathematically model the process that
produced the customer satisfaction data by specifying the frequencies with which
each y value is generated.
B) The expected value of the distribution, E(Y), can be found by taking an average
weighted by the y values. For instance, the value y holds 50 percent of the weight of
the expected value. Since the distribution is discrete, we should use the discrete
expected value formula: ∑ ∗ () = 1 ∗ 0.05 + 2 ∗ 0.15 + 3 ∗ 0.15 + 4 ∗ 0.15 + 5 ∗
0.50 = 3.9.
The variance of the distribution, Var(Y), is a probability weighted measure of the
average difference of the y values from the expected value of the distribution. Since
the distribution is discrete, the variance should be computed using the discrete
variance formula: ∑ − () ∗ (). This can be computed manually or by using
the following R code.
y = c(1,2,3,4,5)
p = c(.05, .15, .15, .15, .5)
sigma2.y = sum((y-mu.y)^2 *p)
This code gives a variance of 1.69 for the distribution.
C) The Law of Large Numbers applied in this situation tells us that, when we
generate n hypothetical customer data from this distribution, the average of those n
data values will get closer to the expected value of the distribution, 3.9, as n
approaches infinity.
D) We can find the mean of the distribution of by expressing () in terms of
(), which we know to be 3.9. First, by substitution, ( ) = (
( + + ⋯ +
)). Then, by the linearity property of expected value, this quantity equals
( + + ⋯ + ). Next, by the additivity property of expected value, the
quantity becomes [ ( ) + ( ) + ⋯ + ( )]. Now, since each is
identically distributed, each of the 1000 ( ) terms are identical. The expression
can now be reduced to [1000 ∗ ()] = (). Hence we have shown that
() = (), so we can conclude that () = 3.9. Note that independence was not
needed in these calculations, only identical distributions of the random variables.
E) The variance of the distribution of can be found similarly by expressing !()
as a function of !(), which I have already calculated to be 1.69. First, by
substitution, !() = !( ( + + ⋯ + )). Then, by the linearity
property of variance, this expression equals " !( + + ⋯ + ). By the
additive property of variance, and using the fact that all covariance terms will equal
zero because the 1000 random variables are independent, the quantity becomes
"
[ !( ) + !( ) + ⋯ + !( )]. Now, since each , term is identically
distributed, each of the 1000 !( ) terms are identical. The expression can now be
reduced to " [1000 ∗ !()] =
#$%(&)
. Plugging in !() = 1.69, the
!() = 0.00169. Both independence and identical distributions of the random
variables were needed to calculate the variance of the sample average.
F) The standard deviation of the sample average can be found by taking the square
root of !(): √0.00169 = 0.04111. Chebyschev’s inequality, applied in this case,
states that Pr () − 3 ∗ +,() < < () + 3 ∗ +,() > 1 −
0"
. Plugging in the
values calculated earlier, the inequality becomes
Pr(3.9 − 3 ∗ 0.04111 < < 3.9 + 3 ∗ 0.04111) > 1 −
1
9
= Pr(3.77667 < < 4.02333) > 0.8889.
This tells you that at least 0.8889% of values that are produced by this
distribution will be between 3.77667 and 4.02333. That is, at least 0.8889% of the produced will be within ±3standard deviations of the mean.
G) A total of1000 ∗ 10000data points will be generated, so I first created this
sampleandassigneditthevariableall.yusingthefollowingcode.
n = 1000
Nrep = 10000
all.y = sample(y, n*Nrep, p, replace = T)
Itwasthennecessarytogroupthedataintoamatrixwith1000columns,which
allowedthecalculationof10000randomsamplesofasshowninthecodebelow.
matrix.y = matrix(all.y, ncol = n)
ybar = rowMeans(matrix.y)
We can check how well Chebychev’s inequality works with the data that was
produced by checking how many values out of the 10000 sampled are between
3.77667 and 4.02333. This was done using the code below.
check = (3.77667<ybar & 4.02333>ybar)
mean(check)
This gave a result of 0.9969. Thus 99.69%, or 9969 of the 10,000 values produced by
the distribution were between 3.77667 and 4.02333. This is exactly the result
predicted by Chebyshev’s Inequality in part F, since 99.86% > 88.89%, so
Chebyshev’s Inequality works appropriately in describing the data produced.
H) The histogram of the average values were graphed in a histogram using the code
below to produce Figure 1.
hist(ybar, breaks=50, freq=F)
Figure 1
The q-q plot was created using the code below and is shown in Figure 2.
qqnorm(ybar)
qqline(ybar)
Figure 2
Figure 1 shows a fairly symmetrical bell shape, indicative of a normal distribution;
Figure 2 shows that the sample quantiles of the generated values fit very closely
with the quantiles of a normal distribution, also indicating that the sample quantiles
were produced by a normal distribution. From Figures 1 and 2, it appears that the
Central Limit Theorem did work in this case because the distribution of the sample
averages is approximately normal. This is made possible by the high n value of 1000.
I) The 68-95-99.7 rule was checked for the values using code similar to the code
that checked Chebychev’s inequality. Roughly 68% of the data produced by a
normal distribution will be within 1 standard deviation of the mean; it is expected
that this will hold for the values that were generated due to the analysis of the
data’s histogram and q-q plot. The code below checked for the percentage of values
that were in this range.
within1sd = abs(ybar-3.9) < 1*0.04111
prop.1sd = mean(within1sd)
The percentage of values within 2 and 3 standard deviations of the mean were
similarly calculated using the code below.
within2sd = abs(ybar-3.9) < 2*0.04111
prop.2sd = mean(within2sd)
within3sd = abs(ybar-3.9) < 3*0.04111
prop.3sd = mean(within3sd)
This code tells us that 69.23% of the data that was generated was within 1 standard
deviation of the mean, 95.3% of the data was within 2 standard deviations of the
mean, and 95.3% of the data was within 99.69% of the mean. Although the 68-9599.7 rule did not predict with complete accuracy the spread of the data produced
due to chance, the percentages are close enough that the rule worked reasonably
well.
J) The 68-95-99.7 rule is more useful in this case for most purposes because it gives
exact predictions for the probabilities of data being within 1, 2, and 3 standard
deviations of the mean, and these predictions were correct within 2%. The
distribution of is approximately normal, with enabled the correct use of the 68-9599.7 rule. Chebychev’s inequality, on the other hand, gives a range of values that the
true probabilities could be in, so it is usually less precise.
However, if complete accuracy is needed, Chebychev’s inequality is the better
interpretation. Predictions from the 68-95-99.7 rule are only approximate because
the distribution of is only approximately normal; but Chebychev’s inequality is
true for all distributions, so we can be certain that it makes accurate predictions.