Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Definitions 1. Define sampling distribution and explain how it relates to the standard error of the mean. (2 Points) Sampling distribution: The probability distribution that describes how a statistic, such as the mean, varies from sample to sample. The location and scale of the sampling distribution for X are given by X and the standard error of X which is sX s n This suggests that the standard error of the mean is a measurement of the variation in the mean from sample to sample. 2. Compare and contrast two-sample comparisons with paired comparisons. (3 Points) Two sample comparison tests for the difference between the means of two different samples. No relationship exists between the observations in the two samples. The samples are independent of each other. Paired comparison: A compare of two treatments using dependent samples designed to be similar. Observations in each sample naturally associate with each other. This often occurs by exposing the same experimental unit to two different treatments. Both the two sample and paired comparisons are used to measure the difference between two groups. Pairing isolates the effect of the treatment, reducing the random variation that can hide a difference. 3. Define the Tukey Bulging Rule and explain how it can help determine appropriate transformations. (3 Points) Tukey’s bulging rule uses the accompanying diagram to suggest appropriate transformations whenever we encounter a nonlinear pattern. This process is usually iterative as we match the pattern that is found in our data with the shape found in each of the four different quadrants in the diagram. After matching the pattern in our data with one of the shapes in the four different quadrants, we can then go up or down the ladder of powers for the x-axis and/or y-axis in order to transform the nonlinear pattern into a straight line. Y3 2 Y Y up : Y 1 X down : log X , X X up : 2 X , X3 Y down : Y log(Y ) 4. Define elasticity and explain how you can use simple regression to estimate a value for elasticity. (2 Points) Elasticity: The ratio of the %change in Y or the response variable to the % change in X or the explanatory variable. The slope coefficient in a log-log regression gives an estimate of the elasticity. 5. Compare and contrast one-tailed versus two-tailed hypothesis tests. Use an example to illustrate your answer. (3 Points) One-sided hypothesis: Hypotheses in which the null hypothesis allows any value of a parameter larger (or smaller) than a specified value. H 0 : 0 H A : 0 Two-sided hypothesis: Hypotheses in which the null hypothesis asserts a specific value for the population parameter. H 0 : 0 H A : 0 Graphically, the two-tail test corresponds to the top t-test and includes the areas in both the left and right of the distribution. The one-tailed test for the less than null hypothesis is the middle ttest and corresponds to the probability given in the right tail. The one-tailed test for the greater than null hypothesis is the probability below the t-value. Problems Before purchasing videoconferencing equipment, a company ran tests of its current internal computer network. The goal of the tests was to measure how rapidly data moved through the network given the current demand on the network. Eighty files ranging in size from 20 to 100 megabytes (MB) were transmitted over the network at various times of day, and the time to send the files was recorded. The attached JMP output reports the results of a simple regression model with Y given by Transfer Time and X given by File Size. Use the JMP output to answer the following questions. 1. Define, discuss, and compare the conditional and unconditional means of the response variable Transfer Time. (3 Points) Unconditional mean: the expected value or mean of a distribution without reference to any possible explanatory or predictor variables. E Y Y 25 . There is no value for file size specified so we use the simple average over the entire sample. Conditional mean: the average or expected value of one variable given that another variables takes on a specific value. E Y | X x 0 1 x 7.3 0.3x . We condition our expected value for Transfer Time on the file size. 2. Use the appropriate parts of the JMP output to discuss and illustrate how total variation relates to statistical models. (3 Points) The statistical model divides total variation into that which is explained by the model and that which is not explained. The total variation to be explained is the sum of the squares of the deviations from the mean or Y Y n i 1 2 i This is the variation around the unconditional mean. The unexplained variation is the sum of the squares of the residuals which is given by n i 1 Y ˆ ˆ X 2 i 0 1 i n e2 i 1 i This is the variation around the conditional mean. The square root of the average of this sum is the Root Mean Squared Error (RMSE) or the standard error of the regression, which is 6.24. The R-Squared tells us the percentage of total variation that is explained by the statistical model. In this case, the statistical model explains approximately 62% of the variation. It then follows that 38% of the variation is not explained by the unconditional mean. 3. Evaluate and explain the five conditions for the simple regression model (SRM) (5 Points) a) Linear The scatterplot suggests that the data conforms to a straight line. b) no lurking variables In the case of a simple regression model, there are almost always lurking variables. In this case, the time of the day, the type of file, or a number of other possible explanatory variables might also affect transfer time. c) equal variance The residuals for small, medium, and large file sizes appear to be approximately the same distance, on average from the regression line. d) Independence In order to gauge independence, we need to have a scatterplot of the residuals. If the scatterplot has no patterns and appears random, then we have independence among the observations. Because we don’t have a scatterplot, we can’t judge whether we have independence or not. Sorry about the omission. e) Normal The normal-quantile plot shows that most residuals locate along the straight line within the confidence band. There are some deviations from the straight line in the middle of the graph. The skewness coefficient and the coefficient of excess aren’t extremely large. The Shapiro-Wilks test fails to reject the proposition that the errors are normally distributed. This is a case where can continue cautiously under the assumption that the residuals are normally distributed. There is some concern about normality, however, so we should proceed cautiously. 4. Use the information about the slope coefficient to explain: a) How the t-ratio is calculated and interpreted. (1 Point) The t-ratio is calculated by dividing the estimated slope parameter by its standard error or 0.31 11.4 0.03 b) Specify the hypothesis that is being tested. (1 Point) H 0 : 1 0 c) Explain the results of the hypothesis test. (2 Points) The slope coefficient is 11.4 standard errors from the hypothesized value. This gives the very low p-value reported in the output. The low p-value means that we should reject the null hypothesis. 5. Interpret the intercept and slope coefficients. Your answer should include a discussion of the units in which they are measured. (2 Points) The intercept term might be interpreted as the computer’s setup time before the actual file transfer begins. This means that no matter what the while size, the computer takes approximately 7 time units before the file transfer starts. The units of the intercept term are the time units The slope coefficient of 0.31 means that increasing the file size by 1 megabyte increases the file transfer time to increase by 0.31 time units. Since the time units aren’t specified, we can only say that the units of the slope coefficient are time units per megabyte. 6. Use the regression output to define and discuss confidence intervals versus the prediction intervals for the regression line. (3 Points) As mentioned, the statistical model decomposes total variation into that which is explained by the model and that which is unexplained. The unexplained part of the variation or the residuals are therefore one source of uncertainty. The other sources of uncertainty occur because the intercept and slope have been estimated, and both have standard errors. As mentioned, the statistical model is a conditional expectation or mean. If we transfer multiple files which are of the same size and then take the average of these transfers, this would correspond to the conditional mean for a given file size. The confidence interval reflects the variation that occurs from multiple transfers of the same file size. The standard errors of the intercept and slope contribute to this uncertainty. Because we are taking an average, the residuals cancel each other out so the standard error of the regression isn’t included in this calculation. If we don’t have the benefit of multiple file transfers and are only predicting a single occurrence or one specific file of a given size, then the residuals can’t cancel each other out and more uncertainty exists. This means that the standard errors of the intercept, slope, and regression all are part of the uncertainty that is summarized in the prediction interval. The addition of the standard error of the regression to the standard errors of the slope and intercept cause the prediction interval to exceed the confidence interval in width. The prediction interval accounts for all variation, both explained and unexplained. 7. Explain how you would use the estimated model to predict file transfer time. (1 Point) We use the estimated equation and substitute in different file sizes to predict how long it will take to transfer each file: Yˆ 0 1 X i 7.3 0.3 X i For example, a 10 megabyte file should take 10.3 time units to transfer 10.3 7.3 0.310