Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistic exam 2013/2014 Statistic Synopsis – International Business & Politics 2013 Peter Dalgaard and Carmine Gioia Esben Linnet Burkandt Mads Saabye Jørgensen Valdemar Gaarn Rasmussen Sidsel Green Pedersen Sheryne Hafez xxxxxx-xxxx xxxxxx-xxxx xxxxxx-xxxx xxxxxx-xxxx xxxxxx-xxxx Statistic exam 2013/2014 Question 1. Give a brief description of the variables in the data set. Pay particular attention to the logSize variable. Notice that this variable, originally given in square feet, has been transformed to logarithmic scale (base 10 logarithm). Report the minimum and maximum store size in square feet and square meter. A variable is any characteristics that are recorded for a study. For the data set obtained from Enterprise Surveys (http://enterprisesurveys.org), The World Bank. The variables are store type, city, size of store in sq. feet (given by log-10 scale), level of competition on a scale ranging 1-4, perception, efficiency 3 years ago (given by 10-log scale), efficiency today (given by 10-log scale), total sales 3 years ago (given by 10-log scale), total sales today (given by 10-log scale) and whether they have a computer or not. The variables store type, city, perception and computer are each observations, which belong to a category. Thus they are categorical variables. The variables, store size, competition, efficiency 3 years ago, efficiency today, total sales 3 years ago and total sales today are numerical, and thus we can describe them as quantitative variables. Efficiency today and efficiency 3 years ago are dependent variables, as each sample has a paired sample. The same goes for total sales 3 years ago and total sales today. The two research questions are “Competition and labor productivity in India’s retail stores” and “Are labor regulations driving computer usage in India’s retail stores?”.In the research paper the main explanatory variable is competition, and the main response variable is labor productivity. In the second question labor regulations is the main explanatory variable, and computer usage is the main response variable. Given the LogSize (size of store in square feet in log 10 scale) we report the minimum and maximum values in square feet by: 10^LogSize. Minimum square feet: 12 Maximum square feet: 14.997 Minimum square meters: 1,1 Maximum square meters: 1.393 In square meters by: 14.997*0,0929 = 1.393 and 12*0,0929 = 1,1 Descriptive statistics for Computer Computer is a categorical, binary variable and is illustrated in a bar chart. We illustrate it as a dummy variable, where 0 and 1 respectively describe the absence or presence of a computer in the retail store. Analyzing the bar chart we see that 86 % of the sample participants say no to having a computer, whereas 14 % say yes. 1 Statistic exam 2013/2014 Computer usage Count Probability No (0) 340 0,85642 Yes (1) 57 0,14358 Total 397 1,0000 Descriptive statistics for LogSize (store size) As a quantitative numerical variable we can describe store size with a histogram. As seen in the histogram the distribution of store size seems to be normal distributed with a mean store size of 2,1982 Log 10 scale 157,8 square feet and a median of 2,176 log 10 scale 150 square feet. With a difference between the mean and the median of (157,8-150 = 7,8) square feet there is a close match between the two, meaning that there is an approximately symmetric distribution. Thus it suggest that the distribution of store type is normal. Question 2. Compare Efficiency between the store types. First make a two-group comparison between the two largest groups (Traditional and Consumer Durable)3 and then a comparison between all three groups. Efficiency is the quantitative response variable and store type is the categorical, explanatory variable. First we will make a significance test for whether efficiency is dependent on store type and then we will produce a 95 % confidence interval for the difference between the two efficiency means. Two-sided significance test We will now perform a double-sided significance test for whether efficiency is dependent on the store type. 1) Assumptions: - Quantitative response variable for two groups - Independent, random samples - We assume approximately normal distribution for each group, given the central limit theorem, as we have a large sample size 2) Hypothesis: The null hypothesis: Efficiency is not dependent on store type, thus there will be no difference between the efficiency-means H0:(u1-u2)=0 The alternative hypothesis: Efficiency does depend on store type, thus there will be a difference between the two groups' means Ha:(u1-u2)0 2 Statistic exam 2013/2014 3) Test-statistics 4) P-value = 0,001 This tells us the probability that the test statistic equals the observed test statistics or a value even more extreme. To get the p-value we use the t-test together with degrees of freedom = 103. 5) Since the p-value is below the significance level =0.05we can reject the null hypothesis, supporting the claim that there is a difference between efficiency within the two store types. We can thus further investigate the precise difference in efficiency through a 95 % confidence interval regarding the difference between the means. Assumptions for a 95% confidence interval - Independent random samples - Quantitative response variable for two groups - Approximately normal distribution for both groups (given by the central limit theorem) Confidence interval The formula for the 95 % confidence interval for the difference between two population proportions is (traditional Fmcg and consumer durable stores) (𝑥1 − 𝑥2 )𝑡.025 (𝑠𝑒) = The value of 𝑡.025 is reported by a t table to be 1,960 with DF of 103. By computation the above confidence interval does not contain zero, why we can state that we can be 95% confident that Traditional Fmcg stores are between -0,27553 and -0,51811 less efficient than Consumer durable stores. (for further summary statistics see appendix, question 2) We extend our analysis to compare all three groups of store types. This is done by one way ANOVA test since we are measuring only one factor that can impact efficiency, which is store type. ANOVA - comparing several means 1. Assumptions - The population distributions of the response variable for the groups are normal, with the same standard deviations for each group. - Randomization: in a survey sample, independent random samples are selected from each of the g populations. - For an experiment, subjects are randomly assigned separately to the g groups. 3 Statistic exam 2013/2014 2. Hypothesis The null hypothesis states that each population mean within the three store types are equal (traditional Fmcg, consumer durable stores and modern format stores.) Our alternative hypothesis states that at least two of the population means are unequal. In this case either traditional Fmcg, Consumer durable stores or modern format stores. Ha=At least two of our population means are different. 3. Test statistic = 29,23 The F sampling distribution has DF1 = G-1 = 3-1 = 2 and DF2 = N-g = 389-3 = 386. 4. P-value = 0,0001 5. Conclusion By the above computation done in SAS JMP the F ratio is calculated by the ratio of the two mean squares = 5,93626/0,20309 = F=29,2303. This F ratio reports a P value = <,0001. Since we have a P – value that is smaller than our significance level of 0,05, there is strong evidence against the null hypothesis. We found a small p-value in our F test, but the test does not specify which means are different or how different they are. Thereby we estimate confidence intervals comparing pairs of means for all tree store types. For two groups e.g. traditional fmcg and modern format stores, with sample means y1and y2having sample sizes n1 and n2, the 95% confidence interval computed by SASJMP is calculated by the following formula: 𝑦1 − 𝑦2 ± 𝑡.025 ∗ 𝑠√ 1 𝑛1 + 1 𝑛2 We infer that the efficiency for Consumer Durable Stores is between 0,2769 and 0,5166 higher than the efficiency in Traditional Fmcg stores. Since the confidence interval contains only positive numbers, this suggest that > 0. Further the comparison between Modern Format Stores and Traditional Fmcg shows that the efficiency for Modern Format Stores is between 0,2251 and 0,5155 higher than the efficiency in Traditional Fmcg Stores. 4 Statistic exam 2013/2014 In the last comparison between Consumer Durable Stores and Modern Format Stores we have a confidence interval between -0,1461 and 0,1991. Because the confidence interval contains 0, there is not enough evidence to conclude that a difference exists. According to our alternative hypothesis, which states that at least two of our population means are different, we can conclude that Traditional Fmcg’s mean is significantly different. Question 3. Similarly, compare the probability of computer use as recorded in the Computer variable. Again, both do a two-group comparison and the full three-group comparison. (store types) Two group comparison - Traditional and Consumer Durable: When comparing the proportion of computer usage for different store types, the explanatory variable is store type, and the response variable is computer usage. Now the response variable is categorical as opposed to question 2. First we will do a double-sided significance test to find out whether there is a difference in computer usage across the two store types, then we will calculate a confidence interval to predict, with 95% confidence, how big the difference is. For the full three-group comparison we will make a Chi-square test. TWO-SIDED SIGNIFICANCE TEST A two-sided significance test is done through five steps. Assumption: - A categorical response variable for two groups - Independent random samples - n1 and n2 are large enough that there are at least five successes and five failures in each group Hypothesis: meaning that The null hypothesis suggests that computer usage across store types is alike. The alternative hypothesis suggests that computer usage differs according to store type. Test Statistic: is the pooled estimate, which is the total sum of stores using computers in relation to all stores. P-Value: P=0.0001 Conclusion: Our P-value obtained is far below the significance level of 0,05 giving strong evidence against the null hypothesis. On the other hand this supports the alternative hypothesis, so we can conclude that there is an association between computer usage and store type. CONFIDENCE INTERVAL 5 Statistic exam 2013/2014 A 95 % confidence interval, for the difference in computer usage between two population proportions (Traditional- and consumer durable stores), is calculated as follows: Where z=1.96 We use 𝑃̂ instead of P, as we are dealing with a sample proportion and thus use predicted population proportion. Assumptions: - A categorical response variable for two groups - Independent random samples for the two groups, either from random sampling or a randomized experiment. - Large enough samples that there are at lease 10 successes, and 10 failures. Calculating the 95 % confidence interval: The upper and lower 95 % CI has been computed through software, due to the higher precision of those calculations. Conclusion Looking at the upper and lower case confidence interval, it can be inferred that computer usage is approximately between 12,69 % and 33,68 % higher for consumer durable stores than for traditional stores. Chi-squared test statistics: The chi-squared test statistics compares the observed cell counts to the expected cell counts, testing the independence of two conditional distributions. The test compares the cell counts in the contingency table with counts we would expect to see if the null hypothesis of independence were true. The chisquared test will show the three-group comparison between all three store types Traditional, Consumer Durable and Modern format stores. It is also shown in five steps: Expected cell count:(row total)*(column total) total sample 1) Assumptions: - Categorical response variable for three groups - Independent random samples - Last enough sample sizes, so there is at least 5 “successes” and 5 “failures” in each group 2) Hypothesis: H0=Store type and computer is independent Ha= Store type and computer is dependent 3) Chi-squared test statistics: 6 Statistic exam 2013/2014 𝑥2 = ∑ (𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑐𝑜𝑢𝑛𝑡−𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑐𝑜𝑢𝑛𝑡)2 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑐𝑜𝑢𝑛𝑡 DF = (r-1)(c-1) DF = (3-1)(2-1) = 2 Test Chi Square Prob>ChiSq Likelihood Ratio 94,394 <0,001 Pearson <0,001 116,147 4) P-value With DF = 1 we get a chi-square equal to 116,147. Thus we get a p-value of <0,001 5) Conclusion: with a P-value <0,001 we can reject the null hypothesis. Due to the small p-value, we infer that the variables store type and computer is associated. Strength of association: Analyzing the strength of the association makes us capable of determining whether the association between store type and computer is significant and important, or significant and weak and thus useless in our analysis. Percentage proportion with computer: Consumer durable stores: 26,76 Modern format stores: 62,79 Traditional stores: 3,89 Conclusion: As none of the proportions are numerically close to each other, they will all result in a value far from 0 when measuring their proportion. Thus we can conclude there is a strong, significant association between store type and computer. Question 4. Compare the variables Efficiency and Efficiency3yr. Has there been a significant increase in efficiency? Give a confidence interval for the average increase. The variables efficiency and efficiency3yr are dependent samples, meaning each observation in one sample has a matched observation in the other sample. For dependent samples, mean of difference = difference of means, thereby the difference (𝑥̅! − 𝑥̅2 ) between the means of the two samples equals the mean 𝑥̅𝑑 of the difference scores. We thus construct a new variable, d, which illustrates the difference between the two sample means: 𝑑 = 𝑥̅𝑑 = 𝑥̅1 − 𝑥̅2 = 0,7230655 𝑠𝑒 = 𝑠𝑑 √𝑛 = 1,8101355 √397 = 0,91848 7 Statistic exam 2013/2014 To compare means with dependent samples, we construct a confidence interval and do a two-sided significance test using the single sample of difference scores. The 95% confidence interval single sample. and the test statistic are the same as for a Two-sided significance test: Assumptions: 1) Quantitative response variable for two groups 2) Independent random samples 3) Approximately normal distribution for each group (mostly applicable for small sample sizes) 2) Hypothesis: Null hypothesis: There is no difference between efficiency today and efficiency 3 years ago H0:x1-x2=0 Alternative hypothesis: The stores are more efficient today than three years ago Ha:x1-x20 3) Test statistics: 𝑡= 𝑥̅𝑑 −0 𝑆𝑑 = 7,9591 √𝑛 4) P-value = 0,0001 as reported from our test statistics in JMP 5) Conclusion: The P-value tells us there is there is a very little probability that we will observe a t-value of 7,9591 or more extreme in our population. Thus we can reject the null hypothesis, and confirm that there has been a significant increase in efficiency. Computation of 95% confidence interval 8 Statistic exam 2013/2014 The above 95% confidence interval is reported by SAS JMP. However it is calculated by the following formula: Sample mean difference ±𝑡.025 (𝑠𝑒) = 0,7230655 ± 1,960 ∗ (0,0908481) = 0,545003; 0,901128) The critical value t is found by looking in the t table. with DF = N-1 = 397-1 = 396 = 1,960. the se is reported in SAS JMP and calculated by the following formula: 𝑠𝑒 = 𝑠𝑑 √𝑛 = 1,8101355 √397 = 0,91848. We use the values reported by SAS JMP, since they are calculated more precise. Conclusively we can be 95% confident that the efficiency has increased by between 0,544461 and 0,90167 units within the last 3 years. Question 5. Fit a simple linear regression in which Efficiency is described by logSize. Compute a 95% confidence interval for the slope of the regression and interpret the result. Discuss possible violations of the model assumptions. See appendix 5 We fit a simple linear regression where Efficiency is the response variable and LogSize the explanatory. r2= 0,11 (11%) This value shows that the model has 11% less error than ȳ in predicting efficiency. 1. Assumptions: 1)The population means of y at different values of x have a straight-line relationship with x, that is 𝑦̂ = 𝑎𝑥 + 𝑏 2) The data are gathered using randomization, such as random sampling or a randomized experiment. 3) The population values of y at each value of x follow a normal distribution, with the same standard deviation at each x value. The assumptions of randomization is described in the report from the World Bank, and the normal distribution for relatively LogSize and Efficiency are shown in Question 1 and 4. The first assumption regarding linearity can be questioned, as r=r2=0,11 =0,34. 0,34 does not display a strong correlation. The simple linear regression model in this case is thus not particularly good, we will however use it as part of our analysis. Later on we will strengthen the model by including more variables in a multiple regression analysis, as seen in question 6. 2. Hypotheses: 𝐻0 : 𝑏 = 0 𝐻𝑎 : 𝑏 ≠ 0 9 Statistic exam 2013/2014 3. Test statistic: Reported from SAS JMP we can calculate our t score using the following formular with our b coefficient of 0,3390092 and se of 0,048042. 4. P-value= 0,0001. 5. Conclusion: From our test statistic, JMP reports a p-value of 0,0001. Thus we can reject our H0, and state that there is a relationship between logSize and efficiency. Computation of 95% confidence interval. Looking up in the t table it reports a critical value of 1,960 with DF = 387 and a confidence level of 95%. We can construct a 95% confidence interval by the following formula 0,33900921 ± 1,960 ∗ 0,48042 = (0,2445541, 0,4334643) = Thus by the above computations we can be 95% confident that the slope falls between the above confidence interval. On average the efficiency increases by between 0,2445541 and 0,4334643 for each additional 1unit increase in LogSize (each additional 10 square feet increase). Question 6. Extend the linear regression to a multiple linear regression by further including StoreType, City, and Competition. There are several explanatory variables such as store type, city, competition and store size that have an impact on the store’s efficiency. We will combine these variables in a multiple regression model, where the idea is that more than one explanatory predicts the response variable. Our R2 increases from 0,114 to 0,260 when we add the extra variables. This implies a greater reduction in error when predicting y by x, instead of only using y, than when we just looked at store size in the above question. (a) Fit an additive model to data and explain the most important parts of the output. 10 Statistic exam 2013/2014 By looking at the effect test reported by SASJMP, we see that the overall p-value for the city variable is = 0,5197, which is above our significance level of 0,05. We will exclude the city variable in our analysis, since there is no statistical significance for inclusion. By excluding city, our R2 will decrease from 0,26 to 0,25. We consider this reduction insignificant. (see appendix, question 6 for summary statistics) The multiple regression equation is set up as 𝜇𝑦 = 𝛼 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝛽3 𝑥3 , 𝑒𝑡𝑐. SASJMP reports the following values: The y-intercept, α = 3,22611011872101. By including all variables, the multiple Regression equation= X1= competition, X2=l ogSize, X3= consumer durable stores, X4= Modern format and X5= Traditional Fmcg. Our β values describe what happens given a certain value of the particular x. From the output we can look at the fit of the model, basing our analysis on the given squared correlation, 𝑅2 , which is indicated as 0,25. This means that the multiple regression equation has 26% less error than ȳ. (b) Check the model assumptions. Pay attention to possibly nonlinear effects and interactions. Extend the model if required. Assumptions of the model: 1) Each explanatory variable has a straight-line relation with 𝜇, with the same slope for all combinations of values of other predictors in the model 2) Data gathered with randomization 3) Normal distribution for y with the same standard deviation at each combination of values of other predictors in the model 1) Linearity: We check for nonlinear effects by plotting the residuals against the different explanatory variables. 11 Statistic exam 2013/2014 When looking at Competition and Store Type we see that neither of these, when squared, show any signs of ‘banana’-shape, and there is neither any obvious change in variation as the x-value increases. LogSize does display some change in variation, as it appears the variance is larger for small and large values of x. This does not invalidate the use of multiple regression. We must however be critical towards inferences of efficiency based of LogSize. 2) Data gathered with randomization: The data are a cross section of 1948 stores spread over 16 states and 41 cities of India. It is collected by the World Bank, and we assume they have considered randomization in order for a statistical analysis to be valid. 3) Normal distribution: As can be seen on the graph of Bivariate fit of studentized residuals, residuals are normally distributed and have a constant standard deviation. The residuals fall within 1 standard deviation. Check for interaction (see appendix, question 6): In order for our multiple regression analysis to be valid we want to check that there is no interaction between the residuals. This means that the effect of either factor on the response variable is the same at each category of the other factor. 12 Statistic exam 2013/2014 The graph (check appendix) displays an interaction between LogSize and Storetype. The effect of Competition on Efficiency however seems to be independent of both Store Type and LogSize. This can be confirmed by our P-value for StoreType*logSize, which is <0,001 (low) and thus displays a high evidence of interaction between the two variables. Lurking variables for efficiency could be Perception or Computer. However, when looking at the effect test in JMP they both report a P-value above our significance level of 0,05. This implies we cannot make inferences about efficiency due to neither Computer nor Perception, and we will thus not extend our model any further. (c) Discuss the statistical significance of the predictors, and state a 95% confidence interval for the effect of Competition. 2. Hypotheses: the null hypothesis . Since there is no prior prediction about whether the effect of competition is positive or negative (for fixed values of x2 and x3), we use the two-sided significance test, and the alternative hypothesis: . 3. Test statistics: Our parameter estimates calculated in SAS jmp reports a slope estimate of 0.5604006 for competition and a standard error of se = 0.095327. It also reports the t test statistics of 4. P-value: Our parameter estimates calculated in SAS jmp reports a P-value = 0.0001. This is the twotailed probability of a t statistic above 5.88 and below -5.88, if were true. 13 Statistic exam 2013/2014 5. Conclusion: The P-value of 0.0001 makes it possible for us to reject our null hypothesis that . At common significance levels of 0.05, we can reject significant effect on efficiency. . Competition does have a By the above significance test we cannot tell whether the null hypothesis is plausible. We thereby do a confidence interval to show the precise effect competition have on efficiency. We consider the multiple regression analysis of y = efficiency and predictors x1 = competition, x2 = logsize, x3 = Store type. We now find and interpret a 95% confidence interval for b1, the effect of competition while controlling for logsize and storetype. From the data calculated in SAS jmp we look at the parameter estimates. b1 = 0.5604006, with se = 0.095327. The confidence interval equals: At fixed values of x2 and x3, we infer with 95 % confidence, that the level of efficiency increases between 0,37 and 0,75 each time we increase competition by 1 unit. Question 7. Fit a logistic regression model predicting Computer from logSize. Compute the odds ratio corresponding to a 10-fold increase in size, and give a 95% confidence intervals for the odds ratio. JMP computes the following logistic regression model for predicting computer from logSize: From the logistic regression model it is shown that as the logSize of a store increases so does the chance of computer usage. To more precisely predict what the odds ratio is of computer usage with a 10-fold increase in logSize we need a starting point. Therefore we must first calculate the probability of having a computer for an average sized store. Further, we need to know & . JMP provides these numbers. 14 Statistic exam 2013/2014 Mean logSize/ x = 2.198 =-9.83 = 3.35 Computer is a binary response variable allowing us to use the following equation to predict the probability of computer usage for an averaged size store. ln(1 − 𝑝𝑝 ) =∝ +𝛽𝑥 p = +1 + 𝑒 ∝+𝛽𝑥𝑒 𝑃1 = 𝑒 −9,83+3,35∗2.198 1+𝑒 −9,38+3,35∗2.198 ∝+𝛽𝑥 = 0,078 = 7,8% chance of computer usage for an averaged sized store To calculate the chance of computer usage for a 10-fold increase in logSize we must follow the same procedure. A 10-fold increase in logSize is equal to 2.198+1=3.198, due to the base 10 logarithm. 𝑃2 = 𝑒 −9,83+3,35∗3.198 1+𝑒 −9,83+3,35∗3.198 = 0,0707 = 70,7% chance of computer usage for a 10-fold increase in logSize To infer further on these results we can calculate the relative odds (odds ratio). This will enable us to see the chance of having a computer in a store with logSize 3.198 compared to a store with a logSize of 2.198. From this it can be concluded that the chance of using a computer is 28.52 bigger for a store with a logSize 3.198 than for a store with logSize 2.198. This proves that there is an association between the size of a store and whether or not it uses computers. The 95% confidence interval for the odds ratio ranges from 12.64 to 70.99. Hence, there is between 12.64% and 70.99% higher odds of having a computer in a store with logSize 3.198. This confirms with 95% confidence, what has already been proved, that there is a positive association between logSize and computer usage due to the interval being above and not containing 0. (b) Extend the model to a multiple logistic regression using the predictors logSize, City, StoreType, and Perception. 15 Statistic exam 2013/2014 According to the parameter estimates reported by SASJMP we can exclude the city variable, due to its high p-value, which make it statistically insignificant. Our model is now reduced, and consists only of the remaining significant variables with a p-value below 0.05. These are LogSize, StoreType and Perception which are statistically significant in predicting computer use. The excluded computation by SASJMP is reported below. Knowing the significant explanatory variables we can now report the prediction equation for the response variable computer usage with x1 being logSize, x2 being perception, x3 being Consumer Durable Stores, x4 being Modern Format Stores and x5 being Traditional Fmcg Stores. The equation will state the probability of computer usage according to the different values of x chosen for the explanatory variables. (c) Describe the effect of Perception, both in terms of statistical significance and in real-world terms (i.e., if there is an effect, what does it mean?). We already described that Perception is statistical significant in question b. This was done when we left it as a significant explanatory variable due to its p-value < 0.05. Specifically the Perception pvalue = 0,0115 tells us that there is less than 1,15% chance of observing a value outside 1.96 standard deviations from the mean, proving an association. This can be proven further. Looking at the data below there is an obvious tendency that as Perception grows the proportion of Computer Usage decrease. 16 Statistic exam 2013/2014 This association should be understood, in real-world terms, as follows. Stores perceiving labor regulations as a problem, tends to substitute labor for computers in order to avoid the issues related to labor. In the data we are given the proportion of stores in each city that regard labour regulations as a problem. Naturally this means that in a city where a large amount of stores view labor regulations as a problem the proportion of computer usage will be higher than in the opposite case. 17