Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Examiners’ commentaries 2016 Examiners’ commentaries 2016 ST104a Statistics 1 Important note This commentary reflects the examination and assessment arrangements for this course in the academic year 2015–16. The format and structure of the examination may change in future years, and any such changes will be publicised on the virtual learning environment (VLE). Information about the subject guide and the Essential reading references Unless otherwise stated, all cross-references will be to the latest version of the subject guide (2014). You should always attempt to use the most recent edition of any Essential reading textbook, even if the commentary and/or online reading list and/or subject guide refer to an earlier edition. If different editions of Essential reading are listed, please check the VLE for reading supplements – if none are available, please use the contents list and index of the new edition to find the relevant section. General remarks Learning outcomes At the end of the course and having completed the Essential reading and activities you should: • be familiar with the key ideas of statistics that are accessible to a student with a moderate mathematical competence • be able to routinely apply a variety of methods for explaining, summarising and presenting data and interpreting results clearly using appropriate diagrams, titles and labels when required • be able to summarise the ideas of randomness and variability, and the way in which these link to probability theory to allow the systematic and logical collection of statistical techniques of great practical importance in many applied areas • have a grounding in probability theory and some grasp of the most common statistical methods • be able to perform inference to test the significance of common measures such as means and proportions and conduct chi-square tests of contingency tables • be able to use simple linear regression and correlation analysis and know when it is appropriate to do so. Planning your time in the examination You have two hours to complete this paper, which is in two parts. The first part, Section A, is compulsory which covers several subquestions and accounts for 50 per cent of the total marks. 1 ST104a Statistics 1 Section B contains three questions, each worth 25 per cent, from which you are asked to choose two. Remember that each of the Section B questions is likely to cover more than one topic. In 2016, for example, the first part of Question 2 asked for a chi-squared test and survey design problems appeared in the second part. Question 3 had a series of questions involving drawing diagrams, such as histograms, hypothesis testing, in particular paired sample t tests, and confidence intervals. The first part of Question 4 was on linear regression and involved drawing a diagram, while the second part was a hypothesis test comparing population means using the sample data given. This means that it is really important that you make sure you have a reasonable idea of what topics are covered before you start work on the paper! We suggest you divide your time as follows during the examination. • Spend the first 10 minutes annotating the paper. Note the topics covered in each question and subquestion. • Allow yourself 45 minutes for Section A. Do not allow yourself to get stuck on any one question, but do not just give up after two minutes! • Once you have chosen your two Section B questions, give them about 25 minutes each. • This leaves you with 15 minutes. Do not leave the examination hall at this point! Check over any questions you may not have completely finished. Make sure you have labelled and given a title to any tables or diagrams which were required and, if you did more than the two questions required in Section B, decide which one to delete. Remember that only two of your answers will be given credit in Section B and that you must choose which these are! What are the examiners looking for? The examiners are looking for very simple demonstrations from you. They want to be sure that you: • have covered the syllabus as described and explained in the subject guide • know the basic formulae given there and when and how to use them • understand and answer the questions set. You are not expected to write long essays where explanations or descriptions of sample design are required, and note-form answers are acceptable. However, clear and accurate language, both mathematical and written, is expected and marked. The explanations below and in the specific Examiners’ commentaries for the papers for each zone should make these requirements clear. Key steps to improvement The most important thing you can do is answer the question set! This may sound very simple, but these are some of the things that candidates did not do, though asked, in the 2016 examinations! Remember the following. 2 • If you are asked to label a diagram (which is almost always the case!), please do so. Writing ‘Histogram’ or ‘Stem-and-leaf diagram’ in itself is insufficient. What do the data describe? What are the units? What are the x-axis and y-axis? • If you are specifically asked to carry out a hypothesis test, or a confidence interval, do so. It is not acceptable to do one rather than the other! If you are asked to find a 5% critical value, this is what will be marked. • Do not waste time calculating things which are not required by the examiners. If you are asked to find the line of best fit, you will get no marks if you calculate the correlation coefficient as well. If you are asked to use the confidence interval you have just calculated to comment on the results, carrying out an additional hypothesis test will not gain you marks. Examiners’ commentaries 2016 How should you use the specific comments on each question given in the Examiners0 commentaries? We hope that you find these useful. For each question and subquestion, they give: • further guidance for each question on the points made in the last section • the answers, or keys to the answers, which the examiners were looking for • the relevant detailed reference to P. Newbold, W.L. Carlson and B.M. Thorne Statistics for business and economics. (London: Prentice–Hall, 2012) eighth edition [ISBN 9780273767060] and the subject guide • where appropriate, suggested activities from the subject guide which should help you to prepare, and similar questions from Newbold (2012). Any further references you might need are given in the part of the subject guide to which you are referred for each answer. Memorising from the Examiners0 commentaries It was noted recently that a small number of candidates appeared to be memorising answers from previous years’ Examiners’ commentaries, for example plots, and produced the exact same image of them without looking at the current year’s examination paper questions! Note that this is very easy to spot. The Examiners’ commentaries should be used as a guide to practise on sample examination questions and it is pointless to attempt to memorise them. Examination revision strategy Many candidates are disappointed to find that their examination performance is poorer than they expected. This may be due to a number of reasons. The Examiners’ commentaries suggest ways of addressing common problems and improving your performance. One particular failing is ‘question spotting’, that is, confining your examination preparation to a few questions and/or topics which have come up in past papers for the course. This can have serious consequences. We recognise that candidates may not cover all topics in the syllabus in the same depth, but you need to be aware that the examiners are free to set questions on any aspect of the syllabus. This means that you need to study enough of the syllabus to enable you to answer the required number of examination questions. The syllabus can be found in the Course information sheet in the section of the VLE dedicated to each course. You should read the syllabus carefully and ensure that you cover sufficient material in preparation for the examination. Examiners will vary the topics and questions from year to year and may well set questions that have not appeared in past papers. Examination papers may legitimately include questions on any topic in the syllabus. So, although past papers can be helpful during your revision, you cannot assume that topics or specific questions that have come up in past examinations will occur again. If you rely on a question-spotting strategy, it is likely you will find yourself in difficulties when you sit the examination. We strongly advise you not to adopt this strategy. 3 ST104a Statistics 1 Examiners’ commentaries 2016 ST104a Statistics 1 Important note This commentary reflects the examination and assessment arrangements for this course in the academic year 2015–16. The format and structure of the examination may change in future years, and any such changes will be publicised on the virtual learning environment (VLE). Note that in what follows the symbol • corresponds to 1 mark unless stated otherwise. Information about the subject guide and the Essential reading references Unless otherwise stated, all cross-references will be to the latest version of the subject guide (2014). You should always attempt to use the most recent edition of any Essential reading textbook, even if the commentary and/or online reading list and/or subject guide refer to an earlier edition. If different editions of Essential reading are listed, please check the VLE for reading supplements – if none are available, please use the contents list and index of the new edition to find the relevant section. Comments on specific questions – Zone A Candidates should answer THREE of the following FOUR questions: QUESTION 1 of Section A (50 marks) and TWO questions from Section B (25 marks each). Section A Answer all parts of question 1 (50 marks in total). Question 1 (a) A random sample of the heights of buildings has a sample mean of 24.96 metres. State the units of measurements for the summaries below and justify your answers. i. sample variance ii. sample standard deviation. (4 marks) Reading for this question This question requires knowledge regarding measures of location and spread. Hence reading of Sections 4.8 and 4.9 in the subject guide is essential and in particular Section 4.9.3. For example, candidates should gain familiarity with the sample mean, median, variance and standard deviation. 4 Examiners’ commentaries 2016 Approaching the question The first thing to do is check the formulae for the sample variance and standard deviation. It is then not hard to note that the sample variance, s2 , involves squared deviations of the observations about the sample mean: n s2 = 1 X (xi − x̄)2 . n − 1 i=1 The units of measurement will therefore be metres squared, m2 . The formula for the standard deviation, s, involves the square root of the sample variance: v u n u 1 X (xi − x̄)2 s=t n − 1 i=1 hence we return to the original units of measurement, i.e. meters, m. Some candidates did not provide a justification for their choices, for example just reporting meters or meters squared. Justification is essential however, and therefore the mention of the formulae was essential to get full marks. (b) Suppose that x1 = 8, x2 = −1, x3 = −6, x4 = 5, x5 = 0, and y1 = −7, y2 = 3, y3 = 0, y4 = 1, y5 = −3. Calculate the following quantities: i. i=4 X x2i i=2 ii. i=3 X 2xi yi i=1 iii. y53 + i=4 X yi4 i=3 xi . (6 marks) Reading for this question This question refers to the basic bookwork which can be found on Section 2.9 of the subject guide, and in particular Activity A1.6. Approaching the question Be careful to leave the xi s and yi s in the order given and only cover the values of i asked for. This question was generally well done. The answers are as follows. i=4 P 2 i. xi = (−1)2 + (−6)2 + 52 = 1 + 36 + 25 = 62. i=2 ii. i=3 P 2xi yi = 2 i=1 iii. y53 + 3 P xi yi = 2((8 × −7) + (−1 × 3) + (−6 × 0)) = 2(−56 − 3 + 0) = −118. i=1 i=4 P i=3 yi4 /xi = (−3)3 + (0 + 1/5) = −26.8. (c) A population is normally distributed with a population mean of 138 and a population standard deviation of 21. i. State the distribution of the sample mean for simple random samples of size n = 25. ii. Given a simple random sample of size n = 25, determine the probability that the sample mean will be less than 128. (4 marks) Reading for this question This section examines the ideas of the normal random variable. Read the relevant section of Chapter 6 of the subject guide and work out the examples and activities of this section. The 5 ST104a Statistics 1 Sample examination questions are quite relevant. For the first part of the question it is essential to check Section 6.9 of the subject guide. Approaching the question The first part just requires knowledge of the fact that if X is a normal random variable with mean µ and variance σ 2 , the sample mean from a sample of size n, X̄, is also a normal random variable with mean µ and variance σ 2 /n. Direct application of this fact then yields that: (21)2 X̄ ∼ N 138, = N (138, 17.64). 25 For the second part, the basic property of the normal random variable for this question is that if X ∼ N (µ, σ 2 ), then Z = (X − µ)/σ ∼ N (0, 1). Note also that: * P (Z < a) = P (Z ≤ a) = Φ(a) * P (Z > a) = P (Z ≥ a) = 1 − P (Z ≤ a) = 1 − P (Z < a) = 1 − Φ(a) * P (a < Z < b) = P (a ≤ Z < b) = P (a < Z ≤ b) = P (a ≤ Z ≤ b) = Φ(b) − Φ(a). The above is all you need to find the requested proportion. We can write: 128 − 138 P (X̄ < 128) = P Z < √ 17.64 = P (Z < −2.38) = 1 − Φ(2.38) = 1 − 0.99134 = 0.00866. (d) Classify each one of the following variables as measurable (continuous) or categorical. If a variable is categorical, further classify it as nominal or ordinal. Justify your answer. (Note that no marks will be awarded without justification.) i. The weight of a cereal packet produced in a factory. ii. The order an athlete finishes a marathon. iii. The colour of a pair of shoes. iv. Currency exchange rates. (8 marks) Reading for this question This question requires identifying types of variables so reading the relevant section in the subject guide (Section 4.6) is essential. Candidates should gain familiarity with the notion of a variable and be able to distinguish between discrete and continuous (measurable) data. In addition to identifying whether a variable is categorical or measurable, further distinctions between ordinal and nominal categorical variable should be made by candidates. Approaching the question A general tip for identifying continuous and categorical variables is to think of the possible values they can take. If these are finite and represent specific entities the variable is categorical. Otherwise, if these consist of numbers corresponding to measurements, the data are continuous and the variable is measurable. Such variables may also have measurement units or can be measured to various decimal places. i. Measurable because the weight can be measured, for example, in grammes to several decimal places such as 499.28 g. ii. The observations consist of the athletes finishing in a specific order (1st, 2nd etc.). It is therefore a categorical ordinal variable. 6 Examiners’ commentaries 2016 iii. Each colour (black, white, red, etc.) is a category. Also, there is no natural ordering between the colours, for example we cannot really say that ‘blue is higher than red’. This is therefore a categorical nominal variable. iv. Measurable because exchange rates are quoted to several decimal places, for example US$1.45 to the £. Weak candidates did not provide a justification for their choices, reported nominal or categorical to measurable variables and sometimes answered ordinal when their justification was pointing to a nominal variable. There were also phrases like ‘It is measurable because it can be measured’ that were not awarded any marks. (e) The random variable X takes the values 0, 1 and 4 according to the following probability distribution: x pX (x) 0 0.2 1 k 4 k i. Determine the constant k. ii. Find E(X), the expected value of X. iii. Find Var(X), the variance of X. (5 marks) Reading for this question This is a question on probability, exploring the concepts of relative frequency, conditional probability and probability distribution. Reading from Chapter 5 of the subject guide is suggested with focus on the sections on these topics. Try Activity A5.1 and the exercises on probability trees. Approaching the question P i. i p(xi ) = 1, hence k = 0.4. P ii. E(X) = i xi p(xi ) = 0 × 0.2 + 1 × 0.4 + 4 × 0.4 = 2.0. P iii. E(X 2 ) = i x2i p(xi ) = 02 × 0.2 + 12 × 0.4 + 42 × 0.4 = 6.8. Hence: Var(X) = 6.8 − 22 = 2.8. An alternative method to find the variance is through the formula where µ is found in part ii. P i (xi − µ)2 p(xi ), (f ) An engine encounters a standard environment with a probability of 0.95, and a severe environment with a probability of 0.05. In a normal environment the probability of failure is 0.02, whereas in the severe environment this probability is 0.5. i. What is the probability of failure? ii. Given that failure has occurred, what is the probability that the environment encountered was severe? (4 marks) Reading for this question This is a question on probability and targets mostly the material of Chapter 5 in the subject guide. It is essential to practise on such exercises through the learning activities and exercises of this chapter as well as the material on the VLE. In particular you can attempt Learning activity A5.6 and Sample examination question 5. It is also useful to familiarise yourself with probability trees as they can be quite handy in such exercises. Approaching the question The first part was straightforward for candidates familiar with this section, requiring the use of the total law of probability (although it can also be calculated using common intuition). Part ii. requires knowledge of the conditional probability definition or, alternatively, knowledge of Bayes’ theorem. 7 ST104a Statistics 1 The workout of the exercise is given below. i. We have: P (F ) = P (F | N ) P (N ) + P (F | S) P (S) = 0.02 × 0.95 + 0.5 × 0.05 = 0.044. ii. We have: P (S | F ) = P (F | S) P (S) 0.025 25 = = = 0.5682. P (F ) 0.044 44 (g) A museum conducts a survey of its visitors in order to assess the popularity of a device which is used to provide information on the museum exhibits. The device will be withdrawn if fewer than 20% of all of the museum’s visitors make use of it. Of a random sample of 100 visitors, 15 chose to use the device. i. Carry out an appropriate hypothesis test at the 5% significance level to see if the device should be withdrawn and state your conclusions. ii. Calculate the p-value of the test. (7 marks) Reading for this question This question refers to a one-sided hypothesis test examining whether the proportion of all museum visitors is less than 20%. While the entire chapter (Chapter 8 of the subject guide) on hypothesis testing is relevant, one can focus on the relevant section for a single proportion, Section 8.14. Note also that reading on one-tailed (and two-tailed) hypothesis tests are located in Section 8.10. The second part of the question looks at p-values, and the relevant section in the subject guide is Section 8.11. Approaching the question It is essential to identify the type of hypothesis test required for this question. Since there is only one variable involved it will have to be a test for a single proportion, and the test statistic can be found in the formula sheet. Make sure to substitute the relevant quantities carefully and avoid any numerical errors in the calculation. The remaining steps involve finding the critical values from the corresponding statistical table for the relevant significance level, deciding whether to reject H0 , and interpreting the results in the context of the problem. The working of the first part of the exercise is given below. • H0 : π = 0.2 vs. H1 : π < 0.2. • The sample proportion is p = 15/100 = 0.15. The standard error of the sample p proportion is 0.2 × 0.8/100 = 0.04. The test statistic value is: t= 0.15 − 0.2 = −1.25. 0.04 • For α = 0.05, the critical value is −1.645. • Decision: do not reject H0 . • No evidence that fewer than 20% of visitors make use of the device. The second part of the question requires the use of p-values and challenged most candidates. The exercise does not require lengthy calculations and can be derived in a relatively straightforward manner if one is familiar with the material of Section 8.11 of the subject guide. Once the test statistic is calculated (t = −1.25 from the first part) one simply needs to calculate, where Z ∼ N (0, 1): P (Z ≤ −1.25) = 1 − Φ(1.25) = 1 − 0.8944 = 0.1056. Note: The last three marks of the first part can also be awarded by correct use of the p-value, see below. • The p-value is higher than α = 0.05. 8 Examiners’ commentaries 2016 • Decision: do not reject H0 . • No evidence that fewer than 20% of visitors make use of the device. (h) State whether the following are true or false and give a brief explanation. (Note that no marks will be awarded for a simple true/false answer.) i. The interquartile range of a sample is influenced by extreme values. ii. A sampling distribution is the probability distribution of a population parameter. iii. A sample correlation coefficient close to 1 indicates a strong positive linear relationship between two categorical variables. iv. A p-value of 0.08 represents a highly significant hypothesis test result. v. Rejection of a null hypothesis might indicate that a Type II error has been committed. vi. A quota sample is the non-random equivalent of a systematic random sample. (12 marks) Reading for this question This question contains material from various parts of the subject guide. Here, it is more important to have a good intuitive understanding of the relevant concepts than the technical level in computations. Part i. concerns measures of spread that can be found in Section 4.9 of the subject guide. Part ii. enquires about the sampling distribution which is defined in Section 6.9. Part iii. is about correlation (see Section 12.8) and types of variables (see Section 4.6). Part iv. targets the concepts of a p-value covered in Section 8.11, whereas part v. looks at types of error in hypothesis testing (Section 8.7). Finally, part vi. requires material from Chapter 10 and in particular Section 10.7 on types of sampling. Approaching the question Candidates always find this type of question tricky. It requires a brief explanation of the reason for a true/false answer and not just a choice between the two. Some candidates lost marks for long rambling explanations without a decision as to whether a statement was true or false. i. False. The interquartile range of a sample is defined as the range of the central 50% of the values in a dataset, so any extreme values would lie below the lower quartile and/or above the upper quartile. ii. False. A sampling distribution is the probability distribution of a sample statistic. iii. False. A value of r close to 1 indicates a strong, positive linear relationship between two measurable (continuous) variables. iv. False. A p-value less than 0.01 represents a highly significant hypothesis test result, 0.08 is merely weakly significant. v. False. Rejection of a true null hypothesis might indicate that a Type I error has been committed. vi. False. A quota sample is the non-random equivalent of a stratified random sample. 9 ST104a Statistics 1 Section B Answer two out of the three questions from this section (25 marks each). Question 2 (a) A factory uses four different machines to manufacture a particular type of machine component. A random sample of 400 components is selected from the output of the factory. Each component in the sample is inspected to determine whether or not it is faulty. The machine that produced the component is also recorded. The results are as follows: Machine Machine Machine Machine Total 1 2 3 4 Outcome Faulty Non-faulty 4 96 2 98 11 89 14 86 31 369 Total 100 100 100 100 400 i. Based on the data in the table, and without conducting any significance test, would you say there is an association between the machine number and the component being faulty? ii. Calculate the χ2 statistic and use it to test for independence, using a 5% significance level. What do you conclude? (14 marks) Reading for this question This part targets Chapter 8 of the subject guide on contingency tables and chi-squared tests. Note that part i. of the question does not require any calculations, just understanding and interpreting contingency tables. Candidates can attempt Activity A8.4 to practise. Part ii. is a straightforward chi-squared test and the reading is also given in Chapter 8. Look also at Activity A8.4. Approaching the question i. There are some differences in the proportions of faulty components for each machine. More specifically, 2% of the components from Machine 2 are faulty, whereas the corresponding proportion for Machine 3 is 11%, and for Machine 4 is 14%. Hence, there seems to be an association between machine number and the component being faulty, although this needs to be investigated further. (Note: the conclusion of the last sentence must be stated to get full marks.) ii. Set out the null hypothesis that there is no association between machine number and the component being faulty against the alternative that there is an association. Be careful to get these the correct way round! H0 : No association between the machine number and the component being faulty. vs. H1 : Association between machine number and the component being faulty. Work out the expected values to obtain the table below. 7.75 92.25 7.75 92.25 7.75 92.25 7.75 92.25 The test statistic formula is: 10 X (Oi,j − Ei,j )2 Ei,j Examiners’ commentaries 2016 which gives a value of 13.53. This is a 4 × 2 contingency table, so the degrees of freedom are (4 − 1) × (2 − 1) = 3. For α = 0.05, the critical value is 7.815, hence we reject H0 . We conclude that there is evidence of an association between machine number and the component being faulty. Many candidates looked up the tables incorrectly and so failed to follow through their earlier accurate work. (b) i. Describe how stratified random sampling is performed and explain how it differs from quota sampling. ii. A company producing handheld electronic devices (tablets, mobile phones etc.) wants to understand how people of different ages rate its products. For this reason, the company’s management has decided to use a survey of its customers and has asked you to devise an appropriate random sampling scheme. Outline the key components of your sampling scheme. (11 marks) Reading for this question This question was on basic material on survey designs. Background reading is given in Chapters 10 and 11 of the subject guide which, along with the recommended reading, should be looked at carefully. Candidates were expected to have studied and understood the main important constituents of design in random sampling. It is also a good idea to try the Learning activities of Chapter 10. Approaching the question One of the main things to avoid in this part is to write essays without any structure. This exercise asks for specific things and each one of them requires one or two lines. If you are unsure of what these things are, do not write lengthy essays. This is not giving you anything and is a waste of your invaluable examination time. If you can identify what is being asked, keep in mind that the answer should not be long. Note also that in some cases there is no unique answer to the question. The marking scheme and some model answers are given below. i. Description of stratified random sampling: the population is divided into strata, natural groupings within the population, and a simple random sample is taken from each stratum. See page 162 of the subject guide for a more detailed description. Stratified random sampling is different from quota sampling in the following ways. ∗ Stratified random sampling is probability sampling, whereas quota sampling is non-probability sampling. ∗ In stratified random sampling a sampling frame is required, whereas in quota sampling pre-chosen frequencies in each category are sought. ii. As mentioned earlier, it is crucial in this type of question to avoid long answers. Also, note that there is no unique answer. A possible set of ‘ingredients’ of an answer is given below (each bullet point corresponds to a mark). • • • • • • • Propose stratified sampling since customers of all ages are to be surveyed. Sampling frame could be the company’s customer database. Take a simple random sample from each stratum. Stratification factors should include age. Other stratification factors could be gender, country of residence, etc. Contact method: mail, telephone or email (likely to have all details on database). Minimise non-response through a suitable incentive, such as discount off the next purchase. 11 ST104a Statistics 1 Question 3 (a) The data below represent heights, measured in centimetres, of women from an adult female population: 162 166 167 168 169 170 164 166 167 168 169 171 164 166 167 168 169 172 165 167 168 168 170 184 165 167 168 169 170 185 i. Carefully construct, draw and label a histogram of these data on the graph paper provided. ii. Find the median height among these women and the upper quartile. What percentage of women were below 165 cm? iii. Comment on the data given the shape of the histogram without doing any further calculations. iv. Name two other types of graphical displays that would be suitable to represent the data. (13 marks) Reading for this question Chapter 4 provides all the relevant material for this question. More specifically, reading on histograms can be found in Section 4.7.3, but the entire Sections 4.7, 4.8 and 4.9 are highly relevant. Approaching the question i. A histogram compatible with what the examiners were expecting to see is shown below. Marks were awarded for including the title, labelling correctly and accurately drawing the figure. Note that it is essential (and more convenient) to draw the figure on the graph paper provided; marks will be withdrawn otherwise. 0.08 0.04 0.00 Frequency Densities 0.12 Histogram of Heights 160 165 170 175 Heights of women in centimeters 12 180 185 Examiners’ commentaries 2016 ii. • Median: 168 centimeters. Note: Raw data should be used, not grouped data. Also, make sure to mention the units to get the full marks. • Upper quartile: 169 centimeters. Note: Same as above. • Percentage: 3/30 = 10%. Note: As the question asks for a percentage, make sure to report 10%, not just 3/30 or anything else. iii. Based on the shape of the histogram, we can see that the distribution of the data is positively skewed. Also two women, with heights of 184 cm and 185 cm, may be regarded as outliers. Note: It is important to identify the specific outliers (184 cm and 185 cm) not just write ‘there are two outliers’. iv. A boxplot, stem-and-leaf diagram or dot plot are other types of suitable graphical displays. The reason for that is that the variable height is measurable and these graphs are suitable for displaying the distribution of such variables. (b) A random sample of 9 people tried a specific diet that lasted 2 months to lose weight. The weights of these people, measured in kilograms, were measured both at the beginning and the end of the diet, and are shown in the table below: Weight before diet 75 76 90 92 89 63 65 80 90 Weight after diet 73 72 92 93 89 61 62 76 84 i. Carry out an appropriate hypothesis test to determine whether the diet is effective in helping people lose weight. State the test hypotheses, and specify your test statistic and its distribution under the null hypothesis. Comment on your findings. ii. State any assumptions you made in i. iii. Give a 90% confidence interval for the difference between the means of the weights before and after the diet. (12 marks) Reading for this question Look up the sections about hypothesis testing for testing a difference between two population means. However, it is essential for this part to focus on the section regarding paired samples (Section 8.16.4). Approaching the question i. Regarding hypotheses, note that the wording ‘effective’ suggests a one-sided test. Hence we test: H0 : µbefore = µafter vs. H1 : µbefore < µafter . In this part, it is also essential to realise that we have a paired sample, as we have two observations for each person (before and after the diet). Hence the difference for each person should be calculated: −2 −4 2 1 0 −2 −3 −4 −6 The next step is to calculate sd = 2.598 and s̄d = −2.0, in order to obtain the value of the test statistic: x̄d − 0 √ = −2.309. t= sd / n 13 ST104a Statistics 1 We have a t distribution with 8 degrees of freedom, hence the critical value (for a one-sided test) is −1.860. Note: This is clearly a t distribution, make sure not to use the standard normal distribution. Hence, we reject H0 at the 5% significance level. Testing at the 1% significance level gives a critical value of t8, 0.99 = −2.896. Therefore, we do not reject H0 and conclude that there is moderate evidence that the diet is effective. ii. • Differences are normally distributed. • Pairs of observations are independent. iii. This is a standard exercise for confidence intervals given the appropriate formula from the formula sheet (make sure to be able to recognise it). The requested confidence interval is (−3.610, −0.390). Question 4 (a) The director of a local Tourism Authority would like to know whether a family’s annual expenditure on recreation (y), measured in $000s, is related to their annual income (x), also measured in $000s. In order to explore this potential relationship, the variables x and y were recorded for 10 randomly selected families that visited the area last year. The results were as follows: Week x y #1 41.2 2.4 #2 50.1 2.7 #3 52.0 2.8 #4 62.0 8.0 #5 44.5 3.1 #6 37.7 2.1 #7 73.5 12.1 #8 37.5 2.0 #9 56.7 3.9 #10 65.2 8.9 The summary statistics for these data are: Sum of x data: 520.4 Sum of the squares of x data: 28431.42 Sum of y data: 48 Sum of the squares of y data: 343.74 Sum of the products of x and y data: 2858.63 i. Draw a scatter diagram of these data on the graph paper provided. Label the diagram carefully. ii. Calculate the sample correlation coefficient. Interpret your findings. iii. Calculate the least squares line of y on x and draw the line on the scatter diagram. iv. Do you find the analyses in ii. and iii. appropriate? Justify your answer and suggest any alternative ways to model the relationship between x and y. (13 marks) Reading for this question This is a standard linear regression question and the reading is to be found in Chapter 12 of the subject guide. Section 12.6 provides details for scatter diagrams and is suitable for part i., whereas the remaining parts are on correlation and regression which are covered in Sections 12.8, 12.9 and 12.10 of the subject guide. Section 12.7 is also relevant. Sample examination question 2 of this chapter is also recommended for practice on questions of this type. Approaching the question i. Candidates are reminded that they are asked to draw and label the scatter diagram which should include a title (‘Scatter diagram’ alone will not suffice) and labelled axes which give their units in addition. Far too many candidates threw away marks by neglecting these points and consequently were only given one mark out of the possible four allocated for this part of the question. Another common way of losing marks was failing to use the graph paper which was provided, and required, in the question. Candidates who drew on the ordinary paper in their answer booklet were not awarded marks for this part of the question. 14 Examiners’ commentaries 2016 Annual family recreation expenditure vs. Annual family income 10 x 4 6 8 x x x 2 Annual family recreation expenditure in $000s 12 x x xx 40 45 x x 50 55 60 65 70 Annual family income in $000s ii. The summary statistics can be substituted into the formula for the sample correlation coefficient (make sure you know which one it is!) to obtain the value 0.9222. An interpretation of this value is the following: the data suggest that the higher family annual income, the higher the family annual recreation expenditure. The fact that the value is very close to 1, suggests that this is a strong, positive linear relationship. Many candidates did not mention all three words (strong, positive, linear). Note that all of these words provide useful information on interpreting the relationship and are therefore required to obtain full marks. iii. The regression line can be written by the equation yb = a + bx or y = a + bx + ε. The formula for b is: P xi yi − nx̄ȳ b= P 2 xi − nx̄2 and by substituting the summary statistics we get b = 0.267. The formula for a is a = ȳ − bx̄, so we get a = −9.107. Hence the regression line can be written as yb = −9.107 + 0.267x or y = −9.107 + 0.267x + ε. It should also be plotted on the scatter diagram. Many candidates reported incorrectly the regression line as y = −9.107 + 0.267x. This expression is false; one of the two above expressions is required. iv. In this case, one can note in the scatter diagram that the points seem to be ‘scattered’ around a non-linear curve rather than a straight line. Another, equivalent, way to note this is the presence of two outliers. Hence a linear regression model does not seem to be a good model for the relationship between family annual income and family annual recreation expenditure. Alternative approaches may involve the Spearman’s rank correlation coefficient or transformations of the data, for example a log-transformation. (b) The fuel consumption of two different car models (A and B) was compared in the following way. A random sample of 20 cars from model A and 35 cars from model B were taken and the fuel consumption (in miles per gallon) was measured for each car. The results are summarised in the table below. Car Model A Car Model B Sample size 20 35 Sample mean 30.9 27.1 Sample standard deviation 6.11 6.41 15 ST104a Statistics 1 i. Use an appropriate hypothesis test to determine whether the model A cars can do more miles per gallon than model B cars. State clearly the hypotheses, the test statistic and its distribution under the null hypothesis, and carry out the test at two appropriate significance levels. Comment on your findings. ii. State clearly any assumptions you made in i. iii. Provide a 95% confidence interval for the difference between the mean fuel consumption of the two car models. (12 marks) Reading for this question The first two parts of the question refer to a two-sided hypothesis test comparing two population means. While the entire chapter on hypothesis testing is relevant (Chapter 8), one can focus on Section 8.16 and in particular Sections 8.16.2 and 8.16.3 as the variances are unknown. The last part of the question requires a confidence interval for the difference between two population means, therefore Sections 7.13.2 and 7.13.3 are most relevant. Approaching the question i. Let µA denote the mean fuel consumption for car model A and µB the mean fuel consumption for car model B. The wording ‘can do more miles per gallon than’ implies a one-sided test, hence the hypotheses can be written as: H0 : µA = µB vs. H1 : µA > µB . The test statistic formulae, depending on whether a pooled variance is used or not, are provided in the formula sheet: x̄ − ȳ p s2A /nA + s2B /nB or x̄ − ȳ q . 2 sp (1/n1 + 1/n2 ) If equal variances are assumed, the test statistic value is 2.150 (the pooled variance is 39.74). If equal variances are not assumed the test statistic value is 2.179. Since the variances are unknown and the sample size is not large enough, the t50 distribution is being used. The critical value at the 5% significance level is 1.676, hence we reject the null hypothesis. If we take a (smaller) α of 1%, the critical value is 2.390, so we do not reject H0 . We conclude that there is moderate evidence of a difference in the mean fuel consumption between the car models. ii. The assumptions for ii. were the following. • Assumption about equal variances. • Assumption about whether nA + nB is ‘large’ so that the normality assumption is satisfied. • Assumption about independent samples. Some candidates stated assumptions in this part that were not made in part i. Marks were not awarded in such cases. Also some other candidates just copied the phrase ‘assumption about equal variances’ and naturally were not awarded any marks. One should state whether the calculations were based on the assumption that the unknown variances are equal or unequal. iii. Based on the t50 distribution and using the correct formula from the formula sheet (make sure to be able to recognise it) the requested 95% confidence interval is (0.251, 7.349). Note: In the solution above, the t50 distribution was used but the use of the standard normal distribution is also justified as the sample size is relatively large. Hence a solution based on the standard normal distribution is also acceptable. 16 Examiners’ commentaries 2016 Examiners’ commentaries 2016 ST104a Statistics 1 Important note This commentary reflects the examination and assessment arrangements for this course in the academic year 2015–16. The format and structure of the examination may change in future years, and any such changes will be publicised on the virtual learning environment (VLE). Note that in what follows the symbol • corresponds to 1 mark unless stated otherwise. Information about the subject guide and the Essential reading references Unless otherwise stated, all cross-references will be to the latest version of the subject guide (2014). You should always attempt to use the most recent edition of any Essential reading textbook, even if the commentary and/or online reading list and/or subject guide refer to an earlier edition. If different editions of Essential reading are listed, please check the VLE for reading supplements – if none are available, please use the contents list and index of the new edition to find the relevant section. Comments on specific questions – Zone B Candidates should answer THREE of the following FOUR questions: QUESTION 1 of Section A (50 marks) and TWO questions from Section B (25 marks each). Section A Answer all parts of question 1 (50 marks in total). Question 1 (a) A random sample of athletes’ times to run 200 metres has a sample mean of 24.96 seconds. State the units of measurements for the summaries below and justify your answers. i. sample variance ii. sample standard deviation. (4 marks) Reading for this question This question requires knowledge regarding measures of location and spread. Hence reading of Sections 4.8 and 4.9 in the subject guide is essential and in particular Section 4.9.3. For example, candidates should gain familiarity with the sample mean, median, variance and standard deviation. 17 ST104a Statistics 1 Approaching the question The first thing to do is check the formulae for the sample variance and standard deviation. It is then not hard to note that the sample variance, s2 , involves squared deviations of the observations about the sample mean: n s2 = 1 X (xi − x̄)2 . n − 1 i=1 The units of measurement will therefore be seconds squared. The formula for standard deviation s involves the square root of the sample variance: v u n u 1 X s=t (xi − x̄)2 n − 1 i=1 hence we return to the original units of measurement, i.e. seconds. Some candidates did not provide a justification for their choices, for example just reporting seconds or seconds squared. Justification is essential however, and therefore the mention of the formulae was essential to get full marks. (b) Suppose that x1 = 4, x2 = −3, x3 = −7, x4 = 6, x5 = 2, and y1 = −6, y2 = 4, y3 = −4, y4 = 0, y5 = 1. Calculate the following quantities: i. i=4 X x2i i=2 ii. i=3 X 3xi yi i=1 iii. y33 + i=5 X yi4 i=4 xi . (6 marks) Reading for this question This question refers to the basic bookwork which can be found on Section 2.9 of the subject guide, and in particular Activity A1.6. Approaching the question Be careful to leave the xi s and yi s in the order given and only cover the values of i asked for. This question was generally well done. The answers are as follows. i=4 P 2 i. xi = (−3)2 + (−7)2 + 62 = 9 + 49 + 36 = 94. i=2 ii. i=3 P 3xi yi = 3 i=1 iii. y33 + 3 P xi yi = 3((4 × −6) + (−3 × 4) + (−7 × −4)) = 3(−24 − 12 + 28) = −24. i=1 i=5 P i=4 yi4 /xi = (−4)3 + (0 + 1/2) = −63.5. (c) A population is normally distributed with a population mean of 76 and a population standard deviation of 12. i. State the distribution of the sample mean for simple random samples of size n = 100. ii. Given a simple random sample of size n = 100, determine the probability that the sample mean will be less than 75. (4 marks) Reading for this question This section examines the ideas of the normal random variable. Read the relevant section of Chapter 6 of the subject guide and work out the examples and activities of this section. The Sample examination questions are quite relevant. For the first part of the question it is essential to check Section 6.9 of the subject guide. 18 Examiners’ commentaries 2016 Approaching the question The first part just requires knowledge of the fact that if X is a normal random variable with mean µ and variance σ 2 , the sample mean from a sample of size n, X̄, is also a normal random variable with mean µ and variance σ 2 /n. Direct application of this fact then yields that: (12)2 = N (76, 1.44). X̄ ∼ N 76, 100 For the second part, the basic property of the normal random variable for this question is that if X ∼ N (µ, σ 2 ), then Z = (X − µ)/σ ∼ N (0, 1). Note also that: * P (Z < a) = P (Z ≤ a) = Φ(a) * P (Z > a) = P (Z ≥ a) = 1 − P (Z ≤ a) = 1 − P (Z < a) = 1 − Φ(a) * P (a < Z < b) = P (a ≤ Z < b) = P (a < Z ≤ b) = P (a ≤ Z ≤ b) = Φ(b) − Φ(a). The above is all you need to find the requested proportion: We can write: 75 − 76 P (X̄ < 75) = P Z < √ 1.44 = P (Z < −0.83) = 1 − Φ(0.83) = 1 − 0.7967 = 0.2033. (d) Classify each one of the following variables as measurable (continuous) or categorical. If a variable is categorical, further classify it as nominal or ordinal. Justify your answer. (Note that no marks will be awarded without justification.) i. The weight of a chocolate bar produced in a factory. ii. Responses to ‘what is your age group?’ in a questionnaire. iii. The colour of a car. iv. Inflation rates. (8 marks) Reading for this question This question requires identifying types of variables so reading the relevant section in the subject guide (Section 4.6) is essential. Candidates should gain familiarity with the notion of a variable and be able to distinguish between discrete and continuous (measurable) data. In addition to identifying whether a variable is categorical or measurable, further distinctions between ordinal and nominal categorical variable should be made by candidates. Approaching the question A general tip for identifying continuous and categorical variables is to think of the possible values they can take. If these are finite and represent specific entities the variable is categorical. Otherwise, if these consist of numbers corresponding to measurements, the data are continuous and the variable is measurable. Such variables may also have measurement units or can be measured to various decimal places. i. Measurable because the weight can be measured, for example, in grammes to several decimal places such as 499.28 g. ii. Age groups are in a ranked order, for example [18, 30), [30, 40) etc. It is therefore a categorical ordinal variable. iii. Each colour (black, white, red, etc.) is a category. Also, there is no natural ordering between the colours, for example we cannot really say that ‘blue is higher than red’. This is therefore a categorical nominal variable. 19 ST104a Statistics 1 iv. Measurable because inflation rates are quoted to several decimal places, for example 1.50%. Weak candidates did not provide a justification for their choices, reported nominal or categorical to measurable variables and sometimes answered ordinal when their justification was pointing to a nominal variable. There were also phrases like ‘It is measurable because it can be measured’ that were not awarded any marks. (e) The random variable X takes the values 0, 1 and 3 according to the following probability distribution: x pX (x) 0 0.4 1 k 3 k i. Determine the constant k. ii. Find E(X), the expected value of X. iii. Find Var(X), the variance of X. (5 marks) Reading for this question This is a question on probability, exploring the concepts of relative frequency, conditional probability and probability distribution. Reading from Chapter 5 of the subject guide is suggested with focus on the sections on these topics. Try Activity A5.1 and the exercises on probability trees. Approaching the question P i. i p(xi ) = 1, hence k = 0.3. P ii. E(X) = i xi p(xi ) = 0 × 0.4 + 1 × 0.3 + 3 × 0.3 = 1.2. P iii. E(X 2 ) = i x2i p(xi ) = 02 × 0.4 + 12 × 0.3 + 32 × 0.3 = 3.0. Hence: Var(X) = 3.0 − (1.2)2 = 1.56. An alternative method to find the variance is through the formula where µ is found in part ii. P i (xi − µ)2 p(xi ), (f ) An engine encounters a standard environment with a probability of 0.9, and a severe environment with a probability of 0.1. In a normal environment the probability of failure is 0.03, whereas in the severe environment this probability is 0.5. i. What is the probability of failure? ii. Given that failure has occurred, what is the probability that the environment encountered was severe? (4 marks) Reading for this question This is a question on probability and targets mostly the material of Chapter 5 in the subject guide. It is essential to practise on such exercises through the learning activities and exercises of this chapter as well as the material on the VLE. In particular you can attempt Learning activity A5.6 and Sample examination question 5. It is also useful to familiarise yourself with probability trees as they can be quite handy in such exercises. Approaching the question The first part was straightforward for candidates familiar with this section, requiring the use of the total law of probability (although it can also be calculated using common intuition). Part ii. requires knowledge of the conditional probability definition or, alternatively, knowledge of Bayes’ theorem. 20 Examiners’ commentaries 2016 The workout of the exercise is given below. i. We have: P (F ) = P (F | N ) P (N ) + P (F | S) P (S) = 0.03 × 0.9 + 0.5 × 0.1 = 0.077. ii. We have: P (S | F ) = P (F | S) P (S) 0.05 50 = = = 0.6494 ≈ 0.65. P (F ) 0.077 77 (g) A museum conducts a survey of its visitors in order to assess the popularity of a device which is used to provide information on the museum exhibits. The device will be withdrawn if fewer than 250% of all of the museum’s visitors make use of it. Of a random sample of 100 visitors, 20 chose to use the device. i. Carry out an appropriate hypothesis test at the 5% significance level to see if the device should be withdrawn and state your conclusions. ii. Calculate the p-value of the test. (7 marks) Reading for this question This question refers to a one-sided hypothesis test examining whether the proportion of all museum visitors is less than 20%. While the entire chapter (Chapter 8 of the subject guide) on hypothesis testing is relevant, one can focus on the relevant section for a single proportion, Section 8.14. Note also that reading on one-tailed (and two-tailed) hypothesis tests are located in Section 8.10. The second part of the question looks at p-values, and the relevant section in the subject guide is Section 8.11. Approaching the question It is essential to identify the type of hypothesis test required for this question. Since there is only one variable involved it will have to be a test for a single proportion, and the test statistic can be found in the formula sheet. Make sure to substitute the relevant quantities carefully and avoid any numerical errors in the calculation. The remaining steps involve finding the critical values from the corresponding statistical table for the relevant significance level, deciding whether to reject H0 , and interpreting the results in the context of the problem. The working of the first part of the exercise is given below. • H0 : π = 0.25 vs. H1 : π < 0.25. • The sample proportion is p = 20/100 = 0.20. The standard error of the sample p proportion is 0.25 × 0.75/100 = 0.0433. The test statistic value is: t= 0.2 − 0.25 = −1.15. 0.0433 • For α = 0.05, the critical value is −1.645. • Decision: do not reject H0 . • No evidence that fewer than 25% of visitors make use of the device. The second part of the question requires the use of p-values and challenged most candidates. The exercise does not require lengthy calculations and can be derived in a relatively straightforward manner if one is familiar with the material of Section 8.11 of the subject guide. Once the test statistic is calculated (t = −1.15 from the first part) one simply needs to calculate, where Z ∼ N (0, 1): P (Z ≤ −1.15) = 1 − Φ(1.15) = 1 − 0.8749 = 0.1251. Note: The last three marks of the first part can also be awarded by correct use of the p-value, see below. • The p-value is higher than α = 0.05. 21 ST104a Statistics 1 • Decision: do not reject H0 . • No evidence that fewer than 25% of visitors make use of the device. (h) State whether the following are true or false and give a brief explanation. (Note that no marks will be awarded for a simple true/false answer.) i. The range of a sample is influenced by extreme values. ii. A sampling distribution is the probability distribution of a population parameter. iii. A sample correlation coefficient close to −1 indicates a strong negative linear relationship between two categorical variables. iv. A p-value of 0.007 represents a weakly significant hypothesis test result. v. Failure to reject a null hypothesis might indicate that a Type I error has been committed. vi. A stratified random sample is the random equivalent of a convenience sample. (12 marks) Reading for this question This question contains material from various parts of the subject guide. Here, it is more important to have a good intuitive understanding of the relevant concepts than the technical level in computations. Part i. concerns measures of spread that can be found in Section 4.9 of the subject guide. Part ii. enquires about the sampling distribution which is defined in Section 6.9. Part iii. is about correlation (see Section 12.8) and types of variables (see Section 4.6). Part iv. targets the concepts of a p-value covered in Section 8.11, whereas part v. looks at types of error in hypothesis testing (Section 8.7). Finally, part vi. requires material from Chapter 10 and in particular Section 10.7 on types of sampling. Approaching the question Candidates always find this type of question tricky. It requires a brief explanation of the reason for a true/false answer and not just a choice between the two. Some candidates lost marks for long rambling explanations without a decision as to whether a statement was true or false. i. True. The range is defined as x(n) − x(1) , so any extreme values would be x(1) and/or x(n) , hence influencing the range. ii. False. A sampling distribution is the probability distribution of a sample statistic. iii. False. A value of r close to −1 indicates a strong, negative linear relationship between two measurable (continuous) variables. iv. False. A p-value of 0.007 represents a highly significant hypothesis test result. Weakly significant means a p-value between 0.05 and 0.10. v. False. Failure to reject a null hypothesis might indicate that a Type II error has been committed. vi. False. A quota sample is the non-random equivalent of a stratified random sample. 22 Examiners’ commentaries 2016 Section B Answer two out of the three questions from this section (25 marks each). Question 2 (a) A sample consisting of 400 randomly selected students was classified in terms of personality type (introvert or extrovert) and in terms of their favourite colour (red, yellow, green or blue). Their responses are summarised in the table below: Red Yellow Green Blue Total Personality type Introvert Extrovert 32 68 26 74 21 79 46 54 125 275 Total 100 100 100 100 400 i. Based on the data in the table, and without conducting any significance test, would you say there is an association between the student’s type of personality and colour preference? ii. Calculate the χ2 statistic and use it to test for independence, using a 5% significance level. What do you conclude? (14 marks) Reading for this question This part targets Chapter 8 of the subject guide on contingency tables and chi-squared tests. Note that part i. of the question does not require any calculations, just understanding and interpreting contingency tables. Candidates can attempt Activity A8.4 to practise. Part ii. is a straightforward chi-squared test and the reading is also given in Chapter 8. Look also at Activity A8.4. Approaching the question i. There are some differences in rates of introvert students for each colour preference. More specifically, 21% of the students who prefer the green colour are introvert, whereas the corresponding proportion for students who prefer red is 32%, and for students preferring blue is 46%. Hence, there seems to be an association between personality type and colour preference, although this needs to be investigated further. (Note: the conclusion of the last sentence must be stated to get full marks.) ii. Set out the null hypothesis that there is no association between personality type and colour preference against the alternative that there is an association. Be careful to get these the correct way round! H0 : No association between the personality type and colour preference. vs. H1 : Association between personality type and colour preference. Work out the expected values to obtain the table below. 31.25 31.25 31.25 31.25 The test statistic formula is: 68.75 68.75 68.75 68.75 X (Oi,j − Ei,j )2 Ei,j which gives a value of 16.33. This is a 4 × 2 contingency table, so the degrees of freedom are (4 − 1) × (2 − 1) = 3. 23 ST104a Statistics 1 For α = 0.05, the critical value is 7.815, hence we reject H0 . We conclude that there is evidence of an association between personality type and colour preference. Many candidates looked up the tables incorrectly and so failed to follow through their earlier accurate work. (b) i. Describe how quota sampling is performed and explain how it differs from stratified random sampling. ii. A company producing handheld electronic devices (tablets, mobile phones etc.) wants to understand how men and women rate its products. For this reason, the company’s management has decided to use a survey of its customers and has asked you to devise an appropriate random sampling scheme. Outline the key components of your sampling scheme. (11 marks) Reading for this question This question was on basic material on survey designs. Background reading is given in Chapters 10 and 11 of the subject guide which, along with the recommended reading, should be looked at carefully. Candidates were expected to have studied and understood the main important constituents of design in random sampling. It is also a good idea to try the Learning activities of Chapter 10. Approaching the question One of the main things to avoid in this part is to write essays without any structure. This exercise asks for specific things and each one of them requires one or two lines. If you are unsure of what these things are, do not write lengthy essays. This is not giving you anything and is a waste of your invaluable examination time. If you can identify what is being asked, keep in mind that the answer should not be long. Note also that in some cases there is no unique answer to the question. The marking scheme and some model answers are given below. i. Description of quota sampling: the interviewer is given specific quota controls on certain specified characteristics, such as age, gender, social class etc. and then interviews people until these quota are reached. See page 159 of the subject guide for a more detailed description. Quota is different from stratified random sampling in the following ways. ∗ Stratified random sampling is probability sampling, whereas quota sampling is non-probability sampling. ∗ In stratified random sampling a sampling frame is required, whereas in quota sampling pre-chosen frequencies in each category are sought. ii. As mentioned earlier, it is crucial in this type of question to avoid long answers. Also, note that there is no unique answer. A possible set of ‘ingredients’ of an answer is given below (each bullet point corresponds to a mark). • • • • • • • 24 Propose stratified sampling since customers of all ages are to be surveyed. Sampling frame could be the company’s customer database. Take a simple random sample from each stratum. Stratification factors should include gender. Other stratification factors could be age, country of residence, etc. Contact method: mail, telephone or email (likely to have all details on database). Minimise non-response through a suitable incentive, such as discount off the next purchase. Examiners’ commentaries 2016 Question 3 (a) A policeman recorded the speed of 30 cars on a road with a 30 miles per hours speed limit. The recorded data are shown below: 25.6 26.2 27.9 28.8 29.2 30.1 25.7 26.9 27.9 28.9 29.3 30.1 25.7 27.5 28.3 28.9 29.5 30.2 25.8 27.7 28.4 29.0 29.7 36.2 25.8 27.8 28.5 29.1 29.8 36.9 i. Carefully construct, draw and label a histogram of these data on the graph paper provided. ii. Find the median speed among these cars and the upper quartile. What percentage of drivers were exceeding the 30 miles per hour speed limit? iii. Comment on the data given the shape of the histogram without doing any further calculations. iv. Name two other types of graphical displays that would be suitable to represent the data. (13 marks) Reading for this question Chapter 4 provides all the relevant material for this question. More specifically, reading on histograms can be found in Section 4.7.3, but the entire Sections 4.7, 4.8 and 4.9 are highly relevant. Approaching the question i. A histogram compatible with what the examiners were expecting to see is shown below. Marks were awarded for including the title, labelling correctly and accurately drawing the figure. Note that it is essential (and more convenient) to draw the figure on the graph paper provided; marks will be withdrawn otherwise. 0.15 0.10 0.05 0.00 Frequency Densities 0.20 Histogram of Speeds 24 26 28 30 32 34 36 38 Speeds in miles per hour 25 ST104a Statistics 1 ii. • Median: 28.65 miles per hour. Note: Raw data should be used, not grouped data. Also, make sure to mention the units to get the full marks. • Upper quartile: 29.45 miles per hour. Note: Same as above. • percentage: 5/30 = 16.67%. Note: As the question asks for a percentage, make sure to report 16.67% (17% is also fine), not just 5/30 or anything else. iii. Based on the shape of the histogram, we can see that the distribution of the data is positively skewed. Also two cars, with speeds 36.2 and 36.9 miles per hour, may be regarded as outliers. Note: It is important to identify the specific outliers (36.2 and 36.9 miles per hour) not just write ‘there are two outliers’. iv. A boxplot, stem-and-leaf diagram or dot plot are other types of suitable graphical displays. The reason for that is that the variable speed is measurable and these graphs are suitable for displaying the distribution of such variables. (b) A random sample of 9 students received special training to improve their performance on IQ tests. Each of the 9 students took an IQ test before and after the training and their scores are shown in the table below: IQ score before training 105 116 120 93 119 133 75 86 90 IQ score after training 107 120 118 92 119 135 78 90 96 i. Carry out an appropriate hypothesis test to determine whether the special training is effective for increasing the average IQ score. State the test hypotheses, and specify your test statistic and its distribution under the null hypothesis. Comment on your findings. ii. State any assumptions you made in i. iii. Give a 90% confidence interval for the difference between the means of the IQ scores before and after training. (12 marks) Reading for this question Look up the sections about hypothesis testing for testing a difference between two population means. However, it is essential for this part to focus on the section regarding paired samples (Section 8.16.4). Approaching the question i. Regarding hypotheses, note that the wording ‘increasing’ suggests a one-sided test. Hence we test: H0 : µbefore = µafter vs. H1 : µbefore < µafter . In this part, it is also essential to realise that we have a paired sample, as we have two observations for each person (before and after the special training). Hence the difference for each person should be calculated: 2 4 −2 −1 0 2 3 4 6 The next step is to calculate sd = 2.598 and x̄d = 2.0, in order to obtain the value of the test statistic: x̄d − 0 √ = 2.309. t= sd / n 26 Examiners’ commentaries 2016 We have a t distribution with 8 degrees of freedom, hence the critical value (for a one-sided test) is 1.860. Note: This is clearly a t distribution, make sure not to use the standard normal distribution. Hence, we reject H0 at the 5% significance level. Testing at the 1% significance level gives a critical value of t8, 0.01 = 2.896. Therefore, we do not reject H0 concluding that there is moderate evidence that the special training is effective. ii. • Differences are normally distributed. • Pairs of observations are independent. iii. This is a standard exercise for confidence intervals given the appropriate formula from the formula sheet (make sure to be able to recognise it). The requested confidence interval is (0.390, 3.610). Question 4 (a) An insurance company wants to relate the amount of fire damage (y) in major residential fires to the distance between the residence and the nearest fire station (x). For this reason, a study was conducted in a large suburb of a major city based on a sample of 10 recent fires in this suburb. For each of these fires, the variables x and y were recorded and are shown in the table below: Fire x y #1 3.4 2.6 #2 1.8 1.8 #3 4.6 5.9 #4 2.3 2.3 #5 3.1 2.8 #6 5.5 8.6 #7 0.7 1.4 #8 3.0 2.3 #9 2.6 2.0 #10 4.3 5.7 The summary statistics for these data are: Sum of x data: 31.3 Sum of the squares of x data: 115.85 Sum of y data: 35.4 Sum of the squares of y data: 175.64 Sum of the products of x and y data: 138.08 i. Draw a scatter diagram of these data on the graph paper provided. Label the diagram carefully. ii. Calculate the sample correlation coefficient. Interpret your findings. iii. Calculate the least squares line of y on x and draw the line on the scatter diagram. iv. Do you find the analyses in ii. and iii. appropriate? Justify your answer and suggest any alternative ways to model the relationship between x and y. (13 marks) Reading for this question This is a standard linear regression question and the reading is to be found in Chapter 12 of the subject guide. Section 12.6 provides details for scatter diagrams and is suitable for part i., whereas the remaining parts are on correlation and regression which are covered in Sections 12.8, 12.9 and 12.10 of the subject guide. Section 12.7 is also relevant. Sample examination question 2 of this chapter is also recommended for practice on questions of this type. Approaching the question i. Candidates are reminded that they are asked to draw and label the scatter diagram which should include a title (‘Scatter diagram’ alone will not suffice) and labelled axes which give their units in addition. Far too many candidates threw away marks by neglecting these points and consequently were only given one mark out of the possible four allocated for this part of the question. Another common way of losing marks was failing to use the graph paper which was provided, and required, in the question. Candidates who drew on the ordinary paper in their answer booklet were not awarded marks for this part of the question. 27 ST104a Statistics 1 Amount of fire damage vs. Distance from nearest fire station 6 x 4 5 x 3 Amount of fire damage 7 8 x x 2 x x x x x x 1 2 3 4 5 Distance between residence and the nearest fire station ii. The summary statistics can be substituted into the formula for the sample correlation coefficient (make sure you know which one it is!) to obtain the value 0.9093. An interpretation of this value is the following: the data suggest that the greater the distance of the residence from the nearest fire station, the higher the amount of fire damage. The fact that the value is very close to 1, suggests that this is a strong, positive linear relationship. Many candidates did not mention all three words (strong, positive, linear). Note that all of these words provide useful information on interpreting the relationship and are therefore required to obtain full marks. iii. The regression line can be written by the equation yb = a + bx or y = a + bx + ε. The formula for b is: P xi yi − nx̄ȳ b= P 2 xi − nx̄2 and by substituting the summary statistics we get b = 1.526. The formula for a is a = ȳ − bx̄, so we get a = −1.235. Hence the regression line can be written as yb = −1.235 + 1.526x or y = −1.235 + 1.526x + ε. It should also be plotted on the scatter diagram. Many candidates reported incorrectly the regression line as y = −1.235 + 1.526x. This expression is false; one of the two above expressions is required. iv. In this case, one can note in the scatter diagram that the points seem to be ‘scattered’ around a non-linear curve rather than a straight line. Another, equivalent, way to note this is the presence of two outliers. Hence a linear regression model does not seem to be a good model for the relationship between the amount of fire damage and the distance from the nearest fire station. Alternative approaches may involve the Spearman’s rank correlation coefficient or transformations of the data, for example the log-transformation. (b) The 55 university students on a certain course were randomly assigned to two class groups of size 30 and 25 students respectively. At the end of the year, all students took the examination and their marks are summarised in the table below. Sample size Sample mean Sample standard deviation Class Group 1 30 75.33 7.61 Class Group 2 25 71.40 6.37 28 Examiners’ commentaries 2016 i. Use an appropriate hypothesis test to determine whether the students of class group 1 were better in terms of examination marks. State clearly the hypotheses, the test statistic and its distribution under the null hypothesis, and carry out the test at two appropriate significance levels Comment on your findings. ii. State clearly any assumptions you made in i. iii. Provide a 95% confidence interval for the difference between the mean exam marks of the two class groups. (12 marks) Reading for this question The first two parts of the question refer to a two-sided hypothesis test comparing two population means. While the entire chapter on hypothesis testing is relevant (Chapter 8), one can focus on Section 8.16 and in particular Sections 8.16.2 and 8.16.3 as the variances are unknown. The last part of the question requires a confidence interval for the difference between two population means, therefore Sections 7.13.2 and 7.13.3 are most relevant. Approaching the question i. Let µA denote the mean examination mark for class group 1 and µB the mean examination mark for class group 2. The wording ‘were better in terms of examination marks’ implies a one-sided test, hence the hypotheses can be written as: H0 : µA = µB vs. H1 : µA > µB . The test statistic formulae, depending on whether a pooled variance is used or not, are provided in the formula sheet: x̄ − ȳ p s2A /nA + s2B /nB or x̄ − ȳ q . 2 sp (1/n1 + 1/n2 ) If equal variances are assumed, the test statistic value is 2.0511 (the pooled variance is 50.06). If equal variances are not assumed the test statistic value is 2.0848. Since the variances are unknown and the sample size is not large enough, the t50 distribution is being used. The critical value at the 5% significance level is 1.676, hence we reject the null hypothesis. If we take a (smaller) α of 1%, the critical value is 2.390, so we do not reject H0 . We conclude that there is moderate evidence of a difference between the mean examination marks of the two class groups. ii. The assumptions for ii. were the following. • Assumption about equal variances. • Assumption about whether nA + nB is ‘large’ so that the normality assumption is satisfied. • Assumption about independent samples. Some candidates stated assumptions in this part that were not made in part i. Marks were not awarded in such cases. Also some other candidates just copied the phrase ‘assumption about equal variances’ and naturally were not awarded any marks. One should state whether the calculations were based on the assumption that the unknown variances are equal or unequal. iii. Based on the t50 distribution and using the correct formula from the formula sheet (make sure to be able to recognise it) the requested 95% confidence interval is (0.082, 7.778). Note: In the solution above, the t50 distribution was used but the use of the standard normal distribution is also justified as the sample size is relatively large. Hence a solution based on the standard normal distribution is also acceptable. 29