Download Examiners` commentaries 2016 - University of London International

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Examiners’ commentaries 2016
Examiners’ commentaries 2016
ST104a Statistics 1
Important note
This commentary reflects the examination and assessment arrangements for this course in the
academic year 2015–16. The format and structure of the examination may change in future years,
and any such changes will be publicised on the virtual learning environment (VLE).
Information about the subject guide and the Essential reading
references
Unless otherwise stated, all cross-references will be to the latest version of the subject guide (2014).
You should always attempt to use the most recent edition of any Essential reading textbook, even if
the commentary and/or online reading list and/or subject guide refer to an earlier edition. If
different editions of Essential reading are listed, please check the VLE for reading supplements – if
none are available, please use the contents list and index of the new edition to find the relevant
section.
General remarks
Learning outcomes
At the end of the course and having completed the Essential reading and activities you should:
•
be familiar with the key ideas of statistics that are accessible to a student with a moderate
mathematical competence
•
be able to routinely apply a variety of methods for explaining, summarising and presenting
data and interpreting results clearly using appropriate diagrams, titles and labels when
required
•
be able to summarise the ideas of randomness and variability, and the way in which these
link to probability theory to allow the systematic and logical collection of statistical
techniques of great practical importance in many applied areas
•
have a grounding in probability theory and some grasp of the most common statistical
methods
•
be able to perform inference to test the significance of common measures such as means and
proportions and conduct chi-square tests of contingency tables
•
be able to use simple linear regression and correlation analysis and know when it is
appropriate to do so.
Planning your time in the examination
You have two hours to complete this paper, which is in two parts. The first part, Section A, is
compulsory which covers several subquestions and accounts for 50 per cent of the total marks.
1
ST104a Statistics 1
Section B contains three questions, each worth 25 per cent, from which you are asked to choose two.
Remember that each of the Section B questions is likely to cover more than one topic. In 2016, for
example, the first part of Question 2 asked for a chi-squared test and survey design problems
appeared in the second part. Question 3 had a series of questions involving drawing diagrams, such
as histograms, hypothesis testing, in particular paired sample t tests, and confidence intervals. The
first part of Question 4 was on linear regression and involved drawing a diagram, while the second
part was a hypothesis test comparing population means using the sample data given. This means
that it is really important that you make sure you have a reasonable idea of what topics are covered
before you start work on the paper! We suggest you divide your time as follows during the
examination.
•
Spend the first 10 minutes annotating the paper. Note the topics covered in each question
and subquestion.
•
Allow yourself 45 minutes for Section A. Do not allow yourself to get stuck on any one
question, but do not just give up after two minutes!
•
Once you have chosen your two Section B questions, give them about 25 minutes each.
•
This leaves you with 15 minutes. Do not leave the examination hall at this point! Check
over any questions you may not have completely finished. Make sure you have labelled and
given a title to any tables or diagrams which were required and, if you did more than the
two questions required in Section B, decide which one to delete. Remember that only two of
your answers will be given credit in Section B and that you must choose which these are!
What are the examiners looking for?
The examiners are looking for very simple demonstrations from you. They want to be sure that you:
•
have covered the syllabus as described and explained in the subject guide
•
know the basic formulae given there and when and how to use them
•
understand and answer the questions set.
You are not expected to write long essays where explanations or descriptions of sample design
are required, and note-form answers are acceptable. However, clear and accurate language, both
mathematical and written, is expected and marked. The explanations below and in the specific
Examiners’ commentaries for the papers for each zone should make these requirements clear.
Key steps to improvement
The most important thing you can do is answer the question set! This may sound very simple, but
these are some of the things that candidates did not do, though asked, in the 2016 examinations!
Remember the following.
2
•
If you are asked to label a diagram (which is almost always the case!), please do so. Writing
‘Histogram’ or ‘Stem-and-leaf diagram’ in itself is insufficient. What do the data describe?
What are the units? What are the x-axis and y-axis?
•
If you are specifically asked to carry out a hypothesis test, or a confidence interval, do so. It
is not acceptable to do one rather than the other! If you are asked to find a 5% critical
value, this is what will be marked.
•
Do not waste time calculating things which are not required by the examiners. If you are
asked to find the line of best fit, you will get no marks if you calculate the correlation
coefficient as well. If you are asked to use the confidence interval you have just calculated to
comment on the results, carrying out an additional hypothesis test will not gain you marks.
Examiners’ commentaries 2016
How should you use the specific comments on each question given in the
Examiners0 commentaries?
We hope that you find these useful. For each question and subquestion, they give:
•
further guidance for each question on the points made in the last section
•
the answers, or keys to the answers, which the examiners were looking for
•
the relevant detailed reference to P. Newbold, W.L. Carlson and B.M. Thorne Statistics for
business and economics. (London: Prentice–Hall, 2012) eighth edition [ISBN
9780273767060] and the subject guide
•
where appropriate, suggested activities from the subject guide which should help you to
prepare, and similar questions from Newbold (2012).
Any further references you might need are given in the part of the subject guide to which you are
referred for each answer.
Memorising from the Examiners0 commentaries
It was noted recently that a small number of candidates appeared to be memorising answers from
previous years’ Examiners’ commentaries, for example plots, and produced the exact same image of
them without looking at the current year’s examination paper questions! Note that this is very easy
to spot. The Examiners’ commentaries should be used as a guide to practise on sample examination
questions and it is pointless to attempt to memorise them.
Examination revision strategy
Many candidates are disappointed to find that their examination performance is poorer than they
expected. This may be due to a number of reasons. The Examiners’ commentaries suggest ways of
addressing common problems and improving your performance. One particular failing is ‘question
spotting’, that is, confining your examination preparation to a few questions and/or topics which
have come up in past papers for the course. This can have serious consequences.
We recognise that candidates may not cover all topics in the syllabus in the same depth, but you
need to be aware that the examiners are free to set questions on any aspect of the syllabus. This
means that you need to study enough of the syllabus to enable you to answer the required number of
examination questions.
The syllabus can be found in the Course information sheet in the section of the VLE dedicated to
each course. You should read the syllabus carefully and ensure that you cover sufficient material in
preparation for the examination. Examiners will vary the topics and questions from year to year and
may well set questions that have not appeared in past papers. Examination papers may legitimately
include questions on any topic in the syllabus. So, although past papers can be helpful during your
revision, you cannot assume that topics or specific questions that have come up in past examinations
will occur again.
If you rely on a question-spotting strategy, it is likely you will find yourself in difficulties
when you sit the examination. We strongly advise you not to adopt this strategy.
3
ST104a Statistics 1
Examiners’ commentaries 2016
ST104a Statistics 1
Important note
This commentary reflects the examination and assessment arrangements for this course in the
academic year 2015–16. The format and structure of the examination may change in future years,
and any such changes will be publicised on the virtual learning environment (VLE). Note that in
what follows the symbol • corresponds to 1 mark unless stated otherwise.
Information about the subject guide and the Essential reading
references
Unless otherwise stated, all cross-references will be to the latest version of the subject guide (2014).
You should always attempt to use the most recent edition of any Essential reading textbook, even if
the commentary and/or online reading list and/or subject guide refer to an earlier edition. If
different editions of Essential reading are listed, please check the VLE for reading supplements – if
none are available, please use the contents list and index of the new edition to find the relevant
section.
Comments on specific questions – Zone A
Candidates should answer THREE of the following FOUR questions: QUESTION 1 of Section
A (50 marks) and TWO questions from Section B (25 marks each).
Section A
Answer all parts of question 1 (50 marks in total).
Question 1
(a) A random sample of the heights of buildings has a sample mean of 24.96 metres.
State the units of measurements for the summaries below and justify your
answers.
i. sample variance
ii. sample standard deviation.
(4 marks)
Reading for this question
This question requires knowledge regarding measures of location and spread. Hence reading
of Sections 4.8 and 4.9 in the subject guide is essential and in particular Section 4.9.3. For
example, candidates should gain familiarity with the sample mean, median, variance and
standard deviation.
4
Examiners’ commentaries 2016
Approaching the question
The first thing to do is check the formulae for the sample variance and standard deviation.
It is then not hard to note that the sample variance, s2 , involves squared deviations of the
observations about the sample mean:
n
s2 =
1 X
(xi − x̄)2 .
n − 1 i=1
The units of measurement will therefore be metres squared, m2 .
The formula for the standard deviation, s, involves the square root of the sample variance:
v
u
n
u 1 X
(xi − x̄)2
s=t
n − 1 i=1
hence we return to the original units of measurement, i.e. meters, m.
Some candidates did not provide a justification for their choices, for example just reporting
meters or meters squared. Justification is essential however, and therefore the mention of
the formulae was essential to get full marks.
(b) Suppose that x1 = 8, x2 = −1, x3 = −6, x4 = 5, x5 = 0, and y1 = −7, y2 = 3,
y3 = 0, y4 = 1, y5 = −3. Calculate the following quantities:
i.
i=4
X
x2i
i=2
ii.
i=3
X
2xi yi
i=1
iii.
y53 +
i=4
X
yi4
i=3
xi
.
(6 marks)
Reading for this question
This question refers to the basic bookwork which can be found on Section 2.9 of the subject
guide, and in particular Activity A1.6.
Approaching the question
Be careful to leave the xi s and yi s in the order given and only cover the values of i asked for.
This question was generally well done. The answers are as follows.
i=4
P 2
i.
xi = (−1)2 + (−6)2 + 52 = 1 + 36 + 25 = 62.
i=2
ii.
i=3
P
2xi yi = 2
i=1
iii. y53 +
3
P
xi yi = 2((8 × −7) + (−1 × 3) + (−6 × 0)) = 2(−56 − 3 + 0) = −118.
i=1
i=4
P
i=3
yi4 /xi = (−3)3 + (0 + 1/5) = −26.8.
(c) A population is normally distributed with a population mean of 138 and a
population standard deviation of 21.
i. State the distribution of the sample mean for simple random samples of size
n = 25.
ii. Given a simple random sample of size n = 25, determine the probability that
the sample mean will be less than 128.
(4 marks)
Reading for this question
This section examines the ideas of the normal random variable. Read the relevant section of
Chapter 6 of the subject guide and work out the examples and activities of this section. The
5
ST104a Statistics 1
Sample examination questions are quite relevant. For the first part of the question it is
essential to check Section 6.9 of the subject guide.
Approaching the question
The first part just requires knowledge of the fact that if X is a normal random variable with
mean µ and variance σ 2 , the sample mean from a sample of size n, X̄, is also a normal
random variable with mean µ and variance σ 2 /n. Direct application of this fact then yields
that:
(21)2
X̄ ∼ N 138,
= N (138, 17.64).
25
For the second part, the basic property of the normal random variable for this question is
that if X ∼ N (µ, σ 2 ), then Z = (X − µ)/σ ∼ N (0, 1). Note also that:
* P (Z < a) = P (Z ≤ a) = Φ(a)
* P (Z > a) = P (Z ≥ a) = 1 − P (Z ≤ a) = 1 − P (Z < a) = 1 − Φ(a)
* P (a < Z < b) = P (a ≤ Z < b) = P (a < Z ≤ b) = P (a ≤ Z ≤ b) = Φ(b) − Φ(a).
The above is all you need to find the requested proportion. We can write:
128 − 138
P (X̄ < 128) = P Z < √
17.64
= P (Z < −2.38)
=
1 − Φ(2.38)
=
1 − 0.99134
=
0.00866.
(d) Classify each one of the following variables as measurable (continuous) or
categorical. If a variable is categorical, further classify it as nominal or ordinal.
Justify your answer. (Note that no marks will be awarded without justification.)
i. The weight of a cereal packet produced in a factory.
ii. The order an athlete finishes a marathon.
iii. The colour of a pair of shoes.
iv. Currency exchange rates.
(8 marks)
Reading for this question
This question requires identifying types of variables so reading the relevant section in the
subject guide (Section 4.6) is essential. Candidates should gain familiarity with the notion
of a variable and be able to distinguish between discrete and continuous (measurable) data.
In addition to identifying whether a variable is categorical or measurable, further
distinctions between ordinal and nominal categorical variable should be made by candidates.
Approaching the question
A general tip for identifying continuous and categorical variables is to think of the possible
values they can take. If these are finite and represent specific entities the variable is
categorical. Otherwise, if these consist of numbers corresponding to measurements, the data
are continuous and the variable is measurable. Such variables may also have measurement
units or can be measured to various decimal places.
i. Measurable because the weight can be measured, for example, in grammes to several
decimal places such as 499.28 g.
ii. The observations consist of the athletes finishing in a specific order (1st, 2nd etc.). It is
therefore a categorical ordinal variable.
6
Examiners’ commentaries 2016
iii. Each colour (black, white, red, etc.) is a category. Also, there is no natural ordering
between the colours, for example we cannot really say that ‘blue is higher than red’. This
is therefore a categorical nominal variable.
iv. Measurable because exchange rates are quoted to several decimal places, for example
US$1.45 to the £.
Weak candidates did not provide a justification for their choices, reported nominal or
categorical to measurable variables and sometimes answered ordinal when their justification
was pointing to a nominal variable. There were also phrases like ‘It is measurable because it
can be measured’ that were not awarded any marks.
(e) The random variable X takes the values 0, 1 and 4 according to the following
probability distribution:
x
pX (x)
0
0.2
1
k
4
k
i. Determine the constant k.
ii. Find E(X), the expected value of X.
iii. Find Var(X), the variance of X.
(5 marks)
Reading for this question
This is a question on probability, exploring the concepts of relative frequency, conditional
probability and probability distribution. Reading from Chapter 5 of the subject guide is
suggested with focus on the sections on these topics. Try Activity A5.1 and the exercises on
probability trees.
Approaching the question
P
i.
i p(xi ) = 1, hence k = 0.4.
P
ii. E(X) = i xi p(xi ) = 0 × 0.2 + 1 × 0.4 + 4 × 0.4 = 2.0.
P
iii. E(X 2 ) = i x2i p(xi ) = 02 × 0.2 + 12 × 0.4 + 42 × 0.4 = 6.8. Hence:
Var(X) = 6.8 − 22 = 2.8.
An alternative method to find the variance is through the formula
where µ is found in part ii.
P
i (xi
− µ)2 p(xi ),
(f ) An engine encounters a standard environment with a probability of 0.95, and a
severe environment with a probability of 0.05. In a normal environment the
probability of failure is 0.02, whereas in the severe environment this probability
is 0.5.
i. What is the probability of failure?
ii. Given that failure has occurred, what is the probability that the environment
encountered was severe?
(4 marks)
Reading for this question
This is a question on probability and targets mostly the material of Chapter 5 in the subject
guide. It is essential to practise on such exercises through the learning activities and
exercises of this chapter as well as the material on the VLE. In particular you can attempt
Learning activity A5.6 and Sample examination question 5. It is also useful to familiarise
yourself with probability trees as they can be quite handy in such exercises.
Approaching the question
The first part was straightforward for candidates familiar with this section, requiring the use
of the total law of probability (although it can also be calculated using common intuition).
Part ii. requires knowledge of the conditional probability definition or, alternatively,
knowledge of Bayes’ theorem.
7
ST104a Statistics 1
The workout of the exercise is given below.
i. We have:
P (F ) = P (F | N ) P (N ) + P (F | S) P (S) = 0.02 × 0.95 + 0.5 × 0.05 = 0.044.
ii. We have:
P (S | F ) =
P (F | S) P (S)
0.025
25
=
=
= 0.5682.
P (F )
0.044
44
(g) A museum conducts a survey of its visitors in order to assess the popularity of a
device which is used to provide information on the museum exhibits. The
device will be withdrawn if fewer than 20% of all of the museum’s visitors make
use of it. Of a random sample of 100 visitors, 15 chose to use the device.
i. Carry out an appropriate hypothesis test at the 5% significance level to see if
the device should be withdrawn and state your conclusions.
ii. Calculate the p-value of the test.
(7 marks)
Reading for this question
This question refers to a one-sided hypothesis test examining whether the proportion of all
museum visitors is less than 20%. While the entire chapter (Chapter 8 of the subject guide)
on hypothesis testing is relevant, one can focus on the relevant section for a single
proportion, Section 8.14. Note also that reading on one-tailed (and two-tailed) hypothesis
tests are located in Section 8.10. The second part of the question looks at p-values, and the
relevant section in the subject guide is Section 8.11.
Approaching the question
It is essential to identify the type of hypothesis test required for this question. Since there is
only one variable involved it will have to be a test for a single proportion, and the test
statistic can be found in the formula sheet. Make sure to substitute the relevant quantities
carefully and avoid any numerical errors in the calculation.
The remaining steps involve finding the critical values from the corresponding statistical
table for the relevant significance level, deciding whether to reject H0 , and interpreting the
results in the context of the problem. The working of the first part of the exercise is given
below.
• H0 : π = 0.2 vs. H1 : π < 0.2.
• The sample proportion
is p = 15/100 = 0.15. The standard error of the sample
p
proportion is 0.2 × 0.8/100 = 0.04. The test statistic value is:
t=
0.15 − 0.2
= −1.25.
0.04
• For α = 0.05, the critical value is −1.645.
• Decision: do not reject H0 .
• No evidence that fewer than 20% of visitors make use of the device.
The second part of the question requires the use of p-values and challenged most candidates.
The exercise does not require lengthy calculations and can be derived in a relatively
straightforward manner if one is familiar with the material of Section 8.11 of the subject
guide. Once the test statistic is calculated (t = −1.25 from the first part) one simply needs
to calculate, where Z ∼ N (0, 1):
P (Z ≤ −1.25) = 1 − Φ(1.25) = 1 − 0.8944 = 0.1056.
Note: The last three marks of the first part can also be awarded by correct use of the
p-value, see below.
• The p-value is higher than α = 0.05.
8
Examiners’ commentaries 2016
• Decision: do not reject H0 .
• No evidence that fewer than 20% of visitors make use of the device.
(h) State whether the following are true or false and give a brief explanation. (Note
that no marks will be awarded for a simple true/false answer.)
i. The interquartile range of a sample is influenced by extreme values.
ii. A sampling distribution is the probability distribution of a population
parameter.
iii. A sample correlation coefficient close to 1 indicates a strong positive linear
relationship between two categorical variables.
iv. A p-value of 0.08 represents a highly significant hypothesis test result.
v. Rejection of a null hypothesis might indicate that a Type II error has been
committed.
vi. A quota sample is the non-random equivalent of a systematic random sample.
(12 marks)
Reading for this question
This question contains material from various parts of the subject guide. Here, it is more
important to have a good intuitive understanding of the relevant concepts than the technical
level in computations. Part i. concerns measures of spread that can be found in Section 4.9
of the subject guide. Part ii. enquires about the sampling distribution which is defined in
Section 6.9. Part iii. is about correlation (see Section 12.8) and types of variables (see
Section 4.6). Part iv. targets the concepts of a p-value covered in Section 8.11, whereas part
v. looks at types of error in hypothesis testing (Section 8.7). Finally, part vi. requires
material from Chapter 10 and in particular Section 10.7 on types of sampling.
Approaching the question
Candidates always find this type of question tricky. It requires a brief explanation of the
reason for a true/false answer and not just a choice between the two. Some candidates lost
marks for long rambling explanations without a decision as to whether a statement was true
or false.
i. False. The interquartile range of a sample is defined as the range of the central 50% of
the values in a dataset, so any extreme values would lie below the lower quartile and/or
above the upper quartile.
ii. False. A sampling distribution is the probability distribution of a sample statistic.
iii. False. A value of r close to 1 indicates a strong, positive linear relationship between two
measurable (continuous) variables.
iv. False. A p-value less than 0.01 represents a highly significant hypothesis test result, 0.08
is merely weakly significant.
v. False. Rejection of a true null hypothesis might indicate that a Type I error has been
committed.
vi. False. A quota sample is the non-random equivalent of a stratified random sample.
9
ST104a Statistics 1
Section B
Answer two out of the three questions from this section (25 marks each).
Question 2
(a) A factory uses four different machines to manufacture a particular type of
machine component. A random sample of 400 components is selected from the
output of the factory. Each component in the sample is inspected to determine
whether or not it is faulty. The machine that produced the component is also
recorded. The results are as follows:
Machine
Machine
Machine
Machine
Total
1
2
3
4
Outcome
Faulty Non-faulty
4
96
2
98
11
89
14
86
31
369
Total
100
100
100
100
400
i. Based on the data in the table, and without conducting any significance test,
would you say there is an association between the machine number and the
component being faulty?
ii. Calculate the χ2 statistic and use it to test for independence, using a 5%
significance level. What do you conclude?
(14 marks)
Reading for this question
This part targets Chapter 8 of the subject guide on contingency tables and chi-squared
tests. Note that part i. of the question does not require any calculations, just understanding
and interpreting contingency tables. Candidates can attempt Activity A8.4 to practise. Part
ii. is a straightforward chi-squared test and the reading is also given in Chapter 8. Look also
at Activity A8.4.
Approaching the question
i. There are some differences in the proportions of faulty components for each machine.
More specifically, 2% of the components from Machine 2 are faulty, whereas the
corresponding proportion for Machine 3 is 11%, and for Machine 4 is 14%. Hence, there
seems to be an association between machine number and the component being faulty,
although this needs to be investigated further. (Note: the conclusion of the last sentence
must be stated to get full marks.)
ii. Set out the null hypothesis that there is no association between machine number and the
component being faulty against the alternative that there is an association. Be careful to
get these the correct way round!
H0 : No association between the machine number and the component being faulty.
vs.
H1 : Association between machine number and the component being faulty.
Work out the expected values to obtain the table below.
7.75 92.25
7.75 92.25
7.75 92.25
7.75 92.25
The test statistic formula is:
10
X (Oi,j − Ei,j )2
Ei,j
Examiners’ commentaries 2016
which gives a value of 13.53. This is a 4 × 2 contingency table, so the degrees of freedom
are (4 − 1) × (2 − 1) = 3.
For α = 0.05, the critical value is 7.815, hence we reject H0 .
We conclude that there is evidence of an association between machine number and the
component being faulty.
Many candidates looked up the tables incorrectly and so failed to follow through their
earlier accurate work.
(b) i. Describe how stratified random sampling is performed and explain how it
differs from quota sampling.
ii. A company producing handheld electronic devices (tablets, mobile phones
etc.) wants to understand how people of different ages rate its products. For
this reason, the company’s management has decided to use a survey of its
customers and has asked you to devise an appropriate random sampling
scheme. Outline the key components of your sampling scheme.
(11 marks)
Reading for this question
This question was on basic material on survey designs. Background reading is given in
Chapters 10 and 11 of the subject guide which, along with the recommended reading, should
be looked at carefully. Candidates were expected to have studied and understood the main
important constituents of design in random sampling. It is also a good idea to try the
Learning activities of Chapter 10.
Approaching the question
One of the main things to avoid in this part is to write essays without any structure. This
exercise asks for specific things and each one of them requires one or two lines. If you are
unsure of what these things are, do not write lengthy essays. This is not giving you
anything and is a waste of your invaluable examination time. If you can identify what is
being asked, keep in mind that the answer should not be long. Note also that in some
cases there is no unique answer to the question.
The marking scheme and some model answers are given below.
i. Description of stratified random sampling: the population is divided into strata, natural
groupings within the population, and a simple random sample is taken from each
stratum. See page 162 of the subject guide for a more detailed description.
Stratified random sampling is different from quota sampling in the following ways.
∗ Stratified random sampling is probability sampling, whereas quota sampling is
non-probability sampling.
∗ In stratified random sampling a sampling frame is required, whereas in quota
sampling pre-chosen frequencies in each category are sought.
ii. As mentioned earlier, it is crucial in this type of question to avoid long answers. Also,
note that there is no unique answer. A possible set of ‘ingredients’ of an answer is given
below (each bullet point corresponds to a mark).
•
•
•
•
•
•
•
Propose stratified sampling since customers of all ages are to be surveyed.
Sampling frame could be the company’s customer database.
Take a simple random sample from each stratum.
Stratification factors should include age.
Other stratification factors could be gender, country of residence, etc.
Contact method: mail, telephone or email (likely to have all details on database).
Minimise non-response through a suitable incentive, such as discount off the next
purchase.
11
ST104a Statistics 1
Question 3
(a) The data below represent heights, measured in centimetres, of women from an
adult female population:
162
166
167
168
169
170
164
166
167
168
169
171
164
166
167
168
169
172
165
167
168
168
170
184
165
167
168
169
170
185
i. Carefully construct, draw and label a histogram of these data on the graph
paper provided.
ii. Find the median height among these women and the upper quartile. What
percentage of women were below 165 cm?
iii. Comment on the data given the shape of the histogram without doing any
further calculations.
iv. Name two other types of graphical displays that would be suitable to
represent the data.
(13 marks)
Reading for this question
Chapter 4 provides all the relevant material for this question. More specifically, reading on
histograms can be found in Section 4.7.3, but the entire Sections 4.7, 4.8 and 4.9 are highly
relevant.
Approaching the question
i. A histogram compatible with what the examiners were expecting to see is shown below.
Marks were awarded for including the title, labelling correctly and accurately drawing
the figure. Note that it is essential (and more convenient) to draw the figure on the
graph paper provided; marks will be withdrawn otherwise.
0.08
0.04
0.00
Frequency Densities
0.12
Histogram of Heights
160
165
170
175
Heights of women in centimeters
12
180
185
Examiners’ commentaries 2016
ii. • Median: 168 centimeters. Note: Raw data should be used, not grouped data. Also,
make sure to mention the units to get the full marks.
• Upper quartile: 169 centimeters. Note: Same as above.
• Percentage: 3/30 = 10%. Note: As the question asks for a percentage, make sure to
report 10%, not just 3/30 or anything else.
iii. Based on the shape of the histogram, we can see that the distribution of the data is
positively skewed. Also two women, with heights of 184 cm and 185 cm, may be regarded
as outliers. Note: It is important to identify the specific outliers (184 cm and 185 cm)
not just write ‘there are two outliers’.
iv. A boxplot, stem-and-leaf diagram or dot plot are other types of suitable graphical
displays. The reason for that is that the variable height is measurable and these graphs
are suitable for displaying the distribution of such variables.
(b) A random sample of 9 people tried a specific diet that lasted 2 months to lose
weight. The weights of these people, measured in kilograms, were measured
both at the beginning and the end of the diet, and are shown in the table below:
Weight before diet
75
76
90
92
89
63
65
80
90
Weight after diet
73
72
92
93
89
61
62
76
84
i. Carry out an appropriate hypothesis test to determine whether the diet is
effective in helping people lose weight. State the test hypotheses, and specify
your test statistic and its distribution under the null hypothesis. Comment
on your findings.
ii. State any assumptions you made in i.
iii. Give a 90% confidence interval for the difference between the means of the
weights before and after the diet.
(12 marks)
Reading for this question
Look up the sections about hypothesis testing for testing a difference between two
population means. However, it is essential for this part to focus on the section regarding
paired samples (Section 8.16.4).
Approaching the question
i. Regarding hypotheses, note that the wording ‘effective’ suggests a one-sided test. Hence
we test:
H0 : µbefore = µafter
vs.
H1 : µbefore < µafter .
In this part, it is also essential to realise that we have a paired sample, as we have two
observations for each person (before and after the diet). Hence the difference for each
person should be calculated:
−2
−4
2
1
0
−2
−3
−4
−6
The next step is to calculate sd = 2.598 and s̄d = −2.0, in order to obtain the value of
the test statistic:
x̄d − 0
√ = −2.309.
t=
sd / n
13
ST104a Statistics 1
We have a t distribution with 8 degrees of freedom, hence the critical value (for a
one-sided test) is −1.860. Note: This is clearly a t distribution, make sure not to use the
standard normal distribution.
Hence, we reject H0 at the 5% significance level. Testing at the 1% significance level
gives a critical value of t8, 0.99 = −2.896. Therefore, we do not reject H0 and conclude
that there is moderate evidence that the diet is effective.
ii. • Differences are normally distributed.
• Pairs of observations are independent.
iii. This is a standard exercise for confidence intervals given the appropriate formula from
the formula sheet (make sure to be able to recognise it). The requested confidence
interval is (−3.610, −0.390).
Question 4
(a) The director of a local Tourism Authority would like to know whether a family’s
annual expenditure on recreation (y), measured in $000s, is related to their
annual income (x), also measured in $000s. In order to explore this potential
relationship, the variables x and y were recorded for 10 randomly selected
families that visited the area last year. The results were as follows:
Week
x
y
#1
41.2
2.4
#2
50.1
2.7
#3
52.0
2.8
#4
62.0
8.0
#5
44.5
3.1
#6
37.7
2.1
#7
73.5
12.1
#8
37.5
2.0
#9
56.7
3.9
#10
65.2
8.9
The summary statistics for these data are:
Sum of x data: 520.4 Sum of the squares of x data: 28431.42
Sum of y data: 48
Sum of the squares of y data: 343.74
Sum of the products of x and y data: 2858.63
i. Draw a scatter diagram of these data on the graph paper provided. Label the
diagram carefully.
ii. Calculate the sample correlation coefficient. Interpret your findings.
iii. Calculate the least squares line of y on x and draw the line on the scatter
diagram.
iv. Do you find the analyses in ii. and iii. appropriate? Justify your answer and
suggest any alternative ways to model the relationship between x and y.
(13 marks)
Reading for this question
This is a standard linear regression question and the reading is to be found in Chapter 12 of
the subject guide. Section 12.6 provides details for scatter diagrams and is suitable for part
i., whereas the remaining parts are on correlation and regression which are covered in
Sections 12.8, 12.9 and 12.10 of the subject guide. Section 12.7 is also relevant. Sample
examination question 2 of this chapter is also recommended for practice on questions of this
type.
Approaching the question
i. Candidates are reminded that they are asked to draw and label the scatter diagram
which should include a title (‘Scatter diagram’ alone will not suffice) and labelled axes
which give their units in addition. Far too many candidates threw away marks by
neglecting these points and consequently were only given one mark out of the possible
four allocated for this part of the question. Another common way of losing marks was
failing to use the graph paper which was provided, and required, in the question.
Candidates who drew on the ordinary paper in their answer booklet were not awarded
marks for this part of the question.
14
Examiners’ commentaries 2016
Annual family recreation expenditure vs. Annual family income
10
x
4
6
8
x
x
x
2
Annual family recreation expenditure in $000s
12
x
x
xx
40
45
x
x
50
55
60
65
70
Annual family income in $000s
ii. The summary statistics can be substituted into the formula for the sample correlation
coefficient (make sure you know which one it is!) to obtain the value 0.9222. An
interpretation of this value is the following: the data suggest that the higher family
annual income, the higher the family annual recreation expenditure. The fact that the
value is very close to 1, suggests that this is a strong, positive linear relationship.
Many candidates did not mention all three words (strong, positive, linear). Note that all
of these words provide useful information on interpreting the relationship and are
therefore required to obtain full marks.
iii. The regression line can be written by the equation yb = a + bx or y = a + bx + ε. The
formula for b is:
P
xi yi − nx̄ȳ
b= P 2
xi − nx̄2
and by substituting the summary statistics we get b = 0.267.
The formula for a is a = ȳ − bx̄, so we get a = −9.107.
Hence the regression line can be written as yb = −9.107 + 0.267x or
y = −9.107 + 0.267x + ε. It should also be plotted on the scatter diagram.
Many candidates reported incorrectly the regression line as y = −9.107 + 0.267x. This
expression is false; one of the two above expressions is required.
iv. In this case, one can note in the scatter diagram that the points seem to be ‘scattered’
around a non-linear curve rather than a straight line. Another, equivalent, way to note
this is the presence of two outliers. Hence a linear regression model does not seem to be
a good model for the relationship between family annual income and family annual
recreation expenditure. Alternative approaches may involve the Spearman’s rank
correlation coefficient or transformations of the data, for example a log-transformation.
(b) The fuel consumption of two different car models (A and B) was compared in
the following way. A random sample of 20 cars from model A and 35 cars from
model B were taken and the fuel consumption (in miles per gallon) was
measured for each car. The results are summarised in the table below.
Car Model A
Car Model B
Sample size
20
35
Sample mean
30.9
27.1
Sample standard deviation
6.11
6.41
15
ST104a Statistics 1
i. Use an appropriate hypothesis test to determine whether the model A cars
can do more miles per gallon than model B cars. State clearly the
hypotheses, the test statistic and its distribution under the null hypothesis,
and carry out the test at two appropriate significance levels. Comment on
your findings.
ii. State clearly any assumptions you made in i.
iii. Provide a 95% confidence interval for the difference between the mean fuel
consumption of the two car models.
(12 marks)
Reading for this question
The first two parts of the question refer to a two-sided hypothesis test comparing two
population means. While the entire chapter on hypothesis testing is relevant (Chapter 8),
one can focus on Section 8.16 and in particular Sections 8.16.2 and 8.16.3 as the variances
are unknown. The last part of the question requires a confidence interval for the difference
between two population means, therefore Sections 7.13.2 and 7.13.3 are most relevant.
Approaching the question
i. Let µA denote the mean fuel consumption for car model A and µB the mean fuel
consumption for car model B.
The wording ‘can do more miles per gallon than’ implies a one-sided test, hence the
hypotheses can be written as:
H0 : µA = µB
vs.
H1 : µA > µB .
The test statistic formulae, depending on whether a pooled variance is used or not, are
provided in the formula sheet:
x̄ − ȳ
p
s2A /nA + s2B /nB
or
x̄ − ȳ
q
.
2
sp (1/n1 + 1/n2 )
If equal variances are assumed, the test statistic value is 2.150 (the pooled variance is
39.74). If equal variances are not assumed the test statistic value is 2.179.
Since the variances are unknown and the sample size is not large enough, the t50
distribution is being used. The critical value at the 5% significance level is 1.676, hence
we reject the null hypothesis. If we take a (smaller) α of 1%, the critical value is 2.390,
so we do not reject H0 . We conclude that there is moderate evidence of a difference in
the mean fuel consumption between the car models.
ii. The assumptions for ii. were the following.
• Assumption about equal variances.
• Assumption about whether nA + nB is ‘large’ so that the normality assumption is
satisfied.
• Assumption about independent samples.
Some candidates stated assumptions in this part that were not made in part i. Marks
were not awarded in such cases. Also some other candidates just copied the phrase
‘assumption about equal variances’ and naturally were not awarded any marks. One
should state whether the calculations were based on the assumption that the unknown
variances are equal or unequal.
iii. Based on the t50 distribution and using the correct formula from the formula sheet (make
sure to be able to recognise it) the requested 95% confidence interval is (0.251, 7.349).
Note: In the solution above, the t50 distribution was used but the use of the standard
normal distribution is also justified as the sample size is relatively large. Hence a solution
based on the standard normal distribution is also acceptable.
16
Examiners’ commentaries 2016
Examiners’ commentaries 2016
ST104a Statistics 1
Important note
This commentary reflects the examination and assessment arrangements for this course in the
academic year 2015–16. The format and structure of the examination may change in future years,
and any such changes will be publicised on the virtual learning environment (VLE). Note that in
what follows the symbol • corresponds to 1 mark unless stated otherwise.
Information about the subject guide and the Essential reading
references
Unless otherwise stated, all cross-references will be to the latest version of the subject guide (2014).
You should always attempt to use the most recent edition of any Essential reading textbook, even if
the commentary and/or online reading list and/or subject guide refer to an earlier edition. If
different editions of Essential reading are listed, please check the VLE for reading supplements – if
none are available, please use the contents list and index of the new edition to find the relevant
section.
Comments on specific questions – Zone B
Candidates should answer THREE of the following FOUR questions: QUESTION 1 of Section
A (50 marks) and TWO questions from Section B (25 marks each).
Section A
Answer all parts of question 1 (50 marks in total).
Question 1
(a) A random sample of athletes’ times to run 200 metres has a sample mean of
24.96 seconds. State the units of measurements for the summaries below and
justify your answers.
i. sample variance
ii. sample standard deviation.
(4 marks)
Reading for this question
This question requires knowledge regarding measures of location and spread. Hence reading
of Sections 4.8 and 4.9 in the subject guide is essential and in particular Section 4.9.3. For
example, candidates should gain familiarity with the sample mean, median, variance and
standard deviation.
17
ST104a Statistics 1
Approaching the question
The first thing to do is check the formulae for the sample variance and standard deviation.
It is then not hard to note that the sample variance, s2 , involves squared deviations of the
observations about the sample mean:
n
s2 =
1 X
(xi − x̄)2 .
n − 1 i=1
The units of measurement will therefore be seconds squared.
The formula for standard deviation s involves the square root of the sample variance:
v
u
n
u 1 X
s=t
(xi − x̄)2
n − 1 i=1
hence we return to the original units of measurement, i.e. seconds.
Some candidates did not provide a justification for their choices, for example just reporting
seconds or seconds squared. Justification is essential however, and therefore the mention of
the formulae was essential to get full marks.
(b) Suppose that x1 = 4, x2 = −3, x3 = −7, x4 = 6, x5 = 2, and y1 = −6, y2 = 4,
y3 = −4, y4 = 0, y5 = 1. Calculate the following quantities:
i.
i=4
X
x2i
i=2
ii.
i=3
X
3xi yi
i=1
iii.
y33 +
i=5
X
yi4
i=4
xi
.
(6 marks)
Reading for this question
This question refers to the basic bookwork which can be found on Section 2.9 of the subject
guide, and in particular Activity A1.6.
Approaching the question
Be careful to leave the xi s and yi s in the order given and only cover the values of i asked for.
This question was generally well done. The answers are as follows.
i=4
P 2
i.
xi = (−3)2 + (−7)2 + 62 = 9 + 49 + 36 = 94.
i=2
ii.
i=3
P
3xi yi = 3
i=1
iii. y33 +
3
P
xi yi = 3((4 × −6) + (−3 × 4) + (−7 × −4)) = 3(−24 − 12 + 28) = −24.
i=1
i=5
P
i=4
yi4 /xi = (−4)3 + (0 + 1/2) = −63.5.
(c) A population is normally distributed with a population mean of 76 and a
population standard deviation of 12.
i. State the distribution of the sample mean for simple random samples of size
n = 100.
ii. Given a simple random sample of size n = 100, determine the probability
that the sample mean will be less than 75.
(4 marks)
Reading for this question
This section examines the ideas of the normal random variable. Read the relevant section of
Chapter 6 of the subject guide and work out the examples and activities of this section. The
Sample examination questions are quite relevant. For the first part of the question it is
essential to check Section 6.9 of the subject guide.
18
Examiners’ commentaries 2016
Approaching the question
The first part just requires knowledge of the fact that if X is a normal random variable with
mean µ and variance σ 2 , the sample mean from a sample of size n, X̄, is also a normal
random variable with mean µ and variance σ 2 /n. Direct application of this fact then yields
that:
(12)2
= N (76, 1.44).
X̄ ∼ N 76,
100
For the second part, the basic property of the normal random variable for this question is
that if X ∼ N (µ, σ 2 ), then Z = (X − µ)/σ ∼ N (0, 1). Note also that:
* P (Z < a) = P (Z ≤ a) = Φ(a)
* P (Z > a) = P (Z ≥ a) = 1 − P (Z ≤ a) = 1 − P (Z < a) = 1 − Φ(a)
* P (a < Z < b) = P (a ≤ Z < b) = P (a < Z ≤ b) = P (a ≤ Z ≤ b) = Φ(b) − Φ(a).
The above is all you need to find the requested proportion: We can write:
75 − 76
P (X̄ < 75) = P Z < √
1.44
= P (Z < −0.83)
=
1 − Φ(0.83)
=
1 − 0.7967
=
0.2033.
(d) Classify each one of the following variables as measurable (continuous) or
categorical. If a variable is categorical, further classify it as nominal or ordinal.
Justify your answer. (Note that no marks will be awarded without justification.)
i. The weight of a chocolate bar produced in a factory.
ii. Responses to ‘what is your age group?’ in a questionnaire.
iii. The colour of a car.
iv. Inflation rates.
(8 marks)
Reading for this question
This question requires identifying types of variables so reading the relevant section in the
subject guide (Section 4.6) is essential. Candidates should gain familiarity with the notion
of a variable and be able to distinguish between discrete and continuous (measurable) data.
In addition to identifying whether a variable is categorical or measurable, further
distinctions between ordinal and nominal categorical variable should be made by candidates.
Approaching the question
A general tip for identifying continuous and categorical variables is to think of the possible
values they can take. If these are finite and represent specific entities the variable is
categorical. Otherwise, if these consist of numbers corresponding to measurements, the data
are continuous and the variable is measurable. Such variables may also have measurement
units or can be measured to various decimal places.
i. Measurable because the weight can be measured, for example, in grammes to several
decimal places such as 499.28 g.
ii. Age groups are in a ranked order, for example [18, 30), [30, 40) etc. It is therefore a
categorical ordinal variable.
iii. Each colour (black, white, red, etc.) is a category. Also, there is no natural ordering
between the colours, for example we cannot really say that ‘blue is higher than red’. This
is therefore a categorical nominal variable.
19
ST104a Statistics 1
iv. Measurable because inflation rates are quoted to several decimal places, for example
1.50%.
Weak candidates did not provide a justification for their choices, reported nominal or
categorical to measurable variables and sometimes answered ordinal when their justification
was pointing to a nominal variable. There were also phrases like ‘It is measurable because it
can be measured’ that were not awarded any marks.
(e) The random variable X takes the values 0, 1 and 3 according to the following
probability distribution:
x
pX (x)
0
0.4
1
k
3
k
i. Determine the constant k.
ii. Find E(X), the expected value of X.
iii. Find Var(X), the variance of X.
(5 marks)
Reading for this question
This is a question on probability, exploring the concepts of relative frequency, conditional
probability and probability distribution. Reading from Chapter 5 of the subject guide is
suggested with focus on the sections on these topics. Try Activity A5.1 and the exercises on
probability trees.
Approaching the question
P
i.
i p(xi ) = 1, hence k = 0.3.
P
ii. E(X) = i xi p(xi ) = 0 × 0.4 + 1 × 0.3 + 3 × 0.3 = 1.2.
P
iii. E(X 2 ) = i x2i p(xi ) = 02 × 0.4 + 12 × 0.3 + 32 × 0.3 = 3.0. Hence:
Var(X) = 3.0 − (1.2)2 = 1.56.
An alternative method to find the variance is through the formula
where µ is found in part ii.
P
i (xi
− µ)2 p(xi ),
(f ) An engine encounters a standard environment with a probability of 0.9, and a
severe environment with a probability of 0.1. In a normal environment the
probability of failure is 0.03, whereas in the severe environment this probability
is 0.5.
i. What is the probability of failure?
ii. Given that failure has occurred, what is the probability that the environment
encountered was severe?
(4 marks)
Reading for this question
This is a question on probability and targets mostly the material of Chapter 5 in the subject
guide. It is essential to practise on such exercises through the learning activities and
exercises of this chapter as well as the material on the VLE. In particular you can attempt
Learning activity A5.6 and Sample examination question 5. It is also useful to familiarise
yourself with probability trees as they can be quite handy in such exercises.
Approaching the question
The first part was straightforward for candidates familiar with this section, requiring the use
of the total law of probability (although it can also be calculated using common intuition).
Part ii. requires knowledge of the conditional probability definition or, alternatively,
knowledge of Bayes’ theorem.
20
Examiners’ commentaries 2016
The workout of the exercise is given below.
i. We have:
P (F ) = P (F | N ) P (N ) + P (F | S) P (S) = 0.03 × 0.9 + 0.5 × 0.1 = 0.077.
ii. We have:
P (S | F ) =
P (F | S) P (S)
0.05
50
=
=
= 0.6494 ≈ 0.65.
P (F )
0.077
77
(g) A museum conducts a survey of its visitors in order to assess the popularity of a
device which is used to provide information on the museum exhibits. The
device will be withdrawn if fewer than 250% of all of the museum’s visitors
make use of it. Of a random sample of 100 visitors, 20 chose to use the device.
i. Carry out an appropriate hypothesis test at the 5% significance level to see if
the device should be withdrawn and state your conclusions.
ii. Calculate the p-value of the test.
(7 marks)
Reading for this question
This question refers to a one-sided hypothesis test examining whether the proportion of all
museum visitors is less than 20%. While the entire chapter (Chapter 8 of the subject guide)
on hypothesis testing is relevant, one can focus on the relevant section for a single
proportion, Section 8.14. Note also that reading on one-tailed (and two-tailed) hypothesis
tests are located in Section 8.10. The second part of the question looks at p-values, and the
relevant section in the subject guide is Section 8.11.
Approaching the question
It is essential to identify the type of hypothesis test required for this question. Since there is
only one variable involved it will have to be a test for a single proportion, and the test
statistic can be found in the formula sheet. Make sure to substitute the relevant quantities
carefully and avoid any numerical errors in the calculation.
The remaining steps involve finding the critical values from the corresponding statistical
table for the relevant significance level, deciding whether to reject H0 , and interpreting the
results in the context of the problem. The working of the first part of the exercise is given
below.
• H0 : π = 0.25 vs. H1 : π < 0.25.
• The sample proportion
is p = 20/100 = 0.20. The standard error of the sample
p
proportion is 0.25 × 0.75/100 = 0.0433. The test statistic value is:
t=
0.2 − 0.25
= −1.15.
0.0433
• For α = 0.05, the critical value is −1.645.
• Decision: do not reject H0 .
• No evidence that fewer than 25% of visitors make use of the device.
The second part of the question requires the use of p-values and challenged most candidates.
The exercise does not require lengthy calculations and can be derived in a relatively
straightforward manner if one is familiar with the material of Section 8.11 of the subject
guide. Once the test statistic is calculated (t = −1.15 from the first part) one simply needs
to calculate, where Z ∼ N (0, 1):
P (Z ≤ −1.15) = 1 − Φ(1.15) = 1 − 0.8749 = 0.1251.
Note: The last three marks of the first part can also be awarded by correct use of the
p-value, see below.
• The p-value is higher than α = 0.05.
21
ST104a Statistics 1
• Decision: do not reject H0 .
• No evidence that fewer than 25% of visitors make use of the device.
(h) State whether the following are true or false and give a brief explanation. (Note
that no marks will be awarded for a simple true/false answer.)
i. The range of a sample is influenced by extreme values.
ii. A sampling distribution is the probability distribution of a population
parameter.
iii. A sample correlation coefficient close to −1 indicates a strong negative linear
relationship between two categorical variables.
iv. A p-value of 0.007 represents a weakly significant hypothesis test result.
v. Failure to reject a null hypothesis might indicate that a Type I error has
been committed.
vi. A stratified random sample is the random equivalent of a convenience sample.
(12 marks)
Reading for this question
This question contains material from various parts of the subject guide. Here, it is more
important to have a good intuitive understanding of the relevant concepts than the technical
level in computations. Part i. concerns measures of spread that can be found in Section 4.9
of the subject guide. Part ii. enquires about the sampling distribution which is defined in
Section 6.9. Part iii. is about correlation (see Section 12.8) and types of variables (see
Section 4.6). Part iv. targets the concepts of a p-value covered in Section 8.11, whereas part
v. looks at types of error in hypothesis testing (Section 8.7). Finally, part vi. requires
material from Chapter 10 and in particular Section 10.7 on types of sampling.
Approaching the question
Candidates always find this type of question tricky. It requires a brief explanation of the
reason for a true/false answer and not just a choice between the two. Some candidates lost
marks for long rambling explanations without a decision as to whether a statement was true
or false.
i. True. The range is defined as x(n) − x(1) , so any extreme values would be x(1) and/or
x(n) , hence influencing the range.
ii. False. A sampling distribution is the probability distribution of a sample statistic.
iii. False. A value of r close to −1 indicates a strong, negative linear relationship between
two measurable (continuous) variables.
iv. False. A p-value of 0.007 represents a highly significant hypothesis test result. Weakly
significant means a p-value between 0.05 and 0.10.
v. False. Failure to reject a null hypothesis might indicate that a Type II error has been
committed.
vi. False. A quota sample is the non-random equivalent of a stratified random sample.
22
Examiners’ commentaries 2016
Section B
Answer two out of the three questions from this section (25 marks each).
Question 2
(a) A sample consisting of 400 randomly selected students was classified in terms of
personality type (introvert or extrovert) and in terms of their favourite colour
(red, yellow, green or blue). Their responses are summarised in the table below:
Red
Yellow
Green
Blue
Total
Personality type
Introvert Extrovert
32
68
26
74
21
79
46
54
125
275
Total
100
100
100
100
400
i. Based on the data in the table, and without conducting any significance test,
would you say there is an association between the student’s type of
personality and colour preference?
ii. Calculate the χ2 statistic and use it to test for independence, using a 5%
significance level. What do you conclude?
(14 marks)
Reading for this question
This part targets Chapter 8 of the subject guide on contingency tables and chi-squared
tests. Note that part i. of the question does not require any calculations, just understanding
and interpreting contingency tables. Candidates can attempt Activity A8.4 to practise. Part
ii. is a straightforward chi-squared test and the reading is also given in Chapter 8. Look also
at Activity A8.4.
Approaching the question
i. There are some differences in rates of introvert students for each colour preference. More
specifically, 21% of the students who prefer the green colour are introvert, whereas the
corresponding proportion for students who prefer red is 32%, and for students preferring
blue is 46%. Hence, there seems to be an association between personality type and
colour preference, although this needs to be investigated further. (Note: the conclusion
of the last sentence must be stated to get full marks.)
ii. Set out the null hypothesis that there is no association between personality type and
colour preference against the alternative that there is an association. Be careful to get
these the correct way round!
H0 : No association between the personality type and colour preference.
vs.
H1 : Association between personality type and colour preference.
Work out the expected values to obtain the table below.
31.25
31.25
31.25
31.25
The test statistic formula is:
68.75
68.75
68.75
68.75
X (Oi,j − Ei,j )2
Ei,j
which gives a value of 16.33. This is a 4 × 2 contingency table, so the degrees of freedom
are (4 − 1) × (2 − 1) = 3.
23
ST104a Statistics 1
For α = 0.05, the critical value is 7.815, hence we reject H0 .
We conclude that there is evidence of an association between personality type and colour
preference.
Many candidates looked up the tables incorrectly and so failed to follow through their
earlier accurate work.
(b) i. Describe how quota sampling is performed and explain how it differs from
stratified random sampling.
ii. A company producing handheld electronic devices (tablets, mobile phones
etc.) wants to understand how men and women rate its products. For this
reason, the company’s management has decided to use a survey of its
customers and has asked you to devise an appropriate random sampling
scheme. Outline the key components of your sampling scheme.
(11 marks)
Reading for this question
This question was on basic material on survey designs. Background reading is given in
Chapters 10 and 11 of the subject guide which, along with the recommended reading, should
be looked at carefully. Candidates were expected to have studied and understood the main
important constituents of design in random sampling. It is also a good idea to try the
Learning activities of Chapter 10.
Approaching the question
One of the main things to avoid in this part is to write essays without any structure. This
exercise asks for specific things and each one of them requires one or two lines. If you are
unsure of what these things are, do not write lengthy essays. This is not giving you
anything and is a waste of your invaluable examination time. If you can identify what is
being asked, keep in mind that the answer should not be long. Note also that in some
cases there is no unique answer to the question.
The marking scheme and some model answers are given below.
i. Description of quota sampling: the interviewer is given specific quota controls on certain
specified characteristics, such as age, gender, social class etc. and then interviews people
until these quota are reached. See page 159 of the subject guide for a more detailed
description.
Quota is different from stratified random sampling in the following ways.
∗ Stratified random sampling is probability sampling, whereas quota sampling is
non-probability sampling.
∗ In stratified random sampling a sampling frame is required, whereas in quota
sampling pre-chosen frequencies in each category are sought.
ii. As mentioned earlier, it is crucial in this type of question to avoid long answers. Also,
note that there is no unique answer. A possible set of ‘ingredients’ of an answer is given
below (each bullet point corresponds to a mark).
•
•
•
•
•
•
•
24
Propose stratified sampling since customers of all ages are to be surveyed.
Sampling frame could be the company’s customer database.
Take a simple random sample from each stratum.
Stratification factors should include gender.
Other stratification factors could be age, country of residence, etc.
Contact method: mail, telephone or email (likely to have all details on database).
Minimise non-response through a suitable incentive, such as discount off the next
purchase.
Examiners’ commentaries 2016
Question 3
(a) A policeman recorded the speed of 30 cars on a road with a 30 miles per hours
speed limit. The recorded data are shown below:
25.6
26.2
27.9
28.8
29.2
30.1
25.7
26.9
27.9
28.9
29.3
30.1
25.7
27.5
28.3
28.9
29.5
30.2
25.8
27.7
28.4
29.0
29.7
36.2
25.8
27.8
28.5
29.1
29.8
36.9
i. Carefully construct, draw and label a histogram of these data on the graph
paper provided.
ii. Find the median speed among these cars and the upper quartile. What
percentage of drivers were exceeding the 30 miles per hour speed limit?
iii. Comment on the data given the shape of the histogram without doing any
further calculations.
iv. Name two other types of graphical displays that would be suitable to
represent the data.
(13 marks)
Reading for this question
Chapter 4 provides all the relevant material for this question. More specifically, reading on
histograms can be found in Section 4.7.3, but the entire Sections 4.7, 4.8 and 4.9 are highly
relevant.
Approaching the question
i. A histogram compatible with what the examiners were expecting to see is shown below.
Marks were awarded for including the title, labelling correctly and accurately drawing
the figure. Note that it is essential (and more convenient) to draw the figure on the
graph paper provided; marks will be withdrawn otherwise.
0.15
0.10
0.05
0.00
Frequency Densities
0.20
Histogram of Speeds
24
26
28
30
32
34
36
38
Speeds in miles per hour
25
ST104a Statistics 1
ii. • Median: 28.65 miles per hour. Note: Raw data should be used, not grouped data.
Also, make sure to mention the units to get the full marks.
• Upper quartile: 29.45 miles per hour. Note: Same as above.
• percentage: 5/30 = 16.67%. Note: As the question asks for a percentage, make sure
to report 16.67% (17% is also fine), not just 5/30 or anything else.
iii. Based on the shape of the histogram, we can see that the distribution of the data is
positively skewed. Also two cars, with speeds 36.2 and 36.9 miles per hour, may be
regarded as outliers. Note: It is important to identify the specific outliers (36.2 and 36.9
miles per hour) not just write ‘there are two outliers’.
iv. A boxplot, stem-and-leaf diagram or dot plot are other types of suitable graphical
displays. The reason for that is that the variable speed is measurable and these graphs
are suitable for displaying the distribution of such variables.
(b) A random sample of 9 students received special training to improve their
performance on IQ tests. Each of the 9 students took an IQ test before and
after the training and their scores are shown in the table below:
IQ score before training
105
116
120
93
119
133
75
86
90
IQ score after training
107
120
118
92
119
135
78
90
96
i. Carry out an appropriate hypothesis test to determine whether the special
training is effective for increasing the average IQ score. State the test
hypotheses, and specify your test statistic and its distribution under the null
hypothesis. Comment on your findings.
ii. State any assumptions you made in i.
iii. Give a 90% confidence interval for the difference between the means of the
IQ scores before and after training.
(12 marks)
Reading for this question
Look up the sections about hypothesis testing for testing a difference between two
population means. However, it is essential for this part to focus on the section regarding
paired samples (Section 8.16.4).
Approaching the question
i. Regarding hypotheses, note that the wording ‘increasing’ suggests a one-sided test.
Hence we test:
H0 : µbefore = µafter
vs.
H1 : µbefore < µafter .
In this part, it is also essential to realise that we have a paired sample, as we have two
observations for each person (before and after the special training). Hence the difference
for each person should be calculated:
2
4
−2
−1
0
2
3
4
6
The next step is to calculate sd = 2.598 and x̄d = 2.0, in order to obtain the value of the
test statistic:
x̄d − 0
√ = 2.309.
t=
sd / n
26
Examiners’ commentaries 2016
We have a t distribution with 8 degrees of freedom, hence the critical value (for a
one-sided test) is 1.860. Note: This is clearly a t distribution, make sure not to use the
standard normal distribution.
Hence, we reject H0 at the 5% significance level. Testing at the 1% significance level
gives a critical value of t8, 0.01 = 2.896. Therefore, we do not reject H0 concluding that
there is moderate evidence that the special training is effective.
ii. • Differences are normally distributed.
• Pairs of observations are independent.
iii. This is a standard exercise for confidence intervals given the appropriate formula from
the formula sheet (make sure to be able to recognise it). The requested confidence
interval is (0.390, 3.610).
Question 4
(a) An insurance company wants to relate the amount of fire damage (y) in major
residential fires to the distance between the residence and the nearest fire
station (x). For this reason, a study was conducted in a large suburb of a major
city based on a sample of 10 recent fires in this suburb. For each of these fires,
the variables x and y were recorded and are shown in the table below:
Fire
x
y
#1
3.4
2.6
#2
1.8
1.8
#3
4.6
5.9
#4
2.3
2.3
#5
3.1
2.8
#6
5.5
8.6
#7
0.7
1.4
#8
3.0
2.3
#9
2.6
2.0
#10
4.3
5.7
The summary statistics for these data are:
Sum of x data: 31.3 Sum of the squares of x data: 115.85
Sum of y data: 35.4 Sum of the squares of y data: 175.64
Sum of the products of x and y data: 138.08
i. Draw a scatter diagram of these data on the graph paper provided. Label the
diagram carefully.
ii. Calculate the sample correlation coefficient. Interpret your findings.
iii. Calculate the least squares line of y on x and draw the line on the scatter
diagram.
iv. Do you find the analyses in ii. and iii. appropriate? Justify your answer and
suggest any alternative ways to model the relationship between x and y.
(13 marks)
Reading for this question
This is a standard linear regression question and the reading is to be found in Chapter 12 of
the subject guide. Section 12.6 provides details for scatter diagrams and is suitable for part
i., whereas the remaining parts are on correlation and regression which are covered in
Sections 12.8, 12.9 and 12.10 of the subject guide. Section 12.7 is also relevant. Sample
examination question 2 of this chapter is also recommended for practice on questions of this
type.
Approaching the question
i. Candidates are reminded that they are asked to draw and label the scatter diagram
which should include a title (‘Scatter diagram’ alone will not suffice) and labelled axes
which give their units in addition. Far too many candidates threw away marks by
neglecting these points and consequently were only given one mark out of the possible
four allocated for this part of the question. Another common way of losing marks was
failing to use the graph paper which was provided, and required, in the question.
Candidates who drew on the ordinary paper in their answer booklet were not awarded
marks for this part of the question.
27
ST104a Statistics 1
Amount of fire damage vs. Distance from nearest fire station
6
x
4
5
x
3
Amount of fire damage
7
8
x
x
2
x
x
x
x
x
x
1
2
3
4
5
Distance between residence and the nearest fire station
ii. The summary statistics can be substituted into the formula for the sample correlation
coefficient (make sure you know which one it is!) to obtain the value 0.9093. An
interpretation of this value is the following: the data suggest that the greater the
distance of the residence from the nearest fire station, the higher the amount of fire
damage. The fact that the value is very close to 1, suggests that this is a strong, positive
linear relationship.
Many candidates did not mention all three words (strong, positive, linear). Note that all
of these words provide useful information on interpreting the relationship and are
therefore required to obtain full marks.
iii. The regression line can be written by the equation yb = a + bx or y = a + bx + ε. The
formula for b is:
P
xi yi − nx̄ȳ
b= P 2
xi − nx̄2
and by substituting the summary statistics we get b = 1.526.
The formula for a is a = ȳ − bx̄, so we get a = −1.235.
Hence the regression line can be written as yb = −1.235 + 1.526x or
y = −1.235 + 1.526x + ε. It should also be plotted on the scatter diagram.
Many candidates reported incorrectly the regression line as y = −1.235 + 1.526x. This
expression is false; one of the two above expressions is required.
iv. In this case, one can note in the scatter diagram that the points seem to be ‘scattered’
around a non-linear curve rather than a straight line. Another, equivalent, way to note
this is the presence of two outliers. Hence a linear regression model does not seem to be
a good model for the relationship between the amount of fire damage and the distance
from the nearest fire station. Alternative approaches may involve the Spearman’s rank
correlation coefficient or transformations of the data, for example the log-transformation.
(b) The 55 university students on a certain course were randomly assigned to two
class groups of size 30 and 25 students respectively. At the end of the year, all
students took the examination and their marks are summarised in the table
below.
Sample size Sample mean Sample standard deviation
Class Group 1
30
75.33
7.61
Class Group 2
25
71.40
6.37
28
Examiners’ commentaries 2016
i. Use an appropriate hypothesis test to determine whether the students of
class group 1 were better in terms of examination marks. State clearly the
hypotheses, the test statistic and its distribution under the null hypothesis,
and carry out the test at two appropriate significance levels Comment on
your findings.
ii. State clearly any assumptions you made in i.
iii. Provide a 95% confidence interval for the difference between the mean exam
marks of the two class groups.
(12 marks)
Reading for this question
The first two parts of the question refer to a two-sided hypothesis test comparing two
population means. While the entire chapter on hypothesis testing is relevant (Chapter 8),
one can focus on Section 8.16 and in particular Sections 8.16.2 and 8.16.3 as the variances
are unknown. The last part of the question requires a confidence interval for the difference
between two population means, therefore Sections 7.13.2 and 7.13.3 are most relevant.
Approaching the question
i. Let µA denote the mean examination mark for class group 1 and µB the mean
examination mark for class group 2.
The wording ‘were better in terms of examination marks’ implies a one-sided test, hence
the hypotheses can be written as:
H0 : µA = µB
vs.
H1 : µA > µB .
The test statistic formulae, depending on whether a pooled variance is used or not, are
provided in the formula sheet:
x̄ − ȳ
p
s2A /nA + s2B /nB
or
x̄ − ȳ
q
.
2
sp (1/n1 + 1/n2 )
If equal variances are assumed, the test statistic value is 2.0511 (the pooled variance is
50.06). If equal variances are not assumed the test statistic value is 2.0848.
Since the variances are unknown and the sample size is not large enough, the t50
distribution is being used. The critical value at the 5% significance level is 1.676, hence
we reject the null hypothesis. If we take a (smaller) α of 1%, the critical value is 2.390,
so we do not reject H0 . We conclude that there is moderate evidence of a difference
between the mean examination marks of the two class groups.
ii. The assumptions for ii. were the following.
• Assumption about equal variances.
• Assumption about whether nA + nB is ‘large’ so that the normality assumption is
satisfied.
• Assumption about independent samples.
Some candidates stated assumptions in this part that were not made in part i. Marks
were not awarded in such cases. Also some other candidates just copied the phrase
‘assumption about equal variances’ and naturally were not awarded any marks. One
should state whether the calculations were based on the assumption that the unknown
variances are equal or unequal.
iii. Based on the t50 distribution and using the correct formula from the formula sheet (make
sure to be able to recognise it) the requested 95% confidence interval is (0.082, 7.778).
Note: In the solution above, the t50 distribution was used but the use of the standard
normal distribution is also justified as the sample size is relatively large. Hence a solution
based on the standard normal distribution is also acceptable.
29