Download here - The Statistical Society of Australia

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Response to the statistics components of the
NSW Draft Stage 6 Mathematics and Extensions Curriculum
PO Box 213
BELCONNEN ACT 2616
ABN 82 853 491 081
Phone: (02) 6251 3647
Email: [email protected]
PO Box 213
BELCONNEN ACT 2616
31st August 2016
A response to the draft documents found at
http://www.boardofstudies.nsw.edu.au/syllabus_hsc/mathematics-advanced.html
I.
II.
III.
IV.
Executive summary
Primary considerations for further developing and teaching the syllabus
General curriculum
Glossary and definitions
Authors: Peter Howley and Scott A. Sisson.
Authorship contributions: Diana Combe, Nick Fisher, Helen MacGillivray, Garth Tarr & Neville Weber.
Communications regarding the content of this report can be made through the SSA’s Executive Officer:
[email protected]
I. Executive Summary
The Statistical Society of Australia (SSA) is highly supportive of the addition of substantial elements of statistics
within the new NSW BOSTES Stage 6 Mathematics Syllabus. With increased societal recognition of the value of
data and its transformation into information, and the rising need for statistical literacy and application across
university degrees, understanding statistical concepts and acquiring statistical skills are increasingly becoming
critical to access and succeed in higher education. Further, with reports (e.g. Manyika et al., 2011*) identifying
the international undersupply of statisticians, the establishment of consolidated streams to introduce students
to such opportunities is invaluable; the Statistics and Probability strands within the school curriculum provide
a critical component towards achieving this.
Our major comments on the proposed curriculum are:
a) Increased engagement with and understanding of statistics in Stage 6 will be better served by a
revision to the statistical content and its delivery in the F-10 syllabus. In particular, developing an early
recognition and delight in the concept of variation, statistical investigation and thinking statistically.
b) Statistical ideas are by nature multidisciplinary, and should not merely be considered as a subset of
mathematics. A key to reinforcing the learning of fundamental statistical ideas is to embed them as
part of the investigatory process in the curricula of other disciplines, particularly, but not limited to,
the Sciences.
c) In practice, and in the context of arousing interest in students, the use of a computer is essential to
engaging with statistics. For the statistical education of Australian students to be competitive with
their international peers, the statistics syllabus should be constructed with computer usage assumed
as default, much like a lab is for the Sciences, rather than being constrained by the lowest common
denominator.
d) There are a large number of errors and omissions in the Glossary, and very many casually or loosely
written technical descriptions across the draft syllabus. We strongly recommend engaging with the
SSA to develop an appropriate and complete glossary, along the lines of the ACARA F-10 Mathematics
Glossary (http://v7-5.australiancurriculum.edu.au/mathematics/glossary).
e) There is currently no proposed statistics component in Stage 6 Extension 2. While the above findings
take precedence, this is an excellent opportunity to develop inspiring statistical content that would be
appreciated by the more mathematically inclined students.
The Statistical Society of Australia is the national professional association of statisticians, with members who
are nationally and internationally recognised as experts in statistical education. It stands ready to directly assist
NSW BOSTES with the construction of a high quality state curriculum in statistics.
* Manyika J et al (2011). Big Data: The next frontier for innovation, competition and productivity. McKinsey Global Institute.
II. Primary considerations for further developing and teaching the syllabus
a) Developing an appreciation of statistical ideas:
There are three critical aspects to developing an understanding and appreciation of Statistics and Statistical
Analysis. The first is the exploration of methods within a context. As former Chief Scientist Ian Chubb
commented** “when they do study them (the sciences) at school … the best way to teach it inspirationally is to
teach it the way it’s practiced”. Statistical analysis in practice is always undertaken with a problem at hand,
which provides a context within which an investigation (including its design) and subsequent analysis is
conducted. The investigation cycle is: Plan (design), Collect, Analyse (explore), Draw Conclusions,
Communicate.
The second is the understanding of variation: an awareness that Statistics is about understanding and
modelling the real-world variation that exists, and methods for identifying, quantifying and utilising variation,
within the design, analysis and final reporting stages (indeed within all stages) of the investigation cycle.
The third is statistical literacy: whilst ordinarily considered the base level of statistical understanding,
addressing the above first two aspects at least at rudimentary levels increases the chance of becoming
statistically literate. Note: statistical literacy is used as implying a competence to understand, critically appraise
and draw accurate meaning from a description or reporting of the real world in terms of quantitative, and
statistical, information involving variability or chance. Development of students' statistical literacy calls for
them to recognise both valid and invalid statistical interpretations. The latter objective can be achieved both
directly by teaching the limitations of statistical techniques as they are being taught, and indirectly by inviting
students to criticise published statistical analyses (both visual and written) that have flaws.
These key principles should be utilised in the delivery of Statistics in schools. ‘Chance’ or ‘Probability’ is distinct
from Statistics, but must similarly be developed within a context, to incite interest, and to support the
interpretation of data.
** http://www.chiefscientist.gov.au/2015/07/interview-1233-abc-newcastle/
b) Integration of statistical ideas within mathematics and other disciplines:
A fundamental issue in designing a curriculum in any discipline is to ensure that it supports the evolution of the
subject matter and skills from the earliest stages in two ways. Firstly, it should provide a steady progression
from simpler towards more advanced material, and yet include time for reinforcement of learning. Secondly,
it should demonstrate coherence from year to year. Considering the underlying interdisciplinary nature of
statistics, the content and progression of statistical ideas in the Mathematics curriculum should ideally be
integrated with the development of relevant quantitative topics in other disciplinary curricula, in particular the
Sciences and TAS (Technological and Applied Sciences). In addition to enhancing analytic learning in these other
disciplines, this will reinforce the idea that statistics is highly relevant across multiple disciplines, and should
not be considered (and possibly rejected in some student minds) as “merely” a sub-discipline of Mathematics.
This should also increase the connection in young minds of other disciplines with Mathematics, a key within
the national STEM initiative.
c) On the use of computer technology (Information and Communication Technology):
Statistics is a relatively modern discipline, and one that is only increasing in societal importance. By nature, it
has made extensive use of computer technology and graphical systems. It is now inconceivable that everyday
practical usage of statistical methods should take place without the use of a computer. Students are expected
to be computer literate when leaving school, and are expected to be able to implement any statistical
techniques they have learned in practice. By construction, this requires previous experience of statistics with
computers.
Computers, therefore, need to be used appropriately to support student learning in statistics. It is unrealistic
to conceive of e.g. physics, chemistry or biology being taught without use of a laboratory. Attempting to teach
statistics without students having access to a corresponding facility (i.e. a computer lab) is in the same category.
We understand that it is not realistic to assume that all schools have equal access to such facilities. However,
it is now the second decade of the 21st Century, and there is a danger in allowing the curriculum to be dictated
by the lowest common denominator that Australian students will become rapidly disadvantaged with respect
to their peers internationally. The curriculum, as an educational enabling mechanism and a vehicle for
modernisation, should be moving as quickly as possible towards requiring that all students of statistics will
have such access. The ability to explore data visually and to evaluate the quality of statistical models is essential
to developing an understanding of statistics, right from the start. Critically, it also brings the subject alive and
increases the time on more productive activity.
d) Errors in definitions and missing definitions:
There are a large number of errors in the Glossary, and very many casually or loosely written technical
descriptions across all of the draft syllabus. Teaching to poor definitions can produce confused or flawed
instruction from teachers, and result in student cohorts with partial or just plainly incorrect understanding of
the subject. We describe many of these in Section IV below; however, these sections and definitions will need
to be carefully rewritten and then checked by experts in the field. We strongly recommend engaging with the
SSA to develop an appropriate and complete glossary, along the lines of the ACARA F-10 Mathematics Glossary
(http://v7-5.australiancurriculum.edu.au/mathematics/glossary).
e) Statistical content in Stage 6 Extension 2
There is currently no proposed statistics component in Stage 6 Extension 2. We understand that content which
was previously identified for inclusion here was removed as it was incremental on the proposed Extension 1
content, and more importantly, it was very dry and tedious to the point that it would likely turn students away
from any further interest in the subject. From this perspective, we are happy that this previous content has
been removed.
The following suggestion is very much a secondary concern, in that the above discussion on content, delivery
and correctness are considerably more important. However, the current blank slate on Extension 2 statistics
content provides an opportunity to include alternative statistical content that is both aligned with the
Mathematics syllabus, and that the more mathematically inclined Extension 2 students would be capable of
and interested in pursuing.
For example, this might include firstly Introducing different types of distribution beyond the normal and
binomial, such as the Poisson (as derived from the Binomial), exponential (as derived from the Poisson),
uniform and distributions with arbitrary functional form, and analytically computing their expectations and
standard deviations through integration. This may further be followed by fitting these distributions to data
based on moment-based estimators (i.e. matching sample means and standard deviations with those of the
distributions), and then using the fitted distributions to make predictions and answer questions of interest.
f) Support from the Statistical Society of Australia (SSA):
The SSA is the national professional association of statisticians. Its members are extensively drawn from
academia, business, government, industry and the education sectors. Many of its members are nationally and
internationally recognised as experts in statistical education, and they have previously worked directly with
ACARA and BOSTES on the National Curriculum (e.g. via Advisory Boards) either as individuals or more formally
as representatives of the SSA.
The SSA wishes to clearly communicate that it stands ready to directly assist NSW BOSTES with the construction
of a high quality state curriculum in statistics. Contact details can be found at the head of this document.
III. General curriculum
Notwithstanding the above considerations for its delivery, there are a number of issues with the draft syllabus
which need to be addressed. We have attempted to highlight many of these below. The below list is not
intended to be exhaustive.
The SSA remains ready to engage with NSW BOSTES to develop an appropriate and motivational statistics
syllabus.
Stage 6 Syllabus Draft
•
P.46. Replace “understand the difference between an outcome, an event and an experiment,...” with
“understand to what each of an outcome, an event, and an experiment refers…”
Rationale: The original wording has some problems: a) difference between three items? b) while the
intention of the original was likely regarding differences among, or perhaps the pairwise differences,
we note that an experiment may produce an outcome/s, and an event is usually a subset of the set of
all outcomes; however, an outcome may be an elementary event; c) ‘difference’ between an
experiment and an outcome - perhaps refer to differences between an experiment and an
observational study (each having outcomes and events). We have provided a suggested rewording in
Part IV - Glossary within ‘Outcome’.
•
P.46 “determine relative frequency as probability”: Does this mean to suggest that relative frequency
and probability are interchangeable (which is false), or rather to understand that relative frequencies
provide a way to define an ‘empirically-based’ probability?
NOTE: The ACARA Elementary Mathematics syllabus uses “identify relative frequency as probability”
in the context of simulations. The use of simulations to build understanding of random behaviour is
missing from this syllabus.
•
P. 47 Topic Focus: “This principal focus of this topic…” should start “The principal…”
•
P.47 “the sum of the probabilities is 1”: This would be more precisely expressed as e.g. “the sum of
the probabilities of all possible elementary events (or outcomes) within the sample space is 1”.
•
P. 47 E(X) is introduced for discrete variables, but it is not used for continuous variables. This introduces
another piece of notation which is specifically being used to refer to discrete variables. It is not clear if
this connection to discrete case is intentional (as it is not necessary), and if so this should be clarified
in the glossary.
•
P.47/8 Draft uses both forms of notation (n r) and nCr for the number of ways of choosing r objects
from a set of n in the same subsection. Is this intentional?
•
P.63: “Including the use of the normal distribution”: this should be capitalised as “the Normal
distribution”. NOTE: elsewhere in document this is stated as “the Normal Distribution” (i.e., capital D
as well) - need consistency throughout.
•
P.63 The list of OUTCOMES are the same as those given within each of the TOPICS on pages 64 (for
Bivariate) and 65 (for Normal). However, these two topics each address a subset of these outcomes.
Is this a copy-paste issue - shouldn’t each topic have its own specific outcomes? For example, it seems
odd to state within M-S5 The Normal Distribution (p. 65) outcomes including “correlation of bivariate
data”. Similarly odd to state within M-S4 Bivariate (p. 64) outcomes including “use of the normal
distribution”. Why not state only the subset of outcomes relevant to the topic, within the topic?
•
P.63 First bullet point of OUTCOMES
“... and the correlation of bivariate data”: Since the line of best fit and least squares regression is
utilised, it is not only correlation that is assessed. Namely, one only undertakes regression once a
causal relationship has been established as plausible. Further, the content refers to identifying
patterns in two-way frequency tables (categorical data) - this doesn’t fall within the statistical
consideration of correlation, rather (for the tables being considered) association.
•
P.63 OUTCOMES “solves problems using appropriate statistical processes, including the use of the
normal distribution and the correlation of bivariate data”: Perhaps improve the precision of this
statement with “solves problems using statistical techniques, including use of the Normal distribution
for a single variable, associations for bivariate categorical data, and correlation and least squares
regression for bivariate numerical data”.
•
P.63 STRAND FOCUS “Statistical Analysis involves the relationships between two variables and analysis
of correlation in bivariate data sets.”: The reference to `bivariate’ and `between two variables’ seem
redundant. The description is also inconsistent with the first listed student outcome which notes use
of the Normal distribution: this is not related to having two variables and may confuse. It is unclear
whether this quote is intended as a general comment on Statistical Analysis or not. Assuming the
reference is to `Statistical Analysis’ as the Strand, rather than the more general topic, perhaps replace
the Strand Focus paragraphs with:
“Statistical Analysis involves the understanding and use of probability distributions and density
functions and the exploration and quantification of linear relationships.
Knowledge of statistical analysis enables valid, reliable and appropriate interpretation of situations
and awareness of contributing factors to observed outcomes and the possible misrepresentation of
information by third parties.
Study of statistical analysis is important in developing students' ability to recognise, describe and
appropriately apply statistical relationships in order to predict outcomes. An appreciation of how
conclusions drawn from data can be used to inform decisions made by groups such as scientific
investigators, businesses and policy-makers”.
•
P.64 2nd of 3 parent bullet points, 2nd child bullet point “or the coefficient of determination”: The
coefficient should only be used within a regression (or line of best fit) context, and so should only be
undertaken when there is a perceived causal relationship. Hence, remove the quoted text and reinsert
it within the 3rd of the 3 parent bullet points (which refers to modelling a linear relationship).
•
P.64 The 3rd of 3 parent bullet points begins “model a linear relationship by fitting an appropriate line
of best fit”: the word ‘appropriate’ is key here. Perhaps the fourth child point “recognise that an
observed association between two variables does not necessarily mean that there is a causal
relationship between them” and fifth child point “identify possible non-causal explanations for an
association...” should not appear here, but instead earlier within the 2nd of the 3 parent bullet points.
•
P.64 Within the 3rd of 3 parent bullet points, we recommend including an additional point: “recognise
the differences between predicted and observed values, and consider the size of such differences to
assess the goodness of fit of models and introduce the concept of a residual”. Here `residual’ refers to
a measure of the difference between an observed and a predicted (by the least-squares regression
model) outcome. This should not pose too great an extension, but is an invaluable aspect for students
to conceptualise and begin to consider. Residual could also have a glossary term.
•
P.65 2nd parent/major bullet point: “given by integrals”: Should this be replaced with “obtained via
integration”? This could further include “and linking with area under a curve”. Although this may be
implicit, being explicit in the syllabus can only be beneficial.
•
P.65 The notation (µ, σ, s, 𝑥𝑥) introduced here is incomplete as the distinction of a population and
sample have not been established. There needs to be clarification of why each notation is used. This
means that the concepts of sample mean and variance (statistics) and the concepts of the mean and
variance of a random variable (parameters) need both to be in the curriculum, and an understanding
of how they relate to each other - that one is an estimate of the other etc. So the course needs E(X),
or μ, Var(X) or σ2, and 𝑥𝑥 and s (and p and 𝑝𝑝̂ ) etc all defined and used appropriately. If this is not
established elsewhere in F-10 then needs to be introduced here.
•
P.65 6th of 6 major bullet points: Consider excluding use of the continuity correction, in favour of just
estimating the probability using a normal approximation without the continuity correction, when
appropriate. In practice, we would use a computer to calculate binomial probabilities exactly if
necessary. We use the approximation to obtain an estimate much like we estimate 12 x 18 with 10 x
20, before we then use a calculator for an exact computation. NOTE: Commonly the rule of thumb is
greater than or equal to 5, not just greater than 5. Visual considerations should be utilised to support
the nature of using the normal approximation to the binomial, either with some examples, or with
simulations.
Further, why include the normal approximation to the binomial? This is formally based on the central
limit theorem which is well beyond the 2 unit course. At this level the binomial probabilities can all be
easily calculated using standard statistical software so there is little motivation for introducing an
approximation.
•
P.65 The section could optionally be retitled “Continuous random variables and the Normal
distribution” and develop material to complement the work on discrete random variables in Year 11
M-S3 (Probability Distributions). The uniform distribution could be featured along with the normal.
•
P.65 Care is needed with calculating z-scores when using the sample mean and standard deviation
versus the population mean and standard deviation. The distinction between population model and
sample estimate needs to be noted and carefully managed. (Perhaps this is a notation issue confusing
sample estimates and population parameters.)
Stage 6 Extension 1
•
ME-S1 (p.37) It is worth noting that Permutations and Combinations are NOT Statistics and should not
be labelled as such (which has the potential to result in the disengagement of students with statistics,
as it becomes viewed as too mathematical and too difficult). This should be a separate topic (as it is in
the ACARA National Syllabus). Some probability problems do use counting techniques but they are
mostly used in many other areas of mathematics.
•
The STRAND FOCUS descriptions are problematic across syllabi. The same term ‘Statistical Analysis’ is
used, yet the descriptions vary. Is the intention to define the term universally across the curriculum,
or specifically to its use within a particular Strand? Perhaps the former is more appropriate, with each
strand having a clear more specific focus.
ME-S2 (p.45): Sampling and Estimates:
•
There are simpler principles surrounding sampling and representativeness that should precede this.
Perhaps these are addressed in F-10? If so, then the Stage 6 syllabus should utilise these when
introducing population and sample notation. If not, then the concept introduced here may be
problematic and consideration should be given to addressing sampling and estimates in Stage 6
General, or earlier Stages.
•
This Section should be supported by using computer simulation to develop the ideas of a sampling
distribution, which in turn will help convince students that the use of a normal distribution is plausible
for the construction of confidence intervals etc.
•
It is unclear why confidence intervals are introduced for the parameter p, rather than for the mean of
a continuous variable.
•
“define the approximate standard error E = z�𝑝𝑝̂ (1 − 𝑝𝑝̂ )/𝑛𝑛 and understand the trade-off between
margin of error and level of confidence ...”: Firstly ‘z’ must be removed since the expression with z
included is NOT the standard error, rather it is what is known as the margin of error. The margin of
error may be defined as z�𝑝𝑝̂ (1 − 𝑝𝑝̂ )/𝑛𝑛.
Secondly, rather than simply state “standard error”, this should be replaced with “standard error of
𝑝𝑝̂ ” as standard error must refer to the statistic. I.e. “define the approximate standard error of 𝑝𝑝̂ as
•
SE(𝑝𝑝̂ )= �𝑝𝑝̂ (1 − 𝑝𝑝̂ )/𝑛𝑛.”
The ACARA Mathematical Methods glossary includes the Central Limit Theorem. We recommend
including this key fundamental result that underpins many statistical ideas in both the (extension)
syllabus and the glossary. Teachers should be aware of the result and its link to the simulations that
are designed to illustrate the normal approximation to the sampling distribution of the sample
proportion.
IV. Glossary and definitions
There are a large number of errors and omissions in the Glossary of each of the Stage 6 and the Extension 1
drafts. We have attempted to highlight many of these below. They include imprecisions that result in confused
or incorrect definitions, definitions that need a context to be properly understood (e.g. “parameter” and
“normal” both have different meanings in each of the statistical and mathematical contexts), or definitions
that are just plainly wrong. The below list is not intended to be exhaustive, nor do we wish to suggest that
correcting the definitions as indicated will produce robust definitions. Rather, as suggested above, we strongly
recommend engaging with the SSA to develop an appropriate and complete glossary, along the lines of the
ACARA F-10 Mathematics Glossary.
Stage 6
Bernoulli trial - Should have “only two possible outcomes, typically labelled ‘success’ and ‘failure’ (the opposite
of success). A success occurs with probability p and the probability of failure is q=1−p.”
Binomial distribution - Suggest replacing “n independent yes/no experiments” with “n independent Bernoulli
trials”.
Binomial random variable - Missing key concepts, and loose phrasing. Suggest replacing “in n trials with two
possible outcomes, called success and failure.” with “in n independent Bernoulli trials.” Also “In each Bernoulli
trial, the probability of success is …”
Box and whisker plot - The precision of this definition could be improved by including outliers (a component
of box-and-whisker plots), mention of the lower and upper quartiles, and including a glossary definition of “five
number summary”. Reference to “vertical” line for median incorrectly suggests that boxplots can’t be drawn
vertically (where the median would then be a horizontal line).
Categorical data - What is described here is a Categorical Variable. Further, the examples have a list of
outcomes which are not exhaustive: “Examples: blood group is a categorical variable; its values are: A, B, AB or
O. So too is construction type of a house; its values might be brick, concrete, timber or steel.” Suggest changing
to `major blood group type’ and ‘principal construction type’, as there are clearly missing options in the lists
provided, or other examples.
However, we suggest removing the postcode example, as postcodes which are numerically close typically
correspond to areas which are physically close. So here the numerical order does have a meaning (even if taking
an average would be an odd thing to do here).
The above confusion between `data’ and ‘variable’ needs to be addressed, perhaps through an explicit glossary
definition. E.g. “Data are the values observed for a particular variable. A variable is the characteristic (or
measurement) of interest which will vary depending upon the individual or item we happen to be observing,
or the point in time at which we do so.” The same data/variable definition problem also occurs in the glossary
terms for Nominal Data, Numerical Data and Ordinal Data, and these should be resolved in the same way.
Finally, do not absolutely equate categorical data to qualitative data. Suggest rephrasing as “Categorical data
is sometimes referred to as qualitative data, much as numerical is referred to as quantitative.”
Coefficient of Determination - This definition is incomplete as it does not explicitly identify attribution to a
linear relationship between the dependent and independent variables. Suggest replacing “It is a number that
indicates the proportion of the variance in the dependent variable that is predictable from the independent
variable.” with “It is a number that indicates the proportion of the variance in the dependent variable that is
predictable from the independent variable in a linear relationship between the two variables.”
Conditional Probability - Requires that P(B)>0.
Cumulative Frequency - Suggest replace existing with “The total of all frequencies (or counts) up to and
including the accumulation point of interest.”
Decile - Has the potential to be confusing. It is also not clear what sorted means (i.e. into order of size) and
what equal parts might mean. Suggest rewriting as “A decile is any of the nine values that divide the data (once
sorted into non-descending order) into ten equally-sized parts, so that each part represents 1/10 of the sample
or population. E.g., the first decile is the number with 10% of the data below it, the second decile is the number
with 20% of the data below it, and so on.”
Dependent and independent variables - This is wrong, as which variable is considered as independent and
which as dependent depends solely on the context and what variable you wish to use to predict the other.
There is no absolute notion of “this variable is a dependent/independent variable”. Further, by construction
this usage of dependent and independent has a completely different meaning of independent to that in the
context of independent events.
Discrete variable - This is wrong as discrete random variables do not have to take a finite number of values
(e.g. think of a Poisson random variable which is discrete but has an infinite sample space). Perhaps adjust
wording to “A discrete variable is a numerical variable that may only assume distinct values, as opposed to any
or all values within a continuum”. (Alternatively link the definition to “countable sets”, with an appropriate
glossary term for countable sets.)
Independent Event - This is wrong on several levels. Firstly, events do not have outcomes. Secondly, it makes
no sense to say an event is independent - we can talk about two events being independent, or one event being
independent of another.
Mean - The definition refers to arithmetic mean, but fails to identify that this is what is being referred to simply
as the mean. As there are other types of mean, this should be clarified here.
Measures of central tendency - The definition states “measures of location summarise…” but doesn’t note
that the “measures of location” is synonymous with the “measures of central tendency”. In addition, reference
to a “typical” value” is not always appropriate - many averages don’t exist as a typical value (e.g. the average
score on rolling a 6-sided die is 3.5). Perhaps replace with “Summary measures which indicate where a
distribution may be centred; mean, median and mode are the commonly used measures, although not all are
appropriate for every situation.”
Median - The first sentence suggests that the median must be one of the values of the dataset (which is
contradicted by the following two sentences). Perhaps replace with “The median divides an ordered dataset
into two equal parts.”
Mode - Defines the mode of a dataset but not the mode of a distribution, which is used in the glossary definition
of e.g. Normal Distribution (the histogram of which is presumably constructed from continuous data with no
repeated observations.) The mode (and other similar measures on both data and distributions, such as means,
medians, variances etc.) should be separately defined for both sample data and distributions.
Nominal data - Need to resolve the data/variable definitional problem as for Categorical Data, Numerical Data
and Ordinal Data. Suggest replacing the existing definition with “A variable for which it’s possible outcomes are
categories, and these categories have no natural ordering to/among them.”
Normally Distributed - What is listed are properties of the normal distribution. However, there are an infinite
number of distributions with these properties, and the normal is only one of these. So this definition is
sufficiently incomplete to be wrong. Also, why is Normal Distribution put in quotation marks? These should be
removed. The third bullet point is redundant given the first bullet point. Replace definition with “referring to
a distribution which follows the Normal, or Gaussian, distribution. It is commonly referred to as a ‘bell curve’
given its appearance (much like a bell shape) and has several key properties, which include:
mean=median=mode; symmetry about the centre; approximately 68% of values lie within plus and minus one
standard deviation of the centre, approximately 95% of values lie within plus and minus two standard
deviations of the centre, and approximately 99.7% of values lie within plus and minus three standard deviation
of the centre.” This can be supplemented with an appropriate graphic.
Normal - This definition refers to a different usage of the word “normal” to that used in statistics. In Statistics
“Normal” (note the capitalisation) often refers to the distribution or to the shape of the distribution of a
dataset. This glossary term really needs two definitions, delineated by their context. E.g. “Normal
(mathematics) refers to...” and “Normal (statistics) refers to …”
Normal random variable - Uppercase should be used (Normal random variable) as per Normal distribution.
Delete “and produces a bell curve” as superfluous, and its inclusion suggests that there are cases where a bell
curve may not be produced that are Normal random variables.
Numerical data - It is stated that “Numerical data is data associated with a numerical variable.” and also that
“Numerical variables are variables whose values are numbers”. Such an approach could be used for the glossary
definitions for categorical data, nominal data and ordinal data.
Within `Numerical variable’, replace “Numerical variables are variables whose values are numbers, and for
which arithmetic processes such as adding and subtracting, or calculating an average, make sense.” with
“Numerical variables are variables whose values are numbers which arise from a measuring process and
represent a quantity.”
Further, a Discrete Numerical Variable is defined within the Numerical Data glossary term; however, a) it should
be a glossary term on its own; and b) there is no corresponding definition of a continuous variable. For example:
“Discrete numerical variable: a numerical variable that may only assume distinct values, as opposed to any or
all values within a continuum.”
“Continuous numerical variable: a numerical variable for which the possible responses may assume any value
within some continuum.”
Ordinal data - Need to distinguish between ordinal data and an ordinal variable, as for Categorical Data,
Nominal Data and Numerical Data. Suggest replacing the current text with “If the categorical variable’s
responses have a natural order, then it is called an ordinal variable. E.g. the variable Investment Risk has the
categories Low, Medium and High.”
Outcome - The syllabus notes the importance of understanding differences between outcome, event and
experiment. An outcome is the elementary event of an experiment, while an event refers to a set of outcomes.
E.g., in rolling a 6-sided die where each side is labelled with one of the numbers 1 to 6, an outcome is any one
of 1, 2, 3, 4, 5 or 6 (this is also an elementary event), whereas an event may be ‘an odd number’ and hence is
the set of outcomes {1,3,5}.
Quartiles - “four (approximately) equal parts” - remove parentheses, as it may be interpreted as saying there
are approximately four, when in fact there are exactly four. These four parts are equal (approximately), or
approximately equal.
Quantiles/Quartiles/Deciles - These are all confusing, as some deal with quantiles of probability distributions
and others deal with datasets. Both should be defined (and without any problematic phrasings, as currently.)
Pearson’s correlation coefficient - Commonly denoted “r”, this is different to the “coefficient of
determination” which is “r2”. It is incorrect to have “or a coefficient of determination” in the definition. Remove
these words.
Relative Frequency - The relative frequency of a particular value (or group of values) in a data set is the
proportion of the observations in the data set which have that value (or group of values).
Standard deviation - Needs to distinguish between the sample standard deviation (of a dataset) and the
standard deviation of a random variable.
Variance - Needs to distinguish between the variance of a random variable and the variance of a dataset
(similarly to Expectation/Mean, mode, median etc).
Stage 6 Extension 1
In addition to the above:
Confidence Interval - Suggest deleting the second sentence, which is vague, and an oversimplification, and
isn’t part of any definition regardless. Further, continue after first sentence with “A range of values, based upon
sample data, within which we estimate the unknown population parameter lies, with some level of confidence.
Strictly confidence relates to the repeated sampling and estimation process, so 95% confidence refers to the
expectation that upon repeated sampling and interval estimation, 95% of said intervals would contain the true
parameter.”
Interval Estimate - The given definition is for interval estimation, not interval estimate. Replace with “a range
of values representing an estimate for the parameter (a population value).” This also depends upon the correct
definition of “parameter” within this context (for which no correct glossary term exists - see below).
Normal Distribution - This definition appears within the Stage 6 Extension 1 glossary, yet is not in the Stage 6
glossary. Further the Stage 6 Extension 1 definition given for Normal Distribution, is the same definition
provided in Stage 6 for Normally Distributed (which refers to “Normal Distribution”). The Glossaries need to
be consistent and exhaustive.
Parameter - The definition provided doesn’t apply to its use within statistics, even though the expression
‘population parameter’ is used within the definition of Interval Estimate and Confidence Interval.
Standard Error - This definition is simply wrong. (Even allowing for possible confusion with “Margin of Error”
which is itself not defined anywhere.) A definition for Standard Error’ could be “The standard error (SE) is the
standard deviation of the sampling distribution of a statistic.”