Download Applied Quantitative Methods II. MBA course Montenegro

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Applied Quantitative Methods
MBA course Montenegro
Peter Balogh
PhD
[email protected]
Descriptive statistics
• The first part of my presentation considered ways of
collecting and presenting data.
• The quality of the data is fundamental to the work that
we do, and yet data collection can be the very stage
where time and cost pressures are particularly
demanding.
• It is necessary for the researcher to ensure that data
collection methods are adequate and that the data can
inform the users in a meaningful way.
• Fancy analysis, now made easy by computers, cannot
compensate for poor data.
• It is often said (using a machine analogy) that the
results coming out can only be as good as the data
going in.
Descriptive statistics
• However, searching for data, the collection of data and the use of
charts and diagrams alone are not sufficient for most purposes.
• Generally, we need to do more than 'paint' an illustrative picture.
• We need to describe the data with a rigour that summary numbers
allow perhaps comparing one set or sub-set of data with another.
• Users will want numbers to work with, for example market share
and rate of increase, to test ideas and to use to identify possible
relationships.
• This process of exploring data using acceptable techniques and
theories is termed analysis.
• The calculation of numbers should clarify the data, revealing
similarities and differences that were not seen before.
• The use of numbers can complement the skills of insightful
observation and idea generation.
Descriptive statistics
• The following three chapters provide ways of
summarizing the mass of detail contained in data
sets.
• The use of computer packages, such as MINITAB,
SPSS or spreadsheets has made the calculation of
descriptive statistics relatively simple.
• This removal of the burden of calculation or
'number crunching' makes the selection and
interpretation of appropriate statistics even more
important.
5. Measures of location
• This chapter refers to 'measures of location' rather
than just the average, to emphasize that there is
more than one 'typical' summary value (the simple
average).
• What is seen and accepted as a typical value will
depend on the data we are considering, for
example, the most popular model of car, opinions
on service levels as expressed on a questionnaire
rating scale, or the weekly cost of groceries.
5. Measures of location
• We do not wish to question the use in 'everyday'
language of the word 'average' ('she played an
average game', 'he has an average job'), but when
calculating statistics we need to be more precise, in
a shopping survey, for example, we need to know
what exactly is meant by the average amount spent
or the average number of visits, so that we can use
this information more effectively for description or
inference.
5. Measures of location
• We also need to be aware of all the assumptions being
made - so for example, when we talk about the average
amount spent on cigarettes, do we mean the average
spent by smokers, or the average amount spent by all
respondents?
• Computer packages make such calculations relatively
simple, but they can mask the assumptions being
made.
• We have detailed the manual methods of calculation in
this chapter because these are still required by
academic courses, are still useful for small data sets,
can be necessary for checking results and can enhance
your understanding of the particular statistic.
5. Measures of location
5.1 The mean, median and mode
• The mean (or arithmetic mean) is the most often
used measure of location or average, with the
median and the mode being used for more specific
(special case) applications.
• Each of these statistics has its own characteristics
and will generally produce a different result for a
given set of data.
• For some sets of data it is useful to determine all of
these statistics (together they tell us something
about the differences in the data), but for other
data sets, not all of the calculable statistics may be
valid.
5.1 The mean, median and mode
• A major consideration will be the type of data we
are dealing with - categorical, ordinal or cardinal.
• As we will see, for example, it is not appropriate to
calculate a mean for categorical data.
• How can you have an average social class or country
of origin?
• Data can be discrete or continuous.
• Is giving the average number of children as 1.8 a
meaningful answer?
5.1 The mean, median and mode
• Knowing the average number of children could be
helpful in a population projection exercise, but less
helpful in allocating aircraft seats.
• We also need to consider the variation in the data
(is it all closely bunched together or are there
extreme values?).
• We also need to decide whether we are going to
include all values (do we trust all the values we
have, or do we wish to exclude some for being
unlikely or unrepresentative?).
5.1 The mean, median and mode
• The arithmetic mean (usually just shortened to
mean) is the name given to the 'simple average'
that most people calculate.
• It is easy to understand and a very effective way of
communicating an answer.
• It does not really apply to categorical data and its
interpretation can be difficult when used with
ordinal data, but it is used in this way and its use is
often justified for practical reasons.
5.1 The mean, median and mode
• The median is the middle value of an ordered list of
data.
• It is not as well known as the mean but can be more
appropriate for certain types of data.
• The mode is the most frequent value or item,
typical examples being the most popular model of
car or most common shoe size.
5.1.1 Untabulated data
• Untabulated data will usually be presented to us as
a list of numbers or rows of numbers (see Case
study 1 data on the number of cars entering a car
park during 10-minute intervals, file SDL1.XLS).
• This data can come in any order (unranked rather
than ranked) and can range from a few values to
several thousand or more.
• Suppose we consider just the first ten observations
from this data set (to make life easy):
10 22 31 9 24 27 29 9 23 12
The mean
• To calculate the mean, the numbers are added together to find the total,
and this total is divided by the number of values included. In this case
x
where
(pronounced x bar) is the symbol used to represent the mean.
• It is important to clarify the units in use and to give an interpretation to the
result.
• In this example, 0.6 of a car is only meaningful in terms of an average value
(unless we are actually cutting up cars).
• We would also need to decide whether to round our answer to an average
of 20 cars every 10 minutes for reporting purposes.
The mean
• As most statistics require some form of calculation, a
shorthand has developed to describe the necessary
steps.
• Using this shorthand, or notation, the calculation of the
mean would be written as follows:
• where x represents individual values,  (sigma) is an
instruction to sum values, and n is the number of
values.
The median
• The median is the value in the middle when
numbers have been listed in either ascending or
descending order (typically ascending order).
• The first step is to rank, or order, the values of
interest:
• The next step is a matter of counting from the left
or the right.
The median
• When working with a listing of values (but not with
continuous data) the position of the middle value is found
using the formula:
• This is easy to use with an odd number of values (e.g. given
five values the middle one would be given by
(5 + l)/2, i.e. the third one).
• When working with an even number of values, such as ten, a
1\2 emerges and we need to use the two adjacent values.
• In this case, (10 + l)/2 gives 5 1/2 and we need to consider
the fifth value of 22, and the sixth value of 23.
• Having found the two adjacent values, the practice is to
average these to determine a median.
• In this case the median would be given as 22.5.
The median
• This small data set can be used to illustrate two
important points.
• First, the averaging of the two middle values can
produce a value not possible in the original data
(22.5 cars) and secondly, the median is not
sensitive to changing values away from the centre.
• As an extreme example, suppose the largest value
of 31 had been wrongly recorded as 310, the mean
would change substantially but the median would
stay the same.
• In this sense, the median can be regarded as a more
'robust' statistic.
The mode
• The mode is the most frequently occurring
observation.
• Given the data:
10 22 31 9 24 27 29 9 23 12
it can easily be seen that 9 occurs twice and is
therefore the most frequent value, i.e. it is the
mode.
The mode
• One of the problems with the mode can be illustrated if
we consider the first 12 values given in the Case 1 data:
10 22 31 9 24 27 29 9 23 12 33 29
• In this case there are two modes, 9 and 29, since both
values occur twice.
• If an extra data value were to become available, say 10,
there would then be three modes.
• If the extra value had been a 9 rather than a 10, there
would only be one mode.
• For some sets of data the mode can be unstable.
• Again, the user should not just accept the statistic at
'face-value', but consider what it really means and what
interpretation can be given.
The mode
• The determination of the mean (using the Excel average
function), the median and the mode for this illustrative
untabulated data is shown in Figure 5.2.
5.1.2 Tabulated (ungrouped)
discrete data
• It is far easier to manage discrete data in a
tabulated form where the highest and lowest
values, and the most frequent values, can be
quickly identified.
• The number of working days lost by employees in
the last quarter may be of particular interest to a
manager.
• The data could be presented as shown in Table 5.1.
The mean
• To calculate the average number of working days lost
last quarter, we first need to find the total number of
days lost and then divide by the number of employees
included.
• In this example, 410 employees lost no days adding
zero to the overall total, 430 employees lost one day
adding 430 days to the overall total, 290 employees lost
two days adding 580 days to the overall total, and so
on.
• The calculation, as shown in Table 5.2, involves
multiplying the number of lost days, x, by the
frequency, f, to obtain a column of sub-totals, fx. This
new column, fx, is then summed to give a total  fx
which is then dividing by the number included, n
(where n = the sum of frequencies  f ).
The median
• The tabulation of the number of 'lost days' has
effectively ordered the data (the first 410
employees lost no days, the next 430 lost one day
and so on).
• To find the median, which is a statistic concerned
with order, we can either make a 'running' count or
use cumulative frequencies.
• Cumulative frequency is the number of items with
a given value or less.
• To calculate cumulative frequency, we just add the
next frequency to the running total - see Table 5.3.
The median
• As the data is discrete, the position of the median is
found using the formula (n + 1)/2 which gives the
720 1/2th ((1440 + 1)/2) ordered observation, i.e. it
will lie between the 720th and the 721st 'ordered'
employee.
• It can be seen from the cumulative frequency that
410 employees lost no days and 840 lost one or less
days.
• By deduction, the 720th and the 721st employee
both lost one day; the median is therefore one day
The mode
• The mode corresponds to the highest frequency
count, which is one day lost (which can be seen
easily on the original Table 5.1).
• In this case the median and the mode both give the
same value of 1 and the mean gives the slightly
higher value of 1.451.
• The effect of a few employees losing a higher
number of days is to pull the mean upwards.
5.1.3 Tabulated (grouped)
continuous data
• As discussed in Part 1, continuous measurement is
the result of using an instrument of measurement
and will give values like 5 or 5.2 or 5.1763,
depending on the requirements of the user (e.g. to
the nearest whole number or ±0.05 or ±0.00005)
and the accuracy of the measuring device.
• Values will either come as a long list (e.g. a data
file), or collated in a table where values are grouped
by non-overlapping intervals.
• Table 4.14, reproduced below as Table 5.4, is typical
of continuous data.
5.1.3 Tabulated (grouped)
continuous data
• If data is given to us in the form of Table 5.4, then we no
longer know the exact value of each observation.
• We only know, for example, that two respondents spent
under £5, but not how much under £5.
• In this type of case, we have to assume a value for each group
of respondents and estimate the descriptive statistic.
• In practice, we assume that all the values within a group are
evenly spread (the larger values tending to cancel the smaller
values) and can be reasonably represented by the mid-point
value.
• This is the only realistic assumption we can make unless we
have additional information.
• If we were to use the lower limit value, we are likely to underestimate the mean; if we were to use the upper limit value,
we are likely to over-estimate the mean.
5.1.3 Tabulated (grouped)
continuous data
• Looking at much of the published data, we often
find that the first and last groups are left as openended, for example, 'under £1500' or 'over £100
000'.
• In these cases it will be necessary to make
assumptions about the upper or lower limits before
we can calculate the mean.
• There are no specific rules for estimating such endpoints, but you should consider the data you are
trying to describe.
5.1.3 Tabulated (grouped)
continuous data
• If, for example, we were given data on the 'age of first
driving conviction' with a first group labelled 'under 17
years' it would hardly be realistic to use a lower limit of
zero! (It is quite difficult to drive at a few months old.)
• Looking at this data, we might decide to use the
minimum age at which a driving licence can normally
be obtained, but being caught driving one's parents'
Porsche around the M25 at the age of 15 would be
likely to lead to some form of conviction.
• There is no correct answer: it is a question of knowing,
or at least thinking about, the data.
The mean
• Once we have established the limits for each group
we can then find the mid-points and use these as
the x values in our calculations.
• The formula to use is
where x now represents the mid-point values.
• The procedure is shown in Table 5.5.
The mean
• Care must be taken to clarify the interval range and the midpoints.
• Generally, for continuous data, the mid-points can be easily found
by adding the upper and lower interval boundaries and dividing by
two.
• Discrete data can also be tabulated in a grouped format.
• If we were considering the number of visitors or enquiries during a
given time we might use an interval like '10 but under 20'.
• This would include 10, 11, 12, 13, 14, 15, 16, 17, 18 and 19 but not
20.
• The mid-point would be 14.5 and not 15.
• Particular care needs to be taken when working with 'age' data.
• Age is often given in the form: 10-14 years, 15-19 years.
• Since age is continuous (and we refer to age last birthday), the
mid-points would be 12.5 and 17.5.
The median
• We can determine the median either graphically or
by calculation.
• The first step in both cases is to find the cumulative
frequencies, as shown in Table 5.6.
• In this example, two respondents spent less than £5,
eight respondents spent less than £10 and so on.
• It should be noted that the cumulative frequency refers
to the upper boundary of the corresponding interval.
The median - the graphical method
• To find the median graphically, we plot cumulative
frequency against the upper boundary of the
corresponding interval and join the points with
straight lines (this is the graphical representation of
the assumption that values are evenly spread within
groups).
• The resultant frequency graph or ogive is shown as
Figure 5.3.
The median - the graphical method
• To identify the median value for continuous data
the formula n/2 is used (not [n + l)/2). As you will
see as you move to more advanced statistics, with
continuous data we are dividing a distribution (the
area under a curve) in two, and not a list of
numbers.
• In this case, the median is the value of the 22nd
observation, which can be read from the ogive as
£17.50.
The median – by calculation
• To calculate the median, we must first locate the group
that contains the 22nd observation, i.e. the median
group.
• Looking down the cumulative frequency column of
Table 5.6, we can see that 16 respondents spend less
than £15 and 28 respondents spend less than £20.
• The 22nd observation (respondent) must lie in the
group '£15 but less than £20'.
• The median must be £15 plus some fraction of the
interval of £5.
• The median observation lies six respondents into this
group (22 is the median observation minus the 16
observations that lie below this group).
The median – by calculation
• There are 12 respondents in this median group, so
the median lies 6/12ths of the way through the
interval.
• The median is equal to:
The median – by calculation
• In terms of a formula we can write:
• where l is the lower boundary of the median group,
i is the width of the median group, F is the
cumulative frequency up to the median group and f
is the frequency in the median group.
• Using the figures from the example above:
The mode
• We can determine the mode either graphically or by
calculation.
• As the data is continuous and given in intervals, the
mode can most easily be thought of as the point of
greatest density or concentration.
• To estimate the mode, we need to refer to the
histogram, originally drawn as Figure 4.15 and now
given as Figure 5.4.
The mode
• To estimate the mode, we first identify the tallest
block on the histogram (scalinghas already taken
place) and join the corner points, as shown.
• The point of intersection locates the mode.
• In this case the mode is £17 (to the nearest £1).
The mode - by calculation
• The formula for the mode can be given as
• where l is the lower boundary of the modal group,
fm is the (scaled) frequency of the modal group, fm-1
is the (scaled) frequency of the pre-modal group,
fm+1 is the (scaled) frequency of the post-modal
group and i is the width of the modal group.
The mode - by calculation
• The frequencies and the scaling effect can be seen
in Figure 5.4.
• The main adjustment to make is the use of five
rather than 10 (the interval where the width
doubled) for the post-modal frequency.
• The mode is then:
The mode - by calculation
• A spreadsheet showing the calculation of the mean,
median and mode for tabulated continuous data is
given as Figure 5.5.
• Excel does not provide specific functions for this
kind of tabulated data and the spreadsheet needs
to be constructed using equations.
• The cell references have not been shown for the
calculations of the median and the mode (which are
as above) as the identification of the median and
modal intervals are always an important
prerequisite and can be ignored if the spreadsheet
is modified.
5.2 Other measures of location
• The mean, median and mode as described are not
the only measures of location.
• The geometric mean and the harmonic mean are
included here for the completeness of the chapter
(and for reference purposes), but their use is limited
to particular types of data and is rather specialist.
• The weighted mean is used if we wish to adjust the
representation of the raw data; it is particularly
important in market research.
5.2.1
The geometric mean
• The geometric mean is defined to be 'the nth root of
the product of n numbers' and is particularly useful
when we are trying to average percentages. (It is also
used with index numbers.)
• Given the percentage of time spent on a certain task,
we have the following data:
30% 20% 25% 31% 25%
• Multiplying the five numbers together gives
30 x 20 x 25 x 31 x 25 = 11 625 000
and taking the fifth root gives the geometric mean as
25.8868%.
• A simple arithmetic mean would give the answer 26.2%
which is an over-estimate of the amount of time spent
on the task.
5.2.2
The harmonic mean
• The harmonic mean is used where we are looking at
ratio data, for example miles per gallon or output per
shift.
• It is defined as 'the reciprocal of the arithmetic mean of
the reciprocals of the data'.
• For simple data it is not too difficult to calculate.
• For example, if we have data on the number of miles
per gallon achieved by five company representatives:
23 25 26 29 23
• then, to calculate the harmonic mean, we find the
reciprocals of each number:
0.043478 0.04 0.038461 0.034483 0.043478
5.2.2
The harmonic mean
• find their average:
and then take the reciprocal of the answer: 25.0125.
Thus the average fuel consumption of the five
representatives is 25.0125 miles per gallon.
(The arithmetic mean would be 25.2 mpg.)
5.2.3
Weighted means
• Suppose that we were given a grouped frequency
distribution of weekly income for a particular group of
workers and in addition the average income within
each of these categories.
• In this case we could use the set of averages rather
than midpoints to calculate an overall mean.
• In terms of our notation we would need to write the
formula as:
i i
x f

x
•
n
where x remains the overall mean, xi is the mean in
category i and fi is the frequency in category i.
5.2.3
Weighted means
• The calculation using a set of averages is shown in
Table 5.7.
• We would note from the table, for example, that
the first 10 workers have an average weekly income
of £170, and together earn £1700.
• The overall mean is
5.2.3
Weighted means
• where fi/n are the weighting factors.
• These weighting factors can be thought of as a
measure of size or importance.
• They can be used to correct inadequacies in data or
to collate results from a survey which was not
completely representative.