Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Transcript

Applied Quantitative Methods MBA course Montenegro Peter Balogh PhD [email protected] Descriptive statistics • The first part of my presentation considered ways of collecting and presenting data. • The quality of the data is fundamental to the work that we do, and yet data collection can be the very stage where time and cost pressures are particularly demanding. • It is necessary for the researcher to ensure that data collection methods are adequate and that the data can inform the users in a meaningful way. • Fancy analysis, now made easy by computers, cannot compensate for poor data. • It is often said (using a machine analogy) that the results coming out can only be as good as the data going in. Descriptive statistics • However, searching for data, the collection of data and the use of charts and diagrams alone are not sufficient for most purposes. • Generally, we need to do more than 'paint' an illustrative picture. • We need to describe the data with a rigour that summary numbers allow perhaps comparing one set or sub-set of data with another. • Users will want numbers to work with, for example market share and rate of increase, to test ideas and to use to identify possible relationships. • This process of exploring data using acceptable techniques and theories is termed analysis. • The calculation of numbers should clarify the data, revealing similarities and differences that were not seen before. • The use of numbers can complement the skills of insightful observation and idea generation. Descriptive statistics • The following three chapters provide ways of summarizing the mass of detail contained in data sets. • The use of computer packages, such as MINITAB, SPSS or spreadsheets has made the calculation of descriptive statistics relatively simple. • This removal of the burden of calculation or 'number crunching' makes the selection and interpretation of appropriate statistics even more important. 5. Measures of location • This chapter refers to 'measures of location' rather than just the average, to emphasize that there is more than one 'typical' summary value (the simple average). • What is seen and accepted as a typical value will depend on the data we are considering, for example, the most popular model of car, opinions on service levels as expressed on a questionnaire rating scale, or the weekly cost of groceries. 5. Measures of location • We do not wish to question the use in 'everyday' language of the word 'average' ('she played an average game', 'he has an average job'), but when calculating statistics we need to be more precise, in a shopping survey, for example, we need to know what exactly is meant by the average amount spent or the average number of visits, so that we can use this information more effectively for description or inference. 5. Measures of location • We also need to be aware of all the assumptions being made - so for example, when we talk about the average amount spent on cigarettes, do we mean the average spent by smokers, or the average amount spent by all respondents? • Computer packages make such calculations relatively simple, but they can mask the assumptions being made. • We have detailed the manual methods of calculation in this chapter because these are still required by academic courses, are still useful for small data sets, can be necessary for checking results and can enhance your understanding of the particular statistic. 5. Measures of location 5.1 The mean, median and mode • The mean (or arithmetic mean) is the most often used measure of location or average, with the median and the mode being used for more specific (special case) applications. • Each of these statistics has its own characteristics and will generally produce a different result for a given set of data. • For some sets of data it is useful to determine all of these statistics (together they tell us something about the differences in the data), but for other data sets, not all of the calculable statistics may be valid. 5.1 The mean, median and mode • A major consideration will be the type of data we are dealing with - categorical, ordinal or cardinal. • As we will see, for example, it is not appropriate to calculate a mean for categorical data. • How can you have an average social class or country of origin? • Data can be discrete or continuous. • Is giving the average number of children as 1.8 a meaningful answer? 5.1 The mean, median and mode • Knowing the average number of children could be helpful in a population projection exercise, but less helpful in allocating aircraft seats. • We also need to consider the variation in the data (is it all closely bunched together or are there extreme values?). • We also need to decide whether we are going to include all values (do we trust all the values we have, or do we wish to exclude some for being unlikely or unrepresentative?). 5.1 The mean, median and mode • The arithmetic mean (usually just shortened to mean) is the name given to the 'simple average' that most people calculate. • It is easy to understand and a very effective way of communicating an answer. • It does not really apply to categorical data and its interpretation can be difficult when used with ordinal data, but it is used in this way and its use is often justified for practical reasons. 5.1 The mean, median and mode • The median is the middle value of an ordered list of data. • It is not as well known as the mean but can be more appropriate for certain types of data. • The mode is the most frequent value or item, typical examples being the most popular model of car or most common shoe size. 5.1.1 Untabulated data • Untabulated data will usually be presented to us as a list of numbers or rows of numbers (see Case study 1 data on the number of cars entering a car park during 10-minute intervals, file SDL1.XLS). • This data can come in any order (unranked rather than ranked) and can range from a few values to several thousand or more. • Suppose we consider just the first ten observations from this data set (to make life easy): 10 22 31 9 24 27 29 9 23 12 The mean • To calculate the mean, the numbers are added together to find the total, and this total is divided by the number of values included. In this case x where (pronounced x bar) is the symbol used to represent the mean. • It is important to clarify the units in use and to give an interpretation to the result. • In this example, 0.6 of a car is only meaningful in terms of an average value (unless we are actually cutting up cars). • We would also need to decide whether to round our answer to an average of 20 cars every 10 minutes for reporting purposes. The mean • As most statistics require some form of calculation, a shorthand has developed to describe the necessary steps. • Using this shorthand, or notation, the calculation of the mean would be written as follows: • where x represents individual values, (sigma) is an instruction to sum values, and n is the number of values. The median • The median is the value in the middle when numbers have been listed in either ascending or descending order (typically ascending order). • The first step is to rank, or order, the values of interest: • The next step is a matter of counting from the left or the right. The median • When working with a listing of values (but not with continuous data) the position of the middle value is found using the formula: • This is easy to use with an odd number of values (e.g. given five values the middle one would be given by (5 + l)/2, i.e. the third one). • When working with an even number of values, such as ten, a 1\2 emerges and we need to use the two adjacent values. • In this case, (10 + l)/2 gives 5 1/2 and we need to consider the fifth value of 22, and the sixth value of 23. • Having found the two adjacent values, the practice is to average these to determine a median. • In this case the median would be given as 22.5. The median • This small data set can be used to illustrate two important points. • First, the averaging of the two middle values can produce a value not possible in the original data (22.5 cars) and secondly, the median is not sensitive to changing values away from the centre. • As an extreme example, suppose the largest value of 31 had been wrongly recorded as 310, the mean would change substantially but the median would stay the same. • In this sense, the median can be regarded as a more 'robust' statistic. The mode • The mode is the most frequently occurring observation. • Given the data: 10 22 31 9 24 27 29 9 23 12 it can easily be seen that 9 occurs twice and is therefore the most frequent value, i.e. it is the mode. The mode • One of the problems with the mode can be illustrated if we consider the first 12 values given in the Case 1 data: 10 22 31 9 24 27 29 9 23 12 33 29 • In this case there are two modes, 9 and 29, since both values occur twice. • If an extra data value were to become available, say 10, there would then be three modes. • If the extra value had been a 9 rather than a 10, there would only be one mode. • For some sets of data the mode can be unstable. • Again, the user should not just accept the statistic at 'face-value', but consider what it really means and what interpretation can be given. The mode • The determination of the mean (using the Excel average function), the median and the mode for this illustrative untabulated data is shown in Figure 5.2. 5.1.2 Tabulated (ungrouped) discrete data • It is far easier to manage discrete data in a tabulated form where the highest and lowest values, and the most frequent values, can be quickly identified. • The number of working days lost by employees in the last quarter may be of particular interest to a manager. • The data could be presented as shown in Table 5.1. The mean • To calculate the average number of working days lost last quarter, we first need to find the total number of days lost and then divide by the number of employees included. • In this example, 410 employees lost no days adding zero to the overall total, 430 employees lost one day adding 430 days to the overall total, 290 employees lost two days adding 580 days to the overall total, and so on. • The calculation, as shown in Table 5.2, involves multiplying the number of lost days, x, by the frequency, f, to obtain a column of sub-totals, fx. This new column, fx, is then summed to give a total fx which is then dividing by the number included, n (where n = the sum of frequencies f ). The median • The tabulation of the number of 'lost days' has effectively ordered the data (the first 410 employees lost no days, the next 430 lost one day and so on). • To find the median, which is a statistic concerned with order, we can either make a 'running' count or use cumulative frequencies. • Cumulative frequency is the number of items with a given value or less. • To calculate cumulative frequency, we just add the next frequency to the running total - see Table 5.3. The median • As the data is discrete, the position of the median is found using the formula (n + 1)/2 which gives the 720 1/2th ((1440 + 1)/2) ordered observation, i.e. it will lie between the 720th and the 721st 'ordered' employee. • It can be seen from the cumulative frequency that 410 employees lost no days and 840 lost one or less days. • By deduction, the 720th and the 721st employee both lost one day; the median is therefore one day The mode • The mode corresponds to the highest frequency count, which is one day lost (which can be seen easily on the original Table 5.1). • In this case the median and the mode both give the same value of 1 and the mean gives the slightly higher value of 1.451. • The effect of a few employees losing a higher number of days is to pull the mean upwards. 5.1.3 Tabulated (grouped) continuous data • As discussed in Part 1, continuous measurement is the result of using an instrument of measurement and will give values like 5 or 5.2 or 5.1763, depending on the requirements of the user (e.g. to the nearest whole number or ±0.05 or ±0.00005) and the accuracy of the measuring device. • Values will either come as a long list (e.g. a data file), or collated in a table where values are grouped by non-overlapping intervals. • Table 4.14, reproduced below as Table 5.4, is typical of continuous data. 5.1.3 Tabulated (grouped) continuous data • If data is given to us in the form of Table 5.4, then we no longer know the exact value of each observation. • We only know, for example, that two respondents spent under £5, but not how much under £5. • In this type of case, we have to assume a value for each group of respondents and estimate the descriptive statistic. • In practice, we assume that all the values within a group are evenly spread (the larger values tending to cancel the smaller values) and can be reasonably represented by the mid-point value. • This is the only realistic assumption we can make unless we have additional information. • If we were to use the lower limit value, we are likely to underestimate the mean; if we were to use the upper limit value, we are likely to over-estimate the mean. 5.1.3 Tabulated (grouped) continuous data • Looking at much of the published data, we often find that the first and last groups are left as openended, for example, 'under £1500' or 'over £100 000'. • In these cases it will be necessary to make assumptions about the upper or lower limits before we can calculate the mean. • There are no specific rules for estimating such endpoints, but you should consider the data you are trying to describe. 5.1.3 Tabulated (grouped) continuous data • If, for example, we were given data on the 'age of first driving conviction' with a first group labelled 'under 17 years' it would hardly be realistic to use a lower limit of zero! (It is quite difficult to drive at a few months old.) • Looking at this data, we might decide to use the minimum age at which a driving licence can normally be obtained, but being caught driving one's parents' Porsche around the M25 at the age of 15 would be likely to lead to some form of conviction. • There is no correct answer: it is a question of knowing, or at least thinking about, the data. The mean • Once we have established the limits for each group we can then find the mid-points and use these as the x values in our calculations. • The formula to use is where x now represents the mid-point values. • The procedure is shown in Table 5.5. The mean • Care must be taken to clarify the interval range and the midpoints. • Generally, for continuous data, the mid-points can be easily found by adding the upper and lower interval boundaries and dividing by two. • Discrete data can also be tabulated in a grouped format. • If we were considering the number of visitors or enquiries during a given time we might use an interval like '10 but under 20'. • This would include 10, 11, 12, 13, 14, 15, 16, 17, 18 and 19 but not 20. • The mid-point would be 14.5 and not 15. • Particular care needs to be taken when working with 'age' data. • Age is often given in the form: 10-14 years, 15-19 years. • Since age is continuous (and we refer to age last birthday), the mid-points would be 12.5 and 17.5. The median • We can determine the median either graphically or by calculation. • The first step in both cases is to find the cumulative frequencies, as shown in Table 5.6. • In this example, two respondents spent less than £5, eight respondents spent less than £10 and so on. • It should be noted that the cumulative frequency refers to the upper boundary of the corresponding interval. The median - the graphical method • To find the median graphically, we plot cumulative frequency against the upper boundary of the corresponding interval and join the points with straight lines (this is the graphical representation of the assumption that values are evenly spread within groups). • The resultant frequency graph or ogive is shown as Figure 5.3. The median - the graphical method • To identify the median value for continuous data the formula n/2 is used (not [n + l)/2). As you will see as you move to more advanced statistics, with continuous data we are dividing a distribution (the area under a curve) in two, and not a list of numbers. • In this case, the median is the value of the 22nd observation, which can be read from the ogive as £17.50. The median – by calculation • To calculate the median, we must first locate the group that contains the 22nd observation, i.e. the median group. • Looking down the cumulative frequency column of Table 5.6, we can see that 16 respondents spend less than £15 and 28 respondents spend less than £20. • The 22nd observation (respondent) must lie in the group '£15 but less than £20'. • The median must be £15 plus some fraction of the interval of £5. • The median observation lies six respondents into this group (22 is the median observation minus the 16 observations that lie below this group). The median – by calculation • There are 12 respondents in this median group, so the median lies 6/12ths of the way through the interval. • The median is equal to: The median – by calculation • In terms of a formula we can write: • where l is the lower boundary of the median group, i is the width of the median group, F is the cumulative frequency up to the median group and f is the frequency in the median group. • Using the figures from the example above: The mode • We can determine the mode either graphically or by calculation. • As the data is continuous and given in intervals, the mode can most easily be thought of as the point of greatest density or concentration. • To estimate the mode, we need to refer to the histogram, originally drawn as Figure 4.15 and now given as Figure 5.4. The mode • To estimate the mode, we first identify the tallest block on the histogram (scalinghas already taken place) and join the corner points, as shown. • The point of intersection locates the mode. • In this case the mode is £17 (to the nearest £1). The mode - by calculation • The formula for the mode can be given as • where l is the lower boundary of the modal group, fm is the (scaled) frequency of the modal group, fm-1 is the (scaled) frequency of the pre-modal group, fm+1 is the (scaled) frequency of the post-modal group and i is the width of the modal group. The mode - by calculation • The frequencies and the scaling effect can be seen in Figure 5.4. • The main adjustment to make is the use of five rather than 10 (the interval where the width doubled) for the post-modal frequency. • The mode is then: The mode - by calculation • A spreadsheet showing the calculation of the mean, median and mode for tabulated continuous data is given as Figure 5.5. • Excel does not provide specific functions for this kind of tabulated data and the spreadsheet needs to be constructed using equations. • The cell references have not been shown for the calculations of the median and the mode (which are as above) as the identification of the median and modal intervals are always an important prerequisite and can be ignored if the spreadsheet is modified. 5.2 Other measures of location • The mean, median and mode as described are not the only measures of location. • The geometric mean and the harmonic mean are included here for the completeness of the chapter (and for reference purposes), but their use is limited to particular types of data and is rather specialist. • The weighted mean is used if we wish to adjust the representation of the raw data; it is particularly important in market research. 5.2.1 The geometric mean • The geometric mean is defined to be 'the nth root of the product of n numbers' and is particularly useful when we are trying to average percentages. (It is also used with index numbers.) • Given the percentage of time spent on a certain task, we have the following data: 30% 20% 25% 31% 25% • Multiplying the five numbers together gives 30 x 20 x 25 x 31 x 25 = 11 625 000 and taking the fifth root gives the geometric mean as 25.8868%. • A simple arithmetic mean would give the answer 26.2% which is an over-estimate of the amount of time spent on the task. 5.2.2 The harmonic mean • The harmonic mean is used where we are looking at ratio data, for example miles per gallon or output per shift. • It is defined as 'the reciprocal of the arithmetic mean of the reciprocals of the data'. • For simple data it is not too difficult to calculate. • For example, if we have data on the number of miles per gallon achieved by five company representatives: 23 25 26 29 23 • then, to calculate the harmonic mean, we find the reciprocals of each number: 0.043478 0.04 0.038461 0.034483 0.043478 5.2.2 The harmonic mean • find their average: and then take the reciprocal of the answer: 25.0125. Thus the average fuel consumption of the five representatives is 25.0125 miles per gallon. (The arithmetic mean would be 25.2 mpg.) 5.2.3 Weighted means • Suppose that we were given a grouped frequency distribution of weekly income for a particular group of workers and in addition the average income within each of these categories. • In this case we could use the set of averages rather than midpoints to calculate an overall mean. • In terms of our notation we would need to write the formula as: i i x f x • n where x remains the overall mean, xi is the mean in category i and fi is the frequency in category i. 5.2.3 Weighted means • The calculation using a set of averages is shown in Table 5.7. • We would note from the table, for example, that the first 10 workers have an average weekly income of £170, and together earn £1700. • The overall mean is 5.2.3 Weighted means • where fi/n are the weighting factors. • These weighting factors can be thought of as a measure of size or importance. • They can be used to correct inadequacies in data or to collate results from a survey which was not completely representative.