Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 3 Summarising Data For use in fall semester 2015 Lecture notes were originally designed by Nigel Halpern. This lecture set may be modified during the semester. Last modified: 4-8-2015 SCM300 Survey Design Lecture Aim & Objectives Aim • To investigate pictorial & statistical methods of analysing quantitative data Objectives • Pictorial representation of quantitative data • Statistical representation of quantitative data SCM300 Survey Design Pictorial Representation • Levels of measurement • Tables & frequency distributions • Charts, plots, graphs & pie-charts SCM300 Survey Design 3 (4) Levels of Measurement • Nominal variables • Ordinal variables • Interval (& ratio) variables SCM300 Survey Design Nominal • Categories – e.g. gender (m/f), responses (y/n), class of travel (b/l) • Usually presented as frequencies & categories or %’s – e.g. 45% male, 55% female • Measure the existence (or not) of a characteristic – But contain limited information SCM300 Survey Design Ordinal • Ordered categories or preferences – e.g. ranked responses from a Likert scale – e.g. finishers in a race (1st, 2nd, 3rd, etc) – e.g. preferred aircraft • Measure intensity, order or degree – But still limited as they don’t imply distances • i.e. distance between 1st & 2nd SCM300 Survey Design Interval & Ratio • Ordered & scaled (on equal intervals) – e.g. age in years, temperature • Measures differences between values – Interval: arbitrary zero • e.g. temperature (+/-) – Ratio: absolute zero indicating absence of that variable • e.g. age, income • High analytical capabilities – e.g. can compare means unlike for nominal or ordinal data SCM300 Survey Design Levels of Measurement Summary Variable type Nominal Ordinal Interval/Ratio SCM300 Survey Design Description Classification of responses into mutually exclusive categories Categories are rank ordered Examples Male/Female Yes/No 1st/2nd/3rd Likert Distances between Temperature items on scale are equal Age Variable type Description Your turn….. • Nominal Classification of responses into mutually exclusive categories Ordinal Categories are rank ordered Interval/Ratio Distances between items on scale are equal What levels of measurement would be derived from each of the following questions 1. 2. 3. 4. 5. 6. Gender (male/female) Age in years and months (state years/months) Do you smoke (yes/no) How many cigarettes, on average, do you smoke a day (state no.) Number of full years you’ve been smoking (state no.) How many minutes exercise do you do, on average, each day (less than 30mins / 30-59mins / 60+mins) 7. To what extent do you think that smoking is bad for your health (Strongly agree / tend to agree / neither / tend to disagree / strongly disagree) 8. Rank the cigarette brands in order of quality (B&H, Silk Cut, Marlborough) SCM300 Survey Design Tables • • • • Most straight forward pictorial representation Good method of storing information Summarises &/or shows patterns in data Easily made using word-processing or spreadsheets • Confusing if constructed poorly • Confusing if they try to show too much SCM300 Survey Design Table Considerations • Should be clear & appropriate • Should be chosen with a purpose in mind – Not just for the sake of it • Must include a title & a source of data • Must be referenced & discussed in the text – Don’t assume that everyone will understand them SCM300 Survey Design Table Clarity • Use a common system of data presentation • Use percentages rather than raw scores for clarity & comparative capabilities The above points are particularly relevant if the table includes more than one variable calculated using different units of measurement (AKA ‘cross-tabulation’) SCM300 Survey Design Data from a survey of pax at LGW, LHR & MAN (CAA, 2000): - 34,650 Business Pax: A/B=18,607; C1=14,345; C2=1,386; D/E=312 - 130,350 Leisure Pax: A/B=43,407; C1=52,400; C2=21,508; D/E=13,035 Table 1. Passengers at LGW, LHR & MAN, 1999 Socio-economic status Business passenger Leisure passengers Total A/B 18,607 43,407 62,014 C1 14,345 52,400 66,745 C2 1,386 21,508 22,894 D/E 312 13,035 13,347 Total 34,650 130,350 165,000 Use percentages instead? SCM300 Survey Design Data from a survey of pax at LGW, LHR & MAN (CAA, 2000): - 34,650 Business Pax: A/B=18,607; C1=14,345; C2=1,386; D/E=312 - 130,350 Leisure Pax: A/B=43,407; C1=52,400; C2=21,508; D/E=13,035 Table 1. Passengers at LGW, LHR & MAN, 1999 (%) Socio-economic status Business passengers Leisure passengers Total A/B 54 33 38 C1 41 40 40 C2 4 17 14 D/E 1 10 8 Total 21 79 100 Easier to interpret? SCM300 Survey Design Frequency Distributions • • • • Standard frequency distribution Univariate frequency distribution Grouped frequency distribution Relative & cumulative frequency distribution SCM300 Survey Design Standard Frequency • Standard frequency distribution – Presents data • e.g. “How many return flights did you take last year?” • Answers from 50 pax as a standard frequency distribution: Number of return flights taken last year: 7 3 10 3 2 4 3 3 6 3 5 2 3 4 2 5 4 3 6 8 4 12 1 3 4 15 5 1 3 1 4 2 3 5 2 3 8 3 4 4 6 3 5 2 42 3 2 5 1 SCM300 Survey Design Univariate Frequency • Univariate frequency distribution – Lists data more clearly & with their frequency – Important for large sample sizes Flights Frequency Flights Frequency 1 4 7 1 2 8 8 2 3 14 10 1 4 9 12 1 5 6 15 1 6 3 SCM300 Survey Design Grouped Frequency • Grouped frequency distribution – Groups all data according to categories – Further improves clarity SCM300 Survey Design Flights Grouped frequency 1-3 26 4-6 18 7-9 3 10-12 2 13+ 1 Total 50 Relative & Cumulative Frequency • Relative & cumulative frequency distributions – Relative: each category as a % of the total – Cumulative: add each relative to proceeding Flights Grouped Relative (%) Cumulative (%) 1-3 26 52 52 4-6 18 36 88 7-9 3 6 94 10-12 2 4 98 13+ 1 2 100 Total 50 100 SCM300 Survey Design Too many numbers…? SCM300 Survey Design Charts, Plots, Graphs & Pie-charts • • • • • • Simple bar charts Compound bar charts Histograms Scatter or dot plots Line graphs Pie-charts SCM300 Survey Design Charts, Plots, Graphs & Pie-charts: Pros & Cons • Easily made using word-processing or spreadsheets • Ease of creation can lead to over-elaborate charts at the expense of clarity SCM300 Survey Design Charts, Plots, Graphs & Pie-charts: Considerations • Should be clear & appropriate • Should be chosen with a purpose in mind – Not just for the sake of it • Typically include – – – – Title Labelled axis Key that explains the different segments Source of data • Must be referenced & discussed in the text – Do not assume that everyone will understand them • Data type will restrict which method is chosen SCM300 Survey Design Simple Bar Charts • Simple bar charts – Horizontal or vertical charts of separate bars that represent size of data Student results for SCM300 in 2007 0-39% 40-49% 50-59% 60-69% 70+% 5 SCM300 Survey Design 9 15 7 3 Simple Bar Charts Number of students Figure 1. Student results for SCM300 in 2007 16 14 12 10 8 6 4 2 0 0-39 40-49 50-59 Grade (%) SCM300 Survey Design 60-69 70+ Compound Bar Charts • Compound bar charts – Show proportions/relative size of groups – Bars will always have same height when % are used but not when figures are used – For 3+ components, pie-charts may be better Student results for SCM300 in 2007 0-39% 40-49% 50-59% 60-69% 70%+ Male 4 7 7 2 0 Female 1 2 8 5 3 SCM300 Survey Design Compound Bar Charts Figure 1. Student results for SCM300 in 2007 Number of students 100% 80% 60% Female Male 40% 20% 0% 0-39 40-49 50-59 Grade (%) SCM300 Survey Design 60-69 70+ Histograms • Histograms – Similar to bar charts but a better indication of variation & distribution – Bars are connected instead of separate SCM300 Survey Design Histograms Number of students Figure 1. Student results for SCM300 in 2007 16 14 12 10 8 6 4 2 0 0-39 40-49 50-59 Grade (%) SCM300 Survey Design 60-69 70+ This figure indicates repeat visits to Norway & tourists interest in returning but is it easy to understand…..? SCM300 Survey Design Scatter or Dot Plots • Scatter or dot plots – Illustrate the exact distribution of data – Can be used to illustrate continuous data • BUT a line graph may be better – Effective for 2 related variables SCM300 Survey Design Scatter or Dot Plots Aircraft movements SCM300 Survey Design 00 0 70 0 00 0 60 0 00 0 50 0 00 0 40 0 00 0 00 0 30 0 10 0 20 0 00 0 600 000 500 000 400 000 300 000 200 000 100 000 0 0 Passengers Figure 1. Passengers & Aircraft Movements at HiMolde Airport Line Graphs • Line graphs – Show trends over time • e.g. patterns, peaks & troughs, rates of incline/decline – Can show more than 1 variable at a time • This can indicate possible relationships • e.g. see next slide SCM300 Survey Design SCM300 Survey Design Pie Charts • Pie-charts – – – – Segments represent cases in each category Best for 3-6 categories (no more, no less) Labelling & shading sometimes difficult Combining categories may improve clarity but loses detail SCM300 Survey Design Pie Charts Car park 21% Other 26% SCM300 Survey Design Catering 11% Retail 42% Pie Charts Too many pies……..? SCM300 Survey Design Charts, Plots, Graphs & Pie-charts Summary Variable type Bar Pie Line Nominal Ordinal Yes Yes Yes Yes No No Interval/ratio Yes (if grouped) Yes (if grouped) Yes SCM300 Survey Design Statistical Representation • Measures of central tendency • Measures of dispersion • Normal distribution & skew SCM300 Survey Design Measures of Central Tendency • Raw data can be confusing & meaningless • Measures of central tendency – AKA measures of location or average – Present the data in 1 single number • 3 different measures depend on intention or data – See next slide SCM300 Survey Design Measures of Central Tendency Measure Definition Data Mode Most commonly occurring value in a data set Misleading if an extreme value & may be multiple modes (bimodal distribution) Any Median Central value representing central point of a data set When there is an even set of values you take the two middle values and find the mid-point between them. Extremes don’t distort it but data has to be in order from lowest to highest in order to calculate it. Ordinal or interval/ratio Mean Average value in a data set Advantage is that it uses all values in a data set. Disadvantage is that it can only be used with interval/ratio data and when there are few values in the data set, it can be distorted by extremes. Interval/ratio SCM300 Survey Design Example Age of students 19 20 36 19 19 24 37 20 21 20 19 19 19 19 20 25 20 26 20 19 19 19 19 19 20 19 24 25 20 20 26 25 19 20 19 18 19 28 22 19 Mean 22 Median 20 Mode 19 SCM300 Survey Design Measures of Dispersion • Measures of central tendency don’t show: – – – – How closely related values are (i.e. clustered) How representative they are of the data set The range of values The degree of distortion by extreme values Salaries of office staff at HiMolde Airways: · £11k, £15k, £15k, £18k, £25K, £30k, £32k, £38k Salaries of office staff at HiMolde Airport: · £20k, £21k, £22k, £23k, £23k, £24k, £25k, £26k Mean salary at HiMolde Airways = £23k (£184k/8) Mean salary at HiMolde Airport = £23k (£184k/8) SCM300 Survey Design Measures of Dispersion • Range • Inter-quartile range • Standard deviation SCM300 Survey Design Range • Simplest & crudest measure of dispersion • Indicates spread of data – Places values in ascending order – Then subtracts smallest from the largest value • Extreme values affect (determine) the outcome • Range gives a greater insight into a data set – But gives no indication of the clustering of individual values SCM300 Survey Design Range Salaries of office staff at HiMolde Airways: · - £11k, £15k, £15k, £18k, £25K, £30k, £32k, £38k Salaries of office staff at HiMolde Airport: · - £20k, £21k, £22k, £23k, £23k, £24k, £25k, £26k Range of salaries at HiMolde Airways = £38k - £11k = £27k Range of salaries at HiMolde Airport = £26k - £20k = £6k SCM300 Survey Design Inter-Quartile Range • Most appropriate when using ordinal data • Divides values into 4 equal parts (quartiles) – Is an extension of the idea of the median • Represents the middle 50% of the values that fall between the 1st & 3rd quartiles • Not affected by extremes – BUT doesn’t utilise all values • It discards 50% of the values & therefore provides a limited picture of the degree of clustering SCM300 Survey Design Inter-Quartile Range Median value 1st 25% cases Min. value 2nd 25% cases Q1 3rd 25% cases Q2 Inter-Quartile Range SCM300 Survey Design 4th 25% cases Q3 Max. value Inter-Quartile Range Salaries of office staff at HiMolde Airways: · - £11k, £15k, £15k, £18k, £25K, £30k, £32k, £38k Salaries of office staff at HiMolde Airport: · - £20k, £21k, £22k, £23k, £23k, £24k, £25k, £26k IQ Range of salaries at HiMolde Airways = £15-£31 IQ Range of salaries at HiMolde Airport = £22-£24 SCM300 Survey Design Standard Deviation • • • • Widely used in quantitative research Most useful measure of dispersion Utilises all data in the distribution Compares each value in the distribution with the mean – It examines the variance of the data around the mean – Therefore saying something about how representative the mean is for the data set SCM300 Survey Design Standard Deviation • Smaller SD = less variation – i.e. data is more concentrated around the mean – Greater SD = greater variation • However – Size of SD is in part a reflection of the size of the mean • So a large SD may simply be the product of a large mean • Because of this, both figures should be quoted • Extreme numbers can distort the outcome – BUT have less of an impact than when using the range SCM300 Survey Design Standard Deviation Salaries of office staff at HiMolde Airways: · - £11k, £15k, £15k, £18k, £25K, £30k, £32k, £38k Salaries of office staff at HiMolde Airport: · - £20k, £21k, £22k, £23k, £23k, £24k, £25k, £26k Standard deviation of salaries at HiMolde Airways = 10 Standard deviation of salaries at HiMolde Airport = 2 SCM300 Survey Design Central Tendency & Dispersion Summary Nominal Ordinal Male/Female 1st/2nd/3rd Example Central Mode tendency Dispersion N/a SCM300 Survey Design Median Interval/Ratio Temperature Mean Inter-quartile Standard deviation range Normal Distribution & Skew • Normal distribution • Skew SCM300 Survey Design Normal Distribution • Normal if – Mean, median & mode coincide – Distribution is the same either side of the central values • e.g. see next slide • Often referred to as a bell-shaped curve – 50% of the cases can be found either side of the central value – Values tend to be clustered around the mean • i.e. very few extreme values SCM300 Survey Design Normal Distribution 50% of cases SCM300 Survey Design 50% of cases Mean Median Mode Normal Distribution • A normal distribution has certain properties – 68% of cases fall within 1 SD either side of the mean – 95% within 2 SDs – 99% within 3 SDs • e.g. see next slide • Other % values can be calculated using statistical tables – Found in some statistics books • Normal distribution is important for sampling & hypothesis testing – Many statistical tests assume data will be normally distributed SCM300 Survey Design 68.26% 95.44% 99.7% -3sd -2sd -1sd Mean +1sd +2sd +3sd • Normal distribution is an ‘ideal’ type of distribution • However, it is unlikely that data sets will be normal • When they are not normal, they are ‘skewed’ SCM300 Survey Design Skew • +ve skew – Data set has a few very large values • i.e. most values cluster to the left – The mean will be larger than the median • -ve skew – Data set has a few very small values • i.e. most values cluster to the right – The mean will be smaller than the median SCM300 Survey Design Positive Skew Median SCM300 Survey Design Mean Negative Skew Mean SCM300 Survey Design Median Skew • Skew is typically found where – Sample sizes are small – Bias has been introduced in the sampling process • Skewed distributions can be determined – Visually using a histogram – Statistically by calculating a co-efficient of skewness (sk) SCM300 Survey Design Co-efficient of Skewness 3(Mean – Median) sk = ---------------------------Standard Deviation • Indicates the direction of the skew (+ve or –ve) • Greater co-efficient = greater skew • Normal distribution will have a co-efficient of 0 SCM300 Survey Design Summary • Pictorial representation of quantitative data – 3 (4) levels of measurement • Nominal • Ordinal • Interval / ratio – Range of pictorial representation available • Choice is determined by the level of measurement SCM300 Survey Design Summary • Statistical representation of quantitative data – 3 measures of central tendency • Mean, median, mode • Choice is determined by the level of measurement – 3 measures of dispersion • Range, inter-quartile range, SD • Choice is determined by the level of measurement – Normal distribution & skew represent the distribution of responses SCM300 Survey Design Recommended Reading • Chapter 1-3 in Gaur, A.S. and Gaur, S.S. (2006). Statistical Methods for Practice and Research: A Guide to Data Analysis Using SPSS. New Delhi: Response Books. SCM300 Survey Design “Thank you for your attention” Questions.……. SCM300 Survey Design