Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
x x n x 2 2 s x n Measures of central tendency are statistics that express the most typical or average scores in a distribution These measures are: • The Mode • The Median • The Mean The simplest measure of central tendency is the mode; the mode is the value that occurs with the greatest frequency within a data set Students weighed two different samples of broad beans and obtained the data shown below Sample A (g) 1.42 1.42 1.43 1.44 1.44 1.44 1.44 1.45 1.46 1.46 Sample B (g) 1.36 1.37 1.40 1.43 1.44 1.44 1.47 1.48 1.49 2.01 Students weighed two different samples of broad beans and obtained the data shown below Sample A (g) 1.42 1.42 1.43 1.44 1.44 1.44 1.44 1.45 1.46 1.46 Sample B (g) 1.36 1.37 1.40 1.43 1.44 1.44 1.47 1.48 1.49 2.01 The most frequently occurring value (i.e. the mode) in both Sample A and Sample B is 1.44 The Median is the central or middle value of a set of values when placed in order Sample 1.42 1.42 1.43 1.44 1.44 1.44 1.44 1.45 1.46 1.46 A Median As sample A includes an even number of values then the median is halfway between the middle two, i.e. 1.44 and 1.44; these values are the same and the median is therefore 1.44 Sample A (g) 1.42 1.42 1.43 1.44 1.44 1.44 1.44 1.45 1.46 1.46 Sample B (g) 1.36 1.37 1.40 1.43 1.44 1.44 1.47 1.48 1.49 2.01 The Mean is obtained by adding up all of the values and then dividing their sum by the number of values The formula for calculating the mean is: x x n where x = the mean = the sum of x = any value n = number of values Calculate the means for samples A and B Sample A (g) 1.42 1.42 1.43 1.44 1.44 1.44 1.44 1.45 1.46 1.46 Sample B (g) 1.36 1.37 1.40 1.43 1.44 1.44 1.47 1.48 1.49 2.01 For Sample A, the mean is: 1.42 + 1.42 + 1.43 + 1.44 + 1.44 + 1.44 + 1.44 + 1.45 + 1.46 + 1.46 10 Sample A Mean 14.4/10 = 1.44g For Sample B, the mean is: 1.36 + 1.37 + 1.40 + 1.43 + 1.44 + 1.44 + 1.47 + 1.48 + 1.49 + 2.01 10 Sample B Mean 14.89/10 = 1.489g The Mode and Grouped Data When data is grouped, it is not possible to quote the mode precisely; the ‘modal class’ is used to describe the data The modal class for this height data is 1.51 – 1.58 Rule of Thumb In general, the mean is used as a measure of central tendency with quantitative (interval) data, unless the distribution is markedly skewed When summarising qualitative data, the mode or median are the most appropriate measures of central tendency When the distribution of interval data is highly skewed, then the most appropriate measure of central tendency is the median Measures of central tendency alone are insufficient for characterisation of the distribution of data A measure of how much the data are dispersed or ‘spread out’ is also needed Four statistics can be used to indicate dispersion: • • • • The range The variance The standard deviation The interquartile range In most cases, the mean and standard deviation are used with quantitative (interval) data with the mode or median, and the interquartile range being used for qualitative variables Standard Deviation The Standard Deviation (s) of a set of values is a measure of the spread of the values from the mean A formula for calculating the standard deviation is: ( x x ) s n 2 where s = standard deviation x - x = the deviation of a value from the mean = the sum of n = number of values A quicker method for calculating the standard deviation is to use the equation shown below – this method is less tedious and less prone to error x s x n 2 2 where s = standard deviation x = any individual value x = the mean of a set of values = the sum of n = number of values Calculate the standard deviation for the bean samples A and B Sample A (g) 1.42 1.42 1.43 1.44 1.44 1.44 1.44 1.45 1.46 1.46 Sample B (g) 1.36 1.37 1.40 1.43 1.44 1.44 1.47 1.48 1.49 2.01 The standard deviation for Sample A is 0.013 The standard deviation for Sample B is 0.179 Sample B data displays greater variation than Sample A data The standard error of the mean provides an estimate of the likelihood that a sample mean is close to the true mean of a whole population The standard error is calculated using a formula that takes into account the standard deviation of the sample (s) and the sample size (n) s SE n The formula shows that the larger the sample size, the smaller the standard error of the mean A graph of standard error of the mean against sample size reveals an interesting trend Increasing the sample size by a few subjects makes a large difference to the standard error when the sample size is small, but makes much less of a difference when the sample size is large The standard error of the mean can be used to define confidence limits or intervals A student measured the heights of 62 individuals and found the mean height to be 1.64 metres The standard deviation of this sample was found to be 0.129 The student then estimated the standard error of the mean: s 0.129 SE 0.016 n 62 The student can be 68% confident that the true mean of the population falls within the range ± 0.016 of the mean of the sample, i.e. 1.64 ± 0.016 (mean ± 1 SE) What this means is that the interval between 1.623 and 1.656 (confidence limits) has a 68% probability of containing the true mean Researchers more commonly use the 95% confidence interval s 0.129 SE 0.016 n 62 In this case, the student can be 95% confident that the true mean of the population falls within approximately two standard errors of the mean of the sample, i.e. 1.64 ± 0.032 (mean ± 2 SE) This means that the interval between 1.608 and 1.672 (confidence limits) has a 95% probability of containing the true mean More accurate calculations make use of z scores for a normal distribution to estimate confidence intervals – for 95% confidence intervals, the standard error is multiplied by 1.96; mean ± 1.96 SE Broad Bean Samples Sample A (g) Sample B (g) 1.42 1.36 1.42 1.43 1.44 1.37 1.40 1.43 1.44 1.44 1.44 1.44 1.44 1.47 1.45 1.46 1.46 1.48 1.49 2.01 Estimate the standard error for Sample A and Sample B You may check your answers by entering data into a suitable statistics programme The two students obtained very different statistical values from their data even though the beans had been drawn from the same population – can you suggest reasons for these differences? Samples A and B are sub-groups of the total population of broad beans and may not therefore be truly representative of the population as a whole Variations between the samples and the original population may arise as a consequence of: • Bias in sampling – the students may have unknowingly been selective when choosing beans to weigh – Random sampling methods should be used to eliminate bias from the results • Chance – the students may have, by chance, selected a particular set of beans – this is more likely to be the case when only one sample is taken and when the sample size is small – taking at least three samples (replication), choosing appropriate sample sizes and obtaining mean results from these different samples, helps to eliminate chance effects from experimental values • Measurement error – errors arising from taking any form of measurement are not uncommon – when the same material is measured or weighed on a different occasion, different values are often obtained More Data A group of students measured the masses of individual French bean seeds and their results are shown in the table Present these results in graphical form Calculate the mean, median and mode for these results Calculate the standard deviation Estimate the standard error of the mean Define the confidence limits for the mean of this set of data Check your answers with a suitable statistics programme Mass of Bean Seeds (g) A knowledge of the shape of the distribution of values obtained in an investigation is crucially important for choosing an appropriate statistical test for analysis The normal distribution is theoretically determined by the value of the mean and the standard deviation When the value of the mean is zero and the standard deviation is one, the normal curve is said to be in ‘standard form’ A characteristic ‘bell’ shape graph is obtained The characteristic bell-shaped curve of a normal distribution has the following characteristics: • It is symmetrical about the mean, so that equal numbers of values fall above and below the mean (mean = median = mode) • Relatively few values fall into the high or low categories of the distribution; 68% of its values are within one standard deviation of the mean • 95% of its values are within two standard deviations of the mean • About 99% of its values are within three standard deviations of the mean Many investigations generate data which approximate to the normal distribution Skewed distributions deviate from the ‘normal’ distribution curve - their distributions are asymmetrical The mean, mode and median differ in a skewed distribution; the mean and median values are less than the mode for a negatively skewed distribution, and greater than the mode for a positively skewed distribution The degree of skewness can be determined by calculating the coefficient of skewness (Sk) where s = the standard deviation mean - mode Sk s When the distribution of interval data is highly skewed, then the median and interquartile range should be used as measures of central tendency and dispersion A useful, visual method for assessing whether a set of data can be assumed to have come from a normal distribution is to plot the data against their cumulative frequency distribution on special graph paper The graph paper used for this plot is called normal probability paper When the graphed data lies close to a straight line, we may assume that the distribution is approximately normal Class Interval (m) Frequency 1.30 – 1.37 3 1.37 – 1.44 12 1.44 – 1.51 14 1.51 – 1.58 24 1.58 – 1.65 23 1.65 – 1.72 22 1.72 – 1.79 16 1.79 – 1.86 6 Cumulative Frequency Percentage Cumulative Frequency Use the human height data above to obtain the cumulative frequencies and the percentage cumulative frequencies Percentage cumulative frequencies are obtained by dividing the cumulative frequencies by the total cumulative frequency and multiplying by 100 Frequency Cumulative Frequency Percentage Cumulative Frequency 1.30 – 1.37 3 3 2.50 1.37 – 1.44 12 15 12.50 1.44 – 1.51 14 29 24.17 1.51 – 1.58 24 53 44.17 1.58 – 1.65 23 76 63.33 1.65 – 1.72 22 98 81.67 1.72 – 1.79 16 114 95.00 1.79 – 1.86 6 120 100.00 Class Interval (m) Plot a graph of percentage cumulative frequency against the upper class boundary of the height data using the provided probability graph paper Assess the normality of the distribution for the height data The probability plot for the height data shows that the points lie close to a straight line, and we may assume that the distribution is approximately normal Using the same method, test the bean data on the following slide for normality Percentage Cumulative Frequency Class Interval (g) Frequency Cumulative Frequency 0.91 - 1.04 2 2 2.86 1.04 - 1.17 5 7 10.00 1.17 - 1.30 5 12 17.14 1.30 - 1.43 14 26 37.14 1.43 - 1.56 13 39 55.71 1.56 - 1.69 11 50 71.43 1.69 - 1.82 9 59 84.29 1.82 - 1.95 7 66 94.29 1.95 - 2.08 1 67 95.71 2.08 - 2.21 3 70 100.00 BEAN DATA The probability plot for the bean data shows that the points lie close to a straight line, and we may assume that the distribution is approximately normal