Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Measures of Central Tendency Measures of central tendency, or averages, are values that describe a set of data by identifying the central position within that set, e.g. the average life span of a human being. There are 3 different types: MEAN Use the mean to describe the middle of a set of data that does not have an outlier. Calculated by adding all the values together and dividing the total by the number of values in the data set, i.e. E.g. A darts player throws the following scores during a match: 80, 100, 62, 180, 21, 55 Her average score per throw = (80 + 100 + 62 + 180 + 21 + 55) ÷ 6 = 83 Advantages - Most popular measure in fields such as business, engineering and computer science. - It is unique - there is only one answer. - Useful when comparing sets of data. - Can be used with discrete or continuous data - The calculation includes every value in the data set. Disadvantages - Affected by extreme values (outliers) e.g. GDP per capita is often not a good measure of the average persons wealth because can be distorted by extremely rich peoples. - Should not be used if the frequency distribution of the data set is skewed MEDIAN Use the median to describe the middle of a set of data that does have an outlier. The middle number when the values are listed in order of magnitude E.g. The median of {1, 2, 4, 4, 5, 8, 9} is 4 If you have an even set of numbers you must take the mean of the middle two to get the median E.g. The median of {1, 2, 4, 5, 7, 9, 12, 20} is (5+7) ÷ 2 = 6. Advantages - Extreme values (outliers) do not affect the median as strongly as they do the mean. - Can be used with skewed data - Useful when comparing sets of data. - It is unique - there is only one answer. Disadvantages - Need to put all values in order first, which can be time consuming - Uses only one value in the data set, so may not be truly representative of the whole data set. If this value is changed so it the median, even if all other values stay the same. MODE Use the mode when the data is categorical or when asked to choose the most popular item. The most commonly occurring value On a histogram, the highest bar is the mode E.g. The mode of {1, 2, 4, 4, 5, 8, 9} is 4 as it occurs the most. Advantages - Extreme values (outliers) do not affect the mode. - Can be used with skewed data. - More useful for some types of data e.g. the average number of children per woman Disadvantages - Not necessarily unique - may be more than one answer - When there is more than one mode, it is difficult to interpret and/or compare. - When no values repeat in the data set, the mode is every value and is useless. - May be unrepresentative of the data set e.g. 1, 1, 1, 6, 5, 7, 11, 7, 9, 10, 8 – A mode of 1 isn’t very useful! - Only really useful for categorical data. - Can be changed significantly if only one number in the data set is changed. Measures of Dispersion Measures of dispersion, or spread, are used to describe the variability in a sample or population. For example: the range of different heights of children in a class; or the differences in amount of rainfall throughout the year. These measures can be used along with averages to help you analyse data in more depth and make more detailed comparisons between data sets. There are 3 different types: RANGE Use the mean to describe the spread of a dataset that does not have an outlier. This is the difference between the highest and the lowest figures in your data set E.g. A darts player throws the following scores during a match: 80, 100, 62, 180, 21, 55 Her average score per throw = (180-21) = 159 Advantages - Simple and quick to calculate - Useful when comparing sets of data. - Can be used with discrete or continuous data Disadvantages - Affected by extreme values (outliers) - Only gives a very basic indication of spread - Only uses two values in the data set iNTERQUARTILE RANGE This measure of dispersion can be used with date that does have outliers Looks at the spread of the middle 50% of the data List numbers in order of magnitude, then calculate the Upper Quartile (n+1)/4 and the Lower Quartile ((n+1)/4) x 3. (n = number of figures in data set) IQR = UQ – LQ E.g. 1, 11, 15, 19, 20, 24, 28, 34, 37, 47, 50, 57, 68 UQ = 47; LQ = 19; IQR = (47-19) = 28 Advantages - Not affected by outliers - Allows a fairer comparison of data sets than the range. - Good measure to use in combination with the median Disadvantages - Need to put all values in order first, which can be time consuming - More difficult to calculate than the range STANDARD DEVIATION A more advanced measure of dispersion Indicates the extent to which the data is clustered around the mean A smaller SD score suggests a more reliable mean First calculate the variance - this is calculated as the average squared deviation of each number from the mean of a data set. E.g. for the numbers 1, 2, and 3 the mean is 2 and the variance is 0.667. [(1 - 2)2 + (2 2)2 + (3 - 2)2] ÷ 3 = 0.667 Then square root the variance to get the standard deviation, E.G. SQRT 0.667 = 0.817 The more widely spread the values are, the larger the standard deviation is SD tells you that 68% of the data lies within one standard deviation of the mean, 95% lie within 2 SD’s and 99% lie within 3SD’s. Anything that is 2 or more SDs away from the mean can be considered anomalous Advantages - Uses all values in a data set - Provides a statistical measure with which to identify outliers - Useful for comparing two sets of data that have similar means Disadvantages - Should only be used for data sets that have a normal distribution - More difficult and time consuming to calculate than the range - Can be distorted by outliers like the mean