Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 2 Summarising Numerical Data 2B The Median, Range and Interquartile Range (IQR) The most versatile statistical tools for numerically describing the centre and spread of a distribution uses: the median (the middle value) as its measure of centre, and 2 measures of spread: o the range (the maximum spread of the data values) o the interquartile range (the spread of the middle half of data values) Determining the Median (M) To find the median: First order the data. For n data values the median is located at the n 1 th position. 2 When: o o Example 3 a) n is odd, the median will be the middle data value n is even, the median will be the average of the two middle data values (pg 47) 2, 9, 1, 8, 3, 5, 3, 8, 1 b) Median = 10, 1, 3, 4, 8, 6, 10, 1, 2, 9 Median = Note – you should always check that you are correct by counting the number of data values each side of the median. They should be equal. Using a dot plot or stem plot to help locate medians Locating the median on a dot or stem plot can be done on a calculator but because the data is already in order it is relatively simple to do manually. Look at Example 4 and Example 5 in the textbook. (pg 47 – 48) The Range The range R, is the simplest measure of spread of a distribution. It sis the difference between the largest and smallest values in the dataset. Look at Example 6 in textbook. (pg 49) Range = largest value in data – smallest value in data A problem with the range as a measure of spread But because the range depends only on the two extreme values in the data, it is not always an informative measure of spread. For example, one or both of these two values may be outlies. A more refined measure of spread that overcomes these limitations of the range is the interquartile range (IQR). The Interquartile Range (IQR) The median divides the distribution in half and the quartiles are the points that divide the distribution into quarters. Range 0% 25 % 50 % 75 % 100 % Q1 Q2 Q3 Lower Quartile Median Upper Quartile IQR The Interquartile Range (IQR) is the spread of the middle 50% of the data values so that: IQR = Q3 – Q1 Example 7 (pg 50) Q1 5 2 Median 5 5 6 Q3 6 4 IQR 6 6 6 7 9 Range 7 1 2 2 3 3 4 4 4 4 7 5 5 6 6 7 7 *******************Clarification ********************* When finding the quartiles of an odd number of data – disregard the median before determining the lower and upper quartile. 3 4 4 4 5 5 6 7 9 10 11 12 14 Why is the IQR a more useful measure of spread than the range? The IQR is a good measure of spread because it is generally not affected by the presence of any outliers. Exercise 2B pg 51 – 53 Q’s 1 – 8 2C The five-number summary and the box plot The five-number summary Five-number Summary is the following: Min value, Q1, M, Q3, Max Value This data can then be used to construct a new graph known as a box plot. The box plot The box plot (or box-and-whisker plot) is a graphical display of the five-number summary. A box plot is a very compact way to display the location, spread and general shape of a distribution. It is also a very useful tool for comparing distributions of related subgroups. Box plots can be drawn either vertically or horizontally. 5 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 Example 8 (pg 54) Constructing a box plot from a five-number summary Min 52 Q1 66 M 73 Q3 75 Max 77 10 15 20 25 30 35 40 45 5 50 55 60 65 70 Life Expectancy (Years) 75 80 x Box plot with Outliers To display outliers on a box plot, we must first determine the location of what we call the upper and lower fences. These are imaginary lines drawn one and a half interquartile ranges above and below the box ends (Q1 and Q3.) Data values outside these fences are then classified as possible outliers and are plotted separately. The whiskers will end at the highest and lowest data values that lie within 1.5 IQR’s from the ends of the box. Upper Fence = Q3 + 1.5 X IQR Lower Fence = Q1 - 1.5 X IQR Try data manually to make a box plot with outliers and then use the graphical calculator to check your work. 3, 21, 21, 22, 23, 24, 26, 27, 28, 30, 31, 31, 33, 34, 35, 35, 36, 43, 48 0 Min = Lower Fence Upper Fence 1 Q1 = Q1 – 1.5 X IQR Q3 + 1.5 X IQR 2 Med = 3 Q3 = 4 Max = IQR = Range = 5 10 15 20 25 30 35 40 45 50 How to construct a box plot with outliers using the TI-Nspire CAS – do example in textbook. (pg 56) Interpreting Box Plots Do Example 9 and 10 from textbook. (pg 58 – 59) Exercise 2C pg 59 – 61 Q’s 1 – 10 2D Relating a box plot to shape Symmetric Distribution - is evenly spread out around the median. There is also a strong tendency for the data values to cluster around the centre of the distribution and its whiskers are approximately equal in length. Positively Skewed Distributions – are characterised by a cluster of data values at the left-hand end of the distribution. These distributions are said to gradually tail off to the right. The left- hand whisker will be short, while the right-hand whicker will be long. Negatively Skewed Distributions – are characterised by a cluster of data values at the right-hand end of the distribution. These distributions are said to gradually tail off to the left. The right- hand whisker will be short, while the left-hand whicker will be long. Distributions with Outlier(s) – in the tails. Exercise 2D pg 63 are characterised by gaps between the main body and data values 2E Using box plots to describe and compare distributions Box plots are a very powerful tool for describing a distribution in terms of shape*, centre and spread. They can also be used to help us identify outliers. * Beware of using a box plot to identify the shape of a distribution with a data set of less than 20-30. In this case, use a dot plot to check instead. Look at Example 11, 12 and 13 on pages 64 – 65. Exercise 2E pg 65 – 66 Q’s 1 – 3 2F Describing the centre and spread of symmetric distributions When can use the median, IQR and range for describing the centre and spread of distributions of any shape. If you have a symmetric shaped distribution without outliers there is an alternative. measure of centre – the mean measure of spread – the standard deviation The mean Most people think of the mean and think the “average”. Mean = x x n sum of data values total number of data values where x = mean x = sum of data values n = total number of data values Example 14 pg 68 n= x= x = The relationship between the mean and the median Then median lies at the midpoint of the distribution and the mean is the balance point. (pg 68) When to use the median rather than the mean Because the median is relatively unaffected by the presence of extreme values, it is said to be a resistance statistic. For this reason it is frequently used if the data is clearly skewed or is likely to contain and outlier. However, if a distribution is symmetric, both the mean and the median will be approximately equal in value. The mean is often preferred because it is familiar to most people. If the data is skewed or there is an outlier the mean will be dragged towards the tail or the direction of the outlier. Choosing between the mean and the median: Symmetric w/no Outliers – either the mean or the median can be used. Skewed and/or Outliers – the median is more reliable to indicate the centre. Exercise 2F-1 pg 69 – 70 Q’s 1 – 8 The standard deviation To measure the spread around the median we use the IQR. To measure the spread around the mean we use the standard deviation (s). We will usually use the calculator to work out the standard deviation but it is also important to understand what the calculator is doing. The formula for standard deviation, s, is: s= ( x x ) 2 n 1 Practice - How to calculate the mean and standard deviation using the TI-Nspire CAS (pg 71 – 72) Exercise 2F-2 pg 73 – 74 Q’s 1 – 6 2G The normal distribution and the 68-95-99.7% rule The normal distribution Many data sets are roughly symmetrical and take on a bell shaped curve. These types of data sets are called normal distributions. The 68-95-99.7% rule In normal distributions, the percentage of data that lie within a certain number of standard deviations (SD) away from the mean can always be determined. For a normal distribution , approximately: 68% of the data lies within one standard deviation of the mean. 95% of the data lies within two standard deviations of the mean. 99.7% of the data lies within three standard deviations of the mean. It is helpful to view this graphically – assuming the data is approximately normal. Look at examples and graphs on pg 75 - 77 Example 15 pg 76 X Exercise 2G pg 78 - 79 Q’s 1 – 4 (Use WS on the next page.) Exercise 2G pg 78 - 79 Q’s 1 – 4 2H Standard scores (z-scores) It is common to transform normally distributed data into a new set of units which show the number of standard deviations each data value lies from the mean of the distribution. This is called standardising and these transformed data values are called standard or z-scores. Standard Score = data value mean s tan dard deviation or z x x s Standard scores can be both positive and negative: a positive z-score indicates that the data value lies above the mean. a zero standardised score indicates that the data value is equal to the mean. a negative z-score indicates that the data value lies below the mean. Study Example 16 pg 79 – 80 Using standard scores to compare performance Standard scores are useful for making comparisons across data distributions which have different means and/or standard deviations.. Study Example 17 pg 81 Converting standardised scores into actual scores By making the actual score the subject of the rule for calculating standard scores, we arrive at: Actual score = mean + standard score X standard score Study Example 18 pg 81 Exercise 2H pg 82 Q’s 1 – 4 Chapter 2 REVISION Multiple-choice questions pg 85 - 88 Q’s 1 – 29 Extended-response questions pg 89 – 90 Q’s 1 - 5 x xzs