Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Part 2: Summarising Data Numerically and Graphically Matthew Sperrin and Juhyun Park December 12, 2008 1 Introduction How long do University students spend on social networking websites per day? A random sample of 50 students were asked to record their social networking website usage for one day. The results, in minutes spent, are given in Table 1. Looking at the table only, what can you learn from the data? Figure 1 shows exactly the same data in a histogram, with minutes spent plotted along the horizontal axis, and the height of the bars representing the number of students in each region. Exercise 1. Can you learn more about social networking site usage from looking at the graph than you can from the table? Is the graph or the table easier to interpret? 0 4 2 0 17 0 51 13 10 17 9 5 11 18 10 4 12 15 3 22 21 21 4 4 24 2 6 6 2 7 29 1 9 6 34 32 27 19 5 7 9 68 42 26 185 6 4 3 2 3 Table 1: Minutes spent on social networking websites per day 1 2 15 0 5 10 Frequency 20 25 30 1. Introduction 0 50 100 150 Minutes spent on social networking websites per day Figure 1: Minutes spent on social networking websites per day Presenting data graphically can often help us to learn things from the data. Table 2 gives the birthweights (in grams) of 44 babies born in the Mater Mothers’ Hospital in Brisbane, Queensland, Australia, on December 18, 1997. Exercise 2. What is the variable being measured here? Is it quantitative? If so, is it continuous or discrete?1 Suppose you are asked ‘Tell me about the birthweights of these babies’. What would you say? 1 See Part 1 for an introduction to these concepts. 2. Visualising the Data 3 3837 3334 3554 3838 3625 2208 1745 2846 3166 3520 3380 3294 2576 3208 3521 3746 3523 2902 2635 3920 3690 3430 3480 3116 3428 3783 3345 3034 2184 3300 2383 3428 4162 3630 3406 3402 3500 3736 3370 2121 3150 3866 3542 3278 Table 2: Birthweights of babies born in the Mater Mothers’ Hospital in Brisbane, Queensland, Australia, on December 18, 1997 It would be time-consuming and uninformative to list the weights of every single baby — it would be better to specify a few numbers that summarise in some way the weights of the babies. Graphical and numerical summaries of data come under the joint heading of exploratory data analysis. This part of the course has the following objectives: 1. To introduce some numerical and graphical summary methods. 2. To explore which graphical methods/summary statistics are useful in certain situations, and how to use them together sensibly. 3. To extend the ideas to situations where we are interested in the relationship between two variables. 2 Visualising the Data Suppose that we are interested in analysing the birthweight data given in Table 2. Where should we start? Whatever the question is that we are going to try and answer, whatever the purpose of the analysis, a useful first step is to look at the data. Looking at the table itself is probably not very helpful — it is very difficult to get any sort of intuition on what the data is like. Getting some visual impression of what the data we have is like will help us in deciding what we can do with the data, as we will see later. 3. Measuring Location and Spread 4 A possible first step in visualising the data is to produce a histogram, so we briefly introduce histograms in this section. We work through the process of constructing a histogram using the birthweight data. 1. We divide the range of the data into (equally) sized bins. Here, the lightest baby has weight 1745 grams and the heaviest has weight 4162 grams. We will use 6 bins: 1501-2000, 2001-2500, 2501-3000, 3001-3500, 3501-4000 and 4001-4500. 2. Record the number of observations that fall into each bin. Here, we need to draw a frequency table, and tally the number of birthweights in each category. For example, the first weight listed in Table 2 is 3837 grams, so falls into the category ‘3501-4000’. Bin Number of Observations 1501-2000 1 2001-2500 4 2501-3000 4 3001-3500 19 3501-4000 15 4001-4500 1 3. Plot on a graph. The x axis is the range of the data and the y axis is the number of observations (count) in each bin. Figure 2 gives the final histogram. Exercise 3. What do you learn about the babies’ birthweights from this histogram? For example, what sort of weight are the heaviest babies? What do the babies typically weigh? This gives a very brief introduction to histograms, and more technical aspects will be given in Section 4. 3 Measuring Location and Spread Recall that we have, in Table 2, the birthweights (in grams) of 44 babies. What sort of questions might be of interest regarding these birthweights? 5 10 0 5 Frequency 15 3. Measuring Location and Spread 1500 2000 2500 3000 3500 4000 4500 Birthweight (grams) Figure 2: Histogram of Birthweights We might be interested in: 1. What does a typical baby weigh? 2. How spread out are the weights of the babies? Or, how light are ‘light’ babies, and how heavy are ‘heavy’ babies? The first question can be answered by calculating a statistic that summarises location. We will introduce two ways to measure location — the median and mean. The second question can be answered by calculating a statistic that summarises spread. We will introduce three ways to measure spread — the range, the interquartile range, and the standard deviation. Later, we will talk about the relative advantages and disadvantages of each measure, and discuss the circumstances when each might be used. 3. Measuring Location and Spread 6 Before we get started, a brief digression on notation is necessary. Some Notation Firstly, we choose a letter to denote the variable we are measuring — X is a common choice. It is good practice to make it clear what your variables mean, by writing at the beginning ‘let X denote [the variable we are measuring]’. Instead of saying ‘the first observation from the data’ we will simply write x1 (If we had chosen Y to denote the variable, this would be y1 ). The tenth observation is written as x10 , and so on. We call the total number of observations in our dataset n, so the final observation is xn . For example, in the birthweight data in Table 2, let X be the weights of the babies. Then, reading across the first row we get x1 = 3837, x2 = 3334, x3 = 3554, . . . , x44 = 3278, and n = 44, as there are 44 births recorded. Notice the difference here between upper case and lower case letters. Upper case letters are not numbers — they are variables, telling us what it is we are measuring. They are random because each time we take a measurement we will get a different answer. Lower case letters are numbers, because they tell us the value the variable takes for a specific measurement. We use brackets if we have put the data in ascending numerical order — the first observation (which is now the smallest) is called x(1) , the second observation (so second smallest) is called x(2) , and so on, up to the largest observation x(n) . Table 3 gives the birthweight data in ascending numerical order. Looking at this we can see that x(1) = 1745, x(2) = 2121,. . ., and x(44) = 4162. 1745 2121 2184 2208 2383 2576 2635 2846 2902 3034 3116 3150 3166 3208 3278 3294 3300 3334 3345 3370 3380 3402 3406 3428 3428 3430 3480 3500 3520 3521 3523 3542 3554 3625 3630 3690 3736 3746 3783 3837 3838 3866 3920 4162 Table 3: Birthweights of babies born in the Mater Mothers’ Hospital in Brisbane, Queensland, Australia, on December 18, 1997; in ascending numerical order. 3.1 The Median 3.1 7 The Median The median is one way of answering the question ‘What does a typical baby weigh?’. Exercise 4. Suppose we have placed the 44 birthweights in ascending numerical order (as in Table 3). Which of the following values would best reflect what a typical baby weighs? • x(2) — i.e. the 2nd smallest value? • x(22) — a value somewhere around the middle? • x(42) — one of the largest values? The median value of a collection of data is the ‘middle’ value when the data is in numerical order. We use the symbol xm for the median. Exercise 5. Looking at Table 3, find the ‘middle’ value for the birthweight. We now give a general procedure for calculating the median, using the birthweight data as an example. 1. Place the data in numerical order. This is done in Table 3. 2. Take the total number of observations and add 1. So here, there are 44 observations, so adding 1 gives 45. 3. Divide by 2 — call the result t. So t = 45/2 = 22.5. 4. If the result is a whole number the median is x(t) . Otherwise, the result is the average of the two numbers either side of t. We have t = 22.5, not a whole number, so our answer is the average of x(22) and x(23) . So we get xm = x(22) + x(23) 3402 + 3406 = = 3404 2 2 So the median of the birthweight data is 3404 grams. Exercise 6. Why do we add 1 to the total number of observations before we divide by 2? HINT: you can get intuition by considering a data set with 3 observations (i.e. n = 3). 3.2 The Range and Interquartile Range 8 Exercise 7. In calculating the median, does it matter whether the data is arranged in ascending or descending numerical order? So, the median gives us an answer to the question ‘what does a typical baby weigh?’ — the median birthweight is 3404 grams. 3.2 The Range and Interquartile Range We now consider the question, ‘How spread out are the weights of the babies? Or, how light are ‘light’ babies, and how heavy are ‘heavy’ babies?’ Given that the ‘typical baby’ weighs 3404 grams, would be surprised to see a baby weighing 3600 grams, for example? These are the sorts of questions that we can answer by having an indication of the spread of the data. One way we may consider looking at the spread of the data would be the difference between the heaviest baby and the lightest baby — this is called the range, and in the notation: Range = x(n) − x(1) So, for the birthweight data, the lightest baby is x(1) = 1745, and the heaviest baby is x(n) = 4162. So the range is Range = x(n) − x(1) = 4162 − 1745 = 2417 grams Exercise 8. Table 4 gives the time 50 randomly selected University students spend on social networking websites in a day (in minutes), in ascending numerical order. Calculate the range of this data. Why might the range not be a good description of the spread of the data in this case? Since the range can be so sensitive to outliers — values that are unusually large or small — we consider a slightly different measure, called the Interquartile Range (IQR). The IQR takes the range of the middle 50% of the data, meaning it is not affected by the outliers. 3.2 The Range and Interquartile Range 9 0 0 0 1 2 2 2 2 3 3 3 4 4 4 4 4 5 5 6 6 6 6 7 7 9 9 9 10 10 11 12 13 15 17 17 18 19 21 21 22 24 26 27 29 32 34 42 51 68 185 Table 4: Minutes spent on social networking websites per day, in ascending numerical order. • The lower quartile (LQ) is the value that has one quarter of the data smaller, and three quarters of the data larger than it. It is also known as the 25th quantile, as 25% of the data is smaller than it. • Similarly, the upper quartile (UQ) is the value that has one quarter of the data larger, and three quarters of the data smaller than it. It is also known as the 75th quantile, as 75% of the data is smaller than it. • The Interquartile Range (IQR) is the range of the middle 50% of the data, and is calculated by taking the difference between the UQ and the LQ — IQR = UQ − LQ. Exercise 9. Looking at Table 3, find the lower quartile (one quarter of the way along the data), the upper quartile (three quarters of the way along the data) and the interquartile range (difference between the two). As with the median, there are technicalities involved when the LQ and UQ lie ‘in-between’ two data points. However, there is no agreement on what to do in these cases. We suggest calculating the LQ as ‘the median of the lower half of the data’, and the UQ as ‘the median of the upper half of the data’. If there is an odd number of data points, usual convention is to exclude the median from both calculations. We will now give a general algorithm for calculating the LQ, UQ and IQR, using the birthweight data as an example. The LQ is the median of the first half of the data, i.e. the top two rows of Table 3, or the first 22 observations. The total number of observations plus 1 is therefore 3.2 The Range and Interquartile Range 10 23. Divide this by two and calling the answer t, gives t = 11.5. So the result is the average of x(11) and x(12) , LQ = x(11) + x(12) 3116 + 3150 = = 3133 2 2 The UQ is the median of the second half of the data, i.e. the bottom two rows of Table 3, or observations 23 − 44, the last 22 observations The total number of observations plus 1 is therefore 23. Divide this by two to get t = 11.5. So the result is the average of the 11th and 12th observations in the second half of the data. This is NOT the same as x(11) and x(12) because they are in the first half of the data! We add 22 to tell us where to find the UQ because we didn’t use the first 22 observations. So 11.5 + 22 = 33.5, so the result is the average of x(33) and x(34) , UQ = x(33) + x(34) 3554 + 3625 = = 3589.5 2 2 Now the IQR is simply the difference between the UQ and the LQ: IQR = U Q − LQ = 3589.5 − 3133 = 456.5 At this point, we can return to our original question that we were trying to answer by calculating the spread of the data — how heavy are ‘heavy’ babies and how light are ‘light’ babies? It remains somewhat subjective what we mean by light and heavy, but we could say that the upper 25% are ‘heavy’ and the lower 25% are ‘light’. We know these quantities — they are just the UQ and the LQ. So, ‘heavy’ babies have weight greater than 3589.5 grams, and ‘light’ babies have weight less than 3133 grams. Exercise 10. Calculate the Median, LQ, UQ and IQR of the social networking website usage data. Table 4 gives the data in ascending numerical order. Box Plots The median, quartiles and range can be summarised in a graphical format, which can be useful when comparing one sample against another. Firstly, the quantities can be summarised in a so-called five number summary. This consists of five numbers — the smallest observation, the lower quartile, the median, the upper quartile and the 3.2 The Range and Interquartile Range 11 largest observation (in that order). For example, for the birthweight data we would write the five number summary as (1745, 3133, 3404, 3589.5, 4162). This summary can also be drawn in a box plot. To create a box plot, the horizontal axis is the range of the data (in this case, the weights of the babies). Draw a small vertical line at each of the five numbers from the five number summary, then connect the ends of the line at the LQ to the lines at the UQ. Figure 3 gives the box plot for the birthweight data. 2000 2500 3000 3500 4000 Figure 3: Boxplot of Birthweights Exercise 11. Suppose that we separate the weights of the 44 babies into boys and girls. We calculate the five number summaries for each group, and they are, for the boys (2121,3166,3404,3630,4162) and for the girls (1745,2576,3381,3523,3866). Draw the two boxplots, one below the other, on a single graph, using the same horizontal axis. Use this to compare the two sets of birthweights. 3.3 The Mean 3.3 12 The Mean We have discussed that it is useful to find location statistics, in order to answer a question such as, in our birthweight example, what a typical baby weighs. The mean is an alternative way to calculate this. We will explore the similarities and differences between the mean and median in Section 3.5. The mean is calculated by adding up the values of all the observations, then dividing by the amount of observations. This corresponds to sharing the values equally amongst all the observations. The sample mean is denoted by x̄ (say ‘x-bar’).2 The formula for the mean is x1 + x2 + . . . + xn n In this formula, we have used ‘. . .’ to show that we have missed out all the middle x̄ = terms, but you would of course fill these in when using the formula. There is a nicer P way to indicate a sum of observations than this, by using the symbol , which means ‘the sum of’. So we can rewrite the formula for the mean as n x̄ = We use the area below the P 1X xi n i=1 sign to indicate where the sum starts from, and the area above to indicate where the sum finishes. The i in this case is called an index. It changes its value from the smallest value in the sum to the largest value, going through every whole number inbetween (i = 1, then i = 2, etc.). As an example, let’s calculate the mean of the birthweight data. Putting the data into the formula we get n 1X 3837 + 3334 + . . . + 3278 x̄ = xi = = 3275.955 n i=1 44 so the mean is approximately x̄ = 3276. So, the weight of an average baby is 3276 grams. 2 It is x because X is the letter used to denote the variable. If we had used Y , it would be ȳ. 3.4 The Variance and the Standard Deviation 3.4 13 The Variance and the Standard Deviation Just as the mean is an alternative to the median for measuring locations, there is also an alternative to the IQR for measuring spread. Exercise 12. Using the time spent on social networking websites data in Table 4, demonstrate why the IQR can not be said to use all of the data. HINT: Think about changing the value of the largest time (currently 185) to 285. Would the IQR change? The sample variance is defined as the average squared distance from an observation to the sample mean. The ‘distance’ from an observation xi to the sample mean x̄ is (xi − x̄). The ‘squaring’ removes any negative values here (e.g. if xi = 3 but x̄ = 5, then xi − x̄ = −2, but squaring this gives 4). We call the sample variance s2 , and the formula is n 1X (xi − x̄)2 s = n i=1 2 Since we have squared the (xi − x̄) part, the variance is not in the same units as the observations. For example, if our observations were in metres, the variance would be in square metres! This makes the variance difficult to interpret. The standard deviation remedies this problem by simply taking the square root of the variance. It is denoted by s, and its formula is given by v u n u1 X s=t (xi − x̄)2 n i=1 If necessary, we can subscript by the letter of the variable, e.g. s2X and sX for the variance and standard deviation of the variable X. It is useful to do this if we are dealing with more than one variable. As an indication of how the standard deviation describes spread in a dataset, Figure 4 gives four examples of histograms, on the same scale, and the sample standard deviation in each case. Each of the datasets has a mean of 3. The relationship is that the further from the mean the data tends to be, the larger the standard deviation 3.4 The Variance and the Standard Deviation 14 is. For example, in the bottom right histogram of Figure 4, all the values are far from 3, resulting in a large standard deviation. On the other hand, in the bottom left histogram of Figure 4, the values are all fairly close to 3, resulting in a smaller standard deviation. Standard Deviation = 1.63 15 5 10 Frequency 20 10 0 0 5 Frequency 30 Standard Deviation = 1.06 1 2 3 4 5 6 0 1 2 3 4 5 x Standard Deviation = 0.29 Standard Deviation = 2.54 6 30 20 Frequency 10 8 6 0 0 2 10 4 Frequency 40 50 x 14 0 0 1 2 3 4 5 x 6 0 1 2 3 4 5 6 x Figure 4: Histograms of four datasets (each with 100 observations) and their associated standard deviations. As an example we will now calculate variance and standard deviation of the birthweight data, giving a step-by-step approach to doing the calculations. 1. Produce a table with three columns — the original xi values, the xi − x̄ values, and finally the (xi − x̄)2 values. 3.4 The Variance and the Standard Deviation 15 xi xi − x̄ (xi − x̄)2 3837 3837 − 3276 = 561 5162 = 314721 3334 3334 − 3276 = 58 582 = 3364 3554 .. . 3554 − 3276 = 278 .. . 2782 = 77284 .. . 3278 2 4 Pn 2 − x̄) : 11989186 Variance (above line divided by n): 272481.5 i=1 (xi Standard Deviation (square root of variance): 522 Table 5: Calculation of variance and standard deviation for the birthweight data 2. Total up the final column of the table and divide by the number of observations. This is the variance. 3. Take the square root of the answer. This is the standard deviation. These steps are summarised for the birthweight data in the Table 3.4. We calculated earlier that x̄ = 3276 grams. Construcing a table like this is helpful when calculating the variance and standard deviation by hand. So we conclude that the sample standard deviation of the birthweights is 522 grams. We have omitted many lines in the table — it would be a good exercise to check that you can reproduce the table in full. These calculations are somewhat timeconsuming to do by hand. Most scientific calculators, and many computer packages, will do the calculation for you. The standard deviation can loosely be interpreted as ‘how far a typical observation is from the mean’. Exercise 13. Suppose you wish to hire a typing assistant. The number of pages typed per day by Assistant A has mean 65 pages and standard deviation 3 pages. 3.5 Mean or Median? 16 The number of pages typed per day by Assistant B has mean 75 pages and standard deviation 20 pages. • Which assistant is the most consistent? • Which assistant would you expect to type the most pages over a week? • If these were the two applicants for the job of typing assistant, which would you hire (assuming you know nothing else about them)? A small standard deviation means a high consistency or precision. Exercise 14. Calculate the variance and standard deviation of the following times spent by University students on social networking websites:3 4 6 51 17 11 4 3 21 24 3 Exercise 15. Is it possible for either the variance or the standard deviation to be negative? 3.5 Mean or Median? We conclude this section by looking at the differences between the two location statistics we have introduced — the mean and the median — and discussing the situations in which each might be used. Exercise 16. Suppose a student receives the following marks in nine courses (placed in ascending numerical order). The University awards a first class degree to any student who earns 70% or more overall. 35 40 42 56 70 70 71 73 73 The exam board proposes that the overall degree classification is calculated based on the median of the nine marks — here the median is 70, so the student is awarded a first class degree. Is this a fair result? 3 This is a subset of Table 4, to make the calculation more manageable. 3.5 Mean or Median? 17 Exercise 17. Figure 5 gives the annual wage of 50 UK full time workers, chosen in a random sample. The median of the wages is 21.5 thousand pounds, and the mean is 28.7 thousand pounds. Explain why the mean is larger than the median. Which of 10 0 5 Frequency 15 the mean and the median is more useful here? 0 50 100 150 200 250 Annual Wage (in £1000s) Figure 5: Histogram of the wages of 50 randomly selected full time workers in the UK. The above questions illustrate situations where it is possible to select one of the location statistics over the other — the mean is sometimes more useful than the median, but in other situations the median can be more useful than the mean. In general, the problem with the mean is that it is affected by extreme values. In 4. The Shape of the Data 18 the wages example, there is one person in the sample earning 250 000, which is far more than everybody else. This causes the mean to be larger. Therefore, in cases where there are outliers in the data, the median is often a better choice. The weakness of the median is that it cares only about the order that the data is in, and the value of the middle observation. In the degree classification example above, the median of 70% does not represent well what the data looks like. When there are no outliers in the data, the mean is often more useful. So, it is important to look at the data (for example by viewing histograms) before deciding which summary statistics to report. 4 The Shape of the Data We introduced the histogram in Section 2, in this section we look in more detail at the histogram, and introduce other graphical methods to visualise the shape of the data. 4.1 Histograms Revisited When we first introduced the histogram in Section 2, we selected 6 equally sized bins to divide the data into, without explaining why we made this choice. In fact, choosing the number of bins to use is something of an art in producing histograms, and is often done by trial and error. Figure 6 gives two histograms of the birthweight data, the first has two bins (each bin being 2000 grams wide), the second has 48 bins (each bin being 50 grams wide). Both were created using identical data — the birthweight data from Table 2. Exercise 18. Using Figures 2 and 6, comment on the consequences of making a poor choice for the number of bins. You will find when you use statistical packages to produce histograms that an appropriate number of bins is selected automatically. Exercise 19. Table 4 gives the time spent by 50 randomly chosen students on social networking websites (in minutes). Produce a histogram, with appropriate bin width, 19 0 10 25 Histograms Revisited Frequency 4.1 1000 2000 3000 4000 5000 4 2 0 Frequency Birthweight (grams) 2000 2500 3000 3500 4000 Birthweight (grams) Figure 6: Histogram of Birthweights — with 2 bins (top) and 48 bins (bottom). for this data. (Note: you may find it useful to exclude the largest time (185 minutes) to produce a more interesting histogram). Interpret the results. Density Histograms Exercise 20. Suppose you take the names of all the babies born in the Mater Mothers’ Hospital in Brisbane, Queensland, Australia, on December 18, 1997, put them in a hat, and select one at random. What is the chance of the chosen baby having birthweight between 3001 and 3500 grams? In Figure 7 we have the same histogram as in Figure 2, but this time we rescale the height of the bars so that the area of the bar in each bin represents the chance of a randomly chosen birthweight coming from that bin. Histograms Revisited 20 0.0004 0 0.0002 Density 0.0006 0.0008 4.1 1500 2000 2500 3000 3500 4000 4500 Birthweight (grams) Figure 7: Density Histogram of Birthweights How have we calculated this? The chance of a randomly chosen birthweight (or, in general, a randomly chosen observation) coming from a bin should be Number of Observations in Bin . Total Number of Observations The area of the bin is Height of bar × Width of bar. Since we want the chance and the area to match, and this means that Height of Bar = Number of Observations in Bin Width of bar × Total Number of Observations Exercise 21. Explain why the total area of all the bars in the new histogram is 1. 4.2 Bar Charts 4.2 21 Bar Charts You are probably familiar with a method of displaying categorical data that is very similar to a histogram — the bar chart. A bar chart could be used, for example, to visualise and compare the number of people in a sample with different hair colours (this example was also used in Part 1). Suppose we collect a sample of 100 people and record their hair colour. We record the results in a frequency table: Hair Colour Number of Observations Black 10 Brown 51 Fair 28 Ginger 3 Other 8 A bar chart is then constructed from the frequency table in exactly the same way as we did for the histogram. Figure 8 gives two possible bar charts we could construct from this frequency table.4 The difference between the two is that we have changed the order of the bars along the horizontal axis. Exercise 22. Why is it not sensible to change the order of the bars along the horizontal axis in a histogram, like that in Figure 7? Bar charts are useful ways to display categorical data. They must not be confused with histograms, which are used to summarise quantitative data. 4.3 Empirical Distribution Functions The histogram is very useful for estimating the chance of a randomly chosen observation being within a given range of values, or equivalently, the proportion of observations within that same region (see Section 2). However, we have already seen 4 Note though, it is usual to arrange the bars in descending height order (as in Figure 8, left panel). 4.3 Empirical Distribution Functions 22 Figure 8: Two barcharts showing the sample proportions of 100 people’s hair colour. that the histogram is limited in that its shape depends on the number of bins we choose to classify our data into. In particular, if we are interested in estimating, say, the chance of a random observation being less than a certain value, we will get different answers depending on the bins we have chosen for the histogram. The empirical distribution function (e.d.f.) is a means of displaying all the quantiles of a set of data. Two special quantiles, the LQ (25th quantile) and the UQ (75th quantile) have already been introduced in Section 3.2. Producing the e.d.f. does not require any subjective decisions (like choosing the number of bins in the histogram setting). Therefore, there is one unique e.d.f. for any collection of data. There is an added bonus with the e.d.f. — we can read the median and quartiles from the graph quickly and easily, as we will see. We label the e.d.f. as a capital letter with a tilde (‘˜’) above it. The formula for the e.d.f. is F̃ (x) = Number of observations smaller than x Total number of observations For example, the 25th quantile, or lower quartile, is the value which has 25% of the data smaller than it. In Section 3.2 we calculated the LQ for the birthweight data is 456.5. So, F̃ (456.5) = 0.25 approximately, because the e.d.f. measures the percentage of the data smaller than a fixed value. We will now work through the construction of the e.d.f. using the birthweight 4.3 Empirical Distribution Functions 23 data. 1. We construct the graph using the ordered data (as in Table 3 for the birthweight data). The x axis covers the range of the data (so you could use the same range as the histogram, for example), and the y axis gives the value of F̃ (x). The graph is built like a staircase, starting from the y value of 0, on the left of the smallest observation on the x axis, and finishing at the y value of 1, on the right of the largest observation on the x axis. 2. Now, every time we encounter an observation, the y value goes up by one ‘step’. The amount we go up is n1 , with n = number of observations, as usual. If there are ‘ties’ (i.e. more than one observation with the same value), we take a bigger step, whose size is n1 × number of ties. 3. The graph finishes when the y value reaches 1. If you run out of data before reaching 1, or go past 1, you know you have made a mistake somewhere! Figure 9 shows what the final e.d.f. for the birthweight data should look like. Ensure you understand why constructing the graph in this way gives the e.d.f. that is defined in Equation 4.3. Also ensure that you understand that, unlike with histograms, there is no aspect of the e.d.f. that we can change to give a ‘different’ graph for the same set of data. Notice that the e.d.f. looks like a staircase — it moves up in ‘jumps’. Exercise 23. Two consecutive birthweights from Table 3 are 2208 and 2383. Explain why F̃ (2210) and F̃ (2300) must be the same. Exercise 24. Produce an e.d.f. for the social networking website data (Table 4). Reading the Median and Quantiles from the e.d.f. Recall that the median of a collection of data is the ‘middle value’. The e.d.f. is defined in Equation 4.3 as F̃ (x) = Number of observations smaller than x Total number of observations Empirical Distribution Functions 24 0.8 1.0 4.3 0.2 0.4 Fn(x) 0.6 ● ● ● ● ● ● ● ● ● 0.0 ● 1500 ● 2000 ● ● ● ● ● 2500 ● ● ● ● ● ● ● 3000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 3500 4000 birthweight Figure 9: e.d.f. of Birthweights Exercise 25. Explain why F̃ (xm ) = 0.5, i.e. the value of the e.d.f. at the median is 0.5. Since the value of the e.d.f. at the median is 0.5, we can easily read off an estimate of the median from our e.d.f. We do this by working backwards. As we know that F̃ (xm ) = 0.5, if we draw a line across from the y-axis at 0.5 to the e.d.f., reading off the corresponding x-value gives us the median. This process is illustrated in the birthweight data in Figure 10, and we get an estimate for the median of around 3400. Exercise 26. Show that, in the same way, the value of the e.d.f. at the LQ and UQ is 0.25 and 0.75 respectively. Use this to estimate from the e.d.f. in Figure 9 the LQ, UQ and IQR of the birthweight data. Compare with the estimates you already calculated from the data, and explain why they may not be exactly the same. 25 0.8 1.0 5. The Relationship Between Two Variables 0.2 0.4 Fn(x) 0.6 ● ● ● ● ● ● ● ● ● ● 0.0 ● 1500 2000 ● ● ● ● ● 2500 ● ● ● ● ● ● ● 3000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 3500 4000 birthweight Figure 10: e.d.f. of birthweights, with median line added Exercise 27. Summarise the advantages and disadvantages of Empirical Distribution Functions versus Histograms. 5 The Relationship Between Two Variables So far in this part of the course, we have looked at graphical methods and summary statistics for a single variable only. Can you think of situations where it may be of interest to look at the relationship between two variables? There has been much said about the supposed link between smoking and lung cancer. We have data available, from 44 US states, on two variables. • Smoke: Number of cigarettes smoked (hundreds per person) in 1960. 5.1 Scatter Graphs 26 Smoke Lung Smoke Lung Smoke Lung Smoke Lung 18.20 17.05 25.82 19.80 18.24 15.98 28.60 22.07 31.10 22.83 33.60 24.55 40.46 27.27 28.27 23.57 20.10 13.58 27.91 22.80 26.18 20.30 22.12 16.59 21.84 16.84 23.44 17.71 21.58 25.45 28.92 20.94 25.91 26.48 26.92 22.04 24.96 22.72 22.06 14.20 16.08 15.60 27.56 20.98 23.75 19.50 23.32 16.70 42.40 23.03 28.64 25.95 21.16 14.59 29.14 25.02 19.96 12.12 26.38 21.89 23.44 19.45 23.78 12.11 29.18 23.68 18.06 17.45 20.94 14.11 20.08 17.60 22.57 20.74 14.00 12.01 25.89 21.22 21.17 20.34 21.25 20.55 22.86 15.53 28.04 15.92 30.34 25.88 Table 6: Numbers of cigarettes smoked (hundreds per person) in 1960 and Deaths per 100K population from lung cancer, in 44 US states. • Lung: Deaths per 100 000 population from lung cancer. The data is given in Table 6. Looking at the table, can you see any relationship between the number of cigarettes smoked, and the rate of deaths from lung cancer? Clearly, this is difficult if not impossible to do. 5.1 Scatter Graphs A scatter graph is a visualisation of the relationship of two quantitative (numeric) variables. So far, we have been using X to denote our variable. Now that there are two variables, we will use X to denote the number of cigarettes smoked, and Y to denote the number of deaths from lung cancer. To construct a scatter graph, we plot the specific X and Y values for each state onto a graph as co-ordinates. The horizontal axis has the range of the X variable, and the vertical axis takes the range of the Y variable. Then, for example, for the 5.1 Scatter Graphs 27 first state in the table, we draw a cross on the graph at (18.20, 17.05). After plotting 25 20 15 Deaths per 100 000 of Lung Cancer all of the points from Table 6, we end up with a scatter graph — as in Figure 11. 15 20 25 30 35 40 Cigarettes Smoked (hundreds per capita) in 1960 Figure 11: Scatter plot of the number of cigarettes smoked against the number of deaths from lung cancer, for 44 US states. Exercise 28. What does the Figure 11 suggest about the relationship between cigarettes smoked and the number of deaths through lung cancer? The diagonal line in Figure 11 is a line of best fit. This is drawn in such a way as to represent the underlying relationship between the X and Y variables. Since this line slopes upwards we would say there is a positive relationship between X and Y , i.e. as the number of cigarettes smoked increases, so does the number of deaths through lung cancer. 5.2 Correlation 28 Exercise 29. How might the line of best fit be used to estimate the death rate from lung cancer in a state for which we only know that the number of (hundreds of ) cigarettes smoked per person was 30? Exercise 30. Can you think of two variables that may have a negative relationship? i.e. as one increases in value, the other one decreases? 5.2 Correlation Suppose you are asked to describe the relationship between smoking and lung cancer using Figure 11 over the telephone. What would you say? Is this easy to do? As well as having a picture of the relationship between two variables, it is also useful to have some sort of numerical summary (this would be a lot easier to descibe over the telephone!). The correlation between two variables is a number that describes the strength of the linear (i.e. straight line) relationship between them. We use the symbol r for the sample correlation, and subscript it with the two variables we are calculating the correlation between. For example rXY is the sample correlation between smoking and lung cancer, from the previous section. The correlation is always between −1 and 1. A positive number corresponds to a positive linear relationship between the two variables (i.e. as one increases, the other increases), and a negative number corresponds to a negative relationship (i.e. as one increases, the other decreases). Figure 12 shows some scatter graphs of the relationship between two variables, and the associated correlations, to give a feel for what the numbers mean. The fact that correlation measures linear relationships between variables is important to remember. It is always useful to plot the data in a scatter graph first, to see whether there is an indication of a relationship between the two variables that is not linear. Figure 13 gives some examples of calculated correlations where the relationship between two variables is not linear. The correlation can be very misleading in these instances. The formula for calculating the sample correlation between two variables X and Correlation 29 ● −2 −1 0 1 ● 3 3 2 1 0 ● ● 0 1 −2 3 2 −1 0 1 ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● −1 1 0 −2 1 ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ●● ● ●● ●●●● ● ● ● ● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ● ● ●●●● ● ● ● ● ●● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● 2 −1 ● ● ● ● ● ●● ● ● ● ● ● ●●●● ● ● ● ● ● ● ●●●● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ●● ●● ●● ●● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ●● ●●● ● ● ●● ● ● ● −2 −1 0 1 ● 2 ● 0 1 ● 2 3 Correlation = −0.54 ● ● 2 ● Correlation = −0.79 ● −2 0 1 ● 2 ● −1 0 ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ●● ● ●●● ●● ● ● ● ● ● ● ● ●●●●●● ● ● ● ● ●● ● ● ● ●●● ●● ●● ● ●●● ● ● ● ● ●● ●● ●● ●● ● ● ●● ● ● ● ●● ●● ● ● ● ●● ● ●● ● ● −2 −1 Correlation = 0.08 ● −3 ● −3 −2 ● ● Correlation = −1 ● 2 1 1 2 3 ● 2 1 3 −2 0 Correlation = 0.21 −1 ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ● ●● ●●● ●● ● ● ● ● ●● ● ● ●● ● ● ●●● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ●●●● ● ● ●● ● ● ● ● ● ● ● −1 −3 −2 0 1 2 Correlation = 0.45 2 2 1 0 2 0 1 ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ●●● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●●●● ● ● ● ● ●● ●● ● ● ● ●● ● ●●● ●●● ● ● ● ●● ●●● ● ● ● ● ● ● ●● ● ●● ● ● ●● −1 0 ● ● −2 −1 ● ● ●● ● ● ● 2 −2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ●● ●●● ● ● ●●● ●● ● ● ● ● ● ● ● ● ●● ● ●● ●●●● ● ●● ●●● ● ●● ● ● ●● ● ● ●●● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● 1 ●● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● Correlation = 0.78 0 ● Correlation = 0.95 −2 −1 −2 0 1 2 Correlation = 1 ● −2 5.2 ● ● ● ● ●● ● ●● ●●● ●● ● ● ●● ● ●● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ●●●● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ●●● ●●● ●● ● ● ● ● ● ● ● ●● ● ● −3 −2 −1 0 1 Figure 12: Scatter plots of nine datasets (each with 100 observations and 2 variables per observation) and their associated correlations. 5.2 Correlation 30 0 1 2 3 4 5 ● 2 ● ●●● ●● ● ● ● ● ● ● ● ● ●● ● 0 ●● ● ● ●● ● ●● ● ● ●●● ●● ● ● ● ●● 1 3.0 15 ● ● 2.0 ● ● ● ● ● ● ●● ● ● ● ●●●● ● ● ● ● ● ● ● ● ●●● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● −10 1.0 ● ●● ● ● ● ● ● ● 2 3 4 5 6 0.0 0 5 10 ● 1 2 3 ●● ●● ●●● ● ● ●● ● ● ● ● ● ● ● ● 4 5 6 Correlation = 0 ● 0 ● ● ● 6 ● ● ● ● ● ● ● Correlation = −0.13 ●● ●● ●● ●●● ● ● ● ● ● ● ●● ● ●● ● ●● ● ●● ● ● ● ● ●● ●● ● ● 1 0 ● ● ● ● ● ●● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ●● ● ●● ● ● ● 3 Correlation = 0.02 −3 −2 −1 0.0 0.4 0.8 Correlation = −0.03 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 Figure 13: Scatter plots of two datasets (each with 100 observations and 2 variables per observation) and their associated correlations, where the relationship between the two variables is not linear. 6. Summary 31 Y is: Pn − x̄)(yi − ȳ) nsX sY We will calculate the correlation for the lung cancer and smoking data. As we did rXY = i=1 (xi for the variance, it is helpful to draw a table here. We have calculated the means and standard deviations in advance. For the ‘smoke’ variable, x̄ = 24.91 and sX = 5.57. For the ‘lung’ variable, ȳ = 19.65 and sY = 4.23 (feel free to verify these yourself!). xi (Smoke) yi (Lung) (xi − x̄) (yi − ȳ) (xi − x̄) × (yi − ȳ) 18.20 17.05 −6.71 −2.60 17.48 31.10 22.83 6.18 3.18 19.65 20.10 .. . 13.58 .. . −4.81 .. . −6.07 .. . 29.24 30.34 25.88 5.43 Pn 6.23 33.79 − x̄)(yi − bary): 706.66 i=1 (xi Correlation (above line divided by (n × sX × sY )): 0.68 A correlation of 0.68 is a fairly strong correlation. 6 Summary There were three objectives to this part of the course, let’s summarise how we have tackled each one. 1. To introduce some basic graphical methods and summary statistics. We have introduced a range of techniques. This is not exhaustive, there are many other graphical methods and summary statistics we could have considered. 2. To motivate which graphical methods/summary statistics are useful in certain situations, and how to use them together sensibly. It is vital to identify which tools are useful to us for different purposes. Usually, we will use a combination of graphical representations, and summary statistics, to learn about the variables in the data we have collected. A. Hints and Answers to Exercises 32 3. To extend the ideas to situations where we are interested in measuring two variables. The final section introduces scatter plots and correlation, as methods specific to dealing with the relationship between two variables. Do remember though, it is still important to look at each of the variables individually, using the methods in the earlier sections. A Hints and Answers to Exercises Exercise 1: Hopefully, you believe the graph is easier to interpret. We can see that the majority of people spend less than ten minutes on social networking websites in a day, with a very small number spending longer than 50 minutes. There is one outlying observation — one person spends 185 minutes in the day, which is far longer than everybody else. Exercise 2: The variable is ‘Weight of baby’. It is certainly, quantitative, and since a weight is measured on a scale, it is continuous (although, in the data we have in Table 2, it is rounded off to the nearest gram). Exercise 3: All babies are between 1500 and 4500 grams, with most weights lying between 3000 and 4000 grams. You may have other comments. Exercise 4: x(22) would be the most sensible. Exercise 5 This is worked through in the text. Exercise 6: This is easily seen if we think about the case with three observations only. For example, suppose we have observations 4, 7 and 12. If we did not add one, we’d be looking for the median inbetween the 1st and 2nd value, which clearly doesn’t make sense. This peculiarity is because of the way we count — if we started counting from 0 rather than 1, we would not have this problem! A. Hints and Answers to Exercises 33 Exercise 7: No. Whichever way around the numbers are written, the same numbers will always be in the middle. However, it is conventional to write the numbers in ascending numerical order. Exercise 8: Range = x(n) − x(1) = 185 − 0 = 185 minutes This is not particularly useful because the largest time, x(n) = 185 does not fit in with the rest of the data — if we excluded this value the range would be only 68 minutes. It is undesirable that the range is so sensitive to this one, unusual value. This is why the interquartile range is more commonly used for measuring spread. Exercise 9 Answered in the text. Exercise 10: To calculate the median, total number of observations +1 = 51. Then t = 51/2 = 25.5. So the median is the average of the 25th and 26th value in Table 4: x(25) + x(26) 9+9 = = 9 minutes 2 2 The LQ is the median of the first 25 observations, so LQ = x(13) = 4 minutes. xm = The UQ is the median of the last 25 observations, so UQ = x(38) = 21 minutes. Then IQR = UQ − LQ = 21 − 4 = 17 minutes. Exercise 11: Your box plot should look like Figure 14, although you may have the boxplots the other way around, which is fine. Comparing the two, the boys’ are heavier at birth on average (the median is slightly larger), and the IQR for the boys is much smaller, meaning that in general, boys’ weights cluster more closely around the median than the girls’. Exercise 12: Imagine partitioning the data into three groups: 1. Smaller than the LQ 34 girls boys A. Hints and Answers to Exercises 2000 2500 3000 3500 4000 Figure 14: Boxplot of the birthweights of the boys (top) and the girls (bottom). 2. Between the LQ and the UQ 3. Larger than the UQ Then, changing the value of any observation would not affect the IQR, provided the observation stays in the same group. For example, changing the value of the largest time from 185 to 285 would not change the IQR, since the value clearly remains in the group ‘larger than the UQ’. Exercise 13: Assistant A is the most consistent as the standard deviation is smaller, but Assistant B would be expected to type the most pages over the week as the mean of Assistant B is larger. Either answer is correct for the final part — it depends on your preferences. If quantity only is important, then Assistant B is the A. Hints and Answers to Exercises 35 best choice. However, if consistent output on a daily basis is required then Assistant A should be chosen. Exercise 14: We reproduce the table in full. First, calculate the mean and check you get 14.4. Also, the number of observations n = 10. xi xi − x̄ (xi − x̄)2 4 −10.4 108.16 6 −8.4 70.56 51 36.6 1339.56 17 2.6 6.76 11 −3.4 11.56 4 −10.4 108.16 3 −11.4 129.96 21 6.6 43.56 24 9.6 92.16 3 −11.4 129.96 Pn − x̄)2 : 2040.4 Variance (above line divided by n): 204.04 Standard Deviation (square root of variance): 14.28 i=1 (xi So the sample variance is 204.04 and the sample standard deviation is 14.28.5 Exercise 15: No. We are adding together lots of numbers which are zero or larger. Therefore, all variances and standard deviations are always positive, and indeed it does not make sense to have a negative value — a variance or standard deviation of zero means there is ‘no spread’ i.e. all the observations have exactly the same value. Exercise 16: This would seem a little generous to the student — as she only just 5 If you check this result using a calculator or a computer, you may get a variance of 226.71, and a standard deviation of 15.06. Do not worry. The computer is calculating the unbiased estimate of the population variance/standard deviation, we will come on to this in Part 4. A. Hints and Answers to Exercises 36 got a first in five of the modules, and was some way off achieving a first in the other four. Exercise 17: The mean has been distorted by the single large time (of 185 minutes). The median is not affected by this value. Therefore, for most purposes, such as giving the time a typical University student spends on social networking websites per day, the median would be a better measure. Exercise 18: The top panel in Figure 6 is not very informative — Figure 2 gives a far more informative picture of the data. The bottom panel in Figure 6 is very difficult to interpret — it is not very smooth, and does not give a good overview of the shape of the data. Exercise 19: Including the largest value, we took 14 bins, with each bin having width 10 minutes. The resulting histogram is that seen in the Introduction, reproduced in the top panel of Figure 15 for convenience. Excluding the largest value, we take bins of width 5 minutes to obtain the histogram in the bottom panel of Figure 15. We can say that the largest time is abnormally large compared to the rest of the data. Excluding this, the remaining times are between 0 and 70 minutes. Most of them are between 0 and 30 minutes. Exercise 20: The chance is the same as the proportion of babies between 3001 and 3500 grams, which we can get straight from the histogram (or the frequency table). Probability[Randomly chosen baby between 3001 and 3500 grams] Number of babies between 3001 and 3500 = Total number of babies =19/44 =0.43 37 15 0 5 Frequency 25 A. Hints and Answers to Exercises 0 50 100 150 10 5 0 Frequency 15 Minutes spent on social networking websites per day 0 10 20 30 40 50 60 70 Minutes spent on social networking websites per day Figure 15: Histogram of daily time spent on social networking websites (top — all values included, bottom — largest value excluded) with bins of width 10 and 5 minutes respectively. Exercise 21: The total area is the same as the probability of a randomly chosen baby having any weight. The probability of any certain event is 1 (this is how probability is defined). Exercise 22: There is a meaning to the position of the bars in a histogram — they are in numerical order. This is not the case in general for a bar chart. Exercise 23 The value of the e.d.f. only changes when we encounter an observation — at this point it jumps up by e.d.f. went up by jumps up by 1 n 1 n 1 n (step 2 of the construction of the e.d.f.). So the at 2208. It then stays the same until we get to 2383, when it again. So the value of the e.d.f. at 2210 and 2300 must both be the A. Hints and Answers to Exercises 38 same. 1.0 Exercise 24: The final e.d.f. is given in Figure 16. 0.8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.6 ● 0.2 0.4 Fn(x) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● 0 200 400 600 State Area Figure 16: e.d.f. of State areas. Exercise 25: The median is the middle value, and since the value of the e.d.f. is F̃ (x) = Number of observations smaller than x Total number of observations then we expect half of the observations to be smaller than the middle value. Exercise 26: The argument for the F̃ (LQ) = 0.25, and F̃ (LQ) = 0.75 are similar to the previous exercise. Figure 17 is the birthweight data e.d.f. with the lower quartile and upper quartile lines added. We can estimate from this that the LQ is around 3100 and the UQ around 3600, to give an estimate of the IQR of 500. A. Hints and Answers to Exercises 39 These do not match exactly with the values calculated from the data because we 1.0 are only able to estimate from the graph. ● ● ● ● ● ● ● ● 0.8 ● ● ● 0.6 ● ● ● ● ● ● ● Fn(x) ● ● 0.4 ● ● ● ● ● ● ● ● ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● 0.0 ● 1500 2000 2500 3000 3500 4000 birthweight Figure 17: e.d.f. of State areas, with lower quartile and upper quartile lines added. Exercise 27: • The histogram requires a subjective decision to be made about the bin sizes, whereas there are no subjectivities involved in producing an e.d.f. • It is easier to get an impression of the shape of the distribution from a (well constructed) histogram. • It is easier to calculate from the e.d.f. probabilities of a random observation landing inbetween any given interval, whereas with histograms we are restricted to the bins that we have defined. A. Hints and Answers to Exercises 40 Exercise 28: There is some positive relationship. i.e. in states where the amount of cigarettes smoked (hundreds per capita) is larger than average, we would also expect the number of deaths from lung cancer per 100 000 to be larger than average. Exercise 29: We could read, from the scatter graph, the number of deaths per 100 000 that corresponds to 30 cigarettes smoked (hundreds per capita). We get an estimate of around 22 deaths per 100 000. See Figure 18 for a visualisation of how 20 15 Deaths per 100 000 of Lung Cancer 25 this works. 15 20 25 30 35 40 Cigarettes Smoked (hundreds per capita) in 1960 Figure 18: Scatter plot of the number of cigarettes smoked against the number of deaths from lung cancer, for 44 US states. Looking up expected number of deaths for a state where 30 cigarettes (hundreds per capita) are smoked. Exercise 30: There are many possible examples. For example, one variable being A. Hints and Answers to Exercises ‘temperature’ and the other ‘scarf sales’. 41